Best Practice

Summary

Sesam is an Integration Platform using a unique Datahub approach for collecting, connecting and sharing data. With Sesam data can quickly be re-purposed, re-structured and used, without changing the systems that own the original data. In this way all the valuable data within your company will be available for the whole organization.

Because Sesam has a unique approach for integrating data and is a very generic platform, there was a growing need for a best practice to describe and teach how to best utilize the great possibilities of Sesam.

As the amount of data in a Sesam node grows, the need for an optimized dataset structure increases. Without proper structure, each added system results in more time spent on connecting and joining the corresponding raw data.

When working in a sequential way to process data, i.e. many pipes which further add to point-to-point connections inside Sesam, locating and re-using data is significantly more time consuming.

These challenges may be solved by grouping data of the same type or concept in what we call “global datasets”.

Global datasets are key to getting the most out of using Sesam. This means less connections inside Sesam, and more re-usable datasets. In addition, it makes it easier to find required data as it is now grouped in one location, e.g. invoice or employee. It is also easier adding new integrations for consuming data when connecting to ready-to-use enriched datasets in the existing global datasets.

Data model

The data model in Sesam can be described in short as connect, collect, share. The different sources are connected to Sesam with connectors, the data is then imported into datasets inside Sesam. Once imported, all data within Sesam is in the same format (JSON), which means data can now be connected and merged independent of the source system. When importing data into Sesam, there are two important Sesam principles to keep in mind, always try to get as much of the data as possible (i.e if importing a sql table, do select star/ select all). The second principle is to keep the data as close to the original as possible, do not implement any transforms or change the semantics unless it's necessary. The next step is to create the global datasets, these consist of data from the "raw" imported datasets, categorized and connected (if possible). Two principles to keep in mind when creating the global datasets, ALL the data must be available through global datasets, which means all the raw datasets need to be imported into a global dataset. The second principle is to always try to merge data with existing data in the global dataset which is about the same thing. The goal of doing it this way, is to make it easy to consume and reuse the data inside Sesam, which is in line with our bold vision; "All the data from all the systems, connected and available as a single shared resource".

Generic pipe concept

To read about the main concepts and how to get started in Sesam, please click here

Global datasets

Sesam organizes entities by storing them in global datasets.

Definition

A global dataset is a collection of data of the same type, or concept, from different sources. In other words, a global dataset combines data from sources with logically linked data to provide one common place to retrieve this data from when needed. This will reduce the total number of pipes needed compared to a system where you get data from the original sources each time.

A global dataset is generated by merging data from various sources. The data merge can be performed by simply merging datasets together, or by selecting which properties to merge through transformations. You can read more about these transformations here It is also possible to simply add datasets to a global dataset without merging.

It is important to remember that a global dataset requires knowledge or understanding of the basic data from the different sources. Only by locating the logically linked data is it possible to effectively structure it into global datasets.

Example:

There are three sources containing person data as shown below. If any target system wants data about this person, it would have to go through each of the root datasets every time. However, through the creation of a global-person dataset, information can be easily fetched from one single location.

HR system
{
   "_id": "hrsystem-person:02023688018",
  "hrsystem-person:EmailAddress": "IsakEikeland@teleworm.us",
  "hrsystem-person:Gender": "male",
}

CRM
{
  "_id": "crm-person:100",
    "crm-person:EmailAddress": "IsakEikeland@teleworm.us",
    "crm-person:ID:”100”
    "crm-person:SSN": "02023688018",
    "crm-person:SSN-ni": "~:hrsystem-person:02023688018",
  }

ERP
{
   "_id": "erp-person:0202",
   "erp-person:SSN": "02023688018",
   "erp-person:SSN-ni": "~:hrsystem-person:02023688018",
   "erp-person:ID:”0202”
   "erp-person:country":"NO"
}

The dataset below is what a global dataset of the above three datasets looks like in Sesam when merging on equality of social security number (SSN).

{
  "$ids": [
  "~:crm-person:100",
  "~:hrsystem-person:02023688018",
  "~:erp-person:0202"
  ],
  "_id": "crm-person:100",
  "hrsystem-person:EmailAddress": "IsakEikeland@teleworm.us",
  "hrsystem-person:Gender": "male",
  "crm-person:EmailAddress": "IsakEikeland@teleworm.us",
  "crm-person:ID:”100”
  "crm-person:SSN": "02023688018",
  "crm-person:SSN-ni": "~:hrsystem-person:02023688018",
  "erp-person:SSN": "02023688018",
  "erp-person:SSN-ni": "~:hrsystem-person:02023688018",
  "erp-person:ID”:”0202”
  "erp-person:country":"NO"
}

Positive effects of global datasets

  • By decoupling data from original sources, point-to-point integrations within Sesam can be avoided, thus fewer connections results in lower maintenance costs. In addition, data is available without concern for the original source
  • All logic related to connecting and enriching data is only done once
  • Data in Global datasets are re-used, which saves work and makes adding new integrations easier
  • Only one look-up, instead of having to “look for data” in various datasets
  • Input datasets can be kept raw and as similar to the real source as possible, independent of how the data will be used, thus avoiding “early binding”
  • Adding additional integrations further refines the global datasets, and therefore continuously improves the data quality

A data model without global datasets might look like the figure below. This example consists of four sources and three target systems only. Generally, it will be a lot more complicated.

Datamodel without global datasets

As shown in the figure below, a Sesam node containing global datasets results in fewer connections, making it both tidier and easier to manage.

Generic pipe concept

What do you have to take into account, and what are the challenges of global datasets?

Global datasets will most likely grow and become large. If the configuration or logic is changed, this can in some cases mean that the whole dataset needs to be updated. This can potentially be a big job and will take time.

As an example, an energy company has 700 000 customers, and each customer has a power meter connected to their home. When adding the historic data, the company is required to store as well, the total data objects sum up to 30 000 000. One way of managing this large data amount is to divide the data into different global datasets. In this case, the energy company chose to store their historic data in one global dataset, and the current data in a different global dataset.

Namespace and namespaced identifiers

Namespace

A namespace consists of two parts: a namespace and a property. The namespace part can consist of any characters, ending with a colon. The property part can consist of any character except colons. In the example below, "crm-person" and "hrsystem-person" are namespaces and "SSN" is the property.

E.g.

"crm-person:ssn"

"hrsystem-person:ssn"

Namespaced identifiers

Namespaces are used to create namespaced identifiers, which makes it possible to merge data without losing track of the source. In addition, namespaced identifiers can be mapped to complete URLs as we have unique identifiers for each object. Namespaced identifiers provide the same functionality as foreign keys in databases. These references are usually added in the input pipe.

A namespaced identifier may take the following form:

"hrsystem-person:SSN-ni":"~:hrsystem-person:18057653453"

"namespace:propertyName":"namespaced-identifier:value"

Using namespace identifiers is a recommended way of referring to datasets for matching properties during transformations to ease connection of data. If you have three different person datasets, and you want to merge on a common property, like e-mail or SSN, then we should use namespace identifiers. The code below will add a namespace identifier based on common SSN properties between datasets "crm-person" and "erp-person" during transformation inside DTL of "crm-person". In a similar way, we need to create a namespace identifier between "hrsystem-person" and "erp-person" datasets so that we can refer to them during merging.

["make-ni", "hrsystem-person", "SSN"],

This will produce the following output:

"crm-person:SSN-ni": "~:hrsystem-person:23072451376",

Now, you have unique namespace identifiers based on SSN, which you can refer now.

{
 "_id": "global-person",
 "type": "pipe",
 "source": {
     "type": "merge",
     "datasets": ["crm-person cp", "hrsystem-person hr", "erp-person ep"],
     "equality": [
         ["eq", "cp.SSN-ni", "hr.$ids"],
         ["eq", "ep.SSN-ni", "hr.$ids"]
     ],
     "identity": "first",
     "version": 2
 }

In the above code we are connecting the foreign keys "SSN-ni" of "erp-person" and "crm-person" with the primary key "$ids" of "hrsystem-person". You do not need to add the third equality between "erp-person" and "crm-person" as it will happen automatically.

By default, namespaced identifiers are stripped from the output.

Naming conventions

It is essential to have an agreed naming convention across integrations within Sesam. The motivation is to have a better visibility and understanding of where your data comes from and where it is heading, as well as to how it is internally transformed. It also makes it easier to switch between projects.

General rules

  • lower case
  • dash - as delimiter

Systems

  • name after the name of the service you integrate with, not the technology used (e.g. salesforce instead of mysql)
  • if multiple systems are required to talk to a system, postfix them with a qualifier (e.g.salesforce-out)

Pipes

  • name input pipes with system they read from and postfix with the type of content (e.g. salesforce-sale)
  • do not use plural names (e.g. salesforce-sale not salesforce-sales)
  • prefix merge pipes with merged- (e.g. merged-sale)
  • prefix global pipes with global- (e.g. global-sale)
  • name intermediate output pipe with the type of the content and the name of the system to send to (e.g. sale-bigquery)
  • name outgoing pipe by postfixing the intermediate output with -endpoint (e.g. sale-bigquery-endpoint)

Datasets

  • name them the same as the pipe that produced it (the default and does not need to be specified)

Tips for global datasets

  • All datasets should go into a global dataset
  • In most data models, between 10–20 global datasets are sufficient. This is based on experience on various size of projects at Sesam. The smaller projects could have close to 10, and some of the bigger projects has over 20 global datasets, with hundreds of pipes connected to them. To identify how many global datasets a project might need it is important to perform a proper analysis. For instance, if a company’s needs are met by five global datasets, then they don’t have to have at least ten. This is only for best practice, but we do have examples of larger data models with less than ten global datasets
  • Start general with big “buckets” and re-arrange and split into smaller global datasets if necessary
  • Think less property and more “what it is”, e.g. person vs user. Something that stops being a user might not stop being a person
  • Keep it generic
  • Avoid system specific global datasets. I.e. a document management system contains metadata about various concepts (e.g. title, revision, status, equipment, owner, date generated files). These are static in nature, and to make them useful you can put “equipment data” in a global equipment dataset. The “owner data” might be put in global person dataset etc. This way you gather concepts across sources and enrich them, such that they are available for other systems to use
  • Global datasets give us the opportunity to define “golden records”

How to do global datasets in Sesam

When initiating a new project in Sesam, it is important to begin with the data model. Start by analyzing the sources and data to determine the needs of the organization. This will have an impact on the data model and more specifically how the global datasets will be organized. It is here the organization needs to think: what is important to me? What data do I use often, and therefore needs to be easily available? The results vary for each organization and each data model. It is however normal to add global datasets, or to re-arrange them, as the amount of data is growing.

To get an idea of the granularity, please see final chapter called “Examples of real global datasets”.

Generally, most organizations need five basic global datasets. This is not true for all organizations and data integrations, but it is a good basis to start from.

These five are:

Global-person

Global-project

Global-classification

Global-organization

Global-task

This is only the first part of the analysis. The second part is how to enrich data in the global datasets, and to determine which aggregated datasets there is a need for. These are questions that need to be asked in order to make the enriched datasets as useful as possible.

Recipe for generating global datasets

It is impossible to make a universal recipe for all integration projects using Sesam as all projects are unique. The different data variety, data model complexity and costumer requirements are all integral parts structuring each individual Sesam node. In addition, the order you do the various tasks might vary, so please use this as a guideline only, not a comprehensive recipe.

  1. The first step is to consider what the goal of the integration is; what do you want to achieve?
  2. Next step is to determine which data from which sources do you need to achieve your goal.
  3. Get information regarding the existing data model and how data needs to be joined.
  4. Access the data source and copy the necessary data into Sesam.
  5. Analyze and decide on how you want to organize your global datasets. There is no right or wrong way of how to do this. In time you will gain experience on which datasets work as global datasets and which does not. Try to use common sense and organize by concept or type.
  6. Once decided it is important to analyze how the data is going to be added to the global dataset; is there a need to merge the data or is there a need to “place” data in a global dataset without merging? For example, generating a global location dataset is logical. It contains countries, regions, cities, boroughs, counties and offices. It does not make sense to merge them, but it does make sense to put them in a common global dataset. This way you might gather data concerning the same concept as well as to have one single location place for looking up this information.

In many cases however, it does make sense to merge the data, such as person data as shown earlier, which was merged on SSN, email etc.

  1. Some data may need to be processed before being added to a global dataset. This involves e.g. selecting what we use as ID, converting data type, change property names etc.
  2. When the global datasets are set up, the data can either be re-used as is, or undergo further transformations. This might encompass filtering specific data and joining with other datasets etc. to enhance quality and usefulness.
  3. Based on the target systems and your requirements, adapting data to target systems is done as late as possible in the data flow and as close to target as possible (late binding.)

Let’s start with simplified example to demonstrate. Below we have four datasets from two different sources; "crm" and "erp":

erp-person

crm-person

erp-organisation

crm-organisation

Looking at the names of the datasets, it would be logical to create two global datasets. The first could contain data about person, such as user, customer, name, employee and so on.

global-person

Generic pipe concep

The second could contain data concerning the organization. This might include names of departments, customers, regions and so on.

global-organisation

Generic pipe concep

When the number of sources and datasets increases it will become natural to add more “buckets” or global datasets to put them in.

Below are new sources with data from Difi and Salesforce. In addition, more datasets from existing sources were added.

Datasets:

erp-person

crm-person

difi-ssn

hrsystem-person

difi-ssn

difi-orgnumber

salesforce-opportunity

erp-projectnumber

crm-order

The datasets might be organized like this, please see below. As seen no changes in “global-organization”. New datasets added to “global-person” and new “bucket” called “global-project” is generated.

global-person

Generic pipe concep

The second could contain data concerning projects. This might orders, project numbers, sales opportunities etc.

global-project

Generic pipe concep

It is important to emphasize that this is only a suggestion on how it might be logical to organize the datasets. The end result is highly individual and will most likely vary. This does however give an idea on how architecture in Sesam is built and developed using global datasets.

Additional Sesam tips

Golden record

A golden record is a single, well-defined version of all the data entities in an organizational ecosystem. In this context, a golden record is sometimes called the "single version of the truth", where "truth" is understood to mean the reference to which data users can to turn when they want to ensure that they have the correct version of a piece of information.

In the example below, all three sources provide a zip-code, such that some properties in a global dataset might be duplicates from different sources. In this case it could be fitting to add a "global-person:zipcode" property to the global dataset. This property should contain the most reliable zip-code value of the three sources and will be the property we access when we want the person's zip-code. This global property becomes a part of a "golden record" which ensures a single, well-defined representation of the person.

{
  "$ids": [
  "~:crm-person:100",
  "~:hrsystem-person:02023688018",
  "~:erp-person:0202"
  ],
  "_id": "crm-person:100",
  "hrsystem-person:EmailAddress": "IsakEikeland@teleworm.us",
  "hrsystem-person:Gender": "male",
  "hrsystem-person:ZipCode": "null",
  "crm-person:EmailAddress": "IsakEikeland@teleworm.us",
  "crm-person:ID":"100",
  "crm-person:SSN": "02023688018",
  "crm-person:SSN-ni": "~:hrsystem-person:02023688018",
  "crm-person:PostalCode": "3732",
  "erp-person:SSN": "02023688018",
  "erp-person:SSN-ni": "~:hrsystem-person:02023688018",
  "erp-person:ID":"0202",
  "erp-person:ZipCode": "5003",
  "global-person:zipcode": "3732"
}

In addition to the zip-code from the 3 different data sources, the "global-person" dataset now also contains a global-person:zipcode. When creating a golden record in Sesam, one configures the priority of the sources and the value of the property that is highest on the priority list and has data will be used.

"hrsystem-person:ZipCode": null,
"crm-person:PostalCode": "3732",
"erp-person:ZipCode": "5003",
"global-person:zipcode": "3732"

Now, the most trusted zip-code value can be accessed without evaluating all three at every inquiry.

RDF types

In central datasets a property for classification is sometimes added. In Sesam, this is called "rdf type”. This is used if one wants to extract a specific data type from the global dataset.

Data modelling

Below are principles of doing data modelling in Sesam.

Raw input

When reading data into Sesam it is best practice to copy it and not start changing it. This way we have a dataset which is identical or close to identical to the source data. It is, however, common practice to add namespaced identifiers
on the source pipe to keep track of where the data comes from.

Benefits:

  • Not configured specifically for any project or use-case, therefore much easier to re-use the data over time
  • No decisions have to be made before the data is imported

Drawbacks:

  • Increased storage use if not all the data is needed

Data flow

In Sesam data is collected, connected, enriched and transformed from the datasets formed from retrieving data from the source systems. This is done by compiling data from multiple datasets, transforming data into new data formats or standards, and adapting the data to new target systems. In this way, new values are created for the re-use and use of data. This is done in the global dataset where the main purpose is that one should not need to look up multiple datasets and compile data for each time one needs it, but rather make the connecting and enriching once and look up in one place.

Enrich data

There are multiple ways to enrich the original source data, the most common one is to do a transformation, a simple example would be to concatenate “firstname” and “lastname” into a new property called “name”, that consists of both. This will be stored in the global dataset (in addition to the two original properties), and will be available for future integrations that might need the same transformation.

Another way to enrich data, is to derive it based on the original property. One example of this can be a “map-coordinate” property that is stored in the coordinate system that Google uses, but the target system needs it in another coordinate system. This is achieved by calling a coordinate microservice, that returns one or more extra properties based on other coordinate systems. These are then added to the global dataset in addition to the original one, giving future integrations more options if needed.

Yet another example on how to enrich data is by adding mapping to the properties to support a corporate standard information model or simply mapping to a target system. This adds the mapped properties to the global dataset in addition to the original properties, making it possible for integrations to chose between a standard information model or the native information model of the source system.

Output data (late binding)

Principle - adaptation of data to the receiving system is done as late as possible in the data flow, and as close to the receiving system as possible.

Unmodified dataset as output

When writing data out of Sesam the dataset might be transferred as it is (unmodified dataset as output), transformed on the way out or transferred directly to other sources.

Manage source code

Sesam usually uses a Git based source control service to collaborate and have version control on source code.

Git: an open source version control system used to manage code (DTL when working in Sesam). When working in project the code is updated constantly and released in new versions, so Git helps manage this. As with all projects, it’s up to the project itself to decide how to manage the source code, and what kind of service to use. It is not required to use a source control service, but it is highly recommended.

Examples of real global datasets

Below is an example from a Sesam customer:

global-workorder

global-vehicle

global-sale

global-reporting

global-reading

global-project

global-poweroutage

global-person

global-meterpoint

global-location

global-invoicemain

global-invoicedetail

global-invoice

global-grid

global-fault

global-customer

global-contract

global-communication

global-classification

global-asset

global-account

Another organization’s data model with 13 global datasets:

global-subscription

global-skills

global-site

global-sesam-product

global-person

global-paymentmethod

global-machine

global-event

global-department-employee

global-department

global-CV

global-company

global-customer

A public sector company’s growing list of global datasets:

global-klassifisering

global-organisasjon

global-person

global-prosjekt

global-prosjektoekonomi

global-soeknad

global-statistikk

An energy company’s list of global datasets:

global-asset

global-catalogue

global-classification

global-consumption

global-contract

global-customer

global-document

global-exportobjects

global-facility

global-grid

global-inventory

global-invoice

global-job

global-location

global-market

global-meterpoint

global-sale

global-timeseries

global-vendor

global-workorder

Another public sector company’s list of global datasets:

global-access

global-address

global-asset

global-case

global-classification

global-company

global-contract

global-course

global-document

global-file

global-order

global-person

global-project

global-task