Concepts

Introduction

Sesam

This document introduces concepts that are key to understanding and working with Sesam.

Sesam is a Master Data Hub built on a streaming dataflow data integration and processing system. It stores data in a data hub. The hub is optimised for getting data from source systems, transforming data, and providing data to target systems.

Sesam gets raw data from source systems and stores it in datasets. Pipes can be defined to process datasets to construct new datasets. Transforms can join data across datasets to create new shapes of data. Data from these datasets can be exposed and delivered to other systems. The entire system is driven by the state change of entities.

The primary building block for building flows is the pipe. A pipe gets data from a source, transforms it and writes it to a sink. The data that flows through pipes are streams of entities – which are like JSON objects. Pipes are the active component that gets data into the data hub, makes data flow through it and provides data to target systems.

Why Sesam

The data hub is the go-to place for data within the enterprise. Integrations no longer have to be point-to-point. Systems can be loosely coupled instead of being tightly coupled, as is the case for direct integrations. With Sesam, individual systems no longer have to depend on other systems being up. It is also a lot easier to replace systems or to perform migrations. Sesam is the active part and will schedule how and when pipes are run. If a system is down, the pipe will try getting or sending the data once the system is back up.

With the help of features like streaming, merging, namespaces and global datasets Sesam enables higher quality master data management.

The Sesam service is built around the principle that Sesam does not own the data stored in the data hub. The idea is that all the data in the data hub can be re-read from the sources and thus be fully rebuilt from scratch.

Streams of data

Sesam consumes and produces streams of entities. An entity is very much like a JSON object and consists of a number of key-value pairs along with some special reserved property names. See the entity data model document for more details about entities.

The following is a quick example of the shape of entities that are consumed and exposed by Sesam.

[
    {
        "_id": "1",
        "name": "Bill",
        "dob": "01-01-1980"
    },
    {
        "_id": "2",
        "name": "Jane",
        "dob": "04-10-1992"
    }
]

Streams of entities flow through pipes. A pipe has an associated pump that is scheduled to regularly pull data entities from the source, push them through any transforms and send the results to the sink. The most common source is the dataset source which reads entities from a dataset. The most common sink is the dataset sink which writes entities to a dataset. There are also sources and sinks that can read and write data to and from external systems outside of Sesam.

Note

Sesam's service API is not built to serve a large number of concurrent clients. Sesam is primarily an asynchronous batching and stream processing system. The Service API is not meant to be used by user-facing applications that have low latency and high throughput requirements. For that reason we do not currently give any guarantees in this regard. In practice means that if you have such a requirement you should stream the data out of Sesam and host it in a dedicated publishing systems that can scale its endpoints.

Datasets

A dataset is the basic means of storage inside Sesam. A dataset is a log of entities supported by primary and secondary indexes. A dataset sink can write entities to the dataset. An entity is appended to the log if it is new (as in, an entity with a never-before-seen _id property) or if it is different from the previous version of the same entity.

A content hash is generated from the content of each entity. This hash value is used to determine if an entity has changed over time. The content hashing is what enables change tracking.

The dataset source exposes the entities from the dataset so that they can be streamed through pipes. As the main data structure is a log the source can read from a specific location in the log. Datasets have full continuation support.

Dataset structure

Configuration

Systems

A system is any database or API that could be used as a source of data for Sesam or as the target of entities coming out of Sesam. The system components provide a way to represent the actual systems being connected or integrated.

The system component has a couple of uses. Firstly it can be used to introspect the underlying system and provide back lists of possible 'source' or 'sink' targets. Often this information can be used on the command line or in the Sesam Management Studio to quickly and efficiently configure how Sesam consumes or delivers data.

The other use of the system is that it allows configuration that may apply to many source definitions, e.g. connection strings, to be located and managed in just one place. Systems also provide services like connection pooling and rate limiting.

You can also run your own extension systems.

Pipes

A pipe is composed of a source, a chain of transforms, a sink, and a pump. It is an atomic unit that makes sure that data flows from the source to the sink. It is a simple way to talk about the flow of data from a source system to a target system. The pipe is also the only way to specify how entities flow from dataset to dataset.

Pipe structure

Sources

A source exposes a stream of entities. Typically, this stream of entities will be the entities in a dataset, rows of data in a SQL database table, the rows in a CSV file, or JSON data from an API.

Source

Sources have varying support for continuations. They accept an additional parameter called a since token. This token is used to fetch only the entities that have changed since the location stored in the token. This is used to ask for only the entities that have changed since the last time Sesam asked for them. The since token is an opaque string token that may take any form; it is interpreted by the source only. For example, for a SQL source it might be a datestamp, for a log based source it might be an offset.

Sesam provides a number of out of the box source types, such as SQL and LDAP. It is also easy for developers to expose a microservice that can supply data from an external service. The built-in json source is able to consume data from these endpoints. These custom data providers can be written and hosted in any language.

To help with this there are a number of template projects hosted on our GitHub to make this process as easy as possible.

Transforms

Entities streaming through a pipe can be transformed on their way from the source to the sink. A transform chain takes a stream of entities, transforms them, and creates a new stream of entities. There are several different transform types supported; the primary one being the DTL transform, which uses the Data Transformation Language (DTL) to join and transform data into new shapes.

DTL has a simple syntax and model where the user declares how to construct a new data entity. It has commands such as 'add', 'copy', and 'merge'. These may operate on properties, lists of values or complete entities.

Transform

In general, DTL is applied to entities in a dataset and the resulting entities are pushed into a sink that writes to a new dataset. The new dataset is then used as a source for sinks that write the data to external systems.

Sinks

A sink is a component that can consume entities fed to it by a pump. The sink has the responsibility to write these entities to the target, handle transactional boundaries and potentially batching of multiple entities if supported by the target system.

Several types of sinks, such as the SQL sink, are available. Using the JSON push sink enables entities to be pushed to custom microservices or other Sesam service instances.

Sink

Pumps

A scheduler handles the mechanics of pumping data from a source to a sink. It runs periodically or on a cron schedule and reads entities from a source and writes them to a sink.

It's also capable of rescanning the source from scratch at configurable points in time. If errors occur during reading or writing of entities, it will keep a log of the failed entities and in the case of writes it can retry writing an entity later.

The retry strategy is configurable in several ways and if an end state is reached for a failed entity, it can be written to a dead letter dataset for further processing.

Flows

Pipes read from sources and writes to sinks. The output of one pipe can be read by many downstream pipes. In this way pipes can be chained together into a directed graph – also called a flow. In some special situations you may also have cycles in this graph. The Sesam Management Studio has features for visualising and inspecting flows.

Environment Variables

An environment variable is a named value that you can reference in your configuration. Environment variables are used to parameterize your configuration so that you can easily enable/disable or change certain aspects of your configuration. If you have an environment variable called myvariable then you can reference it in configuration like this: "$ENV(myvariable)". Do not use environment variables for sensitive values; use secrets instead. Environment variables are global only.

Secrets

Secrets are like environment variables except that they are write-only. Once written to the API you cannot read them back out, but you can reference them in your configuration. They should be used for sensitive values like passwords and other credentials. A secret can only be used in certain locations of the configuration. If you have a secret called mysecret then you can reference it in configuration like this: "$SECRET(mysecret)". Secrets can either be global or be local to a system (recommended).

Service Metadata

The service metadata is a singleton configuration entity that is used for service-wide settings.

Features

Scheduling and signalling

The active part of a pipe is called a pump. A pump makes entities flow through the pipe. It can be scheduled to run at regular intervals. These intervals can be specified in seconds or using a cron expression. One can also optionally schedule the pipe to do full rescans.

Signalling is an optional feature that automatically signals downstream pipes when data changes upstream. The signal then schedules the pump for immediate execution. This feature allows for new data to flow downstream at a much faster pace than if the pumps just ran at scheduled intervals.

Continuation Support

Sources can optionally support a since marker which lets them pick up where the previous stream of entities left off - like a "bookmark" in the entity stream. This continuation support allows a pipe to process changes incrementally. The next time the pipe runs it will continue where the previous run finished. Combined with change tracking this reduces the amount of work that needs to be done.

Change Tracking

Sesam is special in that it really cares when data has changed. The typical pattern is to read data from a source and push it to a sink that is writing into a dataset. The dataset is essentially a log of the entities it receives. However, if a new log entry was added every time the source was checked then log would grow very fast and be of little use. There are mechanisms at both ends to prevent this. When reading data from a source, it may be possible to just ask for the entities that have changed since the last time, if the source supports it. This uses the knowledge of the source, such as a last updated time stamp, to ensure that only entities that have been created, deleted or modified are exposed. On the side of the dataset, regardless of where the data comes from, an incoming entity is compared with the existing version of that entity and only updated if they are different. The comparison is done by comparing the hashes of the old and new entity.

Deletion Tracking

The dataset sink is capable of detecting that entities have disappeared from the source. It can do this when the pipe does a full rescan. At the end of a pipe run the sink will write a deleted version of those entities (where the "_deleted" property is set to true). This is a useful feature particularly when the source itself is not able to emit deletes. It is also useful in the cases where filters or other configuration changes causes previously emitted entities to no longer be produced by the pipe.

Dependency Tracking

One of the really smart things that Sesam can do is to understand complex dependencies in DTL. This is best described with an example. Imagine a dataset of customers and a dataset of addresses. Each address has a property customer_id that is the primary key of the customer entity to which it belongs. A user creates a DTL transform that processes all customers and creates a new customer-with-address structure that includes the address as a property. To do this they can use the hops function to connect the customer and address. This DTL transform forms part of a pipe and as such when a customer entity is updated, added or deleted it will be at the head of the dataset log and gets processed the next time the pump runs. But what if the address changes? As far as the expected output the customer itself has also changed.

This is in essence a problem of cache invalidation of complex queries. With Sesam, we have solved this problem. We are empowered to solve the problem thanks to our dedicated transform language. This allows us to introspect the transform to see where the dependencies are. Once we understand the dependencies we can create data structures and events that are able to understand that a change to an address should put a corresponding customer entity at the front of the dataset log again. Once it is there it will be pulled the next time the pump is run and a new customer entity containing the updated address is exposed.

Note

Only pipes that use the dataset source supports dependency tracking. The primary reason for that is a technical one; the tracked entities need to be looked up by id before a specific point in time and fed through the pipe. This is currently only implemented for the dataset source type. It is unlikely that it can be implemented for other source types as those have latency and ambiguity issues.

Automatic Reprocessing

There are many possible reasons why a pipe may fall out of sync. Configuration may change, datasets may be deleted and then recreated, sources may be truncated, data may be restored from backup, joins to new datasets can be introduced and so on. In these cases the pipe should be reset and it should perform a full rescan to get a new view of the world. Sesam has a feature called automatic reprocessing that will detect that the pipe has fallen out of sync and needs to be reset. This is currently an opt-in feature, but if you enable it in the pipe or in service metadata the pipe will automatically reset itself and perform a full rescan – making sure that it is no longer out of sync. In some situations it may need to rewind just a little, instead of doing a full rescan - in any case you can then be sure that it is no longer out of sync.

Namespaces

Namespaces are inspired by The Resource Description Framework (RDF). You'll see them in terms of namespaced identifiers - also called NIs. A NI is a special datatype defined in the entity data model. In essence they are a string consisting of two parts, the namespace and the identifier. "~:global-person:john-doe" is an example. The ~: is the type part that tells you that it is a namespaced identifier. global-person in this case is the namespace and john-doe is the identifier.

Properties can also have namespaces, but here the ~: part is not used. global-person:fullname is an example of such a namespaced property. Namespaced properties are essential when merging to avoid naming collisions and to maintain provenance of the properties.

A namespaced identifier is a unique reference to an abstract thing. It is an identifer. In Sesam it is not a globally unique identifier, but it is a unique identifier inside one Sesam datahub. There are mechanisms in place for collapsing and expanding namespaced identifiers to globally unique identifiers on import and export.

Namespaced identifiers and properties with namespaces will automatically expand to fully qualified Uniform Resource Identifiers (URIs) when exporting to RDF. URIs in RDF are similarly collapsed into namespaced identifiers and properties with namespaces on import. They can also be expanded and collapsed using DTL.

Sesam can utilize RDF for input, transformation or producing data for external consumption.

Global datasets

The use of global datasets is described in depth in the Best Practice document. The principle is to have one go-to dataset to find data about a specific type of data. A global dataset typically co-locates and merges data from many different sources.

Merging

An essential feature that enables global datasets is the ability to merge different entities into one entity representing the same thing. Organizations often have multiple systems that share overlapping information about employees, customers, products etc. The merge source lets you define equivalence rules that enables you to merge entities. The merge source is able to merge incrementally producing a stream of entities that have been merged – or unmerged (when an equivalence rule no longer applies).

Transit encoding

Sesam's entity data model is a JSON compatible data model. JSON itself supports a limited number of data types, so in order to make the model richer, the entity data model supports a subset of the Transit data types. Transit encoding is a technique for encoding a larger set of data types in JSON. See the entity data model for more information about this encoding.

Compaction

A dataset is an append-only immutable log of data that would, left unchecked, grow forever. This problem is partly mitigated as entities are only written to the log if they are new or different (based on a content hash comparison) from the most recent version of that entity. To supplement this and ensure that a dataset does not consume all available disk space a retention policy can be defined. A retention policy describes the general way in which the log should be compacted. The default policy is to keep two versions of every entity. This is the minimal number of versions to keep in order to make dependency tracking work. A time-based policy is also available allowing you to say how old and entity can be before it becomes a candidate for compaction.

Completeness

Completeness is a feature that you typically enable on outgoing pipes. It makes sure that all pipes that this pipe is dependent on have run before it processes the source entities of this pipe. The timestamp of the source entity is compared with the completeness timestamp that was inherited from its upstream and dependent pipes. This feature effectively holds back the processing of source entities until it can be sure that dependent pipes have completed. This is useful when you want to have a final entity version before you send it to the target system. It also reduces the number of times you have to send the entity to the target system as there might be several state transitions until the entity can be considered complete.

Circuit Breakers

A circuit breaker is a safety mechanism that one can enable on the dataset sink. The circuit breaker will trip if a larger than expected number of entities are written to a dataset in a pipe run. When tripped, the pipe will refuse to run and it has to be untripped manually. This safety mechanism is there to prevent unforeseen tsunamis of changes and to prevent them from propagating downstream.

Notifications

Monitoring of pipes can be enabled. Once a pipe is being monitored, you can add notification rules to pipes and be alerted when those rules are triggered. You can get notification alerts in the user-interface or by email.

Extensions

Sesam provides a finite number of systems, but you can build and run your own microservice extension systems. The microservice system allows you to use custom Docker images to host them inside the Sesam service.

Network Policy

One has the option of blocking all public access through it or denying all except for a whitelist of ip addresses and ranges. In the new architecture it is possible to push the IP white listing down to the reverse proxy and also allow public access and restricted access to pipes through custom rules on the pipes. There are no restrictions on outgoing traffic currently.

VPN

You can extend Sesam into your own network using a IPSec-based Virtual Private Network. You can configure VPN under Subscriptions Settings in the Management Studio. Note that there is a additional surcharge for VPN, see Subscription Fee, payment terms for more information.

List of supported VPN devices and configuration guides can be found at https://docs.microsoft.com/en-us/azure/vpn-gateway/vpn-gateway-about-vpn-devices.

Status Page

Sesam hosts a status page at https://status.sesam.io/. There you will find the real-time operational status of the Sesam services. Any incidents will be reported there, but you can also register and get emails when there are changes in the operational status. A notification badge will also be shown in the Sesam Management Studio when incidents occur. If you have other custom requirements there is also a provisional Status API that you can use.

Software channels

Sesam software is released through a phased rollout scheme. There are four different release channels – commonly called canaries. This is done to give changes and new features some time in non-production environments before they are rolled out to production. The goal is to reduce risk.

The available channels are:

  • weekly-prod is release bi-weekly and is the most stable release. Use this in production!
  • weekly is release once a week. Use this in staging environments.
  • nightly is released every night. Use this in development environments.
  • latest is released every time a pull request is merged. Use this only for developent environments, and only when you know what you're doing.

Note

We can for any reason choose to not promote new versions of any software channel, build dates will correspond to a minimum, not a maximum age.

Weekly and nightly upgrades are performed between 00-03 CET. Weekly upgrades are performed night to Monday. Security hotfixes will not wait for the scheduled window. Downgrades are not supported.