Dataset sink

The dataset sink writes the entities it is given to an identified dataset. The configuration looks like:

Prototype

{
    "type": "dataset",
    "dataset": "id-of-dataset"
}

Properties

Property

Type

Description

Default

Req

dataset

String

The id of the dataset to write entities into. You should normally not have to specify the dataset id as the default value is the pipe id, and there should be a very good reason for the dataset id to be different from the pipe id.

Note: if it doesn’t exist before entities are written to the sink, it will be created on the fly.

Note

The dataset id cannot contain forward slash characters (/) nor can it reference a system: dataset.

The pipe id.

Yes

set_initial_offset

Enum<String>

This property specifies when the sink should set the initial offset on its dataset.

When the initial offset is set then the dataset is considered to be populated.

  • if-source-populated (the default) means that the pipe will set the initial offset when the source is populated and the pipe has consumed all the source entities. This is a very useful default as the populated flag will propagate automatically downstream once datasets get populated upstream.

  • never means that the pipe will never set the initial offset.

  • always means that the pipe will always set the initial offset when the pipe completed successfully.

  • initially means that the pipe will set the initial offset at the start of the pump run.

  • onload means that the initial offset will be set when the pipe is loaded / configured.

if-source-populated

indexes

String or Array

If set to "$ids" then an index on the $ids property will be automatically maintained. This index will then be used by the dataset browser to look up entities both by _id and $ids. The property global_defaults.always_index_ids can be enabled in the service metadata if all dataset sinks should by default maintain an index on $ids.

If the value is an array then it can contain index expressions that should be maintained on the sink dataset. This is typically used for declaring subset indexes.

[]

track_children

Boolean

If true then the $children property will be compared against the previous version of the entity and a delta produced. This will cause the $children property to be updated on entities just before they are written to the dataset.

This is a special feature that can be used in combination with the ["create-child", ...] DTL function and the emit_children pipe transform. The purpose is to be able to detect deleted children entities when doing incremental syncs.

The effective value of this property is inferred to be true if any of the pipe’s transforms use the create-child DTL function. It is possible to override this by setting the property’s value to false.

Inferred

enable_optimistic_locking

Boolean

If true then the _updated property in each entity will be compared against the previous version of the entity. If the _updated property of at least one entity doesn’t match, an error will raised and no entities will be written to the target dataset.

The purpose is to be guard against two agents trying to update the same entity at the same time; in some cases one doesn’t want the last edit to “win” automatically. The typical usecase is a pipe with a http_endpoint source where the http endpoint can be accessed by several independant processes that use the sesam instance as a storage service. In this case the pipe should not have any transforms, since the http_endpoint will send the resulting entity back to the calling process; if the entity has been transformed by DTL or some other transform, the result might make little sense to the calling process.

false

circuit_breaker_threshold_factor

Decimal

Specifying this property will enable a circuit breaker on the pipe. It specifies a factor that is used to calculate the circuit breaker limit. The limit is calculated based on the number of unique entity ids in the dataset, i.e. the number of latest entities in the dataset (including deleted entities).

Note that this is a factor and not a percentage, e.g. 0.32 means 32% and 1.5 means 150%. If the factor is 0.5 and the dataset already contains 100 entities, then the circuit breaker will trip if it sees more than 50 new entities.

null

No

circuit_breaker_threshold_count

Integer

Specifying this property will enable a circuit breaker on the pipe. The count specifies the circuit breaker limit directly. The limit defines how many new entities can be written to the dataset before the circuit breaker trips. If this property is set to 100, then 100 entities can be written before it trips.

Note

If both circuit_breaker_threshold_factor and circuit_breaker_threshold_count are specified then the maximum value of those two are used as the circuit breaker limit. The count is in this case typically used to specify the lower limit.

null

No

deletion_tracking

Boolean

If true (the default), then after a full run any entities that existed in the dataset before the run but that weren’t seen during the run will be deleted.

If false, then any existing entities in the dataset will not be touched. This is only useful in very special circumstances.

true

No

mark_deletion_tracked

Boolean

If true (the default is false), a "$deletion_tracked":true property will be added to entities deleted by deletion tracking after full runs or rescans. See also the deletion_tracking property.

false

No

bitset_commit_interval

Integer

Specifies how often dataset bitsets and dataset compaction changes are written to disk. The higher the number the fewer writes, but at the cost of having to redo the work if the pipe fails before completion. The changes are always written to disk once the pipe completes.

1000000

No

prevent_multiple_versions

Boolean or Enum<String>

If true then the pipe will fail if a new version of an existing entity is attempted written to the sink dataset. This is useful if one wants to prevent multiple versions of the same entity to be written to the sink dataset. If set to "ignore" the pipe will not fail but instead ignore any updates to existing entities in the dataset.

false

No

suppress_filtered

Boolean

The default value is false unless it is a full sync and the source is of type dataset and include_previous_versions is false [*]. The purpose of this property is to make it possible to opt-in or opt-out of a specific optimization in the pipe. The optimization is to suppress entities that are filtered out in a transform early so that they are not passed to the sink. This optimization should only be used when the pipe produces exactly one version per _id in the output. The optimization is useful when the pipe filters out a lot of entities.

false [*]

No

max_entity_bytes_size

Enum<String>

Defines the maximum size in bytes of an individual entity as it is stored in a dataset.

104857600 (100MB)

Example configuration

The outermost object would be your pipe configuration, which is omitted here for brevity:

{
    "sink": {
        "type": "dataset",
        "dataset": "Northwind:Customer",
    }
}