Replicated dataset sink¶

The replicated dataset sink writes entities to a replicated dataset. Unlike the normal dataset sink, it preserves the original entity sequence order and offsets from the upstream source, making the resulting dataset a faithful copy.

A typical use case is renaming a pipe: create a new pipe that sources from the existing dataset and uses a replicated_dataset sink writing to the new dataset name. Downstream pipes can then be switched to the new dataset, and the original pipe retired.

Note

Deletion tracking, entity re-posting, and circuit breakers are not supported by this sink type.

Prototype¶

{
    "type": "replicated_dataset",
    "dataset": "id-of-dataset"
}

Properties¶

Property	Type	Description	Default	Req
`dataset`	String	The id of the dataset to write entities into. If the dataset does not exist it will be created as a replicated dataset. Note: the dataset id cannot contain forward slash characters (`/`) nor can it reference a `system:` dataset.	The pipe id.	Yes
`set_initial_offset`	Enum<String>	Controls when the sink marks its dataset as populated. Accepts the same values as the dataset sink: `if-source-populated` (default), `never`, `always`, `initially`, and `onload`.	`if-source-populated`
`indexes`	String or Array	If set to `"$ids"` then an index on the `$ids` property will be automatically maintained. If the value is an array then it can contain index expressions that should be maintained on the sink dataset. This is typically used for declaring subset indexes.	`[]`
`change_tracking`	Boolean	If `true` (the default), the sink will deduplicate identical entity versions and avoid writing an entity if it is identical to the previous version already stored in the dataset.	`true`	No
`bitset_commit_interval`	Integer(>=1)	Specifies how many entities are processed before bitset updates are persisted to disk. The higher the number the fewer writes, but at the cost of having to redo the work if the pipe fails before completion. The changes are always written to disk once the pipe completes.	`1000000`	No
`allow_normal_dataset`	Boolean	If `true`, the sink will fall back to the normal dataset writer if the sink dataset already exists as a `NORMAL` type dataset, instead of raising an error. This is useful during a migration from a normal dataset to a replicated dataset.	`false`	No
`auto_migrate_to_replicated_dataset`	Boolean	If `true`, the sink will automatically convert an existing `NORMAL` dataset to a `REPLICATED` type in an atomic, re-entrant operation. All downstream pipes are quiesced, the old dataset is replaced, and downstream offsets are updated before the pipe re-runs.	`false`	No

Example configuration¶

Renaming a pipe (copying old-pipe dataset to new-pipe):

{
    "_id": "new-pipe",
    "type": "pipe",
    "source": {
        "type": "dataset",
        "dataset": "old-pipe",
        "include_previous_versions": true
    },
    "sink": {
        "type": "replicated_dataset",
        "dataset": "new-pipe"
    }
}

Once new-pipe is fully populated, downstream pipes can be switched from old-pipe to new-pipe and old-pipe can be retired.

Null sink

REST sink