Replicated dataset sink¶
The replicated dataset sink writes entities to a replicated dataset. Unlike the normal dataset sink, it preserves the original entity sequence order and offsets from the upstream source, making the resulting dataset a faithful copy.
A typical use case is renaming a pipe: create a new pipe that sources from the existing dataset and
uses a replicated_dataset sink writing to the new dataset name. Downstream pipes can then be
switched to the new dataset, and the original pipe retired.
Note
Deletion tracking, entity re-posting, and circuit breakers are not supported by this sink type.
Prototype¶
{
"type": "replicated_dataset",
"dataset": "id-of-dataset"
}
Properties¶
Property |
Type |
Description |
Default |
Req |
|---|---|---|---|---|
|
String |
The id of the dataset to write entities into. If the dataset does not exist it will be created as a replicated dataset. Note: the dataset id cannot contain forward slash characters ( |
The pipe id. |
Yes |
|
Enum<String> |
Controls when the sink marks its dataset as populated. Accepts the same values as the
dataset sink: |
|
|
|
String or Array |
If set to |
|
|
|
Boolean |
If |
|
No |
|
Integer(>=1) |
Specifies how many entities are processed before bitset updates are persisted to disk. The higher the number the fewer writes, but at the cost of having to redo the work if the pipe fails before completion. The changes are always written to disk once the pipe completes. |
|
No |
|
Boolean |
If |
|
No |
|
Boolean |
If |
|
No |
Example configuration¶
Renaming a pipe (copying old-pipe dataset to new-pipe):
{
"_id": "new-pipe",
"type": "pipe",
"source": {
"type": "dataset",
"dataset": "old-pipe",
"include_previous_versions": true
},
"sink": {
"type": "replicated_dataset",
"dataset": "new-pipe"
}
}
Once new-pipe is fully populated, downstream pipes can be switched from old-pipe to
new-pipe and old-pipe can be retired.