Pipes

A pipe defines the flow of data from a source to a sink on some schedule as defined by the pump settings. Optionally, a pipe may define an ordered list of transforms that are applied to entities as they flow from the source to the sink. As the name implies, a pump “pumps” data in the form of entities from the source to the sink at regular or scheduled intervals. A chain of transforms can be placed in between the source and the sink, so that entities are transformed on their way to the sink.

The pipe configuration consists of a source, transform, sink and a pump.

Note that the forward slash character (“/”) is not allowed in the pipe _id property.

Prototype

The following JSON snippet shows the general form of a pipe definition.

{
  "_id": "pipe-id",
  "name": "Name of pipe",
  "description": "This is a description of the pipe",
  "comment": "This is a comment",
  "type": "pipe",
  "source": {
  },
  "transform": {
  },
  "sink": {
  },
  "pump": {
  },
  "metadata": {
  }
}

Note that if no name property is explicitly set for the source, sink or pump configurations one will be generated based on the name of the pipe (i.e. the contents of this property postfixed with “source”, “sink” or “pump” respectively).

Batching

Pipes support batching if the sink supports batching. It does this by accumulating source entities in a buffer before writing the batch to transforms and the sink. The size of each batch can be specified using the batch_size property on the pipe. The default batch size is usually 100, but this may vary depending on the source- and sink-type used in the pipe. The REST sink will for instance make the default batch_size 1.

Note that the sink may have its own batch_size property. This is useful if the pipe has transforms that produce more entities than the number of entities taken as input.

Properties

Property

Type

Description

Default

Req

_id

String

The id of the pipe, this should be unique within a Sesam service instance. Note that you cannot use the / character in the id property.

Yes

type

String

The type of the component, for pipes the only allowed value is “pipe”

Yes

name

String

A human readable name of the component.

description

String or list of strings

A human readable description of the component.

comment

String or list of strings

A human readable comment on the component.

permissions

List of ACL elements

This property should contain a list of ACL definitions (itself a 3-element list (tuple) of “ALLOW”|”DENY”,[list,of,groups,or,roles],[list,of,permissions]) that defines which permissions should be applied to the pipe when it’s uploaded and instantiated. See the example configuration below for how this should be formatted. You will find the list of pipe and system permissions here.

batch_size

Integer(>=1)

The number of source entities to consume before writing to the sink. The batch size can be used to buffer up entities so that they can be written together to the sink in one go. The sink must support batch for the bulking to happen. This may increase the throughput of the pipe, at the cost of extra memory usage. If the batch fails, then entities will be retried individually.

usually 100, but varies with other pipe settings.

checkpoint_interval

Integer(>=1)

Specifies how often the pipe offset is saved. It says how many batches must be processed before the pipe offset is saved the next time. Note that the pipe offset is always saved at the end of the sync if it changed.

The default value is 10000/batch_size = 100, i.e. the checkpoint happens every 100 batches. The exception is if batch_size is 1, in which case the default value of checkpoint_interval is also set to 1.

100 (1 if batch_size=1)

disable_set_last_seen

Boolean

If this flag is set to true, it will no longer be possible to reset or set the ‘last seen’ parameter for this pipe. The primary use case for this property is when you need to protect the pipe from accidental resets.

false

enable_background_rescan

Boolean

When set to true, enables running pipe rescans in the background for this pipe.

false

rescan_when_config_changes

Boolean

If automatic reprocessing is enabled, setting this property to true will make this pipe automatically reset itself when the pipe config changes (the reset happens the next time the pipe is run). If automatic reprocessing is not enabled, setting this property to true will mark this pipe as out-of-sync when the pipe config changes (the out-of-sync marking happens the next time the pipe is run). The default value is inherited from the service metadata.

false

source

Object

A configuration object for the source component of the pipe.

Yes

transform

Object/List

Zero or more configuration objects for the transform components of the pipe. The default is to do no transformation of the entities. If a list of more than one transform components is given, then they are chained together in the order given. This means that the output of the first transform is passed as the input of the second, and so on. The output of the last transform is then passed to the sink. The first transform gets its input from the source.

sink

Object

A configuration object for the sink component of the pipe. If omitted, it defaults to a dataset sink with its dataset property set to same as the pipe’s _id property.

pump

Object

A configuration object for the pump component of the pipe.

infer_pipe_entity_types

Boolean

Schema inference is enabled for all pipes by default. Setting this property to false will disable schema inference for this pipe.

Note

The default value is false for developer subscriptions.

true

dependency_tracking.dependency_warning_threshold

Integer

The number of entities that dependency tracking can keep in memory at a given time. If this number is exceeded then a warning message is written to the log. The default value is inherited from the service metadata.

10000

dependency_tracking.dependency_error_threshold

Integer

The number of entities that dependency tracking can keep in memory at a given time. If this number is exceeded then the pump will fail. The default value is inherited from the service metadata. Do not set this value too high as it may cause excessive memory usage.

50000

dependency_tracking.dependency_warning_threshold_total_bytes

Integer

The number of bytes that dependency tracking can keep in memory at a given time. If this number is exceeded then a warning message is written to the log. The default value is inherited from the service metadata.

33554432 (32MB)

dependency_tracking.dependency_error_threshold_total_bytes

Integer

The number of bytes that dependency tracking can keep in memory at a given time. If this number is exceeded then the pump will fail. The default value is inherited from the service metadata. Do not set this value too high as it may cause excessive memory usage.

134217728 (128MB)

dependency_tracking.enable_hops_thresholds

Boolean

If true, then warning and error thresholds that apply for dependency tracking also apply for regular "hops" expressions. The default value is inherited from the service metadata. It is recommended that you set this property to true in development environments.

false

Example configuration

The following example shows a pipe definition that exposes data from HubSpot’s REST API through the endpoint companies, and feeds it into a sink that writes the data into a dataset with the same _id as the pipe.

{
  "_id": "hubspot-company-collect",
  "type": "pipe",
  "source": {
    "type": "rest",
    "system": "hubspot",
    "id_expression": "{{ id }}",
    "operation": "get",
    "payload_property": "results",
    "properties": {
      "url": "companies"
    }
  },
  "permissions": [
       ["allow", ["group:Developer"], ["read_config", "read_data", "write_data"]],
       ["deny", ["1298aedf-f1c9-42ed-bfbf-00b6488d39b7_Clown"], ["read_config"]]
   ]
  "add_namespaces": false,
  "namespaced_identifiers": false
}

Pipe Features

The following are available features that can be activate and/or changed on any pipe in Sesam.

Feature

Description

Compaction

Compaction decides how and when old entities should be removed from datasets.

Circuit breakers

Circuit breakers prevent pipes from running if the number of changed entities is too large.

Automatic reprocessing

Automatic reprocessing automatically resets a pipe when it is out of sync with its input data.

Completeness

Completeness lets pipes hold of processing of data until upstream dependencies are processed.

Namespaces

Namespaces allows tracking of property origins in Sesam.

Durable data

Durable data allows additional data storage to minimize likelihood of dataloss.