CSV source¶

The CSV data source translates the rows of files in CSV format to entities.

The configuration options are:

Prototype¶

{
   "type": "csv",
   "system": "a-valid-url-or-microservice-system-id",
   "url": "url-to-csv-file",
   "has_header": true,
   "field_names": ["mappings","from","columns","to","properties"],
   "auto_dialect": true,
   "dialect": "excel",
   "encoding": "utf-8",
   "decode_error_strategy": "strict-or-replace",
   "primary_key": ["list","of","column","names"],
   "whitelist": ["list","of","column","names","to","include"],
   "blacklist": ["list","of","column","names","to","exclude"],
   "preserve_empty_strings": false,
   "delimiter": ",",
   "escape_null_bytes": false
}

Properties¶

Property	Type	Description	Default	Req
`url`	String	The URL of the `CVS` file to load.		Yes
`system`	String	The ID of the URL system or microservice system component to use.		Yes
`has_header`	Boolean	Flag that indicates to the source that the first row in the `CSV` file contains the names of the columns. If this property is set to `false`, you will have to provide a list of column names in the `field_names` property.	true
`field_names`	List	If set, specifies the names of the columns. It takes precedence over the header in the CSV file if present.
`auto_dialect`	Boolean	Flag that hints to the source that it should try to guess the dialect of the `CSV` file on its own. Note that if `dialect` is explicitly set, `auto_dialect` is ignored.	true
`dialect`	String	Encodes what type of CSV file the file is. This is basically presets of the other properties. The recognised values are `"excel"`, `"excel_tab"` and `"unix_dialect"`. Note that if `dialect` is explicitly set, `auto_dialect` is ignored. If both `auto_dialect` is `false` and no `dialect` has been explicitly set, the dialect chosen will be `excel`.
`encoding`	String	The character set to used to encode the text in the CSV file	“UTF-8”
`decode_error_strategy`	String	A enumeration of “strict” and “replace” that tells the character decoder how to deal with illegal characters in the input data. The default is “strict” which raises an error and stops processing. The “replace” option will log a warning and attempt to replace the offending character(s) with the unicode special character for “replacement character”, see https://en.wikipedia.org/wiki/Specials_%28Unicode_block%29 for more details. Use the “replace” option with extreme care as it can lead to data loss if you’re not absolutely sure of what you are doing. The preferred option should always be to try the fix the data at the source.	“strict”
`primary_key`	List<String> or String	The name of the column(s) to use as `_id` in the generated entities. It can be either a list of strings (if the identity is a compound value) or a single column name (i.e. a string). The column name(s) are case sensitive and must match the contents of either `field_names` or the header of the CSV file.		Yes
`whitelist`	List<String>	The names of the columns to include in the generated entities. If there is a `blacklist` also specified, the whitelist will be filtered against the contents of the blacklist.
`blacklist`	List<String>	The names of the columns to exclude from the generated entities. If there is a `whitelist` also specified, the blacklist operates on the values of the whitelist (and not the whole columnset).
`preserve_empty_strings`	Boolean	If set to `true` will include column values that are empty strings. By default these are omitted.	False
`delimiter`	String	The character or string to use as the `CSV` field separator (delimiter)	“,”
`escape_null_bytes`	Boolean	If set to `true`, null characters in the CSV will be escaped before the data is parsed. Null characters in a CSV file can fail the pipe if they are not escaped. By default, this is set to `false` due to performance reasons.	`false`
`if_source_empty`	Enum<String>	Determines the behaviour of the pipe when the CSV source returns no entities. Normally, any previously synced entities will be deleted even if the pipe does not receive any entities from its source. If set to `"fail"`, the pipe will automatically fail if the source returns no entities. This means that any previous entities in the pipe’s dataset are not deleted. If set to `"accept"`, the pipe will not fail and any previously synced entities will be deleted. The global default `global_defaults.if_source_empty` can be set for all pipes in the service metadata.	`"accept"`

Continuation support¶

See the section on continuation support for more information.

Property	Value
`supports_since`	`false` (Default)
`is_since_comparable`	`true` (Default)
`is_chronological`	`false` (Default)

Example configuration¶

The outermost object would be your pipe configuration, which is omitted here for brevity:

{
    "source": {
        "type": "csv",
        "url": "http://blog.plsoucy.com/wp-content/uploads/2012/04/countries-20140629.csv",
        "primary_key": "Code",
        "encoding": "iso-8859-1"
    }
}

Conditional source

Dataset source