Native batch ingestion with firehose

Firehoses are deprecated in 0.17.0. It’s highly recommended to use the Native batch ingestion input sources instead.

There are several firehoses readily available in Druid, some are meant for examples, others can be used directly in a production environment.

StaticS3Firehose

You need to include the druid-s3-extensions as an extension to use the StaticS3Firehose.

This firehose ingests events from a predefined list of S3 objects. This firehose is splittable and can be used by the Parallel task. Since each split represents an object in this firehose, each worker task of index_parallel will read an object.

Sample spec:

  1. "firehose" : {
  2. "type" : "static-s3",
  3. "uris": ["s3://foo/bar/file.gz", "s3://bar/foo/file2.gz"]
  4. }

This firehose provides caching and prefetching features. In the Simple task, a firehose can be read twice if intervals or shardSpecs are not specified, and, in this case, caching can be useful. Prefetching is preferred when direct scan of objects is slow. Note that prefetching or caching isn’t that useful in the Parallel task.

propertydescriptiondefaultrequired?
typeThis should be static-s3.Noneyes
urisJSON array of URIs where s3 files to be ingested are located.Noneuris or prefixes must be set
prefixesJSON array of URI prefixes for the locations of s3 files to be ingested.Noneuris or prefixes must be set
maxCacheCapacityBytesMaximum size of the cache space in bytes. 0 means disabling cache. Cached files are not removed until the ingestion task completes.1073741824no
maxFetchCapacityBytesMaximum size of the fetch space in bytes. 0 means disabling prefetch. Prefetched files are removed immediately once they are read.1073741824no
prefetchTriggerBytesThreshold to trigger prefetching s3 objects.maxFetchCapacityBytes / 2no
fetchTimeoutTimeout for fetching an s3 object.60000no
maxFetchRetryMaximum retry for fetching an s3 object.3no

StaticGoogleBlobStoreFirehose

You need to include the druid-google-extensions as an extension to use the StaticGoogleBlobStoreFirehose.

This firehose ingests events, similar to the StaticS3Firehose, but from an Google Cloud Store.

As with the S3 blobstore, it is assumed to be gzipped if the extension ends in .gz

This firehose is splittable and can be used by the Parallel task. Since each split represents an object in this firehose, each worker task of index_parallel will read an object.

Sample spec:

  1. "firehose" : {
  2. "type" : "static-google-blobstore",
  3. "blobs": [
  4. {
  5. "bucket": "foo",
  6. "path": "/path/to/your/file.json"
  7. },
  8. {
  9. "bucket": "bar",
  10. "path": "/another/path.json"
  11. }
  12. ]
  13. }

This firehose provides caching and prefetching features. In the Simple task, a firehose can be read twice if intervals or shardSpecs are not specified, and, in this case, caching can be useful. Prefetching is preferred when direct scan of objects is slow. Note that prefetching or caching isn’t that useful in the Parallel task.

propertydescriptiondefaultrequired?
typeThis should be static-google-blobstore.Noneyes
blobsJSON array of Google Blobs.Noneyes
maxCacheCapacityBytesMaximum size of the cache space in bytes. 0 means disabling cache. Cached files are not removed until the ingestion task completes.1073741824no
maxFetchCapacityBytesMaximum size of the fetch space in bytes. 0 means disabling prefetch. Prefetched files are removed immediately once they are read.1073741824no
prefetchTriggerBytesThreshold to trigger prefetching Google Blobs.maxFetchCapacityBytes / 2no
fetchTimeoutTimeout for fetching a Google Blob.60000no
maxFetchRetryMaximum retry for fetching a Google Blob.3no

Google Blobs:

propertydescriptiondefaultrequired?
bucketName of the Google Cloud bucketNoneyes
pathThe path where data is located.Noneyes

HDFSFirehose

You need to include the druid-hdfs-storage as an extension to use the HDFSFirehose.

This firehose ingests events from a predefined list of files from the HDFS storage. This firehose is splittable and can be used by the Parallel task. Since each split represents an HDFS file, each worker task of index_parallel will read files.

Sample spec:

  1. "firehose" : {
  2. "type" : "hdfs",
  3. "paths": "/foo/bar,/foo/baz"
  4. }

This firehose provides caching and prefetching features. During native batch indexing, a firehose can be read twice if intervals are not specified, and, in this case, caching can be useful. Prefetching is preferred when direct scanning of files is slow. Note that prefetching or caching isn’t that useful in the Parallel task.

PropertyDescriptionDefault
typeThis should be hdfs.none (required)
pathsHDFS paths. Can be either a JSON array or comma-separated string of paths. Wildcards like * are supported in these paths.none (required)
maxCacheCapacityBytesMaximum size of the cache space in bytes. 0 means disabling cache. Cached files are not removed until the ingestion task completes.1073741824
maxFetchCapacityBytesMaximum size of the fetch space in bytes. 0 means disabling prefetch. Prefetched files are removed immediately once they are read.1073741824
prefetchTriggerBytesThreshold to trigger prefetching files.maxFetchCapacityBytes / 2
fetchTimeoutTimeout for fetching each file.60000
maxFetchRetryMaximum number of retries for fetching each file.3

You can also ingest from other storage using the HDFS firehose if the HDFS client supports that storage. However, if you want to ingest from cloud storage, consider using the service-specific input source for your data storage. If you want to use a non-hdfs protocol with the HDFS firehose, you need to include the protocol you want in druid.ingestion.hdfs.allowedProtocols. See HDFS firehose security configuration for more details.

LocalFirehose

This Firehose can be used to read the data from files on local disk, and is mainly intended for proof-of-concept testing, and works with string typed parsers. This Firehose is splittable and can be used by native parallel index tasks. Since each split represents a file in this Firehose, each worker task of index_parallel will read a file. A sample local Firehose spec is shown below:

  1. {
  2. "type": "local",
  3. "filter" : "*.csv",
  4. "baseDir": "/data/directory"
  5. }
propertydescriptionrequired?
typeThis should be “local”.yes
filterA wildcard filter for files. See here for more information.yes
baseDirdirectory to search recursively for files to be ingested.yes

HttpFirehose

This Firehose can be used to read the data from remote sites via HTTP, and works with string typed parsers. This Firehose is splittable and can be used by native parallel index tasks. Since each split represents a file in this Firehose, each worker task of index_parallel will read a file. A sample HTTP Firehose spec is shown below:

  1. {
  2. "type": "http",
  3. "uris": ["http://example.com/uri1", "http://example2.com/uri2"]
  4. }

You can only use protocols listed in the druid.ingestion.http.allowedProtocols property as HTTP firehose input sources. The http and https protocols are allowed by default. See HTTP firehose security configuration for more details.

The below configurations can be optionally used if the URIs specified in the spec require a Basic Authentication Header. Omitting these fields from your spec will result in HTTP requests with no Basic Authentication Header.

propertydescriptiondefault
httpAuthenticationUsernameUsername to use for authentication with specified URIsNone
httpAuthenticationPasswordPasswordProvider to use with specified URIsNone

Example with authentication fields using the DefaultPassword provider (this requires the password to be in the ingestion spec):

  1. {
  2. "type": "http",
  3. "uris": ["http://example.com/uri1", "http://example2.com/uri2"],
  4. "httpAuthenticationUsername": "username",
  5. "httpAuthenticationPassword": "password123"
  6. }

You can also use the other existing Druid PasswordProviders. Here is an example using the EnvironmentVariablePasswordProvider:

  1. {
  2. "type": "http",
  3. "uris": ["http://example.com/uri1", "http://example2.com/uri2"],
  4. "httpAuthenticationUsername": "username",
  5. "httpAuthenticationPassword": {
  6. "type": "environment",
  7. "variable": "HTTP_FIREHOSE_PW"
  8. }
  9. }

The below configurations can optionally be used for tuning the Firehose performance. Note that prefetching or caching isn’t that useful in the Parallel task.

propertydescriptiondefault
maxCacheCapacityBytesMaximum size of the cache space in bytes. 0 means disabling cache. Cached files are not removed until the ingestion task completes.1073741824
maxFetchCapacityBytesMaximum size of the fetch space in bytes. 0 means disabling prefetch. Prefetched files are removed immediately once they are read.1073741824
prefetchTriggerBytesThreshold to trigger prefetching HTTP objects.maxFetchCapacityBytes / 2
fetchTimeoutTimeout for fetching an HTTP object.60000
maxFetchRetryMaximum retries for fetching an HTTP object.3

IngestSegmentFirehose

This Firehose can be used to read the data from existing druid segments, potentially using a new schema and changing the name, dimensions, metrics, rollup, etc. of the segment. This Firehose is splittable and can be used by native parallel index tasks. This firehose will accept any type of parser, but will only utilize the list of dimensions and the timestamp specification. A sample ingest Firehose spec is shown below:

  1. {
  2. "type": "ingestSegment",
  3. "dataSource": "wikipedia",
  4. "interval": "2013-01-01/2013-01-02"
  5. }
propertydescriptionrequired?
typeThis should be “ingestSegment”.yes
dataSourceA String defining the data source to fetch rows from, very similar to a table in a relational databaseyes
intervalA String representing the ISO-8601 interval. This defines the time range to fetch the data over.yes
dimensionsThe list of dimensions to select. If left empty, no dimensions are returned. If left null or not defined, all dimensions are returned.no
metricsThe list of metrics to select. If left empty, no metrics are returned. If left null or not defined, all metrics are selected.no
filterSee Filtersno
maxInputSegmentBytesPerTaskDeprecated. Use Segments Split Hint Spec instead. When used with the native parallel index task, the maximum number of bytes of input segments to process in a single task. If a single segment is larger than this number, it will be processed by itself in a single task (input segments are never split across tasks). Defaults to 150MB.no

SqlFirehose

This Firehose can be used to ingest events residing in an RDBMS. The database connection information is provided as part of the ingestion spec. For each query, the results are fetched locally and indexed. If there are multiple queries from which data needs to be indexed, queries are prefetched in the background, up to maxFetchCapacityBytes bytes. This Firehose is splittable and can be used by native parallel index tasks. This firehose will accept any type of parser, but will only utilize the list of dimensions and the timestamp specification. See the extension documentation for more detailed ingestion examples.

Requires one of the following extensions:

  1. {
  2. "type": "sql",
  3. "database": {
  4. "type": "mysql",
  5. "connectorConfig": {
  6. "connectURI": "jdbc:mysql://host:port/schema",
  7. "user": "user",
  8. "password": "password"
  9. }
  10. },
  11. "sqls": ["SELECT * FROM table1", "SELECT * FROM table2"]
  12. }
propertydescriptiondefaultrequired?
typeThis should be “sql”.Yes
databaseSpecifies the database connection details. The database type corresponds to the extension that supplies the connectorConfig support. The specified extension must be loaded into Druid:



You can selectively allow JDBC properties in connectURI. See JDBC connections security config for more details.
Yes
maxCacheCapacityBytesMaximum size of the cache space in bytes. 0 means disabling cache. Cached files are not removed until the ingestion task completes.1073741824No
maxFetchCapacityBytesMaximum size of the fetch space in bytes. 0 means disabling prefetch. Prefetched files are removed immediately once they are read.1073741824No
prefetchTriggerBytesThreshold to trigger prefetching SQL result objects.maxFetchCapacityBytes / 2No
fetchTimeoutTimeout for fetching the result set.60000No
foldCaseToggle case folding of database column names. This may be enabled in cases where the database returns case insensitive column names in query results.falseNo
sqlsList of SQL queries where each SQL query would retrieve the data to be indexed.Yes

Database

propertydescriptiondefaultrequired?
typeThe type of database to query. Valid values are mysql and postgresql_Yes
connectorConfigSpecify the database connection properties via connectURI, user and passwordYes

InlineFirehose

This Firehose can be used to read the data inlined in its own spec. It can be used for demos or for quickly testing out parsing and schema, and works with string typed parsers. A sample inline Firehose spec is shown below:

  1. {
  2. "type": "inline",
  3. "data": "0,values,formatted\n1,as,CSV"
  4. }
propertydescriptionrequired?
typeThis should be “inline”.yes
dataInlined data to ingest.yes

CombiningFirehose

This Firehose can be used to combine and merge data from a list of different Firehoses.

  1. {
  2. "type": "combining",
  3. "delegates": [ { firehose1 }, { firehose2 }, ... ]
  4. }
propertydescriptionrequired?
typeThis should be “combining”yes
delegatesList of Firehoses to combine data fromyes