s3

The s3 sink saves batches of events to Amazon Simple Storage Service (Amazon S3) objects.

Usage

The following example creates a pipeline configured with an s3 sink. It contains additional options for customizing the event and size thresholds for which the pipeline sends record events and sets the codec type ndjson:

  1. pipeline:
  2. ...
  3. sink:
  4. - s3:
  5. aws:
  6. region: us-east-1
  7. sts_role_arn: arn:aws:iam::123456789012:role/Data-Prepper
  8. sts_header_overrides:
  9. max_retries: 5
  10. bucket:
  11. name: bucket_name
  12. object_key:
  13. path_prefix: my-elb/%{yyyy}/%{MM}/%{dd}/
  14. threshold:
  15. event_count: 2000
  16. maximum_size: 50mb
  17. event_collect_timeout: 15s
  18. codec:
  19. ndjson:
  20. buffer_type: in_memory

Configuration

Use the following options when customizing the s3 sink.

OptionRequiredTypeDescription
bucketYesStringThe object from which the data is retrieved and then stored. The name must match the name of your object store.
codecYesBuffer typeDetermines the buffer type.
awsYesAWSThe AWS configuration. See aws for more information.
thresholdYesThresholdConfigures when to write an object to S3.
object_keyNoSets the path_prefix and the file_pattern of the object store. Defaults to the S3 object events-%{yyyy-MM-dd’T’hh-mm-ss} found inside the root directory of the bucket. 
compressionNoStringThe compression algorithm to apply: none, gzip, or snappy. Default is none.
buffer_typeNoBuffer typeDetermines the buffer type.
max_retriesNoIntegerThe maximum number of times a single request should retry when ingesting data to S3. Defaults to 5.

aws

OptionRequiredTypeDescription
regionNoStringThe AWS Region to use for credentials. Defaults to standard SDK behavior to determine the Region.
sts_role_arnNoStringThe AWS Security Token Service (AWS STS) role to assume for requests to Amazon SQS and Amazon S3. Defaults to null, which will use the standard SDK behavior for credentials.
sts_header_overridesNoMapA map of header overrides that the IAM role assumes for the sink plugin.
sts_external_idNoStringThe external ID to attach to AssumeRole requests from AWS STS.

Threshold configuration

Use the following options to set ingestion thresholds for the s3 sink.

OptionRequiredTypeDescription
event_countYesIntegerThe maximum number of events the S3 bucket can ingest.
maximum_sizeYesStringThe maximum number of bytes that the S3 bucket can ingest after compression. Defaults to 50mb.
event_collect_timeoutYesStringSets the time period during which events are collected before ingestion. All values are strings that represent duration, either an ISO_8601 notation string, such as PT20.345S, or a simple notation, such as 60s or 1500ms.

Buffer type

buffer_type is an optional configuration that records stored events temporarily before flushing them into an S3 bucket. The default value is in_memory. Use one of the following options:

  • local_file: Flushes the record into a file on your machine.
  • in_memory: Stores the record in memory.
  • multipart: Writes using the S3 multipart upload. Every 10 MB is written as a part.

Object key configuration

OptionRequiredTypeDescription
path_prefixYesStringThe S3 key prefix path to use. Accepts date-time formatting. For example, you can use %{yyyy}/%{MM}/%{dd}/%{HH}/ to create hourly folders in S3. By default, events write to the root of the bucket.

codec

The codec determines how the s3 source formats data written to each S3 object.

avro codec

The avro codec writes an event as an Apache Avro document.

Because Avro requires a schema, you may either define the schema yourself, or Data Prepper will automatically generate a schema. In general, you should define your own schema because it will most accurately reflect your needs.

We recommend that you make your Avro fields use a null union. Without the null union, each field must be present or the data will fail to write to the sink. If you can be certain that each each event has a given field, you can make it non-nullable.

When you provide your own Avro schema, that schema defines the final structure of your data. Therefore, any extra values inside any incoming events that are not mapped in the Arvo schema will not be included in the final destination. To avoid confusion between a custom Arvo schema and the include_keys or exclude_keys sink configurations, Data Prepper does not allow the use of the include_keys or exclude_keys with a custom schema.

In cases where your data is uniform, you may be able to automatically generate a schema. Automatically generated schemas are based on the first event received by the codec. The schema will only contain keys from this event. Therefore, you must have all keys present in all events in order for the automatically generated schema to produce a working schema. Automatically generated schemas make all fields nullable. Use the sink’s include_keys and exclude_keys configurations to control what data is included in the auto-generated schema.

OptionRequiredTypeDescription
schemaYesStringThe Avro schema declaration. Not required if auto_schema is set to true.
auto_schemaNoBooleanWhen set to true, automatically generates the Avro schema declaration from the first event.

ndjson codec

The ndjson codec writes each line as a JSON object.

The ndjson codec does not take any configurations.

json codec

The json codec writes events in a single large JSON file. Each event is written into an object within a JSON array.

OptionRequiredTypeDescription
key_nameNoStringThe name of the key for the JSON array. By default this is events.

parquet codec

The parquet codec writes events into a Parquet file. When using the Parquet codec, set the buffer_type to in_memory.

The Parquet codec writes data using the Avro schema. Because Parquet requires an Avro schema, you may either define the schema yourself, or Data Prepper will automatically generate a schema. However, we generally recommend that you define your own schema so that it can best meet your needs.

For details on the Avro schema and recommendations, see the Avro codec documentation.

OptionRequiredTypeDescription
schemaYesStringThe Avro schema declaration. Not required if auto_schema is set to true.
auto_schemaNoBooleanWhen set to true, automatically generates the Avro schema declaration from the first event.