Data Prepper

Data Prepper is an independent component, not an OpenSearch plugin, that converts data for use with OpenSearch. It’s not bundled with the all-in-one OpenSearch installation packages.

Install Data Prepper

To use the Docker image, pull it like any other image:

  1. docker pull opensearchproject/data-prepper:latest

Otherwise, download the appropriate archive for your operating system and unzip it.

Configure pipelines

To use Data Prepper, you define pipelines in a configuration YAML file. Each pipeline is a combination of a source, a buffer, zero or more preppers, and one or more sinks:

  1. sample-pipeline:
  2. workers: 4 # the number of workers
  3. delay: 100 # in milliseconds, how long workers wait between read attempts
  4. source:
  5. otel_trace_source:
  6. ssl: true
  7. sslKeyCertChainFile: "config/demo-data-prepper.crt"
  8. sslKeyFile: "config/demo-data-prepper.key"
  9. buffer:
  10. bounded_blocking:
  11. buffer_size: 1024 # max number of records the buffer accepts
  12. batch_size: 256 # max number of records the buffer drains after each read
  13. prepper:
  14. - otel_trace_raw_prepper:
  15. sink:
  16. - opensearch:
  17. hosts: ["https:localhost:9200"]
  18. cert: "config/root-ca.pem"
  19. username: "ta-user"
  20. password: "ta-password"
  21. trace_analytics_raw: true
  • Sources define where your data comes from. In this case, the source is the OpenTelemetry Collector (otel_trace_source) with some optional SSL settings.

  • Buffers store data as it passes through the pipeline.

    By default, Data Prepper uses its one and only buffer, the bounded_blocking buffer, so you can omit this section unless you developed a custom buffer or need to tune the buffer settings.

  • Preppers perform some action on your data: filter, transform, enrich, etc.

    You can have multiple preppers, which run sequentially from top to bottom, not in parallel. The otel_trace_raw_prepper prepper converts OpenTelemetry data into OpenSearch-compatible JSON documents.

  • Sinks define where your data goes. In this case, the sink is an OpenSearch cluster.

Pipelines can act as the source for other pipelines. In the following example, a pipeline takes data from the OpenTelemetry Collector and uses two other pipelines as sinks:

  1. entry-pipeline:
  2. delay: "100"
  3. source:
  4. otel_trace_source:
  5. ssl: true
  6. sslKeyCertChainFile: "config/demo-data-prepper.crt"
  7. sslKeyFile: "config/demo-data-prepper.key"
  8. sink:
  9. - pipeline:
  10. name: "raw-pipeline"
  11. - pipeline:
  12. name: "service-map-pipeline"
  13. raw-pipeline:
  14. source:
  15. pipeline:
  16. name: "entry-pipeline"
  17. prepper:
  18. - otel_trace_raw_prepper:
  19. sink:
  20. - opensearch:
  21. hosts: ["https://localhost:9200" ]
  22. cert: "config/root-ca.pem"
  23. username: "ta-user"
  24. password: "ta-password"
  25. trace_analytics_raw: true
  26. service-map-pipeline:
  27. delay: "100"
  28. source:
  29. pipeline:
  30. name: "entry-pipeline"
  31. prepper:
  32. - service_map_stateful:
  33. sink:
  34. - opensearch:
  35. hosts: ["https://localhost:9200"]
  36. cert: "config/root-ca.pem"
  37. username: "ta-user"
  38. password: "ta-password"
  39. trace_analytics_service_map: true

To learn more, see the Data Prepper configuration reference.

Configure the Data Prepper server

Data Prepper itself provides administrative HTTP endpoints such as /list to list pipelines and /metrics/prometheus to provide Prometheus-compatible metrics data. The port which serves these endpoints, as well as TLS configuration, is specified by a separate YAML file. Example:

  1. ssl: true
  2. keyStoreFilePath: "/usr/share/data-prepper/keystore.jks"
  3. keyStorePassword: "password"
  4. privateKeyPassword: "other_password"
  5. serverPort: 1234

Start Data Prepper

Docker

  1. docker run --name data-prepper --expose 21890 -v /full/path/to/pipelines.yaml:/usr/share/data-prepper/pipelines.yaml -v /full/path/to/data-prepper-config.yaml:/usr/share/data-prepper/data-prepper-config.yaml opensearchproject/opensearch-data-prepper:latest

macOS and Linux

  1. ./data-prepper-tar-install.sh config/pipelines.yaml config/data-prepper-config.yaml

For production workloads, you likely want to run Data Prepper on a dedicated machine, which makes connectivity a concern. Data Prepper uses port 21890 and must be able to connect to both the OpenTelemetry Collector and the OpenSearch cluster. In the sample applications, you can see that all components use the same Docker network and expose the appropriate ports.