Trace analytics

Trace analytics allows you to collect trace data and customize a pipeline that ingests and transforms the data for use in OpenSearch. The following provides an overview of the trace analytics workflow in Data Prepper, how to configure it, and how to visualize trace data.

Introduction

When using Data Prepper as a server-side component to collect trace data, you can customize a Data Prepper pipeline to ingest and transform the data for use in OpenSearch. Upon transformation, you can visualize the transformed trace data for use with the Observability plugin inside of OpenSearch Dashboards. Trace data provides visibility into your application’s performance, and helps you gain more information about individual traces.

The following flowchart illustrates the trace analytics workflow, from running OpenTelemetry Collector to using OpenSearch Dashboards for visualization.

Trace analyticis component overview

To monitor trace analytics, you need to set up the following components in your service environment:

  • Add instrumentation to your application so it can generate telemetry data and send it to an OpenTelemetry collector.
  • Run an OpenTelemetry collector as a sidecar or daemonset for Amazon Elastic Kubernetes Service (Amazon EKS), a sidecar for Amazon Elastic Container Service (Amazon ECS), or an agent on Amazon Elastic Compute Cloud (Amazon EC2). You should configure the collector to export trace data to Data Prepper.
  • Deploy Data Prepper as the ingestion collector for OpenSearch. Configure it to send the enriched trace data to your OpenSearch cluster or to the Amazon OpenSearch Service domain.
  • Use OpenSearch Dashboards to visualize and detect problems in your distributed applications.

Trace analytics pipeline

To monitor trace analytics in Data Prepper, we provide three pipelines: entry-pipeline, raw-trace-pipeline, and service-map-pipeline. The following image provides an overview of how the pipelines work together to monitor trace analytics.

Trace analytics pipeline overview

OpenTelemetry trace source

The OpenTelemetry source accepts trace data from the OpenTelemetry Collector. The source follows the OpenTelemetry Protocol and officially supports transport over gRPC and the use of industry-standard encryption (TLS/HTTPS).

Processor

There are three processors for the trace analytics feature:

  • otel_trace_raw - The otel_trace_raw processor receives a collection of span records from otel-trace-source, and performs stateful processing, extraction, and completion of trace-group-related fields.
  • otel_trace_group - The otel_trace_group processor fills in the missing trace-group-related fields in the collection of span records by looking up the OpenSearch backend.
  • service_map_stateful – The service_map_stateful processor performs the required preprocessing for trace data and builds metadata to display the service-map dashboards.

OpenSearch sink

OpenSearch provides a generic sink that writes data to OpenSearch as the destination. The OpenSearch sink has configuration options related to the OpenSearch cluster, such as endpoint, SSL, username/password, index name, index template, and index state management.

The sink provides specific configurations for the trace analytics feature. These configurations allow the sink to use indexes and index templates specific to trace analytics. The following OpenSearch indexes are specific to trace analytics:

  • otel-v1-apm-span – The otel-v1-apm-span index stores the output from the otel_trace_raw processor.
  • otel-v1-apm-service-map – The otel-v1-apm-service-map index stores the output from the service_map_stateful processor.

Trace tuning

Starting with version 0.8.x, Data Prepper supports both vertical and horizontal scaling for trace analytics. You can adjust the size of a single Data Prepper instance to meet your workload’s demands and scale vertically.

You can scale horizontally by using the core peer forwarder to deploy multiple Data Prepper instances to form a cluster. This enables Data Prepper instances to communicate with instances in the cluster and is required for horizontally scaling deployments.

Scaling recommendations

Use the following recommended configurations to scale Data Prepper. We recommend that you modify parameters based on the requirements. We also recommend that you monitor the Data Prepper host metrics and OpenSearch metrics to ensure that the configuration works as expected.

Buffer

The total number of trace requests processed by Data Prepper is equal to the sum of the buffer_size values in otel-trace-pipeline and raw-pipeline. The total number of trace requests sent to OpenSearch is equal to the product of batch_size and workers in raw-trace-pipeline. For more information about raw-pipeline, see Trace analytics pipeline.

We recommend the following when making changes to buffer settings:

  • The buffer_size value in otel-trace-pipeline and raw-pipeline should be the same.
  • The buffer_size should be greater than or equal to workers * batch_size in the raw-pipeline.

Workers

The workers setting determines the number of threads that are used by Data Prepper to process requests from the buffer. We recommend that you set workers based on the CPU utilization. This value can be higher than the number of available processors because Data Prepper uses significant input/output time when sending data to OpenSearch.

Heap

Configure the Data Prepper heap by setting the JVM_OPTS environment variable. We recommend that you set the heap value to a minimum value of 4 * batch_size * otel_send_batch_size * maximum size of indvidual span.

As mentioned in the OpenTelemetry Collector section, set otel_send_batch_size to a value of 50 in your OpenTelemetry Collector configuration.

Local disk

Data Prepper uses the local disk to store metadata required for service map processing, so we recommend storing only the following key fields: traceId, spanId, parentSpanId, spanKind, spanName, and serviceName. The service-map plugin stores only two files, each of which stores window_duration seconds of data. As an example, testing with a throughput of 3000 spans/second resulted in the total disk usage of 4 MB.

Data Prepper also uses the local disk to write logs. In the most recent version of Data Prepper, you can redirect the logs to your preferred path.

AWS CloudFormation template and Kubernetes/Amazon EKS configuration files

The AWS CloudFormation template provides a user-friendly mechanism for configuring the scaling attributes described in the Trace tuning section.

The Kubernetes configuration files and Amazon EKS configuration files are available for configuring these attributes in a cluster deployment.

Benchmark tests

The benchmark tests were performed on an r5.xlarge EC2 instance with the following configuration:

  • buffer_size: 4096
  • batch_size: 256
  • workers: 8
  • Heap: 10 GB

This setup was able to handle a throughput of 2100 spans/second at 20 percent CPU utilization.

Pipeline configuration

The following sections provide examples of different types of pipelines and how to configure each type.

Example: Trace analytics pipeline

The following example demonstrates how to build a pipeline that supports the OpenSearch Dashboards Observability plugin. This pipeline takes data from the OpenTelemetry Collector and uses two other pipelines as sinks. These two separate pipelines serve two different purposes and write to different OpenSearch indexes. The first pipeline prepares trace data for OpenSearch and enriches and ingests the span documents into a span index within OpenSearch. The second pipeline aggregates traces into a service map and writes service map documents into a service map index within OpenSearch.

Starting with Data Prepper version 2.0, Data Prepper no longer supports the otel_trace_raw_prepper processor. The otel_trace_raw processor replaces the otel_trace_raw_prepper processor and supports some of Data Prepper’s recent data model changes. Instead, you should use the otel_trace_raw processor. See the following YAML file example:

  1. entry-pipeline:
  2. delay: "100"
  3. source:
  4. otel_trace_source:
  5. ssl: false
  6. buffer:
  7. bounded_blocking:
  8. buffer_size: 10240
  9. batch_size: 160
  10. sink:
  11. - pipeline:
  12. name: "raw-trace-pipeline"
  13. - pipeline:
  14. name: "service-map-pipeline"
  15. raw-pipeline:
  16. source:
  17. pipeline:
  18. name: "entry-pipeline"
  19. buffer:
  20. bounded_blocking:
  21. buffer_size: 10240
  22. batch_size: 160
  23. processor:
  24. - otel_trace_raw:
  25. sink:
  26. - opensearch:
  27. hosts: ["https://localhost:9200"]
  28. insecure: true
  29. username: admin
  30. password: admin
  31. index_type: trace-analytics-raw
  32. service-map-pipeline:
  33. delay: "100"
  34. source:
  35. pipeline:
  36. name: "entry-pipeline"
  37. buffer:
  38. bounded_blocking:
  39. buffer_size: 10240
  40. batch_size: 160
  41. processor:
  42. - service_map_stateful:
  43. sink:
  44. - opensearch:
  45. hosts: ["https://localhost:9200"]
  46. insecure: true
  47. username: admin
  48. password: admin
  49. index_type: trace-analytics-service-map

To maintain similar ingestion throughput and latency, scale the buffer_size and batch_size by the estimated maximum batch size in the client request payload. {: .tip}

Example: otel trace

The following is an example otel-trace-source .yaml file with SSL and basic authentication enabled. Note that you will need to modify your otel-collector-config.yaml file so that it uses your own credentials.

  1. source:
  2. otel_trace_source:
  3. #record_type: event # Add this when using Data Prepper 1.x. This option is removed in 2.0
  4. ssl: true
  5. sslKeyCertChainFile: "/full/path/to/certfile.crt"
  6. sslKeyFile: "/full/path/to/keyfile.key"
  7. authentication:
  8. http_basic:
  9. username: "my-user"
  10. password: "my_s3cr3t"

Example: pipeline.yaml

The following is an example pipeline.yaml file without SSL and basic authentication enabled for the otel-trace-pipeline pipeline:

  1. otel-trace-pipeline:
  2. # workers is the number of threads processing data in each pipeline.
  3. # We recommend same value for all pipelines.
  4. # default value is 1, set a value based on the machine you are running Data Prepper
  5. workers: 8
  6. # delay in milliseconds is how often the worker threads should process data.
  7. # Recommend not to change this config as we want the entry-pipeline to process as quick as possible
  8. # default value is 3_000 ms
  9. delay: "100"
  10. source:
  11. otel_trace_source:
  12. #record_type: event # Add this when using Data Prepper 1.x. This option is removed in 2.0
  13. ssl: false # Change this to enable encryption in transit
  14. authentication:
  15. unauthenticated:
  16. buffer:
  17. bounded_blocking:
  18. # buffer_size is the number of ExportTraceRequest from otel-collector the data prepper should hold in memeory.
  19. # We recommend to keep the same buffer_size for all pipelines.
  20. # Make sure you configure sufficient heap
  21. # default value is 512
  22. buffer_size: 512
  23. # This is the maximum number of request each worker thread will process within the delay.
  24. # Default is 8.
  25. # Make sure buffer_size >= workers * batch_size
  26. batch_size: 8
  27. sink:
  28. - pipeline:
  29. name: "raw-trace-pipeline"
  30. - pipeline:
  31. name: "entry-pipeline"
  32. raw-pipeline:
  33. # Configure same as the otel-trace-pipeline
  34. workers: 8
  35. # We recommend using the default value for the raw-pipeline.
  36. delay: "3000"
  37. source:
  38. pipeline:
  39. name: "entry-pipeline"
  40. buffer:
  41. bounded_blocking:
  42. # Configure the same value as in entry-pipeline
  43. # Make sure you configure sufficient heap
  44. # The default value is 512
  45. buffer_size: 512
  46. # The raw processor does bulk request to your OpenSearch sink, so configure the batch_size higher.
  47. # If you use the recommended otel-collector setup each ExportTraceRequest could contain max 50 spans. https://github.com/opensearch-project/data-prepper/tree/v0.7.x/deployment/aws
  48. # With 64 as batch size each worker thread could process upto 3200 spans (64 * 50)
  49. batch_size: 64
  50. processor:
  51. - otel_trace_raw:
  52. - otel_trace_group:
  53. hosts: [ "https://localhost:9200" ]
  54. # Change to your credentials
  55. username: "admin"
  56. password: "admin"
  57. # Add a certificate file if you are accessing an OpenSearch cluster with a self-signed certificate
  58. #cert: /path/to/cert
  59. # If you are connecting to an Amazon OpenSearch Service domain without
  60. # Fine-Grained Access Control, enable these settings. Comment out the
  61. # username and password above.
  62. #aws_sigv4: true
  63. #aws_region: us-east-1
  64. sink:
  65. - opensearch:
  66. hosts: [ "https://localhost:9200" ]
  67. index_type: trace-analytics-raw
  68. # Change to your credentials
  69. username: "admin"
  70. password: "admin"
  71. # Add a certificate file if you are accessing an OpenSearch cluster with a self-signed certificate
  72. #cert: /path/to/cert
  73. # If you are connecting to an Amazon OpenSearch Service domain without
  74. # Fine-Grained Access Control, enable these settings. Comment out the
  75. # username and password above.
  76. #aws_sigv4: true
  77. #aws_region: us-east-1
  78. service-map-pipeline:
  79. workers: 8
  80. delay: "100"
  81. source:
  82. pipeline:
  83. name: "entry-pipeline"
  84. processor:
  85. - service_map_stateful:
  86. # The window duration is the maximum length of time the data prepper stores the most recent trace data to evaluvate service-map relationships.
  87. # The default is 3 minutes, this means we can detect relationships between services from spans reported in last 3 minutes.
  88. # Set higher value if your applications have higher latency.
  89. window_duration: 180
  90. buffer:
  91. bounded_blocking:
  92. # buffer_size is the number of ExportTraceRequest from otel-collector the data prepper should hold in memeory.
  93. # We recommend to keep the same buffer_size for all pipelines.
  94. # Make sure you configure sufficient heap
  95. # default value is 512
  96. buffer_size: 512
  97. # This is the maximum number of request each worker thread will process within the delay.
  98. # Default is 8.
  99. # Make sure buffer_size >= workers * batch_size
  100. batch_size: 8
  101. sink:
  102. - opensearch:
  103. hosts: [ "https://localhost:9200" ]
  104. index_type: trace-analytics-service-map
  105. # Change to your credentials
  106. username: "admin"
  107. password: "admin"
  108. # Add a certificate file if you are accessing an OpenSearch cluster with a self-signed certificate
  109. #cert: /path/to/cert
  110. # If you are connecting to an Amazon OpenSearch Service domain without
  111. # Fine-Grained Access Control, enable these settings. Comment out the
  112. # username and password above.
  113. #aws_sigv4: true
  114. #aws_region: us-east-1

You need to modify the preceding configuration for your OpenSearch cluster so that the configuration matches your environment. Note that it has two opensearch sinks that need to be modified.

You must make the following changes:

  • hosts – Set to your hosts.
  • username – Provide your OpenSearch username.
  • password – Provide your OpenSearch password.
  • aws_sigv4 – If you are using Amazon OpenSearch Service with AWS signing, set this value to true. It will sign requests with the default AWS credentials provider.
  • aws_region – If you are using Amazon OpenSearch Service with AWS signing, set this value to your AWS Region.

For other configurations available for OpenSearch sinks, see Data Prepper OpenSearch sink.

OpenTelemetry Collector

You need to run OpenTelemetry Collector in your service environment. Follow Getting Started to install an OpenTelemetry collector. Ensure that you configure the collector with an exporter configured for your Data Prepper instance. The following example otel-collector-config.yaml file receives data from various instrumentations and exports it to Data Prepper.

Example otel-collector-config.yaml file

The following is an example otel-collector-config.yaml file:

  1. receivers:
  2. jaeger:
  3. protocols:
  4. grpc:
  5. otlp:
  6. protocols:
  7. grpc:
  8. zipkin:
  9. processors:
  10. batch/traces:
  11. timeout: 1s
  12. send_batch_size: 50
  13. exporters:
  14. otlp/data-prepper:
  15. endpoint: localhost:21890
  16. tls:
  17. insecure: true
  18. service:
  19. pipelines:
  20. traces:
  21. receivers: [jaeger, otlp, zipkin]
  22. processors: [batch/traces]
  23. exporters: [otlp/data-prepper]

After you run OpenTelemetry in your service environment, you must configure your application to use the OpenTelemetry Collector. The OpenTelemetry Collector typically runs alongside your application.

Next steps and more information

The OpenSearch Dashboards Observability plugin documentation provides additional information about configuring OpenSearch to view trace analytics in OpenSearch Dashboards.

For more information about how to tune and scale Data Prepper for trace analytics, see Trace tuning.

Migrating to Data Prepper 2.0

Starting with Data Prepper version 1.4, trace processing uses Data Prepper’s event model. This allows pipeline authors to configure other processors to modify spans or traces. To provide a migration path, Data Prepper version 1.4 introduced the following changes:

  • otel_trace_source has an optional record_type parameter that can be set to event. When configured, it will output event objects.
  • otel_trace_raw replaces otel_trace_raw_prepper for event-based spans.
  • otel_trace_group replaces otel_trace_group_prepper for event-based spans.

In Data Prepper version 2.0, otel_trace_source will only output events. Data Prepper version 2.0 also removes otel_trace_raw_prepper and otel_trace_group_prepper entirely. To migrate to Data Prepper version 2.0, you can configure your trace pipeline using the event model.

Trace analytics - 图3