Ingest pipelines

An ingest pipeline is a sequence of processors that are applied to documents as they are ingested into an index. Each processor in a pipeline performs a specific task, such as filtering, transforming, or enriching data.

Processors are customizable tasks that run in a sequential order as they appear in the request body. This order is important, as each processor depends on the output of the previous processor. The modified documents appear in your index after the processors are applied.

OpenSearch ingest pipelines compared to Data Prepper

OpenSeach ingest pipelines run within the OpenSearch cluster, whereas Data Prepper is an external component that runs on the OpenSearch cluster.

OpenSearch ingest pipelines perform actions on indexes and are preferred for use cases involving pre-processing simple datasets, machine learning (ML) processors, and vector embedding processors. OpenSearch ingest pipelines are recommended for simple data pre-processing and small datasets.

Data Prepper is recommended for any data processing tasks it supports, particularly when dealing with large datasets and complex data pre-processing requirements. It streamlines the process of transferring and fetching large datasets while providing robust capabilities for intricate data preparation and transformation operations. Refer to the Data Prepper documentation for more information.

OpenSearch ingest pipelines can only be managed using Ingest API operations.

Prerequisites

The following are prerequisites for using OpenSearch ingest pipelines:

  • When using ingestion in a production environment, your cluster should contain at least one node with the node roles permission set to ingest. For information about setting up node roles within a cluster, see Cluster Formation.
  • If the OpenSearch Security plugin is enabled, you must have the cluster_manage_pipelines permission to manage ingest pipelines.

Define a pipeline

A pipeline definition describes the sequence of an ingest pipeline and can be written in JSON format. An ingest pipeline consists of the following:

  1. {
  2. "description" : "..."
  3. "processors" : [...]
  4. }

Request body fields

FieldRequiredTypeDescription
processorsRequiredArray of processor objectsA component that performs a specific data processing task as the data is being ingested into OpenSearch.
descriptionOptionalStringA description of the ingest pipeline.

Next steps

Learn how to: