Ingest processors

Ingest processors are a core component of ingest pipelines. They preprocess documents before indexing. For example, you can remove fields, extract values from text, convert data formats, or append additional information.

OpenSearch provides a standard set of ingest processors within your OpenSearch installation. For a list of processors available in OpenSearch, use the Nodes Info API operation:

  1. GET /_nodes/ingest?filter_path=nodes.*.ingest.processors

copy

To set up and deploy ingest processors, make sure you have the necessary permissions and access rights. See Security plugin REST API to learn more.

Supported processors

Processor types and their required or optional parameters vary depending on your specific use case. OpenSearch supports the following ingest processors. For tutorials on using these processors in an OpenSearch pipeline, go to each processor’s respective documentation.

Processor typeDescription
appendAdds one or more values to a field in a document.
bytesConverts a human-readable byte value to its value in bytes.
community_idGenerates a community ID flow hash algorithm for the network flow tuples.
convertChanges the data type of a field in a document.
copyCopies an entire object in an existing field to another field.
csvExtracts CSVs and stores them as individual fields in a document.
dateParses dates from fields and then uses the date or timestamp as the timestamp for a document.
date_index_nameIndexes documents into time-based indexes based on a date or timestamp field in a document.
dissectExtracts structured fields from a text field using a defined pattern.
dot_expanderExpands a field with dots into an object field.
dropDrops a document without indexing it or raising any errors.
failRaises an exception and stops the execution of a pipeline.
fingerprintGenerates a hash value for either certain specified fields or all fields in a document.
foreachAllows for another processor to be applied to each element of an array or an object field in a document.
geoipAdds information about the geographical location of an IP address.
geojson-featureIndexes GeoJSON data into a geospatial field.
grokParses and structures unstructured data using pattern matching.
gsubReplaces or deletes substrings within a string field of a document.
html_stripRemoves HTML tags from a text field and returns the plain text content.
ip2geoAdds information about the geographical location of an IPv4 or IPv6 address.
joinConcatenates each element of an array into a single string using a separator character between each element.
jsonConverts a JSON string into a structured JSON object.
kvAutomatically parses key-value pairs in a field.
lowercaseConverts text in a specific field to lowercase letters.
pipelineRuns an inner pipeline.
removeRemoves fields from a document.
remove_by_patternRemoves fields from a document by field pattern.
renameRenames an existing field.
scriptRuns an inline or stored script on incoming documents.
setSets the value of a field to a specified value.
sortSorts the elements of an array in ascending or descending order.
sparse_encodingGenerates a sparse vector/token and weights from text fields for neural sparse search using sparse retrieval.
splitSplits a field into an array using a separator character.
text_chunkingSplits long documents into smaller chunks.
text_embeddingGenerates vector embeddings from text fields for semantic search.
text_image_embeddingGenerates combined vector embeddings from text and image fields for multimodal neural search.
trimRemoves leading and trailing white space from a string field.
uppercaseConverts text in a specific field to uppercase letters.
urldecodeDecodes a string from URL-encoded format.
user_agentExtracts details from the user agent sent by a browser to its web requests.

Processor limit settings

You can limit the number of ingest processors using the cluster setting cluster.ingest.max_number_processors. The total number of processors includes both the number of processors and the number of on_failure processors.

The default value for cluster.ingest.max_number_processors is Integer.MAX_VALUE. Adding a higher number of processors than the value configured in cluster.ingest.max_number_processors will throw an IllegalStateException.

Batch-enabled processors

Some processors support batch ingestion—they can process multiple documents at the same time as a batch. These batch-enabled processors usually provide better performance when using batch processing. For batch processing, use the Bulk API and provide a batch_size parameter. All batch-enabled processors have a batch mode and a single-document mode. When you ingest documents using the PUT method, the processor functions in single-document mode and processes documents in series. Currently, only the text_embedding and sparse_encoding processors are batch enabled. All other processors process documents one at a time.

Selectively enabling processors

Processors defined by the ingest-common module can be selectively enabled by providing the ingest-common.processors.allowed cluster setting. If not provided, then all processors are enabled by default. Specifying an empty list disables all processors. If the setting is changed to remove previously enabled processors, then any pipeline using a disabled processor will fail after node restart when the new setting takes effect.