Sparse encoding processor
The sparse_encoding
processor is used to generate a sparse vector/token and weights from text fields for neural sparse search using sparse retrieval.
PREREQUISITE
Before using the sparse_encoding
processor, you must set up a machine learning (ML) model. For more information, see Choosing a model.
The following is the syntax for the sparse_encoding
processor:
{
"sparse_encoding": {
"model_id": "<model_id>",
"field_map": {
"<input_field>": "<vector_field>"
}
}
}
copy
Configuration parameters
The following table lists the required and optional parameters for the sparse_encoding
processor.
Parameter | Data type | Required/Optional | Description |
---|---|---|---|
model_id | String | Required | The ID of the model that will be used to generate the embeddings. The model must be deployed in OpenSearch before it can be used in neural search. For more information, see Using custom models within OpenSearch and Neural sparse search. |
field_map | Object | Required | Contains key-value pairs that specify the mapping of a text field to a rank_features field. |
field_map.<input_field> | String | Required | The name of the field from which to obtain text for generating vector embeddings. |
field_map.<vector_field> | String | Required | The name of the vector field in which to store the generated vector embeddings. |
description | String | Optional | A brief description of the processor. |
tag | String | Optional | An identifier tag for the processor. Useful for debugging to distinguish between processors of the same type. |
batch_size | Integer | Optional | Specifies the number of documents to be batched and processed each time. Default is 1 . |
Using the processor
Follow these steps to use the processor in a pipeline. You must provide a model ID when creating the processor. For more information, see Using custom models within OpenSearch.
Step 1: Create a pipeline.
The following example request creates an ingest pipeline where the text from passage_text
will be converted into text embeddings and the embeddings will be stored in passage_embedding
:
PUT /_ingest/pipeline/nlp-ingest-pipeline
{
"description": "A sparse encoding ingest pipeline",
"processors": [
{
"sparse_encoding": {
"model_id": "aP2Q8ooBpBj3wT4HVS8a",
"field_map": {
"passage_text": "passage_embedding"
}
}
}
]
}
copy
Step 2 (Optional): Test the pipeline.
It is recommended that you test your pipeline before you ingest documents.
To test the pipeline, run the following query:
POST _ingest/pipeline/nlp-ingest-pipeline/_simulate
{
"docs": [
{
"_index": "testindex1",
"_id": "1",
"_source":{
"passage_text": "hello world"
}
}
]
}
copy
Response
The response confirms that in addition to the passage_text
field, the processor has generated text embeddings in the passage_embedding
field:
{
"docs" : [
{
"doc" : {
"_index" : "testindex1",
"_id" : "1",
"_source" : {
"passage_embedding" : {
"!" : 0.8708904,
"door" : 0.8587369,
"hi" : 2.3929274,
"worlds" : 2.7839446,
"yes" : 0.75845814,
"##world" : 2.5432441,
"born" : 0.2682308,
"nothing" : 0.8625516,
"goodbye" : 0.17146169,
"greeting" : 0.96817183,
"birth" : 1.2788506,
"come" : 0.1623208,
"global" : 0.4371151,
"it" : 0.42951578,
"life" : 1.5750692,
"thanks" : 0.26481047,
"world" : 4.7300377,
"tiny" : 0.5462298,
"earth" : 2.6555297,
"universe" : 2.0308156,
"worldwide" : 1.3903781,
"hello" : 6.696973,
"so" : 0.20279501,
"?" : 0.67785245
},
"passage_text" : "hello world"
},
"_ingest" : {
"timestamp" : "2023-10-11T22:35:53.654650086Z"
}
}
}
]
}
Once you have created an ingest pipeline, you need to create an index for ingestion and ingest documents into the index. To learn more, see Step 2: Create an index for ingestion and Step 3: Ingest documents into the index of Neural sparse search.
Next steps
- To learn how to use the
neural_sparse
query for a sparse search, see Neural sparse query. - To learn more about sparse search, see Neural sparse search.
- To learn more about using models in OpenSearch, see Choosing a model.
- For a comprehensive example, see Neural search tutorial.