Configuring ingest pipelines for neural sparse search
Generating sparse vector embeddings within OpenSearch enables neural sparse search to function like lexical search. To take advantage of this encapsulation, set up an ingest pipeline to create and store sparse vector embeddings from document text during ingestion. At query time, input plain text, which will be automatically converted into vector embeddings for search.
For this tutorial, you’ll use neural sparse search with OpenSearch’s built-in machine learning (ML) model hosting and ingest pipelines. Because the transformation of text to embeddings is performed within OpenSearch, you’ll use text when ingesting and searching documents.
At ingestion time, neural sparse search uses a sparse encoding model to generate sparse vector embeddings from text fields.
At query time, neural sparse search operates in one of two search modes:
Bi-encoder mode (requires a sparse encoding model): A sparse encoding model generates sparse vector embeddings from both documents and query text. This approach provides better search relevance at the cost of an increase in latency.
Doc-only mode (requires a sparse encoding model and a tokenizer): A sparse encoding model generates sparse vector embeddings from documents. In this mode, neural sparse search tokenizes query text using a tokenizer and obtains the token weights from a lookup table. This approach provides faster retrieval at the cost of a slight decrease in search relevance. The tokenizer is deployed and invoked using the Model API for a uniform neural sparse search experience.
For more information about choosing the neural sparse search mode that best suits your workload, see Choose the search mode.
Tutorial
This tutorial consists of the following steps:
Prerequisites
Before you start, complete the prerequisites for neural search.
Step 1: Configure a sparse encoding model/tokenizer
Both the bi-encoder and doc-only search modes require you to configure a sparse encoding model. Doc-only mode requires you to configure a tokenizer in addition to the model.
Step 1(a): Choose the search mode
Choose the search mode and the appropriate model/tokenizer combination:
Bi-encoder: Use the
amazon/neural-sparse/opensearch-neural-sparse-encoding-v2-distill
model during both ingestion and search.Doc-only: Use the
amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v2-distill
model during ingestion and theamazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1
tokenizer during search.
The following table provides a search relevance comparison for all available combinations of the two search modes so that you can choose the best combination for your use case.
Mode | Ingestion model | Search model | Avg search relevance on BEIR | Model parameters |
---|---|---|---|---|
Doc-only | amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v1 | amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1 | 0.49 | 133M |
Doc-only | amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v2-distill | amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1 | 0.504 | 67M |
Doc-only | amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v2-mini | amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1 | 0.497 | 23M |
Bi-encoder | amazon/neural-sparse/opensearch-neural-sparse-encoding-v1 | amazon/neural-sparse/opensearch-neural-sparse-encoding-v1 | 0.524 | 133M |
Bi-encoder | amazon/neural-sparse/opensearch-neural-sparse-encoding-v2-distill | amazon/neural-sparse/opensearch-neural-sparse-encoding-v2-distill | 0.528 | 67M |
Step 1(b): Register the model/tokenizer
When you register a model/tokenizer, OpenSearch creates a model group for the model/tokenizer. You can also explicitly create a model group before registering models. For more information, see Model access control.
Bi-encoder mode
When using bi-encoder mode, you only need to register the amazon/neural-sparse/opensearch-neural-sparse-encoding-v2-distill
model.
Register the sparse encoding model:
POST /_plugins/_ml/models/_register?deploy=true
{
"name": "amazon/neural-sparse/opensearch-neural-sparse-encoding-v2-distill",
"version": "1.0.0",
"model_format": "TORCH_SCRIPT"
}
copy
Registering a model is an asynchronous task. OpenSearch returns a task ID for every model you register:
{
"task_id": "aFeif4oB5Vm0Tdw8yoN7",
"status": "CREATED"
}
You can check the status of the task by calling the Tasks API:
GET /_plugins/_ml/tasks/aFeif4oB5Vm0Tdw8yoN7
copy
Once the task is complete, the task state will change to COMPLETED
and the Tasks API response will contain the model ID of the registered model:
{
"model_id": "<bi-encoder model ID>",
"task_type": "REGISTER_MODEL",
"function_name": "SPARSE_ENCODING",
"state": "COMPLETED",
"worker_node": [
"4p6FVOmJRtu3wehDD74hzQ"
],
"create_time": 1694358489722,
"last_update_time": 1694358499139,
"is_async": true
}
Note the model_id
of the model you’ve created; you’ll need it for the following steps.
Doc-only mode
When using doc-only mode, you need to register the amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v2-distill
model, which you’ll use at ingestion time, and the amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1
tokenizer, which you’ll use at search time.
Register the sparse encoding model:
POST /_plugins/_ml/models/_register?deploy=true
{
"name": "amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v2-distill",
"version": "1.0.0",
"model_format": "TORCH_SCRIPT"
}
copy
Register the tokenizer:
POST /_plugins/_ml/models/_register?deploy=true
{
"name": "amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1",
"version": "1.0.1",
"model_format": "TORCH_SCRIPT"
}
copy
Like in the bi-encoder mode, use the Tasks API to check the status of the registration task. After the Tasks API returns the task state as COMPLETED
. Note the model_id
of the model and the tokenizer you’ve created; you’ll need them for the following steps.
Step 1(c): Deploy the model/tokenizer
Next, you’ll need to deploy the model/tokenizer you registered. Deploying a model creates a model instance and caches the model in memory.
Bi-encoder mode
To deploy the model, provide its model ID to the _deploy
endpoint:
POST /_plugins/_ml/models/<bi-encoder model ID>/_deploy
copy
As with the register operation, the deploy operation is asynchronous, so you’ll get a task ID in the response:
{
"task_id": "ale6f4oB5Vm0Tdw8NINO",
"status": "CREATED"
}
You can check the status of the task by using the Tasks API:
GET /_plugins/_ml/tasks/ale6f4oB5Vm0Tdw8NINO
copy
Once the task is complete, the task state will change to COMPLETED
:
{
"model_id": "<bi-encoder model ID>",
"task_type": "DEPLOY_MODEL",
"function_name": "SPARSE_ENCODING",
"state": "COMPLETED",
"worker_node": [
"4p6FVOmJRtu3wehDD74hzQ"
],
"create_time": 1694360024141,
"last_update_time": 1694360027940,
"is_async": true
}
Doc-only mode
To deploy the model, provide its model ID to the _deploy
endpoint:
POST /_plugins/_ml/models/<doc-only model ID>/_deploy
copy
You can deploy the tokenizer in the same way:
POST /_plugins/_ml/models/<tokenizer ID>/_deploy
copy
As with bi-encoder mode, you can check the status of both deploy tasks by using the Tasks API. Once the task is complete, the task state will change to COMPLETED
.
Step 2: Ingest data
In both the bi-encoder and doc-only modes, you’ll use a sparse encoding model at ingestion time to generate sparse vector embeddings.
Step 2(a): Create an ingest pipeline
To generate sparse vector embeddings, you need to create an ingest pipeline that contains a sparse_encoding processor, which will convert the text in a document field to vector embeddings. The processor’s field_map
determines the input fields from which to generate vector embeddings and the output fields in which to store the embeddings.
The following example request creates an ingest pipeline where the text from passage_text
will be converted into sparse vector embeddings, which will be stored in passage_embedding
. Provide the model ID of the registered model in the request:
PUT /_ingest/pipeline/nlp-ingest-pipeline-sparse
{
"description": "An sparse encoding ingest pipeline",
"processors": [
{
"sparse_encoding": {
"model_id": "<bi-encoder or doc-only model ID>",
"field_map": {
"passage_text": "passage_embedding"
}
}
}
]
}
copy
To split long text into passages, use the text_chunking
ingest processor before the sparse_encoding
processor. For more information, see Text chunking.
Step 2(b): Create an index for ingestion
In order to use the sparse encoding processor defined in your pipeline, create a rank features index, adding the pipeline created in the previous step as the default pipeline. Ensure that the fields defined in the field_map
are mapped as correct types. Continuing with the example, the passage_embedding
field must be mapped as rank_features. Similarly, the passage_text
field must be mapped as text
.
The following example request creates a rank features index configured with a default ingest pipeline:
PUT /my-nlp-index
{
"settings": {
"default_pipeline": "nlp-ingest-pipeline-sparse"
},
"mappings": {
"properties": {
"id": {
"type": "text"
},
"passage_embedding": {
"type": "rank_features"
},
"passage_text": {
"type": "text"
}
}
}
}
copy
To save disk space, you can exclude the embedding vector from the source as follows:
PUT /my-nlp-index
{
"settings": {
"default_pipeline": "nlp-ingest-pipeline-sparse"
},
"mappings": {
"_source": {
"excludes": [
"passage_embedding"
]
},
"properties": {
"id": {
"type": "text"
},
"passage_embedding": {
"type": "rank_features"
},
"passage_text": {
"type": "text"
}
}
}
}
copy
Once the <token, weight>
pairs are excluded from the source, they cannot be recovered. Before applying this optimization, make sure you don’t need the <token, weight>
pairs for your application.
Step 2(c): Ingest documents into the index
To ingest documents into the index created in the previous step, send the following requests:
PUT /my-nlp-index/_doc/1
{
"passage_text": "Hello world",
"id": "s1"
}
copy
PUT /my-nlp-index/_doc/2
{
"passage_text": "Hi planet",
"id": "s2"
}
copy
Before the document is ingested into the index, the ingest pipeline runs the sparse_encoding
processor on the document, generating vector embeddings for the passage_text
field. The indexed document includes the passage_text
field, which contains the original text, and the passage_embedding
field, which contains the vector embeddings.
Step 3: Search the data
To perform a neural sparse search on your index, use the neural_sparse
query clause in Query DSL queries.
The following example request uses a neural_sparse
query to search for relevant documents using a raw text query. Provide the model ID for bi-encoder mode or the tokenizer ID for doc-only mode:
GET my-nlp-index/_search
{
"query": {
"neural_sparse": {
"passage_embedding": {
"query_text": "Hi world",
"model_id": "<bi-encoder or tokenizer ID>"
}
}
}
}
copy
The response contains the matching documents:
{
"took" : 688,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 2,
"relation" : "eq"
},
"max_score" : 30.0029,
"hits" : [
{
"_index" : "my-nlp-index",
"_id" : "1",
"_score" : 30.0029,
"_source" : {
"passage_text" : "Hello world",
"passage_embedding" : {
"!" : 0.8708904,
"door" : 0.8587369,
"hi" : 2.3929274,
"worlds" : 2.7839446,
"yes" : 0.75845814,
"##world" : 2.5432441,
"born" : 0.2682308,
"nothing" : 0.8625516,
"goodbye" : 0.17146169,
"greeting" : 0.96817183,
"birth" : 1.2788506,
"come" : 0.1623208,
"global" : 0.4371151,
"it" : 0.42951578,
"life" : 1.5750692,
"thanks" : 0.26481047,
"world" : 4.7300377,
"tiny" : 0.5462298,
"earth" : 2.6555297,
"universe" : 2.0308156,
"worldwide" : 1.3903781,
"hello" : 6.696973,
"so" : 0.20279501,
"?" : 0.67785245
},
"id" : "s1"
}
},
{
"_index" : "my-nlp-index",
"_id" : "2",
"_score" : 16.480486,
"_source" : {
"passage_text" : "Hi planet",
"passage_embedding" : {
"hi" : 4.338913,
"planets" : 2.7755864,
"planet" : 5.0969057,
"mars" : 1.7405145,
"earth" : 2.6087382,
"hello" : 3.3210192
},
"id" : "s2"
}
}
]
}
}
To minimize disk and network I/O latency related to sparse embedding sources, you can exclude the embedding vector source from the query as follows:
GET my-nlp-index/_search
{
"_source": {
"excludes": [
"passage_embedding"
]
},
"query": {
"neural_sparse": {
"passage_embedding": {
"query_text": "Hi world",
"model_id": "<bi-encoder or tokenizer ID>"
}
}
}
}
copy
Accelerating neural sparse search
To learn more about improving retrieval time for neural sparse search, see Accelerating neural sparse search.
Creating a search pipeline for neural sparse search
You can create a search pipeline that augments neural sparse search functionality by:
- Accelerating neural sparse search for faster retrieval.
- Setting the default model ID on an index for easier use.
To configure the pipeline, add a neural_sparse_two_phase_processor or a neural_query_enricher processor. The following request creates a pipeline with both processors:
PUT /_search/pipeline/neural_search_pipeline
{
"request_processors": [
{
"neural_sparse_two_phase_processor": {
"tag": "neural-sparse",
"description": "Creates a two-phase processor for neural sparse search."
}
},
{
"neural_query_enricher" : {
"default_model_id": "<bi-encoder model/tokenizer ID>"
}
}
]
}
copy
Then set the default pipeline for your index to the newly created search pipeline:
PUT /my-nlp-index/_settings
{
"index.search.default_pipeline" : "neural_search_pipeline"
}
copy
For more information about setting a default model on an index, or to learn how to set a default model on a specific field, see Setting a default model on an index or field.
Troubleshooting
This section contains information about resolving common issues encountered while running neural sparse search.
Remote connector throttling exceptions
When using connectors to call a remote service such as Amazon SageMaker, ingestion and search calls sometimes fail because of remote connector throttling exceptions.
For OpenSearch versions earlier than 2.15, a throttling exception will be returned as an error from the remote service:
{
"type": "status_exception",
"reason": "Error from remote service: {\"message\":null}"
}
To mitigate throttling exceptions, decrease the maximum number of connections specified in the max_connection
setting in the connector’s client_config object. Doing so will prevent the maximum number of concurrent connections from exceeding the threshold of the remote service. You can also modify the retry settings to avoid a request spike during ingestion.