Configuring ingest pipelines for neural sparse search

Generating sparse vector embeddings within OpenSearch enables neural sparse search to function like lexical search. To take advantage of this encapsulation, set up an ingest pipeline to create and store sparse vector embeddings from document text during ingestion. At query time, input plain text, which will be automatically converted into vector embeddings for search.

For this tutorial, you’ll use neural sparse search with OpenSearch’s built-in machine learning (ML) model hosting and ingest pipelines. Because the transformation of text to embeddings is performed within OpenSearch, you’ll use text when ingesting and searching documents.

At ingestion time, neural sparse search uses a sparse encoding model to generate sparse vector embeddings from text fields.

At query time, neural sparse search operates in one of two search modes:

  • Bi-encoder mode (requires a sparse encoding model): A sparse encoding model generates sparse vector embeddings from both documents and query text. This approach provides better search relevance at the cost of an increase in latency.

  • Doc-only mode (requires a sparse encoding model and a tokenizer): A sparse encoding model generates sparse vector embeddings from documents. In this mode, neural sparse search tokenizes query text using a tokenizer and obtains the token weights from a lookup table. This approach provides faster retrieval at the cost of a slight decrease in search relevance. The tokenizer is deployed and invoked using the Model API for a uniform neural sparse search experience.

For more information about choosing the neural sparse search mode that best suits your workload, see Choose the search mode.

Tutorial

This tutorial consists of the following steps:

  1. Configure a sparse encoding model/tokenizer.
    1. Choose the search mode
    2. Register the model/tokenizer
    3. Deploy the model/tokenizer
  2. Ingest data
    1. Create an ingest pipeline
    2. Create an index for ingestion
    3. Ingest documents into the index
  3. Search the data

Prerequisites

Before you start, complete the prerequisites for neural search.

Step 1: Configure a sparse encoding model/tokenizer

Both the bi-encoder and doc-only search modes require you to configure a sparse encoding model. Doc-only mode requires you to configure a tokenizer in addition to the model.

Step 1(a): Choose the search mode

Choose the search mode and the appropriate model/tokenizer combination:

  • Bi-encoder: Use the amazon/neural-sparse/opensearch-neural-sparse-encoding-v2-distill model during both ingestion and search.

  • Doc-only: Use the amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v2-distill model during ingestion and the amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1 tokenizer during search.

The following table provides a search relevance comparison for all available combinations of the two search modes so that you can choose the best combination for your use case.

ModeIngestion modelSearch modelAvg search relevance on BEIRModel parameters
Doc-onlyamazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v1amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v10.49133M
Doc-onlyamazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v2-distillamazon/neural-sparse/opensearch-neural-sparse-tokenizer-v10.50467M
Doc-onlyamazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v2-miniamazon/neural-sparse/opensearch-neural-sparse-tokenizer-v10.49723M
Bi-encoderamazon/neural-sparse/opensearch-neural-sparse-encoding-v1amazon/neural-sparse/opensearch-neural-sparse-encoding-v10.524133M
Bi-encoderamazon/neural-sparse/opensearch-neural-sparse-encoding-v2-distillamazon/neural-sparse/opensearch-neural-sparse-encoding-v2-distill0.52867M

Step 1(b): Register the model/tokenizer

When you register a model/tokenizer, OpenSearch creates a model group for the model/tokenizer. You can also explicitly create a model group before registering models. For more information, see Model access control.

Bi-encoder mode

When using bi-encoder mode, you only need to register the amazon/neural-sparse/opensearch-neural-sparse-encoding-v2-distill model.

Register the sparse encoding model:

  1. POST /_plugins/_ml/models/_register?deploy=true
  2. {
  3. "name": "amazon/neural-sparse/opensearch-neural-sparse-encoding-v2-distill",
  4. "version": "1.0.0",
  5. "model_format": "TORCH_SCRIPT"
  6. }

copy

Registering a model is an asynchronous task. OpenSearch returns a task ID for every model you register:

  1. {
  2. "task_id": "aFeif4oB5Vm0Tdw8yoN7",
  3. "status": "CREATED"
  4. }

You can check the status of the task by calling the Tasks API:

  1. GET /_plugins/_ml/tasks/aFeif4oB5Vm0Tdw8yoN7

copy

Once the task is complete, the task state will change to COMPLETED and the Tasks API response will contain the model ID of the registered model:

  1. {
  2. "model_id": "<bi-encoder model ID>",
  3. "task_type": "REGISTER_MODEL",
  4. "function_name": "SPARSE_ENCODING",
  5. "state": "COMPLETED",
  6. "worker_node": [
  7. "4p6FVOmJRtu3wehDD74hzQ"
  8. ],
  9. "create_time": 1694358489722,
  10. "last_update_time": 1694358499139,
  11. "is_async": true
  12. }

Note the model_id of the model you’ve created; you’ll need it for the following steps.

Doc-only mode

When using doc-only mode, you need to register the amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v2-distill model, which you’ll use at ingestion time, and the amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1 tokenizer, which you’ll use at search time.

Register the sparse encoding model:

  1. POST /_plugins/_ml/models/_register?deploy=true
  2. {
  3. "name": "amazon/neural-sparse/opensearch-neural-sparse-encoding-doc-v2-distill",
  4. "version": "1.0.0",
  5. "model_format": "TORCH_SCRIPT"
  6. }

copy

Register the tokenizer:

  1. POST /_plugins/_ml/models/_register?deploy=true
  2. {
  3. "name": "amazon/neural-sparse/opensearch-neural-sparse-tokenizer-v1",
  4. "version": "1.0.1",
  5. "model_format": "TORCH_SCRIPT"
  6. }

copy

Like in the bi-encoder mode, use the Tasks API to check the status of the registration task. After the Tasks API returns the task state as COMPLETED. Note the model_id of the model and the tokenizer you’ve created; you’ll need them for the following steps.

Step 1(c): Deploy the model/tokenizer

Next, you’ll need to deploy the model/tokenizer you registered. Deploying a model creates a model instance and caches the model in memory.

Bi-encoder mode

To deploy the model, provide its model ID to the _deploy endpoint:

  1. POST /_plugins/_ml/models/<bi-encoder model ID>/_deploy

copy

As with the register operation, the deploy operation is asynchronous, so you’ll get a task ID in the response:

  1. {
  2. "task_id": "ale6f4oB5Vm0Tdw8NINO",
  3. "status": "CREATED"
  4. }

You can check the status of the task by using the Tasks API:

  1. GET /_plugins/_ml/tasks/ale6f4oB5Vm0Tdw8NINO

copy

Once the task is complete, the task state will change to COMPLETED:

  1. {
  2. "model_id": "<bi-encoder model ID>",
  3. "task_type": "DEPLOY_MODEL",
  4. "function_name": "SPARSE_ENCODING",
  5. "state": "COMPLETED",
  6. "worker_node": [
  7. "4p6FVOmJRtu3wehDD74hzQ"
  8. ],
  9. "create_time": 1694360024141,
  10. "last_update_time": 1694360027940,
  11. "is_async": true
  12. }

Doc-only mode

To deploy the model, provide its model ID to the _deploy endpoint:

  1. POST /_plugins/_ml/models/<doc-only model ID>/_deploy

copy

You can deploy the tokenizer in the same way:

  1. POST /_plugins/_ml/models/<tokenizer ID>/_deploy

copy

As with bi-encoder mode, you can check the status of both deploy tasks by using the Tasks API. Once the task is complete, the task state will change to COMPLETED.

Step 2: Ingest data

In both the bi-encoder and doc-only modes, you’ll use a sparse encoding model at ingestion time to generate sparse vector embeddings.

Step 2(a): Create an ingest pipeline

To generate sparse vector embeddings, you need to create an ingest pipeline that contains a sparse_encoding processor, which will convert the text in a document field to vector embeddings. The processor’s field_map determines the input fields from which to generate vector embeddings and the output fields in which to store the embeddings.

The following example request creates an ingest pipeline where the text from passage_text will be converted into sparse vector embeddings, which will be stored in passage_embedding. Provide the model ID of the registered model in the request:

  1. PUT /_ingest/pipeline/nlp-ingest-pipeline-sparse
  2. {
  3. "description": "An sparse encoding ingest pipeline",
  4. "processors": [
  5. {
  6. "sparse_encoding": {
  7. "model_id": "<bi-encoder or doc-only model ID>",
  8. "field_map": {
  9. "passage_text": "passage_embedding"
  10. }
  11. }
  12. }
  13. ]
  14. }

copy

To split long text into passages, use the text_chunking ingest processor before the sparse_encoding processor. For more information, see Text chunking.

Step 2(b): Create an index for ingestion

In order to use the sparse encoding processor defined in your pipeline, create a rank features index, adding the pipeline created in the previous step as the default pipeline. Ensure that the fields defined in the field_map are mapped as correct types. Continuing with the example, the passage_embedding field must be mapped as rank_features. Similarly, the passage_text field must be mapped as text.

The following example request creates a rank features index configured with a default ingest pipeline:

  1. PUT /my-nlp-index
  2. {
  3. "settings": {
  4. "default_pipeline": "nlp-ingest-pipeline-sparse"
  5. },
  6. "mappings": {
  7. "properties": {
  8. "id": {
  9. "type": "text"
  10. },
  11. "passage_embedding": {
  12. "type": "rank_features"
  13. },
  14. "passage_text": {
  15. "type": "text"
  16. }
  17. }
  18. }
  19. }

copy

To save disk space, you can exclude the embedding vector from the source as follows:

  1. PUT /my-nlp-index
  2. {
  3. "settings": {
  4. "default_pipeline": "nlp-ingest-pipeline-sparse"
  5. },
  6. "mappings": {
  7. "_source": {
  8. "excludes": [
  9. "passage_embedding"
  10. ]
  11. },
  12. "properties": {
  13. "id": {
  14. "type": "text"
  15. },
  16. "passage_embedding": {
  17. "type": "rank_features"
  18. },
  19. "passage_text": {
  20. "type": "text"
  21. }
  22. }
  23. }
  24. }

copy

Once the <token, weight> pairs are excluded from the source, they cannot be recovered. Before applying this optimization, make sure you don’t need the <token, weight> pairs for your application.

Step 2(c): Ingest documents into the index

To ingest documents into the index created in the previous step, send the following requests:

  1. PUT /my-nlp-index/_doc/1
  2. {
  3. "passage_text": "Hello world",
  4. "id": "s1"
  5. }

copy

  1. PUT /my-nlp-index/_doc/2
  2. {
  3. "passage_text": "Hi planet",
  4. "id": "s2"
  5. }

copy

Before the document is ingested into the index, the ingest pipeline runs the sparse_encoding processor on the document, generating vector embeddings for the passage_text field. The indexed document includes the passage_text field, which contains the original text, and the passage_embedding field, which contains the vector embeddings.

Step 3: Search the data

To perform a neural sparse search on your index, use the neural_sparse query clause in Query DSL queries.

The following example request uses a neural_sparse query to search for relevant documents using a raw text query. Provide the model ID for bi-encoder mode or the tokenizer ID for doc-only mode:

  1. GET my-nlp-index/_search
  2. {
  3. "query": {
  4. "neural_sparse": {
  5. "passage_embedding": {
  6. "query_text": "Hi world",
  7. "model_id": "<bi-encoder or tokenizer ID>"
  8. }
  9. }
  10. }
  11. }

copy

The response contains the matching documents:

  1. {
  2. "took" : 688,
  3. "timed_out" : false,
  4. "_shards" : {
  5. "total" : 1,
  6. "successful" : 1,
  7. "skipped" : 0,
  8. "failed" : 0
  9. },
  10. "hits" : {
  11. "total" : {
  12. "value" : 2,
  13. "relation" : "eq"
  14. },
  15. "max_score" : 30.0029,
  16. "hits" : [
  17. {
  18. "_index" : "my-nlp-index",
  19. "_id" : "1",
  20. "_score" : 30.0029,
  21. "_source" : {
  22. "passage_text" : "Hello world",
  23. "passage_embedding" : {
  24. "!" : 0.8708904,
  25. "door" : 0.8587369,
  26. "hi" : 2.3929274,
  27. "worlds" : 2.7839446,
  28. "yes" : 0.75845814,
  29. "##world" : 2.5432441,
  30. "born" : 0.2682308,
  31. "nothing" : 0.8625516,
  32. "goodbye" : 0.17146169,
  33. "greeting" : 0.96817183,
  34. "birth" : 1.2788506,
  35. "come" : 0.1623208,
  36. "global" : 0.4371151,
  37. "it" : 0.42951578,
  38. "life" : 1.5750692,
  39. "thanks" : 0.26481047,
  40. "world" : 4.7300377,
  41. "tiny" : 0.5462298,
  42. "earth" : 2.6555297,
  43. "universe" : 2.0308156,
  44. "worldwide" : 1.3903781,
  45. "hello" : 6.696973,
  46. "so" : 0.20279501,
  47. "?" : 0.67785245
  48. },
  49. "id" : "s1"
  50. }
  51. },
  52. {
  53. "_index" : "my-nlp-index",
  54. "_id" : "2",
  55. "_score" : 16.480486,
  56. "_source" : {
  57. "passage_text" : "Hi planet",
  58. "passage_embedding" : {
  59. "hi" : 4.338913,
  60. "planets" : 2.7755864,
  61. "planet" : 5.0969057,
  62. "mars" : 1.7405145,
  63. "earth" : 2.6087382,
  64. "hello" : 3.3210192
  65. },
  66. "id" : "s2"
  67. }
  68. }
  69. ]
  70. }
  71. }

To minimize disk and network I/O latency related to sparse embedding sources, you can exclude the embedding vector source from the query as follows:

  1. GET my-nlp-index/_search
  2. {
  3. "_source": {
  4. "excludes": [
  5. "passage_embedding"
  6. ]
  7. },
  8. "query": {
  9. "neural_sparse": {
  10. "passage_embedding": {
  11. "query_text": "Hi world",
  12. "model_id": "<bi-encoder or tokenizer ID>"
  13. }
  14. }
  15. }
  16. }

copy

To learn more about improving retrieval time for neural sparse search, see Accelerating neural sparse search.

You can create a search pipeline that augments neural sparse search functionality by:

  • Accelerating neural sparse search for faster retrieval.
  • Setting the default model ID on an index for easier use.

To configure the pipeline, add a neural_sparse_two_phase_processor or a neural_query_enricher processor. The following request creates a pipeline with both processors:

  1. PUT /_search/pipeline/neural_search_pipeline
  2. {
  3. "request_processors": [
  4. {
  5. "neural_sparse_two_phase_processor": {
  6. "tag": "neural-sparse",
  7. "description": "Creates a two-phase processor for neural sparse search."
  8. }
  9. },
  10. {
  11. "neural_query_enricher" : {
  12. "default_model_id": "<bi-encoder model/tokenizer ID>"
  13. }
  14. }
  15. ]
  16. }

copy

Then set the default pipeline for your index to the newly created search pipeline:

  1. PUT /my-nlp-index/_settings
  2. {
  3. "index.search.default_pipeline" : "neural_search_pipeline"
  4. }

copy

For more information about setting a default model on an index, or to learn how to set a default model on a specific field, see Setting a default model on an index or field.

Troubleshooting

This section contains information about resolving common issues encountered while running neural sparse search.

Remote connector throttling exceptions

When using connectors to call a remote service such as Amazon SageMaker, ingestion and search calls sometimes fail because of remote connector throttling exceptions.

For OpenSearch versions earlier than 2.15, a throttling exception will be returned as an error from the remote service:

  1. {
  2. "type": "status_exception",
  3. "reason": "Error from remote service: {\"message\":null}"
  4. }

To mitigate throttling exceptions, decrease the maximum number of connections specified in the max_connection setting in the connector’s client_config object. Doing so will prevent the maximum number of concurrent connections from exceeding the threshold of the remote service. You can also modify the retry settings to avoid a request spike during ingestion.