Text chunking
Introduced 2.13
To split long text into passages, you can use a text_chunking
processor as a preprocessing step for a text_embedding
or sparse_encoding
processor in order to obtain embeddings for each chunked passage. For more information about the processor parameters, see Text chunking processor. Before you start, follow the steps outlined in the pretrained model documentation to register an embedding model. The following example preprocesses text by splitting it into passages and then produces embeddings using the text_embedding
processor.
Step 1: Create a pipeline
The following example request creates an ingest pipeline that converts the text in the passage_text
field into chunked passages, which will be stored in the passage_chunk
field. The text in the passage_chunk
field is then converted into text embeddings, and the embeddings are stored in the passage_embedding
field:
PUT _ingest/pipeline/text-chunking-embedding-ingest-pipeline
{
"description": "A text chunking and embedding ingest pipeline",
"processors": [
{
"text_chunking": {
"algorithm": {
"fixed_token_length": {
"token_limit": 10,
"overlap_rate": 0.2,
"tokenizer": "standard"
}
},
"field_map": {
"passage_text": "passage_chunk"
}
}
},
{
"text_embedding": {
"model_id": "LMLPWY4BROvhdbtgETaI",
"field_map": {
"passage_chunk": "passage_chunk_embedding"
}
}
}
]
}
copy
Step 2: Create an index for ingestion
In order to use the ingest pipeline, you need to create a k-NN index. The passage_chunk_embedding
field must be of the nested
type. The knn.dimension
field must contain the number of dimensions for your model:
PUT testindex
{
"settings": {
"index": {
"knn": true
}
},
"mappings": {
"properties": {
"text": {
"type": "text"
},
"passage_chunk_embedding": {
"type": "nested",
"properties": {
"knn": {
"type": "knn_vector",
"dimension": 768
}
}
}
}
}
}
copy
Step 3: Ingest documents into the index
To ingest a document into the index created in the previous step, send the following request:
POST testindex/_doc?pipeline=text-chunking-embedding-ingest-pipeline
{
"passage_text": "This is an example document to be chunked. The document contains a single paragraph, two sentences and 24 tokens by standard tokenizer in OpenSearch."
}
copy
Step 4: Search the index using neural search
You can use a nested
query to perform vector search on your index. We recommend setting score_mode
to max
, where the document score is set to the highest score out of all passage embeddings:
GET testindex/_search
{
"query": {
"nested": {
"score_mode": "max",
"path": "passage_chunk_embedding",
"query": {
"neural": {
"passage_chunk_embedding.knn": {
"query_text": "document",
"model_id": "-tHZeI4BdQKclr136Wl7"
}
}
}
}
}
}
copy