Truncate hits processor

Truncate hits processor

Introduced 2.12

The truncate_hits response processor discards returned search hits after a given hit count is reached. The truncate_hits processor is designed to work with the oversample request processor but may be used on its own.

The target_size parameter (which specifies where to truncate) is optional. If it is not specified, then OpenSearch uses the original_size variable set by the oversample processor (if available).

The following is a common usage pattern:

Add the oversample processor to a request pipeline to fetch a larger set of results.
In the response pipeline, apply a reranking processor (which may promote results from beyond the originally requested top N) or the collapse processor (which may discard results after deduplication).
Apply the truncate processor to return (at most) the originally requested number of hits.

Request fields

The following table lists all request fields.

Field	Data type	Description
`target_size`	Integer	The maximum number of search hits to return (>=0). If not specified, the processor will try to read the `original_size` variable and will fail if it is not available. Optional.
`context_prefix`	String	May be used to read the `original_size` variable from a specific scope in order to avoid collisions. Optional.
`tag`	String	The processor’s identifier. Optional.
`description`	String	A description of the processor. Optional.
`ignore_failure`	Boolean	If `true`, OpenSearch ignores any failure of this processor and continues to run the remaining processors in the search pipeline. Optional. Default is `false`.

Example

The following example demonstrates using a search pipeline with a truncate processor.

Setup

Create an index named my_index containing many documents:

POST /_bulk
{ "create":{"_index":"my_index","_id":1}}
{ "doc": { "title" : "document 1" }}
{ "create":{"_index":"my_index","_id":2}}
{ "doc": { "title" : "document 2" }}
{ "create":{"_index":"my_index","_id":3}}
{ "doc": { "title" : "document 3" }}
{ "create":{"_index":"my_index","_id":4}}
{ "doc": { "title" : "document 4" }}
{ "create":{"_index":"my_index","_id":5}}
{ "doc": { "title" : "document 5" }}
{ "create":{"_index":"my_index","_id":6}}
{ "doc": { "title" : "document 6" }}
{ "create":{"_index":"my_index","_id":7}}
{ "doc": { "title" : "document 7" }}
{ "create":{"_index":"my_index","_id":8}}
{ "doc": { "title" : "document 8" }}
{ "create":{"_index":"my_index","_id":9}}
{ "doc": { "title" : "document 9" }}
{ "create":{"_index":"my_index","_id":10}}
{ "doc": { "title" : "document 10" }}

copy

Creating a search pipeline

The following request creates a search pipeline named my_pipeline with a truncate_hits response processor that discards hits after the first five:

PUT /_search/pipeline/my_pipeline 
{
  "response_processors": [
    {
      "truncate_hits" : {
        "tag" : "truncate_1",
        "description" : "This processor will discard results after the first 5.",
        "target_size" : 5
      }
    }
  ]
}

copy

Using a search pipeline

Search for documents in my_index without a search pipeline:

POST /my_index/_search
{
  "size": 8
}

copy

The response contains eight hits:

Response

{
  "took" : 13,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "my_index",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "doc" : {
            "title" : "document 1"
          }
        }
      },
      {
        "_index" : "my_index",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "doc" : {
            "title" : "document 2"
          }
        }
      },
      {
        "_index" : "my_index",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "doc" : {
            "title" : "document 3"
          }
        }
      },
      {
        "_index" : "my_index",
        "_id" : "4",
        "_score" : 1.0,
        "_source" : {
          "doc" : {
            "title" : "document 4"
          }
        }
      },
      {
        "_index" : "my_index",
        "_id" : "5",
        "_score" : 1.0,
        "_source" : {
          "doc" : {
            "title" : "document 5"
          }
        }
      },
      {
        "_index" : "my_index",
        "_id" : "6",
        "_score" : 1.0,
        "_source" : {
          "doc" : {
            "title" : "document 6"
          }
        }
      },
      {
        "_index" : "my_index",
        "_id" : "7",
        "_score" : 1.0,
        "_source" : {
          "doc" : {
            "title" : "document 7"
          }
        }
      },
      {
        "_index" : "my_index",
        "_id" : "8",
        "_score" : 1.0,
        "_source" : {
          "doc" : {
            "title" : "document 8"
          }
        }
      }
    ]
  }
}

To search with a pipeline, specify the pipeline name in the search_pipeline query parameter:

POST /my_index/_search?search_pipeline=my_pipeline
{
  "size": 8
}

copy

The response contains only 5 hits, even though 8 were requested and 10 were available:

Response

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "my_index",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "doc" : {
            "title" : "document 1"
          }
        }
      },
      {
        "_index" : "my_index",
        "_id" : "2",
        "_score" : 1.0,
        "_source" : {
          "doc" : {
            "title" : "document 2"
          }
        }
      },
      {
        "_index" : "my_index",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "doc" : {
            "title" : "document 3"
          }
        }
      },
      {
        "_index" : "my_index",
        "_id" : "4",
        "_score" : 1.0,
        "_source" : {
          "doc" : {
            "title" : "document 4"
          }
        }
      },
      {
        "_index" : "my_index",
        "_id" : "5",
        "_score" : 1.0,
        "_source" : {
          "doc" : {
            "title" : "document 5"
          }
        }
      }
    ]
  }
}

Oversample, collapse, and truncate hits

The following is a more realistic example in which you use oversample to request many candidate documents, use collapse to remove documents that duplicate a particular field (to get more diverse results), and then use truncate to return the originally requested document count (to avoid returning a large result payload from the cluster).

Setup

Create many documents containing a field that you’ll use for collapsing:

POST /_bulk
{ "create":{"_index":"my_index","_id":1}}
{ "title" : "document 1", "color":"blue" }
{ "create":{"_index":"my_index","_id":2}}
{ "title" : "document 2", "color":"blue" }
{ "create":{"_index":"my_index","_id":3}}
{ "title" : "document 3", "color":"red" }
{ "create":{"_index":"my_index","_id":4}}
{ "title" : "document 4", "color":"red" }
{ "create":{"_index":"my_index","_id":5}}
{ "title" : "document 5", "color":"yellow" }
{ "create":{"_index":"my_index","_id":6}}
{ "title" : "document 6", "color":"yellow" }
{ "create":{"_index":"my_index","_id":7}}
{ "title" : "document 7", "color":"orange" }
{ "create":{"_index":"my_index","_id":8}}
{ "title" : "document 8", "color":"orange" }
{ "create":{"_index":"my_index","_id":9}}
{ "title" : "document 9", "color":"green" }
{ "create":{"_index":"my_index","_id":10}}
{ "title" : "document 10", "color":"green" }

copy

Create a pipeline that collapses only on the color field:

PUT /_search/pipeline/collapse_pipeline
{
  "response_processors": [
    {
      "collapse" : {
        "field": "color"
      }
    }
  ]
}

copy

Create another pipeline that oversamples, collapses, and then truncates results:

PUT /_search/pipeline/oversampling_collapse_pipeline
{
  "request_processors": [
    {
      "oversample": {
        "sample_factor": 3
      }
    }
  ],
  "response_processors": [
    {
      "collapse" : {
        "field": "color"
      }
    },
    {
      "truncate_hits": {
        "description": "Truncates back to the original size before oversample increased it."
      }
    }
  ]
}

copy

Collapse without oversample

In this example, you request the top three documents before collapsing on the color field. Because the first two documents have the same color, the second one is discarded, and the request returns the first and third documents:

POST /my_index/_search?search_pipeline=collapse_pipeline
{
  "size": 3
}

copy

Response

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "my_index",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "title" : "document 1",
          "color" : "blue"
        }
      },
      {
        "_index" : "my_index",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "title" : "document 3",
          "color" : "red"
        }
      }
    ]
  },
  "profile" : {
    "shards" : [ ]
  }
}

Oversample, collapse, and truncate

Now you will use the oversampling_collapse_pipeline, which requests the top 9 documents (multiplying the size by 3), deduplicates by color, and then returns the top 3 hits:

POST /my_index/_search?search_pipeline=oversampling_collapse_pipeline
{
  "size": 3
}

copy

Response

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "my_index",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "title" : "document 1",
          "color" : "blue"
        }
      },
      {
        "_index" : "my_index",
        "_id" : "3",
        "_score" : 1.0,
        "_source" : {
          "title" : "document 3",
          "color" : "red"
        }
      },
      {
        "_index" : "my_index",
        "_id" : "5",
        "_score" : 1.0,
        "_source" : {
          "title" : "document 5",
          "color" : "yellow"
        }
      }
    ]
  },
  "profile" : {
    "shards" : [ ]
  }
}