Keyword search

Keyword search

By default, OpenSearch calculates document scores using the Okapi BM25 algorithm. BM25 is a keyword-based algorithm that performs lexical search for words that appear in the query.

When determining a document’s relevance, BM25 considers term frequency/inverse document frequency (TF/IDF):

Term frequency stipulates that documents in which the search term appears more frequently are more relevant.
Inverse document frequency gives less weight to the words that commonly appear in all documents in the corpus (for example, articles like “the”).

Example

The following example query searches for the words long live king in the shakespeare index:

GET shakespeare/_search
{
  "query": {
    "match": {
      "text_entry": "long live king"
    }
  }
}

copy

The response contains the matching documents, each with a relevance score in the _score field:

{
  "took": 113,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2352,
      "relation": "eq"
    },
    "max_score": 18.781435,
    "hits": [
      {
        "_index": "shakespeare",
        "_id": "32437",
        "_score": 18.781435,
        "_source": {
          "type": "line",
          "line_id": 32438,
          "play_name": "Hamlet",
          "speech_number": 3,
          "line_number": "1.1.3",
          "speaker": "BERNARDO",
          "text_entry": "Long live the king!"
        }
      },
      {
        "_index": "shakespeare",
        "_id": "83798",
        "_score": 16.523308,
        "_source": {
          "type": "line",
          "line_id": 83799,
          "play_name": "Richard III",
          "speech_number": 42,
          "line_number": "3.7.242",
          "speaker": "BUCKINGHAM",
          "text_entry": "Long live Richard, Englands royal king!"
        }
      },
      {
        "_index": "shakespeare",
        "_id": "82994",
        "_score": 15.588365,
        "_source": {
          "type": "line",
          "line_id": 82995,
          "play_name": "Richard III",
          "speech_number": 24,
          "line_number": "3.1.80",
          "speaker": "GLOUCESTER",
          "text_entry": "live long."
        }
      },
      {
        "_index": "shakespeare",
        "_id": "7199",
        "_score": 15.586321,
        "_source": {
          "type": "line",
          "line_id": 7200,
          "play_name": "Henry VI Part 2",
          "speech_number": 12,
          "line_number": "2.2.64",
          "speaker": "BOTH",
          "text_entry": "Long live our sovereign Richard, Englands king!"
        }
      }
      ...
    ]
  }
}

Similarity algorithms

The following table lists the supported similarity algorithms.

Algorithm	Description
`BM25`	The default OpenSearch Okapi BM25 similarity algorithm.
`boolean`	Assigns terms a score equal to their boost value. Use `boolean` similarity when you want the document scores to be based on the binary value of whether the terms match.

Specifying similarity

You can specify the similarity algorithm in the similarity parameter when configuring mappings at the field level.

For example, the following query specifies the boolean similarity for the boolean_field. The bm25_field is assigned the default BM25 similarity:

PUT /testindex
{
  "mappings": {
    "properties": {
      "bm25_field": { 
        "type": "text"
      },
      "boolean_field": {
        "type": "text",
        "similarity": "boolean" 
      }
    }
  }
}

copy

Configuring BM25 similarity

You can configure BM25 similarity parameters at the index level as follows:

PUT /testindex
{
  "settings": {
    "index": {
      "similarity": {
        "custom_similarity": {
          "type": "BM25",
          "k1": 1.2,
          "b": 0.75,
          "discount_overlaps": "true"
        }
      }
    }
  }
}

BM25 similarity supports the following parameters.

Parameter	Data type	Description
`k1`	Float	Determines non-linear term frequency normalization (saturation) properties. The default value is `1.2`.
`b`	Float	Determines the degree to which document length normalizes TF values. The default value is `0.75`.
`discount_overlaps`	Boolean	Determines whether overlap tokens (tokens with zero position increment) are ignored when computing the norm. Default is `true` (overlap tokens do not count when computing the norm).

Next steps

Learn about query and filter context.
Learn about the query types OpenSearch supports.