Edge n-gram token filter

Edge n-gram token filter

The edge_ngram token filter is very similar to the ngram token filter, where a particular string is split into substrings of different lengths. The edge_ngram token filter, however, generates n-grams (substrings) only from the beginning (edge) of a token. It’s particularly useful in scenarios like autocomplete or prefix matching, where you want to match the beginning of words or phrases as the user types them.

Parameters

The edge_ngram token filter can be configured with the following parameters.

Parameter	Required/Optional	Data type	Description
`min_gram`	Optional	Integer	The minimum length of the n-grams that will be generated. Default is `1`.
`max_gram`	Optional	Integer	The maximum length of the n-grams that will be generated. Default is `1` for the `edge_ngram` filter and `2` for custom token filters. Avoid setting this parameter to a low value. If the value is set too low, only very short n-grams will be generated and the search term will not be found. For example, if `max_gram` is set to `3` and you index the word “banana”, the longest generated token will be “ban”. If the user searches for “banana”, no matches will be returned. You can use the `truncate` token filter as a search analyzer to mitigate this risk.
`preserve_original`	Optional	Boolean	Includes the original token in the output. Default is `false` .

Example

The following example request creates a new index named edge_ngram_example and configures an analyzer with the edge_ngram filter:

PUT /edge_ngram_example
{
  "settings": {
    "analysis": {
      "filter": {
        "my_edge_ngram": {
          "type": "edge_ngram",
          "min_gram": 3,
          "max_gram": 4
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "my_edge_ngram"]
        }
      }
    }
  }
}

copy

Generated tokens

Use the following request to examine the tokens generated using the analyzer:

POST /edge_ngram_example/_analyze
{
  "analyzer": "my_analyzer",
  "text": "slow green turtle"
}

copy

The response contains the generated tokens:

{
  "tokens": [
    {
      "token": "slo",
      "start_offset": 0,
      "end_offset": 4,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "slow",
      "start_offset": 0,
      "end_offset": 4,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "gre",
      "start_offset": 5,
      "end_offset": 10,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "gree",
      "start_offset": 5,
      "end_offset": 10,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "tur",
      "start_offset": 11,
      "end_offset": 17,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "turt",
      "start_offset": 11,
      "end_offset": 17,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}