Edge n-gram token filter
- Edge n-gram token filter

Edge n-gram token filter

Forms an n-gram of a specified length from the beginning of a token.

For example, you can use the edge_ngram token filter to change quick to qu.

When not customized, the filter creates 1-character edge n-grams by default.

This filter uses Lucene’s EdgeNGramTokenFilter.

The edge_ngram filter is similar to the ngram token filter. However, the edge_ngram only outputs n-grams that start at the beginning of a token. These edge n-grams are useful for search-as-you-type queries.

Example

The following analyze API request uses the edge_ngram filter to convert the quick brown fox jumps to 1-character and 2-character edge n-grams:

resp = client.indices.analyze(
    tokenizer="standard",
    filter=[
        {
            "type": "edge_ngram",
            "min_gram": 1,
            "max_gram": 2
        }
    ],
    text="the quick brown fox jumps",
)
print(resp)

response = client.indices.analyze(
  body: {
    tokenizer: 'standard',
    filter: [
      {
        type: 'edge_ngram',
        min_gram: 1,
        max_gram: 2
      }
    ],
    text: 'the quick brown fox jumps'
  }
)
puts response

const response = await client.indices.analyze({
  tokenizer: "standard",
  filter: [
    {
      type: "edge_ngram",
      min_gram: 1,
      max_gram: 2,
    },
  ],
  text: "the quick brown fox jumps",
});
console.log(response);

GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    { "type": "edge_ngram",
      "min_gram": 1,
      "max_gram": 2
    }
  ],
  "text": "the quick brown fox jumps"
}

The filter produces the following tokens:

[ t, th, q, qu, b, br, f, fo, j, ju ]

Add to an analyzer

The following create index API request uses the edge_ngram filter to configure a new custom analyzer.

resp = client.indices.create(
    index="edge_ngram_example",
    settings={
        "analysis": {
            "analyzer": {
                "standard_edge_ngram": {
                    "tokenizer": "standard",
                    "filter": [
                        "edge_ngram"
                    ]
                }
            }
        }
    },
)
print(resp)

response = client.indices.create(
  index: 'edge_ngram_example',
  body: {
    settings: {
      analysis: {
        analyzer: {
          standard_edge_ngram: {
            tokenizer: 'standard',
            filter: [
              'edge_ngram'
            ]
          }
        }
      }
    }
  }
)
puts response

const response = await client.indices.create({
  index: "edge_ngram_example",
  settings: {
    analysis: {
      analyzer: {
        standard_edge_ngram: {
          tokenizer: "standard",
          filter: ["edge_ngram"],
        },
      },
    },
  },
});
console.log(response);

PUT edge_ngram_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "standard_edge_ngram": {
          "tokenizer": "standard",
          "filter": [ "edge_ngram" ]
        }
      }
    }
  }
}

Configurable parameters

max_gram

(Optional, integer) Maximum character length of a gram. For custom token filters, defaults to 2. For the built-in edge_ngram filter, defaults to 1.

See Limitations of the max_gram parameter.

min_gram

(Optional, integer) Minimum character length of a gram. Defaults to 1.

preserve_original

(Optional, Boolean) Emits original token when set to true. Defaults to false.

side

(Optional, string) [8.16.0] Deprecated in 8.16.0. use <<analysis-reverse-tokenfilter . Indicates whether to truncate tokens from the front or back. Defaults to front.

Customize

To customize the edge_ngram filter, duplicate it to create the basis for a new custom token filter. You can modify the filter using its configurable parameters.

For example, the following request creates a custom edge_ngram filter that forms n-grams between 3-5 characters.

resp = client.indices.create(
    index="edge_ngram_custom_example",
    settings={
        "analysis": {
            "analyzer": {
                "default": {
                    "tokenizer": "whitespace",
                    "filter": [
                        "3_5_edgegrams"
                    ]
                }
            },
            "filter": {
                "3_5_edgegrams": {
                    "type": "edge_ngram",
                    "min_gram": 3,
                    "max_gram": 5
                }
            }
        }
    },
)
print(resp)

response = client.indices.create(
  index: 'edge_ngram_custom_example',
  body: {
    settings: {
      analysis: {
        analyzer: {
          default: {
            tokenizer: 'whitespace',
            filter: [
              '3_5_edgegrams'
            ]
          }
        },
        filter: {
          "3_5_edgegrams": {
            type: 'edge_ngram',
            min_gram: 3,
            max_gram: 5
          }
        }
      }
    }
  }
)
puts response

const response = await client.indices.create({
  index: "edge_ngram_custom_example",
  settings: {
    analysis: {
      analyzer: {
        default: {
          tokenizer: "whitespace",
          filter: ["3_5_edgegrams"],
        },
      },
      filter: {
        "3_5_edgegrams": {
          type: "edge_ngram",
          min_gram: 3,
          max_gram: 5,
        },
      },
    },
  },
});
console.log(response);

PUT edge_ngram_custom_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "default": {
          "tokenizer": "whitespace",
          "filter": [ "3_5_edgegrams" ]
        }
      },
      "filter": {
        "3_5_edgegrams": {
          "type": "edge_ngram",
          "min_gram": 3,
          "max_gram": 5
        }
      }
    }
  }
}

Limitations of the `max_gram` parameter

The edge_ngram filter’s max_gram value limits the character length of tokens. When the edge_ngram filter is used with an index analyzer, this means search terms longer than the max_gram length may not match any indexed terms.

For example, if the max_gram is 3, searches for apple won’t match the indexed term app.

To account for this, you can use the truncate filter with a search analyzer to shorten search terms to the max_gram character length. However, this could return irrelevant results.

For example, if the max_gram is 3 and search terms are truncated to three characters, the search term apple is shortened to app. This means searches for apple return any indexed terms matching app, such as apply, snapped, and apple.

We recommend testing both approaches to see which best fits your use case and desired search experience.