Edge n-gram token filter
Edge n-gram token filter
Forms an n-gram of a specified length from the beginning of a token.
For example, you can use the edge_ngram
token filter to change quick
to qu
.
When not customized, the filter creates 1-character edge n-grams by default.
This filter uses Lucene’s EdgeNGramTokenFilter.
The edge_ngram
filter is similar to the ngram token filter. However, the edge_ngram
only outputs n-grams that start at the beginning of a token. These edge n-grams are useful for search-as-you-type queries.
Example
The following analyze API request uses the edge_ngram
filter to convert the quick brown fox jumps
to 1-character and 2-character edge n-grams:
resp = client.indices.analyze(
tokenizer="standard",
filter=[
{
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 2
}
],
text="the quick brown fox jumps",
)
print(resp)
response = client.indices.analyze(
body: {
tokenizer: 'standard',
filter: [
{
type: 'edge_ngram',
min_gram: 1,
max_gram: 2
}
],
text: 'the quick brown fox jumps'
}
)
puts response
const response = await client.indices.analyze({
tokenizer: "standard",
filter: [
{
type: "edge_ngram",
min_gram: 1,
max_gram: 2,
},
],
text: "the quick brown fox jumps",
});
console.log(response);
GET _analyze
{
"tokenizer": "standard",
"filter": [
{ "type": "edge_ngram",
"min_gram": 1,
"max_gram": 2
}
],
"text": "the quick brown fox jumps"
}
The filter produces the following tokens:
[ t, th, q, qu, b, br, f, fo, j, ju ]
Add to an analyzer
The following create index API request uses the edge_ngram
filter to configure a new custom analyzer.
resp = client.indices.create(
index="edge_ngram_example",
settings={
"analysis": {
"analyzer": {
"standard_edge_ngram": {
"tokenizer": "standard",
"filter": [
"edge_ngram"
]
}
}
}
},
)
print(resp)
response = client.indices.create(
index: 'edge_ngram_example',
body: {
settings: {
analysis: {
analyzer: {
standard_edge_ngram: {
tokenizer: 'standard',
filter: [
'edge_ngram'
]
}
}
}
}
}
)
puts response
const response = await client.indices.create({
index: "edge_ngram_example",
settings: {
analysis: {
analyzer: {
standard_edge_ngram: {
tokenizer: "standard",
filter: ["edge_ngram"],
},
},
},
},
});
console.log(response);
PUT edge_ngram_example
{
"settings": {
"analysis": {
"analyzer": {
"standard_edge_ngram": {
"tokenizer": "standard",
"filter": [ "edge_ngram" ]
}
}
}
}
}
Configurable parameters
max_gram
(Optional, integer) Maximum character length of a gram. For custom token filters, defaults to 2
. For the built-in edge_ngram
filter, defaults to 1
.
See Limitations of the max_gram parameter.
min_gram
(Optional, integer) Minimum character length of a gram. Defaults to 1
.
preserve_original
(Optional, Boolean) Emits original token when set to true
. Defaults to false
.
side
(Optional, string) [8.16.0] Deprecated in 8.16.0. use <<analysis-reverse-tokenfilter . Indicates whether to truncate tokens from the front
or back
. Defaults to front
.
Customize
To customize the edge_ngram
filter, duplicate it to create the basis for a new custom token filter. You can modify the filter using its configurable parameters.
For example, the following request creates a custom edge_ngram
filter that forms n-grams between 3-5 characters.
resp = client.indices.create(
index="edge_ngram_custom_example",
settings={
"analysis": {
"analyzer": {
"default": {
"tokenizer": "whitespace",
"filter": [
"3_5_edgegrams"
]
}
},
"filter": {
"3_5_edgegrams": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 5
}
}
}
},
)
print(resp)
response = client.indices.create(
index: 'edge_ngram_custom_example',
body: {
settings: {
analysis: {
analyzer: {
default: {
tokenizer: 'whitespace',
filter: [
'3_5_edgegrams'
]
}
},
filter: {
"3_5_edgegrams": {
type: 'edge_ngram',
min_gram: 3,
max_gram: 5
}
}
}
}
}
)
puts response
const response = await client.indices.create({
index: "edge_ngram_custom_example",
settings: {
analysis: {
analyzer: {
default: {
tokenizer: "whitespace",
filter: ["3_5_edgegrams"],
},
},
filter: {
"3_5_edgegrams": {
type: "edge_ngram",
min_gram: 3,
max_gram: 5,
},
},
},
},
});
console.log(response);
PUT edge_ngram_custom_example
{
"settings": {
"analysis": {
"analyzer": {
"default": {
"tokenizer": "whitespace",
"filter": [ "3_5_edgegrams" ]
}
},
"filter": {
"3_5_edgegrams": {
"type": "edge_ngram",
"min_gram": 3,
"max_gram": 5
}
}
}
}
}
Limitations of the max_gram
parameter
The edge_ngram
filter’s max_gram
value limits the character length of tokens. When the edge_ngram
filter is used with an index analyzer, this means search terms longer than the max_gram
length may not match any indexed terms.
For example, if the max_gram
is 3
, searches for apple
won’t match the indexed term app
.
To account for this, you can use the truncate filter with a search analyzer to shorten search terms to the max_gram
character length. However, this could return irrelevant results.
For example, if the max_gram
is 3
and search terms are truncated to three characters, the search term apple
is shortened to app
. This means searches for apple
return any indexed terms matching app
, such as apply
, snapped
, and apple
.
We recommend testing both approaches to see which best fits your use case and desired search experience.