Keyword repeat token filter
Keyword repeat token filter
Outputs a keyword version of each token in a stream. These keyword tokens are not stemmed.
The keyword_repeat
filter assigns keyword tokens a keyword
attribute of true
. Stemmer token filters, such as stemmer or porter_stem, skip tokens with a keyword
attribute of true
.
You can use the keyword_repeat
filter with a stemmer token filter to output a stemmed and unstemmed version of each token in a stream.
To work properly, the keyword_repeat
filter must be listed before any stemmer token filters in the analyzer configuration.
Stemming does not affect all tokens. This means streams could contain duplicate tokens in the same position, even after stemming.
To remove these duplicate tokens, add the remove_duplicates filter after the stemmer filter in the analyzer configuration.
The keyword_repeat
filter uses Lucene’s KeywordRepeatFilter.
Example
The following analyze API request uses the keyword_repeat
filter to output a keyword and non-keyword version of each token in fox running and jumping
.
To return the keyword
attribute for these tokens, the analyze API request also includes the following arguments:
explain
:true
attributes
:keyword
GET /_analyze
{
"tokenizer": "whitespace",
"filter": [
"keyword_repeat"
],
"text": "fox running and jumping",
"explain": true,
"attributes": "keyword"
}
The API returns the following response. Note that one version of each token has a keyword
attribute of true
.
Response
{
"detail": {
"custom_analyzer": true,
"charfilters": [],
"tokenizer": ...,
"tokenfilters": [
{
"name": "keyword_repeat",
"tokens": [
{
"token": "fox",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0,
"keyword": true
},
{
"token": "fox",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0,
"keyword": false
},
{
"token": "running",
"start_offset": 4,
"end_offset": 11,
"type": "word",
"position": 1,
"keyword": true
},
{
"token": "running",
"start_offset": 4,
"end_offset": 11,
"type": "word",
"position": 1,
"keyword": false
},
{
"token": "and",
"start_offset": 12,
"end_offset": 15,
"type": "word",
"position": 2,
"keyword": true
},
{
"token": "and",
"start_offset": 12,
"end_offset": 15,
"type": "word",
"position": 2,
"keyword": false
},
{
"token": "jumping",
"start_offset": 16,
"end_offset": 23,
"type": "word",
"position": 3,
"keyword": true
},
{
"token": "jumping",
"start_offset": 16,
"end_offset": 23,
"type": "word",
"position": 3,
"keyword": false
}
]
}
]
}
}
To stem the non-keyword tokens, add the stemmer
filter after the keyword_repeat
filter in the previous analyze API request.
GET /_analyze
{
"tokenizer": "whitespace",
"filter": [
"keyword_repeat",
"stemmer"
],
"text": "fox running and jumping",
"explain": true,
"attributes": "keyword"
}
The API returns the following response. Note the following changes:
- The non-keyword version of
running
was stemmed torun
. - The non-keyword version of
jumping
was stemmed tojump
.
Response
{
"detail": {
"custom_analyzer": true,
"charfilters": [],
"tokenizer": ...,
"tokenfilters": [
{
"name": "keyword_repeat",
"tokens": ...
},
{
"name": "stemmer",
"tokens": [
{
"token": "fox",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0,
"keyword": true
},
{
"token": "fox",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0,
"keyword": false
},
{
"token": "running",
"start_offset": 4,
"end_offset": 11,
"type": "word",
"position": 1,
"keyword": true
},
{
"token": "run",
"start_offset": 4,
"end_offset": 11,
"type": "word",
"position": 1,
"keyword": false
},
{
"token": "and",
"start_offset": 12,
"end_offset": 15,
"type": "word",
"position": 2,
"keyword": true
},
{
"token": "and",
"start_offset": 12,
"end_offset": 15,
"type": "word",
"position": 2,
"keyword": false
},
{
"token": "jumping",
"start_offset": 16,
"end_offset": 23,
"type": "word",
"position": 3,
"keyword": true
},
{
"token": "jump",
"start_offset": 16,
"end_offset": 23,
"type": "word",
"position": 3,
"keyword": false
}
]
}
]
}
}
However, the keyword and non-keyword versions of fox
and and
are identical and in the same respective positions.
To remove these duplicate tokens, add the remove_duplicates
filter after stemmer
in the analyze API request.
GET /_analyze
{
"tokenizer": "whitespace",
"filter": [
"keyword_repeat",
"stemmer",
"remove_duplicates"
],
"text": "fox running and jumping",
"explain": true,
"attributes": "keyword"
}
The API returns the following response. Note that the duplicate tokens for fox
and and
have been removed.
Response
{
"detail": {
"custom_analyzer": true,
"charfilters": [],
"tokenizer": ...,
"tokenfilters": [
{
"name": "keyword_repeat",
"tokens": ...
},
{
"name": "stemmer",
"tokens": ...
},
{
"name": "remove_duplicates",
"tokens": [
{
"token": "fox",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0,
"keyword": true
},
{
"token": "running",
"start_offset": 4,
"end_offset": 11,
"type": "word",
"position": 1,
"keyword": true
},
{
"token": "run",
"start_offset": 4,
"end_offset": 11,
"type": "word",
"position": 1,
"keyword": false
},
{
"token": "and",
"start_offset": 12,
"end_offset": 15,
"type": "word",
"position": 2,
"keyword": true
},
{
"token": "jumping",
"start_offset": 16,
"end_offset": 23,
"type": "word",
"position": 3,
"keyword": true
},
{
"token": "jump",
"start_offset": 16,
"end_offset": 23,
"type": "word",
"position": 3,
"keyword": false
}
]
}
]
}
}
Add to an analyzer
The following create index API request uses the keyword_repeat
filter to configure a new custom analyzer.
This custom analyzer uses the keyword_repeat
and porter_stem
filters to create a stemmed and unstemmed version of each token in a stream. The remove_duplicates
filter then removes any duplicate tokens from the stream.
PUT /my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_analyzer": {
"tokenizer": "standard",
"filter": [
"keyword_repeat",
"porter_stem",
"remove_duplicates"
]
}
}
}
}
}