Keep words token filter
Keep words token filter
Keeps only tokens contained in a specified word list.
This filter uses Lucene’s KeepWordFilter.
To remove a list of words from a token stream, use the stop filter.
Example
The following analyze API request uses the keep
filter to keep only the fox
and dog
tokens from the quick fox jumps over the lazy dog
.
resp = client.indices.analyze(
tokenizer="whitespace",
filter=[
{
"type": "keep",
"keep_words": [
"dog",
"elephant",
"fox"
]
}
],
text="the quick fox jumps over the lazy dog",
)
print(resp)
response = client.indices.analyze(
body: {
tokenizer: 'whitespace',
filter: [
{
type: 'keep',
keep_words: [
'dog',
'elephant',
'fox'
]
}
],
text: 'the quick fox jumps over the lazy dog'
}
)
puts response
const response = await client.indices.analyze({
tokenizer: "whitespace",
filter: [
{
type: "keep",
keep_words: ["dog", "elephant", "fox"],
},
],
text: "the quick fox jumps over the lazy dog",
});
console.log(response);
GET _analyze
{
"tokenizer": "whitespace",
"filter": [
{
"type": "keep",
"keep_words": [ "dog", "elephant", "fox" ]
}
],
"text": "the quick fox jumps over the lazy dog"
}
The filter produces the following tokens:
[ fox, dog ]
Configurable parameters
keep_words
(Required*, array of strings) List of words to keep. Only tokens that match words in this list are included in the output.
Either this parameter or keep_words_path
must be specified.
keep_words_path
(Required*, array of strings) Path to a file that contains a list of words to keep. Only tokens that match words in this list are included in the output.
This path must be absolute or relative to the config
location, and the file must be UTF-8 encoded. Each word in the file must be separated by a line break.
Either this parameter or keep_words
must be specified.
keep_words_case
(Optional, Boolean) If true
, lowercase all keep words. Defaults to false
.
Customize and add to an analyzer
To customize the keep
filter, duplicate it to create the basis for a new custom token filter. You can modify the filter using its configurable parameters.
For example, the following create index API request uses custom keep
filters to configure two new custom analyzers:
standard_keep_word_array
, which uses a customkeep
filter with an inline array of keep wordsstandard_keep_word_file
, which uses a customerkeep
filter with a keep words file
resp = client.indices.create(
index="keep_words_example",
settings={
"analysis": {
"analyzer": {
"standard_keep_word_array": {
"tokenizer": "standard",
"filter": [
"keep_word_array"
]
},
"standard_keep_word_file": {
"tokenizer": "standard",
"filter": [
"keep_word_file"
]
}
},
"filter": {
"keep_word_array": {
"type": "keep",
"keep_words": [
"one",
"two",
"three"
]
},
"keep_word_file": {
"type": "keep",
"keep_words_path": "analysis/example_word_list.txt"
}
}
}
},
)
print(resp)
const response = await client.indices.create({
index: "keep_words_example",
settings: {
analysis: {
analyzer: {
standard_keep_word_array: {
tokenizer: "standard",
filter: ["keep_word_array"],
},
standard_keep_word_file: {
tokenizer: "standard",
filter: ["keep_word_file"],
},
},
filter: {
keep_word_array: {
type: "keep",
keep_words: ["one", "two", "three"],
},
keep_word_file: {
type: "keep",
keep_words_path: "analysis/example_word_list.txt",
},
},
},
},
});
console.log(response);
PUT keep_words_example
{
"settings": {
"analysis": {
"analyzer": {
"standard_keep_word_array": {
"tokenizer": "standard",
"filter": [ "keep_word_array" ]
},
"standard_keep_word_file": {
"tokenizer": "standard",
"filter": [ "keep_word_file" ]
}
},
"filter": {
"keep_word_array": {
"type": "keep",
"keep_words": [ "one", "two", "three" ]
},
"keep_word_file": {
"type": "keep",
"keep_words_path": "analysis/example_word_list.txt"
}
}
}
}
}