Stop token filter
Stop token filter
Removes stop words from a token stream.
When not customized, the filter removes the following English stop words by default:
a
, an
, and
, are
, as
, at
, be
, but
, by
, for
, if
, in
, into
, is
, it
, no
, not
, of
, on
, or
, such
, that
, the
, their
, then
, there
, these
, they
, this
, to
, was
, will
, with
In addition to English, the stop
filter supports predefined stop word lists for several languages. You can also specify your own stop words as an array or file.
The stop
filter uses Lucene’s StopFilter.
Example
The following analyze API request uses the stop
filter to remove the stop words a
and the
from a quick fox jumps over the lazy dog
:
GET /_analyze
{
"tokenizer": "standard",
"filter": [ "stop" ],
"text": "a quick fox jumps over the lazy dog"
}
The filter produces the following tokens:
[ quick, fox, jumps, over, lazy, dog ]
Add to an analyzer
The following create index API request uses the stop
filter to configure a new custom analyzer.
PUT /my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "whitespace",
"filter": [ "stop" ]
}
}
}
}
}
Configurable parameters
stopwords
(Optional, string or array of strings) Language value, such as _arabic_
or _thai_
. Defaults to _english_.
Each language value corresponds to a predefined list of stop words in Lucene. See Stop words by language for supported language values and their stop words.
Also accepts an array of stop words.
For an empty list of stop words, use _none_
.
stopwords_path
(Optional, string) Path to a file that contains a list of stop words to remove.
This path must be absolute or relative to the config
location, and the file must be UTF-8 encoded. Each stop word in the file must be separated by a line break.
ignore_case
(Optional, Boolean) If true
, stop word matching is case insensitive. For example, if true
, a stop word of the
matches and removes The
, THE
, or the
. Defaults to false
.
remove_trailing
(Optional, Boolean) If true
, the last token of a stream is removed if it’s a stop word. Defaults to true
.
This parameter should be false
when using the filter with a completion suggester. This would ensure a query like green a
matches and suggests green apple
while still removing other stop words.
Customize
To customize the stop
filter, duplicate it to create the basis for a new custom token filter. You can modify the filter using its configurable parameters.
For example, the following request creates a custom case-insensitive stop
filter that removes stop words from the _english_ stop words list:
PUT /my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"default": {
"tokenizer": "whitespace",
"filter": [ "my_custom_stop_words_filter" ]
}
},
"filter": {
"my_custom_stop_words_filter": {
"type": "stop",
"ignore_case": true
}
}
}
}
}
You can also specify your own list of stop words. For example, the following request creates a custom case-sensitive stop
filter that removes only the stop words and
, is
, and the
:
PUT /my-index-000001
{
"settings": {
"analysis": {
"analyzer": {
"default": {
"tokenizer": "whitespace",
"filter": [ "my_custom_stop_words_filter" ]
}
},
"filter": {
"my_custom_stop_words_filter": {
"type": "stop",
"ignore_case": true,
"stopwords": [ "and", "is", "the" ]
}
}
}
}
}
Stop words by language
The following list contains supported language values for the stopwords
parameter and a link to their predefined stop words in Lucene.
_arabic_
_armenian_
_basque_
_bengali_
_brazilian_
(Brazilian Portuguese)
Brazilian Portuguese stop words
_bulgarian_
_catalan_
_cjk_
(Chinese, Japanese, and Korean)
_czech_
_danish_
_dutch_
_english_
_estonian_
_finnish_
_french_
_galician_
_german_
_greek_
_hindi_
_hungarian_
_indonesian_
_irish_
_italian_
_latvian_
_lithuanian_
_norwegian_
_persian_
_portuguese_
_romanian_
_russian_
_sorani_
_spanish_
_swedish_
_thai_
_turkish_