Stop token filter

Stop token filter

Removes stop words from a token stream.

When not customized, the filter removes the following English stop words by default:

a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with

In addition to English, the stop filter supports predefined stop word lists for several languages. You can also specify your own stop words as an array or file.

The stop filter uses Lucene’s StopFilter.

Example

The following analyze API request uses the stop filter to remove the stop words a and the from a quick fox jumps over the lazy dog:

  1. GET /_analyze
  2. {
  3. "tokenizer": "standard",
  4. "filter": [ "stop" ],
  5. "text": "a quick fox jumps over the lazy dog"
  6. }

The filter produces the following tokens:

  1. [ quick, fox, jumps, over, lazy, dog ]

Add to an analyzer

The following create index API request uses the stop filter to configure a new custom analyzer.

  1. PUT /my-index-000001
  2. {
  3. "settings": {
  4. "analysis": {
  5. "analyzer": {
  6. "my_analyzer": {
  7. "tokenizer": "whitespace",
  8. "filter": [ "stop" ]
  9. }
  10. }
  11. }
  12. }
  13. }

Configurable parameters

stopwords

(Optional, string or array of strings) Language value, such as _arabic_ or _thai_. Defaults to _english_.

Each language value corresponds to a predefined list of stop words in Lucene. See Stop words by language for supported language values and their stop words.

Also accepts an array of stop words.

For an empty list of stop words, use _none_.

stopwords_path

(Optional, string) Path to a file that contains a list of stop words to remove.

This path must be absolute or relative to the config location, and the file must be UTF-8 encoded. Each stop word in the file must be separated by a line break.

ignore_case

(Optional, Boolean) If true, stop word matching is case insensitive. For example, if true, a stop word of the matches and removes The, THE, or the. Defaults to false.

remove_trailing

(Optional, Boolean) If true, the last token of a stream is removed if it’s a stop word. Defaults to true.

This parameter should be false when using the filter with a completion suggester. This would ensure a query like green a matches and suggests green apple while still removing other stop words.

Customize

To customize the stop filter, duplicate it to create the basis for a new custom token filter. You can modify the filter using its configurable parameters.

For example, the following request creates a custom case-insensitive stop filter that removes stop words from the _english_ stop words list:

  1. PUT /my-index-000001
  2. {
  3. "settings": {
  4. "analysis": {
  5. "analyzer": {
  6. "default": {
  7. "tokenizer": "whitespace",
  8. "filter": [ "my_custom_stop_words_filter" ]
  9. }
  10. },
  11. "filter": {
  12. "my_custom_stop_words_filter": {
  13. "type": "stop",
  14. "ignore_case": true
  15. }
  16. }
  17. }
  18. }
  19. }

You can also specify your own list of stop words. For example, the following request creates a custom case-sensitive stop filter that removes only the stop words and, is, and the:

  1. PUT /my-index-000001
  2. {
  3. "settings": {
  4. "analysis": {
  5. "analyzer": {
  6. "default": {
  7. "tokenizer": "whitespace",
  8. "filter": [ "my_custom_stop_words_filter" ]
  9. }
  10. },
  11. "filter": {
  12. "my_custom_stop_words_filter": {
  13. "type": "stop",
  14. "ignore_case": true,
  15. "stopwords": [ "and", "is", "the" ]
  16. }
  17. }
  18. }
  19. }
  20. }

Stop words by language

The following list contains supported language values for the stopwords parameter and a link to their predefined stop words in Lucene.

_arabic_

Arabic stop words

_armenian_

Armenian stop words

_basque_

Basque stop words

_bengali_

Bengali stop words

_brazilian_ (Brazilian Portuguese)

Brazilian Portuguese stop words

_bulgarian_

Bulgarian stop words

_catalan_

Catalan stop words

_cjk_ (Chinese, Japanese, and Korean)

CJK stop words

_czech_

Czech stop words

_danish_

Danish stop words

_dutch_

Dutch stop words

_english_

English stop words

_estonian_

Estonian stop words

_finnish_

Finnish stop words

_french_

French stop words

_galician_

Galician stop words

_german_

German stop words

_greek_

Greek stop words

_hindi_

Hindi stop words

_hungarian_

Hungarian stop words

_indonesian_

Indonesian stop words

_irish_

Irish stop words

_italian_

Italian stop words

_latvian_

Latvian stop words

_lithuanian_

Lithuanian stop words

_norwegian_

Norwegian stop words

_persian_

Persian stop words

_portuguese_

Portuguese stop words

_romanian_

Romanian stop words

_russian_

Russian stop words

_sorani_

Sorani stop words

_spanish_

Spanish stop words

_swedish_

Swedish stop words

_thai_

Thai stop words

_turkish_

Turkish stop words