Common grams token filter

Common grams token filter

Generates bigrams for a specified set of common words.

For example, you can specify is and the as common words. This filter then converts the tokens [the, quick, fox, is, brown] to [the, the_quick, quick, fox, fox_is, is, is_brown, brown].

You can use the common_grams filter in place of the stop token filter when you don’t want to completely ignore common words.

This filter uses Lucene’s CommonGramsFilter.

Example

The following analyze API request creates bigrams for is and the:

  1. GET /_analyze
  2. {
  3. "tokenizer" : "whitespace",
  4. "filter" : [
  5. {
  6. "type": "common_grams",
  7. "common_words": ["is", "the"]
  8. }
  9. ],
  10. "text" : "the quick fox is brown"
  11. }

The filter produces the following tokens:

  1. [ the, the_quick, quick, fox, fox_is, is, is_brown, brown ]

Add to an analyzer

The following create index API request uses the common_grams filter to configure a new custom analyzer:

  1. PUT /common_grams_example
  2. {
  3. "settings": {
  4. "analysis": {
  5. "analyzer": {
  6. "index_grams": {
  7. "tokenizer": "whitespace",
  8. "filter": [ "common_grams" ]
  9. }
  10. },
  11. "filter": {
  12. "common_grams": {
  13. "type": "common_grams",
  14. "common_words": [ "a", "is", "the" ]
  15. }
  16. }
  17. }
  18. }
  19. }

Configurable parameters

common_words

(Required*, array of strings) A list of tokens. The filter generates bigrams for these tokens.

Either this or the common_words_path parameter is required.

common_words_path

(Required*, string) Path to a file containing a list of tokens. The filter generates bigrams for these tokens.

This path must be absolute or relative to the config location. The file must be UTF-8 encoded. Each token in the file must be separated by a line break.

Either this or the common_words parameter is required.

ignore_case

(Optional, Boolean) If true, matches for common words matching are case-insensitive. Defaults to false.

query_mode

(Optional, Boolean) If true, the filter excludes the following tokens from the output:

  • Unigrams for common words
  • Unigrams for terms followed by common words

Defaults to false. We recommend enabling this parameter for search analyzers.

For example, you can enable this parameter and specify is and the as common words. This filter converts the tokens [the, quick, fox, is, brown] to [the_quick, quick, fox_is, is_brown,].

Customize

To customize the common_grams filter, duplicate it to create the basis for a new custom token filter. You can modify the filter using its configurable parameters.

For example, the following request creates a custom common_grams filter with ignore_case and query_mode set to true:

  1. PUT /common_grams_example
  2. {
  3. "settings": {
  4. "analysis": {
  5. "analyzer": {
  6. "index_grams": {
  7. "tokenizer": "whitespace",
  8. "filter": [ "common_grams_query" ]
  9. }
  10. },
  11. "filter": {
  12. "common_grams_query": {
  13. "type": "common_grams",
  14. "common_words": [ "a", "is", "the" ],
  15. "ignore_case": true,
  16. "query_mode": true
  17. }
  18. }
  19. }
  20. }
  21. }