N-gram token filter

The ngram token filter is a powerful tool used to break down text into smaller components, known as n-grams, which can improve partial matching and fuzzy search capabilities. It works by splitting a token into smaller substrings of defined lengths. These filters are commonly used in search applications to support autocomplete, partial matches, and typo-tolerant search. For more information, see Autocomplete functionality and Did-you-mean.

Parameters

The ngram token filter can be configured with the following parameters.

ParameterRequired/OptionalData typeDescription
min_gramOptionalIntegerThe minimum length of the n-grams. Default is 1.
max_gramOptionalIntegerThe maximum length of the n-grams. Default is 2.
preserve_originalOptionalBooleanWhether to keep the original token as one of the outputs. Default is false.

Example

The following example request creates a new index named ngram_example_index and configures an analyzer with an ngram filter:

  1. PUT /ngram_example_index
  2. {
  3. "settings": {
  4. "analysis": {
  5. "filter": {
  6. "ngram_filter": {
  7. "type": "ngram",
  8. "min_gram": 2,
  9. "max_gram": 3
  10. }
  11. },
  12. "analyzer": {
  13. "ngram_analyzer": {
  14. "type": "custom",
  15. "tokenizer": "standard",
  16. "filter": [
  17. "lowercase",
  18. "ngram_filter"
  19. ]
  20. }
  21. }
  22. }
  23. }
  24. }

copy

Generated tokens

Use the following request to examine the tokens generated using the analyzer:

  1. POST /ngram_example_index/_analyze
  2. {
  3. "analyzer": "ngram_analyzer",
  4. "text": "Search"
  5. }

copy

The response contains the generated tokens:

  1. {
  2. "tokens": [
  3. {
  4. "token": "se",
  5. "start_offset": 0,
  6. "end_offset": 6,
  7. "type": "<ALPHANUM>",
  8. "position": 0
  9. },
  10. {
  11. "token": "sea",
  12. "start_offset": 0,
  13. "end_offset": 6,
  14. "type": "<ALPHANUM>",
  15. "position": 0
  16. },
  17. {
  18. "token": "ea",
  19. "start_offset": 0,
  20. "end_offset": 6,
  21. "type": "<ALPHANUM>",
  22. "position": 0
  23. },
  24. {
  25. "token": "ear",
  26. "start_offset": 0,
  27. "end_offset": 6,
  28. "type": "<ALPHANUM>",
  29. "position": 0
  30. },
  31. {
  32. "token": "ar",
  33. "start_offset": 0,
  34. "end_offset": 6,
  35. "type": "<ALPHANUM>",
  36. "position": 0
  37. },
  38. {
  39. "token": "arc",
  40. "start_offset": 0,
  41. "end_offset": 6,
  42. "type": "<ALPHANUM>",
  43. "position": 0
  44. },
  45. {
  46. "token": "rc",
  47. "start_offset": 0,
  48. "end_offset": 6,
  49. "type": "<ALPHANUM>",
  50. "position": 0
  51. },
  52. {
  53. "token": "rch",
  54. "start_offset": 0,
  55. "end_offset": 6,
  56. "type": "<ALPHANUM>",
  57. "position": 0
  58. },
  59. {
  60. "token": "ch",
  61. "start_offset": 0,
  62. "end_offset": 6,
  63. "type": "<ALPHANUM>",
  64. "position": 0
  65. }
  66. ]
  67. }