Keyword repeat token filter

Keyword repeat token filter

Outputs a keyword version of each token in a stream. These keyword tokens are not stemmed.

The keyword_repeat filter assigns keyword tokens a keyword attribute of true. Stemmer token filters, such as stemmer or porter_stem, skip tokens with a keyword attribute of true.

You can use the keyword_repeat filter with a stemmer token filter to output a stemmed and unstemmed version of each token in a stream.

To work properly, the keyword_repeat filter must be listed before any stemmer token filters in the analyzer configuration.

Stemming does not affect all tokens. This means streams could contain duplicate tokens in the same position, even after stemming.

To remove these duplicate tokens, add the remove_duplicates filter after the stemmer filter in the analyzer configuration.

The keyword_repeat filter uses Lucene’s KeywordRepeatFilter.

Example

The following analyze API request uses the keyword_repeat filter to output a keyword and non-keyword version of each token in fox running and jumping.

To return the keyword attribute for these tokens, the analyze API request also includes the following arguments:

  • explain: true
  • attributes: keyword
  1. GET /_analyze
  2. {
  3. "tokenizer": "whitespace",
  4. "filter": [
  5. "keyword_repeat"
  6. ],
  7. "text": "fox running and jumping",
  8. "explain": true,
  9. "attributes": "keyword"
  10. }

The API returns the following response. Note that one version of each token has a keyword attribute of true.

Response

  1. {
  2. "detail": {
  3. "custom_analyzer": true,
  4. "charfilters": [],
  5. "tokenizer": ...,
  6. "tokenfilters": [
  7. {
  8. "name": "keyword_repeat",
  9. "tokens": [
  10. {
  11. "token": "fox",
  12. "start_offset": 0,
  13. "end_offset": 3,
  14. "type": "word",
  15. "position": 0,
  16. "keyword": true
  17. },
  18. {
  19. "token": "fox",
  20. "start_offset": 0,
  21. "end_offset": 3,
  22. "type": "word",
  23. "position": 0,
  24. "keyword": false
  25. },
  26. {
  27. "token": "running",
  28. "start_offset": 4,
  29. "end_offset": 11,
  30. "type": "word",
  31. "position": 1,
  32. "keyword": true
  33. },
  34. {
  35. "token": "running",
  36. "start_offset": 4,
  37. "end_offset": 11,
  38. "type": "word",
  39. "position": 1,
  40. "keyword": false
  41. },
  42. {
  43. "token": "and",
  44. "start_offset": 12,
  45. "end_offset": 15,
  46. "type": "word",
  47. "position": 2,
  48. "keyword": true
  49. },
  50. {
  51. "token": "and",
  52. "start_offset": 12,
  53. "end_offset": 15,
  54. "type": "word",
  55. "position": 2,
  56. "keyword": false
  57. },
  58. {
  59. "token": "jumping",
  60. "start_offset": 16,
  61. "end_offset": 23,
  62. "type": "word",
  63. "position": 3,
  64. "keyword": true
  65. },
  66. {
  67. "token": "jumping",
  68. "start_offset": 16,
  69. "end_offset": 23,
  70. "type": "word",
  71. "position": 3,
  72. "keyword": false
  73. }
  74. ]
  75. }
  76. ]
  77. }
  78. }

To stem the non-keyword tokens, add the stemmer filter after the keyword_repeat filter in the previous analyze API request.

  1. GET /_analyze
  2. {
  3. "tokenizer": "whitespace",
  4. "filter": [
  5. "keyword_repeat",
  6. "stemmer"
  7. ],
  8. "text": "fox running and jumping",
  9. "explain": true,
  10. "attributes": "keyword"
  11. }

The API returns the following response. Note the following changes:

  • The non-keyword version of running was stemmed to run.
  • The non-keyword version of jumping was stemmed to jump.

Response

  1. {
  2. "detail": {
  3. "custom_analyzer": true,
  4. "charfilters": [],
  5. "tokenizer": ...,
  6. "tokenfilters": [
  7. {
  8. "name": "keyword_repeat",
  9. "tokens": ...
  10. },
  11. {
  12. "name": "stemmer",
  13. "tokens": [
  14. {
  15. "token": "fox",
  16. "start_offset": 0,
  17. "end_offset": 3,
  18. "type": "word",
  19. "position": 0,
  20. "keyword": true
  21. },
  22. {
  23. "token": "fox",
  24. "start_offset": 0,
  25. "end_offset": 3,
  26. "type": "word",
  27. "position": 0,
  28. "keyword": false
  29. },
  30. {
  31. "token": "running",
  32. "start_offset": 4,
  33. "end_offset": 11,
  34. "type": "word",
  35. "position": 1,
  36. "keyword": true
  37. },
  38. {
  39. "token": "run",
  40. "start_offset": 4,
  41. "end_offset": 11,
  42. "type": "word",
  43. "position": 1,
  44. "keyword": false
  45. },
  46. {
  47. "token": "and",
  48. "start_offset": 12,
  49. "end_offset": 15,
  50. "type": "word",
  51. "position": 2,
  52. "keyword": true
  53. },
  54. {
  55. "token": "and",
  56. "start_offset": 12,
  57. "end_offset": 15,
  58. "type": "word",
  59. "position": 2,
  60. "keyword": false
  61. },
  62. {
  63. "token": "jumping",
  64. "start_offset": 16,
  65. "end_offset": 23,
  66. "type": "word",
  67. "position": 3,
  68. "keyword": true
  69. },
  70. {
  71. "token": "jump",
  72. "start_offset": 16,
  73. "end_offset": 23,
  74. "type": "word",
  75. "position": 3,
  76. "keyword": false
  77. }
  78. ]
  79. }
  80. ]
  81. }
  82. }

However, the keyword and non-keyword versions of fox and and are identical and in the same respective positions.

To remove these duplicate tokens, add the remove_duplicates filter after stemmer in the analyze API request.

  1. GET /_analyze
  2. {
  3. "tokenizer": "whitespace",
  4. "filter": [
  5. "keyword_repeat",
  6. "stemmer",
  7. "remove_duplicates"
  8. ],
  9. "text": "fox running and jumping",
  10. "explain": true,
  11. "attributes": "keyword"
  12. }

The API returns the following response. Note that the duplicate tokens for fox and and have been removed.

Response

  1. {
  2. "detail": {
  3. "custom_analyzer": true,
  4. "charfilters": [],
  5. "tokenizer": ...,
  6. "tokenfilters": [
  7. {
  8. "name": "keyword_repeat",
  9. "tokens": ...
  10. },
  11. {
  12. "name": "stemmer",
  13. "tokens": ...
  14. },
  15. {
  16. "name": "remove_duplicates",
  17. "tokens": [
  18. {
  19. "token": "fox",
  20. "start_offset": 0,
  21. "end_offset": 3,
  22. "type": "word",
  23. "position": 0,
  24. "keyword": true
  25. },
  26. {
  27. "token": "running",
  28. "start_offset": 4,
  29. "end_offset": 11,
  30. "type": "word",
  31. "position": 1,
  32. "keyword": true
  33. },
  34. {
  35. "token": "run",
  36. "start_offset": 4,
  37. "end_offset": 11,
  38. "type": "word",
  39. "position": 1,
  40. "keyword": false
  41. },
  42. {
  43. "token": "and",
  44. "start_offset": 12,
  45. "end_offset": 15,
  46. "type": "word",
  47. "position": 2,
  48. "keyword": true
  49. },
  50. {
  51. "token": "jumping",
  52. "start_offset": 16,
  53. "end_offset": 23,
  54. "type": "word",
  55. "position": 3,
  56. "keyword": true
  57. },
  58. {
  59. "token": "jump",
  60. "start_offset": 16,
  61. "end_offset": 23,
  62. "type": "word",
  63. "position": 3,
  64. "keyword": false
  65. }
  66. ]
  67. }
  68. ]
  69. }
  70. }

Add to an analyzer

The following create index API request uses the keyword_repeat filter to configure a new custom analyzer.

This custom analyzer uses the keyword_repeat and porter_stem filters to create a stemmed and unstemmed version of each token in a stream. The remove_duplicates filter then removes any duplicate tokens from the stream.

  1. PUT /my-index-000001
  2. {
  3. "settings": {
  4. "analysis": {
  5. "analyzer": {
  6. "my_custom_analyzer": {
  7. "tokenizer": "standard",
  8. "filter": [
  9. "keyword_repeat",
  10. "porter_stem",
  11. "remove_duplicates"
  12. ]
  13. }
  14. }
  15. }
  16. }
  17. }