Keyword repeat token filter

The keyword_repeat token filter emits the keyword version of a token into a token stream. This filter is typically used when you want to retain both the original token and its modified version after further token transformations, such as stemming or synonym expansion. The duplicated tokens allow the original, unchanged version of the token to remain in the final analysis alongside the modified versions.

The keyword_repeat token filter should be placed before stemming filters. Stemming is not applied to every token, thus you may have duplicate tokens in the same position after stemming. To remove duplicate tokens, use the remove_duplicates token filter after the stemmer.

Example

The following example request creates a new index named my_index and configures an analyzer with a keyword_repeat filter:

  1. PUT /my_index
  2. {
  3. "settings": {
  4. "analysis": {
  5. "filter": {
  6. "my_kstem": {
  7. "type": "kstem"
  8. },
  9. "my_lowercase": {
  10. "type": "lowercase"
  11. }
  12. },
  13. "analyzer": {
  14. "my_custom_analyzer": {
  15. "type": "custom",
  16. "tokenizer": "standard",
  17. "filter": [
  18. "my_lowercase",
  19. "keyword_repeat",
  20. "my_kstem"
  21. ]
  22. }
  23. }
  24. }
  25. }
  26. }

copy

Generated tokens

Use the following request to examine the tokens generated using the analyzer:

  1. POST /my_index/_analyze
  2. {
  3. "analyzer": "my_custom_analyzer",
  4. "text": "Stopped quickly"
  5. }

copy

The response contains the generated tokens:

  1. {
  2. "tokens": [
  3. {
  4. "token": "stopped",
  5. "start_offset": 0,
  6. "end_offset": 7,
  7. "type": "<ALPHANUM>",
  8. "position": 0
  9. },
  10. {
  11. "token": "stop",
  12. "start_offset": 0,
  13. "end_offset": 7,
  14. "type": "<ALPHANUM>",
  15. "position": 0
  16. },
  17. {
  18. "token": "quickly",
  19. "start_offset": 8,
  20. "end_offset": 15,
  21. "type": "<ALPHANUM>",
  22. "position": 1
  23. },
  24. {
  25. "token": "quick",
  26. "start_offset": 8,
  27. "end_offset": 15,
  28. "type": "<ALPHANUM>",
  29. "position": 1
  30. }
  31. ]
  32. }

You can further examine the impact of the keyword_repeat token filter by adding the following parameters to the _analyze query:

  1. POST /my_index/_analyze
  2. {
  3. "analyzer": "my_custom_analyzer",
  4. "text": "Stopped quickly",
  5. "explain": true,
  6. "attributes": "keyword"
  7. }

copy

The response includes detailed information, such as tokenization, filtering, and the application of specific token filters:

  1. {
  2. "detail": {
  3. "custom_analyzer": true,
  4. "charfilters": [],
  5. "tokenizer": {
  6. "name": "standard",
  7. "tokens": [
  8. {"token": "OpenSearch","start_offset": 0,"end_offset": 10,"type": "<ALPHANUM>","position": 0},
  9. {"token": "helped","start_offset": 11,"end_offset": 17,"type": "<ALPHANUM>","position": 1},
  10. {"token": "many","start_offset": 18,"end_offset": 22,"type": "<ALPHANUM>","position": 2},
  11. {"token": "employers","start_offset": 23,"end_offset": 32,"type": "<ALPHANUM>","position": 3}
  12. ]
  13. },
  14. "tokenfilters": [
  15. {
  16. "name": "lowercase",
  17. "tokens": [
  18. {"token": "opensearch","start_offset": 0,"end_offset": 10,"type": "<ALPHANUM>","position": 0},
  19. {"token": "helped","start_offset": 11,"end_offset": 17,"type": "<ALPHANUM>","position": 1},
  20. {"token": "many","start_offset": 18,"end_offset": 22,"type": "<ALPHANUM>","position": 2},
  21. {"token": "employers","start_offset": 23,"end_offset": 32,"type": "<ALPHANUM>","position": 3}
  22. ]
  23. },
  24. {
  25. "name": "keyword_marker_filter",
  26. "tokens": [
  27. {"token": "opensearch","start_offset": 0,"end_offset": 10,"type": "<ALPHANUM>","position": 0,"keyword": true},
  28. {"token": "helped","start_offset": 11,"end_offset": 17,"type": "<ALPHANUM>","position": 1,"keyword": false},
  29. {"token": "many","start_offset": 18,"end_offset": 22,"type": "<ALPHANUM>","position": 2,"keyword": false},
  30. {"token": "employers","start_offset": 23,"end_offset": 32,"type": "<ALPHANUM>","position": 3,"keyword": false}
  31. ]
  32. },
  33. {
  34. "name": "kstem_filter",
  35. "tokens": [
  36. {"token": "opensearch","start_offset": 0,"end_offset": 10,"type": "<ALPHANUM>","position": 0,"keyword": true},
  37. {"token": "help","start_offset": 11,"end_offset": 17,"type": "<ALPHANUM>","position": 1,"keyword": false},
  38. {"token": "many","start_offset": 18,"end_offset": 22,"type": "<ALPHANUM>","position": 2,"keyword": false},
  39. {"token": "employer","start_offset": 23,"end_offset": 32,"type": "<ALPHANUM>","position": 3,"keyword": false}
  40. ]
  41. }
  42. ]
  43. }
  44. }