Dictionary decompounder token filter

The dictionary_decompounder token filter is used to split compound words into their constituent parts based on a predefined dictionary. This filter is particularly useful for languages like German, Dutch, or Finnish, in which compound words are common, so breaking them down can improve search relevance. The dictionary_decompounder token filter determines whether each token (word) can be split into smaller tokens based on a list of known words. If the token can be split into known words, the filter generates the subtokens for the token.

Parameters

The dictionary_decompounder token filter has the following parameters.

ParameterRequired/OptionalData typeDescription
word_listRequired unless word_list_path is configuredArray of stringsThe dictionary of words that the filter uses to split compound words.
word_list_pathRequired unless word_list is configuredStringA file path to a text file containing the dictionary words. Accepts either an absolute path or a path relative to the config directory. The dictionary file must be UTF-8 encoded, and each word must be listed on a separate line.
min_word_sizeOptionalIntegerThe minimum length of the entire compound word that will be considered for splitting. If a compound word is shorter than this value, it is not split. Default is 5.
min_subword_sizeOptionalIntegerThe minimum length for any subword. If a subword is shorter than this value, it is not included in the output. Default is 2.
max_subword_sizeOptionalIntegerThe maximum length for any subword. If a subword is longer than this value, it is not included in the output. Default is 15.
only_longest_matchOptionalBooleanIf set to true, only the longest matching subword will be returned. Default is false.

Example

The following example request creates a new index named decompound_example and configures an analyzer with the dictionary_decompounder filter:

  1. PUT /decompound_example
  2. {
  3. "settings": {
  4. "analysis": {
  5. "filter": {
  6. "my_dictionary_decompounder": {
  7. "type": "dictionary_decompounder",
  8. "word_list": ["slow", "green", "turtle"]
  9. }
  10. },
  11. "analyzer": {
  12. "my_analyzer": {
  13. "type": "custom",
  14. "tokenizer": "standard",
  15. "filter": ["lowercase", "my_dictionary_decompounder"]
  16. }
  17. }
  18. }
  19. }
  20. }

copy

Generated tokens

Use the following request to examine the tokens generated using the analyzer:

  1. POST /decompound_example/_analyze
  2. {
  3. "analyzer": "my_analyzer",
  4. "text": "slowgreenturtleswim"
  5. }

copy

The response contains the generated tokens:

  1. {
  2. "tokens": [
  3. {
  4. "token": "slowgreenturtleswim",
  5. "start_offset": 0,
  6. "end_offset": 19,
  7. "type": "<ALPHANUM>",
  8. "position": 0
  9. },
  10. {
  11. "token": "slow",
  12. "start_offset": 0,
  13. "end_offset": 19,
  14. "type": "<ALPHANUM>",
  15. "position": 0
  16. },
  17. {
  18. "token": "green",
  19. "start_offset": 0,
  20. "end_offset": 19,
  21. "type": "<ALPHANUM>",
  22. "position": 0
  23. },
  24. {
  25. "token": "turtle",
  26. "start_offset": 0,
  27. "end_offset": 19,
  28. "type": "<ALPHANUM>",
  29. "position": 0
  30. }
  31. ]
  32. }