Hyphenation decompounder token filter

The hyphenation_decompounder token filter is used to break down compound words into their constituent parts. This filter is particularly useful for languages like German, Dutch, and Swedish, in which compound words are common. The filter uses hyphenation patterns (typically defined in .xml files) to identify the possible locations within a compound word where it can be split into components. These components are then checked against a provided dictionary. If there is a match, those components are treated as valid tokens. For more information about hyphenation pattern files, see FOP XML Hyphenation Patterns.

Parameters

The hyphenation_decompounder token filter can be configured with the following parameters.

ParameterRequired/OptionalData typeDescription
hyphenation_patterns_pathRequiredStringThe path (relative to the config directory or absolute) to the hyphenation patterns file, which contains the language-specific rules for word splitting. The file is typically in XML format. Sample files can be downloaded from the OFFO SourceForge project.
word_listRequired if word_list_path is not setArray of stringsA list of words used to validate the components generated by the hyphenation patterns.
word_list_pathRequired if word_list is not setStringThe path (relative to the config directory or absolute) to a list of subwords.
max_subword_sizeOptionalIntegerThe maximum subword length. If the generated subword exceeds this length, it will not be added to the generated tokens. Default is 15.
min_subword_sizeOptionalIntegerThe minimum subword length. If the generated subword is shorter than the specified length, it will not be added to the generated tokens. Default is 2.
min_word_sizeOptionalIntegerThe minimum word character length. Word tokens shorter than this length are excluded from decomposition into subwords. Default is 5.
only_longest_matchOptionalBooleanOnly includes the longest subword in the generated tokens. Default is false.

Example

The following example request creates a new index named test_index and configures an analyzer with a hyphenation_decompounder filter:

  1. PUT /test_index
  2. {
  3. "settings": {
  4. "analysis": {
  5. "filter": {
  6. "my_hyphenation_decompounder": {
  7. "type": "hyphenation_decompounder",
  8. "hyphenation_patterns_path": "analysis/hyphenation_patterns.xml",
  9. "word_list": ["notebook", "note", "book"],
  10. "min_subword_size": 3,
  11. "min_word_size": 5,
  12. "only_longest_match": false
  13. }
  14. },
  15. "analyzer": {
  16. "my_analyzer": {
  17. "type": "custom",
  18. "tokenizer": "standard",
  19. "filter": [
  20. "lowercase",
  21. "my_hyphenation_decompounder"
  22. ]
  23. }
  24. }
  25. }
  26. }
  27. }

copy

Generated tokens

Use the following request to examine the tokens generated using the analyzer:

  1. POST /test_index/_analyze
  2. {
  3. "analyzer": "my_analyzer",
  4. "text": "notebook"
  5. }

copy

The response contains the generated tokens:

  1. {
  2. "tokens": [
  3. {
  4. "token": "notebook",
  5. "start_offset": 0,
  6. "end_offset": 8,
  7. "type": "<ALPHANUM>",
  8. "position": 0
  9. },
  10. {
  11. "token": "note",
  12. "start_offset": 0,
  13. "end_offset": 8,
  14. "type": "<ALPHANUM>",
  15. "position": 0
  16. },
  17. {
  18. "token": "book",
  19. "start_offset": 0,
  20. "end_offset": 8,
  21. "type": "<ALPHANUM>",
  22. "position": 0
  23. }
  24. ]
  25. }