Hyphenation decompounder token filter

Hyphenation decompounder token filter

The hyphenation_decompounder token filter is used to break down compound words into their constituent parts. This filter is particularly useful for languages like German, Dutch, and Swedish, in which compound words are common. The filter uses hyphenation patterns (typically defined in .xml files) to identify the possible locations within a compound word where it can be split into components. These components are then checked against a provided dictionary. If there is a match, those components are treated as valid tokens. For more information about hyphenation pattern files, see FOP XML Hyphenation Patterns.

Parameters

The hyphenation_decompounder token filter can be configured with the following parameters.

Parameter	Required/Optional	Data type	Description
`hyphenation_patterns_path`	Required	String	The path (relative to the `config` directory or absolute) to the hyphenation patterns file, which contains the language-specific rules for word splitting. The file is typically in XML format. Sample files can be downloaded from the OFFO SourceForge project.
`word_list`	Required if `word_list_path` is not set	Array of strings	A list of words used to validate the components generated by the hyphenation patterns.
`word_list_path`	Required if `word_list` is not set	String	The path (relative to the `config` directory or absolute) to a list of subwords.
`max_subword_size`	Optional	Integer	The maximum subword length. If the generated subword exceeds this length, it will not be added to the generated tokens. Default is `15`.
`min_subword_size`	Optional	Integer	The minimum subword length. If the generated subword is shorter than the specified length, it will not be added to the generated tokens. Default is `2`.
`min_word_size`	Optional	Integer	The minimum word character length. Word tokens shorter than this length are excluded from decomposition into subwords. Default is `5`.
`only_longest_match`	Optional	Boolean	Only includes the longest subword in the generated tokens. Default is `false`.

Example

The following example request creates a new index named test_index and configures an analyzer with a hyphenation_decompounder filter:

PUT /test_index
{
  "settings": {
    "analysis": {
      "filter": {
        "my_hyphenation_decompounder": {
          "type": "hyphenation_decompounder",
          "hyphenation_patterns_path": "analysis/hyphenation_patterns.xml",
          "word_list": ["notebook", "note", "book"],
          "min_subword_size": 3,
          "min_word_size": 5,
          "only_longest_match": false
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_hyphenation_decompounder"
          ]
        }
      }
    }
  }
}

copy

Generated tokens

Use the following request to examine the tokens generated using the analyzer:

POST /test_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "notebook"
}

copy

The response contains the generated tokens:

{
  "tokens": [
    {
      "token": "notebook",
      "start_offset": 0,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "note",
      "start_offset": 0,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "book",
      "start_offset": 0,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 0
    }
  ]
}