Hunspell token filter

The hunspell token filter is used for stemming and morphological analysis of words in a specific language. This filter applies Hunspell dictionaries, which are widely used in spell checkers. It works by breaking down words into their root forms (stemming).

The Hunspell dictionary files are automatically loaded at startup from the <OS_PATH_CONF>/hunspell/<locale> directory. For example, the en_GB locale must have at least one .aff file and one or more .dic files in the <OS_PATH_CONF>/hunspell/en_GB/ directory.

You can download these files from LibreOffice dictionaries.

Parameters

The hunspell token filter can be configured with the following parameters.

ParameterRequired/OptionalData typeDescription
language/lang/localeAt least one of the three is requiredStringSpecifies the language for the Hunspell dictionary.
dedupOptionalBooleanDetermines whether to remove multiple duplicate stemming terms for the same token. Default is true.
dictionaryOptionalArray of stringsConfigures the dictionary files to be used for the Hunspell dictionary. Default is all files in the <OS_PATH_CONF>/hunspell/<locale> directory.
longest_onlyOptionalBooleanSpecifies whether only the longest stemmed version of the token should be returned. Default is false.

Example

The following example request creates a new index named my_index and configures an analyzer with a hunspell filter:

  1. PUT /my_index
  2. {
  3. "settings": {
  4. "analysis": {
  5. "filter": {
  6. "my_hunspell_filter": {
  7. "type": "hunspell",
  8. "lang": "en_GB",
  9. "dedup": true,
  10. "longest_only": true
  11. }
  12. },
  13. "analyzer": {
  14. "my_analyzer": {
  15. "type": "custom",
  16. "tokenizer": "standard",
  17. "filter": [
  18. "lowercase",
  19. "my_hunspell_filter"
  20. ]
  21. }
  22. }
  23. }
  24. }
  25. }

copy

Generated tokens

Use the following request to examine the tokens generated using the analyzer:

  1. POST /my_index/_analyze
  2. {
  3. "analyzer": "my_analyzer",
  4. "text": "the turtle moves slowly"
  5. }

copy

The response contains the generated tokens:

  1. {
  2. "tokens": [
  3. {
  4. "token": "the",
  5. "start_offset": 0,
  6. "end_offset": 3,
  7. "type": "<ALPHANUM>",
  8. "position": 0
  9. },
  10. {
  11. "token": "turtle",
  12. "start_offset": 4,
  13. "end_offset": 10,
  14. "type": "<ALPHANUM>",
  15. "position": 1
  16. },
  17. {
  18. "token": "move",
  19. "start_offset": 11,
  20. "end_offset": 16,
  21. "type": "<ALPHANUM>",
  22. "position": 2
  23. },
  24. {
  25. "token": "slow",
  26. "start_offset": 17,
  27. "end_offset": 23,
  28. "type": "<ALPHANUM>",
  29. "position": 3
  30. }
  31. ]
  32. }