Phonetic token filter

The phonetic token filter transforms tokens into their phonetic representations, enabling more flexible matching of words that sound similar but are spelled differently. This is particularly useful for searching names, brands, or other entities that users might spell differently but pronounce similarly.

The phonetic token filter is not included in OpenSearch distributions by default. To use this token filter, you must first install the analysis-phonetic plugin as follows and then restart OpenSearch:

  1. ./bin/opensearch-plugin install analysis-phonetic

copy

For more information about installing plugins, see Installing plugins.

Parameters

The phonetic token filter can be configured with the following parameters.

ParameterRequired/OptionalData typeDescription
encoderOptionalStringSpecifies the phonetic algorithm to use.

Valid values are:
- metaphone (default)
- double_metaphone
- soundex
- refined_soundex
- caverphone1
- caverphone2
- cologne
- nysiis
- koelnerphonetik
- haasephonetik
- beider_morse
- daitch_mokotoff
replaceOptionalBooleanWhether to replace the original token. If false, the original token is included in the output along with the phonetic encoding. Default is true.

Example

The following example request creates a new index named names_index and configures an analyzer with a phonetic filter:

  1. PUT /names_index
  2. {
  3. "settings": {
  4. "analysis": {
  5. "filter": {
  6. "my_phonetic_filter": {
  7. "type": "phonetic",
  8. "encoder": "double_metaphone",
  9. "replace": true
  10. }
  11. },
  12. "analyzer": {
  13. "phonetic_analyzer": {
  14. "tokenizer": "standard",
  15. "filter": [
  16. "my_phonetic_filter"
  17. ]
  18. }
  19. }
  20. }
  21. }
  22. }

copy

Generated tokens

Use the following request to examine the tokens generated for the names Stephen and Steven using the analyzer:

  1. POST /names_index/_analyze
  2. {
  3. "text": "Stephen",
  4. "analyzer": "phonetic_analyzer"
  5. }

copy

  1. POST /names_index/_analyze
  2. {
  3. "text": "Steven",
  4. "analyzer": "phonetic_analyzer"
  5. }

copy

In both cases, the response contains the same generated token:

  1. {
  2. "tokens": [
  3. {
  4. "token": "STFN",
  5. "start_offset": 0,
  6. "end_offset": 6,
  7. "type": "<ALPHANUM>",
  8. "position": 0
  9. }
  10. ]
  11. }