Phonetic token filter

Phonetic token filter

The phonetic token filter transforms tokens into their phonetic representations, enabling more flexible matching of words that sound similar but are spelled differently. This is particularly useful for searching names, brands, or other entities that users might spell differently but pronounce similarly.

The phonetic token filter is not included in OpenSearch distributions by default. To use this token filter, you must first install the analysis-phonetic plugin as follows and then restart OpenSearch:

./bin/opensearch-plugin install analysis-phonetic

copy

For more information about installing plugins, see Installing plugins.

Parameters

The phonetic token filter can be configured with the following parameters.

Parameter	Required/Optional	Data type	Description
`encoder`	Optional	String	Specifies the phonetic algorithm to use. Valid values are: - `metaphone` (default) - `double_metaphone` - `soundex` - `refined_soundex` - `caverphone1` - `caverphone2` - `cologne` - `nysiis` - `koelnerphonetik` - `haasephonetik` - `beider_morse` - `daitch_mokotoff`
`replace`	Optional	Boolean	Whether to replace the original token. If `false`, the original token is included in the output along with the phonetic encoding. Default is `true`.

Example

The following example request creates a new index named names_index and configures an analyzer with a phonetic filter:

PUT /names_index
{
  "settings": {
    "analysis": {
      "filter": {
        "my_phonetic_filter": {
          "type": "phonetic",
          "encoder": "double_metaphone",
          "replace": true
        }
      },
      "analyzer": {
        "phonetic_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "my_phonetic_filter"
          ]
        }
      }
    }
  }
}

copy

Generated tokens

Use the following request to examine the tokens generated for the names Stephen and Steven using the analyzer:

POST /names_index/_analyze
{
  "text": "Stephen",
  "analyzer": "phonetic_analyzer"
}

copy

POST /names_index/_analyze
{
  "text": "Steven",
  "analyzer": "phonetic_analyzer"
}

copy

In both cases, the response contains the same generated token:

{
  "tokens": [
    {
      "token": "STFN",
      "start_offset": 0,
      "end_offset": 6,
      "type": "<ALPHANUM>",
      "position": 0
    }
  ]
}