Synonym token filter

Synonym token filter

The synonym token filter allows to easily handle synonyms during the analysis process. Synonyms are configured using a configuration file. Here is an example:

  1. PUT /test_index
  2. {
  3. "settings": {
  4. "index": {
  5. "analysis": {
  6. "analyzer": {
  7. "synonym": {
  8. "tokenizer": "whitespace",
  9. "filter": [ "synonym" ]
  10. }
  11. },
  12. "filter": {
  13. "synonym": {
  14. "type": "synonym",
  15. "synonyms_path": "analysis/synonym.txt"
  16. }
  17. }
  18. }
  19. }
  20. }
  21. }

The above configures a synonym filter, with a path of analysis/synonym.txt (relative to the config location). The synonym analyzer is then configured with the filter.

This filter tokenizes synonyms with whatever tokenizer and token filters appear before it in the chain.

Additional settings are:

  • expand (defaults to true).
  • lenient (defaults to false). If true ignores exceptions while parsing the synonym configuration. It is important to note that only those synonym rules which cannot get parsed are ignored. For instance consider the following request:
  1. PUT /test_index
  2. {
  3. "settings": {
  4. "index": {
  5. "analysis": {
  6. "analyzer": {
  7. "synonym": {
  8. "tokenizer": "standard",
  9. "filter": [ "my_stop", "synonym" ]
  10. }
  11. },
  12. "filter": {
  13. "my_stop": {
  14. "type": "stop",
  15. "stopwords": [ "bar" ]
  16. },
  17. "synonym": {
  18. "type": "synonym",
  19. "lenient": true,
  20. "synonyms": [ "foo, bar => baz" ]
  21. }
  22. }
  23. }
  24. }
  25. }
  26. }

With the above request the word bar gets skipped but a mapping foo => baz is still added. However, if the mapping being added was foo, baz => bar nothing would get added to the synonym list. This is because the target word for the mapping is itself eliminated because it was a stop word. Similarly, if the mapping was “bar, foo, baz” and expand was set to false no mapping would get added as when expand=false the target mapping is the first word. However, if expand=true then the mappings added would be equivalent to foo, baz => foo, baz i.e, all mappings other than the stop word.

tokenizer and ignore_case are deprecated

The tokenizer parameter controls the tokenizers that will be used to tokenize the synonym, this parameter is for backwards compatibility for indices that created before 6.0. The ignore_case parameter works with tokenizer parameter only.

Two synonym formats are supported: Solr, WordNet.

Solr synonyms

The following is a sample format of the file:

  1. # Blank lines and lines starting with pound are comments.
  2. # Explicit mappings match any token sequence on the LHS of "=>"
  3. # and replace with all alternatives on the RHS. These types of mappings
  4. # ignore the expand parameter in the schema.
  5. # Examples:
  6. i-pod, i pod => ipod
  7. sea biscuit, sea biscit => seabiscuit
  8. # Equivalent synonyms may be separated with commas and give
  9. # no explicit mapping. In this case the mapping behavior will
  10. # be taken from the expand parameter in the schema. This allows
  11. # the same synonym file to be used in different synonym handling strategies.
  12. # Examples:
  13. ipod, i-pod, i pod
  14. foozball , foosball
  15. universe , cosmos
  16. lol, laughing out loud
  17. # If expand==true, "ipod, i-pod, i pod" is equivalent
  18. # to the explicit mapping:
  19. ipod, i-pod, i pod => ipod, i-pod, i pod
  20. # If expand==false, "ipod, i-pod, i pod" is equivalent
  21. # to the explicit mapping:
  22. ipod, i-pod, i pod => ipod
  23. # Multiple synonym mapping entries are merged.
  24. foo => foo bar
  25. foo => baz
  26. # is equivalent to
  27. foo => foo bar, baz

You can also define synonyms for the filter directly in the configuration file (note use of synonyms instead of synonyms_path):

  1. PUT /test_index
  2. {
  3. "settings": {
  4. "index": {
  5. "analysis": {
  6. "filter": {
  7. "synonym": {
  8. "type": "synonym",
  9. "synonyms": [
  10. "i-pod, i pod => ipod",
  11. "universe, cosmos"
  12. ]
  13. }
  14. }
  15. }
  16. }
  17. }
  18. }

However, it is recommended to define large synonyms set in a file using synonyms_path, because specifying them inline increases cluster size unnecessarily.

WordNet synonyms

Synonyms based on WordNet format can be declared using format:

  1. PUT /test_index
  2. {
  3. "settings": {
  4. "index": {
  5. "analysis": {
  6. "filter": {
  7. "synonym": {
  8. "type": "synonym",
  9. "format": "wordnet",
  10. "synonyms": [
  11. "s(100000001,1,'abstain',v,1,0).",
  12. "s(100000001,2,'refrain',v,1,0).",
  13. "s(100000001,3,'desist',v,1,0)."
  14. ]
  15. }
  16. }
  17. }
  18. }
  19. }
  20. }

Using synonyms_path to define WordNet synonyms in a file is supported as well.

Parsing synonym files

Elasticsearch will use the token filters preceding the synonym filter in a tokenizer chain to parse the entries in a synonym file. So, for example, if a synonym filter is placed after a stemmer, then the stemmer will also be applied to the synonym entries. Because entries in the synonym map cannot have stacked positions, some token filters may cause issues here. Token filters that produce multiple versions of a token may choose which version of the token to emit when parsing synonyms, e.g. asciifolding will only produce the folded version of the token. Others, e.g. multiplexer, word_delimiter_graph or ngram will throw an error.

If you need to build analyzers that include both multi-token filters and synonym filters, consider using the multiplexer filter, with the multi-token filters in one branch and the synonym filter in the other.