Similarity module

Similarity module

A similarity (scoring / ranking model) defines how matching documents are scored. Similarity is per field, meaning that via the mapping one can define a different similarity per field.

Similarity is only applicable for text type and keyword type fields.

Configuring a custom similarity is considered an expert feature and the builtin similarities are most likely sufficient as is described in similarity.

Configuring a similarity

Most existing or custom Similarities have configuration options which can be configured via the index settings as shown below. The index options can be provided when creating an index or updating index settings.

  1. PUT /index
  2. {
  3. "settings": {
  4. "index": {
  5. "similarity": {
  6. "my_similarity": {
  7. "type": "DFR",
  8. "basic_model": "g",
  9. "after_effect": "l",
  10. "normalization": "h2",
  11. "normalization.h2.c": "3.0"
  12. }
  13. }
  14. }
  15. }
  16. }

Here we configure the DFR similarity so it can be referenced as my_similarity in mappings as is illustrate in the below example:

  1. PUT /index/_mapping
  2. {
  3. "properties" : {
  4. "title" : { "type" : "text", "similarity" : "my_similarity" }
  5. }
  6. }

Available similarities

BM25 similarity (default)

TF/IDF based similarity that has built-in tf normalization and is supposed to work better for short fields (like names). See Okapi_BM25 for more details. This similarity has the following options:

k1

Controls non-linear term frequency normalization (saturation). The default value is 1.2.

b

Controls to what degree document length normalizes tf values. The default value is 0.75.

discount_overlaps

Determines whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm. By default this is true, meaning overlap tokens do not count when computing norms.

Type name: BM25

DFR similarity

Similarity that implements the divergence from randomness framework. This similarity has the following options:

basic_model

Possible values: g, if, in and ine.

after_effect

Possible values: b and l.

normalization

Possible values: no, h1, h2, h3 and z.

All options but the first option need a normalization value.

Type name: DFR

DFI similarity

Similarity that implements the divergence from independence model. This similarity has the following options:

independence_measure

Possible values standardized, saturated, chisquared.

When using this similarity, it is highly recommended not to remove stop words to get good relevance. Also beware that terms whose frequency is less than the expected frequency will get a score equal to 0.

Type name: DFI

IB similarity.

Information based model . The algorithm is based on the concept that the information content in any symbolic distribution sequence is primarily determined by the repetitive usage of its basic elements. For written texts this challenge would correspond to comparing the writing styles of different authors. This similarity has the following options:

distribution

Possible values: ll and spl.

lambda

Possible values: df and ttf.

normalization

Same as in DFR similarity.

Type name: IB

LM Dirichlet similarity.

LM Dirichlet similarity . This similarity has the following options:

mu

Default to 2000.

The scoring formula in the paper assigns negative scores to terms that have fewer occurrences than predicted by the language model, which is illegal to Lucene, so such terms get a score of 0.

Type name: LMDirichlet

LM Jelinek Mercer similarity.

LM Jelinek Mercer similarity . The algorithm attempts to capture important patterns in the text, while leaving out noise. This similarity has the following options:

lambda

The optimal value depends on both the collection and the query. The optimal value is around 0.1 for title queries and 0.7 for long queries. Default to 0.1. When value approaches 0, documents that match more query terms will be ranked higher than those that match fewer terms.

Type name: LMJelinekMercer

Scripted similarity

A similarity that allows you to use a script in order to specify how scores should be computed. For instance, the below example shows how to reimplement TF-IDF:

  1. PUT /index
  2. {
  3. "settings": {
  4. "number_of_shards": 1,
  5. "similarity": {
  6. "scripted_tfidf": {
  7. "type": "scripted",
  8. "script": {
  9. "source": "double tf = Math.sqrt(doc.freq); double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; double norm = 1/Math.sqrt(doc.length); return query.boost * tf * idf * norm;"
  10. }
  11. }
  12. }
  13. },
  14. "mappings": {
  15. "properties": {
  16. "field": {
  17. "type": "text",
  18. "similarity": "scripted_tfidf"
  19. }
  20. }
  21. }
  22. }
  23. PUT /index/_doc/1
  24. {
  25. "field": "foo bar foo"
  26. }
  27. PUT /index/_doc/2
  28. {
  29. "field": "bar baz"
  30. }
  31. POST /index/_refresh
  32. GET /index/_search?explain=true
  33. {
  34. "query": {
  35. "query_string": {
  36. "query": "foo^1.7",
  37. "default_field": "field"
  38. }
  39. }
  40. }

Which yields:

  1. {
  2. "took": 12,
  3. "timed_out": false,
  4. "_shards": {
  5. "total": 1,
  6. "successful": 1,
  7. "skipped": 0,
  8. "failed": 0
  9. },
  10. "hits": {
  11. "total": {
  12. "value": 1,
  13. "relation": "eq"
  14. },
  15. "max_score": 1.9508477,
  16. "hits": [
  17. {
  18. "_shard": "[index][0]",
  19. "_node": "OzrdjxNtQGaqs4DmioFw9A",
  20. "_index": "index",
  21. "_type": "_doc",
  22. "_id": "1",
  23. "_score": 1.9508477,
  24. "_source": {
  25. "field": "foo bar foo"
  26. },
  27. "_explanation": {
  28. "value": 1.9508477,
  29. "description": "weight(field:foo in 0) [PerFieldSimilarity], result of:",
  30. "details": [
  31. {
  32. "value": 1.9508477,
  33. "description": "score from ScriptedSimilarity(weightScript=[null], script=[Script{type=inline, lang='painless', idOrCode='double tf = Math.sqrt(doc.freq); double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; double norm = 1/Math.sqrt(doc.length); return query.boost * tf * idf * norm;', options={}, params={}}]) computed from:",
  34. "details": [
  35. {
  36. "value": 1.0,
  37. "description": "weight",
  38. "details": []
  39. },
  40. {
  41. "value": 1.7,
  42. "description": "query.boost",
  43. "details": []
  44. },
  45. {
  46. "value": 2,
  47. "description": "field.docCount",
  48. "details": []
  49. },
  50. {
  51. "value": 4,
  52. "description": "field.sumDocFreq",
  53. "details": []
  54. },
  55. {
  56. "value": 5,
  57. "description": "field.sumTotalTermFreq",
  58. "details": []
  59. },
  60. {
  61. "value": 1,
  62. "description": "term.docFreq",
  63. "details": []
  64. },
  65. {
  66. "value": 2,
  67. "description": "term.totalTermFreq",
  68. "details": []
  69. },
  70. {
  71. "value": 2.0,
  72. "description": "doc.freq",
  73. "details": []
  74. },
  75. {
  76. "value": 3,
  77. "description": "doc.length",
  78. "details": []
  79. }
  80. ]
  81. }
  82. ]
  83. }
  84. }
  85. ]
  86. }
  87. }

While scripted similarities provide a lot of flexibility, there is a set of rules that they need to satisfy. Failing to do so could make Elasticsearch silently return wrong top hits or fail with internal errors at search time:

  • Returned scores must be positive.
  • All other variables remaining equal, scores must not decrease when doc.freq increases.
  • All other variables remaining equal, scores must not increase when doc.length increases.

You might have noticed that a significant part of the above script depends on statistics that are the same for every document. It is possible to make the above slightly more efficient by providing an weight_script which will compute the document-independent part of the score and will be available under the weight variable. When no weight_script is provided, weight is equal to 1. The weight_script has access to the same variables as the script except doc since it is supposed to compute a document-independent contribution to the score.

The below configuration will give the same tf-idf scores but is slightly more efficient:

  1. PUT /index
  2. {
  3. "settings": {
  4. "number_of_shards": 1,
  5. "similarity": {
  6. "scripted_tfidf": {
  7. "type": "scripted",
  8. "weight_script": {
  9. "source": "double idf = Math.log((field.docCount+1.0)/(term.docFreq+1.0)) + 1.0; return query.boost * idf;"
  10. },
  11. "script": {
  12. "source": "double tf = Math.sqrt(doc.freq); double norm = 1/Math.sqrt(doc.length); return weight * tf * norm;"
  13. }
  14. }
  15. }
  16. },
  17. "mappings": {
  18. "properties": {
  19. "field": {
  20. "type": "text",
  21. "similarity": "scripted_tfidf"
  22. }
  23. }
  24. }
  25. }

Type name: scripted

Default Similarity

By default, Elasticsearch will use whatever similarity is configured as default.

You can change the default similarity for all fields in an index when it is created:

  1. PUT /index
  2. {
  3. "settings": {
  4. "index": {
  5. "similarity": {
  6. "default": {
  7. "type": "boolean"
  8. }
  9. }
  10. }
  11. }
  12. }

If you want to change the default similarity after creating the index you must close your index, send the following request and open it again afterwards:

  1. POST /index/_close?wait_for_active_shards=0
  2. PUT /index/_settings
  3. {
  4. "index": {
  5. "similarity": {
  6. "default": {
  7. "type": "boolean"
  8. }
  9. }
  10. }
  11. }
  12. POST /index/_open