k-NN search with filters

To refine k-NN results, you can filter a k-NN search using one of the following methods:

  • Scoring script filter: This approach involves pre-filtering a document set and then running an exact k-NN search on the filtered subset. It does not scale for large filtered subsets.

  • Boolean filter: This approach runs an approximate nearest neighbor (ANN) search and then applies a filter to the results. Because of post-filtering, it may return significantly fewer than k results for a restrictive filter.

  • Lucene k-NN filter: This approach applies filtering during the k-NN search, as opposed to before or after the k-NN search, which ensures that k results are returned. You can only use this method with the Hierarchical Navigable Small World (HNSW) algorithm implemented by the Lucene search engine in k-NN plugin versions 2.4 and later.

Filtered search optimization

Depending on your dataset and use case, you might be more interested in maximizing recall or minimizing latency. The following table provides guidance on various k-NN search configurations and the filtering methods used to optimize for higher recall or lower latency. The first three columns of the table provide several example k-NN search configurations. A search configuration consists of:

  • The number of documents in an index, where one OpenSearch document corresponds to one k-NN vector.
  • The percentage of documents left in the results after filtering. This value depends on the restrictiveness of the filter that you provide in the query. The most restrictive filter in the table returns 2.5% of documents in the index, while the least restrictive filter returns 80% of documents.
  • The desired number of returned results (k).

Once you’ve estimated the number of documents in your index, the restrictiveness of your filter, and the desired number of nearest neighbors, use the following table to choose a filtering method that optimizes for recall or latency.

Number of documents in an indexPercentage of documents the filter returnskFiltering method to use for higher recallFiltering method to use for lower latency
10M2.5100Scoring scriptScoring script
10M38100Lucene filterBoolean filter
10M80100Scoring scriptLucene filter
1M2.5100Lucene filterScoring script
1M38100Lucene filterLucene filter/scoring script
1M80100Boolean filterLucene filter

Scoring script filter

A scoring script filter first filters the documents and then uses a brute-force exact k-NN search on the results. For example, the following query searches for hotels with a rating between 8 and 10, inclusive, that provide parking and then performs a k-NN search to return the 3 hotels that are closest to the specified location:

  1. POST /hotels-index/_search
  2. {
  3. "size": 3,
  4. "query": {
  5. "script_score": {
  6. "query": {
  7. "bool": {
  8. "filter": {
  9. "bool": {
  10. "must": [
  11. {
  12. "range": {
  13. "rating": {
  14. "gte": 8,
  15. "lte": 10
  16. }
  17. }
  18. },
  19. {
  20. "term": {
  21. "parking": "true"
  22. }
  23. }
  24. ]
  25. }
  26. }
  27. }
  28. },
  29. "script": {
  30. "source": "knn_score",
  31. "lang": "knn",
  32. "params": {
  33. "field": "location",
  34. "query_value": [
  35. 5.0,
  36. 4.0
  37. ],
  38. "space_type": "l2"
  39. }
  40. }
  41. }
  42. }
  43. }

copy

A Boolean filter consists of a Boolean query that contains a k-NN query and a filter. For example, the following query searches for hotels that are closest to the specified location and then filters the results to return hotels with a rating between 8 and 10, inclusive, that provide parking:

  1. POST /hotels-index/_search
  2. {
  3. "size": 3,
  4. "query": {
  5. "bool": {
  6. "filter": {
  7. "bool": {
  8. "must": [
  9. {
  10. "range": {
  11. "rating": {
  12. "gte": 8,
  13. "lte": 10
  14. }
  15. }
  16. },
  17. {
  18. "term": {
  19. "parking": "true"
  20. }
  21. }
  22. ]
  23. }
  24. },
  25. "must": [
  26. {
  27. "knn": {
  28. "location": {
  29. "vector": [
  30. 5,
  31. 4
  32. ],
  33. "k": 20
  34. }
  35. }
  36. }
  37. ]
  38. }
  39. }
  40. }

The response includes documents containing the matching hotels:

  1. {
  2. "took" : 95,
  3. "timed_out" : false,
  4. "_shards" : {
  5. "total" : 1,
  6. "successful" : 1,
  7. "skipped" : 0,
  8. "failed" : 0
  9. },
  10. "hits" : {
  11. "total" : {
  12. "value" : 5,
  13. "relation" : "eq"
  14. },
  15. "max_score" : 0.72992706,
  16. "hits" : [
  17. {
  18. "_index" : "hotels-index",
  19. "_id" : "3",
  20. "_score" : 0.72992706,
  21. "_source" : {
  22. "location" : [
  23. 4.9,
  24. 3.4
  25. ],
  26. "parking" : "true",
  27. "rating" : 9
  28. }
  29. },
  30. {
  31. "_index" : "hotels-index",
  32. "_id" : "6",
  33. "_score" : 0.3012048,
  34. "_source" : {
  35. "location" : [
  36. 6.4,
  37. 3.4
  38. ],
  39. "parking" : "true",
  40. "rating" : 9
  41. }
  42. },
  43. {
  44. "_index" : "hotels-index",
  45. "_id" : "5",
  46. "_score" : 0.24154587,
  47. "_source" : {
  48. "location" : [
  49. 3.3,
  50. 4.5
  51. ],
  52. "parking" : "true",
  53. "rating" : 8
  54. }
  55. }
  56. ]
  57. }
  58. }

Lucene k-NN filter implementation

k-NN plugin version 2.2 introduced support for running k-NN searches with the Lucene engine using HNSW graphs. Starting with version 2.4, which is based on Lucene version 9.4, you can use Lucene filters for k-NN searches.

When you specify a Lucene filter for a k-NN search, the Lucene algorithm decides whether to perform an exact k-NN search with pre-filtering or an approximate search with modified post-filtering. The algorithm uses the following variables:

  • N: The number of documents in the index.
  • P: The number of documents in the document subset after the filter is applied (P <= N).
  • k: The maximum number of vectors to return in the response.

The following flow chart outlines the Lucene algorithm.

Lucene algorithm for filtering

For more information about the Lucene filtering implementation and the underlying KnnVectorQuery, see the Apache Lucene documentation.

Using a Lucene k-NN filter

Consider a dataset that includes 12 documents containing hotel information. The following image shows all hotels on an xy coordinate plane by location. Additionally, the points for hotels that have a rating between 8 and 10, inclusive, are depicted with orange dots, and hotels that provide parking are depicted with green circles. The search point is colored in red:

Graph of documents with filter criteria

In this example, you will create an index and search for the three hotels with high ratings and parking that are the closest to the search location.

Step 1: Create a new index

Before you can run a k-NN search with a filter, you need to create an index with a knn_vector field. For this field, you need to specify lucene as the engine and hnsw as the method in the mapping.

The following request creates a new index called hotels-index with a knn-filter field called location:

  1. PUT /hotels-index
  2. {
  3. "settings": {
  4. "index": {
  5. "knn": true,
  6. "knn.algo_param.ef_search": 100,
  7. "number_of_shards": 1,
  8. "number_of_replicas": 0
  9. }
  10. },
  11. "mappings": {
  12. "properties": {
  13. "location": {
  14. "type": "knn_vector",
  15. "dimension": 2,
  16. "method": {
  17. "name": "hnsw",
  18. "space_type": "l2",
  19. "engine": "lucene",
  20. "parameters": {
  21. "ef_construction": 100,
  22. "m": 16
  23. }
  24. }
  25. }
  26. }
  27. }
  28. }

copy

Step 2: Add data to your index

Next, add data to your index.

The following request adds 12 documents that contain hotel location, rating, and parking information:

  1. POST /_bulk
  2. { "index": { "_index": "hotels-index", "_id": "1" } }
  3. { "location": [5.2, 4.4], "parking" : "true", "rating" : 5 }
  4. { "index": { "_index": "hotels-index", "_id": "2" } }
  5. { "location": [5.2, 3.9], "parking" : "false", "rating" : 4 }
  6. { "index": { "_index": "hotels-index", "_id": "3" } }
  7. { "location": [4.9, 3.4], "parking" : "true", "rating" : 9 }
  8. { "index": { "_index": "hotels-index", "_id": "4" } }
  9. { "location": [4.2, 4.6], "parking" : "false", "rating" : 6}
  10. { "index": { "_index": "hotels-index", "_id": "5" } }
  11. { "location": [3.3, 4.5], "parking" : "true", "rating" : 8 }
  12. { "index": { "_index": "hotels-index", "_id": "6" } }
  13. { "location": [6.4, 3.4], "parking" : "true", "rating" : 9 }
  14. { "index": { "_index": "hotels-index", "_id": "7" } }
  15. { "location": [4.2, 6.2], "parking" : "true", "rating" : 5 }
  16. { "index": { "_index": "hotels-index", "_id": "8" } }
  17. { "location": [2.4, 4.0], "parking" : "true", "rating" : 8 }
  18. { "index": { "_index": "hotels-index", "_id": "9" } }
  19. { "location": [1.4, 3.2], "parking" : "false", "rating" : 5 }
  20. { "index": { "_index": "hotels-index", "_id": "10" } }
  21. { "location": [7.0, 9.9], "parking" : "true", "rating" : 9 }
  22. { "index": { "_index": "hotels-index", "_id": "11" } }
  23. { "location": [3.0, 2.3], "parking" : "false", "rating" : 6 }
  24. { "index": { "_index": "hotels-index", "_id": "12" } }
  25. { "location": [5.0, 1.0], "parking" : "true", "rating" : 3 }

copy

Step 3: Search your data with a filter

Now you can create a k-NN search with filters. In the k-NN query clause, include the point of interest that is used to search for nearest neighbors, the number of nearest neighbors to return (k), and a filter with the restriction criteria. Depending on how restrictive you want your filter to be, you can add multiple query clauses to a single request.

The following request creates a k-NN query that searches for the top three hotels near the location with the coordinates [5, 4] that are rated between 8 and 10, inclusive, and provide parking:

  1. POST /hotels-index/_search
  2. {
  3. "size": 3,
  4. "query": {
  5. "knn": {
  6. "location": {
  7. "vector": [
  8. 5,
  9. 4
  10. ],
  11. "k": 3,
  12. "filter": {
  13. "bool": {
  14. "must": [
  15. {
  16. "range": {
  17. "rating": {
  18. "gte": 8,
  19. "lte": 10
  20. }
  21. }
  22. },
  23. {
  24. "term": {
  25. "parking": "true"
  26. }
  27. }
  28. ]
  29. }
  30. }
  31. }
  32. }
  33. }
  34. }

copy

The response returns the three hotels that are nearest to the search point and have met the filter criteria:

  1. {
  2. "took" : 47,
  3. "timed_out" : false,
  4. "_shards" : {
  5. "total" : 1,
  6. "successful" : 1,
  7. "skipped" : 0,
  8. "failed" : 0
  9. },
  10. "hits" : {
  11. "total" : {
  12. "value" : 3,
  13. "relation" : "eq"
  14. },
  15. "max_score" : 0.72992706,
  16. "hits" : [
  17. {
  18. "_index" : "hotels-index",
  19. "_id" : "3",
  20. "_score" : 0.72992706,
  21. "_source" : {
  22. "location" : [
  23. 4.9,
  24. 3.4
  25. ],
  26. "parking" : "true",
  27. "rating" : 9
  28. }
  29. },
  30. {
  31. "_index" : "hotels-index",
  32. "_id" : "6",
  33. "_score" : 0.3012048,
  34. "_source" : {
  35. "location" : [
  36. 6.4,
  37. 3.4
  38. ],
  39. "parking" : "true",
  40. "rating" : 9
  41. }
  42. },
  43. {
  44. "_index" : "hotels-index",
  45. "_id" : "5",
  46. "_score" : 0.24154587,
  47. "_source" : {
  48. "location" : [
  49. 3.3,
  50. 4.5
  51. ],
  52. "parking" : "true",
  53. "rating" : 8
  54. }
  55. }
  56. ]
  57. }
  58. }

Note that there are multiple ways to construct a filter that returns hotels that provide parking, for example:

  • A term query clause in the should clause
  • A wildcard query clause in the should clause
  • A regexp query clause in the should clause
  • A must_not clause to eliminate hotels with parking set to false.

The following request illustrates these four different ways of searching for hotels with parking:

  1. POST /hotels-index/_search
  2. {
  3. "size": 3,
  4. "query": {
  5. "knn": {
  6. "location": {
  7. "vector": [ 5.0, 4.0 ],
  8. "k": 3,
  9. "filter": {
  10. "bool": {
  11. "must": {
  12. "range": {
  13. "rating": {
  14. "gte": 1,
  15. "lte": 6
  16. }
  17. }
  18. },
  19. "should": [
  20. {
  21. "term": {
  22. "parking": "true"
  23. }
  24. },
  25. {
  26. "wildcard": {
  27. "parking": {
  28. "value": "t*e"
  29. }
  30. }
  31. },
  32. {
  33. "regexp": {
  34. "parking": "[a-zA-Z]rue"
  35. }
  36. }
  37. ],
  38. "must_not": [
  39. {
  40. "term": {
  41. "parking": "false"
  42. }
  43. }
  44. ],
  45. "minimum_should_match": 1
  46. }
  47. }
  48. }
  49. }
  50. }
  51. }

copy