k-NN vector field type

The k-NN plugin introduces a custom data type, the knn_vector, that allows users to ingest their k-NN vectors into an OpenSearch index and perform different kinds of k-NN search. The knn_vector field is highly configurable and can serve many different k-NN workloads. In general, a knn_vector field can be built either by providing a method definition or specifying a model id.

Example

For example, to map my_vector as a knn_vector, use the following request:

  1. PUT test-index
  2. {
  3. "settings": {
  4. "index": {
  5. "knn": true,
  6. "knn.algo_param.ef_search": 100
  7. }
  8. },
  9. "mappings": {
  10. "properties": {
  11. "my_vector": {
  12. "type": "knn_vector",
  13. "dimension": 3,
  14. "method": {
  15. "name": "hnsw",
  16. "space_type": "l2",
  17. "engine": "lucene",
  18. "parameters": {
  19. "ef_construction": 128,
  20. "m": 24
  21. }
  22. }
  23. }
  24. }
  25. }
  26. }

copy

Method definitions

Method definitions are used when the underlying approximate k-NN algorithm does not require training. For example, the following knn_vector field specifies that nmslib’s implementation of hnsw should be used for approximate k-NN search. During indexing, nmslib will build the corresponding hnsw segment files.

  1. "my_vector": {
  2. "type": "knn_vector",
  3. "dimension": 4,
  4. "method": {
  5. "name": "hnsw",
  6. "space_type": "l2",
  7. "engine": "nmslib",
  8. "parameters": {
  9. "ef_construction": 128,
  10. "m": 24
  11. }
  12. }
  13. }

Model IDs

Model IDs are used when the underlying Approximate k-NN algorithm requires a training step. As a prerequisite, the model must be created with the Train API. The model contains the information needed to initialize the native library segment files.

  1. "type": "knn_vector",
  2. "model_id": "my-model"
  3. }

However, if you intend to use Painless scripting or a k-NN score script, you only need to pass the dimension.

  1. "type": "knn_vector",
  2. "dimension": 128
  3. }

Lucene byte vector

By default, k-NN vectors are float vectors, where each dimension is 4 bytes. If you want to save storage space, you can use byte vectors with the lucene engine. In a byte vector, each dimension is a signed 8-bit integer in the [-128, 127] range.

Byte vectors are supported only for the lucene engine. They are not supported for the nmslib and faiss engines.

In k-NN benchmarking tests, the use of byte rather than float vectors resulted in a significant reduction in storage and memory usage as well as improved indexing throughput and reduced query latency. Additionally, precision on recall was not greatly affected (note that recall can depend on various factors, such as the quantization technique and data distribution).

When using byte vectors, expect some loss of precision in the recall compared to using float vectors. Byte vectors are useful in large-scale applications and use cases that prioritize a reduced memory footprint in exchange for a minimal loss of recall.

Introduced in k-NN plugin version 2.9, the optional data_type parameter defines the data type of a vector. The default value of this parameter is float.

To use a byte vector, set the data_type parameter to byte when creating mappings for an index:

  1. PUT test-index
  2. {
  3. "settings": {
  4. "index": {
  5. "knn": true,
  6. "knn.algo_param.ef_search": 100
  7. }
  8. },
  9. "mappings": {
  10. "properties": {
  11. "my_vector": {
  12. "type": "knn_vector",
  13. "dimension": 3,
  14. "data_type": "byte",
  15. "method": {
  16. "name": "hnsw",
  17. "space_type": "l2",
  18. "engine": "lucene",
  19. "parameters": {
  20. "ef_construction": 128,
  21. "m": 24
  22. }
  23. }
  24. }
  25. }
  26. }
  27. }

copy

Then ingest documents as usual. Make sure each dimension in the vector is in the supported [-128, 127] range:

  1. PUT test-index/_doc/1
  2. {
  3. "my_vector": [-126, 28, 127]
  4. }

copy

  1. PUT test-index/_doc/2
  2. {
  3. "my_vector": [100, -128, 0]
  4. }

copy

When querying, be sure to use a byte vector:

  1. GET test-index/_search
  2. {
  3. "size": 2,
  4. "query": {
  5. "knn": {
  6. "my_vector": {
  7. "vector": [26, -120, 99],
  8. "k": 2
  9. }
  10. }
  11. }
  12. }

copy

Quantization techniques

If your vectors are of the type float, you need to first convert them to the byte type before ingesting the documents. This conversion is accomplished by quantizing the dataset—reducing the precision of its vectors. There are many quantization techniques, such as scalar quantization or product quantization (PQ), which is used in the Faiss engine. The choice of quantization technique depends on the type of data you’re using and can affect the accuracy of recall values. The following sections describe the scalar quantization algorithms that were used to quantize the k-NN benchmarking test data for the L2 and cosine similarity space types. The provided pseudocode is for illustration purposes only.

Scalar quantization for the L2 space type

The following example pseudocode illustrates the scalar quantization technique used for the benchmarking tests on Euclidean datasets with the L2 space type. Euclidean distance is shift invariant. If you shift both \(x\) and \(y\) by the same \(z\), then the distance remains the same (\(\lVert x-y\rVert =\lVert (x-z)-(y-z)\rVert\)).

  1. # Random dataset (Example to create a random dataset)
  2. dataset = np.random.uniform(-300, 300, (100, 10))
  3. # Random query set (Example to create a random queryset)
  4. queryset = np.random.uniform(-350, 350, (100, 10))
  5. # Number of values
  6. B = 256
  7. # INDEXING:
  8. # Get min and max
  9. dataset_min = np.min(dataset)
  10. dataset_max = np.max(dataset)
  11. # Shift coordinates to be non-negative
  12. dataset -= dataset_min
  13. # Normalize into [0, 1]
  14. dataset *= 1. / (dataset_max - dataset_min)
  15. # Bucket into 256 values
  16. dataset = np.floor(dataset * (B - 1)) - int(B / 2)
  17. # QUERYING:
  18. # Clip (if queryset range is out of datset range)
  19. queryset = queryset.clip(dataset_min, dataset_max)
  20. # Shift coordinates to be non-negative
  21. queryset -= dataset_min
  22. # Normalize
  23. queryset *= 1. / (dataset_max - dataset_min)
  24. # Bucket into 256 values
  25. queryset = np.floor(queryset * (B - 1)) - int(B / 2)

copy

Scalar quantization for the cosine similarity space type

The following example pseudocode illustrates the scalar quantization technique used for the benchmarking tests on angular datasets with the cosine similarity space type. Cosine similarity is not shift invariant (\(cos(x, y) \neq cos(x-z, y-z)\)).

The following pseudocode is for positive numbers:

  1. # For Positive Numbers
  2. # INDEXING and QUERYING:
  3. # Get Max of train dataset
  4. max = np.max(dataset)
  5. min = 0
  6. B = 127
  7. # Normalize into [0,1]
  8. val = (val - min) / (max - min)
  9. val = (val * B)
  10. # Get int and fraction values
  11. int_part = floor(val)
  12. frac_part = val - int_part
  13. if 0.5 < frac_part:
  14. bval = int_part + 1
  15. else:
  16. bval = int_part
  17. return Byte(bval)

copy

The following pseudocode is for negative numbers:

  1. # For Negative Numbers
  2. # INDEXING and QUERYING:
  3. # Get Min of train dataset
  4. min = 0
  5. max = -np.min(dataset)
  6. B = 128
  7. # Normalize into [0,1]
  8. val = (val - min) / (max - min)
  9. val = (val * B)
  10. # Get int and fraction values
  11. int_part = floor(var)
  12. frac_part = val - int_part
  13. if 0.5 < frac_part:
  14. bval = int_part + 1
  15. else:
  16. bval = int_part
  17. return Byte(bval)

copy

Binary k-NN vectors

You can reduce memory costs by a factor of 32 by switching from float to binary vectors. Using binary vector indexes can lower operational costs while maintaining high recall performance, making large-scale deployment more economical and efficient.

Binary format is available for the following k-NN search types:

  • Approximate k-NN: Supports binary vectors only for the Faiss engine with the HNSW and IVF algorithms.
  • Script score k-NN: Enables the use of binary vectors in script scoring.
  • Painless extensions: Allows the use of binary vectors with Painless scripting extensions.

Requirements

There are several requirements for using binary vectors in the OpenSearch k-NN plugin:

  • The data_type of the binary vector index must be binary.
  • The space_type of the binary vector index must be hamming.
  • The dimension of the binary vector index must be a multiple of 8.
  • You must convert your binary data into 8-bit signed integers (int8) in the [-128, 127] range. For example, the binary sequence of 8 bits 0, 1, 1, 0, 0, 0, 1, 1 must be converted into its equivalent byte value of 99 to be used as a binary vector input.

Example: HNSW

To create a binary vector index with the Faiss engine and HNSW algorithm, send the following request:

  1. PUT /test-binary-hnsw
  2. {
  3. "settings": {
  4. "index": {
  5. "knn": true
  6. }
  7. },
  8. "mappings": {
  9. "properties": {
  10. "my_vector": {
  11. "type": "knn_vector",
  12. "dimension": 8,
  13. "data_type": "binary",
  14. "method": {
  15. "name": "hnsw",
  16. "space_type": "hamming",
  17. "engine": "faiss",
  18. "parameters": {
  19. "ef_construction": 128,
  20. "m": 24
  21. }
  22. }
  23. }
  24. }
  25. }
  26. }

copy

Then ingest some documents containing binary vectors:

  1. PUT _bulk
  2. {"index": {"_index": "test-binary-hnsw", "_id": "1"}}
  3. {"my_vector": [7], "price": 4.4}
  4. {"index": {"_index": "test-binary-hnsw", "_id": "2"}}
  5. {"my_vector": [10], "price": 14.2}
  6. {"index": {"_index": "test-binary-hnsw", "_id": "3"}}
  7. {"my_vector": [15], "price": 19.1}
  8. {"index": {"_index": "test-binary-hnsw", "_id": "4"}}
  9. {"my_vector": [99], "price": 1.2}
  10. {"index": {"_index": "test-binary-hnsw", "_id": "5"}}
  11. {"my_vector": [80], "price": 16.5}

copy

When querying, be sure to use a binary vector:

  1. GET /test-binary-hnsw/_search
  2. {
  3. "size": 2,
  4. "query": {
  5. "knn": {
  6. "my_vector": {
  7. "vector": [9],
  8. "k": 2
  9. }
  10. }
  11. }
  12. }

copy

The response contains the two vectors closest to the query vector:

Response

  1. {
  2. "took": 8,
  3. "timed_out": false,
  4. "_shards": {
  5. "total": 1,
  6. "successful": 1,
  7. "skipped": 0,
  8. "failed": 0
  9. },
  10. "hits": {
  11. "total": {
  12. "value": 2,
  13. "relation": "eq"
  14. },
  15. "max_score": 0.5,
  16. "hits": [
  17. {
  18. "_index": "test-binary-hnsw",
  19. "_id": "2",
  20. "_score": 0.5,
  21. "_source": {
  22. "my_vector": [
  23. 10
  24. ],
  25. "price": 14.2
  26. }
  27. },
  28. {
  29. "_index": "test-binary-hnsw",
  30. "_id": "5",
  31. "_score": 0.25,
  32. "_source": {
  33. "my_vector": [
  34. 80
  35. ],
  36. "price": 16.5
  37. }
  38. }
  39. ]
  40. }
  41. }

Example: IVF

The IVF method requires a training step that creates and trains the model used to initialize the native library index during segment creation. For more information, see Building a k-NN index from a model.

First, create an index that will contain binary vector training data. Specify the Faiss engine and IVF algorithm and make sure that the dimension matches the dimension of the model you want to create:

  1. PUT train-index
  2. {
  3. "mappings": {
  4. "properties": {
  5. "train-field": {
  6. "type": "knn_vector",
  7. "dimension": 8,
  8. "data_type": "binary"
  9. }
  10. }
  11. }
  12. }

copy

Ingest training data containing binary vectors into the training index:

Bulk ingest request

  1. PUT _bulk
  2. { "index": { "_index": "train-index", "_id": "1" } }
  3. { "train-field": [1] }
  4. { "index": { "_index": "train-index", "_id": "2" } }
  5. { "train-field": [2] }
  6. { "index": { "_index": "train-index", "_id": "3" } }
  7. { "train-field": [3] }
  8. { "index": { "_index": "train-index", "_id": "4" } }
  9. { "train-field": [4] }
  10. { "index": { "_index": "train-index", "_id": "5" } }
  11. { "train-field": [5] }
  12. { "index": { "_index": "train-index", "_id": "6" } }
  13. { "train-field": [6] }
  14. { "index": { "_index": "train-index", "_id": "7" } }
  15. { "train-field": [7] }
  16. { "index": { "_index": "train-index", "_id": "8" } }
  17. { "train-field": [8] }
  18. { "index": { "_index": "train-index", "_id": "9" } }
  19. { "train-field": [9] }
  20. { "index": { "_index": "train-index", "_id": "10" } }
  21. { "train-field": [10] }
  22. { "index": { "_index": "train-index", "_id": "11" } }
  23. { "train-field": [11] }
  24. { "index": { "_index": "train-index", "_id": "12" } }
  25. { "train-field": [12] }
  26. { "index": { "_index": "train-index", "_id": "13" } }
  27. { "train-field": [13] }
  28. { "index": { "_index": "train-index", "_id": "14" } }
  29. { "train-field": [14] }
  30. { "index": { "_index": "train-index", "_id": "15" } }
  31. { "train-field": [15] }
  32. { "index": { "_index": "train-index", "_id": "16" } }
  33. { "train-field": [16] }
  34. { "index": { "_index": "train-index", "_id": "17" } }
  35. { "train-field": [17] }
  36. { "index": { "_index": "train-index", "_id": "18" } }
  37. { "train-field": [18] }
  38. { "index": { "_index": "train-index", "_id": "19" } }
  39. { "train-field": [19] }
  40. { "index": { "_index": "train-index", "_id": "20" } }
  41. { "train-field": [20] }
  42. { "index": { "_index": "train-index", "_id": "21" } }
  43. { "train-field": [21] }
  44. { "index": { "_index": "train-index", "_id": "22" } }
  45. { "train-field": [22] }
  46. { "index": { "_index": "train-index", "_id": "23" } }
  47. { "train-field": [23] }
  48. { "index": { "_index": "train-index", "_id": "24" } }
  49. { "train-field": [24] }
  50. { "index": { "_index": "train-index", "_id": "25" } }
  51. { "train-field": [25] }
  52. { "index": { "_index": "train-index", "_id": "26" } }
  53. { "train-field": [26] }
  54. { "index": { "_index": "train-index", "_id": "27" } }
  55. { "train-field": [27] }
  56. { "index": { "_index": "train-index", "_id": "28" } }
  57. { "train-field": [28] }
  58. { "index": { "_index": "train-index", "_id": "29" } }
  59. { "train-field": [29] }
  60. { "index": { "_index": "train-index", "_id": "30" } }
  61. { "train-field": [30] }
  62. { "index": { "_index": "train-index", "_id": "31" } }
  63. { "train-field": [31] }
  64. { "index": { "_index": "train-index", "_id": "32" } }
  65. { "train-field": [32] }
  66. { "index": { "_index": "train-index", "_id": "33" } }
  67. { "train-field": [33] }
  68. { "index": { "_index": "train-index", "_id": "34" } }
  69. { "train-field": [34] }
  70. { "index": { "_index": "train-index", "_id": "35" } }
  71. { "train-field": [35] }
  72. { "index": { "_index": "train-index", "_id": "36" } }
  73. { "train-field": [36] }
  74. { "index": { "_index": "train-index", "_id": "37" } }
  75. { "train-field": [37] }
  76. { "index": { "_index": "train-index", "_id": "38" } }
  77. { "train-field": [38] }
  78. { "index": { "_index": "train-index", "_id": "39" } }
  79. { "train-field": [39] }
  80. { "index": { "_index": "train-index", "_id": "40" } }
  81. { "train-field": [40] }

copy

Then, create and train the model named test-binary-model. The model will be trained using the training data from the train_field in the train-index. Specify the binary data type and hamming space type:

  1. POST _plugins/_knn/models/test-binary-model/_train
  2. {
  3. "training_index": "train-index",
  4. "training_field": "train-field",
  5. "dimension": 8,
  6. "description": "model with binary data",
  7. "data_type": "binary",
  8. "method": {
  9. "name": "ivf",
  10. "engine": "faiss",
  11. "space_type": "hamming",
  12. "parameters": {
  13. "nlist": 1,
  14. "nprobes": 1
  15. }
  16. }
  17. }

copy

To check the model training status, call the Get Model API:

  1. GET _plugins/_knn/models/test-binary-model?filter_path=state

copy

Once the training is complete, the state changes to created.

Next, create an index that will initialize its native library indexes using the trained model:

  1. PUT test-binary-ivf
  2. {
  3. "settings": {
  4. "index": {
  5. "knn": true
  6. }
  7. },
  8. "mappings": {
  9. "properties": {
  10. "my_vector": {
  11. "type": "knn_vector",
  12. "model_id": "test-binary-model"
  13. }
  14. }
  15. }
  16. }

copy

Ingest the data containing the binary vectors that you want to search into the created index:

  1. PUT _bulk?refresh=true
  2. {"index": {"_index": "test-binary-ivf", "_id": "1"}}
  3. {"my_vector": [7], "price": 4.4}
  4. {"index": {"_index": "test-binary-ivf", "_id": "2"}}
  5. {"my_vector": [10], "price": 14.2}
  6. {"index": {"_index": "test-binary-ivf", "_id": "3"}}
  7. {"my_vector": [15], "price": 19.1}
  8. {"index": {"_index": "test-binary-ivf", "_id": "4"}}
  9. {"my_vector": [99], "price": 1.2}
  10. {"index": {"_index": "test-binary-ivf", "_id": "5"}}
  11. {"my_vector": [80], "price": 16.5}

copy

Finally, search the data. Be sure to provide a binary vector in the k-NN vector field:

  1. GET test-binary-ivf/_search
  2. {
  3. "size": 2,
  4. "query": {
  5. "knn": {
  6. "my_vector": {
  7. "vector": [8],
  8. "k": 2
  9. }
  10. }
  11. }
  12. }

copy

The response contains the two vectors closest to the query vector:

Response

  1. GET /_plugins/_knn/models/my-model?filter_path=state
  2. {
  3. "took": 7,
  4. "timed_out": false,
  5. "_shards": {
  6. "total": 1,
  7. "successful": 1,
  8. "skipped": 0,
  9. "failed": 0
  10. },
  11. "hits": {
  12. "total": {
  13. "value": 2,
  14. "relation": "eq"
  15. },
  16. "max_score": 0.5,
  17. "hits": [
  18. {
  19. "_index": "test-binary-ivf",
  20. "_id": "2",
  21. "_score": 0.5,
  22. "_source": {
  23. "my_vector": [
  24. 10
  25. ],
  26. "price": 14.2
  27. }
  28. },
  29. {
  30. "_index": "test-binary-ivf",
  31. "_id": "3",
  32. "_score": 0.25,
  33. "_source": {
  34. "my_vector": [
  35. 15
  36. ],
  37. "price": 19.1
  38. }
  39. }
  40. ]
  41. }
  42. }