Working with Vector Search

Create vector keyspace

Create the keyspace you want to use for your Vector Search table. This example uses cycling as the keyspace name:

  1. CREATE KEYSPACE IF NOT EXISTS cycling
  2. WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : '1' };

Use vector keyspace

Select the keyspace you want to use for your Vector Search table. This example uses cycling as the keyspace name:

  1. USE cycling;

Create vector table

Create a new table in your keyspace, including the comments_vector column for vector. The code below creates a vector with five values:

  1. CREATE TABLE IF NOT EXISTS cycling.comments_vs (
  2. record_id timeuuid,
  3. id uuid,
  4. commenter text,
  5. comment text,
  6. comment_vector VECTOR <FLOAT, 5>,
  7. created_at timestamp,
  8. PRIMARY KEY (id, created_at)
  9. )
  10. WITH CLUSTERING ORDER BY (created_at DESC);

Optionally, you can alter an existing table to add a vector column:

  1. ALTER TABLE cycling.comments_vs
  2. ADD comment_vector VECTOR <FLOAT, 5>(1)

Create vector index

Create the custom index with Storage Attached Indexing (SAI):

  1. CREATE INDEX IF NOT EXISTS ann_index
  2. ON cycling.comments_vs(comment_vector) USING 'sai';

For more about SAI, see the Storage Attached Indexing documentation.

The index can be created with options that define the similarity function:

  1. CREATE INDEX IF NOT EXISTS ann_index
  2. ON vsearch.com(item_vector) USING sai
  3. WITH OPTIONS = { similarity_function’: DOT_PRODUCT };

Valid values for the similarity_function are DOT_PRODUCT, COSINE, or EUCLIDEAN.

Load vector data into your database

Insert data into the table using the new type:

  1. INSERT INTO cycling.comments_vs (record_id, id, created_at, comment, commenter, comment_vector)
  2. VALUES (
  3. now(),
  4. e7ae5cf3-d358-4d99-b900-85902fda9bb0,
  5. '2017-02-14 12:43:20-0800',
  6. 'Raining too hard should have postponed',
  7. 'Alex',
  8. [0.45, 0.09, 0.01, 0.2, 0.11]
  9. );
  10. INSERT INTO cycling.comments_vs (record_id, id, created_at, comment, commenter, comment_vector)
  11. VALUES (
  12. now(),
  13. e7ae5cf3-d358-4d99-b900-85902fda9bb0,
  14. '2017-03-21 13:11:09.999-0800',
  15. 'Second rest stop was out of water',
  16. 'Alex',
  17. [0.99, 0.5, 0.99, 0.1, 0.34]
  18. );
  19. INSERT INTO cycling.comments_vs (record_id, id, created_at, comment, commenter, comment_vector)
  20. VALUES (
  21. now(),
  22. e7ae5cf3-d358-4d99-b900-85902fda9bb0,
  23. '2017-04-01 06:33:02.16-0800',
  24. 'LATE RIDERS SHOULD NOT DELAY THE START',
  25. 'Alex',
  26. [0.9, 0.54, 0.12, 0.1, 0.95]
  27. );
  28. INSERT INTO cycling.comments_vs (record_id, id, created_at, comment, commenter, comment_vector)
  29. VALUES (
  30. now(),
  31. c7fceba0-c141-4207-9494-a29f9809de6f,
  32. totimestamp(now()),
  33. 'The gift certificate for winning was the best',
  34. 'Amy',
  35. [0.13, 0.8, 0.35, 0.17, 0.03]
  36. );
  37. INSERT INTO cycling.comments_vs (record_id, id, created_at, comment, commenter, comment_vector)
  38. VALUES (
  39. now(),
  40. c7fceba0-c141-4207-9494-a29f9809de6f,
  41. '2017-02-17 12:43:20.234+0400',
  42. 'Glad you ran the race in the rain',
  43. 'Amy',
  44. [0.3, 0.34, 0.2, 0.78, 0.25]
  45. );
  46. INSERT INTO cycling.comments_vs (record_id, id, created_at, comment, commenter, comment_vector)
  47. VALUES (
  48. now(),
  49. c7fceba0-c141-4207-9494-a29f9809de6f,
  50. '2017-03-22 5:16:59.001+0400',
  51. 'Great snacks at all reststops',
  52. 'Amy',
  53. [0.1, 0.4, 0.1, 0.52, 0.09]
  54. );
  55. INSERT INTO cycling.comments_vs (record_id, id, created_at, comment, commenter, comment_vector)
  56. VALUES (
  57. now(),
  58. c7fceba0-c141-4207-9494-a29f9809de6f,
  59. '2017-04-01 17:43:08.030+0400',
  60. 'Last climb was a killer',
  61. 'Amy',
  62. [0.3, 0.75, 0.2, 0.2, 0.5]
  63. );

Query vector data with CQL

To query data using Vector Search, use a SELECT query:

  1. SELECT * FROM cycling.comments_vs
  2. ORDER BY comment_vector ANN OF [0.15, 0.1, 0.1, 0.35, 0.55]
  3. LIMIT 3;

To obtain the similarity calculation of the best scoring node closest to the query data as part of the results, use a SELECT query:

  1. SELECT comment, similarity_cosine(comment_vector, [0.2, 0.15, 0.3, 0.2, 0.05])
  2. FROM cycling.comments_vs
  3. ORDER BY comment_vector ANN OF [0.1, 0.15, 0.3, 0.12, 0.05]
  4. LIMIT 1;

The supported functions for this type of query are:

  • similarity_dot_product

  • similarity_cosine

  • similarity_euclidean

with the parameters of (<vector_column>, <embedding_value>). Both parameters represent vectors.

  • The limit must be 1,000 or fewer.

  • Vector Search utilizes Approximate Nearest Neighbor (ANN) that in most cases yields results almost as good as the exact match. The scaling is superior to Exact Nearest Neighbor (KNN).

  • Least-similar searches are not supported.

  • Vector Search works optimally on tables with no overwrites or deletions of the item_vector column. For an item_vector column with changes, expect slower search results.