Embeddings
Embeddings databases are the engine that delivers semantic search. Data is transformed into embeddings vectors where similar concepts will produce similar vectors. Indexes both large and small are built with these vectors. The indexes are used to find results that have the same meaning, not necessarily the same keywords.
The following code snippet shows how to build and search an embeddings index.
from txtai import Embeddings
# Create embeddings model, backed by sentence-transformers & transformers
embeddings = Embeddings(path="sentence-transformers/nli-mpnet-base-v2")
data = [
"US tops 5 million confirmed virus cases",
"Canada's last fully intact ice shelf has suddenly collapsed, " +
"forming a Manhattan-sized iceberg",
"Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
"The National Park Service warns against sacrificing slower friends " +
"in a bear attack",
"Maine man wins $1M from $25 lottery ticket",
"Make huge profits without work, earn up to $100,000 a day"
]
# Index the list of text
embeddings.index(data)
print(f"{'Query':20} Best Match")
print("-" * 50)
# Run an embeddings search for each query
for query in ("feel good story", "climate change", "public health story", "war",
"wildlife", "asia", "lucky", "dishonest junk"):
# Extract uid of first result
# search result format: (uid, score)
uid = embeddings.search(query, 1)[0][0]
# Print text
print(f"{query:20} {data[uid]}")
Build
An embeddings instance is configuration-driven based on what is passed in the constructor. Vectors are stored with the option to also store content. Content storage enables additional filtering and data retrieval options.
The example above sets a specific embeddings vector model via the path parameter. An embeddings instance with no configuration can also be created.
embeddings = Embeddings()
In this case, when loading and searching for data, the default transformers vector model is used to vectorize data. See the model guide for current model recommentations.
Index
After creating a new embeddings instance, the next step is adding data to it.
embeddings.index(rows)
The index method takes an iterable and supports the following formats for each element.
(id, data, tags)
- default processing format
Element | Description |
---|---|
id | unique record id |
data | input data to index, can be text, a dictionary or object |
tags | optional tags string, used to mark/label data as it’s indexed |
(id, data)
Same as above but without tags.
data
Single element to index. In this case, unique id’s will automatically be generated. Note that for generated id’s, upsert and delete calls require a separate search to get the target ids.
When the data field is a dictionary, text is passed via the text
key, binary objects via the object
key. Note that content must be enabled to store metadata and objects to store binary object data. The id
and tags
keys will be extracted, if provided.
The input iterable can be a list or generator. Generators help with indexing very large datasets as only portions of the data is in memory at any given time.
More information on indexing can be found in the index guide.
Search
Once data is indexed, it is ready for search.
embeddings.search(query, limit)
The search method takes two parameters, the query and query limit. The results format is different based on whether content is stored or not.
- List of
(id, score)
when content is not stored - List of
{**query columns}
when content is stored
Both natural language and SQL queries are supported. More information can be found in the query guide.
More examples
See this link for a full list of embeddings examples.