- Frequently Asked Questions and Commonly Encountered Issues
- Frequently Asked Questions
- Commonly Encountered Problems
- Collection Dimensionality Mismatch
- Large Distances in Search Results
OperationalError: no such column: collections.topic
sqlite3.OperationalError: database or disk is full
RuntimeError: Chroma is running in http-only client mode, and can only be run with 'chromadb.api.fastapi.FastAPI'
ValueError: You must provide an embedding function to compute embeddings
Frequently Asked Questions and Commonly Encountered Issues
This section provides answers to frequently asked questions and information on commonly encountered problem when working with Chroma. These information below is based on interactions with the Chroma community.
404 Answer Not Found
If you have a question that is not answered here, please reach out to us on our Discord @taz or GitHub Issues
Frequently Asked Questions
What does Chroma use to index embedding vectors?
Chroma uses its own fork of HNSW lib for indexing and searching embeddings. In addition to HNSW, Chroma also uses a Brute Force index, which acts as a buffer (prior to updating the HNSW graph) and performs exhaustive search using the same distance metric as the HNSW index.
Alternative Questions:
- What library does Chroma use for vector index and search?
- What algorithm does Chroma use for vector search?
How to set dimensionality of my collections?
When creating a collection, its dimensionality is determined by the dimensionality of the first embedding added to it. Once the dimensionality is set, it cannot be changed. Therefore, it is important to consistently use embeddings of the same dimensionality when adding or querying a collection.
Example:
Alternative Questions:
- Can I change the dimensionality of a collection?
Can I use transformers
models with Chroma?
Generally, yes you can use transformers
models with Chroma. Although Chroma does not provide a wrapper for this, you can use SentenceTransformerEmbeddingFunction
to achieve the same result. The sentence-transformer library will implicitly do mean-pooling on the last hidden layer, and you’ll get a warning about it - No sentence-transformers model found with name [model name]. Creating a new one with MEAN pooling.
Example:
Warning
Not all models will work with the above method. Also mean pooling may not be the best strategy for the model. Read the model card and try to understand what if any pooling the creators recommend. You may also want to normalize the embeddings before adding them to Chroma (pass normalize_embeddings=True
to the SentenceTransformerEmbeddingFunction
EF constructor).
Should I store my documents in Chroma?
Note: This applies to Chroma single-node and local embedded clients. (Chroma version ca. 0.5.x)
Chroma allows users to store both embeddings and documents, alongside metadata, in collections. Documents and metadata are both optional and depending on your use case you may choose to store them in Chroma or externally, or not at all.
Here are some pros/cons to help you decide whether to store your documents in Chroma:
Pros:
- Keeps all the data in the same place. You don’t have to manage a separate DB for the documents
- Allows you to do keyword searches on the documents
Cons:
- The database can grow substantially in size because documents are effectively duplicated - once for storing them as metadata for queries and another for the FTS5 index.
- Queries performance hit
Commonly Encountered Problems
Collection Dimensionality Mismatch
Symptoms:
This error usually exhibits in the following error message:
chromadb.errors.InvalidDimensionException: Embedding dimension XXX does not match collection dimensionality YYY
Context:
When adding/upserting or querying Chroma collection. This error is more visible/pronounced when using the Python APIs, but will also show up in also surface in other clients.
Cause:
You are trying to add or query a collection with vectors of a different dimensionality than the collection was created with.
Explanation/Solution:
When you first create a collection client.create_collection("name")
, the collection will not have knowledge of its dimensionality so that allows you to add vectors of any dimensionality to it. However, once your first batch of embeddings is added to the collection, the collection will be locked to that dimensionality. Any subsequent query or add operation must use embeddings of the same dimensionality. The dimensionality of the embeddings is a characteristic of the embedding model (EmbeddingFunction) used to generate the embeddings, therefore it is important to consistently use the same EmbeddingFunction when adding or querying a collection.
Tip
If you do not specify an embedding_function
when creating (client.create_collection
) or getting (client.get_or_create_collection
) a collection, Chroma wil use its default embedding function.
Large Distances in Search Results
Symptoms:
When querying a collection, you get results that are in the 10s or 100s.
Context:
Frequently when using you own embedding function.
Cause:
The embeddings are not normalized.
Explanation/Solution:
L2
(Euclidean distance) and IP
(inner product) distance metrics are sensitive to the magnitude of the vectors. Chroma uses L2
by default. Therefore, it is recommended to normalize the embeddings before adding them to Chroma.
Here is an example how to normalize embeddings using L2 norm:
OperationalError: no such column: collections.topic
Symptoms:
The error OperationalError: no such column: collections.topic
is raised when trying to access Chroma locally or remotely.
Context:
After upgrading to Chroma 0.5.0
or accessing your Chroma persistent data with Chroma client version 0.5.0
.
Cause:
In version 0.5.x
Chroma has made some SQLite3 schema changes that are not backwards compatible with the previous versions. Once you access your persistent data on the server or locally with the new Chroma version it will automatically migrate to the new schema. This operation is not reversible.
Explanation/Solution:
To resolve this issue you will need to upgrade all your clients accessing the Chroma data to version 0.5.x
.
Here’s a link to the migration performed by Chroma - https://github.com/chroma-core/chroma/blob/main/chromadb/migrations/sysdb/00005-remove-topic.sqlite.sql
sqlite3.OperationalError: database or disk is full
Symptoms:
The error sqlite3.OperationalError: database or disk is full
is raised when trying to access Chroma locally or remotely. The error can occur in any of the Chroma API calls.
Context:
There are two contexts in which this error can occur:
- When the persistent disk space is full or the disk quota is reached - This is where your
PERSIST_DIRECTORY
points to. - When there is not enough space in the temporary director - frequently
/tmp
on your system or container.
Cause:
When inserting new data and your Chroma persistent disk space is full or the disk quota is reached, the database will not be able to write metadata to SQLite3 db thus raising the error.
When performing large queries or multiple concurrent queries, the temporary disk space may be exhausted.
Explanation/Solution:
To work around the first issue, you can increase the disk space or clean up the disk space. To work around the second issue, you can increase the temporary disk space (works fine for containers but might be a problem for VMs) or point SQLite3 to a different temporary directory by using SQLITE_TMPDIR
environment variable.
SQLite Temp File
More information on how sqlite3 uses temp files can be found here.
RuntimeError: Chroma is running in http-only client mode, and can only be run with 'chromadb.api.fastapi.FastAPI'
Symptoms and Context:
The following error is raised when trying to create a new PersistentClient
, EphemeralClient
, or Client
:
Cause:
There are two possible causes for this error:
chromadb-client
is installed and you are trying to work with a local client.- Dependency conflict with
chromadb-client
andchromadb
packages.
Explanation/Solution:
Chroma (python) comes in two packages - chromadb
and chromadb-client
. The chromadb-client
package is used to interact with a remote Chroma server. If you are trying to work with a local client, you should use the chromadb
package. If you are planning to interact with remote server only it is recommended to use the chromadb-client
package.
If you intend to work locally with Chroma (e.g. embed in your app) then we suggest that you uninstall the chromadb-client
package and install the chromadb
package.
To check which package you have installed:
To uninstall the chromadb-client
package:
Working with virtual environments
It is recommended to work with virtual environments to avoid dependency conflicts. To create a virtual environment you can use the following snippet:
Alternatively you can use conda
or poetry
to manage your environments.
Default Embedding Function
Default embedding function - chromadb.utils.embedding_functions.DefaultEmbeddingFunction
- can only be used with chromadb
package.
ValueError: You must provide an embedding function to compute embeddings
Symptoms and Context:
The error ValueError: You must provide an embedding function to compute embeddings.https://docs.trychroma.com/embeddings"
is frequently raised when trying to add embeddings to a collection using Python thin client (chromadb-client
package).
Cause:
To reduce the size of the chromadb-client
package the default embedding function which requires onnxruntime
package is not included and is instead aliased to None
.
Explanation/Solution:
To resolve this issue you must always provide an embedding function when you call get_collection
or get_or_create_collection
methods to provide the Http client with the necessary information to compute embeddings.
September 5, 2024