Collections

Collections are the grouping mechanism for embeddings, documents, and metadata.

Collection Basics

Collection Properties

Each collection is characterized by the following properties:

  • name: The name of the collection. The name can be changed as long as it is unique within the database ( use collection.modify(name="new_name") to change the name of the collection
  • metadata: A dictionary of metadata associated with the collection. The metadata is a dictionary of key-value pairs. Keys can be strings, values can be strings, integers, floats, or booleans. Metadata can be changed using collection.modify(metadata={"key": "value"}) (Note: Metadata is always overwritten when modified)
  • embedding_function: The embedding function used to embed documents in the collection.

Defaults:

  • Embedding Function - by default if embedding_function parameter is not provided at get() or create_collection() or get_or_create_collection() time, Chroma uses chromadb.utils.embedding_functions.DefaultEmbeddingFunction which uses the chromadb.utils.embedding_functions.DefaultEmbeddingFunction to embed documents. The default embedding function uses Onnx Runtime with all-MiniLM-L6-v2 model.
  • distance metric - by default Chroma use L2 (Euclidean Distance Squared) distance metric for newly created collection. You can change it at creation time using hnsw:space metadata key. Possible values are l2, cosine, and ‘ip’ (inner product). (Note: cosine value returns cosine distance rather then cosine similarity. Ie. values close to 0 means the embeddings are more similar.)
  • Batch size, defined by hnsw:batch_size metadata key. Default is 100. The batch size defines the size of the in-memory bruteforce index. Once the threshold is reached, vectors are added to the HNSW index and the bruteforce index is cleared. Greater values may improve ingest performance. When updating also consider changing sync threshold
  • Sync threshold, defined by hnsw:sync_threshold metadata key. Default 1000. The sync threshold defines the limit at which the HNSW index is synced to disk. This limit only applies to newly added vectors.

Keep in Mind

Collection distance metric cannot be changed after the collection is created. To change the distance metric see #cloning-a-collection

Name Restrictions

Collection names in Chroma must adhere to the following restrictions:

(1) contains 3-63 characters (2) starts and ends with an alphanumeric character (3) otherwise contains only alphanumeric characters, underscores or hyphens (-) (4) contains no two consecutive periods (..) (5) is not a valid IPv4 address

Creating a collection

Official Docs

For more information on the create_collection or get_or_create_collection methods, see the official ChromaDB documentation.

Parameters:

NameDescriptionDefault ValueType
nameName of the collection to create. Parameter is requiredN/AString
metadataMetadata associated with the collection. This is an optional parameterNoneDictionary
embedding_functionEmbedding function to use for the collection. This is an optional parameterchromadb.utils.embedding_functions.DefaultEmbeddingFunctionEmbeddingFunction
  1. import chromadb
  2. client = chromadb.PersistentClient(path="test") # or HttpClient()
  3. col = client.create_collection("test")

Alternatively you can use the get_or_create_collection method to create a collection if it doesn’t exist already.

  1. import chromadb
  2. client = chromadb.PersistentClient(path="test") # or HttpClient()
  3. col = client.get_or_create_collection("test", metadata={"key": "value"})

Metadata with get_or_create_collection()

If the collection exists and metadata is provided in the method it will attempt to overwrite the existing metadata. This behaviour may be fixed by this GH issue

Deleting a collection

Official Docs

For more information on the delete_collection method, see the official ChromaDB documentation.

Parameters:

NameDescriptionDefault ValueType
nameName of the collection to delete. Parameter is requiredN/AString
  1. import chromadb
  2. client = chromadb.PersistentClient(path="test") # or HttpClient()
  3. client.delete_collection("test")

Listing all collections

Official Docs

For more information on the list_collections method, see the official ChromaDB documentation.

Parameters:

NameDescriptionDefault ValueType
offsetThe starting offset for listing collections. This is an optional parameterNonePositive Integer
limitThe number of collections to return. If the remaining collections from offset are fewer than this number then returned collection will also be fewer. This is an optional parameterNonePositive Integer
  1. import chromadb
  2. client = chromadb.PersistentClient(path="test") # or HttpClient()
  3. collections = client.list_collections()

Getting a collection

Official Docs

For more information on the get_collection method, see the official ChromaDB documentation.

Parameters:

NameDescriptionDefault ValueType
nameName of the collection to get. Parameter is requiredN/AString
embedding_functionEmbedding function to use for the collection. This is an optional parameterchromadb.utils.embedding_functions.DefaultEmbeddingFunctionEmbeddingFunction
  1. import chromadb
  2. client = chromadb.PersistentClient(path="test") # or HttpClient()
  3. col = client.get_collection("test")

Modifying a collection

Official Docs

For more information on the modify method, see the official ChromaDB documentation.

Modify method on collection

As the reader will observe modify method is called on the collection and node on the client as the rest of the collection lifecycle methods.

Metadata Overwrite

Metadata is always overwritten when modified. If you want to add a new key-value pair to the metadata, you must first get the existing metadata and then add the new key-value pair to it.

Parameters:

NameDescriptionDefault ValueType
nameThe new name of the collection. Parameter is requiredN/AString
metadataMetadata associated with the collection. This is an optional parameterNoneDictionary

Both collection properties (name and metadata) can be modified, separately ot together.

  1. import chromadb
  2. client = chromadb.PersistentClient(path="test") # or HttpClient()
  3. col = client.get_collection("test")
  4. col.modify(name="test2", metadata={"key": "value"})

Counting Collections

Official Docs

For more information on the count_collections method, see the official ChromaDB documentation.

  1. import chromadb
  2. client = chromadb.PersistentClient(path="test") # or HttpClient()
  3. col = client.get_or_create_collection("test") # create a new collection
  4. client.count_collections()

Iterating over a Collection

  1. import chromadb
  2. client = chromadb.PersistentClient(path="my_local_data") # or HttpClient()
  3. collection = client.get_or_create_collection("local_collection")
  4. collection.add(
  5. ids=[f"{i}" for i in range(1000)],
  6. documents=[f"document {i}" for i in range(1000)],
  7. metadatas=[{"doc_id": i} for i in range(1000)])
  8. existing_count = collection.count()
  9. batch_size = 10
  10. for i in range(0, existing_count, batch_size):
  11. batch = collection.get(
  12. include=["metadatas", "documents", "embeddings"],
  13. limit=batch_size,
  14. offset=i)
  15. print(batch) # do something with the batch

Collection Utilities

Copying Collections

Local To RemoteLocal To Local

The following example demonstrates how to copy a local collection to a remote ChromaDB server. (it also works in reverse)

  1. import chromadb
  2. client = chromadb.PersistentClient(path="my_local_data")
  3. remote_client = chromadb.HttpClient()
  4. collection = client.get_or_create_collection("local_collection")
  5. collection.add(
  6. ids=["1", "2"],
  7. documents=["hello world", "hello ChromaDB"],
  8. metadatas=[{"a": 1}, {"b": 2}])
  9. remote_collection = remote_client.get_or_create_collection("remote_collection",
  10. metadata=collection.metadata)
  11. existing_count = collection.count()
  12. batch_size = 10
  13. for i in range(0, existing_count, batch_size):
  14. batch = collection.get(
  15. include=["metadatas", "documents", "embeddings"],
  16. limit=batch_size,
  17. offset=i)
  18. remote_collection.add(
  19. ids=batch["ids"],
  20. documents=batch["documents"],
  21. metadatas=batch["metadatas"],
  22. embeddings=batch["embeddings"])

Using ChromaDB Data Pipes

Using ChromaDB Data Pipes package you can achieve the same result.

  1. pip install chromadb-data-pipes
  2. cdp export "file://path/to_local_data/local_collection" | \
  3. cdp import "http://remote_chromadb:port/remote_collection" --create

Following shows an example of how to copy a collection from one local persistent DB to another local persistent DB.

  1. import chromadb
  2. local_client = chromadb.PersistentClient(path="source")
  3. remote_client = chromadb.PersistentClient(path="target")
  4. collection = local_client.get_or_create_collection("my_source_collection")
  5. collection.add(
  6. ids=["1", "2"],
  7. documents=["hello world", "hello ChromaDB"],
  8. metadatas=[{"a": 1}, {"b": 2}])
  9. remote_collection = remote_client.get_or_create_collection("my_target_collection",
  10. metadata=collection.metadata)
  11. existing_count = collection.count()
  12. batch_size = 10
  13. for i in range(0, existing_count, batch_size):
  14. batch = collection.get(
  15. include=["metadatas", "documents", "embeddings"],
  16. limit=batch_size,
  17. offset=i)
  18. remote_collection.add(
  19. ids=batch["ids"],
  20. documents=batch["documents"],
  21. metadatas=batch["metadatas"],
  22. embeddings=batch["embeddings"])

Using ChromaDB Data Pipes

You can achieve the above with ChromaDB Data Pipes package.

  1. pip install chromadb-data-pipes
  2. cdp export "file://source_persist_dir/target_collection" | \
  3. cdp import "file://target_persist_dir/target_collection" --create

Cloning a collection

Here are some reasons why you might want to clone a collection:

  • Change distance function (via metadata - hnsw:space)
  • Change HNSW hyper parameters (hnsw:M, hnsw:construction_ef, hnsw:search_ef)
  1. import chromadb
  2. client = chromadb.PersistentClient(path="test") # or HttpClient()
  3. col = client.get_or_create_collection("test") # create a new collection with L2 (default)
  4. col.add(ids=[f"{i}" for i in range(1000)], documents=[f"document {i}" for i in range(1000)])
  5. newCol = client.get_or_create_collection("test1", metadata={
  6. "hnsw:space": "cosine"}) # let's change the distance function to cosine
  7. existing_count = col.count()
  8. batch_size = 10
  9. for i in range(0, existing_count, batch_size):
  10. batch = col.get(include=["metadatas", "documents", "embeddings"], limit=batch_size, offset=i)
  11. newCol.add(ids=batch["ids"], documents=batch["documents"], metadatas=batch["metadatas"],
  12. embeddings=batch["embeddings"])
  13. print(newCol.count())
  14. print(newCol.get(offset=0, limit=10)) # get first 10 documents

Changing the embedding function

To change the embedding function of a collection, it must be cloned to a new collection with the desired embedding function.

  1. import os
  2. import chromadb
  3. from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction, DefaultEmbeddingFunction
  4. client = chromadb.PersistentClient(path="test") # or HttpClient()
  5. default_ef = DefaultEmbeddingFunction()
  6. col = client.create_collection("default_ef_collection",embedding_function=default_ef)
  7. openai_ef = OpenAIEmbeddingFunction(api_key=os.getenv("OPENAI_API_KEY"), model_name="text-embedding-3-small")
  8. col.add(ids=[f"{i}" for i in range(1000)], documents=[f"document {i}" for i in range(1000)])
  9. newCol = client.get_or_create_collection("openai_ef_collection", embedding_function=openai_ef)
  10. existing_count = col.count()
  11. batch_size = 10
  12. for i in range(0, existing_count, batch_size):
  13. batch = col.get(include=["metadatas", "documents"], limit=batch_size, offset=i)
  14. newCol.add(ids=batch["ids"], documents=batch["documents"], metadatas=batch["metadatas"])
  15. # get first 10 documents with their OpenAI embeddings
  16. print(newCol.get(offset=0, limit=10,include=["metadatas", "documents", "embeddings"]))

Cloning a subset of a collection with query

The below example demonstrates how to select a slice of an existing collection by using where and where_document query and creating a new collection with the selected slice.

Race Condition

The below example is not atomic and if data is changed between the initial selection query (select_ids = col.get(...) and the subsequent insertion query (batch = col.get(...)) the new collection may not contain the expected data.

  1. import chromadb
  2. client = chromadb.PersistentClient(path="test") # or HttpClient()
  3. col = client.get_or_create_collection("test") # create a new collection with L2 (default)
  4. col.add(ids=[f"{i}" for i in range(1000)], documents=[f"document {i}" for i in range(1000)])
  5. newCol = client.get_or_create_collection("test1", metadata={
  6. "hnsw:space": "cosine", "hnsw:M": 32}) # let's change the distance function to cosine and M to 32
  7. query_where = {"metadata_key": "value"}
  8. query_where_document = {"$contains": "document"}
  9. select_ids = col.get(where_document=query_where_document, where=query_where, include=[]) # get only IDs
  10. batch_size = 10
  11. for i in range(0, len(select_ids["ids"]), batch_size):
  12. batch = col.get(include=["metadatas", "documents", "embeddings"], limit=batch_size, offset=i, where=query_where,
  13. where_document=query_where_document)
  14. newCol.add(ids=batch["ids"], documents=batch["documents"], metadatas=batch["metadatas"],
  15. embeddings=batch["embeddings"])
  16. print(newCol.count())
  17. print(newCol.get(offset=0, limit=10)) # get first 10 documents

Updating Document/Record Metadata

In this example we loop through all documents of a collection and strip all metadata fields of leading and trailing whitespace. Change the update_metadata function to suit your needs.

  1. from chromadb import Settings
  2. import chromadb
  3. client = chromadb.PersistentClient(path="test", settings=Settings(allow_reset=True))
  4. client.reset() # reset the database so we can run this script multiple times
  5. col = client.get_or_create_collection("test")
  6. count = col.count()
  7. def update_metadata(metadata: dict):
  8. return {k: v.strip() for k, v in metadata.items()}
  9. for i in range(0, count, 10):
  10. batch = col.get(include=["metadatas"], limit=10, offset=i)
  11. col.update(ids=batch["ids"], metadatas=[update_metadata(metadata) for metadata in batch["metadatas"]])

Tips and Tricks

Getting IDs Only

The below example demonstrates how to get only the IDs of a collection. This is useful if you need to work with IDs without the need to fetch any additional data. Chroma will accept and empty include array indicating that no other data than the IDs is returned.

  1. import chromadb
  2. client = chromadb.PersistentClient(path="test")
  3. col = client.get_or_create_collection("my_collection")
  4. ids_only_result = col.get(include=[])
  5. print(ids_only_result['ids'])

August 1, 2024