Batching
It is often that you may need to ingest a large number of documents into Chroma. The problem you may face is related to the underlying SQLite version of the machine running Chroma which imposes a maximum number of statements and parameters which Chroma translates into a batchable record size, exposed via the max_batch_size
parameter of the ChromaClient
class.
Creating Batches
Due to consistency and data integrity reasons, Chroma does not offer, yet, out-of-the-box batching support. The below code snippet shows how to create batches of documents and ingest them into Chroma.
import chromadb
from chromadb.utils.batch_utils import create_batches
import uuid
client = chromadb.PersistentClient(path="test-large-batch")
large_batch = [(f"{uuid.uuid4()}", f"document {i}", [0.1] * 1536) for i in range(100000)]
ids, documents, embeddings = zip(*large_batch)
batches = create_batches(api=client,ids=list(ids), documents=list(documents), embeddings=list(embeddings))
collection = client.get_or_create_collection("test")
for batch in batches:
print(f"Adding batch of size {len(batch[0])}")
collection.add(ids=batch[0],
documents=batch[3],
embeddings=batch[1],
metadatas=batch[2])
January 20, 2024