Batching

It is often that you may need to ingest a large number of documents into Chroma. The problem you may face is related to the underlying SQLite version of the machine running Chroma which imposes a maximum number of statements and parameters which Chroma translates into a batchable record size, exposed via the max_batch_size parameter of the ChromaClient class.

  1. import chromadb
  2. client = chromadb.PersistentClient(path="test")
  3. print("Number of documents that can be inserted at once: ",client.max_batch_size)

Creating Batches

Due to consistency and data integrity reasons, Chroma does not offer, yet, out-of-the-box batching support. The below code snippet shows how to create batches of documents and ingest them into Chroma.

  1. import chromadb
  2. from chromadb.utils.batch_utils import create_batches
  3. import uuid
  4. client = chromadb.PersistentClient(path="test-large-batch")
  5. large_batch = [(f"{uuid.uuid4()}", f"document {i}", [0.1] * 1536) for i in range(100000)]
  6. ids, documents, embeddings = zip(*large_batch)
  7. batches = create_batches(api=client,ids=list(ids), documents=list(documents), embeddings=list(embeddings))
  8. collection = client.get_or_create_collection("test")
  9. for batch in batches:
  10. print(f"Adding batch of size {len(batch[0])}")
  11. collection.add(ids=batch[0],
  12. documents=batch[3],
  13. embeddings=batch[1],
  14. metadatas=batch[2])

January 20, 2024