Document IDs

Chroma is unopinionated about document IDs and delegates those decisions to the user. This frees users to build semantics around their IDs.

Note on Compound IDs

While you can choose to use IDs that are composed of multiple sub-IDs (e.g. user_id + document_id), it is important to highlight that Chroma does not support querying by partial ID.

Common Practices

chromadbx

We provide a convinient wrapper for in the form of chromadbx package that provides ID generators for UUIDs, ULIDs, NonoIDs, and Hashes, among others functions. You can install it with pip install chromadbx.

UUIDs

UUIDs are a common choice for document IDs. They are unique, and can be generated in a distributed fashion. They are also opaque, which means that they do not contain any information about the document itself. This can be a good thing, as it allows you to change the document without changing the ID.

chromadbxPython

  1. import chromadb
  2. from chromadbx import UUIDGenerator
  3. client = chromadb.Client()
  4. col = client.get_or_create_collection("test")
  5. my_docs = [f"Document {_}" for _ in range(10)]
  6. col.add(ids=UUIDGenerator(len(my_docs)), documents=my_docs)
  1. import uuid
  2. import chromadb
  3. my_documents = [
  4. "Hello, world!",
  5. "Hello, Chroma!"
  6. ]
  7. client = chromadb.Client()
  8. collection = client.get_or_create_collection("collection")
  9. collection.add(ids=[f"{uuid.uuid4()}" for _ in range(len(my_documents))], documents=my_documents)

Caveats

Predictable Ordering

UUIDs especially v4 are not lexicographically sortable. In its current version (0.4.x-0.5.0) Chroma orders responses of get() by the ID of the documents. Therefore, if you need predictable ordering, you may want to consider a different ID strategy.

Storage and Performance Overhead

Chroma stores Document IDs as strings and UUIDs are 36 characters long, which can be a lot of overhead if you have a large number of documents. If you are concerned about storage overhead, you may want to consider a different ID strategy. Additionally Chroma uses the document IDs when sorting results which also incurs a performance hit.

ULIDs

ULIDs are a variant of UUIDs that are lexicographically sortable. They are also 128 bits long, like UUIDs, but they are encoded in a way that makes them sortable. This can be useful if you need predictable ordering of your documents.

ULIDs are also shorter than UUIDs, which can save you some storage space. They are also opaque, like UUIDs, which means that they do not contain any information about the document itself.

Install the ulid-py package to generate ULIDs.

  1. pip install ulid-py

chromadbxPython

  1. import chromadb
  2. from chromadbx import ULIDGenerator
  3. import ulid
  4. client = chromadb.Client()
  5. col = client.get_or_create_collection("test")
  6. my_docs = [f"Document {_}" for _ in range(10)]
  7. col.add(ids=ULIDGenerator(len(my_docs)), documents=my_docs)
  1. from ulid import ULID
  2. import chromadb
  3. my_documents = [
  4. "Hello, world!",
  5. "Hello, Chroma!"
  6. ]
  7. _ulid = ULID()
  8. client = chromadb.Client()
  9. collection = client.get_or_create_collection("name")
  10. collection.add(ids=[f"{_ulid.generate()}" for _ in range(len(my_documents))], documents=my_documents)

NanoIDs

NanoIDs provide a way to generate unique IDs that are shorter than UUIDs. They are not lexically sortable, but they are unique and can be generated in a distributed fashion. They are also opaque, with low collision rates - (collision probability calculator)[https://zelark.github.io/nano-id-cc/\]

chromadbxPython

  1. import chromadb
  2. from chromadbx import NanoIDGenerator
  3. client = chromadb.Client()
  4. col = client.get_or_create_collection("test")
  5. my_docs = [f"Document {_}" for _ in range(10)]
  6. col.add(ids=NanoIDGenerator(len(my_docs)), documents=my_docs)
  1. from nanoid import generate
  2. import chromadb
  3. client = chromadb.Client()
  4. col = client.get_or_create_collection("test")
  5. my_docs = [f"Document {_}" for _ in range(10)]
  6. col.add(ids=[f"{generate()}" for _ in range(my_docs)], documents=my_docs)

Hashes

Hashes are another common choice for document IDs. They are unique, and can be generated in a distributed fashion. They are also opaque, which means that they do not contain any information about the document itself. This can be a good thing, as it allows you to change the document without changing the ID.

chromadbxPython

Random SHA256:

  1. import chromadb
  2. from chromadbx import RandomSHA256Generator
  3. client = chromadb.Client()
  4. col = client.get_or_create_collection("test")
  5. my_docs = [f"Document {_}" for _ in range(10)]
  6. col.add(ids=RandomSHA256Generator(len(my_docs)), documents=my_docs)

Document-based SHA256:

  1. import chromadb
  2. from chromadbx import DocumentSHA256Generator
  3. client = chromadb.Client()
  4. col = client.get_or_create_collection("test")
  5. my_docs = [f"Document {_}" for _ in range(10)]
  6. col.add(ids=DocumentSHA256Generator(documents=my_docs), documents=my_docs)

Random SHA256:

  1. import hashlib
  2. import os
  3. import chromadb
  4. def generate_sha256_hash() -> str:
  5. # Generate a random number
  6. random_data = os.urandom(16)
  7. # Create a SHA256 hash object
  8. sha256_hash = hashlib.sha256()
  9. # Update the hash object with the random data
  10. sha256_hash.update(random_data)
  11. # Return the hexadecimal representation of the hash
  12. return sha256_hash.hexdigest()
  13. my_documents = [
  14. "Hello, world!",
  15. "Hello, Chroma!"
  16. ]
  17. client = chromadb.Client()
  18. collection = client.get_or_create_collection("collection")
  19. collection.add(ids=[generate_sha256_hash() for _ in range(len(my_documents))], documents=my_documents)

Document-based SHA256:

It is also possible to use the document as basis for the hash, the downside of that is that when the document changes, and you have a semantic around the text as relating to the hash, you may need to update the hash.

  1. import hashlib
  2. import chromadb
  3. def generate_sha256_hash_from_text(text) -> str:
  4. # Create a SHA256 hash object
  5. sha256_hash = hashlib.sha256()
  6. # Update the hash object with the text encoded to bytes
  7. sha256_hash.update(text.encode('utf-8'))
  8. # Return the hexadecimal representation of the hash
  9. return sha256_hash.hexdigest()
  10. my_documents = [
  11. "Hello, world!",
  12. "Hello, Chroma!"
  13. ]
  14. client = chromadb.Client()
  15. collection = client.get_or_create_collection("collection")
  16. collection.add(ids=[generate_sha256_hash_from_text(my_documents[i]) for i in range(len(my_documents))],
  17. documents=my_documents)

Semantic Strategies

In this section we’ll explore a few different use cases for building semantics around document IDs.

  • URL Slugs - if your docs are web pages with permalinks (e.g. blog posts), you can use the URL slug as the document ID.
  • File Paths - if your docs are files on disk, you can use the file path as the document ID.

July 31, 2024