Concepts

Tenancy and DB Hierarchies

The following picture illustrates the tenancy and DB hierarchy in Chroma:

Tenancy and DB Hierarchy

Storage

In Chroma single-node, all data about tenancy, databases, collections and documents is stored in a single SQLite database.

Tenants

A tenant is a logical grouping for a set of databases. A tenant is designed to model a single organization or user. A tenant can have multiple databases.

Databases

A database is a logical grouping for a set of collections. A database is designed to model a single application or project. A database can have multiple collections.

Collections

Collections are the grouping mechanism for embeddings, documents, and metadata.

Documents

Chunks of text

Documents in ChromaDB lingo are chunks of text that fits within the embedding model’s context window. Unlike other frameworks that use the term “document” to mean a file, ChromaDB uses the term “document” to mean a chunk of text.

Documents are raw chunks of text that are associated with an embedding. Documents are stored in the database and can be queried for.

Metadata

Metadata is a dictionary of key-value pairs that can be associated with an embedding. Metadata is stored in the database and can be queried for.

Metadata values can be of the following types:

  • strings
  • integers
  • floats (float32)
  • booleans

Embedding Function

Also referred to as embedding model, embedding functions in ChromaDB are wrappers that expose a consistent interface for generating embedding vectors from documents or text queries.

For a list of supported embedding functions see Chroma’s official documentation.

Distance Function

Distance functions help in calculating the difference (distance) between two embedding vectors. ChromaDB supports the following distance functions:

  • Cosine - Useful for text similarity
  • Euclidean (L2) - Useful for text similarity, more sensitive to noise than cosine
  • Inner Product (IP) - Recommender systems

Embedding Model

Embeddings

A representation of a document in the embedding model’s latent space in te form of a vector, list of 32-bit floats (or ints).

Metadata Segment

The metadata segment holds both the documents and their respective metadata fields (if any). The metadata segment is stored in sqlite3 under <persistent_dir>/chroma.sqlite3.

Vector Segment

Segment or Index?

In the below paragraphs we use, the terms “segment” and “index” are used interchangeably.

Under the hood Chroma uses its own fork HNSW lib for indexing and searching vectors.

In a single-node mode, Chroma will create a single vector index for each collection. The index is stored in a UUID-named subdir in your persistent dir, named after the vector segment of the collection.

The HNSW lib uses fast ANN algo to search the vectors in the index.

In addition to the HNSW index, Chroma uses Brute Force index to buffer embeddings in memory before they are added to the HNSW index (see batch_size). As the name suggests the search in the Brute Force index is done by iterating over all the vectors in the index and comparing them to the query using the distance_function. Brute Force index search is exhaustive and works well on small datasets.

August 5, 2024