Write-ahead Log (WAL) Pruning

Chroma Write-Ahead Log is unbounded by default and grows indefinitely. This can lead to high disk usage and slow performance. To prevent this, it is recommended to prune/cleanup the WAL periodically. Below we offer a couple of tools, including an official and recommended CLI tool, to help you prune your WAL.

Tooling

There are two ways to prune your WAL:

  • Chroma CLI - this is the official tooling provided by Chroma and is the recommended way to prune your WAL. This functionality is available either from main branch or Chroma release >0.5.5.
  • chroma-ops

Chroma CLI

To prune your WAL you need to install Chroma CLI (it comes as part of the core Chroma package):

  1. pip install chromadb
  2. chroma utils vacuum --path /path/to/persist_dir

Auto-pruning

Running the above command will enable auto WAL pruning. This means that Chroma will periodically prune the WAL during its normal operations.

Chroma Ops

To prune your WAL you can run the following command:

  1. pip install chroma-ops
  2. chops cleanup-wal /path/to/persist_dir

⚠️ IMPORTANT: It is always a good thing to backup your data before you prune the WAL.

Manual

Steps:

Stop Chroma

It is vitally important that you stop Chroma before you prune the WAL. If you don’t stop Chroma you risk corrupting

  • ⚠️ Stop Chroma
  • 💾 Create a backup of your chroma.sqlite3 file in your persistent dir
  • 👀 Check your current chroma.sqlite3 size (e.g. ls -lh /path/to/persist/dir/chroma.sqlite3)
  • 🖥️ Run the script below
  • 🔭 Check your current chroma.sqlite3 size again to verify that the WAL has been pruned
  • 🚀 Start Chroma

Script (store it in a file like compact-wal.sql)

wal_clean.py

  1. #!/usr/bin/env python3
  2. # Call the script: python wal_clean.py ./chroma-test-compact
  3. import os
  4. import sqlite3
  5. from typing import cast, Optional, Dict
  6. import argparse
  7. import pickle
  8. class PersistentData:
  9. """Stores the data and metadata needed for a PersistentLocalHnswSegment"""
  10. dimensionality: Optional[int]
  11. total_elements_added: int
  12. max_seq_id: int
  13. id_to_label: Dict[str, int]
  14. label_to_id: Dict[int, str]
  15. id_to_seq_id: Dict[str, int]
  16. def load_from_file(filename: str) -> "PersistentData":
  17. """Load persistent data from a file"""
  18. with open(filename, "rb") as f:
  19. ret = cast(PersistentData, pickle.load(f))
  20. return ret
  21. def clean_wal(chroma_persist_dir: str):
  22. if not os.path.exists(chroma_persist_dir):
  23. raise Exception(f"Persist {chroma_persist_dir} dir does not exist")
  24. if not os.path.exists(f'{chroma_persist_dir}/chroma.sqlite3'):
  25. raise Exception(
  26. f"SQL file not found int persist dir {chroma_persist_dir}/chroma.sqlite3")
  27. # Connect to SQLite database
  28. conn = sqlite3.connect(f'{chroma_persist_dir}/chroma.sqlite3')
  29. # Create a cursor object
  30. cursor = conn.cursor()
  31. # SQL query
  32. query = "SELECT id,topic FROM segments where scope='VECTOR'" # Replace with your query
  33. # Execute the query
  34. cursor.execute(query)
  35. # Fetch the results (if needed)
  36. results = cursor.fetchall()
  37. wal_cleanup_queries = []
  38. for row in results:
  39. # print(row)
  40. metadata = load_from_file(
  41. f'{chroma_persist_dir}/{row[0]}/index_metadata.pickle')
  42. wal_cleanup_queries.append(
  43. f"DELETE FROM embeddings_queue WHERE seq_id < {metadata.max_seq_id} AND topic='{row[1]}';")
  44. cursor.executescript('\n'.join(wal_cleanup_queries))
  45. # Close the cursor and connection
  46. cursor.close()
  47. conn.close()
  48. if __name__ == "__main__":
  49. parser = argparse.ArgumentParser()
  50. parser.add_argument('persist_dir', type=str)
  51. arg = parser.parse_args()
  52. print(arg.persist_dir)
  53. clean_wal(arg.persist_dir)

Run the script

  1. # Let's create a backup
  2. tar -czvf /path/to/persist/dir/chroma.sqlite3.backup.tar.gz /path/to/persist/dir/chroma.sqlite3
  3. lsof /path/to/persist/dir/chroma.sqlite3 # make sure that no process is using the file
  4. python wal_clean.py /path/to/persist/dir/
  5. # start chroma

July 30, 2024