Storage Engine
Cassandra processes data at several stages on the write path, starting with the immediate logging of a write and ending in with a write of data to disk:
Logging data in the commit log
Writing data to the memtable
Flushing data from the memtable
Storing data on disk in SSTables
Logging writes to commit logs
When a write occurs, Cassandra writes the data to a local append-only (cassandra.apache.org//glossary.html#commit-log)[commit log] on disk. This action provides configurable durability by logging every write made to a Cassandra node. If an unexpected shutdown occurs, the commit log provides permanent durable writes of the data. On startup, any mutations in the commit log will be applied to (cassandra.apache.org//glossary.html#memtable)[memtables]. The commit log is shared among tables.
All mutations are write-optimized on storage in commit log segments, reducing the number of seeks needed to write to disk. Commit log segments are limited by the commitlog_segment_size option. Once the defined size is reached, a new commit log segment is created. Commit log segments can be archived, deleted, or recycled once all the data is flushed to (SSTables). Commit log segments are truncated when Cassandra has written data older than a certain point to the SSTables. Running nodetool drain before stopping Cassandra will write everything in the memtables to SSTables and remove the need to sync with the commit logs on startup.
- commitlog_segment_size: The default size is 32MiB, which is almost always fine, but if you are archiving commitlog segments (see commitlog_archiving.properties), then you probably want a finer granularity of archiving; 8 or 16 MiB is reasonable.
commitlog_segment_size
also determines the default value of max_mutation_size incassandra.yaml
. By default,max_mutation_size
is a half the size ofcommitlog_segment_size
.
If |
commitlog_sync: may be either periodic or batch.
batch
: In batch mode, Cassandra won’t acknowledge writes until the commit log has been fsynced to disk.periodic
: In periodic mode, writes are immediately acknowledged, and the commit log is simply synced every “commitlog_sync_period” milliseconds.commitlog_sync_period
: Time to wait between “periodic” fsyncs Default Value: 10000ms
Default Value: batch
In the event of an unexpected shutdown, Cassandra can lose up to the sync period or more if the sync is delayed. If using |
- commitlog_directory: This option is commented out by default. When running on magnetic HDD, this should be a separate spindle than the data directories. If not set, the default directory is
$CASSANDRA_HOME/data/commitlog
.
Default Value: /var/lib/cassandra/commitlog
- commitlog_compression: Compression to apply to the commitlog. If omitted, the commit log will be written uncompressed. LZ4, Snappy,Deflate and Zstd compressors are supported.
Default Value: (complex option):
# - class_name: LZ4Compressor
# parameters:
- commitlog_total_space: Total space to use for commit logs on disk. This option is commented out by default. If space gets above this value, Cassandra will flush every dirty table in the oldest segment and remove it. So a small total commit log space will tend to cause more flush activity on less-active tables. The default value is the smallest between 8192 and 1/4 of the total space of the commitlog volume.
Default Value: 8192MiB
Memtables
When a write occurs, Cassandra also writes the data to a memtable. Memtables are in-memory structures where Cassandra buffers writes. In general, there is one active memtable per table. The memtable is a write-back cache of data partitions that Cassandra looks up by key. Memtables may be stored entirely on-heap or partially off-heap, depending on memtable_allocation_type.
The memtable stores writes in sorted order until reaching a configurable limit. When the limit is reached, memtables are flushed onto disk and become immutable SSTables. Flushing can be triggered in several ways:
The memory usage of the memtables exceeds the configured threshold (see memtable_cleanup_threshold)
The
commit log
approaches its maximum size, and forces memtable flushes in order to allow commit log segments to be freed.
When a triggering event occurs, the memtable is put in a queue that is flushed to disk. Flushing writes the data to disk, in the memtable-sorted order. A partition index is also created on the disk that maps the tokens to a location on disk.
The queue can be configured with either the memtable_heap_space or memtable_offheap_space setting in the cassandra.yaml
file. If the data to be flushed exceeds the memtable_cleanup_threshold
, Cassandra blocks writes until the next flush succeeds. You can manually flush a table using nodetool flush or nodetool drain
(flushes memtables without listening for connections to other nodes). To reduce the commit log replay time, the recommended best practice is to flush the memtable before you restart the nodes. If a node stops working, replaying the commit log restores writes to the memtable that were there before it stopped.
Data in the commit log is purged after its corresponding data in the memtable is flushed to an SSTable on disk.
SSTables
SSTables are the immutable data files that Cassandra uses for persisting data on disk. SSTables are maintained per table. SSTables are immutable, and never written to again after the memtable is flushed. Thus, a partition is typically stored across multiple SSTable files, as data is added or modified.
Each SSTable is comprised of multiple components stored in separate files:
Data.db
The actual data, i.e. the contents of rows.
Partitions.db
The partition index file maps unique prefixes of decorated partition keys to data file locations, or, in the case of wide partitions indexed in the row index file, to locations in the row index file.
Rows.db
The row index file only contains entries for partitions that contain more than one row and are bigger than one index block. For all such partitions, it stores a copy of the partition key, a partition header, and an index of row block separators, which map each row key into the first block where any content with equal or higher row key can be found.
Index.db
An index from partition keys to positions in the Data.db
file. For wide partitions, this may also include an index to rows within a partition.
Summary.db
A sampling of (by default) every 128th entry in the Index.db
file.
Filter.db
A Bloom Filter of the partition keys in the SSTable.
CompressionInfo.db
Metadata about the offsets and lengths of compression chunks in the Data.db
file.
Statistics.db
Stores metadata about the SSTable, including information about timestamps, tombstones, clustering keys, compaction, repair, compression, TTLs, and more.
Digest.crc32
A CRC-32 digest of the Data.db
file.
TOC.txt
A plain text list of the component files for the SSTable.
SAI*.db
Index information for Storage-Attached indexes. Only present if SAI is enabled for the table.
Note that the |
Within the Data.db
file, rows are organized by partition. These partitions are sorted in token order (i.e. by a hash of the partition key when the default partitioner, Murmur3Partition
, is used). Within a partition, rows are stored in the order of their clustering keys.
SSTables can be optionally compressed using block-based compression.
As SSTables are flushed to disk from memtables
or are streamed from other nodes, Cassandra triggers compactions which combine multiple SSTables into one. Once the new SSTable has been written, the old SSTables can be removed.
SSTable Versions
From (BigFormat#BigVersion).
The version numbers, to date are:
Version 0
b (0.7.0): added version to sstable filenames
c (0.7.0): bloom filter component computes hashes over raw key bytes instead of strings
d (0.7.0): row size in data component becomes a long instead of int
e (0.7.0): stores undecorated keys in data and index components
f (0.7.0): switched bloom filter implementations in data component
g (0.8): tracks flushed-at context in metadata component
Version 1
h (1.0): tracks max client timestamp in metadata component
hb (1.0.3): records compression ration in metadata component
hc (1.0.4): records partitioner in metadata component
hd (1.0.10): includes row tombstones in maxtimestamp
he (1.1.3): includes ancestors generation in metadata component
hf (1.1.6): marker that replay position corresponds to 1.1.5+ millis-based id (see CASSANDRA-4782)
ia (1.2.0):
column indexes are promoted to the index file
records estimated histogram of deletion times in tombstones
bloom filter (keys and columns) upgraded to Murmur3
ib (1.2.1): tracks min client timestamp in metadata component
ic (1.2.5): omits per-row bloom filter of column names
Version 2
ja (2.0.0):
super columns are serialized as composites (note that there is no real format change, this is mostly a marker to know if we should expect super columns or not. We do need a major version bump however, because we should not allow streaming of super columns into this new format)
tracks max local deletiontime in sstable metadata
records bloom_filter_fp_chance in metadata component
remove data size and column count from data file (CASSANDRA-4180)
tracks max/min column values (according to comparator)
jb (2.0.1):
switch from crc32 to adler32 for compression checksums
checksum the compressed data
ka (2.1.0):
new Statistics.db file format
index summaries can be downsampled and the sampling level is persisted
switch uncompressed checksums to adler32
tracks presence of legacy (local and remote) counter shards
la (2.2.0): new file name format
lb (2.2.7): commit log lower bound included
Version 3
ma (3.0.0):
swap bf hash order
store rows natively
mb (3.0.7, 3.7): commit log lower bound included
mc (3.0.8, 3.9): commit log intervals included
md (3.0.18, 3.11.4): corrected sstable min/max clustering
me (3.0.25, 3.11.11): added hostId of the node from which the sstable originated
Version 4
na (4.0-rc1): uncompressed chunks, pending repair session, isTransient, checksummed sstable metadata file, new Bloomfilter format
nb (4.0.0): originating host id
Version 5
oa (5.0): improved min/max, partition level deletion presence marker, key range (CASSANDRA-18134)
Long deletionTime to prevent TTL overflow
token space coverage
Trie-indexed Based SSTable Versions (BTI)
Cassandra 5.0 introduced new SSTable formats BTI for Trie-indexed SSTables. To use the BTI formats configure it cassandra.yaml
like
sstable:
selected_format: bti
Versions come from (BtiFormat#BtiVersion).
For implementation docs see (BtiFormat.md).
Version 5
- da (5.0): initial version of the BIT format
Example Code
The following example is useful for finding all sstables that do not match the “ib” SSTable version
find /var/lib/cassandra/data/ -type f | grep -v -- -ib- | grep -v "/snapshots"