Architecture
Storage Engine
M3DB is a time series database that was primarily designed to be horizontally scalable and able to handle high data throughput. Time Series Compression One of M3DB’s biggest strengths as a time series database (as opposed to using a more general-purpose horizontally scalable, distributed database like Cassandra) is its ability to compress time series data resulting in huge memory and disk savings. There are two compression algorithms used in M3DB: M3TSZ and protobuf encoding.
Sharding
Timeseries keys are hashed to a fixed set of virtual shards. Virtual shards are then assigned to physical nodes. M3DB can be configured to use any hashing function and a configured number of shards. By default murmur3 is used as the hashing function and 4096 virtual shards are configured. Benefits Shards provide a variety of benefits throughout the M3DB stack: They make horizontal scaling easier and adding / removing nodes without downtime trivial at the cluster level.
Consistency Levels
M3DB provides variable consistency levels for read and write operations, as well as cluster connection operations. These consistency levels are handled at the client level. Write consistency levels One: Corresponds to a single node succeeding for an operation to succeed. Majority: Corresponds to the majority of nodes succeeding for an operation to succeed. All: Corresponds to all nodes succeeding for an operation to succeed. Read consistency levels One: Corresponds to reading from a single node to designate success.
Storage
Overview The primary unit of long-term storage for M3DB are fileset files which store compressed streams of time series values, one per shard block time window size. They are flushed to disk after a block time window becomes unreachable, that is the end of the time window for which that block can no longer be written to. If a process is killed before it has a chance to flush the data for the current time window to disk it must be restored from the commit log (or a peer that is responsible for the same shard if replication factor is larger than 1.
Commit Logs And Snapshot Files
Overview M3DB has a commit log that is equivalent to the commit log or write-ahead-log in other databases. The commit logs are completely uncompressed (no M3TSZ encoding), and there is one per database (multiple namespaces in a single process will share a commit log.) Integrity Levels There are two integrity levels available for commit logs: Synchronous: write operations must wait until it has finished writing an entry in the commit log to complete.
Peer Streaming
Client Peer streaming is managed by the M3DB client. It fetches all blocks from peers for a specified time range for bootstrapping purposes. It performs the following steps: Fetch all metadata for blocks from all peers who own the specified shard Compares metadata from different peers and determines the best peer(s) from which to stream the actual data Streams the block data from peers Steps 1, 2 and 3 all happen concurrently.
Caching
Overview Blocks that are still being actively compressed / M3TSZ encoded must be kept in memory until they are sealed and flushed to disk. Blocks that have already been sealed, however, don’t need to remain in-memory. In order to support efficient reads, M3DB implements various caching policies which determine which flushed blocks are kept in memory, and which are not. The “cache” itself is not a separate datastructure in memory, cached blocks are simply stored in their respective in-memory objects with various different mechanisms (depending on the chosen cache policy) determining which series / blocks are evicted and which are retained.