MatrixCube Introduction
MatrixCube is a fundamental library for building distributed systems, which offers guarantees about reliability, consistency, and scalability. It is designed to facilitate distributed, stateful application building to allow developers only need to focus on the business logic on a single node. MatrixCube is currently built upon multi-raft to provide replicated state machine and will migrate to Paxos families to increase friendliness to scenarios spanning multiple data centers.
Unlike many other distributed systems, MatrixCube is designed as part of the storage nodes. A matrixone distributed deployment doesn’t not have dedicated scheduling nodes. MatrixCube cannot work as a standalone module.
MatrixCube Architecture
Key Concepts
There are several key concepts for understanding how MatrixCube works.
Store
A MatrixCube distributed system consists of several physical computers, our data are stored across these physical computers. We call each computer inside this cluster a Store
.
Shard
Our data in database are organized in tables logically. But for physical storage, the data are split into different partitions to store in order to get better scalability. Each partition is called a Shard
. In our design, a new created table is initially a Shard
. When the size of the table exceeds the Shard
size limit, the Shard
will split.
Replica
To provide reliable service, each Shard
is stored not only once, it will have several copy stored in different Stores
. We call each copy a Replica
. A Shard
can have multiple Replica
, the data in each Replica
are the same.
Raft-group and Leader
Since multiple Replicas
are located in different Stores
, once a Replica
is updated, the other Replicas
must be updated to keep data consistency. When a client makes query to no matter which Replica
, it always gets the same result. We deploy Raft protocol to implement the concensus process. The Replicas
of a particular Shard
group into a Raft-group
.
In each Raft-group
, a Leader
is elected to be the representative of this group. All consistent read and write requests are handled only by the leader.
Learn more about: How does a Leader get elected in Raft?
Data Storage
A DataStorage
is an interface for implementing distributed storage service. It must be defined in prior to using MatrixCube. DataStorage
needs to be implemented based on the characteristics of storage engine. Some common distributed storage service can be easily constructed based on DataStorage
, such as Distributed Redis
, Distributed Key-Value
, Distributed File System
etc. A default Key-Value based DataStorage
is provided to meet the requirements of most scenarios.
Prophet
Prophet
is a scheduling module. It takes charge of Auto-Rebalance
, which keeps the system storage level and read/write throughput level balanced across Stores
. The inital 3 Stores
of a MatrixCube cluster are all Prophet Nodes
.
Learn more about How does Prophet handle the scheduling?
Raftstore
Raftstore
is the core component of MatrixCube, it implements the most important features of MatrixCube:
Metadata storage: including the metadata of
Store
,Shard
,Raft-log
.Multi-Raft management: the relationship between
Store
,Shard
,Replica
,Raft-Group
, the communication between multipleRaft-Group
s, Leader election and re-election.Global Routing: a global routing table will be constructed with the Event Notify mechanism of
Prophet
. The read/write routing will be based on this routing table.Shard Proxy: a proxy for read/write request for
Shard
. With the proxy, the detailed implementation ofMulti-Raft
is senseless and allStore
s are equal for users. User can make request to anyStore
, all requests will be routed to the rightStore
byShard Proxy
.
Learn more about How do the Shard Proxy and Global Routing work?
Key Features
Strong Consistency
MatrixCube provides a strong consistency. It is guaranteed that after any successful data write, the reading afterwards will get the latest value, no matter from which store.
Fault Tolerance
The distributed storage service implemented by MatrixCube is a fault tolerant and high available service. When a Shard
has 2*N+1
Replicas
, the system can still work until N+1
Replicas
fail.
For example, a cluster with 3 Stores
can survive with 1 Store
failure; a cluster with 5 Stores
can survive with 2 Stores
failure.
Shard Splitting
There is a certain limit to a Shard
size. Whenever a Shard
exceeds its storage limit, MatrixCube splits a Shard
into two Shards
and keep each Shard
with the same storage level.
You can checkout a more detailed descripition about this process with How does the Shard Splitting work?.
Auto-Rebalance
A distributed system should leverage all the computation power and storage of all nodes. For a MatrixCube cluster, when there is an increase or decrease of Stores
, an Auto-Rebalance
will occur, which moves data across Stores
to reach balance for each single Store
.
Learn more about: How does the Auto-Rebalance work?.
Scale-out
With shard splitting and auto-rebalance, a MatrixCube distributed system is capable of scaling out. The cluster storage and throughput capability are proportional to the number of Stores
.
User-defined storage engine
MatrixCube has no limit to standalone data storage engine. Any storage engine implementing DataStorage
interface defined by MatrixCube could construct a MatrixCube-based distributed system. By default, MatrixCube provides a Pebble
-based Key-Value storage engine. (Pebble
is a Go version of RocksDB
).
User-defined Read/Write
As a general distributed framework, different distributed storage system could be build based on MatrixCube. User can also customize their read/write commands. As long as it works in a standalone version, MatrixCube can help you upgrading it to a distributed version.