Sharding
ArangoDB organizes its collection data in shards. Sharding allows to usemultiple machines to run a cluster of ArangoDB instances that togetherconstitute a single database system.
Sharding is used to distribute data across physical machines in an ArangoDBCluster. It is a method to determine the optimal placement of documents onindividual DB-Servers.
This enables you to store much more data, since ArangoDB distributes the dataautomatically to the different servers. In many situations one can also reap abenefit in data throughput, again because the load can be distributed tomultiple machines.
Using sharding allows ArangoDB to support deployments with large amounts ofdata, which would not fit on a single machine. A high rate of write / readoperations or AQL queries can also overwhelm a single servers RAM and diskcapacity.
There are two main ways of scaling a database system:
- Vertical scaling
- Horizontal scaling
Vertical scaling scaling means to upgrade to better server hardware (fasterCPU, more RAM / disk). This can be a cost effective way of scaling, becauseadministration is easy and performance characteristics do not change much.Reasoning about the behavior of a single machine is also a lot easier thanhaving multiple machines. However at a certain point larger machines are eithernot available anymore or the cost becomes prohibitive.
Horizontal scaling is about increasing the number of servers. Servers typicallybeing based on commodity hardware, which is readily available from manydifferent Cloud providers. The capability of each single machine may not behigh, but the combined the computing power of these machines can be arbitrarilylarge. Adding more machines on-demand is also typically easier and morecost-effective than pre-provisioning a single large machine. Increasedcomplexity in infrastructure can be managed using modern containerization andcluster orchestrations tools like Kubernetes.
To achieve this ArangoDB splits your dataset into so called shards. The numberof shards is something you may choose according to your needs. Proper shardingis essential to achieve optimal performance. From the outside th process ofsplitting the data and assembling it again is fully transparent and as such weachieve the goals of what other systems call “master-master replication”.
An application may talk to any Coordinator and it will automatically figureout where the data is currently stored (read-case) or is to be stored(write-case). The information about the shards is shared across allCoordinators using the Agency.
Shards are configured per collection so multiple shards of data form thecollection as a whole. To determine in which shard the data is to be storedArangoDB performs a hash across the values. By default this hash is beingcreated from the _key
document attribute.
Every shard is a local collection on any DBServer, that houses such a shardas depicted above for our example with 5 shards and 3 replicas. Here, everyleading shard S1 through S5 is followed each by 2 replicas R1 through R5.The collection creation mechanism on ArangoDB Coordinators tries to bestdistribute the shards of a collection among the DBServers. This seems tosuggest, that one shards the data in 5 parts, to make best use of all ourmachines. We further choose a replication factor of 3 as it is a reasonablecompromise between performance and data safety. This means, that the collectioncreation ideally distributes 15 shards, 5 of which are leaders to each 2replicas. This in turn implies, that a complete pristine replication wouldinvolve 10 shards which need to catch up with their leaders.
Not all use cases require horizontal scalability. In such cases, consider theOneShardfeature as alternative to flexible sharding.
Shard Keys
ArangoDB uses the specified shard key attributes to determine in which sharda given document is to be stored. Choosing the right shard key can havesignificant impact on your performance can reduce network traffic and increaseperformance.
ArangoDB uses consistent hashing to compute the target shard from the givenvalues (as specified via by the shardKeys
collection property). The ideal setof shard keys allows ArangoDB to distribute documents evenly across your shardsand your DBServers. By default ArangoDB uses the _key
field as a shard key.For a custom shard key you should consider a few different properties:
Cardinality: The cardinality of a set is the number of distinct valuesthat it contains. A shard key with only N distinct values can not be hashedonto more than N shards. Consider using multiple shard keys, if one of yourvalues has a low cardinality.
Frequency: Consider how often a given shard key value may appear inyour data. Having a lot of documents with identical shard keys will leadto unevenly distributed data.
This means that a single shard could become a bottleneck in your cluster.The effectiveness of horizontal scaling is reduced if most documents end up ina single shard. Shards are not divisible at this time, so paying attention tothe size of shards is important.
Consider both frequency and cardinality when picking a shard key, if necessaryconsider picking multiple shard keys.
Configuring Shards
The number of shards can be configured at collection creation time, e.g. inthe Web UI or via arangosh:
db._create("sharded_collection", {"numberOfShards": 4, "shardKeys": ["country"]});
The example above, where country
has been used as shardKeys can be usefulto keep data of every country in one shard, which would result in betterperformance for queries working on a per country base.
It is also possible to specify multiple shardKeys
.
Note however that if you change the shard keys from their default ["_key"]
,then finding a document in the collection by its primary key involves a requestto every single shard. However this can be mitigated: All CRUD APIs and AQLsupport using the shard key values as a lookup hints. Just send them as partof the update / replace or removal operation, or in case of AQL, thatyou use a document reference or an object for the UPDATE, REPLACE or REMOVEoperation which includes the shard key attributes:
UPDATE { _key: "123", country: "…" } WITH { … } IN sharded_collection
If custom shard keys are used, one can no longer prescribe the primary key value ofa new document but must use the automatically generated one. This latterrestriction comes from the fact that ensuring uniqueness of the primary keywould be very inefficient if the user could specify the primary key.
On which DB-Server in a Cluster a particular shard is kept is undefined.There is no option to configure an affinity based on certain shard keys.
For more information on shard rebalancing and administration topics please havea look in the Cluster Administration section.
Indexes On Shards
Unique indexes (hash, skiplist, persistent) on sharded collections are onlyallowed if the fields used to determine the shard key are also included in thelist of attribute paths for the index:
shardKeys | indexKeys | |
---|---|---|
a | a | allowed |
a | b | not allowed |
a | a, b | allowed |
a, b | a | not allowed |
a, b | b | not allowed |
a, b | a, b | allowed |
a, b | a, b, c | allowed |
a, b, c | a, b | not allowed |
a, b, c | a, b, c | allowed |
High Availability
A cluster can still read from a collection if shards become unavailable forsome reason. The data residing on the unavailable shard cannot be accessed,however reads on other shards will still succeed.
In a production environment you should always deploy your collections with areplicationFactor greater than 1 to ensure that the shard stays availableeven when a machine fails.
Storage Capacity
The cluster will distribute your data across multiple machines in your cluster.Every machine will only contain a subset of your data. Thus the cluster now hasthe combined storage capacity of all your machines.
Please note that increasing the replication factor also increases the spacerequired to keep all your data in the cluster.