Architecture
The cluster architecture of ArangoDB is a CP master/master model with no single point of failure. With “CP” we mean that in the presence of anetwork partition, the database prefers internal consistency over availability. With “master/master” we mean that clients can send their requests to an arbitrary node, and experience the same view on thedatabase regardless. “No single point of failure” means that the clustercan continue to serve requests, even if one machine fails completely.
In this way, ArangoDB has been designed as a distributed multi-model database. This section gives a short outline on the cluster architecture andhow the above features and capabilities are achieved.
Structure of an ArangoDB cluster
An ArangoDB cluster consists of a number of ArangoDB instanceswhich talk to each other over the network. They play different roles,which will be explained in detail below. The current configuration of the cluster is held in the “Agency”, which is a highly-availableresilient key/value store based on an odd number of ArangoDB instancesrunning Raft Consensus Protocol.
For the various instances in an ArangoDB cluster there are 4 distinctroles: Agents, Coordinators, Primary and Secondary DBservers. In thefollowing sections we will shed light on each of them. Note that thetasks for all roles run the same binary from the same Docker image.
Agents
One or multiple Agents form the Agency in an ArangoDB cluster. TheAgency is the central place to store the configuration in a cluster. Itperforms leader elections and provides other synchronization services forthe whole cluster. Without the Agency none of the other components canoperate.
While generally invisible to the outside it is the heart of thecluster. As such, fault tolerance is of course a must have for theAgency. To achieve that the Agents are using the Raft ConsensusAlgorithm. The algorithm formally guaranteesconflict free configuration management within the ArangoDB cluster.
At its core the Agency manages a big configuration tree. It supportstransactional read and write operations on this tree, and other serverscan subscribe to HTTP callbacks for all changes to the tree.
Coordinators
Coordinators should be accessible from the outside. These are the onesthe clients talk to. They will coordinate cluster tasks likeexecuting queries and running Foxx services. They know where thedata is stored and will optimize where to run user supplied queries orparts thereof. Coordinators are stateless and can thus easily be shut downand restarted as needed.
Primary DBservers
Primary DBservers are the ones where the data is actually hosted. Theyhost shards of data and using synchronous replication a primary mayeither be leader or follower for a shard.
They should not be accessed from the outside but indirectly through thecoordinators. They may also execute queries in part or as a whole whenasked by a coordinator.
Secondaries
Secondary DBservers are asynchronous replicas of primaries. If one isusing only synchronous replication, one does not need secondaries at all.For each primary, there can be one or more secondaries. Since thereplication works asynchronously (eventual consistency), the replicationdoes not impede the performance of the primaries. On the other hand,their replica of the data can be slightly out of date. The secondariesare perfectly suitable for backups as they don’t interfere with thenormal cluster operation.
Cluster ID
Every non-Agency ArangoDB instance in a cluster is assigned a uniqueID during its startup. Using its ID a node is identifiablethroughout the cluster. All cluster operations will communicatevia this ID.
Sharding
Using the roles outlined above an ArangoDB cluster is able to distributedata in so called shards across multiple primaries. From the outsidethis process is fully transparent and as such we achieve the goals ofwhat other systems call “master-master replication”. In an ArangoDBcluster you talk to any coordinator and whenever you read or write datait will automatically figure out where the data is stored (read) or tobe stored (write). The information about the shards is shared across thecoordinators using the Agency.
Also see Sharding in theAdministration chapter.
Many sensible configurations
This architecture is very flexible and thus allows many configurations,which are suitable for different usage scenarios:
- The default configuration is to run exactly one coordinator andone primary DBserver on each machine. This achieves the classicalmaster/master setup, since there is a perfect symmetry between thedifferent nodes, clients can equally well talk to any one of thecoordinators and all expose the same view to the data store.
- One can deploy more coordinators than DBservers. This is a sensibleapproach if one needs a lot of CPU power for the Foxx services, because they run on the coordinators.
- One can deploy more DBservers than coordinators if more data capacity is needed and the query performance is the lesser bottleneck
- One can deploy a coordinator on each machine where an applicationserver (e.g. a node.js server) runs, and the Agents and DBservers on a separate set of machines elsewhere. This avoids a network hop between the application server and the database and thus decreaseslatency. Essentially, this moves some of the database distributionlogic to the machine where the client runs.These shall suffice for now. The important piece of information hereis that the coordinator layer can be scaled and deployed independentlyfrom the DBserver layer.
Replication
ArangoDB offers two ways of data replication within a cluster, synchronousand asynchronous. In this section we explain some details and highlightthe advantages and disadvantages respectively.
Synchronous replication with automatic fail-over
Synchronous replication works on a per-shard basis. One configures foreach collection, how many copies of each shard are kept in the cluster.At any given time, one of the copies is declared to be the “leader” andall other replicas are “followers”. Write operations for this shardare always sent to the DBserver which happens to hold the leader copy,which in turn replicates the changes to all followers before the operationis considered to be done and reported back to the coordinator.Read operations are all served by the server holding the leader copy,this allows to provide snapshot semantics for complex transactions.
If a DBserver fails that holds a follower copy of a shard, then the leadercan no longer synchronize its changes to that follower. After a short timeout(3 seconds), the leader gives up on the follower, declares it to beout of sync, and continues service without the follower. When the serverwith the follower copy comes back, it automatically resynchronizes itsdata with the leader and synchronous replication is restored.
If a DBserver fails that holds a leader copy of a shard, then the leadercan no longer serve any requests. It will no longer send a heartbeat to the Agency. Therefore, a supervision process running in the Raft leaderof the Agency, can take the necessary action (after 15 seconds of missingheartbeats), namely to promote one of the servers that hold in-syncreplicas of the shard to leader for that shard. This involves a reconfiguration in the Agency and leads to the fact that coordinatorsnow contact a different DBserver for requests to this shard. Serviceresumes. The other surviving replicas automatically resynchronize theirdata with the new leader. When the DBserver with the original leadercopy comes back, it notices that it now holds a follower replica,resynchronizes its data with the new leader and order is restored.
All shard data synchronizations are done in an incremental way, such thatresynchronizations are quick. This technology allows to move shards(follower and leader ones) between DBservers without service interruptions.Therefore, an ArangoDB cluster can move all the data on a specific DBserverto other DBservers and then shut down that server in a controlled way. This allows to scale down an ArangoDB cluster without service interruption,loss of fault tolerance or data loss. Furthermore, one can re-balance thedistribution of the shards, either manually or automatically.
All these operations can be triggered via a REST/JSON API or via thegraphical web UI. All fail-over operations are completely handled withinthe ArangoDB cluster.
Obviously, synchronous replication involves a certain increased latency forwrite operations, simply because there is one more network hop within the cluster for every request. Therefore the user can set the replication factorto 1, which means that only one copy of each shard is kept, therebyswitching off synchronous replication. This is a suitable setting forless important or easily recoverable data for which low latency write operations matter.
Asynchronous replication with automatic fail-over
Asynchronous replication works differently, in that it is organizedusing primary and secondary DBservers. Each secondary server replicatesall the data held on a primary by polling in an asynchronous way. Thisprocess has very little impact on the performance of the primary. Thedisadvantage is that there is a delay between the confirmation of awrite operation that is sent to the client and the actual replication ofthe data. If the master server fails during this delay, then committedand confirmed data can be lost.
Nevertheless, we also offer automatic fail-over with this setup. Contrary to the synchronous case, here the fail-over management is done from outsidethe ArangoDB cluster. In a future version we might move this managementinto the supervision process in the Agency, but as of now, the managementis done via the Mesos framework scheduler for ArangoDB (see below).
The granularity of the replication is a whole ArangoDB instance withall data that resides on that instance, which means thatyou need twice as many instances as without asynchronous replication. Synchronous replication is more flexible in that respect, you can have smaller and larger instances, and if one fails, the data can be rebalancedacross the remaining ones.
Microservices and zero administation
The design and capabilities of ArangoDB are geared towards usage inmodern microservice architectures of applications. With the Foxx services it is very easy to deploy a datacentric microservice within an ArangoDB cluster.
In addition, one can deploy multiple instances of ArangoDB within thesame project. One part of the project might need a scalable documentstore, another might need a graph database, and yet another might needthe full power of a multi-model database actually mixing the variousdata models. There are enormous efficiency benefits to be reaped bybeing able to use a single technology for various roles in a project.
To simplify life of the devops in such a scenario we try as much aspossible to use a zero administration approach for ArangoDB. A runningArangoDB cluster is resilient against failures and essentially repairsitself in case of temporary failures. See the next section for furthercapabilities in this direction.
Apache Mesos integration
For the distributed setup, we use the Apache Mesos infrastructure by default.ArangoDB is a fully certified package for DC/OS and can thusbe deployed essentially with a few mouse clicks or a single command, onceyou have an existing DC/OS cluster. But even on a plain Apache Mesos clusterone can deploy ArangoDB via Marathon with a single API call and some JSON configuration.
The advantage of this approach is that we can not only implement the initial deployment, but also the later management of automatic replacement of failed instances and the scaling of the ArangoDB cluster(triggered manually or even automatically). Since all manipulations areeither via the graphical web UI or via JSON/REST calls, one can evenimplement auto-scaling very easily.
A DC/OS cluster is a very natural environment to deploy microservicearchitectures, since it is so convenient to deploy various services,including potentially multiple ArangoDB cluster instances within thesame DC/OS cluster. The built-in service discovery makes it extremelysimple to connect the various microservices and Mesos automaticallytakes care of the distribution and deployment of the various tasks.
See the Deployment chapter and its subsectionsfor instructions.
It is possible to deploy an ArangoDB cluster by simply launching a bunch of Docker containers with the right command line options to link them up, or even on a single machine starting multiple ArangoDB processes. In that case, synchronous replication will work within the deployed ArangoDB cluster,and automatic fail-over in the sense that the duties of a failed server willautomatically be assigned to another, surviving one. However, since theArangoDB cluster cannot within itself launch additional instances, replacementof failed nodes is not automatic and scaling up and down has to be managedmanually. This is why we do not recommend this setup for production deployment.