- 12.4. Shard Management
- 12.4.1. Introduction
- 12.4.2. Moving a shard
- 12.4.2.1. Copying shard files
- 12.4.2.2. Set the target node to true maintenance mode
- 12.4.2.3. Updating cluster metadata to reflect the new target shard(s)
- 12.4.2.4. Monitor internal replication to ensure up-to-date shard(s)
- 12.4.2.5. Clear the target node’s maintenance mode
- 12.4.2.6. Update cluster metadata again to remove the source shard
- 12.4.2.7. Remove the shard and secondary index files from the source node
- 12.4.3. Specifying database placement
- 12.4.4. Resharding a database to a new q value
12.4. Shard Management
12.4.1. Introduction
This document discusses how sharding works in CouchDB along with how tosafely add, move, remove, and create placement rules for shards andshard replicas.
A shard) is ahorizontal partition of data in a database. Partitioning data intoshards and distributing copies of each shard (called “shard replicas” orjust “replicas”) to different nodes in a cluster gives the data greaterdurability against node loss. CouchDB clusters automatically sharddatabases and distribute the subsets of documents that compose eachshard among nodes. Modifying cluster membership and sharding behaviormust be done manually.
12.4.1.1. Shards and Replicas
How many shards and replicas each database has can be set at the globallevel, or on a per-database basis. The relevant parameters are q
andn
.
q is the number of database shards to maintain. n is the number ofcopies of each document to distribute. The default value for n
is 3
,and for q
is 8
. With q=8
, the database is split into 8 shards. Withn=3
, the cluster distributes three replicas of each shard. Altogether,that’s 24 shard replicas for a single database. In a default 3-node cluster,each node would receive 8 shards. In a 4-node cluster, each node wouldreceive 6 shards. We recommend in the general case that the number ofnodes in your cluster should be a multiple of n
, so that shards aredistributed evenly.
CouchDB nodes have a etc/local.ini
file with a section namedcluster which looks like this:
- [cluster]
- q=8
- n=3
These settings can be modified to set sharding defaults for alldatabases, or they can be set on a per-database basis by specifying theq
and n
query parameters when the database is created. Forexample:
- $ curl -X PUT "$COUCH_URL:5984/database-name?q=4&n=2"
That creates a database that is split into 4 shards and 2 replicas,yielding 8 shard replicas distributed throughout the cluster.
12.4.1.2. Quorum
Depending on the size of the cluster, the number of shards per database,and the number of shard replicas, not every node may have access toevery shard, but every node knows where all the replicas of each shardcan be found through CouchDB’s internal shard map.
Each request that comes in to a CouchDB cluster is handled by any onerandom coordinating node. This coordinating node proxies the request tothe other nodes that have the relevant data, which may or may notinclude itself. The coordinating node sends a response to the clientonce a quorum) ofdatabase nodes have responded; 2, by default. The default required sizeof a quorum is equal to r=w=((n+1)/2)
where r
refers to the sizeof a read quorum, w
refers to the size of a write quorum, and n
refers to the number of replicas of each shard. In a default cluster wheren
is 3, ((n+1)/2)
would be 2.
Note
Each node in a cluster can be a coordinating node for any onerequest. There are no special roles for nodes inside the cluster.
The size of the required quorum can be configured at request time bysetting the r
parameter for document and view reads, and the w
parameter for document writes. For example, here is a request thatdirects the coordinating node to send a response once at least two nodeshave responded:
- $ curl "$COUCH_URL:5984/<doc>?r=2"
Here is a similar example for writing a document:
- $ curl -X PUT "$COUCH_URL:5984/<doc>?w=2" -d '{...}'
Setting r
or w
to be equal to n
(the number of replicas)means you will only receive a response once all nodes with relevantshards have responded or timed out, and as such this approach does notguarantee ACIDic consistency. Setting r
orw
to 1 means you will receive a response after only one relevantnode has responded.
12.4.2. Moving a shard
This section describes how to manually place and replace shards. Theseactivities are critical steps when you determine your cluster is too bigor too small, and want to resize it successfully, or you have noticedfrom server metrics that database/shard layout is non-optimal and youhave some “hot spots” that need resolving.
Consider a three-node cluster with q=8 and n=3. Each database has 24shards, distributed across the three nodes. If you add a fourthnode to the cluster, CouchDB will not redistributeexisting database shards to it. This leads to unbalanced load, as thenew node will only host shards for databases created after it joined thecluster. To balance the distribution of shards from existing databases,they must be moved manually.
Moving shards between nodes in a cluster involves the following steps:
- Ensure the target node has joined the cluster.
- Copy the shard(s) and any secondaryindex shard(s) onto the target node.
- Set the target node to maintenance mode.
- Update cluster metadatato reflect the new target shard(s).
- Monitor internal replicationto ensure up-to-date shard(s).
- Clear the target node’s maintenance mode.
- Update cluster metadata againto remove the source shard(s)
- Remove the shard file(s) and secondary index file(s)from the source node.
12.4.2.1. Copying shard files
Note
Technically, copying database and secondary indexshards is optional. If you proceed to the next step withoutperforming this data copy, CouchDB will use internal replicationto populate the newly added shard replicas. However, copying filesis faster than internal replication, especially on a busy cluster,which is why we recommend performing this manual data copy first.
Shard files live in the data/shards
directory of your CouchDBinstall. Within those subdirectories are the shard files themselves. Forinstance, for a q=8
database called abc
, here is its database shardfiles:
- data/shards/00000000-1fffffff/abc.1529362187.couch
- data/shards/20000000-3fffffff/abc.1529362187.couch
- data/shards/40000000-5fffffff/abc.1529362187.couch
- data/shards/60000000-7fffffff/abc.1529362187.couch
- data/shards/80000000-9fffffff/abc.1529362187.couch
- data/shards/a0000000-bfffffff/abc.1529362187.couch
- data/shards/c0000000-dfffffff/abc.1529362187.couch
- data/shards/e0000000-ffffffff/abc.1529362187.couch
Secondary indexes (including JavaScript views, Erlang views and Mangoindexes) are also sharded, and their shards should be moved to save thenew node the effort of rebuilding the view. View shards live indata/.shards
. For example:
- data/.shards
- data/.shards/e0000000-ffffffff/_replicator.1518451591_design
- data/.shards/e0000000-ffffffff/_replicator.1518451591_design/mrview
- data/.shards/e0000000-ffffffff/_replicator.1518451591_design/mrview/3e823c2a4383ac0c18d4e574135a5b08.view
- data/.shards/c0000000-dfffffff
- data/.shards/c0000000-dfffffff/_replicator.1518451591_design
- data/.shards/c0000000-dfffffff/_replicator.1518451591_design/mrview
- data/.shards/c0000000-dfffffff/_replicator.1518451591_design/mrview/3e823c2a4383ac0c18d4e574135a5b08.view
- ...
Since they are files, you can use cp
, rsync
,scp
or other file-copying command to copy them from one node toanother. For example:
- # one one machine
- $ mkdir -p data/.shards/<range>
- $ mkdir -p data/shards/<range>
- # on the other
- $ scp <couch-dir>/data/.shards/<range>/<database>.<datecode>* \
- <node>:<couch-dir>/data/.shards/<range>/
- $ scp <couch-dir>/data/shards/<range>/<database>.<datecode>.couch \
- <node>:<couch-dir>/data/shards/<range>/
Note
Remember to move view files before database files! If a view indexis ahead of its database, the database will rebuild it fromscratch.
12.4.2.2. Set the target node to true maintenance mode
Before telling CouchDB about these new shards on the node, the nodemust be put into maintenance mode. Maintenance mode instructs CouchDB toreturn a 404 Not Found
response on the /_up
endpoint, andensures it does not participate in normal interactive clustered requestsfor its shards. A properly configured load balancer that uses GET
to check the health of nodes will detect this 404 and remove thenode from circulation, preventing requests from being sent to that node.For example, to configure HAProxy to use the
/_up/_up
endpoint, use:
- http-check disable-on-404
- option httpchk GET /_up
If you do not set maintenance mode, or the load balancer ignores thismaintenance mode status, after the next step is performed the clustermay return incorrect responses when consulting the node in question. Youdon’t want this! In the next steps, we will ensure that this shard isup-to-date before allowing it to participate in end-user requests.
To enable maintenance mode:
Then, verify that the node is in maintenance mode by performing a GET
on that node’s individual endpoint:
/_up
Finally, check that your load balancer has removed the node from thepool of available backend nodes.
12.4.2.3. Updating cluster metadata to reflect the new target shard(s)
Now we need to tell CouchDB that the target node (which must already bejoined to the cluster) should be hostingshard replicas for a given database.
To update the cluster metadata, use the special /_dbs
database,which is an internal CouchDB database that maps databases to shards andnodes. This database is replicated between nodes. It is accessible onlyvia a node-local port, usually at port 5986. By default, this port isonly available on the localhost interface for security purposes.
First, retrieve the database’s current metadata:
- $ curl http://localhost:5986/_dbs/{name}
- {
- "_id": "{name}",
- "_rev": "1-e13fb7e79af3b3107ed62925058bfa3a",
- "shard_suffix": [46, 49, 53, 51, 48, 50, 51, 50, 53, 50, 54],
- "changelog": [
- ["add", "00000000-1fffffff", "node1@xxx.xxx.xxx.xxx"],
- ["add", "00000000-1fffffff", "node2@xxx.xxx.xxx.xxx"],
- ["add", "00000000-1fffffff", "node3@xxx.xxx.xxx.xxx"],
- …
- ],
- "by_node": {
- "node1@xxx.xxx.xxx.xxx": [
- "00000000-1fffffff",
- …
- ],
- …
- },
- "by_range": {
- "00000000-1fffffff": [
- "node1@xxx.xxx.xxx.xxx",
- "node2@xxx.xxx.xxx.xxx",
- "node3@xxx.xxx.xxx.xxx"
- ],
- …
- }
- }
Here is a brief anatomy of that document:
_id
: The name of the database._rev
: The current revision of the metadata.shard_suffix
: A timestamp of the database’s creation, marked asseconds after the Unix epoch mapped to the codepoints for ASCIInumerals.changelog
: History of the database’s shards.by_node
: List of shards on each node.by_range
: On which nodes each shard is.
To reflect the shard move in the metadata, there are three steps:Add appropriate changelog entries.
- Update the
by_node
entries. - Update the
by_range
entries.
Warning
Be very careful! Mistakes during this process canirreparably corrupt the cluster!
As of this writing, this process must be done manually.
To add a shard to a node, add entries like this to the databasemetadata’s changelog
attribute:
- ["add", "<range>", "<node-name>"]
The <range>
is the specific shard range for the shard. The <node-
should match the name and address of the node as displayed in
name>GET /_membership
on the cluster.
Note
When removing a shard from a node, specify remove
instead of add
.
Once you have figured out the new changelog entries, you will need toupdate the by_node
and by_range
to reflect who is storing whatshards. The data in the changelog entries and these attributes mustmatch. If they do not, the database may become corrupted.
Continuing our example, here is an updated version of the metadata abovethat adds shards to an additional node called node4
:
- {
- "_id": "{name}",
- "_rev": "1-e13fb7e79af3b3107ed62925058bfa3a",
- "shard_suffix": [46, 49, 53, 51, 48, 50, 51, 50, 53, 50, 54],
- "changelog": [
- ["add", "00000000-1fffffff", "node1@xxx.xxx.xxx.xxx"],
- ["add", "00000000-1fffffff", "node2@xxx.xxx.xxx.xxx"],
- ["add", "00000000-1fffffff", "node3@xxx.xxx.xxx.xxx"],
- ...
- ["add", "00000000-1fffffff", "node4@xxx.xxx.xxx.xxx"]
- ],
- "by_node": {
- "node1@xxx.xxx.xxx.xxx": [
- "00000000-1fffffff",
- ...
- ],
- ...
- "node4@xxx.xxx.xxx.xxx": [
- "00000000-1fffffff"
- ]
- },
- "by_range": {
- "00000000-1fffffff": [
- "node1@xxx.xxx.xxx.xxx",
- "node2@xxx.xxx.xxx.xxx",
- "node3@xxx.xxx.xxx.xxx",
- "node4@xxx.xxx.xxx.xxx"
- ],
- ...
- }
- }
Now you can PUT
this new metadata:
- $ curl -X PUT http://localhost:5986/_dbs/{name} -d '{...}'
12.4.2.4. Monitor internal replication to ensure up-to-date shard(s)
After you complete the previous step, as soon as CouchDB receives awrite request for a shard on the target node, CouchDB will check if thetarget node’s shard(s) are up to date. If it finds they are not up todate, it will trigger an internal replication job to complete this task.You can observe this happening by triggering a write to the database(update a document, or create a new one), while monitoring the/_node/<nodename>/_system
endpoint, which includes theinternal_replication_jobs
metric.
Once this metric has returned to the baseline from before you wrote thedocument, or is 0
, the shard replica is ready to serve data and wecan bring the node out of maintenance mode.
12.4.2.5. Clear the target node’s maintenance mode
You can now let the node start servicing data requests byputting "false"
to the maintenance mode configuration endpoint, justas in step 2.
Verify that the node is not in maintenance mode by performing a GET
on that node’s individual endpoint.
/_up
Finally, check that your load balancer has returned the node to the poolof available backend nodes.
12.4.2.6. Update cluster metadata again to remove the source shard
Now, remove the source shard from the shard map the same way that youadded the new target shard to the shard map in step 2. Be sure to addthe ["remove", <range>, <source-shard>]
entry to the end of thechangelog as well as modifying both the by_node
and by_range
sections ofthe database metadata document.
12.4.2.7. Remove the shard and secondary index files from the source node
Finally, you can remove the source shard replica by deleting its file from thecommand line on the source host, along with any view shard replicas:
Congratulations! You have moved a database shard replica. By adding and removingdatabase shard replicas in this way, you can change the cluster’s shard layout,also known as a shard map.
12.4.3. Specifying database placement
You can configure CouchDB to put shard replicas on certain nodes atdatabase creation time using placement rules.
First, each node must be labeled with a zone attribute. This defineswhich zone each node is in. You do this by editing the node’s documentin the /_nodes
database, which is accessed through the node-localport. Add a key value pair of the form:
- "zone": "{zone-name}"
Do this for all of the nodes in your cluster. For example:
- $ curl -X PUT http://localhost:5986/_nodes/<node-name> \
- -d '{ \
- "_id": "<node-name>",
- "_rev": "<rev>",
- "zone": "<zone-name>"
- }'
In the local config file (local.ini
) of each node, define aconsistent cluster-wide setting like:
- [cluster]
- placement = <zone-name-1>:2,<zone-name-2>:1
In this example, CouchDB will ensure that two replicas for a shard willbe hosted on nodes with the zone attribute set to <zone-name-1>
andone replica will be hosted on a new with the zone attribute set to<zone-name-2>
.
This approach is flexible, since you can also specify zones on a per-database basis by specifying the placement setting as a query parameterwhen the database is created, using the same syntax as the ini file:
- curl -X PUT $COUCH_URL:5984/<dbname>?zone=<zone>
Note that you can also use this system to ensure certain nodes in thecluster do not host any replicas for newly created databases, by givingthem a zone attribute that does not appear in the [cluster]
placement string.
12.4.4. Resharding a database to a new q value
The q
value for a database can only be set when the database iscreated, precluding live resharding. Instead, to reshard a database, itmust be regenerated. Here are the steps:
- Create a temporary database with the desired shard settings, byspecifying the q value as a query parameter during the PUToperation.
- Stop clients accessing the database.
- Replicate the primary database to the temporary one. Multiplereplications may be required if the primary database is underactive use.
- Delete the primary database. Make sure nobody is using it!
- Recreate the primary database with the desired shard settings.
- Clients can now access the database again.
- Replicate the temporary back to the primary.
- Delete the temporary database.
Once all steps have completed, the database can be used again. Thecluster will create and distribute its shards according to placementrules automatically.
Downtime can be avoided in production if the client application(s) canbe instructed to use the new database instead of the old one, and a cut-over is performed during a very brief outage window.