Recover failing disk
YugabyteDB can be configured to use multiple storage disks by setting the —fs_data_dirs
configuration option.This introduces the possibility of disk failure and recovery issues.
Cluster replication recovery
The yb-tserver
service automatically detects disk failures and attempts to spread the data from the failed disk to other healthy nodes in the cluster.In a single-zone setup with a replication factor (RF) of 3
: if you started with four nodes or more,then there would be at least three nodes left after one failed.In this case, rereplication is automatically started if a YB-TServer or disk is down for 10 minutes.
In a multi-zone setup with a replication factor (RF) of 3
: YugabyteDB will try to keep one copy of data per zone.In this case, for automatic rereplication of data, a zone needs to have at least two YB-TServers so that if one fails,its data can be rereplicated to the other. Thus, this would mean at least a six-node cluster.
Failed disk replacement
The steps to replace a failed disk are:
- Stop the YB-TServer node.
- Replace the disks that have failed.
- Restart the
yb-tserver
service.On restart, the YB-TServer will see the new empty disk and start replicating tablets from other nodes.