Restoring a Cluster from Backup

Etcd backup and recovery for Rancher launched Kubernetes clusters can be easily performed. Snapshots of the etcd database are taken and saved either locally onto the etcd nodes or to a S3 compatible target. The advantages of configuring S3 is that if all etcd nodes are lost, your snapshot is saved remotely and can be used to restore the cluster.

Rancher recommends enabling the ability to set up recurring snapshots of etcd, but one-time snapshots can easily be taken as well. Rancher allows restore from saved snapshots or if you don’t have any snapshots, you can still restore etcd.

Clusters can also be restored to a prior Kubernetes version and cluster configuration.

Viewing Available Snapshots

The list of all available snapshots for the cluster is available.

  1. In the upper left corner, click ☰ > Cluster Management.
  2. In the Clusters page, go to the cluster where you want to view the snapshots and click the name of the cluster.
  3. Click the Snapshots tab. The listed snapshots include a timestamp of when they were created.

Restoring a Cluster from a Snapshot

If your Kubernetes cluster is broken, you can restore the cluster from a snapshot.

Snapshots are composed of the cluster data in etcd, the Kubernetes version, and the cluster configuration in the cluster.yml. These components allow you to select from the following options when restoring a cluster from a snapshot:

  • Restore just the etcd contents: This restore is similar to restoring to snapshots in Rancher before v2.4.0.
  • Restore etcd and Kubernetes version: This option should be used if a Kubernetes upgrade is the reason that your cluster is failing, and you haven’t made any cluster configuration changes.
  • Restore etcd, Kubernetes versions and cluster configuration: This option should be used if you changed both the Kubernetes version and cluster configuration when upgrading.

When rolling back to a prior Kubernetes version, the upgrade strategy options are ignored. Worker nodes are not cordoned or drained before being reverted to the older Kubernetes version, so that an unhealthy cluster can be more quickly restored to a healthy state.

Restoring a Cluster from Backup - 图1Prerequisite:

To restore snapshots from S3, the cluster needs to be configured to take recurring snapshots on S3.

  1. In the upper left corner, click ☰ > Cluster Management.
  2. In the Clusters page, go to the cluster where you want to view the snapshots and click the name of the cluster.
  3. Click the Snapshots tab to view the list of saved snapshots.
  4. Go to the snapshot you want to restore and click ⋮ > Restore.
  5. Select a Restore Type.
  6. Click Restore.

Result: The cluster will go into updating state and the process of restoring the etcd nodes from the snapshot will start. The cluster is restored when it returns to an active state.

Restoring a Cluster From a Snapshot When the controlplane/etcd Are Completely Unavailable

In a disaster recovery scenario, the control plane and etcd nodes managed by Rancher in a downstream cluster may no longer be available or functioning. The cluster can be rebuilt by adding control plane and etcd nodes again, followed by restoring from an available snapshot.

  • RKE
  • RKE2/K3s

Follow the procedure described in the SUSE Knowledgebase.

If you have a complete cluster failure, you must remove all etcd nodes/machines from your cluster before you can add a “new” etcd node for restore.

Restoring a Cluster from Backup - 图2note

Due to a known issue, this procedure requires Rancher v2.7.5 or newer.

Restoring a Cluster from Backup - 图3note

If you are using local snapshots, it is VERY important that you ensure you back up the corresponding snapshot you want to restore from the /var/lib/rancher/<k3s/rke2>/server/db/snapshots/ folder on the etcd node you are going to be removing. You can copy the snapshot onto your new node in the /var/lib/rancher/<k3s/rke2>/server/db/snapshots/ folder. Furthermore, if using local snapshots and restoring to a new node, restoration cannot be done via the UI as of now.

  1. Remove all etcd nodes from your cluster.

    1. In the upper left corner, click ☰ > Cluster Management.
    2. In the Clusters page, go to the cluster where you want to remove nodes.
    3. In the Machines tab, click ⋮ > Delete on each node you want to delete. Initially, you will see the nodes hang in a deleting state, but once all etcd nodes are deleting, they will be removed together. This is due to the fact that Rancher sees all etcd nodes deleting and proceeds to “short circuit” the etcd safe-removal logic.
  2. After all etcd nodes are removed, add the new etcd node that you are planning to restore from. Assign the new node the role of all (etcd, controlplane, and worker).

    • If the node was previously in a cluster, clean the node first.
    • For custom clusters, go to the Registration tab and check the box for etcd, controlplane, and worker. Then copy and run the registration command on your node.
    • For node driver clusters, a new node is provisioned automatically.

    At this point, Rancher will indicate that restoration from etcd snapshot is required.

  3. Restore from an etcd snapshot.

    Restoring a Cluster from Backup - 图4note

    As the etcd node is a clean node, you may need to manually create the /var/lib/rancher/<k3s/rke2>/server/db/snapshots/ path.

    • For S3 snapshots, restore using the UI.

      1. Click the Snapshots tab to view the list of saved snapshots.
      2. Go to the snapshot you want to restore and click ⋮ > Restore.
      3. Select a Restore Type.
      4. Click Restore.
    • For local snapshots, restore using the UI is not available.

      1. In the upper right corner, click ⋮ > Edit YAML.
      2. The example YAML below can be added under your rkeConfig to configure the etcd restore:
      1. ...
      2. rkeConfig:
      3. etcdSnapshotRestore:
      4. name: <string> # This field is required. Refers to the filename of the associated etcdsnapshot object.
      5. ...
  4. After restoration is successful, you can scale your etcd nodes back up to the desired redundancy.

Recovering etcd without a Snapshot (RKE)

If the group of etcd nodes loses quorum, the Kubernetes cluster will report a failure because no operations, e.g. deploying workloads, can be executed in the Kubernetes cluster. The cluster should have three etcd nodes to prevent a loss of quorum. If you want to recover your set of etcd nodes, follow these instructions:

  1. Keep only one etcd node in the cluster by removing all other etcd nodes.

  2. On the single remaining etcd node, run the following command:

    1. docker run --rm -v /var/run/docker.sock:/var/run/docker.sock assaflavie/runlike etcd

    This command outputs the running command for etcd, save this command to use later.

  3. Stop the running etcd container and rename it to etcd-old.

    1. docker stop etcd
    2. docker rename etcd etcd-old
  4. Take the saved command from Step 2 and revise it:

    • If you originally had more than 1 etcd node, then you need to change --initial-cluster to only contain the node that remains.
    • Add --force-new-cluster to the end of the command.
  5. Run the revised command.

  6. After the single nodes is up and running, Rancher recommends adding additional etcd nodes to your cluster. If you have a custom cluster and you want to reuse an old node, you are required to clean up the nodes before attempting to add them back into a cluster.