Disaster recovery considerations

Disaster recovery considerations

Loss of a client machine
Planning for a loss of client network connectivity
Unexpected loss of network connectivity
Recovering network connectivity manually
Emergency recovery of network connectivity
Loss of a network segment
Loss of a Tang server
Rekeying compromised key material

This section describes several potential disaster situations and the procedures to respond to each of them. Additional situations will be added here as they are discovered or presumed likely to be possible.

Loss of a client machine

The loss of a cluster node that uses the Tang server to decrypt its disk partition is not a disaster. Whether the machine was stolen, suffered hardware failure, or another loss scenario is not important: the disks are encrypted and considered unrecoverable.

However, in the event of theft, a precautionary rotation of the Tang server’s keys and rekeying of all remaining nodes would be prudent to ensure the disks remain unrecoverable even in the event the thieves subsequently gain access to the Tang servers.

To recover from this situation, either reinstall or replace the node.

Planning for a loss of client network connectivity

The loss of network connectivity to an individual node will cause it to become unable to boot in an unattended fashion.

If you are planning work that might cause a loss of network connectivity, you can reveal the passphrase for an onsite technician to use manually, and then rotate the keys afterwards to invalidate it:

Procedure

Before the network becomes unavailable, show the password used in the first slot -s 1 of device /dev/vda2 with this command:
```
$ sudo clevis luks pass -d /dev/vda2 -s 1
```
Invalidate that value and regenerate a new random boot-time passphrase with this command:
```
$ sudo clevis luks regen -d /dev/vda2 -s 1
```

Unexpected loss of network connectivity

If the network disruption is unexpected and a node reboots, consider the following scenarios:

If any nodes are still online, ensure that they do not reboot until network connectivity is restored. This is not applicable for single-node clusters.
The node will remain offline until such time that either network connectivity is restored, or a pre-established passphrase is entered manually at the console. In exceptional circumstances, network administrators might be able to reconfigure network segments to reestablish access, but this is counter to the intent of NBDE, which is that lack of network access means lack of ability to boot.
The lack of network access at the node can reasonably be expected to impact that node’s ability to function as well as its ability to boot. Even if the node were to boot via manual intervention, the lack of network access would make it effectively useless.

Recovering network connectivity manually

A somewhat complex and manually intensive process is also available to the onsite technician for network recovery.

Procedure

The onsite technician extracts the Clevis header from the hard disks. Depending on BIOS lockdown, this might involve removing the disks and installing them in a lab machine.
The onsite technician transmits the Clevis headers to a colleague with legitimate access to the Tang network who then performs the decryption.
Due to the necessity of limited access to the Tang network, the technician should not be able to access that network via VPN or other remote connectivity. Similarly, the technician cannot patch the remote server through to this network in order to decrypt the disks automatically.
The technician reinstalls the disk and manually enters the plain text passphrase provided by their colleague.
The machine successfully starts even without direct access to the Tang servers. Note that the transmission of the key material from the install site to another site with network access must be done carefully.
When network connectivity is restored, the technician rotates the encryption keys.

Emergency recovery of network connectivity

If you are unable to recover network connectivity manually, consider the following steps. Be aware that these steps are discouraged if other methods to recover network connectivity are available.

This method must only be performed by a highly trusted technician.
Taking the Tang server’s key material to the remote site is considered to be a breach of the key material and all servers must be rekeyed and re-encrypted.
This method must be used in extreme cases only, or as a proof of concept recovery method to demonstrate its viability.
Equally extreme, but theoretically possible, is to power the server in question with an Uninterruptible Power Supply (UPS), transport the server to a location with network connectivity to boot and decrypt the disks, and then restore the server at the original location on battery power to continue operation.
If you want to use a backup manual passphrase, you must create it before the failure situation occurs.
Just as attack scenarios become more complex with TPM and Tang compared to a stand-alone Tang installation, so emergency disaster recovery processes are also made more complex if leveraging the same method.

Loss of a network segment

The loss of a network segment, making a Tang server temporarily unavailable, has the following consequences:

OKD nodes continue to boot as normal, provided other servers are available.
New nodes cannot establish their encryption keys until the network segment is restored. In this case, ensure connectivity to remote geographic locations for the purposes of high availability and redundancy. This is because when you are installing a new node or rekeying an existing node, all of the Tang servers you are referencing in that operation must be available.

A hybrid model for a vastly diverse network, such as five geographic regions in which each client is connected to the closest three clients is worth investigating.

In this scenario, new clients are able to establish their encryption keys with the subset of servers that are reachable. For example, in the set of tang1, tang2 and tang3 servers, if tang2 becomes unreachable clients can still establish their encryption keys with tang1 and tang3, and at a later time re-establish with the full set. This can involve either a manual intervention or a more complex automation to be available.

Loss of a Tang server

The loss of an individual Tang server within a load balanced set of servers with identical key material is completely transparent to the clients.

The temporary failure of all Tang servers associated with the same URL, that is, the entire load balanced set, can be considered the same as the loss of a network segment. Existing clients have the ability to decrypt their disk partitions so long as another preconfigured Tang server is available. New clients cannot enroll until at least one of these servers comes back online.

You can mitigate the physical loss of a Tang server by either reinstalling the server or restoring the server from backups. Ensure that the backup and restore processes of the key material is adequately protected from unauthorized access.

Rekeying compromised key material

If key material is potentially exposed to unauthorized third parties, such as through the physical theft of a Tang server or associated data, immediately rotate the keys.

Procedure

Rekey any Tang server holding the affected material.
Rekey all clients using the Tang server.
Destroy the original key material.
Scrutinize any incidents that result in unintended exposure of the master encryption key. If possible, take compromised nodes offline and re-encrypt their disks.

Reformatting and reinstalling on the same physical hardware, although slow, is easy to automate and test.