Host-level tasks
Adding a host to the cluster
For information on adding master or node hosts to a cluster, see the Adding hosts to an existing cluster section in the Install and configuration guide.
Master host tasks
Deprecating a master host
Deprecating a master host removes it from the OKD environment.
The reasons to deprecate or scale down master hosts include hardware re-sizing or replacing the underlying infrastructure.
Highly available OKD environments require at least three master hosts and three etcd nodes. Usually, the master hosts are colocated with the etcd services. If you deprecate a master host, you also remove the etcd static pods from that host.
Ensure that the master and etcd services are always deployed in odd numbers due to the voting mechanisms that take place among those services. |
Creating a master host backup
Perform this backup process before any change to the OKD infrastructure, such as a system update, upgrade, or any other significant modification. Back up data regularly to ensure that recent data is available if a failure occurs.
OKD files
The master instances run important services, such as the API, controllers. The /etc/origin/master
directory stores many important files:
The configuration, the API, controllers, services, and more
Certificates generated by the installation
All cloud provider-related configuration
Keys and other authentication files, such as
htpasswd
if you use htpasswdAnd more
You can customize OKD services, such as increasing the log level or using proxies. The configuration files are stored in the /etc/sysconfig
directory.
Because the masters are also nodes, back up the entire /etc/origin
directory.
Procedure
You must perform the following steps on each master node. |
Create a backup of the pod definitions, located here.
Create a backup of the master host configuration files:
$ MYBACKUPDIR=/backup/$(hostname)/$(date +%Y%m%d)
$ sudo mkdir -p ${MYBACKUPDIR}/etc/sysconfig
$ sudo cp -aR /etc/origin ${MYBACKUPDIR}/etc
$ sudo cp -aR /etc/sysconfig/ ${MYBACKUPDIR}/etc/sysconfig/
The master configuration file is/etc/origin/master/master-config.yaml.
The
/etc/origin/master/ca.serial.txt
file is generated on only the first master listed in the Ansible host inventory. If you deprecate the first master host, copy the/etc/origin/master/ca.serial.txt
file to the rest of master hosts before the process.In OKD 3.11 clusters running multiple masters, one of the master nodes includes additional CA certificates in
/etc/origin/master
,/etc/etcd/ca
and/etc/etcd/generated_certs
. These are required for application node and etcd node scale-up operations and would need to be restored on another master node should the originating master become permanently unavailable. These directories are included by default within the backup procedures documented here.Other important files that need to be considered when planning a backup include:
File
Description
/etc/cni/
Container Network Interface configuration (if used)
/etc/sysconfig/iptables
Where the
iptables
rules are stored/etc/sysconfig/docker-storage-setup
The input file for
container-storage-setup
command/etc/sysconfig/docker
The
docker
configuration file/etc/sysconfig/docker-network
docker
networking configuration (i.e. MTU)/etc/sysconfig/docker-storage
docker
storage configuration (generated bycontainer-storage-setup
)/etc/dnsmasq.conf
Main configuration file for
dnsmasq
/etc/dnsmasq.d/
Different
dnsmasq
configuration files/etc/sysconfig/flanneld
flannel
configuration file (if used)/etc/pki/ca-trust/source/anchors/
Certificates added to the system (i.e. for external registries)
Create a backup of those files:
$ MYBACKUPDIR=/backup/$(hostname)/$(date +%Y%m%d)
$ sudo mkdir -p ${MYBACKUPDIR}/etc/sysconfig
$ sudo mkdir -p ${MYBACKUPDIR}/etc/pki/ca-trust/source/anchors
$ sudo cp -aR /etc/sysconfig/{iptables,docker-*,flanneld} \
${MYBACKUPDIR}/etc/sysconfig/
$ sudo cp -aR /etc/dnsmasq* /etc/cni ${MYBACKUPDIR}/etc/
$ sudo cp -aR /etc/pki/ca-trust/source/anchors/* \
${MYBACKUPDIR}/etc/pki/ca-trust/source/anchors/
If a package is accidentally removed or you need to resore a file that is included in an
rpm
package, having a list ofrhel
packages installed on the system can be useful.If you use Red Hat Satellite features, such as content views or the facts store, provide a proper mechanism to reinstall the missing packages and a historical data of packages installed in the systems.
To create a list of the current
rhel
packages installed in the system:$ MYBACKUPDIR=/backup/$(hostname)/$(date +%Y%m%d)
$ sudo mkdir -p ${MYBACKUPDIR}
$ rpm -qa | sort | sudo tee $MYBACKUPDIR/packages.txt
If you used the previous steps, the following files are present in the backup directory:
$ MYBACKUPDIR=/backup/$(hostname)/$(date +%Y%m%d)
$ sudo find ${MYBACKUPDIR} -mindepth 1 -type f -printf '%P\n'
etc/sysconfig/flanneld
etc/sysconfig/iptables
etc/sysconfig/docker-network
etc/sysconfig/docker-storage
etc/sysconfig/docker-storage-setup
etc/sysconfig/docker-storage-setup.rpmnew
etc/origin/master/ca.crt
etc/origin/master/ca.key
etc/origin/master/ca.serial.txt
etc/origin/master/ca-bundle.crt
etc/origin/master/master.proxy-client.crt
etc/origin/master/master.proxy-client.key
etc/origin/master/service-signer.crt
etc/origin/master/service-signer.key
etc/origin/master/serviceaccounts.private.key
etc/origin/master/serviceaccounts.public.key
etc/origin/master/openshift-master.crt
etc/origin/master/openshift-master.key
etc/origin/master/openshift-master.kubeconfig
etc/origin/master/master.server.crt
etc/origin/master/master.server.key
etc/origin/master/master.kubelet-client.crt
etc/origin/master/master.kubelet-client.key
etc/origin/master/admin.crt
etc/origin/master/admin.key
etc/origin/master/admin.kubeconfig
etc/origin/master/etcd.server.crt
etc/origin/master/etcd.server.key
etc/origin/master/master.etcd-client.key
etc/origin/master/master.etcd-client.csr
etc/origin/master/master.etcd-client.crt
etc/origin/master/master.etcd-ca.crt
etc/origin/master/policy.json
etc/origin/master/scheduler.json
etc/origin/master/htpasswd
etc/origin/master/session-secrets.yaml
etc/origin/master/openshift-router.crt
etc/origin/master/openshift-router.key
etc/origin/master/registry.crt
etc/origin/master/registry.key
etc/origin/master/master-config.yaml
etc/origin/generated-configs/master-master-1.example.com/master.server.crt
...[OUTPUT OMITTED]...
etc/origin/cloudprovider/openstack.conf
etc/origin/node/system:node:master-0.example.com.crt
etc/origin/node/system:node:master-0.example.com.key
etc/origin/node/ca.crt
etc/origin/node/system:node:master-0.example.com.kubeconfig
etc/origin/node/server.crt
etc/origin/node/server.key
etc/origin/node/node-dnsmasq.conf
etc/origin/node/resolv.conf
etc/origin/node/node-config.yaml
etc/origin/node/flannel.etcd-client.key
etc/origin/node/flannel.etcd-client.csr
etc/origin/node/flannel.etcd-client.crt
etc/origin/node/flannel.etcd-ca.crt
etc/pki/ca-trust/source/anchors/openshift-ca.crt
etc/pki/ca-trust/source/anchors/registry-ca.crt
etc/dnsmasq.conf
etc/dnsmasq.d/origin-dns.conf
etc/dnsmasq.d/origin-upstream-dns.conf
etc/dnsmasq.d/node-dnsmasq.conf
packages.txt
If needed, you can compress the files to save space:
$ MYBACKUPDIR=/backup/$(hostname)/$(date +%Y%m%d)
$ sudo tar -zcvf /backup/$(hostname)-$(date +%Y%m%d).tar.gz $MYBACKUPDIR
$ sudo rm -Rf ${MYBACKUPDIR}
To create any of these files from scratch, the openshift-ansible-contrib
repository contains the backup_master_node.sh
script, which performs the previous steps. The script creates a directory on the host where you run the script and copies all the files previously mentioned.
The |
You can run the script on every master host with:
$ mkdir ~/git
$ cd ~/git
$ git clone https://github.com/openshift/openshift-ansible-contrib.git
$ cd openshift-ansible-contrib/reference-architecture/day2ops/scripts
$ ./backup_master_node.sh -h
Backing up etcd
When you back up etcd, you must back up both the etcd configuration files and the etcd data.
Backing up etcd configuration files
The etcd configuration files to be preserved are all stored in the /etc/etcd
directory of the instances where etcd is running. This includes the etcd configuration file (/etc/etcd/etcd.conf
) and the required certificates for cluster communication. All those files are generated at installation time by the Ansible installer.
Procedure
For each etcd member of the cluster, back up the etcd configuration.
$ ssh master-0
# mkdir -p /backup/etcd-config-$(date +%Y%m%d)/
# cp -R /etc/etcd/ /backup/etcd-config-$(date +%Y%m%d)/
The certificates and configuration files on each etcd cluster member are unique. |
Backing up etcd data
Prerequisites
The OKD installer creates aliases to avoid typing all the flags named However, the |
Before backing up etcd:
etcdctl
binaries must be available or, in containerized installations, therhel7/etcd
container must be available.Ensure that the OKD API service is running.
Ensure connectivity with the etcd cluster (port 2379/tcp).
Ensure the proper certificates to connect to the etcd cluster.
Ensure
go
is installed.
Procedure
While the The |
Back up the etcd data:
+
Clusters upgraded from previous versions of OKD might contain v2 data stores. Back up all etcd data stores. |
Make a snapshot of the etcd node:
# systemctl show etcd --property=ActiveState,SubState
# mkdir -p /var/lib/etcd/backup/etcd-$(date +%Y%m%d) (1)
# etcdctl3 snapshot save /var/lib/etcd/backup/etcd-$(date +%Y%m%d)/db
1 You must write the snapshot to a directory under /var/lib/etcd/
.The
etcdctl snapshot save
command requires the etcd service to be running.Stop all etcd services by removing the etcd pod definition and rebooting the host:
# mkdir -p /etc/origin/node/pods-stopped
# mv /etc/origin/node/pods/* /etc/origin/node/pods-stopped/
Create the etcd data backup and copy the etcd
db
file:# etcdctl2 backup \
--data-dir /var/lib/etcd \
--backup-dir /backup/etcd-$(date +%Y%m%d)
A
/backup/etcd-<date>/
directory is created, where<date>
represents the current date, which must be an external NFS share, S3 bucket, or any external storage location.In the case of an all-in-one cluster, the etcd data directory is located in the
/var/lib/origin/openshift.local.etcd
directory.If etcd runs as a static pod, run the following commands:
If you use static pods, use the v3 API.
Obtain the etcd endpoint IP address from the static pod manifest:
$ export ETCD_POD_MANIFEST="/etc/origin/node/pods/etcd.yaml"
$ export ETCD_EP=$(grep https ${ETCD_POD_MANIFEST} | cut -d '/' -f3)
Obtain the etcd pod name:
$ oc login -u system:admin
$ export ETCD_POD=$(oc get pods -n kube-system | grep -o -m 1 '\S*etcd\S*')
Take a snapshot of the etcd data in the pod and store it locally:
$ oc project kube-system
$ oc exec ${ETCD_POD} -c etcd -- /bin/bash -c "ETCDCTL_API=3 etcdctl \
--cert /etc/etcd/peer.crt \
--key /etc/etcd/peer.key \
--cacert /etc/etcd/ca.crt \
--endpoints $ETCD_EP \
snapshot save /var/lib/etcd/snapshot.db"
Deprecating a master host
Master hosts run important services, such as the OKD API and controllers services. In order to deprecate a master host, these services must be stopped.
The OKD API service is an active/active service, so stopping the service does not affect the environment as long as the requests are sent to a separate master server. However, the OKD controllers service is an active/passive service, where the services use etcd to decide the active master.
Deprecating a master host in a multi-master architecture includes removing the master from the load balancer pool to avoid new connections attempting to use that master. This process depends heavily on the load balancer used. The steps below show the details of removing the master from haproxy
. In the event that OKD is running on a cloud provider, or using a F5
appliance, see the specific product documents to remove the master from rotation.
Procedure
Remove the
backend
section in the/etc/haproxy/haproxy.cfg
configuration file. For example, if deprecating a master namedmaster-0.example.com
usinghaproxy
, ensure the host name is removed from the following:backend mgmt8443
balance source
mode tcp
# MASTERS 8443
server master-1.example.com 192.168.55.12:8443 check
server master-2.example.com 192.168.55.13:8443 check
Then, restart the
haproxy
service.$ sudo systemctl restart haproxy
Once the master is removed from the load balancer, disable the API and controller services by moving definition files out of the static pods dir /etc/origin/node/pods:
# mkdir -p /etc/origin/node/pods/disabled
# mv /etc/origin/node/pods/controller.yaml /etc/origin/node/pods/disabled/:
+
Because the master host is a schedulable OKD node, follow the steps in the Deprecating a node host section.
Remove the master host from the
[masters]
and[nodes]
groups in the/etc/ansible/hosts
Ansible inventory file to avoid issues if running any Ansible tasks using that inventory file.Deprecating the first master host listed in the Ansible inventory file requires extra precautions.
The
/etc/origin/master/ca.serial.txt
file is generated on only the first master listed in the Ansible host inventory. If you deprecate the first master host, copy the/etc/origin/master/ca.serial.txt
file to the rest of master hosts before the process.In OKD 3.11 clusters running multiple masters, one of the master nodes includes additional CA certificates in
/etc/origin/master
,/etc/etcd/ca
, and/etc/etcd/generated_certs
. These are required for application node and etcd node scale-up operations and must be restored on another master node if the CA host master is being deprecated.The
kubernetes
service includes the master host IPs as endpoints. To verify that the master has been properly deprecated, review thekubernetes
service output and see if the deprecated master has been removed:$ oc describe svc kubernetes -n default
Name: kubernetes
Namespace: default
Labels: component=apiserver
provider=kubernetes
Annotations: <none>
Selector: <none>
Type: ClusterIP
IP: 10.111.0.1
Port: https 443/TCP
Endpoints: 192.168.55.12:8443,192.168.55.13:8443
Port: dns 53/UDP
Endpoints: 192.168.55.12:8053,192.168.55.13:8053
Port: dns-tcp 53/TCP
Endpoints: 192.168.55.12:8053,192.168.55.13:8053
Session Affinity: ClientIP
Events: <none>
After the master has been successfully deprecated, the host where the master was previously running can be safely deleted.
Removing an etcd host
If an etcd host fails beyond restoration, remove it from the cluster.
Steps to be performed on all masters hosts
Procedure
Remove each other etcd host from the etcd cluster. Run the following command for each etcd node:
# etcdctl -C https://<surviving host IP address>:2379 \
--ca-file=/etc/etcd/ca.crt \
--cert-file=/etc/etcd/peer.crt \
--key-file=/etc/etcd/peer.key member remove <failed member ID>
Restart the master API service on every master:
# master-restart api restart-master controller
Steps to be performed in the current etcd cluster
Procedure
Remove the failed host from the cluster:
# etcdctl2 cluster-health
member 5ee217d19001 is healthy: got healthy result from https://192.168.55.12:2379
member 2a529ba1840722c0 is healthy: got healthy result from https://192.168.55.8:2379
failed to check the health of member 8372784203e11288 on https://192.168.55.21:2379: Get https://192.168.55.21:2379/health: dial tcp 192.168.55.21:2379: getsockopt: connection refused
member 8372784203e11288 is unreachable: [https://192.168.55.21:2379] are all unreachable
member ed4f0efd277d7599 is healthy: got healthy result from https://192.168.55.13:2379
cluster is healthy
# etcdctl2 member remove 8372784203e11288 (1)
Removed member 8372784203e11288 from cluster
# etcdctl2 cluster-health
member 5ee217d19001 is healthy: got healthy result from https://192.168.55.12:2379
member 2a529ba1840722c0 is healthy: got healthy result from https://192.168.55.8:2379
member ed4f0efd277d7599 is healthy: got healthy result from https://192.168.55.13:2379
cluster is healthy
1 The remove
command requires the etcd ID, not the hostname.To ensure the etcd configuration does not use the failed host when the etcd service is restarted, modify the
/etc/etcd/etcd.conf
file on all remaining etcd hosts and remove the failed host in the value for theETCD_INITIAL_CLUSTER
variable:# vi /etc/etcd/etcd.conf
For example:
ETCD_INITIAL_CLUSTER=master-0.example.com=https://192.168.55.8:2380,master-1.example.com=https://192.168.55.12:2380,master-2.example.com=https://192.168.55.13:2380
becomes:
ETCD_INITIAL_CLUSTER=master-0.example.com=https://192.168.55.8:2380,master-1.example.com=https://192.168.55.12:2380
Restarting the etcd services is not required, because the failed host is removed using
etcdctl
.Modify the Ansible inventory file to reflect the current status of the cluster and to avoid issues when re-running a playbook:
[OSEv3:children]
masters
nodes
etcd
... [OUTPUT ABBREVIATED] ...
[etcd]
master-0.example.com
master-1.example.com
If you are using Flannel, modify the
flanneld
service configuration located at/etc/sysconfig/flanneld
on every host and remove the etcd host:FLANNEL_ETCD_ENDPOINTS=https://master-0.example.com:2379,https://master-1.example.com:2379,https://master-2.example.com:2379
Restart the
flanneld
service:# systemctl restart flanneld.service
Creating a master host backup
Perform this backup process before any change to the OKD infrastructure, such as a system update, upgrade, or any other significant modification. Back up data regularly to ensure that recent data is available if a failure occurs.
OKD files
The master instances run important services, such as the API, controllers. The /etc/origin/master
directory stores many important files:
The configuration, the API, controllers, services, and more
Certificates generated by the installation
All cloud provider-related configuration
Keys and other authentication files, such as
htpasswd
if you use htpasswdAnd more
You can customize OKD services, such as increasing the log level or using proxies. The configuration files are stored in the /etc/sysconfig
directory.
Because the masters are also nodes, back up the entire /etc/origin
directory.
Procedure
You must perform the following steps on each master node. |
Create a backup of the pod definitions, located here.
Create a backup of the master host configuration files:
$ MYBACKUPDIR=/backup/$(hostname)/$(date +%Y%m%d)
$ sudo mkdir -p ${MYBACKUPDIR}/etc/sysconfig
$ sudo cp -aR /etc/origin ${MYBACKUPDIR}/etc
$ sudo cp -aR /etc/sysconfig/ ${MYBACKUPDIR}/etc/sysconfig/
The master configuration file is/etc/origin/master/master-config.yaml.
The
/etc/origin/master/ca.serial.txt
file is generated on only the first master listed in the Ansible host inventory. If you deprecate the first master host, copy the/etc/origin/master/ca.serial.txt
file to the rest of master hosts before the process.In OKD 3.11 clusters running multiple masters, one of the master nodes includes additional CA certificates in
/etc/origin/master
,/etc/etcd/ca
and/etc/etcd/generated_certs
. These are required for application node and etcd node scale-up operations and would need to be restored on another master node should the originating master become permanently unavailable. These directories are included by default within the backup procedures documented here.Other important files that need to be considered when planning a backup include:
File
Description
/etc/cni/
Container Network Interface configuration (if used)
/etc/sysconfig/iptables
Where the
iptables
rules are stored/etc/sysconfig/docker-storage-setup
The input file for
container-storage-setup
command/etc/sysconfig/docker
The
docker
configuration file/etc/sysconfig/docker-network
docker
networking configuration (i.e. MTU)/etc/sysconfig/docker-storage
docker
storage configuration (generated bycontainer-storage-setup
)/etc/dnsmasq.conf
Main configuration file for
dnsmasq
/etc/dnsmasq.d/
Different
dnsmasq
configuration files/etc/sysconfig/flanneld
flannel
configuration file (if used)/etc/pki/ca-trust/source/anchors/
Certificates added to the system (i.e. for external registries)
Create a backup of those files:
$ MYBACKUPDIR=/backup/$(hostname)/$(date +%Y%m%d)
$ sudo mkdir -p ${MYBACKUPDIR}/etc/sysconfig
$ sudo mkdir -p ${MYBACKUPDIR}/etc/pki/ca-trust/source/anchors
$ sudo cp -aR /etc/sysconfig/{iptables,docker-*,flanneld} \
${MYBACKUPDIR}/etc/sysconfig/
$ sudo cp -aR /etc/dnsmasq* /etc/cni ${MYBACKUPDIR}/etc/
$ sudo cp -aR /etc/pki/ca-trust/source/anchors/* \
${MYBACKUPDIR}/etc/pki/ca-trust/source/anchors/
If a package is accidentally removed or you need to resore a file that is included in an
rpm
package, having a list ofrhel
packages installed on the system can be useful.If you use Red Hat Satellite features, such as content views or the facts store, provide a proper mechanism to reinstall the missing packages and a historical data of packages installed in the systems.
To create a list of the current
rhel
packages installed in the system:$ MYBACKUPDIR=/backup/$(hostname)/$(date +%Y%m%d)
$ sudo mkdir -p ${MYBACKUPDIR}
$ rpm -qa | sort | sudo tee $MYBACKUPDIR/packages.txt
If you used the previous steps, the following files are present in the backup directory:
$ MYBACKUPDIR=/backup/$(hostname)/$(date +%Y%m%d)
$ sudo find ${MYBACKUPDIR} -mindepth 1 -type f -printf '%P\n'
etc/sysconfig/flanneld
etc/sysconfig/iptables
etc/sysconfig/docker-network
etc/sysconfig/docker-storage
etc/sysconfig/docker-storage-setup
etc/sysconfig/docker-storage-setup.rpmnew
etc/origin/master/ca.crt
etc/origin/master/ca.key
etc/origin/master/ca.serial.txt
etc/origin/master/ca-bundle.crt
etc/origin/master/master.proxy-client.crt
etc/origin/master/master.proxy-client.key
etc/origin/master/service-signer.crt
etc/origin/master/service-signer.key
etc/origin/master/serviceaccounts.private.key
etc/origin/master/serviceaccounts.public.key
etc/origin/master/openshift-master.crt
etc/origin/master/openshift-master.key
etc/origin/master/openshift-master.kubeconfig
etc/origin/master/master.server.crt
etc/origin/master/master.server.key
etc/origin/master/master.kubelet-client.crt
etc/origin/master/master.kubelet-client.key
etc/origin/master/admin.crt
etc/origin/master/admin.key
etc/origin/master/admin.kubeconfig
etc/origin/master/etcd.server.crt
etc/origin/master/etcd.server.key
etc/origin/master/master.etcd-client.key
etc/origin/master/master.etcd-client.csr
etc/origin/master/master.etcd-client.crt
etc/origin/master/master.etcd-ca.crt
etc/origin/master/policy.json
etc/origin/master/scheduler.json
etc/origin/master/htpasswd
etc/origin/master/session-secrets.yaml
etc/origin/master/openshift-router.crt
etc/origin/master/openshift-router.key
etc/origin/master/registry.crt
etc/origin/master/registry.key
etc/origin/master/master-config.yaml
etc/origin/generated-configs/master-master-1.example.com/master.server.crt
...[OUTPUT OMITTED]...
etc/origin/cloudprovider/openstack.conf
etc/origin/node/system:node:master-0.example.com.crt
etc/origin/node/system:node:master-0.example.com.key
etc/origin/node/ca.crt
etc/origin/node/system:node:master-0.example.com.kubeconfig
etc/origin/node/server.crt
etc/origin/node/server.key
etc/origin/node/node-dnsmasq.conf
etc/origin/node/resolv.conf
etc/origin/node/node-config.yaml
etc/origin/node/flannel.etcd-client.key
etc/origin/node/flannel.etcd-client.csr
etc/origin/node/flannel.etcd-client.crt
etc/origin/node/flannel.etcd-ca.crt
etc/pki/ca-trust/source/anchors/openshift-ca.crt
etc/pki/ca-trust/source/anchors/registry-ca.crt
etc/dnsmasq.conf
etc/dnsmasq.d/origin-dns.conf
etc/dnsmasq.d/origin-upstream-dns.conf
etc/dnsmasq.d/node-dnsmasq.conf
packages.txt
If needed, you can compress the files to save space:
$ MYBACKUPDIR=/backup/$(hostname)/$(date +%Y%m%d)
$ sudo tar -zcvf /backup/$(hostname)-$(date +%Y%m%d).tar.gz $MYBACKUPDIR
$ sudo rm -Rf ${MYBACKUPDIR}
To create any of these files from scratch, the openshift-ansible-contrib
repository contains the backup_master_node.sh
script, which performs the previous steps. The script creates a directory on the host where you run the script and copies all the files previously mentioned.
The |
You can run the script on every master host with:
$ mkdir ~/git
$ cd ~/git
$ git clone https://github.com/openshift/openshift-ansible-contrib.git
$ cd openshift-ansible-contrib/reference-architecture/day2ops/scripts
$ ./backup_master_node.sh -h
Restoring a master host backup
After creating a backup of important master host files, if they become corrupted or accidentally removed, you can restore the files by copying the files back to master, ensuring they contain the proper content, and restarting the affected services.
Procedure
Restore the
/etc/origin/master/master-config.yaml
file:# MYBACKUPDIR=*/backup/$(hostname)/$(date +%Y%m%d)*
# cp /etc/origin/master/master-config.yaml /etc/origin/master/master-config.yaml.old
# cp /backup/$(hostname)/$(date +%Y%m%d)/origin/master/master-config.yaml /etc/origin/master/master-config.yaml
# master-restart api
# master-restart controllers
Restarting the master services can lead to downtime. However, you can remove the master host from the highly available load balancer pool, then perform the restore operation. Once the service has been properly restored, you can add the master host back to the load balancer pool.
Perform a full reboot of the affected instance to restore the
iptables
configuration.If you cannot restart OKD because packages are missing, reinstall the packages.
Get the list of the current installed packages:
$ rpm -qa | sort > /tmp/current_packages.txt
View the differences between the package lists:
$ diff /tmp/current_packages.txt ${MYBACKUPDIR}/packages.txt
> ansible-2.4.0.0-5.el7.noarch
Reinstall the missing packages:
# yum reinstall -y <packages> (1)
1 Replace <packages>
with the packages that are different between the package lists.
Restore a system certificate by copying the certificate to the
/etc/pki/ca-trust/source/anchors/
directory and execute theupdate-ca-trust
:$ MYBACKUPDIR=*/backup/$(hostname)/$(date +%Y%m%d)*
$ sudo cp ${MYBACKUPDIR}/external_certificates/my_company.crt /etc/pki/ca-trust/source/anchors/
$ sudo update-ca-trust
Always ensure the user ID and group ID are restored when the files are copied back, as well as the
SELinux
context.
Node host tasks
Deprecating a node host
The procedure is the same whether deprecating an infrastructure node or an application node.
Prerequisites
Ensure enough capacity is available to migrate the existing pods from the node set to be removed. Removing an infrastructure node is advised only when at least two more nodes will stay online after the infrastructure node is removed.
Procedure
List all available nodes to find the node to deprecate:
$ oc get nodes
NAME STATUS AGE VERSION
ocp-infra-node-b7pl Ready 23h v1.6.1+5115d708d7
ocp-infra-node-p5zj Ready 23h v1.6.1+5115d708d7
ocp-infra-node-rghb Ready 23h v1.6.1+5115d708d7
ocp-master-dgf8 Ready,SchedulingDisabled 23h v1.6.1+5115d708d7
ocp-master-q1v2 Ready,SchedulingDisabled 23h v1.6.1+5115d708d7
ocp-master-vq70 Ready,SchedulingDisabled 23h v1.6.1+5115d708d7
ocp-node-020m Ready 23h v1.6.1+5115d708d7
ocp-node-7t5p Ready 23h v1.6.1+5115d708d7
ocp-node-n0dd Ready 23h v1.6.1+5115d708d7
As an example, this topic deprecates the
ocp-infra-node-b7pl
infrastructure node.Describe the node and its running services:
$ oc describe node ocp-infra-node-b7pl
Name: ocp-infra-node-b7pl
Role:
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=n1-standard-2
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=europe-west3
failure-domain.beta.kubernetes.io/zone=europe-west3-c
kubernetes.io/hostname=ocp-infra-node-b7pl
role=infra
Annotations: volumes.kubernetes.io/controller-managed-attach-detach=true
Taints: <none>
CreationTimestamp: Wed, 22 Nov 2017 09:36:36 -0500
Phase:
Conditions:
...
Addresses: 10.156.0.11,ocp-infra-node-b7pl
Capacity:
cpu: 2
memory: 7494480Ki
pods: 20
Allocatable:
cpu: 2
memory: 7392080Ki
pods: 20
System Info:
Machine ID: bc95ccf67d047f2ae42c67862c202e44
System UUID: 9762CC3D-E23C-AB13-B8C5-FA16F0BCCE4C
Boot ID: ca8bf088-905d-4ec0-beec-8f89f4527ce4
Kernel Version: 3.10.0-693.5.2.el7.x86_64
OS Image: Employee SKU
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://1.12.6
Kubelet Version: v1.6.1+5115d708d7
Kube-Proxy Version: v1.6.1+5115d708d7
ExternalID: 437740049672994824
Non-terminated Pods: (2 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
default docker-registry-1-5szjs 100m (5%) 0 (0%) 256Mi (3%)0 (0%)
default router-1-vzlzq 100m (5%) 0 (0%) 256Mi (3%)0 (0%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
200m (10%) 0 (0%) 512Mi (7%) 0 (0%)
Events: <none>
The output above shows that the node is running two pods:
router-1-vzlzq
anddocker-registry-1-5szjs
. Two more infrastructure nodes are available to migrate these two pods.The cluster described above is a highly available cluster, this means both the
router
anddocker-registry
services are running on all infrastructure nodes.Mark a node as unschedulable and evacuate all of its pods:
$ oc adm drain ocp-infra-node-b7pl --delete-local-data
node "ocp-infra-node-b7pl" cordoned
WARNING: Deleting pods with local storage: docker-registry-1-5szjs
pod "docker-registry-1-5szjs" evicted
pod "router-1-vzlzq" evicted
node "ocp-infra-node-b7pl" drained
If the pod has attached local storage (for example,
EmptyDir
), the--delete-local-data
option must be provided. Generally, pods running in production should use the local storage only for temporary or cache files, but not for anything important or persistent. For regular storage, applications should use object storage or persistent volumes. In this case, thedocker-registry
pod’s local storage is empty, because the object storage is used instead to store the container images.The above operation deletes existing pods running on the node. Then, new pods are created according to the replication controller.
In general, every application should be deployed with a deployment configuration, which creates pods using the replication controller.
oc adm drain
will not delete any bare pods (pods that are neither mirror pods nor managed byReplicationController
,ReplicaSet
,DaemonSet
,StatefulSet
, or a job). To do so, the—force
option is required. Be aware that the bare pods will not be recreated on other nodes and data may be lost during this operation.The example below shows the output of the replication controller of the registry:
$ oc describe rc/docker-registry-1
Name: docker-registry-1
Namespace: default
Selector: deployment=docker-registry-1,deploymentconfig=docker-registry,docker-registry=default
Labels: docker-registry=default
openshift.io/deployment-config.name=docker-registry
Annotations: ...
Replicas: 3 current / 3 desired
Pods Status: 3 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
Labels: deployment=docker-registry-1
deploymentconfig=docker-registry
docker-registry=default
Annotations: openshift.io/deployment-config.latest-version=1
openshift.io/deployment-config.name=docker-registry
openshift.io/deployment.name=docker-registry-1
Service Account: registry
Containers:
registry:
Image: openshift3/ose-docker-registry:v3.6.173.0.49
Port: 5000/TCP
Requests:
cpu: 100m
memory: 256Mi
Liveness: http-get https://:5000/healthz delay=10s timeout=5s period=10s #success=1 #failure=3
Readiness: http-get https://:5000/healthz delay=0s timeout=5s period=10s #success=1 #failure=3
Environment:
REGISTRY_HTTP_ADDR: :5000
REGISTRY_HTTP_NET: tcp
REGISTRY_HTTP_SECRET: tyGEnDZmc8dQfioP3WkNd5z+Xbdfy/JVXf/NLo3s/zE=
REGISTRY_MIDDLEWARE_REPOSITORY_OPENSHIFT_ENFORCEQUOTA: false
REGISTRY_HTTP_TLS_KEY: /etc/secrets/registry.key
OPENSHIFT_DEFAULT_REGISTRY: docker-registry.default.svc:5000
REGISTRY_CONFIGURATION_PATH: /etc/registry/config.yml
REGISTRY_HTTP_TLS_CERTIFICATE: /etc/secrets/registry.crt
Mounts:
/etc/registry from docker-config (rw)
/etc/secrets from registry-certificates (rw)
/registry from registry-storage (rw)
Volumes:
registry-storage:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
registry-certificates:
Type: Secret (a volume populated by a Secret)
SecretName: registry-certificates
Optional: false
docker-config:
Type: Secret (a volume populated by a Secret)
SecretName: registry-config
Optional: false
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
49m 49m 1 replication-controller Normal SuccessfulCreate Created pod: docker-registry-1-dprp5
The event at the bottom of the output displays information about new pod creation. So, when listing all pods:
$ oc get pods
NAME READY STATUS RESTARTS AGE
docker-registry-1-dprp5 1/1 Running 0 52m
docker-registry-1-kr8jq 1/1 Running 0 1d
docker-registry-1-ncpl2 1/1 Running 0 1d
registry-console-1-g4nqg 1/1 Running 0 1d
router-1-2gshr 0/1 Pending 0 52m
router-1-85qm4 1/1 Running 0 1d
router-1-q5sr8 1/1 Running 0 1d
The
docker-registry-1-5szjs
androuter-1-vzlzq
pods that were running on the now deprecated node are no longer available. Instead, two new pods have been created:docker-registry-1-dprp5
androuter-1-2gshr
. As shown above, the new router pod isrouter-1-2gshr
, but is in thePending
state. This is because every node can be running only on one single router and is bound to the ports 80 and 443 of the host.When observing the newly created registry pod, the example below shows that the pod has been created on the
ocp-infra-node-rghb
node, which is different from the deprecating node:$ oc describe pod docker-registry-1-dprp5
Name: docker-registry-1-dprp5
Namespace: default
Security Policy: hostnetwork
Node: ocp-infra-node-rghb/10.156.0.10
...
The only difference between deprecating the infrastructure and the application node is that once the infrastructure node is evacuated, and if there is no plan to replace that node, the services running on infrastructure nodes can be scaled down:
$ oc scale dc/router --replicas 2
deploymentconfig "router" scaled
$ oc scale dc/docker-registry --replicas 2
deploymentconfig "docker-registry" scaled
Now, every infrastructure node is running only one kind of each pod:
$ oc get pods
NAME READY STATUS RESTARTS AGE
docker-registry-1-kr8jq 1/1 Running 0 1d
docker-registry-1-ncpl2 1/1 Running 0 1d
registry-console-1-g4nqg 1/1 Running 0 1d
router-1-85qm4 1/1 Running 0 1d
router-1-q5sr8 1/1 Running 0 1d
$ oc describe po/docker-registry-1-kr8jq | grep Node:
Node: ocp-infra-node-p5zj/10.156.0.9
$ oc describe po/docker-registry-1-ncpl2 | grep Node:
Node: ocp-infra-node-rghb/10.156.0.10
To provide a full highly available cluster, at least three infrastructure nodes should always be available.
To verify that the scheduling on the node is disabled:
$ oc get nodes
NAME STATUS AGE VERSION
ocp-infra-node-b7pl Ready,SchedulingDisabled 1d v1.6.1+5115d708d7
ocp-infra-node-p5zj Ready 1d v1.6.1+5115d708d7
ocp-infra-node-rghb Ready 1d v1.6.1+5115d708d7
ocp-master-dgf8 Ready,SchedulingDisabled 1d v1.6.1+5115d708d7
ocp-master-q1v2 Ready,SchedulingDisabled 1d v1.6.1+5115d708d7
ocp-master-vq70 Ready,SchedulingDisabled 1d v1.6.1+5115d708d7
ocp-node-020m Ready 1d v1.6.1+5115d708d7
ocp-node-7t5p Ready 1d v1.6.1+5115d708d7
ocp-node-n0dd Ready 1d v1.6.1+5115d708d7
And that the node does not contain any pods:
$ oc describe node ocp-infra-node-b7pl
Name: ocp-infra-node-b7pl
Role:
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=n1-standard-2
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=europe-west3
failure-domain.beta.kubernetes.io/zone=europe-west3-c
kubernetes.io/hostname=ocp-infra-node-b7pl
role=infra
Annotations: volumes.kubernetes.io/controller-managed-attach-detach=true
Taints: <none>
CreationTimestamp: Wed, 22 Nov 2017 09:36:36 -0500
Phase:
Conditions:
...
Addresses: 10.156.0.11,ocp-infra-node-b7pl
Capacity:
cpu: 2
memory: 7494480Ki
pods: 20
Allocatable:
cpu: 2
memory: 7392080Ki
pods: 20
System Info:
Machine ID: bc95ccf67d047f2ae42c67862c202e44
System UUID: 9762CC3D-E23C-AB13-B8C5-FA16F0BCCE4C
Boot ID: ca8bf088-905d-4ec0-beec-8f89f4527ce4
Kernel Version: 3.10.0-693.5.2.el7.x86_64
OS Image: Employee SKU
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://1.12.6
Kubelet Version: v1.6.1+5115d708d7
Kube-Proxy Version: v1.6.1+5115d708d7
ExternalID: 437740049672994824
Non-terminated Pods: (0 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
CPU Requests CPU Limits Memory Requests Memory Limits
------------ ---------- --------------- -------------
0 (0%) 0 (0%) 0 (0%) 0 (0%)
Events: <none>
Remove the infrastructure instance from the
backend
section in the/etc/haproxy/haproxy.cfg
configuration file:backend router80
balance source
mode tcp
server infra-1.example.com 192.168.55.12:80 check
server infra-2.example.com 192.168.55.13:80 check
backend router443
balance source
mode tcp
server infra-1.example.com 192.168.55.12:443 check
server infra-2.example.com 192.168.55.13:443 check
Then, restart the
haproxy
service.$ sudo systemctl restart haproxy
Remove the node from the cluster after all pods are evicted with command:
$ oc delete node ocp-infra-node-b7pl
node "ocp-infra-node-b7pl" deleted
$ oc get nodes
NAME STATUS AGE VERSION
ocp-infra-node-p5zj Ready 1d v1.6.1+5115d708d7
ocp-infra-node-rghb Ready 1d v1.6.1+5115d708d7
ocp-master-dgf8 Ready,SchedulingDisabled 1d v1.6.1+5115d708d7
ocp-master-q1v2 Ready,SchedulingDisabled 1d v1.6.1+5115d708d7
ocp-master-vq70 Ready,SchedulingDisabled 1d v1.6.1+5115d708d7
ocp-node-020m Ready 1d v1.6.1+5115d708d7
ocp-node-7t5p Ready 1d v1.6.1+5115d708d7
ocp-node-n0dd Ready 1d v1.6.1+5115d708d7
For more information on evacuating and draining pods or nodes, see Node maintenance section. |
Replacing a node host
In the event that a node would need to be added in place of the deprecated node, follow the Adding hosts to an existing cluster section.
Creating a node host backup
Creating a backup of a node host is a different use case from backing up a master host. Because master hosts contain many important files, creating a backup is highly recommended. However, the nature of nodes is that anything special is replicated over the nodes in case of failover, and they typically do not contain data that is necessary to run an environment. If a backup of a node contains something necessary to run an environment, then a creating a backup is recommended.
The backup process is to be performed before any change to the infrastructure, such as a system update, upgrade, or any other significant modification. Backups should be performed on a regular basis to ensure the most recent data is available if a failure occurs.
OKD files
Node instances run applications in the form of pods, which are based on containers. The /etc/origin/
and /etc/origin/node
directories house important files, such as:
The configuration of the node services
Certificates generated by the installation
Cloud provider-related configuration
Keys and other authentication files, such as the
dnsmasq
configuration
The OKD services can be customized to increase the log level, use proxies, and more, and the configuration files are stored in the /etc/sysconfig
directory.
Procedure
Create a backup of the node configuration files:
$ MYBACKUPDIR=/backup/$(hostname)/$(date +%Y%m%d)
$ sudo mkdir -p ${MYBACKUPDIR}/etc/sysconfig
$ sudo cp -aR /etc/origin ${MYBACKUPDIR}/etc
$ sudo cp -aR /etc/sysconfig/atomic-openshift-node ${MYBACKUPDIR}/etc/sysconfig/
OKD uses specific files that must be taken into account when planning the backup policy, including:
File
Description
/etc/cni/
Container Network Interface configuration (if used)
/etc/sysconfig/iptables
Where the
iptables
rules are stored/etc/sysconfig/docker-storage-setup
The input file for
container-storage-setup
command/etc/sysconfig/docker
The
docker
configuration file/etc/sysconfig/docker-network
docker
networking configuration (i.e. MTU)/etc/sysconfig/docker-storage
docker
storage configuration (generated bycontainer-storage-setup
)/etc/dnsmasq.conf
Main configuration file for
dnsmasq
/etc/dnsmasq.d/
Different
dnsmasq
configuration files/etc/sysconfig/flanneld
flannel
configuration file (if used)/etc/pki/ca-trust/source/anchors/
Certificates added to the system (i.e. for external registries)
To create those files:
$ MYBACKUPDIR=/backup/$(hostname)/$(date +%Y%m%d)
$ sudo mkdir -p ${MYBACKUPDIR}/etc/sysconfig
$ sudo mkdir -p ${MYBACKUPDIR}/etc/pki/ca-trust/source/anchors
$ sudo cp -aR /etc/sysconfig/{iptables,docker-*,flanneld} \
${MYBACKUPDIR}/etc/sysconfig/
$ sudo cp -aR /etc/dnsmasq* /etc/cni ${MYBACKUPDIR}/etc/
$ sudo cp -aR /etc/pki/ca-trust/source/anchors/* \
${MYBACKUPDIR}/etc/pki/ca-trust/source/anchors/
If a package is accidentally removed, or a file included in an
rpm
package should be restored, having a list ofrhel
packages installed on the system can be useful.If using Red Hat Satellite features, such as content views or the facts store, provide a proper mechanism to reinstall the missing packages and a historical data of packages installed in the systems.
To create a list of the current
rhel
packages installed in the system:$ MYBACKUPDIR=/backup/$(hostname)/$(date +%Y%m%d)
$ sudo mkdir -p ${MYBACKUPDIR}
$ rpm -qa | sort | sudo tee $MYBACKUPDIR/packages.txt
The following files should now be present in the backup directory:
$ MYBACKUPDIR=/backup/$(hostname)/$(date +%Y%m%d)
$ sudo find ${MYBACKUPDIR} -mindepth 1 -type f -printf '%P\n'
etc/sysconfig/atomic-openshift-node
etc/sysconfig/flanneld
etc/sysconfig/iptables
etc/sysconfig/docker-network
etc/sysconfig/docker-storage
etc/sysconfig/docker-storage-setup
etc/sysconfig/docker-storage-setup.rpmnew
etc/origin/node/system:node:app-node-0.example.com.crt
etc/origin/node/system:node:app-node-0.example.com.key
etc/origin/node/ca.crt
etc/origin/node/system:node:app-node-0.example.com.kubeconfig
etc/origin/node/server.crt
etc/origin/node/server.key
etc/origin/node/node-dnsmasq.conf
etc/origin/node/resolv.conf
etc/origin/node/node-config.yaml
etc/origin/node/flannel.etcd-client.key
etc/origin/node/flannel.etcd-client.csr
etc/origin/node/flannel.etcd-client.crt
etc/origin/node/flannel.etcd-ca.crt
etc/origin/cloudprovider/openstack.conf
etc/pki/ca-trust/source/anchors/openshift-ca.crt
etc/pki/ca-trust/source/anchors/registry-ca.crt
etc/dnsmasq.conf
etc/dnsmasq.d/origin-dns.conf
etc/dnsmasq.d/origin-upstream-dns.conf
etc/dnsmasq.d/node-dnsmasq.conf
packages.txt
If needed, the files can be compressed to save space:
$ MYBACKUPDIR=/backup/$(hostname)/$(date +%Y%m%d)
$ sudo tar -zcvf /backup/$(hostname)-$(date +%Y%m%d).tar.gz $MYBACKUPDIR
$ sudo rm -Rf ${MYBACKUPDIR}
To create any of these files from scratch, the openshift-ansible-contrib
repository contains the backup_master_node.sh
script, which performs the previous steps. The script creates a directory on the host running the script and copies all the files previously mentioned.
The |
The script can be executed on every master host with:
$ mkdir ~/git
$ cd ~/git
$ git clone https://github.com/openshift/openshift-ansible-contrib.git
$ cd openshift-ansible-contrib/reference-architecture/day2ops/scripts
$ ./backup_master_node.sh -h
Restoring a node host backup
After creating a backup of important node host files, if they become corrupted or accidentally removed, you can restore the file by copying back the file, ensuring it contains the proper content and restart the affected services.
Procedure
Restore the
/etc/origin/node/node-config.yaml
file:# MYBACKUPDIR=/backup/$(hostname)/$(date +%Y%m%d)
# cp /etc/origin/node/node-config.yaml /etc/origin/node/node-config.yaml.old
# cp /backup/$(hostname)/$(date +%Y%m%d)/etc/origin/node/node-config.yaml /etc/origin/node/node-config.yaml
# reboot
Restarting the services can lead to downtime. See Node maintenance, for tips on how to ease the process. |
Perform a full reboot of the affected instance to restore the |
If you cannot restart OKD because packages are missing, reinstall the packages.
Get the list of the current installed packages:
$ rpm -qa | sort > /tmp/current_packages.txt
View the differences between the package lists:
$ diff /tmp/current_packages.txt ${MYBACKUPDIR}/packages.txt
> ansible-2.4.0.0-5.el7.noarch
Reinstall the missing packages:
# yum reinstall -y <packages> (1)
1 Replace <packages>
with the packages that are different between the package lists.
Restore a system certificate by copying the certificate to the
/etc/pki/ca-trust/source/anchors/
directory and execute theupdate-ca-trust
:$ MYBACKUPDIR=*/backup/$(hostname)/$(date +%Y%m%d)*
$ sudo cp ${MYBACKUPDIR}/etc/pki/ca-trust/source/anchors/my_company.crt /etc/pki/ca-trust/source/anchors/
$ sudo update-ca-trust
Always ensure proper user ID and group ID are restored when the files are copied back, as well as the
SELinux
context.
Node maintenance and next steps
See Managing nodes or Managing pods topics for various node management options. These include:
A node can reserve a portion of its resources to be used by specific components. These include the kubelet, kube-proxy, Docker, or other remaining system components such as sshd and NetworkManager. See the Allocating node resources section in the Cluster Administrator guide for more information.
etcd tasks
etcd backup
etcd is the key value store for all object definitions, as well as the persistent master state. Other components watch for changes, then bring themselves into the desired state.
OKD versions prior to 3.5 use etcd version 2 (v2), while 3.5 and later use version 3 (v3). The data model between the two versions of etcd is different. etcd v3 can use both the v2 and v3 data models, whereas etcd v2 can only use the v2 data model. In an etcd v3 server, the v2 and v3 data stores exist in parallel and are independent.
For both v2 and v3 operations, you can use the ETCDCTL_API
environment variable to use the correct API:
$ etcdctl -v
etcdctl version: 3.2.5
API version: 2
$ ETCDCTL_API=3 etcdctl version
etcdctl version: 3.2.5
API version: 3.2
See Migrating etcd Data (v2 to v3) section in the OKD 3.7 documentation for information about how to migrate to v3.
In OKD version 3.10 and later, you can either install etcd on separate hosts or run it as a static pod on your master hosts. If you do not specify separate etcd hosts, etcd runs as a static pod on master hosts. Because of this difference, the backup process is different if you use static pods.
The etcd backup process is composed of two different procedures:
Configuration backup: Including the required etcd configuration and certificates
Data backup: Including both v2 and v3 data model.
You can perform the data backup process on any host that has connectivity to the etcd cluster, where the proper certificates are provided, and where the etcdctl
tool is installed.
The backup files must be copied to an external system, ideally outside the OKD environment, and then encrypted. |
Note that the etcd backup still has all the references to current storage volumes. When you restore etcd, OKD starts launching the previous pods on nodes and reattaching the same storage. This process is no different than the process of when you remove a node from the cluster and add a new one back in its place. Anything attached to that node is reattached to the pods on whatever nodes they are rescheduled to.
Backing up etcd
When you back up etcd, you must back up both the etcd configuration files and the etcd data.
Backing up etcd configuration files
The etcd configuration files to be preserved are all stored in the /etc/etcd
directory of the instances where etcd is running. This includes the etcd configuration file (/etc/etcd/etcd.conf
) and the required certificates for cluster communication. All those files are generated at installation time by the Ansible installer.
Procedure
For each etcd member of the cluster, back up the etcd configuration.
$ ssh master-0
# mkdir -p /backup/etcd-config-$(date +%Y%m%d)/
# cp -R /etc/etcd/ /backup/etcd-config-$(date +%Y%m%d)/
The certificates and configuration files on each etcd cluster member are unique. |
Backing up etcd data
Prerequisites
The OKD installer creates aliases to avoid typing all the flags named However, the |
Before backing up etcd:
etcdctl
binaries must be available or, in containerized installations, therhel7/etcd
container must be available.Ensure that the OKD API service is running.
Ensure connectivity with the etcd cluster (port 2379/tcp).
Ensure the proper certificates to connect to the etcd cluster.
Ensure
go
is installed.
Procedure
While the The |
Back up the etcd data:
+
Clusters upgraded from previous versions of OKD might contain v2 data stores. Back up all etcd data stores. |
Make a snapshot of the etcd node:
# systemctl show etcd --property=ActiveState,SubState
# mkdir -p /var/lib/etcd/backup/etcd-$(date +%Y%m%d) (1)
# etcdctl3 snapshot save /var/lib/etcd/backup/etcd-$(date +%Y%m%d)/db
1 You must write the snapshot to a directory under /var/lib/etcd/
.The
etcdctl snapshot save
command requires the etcd service to be running.Stop all etcd services by removing the etcd pod definition and rebooting the host:
# mkdir -p /etc/origin/node/pods-stopped
# mv /etc/origin/node/pods/* /etc/origin/node/pods-stopped/
Create the etcd data backup and copy the etcd
db
file:# etcdctl2 backup \
--data-dir /var/lib/etcd \
--backup-dir /backup/etcd-$(date +%Y%m%d)
A
/backup/etcd-<date>/
directory is created, where<date>
represents the current date, which must be an external NFS share, S3 bucket, or any external storage location.In the case of an all-in-one cluster, the etcd data directory is located in the
/var/lib/origin/openshift.local.etcd
directory.If etcd runs as a static pod, run the following commands:
If you use static pods, use the v3 API.
Obtain the etcd endpoint IP address from the static pod manifest:
$ export ETCD_POD_MANIFEST="/etc/origin/node/pods/etcd.yaml"
$ export ETCD_EP=$(grep https ${ETCD_POD_MANIFEST} | cut -d '/' -f3)
Obtain the etcd pod name:
$ oc login -u system:admin
$ export ETCD_POD=$(oc get pods -n kube-system | grep -o -m 1 '\S*etcd\S*')
Take a snapshot of the etcd data in the pod and store it locally:
$ oc project kube-system
$ oc exec ${ETCD_POD} -c etcd -- /bin/bash -c "ETCDCTL_API=3 etcdctl \
--cert /etc/etcd/peer.crt \
--key /etc/etcd/peer.key \
--cacert /etc/etcd/ca.crt \
--endpoints $ETCD_EP \
snapshot save /var/lib/etcd/snapshot.db"
Restoring etcd
The restore procedure for etcd configuration files replaces the appropriate files, then restarts the service or static pod.
If an etcd host has become corrupted and the /etc/etcd/etcd.conf
file is lost, restore it using:
$ ssh master-0
# cp /backup/yesterday/master-0-files/etcd.conf /etc/etcd/etcd.conf
# restorecon -RvF /etc/etcd/etcd.conf
In this example, the backup file is stored in the /backup/yesterday/master-0-files/etcd.conf
path where it can be used as an external NFS share, S3 bucket, or other storage solution.
If you run etcd as a static pod, follow only the steps in that section. If you run etcd as a separate service on either master or standalone nodes, follow the steps to restore data as required. |
Restoring etcd data
The following process restores healthy data files and starts the etcd cluster as a single node, then adds the rest of the nodes if an etcd cluster is required.
Procedure
Stop all etcd services by removing the etcd pod definition and rebooting the host:
# mkdir -p /etc/origin/node/pods-stopped
# mv /etc/origin/node/pods/* /etc/origin/node/pods-stopped/
# reboot
To ensure the proper backup is restored, delete the etcd directories:
To back up the current etcd data before you delete the directory, run the following command:
# mv /var/lib/etcd /var/lib/etcd.old
# mkdir /var/lib/etcd
# restorecon -RvF /var/lib/etcd/
Or, to delete the directory and the etcd, data, run the following command:
# rm -Rf /var/lib/etcd/*
In an all-in-one cluster, the etcd data directory is located in the
/var/lib/origin/openshift.local.etcd
directory.
Restore a healthy backup data file to each of the etcd nodes. Perform this step on all etcd hosts, including master hosts collocated with etcd.
# cp -R /backup/etcd-xxx/* /var/lib/etcd/
# mv /var/lib/etcd/db /var/lib/etcd/member/snap/db
# chcon -R --reference /backup/etcd-xxx/* /var/lib/etcd/
Run the etcd service on one of your etcd hosts, forcing a new cluster.
This creates a custom file for the etcd service, which overwrites the execution command adding the
--force-new-cluster
option:# mkdir -p /etc/systemd/system/etcd.service.d/
# echo "[Service]" > /etc/systemd/system/etcd.service.d/temp.conf
# echo "ExecStart=" >> /etc/systemd/system/etcd.service.d/temp.conf
# sed -n '/ExecStart/s/"$/ --force-new-cluster"/p' \
/usr/lib/systemd/system/etcd.service \
>> /etc/systemd/system/etcd.service.d/temp.conf
# systemctl daemon-reload
# master-restart etcd
Check for error messages:
# master-logs etcd etcd
Check for health status:
# etcdctl3 cluster-health
member 5ee217d17301 is healthy: got healthy result from https://192.168.55.8:2379
cluster is healthy
Restart the etcd service in cluster mode:
# rm -f /etc/systemd/system/etcd.service.d/temp.conf
# systemctl daemon-reload
# master-restart etcd
Check for health status and member list:
# etcdctl3 cluster-health
member 5ee217d17301 is healthy: got healthy result from https://192.168.55.8:2379
cluster is healthy
# etcdctl3 member list
5ee217d17301: name=master-0.example.com peerURLs=http://localhost:2380 clientURLs=https://192.168.55.8:2379 isLeader=true
After the first instance is running, you can add the remaining peers back into the cluster.
Fix the peerURLS parameter
After restoring the data and creating a new cluster, the peerURLs
parameter shows localhost
instead of the IP where etcd is listening for peer communication:
# etcdctl3 member list
5ee217d17301: name=master-0.example.com peerURLs=http://*localhost*:2380 clientURLs=https://192.168.55.8:2379 isLeader=true
Procedure
Get the member ID using
etcdctl member list
:`etcdctl member list`
Get the IP where etcd listens for peer communication:
$ ss -l4n | grep 2380
Update the member information with that IP:
# etcdctl3 member update 5ee217d17301 https://192.168.55.8:2380
Updated member with ID 5ee217d17301 in cluster
To verify, check that the IP is in the member list:
$ etcdctl3 member list
5ee217d17301: name=master-0.example.com peerURLs=https://*192.168.55.8*:2380 clientURLs=https://192.168.55.8:2379 isLeader=true
Restoring etcd snapshot
Snapshot integrity may be optionally verified at restore time. If the snapshot is taken with etcdctl snapshot save
, it will have an integrity hash that is checked by etcdctl snapshot restore
. If the snapshot is copied from the data directory, there is no integrity hash and it will only restore by using --skip-hash-check
.
The procedure to restore the data must be performed on a single etcd host. You can then add the rest of the nodes to the cluster. |
Procedure
Stop all etcd services by removing the etcd pod definition and rebooting the host:
# mkdir -p /etc/origin/node/pods-stopped
# mv /etc/origin/node/pods/* /etc/origin/node/pods-stopped/
# reboot
Clear all old data, because
etcdctl
recreates it in the node where the restore procedure is going to be performed:# rm -Rf /var/lib/etcd
Run the
snapshot restore
command, substituting the values from the/etc/etcd/etcd.conf
file:# etcdctl3 snapshot restore /backup/etcd-xxxxxx/backup.db \
--data-dir /var/lib/etcd \
--name master-0.example.com \
--initial-cluster "master-0.example.com=https://192.168.55.8:2380" \
--initial-cluster-token "etcd-cluster-1" \
--initial-advertise-peer-urls https://192.168.55.8:2380 \
--skip-hash-check=true
2017-10-03 08:55:32.440779 I | mvcc: restore compact to 1041269
2017-10-03 08:55:32.468244 I | etcdserver/membership: added member 40bef1f6c79b3163 [https://192.168.55.8:2380] to cluster 26841ebcf610583c
Restore permissions and
selinux
context to the restored files:# restorecon -RvF /var/lib/etcd
Start the etcd service:
# systemctl start etcd
Check for any error messages:
# master-logs etcd etcd
Restoring etcd on a static pod
Before restoring etcd on a static pod:
etcdctl
binaries must be available or, in containerized installations, therhel7/etcd
container must be available.You can install the
etcdctl
binary with the etcd package by running the following commands:# yum install etcd
The package also installs the systemd service. Disable and mask the service so that it does not run as a systemd service when etcd runs in static pod. By disabling and masking the service, you ensure that you do not accidentally start it and prevent it from automatically restarting when you reboot the system.
# systemctl disable etcd.service
# systemctl mask etcd.service
To restore etcd on a static pod:
If the pod is running, stop the etcd pod by moving the pod manifest YAML file to another directory:
# mkdir -p /etc/origin/node/pods-stopped
# mv /etc/origin/node/pods/etcd.yaml /etc/origin/node/pods-stopped
Clear all old data:
# rm -rf /var/lib/etcd
You use the etcdctl to recreate the data in the node where you restore the pod.
Restore the etcd snapshot to the mount path for the etcd pod:
# export ETCDCTL_API=3
# etcdctl snapshot restore /etc/etcd/backup/etcd/snapshot.db
--data-dir /var/lib/etcd/
--name ip-172-18-3-48.ec2.internal
--initial-cluster "ip-172-18-3-48.ec2.internal=https://172.18.3.48:2380"
--initial-cluster-token "etcd-cluster-1"
--initial-advertise-peer-urls https://172.18.3.48:2380
--skip-hash-check=true
Obtain the values for your cluster from the $/backup_files/etcd.conf file.
Set required permissions and selinux context on the data directory:
# restorecon -RvF /var/lib/etcd/
Restart the etcd pod by moving the pod manifest YAML file to the required directory:
# mv /etc/origin/node/pods-stopped/etcd.yaml /etc/origin/node/pods/
Replacing an etcd host
To replace an etcd host, scale up the etcd cluster and then remove the host. This process ensures that you keep quorum if you lose an etcd host during the replacement procedure.
The etcd cluster must maintain a quorum during the replacement operation. This means that at least one host must be in operation at all times. If the host replacement operation occurs while the etcd cluster maintains a quorum, cluster operations are usually not affected. If a large amount of etcd data must replicate, some operations might slow down. |
Before you start any procedure involving the etcd cluster, you must have a backup of the etcd data and configuration files so that you can restore the cluster if the procedure fails. |
Scaling etcd
You can scale the etcd cluster vertically by adding more resources to the etcd hosts or horizontally by adding more etcd hosts.
Due to the voting system etcd uses, the cluster must always contain an odd number of members. Having a cluster with an odd number of etcd hosts can account for fault tolerance. Having an odd number of etcd hosts does not change the number needed for a quorum but increases the tolerance for failure. For example, with a cluster of three members, quorum is two, which leaves a failure tolerance of one. This ensures the cluster continues to operate if two of the members are healthy. Having an in-production cluster of three etcd hosts is recommended. |
The new host requires a fresh Red Hat Enterprise Linux version 7 dedicated host. The etcd storage should be located on an SSD disk to achieve maximum performance and on a dedicated disk mounted in /var/lib/etcd
.
Prerequisites
Before you add a new etcd host, perform a backup of both etcd configuration and data to prevent data loss.
Check the current etcd cluster status to avoid adding new hosts to an unhealthy cluster. Run this command:
# ETCDCTL_API=3 etcdctl --cert="/etc/etcd/peer.crt" \
--key=/etc/etcd/peer.key \
--cacert="/etc/etcd/ca.crt" \
--endpoints="https://*master-0.example.com*:2379,\
https://*master-1.example.com*:2379,\
https://*master-2.example.com*:2379"
endpoint health
https://master-0.example.com:2379 is healthy: successfully committed proposal: took = 5.011358ms
https://master-1.example.com:2379 is healthy: successfully committed proposal: took = 1.305173ms
https://master-2.example.com:2379 is healthy: successfully committed proposal: took = 1.388772ms
Before running the
scaleup
playbook, ensure the new host is registered to the proper Red Hat software channels:# subscription-manager register \
--username=*<username>* --password=*<password>*
# subscription-manager attach --pool=*<poolid>*
# subscription-manager repos --disable="*"
# subscription-manager repos \
--enable=rhel-7-server-rpms \
--enable=rhel-7-server-extras-rpms
etcd is hosted in the
rhel-7-server-extras-rpms
software channel.Make sure all unused etcd members are removed from the etcd cluster. This must be completed before running the
scaleup
playbook.List the etcd members:
# etcdctl --cert="/etc/etcd/peer.crt" --key="/etc/etcd/peer.key" \
--cacert="/etc/etcd/ca.crt" --endpoints=ETCD_LISTEN_CLIENT_URLS member list -w table
Copy the unused etcd member ID, if applicable.
Remove the unused member by specifying its ID in the following command:
# etcdctl --cert="/etc/etcd/peer.crt" --key="/etc/etcd/peer.key" \
--cacert="/etc/etcd/ca.crt" --endpoints=ETCD_LISTEN_CLIENT_URL member remove UNUSED_ETCD_MEMBER_ID
Upgrade etcd and iptables on the current etcd nodes:
# yum update etcd iptables-services
Back up the /etc/etcd configuration for the etcd hosts.
If the new etcd members will also be OKD nodes, add the desired number of hosts to the cluster.
The rest of this procedure assumes you added one host, but if you add multiple hosts, perform all steps on each host.
Adding a new etcd host using Ansible
Procedure
In the Ansible inventory file, create a new group named
[new_etcd]
and add the new host. Then, add thenew_etcd
group as a child of the[OSEv3]
group:[OSEv3:children]
masters
nodes
etcd
new_etcd (1)
... [OUTPUT ABBREVIATED] ...
[etcd]
master-0.example.com
master-1.example.com
master-2.example.com
[new_etcd] (1)
etcd0.example.com (1)
1 Add these lines. From the host that installed OKD and hosts the Ansible inventory file, change to the playbook directory and run the etcd
scaleup
playbook:$ cd /usr/share/ansible/openshift-ansible
$ ansible-playbook playbooks/openshift-etcd/scaleup.yml
After the playbook runs, modify the inventory file to reflect the current status by moving the new etcd host from the
[new_etcd]
group to the[etcd]
group:[OSEv3:children]
masters
nodes
etcd
new_etcd
... [OUTPUT ABBREVIATED] ...
[etcd]
master-0.example.com
master-1.example.com
master-2.example.com
etcd0.example.com
If you use Flannel, modify the
flanneld
service configuration on every OKD host, located at/etc/sysconfig/flanneld
, to include the new etcd host:FLANNEL_ETCD_ENDPOINTS=https://master-0.example.com:2379,https://master-1.example.com:2379,https://master-2.example.com:2379,https://etcd0.example.com:2379
Restart the
flanneld
service:# systemctl restart flanneld.service
Manually adding a new etcd host
If you do not run etcd as static pods on master nodes, you might need to add another etcd host.
Procedure
Modify the current etcd cluster
To create the etcd certificates, run the openssl
command, replacing the values with those from your environment.
Create some environment variables:
export NEW_ETCD_HOSTNAME="*etcd0.example.com*"
export NEW_ETCD_IP="192.168.55.21"
export CN=$NEW_ETCD_HOSTNAME
export SAN="IP:${NEW_ETCD_IP}, DNS:${NEW_ETCD_HOSTNAME}"
export PREFIX="/etc/etcd/generated_certs/etcd-$CN/"
export OPENSSLCFG="/etc/etcd/ca/openssl.cnf"
The custom
openssl
extensions used asetcdv3_ca*
include the $SAN environment variable assubjectAltName
. See/etc/etcd/ca/openssl.cnf
for more information.Create the directory to store the configuration and certificates:
# mkdir -p ${PREFIX}
Create the server certificate request and sign it: (server.csr and server.crt)
# openssl req -new -config ${OPENSSLCFG} \
-keyout ${PREFIX}server.key \
-out ${PREFIX}server.csr \
-reqexts etcd_v3_req -batch -nodes \
-subj /CN=$CN
# openssl ca -name etcd_ca -config ${OPENSSLCFG} \
-out ${PREFIX}server.crt \
-in ${PREFIX}server.csr \
-extensions etcd_v3_ca_server -batch
Create the peer certificate request and sign it: (peer.csr and peer.crt)
# openssl req -new -config ${OPENSSLCFG} \
-keyout ${PREFIX}peer.key \
-out ${PREFIX}peer.csr \
-reqexts etcd_v3_req -batch -nodes \
-subj /CN=$CN
# openssl ca -name etcd_ca -config ${OPENSSLCFG} \
-out ${PREFIX}peer.crt \
-in ${PREFIX}peer.csr \
-extensions etcd_v3_ca_peer -batch
Copy the current etcd configuration and
ca.crt
files from the current node as examples to modify later:# cp /etc/etcd/etcd.conf ${PREFIX}
# cp /etc/etcd/ca.crt ${PREFIX}
While still on the surviving etcd host, add the new host to the cluster. To add additional etcd members to the cluster, you must first adjust the default localhost peer in the
**peerURLs**
value for the first member:Get the member ID for the first member using the
member list
command:# etcdctl --cert-file=/etc/etcd/peer.crt \
--key-file=/etc/etcd/peer.key \
--ca-file=/etc/etcd/ca.crt \
--peers="https://172.18.1.18:2379,https://172.18.9.202:2379,https://172.18.0.75:2379" \ (1)
member list
1 Ensure that you specify the URLs of only active etcd members in the —peers
parameter value.Obtain the IP address where etcd listens for cluster peers:
$ ss -l4n | grep 2380
Update the value of
**peerURLs**
using theetcdctl member update
command by passing the member ID and IP address obtained from the previous steps:# etcdctl --cert-file=/etc/etcd/peer.crt \
--key-file=/etc/etcd/peer.key \
--ca-file=/etc/etcd/ca.crt \
--peers="https://172.18.1.18:2379,https://172.18.9.202:2379,https://172.18.0.75:2379" \
member update 511b7fb6cc0001 https://172.18.1.18:2380
Re-run the
member list
command and ensure the peer URLs no longer include localhost.
Add the new host to the etcd cluster. Note that the new host is not yet configured, so the status stays as
unstarted
until the you configure the new host.You must add each member and bring it online one at a time. When you add each additional member to the cluster, you must adjust the
peerURLs
list for the current peers. ThepeerURLs
list grows by one for each member added. Theetcdctl member add
command outputs the values that you must set in the etcd.conf file as you add each member, as described in the following instructions.# etcdctl -C https://${CURRENT_ETCD_HOST}:2379 \
--ca-file=/etc/etcd/ca.crt \
--cert-file=/etc/etcd/peer.crt \
--key-file=/etc/etcd/peer.key member add ${NEW_ETCD_HOSTNAME} https://${NEW_ETCD_IP}:2380 (1)
Added member named 10.3.9.222 with ID 4e1db163a21d7651 to cluster
ETCD_NAME="<NEW_ETCD_HOSTNAME>"
ETCD_INITIAL_CLUSTER="<NEW_ETCD_HOSTNAME>=https://<NEW_HOST_IP>:2380,<CLUSTERMEMBER1_NAME>=https:/<CLUSTERMEMBER2_IP>:2380,<CLUSTERMEMBER2_NAME>=https:/<CLUSTERMEMBER2_IP>:2380,<CLUSTERMEMBER3_NAME>=https:/<CLUSTERMEMBER3_IP>:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
1 In this line, 10.3.9.222
is a label for the etcd member. You can specify the host name, IP address, or a simple name.Update the sample
${PREFIX}/etcd.conf
file.Replace the following values with the values generated in the previous step:
ETCD_NAME
ETCD_INITIAL_CLUSTER
ETCD_INITIAL_CLUSTER_STATE
2. Modify the following variables with the new host IP from the output of the previous step. You can use `${NEW_ETCD_IP}` as the value.
```
ETCD_LISTEN_PEER_URLS
ETCD_LISTEN_CLIENT_URLS
ETCD_INITIAL_ADVERTISE_PEER_URLS
ETCD_ADVERTISE_CLIENT_URLS
```
3. If you previously used the member system as an etcd node, you must overwrite the current values in the ***/etc/etcd/etcd.conf*** file.
4. Check the file for syntax errors or missing IP addresses, otherwise the etcd service might fail:
```
# vi ${PREFIX}/etcd.conf
```
On the node that hosts the installation files, update the
[etcd]
hosts group in the /etc/ansible/hosts inventory file. Remove the old etcd hosts and add the new ones.Create a
tgz
file that contains the certificates, the sample configuration file, and theca
and copy it to the new host:# tar -czvf /etc/etcd/generated_certs/${CN}.tgz -C ${PREFIX} .
# scp /etc/etcd/generated_certs/${CN}.tgz ${CN}:/tmp/
Modify the new etcd host
Install
iptables-services
to provide iptables utilities to open the required ports for etcd:# yum install -y iptables-services
Create the
OS_FIREWALL_ALLOW
firewall rules to allow etcd to communicate:Port 2379/tcp for clients
Port 2380/tcp for peer communication
# systemctl enable iptables.service --now
# iptables -N OS_FIREWALL_ALLOW
# iptables -t filter -I INPUT -j OS_FIREWALL_ALLOW
# iptables -A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 2379 -j ACCEPT
# iptables -A OS_FIREWALL_ALLOW -p tcp -m state --state NEW -m tcp --dport 2380 -j ACCEPT
# iptables-save | tee /etc/sysconfig/iptables
In this example, a new chain
OS_FIREWALL_ALLOW
is created, which is the standard naming the OKD installer uses for firewall rules.If the environment is hosted in an IaaS environment, modify the security groups for the instance to allow incoming traffic to those ports as well.
Install etcd:
# yum install -y etcd
Ensure version
etcd-2.3.7-4.el7.x86_64
or greater is installed,Ensure the etcd service is not running by removing the etcd pod definition:
# mkdir -p /etc/origin/node/pods-stopped
# mv /etc/origin/node/pods/* /etc/origin/node/pods-stopped/
Remove any etcd configuration and data:
# rm -Rf /etc/etcd/*
# rm -Rf /var/lib/etcd/*
Extract the certificates and configuration files:
# tar xzvf /tmp/etcd0.example.com.tgz -C /etc/etcd/
Start etcd on the new host:
# systemctl enable etcd --now
Verify that the host is part of the cluster and the current cluster health:
If you use the v2 etcd api, run the following command:
# etcdctl --cert-file=/etc/etcd/peer.crt \
--key-file=/etc/etcd/peer.key \
--ca-file=/etc/etcd/ca.crt \
--peers="https://*master-0.example.com*:2379,\
https://*master-1.example.com*:2379,\
https://*master-2.example.com*:2379,\
https://*etcd0.example.com*:2379"\
cluster-health
member 5ee217d19001 is healthy: got healthy result from https://192.168.55.12:2379
member 2a529ba1840722c0 is healthy: got healthy result from https://192.168.55.8:2379
member 8b8904727bf526a5 is healthy: got healthy result from https://192.168.55.21:2379
member ed4f0efd277d7599 is healthy: got healthy result from https://192.168.55.13:2379
cluster is healthy
If you use the v3 etcd api, run the following command:
# ETCDCTL_API=3 etcdctl --cert="/etc/etcd/peer.crt" \
--key=/etc/etcd/peer.key \
--cacert="/etc/etcd/ca.crt" \
--endpoints="https://*master-0.example.com*:2379,\
https://*master-1.example.com*:2379,\
https://*master-2.example.com*:2379,\
https://*etcd0.example.com*:2379"\
endpoint health
https://master-0.example.com:2379 is healthy: successfully committed proposal: took = 5.011358ms
https://master-1.example.com:2379 is healthy: successfully committed proposal: took = 1.305173ms
https://master-2.example.com:2379 is healthy: successfully committed proposal: took = 1.388772ms
https://etcd0.example.com:2379 is healthy: successfully committed proposal: took = 1.498829ms
Modify each OKD master
Modify the master configuration in the
etcClientInfo
section of the/etc/origin/master/master-config.yaml
file on every master. Add the new etcd host to the list of the etcd servers OKD uses to store the data, and remove any failed etcd hosts:etcdClientInfo:
ca: master.etcd-ca.crt
certFile: master.etcd-client.crt
keyFile: master.etcd-client.key
urls:
- https://master-0.example.com:2379
- https://master-1.example.com:2379
- https://master-2.example.com:2379
- https://etcd0.example.com:2379
Restart the master API service:
On every master:
# master-restart api
# master-restart controllers
The number of etcd nodes must be odd, so you must add at least two hosts.
If you use Flannel, modify the
flanneld
service configuration located at/etc/sysconfig/flanneld
on every OKD host to include the new etcd host:FLANNEL_ETCD_ENDPOINTS=https://master-0.example.com:2379,https://master-1.example.com:2379,https://master-2.example.com:2379,https://etcd0.example.com:2379
Restart the
flanneld
service:# systemctl restart flanneld.service
Removing an etcd host
If an etcd host fails beyond restoration, remove it from the cluster.
Steps to be performed on all masters hosts
Procedure
Remove each other etcd host from the etcd cluster. Run the following command for each etcd node:
# etcdctl -C https://<surviving host IP address>:2379 \
--ca-file=/etc/etcd/ca.crt \
--cert-file=/etc/etcd/peer.crt \
--key-file=/etc/etcd/peer.key member remove <failed member ID>
Restart the master API service on every master:
# master-restart api restart-master controller
Steps to be performed in the current etcd cluster
Procedure
Remove the failed host from the cluster:
# etcdctl2 cluster-health
member 5ee217d19001 is healthy: got healthy result from https://192.168.55.12:2379
member 2a529ba1840722c0 is healthy: got healthy result from https://192.168.55.8:2379
failed to check the health of member 8372784203e11288 on https://192.168.55.21:2379: Get https://192.168.55.21:2379/health: dial tcp 192.168.55.21:2379: getsockopt: connection refused
member 8372784203e11288 is unreachable: [https://192.168.55.21:2379] are all unreachable
member ed4f0efd277d7599 is healthy: got healthy result from https://192.168.55.13:2379
cluster is healthy
# etcdctl2 member remove 8372784203e11288 (1)
Removed member 8372784203e11288 from cluster
# etcdctl2 cluster-health
member 5ee217d19001 is healthy: got healthy result from https://192.168.55.12:2379
member 2a529ba1840722c0 is healthy: got healthy result from https://192.168.55.8:2379
member ed4f0efd277d7599 is healthy: got healthy result from https://192.168.55.13:2379
cluster is healthy
1 The remove
command requires the etcd ID, not the hostname.To ensure the etcd configuration does not use the failed host when the etcd service is restarted, modify the
/etc/etcd/etcd.conf
file on all remaining etcd hosts and remove the failed host in the value for theETCD_INITIAL_CLUSTER
variable:# vi /etc/etcd/etcd.conf
For example:
ETCD_INITIAL_CLUSTER=master-0.example.com=https://192.168.55.8:2380,master-1.example.com=https://192.168.55.12:2380,master-2.example.com=https://192.168.55.13:2380
becomes:
ETCD_INITIAL_CLUSTER=master-0.example.com=https://192.168.55.8:2380,master-1.example.com=https://192.168.55.12:2380
Restarting the etcd services is not required, because the failed host is removed using
etcdctl
.Modify the Ansible inventory file to reflect the current status of the cluster and to avoid issues when re-running a playbook:
[OSEv3:children]
masters
nodes
etcd
... [OUTPUT ABBREVIATED] ...
[etcd]
master-0.example.com
master-1.example.com
If you are using Flannel, modify the
flanneld
service configuration located at/etc/sysconfig/flanneld
on every host and remove the etcd host:FLANNEL_ETCD_ENDPOINTS=https://master-0.example.com:2379,https://master-1.example.com:2379,https://master-2.example.com:2379
Restart the
flanneld
service:# systemctl restart flanneld.service