About node maintenance
About node maintenance mode
Nodes can be placed into maintenance mode using the oc adm
utility, or using NodeMaintenance
custom resources (CRs).
Placing a node into maintenance marks the node as unschedulable and drains all the virtual machines and pods from it. Virtual machine instances that have a LiveMigrate
eviction strategy are live migrated to another node without loss of service. This eviction strategy is configured by default in virtual machine created from common templates but must be configured manually for custom virtual machines.
Virtual machine instances without an eviction strategy are shut down. Virtual machines with a RunStrategy
of Running
or RerunOnFailure
are recreated on another node. Virtual machines with a RunStrategy
of Manual
are not automatically restarted.
Virtual machines must have a persistent volume claim (PVC) with a shared |
When installed as part of OKD Virtualization, Node Maintenance Operator watches for new or deleted NodeMaintenance
CRs. When a new NodeMaintenance
CR is detected, no new workloads are scheduled and the node is cordoned off from the rest of the cluster. All pods that can be evicted are evicted from the node. When a NodeMaintenance
CR is deleted, the node that is referenced in the CR is made available for new workloads.
Using a |
Maintaining bare metal nodes
When you deploy OKD on bare metal infrastructure, there are additional considerations that must be taken into account compared to deploying on cloud infrastructure. Unlike in cloud environments where the cluster nodes are considered ephemeral, re-provisioning a bare metal node requires significantly more time and effort for maintenance tasks.
When a bare metal node fails, for example, if a fatal kernel error happens or a NIC card hardware failure occurs, workloads on the failed node need to be restarted elsewhere else on the cluster while the problem node is repaired or replaced. Node maintenance mode allows cluster administrators to gracefully power down nodes, moving workloads to other parts of the cluster and ensuring workloads do not get interrupted. Detailed progress and node status details are provided during maintenance.
Additional resources: