Remediating nodes with Machine Health Checks
Machine health checks automatically repair unhealthy machines in a particular machine pool.
About machine health checks
You can only apply a machine health check to control plane machines on clusters that use control plane machine sets. |
To monitor machine health, create a resource to define the configuration for a controller. Set a condition to check, such as staying in the NotReady
status for five minutes or displaying a permanent condition in the node-problem-detector, and a label for the set of machines to monitor.
The controller that observes a MachineHealthCheck
resource checks for the defined condition. If a machine fails the health check, the machine is automatically deleted and one is created to take its place. When a machine is deleted, you see a machine deleted
event.
To limit disruptive impact of the machine deletion, the controller drains and deletes only one node at a time. If there are more unhealthy machines than the maxUnhealthy
threshold allows for in the targeted pool of machines, remediation stops and therefore enables manual intervention.
Consider the timeouts carefully, accounting for workloads and requirements.
|
To stop the check, remove the resource.
Limitations when deploying machine health checks
There are limitations to consider before deploying a machine health check:
Only machines owned by a machine set are remediated by a machine health check.
If the node for a machine is removed from the cluster, a machine health check considers the machine to be unhealthy and remediates it immediately.
If the corresponding node for a machine does not join the cluster after the
nodeStartupTimeout
, the machine is remediated.A machine is remediated immediately if the
Machine
resource phase isFailed
.
Configuring machine health checks to use the Self Node Remediation Operator
Use the following procedure to configure the worker or control-plane machine health checks to use the Self Node Remediation Operator as a remediation provider.
Prerequisites
Install the OpenShift CLI (
oc
).Log in as a user with
cluster-admin
privileges.
Procedure
Create a
SelfNodeRemediationTemplate
CR:Define the
SelfNodeRemediationTemplate
CR:apiVersion: self-node-remediation.medik8s.io/v1alpha1
kind: SelfNodeRemediationTemplate
metadata:
namespace: openshift-machine-api
name: selfnoderemediationtemplate-sample
spec:
template:
spec:
remediationStrategy: ResourceDeletion (1)
1 Specifies the remediation strategy. The default strategy is ResourceDeletion
.To create the
SelfNodeRemediationTemplate
CR, run the following command:$ oc create -f <snrt-name>.yaml
Create or update the
MachineHealthCheck
CR to point to theSelfNodeRemediationTemplate
CR:Define or update the
MachineHealthCheck
CR:apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
name: machine-health-check
namespace: openshift-machine-api
spec:
selector:
matchLabels: (1)
machine.openshift.io/cluster-api-machine-role: "worker"
machine.openshift.io/cluster-api-machine-type: "worker"
unhealthyConditions:
- type: "Ready"
timeout: "300s"
status: "False"
- type: "Ready"
timeout: "300s"
status: "Unknown"
maxUnhealthy: "40%"
nodeStartupTimeout: "10m"
remediationTemplate: (2)
kind: SelfNodeRemediationTemplate
apiVersion: self-node-remediation.medik8s.io/v1alpha1
name: selfnoderemediationtemplate-sample
1 Selects whether the machine health check is for worker
orcontrol-plane
nodes. The label can also be user-defined.2 Specifies the details for the remediation template. To create a
MachineHealthCheck
CR, run the following command:$ oc create -f <mhc-name>.yaml
To update a
MachineHealthCheck
CR, run the following command:$ oc apply -f <mhc-name>.yaml