Troubleshooting the control plane machine set
Use the information in this section to understand and recover from issues you might encounter.
Checking the control plane machine set custom resource state
You can verify the existence and state of the ControlPlaneMachineSet
custom resource (CR).
Procedure
Determine the state of the CR by running the following command:
$ oc get controlplanemachineset.machine.openshift.io cluster --namespace openshift-machine-api
A result of
Active
indicates that theControlPlaneMachineSet
CR exists and is activated. No administrator action is required.A result of
Inactive
indicates that aControlPlaneMachineSet
CR exists but is not activated.A result of
NotFound
indicates that there is no existingControlPlaneMachineSet
CR.
Next steps
To use the control plane machine set, you must ensure that a ControlPlaneMachineSet
CR with the correct settings for your cluster exists.
If your cluster has an existing CR, you must verify that the configuration in the CR is correct for your cluster.
If your cluster does not have an existing CR, you must create one with the correct configuration for your cluster.
Additional resources
Adding a missing Azure internal load balancer
The internalLoadBalancer
parameter is required in both the ControlPlaneMachineSet
and control plane Machine
custom resources (CRs) for Azure. If this parameter is not preconfigured on your cluster, you must add it to both CRs.
For more information about where this parameter is located in the Azure provider specification, see the sample Azure provider specification. The placement in the control plane Machine
CR is similar.
Procedure
List the control plane machines in your cluster by running the following command:
$ oc get machines -l machine.openshift.io/cluster-api-machine-role==master -n openshift-machine-api
For each control plane machine, edit the CR by running the following command:
$ oc edit machine <control_plane_machine_name>
Add the
internalLoadBalancer
parameter with the correct details for your cluster and save your changes.Edit your control plane machine set CR by running the following command:
$ oc --namespace openshift-machine-api edit controlplanemachineset.machine.openshift.io cluster
Add the
internalLoadBalancer
parameter with the correct details for your cluster and save your changes.
Next steps
For clusters that use the default
RollingUpdate
update strategy, the Operator automatically propagates the changes to your control plane configuration.For clusters that are configured to use the
OnDelete
update strategy, you must replace your control plane machines manually.
Additional resources
Recovering a degraded etcd Operator
Certain situations can cause the etcd Operator to become degraded.
For example, while performing remediation, the machine health check might delete a control plane machine that is hosting etcd. If the etcd member is not reachable at that time, the etcd Operator becomes degraded.
When the etcd Operator is degraded, manual intervention is required to force the Operator to remove the failed member and restore the cluster state.
Procedure
List the control plane machines in your cluster by running the following command:
$ oc get machines -l machine.openshift.io/cluster-api-machine-role==master -n openshift-machine-api -o wide
Any of the following conditions might indicate a failed control plane machine:
The
STATE
value isstopped
.The
PHASE
value isFailed
.The
PHASE
value isDeleting
for more than ten minutes.
Before continuing, ensure that your cluster has two healthy control plane machines. Performing the actions in this procedure on more than one control plane machine risks losing etcd quorum and can cause data loss.
If you have lost the majority of your control plane hosts, leading to etcd quorum loss, then you must follow the disaster recovery procedure “Restoring to a previous cluster state” instead of this procedure.
Edit the machine CR for the failed control plane machine by running the following command:
$ oc edit machine <control_plane_machine_name>
Remove the contents of the
lifecycleHooks
parameter from the failed control plane machine and save your changes.The etcd Operator removes the failed machine from the cluster and can then safely add new etcd members.
Additional resources