Using the vSphere Problem Detector Operator

Using the vSphere Problem Detector Operator

About the vSphere Problem Detector Operator
Running the vSphere Problem Detector Operator checks
Viewing the events from the vSphere Problem Detector Operator
Viewing the logs from the vSphere Problem Detector Operator
Configuration checks run by the vSphere Problem Detector Operator
About the storage class configuration check
Metrics for the vSphere Problem Detector Operator
Additional resources

About the vSphere Problem Detector Operator

The vSphere Problem Detector Operator checks clusters that are deployed on vSphere for common installation and misconfiguration issues that are related to storage.

The Operator runs in the openshift-cluster-storage-operator namespace and is started by the Cluster Storage Operator when the Cluster Storage Operator detects that the cluster is deployed on vSphere. The vSphere Problem Detector Operator communicates with the vSphere vCenter Server to determine the virtual machines in the cluster, the default datastore, and other information about the vSphere vCenter Server configuration. The Operator uses the credentials from the Cloud Credential Operator to connect to vSphere.

The Operator runs the checks according to the following schedule:

The checks run every 8 hours.
If any check fails, the Operator runs the checks again in intervals of 1 minute, 2 minutes, 4, 8, and so on. The Operator doubles the interval up to a maximum interval of 8 hours.
When all checks pass, the schedule returns to an 8 hour interval.

The Operator increases the frequency of the checks after a failure so that the Operator can report success quickly after the failure condition is remedied. You can run the Operator manually for immediate troubleshooting information.

Running the vSphere Problem Detector Operator checks

You can override the schedule for running the vSphere Problem Detector Operator checks and run the checks immediately.

The vSphere Problem Detector Operator automatically runs the checks every 8 hours. However, when the Operator starts, it runs the checks immediately. The Operator is started by the Cluster Storage Operator when the Cluster Storage Operator starts and determines that the cluster is running on vSphere. To run the checks immediately, you can scale the vSphere Problem Detector Operator to 0 and back to 1 so that it restarts the vSphere Problem Detector Operator.

Prerequisites

Access to the cluster as a user with the cluster-admin role.

Procedure

Scale the Operator to 0:

$ oc scale deployment/vsphere-problem-detector-operator --replicas=0 \
    -n openshift-cluster-storage-operator

If the deployment does not scale to zero immediately, you can run the following command to wait for the pods to exit:

$ oc wait pods -l name=vsphere-problem-detector-operator \
    --for=delete --timeout=5m -n openshift-cluster-storage-operator

Scale the Operator back to 1:

$ oc scale deployment/vsphere-problem-detector-operator --replicas=1 \
    -n openshift-cluster-storage-operator

Delete the old leader lock to speed up the new leader election for the Cluster Storage Operator:
```
$ oc delete -n openshift-cluster-storage-operator \
    cm vsphere-problem-detector-lock
```

Verification

View the events or logs that are generated by the vSphere Problem Detector Operator. Confirm that the events or logs have recent timestamps.

Viewing the events from the vSphere Problem Detector Operator

After the vSphere Problem Detector Operator runs and performs the configuration checks, it creates events that can be viewed from the command line or from the OKD web console.

Procedure

To view the events by using the command line, run the following command:

$ oc get event -n openshift-cluster-storage-operator \
    --sort-by={.metadata.creationTimestamp}

Example output

16m     Normal    Started             pod/vsphere-problem-detector-operator-xxxxx         Started container vsphere-problem-detector
16m     Normal    Created             pod/vsphere-problem-detector-operator-xxxxx         Created container vsphere-problem-detector
16m     Normal    LeaderElection      configmap/vsphere-problem-detector-lock    vsphere-problem-detector-operator-xxxxx became leader

To view the events by using the OKD web console, navigate to Home → Events and select openshift-cluster-storage-operator from the Project menu.

Viewing the logs from the vSphere Problem Detector Operator

After the vSphere Problem Detector Operator runs and performs the configuration checks, it creates log records that can be viewed from the command line or from the OKD web console.

Procedure

To view the logs by using the command line, run the following command:

$ oc logs deployment/vsphere-problem-detector-operator \
    -n openshift-cluster-storage-operator

Example output

I0108 08:32:28.445696       1 operator.go:209] ClusterInfo passed
I0108 08:32:28.451029       1 datastore.go:57] CheckStorageClasses checked 1 storage classes, 0 problems found
I0108 08:32:28.451047       1 operator.go:209] CheckStorageClasses passed
I0108 08:32:28.452160       1 operator.go:209] CheckDefaultDatastore passed
I0108 08:32:28.480648       1 operator.go:271] CheckNodeDiskUUID:<host_name> passed
I0108 08:32:28.480685       1 operator.go:271] CheckNodeProviderID:<host_name> passed

To view the Operator logs with the OKD web console, perform the following steps:
1. Navigate to Workloads → Pods.
2. Select openshift-cluster-storage-operator from the Projects menu.
3. Click the link for the vsphere-problem-detector-operator pod.
4. Click the Logs tab on the Pod details page to view the logs.

Configuration checks run by the vSphere Problem Detector Operator

The following tables identify the configuration checks that the vSphere Problem Detector Operator runs. Some checks verify the configuration of the cluster. Other checks verify the configuration of each node in the cluster.

Table 1. Cluster configuration checks
Name	Description
`CheckDefaultDatastore`	Verifies that the default datastore name in the vSphere configuration is short enough for use with dynamic provisioning. If this check fails, you can expect the following: `systemd` logs errors to the journal such as `Failed to set up mount unit: Invalid argument`. `systemd` does not unmount volumes if the virtual machine is shut down or rebooted without draining all the pods from the node. If this check fails, reconfigure vSphere with a shorter name for the default datastore.
`CheckFolderPermissions`	Verifies the permission to list volumes in the default datastore. This permission is required to create volumes. The Operator verifies the permission by listing the `/` and `/kubevols` directories. The root directory must exist. It is acceptable if the `/kubevols` directory does not exist when the check runs. The `/kubevols` directory is created when the datastore is used with dynamic provisioning if the directory does not already exist. If this check fails, review the required permissions for the vCenter account that was specified during the OKD installation.
`CheckStorageClasses`	Verifies the following: The fully qualified path to each persistent volume that is provisioned by this storage class is less than 255 characters. If a storage class uses a storage policy, the storage class must use one policy only and that policy must be defined.
`CheckTaskPermissions`	Verifies the permission to list recent tasks and datastores.
`ClusterInfo`	Collects the cluster version and UUID from vSphere vCenter.

Table 2. Node configuration checks
Name	Description
`CheckNodeDiskUUID`	Verifies that all the vSphere virtual machines are configured with `disk.enableUUID=TRUE`. If this check fails, see the How to check ‘disk.EnableUUID’ parameter from VM in vSphere Red Hat Knowledgebase solution.
`CheckNodeProviderID`	Verifies that all nodes are configured with the `ProviderID` from vSphere vCenter. This check fails when the output from the following command does not include a provider ID for each node. `$ oc get nodes -o custom-columns=NAME:.metadata.name,PROVIDER_ID:.spec.providerID,UUID:.status.nodeInfo.systemUUID` If this check fails, refer to the vSphere product documentation for information about setting the provider ID for each node in the cluster.
`CollectNodeESXiVersion`	Reports the version of the ESXi hosts that run nodes.
`CollectNodeHWVersion`	Reports the virtual machine hardware version for a node.

About the storage class configuration check

The names for persistent volumes that use vSphere storage are related to the datastore name and cluster ID.

When a persistent volume is created, systemd creates a mount unit for the persistent volume. The systemd process has a 255 character limit for the length of the fully qualified path to the VDMK file that is used for the persistent volume.

The fully qualified path is based on the naming conventions for systemd and vSphere. The naming conventions use the following pattern:

/var/lib/kubelet/plugins/kubernetes.io/vsphere-volume/mounts/[<datastore>] 00000000-0000-0000-0000-000000000000/<cluster_id>-dynamic-pvc-00000000-0000-0000-0000-000000000000.vmdk

The naming conventions require 205 characters of the 255 character limit.
The datastore name and the cluster ID are determined from the deployment.
The datastore name and cluster ID are substituted into the preceding pattern. Then the path is processed with the systemd-escape command to escape special characters. For example, a hyphen character uses four characters after it is escaped. The escaped value is \x2d.
After processing with systemd-escape to ensure that systemd can access the fully qualified path to the VDMK file, the length of the path must be less than 255 characters.

Metrics for the vSphere Problem Detector Operator

The vSphere Problem Detector Operator exposes the following metrics for use by the OKD monitoring stack.

Table 3. Metrics exposed by the vSphere Problem Detector Operator
Name	Description
`vsphere_cluster_check_total`	Cumulative number of cluster-level checks that the vSphere Problem Detector Operator performed. This count includes both successes and failures.
`vsphere_cluster_check_errors`	Number of failed cluster-level checks that the vSphere Problem Detector Operator performed. For example, a value of `1` indicates that one cluster-level check failed.
`vsphere_esxi_version_total`	Number of ESXi hosts with a specific version. Be aware that if a host runs more than one node, the host is counted only once.
`vsphere_node_check_total`	Cumulative number of node-level checks that the vSphere Problem Detector Operator performed. This count includes both successes and failures.
`vsphere_node_check_errors`	Number of failed node-level checks that the vSphere Problem Detector Operator performed. For example, a value of `1` indicates that one node-level check failed.
`vsphere_node_hw_version_total`	Number of vSphere nodes with a specific hardware version.
`vsphere_vcenter_info`	Information about the vSphere vCenter Server.

Additional resources

Understanding the monitoring stack