Troubleshooting local persistent storage using LVMS
Because OKD does not scope a persistent volume (PV) to a single project, it can be shared across the cluster and claimed by any project using a persistent volume claim (PVC). This can lead to a number of issues that require troubleshooting.
Investigating a PVC stuck in the Pending state
A persistent volume claim (PVC) can get stuck in a Pending
state for a number of reasons. For example:
Insufficient computing resources
Network problems
Mismatched storage class or node selector
No available volumes
The node with the persistent volume (PV) is in a
Not Ready
state
Identify the cause by using the oc describe
command to review details about the stuck PVC.
Procedure
Retrieve the list of PVCs by running the following command:
$ oc get pvc
Example output
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
lvms-test Pending lvms-vg1 11s
Inspect the events associated with a PVC stuck in the
Pending
state by running the following command:$ oc describe pvc <pvc_name> (1)
1 Replace <pvc_name>
with the name of the PVC. For example,lvms-vg1
.Example output
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ProvisioningFailed 4s (x2 over 17s) persistentvolume-controller storageclass.storage.k8s.io "lvms-vg1" not found
Recovering from missing LVMS or Operator components
If you encounter a storage class “not found” error, check the LVMCluster
resource and ensure that all the logical volume manager storage (LVMS) pods are running. You can create an LVMCluster
resource if it does not exist.
Procedure
Verify the presence of the LVMCluster resource by running the following command:
$ oc get lvmcluster -n openshift-storage
Example output
NAME AGE
my-lvmcluster 65m
If the cluster doesn’t have an
LVMCluster
resource, create one by running the following command:$ oc create -n openshift-storage -f <custom_resource> (1)
1 Replace <custom_resource>
with a custom resource URL or file tailored to your requirements.Example custom resource
apiVersion: lvm.topolvm.io/v1alpha1
kind: LVMCluster
metadata:
name: my-lvmcluster
spec:
storage:
deviceClasses:
- name: vg1
default: true
thinPoolConfig:
name: thin-pool-1
sizePercent: 90
overprovisionRatio: 10
Check that all the pods from LVMS are in the
Running
state in theopenshift-storage
namespace by running the following command:$ oc get pods -n openshift-storage
Example output
NAME READY STATUS RESTARTS AGE
lvms-operator-7b9fb858cb-6nsml 3/3 Running 0 70m
topolvm-controller-5dd9cf78b5-7wwr2 5/5 Running 0 66m
topolvm-node-dr26h 4/4 Running 0 66m
vg-manager-r6zdv 1/1 Running 0 66m
The expected output is one running instance of
lvms-operator
andvg-manager
. One instance oftopolvm-controller
andtopolvm-node
is expected for each node.If
topolvm-node
is stuck in theInit
state, there is a failure to locate an available disk for LVMS to use. To retrieve the information necessary to troubleshoot, review the logs of thevg-manager
pod by running the following command:$ oc logs -l app.kubernetes.io/component=vg-manager -n openshift-storage
Recovering from node failure
Sometimes a persistent volume claim (PVC) is stuck in a Pending
state because a particular node in the cluster has failed. To identify the failed node, you can examine the restart count of the topolvm-node
pod. An increased restart count indicates potential problems with the underlying node, which may require further investigation and troubleshooting.
Procedure
Examine the restart count of the
topolvm-node
pod instances by running the following command:$ oc get pods -n openshift-storage
Example output
NAME READY STATUS RESTARTS AGE
lvms-operator-7b9fb858cb-6nsml 3/3 Running 0 70m
topolvm-controller-5dd9cf78b5-7wwr2 5/5 Running 0 66m
topolvm-node-dr26h 4/4 Running 0 66m
topolvm-node-54as8 4/4 Running 0 66m
topolvm-node-78fft 4/4 Running 17 (8s ago) 66m
vg-manager-r6zdv 1/1 Running 0 66m
vg-manager-990ut 1/1 Running 0 66m
vg-manager-an118 1/1 Running 0 66m
After you resolve any issues with the node, you might need to perform the forced cleanup procedure if the PVC is still stuck in a
Pending
state.
Additional resources
Recovering from disk failure
If you see a failure message while inspecting the events associated with the persistent volume claim (PVC), there might be a problem with the underlying volume or disk. Disk and volume provisioning issues often result with a generic error first, such as Failed to provision volume with StorageClass <storage_class_name>
. A second, more specific error message usually follows.
Procedure
Inspect the events associated with a PVC by running the following command:
$ oc describe pvc <pvc_name> (1)
1 Replace <pvc_name>
with the name of the PVC. Here are some examples of disk or volume failure error messages and their causes:Failed to check volume existence: Indicates a problem in verifying whether the volume already exists. Volume verification failure can be caused by network connectivity problems or other failures.
Failed to bind volume: Failure to bind a volume can happen if the persistent volume (PV) that is available does not match the requirements of the PVC.
FailedMount or FailedUnMount: This error indicates problems when trying to mount the volume to a node or unmount a volume from a node. If the disk has failed, this error might appear when a pod tries to use the PVC.
Volume is already exclusively attached to one node and can’t be attached to another: This error can appear with storage solutions that do not support
ReadWriteMany
access modes.
Establish a direct connection to the host where the problem is occurring.
Resolve the disk issue.
After you have resolved the issue with the disk, you might need to perform the forced cleanup procedure if failure messages persist or reoccur.
Additional resources
Performing a forced cleanup
If disk- or node-related problems persist after you complete the troubleshooting procedures, it might be necessary to perform a forced cleanup procedure. A forced cleanup is used to comprehensively address persistent issues and ensure the proper functioning of the LVMS.
Prerequisites
All of the persistent volume claims (PVCs) created using the logical volume manager storage (LVMS) driver have been removed.
The pods using those PVCs have been stopped.
Procedure
Switch to the
openshift-storage
namespace by running the following command:$ oc project openshift-storage
Ensure there is no
Logical Volume
custom resource (CR) remaining by running the following command:$ oc get logicalvolume
Example output
No resources found
If there are any
LogicalVolume
CRs remaining, remove their finalizers by running the following command:$ oc patch logicalvolume <name> -p '{"metadata":{"finalizers":[]}}' --type=merge (1)
1 Replace <name>
with the name of the CR.After removing their finalizers, delete the CRs by running the following command:
$ oc delete logicalvolume <name> (1)
1 Replace <name>
with the name of the CR.
Make sure there are no
LVMVolumeGroup
CRs left by running the following command:$ oc get lvmvolumegroup
Example output
No resources found
If there are any
LVMVolumeGroup
CRs left, remove their finalizers by running the following command:$ oc patch lvmvolumegroup <name> -p '{"metadata":{"finalizers":[]}}' --type=merge (1)
1 Replace <name>
with the name of the CR.After removing their finalizers, delete the CRs by running the following command:
$ oc delete lvmvolumegroup <name> (1)
1 Replace <name>
with the name of the CR.
Remove any
LVMVolumeGroupNodeStatus
CRs by running the following command:$ oc delete lvmvolumegroupnodestatus --all
Remove the
LVMCluster
CR by running the following command:$ oc delete lvmcluster --all