- Troubleshooting
- Downloading the Velero CLI tool
- Accessing the Velero binary in the Velero deployment in the cluster
- Debugging Velero resources with the OpenShift CLI tool
- Debugging Velero resources with the Velero CLI tool
- Pods crash or restart due to lack of memory or CPU
- Issues with Velero and admission webhooks
- Installation issues
- OADP timeouts
- Backup and Restore CR issues
- Restic issues
- Using the must-gather tool
- OADP Monitoring
Troubleshooting
You can debug Velero custom resources (CRs) by using the OpenShift CLI tool or the Velero CLI tool. The Velero CLI tool provides more detailed logs and information.
You can check installation issues, backup and restore CR issues, and Restic issues.
You can collect logs and CR information by using the must-gather tool.
You can obtain the Velero CLI tool by:
Downloading the Velero CLI tool
Accessing the Velero binary in the Velero deployment in the cluster
Downloading the Velero CLI tool
You can download and install the Velero CLI tool by following the instructions on the Velero documentation page.
The page includes instructions for:
macOS by using Homebrew
GitHub
Windows by using Chocolatey
Prerequisites
You have access to a Kubernetes cluster, v1.16 or later, with DNS and container networking enabled.
You have installed
kubectl
locally.
Procedure
Open a browser and navigate to “Install the CLI” on the Velero website.
Follow the appropriate procedure for macOS, GitHub, or Windows.
Download the Velero version appropriate for your version of OADP and OKD.
OADP-Velero-OKD version relationship
OADP version | Velero version | OKD version |
---|---|---|
1.1.0 | 4.9 and later | |
1.1.1 | 4.9 and later | |
1.1.2 | 4.9 and later | |
1.1.3 | 4.9 and later | |
1.1.4 | 4.9 and later | |
1.1.5 | 4.9 and later | |
1.1.6 | 4.11 and later | |
1.2.0 | 4.11 and later | |
1.2.1 | 4.11 and later | |
1.2.2 | 4.11 and later |
Accessing the Velero binary in the Velero deployment in the cluster
You can use a shell command to access the Velero binary in the Velero deployment in the cluster.
Prerequisites
- Your
DataProtectionApplication
custom resource has a status ofReconcile complete
.
Procedure
Enter the following command to set the needed alias:
$ alias velero='oc -n openshift-adp exec deployment/velero -c velero -it -- ./velero'
Debugging Velero resources with the OpenShift CLI tool
You can debug a failed backup or restore by checking Velero custom resources (CRs) and the Velero
pod log with the OpenShift CLI tool.
Velero CRs
Use the oc describe
command to retrieve a summary of warnings and errors associated with a Backup
or Restore
CR:
$ oc describe <velero_cr> <cr_name>
Velero pod logs
Use the oc logs
command to retrieve the Velero
pod logs:
$ oc logs pod/<velero>
Velero pod debug logs
You can specify the Velero log level in the DataProtectionApplication
resource as shown in the following example.
This option is available starting from OADP 1.0.3. |
apiVersion: oadp.openshift.io/v1alpha1
kind: DataProtectionApplication
metadata:
name: velero-sample
spec:
configuration:
velero:
logLevel: warning
The following logLevel
values are available:
trace
debug
info
warning
error
fatal
panic
It is recommended to use debug
for most logs.
Debugging Velero resources with the Velero CLI tool
You can debug Backup
and Restore
custom resources (CRs) and retrieve logs with the Velero CLI tool.
The Velero CLI tool provides more detailed information than the OpenShift CLI tool.
Syntax
Use the oc exec
command to run a Velero CLI command:
$ oc -n openshift-adp exec deployment/velero -c velero -- ./velero \
<backup_restore_cr> <command> <cr_name>
Example
$ oc -n openshift-adp exec deployment/velero -c velero -- ./velero \
backup describe 0e44ae00-5dc3-11eb-9ca8-df7e5254778b-2d8ql
Help option
Use the velero --help
option to list all Velero CLI commands:
$ oc -n openshift-adp exec deployment/velero -c velero -- ./velero \
--help
Describe command
Use the velero describe
command to retrieve a summary of warnings and errors associated with a Backup
or Restore
CR:
$ oc -n openshift-adp exec deployment/velero -c velero -- ./velero \
<backup_restore_cr> describe <cr_name>
Example
$ oc -n openshift-adp exec deployment/velero -c velero -- ./velero \
backup describe 0e44ae00-5dc3-11eb-9ca8-df7e5254778b-2d8ql
Logs command
Use the velero logs
command to retrieve the logs of a Backup
or Restore
CR:
$ oc -n openshift-adp exec deployment/velero -c velero -- ./velero \
<backup_restore_cr> logs <cr_name>
Example
$ oc -n openshift-adp exec deployment/velero -c velero -- ./velero \
restore logs ccc7c2d0-6017-11eb-afab-85d0007f5a19-x4lbf
Pods crash or restart due to lack of memory or CPU
If a Velero or Restic pod crashes due to a lack of memory or CPU, you can set specific resource requests for either of those resources.
Additional resources
Setting resource requests for a Velero pod
You can use the configuration.velero.podConfig.resourceAllocations
specification field in the oadp_v1alpha1_dpa.yaml
file to set specific resource requests for a Velero
pod.
Procedure
Set the
cpu
andmemory
resource requests in the YAML file:Example Velero file
apiVersion: oadp.openshift.io/v1alpha1
kind: DataProtectionApplication
...
configuration:
velero:
podConfig:
resourceAllocations: (1)
requests:
cpu: 200m
memory: 256Mi
1 The resourceAllocations
listed are for average usage.
Setting resource requests for a Restic pod
You can use the configuration.restic.podConfig.resourceAllocations
specification field to set specific resource requests for a Restic
pod.
Procedure
Set the
cpu
andmemory
resource requests in the YAML file:Example Restic file
apiVersion: oadp.openshift.io/v1alpha1
kind: DataProtectionApplication
...
configuration:
restic:
podConfig:
resourceAllocations: (1)
requests:
cpu: 1000m
memory: 16Gi
1 The resourceAllocations
listed are for average usage.
The values for the resource request fields must follow the same format as Kubernetes resource requirements. Also, if you do not specify
|
Issues with Velero and admission webhooks
Velero has limited abilities to resolve admission webhook issues during a restore. If you have workloads with admission webhooks, you might need to use an additional Velero plugin or make changes to how you restore the workload.
Typically, workloads with admission webhooks require you to create a resource of a specific kind first. This is especially true if your workload has child resources because admission webhooks typically block child resources.
For example, creating or restoring a top-level object such as service.serving.knative.dev
typically creates child resources automatically. If you do this first, you will not need to use Velero to create and restore these resources. This avoids the problem of child resources being blocked by an admission webhook that Velero might use.
Restoring workarounds for Velero backups that use admission webhooks
This section describes the additional steps required to restore resources for several types of Velero backups that use admission webhooks.
Restoring Knative resources
You might encounter problems using Velero to back up Knative resources that use admission webhooks.
You can avoid such problems by restoring the top level Service
resource first whenever you back up and restore Knative resources that use admission webhooks.
Procedure
Restore the top level
service.serving.knavtive.dev Service
resource:$ velero restore <restore_name> \
--from-backup=<backup_name> --include-resources \
service.serving.knavtive.dev
Restoring IBM AppConnect resources
If you experience issues when you use Velero to a restore an IBM AppConnect resource that has an admission webhook, you can run the checks in this procedure.
Procedure
Check if you have any mutating admission plugins of
kind: MutatingWebhookConfiguration
in the cluster:$ oc get mutatingwebhookconfigurations
Examine the YAML file of each
kind: MutatingWebhookConfiguration
to ensure that none of its rules block creation of the objects that are experiencing issues. For more information, see the official Kubernetes documentation.Check that any
spec.version
intype: Configuration.appconnect.ibm.com/v1beta1
used at backup time is supported by the installed Operator.
Additional resources
Installation issues
You might encounter issues caused by using invalid directories or incorrect credentials when you install the Data Protection Application.
Backup storage contains invalid directories
The Velero
pod log displays the error message, Backup storage contains invalid top-level directories
.
Cause
The object storage contains top-level directories that are not Velero directories.
Solution
If the object storage is not dedicated to Velero, you must specify a prefix for the bucket by setting the spec.backupLocations.velero.objectStorage.prefix
parameter in the DataProtectionApplication
manifest.
Incorrect AWS credentials
The oadp-aws-registry
pod log displays the error message, InvalidAccessKeyId: The AWS Access Key Id you provided does not exist in our records.
The Velero
pod log displays the error message, NoCredentialProviders: no valid providers in chain
.
Cause
The credentials-velero
file used to create the Secret
object is incorrectly formatted.
Solution
Ensure that the credentials-velero
file is correctly formatted, as in the following example:
Example credentials-velero
file
[default] (1)
aws_access_key_id=AKIAIOSFODNN7EXAMPLE (2)
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
1 | AWS default profile. |
2 | Do not enclose the values with quotation marks (“ , ‘ ). |
OADP timeouts
Extending a timeout allows complex or resource-intensive processes to complete successfully without premature termination. This configuration can reduce the likelihood of errors, retries, or failures.
Ensure that you balance timeout extensions in a logical manner so that you do not configure excessively long timeouts that might hide underlying issues in the process. Carefully consider and monitor an appropriate timeout value that meets the needs of the process and the overall system performance.
The following are various OADP timeouts, with instructions of how and when to implement these parameters:
Restic timeout
timeout
defines the Restic timeout. The default value is 1h
.
Use the Restic timeout
for the following scenarios:
For Restic backups with total PV data usage that is greater than 500GB.
If backups are timing out with the following error:
level=error msg="Error backing up item" backup=velero/monitoring error="timed out waiting for all PodVolumeBackups to complete"
Procedure
Edit the values in the
spec.configuration.restic.timeout
block of theDataProtectionApplication
CR manifest, as in the following example:apiVersion: oadp.openshift.io/v1alpha1
kind: DataProtectionApplication
metadata:
name: <dpa_name>
spec:
configuration:
restic:
timeout: 1h
# ...
Velereo resource timeout
resourceTimeout
defines how long to wait for several Velero resources before timeout occurs, such as Velero custom resource definition (CRD) availability, volumeSnapshot
deletion, and repository availability. The default is 10m
.
Use the resourceTimeout
for the following scenarios:
For backups with total PV data usage that is greater than 1TB. This parameter is used as a timeout value when Velero tries to clean up or delete the Container Storage Interface (CSI) snapshots, before marking the backup as complete.
- A sub-task of this cleanup tries to patch VSC and this timeout can be used for that task.
To create or ensure a backup repository is ready for filesystem based backups for Restic or Kopia.
To check if the Velero CRD is available in the cluster before restoring the custom resource (CR) or resource from the backup.
Procedure
Edit the values in the
spec.configuration.velero.resourceTimeout
block of theDataProtectionApplication
CR manifest, as in the following example:apiVersion: oadp.openshift.io/v1alpha1
kind: DataProtectionApplication
metadata:
name: <dpa_name>
spec:
configuration:
velero:
resourceTimeout: 10m
# ...
Data Mover timeout
timeout
is a user-supplied timeout to complete VolumeSnapshotBackup
and VolumeSnapshotRestore
. The default value is 10m
.
Use the Data Mover timeout
for the following scenarios:
If creation of
VolumeSnapshotBackups
(VSBs) andVolumeSnapshotRestores
(VSRs), times out after 10 minutes.For large scale environments with total PV data usage that is greater than 500GB. Set the timeout for
1h
.With the
VolumeSnapshotMover
(VSM) plugin.Only with OADP 1.1.x.
Procedure
Edit the values in the
spec.features.dataMover.timeout
block of theDataProtectionApplication
CR manifest, as in the following example:apiVersion: oadp.openshift.io/v1alpha1
kind: DataProtectionApplication
metadata:
name: <dpa_name>
spec:
features:
dataMover:
timeout: 10m
# ...
CSI snapshot timeout
CSISnapshotTimeout
specifies the time during creation to wait until the CSI VolumeSnapshot
status becomes ReadyToUse
, before returning error as timeout. The default value is 10m
.
Use the CSISnapshotTimeout
for the following scenarios:
With the CSI plugin.
For very large storage volumes that may take longer than 10 minutes to snapshot. Adjust this timeout if timeouts are found in the logs.
Typically, the default value for |
Procedure
Edit the values in the
spec.csiSnapshotTimeout
block of theBackup
CR manifest, as in the following example:apiVersion: velero.io/v1
kind: Backup
metadata:
name: <backup_name>
spec:
csiSnapshotTimeout: 10m
# ...
Velereo default item operation timeout
defaultItemOperationTimeout
defines how long to wait on asynchronous BackupItemActions
and RestoreItemActions
to complete before timing out. The default value is 1h
.
Use the defaultItemOperationTimeout
for the following scenarios:
Only with Data Mover 1.2.x.
To specify the amount of time a particular backup or restore should wait for the Asynchronous actions to complete. In the context of OADP features, this value is used for the Asynchronous actions involved in the Container Storage Interface (CSI) Data Mover feature.
When
defaultItemOperationTimeout
is defined in the Data Protection Application (DPA) using thedefaultItemOperationTimeout
, it applies to both backup and restore operations. You can useitemOperationTimeout
to define only the backup or only the restore of those CRs, as described in the following “Item operation timeout - restore”, and “Item operation timeout - backup” sections.
Procedure
Edit the values in the
spec.configuration.velero.defaultItemOperationTimeout
block of theDataProtectionApplication
CR manifest, as in the following example:apiVersion: oadp.openshift.io/v1alpha1
kind: DataProtectionApplication
metadata:
name: <dpa_name>
spec:
configuration:
velero:
defaultItemOperationTimeout: 1h
# ...
Item operation timeout - restore
ItemOperationTimeout
specifies the time that is used to wait for RestoreItemAction
operations. The default value is 1h
.
Use the restore ItemOperationTimeout
for the following scenarios:
Only with Data Mover 1.2.x.
For Data Mover uploads and downloads to or from the
BackupStorageLocation
. If the restore action is not completed when the timeout is reached, it will be marked as failed. If Data Mover operations are failing due to timeout issues, because of large storage volume sizes, then this timeout setting may need to be increased.
Procedure
Edit the values in the
Restore.spec.itemOperationTimeout
block of theRestore
CR manifest, as in the following example:apiVersion: velero.io/v1
kind: Restore
metadata:
name: <restore_name>
spec:
itemOperationTimeout: 1h
# ...
Item operation timeout - backup
ItemOperationTimeout
specifies the time used to wait for asynchronous BackupItemAction
operations. The default value is 1h
.
Use the backup ItemOperationTimeout
for the following scenarios:
Only with Data Mover 1.2.x.
For Data Mover uploads and downloads to or from the
BackupStorageLocation
. If the backup action is not completed when the timeout is reached, it will be marked as failed. If Data Mover operations are failing due to timeout issues, because of large storage volume sizes, then this timeout setting may need to be increased.
Procedure
Edit the values in the
Backup.spec.itemOperationTimeout
block of theBackup
CR manifest, as in the following example:apiVersion: velero.io/v1
kind: Backup
metadata:
name: <backup_name>
spec:
itemOperationTimeout: 1h
# ...
Backup and Restore CR issues
You might encounter these common issues with Backup
and Restore
custom resources (CRs).
Backup CR cannot retrieve volume
The Backup
CR displays the error message, InvalidVolume.NotFound: The volume ‘vol-xxxx’ does not exist
.
Cause
The persistent volume (PV) and the snapshot locations are in different regions.
Solution
Edit the value of the
spec.snapshotLocations.velero.config.region
key in theDataProtectionApplication
manifest so that the snapshot location is in the same region as the PV.Create a new
Backup
CR.
Backup CR status remains in progress
The status of a Backup
CR remains in the InProgress
phase and does not complete.
Cause
If a backup is interrupted, it cannot be resumed.
Solution
Retrieve the details of the
Backup
CR:$ oc -n {namespace} exec deployment/velero -c velero -- ./velero \
backup describe <backup>
Delete the
Backup
CR:$ oc delete backup <backup> -n openshift-adp
You do not need to clean up the backup location because a
Backup
CR in progress has not uploaded files to object storage.Create a new
Backup
CR.
Backup CR status remains in PartiallyFailed
The status of a Backup
CR without Restic in use remains in the PartiallyFailed
phase and does not complete. A snapshot of the affiliated PVC is not created.
Cause
If the backup is created based on the CSI snapshot class, but the label is missing, CSI snapshot plugin fails to create a snapshot. As a result, the Velero
pod logs an error similar to the following:
time="2023-02-17T16:33:13Z" level=error msg="Error backing up item" backup=openshift-adp/user1-backup-check5 error="error executing custom action (groupResource=persistentvolumeclaims, namespace=busy1, name=pvc1-user1): rpc error: code = Unknown desc = failed to get volumesnapshotclass for storageclass ocs-storagecluster-ceph-rbd: failed to get volumesnapshotclass for provisioner openshift-storage.rbd.csi.ceph.com, ensure that the desired volumesnapshot class has the velero.io/csi-volumesnapshot-class label" logSource="/remote-source/velero/app/pkg/backup/backup.go:417" name=busybox-79799557b5-vprq
Solution
Delete the
Backup
CR:$ oc delete backup <backup> -n openshift-adp
If required, clean up the stored data on the
BackupStorageLocation
to free up space.Apply label
velero.io/csi-volumesnapshot-class=true
to theVolumeSnapshotClass
object:$ oc label volumesnapshotclass/<snapclass_name> velero.io/csi-volumesnapshot-class=true
Create a new
Backup
CR.
Restic issues
You might encounter these issues when you back up applications with Restic.
Restic permission error for NFS data volumes with root_squash enabled
The Restic
pod log displays the error message: controller=pod-volume-backup error="fork/exec/usr/bin/restic: permission denied"
.
Cause
If your NFS data volumes have root_squash
enabled, Restic
maps to nfsnobody
and does not have permission to create backups.
Solution
You can resolve this issue by creating a supplemental group for Restic
and adding the group ID to the DataProtectionApplication
manifest:
Create a supplemental group for
Restic
on the NFS data volume.Set the
setgid
bit on the NFS directories so that group ownership is inherited.Add the
spec.configuration.restic.supplementalGroups
parameter and the group ID to theDataProtectionApplication
manifest, as in the following example:spec:
configuration:
restic:
enable: true
supplementalGroups:
- <group_id> (1)
1 Specify the supplemental group ID. Wait for the
Restic
pods to restart so that the changes are applied.
Restic Backup CR cannot be recreated after bucket is emptied
If you create a Restic Backup
CR for a namespace, empty the object storage bucket, and then recreate the Backup
CR for the same namespace, the recreated Backup
CR fails.
The velero
pod log displays the following error message: stderr=Fatal: unable to open config file: Stat: The specified key does not exist.\nIs there a repository at the following location?
.
Cause
Velero does not recreate or update the Restic repository from the ResticRepository
manifest if the Restic directories are deleted from object storage. See Velero issue 4421 for more information.
Solution
Remove the related Restic repository from the namespace by running the following command:
$ oc delete resticrepository openshift-adp <name_of_the_restic_repository>
In the following error log,
mysql-persistent
is the problematic Restic repository. The name of the repository appears in italics for clarity.time="2021-12-29T18:29:14Z" level=info msg="1 errors
encountered backup up item" backup=velero/backup65
logSource="pkg/backup/backup.go:431" name=mysql-7d99fc949-qbkds
time="2021-12-29T18:29:14Z" level=error msg="Error backing up item"
backup=velero/backup65 error="pod volume backup failed: error running
restic backup, stderr=Fatal: unable to open config file: Stat: The
specified key does not exist.\nIs there a repository at the following
location?\ns3:http://minio-minio.apps.mayap-oadp-
veleo-1234.qe.devcluster.openshift.com/mayapvelerooadp2/velero1/
restic/mysql-persistent\n: exit status 1" error.file="/remote-source/
src/github.com/vmware-tanzu/velero/pkg/restic/backupper.go:184"
error.function="github.com/vmware-tanzu/velero/
pkg/restic.(*backupper).BackupPodVolumes"
logSource="pkg/backup/backup.go:435" name=mysql-7d99fc949-qbkds
Restic restore partially failing on OCP 4.14 due to changed PSA policy
OpenShift Container Platform 4.14 enforces a Pod Security Admission (PSA) policy that can hinder the readiness of pods during a Restic restore process.
If a SecurityContextConstraints
(SCC) resource is not found when a pod is created, and the PSA policy on the pod is not set up to meet the required standards, pod admission is denied.
This issue arises due to the resource restore order of Velero.
Sample error
\"level=error\" in line#2273: time=\"2023-06-12T06:50:04Z\"
level=error msg=\"error restoring mysql-869f9f44f6-tp5lv: pods\\\
"mysql-869f9f44f6-tp5lv\\\" is forbidden: violates PodSecurity\\\
"restricted:v1.24\\\": privil eged (container \\\"mysql\\\
" must not set securityContext.privileged=true),
allowPrivilegeEscalation != false (containers \\\
"restic-wait\\\", \\\"mysql\\\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers \\\
"restic-wait\\\", \\\"mysql\\\" must set securityContext.capabilities.drop=[\\\"ALL\\\"]), seccompProfile (pod or containers \\\
"restic-wait\\\", \\\"mysql\\\" must set securityContext.seccompProfile.type to \\\
"RuntimeDefault\\\" or \\\"Localhost\\\")\" logSource=\"/remote-source/velero/app/pkg/restore/restore.go:1388\" restore=openshift-adp/todolist-backup-0780518c-08ed-11ee-805c-0a580a80e92c\n
velero container contains \"level=error\" in line#2447: time=\"2023-06-12T06:50:05Z\"
level=error msg=\"Namespace todolist-mariadb,
resource restore error: error restoring pods/todolist-mariadb/mysql-869f9f44f6-tp5lv: pods \\\
"mysql-869f9f44f6-tp5lv\\\" is forbidden: violates PodSecurity \\\"restricted:v1.24\\\": privileged (container \\\
"mysql\\\" must not set securityContext.privileged=true),
allowPrivilegeEscalation != false (containers \\\
"restic-wait\\\",\\\"mysql\\\" must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (containers \\\
"restic-wait\\\", \\\"mysql\\\" must set securityContext.capabilities.drop=[\\\"ALL\\\"]), seccompProfile (pod or containers \\\
"restic-wait\\\", \\\"mysql\\\" must set securityContext.seccompProfile.type to \\\
"RuntimeDefault\\\" or \\\"Localhost\\\")\"
logSource=\"/remote-source/velero/app/pkg/controller/restore_controller.go:510\"
restore=openshift-adp/todolist-backup-0780518c-08ed-11ee-805c-0a580a80e92c\n]",
Solution
In your DPA custom resource (CR), check or set the
restore-resource-priorities
field on the Velero server to ensure thatsecuritycontextconstraints
is listed in order beforepods
in the list of resources:$ oc get dpa -o yaml
Example DPA CR
# ...
configuration:
restic:
enable: true
velero:
args:
restore-resource-priorities: 'securitycontextconstraints,customresourcedefinitions,namespaces,storageclasses,volumesnapshotclass.snapshot.storage.k8s.io,volumesnapshotcontents.snapshot.storage.k8s.io,volumesnapshots.snapshot.storage.k8s.io,datauploads.velero.io,persistentvolumes,persistentvolumeclaims,serviceaccounts,secrets,configmaps,limitranges,pods,replicasets.apps,clusterclasses.cluster.x-k8s.io,endpoints,services,-,clusterbootstraps.run.tanzu.vmware.com,clusters.cluster.x-k8s.io,clusterresourcesets.addons.cluster.x-k8s.io' (1)
defaultPlugins:
- gcp
- openshift
1 If you have an existing restore resource priority list, ensure you combine that existing list with the complete list. Ensure that the security standards for the application pods are aligned, as provided in Fixing PodSecurity Admission warnings for deployments, to prevent deployment warnings. If the application is not aligned with security standards, an error can occur regardless of the SCC.
This solution is temporary, and ongoing discussions are in progress to address it. |
Additional resources
Using the must-gather tool
You can collect logs, metrics, and information about OADP custom resources by using the must-gather
tool.
The must-gather
data must be attached to all customer cases.
Prerequisites
You must be logged in to the OKD cluster as a user with the
cluster-admin
role.You must have the OpenShift CLI (
oc
) installed.
Procedure
Navigate to the directory where you want to store the
must-gather
data.Run the
oc adm must-gather
command for one of the following data collection options:$ oc adm must-gather --image=registry.redhat.io/oadp/oadp-mustgather-rhel8:v1.1
The data is saved as
must-gather/must-gather.tar.gz
. You can upload this file to a support case on the Red Hat Customer Portal.$ oc adm must-gather --image=registry.redhat.io/oadp/oadp-mustgather-rhel8:v1.1 \
-- /usr/bin/gather_metrics_dump
This operation can take a long time. The data is saved as
must-gather/metrics/prom_data.tar.gz
.
OADP Monitoring
The OKD provides a monitoring stack that allows users and administrators to effectively monitor and manage their clusters, as well as monitor and analyze the workload performance of user applications and services running on the clusters, including receiving alerts if an event occurs.
Additional resources
OADP monitoring setup
The OADP Operator leverages an OpenShift User Workload Monitoring provided by the OpenShift Monitoring Stack for retrieving metrics from the Velero service endpoint. The monitoring stack allows creating user-defined Alerting Rules or querying metrics by using the OpenShift Metrics query front end.
With enabled User Workload Monitoring, it is possible to configure and use any Prometheus-compatible third-party UI, such as Grafana, to visualize Velero metrics.
Monitoring metrics requires enabling monitoring for the user-defined projects and creating a ServiceMonitor
resource to scrape those metrics from the already enabled OADP service endpoint that resides in the openshift-adp
namespace.
Prerequisites
You have access to an OKD cluster using an account with
cluster-admin
permissions.You have created a cluster monitoring config map.
Procedure
Edit the
cluster-monitoring-config
ConfigMap
object in theopenshift-monitoring
namespace:$ oc edit configmap cluster-monitoring-config -n openshift-monitoring
Add or enable the
enableUserWorkload
option in thedata
section’sconfig.yaml
field:apiVersion: v1
data:
config.yaml: |
enableUserWorkload: true (1)
kind: ConfigMap
metadata:
# ...
1 Add this option or set to true
Wait a short period of time to verify the User Workload Monitoring Setup by checking if the following components are up and running in the
openshift-user-workload-monitoring
namespace:$ oc get pods -n openshift-user-workload-monitoring
Example output
NAME READY STATUS RESTARTS AGE
prometheus-operator-6844b4b99c-b57j9 2/2 Running 0 43s
prometheus-user-workload-0 5/5 Running 0 32s
prometheus-user-workload-1 5/5 Running 0 32s
thanos-ruler-user-workload-0 3/3 Running 0 32s
thanos-ruler-user-workload-1 3/3 Running 0 32s
Verify the existence of the
user-workload-monitoring-config
ConfigMap in theopenshift-user-workload-monitoring
. If it exists, skip the remaining steps in this procedure.$ oc get configmap user-workload-monitoring-config -n openshift-user-workload-monitoring
Example output
Error from server (NotFound): configmaps "user-workload-monitoring-config" not found
Create a
user-workload-monitoring-config
ConfigMap
object for the User Workload Monitoring, and save it under the2_configure_user_workload_monitoring.yaml
file name:Example output
apiVersion: v1
kind: ConfigMap
metadata:
name: user-workload-monitoring-config
namespace: openshift-user-workload-monitoring
data:
config.yaml: |
Apply the
2_configure_user_workload_monitoring.yaml
file:$ oc apply -f 2_configure_user_workload_monitoring.yaml
configmap/user-workload-monitoring-config created
Creating OADP service monitor
OADP provides an openshift-adp-velero-metrics-svc
service which is created when the DPA is configured. The service monitor used by the user workload monitoring must point to the defined service.
Get details about the service by running the following commands:
Procedure
Ensure the
openshift-adp-velero-metrics-svc
service exists. It should containapp.kubernetes.io/name=velero
label, which will be used as selector for theServiceMonitor
object.$ oc get svc -n openshift-adp -l app.kubernetes.io/name=velero
Example output
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
openshift-adp-velero-metrics-svc ClusterIP 172.30.38.244 <none> 8085/TCP 1h
Create a
ServiceMonitor
YAML file that matches the existing service label, and save the file as3_create_oadp_service_monitor.yaml
. The service monitor is created in theopenshift-adp
namespace where theopenshift-adp-velero-metrics-svc
service resides.Example
ServiceMonitor
objectapiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
labels:
app: oadp-service-monitor
name: oadp-service-monitor
namespace: openshift-adp
spec:
endpoints:
- interval: 30s
path: /metrics
targetPort: 8085
scheme: http
selector:
matchLabels:
app.kubernetes.io/name: "velero"
Apply the
3_create_oadp_service_monitor.yaml
file:$ oc apply -f 3_create_oadp_service_monitor.yaml
Example output
servicemonitor.monitoring.coreos.com/oadp-service-monitor created
Verification
Confirm that the new service monitor is in an Up state by using the Administrator perspective of the OKD web console:
Navigate to the Observe → Targets page.
Ensure the Filter is unselected or that the User source is selected and type
openshift-adp
in theText
search field.Verify that the status for the Status for the service monitor is Up.
Figure 1. OADP metrics targets
Creating an alerting rule
The OKD monitoring stack allows to receive Alerts configured using Alerting Rules. To create an Alerting rule for the OADP project, use one of the Metrics which are scraped with the user workload monitoring.
Procedure
Create a
PrometheusRule
YAML file with the sampleOADPBackupFailing
alert and save it as4_create_oadp_alert_rule.yaml
.Sample
OADPBackupFailing
alertapiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: sample-oadp-alert
namespace: openshift-adp
spec:
groups:
- name: sample-oadp-backup-alert
rules:
- alert: OADPBackupFailing
annotations:
description: 'OADP had {{$value | humanize}} backup failures over the last 2 hours.'
summary: OADP has issues creating backups
expr: |
increase(velero_backup_failure_total{job="openshift-adp-velero-metrics-svc"}[2h]) > 0
for: 5m
labels:
severity: warning
In this sample, the Alert displays under the following conditions:
There is an increase of new failing backups during the 2 last hours that is greater than 0 and the state persists for at least 5 minutes.
If the time of the first increase is less than 5 minutes, the Alert will be in a
Pending
state, after which it will turn into aFiring
state.
Apply the
4_create_oadp_alert_rule.yaml
file, which creates thePrometheusRule
object in theopenshift-adp
namespace:$ oc apply -f 4_create_oadp_alert_rule.yaml
Example output
prometheusrule.monitoring.coreos.com/sample-oadp-alert created
Verification
After the Alert is triggered, you can view it in the following ways:
In the Developer perspective, select the Observe menu.
In the Administrator perspective under the Observe → Alerting menu, select User in the Filter box. Otherwise, by default only the Platform Alerts are displayed.
Figure 2. OADP backup failing alert
Additional resources
List of available metrics
These are the list of metrics provided by the OADP together with their Types.
Metric name | Description | Type |
---|---|---|
| Number of bytes retrieved from the cache | Counter |
| Number of times content was retrieved from the cache | Counter |
| Number of times malformed content was read from the cache | Counter |
| Number of times content was not found in the cache and fetched | Counter |
| Number of bytes retrieved from the underlying storage | Counter |
| Number of times content could not be found in the underlying storage | Counter |
| Number of times content could not be saved in the cache | Counter |
| Number of bytes retrieved using | Counter |
| Number of times | Counter |
| Number of times | Counter |
| Number of times | Counter |
| Number of bytes passed to | Counter |
| Number of times | Counter |
| Total number of attempted backups | Counter |
| Total number of attempted backup deletions | Counter |
| Total number of failed backup deletions | Counter |
| Total number of successful backup deletions | Counter |
| Time taken to complete backup, in seconds | Histogram |
| Total number of failed backups | Counter |
| Total number of errors encountered during backup | Gauge |
| Total number of items backed up | Gauge |
| Last status of the backup. A value of 1 is success, 0. | Gauge |
| Last time a backup ran successfully, Unix timestamp in seconds | Gauge |
| Total number of partially failed backups | Counter |
| Total number of successful backups | Counter |
| Size, in bytes, of a backup | Gauge |
| Current number of existent backups | Gauge |
| Total number of validation failed backups | Counter |
| Total number of warned backups | Counter |
| Total number of CSI attempted volume snapshots | Counter |
| Total number of CSI failed volume snapshots | Counter |
| Total number of CSI successful volume snapshots | Counter |
| Total number of attempted restores | Counter |
| Total number of failed restores | Counter |
| Total number of partially failed restores | Counter |
| Total number of successful restores | Counter |
| Current number of existent restores | Gauge |
| Total number of failed restores failing validations | Counter |
| Total number of attempted volume snapshots | Counter |
| Total number of failed volume snapshots | Counter |
| Total number of successful volume snapshots | Counter |
Viewing metrics using the Observe UI
You can view metrics in the OKD web console from the Administrator or Developer perspective, which must have access to the openshift-adp
project.
Procedure
Navigate to the Observe → Metrics page:
If you are using the Developer perspective, follow these steps:
Select Custom query, or click on the Show PromQL link.
Type the query and click Enter.
If you are using the Administrator perspective, type the expression in the text field and select Run Queries.
Figure 3. OADP metrics query