Prometheus Cluster Monitoring

Overview

OKD ships with a pre-configured and self-updating monitoring stack that is based on the Prometheus open source project and its wider eco-system. It provides monitoring of cluster components and ships with a set of alerts to immediately notify the cluster administrator about any occurring problems and a set of Grafana dashboards.

monitoring diagram

Highlighted in the diagram above, at the heart of the monitoring stack sits the OKD Cluster Monitoring Operator (CMO), which watches over the deployed monitoring components and resources, and ensures that they are always up to date.

The Prometheus Operator (PO) creates, configures, and manages Prometheus and Alertmanager instances. It also automatically generates monitoring target configurations based on familiar Kubernetes label queries.

In addition to Prometheus and Alertmanager, OKD Monitoring also includes node-exporter and kube-state-metrics. Node-exporter is an agent deployed on every node to collect metrics about it. The kube-state-metrics exporter agent converts Kubernetes objects to metrics consumable by Prometheus.

The targets monitored as part of the cluster monitoring are:

  • Prometheus itself

  • Prometheus-Operator

  • cluster-monitoring-operator

  • Alertmanager cluster instances

  • Kubernetes apiserver

  • kubelets (the kubelet embeds cAdvisor for per container metrics)

  • kube-controllers

  • kube-state-metrics

  • node-exporter

  • etcd (if etcd monitoring is enabled)

All these components are automatically updated.

For more information about the OKD Cluster Monitoring Operator, see the Cluster Monitoring Operator GitHub project.

In order to be able to deliver updates with guaranteed compatibility, configurability of the OKD Monitoring stack is limited to the explicitly available options.

Configuring OKD cluster monitoring

The OKD Ansible openshift_cluster_monitoring_operator role configures and deploys the Cluster Monitoring Operator using the variables from the inventory file.

Table 1. Ansible variables
VariableDescription

openshift_cluster_monitoring_operator_install

Deploy the Cluster Monitoring Operator if true. Otherwise, undeploy. This variable is set to true by default.

openshift_cluster_monitoring_operator_prometheus_storage_capacity

The persistent volume claim size for each of the Prometheus instances. This variable applies only if openshift_cluster_monitoring_operator_prometheus_storage_enabled is set to true. Defaults to 50Gi.

openshift_cluster_monitoring_operator_alertmanager_storage_capacity

The persistent volume claim size for each of the Alertmanager instances. This variable applies only if openshift_cluster_monitoring_operator_alertmanager_storage_enabled is set to true. Defaults to 2Gi.

openshift_cluster_monitoring_operator_node_selector

Set to the desired, existing node selector to ensure that pods are placed onto nodes with specific labels. Defaults to node-role.kubernetes.io/infra=true.

openshift_cluster_monitoring_operator_alertmanager_config

Configures Alertmanager.

openshift_cluster_monitoring_operator_prometheus_storage_enabled

Enable persistent storage of Prometheus’ time-series data. This variable is set to false by default.

openshift_cluster_monitoring_operator_alertmanager_storage_enabled

Enable persistent storage of Alertmanager notifications and silences. This variable is set to false by default.

openshift_cluster_monitoring_operator_prometheus_storage_class_name

If you enabled the openshift_cluster_monitoring_operator_prometheus_storage_enabled option, set a specific StorageClass to ensure that pods are configured to use the PVC with that storageclass. Defaults to none, which applies the default storage class name.

openshift_cluster_monitoring_operator_alertmanager_storage_class_name

If you enabled the openshift_cluster_monitoring_operator_alertmanager_storage_enabled option, set a specific StorageClass to ensure that pods are configured to use the PVC with that storageclass. Defaults to none, which applies the default storage class name.

Monitoring prerequisites

The monitoring stack imposes additional resource requirements. See computing resources recommendations for details.

Installing monitoring stack

The Monitoring stack is installed with OKD by default. You can prevent it from being installed. To do that, set this variable to false in the Ansible inventory file:

openshift_cluster_monitoring_operator_install

You can do it by running:

  1. $ ansible-playbook [-i </path/to/inventory>] <OPENSHIFT_ANSIBLE_DIR>/playbooks/openshift-monitoring/config.yml \
  2. -e openshift_cluster_monitoring_operator_install=False

A common path for the Ansible directory is /usr/share/ansible/openshift-ansible/. In this case, the path to the configuration file is /usr/share/ansible/openshift-ansible/playbooks/openshift-monitoring/config.yml.

Persistent storage

Running cluster monitoring with persistent storage means that your metrics are stored to a persistent volume and can survive a pod being restarted or recreated. This is ideal if you require your metrics or alerting data to be guarded from data loss. For production environments, it is highly recommended to configure persistent storage using block storage technology.

Enabling persistent storage

By default, persistent storage is disabled for both Prometheus time-series data and for Alertmanager notifications and silences. You can configure the cluster to persistently store any one of them or both.

  • To enable persistent storage of Prometheus time-series data, set this variable to true in the Ansible inventory file:

    openshift_cluster_monitoring_operator_prometheus_storage_enabled

  • To enable persistent storage of Alertmanager notifications and silences, set this variable to true in the Ansible inventory file:

    openshift_cluster_monitoring_operator_alertmanager_storage_enabled

Determining how much storage is necessary

How much storage you need depends on the number of pods. It is administrator’s responsibility to dedicate sufficient storage to ensure that the disk does not become full. For information on system requirements for persistent storage, see Capacity Planning for Cluster Monitoring Operator.

Setting persistent storage size

To specify the size of the persistent volume claim for Prometheus and Alertmanager, change these Ansible variables:

  • openshift_cluster_monitoring_operator_prometheus_storage_capacity (default: 50Gi)

  • openshift_cluster_monitoring_operator_alertmanager_storage_capacity (default: 2Gi)

Each of these variables applies only if its corresponding storage_enabled variable is set to true.

Allocating enough persistent volumes

Unless you use dynamically-provisioned storage, you need to make sure you have a persistent volume (PV) ready to be claimed by the PVC, one PV for each replica. Prometheus has two replicas and Alertmanager has three replicas, which amounts to five PVs.

Enabling dynamically-provisioned storage

Instead of statically-provisioned storage, you can use dynamically-provisioned storage. See Dynamic Volume Provisioning for details.

To enable dynamic storage for Prometheus and Alertmanager, set the following parameters to true in the Ansible inventory file:

  • openshift_cluster_monitoring_operator_prometheus_storage_enabled (Default: false)

  • openshift_cluster_monitoring_operator_alertmanager_storage_enabled (Default: false)

After you enable dynamic storage, you can also set the storageclass for the persistent volume claim for each component in the following parameters in the Ansible inventory file:

  • openshift_cluster_monitoring_operator_prometheus_storage_class_name (default: “”)

  • openshift_cluster_monitoring_operator_alertmanager_storage_class_name (default: “”)

Each of these variables applies only if its corresponding storage_enabled variable is set to true.

Supported configuration

The supported way of configuring OKD Monitoring is by configuring it using the options described in this guide. Beyond those explicit configuration options, it is possible to inject additional configuration into the stack. However this is unsupported, as configuration paradigms might change across Prometheus releases, and such cases can only be handled gracefully if all configuration possibilities are controlled.

Explicitly unsupported cases include:

  • Creating additional ServiceMonitor objects in the openshift-monitoring namespace, thereby extending the targets the cluster monitoring Prometheus instance scrapes. This can cause collisions and load differences that cannot be accounted for, therefore the Prometheus setup can be unstable.

  • Creating additional ConfigMap objects, that cause the cluster monitoring Prometheus instance to include additional alerting and recording rules. Note that this behavior is known to cause a breaking behavior if applied, as Prometheus 2.0 will ship with a new rule file syntax.

Configuring Alertmanager

The Alertmanager manages incoming alerts; this includes silencing, inhibition, aggregation, and sending out notifications through methods such as email, PagerDuty, and HipChat.

The default configuration of the OKD Monitoring Alertmanager cluster is:

  1. global:
  2. resolve_timeout: 5m
  3. route:
  4. group_wait: 30s
  5. group_interval: 5m
  6. repeat_interval: 12h
  7. receiver: default
  8. routes:
  9. - match:
  10. alertname: DeadMansSwitch
  11. repeat_interval: 5m
  12. receiver: deadmansswitch
  13. receivers:
  14. - name: default
  15. - name: deadmansswitch

This configuration can be overwritten using the Ansible variable openshift_cluster_monitoring_operator_alertmanager_config from the openshift_cluster_monitoring_operator role.

The following example configures PagerDuty for notifications. See the PagerDuty documentation for Alertmanager to learn how to retrieve the service_key.

  1. openshift_cluster_monitoring_operator_alertmanager_config: |+
  2. global:
  3. resolve_timeout: 5m
  4. route:
  5. group_wait: 30s
  6. group_interval: 5m
  7. repeat_interval: 12h
  8. receiver: default
  9. routes:
  10. - match:
  11. alertname: DeadMansSwitch
  12. repeat_interval: 5m
  13. receiver: deadmansswitch
  14. - match:
  15. service: example-app
  16. routes:
  17. - match:
  18. severity: critical
  19. receiver: team-frontend-page
  20. receivers:
  21. - name: default
  22. - name: deadmansswitch
  23. - name: team-frontend-page
  24. pagerduty_configs:
  25. - service_key: "<key>"

The sub-route matches only on alerts that have a severity of critical and sends them using the receiver called team-frontend-page. As the name indicates, someone should be paged for alerts that are critical. See Alertmanager configuration for configuring alerting through different alert receivers.

Dead man’s switch

OKD Monitoring ships with a dead man’s switch to ensure the availability of the monitoring infrastructure.

The dead man’s switch is a simple Prometheus alerting rule that always triggers. The Alertmanager continuously sends notifications for the dead man’s switch to the notification provider that supports this functionality. This also ensures that communication between the Alertmanager and the notification provider is working.

This mechanism is supported by PagerDuty to issue alerts when the monitoring system itself is down. For more information, see Dead man’s switch PagerDuty below.

Grouping alerts

After alerts are firing against the Alertmanager, it must be configured to know how to logically group them.

For this example, a new route is added to reflect alert routing of the frontend team.

Procedure

  1. Add new routes. Multiple routes may be added beneath the original route, typically to define the receiver for the notification. The following example uses a matcher to ensure that only alerts coming from the service example-app are used:

    1. global:
    2. resolve_timeout: 5m
    3. route:
    4. group_wait: 30s
    5. group_interval: 5m
    6. repeat_interval: 12h
    7. receiver: default
    8. routes:
    9. - match:
    10. alertname: DeadMansSwitch
    11. repeat_interval: 5m
    12. receiver: deadmansswitch
    13. - match:
    14. service: example-app
    15. routes:
    16. - match:
    17. severity: critical
    18. receiver: team-frontend-page
    19. receivers:
    20. - name: default
    21. - name: deadmansswitch

    The sub-route matches only on alerts that have a severity of critical, and sends them using the receiver called team-frontend-page. As the name indicates, someone should be paged for alerts that are critical.

Dead man’s switch PagerDuty

PagerDuty supports this mechanism through an integration called Dead Man’s Snitch. Simply add a PagerDuty configuration to the default deadmansswitch receiver. Use the process described above to add this configuration.

Configure Dead Man’s Snitch to page the operator if the Dead man’s switch alert is silent for 15 minutes. With the default Alertmanager configuration, the Dead man’s switch alert is repeated every five minutes. If Dead Man’s Snitch triggers after 15 minutes, it indicates that the notification has been unsuccessful at least twice.

Learn how to configure Dead Man’s Snitch for PagerDuty.

Alerting rules

OKD Cluster Monitoring ships with the following alerting rules configured by default. Currently you cannot add custom alerting rules.

Some alerting rules have identical names. This is intentional. They are alerting about the same event with different thresholds, with different severity, or both. With the inhibition rules, the lower severity is inhibited when the higher severity is firing.

For more details on the alerting rules, see the configuration file.

AlertSeverityDescription

ClusterMonitoringOperatorErrors

critical

Cluster Monitoring Operator is experiencing X% errors.

AlertmanagerDown

critical

Alertmanager has disappeared from Prometheus target discovery.

ClusterMonitoringOperatorDown

critical

ClusterMonitoringOperator has disappeared from Prometheus target discovery.

KubeAPIDown

critical

KubeAPI has disappeared from Prometheus target discovery.

KubeControllerManagerDown

critical

KubeControllerManager has disappeared from Prometheus target discovery.

KubeSchedulerDown

critical

KubeScheduler has disappeared from Prometheus target discovery.

KubeStateMetricsDown

critical

KubeStateMetrics has disappeared from Prometheus target discovery.

KubeletDown

critical

Kubelet has disappeared from Prometheus target discovery.

NodeExporterDown

critical

NodeExporter has disappeared from Prometheus target discovery.

PrometheusDown

critical

Prometheus has disappeared from Prometheus target discovery.

PrometheusOperatorDown

critical

PrometheusOperator has disappeared from Prometheus target discovery.

KubePodCrashLooping

critical

Namespace/Pod (Container) is restarting times / second

KubePodNotReady

critical

Namespace/Pod is not ready.

KubeDeploymentGenerationMismatch

critical

Deployment Namespace/Deployment generation mismatch

KubeDeploymentReplicasMismatch

critical

Deployment Namespace/Deployment replica mismatch

KubeStatefulSetReplicasMismatch

critical

StatefulSet Namespace/StatefulSet replica mismatch

KubeStatefulSetGenerationMismatch

critical

StatefulSet Namespace/StatefulSet generation mismatch

KubeDaemonSetRolloutStuck

critical

Only X% of desired pods scheduled and ready for daemon set Namespace/DaemonSet

KubeDaemonSetNotScheduled

warning

A number of pods of daemonset Namespace/DaemonSet are not scheduled.

KubeDaemonSetMisScheduled

warning

A number of pods of daemonset Namespace/DaemonSet are running where they are not supposed to run.

KubeCronJobRunning

warning

CronJob Namespace/CronJob is taking more than 1h to complete.

KubeJobCompletion

warning

Job Namespaces/Job is taking more than 1h to complete.

KubeJobFailed

warning

Job Namespaces/Job failed to complete.

KubeCPUOvercommit

warning

Overcommited CPU resource requests on Pods, cannot tolerate node failure.

KubeMemOvercommit

warning

Overcommited Memory resource requests on Pods, cannot tolerate node failure.

KubeCPUOvercommit

warning

Overcommited CPU resource request quota on Namespaces.

KubeMemOvercommit

warning

Overcommited Memory resource request quota on Namespaces.

alerKubeQuotaExceeded

warning

X% usage of Resource in namespace Namespace.

KubePersistentVolumeUsageCritical

critical

The persistent volume claimed by PersistentVolumeClaim in namespace Namespace has X% free.

KubePersistentVolumeFullInFourDays

critical

Based on recent sampling, the persistent volume claimed by PersistentVolumeClaim in namespace Namespace is expected to fill up within four days. Currently X bytes are available.

KubeNodeNotReady

warning

Node has been unready for more than an hour

KubeVersionMismatch

warning

There are X different versions of Kubernetes components running.

KubeClientErrors

warning

Kubernetes API server client ‘Job/Instance‘ is experiencing X% errors.’

KubeClientErrors

warning

Kubernetes API server client ‘Job/Instance‘ is experiencing X errors / sec.’

KubeletTooManyPods

warning

Kubelet Instance is running X pods, close to the limit of 110.

KubeAPILatencyHigh

warning

The API server has a 99th percentile latency of X seconds for Verb Resource.

KubeAPILatencyHigh

critical

The API server has a 99th percentile latency of X seconds for Verb Resource.

KubeAPIErrorsHigh

critical

API server is erroring for X% of requests.

KubeAPIErrorsHigh

warning

API server is erroring for X% of requests.

KubeClientCertificateExpiration

warning

Kubernetes API certificate is expiring in less than 7 days.

KubeClientCertificateExpiration

critical

Kubernetes API certificate is expiring in less than 1 day.

AlertmanagerConfigInconsistent

critical

Summary: Configuration out of sync. Description: The configuration of the instances of the Alertmanager cluster Service are out of sync.

AlertmanagerFailedReload

warning

Summary: Alertmanager’s configuration reload failed. Description: Reloading Alertmanager’s configuration has failed for Namespace/Pod.

TargetDown

warning

Summary: Targets are down. Description: X% of Job targets are down.

DeadMansSwitch

none

Summary: Alerting DeadMansSwitch. Description: This is a DeadMansSwitch meant to ensure that the entire Alerting pipeline is functional.

NodeDiskRunningFull

warning

Device Device of node-exporter Namespace/Pod is running full within the next 24 hours.

NodeDiskRunningFull

critical

Device Device of node-exporter Namespace/Pod is running full within the next 2 hours.

PrometheusConfigReloadFailed

warning

Summary: Reloading Prometheus’ configuration failed. Description: Reloading Prometheus’ configuration has failed for Namespace/Pod

PrometheusNotificationQueueRunningFull

warning

Summary: Prometheus’ alert notification queue is running full. Description: Prometheus’ alert notification queue is running full for Namespace/Pod

PrometheusErrorSendingAlerts

warning

Summary: Errors while sending alert from Prometheus. Description: Errors while sending alerts from Prometheus Namespace/Pod to Alertmanager Alertmanager

PrometheusErrorSendingAlerts

critical

Summary: Errors while sending alerts from Prometheus. Description: Errors while sending alerts from Prometheus Namespace/Pod to Alertmanager Alertmanager

PrometheusNotConnectedToAlertmanagers

warning

Summary: Prometheus is not connected to any Alertmanagers. Description: Prometheus Namespace/Pod is not connected to any Alertmanagers

PrometheusTSDBReloadsFailing

warning

Summary: Prometheus has issues reloading data blocks from disk. Description: Job at Instance had X reload failures over the last four hours.

PrometheusTSDBCompactionsFailing

warning

Summary: Prometheus has issues compacting sample blocks. Description: Job at Instance had X compaction failures over the last four hours.

PrometheusTSDBWALCorruptions

warning

Summary: Prometheus write-ahead log is corrupted. Description: Job at Instance has a corrupted write-ahead log (WAL).

PrometheusNotIngestingSamples

warning

Summary: Prometheus isn’t ingesting samples. Description: Prometheus Namespace/Pod isn’t ingesting samples.

PrometheusTargetScrapesDuplicate

warning

Summary: Prometheus has many samples rejected. Description: Namespace/Pod has many samples rejected due to duplicate timestamps but different values

EtcdInsufficientMembers

critical

Etcd cluster “Job“: insufficient members (X).

EtcdNoLeader

critical

Etcd cluster “Job“: member Instance has no leader.

EtcdHighNumberOfLeaderChanges

warning

Etcd cluster “Job“: instance Instance has seen X leader changes within the last hour.

EtcdHighNumberOfFailedGRPCRequests

warning

Etcd cluster “Job“: X% of requests for GRPC_Method failed on etcd instance Instance.

EtcdHighNumberOfFailedGRPCRequests

critical

Etcd cluster “Job“: X% of requests for GRPC_Method failed on etcd instance Instance.

EtcdGRPCRequestsSlow

critical

Etcd cluster “Job“: gRPC requests to GRPC_Method are taking X_s on etcd instance _Instance.

EtcdMemberCommunicationSlow

warning

Etcd cluster “Job“: member communication with To is taking X_s on etcd instance _Instance.

EtcdHighNumberOfFailedProposals

warning

Etcd cluster “Job“: X proposal failures within the last hour on etcd instance Instance.

EtcdHighFsyncDurations

warning

Etcd cluster “Job“: 99th percentile fync durations are X_s on etcd instance _Instance.

EtcdHighCommitDurations

warning

Etcd cluster “Job“: 99th percentile commit durations X_s on etcd instance _Instance.

FdExhaustionClose

warning

Job instance Instance will exhaust its file descriptors soon

FdExhaustionClose

critical

Job instance Instance will exhaust its file descriptors soon

Configuring etcd monitoring

If the etcd service does not run correctly, successful operation of the whole OKD cluster is in danger. Therefore, it is reasonable to configure monitoring of etcd.

Follow these steps to configure etcd monitoring:

Procedure

  1. Verify that the monitoring stack is running:

    1. $ oc -n openshift-monitoring get pods
    2. NAME READY STATUS RESTARTS AGE
    3. alertmanager-main-0 3/3 Running 0 34m
    4. alertmanager-main-1 3/3 Running 0 33m
    5. alertmanager-main-2 3/3 Running 0 33m
    6. cluster-monitoring-operator-67b8797d79-sphxj 1/1 Running 0 36m
    7. grafana-c66997f-pxrf7 2/2 Running 0 37s
    8. kube-state-metrics-7449d589bc-rt4mq 3/3 Running 0 33m
    9. node-exporter-5tt4f 2/2 Running 0 33m
    10. node-exporter-b2mrp 2/2 Running 0 33m
    11. node-exporter-fd52p 2/2 Running 0 33m
    12. node-exporter-hfqgv 2/2 Running 0 33m
    13. prometheus-k8s-0 4/4 Running 1 35m
    14. prometheus-k8s-1 0/4 ContainerCreating 0 21s
    15. prometheus-operator-6c9fddd47f-9jfgk 1/1 Running 0 36m
  2. Open the configuration file for the cluster monitoring stack:

    1. $ oc -n openshift-monitoring edit configmap cluster-monitoring-config
  3. Under config.yaml: |+, add the etcd section.

    1. If you run etcd in static pods on your master nodes, you can specify the etcd nodes using the selector:

      1. ...
      2. data:
      3. config.yaml: |+
      4. ...
      5. etcd:
      6. targets:
      7. selector:
      8. openshift.io/component: etcd
      9. openshift.io/control-plane: "true"
    2. If you run etcd on separate hosts, you need to specify the nodes using IP addresses:

      1. ...
      2. data:
      3. config.yaml: |+
      4. ...
      5. etcd:
      6. targets:
      7. ips:
      8. - "127.0.0.1"
      9. - "127.0.0.2"
      10. - "127.0.0.3"

      If the IP addresses for etcd nodes change, you must update this list.

  1. Verify that the etcd service monitor is now running:

    1. $ oc -n openshift-monitoring get servicemonitor
    2. NAME AGE
    3. alertmanager 35m
    4. etcd 1m (1)
    5. kube-apiserver 36m
    6. kube-controllers 36m
    7. kube-state-metrics 34m
    8. kubelet 36m
    9. node-exporter 34m
    10. prometheus 36m
    11. prometheus-operator 37m
    1The etcd service monitor.

    It might take up to a minute for the etcd service monitor to start.

  2. Now you can navigate to the web interface to see more information about the status of etcd monitoring.

    1. To get the URL, run:

      1. $ oc -n openshift-monitoring get routes
      2. NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD
      3. ...
      4. prometheus-k8s prometheus-k8s-openshift-monitoring.apps.msvistun.origin-gce.dev.openshift.com prometheus-k8s web reencrypt None
    2. Using https, navigate to the URL listed for prometheus-k8s. Log in.

  1. Ensure the user belongs to the cluster-monitoring-view role. This role provides access to viewing cluster monitoring UIs.

    For example, to add user developer to the cluster-monitoring-view role, run:

    1. $ oc adm policy add-cluster-role-to-user cluster-monitoring-view developer
  2. In the web interface, log in as the user belonging to the cluster-monitoring-view role.

  3. Click Status, then Targets. If you see an etcd entry, etcd is being monitored.

    etcd no certificate

  4. While etcd is now being monitored, Prometheus is not yet able to authenticate against etcd, and so cannot gather metrics.

    To configure Prometheus authentication against etcd:

    1. Copy the /etc/etcd/ca/ca.crt and /etc/etcd/ca/ca.key credentials files from the master node to the local machine:

      1. $ ssh -i gcp-dev/ssh-privatekey cloud-user@35.237.54.213
    2. Create the openssl.cnf file with these contents:

      1. [ req ]
      2. req_extensions = v3_req
      3. distinguished_name = req_distinguished_name
      4. [ req_distinguished_name ]
      5. [ v3_req ]
      6. basicConstraints = CA:FALSE
      7. keyUsage = nonRepudiation, keyEncipherment, digitalSignature
      8. extendedKeyUsage=serverAuth, clientAuth
    3. Generate the etcd.key private key file:

      1. $ openssl genrsa -out etcd.key 2048
    4. Generate the etcd.csr certificate signing request file:

      1. $ openssl req -new -key etcd.key -out etcd.csr -subj "/CN=etcd" -config openssl.cnf
    5. Generate the etcd.crt certificate file:

      1. $ openssl x509 -req -in etcd.csr -CA ca.crt -CAkey ca.key -CAcreateserial -out etcd.crt -days 365 -extensions v3_req -extfile openssl.cnf
    6. Put the credentials into format used by OKD:

      1. $ cat <<-EOF > etcd-cert-secret.yaml
      2. apiVersion: v1
      3. data:
      4. etcd-client-ca.crt: "$(cat ca.crt | base64 --wrap=0)"
      5. etcd-client.crt: "$(cat etcd.crt | base64 --wrap=0)"
      6. etcd-client.key: "$(cat etcd.key | base64 --wrap=0)"
      7. kind: Secret
      8. metadata:
      9. name: kube-etcd-client-certs
      10. namespace: openshift-monitoring
      11. type: Opaque
      12. EOF

      This creates the etcd-cert-secret.yaml file

    7. Apply the credentials file to the cluster:

      1. $ oc apply -f etcd-cert-secret.yaml
  1. Now that you have configured authentication, visit the Targets page of the web interface again. Verify that etcd is now being correctly monitored. It might take several minutes for changes to take effect.

    etcd monitoring working

  2. If you want etcd monitoring to be automatically updated when you update OKD, set this variable in the Ansible inventory file to true:

    1. openshift_cluster_monitoring_operator_etcd_enabled=true

    If you run etcd on separate hosts, specify the nodes by IP addresses using this Ansible variable:

    1. openshift_cluster_monitoring_operator_etcd_hosts=[<address1>, <address2>, ...]

    If the IP addresses of the etcd nodes change, you must update this list.

Accessing Prometheus, Alertmanager, and Grafana

OKD Monitoring ships with a Prometheus instance for cluster monitoring and a central Alertmanager cluster. In addition to Prometheus and Alertmanager, OKD Monitoring also includes a Grafana instance as well as pre-built dashboards for cluster monitoring troubleshooting. The Grafana instance that is provided with the monitoring stack, along with its dashboards, is read-only.

To get the addresses for accessing Prometheus, Alertmanager, and Grafana web UIs:

Procedure

  1. Run the following command:

    1. $ oc -n openshift-monitoring get routes
    2. NAME HOST/PORT
    3. alertmanager-main alertmanager-main-openshift-monitoring.apps._url_.openshift.com
    4. grafana grafana-openshift-monitoring.apps._url_.openshift.com
    5. prometheus-k8s prometheus-k8s-openshift-monitoring.apps._url_.openshift.com

    Make sure to prepend https:// to these addresses. You cannot access web UIs using unencrypted connections.

  2. Authentication is performed against the OKD identity and uses the same credentials or means of authentication as is used elsewhere in OKD. You must use a role that has read access to all namespaces, such as the cluster-monitoring-view cluster role.