Setting up Prometheus and Grafana to monitor Longhorn

Overview

Longhorn natively exposes metrics in Prometheus text format on a REST endpoint http://LONGHORN_MANAGER_IP:PORT/metrics. See Longhorn’s metrics for the descriptions of all available metrics. You can use any collecting tools such as Prometheus, Graphite, Telegraf to scrape these metrics then visualize the collected data by tools such as Grafana.

This document presents an example setup to monitor Longhorn. The monitoring system uses Prometheus for collecting data and alerting, Grafana for visualizing/dashboarding the collected data. From a high-level overview, the monitoring system contains:

  • Prometheus server which scrapes and stores time series data from Longhorn metrics endpoints. The Prometheus is also responsible for generating alerts base on configured rules and collected data. Prometheus servers then send alerts to an Alertmanager.
  • AlertManager then manages those alerts, including silencing, inhibition, aggregation, and sending out notifications via methods such as email, on-call notification systems, and chat platforms.
  • Grafana which queries Prometheus server for data and draws a dashboard for visualization.

The below picture describes the detailed architecture of the monitoring system.

images

There are 2 unmentioned components in the above picture:

  • Longhorn Backend service is a service pointing to the set of Longhorn manager pods. Longhorn’s metrics are exposed in Longhorn manager pods at the endpoint http://LONGHORN_MANAGER_IP:PORT/metrics.
  • Prometheus operator makes running Prometheus on top of Kubernetes very easy. The operator watches 3 custom resources: ServiceMonitor, Prometheus and AlertManager. When users create those custom resources, Prometheus Operator deploys and manages the Prometheus server, AlertManager with the user-specified configurations.

Installation

Following this instruction will install all components into the monitoring namespace. To install them into a different namespace, change the field namespace: OTHER_NAMESPACE

Create monitoring namespace

  1. apiVersion: v1
  2. kind: Namespace
  3. metadata:
  4. name: monitoring

Install Prometheus Operator

Deploy Prometheus Operator and its required ClusterRole, ClusterRoleBinding, and Service Account.

  1. apiVersion: rbac.authorization.k8s.io/v1
  2. kind: ClusterRoleBinding
  3. metadata:
  4. labels:
  5. app.kubernetes.io/component: controller
  6. app.kubernetes.io/name: prometheus-operator
  7. app.kubernetes.io/version: v0.38.3
  8. name: prometheus-operator
  9. namespace: monitoring
  10. roleRef:
  11. apiGroup: rbac.authorization.k8s.io
  12. kind: ClusterRole
  13. name: prometheus-operator
  14. subjects:
  15. - kind: ServiceAccount
  16. name: prometheus-operator
  17. namespace: monitoring
  18. ---
  19. apiVersion: rbac.authorization.k8s.io/v1
  20. kind: ClusterRole
  21. metadata:
  22. labels:
  23. app.kubernetes.io/component: controller
  24. app.kubernetes.io/name: prometheus-operator
  25. app.kubernetes.io/version: v0.38.3
  26. name: prometheus-operator
  27. namespace: monitoring
  28. rules:
  29. - apiGroups:
  30. - apiextensions.k8s.io
  31. resources:
  32. - customresourcedefinitions
  33. verbs:
  34. - create
  35. - apiGroups:
  36. - apiextensions.k8s.io
  37. resourceNames:
  38. - alertmanagers.monitoring.coreos.com
  39. - podmonitors.monitoring.coreos.com
  40. - prometheuses.monitoring.coreos.com
  41. - prometheusrules.monitoring.coreos.com
  42. - servicemonitors.monitoring.coreos.com
  43. - thanosrulers.monitoring.coreos.com
  44. resources:
  45. - customresourcedefinitions
  46. verbs:
  47. - get
  48. - update
  49. - apiGroups:
  50. - monitoring.coreos.com
  51. resources:
  52. - alertmanagers
  53. - alertmanagers/finalizers
  54. - prometheuses
  55. - prometheuses/finalizers
  56. - thanosrulers
  57. - thanosrulers/finalizers
  58. - servicemonitors
  59. - podmonitors
  60. - prometheusrules
  61. verbs:
  62. - '*'
  63. - apiGroups:
  64. - apps
  65. resources:
  66. - statefulsets
  67. verbs:
  68. - '*'
  69. - apiGroups:
  70. - ""
  71. resources:
  72. - configmaps
  73. - secrets
  74. verbs:
  75. - '*'
  76. - apiGroups:
  77. - ""
  78. resources:
  79. - pods
  80. verbs:
  81. - list
  82. - delete
  83. - apiGroups:
  84. - ""
  85. resources:
  86. - services
  87. - services/finalizers
  88. - endpoints
  89. verbs:
  90. - get
  91. - create
  92. - update
  93. - delete
  94. - apiGroups:
  95. - ""
  96. resources:
  97. - nodes
  98. verbs:
  99. - list
  100. - watch
  101. - apiGroups:
  102. - ""
  103. resources:
  104. - namespaces
  105. verbs:
  106. - get
  107. - list
  108. - watch
  109. ---
  110. apiVersion: apps/v1
  111. kind: Deployment
  112. metadata:
  113. labels:
  114. app.kubernetes.io/component: controller
  115. app.kubernetes.io/name: prometheus-operator
  116. app.kubernetes.io/version: v0.38.3
  117. name: prometheus-operator
  118. namespace: monitoring
  119. spec:
  120. replicas: 1
  121. selector:
  122. matchLabels:
  123. app.kubernetes.io/component: controller
  124. app.kubernetes.io/name: prometheus-operator
  125. template:
  126. metadata:
  127. labels:
  128. app.kubernetes.io/component: controller
  129. app.kubernetes.io/name: prometheus-operator
  130. app.kubernetes.io/version: v0.38.3
  131. spec:
  132. containers:
  133. - args:
  134. - --kubelet-service=kube-system/kubelet
  135. - --logtostderr=true
  136. - --config-reloader-image=jimmidyson/configmap-reload:v0.3.0
  137. - --prometheus-config-reloader=quay.io/prometheus-operator/prometheus-config-reloader:v0.38.3
  138. image: quay.io/prometheus-operator/prometheus-operator:v0.38.3
  139. name: prometheus-operator
  140. ports:
  141. - containerPort: 8080
  142. name: http
  143. resources:
  144. limits:
  145. cpu: 200m
  146. memory: 200Mi
  147. requests:
  148. cpu: 100m
  149. memory: 100Mi
  150. securityContext:
  151. allowPrivilegeEscalation: false
  152. nodeSelector:
  153. beta.kubernetes.io/os: linux
  154. securityContext:
  155. runAsNonRoot: true
  156. runAsUser: 65534
  157. serviceAccountName: prometheus-operator
  158. ---
  159. apiVersion: v1
  160. kind: ServiceAccount
  161. metadata:
  162. labels:
  163. app.kubernetes.io/component: controller
  164. app.kubernetes.io/name: prometheus-operator
  165. app.kubernetes.io/version: v0.38.3
  166. name: prometheus-operator
  167. namespace: monitoring
  168. ---
  169. apiVersion: v1
  170. kind: Service
  171. metadata:
  172. labels:
  173. app.kubernetes.io/component: controller
  174. app.kubernetes.io/name: prometheus-operator
  175. app.kubernetes.io/version: v0.38.3
  176. name: prometheus-operator
  177. namespace: monitoring
  178. spec:
  179. clusterIP: None
  180. ports:
  181. - name: http
  182. port: 8080
  183. targetPort: http
  184. selector:
  185. app.kubernetes.io/component: controller
  186. app.kubernetes.io/name: prometheus-operator

Install Longhorn ServiceMonitor

Longhorn ServiceMonitor has a label selector app: longhorn-manager to select Longhorn backend service. Later on, the Prometheus CRD can include Longhorn ServiceMonitor so that the Prometheus server can discover all Longhorn manager pods and their endpoints.

  1. apiVersion: monitoring.coreos.com/v1
  2. kind: ServiceMonitor
  3. metadata:
  4. name: longhorn-prometheus-servicemonitor
  5. namespace: monitoring
  6. labels:
  7. name: longhorn-prometheus-servicemonitor
  8. spec:
  9. selector:
  10. matchLabels:
  11. app: longhorn-manager
  12. namespaceSelector:
  13. matchNames:
  14. - longhorn-system
  15. endpoints:
  16. - port: manager

Install and configure Prometheus AlertManager

  1. Create a highly available Alertmanager deployment with 3 instances:

    1. apiVersion: monitoring.coreos.com/v1
    2. kind: Alertmanager
    3. metadata:
    4. name: longhorn
    5. namespace: monitoring
    6. spec:
    7. replicas: 3
  2. The Alertmanager instances will not be able to start up unless a valid configuration is given. See here for more explanation about Alertmanager configuration. The following code gives an example configuration:

    1. global:
    2. resolve_timeout: 5m
    3. route:
    4. group_by: [alertname]
    5. receiver: email_and_slack
    6. receivers:
    7. - name: email_and_slack
    8. email_configs:
    9. - to: <the email address to send notifications to>
    10. from: <the sender address>
    11. smarthost: <the SMTP host through which emails are sent>
    12. # SMTP authentication information.
    13. auth_username: <the username>
    14. auth_identity: <the identity>
    15. auth_password: <the password>
    16. headers:
    17. subject: 'Longhorn-Alert'
    18. text: |-
    19. {{ range .Alerts }}
    20. *Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`
    21. *Description:* {{ .Annotations.description }}
    22. *Details:*
    23. {{ range .Labels.SortedPairs }} *{{ .Name }}:* `{{ .Value }}`
    24. {{ end }}
    25. {{ end }}
    26. slack_configs:
    27. - api_url: <the Slack webhook URL>
    28. channel: <the channel or user to send notifications to>
    29. text: |-
    30. {{ range .Alerts }}
    31. *Alert:* {{ .Annotations.summary }} - `{{ .Labels.severity }}`
    32. *Description:* {{ .Annotations.description }}
    33. *Details:*
    34. {{ range .Labels.SortedPairs }} *{{ .Name }}:* `{{ .Value }}`
    35. {{ end }}
    36. {{ end }}

    Save the above Alertmanager config in a file called alertmanager.yaml and create a secret from it using kubectl.

    Alertmanager instances require the secret resource naming to follow the format alertmanager-{ALERTMANAGER_NAME}. In the previous step, the name of the Alertmanager is longhorn, so the secret name must be alertmanager-longhorn

    1. $ kubectl create secret generic alertmanager-longhorn --from-file=alertmanager.yaml -n monitoring
  3. To be able to view the web UI of the Alertmanager, expose it through a Service. A simple way to do this is to use a Service of type NodePort:

    1. apiVersion: v1
    2. kind: Service
    3. metadata:
    4. name: alertmanager-longhorn
    5. namespace: monitoring
    6. spec:
    7. type: NodePort
    8. ports:
    9. - name: web
    10. nodePort: 30903
    11. port: 9093
    12. protocol: TCP
    13. targetPort: web
    14. selector:
    15. alertmanager: longhorn

    After creating the above service, you can access the web UI of Alertmanager via a Node’s IP and the port 30903.

    Use the above NodePort service for quick verification only because it doesn’t communicate over the TLS connection. You may want to change the service type to ClusterIP, and set up an Ingress-controller to expose the web UI of Alertmanager over TLS connection.

Install and configure Prometheus server

  1. Create PrometheusRule custom resource which defines alert conditions. See more examples about Longhorn alert rules at Longhorn Alert Rule Examples.

    1. apiVersion: monitoring.coreos.com/v1
    2. kind: PrometheusRule
    3. metadata:
    4. labels:
    5. prometheus: longhorn
    6. role: alert-rules
    7. name: prometheus-longhorn-rules
    8. namespace: monitoring
    9. spec:
    10. groups:
    11. - name: longhorn.rules
    12. rules:
    13. - alert: LonghornVolumeUsageCritical
    14. annotations:
    15. description: Longhorn volume {{$labels.volume}} on {{$labels.node}} is at {{$value}}% used for
    16. more than 5 minutes.
    17. summary: Longhorn volume capacity is over 90% used.
    18. expr: 100 * (longhorn_volume_usage_bytes / longhorn_volume_capacity_bytes) > 90
    19. for: 5m
    20. labels:
    21. issue: Longhorn volume {{$labels.volume}} usage on {{$labels.node}} is critical.
    22. severity: critical

    For more information on how to define alert rules see here.

  2. If RBAC authorization is activated, Create a ClusterRole and ClusterRoleBinding for the Prometheus Pods:

    1. apiVersion: v1
    2. kind: ServiceAccount
    3. metadata:
    4. name: prometheus
    5. namespace: monitoring
    1. apiVersion: rbac.authorization.k8s.io/v1beta1
    2. kind: ClusterRole
    3. metadata:
    4. name: prometheus
    5. namespace: monitoring
    6. rules:
    7. - apiGroups: [""]
    8. resources:
    9. - nodes
    10. - services
    11. - endpoints
    12. - pods
    13. verbs: ["get", "list", "watch"]
    14. - apiGroups: [""]
    15. resources:
    16. - configmaps
    17. verbs: ["get"]
    18. - nonResourceURLs: ["/metrics"]
    19. verbs: ["get"]
    1. apiVersion: rbac.authorization.k8s.io/v1beta1
    2. kind: ClusterRoleBinding
    3. metadata:
    4. name: prometheus
    5. roleRef:
    6. apiGroup: rbac.authorization.k8s.io
    7. kind: ClusterRole
    8. name: prometheus
    9. subjects:
    10. - kind: ServiceAccount
    11. name: prometheus
    12. namespace: monitoring
  3. Create a Prometheus custom resource. Notice that we select the Longhorn service monitor and Longhorn rules in the spec.

    1. apiVersion: monitoring.coreos.com/v1
    2. kind: Prometheus
    3. metadata:
    4. name: prometheus
    5. namespace: monitoring
    6. spec:
    7. replicas: 2
    8. serviceAccountName: prometheus
    9. alerting:
    10. alertmanagers:
    11. - namespace: monitoring
    12. name: alertmanager-longhorn
    13. port: web
    14. serviceMonitorSelector:
    15. matchLabels:
    16. name: longhorn-prometheus-servicemonitor
    17. ruleSelector:
    18. matchLabels:
    19. prometheus: longhorn
    20. role: alert-rules
  4. To be able to view the web UI of the Prometheus server, expose it through a Service. A simple way to do this is to use a Service of type NodePort:

    1. apiVersion: v1
    2. kind: Service
    3. metadata:
    4. name: prometheus
    5. namespace: monitoring
    6. spec:
    7. type: NodePort
    8. ports:
    9. - name: web
    10. nodePort: 30904
    11. port: 9090
    12. protocol: TCP
    13. targetPort: web
    14. selector:
    15. prometheus: prometheus

    After creating the above service, you can access the web UI of Prometheus server via a Node’s IP and the port 30904.

    At this point, you should be able to see all Longhorn manager targets as well as Longhorn rules in the targets and rules section of the Prometheus server UI.

    Use the above NodePort service for quick verification only because it doesn’t communicate over TLS connection. You may want to change the service type to ClusterIP, and set up an Ingress-controller to expose the web UI of Prometheus server over TLS connection.

Install Grafana

  1. Create Grafana datasource config:

    1. apiVersion: v1
    2. kind: ConfigMap
    3. metadata:
    4. name: grafana-datasources
    5. namespace: monitoring
    6. data:
    7. prometheus.yaml: |-
    8. {
    9. "apiVersion": 1,
    10. "datasources": [
    11. {
    12. "access":"proxy",
    13. "editable": true,
    14. "name": "prometheus",
    15. "orgId": 1,
    16. "type": "prometheus",
    17. "url": "http://prometheus:9090",
    18. "version": 1
    19. }
    20. ]
    21. }
  2. Create Grafana deployment:

    1. apiVersion: apps/v1
    2. kind: Deployment
    3. metadata:
    4. name: grafana
    5. namespace: monitoring
    6. labels:
    7. app: grafana
    8. spec:
    9. replicas: 1
    10. selector:
    11. matchLabels:
    12. app: grafana
    13. template:
    14. metadata:
    15. name: grafana
    16. labels:
    17. app: grafana
    18. spec:
    19. containers:
    20. - name: grafana
    21. image: grafana/grafana:7.1.5
    22. ports:
    23. - name: grafana
    24. containerPort: 3000
    25. resources:
    26. limits:
    27. memory: "500Mi"
    28. cpu: "300m"
    29. requests:
    30. memory: "500Mi"
    31. cpu: "200m"
    32. volumeMounts:
    33. - mountPath: /var/lib/grafana
    34. name: grafana-storage
    35. - mountPath: /etc/grafana/provisioning/datasources
    36. name: grafana-datasources
    37. readOnly: false
    38. volumes:
    39. - name: grafana-storage
    40. emptyDir: {}
    41. - name: grafana-datasources
    42. configMap:
    43. defaultMode: 420
    44. name: grafana-datasources
  3. Expose Grafana on NodePort 32000:

    1. apiVersion: v1
    2. kind: Service
    3. metadata:
    4. name: grafana
    5. namespace: monitoring
    6. spec:
    7. selector:
    8. app: grafana
    9. type: NodePort
    10. ports:
    11. - port: 3000
    12. targetPort: 3000
    13. nodePort: 32000

    Use the above NodePort service for quick verification only because it doesn’t communicate over TLS connection. You may want to change the service type to ClusterIP, and setup an Ingress-controller to expose Grafana over TLS connection.

  4. Access the Grafana dashboard using any node IP on port 32000. The default credential is:

    1. User: admin
    2. Pass: admin
  5. Setup Longhorn dashboard

    Once inside Grafana, import the prebuilt Longhorn example dashboard.

    See https://grafana.com/docs/grafana/latest/reference/export_import/ for the instructions about how to import a Grafana dashboard.

    You should see the following dashboard upon successful: images