Integrate OSM with your own Prometheus and Grafana stack

The following article shows you how to create an example bring your own (BYO) Prometheus and Grafana stack on your cluster and configure that stack for observability and monitoring of OSM. For an example using an automatic provisioning of a Prometheus and Grafana stack with OSM, see the Observability getting started guide.

IMPORTANT: The configuration created in this article should not be used in production environments. For production-grade deployments, see Prometheus Operator and Deploy Grafana in Kubernetes.

Prerequisites

  • Kubernetes cluster running Kubernetes v1.20.0 or greater.
  • OSM installed on the Kubernetes cluster.
  • kubectl installed and access to the cluster’s API server.
  • osm CLI installed.
  • helm CLI installed.

Deploy an example Prometheus instance

Use helm to deploy a Prometheus instance to your cluster in the default namespace.

  1. helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
  2. helm repo update
  3. helm install stable prometheus-community/prometheus

The output of the helm install command contains the DNS name of the Prometheus server. For example:

  1. ...
  2. The Prometheus server can be accessed via port 80 on the following DNS name from within your cluster:
  3. stable-prometheus-server.metrics.svc.cluster.local
  4. ...

Record this DNS name for use in a later step.

Configure Prometheus for OSM

Prometheus needs to be configured to scape the OSM endpoints and properly handle OSM’s labeling, relabelling, and endpoint configuration. This configuration also helps the OSM Grafana dashboards, which are configured in a later step, properly display the data scraped from OSM.

Use kubectl get configmap to verify the stable-prometheus-sever configmap has been created. For example:

  1. $ kubectl get configmap
  2. NAME DATA AGE
  3. ...
  4. stable-prometheus-alertmanager 1 18m
  5. stable-prometheus-server 5 18m
  6. ...

Create update-prometheus-configmap.yaml with the following:

  1. apiVersion: v1
  2. kind: ConfigMap
  3. metadata:
  4. name: stable-prometheus-server
  5. data:
  6. prometheus.yml: |
  7. global:
  8. scrape_interval: 10s
  9. scrape_timeout: 10s
  10. evaluation_interval: 1m
  11. scrape_configs:
  12. - job_name: 'kubernetes-apiservers'
  13. kubernetes_sd_configs:
  14. - role: endpoints
  15. scheme: https
  16. tls_config:
  17. ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  18. # TODO need to remove this when the CA and SAN match
  19. insecure_skip_verify: true
  20. bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  21. metric_relabel_configs:
  22. - source_labels: [__name__]
  23. regex: '(apiserver_watch_events_total|apiserver_admission_webhook_rejection_count)'
  24. action: keep
  25. relabel_configs:
  26. - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
  27. action: keep
  28. regex: default;kubernetes;https
  29. - job_name: 'kubernetes-nodes'
  30. scheme: https
  31. tls_config:
  32. ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  33. bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  34. kubernetes_sd_configs:
  35. - role: node
  36. relabel_configs:
  37. - action: labelmap
  38. regex: __meta_kubernetes_node_label_(.+)
  39. - target_label: __address__
  40. replacement: kubernetes.default.svc:443
  41. - source_labels: [__meta_kubernetes_node_name]
  42. regex: (.+)
  43. target_label: __metrics_path__
  44. replacement: /api/v1/nodes/${1}/proxy/metrics
  45. - job_name: 'kubernetes-pods'
  46. kubernetes_sd_configs:
  47. - role: pod
  48. metric_relabel_configs:
  49. - source_labels: [__name__]
  50. regex: '(envoy_server_live|envoy_cluster_health_check_.*|envoy_cluster_upstream_rq_xx|envoy_cluster_upstream_cx_active|envoy_cluster_upstream_cx_tx_bytes_total|envoy_cluster_upstream_cx_rx_bytes_total|envoy_cluster_upstream_rq_total|envoy_cluster_upstream_cx_destroy_remote_with_active_rq|envoy_cluster_upstream_cx_connect_timeout|envoy_cluster_upstream_cx_destroy_local_with_active_rq|envoy_cluster_upstream_rq_pending_failure_eject|envoy_cluster_upstream_rq_pending_overflow|envoy_cluster_upstream_rq_timeout|envoy_cluster_upstream_rq_rx_reset|^osm.*)'
  51. action: keep
  52. relabel_configs:
  53. - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
  54. action: keep
  55. regex: true
  56. - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
  57. action: replace
  58. target_label: __metrics_path__
  59. regex: (.+)
  60. - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
  61. action: replace
  62. regex: ([^:]+)(?::\d+)?;(\d+)
  63. replacement: $1:$2
  64. target_label: __address__
  65. - source_labels: [__meta_kubernetes_namespace]
  66. action: replace
  67. target_label: source_namespace
  68. - source_labels: [__meta_kubernetes_pod_name]
  69. action: replace
  70. target_label: source_pod_name
  71. - regex: '(__meta_kubernetes_pod_label_app)'
  72. action: labelmap
  73. replacement: source_service
  74. - regex: '(__meta_kubernetes_pod_label_osm_envoy_uid|__meta_kubernetes_pod_label_pod_template_hash|__meta_kubernetes_pod_label_version)'
  75. action: drop
  76. # for non-ReplicaSets (DaemonSet, StatefulSet)
  77. # __meta_kubernetes_pod_controller_kind=DaemonSet
  78. # __meta_kubernetes_pod_controller_name=foo
  79. # =>
  80. # workload_kind=DaemonSet
  81. # workload_name=foo
  82. - source_labels: [__meta_kubernetes_pod_controller_kind]
  83. action: replace
  84. target_label: source_workload_kind
  85. - source_labels: [__meta_kubernetes_pod_controller_name]
  86. action: replace
  87. target_label: source_workload_name
  88. # for ReplicaSets
  89. # __meta_kubernetes_pod_controller_kind=ReplicaSet
  90. # __meta_kubernetes_pod_controller_name=foo-bar-123
  91. # =>
  92. # workload_kind=Deployment
  93. # workload_name=foo-bar
  94. # deplyment=foo
  95. - source_labels: [__meta_kubernetes_pod_controller_kind]
  96. action: replace
  97. regex: ^ReplicaSet$
  98. target_label: source_workload_kind
  99. replacement: Deployment
  100. - source_labels:
  101. - __meta_kubernetes_pod_controller_kind
  102. - __meta_kubernetes_pod_controller_name
  103. action: replace
  104. regex: ^ReplicaSet;(.*)-[^-]+$
  105. target_label: source_workload_name
  106. - job_name: 'smi-metrics'
  107. kubernetes_sd_configs:
  108. - role: pod
  109. relabel_configs:
  110. - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
  111. action: keep
  112. regex: true
  113. - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
  114. action: replace
  115. target_label: __metrics_path__
  116. regex: (.+)
  117. - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
  118. action: replace
  119. regex: ([^:]+)(?::\d+)?;(\d+)
  120. replacement: $1:$2
  121. target_label: __address__
  122. metric_relabel_configs:
  123. - source_labels: [__name__]
  124. regex: 'envoy_.*osm_request_(total|duration_ms_(bucket|count|sum))'
  125. action: keep
  126. - source_labels: [__name__]
  127. action: replace
  128. regex: envoy_response_code_(\d{3})_source_namespace_.*_source_kind_.*_source_name_.*_source_pod_.*_destination_namespace_.*_destination_kind_.*_destination_name_.*_destination_pod_.*_osm_request_total
  129. target_label: response_code
  130. - source_labels: [__name__]
  131. action: replace
  132. regex: envoy_response_code_\d{3}_source_namespace_(.*)_source_kind_.*_source_name_.*_source_pod_.*_destination_namespace_.*_destination_kind_.*_destination_name_.*_destination_pod_.*_osm_request_total
  133. target_label: source_namespace
  134. - source_labels: [__name__]
  135. action: replace
  136. regex: envoy_response_code_\d{3}_source_namespace_.*_source_kind_(.*)_source_name_.*_source_pod_.*_destination_namespace_.*_destination_kind_.*_destination_name_.*_destination_pod_.*_osm_request_total
  137. target_label: source_kind
  138. - source_labels: [__name__]
  139. action: replace
  140. regex: envoy_response_code_\d{3}_source_namespace_.*_source_kind_.*_source_name_(.*)_source_pod_.*_destination_namespace_.*_destination_kind_.*_destination_name_.*_destination_pod_.*_osm_request_total
  141. target_label: source_name
  142. - source_labels: [__name__]
  143. action: replace
  144. regex: envoy_response_code_\d{3}_source_namespace_.*_source_kind_.*_source_name_.*_source_pod_(.*)_destination_namespace_.*_destination_kind_.*_destination_name_.*_destination_pod_.*_osm_request_total
  145. target_label: source_pod
  146. - source_labels: [__name__]
  147. action: replace
  148. regex: envoy_response_code_\d{3}_source_namespace_.*_source_kind_.*_source_name_.*_source_pod_.*_destination_namespace_(.*)_destination_kind_.*_destination_name_.*_destination_pod_.*_osm_request_total
  149. target_label: destination_namespace
  150. - source_labels: [__name__]
  151. action: replace
  152. regex: envoy_response_code_\d{3}_source_namespace_.*_source_kind_.*_source_name_.*_source_pod_.*_destination_namespace_.*_destination_kind_(.*)_destination_name_.*_destination_pod_.*_osm_request_total
  153. target_label: destination_kind
  154. - source_labels: [__name__]
  155. action: replace
  156. regex: envoy_response_code_\d{3}_source_namespace_.*_source_kind_.*_source_name_.*_source_pod_.*_destination_namespace_.*_destination_kind_.*_destination_name_(.*)_destination_pod_.*_osm_request_total
  157. target_label: destination_name
  158. - source_labels: [__name__]
  159. action: replace
  160. regex: envoy_response_code_\d{3}_source_namespace_.*_source_kind_.*_source_name_.*_source_pod_.*_destination_namespace_.*_destination_kind_.*_destination_name_.*_destination_pod_(.*)_osm_request_total
  161. target_label: destination_pod
  162. - source_labels: [__name__]
  163. action: replace
  164. regex: .*(osm_request_total)
  165. target_label: __name__
  166. - source_labels: [__name__]
  167. action: replace
  168. regex: envoy_source_namespace_(.*)_source_kind_.*_source_name_.*_source_pod_.*_destination_namespace_.*_destination_kind_.*_destination_name_.*_destination_pod_.*_osm_request_duration_ms_(bucket|sum|count)
  169. target_label: source_namespace
  170. - source_labels: [__name__]
  171. action: replace
  172. regex: envoy_source_namespace_.*_source_kind_(.*)_source_name_.*_source_pod_.*_destination_namespace_.*_destination_kind_.*_destination_name_.*_destination_pod_.*_osm_request_duration_ms_(bucket|sum|count)
  173. target_label: source_kind
  174. - source_labels: [__name__]
  175. action: replace
  176. regex: envoy_source_namespace_.*_source_kind_.*_source_name_(.*)_source_pod_.*_destination_namespace_.*_destination_kind_.*_destination_name_.*_destination_pod_.*_osm_request_duration_ms_(bucket|sum|count)
  177. target_label: source_name
  178. - source_labels: [__name__]
  179. action: replace
  180. regex: envoy_source_namespace_.*_source_kind_.*_source_name_.*_source_pod_(.*)_destination_namespace_.*_destination_kind_.*_destination_name_.*_destination_pod_.*_osm_request_duration_ms_(bucket|sum|count)
  181. target_label: source_pod
  182. - source_labels: [__name__]
  183. action: replace
  184. regex: envoy_source_namespace_.*_source_kind_.*_source_name_.*_source_pod_.*_destination_namespace_(.*)_destination_kind_.*_destination_name_.*_destination_pod_.*_osm_request_duration_ms_(bucket|sum|count)
  185. target_label: destination_namespace
  186. - source_labels: [__name__]
  187. action: replace
  188. regex: envoy_source_namespace_.*_source_kind_.*_source_name_.*_source_pod_.*_destination_namespace_.*_destination_kind_(.*)_destination_name_.*_destination_pod_.*_osm_request_duration_ms_(bucket|sum|count)
  189. target_label: destination_kind
  190. - source_labels: [__name__]
  191. action: replace
  192. regex: envoy_source_namespace_.*_source_kind_.*_source_name_.*_source_pod_.*_destination_namespace_.*_destination_kind_.*_destination_name_(.*)_destination_pod_.*_osm_request_duration_ms_(bucket|sum|count)
  193. target_label: destination_name
  194. - source_labels: [__name__]
  195. action: replace
  196. regex: envoy_source_namespace_.*_source_kind_.*_source_name_.*_source_pod_.*_destination_namespace_.*_destination_kind_.*_destination_name_.*_destination_pod_(.*)_osm_request_duration_ms_(bucket|sum|count)
  197. target_label: destination_pod
  198. - source_labels: [__name__]
  199. action: replace
  200. regex: .*(osm_request_duration_ms_(bucket|sum|count))
  201. target_label: __name__
  202. - job_name: 'kubernetes-cadvisor'
  203. scheme: https
  204. tls_config:
  205. ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  206. bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  207. kubernetes_sd_configs:
  208. - role: node
  209. metric_relabel_configs:
  210. - source_labels: [__name__]
  211. regex: '(container_cpu_usage_seconds_total|container_memory_rss)'
  212. action: keep
  213. relabel_configs:
  214. - action: labelmap
  215. regex: __meta_kubernetes_node_label_(.+)
  216. - target_label: __address__
  217. replacement: kubernetes.default.svc:443
  218. - source_labels: [__meta_kubernetes_node_name]
  219. regex: (.+)
  220. target_label: __metrics_path__
  221. replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor

Use kubectl apply to update the Prometheus server configmap.

  1. kubectl apply -f update-prometheus-configmap.yaml

Verify Prometheus is able to scrape the OSM mesh and API endpoints by using kubectl port-forward to forward the traffic between the Prometheus management application and your development computer.

  1. export POD_NAME=$(kubectl get pods -l "app=prometheus,component=server" -o jsonpath="{.items[0].metadata.name}")
  2. kubectl port-forward $POD_NAME 9090

Open a web browser to http://localhost:9090/targets to access the Prometheus management application and verify the endpoints are connected, up, and scrapping is running.

Integrate OSM with Prometheus and Grafana - 图1

Targets with specific relabeling config established by OSM should be “up”

Stop the port-forwarding command.

Deploying a Grafana Instance

Use helm to deploy a Grafana instance to your cluster in the default namespace.

  1. helm repo add grafana https://grafana.github.io/helm-charts
  2. helm repo update
  3. helm install grafana/grafana --generate-name

Use kubectl get secret to display the administrator password for Grafana.

  1. export SECRET_NAME=$(kubectl get secret -l "app.kubernetes.io/name=grafana" -o jsonpath="{.items[0].metadata.name}")
  2. kubectl get secret $SECRET_NAME -o jsonpath="{.data.admin-password}" | base64 --decode ; echo

Use kubectl port-forward to forward the traffic between the Grafana’s management application and your development computer.

  1. export POD_NAME=$(kubectl get pods -l "app.kubernetes.io/name=grafana" -o jsonpath="{.items[0].metadata.name}")
  2. kubectl port-forward $POD_NAME 3000

Open a web browser to http://localhost:3000 to access the Grafana’s management application. Use admin as the username and administrator password from the previous step. and verify the endpoints are connected, up, and scrapping is running.

From the management application:

  • Select Settings then Data Sources.
  • Select Add data source.
  • Find the Prometheus data source and select Select.
  • Enter the DNS name, for example stable-prometheus-server.default.svc.cluster.local, from the earlier step in URL.

Select Save and Test and confirm you see Data source is working.

Importing OSM Dashboards

OSM Dashboards are available through OSM GitHub repository, which can be imported as json blobs on the management application.

To import a dashboard:

  • Hover your cursor over the + and select Import.
  • Copy the JSON from the osm-mesh-envoy-details dashboard and paste it in Import via panel json.
  • Select Load.
  • Select Import.

Confirm you see a Mesh and Envoy Details dashboard created.