将 OSM 与自维护的 Prometheus 和 Grafana 集成

本文为你展示如何在集群中创建自行维护的 Prometheus 和 Grafana ,并为其配置以实现对 OSM 的可观测性和监控。有关使用 OSM 自动配置 Prometheus 和 Grafana 的示例,请参阅 Observability 入门指南。

重要提示:本文创建的配置不应在生产环境中使用。对于生产级部署,请参阅 Prometheus OperatorDeploy Grafana in Kubernetes

先决条件

  • Kubernetes 集群运行版本 v1.22.9 或者更高。
  • 集群中已安装 OSM。
  • 已安装 kubectl 用于访问 API server 。
  • 已安装 osm 命令行工具
  • 已安装 helm 命令行工具

部署示例化的 Prometheus 实例

在 default 命名空间下,使用 helm 安装 Prometheus 实例。

  1. helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
  2. helm repo update
  3. helm install stable prometheus-community/prometheus

helm install 命令的输出中包含了 Prometheus 服务端的 DNS 链接。例如:

  1. ...
  2. The Prometheus server can be accessed via port 80 on the following DNS name from within your cluster:
  3. stable-prometheus-server.metrics.svc.cluster.local
  4. ...

记下 DNS 链接,后面步骤会用到。

为 OSM 配置 Prometheus

Prometheus 需要对 OSM 端点进行抓取配置,并对应地处理 OSM 的标记、重新标记和端点配置。此配置还有助于 OSM Grafana 仪表板正确显示从 OSM 抓取的数据(OSM Grafana 仪表板会在后续步骤中配置)

使用 kubectl get configmap 来验证 stable-prometheus-server configmap 是否已创建。 例如:

  1. $ kubectl get configmap
  2. NAME DATA AGE
  3. ...
  4. stable-prometheus-alertmanager 1 18m
  5. stable-prometheus-server 5 18m
  6. ...

使用以下内容创建 update-prometheus-configmap.yaml

  1. apiVersion: v1
  2. kind: ConfigMap
  3. metadata:
  4. name: stable-prometheus-server
  5. data:
  6. prometheus.yml: |
  7. global:
  8. scrape_interval: 10s
  9. scrape_timeout: 10s
  10. evaluation_interval: 1m
  11. scrape_configs:
  12. - job_name: 'kubernetes-apiservers'
  13. kubernetes_sd_configs:
  14. - role: endpoints
  15. scheme: https
  16. tls_config:
  17. ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  18. # TODO need to remove this when the CA and SAN match
  19. insecure_skip_verify: true
  20. bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  21. metric_relabel_configs:
  22. - source_labels: [__name__]
  23. regex: '(apiserver_watch_events_total|apiserver_admission_webhook_rejection_count)'
  24. action: keep
  25. relabel_configs:
  26. - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
  27. action: keep
  28. regex: default;kubernetes;https
  29. - job_name: 'kubernetes-nodes'
  30. scheme: https
  31. tls_config:
  32. ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  33. bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  34. kubernetes_sd_configs:
  35. - role: node
  36. relabel_configs:
  37. - action: labelmap
  38. regex: __meta_kubernetes_node_label_(.+)
  39. - target_label: __address__
  40. replacement: kubernetes.default.svc:443
  41. - source_labels: [__meta_kubernetes_node_name]
  42. regex: (.+)
  43. target_label: __metrics_path__
  44. replacement: /api/v1/nodes/${1}/proxy/metrics
  45. - job_name: 'kubernetes-pods'
  46. kubernetes_sd_configs:
  47. - role: pod
  48. metric_relabel_configs:
  49. - source_labels: [__name__]
  50. regex: '(envoy_server_live|envoy_cluster_health_check_.*|envoy_cluster_upstream_rq_xx|envoy_cluster_upstream_cx_active|envoy_cluster_upstream_cx_tx_bytes_total|envoy_cluster_upstream_cx_rx_bytes_total|envoy_cluster_upstream_rq_total|envoy_cluster_upstream_cx_destroy_remote_with_active_rq|envoy_cluster_upstream_cx_connect_timeout|envoy_cluster_upstream_cx_destroy_local_with_active_rq|envoy_cluster_upstream_rq_pending_failure_eject|envoy_cluster_upstream_rq_pending_overflow|envoy_cluster_upstream_rq_timeout|envoy_cluster_upstream_rq_rx_reset|^osm.*)'
  51. action: keep
  52. relabel_configs:
  53. - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
  54. action: keep
  55. regex: true
  56. - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
  57. action: replace
  58. target_label: __metrics_path__
  59. regex: (.+)
  60. - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
  61. action: replace
  62. regex: ([^:]+)(?::\d+)?;(\d+)
  63. replacement: $1:$2
  64. target_label: __address__
  65. - source_labels: [__meta_kubernetes_namespace]
  66. action: replace
  67. target_label: source_namespace
  68. - source_labels: [__meta_kubernetes_pod_name]
  69. action: replace
  70. target_label: source_pod_name
  71. - regex: '(__meta_kubernetes_pod_label_app)'
  72. action: labelmap
  73. replacement: source_service
  74. - regex: '(__meta_kubernetes_pod_label_osm_envoy_uid|__meta_kubernetes_pod_label_pod_template_hash|__meta_kubernetes_pod_label_version)'
  75. action: drop
  76. # for non-ReplicaSets (DaemonSet, StatefulSet)
  77. # __meta_kubernetes_pod_controller_kind=DaemonSet
  78. # __meta_kubernetes_pod_controller_name=foo
  79. # =>
  80. # workload_kind=DaemonSet
  81. # workload_name=foo
  82. - source_labels: [__meta_kubernetes_pod_controller_kind]
  83. action: replace
  84. target_label: source_workload_kind
  85. - source_labels: [__meta_kubernetes_pod_controller_name]
  86. action: replace
  87. target_label: source_workload_name
  88. # for ReplicaSets
  89. # __meta_kubernetes_pod_controller_kind=ReplicaSet
  90. # __meta_kubernetes_pod_controller_name=foo-bar-123
  91. # =>
  92. # workload_kind=Deployment
  93. # workload_name=foo-bar
  94. # deplyment=foo
  95. - source_labels: [__meta_kubernetes_pod_controller_kind]
  96. action: replace
  97. regex: ^ReplicaSet$
  98. target_label: source_workload_kind
  99. replacement: Deployment
  100. - source_labels:
  101. - __meta_kubernetes_pod_controller_kind
  102. - __meta_kubernetes_pod_controller_name
  103. action: replace
  104. regex: ^ReplicaSet;(.*)-[^-]+$
  105. target_label: source_workload_name
  106. - job_name: 'smi-metrics'
  107. kubernetes_sd_configs:
  108. - role: pod
  109. relabel_configs:
  110. - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
  111. action: keep
  112. regex: true
  113. - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
  114. action: replace
  115. target_label: __metrics_path__
  116. regex: (.+)
  117. - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
  118. action: replace
  119. regex: ([^:]+)(?::\d+)?;(\d+)
  120. replacement: $1:$2
  121. target_label: __address__
  122. metric_relabel_configs:
  123. - source_labels: [__name__]
  124. regex: 'envoy_.*osm_request_(total|duration_ms_(bucket|count|sum))'
  125. action: keep
  126. - source_labels: [__name__]
  127. action: replace
  128. regex: envoy_response_code_(\d{3})_source_namespace_.*_source_kind_.*_source_name_.*_source_pod_.*_destination_namespace_.*_destination_kind_.*_destination_name_.*_destination_pod_.*_osm_request_total
  129. target_label: response_code
  130. - source_labels: [__name__]
  131. action: replace
  132. regex: envoy_response_code_\d{3}_source_namespace_(.*)_source_kind_.*_source_name_.*_source_pod_.*_destination_namespace_.*_destination_kind_.*_destination_name_.*_destination_pod_.*_osm_request_total
  133. target_label: source_namespace
  134. - source_labels: [__name__]
  135. action: replace
  136. regex: envoy_response_code_\d{3}_source_namespace_.*_source_kind_(.*)_source_name_.*_source_pod_.*_destination_namespace_.*_destination_kind_.*_destination_name_.*_destination_pod_.*_osm_request_total
  137. target_label: source_kind
  138. - source_labels: [__name__]
  139. action: replace
  140. regex: envoy_response_code_\d{3}_source_namespace_.*_source_kind_.*_source_name_(.*)_source_pod_.*_destination_namespace_.*_destination_kind_.*_destination_name_.*_destination_pod_.*_osm_request_total
  141. target_label: source_name
  142. - source_labels: [__name__]
  143. action: replace
  144. regex: envoy_response_code_\d{3}_source_namespace_.*_source_kind_.*_source_name_.*_source_pod_(.*)_destination_namespace_.*_destination_kind_.*_destination_name_.*_destination_pod_.*_osm_request_total
  145. target_label: source_pod
  146. - source_labels: [__name__]
  147. action: replace
  148. regex: envoy_response_code_\d{3}_source_namespace_.*_source_kind_.*_source_name_.*_source_pod_.*_destination_namespace_(.*)_destination_kind_.*_destination_name_.*_destination_pod_.*_osm_request_total
  149. target_label: destination_namespace
  150. - source_labels: [__name__]
  151. action: replace
  152. regex: envoy_response_code_\d{3}_source_namespace_.*_source_kind_.*_source_name_.*_source_pod_.*_destination_namespace_.*_destination_kind_(.*)_destination_name_.*_destination_pod_.*_osm_request_total
  153. target_label: destination_kind
  154. - source_labels: [__name__]
  155. action: replace
  156. regex: envoy_response_code_\d{3}_source_namespace_.*_source_kind_.*_source_name_.*_source_pod_.*_destination_namespace_.*_destination_kind_.*_destination_name_(.*)_destination_pod_.*_osm_request_total
  157. target_label: destination_name
  158. - source_labels: [__name__]
  159. action: replace
  160. regex: envoy_response_code_\d{3}_source_namespace_.*_source_kind_.*_source_name_.*_source_pod_.*_destination_namespace_.*_destination_kind_.*_destination_name_.*_destination_pod_(.*)_osm_request_total
  161. target_label: destination_pod
  162. - source_labels: [__name__]
  163. action: replace
  164. regex: .*(osm_request_total)
  165. target_label: __name__
  166. - source_labels: [__name__]
  167. action: replace
  168. regex: envoy_source_namespace_(.*)_source_kind_.*_source_name_.*_source_pod_.*_destination_namespace_.*_destination_kind_.*_destination_name_.*_destination_pod_.*_osm_request_duration_ms_(bucket|sum|count)
  169. target_label: source_namespace
  170. - source_labels: [__name__]
  171. action: replace
  172. regex: envoy_source_namespace_.*_source_kind_(.*)_source_name_.*_source_pod_.*_destination_namespace_.*_destination_kind_.*_destination_name_.*_destination_pod_.*_osm_request_duration_ms_(bucket|sum|count)
  173. target_label: source_kind
  174. - source_labels: [__name__]
  175. action: replace
  176. regex: envoy_source_namespace_.*_source_kind_.*_source_name_(.*)_source_pod_.*_destination_namespace_.*_destination_kind_.*_destination_name_.*_destination_pod_.*_osm_request_duration_ms_(bucket|sum|count)
  177. target_label: source_name
  178. - source_labels: [__name__]
  179. action: replace
  180. regex: envoy_source_namespace_.*_source_kind_.*_source_name_.*_source_pod_(.*)_destination_namespace_.*_destination_kind_.*_destination_name_.*_destination_pod_.*_osm_request_duration_ms_(bucket|sum|count)
  181. target_label: source_pod
  182. - source_labels: [__name__]
  183. action: replace
  184. regex: envoy_source_namespace_.*_source_kind_.*_source_name_.*_source_pod_.*_destination_namespace_(.*)_destination_kind_.*_destination_name_.*_destination_pod_.*_osm_request_duration_ms_(bucket|sum|count)
  185. target_label: destination_namespace
  186. - source_labels: [__name__]
  187. action: replace
  188. regex: envoy_source_namespace_.*_source_kind_.*_source_name_.*_source_pod_.*_destination_namespace_.*_destination_kind_(.*)_destination_name_.*_destination_pod_.*_osm_request_duration_ms_(bucket|sum|count)
  189. target_label: destination_kind
  190. - source_labels: [__name__]
  191. action: replace
  192. regex: envoy_source_namespace_.*_source_kind_.*_source_name_.*_source_pod_.*_destination_namespace_.*_destination_kind_.*_destination_name_(.*)_destination_pod_.*_osm_request_duration_ms_(bucket|sum|count)
  193. target_label: destination_name
  194. - source_labels: [__name__]
  195. action: replace
  196. regex: envoy_source_namespace_.*_source_kind_.*_source_name_.*_source_pod_.*_destination_namespace_.*_destination_kind_.*_destination_name_.*_destination_pod_(.*)_osm_request_duration_ms_(bucket|sum|count)
  197. target_label: destination_pod
  198. - source_labels: [__name__]
  199. action: replace
  200. regex: .*(osm_request_duration_ms_(bucket|sum|count))
  201. target_label: __name__
  202. - job_name: 'kubernetes-cadvisor'
  203. scheme: https
  204. tls_config:
  205. ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
  206. bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  207. kubernetes_sd_configs:
  208. - role: node
  209. metric_relabel_configs:
  210. - source_labels: [__name__]
  211. regex: '(container_cpu_usage_seconds_total|container_memory_rss)'
  212. action: keep
  213. relabel_configs:
  214. - action: labelmap
  215. regex: __meta_kubernetes_node_label_(.+)
  216. - target_label: __address__
  217. replacement: kubernetes.default.svc:443
  218. - source_labels: [__meta_kubernetes_node_name]
  219. regex: (.+)
  220. target_label: __metrics_path__
  221. replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor

使用 kubectl apply 来更新 Prometheus 服务端 configmap。

  1. kubectl apply -f update-prometheus-configmap.yaml

通过使用 kubectl port-forward 来转发 Prometheus 管理应用程序和你的开发机器间的流量,来验证 Prometheus 是否能够来抓取 OSM 网格和 API 端点。

  1. export POD_NAME=$(kubectl get pods -l "app=prometheus,component=server" -o jsonpath="{.items[0].metadata.name}")
  2. kubectl port-forward $POD_NAME 9090

在 web 浏览器中打开 http://localhost:9090/targets 来访问 Prometheus 管理应用,并验证端点是否连接、启动和正在执行抓取。

将 OSM 与 Prometheus 和 Grafana 集成 - 图1

Targets with specific relabeling config established by OSM should be “up”

停止端口转发命令。

部署 Grafana 实例

在 default 命名空间下,使用 helm 安装 Grafana 实例。

  1. helm repo add grafana https://grafana.github.io/helm-charts
  2. helm repo update
  3. helm install grafana/grafana --generate-name

使用 kubectl get secret 来显示 Grafana 的管理员密码。

  1. export SECRET_NAME=$(kubectl get secret -l "app.kubernetes.io/name=grafana" -o jsonpath="{.items[0].metadata.name}")
  2. kubectl get secret $SECRET_NAME -o jsonpath="{.data.admin-password}" | base64 --decode ; echo

使用 kubectl port-forward 来转发 Grafana 管理应用和你的开发机器间的流量。

  1. export POD_NAME=$(kubectl get pods -l "app.kubernetes.io/name=grafana" -o jsonpath="{.items[0].metadata.name}")
  2. kubectl port-forward $POD_NAME 3000

在 web 浏览器中打开 http://localhost:3000 访问 Grafana 的管理程序。使用 admin 作为用户名和前面获取到的管理员密码。验证端点已连接、启动和正在执行抓取。

在管理程序中:

  • 选择 Settings 然后 Data Sources
  • 选择 Add Data source
  • 找到 Prometheus 数据源并点击 Select
  • 输入 DNS 名字,例如在前面拿到的 stable-prometheus-server.default.svc.cluster.local

选择 Save and Test 并确保看到 Data source is working

Importing OSM Dashboards

OSM Dashboards 可通过 OSM GitHub 存储库 获得,可以在管理应用程序上以 json blobs 方式导入。

要导入仪表盘:

  • 鼠标移放在 + 上并点击 Import
  • osm-mesh-envoy-details dashboard 复制 JSON 内容并拷贝到 Import via panel json
  • 选择 Load
  • 选择 Import

确保看到 Mesh and Envoy Details 仪表盘创建。