可观察性最佳实践
- 使用 Prometheus 进行生产规模的监控

可观察性最佳实践

使用 Prometheus 进行生产规模的监控

使用 Istio 以及 Prometheus 进行生产规模的监控时推荐的方式是使用分层联邦并且结合一组记录规则。

尽管安装 Istio 不会默认部署 Prometheus，入门指导中 Option 1: Quick Start 的部署按照 Prometheus 集成指导安装了 Prometheus。此 Prometheus 部署刻意地配置了很短的保留窗口 (6小时)。此快速入门 Prometheus 部署同时也配置为从网格上运行的每一个 Envoy 代理上收集指标，同时通过一组有关它们的源的标签( instance，pod, 和 namespace)来扩充指标。

Production-scale Istio monitoring with Istio

通过记录规则进行负载等级的聚合

为了聚合统计实例以及 pod 级别的指标起来，需要用以下的记录规则更新默认 Prometheus 配置：

Plain Prometheus Rules Prometheus Operator Rules CRD

groups:
- name: "istio.recording-rules"
  interval: 5s
  rules:
  - record: "workload:istio_requests_total"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_requests_total)
  - record: "workload:istio_request_duration_milliseconds_count"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_duration_milliseconds_count)
  - record: "workload:istio_request_duration_milliseconds_sum"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_duration_milliseconds_sum)
  - record: "workload:istio_request_duration_milliseconds_bucket"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_duration_milliseconds_bucket)
  - record: "workload:istio_request_bytes_count"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_bytes_count)
  - record: "workload:istio_request_bytes_sum"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_bytes_sum)
  - record: "workload:istio_request_bytes_bucket"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_bytes_bucket)
  - record: "workload:istio_response_bytes_count"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_response_bytes_count)
  - record: "workload:istio_response_bytes_sum"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_response_bytes_sum)
  - record: "workload:istio_response_bytes_bucket"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_response_bytes_bucket)
  - record: "workload:istio_tcp_connections_opened_total"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_connections_opened_total)
  - record: "workload:istio_tcp_connections_closed_total"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_connections_closed_total)
  - record: "workload:istio_tcp_sent_bytes_total_count"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_sent_bytes_total_count)
  - record: "workload:istio_tcp_sent_bytes_total_sum"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_sent_bytes_total_sum)
  - record: "workload:istio_tcp_sent_bytes_total_bucket"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_sent_bytes_total_bucket)
  - record: "workload:istio_tcp_received_bytes_total_count"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_received_bytes_total_count)
  - record: "workload:istio_tcp_received_bytes_total_sum"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_received_bytes_total_sum)
  - record: "workload:istio_tcp_received_bytes_total_bucket"
    expr: |
      sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_received_bytes_total_bucket)

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: istio-metrics-aggregation
  labels:
    app.kubernetes.io/name: istio-prometheus
spec:
  groups:
  - name: "istio.metricsAggregation-rules"
    interval: 5s
    rules:
    - record: "workload:istio_requests_total"
      expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_requests_total)"
    - record: "workload:istio_request_duration_milliseconds_count"
      expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_duration_milliseconds_count)"
    - record: "workload:istio_request_duration_milliseconds_sum"
      expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_duration_milliseconds_sum)"
    - record: "workload:istio_request_duration_milliseconds_bucket"
      expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_duration_milliseconds_bucket)"
    - record: "workload:istio_request_bytes_count"
      expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_bytes_count)"
    - record: "workload:istio_request_bytes_sum"
      expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_bytes_sum)"
    - record: "workload:istio_request_bytes_bucket"
      expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_request_bytes_bucket)"
    - record: "workload:istio_response_bytes_count"
      expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_response_bytes_count)"
    - record: "workload:istio_response_bytes_sum"
      expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_response_bytes_sum)"
    - record: "workload:istio_response_bytes_bucket"
      expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_response_bytes_bucket)"
    - record: "workload:istio_tcp_connections_opened_total"
      expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_connections_opened_total)"
    - record: "workload:istio_tcp_connections_closed_total"
      expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_connections_closed_total)"
    - record: "workload:istio_tcp_sent_bytes_total_count"
      expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_sent_bytes_total_count)"
    - record: "workload:istio_tcp_sent_bytes_total_sum"
      expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_sent_bytes_total_sum)"
    - record: "workload:istio_tcp_sent_bytes_total_bucket"
      expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_sent_bytes_total_bucket)"
    - record: "workload:istio_tcp_received_bytes_total_count"
      expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_received_bytes_total_count)"
    - record: "workload:istio_tcp_received_bytes_total_sum"
      expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_received_bytes_total_sum)"
    - record: "workload:istio_tcp_received_bytes_total_bucket"
      expr: "sum without(instance, kubernetes_namespace, kubernetes_pod_name) (istio_tcp_received_bytes_total_bucket)"

以上的记录规则只是同聚合得到 pods 以及实例级别的指标。这仍然完整的保留了 Istio 标准指标中的全部项，包括全部的 Istio 维度。尽管这有助于通过联邦控制指标维度，您可能仍想进一步优化记录规则来匹配您现有的仪表盘，告警，以及特定的引用。

如需要更多关于如何配置您的记录规则。请参考使用记录规则优化指标收集。

使用负载级别的聚合指标进行联邦

为了建立 Prometheus 联邦，请修改您的 Prometheus 生产部署配置来抓取 Istio Prometheus 联邦终端的指标数据。

将以下的 Job 添加到配置中：

- job_name: 'istio-prometheus'
  honor_labels: true
  metrics_path: '/federate'
  kubernetes_sd_configs:
  - role: pod
    namespaces:
      names: ['istio-system']
  metric_relabel_configs:
  - source_labels: [__name__]
    regex: 'workload:(.*)'
    target_label: __name__
    action: replace
  params:
    'match[]':
    - '{__name__=~"workload:(.*)"}'
    - '{__name__=~"pilot(.*)"}'

如果您使用的是 Prometheus Operator，请使用以下的配置：

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: istio-federation
  labels:
    app.kubernetes.io/name: istio-prometheus
spec:
  namespaceSelector:
    matchNames:
    - istio-system
  selector:
    matchLabels:
      app: prometheus
  endpoints:
  - interval: 30s
    scrapeTimeout: 30s
    params:
      'match[]':
      - '{__name__=~"workload:(.*)"}'
      - '{__name__=~"pilot(.*)"}'
    path: /federate
    targetPort: 9090
    honorLabels: true
    metricRelabelings:
    - sourceLabels: ["__name__"]
      regex: 'workload:(.*)'
      targetLabel: "__name__"
      action: replace

联邦配置的关键是首先匹配通过 Istio 部署的 Prometheus 中收集 Istio 标准指标的 job。并且将收集到的指标重命名，方法为去除负载等级记录规则命名前缀 (workload:)。这使得现有的仪表盘以及引用能够无缝地针对生产用 Prometheus 继续工作 (并且不在指向 Istio 实例)。

您可以在设置联邦时包含额外的指标(例如 envoy, go 等)。

控制面指标也被生产用 Prometheus 收集并联邦。

Optimizing metrics collection with recording rules

除了使用记录规则在 pod 和实例等级聚合，您也许想要使用记录规则为您现有的仪表盘以及告警专门生成聚合指标。这方面针对收集的优化可以很大的节约您 Prometheus 生产实例的资源消耗，同时加速了引用性能。

例如，假设一个监控仪表盘使用以下 Prometheus 引用：

请求速率在过去 1 分钟的平均值，并按照目的服务以及命名空间聚合

sum(irate(istio_requests_total{reporter="source"}[1m]))
by (
    destination_canonical_service,
    destination_workload_namespace
)

P95 客户端延迟在过去 1 分钟的平均值，并按照来源，目的服务以及命名空间聚合

histogram_quantile(0.95,
  sum(irate(istio_request_duration_milliseconds_bucket{reporter="source"}[1m]))
  by (
    destination_canonical_service,
    destination_workload_namespace,
    source_canonical_service,
    source_workload_namespace,
    le
  )
)

以下记录规则可以加至 Istio Prometheus 配置中，使用 istio 前缀来使得联邦更容易识别这些指标。

groups:
- name: "istio.recording-rules"
  interval: 5s
  rules:
  - record: "istio:istio_requests:by_destination_service:rate1m"
    expr: |
      sum(irate(istio_requests_total{reporter="destination"}[1m]))
      by (
        destination_canonical_service,
        destination_workload_namespace
      )
  - record: "istio:istio_request_duration_milliseconds_bucket:p95:rate1m"
    expr: |
      histogram_quantile(0.95,
        sum(irate(istio_request_duration_milliseconds_bucket{reporter="source"}[1m]))
        by (
          destination_canonical_service,
          destination_workload_namespace,
          source_canonical_service,
          source_workload_namespace,
          le
        )
      )

Prometheus 生产实例可以从 Istio 实例那里得到的信息更新联邦：

匹配字句 {__name__=~"istio:(.*)"}
重新将指标标签为： regex: "istio:(.*)"

原始引用被替代为：

istio_requests:by_destination_service:rate1m
avg(istio_request_duration_milliseconds_bucket:p95:rate1m)

更详细的关于 AutoTrader 上生产环境指标收集优化的文章提供了更丰富的例子来描述如何直接对引用聚合从而赋能仪表盘以及告警。