节点健康监测

节点健康监测

节点问题检测器（Node Problem Detector） 是一个守护程序，用于监视和报告节点的健康状况。你可以将节点问题探测器以 DaemonSet 或独立守护程序运行。节点问题检测器从各种守护进程收集节点问题，并以 NodeCondition 和 Event 的形式报告给 API 服务器。

要了解如何安装和使用节点问题检测器，请参阅节点问题探测器项目文档。

准备开始

你必须拥有一个 Kubernetes 的集群，同时你的 Kubernetes 集群必须带有 kubectl 命令行工具。如果你还没有集群，你可以通过 Minikube 构建一个你自己的集群，或者你可以使用下面任意一个 Kubernetes 工具构建：

局限性

节点问题检测器只支持基于文件类型的内核日志。它不支持像 journald 这样的命令行日志工具。
节点问题检测器使用内核日志格式来报告内核问题。要了解如何扩展内核日志格式，请参阅添加对另一个日志格式的支持。

启用节点问题检测器

一些云供应商将节点问题检测器以插件形式启用。你还可以使用 kubectl 或创建插件 Pod 来启用节点问题探测器。

使用 kubectl 启用节点问题检测器

kubectl 提供了节点问题探测器最灵活的管理。你可以覆盖默认配置使其适合你的环境或检测自定义节点问题。例如：

创建类似于 node-strought-detector.yaml 的节点问题检测器配置：

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-problem-detector-v0.1
  namespace: kube-system
  labels:
    k8s-app: node-problem-detector
    version: v0.1
    kubernetes.io/cluster-service: "true"
spec:
  selector:
    matchLabels:
      k8s-app: node-problem-detector  
      version: v0.1
      kubernetes.io/cluster-service: "true"
  template:
    metadata:
      labels:
        k8s-app: node-problem-detector
        version: v0.1
        kubernetes.io/cluster-service: "true"
    spec:
      hostNetwork: true
      containers:
      - name: node-problem-detector
        image: k8s.gcr.io/node-problem-detector:v0.1
        securityContext:
          privileged: true
        resources:
          limits:
            cpu: "200m"
            memory: "100Mi"
          requests:
            cpu: "20m"
            memory: "20Mi"
        volumeMounts:
        - name: log
          mountPath: /log
          readOnly: true
      volumes:
      - name: log
        hostPath:
          path: /var/log/

说明： 你应该检查系统日志目录是否适用于操作系统发行版本。

使用 kubectl 启动节点问题检测器：

kubectl apply -f https://k8s.io/examples/debug/node-problem-detector.yaml

使用插件 pod 启用节点问题检测器

如果你使用的是自定义集群引导解决方案，不需要覆盖默认配置，可以利用插件 Pod 进一步自动化部署。

创建 node-strick-detector.yaml，并在控制平面节点上保存配置到插件 Pod 的目录 /etc/kubernetes/addons/node-problem-detector。

覆盖配置文件

构建节点问题检测器的 docker 镜像时，会嵌入默认配置。

不过，你可以像下面这样使用 ConfigMap 将其覆盖：

更改 config/ 中的配置文件

创建 ConfigMap node-strick-detector-config：

kubectl create configmap node-problem-detector-config --from-file=config/

更改 node-problem-detector.yaml 以使用 ConfigMap:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-problem-detector-v0.1
  namespace: kube-system
  labels:
    k8s-app: node-problem-detector
    version: v0.1
    kubernetes.io/cluster-service: "true"
spec:
  selector:
    matchLabels:
      k8s-app: node-problem-detector  
      version: v0.1
      kubernetes.io/cluster-service: "true"
  template:
    metadata:
      labels:
        k8s-app: node-problem-detector
        version: v0.1
        kubernetes.io/cluster-service: "true"
    spec:
      hostNetwork: true
      containers:
      - name: node-problem-detector
        image: k8s.gcr.io/node-problem-detector:v0.1
        securityContext:
          privileged: true
        resources:
          limits:
            cpu: "200m"
            memory: "100Mi"
          requests:
            cpu: "20m"
            memory: "20Mi"
        volumeMounts:
        - name: log
          mountPath: /log
          readOnly: true
        - name: config # Overwrite the config/ directory with ConfigMap volume
          mountPath: /config
          readOnly: true
      volumes:
      - name: log
        hostPath:
          path: /var/log/
      - name: config # Define ConfigMap volume
        configMap:
          name: node-problem-detector-config

使用新的配置文件重新创建节点问题检测器：

# 如果你正在运行节点问题检测器，请先删除，然后再重新创建
kubectl delete -f https://k8s.io/examples/debug/node-problem-detector.yaml
kubectl apply -f https://k8s.io/examples/debug/node-problem-detector-configmap.yaml

说明： 此方法仅适用于通过 kubectl 启动的节点问题检测器。

如果节点问题检测器作为集群插件运行，则不支持覆盖配置。插件管理器不支持 ConfigMap。

内核监视器

内核监视器（Kernel Monitor） 是节点问题检测器中支持的系统日志监视器守护进程。内核监视器观察内核日志并根据预定义规则检测已知的内核问题。

内核监视器根据 config/kernel-monitor.json 中的一组预定义规则列表匹配内核问题。规则列表是可扩展的，你始终可以通过覆盖配置来扩展它。

添加新的 NodeCondition

要支持新的 NodeCondition，请在 config/kernel-monitor.json 中的 conditions 字段中创建一个条件定义：

{
  "type": "NodeConditionType",
  "reason": "CamelCaseDefaultNodeConditionReason",
  "message": "arbitrary default node condition message"
}

检测新的问题

你可以使用新的规则描述来扩展 config/kernel-monitor.json 中的 rules 字段以检测新问题：

{
  "type": "temporary/permanent",
  "condition": "NodeConditionOfPermanentIssue",
  "reason": "CamelCaseShortReason",
  "message": "regexp matching the issue in the kernel log"
}

配置内核日志设备的路径

检查你的操作系统（OS）发行版本中的内核日志路径位置。 Linux 内核日志设备通常呈现为 /dev/kmsg。但是，日志路径位置因 OS 发行版本而异。 config/kernel-monitor.json 中的 log 字段表示容器内的日志路径。你可以配置 log 字段以匹配节点问题检测器所示的设备路径。

添加对其它日志格式的支持

内核监视器使用 Translator 插件转换内核日志的内部数据结构。你可以为新的日志格式实现新的转换器。

建议和限制

建议在集群中运行节点问题检测器以监控节点运行状况。运行节点问题检测器时，你可以预期每个节点上的额外资源开销。通常这是可接受的，因为：

内核日志增长相对缓慢。
已经为节点问题检测器设置了资源限制。
即使在高负载下，资源使用也是可接受的。有关更多信息，请参阅节点问题检测器基准结果。

最后修改 July 07, 2021 at 1:55 PM PST : fix markdown syntax error (a87976387)