Monitor Node Health

Monitor Node Health

Node problem detector is a DaemonSet monitoring the node health. It collects node problems from various daemons and reports them to the apiserver as NodeCondition and Event.

It supports some known kernel issue detection now, and will detect more and more node problems over time.

Currently Kubernetes won’t take any action on the node conditions and events generated by node problem detector. In the future, a remedy system could be introduced to deal with node problems.

See more information here.

Before you begin

You need to have a Kubernetes cluster, and the kubectl command-line tool must be configured to communicate with your cluster. If you do not already have a cluster, you can create one by using minikube or you can use one of these Kubernetes playgrounds:

To check the version, enter kubectl version.

Limitations

The kernel issue detection of node problem detector only supports file based kernel log now. It doesn’t support log tools like journald.
The kernel issue detection of node problem detector has assumption on kernel log format, and now it only works on Ubuntu and Debian. However, it is easy to extend it to support other log format.

Enable/Disable in GCE cluster

Node problem detector is running as a cluster addon enabled by default in the gce cluster.

You can enable/disable it by setting the environment variable KUBE_ENABLE_NODE_PROBLEM_DETECTOR before kube-up.sh.

Use in Other Environment

To enable node problem detector in other environment outside of GCE, you can use either kubectl or addon pod.

Kubectl

This is the recommended way to start node problem detector outside of GCE. It provides more flexible management, such as overwriting the default configuration to fit it into your environment or detect customized node problems.

Step 1: node-problem-detector.yaml:

debug/node-problem-detector.yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-problem-detector-v0.1
  namespace: kube-system
  labels:
    k8s-app: node-problem-detector
    version: v0.1
    kubernetes.io/cluster-service: "true"
spec:
  selector:
    matchLabels:
      k8s-app: node-problem-detector  
      version: v0.1
      kubernetes.io/cluster-service: "true"
  template:
    metadata:
      labels:
        k8s-app: node-problem-detector
        version: v0.1
        kubernetes.io/cluster-service: "true"
    spec:
      hostNetwork: true
      containers:
      - name: node-problem-detector
        image: k8s.gcr.io/node-problem-detector:v0.1
        securityContext:
          privileged: true
        resources:
          limits:
            cpu: "200m"
            memory: "100Mi"
          requests:
            cpu: "20m"
            memory: "20Mi"
        volumeMounts:
        - name: log
          mountPath: /log
          readOnly: true
      volumes:
      - name: log
        hostPath:
          path: /var/log/

Notice that you should make sure the system log directory is right for your OS distro.

Step 2: Start node problem detector with kubectl:

 kubectl apply -f https://k8s.io/examples/debug/node-problem-detector.yaml

Addon Pod

This is for those who have their own cluster bootstrap solution, and don’t need to overwrite the default configuration. They could leverage the addon pod to further automate the deployment.

Just create node-problem-detector.yaml, and put it under the addon pods directory /etc/kubernetes/addons/node-problem-detector on master node.

Overwrite the Configuration

The default configuration is embedded when building the Docker image of node problem detector.

However, you can use ConfigMap to overwrite it following the steps:

Step 1: Change the config files in config/.
Step 2: Create the ConfigMap node-problem-detector-config with kubectl create configmap node-problem-detector-config --from-file=config/.
Step 3: Change the node-problem-detector.yaml to use the ConfigMap:

debug/node-problem-detector-configmap.yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-problem-detector-v0.1
  namespace: kube-system
  labels:
    k8s-app: node-problem-detector
    version: v0.1
    kubernetes.io/cluster-service: "true"
spec:
  selector:
    matchLabels:
      k8s-app: node-problem-detector  
      version: v0.1
      kubernetes.io/cluster-service: "true"
  template:
    metadata:
      labels:
        k8s-app: node-problem-detector
        version: v0.1
        kubernetes.io/cluster-service: "true"
    spec:
      hostNetwork: true
      containers:
      - name: node-problem-detector
        image: k8s.gcr.io/node-problem-detector:v0.1
        securityContext:
          privileged: true
        resources:
          limits:
            cpu: "200m"
            memory: "100Mi"
          requests:
            cpu: "20m"
            memory: "20Mi"
        volumeMounts:
        - name: log
          mountPath: /log
          readOnly: true
        - name: config # Overwrite the config/ directory with ConfigMap volume
          mountPath: /config
          readOnly: true
      volumes:
      - name: log
        hostPath:
          path: /var/log/
      - name: config # Define ConfigMap volume
        configMap:
          name: node-problem-detector-config

Step 4: Re-create the node problem detector with the new yaml file:

 kubectl delete -f https://k8s.io/examples/debug/node-problem-detector.yaml # If you have a node-problem-detector running
 kubectl apply -f https://k8s.io/examples/debug/node-problem-detector-configmap.yaml

Notice that this approach only applies to node problem detector started with kubectl.

For node problem detector running as cluster addon, because addon manager doesn’t support ConfigMap, configuration overwriting is not supported now.

Kernel Monitor

Kernel Monitor is a problem daemon in node problem detector. It monitors kernel log and detects known kernel issues following predefined rules.

The Kernel Monitor matches kernel issues according to a set of predefined rule list in config/kernel-monitor.json. The rule list is extensible, and you can always extend it by overwriting the configuration.

Add New NodeConditions

To support new node conditions, you can extend the conditions field in config/kernel-monitor.json with new condition definition:

{
  "type": "NodeConditionType",
  "reason": "CamelCaseDefaultNodeConditionReason",
  "message": "arbitrary default node condition message"
}

Detect New Problems

To detect new problems, you can extend the rules field in config/kernel-monitor.json with new rule definition:

{
  "type": "temporary/permanent",
  "condition": "NodeConditionOfPermanentIssue",
  "reason": "CamelCaseShortReason",
  "message": "regexp matching the issue in the kernel log"
}

Change Log Path

Kernel log in different OS distros may locate in different path. The log field in config/kernel-monitor.json is the log path inside the container. You can always configure it to match your OS distro.

Support Other Log Format

Kernel monitor uses Translator plugin to translate kernel log the internal data structure. It is easy to implement a new translator for a new log format.

Caveats

It is recommended to run the node problem detector in your cluster to monitor the node health. However, you should be aware that this will introduce extra resource overhead on each node. Usually this is fine, because:

The kernel log is generated relatively slowly.
Resource limit is set for node problem detector.
Even under high load, the resource usage is acceptable. (see benchmark result)