Monitoring Cluster Events and Logs

You are viewing documentation for a release that is no longer supported. The latest supported version of version 3 is [3.11]. For the most recent version 4, see [4]

You are viewing documentation for a release that is no longer supported. The latest supported version of version 3 is [3.11]. For the most recent version 4, see [4]

Introduction

In addition to security measures mentioned in other sections of this guide, the ability to monitor and audit an OKD cluster is an important part of safeguarding the cluster and its users against inappropriate usage.

There are two main sources of cluster-level information that are useful for this purpose: events and logs.

Cluster Events

Cluster administrators are encouraged to familiarize themselves with the Event resource type and review a list of events to determine which events are of interest. Depending on the master controller and plugin configuration, there are typically more potential event types than listed here.

Events are associated with a namespace, either the namespace of the resource they are related to or, for cluster events, the default namespace. The default namespace holds relevant events for monitoring or auditing a cluster, such as Node events and resource events related to infrastructure components.

The master API and oc command do not provide parameters to scope a listing of events to only those related to nodes. A simple approach would be to use grep:

  1. $ oc get event -n default | grep Node
  2. 1h 20h 3 origin-node-1.example.local Node Normal NodeHasDiskPressure ...

A more flexible approach is to output the events in a form that other tools can process. For example, the following example uses the jq tool against JSON output to extract only NodeHasDiskPressure events:

  1. $ oc get events -n default -o json \
  2. | jq '.items[] | select(.involvedObject.kind == "Node" and .reason == "NodeHasDiskPressure")'
  3. {
  4. "apiVersion": "v1",
  5. "count": 3,
  6. "involvedObject": {
  7. "kind": "Node",
  8. "name": "origin-node-1.example.local",
  9. "uid": "origin-node-1.example.local"
  10. },
  11. "kind": "Event",
  12. "reason": "NodeHasDiskPressure",
  13. ...
  14. }

Events related to resource creation, modification, or deletion can also be good candidates for detecting misuse of the cluster. The following query, for example, can be used to look for excessive pulling of images:

  1. $ oc get events --all-namespaces -o json \
  2. | jq '[.items[] | select(.involvedObject.kind == "Pod" and .reason == "Pulling")] | length'
  3. 4

When a namespace is deleted, its events are deleted as well. Events can also expire and are deleted to prevent filling up etcd storage. Events are not stored as a permanent record and frequent polling is necessary to capture statistics over time.

Cluster Logs

This section describes the types of operational logs produced on the cluster.

Service Logs

OKD produces logs for services that run on static pods in a cluster:

  • origin-master-api

  • origin-master-controllers

  • etcd

  • origin-node

These logs are intended more for debugging purposes than for security auditing. You can retrieve logs for each service with the master-logs api api, master-logs controllers controllers, or master-logs etcd etcd commands. If your cluster runs an aggregated logging stack, like an Ops cluster, cluster administrators can retrieve logs from the logging .operations indexes.

The API server, controllers, and etcd static pods run in kube-system namespace.

Master API Audit Log

To log master API requests by users, administrators, or system components enable audit logging for the master API. This will create a file on each master host or, if there is no file configured, be included in the service’s journal. Entries in the journal can be found by searching for “AUDIT”.

Audit log entries consist of one line recording each REST request when it is received and one line with the HTTP response code when it completes. For example, here is a record of the system administrator requesting a list of nodes:

  1. 2017-10-17T13:12:17.635085787Z AUDIT: id="410eda6b-88d4-4491-87ff-394804ca69a1" ip="192.168.122.156" method="GET" user="system:admin" groups="\"system:cluster-admins\",\"system:authenticated\"" as="<self>" asgroups="<lookup>" namespace="<none>" uri="/api/v1/nodes"
  2. 2017-10-17T13:12:17.636081056Z AUDIT: id="410eda6b-88d4-4491-87ff-394804ca69a1" response="200"

It might be useful to poll the log periodically for the number of recent requests per response code, as shown in the following example:

  1. $ tail -5000 /var/log/openshift-audit.log \
  2. | grep -Po 'response="..."' \
  3. | sort | uniq -c | sort -rn
  4. 3288 response="200"
  5. 8 response="404"
  6. 6 response="201"

The following list describes some of the response codes in more detail:

  • 200 or 201 response codes indicate a successful request.

  • 400 response codes may be of interest as they indicate a malformed request, which should not occur with most clients.

  • 404 response codes are typically benign requests for a resource that does not exist.

  • 500 - 599 response codes indicate server errors, which can be a result of bugs, system failures, or even malicious activity.

If an unusual number of error responses are found, the audit log entries for corresponding requests can be retrieved for further investigation.

The IP address of the request is typically a cluster host or API load balancer, and there is no record of the IP address behind a load balancer proxy request (however, load balancer logs can be useful for determining request origin).

It can be useful to look for unusual numbers of requests by a particular user or group.

The following example lists the top 10 users by number of requests in the last 5000 lines of the audit log:

  1. $ tail -5000 /var/log/openshift-audit.log \
  2. | grep -Po ' user="(.*?)(?<!\\)"' \
  3. | sort | uniq -c | sort -rn | head -10
  4. 976 user="system:openshift-master"
  5. 270 user="system:node:origin-node-1.example.local"
  6. 270 user="system:node:origin-master.example.local"
  7. 66 user="system:anonymous"
  8. 32 user="system:serviceaccount:kube-system:cronjob-controller"
  9. 24 user="system:serviceaccount:kube-system:pod-garbage-collector"
  10. 18 user="system:serviceaccount:kube-system:endpoint-controller"
  11. 14 user="system:serviceaccount:openshift-infra:serviceaccount-pull-secrets-controller"
  12. 11 user="test user"
  13. 4 user="test \" user"

More advanced queries generally require the use of additional log analysis tools. Auditors will need a detailed familiarity with the OpenShift v1 API and Kubernetes v1 API to aggregate request summaries from the audit log according to which kind of resource is involved (the uri field). See REST API Reference for details.

More advanced audit logging capabilities are introduced with OKD 3.7 as a Technology Preview feature. This feature enables providing an audit policy file to control which requests are logged and the level of detail to log. Advanced audit log entries provide more detail in JSON format and can be logged via a webhook as opposed to file or system journal.