Optimizing CPU usage with mount namespace encapsulation

Optimizing CPU usage with mount namespace encapsulation

You can optimize CPU usage in OKD clusters by using mount namespace encapsulation to provide a private namespace for kubelet and CRI-O processes. This reduces the cluster CPU resources used by systemd with no difference in functionality.

Mount namespace encapsulation is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

Encapsulating mount namespaces

Mount namespaces are used to isolate mount points so that processes in different namespaces cannot view each others’ files. Encapsulation is the process of moving Kubernetes mount namespaces to an alternative location where they will not be constantly scanned by the host operating system.

The host operating system uses systemd to constantly scan all mount namespaces: both the standard Linux mounts and the numerous mounts that Kubernetes uses to operate. The current implementation of kubelet and CRI-O both use the top-level namespace for all container runtime and kubelet mount points. However, encapsulating these container-specific mount points in a private namespace reduces systemd overhead with no difference in functionality. Using a separate mount namespace for both CRI-O and kubelet can encapsulate container-specific mounts from any systemd or other host operating system interaction.

This ability to potentially achieve major CPU optimization is now available to all OKD administrators. Encapsulation can also improve security by storing Kubernetes-specific mount points in a location safe from inspection by unprivileged users.

The following diagrams illustrate a Kubernetes installation before and after encapsulation. Both scenarios show example containers which have mount propagation settings of bidirectional, host-to-container, and none.

Here we see systemd, host operating system processes, kubelet, and the container runtime sharing a single mount namespace.

systemd, host operating system processes, kubelet, and the container runtime each have access to and visibility of all mount points.
Container 1, configured with bidirectional mount propagation, can access systemd and host mounts, kubelet and CRI-O mounts. A mount originating in Container 1, such as /run/a is visible to systemd, host operating system processes, kubelet, container runtime, and other containers with host-to-container or bidirectional mount propagation configured (as in Container 2).
Container 2, configured with host-to-container mount propagation, can access systemd and host mounts, kubelet and CRI-O mounts. A mount originating in Container 2, such as /run/b, is not visible to any other context.
Container 3, configured with no mount propagation, has no visibility of external mount points. A mount originating in Container 3, such as /run/c, is not visible to any other context.

The following diagram illustrates the system state after encapsulation.

The main systemd process is no longer devoted to unnecessary scanning of Kubernetes-specific mount points. It only monitors systemd-specific and host mount points.
The host operating system processes can access only the systemd and host mount points.
Using a separate mount namespace for both CRI-O and kubelet completely separates all container-specific mounts away from any systemd or other host operating system interaction whatsoever.
The behavior of Container 1 is unchanged, except a mount it creates such as /run/a is no longer visible to systemd or host operating system processes. It is still visible to kubelet, CRI-O, and other containers with host-to-container or bidirectional mount propagation configured (like Container 2).
The behavior of Container 2 and Container 3 is unchanged.

Configuring mount namespace encapsulation

You can configure mount namespace encapsulation so that a cluster runs with less resource overhead.

Mount namespace encapsulation is a Technology Preview feature and it is disabled by default. To use it, you must enable the feature manually.

Prerequisites

You have installed the OpenShift CLI (oc).
You have logged in as a user with cluster-admin privileges.

Procedure

Create a file called mount_namespace_config.yaml with the following YAML:

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: 99-kubens-master
spec:
  config:
    ignition:
      version: 3.2.0
    systemd:
      units:
      - enabled: true
        name: kubens.service
---
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 99-kubens-worker
spec:
  config:
    ignition:
      version: 3.2.0
    systemd:
      units:
      - enabled: true
        name: kubens.service

Apply the mount namespace MachineConfig CR by running the following command:

$ oc apply -f mount_namespace_config.yaml

Example output

machineconfig.machineconfiguration.openshift.io/99-kubens-master created
machineconfig.machineconfiguration.openshift.io/99-kubens-worker created

The MachineConfig CR can take up to 30 minutes to finish being applied in the cluster. You can check the status of the MachineConfig CR by running the following command:

$ oc get mcp

Example output

NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-03d4bc4befb0f4ed3566a2c8f7636751   False     True       False      3              0                   0                     0                      45m
worker   rendered-worker-10577f6ab0117ed1825f8af2ac687ddf   False     True       False      3              1                   1

Wait for the MachineConfig CR to be applied successfully across all control plane and worker nodes after running the following command:

$ oc wait --for=condition=Updated mcp --all --timeout=30m

Example output

machineconfigpool.machineconfiguration.openshift.io/master condition met
machineconfigpool.machineconfiguration.openshift.io/worker condition met

Verification

To verify encapsulation for a cluster host, run the following commands:

Open a debug shell to the cluster host:
```
$ oc debug node/<node_name>
```
Open a chroot session:
```
sh-4.4# chroot /host
```
Check the systemd mount namespace:
```
sh-4.4# readlink /proc/1/ns/mnt
```
Example output
```
mnt:[4026531953]
```

Check kubelet mount namespace:

sh-4.4# readlink /proc/$(pgrep kubelet)/ns/mnt

Example output

mnt:[4026531840]

Check the CRI-O mount namespace:

sh-4.4# readlink /proc/$(pgrep crio)/ns/mnt

Example output

mnt:[4026531840]

These commands return the mount namespaces associated with systemd, kubelet, and the container runtime. In OKD, the container runtime is CRI-O.

Encapsulation is in effect if systemd is in a different mount namespace to kubelet and CRI-O as in the above example. Encapsulation is not in effect if all three processes are in the same mount namespace.

Inspecting encapsulated namespaces

You can inspect Kubernetes-specific mount points in the cluster host operating system for debugging or auditing purposes by using the kubensenter script that is available in Fedora CoreOS (FCOS).

SSH shell sessions to the cluster host are in the default namespace. To inspect Kubernetes-specific mount points in an SSH shell prompt, you need to run the kubensenter script as root. The kubensenter script is aware of the state of the mount encapsulation, and is safe to run even if encapsulation is not enabled.

oc debug remote shell sessions start inside the Kubernetes namespace by default. You do not need to run kubensenter to inspect mount points when you use oc debug.

If the encapsulation feature is not enabled, the kubensenter findmnt and findmnt commands return the same output, regardless of whether they are run in an oc debug session or in an SSH shell prompt.

Prerequisites

You have installed the OpenShift CLI (oc).
You have logged in as a user with cluster-admin privileges.
You have configured SSH access to the cluster host.

Procedure

Open a remote SSH shell to the cluster host. For example:
```
$ ssh core@<node_name>
```

Run commands using the provided kubensenter script as the root user. To run a single command inside the Kubernetes namespace, provide the command and any arguments to the kubensenter script. For example, to run the findmnt command inside the Kubernetes namespace, run the following command:

[core@control-plane-1 ~]$ sudo kubensenter findmnt

Example output

kubensenter: Autodetect: kubens.service namespace found at /run/kubens/mnt
TARGET                                SOURCE                 FSTYPE     OPTIONS
/                                     /dev/sda4[/ostree/deploy/rhcos/deploy/32074f0e8e5ec453e56f5a8a7bc9347eaa4172349ceab9c22b709d9d71a3f4b0.0]
|                                                            xfs        rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,prjquota
                                      shm                    tmpfs
...

To start a new interactive shell inside the Kubernetes namespace, run the kubensenter script without any arguments:
```
[core@control-plane-1 ~]$ sudo kubensenter
```
Example output
```
kubensenter: Autodetect: kubens.service namespace found at /run/kubens/mnt
```

Running additional services in the encapsulated namespace

Any monitoring tool that relies on the ability to run in the host operating system and have visibility of mount points created by kubelet, CRI-O, or containers themselves, must enter the container mount namespace to see these mount points. The kubensenter script that is provided with OKD executes another command inside the Kubernetes mount point and can be used to adapt any existing tools.

The kubensenter script is aware of the state of the mount encapsulation feature status, and is safe to run even if encapsulation is not enabled. In that case the script executes the provided command in the default mount namespace.

For example, if a systemd service needs to run inside the new Kubernetes mount namespace, edit the service file and use the ExecStart= command line with kubensenter.

[Unit]
Description=Example service
[Service]
ExecStart=/usr/bin/kubensenter /path/to/original/command arg1 arg2

Optimizing CPU usage

Optimizing CPU usage with mount namespace encapsulation

Encapsulating mount namespaces

Configuring mount namespace encapsulation

Inspecting encapsulated namespaces

Running additional services in the encapsulated namespace

Additional resources