Using sysctls in containers

Using sysctls in containers

Sysctl settings are exposed via Kubernetes, allowing users to modify certain kernel parameters at runtime for namespaces within a container. Only sysctls that are namespaced can be set independently on pods. If a sysctl is not namespaced, called node-level, you must use another method of setting the sysctl, such as the Node Tuning Operator. Moreover, only those sysctls considered safe are whitelisted by default; you can manually enable other unsafe sysctls on the node to be available to the user.

About sysctls

In Linux, the sysctl interface allows an administrator to modify kernel parameters at runtime. Parameters are available via the /proc/sys/ virtual process file system. The parameters cover various subsystems, such as:

kernel (common prefix: kernel.)
networking (common prefix: net.)
virtual memory (common prefix: vm.)
MDADM (common prefix: dev.)

More subsystems are described in Kernel documentation. To get a list of all parameters, run:

$ sudo sysctl -a

Namespaced versus node-level sysctls

A number of sysctls are namespaced in the Linux kernels. This means that you can set them independently for each pod on a node. Being namespaced is a requirement for sysctls to be accessible in a pod context within Kubernetes.

The following sysctls are known to be namespaced:

kernel.shm*
kernel.msg*
kernel.sem
fs.mqueue.*

Additionally, most of the sysctls in the net.* group are known to be namespaced. Their namespace adoption differs based on the kernel version and distributor.

Sysctls that are not namespaced are called node-level and must be set manually by the cluster administrator, either by means of the underlying Linux distribution of the nodes, such as by modifying the /etc/sysctls.conf file, or by using a daemon set with privileged containers. You can use the Node Tuning Operator to set node-level sysctls.

Consider marking nodes with special sysctls as tainted. Only schedule pods onto them that need those sysctl settings. Use the taints and toleration feature to mark the nodes.

Safe versus unsafe sysctls

Sysctls are grouped into safe and unsafe sysctls.

For a sysctl to be considered safe, it must use proper namespacing and must be properly isolated between pods on the same node. This means that if you set a sysctl for one pod it must not:

Influence any other pod on the node
Harm the node’s health
Gain CPU or memory resources outside of the resource limits of a pod

OKD supports, or whitelists, the following sysctls in the safe set:

kernel.shm_rmid_forced
net.ipv4.ip_local_port_range
net.ipv4.tcp_syncookies
net.ipv4.ping_group_range

All safe sysctls are enabled by default. You can use a sysctl in a pod by modifying the Pod spec.

Any sysctl not whitelisted by OKD is considered unsafe for OKD. Note that being namespaced alone is not sufficient for the sysctl to be considered safe.

All unsafe sysctls are disabled by default, and the cluster administrator must manually enable them on a per-node basis. Pods with disabled unsafe sysctls are scheduled but do not launch.

$ oc get pod

Example output

NAME        READY   STATUS            RESTARTS   AGE
hello-pod   0/1     SysctlForbidden   0          14s

Setting sysctls for a pod

You can set sysctls on pods using the pod’s securityContext. The securityContext applies to all containers in the same pod.

Safe sysctls are allowed by default. A pod with unsafe sysctls fails to launch on any node unless the cluster administrator explicitly enables unsafe sysctls for that node. As with node-level sysctls, use the taints and toleration feature or labels on nodes to schedule those pods onto the right nodes.

The following example uses the pod securityContext to set a safe sysctl kernel.shm_rmid_forced and two unsafe sysctls, net.core.somaxconn and kernel.msgmax. There is no distinction between safe and unsafe sysctls in the specification.

To avoid destabilizing your operating system, modify sysctl parameters only after you understand their effects.

Procedure

To use safe and unsafe sysctls:

Modify the YAML file that defines the pod and add the securityContext spec, as shown in the following example:

apiVersion: v1
kind: Pod
metadata:
  name: sysctl-example
spec:
  securityContext:
    sysctls:
    - name: kernel.shm_rmid_forced
      value: "0"
    - name: net.core.somaxconn
      value: "1024"
    - name: kernel.msgmax
      value: "65536"
  ...

Create the pod:

$ oc apply -f <file-name>.yaml

If the unsafe sysctls are not allowed for the node, the pod is scheduled, but does not deploy:

$ oc get pod

Example output

NAME        READY   STATUS            RESTARTS   AGE
hello-pod   0/1     SysctlForbidden   0          14s

Enabling unsafe sysctls

A cluster administrator can allow certain unsafe sysctls for very special situations such as high performance or real-time application tuning.

If you want to use unsafe sysctls, a cluster administrator must enable them individually for a specific type of node. The sysctls must be namespaced.

You can further control which sysctls can be set in pods by specifying lists of sysctls or sysctl patterns in the forbiddenSysctls and allowedUnsafeSysctls fields of the Security Context Constraints.

The forbiddenSysctls option excludes specific sysctls.
The allowedUnsafeSysctls option controls specific needs such as high performance or real-time application tuning.

Due to their nature of being unsafe, the use of unsafe sysctls is at-your-own-risk and can lead to severe problems, such as improper behavior of containers, resource shortage, or breaking a node.

Procedure

Add a label to the machine config pool where the containers where containers with the unsafe sysctls will run:

$ oc edit machineconfigpool worker

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  creationTimestamp: 2019-02-08T14:52:39Z
  generation: 1
  labels:
    custom-kubelet: sysctl (1)

1	Add a `key: pair` label.

Create a KubeletConfig custom resource (CR):

apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: custom-kubelet
spec:
  machineConfigPoolSelector:
    matchLabels:
      custom-kubelet: sysctl (1)
  kubeletConfig:
    allowedUnsafeSysctls: (2)
      - "kernel.msg*"
      - "net.core.somaxconn"

1	Specify the label from the machine config pool.
2	List the unsafe sysctls you want to allow.

Create the object:
```
$ oc apply -f set-sysctl-worker.yaml
```
A new MachineConfig object named in the 99-worker-XXXXXX-XXXXX-XXXX-XXXXX-kubelet format is created.

Wait for the cluster to reboot usng the machineconfigpool object status fields:

For example:

status:
  conditions:
    - lastTransitionTime: '2019-08-11T15:32:00Z'
      message: >-
        All nodes are updating to
        rendered-worker-ccbfb5d2838d65013ab36300b7b3dc13
      reason: ''
      status: 'True'
      type: Updating

A message similar to the following appears when the cluster is ready:

   - lastTransitionTime: '2019-08-11T16:00:00Z'
      message: >-
        All nodes are updated with
        rendered-worker-ccbfb5d2838d65013ab36300b7b3dc13
      reason: ''
      status: 'True'
      type: Updated

When the cluster is ready, check for the merged KubeletConfig object in the new MachineConfig object:

$ oc get machineconfig 99-worker-XXXXXX-XXXXX-XXXX-XXXXX-kubelet -o json | grep ownerReference -A7

        "ownerReferences": [
            {
                "apiVersion": "machineconfiguration.openshift.io/v1",
                "blockOwnerDeletion": true,
                "controller": true,
                "kind": "KubeletConfig",
                "name": "custom-kubelet",
                "uid": "3f64a766-bae8-11e9-abe8-0a1a2a4813f2"
            }
        ]

You can now add unsafe sysctls to pods as needed.