Using sysctls in containers
Sysctl settings are exposed via Kubernetes, allowing users to modify certain kernel parameters at runtime for namespaces within a container. Only sysctls that are namespaced can be set independently on pods. If a sysctl is not namespaced, called node-level, you must use another method of setting the sysctl, such as the Node Tuning Operator. Moreover, only those sysctls considered safe are whitelisted by default; you can manually enable other unsafe sysctls on the node to be available to the user.
About sysctls
In Linux, the sysctl interface allows an administrator to modify kernel parameters at runtime. Parameters are available via the /proc/sys/ virtual process file system. The parameters cover various subsystems, such as:
kernel (common prefix: kernel.)
networking (common prefix: net.)
virtual memory (common prefix: vm.)
MDADM (common prefix: dev.)
More subsystems are described in Kernel documentation. To get a list of all parameters, run:
$ sudo sysctl -a
Namespaced versus node-level sysctls
A number of sysctls are namespaced in the Linux kernels. This means that you can set them independently for each pod on a node. Being namespaced is a requirement for sysctls to be accessible in a pod context within Kubernetes.
The following sysctls are known to be namespaced:
kernel.shm*
kernel.msg*
kernel.sem
fs.mqueue.*
Additionally, most of the sysctls in the net.* group are known to be namespaced. Their namespace adoption differs based on the kernel version and distributor.
Sysctls that are not namespaced are called node-level and must be set manually by the cluster administrator, either by means of the underlying Linux distribution of the nodes, such as by modifying the /etc/sysctls.conf file, or by using a daemon set with privileged containers. You can use the Node Tuning Operator to set node-level sysctls.
Consider marking nodes with special sysctls as tainted. Only schedule pods onto them that need those sysctl settings. Use the taints and toleration feature to mark the nodes. |
Safe versus unsafe sysctls
Sysctls are grouped into safe and unsafe sysctls.
For a sysctl to be considered safe, it must use proper namespacing and must be properly isolated between pods on the same node. This means that if you set a sysctl for one pod it must not:
Influence any other pod on the node
Harm the node’s health
Gain CPU or memory resources outside of the resource limits of a pod
OKD supports, or whitelists, the following sysctls in the safe set:
kernel.shm_rmid_forced
net.ipv4.ip_local_port_range
net.ipv4.tcp_syncookies
All safe sysctls are enabled by default. You can use a sysctl in a pod by modifying the Pod
spec.
Any sysctl not whitelisted by OKD is considered unsafe for OKD. Note that being namespaced alone is not sufficient for the sysctl to be considered safe.
All unsafe sysctls are disabled by default, and the cluster administrator must manually enable them on a per-node basis. Pods with disabled unsafe sysctls are scheduled but do not launch.
$ oc get pod
Example output
NAME READY STATUS RESTARTS AGE
hello-pod 0/1 SysctlForbidden 0 14s
Setting sysctls for a pod
You can set sysctls on pods using the pod’s securityContext
. The securityContext
applies to all containers in the same pod.
Safe sysctls are allowed by default. A pod with unsafe sysctls fails to launch on any node unless the cluster administrator explicitly enables unsafe sysctls for that node. As with node-level sysctls, use the taints and toleration feature or labels on nodes to schedule those pods onto the right nodes.
The following example uses the pod securityContext
to set a safe sysctl kernel.shm_rmid_forced
and two unsafe sysctls, net.core.somaxconn
and kernel.msgmax
. There is no distinction between safe and unsafe sysctls in the specification.
To avoid destabilizing your operating system, modify sysctl parameters only after you understand their effects. |
Procedure
To use safe and unsafe sysctls:
Modify the YAML file that defines the pod and add the
securityContext
spec, as shown in the following example:apiVersion: v1
kind: Pod
metadata:
name: sysctl-example
spec:
securityContext:
sysctls:
- name: kernel.shm_rmid_forced
value: "0"
- name: net.core.somaxconn
value: "1024"
- name: kernel.msgmax
value: "65536"
...
Create the pod:
$ oc apply -f <file-name>.yaml
If the unsafe sysctls are not allowed for the node, the pod is scheduled, but does not deploy:
$ oc get pod
Example output
NAME READY STATUS RESTARTS AGE
hello-pod 0/1 SysctlForbidden 0 14s
Enabling unsafe sysctls
A cluster administrator can allow certain unsafe sysctls for very special situations such as high performance or real-time application tuning.
If you want to use unsafe sysctls, a cluster administrator must enable them individually for a specific type of node. The sysctls must be namespaced.
You can further control which sysctls can be set in pods by specifying lists of sysctls or sysctl patterns in the forbiddenSysctls
and allowedUnsafeSysctls
fields of the Security Context Constraints.
The
forbiddenSysctls
option excludes specific sysctls.The
allowedUnsafeSysctls
option controls specific needs such as high performance or real-time application tuning.
Due to their nature of being unsafe, the use of unsafe sysctls is at-your-own-risk and can lead to severe problems, such as improper behavior of containers, resource shortage, or breaking a node. |
Procedure
Add a label to the machine config pool where the containers where containers with the unsafe sysctls will run:
$ oc edit machineconfigpool worker
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
creationTimestamp: 2019-02-08T14:52:39Z
generation: 1
labels:
custom-kubelet: sysctl (1)
1 Add a key: pair
label.Create a
KubeletConfig
custom resource (CR):apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
name: custom-kubelet
spec:
machineConfigPoolSelector:
matchLabels:
custom-kubelet: sysctl (1)
kubeletConfig:
allowedUnsafeSysctls: (2)
- "kernel.msg*"
- "net.core.somaxconn"
1 Specify the label from the machine config pool. 2 List the unsafe sysctls you want to allow. Create the object:
$ oc apply -f set-sysctl-worker.yaml
A new
MachineConfig
object named in the99-worker-XXXXXX-XXXXX-XXXX-XXXXX-kubelet
format is created.Wait for the cluster to reboot usng the
machineconfigpool
objectstatus
fields:For example:
status:
conditions:
- lastTransitionTime: '2019-08-11T15:32:00Z'
message: >-
All nodes are updating to
rendered-worker-ccbfb5d2838d65013ab36300b7b3dc13
reason: ''
status: 'True'
type: Updating
A message similar to the following appears when the cluster is ready:
- lastTransitionTime: '2019-08-11T16:00:00Z'
message: >-
All nodes are updated with
rendered-worker-ccbfb5d2838d65013ab36300b7b3dc13
reason: ''
status: 'True'
type: Updated
When the cluster is ready, check for the merged
KubeletConfig
object in the newMachineConfig
object:$ oc get machineconfig 99-worker-XXXXXX-XXXXX-XXXX-XXXXX-kubelet -o json | grep ownerReference -A7
"ownerReferences": [
{
"apiVersion": "machineconfiguration.openshift.io/v1",
"blockOwnerDeletion": true,
"controller": true,
"kind": "KubeletConfig",
"name": "custom-kubelet",
"uid": "3f64a766-bae8-11e9-abe8-0a1a2a4813f2"
}
]
You can now add unsafe sysctls to pods as needed.