Restrict a Container’s Syscalls with seccomp

FEATURE STATE: Kubernetes v1.19 [stable]

Seccomp stands for secure computing mode and has been a feature of the Linux kernel since version 2.6.12. It can be used to sandbox the privileges of a process, restricting the calls it is able to make from userspace into the kernel. Kubernetes lets you automatically apply seccomp profiles loaded onto a node to your Pods and containers.

Identifying the privileges required for your workloads can be difficult. In this tutorial, you will go through how to load seccomp profiles into a local Kubernetes cluster, how to apply them to a Pod, and how you can begin to craft profiles that give only the necessary privileges to your container processes.

Objectives

  • Learn how to load seccomp profiles on a node
  • Learn how to apply a seccomp profile to a container
  • Observe auditing of syscalls made by a container process
  • Observe behavior when a missing profile is specified
  • Observe a violation of a seccomp profile
  • Learn how to create fine-grained seccomp profiles
  • Learn how to apply a container runtime default seccomp profile

Before you begin

In order to complete all steps in this tutorial, you must install kind and kubectl.

This tutorial shows some examples that are still beta (since v1.25) and others that use only generally available seccomp functionality. You should make sure that your cluster is configured correctly for the version you are using.

The tutorial also uses the curl tool for downloading examples to your computer. You can adapt the steps to use a different tool if you prefer.

Note: It is not possible to apply a seccomp profile to a container running with privileged: true set in the container’s securityContext. Privileged containers always run as Unconfined.

Download example seccomp profiles

The contents of these profiles will be explored later on, but for now go ahead and download them into a directory named profiles/ so that they can be loaded into the cluster.

pods/security/seccomp/profiles/audit.json Restrict a Container’s Syscalls with seccomp - 图1

  1. {
  2. "defaultAction": "SCMP_ACT_LOG"
  3. }

pods/security/seccomp/profiles/violation.json Restrict a Container’s Syscalls with seccomp - 图2

  1. {
  2. "defaultAction": "SCMP_ACT_ERRNO"
  3. }

pods/security/seccomp/profiles/fine-grained.json Restrict a Container’s Syscalls with seccomp - 图3

  1. {
  2. "defaultAction": "SCMP_ACT_ERRNO",
  3. "architectures": [
  4. "SCMP_ARCH_X86_64",
  5. "SCMP_ARCH_X86",
  6. "SCMP_ARCH_X32"
  7. ],
  8. "syscalls": [
  9. {
  10. "names": [
  11. "accept4",
  12. "epoll_wait",
  13. "pselect6",
  14. "futex",
  15. "madvise",
  16. "epoll_ctl",
  17. "getsockname",
  18. "setsockopt",
  19. "vfork",
  20. "mmap",
  21. "read",
  22. "write",
  23. "close",
  24. "arch_prctl",
  25. "sched_getaffinity",
  26. "munmap",
  27. "brk",
  28. "rt_sigaction",
  29. "rt_sigprocmask",
  30. "sigaltstack",
  31. "gettid",
  32. "clone",
  33. "bind",
  34. "socket",
  35. "openat",
  36. "readlinkat",
  37. "exit_group",
  38. "epoll_create1",
  39. "listen",
  40. "rt_sigreturn",
  41. "sched_yield",
  42. "clock_gettime",
  43. "connect",
  44. "dup2",
  45. "epoll_pwait",
  46. "execve",
  47. "exit",
  48. "fcntl",
  49. "getpid",
  50. "getuid",
  51. "ioctl",
  52. "mprotect",
  53. "nanosleep",
  54. "open",
  55. "poll",
  56. "recvfrom",
  57. "sendto",
  58. "set_tid_address",
  59. "setitimer",
  60. "writev"
  61. ],
  62. "action": "SCMP_ACT_ALLOW"
  63. }
  64. ]
  65. }

Run these commands:

  1. mkdir ./profiles
  2. curl -L -o profiles/audit.json https://k8s.io/examples/pods/security/seccomp/profiles/audit.json
  3. curl -L -o profiles/violation.json https://k8s.io/examples/pods/security/seccomp/profiles/violation.json
  4. curl -L -o profiles/fine-grained.json https://k8s.io/examples/pods/security/seccomp/profiles/fine-grained.json
  5. ls profiles

You should see three profiles listed at the end of the final step:

  1. audit.json fine-grained.json violation.json

Create a local Kubernetes cluster with kind

For simplicity, kind can be used to create a single node cluster with the seccomp profiles loaded. Kind runs Kubernetes in Docker, so each node of the cluster is a container. This allows for files to be mounted in the filesystem of each container similar to loading files onto a node.

pods/security/seccomp/kind.yaml Restrict a Container’s Syscalls with seccomp - 图4

  1. apiVersion: kind.x-k8s.io/v1alpha4
  2. kind: Cluster
  3. nodes:
  4. - role: control-plane
  5. extraMounts:
  6. - hostPath: "./profiles"
  7. containerPath: "/var/lib/kubelet/seccomp/profiles"

Download that example kind configuration, and save it to a file named kind.yaml:

  1. curl -L -O https://k8s.io/examples/pods/security/seccomp/kind.yaml

You can set a specific Kubernetes version by setting the node’s container image. See Nodes within the kind documentation about configuration for more details on this. This tutorial assumes you are using Kubernetes v1.25.

As a beta feature, you can configure Kubernetes to use the profile that the container runtime prefers by default, rather than falling back to Unconfined. If you want to try that, see enable the use of RuntimeDefault as the default seccomp profile for all workloads before you continue.

Once you have a kind configuration in place, create the kind cluster with that configuration:

  1. kind create cluster --config=kind.yaml

After the new Kubernetes cluster is ready, identify the Docker container running as the single node cluster:

  1. docker ps

You should see output indicating that a container is running with name kind-control-plane. The output is similar to:

  1. CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
  2. 6a96207fed4b kindest/node:v1.18.2 "/usr/local/bin/entr…" 27 seconds ago Up 24 seconds 127.0.0.1:42223->6443/tcp kind-control-plane

If observing the filesystem of that container, you should see that the profiles/ directory has been successfully loaded into the default seccomp path of the kubelet. Use docker exec to run a command in the Pod:

  1. # Change 6a96207fed4b to the container ID you saw from "docker ps"
  2. docker exec -it 6a96207fed4b ls /var/lib/kubelet/seccomp/profiles
  1. audit.json fine-grained.json violation.json

You have verified that these seccomp profiles are available to the kubelet running within kind.

Enable the use of RuntimeDefault as the default seccomp profile for all workloads

FEATURE STATE: Kubernetes v1.25 [beta]

To use seccomp profile defaulting, you must run the kubelet with the SeccompDefault feature gate enabled (this is the default). You must also explicitly enable the defaulting behavior for each node where you want to use this with the corresponding --seccomp-default command line flag. Both have to be enabled simultaneously to use the feature.

If enabled, the kubelet will use the RuntimeDefault seccomp profile by default, which is defined by the container runtime, instead of using the Unconfined (seccomp disabled) mode. The default profiles aim to provide a strong set of security defaults while preserving the functionality of the workload. It is possible that the default profiles differ between container runtimes and their release versions, for example when comparing those from CRI-O and containerd.

Note: Enabling the feature will neither change the Kubernetes securityContext.seccompProfile API field nor add the deprecated annotations of the workload. This provides users the possibility to rollback anytime without actually changing the workload configuration. Tools like crictl inspect can be used to verify which seccomp profile is being used by a container.

Some workloads may require a lower amount of syscall restrictions than others. This means that they can fail during runtime even with the RuntimeDefault profile. To mitigate such a failure, you can:

  • Run the workload explicitly as Unconfined.
  • Disable the SeccompDefault feature for the nodes. Also making sure that workloads get scheduled on nodes where the feature is disabled.
  • Create a custom seccomp profile for the workload.

If you were introducing this feature into production-like cluster, the Kubernetes project recommends that you enable this feature gate on a subset of your nodes and then test workload execution before rolling the change out cluster-wide.

You can find more detailed information about a possible upgrade and downgrade strategy in the related Kubernetes Enhancement Proposal (KEP): Enable seccomp by default.

Kubernetes 1.25 lets you configure the seccomp profile that applies when the spec for a Pod doesn’t define a specific seccomp profile. This is a beta feature and the corresponding SeccompDefault feature gate is enabled by default. However, you still need to enable this defaulting for each node where you would like to use it.

If you are running a Kubernetes 1.25 cluster and want to enable the feature, either run the kubelet with the --seccomp-default command line flag, or enable it through the kubelet configuration file. To enable the feature gate in kind, ensure that kind provides the minimum required Kubernetes version and enables the SeccompDefault feature in the kind configuration:

  1. kind: Cluster
  2. apiVersion: kind.x-k8s.io/v1alpha4
  3. featureGates:
  4. SeccompDefault: true
  5. nodes:
  6. - role: control-plane
  7. image: kindest/node:v1.23.0@sha256:49824ab1727c04e56a21a5d8372a402fcd32ea51ac96a2706a12af38934f81ac
  8. kubeadmConfigPatches:
  9. - |
  10. kind: JoinConfiguration
  11. nodeRegistration:
  12. kubeletExtraArgs:
  13. seccomp-default: "true"
  14. - role: worker
  15. image: kindest/node:v1.23.0@sha256:49824ab1727c04e56a21a5d8372a402fcd32ea51ac96a2706a12af38934f81ac
  16. kubeadmConfigPatches:
  17. - |
  18. kind: JoinConfiguration
  19. nodeRegistration:
  20. kubeletExtraArgs:
  21. feature-gates: SeccompDefault=true
  22. seccomp-default: "true"

If the cluster is ready, then running a pod:

  1. kubectl run --rm -it --restart=Never --image=alpine alpine -- sh

Should now have the default seccomp profile attached. This can be verified by using docker exec to run crictl inspect for the container on the kind worker:

  1. docker exec -it kind-worker bash -c \
  2. 'crictl inspect $(crictl ps --name=alpine -q) | jq .info.runtimeSpec.linux.seccomp'
  1. {
  2. "defaultAction": "SCMP_ACT_ERRNO",
  3. "architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_X86", "SCMP_ARCH_X32"],
  4. "syscalls": [
  5. {
  6. "names": ["..."]
  7. }
  8. ]
  9. }

Create a Pod with a seccomp profile for syscall auditing

To start off, apply the audit.json profile, which will log all syscalls of the process, to a new Pod.

Here’s a manifest for that Pod:

pods/security/seccomp/ga/audit-pod.yaml Restrict a Container’s Syscalls with seccomp - 图5

  1. apiVersion: v1
  2. kind: Pod
  3. metadata:
  4. name: audit-pod
  5. labels:
  6. app: audit-pod
  7. spec:
  8. securityContext:
  9. seccompProfile:
  10. type: Localhost
  11. localhostProfile: profiles/audit.json
  12. containers:
  13. - name: test-container
  14. image: hashicorp/http-echo:0.2.3
  15. args:
  16. - "-text=just made some syscalls!"
  17. securityContext:
  18. allowPrivilegeEscalation: false

Note:

The functional support for the already deprecated seccomp annotations seccomp.security.alpha.kubernetes.io/pod (for the whole pod) and container.seccomp.security.alpha.kubernetes.io/[name] (for a single container) is going to be removed with a future release of Kubernetes. Please always use the native API fields in favor of the annotations.

Since Kubernetes v1.25, kubelets no longer support the annotations, use of the annotations in static pods is no longer supported, and the seccomp annotations are no longer auto-populated when pods with seccomp fields are created. Auto-population of the seccomp fields from the annotations is planned to be removed in a future release.

Create the Pod in the cluster:

  1. kubectl apply -f https://k8s.io/examples/pods/security/seccomp/ga/audit-pod.yaml

This profile does not restrict any syscalls, so the Pod should start successfully.

  1. kubectl get pod/audit-pod
  1. NAME READY STATUS RESTARTS AGE
  2. audit-pod 1/1 Running 0 30s

In order to be able to interact with this endpoint exposed by this container, create a NodePort Services that allows access to the endpoint from inside the kind control plane container.

  1. kubectl expose pod audit-pod --type NodePort --port 5678

Check what port the Service has been assigned on the node.

  1. kubectl get service audit-pod

The output is similar to:

  1. NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
  2. audit-pod NodePort 10.111.36.142 <none> 5678:32373/TCP 72s

Now you can use curl to access that endpoint from inside the kind control plane container, at the port exposed by this Service. Use docker exec to run the curl command within the container belonging to that control plane container:

  1. # Change 6a96207fed4b to the control plane container ID you saw from "docker ps"
  2. docker exec -it 6a96207fed4b curl localhost:32373
  1. just made some syscalls!

You can see that the process is running, but what syscalls did it actually make? Because this Pod is running in a local cluster, you should be able to see those in /var/log/syslog. Open up a new terminal window and tail the output for calls from http-echo:

  1. tail -f /var/log/syslog | grep 'http-echo'

You should already see some logs of syscalls made by http-echo, and if you curl the endpoint in the control plane container you will see more written.

For example:

  1. Jul 6 15:37:40 my-machine kernel: [369128.669452] audit: type=1326 audit(1594067860.484:14536): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=51 compat=0 ip=0x46fe1f code=0x7ffc0000
  2. Jul 6 15:37:40 my-machine kernel: [369128.669453] audit: type=1326 audit(1594067860.484:14537): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=54 compat=0 ip=0x46fdba code=0x7ffc0000
  3. Jul 6 15:37:40 my-machine kernel: [369128.669455] audit: type=1326 audit(1594067860.484:14538): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=202 compat=0 ip=0x455e53 code=0x7ffc0000
  4. Jul 6 15:37:40 my-machine kernel: [369128.669456] audit: type=1326 audit(1594067860.484:14539): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=288 compat=0 ip=0x46fdba code=0x7ffc0000
  5. Jul 6 15:37:40 my-machine kernel: [369128.669517] audit: type=1326 audit(1594067860.484:14540): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=0 compat=0 ip=0x46fd44 code=0x7ffc0000
  6. Jul 6 15:37:40 my-machine kernel: [369128.669519] audit: type=1326 audit(1594067860.484:14541): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=270 compat=0 ip=0x4559b1 code=0x7ffc0000
  7. Jul 6 15:38:40 my-machine kernel: [369188.671648] audit: type=1326 audit(1594067920.488:14559): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=270 compat=0 ip=0x4559b1 code=0x7ffc0000
  8. Jul 6 15:38:40 my-machine kernel: [369188.671726] audit: type=1326 audit(1594067920.488:14560): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=29064 comm="http-echo" exe="/http-echo" sig=0 arch=c000003e syscall=202 compat=0 ip=0x455e53 code=0x7ffc0000

You can begin to understand the syscalls required by the http-echo process by looking at the syscall= entry on each line. While these are unlikely to encompass all syscalls it uses, it can serve as a basis for a seccomp profile for this container.

Clean up that Pod and Service before moving to the next section:

  1. kubectl delete service audit-pod --wait
  2. kubectl delete pod audit-pod --wait --now

Create Pod with a seccomp profile that causes violation

For demonstration, apply a profile to the Pod that does not allow for any syscalls.

The manifest for this demonstration is:

pods/security/seccomp/ga/violation-pod.yaml Restrict a Container’s Syscalls with seccomp - 图6

  1. apiVersion: v1
  2. kind: Pod
  3. metadata:
  4. name: violation-pod
  5. labels:
  6. app: violation-pod
  7. spec:
  8. securityContext:
  9. seccompProfile:
  10. type: Localhost
  11. localhostProfile: profiles/violation.json
  12. containers:
  13. - name: test-container
  14. image: hashicorp/http-echo:0.2.3
  15. args:
  16. - "-text=just made some syscalls!"
  17. securityContext:
  18. allowPrivilegeEscalation: false

Attempt to create the Pod in the cluster:

  1. kubectl apply -f https://k8s.io/examples/pods/security/seccomp/ga/violation-pod.yaml

The Pod creates, but there is an issue. If you check the status of the Pod, you should see that it failed to start.

  1. kubectl get pod/violation-pod
  1. NAME READY STATUS RESTARTS AGE
  2. violation-pod 0/1 CrashLoopBackOff 1 6s

As seen in the previous example, the http-echo process requires quite a few syscalls. Here seccomp has been instructed to error on any syscall by setting "defaultAction": "SCMP_ACT_ERRNO". This is extremely secure, but removes the ability to do anything meaningful. What you really want is to give workloads only the privileges they need.

Clean up that Pod before moving to the next section:

  1. kubectl delete pod violation-pod --wait --now

Create Pod with a seccomp profile that only allows necessary syscalls

If you take a look at the fine-grained.json profile, you will notice some of the syscalls seen in syslog of the first example where the profile set "defaultAction": "SCMP_ACT_LOG". Now the profile is setting "defaultAction": "SCMP_ACT_ERRNO", but explicitly allowing a set of syscalls in the "action": "SCMP_ACT_ALLOW" block. Ideally, the container will run successfully and you will see no messages sent to syslog.

The manifest for this example is:

pods/security/seccomp/ga/fine-pod.yaml Restrict a Container’s Syscalls with seccomp - 图7

  1. apiVersion: v1
  2. kind: Pod
  3. metadata:
  4. name: fine-pod
  5. labels:
  6. app: fine-pod
  7. spec:
  8. securityContext:
  9. seccompProfile:
  10. type: Localhost
  11. localhostProfile: profiles/fine-grained.json
  12. containers:
  13. - name: test-container
  14. image: hashicorp/http-echo:0.2.3
  15. args:
  16. - "-text=just made some syscalls!"
  17. securityContext:
  18. allowPrivilegeEscalation: false

Create the Pod in your cluster:

  1. kubectl apply -f https://k8s.io/examples/pods/security/seccomp/ga/fine-pod.yaml
  1. kubectl get pod fine-pod

The Pod should be showing as having started successfully:

  1. NAME READY STATUS RESTARTS AGE
  2. fine-pod 1/1 Running 0 30s

Open up a new terminal window and use tail to monitor for log entries that mention calls from http-echo:

  1. # The log path on your computer might be different from "/var/log/syslog"
  2. tail -f /var/log/syslog | grep 'http-echo'

Next, expose the Pod with a NodePort Service:

  1. kubectl expose pod fine-pod --type NodePort --port 5678

Check what port the Service has been assigned on the node:

  1. kubectl get service fine-pod

The output is similar to:

  1. NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
  2. fine-pod NodePort 10.111.36.142 <none> 5678:32373/TCP 72s

Use curl to access that endpoint from inside the kind control plane container:

  1. # Change 6a96207fed4b to the control plane container ID you saw from "docker ps"
  2. docker exec -it 6a96207fed4b curl localhost:32373
  1. just made some syscalls!

You should see no output in the syslog. This is because the profile allowed all necessary syscalls and specified that an error should occur if one outside of the list is invoked. This is an ideal situation from a security perspective, but required some effort in analyzing the program. It would be nice if there was a simple way to get closer to this security without requiring as much effort.

Clean up that Pod and Service before moving to the next section:

  1. kubectl delete service fine-pod --wait
  2. kubectl delete pod fine-pod --wait --now

Create Pod that uses the container runtime default seccomp profile

Most container runtimes provide a sane set of default syscalls that are allowed or not. You can adopt these defaults for your workload by setting the seccomp type in the security context of a pod or container to RuntimeDefault.

Note: If you have the SeccompDefault feature gate enabled, then Pods use the RuntimeDefault seccomp profile whenever no other seccomp profile is specified. Otherwise, the default is Unconfined.

Here’s a manifest for a Pod that requests the RuntimeDefault seccomp profile for all its containers:

pods/security/seccomp/ga/default-pod.yaml Restrict a Container’s Syscalls with seccomp - 图8

  1. apiVersion: v1
  2. kind: Pod
  3. metadata:
  4. name: default-pod
  5. labels:
  6. app: default-pod
  7. spec:
  8. securityContext:
  9. seccompProfile:
  10. type: RuntimeDefault
  11. containers:
  12. - name: test-container
  13. image: hashicorp/http-echo:0.2.3
  14. args:
  15. - "-text=just made some more syscalls!"
  16. securityContext:
  17. allowPrivilegeEscalation: false

Create that Pod:

  1. kubectl apply -f https://k8s.io/examples/pods/security/seccomp/ga/default-pod.yaml
  1. kubectl get pod default-pod

The Pod should be showing as having started successfully:

  1. NAME READY STATUS RESTARTS AGE
  2. default-pod 1/1 Running 0 20s

Finally, now that you saw that work OK, clean up:

  1. kubectl delete pod default-pod --wait --now

What’s next

You can learn more about Linux seccomp: