Security Contexts, Seccomp, and AppArmor#
Security contexts control what a container can do at the Linux kernel level: which user it runs as, which syscalls it can make, which files it can access, and whether it can escalate privileges. These settings are your last line of defense when a container is compromised. A properly configured security context limits the blast radius of a breach by preventing an attacker from escaping the container, accessing the host, or escalating to root.
Pod-Level vs Container-Level Security Context#
Kubernetes supports security context at two levels. Pod-level settings apply to all containers in the pod. Container-level settings override pod-level settings for that specific container.
apiVersion: v1
kind: Pod
metadata:
name: hardened-app
spec:
securityContext: # Pod-level: applies to all containers
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
fsGroupChangePolicy: OnRootMismatch
seccompProfile:
type: RuntimeDefault
containers:
- name: app
image: myregistry.io/app:v2.1.0
securityContext: # Container-level: overrides pod-level for this container
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
add: ["NET_BIND_SERVICE"]Key Security Context Fields#
runAsUser / runAsGroup – the UID and GID the container process runs as. Set these explicitly rather than relying on the Dockerfile USER directive, which can be overridden by the image.
securityContext:
runAsUser: 10001
runAsGroup: 10001runAsNonRoot – Kubernetes rejects the pod at admission if the container image would run as UID 0. This is a safety net: even if someone pushes an image that defaults to root, the pod will not start.
readOnlyRootFilesystem – prevents any writes to the container’s root filesystem. This blocks attackers from modifying binaries, dropping scripts, or writing to unexpected locations. Applications that need writable directories (logs, temp files, caches) use mounted volumes instead.
allowPrivilegeEscalation – prevents child processes from gaining more privileges than the parent. This disables setuid binaries and ptrace. Set this to false unless you have a specific reason not to.
capabilities – Linux capabilities are fine-grained privilege controls. Instead of giving a container full root access, you grant only the specific capabilities it needs.
fsGroup – sets group ownership on all mounted volumes to this GID. Containers running as non-root can read and write to volumes owned by the fsGroup.
fsGroupChangePolicy – controls when Kubernetes applies the fsGroup ownership change. Always recursively chowns every file on every mount, which is slow on large volumes. OnRootMismatch only changes ownership if the root directory of the volume does not already match fsGroup. Use OnRootMismatch for any volume with more than a few files.
Linux Capabilities#
The correct approach is drop ALL capabilities and add back only what the container needs:
securityContext:
capabilities:
drop: ["ALL"]
add: ["NET_BIND_SERVICE"]Common capabilities you might need to add back:
| Capability | Use Case |
|---|---|
NET_BIND_SERVICE |
Bind to ports below 1024 (nginx on port 80, etc.) |
SYS_PTRACE |
Process tracing, needed by some monitoring/debugging tools |
NET_RAW |
Raw sockets, needed by ping and some network tools |
CHOWN |
Change file ownership, needed by some init containers |
SETUID / SETGID |
Change process UID/GID, needed by some init containers |
If you are unsure which capabilities an application needs, run it with all capabilities dropped and check the error messages. Most applications need zero additional capabilities.
Seccomp Profiles#
Seccomp (Secure Computing Mode) filters which Linux syscalls a container can make. This is a powerful defense against container escapes – even if an attacker gets code execution inside a container, they cannot call dangerous syscalls like ptrace, mount, or reboot.
Three profile types:
RuntimeDefault – the container runtime’s built-in seccomp profile. Docker and containerd block roughly 60 dangerous syscalls while allowing the ~300+ syscalls that normal applications need. This is the right choice for most workloads.
securityContext:
seccompProfile:
type: RuntimeDefaultLocalhost – a custom profile loaded from the node’s filesystem. Use this when RuntimeDefault is too permissive or too restrictive for your workload.
securityContext:
seccompProfile:
type: Localhost
localhostProfile: profiles/custom-app.jsonThe profile file must exist at /var/lib/kubelet/seccomp/profiles/custom-app.json on every node where the pod might be scheduled.
Unconfined – no seccomp filtering at all. Avoid in production. Any syscall is allowed.
A custom seccomp profile is a JSON file listing allowed or blocked syscalls:
{
"defaultAction": "SCMP_ACT_ERRNO",
"architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_AARCH64"],
"syscalls": [
{
"names": ["read", "write", "open", "close", "stat", "fstat",
"mmap", "mprotect", "munmap", "brk", "ioctl",
"access", "pipe", "select", "sched_yield", "clone",
"execve", "exit", "exit_group", "futex", "epoll_wait",
"socket", "connect", "accept", "bind", "listen",
"sendto", "recvfrom", "getsockname", "getpeername"],
"action": "SCMP_ACT_ALLOW"
}
]
}The defaultAction: SCMP_ACT_ERRNO blocks everything by default, and then the syscalls list allows specific calls. Building a custom profile requires knowing exactly which syscalls your application uses. Tools like strace or the Security Profiles Operator can help generate profiles.
Security Profiles Operator#
Managing seccomp profiles as files on every node is painful. The Security Profiles Operator (SPO) lets you manage profiles as Kubernetes CRDs:
apiVersion: security-profiles-operator.x-k8s.io/v1beta1
kind: SeccompProfile
metadata:
name: custom-app-profile
namespace: production
spec:
defaultAction: SCMP_ACT_ERRNO
syscalls:
- action: SCMP_ACT_ALLOW
names:
- read
- write
- open
- close
- exit_groupSPO syncs the profile to all nodes and handles cleanup. It also supports recording mode – run a pod and SPO records which syscalls it makes, then generates a profile.
AppArmor#
AppArmor is a Mandatory Access Control (MAC) system that restricts which files, paths, and network operations a process can access. It complements seccomp: seccomp controls which syscalls are allowed, AppArmor controls what resources those syscalls can access.
Apply an AppArmor profile to a container via annotation:
apiVersion: v1
kind: Pod
metadata:
name: apparmor-demo
annotations:
container.apparmor.security.beta.kubernetes.io/app: localhost/custom-profile
spec:
containers:
- name: app
image: myregistry.io/app:v2.1.0The profile must be loaded on every node where the pod might run. Most container runtimes apply a default AppArmor profile that blocks writing to /proc and /sys, mounting filesystems, and accessing raw network sockets.
Common profile values:
runtime/default– the container runtime’s default AppArmor profilelocalhost/<profile-name>– a custom profile loaded on the nodeunconfined– no AppArmor restrictions (avoid in production)
AppArmor is available on Debian/Ubuntu-based nodes. For RHEL/CentOS nodes, use SELinux instead, configured via seLinuxOptions in the security context.
SELinux#
On RHEL, CentOS, and Fedora nodes, SELinux provides mandatory access control instead of AppArmor:
securityContext:
seLinuxOptions:
level: "s0:c123,c456"
type: "container_t"SELinux labels control which files and processes a container can interact with. The default container_t type provides reasonable isolation. Custom types allow fine-grained control but require SELinux policy expertise.
Production Hardening Checklist#
Apply this baseline to every application pod:
spec:
securityContext:
runAsNonRoot: true
runAsUser: 10001
runAsGroup: 10001
fsGroup: 10001
fsGroupChangePolicy: OnRootMismatch
seccompProfile:
type: RuntimeDefault
containers:
- name: app
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
volumeMounts:
- name: tmp
mountPath: /tmp
- name: cache
mountPath: /var/cache
volumes:
- name: tmp
emptyDir: {}
- name: cache
emptyDir: {}Then verify:
# Check what user the container is actually running as
kubectl exec hardened-app -- id
# uid=10001 gid=10001 groups=10001
# Verify read-only filesystem
kubectl exec hardened-app -- touch /test-file
# touch: /test-file: Read-only file system
# Check capabilities
kubectl exec hardened-app -- cat /proc/1/status | grep Cap
# CapBnd: 0000000000000400 (only NET_BIND_SERVICE if added)Common Gotchas#
readOnlyRootFilesystem breaks apps that write to /tmp. Many applications, frameworks, and language runtimes write temporary files. Mount an emptyDir volume at /tmp and any other writable paths the application needs. Common paths include /var/cache, /var/run, and /home/<user>.
Some images run as root by default. The official nginx image runs as root. The official postgres image runs as root. Use -unprivileged or -nonroot image variants when available (e.g., nginxinc/nginx-unprivileged), or set runAsUser to a non-root UID. Check the image’s Dockerfile to see what user it expects.
fsGroup is slow on large volumes. With fsGroupChangePolicy: Always (the default), Kubernetes recursively chowns every file on every mounted volume every time the pod starts. On volumes with millions of files this can take minutes. Switch to OnRootMismatch which only changes ownership if the top-level directory’s group does not match fsGroup.
Init containers need their own security context. An init container that needs to run as root (e.g., to set file permissions) must have its own security context with runAsUser: 0. The pod-level runAsNonRoot: true will block it otherwise. However, under the restricted Pod Security Standard, root init containers are not allowed – use fsGroup instead.
Seccomp RuntimeDefault may break some applications. Java applications that use epoll and Go applications that use clone3 generally work fine with RuntimeDefault. But some applications that use uncommon syscalls (e.g., io_uring, perf_event_open) will fail silently or crash. If an application works without seccomp but breaks with RuntimeDefault, use strace to identify the blocked syscall and either switch to a custom profile or work around the missing syscall.