Container Runtime Security Hardening

Why Runtime Security Matters#

Container images get scanned for vulnerabilities before deployment. Admission controllers enforce pod security standards at creation time. But neither addresses what happens after the container starts running. Runtime security fills this gap: it detects and prevents malicious behavior inside running containers.

A compromised container with a properly hardened runtime is limited in what damage it can cause. Without runtime hardening, a single container escape can compromise the entire node.

Seccomp Profiles#

Seccomp (Secure Computing Mode) restricts which Linux system calls a container process can make. The kernel kills any process that attempts a blocked syscall. This is the most effective single hardening measure because it directly limits what the kernel will do on behalf of the container.

The RuntimeDefault Profile#

Kubernetes applies no seccomp profile by default. The RuntimeDefault profile is the container runtime’s built-in profile (containerd or CRI-O), which blocks approximately 44 dangerous syscalls including mount, reboot, kexec_load, unshare, and bpf.

apiVersion: v1
kind: Pod
metadata:
  name: app
spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault
  containers:
    - name: app
      image: myapp:1.0.0
      securityContext:
        seccompProfile:
          type: RuntimeDefault

Apply RuntimeDefault to every workload as a starting point. It breaks very few applications because it only blocks syscalls that normal applications never use.

Custom Seccomp Profiles#

For higher security, create a custom profile that only allows the specific syscalls your application needs. This follows the principle of least privilege at the kernel level.

Step 1: Record which syscalls your application uses.

# Use strace to record syscalls made by the application
strace -f -o /tmp/syscalls.log -e trace=all /path/to/application

# Extract unique syscall names
awk '{print $NF}' /tmp/syscalls.log | grep -oP '^\w+' | sort -u > /tmp/used-syscalls.txt

# Alternatively, use the OCI seccomp BPF hook to generate a profile automatically
# Install oci-seccomp-bpf-hook, then run the container with:
sudo podman run --annotation io.containers.trace-syscall=of:/tmp/seccomp-profile.json myapp:1.0.0

Step 2: Create the seccomp profile.

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": ["SCMP_ARCH_X86_64", "SCMP_ARCH_AARCH64"],
  "syscalls": [
    {
      "names": [
        "accept4", "access", "arch_prctl", "bind", "brk", "clone", "close",
        "connect", "epoll_create1", "epoll_ctl", "epoll_pwait", "execve",
        "exit_group", "fcntl", "fstat", "futex", "getdents64", "getpid",
        "getsockname", "getsockopt", "ioctl", "listen", "lseek", "madvise",
        "mmap", "mprotect", "munmap", "nanosleep", "newfstatat", "openat",
        "pipe2", "pread64", "read", "recvfrom", "rt_sigaction", "rt_sigprocmask",
        "rt_sigreturn", "sched_getaffinity", "sched_yield", "sendto", "set_robust_list",
        "set_tid_address", "setsockopt", "sigaltstack", "socket", "tgkill",
        "write", "writev"
      ],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

Step 3: Deploy the profile via a Kubernetes SeccompProfile resource or by placing it on nodes.

# Using the Kubernetes seccomp profile directory on nodes
# Place the profile at: /var/lib/kubelet/seccomp/profiles/myapp.json

apiVersion: v1
kind: Pod
metadata:
  name: app
spec:
  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: profiles/myapp.json
  containers:
    - name: app
      image: myapp:1.0.0

For managed Kubernetes where you cannot place files on nodes, use the Security Profiles Operator to manage seccomp profiles as Kubernetes resources:

apiVersion: security-profiles-operator.x-k8s.io/v1beta1
kind: SeccompProfile
metadata:
  name: myapp-seccomp
  namespace: production
spec:
  defaultAction: SCMP_ACT_ERRNO
  architectures:
    - SCMP_ARCH_X86_64
    - SCMP_ARCH_AARCH64
  syscalls:
    - action: SCMP_ACT_ALLOW
      names:
        - accept4
        - bind
        - clone
        - close
        - connect
        # ... remaining syscalls

AppArmor Profiles#

AppArmor provides mandatory access control on Debian/Ubuntu-based systems. It restricts file access, network access, and capability usage per program.

Default Docker/Containerd Profile#

The default container runtime profile (docker-default or cri-containerd.apparmor.d) restricts mounting filesystems, accessing /proc and /sys files, and loading kernel modules. Like seccomp’s RuntimeDefault, this is a reasonable baseline.

Custom AppArmor Profile#

# /etc/apparmor.d/myapp
#include <tunables/global>

profile myapp flags=(attach_disconnected,mediate_deleted) {
  #include <abstractions/base>

  # Allow reading application files
  /app/** r,
  /app/bin/myapp ix,

  # Allow writing to specific directories only
  /tmp/** rw,
  /var/log/myapp/** rw,

  # Network access: allow TCP only
  network inet stream,
  network inet6 stream,

  # Deny raw sockets (prevents packet sniffing)
  deny network raw,
  deny network packet,

  # Deny mount operations
  deny mount,

  # Deny access to sensitive host paths
  deny /proc/*/mem rw,
  deny /sys/firmware/** rw,
  deny /etc/shadow r,
  deny /etc/passwd w,

  # Deny ptrace (prevents debugging/inspection of other processes)
  deny ptrace,
}

Load and apply the profile:

# Load the profile
sudo apparmor_parser -r /etc/apparmor.d/myapp

# Verify it loaded
sudo aa-status | grep myapp

# Apply to a Kubernetes pod

apiVersion: v1
kind: Pod
metadata:
  name: app
  annotations:
    container.apparmor.security.beta.kubernetes.io/app: localhost/myapp
spec:
  containers:
    - name: app
      image: myapp:1.0.0

SELinux for RHEL/CentOS Nodes#

On RHEL-based systems, SELinux provides equivalent mandatory access control. The container_t SELinux type is applied to containers by default and restricts host filesystem access, network operations, and inter-process communication.

apiVersion: v1
kind: Pod
metadata:
  name: app
spec:
  securityContext:
    seLinuxOptions:
      type: container_t
      level: "s0:c123,c456"
  containers:
    - name: app
      image: myapp:1.0.0

The level field assigns MCS (Multi-Category Security) labels. Containers with different MCS labels cannot access each other’s files even if they run on the same node.

Capability Dropping#

Linux capabilities split root’s powers into discrete units. Containers start with a default set of 14 capabilities. Most applications need none of them.

Drop All, Add Back Selectively#

securityContext:
  capabilities:
    drop:
      - ALL
    add: []

This is the most secure default. Only add capabilities back when the application fails without them, and add only the specific capability needed:

Capability	What It Allows	When Needed
`NET_BIND_SERVICE`	Bind to ports below 1024	Web servers on port 80/443
`CHOWN`	Change file ownership	Init containers setting up volumes
`SETUID` / `SETGID`	Change user/group ID	Applications that drop privileges at startup
`DAC_OVERRIDE`	Bypass file permission checks	Rarely legitimate in containers
`SYS_PTRACE`	Trace/debug other processes	Debugging sidecars, security tools
`NET_RAW`	Use raw sockets	Ping, network diagnostics

Never add these in production workloads: SYS_ADMIN (near-equivalent of full root), SYS_PTRACE (allows container escape via process injection), NET_ADMIN (allows network namespace manipulation).

Read-Only Root Filesystem#

A read-only root filesystem prevents attackers from modifying binaries, installing tools, or writing scripts in the container:

apiVersion: v1
kind: Pod
metadata:
  name: app
spec:
  containers:
    - name: app
      image: myapp:1.0.0
      securityContext:
        readOnlyRootFilesystem: true
      volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: var-run
          mountPath: /var/run
        - name: var-cache
          mountPath: /var/cache
  volumes:
    - name: tmp
      emptyDir:
        sizeLimit: 100Mi
    - name: var-run
      emptyDir:
        sizeLimit: 10Mi
    - name: var-cache
      emptyDir:
        sizeLimit: 50Mi

Mount emptyDir volumes for every path where the application needs to write. Set sizeLimit to prevent a compromised container from filling the node’s disk.

Falco: Runtime Threat Detection#

Falco monitors system calls made by containers in real time and alerts on suspicious behavior. It is the runtime equivalent of an intrusion detection system for containers.

Installation#

# Install via Helm
helm repo add falcosecurity https://falcosecurity.github.io/charts
helm install falco falcosecurity/falco \
  --namespace falco --create-namespace \
  --set tty=true \
  --set falcosidekick.enabled=true \
  --set falcosidekick.config.slack.webhookurl=https://hooks.slack.com/services/XXX

Key Detection Rules#

Falco ships with rules that detect common attack patterns. These fire without any custom configuration:

Terminal shell in container: Detects interactive shell access (bash, sh, zsh) inside a container. Almost always suspicious in production.
Read sensitive file untouched: Detects reading /etc/shadow, /etc/sudoers, or private keys.
Write below /etc or /bin: Detects modification of system files, a common persistence technique.
Contact K8s API server: Detects containers making Kubernetes API calls, which is unexpected unless the workload intentionally uses the API.
Outbound connection to C2: Detects connections to known command-and-control infrastructure.

Custom Rules#

Write rules for your specific environment:

- rule: Unexpected process in production container
  desc: Detect processes that are not part of the normal application
  condition: >
    spawned_process and
    container and
    container.image.repository = "registry.example.com/myapp" and
    not proc.name in (myapp, node, python, gunicorn)
  output: >
    Unexpected process in production container
    (user=%user.name command=%proc.cmdline container=%container.name
     image=%container.image.repository:%container.image.tag)
  priority: WARNING
  tags: [container, process]

- rule: Sensitive mount in container
  desc: Detect containers mounting sensitive host paths
  condition: >
    container and
    (fd.name startswith /etc/kubernetes or
     fd.name startswith /var/lib/kubelet or
     fd.name startswith /var/run/docker.sock)
  output: >
    Sensitive path accessed in container
    (user=%user.name path=%fd.name container=%container.name)
  priority: CRITICAL
  tags: [container, filesystem]

Alert Routing with Falcosidekick#

Falcosidekick forwards Falco alerts to external systems:

# Values for Falcosidekick Helm installation
config:
  slack:
    webhookurl: "https://hooks.slack.com/services/XXX"
    minimumpriority: "warning"
  pagerduty:
    routingkey: "ROUTING_KEY"
    minimumpriority: "critical"
  elasticsearch:
    hostport: "https://elasticsearch:9200"
    index: "falco-alerts"
    minimumpriority: "notice"

gVisor: Application Kernel Isolation#

gVisor interposes a user-space kernel (called Sentry) between the container and the host kernel. System calls from the container are handled by Sentry rather than the host kernel, providing defense-in-depth against kernel vulnerabilities.

Setup with containerd#

# Install gVisor runsc binary
curl -fsSL https://gvisor.dev/archive.key | sudo gpg --dearmor -o /usr/share/keyrings/gvisor-archive-keyring.gpg
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/gvisor-archive-keyring.gpg] https://storage.googleapis.com/gvisor/releases release main" | \
  sudo tee /etc/apt/sources.list.d/gvisor.list
sudo apt update && sudo apt install -y runsc

# Add gVisor as a containerd runtime
# Add to /etc/containerd/config.toml:

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runsc]
  runtime_type = "io.containerd.runsc.v1"

sudo systemctl restart containerd

Create a RuntimeClass in Kubernetes#

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: gvisor
handler: runsc

Use gVisor for Specific Workloads#

apiVersion: v1
kind: Pod
metadata:
  name: untrusted-workload
spec:
  runtimeClassName: gvisor
  containers:
    - name: app
      image: untrusted-app:1.0.0

gVisor adds latency to system calls (roughly 2-10x for syscall-heavy workloads). Use it for untrusted workloads, multi-tenant environments, or workloads processing untrusted input. Do not use it for latency-sensitive workloads like databases.

Kata Containers: VM-Level Isolation#

Kata Containers runs each container inside a lightweight virtual machine. This provides hardware-level isolation via the hypervisor. A container escape reaches the guest VM, not the host.

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: kata
handler: kata
overhead:
  podFixed:
    memory: "160Mi"
    cpu: "250m"

apiVersion: v1
kind: Pod
metadata:
  name: isolated-workload
spec:
  runtimeClassName: kata
  containers:
    - name: app
      image: sensitive-app:1.0.0

Kata Containers have higher overhead than gVisor (each pod gets a VM with its own kernel) but provide stronger isolation because they use hardware virtualization. Use Kata for workloads that require the strongest possible isolation, such as running customer-supplied code or processing classified data.

Runtime Isolation Comparison#

Feature	runc (default)	gVisor	Kata Containers
Isolation boundary	Linux namespaces + cgroups	User-space kernel	Hardware VM
Syscall overhead	None	2-10x	1.5-3x
Memory overhead	Minimal	~50MB per sandbox	~160MB per pod
Startup time	<1 second	~1 second	2-5 seconds
Kernel vulnerability protection	None	Strong	Strongest
Compatibility	Full	Most workloads	Most workloads
Best for	Trusted workloads	Multi-tenant, untrusted input	Highest-security, multi-tenant

Layered Defense#

No single mechanism provides complete runtime security. Layer them:

Seccomp: Restrict which syscalls are available. The kernel-level filter.
AppArmor/SELinux: Restrict file and network access. The OS-level policy.
Capabilities: Drop unnecessary root powers. The privilege-level control.
Read-only filesystem: Prevent runtime modification. The immutability guarantee.
Falco: Detect when something bypasses the above controls. The detection layer.
gVisor/Kata: Isolate the workload from the host kernel entirely. The containment layer.

Apply layers 1 through 5 to every workload. Add layer 6 for untrusted or highest-risk workloads. Each layer reduces the attack surface that the next layer must defend.