Security Hardening a Kubernetes Cluster#

This operational sequence takes a default Kubernetes cluster and locks it down. Phases are ordered by impact and dependency: assessment first, then RBAC, pod security, networking, images, auditing, and finally data protection. Each phase includes the commands, policy YAML, and verification steps.

Do not skip the assessment phase. You need to know what you are fixing before you start fixing it.


Phase 1 – Assessment#

Before changing anything, establish a baseline. This phase produces a prioritized list of findings that drives the order of remediation in later phases.

Step 1: Run CIS Benchmark Scan#

# kube-bench runs the CIS Kubernetes Benchmark checks
# On self-managed clusters, run as a Job:
kubectl apply -f https://raw.githubusercontent.com/aquasecurity/kube-bench/main/job.yaml

# Wait for completion and read results
kubectl wait --for=condition=complete job/kube-bench --timeout=120s
kubectl logs job/kube-bench

# On managed clusters (EKS, GKE, AKS), use the platform-specific variant:
# EKS:
kubectl apply -f https://raw.githubusercontent.com/aquasecurity/kube-bench/main/job-eks.yaml

Focus on FAIL results. WARN results are informational but should be reviewed.

Step 2: Scan Running Images for CVEs#

# Trivy can scan an entire cluster
trivy k8s --report=summary cluster

# For detailed results on a specific namespace:
trivy k8s --report=all --namespace=app-production

# Or scan individual images:
trivy image --severity HIGH,CRITICAL nginx:1.25.3

Step 3: Audit Current RBAC Bindings#

# List all ClusterRoleBindings and their subjects
kubectl get clusterrolebindings -o json | jq -r '
  .items[] |
  "\(.metadata.name) -> \(.roleRef.name) : \(.subjects // [] | map("\(.kind)/\(.name)") | join(", "))"'

# Find cluster-admin bindings specifically
kubectl get clusterrolebindings -o json | jq -r '
  .items[] |
  select(.roleRef.name == "cluster-admin") |
  "\(.metadata.name): \(.subjects // [] | map("\(.kind)/\(.name)") | join(", "))"'

Step 4: Check for Privileged Pods#

# Find pods running as privileged
kubectl get pods -A -o json | jq -r '
  .items[] |
  .metadata.namespace as $ns |
  .metadata.name as $pod |
  .spec.containers[] |
  select(.securityContext.privileged == true) |
  "\($ns)/\($pod)/\(.name): PRIVILEGED"'

Step 5: List Pods with Host Access#

# Pods with hostPath mounts
kubectl get pods -A -o json | jq -r '
  .items[] |
  select(.spec.volumes[]? | .hostPath != null) |
  "\(.metadata.namespace)/\(.metadata.name): hostPath volumes"'

# Pods with hostNetwork
kubectl get pods -A -o json | jq -r '
  .items[] |
  select(.spec.hostNetwork == true) |
  "\(.metadata.namespace)/\(.metadata.name): hostNetwork=true"'

# Pods with hostPID
kubectl get pods -A -o json | jq -r '
  .items[] |
  select(.spec.hostPID == true) |
  "\(.metadata.namespace)/\(.metadata.name): hostPID=true"'

Phase 1 Output#

Compile findings into a prioritized list:

  • Critical: Privileged pods in application namespaces, cluster-admin bindings to service accounts, images with critical CVEs
  • High: No network policies, pods running as root, no pod security standards
  • Medium: Missing resource limits, no audit logging, no image registry restrictions
  • Low: Writable root filesystems, missing read-only annotations

Use this list to guide urgency in subsequent phases.


Phase 2 – RBAC Hardening#

Step 7: Remove Unnecessary cluster-admin Bindings#

# List them first (from Step 3 output)
kubectl get clusterrolebindings -o json | jq -r '
  .items[] |
  select(.roleRef.name == "cluster-admin") |
  .metadata.name'

# Delete non-essential bindings (DO NOT delete system bindings)
# System bindings to keep: system:*, kubeadm:*, cluster-admin (the built-in one)
# Review each one before deleting:
kubectl delete clusterrolebinding <binding-name>

Step 8: Create Team-Specific Roles#

cat <<'EOF' | kubectl apply -f -
---
# Developer role: can manage workloads but not RBAC or secrets
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: developer
  namespace: app-production
rules:
- apiGroups: ["apps"]
  resources: ["deployments", "replicasets", "statefulsets"]
  verbs: ["get", "list", "watch", "create", "update", "patch"]
- apiGroups: [""]
  resources: ["pods", "services", "configmaps"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: [""]
  resources: ["pods/log"]
  verbs: ["get"]
- apiGroups: [""]
  resources: ["pods/exec"]
  verbs: ["create"]
---
# Read-only role for auditors and monitoring tools
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: readonly-all
rules:
- apiGroups: ["", "apps", "batch", "networking.k8s.io"]
  resources: ["*"]
  verbs: ["get", "list", "watch"]
---
# CI/CD deployer: can update deployments and configmaps, nothing else
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: deployer
  namespace: app-production
rules:
- apiGroups: ["apps"]
  resources: ["deployments"]
  verbs: ["get", "list", "patch", "update"]
- apiGroups: [""]
  resources: ["configmaps", "secrets"]
  verbs: ["get", "list", "create", "update", "patch"]
EOF

Step 9-10: Service Account Hygiene#

# Find pods using the default service account
kubectl get pods -A -o json | jq -r '
  .items[] |
  select(.spec.serviceAccountName == "default" or .spec.serviceAccountName == null) |
  select(.metadata.namespace != "kube-system") |
  "\(.metadata.namespace)/\(.metadata.name)"'

# Create dedicated service accounts for each workload
kubectl create serviceaccount web-api -n app-production

# Disable auto-mounting of service account tokens on pods that do not need Kubernetes API access
cat <<'EOF' | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
  name: web-api
  namespace: app-production
automountServiceAccountToken: false
EOF

Update deployments to use the dedicated service account:

spec:
  template:
    spec:
      serviceAccountName: web-api
      automountServiceAccountToken: false  # unless the pod needs API access

Phase 2 Verification#

# Test developer permissions
kubectl auth can-i create deployments -n app-production --as=developer@company.com
# Should return: yes

kubectl auth can-i create clusterroles --as=developer@company.com
# Should return: no

# Test deployer permissions
kubectl auth can-i delete namespaces --as=system:serviceaccount:app-production:deploy-bot
# Should return: no

# Verify no stray cluster-admin bindings
kubectl get clusterrolebindings -o json | jq '[.items[] | select(.roleRef.name == "cluster-admin") | .metadata.name]'
# Only system bindings should remain

Rollback: Recreate deleted ClusterRoleBindings with kubectl create clusterrolebinding. RBAC changes take effect immediately with no restart required.


Phase 3 – Pod Security#

Step 12: Apply Pod Security Standards#

# Enforce baseline on all application namespaces
# Warn and audit on restricted to identify further hardening opportunities
for ns in app-production app-staging; do
  kubectl label namespace $ns \
    pod-security.kubernetes.io/enforce=baseline \
    pod-security.kubernetes.io/enforce-version=latest \
    pod-security.kubernetes.io/warn=restricted \
    pod-security.kubernetes.io/warn-version=latest \
    pod-security.kubernetes.io/audit=restricted \
    pod-security.kubernetes.io/audit-version=latest \
    --overwrite
done

# Monitoring and system namespaces need more permissive settings
# because components like node-exporter and promtail need host access
kubectl label namespace monitoring \
  pod-security.kubernetes.io/enforce=privileged \
  pod-security.kubernetes.io/warn=baseline \
  --overwrite

Step 13: Fix Workloads That Violate Standards#

Common fixes for pods that violate the baseline standard:

spec:
  containers:
  - name: app
    image: myapp:1.2.3
    securityContext:
      runAsNonRoot: true
      runAsUser: 1000
      runAsGroup: 1000
      allowPrivilegeEscalation: false
      readOnlyRootFilesystem: true
      capabilities:
        drop:
          - ALL
      seccompProfile:
        type: RuntimeDefault
    volumeMounts:
    - name: tmp
      mountPath: /tmp
  volumes:
  - name: tmp
    emptyDir: {}

The readOnlyRootFilesystem with an emptyDir for /tmp is the standard pattern. Applications that need writable directories get explicit emptyDir mounts rather than a writable root filesystem.

Step 14-15: Install Kyverno with Custom Policies#

helm repo add kyverno https://kyverno.github.io/kyverno
helm install kyverno kyverno/kyverno \
  --namespace kyverno \
  --create-namespace \
  --set replicaCount=3 \
  --set resources.requests.cpu=100m \
  --set resources.requests.memory=256Mi

# Custom policy: require resource limits on all containers
cat <<'EOF' | kubectl apply -f -
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-resource-limits
spec:
  validationFailureAction: Enforce
  background: true
  rules:
  - name: check-limits
    match:
      any:
      - resources:
          kinds:
          - Pod
          namespaces:
          - app-production
          - app-staging
    validate:
      message: "All containers must have CPU and memory limits."
      pattern:
        spec:
          containers:
          - resources:
              limits:
                memory: "?*"
                cpu: "?*"
---
# Custom policy: restrict image registries
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: restrict-image-registries
spec:
  validationFailureAction: Enforce
  background: true
  rules:
  - name: validate-registries
    match:
      any:
      - resources:
          kinds:
          - Pod
          namespaces:
          - app-production
    validate:
      message: "Images must come from approved registries: registry.company.com or gcr.io/my-project."
      pattern:
        spec:
          containers:
          - image: "registry.company.com/* | gcr.io/my-project/*"
EOF

Phase 3 Verification#

# Test PSA enforcement
kubectl run priv-test --image=busybox:1.36 --restart=Never -n app-production \
  --overrides='{"spec":{"containers":[{"name":"test","image":"busybox:1.36","securityContext":{"privileged":true}}]}}'
# Expected: Error from server (Forbidden): pod security standards violation

# Test Kyverno policy: deploy without resource limits
kubectl run no-limits --image=busybox:1.36 --restart=Never -n app-production
# Expected: Error from server: admission webhook denied the request

# Test registry restriction
kubectl run bad-registry --image=docker.io/nginx:latest --restart=Never -n app-production
# Expected: Error from server: admission webhook denied the request

Rollback: Remove namespace labels with kubectl label namespace app-production pod-security.kubernetes.io/enforce-. Delete Kyverno policies with kubectl delete clusterpolicy <name>. Uninstall Kyverno with helm uninstall kyverno -n kyverno.


Phase 4 – Network Security#

Step 17: Default-Deny Network Policies#

# Apply to every application namespace
for ns in app-production app-staging; do
cat <<EOF | kubectl apply -f -
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
  namespace: $ns
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress
---
# CRITICAL: Allow DNS -- without this, service discovery breaks entirely
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-dns
  namespace: $ns
spec:
  podSelector: {}
  policyTypes:
  - Egress
  egress:
  - ports:
    - protocol: UDP
      port: 53
    - protocol: TCP
      port: 53
EOF
done

Step 18: Add Explicit Allow Policies#

cat <<'EOF' | kubectl apply -f -
---
# Allow ingress controller to reach app pods
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-ingress-to-web
  namespace: app-production
spec:
  podSelector:
    matchLabels:
      app: web-api
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: ingress-system
    ports:
    - protocol: TCP
      port: 8080
---
# Allow web-api to talk to the database
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-web-to-db
  namespace: app-production
spec:
  podSelector:
    matchLabels:
      app: web-api
  policyTypes:
  - Egress
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: postgres
    ports:
    - protocol: TCP
      port: 5432
---
# Allow Prometheus to scrape metrics from all pods
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-prometheus-scrape
  namespace: app-production
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          kubernetes.io/metadata.name: monitoring
      podSelector:
        matchLabels:
          app.kubernetes.io/name: prometheus
    ports:
    - protocol: TCP
      port: 9090
    - protocol: TCP
      port: 8080
EOF

Step 19: Verify DNS Still Works#

This is the most common failure after applying network policies. Test immediately.

kubectl run dns-test --image=busybox:1.36 --restart=Never --rm -it -n app-production -- nslookup kubernetes.default.svc.cluster.local

If DNS fails, the allow-dns policy is not applied correctly. Check that the policy’s egress rule allows UDP port 53 to all destinations (the to field should be empty, meaning all destinations).

Phase 4 Verification#

# Verify policies are applied
kubectl get networkpolicy -n app-production

# Test allowed traffic: ingress to web-api should work
kubectl exec -n ingress-system deploy/ingress-nginx-controller -- \
  wget -qO- --timeout=5 http://web-api.app-production:8080/healthz

# Test blocked traffic: direct pod-to-pod across namespaces should fail
kubectl run test-block --image=busybox:1.36 --restart=Never --rm -it -n app-staging -- \
  wget -qO- --timeout=5 http://web-api.app-production:8080/healthz
# Expected: timeout (traffic blocked)

Rollback: kubectl delete networkpolicy --all -n app-production. This removes all policies and returns to the default allow-all behavior.


Phase 5 – Image Security#

Step 22: Block Unscanned Images#

If using Kyverno (installed in Phase 3), add a policy that checks for scan results:

cat <<'EOF' | kubectl apply -f -
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-image-digest
spec:
  validationFailureAction: Enforce
  background: true
  rules:
  - name: require-digest
    match:
      any:
      - resources:
          kinds:
          - Pod
          namespaces:
          - app-production
    validate:
      message: "Images must use a digest (sha256) reference, not a tag."
      pattern:
        spec:
          containers:
          - image: "*@sha256:*"
EOF

Step 23: Restrict Image Sources#

Already handled by the restrict-image-registries policy from Phase 3, Step 15. Verify it is active:

kubectl get clusterpolicy restrict-image-registries -o jsonpath='{.status.conditions[?(@.type=="Ready")].status}'
# Should return: True

Step 24: Enable Image Signing Verification (Optional)#

# Using Cosign with Kyverno
cat <<'EOF' | kubectl apply -f -
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: verify-image-signature
spec:
  validationFailureAction: Enforce
  webhookTimeoutSeconds: 30
  rules:
  - name: check-signature
    match:
      any:
      - resources:
          kinds:
          - Pod
          namespaces:
          - app-production
    verifyImages:
    - imageReferences:
      - "registry.company.com/*"
      attestors:
      - entries:
        - keys:
            publicKeys: |
              -----BEGIN PUBLIC KEY-----
              YOUR_COSIGN_PUBLIC_KEY_HERE
              -----END PUBLIC KEY-----
EOF

Phase 5 Verification#

# Try deploying with a tag (should fail)
kubectl run tag-test --image=nginx:1.25.3 --restart=Never -n app-production
# Expected: blocked by require-image-digest policy

# Try deploying from unauthorized registry (should fail)
kubectl run reg-test --image=docker.io/nginx@sha256:abc123 --restart=Never -n app-production
# Expected: blocked by restrict-image-registries policy

Rollback: kubectl delete clusterpolicy require-image-digest restrict-image-registries verify-image-signature.


Phase 6 – Audit and Monitoring#

Step 27: Enable Kubernetes Audit Logging#

For self-managed clusters, create an audit policy and pass it to the API server. For managed clusters (EKS, GKE, AKS), enable audit logging through the cloud provider console.

# Audit policy file (place on control plane nodes)
cat <<'EOF' > /etc/kubernetes/audit-policy.yaml
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
# Log secret access at Metadata level (who accessed, not what)
- level: Metadata
  resources:
  - group: ""
    resources: ["secrets"]

# Log all RBAC changes at RequestResponse level
- level: RequestResponse
  resources:
  - group: "rbac.authorization.k8s.io"
    resources: ["clusterrolebindings", "rolebindings", "clusterroles", "roles"]

# Log pod exec and attach
- level: Request
  resources:
  - group: ""
    resources: ["pods/exec", "pods/attach"]

# Log authentication failures
- level: Metadata
  stages:
  - ResponseComplete
  omitStages:
  - RequestReceived

# Catch-all: log everything else at Metadata level
- level: Metadata
EOF

Add to kube-apiserver flags:

--audit-policy-file=/etc/kubernetes/audit-policy.yaml
--audit-log-path=/var/log/kubernetes/audit.log
--audit-log-maxage=30
--audit-log-maxbackup=10
--audit-log-maxsize=100

Step 29: Set Up Security Alerts#

cat <<'EOF' | kubectl apply -f -
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: security-alerts
  namespace: monitoring
  labels:
    release: kube-prometheus
spec:
  groups:
  - name: security
    rules:
    - alert: PodExecDetected
      expr: increase(apiserver_request_total{verb="create", resource="pods", subresource="exec"}[5m]) > 0
      labels:
        severity: warning
      annotations:
        summary: "kubectl exec detected"
        description: "Someone exec'd into a pod in the last 5 minutes. Check audit logs."

    - alert: NewClusterAdminBinding
      expr: increase(apiserver_request_total{verb="create", resource="clusterrolebindings"}[5m]) > 0
      labels:
        severity: critical
      annotations:
        summary: "New ClusterRoleBinding created"
        description: "A new ClusterRoleBinding was created. Verify it is authorized."

    - alert: FailedAuthAttempts
      expr: increase(apiserver_audit_event_total{stage="ResponseComplete",code=~"40[13]"}[5m]) > 10
      labels:
        severity: warning
      annotations:
        summary: "Multiple failed auth attempts"
        description: "More than 10 failed authentication/authorization attempts in 5 minutes."
EOF

Step 30: Install Falco for Runtime Security#

helm repo add falcosecurity https://falcosecurity.github.io/charts
helm install falco falcosecurity/falco \
  --namespace falco \
  --create-namespace \
  --set falcosidekick.enabled=true \
  --set falcosidekick.config.slack.webhookurl="https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK" \
  --set driver.kind=modern_ebpf \
  --set collectors.kubernetes.enabled=true

Falco detects runtime security events like shell spawning inside containers, sensitive file access, and unexpected network connections using eBPF.

Phase 6 Verification#

# Test audit logging: exec into a pod
kubectl exec -it deploy/web-api -n app-production -- whoami

# Check audit log for the exec event
# On self-managed: grep "exec" /var/log/kubernetes/audit.log | tail -5
# On managed: check cloud provider's audit log console

# Verify Falco detects the exec
kubectl logs -l app.kubernetes.io/name=falco -n falco --tail=20 | grep "exec"
# Should show a notice about a shell being spawned in a container

Rollback: Falco: helm uninstall falco -n falco. Audit policy: remove the flags from kube-apiserver and restart. Alerting rules: kubectl delete prometheusrule security-alerts -n monitoring.


Phase 7 – Data Protection#

Step 32: Enable etcd Encryption at Rest#

For self-managed clusters, create an encryption configuration:

cat <<'EOF' > /etc/kubernetes/encryption-config.yaml
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources:
  - secrets
  - configmaps
  providers:
  - aescbc:
      keys:
      - name: key1
        secret: $(head -c 32 /dev/urandom | base64)
  - identity: {}
EOF

Add to kube-apiserver: --encryption-provider-config=/etc/kubernetes/encryption-config.yaml

After enabling encryption, re-encrypt all existing secrets:

kubectl get secrets -A -o json | kubectl replace -f -

For managed clusters: EKS uses AWS KMS by default, GKE uses Google-managed keys by default, AKS requires configuring Azure Key Vault.

Step 33: Check Certificate Expiry#

# On self-managed clusters using kubeadm:
kubeadm certs check-expiration

# For any cluster, check the API server certificate:
echo | openssl s_client -connect $(kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}' | sed 's|https://||') 2>/dev/null | openssl x509 -noout -dates

Pass: All certificates expire in more than 90 days. Set a calendar reminder for renewal.

Step 34: Configure Secret Rotation#

Document a rotation schedule and procedure:

# Rotate a Kubernetes secret
kubectl create secret generic db-credentials \
  --from-literal=password=$(openssl rand -base64 24) \
  -n app-production \
  --dry-run=client -o yaml | kubectl apply -f -

# Trigger a rolling restart to pick up the new secret
kubectl rollout restart deployment/web-api -n app-production

For automated rotation, consider integrating with HashiCorp Vault or a cloud secrets manager (AWS Secrets Manager, GCP Secret Manager, Azure Key Vault) using the External Secrets Operator.

Phase 7 Verification#

# Verify etcd encryption (self-managed only)
# Read a secret directly from etcd -- it should be encrypted
ETCDCTL_API=3 etcdctl get /registry/secrets/app-production/db-credentials \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key | hexdump -C | head
# Should show encrypted bytes, not plaintext

# Check cert expiry
kubeadm certs check-expiration 2>/dev/null || echo "Managed cluster -- check via cloud console"
# All certs > 90 days from expiry

Rollback: etcd encryption cannot be trivially reversed. To decrypt, change the encryption config to put identity first, then re-encrypt all secrets. Certificate rotation: use kubeadm certs renew all on self-managed clusters.


Hardening Summary#

Phase Critical Outcome Verification Command
Assessment Risk baseline documented kube-bench, trivy scan results
RBAC No unnecessary cluster-admin kubectl auth can-i tests
Pod Security Privileged pods blocked Deploy privileged pod (rejected)
Network Default deny enforced, DNS works Connectivity tests
Images Untrusted registries blocked Deploy from Docker Hub (rejected)
Audit Security events logged and alerted Exec into pod, check logs
Data Secrets encrypted, certs valid etcd read shows encrypted data

After completing all phases, re-run the CIS benchmark scan from Phase 1 to measure improvement. Target: zero FAIL results for items within your control (some items on managed clusters are the provider’s responsibility).