Jobs and CronJobs#

Deployments manage long-running processes. Jobs manage work that finishes. A Job creates one or more pods, runs them to completion, and tracks whether they succeeded. CronJobs run Jobs on a schedule. Both are essential for database migrations, report generation, data pipelines, and any workload that is not a continuously running server.

Job Basics#

A Job runs a pod until it exits successfully (exit code 0). The simplest case is a single pod that runs once:

apiVersion: batch/v1
kind: Job
metadata:
  name: db-migration
spec:
  template:
    spec:
      containers:
      - name: migrate
        image: myapp:2.3.0
        command: ["./migrate", "--up"]
        env:
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: db-credentials
              key: url
      restartPolicy: Never

The restartPolicy must be Never or OnFailure – Jobs do not support Always because the pod is expected to terminate.

Completions and Parallelism#

For workloads that need multiple successful runs, use completions and parallelism:

spec:
  completions: 10    # 10 pods must succeed
  parallelism: 3     # run 3 pods at a time

This runs up to 3 pods concurrently until 10 pods have completed successfully. Useful for processing a batch of items where each pod handles one item.

Completion Modes#

Jobs support two completion modes:

NonIndexed (default): each pod is interchangeable. The Job just counts successful completions.
Indexed: each pod gets a unique completion index (0, 1, 2, …) available as the JOB_COMPLETION_INDEX environment variable. Useful when each pod must process a specific shard or partition.

spec:
  completions: 5
  parallelism: 5
  completionMode: Indexed

Each pod sees its index and can use it to determine which partition of work to handle – for example, processing records where id % 5 == $JOB_COMPLETION_INDEX.

Restart Policies and Their Tradeoffs#

The restartPolicy on a Job’s pod template controls what happens when a container fails:

Never: Kubernetes does not restart the failed container. Instead, it creates a new pod. The failed pod stays around in Error status. This is useful for debugging because you can inspect the logs of the failed pod, but it leaves pods behind that you need to clean up.

OnFailure: Kubernetes restarts the failed container in the same pod. The pod is reused, and the restart count increments. Logs from the previous attempt are lost (overwritten by the new container). This uses fewer resources but makes debugging harder.

For production, use Never when you need to inspect failures and OnFailure when you want automatic retry with minimal pod sprawl.

Retry Logic: backoffLimit#

The backoffLimit field controls how many times a Job retries before marking itself as failed. The default is 6.

spec:
  backoffLimit: 3
  template:
    spec:
      containers:
      - name: report
        image: report-generator:1.0
      restartPolicy: Never

Between retries, Kubernetes uses exponential backoff: 10 seconds, 20 seconds, 40 seconds, and so on, capped at 6 minutes. This prevents a Job from hammering a failing dependency.

When restartPolicy: Never, each retry creates a new pod. When restartPolicy: OnFailure, each retry restarts the container in place. The backoff limit counts both types of failures.

Timeouts and Cleanup#

activeDeadlineSeconds#

Sets an absolute time limit for the entire Job, including all retries:

spec:
  activeDeadlineSeconds: 600   # fail the whole Job after 10 minutes
  backoffLimit: 3

If the Job exceeds this deadline, Kubernetes terminates all running pods and marks the Job as failed with reason DeadlineExceeded. This is your safety net against Jobs that hang indefinitely.

ttlSecondsAfterFinished#

Controls automatic cleanup of completed and failed Jobs:

spec:
  ttlSecondsAfterFinished: 3600   # delete Job and pods 1 hour after completion

Without this, completed Jobs and their pods stick around forever, cluttering kubectl get jobs output and consuming etcd storage. Set this for any Job you do not need to inspect long-term.

Pod Failure Policy (v1.26+)#

The podFailurePolicy lets you handle specific exit codes differently. This is powerful when certain failures should not trigger a retry:

spec:
  backoffLimit: 5
  podFailurePolicy:
    rules:
    - action: FailJob
      onExitCodes:
        containerName: migrate
        operator: In
        values: [42]
    - action: Ignore
      onExitCodes:
        containerName: migrate
        operator: In
        values: [143]
    - action: Count
      onExitCodes:
        containerName: migrate
        operator: NotIn
        values: [0, 42, 143]
  template:
    spec:
      containers:
      - name: migrate
        image: myapp:2.3.0
        command: ["./migrate"]
      restartPolicy: Never

The actions:

FailJob: immediately fail the Job, no more retries. Use for permanent errors like invalid configuration (exit code 42 in this example).
Ignore: do not count this failure against backoffLimit. Exit code 143 is SIGTERM – the pod was killed externally, not a real failure.
Count: count the failure toward backoffLimit as normal. This is the default behavior.

CronJobs#

A CronJob creates Jobs on a schedule. The schedule field uses standard cron syntax:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: db-backup
spec:
  schedule: "0 */6 * * *"          # every 6 hours
  timeZone: "America/New_York"      # v1.27+ -- explicit timezone
  concurrencyPolicy: Forbid
  startingDeadlineSeconds: 300
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 5
  jobTemplate:
    spec:
      activeDeadlineSeconds: 1800
      backoffLimit: 2
      ttlSecondsAfterFinished: 86400
      template:
        spec:
          containers:
          - name: backup
            image: pg-backup:1.2
            command: ["./backup.sh"]
            env:
            - name: PGHOST
              value: "postgres-svc"
            - name: PGDATABASE
              value: "appdb"
            - name: BACKUP_BUCKET
              value: "s3://backups/db"
            volumeMounts:
            - name: credentials
              mountPath: /secrets
              readOnly: true
          volumes:
          - name: credentials
            secret:
              secretName: backup-credentials
          restartPolicy: OnFailure

Concurrency Policy#

Controls what happens when the next scheduled run fires while a previous Job is still running:

Allow (default): multiple Jobs can run simultaneously. Fine for independent, idempotent work.
Forbid: skip the new run if the previous one is still active. Use this when concurrent runs would conflict – like two backup jobs writing to the same location.
Replace: kill the running Job and start a new one. Useful when freshness matters more than completion.

startingDeadlineSeconds#

If the CronJob controller misses a scheduled time (because of controller downtime or high cluster load), startingDeadlineSeconds defines how late is too late. If the current time is past scheduledTime + startingDeadlineSeconds, the run is skipped entirely.

This is critical for time-sensitive jobs. A report that must run between midnight and 12:05 AM should set startingDeadlineSeconds: 300. If the scheduler is down and wakes up at 12:10 AM, the job will not run for that period instead of producing a late report with incorrect data boundaries.

If startingDeadlineSeconds is not set and the controller misses more than 100 consecutive schedules, the CronJob stops scheduling entirely as a safety measure.

History Limits#

successfulJobsHistoryLimit (default 3) and failedJobsHistoryLimit (default 1) control how many completed Jobs are retained. In production, keep more failed Jobs than successful ones – you need the failed pods for debugging:

successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 5

Timezone Handling#

Before Kubernetes 1.27, CronJobs used the kube-controller-manager’s timezone (usually UTC). Starting in v1.27, the .spec.timeZone field lets you specify a timezone explicitly:

spec:
  schedule: "0 2 * * *"
  timeZone: "America/Chicago"

If your cluster is older, schedule in UTC and do the mental math, or accept the surprise when daylight saving time shifts your job by an hour.

Monitoring Jobs#

# List all Jobs in a namespace
kubectl get jobs -n batch-workloads

# Watch a Job's progress
kubectl get jobs db-migration -w

# See Job events (scheduling, pod creation, failures)
kubectl describe job db-migration

# Get logs from the Job's pod
kubectl logs job/db-migration

# For multi-pod Jobs, get logs from a specific pod
kubectl logs db-migration-7x4k2

# List CronJob schedules and last run
kubectl get cronjobs

# See last scheduled time and active Jobs
kubectl describe cronjob db-backup

When a Job fails, kubectl describe job <name> shows events that explain why: BackoffLimitExceeded, DeadlineExceeded, or pod-level failure messages.

Common Patterns#

Database migrations as Jobs: Run as a pre-deploy Job (or init container). Set backoffLimit: 0 if migrations are not idempotent – a failed migration that partially applied should not retry automatically.

Report generation as CronJobs: Schedule with concurrencyPolicy: Forbid to prevent overlap. Use activeDeadlineSeconds to catch hung report generators.

Queue workers as parallel Jobs: Set completions to the number of queue items and parallelism to your desired concurrency. Each pod pulls one item from the queue and exits.

One-shot admin tasks: Create a Job imperatively for debugging or one-off operations:

kubectl create job manual-cleanup --image=myapp:2.3.0 -- ./cleanup.sh --dry-run

Practical Example: Database Backup CronJob#

This CronJob runs a PostgreSQL backup every 6 hours. It retries twice on failure, times out after 30 minutes, and keeps failed Jobs for 5 days for debugging:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: postgres-backup
  namespace: database
spec:
  schedule: "0 0,6,12,18 * * *"
  timeZone: "UTC"
  concurrencyPolicy: Forbid
  startingDeadlineSeconds: 600
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 5
  jobTemplate:
    spec:
      activeDeadlineSeconds: 1800
      backoffLimit: 2
      ttlSecondsAfterFinished: 432000
      podFailurePolicy:
        rules:
        - action: FailJob
          onExitCodes:
            containerName: backup
            operator: In
            values: [2]
        - action: Count
          onExitCodes:
            containerName: backup
            operator: NotIn
            values: [0, 2]
      template:
        spec:
          containers:
          - name: backup
            image: postgres:16-alpine
            command:
            - /bin/sh
            - -c
            - |
              pg_dump -Fc "$DATABASE_URL" > /backup/db-$(date +%Y%m%d-%H%M%S).dump
              if [ $? -ne 0 ]; then exit 1; fi
              aws s3 cp /backup/*.dump "s3://${BACKUP_BUCKET}/postgres/"
            env:
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: postgres-credentials
                  key: url
            - name: BACKUP_BUCKET
              value: "company-backups"
            resources:
              requests:
                memory: "256Mi"
                cpu: "250m"
              limits:
                memory: "512Mi"
            volumeMounts:
            - name: backup-scratch
              mountPath: /backup
          volumes:
          - name: backup-scratch
            emptyDir:
              sizeLimit: 5Gi
          restartPolicy: Never

Exit code 2 means a permanent configuration error (wrong credentials, missing database) – no point retrying. Exit code 1 means a transient error (network timeout, S3 unavailable) – retry up to twice with exponential backoff. The emptyDir volume provides scratch space for the dump file without requiring a PersistentVolume.