Jobs and CronJobs#
Deployments manage long-running processes. Jobs manage work that finishes. A Job creates one or more pods, runs them to completion, and tracks whether they succeeded. CronJobs run Jobs on a schedule. Both are essential for database migrations, report generation, data pipelines, and any workload that is not a continuously running server.
Job Basics#
A Job runs a pod until it exits successfully (exit code 0). The simplest case is a single pod that runs once:
apiVersion: batch/v1
kind: Job
metadata:
name: db-migration
spec:
template:
spec:
containers:
- name: migrate
image: myapp:2.3.0
command: ["./migrate", "--up"]
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: db-credentials
key: url
restartPolicy: NeverThe restartPolicy must be Never or OnFailure – Jobs do not support Always because the pod is expected to terminate.
Completions and Parallelism#
For workloads that need multiple successful runs, use completions and parallelism:
spec:
completions: 10 # 10 pods must succeed
parallelism: 3 # run 3 pods at a timeThis runs up to 3 pods concurrently until 10 pods have completed successfully. Useful for processing a batch of items where each pod handles one item.
Completion Modes#
Jobs support two completion modes:
- NonIndexed (default): each pod is interchangeable. The Job just counts successful completions.
- Indexed: each pod gets a unique completion index (0, 1, 2, …) available as the
JOB_COMPLETION_INDEXenvironment variable. Useful when each pod must process a specific shard or partition.
spec:
completions: 5
parallelism: 5
completionMode: IndexedEach pod sees its index and can use it to determine which partition of work to handle – for example, processing records where id % 5 == $JOB_COMPLETION_INDEX.
Restart Policies and Their Tradeoffs#
The restartPolicy on a Job’s pod template controls what happens when a container fails:
Never: Kubernetes does not restart the failed container. Instead, it creates a new pod. The failed pod stays around in Error status. This is useful for debugging because you can inspect the logs of the failed pod, but it leaves pods behind that you need to clean up.
OnFailure: Kubernetes restarts the failed container in the same pod. The pod is reused, and the restart count increments. Logs from the previous attempt are lost (overwritten by the new container). This uses fewer resources but makes debugging harder.
For production, use Never when you need to inspect failures and OnFailure when you want automatic retry with minimal pod sprawl.
Retry Logic: backoffLimit#
The backoffLimit field controls how many times a Job retries before marking itself as failed. The default is 6.
spec:
backoffLimit: 3
template:
spec:
containers:
- name: report
image: report-generator:1.0
restartPolicy: NeverBetween retries, Kubernetes uses exponential backoff: 10 seconds, 20 seconds, 40 seconds, and so on, capped at 6 minutes. This prevents a Job from hammering a failing dependency.
When restartPolicy: Never, each retry creates a new pod. When restartPolicy: OnFailure, each retry restarts the container in place. The backoff limit counts both types of failures.
Timeouts and Cleanup#
activeDeadlineSeconds#
Sets an absolute time limit for the entire Job, including all retries:
spec:
activeDeadlineSeconds: 600 # fail the whole Job after 10 minutes
backoffLimit: 3If the Job exceeds this deadline, Kubernetes terminates all running pods and marks the Job as failed with reason DeadlineExceeded. This is your safety net against Jobs that hang indefinitely.
ttlSecondsAfterFinished#
Controls automatic cleanup of completed and failed Jobs:
spec:
ttlSecondsAfterFinished: 3600 # delete Job and pods 1 hour after completionWithout this, completed Jobs and their pods stick around forever, cluttering kubectl get jobs output and consuming etcd storage. Set this for any Job you do not need to inspect long-term.
Pod Failure Policy (v1.26+)#
The podFailurePolicy lets you handle specific exit codes differently. This is powerful when certain failures should not trigger a retry:
spec:
backoffLimit: 5
podFailurePolicy:
rules:
- action: FailJob
onExitCodes:
containerName: migrate
operator: In
values: [42]
- action: Ignore
onExitCodes:
containerName: migrate
operator: In
values: [143]
- action: Count
onExitCodes:
containerName: migrate
operator: NotIn
values: [0, 42, 143]
template:
spec:
containers:
- name: migrate
image: myapp:2.3.0
command: ["./migrate"]
restartPolicy: NeverThe actions:
- FailJob: immediately fail the Job, no more retries. Use for permanent errors like invalid configuration (exit code 42 in this example).
- Ignore: do not count this failure against
backoffLimit. Exit code 143 is SIGTERM – the pod was killed externally, not a real failure. - Count: count the failure toward
backoffLimitas normal. This is the default behavior.
CronJobs#
A CronJob creates Jobs on a schedule. The schedule field uses standard cron syntax:
apiVersion: batch/v1
kind: CronJob
metadata:
name: db-backup
spec:
schedule: "0 */6 * * *" # every 6 hours
timeZone: "America/New_York" # v1.27+ -- explicit timezone
concurrencyPolicy: Forbid
startingDeadlineSeconds: 300
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 5
jobTemplate:
spec:
activeDeadlineSeconds: 1800
backoffLimit: 2
ttlSecondsAfterFinished: 86400
template:
spec:
containers:
- name: backup
image: pg-backup:1.2
command: ["./backup.sh"]
env:
- name: PGHOST
value: "postgres-svc"
- name: PGDATABASE
value: "appdb"
- name: BACKUP_BUCKET
value: "s3://backups/db"
volumeMounts:
- name: credentials
mountPath: /secrets
readOnly: true
volumes:
- name: credentials
secret:
secretName: backup-credentials
restartPolicy: OnFailureConcurrency Policy#
Controls what happens when the next scheduled run fires while a previous Job is still running:
- Allow (default): multiple Jobs can run simultaneously. Fine for independent, idempotent work.
- Forbid: skip the new run if the previous one is still active. Use this when concurrent runs would conflict – like two backup jobs writing to the same location.
- Replace: kill the running Job and start a new one. Useful when freshness matters more than completion.
startingDeadlineSeconds#
If the CronJob controller misses a scheduled time (because of controller downtime or high cluster load), startingDeadlineSeconds defines how late is too late. If the current time is past scheduledTime + startingDeadlineSeconds, the run is skipped entirely.
This is critical for time-sensitive jobs. A report that must run between midnight and 12:05 AM should set startingDeadlineSeconds: 300. If the scheduler is down and wakes up at 12:10 AM, the job will not run for that period instead of producing a late report with incorrect data boundaries.
If startingDeadlineSeconds is not set and the controller misses more than 100 consecutive schedules, the CronJob stops scheduling entirely as a safety measure.
History Limits#
successfulJobsHistoryLimit (default 3) and failedJobsHistoryLimit (default 1) control how many completed Jobs are retained. In production, keep more failed Jobs than successful ones – you need the failed pods for debugging:
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 5Timezone Handling#
Before Kubernetes 1.27, CronJobs used the kube-controller-manager’s timezone (usually UTC). Starting in v1.27, the .spec.timeZone field lets you specify a timezone explicitly:
spec:
schedule: "0 2 * * *"
timeZone: "America/Chicago"If your cluster is older, schedule in UTC and do the mental math, or accept the surprise when daylight saving time shifts your job by an hour.
Monitoring Jobs#
# List all Jobs in a namespace
kubectl get jobs -n batch-workloads
# Watch a Job's progress
kubectl get jobs db-migration -w
# See Job events (scheduling, pod creation, failures)
kubectl describe job db-migration
# Get logs from the Job's pod
kubectl logs job/db-migration
# For multi-pod Jobs, get logs from a specific pod
kubectl logs db-migration-7x4k2
# List CronJob schedules and last run
kubectl get cronjobs
# See last scheduled time and active Jobs
kubectl describe cronjob db-backupWhen a Job fails, kubectl describe job <name> shows events that explain why: BackoffLimitExceeded, DeadlineExceeded, or pod-level failure messages.
Common Patterns#
Database migrations as Jobs: Run as a pre-deploy Job (or init container). Set backoffLimit: 0 if migrations are not idempotent – a failed migration that partially applied should not retry automatically.
Report generation as CronJobs: Schedule with concurrencyPolicy: Forbid to prevent overlap. Use activeDeadlineSeconds to catch hung report generators.
Queue workers as parallel Jobs: Set completions to the number of queue items and parallelism to your desired concurrency. Each pod pulls one item from the queue and exits.
One-shot admin tasks: Create a Job imperatively for debugging or one-off operations:
kubectl create job manual-cleanup --image=myapp:2.3.0 -- ./cleanup.sh --dry-runPractical Example: Database Backup CronJob#
This CronJob runs a PostgreSQL backup every 6 hours. It retries twice on failure, times out after 30 minutes, and keeps failed Jobs for 5 days for debugging:
apiVersion: batch/v1
kind: CronJob
metadata:
name: postgres-backup
namespace: database
spec:
schedule: "0 0,6,12,18 * * *"
timeZone: "UTC"
concurrencyPolicy: Forbid
startingDeadlineSeconds: 600
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 5
jobTemplate:
spec:
activeDeadlineSeconds: 1800
backoffLimit: 2
ttlSecondsAfterFinished: 432000
podFailurePolicy:
rules:
- action: FailJob
onExitCodes:
containerName: backup
operator: In
values: [2]
- action: Count
onExitCodes:
containerName: backup
operator: NotIn
values: [0, 2]
template:
spec:
containers:
- name: backup
image: postgres:16-alpine
command:
- /bin/sh
- -c
- |
pg_dump -Fc "$DATABASE_URL" > /backup/db-$(date +%Y%m%d-%H%M%S).dump
if [ $? -ne 0 ]; then exit 1; fi
aws s3 cp /backup/*.dump "s3://${BACKUP_BUCKET}/postgres/"
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: postgres-credentials
key: url
- name: BACKUP_BUCKET
value: "company-backups"
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
volumeMounts:
- name: backup-scratch
mountPath: /backup
volumes:
- name: backup-scratch
emptyDir:
sizeLimit: 5Gi
restartPolicy: NeverExit code 2 means a permanent configuration error (wrong credentials, missing database) – no point retrying. Exit code 1 means a transient error (network timeout, S3 unavailable) – retry up to twice with exponential backoff. The emptyDir volume provides scratch space for the dump file without requiring a PersistentVolume.