GPU and ML Workloads on Kubernetes#

Running GPU workloads on Kubernetes requires hardware-aware scheduling that the default scheduler does not provide out of the box. GPUs are expensive – an NVIDIA A100 node costs $3-12/hour on cloud providers – so efficient utilization matters far more than with CPU workloads. This article covers the full stack from device plugin installation through GPU sharing and monitoring.

The NVIDIA Device Plugin#

Kubernetes has no native understanding of GPUs. The NVIDIA device plugin bridges that gap by exposing GPUs as a schedulable resource (nvidia.com/gpu). Without it, the scheduler has no idea which nodes have GPUs or how many are available.

Installation#

The recommended deployment method is via the NVIDIA GPU Operator, which installs the device plugin along with drivers, container toolkit, and monitoring components:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set dcgmExporter.enabled=true

For clusters where GPU drivers are pre-installed on the node image (common on EKS with GPU AMIs, GKE with GPU node pools), set driver.enabled=false instead.

Verification#

After installation, GPU nodes should report available GPUs:

# Check GPU resources on nodes
kubectl describe node <gpu-node> | grep -A5 "Allocatable"
# Should show: nvidia.com/gpu: 4 (or however many GPUs the node has)

# Quick test: run nvidia-smi in a pod
kubectl run gpu-test --rm -it --restart=Never \
  --image=nvidia/cuda:12.6.0-base-ubuntu24.04 \
  --overrides='{"spec":{"containers":[{"name":"gpu-test","image":"nvidia/cuda:12.6.0-base-ubuntu24.04","command":["nvidia-smi"],"resources":{"limits":{"nvidia.com/gpu":1}}}]}}' \
  -- nvidia-smi

Requesting GPUs in Pod Specs#

GPUs are requested via resource limits. Unlike CPU and memory, GPUs can only be specified as limits (not requests), and they must be whole numbers. You cannot request 0.5 GPUs through the standard resource model.

apiVersion: v1
kind: Pod
metadata:
  name: ml-training
spec:
  containers:
  - name: trainer
    image: pytorch/pytorch:2.5.1-cuda12.6-cudnn9-runtime
    resources:
      limits:
        nvidia.com/gpu: 2  # Request exactly 2 GPUs
    command: ["python", "train.py"]
  restartPolicy: Never

Key rules for GPU resource specifications:

You can only set limits, not requests. Setting a GPU limit implicitly sets the request to the same value.
GPU limits must be positive integers. Fractional GPUs require GPU sharing (covered below).
If you request more GPUs than any single node has, the pod stays Pending forever.
GPUs are not oversubscribed by default. If a node has 4 GPUs and 4 pods each request 1, a 5th pod will not schedule.

Scheduling GPU Workloads#

Taints and Tolerations for GPU Nodes#

GPU nodes should be tainted to prevent non-GPU workloads from consuming their expensive resources:

kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule

GPU workloads need a matching toleration. The GPU Operator can apply taints automatically via its Helm values.

spec:
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
  containers:
  - name: inference
    image: nvcr.io/nvidia/tritonserver:24.12-py3
    resources:
      limits:
        nvidia.com/gpu: 1

Node Affinity for GPU Type Selection#

Different ML workloads need different GPU types. Training often requires A100 or H100 GPUs, while inference can run on T4 or L4 GPUs. Label nodes by GPU type (kubectl label nodes gpu-node-1 gpu-type=nvidia-a100) and use node affinity to direct workloads. Cloud providers typically set GPU labels automatically (GKE uses cloud.google.com/gke-accelerator, EKS uses k8s.amazonaws.com/accelerator).

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: gpu-type
          operator: In
          values: ["nvidia-a100", "nvidia-h100"]

A single GPU is expensive and often underutilized. An inference workload might use only 20% of an A100’s compute capacity. GPU sharing lets multiple workloads share a single physical GPU, dramatically improving cost efficiency.

Time-Slicing#

Time-slicing divides GPU time across multiple pods, similar to how CPU time-sharing works. Each pod gets exclusive access to the full GPU memory but shares compute time. This is the simplest sharing mechanism.

Configure the device plugin with time-slicing:

# ConfigMap for NVIDIA device plugin
apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-device-plugin-config
  namespace: gpu-operator
data:
  config: |
    version: v1
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: true
        resources:
        - name: nvidia.com/gpu
          replicas: 4  # Each physical GPU appears as 4 virtual GPUs

With replicas: 4, a node with 2 physical GPUs reports 8 nvidia.com/gpu resources. Eight pods can each request 1 GPU.

Trade-offs:

No memory isolation. If one pod allocates all GPU memory, others get CUDA out-of-memory errors.
No compute isolation. A pod running heavy training can starve inference pods sharing the same GPU.
Best for workloads with similar, predictable memory footprints (multiple inference services of similar size).

Multi-Instance GPU (MIG)#

MIG is an A100/A30/H100 hardware feature that partitions a single GPU into up to seven isolated instances. Each MIG instance has dedicated compute cores, memory, and memory bandwidth. MIG provides true hardware-level isolation – workloads cannot interfere with each other.

# Enable MIG mode on the GPU (requires GPU reset)
nvidia-smi -i 0 -mig 1

# Create MIG instances (example: partition A100 80GB into profiles)
# 3g.40gb = 3 compute slices, 40GB memory
nvidia-smi mig -i 0 -cgi 9,9 -C  # Two 3g.40gb instances

# List MIG devices
nvidia-smi mig -i 0 -lgi

The GPU Operator manages MIG profiles declaratively via a ConfigMap with migStrategy: "mixed". Pods request MIG instances by profile name:

resources:
  limits:
    nvidia.com/mig-3g.40gb: 1  # Request one 3g.40gb MIG instance

MIG profile options on A100 80GB:

Profile	Compute Slices	Memory	Use Case
7g.80gb	7 (full GPU)	80GB	Large training jobs
4g.40gb	4	40GB	Medium training/large inference
3g.40gb	3	40GB	Medium inference
2g.20gb	2	20GB	Small inference
1g.10gb	1	10GB	Lightweight inference, notebooks

Trade-offs:

Only available on A100, A30, H100, and newer GPUs.
Changing MIG profiles requires draining the node and resetting the GPU.
Fixed partitions – you cannot dynamically resize a MIG instance.
Best for multi-tenant clusters where isolation is required.

Multi-Process Service (MPS)#

MPS allows multiple CUDA contexts to share a GPU simultaneously with fine-grained compute partitioning. Unlike time-slicing (which context-switches), MPS enables true concurrent execution. Configuration is similar to time-slicing in the device plugin ConfigMap, replacing the timeSlicing block with an mps block. MPS provides better utilization than time-slicing but weaker isolation than MIG. Best for trusted workloads from the same team that need concurrent GPU access.

Criteria	Time-Slicing	MIG	MPS
Isolation	None	Hardware-enforced	Partial
Supported GPUs	All NVIDIA	A100, A30, H100+	Volta and newer
Setup complexity	Low	Medium-High	Medium
Memory isolation	None	Full	Configurable
Best for	Homogeneous inference	Multi-tenant, security-sensitive	Concurrent small workloads
Worst for	Mixed memory requirements	Dynamic workloads needing flexibility	Untrusted workloads

Monitoring GPU Utilization#

GPU monitoring is essential because GPU nodes are your most expensive resources. A GPU sitting at 5% utilization while costing $12/hour is the most egregious form of waste in Kubernetes.

DCGM Exporter#

The NVIDIA DCGM Exporter exposes GPU metrics to Prometheus (installed automatically by the GPU Operator, or deploy separately with Helm). Key metrics: DCGM_FI_DEV_GPU_UTIL (compute utilization), DCGM_FI_DEV_FB_USED/DCGM_FI_DEV_FB_FREE (GPU memory), and DCGM_FI_DEV_GPU_TEMP (temperature). Set alerts on utilization below 10% for 30 minutes (waste), memory above 95% (OOM risk), and temperature above 85C (throttling).

Quick Utilization Check#

For a fast spot-check without Prometheus:

# Run nvidia-smi on a GPU node via a debug pod
kubectl debug node/<gpu-node> -it --image=nvidia/cuda:12.6.0-base-ubuntu24.04 -- nvidia-smi

# Or check from inside a running GPU pod
kubectl exec -it <gpu-pod> -- nvidia-smi --query-gpu=utilization.gpu,utilization.memory,memory.used,memory.total --format=csv

Common Gotchas#

Pods stuck Pending with “insufficient nvidia.com/gpu”. Check that the NVIDIA device plugin DaemonSet is running on GPU nodes (kubectl get pods -n gpu-operator). If the plugin pod is in CrashLoopBackOff, the drivers may not be installed or the container runtime is not configured for GPU passthrough.

CUDA version mismatches. The CUDA version in your container must be compatible with the driver version on the node. Use the GPU Operator to keep drivers consistent across nodes.

GPU memory leaks on abnormal exit. If a pod is killed without clean CUDA context teardown, GPU memory can leak until the GPU is reset. MIG eliminates this because each instance has isolated memory.

No GPU oversubscription by default. Unlike CPU, GPU requests are absolute. There is no concept of GPU “burstable” in standard Kubernetes. GPU sharing (time-slicing, MIG, MPS) is the only way to share a physical GPU across pods.