Pod Topology Spread Constraints#
Pod anti-affinity gives you binary control: either a pod avoids another pod’s topology domain or it does not. But it does not give you even distribution. If you have 6 replicas and 3 zones, anti-affinity cannot express “put exactly 2 in each zone.” Topology spread constraints solve this by letting you specify the maximum allowed imbalance between any two topology domains.
How Topology Spread Works#
A topology spread constraint defines:
- Which topology domains to spread across (via
topologyKey) - How much imbalance is acceptable (via
maxSkew) - What to do when the constraint cannot be met (via
whenUnsatisfiable) - Which pods count toward the distribution (via
labelSelector)
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-api
spec:
replicas: 6
template:
metadata:
labels:
app: web-api
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: web-api
containers:
- name: api
image: web-api:latestWith 6 replicas, 3 zones, and maxSkew: 1, the scheduler distributes pods as evenly as possible. The result is 2 pods per zone. If one zone already has 2 pods and another has 1, the scheduler places the next pod in the zone with fewer pods to keep the skew within 1.
The maxSkew Parameter#
maxSkew is the maximum allowed difference in pod count between any two topology domains. It is always a positive integer.
maxSkew: 1– the strictest possible. Domains can differ by at most 1 pod. With 6 pods across 3 zones, you get 2-2-2. With 7 pods, you get 3-2-2 or 2-3-2 or 2-2-3.maxSkew: 2– more relaxed. One zone can have up to 2 more pods than another. With 6 pods across 3 zones, you could get 4-1-1.
In most production scenarios, maxSkew: 1 is the right choice for zone-level spreading. Use higher values only when you need scheduling flexibility and can tolerate uneven distribution.
whenUnsatisfiable#
This controls what happens when the constraint cannot be met:
| Value | Behavior |
|---|---|
| DoNotSchedule | Hard constraint. The pod stays Pending if placing it would violate maxSkew. |
| ScheduleAnyway | Soft constraint. The scheduler minimizes skew as a scoring factor but still schedules the pod even if skew is exceeded. |
topologySpreadConstraints:
# Hard: zone spread is critical for HA
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: web-api
# Soft: node spread is nice to have
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: web-apiUse DoNotSchedule for failure domain spreading that is truly critical (zones, regions). Use ScheduleAnyway for best-effort spreading (across nodes within a zone) where you prefer even distribution but cannot afford Pending pods.
minDomains#
minDomains ensures pods spread across at least a minimum number of topology domains. This prevents all pods from piling into a single zone when a cluster has few nodes.
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
minDomains: 3
labelSelector:
matchLabels:
app: web-apiWithout minDomains, if your cluster only has nodes in one zone, the scheduler happily places all pods there (skew is 0 since there is only one domain). With minDomains: 3, the scheduler treats missing zones as having 0 pods. If all 6 pods are in one zone, the skew is 6-0=6, which violates maxSkew: 1, so scheduling is blocked until nodes exist in enough zones.
minDomains requires whenUnsatisfiable: DoNotSchedule and the MinDomainsInPodTopologySpread feature gate (stable since Kubernetes 1.30).
Multiple Constraints#
You can combine multiple topology spread constraints to achieve multi-level spreading. The scheduler must satisfy all constraints simultaneously.
topologySpreadConstraints:
# Level 1: spread across zones
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: web-api
# Level 2: spread across nodes within each zone
- maxSkew: 1
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app: web-apiThis gives you two-dimensional spreading: even across zones (hard requirement) and even across nodes within each zone (soft preference). For 6 replicas across 3 zones with 2 nodes per zone, the ideal result is 1 pod per node.
Interaction with Node Affinity#
Topology spread constraints only consider nodes that the pod is eligible to run on. If you combine spread constraints with node affinity, the scheduler first filters to matching nodes and then evaluates spread within that subset.
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-type
operator: In
values:
- compute
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels:
app: web-apiHere, the pod only runs on node-type=compute nodes, and the spread is calculated only across those nodes. If compute nodes exist in only 2 of 3 zones, the scheduler spreads across 2 zones, not 3.
Interaction with Pod Affinity#
Topology spread constraints and pod affinity/anti-affinity work independently. The scheduler must satisfy both. This can create conflicts:
- Topology spread says “distribute evenly across zones”
- Pod affinity says “run near Redis pods, which are only in zone-a”
In this case, the pod might stay Pending because it cannot be both evenly spread and co-located with Redis. If this happens, use ScheduleAnyway for the spread constraint or preferred for the affinity.
Cluster-Level Defaults#
You can set default topology spread constraints for all pods at the cluster level by configuring the kube-scheduler. This is useful when you want every workload to spread across zones without requiring every team to add constraints to their specs.
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- pluginConfig:
- name: PodTopologySpread
args:
defaultConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway
defaultingType: ListCluster defaults are overridden by any topologySpreadConstraints defined in the pod spec.
Pod Anti-Affinity vs Topology Spread Constraints#
Both features control pod distribution, but they work differently:
| Aspect | Pod Anti-Affinity | Topology Spread Constraints |
|---|---|---|
| Control | Binary: avoid or do not avoid | Numeric: maxSkew defines allowed imbalance |
| Even distribution | Cannot enforce even spread | Designed for even spread |
| Performance | Expensive to evaluate at scale | More efficient at scale |
| Flexibility | required or preferred | DoNotSchedule or ScheduleAnyway, plus maxSkew tuning |
| Simplicity | Simpler to understand | More parameters to configure |
| Best for | “No two replicas on the same node” | “Spread 6 replicas evenly across 3 zones” |
Use pod anti-affinity when you need simple binary exclusion (no two pods on the same node). Use topology spread constraints when you need controlled, even distribution across multiple domains.
Common Gotchas#
Label selector must match the pods being scheduled. The labelSelector in a topology spread constraint should match the labels of the pods in the same Deployment. If it does not match, the constraint has no effect because no existing pods count toward the distribution. The scheduler sees zero pods in every domain and considers skew to be 0.
# Verify labels match
kubectl get pods -l app=web-api --show-labelsmaxSkew=1 with fewer replicas than domains. If you have 3 zones and only 2 replicas with maxSkew: 1 and DoNotSchedule, one zone will always be empty. This is fine – the skew is 1-0=1, which satisfies the constraint. But if you set minDomains: 3 with only 2 replicas, you get 1-1-0 which has a skew of 1, still satisfying maxSkew: 1. However, having minDomains greater than your replica count can cause issues if skew math does not work out at larger scales.
Topology domains with no eligible nodes are ignored. If zone-c exists but has no nodes matching your nodeSelector, it is not counted as a topology domain. Pods will not be Pending waiting for a domain that has no eligible nodes.
Rollout interactions. During a rolling update, old and new pods both count toward the spread calculation. If you have a strict maxSkew: 1 across 3 zones with 3 replicas, and a rolling update creates a 4th pod before terminating an old one, the 4th pod might be temporarily Pending until the surge resolves. Set appropriate maxSurge and maxUnavailable in your Deployment strategy.