Pod Affinity and Anti-Affinity#
Node affinity controls which nodes a pod can run on. Pod affinity and anti-affinity go further – they control whether a pod should run near or away from other specific pods. This is how you co-locate a frontend with its cache for low latency, or spread database replicas across failure domains for high availability.
Pod Affinity: Schedule Near Other Pods#
Pod affinity tells the scheduler “place this pod in the same topology domain as pods matching a label selector.” The topology domain is defined by topologyKey – it could be the same node, the same zone, or any other node label.
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-frontend
spec:
replicas: 3
template:
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- redis-cache
topologyKey: kubernetes.io/hostname
containers:
- name: frontend
image: web-frontend:latestThis places each web-frontend pod on a node that already has a pod labeled app=redis-cache. If no node runs the matching pod, the frontend pod stays Pending.
topologyKey: Defining “Near”#
The topologyKey is a node label that defines the scope of co-location or separation. The scheduler groups nodes by the value of this label and treats each group as a topology domain.
| topologyKey | Meaning | Use Case |
|---|---|---|
kubernetes.io/hostname |
Same node | Co-locate for lowest latency, separate for node-level HA |
topology.kubernetes.io/zone |
Same availability zone | Zone-level co-location or spreading |
topology.kubernetes.io/region |
Same region | Regional affinity |
Custom label (e.g., rack) |
Same rack/custom group | Rack-aware placement |
When you say topologyKey: topology.kubernetes.io/zone, the scheduler groups all nodes by their zone label value. Two pods with affinity will be placed in the same zone but not necessarily on the same node.
Pod Anti-Affinity: Schedule Away from Other Pods#
Anti-affinity is the opposite – it tells the scheduler “do not place this pod in the same topology domain as pods matching a label selector.” This is critical for spreading replicas.
apiVersion: apps/v1
kind: Deployment
metadata:
name: postgres
spec:
replicas: 3
template:
metadata:
labels:
app: postgres
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- postgres
topologyKey: kubernetes.io/hostname
containers:
- name: postgres
image: postgres:16This ensures every postgres pod runs on a different node. With 3 replicas, you need at least 3 nodes or the extra pods stay Pending.
Required vs Preferred#
Just like node affinity, pod affinity/anti-affinity has hard and soft variants:
Required (requiredDuringSchedulingIgnoredDuringExecution): The pod will not schedule if the rule cannot be satisfied. Use this for hard requirements like “database replicas must be on different nodes.”
Preferred (preferredDuringSchedulingIgnoredDuringExecution): The scheduler tries to satisfy the rule but will schedule the pod elsewhere if it cannot. Each preferred rule has a weight from 1 to 100.
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- web-api
topologyKey: topology.kubernetes.io/zoneThe weight system works like scoring. When the scheduler evaluates candidate nodes, it sums the weights of all satisfied preferred rules. A node where the weight-100 anti-affinity is satisfied scores 100 points higher than one where it is not. With multiple preferred rules, the scheduler picks the node with the highest total score.
Namespace Scoping#
By default, pod affinity/anti-affinity only matches pods in the same namespace as the pod being scheduled. You can expand or restrict this with namespaceSelector and namespaces:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- shared-cache
topologyKey: kubernetes.io/hostname
# Match pods in namespaces with this label
namespaceSelector:
matchLabels:
team: platform
# Or explicitly list namespaces
# namespaces:
# - cache-namespace
# - shared-servicesIf you set an empty namespaceSelector: {}, it matches pods in all namespaces. This is required when your affinity target is in a different namespace.
Practical Use Cases#
Spread Replicas Across Zones#
The most common anti-affinity pattern: ensure a stateless service has replicas in multiple availability zones.
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-api
spec:
replicas: 3
template:
metadata:
labels:
app: web-api
spec:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: web-api
topologyKey: topology.kubernetes.io/zone
containers:
- name: api
image: web-api:latestUsing preferred here means if you only have 2 zones but 3 replicas, the third replica still schedules (it just doubles up in one zone). With required, the third replica would stay Pending.
Co-locate Frontend with Cache#
Place frontend pods on the same node as Redis for minimal network latency:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
podAffinityTerm:
labelSelector:
matchLabels:
app: redis
topologyKey: kubernetes.io/hostnameUsing preferred with a weight avoids blocking scheduling when the Redis node is full.
Combined Pattern: Spread and Co-locate#
A 3-replica stateless service that spreads across zones while preferring to be near its cache:
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels:
app: web-api
topologyKey: kubernetes.io/hostname
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 50
podAffinityTerm:
labelSelector:
matchLabels:
app: redis
topologyKey: topology.kubernetes.io/zoneThis says: each replica must be on a different node (hard anti-affinity), and preferably in the same zone as a Redis pod (soft affinity).
Performance Considerations#
Pod affinity and anti-affinity are significantly more expensive for the scheduler to evaluate than node affinity. For every candidate node, the scheduler must check which pods are already running on nodes in the same topology domain. In large clusters (500+ nodes), this can noticeably slow scheduling.
Mitigations:
- Use
preferredDuringSchedulingIgnoredDuringExecutioninstead ofrequiredwhen possible – it short-circuits faster. - Limit
topologyKeyto smaller scopes (kubernetes.io/hostnameis faster to evaluate thantopology.kubernetes.io/zonebecause fewer nodes share the same hostname). - Consider topology spread constraints (covered in a separate article) as a more efficient alternative for even distribution.
Common Gotchas#
Required anti-affinity with not enough nodes. If you have 5 replicas with required anti-affinity on hostname but only 4 nodes, the fifth replica is Pending forever. Use preferred unless you genuinely need the hard constraint.
# Find pending pods and see why
kubectl get pods --field-selector=status.phase=Pending
kubectl describe pod <pending-pod-name>
# Events will show: "0/4 nodes are available: 4 node(s) didn't match pod anti-affinity rules."Zone spreading with uneven capacity. If you have 3 zones but zone-c is full, required anti-affinity across zones can leave pods Pending even though other zones have room. The scheduler cannot place a pod in zone-c if there are no available nodes there.
Label selector must match existing pods. If your labelSelector does not match any running pods, affinity has no effect (no pods to be “near”), and anti-affinity is trivially satisfied (no pods to avoid). Double-check your labels with kubectl get pods --show-labels.
Self-referencing anti-affinity. When a Deployment uses anti-affinity with its own labels, the first pod schedules fine (no existing pods to conflict with). The second pod then avoids the first pod’s node. This is the expected behavior, but it means your anti-affinity is tested starting with the second replica, not the first.