DaemonSets#
A DaemonSet ensures that a copy of a pod runs on every node in the cluster – or on a selected subset of nodes. When a new node joins the cluster, the DaemonSet controller automatically schedules a pod on it. When a node is removed, the pod is garbage collected.
This is the right abstraction for infrastructure that needs to run everywhere: log collectors, monitoring agents, network plugins, storage drivers, and security tooling.
When to Use DaemonSets#
DaemonSets solve problems where per-node presence matters:
- Log collection: Fluent Bit, Fluentd, or Promtail reading container logs from each node’s
/var/logand forwarding to a central system. - Metrics: Prometheus node-exporter exposing hardware and OS metrics from every node.
- Networking: Calico, Cilium, or kube-proxy running on every node to provide pod networking and network policy enforcement.
- Storage: CSI drivers that must run on every node to provide volume mount capabilities.
- Security: Falco, Sysdig, or other runtime security agents monitoring system calls on each node.
Basic DaemonSet#
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: node-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: node-exporter
template:
metadata:
labels:
app: node-exporter
spec:
containers:
- name: node-exporter
image: prom/node-exporter:v1.7.0
ports:
- containerPort: 9100
hostPort: 9100
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
memory: 128Mi
volumeMounts:
- name: proc
mountPath: /host/proc
readOnly: true
- name: sys
mountPath: /host/sys
readOnly: true
volumes:
- name: proc
hostPath:
path: /proc
- name: sys
hostPath:
path: /sys
hostNetwork: true
hostPID: trueNode-exporter uses hostNetwork and hostPID because it needs direct access to node-level metrics. Most DaemonSets need some form of host access – log collectors mount /var/log, network plugins mount /opt/cni.
Node Selection#
Not every DaemonSet needs to run on every node. Use nodeSelector or nodeAffinity to restrict placement:
spec:
template:
spec:
nodeSelector:
node-role.kubernetes.io/worker: ""For more complex rules, use node affinity:
spec:
template:
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/os
operator: In
values: ["linux"]
- key: node-role.kubernetes.io/control-plane
operator: DoesNotExistThis schedules pods only on Linux worker nodes, excluding control plane nodes.
Tolerations#
Tolerations are critical for DaemonSets. Nodes often have taints to prevent regular workloads from scheduling on them – control plane nodes, GPU nodes, dedicated tenant nodes. A DaemonSet pod without the right tolerations will not schedule on tainted nodes, leaving gaps in your coverage.
For cluster-wide agents (logging, monitoring), tolerate everything:
spec:
template:
spec:
tolerations:
- operator: ExistsThe operator: Exists with no key matches all taints. This ensures the DaemonSet runs everywhere regardless of what taints exist.
For more selective targeting, tolerate specific taints:
spec:
template:
spec:
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
- key: node.kubernetes.io/not-ready
operator: Exists
effect: NoExecute
- key: node.kubernetes.io/unreachable
operator: Exists
effect: NoExecuteThe not-ready and unreachable tolerations are important for monitoring agents – you want them running on unhealthy nodes precisely because those nodes need monitoring the most.
Update Strategies#
RollingUpdate (Default)#
Updates DaemonSet pods one (or more) at a time across nodes:
spec:
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 0- maxUnavailable: how many nodes can have their DaemonSet pod down simultaneously during the update. Default is 1. Set higher for large clusters where updating one node at a time would take hours.
- maxSurge (v1.22+): how many extra pods can exist during the update. With
maxSurge: 1, Kubernetes creates the new pod before killing the old one on each node, reducing downtime. Not all DaemonSets support this – if the pod useshostPortorhostNetwork, two pods cannot coexist on the same node.
# Trigger a rolling update by changing the image
kubectl set image daemonset/fluent-bit fluent-bit=fluent/fluent-bit:3.0 -n logging
# Watch the rollout progress
kubectl rollout status daemonset/fluent-bit -n logging
# Roll back if something goes wrong
kubectl rollout undo daemonset/fluent-bit -n loggingOnDelete#
Pods are only replaced when manually deleted:
spec:
updateStrategy:
type: OnDeleteThis gives you full control over the update pace. Use it for sensitive node-level agents where you want to update one node, verify it works, then proceed. The tradeoff is operational overhead – you must delete pods yourself to trigger the update.
# Update the DaemonSet spec, then manually roll one node at a time
kubectl delete pod fluent-bit-7k2x4 -n logging
# Verify the replacement pod is healthy
kubectl get pod -l app=fluent-bit -n logging --field-selector spec.nodeName=worker-1
# Proceed to next node
kubectl delete pod fluent-bit-9m3z8 -n loggingResource Management#
DaemonSet pods compete with workload pods for node resources. A log collector with aggressive resource requests can starve application pods on small nodes.
Set requests conservatively and limits generously:
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
memory: 256MiThis reserves minimal resources for scheduling purposes but allows the pod to burst for short periods. Avoid setting CPU limits on DaemonSets – a log collector that gets throttled during a burst of application logs will fall behind and potentially lose data.
On nodes with limited capacity (e.g., small worker nodes, edge nodes), DaemonSet pods with high requests may be unable to schedule, leaving the pod in Pending state while the DaemonSet controller reports the node as missing coverage.
Priority and Preemption#
Use PriorityClass to ensure critical DaemonSet pods are not evicted when a node is under resource pressure:
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: system-node-critical
value: 2000001000
globalDefault: false
description: "Critical node-level infrastructure"
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluent-bit
spec:
template:
spec:
priorityClassName: system-node-criticalKubernetes provides two built-in priority classes: system-node-critical and system-cluster-critical. Use these for infrastructure DaemonSets that must not be evicted. Application-level DaemonSets should use a custom PriorityClass with a lower value.
DaemonSets vs Static Pods#
Static pods are managed directly by the kubelet, not by the Kubernetes API. They are defined as YAML files in /etc/kubernetes/manifests/ on each node. The control plane components (kube-apiserver, etcd, kube-scheduler) run as static pods.
DaemonSets are managed by the DaemonSet controller through the Kubernetes API. Use DaemonSets for everything except the control plane itself. They support rolling updates, label selectors, resource quotas, and all the lifecycle management that static pods lack.
Debugging DaemonSets#
# Check rollout status
kubectl rollout status daemonset/fluent-bit -n logging
# See which nodes have pods and which are missing
kubectl get pods -l app=fluent-bit -n logging -o wide
# Compare against expected node count
kubectl get nodes --no-headers | wc -l
kubectl get pods -l app=fluent-bit -n logging --no-headers | wc -l
# If a pod is missing from a node, check for scheduling issues
kubectl describe daemonset fluent-bit -n logging
# Look for "Pods Status" and events showing why pods cannot schedule
# Check a specific node for taints that might block scheduling
kubectl describe node worker-3 | grep -A5 TaintsWhen a DaemonSet pod is missing from a node, the cause is almost always one of: the pod does not tolerate the node’s taints, the pod’s nodeSelector or nodeAffinity excludes the node, or the pod’s resource requests exceed the node’s available capacity.
Common Gotchas#
Node drain blocked by DaemonSet pods: When draining a node, DaemonSet pods are ignored by default (kubectl drain --ignore-daemonsets). However, if DaemonSet pods have a PodDisruptionBudget (PDB), the drain may block. This is unusual but can happen if someone applies a PDB that matches DaemonSet pods by label. The fix is either to exclude DaemonSet pods from the PDB selector or to use --delete-emptydir-data --ignore-daemonsets when draining.
Resource requests too high: If a DaemonSet requests 1 CPU and 2Gi memory, and your nodes have 4 CPUs and 8Gi, you have given 25% of every node’s resources to a single infrastructure pod. Multiply by several DaemonSets (logging, monitoring, networking, security) and you can lose half your node capacity before any application pods schedule. Audit your DaemonSet resource requests regularly.
Forgetting tolerations after adding taints: You add a new taint to a node group and your monitoring agent stops running on those nodes. Always audit DaemonSet tolerations when modifying node taints.
Practical Example: Fluent Bit DaemonSet#
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluent-bit
namespace: logging
labels:
app: fluent-bit
spec:
selector:
matchLabels:
app: fluent-bit
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
template:
metadata:
labels:
app: fluent-bit
spec:
priorityClassName: system-node-critical
serviceAccountName: fluent-bit
tolerations:
- operator: Exists
containers:
- name: fluent-bit
image: fluent/fluent-bit:3.0
resources:
requests:
cpu: 50m
memory: 64Mi
limits:
memory: 256Mi
volumeMounts:
- name: varlog
mountPath: /var/log
readOnly: true
- name: containers
mountPath: /var/lib/docker/containers
readOnly: true
- name: config
mountPath: /fluent-bit/etc/
volumes:
- name: varlog
hostPath:
path: /var/log
- name: containers
hostPath:
path: /var/lib/docker/containers
- name: config
configMap:
name: fluent-bit-configThis DaemonSet tolerates all taints (runs on every node including control plane), uses system-node-critical priority (will not be evicted under pressure), mounts host log directories read-only, and uses conservative resource requests to avoid starving application pods. The rolling update strategy updates one node at a time, ensuring log collection continues on other nodes during the rollout.