Running Kafka on Kubernetes with Strimzi#
Running Kafka on Kubernetes without an operator is painful. You need StatefulSets, headless Services, init containers for broker ID assignment, and careful handling of storage and networking. Strimzi eliminates most of this by managing the entire Kafka lifecycle through Custom Resource Definitions.
Installing Strimzi#
# Option 1: Helm
helm repo add strimzi https://strimzi.io/charts
helm install strimzi strimzi/strimzi-kafka-operator \
--namespace kafka \
--create-namespace
# Option 2: Direct YAML install
kubectl create namespace kafka
kubectl apply -f https://strimzi.io/install/latest?namespace=kafka -n kafkaVerify the operator is running:
kubectl get pods -n kafka -l name=strimzi-cluster-operatorThe operator watches for Kafka CRDs across the cluster (or in specific namespaces, depending on installation configuration).
Deploying a Kafka Cluster#
The Kafka CRD defines the entire cluster – brokers, ZooKeeper (or KRaft), and Entity Operator:
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: my-cluster
namespace: kafka
spec:
kafka:
version: 3.7.0
replicas: 3
listeners:
- name: plain
port: 9092
type: internal
tls: false
- name: tls
port: 9093
type: internal
tls: true
config:
offsets.topic.replication.factor: 3
transaction.state.log.replication.factor: 3
transaction.state.log.min.isr: 2
default.replication.factor: 3
min.insync.replicas: 2
log.retention.hours: 168
log.segment.bytes: 1073741824
num.partitions: 6
storage:
type: persistent-claim
size: 50Gi
class: gp3
deleteClaim: false
resources:
requests:
memory: 2Gi
cpu: 500m
limits:
memory: 4Gi
cpu: "2"
jvmOptions:
-Xms: 1g
-Xmx: 2g
zookeeper:
replicas: 3
storage:
type: persistent-claim
size: 10Gi
class: gp3
deleteClaim: false
resources:
requests:
memory: 512Mi
cpu: 250m
entityOperator:
topicOperator: {}
userOperator: {}Apply it and wait:
kubectl apply -f kafka-cluster.yaml
kubectl wait kafka/my-cluster --for=condition=Ready --timeout=300s -n kafkaStrimzi creates StatefulSets for Kafka brokers (my-cluster-kafka-0/1/2) and ZooKeeper nodes (my-cluster-zookeeper-0/1/2), along with Services, ConfigMaps, and Secrets for inter-broker communication.
For Kafka 3.7+ you can use KRaft mode instead of ZooKeeper by adding the annotation strimzi.io/kraft: enabled and replacing the zookeeper section with a nodePool configuration. KRaft removes the ZooKeeper dependency entirely.
Storage: Persistent Volumes and JBOD#
Single volume (shown above) works for most deployments. For high-throughput workloads, JBOD (Just a Bunch Of Disks) spreads partitions across multiple volumes:
storage:
type: jbod
volumes:
- id: 0
type: persistent-claim
size: 100Gi
class: gp3
deleteClaim: false
- id: 1
type: persistent-claim
size: 100Gi
class: gp3
deleteClaim: falseEach broker gets two PVCs. Kafka distributes log segments across the volumes. This doubles throughput by parallelizing disk I/O.
Set deleteClaim: false in production. When set to true, deleting the Kafka resource deletes all PVCs and your data with them.
Listener Configuration#
Listeners control how clients connect to Kafka. Strimzi supports several listener types for external access:
listeners:
# Internal cluster access
- name: plain
port: 9092
type: internal
tls: false
# External via NodePort
- name: external
port: 9094
type: nodeport
tls: true
authentication:
type: tls
# External via LoadBalancer (one per broker)
- name: extlb
port: 9095
type: loadbalancer
tls: true
# External via Ingress (requires nginx ingress controller)
- name: extingress
port: 9096
type: ingress
tls: true
configuration:
bootstrap:
host: kafka-bootstrap.example.com
brokers:
- broker: 0
host: kafka-0.example.com
- broker: 1
host: kafka-1.example.com
- broker: 2
host: kafka-2.example.comInternal clients connect to my-cluster-kafka-bootstrap.kafka.svc:9092. NodePort is free but exposes high-numbered ports; LoadBalancer gives clean endpoints but creates one LB per broker.
Topic and User Management#
The Entity Operator watches for KafkaTopic and KafkaUser CRDs:
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
name: orders
namespace: kafka
labels:
strimzi.io/cluster: my-cluster
spec:
partitions: 12
replicas: 3
config:
retention.ms: 604800000 # 7 days
cleanup.policy: delete
max.message.bytes: 1048576
min.insync.replicas: 2
---
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
name: order-processor
namespace: kafka
labels:
strimzi.io/cluster: my-cluster
spec:
authentication:
type: tls
authorization:
type: simple
acls:
- resource:
type: topic
name: orders
patternType: literal
operations: [Read, Write, Describe]
host: "*"
- resource:
type: group
name: order-processor-group
patternType: literal
operations: [Read]
host: "*"The User Operator creates a Secret order-processor containing the client certificate and key. Mount this into your consumer/producer pods.
Monitoring with JMX and Prometheus#
Enable JMX metrics export in the Kafka resource:
spec:
kafka:
metricsConfig:
type: jmxPrometheusExporter
valueFrom:
configMapKeyRef:
name: kafka-metrics
key: kafka-metrics-config.ymlKey metrics: UnderReplicatedPartitions (replication health), OfflinePartitionsCount (partitions without a leader), MessagesInPerSec (throughput), RequestHandlerAvgIdlePercent (broker load), and kafka_log_Log_Size (disk usage per partition).
Common Issues#
Under-replicated partitions. Causes: a broker is down, disk is slow, or network congestion. Check with --describe --under-replicated-partitions via kafka-topics.sh.
Broker not joining the cluster. Check headless Service DNS resolution and ensure no NetworkPolicy blocks inter-broker traffic on ports 9091 (replication) and 2181 (ZooKeeper).
Disk full. Brokers become unresponsive. Monitor PVC usage and set log.retention.hours and log.retention.bytes. To recover, expand the PVC or reduce retention and wait for cleanup.
Consumer group rebalancing storms. Frequently restarting consumer pods trigger rebalances that pause all consumers. Fix the root cause and increase session.timeout.ms and max.poll.interval.ms to tolerate brief interruptions.