Event-Driven Architecture for Microservices#
In a microservices architecture, services need to communicate. The two fundamental approaches are synchronous (request-response) and asynchronous (event-driven). Most systems use both – the decision is which interactions should be synchronous and which should be event-driven.
Synchronous vs Asynchronous Communication#
Synchronous (request-response): Service A calls Service B and waits for a response. Simple, familiar, and works well when A needs the response to continue. The cost is temporal coupling – if B is down, A fails.
Asynchronous (event-driven): Service A publishes an event and continues without waiting. Service B processes the event later. This decouples services in time – A does not know or care when B processes the event.
When to use synchronous calls#
- Queries that need immediate answers. A checkout page needs the current price from the pricing service right now. An event-based “eventually the price will arrive” model is unacceptable here.
- Operations that must complete atomically. A payment authorization that must succeed before confirming an order.
- Simple service-to-service calls with low latency requirements. A user service looking up a profile by ID. Adding a message broker for this is overhead without benefit.
When to use events#
- Notifications that do not need a response. “Order was placed” does not need to wait for the email service, the analytics service, or the inventory service to acknowledge.
- Operations that can be processed later. Image resizing, report generation, data synchronization. The user does not wait for these.
- Cross-service data propagation. When the customer service updates an address, the order service, shipping service, and billing service all need to know. Events let you add consumers without changing the publisher.
- Long-running workflows. An order fulfillment process that spans multiple services over minutes or hours. Holding an HTTP connection open for this is not viable.
Event Patterns#
Event notification#
The simplest pattern. A service publishes an event when something happens. Other services react to it.
{
"event_type": "order.created",
"event_id": "evt-abc-123",
"timestamp": "2026-02-22T14:30:00Z",
"data": {
"order_id": "ord-456",
"customer_id": "cust-789"
}
}The event carries minimal data – just enough for consumers to know something happened. If consumers need more details, they call back to the originating service. This keeps events small and avoids leaking internal data models.
Tradeoff: Extra call back to the source service. If many consumers call back, the source service sees amplified load.
Event-carried state transfer#
The event carries the full data that consumers need:
{
"event_type": "customer.address_updated",
"event_id": "evt-def-456",
"timestamp": "2026-02-22T14:31:00Z",
"data": {
"customer_id": "cust-789",
"address": {
"street": "123 Main St",
"city": "Portland",
"state": "OR",
"zip": "97201"
}
}
}Consumers store the data locally and do not need to call back. This improves autonomy and performance but means consumers hold copies of data they do not own. Those copies can be stale – if a consumer missed an event or processed events out of order, its local copy is wrong.
Event sourcing#
Instead of storing the current state of an entity, store the sequence of events that led to that state. The current state is derived by replaying events.
# Traditional: store current state
orders table: { id: "ord-456", status: "shipped", total: 5999 }
# Event sourced: store the event log
events table:
{ order_id: "ord-456", type: "OrderCreated", data: { items: [...], total: 5999 } }
{ order_id: "ord-456", type: "PaymentReceived", data: { amount: 5999 } }
{ order_id: "ord-456", type: "OrderShipped", data: { tracking: "1Z999..." } }Benefits: Complete audit trail, ability to reconstruct state at any point in time, natural fit for event-driven architectures.
Costs: Complexity in querying (you cannot SELECT * FROM orders WHERE status = ‘shipped’ without a read model), event schema evolution is hard (you must handle old event formats forever), eventual consistency between the event store and read models.
Use event sourcing when: You need a complete audit trail (financial systems, compliance-regulated domains), you need to reconstruct historical state, or the domain is naturally event-centric (trading systems, logistics).
Do not use event sourcing when: Simple CRUD is sufficient, your team does not have distributed systems experience, or you cannot accept eventual consistency for reads.
CQRS (Command Query Responsibility Segregation)#
CQRS separates the write model (commands) from the read model (queries). Writes go to one data store optimized for consistency, reads come from a different store optimized for query patterns.
# Write side (command)
Client -> Command Handler -> Event Store
-> Publish Event
# Read side (query)
Event -> Projection Handler -> Read Database (optimized for queries)
Client -> Query Handler -> Read DatabaseCQRS is often paired with event sourcing but does not require it. You can use CQRS with a traditional database for writes and a materialized view or search index for reads.
Example: An order service writes to a normalized PostgreSQL database. A projection handler listens for order events and updates an Elasticsearch index optimized for the “search orders” API. Writes go to Postgres. Reads go to Elasticsearch.
Order Created Event -> Projection Handler -> Elasticsearch
-> Redis (for real-time dashboards)
-> PostgreSQL read replica (for reports)The cost is eventual consistency. After a write, the read model may not reflect the change immediately. For most use cases, a few hundred milliseconds of lag is acceptable. For cases where it is not (user just created an order and immediately views it), use read-your-writes consistency – route the read to the write database for that specific user for a short window after the write.
Saga Pattern#
A saga manages a distributed transaction that spans multiple services. Instead of a single ACID transaction, a saga is a sequence of local transactions with compensating actions to undo completed steps if a later step fails.
Choreography-based saga#
Each service listens for events and decides what to do. There is no central coordinator.
1. Order Service: publishes "OrderCreated"
2. Payment Service: hears "OrderCreated", charges payment, publishes "PaymentCompleted"
3. Inventory Service: hears "PaymentCompleted", reserves stock, publishes "StockReserved"
4. Shipping Service: hears "StockReserved", creates shipment, publishes "ShipmentCreated"
# If payment fails:
2. Payment Service: publishes "PaymentFailed"
1. Order Service: hears "PaymentFailed", cancels orderAdvantages: Simple to implement for 2-3 step flows, no single point of failure, services are loosely coupled.
Disadvantages: Hard to understand the complete flow by looking at any one service, difficult to add new steps, no centralized monitoring of saga progress. At 5+ steps, choreography becomes a tangled web of events.
Orchestration-based saga#
A central orchestrator (saga coordinator) directs the flow. It sends commands to services and handles responses.
// Saga orchestrator pseudocode
func (s *OrderSaga) Execute(ctx context.Context, order Order) error {
// Step 1: Reserve payment
paymentRef, err := s.paymentService.Reserve(ctx, order.Total, order.CustomerID)
if err != nil {
return fmt.Errorf("payment reservation failed: %w", err)
}
// Step 2: Reserve inventory
err = s.inventoryService.Reserve(ctx, order.Items)
if err != nil {
// Compensate: release payment
s.paymentService.Release(ctx, paymentRef)
return fmt.Errorf("inventory reservation failed: %w", err)
}
// Step 3: Create shipment
err = s.shippingService.CreateShipment(ctx, order.ShippingAddress, order.Items)
if err != nil {
// Compensate: release inventory and payment
s.inventoryService.Release(ctx, order.Items)
s.paymentService.Release(ctx, paymentRef)
return fmt.Errorf("shipment creation failed: %w", err)
}
// Step 4: Confirm payment
err = s.paymentService.Confirm(ctx, paymentRef)
if err != nil {
// Compensate: cancel shipment, release inventory
s.shippingService.CancelShipment(ctx, order.ID)
s.inventoryService.Release(ctx, order.Items)
return fmt.Errorf("payment confirmation failed: %w", err)
}
return nil
}Advantages: Easy to understand the complete flow, centralized monitoring, straightforward to add or remove steps, explicit compensation logic.
Disadvantages: The orchestrator is a coordination point (not necessarily a bottleneck but an additional service to maintain), risk of the orchestrator becoming a “god service” that contains too much business logic.
Use choreography for simple 2-3 step flows where each step is owned by a different team. Use orchestration for complex workflows with 4+ steps, conditional logic, or when you need centralized visibility into saga progress.
Outbox Pattern#
The outbox pattern solves a fundamental problem: how do you atomically update a database and publish an event? If you write to the database and then publish to Kafka, the publish can fail after the write succeeds, leaving the system inconsistent.
The problem#
# This is NOT atomic:
1. BEGIN TRANSACTION
2. INSERT INTO orders (id, ...) VALUES (...)
3. COMMIT
4. kafka.publish("order.created", ...) # <-- This can fail after step 3If step 4 fails, the order exists in the database but no event was published. Downstream services never learn about the order.
The solution#
Write the event to an outbox table in the same database transaction. A separate process reads the outbox table and publishes events.
-- Same transaction: both writes succeed or both fail
BEGIN;
INSERT INTO orders (id, customer_id, status, total)
VALUES ('ord-456', 'cust-789', 'pending', 5999);
INSERT INTO outbox (id, aggregate_type, aggregate_id, event_type, payload, created_at)
VALUES (
'evt-abc-123',
'order',
'ord-456',
'order.created',
'{"order_id": "ord-456", "customer_id": "cust-789", "total": 5999}',
NOW()
);
COMMIT;A polling publisher or CDC-based publisher reads the outbox and publishes to the message broker:
-- Polling publisher: grab unpublished events
SELECT * FROM outbox WHERE published = false ORDER BY created_at LIMIT 100;
-- After successful publish:
UPDATE outbox SET published = true WHERE id IN (...);For higher throughput and lower latency, use Debezium to capture changes from the outbox table via the database’s write-ahead log (WAL/binlog) and publish them directly to Kafka. This avoids polling entirely.
Message Broker Selection#
Apache Kafka#
Kafka is a distributed log. Messages are written to partitions and retained for a configurable period (days, weeks, forever). Consumers track their position (offset) in the log.
Strengths: Extreme throughput (millions of messages per second), message retention (replay events from any point), strong ordering within a partition, consumer groups for parallel processing, ecosystem (Kafka Connect, Kafka Streams, ksqlDB).
Weaknesses: Operational complexity (ZooKeeper/KRaft, partition rebalancing, ISR management), higher latency than RabbitMQ for simple queuing, overkill for low-throughput scenarios, steep learning curve.
Best for: High-throughput event streams, event sourcing, data pipelines, cases where message replay is important.
RabbitMQ#
RabbitMQ is a traditional message broker implementing AMQP. Messages are published to exchanges, routed to queues, and consumed.
Strengths: Flexible routing (direct, topic, fanout, headers exchanges), message acknowledgment and redelivery, priority queues, dead letter exchanges for failed messages, simpler operational model than Kafka, lower latency for individual messages.
Weaknesses: No built-in message replay (once consumed, messages are gone unless you configure lazy queues), lower throughput ceiling than Kafka, no native partitioning for ordered processing.
Best for: Task queues, request-response patterns over messaging, complex routing requirements, systems where message replay is not needed.
NATS#
NATS is a lightweight, high-performance messaging system. NATS JetStream adds persistence and exactly-once delivery.
Strengths: Extremely low latency, simple operational model, tiny resource footprint, built-in clustering, request-reply pattern support, subject-based routing.
Weaknesses: Smaller ecosystem than Kafka or RabbitMQ, JetStream is newer and less battle-tested at extreme scale, fewer enterprise integrations.
Best for: Low-latency microservice communication, edge computing, IoT, systems where operational simplicity matters most, Kubernetes-native environments.
Selection guidance#
| Requirement | Kafka | RabbitMQ | NATS |
|---|---|---|---|
| Throughput (msg/sec) | Millions | Hundred thousands | Millions |
| Message replay | Native | No (workarounds exist) | JetStream |
| Latency | Higher (batching) | Low | Lowest |
| Ordering guarantees | Per-partition | Per-queue | Per-subject (JetStream) |
| Operational complexity | High | Medium | Low |
| Complex routing | Limited | Best | Good |
| Resource footprint | Heavy | Medium | Light |
Choose Kafka when you need high-throughput event streaming, message replay, or are building data pipelines. Choose RabbitMQ when you need flexible routing, task queues, or dead letter handling for failed messages. Choose NATS when you need the lowest latency, simplest operations, or are working in resource-constrained environments.
Eventual Consistency Tradeoffs#
Event-driven architectures introduce eventual consistency. Data across services will converge to a consistent state, but there is a window where different services have different views of reality.
Strategies for managing eventual consistency:
- Communicate the window. If the consistency window is usually under 500ms, document it. Most users and calling services can tolerate this.
- Read-your-writes. After a user writes data, route subsequent reads for that user to the authoritative source for a short window (1-2 seconds).
- Optimistic UI. The client assumes the write succeeded and shows the updated state immediately. If an event later indicates failure, the UI corrects itself.
- Idempotent consumers. Events may be delivered more than once. Every consumer must handle duplicate events safely. Use the event ID as a deduplication key.
-- Idempotent event processing
INSERT INTO processed_events (event_id, processed_at)
VALUES ('evt-abc-123', NOW())
ON CONFLICT (event_id) DO NOTHING;
-- Only process if the insert succeeded (event was not already processed)The fundamental question is not “synchronous or asynchronous?” but “where can we accept eventual consistency and where do we need immediate consistency?” Draw that line explicitly for each interaction between services, document it, and build your architecture accordingly.