Saga Pattern#

In a monolith, a single database transaction can span multiple operations atomically. In microservices, each service owns its database. There is no distributed transaction that works reliably across services. The saga pattern solves this by breaking a transaction into a sequence of local transactions, each with a corresponding compensating transaction that undoes its work if a later step fails.

The Problem: No Distributed ACID#

Consider an order placement that must: (1) reserve inventory, (2) charge payment, (3) create shipment. In a monolith, this is one transaction. In microservices, these are three services with three databases. Two-phase commit (2PC) across these is fragile, slow, and most message brokers and modern databases do not support it across service boundaries.

Sagas accept that the system will be temporarily inconsistent and provide a mechanism to either complete all steps or undo the ones that succeeded.

Choreography vs Orchestration#

There are two approaches to coordinating saga steps.

Choreography (Event-Based)#

Each service listens for events and reacts independently. No central coordinator exists.

Order Service → publishes "OrderCreated"
Inventory Service → listens for "OrderCreated" → reserves stock → publishes "InventoryReserved"
Payment Service → listens for "InventoryReserved" → charges card → publishes "PaymentCharged"
Shipping Service → listens for "PaymentCharged" → creates shipment → publishes "ShipmentCreated"
Order Service → listens for "ShipmentCreated" → marks order complete

On failure, services publish compensating events:

Payment Service → charge fails → publishes "PaymentFailed"
Inventory Service → listens for "PaymentFailed" → releases reserved stock → publishes "InventoryReleased"
Order Service → listens for "InventoryReleased" → marks order failed

When choreography works well: Few steps (3-4), simple linear flow, teams own their services independently and want loose coupling.

When choreography breaks down: More than 5 steps, complex branching logic, difficulty tracing the overall flow, no single place to see saga state. Debugging a choreographed saga across 8 services by reading event logs is painful.

Orchestration (Coordinator-Based)#

A central saga orchestrator tells each service what to do and handles the flow.

class OrderSaga:
    def execute(self, order):
        # Step 1: Reserve inventory
        inventory_result = self.inventory_service.reserve(order.items)
        if not inventory_result.success:
            self.fail_order(order, "Insufficient inventory")
            return

        # Step 2: Charge payment
        payment_result = self.payment_service.charge(order.payment_info, order.total)
        if not payment_result.success:
            self.inventory_service.release(inventory_result.reservation_id)  # compensate
            self.fail_order(order, "Payment failed")
            return

        # Step 3: Create shipment
        shipment_result = self.shipping_service.create_shipment(order)
        if not shipment_result.success:
            self.payment_service.refund(payment_result.charge_id)  # compensate
            self.inventory_service.release(inventory_result.reservation_id)  # compensate
            self.fail_order(order, "Shipment creation failed")
            return

        self.complete_order(order)

When orchestration works well: Complex flows with branching, many steps, need for visibility into saga state, teams that want a clear definition of the business process.

When orchestration has risks: The orchestrator becomes a coupling point. If every saga change requires modifying the orchestrator, you have centralized control. Keep orchestrators thin – they manage flow, not business logic.

Compensating Transactions#

Compensating transactions are not rollbacks. They are new forward operations that semantically undo previous work.

Key differences from database rollbacks:

They can fail. A payment refund might fail. You need retry logic and potentially manual intervention.
They may not perfectly undo. A notification email cannot be unsent. The compensation might be sending a cancellation email.
They are visible. Unlike a rolled-back transaction, compensations create observable side effects (a refund appears on a credit card statement).

Design rules for compensating transactions:

Every saga step must have a defined compensation.
Compensations execute in reverse order of the original steps.
Compensations must be idempotent – they may be retried.
Some steps are non-compensable (sending an email). Place these last in the saga.

Idempotency Requirements#

Every saga participant must handle duplicate messages. A message broker may deliver the same event twice. The orchestrator may retry a failed call. Without idempotency, you charge the customer twice or reserve inventory twice.

Implement idempotency with idempotency keys:

def reserve_inventory(request):
    # Check if this reservation already exists
    existing = db.query(
        "SELECT * FROM reservations WHERE idempotency_key = %s",
        request.idempotency_key
    )
    if existing:
        return existing  # Return previous result

    # Proceed with reservation
    reservation = create_reservation(request)
    return reservation

Pass idempotency keys through the entire saga. The orchestrator generates a unique key per saga instance, and each step derives its own key from it:

saga_id: "order-abc-123"
step keys: "order-abc-123:inventory", "order-abc-123:payment", "order-abc-123:shipping"

Saga State Management#

The orchestrator must persist saga state to survive crashes. If the orchestrator restarts mid-saga, it must know which steps completed and resume from the correct point.

Store saga state in a database:

CREATE TABLE saga_instances (
    saga_id UUID PRIMARY KEY,
    saga_type VARCHAR(100) NOT NULL,
    current_step INT NOT NULL,
    state VARCHAR(20) NOT NULL,  -- RUNNING, COMPENSATING, COMPLETED, FAILED
    payload JSONB NOT NULL,
    step_results JSONB DEFAULT '[]',
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW()
);

Temporal and Cadence Workflow Engines#

Writing saga orchestrators from scratch means handling state persistence, retries, timeouts, and crash recovery yourself. Workflow engines like Temporal (and its predecessor Cadence) handle all of this.

A Temporal workflow for the order saga:

func OrderSagaWorkflow(ctx workflow.Context, order Order) error {
    opts := workflow.ActivityOptions{
        StartToCloseTimeout: 30 * time.Second,
        RetryPolicy: &temporal.RetryPolicy{
            MaximumAttempts: 3,
        },
    }
    ctx = workflow.WithActivityOptions(ctx, opts)

    // Step 1: Reserve inventory
    var reservationID string
    err := workflow.ExecuteActivity(ctx, ReserveInventory, order).Get(ctx, &reservationID)
    if err != nil {
        return err  // Temporal handles retry; if all retries fail, workflow fails
    }

    // Step 2: Charge payment
    var chargeID string
    err = workflow.ExecuteActivity(ctx, ChargePayment, order).Get(ctx, &chargeID)
    if err != nil {
        // Compensate: release inventory
        _ = workflow.ExecuteActivity(ctx, ReleaseInventory, reservationID).Get(ctx, nil)
        return err
    }

    // Step 3: Create shipment
    err = workflow.ExecuteActivity(ctx, CreateShipment, order).Get(ctx, nil)
    if err != nil {
        // Compensate in reverse
        _ = workflow.ExecuteActivity(ctx, RefundPayment, chargeID).Get(ctx, nil)
        _ = workflow.ExecuteActivity(ctx, ReleaseInventory, reservationID).Get(ctx, nil)
        return err
    }

    return nil
}

Temporal provides: durable execution (survives process crashes), automatic retries, timeouts, saga state persistence, workflow history and visibility, and compensation via standard language constructs. You do not need a separate saga state table or idempotency framework – Temporal handles this internally.

Failure Handling Strategies#

Retry with backoff: Transient failures (network timeout, temporary unavailability) should be retried before triggering compensation. Use exponential backoff with jitter.

Timeout-based decisions: If a step does not respond within a deadline, the saga must decide: retry, compensate, or escalate. Do not wait indefinitely.

Dead letter / manual intervention: Some failures cannot be resolved automatically. A payment that partially processed needs human review. Route these to a dead letter queue with full saga context.

Forward recovery vs backward recovery: Sometimes it is cheaper to push forward (retry the failing step) than to compensate all previous steps. If 9 of 10 steps completed and step 10 is a transient failure, retry step 10 rather than compensating 9 steps.

When to Use Sagas#

Use sagas when a business transaction spans multiple services that each own their data. Do not use sagas for operations within a single service – use a regular database transaction. Do not use sagas when strong consistency is required and a single database can handle it. The complexity cost of sagas is significant; only pay it when you have no alternative.