Multi-Stage Temporal Workflows#
The HelloWorkflow from Temporal Go Workflow Basics calls one activity and returns. Real workflows are not that simple. A deployment pipeline provisions infrastructure, configures networking, deploys the application, runs health checks, and updates DNS. Each step depends on the previous one. Any step can fail. Some failures require undoing earlier steps.
This article covers the patterns you need for production multi-stage workflows: sequential activities with data passing, retry policies, timeouts, child workflows, error propagation, and compensation.
Sequential Activities#
The simplest multi-stage pattern chains activities, passing each output as input to the next:
func ProvisionWorkflow(ctx workflow.Context, req ProvisionRequest) (ProvisionResult, error) {
actCtx := workflow.WithActivityOptions(ctx, workflow.ActivityOptions{
StartToCloseTimeout: 5 * time.Minute,
})
// Step 1: Create the resource
var createResult CreateResult
err := workflow.ExecuteActivity(actCtx, CreateResource, req).Get(ctx, &createResult)
if err != nil {
return ProvisionResult{}, fmt.Errorf("create resource: %w", err)
}
// Step 2: Configure the resource using step 1's output
var configResult ConfigResult
err = workflow.ExecuteActivity(actCtx, ConfigureResource, ConfigRequest{
ResourceID: createResult.ResourceID,
Settings: req.Settings,
}).Get(ctx, &configResult)
if err != nil {
return ProvisionResult{}, fmt.Errorf("configure resource: %w", err)
}
// Step 3: Validate the resource
var validResult ValidationResult
err = workflow.ExecuteActivity(actCtx, ValidateResource, ValidateRequest{
ResourceID: createResult.ResourceID,
}).Get(ctx, &validResult)
if err != nil {
return ProvisionResult{}, fmt.Errorf("validate resource: %w", err)
}
return ProvisionResult{
ResourceID: createResult.ResourceID,
Status: "provisioned",
}, nil
}Each .Get(ctx, &result) blocks until the activity completes. If the activity fails (after exhausting retries), the error propagates up and the workflow returns it. This is straightforward but has no compensation – if step 3 fails, steps 1 and 2 are not undone.
Retry Policies#
By default, Temporal retries failed activities indefinitely with exponential backoff. In production, you almost always want to customize this.
actCtx := workflow.WithActivityOptions(ctx, workflow.ActivityOptions{
StartToCloseTimeout: 2 * time.Minute,
RetryPolicy: &temporal.RetryPolicy{
InitialInterval: time.Second,
BackoffCoefficient: 2.0,
MaximumInterval: time.Minute,
MaximumAttempts: 5,
NonRetryableErrorTypes: []string{"InvalidInputError", "AuthenticationError"},
},
})RetryPolicy fields:
- InitialInterval – Wait time before the first retry. Subsequent retries multiply by
BackoffCoefficient. Start at 1 second for most activities. - BackoffCoefficient – Multiplier for each successive retry interval. A coefficient of 2.0 produces waits of 1s, 2s, 4s, 8s, etc. Use 1.0 for constant-interval retries (useful for queue-like polling).
- MaximumInterval – Upper bound on the retry interval. Without this, exponential backoff can produce very long waits. Set to 1 minute for API calls, longer for batch jobs.
- MaximumAttempts – Total number of attempts including the initial one. Set to 0 for unlimited retries (the default). Use a finite number when the operation is unlikely to succeed after several tries.
- NonRetryableErrorTypes – Error type names that should not be retried. Use this for errors where retrying is pointless: invalid input, authentication failures, resource-not-found. Match these against the error type string returned by your activity.
To mark an error as non-retryable from within an activity:
func CreateResource(ctx context.Context, req CreateRequest) (CreateResult, error) {
if req.Name == "" {
return CreateResult{}, temporal.NewNonRetryableApplicationError(
"resource name is required",
"InvalidInputError",
nil,
)
}
// ...
}Different activities often need different retry policies. Create separate activity contexts:
fastCtx := workflow.WithActivityOptions(ctx, workflow.ActivityOptions{
StartToCloseTimeout: 30 * time.Second,
RetryPolicy: &temporal.RetryPolicy{MaximumAttempts: 3},
})
slowCtx := workflow.WithActivityOptions(ctx, workflow.ActivityOptions{
StartToCloseTimeout: 10 * time.Minute,
RetryPolicy: &temporal.RetryPolicy{MaximumAttempts: 2},
})Timeout Strategies#
Temporal has four timeout types. Understanding when to use each prevents common production issues.
ScheduleToStartTimeout – How long a task can sit in the task queue before a worker picks it up. If no worker is available for this long, the activity fails. Rarely used – it indicates a capacity problem, not an activity problem. Default is unlimited.
StartToCloseTimeout – How long the activity can run after a worker picks it up. This is the most commonly set timeout. Set it to the maximum expected execution time plus margin. Required if ScheduleToCloseTimeout is not set.
ScheduleToCloseTimeout – Total time from when the activity is scheduled to when it must complete. This includes queue time and execution time. Useful as an overall SLA. If set, StartToCloseTimeout is optional.
HeartbeatTimeout – For long-running activities that report progress. If the activity does not heartbeat within this interval, Temporal considers it stalled and retries it. Essential for activities that can hang – large file downloads, database migrations, container operations.
// Short API call
apiCtx := workflow.WithActivityOptions(ctx, workflow.ActivityOptions{
StartToCloseTimeout: 30 * time.Second,
})
// Long-running operation with heartbeat
longCtx := workflow.WithActivityOptions(ctx, workflow.ActivityOptions{
StartToCloseTimeout: 30 * time.Minute,
HeartbeatTimeout: 30 * time.Second,
})Within the activity, send heartbeats periodically:
func LongRunningActivity(ctx context.Context, req Request) (Result, error) {
for i, item := range req.Items {
// Process item...
activity.RecordHeartbeat(ctx, fmt.Sprintf("processed %d/%d", i+1, len(req.Items)))
}
return Result{}, nil
}Child Workflows#
A child workflow is a workflow started by another workflow. Use them when you need:
- Reusable sub-processes. A “stop and snapshot” workflow can be called from multiple parent workflows.
- Different retry policies. Child workflows have their own retry policy, independent of the parent.
- Separate event history. Each workflow has its own history. Breaking large workflows into parent/child keeps individual histories manageable (Temporal has a 50,000-event history limit per workflow).
- Independent lifecycle. Depending on the
ParentClosePolicy, a child can outlive its parent.
func ParentWorkflow(ctx workflow.Context, req ParentRequest) (ParentResult, error) {
childCtx := workflow.WithChildOptions(ctx, workflow.ChildWorkflowOptions{
WorkflowID: fmt.Sprintf("child-%s", req.ID),
TaskQueue: "child-task-queue",
ParentClosePolicy: enums.PARENT_CLOSE_POLICY_TERMINATE,
WorkflowRunTimeout: 10 * time.Minute,
})
var childResult ChildResult
err := workflow.ExecuteChildWorkflow(childCtx, ChildWorkflow, ChildRequest{
ResourceID: req.ResourceID,
}).Get(ctx, &childResult)
if err != nil {
return ParentResult{}, fmt.Errorf("child workflow: %w", err)
}
return ParentResult{ChildData: childResult.Data}, nil
}Parent-Child Lifecycle#
When a parent workflow fails or is cancelled, what happens to its children depends on ParentClosePolicy:
- TERMINATE – The child is immediately terminated. Use this when the child’s work is meaningless without the parent. This is the most common choice.
- REQUEST_CANCEL – The child receives a cancellation request but can handle it gracefully (run cleanup). Use this when the child holds resources that need cleanup.
- ABANDON – The child continues running independently. Use this when the child represents work that should complete regardless of the parent – for example, a notification workflow.
// Child terminates if parent fails (default and most common)
workflow.ChildWorkflowOptions{
ParentClosePolicy: enums.PARENT_CLOSE_POLICY_TERMINATE,
}
// Child gets a chance to clean up
workflow.ChildWorkflowOptions{
ParentClosePolicy: enums.PARENT_CLOSE_POLICY_REQUEST_CANCEL,
}
// Child runs to completion regardless
workflow.ChildWorkflowOptions{
ParentClosePolicy: enums.PARENT_CLOSE_POLICY_ABANDON,
}Error Propagation#
Errors flow upward naturally. An activity error becomes the return value of ExecuteActivity(...).Get(). A child workflow error becomes the return value of ExecuteChildWorkflow(...).Get(). Wrap errors at each level for debugging:
// In the parent workflow
err := workflow.ExecuteChildWorkflow(childCtx, StopAndSnapshotWorkflow, req).Get(ctx, &result)
if err != nil {
return Result{}, fmt.Errorf("stop-and-snapshot for container %s: %w", req.ContainerID, err)
}// In the child workflow
err := workflow.ExecuteActivity(actCtx, StopContainer, stopReq).Get(ctx, &stopResult)
if err != nil {
return Result{}, fmt.Errorf("stop container %s: %w", req.ContainerID, err)
}The resulting error chain reads: stop-and-snapshot for container abc123: stop container abc123: connection refused. Each layer adds context without losing the original error.
Compensation Pattern#
When a multi-stage workflow fails partway through, you often need to undo completed steps. The compensation pattern tracks what has been done and reverses it on failure.
func MultiStageWorkflow(ctx workflow.Context, req Request) (Result, error) {
actCtx := workflow.WithActivityOptions(ctx, workflow.ActivityOptions{
StartToCloseTimeout: 5 * time.Minute,
})
var completedSteps []string
// Step 1
var step1Result StepOneResult
err := workflow.ExecuteActivity(actCtx, StepOne, req).Get(ctx, &step1Result)
if err != nil {
return Result{}, err
}
completedSteps = append(completedSteps, "step-one")
// Step 2
var step2Result StepTwoResult
err = workflow.ExecuteActivity(actCtx, StepTwo, step1Result).Get(ctx, &step2Result)
if err != nil {
compensate(ctx, completedSteps, step1Result)
return Result{}, fmt.Errorf("step two failed, compensated: %w", err)
}
completedSteps = append(completedSteps, "step-two")
// Step 3
var step3Result StepThreeResult
err = workflow.ExecuteActivity(actCtx, StepThree, step2Result).Get(ctx, &step3Result)
if err != nil {
compensate(ctx, completedSteps, step1Result)
return Result{}, fmt.Errorf("step three failed, compensated: %w", err)
}
return Result{Data: step3Result.Data}, nil
}
func compensate(ctx workflow.Context, completedSteps []string, data StepOneResult) {
compCtx := workflow.WithActivityOptions(ctx, workflow.ActivityOptions{
StartToCloseTimeout: 2 * time.Minute,
RetryPolicy: &temporal.RetryPolicy{
MaximumAttempts: 5,
},
})
// Compensate in reverse order
for i := len(completedSteps) - 1; i >= 0; i-- {
switch completedSteps[i] {
case "step-two":
_ = workflow.ExecuteActivity(compCtx, CompensateStepTwo, data).Get(ctx, nil)
case "step-one":
_ = workflow.ExecuteActivity(compCtx, CompensateStepOne, data).Get(ctx, nil)
}
}
}Compensation activities should be idempotent. If the workflow is retried or the compensation itself fails and is retried, running compensation twice should be safe. Give compensation activities their own retry policy – you want them to try hard to succeed, since a failed compensation leaves the system in an inconsistent state.
Testing compensation is covered in Testing Temporal Workflows.
Complete Example Preview#
The container lifecycle workflow puts all of these patterns together:
- Sequential activities: inspect, stop, commit, tag
- Child workflow: stop-and-snapshot as a reusable sub-process
- Compensation: if commit fails after stop, restart the container
- DI for activities: mock Docker client for testing, real client for production
- Retry policies: different policies for fast operations (inspect) vs slow ones (commit)
That article walks through the complete implementation with tests.
Summary#
Multi-stage workflows combine several patterns: sequential activity execution with data passing, retry policies tuned per activity, timeouts matched to expected durations, child workflows for reusable sub-processes, and compensation for handling partial failures. The key design principle is that every step that creates a side effect should have a corresponding compensation step. Keep activities idempotent, wrap errors with context at every level, and use child workflows when a sub-process needs its own lifecycle or retry policy.