Distributed Tracing in Practice

Trace, Span, and Context#

A trace represents a single request flowing through a distributed system. It is identified by a 128-bit trace ID. A span represents one unit of work within that trace – an HTTP handler, a database query, a message publish. Each span has a name, start time, duration, status, attributes (key-value pairs), and events (timestamped annotations). Spans form a tree: every span except the root has a parent span ID.

Context propagation carries the trace ID and parent span ID across service boundaries. When Service A calls Service B, it injects trace context into HTTP headers. Service B extracts the context and creates a child span under the same trace. Without propagation, you get disconnected single-service traces instead of an end-to-end view.

Jaeger on Kubernetes#

Jaeger is the most widely deployed open-source tracing backend. Deploy it with the Jaeger Operator:

# Install cert-manager (required by Jaeger Operator)
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/latest/download/cert-manager.yaml

# Install Jaeger Operator
kubectl create namespace observability
kubectl apply -f https://github.com/jaegertracing/jaeger-operator/releases/latest/download/jaeger-operator.yaml \
  -n observability

Create a Jaeger instance:

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger
  namespace: observability
spec:
  strategy: production
  storage:
    type: elasticsearch
    options:
      es:
        server-urls: http://elasticsearch:9200
        index-prefix: jaeger
  collector:
    resources:
      limits:
        memory: 512Mi
  query:
    resources:
      limits:
        memory: 512Mi

For smaller setups, use strategy: allInOne with in-memory or Badger storage. For production, use Elasticsearch or Cassandra as the backend.

Grafana Tempo as a Lightweight Alternative#

Tempo is Grafana’s trace backend. It stores traces in object storage (S3, GCS, local filesystem) without requiring Elasticsearch or Cassandra. This makes it dramatically simpler and cheaper to operate.

helm repo add grafana https://grafana.github.io/helm-charts
helm install tempo grafana/tempo \
  --namespace observability --create-namespace \
  --set tempo.storage.trace.backend=local \
  --set tempo.storage.trace.local.path=/var/tempo/traces

For production with S3:

# tempo-values.yaml
tempo:
  storage:
    trace:
      backend: s3
      s3:
        bucket: tempo-traces
        endpoint: s3.amazonaws.com
        region: us-east-1
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317
  metricsGenerator:
    enabled: true
    remoteWrite:
      - url: http://prometheus:9090/api/v1/write

The metricsGenerator derives RED metrics (rate, errors, duration) from traces automatically, so you get service-level metrics without separate instrumentation.

Trace-to-Log and Trace-to-Metric Correlation#

The real power of tracing comes from correlation. In Grafana, configure data source links:

In the Tempo data source settings, add a derived field linking to Loki:

# Grafana datasource provisioning
datasources:
  - name: Tempo
    type: tempo
    url: http://tempo:3200
    jsonData:
      tracesToLogs:
        datasourceUid: loki
        filterByTraceID: true
        filterBySpanID: false
        mapTagNamesEnabled: true
        mappedTags:
          - key: service.name
            value: app
      tracesToMetrics:
        datasourceUid: prometheus
        tags:
          - key: service.name
            value: service
        queries:
          - name: Request rate
            query: sum(rate(traces_spanmetrics_calls_total{$$__tags}[5m]))
          - name: Error rate
            query: sum(rate(traces_spanmetrics_calls_total{status_code="STATUS_CODE_ERROR",$$__tags}[5m]))

This lets you click a trace span and jump directly to the relevant logs in Loki filtered by trace ID, or see the related metrics for that service.

For the reverse direction (logs to traces), include the trace ID in your structured logs and configure Loki’s derived fields in Grafana to link back to Tempo.

Instrumenting HTTP Clients and Servers#

Using Go with the OTel SDK:

package main

import (
    "net/http"
    "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
)

func main() {
    // Wrap HTTP handler -- creates a span for each incoming request
    handler := otelhttp.NewHandler(http.HandlerFunc(handleOrder), "HandleOrder")
    http.Handle("/orders", handler)

    // Wrap HTTP client -- creates a span for each outgoing request
    client := &http.Client{Transport: otelhttp.NewTransport(http.DefaultTransport)}

    // Use the client -- child span is created automatically
    resp, err := client.Get("http://inventory-service/check")
}

func handleOrder(w http.ResponseWriter, r *http.Request) {
    ctx := r.Context()
    tracer := otel.Tracer("order-service")

    // Create a child span for a specific operation
    ctx, span := tracer.Start(ctx, "validate-order")
    defer span.End()

    span.SetAttributes(
        attribute.String("order.id", "ord-12345"),
        attribute.Int("order.items", 3),
    )

    // Add an event for a notable occurrence
    span.AddEvent("payment-processed", trace.WithAttributes(
        attribute.Float64("amount", 99.50),
    ))
}

For Python with Flask:

from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry import trace

# Auto-instrument Flask and requests library
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()

tracer = trace.get_tracer("order-service")

@app.route("/orders", methods=["POST"])
def create_order():
    with tracer.start_as_current_span("validate-order") as span:
        span.set_attribute("order.id", order_id)
        # Database call, external API, etc. are captured as child spans
        result = db.execute(query)
        span.add_event("query-executed", {"rows": len(result)})

Common Instrumentation Patterns#

Database calls: Wrap database queries in spans with attributes for the SQL operation (db.system, db.statement, db.name). Most OTel instrumentation libraries do this automatically for popular database drivers.

External API calls: The instrumented HTTP client creates spans automatically. Add attributes for the target service name and endpoint for easier filtering.

Queue consumers: When consuming from Kafka or RabbitMQ, extract trace context from message headers to continue the trace started by the producer. The OTel Kafka instrumentation handles this:

// Producer injects context into message headers
otel.GetTextMapPropagator().Inject(ctx, otelsarama.NewProducerMessageCarrier(msg))

// Consumer extracts context from message headers
ctx = otel.GetTextMapPropagator().Extract(context.Background(),
    otelsarama.NewConsumerMessageCarrier(msg))

Debugging Slow Requests with Traces#

When a user reports a slow request, the workflow is:

Find the trace by searching Tempo/Jaeger for the trace ID (from a log entry or response header) or by querying for slow traces: {duration > 2s && resource.service.name = "api-gateway"}.
Open the trace waterfall view. Look for the widest span – that is where time was spent.
Check span attributes and events. A database span with db.statement shows the exact query. An HTTP span shows the URL and status code.
Look for gaps between spans. A gap means the application was doing CPU work, waiting on something uninstrumented, or blocked on I/O without a span.
Correlate to logs. Click through to Loki with the trace ID to see detailed log output from the slow operation.

Returning trace IDs in HTTP response headers (X-Trace-Id) makes it easy for developers and support to find the relevant trace when investigating issues reported by users.