Trace, Span, and Context#
A trace represents a single request flowing through a distributed system. It is identified by a 128-bit trace ID. A span represents one unit of work within that trace – an HTTP handler, a database query, a message publish. Each span has a name, start time, duration, status, attributes (key-value pairs), and events (timestamped annotations). Spans form a tree: every span except the root has a parent span ID.
Context propagation carries the trace ID and parent span ID across service boundaries. When Service A calls Service B, it injects trace context into HTTP headers. Service B extracts the context and creates a child span under the same trace. Without propagation, you get disconnected single-service traces instead of an end-to-end view.
Jaeger on Kubernetes#
Jaeger is the most widely deployed open-source tracing backend. Deploy it with the Jaeger Operator:
# Install cert-manager (required by Jaeger Operator)
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/latest/download/cert-manager.yaml
# Install Jaeger Operator
kubectl create namespace observability
kubectl apply -f https://github.com/jaegertracing/jaeger-operator/releases/latest/download/jaeger-operator.yaml \
-n observabilityCreate a Jaeger instance:
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: jaeger
namespace: observability
spec:
strategy: production
storage:
type: elasticsearch
options:
es:
server-urls: http://elasticsearch:9200
index-prefix: jaeger
collector:
resources:
limits:
memory: 512Mi
query:
resources:
limits:
memory: 512MiFor smaller setups, use strategy: allInOne with in-memory or Badger storage. For production, use Elasticsearch or Cassandra as the backend.
Grafana Tempo as a Lightweight Alternative#
Tempo is Grafana’s trace backend. It stores traces in object storage (S3, GCS, local filesystem) without requiring Elasticsearch or Cassandra. This makes it dramatically simpler and cheaper to operate.
helm repo add grafana https://grafana.github.io/helm-charts
helm install tempo grafana/tempo \
--namespace observability --create-namespace \
--set tempo.storage.trace.backend=local \
--set tempo.storage.trace.local.path=/var/tempo/tracesFor production with S3:
# tempo-values.yaml
tempo:
storage:
trace:
backend: s3
s3:
bucket: tempo-traces
endpoint: s3.amazonaws.com
region: us-east-1
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
metricsGenerator:
enabled: true
remoteWrite:
- url: http://prometheus:9090/api/v1/writeThe metricsGenerator derives RED metrics (rate, errors, duration) from traces automatically, so you get service-level metrics without separate instrumentation.
Trace-to-Log and Trace-to-Metric Correlation#
The real power of tracing comes from correlation. In Grafana, configure data source links:
In the Tempo data source settings, add a derived field linking to Loki:
# Grafana datasource provisioning
datasources:
- name: Tempo
type: tempo
url: http://tempo:3200
jsonData:
tracesToLogs:
datasourceUid: loki
filterByTraceID: true
filterBySpanID: false
mapTagNamesEnabled: true
mappedTags:
- key: service.name
value: app
tracesToMetrics:
datasourceUid: prometheus
tags:
- key: service.name
value: service
queries:
- name: Request rate
query: sum(rate(traces_spanmetrics_calls_total{$$__tags}[5m]))
- name: Error rate
query: sum(rate(traces_spanmetrics_calls_total{status_code="STATUS_CODE_ERROR",$$__tags}[5m]))This lets you click a trace span and jump directly to the relevant logs in Loki filtered by trace ID, or see the related metrics for that service.
For the reverse direction (logs to traces), include the trace ID in your structured logs and configure Loki’s derived fields in Grafana to link back to Tempo.
Instrumenting HTTP Clients and Servers#
Using Go with the OTel SDK:
package main
import (
"net/http"
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
)
func main() {
// Wrap HTTP handler -- creates a span for each incoming request
handler := otelhttp.NewHandler(http.HandlerFunc(handleOrder), "HandleOrder")
http.Handle("/orders", handler)
// Wrap HTTP client -- creates a span for each outgoing request
client := &http.Client{Transport: otelhttp.NewTransport(http.DefaultTransport)}
// Use the client -- child span is created automatically
resp, err := client.Get("http://inventory-service/check")
}
func handleOrder(w http.ResponseWriter, r *http.Request) {
ctx := r.Context()
tracer := otel.Tracer("order-service")
// Create a child span for a specific operation
ctx, span := tracer.Start(ctx, "validate-order")
defer span.End()
span.SetAttributes(
attribute.String("order.id", "ord-12345"),
attribute.Int("order.items", 3),
)
// Add an event for a notable occurrence
span.AddEvent("payment-processed", trace.WithAttributes(
attribute.Float64("amount", 99.50),
))
}For Python with Flask:
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry import trace
# Auto-instrument Flask and requests library
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()
tracer = trace.get_tracer("order-service")
@app.route("/orders", methods=["POST"])
def create_order():
with tracer.start_as_current_span("validate-order") as span:
span.set_attribute("order.id", order_id)
# Database call, external API, etc. are captured as child spans
result = db.execute(query)
span.add_event("query-executed", {"rows": len(result)})Common Instrumentation Patterns#
Database calls: Wrap database queries in spans with attributes for the SQL operation (db.system, db.statement, db.name). Most OTel instrumentation libraries do this automatically for popular database drivers.
External API calls: The instrumented HTTP client creates spans automatically. Add attributes for the target service name and endpoint for easier filtering.
Queue consumers: When consuming from Kafka or RabbitMQ, extract trace context from message headers to continue the trace started by the producer. The OTel Kafka instrumentation handles this:
// Producer injects context into message headers
otel.GetTextMapPropagator().Inject(ctx, otelsarama.NewProducerMessageCarrier(msg))
// Consumer extracts context from message headers
ctx = otel.GetTextMapPropagator().Extract(context.Background(),
otelsarama.NewConsumerMessageCarrier(msg))Debugging Slow Requests with Traces#
When a user reports a slow request, the workflow is:
- Find the trace by searching Tempo/Jaeger for the trace ID (from a log entry or response header) or by querying for slow traces:
{duration > 2s && resource.service.name = "api-gateway"}. - Open the trace waterfall view. Look for the widest span – that is where time was spent.
- Check span attributes and events. A database span with
db.statementshows the exact query. An HTTP span shows the URL and status code. - Look for gaps between spans. A gap means the application was doing CPU work, waiting on something uninstrumented, or blocked on I/O without a span.
- Correlate to logs. Click through to Loki with the trace ID to see detailed log output from the slow operation.
Returning trace IDs in HTTP response headers (X-Trace-Id) makes it easy for developers and support to find the relevant trace when investigating issues reported by users.