The Decision Landscape#
Log management is deceptively simple on the surface – applications write text, you store it, you search it later. In practice, every decision in the log pipeline involves tradeoffs between cost, query speed, retention depth, operational complexity, and correlation with other observability signals. This guide provides a framework for making those decisions based on your actual requirements rather than defaults or trends.
Structured Logging: The Foundation#
Before choosing any aggregation tool, standardize on structured logging. Unstructured logs are human-readable but machine-hostile. Structured logs are both.
Unstructured vs. Structured#
Unstructured:
2026-02-22 10:15:23 ERROR [api-gateway] Failed to process request for user john@example.com: connection timeout to auth-service after 30sStructured (JSON):
{
"timestamp": "2026-02-22T10:15:23.456Z",
"level": "error",
"service": "api-gateway",
"message": "Failed to process request",
"user_email": "john@example.com",
"error_type": "connection_timeout",
"upstream_service": "auth-service",
"timeout_seconds": 30,
"trace_id": "abc123def456",
"span_id": "789ghi",
"request_id": "req-001234"
}The structured version enables:
- Filtering by field:
upstream_service=auth-service AND error_type=connection_timeoutwithout regex. - Aggregation: Count errors by upstream_service to find the most failing dependency.
- Correlation: The trace_id links this log entry to the distributed trace for the same request.
- Alerting: Alert when
error_type=connection_timeoutexceeds a threshold for a specific upstream_service.
Structured Logging Standards#
Define a standard set of fields that every service must include.
Required fields:
timestamp ISO 8601 with timezone (always UTC)
level debug, info, warn, error, fatal
service service name matching the Kubernetes deployment name
message human-readable description (no variable interpolation)
trace_id W3C trace ID (when in a request context)
Recommended fields:
span_id W3C span ID
request_id application-level request correlation ID
user_id anonymized or hashed user identifier (never PII in logs)
environment production, staging, development
version application version or git SHA
duration_ms operation duration in milliseconds
error_type machine-readable error classificationWhen to Log#
Not everything should be logged. Excessive logging increases costs and makes signal harder to find in the noise.
Log at ERROR level: Failures that affect the current request or operation. Database errors, upstream timeouts, unhandled exceptions. These should be actionable.
Log at WARN level: Conditions that are not failures but indicate potential problems. Retry attempts, degraded performance, approaching resource limits. These are early signals.
Log at INFO level: Significant business events and lifecycle events. Request start/end (with duration), service startup/shutdown, configuration changes. These provide context during investigations.
Log at DEBUG level: Detailed internal state useful for development. Disabled in production unless temporarily enabled for debugging. Never leave debug logging on permanently – it generates enormous volume with minimal operational value.
Do not log: Sensitive data (passwords, tokens, full credit card numbers), health check requests (they generate noise with no value), high-frequency internal events (every cache hit, every metric sample).
Log Aggregation Architecture Decisions#
Decision: Centralized vs. Distributed#
Centralized aggregation sends all logs to a single backend (Loki, Elasticsearch, CloudWatch). This simplifies querying – one place to search – and enables cross-service correlation. The tradeoff is network bandwidth for log shipping and a single point of failure in the log pipeline.
Distributed aggregation keeps logs closer to their source, with a lightweight local query layer. Each cluster or region has its own log backend. This reduces network costs and latency but makes cross-region queries difficult. Use this only when regulatory requirements mandate data residency or when network costs are prohibitive.
Recommendation for most teams: Centralized. The operational simplicity and correlation capability outweigh the bandwidth cost for log volumes under 500GB/day. Above 500GB/day, evaluate whether aggressive filtering at the collector level can reduce volume, or whether a regional tier with a global query federation (Loki’s multi-tenant mode, Elasticsearch cross-cluster search) is justified.
Decision: Push vs. Pull#
Push-based (most common): Log collectors (Fluentd, Fluent Bit, Vector, Promtail) run as agents on each node, tail log files or read from stdout, and push log entries to the backend. This is the standard model for Kubernetes environments where pods write to stdout and the kubelet writes those logs to files on the node.
Pull-based: The backend periodically fetches logs from sources. Rarely used for application logs but common for infrastructure logs (pulling from cloud provider APIs, for example).
Recommendation: Push-based for application logs in Kubernetes. The DaemonSet pattern (one collector pod per node) is well-established, efficient, and handles pod lifecycle automatically.
Decision: Collector Selection#
Choose your collector based on where you send logs and what processing you need.
IF you use Loki exclusively:
USE Promtail or Grafana Alloy
REASON: Tightest integration, label extraction designed for Loki
IF you need to send logs to multiple backends:
USE Fluent Bit or Vector
REASON: Both support many output plugins
IF you need complex routing, filtering, or transformation:
USE Vector
REASON: Built-in VRL (Vector Remap Language) for powerful transforms
IF memory footprint is the primary constraint:
USE Fluent Bit
REASON: ~5MB baseline memory, written in C
IF you need a specific plugin from a large ecosystem:
USE Fluentd
REASON: 700+ community plugins, though heavier than Fluent BitArchitecture Pattern: Kubernetes DaemonSet Collector#
+------------------+
| Application Pod |
| (writes to |
| stdout/stderr) |
+--------+---------+
|
kubelet writes to
/var/log/pods/...
|
+--------+---------+
| Collector |
| (DaemonSet pod) |
| Reads log files |
| Adds labels |
| Filters/parses |
+--------+---------+
|
Pushes to backend
|
+--------------+--------------+
| |
+--------+--------+ +---------+--------+
| Loki / ES / | | Object Storage |
| ClickHouse | | (S3/GCS for |
| (hot storage) | | cold/archive) |
+--------+--------+ +------------------+
|
+--------+--------+
| Grafana / |
| Kibana |
| (query & viz) |
+-----------------+Log Retention Policies#
Retention is a cost and compliance decision, not a technical one. Every day of retained logs costs storage and may have legal implications.
Decision Framework for Retention#
IF compliance requires specific retention (SOC 2, HIPAA, PCI-DSS):
Retain for the required period (typically 1-7 years)
USE tiered storage: hot (7-30 days), warm (30-90 days), cold (90+ days)
IF no compliance requirement:
Hot retention: 7-14 days (fast queries)
Warm retention: 30-90 days (slower queries, cheaper storage)
Cold retention: Optional, archive to object storage for cost
IF budget is the primary constraint:
Retain 7 days in the log backend
Archive raw logs to S3/GCS with lifecycle policies for long-term
Accept that querying archived logs requires rehydrationTiered Storage Implementation#
Loki: Uses object storage (S3, GCS) natively. Set retention with table_manager.retention_period or limits_config.retention_period. Older data automatically moves to cheaper storage tiers if using S3 Intelligent-Tiering or GCS Nearline.
# Loki retention configuration
limits_config:
retention_period: 720h # 30 days
# Per-tenant overrides (multi-tenant Loki)
overrides:
production:
retention_period: 2160h # 90 days
development:
retention_period: 168h # 7 daysElasticsearch: Use Index Lifecycle Management (ILM) to transition indices through phases.
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": { "max_age": "1d", "max_size": "50gb" }
}
},
"warm": {
"min_age": "7d",
"actions": {
"shrink": { "number_of_shards": 1 },
"forcemerge": { "max_num_segments": 1 }
}
},
"cold": {
"min_age": "30d",
"actions": {
"searchable_snapshot": { "snapshot_repository": "logs-archive" }
}
},
"delete": {
"min_age": "90d"
}
}
}
}What to Filter Before Storage#
Reduce volume (and cost) by filtering at the collector level. Not all logs need to reach the backend.
# Fluent Bit filter example: drop health check logs
[FILTER]
Name grep
Match kube.*
Exclude log /health|/ready|/live/
# Vector filter: drop debug logs in production
[transforms.filter_debug]
type = "filter"
inputs = ["kubernetes_logs"]
condition = '.level != "debug"'Filtering before ingestion is the most cost-effective optimization. A service that logs every health check response at INFO level can generate 30-50% of its total log volume from those entries alone. Drop them at the collector.
Log-Based Alerting#
Metrics-based alerting is preferred for most scenarios because metrics are cheap to store and fast to query. But some conditions are only detectable in logs: specific error messages, stack traces, security events, or business logic failures that are not captured as metrics.
Decision: When to Alert on Logs vs. Metrics#
USE metric-based alerting when:
- The condition is a rate, count, or threshold (error rate > 1%)
- The metric already exists in Prometheus
- You need alerting latency under 30 seconds
USE log-based alerting when:
- The condition is a specific error message or pattern
- The error does not have a corresponding metric
- You need to alert on the first occurrence of a specific event
- Security events: unauthorized access attempts, privilege escalationLoki LogQL Alerting#
Loki can evaluate LogQL queries as alerting rules, similar to Prometheus alerting rules.
# Loki alerting rules
groups:
- name: log-alerts
rules:
- alert: AuthServicePanicDetected
expr: |
count_over_time(
{job="auth-service"} |= "panic" [5m]
) > 0
for: 0m
labels:
severity: critical
annotations:
summary: "Panic detected in auth-service logs"
- alert: HighRateOfDatabaseErrors
expr: |
sum(rate(
{job="api-gateway"} |= "database error" [5m]
)) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "Database errors in api-gateway exceeding 0.5/s"
- alert: UnauthorizedAccessAttempt
expr: |
count_over_time(
{job="api-gateway"} | json | level="warn" | message="unauthorized access attempt" [5m]
) > 10
for: 0m
labels:
severity: critical
annotations:
summary: "More than 10 unauthorized access attempts in 5 minutes"Elasticsearch Watcher Alerting#
{
"trigger": {
"schedule": { "interval": "5m" }
},
"input": {
"search": {
"request": {
"indices": ["logs-*"],
"body": {
"query": {
"bool": {
"must": [
{ "match": { "level": "error" } },
{ "match": { "error_type": "database_connection_failed" } },
{ "range": { "@timestamp": { "gte": "now-5m" } } }
]
}
}
}
}
}
},
"condition": {
"compare": { "ctx.payload.hits.total.value": { "gt": 5 } }
},
"actions": {
"notify_slack": {
"webhook": {
"scheme": "https",
"host": "hooks.slack.com",
"port": 443,
"method": "post",
"path": "/services/T.../B.../...",
"body": "{\"text\": \"Database connection errors detected: {{ctx.payload.hits.total.value}} in last 5 minutes\"}"
}
}
}
}Converting Log Patterns to Metrics#
For recurring log-based alerts, convert the pattern into a metric. This gives you the performance and reliability of metric-based alerting while still capturing the log-originating signal.
Loki metric extraction:
# Extract a metric from logs using LogQL metric queries
# This counts database errors per service, usable as a recording rule
sum by (service) (
count_over_time({namespace="production"} | json | error_type="database_error" [5m])
)Application-level approach: Emit a Prometheus counter alongside the log statement. When the application logs a database error, it also increments database_errors_total{service="api-gateway", error_type="connection_timeout"}. This is the most reliable approach because the metric and the log come from the same code path.
Correlation with Traces and Metrics#
Logs become dramatically more useful when they link to traces and metrics. During an incident, the workflow is: alert fires (metrics) -> view logs for the failing time window -> click through to the trace for a specific failing request -> see the full distributed call path.
Implementing Correlation#
Log-to-trace: Include trace_id and span_id in every structured log entry. In Grafana, configure the Loki data source with a “Derived field” that links trace_id values to the Tempo or Jaeger data source.
# Grafana Loki data source configuration (provisioning)
datasources:
- name: Loki
type: loki
url: http://loki:3100
jsonData:
derivedFields:
- datasourceUid: tempo
matcherRegex: '"trace_id":"([a-f0-9]+)"'
name: TraceID
url: '$${__value.raw}'Log-to-metric: Use matching labels between logs and metrics. If the Loki log stream has {service="api-gateway", namespace="production"} and Prometheus has metrics with {job="api-gateway", namespace="production"}, Grafana can correlate them in split-view panels.
Metric-to-log: From a Grafana metric panel showing elevated error rates, link to a Loki query filtered to the same time range and service labels. Grafana’s Explore view supports this natively – select a time range on a metric graph and click “Split” to open a Loki query for the same window.
OpenTelemetry as the Correlation Layer#
OpenTelemetry provides a unified SDK and collector that handles metrics, logs, and traces with built-in correlation. The OpenTelemetry SDK automatically injects trace context into log entries, and the OpenTelemetry Collector can export all three signals to their respective backends while preserving the correlation IDs.
# OpenTelemetry Collector pipeline configuration
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
exporters:
prometheusremotewrite:
endpoint: "http://mimir:9090/api/v1/push"
loki:
endpoint: "http://loki:3100/loki/api/v1/push"
otlp/tempo:
endpoint: "http://tempo:4317"
service:
pipelines:
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [batch]
exporters: [loki]
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp/tempo]This architecture means the application instruments once with the OpenTelemetry SDK and gets correlated metrics, logs, and traces without managing three separate instrumentation libraries.
Decision Summary#
STRUCTURED LOGGING:
Always. Non-negotiable. JSON format with standardized fields.
This is the single highest-leverage investment in your log pipeline.
AGGREGATION ARCHITECTURE:
Centralized push-based with DaemonSet collectors for most teams.
Distributed only for data residency or extreme volume (>500GB/day).
COLLECTOR:
Promtail/Alloy for Loki-only shops.
Vector or Fluent Bit for multi-backend or complex transformation needs.
RETENTION:
Match compliance requirements. Default to 30 days hot, archive to object storage.
Filter noise at the collector to reduce volume before it hits the backend.
ALERTING:
Metrics-first. Log-based only for patterns that cannot be expressed as metrics.
Convert recurring log alerts into metrics over time.
CORRELATION:
Include trace_id in all log entries. Configure Grafana derived fields.
Consider OpenTelemetry for unified instrumentation across all three signals.