The Decision Landscape#

Log management is deceptively simple on the surface – applications write text, you store it, you search it later. In practice, every decision in the log pipeline involves tradeoffs between cost, query speed, retention depth, operational complexity, and correlation with other observability signals. This guide provides a framework for making those decisions based on your actual requirements rather than defaults or trends.

Structured Logging: The Foundation#

Before choosing any aggregation tool, standardize on structured logging. Unstructured logs are human-readable but machine-hostile. Structured logs are both.

Unstructured vs. Structured#

Unstructured:

2026-02-22 10:15:23 ERROR [api-gateway] Failed to process request for user john@example.com: connection timeout to auth-service after 30s

Structured (JSON):

{
  "timestamp": "2026-02-22T10:15:23.456Z",
  "level": "error",
  "service": "api-gateway",
  "message": "Failed to process request",
  "user_email": "john@example.com",
  "error_type": "connection_timeout",
  "upstream_service": "auth-service",
  "timeout_seconds": 30,
  "trace_id": "abc123def456",
  "span_id": "789ghi",
  "request_id": "req-001234"
}

The structured version enables:

  • Filtering by field: upstream_service=auth-service AND error_type=connection_timeout without regex.
  • Aggregation: Count errors by upstream_service to find the most failing dependency.
  • Correlation: The trace_id links this log entry to the distributed trace for the same request.
  • Alerting: Alert when error_type=connection_timeout exceeds a threshold for a specific upstream_service.

Structured Logging Standards#

Define a standard set of fields that every service must include.

Required fields:
  timestamp     ISO 8601 with timezone (always UTC)
  level         debug, info, warn, error, fatal
  service       service name matching the Kubernetes deployment name
  message       human-readable description (no variable interpolation)
  trace_id      W3C trace ID (when in a request context)

Recommended fields:
  span_id       W3C span ID
  request_id    application-level request correlation ID
  user_id       anonymized or hashed user identifier (never PII in logs)
  environment   production, staging, development
  version       application version or git SHA
  duration_ms   operation duration in milliseconds
  error_type    machine-readable error classification

When to Log#

Not everything should be logged. Excessive logging increases costs and makes signal harder to find in the noise.

Log at ERROR level: Failures that affect the current request or operation. Database errors, upstream timeouts, unhandled exceptions. These should be actionable.

Log at WARN level: Conditions that are not failures but indicate potential problems. Retry attempts, degraded performance, approaching resource limits. These are early signals.

Log at INFO level: Significant business events and lifecycle events. Request start/end (with duration), service startup/shutdown, configuration changes. These provide context during investigations.

Log at DEBUG level: Detailed internal state useful for development. Disabled in production unless temporarily enabled for debugging. Never leave debug logging on permanently – it generates enormous volume with minimal operational value.

Do not log: Sensitive data (passwords, tokens, full credit card numbers), health check requests (they generate noise with no value), high-frequency internal events (every cache hit, every metric sample).

Log Aggregation Architecture Decisions#

Decision: Centralized vs. Distributed#

Centralized aggregation sends all logs to a single backend (Loki, Elasticsearch, CloudWatch). This simplifies querying – one place to search – and enables cross-service correlation. The tradeoff is network bandwidth for log shipping and a single point of failure in the log pipeline.

Distributed aggregation keeps logs closer to their source, with a lightweight local query layer. Each cluster or region has its own log backend. This reduces network costs and latency but makes cross-region queries difficult. Use this only when regulatory requirements mandate data residency or when network costs are prohibitive.

Recommendation for most teams: Centralized. The operational simplicity and correlation capability outweigh the bandwidth cost for log volumes under 500GB/day. Above 500GB/day, evaluate whether aggressive filtering at the collector level can reduce volume, or whether a regional tier with a global query federation (Loki’s multi-tenant mode, Elasticsearch cross-cluster search) is justified.

Decision: Push vs. Pull#

Push-based (most common): Log collectors (Fluentd, Fluent Bit, Vector, Promtail) run as agents on each node, tail log files or read from stdout, and push log entries to the backend. This is the standard model for Kubernetes environments where pods write to stdout and the kubelet writes those logs to files on the node.

Pull-based: The backend periodically fetches logs from sources. Rarely used for application logs but common for infrastructure logs (pulling from cloud provider APIs, for example).

Recommendation: Push-based for application logs in Kubernetes. The DaemonSet pattern (one collector pod per node) is well-established, efficient, and handles pod lifecycle automatically.

Decision: Collector Selection#

Choose your collector based on where you send logs and what processing you need.

IF you use Loki exclusively:
  USE Promtail or Grafana Alloy
  REASON: Tightest integration, label extraction designed for Loki

IF you need to send logs to multiple backends:
  USE Fluent Bit or Vector
  REASON: Both support many output plugins

IF you need complex routing, filtering, or transformation:
  USE Vector
  REASON: Built-in VRL (Vector Remap Language) for powerful transforms

IF memory footprint is the primary constraint:
  USE Fluent Bit
  REASON: ~5MB baseline memory, written in C

IF you need a specific plugin from a large ecosystem:
  USE Fluentd
  REASON: 700+ community plugins, though heavier than Fluent Bit

Architecture Pattern: Kubernetes DaemonSet Collector#

                    +------------------+
                    |  Application Pod |
                    |  (writes to      |
                    |   stdout/stderr) |
                    +--------+---------+
                             |
                    kubelet writes to
                    /var/log/pods/...
                             |
                    +--------+---------+
                    | Collector        |
                    | (DaemonSet pod)  |
                    | Reads log files  |
                    | Adds labels      |
                    | Filters/parses   |
                    +--------+---------+
                             |
                    Pushes to backend
                             |
              +--------------+--------------+
              |                             |
     +--------+--------+         +---------+--------+
     |  Loki / ES /    |         |  Object Storage  |
     |  ClickHouse     |         |  (S3/GCS for     |
     |  (hot storage)  |         |   cold/archive)  |
     +--------+--------+         +------------------+
              |
     +--------+--------+
     |  Grafana /       |
     |  Kibana          |
     |  (query & viz)   |
     +-----------------+

Log Retention Policies#

Retention is a cost and compliance decision, not a technical one. Every day of retained logs costs storage and may have legal implications.

Decision Framework for Retention#

IF compliance requires specific retention (SOC 2, HIPAA, PCI-DSS):
  Retain for the required period (typically 1-7 years)
  USE tiered storage: hot (7-30 days), warm (30-90 days), cold (90+ days)

IF no compliance requirement:
  Hot retention: 7-14 days (fast queries)
  Warm retention: 30-90 days (slower queries, cheaper storage)
  Cold retention: Optional, archive to object storage for cost

IF budget is the primary constraint:
  Retain 7 days in the log backend
  Archive raw logs to S3/GCS with lifecycle policies for long-term
  Accept that querying archived logs requires rehydration

Tiered Storage Implementation#

Loki: Uses object storage (S3, GCS) natively. Set retention with table_manager.retention_period or limits_config.retention_period. Older data automatically moves to cheaper storage tiers if using S3 Intelligent-Tiering or GCS Nearline.

# Loki retention configuration
limits_config:
  retention_period: 720h  # 30 days

# Per-tenant overrides (multi-tenant Loki)
overrides:
  production:
    retention_period: 2160h  # 90 days
  development:
    retention_period: 168h   # 7 days

Elasticsearch: Use Index Lifecycle Management (ILM) to transition indices through phases.

{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": { "max_age": "1d", "max_size": "50gb" }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "searchable_snapshot": { "snapshot_repository": "logs-archive" }
        }
      },
      "delete": {
        "min_age": "90d"
      }
    }
  }
}

What to Filter Before Storage#

Reduce volume (and cost) by filtering at the collector level. Not all logs need to reach the backend.

# Fluent Bit filter example: drop health check logs
[FILTER]
    Name    grep
    Match   kube.*
    Exclude log /health|/ready|/live/

# Vector filter: drop debug logs in production
[transforms.filter_debug]
  type = "filter"
  inputs = ["kubernetes_logs"]
  condition = '.level != "debug"'

Filtering before ingestion is the most cost-effective optimization. A service that logs every health check response at INFO level can generate 30-50% of its total log volume from those entries alone. Drop them at the collector.

Log-Based Alerting#

Metrics-based alerting is preferred for most scenarios because metrics are cheap to store and fast to query. But some conditions are only detectable in logs: specific error messages, stack traces, security events, or business logic failures that are not captured as metrics.

Decision: When to Alert on Logs vs. Metrics#

USE metric-based alerting when:
  - The condition is a rate, count, or threshold (error rate > 1%)
  - The metric already exists in Prometheus
  - You need alerting latency under 30 seconds

USE log-based alerting when:
  - The condition is a specific error message or pattern
  - The error does not have a corresponding metric
  - You need to alert on the first occurrence of a specific event
  - Security events: unauthorized access attempts, privilege escalation

Loki LogQL Alerting#

Loki can evaluate LogQL queries as alerting rules, similar to Prometheus alerting rules.

# Loki alerting rules
groups:
  - name: log-alerts
    rules:
      - alert: AuthServicePanicDetected
        expr: |
          count_over_time(
            {job="auth-service"} |= "panic" [5m]
          ) > 0
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "Panic detected in auth-service logs"

      - alert: HighRateOfDatabaseErrors
        expr: |
          sum(rate(
            {job="api-gateway"} |= "database error" [5m]
          )) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Database errors in api-gateway exceeding 0.5/s"

      - alert: UnauthorizedAccessAttempt
        expr: |
          count_over_time(
            {job="api-gateway"} | json | level="warn" | message="unauthorized access attempt" [5m]
          ) > 10
        for: 0m
        labels:
          severity: critical
        annotations:
          summary: "More than 10 unauthorized access attempts in 5 minutes"

Elasticsearch Watcher Alerting#

{
  "trigger": {
    "schedule": { "interval": "5m" }
  },
  "input": {
    "search": {
      "request": {
        "indices": ["logs-*"],
        "body": {
          "query": {
            "bool": {
              "must": [
                { "match": { "level": "error" } },
                { "match": { "error_type": "database_connection_failed" } },
                { "range": { "@timestamp": { "gte": "now-5m" } } }
              ]
            }
          }
        }
      }
    }
  },
  "condition": {
    "compare": { "ctx.payload.hits.total.value": { "gt": 5 } }
  },
  "actions": {
    "notify_slack": {
      "webhook": {
        "scheme": "https",
        "host": "hooks.slack.com",
        "port": 443,
        "method": "post",
        "path": "/services/T.../B.../...",
        "body": "{\"text\": \"Database connection errors detected: {{ctx.payload.hits.total.value}} in last 5 minutes\"}"
      }
    }
  }
}

Converting Log Patterns to Metrics#

For recurring log-based alerts, convert the pattern into a metric. This gives you the performance and reliability of metric-based alerting while still capturing the log-originating signal.

Loki metric extraction:

# Extract a metric from logs using LogQL metric queries
# This counts database errors per service, usable as a recording rule
sum by (service) (
  count_over_time({namespace="production"} | json | error_type="database_error" [5m])
)

Application-level approach: Emit a Prometheus counter alongside the log statement. When the application logs a database error, it also increments database_errors_total{service="api-gateway", error_type="connection_timeout"}. This is the most reliable approach because the metric and the log come from the same code path.

Correlation with Traces and Metrics#

Logs become dramatically more useful when they link to traces and metrics. During an incident, the workflow is: alert fires (metrics) -> view logs for the failing time window -> click through to the trace for a specific failing request -> see the full distributed call path.

Implementing Correlation#

Log-to-trace: Include trace_id and span_id in every structured log entry. In Grafana, configure the Loki data source with a “Derived field” that links trace_id values to the Tempo or Jaeger data source.

# Grafana Loki data source configuration (provisioning)
datasources:
  - name: Loki
    type: loki
    url: http://loki:3100
    jsonData:
      derivedFields:
        - datasourceUid: tempo
          matcherRegex: '"trace_id":"([a-f0-9]+)"'
          name: TraceID
          url: '$${__value.raw}'

Log-to-metric: Use matching labels between logs and metrics. If the Loki log stream has {service="api-gateway", namespace="production"} and Prometheus has metrics with {job="api-gateway", namespace="production"}, Grafana can correlate them in split-view panels.

Metric-to-log: From a Grafana metric panel showing elevated error rates, link to a Loki query filtered to the same time range and service labels. Grafana’s Explore view supports this natively – select a time range on a metric graph and click “Split” to open a Loki query for the same window.

OpenTelemetry as the Correlation Layer#

OpenTelemetry provides a unified SDK and collector that handles metrics, logs, and traces with built-in correlation. The OpenTelemetry SDK automatically injects trace context into log entries, and the OpenTelemetry Collector can export all three signals to their respective backends while preserving the correlation IDs.

# OpenTelemetry Collector pipeline configuration
receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  batch:

exporters:
  prometheusremotewrite:
    endpoint: "http://mimir:9090/api/v1/push"
  loki:
    endpoint: "http://loki:3100/loki/api/v1/push"
  otlp/tempo:
    endpoint: "http://tempo:4317"

service:
  pipelines:
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [loki]
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/tempo]

This architecture means the application instruments once with the OpenTelemetry SDK and gets correlated metrics, logs, and traces without managing three separate instrumentation libraries.

Decision Summary#

STRUCTURED LOGGING:
  Always. Non-negotiable. JSON format with standardized fields.
  This is the single highest-leverage investment in your log pipeline.

AGGREGATION ARCHITECTURE:
  Centralized push-based with DaemonSet collectors for most teams.
  Distributed only for data residency or extreme volume (>500GB/day).

COLLECTOR:
  Promtail/Alloy for Loki-only shops.
  Vector or Fluent Bit for multi-backend or complex transformation needs.

RETENTION:
  Match compliance requirements. Default to 30 days hot, archive to object storage.
  Filter noise at the collector to reduce volume before it hits the backend.

ALERTING:
  Metrics-first. Log-based only for patterns that cannot be expressed as metrics.
  Convert recurring log alerts into metrics over time.

CORRELATION:
  Include trace_id in all log entries. Configure Grafana derived fields.
  Consider OpenTelemetry for unified instrumentation across all three signals.