Grafana Loki for Log Aggregation

Loki Architecture#

Loki is a log aggregation system designed by Grafana Labs. Unlike Elasticsearch, Loki does not index log content. It indexes only metadata labels, then stores compressed log chunks in object storage. This makes it cheaper to operate and simpler to scale, at the cost of slower full-text search across massive datasets.

The core components are:

Distributor: Receives incoming log streams from agents, validates labels, and forwards to ingesters via consistent hashing.
Ingester: Buffers log data in memory, builds compressed chunks, and flushes them to long-term storage (S3, GCS, filesystem).
Querier: Executes LogQL queries by fetching chunk references from the index and reading chunk data from storage.
Compactor: Runs periodic compaction on the index (especially for boltdb-shipper) and handles retention enforcement by deleting old data.
Query Frontend (optional): Splits large queries into smaller ones, caches results, and distributes work across queriers.

Deployment Modes#

Loki supports three deployment modes, each suited to different scales.

Monolithic: All components run in a single process. Good for development and clusters ingesting under 100GB/day. Deploy with a single StatefulSet.

Simple Scalable (SSD): Splits into read path and write path. Write nodes handle distribution and ingestion; read nodes handle queries. This is the recommended starting point for production. The grafana/loki Helm chart deploys this mode by default.

Microservices: Each component runs as a separate Deployment or StatefulSet. Required for very high throughput (multi-TB/day). Complex to operate – use only when the simple scalable mode hits limits.

Helm Installation#

The grafana/loki-stack chart bundles Loki with Promtail and optionally Grafana:

helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
  --namespace logging --create-namespace \
  --set loki.storage.type=filesystem \
  --set loki.commonConfig.replication_factor=1 \
  --set promtail.enabled=true \
  --set grafana.enabled=true

For production with object storage:

# loki-values.yaml
loki:
  storage:
    type: s3
    s3:
      endpoint: s3.amazonaws.com
      bucketnames: loki-chunks
      region: us-east-1
      access_key_id: ${AWS_ACCESS_KEY_ID}
      secret_access_key: ${AWS_SECRET_ACCESS_KEY}
  commonConfig:
    replication_factor: 3
  limits_config:
    retention_period: 30d
  compactor:
    retention_enabled: true

Promtail as Log Collector#

Promtail runs as a DaemonSet on every node, tails container log files from /var/log/pods, and ships them to Loki. It discovers pods via the Kubernetes API and attaches labels automatically.

# promtail scrape config (inside promtail-values.yaml)
config:
  clients:
    - url: http://loki-gateway/loki/api/v1/push
  scrape_configs:
    - job_name: kubernetes-pods
      kubernetes_sd_configs:
        - role: pod
      relabel_configs:
        - source_labels: [__meta_kubernetes_namespace]
          target_label: namespace
        - source_labels: [__meta_kubernetes_pod_name]
          target_label: pod
        - source_labels: [__meta_kubernetes_pod_container_name]
          target_label: container
      pipeline_stages:
        - docker: {}
        - json:
            expressions:
              level: level
              msg: msg
              caller: caller
        - labels:
            level:
        - timestamp:
            source: time
            format: RFC3339Nano

Pipeline stages parse log lines sequentially. The json stage extracts fields from JSON logs. The labels stage promotes extracted values to Loki labels (use sparingly – see cardinality section). The timestamp stage sets the log entry timestamp from the parsed field instead of the ingestion time.

LogQL Query Language#

LogQL has two query types: log queries (return log lines) and metric queries (return numeric values derived from logs).

Stream selectors filter by labels:

{namespace="production", container="api-server"}

Filter expressions search within log content:

{namespace="production"} |= "error"
{namespace="production"} !~ "health_check|readiness"
{namespace="production"} |= "timeout" != "context canceled"

|= is contains, != is not-contains, |~ is regex match, !~ is regex not-match.

Parser stages structure unstructured logs at query time:

# Parse JSON logs and filter on extracted fields
{container="api-server"} | json | status_code >= 500

# Parse logfmt
{container="nginx"} | logfmt | duration > 2s

# Parse with regex
{container="legacy-app"} | regexp `(?P<ip>\S+) - - \[(?P<ts>[^\]]+)\] "(?P<method>\w+) (?P<path>\S+)"`
  | path=~"/api/.*" | method="POST"

Metric queries compute rates and aggregations:

# Error rate per namespace over 5 minutes
sum by (namespace) (rate({job="pods"} |= "error" [5m]))

# Count unique error messages in the last hour
count_over_time({container="api-server"} |= "error" | json | keep msg [1h])

# 99th percentile of extracted durations
quantile_over_time(0.99, {container="api-server"} | json | unwrap duration_ms [5m]) by (endpoint)

Label Cardinality: The Number One Loki Mistake#

Loki’s index is built on labels. Every unique label combination creates a new stream. High-cardinality labels (user IDs, request IDs, IP addresses, trace IDs) create millions of streams and destroy Loki’s performance.

Rules to follow:

Use only low-cardinality labels: namespace, pod, container, level, app, env. These have bounded, small value sets.
Never promote request-scoped values (user ID, trace ID, request path) to labels. Instead, keep them in the log line and filter with |= or | json | field="value" at query time.
Monitor stream count with the Loki metric loki_ingester_streams_created_total. If it climbs continuously, you have a cardinality problem.
Set max_streams_per_user in limits_config to catch runaway cardinality before it takes down ingesters.

limits_config:
  max_streams_per_user: 10000
  max_entries_limit_per_query: 5000
  max_label_names_per_series: 15

Loki vs Elasticsearch#

Elasticsearch indexes every word in every log line, enabling fast full-text search. Loki indexes only labels and compresses log chunks, making it 10-20x cheaper on storage. Elasticsearch requires significant heap tuning and cluster management. Loki stores chunks in object storage, offloading durability.

Choose Elasticsearch when you need fast ad-hoc full-text search across logs and have the operational budget. Choose Loki when you know your query patterns (filter by service, then grep), want tight Grafana integration, and want to minimize storage costs. For most Kubernetes-native teams already running Grafana and Prometheus, Loki is the natural fit.