Prometheus Architecture Deep Dive

Pull-Based Scraping Model#

Prometheus pulls metrics from targets rather than having targets push metrics to it. Every scrape interval (default 15s in the global config), Prometheus sends an HTTP GET to each target’s metrics endpoint. The target responds with all its current metric values in Prometheus exposition format.

This pull model has concrete advantages. Prometheus controls the scrape rate, so a misbehaving target cannot flood the system. You can scrape a target from your laptop with curl http://target:8080/metrics to see exactly what Prometheus sees. Targets that go down are immediately detectable because the scrape fails.

global:
  scrape_interval: 15s
  scrape_timeout: 10s
  evaluation_interval: 15s

The scrape_timeout must be less than scrape_interval. If a target takes longer than the timeout to respond, the scrape is marked as failed and up{job="..."} drops to 0.

TSDB Storage#

Prometheus stores data in a custom time-series database on local disk. The TSDB organizes data into two-hour blocks. Incoming samples first go to an in-memory head block, which is write-ahead logged (WAL) for crash recovery. Every two hours, the head block is compacted into an immutable on-disk block.

Compaction merges smaller blocks into larger ones over time, improving query performance for longer time ranges. Each block is a directory containing chunks (the actual sample data), an index (label-to-series mappings), and metadata.

Key storage flags:

prometheus \
  --storage.tsdb.path=/prometheus \
  --storage.tsdb.retention.time=30d \
  --storage.tsdb.retention.size=50GB \
  --storage.tsdb.min-block-duration=2h \
  --storage.tsdb.max-block-duration=36h

When both retention.time and retention.size are set, whichever limit is hit first triggers block deletion. Monitor prometheus_tsdb_storage_blocks_bytes to track actual disk usage.

Service Discovery#

Static configs work for a handful of targets but break down in dynamic environments. Prometheus has native service discovery for Kubernetes, Consul, EC2, GCE, Azure, DNS, and file-based discovery.

Kubernetes SD discovers targets from the Kubernetes API server:

scrape_configs:
  - job_name: "k8s-pods"
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names: ["production", "staging"]
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: "true"
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)
        replacement: ${1}

The role field determines what is discovered: pod, service, endpoints, endpointslice, node, or ingress. Each role exposes different __meta_kubernetes_* labels for relabeling.

Consul SD discovers services registered in Consul:

scrape_configs:
  - job_name: "consul-services"
    consul_sd_configs:
      - server: "consul.service.consul:8500"
        services: []  # empty = all services
        tags: ["prometheus"]
    relabel_configs:
      - source_labels: [__meta_consul_service]
        target_label: job

File SD watches JSON or YAML files for target lists, useful for configuration management systems:

scrape_configs:
  - job_name: "file-targets"
    file_sd_configs:
      - files: ["/etc/prometheus/targets/*.json"]
        refresh_interval: 5m

Relabeling#

Relabeling is the most powerful and most confusing part of Prometheus configuration. There are two places it applies, and they serve different purposes.

relabel_configs runs before the scrape. It manipulates target labels and metadata to control which targets are scraped and how. It operates on labels prefixed with __meta_* (from service discovery) and special labels like __address__, __metrics_path__, and __scheme__.

metric_relabel_configs runs after the scrape, on every individual metric sample. Use it to drop expensive metrics, rename them, or modify their labels.

scrape_configs:
  - job_name: "kubelet"
    kubernetes_sd_configs:
      - role: node
    relabel_configs:
      # Before scrape: rewrite the port to kubelet's metrics port
      - action: replace
        source_labels: [__address__]
        regex: (.+):(.+)
        replacement: ${1}:10250
        target_label: __address__
    metric_relabel_configs:
      # After scrape: drop high-cardinality metrics we don't need
      - source_labels: [__name__]
        regex: "kubelet_runtime_operations_duration_seconds_bucket"
        action: drop

Common relabeling actions: keep (only keep targets matching regex), drop (discard matching targets), replace (regex-replace label values), labelmap (copy labels matching a regex pattern), labeldrop (remove labels matching regex), hashmod (for sharding across multiple Prometheus instances).

Federation and Remote Storage#

A single Prometheus instance has limits. Federation lets a higher-level Prometheus scrape selected metrics from downstream instances:

scrape_configs:
  - job_name: "federate"
    honor_labels: true
    metrics_path: /federate
    params:
      match[]:
        - '{job="api-server"}'
        - '{__name__=~"job:.*"}'
    static_configs:
      - targets: ["prometheus-dc1:9090", "prometheus-dc2:9090"]

For true long-term storage, use remote_write to send samples to a dedicated backend. The three major options are Thanos, Cortex/Mimir (Grafana Mimir is Cortex’s successor), and VictoriaMetrics.

remote_write:
  - url: "http://mimir-distributor:9009/api/v1/push"
    queue_config:
      max_samples_per_send: 5000
      batch_send_deadline: 5s
      max_shards: 30
    write_relabel_configs:
      - source_labels: [__name__]
        regex: "go_.*"
        action: drop

Thanos adds a sidecar to each Prometheus that uploads blocks to object storage (S3, GCS). A querier component provides a unified query interface across all Prometheus instances and historical data. Best for: multi-cluster setups where you want to keep Prometheus instances independent.

Grafana Mimir is a horizontally scalable, multi-tenant TSDB. Prometheus pushes via remote_write and Mimir handles storage, compaction, and querying. Best for: large-scale centralized monitoring with strict multi-tenancy requirements.

When Prometheus Is the Right Choice#

Prometheus excels at infrastructure and application monitoring with dimensional data. It handles millions of active time series on a single instance with modest hardware. The ecosystem (exporters, alerting, Grafana integration) is unmatched.

Prometheus is not the right choice for: event logging (use Loki or Elasticsearch), distributed tracing (use Jaeger or Tempo), business analytics requiring SQL joins across metrics and other data, or situations requiring 100% accuracy of every data point (Prometheus may lose samples during scrape failures or restarts). For high-cardinality use cases like per-user metrics, consider a system built for that workload like Mimir or VictoriaMetrics from the start.