Why Monitor Your Monitoring#
If Prometheus runs out of memory and crashes, you lose all alerting. If its disk fills up, it stops ingesting and you have a blind spot that may last hours before anyone notices. If scrapes start timing out, metrics go stale and alerts based on rate() produce no data (which means they silently stop firing rather than triggering). Prometheus must be the most reliably monitored component in your stack.
The irony is real: Prometheus exposes hundreds of metrics about itself, but if the instance is down, nobody is scraping those metrics. The solution is cross-monitoring – each Prometheus instance scrapes the other. In single-instance setups, forward critical self-monitoring metrics via remote_write to an external system, or use a lightweight external prober.
Key Self-Monitoring Metrics#
Series and Ingestion#
# Active time series count -- the single most important capacity metric
prometheus_tsdb_head_series
# Rate of new chunks being created (correlates with ingestion rate)
rate(prometheus_tsdb_head_chunks_created_total[5m])
# Samples ingested per second
rate(prometheus_tsdb_head_samples_appended_total[5m])
# Series created per second (new series appearing)
rate(prometheus_tsdb_head_series_created_total[5m])Track prometheus_tsdb_head_series on a daily basis. A healthy production Prometheus typically grows series count slowly as new services deploy. Sudden jumps indicate a cardinality problem – a new deployment with high-cardinality labels or a misconfigured service discovery.
Compaction Health#
# Compaction rate and failures
rate(prometheus_tsdb_compactions_total[1h])
rate(prometheus_tsdb_compactions_failed_total[1h])
# Compaction duration -- should stay consistent
prometheus_tsdb_compaction_duration_seconds{quantile="0.99"}
# Head block compaction specifically
prometheus_tsdb_head_truncations_totalFailed compactions are serious. They mean on-disk blocks are not being merged, which degrades query performance for longer time ranges and wastes disk space. If you see prometheus_tsdb_compactions_failed_total increasing, check disk space and Prometheus logs for corruption errors.
Scrape Performance#
# Duration of each scrape -- should be well under scrape_interval
scrape_duration_seconds
# How many samples each target returns per scrape
scrape_samples_scraped
# Targets exceeding configured limits
prometheus_target_scrape_pool_exceeded_target_limit_total
# Scrape failures by reason
rate(prometheus_target_scrapes_exceeded_body_size_limit_total[5m])
rate(prometheus_target_scrapes_exceeded_sample_limit_total[5m])A useful derivative metric: identify the slowest targets dragging down your scrape cycle:
topk(10, scrape_duration_seconds)If any target’s scrape_duration_seconds approaches your scrape_timeout (default 10s), that target is at risk of intermittent scrape failures. Either increase the timeout for that job, reduce the number of metrics the target exposes, or increase the scrape interval for that specific job.
Rule Evaluation#
# How long rule groups take to evaluate
prometheus_rule_group_duration_seconds{quantile="0.99"}
# Rules that take too long
prometheus_rule_evaluation_duration_seconds
# Missed evaluations (rule group took longer than its interval)
rate(prometheus_rule_group_iterations_missed_total[5m])
# Rule evaluation failures
rate(prometheus_rule_evaluation_failures_total[5m])Missed rule iterations mean your recording rules and alerts are not being computed on schedule. This is a direct reliability risk. The fix is usually to either optimize the underlying PromQL expressions, convert them to recording rules, or increase the rule group’s evaluation interval.
Alertmanager Communication#
# Dropped notifications (Alertmanager unreachable or rejecting)
rate(prometheus_notifications_dropped_total[5m])
# Notification send duration
prometheus_notifications_queue_length
prometheus_notifications_queue_capacity
# Alertmanager discovery
prometheus_notifications_alertmanagers_discoveredIf prometheus_notifications_dropped_total is increasing, alerts are firing but never reaching Alertmanager. This is a silent failure mode – you think you have alerting, but notifications are being discarded.
Resource Consumption#
# Prometheus memory usage
process_resident_memory_bytes{job="prometheus"}
# CPU usage
rate(process_cpu_seconds_total{job="prometheus"}[5m])
# Open file descriptors (can hit OS limits)
process_open_fds{job="prometheus"}
# Go garbage collection impact
rate(go_gc_duration_seconds_sum{job="prometheus"}[5m])Capacity Planning Formulas#
Memory#
The primary driver of Prometheus memory usage is the number of active time series. Each series in the head block consumes approximately 2-4 KB of memory, depending on label count and churn rate. A rough formula:
Base memory ≈ active_series * 3 KB
Query overhead ≈ 20-50% additional during queries
Total ≈ active_series * 3 KB * 1.5For 1 million active series, budget approximately 4.5 GB of RAM. For 5 million series, budget approximately 22 GB. These are baseline numbers – complex queries, many concurrent dashboard viewers, and high churn (series appearing and disappearing rapidly) all increase memory pressure.
Monitor actual usage and correlate with series count:
# Bytes per series (actual)
process_resident_memory_bytes{job="prometheus"}
/ prometheus_tsdb_head_seriesDisk#
Disk usage depends on samples per second, bytes per sample after compression, and retention duration:
Disk usage ≈ ingestion_rate_samples_per_sec * bytes_per_sample * retention_secondsPrometheus achieves roughly 1.5-2 bytes per sample after compression (down from 16 bytes raw). Calculate your ingestion rate:
# Samples ingested per second
rate(prometheus_tsdb_head_samples_appended_total[5m])Example: 100,000 samples/second with 1.5 bytes/sample and 15-day retention:
100,000 * 1.5 * (15 * 86,400) = ~194 GBAdd 20% for WAL, temporary compaction space, and index overhead.
Scrape Load#
Each scrape target adds load proportional to the number of metrics it exposes:
Total series ≈ sum(metrics_per_target * instances_of_target) for all jobs
Ingestion rate ≈ total_series / scrape_intervalA Kubernetes cluster with 200 pods, each exposing 500 metrics, scraped every 15 seconds: 100,000 series, ~6,667 samples/second. This is well within a single Prometheus instance’s capability.
Scaling Patterns#
Vertical Scaling#
Before introducing architectural complexity, scale up. Prometheus is surprisingly efficient:
- SSDs are mandatory for any meaningful workload (beyond 500K active series with HDD causes I/O bottleneck during compaction)
- More RAM directly reduces query latency by keeping more data in the page cache
- CPU is rarely the bottleneck unless you have hundreds of expensive recording rules
A single well-provisioned Prometheus instance (64 GB RAM, NVMe SSD, 16 cores) comfortably handles 10-15 million active series.
Functional Sharding#
Split by responsibility rather than data:
prometheus-infra: scrapes node_exporter, kube-state-metrics, kubelet
prometheus-apps: scrapes application pods
prometheus-platform: scrapes databases, message queues, cachesEach instance has its own storage, alerting rules, and retention. Configure Grafana with multiple data sources and use mixed queries or dashboard variables to select the right source.
Hierarchical Federation#
Edge Prometheus instances scrape targets locally. A central Prometheus scrapes aggregated metrics from edges via the /federate endpoint:
# Central Prometheus config
scrape_configs:
- job_name: "federate-dc1"
honor_labels: true
metrics_path: /federate
params:
match[]:
- '{__name__=~"job:.*"}' # Only pull recording rule results
- 'up'
static_configs:
- targets: ["prometheus-dc1.internal:9090"]
labels:
datacenter: "dc1"Only federate recording rules and critical metrics. Pulling raw metrics defeats the purpose and creates a bottleneck at the central instance.
Remote Write to Long-Term Storage#
For retention beyond what local disk supports, use remote_write:
Thanos: Sidecar uploads TSDB blocks to object storage. A Querier provides a unified query layer. Best when you want Prometheus instances to remain independent and queryable locally. The Store Gateway serves historical data from object storage.
Grafana Mimir: Prometheus pushes via remote_write to a horizontally scalable cluster. Handles deduplication, compaction, and long-term storage. Best for centralized multi-tenant monitoring at scale.
VictoriaMetrics: Drop-in remote write target with excellent compression and query performance. Single-binary deployment option makes it simpler to operate than Thanos or Mimir for smaller teams.
Prometheus High Availability#
Run two identical Prometheus instances scraping the same targets. Both evaluate the same alerting rules. Alertmanager handles deduplication of duplicate alerts via its group_by and dedup logic.
# Instance A and B have identical configs except for external_labels
global:
external_labels:
cluster: "production"
replica: "a" # "b" on the other instanceBoth instances send alerts to the same Alertmanager cluster. Alertmanager deduplicates based on the alert’s identity (name + labels, excluding replica). Configure Alertmanager’s --cluster.peer flag so Alertmanager instances form a cluster and synchronize notification state.
Queries during failover are handled at the Grafana level – configure both Prometheus instances as data sources and use Grafana’s data source proxy or a load balancer with health checks.
Health Alerts#
groups:
- name: prometheus-self-monitoring
rules:
- alert: PrometheusHighMemory
expr: |
process_resident_memory_bytes{job="prometheus"}
/ on() prometheus_tsdb_head_series * 4096 > 1.5
for: 30m
labels:
severity: warning
annotations:
summary: "Prometheus memory usage is disproportionately high relative to series count"
- alert: PrometheusSlowScrapes
expr: |
scrape_duration_seconds > 15
for: 10m
labels:
severity: warning
annotations:
summary: "Target {{ $labels.instance }} taking {{ $value | printf \"%.1f\" }}s to scrape"
- alert: PrometheusDroppedNotifications
expr: |
rate(prometheus_notifications_dropped_total[5m]) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Prometheus is dropping alert notifications"
- alert: PrometheusRuleMissedEvaluations
expr: |
rate(prometheus_rule_group_iterations_missed_total[5m]) > 0
for: 10m
labels:
severity: warning
annotations:
summary: "Rule group {{ $labels.rule_group }} missing evaluations"
- alert: PrometheusTSDBCompactionsFailing
expr: |
rate(prometheus_tsdb_compactions_failed_total[1h]) > 0
for: 15m
labels:
severity: critical
annotations:
summary: "TSDB compactions are failing -- check disk space and logs"
- alert: PrometheusWALCorruption
expr: |
prometheus_tsdb_wal_corruptions_total > 0
for: 0m
labels:
severity: critical
annotations:
summary: "WAL corruption detected -- Prometheus may lose data on restart"TSDB Maintenance#
Snapshots#
Create a consistent point-in-time backup without stopping Prometheus:
curl -X POST http://prometheus:9090/api/v1/admin/tsdb/snapshotThis creates a snapshot in <storage_path>/snapshots/<timestamp>. The snapshot is a hard-link copy of the current blocks, so it is fast and space-efficient. Copy the snapshot directory to your backup target.
The admin API must be enabled with --web.enable-admin-api.
Clean Shutdown#
Prometheus flushes the in-memory head block to WAL on shutdown. A SIGKILL or OOM kill skips this flush, which means WAL replay on the next startup takes longer and may lose the most recent samples. Always use SIGTERM and wait for graceful shutdown.
In Kubernetes, set a generous terminationGracePeriodSeconds (at least 120 seconds) for the Prometheus pod. Prometheus needs time to flush the head block and finish any in-progress compactions.
WAL Replay#
On startup, Prometheus replays the WAL to reconstruct the head block. With a large WAL (many series, long time since last compaction), this can take minutes. Monitor startup time and consider reducing --storage.tsdb.min-block-duration if startup takes too long, though this increases compaction frequency.
Self-Monitoring Dashboard#
A production Prometheus self-monitoring dashboard should include these panels:
| Panel | Query | Type |
|---|---|---|
| Active Series | prometheus_tsdb_head_series |
Stat / Time series |
| Ingestion Rate | rate(prometheus_tsdb_head_samples_appended_total[5m]) |
Time series |
| Memory Usage | process_resident_memory_bytes{job="prometheus"} |
Time series |
| Scrape Duration (p99) | histogram_quantile(0.99, rate(prometheus_target_interval_length_seconds_bucket[5m])) |
Time series |
| Slowest Targets | topk(10, scrape_duration_seconds) |
Table |
| Rule Eval Duration | prometheus_rule_group_duration_seconds{quantile="0.99"} |
Time series |
| Dropped Notifications | rate(prometheus_notifications_dropped_total[5m]) |
Stat (should be 0) |
| Compaction Duration | prometheus_tsdb_compaction_duration_seconds |
Time series |
| Disk Usage | prometheus_tsdb_storage_blocks_bytes |
Time series |
| WAL Size | prometheus_tsdb_wal_storage_size_bytes |
Time series |
| Series Churn | rate(prometheus_tsdb_head_series_created_total[5m]) |
Time series |
| Head Chunks | prometheus_tsdb_head_chunks |
Stat |
Set alert thresholds on the stat panels. A red disk usage panel that nobody notices is no better than no panel at all.