Advanced PromQL: Performance, Cardinality, and Complex Query Patterns

Cardinality Explosion#

Cardinality is the number of unique time series Prometheus tracks. Every unique combination of metric name and label key-value pairs creates a separate series. A metric with 3 labels, each having 100 possible values, generates up to 1,000,000 series. In practice, cardinality explosions are the single most common way to kill a Prometheus instance.

The usual culprits are labels containing user IDs, request paths with embedded IDs (like /api/users/a]3f7b2c1), session tokens, trace IDs, or any unbounded value set. A seemingly innocent label like path on an HTTP metric becomes catastrophic when your API has RESTful routes with UUIDs in the path.

Detecting Cardinality Problems#

Query the TSDB status endpoint directly for the fastest overview:

curl -s http://prometheus:9090/api/v1/status/tsdb | jq '.data.headStats'

This returns numSeries, numLabelPairs, and chunkCount. Compare numSeries against your baseline – a sudden jump usually means a new deployment introduced high-cardinality labels.

Find which metrics have the most series:

# Top 10 metric names by series count
topk(10, count by (__name__) ({__name__=~".+"}))

Warning: this query itself is expensive on large instances. Use the TSDB status API endpoint instead when possible – it returns the top 10 series count by metric name without evaluating a query.

Track the symbol table size, which grows with label cardinality:

prometheus_tsdb_symbol_table_size_bytes

Monitor active series count over time to catch gradual growth:

prometheus_tsdb_head_series

Fixing Cardinality#

Use metric_relabel_configs to drop high-cardinality labels or entire metrics before they hit storage:

scrape_configs:
  - job_name: "api-server"
    metric_relabel_configs:
      # Drop a specific high-cardinality label from all metrics
      - action: labeldrop
        regex: "request_id"

      # Drop an entire metric that generates too many series
      - source_labels: [__name__]
        regex: "http_request_duration_seconds_bucket"
        action: drop

      # Replace high-cardinality path labels with a normalized version
      - source_labels: [path]
        regex: "/api/users/[a-f0-9-]+"
        target_label: path
        replacement: "/api/users/:id"

The distinction matters: relabel_configs runs before the scrape and operates on target labels. metric_relabel_configs runs after the scrape and operates on metric labels. Cardinality control is almost always a metric_relabel_configs concern.

Expensive Query Patterns#

Some PromQL patterns look simple but are extremely expensive to evaluate.

Selecting all series. {__name__=~".+"} matches every single time series. On an instance with 2 million series, this loads all of them into memory. Never use this in dashboards or alerts. If you need it for cardinality analysis, use the TSDB status API.

Unbounded range vectors. rate(http_requests_total[30d]) forces Prometheus to load 30 days of samples for every matching series. Prefer shorter windows and use recording rules to aggregate over longer periods.

Regex on high-cardinality labels. {path=~".*error.*"} scans every value of path across every series. If path has 50,000 unique values, this is extremely slow. Pre-filter with exact matches or restructure your labels.

Missing label matchers. rate(http_requests_total[5m]) without any label selectors matches every series for that metric name across all jobs, instances, and status codes. Always add at least one restricting label.

Query Performance Internals#

Prometheus evaluates queries by first selecting all matching series from the inverted index, then loading the required sample data for those series within the time range. The cost is roughly proportional to: (number of series matched) * (number of samples per series in the range window).

rate() over [5m] at a 15-second scrape interval loads about 20 samples per series. rate() over [1m] loads about 4 samples. The 5-minute window is not much more expensive per series, but it produces more accurate rates because it has more data points to calculate the slope. The 1-minute window is cheaper only if it causes fewer series to be matched (it does not – the label selector determines that, not the range).

The step interval in range queries (the step parameter in the HTTP API, controlled by Grafana’s resolution) determines how many times the expression is evaluated across the query range. A 24-hour dashboard panel with a 15-second step evaluates 5,760 times. Changing to a 1-minute step reduces that to 1,440 evaluations. For dashboards covering long time ranges, increase the step or use recording rules.

Binary Operators and Vector Matching#

Arithmetic and comparison operators between two instant vectors require Prometheus to match series from both sides. By default, it performs one-to-one matching on all label names.

# Simple ratio -- works when both sides have identical label sets
http_errors_total / http_requests_total

When the label sets differ, use on() or ignoring() to control matching:

# Match only on the 'job' label, ignoring all others
sum by (job) (rate(http_requests_total{code=~"5.."}[5m]))
/ on(job)
sum by (job) (rate(http_requests_total[5m]))

For many-to-one or one-to-many joins, use group_left() or group_right(). The “group” side is the one with more series:

# Attach deployment name to pod-level metrics
# Left side (pod metrics) has many series per deployment
# Right side (deployment info) has one series per deployment
sum by (pod, namespace) (rate(container_cpu_usage_seconds_total[5m]))
* on(pod, namespace) group_left(deployment)
  kube_pod_labels{label_app_kubernetes_io_name!=""}

A common mistake: trying group_left when both sides have multiple matches. Prometheus will return an error. Aggregate one side first so the join is truly many-to-one.

Detection and Prediction Functions#

absent() and absent_over_time()#

These functions return 1 when a metric is missing. They are critical for dead-man switch alerts:

# Alert if no scrape data exists for the payment service
absent(up{job="payment-service"})

# Alert if no samples appeared in the last 15 minutes
absent_over_time(up{job="payment-service"}[15m])

Use absent() for detecting targets that have disappeared entirely. Use absent_over_time() for detecting targets that have stopped producing specific metrics even though the scrape succeeds.

predict_linear()#

Fits a linear regression to a range vector and predicts the value at a future point:

# Predict disk usage 24 hours from now
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600) < 0

# Predict when a certificate expires (useful if not_after fluctuates due to renewals)
predict_linear(x509_cert_not_after[24h], 7*24*3600) < time()

The range vector should cover enough time to establish a trend. Using [6h] with predict_linear for a 24-hour prediction gives reasonable results. Using [5m] would be dominated by noise.

changes(), resets(), and deriv()#

# Detect configuration reloads (counter resets)
resets(prometheus_config_last_reload_success_timestamp_seconds[1h])

# Detect flapping -- a gauge changing too frequently
changes(kube_node_status_condition{condition="Ready"}[1h]) > 5

# Rate of change of a gauge (derivative)
deriv(node_filesystem_avail_bytes{mountpoint="/"}[1h])

resets() counts counter resets, useful for detecting restarts. changes() counts the number of times a gauge value changed, useful for detecting flapping. deriv() computes the per-second derivative of a gauge using simple linear regression, similar to predict_linear but returning the slope itself rather than a predicted value.

Label Reshaping#

label_replace() and label_join()#

When you need to join metrics from different sources that use different label conventions:

# Create a normalized 'service' label from the 'job' label
label_replace(up, "service", "$1", "job", "(.*)-metrics")

# Join multiple labels into one
label_join(kube_pod_info, "pod_id", "/", "namespace", "pod")

A practical use case: your application metrics use service_name while kube-state-metrics uses deployment. You need to join them:

label_replace(
  sum by (service_name) (rate(app_request_errors_total[5m])),
  "deployment", "$1", "service_name", "(.*)"
)
/ on(deployment) group_left()
  kube_deployment_spec_replicas

Subqueries#

Subqueries let you apply range functions to the output of an instant vector expression. The syntax is <expression>[<range>:<resolution>]:

# Maximum 5-minute error rate seen in the last hour, evaluated every minute
max_over_time(
  rate(http_requests_total{code=~"5.."}[5m])[1h:1m]
)

# Standard deviation of request rate over 24 hours
stddev_over_time(
  sum(rate(http_requests_total[5m]))[24h:5m]
)

The performance impact of subqueries is significant. The inner expression is evaluated at every step within the outer range. A subquery with [24h:1m] evaluates the inner expression 1,440 times. If the inner expression is already expensive, this compounds badly. Always use recording rules for subqueries that appear in alerts or dashboards.

Recording Rules Strategy#

Naming Convention#

Follow the level:metric:operations convention:

job:http_requests:rate5m           -- rate aggregated to job level
namespace:container_cpu:sum_rate5m -- sum of rates aggregated to namespace level
cluster:node_cpu:ratio             -- ratio computed at cluster level

The level prefix tells you the aggregation granularity. The operations suffix tells you what computations were applied. This convention makes recording rules self-documenting.

When to Create Recording Rules#

Create a recording rule when any of these are true: the query is used in an alert, the query is used in a dashboard panel that refreshes frequently, the query takes more than 2 seconds to evaluate, the query is a subquery or contains a subquery, or the query will be used as input to another query (layered recording rules).

Organizing Rule Groups#

groups:
  - name: http_request_rules
    interval: 30s
    rules:
      # Layer 1: basic rates
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

      - record: job:http_errors:rate5m
        expr: sum by (job) (rate(http_requests_total{code=~"5.."}[5m]))

      # Layer 2: ratios built on layer 1
      - record: job:http_errors:ratio5m
        expr: job:http_errors:rate5m / job:http_requests:rate5m

  - name: sli_rules
    interval: 1m
    rules:
      # Layer 3: SLI built on layer 2
      - record: job:http_availability:ratio30d
        expr: 1 - (sum_over_time(job:http_errors:ratio5m[30d]) / count_over_time(job:http_errors:ratio5m[30d]))

Set the interval per group based on how critical the data is. SLI-related rules can evaluate less frequently than request-rate rules.

Practical Example: Building a Complex SLI#

Build an availability SLI step by step for a multi-service platform:

# Step 1: total request rate per service, excluding health checks
sum by (service) (
  rate(http_requests_total{handler!="/healthz", handler!="/readyz"}[5m])
)

# Step 2: failed request rate (5xx only, not 4xx -- those are client errors)
sum by (service) (
  rate(http_requests_total{handler!="/healthz", handler!="/readyz", code=~"5.."}[5m])
)

# Step 3: availability ratio
1 - (
  sum by (service) (
    rate(http_requests_total{handler!="/healthz", handler!="/readyz", code=~"5.."}[5m])
  )
  /
  sum by (service) (
    rate(http_requests_total{handler!="/healthz", handler!="/readyz"}[5m])
  )
)

# Step 4: as a recording rule for the 30-day SLI window
# Record the 5m error ratio first, then aggregate over 30d

The key decisions in this SLI: exclude health check endpoints (they inflate success counts), treat only 5xx as errors (4xx is user error, not service failure), and group by service to get per-service availability. Each of these decisions should be documented alongside your SLO definition, because changing what counts as “successful” changes your reported availability.