Instant Vectors vs Range Vectors#

An instant vector returns one sample per time series at a single point in time. A range vector returns multiple samples per time series over a time window.

# Instant vector: current value of each series
http_requests_total{job="api"}

# Range vector: last 5 minutes of samples for each series
http_requests_total{job="api"}[5m]

You cannot graph a range vector directly. Functions like rate() and increase() consume a range vector and return an instant vector, which Grafana can then plot.

rate() vs irate()#

Both compute per-second rates from counter metrics, but they behave differently.

rate() calculates the average per-second increase over the entire range window. It smooths out spikes and is the right choice for alerting and dashboards where you want stable trends:

# Average requests per second over last 5 minutes
rate(http_requests_total[5m])

irate() uses only the last two data points in the range window. It reacts to spikes immediately but is noisy:

# Instantaneous rate based on last two samples
irate(http_requests_total[5m])

The range window in irate() is only a lookback to find two samples. The actual rate is computed over the gap between those two scrapes, regardless of the window size.

Rule of thumb: use rate() for alerts and recording rules, irate() only for high-resolution interactive dashboards.

histogram_quantile() for Latency Percentiles#

Histogram metrics store observations in buckets. To get percentiles, use histogram_quantile():

# p99 latency across all instances
histogram_quantile(0.99,
  sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)

# p95 latency broken down by endpoint
histogram_quantile(0.95,
  sum by (le, handler) (rate(http_request_duration_seconds_bucket[5m]))
)

# p50 (median) latency
histogram_quantile(0.50,
  sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)

The le label (less-than-or-equal) is required in the by clause – it represents the bucket boundaries. Always apply rate() before histogram_quantile() to compute per-second bucket fill rates, otherwise you get cumulative counts that produce meaningless percentiles.

Bucket boundaries are set in your application instrumentation. Default Go client buckets are [.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]. Choose buckets that span your realistic latency range for accurate percentiles.

Aggregation Operators#

Aggregation operators reduce dimensions by collapsing label sets.

# Sum across all instances, keep only the job label
sum by (job) (rate(http_requests_total[5m]))

# Equivalent using 'without' -- drop the instance label, keep everything else
sum without (instance) (rate(http_requests_total[5m]))

# Average CPU usage per node
avg by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))

# Count the number of pods per namespace
count by (namespace) (kube_pod_info)

# Maximum memory usage across all pods in a deployment
max by (deployment) (container_memory_working_set_bytes{container!=""})

# Minimum available disk space across all nodes
min by (instance) (node_filesystem_avail_bytes{mountpoint="/"})

# Top 5 pods by CPU usage
topk(5, sum by (pod) (rate(container_cpu_usage_seconds_total{container!=""}[5m])))

Offset Modifier and Subqueries#

The offset modifier shifts a query back in time. Useful for comparing current values to historical baselines:

# Request rate now vs 1 hour ago
rate(http_requests_total[5m]) / rate(http_requests_total[5m] offset 1h)

# Request rate now vs same time yesterday
rate(http_requests_total[5m]) - rate(http_requests_total[5m] offset 1d)

Subqueries evaluate a query over a range at a specified resolution:

# Maximum 5-minute error rate over the last hour, evaluated every minute
max_over_time(
  rate(http_requests_total{status_code=~"5.."}[5m])[1h:1m]
)

# Average of p99 latency over the last 24 hours
avg_over_time(
  histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))[24h:5m]
)

Label Matching#

PromQL supports four label matchers: = (exact match), != (not equal), =~ (regex match), !~ (negative regex match):

# All 5xx status codes
http_requests_total{status_code=~"5.."}

# All non-GET methods
http_requests_total{method!="GET"}

# Multiple namespaces
kube_pod_info{namespace=~"production|staging"}

# Exclude system containers
container_memory_working_set_bytes{container!="", container!="POD"}

10 Common Monitoring Queries#

# 1. Error rate as a percentage
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100

# 2. Request throughput by endpoint
sum by (handler) (rate(http_requests_total[5m]))

# 3. p95 latency per service
histogram_quantile(0.95,
  sum by (le, job) (rate(http_request_duration_seconds_bucket[5m])))

# 4. CPU saturation (load average vs cores)
node_load1 / count without (cpu) (node_cpu_seconds_total{mode="idle"})

# 5. Memory usage percentage per node
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# 6. Disk IOPS
sum by (instance) (rate(node_disk_reads_completed_total[5m])
  + rate(node_disk_writes_completed_total[5m]))

# 7. Container restart count in last hour
increase(kube_pod_container_status_restarts_total[1h])

# 8. Network throughput per pod
sum by (pod) (rate(container_network_receive_bytes_total[5m])) * 8

# 9. PVC usage percentage
kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes * 100

# 10. Deployment replica availability
kube_deployment_status_available_replicas / kube_deployment_spec_replicas

Recording Rules#

Expensive queries that run frequently (dashboard panels refreshing every 15s, alerts evaluating every minute) should be converted to recording rules. Prometheus pre-computes the result and stores it as a new time series.

groups:
  - name: http_recording_rules
    interval: 30s
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

      - record: job:http_errors:ratio5m
        expr: |
          sum by (job) (rate(http_requests_total{status_code=~"5.."}[5m]))
          / sum by (job) (rate(http_requests_total[5m]))

      - record: job:http_latency:p99_5m
        expr: |
          histogram_quantile(0.99,
            sum by (le, job) (rate(http_request_duration_seconds_bucket[5m])))

  - name: node_recording_rules
    rules:
      - record: instance:node_cpu:utilization5m
        expr: |
          1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))

Naming convention: level:metric_name:operations. The level indicates the aggregation level (job, instance, namespace). The operations suffix describes what was applied (rate5m, ratio5m, p99_5m).

Use recording rules in your alerts and dashboards by referencing the recorded metric name directly: job:http_errors:ratio5m > 0.05 instead of the full expression. This reduces query load on Prometheus and makes alert rules more readable.