Observability Stack Troubleshooting: Diagnosing Prometheus, Alertmanager, Grafana, and Pipeline Failures

“I’m Not Seeing Metrics” – Systematic Diagnosis#

This is the most common observability complaint. Work through these steps in order to isolate where the pipeline breaks.

Step 1: Is the Target Being Scraped?#

Open the Prometheus UI at /targets. Search for the job name or target address. Look at three things: state (UP or DOWN), last scrape timestamp, and error message.

Status: UP    Last Scrape: 3s ago    Duration: 12ms    Error: (none)
Status: DOWN  Last Scrape: 15s ago   Duration: 0ms     Error: connection refused

If the target does not appear at all, Prometheus does not know about it. This means the scrape configuration (or ServiceMonitor) is not matching the target. Jump to the ServiceMonitor checklist at the end of this guide.

Step 2: Target Is DOWN#

The target appears in /targets but its state is DOWN. Common causes:

Wrong port or path:

# Verify the application is actually serving metrics
curl -s http://<pod-ip>:8080/metrics | head -20

# If the path is different from /metrics
curl -s http://<pod-ip>:8080/actuator/prometheus | head -20

Port-forward to test from within the cluster:

kubectl port-forward -n myapp pod/myapp-abc123 8080:8080
curl -s http://localhost:8080/metrics | head -5

Network policy blocking scrape traffic:

# Check if NetworkPolicies exist in the target namespace
kubectl get networkpolicies -n myapp

# Prometheus typically runs in the monitoring namespace
# The NetworkPolicy must allow ingress from the monitoring namespace

# NetworkPolicy that allows Prometheus to scrape
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-prometheus-scrape
  namespace: myapp
spec:
  podSelector: {}
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: monitoring
      ports:
        - port: 8080
          protocol: TCP

Application not exposing /metrics:

The application code needs a metrics endpoint. If using a framework, verify the metrics middleware is enabled. For Go applications using the Prometheus client library:

# Check if the /metrics path returns valid Prometheus format
curl -s http://localhost:8080/metrics | grep "^# HELP" | head -5
# Should show lines like: # HELP http_requests_total Total number of HTTP requests

Step 3: Target Is UP but Metric Is Missing#

The target is being scraped successfully but the specific metric you need does not appear in Prometheus queries.

# Check the raw /metrics endpoint for the metric name
curl -s http://localhost:8080/metrics | grep "my_metric_name"

If the metric is not in the /metrics output, the application is not emitting it. Check the application code or configuration.

If the metric is in /metrics but not queryable in Prometheus, it may be dropped by metric_relabel_configs:

# Check Prometheus config for relabeling rules that drop metrics
kubectl get secret -n monitoring prometheus-config -o jsonpath='{.data.prometheus\.yml}' | base64 -d | grep -A5 "metric_relabel"

Step 4: Metric Exists but Query Returns Empty#

The metric is present in Prometheus (you can see it with a simple query like my_metric_name) but your full query returns nothing.

Wrong label matchers:

-- This returns data
http_requests_total

-- This returns nothing because the label value is wrong
http_requests_total{namespace="production"}

-- Check what label values actually exist
group by (namespace) (http_requests_total)

Staleness:

Prometheus marks a time series as stale if no new samples arrive within 5 minutes (the staleness window). If a pod was terminated and a new one took over, the old series is stale and the new one has different labels (new pod name).

-- Check when the metric was last updated
timestamp(my_metric_name)

-- Compare to current time
time() - timestamp(my_metric_name)
-- If this is > 300, the series is stale

Time range too narrow:

The Prometheus UI defaults to showing data for the current time. If the metric was only emitted in the past, expand the time range or use a range query.

“Prometheus Is Slow or OOM” – Diagnosis#

Check Cardinality#

High cardinality (too many unique time series) is the primary cause of Prometheus memory issues and slow queries.

-- Total number of active time series
prometheus_tsdb_head_series

-- Find the top 10 metric names by series count
topk(10, count by (__name__) ({__name__=~".+"}))

-- Find which label has high cardinality on a specific metric
count by (pod) (http_requests_total)
-- If this returns thousands of results, "pod" is a high-cardinality label

Common cardinality bombs:

A label containing request IDs, user IDs, or URLs with path parameters.
Histogram metrics with too many buckets multiplied by high-cardinality labels.
Metrics emitted per-connection or per-request instead of aggregated.

Fix: Drop unnecessary labels at scrape time:

# In scrape_config or ServiceMonitor
metricRelabelings:
  - sourceLabels: [__name__]
    regex: "go_.*"
    action: drop
  - regex: "request_id"
    action: labeldrop

Check Slow Scrape Targets#

-- Targets that take more than 2 seconds to scrape
scrape_duration_seconds > 2

-- Targets returning the most samples per scrape
scrape_samples_scraped > 10000

A target returning 50,000 samples per scrape puts significant load on Prometheus ingestion. Consider reducing the metric count at the source or scraping less frequently.

Check Expensive Queries#

-- Queries that take more than 10 seconds to evaluate
prometheus_engine_query_duration_seconds{quantile="0.99"} > 10

Grafana dashboards with dozens of panels, each running unoptimized queries with wide time ranges and no recording rules, are the usual culprit. Use recording rules to pre-compute expensive aggregations:

groups:
  - name: recording-rules
    rules:
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

Emergency: Prevent Disk Full#

# Limit TSDB retention by size (will delete oldest blocks first)
prometheus --storage.tsdb.retention.size=40GB

# Or by time
prometheus --storage.tsdb.retention.time=15d

“Alertmanager Isn’t Sending Notifications”#

Step 1: Is the Alert Firing in Prometheus?#

Check the Prometheus /alerts page. The alert should show as firing (not pending or inactive).

ALERTS{alertname="HighErrorRate", alertstate="firing"}

If it is pending, the for duration has not been satisfied yet. If it is inactive, the expression is not returning results.

Step 2: Did the Alert Reach Alertmanager?#

amtool alert query --alertmanager.url=http://localhost:9093
amtool alert query --alertmanager.url=http://localhost:9093 alertname=HighErrorRate

If the alert is not in Alertmanager, check the connection between Prometheus and Alertmanager:

-- Errors sending alerts from Prometheus to Alertmanager
prometheus_notifications_errors_total
prometheus_notifications_dropped_total

-- Alertmanager URL configured in Prometheus
prometheus_notifications_alertmanagers_discovered

Step 3: Is Routing Correct?#

# Test which receiver would handle this alert
amtool config routes test --alertmanager.url=http://localhost:9093 \
  severity=critical namespace=production alertname=HighErrorRate

# Show the full routing tree
amtool config routes show --alertmanager.url=http://localhost:9093

Step 4: Is It Silenced?#

amtool silence query --alertmanager.url=http://localhost:9093

Review each active silence. Someone may have created a broad silence during a maintenance window and forgotten to expire it.

Step 5: Is It Inhibited?#

Review the inhibit_rules section in alertmanager.yml. An inhibition rule suppresses the target alert when the source alert is active. If a higher-severity alert is firing for the same labels, the lower-severity one is silently suppressed.

Step 6: Is the Receiver Broken?#

Check Alertmanager logs for delivery errors:

kubectl logs -n monitoring alertmanager-main-0 --tail=100 | grep -E "error|fail|notify"

Common failures:

msg="Error sending notification" err="unexpected status code 403" – Slack token expired or channel permissions changed.
msg="Error sending notification" err="Post ... dial tcp: lookup ... no such host" – DNS resolution failure for the webhook endpoint.
msg="Error sending notification" err="context deadline exceeded" – Webhook endpoint is too slow to respond.

Step 7: HA Deduplication Issue#

In a multi-replica Alertmanager cluster, alerts are deduplicated through the gossip protocol. If peers cannot communicate, the same alert may be sent multiple times or not at all.

-- Check cluster peer count (should match replica count)
alertmanager_cluster_peers

-- Check for cluster communication failures
alertmanager_cluster_messages_received_total
alertmanager_cluster_reconnections_total

“Grafana Dashboard Shows No Data”#

Data Source Connectivity#

Go to Grafana Settings (gear icon) then Data Sources. Click on the data source and hit “Test.” If it fails:

# Test connectivity from the Grafana pod to Prometheus
kubectl exec -n monitoring grafana-abc123 -- \
  wget -qO- http://prometheus-operated:9090/api/v1/query?query=up

Common causes: Service name changed after a Helm upgrade, Prometheus moved to a different namespace, or a NetworkPolicy was added that blocks Grafana-to-Prometheus traffic.

Query Syntax#

Copy the panel’s query and run it directly in the Prometheus UI at /graph. If it returns data there but not in Grafana, the issue is on the Grafana side (typically variable interpolation or time range).

Time Range#

The dashboard’s time range picker may be set to a window where no data exists. Change it to “Last 1 hour” or “Last 6 hours” and check again. Also verify the dashboard is not using an absolute time range that was saved from a past investigation.

Template Variable Issues#

If panels use variables like $namespace and the variable’s query returns no values, all panels depending on that variable show “No data.”

-- Variable query that might break after label changes
label_values(kube_pod_info{cluster="$cluster"}, namespace)

If the cluster label was renamed to kubernetes_cluster, this variable query returns nothing, and every panel filtering by $namespace breaks silently. Check the variable definitions under dashboard settings.

Grafana-Prometheus Version Incompatibility#

Grafana’s query editor can generate PromQL syntax that older Prometheus versions do not support. Negative offsets, subqueries, and certain functions were added in specific Prometheus versions. If you see “parse error” in Prometheus but the query looks correct, check the Prometheus version.

“Logs Aren’t Appearing in Loki”#

Is the Log Collector Running?#

# Check Promtail pods
kubectl get pods -n monitoring -l app.kubernetes.io/name=promtail
kubectl logs -n monitoring promtail-abc123 --tail=50

# Check Fluent Bit pods (if using Fluent Bit instead)
kubectl get pods -n monitoring -l app.kubernetes.io/name=fluent-bit

If the collector pod is crash-looping, check logs for configuration errors (wrong Loki endpoint, invalid pipeline stages, permission denied reading log files).

Are Log Labels Correct?#

Loki indexes logs by labels, not by content. If the label set changed (e.g., a pipeline stage that extracted app from the log line was modified), existing LogQL queries that filter on that label return nothing.

-- Check what label combinations exist
{namespace="myapp"} | logfmt
-- If this returns no results, try broader queries:
{job="myapp/myapp"}

Rate Limiting#

Loki enforces ingestion rate limits. If the log volume exceeds the configured limit, Loki drops logs silently (or returns 429 errors to the collector).

-- Check ingestion rates
loki_distributor_lines_received_total
loki_distributor_bytes_received_total

-- Check for rejected streams
loki_discarded_samples_total

Increase limits in the Loki config if needed:

limits_config:
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20
  per_stream_rate_limit: 5MB

Retention#

Logs may have expired. Check the retention configuration:

# Loki config
table_manager:
  retention_deletes_enabled: true
  retention_period: 720h   # 30 days

If you are querying for logs older than the retention period, they no longer exist.

ServiceMonitor Not Working – Checklist#

When you deploy a ServiceMonitor and the target does not appear in Prometheus /targets, work through this checklist.

1. Labels Match `serviceMonitorSelector`#

The Prometheus CR specifies which ServiceMonitors it picks up via label selectors:

# Find the Prometheus CR's serviceMonitorSelector
kubectl get prometheus -n monitoring -o yaml | grep -A5 serviceMonitorSelector

# Typical output
serviceMonitorSelector:
  matchLabels:
    release: monitoring

Your ServiceMonitor must have this label:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: myapp
  namespace: myapp
  labels:
    release: monitoring   # MUST match the selector above

2. Namespace Is Included#

The Prometheus CR may restrict which namespaces it watches:

kubectl get prometheus -n monitoring -o yaml | grep -A5 serviceMonitorNamespaceSelector

If serviceMonitorNamespaceSelector is empty ({}), all namespaces are watched. If it has a selector, the namespace must have matching labels:

kubectl label namespace myapp monitoring=enabled

3. Port Name Matches#

The ServiceMonitor endpoints[].port must match the name of the port on the Service, not the number:

# Service
apiVersion: v1
kind: Service
metadata:
  name: myapp
spec:
  ports:
    - name: http-metrics    # <-- this name
      port: 8080
      targetPort: 8080

# ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
spec:
  endpoints:
    - port: http-metrics    # <-- must match the Service port name
      interval: 15s

4. Service Selector Matches Pod Labels#

The Service must actually select the target pods:

# Check the Service's selector
kubectl get svc myapp -n myapp -o yaml | grep -A5 selector

# Check the pod's labels
kubectl get pods -n myapp --show-labels

5. Verify the ServiceMonitor Exists#

kubectl get servicemonitor -n myapp
kubectl describe servicemonitor myapp -n myapp

If the ServiceMonitor does not exist, your Helm chart or Kustomize overlay may not be deploying it. Check that the CRD is installed: kubectl get crd servicemonitors.monitoring.coreos.com.

Decision Tree: “Something Is Wrong with Monitoring”#

Follow this text-based flowchart when someone reports a monitoring issue.

START: What is the symptom?
|
+-- "Metric is missing from queries"
|     |
|     +-- Is the target in Prometheus /targets?
|           |
|           +-- NO --> Is there a ServiceMonitor? --> Check ServiceMonitor checklist above
|           +-- YES, state UP --> curl /metrics on the target directly
|           |     |
|           |     +-- Metric present in /metrics? --> Check relabel_configs dropping it
|           |     +-- Metric absent from /metrics? --> Application not emitting it, fix the app
|           +-- YES, state DOWN --> Check port, path, network policy
|
+-- "Alert is not firing"
|     |
|     +-- Does the expression return results in /graph? --> No: fix the expression
|     +-- Is alert in pending state? --> for duration not met yet
|     +-- Is alert firing in Prometheus /alerts? --> Check Alertmanager routing, silences, receiver
|
+-- "Grafana panel shows no data"
|     |
|     +-- Does data source test pass? --> No: fix connectivity
|     +-- Does query return data in Prometheus UI? --> No: metric issue, go to "metric is missing"
|     +-- Yes in Prometheus, no in Grafana --> Check variables, time range, query compatibility
|
+-- "Prometheus is slow or crashing"
|     |
|     +-- Check prometheus_tsdb_head_series --> High cardinality? Drop labels or metrics
|     +-- Check scrape_duration_seconds --> Slow targets? Reduce scrape frequency
|     +-- Check query duration --> Expensive dashboards? Add recording rules
|     +-- OOM? --> Increase memory limits, set retention.size
|
+-- "Logs not in Loki"
|     |
|     +-- Is Promtail/FluentBit running? --> No: fix the collector
|     +-- Are labels correct in LogQL? --> Try broader label query
|     +-- Rate limited? --> Check loki_discarded_samples_total, increase limits
|     +-- Expired? --> Check retention_period
|
END: Escalate to observability team lead with findings from above

This decision tree covers the five most common monitoring failure categories. In each case, the goal is to isolate the broken component before attempting a fix. Start at the data source (the application emitting metrics or logs), work through the pipeline (scraping, ingestion, storage), and end at the presentation layer (Grafana, Alertmanager notifications).