Grafana Dashboards for Kubernetes Monitoring

Data Source Configuration#

Grafana connects to backend data stores through data sources. For a complete Kubernetes observability stack, you need three: Prometheus for metrics, Loki for logs, and Tempo for traces.

Provision data sources declaratively so they survive Grafana restarts and are version-controlled:

# grafana/provisioning/datasources/observability.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus-operated:9090
    isDefault: true
    jsonData:
      timeInterval: "15s"
      exemplarTraceIdDestinations:
        - name: traceID
          datasourceUid: tempo

  - name: Loki
    type: loki
    access: proxy
    url: http://loki-gateway:3100
    jsonData:
      derivedFields:
        - name: TraceID
          matcherRegex: '"traceID":"(\w+)"'
          url: "$${__value.raw}"
          datasourceUid: tempo

  - name: Tempo
    type: tempo
    access: proxy
    url: http://tempo:3100
    jsonData:
      tracesToMetrics:
        datasourceUid: prometheus
        tags: [{key: "service.name", value: "job"}]
      serviceMap:
        datasourceUid: prometheus
      nodeGraph:
        enabled: true

The cross-linking configuration lets you click from a metric data point to the trace that generated it, and extract trace IDs from log lines to link to Tempo.

Dashboard Design: USE Method#

The USE method monitors infrastructure resources. Build one row per resource type.

CPU Row:

# Utilization - time series panel
1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))

# Saturation - time series panel
node_load1 / count without (cpu, mode) (node_cpu_seconds_total{mode="idle"})

# Errors - stat panel (should normally be 0)
sum by (instance) (rate(node_cpu_guest_seconds_total[5m]))

Memory Row:

# Utilization
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)

# Saturation (swap usage indicates memory pressure)
(node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes

# Errors (OOM kills)
increase(node_vmstat_oom_kill[1h])

Disk Row:

# Utilization
1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})

# Saturation (IO wait)
rate(node_cpu_seconds_total{mode="iowait"}[5m])

# Errors
rate(node_disk_io_errors_total[5m])

Dashboard Design: RED Method#

The RED method monitors request-driven services. One dashboard per service.

# Rate - time series panel
sum by (handler) (rate(http_requests_total[5m]))

# Errors - time series panel (show as percentage)
sum by (handler) (rate(http_requests_total{status_code=~"5.."}[5m]))
/ sum by (handler) (rate(http_requests_total[5m])) * 100

# Duration - time series panel with multiple percentile lines
# p50
histogram_quantile(0.50, sum by (le, handler) (rate(http_request_duration_seconds_bucket[5m])))
# p95
histogram_quantile(0.95, sum by (le, handler) (rate(http_request_duration_seconds_bucket[5m])))
# p99
histogram_quantile(0.99, sum by (le, handler) (rate(http_request_duration_seconds_bucket[5m])))

Place all three queries on the same duration panel with distinct colors. Seeing p50, p95, and p99 together reveals tail latency issues that averages would hide.

Variable Templates#

Dashboard variables make dashboards reusable across namespaces, clusters, and workloads. Define them in the dashboard settings.

Namespace selector – variable type Query, data source Prometheus:

label_values(kube_pod_info, namespace)

Enable multi-value and “Include All” to allow selecting multiple namespaces or all at once.

Pod selector (chained to namespace):

label_values(kube_pod_info{namespace=~"$namespace"}, pod)

Node selector:

label_values(kube_node_info, node)

Use variables in panel queries with $variable syntax:

sum by (pod) (rate(container_cpu_usage_seconds_total{
  namespace=~"$namespace",
  pod=~"$pod",
  container!=""
}[5m]))

The =~ operator with $namespace handles both single selection and the “All” option (which produces a regex like ns1|ns2|ns3).

Panel Types#

Choose the right panel type for the data:

Time series: Primary panel for anything over time – CPU, memory, request rate, latency.
Stat: Single-value with thresholds – error count, uptime, active replicas.
Gauge: Value within a known range – disk/memory/CPU percentage.
Table: Multi-column data – top pods by CPU, certificate expiration dates.
Logs: Loki log streams. Pair with metrics panels above to correlate spikes.
Bar gauge: Horizontal bars for ranked comparisons – top 10 pods by memory.

Dashboard Provisioning#

In Kubernetes, provision dashboards via ConfigMaps. Grafana’s sidecar container watches for ConfigMaps with a specific label and loads their contents as dashboards.

apiVersion: v1
kind: ConfigMap
metadata:
  name: app-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"   # label the sidecar watches for
data:
  app-dashboard.json: |
    {
      "dashboard": {
        "title": "Application Overview",
        "panels": [ ... ],
        "templating": { ... }
      }
    }

With kube-prometheus-stack, the sidecar label is configured via grafana.sidecar.dashboards.label in Helm values (default: grafana_dashboard).

For file-based provisioning outside Kubernetes:

# grafana/provisioning/dashboards/default.yml
apiVersion: 1
providers:
  - name: default
    orgId: 1
    folder: "Infrastructure"
    type: file
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true

Grafana as Code#

Grafonnet is a Jsonnet library for generating dashboards programmatically:

local grafana = import 'github.com/grafana/grafonnet/gen/grafonnet-latest/main.libsonnet';

grafana.dashboard.new('Service Overview')
+ grafana.dashboard.withPanels([
  grafana.panel.timeSeries.new('Request Rate')
  + grafana.panel.timeSeries.queryOptions.withTargets([
    grafana.query.prometheus.new('Prometheus',
      'sum by (handler) (rate(http_requests_total{namespace="$namespace"}[5m]))')
  ])
  + grafana.panel.timeSeries.standardOptions.withUnit('reqps'),
])

Build with jsonnet -J vendor service-dashboard.jsonnet > service-dashboard.json.

Terraform provider manages Grafana resources as infrastructure:

resource "grafana_dashboard" "app" {
  config_json = file("dashboards/app.json")
  folder      = grafana_folder.monitoring.id
}

Community Dashboards#

Import proven dashboards by ID rather than building from scratch: Node Exporter Full (1860), Kubernetes Cluster (7249), Kubernetes Pods (6879), CoreDNS (5926), NGINX Ingress (9614). Import via grafana.com/grafana/dashboards/{ID} and customize thresholds to match your environment.

Grafana Alerting vs Alertmanager#

Grafana 9+ has a built-in alerting engine that evaluates rules against any data source. Use Grafana alerting for Loki log queries or multi-datasource conditions. Use Alertmanager for purely Prometheus-based alerts with gossip HA deduplication. Running both is common: Prometheus rules go through Alertmanager, Loki-based alerts go through Grafana alerting. Avoid duplicating the same alert in both.