Data Source Configuration#
Grafana connects to backend data stores through data sources. For a complete Kubernetes observability stack, you need three: Prometheus for metrics, Loki for logs, and Tempo for traces.
Provision data sources declaratively so they survive Grafana restarts and are version-controlled:
# grafana/provisioning/datasources/observability.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus-operated:9090
isDefault: true
jsonData:
timeInterval: "15s"
exemplarTraceIdDestinations:
- name: traceID
datasourceUid: tempo
- name: Loki
type: loki
access: proxy
url: http://loki-gateway:3100
jsonData:
derivedFields:
- name: TraceID
matcherRegex: '"traceID":"(\w+)"'
url: "$${__value.raw}"
datasourceUid: tempo
- name: Tempo
type: tempo
access: proxy
url: http://tempo:3100
jsonData:
tracesToMetrics:
datasourceUid: prometheus
tags: [{key: "service.name", value: "job"}]
serviceMap:
datasourceUid: prometheus
nodeGraph:
enabled: trueThe cross-linking configuration lets you click from a metric data point to the trace that generated it, and extract trace IDs from log lines to link to Tempo.
Dashboard Design: USE Method#
The USE method monitors infrastructure resources. Build one row per resource type.
CPU Row:
# Utilization - time series panel
1 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m]))
# Saturation - time series panel
node_load1 / count without (cpu, mode) (node_cpu_seconds_total{mode="idle"})
# Errors - stat panel (should normally be 0)
sum by (instance) (rate(node_cpu_guest_seconds_total[5m]))Memory Row:
# Utilization
1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)
# Saturation (swap usage indicates memory pressure)
(node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes
# Errors (OOM kills)
increase(node_vmstat_oom_kill[1h])Disk Row:
# Utilization
1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})
# Saturation (IO wait)
rate(node_cpu_seconds_total{mode="iowait"}[5m])
# Errors
rate(node_disk_io_errors_total[5m])Dashboard Design: RED Method#
The RED method monitors request-driven services. One dashboard per service.
# Rate - time series panel
sum by (handler) (rate(http_requests_total[5m]))
# Errors - time series panel (show as percentage)
sum by (handler) (rate(http_requests_total{status_code=~"5.."}[5m]))
/ sum by (handler) (rate(http_requests_total[5m])) * 100
# Duration - time series panel with multiple percentile lines
# p50
histogram_quantile(0.50, sum by (le, handler) (rate(http_request_duration_seconds_bucket[5m])))
# p95
histogram_quantile(0.95, sum by (le, handler) (rate(http_request_duration_seconds_bucket[5m])))
# p99
histogram_quantile(0.99, sum by (le, handler) (rate(http_request_duration_seconds_bucket[5m])))Place all three queries on the same duration panel with distinct colors. Seeing p50, p95, and p99 together reveals tail latency issues that averages would hide.
Variable Templates#
Dashboard variables make dashboards reusable across namespaces, clusters, and workloads. Define them in the dashboard settings.
Namespace selector – variable type Query, data source Prometheus:
label_values(kube_pod_info, namespace)Enable multi-value and “Include All” to allow selecting multiple namespaces or all at once.
Pod selector (chained to namespace):
label_values(kube_pod_info{namespace=~"$namespace"}, pod)Node selector:
label_values(kube_node_info, node)Use variables in panel queries with $variable syntax:
sum by (pod) (rate(container_cpu_usage_seconds_total{
namespace=~"$namespace",
pod=~"$pod",
container!=""
}[5m]))The =~ operator with $namespace handles both single selection and the “All” option (which produces a regex like ns1|ns2|ns3).
Panel Types#
Choose the right panel type for the data:
- Time series: Primary panel for anything over time – CPU, memory, request rate, latency.
- Stat: Single-value with thresholds – error count, uptime, active replicas.
- Gauge: Value within a known range – disk/memory/CPU percentage.
- Table: Multi-column data – top pods by CPU, certificate expiration dates.
- Logs: Loki log streams. Pair with metrics panels above to correlate spikes.
- Bar gauge: Horizontal bars for ranked comparisons – top 10 pods by memory.
Dashboard Provisioning#
In Kubernetes, provision dashboards via ConfigMaps. Grafana’s sidecar container watches for ConfigMaps with a specific label and loads their contents as dashboards.
apiVersion: v1
kind: ConfigMap
metadata:
name: app-dashboard
namespace: monitoring
labels:
grafana_dashboard: "1" # label the sidecar watches for
data:
app-dashboard.json: |
{
"dashboard": {
"title": "Application Overview",
"panels": [ ... ],
"templating": { ... }
}
}With kube-prometheus-stack, the sidecar label is configured via grafana.sidecar.dashboards.label in Helm values (default: grafana_dashboard).
For file-based provisioning outside Kubernetes:
# grafana/provisioning/dashboards/default.yml
apiVersion: 1
providers:
- name: default
orgId: 1
folder: "Infrastructure"
type: file
options:
path: /var/lib/grafana/dashboards
foldersFromFilesStructure: trueGrafana as Code#
Grafonnet is a Jsonnet library for generating dashboards programmatically:
local grafana = import 'github.com/grafana/grafonnet/gen/grafonnet-latest/main.libsonnet';
grafana.dashboard.new('Service Overview')
+ grafana.dashboard.withPanels([
grafana.panel.timeSeries.new('Request Rate')
+ grafana.panel.timeSeries.queryOptions.withTargets([
grafana.query.prometheus.new('Prometheus',
'sum by (handler) (rate(http_requests_total{namespace="$namespace"}[5m]))')
])
+ grafana.panel.timeSeries.standardOptions.withUnit('reqps'),
])Build with jsonnet -J vendor service-dashboard.jsonnet > service-dashboard.json.
Terraform provider manages Grafana resources as infrastructure:
resource "grafana_dashboard" "app" {
config_json = file("dashboards/app.json")
folder = grafana_folder.monitoring.id
}Community Dashboards#
Import proven dashboards by ID rather than building from scratch: Node Exporter Full (1860), Kubernetes Cluster (7249), Kubernetes Pods (6879), CoreDNS (5926), NGINX Ingress (9614). Import via grafana.com/grafana/dashboards/{ID} and customize thresholds to match your environment.
Grafana Alerting vs Alertmanager#
Grafana 9+ has a built-in alerting engine that evaluates rules against any data source. Use Grafana alerting for Loki log queries or multi-datasource conditions. Use Alertmanager for purely Prometheus-based alerts with gossip HA deduplication. Running both is common: Prometheus rules go through Alertmanager, Loki-based alerts go through Grafana alerting. Avoid duplicating the same alert in both.