Synthetic Monitoring: Proactive Uptime Checks, Blackbox Exporter, and External Probing

What Synthetic Monitoring Is#

Synthetic monitoring means actively probing your services on a schedule rather than waiting for users to report problems. Instead of relying on internal health checks or real user traffic to detect issues, you send controlled requests and measure the results. The fundamental question it answers is: “Is my service reachable and responding correctly right now?”

This is distinct from real user monitoring (RUM), which observes actual user interactions. Synthetic probes run 24/7 regardless of traffic volume, so they catch outages at 3 AM when no users are active. They provide consistent, repeatable measurements that are easy to alert on. The tradeoff is that synthetic probes test a narrow, predefined path – they do not capture the full range of user experience.

Use both. Synthetic monitoring catches outages fast. RUM catches performance degradation that users actually experience. They are complementary, not competing, approaches.

Blackbox Exporter#

The Blackbox Exporter is the standard Prometheus tool for synthetic monitoring. It probes endpoints over HTTP, TCP, DNS, and ICMP, and exposes the results as Prometheus metrics. Prometheus scrapes the Blackbox Exporter, passing the target URL as a parameter, and gets back probe results as time series.

Probe Types#

HTTP probe: sends an HTTP request and checks the response. Measures response code, response time, SSL certificate expiry, and can validate response body content with regex matching. This is the most commonly used probe type.

TCP probe: connects to a port and optionally performs a TLS handshake. Useful for checking that a database port is reachable or a non-HTTP service is accepting connections. Does not send application-level data.

DNS probe: queries a DNS server for a specific record and validates the response. Measures resolution time and can check for expected record values. Use this to detect DNS propagation issues or misconfigurations.

ICMP probe: sends ping packets and measures round-trip time. The simplest reachability check. Requires the Blackbox Exporter to run with CAP_NET_RAW capability or as root.

Configuration#

Probes are defined as modules in blackbox.yml:

modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: [200]
      method: GET
      follow_redirects: true
      preferred_ip_protocol: ip4

  http_post_2xx:
    prober: http
    http:
      method: POST
      headers:
        Content-Type: application/json
      body: '{"test": true}'
      valid_status_codes: [200, 201]

  http_with_body_match:
    prober: http
    http:
      method: GET
      fail_if_body_not_matches_regexp:
        - '"status":\s*"healthy"'

  tcp_connect:
    prober: tcp
    timeout: 5s

  dns_resolution:
    prober: dns
    dns:
      query_name: "example.com"
      query_type: "A"
      valid_rcodes: ["NOERROR"]

  icmp_ping:
    prober: icmp
    timeout: 5s
    icmp:
      preferred_ip_protocol: ip4

The Prometheus scrape config uses relabel_configs to pass the target URL to the Blackbox Exporter:

scrape_configs:
  - job_name: "blackbox-http"
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://example.com
          - https://api.example.com/health
          - https://dashboard.example.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

This relabeling pattern is essential. Without it, Prometheus would try to scrape the target URLs directly instead of routing through the Blackbox Exporter. The instance label gets set to the actual target URL for identification in dashboards and alerts.

Key Metrics#

The Blackbox Exporter exposes several metrics per probe:

probe_success: 1 if the probe succeeded (correct status code, body match, etc.), 0 if it failed. This is the primary metric for uptime alerting.
probe_duration_seconds: total time the probe took from start to finish, including DNS resolution, TCP connection, TLS handshake, and data transfer.
probe_http_status_code: the HTTP status code returned. Useful for distinguishing between different failure modes (404 vs 500 vs timeout).
probe_ssl_earliest_cert_expiry: Unix timestamp of when the earliest certificate in the chain expires. Critical for preventing certificate-related outages.
probe_dns_lookup_time_seconds: time spent on DNS resolution, broken out from the total probe duration.
probe_http_content_length: response body size in bytes. A sudden drop might indicate a broken page returning an error body.

Alerting on Synthetic Probes#

Three alerts cover the most critical synthetic monitoring scenarios:

groups:
  - name: synthetic-monitoring
    rules:
      - alert: EndpointDown
        expr: probe_success == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Endpoint {{ $labels.instance }} is down"
          description: "Probe has been failing for more than 2 minutes."

      - alert: EndpointSlow
        expr: probe_duration_seconds > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Endpoint {{ $labels.instance }} is slow"
          description: "Probe taking {{ $value }}s (threshold: 5s)."

      - alert: SSLCertExpiringSoon
        expr: (probe_ssl_earliest_cert_expiry - time()) / 86400 < 30
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "SSL cert for {{ $labels.instance }} expires in {{ $value | humanizeDuration }}"
          description: "Certificate expires in less than 30 days."

The for duration on EndpointDown prevents alerting on brief network blips. Two minutes of sustained failure is a reasonable threshold. The SSL certificate alert uses a 1-hour for duration because certificate expiry does not fluctuate – the 1h window just prevents noise from a single failed probe.

External Probe Locations#

Probing from inside your infrastructure catches application-level failures but misses problems visible only from outside: DNS resolution failures, ingress controller misconfigurations, CDN issues, ISP routing problems, or firewall rules blocking external traffic.

Deploy Blackbox Exporter instances outside your primary infrastructure. Options include running probes from a different cloud provider, using cloud functions (AWS Lambda, Google Cloud Functions) that run probe checks on a schedule, or deploying lightweight VMs in different regions.

Multi-location probing reveals the scope of failures. If one probe location fails but others succeed, the problem is likely network-level or regional. If all locations fail simultaneously, the service itself is down. This distinction changes your incident response approach significantly.

Grafana Synthetic Monitoring (a cloud service) provides probes from multiple global locations with built-in dashboard integration. It handles the infrastructure of running distributed probe endpoints, which is useful if you do not want to manage external probe deployments yourself.

What to Monitor Synthetically#

Prioritize endpoints that represent critical user paths and external dependencies:

Login page or authentication endpoint: the front door to your application.
API health endpoint: the most basic service availability check.
Critical user flows: if your application has a checkout flow, probe the checkout page.
DNS resolution: probe your domain names from external DNS resolvers to catch propagation issues.
SSL certificates: monitor expiry for all externally-facing certificates, including those on CDNs and load balancers you might forget about.
Third-party dependencies: probe APIs you depend on (payment processors, identity providers, CDNs) so you know immediately when a dependency goes down.

Grafana Dashboards for Synthetic Monitoring#

A synthetic monitoring dashboard should show at minimum: uptime percentage over time (computed from avg_over_time(probe_success[24h]) * 100), response time trends (probe_duration_seconds as a time series), certificate expiry countdown (days until expiry as a stat panel), and a probe success heatmap showing which targets are healthy at a glance.

# Uptime percentage over the last 7 days
avg_over_time(probe_success{job="blackbox-http"}[7d]) * 100

# Response time 95th percentile
quantile_over_time(0.95, probe_duration_seconds{job="blackbox-http"}[1h])

# Days until certificate expires
(probe_ssl_earliest_cert_expiry - time()) / 86400

Common Gotchas#

Probing too frequently triggers rate limiting or WAF blocks. A 30-second probe interval from multiple locations means your monitoring generates steady traffic. If your WAF sees consistent requests from the same IPs to the same endpoints, it may block them. Whitelist probe source IPs in your WAF rules, or set probe intervals to 60 seconds or longer.

Probing from inside the cluster only gives a false sense of security. A Blackbox Exporter running on the same Kubernetes cluster as your application bypasses the ingress controller, external DNS, and public network path. The probe succeeds even when external users cannot reach the service. Always include at least one external probe location that traverses the full path a real user would take.