Status Page Setup and Management

Purpose of a Status Page#

A status page is the single source of truth for service health. It communicates current status, provides historical reliability data, and sets expectations during incidents through regular updates. A well-maintained status page reduces support tickets during incidents, builds customer trust, and gives teams a structured communication channel.

Platform Options#

Statuspage.io (Atlassian)#

The most widely adopted hosted solution. Integrates with the Atlassian ecosystem.

# Create a component
curl -X POST https://api.statuspage.io/v1/pages/${PAGE_ID}/components \
  -H "Authorization: OAuth ${API_KEY}" \
  -d '{"component": {"name": "API", "status": "operational", "showcase": true}}'

# Create an incident
curl -X POST https://api.statuspage.io/v1/pages/${PAGE_ID}/incidents \
  -H "Authorization: OAuth ${API_KEY}" \
  -d '{"incident": {"name": "Elevated Error Rates", "status": "investigating",
       "impact_override": "minor", "component_ids": ["id"]}}'

Strengths: Highly reliable, subscriber notifications built-in, custom domains, API-first. Weaknesses: Expensive ($399+/month business plan), limited customization, component limits on lower tiers.

Cachet#

Open-source, self-hosted status page in PHP/Laravel.

docker run -d --name cachet -p 8000:8000 \
  -e DB_DRIVER=pgsql -e DB_HOST=postgres \
  -e DB_DATABASE=cachet -e APP_KEY=base64:key \
  cachethq/docker:latest

Strengths: Free, self-hosted, full data ownership, fully customizable. Weaknesses: Requires hosting infrastructure, community support only, you own the uptime.

Instatus#

Modern hosted status page with competitive pricing.

Strengths: Clean UI, lower pricing than Statuspage.io, good API, custom domains. Weaknesses: Smaller integration ecosystem, fewer enterprise features.

Custom Solutions#

Build in-house when you need deep integration with internal systems. Minimum requirements: static site on independent infrastructure, API for monitoring integration, incident history store, and subscriber notification system.

Critical rule: Host on completely separate infrastructure from production. If production is down, the status page must still be reachable.

Component Organization#

Components represent services or features users interact with.

Production Services        Data Processing
  ├── Website                ├── Real-time Pipeline
  ├── API                    ├── Batch Processing
  ├── Authentication         └── Data Exports
  ├── Dashboard
  └── Mobile App           Infrastructure
                             ├── CDN
Integrations                 ├── DNS
  ├── Webhook Delivery       └── Object Storage
  └── Email Notifications

Design principles: Group by user experience, not internal architecture. Users do not know your API is 12 microservices – they care if it works. Include only what users interact with. Keep the count to 10-20 components.

Component Statuses#

Status	When to Use
Operational	All metrics within SLO
Degraded Performance	Latency elevated, some requests slow
Partial Outage	Major feature down for some users
Major Outage	Service completely unavailable

Tie statuses to monitoring thresholds:

component_status_rules:
  api:
    operational:     { error_rate: "< 0.1%", p99_latency: "< 500ms" }
    degraded:        { error_rate: "0.1% - 1%", p99_latency: "500ms - 2s" }
    partial_outage:  { error_rate: "1% - 10%" }
    major_outage:    { error_rate: "> 10%" }

Incident Templates#

Pre-written templates ensure consistent communication during stressful incidents.

Investigating:

We are aware of [symptoms] affecting [component]. Our team is
actively investigating. We will provide updates every [cadence].

Identified:

We have identified the cause as [brief explanation]. We are
implementing [mitigation]. Expected resolution: [timeframe].

Monitoring:

A fix has been implemented. We are monitoring to ensure stability.
Error rates have returned to normal. We will mark this resolved
after [monitoring period] of stable operation.

Resolved:

This incident has been resolved as of [timestamp]. [Brief summary
of cause and fix]. Duration: [total]. We apologize for the
disruption and will conduct a post-incident review.

Maintenance Windows#

Scheduled maintenance communicates planned work that may affect users.

# Statuspage.io - schedule maintenance
curl -X POST https://api.statuspage.io/v1/pages/${PAGE_ID}/incidents \
  -H "Authorization: OAuth ${API_KEY}" \
  -d '{"incident": {
    "name": "Scheduled Database Maintenance",
    "status": "scheduled",
    "scheduled_for": "2026-02-25T02:00:00Z",
    "scheduled_until": "2026-02-25T04:00:00Z",
    "body": "Brief interruptions possible during database cluster maintenance.",
    "scheduled_auto_in_progress": true,
    "scheduled_auto_completed": true
  }}'

Best practices: Announce at least 72 hours in advance. Include expected impact in plain language. Specify timezone (use UTC plus local conversion). Auto-transition status if supported. Send a reminder 24 hours before. Update during the window if timing or impact changes.

Subscriber Notifications#

Channel	Best For	When
Email	Detailed updates, maintenance	All incidents
SMS	Critical outages	SEV-1 only
Webhook	Internal tool integration	All updates
RSS	Pull-based consumers	All updates

Rules: Do not spam – update every 30 minutes during major incidents, not on every status change. Include actionable information. Allow granular subscriptions by component. Test delivery quarterly.

Uptime Calculation#

Uptime % = ((Total minutes - Downtime minutes) / Total minutes) * 100

For a 30-day month: 99.9% allows 43 minutes downtime. 99.95% allows 21 minutes. 99.99% allows 4 minutes.

What counts as downtime: Major outage counts as full downtime. Partial outage counts proportionally (30% of users affected for 10 minutes = 3 minutes effective downtime). Degraded performance typically does not count unless below an SLO threshold. Scheduled maintenance during announced windows is excluded.

Integration with Monitoring#

Automate the connection between monitoring and status page updates.

Prometheus -> Alertmanager -> Webhook Receiver -> Status Page API

# Webhook receiver that updates status page from alerts
@app.route("/webhook/alertmanager", methods=["POST"])
def handle_alert():
    for alert in request.json.get("alerts", []):
        component = alert["annotations"].get("component")
        action = alert["annotations"].get("status_page_action")
        if not component or not action:
            continue
        status = "operational" if alert["status"] == "resolved" \
            else STATUS_MAP.get(action)
        if status:
            update_component(COMPONENT_MAP[component], status)
    return "", 200

Define Prometheus alert annotations that specify the status page action:

- alert: ComponentDegraded
  expr: component:availability:ratio_5m < 0.999
  for: 5m
  annotations:
    status_page_action: "set_degraded"
    component: "api"

Agent Operational Notes#

Never delay updates to gather more information. Post “investigating” immediately and refine later.
Use templates. Do not write incident updates from scratch during an incident.
Match component status to monitoring data. Do not leave a component “operational” when metrics show degradation.
Verify independence. Regularly confirm the status page loads from outside your infrastructure.
Close incidents promptly. An incident left in “monitoring” for days erodes trust.