Purpose of a Status Page#

A status page is the single source of truth for service health. It communicates current status, provides historical reliability data, and sets expectations during incidents through regular updates. A well-maintained status page reduces support tickets during incidents, builds customer trust, and gives teams a structured communication channel.

Platform Options#

Statuspage.io (Atlassian)#

The most widely adopted hosted solution. Integrates with the Atlassian ecosystem.

# Create a component
curl -X POST https://api.statuspage.io/v1/pages/${PAGE_ID}/components \
  -H "Authorization: OAuth ${API_KEY}" \
  -d '{"component": {"name": "API", "status": "operational", "showcase": true}}'

# Create an incident
curl -X POST https://api.statuspage.io/v1/pages/${PAGE_ID}/incidents \
  -H "Authorization: OAuth ${API_KEY}" \
  -d '{"incident": {"name": "Elevated Error Rates", "status": "investigating",
       "impact_override": "minor", "component_ids": ["id"]}}'

Strengths: Highly reliable, subscriber notifications built-in, custom domains, API-first. Weaknesses: Expensive ($399+/month business plan), limited customization, component limits on lower tiers.

Cachet#

Open-source, self-hosted status page in PHP/Laravel.

docker run -d --name cachet -p 8000:8000 \
  -e DB_DRIVER=pgsql -e DB_HOST=postgres \
  -e DB_DATABASE=cachet -e APP_KEY=base64:key \
  cachethq/docker:latest

Strengths: Free, self-hosted, full data ownership, fully customizable. Weaknesses: Requires hosting infrastructure, community support only, you own the uptime.

Instatus#

Modern hosted status page with competitive pricing.

Strengths: Clean UI, lower pricing than Statuspage.io, good API, custom domains. Weaknesses: Smaller integration ecosystem, fewer enterprise features.

Custom Solutions#

Build in-house when you need deep integration with internal systems. Minimum requirements: static site on independent infrastructure, API for monitoring integration, incident history store, and subscriber notification system.

Critical rule: Host on completely separate infrastructure from production. If production is down, the status page must still be reachable.

Component Organization#

Components represent services or features users interact with.

Production Services        Data Processing
  ├── Website                ├── Real-time Pipeline
  ├── API                    ├── Batch Processing
  ├── Authentication         └── Data Exports
  ├── Dashboard
  └── Mobile App           Infrastructure
                             ├── CDN
Integrations                 ├── DNS
  ├── Webhook Delivery       └── Object Storage
  └── Email Notifications

Design principles: Group by user experience, not internal architecture. Users do not know your API is 12 microservices – they care if it works. Include only what users interact with. Keep the count to 10-20 components.

Component Statuses#

StatusWhen to Use
OperationalAll metrics within SLO
Degraded PerformanceLatency elevated, some requests slow
Partial OutageMajor feature down for some users
Major OutageService completely unavailable

Tie statuses to monitoring thresholds:

component_status_rules:
  api:
    operational:     { error_rate: "< 0.1%", p99_latency: "< 500ms" }
    degraded:        { error_rate: "0.1% - 1%", p99_latency: "500ms - 2s" }
    partial_outage:  { error_rate: "1% - 10%" }
    major_outage:    { error_rate: "> 10%" }

Incident Templates#

Pre-written templates ensure consistent communication during stressful incidents.

Investigating:

We are aware of [symptoms] affecting [component]. Our team is
actively investigating. We will provide updates every [cadence].

Identified:

We have identified the cause as [brief explanation]. We are
implementing [mitigation]. Expected resolution: [timeframe].

Monitoring:

A fix has been implemented. We are monitoring to ensure stability.
Error rates have returned to normal. We will mark this resolved
after [monitoring period] of stable operation.

Resolved:

This incident has been resolved as of [timestamp]. [Brief summary
of cause and fix]. Duration: [total]. We apologize for the
disruption and will conduct a post-incident review.

Maintenance Windows#

Scheduled maintenance communicates planned work that may affect users.

# Statuspage.io - schedule maintenance
curl -X POST https://api.statuspage.io/v1/pages/${PAGE_ID}/incidents \
  -H "Authorization: OAuth ${API_KEY}" \
  -d '{"incident": {
    "name": "Scheduled Database Maintenance",
    "status": "scheduled",
    "scheduled_for": "2026-02-25T02:00:00Z",
    "scheduled_until": "2026-02-25T04:00:00Z",
    "body": "Brief interruptions possible during database cluster maintenance.",
    "scheduled_auto_in_progress": true,
    "scheduled_auto_completed": true
  }}'

Best practices: Announce at least 72 hours in advance. Include expected impact in plain language. Specify timezone (use UTC plus local conversion). Auto-transition status if supported. Send a reminder 24 hours before. Update during the window if timing or impact changes.

Subscriber Notifications#

ChannelBest ForWhen
EmailDetailed updates, maintenanceAll incidents
SMSCritical outagesSEV-1 only
WebhookInternal tool integrationAll updates
RSSPull-based consumersAll updates

Rules: Do not spam – update every 30 minutes during major incidents, not on every status change. Include actionable information. Allow granular subscriptions by component. Test delivery quarterly.

Uptime Calculation#

Uptime % = ((Total minutes - Downtime minutes) / Total minutes) * 100

For a 30-day month: 99.9% allows 43 minutes downtime. 99.95% allows 21 minutes. 99.99% allows 4 minutes.

What counts as downtime: Major outage counts as full downtime. Partial outage counts proportionally (30% of users affected for 10 minutes = 3 minutes effective downtime). Degraded performance typically does not count unless below an SLO threshold. Scheduled maintenance during announced windows is excluded.

Integration with Monitoring#

Automate the connection between monitoring and status page updates.

Prometheus -> Alertmanager -> Webhook Receiver -> Status Page API
# Webhook receiver that updates status page from alerts
@app.route("/webhook/alertmanager", methods=["POST"])
def handle_alert():
    for alert in request.json.get("alerts", []):
        component = alert["annotations"].get("component")
        action = alert["annotations"].get("status_page_action")
        if not component or not action:
            continue
        status = "operational" if alert["status"] == "resolved" \
            else STATUS_MAP.get(action)
        if status:
            update_component(COMPONENT_MAP[component], status)
    return "", 200

Define Prometheus alert annotations that specify the status page action:

- alert: ComponentDegraded
  expr: component:availability:ratio_5m < 0.999
  for: 5m
  annotations:
    status_page_action: "set_degraded"
    component: "api"

Agent Operational Notes#

  • Never delay updates to gather more information. Post “investigating” immediately and refine later.
  • Use templates. Do not write incident updates from scratch during an incident.
  • Match component status to monitoring data. Do not leave a component “operational” when metrics show degradation.
  • Verify independence. Regularly confirm the status page loads from outside your infrastructure.
  • Close incidents promptly. An incident left in “monitoring” for days erodes trust.