Purpose of a Status Page#
A status page is the single source of truth for service health. It communicates current status, provides historical reliability data, and sets expectations during incidents through regular updates. A well-maintained status page reduces support tickets during incidents, builds customer trust, and gives teams a structured communication channel.
Platform Options#
Statuspage.io (Atlassian)#
The most widely adopted hosted solution. Integrates with the Atlassian ecosystem.
# Create a component
curl -X POST https://api.statuspage.io/v1/pages/${PAGE_ID}/components \
-H "Authorization: OAuth ${API_KEY}" \
-d '{"component": {"name": "API", "status": "operational", "showcase": true}}'
# Create an incident
curl -X POST https://api.statuspage.io/v1/pages/${PAGE_ID}/incidents \
-H "Authorization: OAuth ${API_KEY}" \
-d '{"incident": {"name": "Elevated Error Rates", "status": "investigating",
"impact_override": "minor", "component_ids": ["id"]}}'Strengths: Highly reliable, subscriber notifications built-in, custom domains, API-first. Weaknesses: Expensive ($399+/month business plan), limited customization, component limits on lower tiers.
Cachet#
Open-source, self-hosted status page in PHP/Laravel.
docker run -d --name cachet -p 8000:8000 \
-e DB_DRIVER=pgsql -e DB_HOST=postgres \
-e DB_DATABASE=cachet -e APP_KEY=base64:key \
cachethq/docker:latestStrengths: Free, self-hosted, full data ownership, fully customizable. Weaknesses: Requires hosting infrastructure, community support only, you own the uptime.
Instatus#
Modern hosted status page with competitive pricing.
Strengths: Clean UI, lower pricing than Statuspage.io, good API, custom domains. Weaknesses: Smaller integration ecosystem, fewer enterprise features.
Custom Solutions#
Build in-house when you need deep integration with internal systems. Minimum requirements: static site on independent infrastructure, API for monitoring integration, incident history store, and subscriber notification system.
Critical rule: Host on completely separate infrastructure from production. If production is down, the status page must still be reachable.
Component Organization#
Components represent services or features users interact with.
Production Services Data Processing
├── Website ├── Real-time Pipeline
├── API ├── Batch Processing
├── Authentication └── Data Exports
├── Dashboard
└── Mobile App Infrastructure
├── CDN
Integrations ├── DNS
├── Webhook Delivery └── Object Storage
└── Email NotificationsDesign principles: Group by user experience, not internal architecture. Users do not know your API is 12 microservices – they care if it works. Include only what users interact with. Keep the count to 10-20 components.
Component Statuses#
| Status | When to Use |
|---|---|
| Operational | All metrics within SLO |
| Degraded Performance | Latency elevated, some requests slow |
| Partial Outage | Major feature down for some users |
| Major Outage | Service completely unavailable |
Tie statuses to monitoring thresholds:
component_status_rules:
api:
operational: { error_rate: "< 0.1%", p99_latency: "< 500ms" }
degraded: { error_rate: "0.1% - 1%", p99_latency: "500ms - 2s" }
partial_outage: { error_rate: "1% - 10%" }
major_outage: { error_rate: "> 10%" }Incident Templates#
Pre-written templates ensure consistent communication during stressful incidents.
Investigating:
We are aware of [symptoms] affecting [component]. Our team is
actively investigating. We will provide updates every [cadence].Identified:
We have identified the cause as [brief explanation]. We are
implementing [mitigation]. Expected resolution: [timeframe].Monitoring:
A fix has been implemented. We are monitoring to ensure stability.
Error rates have returned to normal. We will mark this resolved
after [monitoring period] of stable operation.Resolved:
This incident has been resolved as of [timestamp]. [Brief summary
of cause and fix]. Duration: [total]. We apologize for the
disruption and will conduct a post-incident review.Maintenance Windows#
Scheduled maintenance communicates planned work that may affect users.
# Statuspage.io - schedule maintenance
curl -X POST https://api.statuspage.io/v1/pages/${PAGE_ID}/incidents \
-H "Authorization: OAuth ${API_KEY}" \
-d '{"incident": {
"name": "Scheduled Database Maintenance",
"status": "scheduled",
"scheduled_for": "2026-02-25T02:00:00Z",
"scheduled_until": "2026-02-25T04:00:00Z",
"body": "Brief interruptions possible during database cluster maintenance.",
"scheduled_auto_in_progress": true,
"scheduled_auto_completed": true
}}'Best practices: Announce at least 72 hours in advance. Include expected impact in plain language. Specify timezone (use UTC plus local conversion). Auto-transition status if supported. Send a reminder 24 hours before. Update during the window if timing or impact changes.
Subscriber Notifications#
| Channel | Best For | When |
|---|---|---|
| Detailed updates, maintenance | All incidents | |
| SMS | Critical outages | SEV-1 only |
| Webhook | Internal tool integration | All updates |
| RSS | Pull-based consumers | All updates |
Rules: Do not spam – update every 30 minutes during major incidents, not on every status change. Include actionable information. Allow granular subscriptions by component. Test delivery quarterly.
Uptime Calculation#
Uptime % = ((Total minutes - Downtime minutes) / Total minutes) * 100For a 30-day month: 99.9% allows 43 minutes downtime. 99.95% allows 21 minutes. 99.99% allows 4 minutes.
What counts as downtime: Major outage counts as full downtime. Partial outage counts proportionally (30% of users affected for 10 minutes = 3 minutes effective downtime). Degraded performance typically does not count unless below an SLO threshold. Scheduled maintenance during announced windows is excluded.
Integration with Monitoring#
Automate the connection between monitoring and status page updates.
Prometheus -> Alertmanager -> Webhook Receiver -> Status Page API# Webhook receiver that updates status page from alerts
@app.route("/webhook/alertmanager", methods=["POST"])
def handle_alert():
for alert in request.json.get("alerts", []):
component = alert["annotations"].get("component")
action = alert["annotations"].get("status_page_action")
if not component or not action:
continue
status = "operational" if alert["status"] == "resolved" \
else STATUS_MAP.get(action)
if status:
update_component(COMPONENT_MAP[component], status)
return "", 200Define Prometheus alert annotations that specify the status page action:
- alert: ComponentDegraded
expr: component:availability:ratio_5m < 0.999
for: 5m
annotations:
status_page_action: "set_degraded"
component: "api"Agent Operational Notes#
- Never delay updates to gather more information. Post “investigating” immediately and refine later.
- Use templates. Do not write incident updates from scratch during an incident.
- Match component status to monitoring data. Do not leave a component “operational” when metrics show degradation.
- Verify independence. Regularly confirm the status page loads from outside your infrastructure.
- Close incidents promptly. An incident left in “monitoring” for days erodes trust.