Last updated: March 21, 2026
Production incidents don’t wait for business hours. Distributed teams need defined processes for alert routing, on-call escalation, runbook execution, and post-incident reviews. Here’s what works without chaos.
Table of Contents
- Why Distributed Teams Need Structure
- 1. Alert Routing: PagerDuty vs OpsGenie
- 2. On-Call Rotation Schedule
- 3. Runbook Template
- Symptoms
- Diagnosis (< 2 minutes)
- Quick Fix (5 minutes)
- Root Cause Investigation (post-incident)
- Escalation
- Verification Metrics
- Post-Incident
- Symptoms
- Diagnosis (< 2 minutes)
- Quick Fix
- If Still Critical
- Escalation
- Symptoms
- Diagnosis (< 2 minutes)
- Quick Fix Option 1: Stripe is Down
- Quick Fix Option 2: Our Integration is Broken
- If Queue Backing Up > 1 hour
- Post-Incident
- 4. Incident Communication During Active Incident
- 5. Post-Mortem Template
- Timeline
- Impact
- Root Cause
- Why Wasn’t This Caught?
- Lessons Learned
- Action Items (Who / When)
- Follow-Up
- 6. Complete Setup Checklist
- Real Metrics to Track
- Common Mistakes
Why Distributed Teams Need Structure
Centralized office:
- Someone hears about incident (Slack, word of mouth)
- Walks to engineer’s desk
- Incident commander coordinates response
- It’s visible, noisy, gets attention
Distributed team without process:
- Incident fires at 2am UTC
- You hope someone notices Slack/email alert
- No clear who owns fixing vs coordinating
- People in US don’t know about incident affecting EU customers until morning
- Chaos, delayed response, escalated impact
With structure:
- Alert auto-escalates to on-call engineer (by timezone if possible)
- Phone call wakes them (silence isn’t acceptable)
- 1-page runbook tells them what to do
- Post-mortem identifies root cause and prevention
- Next incident response is faster
1. Alert Routing: PagerDuty vs OpsGenie
PagerDuty (Better for Large Teams)
Setup flow:
1. Your monitoring (DataDog, New Relic, Prometheus) fires alert
2. Alert webhook hits PagerDuty API
3. PagerDuty routes to on-call engineer
4. If no response in 5 minutes, escalates to backup
5. If still no response, escalates to team lead
6. Engineer gets email + SMS + phone call
Configuration example:
Escalation Policy: Engineering On-Call
Level 1 (0 minutes):
- On-call engineer primary
- Trigger: High severity (P1, P2)
- Notify: SMS (1 min), then phone call (2 min)
Level 2 (5 minutes if engineer doesn't acknowledge):
- On-call backup engineer
- Notify: Phone call immediately
Level 3 (10 minutes if backup doesn't acknowledge):
- Engineering manager
- Notify: Phone call + SMS
Rotation: Primary on-call for 1 week
Backup on-call for 1 week
Real example:
2:30 AM UTC — Database query timing out
↓ (Prometheus alert fires)
2:30:01 — PagerDuty SMS to Primary: "P1: DB query timeout, 5000 failures/min"
2:30:45 — Engineer wakes, reads SMS, acknowledges alert
2:31:00 — Engineer has runbook open, running investigation
2:35:00 — Root cause found (index missing), fix deployed
2:37:00 — Incident resolved, post-mortem scheduled
Total time: 7 minutes from alert to fix
Alternative without PagerDuty (no notification):
2:30 AM — Alert fires on Slack
2:32 AM — Alert fires on Slack again (10 failures, 50% traffic affected)
6:00 AM — EU person notices alerts, wakes primary engineer
6:05 AM — Engineer starts investigation
6:45 AM — Fix deployed
7:00 AM — Incident resolved, 4.5 hours customer impact
PagerDuty Pricing
Free tier:
- 1 team member
- Basic scheduling
- 1 escalation policy
Professional ($9/user/month):
- Unlimited schedules
- Advanced routing
- Mobile app
- Slack integration
- Team of 5: $45/month
Enterprise ($29/user/month):
- Custom routing rules
- Third-party integrations
- Team of 5: $145/month
For most teams: Professional tier is sufficient.
OpsGenie (Better for Small/Cost-Sensitive Teams)
Setup similar to PagerDuty, slightly different UI.
Pricing:
- Free: 1 team, limited features
- Standard ($10/user/month): Unlimited teams, schedules, escalation
- Pro ($30/user/month): Custom branding, advanced rules
Difference:
PagerDuty: Enterprise standard, better for large ops
OpsGenie: Simpler, lower cost, better for smaller teams
Most teams use PagerDuty for established operations, OpsGenie for startups.
This guide focuses on PagerDuty but concepts apply to OpsGenie equally.
2. On-Call Rotation Schedule
Simple Weekly Rotation
For team of 6 engineers:
Mon-Sun Week 1: Alice (primary), Bob (backup)
Mon-Sun Week 2: Charlie (primary), Dave (backup)
Mon-Sun Week 3: Emma (primary), Frank (backup)
Mon-Sun Week 4: Grace (primary), Alice (backup)
Repeat every 3 weeks (cycles through everyone fairly).
Considerations:
- Each person on-call every 3 weeks (manageable)
- Always have 2 people covering (primary + backup)
- Handoff happens Sunday 11:59 PM UTC (or timezone best for team)
Timezone-Aware Rotation
For distributed team:
Primary on-call: Engineer in currently active timezone
(Business hours for incident detection are higher)
Backup on-call: Engineer in opposite timezone
(If primary doesn't respond, backup is in their morning/evening)
Example setup (US + EU team):
Week 1:
Mon-Fri 00:00-19:00 UTC: Charlie (EU morning/afternoon)
Fri 19:00-Mon 00:00 UTC: Alice (US afternoon/evening/night)
Week 2:
Mon-Fri 00:00-19:00 UTC: Emma (EU morning/afternoon)
Fri 19:00-Mon 00:00 UTC: Frank (US afternoon/evening/night)
Better for on-call experience (fewer middle-of-night wakeups).
Respecting Boundaries
PagerDuty sleep rule:
Quiet hours: 2 AM - 7 AM on-call engineer's local time
- Alerts still trigger but don't notify (no SMS/call)
- Escalate to backup immediately instead
Example: Alice on-call in US Pacific (UTC-7)
Quiet hours: 2-7 AM PT (9 AM - 2 PM UTC)
Incident at 4 AM PT → escalates to backup immediately
This prevents burnout (on-call is stressful; middle-of-night wakeups are worse).
3. Runbook Template
A runbook is “what to do when X breaks.” 1-page maximum.
Template Structure
# Incident Runbook: Database Connection Pool Exhaustion
## Symptoms
- API returns "Connection timeout" errors
- Database connection count maxed
- Latency spikes on all endpoints
## Diagnosis (< 2 minutes)
1. Log in to Datadog dashboard (link: https://...)
2. Check metric: "postgres_active_connections"
3. If > 90, proceed to resolution
4. Check metric: "query_duration_p99"
5. If > 5s, database is slow (add to slow query runbook)
## Quick Fix (5 minutes)
1. SSH into app-server-1: `ssh ubuntu@app-1.internal`
2. Check connection status: `curl localhost:8080/health`
3. Restart app container: `docker restart app`
4. Verify: Check API returns 200 OK, Datadog shows recovery
If not recovered in 2 minutes, escalate to database team.
## Root Cause Investigation (post-incident)
- Check logs: `grep "Connection pool" /var/log/app.log | tail -100`
- Look for: Query hangs, connection leaks, traffic spike
- Common causes: Slow query, missing index, upstream service failure
## Escalation
If database team on-call unreachable after 3 min, escalate to VP Eng
## Verification Metrics
- Connection count: < 50 (normal)
- Query latency p99: < 200ms
- Error rate: < 0.1%
- All checks green: Incident resolved
## Post-Incident
- Schedule follow-up meeting to investigate root cause
- Implement prevention (e.g., connection pool monitoring)
Real Runbook Examples
Example: Disk Space Exhaustion
# Incident Runbook: Production Disk Space Critical
## Symptoms
- File writes failing (500 errors)
- Datadog alert: "Disk > 95%"
- Log streaming stopping
## Diagnosis (< 2 minutes)
SSH: ssh ubuntu@prod-1
Check disk: `df -h /data`
Identify large files: `du -sh /data/* | sort -h`
## Quick Fix
# Delete old logs (safe)
find /data/logs -type f -mtime +30 -delete
# Restart logging
systemctl restart rsyslog
# Verify
df -h /data (should drop to < 80%)
curl localhost:8080/health (should return 200)
## If Still Critical
Delete container cache: `docker system prune -a`
This is more aggressive, requires verification after
## Escalation
If above steps don't free space, page infra team
Example: Payment Service Failure
# Incident Runbook: Payment Processing Down
## Symptoms
- Checkout fails with "Payment gateway error"
- Stripe webhook queue backing up
- Customer emails arriving
## Diagnosis (< 2 minutes)
Check Stripe API status: https://status.stripe.com/
Check internal status page: https://internal/status/stripe-integration
Check logs: `grep "stripe_error" app.log | tail -20`
## Quick Fix Option 1: Stripe is Down
Wait for Stripe recovery, display banner to customers
Enable "maintenance mode" to prevent orders during outage
https://internal/admin/maintenance-mode
## Quick Fix Option 2: Our Integration is Broken
Restart Stripe sync: `kubectl rollout restart deployment/stripe-sync`
Verify: `curl https://internal/api/stripe-health`
Check queue size: `redis-cli GET stripe:queue:length`
## If Queue Backing Up > 1 hour
Page payments team, consider manual order approval
Escalate to CTO
## Post-Incident
- Review Stripe API logs for error patterns
- Add more detailed error logging to catch next time
- Improve monitoring on queue depth
Runbook Best Practices
- One page maximum — longer and people skip it
- Times in angles brackets — “Diagnosis (< 2 min)” sets expectations
- Exact commands — copy/paste should work
- Links to tools — don’t make people search for dashboard URLs
- Escalation criteria — “if X hasn’t resolved in 5 min, escalate to Y”
- Post-incident section — so you improve next time
4. Incident Communication During Active Incident
Slack Channel Setup
Create: #incidents (or #incident-response for larger teams)
During incident:
- Create thread in #incidents with incident ID
- One person is “scribe” (writes updates in thread)
- Responders post findings/actions to thread
- Every 5 minutes scribe posts status update
Example thread:
Thread started: 2026-03-21 02:30 UTC by Alice
Incident ID: INC-2026-3421
Severity: P1 (customers affected)
Status: Investigating
[02:31] Alice: Confirmed database connection exhaustion (492/500 active)
[02:32] Bob: Restarting connection pool service
[02:33] Bob: Pool restarted, connections dropping (now 280/500)
[02:35] Alice: API latency recovering, error rate dropping
[02:37] Status: RESOLVED - all metrics normal, error rate < 0.1%
Root cause: Query optimization missing on bulk user export
Impact: 7 min outage, 2% of transactions failed during window
Post-mortem: Thursday 2pm UTC
Key: Everyone knows status without jumping between channels.
Customer Communication
Public status page setup (tools: StatusPage.io, Atlassian Status, custom):
During incident:
INVESTIGATING — Service partially unavailable
Some customers may experience slow payment processing. Our team is investigating.
[02:31] We've identified unusual database activity
[02:35] We've deployed a fix and are monitoring recovery
[02:37] Service is recovering, all systems normal
Post-incident:
RESOLVED — Full details available in blog post
Root cause: Missing index on bulk export query
Duration: 7 minutes (02:30-02:37 UTC)
Impact: 2% of transactions failed
Prevention: Added database monitoring, index optimization
Full technical post-mortem: https://...
5. Post-Mortem Template
Conducted within 48 hours, while details are fresh.
Format
# Post-Mortem: Database Connection Pool Exhaustion (INC-2026-3421)
## Timeline
02:30 UTC — Prometheus alert fires (DB connections 95%)
02:31 UTC — PagerDuty notifies Alice (on-call engineer)
02:32 UTC — Alice acknowledges, starts investigation
02:33 UTC — Root cause identified: missing index on user export query
02:35 UTC — Index query optimized, redeployed
02:37 UTC — Connections drop, latency recovers
02:45 UTC — All systems stable, incident declared resolved
## Impact
- Duration: 7 minutes
- Affected: ~2% of payment transactions (450 failed)
- Customer-facing: Payment page returned errors
- Team effort: 1 engineer, ~15 min response + fix
## Root Cause
Bulk user export feature added Friday, no performance testing on production dataset.
Query performed full table scan (50M users) instead of indexed lookup.
Query took 45+ seconds per request, exhausted connection pool within minutes.
## Why Wasn't This Caught?
1. Feature had unit tests (passed)
2. Feature had integration tests on staging data (passed, only 10k test users)
3. No performance test against production-scale data
4. No index on table, even though query required it
## Lessons Learned
1. All new queries should have EXPLAIN ANALYZE review
2. Staging environment doesn't match production scale
3. Index recommendations should be automated in code review
## Action Items (Who / When)
1. [Alice] Add database.md runbook for connection pool exhaustion (by Friday)
2. [Bob] Create script to compare staging vs prod data volumes (by next week)
3. [Charlie] Set up automated EXPLAIN ANALYZE checks in CI (by sprint end)
4. [Dave] Review all bulk query code for index coverage (by next week)
## Follow-Up
- Review in 1 week (are action items complete?)
- Monitor bulk export performance daily for next 2 weeks
- Mention in team standup (everyone learns from this)
Post-Mortem Best Practices
What NOT to do:
- Blame people (“Bob didn’t test”)
- Assign preventions without owners/dates (“we should monitor better”)
- Use as punishment (people won’t report incidents honestly)
What TO do:
- Focus on systems/processes (“We need scale testing”)
- Specific, actionable items (“Add EXPLAIN ANALYZE to CI by March 28”)
- Blameless (focus on “how do we prevent” not “whose fault”)
- Follow up (actually do action items)
6. Complete Setup Checklist
Week 1: Foundation
- Choose PagerDuty or OpsGenie
- Create account, set up basic team
- Connect monitoring tool (DataDog, New Relic, Prometheus) to send alerts to PagerDuty
- Test alert: Trigger fake alert, verify SMS/call reaches someone
- Create #incidents Slack channel
- Establish on-call rotation (first week)
Week 2: Runbooks
- Write runbooks for top 5 incidents (use template above)
- Link runbooks in PagerDuty (in alert description)
- Conduct table-top drill (simulate incident, follow runbook, time it)
- Update runbooks based on drill feedback
- Create post-mortem template in Notion/Google Docs
Week 3: Communication
- Set up StatusPage.io or similar
- Create incident response Slack bot (for status page updates)
- Document escalation policy (who to contact if primary unavailable)
- Create “incident commander” runbook (who coordinates during big incident)
Week 4: Validation
- Conduct live incident drill (deliberately break something non-critical, time response)
- Measure: Alert fires → engineer aware (should be < 2 min)
- Measure: Engineer starts fix (should be < 5 min)
- Adjust PagerDuty settings based on learnings
Real Metrics to Track
After 2 weeks of process:
Mean Time to Alert (MTTA):
- Before: No process (alerts buried in Slack, ~30 min)
- After: 2 minutes (PagerDuty SMS/call)
Mean Time to Recovery (MTTR):
- Before: 45 minutes (waiting for morning, lack of runbook)
- After: 12 minutes (runbook + prepared engineer)
Time to Escalation:
- Before: No process, unclear
- After: 5 minutes to first backup, 10 to manager
Customer Impact Severity:
- Before: Major incidents often hit customers before team aware
- After: Usually resolved before notification goes out
Common Mistakes
Mistake 1: Runbook too long (3+ pages)
- People don’t read it during incident
- Keep to 1 page, action-focused
Mistake 2: Post-mortems become blame sessions
- Team stops reporting incidents honestly
- Switch to blameless post-mortems immediately
Mistake 3: On-call rotation unfair
- High-stress people left more often on-call
- Use scheduling tool, everyone rotates equally
Mistake 4: No escalation policy
- Easy to get stuck (primary unreachable, not clear who to page)
- Define clear escalation in PagerDuty
Mistake 5: Runbooks never updated
- System changes, runbooks become obsolete
- Update runbook every time you fix an incident
Related Articles
- How to Scale Remote Team Incident Response Process
- Scale Remote Team Incident Response From Startup to Mid-Size
- Remote Team Security Incident Response Plan Template
- Best Tools for Remote Team Incident Postmortems in 2026
- How to Create Remote Team Escalation Communication Template Built by theluckystrike — More at zovo.one
Frequently Asked Questions
How long does it take to set up remote team incident response process?
For a straightforward setup, expect 30 minutes to 2 hours depending on your familiarity with the tools involved. Complex configurations with custom requirements may take longer. Having your credentials and environment ready before starting saves significant time.
What are the most common mistakes to avoid?
The most frequent issues are skipping prerequisite steps, using outdated package versions, and not reading error messages carefully. Follow the steps in order, verify each one works before moving on, and check the official documentation if something behaves unexpectedly.
Do I need prior experience to follow this guide?
Basic familiarity with the relevant tools and command line is helpful but not strictly required. Each step is explained with context. If you get stuck, the official documentation for each tool covers fundamentals that may fill in knowledge gaps.
Can I adapt this for a different tech stack?
Yes, the underlying concepts transfer to other stacks, though the specific implementation details will differ. Look for equivalent libraries and patterns in your target stack. The architecture and workflow design remain similar even when the syntax changes.
Where can I get help if I run into issues?
Start with the official documentation for each tool mentioned. Stack Overflow and GitHub Issues are good next steps for specific error messages. Community forums and Discord servers for the relevant tools often have active members who can help with setup problems.