Last updated: March 21, 2026

Production incidents don’t wait for business hours. Distributed teams need defined processes for alert routing, on-call escalation, runbook execution, and post-incident reviews. Here’s what works without chaos.

Table of Contents

Why Distributed Teams Need Structure

Centralized office:

Distributed team without process:

With structure:

1. Alert Routing: PagerDuty vs OpsGenie

PagerDuty (Better for Large Teams)

Setup flow:

1. Your monitoring (DataDog, New Relic, Prometheus) fires alert
2. Alert webhook hits PagerDuty API
3. PagerDuty routes to on-call engineer
4. If no response in 5 minutes, escalates to backup
5. If still no response, escalates to team lead
6. Engineer gets email + SMS + phone call

Configuration example:

Escalation Policy: Engineering On-Call

Level 1 (0 minutes):
  - On-call engineer primary
  - Trigger: High severity (P1, P2)
  - Notify: SMS (1 min), then phone call (2 min)

Level 2 (5 minutes if engineer doesn't acknowledge):
  - On-call backup engineer
  - Notify: Phone call immediately

Level 3 (10 minutes if backup doesn't acknowledge):
  - Engineering manager
  - Notify: Phone call + SMS

Rotation: Primary on-call for 1 week
          Backup on-call for 1 week

Real example:

2:30 AM UTC — Database query timing out
  ↓ (Prometheus alert fires)
2:30:01 — PagerDuty SMS to Primary: "P1: DB query timeout, 5000 failures/min"
2:30:45 — Engineer wakes, reads SMS, acknowledges alert
2:31:00 — Engineer has runbook open, running investigation
2:35:00 — Root cause found (index missing), fix deployed
2:37:00 — Incident resolved, post-mortem scheduled

Total time: 7 minutes from alert to fix

Alternative without PagerDuty (no notification):

2:30 AM — Alert fires on Slack
2:32 AM — Alert fires on Slack again (10 failures, 50% traffic affected)
6:00 AM — EU person notices alerts, wakes primary engineer
6:05 AM — Engineer starts investigation
6:45 AM — Fix deployed
7:00 AM — Incident resolved, 4.5 hours customer impact

PagerDuty Pricing

Free tier:

Professional ($9/user/month):

Enterprise ($29/user/month):

For most teams: Professional tier is sufficient.

OpsGenie (Better for Small/Cost-Sensitive Teams)

Setup similar to PagerDuty, slightly different UI.

Pricing:

Difference:

PagerDuty: Enterprise standard, better for large ops
OpsGenie: Simpler, lower cost, better for smaller teams

Most teams use PagerDuty for established operations, OpsGenie for startups.

This guide focuses on PagerDuty but concepts apply to OpsGenie equally.

2. On-Call Rotation Schedule

Simple Weekly Rotation

For team of 6 engineers:

Mon-Sun Week 1: Alice (primary), Bob (backup)
Mon-Sun Week 2: Charlie (primary), Dave (backup)
Mon-Sun Week 3: Emma (primary), Frank (backup)
Mon-Sun Week 4: Grace (primary), Alice (backup)

Repeat every 3 weeks (cycles through everyone fairly).

Considerations:

Timezone-Aware Rotation

For distributed team:

Primary on-call: Engineer in currently active timezone
  (Business hours for incident detection are higher)

Backup on-call: Engineer in opposite timezone
  (If primary doesn't respond, backup is in their morning/evening)

Example setup (US + EU team):

Week 1:
  Mon-Fri 00:00-19:00 UTC: Charlie (EU morning/afternoon)
  Fri 19:00-Mon 00:00 UTC: Alice (US afternoon/evening/night)

Week 2:
  Mon-Fri 00:00-19:00 UTC: Emma (EU morning/afternoon)
  Fri 19:00-Mon 00:00 UTC: Frank (US afternoon/evening/night)

Better for on-call experience (fewer middle-of-night wakeups).

Respecting Boundaries

PagerDuty sleep rule:

Quiet hours: 2 AM - 7 AM on-call engineer's local time
  - Alerts still trigger but don't notify (no SMS/call)
  - Escalate to backup immediately instead

Example: Alice on-call in US Pacific (UTC-7)
  Quiet hours: 2-7 AM PT (9 AM - 2 PM UTC)
  Incident at 4 AM PT → escalates to backup immediately

This prevents burnout (on-call is stressful; middle-of-night wakeups are worse).


3. Runbook Template

A runbook is “what to do when X breaks.” 1-page maximum.

Template Structure

# Incident Runbook: Database Connection Pool Exhaustion

## Symptoms
- API returns "Connection timeout" errors
- Database connection count maxed
- Latency spikes on all endpoints

## Diagnosis (< 2 minutes)
1. Log in to Datadog dashboard (link: https://...)
2. Check metric: "postgres_active_connections"
3. If > 90, proceed to resolution
4. Check metric: "query_duration_p99"
5. If > 5s, database is slow (add to slow query runbook)

## Quick Fix (5 minutes)
1. SSH into app-server-1: `ssh ubuntu@app-1.internal`
2. Check connection status: `curl localhost:8080/health`
3. Restart app container: `docker restart app`
4. Verify: Check API returns 200 OK, Datadog shows recovery

If not recovered in 2 minutes, escalate to database team.

## Root Cause Investigation (post-incident)
- Check logs: `grep "Connection pool" /var/log/app.log | tail -100`
- Look for: Query hangs, connection leaks, traffic spike
- Common causes: Slow query, missing index, upstream service failure

## Escalation
If database team on-call unreachable after 3 min, escalate to VP Eng

## Verification Metrics
- Connection count: < 50 (normal)
- Query latency p99: < 200ms
- Error rate: < 0.1%
- All checks green: Incident resolved

## Post-Incident
- Schedule follow-up meeting to investigate root cause
- Implement prevention (e.g., connection pool monitoring)

Real Runbook Examples

Example: Disk Space Exhaustion

# Incident Runbook: Production Disk Space Critical

## Symptoms
- File writes failing (500 errors)
- Datadog alert: "Disk > 95%"
- Log streaming stopping

## Diagnosis (< 2 minutes)
SSH: ssh ubuntu@prod-1
Check disk: `df -h /data`
Identify large files: `du -sh /data/* | sort -h`

## Quick Fix
# Delete old logs (safe)
find /data/logs -type f -mtime +30 -delete

# Restart logging
systemctl restart rsyslog

# Verify
df -h /data (should drop to < 80%)
curl localhost:8080/health (should return 200)

## If Still Critical
Delete container cache: `docker system prune -a`
This is more aggressive, requires verification after

## Escalation
If above steps don't free space, page infra team

Example: Payment Service Failure

# Incident Runbook: Payment Processing Down

## Symptoms
- Checkout fails with "Payment gateway error"
- Stripe webhook queue backing up
- Customer emails arriving

## Diagnosis (< 2 minutes)
Check Stripe API status: https://status.stripe.com/
Check internal status page: https://internal/status/stripe-integration
Check logs: `grep "stripe_error" app.log | tail -20`

## Quick Fix Option 1: Stripe is Down
Wait for Stripe recovery, display banner to customers
Enable "maintenance mode" to prevent orders during outage
https://internal/admin/maintenance-mode

## Quick Fix Option 2: Our Integration is Broken
Restart Stripe sync: `kubectl rollout restart deployment/stripe-sync`
Verify: `curl https://internal/api/stripe-health`
Check queue size: `redis-cli GET stripe:queue:length`

## If Queue Backing Up > 1 hour
Page payments team, consider manual order approval
Escalate to CTO

## Post-Incident
- Review Stripe API logs for error patterns
- Add more detailed error logging to catch next time
- Improve monitoring on queue depth

Runbook Best Practices

  1. One page maximum — longer and people skip it
  2. Times in angles brackets — “Diagnosis (< 2 min)” sets expectations
  3. Exact commands — copy/paste should work
  4. Links to tools — don’t make people search for dashboard URLs
  5. Escalation criteria — “if X hasn’t resolved in 5 min, escalate to Y”
  6. Post-incident section — so you improve next time

4. Incident Communication During Active Incident

Slack Channel Setup

Create: #incidents (or #incident-response for larger teams)

During incident:

  1. Create thread in #incidents with incident ID
  2. One person is “scribe” (writes updates in thread)
  3. Responders post findings/actions to thread
  4. Every 5 minutes scribe posts status update

Example thread:

Thread started: 2026-03-21 02:30 UTC by Alice
Incident ID: INC-2026-3421
Severity: P1 (customers affected)
Status: Investigating

[02:31] Alice: Confirmed database connection exhaustion (492/500 active)
[02:32] Bob: Restarting connection pool service
[02:33] Bob: Pool restarted, connections dropping (now 280/500)
[02:35] Alice: API latency recovering, error rate dropping
[02:37] Status: RESOLVED - all metrics normal, error rate < 0.1%

Root cause: Query optimization missing on bulk user export
Impact: 7 min outage, 2% of transactions failed during window
Post-mortem: Thursday 2pm UTC

Key: Everyone knows status without jumping between channels.

Customer Communication

Public status page setup (tools: StatusPage.io, Atlassian Status, custom):

During incident:

INVESTIGATING — Service partially unavailable
Some customers may experience slow payment processing. Our team is investigating.

[02:31] We've identified unusual database activity
[02:35] We've deployed a fix and are monitoring recovery
[02:37] Service is recovering, all systems normal

Post-incident:

RESOLVED — Full details available in blog post

Root cause: Missing index on bulk export query
Duration: 7 minutes (02:30-02:37 UTC)
Impact: 2% of transactions failed
Prevention: Added database monitoring, index optimization

Full technical post-mortem: https://...

5. Post-Mortem Template

Conducted within 48 hours, while details are fresh.

Format

# Post-Mortem: Database Connection Pool Exhaustion (INC-2026-3421)

## Timeline
02:30 UTC — Prometheus alert fires (DB connections 95%)
02:31 UTC — PagerDuty notifies Alice (on-call engineer)
02:32 UTC — Alice acknowledges, starts investigation
02:33 UTC — Root cause identified: missing index on user export query
02:35 UTC — Index query optimized, redeployed
02:37 UTC — Connections drop, latency recovers
02:45 UTC — All systems stable, incident declared resolved

## Impact
- Duration: 7 minutes
- Affected: ~2% of payment transactions (450 failed)
- Customer-facing: Payment page returned errors
- Team effort: 1 engineer, ~15 min response + fix

## Root Cause
Bulk user export feature added Friday, no performance testing on production dataset.
Query performed full table scan (50M users) instead of indexed lookup.
Query took 45+ seconds per request, exhausted connection pool within minutes.

## Why Wasn't This Caught?
1. Feature had unit tests (passed)
2. Feature had integration tests on staging data (passed, only 10k test users)
3. No performance test against production-scale data
4. No index on table, even though query required it

## Lessons Learned
1. All new queries should have EXPLAIN ANALYZE review
2. Staging environment doesn't match production scale
3. Index recommendations should be automated in code review

## Action Items (Who / When)
1. [Alice] Add database.md runbook for connection pool exhaustion (by Friday)
2. [Bob] Create script to compare staging vs prod data volumes (by next week)
3. [Charlie] Set up automated EXPLAIN ANALYZE checks in CI (by sprint end)
4. [Dave] Review all bulk query code for index coverage (by next week)

## Follow-Up
- Review in 1 week (are action items complete?)
- Monitor bulk export performance daily for next 2 weeks
- Mention in team standup (everyone learns from this)

Post-Mortem Best Practices

What NOT to do:

What TO do:


6. Complete Setup Checklist

Week 1: Foundation

Week 2: Runbooks

Week 3: Communication

Week 4: Validation


Real Metrics to Track

After 2 weeks of process:

Mean Time to Alert (MTTA):
- Before: No process (alerts buried in Slack, ~30 min)
- After: 2 minutes (PagerDuty SMS/call)

Mean Time to Recovery (MTTR):
- Before: 45 minutes (waiting for morning, lack of runbook)
- After: 12 minutes (runbook + prepared engineer)

Time to Escalation:
- Before: No process, unclear
- After: 5 minutes to first backup, 10 to manager

Customer Impact Severity:
- Before: Major incidents often hit customers before team aware
- After: Usually resolved before notification goes out

Common Mistakes

Mistake 1: Runbook too long (3+ pages)

Mistake 2: Post-mortems become blame sessions

Mistake 3: On-call rotation unfair

Mistake 4: No escalation policy

Mistake 5: Runbooks never updated


Frequently Asked Questions

How long does it take to set up remote team incident response process?

For a straightforward setup, expect 30 minutes to 2 hours depending on your familiarity with the tools involved. Complex configurations with custom requirements may take longer. Having your credentials and environment ready before starting saves significant time.

What are the most common mistakes to avoid?

The most frequent issues are skipping prerequisite steps, using outdated package versions, and not reading error messages carefully. Follow the steps in order, verify each one works before moving on, and check the official documentation if something behaves unexpectedly.

Do I need prior experience to follow this guide?

Basic familiarity with the relevant tools and command line is helpful but not strictly required. Each step is explained with context. If you get stuck, the official documentation for each tool covers fundamentals that may fill in knowledge gaps.

Can I adapt this for a different tech stack?

Yes, the underlying concepts transfer to other stacks, though the specific implementation details will differ. Look for equivalent libraries and patterns in your target stack. The architecture and workflow design remain similar even when the syntax changes.

Where can I get help if I run into issues?

Start with the official documentation for each tool mentioned. Stack Overflow and GitHub Issues are good next steps for specific error messages. Community forums and Discord servers for the relevant tools often have active members who can help with setup problems.