A playbook is a documented procedure that any team member can execute without needing the author present. For remote teams across time zones, good playbooks turn what would be a “wake someone up” situation into a “follow these steps” situation. The difference is in the structure and completeness of the template.
This guide provides templates for three core playbook types: incident response, deployment, and onboarding.
Incident Response Playbook Template
Incident - [INCIDENT-NAME]
Severity - P1 / P2 / P3
Status - Active / Resolved
Incident Commander - @[owner]
Started - YYYY-MM-DD HH:MM UTC
Resolved - YYYY-MM-DD HH:MM UTC (fill when resolved)
---
What Is Happening
One paragraph plain-language description of the incident. What is affected? Who is affected? What is the user-visible impact?
> Example - "The payments API is returning 502 errors for ~40% of checkout attempts. Approximately 200 users per hour are unable to complete purchases. The error started at 14:23 UTC."
Current Status
> Example - "Identified root cause (database connection pool exhausted). Implementing fix. ETA 30 minutes."
Timeline
| Time (UTC) | Event |
|------------|-------|
| 14:23 | First error alerts fired |
| 14:31 | On-call acknowledged |
| 14:45 | Root cause identified |
| 15:10 | Fix deployed |
| 15:15 | Confirmed resolved |
Impact
- Services affected: payments-api, checkout-frontend
- Error rate: ~40%
- Users affected: ~200/hour
- Revenue impact: ~$4,000/hour estimate
- External customers notified: [Yes/No] via [status.yourcompany.com]
---
Diagnosis Steps
Run these commands to gather context:
```bash
Check service health
kubectl get pods -n production | grep payments
View recent error logs
kubectl logs -n production deployment/payments-api --since=30m | grep ERROR | tail -50
Check database connectivity
kubectl exec -n production deployment/payments-api -- \
pg_isready -h $DB_HOST -p 5432
View active DB connections
psql -h $DB_HOST -U postgres -c \
"SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
Mitigation Options
| Option | Risk | ETA | Steps |
|---|---|---|---|
| Restart pods | Low | 2 min | kubectl rollout restart deployment/payments-api -n production |
| Roll back deployment | Low | 5 min | kubectl rollout undo deployment/payments-api -n production |
| Increase DB pool size | Medium | 15 min | Edit DB_POOL_SIZE env var and redeploy |
| Enable maintenance mode | Medium | 2 min | Set MAINTENANCE_MODE=true in config and redeploy |
Resolution
What was done to resolve the incident. What was the root cause?
Follow-up Actions
- Write post-mortem by [DATE]. @[owner]
- Add alert for [condition]. @[owner]
- Fix root cause permanently. ENG-[ticket]
- Update runbook with new steps. @[owner] ```
Deployment Playbook Template
Deployment - [SERVICE-NAME] v[VERSION]
Deployer - @[name]
Date - YYYY-MM-DD
Environment - staging / production
Deploy type - Standard / Hotfix / Rollback
PR / Release - [link]
---
Pre-Deployment Checklist
Code
- [ ] PR approved by required reviewers
- [ ] All CI checks passing
- [ ] CHANGELOG updated
- [ ] Migration scripts reviewed (if applicable)
Staging Verified
- [ ] Deployed to staging successfully
- [ ] Smoke tests passing on staging
- [ ] New feature tested on staging
Dependencies
- [ ] Dependent services notified
- [ ] External APIs/webhooks compatible with new version
- [ ] Feature flags configured for gradual rollout
Rollback Plan
- [ ] Previous version noted: `v[PREVIOUS_VERSION]`
- [ ] Rollback command tested: `kubectl rollout undo deployment/[service]`
- [ ] Database migrations are reversible: [Yes / No. explain if No]
---
Deployment Steps
1. Announce
Post in #deployments Slack channel:
Deploying [service] v[version] to production. Changes - [brief summary] Risk - Low / Medium / High Rollback ready - yes
2. Deploy
```bash
Tag and push (if not automated)
git tag v[version]
git push origin v[version]
Trigger deploy (if manual)
kubectl set image deployment/[service] [service]=[registry]/[service]:v[version] -n production
Wait for rollout
kubectl rollout status deployment/[service] -n production --timeout=5m
- Verify
Check pods are running
kubectl get pods -n production -l app=[service]
Confirm new version
kubectl describe deployment/[service] -n production | grep Image
Check error rate (first 5 minutes)
Run this every 60 seconds x5
curl -s "https://monitoring.yourcompany.com/api/v1/query?query=rate(http_requests_total{service='[service]',status=~'5..'}[1m])" \
| jq '.data.result[0].value[1]'
- Post-Deploy
- Confirm smoke tests pass in production
- Update status page if maintenance window was posted
- Announce completion in #deployments
- Enable feature flags for gradual rollout (if applicable)
Rollback Procedure
If error rate increases or critical errors appear:
Immediate rollback
kubectl rollout undo deployment/[service] -n production
kubectl rollout status deployment/[service] -n production
Verify rollback
kubectl describe deployment/[service] -n production | grep Image
Post in #deployments:
ROLLBACK - [service] rolled back to v[previous_version].
Reason - [brief explanation]
Investigation ongoing in #incidents
---
Onboarding Playbook Template
```markdown
Onboarding - [ENGINEER_NAME]
Start Date - YYYY-MM-DD
Role - [role]
Manager - @[manager]
Buddy - @[buddy]
Team - [team name]
---
Week 1 - Foundation
Day 1. Access and Setup
IT/Admin tasks (Manager)
- [ ] Google Workspace account created
- [ ] GitHub org invitation sent
- [ ] Slack invitation sent
- [ ] 1Password team invitation sent
- [ ] PagerDuty account created (if on-call eligible)
- [ ] AWS/GCP/Azure console access configured
Environment setup (New Hire)
```bash
Clone the onboarding repo for setup scripts
git clone git@github.com:your-org/onboarding.git
cd onboarding && make setup
- Dev environment set up (run
make setup) - Can run the project locally
- VPN configured and tested
- 2FA enabled on all accounts
Day 2-3. Codebase Orientation
- Read architecture overview in Notion: [link]
- Read team norms doc: [link]
- Review last 3 post-mortems: [link]
- Complete first “good first issue”: [ticket link]
- First PR submitted and reviewed
Day 4-5. Process and Context
- Attended team standup
- Met with buddy (30-min async Loom or sync call)
- Met with manager (1:1 scheduled)
- Read current sprint goals
Week 2 - Contributing
- First PR merged to main
- Attended or watched team retro recording
- Added to on-call rotation schedule (shadow week)
- Read and acknowledged security policies
- Set up error tracking alerts for owned services
Ongoing - 30/60/90 Day Goals
30 Days
- Ship 3 non-trivial PRs
- Understand the data model for core services
- Be able to debug a production issue independently
60 Days
- Own at least one feature end-to-end
- Be confident with deployment process
- Contribute to team norms doc with at least one edit
90 Days
- Lead a project or feature
- Mentor a more junior engineer on a PR review
- Identify and fix one piece of technical debt
Key Resources
| Resource | Link | |———-|——| | Architecture overview | [link] | | API documentation | [link] | | Deployment runbook | [link] | | Incident response playbook | [link] | | On-call rotation | [link] | | Team calendar | [link] |
---
Storing and Accessing Playbooks
Playbooks rot if they're not maintained. The best storage is wherever your team already looks:
```bash
Notion database with properties
Title - [text]
Type - [Incident / Deployment / Onboarding / Process]
Owner - [person]
Last Reviewed - [date]
Status - [Active / Draft / Archived]
GitHub Wiki (version-controlled)
docs/
playbooks/
incident-response.md
deployment-standard.md
onboarding.md
Trigger playbook reminders via GitHub Actions to review stale playbooks:
name: Playbook Review Reminder
on:
schedule:
- cron: '0 9 1 * *' # First of each month
jobs:
remind:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Check for stale playbooks
run: |
find docs/playbooks/ -name "*.md" -mtime +90 \
-exec echo "STALE PLAYBOOK: {}" \; \
| while read msg; do
curl -s -X POST \
-H 'Content-type: application/json' \
--data "{\"text\":\"$msg. please review and update\"}" \
"${{ secrets.SLACK_WEBHOOK }}"
done
Related Reading
- Remote Team Code Review Checklist Template
- How to Create a Remote Dev Environment Template
- How to Create Automated Status Pages
- Calculate reasonable response windows based on overlap
Related Articles