Last updated: March 21, 2026
A runbook is a document that an engineer unfamiliar with a system can follow to complete an operational task correctly, alone, under time pressure, at 3am. That is the test. If your runbooks require institutional knowledge, slack messages to get context, or judgment calls that are not documented, they will fail exactly when you need them most.
Table of Contents
- What a Runbook Is Not
- Runbook Structure
- Prerequisites
- Steps
- Verification
- Rollback
- Escalation
- Write for the Worst Case
- BAD: Ambiguous step
- GOOD: Explicit step
- Decision Trees for Non-Linear Procedures
- Diagnose Database Connection Failures
- Embed Exact Commands, Not Descriptions
- BAD: Description only
- GOOD: Exact commands
- Keep Commands Copy-Pasteable
- BAD: Requires substitution mid-command
- GOOD: Variables declared explicitly before commands
- Maintenance: Keep Runbooks Current
- When a runbook must be updated:
- Quarterly review checklist:
- Runbook Inventory
- Related Reading
Remote teams are especially dependent on good runbooks — there is no one to turn to in the next cubicle. This guide covers how to write runbooks that actually work.
What a Runbook Is Not
Before writing, clarify the distinction:
- Runbook: step-by-step procedure for a specific operational task (deploy a hotfix, restart a service, rotate a certificate)
- Architecture doc: how the system is designed
- Postmortem: what went wrong and why
- Playbook: collection of runbooks and decision guides for an incident type
Runbooks are narrow and task-specific. “Deploy to production” is a runbook. “How our deployment architecture works” is not.
Runbook Structure
Every runbook follows the same structure regardless of the task:
# [Task Name]
**Owner**: [team or person responsible for keeping this current]
**Last tested**: [YYYY-MM-DD]
**Estimated time**: [X minutes]
**Impact**: [what this affects — "restarts the API, expect 30s downtime"]
## Prerequisites
What the executor needs before starting:
- [ ] Access to [system name] with [permission level]
- [ ] [Tool] installed and configured
- [ ] Notify [#channel] before starting
## Steps
### 1. [First major action]
Brief explanation of why this step exists (one sentence).
```bash
# Command to run
exact-command --with-flags
Expected output:
what you should see if it worked
If you see [error], do [specific action]. If you see [other error], STOP and escalate to [contact].
2. [Second major action]
…
Verification
How to confirm the task completed successfully:
# Check command
curl -s https://yourservice.com/health | jq '.status'
Expected: "ok" — if not, see Rollback.
Rollback
If the task needs to be reversed:
# Rollback command
exact-rollback-command
Escalation
If this runbook does not resolve the situation:
- Ping [person/team] in [#channel]
- Page via [PagerDuty rotation name] for P1 issues
- Link to postmortem template: [link] ```
Write for the Worst Case
The person executing your runbook may be:
- Junior, unfamiliar with the system
- Tired, during an incident that has been ongoing for 4 hours
- In a different time zone, no one else online
- Your newest hire on their second week
Write accordingly. Every step should answer: “What do I type, what do I see if it worked, what do I do if it doesn’t?”
## BAD: Ambiguous step
### 3. Restart the application
Restart the app server.
---
## GOOD: Explicit step
### 3. Restart the application server
The app server may enter a stuck state during high traffic. Restarting clears the connection pool.
SSH into the app server:
```bash
ssh deploy@app-server-1.internal
Check the current service status before restarting:
sudo systemctl status myapp
Expected output includes Active: active (running). If you see failed, note the error before continuing — do not restart without understanding why it failed first.
Restart the service:
sudo systemctl restart myapp
Wait 15 seconds, then verify it started cleanly:
sudo systemctl status myapp
journalctl -u myapp -n 20 --no-pager
Expected: status shows Active: active (running) for at least 10 seconds. Logs show no ERROR or FATAL lines.
If the service fails to start after restart, STOP. Do not retry. Escalate to [#on-call] immediately.
## Decision Trees for Non-Linear Procedures
Some procedures have branching paths — the right steps depend on what you observe. Decision trees prevent silent wrong choices.
```markdown
## Diagnose Database Connection Failures
Start here:
**Can you connect to the database directly?**
```bash
psql -h db.internal -U appuser -d myapp -c "SELECT 1"
→ YES (returns 1): Application config issue. Go to Step 3: Check App Config.
→ NO (connection refused): Database is down or unreachable. Go to Step 2: Check Database Status.
→ NO (authentication failed): Credential rotation may have happened. Go to Step 4: Rotate Credentials.
→ NO (timeout): Network issue. Go to Step 5: Check Network.
## Embed Exact Commands, Not Descriptions
```markdown
## BAD: Description only
Check the disk usage and free up space if needed.
---
## GOOD: Exact commands
Check disk usage:
```bash
df -h /
If / is above 85% used, find and remove old log files:
# Find logs older than 30 days
find /var/log -name "*.gz" -mtime +30 -type f
# Review the list, then delete
find /var/log -name "*.gz" -mtime +30 -type f -delete
# Verify space freed
df -h /
Never use `...` or `etc.` in a runbook. Every step is fully specified.
## Keep Commands Copy-Pasteable
Remote engineers executing a runbook at 3am should not be transcribing commands. Every command block should be:
1. Complete — includes all flags and arguments, not just the relevant portion
2. Executable as-is — no `[INSERT_VALUE_HERE]` placeholders in the middle of commands
3. Correct for the target OS — do not mix macOS and Linux commands without labeling them
```markdown
## BAD: Requires substitution mid-command
```bash
kubectl rollout restart deployment/[APP_NAME] -n [NAMESPACE]
GOOD: Variables declared explicitly before commands
Set these variables for your deployment:
export APP_NAME=myapp
export NAMESPACE=production
Then restart the deployment:
kubectl rollout restart deployment/${APP_NAME} -n ${NAMESPACE}
kubectl rollout status deployment/${APP_NAME} -n ${NAMESPACE} --timeout=120s
## Maintenance: Keep Runbooks Current
A runbook that is six months out of date is worse than no runbook — the engineer follows it with confidence and hits unexpected errors.
```markdown
# Runbook Maintenance Process
## When a runbook must be updated:
- After any system change that affects the procedure
- After an incident where following the runbook led to unexpected results
- After each quarterly review
## Quarterly review checklist:
- [ ] Test the procedure end-to-end in staging
- [ ] Update all screenshots (if any)
- [ ] Verify all command outputs still match expected
- [ ] Update "Last tested" date
- [ ] Confirm all linked resources still exist
Assign runbook ownership explicitly. An owner without a name gets updated by nobody.
Runbook Inventory
Track all runbooks in a single index:
# Runbook Index
| Runbook | Owner | Last Tested | Estimated Time |
|---|---|---|---|
| Deploy to Production | @mike | 2026-03-01 | 15 min |
| Database Failover | @sarah | 2026-02-15 | 45 min |
| SSL Certificate Renewal | @alex | 2026-01-20 | 10 min |
| Rollback a Deploy | @mike | 2026-03-01 | 10 min |
| Add a New Engineer's Access | @ops | 2026-02-28 | 20 min |
The index should live in the same location as the runbooks (Obsidian vault, Confluence space, or Notion database) and be the first page an on-call engineer opens.
Related Reading
- ADR Tools for Remote Engineering Teams
- Obsidian for Remote Team Knowledge Management
- Async Decision Making with RFC Documents for Engineering Teams
- Async Release Notes Writing Process for Distributed
Related Articles
- Remote Incident Response Runbook Guide (2026)
- Migration runbook example structure
- How to Organize Remote Team Runbook Documentation for
- How to Build a Remote Team Runbook Library 2026
- Best GitBook Alternative for Remote Engineering Teams Built by theluckystrike — More at zovo.one
Frequently Asked Questions
How long does it take to write runbooks for remote engineering teams?
For a straightforward setup, expect 30 minutes to 2 hours depending on your familiarity with the tools involved. Complex configurations with custom requirements may take longer. Having your credentials and environment ready before starting saves significant time.
What are the most common mistakes to avoid?
The most frequent issues are skipping prerequisite steps, using outdated package versions, and not reading error messages carefully. Follow the steps in order, verify each one works before moving on, and check the official documentation if something behaves unexpectedly.
Do I need prior experience to follow this guide?
Basic familiarity with the relevant tools and command line is helpful but not strictly required. Each step is explained with context. If you get stuck, the official documentation for each tool covers fundamentals that may fill in knowledge gaps.
Can I adapt this for a different tech stack?
Yes, the underlying concepts transfer to other stacks, though the specific implementation details will differ. Look for equivalent libraries and patterns in your target stack. The architecture and workflow design remain similar even when the syntax changes.
Where can I get help if I run into issues?
Start with the official documentation for each tool mentioned. Stack Overflow and GitHub Issues are good next steps for specific error messages. Community forums and Discord servers for the relevant tools often have active members who can help with setup problems.