Why Agents Need Runbooks
Most agent failures are not model failures. They are operating failures. The agent had a vague goal, unclear permissions, no proof requirement, and no recovery path when reality did not match the prompt. A runbook fixes that by turning an agent from a clever assistant into a managed business process.
A strong runbook tells the agent what to do, when to do it, what tools it may use, what evidence it must produce, and when it must stop. It also tells the human operator how to inspect, restart, or retire the workflow without reverse-engineering the original prompt.
VIP rule: If a workflow touches customers, revenue, production data, public content, or credentials, it needs a runbook before it gets more autonomy.
The Seven-Part Agent Runbook
Every production agent runbook should fit on one page before it expands into supporting details. Keep the top layer simple enough that another operator can understand the system in five minutes.
- Mission: the business result this agent is responsible for, written in plain language.
- Trigger: what starts the run — schedule, assignment, webhook, human command, file change, or queue item.
- Inputs: the files, APIs, records, context, or messages the agent is allowed to read.
- Actions: the tools and mutations the agent may perform.
- Proof gates: the checks required before the work is trusted or shipped.
- Escalations: the exact conditions where the agent stops and asks for a decision.
- Recovery: how to retry, roll back, disable, or hand the task to a human.
Start With the Trigger
The trigger defines the agent's shape. A scheduled agent needs drift detection and quiet-success behavior. A webhook agent needs idempotency. A human-commanded agent needs clear confirmation and scope boundaries. A queue worker needs checkout, lock, and completion rules.
trigger:
kind: scheduled heartbeat
cadence: every 4 hours
source: paperclip issue queue
start_condition: assigned issue exists
quiet_success: no assigned issue found
duplicate_guard: do not run if another checkout lock is activeWithout this section, the agent will eventually run at the wrong time, process the same work twice, or invent work because it has no definition of a clean idle state.
Define Permissions Before Tools
Tool lists are not enough. The runbook should explain which actions are allowed inside normal operation and which actions require approval. This prevents the most dangerous form of agent drift: using a valid tool for an invalid purpose.
- Read-only: inspect source files, logs, dashboards, and assigned tickets.
- Safe writes: create draft files, comments, local proof artifacts, or non-public branches.
- Controlled writes: commit code, open pull requests, update tickets, or deploy preview environments.
- Approval-gated: spend money, email customers, publish public claims, delete data, alter credentials, or deploy production.
Build Proof Into the Definition of Done
A runbook should not say, “summarize what happened.” It should say what evidence must exist. The proof gate changes by workflow, but the pattern stays constant: inspect the changed state, run the smallest meaningful verification, and attach the result where the next operator can find it.
definition_of_done:
changed_state:
- branch pushed
- pull request opened
- ticket comment posted
proof_required:
- npm run build passes
- affected route renders locally or in preview
- PR URL is attached to the ticket
trust_label: VERIFIED WORKING or PARTIALLY VERIFIEDThis is how you stop “looks good” from becoming an operating standard. The agent is not finished when it produces an explanation. It is finished when the proof artifact matches the runbook.
Escalation Is a Feature, Not a Failure
Good agents escalate early. Bad agents improvise around missing context. The runbook should name the stop conditions clearly enough that the agent can pause without shame and the human can resolve the blocker quickly.
escalate_when:
credentials_missing: true
external account choice required: true
legal/compliance claim uncertain: true
deployment target ambiguous: true
proof gate unavailable: true
destructive operation required: true
task contradicts current operating rules: trueThe escalation message should include what was requested, what was attempted, the blocker, the decision needed, and the next action after the decision. Anything longer is usually hiding uncertainty.
Recovery: The Section Everyone Skips
Recovery is where amateur automations become real operations. Assume the agent will fail eventually. The question is whether the next operator can see the failure, understand the blast radius, and restore a safe state.
- Retry: when the agent may safely rerun the same task.
- Rollback: which commit, feature flag, queue item, or config value returns the system to the previous state.
- Disable: how to stop the agent if it creates repeated bad output.
- Hand off: what a human needs to know to finish the work manually.
A Complete Mini Runbook
Use this template for any agent you want to move from experiment to operating lane:
agent_runbook:
name: daily_course_builder
mission: publish one useful course lesson on the correct tier schedule
trigger: assigned issue or scheduled content task
inputs:
- course rotation calendar
- existing lesson files
- site repository
allowed_actions:
- create TSX lesson file
- update sitemap or index references
- run build
- commit and push branch/main according to repo policy
- post completion note to task system
approval_gated_actions:
- paid promotion
- changing pricing
- rewriting legal, refund, or subscription terms
proof_gates:
- npm run build passes
- route path exists
- commit hash recorded
- deployment URL recorded if deployed
escalate_when:
- build fails twice
- deploy token missing
- public posting account unavailable
- content tier conflicts with schedule
recovery:
- revert commit if route breaks
- leave task in progress with blocker comment if deploy/post fails
- disable scheduled trigger after repeated failuresBuild This Today
Pick one AI workflow that already runs more than once per week. Write the seven-part runbook for it before improving the prompt. You will usually find that the prompt was not the real bottleneck; the missing operating rules were.
Get new lessons free
We publish free AI lessons weekly. Drop your email and we will send them directly — no spam, no sales sequences, just signal.
Your Homework
- Choose one active agent or automation in your business.
- Write its mission, trigger, inputs, allowed actions, and stop rules.
- Add one proof gate that would catch the most likely failure.
- Define the rollback or disable step before expanding autonomy.
- Review the runbook with someone who did not build the workflow.