The Parallel Development Book
How to Run Multiple AI Agents in Parallel Without the Quality Collapsing
What This Book Is
A practical guide for engineers who already use Cursor, Claude Code, Codex, or similar agents single-threaded, and want to take the next step: running several agents in parallel on real projects without the quality collapsing.
The argument of the book is simple:
Parallel AI development is not about launching more agents. It is about removing the human from the three chokepoints where one agent used to need you constantly — and only then do more agents start to pay off.
Those three chokepoints are requirement alignment, correctness verification, and maintainability. The book gives you one chapter on each, plus one chapter on the prerequisite nobody writes about — the break-in period during which you and your workflow co-adapt.
How It's Organized
Part I — The Premise sets up the thesis: the bottleneck in AI development has moved from "typing code" to three specific places where you, the human, are still required. Adding more agents while those chokepoints are still manned by you just builds a longer queue.
Part II — The Break-in Period describes the four-phase learning curve every team walks through before parallel AI development actually pays off. Skipping or denying this period is the single biggest reason people conclude "parallel AI is hype."
Part III — The Three Keys is one chapter per chokepoint. Each one shows how to replace live human involvement with a mechanism: structured requirement alignment, test-plan-driven development with complexity-triaged review, and engineering discipline encoded as skills.
Part IV — The Parallel Playbook covers the execution layer: the economic phase-change of cheap failure, the four scheduling patterns (from cross-project down to agent-internal), and the next bottleneck — digesting the flood of parallel output — which itself gets solved by letting agents triage agents.
Part V — The Honest Account closes the book by telling the truth about cost: you will not be more relaxed, you will be more tired. Throughput multiplies; cognitive load stays flat or rises. The final chapter generalizes the framework beyond code to any decomposable knowledge task.
Reference — the Source Catalog lists pioneer posts, product writeups, and the zero-review/* skills referenced throughout.
Reading Paths
- "I'm skeptical that parallel AI dev pays off at all" → Chapter 1 → Chapter 9 (read the opening thesis and the honest cost in one sitting before deciding)
- "I tried two agents at once and it was chaos" → Chapter 2 (you're in Phase 1 of the break-in, and that's normal)
- "I have one agent that works, I want to scale to three" → Chapters 3, 4, 5, then Chapter 7
- "I have three agents running and I can't keep up with the output" → Chapter 8
- "I want the underlying economics" → Chapter 6
The Meta-Principle
A parallel AI workflow is not a configuration you install. It is a practice that emerges from months of co-adaptation between you, your codebase, and your agents. This book describes the shape of that practice. It cannot shortcut it for you.
By Atum — Source: github.com/A7um/ParallelDevelopmentBook
Chapter 1: The Bottleneck Moves, It Doesn't Disappear
Thesis: Parallel AI development is not about launching more agents. It is about removing the human from the three specific chokepoints where one agent still needs you constantly — and until you do that, more agents just build a longer queue.
The story everyone tells
By late 2025, the story that "AI can code for you" is no longer a speculation. Cursor, Claude Code, Codex, Devin, and a handful of others have crossed the line from autocomplete into autonomous task execution. Any working engineer reading this book has already experienced the shift: you describe a feature, the agent writes it, runs tests, fixes bugs, and opens a pull request. One agent. One task. One reviewer — you.
The obvious next question is: what if you ran five of them?
The obvious answer — "you go five times faster" — is wrong. People who actually try it hit a wall in the first week and quietly go back to one. What they discovered, without being able to name it, is that the real bottleneck in AI-assisted development was never the agent. It was them.
Consider a data point from the inside. Boris Cherny, creator of Claude Code at Anthropic, shared in late 2025 that he runs ten to fifteen Claude Code sessions at a time, split between numbered terminal tabs, web sessions, and his phone, and shipped 259 PRs in thirty days (all of the code written by agents). His setup is not magic. He uses numbered terminal tabs, system notifications to know when an agent needs input, a CLAUDE.md rules file that grows every time an agent makes a recoverable mistake, and slash commands like /commit-push-pr that automate the repeatable parts of his own attention. Read carefully and you can see the shape: he has not moved the human out of the loop — he has mechanized the places where the human used to block, so his one unit of attention can rotate through ten or fifteen workstreams without bottlenecking. This book is about the mechanisms behind that rotation.
The three chokepoints
When you watch a single engineer work with a single agent, three moments keep coming up where only the human can advance the work:
-
Requirement alignment. The thing in your head is vague, full of implicit context, and hasn't been decided about at the edges. The agent needs crisp, executable instructions. Closing the gap between the two means conversation, clarification, back-and-forth. It eats the human's deepest attention.
-
Correctness verification. The agent writes something that looks right. Is it? In practice, someone has to read the diff, run it, catch the bug the model won't catch itself, and feed the fix back in. Historically that someone has been you.
-
Maintainability. One agent, given a free hand, will write code that runs today and collapses in three months. Someone has to steer architecture — which module, which layer, what pattern — or the codebase becomes a patchwork that nobody, human or agent, can work in later.
All three share a structural feature: they require human attention in the loop. And human attention is serial. You can't spread it across five agents; you have to switch between them, one at a time. Launch five agents with all three chokepoints still manned by you and you don't get 5× throughput. You get five queues waiting on one person.
The reason parallel AI "doesn't work" for most people is not that the agents are bad. It is that those three chokepoints are still staffed by the human, so adding agents just adds queues.
This is the first and most important frame of the book: the bottleneck moves, it does not disappear. You can't make AI development faster by only adding agents. You have to move the human out of the place the queue forms.
Why this wasn't possible until recently
For most of 2025, the honest answer to "can AI take over these three chokepoints?" was no. Agents couldn't debug themselves well enough to guarantee correctness. They couldn't reliably enforce architectural discipline. They couldn't drive a terminal or a browser to verify the output they'd produced.
So the only answer was the copilot pattern: the human drove, the AI assisted. Cursor's early genius was making that collaboration smoother — a better steering wheel, not a self-driving car. Spec-first tools pushed on the requirement side. Various code-review wrappers pushed on the correctness side. All of them were building a better cockpit, not removing the pilot.
Running multiple agents in this era was mechanically possible and practically useless. Five copilots still need one pilot. The queue was always on you.
What changed
Somewhere in the second half of 2025 — call it the Opus 4.5 / Claude Code / GPT-5 / Codex wave — four capability jumps landed close enough together to change the arithmetic:
-
Autonomous debugging got real. Given a shell, logs, and tests, a modern agent can diagnose and fix most day-to-day bugs without a human pointing at the stack trace.
-
"Understanding correct ≈ implementation correct" became approximately true. For well-scoped work, if the agent actually understood the requirement and had a way to run tests, the code it shipped was usually correct. The dominant failure mode stopped being "wrong code" and became "wrong understanding of the ask."
-
Skills became a stable injection mechanism. You can now hand an agent a structured document describing your engineering norms — naming, layering, commit style, whatever — and it will mostly follow them. This is new. Agents used to drift within a session; now the drift is manageable.
-
General computer use matured. The agent isn't confined to a file tree. It can run a terminal, drive a browser, click through a GUI, read documentation written for humans. Installation instructions are quietly being rewritten for agents to execute, not humans.
No single one of these would be enough. Together they mean the three chokepoints are finally mechanizable — not perfectly, not universally, but enough that you can move the human out of the live loop on most tasks.
I want to flag something here that will come up again in Chapter 2: none of these capabilities shows up usefully until you and your workflow have adapted to them. The model being capable of autonomous debugging does not mean your setup will get autonomous debugging on day one. That gap is the break-in period, and skipping it is the main reason people read about these capabilities and then don't see them in their own work.
The three keys
If the three chokepoints are what kept parallel dev from working, the rest of the book is about how to dismantle them. The structure is symmetric:
- Chapter 3 — Key #1 — how to hand off requirements well enough that the agent won't need you mid-execution.
- Chapter 4 — Key #2 — how to treat correctness as a contract signed before coding starts, so you don't read every diff.
- Chapter 5 — Key #3 — how to encode the engineering discipline you'd otherwise enforce in review, as skills the agent applies itself.
Each one replaces "human in the live loop" with "human at the start and end, mechanism in the middle." That's the entire trick.
If you strip Cherny's shared workflow down, he is doing exactly this: CLAUDE.md accumulates the project-specific rules that used to be enforced in review (Key #3), slash commands compress handoff at the execution edges, and the numbered-tab setup is just an ergonomic wrapper around rotating his attention between agents while each one runs autonomously in its middle. Geoffrey Huntley's publicly documented Ralph Wiggum loop is a different shape of the same move: a bash loop (while :; do cat PROMPT.md | claude-code ; done) that runs a fresh context window per iteration against a PROMPT.md and a specs/ directory, with tests as backpressure — the human is the author of the spec and the author of the prompt, nothing in between. The 2026 convention that has grown out of this — Plan Mode in Claude Code, Spec-Driven Development as methodology (see Chapter 3) — is the mature form. Harper Reed's early-2025 three-stage LLM codegen workflow is the widely-recognized ancestor, and its prompts still circulate, but by 2026 standards that pattern is considered incomplete without explicit verification criteria and interface contracts. Three independent pioneers, three different aesthetics, one structural claim: replace live human presence with durable artifacts at the boundaries, and the middle becomes safe to parallelize.
Once those three are in place, Chapter 6 and Chapter 7 cover the execution side — how to actually schedule multiple agents, and the economic phase-change (cheap failure) that unlocks best-of-N as a default move. Chapter 8 covers the bottleneck that re-emerges once execution is parallelized: the output itself. Chapter 9 closes by telling the truth nobody else does — this won't make you more relaxed.
The honest summary
If you remember only one thing from this chapter:
Adding agents only multiplies throughput after you have removed yourself from the three live chokepoints. Doing that is what the next four chapters are for.
Everything else — scheduling patterns, best-of-N, worktrees, subagents — is mechanics on top of that foundation. Mechanics don't save you if the foundation isn't built.
External voices
- Supporting — "code is not the bottleneck": Boris Cherny, creator of Claude Code, has made exactly this argument the flagship line of his public positioning; see Boris Cherny: "code is not the bottleneck" and the Lenny's Newsletter interview Head of Claude Code: What happens after coding is solved. His own workflow — reportedly 15+ parallel Claude Code sessions with structured human oversight (Educative recap) — is a working example of what "move the human out of the three chokepoints" looks like in practice.
- Challenging — the honest disaster log: Harper Foley's Ten AI Agents Destroyed Production. Zero Postmortems. catalogs, among others, a Replit agent deleting a production DB and fabricating 4,000 fake records (July 2025), a Claude Code agent running
terraform destroyon live infra (Feb 2026), and a Cursor IDE agent deleting 70 tracked files despite an explicit "DO NOT RUN ANYTHING" instruction. Every one of these is consistent with this chapter's frame: the human was removed from the live loop before the three chokepoints were mechanized. The incidents are not an argument against the thesis; they are a forecast of what happens when you skip the mechanization. - Challenging — the ladder-vs-drop: Marc Nuri's The Missing Levels of AI-Assisted Development: From Agent Chaos to Orchestration is the best single articulation of the phenomenon that adding agents feels like a "drop" into chaos rather than a climb. Chapter 2 treats this as Phase 1 of the break-in, not as an argument that the ladder doesn't exist.
What's next
Chapter 2 describes the break-in period: the learning curve every team walks through before the three keys start paying off, and the four phases you'll recognize yourself in.
Chapter 2: The Break-in Period — The Hidden Tuition
Thesis: Everything in this book only works after you and your workflow have co-adapted. Most people who conclude "parallel AI is hype" are stuck in Phase 1 of a four-phase break-in and don't know there are three more phases ahead.
What "break-in" means
Break a new pair of leather boots in and the first week hurts. Run a new engine for a thousand miles before you trust it at redline. Hire a senior engineer and expect three months before they're producing at their real level. In all three cases, the tool and the environment are co-adapting — the tool shaping to the task, the environment learning the tool's real edges.
AI agent workflows are the same. You don't install parallel development; you break into it. The residue of the break-in period — skills you've written, prompt templates you've refined, task-decomposition habits you've learned, failure modes you now know to avoid — is what makes the three keys in Chapters 3, 4, and 5 actually pay off. Without that residue, the mechanisms in those chapters are just words on a page.
This is the chapter most AI development books skip, because it's not flattering to sell. Everyone writes about the destination; almost no one writes about the road.
The four phases
The break-in isn't gradual in the sense of smooth. It has shape. Four phases, each with a distinct experience and a distinct signal for when you've moved on.
Phase 1: Chaos
What it feels like: You've read this book or something like it. You launch three agents at once. Within an hour, two of them have touched overlapping files, one has interpreted the requirement in a way you didn't intend, and you have three PRs to review that all need substantial rework. By the end of the day, you're more tired than if you'd written the code yourself, and the output is worse.
The conclusion people draw at this stage: "Parallel AI is hype. It doesn't work."
Why they're wrong: Every step of that disaster was predictable. You hadn't built the shared-context substrate (skills, AGENTS.md, architectural norms) that keeps three agents coherent. You hadn't written the test plans that would have caught the interpretation drift early. You hadn't picked orthogonal tasks that wouldn't collide. This isn't the workflow failing — it's you being a Phase 1 user of a Phase 4 workflow.
Signal you're leaving Phase 1: You start recognizing categories of failure instead of individual bugs. "Oh — the agent does this thing again where it invents a utility function when one already exists." Once you can name a failure mode, you can write a skill for it.
Phase 2: Awareness
What it feels like: You're still mostly running one agent at a time. But you've stopped reacting to each mistake individually. You're starting to notice that the agent keeps getting one particular thing wrong, or keeps making the same architectural guess, or keeps missing one kind of edge case. Your AGENTS.md grows. You write your first real skill — probably covering a specific class of mistake that cost you a lot in Phase 1.
Mitchell Hashimoto's My AI Adoption Journey is the clearest public account of this transition. He describes the earlier period as "excruciating" and a "period of inefficiency" — and credits his turnaround to engineering the harness: writing AGENTS.md files that encode constraints, using deterministic hooks to prevent recurring errors, and deliberately reproducing the agent's work by hand first to build the expertise needed to delegate it. His description of the transition ("the agents failed at architectural tasks, high-performance data structures, complex language-specific logic, forcing me to manually rewrite or fight the agent") is a textbook Phase 1 report. His fix is a textbook Phase 2 move.
What the work looks like: Less time typing, more time writing down rules. You feel like you're doing "less engineering," and that feels wrong, but the code quality starts climbing.
Signal you're leaving Phase 2: You start reusing skills across tasks without modification. The first time you invoke a skill you wrote two weeks ago on a new feature and it Just Works, you've moved on.
Phase 3: Templating
What it feels like: You have a working playbook. Requirement alignment follows a shape you recognize. Test plans come out looking roughly similar in structure. Code review, when you do it, checks the same handful of things the skills should have already covered. You might have two agents running at once regularly now, usually on tasks you deliberately picked because they don't collide.
What the work looks like: New skills still get written, but the rate is slowing. You're no longer finding new categories of mistakes; you're refining the skills you have.
Signal you're leaving Phase 3: You start successfully running more than three agents concurrently on non-trivial work, and you don't spend the whole day context-switching. The scheduling patterns in Chapter 7 start feeling natural rather than effortful.
Harper Reed's My LLM codegen workflow atm — written in Feb 2025, ancestral to the current 2026 Plan Mode / SDD convention — is a public Phase 3 artifact from the previous practice era. What he published was not a clever trick; it was a template distilled from enough projects to know its shape. That's what Phase 3 looks like on the outside — you start being able to write down your own workflow because it has stabilized enough to be describable. (The current 2026 replacement — Chapter 3's Plan Mode + SDD shape — is the same pattern with verification contracts added; it didn't exist when Reed wrote his post.)
Phase 4: Leverage
What it feels like: The three keys run on autopilot for most work. Parallel agents produce code in a consistent style. You spend most of your time on requirement alignment and high-level judgment, not execution. When you read tech Twitter claiming "AI doesn't work for real dev," you know they're not lying — they're just describing Phase 1.
Cherny's shared monthly totals — 259 PRs, 497 commits, 40k lines added and 38k lines removed, all generated by Claude Code, across 1.6k sessions totaling roughly 325 million tokens in thirty days — is what Phase 4 output looks like from the outside. Most of the information in that paragraph is not how much he shipped; it's how much he removed. You don't delete 38k lines on accident. You delete them because, at Phase 4, you have the throughput to actually clean up what was already there, on top of shipping the new features.
What you can do here that you couldn't do earlier: best-of-N as a default (Chapter 6), agent-internal parallelism (Chapter 7 mode 4), and — honestly — taking on work that would have been too ambitious for a solo engineer. This is the leverage the book is actually selling.
The trap of Phase 4: You forget the break-in. You recommend the workflow to a friend, they land in Phase 1 overnight, and they conclude you were lying.
How long each phase takes
There is no honest universal answer. Factors that dominate:
- Codebase maturity. A codebase with strong conventions, good tests, and a clear module structure shortens every phase.
- How much you write down. Phase 2 is essentially "writing things down." Engineers who default to verbal knowledge transfer move through it slowly.
- Tool churn. Every time you switch primary agent (Cursor → Claude Code → Codex), you reset somewhere between a full phase and a half.
- Team size. A solo developer moves faster through early phases; a team plateaus until the shared skills are written down for everyone.
My rough calibration, and I want to be honest that this is calibration, not measurement: most solo engineers reach Phase 3 with one codebase they care about. Fewer reach Phase 4. Teams reach Phase 4 with a codebase only when someone has explicitly invested in writing skills as a shared asset.
What to measure
You can't manage what you don't measure. The cheapest break-in-phase metric I know of is new-skill-per-week rate on a codebase you work on seriously:
- Phase 1: zero (you don't know what to write yet)
- Phase 2: 2–5 per week (the floodgates open)
- Phase 3: declining, from 2/week toward <1/week
- Phase 4: near zero on steady state; brief spikes when you touch a new subsystem
When the rate drops near zero and you're shipping cleanly, you're in Phase 4. When the rate drops near zero and you're shipping poorly, you've stopped paying attention.
A secondary signal: the ratio of time spent on requirement alignment to time spent on review. In Phase 1, review dominates because you don't trust the output. In Phase 4, alignment dominates because review is mostly automated. The crossover usually happens somewhere in late Phase 2.
How the Opus 4.5 "inflection" actually works
Chapter 1 hedged carefully about the "late 2025 capability jump." The break-in framework explains the hedge.
Model capability raises the ceiling of what Phase 4 can do. It does not shorten Phase 1, Phase 2, or Phase 3 for you. This is why the same model release looks world-changing to one engineer and unremarkable to another. The one who already had a skill library, a test-plan habit, and a worktree-based workflow got more leverage from the better model. The one who'd been typing prompts into a single chat window got a slightly smarter chat window.
The model raises the ceiling. The break-in determines how close to the ceiling you actually live.
Skills as residue, not recipes
One subtle implication of the break-in frame: skills are not a library you download; they are the fossil record of your break-in period.
A skill that encodes a generic engineering principle — "prefer composition over inheritance," say — is close to useless. An agent already knows that. A skill that encodes "in this codebase, background jobs are defined in jobs/*.ts and must have a corresponding retry policy defined in jobs/retries.ts, and the agent keeps forgetting the second part" — that one is gold, because it's specific to your scar tissue.
This is why copying someone else's AGENTS.md wholesale rarely works. Their scar tissue isn't yours. The forty lines of hard-won specificity that make their workflow run are specific to their codebase, their tools, and the mistakes their agent made twice before they codified it. You can steal the shape of their skills file, but the content has to be yours.
Chapter 5 treats this more formally. For now, the point is: your skill library is the visible evidence of how far you've broken in.
Advice by phase
Since every reader is at a different phase, the rest of this book's advice is not uniform. A partial map:
- If you're in Phase 1: don't try to run more than one agent at a time. Pick small, well-scoped tasks. When something goes wrong, resist the urge to fix it by typing the code yourself — instead, write down why it went wrong. That note is your first skill.
- If you're in Phase 2: the three keys (Chapters 3, 4, 5) are for you. Start with Key #2 (test plans), because it pays back fastest on a single-agent workflow.
- If you're in Phase 3: Chapter 7's modes 2 and 3 are for you. You have enough discipline to run two or three agents concurrently on orthogonal work. Don't reach for mode 4 yet.
- If you're in Phase 4: Chapter 6 (cheap failure / best-of-N) is where the real leverage is. Also: you are now a teacher, whether you wanted to be or not. The most common mistake at this phase is confusing your current capability with the baseline everyone else has.
The tuition metaphor
The break-in period is a tuition. You pay it by spending attention on tasks you could have done faster yourself, watching the agent fail in ways you'd never have failed, and writing down what you saw. There is no refund, there is no accelerator. The only variable is whether you notice you're paying tuition — and keep the receipts (the skills) — or whether you conclude the school is bad and drop out.
Everyone pays the tuition. The ones who succeed are the ones who treat the receipts as the asset.
External voices
- Supporting — break-in chronicled, in public: Simon Willison's ai-assisted-programming tag archive is close to a real-time record of one engineer breaking in, explicitly revising prior positions as his tooling and skills matured. His Embracing the parallel coding agent lifestyle (Oct 2025) opens by saying he spent months skeptical of the parallel pattern before adopting it — a clean example of moving from Phase 1/2 to Phase 3.
- Supporting — "it was excruciating": Mitchell Hashimoto's My AI Adoption Journey is the most direct "break-in memoir" in this genre. He describes the initial phase as "excruciating" and a "period of inefficiency," and explicitly credits the turnaround to engineering his own harness —
AGENTS.mdfiles, deterministic tools, hooks — which is exactly the Phase 2 → Phase 3 transition this chapter describes. His follow-up Vibing a Non-Trivial Ghostty Feature is an honest account of a Phase 4 run with costs disclosed. - Challenging — the skeptic-turned-convert genre: Max Woolf's detailed skeptic-tries-agents post, summarized by Simon Willison as An AI agent coding skeptic tries AI agent coding, in excessive detail, is worth reading in full. Woolf's early frustration reads as Phase 1; his later adjustments are the beginning of Phase 2. Most "AI coding doesn't work" posts are this same shape caught earlier in the arc.
- Challenging — is the break-in actually climbable?: Marc Nuri's The Missing Levels of AI-Assisted Development describes the jump from one agent to many as a discontinuity, not a smooth curve. This is compatible with the four-phase model if you read the "missing level" as the skill-library substrate — without it, the jump is indeed a drop, not a step.
What's next
Chapter 3 begins Part III with Key #1: requirement alignment. The one step in the whole workflow that genuinely cannot be parallelized.
Chapter 3: Key #1 — Requirement Alignment
Thesis: Requirements are the one step in the workflow that genuinely cannot be parallelized. Invest deeply here, up-front, or pay for it downstream — across every agent, every test plan, every review.
Why this is the key that can't be avoided
Chapter 1 named three chokepoints that had to be mechanized for parallel dev to pay off. Chapters 4 and 5 will show how to mechanize correctness and maintainability. This chapter has a harder job, because requirement alignment cannot be mechanized. It can only be compressed — made faster and more thorough per unit of human attention — so that the unavoidable serial step is as short and as final as possible.
The failure mode of ignoring this chapter is specific and nasty: five agents working in parallel, each with a slightly different interpretation of what was asked for, each producing internally consistent code that doesn't fit together. The bug is not in any single agent's output. The bug is in the requirement itself, and it has been copied five times.
Ambiguity multiplied by parallelism equals divergence. You pay for every unresolved requirements gap once per running agent.
Why one-shot specs aren't enough
The instinct of most engineers, confronted with "write down the requirement," is to open a doc and describe the feature. This is necessary and insufficient. What you write down is the requirement as it exists in your head. The problem is that what's in your head is full of gaps you can't see, because they are filled in automatically by your context, your taste, and your knowledge of the codebase. The agent has none of that and will make different fill-in choices than you would.
The job of requirement alignment is not to "write a good spec." It is to surface and decide every gap that the agent would otherwise guess about.
Two techniques do most of the work. They are complementary.
Technique 1: Exhaustive questioning
The move: describe the feature as best you can in natural language, then — before the agent starts planning or coding — explicitly tell it:
Do not begin work. First, list every question you still have about this requirement. Include things that seem small or obvious. I'll answer them, and then you will ask me more questions based on my answers. Continue until you have no more.
Harper Reed's widely-copied Feb 2025 post My LLM codegen workflow atm (ancestral to today's Plan Mode consensus) codified this exact move into a publishable prompt. The prompts remain structurally sound and still circulate in 2026 practice. He opens every project with:
Ask me one question at a time so we can develop a thorough, step-by-step spec for this idea. Each question should build on my previous answers, and our end goal is to have a detailed specification I can hand off to a developer. Let's do this iteratively and dig into every relevant detail. Remember, only one question at a time. Here's the idea:
<IDEA>
And closes with:
Now that we've wrapped up the brainstorming process, can you compile our findings into a comprehensive, developer-ready specification? Include all relevant requirements, architecture choices, data handling details, error handling strategies, and a testing plan so a developer can immediately begin implementation.
The two-prompt pair produces a spec.md that is already in the right shape to feed into Chapter 4's test-plan step. Adopt his prompts verbatim if you don't want to roll your own; the mechanic is what matters, not the wording.
The agent will produce a list. Some of the questions will feel pedantic. Most of them are the gaps you couldn't see. Answer them. Then:
Given my answers, list any new questions that have surfaced.
Repeat until the questions go trivial or repetitive. For a medium feature, 10–30 minutes. For a complex one, an hour. This feels long until you compare it to the cost of an agent interpreting the gap three different ways in three different branches.
When it's done, have the agent produce a structured spec document, not prose. Sections for: user goal, inputs, outputs, happy path, error cases, edge cases, non-goals. That document becomes the input to everything downstream — test plan, architecture, scheduling.
Why this works
Agents are good at generating candidate questions from a requirement description. They've seen millions of requirements and they pattern-match well. They are not good at making silent decisions that match your taste — but you never asked them to make silent decisions. You asked them to expose every decision point.
The role swap is the critical thing: you stop being the writer of the spec and become the answerer of questions. Answering is a cheaper cognitive operation than composing. You can answer forty questions in the time it would take you to write a spec that would have covered fifteen of them.
Technique 2: Solution generation + human filter
The move, for requirements where you genuinely don't care much about some of the details: describe the feature as before, but this time tell the agent:
Identify the decision points in this requirement — the places where a reasonable engineer would have to decide between multiple plausible options. For each, propose the top three options, drawing on how comparable products handle this (search the web if needed). Recommend one, with a one-line rationale. I will pick.
The agent comes back with decision points you hadn't thought of — "should unregistered users get a preview of this?" "what does the error state look like when the payment provider times out?" — each with three options and a recommendation. You read the recommendations, agree with most of them, overrule one or two, and in twenty minutes you have an aligned spec for something that would have taken an hour of questions-and-answers.
When to prefer this over Technique 1
Technique 1 is heavier and more thorough. Technique 2 is lighter but leans on the agent's prior — its model of how similar products behave. Use Technique 1 when:
- the requirement is core to the product (you have taste here and need to express it)
- the domain is unusual (the agent's prior won't match your reality)
- the stakes of getting a minor decision wrong are high
Use Technique 2 when:
- the requirement is auxiliary (auth flow, export format, pagination style)
- you genuinely don't have strong opinions on most of the details
- you'd rather spend your attention on something else
In practice, most serious features use both: Technique 1 for the core, Technique 2 for the edges.
Complexity-triaged depth
Chapter 2 introduced a heuristic from the author's own practice:
The higher the complexity, the deeper you audit. The lower the complexity, let it go.
This applies directly to requirement alignment. Signals for "go deep":
- irreversible actions (payments, deletions, external API calls with side effects)
- cross-team dependencies (someone else has to integrate with what you build)
- novel problem domains (agent's prior is probably wrong)
- anything touching auth, permissions, or billing
Signals for "let it go":
- CRUD features with well-understood patterns
- internal tools or one-off scripts
- anything easily reversible
For "let it go" work, fifteen minutes of Technique 2 is often enough. For "go deep" work, budget an hour and use both techniques in sequence: Technique 1 first to expose everything, Technique 2 second to resolve the long tail you don't care about.
The deliverable
The output of requirement alignment is a document, not a conversation. The conversation is the means; the document is the artifact. The document has to be crisp enough that:
- a different agent, given only this document, would build the same thing
- the test plan in Chapter 4 can be written directly from it
- when the agent returns halfway through execution with a question, you can point at a section of the document and say "answered."
The last property is especially important. The whole reason you're doing this step is so the agent doesn't need you mid-execution. If the document doesn't answer the mid-execution questions, you haven't finished alignment — you've just delayed the conversation.
Worked example — the 2026 Plan Mode + Spec-Driven Development flow
The industry has converged, in the six months before this chapter was written, on a specific shape for requirement alignment. It goes by two names that refer to the same structural move: Plan Mode (in tool vocabulary — Claude Code, Cursor's planning step, Gemini's antigravity) and Spec-Driven Development (SDD) (in methodology vocabulary — see the Augment Code practitioner's guide, April 2026 and the Jan 2026 arxiv paper of the same name).
Both crystallize a pattern earlier practitioners were hand-rolling in 2025. The authoritative shape today is a four-phase cycle, well-documented in the Plan Mode in Claude Code guide (Feb 2026) and Addy Osmani's late-2025 workflow post:
Phase 1 — Explore (read-only). Enter the agent's Plan Mode — a read-only context where it can grep the codebase, map dependencies, and read specs, but cannot modify files. You narrate what you want; the agent explores the terrain and surfaces what it already knows and what it needs to ask. This is where Technique 1's exhaustive-questioning loop runs.
Phase 2 — Plan (spec + implementation plan). The agent produces an implementation plan against the spec. In SDD vocabulary, a plan-ready spec now requires six concrete elements (paraphrased from the Augment guide):
- Outcomes and scope — what to build, explicit about what's out.
- Constraints and prior decisions — hard pins on libraries, schemas, non-negotiables, so the agent doesn't re-invent them.
- Task breakdown — decomposition into discrete sub-tasks small enough to fit one context.
- Verification criteria — explicit, testable acceptance conditions for each sub-task. These become the contract for a separate Verifier agent in Chapter 4.
- Interfaces between sub-tasks — so that parallel execution (Chapter 7) becomes safe.
- Model tiering — which roles use which models. Current 2026 convention: use your most capable model for spec writing, mid-range for implementation, fast/cheap for verification.
Write the spec to a file — the community has largely unified on docs/plans/<feature>.md or spec.md in the feature worktree. AGENTS.md (covered in Chapter 5) references it. The conversation that produced it is disposable; the file is not.
Phase 3 — Implement (small chunks). Hand the plan to an execution agent. The agent works one sub-task at a time, with the spec's verification criteria as its red/green signal.
Phase 4 — Commit. Structured PR (see the end-of-task-report skill in Chapter 8) referencing the plan file.
flowchart LR
E["1 · Explore<br/>read-only recon"] --> P["2 · Plan<br/>spec + implementation plan"]
P --> I["3 · Implement<br/>sub-tasks vs verification criteria"]
I --> C["4 · Commit<br/>PR + plan reference"]
The "one-sentence rule"
A useful micro-practice from the Plan Mode guide: if you can describe the required diff in one sentence, skip the plan. Otherwise, Plan Mode is mandatory. That rule sets a clean cutoff between "alignment is overhead" and "alignment is the work." It's the practical answer to the critique that spec-first is too heavy for small changes.
What this looks like historically
This shape did not appear from nowhere. Harper Reed's Feb 2025 post My LLM codegen workflow atm was the first widely-copied public write-up of the three-file pattern (spec.md + prompt_plan.md + todo.md). His prompts are still circulating and are structurally sound — they're the 2025 ancestors of the 2026 Plan Mode convention. If you want a runnable starting point for a tool that doesn't have Plan Mode built in, Reed's original prompts are a solid place to start; just know that the 2026 convention adds verification criteria and interface specification as mandatory elements Reed's original prompts underweighted.
The structural claim the Plan Mode / SDD consensus makes — and that I'm endorsing here — is: requirement alignment produces a file with six specific elements, the agent's implementation is contractual against that file, and the file is a first-class repo artifact reviewed at complexity-triaged depth.
A second pattern worth stealing: specs-as-files
Reed's workflow puts the output of the alignment step into spec.md and the planning output into prompt_plan.md + todo.md. Addy Osmani's My LLM coding workflow going into 2026 lands on the same structure independently, and Geoffrey Huntley's Ralph loop is built on PROMPT.md plus a specs/ directory. Three pioneers, three independent workflows, one shared move:
The alignment artifact is a file, not a conversation. Files are portable across agents, they survive session compaction, and they make "what we agreed" legible to a future sub-agent that wasn't present when you agreed it.
If there's one thing you steal from this section, let it be that: the deliverable of requirement alignment is a document named spec.md (or whatever your project's convention is), sitting in the repo, referenced by every downstream step. The conversation is the means; the file is the artifact.
Why requirement alignment cannot be parallelized
Everything else in this book can be parallelized. Requirement alignment cannot. It requires:
- your deepest attention (you are making decisions that shape everything downstream)
- your serial cognitive bandwidth (you can only think deeply about one feature at a time)
- your taste (which the agent doesn't have)
This constraint has a practical consequence for scheduling. When running three or four agents in parallel, the realistic cadence looks like this:
- Align requirements on task A (30 min of your attention)
- Hand off to agent A for planning + test-plan generation (agent's work, you're free)
- While agent A works, start aligning requirements on task B
- Hand off to agent B
- While B plans, check on A's test plan, approve or adjust
- Continue the rotation
You are serial on alignment, parallel on everything else. Think of alignment as the loading step and execution as the firing step of a rifle: the rifle fires many rounds at once; it loads one at a time.
A checklist for "alignment is done"
Before you let an agent move from "planning" to "coding":
- The spec names the user goal in one sentence.
- Every input to the feature has a defined type, default, and validation rule.
- Every output has a defined type and format.
- The happy path is described end-to-end.
- Every error case has a defined behavior — user-facing message, retry policy, or escalation.
- Non-goals are listed (things an agent might plausibly add that you don't want).
- Every decision the agent asked about has an answer written in the document.
- If another agent, not the one you're working with, were handed only this document, you'd expect them to build the same thing.
If any box is unchecked, you are not done aligning. If you hand off now, you will pay for it in Chapter 4 or later.
The zero-review reference
The zero-review/auto-req skill is the author's concrete encoding of the two techniques above into a runnable skill that an agent can follow. It's referenced throughout the book; it's worth reading to see what "requirement alignment encoded as a skill" looks like end-to-end.
Reference: zero-review/auto-req
External voices
- Supporting — 2026 SDD consensus: the Augment Code Spec-Driven Development guide (April 2026) codifies the six-element spec and the spec-first / spec-anchored / spec-as-source rigor levels. The Jan 2026 paper of the same name is the academic companion.
- Supporting — Plan Mode as a tool primitive: Plan Mode in Claude Code (Feb 2026) and the Get AI Perks complete guide (Mar 2026) are the current how-tos. Plan Mode essentially turns the exhaustive-questioning loop of this chapter into a tool feature; if you're using Claude Code, use Plan Mode every time the change exceeds the one-sentence rule.
- Supporting — harness, don't prompt: Mitchell Hashimoto in My AI Adoption Journey (Feb 2026) reports that
AGENTS.md-style constraint documents plus deterministic hooks matter more than any individual prompt. Requirement alignment, in his framing, is partly a document the agent reads and partly a harness that catches the classes of mistake documents can't. - Challenging — "requirements change until they don't": Hillel Wayne's Requirements change until they don't is the right push-back on spec-first purism — when the requirement is genuinely fluid, heavy up-front specification is expensive and often wrong. His point does not invalidate this chapter's technique; it tightens the scope: use the deepest alignment on the parts of the requirement you believe won't move, and keep the mobile parts light. The one-sentence rule above is one practical response to his critique.
What's next
Chapter 4 covers Key #2: how to replace line-by-line code review with a test plan written before coding starts, and audited at a depth proportional to the complexity of the work.
Chapter 4: Key #2 — Correctness as Contract, Not Review
Thesis: Stop reviewing code line-by-line. Write a test plan before coding starts, treat it as the acceptance contract, let the agent close the correctness loop itself, and audit the plan at a depth proportional to the complexity of the work.
Why line-by-line review is dead at scale
In a one-agent workflow, reading the diff is tractable. The agent generates fifty lines, you read fifty lines, you approve. Even here it's slow — but it's possible.
In a three-agent workflow, a full day's output is maybe two or three thousand lines across different PRs, different branches, and different parts of the codebase. If you try to read all of it with the same attention, two things happen:
- You become the bottleneck again. All the parallel speedup you gained in execution is lost in review.
- Your attention degrades. Somewhere around the eighth PR of the day, you start skimming. You approve something you shouldn't have. Skim-reviewing at three-agent scale is worse than careful-reviewing at one-agent scale.
The way out is not "review harder." The way out is stop reviewing the implementation and start reviewing the contract.
What "correctness as contract" means
The correctness of a piece of code is, operationally, the set of behaviors it must exhibit. Tests encode behaviors. A sufficient set of tests, verified green, is evidence of correctness.
If the tests you wrote before coding express the full correctness contract, and the tests pass, the implementation is correct by construction.
That single sentence is the whole frame. The agent writes the code. The agent runs the tests. If the tests pass, you don't need to read the diff. If the tests fail, the agent debugs and re-runs them until they don't fail. You have been removed from the inner loop of correctness verification.
The human role has moved: from reviewing implementations (which you had to do because the agent's output might be wrong) to reviewing the test plan (which the agent couldn't have written without your taste and domain understanding in the first place).
The Adversarial Agent Pattern — 2026 consensus for correctness
The cleanest current practice for correctness-as-contract is the Adversarial Agent Pattern, crystallized in the Augment Code Spec-Driven Development guide (April 2026). It formalizes what several 2025 practitioners were doing ad-hoc, and — because it assigns the verification role to a separate agent with different context and often a different model — it directly addresses the blind-spot problem described later in this chapter.
The pattern has three roles:
- Coordinator. Reads the spec (from Chapter 3), decomposes it into sub-tasks, assigns them.
- Implementor(s). One or more agents, each working on an isolated sub-task in its own git worktree. They cannot see each other's context — only the spec's interface contracts.
- Verifier. A separate agent whose only job is to check each Implementor's output against the spec's verification criteria. It has not seen the implementation process — only the spec and the final diff.
Model tiering has emerged as the convention: the most capable model writes the spec, a mid-tier model implements, and a fast low-cost model verifies. Cost-wise, this is cheaper than running your top model on everything. Correctness-wise, it is substantially stronger than a single agent writing code and its own tests, because the Verifier never shared an understanding with the Implementor — it can only check the spec against reality.
This directly answers the concern you'll see below about "tests and code sharing a blind spot." When the Verifier is independent, a shared misunderstanding between the Implementor's implementation and the Implementor's tests still gets caught — because the Verifier is reading the spec fresh and checking what reality actually does against it.
Adopt the Adversarial Agent Pattern explicitly for any non-trivial correctness surface. Model it as three roles with three separate contexts. A single agent writing code and tests and grading itself is a regression, not a workflow.
The prompt that turns a spec into a test plan
If your tool doesn't have a built-in Verifier role, you can still create one manually — start a fresh session with only the spec loaded (no implementation context), and ask:
You are the Verifier. You will receive a spec and a diff. Your job: for each verification criterion in the spec, state whether the diff satisfies it (YES / NO / PARTIAL), and for each NO or PARTIAL, cite the specific file and behavior that falls short. You do not have access to the Implementor's reasoning — only the spec, the diff, and the repo as-shipped. Prefer skepticism over agreement.
That prompt, paired with the six-element spec from Chapter 3, is the minimum viable Verifier. It is an hour of work to set up and catches a meaningful percentage of the "implementation drifted from spec, but the Implementor's own tests passed" failures.
TPD vs TDD
Test-driven development, in its classic red-green-refactor form, is a human discipline: one test at a time, one small behavior increment, tight feedback loop, the test exists to guide the writing.
Test-Plan-Driven Development (TPD) is different. You produce, in one up-front pass, a full test plan that covers the behaviors the feature must exhibit. Then you hand the whole thing to the agent, which writes the implementation and the test bodies together and closes the loop against the plan.
| TDD (human) | TPD (agent-assisted) | |
|---|---|---|
| Granularity | One test at a time | Whole-feature test plan |
| Purpose | Guide writing incrementally | Define a correctness boundary for autonomous execution |
| Feedback cadence | Red-green per test | Red-green per full suite |
| Primary beneficiary | The human developer | The agent closing its own loop |
TPD is not a replacement for TDD as an intellectual practice. It's a different shape, adapted to the situation where the implementation is going to be written in one go by something that can run the whole suite every thirty seconds.
TDD is a programming discipline for humans. TPD is an acceptance contract for agents.
(The name "TPD" is a convenience coined to contrast with TDD. It isn't a term of art in the wider industry. Don't fight about the label.)
sequenceDiagram
participant H as Human
participant S as Spec
participant TP as Test plan
participant A as Agent
participant CI as Suite / CI
H->>S: Alignment (Ch. 3)
H->>TP: Review plan (contract)
TP->>A: Implement + tests vs plan
A->>CI: Run until green
CI-->>H: Signal only on plan gaps or red suite
What pioneers are already doing (last six months)
Current practice — pinned explicitly to the November 2025 through April 2026 window:
- Mitchell Hashimoto, in My AI Adoption Journey (Feb 2026), treats failures as occasions to add deterministic hooks and tests that prevent the failure class from recurring. The tests are not a development byproduct — they're the permanent harness.
- The Opus 4.5 "No Restart" workflow, documented in Claude Opus 4.5 Unlocks the "No Restart" Workflow (Dec 2025), makes extended autonomous test-fix-test-fix loops practical for the first time. The implication is direct: if the agent can run a suite, debug failures, and re-run without losing context for hours, you genuinely can stop reading the diff.
- Geoffrey Huntley's Ralph loop uses tests (and builds, and lints) as explicit backpressure: the loop is structurally incapable of advancing past a failing suite. Documentation current through late 2025.
- Claude Code's Plan Mode, formalized across late 2025 and early 2026 (2026 complete guide), bakes the "plan before code, verify against plan" loop into the tool itself. You're no longer applying TPD on top of a generic chatbot; the tool now enforces it.
- The Adversarial Agent Pattern (above) is the consensus formalization. Separate Implementor from Verifier, tier models by role, never let one agent grade its own work.
The convergence is notable because these practitioners are not copying each other. Each arrived at "tests before code, verification by a role that didn't write the code" as the obvious fix to the same underlying problem: line-by-line review is the thing that stops scaling first.
What a good test plan covers
The test plan is the deliverable. It should cover three layers:
- Unit tests. Individual function and module behaviors. "Given this input, this function returns this output." The agent will write these against pure logic.
- Integration tests. Interactions between modules. "When the job scheduler calls the retry handler, retries are scheduled on the expected backoff schedule." The agent will write these against the module boundaries defined in the architecture step.
- End-to-end / functional tests. The user path. "A user can upload a file, wait for processing, and download the result, and the file they download matches what they uploaded after the expected transformation."
Each layer has a different sensitivity. Unit tests catch logic bugs. Integration tests catch wiring bugs. E2E tests catch real-world composition bugs. Missing any layer leaves a class of mistakes uncaught.
A good test plan also names what is not covered — behaviors you're explicitly not testing (performance, rare concurrency paths, visual regressions). Naming non-coverage prevents the illusion of completeness.
The agent-operable environment — not optional for TPD
TPD only works if the agent can observe the same reality your test plan asserts and act on failures without pulling you back into the loop. A reproducible Docker (or container-equivalent) image is the usual starting point — pinned dependencies, seeded data, one command to stand the stack up. Treat that image as part of the contract, checked in and versioned like code.
A frozen image is necessary but not sufficient. The environment must also expose the modalities the acceptance criteria actually need. If you skip this, you get green CI on a hollow stub while the product path the spec cares about stays untested.
- Browser-backed products. If users interact through a web UI, the agent (and your E2E harness) must have real browser automation inside the environment: a headed or headless browser the tool can drive, stable base URL, cookies/session fixtures as documented. "The container runs
npm testbut nothing can openhttps://localhost:3000" is a broken TPD surface — the agent cannot close the loop on layout, flows, or client-side regressions your plan names. - GUI / native / desktop applications. If correctness includes windows, menus, or native widgets, the environment must expose GUI use — e.g. a virtual display with documented remote access, or an agent-accessible desktop session — not only a CLI and unit tests. Otherwise the test plan will quietly omit the only channel where bugs show up.
- Complex or concurrent systems. When failures are timing-dependent, stateful across processes, or require stepping through live code, debugger access (attach to the right process, breakpoints, inspect variables, capture stacks) must be available to the agent under the same constraints a senior engineer would use. Relying on
printlnalone reintroduces you as the bottleneck the moment the suite goes red for a non-obvious reason.
Same bar for the Verifier. An independent Verifier that cannot run the stack, drive the browser, or attach a debugger is verifying text against text — useful, but not a substitute for checking behavior in the modalities the spec promised.
Document these capabilities in the repo (compose.yaml, AGENTS.md, or a short docs/agent-environment.md): how to start the environment, which ports expose HTTP, how to reach the browser driver, how to open a GUI session, how to attach the debugger. If it is not documented and reachable by the agent, it is not part of your correctness contract — it is wishful thinking.
The hidden risk: tests and code sharing a blind spot
This is the part almost nobody writes about. If the agent writes both the implementation and the tests from the same understanding of the requirement, and that understanding is wrong, the tests will pass and the code will still be wrong. The tests verify what the code does, not what it should do. They lock in the misunderstanding.
The mitigation is structural:
- The test plan is reviewed by a human, before implementation starts. Not the test bodies — the test plan. The plan describes what should be true; the bodies describe how we verify it. Reviewing the plan is reviewing the intent.
- The test plan is written from the requirement spec, not from the proposed implementation. If you let the agent write the plan after it has written the code (or worse, at the same time), you have lost this property entirely. Order of operations matters.
- A human audits test coverage of high-risk behaviors at complexity-triaged depth. This is the complexity heuristic applied again.
Complexity-triaged review depth
From the author's own practice: the higher the complexity of the change, the deeper the audit. Low complexity, let it go.
Applied to test plans:
Go deep on:
- irreversible actions (payment, deletion, data migration)
- cross-module changes
- security and auth paths
- anything with a hard-to-rollback failure mode
- domains where the agent's prior is known to be weak
Let it go on:
- CRUD
- internal tools and one-off scripts
- low-stakes, well-understood patterns
- changes easily rolled back
"Let it go" does not mean "no test plan." It means: skim the plan, make sure the shape looks right, trust the agent to fill in the details, don't audit every test case. You are still requiring tests; you're just not spending your attention on them at the same depth as on a payment flow.
This heuristic is the single practical instruction I'd most want a Phase 2 reader to internalize. The naive Phase 1 failure is to audit everything equally hard; the naive Phase 4 failure is to audit nothing. Triaging by complexity is the middle path that scales.
Agent-as-user testing for UI work
Automated unit, integration, and E2E tests miss the UI-layer complaints: bad copy, cluttered layout, error messages that are technically correct but useless, flows that "work" but require too many clicks. For these the agent can act as a user — driving a browser, filling forms, reporting what the experience was like. The single trick worth knowing: make the agent play a specific named user role (novice / power user / adversarial), and constrain it to only the senses that role has. A novice-role agent can only report "I clicked save and the page went blank for five seconds" — not "the init failed with a 500," because it can't see the console. That constraint is what makes the reports real-user-shaped instead of engineer-shaped. The implementation details live in zero-review/auto-test.
What you stop doing
To make this concrete, the review practices you are retiring at Phase 3+:
- Reading every diff line-by-line. Gone.
- Catching bugs by reading code. Gone. The tests should catch bugs; if they don't, the plan is incomplete and that is what you fix.
- Style nitpicks in PR comments. Replaced by skill-enforced conventions (Chapter 5).
- Checking that the code runs. Replaced by CI + agent-run suite.
What remains:
- Reviewing test plans for coverage adequacy.
- Spot-checking implementation on high-risk changes only.
- Reading agent-as-user reports.
- Final green-light approval.
The zero-review reference
The zero-review/auto-dev skill encodes the TPD loop end-to-end, including the architecture step described in the next chapter. Together with auto-req (Chapter 3) and auto-test, they form the author's working skill stack for single-agent execution. Parallel scheduling (Chapter 7) runs these skills across multiple agents simultaneously.
Reference: zero-review/auto-dev
External voices
- Supporting — the Adversarial Agent Pattern: the Augment Code SDD practitioner's guide (April 2026) is the definitive current reference for the Coordinator / Implementors / Verifier split with model tiering. The Jan 2026 Spec-Driven Development paper is the academic companion.
- Supporting — extended autonomous loops: Opus 4.5's No Restart workflow (Dec 2025) is the capability that makes TPD-style unattended test-fix-test-fix loops genuinely practical at scale.
- Supporting — stop reviewing, start engineering: Geoffrey Huntley's Ralph Loop argues that line-by-line review is structurally obsolete once agents can self-verify against backpressure; the engineer's job becomes designing guardrails — pre-commit hooks, property-based tests, snapshot tests — not reading diffs.
- Challenging — tests don't prove correctness: Hillel Wayne's Why Don't People Use Formal Methods? remains the sharpest articulation of the limit TPD inherits. The Adversarial Agent Pattern narrows the gap considerably (independent Verifier, spec as contract) but does not close it. For genuinely high-stakes correctness surfaces, testing remains a correctness toolkit item, not a proof.
- Challenging — "you don't know if you have the right spec": Wayne's fundamental verification problem — any test suite is only as good as the requirement it encodes — is why the 2026 consensus spec has six mandatory elements (Chapter 3) rather than three. Verification criteria without explicit outcomes and constraints still lock misunderstandings in.
What's next
Chapter 5 covers Key #3: how to encode engineering discipline — naming, layering, module design, commit style — as skills the agent enforces on itself, and where that still isn't enough.
Chapter 5: Key #3 — Engineering Discipline as Code
Thesis: Skills are not a one-time install. They are the living residue of your break-in process, encoding this project's specific bad habits. The generic engineering principles are the starting line, not the finish.
Why correctness isn't enough
Chapter 4 made correctness a mechanism. A piece of code can be correct and still be catastrophic for the codebase — badly structured, overcoupled, using three different naming conventions, inventing utilities that already exist two directories over. Correctness gets you past today; maintainability decides whether next week's agent can work in the code you shipped this week.
In a one-agent, human-reviewed workflow, maintainability was enforced by you in review. You'd say "don't use inheritance here, use a strategy object" and the agent would adjust. That path doesn't scale to three agents producing three PRs in the same hour. Either you enforce it by mechanism, or it stops getting enforced.
The mechanism is skills. Skills are structured documents, loaded at agent startup, that shape how the agent approaches design and self-reviews its output. They turn "things you'd say in review" into "things the agent checks before it ships."
Generic principles — the starting line
The first half of what belongs in a project's skill set is the generic software-design discipline you'd expect in any review. Most of it comes from Ousterhout's A Philosophy of Software Design and can be encoded cleanly into a short skill that raises an agent's first-draft architecture from default to usably-competent:
- Deep modules. Simple interfaces, significant functionality behind them. Agents default to breaking things into too-small pieces; the skill should push back explicitly.
- Information hiding. Modules shouldn't leak internals. The agent's common failure mode here is splitting by execution order ("step A module, step B module") instead of by knowledge ownership — which almost guarantees leakage.
- Layered abstraction. Each layer provides a distinct mental model. A layer that only forwards calls to the next layer isn't earning its keep.
- Cohesion and separation. Code that must be understood together stays together; generic and special-case logic that confuses each other gets split.
- Error handling through definition. Prefer designs that define errors away (default behavior, simplified semantics) over designs that spray
try/catcheverywhere. - Naming and obviousness. Readers shouldn't be surprised. Names should be specific, consistent across the codebase, and free of invented abbreviations.
- Documentation that adds information. Comments should describe what the code cannot — intent, trade-offs, invariants. Not what it literally does.
- Strategic over tactical design. Every change is an investment in the structure. Quick fixes compound into tech debt with interest.
These are good defaults. Encoded as a skill, they put the agent's first-draft architecture substantially above its untrained baseline.
But generic principles have a ceiling. An agent that "knows" them can still miss that your codebase already has the utility it's about to reinvent, use a naming style that matches one file from two years ago but not the rest, or pick a pattern that's right in the abstract and wrong for the framework you're on. The generic skill can't catch these because they are specific to your project. Past the generic line, skills have to be yours.
If you want the craft of writing a single skill document well — structure, description wording, triggers, failure modes — that's the subject of The Skill Design Book. This chapter is about how skills function as the maintainability mechanism in a parallel workflow, not how to author one.
How pioneers actually accumulate their skills (and what 2026 research says)
The industry has standardized, in the six months before this writing, on AGENTS.md as the cross-tool context file. It is adopted by Claude Code (via CLAUDE.md symlink), Cursor, Codex, Gemini's antigravity, and most other major agents. The most useful current reference is How to Build Your AGENTS.md (2026) from the Augment team (March 2026).
Several findings from the last six months are worth internalizing before you write yours:
- Keep it under about 150 lines. ETH Zurich's Feb 2026 research (summarized in Paul Withers' Is AGENTS.md Engineering the next optimisation approach?) found that verbose or LLM-generated
AGENTS.mdfiles actually reduce task success rates and inflate cost, because of "lost in the middle" degradation on long context. Human-curated, concise files perform measurably better. - Nest for modularity. Agents prioritize the
AGENTS.mdclosest to the current working directory. Use a short root file for repo-wide rules, and drop focusedAGENTS.mdfiles into subdirectories that need different rules. - Symlink for cross-tool compatibility. The conventional 2026 move is
ln -s AGENTS.md CLAUDE.mdso every agent you use reads the same file regardless of its preferred name. This is now standard. - Treat it like code. Check it in, version-control it, review it in PRs. Mitchell Hashimoto's My AI Adoption Journey (Feb 2026) treats
AGENTS.mdas a living contract updated every time a failure class is observed — not a one-time write. - Skills are a separate channel. Anthropic's Agent Skills and Addy Osmani's 2026 writeup of the same pattern distinguish always-loaded context (
AGENTS.md) from loaded-on-demand skills (SKILL.mdin its own directory, triggered by task relevance). The split is significant because it lets you have fifty specialized skills without bloating every session.
The 2026 consensus, compressed: one lean AGENTS.md for durable repo-wide rules, a library of focused SKILL.md packages for task-specific workflows, both checked in, both versioned, both reviewed in PR. A mature project ends up with one root AGENTS.md of roughly 60–120 lines and ten to forty focused skill documents, each tied to a specific class of mistake the agent made once and should never make again. The shape of an individual skill — description, triggers, body, checklist — is covered in The Skill Design Book; what matters here is that every rule in the file comes from a specific failure, carries its reason, and would not be there if you hadn't watched the agent get it wrong.
The skills that actually move the quality needle are the ones that encode the mistakes your agent made in your codebase. "Follow good naming conventions" is useless — the agent already tries to. "In this codebase the convention for handler names is handle<EntityName><Action>; the agent tends to write <entityName>Handler and gets it wrong" is gold. That specificity is the whole point.
The generic skills are the default setup. The specific skills are where the break-in residue lives. The former you write once; the latter you accumulate one failure at a time.
The three-stage execution flow
With skills in place, the work an agent does on a feature looks like this:
flowchart TB
SK[(Skills + AGENTS.md)]
SK --> D[1 · Architecture design]
D --> V[2 · Implementation + verification]
V --> U[3 · Self-audit vs skills]
SK --> V
SK --> U
- Architecture design. Given requirement spec (Ch 3) and test plan (Ch 4), the agent proposes module boundaries, interfaces, file organization, and abstractions. It does this with skills loaded — so the design already reflects the deep-modules, information-hiding, naming-consistency rules. The human reviews this design at complexity-triaged depth. Getting this step right is what makes Chapter 7's mode 4 (agent-internal parallelism) possible at all, because the interface contracts defined here are what let sub-agents work in parallel without colliding.
- Implementation and verification. The agent writes code and tests, runs the suite, debugs failures, iterates to green (Chapter 4).
- Self-audit. Before declaring done, the agent re-reads its own diff with the skill set loaded and checks for violations: shallow modules, redundant layers, inconsistent naming, utilities that duplicate existing ones. It fixes what it finds.
Steps 1 and 3 are new. They replace the architectural judgment and final polish that you'd otherwise apply in review. You're still involved — you approve the architecture, you spot-check the self-audit on high-complexity work — but you are no longer the only line of defense.
Where skill injection still isn't enough
Being honest about the limits:
- Conflicts between principles. Deep modules vs small composable pieces; strategic design vs YAGNI. These genuinely conflict, and skills can't arbitrate — judgment does. For contested cases the agent needs explicit guidance in the skill: "in this codebase, prefer deep modules even if it means the module is harder to unit-test in isolation; we value the interface simplicity more."
- Cross-cutting concerns. Security, observability, performance — they don't live in one module, and a skill that says "think about security" is too vague to act on. These usually need either (a) dedicated tooling (linters, scanners) or (b) very specific skills ("on any endpoint that writes to the database, require that the caller's permission was checked in the handler before the DB call").
- Taste drift between agents. Different agents trained on different data have different biases. A skill set tuned to Claude may read differently to a Codex model. This is a real limit of the portability story; budget time for re-tuning when you switch primary agents.
- Novel subsystems. The first time you touch a new framework, a new language, or a new service, you don't yet know the failure modes. There are no skills to write yet. You pay tuition (Chapter 2) on that subsystem specifically, then write skills from what you learned.
None of these is fatal. All of them mean "skills get you a long way, not all the way."
The link to parallel scheduling
This chapter lives in Part III of the book, about unlocking parallel work. The connection to Chapter 7 (scheduling patterns) is direct:
- Modes 1–3 work better when agents load a shared skill set. Without it, three agents produce three styles of code and merges become a mess.
- Mode 4 — agent-internal parallelism — is essentially impossible without interface contracts defined in the architecture step. Sub-agents working in parallel on different modules can only merge cleanly if the interfaces between them were pinned down up-front. This is the direct dividend of Key #3: the architectural discipline you enforce is what makes parallel decomposition feasible.
Skipping architecture and letting the agent "just code" is the single fastest way to lose the parallel benefit. It works for one agent on a small feature. It catastrophically fails with three agents on a medium one.
The zero-review reference
The zero-review/auto-dev skill encodes this three-stage loop — architecture design, implementation-and-verification, self-audit — as a runnable skill, including Ousterhout-derived design principles and a concrete self-audit checklist. It's worth reading as the canonical example of "engineering discipline encoded as a skill."
Reference: zero-review/auto-dev
External voices
- Supporting: A Philosophy of Software Design (Ousterhout) remains the best single source on the underlying principles. Anthropic's Agent Skills documentation formalizes the injection mechanism. For the craft of authoring a single skill document, defer to The Skill Design Book.
- Challenging: critics of "rules-based design" argue that encoded principles ossify into cargo-cult checklists that miss the point. This is a real risk, especially for generic skills. The counter is that project-specific skills don't generalize and therefore don't ossify — they stay tied to the scar they came from.
What's next
Part III is complete. Chapter 6 opens Part IV with the economic phase-change that unlocks parallel execution: when attempts cost minutes instead of hours, exploration gets cheap, and best-of-N stops being a luxury.
Chapter 6: The Cheap Failure Principle
Thesis: When an attempt costs minutes instead of hours, the whole economics of exploration flip. Best-of-N stops being a luxury move and becomes a default — and the anti-pattern this invites is skipping the steps that made it work.
The phase change
Before AI agents, every attempt to write a feature cost a human several hours. That cost shaped every decision: you picked the safest path, you didn't spike risky ideas, you didn't try a second implementation "just to compare." Exploration was rationed.
With agents running the three keys (Chapters 3–5), an attempt costs minutes to an hour. The cost curve has moved by one or two orders of magnitude. And when execution cost drops by an order of magnitude, the decision cost of "try another approach" starts to exceed the execution cost of actually doing it.
That's the phase change. It sounds like a quantitative shift; it's a qualitative one. A different class of strategy becomes rational that wasn't before.
Before: deciding what to build is cheap; building is expensive; therefore, think hard before building, build once. Now: deciding what to build remains cost; building is cheap; therefore, try several things, pick a winner.
Best-of-N as a daily move
The direct consequence: best-of-N stops being an occasional technique and becomes a default.
In a best-of-N workflow, you spec a feature once (Chapter 3), write a test plan once (Chapter 4), and then launch N parallel attempts — possibly using different agents, different prompts, or different architectural approaches. Each attempt runs against the same test plan. You read the winners, compare, and pick.
This sounds expensive in agent cost. It is. It is radically cheaper in your cost, because:
- One of the three attempts will almost always pass the tests first, and you review that one.
- The other two give you a cheap comparison — you see three design choices for the same contract, and you can tell which feels cleanest in thirty seconds.
- If all three fail, you've learned something real about the requirement that one attempt would have hidden.
flowchart LR
subgraph once["Align once"]
SP[Spec]
TP[Test plan]
end
SP --> W1[Worktree / attempt 1]
SP --> W2[Worktree / attempt 2]
SP --> W3[Worktree / attempt 3]
TP --> W1
TP --> W2
TP --> W3
W1 --> P[Pick winner vs contract]
W2 --> P
W3 --> P
Best-of-N is especially useful for:
- Ambiguous architectural choices. When you don't know the right abstraction, let three attempts show you.
- Risky refactors. The safe path and the aggressive path, run in parallel, compared on the test plan.
- "I want the good one." Tone of copy, UX flow choices, any place where you'd recognize quality faster than you'd specify it.
It is not useful for:
- Tasks with one right answer (CRUD, bug fixes with a known cause). N attempts just wastes compute.
- Tasks where the contract itself is unclear. Best-of-N finds variation on the implementation, not the spec. If the spec is wrong, all N attempts are wrong.
A concrete example — parallel attempts via git worktrees
The cheapest way to run best-of-N on a real feature today is to combine cheap failure with mode 2 from Chapter 7. Concretely:
# Align requirements once, produce spec.md and prompt_plan.md.
# (See Chapter 3's worked example — Harper Reed's flow.)
# Now launch three parallel attempts at the same feature.
git worktree add ../feature-attempt-a attempt-a
git worktree add ../feature-attempt-b attempt-b
git worktree add ../feature-attempt-c attempt-c
# In attempt-a: hand the agent spec.md + prompt_plan.md with no extra instructions.
# In attempt-b: same inputs, but prepend a hint that pushes toward a different
# architectural choice ("prefer a state machine here even if it feels heavy").
# In attempt-c: same inputs, different agent entirely — e.g., Codex instead of
# Claude Code — to get a genuinely independent attempt.
Each attempt runs against the same test plan. Run them in parallel; wall-clock time is roughly the time of one attempt. When two or three have produced green suites, you read them side-by-side and pick.
The diversity trick matters. Three runs of the same agent with the same prompt against the same spec converge to nearly identical output; you learn very little. The real leverage is when each attempt varies on one axis — the agent, the architectural hint, or the temperature — so the differences between the attempts carry information.
Simon Willison's Embracing the parallel coding agent lifestyle and Mitchell Hashimoto's Vibing a Non-Trivial Ghostty Feature both describe using parallel attempts against the same spec specifically to see variance across agents — which implementation seems cleaner, which caught an edge case the others missed. That variance is the product you are buying with the extra compute.
The "exploration becomes the default" shift
There's a second, subtler consequence of cheap failure that I think is more important than best-of-N itself.
In the old economy, when you had a vague idea — "I wonder if we should switch from approach A to approach B" — the cost of finding out was writing approach B, which meant days or weeks, which meant you almost always didn't. The question died unanswered. Approaches stuck because no one could afford to check.
In the new economy, finding out is a thirty-minute agent run. Most vague ideas can be checked. This changes the cadence of decision-making: you stop relying on debate to resolve "should we" questions and start relying on cheap experiments. An engineer in Phase 4 resolves more architectural debates in a day than a team used to resolve in a quarter — not because the engineer is smarter, but because the cost curve of "find out" has collapsed.
This is the real productivity gain, and it's not captured in any "lines of code per day" metric. It's captured in "decisions made per week with actual evidence behind them."
The anti-patterns cheap failure invites
Every phase change creates new ways to fail. The two most common ones here:
Anti-pattern 1: Skipping requirement alignment because "we'll try a few and see"
This is the Phase 1 failure. "Why spend thirty minutes on requirement alignment when I can just run three agents and pick the best?" The answer: because all three agents will produce coherent work against different misinterpretations of the ambiguous requirement. You'll end up with three working implementations of three subtly different features, and the choice between them is a choice between three things you didn't want.
Cheap failure is leverage on top of good alignment. It does not replace alignment. If Chapter 3 isn't done, Chapter 6 actively hurts.
Anti-pattern 2: Letting best-of-N substitute for test plans
"I'll just pick the one that looks best." This works for UI tone; it doesn't work for correctness. Without a test plan, "best" collapses to "the one that compiles and looks familiar," which is the agent's favorite attempt, not the correct one. Chapter 4 is a precondition for Chapter 6; skipping it means best-of-N degenerates into aesthetic voting.
Anti-pattern 3: Running N forever
N attempts is fine. N + M attempts, chasing the perfect implementation, is a compute-burning habit that produces diminishing returns. In practice N = 2 or 3 covers 90% of the cases where variance matters; beyond 3, the marginal attempts almost never change the pick. The rule of thumb is: if the first two attempts disagree meaningfully, run a third to break the tie; if they agree, one attempt would have been fine.
Anti-pattern 4: Forgetting the human bottleneck is still there
Three parallel attempts produce three things you still have to compare. If comparison takes as long as reading one implementation three times, you've just tripled your review load. The fix is either (a) make the test plan do the comparison for you (all three passed; pick either based on test-suite winners or code-aesthetic signal) or (b) don't use best-of-N on tasks where you can't compare quickly.
Cost math, roughly
For a working engineer in Phase 3+, the rough numbers:
- One agent attempt on a medium feature: 30–90 minutes of agent time, ~$1–5 of compute.
- Three parallel attempts: same wall clock, 3× agent time, 3× compute, same human attention at alignment + test-plan stages, maybe 20% extra at pick-a-winner stage.
- Expected value of "one of three attempts teaches you something about the problem you didn't know": hard to quantify, but anecdotally, high on novel work and low on routine work.
The dominant cost is not compute. It's whether your human attention budget has room for "look at three things instead of one." Plan accordingly.
Interaction with the scheduling patterns
Best-of-N is one of four parallel modes discussed in Chapter 7, but it differs from the other three in a key way: the other three modes parallelize across different tasks; best-of-N parallelizes across alternative attempts at the same task.
In practice, a mature parallel workflow uses both at once. You might have three agents on three different features (Chapter 7 mode 2) and within one of those features, three attempts in best-of-N (this chapter). That's nine agent-runs in flight. Chapter 8 will talk about what happens to the flood of output this creates.
External voices
- Supporting: ML practitioners have used best-of-N sampling on model outputs for years; applying the same pattern at the task level is the natural generalization. Chase-Lambert, Anthropic, and others have discussed "draft multiple, pick one" workflows; Geoffrey Huntley's posts on agent-of-agents patterns touch the same territory.
- Challenging: some practitioners caution that best-of-N can mask systematic errors (if all N attempts share the model's blind spot, you pick the best wrong answer with high confidence). This is a real critique and the mitigation is diversity — different agents, different prompts, different starting architectures — not just N runs of the same prompt.
TODO (author's note): pioneer links where someone actually ran best-of-N in production and reported the cost/benefit; any Anthropic/OpenAI posts on sampling-for-coding that fit.
What's next
Chapter 7 covers the four scheduling patterns for running parallel agents — from cross-project coordination (coarsest) down to agent-internal parallelism (finest) — and which patterns match which break-in phase.
Chapter 7: Scheduling Patterns
Thesis: There are four parallel modes, from coarsest to finest, and they match different break-in phases. Don't reach for mode 4 from Phase 1. Don't stay on mode 1 once you're in Phase 3.
What "scheduling" means here
With the three keys in place and cheap failure understood, the remaining question is how you actually run multiple agents at the same time. "Open five Claude Code windows and type into them" is a strategy, and a bad one. Different granularities of parallelism have different coordination costs, different failure modes, and different prerequisites in your break-in phase.
Four modes, from coarsest to finest:
| Mode | Granularity | Coordination cost | Break-in phase that can use it |
|---|---|---|---|
| 1 | Across separate projects | Very low | Phase 1+ |
| 2 | Within one project, across non-overlapping features | Low (via git worktree) | Phase 2+ |
| 3 | Within one feature, across transaction types | Medium | Phase 3+ |
| 4 | Within one task, across sub-agents | High setup, low runtime | Phase 3–4, mostly experimental |
You will use multiple modes at the same time. A Phase 4 engineer in a busy week might be running mode 1 across three projects, mode 2 inside one of those projects, and mode 3 inside one of those features. That's six agents in flight without even reaching for mode 4.
Mode 1: Cross-project parallelism
The move: you have two or three separate projects (different repos, different products). You run one agent per project, each working its own backlog. Projects don't share code, so agents don't interfere.
Why it's easy: there is no coordination. Each project has its own skills, its own conventions, its own requirements. The only thing that's parallel is your attention, and you manage that by rotating between the projects at the alignment stage and letting each project execute asynchronously.
The cadence that works:
- Align requirements on Project A. (20–30 min of deep attention.)
- Hand Project A to its agent for planning and test-plan generation.
- While Project A works, switch to Project B. Align.
- Hand Project B to its agent.
- While B plans, check on A's test plan, approve or adjust.
- Rotate.
This is the "multiple meeting rooms" mode. You are the tech lead walking between them. If you have the break-in residue to hand off cleanly, this mode costs you almost nothing and gives you two or three projects' worth of throughput on one engineer's schedule.
Prerequisite: the project itself has to be in good enough shape to hand off cleanly. A mature project with conventions and skills: trivial. A new project where you're still figuring out architecture: mode 1 doesn't help because the deep attention is on Project A, not on the rotation.
Mode 2: Worktree parallelism — within one project, non-overlapping work
The move: within a single codebase, you have two or three features that don't touch the same files. You create a separate git worktree for each, check out a separate branch in each, and run one agent per worktree. They don't interfere because they're in separate directories.
Why git worktree matters: running multiple agents on the same checkout gets chaotic fast. They fight over the working tree, tests overlap, one agent's WIP changes break another's test run. Worktrees give each agent its own isolated directory tied to the same repo, which solves the mechanical problem cleanly.
The workflow:
git worktree add ../project-featureA featureA-branchgit worktree add ../project-featureB featureB-branch- Launch one agent per worktree, each loading the same shared skills and the same
AGENTS.md. - Each agent runs through Chapters 3–5 independently.
- When an agent finishes, merge its branch back to main. Resolve merge conflicts — ideally by letting an agent handle the mechanical ones and handling strategy conflicts yourself.
Claude Code ships a convenience shortcut for this: the claude -w flag (documented in Cherny's 15-tips roundup) automatically creates a worktree for an agent session, so you don't have to manage the worktree directory manually. If you're using a different agent, the shell commands above work identically; the mechanism is git's, not any specific tool's.
Rule of thumb on orthogonality: if the two features mostly touch different files (say, <20% overlap), mode 2 works. If they heavily overlap (>50%), either serialize them or redesign the task boundaries. The time you lose to merge conflicts can easily erase the time you gained from parallelism.
Break-in prerequisite: you need enough shared skills (Chapter 5) that two agents working in parallel will produce code in the same style. Without that, merge cleanup is dominated by stylistic inconsistency, which is demoralizing and hard to automate. This is why mode 2 mostly shows up in Phase 2+.
Mode 3: Cross-transaction-type parallelism — within one feature
The move: within a single feature, different kinds of work can run in parallel even though they're all about the same feature. You launch one agent per work type.
An example, concrete: you're building an "export orders to CSV" feature. Three agents run simultaneously:
- Agent A — backend correctness. Builds unit and integration tests around the export logic: empty inputs, giant inputs, concurrent exports, corrupted orders. Writes the backend implementation. Closes the test plan.
- Agent B — UI end-to-end. Builds Playwright tests around the export button: clicks it, verifies download, checks error states, tests the disabled state while loading. Writes and adjusts the UI glue code.
- Agent C — known bug backlog. Takes the list of small bugs you've been sitting on (unrelated to this feature but in the same area), and works through them so they're not still there when the feature ships.
These three live in the same feature branch but touch different layers. The coordination cost is medium: they share a branch, so order of integration matters. Usually A and C merge first; B waits for A to stabilize the backend before running against it.
Why this mode matters: a lot of feature work has this shape — backend logic, UI glue, tests at different levels, an ancillary bug list — and naively one engineer does it all serially. Splitting by transaction type gives you real parallelism within a single feature, without the merge complexity of mode 2.
Break-in prerequisite: you need the discipline to define the interfaces between the pieces up-front, so agent B isn't constantly blocked on agent A's decisions. This is the payoff from Chapter 5's architecture step: if the interfaces are specified before coding starts, mode 3 runs cleanly.
Mode 4: Agent-internal parallelism — within one task, across sub-agents
The move: you hand the agent one large task. The agent decomposes it into sub-tasks and runs sub-agents in parallel on each, merging the results. You don't manage the decomposition; the top-level agent does.
Claude Code's "team" mode and similar "orchestrator + workers" patterns implement this. The appeal is obvious: you don't have to pre-plan the parallelization, the agent figures it out.
Why it's the hardest mode: the agent's decomposition is only as good as the interface contracts it defines. If the sub-agents have to coordinate mid-execution ("what does your function return again?"), they collide, merge badly, or produce incompatible outputs. You are back to the coordination problem, but now solved by the agent, which may or may not be good at it.
Current state, being honest: mode 4 is real — it works on scoped, well-specified tasks where the boundaries naturally decompose. On messy real-world tasks where the decomposition itself is the hard part, mode 4 often produces worse results than a single agent running sequentially, because the merge cost outweighs the parallelism gain.
Cognition's Devin can now Manage Devins describes their mode 4 implementation from the inside: a coordinator Devin scopes work and monitors progress while each delegated Devin runs in its own isolated VM. Crucially, the same team published Don't Build Multi-Agents — the two are not contradictory. The second post is a warning against naive multi-agent patterns that don't share context; the first is an implementation that does. Their explicit principle: "share context, not just messages" and "actions carry implicit decisions; conflicting decisions lead to failure." If the top-level agent can't share its reasoning trace with sub-agents, and the sub-agents can't converge on compatible decisions, mode 4 degrades to exactly the "chaos" that most skeptical reports describe.
When it actually pays off: the same conditions that make mode 3 work, but at a smaller granularity. If the architecture step (Chapter 5) has pinned down module boundaries clearly, and each sub-task fits inside a module, agent-internal decomposition is reliable. If the architecture is vague, it isn't.
The direct dependency on Chapter 5: mode 4 is basically the payoff for doing architecture design well. A codebase with well-defined module boundaries and interface contracts is automatically amenable to parallel decomposition. A codebase where agents "code-as-they-design" isn't. This is the strongest practical argument for not skipping the architecture step, even when you're tempted.
When to reach for it: Phase 3 and up, on tasks where the decomposition is obvious (you could have done it by hand with mode 3 but it's tedious). Treat it as a labor-saver for cases you'd otherwise run mode 3, not as a magic productivity multiplier.
How many agents at once?
A practical upper bound, roughly:
- Phase 2: one or two agents. More than two breaks.
- Phase 3: three to four agents comfortably. Five starts requiring deliberate attention discipline.
- Phase 4: five to eight, with mode combinations. Past eight, even experienced engineers lose the thread.
Cherny reports running ten to fifteen concurrent Claude Code sessions. That is the high end of what one deeply practiced operator does on a mature codebase; it is not a baseline. Read about his workflow (Educative) and you see the supporting machinery — numbered terminal tabs, system notifications, a /commit-push-pr slash command, a Chrome extension that lets Claude test the UI it builds — that is specifically there to make fifteen agents manageable by one attention. Without that machinery, the upper bound drops sharply.
The limiting factor is almost never agent compute or tooling. It's your ability to keep context while rotating between tasks at the alignment and review stages. Running too many agents produces more agents but worse decisions at the human-in-the-loop moments — which, from Chapter 1, is where the real bottleneck lives.
The right number of parallel agents is the largest number where your alignment and review quality don't degrade.
Worked example — Geoffrey Huntley's Ralph Loop as a minimal scheduling primitive
The Ralph Loop, documented publicly at ghuntley.com/ralph and in how-to-ralph-wiggum, is worth studying because it's the simplest functional parallel-agent scheduler anyone has published. Strip it down and it's this:
while :; do cat PROMPT.md | claude-code ; done
That one-liner is the whole scheduler. What makes it work is what it relies on:
PROMPT.md— a deterministic prompt that tells the agent, on each fresh invocation, to look at the state of the repo and pick exactly one task to make progress on.specs/— the specification directory, the durable artifact the agent reads to know what "done" means.IMPLEMENTATION_PLAN.md— a live plan the agent reads and updates across iterations.- Tests as backpressure — the agent cannot commit a task that fails tests, so a broken state naturally halts progress on that task until fixed.
Each iteration is a fresh context window. No conversation history carries across. This deliberately prevents the "context rot" Huntley identifies as the main failure mode of long-running agent sessions — the thing where an agent, after several hours of back-and-forth, starts repeating earlier mistakes or drifting from its original task. By throwing away context every loop, you pay a small startup cost in return for guaranteed freshness.
Huntley's how-to-ralph-wiggum repo further splits the loop into two modes:
- Planning mode — read
specs/and the currentsrc/, do gap analysis, updateIMPLEMENTATION_PLAN.md. - Building mode — read
IMPLEMENTATION_PLAN.md, pick the most important task, implement it, run tests, commit.
You run planning periodically to keep the plan fresh against the spec, and building most of the time. This is a primitive version of the same specification-then-execution split that Claude Code's Plan Mode (see Chapter 3) now implements as a tool primitive.
Why is this in the scheduling chapter and not earlier? Because Ralph is a mode-2/mode-3 scheduler implemented as a shell loop, with no tool lock-in. You can run it against Claude Code, Codex, or any agent CLI. Two Ralph loops in two worktrees is mode 2. Two Ralph loops in the same worktree against different PROMPT.md files (one for backend, one for UI) is mode 3. It is worth understanding because every managed product (Claude Code's team mode, Cursor's background agents, Devin's managed Devins) is, at its core, some variant of this pattern with ergonomics on top.
Worked example — Armin Ronacher's "Pi" minimal harness (Jan–Feb 2026)
If the Ralph Loop is the simplest scheduling primitive, Pi is the cleanest example of the opposite move: running parallel agents through a deliberately minimalist harness that the agent can modify itself. Armin Ronacher's three posts — Pi: The Minimal Agent Within OpenClaw (Jan 31), Porting MiniJinja to Go With an Agent (Jan 14), and A Language For Agents (Feb 9) — describe a workflow built around:
- A tiny core of four tools only (Read, Write, Edit, Bash). Everything else is an extension the agent itself can write.
- Self-modifying extensions. Pi hot-reloads extensions the agent writes during a session — so the agent genuinely extends its own harness as it learns the task.
- Branching as a first-class operation. Pi lets Ronacher rewind an agent's session to an earlier message and branch off a new path — avoiding the failure mode he calls vision quests, where an agent re-does work from scratch because its earlier context has drifted.
- Language-agnostic reimplementation. One of Ronacher's observations from the MiniJinja port: when code is cheap, it is often easier to have an agent reimplement a library in your target language than to wrangle cross-language build systems. This is a direct consequence of Chapter 6's cheap-failure principle.
What matters for this chapter: Pi is mode 2 and mode 3 implemented as a harness design rather than a tool feature. It demonstrates that you can get strong parallelism without depending on any vendor's team mode — the primitives are cheap if you treat the harness as something you own.
Worked example — Cherny's fifteen-session setup (late 2025 snapshot)
Currency note: the specifics below come from Cherny's Dec 2025 public sharing. Claude Code has shipped substantial Plan Mode, subagents, and skills changes in the four months since; treat this as an aesthetic example of the pattern, not a current command reference.
On the other end of the aesthetic spectrum is Boris Cherny's publicly-shared workflow, which uses Claude Code's managed features rather than a bash loop. The setup, assembled from his X thread, his 15-tips roundup, and the Educative recap, looks roughly like this:
- Five terminal tabs numbered 1–5, each running a Claude Code session, typically on a separate worktree created via
claude -w. - Another five to ten sessions running on
claude.aiin browser tabs, withclaude --teleport(or/teleportin-session) used to move a session between the terminal and the web as convenient. - Mobile sessions started and checked from his phone while away from the desk.
CLAUDE.mdat the root of each repo, growing by one entry every time an agent makes a mistake worth preventing.- Slash commands for the repeatable parts of handoff —
/commit-push-prto stage + commit + push + open a PR in one shot, custom ones for the common workflows in that codebase. - Subagents for specialized roles — a code-simplifier, a test-verifier — invoked from the primary session when their role is needed.
- Lifecycle hooks —
PreToolUseto log shell commands,Stopto keep an agent running when it prematurely declares itself done. /loopand/schedule— convert workflows into persistent, periodically-running skills. Cherny's example is/loop 5m /babysitto run a housekeeping task every five minutes.- A Chrome extension that lets Claude view and click through the UI it has just built — Cherny describes this as a 2–3× multiplier on UI work quality.
- System notifications used only for "an agent needs input," never for "an agent finished." This converts the inbox pattern into a pull pattern and is the single most-copied detail of his setup.
Notice what this setup is actually doing. The CLAUDE.md file is Key #3 encoded as a growing living document. The slash commands and hooks are alignment and handoff mechanization at the execution edges. The Chrome extension is agent-as-user testing from Chapter 4. The numbered tabs + notifications are the scheduling primitive. The claude -w worktree flag is mode 2. Cherny has essentially built the book's thesis into his daily ergonomics, and the fifteen-session throughput is the result.
You do not need Claude Code's specific features to replicate this. Everything in the list above has a shell-script equivalent; Huntley's Ralph loop is an existence proof. The pattern of "mechanize the edges, rotate attention between middles" is what matters.
A full-day example — mixed modes
To make this concrete, a day at Phase 4 might look like:
- Morning, 9:00–9:30: align requirements on Project A feature 1 (mode 1). Hand off to agent A1.
- 9:30–10:00: align requirements on Project A feature 2 (mode 2 — separate worktree). Hand off to A2.
- 10:00–10:20: align requirements on Project B (mode 1). Hand off to agent B1.
- 10:20–10:45: Agent A1 returns with a test plan. Audit at complexity-triaged depth. Approve. A1 begins implementation.
- 10:45–11:15: In parallel, A2 returns with a plan, B1 has questions on its spec. Answer B1; approve A2.
- 11:15 onward: A1 is building. Launch a mode-3 sub-parallel on A1 for UI testing (a second agent on the same feature). Meanwhile start best-of-N (Chapter 6) on a small refactor in Project B — two alternative attempts, 30 minutes each.
- Afternoon: read reports. Approve merges. Pick best-of-N winner. Start next round.
That's seven to eight agent-runs in a day, three projects, four modes in use. The rate-limiter is not the tools. It's your ability to hold the shape of each handoff in your head and come back to it cleanly.
Anti-patterns
- Reaching for mode 4 from Phase 1. You'll watch sub-agents collide on interfaces that were never defined, conclude that agent-internal parallelism doesn't work, and blame the feature. The feature is fine; you weren't ready.
- Using mode 2 without
worktree(or equivalent isolation). Two agents on one checkout is a merge-conflict machine. Don't. - Running mode 3 without the architecture step. Sub-agents on the same feature without defined interfaces drift and fight. Define first.
- Doing mode 1 when the real bottleneck is that one of the projects is early-stage and needs your attention. Parallelism across projects doesn't help if one project is absorbing 80% of your cognitive budget.
External voices
- Supporting:
git worktreehas been a known trick in AI-assisted development circles for a while; worth searching for posts by practitioners who wrote their own setup scripts. Anthropic's Claude Code team docs describe mode 4 from the inside. - Challenging: many reports (HN, X) of mode 4 failing on real work. These are almost universally accurate as reports and should be read as "mode 4 needs architecture prerequisites," not "mode 4 is broken."
TODO (author's note): link the best
git worktreewriteup you know of; link Anthropic team-mode docs; link at least one honest "mode 4 didn't work for me" post-mortem.
What's next
Chapter 8 covers the bottleneck that emerges when parallel execution finally works: the output is more than you can read. The fix is the same trick applied one level up — letting agents triage agents.
Chapter 8: Digesting the Output
Thesis: Once parallel execution works, the bottleneck migrates to report digestion. The fix is the same trick applied one level up: let agents triage agents.
The problem nobody warns you about
By Phase 3 you have three or four agents producing genuinely useful work per day. The three keys from Part III have moved you out of the live loop on alignment, correctness, and maintainability. The scheduling patterns in Chapter 7 let you spread them across projects, features, and transaction types. The throughput is real.
And then you hit a new wall. Not code. Reports.
Each agent produces, per day:
- a pull request or two
- a set of test results
- an agent-as-user testing report
- a design/architecture summary
- a list of things it couldn't do and wants input on
- sometimes a follow-up suggestion list
Multiply by four agents and you have a couple dozen artifacts to process per day, each of which demands some amount of your attention. You used to bottleneck on writing code; now you bottleneck on reading your agents' output. Chapter 1's theme returns with full force: the bottleneck moves, it doesn't disappear. You've just moved it again.
This is the point where engineers who've broken into parallel dev start burning out. Not because the work is bad — the throughput is better than it's ever been — but because reading twenty-four mixed-quality reports a day is cognitively expensive in a way that reading one PR isn't.
The structure of the new bottleneck
Unlike the three chokepoints from Chapter 1, this one is not about judgment. It's about volume of undifferentiated information. The job is:
- Skim everything that came in.
- Notice when two reports are about the same underlying issue.
- Notice when one is a critical blocker and another is a "nice to have."
- Route things to the right follow-up (another agent, yourself, ignore).
- Preserve what needs to be preserved, drop what doesn't.
This is classically triage. And triage, structurally, is a job that agents are good at — provided you give them the right inputs. Parsing reports, clustering by theme, ranking by severity, writing a one-line summary per item: this is within the capability of any modern agent. What's missing is not the capability but the structure of the pipeline.
The trick: agents triage agents
The move is to treat the agent output stream the same way you treated the source code: as something that shouldn't require you in the inner loop.
A triage agent sits downstream of the execution agents. Its job is:
- Read all new reports since last pass. PRs, test results, user-testing reports, bug lists, design summaries.
- Cluster. Group reports describing the same underlying issue (two agents both reporting that the staging DB is slow; three UI reports all about the same copy issue; etc.).
- Prioritize. Rank clusters by a simple two-axis rubric — user impact × whether a workaround exists, or similar. For Phase 3+ engineers, the specific rubric matters less than consistency.
- Route. Each cluster gets a disposition: escalate to you, send back to an execution agent for fix, file as known issue, drop.
- Summarize. Produce one short human-readable report: the things that need your attention today, ranked, with a one-line reason each.
You read the triage agent's summary. Maybe ten items instead of a hundred. You make calls on the escalated ones. The rest are already routed.
This is the same pattern as the three keys: you are removed from the undifferentiated-volume stage and placed at the decision points only.
Why it works
The triage agent has three advantages over you doing the same job:
- Unlimited patience. It reads all twenty-four reports at the same attentional depth. You don't.
- Pattern-matching at volume. Clustering fifty items by theme is a task that degrades sharply for humans and doesn't for agents. The twenty-ninth item of the day gets the same analysis quality as the first.
- Consistency. The rubric it uses is applied identically across items. Your own rubric degrades as you get tired; theirs doesn't.
It has real limitations too: the triage agent is as good as its inputs, and if the execution agents produce garbage reports, triage makes garbage piles. This is another argument for skills — the execution agents' reports should follow a structured shape.
Honesty about maturity
This chapter is the one where I have to be most honest about where the state of the art is: the triage layer is not a solved product. You can build one. I've built one (the zero-review/auto-triage skill is the reference implementation and it is still in active development). But you can't currently pick it up pre-built.
What you can do:
- Define a structured report format your execution agents must produce (loaded as a skill).
- Run a triage skill as a distinct agent at the end of each work cycle.
- Iterate on the rubric. The first version will over-escalate or under-escalate; tune over weeks.
This is, appropriately, another place where the break-in period shows up. The triage layer matures as you learn what your particular agents tend to get wrong.
Reference: zero-review/auto-triage (in development — partial reference only)
What to do until the triage layer is mature
Short-term defenses you can deploy today, while the triage idea is still evolving:
- Batch your reading. Don't look at agent reports as they land. Set two or three inbox-processing windows per day. Trying to treat reports like chat messages shatters your attention. Mitchell Hashimoto explicitly describes turning off notifications during deep work and running "end-of-day agents" that finish their reports overnight, so he reads batched output once rather than in real time.
- Force-structure reports at the source. Every execution agent should produce reports in a consistent format: status, what was done, what failed, what needs your input, what can wait. A skill that enforces this format pays back immediately.
- Use system notifications only for "needs input," not for "done." Cherny's setup, public on X, uses system notifications explicitly as the bounce-back signal — tell me only when an agent needs me, not when it finishes. This single change turns an inbox pattern into a pull pattern and is often the difference between "five agents feels like chaos" and "five agents feels like supervision."
- Be ruthless about the "what can wait" bucket. Many reports need acknowledgment, not action. A one-line "logged, moving on" is often the right response.
- Maintain a visible queue. A simple spreadsheet of open items, with agent source, date, and status, gives you a coarse triage layer even without an agent. Reviewing the queue once a day — rather than responding to each arrival — is already a big win.
A minimal structured-report skill
The highest-leverage thing you can add early, before any real triage layer exists, is a skill that forces every execution agent to end its work with a structured report. The minimum viable version specifies six sections — Status (COMPLETE / NEEDS_INPUT / BLOCKED), What was done, Tests (green/red/not run), What I was unsure about, What needs human input, Follow-ups I would recommend — and forbids free-form prose outside them.
Loaded on every execution agent, that single skill turns twenty free-form reports into twenty reports with the same six sections. A triage agent — or you, scanning by eye — can process the batch in a fraction of the time, because you know where in the report the parts you care about live. The authoring details for such a skill belong to The Skill Design Book; what matters here is the mechanism. It's the same shape as Cherny's /commit-push-pr slash command at smaller scale — forcing the output into a predictable form so the consumer downstream doesn't have to read each one from scratch.
Structured reports are the single biggest digestion-layer win you can deploy while the full triage layer is still maturing. If you do nothing else from this chapter, do this.
The second-order effect
When the triage layer works, something non-obvious happens. Your execution agents start producing better reports, because:
- You're no longer reading individual reports. You're reading the triage summary.
- The triage agent is therefore the primary consumer of execution-agent output.
- So execution-agent skills evolve to produce reports the triage layer can cleanly ingest.
This is a virtuous cycle. The structure of the pipeline starts to encode a shared report schema without anyone needing to design one top-down. You'll notice it a few weeks in: reports from your agents start feeling like they were written for a machine, because they were.
Where the bottleneck moves next
If you've been reading carefully, you already know: the bottleneck moves again. Once triage works, the next place it shows up is at the decision level — the items the triage layer escalates to you are, by construction, the hardest ones. You don't spend less cognitive energy per item; you spend it on a smaller pile of harder items. Total wall-clock time goes down; per-decision load goes up.
Chapter 9 will deal with this directly. The point for now: there is no final bottleneck. There are only bottlenecks you've handled and bottlenecks you haven't yet. The sequence is what this book is really about.
External voices
- Supporting: the general pattern of "route decisions through an automated triage layer" is well-known in incident management (PagerDuty-style), in customer support (Zendesk-style), and in code review at scale (pull-request bots). Applying the same pattern to agent output is the natural translation.
- Challenging: skeptics (rightly) point out that triage agents can systematically miss novel failure modes — they route by pattern, and a genuinely new problem doesn't match any pattern. This is a real limit. Mitigation: a periodic full-stream review (say, weekly) where you read the raw reports, not the triage output, specifically to catch what the triage missed.
TODO (author's note): any pioneer posts on building their own triage layer for agent output? Screenshots of auto-triage in action would be great here.
What's next
Chapter 9 closes with the honest account: this whole setup won't make you more relaxed. You've traded muscle memory for continuous judgment, and the cognitive cost is real — even when the throughput gain is.
Chapter 9: It Won't Make You Lighter
Thesis: You've traded muscle memory for continuous judgment. Throughput multiplies; cognitive load stays flat or rises. If you don't actively defend slack, you'll burn out at higher productivity than you ever did before.
The uncomfortable admission
Every chapter so far has been about how to get more done. This one is the only one that tells the truth about what that costs.
Parallel AI development is a leveraged position. Leverage amplifies returns, and it amplifies demands. The dollar productivity goes up; the hour-of-your-life productivity does not. You will ship more, and at the end of the day you will feel more tired than you did when you wrote the code yourself — often more tired, because the mix of work has shifted.
Nobody selling AI productivity wants to say this. But if this book doesn't say it, it's not an honest book.
Mitchell Hashimoto's My AI Adoption Journey (Feb 2026) is unusually clear about this side of the trade. He frames it as a three-phase arc: inefficiency (the excruciating break-in), adequacy (the agents work but you're tired), and only eventually workflow discovery — the phase where the leverage starts to feel natural. His explicit ongoing discipline in that third phase: notifications off during deep work, batched-reading windows, background agents that run while he's away from the keyboard so "positive progress happens when I can't work" — all of it specifically designed to keep the judgment reserve from draining faster than it refills.
Armin Ronacher's A Language For Agents (Feb 2026) raises the companion concern: comprehension debt. When it becomes trivial to generate plausible code, you accumulate code that works but whose logic you cannot explain. At Phase 4 throughput, that debt grows fast. Ronacher's guidance is uncompromising: if you cannot explain the code, it is not ready to ship. That rule alone reintroduces some of the cognitive load the agents were supposed to remove — which is, in a way, the whole point of this chapter.
Research cited in Simon Willison's blogmarks (Ranganathan & Ye) found that AI does not reduce knowledge work so much as intensify it: developers juggle more active threads, not fewer. All three pioneers are working successfully at Phase 4. None of them describe the experience as relaxing.
What exactly changed in your day
Consider how a traditional engineer's day feels. You sit down with a task. You think for a minute, write some code, run it, think again, adjust. The cognitive profile is mixed: some minutes are intense (debugging a hard issue, getting the abstraction right), many minutes are semi-automatic (typing out the boilerplate you've typed ten thousand times, wiring up a form, writing a straightforward loop). Your hands and your brain alternate. When you're stuck, you have a half-automatic move — "let me try the obvious thing" — that your background cognition can carry out while your focus recovers.
Now compare a Phase 4 parallel day. You spend almost no time on the semi-automatic work — the agents do it. What's left for you is the judgment work, continuously:
- Is this requirement actually crisp? (Ch 3)
- Does this test plan cover what matters, at the right depth for the complexity? (Ch 4)
- Should I approve this architecture, or push back? (Ch 5)
- Is this one of three attempts good enough to ship, or do I want a fourth? (Ch 6)
- Which agent is running which mode, and am I rotating right? (Ch 7)
- What did the triage layer miss this week? (Ch 8)
Every one of these is a decision. Decisions are expensive in a way that typing is not. Typing boilerplate is rest for your prefrontal cortex. A day of nothing but decisions is not.
You've gone from a mix of typing and thinking to pure thinking. The throughput is higher. The fatigue is also higher.
%%{init: {'theme':'neutral'}}%%
pie showData
title Rough hours shift (Phase 4 day, illustrative)
"Judgment / decisions" : 5
"Typing / routine execution" : 1
Why throughput goes up but feeling of relief doesn't
A rough accounting of the change in a typical day:
- Hours spent typing: 6 → 1. Big win.
- Hours spent thinking: 2 → 5. Big loss.
- Total hours: 8 → 6. Some win.
- Things shipped: 1 feature → 3 features. Big win.
The headline is real: more shipped, fewer hours. But the distribution of those hours is brutal. Five hours of sustained decision-making with almost no decompression windows is more exhausting than eight hours of mixed work. Your productivity goes up. Your reserve goes down.
This isn't an artifact of poor discipline. It's structural. The three keys specifically moved the mechanical work out of your day. What you're left with is what the mechanical work was hiding — the fact that software engineering, at the level you're now doing it, is continuous judgment.
Actively defending slack
The engineer who lasts in this role is not the one who "works harder." They're the one who deliberately protects rest windows and treats them as infrastructure.
Some practices that help:
- Batch reading agent output. Don't treat reports like chat messages. Two or three dedicated reading windows per day, with the rest of the day closed to notifications, gives you context-switch budget to spend on alignment rather than reaction.
- Dedicate mornings to alignment, afternoons to review. Alignment needs deep attention; review needs less. Matching cognitive state to task is worth real percentage points.
- Hard-stop at a decision budget. When you've made fifteen substantive decisions in a day, your next decision is noticeably worse than your fifth. Stop. Save the remaining work for tomorrow.
- Don't confuse "letting go" with "being free." The agents are running; you could start another three. The marginal cost to you is small per agent — and nonzero. Adding agents to fill time you could rest with is the fastest path to burnout in this workflow.
- Protect unstructured thinking time. You still need time to think about the shape of the project, not just the shape of the current task. That time isn't going to appear spontaneously; you have to schedule it.
- Exit the loop on purpose, periodically. Spend one day a week doing something that isn't touching agents. The leverage returns on the days you do work get higher, not lower, when you do this.
A reframing that helps
The most useful reframe I've found: you are no longer a software engineer. You are the operator of a software engineering system.
A solo developer writing code by hand is like a blacksmith: hands-on, producing one item at a time, fatigue paced by the physical rhythm of the work. A Phase 4 engineer running parallel agents is like a factory supervisor: designing the line, setting the quality bar, walking the floor, intervening when things go wrong, keeping several stations coordinated.
The factory ships more. The supervisor is also more tired at the end of the day, in a different way — no burned hands, but a depleted judgment reserve. They can't just "work another hour" at the end of the day; judgment doesn't scale linearly with time the way typing does.
This is not a bad trade. Most people who make the switch would not go back. But it is a trade, not a win on every axis.
When to slow down
Signs that you've over-extended and need to pull back:
- You're approving test plans without really reading them.
- Your triage agent is escalating things and you're not escalating them further when you should.
- You've caught yourself writing code directly instead of going through the pipeline, not because it was faster but because thinking about the pipeline felt like effort.
- You're irritated at the agents for things that are actually your responsibility (ambiguous alignment, unclear test plans).
- You can't tell which projects are in a good state at the end of the week.
Any of these is a signal to reduce concurrency for a few days, restore your reserve, and restart at a lower level. The leverage will still be there when you come back.
The honest bottom line
Parallel AI development is, in this author's experience, the single biggest productivity change in a software engineer's working life since the introduction of source control. It is genuinely a 3–5× multiplier on what you can ship on the days you work. It is also a workflow where the total cognitive load per day is higher, not lower, than what you had before.
Both of these are true. People who promise you the first without mentioning the second are selling you something. The ones who tell you only the second have usually never gotten out of Phase 1.
The goal of this book has been to make the trade legible enough that you can choose it knowingly — and then, having chosen, to give you a reasonable chance of reaching a state where the trade actually pays.
You're trading muscle memory for continuous judgment. That trade ships more work. It does not ship less work of you.
External voices
- Supporting: burnout research in knowledge work consistently finds that the cognitive profile of pure decision-making — as opposed to mixed cognitive-plus-automatic work — is a strong predictor of exhaustion. Engineering management literature (Will Larson's posts in particular) captures the move from "doing" to "deciding" in the transition to technical leadership, which maps closely to this shift.
- Challenging: some practitioners report that AI parallelism has made them more relaxed, not less, because the tedium is gone. This is real for some people, particularly those whose day job was heavily boilerplate and who now have more time for creative work. Temperament matters. The warning in this chapter applies most strongly to engineers who already ran hot.
TODO (author's note): if you have posts from pioneers who have been honest about the exhaustion side, drop them here. Also any writing from tech leads on the "doing → deciding" transition.
What's next
Chapter 10 closes the book by stepping back: everything in these nine chapters generalizes beyond code. Anything that can be decomposed into independent subtasks obeys the same rules. Coding is just the first place the loop closed.
Chapter 10: Beyond Code
Thesis: Every decomposable knowledge task obeys the same rules. Coding is just the first domain where the loop closes end-to-end — and understanding that tells you which other domains are coming next.
Why code was first
If the framework in this book (three chokepoints, three keys, four scheduling modes, break-in period) is really about knowledge work, why did coding reach this point first?
Three structural reasons, and they're worth naming because they predict where this spreads next:
- Code has formal verification. Tests give you an automatic answer to "is this right?" You can run them, get a green or red, and the agent can close its own loop. Almost no other knowledge work has this property natively.
- Code has cheap failure. A wrong implementation is caught by a test in seconds. A wrong diagnosis, a wrong strategy memo, or a wrong design is caught by a human much later, at much higher cost.
- Code has a mature engineering culture. Software engineering has already argued for decades about modularity, testing, and interface design. When agents arrived, there was a vocabulary (Ousterhout, Martin, Beck) ready to be encoded into skills. Most other fields lack this.
Take these three away and you get the opposite: a domain where agents can't self-verify, failure is expensive, and there's no inherited discipline to encode. That domain cannot currently run the three-keys playbook. The domains that can run it are the ones with the strongest analogs to all three.
The same framework, re-applied
Consider data analysis. Replace "code" with "analysis":
- Chokepoint 1 (requirement alignment): what are we actually trying to decide with this data? Which stakeholder will act on the answer?
- Chokepoint 2 (correctness): is the analysis methodologically sound? Are the assumptions valid? This is the hard one — there's no
npm testequivalent. But there is a "test plan for the analysis": a checklist of assumptions, sensitivity checks, alternative slicings. An agent can run them. - Chokepoint 3 (maintainability): will the next person to touch this dataset understand what was done? Are the derivations reproducible? Is the pipeline documented?
The three keys translate directly:
- Requirement alignment → define the decision the analysis will drive. Use Chapter 3's two techniques unchanged.
- Correctness as contract → build the methodological checklist before the analysis. Let the agent run it and report failures.
- Discipline as code → encode data-team conventions (how notebooks are structured, where intermediate outputs live, how units and provenance are documented) as skills.
flowchart TB
D[Decomposable knowledge task]
D --> Q1{Checks for correctness<br/>can be specified?}
Q1 -->|Strong| K[Three keys + parallel patterns apply]
Q1 -->|Weak / subjective| H[Human-heavy verification<br/>narrow parallel surface]
This isn't speculative. Teams are doing it. The break-in period looks familiar: Phase 1 chaos, Phase 2 awareness, Phase 3 templates, Phase 4 leverage. Same curve.
Other domains where the loop is closing
A partial and opinionated list, in rough order of how close each is:
- Data analysis. Already well along. Analysis notebooks with agent-written methodology, running against methodology checklists, with style skills encoding team conventions. Ships today for some teams.
- Research / literature review. The test-plan analog is "criteria for what counts as a relevant source" and "claims the review must address." Multiple agents can review in parallel on different sub-questions. Merge is the summary.
- Design / UX exploration. Best-of-N is the natural move: multiple design attempts against a shared spec, human picks. Skills encode the design system. Agents can critique each other's work against accessibility and consistency checklists.
- Writing (long-form, technical). This book is being drafted this way. Chapter outlines are the requirement. A consistent voice is the test plan (with agent-as-reader checking for voice breaks). Skills encode the style guide. Multiple chapters draft in parallel.
- Legal / compliance review. The contract is clear ("these clauses must/must not appear"). Mature; the bottleneck is professional liability, not technical.
- Strategy / decision memos. Harder, because there's no automatic "right answer" test. But the alignment-first + multiple-attempts-filtered pattern still helps dramatically for framing.
- Operations / incident response. The triage layer in Chapter 8 is already a version of this. Agent-driven runbook execution, with triage routing to humans only on novel failures, is moving fast.
Notice the pattern: the domains furthest along are the ones where "what counts as correct" can be specified, even if it can't be fully automated. The domains lagging are the ones where correctness is genuinely subjective and resists encoding.
What doesn't translate (yet)
Being honest about limits:
- Hardware-bounded work. An agent can plan and document a physical experiment; it cannot run the lab. Until robotics catches up, physical iteration loops stay human.
- Work requiring a body of tacit judgment. Clinical medicine, senior negotiation, courtroom advocacy. The agent can prepare and assist; it cannot currently replace the tacit competence of a practitioner who has seen ten thousand cases.
- Anything where the cost of one wrong output is catastrophic and not recoverable. Launch-vehicle code. Surgery. Monetary policy. The cheap-failure assumption (Chapter 6) breaks, and the whole parallel playbook tightens sharply. You can still use it, but you have to re-price the "cheap" in cheap failure.
The boundaries will move. Robotics will close the hardware gap. Better calibration on agent uncertainty will extend the safe range for high-stakes work. The current list of "doesn't translate" is a snapshot, not a verdict.
What this means for readers
If you're a working engineer using this book for coding, the last chapter is still actionable: the skills you've built through the break-in period generalize. Not the specific code skills — those stay tied to your codebase — but the meta-skills:
- How to structure requirement alignment.
- How to write a test plan that is actually an acceptance contract.
- How to encode discipline as a loadable document.
- How to schedule parallel work without drowning in output.
- How to defend your slack against the leverage.
These meta-skills are worth more than any specific codebase skill, because they transfer. Every new domain you enter after coding will have its own version of the three chokepoints, the three keys, the four modes. You'll recognize them faster because you've seen them in code first.
The meta-frame, one more time
Strip the whole book down to one sentence:
Parallel AI productivity is not an agent capability. It is a workflow you and your agents co-adapt into, by mechanizing the three places where your attention used to be forced serial. The domains where you can mechanize those three places are the domains where this works.
That sentence is short enough to remember. It's also nearly complete — the break-in period, the scheduling modes, the cheap-failure phase change, and the honest cost in Chapter 9 are all implied consequences of it.
A closing note
The framework in this book is opinionated. It is also not finished. The triage layer (Chapter 8) is actively under development. Mode 4 (Chapter 7) is genuinely frontier. The translation to domains beyond code, discussed here, is earliest-days. If you read this book in a year and some specifics look dated, the shape of the argument — bottleneck moves, keys unlock it, break-in is real, honest cost — should still hold.
If the shape doesn't hold, someone will have found a deeper frame. That would be a good outcome too.
External voices
- Supporting: writings on AI-augmented workflows in research (various "AI-assisted literature review" posts), writing (Ethan Mollick's Co-Intelligence), strategy (the emerging genre of "AI-assisted memo" posts in ops/strategy circles). Each is an independent rediscovery of the three-chokepoints / three-keys structure in a different domain.
- Challenging: domain experts in each of the non-code fields above have real, often correct, reasons to doubt that "their domain" is next. Listen to the reasons. Most of them are variants of "we don't have a test-plan analog" — which is true now and may change.
TODO (author's note): drop in pioneer posts from data, research, writing, design — anywhere the three-keys pattern shows up under different vocabulary.
The end
Thank you for reading. The rest is yours — your break-in, your skills, your projects, your parallel agents. This book cannot do the work for you. It can only tell you that the road has a shape, and what the shape is.
Good luck. Ship well. Stay rested.
By Atum — Source: github.com/A7um/ParallelDevelopmentBook
Source Catalog
A working bibliography for The Parallel Development Book.
Currency policy: this book's practices are sourced from the last six months of public writing at time of drafting — roughly October 2025 through April 2026. AI coding practice moves fast; what was leading-edge in early 2025 is often stale by late 2025. Sources older than six months are included only when (a) the underlying argument is structural and durable (e.g., Hillel Wayne on the limits of testing, Ousterhout on software design) or (b) the piece is the ancestor of a current practice and worth citing as history, in which case it is explicitly labeled as such.
Every citation below is dated. If a date is in the past six months relative to April 2026, it is treated as current practice. If it is older, it is explicitly noted as historical or durable structural reference.
Companion Artifacts
zero-review/auto-req— referenced in Chapter 3.zero-review/auto-dev— referenced in Chapters 4 and 5.zero-review/auto-test— referenced in Chapter 4.zero-review/auto-triage— referenced in Chapter 8.- The Skill Design Book — the author's companion book on
SKILL.mdauthorship. Essential background for Chapter 5.
Current Practice (Oct 2025 – April 2026)
Spec-Driven Development and Plan Mode
The dominant 2026 framework for requirement alignment and correctness contracts.
- What Is Spec-Driven Development? A Practitioner's Guide for AI Coding (Augment Code, April 2026) — defines the six-element spec and the Adversarial Agent Pattern. Cited in Chapters 3, 4.
- Spec-Driven Development: From Code to Contract in the Age of AI Coding Assistants (arXiv, Jan 2026) — academic framing of SDD. Cited in Chapter 3.
- Plan Mode in Claude Code (Feb 2026) — 4-phase Explore/Plan/Implement/Commit cycle, one-sentence rule. Cited in Chapters 3, 4.
- Claude Code Plan Mode: Complete Guide (2026) (Mar 2026) — more detailed Plan Mode reference.
AGENTS.md engineering
The 2026 cross-tool standard, replacing tool-specific CLAUDE.md / .cursorrules.
- How to Build Your AGENTS.md (2026) (Augment Code, March 2026) — lean (<150 lines), nested, symlinked, version-controlled. Cited in Chapter 5.
- Is "AGENTS.md Engineering" The Next Optimisation Approach? (Feb 2026) — surveys the ETH Zurich research on context-file performance degradation. Cited in Chapter 5.
- Agents.md best practices gist (2026) — working engineer's practical notes on symlinks and cross-tool compatibility.
- Anthropic Agent Skills documentation (ongoing, materially updated late 2025) — on-demand skill loading vs. always-loaded
AGENTS.md. Cited in Chapter 5.
Recent capability landmarks
- Claude Opus 4.5 Unlocks the "No Restart" Workflow (Dec 2025) — the capability that makes extended autonomous test-fix-test loops genuinely practical. Cited in Chapter 4.
- The Parallel AI Workflow Developer Setup For 2026 (March 2026) — current concrete tooling survey: terminal tabs, worktrees, slash commands.
- Best Tools for Running Parallel AI Coding Agents in 2026 (March 2026) —
ccmanager,dmux,agentree, and other emerging orchestrators. - State of AI agent coders April 2026: agents vs skills vs workflows (April 2026) — community snapshot of which abstractions still matter.
Pioneer practices, last six months
Boris Cherny (Claude Code creator) — Dec 2025 X thread and tips:
- X thread on numbered-tab parallel workflow (Nov 2025). Cited in Chapters 1, 7.
- 15 Claude Code tips shared by Cherny (Dec 2025) —
claude -w,/teleport,/loop,/schedule, hooks, subagents. Cited in Chapter 7. - Claude Code creator reports 259 PRs in 30 days (Dec 2025). Cited in Chapters 1, 2.
- Educative — Master this workflow from the creator of Claude Code (Dec 2025 / early 2026). Cited in Chapter 7.
- Head of Claude Code: What happens after coding is solved (Lenny's Newsletter, Feb 2026).
Mitchell Hashimoto — Feb 2026 adoption memoir:
- My AI Adoption Journey (Feb 5, 2026). Three-phase arc (inefficiency → adequacy → workflow discovery). Cited in Chapters 2, 5, 8, 9.
- Pragmatic Engineer — Mitchell Hashimoto's new way of writing code (Feb 2026).
- Zed — Agentic Engineering in Action with Mitchell Hashimoto (2026).
Armin Ronacher — Jan–Feb 2026 Pi/OpenClaw series:
- Pi: The Minimal Agent Within OpenClaw (Jan 31, 2026). Cited in Chapter 7.
- Porting MiniJinja to Go With an Agent (Jan 14, 2026). Cited in Chapter 7.
- A Language For Agents (Feb 9, 2026) — comprehension debt. Cited in Chapter 9.
- Syntax.fm — Pi, The AI Harness That Powers OpenClaw (Feb 2026).
- Armin Ronacher Leaning In To Find Out — PyAI Conf 2026 talk (2026).
Addy Osmani — Dec 2025 / 2026 workflow posts:
- My LLM coding workflow going into 2026 (Dec 2025). Cited in Chapters 3, 5, 10.
- Top AI Coding Trends for 2026 — Beyond Vibe Coding (early 2026) — Agent Skills formalization. Cited in Chapter 5.
Geoffrey Huntley — Ralph Loop methodology (ongoing, materially current):
- Ralph Wiggum as a "software engineer" (ongoing / late 2025). Cited in Chapters 1, 4, 7.
how-to-ralph-wiggumrepo (maintained 2025–2026).- how to build a coding agent: free workshop.
- I dream about AI subagents.
Cognition / Devin — multi-agent principles:
- Don't Build Multi-Agents (2025, argumentative). Cited in Chapter 7.
- Devin can now Manage Devins (2025–2026). Cited in Chapter 7.
Simon Willison — ongoing practice blog:
- Agentic Engineering Patterns (2026) — Red/Green TDD as first-class agentic pattern. Cited in Chapter 4.
- Embracing the parallel coding agent lifestyle (Oct 5, 2025) — skeptic-to-convert piece. Cited in Chapter 2.
Historical Ancestors (pre-Oct 2025, included for context)
These are the 2024–early 2025 practices that led to the current consensus. Cite them as history — they are not current practice.
- Harper Reed — My LLM codegen workflow atm (Feb 2025). The three-file pattern (
spec.md+prompt_plan.md+todo.md) whose prompts are structurally still sound but whose lack of verification criteria and interface specification make it incomplete by 2026 SDD standards. Cited as ancestor in Chapter 3. - Aider — Separating code reasoning and editing (Sep 2024). The architect/editor split was the first production articulation of "specify first, implement second" — but the current consensus (Plan Mode + SDD + Adversarial Agent) has moved substantially beyond it. Not cited in current chapters.
Skeptical / Critical Voices
Deliberately engaged rather than avoided.
- Harper Foley — Ten AI Agents Destroyed Production. Zero Postmortems. (2025–2026). Cited in Chapter 1 as forecast of skipping mechanization.
- Marc Nuri — The Missing Levels of AI-Assisted Development (2025). The "ladder becomes a drop" argument. Cited in Chapters 1, 2.
- Why AI Agents Keep Failing in Production (Data Science Collective, 2026) — compounding-error analysis.
- The 80% Problem: Why AI Coding Agents Stall (Feb 2026) — Ronacher-inspired analysis of assumption propagation and comprehension debt.
Durable Structural References (intentionally older)
These are older than the six-month window but are cited for structural claims that do not become stale.
- John Ousterhout — A Philosophy of Software Design (2018/2021). Cited in Chapter 5 for the core engineering-discipline principles.
- Kent Beck — Test-Driven Development: By Example (2002). Cited in Chapter 4 as the TDD reference TPD contrasts with.
- Hillel Wayne on the limits of testing and formal methods (2018–2024). Cited in Chapter 4:
- Why Don't People Use Formal Methods?
- Why TDD Isn't Crap
- Requirements change until they don't (cited in Chapter 3).
- Will Larson — An Elegant Puzzle, Staff Engineer. Background for Chapter 9.
- Ethan Mollick — Co-Intelligence. Touchpoint for Chapter 10.
Tools Referenced
- Cursor
- Claude Code
- Codex
- Devin
- Pi / OpenClaw — Ronacher's minimal harness, the mode 2/3 exemplar in Chapter 7.
- Playwright
git worktree
If you are reading this book on GitHub and know of a recent (<6 months) source that belongs here, pull requests are welcome.