AI-Driven Development: Don't Slow Down, Double Down

A couple of months ago Mario Zechner published Thoughts on slowing the fuck down. It's the most honest piece I've read about what's actually going wrong as teams hand more of their codebases over to coding agents — and it started a debate with my colleague that I haven't been able to drop. This post is my answer.

Zechner's diagnosis is sharp. Agents don't learn from their mistakes the way humans do, so the tiny errors that humans used to catch by the friction of typing now compound at an unsustainable rate. He calls agents "complexity merchants" — they reach for the over-engineered patterns from their training data and produce enterprise-grade mess in days, not years. Agentic search in a large codebase has poor recall, so they duplicate code that already exists. His prescription is to put the human back in the loop as the bottleneck: hand-code the architecture, keep humans as the final quality gate, reserve agents for the boring, scoped, evaluable stuff.

I agree with almost every word of his diagnosis. I think his prescription is solving the right problem the wrong way.

A scoping note: this post is mostly about small-to-medium features, where the speed/quality tension is sharpest and most day-to-day shipping happens. For bigger architectural work my workflow doesn't change much — same agent-drafted plan, same multi-agent review — there are just more humans in the design-review step.

The naive AI flow

A developer picks up a Jira ticket, opens Claude Code or Codex or Cursor, explains the feature, and starts implementing. In the better version, either the developer or the tool produces a plan first; the developer skims it, nudges it, and the agent goes off and writes the code and the tests. The developer runs it locally, opens a PR. CI passes — unit, integration, lint. Another developer reviews.

AI tends to produce large PRs covering whole features, so the reviewer can't realistically check everything. They look at the plan, scan the main logic, ship it.

At a glance, this is the same flow we always had: develop, test, PR, review. So what's the problem?

What the old flow gave us by accident

Before agents, a lot of things that mattered for code quality happened as side effects of typing. The developer read the surrounding code because they had to find where to make their change. They built up scope understanding because writing forced them through the branches. Unit tests surfaced edge cases because writing the test made the gap obvious. PRs were smaller because typing is slow, so reviewers could trust the author had read it and could realistically check the whole thing. Style and conventions got enforced because a different human was reading it later.

None of those things were policies. They were byproducts of the bottleneck. When the bottleneck went away, the byproducts went with it.

This is what Zechner is really pointing at when he talks about tiny errors compounding. He's right that we lost something. Where I disagree is on how to get it back.

The wrong fix and the right one

His answer is to put the human bottleneck back where it matters most — hand-own architecture and the system-shaping APIs, keep humans as the final quality gate, restrict agents to scoped, evaluable work where the cost of an oversight is bounded. Not "type everything," but keep agency and judgement firmly with the human wherever mistakes compound.

I think that throws out much of the actual win — the speed — to recover a side effect. The byproducts of typing weren't the typing. They were understanding, coverage, an outside perspective, and a check against over-engineering. We can rebuild every one of those explicitly. With agents.

The nostalgia trap

It's worth being honest about the counterfactual. The "good old days" of careful hand-coded software reviewed by attentive humans — that wasn't most of the days. Real human review at 5pm on a 600-line diff was not a high-water mark. Most of the bugs that shipped in the pre-AI era were stupid oversights — a missing null check, a branch nobody walked, a contract somebody forgot existed, a class of issue that needed breadth the reviewer happened not to have that week.

Agents don't just bring speed. They cross-check. A second model reading the same diff has the patience to walk every branch (which a human reviewer almost never does on a real PR), and finds things humans probably wouldn't have found at all. Multi-agent setups catch whole categories of stupid mistake and oversight at a rate that doesn't degrade after lunch.

When this works, the output isn't "shipped faster, same quality as before." It's better than what we used to ship. Not because any individual agent is smarter than the engineers were, but because the bottleneck used to be human attention — and human attention has bad days, narrow expertise, and finite patience for boring code. Replacing it with several patient readers who are less likely to share the same blind spots is, on the merits, an upgrade.

Ownership, not orchestration

Before any of the workflow below works, one thing has to be true: you own the code the agent wrote under your name. Not "Claude did that." Not "the agent decided." You did, because you shipped it.

That mental shift is what separates a developer using agents from an orchestrator pressing merge. An owner reviews the plan deeply, spot-checks that the code actually matches it, and treats every agent decision as one they're personally accountable for — including the ones they trusted without reading line by line. You can delegate the work; you can't delegate the accountability. The multi-agent review pipeline that follows isn't a substitute for ownership — it's leverage for an owner. Without it, every check below is theater: more reviewers, same rubber stamp.

How AI-driven development should actually work

A few principles, roughly in the order they apply during a feature:

0. Use at least two coding agents, backed by different models from different vendors. This is the single highest-leverage change on the list. Claude Code and Codex, for example. Every model has its own blind spots and its own complexity-merchant instincts. Two models disagreeing about a plan or a piece of code is a much harder signal to ignore than one model being confident — and it's the closest thing we have to the second pair of eyes that the old flow gave us for free.

It's also part of the answer to Zechner's "agents don't learn" worry: a single agent doesn't learn from its mistakes inside a session, but two agents reviewing each other don't have to — they catch them in the moment. The other half of that answer is principle 7.

One caveat: multiple agents don't solve a shared missing context. If none of them has the system map, the related-service code, or the existing tests, they can all confidently miss the same thing. Point 0 only pays off if point 1 is real.

1. Give the agent the same context a human would have. Requirements, design docs, Jira, relevant Confluence pages, the source of related services. Most "the AI did something dumb" stories are really "the AI didn't have the context a human would have demanded." This is also the answer to the poor-search-recall point: don't make the agent guess at the codebase, hand it the map.

Handing over the map presupposes someone has drawn it. In practice this is one of the highest-leverage places a human stays in the loop: maintaining a lightweight index — a CLAUDE.md / AGENTS.md or a top-level pointer file that names the important modules, ownership boundaries, and where to look for related services. The agent doesn't need you to recite the codebase; it needs to know which doors to open. That index is also the thing that pays compounding interest across every future feature.

2. Start with a plan / spec document. Together with principle 3, this is the contract you sign with the agent before any code gets written — if you skimp here, no amount of downstream review saves you. Have the primary agent produce a .md spec from the requirements before any code is written. This is the artifact everything else hangs off — and crucially, it catches the complexity-merchant tendency before it turns into 2,000 lines of premature abstraction.

3. Review the plan with a human and the other agents. This is where the multi-agent setup pays off the most. A second model reading the same spec routinely catches ambiguities, missing edge cases, and over-engineering the first agent reached for instinctively. Fix the plan now, not the code later. Nobody's plan is right on the first try — not yours, not the agent's. Iterating the spec until it's actually right is the cheapest work you'll do on the whole feature; iterating the code later is the most expensive.

4. Implement against the plan. One agent drives. If the feature naturally splits across multiple PRs, keep a single shared plan/design document referenced from each PR — so the thread of intent is visible across the whole feature, not just inside one diff.

5. Multi-agent review before the PR opens. Different models, different prompts, looking for different classes of issues — correctness, security, style, test coverage. The developer is one of the reviewers, not the only one. Diagram below.

6. Human PR review still matters — but adapt it to reality. Hardly anyone is manually checking a large AI-generated PR end to end, and pretending otherwise is how we end up rubber-stamping. This is where the owner mentality earns its keep: the reviewer should check the plan, focus on the parts where judgement matters, and lean on automated AI reviewers (ideally wired into the source control system) for the mechanical coverage. When the human reviewer does catch something the agents missed, that finding goes straight into principle 7.

7. Turn every recurring agent mistake into something structural. This is the other half of the answer to "agents don't learn." Two reviewers catch each other's one-off mistakes inside a session, but the pattern of mistakes — the same wrong abstraction, the same forgotten edge case, the same wrong import — is the real cost across time. Each time you spot a class of issue an agent reliably reaches for, lift it into something durable: a test, a lint rule, a checklist line in the plan template, a skill or instruction entry (CLAUDE.md, AGENTS.md, a .claude/skills/ file, whatever your stack uses), an item in the auto-generated system doc.

The highest-signal trigger for this is the human reviewer. Anything they catch is something the whole pipeline let through — which makes it the most expensive lesson you'll buy on a given PR. The fix can't just be "merge the patch and move on." Every finding from a human reviewer should also turn into a rule the agents will follow next time: a test that locks in the behavior, a line in the skill or instruction file that names the failure mode, a checklist item the planning agent has to satisfy. That's how you teach. Whatever memory agents pick up between sessions is per-user and opaque; the repo memory is shared, enforceable, and seen by every agent on every run.

The shape, end to end

The shape to notice: every stage that used to depend on a single human's diligence now has at least two independent checks, and at least one of them is a model that didn't write the code. That's the friction Zechner is asking for — just installed somewhere other than the keyboard.

This isn't theoretical — we actually run it

The pre-PR review step is the one most teams try to handwave with "I'll skim the diff before pushing." We wired it into a small CLI instead. It takes the diff of your branch and fans it out to several reviewers in parallel — different models from different vendors, by default. Their findings get aggregated and deduplicated, each one validated against the actual source, and then a second model validates that aggregation. The output is a single report: a verdict, issues grouped by severity, and which reviewers found each one.

It isn't sophisticated — just a script that shells out to the agent CLIs you already have, runs them concurrently, and merges what comes back. That's the whole trick, and it's why the specific tool doesn't matter: any team can build the equivalent in an afternoon, tuned to its own models, prompts, and source-control conventions.

None of this has to stay local. The same fan-out can also run as a hosted or CI-backed pipeline: kicked off automatically when a PR opens, results posted straight back to the diff, nothing to run by hand. It's the same review either way; what differs is whether you run it or a service runs it for you.

The point is that "multi-agent review" has to be a button you press, or a pipeline that presses it for you — not a discipline you remember. Skipping costs nothing, so unless pressing that button costs almost nothing too, you'll skip it.

One more thing: keep the docs alive

If you're shipping features at AI speed across multiple repos, the system documentation goes stale faster than any human can maintain it. The lightweight index from principle 1 is worth keeping by hand — it's small and judgement-heavy — but the full cross-repo description of the system isn't. Don't hand-maintain that; have the agents keep it up to date, regenerated as part of the workflow. It's cheap. And it's the context that feeds every future point 1.

The short version: Zechner is right that something real broke when we stopped typing. He's wrong that the fix is to restore the keyboard bottleneck. The fix is to rebuild what we lost — explicitly, with multiple agents and a real plan artifact — and if we do it honestly, we land somewhere better than where we started.

Don't slow down. Double down.

A note on how this post was produced: same workflow. My raw thoughts plus Zechner's article went in as context; Claude drafted; I reviewed manually; Codex reviewed the draft and surfaced fixes; Claude applied them. Several rounds of this loop tightened the argument and the prose. Then a human PR reviewer caught what the loop had glossed over — exactly the role principle 6 describes — and those notes drove another revision pass. The piece you just read is the output of the loop it's arguing for.