AI coding agents are making software development faster.
But not evenly faster.
This is not the same conversation we were having in 2024. Back then, most tools still felt like autocomplete, chat, or clever demos. The real shift came later: Claude Code, Codex-style cloud agents, agent SDKs, browser/computer-use agents, and longer-horizon coding models made it realistic to delegate chunks of engineering work rather than just ask for snippets.
That makes the problem more important, not less.
Modern agents are very good at producing code. They can scaffold features, refactor files, add tests, migrate APIs, generate configuration, and produce a pull request while you are still thinking about the implications.
That feels powerful. It is powerful.
But it also creates a new problem: someone still has to understand what changed.
And that someone is usually a senior developer, tech lead, or architect.
The bottleneck is moving.
It used to be mostly about writing code. Now it is increasingly about reviewing code, understanding intent, validating correctness, checking architecture, and deciding whether the generated solution is something the team wants to live with.
That work is slower, more subtle, and mentally expensive.
Generated code still becomes owned code
The dangerous illusion is that if code was generated quickly, it was also delivered quickly.
That is not how real systems work.
A coding agent can produce a large diff in minutes. But the team still owns the result for months or years. The production incidents, edge cases, security concerns, performance behavior, maintainability problems, and future refactorings do not care whether the original code was typed by a human or generated by a model.
Once it is merged, it is your code.
This is where I think the current AI coding conversation often becomes too shallow. We talk a lot about generation speed. We talk less about comprehension bandwidth.
But comprehension is the scarce resource.
If an agent creates 800 lines of code, a senior engineer may need to reconstruct the reasoning behind those lines:
- Why was this design chosen?
- Which assumptions are hidden in the implementation?
- What are the failure modes?
- Does this fit the existing architecture?
- Are the tests proving behavior or just confirming the implementation?
- Is this code simple enough that someone else can safely change it later?
None of these questions disappear because the code compiles.
The review burden is real
This is not just a feeling.
AWS has written about the problem using systems thinking: when AI assistants speed up coding, bottlenecks shift elsewhere in the value stream. One of their examples is pull request review queues, where senior developers become overloaded reviewing AI-generated code that is syntactically correct but raises architectural questions.
The 2025 Stack Overflow Developer Survey shows the same tension more clearly than the 2024 data. Usage kept climbing: 84% of respondents were using or planning to use AI tools, and 51% of professional developers used them daily. But trust moved in the opposite direction. More developers distrusted AI output accuracy than trusted it: 46% versus 33%. The biggest frustration, reported by 66%, was that AI solutions are “almost right, but not quite”. Another 45% said debugging AI-generated code is more time-consuming.
That is exactly the review bottleneck in human language: AI is useful enough to produce work, but not reliable enough to remove verification.
The same survey also found that agents were not yet universal in 2025. A majority of developers either did not use agents or stuck to simpler AI tools, and 38% had no plans to adopt agents. But among people who did use agents, around 70% said agents reduced time spent on specific development tasks and 69% said they increased productivity. Only 17% said agents improved team collaboration. That distinction matters. Agents can help individuals move faster while making team-level coordination and review harder.
Jellyfish found the same pattern in pull request data. Higher AI adoption correlated with both more throughput and larger pull requests. Moving from 0% to 100% AI adoption corresponded to PRs growing from 74.8 to 88.4 additions on average, an 18.2% increase. Their interpretation is balanced: bigger PRs may contain more robust handling and documentation, but they may also be more complex and harder to maintain.
Either way, bigger PRs mean more to review.
METR’s 2025 study adds a useful cold shower. In their randomized trial with experienced open-source developers working on real issues in large repositories, developers were 19% slower when allowed to use early-2025 AI tools, even though they expected to be faster. That does not mean AI makes developers slower in general. The study is narrower than that. But it is strong evidence that in mature codebases, the cost of understanding, steering, correcting, and verifying AI output can eat the apparent speed gain.
DORA’s 2025 report keeps the focus in the right place: AI-assisted software development has to be measured at the system level, not only as individual typing speed or local task completion. The question is not whether an agent can generate a patch. The question is whether the organization can safely absorb, verify, operate, and maintain the increased flow of changes.
The generation got faster.
The system did not automatically get better.
Senior people become the constraint
This is especially visible for senior developers and architects.
Junior developers can ask an agent to implement something. Product people can ask for a prototype. Multiple agents can work in parallel. The amount of code entering the system can grow very quickly.
But architectural judgment does not scale the same way.
A senior reviewer is not just checking formatting. They are protecting the shape of the system.
They are asking whether the implementation respects boundaries, whether the abstraction is right, whether the error handling matches operational reality, whether the solution will survive the next five feature requests, whether the data model is drifting, whether a simple business rule is becoming a framework.
This is exhausting because it requires holding the existing system, the proposed change, the business context, and the future maintenance cost in your head at the same time.
AI makes this harder when it produces plausible code without a clear design trail.
A human-written pull request often carries some implicit narrative. You can ask the author why they did something. They remember the tradeoffs. They can explain the false starts.
With agent-generated code, the author may not fully understand every line either. The reviewer then has to review both the code and the human’s understanding of the code.
That is a new kind of cognitive load.
The slope toward vibe coding
I understand the temptation to stop reading everything.
When agents produce changes faster than you can review them, the natural response is to trust the tests, skim the diff, and move on. Sometimes that is reasonable. For low-risk code, internal tools, prototypes, or well-contained changes, full manual review of every line may be wasteful.
But there is a slope here.
At the top of the slope, AI is a disciplined assistant. It helps you explore options, write tests, make small changes, and verify behavior.
At the bottom of the slope, you are merging code you do not really understand because the demo works and the agent sounded confident.
That is vibe coding.
The problem is not that vibe coding exists. It is fine for experiments. It is fine for learning. It is fine for throwaway prototypes.
The problem is when vibe coding silently enters production systems.
You can ship without understanding for a while. But the debt accumulates in places where tests are weak, requirements are fuzzy, and architecture matters. Eventually someone has to debug it. Usually under time pressure. Usually the same senior people who were already overloaded.
Code review has to change
The answer is not to reject AI-generated code.
That would be silly. The productivity gain is real, and the tools will keep improving.
The answer is to change what we expect from code review and from the agents producing the code.
A pull request should not merely contain a diff. It should contain evidence.
For AI-assisted work, I want to see things like:
- the problem statement in plain language,
- the intended behavior,
- the architectural boundaries touched,
- the main alternatives considered,
- the risks and assumptions,
- the tests added or changed,
- the commands run,
- screenshots or logs where relevant,
- known limitations,
- and a short explanation of why the change is safe to merge.
This is not bureaucracy. It is compression.
It gives the reviewer a way to understand the change without reverse-engineering the entire thought process from the diff.
Agents should help produce this evidence. If an agent can write the code, it can also summarize the design, list the files changed, explain the test strategy, and identify risky areas. If it cannot explain the change clearly, that is already a review signal.
Smaller diffs matter more now
AI makes large diffs cheap.
That does not make large diffs good.
In fact, small batches become more important when using agents. If code generation is cheap, we should spend that advantage on making changes smaller, safer, and easier to review — not on dumping more code into each pull request.
A good AI-assisted workflow should push toward:
- smaller tasks,
- narrower pull requests,
- stronger automated tests,
- clearer acceptance criteria,
- explicit design notes,
- repeatable verification,
- and reviewable evidence.
The goal is not maximum code output.
The goal is maximum trustworthy progress.
Those are different things.
Architecture becomes a review discipline
For architects, this shift is important.
Architecture used to be partly enforced by implementation friction. If a change was hard to make, people had time to notice that it crossed boundaries or violated the design.
Agents reduce that friction. They can make sweeping changes quickly. They can connect things that should not be connected. They can produce a working implementation that quietly damages the structure of the system.
So architecture has to become more explicit.
Not in the old heavyweight enterprise architecture sense. I do not mean giant documents nobody reads.
I mean lightweight, reviewable constraints:
- module boundaries,
- dependency rules,
- API contracts,
- naming conventions,
- testing expectations,
- security requirements,
- observability standards,
- and clear examples of good local design.
The better these constraints are encoded, the easier it becomes for both humans and agents to stay inside them.
This is one reason I think the next serious productivity gains will not come from agents that simply write more code. They will come from teams that build better engineering systems around agents: repository instructions, isolated sandboxes, checkpoints, architectural rules, automated tests, observability, and review evidence.
The new senior skill: verification design
Senior developers will still need to read code.
But the job is changing.
The new skill is not manually inspecting every generated line forever. That does not scale.
The new skill is designing verification systems that make code review less heroic:
- tests that capture business behavior,
- linters and static analysis that enforce mechanical rules,
- architectural fitness functions,
- CI gates that check the boring things,
- agents that do first-pass review,
- observability that catches real-world failure,
- and pull request templates that force evidence into the open.
Human review should focus on the things humans are still best at: intent, judgment, taste, ownership, risk, and fit with the larger system.
If senior people spend their time catching formatting issues, missing null checks, or obvious test gaps, the process is broken. Machines should handle that.
If senior people spend their time deciding whether a change belongs in the system at all, they are doing the work that matters.
My current rule
My working rule is simple:
I do not need to type every line, but I need to understand what I merge.
That is the line I do not want to cross.
AI can help me move faster. It can help me explore. It can write first drafts. It can generate tests. It can review its own output. It can find mistakes I would miss.
But it cannot own the consequences.
The team owns the consequences.
So if reviewing feels like the bottleneck now, I do not think that is a temporary annoyance. I think it is a signal. We are discovering where the real constraint moved.
The future of AI-assisted software development will not be decided by who can generate the most code.
It will be decided by who can review, verify, and own code without burning out the people whose judgment the system depends on.
Sources
- AWS Executive Insights: Measuring the Impact of AI Assistants on Software Development
- Stack Overflow: 2025 Developer Survey — AI
- DORA: State of AI-assisted Software Development 2025
- Jellyfish: Better Code, or Just Bigger? AI-Assisted Pull Requests Are 18% Larger
- METR: Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity
- OpenAI: New tools for building agents
- OpenAI: Introducing Codex
- OpenAI: Introducing ChatGPT agent
- Anthropic: Claude 3.7 Sonnet and Claude Code
- Anthropic: Introducing Claude Sonnet 4.5
- Addy Osmani: Code Review in the Age of AI