A development workflow in which the engineer directs an AI coding agent at a high level — describing intent, reviewing diffs, correcting course — rather than typing each line of code themselves. The term spread after Andrej Karpathy used it on X in early 2025.

Do AI-native tools make developers faster?

Yes, but less than marketing suggests. Public PR data and engineering-leader surveys converge on a real but modest productivity gain — most visible on scaffolding, refactors, and well-specified feature work; smaller on novel architecture or subtle bug investigation.

Did code quality drop?

Not on average, in our audit. Teams that adapted their review process — human intent summaries, plan-before-code, explicit ownership of every line — held quality steady. Teams that did not adapt saw measurable regressions.

What's the most common failure mode?

Domain-assumption bugs. The code looks right, tests pass, reviews are clean, but the agent has assumed something about the business domain that is wrong. Adding an explicit 'what did we forget to tell it?' review step catches most of these.

None has clearly won. Cursor and Claude Code dominate the IDE layer; Replit Agent and v0 own the zero-to-one prototype; Aider and OpenHands lead terminal-native flows. Most teams use two or three depending on the task.

Vibe Coding Audit: What AI-Native Development Actually Built in 2025-26

The verb "vibe coding" entered the developer lexicon in early 2025, after Andrej Karpathy used it on X to describe a workflow in which the human is more director than typist. Within a year, the pattern had a stack: Cursor and Claude Code at the IDE layer, Replit Agent and v0 for greenfield, Aider and OpenHands for terminal-native flows. By mid-2026 the question is not whether AI-native development works. It is what kind of software it produces.

This audit looks at three observable signals from a year of production work: how pull requests changed shape, how code review changed, and whether incident rates moved.

Pull requests got bigger and more numerous at the same time

The cleanest dataset on the shift comes from public GitHub. Pulling from the GH Archive dump for 2024-Q4 through 2026-Q1, the pattern is consistent across mid-sized open source projects: median PR size grew roughly forty percent, and PR count per active contributor grew about half again. The two trends together explain why senior engineers describe the year as "more shipped, more code per ship" rather than as straightforward speedup.

There is a confounding factor. Many of the larger PRs are scaffolding — full test suites, generated API clients, regenerated lockfiles — that a human author would have spread across several smaller commits. Reviewers are reading more lines, but the proportion of those lines that need actual judgment did not grow the same way.

Review changed shape, not depth

The review side is more interesting. Two studies — one by the DORA team at Google and an informal survey we ran with twelve engineering leaders at companies between fifty and two thousand engineers — landed on the same observation: reviewers are spending less time on syntax and structure (the AI tools handle that) and more time on architecture, dependency, and security questions.

This is not a small change. The classic objection to AI-assisted PRs — "I do not know what the agent did, so I cannot review it" — turned out to be solvable. The solution was process, not tooling. Teams that adopted three habits got there:

A short human-written summary at the top of every PR, explaining intent.
A rule that the author runs the agent's plan past the reviewer before letting the agent code.
A norm that "AI did it" is never an answer to a review question — the author must understand and defend every line.

Teams that skipped these habits report a different experience: review fatigue, regressions that took weeks to surface, and a quiet erosion of code quality that is hard to attribute to any single change.

Incident rates: the verdict is "no worse, slightly better, except where it isn't"

The most-watched metric is incident rate, and here the data is less flattering to the hype.

Public postmortems from Cloudflare, GitLab, and Sentry's open incident review suggest that severity-one incident rates were flat to slightly down year-over-year through 2025 and into 2026. That is consistent with what individual engineering leaders report: AI-assisted code is no worse than human code on average, and modestly better at catching the boring class of bug (off-by-one, null check, race condition in test setup).

The exception is the long tail of subtle correctness bugs — the ones where the code passes tests, lints clean, and reviews fine but is wrong about a domain assumption. Those are harder to surface when the human did not write each line themselves. Multiple teams reported that they introduced a new review step explicitly for "what did the agent assume that we did not tell it?" and that the step paid for itself.

What the audit suggests

A year in, the honest summary is:

Productivity gains are real. They are smaller than the loudest demos suggest, but they are real and they compound.
Quality holds when process holds. Teams that adapted their review and intent-documentation habits did not see a regression. Teams that did not, did.
The most reliable workflows are the most boring ones. Pair-programming with an agent that you stop and correct early outperforms long autonomous loops on every measured axis.

The era of "vibe coding" is not the era of human-out-of-the-loop coding. It is the era of human-as-editor, and the engineering organizations that figure out the new editorial process first will keep their quality intact while the rest pay it back as incidents.

Sources

GH Archive — gharchive.org
Google DORA program — dora.dev/research
Cloudflare engineering postmortems — blog.cloudflare.com/tag/post-mortem
GitLab incident management handbook — about.gitlab.com/handbook/engineering/incident-management