AI Code Generation: What the Velocity Numbers Hide

Mar 20, 2026 Metasphere Engineering 15 min read

Your team adopted an AI coding assistant three months ago. The velocity metrics look fantastic. Pull requests are way up. The engineering director presented the numbers at the all-hands. Everyone applauded.

Then you look at the bug tracker. Defect rate is climbing too. Not fast. A steady uptick in bugs per sprint, the kind that doesn’t trigger alarms until you plot the trendline. The bugs are subtle. Edge cases the AI handled confidently but incorrectly. Error handling that swallows critical exceptions. An API call to a method that doesn’t exist in the version you’re running. Nobody connected the dots because the bugs look like code a senior engineer would write. They pass review. They pass the happy-path tests. They fail in production for a fraction of requests in ways that take hours to diagnose.

You turned on the autopilot. The altitude looks great. Nobody’s watching the instruments.

Key takeaways

AI coding assistants speed up routine tasks meaningfully. Complex tasks show minimal gain and sometimes negative impact after correction time.
Defect rates climb without adapted review practices. The bugs are subtle, confident, and pass superficial review because AI-generated code looks polished.
Security vulnerabilities increase with AI assistance, not decrease. Stanford research found developers using AI wrote more vulnerable code while rating it more secure.
Code review must evolve, not relax. Traditional review catches logic errors. AI-generated code demands scrutiny on edge cases, hallucinated APIs, and security anti-patterns.
Net productivity gain is far smaller than vendor demos suggest after accounting for review overhead, bug fixes, and correction cycles.

Velocity and defect rate rarely share a dashboard. That gap is where the real story hides.

What the Productivity Numbers Actually Show

DORA’s research consistently shows that throughput metrics without quality gates produce misleading signals. More pull requests is an input metric. Working software is an output metric. Vendor studies report the input.

The gain isn’t spread evenly. Boilerplate, test scaffolding, CRUD, configs, documentation: meaningfully faster. These are pattern-matching tasks where the AI shines because public repositories have millions of examples. Complex tasks tell a different story. System design, cross-service coordination, performance-sensitive code. The AI misses design rules it can’t see. Your system has 15 years of decisions in the heads of three senior engineers. The AI fills the gap with confident guesses. Turbulence at 30,000 feet and the autopilot is holding course into a mountain.

Net productivity is real but smaller than the pitch deck. And that gain only holds if review and testing practices keep up. The altimeter looks great. The fuel gauge tells a different story.

The 70% Trap

AI gets you 70% to a working solution. The happy path works. The structure follows recognizable patterns. Code from a talented contractor who has never seen your architecture docs and didn’t ask.

The 70% Trap AI-generated code that covers the happy path convincingly. Misses the edge cases, error boundaries, and security rules that separate production software from demo software. The trap springs when engineers accept the 70% because it looks like 100%. The higher the code’s surface polish, the deeper the trap.

The remaining 30% is where engineering lives. Boundary conditions. Race conditions under concurrency. Error handling that keeps context instead of swallowing it. Timeout values that match your actual latency, not a round number from training data.

A function that processes a list works for 10 items. At 10 million, it runs out of memory because the AI loaded the entire collection instead of streaming. Syntactically correct. Practically a ticking bomb. And the code review passed because it looked clean. The autopilot was flying level. Just toward the wrong airport.

Edge cases are one thing. Security is where the stakes get genuinely dangerous.

Security at Scale

Research from Stanford found that developers using AI assistants produced more security vulnerabilities while rating their code as more secure. Confidence up. Security down. The Dunning-Kruger effect, automated.

The mechanism is simple. AI learns from public repositories. Public repositories contain mountains of insecure code. SQL string concatenation. Unsanitized input rendering. Leaked stack traces. Hardcoded API keys. The AI reproduces these patterns because frequency drives the ranking, not correctness. The most popular answer is not the safest answer. (Ask any security team.)

The patterns are subtly insecure. Parameterized WHERE clauses paired with string-interpolated ORDER BY. Token validation that skips scope checks. Length-checked input with unchecked content. OWASP Top 10 vulnerabilities wearing a clean suit.

The scale problem adds up fast. A human might introduce one SQL injection vulnerability per quarter. An AI reproducing the pattern across 20 endpoints in a single sprint introduces twenty. Copy-paste on steroids. Application security tooling catches some, but static analysis was built for human-written code at human speed.

Code Review in the AI Era

Traditional review assumes the author understood the code they wrote. With AI-assisted development, that assumption breaks. The author may have accepted a plausible suggestion without checking every edge case. Different starting point. Different protocol needed.

Aspect	Traditional Review	AI-Aware Review
Author assumption	Understood the code	May have accepted a plausible suggestion
Edge case coverage	Spot-check is enough	Step-by-step check required
Dependency accuracy	Trust imports	Verify API signatures exist in pinned version
Security patterns	General awareness	Explicit checklist per review
Error handling	Check for presence	Check for correctness and specificity
Test coverage	Adequate if green	Must cover AI-specific risk areas

Three review practices become non-negotiable when AI writes big chunks of your codebase.

Dependency verification. AI assistants make up API methods. response.getStatusMessage() when the API actually has response.statusText. A library import for a package that exists on npm but isn’t in your package.json. Citing sources that don’t exist. Verify every import against pinned versions. Every time.

Edge case interrogation. For every AI-generated function, ask: what happens with null input? Empty input? Maximum-size input? Concurrent access? Network timeout? The AI covered the happy path. It doesn’t think about production. It thinks about training data. The AI covered the happy path. It doesn’t think about production failure modes.

Security pattern checklist. For AI-generated code touching user input, auth, data access, or external calls: parameterized queries for every clause? Error responses exclude stack traces? Auth checks verify scope, not just presence? Input validation covers content, not just length? Security-aware code review checklists catch what general awareness misses.

Review time per line should go up, even as code production speeds up. That feels backwards. It is. The temptation to rubber-stamp clean-looking AI code is exactly how the defect increase creeps in. Smooth flight. Eyes closed. Netflix on in the cockpit.

Testing Strategy for Generated Code

AI-generated tests tend to test the implementation, not the specification. The test verifies the function does what it does. If the function has a bug, the test confirms the bug. 95% coverage. Zero confidence. The autopilot writing its own pre-flight checklist. “Wings still attached? Check.”

Anti-pattern

Don’t: Let AI generate tests for AI-generated code without specification constraints. The test mirrors the implementation’s assumptions, including its bugs. High coverage means nothing when the tests validate the wrong behavior.

Do: Write specification-based test cases first (what should happen), then let AI help with the boilerplate. Tests encode the business requirement (“users cannot transfer more than their available balance”) not the implementation detail (“the function returns false when amount exceeds balance”).

Three testing approaches actually work when AI generates production code.

Property-based testing. Define rules that must always hold: “output length never exceeds input length,” “running the same operations in any order gives the same result.” Frameworks like Hypothesis and fast-check generate hundreds of random inputs testing these rules. They find edge cases the AI missed because they don’t share its happy-path bias. Throwing darts at the spec instead of following the tour guide.

Mutation testing. Tools like Stryker and PIT introduce small changes (flipping > to >=, removing a null check, inverting a boolean) and check if tests catch each one. A surviving mutation is a test gap. AI-generated test suites consistently produce more survivors. They look thorough while missing actual faults. A security camera pointed at the wrong door.

Contract testing. AI generates service-to-service code using patterns from training data. Those patterns may not match your actual API contracts. Contract testing across service boundaries catches mismatches between what the AI assumed the API returns and what it actually returns. This matters most when AI generates both the client and the test. That’s a closed loop validating its own assumptions. Grading your own homework. With the answer key you wrote yourself.

Testing Approach	Traditional	AI-Adapted	Why It Matters for AI Code
Unit tests	Test implementation behavior, happy-path focused	Specification-based: tests encode business rules, not implementation	AI-generated tests confirm the bug works as implemented. 95% coverage, zero confidence
Edge cases	Developer intuition + code review	Property-based: randomized inputs test invariants	AI misses edge cases humans would catch through experience. Property tests find them mechanically
Test quality	Code coverage metrics	Mutation testing: measure surviving mutations	AI test suites have more surviving mutations (tests that pass even when code is broken)
Integration	Manual API testing	Contract testing: verify API assumptions against real service behavior	AI hallucinates API signatures. Contract tests catch mismatches before production

When NOT to Use AI Code Generation

Prerequisites

Code review process explicitly addresses AI-generated code risks (hallucinated APIs, edge case gaps, security patterns)
Testing strategy includes specification-based tests not generated by the same AI
Security scanning tools are configured and catching AI-specific vulnerability patterns
Defect escape rate tracking is in place alongside velocity metrics
Engineers can explain when to reject AI suggestions, not just accept them

Security-critical paths. Auth, crypto, session management. These need reasoning about rules that must always hold, and AI handles those unreliably. Write these manually. Review twice. Test like an attacker would.

Financial calculation precision. Same problem, different domain. AI frequently uses floating-point types that work for most cases and produce quietly wrong results for the rest. The billing rounding bug from safe deployment practices , generated at high velocity. Pennies off per transaction. Nobody notices until someone does.

Novel algorithms. This one’s less obvious. No established patterns in public repositories means no good training signal. The AI is mixing and matching patterns it’s seen before. The output will be subtly wrong for the cases that matter most. A chef improvising from memory when you ordered off-menu.

System architecture decisions. AI generates patterns. It can’t reason about whether the pattern fits your situation. Generating a Kafka consumer when you need a simple queue is unnecessary complexity shipped at record speed. A sports car solution to a bicycle problem.

When AI code generation works well	When it doesn’t
Boilerplate and scaffolding for known patterns	Security-critical authentication and crypto
CRUD operations with standard validation	Financial calculations requiring precision
Test scaffolding (with human-written specs)	Novel algorithms without public precedent
Configuration files and infrastructure code	Cross-service architectural decisions
Documentation drafts and code comments	Performance-critical hot paths

What the Industry Gets Wrong About AI Code Generation

“AI coding assistants make junior developers as productive as seniors.” They make junior developers faster at producing code. Producing correct, secure, maintainable code requires judgment that comes from experience: knowing which suggestion to reject, which edge case the AI missed, which pattern will cause problems at scale. AI amplifies whatever engineering judgment the developer already has. A senior pilot with autopilot lands safely in fog. A student pilot with autopilot doesn’t know they’re descending. The seniority gap widens in both directions.

“AI-generated code needs less review because it follows consistent patterns.” It needs more review, not less. Consistent patterns include consistently reproduced vulnerabilities, consistently hallucinated APIs, and consistently missed edge cases. The surface polish of AI-generated code is precisely what makes it dangerous in review. Reviewers lower their guard for code that looks like it was written by a competent colleague. That’s the trap.

“Lines of code per day is a valid productivity metric for AI tools.” Lines of code per day was never a valid productivity metric. Adding AI to a bad metric makes the metric worse. Measure task completion time, defect escape rate, and time-to-resolution together. Any single metric in isolation tells a flattering lie.

The Confidence Gradient The backwards relationship between how confident AI-generated code looks and how likely it is to have subtle bugs. Polished variable names. Clean structure. Correct happy-path behavior. It all creates a look of quality that makes reviewers drop their guard. Decades of instinct say clean-looking code is more likely correct. AI broke that instinct. The prettier the code, the less you look. The less you look, the more gets through.

Our take AI coding assistants are a force multiplier for experienced engineers and a risk amplifier for inexperienced ones. The discipline required to use them well is itself a senior skill. Prompt specificity. Systematic review. Specification-based testing. Security checklists. Deploy these tools without upgrading your review and testing practices and you get more code, more bugs, and more confidence in both. The worst combination. Invest in the discipline before scaling the tooling. Developer productivity infrastructure that tracks both velocity and defect rate on the same dashboard separates teams that genuinely benefit from teams that quietly pile up invisible debt.

Pull requests up. Defects up too. Now the team tracks both on the same dashboard. Property-based tests catch what the AI misses. Specification-first design means tests encode what the code should do, not what it does. The gain is smaller than the demo promised. But it’s real. And it’s honest. The autopilot still flies. The pilot still watches. That’s the deal.

Frequently Asked Questions

What productivity gains should teams realistically expect from AI coding assistants?

Routine tasks like boilerplate, test scaffolding, and CRUD operations see meaningful speed gains. Complex tasks involving system design or cross-service coordination show minimal improvement and sometimes negative impact after accounting for correction time. Net productivity gains after review overhead and bug fixes land well below what vendor demos advertise. Track task completion time alongside defect escape rate to see the real picture.

Does AI-generated code contain more security vulnerabilities?

Research from Stanford found that developers using AI assistants produced code with more security vulnerabilities than those coding manually, while rating their code as more secure. AI models reproduce insecure patterns from training data including SQL string concatenation, missing input validation, and improper error handling. The risk compounds because AI-generated code looks polished enough to pass superficial review.

How should code review practices change for AI-generated code?

Traditional review focuses on logic correctness and style. AI-generated code requires additional scrutiny on three fronts: edge case handling where AI confidently covers only the happy path, dependency accuracy where hallucinated APIs appear regularly in suggestions, and security patterns where input validation and auth checks may be subtly incomplete. Review time per line should increase, not decrease.

When should engineers avoid using AI code generation entirely?

Avoid AI generation for security-critical authentication and authorization logic, cryptographic implementations, financial calculation precision, novel algorithms without established patterns, and code paths where a subtle bug has outsized blast radius. These domains require reasoning about rules that must always hold, and AI assistants handle those unreliably. The cost of a confident-but-wrong suggestion exceeds the time saved.

How do you measure the real ROI of AI coding tools beyond lines of code?

Track four metrics together: task completion time on routine work, defect escape rate which should not increase from baseline, code review cycle time which typically increases initially as teams adapt, and developer satisfaction. Teams tracking only velocity see apparent gains while defect debt piles up. Net ROI becomes clear at the six-month mark when downstream bug costs show up.