AI Code Generation: What the Velocity Numbers Hide
Your team adopted an AI coding assistant three months ago. The velocity metrics look fantastic. Pull requests are way up. The engineering director presented the numbers at the all-hands. Everyone applauded.
Then you look at the bug tracker. Defect rate is climbing too. Not fast. A steady uptick in bugs per sprint, the kind that doesn’t trigger alarms until you plot the trendline. The bugs are subtle. Edge cases the AI handled confidently but incorrectly. Error handling that swallows critical exceptions. An API call to a method that doesn’t exist in the version you’re running. Nobody connected the dots because the bugs look like code a senior engineer would write. They pass review. They pass the happy-path tests. They fail in production for a fraction of requests in ways that take hours to diagnose.
You turned on the autopilot. The altitude looks great. Nobody’s watching the instruments.
- AI coding assistants speed up routine tasks meaningfully. Complex tasks show minimal gain and sometimes negative impact after correction time.
- Defect rates climb without adapted review practices. The bugs are subtle, confident, and pass superficial review because AI-generated code looks polished.
- Security vulnerabilities increase with AI assistance, not decrease. Stanford research found developers using AI wrote more vulnerable code while rating it more secure.
- Code review must evolve, not relax. Traditional review catches logic errors. AI-generated code demands scrutiny on edge cases, hallucinated APIs, and security anti-patterns.
- Net productivity gain is far smaller than vendor demos suggest after accounting for review overhead, bug fixes, and correction cycles.
Velocity and defect rate rarely share a dashboard. That gap is where the real story hides.
What the Productivity Numbers Actually Show
DORA’s research consistently shows that throughput metrics without quality gates produce misleading signals. More pull requests is an input metric. Working software is an output metric. Vendor studies report the input.
The gain isn’t spread evenly. Boilerplate, test scaffolding, CRUD, configs, documentation: meaningfully faster. These are pattern-matching tasks where the AI shines because public repositories have millions of examples. Complex tasks tell a different story. System design, cross-service coordination, performance-sensitive code. The AI misses design rules it can’t see. Your system has 15 years of decisions in the heads of three senior engineers. The AI fills the gap with confident guesses. Turbulence at 30,000 feet and the autopilot is holding course into a mountain.
Net productivity is real but smaller than the pitch deck. And that gain only holds if review and testing practices keep up. The altimeter looks great. The fuel gauge tells a different story.
The 70% Trap
AI gets you 70% to a working solution. The happy path works. The structure follows recognizable patterns. Code from a talented contractor who has never seen your architecture docs and didn’t ask.
The remaining 30% is where engineering lives. Boundary conditions. Race conditions under concurrency. Error handling that keeps context instead of swallowing it. Timeout values that match your actual latency, not a round number from training data.
A function that processes a list works for 10 items. At 10 million, it runs out of memory because the AI loaded the entire collection instead of streaming. Syntactically correct. Practically a ticking bomb. And the code review passed because it looked clean. The autopilot was flying level. Just toward the wrong airport.
Edge cases are one thing. Security is where the stakes get genuinely dangerous.
Security at Scale
Research from Stanford found that developers using AI assistants produced more security vulnerabilities while rating their code as more secure. Confidence up. Security down. The Dunning-Kruger effect, automated.
The mechanism is simple. AI learns from public repositories. Public repositories contain mountains of insecure code. SQL string concatenation. Unsanitized input rendering. Leaked stack traces. Hardcoded API keys. The AI reproduces these patterns because frequency drives the ranking, not correctness. The most popular answer is not the safest answer. (Ask any security team.)
The patterns are subtly insecure. Parameterized WHERE clauses paired with string-interpolated ORDER BY. Token validation that skips scope checks. Length-checked input with unchecked content. OWASP Top 10 vulnerabilities wearing a clean suit.
The scale problem adds up fast. A human might introduce one SQL injection vulnerability per quarter. An AI reproducing the pattern across 20 endpoints in a single sprint introduces twenty. Copy-paste on steroids. Application security tooling catches some, but static analysis was built for human-written code at human speed.
Code Review in the AI Era
Traditional review assumes the author understood the code they wrote. With AI-assisted development, that assumption breaks. The author may have accepted a plausible suggestion without checking every edge case. Different starting point. Different protocol needed.
| Aspect | Traditional Review | AI-Aware Review |
|---|---|---|
| Author assumption | Understood the code | May have accepted a plausible suggestion |
| Edge case coverage | Spot-check is enough | Step-by-step check required |
| Dependency accuracy | Trust imports | Verify API signatures exist in pinned version |
| Security patterns | General awareness | Explicit checklist per review |
| Error handling | Check for presence | Check for correctness and specificity |
| Test coverage | Adequate if green | Must cover AI-specific risk areas |
Three review practices become non-negotiable when AI writes big chunks of your codebase.
Dependency verification. AI assistants make up API methods. response.getStatusMessage() when the API actually has response.statusText. A library import for a package that exists on npm but isn’t in your package.json. Citing sources that don’t exist. Verify every import against pinned versions. Every time.
Edge case interrogation. For every AI-generated function, ask: what happens with null input? Empty input? Maximum-size input? Concurrent access? Network timeout? The AI covered the happy path. It doesn’t think about production. It thinks about training data. The AI covered the happy path. It doesn’t think about production failure modes.
Security pattern checklist. For AI-generated code touching user input, auth, data access, or external calls: parameterized queries for every clause? Error responses exclude stack traces? Auth checks verify scope, not just presence? Input validation covers content, not just length? Security-aware code review checklists catch what general awareness misses.
Review time per line should go up, even as code production speeds up. That feels backwards. It is. The temptation to rubber-stamp clean-looking AI code is exactly how the defect increase creeps in. Smooth flight. Eyes closed. Netflix on in the cockpit.
Testing Strategy for Generated Code
AI-generated tests tend to test the implementation, not the specification. The test verifies the function does what it does. If the function has a bug, the test confirms the bug. 95% coverage. Zero confidence. The autopilot writing its own pre-flight checklist. “Wings still attached? Check.”
Don’t: Let AI generate tests for AI-generated code without specification constraints. The test mirrors the implementation’s assumptions, including its bugs. High coverage means nothing when the tests validate the wrong behavior.
Do: Write specification-based test cases first (what should happen), then let AI help with the boilerplate. Tests encode the business requirement (“users cannot transfer more than their available balance”) not the implementation detail (“the function returns false when amount exceeds balance”).
Three testing approaches actually work when AI generates production code.
Property-based testing. Define rules that must always hold: “output length never exceeds input length,” “running the same operations in any order gives the same result.” Frameworks like Hypothesis and fast-check generate hundreds of random inputs testing these rules. They find edge cases the AI missed because they don’t share its happy-path bias. Throwing darts at the spec instead of following the tour guide.
Mutation testing. Tools like Stryker and PIT introduce small changes (flipping > to >=, removing a null check, inverting a boolean) and check if tests catch each one. A surviving mutation is a test gap. AI-generated test suites consistently produce more survivors. They look thorough while missing actual faults. A security camera pointed at the wrong door.
Contract testing. AI generates service-to-service code using patterns from training data. Those patterns may not match your actual API contracts. Contract testing across service boundaries catches mismatches between what the AI assumed the API returns and what it actually returns. This matters most when AI generates both the client and the test. That’s a closed loop validating its own assumptions. Grading your own homework. With the answer key you wrote yourself.
| Testing Approach | Traditional | AI-Adapted | Why It Matters for AI Code |
|---|---|---|---|
| Unit tests | Test implementation behavior, happy-path focused | Specification-based: tests encode business rules, not implementation | AI-generated tests confirm the bug works as implemented. 95% coverage, zero confidence |
| Edge cases | Developer intuition + code review | Property-based: randomized inputs test invariants | AI misses edge cases humans would catch through experience. Property tests find them mechanically |
| Test quality | Code coverage metrics | Mutation testing: measure surviving mutations | AI test suites have more surviving mutations (tests that pass even when code is broken) |
| Integration | Manual API testing | Contract testing: verify API assumptions against real service behavior | AI hallucinates API signatures. Contract tests catch mismatches before production |
When NOT to Use AI Code Generation
- Code review process explicitly addresses AI-generated code risks (hallucinated APIs, edge case gaps, security patterns)
- Testing strategy includes specification-based tests not generated by the same AI
- Security scanning tools are configured and catching AI-specific vulnerability patterns
- Defect escape rate tracking is in place alongside velocity metrics
- Engineers can explain when to reject AI suggestions, not just accept them
Security-critical paths. Auth, crypto, session management. These need reasoning about rules that must always hold, and AI handles those unreliably. Write these manually. Review twice. Test like an attacker would.
Financial calculation precision. Same problem, different domain. AI frequently uses floating-point types that work for most cases and produce quietly wrong results for the rest. The billing rounding bug from safe deployment practices , generated at high velocity. Pennies off per transaction. Nobody notices until someone does.
Novel algorithms. This one’s less obvious. No established patterns in public repositories means no good training signal. The AI is mixing and matching patterns it’s seen before. The output will be subtly wrong for the cases that matter most. A chef improvising from memory when you ordered off-menu.
System architecture decisions. AI generates patterns. It can’t reason about whether the pattern fits your situation. Generating a Kafka consumer when you need a simple queue is unnecessary complexity shipped at record speed. A sports car solution to a bicycle problem.
| When AI code generation works well | When it doesn’t |
|---|---|
| Boilerplate and scaffolding for known patterns | Security-critical authentication and crypto |
| CRUD operations with standard validation | Financial calculations requiring precision |
| Test scaffolding (with human-written specs) | Novel algorithms without public precedent |
| Configuration files and infrastructure code | Cross-service architectural decisions |
| Documentation drafts and code comments | Performance-critical hot paths |
What the Industry Gets Wrong About AI Code Generation
“AI coding assistants make junior developers as productive as seniors.” They make junior developers faster at producing code. Producing correct, secure, maintainable code requires judgment that comes from experience: knowing which suggestion to reject, which edge case the AI missed, which pattern will cause problems at scale. AI amplifies whatever engineering judgment the developer already has. A senior pilot with autopilot lands safely in fog. A student pilot with autopilot doesn’t know they’re descending. The seniority gap widens in both directions.
“AI-generated code needs less review because it follows consistent patterns.” It needs more review, not less. Consistent patterns include consistently reproduced vulnerabilities, consistently hallucinated APIs, and consistently missed edge cases. The surface polish of AI-generated code is precisely what makes it dangerous in review. Reviewers lower their guard for code that looks like it was written by a competent colleague. That’s the trap.
“Lines of code per day is a valid productivity metric for AI tools.” Lines of code per day was never a valid productivity metric. Adding AI to a bad metric makes the metric worse. Measure task completion time, defect escape rate, and time-to-resolution together. Any single metric in isolation tells a flattering lie.
Pull requests up. Defects up too. Now the team tracks both on the same dashboard. Property-based tests catch what the AI misses. Specification-first design means tests encode what the code should do, not what it does. The gain is smaller than the demo promised. But it’s real. And it’s honest. The autopilot still flies. The pilot still watches. That’s the deal.