shipping production AI · since 2026 NAICS 541330 / 541511 / 541512 / 541519  ·  CMMC-aware
Refinery Report / AI Engineering / post · ilures
AI EngineeringAgentic AIAI GovernanceSoftware Quality

The Split Screen: Why AI Self-Improvement Works and AI-Generated Code Fails Silently

Constrained, externally grounded self-improvement is real — but the same agents that get measurably better at scored tasks ship code that compiles, runs, passes its own tests, and quietly returns wrong answers. The governance answer is to treat AI output as untrusted code with an independent test oracle.

D
DSE-Experts
Operator-led practice
May 28, 2026
10 min · 2,168 words

There are two stories about autonomous AI agents in 2026, and they are almost never told in the same room. In one, self-improving coding agents are posting genuine, benchmark-verified gains — the kind of progress that makes “recursive self-improvement” sound less like a slogan and more like a roadmap. In the other, the code those same agents produce fails silently: it compiles, it runs, it passes the tests written alongside it, and it quietly returns the wrong answer at scale. Both stories are true. The gap between them is exactly where enterprise risk now lives, and it is the gap most AI adoption budgets do not fund.

This is a split screen, and reading only one half of it is how organizations get hurt. The half that gets the conference keynote is the optimistic one. The half that shows up in your training runs, your inference pipelines, and eventually your liability exposure is the quiet one. This paper argues that the two halves are governed by the same underlying principle — external grounding — and that the practical job of any serious data and AI organization is to build the grounding layer that the technology itself does not provide.

Executive Summary. Constrained self-improvement genuinely works, but only when an agent is anchored to an objective external oracle (a benchmark, a compiler, a scored task). Strip away that grounding and the loop provably collapses. Meanwhile, the code these agents generate fails silently — it runs and passes superficial, often self-authored tests while corrupting results. The verified evidence is consistent across kernels, pull requests, and security scans. The governance response is non-negotiable: treat AI output as untrusted code, build independent test oracles, and never let one model write both the implementation and the tests that judge it. Under the EU Product Liability Directive (2024/2853), this is now a liability question, not merely an engineering one.

Self-Improvement Is Real — and That’s the Boring Part

The headline result is genuine. The Darwin Gödel Machine (DGM) — Zhang et al., arXiv 2505.22954 — is a self-modifying coding agent that rewrites its own toolchain and validates each change against benchmarks. It moved SWE-bench from 20.0% to 50.0% and Polyglot from 14.2% to 30.7%. Those are not projections; they are scored gains on held-out coding tasks. The follow-on work, Meta’s DGM-H “HyperAgents” (ai.meta.com, March 2026), generalized the same self-improvement pattern across coding, paper review, robotics reward design, and Olympiad math grading, beating non-self-improving baselines in each domain.

Read the methodology and the magic dissolves into engineering. DGM ran inside a sandbox, under human oversight, and — critically — every self-modification was judged by an external benchmark the agent did not control. The agent did not decide whether it had improved. SWE-bench decided. That distinction is the entire ballgame.

The reason it has to work this way is now a formal result. Zenil (early 2026), arXiv 2601.05280 proves that purely recursive self-training — improvement with no external grounding signal — collapses. Using Martingale Convergence and the data-processing inequality, the proof shows entropy decay and variance amplification driving the system to mode collapse. In Zenil’s framing, “fully autonomous recursive density matching leads to degenerative fixed points, whereas externally anchored… approaches operate under fundamentally different asymptotic dynamics.”

Translated for practitioners: you cannot bootstrap a model to superintelligence on its own outputs. The loop eats itself. Improvement only happens when the loop is closed against reality. That is why self-improvement is, in the most useful sense, boring — it works precisely to the degree that you have already built an objective oracle to score it. No oracle, no improvement. Build the oracle first is not a caveat to the self-improvement story. It is the whole story.

The Other Half of the Screen: Code That Lies

Now turn to the half nobody budgets for. When agents write code that has no clean oracle — performance-critical GPU kernels, subtle business logic, security-sensitive paths — the failure mode is not a crash. It is a confident, plausible, wrong answer.

Start with kernels, because the evidence is unusually crisp. KernelBench (Stanford, May 2025) ran 250 PyTorch workloads and found that frontier reasoning models matched the PyTorch baseline in fewer than 20% of cases. Evolutionary search helps with speed: EvoEngineer (arXiv 2510.03760) reached a 2.72× median speedup and a maximum 36.75× over PyTorch across 91 kernels — but at only 69.8% code validity. Nearly a third of the generated kernels were not even valid, and that is the optimistic reading.

The pessimistic reading is worse and more important. ProofWright / CUDA correctness (arXiv 2511.12294) found that roughly 70% of AI-generated kernels still contain correctness issues that evade conventional testing. Not issues that fail tests — issues that pass them. Industry analysis of silent CUDA errors describes kernels that “look correct while quietly corrupting training and inference”: misaligned memory, race conditions, and indexing bugs that never crash, they just perturb the outputs. And benchmark work like AgentKernelArena (arXiv 2605.16819) and FastKernels (arXiv 2605.23215) adds a second warning — agents overfit the benchmark they optimize against, so a kernel that aces its target benchmark may generalize poorly to your actual workload.

This is not a kernels-only problem. It is the structure of the technology, and it shows up everywhere code is generated.

The Evidence, Side by Side

The silent-failure pattern is consistent across the studies, the domains, and the tooling. The numbers below are not cherry-picked anecdotes — they are the convergent finding of independent 2025–2026 measurements.

Source What was measured Result
KernelBench (Stanford, 2025) Frontier models matching PyTorch baseline Match in <20% of 250 workloads
EvoEngineer (arXiv 2510.03760) Evolved CUDA kernel quality 2.72× median speedup, 36.75× max, but only 69.8% valid
ProofWright (arXiv 2511.12294) AI kernel correctness vs. conventional tests ~70% of kernels carry correctness bugs that evade testing
CodeRabbit (470 GitHub PRs) AI-co-authored vs. human PRs ~1.7× more issues, 1.75× more logic errors, 1.42× more performance problems
Veracode 2025 GenAI report Vulnerabilities in AI-generated code 45% of cases introduced vulnerabilities; XSS defenses failed in 86% of relevant samples; no improvement with larger models
Sonar 2026 Developer trust vs. verification behavior 96% of devs don’t fully trust AI code, but only 48% always verify before committing

Read the table as one sentence: AI code arrives looking finished, behaves worse than human code on every quality axis measured, and the people shipping it know not to trust it — yet barely half verify before committing. The Veracode finding deserves a second look, because it kills the most common executive assumption: no improvement with newer or larger models. You cannot wait out this problem by buying a bigger model.

The Test-Mirroring Trap

There is one failure mode that quietly defeats most “we have tests” defenses, and it is worth naming precisely. When the same model writes both the implementation and the tests, the tests inherit the implementation’s errors. As the pattern is described: “if the implementation has an off-by-one error, the AI-generated test will assert the wrong value with full confidence.”

This is the mechanism behind so many green CI runs over broken code. The test is not an independent check. It is a mirror. It encodes the model’s belief about what the code does, and the model’s belief is exactly what is wrong. A passing test suite that was co-generated with the code under test provides almost no evidence of correctness — it provides evidence of internal consistency, which is a very different and much weaker property.

Pair this with the Sonar finding — 96% of developers don’t fully trust AI code, but only 48% always verify before committing — and you have the full anatomy of the risk. Teams know the output is suspect. They have tests that say it’s fine. They ship.

Why This Gets Worse as Agents Get More Autonomous

The autonomy curve is bending upward fast. METR’s time-horizon measurements (May 2026) put GPT-5.2 (high effort) at a 50%-time-horizon of roughly 6.6 hours on software tasks (95% CI 3h20m–17h30m) — the highest reported to date. The Stanford HAI 2026 AI Index reports frontier models gained +30 points in a single year on Humanity’s Last Exam, with a tight competitive band on Arena Elo (Anthropic 1503, xAI 1495, Google 1494, OpenAI 1481 as of March 2026).

But the same Index notes that agents still fail roughly 1 in 3 structured tasks. Now combine the two facts. Agents can run autonomously for hours, and they fail a third of the time. Every additional hour of unattended autonomy, every additional sub-agent in a fan-out, multiplies the silent-failure surface — the number of places where wrong-but-plausible output can enter a system without a human ever looking at it. Autonomy does not reduce the verification burden. It concentrates it, and it moves it downstream where it is most expensive to catch.

The encouraging counterpoint is that silent failures are detectable when you build for it. TrainCheck (University of Michigan, early 2026) catches silent training errors by checking invariants rather than outputs — it caught 18 of 20 real silent errors in a single iteration (versus 2 for prior methods) and surfaced 6 previously unknown library bugs. The lesson is not “this is hopeless.” The lesson is that detection requires a deliberately constructed, independent checking layer. It does not come for free, and it does not come from the model.

This Is Now a Liability Question

The framing has changed underneath us. The EU Product Liability Directive (2024/2853) brings AI-generated software defects into the scope of product liability. A silent kernel bug that corrupts a model’s outputs, or a generated function that quietly returns wrong financial figures, is no longer just an engineering embarrassment to be patched in the next sprint. It is a defect in a product, with the legal exposure that designation carries. “The AI wrote it” is not a defense. If anything, it sharpens the question of what verification you exercised before shipping.

For leadership, this collapses the decision. The verification layer was already an engineering best practice. It is now a risk-management and compliance obligation. The cost of building independent test oracles is small and known. The cost of a silent defect that ships, scales, and surfaces in production — or in court — is neither.

The Governance Playbook

The two halves of the split screen converge on one principle: external grounding is the difference between an AI system that improves and one that quietly degrades. Self-improvement works because of the oracle. Code generation fails because there isn’t one. Build the oracle. Concretely:

What This Means For You

The seductive read of 2026 is that agents are getting good enough to trust. The accurate read is that they are getting good enough to be dangerous without a verification layer — productive, autonomous, and confidently wrong in ways that don’t announce themselves. The organizations that win with agentic AI will not be the ones with the biggest model. They will be the ones that built the oracle: independent tests, differential validation, and invariant checks that treat AI output as untrusted until proven otherwise.

That layer is precisely the work most enterprises skip, because it is invisible when it works and catastrophic when it’s absent. It is the work Data Science & Engineering Experts builds. We architect the verification and governance layer that lets you adopt autonomous agents without inheriting their silent failures — independent test oracles, differential validation pipelines, and the operational controls that turn “the AI wrote it” from a liability into an auditable, defensible process.

If your organization is scaling AI-generated code faster than it is scaling the means to verify it, that gap is the risk. Engage with our team to close it before it ships.

P
Founder · Principal Engineer
Data & AI engineer · 10+ yrs hands-on

Writes most of the long-form here. Lives in the codebase. Active on GitHub and LinkedIn.

§ Next step

Not sure which of these is you?

Tell us what's broken in a paragraph and a principal reads it directly — or walk the ladder from a low-commitment first engagement up to retained work.

One long-form a week. No marketing.

Subscribe to the Refinery Report. Practitioner deep-dives on AI engineering, security, and the realities of running production systems. Unsubscribe in one click.

~12 issues / quarter