shipping production AI · since 2026 NAICS 541330 / 541511 / 541512 / 541519  ·  CMMC-aware
Selected Work / Enterprise AI / case · mework
Enterprise AIPost-MortemProduction AIEvaluation

Five Post-Mortems From the Field: Why Enterprise AI Projects Actually Die

Five anonymized failure modes from real engagements — the data problem wearing an AI costume, the missing eval harness, the governance gap, the pilot with no path to production, and security as an afterthought. For each: the symptom, the root cause, and the countermeasure.

D
DSE-Experts
Operator-led practice
May 27, 2026
7 min · 1,450 words

Five Post-Mortems From the Field: Why Enterprise AI Projects Actually Die

Executive Summary

Industry analysts keep publishing the same headline: a large share of enterprise AI projects will be canceled. The number moves, the explanation rarely does. “Lack of business value” and “data quality issues” are true and useless — they describe the autopsy, not the cause of death.

We have been called in to triage enough stalled and failed AI projects to recognize that they die in a small number of specific, repeatable ways. This is a field post-mortem of five of them. Each is anonymized into a pattern. For each, three things: the symptom you actually observe, the root cause underneath it, and the countermeasure that prevents it. None of these failures are about the model. All of them were preventable before the first sprint.

Failure Mode 1: The Data Problem Wearing an AI Costume

The symptom. The system gives confidently wrong answers. The team responds by changing models, tuning prompts, and swapping embeddings. Nothing moves the needle. Every fix improves some cases and breaks others, and morale drains as the model gets blamed for a problem the model never had.

The root cause. The data was never ready, only present. Fields meant different things across source systems. A status column carried six meanings. Retrieval chunking severed conditions from their exceptions, so the system retrieved “Sales qualify immediately” without the clause explaining the exception. The AI was faithfully reasoning over a broken world it had been handed.

The countermeasure. Run a semantic-integrity and coverage audit before any model work. Treat the readiness verdict as a real finding: if the data is present but not ready, the first sprint is data engineering, not model selection. Structure ingestion and chunking so meaning survives splitting — provenance travels with each chunk, exceptions stay attached to rules. You cannot prompt your way out of a data problem, and every hour spent trying is an hour confirming the wrong diagnosis.

Failure Mode 2: No Eval Harness, So Quality Regressed Silently

The symptom. The demo was great. Three weeks after launch, users start reporting that answers “got worse,” but nobody can say when, which change caused it, or by how much. Every prompt tweak is a gamble. The team is afraid to change anything because they cannot tell whether a change helps or hurts.

The root cause. There was never a definition of “done” and never a mechanism to enforce it. Quality lived in the memory of whoever ran the last demo. With no curated eval set and no regression gate, a prompt change that fixed one case silently broke four others, and the degradation compounded invisibly until a human happened to notice.

The countermeasure. Build the eval harness before launch, from the written definition of correct, wrong, and acceptable-refusal. Curate representative cases including the hard exceptions. Score every change automatically for correctness, faithfulness to retrieved context, and refusal behavior. Make a quality drop a hard gate that blocks release the way a failing test blocks a merge. Add production observability so live drift is caught before users report it. A probabilistic system without a regression gate does not stay correct — it erodes.

Failure Mode 3: The Governance and Ownership Gap

The symptom. The system works in the dev account and cannot go to production. Months pass in approval limbo. No one will sign off because no one can say who is allowed to see what the system produces, or who is accountable if it surfaces the wrong thing to the wrong person.

The root cause. Governance was treated as a launch-time compliance checkbox instead of an architecture decision. There was no named, accountable data owner — only a committee that could not decide. Access control was assumed (“only the right people have logins”) rather than enforced at the retrieval layer. There was no audit trail, so the system could not answer “who asked what” after the fact, which made approval impossible in a regulated context.

The countermeasure. Assign a named, accountable owner for the data domain — a person, not a committee. Enforce permissions at the retrieval and API layer so identity propagates from the request into what the system is allowed to return; permissions are part of the query, not a filter on results. Design auditability in from the start, because it cannot be retrofitted cheaply. Governance is an architecture decision made on day one, not a gate discovered on the eve of launch.

Failure Mode 4: The Pilot With No Path to Production

The symptom. A successful pilot that never becomes a system. It demos beautifully on a curated dataset for a single user, leadership is delighted, and then it stalls — because making it real means multi-tenancy, isolation, auth, scale, and operations that were never in scope. The pilot was an island with no bridge to the mainland.

The root cause. The pilot was architected to impress, not to extend. It used a copy of production data with no permission model, ran single-user with no tenancy concept, and had no path to enforce isolation or scale. Every production requirement was deferred as “later,” and later turned out to be a near-total rebuild. The pilot optimized for the demo and mortgaged the system.

The countermeasure. Architect the pilot as the first slice of the production system, not a throwaway. Even at pilot scale, wire the real auth pattern, a real (if small) tenancy model, and permission-aware retrieval. Define the non-functional envelope — concurrency, isolation, residency — in the requirements before the pilot, so the pilot proves the architecture, not just the idea. A pilot that cannot grow into production is a successful demo and a failed project.

Failure Mode 5: Security and Auth as an Afterthought

The symptom. The build is nearly done, then a security review halts it. Auth was bolted on at the edge but the data behind it is open. Tenants can, under the right request, see each other’s context. There is no record of who accessed what. The launch date slips indefinitely while the team retrofits a security model into an architecture that did not anticipate one.

The root cause. Security was sequenced last instead of shaping the design from the start. Authentication existed at the gateway, but identity never propagated into retrieval, so the system enforced “you are logged in” rather than “you are allowed to see this.” Multi-tenant isolation was assumed rather than built into the data layer. Audit was absent. Each of these is cheap to design in and expensive — sometimes impossible — to retrofit.

The countermeasure. Make security a design input, not a final stage. Authenticate every request at the gateway (JWT/JWKS, signed-token validation) before it reaches application logic. Propagate the authenticated identity into the retrieval layer so permission-aware retrieval has a real subject to enforce against. Isolate tenants at the data layer — in multi-tenant AI, isolation is the product. Record every request for audit by design. Auth at the edge with open data behind it is a breach waiting to happen.

The Pattern Across All Five

Failure mode What everyone blames The actual root cause The countermeasure
1. Data in an AI costume The model Data present, not ready Readiness audit before model work
2. Silent regression The prompt No eval harness or gate Eval harness gating every release
3. Governance gap Compliance/legal No owner, no enforced access Named owner; access enforced in architecture
4. Pilot dead-ends “We need more budget” Pilot built to impress, not extend Pilot as first production slice
5. Security afterthought The security team Auth not designed into the stack Security as a day-one design input

The common thread is sequencing. Every one of these failures is a decision deferred past the point where it was cheap to make correctly. The model was never the variable that decided the outcome. Data readiness, evaluation discipline, governance ownership, a production path, and security-by-design were — and all five are knowable before the first sprint.

Applicability

This post-mortem framework is a pre-mortem checklist for any organization about to fund an enterprise AI project. Run the five failure modes against your plan before you build. If you can name the countermeasure already in place for each, you are in the minority of projects that ship. If you cannot, you have just found the work that needs to happen first — at a fraction of the cost of discovering it in month four.


These failure modes are drawn from triage and recovery work across enterprise and regulated-industry AI engagements. All details are anonymized into patterns; the framework is published as a pre-mortem reference for technology and data leaders.

P
Founder · Principal Engineer
Data & AI engineer · 10+ yrs hands-on

Writes most of the long-form here. Lives in the codebase. Active on GitHub and LinkedIn.

§ Next step

Not sure which of these is you?

Tell us what's broken in a paragraph and a principal reads it directly — or walk the ladder from a low-commitment first engagement up to retained work.

One long-form a week. No marketing.

Subscribe to the Refinery Report. Practitioner deep-dives on AI engineering, security, and the realities of running production systems. Unsubscribe in one click.

~12 issues / quarter