shipping production AI · since 2020 NAICS 541511 / 541512 / 541519  ·  CMMC-aware
Refinery Report / AI Engineering / post · 0042
AI Engineering evals production long-form

Eval harnesses are the moat, not the model.

Most LLM apps fail not because the model was wrong, but because nobody owned the eval suite. Here's how we think about owning it — and why we'd rather hand off an eval harness than a model choice.

P
Founder · DSE-Experts
@dsee · 10+ yrs hands-on AI engineering
April 30, 2026
12 min · 2,840 words · v3 · revised 1×

Every week we land in a discovery call where the customer wants to know which model they should pick. Anthropic? OpenAI? Bedrock? Self-hosted? The right answer, almost without exception, is that's the wrong first question. The answer that matters is: what is your evaluation suite, and who owns it?

Models are commodities, increasingly cheap, increasingly substitutable. The thing that determines whether your application works is the suite of cases that say working. If you own that suite — if you can run it in CI, watch it for drift, refuse to ship without it — you can swap models, swap providers, swap prompt strategies, and your product still works. If you don't own that suite, every model release is a coin flip.

What an eval suite actually is

An eval suite is not a benchmark. Benchmarks are about the model. Suites are about your product. The smallest viable suite is a list of input-output cases that, taken together, define "working" for your application. The minimum is around 50 cases. Below that you're guessing. Above 2,000 it gets expensive without much marginal signal. We tend to land in the 600–900 range.

A good case has three properties: it's specific (a real customer interaction, anonymized), it's graded (a deterministic check or rubric, not vibes), and it's weighted (some cases matter more than others — your golden cases catch regressions that ship-stoppers, your edge cases catch creeping drift).

Here's a snippet from the harness we run on every push to the PrivateStack codebase:

# dsee/evals/harness.py

from dsee.evals import GoldenSuite, Drift

suite = GoldenSuite.load("./golden/v3")
result = suite.run(model="prod", k=5)

if result.pass_rate < 0.997:
    raise CIFailure(result.failures)

drift = Drift.vs_baseline(result)
assert drift.score < 0.005
# → 842/842 pass · drift 0.3% · ship it

Two thresholds. Pass rate at 0.997 means we're allowed to lose three out of a thousand. Drift at 0.005 means even the cases that pass can't shift too far from their baseline answers between releases. The CI run takes 90 seconds.

The pull-quote moment

The team that owns the eval suite controls the product.
Not the team that picks the model.

This is the single most important sentence in this post. It's why we tell every customer in their first week: start writing the suite before you write the prompts. The suite is the spec. The prompts are the implementation. You can iterate the implementation cheaply; you can't iterate the spec cheaply.

The four categories of cases

We sort eval cases into four buckets:

What happens if you don't own the suite

The dead-canary signal is this: a model release shows up, your application's behavior changes, and you don't know why. You roll back. You wait. You hope. You're managing a product you don't actually control.

I've seen this in three engagements last year. Each time, the team owned the prompts, owned the routing, owned the infrastructure — but the eval suite was a Notion doc with 12 examples that the founder ran by hand once a month. When the model provider updated their model, behavior shifted, customer complaints arrived, and there was no instrument that said "yes, this is a regression" or "no, it's noise."

Without a suite, every model upgrade is a vibes check. — a CTO, after we shipped them an eval harness

What we hand off

For every AI engineering engagement we ship, the eval harness is a named deliverable. It comes with:

The harness outlives the engagement, the team, and the model you're using when we hand it off. That's the point.


If you've got an LLM application in production and no eval suite — or one that's a single Notion page — we run a four-week fixed-fee engagement to build one and integrate it into your CI. Drop us a note at hello@thedataexperts.us or scope a call.

P
Founder · Principal Engineer
Data & AI engineer · 10+ yrs hands-on

Writes most of the long-form here. Lives in the codebase. Active on GitHub and LinkedIn.

One long-form a week. No marketing.

Subscribe to the Refinery Report. Practitioner deep-dives on AI engineering, security, and the realities of running production systems. Unsubscribe in one click.

~12 issues / quarter