Every week we land in a discovery call where the customer wants to know which model they should pick. Anthropic? OpenAI? Bedrock? Self-hosted? The right answer, almost without exception, is that's the wrong first question. The answer that matters is: what is your evaluation suite, and who owns it?
Models are commodities, increasingly cheap, increasingly substitutable. The thing that determines whether your application works is the suite of cases that say working. If you own that suite — if you can run it in CI, watch it for drift, refuse to ship without it — you can swap models, swap providers, swap prompt strategies, and your product still works. If you don't own that suite, every model release is a coin flip.
What an eval suite actually is
An eval suite is not a benchmark. Benchmarks are about the model. Suites are about your product. The smallest viable suite is a list of input-output cases that, taken together, define "working" for your application. The minimum is around 50 cases. Below that you're guessing. Above 2,000 it gets expensive without much marginal signal. We tend to land in the 600–900 range.
A good case has three properties: it's specific (a real customer interaction, anonymized), it's graded (a deterministic check or rubric, not vibes), and it's weighted (some cases matter more than others — your golden cases catch regressions that ship-stoppers, your edge cases catch creeping drift).
Here's a snippet from the harness we run on every push to the PrivateStack codebase:
# dsee/evals/harness.py from dsee.evals import GoldenSuite, Drift suite = GoldenSuite.load("./golden/v3") result = suite.run(model="prod", k=5) if result.pass_rate < 0.997: raise CIFailure(result.failures) drift = Drift.vs_baseline(result) assert drift.score < 0.005 # → 842/842 pass · drift 0.3% · ship it
Two thresholds. Pass rate at 0.997 means we're allowed to lose three out of a thousand. Drift at 0.005 means even the cases that pass can't shift too far from their baseline answers between releases. The CI run takes 90 seconds.
The pull-quote moment
Not the team that picks the model.
This is the single most important sentence in this post. It's why we tell every customer in their first week: start writing the suite before you write the prompts. The suite is the spec. The prompts are the implementation. You can iterate the implementation cheaply; you can't iterate the spec cheaply.
The four categories of cases
We sort eval cases into four buckets:
- Golden. The 50–100 cases that absolutely must pass. A regression here is a ship-stopper. These are usually drawn from real customer escalations or critical-path interactions.
- Coverage. The 300–500 cases that exercise every code path through the prompt graph. If we add a tool, we add coverage cases for it.
- Adversarial. The 50–150 cases designed to break things — prompt injection, exfil attempts, jailbreaks, weird unicode, very long inputs. We run these less often (nightly, not per-commit) but block release on them.
- Drift. Cases that don't have a "right" answer per se, but have a baseline answer we compare every release against. The drift score is the cosine distance between the new answer and the baseline; we alert on a moving average over a few releases.
What happens if you don't own the suite
The dead-canary signal is this: a model release shows up, your application's behavior changes, and you don't know why. You roll back. You wait. You hope. You're managing a product you don't actually control.
I've seen this in three engagements last year. Each time, the team owned the prompts, owned the routing, owned the infrastructure — but the eval suite was a Notion doc with 12 examples that the founder ran by hand once a month. When the model provider updated their model, behavior shifted, customer complaints arrived, and there was no instrument that said "yes, this is a regression" or "no, it's noise."
Without a suite, every model upgrade is a vibes check. — a CTO, after we shipped them an eval harness
What we hand off
For every AI engineering engagement we ship, the eval harness is a named deliverable. It comes with:
- A
golden/directory of 600–900 cases as JSON or YAML, with provenance comments on each case explaining what it's testing. - A CLI to run the suite locally in < 90 seconds.
- A CI workflow that blocks merges on pass-rate and drift thresholds.
- A weekly job that reruns the full suite against the production model and writes results to a dashboard.
- A runbook for the engineers on your team to maintain it: how to add cases, how to triage failures, how to raise the bar.
The harness outlives the engagement, the team, and the model you're using when we hand it off. That's the point.
If you've got an LLM application in production and no eval suite — or one that's a single Notion page — we run a four-week fixed-fee engagement to build one and integrate it into your CI. Drop us a note at hello@thedataexperts.us or scope a call.