Building a RAG Evaluation Harness: Stopping the Million-Dollar Chunking Mistake
Executive Summary
Most enterprise RAG systems ship on vibes. A team builds a retrieval-augmented chatbot, runs a dozen hand-picked questions, sees plausible answers, and declares victory. Then a quiet change—a new chunk size, a different embedding model, a reworded system prompt—silently degrades answer quality, and nobody notices until a regulator, an auditor, or a frustrated user does.
The most expensive failures we see are not exotic. They are chunking decisions made once, never measured, and never re-validated. A chunk size that looks fine in a demo can quietly halve retrieval recall on the long-tail questions that matter most. In a regulated-industry engagement, that gap is not a UX annoyance—it is a compliance exposure with a real dollar figure attached.
This framework is DSE’s opinionated method for building evaluation into RAG systems as infrastructure, not as an afterthought. The core thesis: you cannot ship RAG responsibly without a versioned golden dataset, a layered metric taxonomy, calibrated LLM judges, and CI gates that block regressions before they reach production. Everything below is how we operationalize that.
Why a Dedicated Harness Is Necessary
RAG combines two failure surfaces—retrieval and generation—and they fail independently. A retriever can return perfect context that the model then ignores. A model can produce a fluent, confident answer grounded in nothing. Spot-checking conflates these failures and hides both.
The deeper problem is that classical retrieval quality does not predict end-to-end answer quality. Recent research found that information-retrieval metrics like nDCG, MAP, and MRR explain only about 60 percent of the variance in downstream RAG accuracy. Those metrics assume every relevant document in the top ranks is equally useful and ignore the possibility that a retrieved passage actively distracts the model into a worse answer.
The conclusion is unavoidable: a single number is never enough. A robust harness is an infrastructure layer combining data curation, retrieval metrics, generation metrics, judge calibration, CI gates, and production observability. Here is how we assemble it.
Layer 1: The Golden Dataset as a Product
Every evaluation effort begins with a curated, version-controlled set of queries, reference answers, and—where it matters—annotated ground-truth passages. We treat this golden dataset as a product: it has an owner, a roadmap, release notes, and a maintenance process tied to corpus changes.
Sizing matters. A common mistake is a 30-question “eval set” that produces statistically meaningless pass rates. As a rule of thumb, if you expect an 80 percent pass rate and want a 5 percent margin of error at 95 percent confidence, you need roughly 246 samples per slice. The word slice is load-bearing—you size per scenario (product line, language, risk category), not per system.
We seed the dataset from real production logs, not from the imagination of the build team. Synthetic generation has a role for coverage, but we enforce one hard rule: the held-out test set used for CI gates must contain no duplicates or near-duplicates of any examples used to tune the retriever, reranker, or prompts.
In regulated engagements, the golden dataset doubles as a governance artifact. It is the concrete evidence in a risk review—showing how the system performs on safety-critical scenarios, what thresholds CI enforces, and how those thresholds map to business risk tolerance. This is also where alignment to frameworks like the NIST AI Risk Management Framework becomes tangible rather than aspirational.
Layer 2: Retrieval Metrics That Actually Tell You Something
Retrieval is where most quality is won or lost, so we measure it explicitly before we ever look at the answer.
Recall@k measures whether the relevant passages made it into the top-k results at all. It is the single most diagnostic retrieval metric, because if the answer-bearing chunk is not retrieved, no model can recover it.
Mean Reciprocal Rank (MRR) captures how early the first relevant passage appears, which matters in QA-style systems where one answer-bearing passage near the top is what counts:
MRR = (1/N) * Σ (1 / rank_i)
where rank_i is the rank of the first relevant document for query i.
Normalized Discounted Cumulative Gain (nDCG) handles graded relevance, rewarding highly relevant passages near the top and discounting lower ranks logarithmically:
DCG_p = Σ (rel_i / log2(i + 1)) for i = 1..p
nDCG_p = DCG_p / IDCG_p
nDCG ranges 0 to 1, with 1 being a perfect ranking—the standard for evaluating retrievers and rerankers on benchmarks like BEIR and MTEB.
But because these IR metrics explain only ~60 percent of downstream quality, we always pair them with RAG-specific retrieval metrics:
| Metric | What it catches | Plain-English failure signal |
|---|---|---|
| Context recall | Did retrieval surface the facts needed to answer? | Low value = retriever missed critical information |
| Context precision | Are relevant chunks ranked above irrelevant ones? | Low value = reranker is weak or top-k too large |
| Context relevance | What fraction of retrieved context is actually needed? | Low value = chunk size / top-k misconfigured |
Context relevance is our early warning system for the chunking mistake. Small chunks with high top-k inflate precision but can wreck recall and cost; large chunks dilute precision with superfluous text. We measure this, we don’t guess it.
Layer 3: Answer Metrics and the Hallucination Triad
Once retrieval is sound, we score the answer. We organize generation metrics around TruLens’s hallucination triad, which is the cleanest mental model we’ve found:
- Groundedness / Faithfulness — Is the answer supported by the retrieved context? We treat faithfulness as the fraction of statements in the answer that are consistent with (or at least not contradicted by) the context. This is the hallucination detector.
- Context relevance — Is the retrieved context relevant to the question?
- Answer relevance — Does the answer actually address what was asked?
Each metric measures exactly one property. We resist the temptation to compute a single blended “quality score,” because when it drops you have no idea which lever to pull. Single-aspect metrics with consistent 0–1 scoring ranges are what make aggregation and root-cause analysis tractable.
Layer 4: LLM-as-a-Judge—and Its Failure Modes
You cannot have humans rate every response at CI cadence. So faithfulness, relevance, and correctness are scored by an LLM judge—ideally a model at least as capable as the one being served. The recipe we follow: define single-aspect criteria, craft an explicit rubric with worked examples, force the judge to emit its reasoning, and calibrate against a small human-labeled set until agreement is acceptable.
The judge is itself a model with errors, and we treat it that way. The failure modes we actively defend against:
- Position bias — the judge favors whichever answer appears first. We randomize order and run paired comparisons.
- Verbosity bias — longer answers score higher regardless of correctness. We instruct the judge to ignore length.
- Self-bias — a judge prefers outputs from its own model family. We avoid judging a model with itself where stakes are high.
- Temperature drift — judge consistency degrades at high temperature. We pin judges to low temperature for reproducibility.
We re-calibrate judges against human labels on a schedule, not once. An uncalibrated judge is a confident liar wired directly into your release gate.
Layer 5: The Tooling Landscape (And How We Pick)
We are deliberately non-dogmatic about tools. The harness, not the vendor, is the asset. How the landscape maps to the job:
| Tool | Primary strength | Where it fits |
|---|---|---|
| RAGAS | RAG-specific metrics (faithfulness, context precision/recall) | Dataset-centric offline scoring |
| DeepEval | “Pytest for LLMs,” threshold-based assert_test |
CI regression gates |
| TruLens | Tracing + hallucination triad feedback functions | Debugging specific failures |
| Phoenix / Arize AX | OpenTelemetry-native observability | Online / production evaluation |
| promptfoo, LangSmith, Maxim, Braintrust | Config-driven evals, dashboards, governance | Scale and team workflows |
Our default starting stack for a regulated client: RAGAS or DeepEval for offline metrics and CI gates, TruLens for failure debugging, and Phoenix for production tracing—feeding production failures back into the golden dataset. We adopt a small set of tools that match the client’s stack and risk profile, then integrate them into one coherent harness rather than running isolated experiments.
Layer 6: CI Regression Gates—Where the Money Is Saved
This is the layer that prevents the million-dollar mistake. Traditional software tests assume determinism; RAG systems do not provide it. So our gates assert on metric thresholds, not exact-match equality.
The pattern we deploy:
- Any change to chunking, index config, embedding model, reranker, or prompts triggers an evaluation run against the golden test set.
- The run computes retrieval and answer metrics (recall@k, context precision/recall, faithfulness, answer relevance).
- The pipeline fails the build if any metric falls below its agreed threshold for the deployed corpus version.
- The CI step is mandatory release governance—not an optional manual check a developer can skip on a Friday.
With a DeepEval-style harness this is a single deepeval test run step in GitHub Actions that fails on threshold violation. The discipline is what matters: a chunking change that quietly drops context recall now turns the build red instead of turning into a production incident.
Layer 7: Online Evaluation—Because Golden Sets Drift
Offline gates protect against regressions on known cases. They cannot anticipate the queries users actually invent. So we attach evaluators to production traces using OpenTelemetry-style instrumentation—spans for routing, retrieval, generation, and policy checks, annotated with query text, retrieved context IDs, model version, and latency.
We sample and stratify: compute cheap proxies (answer length, retrieval latency, cache hit rate) on every request, and run expensive judge-based metrics on 1–5 percent of traffic, stratified by product, language, or risk category, with rolling confidence intervals per slice. Emerging failure modes get re-labeled and folded back into the golden dataset, closing the loop.
The DSE Method in One Picture
┌──────────────────────────────────────────────────────────────┐
│ GOLDEN DATASET (versioned, ~246 samples/slice, owned) │
└───────────────┬───────────────────────────────┬───────────────┘
│ │
┌───────▼────────┐ ┌────────▼─────────┐
│ RETRIEVAL EVAL │ │ ANSWER EVAL │
│ recall@k, MRR, │ │ faithfulness, │
│ nDCG, context │ │ groundedness, │
│ precision/recall│ │ answer relevance │
└───────┬────────┘ └────────┬─────────┘
│ (LLM judge, calibrated)│
└───────────────┬───────────────┘
┌───────▼────────┐
│ CI GATE │ ← blocks the release
│ threshold-based│ if metrics regress
└───────┬────────┘
┌───────▼────────┐
│ ONLINE EVAL │ ← samples 1–5% of prod,
│ OTel tracing │ feeds failures back
└────────────────┘
What This Means For You
If you are running RAG in production without a versioned golden dataset and a CI gate, you are not measuring quality—you are hoping for it. Three moves close most of the gap:
- Build the golden dataset first. Size it per slice (target ~246 samples per scenario), seed it from real logs, and treat it as a product with an owner.
- Separate retrieval from generation in your metrics. Use recall@k and context recall to catch the chunking failures that IR-only metrics miss, then layer faithfulness and answer relevance on top.
- Make the gate non-optional. Wire threshold-based evaluation into CI so a chunking or embedding change cannot ship if it degrades agreed-upon quality.
The chunking mistake is expensive precisely because it is invisible without measurement. A harness makes it visible—and cheap to fix—before it reaches a user or an auditor.
This framework reflects research and engineering practice by the DSE team across enterprise RAG engagements, including regulated-industry clients where answer quality is a compliance requirement. Metric formulas and threshold guidance synthesize current evaluation research and the open-source tooling landscape; specific thresholds should be calibrated to your corpus, risk profile, and stack. It is offered as a reference method for organizations putting retrieval-augmented generation into production responsibly.