Stop demoing RAG. Start measuring it.

Every enterprise RAG demo looks impressive. The same RAG, six weeks into production, hallucinates account numbers and tells a customer their refund was processed when it wasn’t.

The difference between the two outcomes is an evaluation harness. Not vibes. Not “we ran 10 queries and it looked good.” A harness.

What an evaluation harness actually is

A harness has four moving parts:

A dataset — typically 100–500 question/expected-answer pairs that represent real user intent, including the awkward edge cases.
A judge — either a model-graded check (LLM-as-judge with strict rubrics) or a programmatic check (string match, regex, schema validation).
A scoring pipeline — runs the dataset through the current system on a schedule, produces a JSON report.
A regression budget — a defined threshold below which the system doesn’t get to ship.

That’s it. The complexity is in the dataset, not the tooling.

The dataset matters most

The single biggest mistake teams make is testing what’s easy to test instead of what matters.

A good RAG eval dataset has at least these slices:

Happy path — the queries the demo team wrote. Should score >95%.
Adjacent intent — same user goal, different phrasing. Where retrieval falls over.
Counter-intent — the exact opposite question, to check the system doesn’t hallucinate to please.
Out-of-domain — questions outside the corpus. The system should say “I don’t know” with high reliability.
Adversarial — prompt injection, jailbreak attempts, off-topic distractions.
Hallucination probes — questions that look answerable but the corpus doesn’t contain the answer. The system should refuse, not invent.

We typically build this dataset incrementally. Every customer support ticket where the AI gave a wrong answer becomes a new eval row. Within three months, the dataset is the most valuable artifact of the project.

What we check on every run

For each row, the harness produces:

Answer correctness — does the answer match the expected ground truth? (LLM-judge with a strict rubric, plus exact-match for structured fields.)
Citation grounding — every claim in the answer must point to a retrieved chunk. We mark hallucinations.
Refusal correctness — did the system refuse when it should have? Did it refuse when it shouldn’t have?
Latency — p50, p90, p99 on retrieval and generation separately.
Cost — tokens in, tokens out, retrieval calls.

The harness emits a JSON report. The regression budget is hardcoded: drop more than 3 percentage points on overall correctness, you don’t ship.

Where we run it

The harness runs in three places:

One — on every prompt change. Engineer pushes a new system prompt or retriever config; CI runs the harness against a 50-row smoke set; PR review sees the delta.

Two — nightly against the full set. Same code, same data, different day. Catches drift in the model provider’s responses.

Three — on a weekly schedule against real production traffic. A subset of real production conversations are mirrored to a shadow harness, scored, and reported. This catches the drift the dev dataset misses.

What we don’t do

We don’t trust LLM-as-judge for anything safety-critical without a programmatic check beside it. LLMs grading LLMs are correlated. Two GPT-4 instances will both confidently hallucinate the same way.

We don’t ship to production if the harness hasn’t been run on a representative dataset. “It worked in the demo” is not a launch criterion.

We don’t run the eval once before launch and forget it. Model providers update weights. Retrieval corpora drift. The eval is a continuous integration test, not a release gate.

The shape of a production-grade RAG system

The pattern is:

question
  → retrieval (with metadata filters, hybrid search, reranking)
  → context assembly (with citation IDs preserved)
  → generation (with strict system prompt, low temp, structured output where possible)
  → grounding check (every claim → citation, refuse if any unsourced)
  → response with inline citations

The eval harness is what tells you each link in that chain is working. Without it, you’re shipping a demo with extra steps.