Skip to content
RingMod
← Notes
Field note

Evals are the unit tests of your LLM feature

No evals, no production. An eval harness turns 'it looked fine in the demo' into a scored, repeatable gate. Here's what one is and how to build the first one.

Last updated

An eval harness is the unit-test suite of an LLM feature: a representative dataset, each output scored against a rubric or expected result, run on every change, gating what ships. Without one, quality is judged by whoever last clicked through a demo — and a prompt tweak that quietly degrades a slice of responses ships invisibly. This is bar #2 of The Production-Readiness Bar, and it’s the one teams most often defer and most regret deferring. Here’s what a harness actually is, and how to build the first one.

Why “it looked good in the demo” is not a quality bar

LLM outputs are non-deterministic, so the same prompt can pass on Tuesday and fail on Thursday. That breaks the intuition engineers carry from ordinary software, where a passing manual check usually means a working feature. With an LLM, a manual demo tells you the system can produce a good answer, not that it reliably does — and definitely not that your last prompt change didn’t silently degrade a slice of inputs you didn’t click on. Regressions in LLM features are invisible by default. Evals are how you make them visible before users find them.

What an eval harness actually is

Four parts. If any is missing, it isn’t a harness — it’s a spot check.

PartWhat it isFail signal without it
DatasetRepresentative inputs, including edge cases and known failuresYou test the happy path and ship the rest
ScorerA way to grade each output — deterministic, rubric, judge, or human”Looks fine” is the grade
RunnerExecutes the full set on every change, automaticallyEvals exist but nobody runs them
GateA threshold that blocks a merge/deploy when the score dropsYou measure regressions after shipping them

The gate is the part that makes the other three matter. Evals that produce a number nobody acts on are documentation, not a control.

How to build the first one

  1. Collect real inputs. Pull 30–100 cases from actual or expected usage. Deliberately include the edge cases and failure modes you’re worried about — the adversarial input, the ambiguous request, the one that leaked something last time. A set that only covers the happy path certifies the happy path.
  2. Write the pass criteria. For each case, what makes an output acceptable? Sometimes it’s an exact value; often it’s a rubric (“cites a real source, no PHI, answers the actual question”).
  3. Pick a scoring method per case (see below). Different cases want different graders; don’t force one method on all of them.
  4. Run the whole set and record a score. This is your baseline.
  5. Wire it into CI as a gate. Every prompt change, model swap, or retrieval tweak runs the set; a drop below threshold blocks the change. Now a deploy is a scored decision, not a held breath.

Choosing a scoring method

  • Deterministic checks — for anything you can verify exactly: JSON shape, required fields present, forbidden content absent, a number in range. Cheap, reliable, un-gameable. Use these wherever the answer allows it.
  • Rubric + human review — for the hard, subjective slice where correctness is a judgment call. Slow and expensive, so reserve it for the cases that decide whether you ship.
  • LLM-as-judge — fast and scalable for the subjective middle, but biased, non-deterministic, and gameable, especially by its own model family. Calibrate it against human labels on a sample before trusting it, and never let an uncalibrated judge be your only gate.

Most real harnesses mix all three: deterministic checks for the verifiable, a judge for the scalable middle, humans for the consequential edge.

Common mistakes

  • Evaluating only the happy path. The failures you don’t put in the set are the ones you ship.
  • An LLM judge with no calibration. If you haven’t checked the judge against human labels, you don’t know what its score means.
  • Evals that never gate. A dashboard nobody blocks a deploy on measures regressions instead of preventing them.
  • A static set. Every incident is a missing eval case. Add it, and the same regression can’t ship twice.

Building this well — the dataset, the scorers, the gate wired into a pipeline — is a core part of the production-readiness audit: the assessment scores your current eval coverage against this bar, and the buildout stands the harness up as the gate that keeps regressions out of production. No evals, no production — so this is usually where the work starts.

Questions this raises

Straight answers.

What is an eval harness for an LLM?
It's the unit-test suite of an LLM feature: a representative dataset of inputs, a way to score each output against a rubric or expected result, a runner that executes the whole set on every change, and a threshold that gates what ships. It turns 'it looked good in the demo' into a repeatable, scored pass/fail. Without one, quality is judged by whoever last clicked through the app.
How do I build my first LLM eval set?
Start from real inputs, not invented ones — pull 30–100 representative cases from actual or expected usage, including the edge cases and failure modes you're worried about. Write down the pass criteria for each, pick a scoring method (deterministic check, rubric, LLM-as-judge, or human), run it, and wire the result into CI as a gate. A small, honest set that runs on every change beats a large one that never runs.
Is LLM-as-a-judge reliable for scoring evals?
It's useful but not free. An LLM grader is fast and scales, but it's biased, non-deterministic, and can be gamed by the same model family it's judging. Calibrate it against human labels on a sample before you trust it, use it for the subjective middle, and keep deterministic checks for anything you can verify exactly (JSON shape, required fields, forbidden content). Never let an uncalibrated judge be the only gate.
How many eval examples do I need to start?
Fewer than you think. Thirty to a hundred well-chosen cases that cover your real inputs, edge cases, and known failure modes will catch most regressions and is enough to gate a deploy. Coverage of the ways the feature actually fails matters far more than raw count. Grow the set every time an incident reveals a case you didn't have.

Production-Readiness Audit

This is the work, not just the writeup.

If this is your situation, the production-readiness audit is where it gets fixed — by the person who wrote this.

Request an audit