Eval harnesses for chatbots: catch regressions before customers do

Production chatbots degrade in three predictable ways

First, the underlying model is updated by the vendor — sometimes silently, sometimes announced — and a behavior the prompt relied on changes. Second, the knowledge base drifts as documents are added, edited, or removed, and retrieval starts surfacing different chunks for the same questions. Third, the customer base evolves what they ask as the product evolves, the seasons turn, or competitors shift the market.

All three are invisible from inside the system. The bot keeps responding. CSAT scores arrive a quarter later. By the time someone sees the trend, three releases have shipped on top of the regression and root-cause analysis is archaeological. The eval harness is what makes this loop visible in real time.

A graded conversation set is the spine of the eval

The core eval is a curated set of representative customer conversations, each labeled with the correct outcome — first-contact resolution, escalation to a specific team, refusal with routing — and the supporting facts the bot's answer should reference. We build this from real production conversations selected by the support team, not by the AI vendor. The team that owns customer outcomes owns the eval definition of correct.

Each conversation is replayed against the candidate model and prompt configuration. The harness scores: did the bot reach the correct outcome, did it cite the right sources, did it stay on topic, did it surface the right next action. We track per-category scores so a regression in 'billing disputes' is visible even when overall accuracy is flat.

Eval suite size: ~600 graded conversations + 200 refusal cases
Run cadence: Every PR CI-gated, 8–12 min
Regression threshold: ~2% on critical category, fails build
Eval coverage of prod traffic: > 80% category match

Refusal evals are how you catch the chatbot becoming over-helpful

A separate suite covers cases the bot should decline: out-of-scope questions, requests for advice the brand does not give, jailbreak-style prompts, queries that should escalate. We will not ship a release that drops below 95% on the refusal suite. The reason: a model upgrade can make the bot suddenly willing to give advice it had been correctly declining, and the regression looks like 'helpful' in casual testing.

We also test for the inverse — false refusals where the bot declines questions it should answer. Both directions are tracked because over-refusal looks safe but tanks customer satisfaction in production. The eval set is curated by both the support team and the legal/compliance team.

Retrieval quality is the silent contributor to chatbot regressions

Most eval harnesses focus on the model's output and ignore retrieval. That misses the largest source of degradation, which is that retrieval started returning different chunks. We score retrieval separately: for each eval question, does retrieval surface the document chunk that contains the correct answer, in the top-5, top-10, top-20.

Retrieval-only failures are common and mask as model failures because the model gives a vague answer. The eval distinguishes them, and a retrieval-only regression is fixed at the index layer — chunking, embedding, reranking — not at the model layer. Without separation, the team chases the wrong fix for weeks.

LLM-as-judge is useful but only with calibration discipline

Scaling eval grading manually doesn't work past a few hundred cases. We use LLM-as-judge for the bulk of the grading — a scoring model evaluates whether the candidate response satisfies the labeled criteria. The risk is that the judge model has its own biases, drifts when upgraded, and can be miscalibrated against ground truth.

We calibrate the judge against a human-graded subset every release. If the judge agrees with humans on at least 92% of the calibration set, the judge's scores on the full eval are trusted. If agreement drops, the judge is retrained or replaced before any model decisions are made. Treating the judge as ground truth without calibration is the most common eval-harness failure we see in audits.

Production-comparison evals catch what the offline harness misses

Offline eval suites are necessary but never sufficient. We also run a continuous online eval where a small percentage of production traffic gets a parallel response from a candidate model, and the responses are compared by the judge model and by sampled human grading. Differences between the candidate and production are surfaced as candidate-only regressions.

This catches phenomena the offline suite misses: weird customer phrasings, edge-case retrievals, novel topics the eval set hasn't been updated to cover. The eval set itself gets updated from these findings on a monthly cadence. The whole loop stays in motion.

The eval suite has to live with the support team, not the AI team

When the AI team owns the definition of correct, evals become an exercise in self-grading. The right structure: support team defines the cases and the correct outcomes, AI team builds the harness, both review the report on every release. The decision to ship a regression is a joint one, not an AI-team-internal one.

The cultural side of this matters more than the engineering side. A chatbot deployment that doesn't have a customer-outcome team co-owning the eval will degrade. We have seen it consistently across deployments. The harness is half the work; the ownership is the other half.

We caught a regression on a Tuesday afternoon that would have shipped Friday and tanked refund-policy answers all weekend. The new model was technically smarter and got worse on the one thing we cared about. The eval told us. Without it, we would have found out from Monday morning escalations.
— Head of CX, fintech client

Frequently asked

What does a chatbot eval harness actually test?

A graded conversation set scores correctness on representative customer conversations, with per-category breakdowns. A refusal suite scores both refusal accuracy on cases the bot should decline and false-refusal rate on cases it should answer. A retrieval-quality suite scores whether retrieval surfaces the correct document chunks in the top-K. All three run on every PR before deployment.

How often should evals run?

On every prompt change, model version change, knowledge-base update, and retrieval-configuration change. In practice, that means every PR. The full suite needs to complete in under 10–15 minutes to fit a CI gate. Releases that fail any of the configured thresholds — including category-level regressions — block automatically and require an explicit override.

Who owns the eval set definition?

The team that owns customer outcomes — usually customer support, sometimes paired with legal or compliance. Not the AI team. The AI team owns the harness; the customer-outcome team owns the cases and the definition of correct. When ownership inverts, evals become self-grading and lose their purpose. The cultural structure is as important as the engineering.

Can LLM-as-judge be trusted for eval grading?

Only with calibration discipline. The judge model is calibrated against a human-graded subset every release; agreement under 92% means the judge is retrained or replaced before any decisions are made. Treating the judge as ground truth without calibration is the most common eval-harness failure we see in audits. The judge is a useful tool, not a replacement for ground truth.

Why are retrieval evals separate from model evals?

Because retrieval failures and model failures need different fixes. A retrieval-only regression is fixed at chunking, embedding, or reranking — not at the model layer. Without separation, the team can chase the wrong fix for weeks while the actual problem is that retrieval started surfacing different documents. Scoring retrieval independently makes the diagnosis straightforward.

What is a continuous online eval?

A small percentage of production traffic gets a parallel response from a candidate model. The candidate's response is compared to the production response by the judge model and by sampled human grading. Differences are surfaced as candidate-only behaviors, which catches phenomena the offline suite misses — novel customer phrasings, edge-case retrievals, topics the eval hasn't been updated to cover. Findings update the offline eval set monthly.

Eval harnesses for chatbots: catch regressions before customers do

Production chatbots degrade in three predictable ways

A graded conversation set is the spine of the eval

Refusal evals are how you catch the chatbot becoming over-helpful

Retrieval quality is the silent contributor to chatbot regressions

LLM-as-judge is useful but only with calibration discipline

Production-comparison evals catch what the offline harness misses

The eval suite has to live with the support team, not the AI team

Frequently asked

More from Field Notes

Reasoning, not routing: what separates a real conversation bot from a glorified menu

Multi-channel conversation parity: one brain across web, WhatsApp, Slack, and SMS

Live-agent handoff that doesn't reset the conversation