Evaluation Methods

You don't read AI-generated code. Why are you listening to every call?

Deepesh Jayal

•May 2026•

14 min read

You don't read AI-generated code. Why are you listening to every call?

Eval-driven development is what TDD became when humans stopped reading every line of code. Test-driven development was a "best practice" for twenty years. Most teams ignored it. Production stayed mostly stable because senior engineers actually read every pull request, line by line, and caught what the tests would have. Then AI started writing measurable fractions of new code, and the line-by-line review model quietly collapsed. The test suite became the only contract that survived the change. TDD didn't win on merit — it became mandatory the moment human reading dropped out of the loop.

Voice agents are at the same inflection point, one substrate over. Nobody listens to every call. The transcript is the equivalent of the code review — technically inspectable, structurally unscalable. The eval suite is what stays when human listening drops out. The teams that win the voice agent market won't be the ones with the best demos. They will be the ones who compressed evaluation into the development cycle. Eval-driven development is not a methodology choice. It is the only review surface left.

How TDD actually became mandatory

The story of TDD as it gets retold is that engineers gradually came around to writing tests because it was the right thing to do. The real story is less flattering. TDD existed since the late 1990s. Through the 2000s and 2010s it was a flag senior engineers waved at code reviews and the rest of the team rolled their eyes at. Coverage hovered around 30%. Production stayed roughly stable because someone on the team actually read every diff before it merged. The tests were a safety net. The reviewer was the floor.

Two compounding shifts broke that model. Codebases grew past the point where any one person could hold the full surface area in working memory. And AI assistants started generating chunks of code that humans approved without scrutinising the way they would have scrutinised a colleague's pull request. By the time large parts of new commits were AI-authored, the reviewer floor was gone. The tests had to be the check.

That shift is what made TDD load-bearing. The test stopped being a check on the code and became the only contract that survived the lifecycle. The code might be generated, regenerated, refactored by another agent, and shipped — but if the test passed, the behaviour was preserved. Teams that had taken testing seriously the whole time were already operating in the new regime. The teams that hadn't had to retrofit a decade of discipline in twelve months.

Eval-driven development is the same shift, one substrate over. The substrate is probabilistic output instead of deterministic code, and the consumption channel is audio instead of text — but the structural move is identical. When you stop being able to inspect every output, the evaluation suite becomes the only contract left.

Why voice agents are pre-TDD on this axis

Most voice agent teams in 2026 are operating the way text-agent teams operated in 2022. They ship, they monitor, they react to user complaints, they patch. The demos are great. The launch goes well. Production traffic exposes failure patterns no manual test pass had surfaced — accents the team didn't think to try, interruption timing that breaks the turn detector, latency cliffs under load. The team listens to a handful of failed calls, fixes the obvious cases, ships again, and repeats.

This is the equivalent of a 2018 codebase with 30% test coverage shipping to a million daily users. It works until it doesn't, and when it stops working, the team has no instrumentation to know why. The transcript shows a successful conversation. The metrics dashboard shows uptime. The user hung up at second 12 and never came back.

The compression event is already happening for voice — it just hasn't been named yet. Call volume per agent is rising. Listening to every call is no longer humanly possible at most production scales. The teams that win this space will be the ones who decided early that the eval suite was the review surface. Production ai is unforgiving in ways staging never is, and ai failure modes in voice show up as silent quality drift, not loud errors. Evaluation in ai systems has to be the gate, not the audit.

What eval-driven development actually is

EDD is the discipline of treating evaluations as the working specification — written first, enforced on every change, gated before production. The eval is not a test you run after the code works. The eval is the artefact that defines what "works" means. Implementation is the search for a configuration that satisfies the eval. The shift from ad-hoc ai test runs to systematic ai evaluation is what makes this a discipline.

Three things EDD is not, and that get confused with it constantly. EDD is not MLOps. MLOps is about deployment infrastructure for ML systems; the evaluation of ai output quality is a different discipline. You can run perfect MLOps and ship terrible AI because nothing in the deployment pipeline asked whether the output was correct. EDD is not prompt engineering. Iterating on prompts without an oracle is guessing — you change a word, you read three sample outputs, you decide it feels better, you ship. EDD requires the oracle to exist before you start iterating. And EDD is not LLM-as-judge alone. A judge without a calibrated rubric is theatre. The judge produces a number; the number means nothing unless the rubric was specified independently and validated against human judgement first. Real ai agent evaluation needs the rubric, the conditions, and the threshold defined before the first scored run.

Prior work on EDD is worth citing. Chip Huyen's AI Engineering lays out the discipline in detail for text systems. Braintrust's writing on eval-driven development names the methodology and walks through implementation. Google's web.dev guide frames EDD as TDD adapted for AI uncertainty. All of these focus on text. What follows extends that work into voice — where the discipline has to get stricter, not looser.

TDD and EDD side by side

The two disciplines share a structural backbone and diverge on three substantive dimensions. The familiar rows are not where teams get into trouble. The unfamiliar rows are.

Dimension	TDD	EDD
What you write first	The test (the spec)	The eval (the spec)
What "pass" means	Binary — output equals expected	Statistical — N runs above threshold
What you test against	Deterministic function	Probabilistic pipeline
When the test fails	Code is wrong, fix it	Could be code, model, prompt, data, or condition
Regression trigger	Code change	Prompt, model, data, voice config, telephony provider, codec
Cost of skipping	Bugs surface in production	Silent quality drift, invisible failures

The first three rows look like a direct port. The last three are where teams importing TDD habits into AI development fall over. A failing TDD test points to a single fixable thing. A failing eval has six candidate root causes and no compiler error to narrow them down. The cost of skipping unit tests is bugs you'll see; the cost of skipping evals is bugs you won't.

Why voice EDD is strictly harder than text EDD

Voice is not "EDD with extra steps." It is EDD on a substrate where the failure surface is structurally larger and the cost of skipping is non-recoverable. Five reasons, each of which would be enough on its own to argue for tighter discipline.

The eval target is a pipeline, not a function

Text agent EDD has one substitution under test — input prompt to LLM to output. One function. Voice agent EDD has five functions in series: audio to voice activity detector to speech-to-text to LLM to text-to-speech to audio. Each stage has its own failure distribution. Failures compound non-linearly: a 5% degradation at each stage compounds to roughly 23% end-to-end, and interaction effects are worse because errors at early stages mislead later stages. EDD on voice has to evaluate the system, not the model.

The input space is acoustic, not textual

Text eval sets can be curated. You write 200 representative prompts, you run them through the system, you score them. Voice eval sets cannot be curated the same way because the failure modes live in acoustic conditions text cannot represent — background noise, accent variation, speaking pace, interruption timing, PSTN codecs, packet loss profile. The eval set has to be generated across a condition matrix instead of fixed. Five caller profiles times five scenarios is 25 cells; three runs per cell is 75 calls. That is the minimum eval set for a single regression check, and the conditions matter as much as the prompts.

Transcripts are lossy. Outcomes are the eval target.

The transcript is what the ASR thought the caller said. It strips out latency, prosody, audio quality, barge-in behaviour, and downstream tool state. An LLM-as-judge scoring a transcript can return 95% quality for a call where the caller hung up frustrated at second 12. This is the structural reason transcript-only evaluation fails for voice — covered in detail in LLM-as-judge limits for voice agents. Voice EDD has to score on what users experienced, not what was said.

Reliability is statistical, not binary

Run the same scenario × profile through a voice pipeline twice and you can get two different outcomes. Pipeline interactions are non-deterministic: a noise spike at the wrong moment changes ASR confidence, an interruption at the wrong word breaks turn detection, a long LLM token sequence pushes total latency past the engagement threshold. Voice EDD requires N-run reliability scoring — the same eval run 3, 5, or 10 times, with a pass threshold on the success rate. This is what scenario success rate captures. It changes threshold setting, version comparison, and regression suite runtime.

Cost asymmetry is brutal

A text agent failure makes a user retry. A voice agent failure makes them hang up, not call back, and tell someone. Production stakes are categorically different. The cost of a missed eval in a voice agent shipped into a healthcare or banking context is a regulatory escalation, a churned customer, or a viral complaint thread. Voice EDD has to be tighter than text EDD, not looser — the loss function is asymmetric in a way text doesn't share. This is exactly why voice agents fail in production at rates that surprise teams who tested them as text agents.

A reference architecture for voice EDD

What does eval-driven development for voice agents look like as an actual development workflow? Walk it through with a concrete case — a team building an outbound healthcare appointment-booking agent. The eval artefacts get built in this order, before most of the implementation work happens.

Stage 1 — Scenarios are the working spec. Before any prompt is written, the team defines the scenarios the agent must handle end-to-end. New patient booking. Rescheduling. Cancellation. Insurance verification. Each scenario has explicit success conditions — not "the conversation completed" but "the appointment was created with the correct date, time, provider, and confirmation reference." The scenario set is the spec. Voice agent evaluation differs from text evals here: every scenario carries acoustic and behavioural conditions attached to it.

Stage 2 — Caller profiles are the condition matrix. Each scenario gets run against multiple synthetic callers — a clean baseline, an accent variant representative of the patient population, a fast-pace impatient caller, a caller from a noisy environment, an elderly caller speaking slowly. Eight behavioural parameters define each profile: accent, speech pace, background noise, interruption level, latency, language, emotional register, voice gender. The matrix is scenarios times profiles. For a five-scenario, five-profile setup, that is 25 cells.

Stage 3 — Metrics with explicit thresholds. Telemetry metrics measured automatically from call data: response latency P95, call duration, silence ratio, interruption count. LLM-based metrics scored against a rubric: tone consistency on a 1–5 scale with threshold ≥ 4, instruction adherence binary pass, knowledge accuracy binary pass. Every metric has a threshold. "Sounds professional" is not a metric. "Tone consistency ≥ 4 on the defined rubric" is. Good metrics evaluation also verifies downstream state — did the appointment actually create in the booking system, not just whether the agent said it did.

Stage 4 — Build the agent until evals pass. Implementation iteration is bounded by the eval suite. Prompt changes, model swaps, retrieval tweaks, voice configuration adjustments all run through the eval matrix before they are kept. The eval scores are the optimisation target. This is the inversion that makes the workflow eval-driven rather than eval-adjacent — the suite gates iteration, not just release.

Stage 5 — Regression gating on every change. Once the agent passes a baseline, every subsequent change runs the full matrix. A drop greater than 5 SSR points on any critical scenario is a regression. This is TDD's red-green-refactor loop adapted for probabilistic systems — make a change, run the suite, accept or revert. Regression testing under EDD is not a phase; it is the iteration loop itself.

Stage 6 — Continuous evaluation in production. Pre-deployment evaluation is necessary but not sufficient. The same scenario suite runs against production traffic patterns after launch, watching for behavioural drift, emerging failures, and acoustic shifts. The development eval and the production monitoring loop share the same scenario library — one runs on synthetic callers, one on patterns derived from real traffic.

This is what eval-driven development looks like when it is fully applied to voice. Other teams will implement it with different tooling. The primitives — scenarios, profiles, metrics, evaluations, continuous monitoring — are the discipline's moving parts. Evalgent names them as products because that is the operational shape they take in practice, but the discipline is independent of any vendor.

Smoke, regression, production: the three-tier eval model

TDD's mature form is a tiered model — fast unit tests on every change, slower integration tests on commits, end-to-end tests on release. Voice EDD has the same structure adapted for the cost profile of running synthetic callers. Three tiers, three cadences.

Tier         | Trigger              | Scope                    | Runtime
─────────────┼──────────────────────┼──────────────────────────┼────────
Smoke evals  | Every prompt change  | 5 scenarios × 1 profile  | 2–5 min
             |                      | × 1 run = 5 calls        |
Regression   | Every release        | Full matrix × 3+ runs    | 1–4 hrs
             |                      | = 75–250 calls           |
Production   | Continuous           | Sampled from live        | Always-on
             |                      | traffic + scheduled syn- |
             |                      | thetic probes            |

Smoke evals catch the obvious — a prompt change that broke the basic happy path, a model swap that lost a critical capability. They run fast enough to sit in a CI loop. Regression evals catch the subtle — the change that improved happy-path performance but degraded reliability under noise, the model upgrade that improved benchmarks but pushed latency past the engagement cliff. Production evals catch the emergent — the failure pattern that only shows up when real users start saying things the eval set didn't anticipate.

Teams that succeed at voice EDD treat the smoke layer as developer ergonomics, the regression layer as the launch gate, and the production layer as the early-warning system. Teams that fail try to use one tier for everything.

The cost of not doing this

Three failure patterns recur in teams that ship voice agents without eval-first discipline.

The first is the demo-to-production gap. The agent works flawlessly in staging on the test calls the team has been running for weeks. It launches. Within a week task completion rate sits 20 points below where staging numbers suggested it would. Per Hamming AI's analysis of more than 4 million voice agent calls, general AI testing tools miss approximately 40% of voice-specific failures. The eval set didn't match the production distribution because the team didn't build the eval set as a discipline. The gap is not a tooling problem. It is a discipline problem.

The second is regression after model swap. The team upgrades the LLM or the TTS engine. Smoke tests pass. Demo calls sound great. Production performance silently degrades because the new model is faster and louder but produces transcripts the existing prompts are not calibrated for, or pronounces a key product name differently. Without a full regression matrix run between versions, this is invisible until users complain. The regression problem in voice agents is structural; no benchmark on the model card predicts how it behaves in your pipeline.

The third is silent acoustic drift. The agent shipped fine. Six weeks later, task completion has dropped four points and nobody knows why. The model didn't change. The prompts didn't change. The user base shifted — more callers from a region with a different accent distribution, or a product launch shifted question patterns outside coverage. Without continuous evaluation, the first signal is a quarterly metrics review showing degradation, by which point the cost has accumulated.

All three patterns are preventable. None of them are preventable without an eval-first workflow.

Why this matters now

Voice agents are entering high-stakes channels at production scale — healthcare scheduling, banking support, insurance verification, sales outbound. The cost of failure in these channels is not "user has to retry." It is a missed appointment, a regulatory escalation, a churned high-value customer, a viral support thread. The competitive moat in voice AI is shifting from "best demo" to "tightest production discipline." The teams that ship reliably are the ones that decided early that the eval suite was load-bearing.

TDD became mandatory when human reading dropped out of the code loop. Eval-driven development becomes mandatory when human listening drops out of the call loop. That moment has already arrived for most production-scale voice deployments. The question is whether the team caught the shift before the cost did.

Summary

TDD became mandatory because the alternative — trusting code without trusting tests — stopped being defensible. EDD for voice is the same shift on a different substrate, and the teams that compress evaluation into the development cycle first will own the voice agent market.

Frequently asked questions

What is eval-driven development?

What is eval-driven development is the discipline of treating evaluations as the working specification for AI systems — written first, enforced on every change, gated before production. The eval defines what "works" means. Implementation becomes the search for a configuration that satisfies the eval. It is the AI engineering analogue of test-driven development.

How does tdd vs eval-driven development compare?

How does tdd vs eval-driven development compare comes down to substrate and pass criteria. TDD tests deterministic code with binary pass/fail. EDD tests probabilistic systems with statistical thresholds across multiple runs. Both write the spec first. Both gate changes on the spec. EDD adds tolerance for non-deterministic output, condition-matrix evaluation, and N-run reliability scoring that TDD does not need.

Why do ai agents need eval-driven development?

Why ai agents need eval-driven development is the same reason code needed TDD once it stopped being read line by line. Output volume exceeds inspection capacity. Failures hide in conditions humans didn't think to test. The eval suite becomes the only review surface left. Without EDD, AI agent quality drifts silently — and the first signal is users complaining.

What does eval-driven development for voice agents look like?

Eval-driven development for voice agents extends text EDD across five additional dimensions: pipeline-level evaluation across STT, LLM, and TTS layers; acoustic condition matrices instead of curated text inputs; outcome verification beyond transcripts; N-run statistical reliability scoring; and continuous post-deployment monitoring. The discipline is the same; the surface area is larger and the cost of skipping is non-recoverable.

What is an eval-driven development reference architecture?

An eval-driven development reference architecture for voice has six stages. Define scenarios as the spec. Configure caller profiles across acoustic and behavioural conditions. Set metrics with explicit thresholds. Build until evals pass. Gate every change on a full regression matrix. Continuously evaluate against production traffic patterns. Each stage has an artefact that survives the lifecycle.

How do smoke evals regression evals production evals differ?

Smoke evals regression evals production evals form a three-tier model. Smoke evals run on every prompt change against a minimal scenario set in 2–5 minutes. Regression evals run on every release against the full scenario × profile matrix in 1–4 hours. Production evals run continuously against sampled live traffic and scheduled synthetic probes. Each catches different categories of failure.

How to write evals before code?

How to write evals before code means defining success conditions and rubrics before any prompt is written. List the scenarios the system must handle. For each, define explicit success conditions — not "completes the task" but "extracts the correct field, calls the right API, returns the right confirmation." Set thresholds on every metric. Only then start implementation.

What is the eval-driven development cost of skipping?

The eval-driven development cost of skipping is asymmetric and silent. Bugs from skipped unit tests surface as visible failures. Failures from skipped evals surface as drift — task completion rates decay quietly, user trust erodes without a single dramatic incident, regressions ship invisibly. For voice agents in high-stakes channels, the cost compounds before any signal reaches the team.

LLM as judge for voice agents: the hidden limits of transcript evaluation