Evaluation Methods

LLM as judge for voice agents: the hidden limits of transcript evaluation

Deepesh Jayal

•April 2026•

14 min read

LLM as judge for voice agents: the hidden limits of transcript evaluation

The setup is fast, cheap, and feels rigorous. You have transcripts. You have access to GPT-4o. You write an evaluation prompt, score your conversations, and track the quality metric over time. When it reads 94%, confidence is high.

Then your customer satisfaction score drops. Repeat call rates climb. A user posts publicly about the agent that transferred £500 instead of £50, even though the post-call transcript showed a clean successful transaction.

The problem is not the LLM. The problem is the transcript.

LLM based evaluation of voice agents has become a default practice because it is genuinely easy to implement. But ease and validity are different properties. This article explains what llm as judge actually measures, where it fails systematically, and what voice agent evaluation beyond transcripts requires to be reliable.

What is LLM as judge and what does it actually measure?

LLM as judge is a form of llm evaluation in which a language model receives a conversation transcript and a scoring rubric, then outputs a quality score with optional reasoning. In voice AI, this is applied post-call: the STT layer produces a transcript, the transcript is passed to a judge model — typically GPT-4o or Claude — with a prompt asking whether the agent handled the conversation well.

When applied to voice agents, transcript analysis ai of this kind measures four things:

The words said. What the agent said, what the user said, and the order of turns — as the ASR system transcribed them.
Response coherence. Whether the agent's language makes sense given the user's apparent intent.
Tone and register. Whether the agent is courteous, patient, and professional.
Apparent task completion. Whether the dialogue looks like it concluded successfully.

These are not nothing. Coherence, tone, and apparent flow are real signal. The problem is what they do not capture — and in voice AI, the gap between what transcripts capture and what users experience is consistently wider than teams expect.

The 5 blind spots of LLM as judge in voice AI

1. ASR errors are invisible in transcripts

The most fundamental problem with transcript analysis ai for voice agents is that the transcript is not what the user said. It is what the ASR system believed the user said.

When a user says "transfer fifty dollars" and the ASR system transcribes it as "transfer fifteen dollars," the transcript is internally consistent. The agent correctly processes a transfer request for fifteen dollars. The LLM judge evaluates a coherent, well-handled transaction. The score is high.

The user expected fifty dollars transferred. The wrong amount was moved.

This is not an edge case. ASR errors are endemic in production voice systems, particularly for numbers and currency amounts — arguably the most consequential category of data in financial, healthcare, and transactional voice agents. According to Deepgram's State of Voice AI 2025 benchmarks, Deepgram Nova-3 achieves 6.84% word error rate on clean audio. In contact-centre noise conditions (55–65 dB SNR), that WER rises by 15–30%. The asr error impact on voice agent evaluation is structural: every transcription error that reaches the transcript is treated as ground truth by any downstream llm evaluation.

The transcript captures the conversation as the ASR understood it. LLM as judge evaluates that understanding — not the reality.

2. Latency is invisible in transcripts

A transcript of a voice conversation strips out all timing information. A 4-second pause between a user's question and the agent's response looks identical to a 400ms response. Both produce the same text.

If there was a 5-second silence between a user's question and the agent's answer — because LLM inference timed out, the TTS provider was slow, or the downstream API stalled — the user experienced something very different from what the transcript shows. According to ITU-T G.114, one-way audio delay above 150ms begins affecting conversational quality. A 5-second wait in a voice interaction is enough for most users to repeat their question, assume the agent failed, or hang up.

The LLM sees the words. It cannot see the P95 latency. It cannot see the 8% of calls that experienced 3+ second delays. Conversation analytics built purely on transcript quality scores will never surface this class of failure.

3. Task completion cannot be verified from text

This is the most commercially damaging blind spot. The transcript shows what was said. LLM as judge cannot verify what was done.

A transcript might show the agent confirming a booking and the user saying "Perfect, thank you." The LLM judge evaluates this as a successful task completion and scores the conversation highly. But the booking API may have returned a 500 error that the agent's error handling swallowed. The appointment may have been created in the wrong location. The confirmation email may never have arrived.

How to verify voice agent task completion requires checking downstream system state — querying the booking database, confirming the API response code, validating the CRM write, verifying the email delivery. This is end-to-end verification, not text evaluation. No llm evaluation of a transcript can substitute for it, because the transcript does not contain the information needed to verify the action.

Teams discover this gap in a consistent pattern: high transcript quality scores coexist with elevated first call resolution failures, repeat call rates, and customer complaints — because the agent sounded like it succeeded even when it did not.

4. Multimodal and downstream failures are invisible

Voice agents rarely operate in isolation. They trigger SMS confirmations, email follow-ups, app state changes, CRM updates, and payment completions. Users form expectations based on what the agent said would happen, then experience those expectations against what actually happened downstream.

A transcript captures the promise. It does not capture whether the promise was kept.

When a user calls back to ask why their appointment confirmation never arrived, the original transcript will show a perfectly handled call. When a user complains about a charge they did not authorise, the transcript will show a professional, accurate agent that handled a payment request correctly. The voice agent monitoring gap — the distance between what the transcript says and what the downstream systems did — is not accessible to any form of transcript analysis ai.

This is why voice agent accuracy as measured by LLM judges consistently overstates the actual reliability of a system. The judge only evaluates the voice interaction layer. The failures that matter most to users often live in the layers the transcript cannot see.

5. Abandoned conversations leave no transcript

The most invisible failure in transcript-based evaluation is the conversation that never happened — or never completed.

A caller who abandons during hold time leaves no transcript. A user who hangs up 20 seconds into the greeting because the agent did not understand their accent leaves no transcript. A session that crashed due to an infrastructure failure leaves no transcript. Users who tried once, failed, and never called back leave no transcript.

Transcript analysis covers the conversations that made it to a scorable state. Call abandonment rate, hang-up-before-completion rate, and retry-within-24-hours rate are often better indicators of voice agent quality than any transcript quality score — but none of them are accessible from transcripts.

A team measuring only transcript quality can achieve 96% quality scores while systematically driving away their least patient, most acoustically diverse, and technically least forgiving users. Those users never enter the dataset.

The "looks good" problem

LLM evaluation models are trained on human preferences — specifically, the preferences of annotators reading text. They are optimised to identify conversations that look good to a reader. Voice quality is not the same as text quality.

An agent that uses warm, empathetic language; mirrors the user's concerns; ends with clear next steps; and maintains consistent tone throughout — will score well on any llm based evaluation framework. It will score well even if it misunderstood the user's request due to an ASR error, took 4 seconds to respond on 20% of turns, triggered a payment for the wrong amount, or left the user's underlying problem unresolved.

LLM as judge answers the question: if I read this conversation, would I think it went well? That is a different question from: did this conversation achieve the user's goal?

The voice agent evaluation blind spots of transcript-based llm evaluation are not gaps that better prompting can close. They are structural limitations of the data source. No rubric, no chain-of-thought reasoning, no few-shot examples will allow a language model to detect what the transcript does not contain.

This is why transcript analysis fails for voice ai at the metrics evaluation level — the metrics it produces measure text quality, not voice agent outcome. Voice agent analytics built on transcript scores alone measure the shadow of performance, not performance itself.

Comparison: what transcript analysis captures vs what it misses

Dimension	Transcript analysis captures	Transcript analysis misses
Response coherence	Well	—
Tone and register	Well	—
Apparent flow	Well	—
Compliance language check	Well	—
ASR transcription accuracy	Never	Audio analysis required
End-to-end latency	Never	Timing instrumentation required
Task completion (system state)	Never	API / DB verification required
Downstream channel failures	Never	Cross-system monitoring required
Abandoned / incomplete calls	Never	Call metadata and CDR analysis required
Real user outcomes (CSAT, FCR)	Never	Outcome tracking required

What voice agent evaluation beyond transcripts actually requires

End-to-end task verification

The only way to know whether a task completed is to check whether it completed. This means correlating agent dialogue with downstream system state: did the booking appear in the database? Did the API return a success response? Did the payment process? Did the email arrive?

Hamming AI's QA framework integrates with downstream systems to perform end-to-end verification as part of the evaluation pipeline. Cekura tracks hallucination rate and task completion against expected outcomes rather than transcript quality. This is a fundamentally different evaluation model — one that treats the voice interaction as a means to an end, and verifies the end directly.

Audio-level analysis

A transcript is a lossy compression of a voice conversation. Working at the audio level captures what transcripts discard: ASR confidence scores and alternative hypotheses; timing at every stage of every turn; acoustic signal quality including background noise and codec degradation; barge-in events and overlapping speech; and the complete silence record that reveals where latency lived.

Audio-level analysis is harder to implement than transcript scoring — but it measures what users actually experience. For identifying why voice agents fail in production, it is the difference between analysing the shadow and analysing the object.

Outcome correlation

The gold standard for voice agent monitoring is correlating agent performance metrics with business outcomes: task completion rate, first call resolution (FCR), call abandonment rate, repeat call rate within 24 hours, and CSAT scores. These metrics require tracking users across calls and systems, not just scoring individual transcripts.

None of this is achievable with a single LLM evaluation prompt. It requires instrumentation across the full stack — voice interaction layer, downstream APIs, CRM, and customer feedback channels. The platform metrics layer is where this kind of multi-signal evaluation becomes feasible, connecting voice conversation data to downstream outcome verification.

Proactive testing before production

Transcript analysis is reactive: it evaluates conversations that already happened. Proactive evaluation uses synthetic callers to test the agent before real users encounter failures.

Scenario-based testing with synthetic callers can verify task completion end-to-end before deployment; test across acoustic conditions that transcripts would never distinguish; catch regressions after model updates before they reach users; and systematically explore breaking points and edge cases that random production sampling will miss for months.

Reactive transcript analysis tells you about yesterday's failures. Proactive evaluation prevents tomorrow's.

When LLM as judge is legitimate

This is not an argument that llm based evaluation is worthless. It has genuine value in the right context.

Filtering at scale. When a voice agent handles 10,000 conversations per day, LLM evaluation can surface the transcripts worth human review. As a filtering and prioritisation tool — not a quality measurement tool — it is efficient and useful.

Tone and compliance auditing. For questions like "Did the agent follow the required disclosure language?" or "Was prohibited phrasing absent?", transcript analysis ai works well. The compliance question is answered by the words used, and LLM evaluation can check that reliably.

Coaching and qualitative review. Reviewing conversation transcripts for agent coaching, identifying patterns in user language, and exploring qualitative conversational issues — all legitimate uses. LLM analysis can pre-categorise transcripts at scale, making human review more targeted.

Directional trend signals. If your LLM evaluation scores drop suddenly across a cohort, something probably changed. The absolute score may not be trustworthy, but the relative trend is informative signal for investigation.

The key constraint is using transcript analysis for questions that transcripts can answer — and not extending it to questions that require system state, audio data, or outcome tracking.

The metrics that actually measure voice agent quality

Conversation analytics built on transcript quality scores alone is insufficient. The voice metrics that reliably predict user experience require a broader instrumentation model.

Metric	What it measures	Data source required
Task completion rate	Whether the agent's actions produced the intended outcome	Downstream system state
First call resolution (FCR)	Whether the issue was resolved without a repeat call	Call history + outcome tracking
Call abandonment rate	Users who gave up before the conversation was evaluable	CDR / telephony logs
P95 response latency	The worst-case timing experience at scale	Pipeline instrumentation
WER per acoustic condition	ASR accuracy under real-world noise profiles	Audio analysis
Escalation rate	How often the agent could not handle the request	Dialogue state tracking
Repeat call within 24h	Proxy for unresolved issues	Call history correlation
CSAT post-call	User's direct assessment of the experience	Post-call survey or IVR

These ai evaluation metrics cannot be extracted from transcripts alone. They require cross-layer instrumentation that connects voice interaction data to downstream system state and user behaviour. Evaluation of ai voice systems that relies only on transcript quality scores will systematically overestimate reliability — because the data source excludes precisely the failure modes that matter most to users.

Summary

LLM as judge is easy to implement and produces outputs that look authoritative. For voice agents, it systematically misses the failures that damage user experience: ASR errors that corrupt the transcript before the judge receives it, latency that transcripts cannot represent, task completion that requires system state verification, downstream channel failures that live outside the voice layer, and abandoned calls that never become transcripts.

Outcome based voice agent evaluation requires instrumentation beyond the transcript. It means verifying what happened in downstream systems, analysing audio rather than compressed text, tracking business outcome metrics, and testing proactively rather than analysing reactively. High transcript quality scores and poor user outcomes coexist regularly — and will continue to, until evaluation is grounded in what users experience rather than in what conversations look like on paper.

Frequently asked questions

What is LLM as judge for voice agents?

LLM as judge is the practice of passing voice agent conversation transcripts to a language model — typically GPT-4o or Claude — with a scoring rubric asking whether the conversation was handled well. It measures coherence, tone, and apparent task completion from the transcript text. It cannot detect ASR errors, latency problems, task verification failures, downstream channel failures, or abandoned conversations.

Why does transcript analysis fail for voice AI?

Transcripts are lossy: they strip out timing, compress audio quality into text, and represent ASR output rather than user intent. Five structural blind spots make LLM evaluation of transcripts unreliable for voice agents: invisible ASR errors, invisible latency, unverifiable task completion, invisible downstream failures, and missing abandoned calls. These gaps cannot be closed with better prompting — they require different data sources.

What does LLM as judge miss in voice agents?

LLM as judge cannot detect ASR transcription errors where the wrong words appear correct in context; response latency that made the conversation feel slow; task completion failures where the agent said the right thing but the API call failed; downstream failures in email, SMS, or CRM systems the agent triggered; or conversations that were abandoned before they became scorable transcripts.

How do you verify voice agent task completion?

Task completion must be verified against downstream system state — not inferred from dialogue text. Check the booking database for the appointment, confirm the API returned a success response, verify the CRM write, confirm email delivery. None of this is accessible from a transcript. Platforms like Hamming and Cekura integrate downstream verification; this is fundamentally different from transcript-based llm evaluation.

What is outcome based voice agent evaluation?

Outcome based voice agent evaluation measures whether the agent's actions produced the intended real-world result. It tracks task completion rate against system state, first call resolution rate, call abandonment rate, P95 latency, and CSAT — not transcript quality scores. It requires cross-layer instrumentation connecting voice interaction data to downstream APIs, CRM records, and user behaviour tracking.

When is LLM as judge legitimate for voice agents?

LLM as judge is legitimate for filtering large transcript volumes to surface conversations worth human review, auditing tone and compliance language, supporting agent coaching workflows, and identifying directional trend changes. It is not reliable for measuring task completion, detecting technical failures, or producing an accurate quality score for a voice agent in production. Use it as a filtering tool, not a measurement tool.

What metrics actually measure voice agent quality?

The metrics that reliably correlate with user experience are task completion rate (verified against system state), first call resolution, call abandonment rate, P95 response latency, WER per acoustic condition, escalation rate, repeat call rate within 24 hours, and post-call CSAT. These voice metrics require cross-layer instrumentation and cannot be extracted from conversation transcripts alone.

What is the difference between transcript quality and voice agent accuracy?

Transcript quality measures whether conversation text looks coherent, professional, and complete to a reader. Voice agent accuracy measures whether the agent correctly understood user intent under real acoustic conditions, took the correct action in downstream systems, and resolved the issue without a repeat call. The two metrics are weakly correlated: high transcript quality scores alongside accuracy failures are common in production.

You don't read AI-generated code. Why are you listening to every call?