Evalgent
Back to Blog
Voice AI Evaluation

STT evaluation: how to test speech-to-text for voice agents

Deepesh Jayal
10 min read
STT evaluation: how to test speech-to-text for voice agents

Speech-to-text is the first stage of every voice agent, and it sets the ceiling for everything after it. If the transcription is wrong, the language model reasons over the wrong words. This testing is how you find out whether your transcription holds up before real callers expose it. This guide covers the metrics, the method, and the conditions that benchmarks ignore.

Evalgent treats this testing as part of full-call assessment, not an isolated benchmark. We explain why at the end. First, the fundamentals.

What is STT evaluation?

STT evaluation: the process of measuring a speech-to-text system's accuracy, latency, and robustness under realistic audio conditions, rather than on clean reference recordings.

Speech-to-text evaluation answers a simple question: how often does the system get the words right, and under what conditions does it fail. A model can look excellent on a clean benchmark and fall apart on a noisy phone line. The point is to find that gap before production does.

This is broader than a single number. ASR evaluation looks at accuracy, but also latency, behaviour under noise, and performance across speaker groups. The output is not "the model is 95% accurate," but "the model is 95% on clean audio and 78% on accented calls in noise." That second statement predicts production.

How do you evaluate speech to text accuracy?

Accuracy testing compares the system's transcript against a ground truth, the correct reference. The standard metric is word error rate, or WER, which counts the wrong, missing, and extra words. Lower is better.

The mechanics are straightforward. Collect representative audio, transcribe it, and compare each transcript to a human-verified reference after normalization. The hard part is not the math. It is assembling audio that represents your callers, with the accents, noise, and vocabulary they bring. Without that, you measure speech recognition on the wrong distribution. The goal of any attempt to evaluate speech to text is to match that distribution.

Metrics that matter for STT

WER is the headline, but it is not the only signal. A complete assessment tracks several.

  • Word error rate: the core accuracy measure, ideally reported per cohort.
  • Character error rate: catches small errors in names and numbers.
  • Latency: how fast transcripts return, including streaming partials.
  • Real-time factor: processing time relative to audio length.
  • Per-entity accuracy: correctness on the words that matter, like account numbers.

Track these together. A model with great WER but high latency is wrong for a real-time voice agent. A model with good average WER that mangles names is wrong for support. The metrics interact, which is why stt accuracy testing should never lean on one number alone. Endpointing matters too: where the system decides a turn ended shapes both latency and the words it captures.

How do you test STT for accents and noise?

This is where most STT quality is won or lost. A model trained mostly on clean, standard-accent speech degrades sharply outside that distribution. Real callers live outside it.

Build the test set deliberately:

  • Source audio across the accents and dialects your callers actually use.
  • Add background noise at realistic levels: quiet, office, and street.
  • Include domain vocabulary, names, and code-switching between languages.
  • Route audio through your real telephony, not just clean browser capture.

Then report WER per cohort, never as a single blend. A blended score hides the accent group sitting at 30% while the average looks fine. Our Deepgram STT testing guide walks through this kind of condition-based testing in depth.

Metric versus method

People sometimes use the terms interchangeably, but they are not the same. WER is a metric. STT evaluation is the practice that uses it, along with others, under controlled conditions.

The distinction matters because WER alone is easy to misread. A low WER on the wrong audio tells you nothing. The practice adds context: which audio, which speakers, which noise, and what happened to the call as a result. WER is the thermometer; the method is the diagnosis. You need the metric, but the process is what makes it meaningful.

How do you benchmark an STT model?

Benchmarking compares candidate models on the same audio so you can choose. The rule is to hold the audio constant and vary only the model. Run each candidate over an identical, representative test set and compare WER, latency, and per-entity accuracy side by side.

Treat published vendor numbers with suspicion. They run on clean, favourable audio and rarely predict your results. The only benchmark that matters is the one on your traffic. When a new model version ships, re-run the comparison, because stt benchmarking is not a one-time event. Diarization quality matters too if your calls have multiple speakers, since telling who said what affects downstream logic, as covered in speaker diarisation.

What is a good STT accuracy for voice agents?

There is no universal number, because it depends on conditions and on what the words drive. On clean audio, modern STT can reach a WER of 5% or lower. On real calls with accents and noise, 10 to 20% is common, and what counts as acceptable depends on how much a wrong word costs you.

A booking agent can tolerate more transcription noise than one handling account numbers. The right target is set by your downstream task, not by a leaderboard. Judge accuracy against the audio you actually receive and the cost of the errors that slip through. Adoption context for these systems is tracked in the Stanford HAI AI Index.

Why it belongs in full-call testing

Here is the trap with judging STT in isolation. A transcript can be slightly wrong in a way WER barely notices, yet that small error breaks the whole call. The reverse happens too: a transcript with a few harmless errors still completes the task. WER on its own cannot tell these apart.

That is why voice agent stt testing should connect transcription accuracy to call outcomes. The question is not only what the WER is, but whether the call succeeded despite or because of the transcription. Judging speech-to-text in isolation misses this link entirely. Our voice agent stack guide shows how the STT layer feeds everything downstream.

Common pitfalls to avoid

A few mistakes recur often enough to call out. The first is testing on clean audio and trusting the number. The second is reporting a single blended score that hides the worst cohort. The third is measuring transcription alone, with no link to whether the call succeeded.

Two more are subtle. Teams forget to refresh the test set as their caller base shifts, so the suite slowly stops matching reality. And they treat a one-time pass as permanent, even though a model or telephony change can move the numbers overnight. Avoid these by testing on real audio, reporting per cohort, tying scores to outcomes, and re-running after every change.

Tooling helps, but discipline matters more. The best suite is the one you actually re-run, on audio that looks like today's traffic. A modest test set you trust beats a large one you ignore and never open.

How do you evaluate STT for a voice agent?

The best approach runs inside a real call, not on a static file. You want transcription measured under the same conditions the agent faces, and tied to whether the caller's goal was met.

This is where Evalgent fits. Evalgent runs realistic calls and measures STT as one signal in context. Scenarios reproduce noisy, accented, and jargon-heavy calls. Profiles vary caller voices so accuracy is reported per cohort. Metrics track WER and latency alongside task completion, so a transcription error that broke an outcome is visible, not buried in an average. Evaluations run this as automated batches of synthetic callers, and Reviews let you hear the audio behind a bad number.

The result is STT accuracy you can trust, measured on your audio and connected to outcomes. The transcript score tells you the words; the call result tells you whether they mattered. For the full discipline, see the ai voice agent testing pillar and the synthetic callers guide.

Frequently asked questions

What is STT evaluation?

STT evaluation is the process of measuring a speech-to-text system's accuracy, latency, and robustness under realistic audio conditions rather than clean recordings. It uses word error rate as the headline metric, broken down by accent, noise level, and speaker group, and ties transcription accuracy to whether calls succeed. The goal is to predict production performance before launch.

How do you evaluate speech to text accuracy?

Evaluate speech to text accuracy by comparing the system's transcript against a human-verified reference, after normalising case and punctuation. The standard metric is word error rate, which counts wrong, missing, and extra words. The critical step is using audio that represents your real callers, with the accents, noise, and vocabulary they bring, not a clean benchmark set.

What metrics are used for STT evaluation?

The core metric for STT evaluation is word error rate, ideally reported per cohort. Character error rate catches small errors in names and numbers. Latency and real-time factor measure speed, which matters for voice. Per-entity accuracy checks the words that carry meaning, like account numbers. Track these together, since one good metric can hide a fatal weakness in another.

How do you test STT for accents and noise?

Test STT for accents and noise by building a deliberate test set: source audio across the accents your callers use, add background noise at realistic levels, include domain jargon and code-switching, and route audio through your real telephony. Then report word error rate per cohort, never as a single blend, so a failing accent group is not hidden by a healthy average.

STT evaluation vs WER: what is the difference?

WER is a single metric; STT evaluation is the practice that uses it under controlled conditions, alongside latency and per-entity accuracy. WER alone is easy to misread, because a low score on the wrong audio means nothing. The practice adds the context of which speakers, which noise, and what happened to the call, turning a number into a diagnosis.

How do you benchmark an STT model?

Benchmark an STT model by holding the audio constant and varying only the model. Run each candidate over an identical, representative test set and compare word error rate, latency, and per-entity accuracy. Treat vendor benchmarks with suspicion, since they use clean audio. The benchmark that matters is the one on your own traffic, re-run whenever a model updates.

What is a good STT accuracy for voice agents?

A good STT accuracy depends on conditions and on what the words drive. On clean audio, modern speech-to-text reaches 5% word error rate or lower; on real calls with accents and noise, 10 to 20% is common. The right target is set by your downstream task and the cost of a wrong word, not by a vendor leaderboard. Judge it on your real audio.

How do you evaluate STT for a voice agent?

Evaluate STT for a voice agent inside realistic calls, not on static files. Measure word error rate and latency under the accents and noise the agent faces, report per cohort, and tie the transcription to whether the call succeeded. Platform-agnostic testing with synthetic callers, such as Evalgent, makes this an in-context measurement rather than an isolated benchmark.

Conclusion

STT evaluation is how you prove the first stage of your voice agent works under real conditions. Word error rate is the headline, but the value comes from breaking it down by cohort and tying it to call outcomes.

Measure on your own audio, never on a vendor's clean benchmark. The transcript sets the ceiling for the whole agent, so test it as the foundation it is, then test the call it feeds.

Related Articles