Voice AI Evaluation

Word error rate (WER): the complete guide for voice agents

Deepesh Jayal

•June 2026•

10 min read

Word error rate (WER): the complete guide for voice agents

Every voice agent starts by turning speech into text, and WER is how you measure whether that step works. It is the foundation metric of the whole pipeline, because a transcription error becomes a wrong intent and then a wrong action. This guide explains the metric, how to calculate it, what counts as good, and where it falls short.

Evalgent's stance is that WER is necessary but not sufficient. We explain why at the end. First, the metric itself.

What is WER?

Word error rate (WER): the share of words a speech-to-text system gets wrong, measured as the edits needed to match the correct transcript, divided by the number of words spoken.

WER compares two things: what the system transcribed and the ground truth, the correct reference transcript. The further apart they are, the higher the score. It is the standard measure of speech recognition accuracy, used to benchmark STT and ASR systems across the industry, and is defined in detail on the WER reference.

A score of 0% means a perfect transcript. A score of 10% means roughly one word in ten is wrong, in some form. Because it counts errors, lower is always better.

How do you calculate WER?

The word error rate formula is simple. Add up three kinds of error, then divide by the number of words in the reference.

WER = (S + D + I) / N

S = substitutions  (wrong word)
D = deletions      (missed word)
I = insertions     (extra word)
N = total words in the reference transcript

The three error types come from edit distance, specifically the Levenshtein distance between the transcript and the reference. The wer calculation finds the smallest number of edits that turns one into the other, then expresses that as a fraction of the spoken words. One detail matters first: normalization. Before scoring, both transcripts are lowercased and stripped of punctuation, so "Okay" and "okay" do not count as an error.

Here is a worked example. The caller says "I want to cancel my account today," which is seven words. The system transcribes "I want to cancel my count today." That is one substitution, "account" became "count," so S is 1, D and I are 0, and N is 7. The score is 1 divided by 7, about 14%.

Substitutions, deletions, and insertions

The three error types tell you how the system failed, not just that it did. Each behaves differently in a voice agent.

Substitutions: the system hears a different word. "Account" becomes "count." These are the most dangerous, because they produce confident, wrong text.
Deletions: the system drops a word. "I do not agree" becomes "I do agree." Deletions can flip meaning entirely.
Insertions: the system adds a word that was not said, often from noise or breath. These clutter intent detection.

Note that the score can exceed 100%. If the system inserts many extra words, the error count can be larger than the reference length. That surprises people, but it follows directly from the formula.

What is a good WER?

There is no single threshold, because it depends on conditions. On clean, read speech, modern STT can reach a good WER of 5% or lower. On real phone calls with accents and noise, the same system can climb to 20% or higher. The right benchmark is one measured under your conditions, not a vendor's lab number.

This is the trap in benchmarks. A provider quotes a low score from clean audio, and teams assume production will match. It rarely does. Our Deepgram STT testing guide covers how benchmark numbers diverge from production. Judge a system by its accuracy on audio that looks like your traffic.

What causes high WER?

A high score is usually about conditions, not just the model. Real audio is messy, and the messiness shows up as errors.

Background noise: traffic, offices, and crosstalk degrade the signal.
Accents and dialects: speech outside the training distribution raises errors sharply.
Domain jargon: names, drug names, and product codes the model never saw.
Telephony: narrowband codecs and packet loss strip detail from the audio.
Overlapping speech: two people at once confuses the recognizer.

Each factor compounds. A noisy line with an accented caller using jargon can push the score far above the clean benchmark. This is why testing across conditions matters, a theme in our voice agent stack guide.

WER vs accuracy: what WER misses

WER and accuracy are two sides of one coin: accuracy is roughly one minus the error rate. But both share a blind spot. They treat every word as equal, and in a voice agent, words are not equal.

Getting "the" wrong does not matter. Getting "not" or "cancel" or an account number wrong can break the entire call. A transcript can score low overall and still fail on the one word that carried the intent. The metric measures transcription, not meaning, and it does not know which words were load-bearing. That gap is why it cannot tell you on its own whether a call succeeded.

Why WER matters for voice agents

Despite its limits, the metric matters because it sits at the front of the pipeline. Everything downstream depends on it. A wrong transcript leads the language model to the wrong intent, which leads to the wrong response or tool call. The caller hears a fluent, confident, incorrect answer.

This is error compounding, and it is why a word error rate voice agent dashboard should track the score per stage, not just overall. Our piece on why voice agents fail in production shows how a few points lost to noise ripple through the whole call. Track it, but track it as the first link in a chain, not the whole story.

WER vs character error rate

WER counts whole words. Character error rate, or CER, counts characters instead. CER is useful for languages without clear word boundaries, or for catching small errors in names and numbers where one character matters.

For most English voice agents, WER is the headline metric and CER is a useful supplement for fields like account numbers. Use WER for the overall view and CER where character-level precision is the point. Both are accuracy measures; they just count at different granularity.

How WER fits with latency and other metrics

A transcription score never travels alone. It sits beside latency, task completion, and containment in any honest scorecard. A system can post a low error rate and still feel broken if its latency is high, because accuracy and speed are separate problems.

Read WER together with these signals. High accuracy with high latency is a bad agent. Low accuracy with fast responses is a confident, wrong agent. The point of a scorecard is to stop one good number from hiding a bad one. Transcription accuracy is one column in that table, not the table itself.

How to reduce WER in production

Once you measure the score, the next question is how to lower it. Most gains come from matching the system to your real audio, not from chasing a leaderboard model. Pick an STT model on your own calls, add domain vocabulary so names and jargon stop becoming errors, and clean the audio path with noise suppression before it reaches the recognizer.

Telephony choices help too. A better codec and a stable connection preserve detail the recognizer needs. None of these are one-time fixes. Re-measure after each change, per cohort, because a tweak that helps clean calls can hurt noisy ones. Reduction is a loop, not a setting.

How to test WER for a voice agent

Measuring the score well means measuring it under real conditions, broken down by what matters. A single blended number hides the cohorts that fail. You want it per accent, per noise level, and on the calls that carry critical words.

This is where Evalgent fits. Evalgent runs realistic calls and measures the score as one signal among many. Scenarios reproduce noisy, accented, and jargon-heavy calls. Profiles vary caller voices so accuracy is reported per cohort, not just on average. Metrics track it alongside task completion, so you see when transcription errors actually broke outcomes. Evaluations run this as automated batches of synthetic callers, and Reviews let you hear the call behind a bad number.

The result is a metric you can trust, paired with the outcome measures it cannot capture alone. The transcript score tells you the transcription is off; testing tells you whether the call failed because of it. For the full discipline, see the ai voice agent testing pillar, and for the broader split see AI agent testing vs voice agent testing.

Frequently asked questions

What is word error rate?

WER is the standard measure of speech-to-text accuracy. It counts the substitutions, deletions, and insertions needed to turn a system's transcript into the correct reference transcript, divided by the total words spoken. A lower value means more accurate transcription. It is widely used to benchmark speech recognition and ASR systems across the industry.

How do you calculate WER?

Calculate WER with the formula substitutions plus deletions plus insertions, divided by the number of words in the reference transcript. The errors come from the edit distance between the transcript and the correct reference, after normalising case and punctuation. For example, one wrong word in a seven-word sentence gives about 14%.

What is a good word error rate?

A good WER depends on conditions. On clean, read speech, modern speech-to-text can reach 5% or lower. On real phone calls with accents and background noise, the same system may rise to 20% or higher. Judge a system by its accuracy on audio that resembles your real traffic, not by a vendor's clean lab benchmark.

Why does word error rate matter for voice agents?

WER matters because transcription sits at the front of the voice pipeline. A wrong transcript leads the language model to the wrong intent and then the wrong action, so errors compound downstream. A few points lost to noise can ripple into failed calls, which is why the metric is worth tracking as the first link in the chain.

What causes high WER?

A high WER is usually driven by conditions rather than the model alone. Background noise, accents and dialects outside the training data, domain jargon like names and product codes, narrowband telephony codecs, and overlapping speech all raise errors. These factors compound, so a noisy line with an accented caller using jargon can push the score far above a clean benchmark.

Word error rate vs accuracy: what is the difference?

WER and accuracy are closely related: accuracy is roughly one minus the error rate. The difference is framing, not substance. Both share a blind spot, though, because they weight every word equally. In a voice agent, getting a filler word wrong is harmless, while getting "not" or an account number wrong can break the call entirely.

Does WER measure meaning?

No. WER measures transcription accuracy, not meaning. It counts wrong, missing, and extra words, but it does not know which words mattered. A transcript can score low overall and still fail on the single word that carried the intent. That is why the metric alone cannot tell you whether a call actually succeeded.

How do you test WER for a voice agent?

Test WER under realistic conditions, broken down by cohort rather than a single average. Measure it per accent, per noise level, and on calls that carry critical words. Pair it with task completion so you see when transcription errors broke outcomes. Platform-agnostic testing with synthetic callers, such as Evalgent, makes this measurable before production.

Conclusion

WER is the foundation metric of voice AI, counting the substitutions, deletions, and insertions between a transcript and the truth. Lower is better, but the average hides the one wrong word that can break a call.

Measure the score under real conditions, per cohort, and never alone. Pair it with outcome metrics, because transcription accuracy is the start of a reliable voice agent, not the proof of one.

Why AI voice agents fail in production (and how to prevent it)

Voice AI Evaluation

8 min read

Why AI voice agents fail in production (and how to prevent it)

AI voice agents that ace demos still break in production. Learn the 5 root causes, how to test for each, and what production readiness actually means.

Voice agent regression testing: why LLM updates break production

Voice AI Evaluation

9 min read

Voice agent regression testing: why LLM updates break production

Updating your LLM improves benchmarks but breaks production voice agents in 5 predictable ways. How to test after every model update and prevent regressions.

Back to all articles

What is WER?

How do you calculate WER?

Substitutions, deletions, and insertions

What is a good WER?

What causes high WER?

WER vs accuracy: what WER misses

Why WER matters for voice agents

WER vs character error rate

How WER fits with latency and other metrics

How to reduce WER in production

How to test WER for a voice agent

Frequently asked questions

What is word error rate?

How do you calculate WER?

What is a good word error rate?

Why does word error rate matter for voice agents?

What causes high WER?

Word error rate vs accuracy: what is the difference?

Does WER measure meaning?

How do you test WER for a voice agent?

Conclusion

Related Articles

Why AI voice agents fail in production (and how to prevent it)

Voice agent regression testing: why LLM updates break production