Test your voice agent
TTS evaluation: how to test text-to-speech for voice agents

Text-to-speech is the last stage of a voice agent, and it is the part callers actually hear. A voice can sound impressive in a sample and still mispronounce names, lag on the first word, or flatten an important number. This testing is how you catch that before it reaches callers. This guide covers the metrics, the method, and what a demo clip never shows.
Evalgent treats this testing as part of full-call assessment, not a standalone audio demo. We explain why at the end. First, the fundamentals.
What is TTS evaluation?
TTS evaluation: the process of measuring a text-to-speech system's naturalness, intelligibility, latency, and pronunciation accuracy on the content a voice agent actually speaks.
Text-to-speech evaluation answers whether the synthesized voice works for real calls. A voice can be pleasant and still fail: it mangles a surname, reads a phone number too fast, or pauses oddly mid-sentence. The point is to measure those failures, not just overall pleasantness.
This is broader than "does it sound human." TTS quality testing covers four things at once: how natural the voice sounds, how easily it is understood, how fast it starts, and whether it says the right words. A weak result on any one of them hurts the call.
How do you evaluate text to speech quality?
Quality has a subjective side and an objective side, and you need both. The subjective side asks people how the voice sounds. The objective side measures things you can count, like latency and pronunciation errors.
Start by listening on your real scripts, not a vendor demo. Then layer in measurable checks. The classic subjective measure is MOS, and the objective checks include latency and a round-trip accuracy test. Used together, they tell you whether callers will both like and understand the voice. The underlying technology is covered in speech synthesis.
What is a MOS score?
MOS (mean opinion score): the average rating listeners give a voice on a 1-to-5 scale, where 5 is perfectly natural and 1 is bad.
MOS is the standard subjective metric for voice quality. Listeners rate samples, and you average the scores. A high mos score means the voice sounds natural to people, which matters because naturalness affects trust and how callers behave. The metric is defined in detail on the mean opinion score reference.
MOS has limits. It is slow and expensive to gather from humans, and it captures overall impression rather than specific failures. Automated MOS estimators exist and are faster, but they approximate human judgement rather than replace it. Use MOS for the big picture, and pair it with objective checks for the details.
How do you test TTS pronunciation?
Pronunciation is where polished voices quietly fail. A natural-sounding voice that says a customer's name wrong still breaks the call. The fix is to test the words that matter for your domain.
Build a pronunciation set deliberately:
- Real customer names, including non-English ones.
- Numbers: phone numbers, amounts, dates, and account IDs.
- Domain terms, drug names, product codes, and abbreviations.
- Edge cases like emails, URLs, and mixed-language strings.
Use SSML where the engine supports it to correct stubborn cases, and re-test after every change. A round-trip check helps too: synthesize the text, transcribe it back with STT, and compare to the original with word error rate. A high round-trip error flags words the voice is saying wrong.
TTS metrics that matter
A complete assessment tracks both perception and performance.
- MOS: subjective naturalness, from human or estimated ratings.
- Intelligibility: how reliably listeners understand the words.
- Pronunciation accuracy: correctness on names, numbers, and jargon.
- Latency: time to first byte, how fast audio starts streaming.
- Prosody: natural rhythm, stress, and intonation.
Track these together, because they trade off. A voice tuned for maximum naturalness can add latency that hurts the call. A fast voice can sound robotic. The right balance depends on your use case, which is why voice agent tts testing should weight the metrics to your priorities, not chase one in isolation.
What is a good TTS latency for voice agents?
For voice, latency is about time to first byte, how soon audio starts after the text is ready. Callers feel the gap before the agent speaks, so the first byte matters more than total synthesis time. A good target is well under a few hundred milliseconds, and streaming TTS that begins speaking before the full sentence is generated is what makes natural pacing possible.
The trap is evaluating a voice on a downloaded sample, where latency is invisible. In a real call, a beautiful voice that starts slowly feels broken. Measure latency in context, under streaming, on your stack. A slightly less natural voice that starts instantly usually beats a perfect voice that lags.
Subjective vs objective testing
These two approaches answer different questions, and a good program uses both. Subjective evaluation, like MOS, captures how the voice feels to people. Objective evaluation captures what you can measure: latency, pronunciation accuracy, and audio quality at a given sample rate.
Lean too far subjective and you cannot automate or compare reliably. Lean too far objective and you miss the human impression that drives trust. The strongest setup uses objective checks for fast, repeatable gating and periodic subjective scoring for the big picture. Neither alone is enough, which is the same lesson that runs through why voice agents fail in production.
Why it belongs in full-call testing
A voice never performs in isolation on a real call. It speaks dynamic text, full of names and numbers the demo never used, under latency pressure, right after the language model decides what to say. Testing a voice on a fixed script misses all of that.
That is why text-to-speech evaluation should run inside the call, on the content the agent actually generates. A voice that aces a scripted demo can still mispronounce the dynamic fields that appear in production. Our voice agent stack guide shows how TTS sits at the end of the pipeline, speaking whatever the earlier stages produce.
Common TTS evaluation pitfalls
A few mistakes recur. Judging a voice on a downloaded clip hides latency, because real-time streaming is where the first-byte delay actually shows. Testing only generic sentences hides the names and numbers that break in production. And picking a voice for naturalness alone ignores whether it starts fast enough for a live call.
Voice cloning adds its own trap. A cloned voice can sound right on a short sample yet drift on unusual words or long sentences, so it needs the same pronunciation and latency testing as any other voice. Treat a clone as unproven until it passes on your scripts.
Whatever the voice, re-test after every model or SSML change, on your real content, under streaming. A voice that passed last month can regress when the engine updates, and only re-testing on dynamic text will catch it before callers do.
How do you evaluate TTS for a voice agent?
The best approach measures the voice inside realistic calls, on the dynamic text the agent really speaks, with latency and pronunciation captured in context.
This is where Evalgent fits. Evalgent runs realistic calls and measures the voice as one signal among many. Scenarios drive the agent through real scripts full of names, numbers, and jargon. Profiles vary the calls so pronunciation is tested across cases, not one happy-path clip. Metrics track latency and round-trip accuracy with custom thresholds. Evaluations run this as automated batches of synthetic callers, and Reviews let your team hear the exact moment a voice mispronounced a name or lagged on the first word.
The result is a voice you have verified in context, not in a demo. The sample tells you how it can sound; the call tells you how it actually performs. For the full discipline, see the ai voice agent testing pillar and the synthetic callers guide.
Frequently asked questions
What is TTS evaluation?
TTS evaluation is the process of measuring a text-to-speech system's naturalness, intelligibility, latency, and pronunciation accuracy on the content a voice agent actually speaks. It combines subjective scores like MOS with objective checks on latency and pronunciation. The goal is a voice that callers both understand and trust, tested on real scripts rather than a demo clip.
How do you evaluate text to speech quality?
Evaluate text to speech quality with both subjective and objective measures. The subjective side uses listener ratings, typically MOS, to capture how natural the voice sounds. The objective side measures latency, pronunciation accuracy, and audio quality. Test on your real scripts, not a vendor demo, and combine the two views so you cover both impression and measurable performance.
What is a MOS score?
A MOS, or mean opinion score, is the average rating listeners give a voice on a 1-to-5 scale, where 5 is perfectly natural. It is the standard subjective measure of text-to-speech quality. A high score means the voice sounds natural to people, which affects trust. MOS is slow to gather from humans, so it is often paired with faster objective checks.
How do you test TTS pronunciation?
Test TTS pronunciation by building a set of the words that matter in your domain: real customer names, numbers like phone numbers and account IDs, jargon, and edge cases such as emails and mixed-language strings. Use SSML to correct stubborn cases. A round-trip check, synthesising text then transcribing it back, flags words the voice is saying wrong.
What metrics are used for TTS evaluation?
The main metrics for TTS evaluation are MOS for naturalness, intelligibility for comprehension, pronunciation accuracy on names and numbers, latency measured as time to first byte, and prosody for natural rhythm. Track them together, since they trade off. A voice tuned only for naturalness can add latency, and a fast voice can sound robotic, so balance them to your use case.
What is a good TTS latency for voice agents?
A good TTS latency for voice agents is measured as time to first byte, how soon audio starts after the text is ready, and should sit well under a few hundred milliseconds. Streaming TTS that begins speaking before the full sentence is ready is what makes natural pacing possible. A slightly less natural voice that starts instantly usually beats a perfect voice that lags.
Subjective vs objective TTS evaluation: which is better?
Neither subjective nor objective TTS evaluation is better alone; a good program uses both. Subjective measures like MOS capture how the voice feels to people. Objective measures capture latency, pronunciation accuracy, and audio quality, which are fast and repeatable. Use objective checks for automated gating and periodic subjective scoring for the human impression that drives trust.
How do you evaluate TTS for a voice agent?
Evaluate TTS for a voice agent inside realistic calls, on the dynamic text the agent actually speaks, with latency and pronunciation captured in context. Drive scenarios full of names, numbers, and jargon, and measure round-trip accuracy and time to first byte. Platform-agnostic testing with synthetic callers, such as Evalgent, makes this an in-context measurement rather than a demo clip.
Conclusion
TTS evaluation proves the part of your voice agent that callers actually hear. A great demo voice is not enough; the test is whether it stays natural, fast, and correct on the real, dynamic calls your agent handles every day.
Measure naturalness with MOS, but always pair it with latency and pronunciation checks run on your own real scripts. The voice is the last impression your agent makes, so evaluate it in the call, not in a clip. Test it on real scripts, under streaming, and re-test after every change.
Related Articles

Why AI voice agents fail in production (and how to prevent it)
AI voice agents that ace demos still break in production. Learn the 5 root causes, how to test for each, and what production readiness actually means.
Read more
Voice agent regression testing: why LLM updates break production
Updating your LLM improves benchmarks but breaks production voice agents in 5 predictable ways. How to test after every model update and prevent regressions.
Read more