Test your voice agent
A/B testing voice agents: how to compare versions the right way

Every change to a voice agent is a bet: a new prompt, a faster model, a different voice. A/B testing voice agents turns that bet into evidence. Instead of guessing whether version B is better, you run both on identical calls and read the numbers side by side. This guide covers how to structure the test, which metrics matter, and how many calls you actually need.
Evalgent is built to run exactly this kind of comparison offline, before production. We come back to that. First, the method.
What is A/B testing for voice agents?
A/B testing voice agents: comparing two versions of an agent on the same set of calls, changing one variable, and measuring which version performs better on defined metrics.
Classic A/B testing splits live traffic between two variants. For voice agents you can do that in production, but the stronger approach is offline: run both versions against the same synthetic scenarios and compare. That removes the noise of different callers hitting different versions.
The offline approach also makes results repeatable, since the same calls can be replayed against any future version. The core rule is one variable at a time. If you change the prompt and the model together, a difference in results tells you nothing about which change caused it. Hold everything constant except the one thing you are testing.
How do you A/B test a voice agent?
The structure is simple and repeatable. Keep the calls identical so the only difference is the version.
1. Define the variants: a control (current version) and a variant (the change).
2. Fix the scenarios: the same set of calls runs against both, using synthetic callers.
3. Change one thing: prompt, model, voice, or a single parameter.
4. Run both: execute the identical scenario set against each version.
5. Compare metrics: task completion, latency, adherence, per cohort.
6. Decide: ship the winner, or iterate if the result is inconclusive.
Because the scenarios are identical, any difference in outcome is attributable to the one change. This is the offline equivalent of a controlled experiment, and it is far cleaner than splitting live traffic.
How do you compare two voice agent prompts?
Prompt comparison is the most common A/B test, because prompts change constantly. The method is the same: same calls, two prompts, one difference.
Run the identical scenario set against prompt A and prompt B, then compare task completion and adherence side by side. Watch for regressions as much as gains, since a prompt tweak that helps one flow often breaks another. This side-by-side prompt A/B testing is what stops a well-meaning edit from quietly degrading production. The best LLM for voice agents guide covers the model side of the same idea.
How do you test voice variations?
Voice variation testing compares different TTS voices or speech settings on the same scripts. The trap is judging a voice on a demo clip; the real test is how callers respond across your actual content.
Run the same calls with each voice and measure completion, and where possible caller sentiment and drop-off. A voice that sounds pleasant in isolation can pronounce names badly or pace numbers poorly in context. Voice variation testing catches that before it reaches callers, using the same controlled structure as any other A/B test. Test the voices on the numbers, names, and phrases your callers actually hear. Skip generic sample sentences. That is where synthetic voices most often stumble.
What metrics matter for voice agent A/B testing?
The metric you optimise decides the winner, so choose it before you run. A single vanity number leads to the wrong call.
- Task completion: did the caller achieve their goal? The bottom line.
- Latency: did response time change, and by how much?
- Instruction adherence: did the version follow policy and required steps?
- Containment: did it resolve without escalation?
- Per-cohort splits: did the winner win across accents and conditions, or only on easy calls?
Read these together. A variant that lifts completion but adds latency may not be a win. And always split by cohort, because an average can hide a variant that improved overall while breaking one accent group.
How many calls do you need for A/B test significance?
This is where voice A/B tests go wrong. Ten calls prove nothing; a difference could be pure chance. You need enough calls per variant for the result to be reliable, which is what statistical significance measures.
There is no universal number, but a few principles hold. The smaller the true difference, the more calls you need to detect it. Rare failure modes need more volume to show up at all. And you must run the same scenario distribution against both variants. Synthetic callers make the required volume cheap, so you can run hundreds of calls per variant instead of a handful, which is often the difference between a real result and a coin flip.
A/B testing vs offline evaluation
These two overlap but are not the same. Offline evaluation scores one version against criteria. A/B testing compares two versions head to head. You often use evaluation inside an A/B test, as the scoring layer that decides which variant won.
The practical difference is intent. Evaluation asks "is this version good enough to ship." A/B testing asks "is this version better than the one we have." The strongest workflow uses both: evaluate each version, then A/B test the candidate against the current agent before promoting it. Our ai voice agent testing pillar covers the wider discipline.
Building experimentation into your workflow
A single A/B test is useful; voice agent experimentation as a habit is transformative. Teams that compare voice agent versions on every meaningful change catch regressions before they ship. They also build a record of what works. The goal is a repeatable loop, not a one-off exercise.
A few practices make ab testing voice ai reliable at scale. Fix a holdout set of scenarios you never tune against, so you measure honest generalisation rather than overfitting to your test cases. Decide the sample size before you run. Base it on how small a difference you care about. That keeps you from stopping early on noise. Keep a regression suite alongside the experiment. A variant can win on the target metric yet break an unrelated flow. The suite catches that in the same pass.
Treat every promotion as a comparison. Before a new prompt, model, or voice reaches production, it should have beaten the current agent on the same calls. That single rule turns experimentation from an occasional exercise into a release gate. Over time, the record of past experiments becomes an asset. You stop relitigating settled questions. You focus on the changes that still move the numbers.
One caution: an experiment is only as good as its scenarios. If your test set does not reflect real callers, a winning variant may still lose in production. Refresh the scenarios from real call patterns, keep the holdout representative, and re-run past experiments when your traffic shifts.
Running A/B tests offline with Evalgent
The cleanest A/B test never touches production. Evalgent runs both versions against the same synthetic calls, so the only variable is your change. Scenarios hold the calls constant across variants. Profiles vary accents and behaviour so you can split results per cohort. Metrics score task completion, latency, and adherence with custom thresholds. Evaluations run the identical batch against each version, and Reviews let your team hear where one variant beat the other.
The result is a controlled comparison with a clear winner, run before release rather than on real callers. You change one thing, re-run the same calls, and keep the version that wins. Because the run is automated, you can compare far more versions than manual calling would ever allow. For the production side of watching those changes, see voice agent observability and the STT evaluation guide.
Frequently asked questions
What is A/B testing for voice agents?
A/B testing for voice agents means comparing two versions of an agent on the same set of calls and measuring which performs better. You change one variable, such as a prompt, model, or voice, keep the scenarios identical, and compare metrics like task completion and latency. The controlled structure makes any difference in results attributable to the change.
How do you A/B test a voice agent?
A/B test a voice agent by defining a control and a variant, fixing the scenario set, changing exactly one thing, and running the identical calls against both versions. Then compare task completion, latency, and adherence per cohort, and ship the winner. Using synthetic callers keeps the calls identical across variants, which is what makes the comparison clean.
How do you compare two voice agent prompts?
Compare two voice agent prompts by running the same scenario set against prompt A and prompt B, changing nothing else. Measure task completion and instruction adherence side by side, and watch for regressions as well as gains, since a prompt tweak that helps one flow can break another. This side-by-side test prevents a small edit from silently degrading production.
How do you test voice variations?
Test voice variations by running the same call scripts with each candidate voice and measuring completion, and where possible caller sentiment and drop-off. Judge the voice in context, not on a demo clip, because a pleasant voice can mispronounce names or pace numbers badly on real content. Keep the calls identical so the voice is the only variable.
What metrics matter for voice agent A/B testing?
The metrics that matter are task completion, latency, instruction adherence, and containment, split per cohort. Choose the primary metric before running, since it decides the winner. Read the metrics together, because a variant that lifts completion but adds latency may not be a real win, and always split by cohort so an average does not hide a broken accent group.
How many calls do you need for A/B test significance?
There is no universal number, but the smaller the true difference between versions, the more calls per variant you need to detect it reliably. Rare failure modes need more volume to appear at all. Synthetic callers make hundreds of calls per variant cheap, which is usually the difference between a statistically meaningful result and a chance outcome.
A/B testing vs offline evaluation: what is the difference?
Offline evaluation scores one version against defined criteria; A/B testing compares two versions head to head. Evaluation asks whether a version is good enough to ship, while A/B testing asks whether it is better than the current agent. They work together: evaluate each version, then A/B test the candidate against the current one before promoting it to production.
Can you A/B test voice models?
Yes. A/B testing voice models means running the same scenario set against two models, changing only the model, and comparing task completion, latency, and adherence. Because latency and instruction-following vary widely between models, holding the calls constant is the only way to attribute a difference to the model rather than to the calls. Synthetic callers make this comparison repeatable.
Conclusion
A/B testing voice agents replaces opinion with evidence: run two versions on the same calls, change one thing, and let the metrics pick the winner. The discipline is holding the scenarios constant and measuring per cohort, so the difference you see is real.
Run it offline before release, not on live callers. The version you ship should be the one that already beat your current agent on the calls that matter most, measured on a representative sample.
Related Articles

How to automate voice agent testing: synthetic callers vs manual QA
Learn how ai test automation replaces manual QA for voice agents. Compare synthetic callers vs human testers, with a 5-step framework to scale without hiring.
Read more
AI Agent Testing vs Voice Agent Testing: What General Tools Miss for Voice
AI agent testing measures text outputs. Voice agent testing measures behaviour through an acoustic pipeline. Five failure categories general tools miss.
Read more