Metrics

Define how you measure your voice agent

Set your own evaluation criteria — choose what to measure, how to score it, and what success looks like. Evalgent handles the rest.

Scale 1–5

Custom criteria

All scenarios

≥ 4 threshold

MetricLLM-based

Tone consistency

Rate how consistently the agent maintains a calm, empathetic tone

Scoring typeScale 1–5

Threshold≥ 4

Evaluated acrossAll scenarios

Last run avg

What is a voice agent metric?

A metric is an evaluation criterion that measures a specific quality of your voice agent — like tone, response speed, or knowledge accuracy. Metrics are not scenario-specific pass/fail checks. They score every conversation your agent handles, giving you a consistent measure of performance across all scenarios, runs, and versions.

Define

Set what quality to measure and how to score it

Apply

Metric is evaluated across every conversation

Track

Monitor trends across runs and versions

How do we ensure metrics coverage?

Telemetry metrics

Metrics automatically extracted from call metadata — no AI judgment needed. Objective, fast, and always consistent. Measure response latency, call duration, silence ratios, and more.

Telemetry metrics

Response latency

320ms

Call duration

2m 14s

Silence ratio

Interruption count

LLM-based metrics

AI evaluates conversation quality against your defined criteria. Define what to measure in natural language, choose a scoring type, and set a success threshold.

LLM-based metrics

Tone consistency4

Scale 1–5Threshold: ≥ 4

Knowledge accuracyPass

BinaryThreshold: Pass

Instruction adherence2

Scale 1–5Threshold: ≥ 3

Already measuring internally? Good — bring those same metrics here.

Limited visibility

Internal testing today

You measure metrics on real production calls — after they happen
Or on scripted test calls that don't reflect real-world conditions
No consistent simulation environment
Metrics exist, but the test conditions don't match reality

Production-ready

With Evalgent

Same metrics, but evaluated inside realistic simulated calls
Synthetic callers replicate real accents, noise, interruptions, and conversational chaos
Your agent is tested under production-like conditions — before it reaches production
Metrics stay in sync between your internal system and your evaluation environment

"The gap isn't in the metrics — it's in the environment you test them in. Most teams measure on clean, scripted calls. Evalgent measures on calls that behave like real ones."

The difference a well-defined metric makes

Unreliable

Poorly defined metric

"Agent should sound professional" — too vague to score consistently
No measurable threshold — different evaluators, different results
Can't compare across versions — no baseline to track
Results you can't act on

Actionable

Properly defined metric

"Rate how consistently the agent maintains a calm, empathetic tone" — specific and scoreable
Clear scoring type (scale 1–5) with defined success threshold (≥ 4)
Consistent results across all scenarios and runs
Track quality trends version over version

Know if your voice agent is ready for production

Functional

Behavioral

Limit