Evalgent
Metrics

Define how you measure your voice agent

Set your own evaluation criteria — choose what to measure, how to score it, and what success looks like. Evalgent handles the rest.

MetricLLM-based

Tone consistency

Rate how consistently the agent maintains a calm, empathetic tone

Scoring typeScale 1–5
Threshold≥ 4
Evaluated acrossAll scenarios
Last run avg
4

What is a voice agent metric?

A metric is an evaluation criterion that measures a specific quality of your voice agent — like tone, response speed, or knowledge accuracy. Metrics are not scenario-specific pass/fail checks. They score every conversation your agent handles, giving you a consistent measure of performance across all scenarios, runs, and versions.

1

Define

Set what quality to measure and how to score it

2

Apply

Metric is evaluated across every conversation

3

Track

Monitor trends across runs and versions

How do we ensure metrics coverage?

Telemetry metrics

Metrics automatically extracted from call metadata — no AI judgment needed. Objective, fast, and always consistent. Measure response latency, call duration, silence ratios, and more.

Telemetry metrics
Response latency
320ms
Call duration
2m 14s
Silence ratio
8%
Interruption count
3

LLM-based metrics

AI evaluates conversation quality against your defined criteria. Define what to measure in natural language, choose a scoring type, and set a success threshold.

LLM-based metrics
Tone consistency4
Scale 1–5Threshold: ≥ 4
Knowledge accuracyPass
BinaryThreshold: Pass
Instruction adherence2
Scale 1–5Threshold: ≥ 3

Already measuring internally? Good — bring those same metrics here.

Limited visibility

Internal testing today

  • You measure metrics on real production calls — after they happen
  • Or on scripted test calls that don't reflect real-world conditions
  • No consistent simulation environment
  • Metrics exist, but the test conditions don't match reality
Production-ready

With Evalgent

  • Same metrics, but evaluated inside realistic simulated calls
  • Synthetic callers replicate real accents, noise, interruptions, and conversational chaos
  • Your agent is tested under production-like conditions — before it reaches production
  • Metrics stay in sync between your internal system and your evaluation environment

"The gap isn't in the metrics — it's in the environment you test them in. Most teams measure on clean, scripted calls. Evalgent measures on calls that behave like real ones."

The difference a well-defined metric makes

Unreliable

Poorly defined metric

  • "Agent should sound professional" — too vague to score consistently
  • No measurable threshold — different evaluators, different results
  • Can't compare across versions — no baseline to track
  • Results you can't act on
Actionable

Properly defined metric

  • "Rate how consistently the agent maintains a calm, empathetic tone" — specific and scoreable
  • Clear scoring type (scale 1–5) with defined success threshold (≥ 4)
  • Consistent results across all scenarios and runs
  • Track quality trends version over version

Know if your voice agent is ready for production