Challenge LLM judgements with human-in-the-loop reviews
When the AI gets a verdict wrong, flag it. Submit an appeal, provide context, and get the outcome corrected — keeping your evaluation scores accurate.
Refund reason collected
Success condition — Refund request handling
Appeal comment
"The caller confirmed the reason at turn 4 — the LLM missed the implicit confirmation."
What is a voice agent review?
A review is a human-in-the-loop correction. When an LLM scores a condition incorrectly, you submit an appeal explaining why the judgement is wrong. A reviewer examines the evidence, approves or rejects the appeal, and — if approved — corrects the outcome and recalculates your metrics. A reviewer examines the evidence, approves or rejects the appeal, and — if approved — corrects the outcome and recalculates your metrics.
Reviews ensure your evaluation scores stay accurate by letting domain experts correct the mistakes that automated scoring inevitably makes.
How does the review process work?
Flag a judgement
Select a condition from your evaluation results and challenge the LLM's verdict. See the evidence and transcript context before flagging.
Evidence from transcript
Submit your appeal
Provide your comment explaining why the judgement is incorrect. Include references to specific turns or evidence the LLM missed.
Condition
Current verdict
Your comment
Get a decision
A reviewer examines the appeal, the original evidence, and the transcript. They approve with a corrected outcome or reject with notes.
Reviewer note
What you get back
Corrected outcomes
Approved appeals replace the original LLM judgement with the correct outcome
Recalculated metrics
SSR scores and pass/fail verdicts update automatically after corrections
Audit trail
Every appeal, decision, and reviewer note is preserved for traceability
The difference human reviews make
Trust the LLM blindly
- Accept every LLM judgement at face value
- No way to correct false positives or false negatives
- Metrics drift from reality over time
Human-corrected accuracy
- Challenge any verdict with a structured appeal
- Corrected outcomes feed back into your scores
- Continuous improvement loop between human and AI