A benchmark is only as useful as what it can see.
For voice agents — AI systems that conduct spoken conversations to complete tasks — most evaluation has been component-level. You test the speech recognition separately. You test the language model separately. You test the text-to-speech separately. Each part passes. The full system ships.
The problem: failures compound. A slight speech recognition misunderstanding becomes a wrong interpretation becomes an incorrect action becomes a confused follow-up. None of those steps fails catastrophically in isolation. Together, they produce an agent that users find unreliable in ways that are hard to pin down.
ServiceNow Research just released EVA-Bench, an open-source framework that tries to measure what component benchmarks miss.
How EVA-Bench works:
The core idea is bot-to-bot evaluation. Instead of human testers listening to voice agents, EVA uses a separate AI agent to play the role of the user. The "user bot" conducts a realistic multi-turn spoken conversation with the voice agent being tested, then a judge model evaluates the result.
This is fully automated — from speech input to final judgment — and can scale to thousands of conversations without human listeners.
The framework measures two dimensions:
- EVA-A (Accuracy): Did the agent complete the task correctly and faithfully? Did it do what it was asked to do, without making things up or taking wrong actions?
- EVA-X (Experience): Was the interaction natural, concise, and appropriate for spoken dialogue? Did it sound like a good conversation, or like a chatbot reading from a script?
Both matter. A technically accurate agent that sounds robotic won't be deployed. An engaging agent that makes errors is worse than useless in production.
Why this is harder than it looks:
Voice introduces failure modes that text benchmarks can't capture. Turn-taking is one: in spoken conversation, interruptions, pauses, and overlaps are part of the signal. A voice agent that can't navigate a natural pause in the conversation degrades the experience in ways that don't show up in a text transcript.
Latency is another. A response that's technically correct but takes four seconds feels wrong. The user has already started speaking again.
And then there's the compounding problem: errors at the speech recognition layer create drift that affects every subsequent turn. By the end of a multi-turn conversation, the accumulated error looks very different from any single-turn evaluation.
EVA-Bench captures all of this because it evaluates the full end-to-end pipeline, across complete conversations, not isolated utterances.
What this means:
Voice agents are increasingly deployed in enterprise contexts — customer service, medical intake, scheduling, support. The evaluation standard for these systems has to match the deployment reality.
A component benchmark that shows 95% accuracy on individual utterances doesn't tell you what happens in a 12-turn conversation where context drifts, the user clarifies twice, and the agent has to track state across all of it.
EVA-Bench is a step toward evaluating the thing that actually ships — not the parts that compose it.
The code is open-source at github.com/ServiceNow/eva. Worth looking at if you're building or evaluating voice systems.
— sami