Voice AI Evaluation

OpenTelemetry observability for AI voice agents

Deepesh Jayal

•June 2026•

11 min read

OpenTelemetry observability for AI voice agents

A voice agent makes at least three service calls for every turn: speech-to-text, the language model, and text-to-speech. When a call goes wrong, you need to see which of those failed and why. OpenTelemetry for voice agents gives you that view as a single trace per turn. This guide covers the conventions, the spans to emit, and how to wire it up.

Evalgent's interest here is the loop between production traces and pre-release testing. We close on it. First, the standard.

What is OpenTelemetry for voice agents?

OpenTelemetry for voice agents: instrumenting a voice pipeline with the OpenTelemetry standard so each stage of a turn emits a span, producing one distributed trace per conversation.

OpenTelemetry is the vendor-neutral observability standard backed by the CNCF. Applied to voice, it records each call as structured telemetry: traces made of spans, where every span captures a stage with timing and metadata. The general model is explained in the OpenTelemetry traces documentation.

The value is portability. Because the format is standard, you instrument once and send the data to any backend that speaks OTLP. You are not locked into one vendor's logging, and the same traces work across tools.

Why voice agents need distributed tracing

A voice turn is a chain, and chains hide where they break. The audio enters STT, the text goes to the LLM, the LLM may call a tool, and the response goes to TTS. Each hop has its own latency and failure modes, and the failures cascade.

Without distributed tracing, you see the symptom, a bad call, not the cause. With it, you see the whole turn as a waterfall: how long STT took, whether the LLM picked the right tool, where the latency spiked. This is the difference between guessing and knowing, and it is why distributed tracing voice ai work has converged on OpenTelemetry. Our voice agent observability guide covers the broader monitoring picture.

The GenAI semantic conventions

A trace is only useful if everyone agrees what the fields mean. That is what the OpenTelemetry GenAI semantic conventions provide: a standard schema for AI telemetry, backed by the CNCF and adopted across major clouds.

The conventions define how to describe prompts, responses, token usage, and tool calls. Spans carry attributes like the request model, input and output token counts, and the response finish reason. Span trees follow a natural shape: an agent span contains chat spans for each LLM call and tool spans for each tool invocation. The OpenTelemetry GenAI observability post walks through this in detail. Because the genai semantic conventions are shared, any compliant backend can read your traces.

What spans should a voice agent emit?

The span tree should follow the structure of the conversation. A clean hierarchy makes a turn readable at a glance.

conversation
  turn
    stt    (audio -> text)      attrs: latency, confidence
    llm    (text -> response)   attrs: model, tokens, finish_reason
    tool   (action)             attrs: tool name, arguments, result
    tts    (response -> audio)  attrs: latency, voice, first-byte

Each leaf span captures one stage with the attributes that matter for that stage. STT records confidence and latency. The LLM records model, token usage, and finish reason. Tool spans record the call and its result. TTS records time to first byte. Together they reconstruct exactly what happened, which is what good voice agent tracing is for.

How to instrument STT, LLM, and TTS with OpenTelemetry

Instrumentation is mechanical once the structure is clear. Configure a TracerProvider with an OTLP exporter pointing at your collector, then wrap each stage in a span.

The steps are consistent:

Set up a TracerProvider and an OTLP exporter to your collector.
Open a conversation span when the call starts, and a turn span per exchange.
Wrap STT, LLM, tool, and TTS calls in child spans.
Attach attributes: latency, token usage, confidence, and finish reason.
Export the spans through the collector to your chosen backend.

The collector is the hub. It receives spans over OTLP, batches them, and forwards them to one or more backends. This keeps your instrumentation independent of where the data ends up, the core benefit of an otel voice agent setup.

Reading a voice turn as a waterfall

Once spans flow, a turn becomes a waterfall trace. You see each stage as a bar on a timeline, with its duration and where it sits in the sequence. The picture answers the questions that matter fast.

Was the latency in STT or the LLM? Did the model spend time on a tool call that failed? Did TTS lag on the first byte? A waterfall shows time spent per hop, so you stop guessing which layer caused the delay. This is the everyday payoff of opentelemetry observability: failures and slowdowns become visible at the exact span where they happened.

Does Pipecat support OpenTelemetry?

Yes. Pipecat ships built-in support for OpenTelemetry tracing, and it organises traces the way this guide describes: conversation spans contain turns, which contain STT, LLM, and TTS child spans. The Pipecat OpenTelemetry docs cover the setup.

Framework support is broadening. The OpenTelemetry GenAI group is standardising instrumentation across agent frameworks, and AI agent conventions for state and memory are in progress. If you build on a framework with native tracing, you get most of this for free. If you build your own pipeline, you wire the spans yourself, following the same structure.

OpenTelemetry vs custom logging

Custom logging can capture the same data, so why standardise. The answer is portability and structure. Ad hoc logs are unstructured, vendor-specific, and hard to correlate across stages. A trace ties every stage of a turn together with shared IDs and a shared schema.

With custom logging, you reinvent correlation and lock yourself to one tool. With OpenTelemetry, the same instrumentation works across backends, and the GenAI conventions mean others can read your data without translation. For anything beyond a quick prototype, the standard wins. The link to call failures is covered in why voice agents fail in production.

From LLM observability to voice observability

Most OpenTelemetry work in AI started with text. The push for opentelemetry llm observability gave us the GenAI conventions, the span shapes, and the attribute names for prompts, tokens, and tool calls. Voice agents inherit all of that, then add the parts text never had.

The extra parts are the acoustic stages. An LLM trace captures the model call; a voice trace also captures STT before it and TTS after it. Those stages bring their own attributes, like transcription confidence and time to first byte, that the text conventions do not cover. So voice observability is LLM observability plus the audio pipeline around it.

This matters when you choose tools. A backend built only for LLM traces will show the model call but leave the STT and TTS spans as generic, unlabelled work. Pick instrumentation and a backend that understand the full voice turn, or you stay blind to exactly the stages where voice agents fail most.

Common pitfalls to avoid

A few mistakes recur. The first is tracing only the LLM and ignoring STT and TTS, which hides the acoustic failures. The second is logging spans but never sampling or retaining them long enough to investigate a trend. The third is collecting rich traces that no alert or person ever reads.

Avoid these by instrumenting the whole turn, tagging spans by agent version and model, and routing anomalies to an owner. Tracing is only worth the cost if someone, or an alert, actually acts on what the spans reveal.

Closing the loop: observability and testing

Tracing tells you what happened in production. It does not stop the next failure from reaching callers. That gap is where testing comes in, and the two together form a loop.

This is where Evalgent fits. OpenTelemetry traces surface a failure pattern in production, and Evalgent reproduces it before the next release. Scenarios turn a traced failure into a repeatable test. Profiles vary the conditions. Metrics gate the release on the behaviour the trace exposed. Reviews let your team inspect the call with audio and transcript, the same way a trace lets you inspect the spans. Observability and testing answer different questions: what broke, and will it break again. For the testing half, see the voice agent stack guide and the ai voice agent testing pillar.

Frequently asked questions

What is OpenTelemetry for voice agents?

OpenTelemetry for voice agents is the use of the open, vendor-neutral tracing standard to instrument a voice pipeline. Each stage of a turn, STT, the LLM, tool calls, and TTS, emits a span with timing and metadata, stitched into one trace per conversation. The GenAI semantic conventions give these spans a shared schema so any backend can read them.

How do you trace a voice agent with OpenTelemetry?

Trace a voice agent by configuring a TracerProvider with an OTLP exporter to a collector, then wrapping each pipeline stage in a span. Open a conversation span at call start, a turn span per exchange, and child spans for STT, LLM, tool, and TTS calls. Attach attributes like latency, token usage, and confidence, then export the spans to your backend.

What are GenAI semantic conventions?

GenAI semantic conventions are an OpenTelemetry standard schema for AI telemetry, backed by the CNCF. They define how to describe prompts, responses, token usage, and tool calls, with attributes like request model, input and output tokens, and finish reason. Because the schema is shared, any compliant backend can read traces without custom translation, which makes instrumentation portable.

How do you instrument STT LLM and TTS with OpenTelemetry?

Instrument STT, LLM, and TTS by wrapping each call in its own child span under a turn span. STT spans record confidence and latency, LLM spans record model, token usage, and finish reason, and TTS spans record time to first byte. Configure a TracerProvider and an OTLP exporter so the spans flow through a collector to your observability backend.

Does Pipecat support OpenTelemetry?

Yes, Pipecat has built-in OpenTelemetry tracing. It organises traces hierarchically, with conversation spans that contain turns, which contain STT, LLM, and TTS child spans. This matches the structure recommended for voice observability, so you get readable, standards-based traces without building the instrumentation yourself. Other agent frameworks are adding similar native support.

What spans should a voice agent emit?

A voice agent should emit a conversation span containing turn spans, each with child spans for STT, LLM, any tool calls, and TTS. STT spans carry confidence and latency, LLM spans carry model and token usage, tool spans carry the call and result, and TTS spans carry time to first byte. This hierarchy reconstructs exactly what happened in each turn.

OpenTelemetry vs custom logging for voice agents: which is better?

OpenTelemetry beats custom logging for anything beyond a prototype. Custom logs are unstructured, vendor-specific, and hard to correlate across pipeline stages. OpenTelemetry ties every stage of a turn together with shared IDs and a standard schema, works across backends, and follows the GenAI conventions so others can read the data. Custom logging only suits the simplest cases.

How does OpenTelemetry help debug voice agents?

OpenTelemetry helps debug voice agents by turning each turn into a waterfall trace. You see every stage on a timeline with its duration, so you can tell whether latency came from STT, the LLM, a failed tool call, or TTS. Instead of guessing from a bad-call symptom, you jump to the exact span where the problem occurred.

Conclusion

OpenTelemetry for voice agents turns each call into a readable trace across STT, LLM, and TTS, using the GenAI conventions so any backend can consume it. It is how you see which stage of a turn failed, and why, instead of guessing.

Tracing is half the loop. Pair it with testing so the failures your traces reveal in production become the scenarios that gate your next release. See the trace, write the test, and ship with confidence.

Why AI voice agents fail in production (and how to prevent it)

Voice AI Evaluation

8 min read

Why AI voice agents fail in production (and how to prevent it)

AI voice agents that ace demos still break in production. Learn the 5 root causes, how to test for each, and what production readiness actually means.

Voice agent regression testing: why LLM updates break production

Voice AI Evaluation

9 min read

Voice agent regression testing: why LLM updates break production

Updating your LLM improves benchmarks but breaks production voice agents in 5 predictable ways. How to test after every model update and prevent regressions.

Back to all articles

What is OpenTelemetry for voice agents?

Why voice agents need distributed tracing

The GenAI semantic conventions

What spans should a voice agent emit?

How to instrument STT, LLM, and TTS with OpenTelemetry

Reading a voice turn as a waterfall

Does Pipecat support OpenTelemetry?

OpenTelemetry vs custom logging

From LLM observability to voice observability

Common pitfalls to avoid

Closing the loop: observability and testing

Frequently asked questions

What is OpenTelemetry for voice agents?

How do you trace a voice agent with OpenTelemetry?

What are GenAI semantic conventions?

How do you instrument STT LLM and TTS with OpenTelemetry?

Does Pipecat support OpenTelemetry?

What spans should a voice agent emit?

OpenTelemetry vs custom logging for voice agents: which is better?

How does OpenTelemetry help debug voice agents?

Conclusion

Related Articles

Why AI voice agents fail in production (and how to prevent it)

Voice agent regression testing: why LLM updates break production