Latency Metrics

VoiceRun captures five per-turn latency metrics that measure responsiveness from the moment a user speaks to when audio plays back.

End-to-End Turn Taking is the most user-perceptible metric.

Time to First Transcription#

Time from when the user stops speaking until the first transcription arrives from STT. Measures the pure STT processing latency after speech completes.

How it's measured: Timer starts when the user stops speaking (UserStoppedSpeakingFrame); records on the first TranscriptionFrame received. The timer resets at turn end and session stop.

Why it matters: Measures the STT service's actual processing speed, excluding user speaking time. Lower values indicate faster STT response, enabling quicker responses to users. This metric helps identify STT performance bottlenecks independent of speech duration.

Time to First Speech Event#

Latency from the handler receiving the user input (TextEvent) to the first speech-producing event from the handler.

How it's measured: Captured on the first Text-to-Speech event produced by your handler each turn.

Why it matters: Good proxy for LLM/handler prompt+thinking time before speaking starts. Component of End-to-End Turn Taking.

Time to First Audio#

Time from TTS start to the first audio frame streamed to the listener.

How it's measured: Starts at TTS start; recorded on the first TTS audio frame for the turn. VoiceRun also records provider/model-level TTS histograms for time-to-first-audio and full synthesis duration.

Why it matters: Indicates TTS startup/streaming latency that affects perceived snappiness. Component of End-to-End Turn Taking.

End-to-End Turn Taking#

The overall time from when the user stops speaking to the first audio frame streamed to the listener.

How it's measured: Timer starts when the user stops speaking (UserStoppedSpeakingFrame); ends at the first TTS audio frame emitted to the listener.

Why it matters: Represents perceived responsiveness after a user finishes talking. This is the most user-perceptible metric.

Approximate relation:

End-to-End Turn Taking ≈ Time to First Transcription
                       + Time to First Speech Event
                       + Time to First Audio
                       + small pipeline/transport overhead

Function Runtime#

Duration of the handler's work for the turn (end-to-end function time).

How it's measured: Recorded when the turn ends, shown as the duration chip in the debugger.

Why it matters: Helps identify slow logic, blocking I/O, or long-running tool calls.

Tips to Improve Latency#

Stream responses: Use streaming TTS and stream partial LLM responses (speak as you think)
Trim prompts: Keep prompts concise and cache static opening lines with TTS cache
Choose fast STT models: Select models optimized for first token speed if backchanneling is important
Avoid blocking I/O: Make external calls concurrent when possible in your handler

TTS Provider Benchmarks#

The TTS Lab Benchmarks tab summarizes production TTS latency across providers and models. It shows:

TTFA p50 and TTFA p95 — median and 95th-percentile time-to-first-audio
Duration p50 and Duration p95 — median and 95th-percentile full synthesis duration
Samples — number of TTS generations in the selected time range

These aggregates are global across all agents. The source Prometheus histograms include provider and model labels, but not organization or agent labels, so they are useful for provider/model comparison rather than tenant-specific performance analysis.

Latency metrics are included in session events with the metrics.latency.* prefix. See Text to Speech for TTS Lab usage.