Speech to Text
VoiceRun supports multiple speech-to-text (STT) providers for real-time transcription. STT is configured in the agent environment, and available configuration options vary by model.
Test Models and Configurations#
Test STT models and configurations in the STT Lab.
Supported Models#
| Provider | Model | Prompt / bias | Silence-based VAD | Advanced VAD |
|---|---|---|---|---|
| Deepgram | nova-3 | Keyterm prompting | Yes | No |
| Deepgram | flux-general-en | Keyterm prompting | Yes | CSR (conversational) |
| Qwen | qwen3-asr-flash-realtime | Context (corpus text) | Yes | No |
| OpenAI | gpt-4o-transcribe | Context prompt | Optional | Semantic |
| OpenAI | gpt-4o-mini-transcribe | Context prompt | Optional | Semantic |
| OpenAI | gpt-realtime | Context prompt / session instructions | Yes | Semantic |
| Cartesia | ink-whisper | None | Yes | No |
| ElevenLabs | scribe-v2-realtime | Keyterm prompting | Yes | No |
| Soniox | stt-rt-v4 | Context | No | Semantic |
Configuration by Model#
Configuration options differ by model. The table below shows which parameters are available for each model.
| Parameter | Nova-3 | Flux | OpenAI | Qwen3 | Cartesia | ElevenLabs | Soniox |
|---|---|---|---|---|---|---|---|
language | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
prompt | Keywords | Keywords | Context | Context | No | Keywords | Context |
endpointing | Yes (300ms) | No | Yes (500ms) | Yes (800ms) | Yes (300ms) | Yes (300ms) | No |
noiseReduction | No | No | Yes | No | No | No | No |
vadMode | No | No | Yes | No | No | No | Yes |
vadEagerness | No | No | Yes | No | No | No | Yes |
eotThreshold | No | Yes | No | No | No | No | No |
eotTimeoutMs | No | Yes | No | No | No | No | No |
fallbackModel | Yes | No | Yes | No | Yes | No | No |
Deepgram Nova-3#
General-purpose model with broad language support.
Configuration#
| Parameter | Type | Default | Description |
|---|---|---|---|
language | string | en | Language code or multi for auto-detect |
endpointing | number | 300 | Silence detection threshold (ms) |
prompt | string | — | Comma-separated keywords to bias transcription |
Prompt Format#
Nova-3 uses keyword-style prompts. Provide comma-separated terms to improve recognition of domain-specific words:
policy, premium, deductible, copay, beneficiary
Deepgram Flux#
Ultra-low latency English model with end-of-turn (EOT) detection.
Configuration#
| Parameter | Type | Default | Description |
|---|---|---|---|
language | string | en | Language code |
prompt | string | — | Comma-separated keywords |
eotThreshold | float | 0.8 | End-of-turn confidence threshold (0.5-0.9) |
eotTimeoutMs | number | 2500 | Maximum silence wait time (ms) |
eagerEotThreshold | float | — | Lower threshold for immediate EOT (optional) |
EOT Detection#
Flux uses confidence-based end-of-turn detection instead of simple silence timing. Higher thresholds wait for more confident turn endings; lower thresholds end turns more quickly.
OpenAI Models#
gpt-4o-transcribe, gpt-4o-mini-transcribe, and gpt-realtime share the same configuration options.
Configuration#
| Parameter | Type | Default | Description |
|---|---|---|---|
language | string | en | Language code or auto for auto-detect |
endpointing | number | 500 | Silence duration (ms) when using server_vad |
prompt | string | — | Natural language context description |
noiseReductionType | string | near_field | near_field or far_field |
vadMode | string | server_vad | server_vad or semantic_vad |
vadEagerness | string | auto | auto, low, medium, or high (semantic_vad only) |
VAD Modes#
server_vad: Uses silence duration (endpointing) to detect turn endsemantic_vad: Uses context-aware detection with configurable eagerness
When using semantic_vad, the endpointing parameter is ignored and vadEagerness controls sensitivity.
Prompt Format#
OpenAI models use context-style prompts. Provide a natural language description of the conversation context:
This is a customer service call about insurance claims.
Qwen3 ASR#
Low-latency model optimized for Chinese and English.
Configuration#
| Parameter | Type | Default | Description |
|---|---|---|---|
language | string | auto | Language code (zh, en, auto, etc.) |
endpointing | number | 800 | Silence detection threshold (ms) |
prompt | string | — | Context corpus for domain adaptation |
Cartesia Ink-Whisper#
Broad language coverage with simple configuration.
Configuration#
| Parameter | Type | Default | Description |
|---|---|---|---|
language | string | en | Language code or multi for auto-detect |
endpointing | number | 300 | Silence detection threshold (ms) |
Cartesia does not support prompt/keyword biasing.
ElevenLabs Scribe v2#
Low-latency multilingual model with keyword biasing.
Configuration#
| Parameter | Type | Default | Description |
|---|---|---|---|
language | string | en | Language code |
endpointing | number | 300 | Silence detection threshold (ms) |
prompt | string | — | Comma-separated keywords to bias transcription |
Prompt Format#
ElevenLabs uses keyword-style prompts, similar to Deepgram. Provide comma-separated terms:
policy, premium, deductible, copay, beneficiary
Soniox#
Real-time STT with semantic VAD for context-aware turn detection.
Configuration#
| Parameter | Type | Default | Description |
|---|---|---|---|
language | string | en | Language code |
prompt | string | — | Natural language context for domain adaptation |
vadMode | string | semantic_vad | semantic_vad (silence-based VAD not supported) |
vadEagerness | string | auto | auto, low, medium, or high |
VAD#
Soniox uses semantic VAD by default and does not support silence-based endpointing. Use vadEagerness to control how aggressively turns are ended.
Updating Settings at Runtime#
Use STTUpdateSettingsEvent to change STT settings during a conversation. Available parameters vary by model—see the model-specific sections above for supported options.
async def handler(event: Event, context: Context): if isinstance(event, TextEvent): user_message = event.data.get("text", "N/A") if "habla español" in user_message.lower(): yield STTUpdateSettingsEvent(language="es")
Receiving Transcriptions#
Transcribed speech is delivered to your handler as a TextEvent:
from primfunctions.logger import logger async def handler(event: Event, context: Context): if isinstance(event, TextEvent): user_text = event.data.get("text", "N/A") source = event.data.get("source") # "speech" for STT language = event.data.get("language") # detected language (if available) logger.info(f"User said: {user_text}")
Fallback Configuration#
Any STT model can be configured with a fallback model in the agent environment. When the primary model fails to connect, VoiceRun automatically switches to the fallback.
Available fallback options: nova-3, gpt-4o-transcribe, gpt-4o-mini-transcribe, gpt-realtime, ink-whisper, scribe-v2-realtime
