Speech to Text

VoiceRun supports multiple speech-to-text (STT) providers for real-time transcription. STT is configured in the agent environment, and available configuration options vary by model.

Test Models and Configurations#

Test STT models and configurations in the STT Lab.

Supported Models#

Provider	Model	Prompt / bias	Silence-based VAD	Advanced VAD
Deepgram	`nova-3`	Keyterm prompting	Yes	No
Deepgram	`flux-general-en`	Keyterm prompting	Yes	CSR (conversational)
Qwen	`qwen3-asr-flash-realtime`	Context (corpus text)	Yes	No
OpenAI	`gpt-4o-transcribe`	Context prompt	Optional	Semantic
OpenAI	`gpt-4o-mini-transcribe`	Context prompt	Optional	Semantic
OpenAI	`gpt-realtime`	Context prompt / session instructions	Yes	Semantic
Cartesia	`ink-whisper`	None	Yes	No
ElevenLabs	`scribe-v2-realtime`	Keyterm prompting	Yes	No
Soniox	`stt-rt-v4`	Context	No	Semantic

Configuration by Model#

Configuration options differ by model. The table below shows which parameters are available for each model.

Parameter	Nova-3	Flux	OpenAI	Qwen3	Cartesia	ElevenLabs	Soniox
`language`	Yes	Yes	Yes	Yes	Yes	Yes	Yes
`prompt`	Keywords	Keywords	Context	Context	No	Keywords	Context
`endpointing`	Yes (300ms)	No	Yes (500ms)	Yes (800ms)	Yes (300ms)	Yes (300ms)	No
`noiseReduction`	No	No	Yes	No	No	No	No
`vadMode`	No	No	Yes	No	No	No	Yes
`vadEagerness`	No	No	Yes	No	No	No	Yes
`eotThreshold`	No	Yes	No	No	No	No	No
`eotTimeoutMs`	No	Yes	No	No	No	No	No
`fallbackModel`	Yes	No	Yes	No	Yes	No	No

Deepgram Nova-3#

General-purpose model with broad language support.

Configuration#

Parameter	Type	Default	Description
`language`	string	`en`	Language code or `multi` for auto-detect
`endpointing`	number	`300`	Silence detection threshold (ms)
`prompt`	string	—	Comma-separated keywords to bias transcription

Prompt Format#

Nova-3 uses keyword-style prompts. Provide comma-separated terms to improve recognition of domain-specific words:

policy, premium, deductible, copay, beneficiary

Deepgram Flux#

Ultra-low latency English model with end-of-turn (EOT) detection.

Configuration#

Parameter	Type	Default	Description
`language`	string	`en`	Language code
`prompt`	string	—	Comma-separated keywords
`eotThreshold`	float	`0.8`	End-of-turn confidence threshold (0.5-0.9)
`eotTimeoutMs`	number	`2500`	Maximum silence wait time (ms)
`eagerEotThreshold`	float	—	Lower threshold for immediate EOT (optional)

EOT Detection#

Flux uses confidence-based end-of-turn detection instead of simple silence timing. Higher thresholds wait for more confident turn endings; lower thresholds end turns more quickly.

OpenAI Models#

gpt-4o-transcribe, gpt-4o-mini-transcribe, and gpt-realtime share the same configuration options.

Configuration#

Parameter	Type	Default	Description
`language`	string	`en`	Language code or `auto` for auto-detect
`endpointing`	number	`500`	Silence duration (ms) when using `server_vad`
`prompt`	string	—	Natural language context description
`noiseReductionType`	string	`near_field`	`near_field` or `far_field`
`vadMode`	string	`server_vad`	`server_vad` or `semantic_vad`
`vadEagerness`	string	`auto`	`auto`, `low`, `medium`, or `high` (semantic_vad only)

VAD Modes#

server_vad: Uses silence duration (endpointing) to detect turn end
semantic_vad: Uses context-aware detection with configurable eagerness

When using semantic_vad, the endpointing parameter is ignored and vadEagerness controls sensitivity.

Prompt Format#

OpenAI models use context-style prompts. Provide a natural language description of the conversation context:

This is a customer service call about insurance claims.

Qwen3 ASR#

Low-latency model optimized for Chinese and English.

Configuration#

Parameter	Type	Default	Description
`language`	string	`auto`	Language code (`zh`, `en`, `auto`, etc.)
`endpointing`	number	`800`	Silence detection threshold (ms)
`prompt`	string	—	Context corpus for domain adaptation

Cartesia Ink-Whisper#

Broad language coverage with simple configuration.

Configuration#

Parameter	Type	Default	Description
`language`	string	`en`	Language code or `multi` for auto-detect
`endpointing`	number	`300`	Silence detection threshold (ms)

Cartesia does not support prompt/keyword biasing.

ElevenLabs Scribe v2#

Low-latency multilingual model with keyword biasing.

Configuration#

Parameter	Type	Default	Description
`language`	string	`en`	Language code
`endpointing`	number	`300`	Silence detection threshold (ms)
`prompt`	string	—	Comma-separated keywords to bias transcription

Prompt Format#

ElevenLabs uses keyword-style prompts, similar to Deepgram. Provide comma-separated terms:

policy, premium, deductible, copay, beneficiary

Soniox#

Real-time STT with semantic VAD for context-aware turn detection.

Configuration#

Parameter	Type	Default	Description
`language`	string	`en`	Language code
`prompt`	string	—	Natural language context for domain adaptation
`vadMode`	string	`semantic_vad`	`semantic_vad` (silence-based VAD not supported)
`vadEagerness`	string	`auto`	`auto`, `low`, `medium`, or `high`

VAD#

Soniox uses semantic VAD by default and does not support silence-based endpointing. Use vadEagerness to control how aggressively turns are ended.

Updating Settings at Runtime#

Use STTUpdateSettingsEvent to change STT settings during a conversation. Available parameters vary by model—see the model-specific sections above for supported options.

async def handler(event: Event, context: Context):
    if isinstance(event, TextEvent):
        user_message = event.data.get("text", "N/A")

        if "habla español" in user_message.lower():
            yield STTUpdateSettingsEvent(language="es")

Receiving Transcriptions#

Transcribed speech is delivered to your handler as a TextEvent:

from primfunctions.logger import logger

async def handler(event: Event, context: Context):
    if isinstance(event, TextEvent):
        user_text = event.data.get("text", "N/A")
        source = event.data.get("source")  # "speech" for STT
        language = event.data.get("language")  # detected language (if available)

        logger.info(f"User said: {user_text}")

Fallback Configuration#

Any STT model can be configured with a fallback model in the agent environment. When the primary model fails to connect, VoiceRun automatically switches to the fallback.

Available fallback options: nova-3, gpt-4o-transcribe, gpt-4o-mini-transcribe, gpt-realtime, ink-whisper, scribe-v2-realtime