Speech to Text

VoiceRun supports multiple speech-to-text (STT) providers for real-time transcription. STT is configured in the agent environment, and available configuration options vary by model.

Test Models and Configurations#

Test STT models and configurations in the STT Lab.

Supported Models#

ProviderModelPrompt / biasSilence-based VADAdvanced VAD
Deepgramnova-3Keyterm promptingYesNo
Deepgramflux-general-enKeyterm promptingYesCSR (conversational)
Qwenqwen3-asr-flash-realtimeContext (corpus text)YesNo
OpenAIgpt-4o-transcribeContext promptOptionalSemantic
OpenAIgpt-4o-mini-transcribeContext promptOptionalSemantic
OpenAIgpt-realtimeContext prompt / session instructionsYesSemantic
Cartesiaink-whisperNoneYesNo
ElevenLabsscribe-v2-realtimeKeyterm promptingYesNo
Sonioxstt-rt-v4ContextNoSemantic

Configuration by Model#

Configuration options differ by model. The table below shows which parameters are available for each model.

ParameterNova-3FluxOpenAIQwen3CartesiaElevenLabsSoniox
languageYesYesYesYesYesYesYes
promptKeywordsKeywordsContextContextNoKeywordsContext
endpointingYes (300ms)NoYes (500ms)Yes (800ms)Yes (300ms)Yes (300ms)No
noiseReductionNoNoYesNoNoNoNo
vadModeNoNoYesNoNoNoYes
vadEagernessNoNoYesNoNoNoYes
eotThresholdNoYesNoNoNoNoNo
eotTimeoutMsNoYesNoNoNoNoNo
fallbackModelYesNoYesNoYesNoNo

Deepgram Nova-3#

General-purpose model with broad language support.

Configuration#

ParameterTypeDefaultDescription
languagestringenLanguage code or multi for auto-detect
endpointingnumber300Silence detection threshold (ms)
promptstringComma-separated keywords to bias transcription

Prompt Format#

Nova-3 uses keyword-style prompts. Provide comma-separated terms to improve recognition of domain-specific words:

policy, premium, deductible, copay, beneficiary

Deepgram Flux#

Ultra-low latency English model with end-of-turn (EOT) detection.

Configuration#

ParameterTypeDefaultDescription
languagestringenLanguage code
promptstringComma-separated keywords
eotThresholdfloat0.8End-of-turn confidence threshold (0.5-0.9)
eotTimeoutMsnumber2500Maximum silence wait time (ms)
eagerEotThresholdfloatLower threshold for immediate EOT (optional)

EOT Detection#

Flux uses confidence-based end-of-turn detection instead of simple silence timing. Higher thresholds wait for more confident turn endings; lower thresholds end turns more quickly.


OpenAI Models#

gpt-4o-transcribe, gpt-4o-mini-transcribe, and gpt-realtime share the same configuration options.

Configuration#

ParameterTypeDefaultDescription
languagestringenLanguage code or auto for auto-detect
endpointingnumber500Silence duration (ms) when using server_vad
promptstringNatural language context description
noiseReductionTypestringnear_fieldnear_field or far_field
vadModestringserver_vadserver_vad or semantic_vad
vadEagernessstringautoauto, low, medium, or high (semantic_vad only)

VAD Modes#

  • server_vad: Uses silence duration (endpointing) to detect turn end
  • semantic_vad: Uses context-aware detection with configurable eagerness

When using semantic_vad, the endpointing parameter is ignored and vadEagerness controls sensitivity.

Prompt Format#

OpenAI models use context-style prompts. Provide a natural language description of the conversation context:

This is a customer service call about insurance claims.

Qwen3 ASR#

Low-latency model optimized for Chinese and English.

Configuration#

ParameterTypeDefaultDescription
languagestringautoLanguage code (zh, en, auto, etc.)
endpointingnumber800Silence detection threshold (ms)
promptstringContext corpus for domain adaptation

Cartesia Ink-Whisper#

Broad language coverage with simple configuration.

Configuration#

ParameterTypeDefaultDescription
languagestringenLanguage code or multi for auto-detect
endpointingnumber300Silence detection threshold (ms)

Cartesia does not support prompt/keyword biasing.


ElevenLabs Scribe v2#

Low-latency multilingual model with keyword biasing.

Configuration#

ParameterTypeDefaultDescription
languagestringenLanguage code
endpointingnumber300Silence detection threshold (ms)
promptstringComma-separated keywords to bias transcription

Prompt Format#

ElevenLabs uses keyword-style prompts, similar to Deepgram. Provide comma-separated terms:

policy, premium, deductible, copay, beneficiary

Soniox#

Real-time STT with semantic VAD for context-aware turn detection.

Configuration#

ParameterTypeDefaultDescription
languagestringenLanguage code
promptstringNatural language context for domain adaptation
vadModestringsemantic_vadsemantic_vad (silence-based VAD not supported)
vadEagernessstringautoauto, low, medium, or high

VAD#

Soniox uses semantic VAD by default and does not support silence-based endpointing. Use vadEagerness to control how aggressively turns are ended.


Updating Settings at Runtime#

Use STTUpdateSettingsEvent to change STT settings during a conversation. Available parameters vary by model—see the model-specific sections above for supported options.

async def handler(event: Event, context: Context): if isinstance(event, TextEvent): user_message = event.data.get("text", "N/A") if "habla español" in user_message.lower(): yield STTUpdateSettingsEvent(language="es")

Receiving Transcriptions#

Transcribed speech is delivered to your handler as a TextEvent:

from primfunctions.logger import logger async def handler(event: Event, context: Context): if isinstance(event, TextEvent): user_text = event.data.get("text", "N/A") source = event.data.get("source") # "speech" for STT language = event.data.get("language") # detected language (if available) logger.info(f"User said: {user_text}")

Fallback Configuration#

Any STT model can be configured with a fallback model in the agent environment. When the primary model fails to connect, VoiceRun automatically switches to the fallback.

Available fallback options: nova-3, gpt-4o-transcribe, gpt-4o-mini-transcribe, gpt-realtime, ink-whisper, scribe-v2-realtime

sttasrtranscriptionspeech-recognition