VoiceRun Transcribe
Real-time speech-to-text over WebSocket. Stream audio from a microphone or file and receive live transcription events including partial results, final transcripts, and speech activity detection.
How It Works#
The Transcribe API provides a WebSocket endpoint for streaming audio and receiving real-time transcription. The protocol flow is:
- Connect — open a WebSocket with your API key
- Configure — send
session.updateto select model, language, and prompt - Stream — send
audio.appendmessages with base64 audio chunks - Receive — get
transcription.partialandtranscription.completedevents - Close — send
session.closeto end the session
Authentication#
Create an API key in the VoiceRun Console, then pass it as a Bearer token in the WebSocket connection header.
import websockets ws = await websockets.connect( "wss://transcribe.voicerun.com/ws", extra_headers={"Authorization": "Bearer YOUR_API_KEY"} )
If the API key is invalid, the server closes the connection with code 4001 (Unauthorized).
Connection Flow#
Client Server
| |
| ---- WebSocket connect ------> |
| |
| <---- session.created ---------- | (server sends session ID)
| |
| ---- session.update ----------> | (client sends model config)
| |
| <---- session.updated ---------- | (server confirms config)
| |
| ---- audio.append ------------> | (stream audio chunks)
| ---- audio.append ------------> |
| |
| <---- speech.started ----------- | (VAD detected speech)
| <---- transcription.partial ---- | (interim result)
| <---- transcription.partial ---- |
| <---- transcription.completed -- | (final transcript)
| <---- speech.stopped ----------- | (VAD detected silence)
| |
| ---- session.close -----------> |
| <---- session.closed ----------- |
Supported Models#
| Provider | Model | Prompt / bias mechanism | Silence-based VAD | Advanced VAD |
|---|---|---|---|---|
| Deepgram | nova-3 | Keyterm prompting | Yes | No |
| Deepgram | flux-general-en | Keyterm prompting | Yes | CSR (conversational) |
| Qwen | qwen3-asr-flash-realtime | Context (corpus text) | Yes | No |
| OpenAI | gpt-4o-transcribe | Context prompt | Optional | Semantic |
| OpenAI | gpt-4o-mini-transcribe | Context prompt | Optional | Semantic |
| OpenAI | gpt-realtime | Context prompt / session instructions | Yes | Semantic |
| Cartesia | ink-whisper | None | Yes | No |
| ElevenLabs | scribe-v2-realtime | Keyterm prompting | Yes | No |
| Soniox | stt-rt-v4 | Context | No | Semantic |
Audio Format#
PCM16 (default)#
- 16-bit signed little-endian (int16)
- Mono (1 channel)
- Default sample rate: 16,000 Hz
- 20ms chunk = 320 samples = 640 bytes
mulaw#
- 8-bit mu-law encoded
- Mono (1 channel)
- Automatically converted to PCM16 on server
- 20ms chunk = 320 bytes at 16kHz
