VoiceRun Transcribe

Real-time speech-to-text over WebSocket. Stream audio from a microphone or file and receive live transcription events including partial results, final transcripts, and speech activity detection.

How It Works#

The Transcribe API provides a WebSocket endpoint for streaming audio and receiving real-time transcription. The protocol flow is:

  1. Connect — open a WebSocket with your API key
  2. Configure — send session.update to select model, language, and prompt
  3. Stream — send audio.append messages with base64 audio chunks
  4. Receive — get transcription.partial and transcription.completed events
  5. Close — send session.close to end the session

Authentication#

Create an API key in the VoiceRun Console, then pass it as a Bearer token in the WebSocket connection header.

import websockets ws = await websockets.connect( "wss://transcribe.voicerun.com/ws", extra_headers={"Authorization": "Bearer YOUR_API_KEY"} )

If the API key is invalid, the server closes the connection with code 4001 (Unauthorized).


Connection Flow#

Client                              Server
  |                                    |
  |  ---- WebSocket connect ------>    |
  |                                    |
  |  <---- session.created ----------  |  (server sends session ID)
  |                                    |
  |  ---- session.update ---------->   |  (client sends model config)
  |                                    |
  |  <---- session.updated ----------  |  (server confirms config)
  |                                    |
  |  ---- audio.append ------------>   |  (stream audio chunks)
  |  ---- audio.append ------------>   |
  |                                    |
  |  <---- speech.started -----------  |  (VAD detected speech)
  |  <---- transcription.partial ----  |  (interim result)
  |  <---- transcription.partial ----  |
  |  <---- transcription.completed --  |  (final transcript)
  |  <---- speech.stopped -----------  |  (VAD detected silence)
  |                                    |
  |  ---- session.close ----------->   |
  |  <---- session.closed -----------  |

Supported Models#

ProviderModelPrompt / bias mechanismSilence-based VADAdvanced VAD
Deepgramnova-3Keyterm promptingYesNo
Deepgramflux-general-enKeyterm promptingYesCSR (conversational)
Qwenqwen3-asr-flash-realtimeContext (corpus text)YesNo
OpenAIgpt-4o-transcribeContext promptOptionalSemantic
OpenAIgpt-4o-mini-transcribeContext promptOptionalSemantic
OpenAIgpt-realtimeContext prompt / session instructionsYesSemantic
Cartesiaink-whisperNoneYesNo
ElevenLabsscribe-v2-realtimeKeyterm promptingYesNo
Sonioxstt-rt-v4ContextNoSemantic

Audio Format#

PCM16 (default)#

  • 16-bit signed little-endian (int16)
  • Mono (1 channel)
  • Default sample rate: 16,000 Hz
  • 20ms chunk = 320 samples = 640 bytes

mulaw#

  • 8-bit mu-law encoded
  • Mono (1 channel)
  • Automatically converted to PCM16 on server
  • 20ms chunk = 320 bytes at 16kHz
transcriptionwebsocketsttspeech-to-text