Reliability — Retries & Fallbacks
Two independent mechanisms make your handler more robust against transient provider failures:
- Retry — try the same provider again after a transient error, with exponential backoff
- Fallback — try a different provider when the primary exhausts retries or fails outright
They compose: a provider can retry before failing, and once it does fail, the next fallback kicks in.
Retries#
Basic retry#
from primfunctions.completions import ( RetryConfiguration, configure_provider, generate_chat_completion, ) async def handler(event, context): if isinstance(event, StartEvent): configure_provider("anthropic", voicerun_managed=True) if isinstance(event, TextEvent): user_message = event.data.get("text", "N/A") response = await generate_chat_completion({ "provider": "anthropic", "model": "claude-haiku-4-5", "messages": [{"role": "user", "content": user_message}], "retry": RetryConfiguration( enabled=True, max_retries=3, retry_delay=1.0, # initial delay (seconds) backoff_multiplier=2.0, # exponential: 1s, 2s, 4s ), })
retry as a dict#
Both forms are accepted:
"retry": { "max_retries": 5, "retry_delay": 0.5, "backoff_multiplier": 1.5, }
Retry fields#
| Field | Default | Purpose |
|---|---|---|
enabled | True | Master switch |
max_retries | 3 | Attempts after the first |
retry_delay | 1.0 | Initial delay between attempts, in seconds |
backoff_multiplier | 2.0 | Delay is multiplied by this each attempt |
Default schedule (attempts after the first):
attempt 1: immediate
attempt 2: wait 1s
attempt 3: wait 2s
attempt 4: wait 4s
Streaming retries#
For streaming, retries only fire before the first chunk is emitted. Once the server starts producing content, a mid-stream failure surfaces as an exception inside async for rather than triggering a retry — retrying would duplicate output the user has already heard.
Retried before first chunk:
- Connection failures and timeouts
- 429 rate limits
- 5xx server errors
- Auth errors (401)
Not retried after the first chunk:
- Mid-stream disconnects
- Provider errors that arrive after streaming began
Fallbacks#
Basic fallback#
Every provider you list — primary and fallback — must be registered via configure_provider. Do not put an api_key field on a fallback entry; the proxy injects the right key from your session's provider map automatically.
from primfunctions.completions import ( configure_provider, generate_chat_completion, ) async def handler(event, context): if isinstance(event, StartEvent): configure_provider("anthropic", voicerun_managed=True) configure_provider("openai", voicerun_managed=True) if isinstance(event, TextEvent): user_message = event.data.get("text", "N/A") response = await generate_chat_completion({ "provider": "anthropic", "model": "claude-haiku-4-5", "messages": [{"role": "user", "content": user_message}], "fallbacks": [ {"provider": "openai", "model": "gpt-4.1-mini"}, ], })
Fallback chain#
Provide multiple fallbacks. They're attempted in order — the proxy tries primary first, then fallbacks[0], then fallbacks[1], etc.
# Every provider registered up front configure_provider("anthropic", voicerun_managed=True) configure_provider("openai", voicerun_managed=True) configure_provider("google", voicerun_managed=True) response = await generate_chat_completion({ "provider": "anthropic", "model": "claude-haiku-4-5", "messages": [{"role": "user", "content": user_message}], "fallbacks": [ {"provider": "openai", "model": "gpt-4.1-mini"}, {"provider": "google", "model": "gemini-2.5-flash"}, ], }) # Tries: anthropic → openai → google
Typed fallbacks#
FallbackRequest gives you the same type safety as ChatCompletionRequest:
from primfunctions.completions import FallbackRequest response = await generate_chat_completion(ChatCompletionRequest( provider="anthropic", model="claude-haiku-4-5", messages=[UserMessage(content=user_message)], fallbacks=[ FallbackRequest(provider="openai", model="gpt-4.1-mini"), ], ))
Partial overrides#
A FallbackRequest only needs to specify what differs from the primary. Everything else inherits:
response = await generate_chat_completion({ "provider": "anthropic", "model": "claude-haiku-4-5", "messages": [{"role": "user", "content": user_message}], "temperature": 0.7, "max_tokens": 1000, "tools": [...], "fallbacks": [ {"provider": "openai"}, # model, messages, temperature, max_tokens, tools all inherited. # The fallback still needs configure_provider("openai", ...). ], })
An override field explicitly set to something still overrides; only unset fields inherit.
api_keyis not an allowed fallback field. Passing it raisesValueError. The proxy resolves the key fromconfigure_providerfor each fallback'sprovider.
Streaming fallbacks#
Fallbacks work with streaming exactly as they do with non-streaming — the proxy only falls back before the first chunk. Once any chunk has reached your handler, the current provider owns the rest of the turn.
stream = await generate_chat_completion_stream( request={ "provider": "anthropic", "model": "claude-haiku-4-5", "messages": [{"role": "user", "content": user_message}], "fallbacks": [ {"provider": "openai", "model": "gpt-4.1-mini"}, ], }, stream_options={"chunk_by_sentence": True, "clean_sentences": True}, )
Retries + fallbacks#
Each provider (primary and each fallback) gets its own retry attempt against the given retry config. The fallback kicks in only after the primary has exhausted its retries.
response = await generate_chat_completion({ "provider": "anthropic", "model": "claude-haiku-4-5", "messages": [{"role": "user", "content": user_message}], "retry": {"max_retries": 3}, "fallbacks": [ {"provider": "openai", "model": "gpt-4.1-mini"}, # inherits retry config unless it sets its own ], }) # Attempts, in order: # anthropic #1, #2, #3, #4 (all fail) # openai #1, #2, #3, #4 # If all fail → CompletionsProxyError.
A fallback can also override the retry config if you want a faster second try:
"fallbacks": [ {"provider": "openai", "model": "gpt-4.1-mini", "retry": {"max_retries": 0}}, ],
Detecting which provider answered#
The response has response.provider and response.model, which reflect the actual provider and model that produced the output — useful for logging when fallbacks fired:
response = await generate_chat_completion({...}) yield LogEvent(f"served by {response.provider} / {response.model}")
Next steps#
- Examples — full retry + fallback handler
- Provider Configuration — why every provider must be registered
