Reliability — Retries & Fallbacks

Two independent mechanisms make your handler more robust against transient provider failures:

  • Retry — try the same provider again after a transient error, with exponential backoff
  • Fallback — try a different provider when the primary exhausts retries or fails outright

They compose: a provider can retry before failing, and once it does fail, the next fallback kicks in.

Retries#

Basic retry#

from primfunctions.completions import ( RetryConfiguration, configure_provider, generate_chat_completion, ) async def handler(event, context): if isinstance(event, StartEvent): configure_provider("anthropic", voicerun_managed=True) if isinstance(event, TextEvent): user_message = event.data.get("text", "N/A") response = await generate_chat_completion({ "provider": "anthropic", "model": "claude-haiku-4-5", "messages": [{"role": "user", "content": user_message}], "retry": RetryConfiguration( enabled=True, max_retries=3, retry_delay=1.0, # initial delay (seconds) backoff_multiplier=2.0, # exponential: 1s, 2s, 4s ), })

retry as a dict#

Both forms are accepted:

"retry": { "max_retries": 5, "retry_delay": 0.5, "backoff_multiplier": 1.5, }

Retry fields#

FieldDefaultPurpose
enabledTrueMaster switch
max_retries3Attempts after the first
retry_delay1.0Initial delay between attempts, in seconds
backoff_multiplier2.0Delay is multiplied by this each attempt

Default schedule (attempts after the first):

attempt 1: immediate
attempt 2: wait 1s
attempt 3: wait 2s
attempt 4: wait 4s

Streaming retries#

For streaming, retries only fire before the first chunk is emitted. Once the server starts producing content, a mid-stream failure surfaces as an exception inside async for rather than triggering a retry — retrying would duplicate output the user has already heard.

Retried before first chunk:

  • Connection failures and timeouts
  • 429 rate limits
  • 5xx server errors
  • Auth errors (401)

Not retried after the first chunk:

  • Mid-stream disconnects
  • Provider errors that arrive after streaming began

Fallbacks#

Basic fallback#

Every provider you list — primary and fallback — must be registered via configure_provider. Do not put an api_key field on a fallback entry; the proxy injects the right key from your session's provider map automatically.

from primfunctions.completions import ( configure_provider, generate_chat_completion, ) async def handler(event, context): if isinstance(event, StartEvent): configure_provider("anthropic", voicerun_managed=True) configure_provider("openai", voicerun_managed=True) if isinstance(event, TextEvent): user_message = event.data.get("text", "N/A") response = await generate_chat_completion({ "provider": "anthropic", "model": "claude-haiku-4-5", "messages": [{"role": "user", "content": user_message}], "fallbacks": [ {"provider": "openai", "model": "gpt-4.1-mini"}, ], })

Fallback chain#

Provide multiple fallbacks. They're attempted in order — the proxy tries primary first, then fallbacks[0], then fallbacks[1], etc.

# Every provider registered up front configure_provider("anthropic", voicerun_managed=True) configure_provider("openai", voicerun_managed=True) configure_provider("google", voicerun_managed=True) response = await generate_chat_completion({ "provider": "anthropic", "model": "claude-haiku-4-5", "messages": [{"role": "user", "content": user_message}], "fallbacks": [ {"provider": "openai", "model": "gpt-4.1-mini"}, {"provider": "google", "model": "gemini-2.5-flash"}, ], }) # Tries: anthropic → openai → google

Typed fallbacks#

FallbackRequest gives you the same type safety as ChatCompletionRequest:

from primfunctions.completions import FallbackRequest response = await generate_chat_completion(ChatCompletionRequest( provider="anthropic", model="claude-haiku-4-5", messages=[UserMessage(content=user_message)], fallbacks=[ FallbackRequest(provider="openai", model="gpt-4.1-mini"), ], ))

Partial overrides#

A FallbackRequest only needs to specify what differs from the primary. Everything else inherits:

response = await generate_chat_completion({ "provider": "anthropic", "model": "claude-haiku-4-5", "messages": [{"role": "user", "content": user_message}], "temperature": 0.7, "max_tokens": 1000, "tools": [...], "fallbacks": [ {"provider": "openai"}, # model, messages, temperature, max_tokens, tools all inherited. # The fallback still needs configure_provider("openai", ...). ], })

An override field explicitly set to something still overrides; only unset fields inherit.

api_key is not an allowed fallback field. Passing it raises ValueError. The proxy resolves the key from configure_provider for each fallback's provider.

Streaming fallbacks#

Fallbacks work with streaming exactly as they do with non-streaming — the proxy only falls back before the first chunk. Once any chunk has reached your handler, the current provider owns the rest of the turn.

stream = await generate_chat_completion_stream( request={ "provider": "anthropic", "model": "claude-haiku-4-5", "messages": [{"role": "user", "content": user_message}], "fallbacks": [ {"provider": "openai", "model": "gpt-4.1-mini"}, ], }, stream_options={"chunk_by_sentence": True, "clean_sentences": True}, )

Retries + fallbacks#

Each provider (primary and each fallback) gets its own retry attempt against the given retry config. The fallback kicks in only after the primary has exhausted its retries.

response = await generate_chat_completion({ "provider": "anthropic", "model": "claude-haiku-4-5", "messages": [{"role": "user", "content": user_message}], "retry": {"max_retries": 3}, "fallbacks": [ {"provider": "openai", "model": "gpt-4.1-mini"}, # inherits retry config unless it sets its own ], }) # Attempts, in order: # anthropic #1, #2, #3, #4 (all fail) # openai #1, #2, #3, #4 # If all fail → CompletionsProxyError.

A fallback can also override the retry config if you want a faster second try:

"fallbacks": [ {"provider": "openai", "model": "gpt-4.1-mini", "retry": {"max_retries": 0}}, ],

Detecting which provider answered#

The response has response.provider and response.model, which reflect the actual provider and model that produced the output — useful for logging when fallbacks fired:

response = await generate_chat_completion({...}) yield LogEvent(f"served by {response.provider} / {response.model}")

Next steps#

retriesfallbackserror-handling