Reliability — Retries & Fallbacks

Two independent mechanisms make your handler more robust against transient provider failures:

Retry — try the same provider again after a transient error, with exponential backoff
Fallback — try a different provider when the primary exhausts retries or fails outright

They compose: a provider can retry before failing, and once it does fail, the next fallback kicks in.

Retries#

Basic retry#

from primfunctions.completions import (
    RetryConfiguration,
    configure_provider,
    generate_chat_completion,
)

async def handler(event, context):
    if isinstance(event, StartEvent):
        configure_provider("anthropic", voicerun_managed=True)

    if isinstance(event, TextEvent):
        user_message = event.data.get("text", "N/A")

        response = await generate_chat_completion({
            "provider": "anthropic",
            "model": "claude-haiku-4-5",
            "messages": [{"role": "user", "content": user_message}],
            "retry": RetryConfiguration(
                enabled=True,
                max_retries=3,
                retry_delay=1.0,        # initial delay (seconds)
                backoff_multiplier=2.0, # exponential: 1s, 2s, 4s
            ),
        })

`retry` as a dict#

Both forms are accepted:

"retry": {
    "max_retries": 5,
    "retry_delay": 0.5,
    "backoff_multiplier": 1.5,
}

Retry fields#

Field	Default	Purpose
`enabled`	`True`	Master switch
`max_retries`	`3`	Attempts after the first
`retry_delay`	`1.0`	Initial delay between attempts, in seconds
`backoff_multiplier`	`2.0`	Delay is multiplied by this each attempt

Default schedule (attempts after the first):

attempt 1: immediate
attempt 2: wait 1s
attempt 3: wait 2s
attempt 4: wait 4s

For streaming, retries only fire before the first chunk is emitted. Once the server starts producing content, a mid-stream failure surfaces as an exception inside async for rather than triggering a retry — retrying would duplicate output the user has already heard.

Retried before first chunk:

Connection failures and timeouts
429 rate limits
5xx server errors
Auth errors (401)

Not retried after the first chunk:

Mid-stream disconnects
Provider errors that arrive after streaming began

Fallbacks#

Basic fallback#

Every provider you list — primary and fallback — must be registered via configure_provider. Do not put an api_key field on a fallback entry; the proxy injects the right key from your session's provider map automatically.

from primfunctions.completions import (
    configure_provider,
    generate_chat_completion,
)

async def handler(event, context):
    if isinstance(event, StartEvent):
        configure_provider("anthropic", voicerun_managed=True)
        configure_provider("openai", voicerun_managed=True)

    if isinstance(event, TextEvent):
        user_message = event.data.get("text", "N/A")

        response = await generate_chat_completion({
            "provider": "anthropic",
            "model": "claude-haiku-4-5",
            "messages": [{"role": "user", "content": user_message}],
            "fallbacks": [
                {"provider": "openai", "model": "gpt-4.1-mini"},
            ],
        })

Fallback chain#

Provide multiple fallbacks. They're attempted in order — the proxy tries primary first, then fallbacks[0], then fallbacks[1], etc.

# Every provider registered up front
configure_provider("anthropic", voicerun_managed=True)
configure_provider("openai", voicerun_managed=True)
configure_provider("google", voicerun_managed=True)

response = await generate_chat_completion({
    "provider": "anthropic",
    "model": "claude-haiku-4-5",
    "messages": [{"role": "user", "content": user_message}],
    "fallbacks": [
        {"provider": "openai", "model": "gpt-4.1-mini"},
        {"provider": "google", "model": "gemini-2.5-flash"},
    ],
})
# Tries: anthropic → openai → google

Typed fallbacks#

FallbackRequest gives you the same type safety as ChatCompletionRequest:

from primfunctions.completions import FallbackRequest

response = await generate_chat_completion(ChatCompletionRequest(
    provider="anthropic",
    model="claude-haiku-4-5",
    messages=[UserMessage(content=user_message)],
    fallbacks=[
        FallbackRequest(provider="openai", model="gpt-4.1-mini"),
    ],
))

Partial overrides#

A FallbackRequest only needs to specify what differs from the primary. Everything else inherits:

response = await generate_chat_completion({
    "provider": "anthropic",
    "model": "claude-haiku-4-5",
    "messages": [{"role": "user", "content": user_message}],
    "temperature": 0.7,
    "max_tokens": 1000,
    "tools": [...],
    "fallbacks": [
        {"provider": "openai"},
        # model, messages, temperature, max_tokens, tools all inherited.
        # The fallback still needs configure_provider("openai", ...).
    ],
})

An override field explicitly set to something still overrides; only unset fields inherit.

api_key is not an allowed fallback field. Passing it raises ValueError. The proxy resolves the key from configure_provider for each fallback's provider.

Streaming fallbacks#

Fallbacks work with streaming exactly as they do with non-streaming — the proxy only falls back before the first chunk. Once any chunk has reached your handler, the current provider owns the rest of the turn.

stream = await generate_chat_completion_stream(
    request={
        "provider": "anthropic",
        "model": "claude-haiku-4-5",
        "messages": [{"role": "user", "content": user_message}],
        "fallbacks": [
            {"provider": "openai", "model": "gpt-4.1-mini"},
        ],
    },
    stream_options={"chunk_by_sentence": True, "clean_sentences": True},
)

Retries + fallbacks#

Each provider (primary and each fallback) gets its own retry attempt against the given retry config. The fallback kicks in only after the primary has exhausted its retries.

response = await generate_chat_completion({
    "provider": "anthropic",
    "model": "claude-haiku-4-5",
    "messages": [{"role": "user", "content": user_message}],
    "retry": {"max_retries": 3},
    "fallbacks": [
        {"provider": "openai", "model": "gpt-4.1-mini"},
        # inherits retry config unless it sets its own
    ],
})

# Attempts, in order:
#   anthropic #1, #2, #3, #4 (all fail)
#   openai    #1, #2, #3, #4
# If all fail → CompletionsProxyError.

A fallback can also override the retry config if you want a faster second try:

"fallbacks": [
    {"provider": "openai", "model": "gpt-4.1-mini", "retry": {"max_retries": 0}},
],

Detecting which provider answered#

The response has response.provider and response.model, which reflect the actual provider and model that produced the output — useful for logging when fallbacks fired:

response = await generate_chat_completion({...})
yield LogEvent(f"served by {response.provider} / {response.model}")

Next steps#

Examples — full retry + fallback handler
Provider Configuration — why every provider must be registered