Advanced Features

Provider-Specific Arguments#

Each provider supports provider-specific keyword arguments through provider_kwargs. Arguments are keyed by provider name, so you can specify kwargs for multiple providers in a single request—useful when configuring fallbacks.

Structure#

provider_kwargs={
    "openai": {...},           # OpenAI-specific args
    "anthropic": {...},        # Anthropic-specific args
    "google": {...},           # Google Gemini-specific args
    "anthropic_vertex": {...}, # Anthropic Vertex-specific args
}

Only the kwargs for the active provider are used; others are ignored.

OpenAI#

response = await generate_chat_completion({
    "provider": "openai",
    "api_key": "sk-...",
    "model": "gpt-5-mini",
    "messages": [{"role": "user", "content": "Hello"}],
    "provider_kwargs": {
        "openai": {
            "service_tier": "priority",
            "reasoning_effort": "none",
        }
    }
})

See OpenAI API Reference for all valid kwargs.

Anthropic#

response = await generate_chat_completion({
    "provider": "anthropic",
    "api_key": "sk-ant-...",
    "model": "claude-sonnet-4-5-20250514",
    "max_tokens": 16000,
    "messages": [{"role": "user", "content": "Think about this..."}],
    "provider_kwargs": {
        "anthropic": {
            "thinking": {"type": "enabled", "budget_tokens": 10000},
        }
    }
})

Note: Extended thinking is disabled by default in this client. To enable it, explicitly set thinking.type to "enabled".

See Anthropic API Reference for all valid kwargs.

Google#

response = await generate_chat_completion({
    "provider": "google",
    "api_key": "...",
    "model": "gemini-2.5-flash",
    "messages": [{"role": "user", "content": "Hello"}],
    "provider_kwargs": {
        "google": {
            "thinking_config": {"thinking_budget": 10000, "include_thoughts": True},
            "safety_settings": [...],  # Any other GenerateContentConfig options
        }
    }
})

Note: Thinking is disabled by default in this client.

See Google AI API Reference for all valid kwargs.

Alibaba (Qwen via DashScope)#

response = await generate_chat_completion({
    "provider": "alibaba",
    "api_key": context.variables.get("DASHSCOPE_API_KEY"),
    "model": "qwen3.5-plus",
    "messages": [{"role": "user", "content": "Hello"}],
    "provider_kwargs": {
        "alibaba": {
            "enable_search": True,  # Enable Qwen's built-in web search
        }
    }
})

You can also override the regional endpoint:

"provider_kwargs": {
    "alibaba": {
        "base_url": "https://dashscope-us.aliyuncs.com/compatible-mode/v1",  # Virginia
    }
}

Available regions:

Singapore (default): https://dashscope-intl.aliyuncs.com/compatible-mode/v1
Virginia (US): https://dashscope-us.aliyuncs.com/compatible-mode/v1
Beijing (CN): https://dashscope.aliyuncs.com/compatible-mode/v1

See DashScope API Reference for all valid kwargs.

Multi-Provider Example (with Fallbacks)#

response = await generate_chat_completion({
    "provider": "openai",
    "api_key": os.getenv("OPENAI_API_KEY"),
    "model": "gpt-5-mini",
    "messages": [{"role": "user", "content": "Hello"}],
    "provider_kwargs": {
        "openai": {"service_tier": "flex"},
        "anthropic": {"thinking": {"type": "disabled"}},
    },
    "fallbacks": [
        {
            "provider": "anthropic",
            "api_key": os.getenv("ANTHROPIC_API_KEY"),
            "model": "claude-haiku-4-5",
        }
    ]
})

When OpenAI fails, the fallback uses Anthropic with its configured kwargs.

Anthropic Cache Breakpoints#

For Anthropic models, you can use cache breakpoints to optimize prompt caching.

Cache prefixes are built in the order: tools → system → messages. Place large, stable content early and put a breakpoint at the end of each cacheable section. When mixing TTLs, longer durations ("1h") must appear before shorter ones ("5m"). The cached prefix must meet the model's minimum token threshold (1024–4096 tokens depending on model).

from voicerun_completions import (
    SystemMessage,
    UserMessage,
    AssistantMessage,
    CacheBreakpoint,
)

# 1. Tools
tools = [
    {
        "type": "function",
        "function": {
            "name": "lookup_order",
            "description": "Look up an order by ID",
            "parameters": {
                "type": "object",
                "properties": {"order_id": {"type": "string"}},
                "required": ["order_id"]
            }
        },
        "cache_breakpoint": {"ttl": "1h"}  # Breakpoint 1: cache all tools
    }
]

# 2. System + 3. Messages — system is stable, conversation grows each turn
messages = [
    SystemMessage(
        content="You are a customer support agent.\n\n<long reference document, 2000+ tokens>",
        cache_breakpoint=CacheBreakpoint(ttl="1h")  # Breakpoint 2: cache system prompt
    ),
    UserMessage(content="Can you look up order #1234?"),
    AssistantMessage(content="Let me look that up for you."),
    UserMessage(content="What's the shipping status?"),
]

# Add a cache breakpoint to the last message right before sending
messages[-1].cache_breakpoint = CacheBreakpoint(ttl="5m")  # Breakpoint 3

response = await generate_chat_completion({
    "provider": "anthropic",
    "api_key": context.variables.get("ANTHROPIC_API_KEY"),
    "model": "claude-haiku-4-5",
    "messages": messages,
    "tools": tools,
})

Notes:

Maximum of 4 cache breakpoints per request.
When mixing TTLs, longer durations must appear before shorter ones (1h before 5m).
"5m" — 5 minutes (default, refreshed on each cache hit).
"1h" — 1 hour (higher write cost, useful for infrequent access patterns).

Tip: Don't persist cache_breakpoint on user messages in your conversation history. Instead, set it on the last message right before calling generate_chat_completion. The breakpoint position shifts as the conversation grows, so applying it dynamically avoids stale breakpoints on earlier messages or bad request sending more than 4 breakpoints.

Google Thought Signatures#

Google Gemini models support thought signatures for maintaining context across turns. The library automatically handles these:

# The library automatically captures and preserves thought_signature
# in AssistantMessage and ToolCall objects for Google models
response = await generate_chat_completion({
    "provider": "google",
    "api_key": "your-key",
    "model": "gemini-2.5-flash",
    "messages": [...]
})

# thought_signature is automatically included in subsequent requests
# when using the same AssistantMessage object

The thought_signature is a Google-specific field that helps maintain context across conversation turns. It's automatically:

Captured from Google responses
Stored in AssistantMessage.thought_signature and ToolCall.thought_signature
Included in subsequent requests when using the same message objects

Using Anthropic Vertex#

Anthropic Vertex provides access to Anthropic models through Google Cloud Platform. It requires explicit service account credentials passed via provider_kwargs["anthropic_vertex"].

Setup#

Obtain a GCP service account JSON key file with appropriate Vertex AI permissions
Make the key contents available to your application (e.g. via an environment variable)

Usage#

import json

sa_info = json.loads(context.variables.get("GCP_SERVICE_ACCOUNT_JSON"))

response = await generate_chat_completion({
    "provider": "anthropic_vertex",
    "api_key": "",  # Not used - auth via service account
    "model": "claude-haiku-4-5",
    "messages": [{"role": "user", "content": "Hello"}],
    "provider_kwargs": {
        "anthropic_vertex": {
            "region": "us-central1",
            "project_id": "your-gcp-project-id",
            "service_account_credentials": sa_info,  # dict from SA JSON key file
            "thinking": {"type": "disabled"},  # Optional: configure thinking
        }
    }
})

Structured Output#

Use response_schema to instruct the model to return JSON conforming to a given JSON Schema. The library automatically denormalizes the schema for each provider:

OpenAI — response_format with json_schema
Anthropic — output_config with json_schema format
Google — response_mime_type: "application/json" + sanitized response_schema
Alibaba — same as OpenAI
Anthropic Vertex — same as Anthropic

response = await generate_chat_completion({
    "provider": "anthropic",
    "api_key": context.variables.get("ANTHROPIC_API_KEY"),
    "model": "claude-haiku-4-5",
    "messages": [
        {"role": "user", "content": "Invent a fictional person with a name, age, and city."},
    ],
    "response_schema": {
        "type": "object",
        "properties": {
            "name": {"type": "string"},
            "age": {"type": "integer"},
            "city": {"type": "string"},
        },
        "required": ["name", "age", "city"],
        "additionalProperties": False,
    },
})

import json
person = json.loads(response.message.content)
# {"name": "Elara Voss", "age": 34, "city": "Portland"}

The schema is a plain JSON Schema dict (OpenAI-style). For Google, the library automatically sanitizes it using the same rules as tool parameter schemas — see JSON Schema Support for details.

response_schema works with fallbacks — it's inherited by fallback requests unless explicitly overridden:

response = await generate_chat_completion({
    "provider": "openai",
    "api_key": os.getenv("OPENAI_API_KEY"),
    "model": "gpt-4.1-mini",
    "messages": [{"role": "user", "content": "Invent a person."}],
    "response_schema": {
        "type": "object",
        "properties": {
            "name": {"type": "string"},
            "age": {"type": "integer"},
        },
        "required": ["name", "age"],
        "additionalProperties": False,
    },
    "fallbacks": [
        {
            "provider": "google",
            "api_key": os.getenv("GEMINI_API_KEY"),
            "model": "gemini-2.0-flash",
            # response_schema inherited — Google sanitization applied automatically
        },
    ],
})

Next Steps#

Review the API Reference for complete type documentation
Check out Examples for advanced usage patterns