Advanced Features

Provider-specific kwargs#

Each provider exposes tuning knobs beyond the common set (temperature, max_tokens, etc.). Pass them via provider_kwargs, keyed by provider name. The proxy applies only the entry matching the currently-executing provider; entries for other providers are ignored, which lets you configure kwargs for every member of a fallback chain in one request.

provider_kwargs = { "openai": {...}, # applied when provider == "openai" "anthropic": {...}, # applied when provider == "anthropic" "google": {...}, # applied when provider == "google" "anthropic_vertex": {...}, # applied when provider == "anthropic_vertex" "alibaba": {...}, # applied when provider == "alibaba" }

OpenAI#

response = await generate_chat_completion({ "provider": "openai", "model": "gpt-5-mini", "messages": [{"role": "user", "content": "Hello"}], "provider_kwargs": { "openai": { "service_tier": "priority", "reasoning_effort": "none", }, }, })

See the OpenAI chat/completions reference for the full list.

Anthropic#

response = await generate_chat_completion({ "provider": "anthropic", "model": "claude-sonnet-4-5-20250514", "max_tokens": 16000, "messages": [{"role": "user", "content": "Think about this..."}], "provider_kwargs": { "anthropic": { "thinking": {"type": "enabled", "budget_tokens": 10000}, }, }, })

Extended thinking is disabled by default. Set thinking.type to "enabled" to turn it on.

See the Anthropic messages API reference.

Google#

response = await generate_chat_completion({ "provider": "google", "model": "gemini-2.5-flash", "messages": [{"role": "user", "content": "Hello"}], "provider_kwargs": { "google": { "thinking_config": {"thinking_budget": 10000, "include_thoughts": True}, "safety_settings": [...], }, }, })

Thinking is disabled by default on Google models.

See the Google GenerateContent reference.

Alibaba (Qwen via DashScope)#

response = await generate_chat_completion({ "provider": "alibaba", "model": "qwen3.5-plus", "messages": [{"role": "user", "content": "Hello"}], "provider_kwargs": { "alibaba": { "enable_search": True, }, }, })

Override the regional endpoint with base_url:

"provider_kwargs": { "alibaba": { "base_url": "https://dashscope-us.aliyuncs.com/compatible-mode/v1", }, }

Available regions:

  • Singapore (default): https://dashscope-intl.aliyuncs.com/compatible-mode/v1
  • Virginia (US): https://dashscope-us.aliyuncs.com/compatible-mode/v1
  • Beijing (CN): https://dashscope.aliyuncs.com/compatible-mode/v1

See the DashScope model reference.

Multi-provider example (with fallback)#

configure_provider("openai", voicerun_managed=True) configure_provider("anthropic", voicerun_managed=True) response = await generate_chat_completion({ "provider": "openai", "model": "gpt-5-mini", "messages": [{"role": "user", "content": "Hello"}], "provider_kwargs": { "openai": {"service_tier": "flex"}, "anthropic": {"thinking": {"type": "disabled"}}, }, "fallbacks": [ {"provider": "anthropic", "model": "claude-haiku-4-5"}, ], })

When the primary fails and the fallback fires, the proxy applies the anthropic entry instead of the openai one.

Anthropic cache breakpoints#

Anthropic supports prompt caching. The library exposes it via CacheBreakpoint, which attaches to a tool, system message, assistant message, user message, or tool-result message.

The cache is built in order tools → system → messages. Place large, stable content early and put a breakpoint at the end of each cacheable section. When mixing TTLs, the longer duration ("1h") must appear before the shorter ("5m").

from primfunctions.completions import ( AssistantMessage, CacheBreakpoint, SystemMessage, UserMessage, configure_provider, generate_chat_completion, ) configure_provider("anthropic", voicerun_managed=True) # 1. Tools — cache the full tool block tools = [ { "type": "function", "function": { "name": "lookup_order", "description": "Look up an order by ID", "parameters": { "type": "object", "properties": {"order_id": {"type": "string"}}, "required": ["order_id"], }, }, "cache_breakpoint": {"ttl": "1h"}, }, ] # 2. System (stable) + 3. Messages (growing) messages = [ SystemMessage( content="You are a customer support agent.\n\n<2000+ token reference doc>", cache_breakpoint=CacheBreakpoint(ttl="1h"), ), UserMessage(content="Can you look up order #1234?"), AssistantMessage(content="Let me look that up for you."), UserMessage(content="What's the shipping status?"), ] # Dynamic breakpoint on the last message for the growing conversation prefix messages[-1].cache_breakpoint = CacheBreakpoint(ttl="5m") response = await generate_chat_completion({ "provider": "anthropic", "model": "claude-haiku-4-5", "messages": messages, "tools": tools, })

Rules#

  • At most 4 cache breakpoints per request.
  • Longer TTLs (1h) must come before shorter ones (5m) in the prefix order.
  • "5m" — 5-minute ephemeral cache, refreshed on each hit.
  • "1h" — 1-hour ephemeral cache, higher write cost.
  • Each model has a minimum cached-block size. Blocks smaller than the minimum are silently ignored (usage reports cache_creation_input_tokens: 0).
    • Opus / Sonnet: 1024 tokens
    • Haiku 4.5: ~4096 tokens (empirical; higher than older Haikus)

Don't persist breakpoints on stored messages#

Set cache_breakpoint on the last message only right before the call. If you persist it into conversation history and the conversation keeps growing, you end up with stale breakpoints in the middle of the prefix and potentially more than 4 breakpoints — the request will fail. Apply breakpoints dynamically each turn.

Google thought signatures#

Google Gemini maintains context across turns via a thought_signature field on assistant messages and tool calls. The library captures and re-emits it automatically — as long as you keep the same AssistantMessage / ToolCall dataclass around (or round-trip through serialize_conversation / deserialize_conversation), the next turn's request will include the signature.

response = await generate_chat_completion({ "provider": "google", "model": "gemini-2.5-flash", "messages": [...], }) # response.message.thought_signature is preserved # on the AssistantMessage dataclass and survives context.set_completion_messages / get.

No handler code is required to propagate it — just reuse the response message objects on the next turn.

Anthropic Vertex#

Anthropic Vertex runs Anthropic models through Google Cloud. It requires explicit service-account credentials passed via provider_kwargs["anthropic_vertex"].

Setup#

  1. Obtain a GCP service-account JSON key with Vertex AI permissions.
  2. Make it available to your agent via context.variables (e.g. GCP_SERVICE_ACCOUNT_JSON).

Usage#

import json from primfunctions.completions import configure_provider, generate_chat_completion configure_provider("anthropic_vertex", voicerun_managed=True) sa_info = json.loads(context.variables.get("GCP_SERVICE_ACCOUNT_JSON")) response = await generate_chat_completion({ "provider": "anthropic_vertex", "model": "claude-haiku-4-5", "messages": [{"role": "user", "content": "Hello"}], "provider_kwargs": { "anthropic_vertex": { "region": "us-central1", "project_id": "your-gcp-project-id", "service_account_credentials": sa_info, "thinking": {"type": "disabled"}, }, }, })

Structured output#

Use response_schema to instruct the model to return JSON matching a JSON Schema. The proxy maps it to each provider's native format:

  • OpenAIresponse_format with json_schema mode
  • Anthropicoutput_config with json_schema format
  • Googleresponse_mime_type: "application/json" + sanitized response_schema
  • Alibaba — same shape as OpenAI
  • Anthropic Vertex — same shape as Anthropic
response = await generate_chat_completion({ "provider": "anthropic", "model": "claude-haiku-4-5", "messages": [ {"role": "user", "content": "Invent a fictional person with a name, age, and city."}, ], "response_schema": { "type": "object", "properties": { "name": {"type": "string"}, "age": {"type": "integer"}, "city": {"type": "string"}, }, "required": ["name", "age", "city"], "additionalProperties": False, }, }) import json person = json.loads(response.message.content) # {"name": "Elara Voss", "age": 34, "city": "Portland"}

response_schema is inherited by fallbacks unless the fallback sets its own, and Google's sanitizer runs automatically when the request lands on the Google provider:

response = await generate_chat_completion({ "provider": "openai", "model": "gpt-4.1-mini", "messages": [{"role": "user", "content": "Invent a person."}], "response_schema": { "type": "object", "properties": {"name": {"type": "string"}, "age": {"type": "integer"}}, "required": ["name", "age"], "additionalProperties": False, }, "fallbacks": [ {"provider": "google", "model": "gemini-2.0-flash"}, ], })

See JSON Schema support for the cross-provider compatibility matrix.

Next steps#

cachinganthropicadvancedstructured-output