Evaluations

Automatically analyze completed voice agent sessions using LLM judges or data extraction. Evaluations run against real call transcripts to score agent performance, check compliance, and extract structured data.

When to Use#

Quality monitoring — automatically score a percentage of live sessions
Compliance checking — verify agents follow required scripts or collect required information
Data extraction — pull structured data from conversations (customer name, intent, outcome)
Regression testing — re-evaluate historical sessions after changing evaluator criteria

Evaluator Types#

An evaluator is a reusable template that defines how to analyze a session. There are two types:

Judge Evaluators#

A judge evaluator scores a session as pass or fail based on criteria you define. It reads the session transcript or events, sends them to an LLM with your prompt, and checks the response against success criteria.

Example: Did the agent collect the customer's name?

Field	Value
Name	Customer Name Collection
Type	Judge
Target Format	Transcript
System Prompt	`Review this conversation. Did the agent successfully ask for and receive the customer's full name? Respond with {"collected_name": true/false, "name": "the name or null"}`
Response Schema	`{"collected_name": {"type": "boolean"}, "name": {"type": "string"}}`
Success Criteria	`{"collected_name": {"$eq": true}}`

Success Criteria Operators

Operator	Description	Example
`$eq`	Equals	`{"field": {"$eq": true}}`
`$ne`	Not equals	`{"field": {"$ne": "error"}}`
`$gt`	Greater than	`{"score": {"$gt": 5}}`
`$gte`	Greater than or equal	`{"score": {"$gte": 7}}`
`$lt`	Less than	`{"duration": {"$lt": 300}}`
`$lte`	Less than or equal	`{"errors": {"$lte": 0}}`
`$in`	Value in list	`{"status": {"$in": ["resolved", "escalated"]}}`
`$nin`	Value not in list	`{"tone": {"$nin": ["rude", "dismissive"]}}`
`$inc`	String includes	`{"summary": {"$inc": "greeting"}}`
`$ninc`	String does not include	`{"response": {"$ninc": "I don't know"}}`

Extraction Evaluators#

An extraction evaluator pulls structured data from a session — no pass/fail, just data extraction.

Example: Extract call summary

Field	Value
Name	Call Summary Extraction
Type	Extraction
Target Format	Transcript
System Prompt	`Extract the following from this conversation: the caller's intent, whether the issue was resolved, and a one-sentence summary.`
Response Schema	`{"intent": {"type": "string"}, "resolved": {"type": "boolean"}, "summary": {"type": "string"}}`

Target Formats#

Evaluators can analyze sessions in two formats:

Transcript — the human-readable conversation turns (recommended for most use cases)
Events — the full structured JSON event log (useful when you need to inspect timing, tool calls, or internal events)

Creating an Evaluator#

Go to Tooling > Evaluators in the web dashboard
Click Create evaluator at the bottom of the Evaluators table
Fill in:
- Title
- Evaluator Type — Judge or Extraction
- Target Format — Transcript or Events
- Provider and Model — which LLM to use for evaluation
- System Prompt — instructions for the LLM
- Response Schema — JSON Schema for the expected output (optional for extraction)

Assigning Evaluators to Agents#

Evaluators don't run automatically until you assign them to an agent.

Go to your agent's Evaluation tab
Click the Assignments sub-tab
Click Add Assignment
Select an evaluator and set a sampling rate:
- 100% — run on every session
- 50% — run on half of sessions (random)
- 10% — run on 10% of sessions
- 0% — disabled

When a session ends, the system rolls a random number against the sampling rate to decide whether each assigned evaluator runs. This lets you control cost while still monitoring quality.

Viewing Results#

Web Dashboard#

Go to your agent's Evaluation tab, Evaluations sub-tab. You can filter by:

Status (pending, complete, error)
Type (judge, extraction)
Trigger (automatic, batch)
Success (pass/fail)
Session origin, environment, date range

Click any evaluation to see the full ruling or extracted data, the model used, and token costs.

CLI#

# List evaluations for an agent
vr evaluation list <agent>

# List evaluations for a specific session
vr evaluation list <agent> --session <session-id>

# Filter by status or type
vr evaluation list <agent> --status complete --type judge

# View evaluation details
vr evaluation info <evaluation-id>

Recommended Workflow#

Create evaluators that test your agent's key behaviors (greeting, data collection, issue resolution, tone)
Assign them to your agent at 100% sampling rate during development
Make real calls to your agent — evaluators run automatically on live sessions based on the sampling rate
Review results in the Evaluation tab — check for failures and fix agent behavior
Lower sampling rates in production (e.g., 10-20%) to monitor ongoing quality without excessive cost

Note: Evaluators do not run automatically on debug sessions (vr debug). They only trigger automatically on real calls. To evaluate a debug session, you need to manually trigger an evaluation.