Evaluations

Automatically analyze completed voice agent sessions using LLM judges or data extraction. Evaluations run against real call transcripts to score agent performance, check compliance, and extract structured data.

When to Use#

  • Quality monitoring — automatically score a percentage of live sessions
  • Compliance checking — verify agents follow required scripts or collect required information
  • Data extraction — pull structured data from conversations (customer name, intent, outcome)
  • Regression testing — re-evaluate historical sessions after changing evaluator criteria

Evaluator Types#

An evaluator is a reusable template that defines how to analyze a session. There are two types:

Judge Evaluators#

A judge evaluator scores a session as pass or fail based on criteria you define. It reads the session transcript or events, sends them to an LLM with your prompt, and checks the response against success criteria.

Example: Did the agent collect the customer's name?

FieldValue
NameCustomer Name Collection
TypeJudge
Target FormatTranscript
System PromptReview this conversation. Did the agent successfully ask for and receive the customer's full name? Respond with {"collected_name": true/false, "name": "the name or null"}
Response Schema{"collected_name": {"type": "boolean"}, "name": {"type": "string"}}
Success Criteria{"collected_name": {"$eq": true}}

Success Criteria Operators

OperatorDescriptionExample
$eqEquals{"field": {"$eq": true}}
$neNot equals{"field": {"$ne": "error"}}
$gtGreater than{"score": {"$gt": 5}}
$gteGreater than or equal{"score": {"$gte": 7}}
$ltLess than{"duration": {"$lt": 300}}
$lteLess than or equal{"errors": {"$lte": 0}}
$inValue in list{"status": {"$in": ["resolved", "escalated"]}}
$ninValue not in list{"tone": {"$nin": ["rude", "dismissive"]}}
$incString includes{"summary": {"$inc": "greeting"}}
$nincString does not include{"response": {"$ninc": "I don't know"}}

Extraction Evaluators#

An extraction evaluator pulls structured data from a session — no pass/fail, just data extraction.

Example: Extract call summary

FieldValue
NameCall Summary Extraction
TypeExtraction
Target FormatTranscript
System PromptExtract the following from this conversation: the caller's intent, whether the issue was resolved, and a one-sentence summary.
Response Schema{"intent": {"type": "string"}, "resolved": {"type": "boolean"}, "summary": {"type": "string"}}

Target Formats#

Evaluators can analyze sessions in two formats:

  • Transcript — the human-readable conversation turns (recommended for most use cases)
  • Events — the full structured JSON event log (useful when you need to inspect timing, tool calls, or internal events)

Creating an Evaluator#

  1. Go to Tooling > Evaluators in the web dashboard
  2. Click Create evaluator at the bottom of the Evaluators table
  3. Fill in:
    • Title
    • Evaluator Type — Judge or Extraction
    • Target Format — Transcript or Events
    • Provider and Model — which LLM to use for evaluation
    • System Prompt — instructions for the LLM
    • Response Schema — JSON Schema for the expected output (optional for extraction)

Assigning Evaluators to Agents#

Evaluators don't run automatically until you assign them to an agent.

  1. Go to your agent's Evaluation tab
  2. Click the Assignments sub-tab
  3. Click Add Assignment
  4. Select an evaluator and set a sampling rate:
    • 100% — run on every session
    • 50% — run on half of sessions (random)
    • 10% — run on 10% of sessions
    • 0% — disabled

When a session ends, the system rolls a random number against the sampling rate to decide whether each assigned evaluator runs. This lets you control cost while still monitoring quality.


Viewing Results#

Web Dashboard#

Go to your agent's Evaluation tab, Evaluations sub-tab. You can filter by:

  • Status (pending, complete, error)
  • Type (judge, extraction)
  • Trigger (automatic, batch)
  • Success (pass/fail)
  • Session origin, environment, date range

Click any evaluation to see the full ruling or extracted data, the model used, and token costs.

CLI#

# List evaluations for an agent vr evaluation list <agent> # List evaluations for a specific session vr evaluation list <agent> --session <session-id> # Filter by status or type vr evaluation list <agent> --status complete --type judge # View evaluation details vr evaluation info <evaluation-id>

  1. Create evaluators that test your agent's key behaviors (greeting, data collection, issue resolution, tone)
  2. Assign them to your agent at 100% sampling rate during development
  3. Make real calls to your agent — evaluators run automatically on live sessions based on the sampling rate
  4. Review results in the Evaluation tab — check for failures and fix agent behavior
  5. Lower sampling rates in production (e.g., 10-20%) to monitor ongoing quality without excessive cost

Note: Evaluators do not run automatically on debug sessions (vr debug). They only trigger automatically on real calls. To evaluate a debug session, you need to manually trigger an evaluation.

evaluationstestingquality