Evaluations

Automatically analyze completed voice agent sessions. Evaluations score agent performance, check compliance, and extract structured data using LLM judges, structured-data extractors, or deterministic predicates over the session's derived facts and events.

When to Use#

  • Quality monitoring — automatically score a percentage of live sessions
  • Compliance checking — verify agents follow required scripts or collect required information
  • Data extraction — pull structured data from conversations (customer name, intent, outcome)
  • Regression testing — re-evaluate historical sessions after changing evaluator criteria

Evaluator Types#

An evaluator is a reusable template that defines how to analyze a session. There are three types:

Judge Evaluators#

A judge evaluator scores a session as pass or fail based on criteria you define. It reads the session transcript or events, sends them to an LLM with your prompt, and checks the response against success criteria.

Example: Did the agent collect the customer's name?

FieldValue
NameCustomer Name Collection
TypeJudge
Target FormatTranscript
System PromptReview this conversation. Did the agent successfully ask for and receive the customer's full name? Respond with {"collected_name": true/false, "name": "the name or null"}
Response Schema{"collected_name": {"type": "boolean"}, "name": {"type": "string"}}
Success Criteria{"collected_name": {"$eq": true}}

Success Criteria Operators

OperatorDescriptionExample
$eqEquals (deep equality){"field": {"$eq": true}}
$neNot equals{"field": {"$ne": "error"}}
$gtGreater than{"score": {"$gt": 5}}
$gteGreater than or equal{"score": {"$gte": 7}}
$ltLess than{"duration": {"$lt": 300}}
$lteLess than or equal{"errors": {"$lte": 0}}
$inValue is in a literal list{"status": {"$in": ["resolved", "escalated"]}}
$ninValue is not in a literal list{"tone": {"$nin": ["rude", "dismissive"]}}
$incInput array contains this value{"flags": {"$inc": "vip"}}
$nincInput array does not contain this value{"flags": {"$ninc": "test"}}

Deterministic assertions and preconditions support all of the above, plus $contains, $icontains, and $any for working with strings and event payloads. See Deterministic Evaluators below.

Extraction Evaluators#

An extraction evaluator pulls structured data from a session — no pass/fail, just data extraction.

Example: Extract call summary

FieldValue
NameCall Summary Extraction
TypeExtraction
Target FormatTranscript
System PromptExtract the following from this conversation: the caller's intent, whether the issue was resolved, and a one-sentence summary.
Response Schema{"intent": {"type": "string"}, "resolved": {"type": "boolean"}, "summary": {"type": "string"}}

Deterministic Evaluators#

A deterministic evaluator asserts on the derived session view — no LLM call, no token spend, fully repeatable. Use these for checks that are purely factual: was a tool called, did the agent transfer, did the call end with a goodbye, was the duration within bounds. The assertion runs as a JSON predicate against a flat view of the session.

Example: Did the caller mention cancellation?

FieldValue
NameCaller Mentioned Cancellation
TypeDeterministic
Assertion{"events": {"$any": {"name": "transcript_part", "data.role": "user", "data.content": {"$icontains": "cancel"}}}}

Session view fields the assertion can read:

FieldDescription
turn_countNumber of completed turns (count of turn_end events). A mid-turn hangup doesn't count.
duration_secondsSeconds between startedAt and endedAt. Returns 0 when the session hasn't ended.
directioninbound or outbound
originWhere the session came from (e.g. phone, web, simulation, native)
tagsSession tags as a string array. Matchable by bare primitive (tags: "billing") or by $inc / $ninc.
environment[id, name] for the session's environment — prefers the org-scoped environmentId, falls back to legacy agentEnvironmentId. Matchable by bare primitive against either value: environment: "production" or environment: "<uuid>" both work.
eventsRaw event list in arrival order — each element { name, data, timestamp }. Use with $any to assert on event names or payloads (e.g. transcript content, tool arguments).

Operators. Assertions support the Success Criteria Operators table above plus three more for payload work:

OperatorDescriptionExample
$containsString input contains this substring (case-sensitive){"data.content": {"$contains": "refund"}}
$icontainsString input contains this substring (case-insensitive){"data.content": {"$icontains": "REFUND"}}
$anyAt least one element of the input array matches the sub-predicate{"events": {"$any": {"name": "transcript_part"}}}

Dotted field paths walk nested objects (e.g. data.content). Mixing operators and field names at the same level is rejected at evaluation time.

Bare primitive vs array input. When a predicate's value is a primitive and the input field is an array, the engine does membership matching (MongoDB-style). This lets tags: "billing" work without $inc, and lets environment: "production" match against the resolved [id, name] regardless of whether you wrote the name or the ID. Scalar-vs-scalar equality is unchanged.

Example: did the caller say "refund"?

assertion: events: $any: name: "transcript_part" data.role: "user" data.content: { $icontains: "refund" }

Example: production-tagged inbound calls that ended cleanly:

assertion: direction: "inbound" environment: "production" # matches name OR ID via bare-primitive membership tags: "customer-vip" # tag presence — no $inc needed

When the assertion matches, the row records success: true with details: { matched: true }. When it fails, success: false with details: { matched: false, failedPath: "...", reason: "..." } — the failing field path is captured so reviewers can see exactly which clause didn't hold.


Preconditions#

Any evaluator type can declare an optional precondition predicate that gates whether it runs. If the predicate doesn't match the session, the evaluator is skipped instead of executed — no LLM call, no token spend — and a skipped row is recorded with the reason. This stops you from paying for "did the agent handle the objection well?" evals on 1-turn hangups, while keeping the skip auditable.

Preconditions use the same predicate language and session-view fields as deterministic assertions.

Example: Skip the eval unless the call ran long enough to score

FieldValue
Precondition{"turn_count": {"$gte": 3}, "duration_seconds": {"$gte": 30}}

When a session has fewer than 3 turns or shorter than 30 seconds, the evaluation row is written with status="skipped" and a skipReason like precondition not met at "turn_count": $gte 3 failed for 1.

To audit which sessions were skipped:

vr evaluation list <agent> --status skipped

Or filter by status in the web dashboard. Skipped rows never incur token cost.


Target Formats#

Judge and extraction evaluators choose what they send to the LLM:

  • Transcript — the human-readable conversation turns (recommended for most use cases)
  • Events — the full structured JSON event log (useful when you need to inspect timing, tool calls, or internal events)

Deterministic evaluators ignore this setting — they always run against the derived session view (see the field table above).


Creating an Evaluator#

  1. Go to Tooling > Evaluators in the web dashboard
  2. Click Create evaluator at the bottom of the Evaluators table
  3. Fill in:
    • Title
    • Evaluator Type — Judge, Extraction, or Deterministic
    • Target Format — Transcript or Events (deterministic always reads the session view)
    • Provider and Model — which LLM to use (not applicable for deterministic)
    • System Prompt — instructions for the LLM (not applicable for deterministic)
    • Response Schema — JSON Schema for the expected output (optional for extraction, not applicable for deterministic)
    • Assertion — JSON predicate that determines pass/fail (deterministic only)
    • Precondition — optional JSON predicate; sessions that don't match are recorded as skipped instead of evaluated

Assigning Evaluators to Agents#

Evaluators don't run automatically until you assign them to an agent.

  1. Go to your agent's Evaluation tab
  2. Click the Assignments sub-tab
  3. Click Add Assignment
  4. Select an evaluator and set a sampling rate:
    • 100% — run on every session
    • 50% — run on half of sessions (random)
    • 10% — run on 10% of sessions
    • 0% — disabled

When a session ends, the system rolls a random number against the sampling rate to decide whether each assigned evaluator runs. This lets you control cost while still monitoring quality.


Viewing Results#

Web Dashboard#

Go to your agent's Evaluation tab, Evaluations sub-tab. You can filter by:

  • Status (pending, complete, error, skipped)
  • Type (judge, extraction, deterministic)
  • Trigger (automatic, batch)
  • Success (pass/fail — applies to judge and deterministic)
  • Session origin, environment, date range

Skipped rows render with a neutral gray "Skipped" pill and a Skip-reason panel naming the failing precondition field. Deterministic rows render with their assertion JSON and a structured details payload — no model / provider / token block, since no LLM call was made.

Click any evaluation to see the full ruling or extracted data, the model used, and token costs.

CLI#

# List evaluations for an agent vr evaluation list <agent> # List evaluations for a specific session vr evaluation list <agent> --session <session-id> # Filter by status or type vr evaluation list <agent> --status complete --type judge # Audit which sessions were skipped (precondition not met) vr evaluation list <agent> --status skipped # Audit cheap deterministic evals vr evaluation list <agent> --type deterministic # View evaluation details vr evaluation info <evaluation-id>

  1. Create evaluators that test your agent's key behaviors (greeting, data collection, issue resolution, tone)
  2. Assign them to your agent at 100% sampling rate during development
  3. Make real calls to your agent — evaluators run automatically on live sessions based on the sampling rate
  4. Review results in the Evaluation tab — check for failures and fix agent behavior
  5. Lower sampling rates in production (e.g., 10-20%) to monitor ongoing quality without excessive cost

Note: Evaluators do not run automatically on debug sessions (vr debug). They only trigger automatically on real calls. To evaluate a debug session, you need to manually trigger an evaluation.

evaluationstestingquality