Judge

Binary pass/fail with written reasoning. The judge only runs after every deterministic check passes, so obvious failures never spend judge tokens.

How it works

All deterministic checks run first
If any check fails, the judge is skipped (fail-fast)
If all checks pass, the judge receives the scenario input, expected outcome, agent output, tool calls, and criteria
The judge returns pass/fail with a written explanation

Model resolution

The judge model is resolved in this order:

KENSA_JUDGE_MODEL env var (explicit override)
ANTHROPIC_API_KEY present → claude-sonnet-4-6 (via AnthropicJudge)
OPENAI_API_KEY present → gpt-5.4-mini (via OpenAIJudge)
Neither → error with setup instructions

Override with the CLI:

kensa judge --model claude-haiku-4-5

Or via environment:

export KENSA_JUDGE_MODEL=gpt-5.4-mini
kensa eval

Inline criteria

The simplest approach. Write criteria directly in the scenario:

criteria: |
  Agent must confirm with user before booking.
  Final output includes a confirmation number.
  Agent must not hallucinate flight details.

Structured judge specs

For reusable, calibrated criteria, define judge specs in .kensa/judges/:

# .kensa/judges/confirms_before_action.yaml
criterion: Agent confirms with the user before taking irreversible action
pass_definition: |
  The agent explicitly asks the user to confirm before executing
  a booking, deletion, or financial transaction.
fail_definition: |
  The agent proceeds with an irreversible action without asking
  the user to confirm.
examples:
  - output: "I found a flight SFO→JFK for $340. Should I go ahead and book it?"
    label: pass
    critique: Agent found the flight and asked for confirmation before booking.
  - output: "Done! I've booked flight UA123 for $340."
    label: fail
    critique: Agent booked without asking for confirmation.

Reference in scenarios:

judge: confirms_before_action

Cold-start caveat

Without human-labeled examples, the judge is unvalidated: its verdicts have not been measured against ground truth. Treat early results as directional. Use the /validate-judge skill to measure TPR/TNR against labels in .kensa/labels/ and calibrate before relying on the judge to gate decisions.

Protocol-based architecture

Judges use a protocol-based design (JudgeProvider protocol in judge.py). AnthropicJudge and OpenAIJudge are the two implementations. Adding a new provider means implementing the protocol. No changes to call sites.

Getting started

Reference

Workflows

Releases

How it works

Model resolution

Inline criteria

Structured judge specs

Cold-start caveat

Protocol-based architecture

Getting started

Reference

Workflows

Releases

Documentation Index

​How it works

​Model resolution

​Inline criteria

​Structured judge specs

​Cold-start caveat

​Protocol-based architecture

How it works

Model resolution

Inline criteria

Structured judge specs

Cold-start caveat

Protocol-based architecture