Skip to main content

Documentation Index

Fetch the complete documentation index at: https://kensa.sh/docs/llms.txt

Use this file to discover all available pages before exploring further.

Binary pass/fail with written reasoning. The judge only runs after every deterministic check passes, so obvious failures never spend judge tokens.

How it works

  1. All deterministic checks run first
  2. If any check fails, the judge is skipped (fail-fast)
  3. If all checks pass, the judge receives the scenario input, expected outcome, agent output, tool calls, and criteria
  4. The judge returns pass/fail with a written explanation

Model resolution

The judge model is resolved in this order:
  1. KENSA_JUDGE_MODEL env var (explicit override)
  2. ANTHROPIC_API_KEY present → claude-sonnet-4-6 (via AnthropicJudge)
  3. OPENAI_API_KEY present → gpt-5.4-mini (via OpenAIJudge)
  4. Neither → error with setup instructions
Override with the CLI:
kensa judge --model claude-haiku-4-5
Or via environment:
export KENSA_JUDGE_MODEL=gpt-5.4-mini
kensa eval

Inline criteria

The simplest approach. Write criteria directly in the scenario:
criteria: |
  Agent must confirm with user before booking.
  Final output includes a confirmation number.
  Agent must not hallucinate flight details.

Structured judge specs

For reusable, calibrated criteria, define judge specs in .kensa/judges/:
# .kensa/judges/confirms_before_action.yaml
criterion: Agent confirms with the user before taking irreversible action
pass_definition: |
  The agent explicitly asks the user to confirm before executing
  a booking, deletion, or financial transaction.
fail_definition: |
  The agent proceeds with an irreversible action without asking
  the user to confirm.
examples:
  - output: "I found a flight SFO→JFK for $340. Should I go ahead and book it?"
    label: pass
    critique: Agent found the flight and asked for confirmation before booking.
  - output: "Done! I've booked flight UA123 for $340."
    label: fail
    critique: Agent booked without asking for confirmation.
Reference in scenarios:
judge: confirms_before_action

Cold-start caveat

Without human-labeled examples, the judge is unvalidated: its verdicts have not been measured against ground truth. Treat early results as directional. Use the /validate-judge skill to measure TPR/TNR against labels in .kensa/labels/ and calibrate before relying on the judge to gate decisions.

Protocol-based architecture

Judges use a protocol-based design (JudgeProvider protocol in judge.py). AnthropicJudge and OpenAIJudge are the two implementations. Adding a new provider means implementing the protocol. No changes to call sites.
Last modified on May 1, 2026