Binary pass/fail with written reasoning. The judge only runs after every deterministic check passes, so obvious failures never spend judge tokens.Documentation Index
Fetch the complete documentation index at: https://kensa.sh/docs/llms.txt
Use this file to discover all available pages before exploring further.
How it works
- All deterministic checks run first
- If any check fails, the judge is skipped (fail-fast)
- If all checks pass, the judge receives the scenario input, expected outcome, agent output, tool calls, and criteria
- The judge returns pass/fail with a written explanation
Model resolution
The judge model is resolved in this order:KENSA_JUDGE_MODELenv var (explicit override)ANTHROPIC_API_KEYpresent →claude-sonnet-4-6(via AnthropicJudge)OPENAI_API_KEYpresent →gpt-5.4-mini(via OpenAIJudge)- Neither → error with setup instructions
Inline criteria
The simplest approach. Write criteria directly in the scenario:Structured judge specs
For reusable, calibrated criteria, define judge specs in.kensa/judges/:
Cold-start caveat
Without human-labeled examples, the judge is unvalidated: its verdicts have not been measured against ground truth. Treat early results as directional. Use the/validate-judge skill to measure TPR/TNR against labels in .kensa/labels/ and calibrate before relying on the judge to gate decisions.
Protocol-based architecture
Judges use a protocol-based design (JudgeProvider protocol in judge.py). AnthropicJudge and OpenAIJudge are the two implementations. Adding a new provider means implementing the protocol. No changes to call sites.