Skills
Five skills that orchestrate the complete evals workflow.
Skills are what run when you say "evaluate my agent" in a coding agent like Claude Code. They orchestrate the eval workflow using kensa's CLI commands under the hood.
Installation
# Skills + CLI (recommended) — works with Codex, Cursor, OpenCode, Gemini CLI, and more
npx skills add satyaborg/kensa
uv add kensa
# Or, for Claude Code, install as a plugin
/plugin marketplace add satyaborg/kensa
/plugin install kensa
Lifecycle
Setup (audit-evals) → Design (generate-scenarios) → Calibrate (generate-judges)
→ Validate (validate-judge) → Execute (kensa eval) → Diagnose (diagnose-errors) → Iterate
/audit-evals
The default entry point. Assesses readiness, identifies testable behaviors, and prepares the environment.
What it does:
- Checks kensa installation
- Determines current state (scenarios exist? traces exist?)
- Scans codebase for entry point, SDK, tools, behaviors, env vars
- Verifies instrumentation with
kensa doctor - Routes to the appropriate next skill
/generate-scenarios
Generates test scenarios covering five categories:
- Happy path - expected behavior with valid inputs
- Tool usage - correct tool selection and ordering
- Edge cases - boundary conditions, unusual inputs
- Error handling - graceful failure, meaningful error messages
- Cost/latency bounds - resource usage stays within limits
Outputs YAML files to .kensa/scenarios/.
/generate-judges
Creates structured judge prompts for subjective evaluation criteria:
- Binary pass/fail definitions (no Likert scales)
- 2-4 few-shot examples with critiques
- Designed for reuse across scenarios
Outputs YAML specs to .kensa/judges/.
/validate-judge
Tests judge accuracy against human-labeled examples:
- Requires 8-20 labeled examples
- Measures TPR (true positive rate) and TNR (true negative rate)
- Target threshold: both ≥ 90%
- Iterates on the judge prompt until thresholds are met
- Optional bootstrap resampling for confidence intervals
/diagnose-errors
Analyzes eval results after a run:
- Categorizes failures: check failures, judge rejections, errors, uncertain
- Reads
.kensa/results/and.kensa/traces/ - Identifies failure patterns across scenarios
- Recommends next action: fix agent, improve judge, add scenarios, etc.