Skip to main content

Documentation Index

Fetch the complete documentation index at: https://kensa.sh/docs/llms.txt

Use this file to discover all available pages before exploring further.

Say “evaluate my agent” in Claude Code or any skill-aware coding agent, and one of five skills picks the next step: audit setup, generate scenarios, validate the judge, run, or diagnose failures. Each skill drives the CLI under the hood.

Installation

uvx kensa init
Adds kensa to your dev deps, scaffolds .kensa/, and prompts you to choose which coding agent to install skills for. Works with Claude Code, Codex, Cursor, OpenCode, Gemini CLI, and more. Use kensa skills install later to refresh after a kensa upgrade.
kensa skills install -a claude
kensa skills install -a codex
kensa skills install -a cursor
kensa skills install -a opencode
kensa skills install -a gemini
kensa skills install -a all
Claude Code uses .claude/skills. Codex, Cursor, OpenCode, Gemini CLI, and other use the open Agent Skills directory at .agents/skills.

Lifecycle

Setup (audit-evals) → Design (generate-scenarios) → Calibrate (generate-judges)
  → Validate (validate-judge) → Execute (kensa eval) → Diagnose (diagnose-errors) → Iterate

/audit-evals

The default entry point. Assesses readiness, identifies testable behaviors, and prepares the environment. What it does:
  • Checks kensa installation
  • Determines current state (scenarios exist? traces exist?)
  • Scans codebase for entry point, SDK, tools, behaviors, env vars
  • Verifies instrumentation with kensa doctor
  • Routes to the appropriate next skill

/generate-scenarios

Generates test scenarios covering five categories:
  1. Happy path - expected behavior with valid inputs
  2. Tool usage - correct tool selection and ordering
  3. Edge cases - boundary conditions, unusual inputs
  4. Error handling - graceful failure, meaningful error messages
  5. Cost/latency bounds - resource usage stays within limits
Outputs YAML files to .kensa/scenarios/.

/generate-judges

Creates structured judge prompts for subjective evaluation criteria:
  • Binary pass/fail definitions (no Likert scales)
  • 2-4 few-shot examples with critiques
  • Designed for reuse across scenarios
Outputs YAML specs to .kensa/judges/.

/validate-judge

Tests judge accuracy against human-labeled examples:
  • Requires 8-20 labeled examples
  • Measures TPR (true positive rate) and TNR (true negative rate)
  • Target threshold: both ≥ 90%
  • Iterates on the judge prompt until thresholds are met
  • Optional bootstrap resampling for confidence intervals

/diagnose-errors

Analyzes eval results after a run:
  • Categorizes failures: check failures, judge rejections, errors, uncertain
  • Reads .kensa/results/ and .kensa/traces/
  • Identifies failure patterns across scenarios
  • Recommends next action: fix agent, improve judge, add scenarios, etc.
Last modified on May 1, 2026