# kensa
The open source agent evals harness. Tell your coding agent to "evaluate my agent" and get a working eval suite in minutes. Deterministic checks + LLM judge. No platform needed. MIT licensed. Python 3.10+.
## Install
Three entry points, same harness underneath.
- **Skills + CLI (recommended)**: `uvx kensa init` adds `kensa` to your project's dev deps, scaffolds a bare `.kensa/`, and prompts you to choose which coding agent gets the bundled skills. Non-interactive: `uvx kensa init --cli --agent all`. Add `--example` to scaffold a demo agent and demo scenario alongside the bare layout. Use `--no-cli` to skip the dev-dep install (scaffold-only). Use `--agent none` to skip skill installation.
- **Claude Code plugin**: `/plugin marketplace add satyaborg/kensa` then `/plugin install kensa`. Bundles the same five skills.
- **CLI-only**: `uvx kensa init` (bare scaffold) → `kensa capture -i "" -- ` (record one real run as a trace) → `kensa generate` (synthesize scenarios from the capture) → `kensa eval` (run + judge + report).
Skill targets supported by `kensa init` and `kensa skills install`: `claude`, `codex`, `cursor`, `opencode`, `gemini`, `other`, `all`. `other` drops a generic copy you can wire into any agent. Use `--global` on `kensa skills install` to install to your home dir instead of the project.
Provider extras: `uv add "kensa[anthropic]"`, `uv add "kensa[openai]"`, `uv add "kensa[langchain]"`, `uv add "kensa[all]"`. MCP server extra: `uv add "kensa[mcp]"`.
## Docs
- [GitHub](https://github.com/satyaborg/kensa): source, README, examples
- [Architecture (AGENTS.md / CLAUDE.md)](https://github.com/satyaborg/kensa/blob/main/AGENTS.md): data flow, module dependency graph, design patterns, conventions
- [Homepage](https://kensa.sh)
## How it works
Your coding agent reasons: it reads your codebase, identifies failure modes from past traces, and writes scenarios. The CLI computes: it instruments, executes, judges, and reports. Skills orchestrate the workflow between them.
Each scenario runs in its own subprocess with `KENSA_TRACE_DIR` set. Kensa injects a `sitecustomize.py` via `PYTHONPATH` that calls `instrument()` before the agent's code runs, configures OpenTelemetry, writes spans as JSONL, and auto-instruments any detected SDK (Anthropic, OpenAI, LangChain). No code changes in the agent. The injected directory is stripped from `PYTHONPATH` post-instrumentation so child subprocesses don't re-instrument.
Deterministic checks run first; if any check fails, the LLM judge is skipped (fail-fast, save tokens). A scenario passes only when all checks pass AND the judge passes (or no `criteria` is set).
It gets smarter each run. Feed traces from previous runs back through `kensa generate` and kensa synthesizes scenarios targeting real failure modes instead of educated guesses.
## Compatible coding agents
The bundled skills install for: Claude Code, Codex, Cursor, OpenCode, and Gemini CLI. A generic `other` target works with any MCP-aware or skill-aware agent. The MCP server (see below) drives the same workflow from any MCP client, including Claude Desktop.
## Features
- **Zero to eval**: the coding agent bootstraps your eval suite to solve cold-start. You review, not scaffold.
- **Capture-driven scenarios**: `kensa capture -- ` runs your agent once with full tracing. `kensa generate` synthesizes scenario YAMLs from the captured trace via an LLM.
- **Checks gate the judge**: deterministic checks (output match, tool ordering, trajectory, cost caps, latency, turn count) run before the LLM judge. If a check fails, no tokens spent.
- **Trace everything**: auto-instruments Anthropic, OpenAI, and LangChain via OpenTelemetry. Each scenario runs in its own subprocess with isolated tracing and cache-aware cost tracking.
- **Dataset-driven evals**: point at a JSONL file. Each row becomes a run with its own trace and verdict. Re-run for variance stats, flaky detection, and anomaly flagging.
- **Structured judges**: define judge criteria in YAML with pass/fail definitions and few-shot examples. Reuse specs across scenarios for consistent grading.
- **No platform**: pip install, BYO API keys, all data stays local. Same CLI on your laptop and in CI.
## Skills
Five skills orchestrate the eval workflow when used with a coding agent:
- `/audit-evals`: assess readiness, identify testable behaviors, prepare the environment. The default entry point. Triggered by "evaluate my agent", "set up evals", "is my agent ready?".
- `/generate-scenarios`: happy paths, edge cases, tool usage, error handling, cost bounds. One command.
- `/generate-judges`: binary pass/fail definitions with few-shot examples, ready to reuse across scenarios.
- `/validate-judge`: test judge accuracy against human labels. Iterates until TPR and TNR meet threshold.
- `/diagnose-errors`: categorize failures, identify patterns, recommend next action.
The full lifecycle: `Setup → Design → Execute → Diagnose → Iterate`.
## Checks
Deterministic check types. Fast, cheap, binary. Run before the LLM judge to save cost:
- `output_contains`: output includes a string or pattern
- `output_matches`: output matches a regex
- `tools_called`: all listed tools were invoked (set membership, order-free)
- `tools_not_called`: none of the listed tools were invoked
- `tool_order`: tools called in this temporal sequence (use only when order is load-bearing)
- `trajectory`: validate tool-call sequences against expected patterns (strict or any-order) with optional accuracy thresholds, `max_steps`, and `max_tokens`
- `max_cost`: total cost (USD) under threshold
- `max_turns`: LLM call count under limit
- `max_duration`: execution time under limit
- `no_repeat_calls`: no duplicate tool calls with identical arguments
Checks live in a registry (`CHECK_REGISTRY` in `checks.py`). Add a new check by registering a function. No call-site changes needed.
## Judge
Natural-language criteria assessed against the full execution trace. Binary pass/fail with written reasoning.
Judge model resolution order:
1. `KENSA_JUDGE_MODEL` env var (explicit override)
2. `ANTHROPIC_API_KEY` present → `claude-sonnet-4-6`
3. `OPENAI_API_KEY` present → `gpt-5.4-mini`
4. Neither → error with setup instructions
Judges are protocol-based (`JudgeProvider` in `judge.py`). `AnthropicJudge` and `OpenAIJudge` ship in-tree.
## Scenario format
Scenarios are YAML files in `.kensa/scenarios/`. The `input` is appended as the final argv element of `run_command` at execution time. Example:
```yaml
id: classify_ticket
name: Support ticket triage
description: Classify a support ticket by severity.
source: user
input: "Our entire team can't log in. SSO has returned 502 since 7am."
run_command: [python, agent.py]
expected_outcome: Agent returns the correct priority label.
checks:
- type: trajectory
params:
steps:
- tool: classify_ticket
max_steps: 1
max_tokens: 2000
- type: output_matches
params: { pattern: "^P[123]$" }
description: Output must be exactly P1, P2, or P3.
- type: max_cost
params: { max_usd: 0.05 }
description: Stay under five cents.
criteria: |
P1 is for outages or data loss affecting multiple users.
The agent must classify based on business impact, not tone.
```
## CLI
Works standalone without a coding agent. Python 3.10+.
Setup and capture:
- `kensa init`: bare scaffold of `.kensa/`, adds `kensa` to project dev deps, prompts for skill target
- `kensa init --example`: scaffold with a demo agent and demo scenario
- `kensa init --cli --agent all`: non-interactive, install all bundled skill targets
- `kensa init --no-cli --agent none`: scaffold-only, no project mutations
- `kensa skills install --agent `: install bundled skills (also `--global`, `--force`)
- `kensa capture -- [args...]`: capture one real agent invocation as a trace
- `kensa capture -i "" -- [args...]`: capture with an explicit input string (recommended)
- `kensa doctor`: pre-flight environment checks (Python, SDKs, API keys, scenarios, judge)
Run, judge, report:
- `kensa run`: run all scenarios
- `kensa run --scenario-id `: run specific scenario
- `kensa run --dry-run`: list scenarios that would run, without executing
- `kensa judge`: run checks + LLM judge on the latest run
- `kensa judge --run-id `: judge a specific run
- `kensa judge --model `: override judge model
- `kensa report`: rich terminal output for the latest run
- `kensa report --format markdown`: CI-friendly markdown
- `kensa report --format json`: machine-readable
- `kensa report --format html`: standalone HTML file
- `kensa report --verbose`: show full check details and judge reasoning
- `kensa eval`: run + judge + report in one shot (terminal, markdown, or json)
- `kensa eval -s `: eval a specific scenario
Generate and analyze:
- `kensa generate`: synthesize scenario YAMLs from the latest run's traces via an LLM
- `kensa generate --run-id `: synthesize from a specific run
- `kensa generate --trace path/to/trace.jsonl -n 5`: synthesize N scenarios from a trace file
- `kensa generate --dry-run`: print generated YAML without writing
- `kensa analyze`: surface cost, latency, and anomalies across runs
MCP:
- `kensa mcp`: serve kensa over MCP (stdio by default)
- `kensa mcp --http --host 127.0.0.1 --port 8765`: MCP over HTTP
## CI
```yaml
- name: Run evals
run: uv run kensa eval --format markdown
# Exit codes: 0 = all pass, 1 = any fail
```
Deterministic checks need no API keys. Add judge keys as secrets for LLM-judged criteria; if omitted, judge criteria are skipped and don't block the pipeline.
## MCP server
Kensa ships as an MCP server so any MCP-aware client (Claude Code, Cursor, Codex, OpenCode, Gemini CLI, Claude Desktop) can drive the full eval workflow.
Tools (7): `init`, `doctor`, `run`, `judge`, `eval`, `report`, `analyze`. Failures come back as a stable `MCPError(error, code, hint)` envelope rather than raising across the protocol boundary.
Resources (8) under `kensa://`: `runs`, `runs/{run_id}`, `runs/{run_id}/results`, `runs/{run_id}/trace/{scenario}/{index}`, `scenarios`, `scenarios/{scenario_id}`, `judges`, `judges/{name}`.
One-liner for Claude Code:
```bash
claude mcp add kensa -- uvx kensa-mcp
```
JSON config (`.mcp.json` or `.cursor/mcp.json`):
```json
{
"mcpServers": {
"kensa": {
"command": "uvx",
"args": ["kensa-mcp"]
}
}
}
```
Codex (`.codex/config.toml`):
```toml
[mcp_servers.kensa]
command = "uvx"
args = ["kensa-mcp"]
```
The `kensa-mcp` PyPI shim pins to the matching `kensa[mcp]` version, so no pre-install is needed. The shim's launcher prints a clean install hint instead of a two-level import traceback when `fastmcp` is missing.
## OTel compatibility
Spans are standard OpenTelemetry. Kensa writes them as JSONL locally via a custom span exporter (`exporter.py`), which is what `kensa run` and `kensa analyze` consume. The only public Python export is `instrument()` (`from kensa import instrument`), an opt-in escape hatch for environments where `sitecustomize` cannot run (for example, `python -S`).
## Architecture
Flat module structure in `src/kensa/`. No nested packages. No circular dependencies. `models.py` is the dependency root.
Key modules:
- `models.py`: Pydantic domain objects, dependency root
- `paths.py`: centralized `.kensa/` path resolution (stdlib only)
- `pricing.py`: model price lookup, OpenRouter fetch
- `trace_semantics.py`: canonical tool-call dedup and ordering
- `trajectory.py`: trajectory check helpers
- `checks.py`: registry pattern (`CHECK_REGISTRY`)
- `judge.py`: protocol-based providers (`JudgeProvider`, `AnthropicJudge`, `OpenAIJudge`)
- `llm.py`: shared `Completer` protocol and Anthropic/OpenAI adapters
- `runner.py`: subprocess execution, sitecustomize injection
- `capture.py`: `kensa capture` (subprocess + trace + capture manifest)
- `generate.py`: scenario synthesis from traces
- `report.py`: formatters registry (terminal, markdown, json, html)
- `analyzer.py`: cost / latency / anomaly stats
- `aggregate.py`: multi-run variance and flaky detection
- `exporter.py`: OTel JSONL span exporter (no kensa imports)
- `scaffold.py`: idempotent `.kensa/` scaffolding (shared by CLI and MCP)
- `skills_install.py`: bundled-skill installer + `uv add --dev` helper
- `mcp_server.py`: MCP tools and resources (thin adapters)
- `_mcp_launcher.py`: clean install-hint wrapper consumed by the `kensa-mcp` shim
- `cli.py`: Click entry point, lazy imports for fast startup
## Examples
Five example agents in `examples/`, each targeting a domain where reliability matters:
- `sql-analyst`: Finance/BI - wrong revenue numbers, hallucinated metrics
- `incident-triage`: SRE - missed P0, false 3am page
- `code-reviewer`: Security - shipped CVE, false positive fatigue
- `customer-support`: CX - misrouted ticket, unauthorized refund promise
- `sdr-qualifier`: Sales - wasted AE time, lost hot lead
## Links
- [GitHub](https://github.com/satyaborg/kensa): source and documentation
- [Homepage](https://kensa.sh)
- [PyPI: kensa](https://pypi.org/project/kensa/)
- [PyPI: kensa-mcp](https://pypi.org/project/kensa-mcp/)
- [X](https://x.com/kensa_sh): updates