# kensa The open source agent evals harness. Tell your coding agent to "evaluate my agent" and get a working eval suite in minutes. Deterministic checks + LLM judge. No platform needed. MIT licensed. ## Install - Skills + CLI (recommended): `npx skills add satyaborg/kensa` then `uv add kensa` (or `pip install kensa`). Works with Codex, Cursor, OpenCode, Gemini CLI, and other coding agents. - Claude Code plugin: `/plugin marketplace add satyaborg/kensa` then `/plugin install kensa` Provider extras: `uv add "kensa[anthropic]"`, `uv add "kensa[openai]"`, `uv add "kensa[langchain]"`, `uv add "kensa[all]"` ## Docs - [GitHub](https://github.com/satyaborg/kensa): Source, README, and examples - [Architecture (AGENTS.md)](https://github.com/satyaborg/kensa/blob/main/AGENTS.md): Data flow, module dependency graph, design patterns, conventions ## How it works Your coding agent reasons: it reads your codebase, identifies failure modes from past traces, and writes scenarios. The CLI computes: it instruments, executes, judges, and reports. Skills orchestrate the workflow between them. Each scenario runs in its own subprocess with `KENSA_TRACE_DIR` set. The agent's entry point calls `from kensa import instrument; instrument()` which configures OpenTelemetry, writes spans as JSONL, and auto-instruments any detected SDK (Anthropic, OpenAI, LangChain). Deterministic checks run first; if any check fails, the LLM judge is skipped (fail-fast). A scenario passes only when all checks pass AND the judge passes. It gets smarter each run — feed traces from previous runs back in, and kensa generates scenarios targeting real failure modes instead of educated guesses. ## Compatible coding agents Works with Claude Code, Cursor, Codex CLI, Gemini CLI, GitHub Copilot, Kiro, OpenCode, and Pi. ## Features - **Zero to eval**: The coding agent bootstraps your eval suite to solve cold-start. You review, not scaffold. - **Checks gate the judge**: Deterministic checks (tool ordering, cost caps, latency, output matching) run before the LLM judge. If a check fails, no tokens spent. - **Trace everything**: Auto-instruments Anthropic, OpenAI, and LangChain via OpenTelemetry. Each scenario runs in its own subprocess with isolated tracing and cache-aware cost tracking. - **Dataset-driven evals**: Point at a JSONL file. Each row becomes a run with its own trace and verdict. Re-run for variance stats, flaky detection, and anomaly flagging. - **Structured judges**: Define judge criteria in YAML with pass/fail definitions and few-shot examples. Reuse specs across scenarios for consistent grading. - **No platform**: pip install, BYO API keys, all data stays local. Same CLI on your laptop and in CI. ## Skills Five skills orchestrate the eval workflow when used with a coding agent: - `/audit-evals`: Assess readiness, identify testable behaviors, prepare the environment. The default entry point. - `/generate-scenarios`: Happy paths, edge cases, tool usage, error handling, cost bounds. One command. - `/generate-judges`: Binary pass/fail definitions with few-shot examples, ready to reuse across scenarios. - `/validate-judge`: Test judge accuracy against human labels. Iterates until TPR and TNR meet threshold. - `/diagnose-errors`: Categorize failures, identify patterns, recommend next action. ## Checks Deterministic check types — fast, cheap, binary. Run before the LLM judge to save cost: - `output_contains`: Output includes a string or pattern - `output_matches`: Output matches a regex - `tool_called`: A specific tool was invoked - `tool_not_called`: A specific tool was not invoked - `tool_order`: Tools called in expected sequence - `max_cost`: Total cost under threshold - `max_turns`: LLM call count under limit - `max_duration`: Execution time under limit - `no_repeat_calls`: No duplicate tool calls with identical arguments ## Judge Natural-language criteria assessed against the full execution trace. Binary pass/fail with written reasoning. Judge model resolution order: 1. `KENSA_JUDGE_MODEL` env var (explicit override) 2. `ANTHROPIC_API_KEY` present → claude-sonnet-4-6 3. `OPENAI_API_KEY` present → gpt-5.4-mini 4. Neither → error with setup instructions ## Scenario format Scenarios are YAML files in `.kensa/scenarios/`. Example: ```yaml id: classify_ticket name: Support ticket triage description: Classify a support ticket by severity. source: user input: "Our entire team can't log in. SSO has returned 502 since 7am." run_command: python agent.py {{input}} expected_outcome: Agent returns the correct priority label. checks: - type: output_matches params: { pattern: "^P[123]$" } description: Output must be exactly P1, P2, or P3. - type: max_cost params: { max_usd: 0.05 } description: Stay under five cents. criteria: | P1 is for outages or data loss affecting multiple users. The agent must classify based on business impact, not tone. ``` ## CLI Works standalone without a coding agent. Python 3.10+. - `kensa eval`: run + judge + report in one shot - `kensa eval -s `: eval specific scenario - `kensa init`: Set up .kensa/ dir with example agent - `kensa run`: Run scenarios and capture traces - `kensa judge`: Score the latest run with checks + LLM judge - `kensa judge --model `: Override judge model - `kensa report`: Rich terminal output - `kensa report --format markdown`: CI-friendly markdown - `kensa report --format json`: Machine-readable - `kensa report --format html`: Standalone HTML file - `kensa analyze`: Surface cost, latency, and anomalies across runs - `kensa doctor`: Verify your setup is ready to run ## CI ```yaml - name: Run evals run: uv run kensa eval --format markdown # Exit codes: 0 = all pass, 1 = any fail ``` Deterministic checks need no API keys. Add judge keys as secrets for LLM-judged criteria; if omitted, judge criteria are skipped and don't block the pipeline. ## OTel compatibility Spans are standard OpenTelemetry emitted via OpenInference instrumentors. kensa's built-in exporter writes them as JSONL to `KENSA_TRACE_DIR`. To ship spans to a remote OTel backend, wire up your own TracerProvider with an OTLP exporter before importing kensa, or skip `instrument()` and feed JSONL spans in via `KENSA_TRACE_DIR`. A built-in OTLP passthrough is on the roadmap. ## Architecture Flat module structure in `src/kensa/`. Key modules: models.py (Pydantic domain objects, dependency root), checks.py (registry pattern), judge.py (protocol-based providers), runner.py (subprocess execution), report.py (formatters registry), exporter.py (OTel JSONL span exporter). No circular dependencies. ## Examples Five example agents in `examples/`, each targeting a domain where reliability matters: - `sql-analyst`: Finance/BI — wrong revenue numbers, hallucinated metrics - `incident-triage`: SRE — missed P0, false 3am page - `code-reviewer`: Security — shipped CVE, false positive fatigue - `customer-support`: CX — misrouted ticket, unauthorized refund promise - `sdr-qualifier`: Sales — wasted AE time, lost hot lead ## Links - [GitHub](https://github.com/satyaborg/kensa): Source and documentation - [Discord](https://discord.gg/n77EqxUH): Community - [X](https://x.com/kensa_sh): Updates