The open source agent evals harness

Tell your coding agent to evaluate an agent and get a working eval suite in minutes. No platform needed.

$npx skills add satyaborg/kensa

Installs eval skills for coding agents. CLI auto-installs on first use.

Works with
Claude CodeCursorCodex CLIGemini CLIGitHub CopilotKiroOpenCodePi

How it works

Your coding agent reasons: it reads your codebase, identifies failure modes from traces, and writes scenarios. The CLI computes: it instruments, executes, judges, and reports. Skills orchestrate the workflow between them.

01

Zero to eval

The coding agent bootstraps your evals to solve the cold-start problem. You review, not scaffold.

02

Checks gate the judge

Deterministic checks run before the LLM judge. If a check fails, no tokens are spent.

03

Trace everything

Auto-instruments Anthropic, OpenAI, and LangChain via OpenTelemetry (OTel).

04

Dataset-driven evals

Point at a JSONL file, each row becomes a run with its own trace and verdict. Re-run for variance stats, flaky detection, and anomaly flagging.

05

Structured judges

Define judge criteria in YAML with pass/fail definitions and few-shot examples. Reuse specs across scenarios for consistent grading.

06

No platform

uv or pip install, BYO API keys, all data stays local. Same CLI on your laptop and in CI.

Skills

Five skills take you from zero to eval, or from traces to targeted iteration.

/audit-evals

Assess readiness, identify testable behaviors, prepare the environment. The default entry point.

/generate-scenarios

Happy paths, edge cases, tool usage, error handling, cost bounds. One command.

/generate-judges

Binary pass/fail definitions with few-shot examples, ready to reuse across scenarios.

/validate-judge

Test judge accuracy against human labels. Iterates until TPR and TNR meet threshold.

/diagnose-errors

Categorize failures, identify patterns, recommend next action.

CLI Python 3.10+

Works standalone for CI and local iteration. Checks run before the judge, so obvious failures stop early without spending tokens.

kensa initScaffold with an example agent
kensa evalrun + judge + report in one shot
kensa runExecute scenarios, capture traces
kensa judgeDeterministic checks + LLM judge
kensa reportTerminal, markdown, JSON, or HTML output
kensa analyzeCost/latency stats + anomaly flagging
kensa doctorPre-flight environment checks

FAQ

What agents does kensa work with?

Any Python agent that makes LLM calls. Auto-instrumentation covers Anthropic, OpenAI, and LangChain out of the box. Other providers work with manual OTel config.

Do I need to modify my agent code?

Two lines: from kensa import instrument; instrument(). Add before your SDK imports. kensa runs your agent in a subprocess and captures traces automatically. Auto instrumented by coding agents.

Can I run kensa in CI?

Yes. kensa eval --format markdown is all you need. Deterministic checks need no API keys. Add judge keys as secrets for LLM-judged criteria.

Is kensa free?

Yes, it is MIT licensed. The only cost is your LLM API calls for judge criteria, and that's optional.