Eval your agents before they hit production. Kensa generates scenarios, auto-instruments traces, runs deterministic checks, and scores with an LLM judge. Run it from the CLI, your coding agent (Claude Code, Codex), or CI to catch regressions. Open source, local-first, framework agnostic, and OTel-compatible. Zero code changes to your first eval in minutes.Documentation Index
Fetch the complete documentation index at: https://kensa.sh/docs/llms.txt
Use this file to discover all available pages before exploring further.
Prefer the guided path? Quickstart walks you through installing the skills, CLI, and running your first eval.
Run your first eval
Install kensa, add the provider extra that matches your stack, and evaluate a real repo.
Learn the mental model
Understand how scenarios, traces, checks, judges, and reports fit together.
Pick a workflow
Start with skills, drop down to the CLI, wire up MCP, or gate changes in CI.
Try a realistic example
Use one of the included example agents to see the full loop on a codebase with real stakes.
How it works
Zero to eval
Ask your coding agent to inspect the codebase and draft the first scenarios. You review evals instead of starting from a blank file.
Runs become traces
Kensa captures LLM calls, tool use, tokens, cost, and latency while your agent runs each scenario.
Checks gate judges
Assertions run before LLM judges, catching obvious regressions without spending tokens.
Ship with evidence
Get verdicts, traces, cost, latency, and failure details in terminal, Markdown, JSON, or HTML.
Where to start
| If you want to | Go to |
|---|---|
| Get running in under a minute | Quickstart |
| Understand the mental model | Concepts |
| See the full eval workflow | Skills |
| Look up exact commands | CLI Reference |
| Drive kensa from an MCP client | MCP Server |
Why teams use it
- Cold-start friendly: kensa can start from code understanding even when you have no labels and no existing eval harness.
- Trace-informed iteration: previous runs become input for better scenarios instead of dead artifacts.
- Cost-aware by default: checks gate the expensive judge call, so obvious failures do not spend tokens.
- Works with existing tooling: use it through skills, the CLI, MCP, or CI depending on how your team already operates.
It gets smarter each run
Feed traces from previous runs back in withkensa generate, and kensa synthesizes scenarios targeting real failure modes instead of educated guesses.
Eval loop