Skip to main content

Documentation Index

Fetch the complete documentation index at: https://kensa.sh/docs/llms.txt

Use this file to discover all available pages before exploring further.

Eval your agents before they hit production. Kensa generates scenarios, auto-instruments traces, runs deterministic checks, and scores with an LLM judge. Run it from the CLI, your coding agent (Claude Code, Codex), or CI to catch regressions. Open source, local-first, framework agnostic, and OTel-compatible. Zero code changes to your first eval in minutes.
Prefer the guided path? Quickstart walks you through installing the skills, CLI, and running your first eval.

Run your first eval

Install kensa, add the provider extra that matches your stack, and evaluate a real repo.

Learn the mental model

Understand how scenarios, traces, checks, judges, and reports fit together.

Pick a workflow

Start with skills, drop down to the CLI, wire up MCP, or gate changes in CI.

Try a realistic example

Use one of the included example agents to see the full loop on a codebase with real stakes.

How it works

Zero to eval

Ask your coding agent to inspect the codebase and draft the first scenarios. You review evals instead of starting from a blank file.

Runs become traces

Kensa captures LLM calls, tool use, tokens, cost, and latency while your agent runs each scenario.

Checks gate judges

Assertions run before LLM judges, catching obvious regressions without spending tokens.

Ship with evidence

Get verdicts, traces, cost, latency, and failure details in terminal, Markdown, JSON, or HTML.
Each run leaves traces your coding agent can turn into sharper scenarios.

Where to start

If you want toGo to
Get running in under a minuteQuickstart
Understand the mental modelConcepts
See the full eval workflowSkills
Look up exact commandsCLI Reference
Drive kensa from an MCP clientMCP Server

Why teams use it

  • Cold-start friendly: kensa can start from code understanding even when you have no labels and no existing eval harness.
  • Trace-informed iteration: previous runs become input for better scenarios instead of dead artifacts.
  • Cost-aware by default: checks gate the expensive judge call, so obvious failures do not spend tokens.
  • Works with existing tooling: use it through skills, the CLI, MCP, or CI depending on how your team already operates.

It gets smarter each run

Feed traces from previous runs back in with kensa generate, and kensa synthesizes scenarios targeting real failure modes instead of educated guesses.
Eval loop
Run 1 (cold-start):    code → baseline scenarios → traces (1)
Run 2 (with traces):   code + traces (1) → better scenarios → traces (2)
Run 3:                 code + traces (1,2) → even better scenarios

Data flow

Each scenario runs in its own subprocess. kensa auto-instruments the agent’s LLM SDK, captures spans as JSONL, and translates them to its internal format. Checks are deterministic, cheap, and fast. They gate the expensive LLM judge call. If a check fails, the scenario fails immediately without spending tokens. A scenario passes only when all checks pass and the judge passes. If a deterministic check fails, the judge is skipped and no extra tokens are spent.

Compatible coding agents

Kensa works with any coding agent that can run shell commands and use skills, including Claude Code, Codex, Cursor, OpenCode, and Gemini CLI.

License

MIT.
Last modified on April 24, 2026