Documentation Index
Fetch the complete documentation index at: https://kensa.sh/docs/llms.txt
Use this file to discover all available pages before exploring further.
0.8.0 — 2026-05-05
Kensa now runs as a pytest plugin, so you can write evals as ordinary tests and run them in your existing test suite.- New
@pytest.mark.kensa(...)marker binds a pytest test to a kensa scenario, withcases=for parametrized inputs andtrials=for repeated runs against the same case. Recorded outputs are captured and fed through the same checks and judge pipeline as YAML scenarios. - New
kensa eval --pytest <path>invokes the bundled pytest plugin and writes a normal kensa run manifest, so reports, judge, and analyze work unchanged. - New
kensa init --pytestscaffolds a starter pytest-native eval undertests/evals/. - Scenario schema gained
casesandtrialsas the canonical keys;datasetandinput_fieldremain as legacy aliases. - Fixed cost backfill missing models whose slug ends in a single-digit version segment (e.g.
claude-sonnet-4), which previously fell through the slug normalizer.
0.7.0 — 2026-05-01
kensa capture records one real agent invocation, and kensa generate synthesizes scenarios straight from the capture, so you can bootstrap evals without writing any scenarios by hand.
- New
kensa capture -- <cmd>command runs your agent once with full instrumentation and writes a capture-kind run manifest plus a JSONL trace under.kensa/. Pass-i/--inputto mirrorscenario.input.kensa runrejects capture-kind manifests, so captures and evals stay separate. kensa generateis now capture-aware: source priority is--trace→--run-id→ latest capture manifest → latest run manifest. Generated scenarios inherit the observedrun_commandfrom the manifest verbatim.kensa initnow prompts for a coding agent (-a/--agent claude|codex|cursor|opencode|gemini|other|all|none) and installs the bundled skills into the right directory for that agent. Use--no-clito skip theuv add --dev kensastep.- Internal: a new
RunKinddiscriminator on the run manifest distinguishesevalruns fromcaptureruns end to end, so the MCP server, report, and judge surfaces only operate on eval runs.
0.6.2 — 2026-04-27
uvx kensa init is now a one-shot bootstrap: scaffold scenarios, add kensa as a dev dep, and drop the Claude Code skills into the project.
- New
kensa skills installcommand copies the bundled skills into.claude/skills/(Claude Code) and.agents/skills/(Codex, OpenCode, Cursor, and other adopters of the Agent Skills standard). Use--globalto install into~,--claude/--codexto scope to one target, and--forceto overwrite existing files. kensa initgained--cli/--skillsflags (and their negations). In an interactive terminal, each step prompts before mutating state. In CI, both default to skip unless passed explicitly.- When
kensa initadds kensa viauv add --dev, it now points atuv run kensa doctorif the active interpreter is outside the project venv, so doctor checks reflect the right environment.
0.6.1 — 2026-04-24
Release tooling fix.- Fixed
uv.lockdrifting from the bumped package version on release. The release script now refreshes the lockfile, and a packaging test guards against future drift in thekensa-mcpshim’s pin.
0.6.0 — 2026-04-24
kensa generate synthesizes new scenarios from real traces, so coverage grows with usage.
- New
kensa generatecommand replays the latest run (or a specific--run-id/--tracefile) through an LLM and writes fresh scenario YAML to.kensa/scenarios/. Use-nto set the count (1–20),--dry-runto preview,--modelto override the LLM, and--forceto overwrite existing files. - Fixed the generator shipping invalid scenarios: every synthesized scenario is now validated against the runtime schema and
-nis enforced. - Fixed the generator silently returning fewer scenarios than requested: underproduction now surfaces as a warning.
- Fixed OpenAI judge verdicts truncating mid-response on reasoning models by switching to
max_completion_tokens. - Fixed the MCP
scenariosresource URI returning a 404 for clients that followed the documented path.
0.5.2 — 2026-04-18
Cost backfill now recognizes the full range of model slugs SDKs report.- Pricing lookups normalize SDK-reported model IDs against OpenRouter’s canonical dotted slugs, handling provider prefixes, dashed variants, and dated suffixes. Size segments like
70b,24b, and405bare left untouched.
0.5.1 — 2026-04-17
Instrumentation is zero-config: agents run without any code changes.- The runner injects a bootstrap
sitecustomize.pyviaPYTHONPATH, so OpenTelemetry and SDK auto-instrumentation are set up before agent code runs. No morefrom kensa import instrument; instrument()boilerplate in scenario files. instrument()stays exported as an idempotent escape hatch for environments wheresitecustomizecan’t run (e.g.python -S). Existing agents that still call it keep working, with no duplicate spans.
0.5.0 — 2026-04-15
Kensa now runs as an MCP server, so any MCP-aware client can drive the full eval workflow as tools.- New
kensa mcpsubcommand serves the harness over the Model Context Protocol, exposinginit,doctor,run,judge,eval,report, andanalyzeas tools, plus eightkensa://resources. Stdio by default,--http --portfor HTTP transport. - Separate
kensa-mcpPyPI shim lets you runuvx kensa-mcpwithout installing kensa first. The shim pins to the matchingkensa[mcp]version and prints a clean install hint if themcpextra is missing. - MCP errors come back as a stable
MCPError(error, code, hint)envelope instead of raising across the protocol boundary, anddoctornow distinguishes scenario-not-found from invalid-run-id.
0.4.0 — 2026-04-12
Trajectory checks let you assert that an agent followed the right tool-call path.- New
trajectorycheck type validates tool-call sequences against expected patterns — supports strict ordering and any-order matching, with optional accuracy thresholds and inline budgets. - Aggregate reports now include estimated k-run pass rates per scenario, so you can spot flaky evals without guessing.
- Fixed trajectory placeholder validation rejecting valid unordered sequences.
0.3.0 — 2026-04-10
Tool checks now accept lists, so you can assert multiple tools in one check.- Breaking:
tool_calledandtool_not_calledare nowtools_calledandtools_not_called. They take a list of tool names with set-membership semantics (order-free). Usetool_orderwhen sequence matters. - Validation errors now tell you which item in a scenario is invalid, not just that something is wrong.
0.2.0 — 2026-04-08
Run commands are safer — no more shell interpolation of inputs.- Breaking:
run_commandnow takes an argv list instead of a shell string with{{input}}templates. Input is appended as the final argument. This removes the command-injection surface from the oldshlex.quoteapproach. - Omitted
inputfields and explicit empty strings are now handled as distinct cases.
0.1.0 — 2026-04-07
Initial harness release.- Each scenario runs in its own subprocess with
KENSA_TRACE_DIRset. Addfrom kensa import instrument; instrument()to your agent — that’s the only code change. - Auto-instruments Anthropic, OpenAI, and LangChain SDKs via OpenTelemetry. Writes tool calls, token counts, and latency as JSONL spans.
- Deterministic checks:
output_contains,output_not_contains,tools_called,tools_not_called,tool_order,cost_threshold. Checks run before the judge — if a check fails, no tokens are spent. - LLM judge with Anthropic and OpenAI providers. Auto-resolves from whichever API key is set.
- Reports in four formats: terminal, markdown, JSON, HTML.
kensa analyzecomputes multi-run variance, flags flaky scenarios, and reports cost/latency anomalies.- Dataset mode: point
datasetat a JSONL file, each row becomes a run. kensa doctorvalidates environment, dependencies, API keys, and scenario files.- Five Claude Code skills:
audit-evals,generate-scenarios,generate-judges,validate-judge,diagnose-errors. - Five example agents: code reviewer, customer support, incident triage, SDR qualifier, SQL analyst.