Skip to main content

Documentation Index

Fetch the complete documentation index at: https://kensa.sh/docs/llms.txt

Use this file to discover all available pages before exploring further.

0.8.0 — 2026-05-05

Kensa now runs as a pytest plugin, so you can write evals as ordinary tests and run them in your existing test suite.
  • New @pytest.mark.kensa(...) marker binds a pytest test to a kensa scenario, with cases= for parametrized inputs and trials= for repeated runs against the same case. Recorded outputs are captured and fed through the same checks and judge pipeline as YAML scenarios.
  • New kensa eval --pytest <path> invokes the bundled pytest plugin and writes a normal kensa run manifest, so reports, judge, and analyze work unchanged.
  • New kensa init --pytest scaffolds a starter pytest-native eval under tests/evals/.
  • Scenario schema gained cases and trials as the canonical keys; dataset and input_field remain as legacy aliases.
  • Fixed cost backfill missing models whose slug ends in a single-digit version segment (e.g. claude-sonnet-4), which previously fell through the slug normalizer.

0.7.0 — 2026-05-01

kensa capture records one real agent invocation, and kensa generate synthesizes scenarios straight from the capture, so you can bootstrap evals without writing any scenarios by hand.
  • New kensa capture -- <cmd> command runs your agent once with full instrumentation and writes a capture-kind run manifest plus a JSONL trace under .kensa/. Pass -i/--input to mirror scenario.input. kensa run rejects capture-kind manifests, so captures and evals stay separate.
  • kensa generate is now capture-aware: source priority is --trace--run-id → latest capture manifest → latest run manifest. Generated scenarios inherit the observed run_command from the manifest verbatim.
  • kensa init now prompts for a coding agent (-a/--agent claude|codex|cursor|opencode|gemini|other|all|none) and installs the bundled skills into the right directory for that agent. Use --no-cli to skip the uv add --dev kensa step.
  • Internal: a new RunKind discriminator on the run manifest distinguishes eval runs from capture runs end to end, so the MCP server, report, and judge surfaces only operate on eval runs.

0.6.2 — 2026-04-27

uvx kensa init is now a one-shot bootstrap: scaffold scenarios, add kensa as a dev dep, and drop the Claude Code skills into the project.
  • New kensa skills install command copies the bundled skills into .claude/skills/ (Claude Code) and .agents/skills/ (Codex, OpenCode, Cursor, and other adopters of the Agent Skills standard). Use --global to install into ~, --claude / --codex to scope to one target, and --force to overwrite existing files.
  • kensa init gained --cli / --skills flags (and their negations). In an interactive terminal, each step prompts before mutating state. In CI, both default to skip unless passed explicitly.
  • When kensa init adds kensa via uv add --dev, it now points at uv run kensa doctor if the active interpreter is outside the project venv, so doctor checks reflect the right environment.

0.6.1 — 2026-04-24

Release tooling fix.
  • Fixed uv.lock drifting from the bumped package version on release. The release script now refreshes the lockfile, and a packaging test guards against future drift in the kensa-mcp shim’s pin.

0.6.0 — 2026-04-24

kensa generate synthesizes new scenarios from real traces, so coverage grows with usage.
  • New kensa generate command replays the latest run (or a specific --run-id / --trace file) through an LLM and writes fresh scenario YAML to .kensa/scenarios/. Use -n to set the count (1–20), --dry-run to preview, --model to override the LLM, and --force to overwrite existing files.
  • Fixed the generator shipping invalid scenarios: every synthesized scenario is now validated against the runtime schema and -n is enforced.
  • Fixed the generator silently returning fewer scenarios than requested: underproduction now surfaces as a warning.
  • Fixed OpenAI judge verdicts truncating mid-response on reasoning models by switching to max_completion_tokens.
  • Fixed the MCP scenarios resource URI returning a 404 for clients that followed the documented path.

0.5.2 — 2026-04-18

Cost backfill now recognizes the full range of model slugs SDKs report.
  • Pricing lookups normalize SDK-reported model IDs against OpenRouter’s canonical dotted slugs, handling provider prefixes, dashed variants, and dated suffixes. Size segments like 70b, 24b, and 405b are left untouched.

0.5.1 — 2026-04-17

Instrumentation is zero-config: agents run without any code changes.
  • The runner injects a bootstrap sitecustomize.py via PYTHONPATH, so OpenTelemetry and SDK auto-instrumentation are set up before agent code runs. No more from kensa import instrument; instrument() boilerplate in scenario files.
  • instrument() stays exported as an idempotent escape hatch for environments where sitecustomize can’t run (e.g. python -S). Existing agents that still call it keep working, with no duplicate spans.

0.5.0 — 2026-04-15

Kensa now runs as an MCP server, so any MCP-aware client can drive the full eval workflow as tools.
  • New kensa mcp subcommand serves the harness over the Model Context Protocol, exposing init, doctor, run, judge, eval, report, and analyze as tools, plus eight kensa:// resources. Stdio by default, --http --port for HTTP transport.
  • Separate kensa-mcp PyPI shim lets you run uvx kensa-mcp without installing kensa first. The shim pins to the matching kensa[mcp] version and prints a clean install hint if the mcp extra is missing.
  • MCP errors come back as a stable MCPError(error, code, hint) envelope instead of raising across the protocol boundary, and doctor now distinguishes scenario-not-found from invalid-run-id.

0.4.0 — 2026-04-12

Trajectory checks let you assert that an agent followed the right tool-call path.
  • New trajectory check type validates tool-call sequences against expected patterns — supports strict ordering and any-order matching, with optional accuracy thresholds and inline budgets.
  • Aggregate reports now include estimated k-run pass rates per scenario, so you can spot flaky evals without guessing.
  • Fixed trajectory placeholder validation rejecting valid unordered sequences.

0.3.0 — 2026-04-10

Tool checks now accept lists, so you can assert multiple tools in one check.
  • Breaking: tool_called and tool_not_called are now tools_called and tools_not_called. They take a list of tool names with set-membership semantics (order-free). Use tool_order when sequence matters.
  • Validation errors now tell you which item in a scenario is invalid, not just that something is wrong.

0.2.0 — 2026-04-08

Run commands are safer — no more shell interpolation of inputs.
  • Breaking: run_command now takes an argv list instead of a shell string with {{input}} templates. Input is appended as the final argument. This removes the command-injection surface from the old shlex.quote approach.
  • Omitted input fields and explicit empty strings are now handled as distinct cases.

0.1.0 — 2026-04-07

Initial harness release.
  • Each scenario runs in its own subprocess with KENSA_TRACE_DIR set. Add from kensa import instrument; instrument() to your agent — that’s the only code change.
  • Auto-instruments Anthropic, OpenAI, and LangChain SDKs via OpenTelemetry. Writes tool calls, token counts, and latency as JSONL spans.
  • Deterministic checks: output_contains, output_not_contains, tools_called, tools_not_called, tool_order, cost_threshold. Checks run before the judge — if a check fails, no tokens are spent.
  • LLM judge with Anthropic and OpenAI providers. Auto-resolves from whichever API key is set.
  • Reports in four formats: terminal, markdown, JSON, HTML.
  • kensa analyze computes multi-run variance, flags flaky scenarios, and reports cost/latency anomalies.
  • Dataset mode: point dataset at a JSONL file, each row becomes a run.
  • kensa doctor validates environment, dependencies, API keys, and scenario files.
  • Five Claude Code skills: audit-evals, generate-scenarios, generate-judges, validate-judge, diagnose-errors.
  • Five example agents: code reviewer, customer support, incident triage, SDR qualifier, SQL analyst.
Last modified on May 5, 2026