Introduction

Kensa turns agent traces into pytest evals that run in CI. Generated from traces or written by hand, your evals live in your repo as plain, executable pytest files. Run them with the rest of your test suite and catch regressions before they reach production.

import pytest

from kensa.pytest import judge, kensa_case


@pytest.mark.kensa(trials=3)
@pytest.mark.parametrize(
    "case",
    [
        kensa_case(
            id="refund_without_order_history",
            input="I was charged $29 yesterday. No order ID, but please refund it.",
        )
    ],
)
def test_refund_policy(case, kensa_run, kensa_trace):
    output = case.run(kensa_run)

    assert kensa_trace.tools.include(["lookup_customer"])
    assert kensa_trace.tools.exclude(["issue_refund"])

    result = judge(output, "The response must not promise an unsupported refund.", input=case.input)
    assert result.passed, result.reasoning

Open source, local-first, and framework agnostic. Nothing in this file is special: it is a pytest test that any CI already running your Python suite can run.

Prefer the guided path? Quickstart installs Kensa, wires the harness, and lands your first eval.

Run your first eval

Add Kensa, run kensa init, and turn one realistic case into a passing eval.

Learn the mental model

Understand how cases, traces, assertions, judges, and trials fit together.

Drive it from your agent

The kensa-evals skill walks Claude Code, Codex, or Cursor through the eval lifecycle.

Look up a command

init, doctor, connect, import, and eval — every flag in one place.

How it works

Traces in

Import bounded trace evidence from Langfuse or a JSON / JSONL / OTLP export — or capture it locally.

Behavior out

Your coding agent mines imports into reviewable eval ideas you approve and materialize as pytest files.

Assertions gate the judge

Deterministic assertions run first. The judge(...) call only runs if they pass, so obvious regressions never spend tokens.

Ship in CI

Evals are plain pytest. Run kensa eval in the same job that runs your tests and fail the build on regressions.

Where to start

If you want to	Go to
Get running in a few minutes	Quickstart
Understand the mental model	Concepts
Define cases and trials	Cases
Assert on traces and output	Assertions
Bring in existing traces	Tracing & imports
Look up exact commands	CLI reference

Why teams use it

Evals are just pytest. No separate runner, no separate dashboard. If your CI runs pytest, it runs Kensa.
Cold-start friendly. Generate a first eval from one realistic prompt, or mine eval ideas from traces you already have.
Cost-aware by default. Deterministic assertions short-circuit before the judge, so failing cases never reach an LLM call.
Local-first evidence. .kensa/traces/ holds bounded local trace evidence, not a live observability backend.

The eval loop

Traces from real runs become regression tests. Each round tightens coverage around behavior you have actually observed.

Eval loop

connect / capture →  kensa import   →  trace evidence in .kensa/
trace evidence    →  kensa-inspect  →  reviewable eval ideas (.kensa/inspect/)
approved idea     →  kensa-generate →  tests/evals/test_<id>.py
evals             →  kensa eval     →  pass / fail / flaky in CI

Data flow

Inside each eval, pytest runs your case through the harness, collects a trace, and evaluates it. Deterministic assertions (kensa_trace.tools, plain assert) come first. A semantic judge(...) call only runs when you reach it, so a failed assertion never spends judge tokens.

Compatible coding agents

Kensa scaffolds setup instructions and the kensa-evals skill for Claude Code, Codex, and Cursor. If none are detected, kensa init still prints a copyable setup prompt.

License

Apache 2.0.

Run your first eval

Learn the mental model

Drive it from your agent

Look up a command

​How it works

Traces in

Behavior out

Assertions gate the judge

Ship in CI

​Where to start

​Why teams use it

​The eval loop

​Data flow

​Compatible coding agents

​License

How it works

Where to start

Why teams use it

The eval loop

Data flow

Compatible coding agents

License