Skip to main content

Documentation Index

Fetch the complete documentation index at: https://kensa.sh/docs/llms.txt

Use this file to discover all available pages before exploring further.

Scenarios are YAML files in .kensa/scenarios/. Your coding agent generates these, but you can write them by hand.

Full example

id: classify_ticket
name: Support ticket triage
description: Classify a support ticket by severity.
source: user                            # code | traces | user

input: "Our entire team can't log in. SSO has returned 502 since 7am."
run_command: [python, agent.py]

expected_outcome: Agent returns the correct priority label.

checks:
  - type: trajectory
    params:
      steps:
        - tool: classify_ticket
      max_steps: 1
      max_tokens: 2000
    description: One classifier tool call, no extra wandering.
  - type: output_matches
    params: { pattern: "^P[123]$" }
    description: Output must be exactly P1, P2, or P3.

criteria: |
  P1 is for outages or data loss affecting multiple users.
  The agent must classify based on business impact, not tone.

Fields

FieldRequiredDescription
idYesUnique identifier
nameNoHuman-readable name. Defaults to id.
descriptionNoWhat this scenario tests
sourceNoHow it was generated: code, traces, or user
inputNoLiteral input, or the JSONL field selector when cases is set.
casesNoPath to a JSONL file for parameterized cases. Resolves relative to the scenario YAML.
trialsNoNumber of repeated executions per case. Defaults to 1 (smoke). Values above 1 are measured runs.
run_commandCommand modeArgv list passed to subprocess.run (no shell). When literal input is set, it is appended as the final argv element.
env_overridesNoExtra environment variables for this scenario’s subprocess
datasetNoLegacy alias for cases
input_fieldNoLegacy alias for input when cases/dataset is set
expected_outcomeNoNatural-language description of success
checksNoList of deterministic checks
criteriaNoNatural-language criteria for the LLM judge (mutually exclusive with judge)
judgeNoReference to a judge spec in .kensa/judges/ (mutually exclusive with criteria)
trace_refsNoPaths to previous trace files for context
failure_patternNoKnown failure pattern this scenario targets

Checks vs criteria

Checks are deterministic and free. Use them for objective, binary conditions:
  • Was a specific tool called?
  • Did the agent follow the expected tool trajectory?
  • Did the agent stay under budget?
  • Did it complete in fewer than N turns?
Criteria are evaluated by the LLM judge and cost tokens. Use them for subjective or nuanced conditions:
  • Did the agent confirm before taking action?
  • Was the response professional in tone?
  • Did the agent avoid hallucinating details?
Checks run first. If any check fails, criteria are skipped (fail-fast).

Case-driven scenarios

Point at a JSONL file where each row becomes a separate case. Use cases for the file and input for the field selector:
id: booking_variations
name: Booking across routes
cases: data/routes.jsonl
input: query
run_command: [python, agent.py]

checks:
  - type: tools_called
    params: { tools: [search_flights] }
  - type: max_turns
    params: { max: 5 }

criteria: |
  The agent must confirm with the user before booking.
  The final answer must include a confirmation number.
The selected input field becomes the scenario input. Other fields can be referenced in check params via {{...}} placeholders. dataset and input_field still load for older scenarios, but new scenarios should use cases and input. For pytest-native evals, case rows often hold partial conversations:
{"id":"draft_no_send","messages":[{"role":"user","content":"Draft it, but do not send it."}]}
The pytest driver can then pass case.messages into the real application and record the result with case.output(...).

Trajectory checks

Use trajectory when tool-call correctness matters more than any single tool event:
checks:
  - type: trajectory
    params:
      steps:
        - tool: search_docs
        - tool: answer_user
      ordering: exact      # or: any_order
      args: ignore         # or: exact
      min_accuracy: 0.8
      max_steps: 3
      max_tokens: 2500
      max_duration_seconds: 15
This check emits trajectory_accuracy and step_efficiency metrics in reports. In V1, each scenario can define at most one trajectory check.
Last modified on May 4, 2026