Skip to content

Eval Spec Format

Complete schema walkthrough for <skill-name>.eval.json: discovery rules, every supported field, input-file staging, output-file capture, and the format validation DSL. Read this when you're authoring or debugging an eval spec; it's the full reference the shorter README teaser points at.

Returning from the root README. This doc is the full reference; the README has a summary with code examples.

Place <skill-name>.eval.json alongside your .claude/commands/<skill-name>.md:

.claude/commands/
├── find-kid-activities.md
├── find-kid-activities.eval.json    ← clauditor auto-discovers this
├── find-restaurants.md
└── find-restaurants.eval.json

File-based output: Many skills save results to files instead of printing to stdout. Use output_file for skills that write to one known path (e.g., research/results.md). Use output_files with glob patterns for skills that produce multiple files (e.g., ["research/*.md"]). If both are set, output_file takes precedence. When set, clauditor reads the file(s) after running the skill instead of capturing stdout.

Input files

Some skills need sample inputs — a CSV to clean, a log file to summarize, a PDF to extract. Declare them with input_files and clauditor will stage them into each variance run's working directory before invoking the skill:

{
  "skill_name": "csv-cleaner",
  "test_args": "--strict",
  "input_files": ["fixtures/sales.csv"]
}

At grade time, fixtures/sales.csv is copied (via shutil.copy2) into .clauditor/iteration-N/csv-cleaner/run-K/inputs/sales.csv for each of the variance.n_runs runs, and the skill's subprocess is launched with that inputs/ directory as its CWD. So /csv-cleaner --strict sees sales.csv as a plain basename in its own current directory — no path wrangling required. Each run-K gets its own fresh copy, so a skill that mutates its input in one run does not affect the next.

Rules enforced at spec-load time (EvalSpec.from_file):

  • Paths are relative to the eval spec's parent directory, not the repo root. An input_files entry of fixtures/sales.csv next to my-skill.eval.json resolves to <spec-dir>/fixtures/sales.csv. This intentionally differs from output_files, which globs relative to the skill's working directory.
  • Absolute paths are rejected. Use a relative path under the spec directory.
  • Source containment is enforced. The resolved path (including symlink targets) must live under the spec's parent directory. Escapes via .. or symlinks pointing outside the spec tree raise ValueError.
  • Missing files fail loudly. Paths are resolved with Path.resolve(strict=True) — a typo fails at load, not at grade time.
  • Destinations are flattened to basenames. input_files: ["data/sales.csv"] stages as run-K/inputs/sales.csv, not run-K/inputs/data/sales.csv. Two entries that would flatten to the same basename (e.g. a/data.csv and b/data.csv) raise ValueError at load.
  • Collision guard with output_files. Any literal output_files pattern whose basename matches an input_files basename raises ValueError at load. If your skill mutates sales.csv in place and you want to capture the result, either declare the output under a different basename / subdirectory in output_files, or read the post-run file back from the persisted iteration-N/<skill>/run-K/inputs/ directory after grading.
  • No file-size cap. Files are copied verbatim — eval specs are author-controlled, so keep fixtures reasonable.

Captured-output mode: clauditor grade --output <file> reads a pre-captured output file instead of running the skill. In that mode, staging is skipped. If a spec declares input_files and --output is passed, clauditor prints WARNING: --output bypasses the runner; input_files declaration is ignored. to stderr and continues.

Persistence: staged inputs are preserved post-finalize under .clauditor/iteration-N/<skill>/run-K/inputs/ alongside output.txt and output.jsonl, so you can inspect exactly what the skill saw (and what it did to the files) after each run.

Pytest plugin: the clauditor_spec fixture transparently stages input_files into tmp_path when a loaded spec declares them, so existing tests need zero changes.

Security / trust model: Eval specs are developer-authored and run with the repo owner's filesystem access. Clauditor resolves input_files paths under the spec's parent directory, enforces source containment, and rejects absolute paths — but the underlying assumption is that eval specs live in a repo you already trust. Do not run clauditor against eval specs from untrusted sources without reviewing them first.

A complete eval spec with all three layers:

{
  "skill_name": "find-kid-activities",
  "description": "Finds kid-friendly activities near a location",
  "test_args": "\"Cupertino, CA\" --ages 4-6 --count 5 --depth quick",
  "input_files": ["fixtures/sample-venues.csv"],

  "assertions": [
    {"id": "contains_venues", "type": "contains", "needle": "Venues"},
    {"id": "has_entries_3", "type": "has_entries", "count": 3},
    {"id": "has_urls_3", "type": "has_urls", "count": 3},
    {"id": "min_length_500", "type": "min_length", "length": 500},
    {"id": "no_error", "type": "not_contains", "needle": "Error"}
  ],

  "sections": [
    {
      "name": "Venues",
      "tiers": [
        {
          "label": "default",
          "min_entries": 3,
          "fields": [
            {"id": "venue_name", "name": "name", "required": true},
            {"id": "venue_address", "name": "address", "required": true},
            {"id": "venue_website", "name": "website", "required": true}
          ]
        }
      ]
    }
  ],

  "output_file": "research/results.md",
  "output_files": ["research/*.md", "research/*.json"],

  "grading_criteria": [
    {"id": "distance_match", "criterion": "Are all venues within the specified distance?"},
    {"id": "age_appropriate", "criterion": "Are venues appropriate for the specified age range?"},
    {"id": "cost_tier_match", "criterion": "Do cost tiers match the budget filter?"}
  ],
  "grading_model": "claude-sonnet-4-6",
  "grade_thresholds": {
    "min_pass_rate": 0.7,
    "min_mean_score": 0.5
  },

  "trigger_tests": {
    "should_trigger": [
      "Find kid activities in Cupertino",
      "What are some things to do with kids near me?"
    ],
    "should_not_trigger": [
      "What's the weather today?",
      "Help me write a Python script"
    ]
  },

  "variance": {
    "n_runs": 5,
    "min_stability": 0.8
  }
}

See examples/ for a complete working eval spec.

Assertion types and per-type keys

Each Layer 1 assertion carries a type plus the per-type semantic keys listed below (in addition to id, type, and optional name). Integer fields are native JSON ints, not strings — {"length": 500}, not {"length": "500"}. Unknown keys raise ValueError at load time with a "did you mean?" migration hint.

Type Required keys Optional keys Description
contains needle (str) Output contains the needle substring
not_contains needle (str) Output does NOT contain the needle
regex pattern (str) Output matches the regex pattern (search, not fullmatch)
min_count pattern (str), count (int) Regex pattern appears at least count times
min_length length (int) Output length is at least length chars
max_length length (int) Output length is at most length chars
has_urls count (int, default 1) Output contains at least count URLs
has_entries count (int, default 1) Output contains at least count numbered entries
urls_reachable count (int, default 1) At least count URLs in output return 2xx on HEAD
has_format format (str) count (int, default 1) Output contains at least count strings matching the format (see format registry)

Example — one of each shape:

{
  "assertions": [
    {"id": "has_title",      "type": "contains",       "needle": "Results"},
    {"id": "no_error",       "type": "not_contains",   "needle": "Error"},
    {"id": "numbered",       "type": "regex",          "pattern": "\\*\\*\\d+\\."},
    {"id": "three_bullets",  "type": "min_count",      "pattern": "^- ", "count": 3},
    {"id": "long_enough",    "type": "min_length",     "length": 500},
    {"id": "not_too_long",   "type": "max_length",     "length": 5000},
    {"id": "has_3_urls",     "type": "has_urls",       "count": 3},
    {"id": "has_3_entries",  "type": "has_entries",    "count": 3},
    {"id": "urls_work",      "type": "urls_reachable", "count": 2},
    {"id": "two_phones",     "type": "has_format",     "format": "phone_us", "count": 2}
  ]
}

Field validation with format

Each FieldRequirement accepts a single format key that validates the extracted value. format does double duty:

  1. Registered format name — a shorthand for a built-in regex in the format registry. Run python -c "from clauditor.formats import list_formats; print(list_formats())" to see the full list (20 registered formats as of writing). The full list: currency_eur, currency_usd, date_iso, date_us, domain, email, hex_color, ipv4, latitude, longitude, percentage, phone_intl, phone_us, star_rating, time_12h, time_24h, url, uuid, zip_uk, zip_us.
  2. Inline regex — any string that isn't a registered name is compiled with re.compile and used as an anchored fullmatch against the value. Invalid regexes raise ValueError at spec load time.
{
  "sections": [
    {
      "name": "Restaurants",
      "tiers": [
        {
          "label": "default",
          "min_entries": 1,
          "max_entries": 3,
          "fields": [
            {"id": "r_name",    "name": "name",    "required": true},
            {"id": "r_phone",   "name": "phone",   "required": true,  "format": "phone_us"},
            {"id": "r_website", "name": "website", "required": true,  "format": "domain"},
            {"id": "r_zip",     "name": "zip",     "required": false, "format": "^\\d{5}$"}
          ]
        }
      ]
    }
  ]
}

url vs domain: LLMs commonly extract the display text of markdown links ([paesanosj.com](https://paesanosj.com/)paesanosj.com), which are valid domains but not URLs with a scheme. Use format: "url" only when you really need https://…; use format: "domain" to accept bare hostnames too.

max_entries: A precision signal — when set, clauditor emits a count_max assertion if extraction returns more entries than the cap. Field-level checks still run over all extracted entries so you see both the count failure and any per-entry failures.

Optional top-level fields

A few EvalSpec fields tune specific code paths and are safe to omit:

  • user_prompt (string, default null) — a natural-language query fed to the blind A/B judge (blind_compare_from_spec and the clauditor_blind_compare pytest fixture). Distinct from test_args: test_args is the CLI argument string passed to the skill subprocess, while user_prompt is the conversational framing the judge sees when comparing two skill outputs. Required only on the blind-compare code path; other commands (validate, grade, extract, triggers) ignore it.
  • allow_hang_heuristic (bool, default true) — controls the interactive-hang detector in SkillRunner. The heuristic flags a run as a likely-interactive-hang when the skill stops after one turn with a trailing ? or an AskUserQuestion tool call. Set to false to opt a specific skill out when the heuristic consistently mis-classifies its output (e.g. a skill whose correct answer ends in a rhetorical question). When disabled, a suppressed-heuristic run still lands in SkillResult but without the error_category= "interactive" signal.
  • grading_model (string or null, default null post-#182) — the model used for Layer 3 grading (and L2 extraction, suggest, propose-eval, triggers, blind compare). Per-spec override when you want to trade cost for fidelity. Default is null per #146 DEC-004b: when unset (omitted or explicit null), the per-provider resolver clauditor._providers.resolve_grading_model picks the standard default for the resolved grading_provider"claude-sonnet-4-6" for anthropic, "gpt-5.4" for openai. Set explicitly to override (e.g. "claude-opus-4-1" or "gpt-5.4-mini"). to_dict() omits the key when it equals null, so pre-#146 specs round-trip with no synthetic key.
  • grading_provider (string or null, default "auto" post-#182) — selects which provider's SDK handles LLM-grader calls. One of "anthropic", "openai", "auto", or null (legacy #145-vintage; treated the same as the default at the CLI seam). When the resolved value is "auto", the auto-inference layer maps claude-*anthropic, gpt-* / o[0-9]+*openai; when no model is available the resolver falls back to "anthropic" (subscription-first historical default). An unknown model prefix raises a load-time error advising the operator to set --grading-provider explicitly. Precedence: --grading-provider on the CLI wins, then CLAUDITOR_GRADING_PROVIDER env var, then this field, then the "auto" default. to_dict() omits the key when it equals "auto" (or None from legacy specs), matching the transport / harness pattern — pre-#146 specs round-trip with no synthetic key. Full reference: docs/cli-reference.md (per-command --grading-provider row).
  • grade_thresholds (object, default null) — an object with min_pass_rate and/or min_mean_score (both floats in [0.0, 1.0]) that gate clauditor grade's exit code. When set, a run whose metrics fall below either threshold exits 1 (signal failed) rather than 0.
  • variance (object, default null) — {"n_runs": int, "min_stability": float} for clauditor grade --variance. Runs the skill n_runs times, grades each, and asserts cross-run agreement.
  • trigger_tests (object, default null) — {"should_trigger": [str, ...], "should_not_trigger": [str, ...]} for clauditor triggers. Required by that command; other commands ignore it.
  • timeout (int, default null) — per-skill runner timeout in seconds. Overrides the built-in 300-second watchdog for skills that legitimately need longer (e.g. multi-agent research skills). Precedence: --timeout <seconds> on the CLI wins when passed explicitly; otherwise EvalSpec.timeout wins when set; otherwise the runner falls back to its 300-second default. Load-time validation rejects non-int values (including true/false) and values <= 0.
  • transport (string, default "auto") — selects the backend used for LLM-mediated grader calls (Layer 2 extraction, Layer 3 rubric grading, blind compare, trigger judge, suggest proposer, propose-eval). One of "api" (Anthropic SDK over HTTP), "cli" (subprocess to the local claude binary, reuses the user's cached auth), or "auto" (prefers cli when claude is on PATH, falls back to api). Precedence: --transport on the CLI wins, then CLAUDITOR_TRANSPORT env var, then this field, then the auto default. Load-time validation rejects values outside the three- choice set. See docs/transport-architecture.md for the full backend matrix.
  • harness (string, default "auto") — selects which CLI runs the skill subprocess: "claude-code" (Claude Code via claude -p), "codex" (the OpenAI Codex CLI), or "auto" (prefers claude when on PATH, falls back to codex). This is the runtime axis and is independent of transport / grading_provider, which govern the grader call — a skill can run under Codex while the L3 grader still calls Claude Sonnet. Precedence: --harness on the CLI wins (on validate, grade, capture, run), then the CLAUDITOR_HARNESS env var, then this field, then the auto default. Load-time validation rejects values outside the three-choice set. Running under Codex requires CODEX_API_KEY or OPENAI_API_KEY in the subprocess env. See docs/codex-harness.md for the full harness reference.
  • sync_tasks (bool, default false) — when true, clauditor sets CLAUDE_CODE_DISABLE_BACKGROUND_TASKS=1 in the claude -p subprocess env, forcing Task(run_in_background=true) spawns to run synchronously. Skill authors who ship with background sub- agents for parallel fanout but want clauditor to evaluate the full transcript set this to true in the spec. Precedence: --sync-tasks on the CLI wins; otherwise this field wins; otherwise the default false applies. Load-time validation rejects non-bool values. Read the fidelity caveats in docs/skill-usage.md#--sync-tasks-force-task-mode-synchronous-at-eval-time before enabling this field — evaluating sync is not equivalent to evaluating async.

System Prompt

EvalSpec.system_prompt (string, default null) is the prompt body sent to harnesses that consume a separate system-prompt channel. The shipping ClaudeCodeHarness ignores it (the claude -p CLI has no analogue — slash-command identity carries the skill body), while the CodexHarness (#149) prepends it before the user message. It is part of the cross-harness Harness.build_prompt contract introduced in #150 so a single EvalSpec shape feeds every harness identically.

Three-level precedence. clauditor resolves the effective system_prompt once per run:

  1. Explicit EvalSpec.system_prompt (set in eval.json) wins.
  2. Auto-derived AGENTS.md — when the field is null or omitted, SkillSpec.run consults resolve_agents_md(skill_dir, project_root) (per #154 DEC-009; sibling-then-project-root, strict containment per .claude/rules/path-validation.md). If a readable AGENTS.md is found, its contents become the system prompt.
  3. Auto-derived SKILL.md body — falls through when no AGENTS.md is available. clauditor reads the skill file referenced by SkillSpec, strips the YAML frontmatter via parse_frontmatter, and uses the remaining body as the system prompt.

There is no fourth fallback. The empty-string body case threads through verbatim — a SKILL.md with frontmatter but no body resolves to "", not None, so a misconfigured skill surfaces clearly downstream rather than silently masking a missing prompt.

Validation rules. When set, system_prompt must be a non-empty, non-whitespace string. The loader raises ValueError at EvalSpec.from_file time on any of: non-string value, empty string (""), or a string that strips to empty (whitespace-only). Mirrors user_prompt's validation shape exactly.

Frontmatter system_prompt: keys inside SKILL.md are NOT supported (DEC-003 of #150). The body of SKILL.md is the auto-derive source; authors who need a value distinct from the body must set it explicitly in eval.json.

Auto-derive failure mode. If the auto-derive path is taken (no explicit system_prompt in the eval spec) and the skill file is missing, unreadable, or has malformed frontmatter, SkillSpec.run raises RuntimeError naming both the skill identity and the resolved skill path. The underlying FileNotFoundError / OSError / ValueError chains through __cause__ for debug. This is a hard failure — not a warning — because every grader code path needs a deterministic system prompt for cross-run comparability.

Example — explicit override in eval.json:

{
  "skill_name": "find-kid-activities",
  "test_args": "\"Cupertino, CA\" --ages 4-6 --count 5",
  "system_prompt": "You are a careful local-events researcher. Prefer primary sources (venue websites, official park pages) over aggregators. When confidence is low, say so explicitly.",
  "assertions": [
    {"id": "no_error", "type": "not_contains", "needle": "Error"}
  ]
}

When omitted, clauditor derives the same prompt from the body of find-kid-activities/SKILL.md automatically — the explicit field is only needed when the eval-time prompt should diverge from the shipped skill body (e.g. tightening rubric framing for a CI grader without editing SKILL.md).

Schema history

Issue #67 — per-type assertion keys. Assertion dicts previously carried a single overloaded value key whose meaning depended on type — a string needle for contains, a regex pattern for regex, a stringly-typed count for has_urls, and so on. Issue #67 replaced value with per-type semantic keys (needle, pattern, length, count, format) and switched integer fields from stringly-typed ("value": "500") to native JSON ints ("length": 500). The loader rejects the old shape at load time with a "did you mean?" hint pointing at the correct per-type key. No back-compat window: hand-edit old specs to the new shape, or run the spec through clauditor propose-eval --force to regenerate it.