Eval Spec Format
Complete schema walkthrough for <skill-name>.eval.json: discovery rules, every supported field, input-file staging, output-file capture, and the format validation DSL. Read this when you're authoring or debugging an eval spec; it's the full reference the shorter README teaser points at.
Returning from the root README. This doc is the full reference; the README has a summary with code examples.
Place <skill-name>.eval.json alongside your .claude/commands/<skill-name>.md:
.claude/commands/
├── find-kid-activities.md
├── find-kid-activities.eval.json ← clauditor auto-discovers this
├── find-restaurants.md
└── find-restaurants.eval.json
File-based output: Many skills save results to files instead of printing to stdout. Use output_file for skills that write to one known path (e.g., research/results.md). Use output_files with glob patterns for skills that produce multiple files (e.g., ["research/*.md"]). If both are set, output_file takes precedence. When set, clauditor reads the file(s) after running the skill instead of capturing stdout.
Input files
Some skills need sample inputs — a CSV to clean, a log file to summarize, a PDF to extract. Declare them with input_files and clauditor will stage them into each variance run's working directory before invoking the skill:
{
"skill_name": "csv-cleaner",
"test_args": "--strict",
"input_files": ["fixtures/sales.csv"]
}
At grade time, fixtures/sales.csv is copied (via shutil.copy2) into .clauditor/iteration-N/csv-cleaner/run-K/inputs/sales.csv for each of the variance.n_runs runs, and the skill's subprocess is launched with that inputs/ directory as its CWD. So /csv-cleaner --strict sees sales.csv as a plain basename in its own current directory — no path wrangling required. Each run-K gets its own fresh copy, so a skill that mutates its input in one run does not affect the next.
Rules enforced at spec-load time (EvalSpec.from_file):
- Paths are relative to the eval spec's parent directory, not the repo root. An
input_filesentry offixtures/sales.csvnext tomy-skill.eval.jsonresolves to<spec-dir>/fixtures/sales.csv. This intentionally differs fromoutput_files, which globs relative to the skill's working directory. - Absolute paths are rejected. Use a relative path under the spec directory.
- Source containment is enforced. The resolved path (including symlink targets) must live under the spec's parent directory. Escapes via
..or symlinks pointing outside the spec tree raiseValueError. - Missing files fail loudly. Paths are resolved with
Path.resolve(strict=True)— a typo fails at load, not at grade time. - Destinations are flattened to basenames.
input_files: ["data/sales.csv"]stages asrun-K/inputs/sales.csv, notrun-K/inputs/data/sales.csv. Two entries that would flatten to the same basename (e.g.a/data.csvandb/data.csv) raiseValueErrorat load. - Collision guard with
output_files. Any literaloutput_filespattern whose basename matches aninput_filesbasename raisesValueErrorat load. If your skill mutatessales.csvin place and you want to capture the result, either declare the output under a different basename / subdirectory inoutput_files, or read the post-run file back from the persistediteration-N/<skill>/run-K/inputs/directory after grading. - No file-size cap. Files are copied verbatim — eval specs are author-controlled, so keep fixtures reasonable.
Captured-output mode: clauditor grade --output <file> reads a pre-captured output file instead of running the skill. In that mode, staging is skipped. If a spec declares input_files and --output is passed, clauditor prints WARNING: --output bypasses the runner; input_files declaration is ignored. to stderr and continues.
Persistence: staged inputs are preserved post-finalize under .clauditor/iteration-N/<skill>/run-K/inputs/ alongside output.txt and output.jsonl, so you can inspect exactly what the skill saw (and what it did to the files) after each run.
Pytest plugin: the clauditor_spec fixture transparently stages input_files into tmp_path when a loaded spec declares them, so existing tests need zero changes.
Security / trust model: Eval specs are developer-authored and run with the repo owner's filesystem access. Clauditor resolves input_files paths under the spec's parent directory, enforces source containment, and rejects absolute paths — but the underlying assumption is that eval specs live in a repo you already trust. Do not run clauditor against eval specs from untrusted sources without reviewing them first.
A complete eval spec with all three layers:
{
"skill_name": "find-kid-activities",
"description": "Finds kid-friendly activities near a location",
"test_args": "\"Cupertino, CA\" --ages 4-6 --count 5 --depth quick",
"input_files": ["fixtures/sample-venues.csv"],
"assertions": [
{"id": "contains_venues", "type": "contains", "needle": "Venues"},
{"id": "has_entries_3", "type": "has_entries", "count": 3},
{"id": "has_urls_3", "type": "has_urls", "count": 3},
{"id": "min_length_500", "type": "min_length", "length": 500},
{"id": "no_error", "type": "not_contains", "needle": "Error"}
],
"sections": [
{
"name": "Venues",
"tiers": [
{
"label": "default",
"min_entries": 3,
"fields": [
{"id": "venue_name", "name": "name", "required": true},
{"id": "venue_address", "name": "address", "required": true},
{"id": "venue_website", "name": "website", "required": true}
]
}
]
}
],
"output_file": "research/results.md",
"output_files": ["research/*.md", "research/*.json"],
"grading_criteria": [
{"id": "distance_match", "criterion": "Are all venues within the specified distance?"},
{"id": "age_appropriate", "criterion": "Are venues appropriate for the specified age range?"},
{"id": "cost_tier_match", "criterion": "Do cost tiers match the budget filter?"}
],
"grading_model": "claude-sonnet-4-6",
"grade_thresholds": {
"min_pass_rate": 0.7,
"min_mean_score": 0.5
},
"trigger_tests": {
"should_trigger": [
"Find kid activities in Cupertino",
"What are some things to do with kids near me?"
],
"should_not_trigger": [
"What's the weather today?",
"Help me write a Python script"
]
},
"variance": {
"n_runs": 5,
"min_stability": 0.8
}
}
See examples/ for a complete working eval spec.
Assertion types and per-type keys
Each Layer 1 assertion carries a type plus the per-type semantic keys
listed below (in addition to id, type, and optional name). Integer
fields are native JSON ints, not strings — {"length": 500}, not
{"length": "500"}. Unknown keys raise ValueError at load time with a
"did you mean?" migration hint.
| Type | Required keys | Optional keys | Description |
|---|---|---|---|
contains |
needle (str) |
— | Output contains the needle substring |
not_contains |
needle (str) |
— | Output does NOT contain the needle |
regex |
pattern (str) |
— | Output matches the regex pattern (search, not fullmatch) |
min_count |
pattern (str), count (int) |
— | Regex pattern appears at least count times |
min_length |
length (int) |
— | Output length is at least length chars |
max_length |
length (int) |
— | Output length is at most length chars |
has_urls |
— | count (int, default 1) |
Output contains at least count URLs |
has_entries |
— | count (int, default 1) |
Output contains at least count numbered entries |
urls_reachable |
— | count (int, default 1) |
At least count URLs in output return 2xx on HEAD |
has_format |
format (str) |
count (int, default 1) |
Output contains at least count strings matching the format (see format registry) |
Example — one of each shape:
{
"assertions": [
{"id": "has_title", "type": "contains", "needle": "Results"},
{"id": "no_error", "type": "not_contains", "needle": "Error"},
{"id": "numbered", "type": "regex", "pattern": "\\*\\*\\d+\\."},
{"id": "three_bullets", "type": "min_count", "pattern": "^- ", "count": 3},
{"id": "long_enough", "type": "min_length", "length": 500},
{"id": "not_too_long", "type": "max_length", "length": 5000},
{"id": "has_3_urls", "type": "has_urls", "count": 3},
{"id": "has_3_entries", "type": "has_entries", "count": 3},
{"id": "urls_work", "type": "urls_reachable", "count": 2},
{"id": "two_phones", "type": "has_format", "format": "phone_us", "count": 2}
]
}
Field validation with format
Each FieldRequirement accepts a single format key that validates the
extracted value. format does double duty:
- Registered format name — a shorthand for a built-in regex in the
format registry. Run
python -c "from clauditor.formats import list_formats; print(list_formats())"to see the full list (20 registered formats as of writing). The full list:currency_eur,currency_usd,date_iso,date_us,domain,email,hex_color,ipv4,latitude,longitude,percentage,phone_intl,phone_us,star_rating,time_12h,time_24h,url,uuid,zip_uk,zip_us. - Inline regex — any string that isn't a registered name is
compiled with
re.compileand used as an anchoredfullmatchagainst the value. Invalid regexes raiseValueErrorat spec load time.
{
"sections": [
{
"name": "Restaurants",
"tiers": [
{
"label": "default",
"min_entries": 1,
"max_entries": 3,
"fields": [
{"id": "r_name", "name": "name", "required": true},
{"id": "r_phone", "name": "phone", "required": true, "format": "phone_us"},
{"id": "r_website", "name": "website", "required": true, "format": "domain"},
{"id": "r_zip", "name": "zip", "required": false, "format": "^\\d{5}$"}
]
}
]
}
]
}
url vs domain: LLMs commonly extract the display text of markdown
links ([paesanosj.com](https://paesanosj.com/) → paesanosj.com),
which are valid domains but not URLs with a scheme. Use format: "url"
only when you really need https://…; use format: "domain" to accept
bare hostnames too.
max_entries: A precision signal — when set, clauditor emits a
count_max assertion if extraction returns more entries than the cap.
Field-level checks still run over all extracted entries so you see both
the count failure and any per-entry failures.
Optional top-level fields
A few EvalSpec fields tune specific code paths and are safe to omit:
user_prompt(string, defaultnull) — a natural-language query fed to the blind A/B judge (blind_compare_from_specand theclauditor_blind_comparepytest fixture). Distinct fromtest_args:test_argsis the CLI argument string passed to the skill subprocess, whileuser_promptis the conversational framing the judge sees when comparing two skill outputs. Required only on the blind-compare code path; other commands (validate,grade,extract,triggers) ignore it.allow_hang_heuristic(bool, defaulttrue) — controls the interactive-hang detector inSkillRunner. The heuristic flags a run as a likely-interactive-hang when the skill stops after one turn with a trailing?or anAskUserQuestiontool call. Set tofalseto opt a specific skill out when the heuristic consistently mis-classifies its output (e.g. a skill whose correct answer ends in a rhetorical question). When disabled, a suppressed-heuristic run still lands inSkillResultbut without theerror_category= "interactive"signal.grading_model(string or null, defaultnullpost-#182) — the model used for Layer 3 grading (and L2 extraction, suggest, propose-eval, triggers, blind compare). Per-spec override when you want to trade cost for fidelity. Default isnullper #146 DEC-004b: when unset (omitted or explicitnull), the per-provider resolverclauditor._providers.resolve_grading_modelpicks the standard default for the resolvedgrading_provider—"claude-sonnet-4-6"foranthropic,"gpt-5.4"foropenai. Set explicitly to override (e.g."claude-opus-4-1"or"gpt-5.4-mini").to_dict()omits the key when it equalsnull, so pre-#146 specs round-trip with no synthetic key.grading_provider(string or null, default"auto"post-#182) — selects which provider's SDK handles LLM-grader calls. One of"anthropic","openai","auto", ornull(legacy #145-vintage; treated the same as the default at the CLI seam). When the resolved value is"auto", the auto-inference layer mapsclaude-*→anthropic,gpt-*/o[0-9]+*→openai; when no model is available the resolver falls back to"anthropic"(subscription-first historical default). An unknown model prefix raises a load-time error advising the operator to set--grading-providerexplicitly. Precedence:--grading-provideron the CLI wins, thenCLAUDITOR_GRADING_PROVIDERenv var, then this field, then the"auto"default.to_dict()omits the key when it equals"auto"(orNonefrom legacy specs), matching thetransport/harnesspattern — pre-#146 specs round-trip with no synthetic key. Full reference:docs/cli-reference.md(per-command--grading-providerrow).grade_thresholds(object, defaultnull) — an object withmin_pass_rateand/ormin_mean_score(both floats in[0.0, 1.0]) that gateclauditor grade's exit code. When set, a run whose metrics fall below either threshold exits1(signal failed) rather than0.variance(object, defaultnull) —{"n_runs": int, "min_stability": float}forclauditor grade --variance. Runs the skilln_runstimes, grades each, and asserts cross-run agreement.trigger_tests(object, defaultnull) —{"should_trigger": [str, ...], "should_not_trigger": [str, ...]}forclauditor triggers. Required by that command; other commands ignore it.timeout(int, defaultnull) — per-skill runner timeout in seconds. Overrides the built-in 300-second watchdog for skills that legitimately need longer (e.g. multi-agent research skills). Precedence:--timeout <seconds>on the CLI wins when passed explicitly; otherwiseEvalSpec.timeoutwins when set; otherwise the runner falls back to its 300-second default. Load-time validation rejects non-int values (includingtrue/false) and values<= 0.transport(string, default"auto") — selects the backend used for LLM-mediated grader calls (Layer 2 extraction, Layer 3 rubric grading, blind compare, trigger judge, suggest proposer, propose-eval). One of"api"(Anthropic SDK over HTTP),"cli"(subprocess to the localclaudebinary, reuses the user's cached auth), or"auto"(preferscliwhenclaudeis on PATH, falls back toapi). Precedence:--transporton the CLI wins, thenCLAUDITOR_TRANSPORTenv var, then this field, then theautodefault. Load-time validation rejects values outside the three- choice set. Seedocs/transport-architecture.mdfor the full backend matrix.harness(string, default"auto") — selects which CLI runs the skill subprocess:"claude-code"(Claude Code viaclaude -p),"codex"(the OpenAI Codex CLI), or"auto"(prefersclaudewhen on PATH, falls back tocodex). This is the runtime axis and is independent oftransport/grading_provider, which govern the grader call — a skill can run under Codex while the L3 grader still calls Claude Sonnet. Precedence:--harnesson the CLI wins (onvalidate,grade,capture,run), then theCLAUDITOR_HARNESSenv var, then this field, then theautodefault. Load-time validation rejects values outside the three-choice set. Running under Codex requiresCODEX_API_KEYorOPENAI_API_KEYin the subprocess env. Seedocs/codex-harness.mdfor the full harness reference.sync_tasks(bool, defaultfalse) — whentrue, clauditor setsCLAUDE_CODE_DISABLE_BACKGROUND_TASKS=1in theclaude -psubprocess env, forcingTask(run_in_background=true)spawns to run synchronously. Skill authors who ship with background sub- agents for parallel fanout but want clauditor to evaluate the full transcript set this totruein the spec. Precedence:--sync-taskson the CLI wins; otherwise this field wins; otherwise the defaultfalseapplies. Load-time validation rejects non-bool values. Read the fidelity caveats indocs/skill-usage.md#--sync-tasks-force-task-mode-synchronous-at-eval-timebefore enabling this field — evaluating sync is not equivalent to evaluating async.
System Prompt
EvalSpec.system_prompt (string, default null) is the prompt body
sent to harnesses that consume a separate system-prompt channel. The
shipping ClaudeCodeHarness ignores it (the claude -p CLI has no
analogue — slash-command identity carries the skill body), while the
CodexHarness (#149) prepends it before the user
message. It is part of the cross-harness Harness.build_prompt
contract introduced in #150 so a single EvalSpec shape feeds every
harness identically.
Three-level precedence. clauditor resolves the effective
system_prompt once per run:
- Explicit
EvalSpec.system_prompt(set ineval.json) wins. - Auto-derived
AGENTS.md— when the field isnullor omitted,SkillSpec.runconsultsresolve_agents_md(skill_dir, project_root)(per #154 DEC-009; sibling-then-project-root, strict containment per.claude/rules/path-validation.md). If a readableAGENTS.mdis found, its contents become the system prompt. - Auto-derived
SKILL.mdbody — falls through when noAGENTS.mdis available. clauditor reads the skill file referenced bySkillSpec, strips the YAML frontmatter viaparse_frontmatter, and uses the remaining body as the system prompt.
There is no fourth fallback. The empty-string body case threads
through verbatim — a SKILL.md with frontmatter but no body
resolves to "", not None, so a misconfigured skill surfaces
clearly downstream rather than silently masking a missing prompt.
Validation rules. When set, system_prompt must be a non-empty,
non-whitespace string. The loader raises ValueError at
EvalSpec.from_file time on any of: non-string value, empty string
(""), or a string that strips to empty (whitespace-only). Mirrors
user_prompt's validation shape exactly.
Frontmatter system_prompt: keys inside SKILL.md are NOT supported
(DEC-003 of #150). The body of SKILL.md is the auto-derive source;
authors who need a value distinct from the body must set it explicitly
in eval.json.
Auto-derive failure mode. If the auto-derive path is taken (no
explicit system_prompt in the eval spec) and the skill file is
missing, unreadable, or has malformed frontmatter, SkillSpec.run
raises RuntimeError naming both the skill identity and the
resolved skill path. The underlying FileNotFoundError / OSError /
ValueError chains through __cause__ for debug. This is a hard
failure — not a warning — because every grader code path needs a
deterministic system prompt for cross-run comparability.
Example — explicit override in eval.json:
{
"skill_name": "find-kid-activities",
"test_args": "\"Cupertino, CA\" --ages 4-6 --count 5",
"system_prompt": "You are a careful local-events researcher. Prefer primary sources (venue websites, official park pages) over aggregators. When confidence is low, say so explicitly.",
"assertions": [
{"id": "no_error", "type": "not_contains", "needle": "Error"}
]
}
When omitted, clauditor derives the same prompt from the body of
find-kid-activities/SKILL.md automatically — the explicit field is
only needed when the eval-time prompt should diverge from the shipped
skill body (e.g. tightening rubric framing for a CI grader without
editing SKILL.md).
Schema history
Issue #67 — per-type assertion keys. Assertion dicts previously
carried a single overloaded value key whose meaning depended on
type — a string needle for contains, a regex pattern for regex, a
stringly-typed count for has_urls, and so on. Issue #67 replaced
value with per-type semantic keys (needle, pattern, length,
count, format) and switched integer fields from stringly-typed
("value": "500") to native JSON ints ("length": 500). The loader
rejects the old shape at load time with a "did you mean?" hint pointing
at the correct per-type key. No back-compat window: hand-edit old specs
to the new shape, or run the spec through clauditor propose-eval
--force to regenerate it.