Changelog
All notable changes to clauditor are tracked here. Pre-1.0 releases may contain breaking changes without a deprecation shim; see the Breaking changes sections below for migration guidance.
The format is loosely based on Keep a Changelog and this project adheres to Semantic Versioning.
How this is maintained. Day-to-day work goes under
[Unreleased]; on release, those entries are promoted under a new version header dated with the release day. The/release-managerskill uses the version section as the GitHub Release body viagh release create --notes-file, so this file is the single source of truth for human-facing release notes.
Unreleased
0.1.3 - 2026-05-26
Added
- npm wrapper package
clauditor-eval(#4). A Node.js subprocess bridge that shells out to the PythonclauditorCLI: abin/clauditor.jslauncher, a JS API (runSkill/validate/loadSpec) with TypeScript types, atoPassClauditorjest/vitest matcher, and an npm-publish CI workflow. Lets JS/TS projects run clauditor evals without a Python entry point of their own. clauditor run --jsonstructured output andpython -m clauditorentry point (#4).run --jsonemits a single pure-JSON document on stdout (skill exit code carried in the payload, all progress routed to stderr) so external wrappers canJSON.parseit;python -m clauditorinvokes the CLI as a module. Both back the npm bridge.parallel-researchexample skill demonstrating the background-task refactor (#103). Addedexamples/.claude/skills/parallel-research/— a runnable companion to "Recipe A" (droprun_in_background: true, run lanes sequentially) for skills that fan out research across parallel sub-agents. ItsSKILL.eval.jsonincludes anot_containsassertion on"run_in_background"so the refactored-good shape stays locked in.- Live-gated background-task non-completion test fixture (#103).
Added a triple-lock-gated (
CLAUDITOR_RUN_LIVE=1+ auth +claudeon PATH) test that drives a realTask(run_in_background=true)skill through the runner and asserts thebackground-task:warning fires, plus an offline unit assertion on the warning text. Default CI does not spend tokens.
Documentation
- Background-task /
--sync-tasksdoc-coherence sweep (#103). Thebackground-task:runner warning now links todocs/skill-usage.md#skill-compatibilityfor refactoring recipes. Added a worked before/after example (Recipe A and Recipe B) for a parallel-research fan-out to the skill-usage compatibility section, cross-linked from the README's "Skill compatibility" parallel-sub-tasks bullet and from the compatibility matrix row, and catalogued the async-fidelity gap (blocked on upstream anthropics/claude-code#52917) in a new Tier 3 subsection ofdocs/adr/transport-research-103.md. - Surface the Codex harness across the discovery docs. The Codex
runtime shipped in 0.1.2 but was only mentioned in a buried
paragraph of the README's auth section. Added a dedicated
"Running skills under Codex" README section (with a tagline clause
and Contents-nav entry), documented the
--harness {claude-code,codex,auto}flag on the four skill-running commands in the CLI reference's shared-runner-flags table, added aharnessfield entry to the eval-spec reference, and a "Running under Codex" subsection to the quick-start. All link to the existingdocs/codex-harness.mdas the single source of truth. Also fixed the README and quick-start pytest examples to use the post-#155clauditor_runner()factory form (they still showed the pre-#155 value-fixture call).
0.1.2 - 2026-05-21
Breaking changes
- Codex auth pre-flight refuses ChatGPT-mode credentials (#177, PR #181).
check_codex_authnow examines~/.codex/auth.json(or$CODEX_HOME/auth.jsonwhen set) and refuses pre-flight when the file declaresauth_mode == "chatgpt", regardless of whetherCODEX_API_KEY/OPENAI_API_KEYis exported. Rationale: the codex subprocess readsauth.jsonand routes via ChatGPT in that mode, rejecting every model server-side with"The 'gpt-5-codex' model is not supported when using Codex with a ChatGPT account."The pre-flight refusal fails fast with an actionableCodexAuthMissingErrorpointing atcodex login --with-api-key, rather than letting users burn a subprocess + API round-trip on a guaranteed-to-fail call.
Auth.json reads are failure-open per DEC-005 — a missing, unreadable,
malformed, or oversize (> 1 MB) auth.json falls through to the
existing env-var / PATH-on-disk checks. Parsed content (tokens,
account ids) is never serialized to sidecars, logs, or error
messages per DEC-014.
Supersedes #175 DECs 001, 002, 003, 004, 008, and 009 (see the
cross-link table in plans/super/177-codex-auth-mode-conflict.md).
The #175 plan doc stays historical; the refinement trail lives in
the #177 plan.
Migration: users with ~/.codex/auth.json in chatgpt-mode should
run codex login --with-api-key to re-materialize the credentials
file in API-key mode. Users who never logged in via the ChatGPT flow
are unaffected.
clauditor_runneris now a factory fixture (#155). Pytest fixtures cannot accept call-site kwargs; only factory fixtures can. To support the newharness=operator-intent kwarg,clauditor_runnerwas converted from a value fixture to a factory.
Migration:
# Before
def test_my_skill(clauditor_runner):
runner = clauditor_runner # value fixture
result = runner.run("my-skill")
# After
def test_my_skill(clauditor_runner):
runner = clauditor_runner() # factory — call it
result = runner.run("my-skill")
No deprecation shim ships — pre-1.0 accepts the hard break (same
precedent as the SkillResult.assert_* → SkillAsserter migration in
v0.1.0). See docs/pytest-plugin.md#parametrizing-harness--provider.
Added
- Documentation: three new deep-dive docs. Filled doc gaps that accumulated as the multi-provider / multi-harness / cost-tracking surface landed:
docs/codex-harness.md— user-facing narrative for running skills under the OpenAI Codex CLI (auth, four-layer precedence, sandbox modes, troubleshooting, useful harness × grader pairings).docs/cost-tracking.md— thecontext.jsonsidecar shape,cost_usdestimation rules, the pricing table, reasoning-token semantics, and the always-v1 contract.docs/audit-trend-workflow.md— end-to-end story for the iteration-history surface: howgradebuilds history, howaudit/trend/compare/badgeconsume it, and how cross-axis comparability refusal protects against silent averaging across stacks.
README updated to mention the harness axis and cost tracking in
prose, and the Reference docs list now links all three new docs
plus the previously-orphaned codex-stream-schema.md.
- Pytest fixtures:
harness×providerparametrization (#155). The pytest plugin gained two new CLI options and four new factory kwargs so a single test can sweep across{claude-code, codex} × {anthropic, openai}without changing skill or eval-spec files. - New pytest CLI options:
--clauditor-harness {claude-code, codex, auto}— overrides the harness used byclauditor_runnersession-wide. (Note:clauditor_spechonors onlyEvalSpec.harness— setharness:ineval.jsonfor per-skill author preference.)--clauditor-grading-provider {anthropic, openai, auto}— overrides the grading provider used byclauditor_grader,clauditor_blind_compare, andclauditor_triggerssession-wide.
- New factory kwargs (operator-intent, top of precedence stack):
clauditor_runner(harness=...)— pin harness for this runner.clauditor_grader(skill, eval_path, output, *, provider=..., model=...)— both new factory kwargs. (Pre-#155 the fixture only consulted the--clauditor-modelpytest option; the kwarg layer is new.)clauditor_blind_compare(skill, output_a, output_b, eval_path, *, provider=..., model=...)— both new.clauditor_triggers(skill, eval_path, *, provider=..., model=...)— both new.
- Operator-intent precedence (highest → lowest): factory kwarg >
pytest CLI option > env var (
CLAUDITOR_HARNESS,CLAUDITOR_GRADING_PROVIDER) > spec field (EvalSpec.harness,EvalSpec.grading_provider) > default"auto". Mirrors the CLI seam exactly per.claude/rules/spec-cli-precedence.md. - Eager
check_codex_authfromclauditor_runnerandclauditor_spec. When the resolved harness is"codex", both factories firecheck_codex_authbefore returning the runner / wrapped spec; missing auth raisesCodexAuthMissingError(a sibling ofException, NOT a subclass ofAnthropicAuthMissingError/OpenAIAuthMissingError) so callers route on a structuralexceptladder rather than substring-matching error text. - Asymmetry note. There is intentionally no
CLAUDITOR_FIXTURE_ALLOW_OPENAIenv var (DEC-001 of #155). The existingCLAUDITOR_FIXTURE_ALLOW_CLI=1opts into a relaxed Anthropic guard (accepts subscription auth via theclaudeCLI on PATH); OpenAI has no CLI-fallback / subscription analog (per #145 DEC-002), so there is nothing to opt into. Seedocs/pytest-plugin.mdfor the full documentation including apytest.mark.parametrizeworked example. EvalSpec.system_prompt: str | None = Nonefield (#150). Mirrorsuser_prompt's shape and validation: optional at load time, when set must be a non-empty, non-whitespace string (EvalSpec.from_filerejects empty strings, whitespace-only strings, and non-string values). When unset, clauditor auto-derives the system prompt from theSKILL.mdbody (post-frontmatter, viaparse_frontmatter) atSkillSpec.runtime. ExplicitEvalSpec.system_promptwins over the auto-derivedSKILL.mdbody. Auto-derive failures (missing file, malformed frontmatter) raise aRuntimeErrornaming the skill and path. Frontmattersystem_prompt:keys insideSKILL.mdare NOT supported (DEC-003) — the body is the auto-derive source. Seedocs/eval-spec-reference.md#system-prompt.Harness.build_prompt(skill_name, args, *, system_prompt) -> strprotocol method (#150). Third member of the cross-harnessHarnessprotocol (alongsideinvokeandstrip_auth_keys). Each harness owns its identity-to-prompt strategy:ClaudeCodeHarnesskeeps the slash-command synthesis (f"/{skill_name} {args}", orf"/{skill_name}"when args is empty) and ignoressystem_promptbecause theclaude -pCLI has no separate system-prompt channel;MockHarnessrecords(skill_name, args, system_prompt)onbuild_prompt_callsfor test assertions and returns a deterministic stub that surfaces all three inputs. The forthcomingCodexHarness(#149) will consumesystem_promptas the system message. Seedocs/architecture.md#3-harness-protocol.SkillRunner.run(..., system_prompt=...)keyword-only kwarg (#150). Threads the resolvedsystem_promptfromSkillSpec.runto the harness'sbuild_prompt. Keyword-only and placed last so it cannot collide positionally with the existingcwd/env/timeoutkwargs.SkillSpec.runresolves the effective value once (explicitEvalSpec.system_prompt> auto-derivedSKILL.mdbody) and threads the resolved string through this kwarg.
Changed
-
Codex subprocess error mapping: chatgpt-mode rejection classified as
"auth"(#177, PR #181). When the pre-flight refusal is bypassed (e.g.~/.codex/auth.jsonis created / mutated after the pre-flight check, or a sandboxed CI environment skips the parser), the codex subprocess emits the server-side rejection string"The 'gpt-5-codex' model is not supported when using Codex with a ChatGPT account."_classify_codex_failurenow classifies this aserror_category = "auth"rather than"api", matching the actual failure mode (credentials are pointing at the wrong auth surface). No sidecar schema bump per DEC-011 — newLiteralvalues inside an existing field are additive. -
Codex CLI-on-PATH announcement body reworded (#177, PR #181).
_CODEX_CLI_ON_PATH_ANNOUNCEMENTno longer mentions "the ChatGPT-login flow"; the post-#177 body describes codex resolving credentials from~/.codex/auth.jsonin API-key mode. Three durable substrings remain pinned by tests:"codex","PATH","~/.codex/auth.json". The announcement fires under the narrowed condition that the PATH branch is the load-bearing acceptance signal (env vars unset, codex on PATH, andauth.jsonis absent or declaresauth_mode != "chatgpt"); the chatgpt-mode refusal raisesCodexAuthMissingErrorand does not fire the announcement (the exception is the user signal).
0.1.1 - 2026-04-26
Added
AGENTSKILLS_REFERENCE_DEPTH_TOO_DEEPconformance warning (#129) — agentskills.io spec sync.clauditor lintnow flags Markdown link / image targets and reference-style definitions in a SKILL.md body that point more than one directory deep, plus parent-escape (..) paths. Matches the agentskills.io spec's "one level deep from SKILL.md" guidance. Same-directory and one-subdirectory references stay silent; fenced code blocks, URL schemes, anchors, and absolute paths are skipped; per-target de-dupe so a target referenced N times produces a single warning.- Bundled
/clauditorskill — diagnostics discoverability (#134). The SKILL.md "Common errors" section now mentionsclauditor lintandclauditor doctor, so users hitting a lint failure or a runner misconfiguration find the right next-step command from the in-skill text.
Changed
AGENTSKILLS_LICENSE_EMPTYmessage phrasing (#129) — agentskills.io spec sync. Previously suggested a "non-empty SPDX identifier", misleading for authors who want free-form license text. Updated to "non-empty license name or path to a bundled license file" to match the spec wording. No behavior change.- Bundled
/clauditorskill packaging (#134). The maintainer-onlyassets/clauditor.eval.json(the pre-release dogfood gate per DEC-007 of #43) no longer ships in the wheel — it remains in the repo for in-source dogfood runs only. SKILL.mddocs/cli-reference.mdreferences now point at stableblob/devGitHub URLs so they resolve when SKILL.md is rendered outside the repo.allowed-toolstrimmed toBash(clauditor *), Bash(uv run clauditor *)(the redundant narrower entries are removed).
0.1.0 - 2026-04-25
First stable release on PyPI: https://pypi.org/project/clauditor-eval/0.1.0/.
Breaking changes
-
Assertion dicts use per-type semantic keys (#67). The single overloaded
valuekey on eachassertions[]entry has been replaced with per-type keys:needle(oncontains/not_contains),pattern(onregex/min_count),length(onmin_length/max_length),count(onmin_count/has_urls/has_entries/urls_reachable/has_format), andformat(onhas_format). Integer fields are native JSON ints, not strings —{"length": 500}, not{"length": "500"}. The loader rejects the old shape at load time with a per-type "did you mean?" hint pointing at the correct key. No back-compat window ships — hand-edit old specs to the new shape, or regenerate them withclauditor propose-eval --force. Seedocs/eval-spec-reference.md#assertion-types-and-per-type-keysfor the full per-type key table. -
SkillResult.assert_*methods moved toSkillAsserter. Theassert_contains/assert_not_contains/assert_matches/assert_has_urls/assert_has_entries/assert_min_count/assert_min_length/run_assertionshelpers previously lived directly onSkillResult. They now live on a separateSkillAsserterclass (src/clauditor/asserters.py) to preserve the data/asserter split per.claude/rules/data-vs-asserter-split.md. No deprecation shim ships — pre-1.0 accepts the hard break.
Migration: update tests that use the pytest plugin's
clauditor_runner fixture to wrap the result through
clauditor_asserter:
# Before
def test_my_skill(clauditor_runner):
result = clauditor_runner.run("my-skill", args="...")
result.assert_contains("Results")
# After
def test_my_skill(clauditor_runner, clauditor_asserter):
result = clauditor_runner.run("my-skill", args="...")
clauditor_asserter(result).assert_contains("Results")
See docs/pytest-plugin.md for the full
fixture list and assertion-method reference.
Added
clauditor badge— shields.io endpoint JSON for skill quality (#77). A non-LLM aggregator that reads the latest (or--from-iteration N) iteration sidecars and produces a shields.io-compatible JSON file pasted into a README as a one-line Markdown image. Badge JSON carries a top-levelschemaVersion: 1(shields.io's contract) and a nestedclauditor.schema_version: 1extension block (our internal version, per.claude/rules/json-schema-version.md).- Positional
<skill-path>loads viaSkillSpec.from_file(derivesskill_namefrom frontmatter, per.claude/rules/skill-identity-from-frontmatter.md). - Default output
.clauditor/badges/<skill>.json;--output PATHaccepts absolute paths,--forcerequired to overwrite. --url-onlyprints the Markdown image line to stdout withgit remote get-url origin+ default-branch auto-detection;--repo USER/REPO [--branch NAME]overrides;USER/REPO/mainplaceholder with a stderr warning when detection fails.- Color logic:
brightgreen(L1 all-pass + L3 met or absent),yellow(L1 all-pass + L3 below thresholds),red(any L1 fail OR L3 all parse-failed),lightgrey(no iteration yet OR spec declares zero L1 assertions — both write a "no data" placeholder and exit 0 so CI pipelines have a persistent badge). --style KEY=VALUE(repeatable) passes shields.io fields through:style,logoSvg,logoColor,labelColor,cacheSeconds,link. Unknown keys warn but still emit (shields.io ignores).- Exit codes 0 / 1 / 2 per
.claude/rules/llm-cli-exit-code-taxonomy.md(non-LLM branch): 0 success, 1 runtime failure (corrupt iteration, collision without--force, explicit-missing iteration), 2 input validation (mutual exclusion, bad--outputparent, bad--style, bad--label). - Pure compute lives in
src/clauditor/badge.pyper.claude/rules/pure-compute-vs-io-split.md; CLI I/O lives insrc/clauditor/cli/badge.py; git metadata wrapper insrc/clauditor/_git.py. -
See
docs/badges.mdfor the full placement guide (README primary, catalog-page secondary, SKILL.md body tradeoff) and the CI embedding recipe. -
clauditor lint— agentskills.io spec conformance check (#71). A non-LLM static check that validates aSKILL.mdfile against the agentskills.io specification. Safe to run on every commit and in CI — no tokens spent, no network calls. - Positional path argument; resolves absolute paths, follows symlinks, rejects directories with exit 1.
--strictpromotes warnings to exit 2 (pre-publish gate). Load-layer parse failures (AGENTSKILLS_FRONTMATTER_INVALID_YAML, unreadable path) always exit 1 regardless of--strict.--jsonemits a single JSON envelope to stdout ({"schema_version": 1, "skill_path": ..., "passed": bool, "issues": [...]}) with identical exit codes to the human-text output path.- Pure compute lives in
src/clauditor/conformance.pyper.claude/rules/pure-compute-vs-io-split.md—check_conformancereturns alist[ConformanceIssue]with stablecode/severity/messagefields, testable withouttmp_pathorcapsys. - Soft-warn hook on
SkillSpec.from_file: every skill load now runscheck_conformanceand emits warning-severity issues to stderr with theclauditor.conformance: <CODE>: <message>prefix. Errors are silent at this seam — they surface throughclauditor lint. The hook never blocks spec loading. KNOWN_CLAUDE_CODE_EXTENSION_KEYSallowlist for frontmatter keys that Claude Code uses but the agentskills.io spec does not define. Initial contents:argument-hint,disable-model-invocation. Keys in the allowlist do NOT triggerAGENTSKILLS_FRONTMATTER_UNKNOWN_KEY. The maintainer-only/review-agentskills-specskill maintains the allowlist against Claude Code's published frontmatter documentation (DEC-013).paths.derive_skill_namewarning emission retired: the previous warnings for invalid frontmatter-name fallback and frontmatter-vs-filesystem mismatch are now produced bycheck_conformancevia the soft-warn hook — single source of truth for frontmatter-name warnings. The helper's return type simplified fromtuple[str, str | None]to plainstr. Seedocs/cli-reference.md#lintand.claude/rules/skill-identity-from-frontmatter.md.- Runner auth-source control + configurable timeout (#64). Four
skill-invoking CLI commands (
validate,grade,capture,run) and the pytest plugin gained two knobs that together unblock Pro/Max subscribers iterating on research-heavy skills: --no-api-key/--clauditor-no-api-key(pytest) strip bothANTHROPIC_API_KEYandANTHROPIC_AUTH_TOKENfrom theclaude -psubprocess environment so the child falls back to whatever auth is cached in~/.claude/(typically a Pro/Max subscription with a much higher throughput ceiling than the API-key tier). Non-auth Anthropic env vars such asANTHROPIC_BASE_URLare preserved.--timeout SECONDSoverrides the runner's 180-second watchdog on a per-invocation basis. Must be a positive integer; argparse rejects<= 0/ non-int with exit 2. Precedence is CLI > spec > default: the flag wins when passed explicitly, otherwise a newEvalSpec.timeoutfield wins when set, otherwise the built-in 180s default applies.EvalSpec.timeoutis load-time validated (positive int only;boolis explicitly rejected because it is anintsubclass in Python).SkillResult.api_key_sourcecarries theapiKeySourcevalue parsed from the stream-jsonsystem/initevent (when present); the runner prints one stderr info line of the formclauditor.runner: apiKeySource=<value>per run. Values are labels ("ANTHROPIC_API_KEY","claude.ai","none"), not secrets. Olderclaudebuilds that omit the field leaveapi_key_sourceatNoneand suppress the stderr line. Seedocs/cli-reference.md#shared-runner-flags-validate-grade-capture-run,docs/eval-spec-reference.md#optional-top-level-fields, anddocs/stream-json-schema.md. Precedence shape codified in.claude/rules/spec-cli-precedence.md.- Runner error surfacing (#63).
SkillResultgained anerror_categoryfield —"rate_limit" | "auth" | "api" | "interactive" | "subprocess" | "timeout" | None— that classifies any non-clean signal alongside the existingerrorstring. CLI error rendering now surfaces stream-jsonis_error: trueresult messages with the correct category hint, and an interactive-hang heuristic flags runs that stop after one turn with a trailing?orAskUserQuestiontool call. The heuristic can be disabled per-spec viaallow_hang_heuristic: false. A newsucceeded_cleanlypredicate distinguishes "actually completed cleanly" from the lenientsucceededflag. Seedocs/pytest-plugin.mdanddocs/stream-json-schema.md. - Modern
<name>/SKILL.mdskill layout (#66). Skill discovery now supports both the legacy.claude/commands/<name>.mdlayout and the modern.claude/skills/<name>/SKILL.mddirectory layout used by Anthropic's plugin / agentskills.io ecosystem. Skill identity is derived from YAML frontmattername:first, falling back to a layout-aware filesystem derivation (directory name for modern layout, file stem for legacy). Invalid or mismatched names emit a stderr warning and fall through rather than hard-failing. See.claude/rules/skill-identity-from-frontmatter.md. - Blind A/B judge framing via
user_prompt.EvalSpecgained an optional top-leveluser_prompt: str | Nonefield that feeds the conversational framing intoblind_compare_from_specand theclauditor_blind_comparepytest fixture. Distinct fromtest_args(which is the CLI arg string for the skill subprocess). Seedocs/eval-spec-reference.md#optional-top-level-fields.
Added (prior unreleased)
clauditor propose-eval <SKILL.md>— LLM-assisted EvalSpec bootstrap. Reads SKILL.md and an optional captured run, asks Sonnet to propose a full 3-layer EvalSpec (L1 assertions, L2 tiered extraction, L3 rubric), validates the proposal throughEvalSpec.from_dict, and writes the sibling<skill>.eval.json(the same discovery pathSkillSpec.from_fileandclauditor inituse, sovalidate/gradeauto-discover it). Captures are scrubbed throughtranscripts.redact(DEC-008) and the sidecar preserves the non-mutating-scrub invariant. Seedocs/cli-reference.md#propose-evalfor flags and exit codes.- Privacy:
SuggestReport.to_json()scrubsapi_errorthroughtranscripts.redact()before emitting so secret-shaped substrings (Anthropic keys, GitHub PATs, Bearer tokens) are redacted on disk. In-memoryself.api_erroris unchanged (non-mutating scrub per.claude/rules/non-mutating-scrub.md). cmd_triggersnow exits 1 withERROR: No trigger_tests defined in eval specwhen the spec is missingtrigger_tests(previously printed an emptyTrigger Precision:block and exited 0 — a CI-silent-failure hazard).- Root README restructured: deep-reference content promoted into
docs/*.mdfiles with teasers + anchor preservation. README is now ~165 lines (was 770). See.claude/rules/readme-promotion-recipe.mdfor the codified recipe. - Bundled
/clauditorClaude Code slash command installable viaclauditor setup.