Skip to content

Quick Start

End-to-end walkthrough from "I have a skill" to "it's validated, graded, and covered in pytest." Read this after installing clauditor when you want a concrete example of the init → validate → test loop. For a deeper dive into each layer's behavior, see Three Layers of Validation.

Returning from the root README. This doc is the full reference; the README has a summary with code examples.

1. Create an eval spec for your skill

clauditor init .claude/commands/my-skill.md

This creates my-skill.eval.json alongside your skill file:

{
  "skill_name": "my-skill",
  "description": "Eval spec for /my-skill",
  "test_args": "",
  "assertions": [
    {"id": "min_length_500", "type": "min_length", "length": 500},
    {"id": "has_urls_3", "type": "has_urls", "count": 3},
    {"id": "has_entries_3", "type": "has_entries", "count": 3},
    {"id": "no_error", "type": "not_contains", "needle": "Error"}
  ],
  "sections": [
    {
      "name": "Results",
      "tiers": [
        {
          "label": "default",
          "min_entries": 3,
          "fields": [
            {"id": "results_name", "name": "name", "required": true},
            {"id": "results_address", "name": "address", "required": true}
          ]
        }
      ]
    }
  ],
  "grading_criteria": [
    {"id": "relevant", "criterion": "Are results relevant to the query?"},
    {"id": "specific", "criterion": "Are descriptions specific (not generic filler)?"}
  ],
  "grading_model": "claude-sonnet-4-6",
  "trigger_tests": {
    "should_trigger": [],
    "should_not_trigger": []
  },
  "variance": {
    "n_runs": 3,
    "min_stability": 0.8
  }
}

Fill in test_args and customize assertions, sections, and grading criteria for your skill. Or skip straight to step 2 — propose-eval can bootstrap a full three-layer spec from the SKILL.md plus a capture.

# Run the skill and save output to tests/eval/captured/my-skill.txt
clauditor capture my-skill

# Pass initial context if the skill needs it upfront
clauditor capture my-skill -- "San Jose, CA"

# Subscription auth + longer watchdog for research-heavy skills
clauditor capture my-skill --no-api-key --timeout 300

The captured output becomes the grounding context for propose-eval (auto-discovered from tests/eval/captured/) and a replayable fixture for tightening assertions in validate.

Interactive skills: capture runs claude -p — strictly single-turn, no stdin. If the skill asks a question mid-run it will hang at that point. Put all decision context in -- args (and mirror it in test_args in your eval spec) so the skill gets everything upfront.

3. Bootstrap a full eval spec with propose-eval

# Reads SKILL.md + the capture from step 2, writes my-skill.eval.json
clauditor propose-eval .claude/commands/my-skill.md

Requires ANTHROPIC_API_KEY or an authenticated claude CLI.

4. Validate against captured output

# Run skill and validate in one step:
clauditor validate .claude/commands/my-skill.md

# JSON output for CI:
clauditor validate .claude/commands/my-skill.md --json

Running under Codex instead of Claude Code

Every skill-running command (validate, grade, capture, run) accepts --harness codex to drive the skill through the OpenAI Codex CLI instead of claude -p. Export CODEX_API_KEY (or OPENAI_API_KEY) first:

export CODEX_API_KEY=sk-...
clauditor validate .claude/commands/my-skill.md --harness codex

The default --harness auto picks Claude Code when claude is on PATH, else Codex. The runtime axis is independent of the grader — a skill can run under Codex while L3 still grades with Claude Sonnet. Full reference: Codex harness.

5. Use in pytest

def test_my_skill(clauditor_runner, clauditor_asserter):
    runner = clauditor_runner()  # factory fixture; pass harness="codex" to run under Codex
    result = runner.run("my-skill", '"San Jose, CA" --depth quick')
    asserter = clauditor_asserter(result)
    asserter.assert_contains("Results")
    asserter.assert_has_entries(minimum=3)
    asserter.assert_has_urls(minimum=3)

def test_with_eval_spec(clauditor_spec):
    spec = clauditor_spec(".claude/commands/my-skill.md")
    results = spec.evaluate()
    assert results.passed, results.summary()