Codex harness
This doc covers running your skill under the OpenAI Codex CLI instead of Claude Code. If you just want to grade Anthropic-authored skills with Claude as the runtime, you can skip this — the default --harness auto resolves to claude-code whenever the claude binary is on PATH.
Returning from the root README. This doc is the full reference for the harness axis; the README has a one-paragraph summary.
The harness axis is independent of the grader provider axis. A common shape: skill runs under Codex (--harness codex), and the L3 grader still calls Claude Sonnet (--grading-provider anthropic). The two CLIs are evaluating each other.
When to use it
- You don't have a Claude Code subscription and don't want one — Codex bills against
OPENAI_API_KEYdirectly. - You want to compare how the same skill behaves under different runtimes (e.g. baseline A/B between Claude and Codex).
- You're auditing a skill author's claim that their SKILL.md is harness-agnostic.
Quick start
# 1. Install the Codex CLI separately (npm install -g @openai/codex or similar).
$ which codex
/usr/local/bin/codex
# 2. Export an auth env var. Either works; CODEX_API_KEY wins if both are set.
$ export CODEX_API_KEY=sk-...
# 3. Run a skill under Codex.
$ clauditor validate path/to/SKILL.md --harness codex
$ clauditor grade path/to/SKILL.md --harness codex
$ clauditor capture path/to/SKILL.md --harness codex --output codex-run.txt
Four-layer harness resolution
Same precedence shape as --transport and --grading-provider:
| Layer | Value | Notes |
|---|---|---|
| CLI flag | --harness {claude-code,codex,auto} |
On validate, grade, capture, run only. LLM-mediated commands (extract, triggers, compare --blind, propose-eval, suggest) have no harness axis — they call call_model directly without running a skill subprocess. |
| Env var | CLAUDITOR_HARNESS={claude-code,codex,auto} |
Whitespace-only values are normalized to "unset". |
| Spec field | EvalSpec.harness: str = "auto" |
Per-skill author preference. Validated at load time. |
| Default | "auto" |
shutil.which("claude") first; then shutil.which("codex"); hard-fail with a three-escape-hatch error if neither is on PATH. |
When the auto branch lands on codex because claude is not on PATH, clauditor emits a one-time stderr notice per process pointing at the env vars and explicit-pin escape hatches — see "Implicit-coupling announcements" in .claude/rules/centralized-sdk-call.md. An explicit --harness codex (or any non-auto value) stays silent.
Auth precedence inside the Codex subprocess
The harness exports auth into the skill subprocess in this order:
CODEX_API_KEY— Codex-specific key. Wins if set.OPENAI_API_KEY— fallback. Codex CLI accepts this directly.- Cached
~/.codex/auth.json— only honored whenauth_mode == "apikey". Clauditor refuses ChatGPT-mode credentials at pre-flight per #177 because the codex subprocess would route via ChatGPT and reject every model.
If none of the three is available, the pre-flight check_codex_auth guard raises CodexAuthMissingError and the CLI exits 2 with an actionable message naming the required env vars.
Sandbox modes
Codex supports three sandbox levels: read-only, workspace-write, danger-full-access. Clauditor pins workspace-write today — enough for skills that need to write under the workspace but not enough for skills that need network or arbitrary filesystem access. The pinned value is stamped on IterationContext.sandbox_mode for audit visibility. Making it user-configurable is a future ticket; if your skill needs a different mode, file an issue.
Stream parser
Codex emits NDJSON on stdout (different from Claude's stream-json). The parser is in src/clauditor/_harnesses/_codex.py; the on-disk contract — event types, failure surface, harness_metadata keys, advisory warnings — is documented at docs/codex-stream-schema.md. You shouldn't need to read it unless you're contributing to the harness itself or debugging a parse-level failure.
What lands on disk
Every Codex run produces the same sidecar shape as Claude Code runs, with one observability difference:
assertions.json—harness: "codex".grading.json/extraction.json—harness: "codex", plus the grader's provider/model untouched.context.json—harness: "codex",sandbox_mode: "workspace-write", plusharness_metadatacarrying Codex-specific keys (auth source, dropped-events count when applicable).history.jsonl— each appended record carriesharness: "codex"soclauditor trendcan refuse to average across harnesses unless--cross-harnessis passed.
Limitations
- No subscription path. Anthropic users with a Claude Pro/Max subscription can grade for free via
--transport cli; Codex has no equivalent — every call bills against an API key. - Hardcoded sandbox.
workspace-writeonly. Future ticket. - No allow-hang heuristic. The Claude Code harness has a
allow_hang_heuristicknob that classifies "model returned a question" as a failure; Codex doesn't expose the parse hooks needed for it. The flag is silently a no-op when the resolved harness is Codex. - Pytest fixtures.
clauditor_runnerandclauditor_specboth honor the harness axis; the eagercheck_codex_authguard fires at fixture-setup time so a missing-key surfaces as a test-setup error rather than a deep subprocess failure mid-run.
Pairing harness with grader
The four useful combinations:
| Harness | Grader | When to use |
|---|---|---|
claude-code |
anthropic |
Default. Subscription-friendly if you have Claude Pro/Max. |
claude-code |
openai |
Audit Claude-authored skills with a non-Claude grader to reduce same-vendor bias. |
codex |
anthropic |
Anthropic-graded evaluation of Codex behavior. Common when comparing runtimes. |
codex |
openai |
All-OpenAI stack. |
Resolve provider per docs/transport-architecture.md and pair via independent flags — clauditor will refuse to silently average across mismatched harness OR provider axes during clauditor trend / clauditor compare.
Troubleshooting
| Symptom | Cause | Fix |
|---|---|---|
Exit 2 — CodexAuthMissingError |
Neither CODEX_API_KEY nor OPENAI_API_KEY set; no usable cached auth |
Export one of the two env vars. |
| Exit 2 — "ChatGPT-mode credentials refused" | ~/.codex/auth.json has auth_mode == "chatgpt" |
Re-authenticate Codex with an API key flow, OR export CODEX_API_KEY to override the cached creds. |
| Auto resolved to codex unexpectedly | claude not on PATH |
Install Claude Code, or pin --harness=claude-code / CLAUDITOR_HARNESS=claude-code. |
| "stderr line about auto→codex" | First time the auto branch picked Codex this process | One-shot notice. Pin explicitly to silence. |
| Skills that worked under Claude fail under Codex | Different sandbox / runtime semantics | Expected — that's what cross-harness evaluation reveals. The skill is harness-coupled. |
See also clauditor doctor for environment-level diagnostics (which binaries are on PATH, which env vars are set, which auth paths resolve).