Skip to content

codecov docs

SignalForge

LLM-drafted dbt schema.yml, tests, and docs — pruned against real warehouse data so only signal-bearing tests ship.

Status: v0.1 alpha. Eleven issues shipped — single-model draft + warehouse prune, BigQuery adapter, signalforge CLI, signalforge init-demo for first-run UX. Designing in the open on the dev branch.

Why this exists

Authoring schema.yml, tests, and documentation is the most-cited drudgery in the dbt ecosystem. AI tools that generate them already exist — dbt Copilot, dbt-codegen, Paradime DinoAI, Altimate datapilot — but their output is consistently described the same way: noise. Hundreds of not_null and unique tests that always pass. Generic docstrings that paraphrase the column name. Schemas that drift from the SELECT.

SignalForge generates the same artifacts, then asks a different question: does this test produce signal? Every candidate test is run against your real warehouse data. Tests that always pass are dropped. Docs are graded against a project-specific rubric. Only signal-bearing artifacts are written to disk.

And you don't have to start from SignalForge's own drafts. Point it at a schema.yml that dbt Copilot, dbt-codegen, DinoAI, datapilot — or your own hands — already produced, and it prunes that: signalforge prune-existing <model> --schema <path> runs the same warehouse-backed prune over your existing tests, no LLM call required.

What it does

  • Drafts schema.yml from your model SQL using an LLM with project-aware context (manifest, sibling models, your team's terminology).
  • Generates testsnot_null, unique, accepted_values, relationships, plus dbt-expectations-style data tests where appropriate.
  • Prunes the noise. Each candidate test runs against warehouse samples; tests that pass on every row of historical data add no signal and are dropped before they reach your repo.
  • Generates documentation — column-level descriptions and model-level overviews — graded by an LLM-as-judge against a configurable rubric.
  • Reports what was kept and what was dropped, with a one-line "why" per artifact. No black-box generation.
  • Prunes tests you already have (v0.2). Point it at an existing schema.yml — from dbt-codegen, dbt Copilot, DinoAI, datapilot, or hand-written — and the warehouse tells you which of those tests add no signal. Same prune step, no LLM call (signalforge prune-existing).

How it works

┌──────────────┐    ┌─────────────┐    ┌──────────────┐    ┌──────────────┐
│ model.sql +  │ -> │ LLM drafts  │ -> │ Run tests    │ -> │ Quality-     │
│ manifest +   │    │ candidate   │    │ against the  │    │ graded YAML  │
│ project ctx  │    │ artifacts   │    │ warehouse    │    │ + diff       │
└──────────────┘    └─────────────┘    └──────────────┘    └──────────────┘
                                              │
                                              v
                                       Drop always-pass tests;
                                       drop tests that fail on
                                       known-clean data.

The grading layer reuses clauditor's LLM-as-judge methodology, applied to a new artifact class.

There's a second entry point that skips the LLM entirely. If you already have a schema.yml (from another generator or written by hand), signalforge prune-existing reads its tests and runs them straight through the prune step:

┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│ existing     │ -> │ Run tests    │ -> │ diff: which  │
│ schema.yml   │    │ against the  │    │ tests add    │
│ + manifest   │    │ warehouse    │    │ signal       │
└──────────────┘    └──────────────┘    └──────────────┘

No draft, no grade, no LLM call — just "which of these tests earn their place?" Tests SignalForge can't evaluate (custom / dbt-expectations / namespaced generics) are reported as skipped, never silently dropped.

Status (v0.1): Live on PyPI — pip install signalforge-dbt. See Quick start.

Quick start

The wheel ships a minimal dbt demo project (Austin bikeshare staging model against the public bigquery-public-data.austin_bikeshare.bikeshare_trips dataset), copied out of the install via signalforge init-demo, so you can run signalforge end-to-end against a real warehouse with no infrastructure beyond a Google Cloud billing project and an Anthropic API key. A run scans ~200–500 MB of BigQuery (well under $0.01 at on-demand pricing) plus ~$0.13 of Anthropic spend (one draft call + ~84 grade calls on Sonnet 4.6); end-to-end wall-clock is roughly 5–6 minutes.

1. Install

SignalForge requires Python 3.11+.

pip install signalforge-dbt

Verify the CLI is on your PATH:

signalforge --version

The PyPI distribution name is signalforge-dbt (the bare signalforge name is held by an unrelated DSP package); the import package and CLI command are both signalforge.

Prefer an isolated CLI install? uv tool install signalforge-dbt (or pipx install signalforge-dbt) puts the signalforge command on your PATH without adding it to a project environment.

Working from a clone (contributing)? Install the dev toolchain with uv sync --dev — see CONTRIBUTING.md for the full workflow.

2. Authenticate to BigQuery and Anthropic

gcloud auth application-default login
export GOOGLE_CLOUD_PROJECT=<your-billing-project>   # any GCP project you have query access to
export ANTHROPIC_API_KEY=sk-ant-...

Use a fresh shell session (or unset ANTHROPIC_API_KEY after the run) so the key doesn't persist in your bash history.

3. Minimum signalforge.yml

The fixture ships a working config; a minimum that exercises the full pipeline is:

# signalforge.yml — alongside dbt_project.yml
llm:
  model: claude-sonnet-4-6
safety:
  mode: aggregate-only   # schema-only is the default; aggregate-only sends column profiles, never row data
prune:
  sample_strategy: materialised   # v0.2 default; one temp-table CTAS feeds every per-test query
grade:
  min_pass_rate: 0.95
  min_mean_score: 0.95
  fail_on_below_threshold: false   # report-only; flip to true to exit 2 on flagged artifacts

Full reference: docs/safety-ops.md, docs/prune-ops.md, docs/grade-ops.md.

4. Prepare the fixture

Copy the bundled demo project to a writable directory and run signalforge against it:

signalforge init-demo /tmp/sf-austin

5. Pre-flight check (signalforge lint)

Before paying for an LLM call, run the pre-flight validator. It loads signalforge.yml (every per-stage block) and the dbt manifest — no warehouse calls, no Anthropic calls, no network — and reports every failure in one shot. Sub-second; catches typos like safety: { mdoel: ... } that the extra="forbid" config models would otherwise surface only after a billable generate run, plus manifest schema-version mismatches (e.g. dbt 1.13 → v13, outside the supported v9–v12 range) that would otherwise surface mid-pipeline:

signalforge lint --project-dir /tmp/sf-austin

On success, stdout is silent (git-style) and the exit code is 0. Failures are listed on stderr with the offending block(s) named — single-failure runs use the ERROR: <message> shape; multi-failure runs emit a header + one bullet per block. Pass --model <name> to also confirm a specific model resolves in the manifest (accepts a bare name, a unique_id, or a file path). See docs/cli-ops.md § signalforge lint for the full contract.

6. First run

signalforge generate models/staging/stg_bikeshare_trips.sql --project-dir /tmp/sf-austin

The bundled profiles.yml reads GOOGLE_CLOUD_PROJECT from your environment, so no profile editing is required. signalforge init-demo prints a next-steps message naming the env vars and the exact commands to run; pass --force to atomically replace a non-empty destination (refuses /, $HOME, and the current working directory as a blast-radius guard).

Want to preview cost first? signalforge generate --estimate <model> prints the projected USD + warehouse bytes without making any billable Anthropic or warehouse call (one count_tokens round-trip per prompt plus a single BigQuery dryRun). See docs/cli-ops.md § --estimate for the full contract.

7. Expected output

The diff lists drafted column descriptions and signal-bearing tests alongside dropped tests with a one-line "why". Every artifact lands in one of four tiers — kept (survived prune with positive evidence), kept-uncertain (kept, but the warehouse couldn't be reached to evaluate it — e.g. a budget or connectivity issue), dropped (prune found it adds no signal), and flagged (kept, but graded below the quality threshold). The table looks like this (truncated):

diff: model.austin.stg_bikeshare_trips  kept=8  kept-uncertain=0  dropped=2  flagged=1

TIER      ARTIFACT                      TEST            REASON                  SCORE    WHY
kept      column.trip_id.description                                            0.97     Description added; passed all grading criteria.
kept      test.column.trip_id.not_null  not_null                                —        Test returned non-zero failing rows on the warehouse sample.
dropped   test.column.region.not_null   not_null        always-passes           —        Test returned zero failing rows on the representative sample.
flagged   column.bike_id.description                                            0.45     Grading score 0.45 below threshold 0.95.
...

At least one dropped row with always-passes is mathematically guaranteed — the fixture's staging SQL aliases a literal 'austin' AS region column, so any LLM-drafted not_null on it must always-pass and the prune engine drops it. The strict 0.95 grade thresholds in the fixture config typically surface at least one flagged artifact.

A high drop rate is the working state, not the failure state. A typical staging model drops ~60-80% of the LLM-drafted tests as always-passes — the LLM proposes broadly and the prune layer trims the ones the warehouse data doesn't contradict. Internal testing on bigquery-public-data.austin_bikeshare.bikeshare_trips shows 5 of 8 drafted tests dropped (62.5%); see docs/prune-ops.md § Expected drop rates for the per-test-type breakdown.

Two durable artefacts land under /tmp/sf-austin/.signalforge/: grade.json (per-criterion LLM-judge scores) and diff.json (the full rendered diff). The committed .gitignore covers .signalforge/.

Troubleshooting

Symptom Likely cause Fix
User does not have bigquery.jobs.create permission in project bigquery-public-data GOOGLE_CLOUD_PROJECT not set; SDK fell back to the source project Export GOOGLE_CLOUD_PROJECT=<billing-project> where you have the BigQuery Job User role
Query exceeded max_bytes_billed (limit=100000000, ...) Editing the profile dropped or lowered maximum_bytes_billed Keep maximum_bytes_billed: 1000000000 (1 GB) — the bundled demo profiles.yml ships this cap intentionally so the materialised-sample scan clears the adapter's 100 MB default
Manifest not found / dbt_project.yml not found at ... CLI walked up from the wrong cwd, or --project-dir doesn't directly contain dbt_project.yml Either cd into the project root, or pass --project-dir <abs-path> pointing at the directory holding dbt_project.yml
aggregate_complete=False in grade.json Network blip during a grade call exhausted retries Re-run; if it persists, raise grade.total_budget_seconds in signalforge.yml
LLM response did not match the CandidateSchema shape Anthropic response shape drifted vs. the parser Set ANTHROPIC_LOG=info and inspect ~/.anthropic-debug/; file an issue

Full per-flag reference, exit-code taxonomy, and environment variables: docs/cli-ops.md. For multi-model dbt projects, see Running across many models for the --select flag and shell-loop pattern. Maintainer-only walkthrough of the same flow as a gated test (pytest -m e2e --no-cov): docs/e2e-smoke-test.md.

Prune the tests you already have

If you already have a schema.yml — written by hand, or generated by dbt-codegen / dbt Copilot / DinoAI / datapilot — you don't need SignalForge to redraft it. Point prune-existing at it and the warehouse tells you which of those tests add signal. There's no LLM call, so the only requirement is warehouse access (a dbt profile).

Availability: prune-existing is a v0.2 feature, in development on the dev branch — it is not in the current pip install signalforge-dbt (v0.1) release. To use it now, install from source: pip install "signalforge-dbt @ git+https://github.com/wjduenow/SignalForge.git@dev".

# From inside your dbt project (with target/manifest.json present):
signalforge prune-existing customers --schema models/marts/schema.yml

What you get on stdout is a diff of your file: a kept / kept-uncertain / dropped table with a one-line "why" per test, plus a unified diff showing exactly which tests to remove. Tests SignalForge doesn't yet evaluate — custom generics, dbt_utils.*, dbt_expectations.* — are summarised on stderr as skipped (run with --verbose for the per-test breakdown), never silently dropped.

It is read-only by design: there is no --write flag, so your hand-authored file is never overwritten. The rendered diff goes to stdout and a machine-readable copy to .signalforge/diff.json (--dry-run suppresses even that). Apply the removals yourself from the diff. See docs/cli-ops.md § signalforge prune-existing for the full flag set and docs/ingest-ops.md for which dbt test shapes are supported vs. skipped.

CLI

The CLI exposes five subcommands (the first four ship in the v0.1 PyPI release; prune-existing is on the in-development v0.2 line — see Prune the tests you already have):

signalforge generate <model>                     # full draft -> prune -> grade -> diff pipeline for one model
signalforge prune-existing <model> --schema <p>  # prune an existing schema.yml's tests (ingest -> prune -> diff, no LLM) [v0.2]
signalforge init-demo [<dest>]                   # copy the bundled Austin demo project into <dest>
signalforge lint                                 # validate signalforge.yml config blocks (no LLM/warehouse calls)
signalforge version                              # print the SignalForge version

Key generate flags: --project-dir, --manifest, --profiles-dir (point at the project / manifest / profile); --mode {schema-only,aggregate-only,sample} and --min-score (pipeline behaviour); --write / --dry-run and --format {ansi,markdown,json} (output); --estimate (cost preview, no billable calls); --select <expr> (run across many models); --scope, --sample-strategy; and the --quiet / --verbose / --no-color observability triad. prune-existing takes the required --schema <path> plus --project-dir, --manifest, --profiles-dir, --scope, --sample-strategy, --format {ansi,markdown,json}, --dry-run, and the --quiet / --verbose / --no-color triad — it is read-only by design (no --write) and makes no LLM call. init-demo takes --force; lint takes --config, --manifest, --model, --project-dir.

signalforge --help prints the top-level help; each subcommand has its own --help page. See docs/cli-ops.md for the full reference, exit-code taxonomy, and environment variables.

Configuration

Configuring the BigQuery adapter

SignalForge reads your dbt profile and instantiates a BigQueryAdapter via WarehouseAdapter.from_profile(profile). See docs/warehouse-adapter-ops.md for ADC setup, cost defaults, sampling strategy (and the TABLESAMPLE cost-asterisk), PartitionFilter use, and the typed-error reference.

Data safety

Schema-only is the default. The LLM never sees row data unless you explicitly opt in via safety.mode: sample in signalforge.yml (or the --mode sample CLI flag). Even column names that match the built-in PII patterns (*email, *phone, *ssn) — or that you flag via dbt tags: ["pii"] / meta.contains_pii: true / meta.signalforge.sample: false — are replaced with stable hashed placeholders (col_<8 hex>) before reaching the LLM.

Note: the prune step runs warehouse SQL on every invocation regardless of safety.mode. To skip prune entirely (no warehouse contact), set prune.enabled: false in signalforge.yml — see docs/prune-ops.md.

Every LLM call produces one structured record at .signalforge/audit.jsonl (default; configurable via safety.audit_path). The file contains plaintext column-name metadata and should be treated as sensitive: this repo's .gitignore already covers .signalforge/; the writer creates the directory at 0o700 and the audit file at 0o600. The audit writer is fail-closed — if the write fails, the LLM call is aborted (no silent drafts without an audit trail). See docs/safety-ops.md for the JSONL schema.

Full reference — mode semantics, the four opt-out signals and their precedence, the signalforge.yml schema, the audit schema, debugging, and the typed-error reference — is in docs/safety-ops.md.

LLM drafting

How drafting works

signalforge.draft.draft_schema takes a manifest model + warehouse adapter + safety policy and returns a DraftOutcome carrying the parsed CandidateSchema, the typed LLMRequest that was sent, and the LLMResult from the LLM. One LLM call per model; pre-send token counting, the full retry taxonomy, prompt caching, and a fail-closed response audit are all owned by the layer.

Manifest + Model + LLMRequest (from safety layer)
  -> render_prompt  (system + cached manifest summary + dynamic per-model SQL)
  -> call_anthropic (1 SDK seam, full retry taxonomy, prompt caching)
  -> parse_draft_response (JSON + anchor-contract validator)
  -> write_response_event (fail-closed JSONL audit)
  -> DraftOutcome(candidate, request, result)

Auditability

Two parallel audit streams sit under policy.audit_path.parent:

  • audit.jsonl (safety layer) records WHAT data went to the LLM — columns sent, redactions applied, sampling mode in effect.
  • llm_responses.jsonl (draft layer) records WHAT the LLM produced — hashes of the response text, the parsed schema, and the SQL sent; token usage including cache creation/read; the prompt_version.

Both streams are fail-closed: an audit-write failure aborts the call, the partial work is dropped, and an unaudited LLM call cannot silently happen. A reviewer correlates the two streams by model_unique_id + timestamp window. See docs/draft-ops.md for the response-audit schema, the retry taxonomy, the cache pre-send checks, and the typed-error reference.

Roadmap

Version Scope
v0.1 Single-model draft + warehouse prune; first warehouse adapter (BigQuery); CLI only
v0.2 Prune externally-authored tests (prune-existing); additional warehouse adapters (Snowflake, Postgres); project-wide drift detection
v0.3 GitHub Action with PR comment integration
v0.4 Rubric customization; organization-wide style profiles
v1.0 dbt Fusion engine compatibility; dbt MCP server consumption

The architecture is warehouse-agnostic — adapters plug in behind a thin sampling/profiling interface. BigQuery is the v0.1 target because of its generous query-bytes pricing for sampled reads and its first-class INFORMATION_SCHEMA.JOBS history for downstream cost analysis. Snowflake, Databricks, Postgres, and Redshift are all on the roadmap; PRs welcome.

Detail is tracked in GitHub Issues against this repo.

Design principles

  1. Signal over volume. A test that always passes is worse than no test — it consumes review attention without catching anything. SignalForge's job is to produce fewer, better artifacts.
  2. Evaluation in the loop. Generation without grading is what produced the current "AI-test fatigue." Every artifact SignalForge ships has been scored.
  3. OSS-first, Core-friendly. No dependency on dbt Cloud. Runs against any dbt-core project, locally or in CI.
  4. Explainable diffs. Every kept and dropped artifact has a one-line "why." Reviewers see what changed and what the tool's reasoning was.
  5. Permissive license. Apache-2.0. Use it commercially, vendor it, embed it.
  • clauditor — the LLM-graded evaluation framework SignalForge's quality layer is built on.
  • dbt-codegen — the rule-based YAML scaffolder SignalForge complements (codegen scaffolds; SignalForge drafts, prunes, and grades).
  • dbt-osmosis — schema.yml management and propagation; orthogonal concern.
  • Recce — PR-time data diff for dbt; complementary, addresses a different pain point.

License

Apache-2.0. See LICENSE.

Contributing

Pre-alpha — issues welcome to shape the design. Open one against the dev branch describing the use case you'd like SignalForge to handle. Code contributions will open with the v0.1 milestone.