Ingest layer — operations guide
Operational reference for users of signalforge.ingest. Companion to
docs/prune-ops.md,
docs/draft-ops.md, and
docs/manifest-loader-ops.md, and to the design
record in
plans/super/104-ingest-external-tests.md.
What it does
The drafter (signalforge.draft) is one way to produce candidate tests:
an LLM writes them. The ingest layer is the other way — point
SignalForge at an existing dbt schema.yml (hand-written, or output from
dbt-codegen, dbt Copilot, DinoAI, datapilot, …) and let the warehouse tell
you which of those tests actually add signal.
This extends Architectural Commitment #1 ("signal over volume") beyond
SignalForge's own LLM drafts: you can prune any generator's tests, not
just ours. signalforge.ingest.read_schema parses standard dbt test
syntax into the same typed CandidateSchema the drafter emits, so
signalforge.prune.prune_tests consumes it unchanged.
The reader understands the four test types SignalForge can compile to warehouse SQL:
| dbt test | What prune checks |
|---|---|
not_null |
column has no NULLs in the sample |
unique |
column has no duplicate non-NULL values |
accepted_values |
column has no values outside the declared set |
relationships |
every child key resolves to a parent key |
Every other test — dbt_utils.*, dbt_expectations.*, singular/custom
generics, anything namespaced — is skipped and recorded, never
silently dropped (see Supported vs skipped).
!!! tip "There's a CLI for this"
The ingest layer is a library entry point — call read_schema and
hand the result's candidate to prune_tests yourself — but most
operators won't need to. The signalforge prune-existing <model>
--schema <path> subcommand (shipped in
#105) wires
ingest → prune → diff end to end with no LLM call. See the
CLI reference.
Public API
Import from signalforge.ingest.
Reader
read_schema(schema, model, *, project_dir=None) -> IngestResult— the entry point. Parses an externalschema.ymlfor one model and returns anIngestResult.
The schema argument is overloaded by type — this str-vs-Path split
is the contract:
schema: pathlib.Path→ a FILE path. It is canonicalised (symlink-/containment-hardened) againstproject_dir— defaulting to the file's parent directory whenproject_dirisNone— then read. A missing file raisesIngestSchemaNotFoundError; a symlink-loop / escape re-raises asIngestSchemaParseError.schema: str→ RAW YAML CONTENT, not a path. No file read and no canonicalisation happen; the string is parsed directly. (Useful for testing or when the YAML is already in memory.)
model is the manifest signalforge.manifest.Model whose tests are being
ingested. model.name selects the matching models: entry in the
schema.yml; model.columns is the column set the
anchor contract validates against.
Result shapes
IngestResult— the reader's return value. Two fields:candidate: CandidateSchema— the converted schema the prune stage consumes unchanged.skipped: tuple[SkippedTest, ...]— one record per test the reader could not convert, in encounter order.
SkippedTest— one structured skip record. Fields:test_name: str,column: str | None(Nonefor a model-level test),reason: SkipReason,detail: str(a short free-text diagnostic).SkipReason— the closed reason literal (three values; see below).
These models are produced in process and handed straight to prune — they are not serialised to a JSONL audit / sidecar.
Errors
The IngestError hierarchy: IngestError (base), plus
IngestSchemaNotFoundError, IngestSchemaParseError,
IngestSchemaTooLargeError, IngestModelNotFoundError,
IngestAnchorContractError. See the Error reference.
Minimal usage
from pathlib import Path
from signalforge.ingest import read_schema
from signalforge.manifest import load
from signalforge.prune import prune_tests
from signalforge.warehouse import WarehouseAdapter, load_profile
project_dir = Path("/path/to/dbt/project")
manifest = load(project_dir)
model = manifest.get_model("model.my_project.orders")
result = read_schema(project_dir / "models/marts/schema.yml", model)
print(f"{len(result.candidate.columns)} columns, "
f"{len(result.skipped)} tests skipped")
for s in result.skipped:
print(f" skipped {s.test_name} on {s.column}: {s.reason} — {s.detail}")
# The candidate feeds prune unchanged. Pass an un-entered adapter —
# prune_tests owns the `with adapter:` lifecycle.
profile = load_profile(project_dir)
adapter = WarehouseAdapter.from_profile(profile)
prune_result = prune_tests(model, adapter, result.candidate, manifest)
Supported vs skipped tests
The four supported types map to a typed CandidateTest; everything else is
recorded as a SkippedTest. The SkipReason literal is closed at three
values:
SkipReason |
Triggered by |
|---|---|
"unsupported-test-type" |
a bare-string test that isn't not_null / unique (e.g. - positive) |
"custom-or-generic-test" |
a namespaced or project-defined test (dbt_utils.*, dbt_expectations.*, any custom generic), or a malformed test-entry shape |
"malformed-supported-test" |
a supported type whose required args are missing or empty (accepted_values with no values; relationships missing to or field) |
A skip is never a failure — the run continues. The skip records exist so an
operator can see what was left out and why — the signalforge
prune-existing CLI surfaces them as a "N tests skipped, here's why"
stderr report (one summary line, with per-test detail under --verbose).
Contrast this with the anchor contract, which does
fail loud.
Parser tolerances
The reader accepts the range of shapes dbt itself accepts:
tests:anddata_tests:are both read and unioned. dbt renamedtests:→data_tests:in 1.8; the reader concatenates both lists (intests:-first encounter order) at column level and model level.- Identical tests are deduped. A test appearing under both
tests:anddata_tests:(or twice in one list) collapses to oneCandidateTest, keyed by(type, column, args). - Inline args AND
arguments:-nested args. Bothaccepted_values: {values: [...]}and the dbt 1.8+accepted_values: {arguments: {values: [...]}}shapes are read. - Config keys are ignored, never mistaken for args. Interleaved
config,severity,where,name,tags,error_if,warn_if,store_failures,limitkeys are stripped before the required-arg check. ref()/source()are unwrapped onrelationships.to.ref('m')→m;ref('pkg', 'm')→m(last positional);source('s', 't')→s.t. This is a bounded regex unwrap — no Jinja engine, no new dependency. Atostring matching neither pattern is carried verbatim.- A column with no supported tests is still a
CandidateColumn(it just carries an emptyteststuple), and a column / model with nodescriptiondefaults to"".
Anchor contract
A test that references a column absent from the manifest Model means
the ingested YAML is stale or wrong against the warehouse schema. That is a
correctness error the operator must fix, so the reader fails loud —
unlike unsupported types, which are skipped.
read_schema collects every anchor violation across the whole file
(it does not stop at the first) and raises one IngestAnchorContractError
whose violations tuple lists all of them — so you can fix the schema in a
single pass. Three checks run:
- each
CandidateColumnname must exist inmodel.columns; - each column-scoped test must reference its own column;
- each test's referenced column (column-level or model-level) must exist in
model.columns.
A clean candidate raises nothing.
Safety posture
Two attack surfaces, both mitigated before any parse:
- Malicious YAML. Only
yaml.safe_loadis used (neveryaml.load), and the raw byte length is size-capped before the parse runs (5 MB) so a billion-laughs / deeply-nested-anchor payload never reaches the parser. Oversize raisesIngestSchemaTooLargeError. - Path traversal / symlinks. A
Pathargument is canonicalised through the project's symlink-/containment-hardened path safety helper before any read.
A relationships.to value carrying ref('x') or raw SQL is not an
injection vector at ingest — prune's identifier safety check gates every
identifier before it reaches compiled SQL.
The ingest layer is a stage-0 reader: it emits no logging. Observability lives in the consuming prune / grade stages.
Error reference
Every error subclasses IngestError, which renders its message plus a
↳ Remediation: line. The CLI exit-code tier each maps to (for the
fast-follow #105) is noted.
IngestSchemaNotFoundError — tier 1 (load)
The supplied schema.yml Path does not exist.
Verify the schema.yml path is correct and the file exists. Pass the path to the dbt YAML file that declares the model's tests (commonly
models/<dir>/schema.ymlor a per-model_<name>.yml).
IngestSchemaParseError — tier 1 (load)
The schema.yml could not be parsed: malformed YAML, an unreadable file
(encoding / OS error), or a path-containment failure (symlink loop / escape
from the project directory). The triggering exception is chained via
__cause__.
Verify the file is valid YAML (yaml.safe_load must parse it), is readable, and that no symlink in its path escapes the project directory. Run
dbt parseto confirm dbt itself accepts the schema.
IngestSchemaTooLargeError — tier 1 (load)
The raw schema.yml byte length exceeds the parse-safety cap (checked
before any yaml.safe_load).
The schema.yml exceeded the configured byte safety cap applied before yaml.safe_load. Inspect the file for accidental bloat or an attempted billion-laughs / deeply-nested-anchor payload. Trim the schema or split it across multiple files.
IngestModelNotFoundError — tier 2 (input)
No models: entry in the schema.yml matches model.name. A schema.yml
can declare several models; the operator named one the file does not
contain.
Verify the model name matches a
name:undermodels:in the schema.yml. Model names are matched exactly (case-sensitive).
IngestAnchorContractError — tier 2 (input)
One or more tests reference a column absent from the Model. Carries the
full violations tuple (all violations across the file).
Each listed test references a column that is absent from the model. Either correct the column name in the schema.yml to match the model, or regenerate the manifest (
dbt parse) so the model's column set is current.
Worked example
Given this schema.yml (a dbt-codegen-shaped file) and an orders model
whose columns are order_id, status, customer_id, amount,
created_at:
version: 2
models:
- name: customers
description: "An unrelated model in the same file — not selected."
columns:
- name: id
tests:
- not_null
- name: orders
description: "Orders fact table."
columns:
- name: order_id
description: "Surrogate key for the order."
tests:
- not_null
- unique
data_tests:
- unique # duplicate of the unique above — deduped
- name: status
tests:
- accepted_values: # inline args
values: ["placed", "shipped", "cancelled"]
- dbt_utils.unique_combination_of_columns: # namespaced — skipped
combination_of_columns: [order_id, status]
- name: customer_id
data_tests:
- relationships: # arguments:-nested + ref() unwrap
arguments:
to: "ref('customers')"
field: id
- name: amount
tests:
- positive # bare unsupported string — skipped
- name: created_at # no description, no tests
tests:
- dbt_utils.expression_is_true: # model-level custom generic — skipped
expression: "amount >= 0"
read_schema(yaml_str, orders_model) returns an IngestResult with:
Kept candidate tests (on result.candidate):
| column | test | notes |
|---|---|---|
order_id |
not_null |
|
order_id |
unique |
the data_tests: duplicate collapsed away |
status |
accepted_values values ("placed", "shipped", "cancelled") |
inline args |
customer_id |
relationships to="customers", field="id" |
ref('customers') unwrapped |
created_at is present as a CandidateColumn with an empty tests tuple
and description="". The customers model is ignored — only the named
orders model is selected.
Skipped tests (result.skipped):
test_name |
column |
reason |
|---|---|---|
dbt_utils.unique_combination_of_columns |
status |
custom-or-generic-test |
positive |
amount |
unsupported-test-type |
dbt_utils.expression_is_true |
None (model-level) |
custom-or-generic-test |
The four kept tests flow into prune_tests exactly as if the drafter had
produced them; the three skipped tests are surfaced for the operator's
review and do not stop the run.