Warehouse adapter — operations guide
Operational reference for users of signalforge.warehouse. Companion to
docs/manifest-loader-ops.md and the design
record in plans/super/3-bigquery-adapter.md.
v0.1 ships the BigQuery adapter only. Snowflake and Postgres are tracked
for v0.2; the public ABC (WarehouseAdapter) and the from_profile
factory are warehouse-agnostic so adding a sibling adapter is purely
additive.
Quick start
One-time, on a fresh machine:
gcloud auth application-default login
Application Default Credentials (ADC) is the only supported auth method in v0.1 (see §2 dbt profile resolution).
Then, from Python:
from pathlib import Path
from signalforge.warehouse import (
WarehouseAdapter,
TableRef,
load_profile,
)
profile = load_profile(Path("my_dbt_project"))
with WarehouseAdapter.from_profile(profile) as adapter:
sample = adapter.sample_rows(
TableRef(project="my-gcp-project", dataset="analytics", name="dim_users"),
n=100,
)
WarehouseAdapter.from_profile dispatches on profile.type. v0.1 only
supports profile.type == "bigquery"; anything else raises
UnsupportedProfileTypeError with a remediation pointing at the v0.2
roadmap entry.
dbt profile resolution
load_profile(project_dir, target=None) resolves a profiles.yml in
this order (DEC-009):
$DBT_PROFILES_DIR/profiles.yml— user-trusted; honoured first when the env var is set.<project_dir>/profiles.yml— symlink-hardened via the same path gate the manifest loader uses (canonicalise_path); a symlink that escapes the project tree raisesProfileNotFoundErrorrather than silently falling through to the home-dir path.~/.dbt/profiles.yml— user-trusted.
ProfileNotFoundError lists every path searched in its remediation, so
"why didn't you find my profile?" answers itself from the exception
message.
Active-target resolution. Within the resolved profile, the active
output is selected as: explicit target= argument → the profile's own
target: field → ProfileTargetNotFoundError. ProfileTargetNotFoundError
inherits from ProfileNotFoundError, so a single except clause covers
both "no profile" and "wrong target" if the caller does not need to
distinguish them.
Auth-method support (v0.1, DEC-017). Only method: oauth (or unset,
which means "let dbt-bigquery default to ADC") is accepted. Every other
documented dbt-bigquery method raises UnsupportedAuthMethodError from
the Pydantic field validator, with the remediation pointing at
gcloud auth application-default login:
method value |
v0.1 behaviour |
|---|---|
oauth / unset |
accepted; uses ADC |
service-account |
UnsupportedAuthMethodError |
service-account-json |
UnsupportedAuthMethodError |
oauth-secrets |
UnsupportedAuthMethodError |
impersonate-service-account |
UnsupportedAuthMethodError |
Service-account methods land in v0.2; the v0.1 surface is intentionally narrow so the auth path has one well-tested branch.
DbtProfileTarget is a strict (extra="forbid") Pydantic v2 model.
Unknown profile keys raise ValidationError; this is a deliberate
divergence from the manifest reader's extra="ignore" posture (DEC-017),
because silently dropping an auth-config key could mean SignalForge
falls back to ADC when the user thought they had configured something
else. Forward-compat against new dbt-bigquery fields is the drift-detector
test's responsibility (tests/warehouse/test_profiles.py).
Cost defaults
The BigQuery adapter is opinionated about cost on every query.
maximum_bytes_billed = 100 MBby default (DEC-005).BigQueryAdaptertakes amax_bytes_billed=kwarg; the dbt profile'smaximum_bytes_billedfield flows throughload_profileandfrom_profileand overrides the default. Queries that exceed the cap raiseBytesBilledExceededError. The exception'slimitfield always carries the configured cap;job_idandbytes_billedare populated only when BigQuery'sBadRequestexposes them (it usually doesn't on the pre-execution rejection path) and are otherwiseNone. The error message and remediation are sufficient to act on without those fields; v0.2 may revisit by surfacing the failedQueryJobso the IDs flow through.use_query_cache=Falseon every query (DEC-015). Architectural Commitment #5 — explainable diffs — requires that the same input produce the same prune decision; cached results break that contract. v0.2 may re-enable caching behind an explicit opt-in; in v0.1 it is unconditionally off.- Per-call
timeout_ms(DEC-013 of issue #6): pass an integer to_default_job_config(stage="...", timeout_ms=...)to setQueryJobConfig.job_timeout_ms; BigQuery cancels the job server-side at expiry. Bytes-scanned through the cancellation point still bill — set conservatively. DefaultNone(no timeout). Reserved for v0.2 prune layer integration (issue #6 ships withtotal_budget_secondsenforcement only; v0.1 has no publicWarehouseAdapter.run_test_sqlkwarg for per-test timeouts). - BigQuery job labels are auto-set on every query:
signalforge_stage— the pipeline stage that issued the query. Values arewarehouse_sample(fromsample_rows),warehouse_sample_materialise(frommaterialise_sample, v0.2 — see Materialised sampling),warehouse_stats(fromcolumn_stats),warehouse_test(fromrun_test_sql), andwarehouse_session_abort(from the__exit__cleanupCALL BQ.ABORT_SESSION()query — DEC-013 of #22).signalforge_version— the package version (with.rewritten to_to satisfy BigQuery's label-character constraint).
Both are filterable in INFORMATION_SCHEMA.JOBS_BY_PROJECT for v0.2
cost analysis. Stage labels are warehouse_sample,
warehouse_sample_materialise (v0.2), warehouse_stats,
warehouse_test, and warehouse_session_abort (v0.2; one per
pipeline stage that issues a query):
SELECT job_id, total_bytes_billed
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE labels.signalforge_stage = 'warehouse_sample'
Sampling strategy
adapter.sample_rows(table, n, partition_filter=None) returns up to
n rows from table, deterministically.
Default: hash-mod (DEC-006). Every call wraps the table in:
SELECT * FROM <quoted> AS t
WHERE MOD(ABS(FARM_FINGERPRINT(TO_JSON_STRING(t))), bucket) < 1
ORDER BY FARM_FINGERPRINT(TO_JSON_STRING(t))
LIMIT n
The trailing ORDER BY makes the LIMIT truncation deterministic when
the bucket filter retains more than n rows; without it BigQuery's
LIMIT picks an arbitrary subset and breaks the same-input →
same-output prune contract.
bucket is sized from Table.num_rows so the expected sample size lands
near n. The hash-mod approach is deterministic across runs, works on
views, materialised views, wildcard tables, and CTEs, and behaves
correctly when TABLESAMPLE does not.
The TABLESAMPLE cost-asterisk. TABLESAMPLE SYSTEM is documented in
BigQuery as the canonical sampling primitive, but it does not
proportionally reduce bytes billed on un-clustered tables — it scans the
whole table and then drops blocks. It is only cost-effective on
clustered tables or in conjunction with a partition filter. Hash-mod
has the same cost story (it scans everything too) without the
determinism downside, so it is the v0.1 default. v0.2 will add an
opt-in TABLESAMPLE strategy for clustered tables where the bytes-billed
math works out.
PartitionFilter (DEC-014). Scope a sample to a specific partition
to actually reduce bytes billed:
from datetime import date
from signalforge.warehouse import PartitionFilter
adapter.sample_rows(
table,
n=100,
partition_filter=PartitionFilter(
column="event_date",
op=">=",
value=date(2024, 1, 1),
),
)
Each adapter renders its own SQL for PartitionFilter. The typed
column/op/value triple — op is a Literal["=", ">", ">=", "<",
"<=", "!="] and column is identifier-validated at construction time —
removes a SQL-injection seam and prevents cross-warehouse SQL leaks
(DEC-018).
Fail-loud thresholds (DEC-024). Sampling refuses to silently over-spend in two cases:
| Condition | Exception |
|---|---|
Table.num_rows is None/0 and no partition filter |
UnknownTableSizeError |
Table.num_rows >= 100_000_000 and no partition filter |
SamplingRequiresPartitionFilterError |
Both carry the offending table (and num_rows, where known) and a
remediation that names the fix. Fail-loud is preferred over a guessed
bucket size because the worst case — a terabyte-scale unscoped scan —
is silent on the user's side and very loud on the bill.
column_stats access pattern
adapter.column_stats(table, column) returns a ColumnStats object
with count, distinct, nulls, min, max, and data_type. The
call MUST be inside a with adapter: block (DEC-025); calling it
outside one raises RuntimeError.
Flush semantics in v0.1. Inside an active with adapter: block,
calls to adapter.column_stats(table, col) accumulate per table. The
first call for a given table flushes every column queued for that
table in a single batched aggregate query, populating the cache;
subsequent column_stats(table, ...) calls for columns already in the
cache return without issuing another query. Columns queued after a
flush are batched into the next flush when the first uncached column is
requested.
The returned ColumnStats is a fully-populated typed value (no lazy
proxy). v0.2 may add a lazy proxy that defers the flush until a field is
read, but the v0.1 contract is "first call flushes the queued batch" —
predictable and easy to reason about at a with-block boundary.
Use the recommended pattern below to keep call sites cheap to refactor when the lazy form lands:
with WarehouseAdapter.from_profile(profile) as adapter:
refs = {col: adapter.column_stats(table, col) for col in ["a", "b", "c"]}
for col, stats in refs.items():
print(col, stats.count, stats.distinct, stats.nulls)
In v0.1 this issues three queries (one per column_stats call); v0.2
will collapse all three into a single batched query at first field
read.
Complex types (DEC-016). For BigQuery types where ordering is not
meaningful — GEOGRAPHY, JSON, BYTES, ARRAY<...>, STRUCT<...>,
RANGE<...> — min and max are None. count, distinct, and
nulls are populated for every type. The prune layer keys decisions on
data_type (the raw BigQuery type string) without re-reading the
catalog.
Materialised sampling (v0.2, issue #22)
The WarehouseAdapter ABC ships a materialise_sample method in v0.2
that pre-computes a deterministic sample into a session-scoped temp
table, so every candidate test's per-test query reads from the
narrow materialised sample rather than re-running the full-row hash
filter against the source table for every test (see
docs/prune-ops.md § Cost model
for the cost story).
ABC signature (signalforge.warehouse.base):
def materialise_sample(
self,
table: TableRef,
n: int,
*,
partition_filter: PartitionFilter | None = None,
ttl_seconds: int = 3600,
) -> TableRef: ...
The default ABC implementation raises MaterialisationNotSupportedError
with a remediation pointing at prune.sample_strategy: oneshot in
signalforge.yml. The BigQueryAdapter overrides; non-BQ adapters
inherit the default until v0.3.
BigQueryAdapter session-state pattern. The first call to
materialise_sample runs CREATE TEMP TABLE _sf_sample_<run_id> AS SELECT ...
(the CTAS itself uses the bare _sf_sample_<run_id> name — no
_SESSION. prefix) with QueryJobConfig(create_session=True, ...)
against the warehouse_sample_materialise stage label. BigQuery
assigns the session_id server-side; the adapter captures it from
job.session_info.session_id and stores it on the adapter instance
as self._active_session_id for the duration of the prune run.
Subsequent run_test_sql calls automatically attach
ConnectionProperty(key="session_id", value=self._active_session_id)
so the per-test query resolves _SESSION._sf_sample_<run_id> against
the same session. The returned TableRef carries
project=None, dataset="_SESSION", name="_sf_sample_<run_id>" —
two-part qualified_name _SESSION._sf_sample_<run_id>. project=None
is load-bearing because BigQuery rejects the three-part
<project>._SESSION.<name> form even inside the owning session.
The run_id is OUR derivation —
blake2b(table.qualified_name + signalforge_version + str(n) + canonical_json(partition_filter), digest_size=8).hexdigest()
(inputs joined with NUL separator; 16 hex chars) — so the temp-table
name is deterministic across runs and the compiled_sql_hash
reproducibility invariant on PruneEvent (DEC-005 of issue #6) is
preserved.
ttl_seconds is OUR-side hint, not a BQ knob. BigQuery sessions
have a server-enforced max lifetime (~24h regardless of activity)
plus a BQ-default idle timeout. The ttl_seconds=3600 parameter is
NOT passed to BigQuery — it's a hint to the cleanup-WARNING text
(the "auto-expire in Ns" line below). Don't go looking for a BQ SDK
call to set it; there isn't one in v0.2.
The session-state pattern mirrors column_stats's batching state
(DEC-008 / DEC-025 of issue #3): adapter-instance state scoped to a
with adapter: block; cleanup driven by __exit__.
v0.2 → v0.3 migration story for non-BQ adapters. Snowflake and
Postgres adapters in v0.2 inherit the default materialise_sample →
MaterialisationNotSupportedError raise. Operators on those
warehouses opt in to the v0.1 oneshot path via
prune.sample_strategy: oneshot in signalforge.yml. Each
non-BigQuery adapter then ships its own session-equivalent in v0.3
(Snowflake: temporary tables tied to the session; Postgres:
CREATE TEMP TABLE inside a transaction). The ABC default-raise is
the v0.2 stop-gap, not a permanent surface.
Query-bytes estimation (v0.2, issue #36)
The WarehouseAdapter ABC ships an estimate_query_bytes method in
v0.2 so the signalforge generate --estimate cost-preview flow can
estimate how many bytes a candidate query would process WITHOUT
actually scanning the source table. The BigQuery override uses
QueryJobConfig(dry_run=True) and reads job.total_bytes_processed
off the returned job; non-BigQuery adapters inherit the ABC's default
EstimateNotSupportedError raise.
ABC signature (signalforge.warehouse.base):
def estimate_query_bytes(self, sql: str) -> int: ...
The default ABC implementation raises EstimateNotSupportedError with
the locked remediation: "Use --estimate with a BigQuery profile, or
wait for v0.3 multi-warehouse estimation support." Concrete adapters
override; v0.2 ships the BigQuery override only.
BigQuery override mechanism. A dry_run=True query asks BigQuery
to validate the SQL server-side and return the estimated bytes
processed, without committing to any actual scan or row return.
BigQuery does NOT bill bytes for a dry_run, so the production
QueryJobConfig deliberately omits maximum_bytes_billed — a cap on
something that never bills would be dead config; worse, it could
mislead a reader into thinking the dry_run was guarded against
runaway cost. The job is tagged with the warehouse_estimate_query_bytes
stage label for INFORMATION_SCHEMA.JOBS_BY_PROJECT cost attribution.
The same _sql_safety.validate_test_sql cheap-reject pass that
run_test_sql applies fires before the SDK call: a SQL with a
top-level ;, a -- comment, a /* */ block comment, or unbalanced
parens raises QuerySyntaxError and never reaches BigQuery.
v0.2 → v0.3 migration story for non-BQ adapters. Snowflake and
Postgres adapters in v0.2 (and v0.3 multi-warehouse) inherit the
default estimate_query_bytes → EstimateNotSupportedError raise
until each grows its own override (Snowflake's EXPLAIN /
Postgres's EXPLAIN are the natural primitives). The CLI's
--estimate flow surfaces the typed error with the locked
remediation so operators see the v0.3 expansion plan inline.
Session cleanup & manual recovery
Sessions opened by materialise_sample need to be torn down so
their _SESSION._sf_sample_<run_id> temp tables don't linger until
BigQuery's server-side timeout reaps them (~24h). The adapter
implements a three-layer cleanup model — explicit close on the happy
path, swallow-and-warn on cleanup failure, BQ's own session timeout
as the durable fallback (issue #22 DEC-013 / DEC-014).
Layer 1 — explicit __exit__ close (happy path). When an
operator wraps the adapter in a with block (the recommended pattern
that the CLI's cmd_generate always uses), __exit__ checks
self._active_session_id; if non-None, it issues
CALL BQ.ABORT_SESSION(); on the same session via
ConnectionProperty(key="session_id", value=self._active_session_id).
On success, the adapter emits one INFO log (session_id_hash,
ttl_remaining_seconds) and resets _active_session_id = None in a
finally clause so subsequent __exit__ calls are no-ops.
Layer 2 — swallow-and-warn (cleanup failure). If
CALL BQ.ABORT_SESSION(); itself raises (network blip, session
already revoked, quota issue), the adapter swallows the exception
and emits a single multi-line WARNING — cleanup must never block the
user's actual work, which already succeeded. The WARNING contains
the raw session_id (deliberate exception to the otherwise-strict
session-id redaction rule, see DEC-003 / DEC-014 of issue #22) and
the manual bq query command the operator can run to clean up
immediately. State is reset in finally so a second __exit__ call
is a no-op.
Layer 3 — BigQuery server-side session timeout (durable
fallback). Hard process death (SIGKILL, OOM, host failure, the
operator forgetting to use a with block in a notebook session)
cannot fire __exit__. BigQuery's own session timeout (BQ-managed,
~24h max regardless of activity) reaps the orphan automatically.
The operator pays a small cost in temp-table storage until the
timeout fires, but no human intervention is required.
Manual recovery command. When the cleanup-failure WARNING fires, the operator copy-pastes the manual command verbatim from the WARNING body. The exact form is (per DEC-014 of issue #22 — this is the text the WARNING emits, do not paraphrase):
bq query --connection_property=session_id=<raw> --use_legacy_sql=false "CALL BQ.ABORT_SESSION();"
<raw> is the same raw session_id printed in the WARNING's
Session ID: line; the manual command is the only remediation
that doesn't wait for the BQ timeout. Authorisation: BigQuery
rejects BQ.ABORT_SESSION() calls from any identity other than the
session's owner, so only the operator who started the prune run can
execute the manual command — hence the raw session_id in the WARNING
is bounded in surface (read-only to the principal who already owned
the session).
Reading the cleanup-failure WARNING. See
docs/cli-ops.md § Stderr shapes (WARNING)
for the full WARNING shape and the --quiet interaction (the
cleanup-failure WARNING is operator-actionable and is NOT suppressed
by --quiet).
Edge case: SDK returns session_info=None
If materialise_sample's first query succeeds server-side but the
BigQuery SDK returns job.session_info=None (or session_id=None),
the adapter cannot stash the id and __exit__ will not fire
BQ.ABORT_SESSION() — _active_session_id is None, so the cleanup
short-circuits to a no-op. The server-side session lives until BQ's
own timeout (~24h max). This is the SDK contract violating its own
documented behaviour and is not expected in practice; the
Spotting orphaned sessions query below catches it. If you see
materialisation jobs in INFORMATION_SCHEMA.JOBS_BY_PROJECT whose
session_info.session_id is set but no matching BQ.ABORT_SESSION
job ever ran for that session, this is the path that produced them.
Spotting orphaned sessions
When a maintainer wants to audit a project for orphan sessions
(e.g., after a known-bad release that bypassed __exit__, or as a
periodic cleanup hygiene task), the adapter's signalforge_stage
job label is the durable signal. Run this INFORMATION_SCHEMA.JOBS_BY_PROJECT
query to list materialisation jobs older than 2× the expected TTL
(default ttl_seconds=3600 → look for jobs older than 2h that may
have leaked sessions):
SELECT
job_id,
user_email,
creation_time,
session_info.session_id,
state,
TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), creation_time, MINUTE) AS age_minutes
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE labels.signalforge_stage = 'warehouse_sample_materialise'
AND creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
AND TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), creation_time, MINUTE) > 120
ORDER BY creation_time DESC;
Adjust region-us to your dataset's region and the
INTERVAL 24 HOUR window to your retention if INFORMATION_SCHEMA.JOBS_BY_PROJECT
is set up differently. Each row identifies one session that was
opened but whose __exit__-driven CALL BQ.ABORT_SESSION(); never
fired (or fired and failed silently in v0.2 if the WARNING was
suppressed). Operators with permission to abort each session can
reuse the manual recovery command above with the per-row
session_info.session_id.
Integration tests (maintainer-only)
The default pytest invocation skips warehouse-touching tests via
addopts = -m 'not bigquery' (DEC-021). To run them locally:
SF_RUN_BQ=1 pytest -m bigquery --no-cov
The SF_RUN_BQ=1 env-var gate lives on top of the marker so even an
explicit -m bigquery will not fire BigQuery requests in CI by accident.
Fixtures use bigquery-public-data.samples.shakespeare, which is free
under BigQuery's 1 TB/month query tier. There is no CI job for these
tests in v0.1; revisit when external contributors arrive and a billing
account on the project is available.
Debugging
Enable DEBUG-level logs from anywhere in the adapter layer:
import logging
logging.getLogger("signalforge.warehouse").setLevel(logging.DEBUG)
The adapter never logs full SQL or row data (DEC-027) — only batch-flush
events, metadata-cache misses, and the two soft-threshold warnings
(profile size, query bytes). Treat DEBUG output as a hint about when
the adapter went to the warehouse, not what it sent.
Finding the BigQuery job ID for a failed query. Every typed error carries the contextual fields the BigQuery console needs:
BytesBilledExceededError—.job_id,.bytes_billed,.limitQuerySyntaxError—.detail(the BigQuery error text verbatim)TableNotFoundError—.tableColumnNotFoundError—.table,.column
Common errors and where they surface:
| Error | First thing to check |
|---|---|
WarehouseAuthError |
gcloud auth application-default login (ADC not set up). |
BytesBilledExceededError |
Raise max_bytes_billed= on BigQueryAdapter, or supply a PartitionFilter. |
UnknownTableSizeError |
Supply a PartitionFilter, or call adapter.refresh_table_metadata. |
SamplingRequiresPartitionFilterError |
Supply a PartitionFilter to scope the sample. |
InvalidIdentifierError |
Check TableRef/PartitionFilter.column against [A-Za-z_][A-Za-z0-9_]*. |
Error reference
Public API: from signalforge.warehouse import errors. Every exception
subclasses WarehouseError and carries a default_remediation rendered
on a ↳ Remediation: line by __str__.
| Class | When raised | Carried fields | Default remediation (abbreviated) |
|---|---|---|---|
WarehouseError |
Base class; never raised directly. | message, remediation |
(no remediation set — base class) |
WarehouseAuthError |
Wraps google.auth.exceptions.DefaultCredentialsError / RefreshError. |
message |
Run gcloud auth application-default login to set up ADC. |
UnsupportedProfileTypeError |
dbt profile's type is not "bigquery". |
profile_type |
v0.1 supports type: bigquery only; Snowflake/Postgres tracked for v0.2. |
UnsupportedAuthMethodError |
dbt profile's method is not "oauth" (or unset). |
method |
v0.1 supports method: oauth (or unset) only; run gcloud auth application-default login. |
ProfileNotFoundError |
None of the three search paths yielded a profiles.yml (or the project file is missing/malformed). |
searched_paths |
Create a profiles.yml at one of the searched paths, or set DBT_PROFILES_DIR. |
ProfileTargetNotFoundError |
The profile resolved but the requested target is missing. Inherits ProfileNotFoundError. |
profile_name, target, searched_paths |
Add the target to profiles.yml, or pass an explicit target= that exists in the profile. |
ManifestProjectNotFoundError |
Model.database is None so TableRef.from_model cannot construct a fully-qualified ref. |
model_unique_id |
Set database: for the model in dbt, or pass an explicit project=. |
ManifestSchemaNotFoundError |
Model.schema_ is None so TableRef.from_model cannot construct a fully-qualified ref. |
model_unique_id |
Set schema: for the model in dbt, or pass an explicit dataset=. |
InvalidIdentifierError |
A SQL identifier (project / dataset / table / column) failed the [A-Za-z_][A-Za-z0-9_]* regex. |
field, value |
Identifiers must match [A-Za-z_][A-Za-z0-9_]*. |
BytesBilledExceededError |
BigQuery rejected a query because maximum_bytes_billed was exceeded. |
job_id, bytes_billed, limit |
Narrow the query (partition filter / smaller sample) or raise max_bytes_billed. |
TableNotFoundError |
BigQuery 404 for the requested TableRef. |
table |
Verify the project.dataset.table exists and credentials have read access. |
ColumnNotFoundError |
A column reference does not exist on the resolved table schema. | table, column |
Verify the column name against the table schema (INFORMATION_SCHEMA.COLUMNS). |
QuerySyntaxError |
BigQuery rejected a query as malformed (separates from BytesBilledExceededError despite both being 400). |
detail |
Inspect the BigQuery error detail and fix the SQL (or update the drafter prompt if recurring). |
SamplingError |
Parent for sampling-time failures; never raised directly. Catch it to handle both subclasses uniformly. | (none) | Inspect the subclass remediation; fail-loud is preferred to silent over-spend. |
SamplingRequiresPartitionFilterError |
Table.num_rows >= 100_000_000 and no PartitionFilter was supplied. |
table, num_rows |
Pass a PartitionFilter to scope the sample. |
UnknownTableSizeError |
Table.num_rows is None/0 and no PartitionFilter was supplied. |
table |
Provide partition_filter, or call adapter.refresh_table_metadata once num_rows is populated. |
MaterialisationFailedError (v0.2) |
BigQueryAdapter.materialise_sample wraps an SDK / network / quota failure during the materialisation query. |
cause |
Inspect .cause for the underlying exception; falls back to prune.sample_strategy: oneshot to bypass materialisation. |
MaterialisationNotSupportedError (v0.2) |
WarehouseAdapter.materialise_sample default impl raised because the concrete adapter doesn't override it (any non-BigQuery adapter in v0.2). |
(none) | Set prune.sample_strategy: oneshot in signalforge.yml to fall back to per-test sampling, or wait for v0.3 multi-warehouse materialisation support. |
EstimateNotSupportedError (v0.2) |
WarehouseAdapter.estimate_query_bytes default impl raised because the concrete adapter doesn't override it (any non-BigQuery adapter in v0.2). |
adapter_name |
Use --estimate with a BigQuery profile, or wait for v0.3 multi-warehouse estimation support. |
v0.2 follow-ups
Known gaps that ship in a later release; tracked here so users can plan around them and maintainers can keep the doc honest as scope changes.
- Lazy
column_statsproxy. v0.1 flushes one query percolumn_statscall (see §column_statsaccess pattern). v0.2 will land a proxy onColumnStatsso the queued columns of a table flush as a single batched query at first field read. - Legacy domain-scoped project IDs. v0.1 rejects
example.com:my-projectstyle IDs (Google Workspace tenants pre-2014). The fix needs both a regex update and a_quote()change to render the colon outside the backtick group. Defer to v0.2 with proper round-trip tests against an actual domain-scoped project. - Drift detector for
DbtProfileTarget. v0.1 validates a hand-curated fixture against a hand-curatedStrictModel— both drift together if a maintainer updates one without the other. v0.2 will regeneratedbt_bigquery_drift_v1_X.ymlfrom a pinneddbt-bigquery==1.X.*release viauvx, mirroring the manifest fixture regeneration pattern from issue #2'stests/fixtures/regenerate.sh.