Provenance and Version Pinning¶
This page explains how PBI pins upstream public data context and how to audit what was ingested.
What is pinned¶
Public-source pinning is configured in workflow/config/config.yaml:
public_data_provider.namepublic_data_provider.releasepublic_data_provider.snapshot_datepublic_data_provider.schema_profilepublic_data_provider.api_base_urlpublic_data_provider.provenance_mode
These values are recorded for every run, so downstream users can identify exactly which upstream snapshot/profile was used.
How provenance is captured¶
During each public file download, PBI writes a sidecar record with:
- source URL and local path
- retrieval timestamp
- file size
- optional checksum (
sha256) - optional HTTP headers (
ETag,Last-Modified) - detected tabular columns + schema fingerprint
- status/error message
Then all sidecars are consolidated into:
pipeline_logs/csv/public_data_manifest.jsonpipeline_logs/csv/public_data_manifest.csvpipeline_logs/csv/pipeline_run_provenance.jsonpipeline_logs/csv/pipeline_run_provenance.csv
DuckDB provenance tables¶
The database build loads these files into:
dataset_provenancepipeline_run_provenance
Quick checks:
SELECT status, COUNT(*) FROM dataset_provenance GROUP BY status;
SELECT provider_release, provider_snapshot_date, provider_schema_profile FROM pipeline_run_provenance;
Private-data ingestion diagnostics¶
Private datasets are validated source-by-source from private_data/<Source_DB>/.
Validation output is written to:
private_data/private_manifest.json(or/private-data/private_manifest.jsonin containers)
Important behavior:
- only
is_valid: truesources are ingested - invalid sources are skipped and listed with explicit
errors Source_DBin eachmetadata.csvmust match the folder name exactly
If a notebook query like Source_DB = 'test_private' returns 0, first verify available private sources:
SELECT Source_DB, source_type, COUNT(*) AS phage_count
FROM fact_phages
GROUP BY Source_DB, source_type
ORDER BY source_type, Source_DB;
Then inspect:
private_manifest.jsonfor skipped sources and validation errorspipeline_logs/logs/host_download_failures.logfor host-resolution/download failuresdataset_provenancerows withstatus = 'failed'for public-source provenance failures
Common error indicators¶
- Source not ingested: source missing from
fact_phages, present in manifest withis_valid: false - Source name mismatch: manifest error
Source_DB mismatch: expected only '<folder>' - Public provenance failure:
dataset_provenance.status='failed'with non-emptyerror_message - Schema drift warning/failure: schema fingerprint change in download logs and provenance record