Private Data Ingestion¶
PBI can ingest private sources from private_data/ in addition to public PhageScope data.
Required per source¶
Mandatory rules¶
metadata.csvis requiredphage.fastais required- host sequences are required as
hosts/<Host_ID>.fna - every
Host_IDin metadata must map to a host FASTA file - every
Phage_IDin metadata must exist inphage.fasta
Validate before pipeline¶
Runtime behavior¶
- Valid private sources are ingested and linked with
source_type=private - Invalid sources are skipped (public pipeline still completes)
- Re-running pipeline synchronizes removals/additions
Source_DBinmetadata.csvmust match the source folder name exactly
Validate what was ingested¶
Use DuckDB (or SequenceRetriever) to inspect available source labels:
SELECT Source_DB, source_type, COUNT(*) AS phage_count
FROM fact_phages
GROUP BY Source_DB, source_type
ORDER BY source_type, Source_DB;
If you filter Source_DB = 'test_private' and get 0, first check this query to confirm the exact source name currently present (for example test_private_2).
Output mappings¶
private_phage_mapping.jsonroutes private phage retrievalhost_fasta_mapping.jsonincludes host paths (public + private)
Logs¶
In Docker runs, logs/reports are available in ./pipeline_logs/.
Private-source validation details are written to:
private_data/private_manifest.json(host path)/private-data/private_manifest.json(inside container)
This manifest explicitly lists:
is_validper source- validation
errors - skipped/ingested source counts
For provenance/version-pinning details and public-source diagnostics, see: