Skip to content

Private Data Ingestion

PBI can ingest private sources from private_data/ in addition to public PhageScope data.

Required per source

private_data/
  <Source_DB>/
    metadata.csv
    phage.fasta
    hosts/
      <Host_ID>.fna

Mandatory rules

  • metadata.csv is required
  • phage.fasta is required
  • host sequences are required as hosts/<Host_ID>.fna
  • every Host_ID in metadata must map to a host FASTA file
  • every Phage_ID in metadata must exist in phage.fasta

Validate before pipeline

pbi validate-private

Runtime behavior

  • Valid private sources are ingested and linked with source_type=private
  • Invalid sources are skipped (public pipeline still completes)
  • Re-running pipeline synchronizes removals/additions
  • Source_DB in metadata.csv must match the source folder name exactly

Validate what was ingested

Use DuckDB (or SequenceRetriever) to inspect available source labels:

SELECT Source_DB, source_type, COUNT(*) AS phage_count
FROM fact_phages
GROUP BY Source_DB, source_type
ORDER BY source_type, Source_DB;

If you filter Source_DB = 'test_private' and get 0, first check this query to confirm the exact source name currently present (for example test_private_2).

Output mappings

  • private_phage_mapping.json routes private phage retrieval
  • host_fasta_mapping.json includes host paths (public + private)

Logs

In Docker runs, logs/reports are available in ./pipeline_logs/.

Private-source validation details are written to:

  • private_data/private_manifest.json (host path)
  • /private-data/private_manifest.json (inside container)

This manifest explicitly lists:

  • is_valid per source
  • validation errors
  • skipped/ingested source counts

For provenance/version-pinning details and public-source diagnostics, see: