PBI Story — one read walkthrough¶

This page explains the full PBI flow in execution order.

For setup and commands, use the Installation Guide.

1) What PhageScope brings¶

PhageScope aggregates several public phage resources into consistent exports. PBI uses those exports as its main public phage input (metadata and FASTA).

2) Why Docker is central¶

PBI relies on Docker to keep paths, environments, and large intermediate data consistent.

pipeline builds data
analysis reads data for notebooks/scripts
api is legacy and limited

Named volumes and bind mounts keep outputs persistent and auditable.

3) What the pipeline does (in order)¶

Download public phage data from PhageScope sources
Merge and normalize metadata with schema contracts
Build merged FASTA files for phages and proteins + create indexes
Validate private sources (if folders exist in private_data/)
Prepare private mappings (private phages and hosts)
Parse host fields from phage metadata
Resolve hosts to NCBI assemblies
Download host FASTAs from NCBI RefSeq
Create DuckDB database and optimize analytical access
Store reports and logs (validation, quality, failure logs)

4) Resulting data product¶

After completion, PBI provides:

DuckDB database for metadata exploration
Indexed phage/protein FASTA files
Host FASTA mapping for host retrieval
Private phage mapping when private sources are present
Pipeline logs and reports for traceability

5) How users work with it¶

The recommended interface is the analysis container with the pbi package.

Use VS Code Dev Containers for full IDE workflow (preferred)
Use Jupyter Lab for notebook-first workflow