PBI Story — one read walkthrough¶
This page explains the full PBI flow in execution order.
For setup and commands, use the Installation Guide.
1) What PhageScope brings¶
PhageScope aggregates several public phage resources into consistent exports. PBI uses those exports as its main public phage input (metadata and FASTA).
2) Why Docker is central¶
PBI relies on Docker to keep paths, environments, and large intermediate data consistent.
pipelinebuilds dataanalysisreads data for notebooks/scriptsapiis legacy and limited
Named volumes and bind mounts keep outputs persistent and auditable.
3) What the pipeline does (in order)¶
- Download public phage data from PhageScope sources
- Merge and normalize metadata with schema contracts
- Build merged FASTA files for phages and proteins + create indexes
- Validate private sources (if folders exist in
private_data/) - Prepare private mappings (private phages and hosts)
- Parse host fields from phage metadata
- Resolve hosts to NCBI assemblies
- Download host FASTAs from NCBI RefSeq
- Create DuckDB database and optimize analytical access
- Store reports and logs (validation, quality, failure logs)
4) Resulting data product¶
After completion, PBI provides:
- DuckDB database for metadata exploration
- Indexed phage/protein FASTA files
- Host FASTA mapping for host retrieval
- Private phage mapping when private sources are present
- Pipeline logs and reports for traceability
5) How users work with it¶
The recommended interface is the analysis container with the pbi package.
- Use VS Code Dev Containers for full IDE workflow (preferred)
- Use Jupyter Lab for notebook-first workflow