Welcome to PBI Documentation¶
Phage-Bacteria Interaction Database Pipeline
What is PBI?¶
PBI is a bioinformatics pipeline designed to make phage genomic data from PhageScope available in an efficient, structured way for training neural networks and AI models for phage-host interaction prediction. It integrates data from 14+ phage databases via PhageScope and downloads matching bacterial host genomes from NCBI RefSeq.
Note: PBI is a proof of concept and is dependent on PhageScope as its primary data source. Future development will aim to provide more precise host strain information when available.
What you get after running the pipeline:
- ~873,000 phage genomes with complete metadata
- ~43 million protein annotations with functional predictions
- Bacterial host reference genomes from NCBI RefSeq
- Optimized DuckDB database (~5 GB) for fast analytical queries
- Indexed FASTA files (~100 GB) with pyfaidx for rapid sequence retrieval
- Python package (
pbi) for easy data access and machine learning dataset preparation
Getting Started¶
The recommended (and primary) way to run PBI is via Docker. See the Installation Guide for step-by-step instructions on setting up Docker, cloning the repository, running the pipeline container, and connecting to the analysis container via SSH port forwarding.
-
How to install Docker, clone the repository, configure the pipeline, run the containers, and connect to the Jupyter Lab analysis environment via SSH port forwarding.
-
Explanation of the pipeline internals, the
pbiPython package (including key files likehost_fasta_mapping.json), and the overall architecture. -
How to explore the database, retrieve sequences, and prepare machine learning datasets using the Jupyter Lab analysis container. Includes links to the three demo notebooks.
-
REST API reference. ⚠️ Work In Progress — the API is not yet updated for host management and is not the recommended way to interact with data.
Pipeline Overview¶
The PBI pipeline follows a systematic data flow from download to analysis-ready outputs:
┌─────────────────────────────────────────────────────────────────┐
│ PBI Data Flow │
│ │
│ ┌──────────────┐ ┌──────────────────────────────┐ │
│ │ PhageScope │────────> │ Stage 1: Phage Metadata │ │
│ │ (14+ DBs) │ │ Download & merge metadata │ │
│ └──────────────┘ │ + FASTA sequences │ │
│ │ ~4 hours first run │ │
│ └──────────────┬───────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ ┌──────────────────────────────┐ │
│ │ NCBI RefSeq │────────> │ Stage 2: Host Resolution │ │
│ │ (Bacterial │ │ Parse host fields, resolve │ │
│ │ genomes) │ │ to assemblies, download │ │
│ └──────────────┘ │ ~18-24 hours first run │ │
│ └──────────────┬───────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────┐ │
│ │ Final Outputs (in pbi-data volume) │ │
│ │ ├─ DuckDB Database (~5 GB) │ │
│ │ ├─ Phage FASTA + index (~40 GB) │ │
│ │ ├─ Protein FASTA + index (~60 GB) │ │
│ │ ├─ Host FASTA files + JSON (~90 GB) │ │
│ │ └─ HTML Validation Reports │ │
│ └─────────────────┬────────────────────┘ │
│ │ │
│ ┌───────────────────────┴────────────────────┐ │
│ ▼ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Analysis Service│ │ REST API │ │
│ │ (Jupyter Lab) │ │ (FastAPI) │ │
│ │ Port 8888 │ │ Port 8000 │ │
│ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Database Schema¶
The database uses a star schema with phage metadata at the center and host genomes linked via a separate dimension table. See the Database Overview for full details.
dim_proteins ──┐
dim_terminators ──┤
dim_anti_crispr ──┤
dim_virulent_factors ──┤
dim_transmembrane_proteins ──┤──▶ fact_phages (central)
dim_trna_tmrna ──┤
dim_antimicrobial_resistance_genes ─┤
dim_crispr_array ─┘
dim_hosts ──▶ (linked via phage_host_links.csv
and host_fasta_mapping.json)
All phage dimension tables link to fact_phages via Phage_ID. Host genomes are stored as separate FASTA files and indexed via host_fasta_mapping.json for fast retrieval.
Documentation¶
-
Installation, how it works, analysis container usage, and pipeline execution
-
Schema documentation, tables, host data, and data sources
-
REST API endpoints — currently untested
-
Architecture, code structure, and contributing
Current Status¶
| Component | Status | Description |
|---|---|---|
| Pipeline | ✅ Complete | Snakemake workflow with 14+ data sources |
| Phage Database | ✅ Complete | Optimized DuckDB with star schema |
| Host Genomes | ✅ Complete | NCBI RefSeq downloads with multi-host support |
| Sequences | ✅ Complete | Indexed FASTA files (phages, proteins, hosts) |
| Docker | ✅ Complete | Production-ready containers |
| Python Package | 🔧 Active Development | Core functionality available |
| REST API | 🚧 Work In Progress | Basic endpoints implemented; not tested, not updated for host management |
| Documentation | 🔧 Active Development | Continuously improving |
Need Help?¶
- Browse the guides for detailed instructions
- Report issues on GitHub
- Check the troubleshooting sections in our guides
PBI is a proof of concept built with Snakemake, DuckDB, and FastAPI. It is under active development.