Welcome to PBI Documentation¶

Phage-Bacteria Interaction Database Pipeline

What is PBI?¶

PBI is a bioinformatics pipeline designed to make phage genomic data from PhageScope available in an efficient, structured way for training neural networks and AI models for phage-host interaction prediction. It integrates data from 14+ phage databases via PhageScope and downloads matching bacterial host genomes from NCBI RefSeq.

Note: PBI is a proof of concept and is dependent on PhageScope as its primary data source. Future development will aim to provide more precise host strain information when available.

What you get after running the pipeline:

~873,000 phage genomes with complete metadata
~43 million protein annotations with functional predictions
Bacterial host reference genomes from NCBI RefSeq
Optimized DuckDB database (~5 GB) for fast analytical queries
Indexed FASTA files (~100 GB) with pyfaidx for rapid sequence retrieval
Python package (pbi) for easy data access and machine learning dataset preparation

Getting Started¶

The recommended (and primary) way to run PBI is via Docker. See the Installation Guide for step-by-step instructions on setting up Docker, cloning the repository, running the pipeline container, and connecting to the analysis container via SSH port forwarding.

Installation Guide

How to install Docker, clone the repository, configure the pipeline, run the containers, and connect to the Jupyter Lab analysis environment via SSH port forwarding.
How It Works

Explanation of the pipeline internals, the pbi Python package (including key files like host_fasta_mapping.json), and the overall architecture.
Analysis Container Usage

How to explore the database, retrieve sequences, and prepare machine learning datasets using the Jupyter Lab analysis container. Includes links to the three demo notebooks.
API Usage

REST API reference. ⚠️ Work In Progress — the API is not yet updated for host management and is not the recommended way to interact with data.

Pipeline Overview¶

The PBI pipeline follows a systematic data flow from download to analysis-ready outputs:

┌─────────────────────────────────────────────────────────────────┐
│                    PBI Data Flow                                │
│                                                                 │
│  ┌──────────────┐          ┌──────────────────────────────┐     │
│  │  PhageScope  │────────> │  Stage 1: Phage Metadata     │     │
│  │  (14+ DBs)   │          │  Download & merge metadata   │     │
│  └──────────────┘          │  + FASTA sequences           │     │
│                            │     ~4 hours first run       │     │
│                            └──────────────┬───────────────┘     │
│                                           │                     │
│                                           ▼                     │
│  ┌──────────────┐          ┌──────────────────────────────┐     │
│  │  NCBI RefSeq │────────> │  Stage 2: Host Resolution    │     │
│  │ (Bacterial   │          │  Parse host fields, resolve  │     │
│  │  genomes)    │          │  to assemblies, download     │     │
│  └──────────────┘          │     ~18-24 hours first run   │     │
│                            └──────────────┬───────────────┘     │
│                                           │                     │
│                                           ▼                     │
│                    ┌──────────────────────────────────────┐     │
│                    │  Final Outputs (in pbi-data volume)  │     │
│                    │  ├─ DuckDB Database  (~5 GB)         │     │
│                    │  ├─ Phage FASTA + index (~40 GB)     │     │
│                    │  ├─ Protein FASTA + index (~60 GB)   │     │
│                    │  ├─ Host FASTA files + JSON (~90 GB) │     │
│                    │  └─ HTML Validation Reports          │     │
│                    └─────────────────┬────────────────────┘     │
│                                      │                          │
│              ┌───────────────────────┴────────────────────┐     │
│              ▼                                            ▼     │
│     ┌─────────────────┐                    ┌─────────────────┐  │
│     │ Analysis Service│                    │   REST API      │  │
│     │  (Jupyter Lab)  │                    │  (FastAPI)      │  │
│     │  Port 8888      │                    │  Port 8000      │  │
│     └─────────────────┘                    └─────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Database Schema¶

The database uses a star schema with phage metadata at the center and host genomes linked via a separate dimension table. See the Database Overview for full details.

                     dim_proteins ──┐
                  dim_terminators ──┤
                  dim_anti_crispr ──┤
             dim_virulent_factors ──┤
       dim_transmembrane_proteins ──┤──▶ fact_phages (central)
                   dim_trna_tmrna ──┤
dim_antimicrobial_resistance_genes ─┤
                  dim_crispr_array ─┘
                       dim_hosts  ──▶ (linked via phage_host_links.csv
                                       and host_fasta_mapping.json)

All phage dimension tables link to fact_phages via Phage_ID. Host genomes are stored as separate FASTA files and indexed via host_fasta_mapping.json for fast retrieval.

Documentation¶

Guides

Installation, how it works, analysis container usage, and pipeline execution
Database

Schema documentation, tables, host data, and data sources
API Reference

REST API endpoints — currently untested
Developer Guide

Architecture, code structure, and contributing

Current Status¶

Component	Status	Description
Pipeline	✅ Complete	Snakemake workflow with 14+ data sources
Phage Database	✅ Complete	Optimized DuckDB with star schema
Host Genomes	✅ Complete	NCBI RefSeq downloads with multi-host support
Sequences	✅ Complete	Indexed FASTA files (phages, proteins, hosts)
Docker	✅ Complete	Production-ready containers
Python Package	🔧 Active Development	Core functionality available
REST API	🚧 Work In Progress	Basic endpoints implemented; not tested, not updated for host management
Documentation	🔧 Active Development	Continuously improving

Need Help?¶

Browse the guides for detailed instructions
Report issues on GitHub
Check the troubleshooting sections in our guides

PBI is a proof of concept built with Snakemake, DuckDB, and FastAPI. It is under active development.