How PBI Works¶
This page explains the internal architecture of PBI: the Snakemake pipeline, the pbi Python package, key data files, and the REST API.
The Big Picture¶
PBI is a data pipeline that:
- Downloads phage genomic data from PhageScope (which aggregates 14+ databases)
- Parses host information from phage metadata and resolves it to NCBI RefSeq assemblies
- Downloads bacterial reference genomes from NCBI
- Merges everything into an optimized DuckDB database and indexed FASTA files
- Exposes this data through a Python package (
pbi) and optionally a REST API
The ultimate goal is to provide phage-host interaction data in an efficient, structured format for training neural networks and AI models.
The Snakemake Pipeline¶
Location: workflow/
The pipeline is orchestrated by Snakemake, a workflow manager that tracks file dependencies and only re-runs steps when inputs change.
Pipeline Stages¶
workflow/Snakefile
│
├── rules/phagescope.smk → Download phage metadata (CSV) + FASTA archives
│ from PhageScope API for each of the 14+ databases
│
├── rules/database.smk → Merge CSV files, create and optimize DuckDB database,
│ generate HTML validation reports
│
├── rules/sequences.smk → Merge and index phage/protein FASTA files with pyfaidx
│
└── rules/hosts.smk → Parse host fields → resolve to NCBI assemblies →
download host FASTA files → build host_fasta_mapping.json
Configuration¶
workflow/config/config.yaml — main pipeline configuration:
- PhageScope API endpoints and database list
- Output paths (all under /data/ in Docker)
- NCBI credentials (email, api_key)
- Download parameters (concurrency, retries, timeouts)
Key Scripts¶
| Script | Purpose |
|---|---|
workflow/scripts/preprocessing/mergers/ |
Merge per-database CSV files into unified metadata |
workflow/scripts/database/create_duckdb.py |
Create the star-schema DuckDB database |
workflow/scripts/database/optimize_duckdb.py |
Add indexes and views for query performance |
workflow/scripts/database/validate_db.py |
Generate HTML validation reports |
workflow/scripts/sequences/ |
FASTA merging, indexing, host genome downloads |
workflow/scripts/sequences/download_host_genomes_robust.py |
Multi-host parsing and NCBI download |
The pbi Python Package¶
Location: src/pbi/
The pbi package is the primary way to interact with PBI data in Python. It handles database connections, sequence retrieval, and machine learning dataset preparation.
Main Classes¶
SequenceRetriever¶
The central class for accessing the database and FASTA files.
from pbi import quick_connect
# In Docker (paths auto-detected via DATA_PATH environment variable)
retriever = quick_connect()
# Manual initialization
from pbi import SequenceRetriever
retriever = SequenceRetriever(
db_path="/data/processed/databases/phage_database_optimized.duckdb",
phage_fasta_path="/data/processed/sequences/all_phages.fasta",
protein_fasta_path="/data/processed/sequences/all_proteins.fasta",
host_mapping_path="/data/processed/sequences/host_fasta_mapping.json"
)
Key methods:
- get_phage_metadata(where=None, limit=None) — query phage metadata
- get_host_metadata(where=None) — query host metadata
- get_phage_host_pairs(where=None, limit=None) — get linked phage-host pairs
- get_sequences_by_ids(ids, sequence_type='phage') — retrieve FASTA sequences
- get_stats() — database and file statistics
- export_fasta(df, path, id_col) — export sequences to FASTA
NegativeExampleGenerator¶
Generates negative training examples (non-interacting phage-host pairs) for machine learning.
from pbi import NegativeExampleGenerator
neg_gen = NegativeExampleGenerator(retriever)
dataset = neg_gen.generate_balanced_dataset(
positive_pairs=pairs,
strategy='mixed',
positive_ratio=0.5
)
PhageHostStreamingDataset / PhageHostIndexedDataset¶
PyTorch-compatible dataset classes for memory-efficient streaming through large datasets.
from pbi import PhageHostStreamingDataset, PhageHostIndexedDataset, phage_host_collate_fn
from torch.utils.data import DataLoader
dataset = PhageHostStreamingDataset(retriever, where_clause="p.Lifestyle = 'Lytic'")
loader = DataLoader(dataset, batch_size=32, collate_fn=phage_host_collate_fn)
Key Data Files¶
The pbi package works with these key files produced by the pipeline:
| File | Location | Description |
|---|---|---|
phage_database_optimized.duckdb |
/data/processed/databases/ |
Main DuckDB database with all phage metadata |
all_phages.fasta + .fai |
/data/processed/sequences/ |
All phage genome sequences, indexed with pyfaidx |
all_proteins.fasta + .fai |
/data/processed/sequences/ |
All protein sequences, indexed with pyfaidx |
host_fasta_mapping.json |
/data/processed/sequences/ |
Maps host assembly IDs to their FASTA file paths |
| Individual host FASTA files | /data/processed/sequences/hosts/ |
One FASTA file per downloaded host assembly |
The host_fasta_mapping.json File¶
This JSON file is the key index for host genome access. It maps each host assembly accession to the path of its downloaded FASTA file:
{
"GCF_000005845.2": "/data/processed/sequences/hosts/Escherichia_coli_GCF_000005845.2.fna",
"GCF_000006945.2": "/data/processed/sequences/hosts/Salmonella_enterica_GCF_000006945.2.fna",
...
}
The SequenceRetriever uses this file to efficiently retrieve individual host sequences without loading all host FASTA data into memory.
How quick_connect() Works¶
def quick_connect():
paths = get_default_paths() # Reads DATA_PATH env var or uses project-relative defaults
return SequenceRetriever(
db_path=paths['database'],
phage_fasta_path=paths['phage_fasta'],
protein_fasta_path=paths['protein_fasta'],
host_mapping_path=paths['host_mapping'] # Uses host_fasta_mapping.json
)
In Docker, the DATA_PATH environment variable is set to /data/processed, so quick_connect() automatically finds all files.
Host Resolution Process¶
Host genome resolution is a multi-stage process because phage metadata contains complex, semicolon-separated host fields like:
Resolution Stages¶
-
Parsing (
phage_host_candidates.csv): Each host field is split into tokens, classified as assembly accession, species name, or other identifier. -
Resolution (
phage_host_assemblies.csv): Each token is resolved to an NCBI assembly accession via: - Direct assembly lookup (for
GCA_/GCF_accessions) - NCBI Taxonomy + Assembly search (for species names)
-
Fallback species search (for other identifiers)
-
Download: Unique assemblies are downloaded from NCBI RefSeq, deduplicated across phages.
-
Mapping:
host_fasta_mapping.jsonis built, mapping host IDs to downloaded FASTA files.
See the Host Resolution Details page for full documentation.
The REST API¶
Location: api/
The REST API is built with FastAPI and provides HTTP endpoints for querying the database and retrieving sequences.
⚠️ Note: The REST API is currently a Work In Progress and is not the recommended way to interact with PBI data. It has not been updated for host management. For analysis and machine learning, use the analysis container with the
pbiPython package directly.
How the API Works¶
# api/app.py — simplified structure
from fastapi import FastAPI
import duckdb
app = FastAPI()
# On startup, connects to DuckDB and indexes FASTA files
# Exposes endpoints for health checks, SQL queries, and sequence retrieval
Key endpoints:
- GET /health — health check
- GET /stats — database statistics
- POST /query — execute SQL queries
- POST /phages — retrieve phage sequences
- POST /phages/fasta — export FASTA format
See the API Reference for full documentation.
Docker Services¶
The docker-compose.yml defines three services:
| Service | Image | Purpose | Port |
|---|---|---|---|
pipeline |
pbi-pipeline |
Runs Snakemake to build the database | — |
analysis |
pbi-analysis |
Jupyter Lab with pbi package pre-installed |
8888 |
api |
pbi-api |
FastAPI REST server | 8000 |
All services share the pbi-data Docker volume (read-only for analysis and API, read-write for pipeline).
Data Flow Summary¶
PhageScope API (14+ databases)
│
▼ [phagescope.smk]
Raw CSVs + compressed FASTA archives
/data/raw/
│
▼ [database.smk + sequences.smk]
Merged metadata + merged FASTA files
/data/intermediate/
│
▼ [database.smk]
phage_database_optimized.duckdb ←── all phage/protein metadata
/data/processed/databases/
│
all_phages.fasta + .fai ←── phage genome sequences
all_proteins.fasta + .fai ←── protein sequences
/data/processed/sequences/
│
NCBI RefSeq
│
▼ [hosts.smk]
Individual host FASTA files ←── bacterial reference genomes
host_fasta_mapping.json ←── index: host_id → FASTA path
/data/processed/sequences/
│
▼
pbi Python package (SequenceRetriever)
│
├──▶ Jupyter Lab notebooks (analysis container, port 8888)
└──▶ REST API (api container, port 8000) [untested]