Code Structure¶
Overview of the PBI project architecture and code organization.
Project Layout¶
PBI/
├── workflow/ # Snakemake pipeline
│ ├── Snakefile # Main pipeline definition
│ ├── rules/ # Modular pipeline rules
│ │ ├── phagescope.smk # Download phage data from PhageScope
│ │ ├── database.smk # Database creation and optimization
│ │ ├── sequences.smk # FASTA merging and indexing
│ │ └── hosts.smk # Host genome resolution and download
│ ├── scripts/ # Processing scripts
│ │ ├── database/ # DuckDB creation, optimization, validation
│ │ ├── preprocessing/
│ │ │ └── mergers/ # Per-feature CSV merger scripts
│ │ ├── sequences/ # FASTA handling + host genome download
│ │ │ ├── download_host_genomes_robust.py # Multi-host parser + NCBI downloader
│ │ │ ├── assembly_resolver.py # NCBI assembly lookup
│ │ │ ├── index_sequences.py # pyfaidx indexing
│ │ │ └── index_individual_hosts.py # Build host_fasta_mapping.json
│ │ └── utils/ # Report generation utilities
│ ├── config/ # Pipeline configuration
│ │ └── config.yaml # Main config (NCBI credentials, paths, etc.)
│ ├── envs/ # Conda environment specifications
│ └── dag/ # Workflow DAG visualizations
│
├── src/pbi/ # Python package
│ ├── __init__.py # Package init + quick_connect(), get_default_paths()
│ ├── sequence_retrieval.py # SequenceRetriever class
│ ├── negative_examples.py # NegativeExampleGenerator class
│ └── streaming_dataset.py # PhageHostStreamingDataset, PhageHostIndexedDataset
│
├── api/ # REST API (untested)
│ └── app.py # FastAPI application with endpoints
│
├── notebooks/ # Jupyter notebooks
│ ├── 01_database_exploration.ipynb # DB stats, quality control
│ ├── 02_sequence_retrieval.ipynb # Sequence retrieval with pbi package
│ ├── 03_ml_streaming.ipynb # ML dataset preparation
│ ├── README.md
│ ├── bin/ # Previous versions (kept for reference)
│ └── exploration/ # Development notebooks
│
├── docs/ # Documentation (MkDocs)
│ ├── index.md
│ ├── guides/ # Installation and usage guides
│ ├── database/ # Database documentation
│ ├── api/ # API reference
│ ├── reference/ # Command reference
│ ├── developer/ # This page
│ └── archive/ # Historical documentation
│
├── tests/ # Unit tests
│ └── test_multi_host_parsing.py # Host parsing and resolution tests
│
├── docker-compose.yml # Docker orchestration (pipeline, analysis, api)
├── Dockerfile.analysis # Analysis container (Jupyter Lab + pbi package)
├── Dockerfile.api # API container (FastAPI)
├── setup.py # Package configuration
└── mkdocs.yml # Documentation configuration
Key Components¶
Snakemake Pipeline¶
Location: workflow/
The pipeline orchestrates data download, processing, and database creation:
Snakefile: Main workflow definition, includes all rule filesrules/phagescope.smk: Downloads phage metadata (CSV) and FASTA archives from PhageScope API for each of 14+ databasesrules/database.smk: Merges CSV files, creates and optimizes DuckDB database, generates HTML validation reportsrules/sequences.smk: Merges per-database FASTA files, creates pyfaidx indexesrules/hosts.smk: Parses host fields → resolves to NCBI assemblies → downloads host FASTA files → buildshost_fasta_mapping.json
Key scripts:
| Script | Purpose |
|---|---|
scripts/preprocessing/mergers/merge_phage_metadata.py |
Merge per-database phage CSV files |
scripts/preprocessing/mergers/merge_annotated_proteins_metadata.py |
Merge protein annotation CSVs |
scripts/database/create_duckdb.py |
Create star-schema DuckDB from merged CSVs |
scripts/database/optimize_db.py |
Add indexes and views for performance |
scripts/database/validate_db.py |
Generate HTML validation/quality reports |
scripts/sequences/download_host_genomes_robust.py |
Multi-host parsing + NCBI download |
scripts/sequences/assembly_resolver.py |
NCBI Taxonomy + Assembly API lookups |
scripts/sequences/index_individual_hosts.py |
Build host_fasta_mapping.json |
Python Package¶
Location: src/pbi/
The PBI package provides the primary interface for data access and ML dataset preparation:
__init__.py: Exports main classes, definesquick_connect()andget_default_paths()(readsDATA_PATHenv var)sequence_retrieval.py:SequenceRetriever— connects to DuckDB and FASTA files, provides metadata query methods, sequence retrieval, and phage-host pair retrievalnegative_examples.py:NegativeExampleGenerator— generates non-interacting phage-host pairs for ML training (multiple strategies: random, taxonomy-aware, etc.)streaming_dataset.py:PhageHostStreamingDataset,PhageHostIndexedDataset,phage_host_collate_fn— PyTorch-compatible dataset classes for memory-efficient streaming
REST API¶
Location: api/
FastAPI-based REST interface — currently untested:
app.py: Main API application with endpoints for health check, stats, SQL queries, and FASTA export
Documentation¶
Location: docs/
MkDocs-based documentation (this site):
guides/: Installation, how-it-works, analysis usage, pipeline executiondatabase/: Database schema and host resolution detailsapi/: API reference (untested)reference/: Command referencedeveloper/: This pagearchive/: Historical/development documentation
Data Flow¶
PhageScope API (14+ databases)
│
▼ [phagescope.smk]
Raw CSVs + compressed FASTA archives → /data/raw/
│
▼ [preprocessing/mergers/]
Merged metadata CSVs + merged FASTA files → /data/intermediate/
│
├──▶ [database.smk] → phage_database_optimized.duckdb → /data/processed/databases/
└──▶ [sequences.smk] → all_phages.fasta + all_proteins.fasta → /data/processed/sequences/
NCBI RefSeq
│
▼ [hosts.smk]
Individual host FASTA files → /data/processed/sequences/hosts/
host_fasta_mapping.json → /data/processed/sequences/
All outputs → pbi-data Docker volume
│
├──▶ Analysis container (Jupyter Lab, port 8888) via pbi package
└──▶ API container (FastAPI, port 8000) [untested]
Development Workflow¶
- Data Pipeline: Modify
workflow/rules/andworkflow/scripts/for data processing changes - Python Package: Update
src/pbi/for new utilities or ML features - API: Extend
api/app.pyfor new endpoints (test after adding) - Documentation: Update
docs/for user-facing changes - Testing: Add tests to
tests/
Configuration Files¶
workflow/config/config.yaml: Pipeline configuration (NCBI credentials, paths, PhageScope DBs)workflow/envs/*.yaml: Conda environment specificationssetup.py: Python package metadata and dependenciesmkdocs.yml: Documentation configurationdocker-compose.yml: Container orchestration (pipeline, analysis, api services).gitignore: Excludesdata/directory (too large for git)
Tests¶
Location: tests/
test_multi_host_parsing.py: 31 unit tests for host field parsing and resolution logic
Run with:
Resources¶
For contributing guidelines, see the project README or open an issue on GitHub.