Installation Guide¶
This guide explains how to install and configure PBI using Docker, which is the primary and recommended execution method. Docker is required to run the full pipeline, as the host genome download step requires significant disk space and benefits from containerization.
Note on local execution: It is technically possible to run the pipeline locally without Docker (using Snakemake directly), but this is not recommended for the full pipeline due to the large disk space requirements (~225 GB) and complex dependency management. See the Pipeline Execution Guide for local execution details.
System Requirements¶
- Operating System: Linux/macOS (Windows via WSL2)
- Docker: Version 20.10 or later
- Docker Compose: Version 2.0 or later
- Disk Space: At least 225 GB of free disk space (for data, cache, and Docker volumes)
- RAM: 16 GB minimum (32 GB recommended)
- Internet: Stable connection for initial data downloads
- Time:
- First pipeline run (phage metadata only): ~4 hours
- Full pipeline (including host genome resolution + download): ~12–18 hours
Tip: Use
tmuxor another terminal multiplexer to keep your session alive during long pipeline runs, especially when connecting via SSH.
Step 1: Install Docker¶
On Linux (Ubuntu/Debian)¶
# Install Docker
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER
newgrp docker
# Install Docker Compose plugin
sudo apt-get install -y docker-compose-plugin
# Verify
docker --version
docker compose version
On macOS¶
Download and install Docker Desktop. Docker Compose is included.
On Windows (WSL2)¶
Install Docker Desktop with WSL2 backend enabled. Run all commands from within your WSL2 terminal.
Step 2: Clone the Repository¶
Step 3: Configure the Pipeline¶
Before running, set your NCBI credentials as environment variables (required for host genome downloads). Using environment variables keeps sensitive credentials out of version-controlled config files.
export NCBI_EMAIL="your.email@example.com" # Required by NCBI Terms of Service
export NCBI_API_KEY="YOUR_NCBI_API_KEY" # Optional but strongly recommended (10x faster)
Alternatively, copy the provided example and create a .env file in the project root (Docker Compose reads it automatically):
NCBI API Key: Get a free API key at https://www.ncbi.nlm.nih.gov/account/ — it increases the download rate limit from 3 to 10 requests/second, significantly speeding up the host genome download stage.
Step 4: Build and Run the Pipeline Container¶
4.1 Build the Pipeline Image¶
4.2 Run the Pipeline¶
# Run in a tmux session to survive SSH disconnections
tmux new -s pbi-pipeline
docker compose run --rm pipeline
What happens:
- Stage 1 — Phage data (~4 hours): Downloads and processes phage metadata and FASTA sequences from 14+ databases via PhageScope
- Stage 2 — Host resolution (~12–18 hours): Parses host fields from phage metadata, resolves them to NCBI assemblies, downloads reference genomes
Output (stored in the pbi-data Docker volume):
- /data/processed/databases/phage_database_optimized.duckdb — Main database
- /data/processed/sequences/all_phages.fasta (+ .fai index)
- /data/processed/sequences/all_proteins.fasta (+ .fai index)
- /data/processed/sequences/host_fasta_mapping.json — Maps host IDs to FASTA file paths
- /data/processed/reports/ — HTML validation and statistics reports
Step 5: Connect to the Analysis Container¶
The analysis container provides a Jupyter Lab environment with direct access to the database and sequences.
5.1 Set Up SSH Port Forwarding (for Remote Servers)¶
If PBI is running on a remote server, forward port 8888 before you start the container:
This maps port 8888 on the remote server to your local machine so you can open Jupyter Lab in your browser at http://localhost:8888.
5.2 Build and Start the Analysis Container¶
# Build the analysis image
docker compose build analysis
# Start in detached mode
docker compose up -d analysis
5.3 Access Jupyter Lab¶
Open your browser and navigate to:
No password is required (local development mode). Start with the demo notebooks in the notebooks/ directory.
⚠️ Security Warning: Do not expose port 8888 to untrusted networks. Always use SSH tunneling for remote access.
Environment Variables¶
| Variable | Purpose | Default | Required |
|---|---|---|---|
DATA_PATH |
Path for API/analysis to find processed data | /data/processed |
Set automatically in Docker |
NCBI_EMAIL |
Your email for NCBI API | — | Yes, for genome downloads |
NCBI_API_KEY |
NCBI API key (increases rate limit) | — | Optional but recommended |
NCBI_EMAIL and NCBI_API_KEY must be set in your shell before running Docker Compose, or placed in a .env file at the project root. Docker Compose reads the .env file automatically and passes both variables into the container. DATA_PATH and other internal variables are configured in docker-compose.yml and set automatically.
# Option A: export in your current shell session
export NCBI_EMAIL="your.email@example.com"
export NCBI_API_KEY="your-api-key"
# Option B: .env file in the project root (persists across sessions)
cp .env.example .env
# then edit .env with your values
Troubleshooting¶
"No space left on device"¶
# Check disk space
df -h
# Clean Docker system (careful: removes unused images and volumes)
docker system prune
# Clean only the Snakemake cache volume (preserves data)
./cleanup_cache.sh
Pipeline Fails Mid-Run¶
The pipeline uses Snakemake's file-based checkpointing. If it fails, simply re-run:
Already-completed steps and downloaded files will be reused. You can also use:
Jupyter Lab Not Loading¶
# Check container status
docker ps | grep analysis
# Check logs
docker compose logs analysis
# Restart
docker compose restart analysis
Pipeline Runs Again After Completion¶
Ensure that the pbi-data volume is intact:
If the database file is missing, the pipeline needs to re-run.
Next Steps¶
- Read How It Works to understand the pipeline and
pbipackage - Explore the database with Analysis Container Usage
- Check the Docker Guide for advanced Docker operations