Analysis Container Usage Guide¶

This guide explains how to use the PBI analysis container (Jupyter Lab) to explore the database, retrieve sequences, and prepare machine learning datasets. This is the recommended way to interact with PBI data.

Demo Notebooks¶

Three demo notebooks are included in the notebooks/ directory. Start with these to get familiar with PBI:

Notebook	Description
📊 `01_database_exploration.ipynb`	Database statistics, phage quality metrics, host coverage analysis, and data quality assessment
🔬 `02_sequence_retrieval.ipynb`	Connecting to the database, querying metadata, retrieving phage/host/protein sequences
🤖 `03_ml_streaming.ipynb`	End-to-end ML workflow: feature engineering, negative example generation, streaming datasets for PyTorch

Open these in Jupyter Lab after starting the analysis container (see Installation Guide for setup).

Overview¶

The PBI analysis container provides direct access to the DuckDB database and FASTA files, enabling:

5-50x faster bulk data retrieval compared to the REST API
Batch processing of millions of records without memory issues
Safe read-only access to production data
Interactive analysis with Jupyter Lab
Zero network overhead - direct file system access

Python Package Usage¶

Using the `pbi` Package¶

The pbi Python package provides convenient classes for database and sequence access. It is designed to work seamlessly within the Docker analysis container.

Recommended Approach: Docker Container

The pbi package is pre-installed in the analysis container with all dependencies configured correctly. This is the recommended way to use the package:

# Start the analysis container (see Installation Guide for SSH port forwarding first)
docker compose up -d analysis
# Access Jupyter Lab at http://localhost:8888

Local Installation (Not Recommended for Full Pipeline)

While it's possible to install the package locally for development, this requires manual management of large data files (~225 GB):

# Install the package
pip install -e .

# Set environment variables
export DATA_PATH=/path/to/your/data/processed

# Use in Python
from pbi import quick_connect
retriever = quick_connect()

Note: For production use and analysis, always prefer the Docker container approach.

Architecture¶

┌──────────────┐      ┌──────────────┐     ┌──────────────┐
│   Pipeline   │────> │  pbi-data    │<────│   Analysis   │
│  (writes)    │      │   volume     │     │  (read-only) │
└──────────────┘      └──────────────┘     └──────────────┘
                           │
                           │
                     ┌─────┴─────┐
                     │    API    │
                     │(read-only)│
                     └───────────┘

Key Design Principles:

Shared Volume, Separate Services: Pipeline writes data, analysis reads it
Read-Only Mounts: :ro flag prevents accidental data corruption
Direct File Access: Zero network overhead
Batch Processing: Process large datasets in chunks to avoid OOM errors
Hybrid Approach: DuckDB for metadata + SequenceRetriever for sequences

Quick Start¶

1. Build and Start the Analysis Service¶

# Build the analysis container
docker compose build analysis

# Start the analysis service
docker compose up -d analysis

# Access Jupyter Lab
# Open http://localhost:8888 in your browser
# (Use SSH port forwarding if on a remote server - see Installation Guide)

2. Open Example Notebook¶

Navigate to one of the main notebooks in Jupyter Lab to get started:

notebooks/01_database_exploration.ipynb — Database exploration and quality control
notebooks/02_sequence_retrieval.ipynb — Sequence retrieval with the PBI package
notebooks/03_ml_streaming.ipynb — AI/ML with streaming data

3. Create Your First Analysis¶

Create a new notebook in Jupyter Lab:

Option A: Using quick_connect() (Recommended)

The easiest way to get started - automatically handles all paths:

from pbi import quick_connect

# Connect to database with all sequence files
# This automatically uses DATA_PATH environment variable in Docker
retriever = quick_connect()

# Get database statistics
stats = retriever.get_stats()
print(f"Phages: {stats['database']['phages']:,}")
print(f"Proteins: {stats['database']['proteins']:,}")
print(f"Hosts: {stats['database']['hosts']:,}")

# Query and retrieve sequences
phages = retriever.get_phage_metadata("Length > 100000", limit=10)
phage_ids = phages['Phage_ID'].tolist()
sequences = retriever.get_sequences_by_ids(phage_ids, sequence_type='phage')

Option B: Manual Connection (Advanced)

For more control over connection parameters:

import duckdb
from pathlib import Path
from pbi import SequenceRetriever

# Connect to database (read-only)
DB_PATH = "/data/processed/databases/phage_database_optimized.duckdb"
conn = duckdb.connect(DB_PATH, read_only=True)

# Query phages
query = """
SELECT Phage_ID, Length, GC_content
FROM fact_phages
WHERE Length > 100000
LIMIT 10
"""
df = conn.execute(query).fetchdf()
print(df)

# Initialize SequenceRetriever
retriever = SequenceRetriever(
    db_path=DB_PATH,
    phage_fasta_path="/data/processed/sequences/all_phages.fasta",
    protein_fasta_path="/data/processed/sequences/all_proteins.fasta",
    host_mapping_path="/data/processed/sequences/host_fasta_mapping.json"
)

# Get sequences
phage_ids = df['Phage_ID'].tolist()
sequences = retriever.get_sequences_by_ids(phage_ids, sequence_type='phage')

# Cleanup
conn.close()

🗄️ Accessing the Database¶

Understanding the PBI Package¶

The pbi Python package provides a high-level interface to the PBI database and sequence files:

Main Components:

quick_connect() - Convenience function for instant database access
SequenceRetriever - Core class for querying metadata and retrieving sequences
NegativeExampleGenerator - ML utility for generating negative training examples
PhageHostStreamingDataset / PhageHostIndexedDataset - PyTorch-compatible dataset classes

How it works in Docker:

The pbi package automatically detects the DATA_PATH environment variable set in the Docker container (/data/processed) and uses it to locate the database and FASTA files. This means you don't need to specify paths manually.

Retrieving Genomes and Metadata¶

1. Get Phage Genomes

from pbi import quick_connect

retriever = quick_connect()

# Get phage metadata with filters
phages = retriever.get_phage_metadata(
    "Length > 50000 AND Lifestyle = 'Lytic'",
    limit=100
)
print(phages.head())

Get sequences for these phages¶

sequences = retriever.get_sequences_by_ids( df['Phage_ID'].tolist(), sequence_type='phage' )

Method 2: Direct ID-based retrieval¶

phage_ids = ['phage_001', 'phage_002', 'phage_003'] sequences = retriever.get_sequences_by_ids(phage_ids, sequence_type='phage')

sequences is a dict:¶

for phage_id, seq in sequences.items(): print(f"{phage_id}: {len(seq)} bp")

**2. Get Host Genomes**

```python
from pbi import quick_connect

retriever = quick_connect()

# Query hosts
hosts_df = retriever.query_hosts("""
    SELECT Host_ID, Host_Name, Host_Genus, Host_Family
    FROM dim_hosts
    WHERE Host_Genus = 'Escherichia'
    LIMIT 10
""")

# Get host sequences
host_sequences = retriever.get_sequences_by_ids(
    hosts_df['Host_ID'].tolist(),
    sequence_type='host'
)

3. Get Protein Sequences

from pbi import quick_connect

retriever = quick_connect()

# Query proteins from a specific phage
proteins_df = retriever.query_proteins("""
    SELECT Protein_ID, Phage_ID, Annotation, Length
    FROM fact_proteins
    WHERE Phage_ID = 'NC_000001'
    AND Annotation LIKE '%terminase%'
""")

# Get protein sequences
protein_sequences = retriever.get_sequences_by_ids(
    proteins_df['Protein_ID'].tolist(),
    sequence_type='protein'
)

4. Get Metadata Without Sequences

If you only need metadata (no sequences), use direct database queries:

from pbi import quick_connect

retriever = quick_connect()

# Use the underlying database connection
stats = retriever.get_stats()
print(f"Total phages: {stats['database']['phages']:,}")

# Or query directly
import duckdb
conn = duckdb.connect("/data/processed/databases/phage_database_optimized.duckdb", read_only=True)

# Get phage metadata
metadata = conn.execute("""
    SELECT p.*, ph.Host_Name, ph.Host_Genus
    FROM fact_phages p
    LEFT JOIN fact_phage_host ph ON p.Phage_ID = ph.Phage_ID
    WHERE p.GC_Content > 0.5
""").fetchdf()

conn.close()

5. Get Phage-Host Interaction Pairs

from pbi import quick_connect

retriever = quick_connect()

# Get known phage-host interactions
pairs = retriever.get_phage_host_pairs(limit=1000)

# Returns DataFrame with columns:
# - Phage_ID
# - Host_ID  
# - Phage_Length
# - Phage_GC
# - Host_Name
# - Host_Genus

Common Query Patterns¶

Filter by taxonomy:

# E. coli phages
ecoli_phages = retriever.query_phages("""
    SELECT DISTINCT p.*
    FROM fact_phages p
    INNER JOIN fact_phage_host ph ON p.Phage_ID = ph.Phage_ID
    WHERE ph.Host_Genus LIKE '%Escherichia%'
    OR ph.Host_Name LIKE '%coli%'
""")

Filter by genome characteristics:

# Large, complete phages with high GC content
large_phages = retriever.query_phages("""
    SELECT * FROM fact_phages
    WHERE Length > 100000
    AND Completeness = 'complete'
    AND GC_Content > 0.55
    ORDER BY Length DESC
""")

Join multiple tables:

import duckdb

conn = duckdb.connect("/data/processed/databases/phage_database_optimized.duckdb", read_only=True)

# Get phages with their proteins and hosts
complex_query = conn.execute("""
    SELECT 
        p.Phage_ID,
        p.Accession,
        p.Length,
        COUNT(DISTINCT pr.Protein_ID) as Protein_Count,
        ph.Host_Name,
        ph.Host_Genus
    FROM fact_phages p
    LEFT JOIN fact_proteins pr ON p.Phage_ID = pr.Phage_ID
    LEFT JOIN fact_phage_host ph ON p.Phage_ID = ph.Phage_ID
    GROUP BY p.Phage_ID, p.Accession, p.Length, ph.Host_Name, ph.Host_Genus
    HAVING COUNT(DISTINCT pr.Protein_ID) > 10
""").fetchdf()

conn.close()

Common Use Cases¶

Use Case 1: Bulk Metadata Export¶

Export large datasets efficiently using DuckDB's native capabilities:

import duckdb

conn = duckdb.connect("/data/processed/databases/phage_database_optimized.duckdb", 
                      read_only=True)

# Export to Parquet (most efficient)
query = """
COPY (
    SELECT p.*, ph.Host_Name, ph.Host_Genus
    FROM fact_phages p
    LEFT JOIN fact_phage_host ph ON p.Phage_ID = ph.Phage_ID
    WHERE p.Length > 50000
) TO '/workspace/exports/large_phages.parquet' (FORMAT PARQUET)
"""
conn.execute(query)

# Export to CSV
query = """
COPY (
    SELECT * FROM fact_phages
    WHERE Completeness = 'complete'
) TO '/workspace/exports/complete_phages.csv' (HEADER, DELIMITER ',')
"""
conn.execute(query)

conn.close()

Performance: Exports millions of records in seconds, compared to minutes via API.

Use Case 2: Batch Sequence Retrieval¶

Retrieve sequences for thousands of phages efficiently:

from pbi import SequenceRetriever
import duckdb
import pandas as pd

# Get phage IDs from database
conn = duckdb.connect("/data/processed/databases/phage_database_optimized.duckdb",
                      read_only=True)
query = """
SELECT Phage_ID FROM fact_phages
WHERE Completeness = 'complete' AND Length > 50000
"""
phage_ids_df = conn.execute(query).fetchdf()
conn.close()

# Process in batches to avoid memory issues
retriever = SequenceRetriever(
    db_path="/data/processed/databases/phage_database_optimized.duckdb",
    phage_fasta_path="/data/processed/sequences/all_phages.fasta",
    protein_fasta_path="/data/processed/sequences/all_proteins.fasta"
)

BATCH_SIZE = 1000
all_sequences = {}

for i in range(0, len(phage_ids_df), BATCH_SIZE):
    batch_ids = phage_ids_df.iloc[i:i+BATCH_SIZE]['Phage_ID'].tolist()
    batch_sequences = retriever.get_sequences_by_ids(batch_ids, sequence_type='phage')
    all_sequences.update(batch_sequences)
    print(f"Processed {i + len(batch_sequences)} sequences...")

Performance: Retrieves ~1000 sequences per second using indexed FASTA access.

Use Case 3: Complex Joins and Analysis¶

Combine multiple tables for sophisticated analyses:

import duckdb
import pandas as pd
import matplotlib.pyplot as plt

conn = duckdb.connect("/data/processed/databases/phage_database_optimized.duckdb",
                      read_only=True)

# Find phages with specific protein annotations
query = """
SELECT 
    p.Phage_ID,
    p.Accession,
    p.Length,
    p.GC_Content,
    COUNT(DISTINCT pr.Protein_ID) as Protein_Count,
    ph.Host_Genus,
    ph.Host_Name
FROM fact_phages p
LEFT JOIN fact_proteins pr ON p.Phage_ID = pr.Phage_ID
LEFT JOIN fact_phage_host ph ON p.Phage_ID = ph.Phage_ID
WHERE pr.Annotation LIKE '%terminase%'
GROUP BY p.Phage_ID, p.Accession, p.Length, p.GC_Content, 
         ph.Host_Genus, ph.Host_Name
HAVING COUNT(DISTINCT pr.Protein_ID) > 0
ORDER BY p.Length DESC
"""

results = conn.execute(query).fetchdf()
conn.close()

# Visualize
plt.figure(figsize=(10, 6))
plt.scatter(results['Length'] / 1000, results['GC_Content'] * 100, 
            alpha=0.6, s=50)
plt.xlabel('Genome Length (kb)')
plt.ylabel('GC Content (%)')
plt.title('Phages with Terminase Proteins')
plt.tight_layout()
plt.savefig('/workspace/exports/terminase_phages.png', dpi=300)
plt.show()

Use Case 4: Machine Learning Dataset Preparation¶

Prepare datasets for ML models using the convenience quick_connect() function:

from pbi import quick_connect, NegativeExampleGenerator

# Connect using quick_connect (automatically uses correct paths)
retriever = quick_connect()

# Get positive examples (known phage-host pairs)
positive_pairs = retriever.get_phage_host_pairs(limit=5000)

# Filter for quality
positive_pairs = positive_pairs[
    (positive_pairs['Phage_Length'] > 10000) &
    (positive_pairs['Phage_GC'].notna())
]

# Generate negative examples
neg_gen = NegativeExampleGenerator(retriever)
dataset = neg_gen.generate_balanced_dataset(
    positive_pairs=positive_pairs,
    strategy='mixed',  # Combines random, GC-based, and taxonomy-based negatives
    positive_ratio=0.5
)

# Export for training
dataset.to_parquet('/workspace/exports/phage_host_ml_dataset.parquet')
print(f"Dataset created: {len(dataset)} samples")
print(f"Positive: {(dataset['Label'] == 1).sum()}")
print(f"Negative: {(dataset['Label'] == 0).sum()}")

⚙️ Configuration¶

Docker Compose Service¶

The analysis service is defined in docker-compose.yml:

analysis:
  build:
    context: .
    dockerfile: Dockerfile.analysis
  container_name: pbi-analysis
  ports:
    - "8888:8888"
  volumes:
    # Read-only access to data
    - pbi-data:/data:ro
    # Persistent notebook storage
    - ./notebooks:/workspace
  environment:
    - DATA_PATH=/data/processed
  networks:
    - pbi-network
  restart: unless-stopped

Environment Variables¶

DATA_PATH: Base path for processed data (default: /data/processed)

Volume Mounts¶

pbi-data:/data:ro - Read-only access to database and FASTA files
./notebooks:/workspace - Persistent storage for notebooks and exports

Package Management & Custom Environments¶

Installing Additional Packages¶

The analysis container comes with a comprehensive scientific computing stack. To install additional packages:

1. Install packages temporarily (current session only)

From a Jupyter notebook or terminal:

# Install Python package
!pip install scikit-learn xgboost lightgbm

# Install from conda-forge (if conda is available)
!conda install -c conda-forge package_name

2. Install packages persistently

To make packages available across container restarts, modify Dockerfile.analysis:

# Add your packages to the RUN pip install command
RUN pip install --no-cache-dir --trusted-host pypi.org --trusted-host files.pythonhosted.org \
    jupyterlab==4.0.9 \
    duckdb>=0.9.0 \
    # ... existing packages ...
    scikit-learn>=1.3.0 \
    xgboost>=2.0.0 \
    lightgbm>=4.0.0 \
    plotly>=5.0.0

Then rebuild the container:

docker compose build analysis
docker compose up -d analysis

Creating Custom Analysis Environments¶

You can create specialized analysis environments for different projects:

Option 1: Use Conda environments inside the container

# Create a new conda environment
!conda create -n my_analysis python=3.10 -y

# Activate and install packages
!conda activate my_analysis && conda install pandas numpy scikit-learn -y

# Use in Jupyter by installing ipykernel
!conda activate my_analysis && pip install ipykernel
!conda activate my_analysis && python -m ipykernel install --user --name my_analysis

Then select the "my_analysis" kernel from Jupyter's kernel menu.

Option 2: Create a separate Docker service

Add a new service to docker-compose.yml for specialized analysis:

analysis-ml:
  build:
    context: .
    dockerfile: Dockerfile.analysis-ml  # Your custom Dockerfile
  container_name: pbi-analysis-ml
  ports:
    - "8889:8888"  # Different port
  volumes:
    - pbi-data:/data:ro
    - ./notebooks:/workspace
    - ./workflow:/app/workflow:ro
    - ./src:/app/src:ro
  environment:
    - DATA_PATH=/data/processed
    - PYTHONPATH=/app
  networks:
    - pbi-network
  restart: unless-stopped

Create Dockerfile.analysis-ml with your custom package set:

FROM python:3.10-slim

WORKDIR /workspace

# Install system dependencies
RUN apt-get update && apt-get install -y \
    git curl build-essential \
    && rm -rf /var/lib/apt/lists/*

# Copy project
COPY . /app/
ENV PYTHONPATH=/app:$PYTHONPATH

# Install PBI package
RUN pip install -e /app/

# Install ML-specific packages
RUN pip install \
    jupyterlab \
    duckdb pyfaidx pandas numpy \
    scikit-learn xgboost lightgbm \
    torch torchvision \
    transformers \
    optuna \
    mlflow

# Configure Jupyter
RUN jupyter lab --generate-config && \
    echo "c.ServerApp.token = ''" >> ~/.jupyter/jupyter_lab_config.py && \
    echo "c.ServerApp.password = ''" >> ~/.jupyter/jupyter_lab_config.py && \
    echo "c.ServerApp.allow_root = True" >> ~/.jupyter/jupyter_lab_config.py && \
    echo "c.ServerApp.ip = '0.0.0.0'" >> ~/.jupyter/jupyter_lab_config.py && \
    echo "c.ServerApp.port = 8888" >> ~/.jupyter/jupyter_lab_config.py

EXPOSE 8888
ENV DATA_PATH=/data/processed

CMD ["jupyter", "lab", "--allow-root", "--ip=0.0.0.0", "--port=8888", "--no-browser", "--notebook-dir=/workspace"]

Start your custom environment:

docker compose build analysis-ml
docker compose up -d analysis-ml
# Access at http://localhost:8889

Option 3: Use Python virtual environments

From a Jupyter terminal:

# Create virtual environment
python -m venv /workspace/.venvs/my_project
source /workspace/.venvs/my_project/bin/activate
pip install specific-packages

# Install kernel for Jupyter
pip install ipykernel
python -m ipykernel install --user --name=my_project --display-name="My Project"

Writing Custom Functions¶

Where to write reusable functions:

1. In the PBI package (for general-purpose functionality)

Add functions to /app/src/pbi/ and they'll be available via from pbi import ...

Example - add to src/pbi/utils.py:

def calculate_phage_metrics(retriever, phage_ids):
    """Calculate comprehensive metrics for phage genomes."""
    # Your implementation
    pass

Use in notebooks:

from pbi.utils import calculate_phage_metrics
metrics = calculate_phage_metrics(retriever, phage_ids)

2. In a shared notebook utilities module

Create notebooks/utils.py:

# Custom analysis functions
def plot_gc_distribution(df, output_path):
    import matplotlib.pyplot as plt
    plt.hist(df['GC_Content'])
    plt.savefig(output_path)

Use in any notebook:

import sys
sys.path.insert(0, '/workspace')
from utils import plot_gc_distribution

plot_gc_distribution(df, '/workspace/exports/gc_plot.png')

3. In individual notebooks (for one-off analyses)

Define functions directly in notebook cells for exploratory work.

Best Practices for Custom Environments¶

Document your dependencies: Keep a requirements.txt or environment.yml file
Version control: Track your custom Dockerfiles and conda environments
Test in isolation: Use separate containers to avoid dependency conflicts
Share configurations: Use Docker Compose profiles for different team members
Resource limits: Set memory/CPU limits for resource-intensive ML environments

🎓 Best Practices¶

1. Always Use Read-Only Connections¶

# ✅ GOOD: Read-only connection
conn = duckdb.connect(db_path, read_only=True)

# ❌ BAD: Writable connection (risk of corruption)
conn = duckdb.connect(db_path)

Why? The data volume is mounted as :ro (read-only) to prevent accidental modifications. Additionally, using read_only=True in DuckDB prevents database lock conflicts.

2. Process Data in Batches¶

# ✅ GOOD: Batch processing
BATCH_SIZE = 1000
for offset in range(0, total_count, BATCH_SIZE):
    query = f"SELECT * FROM fact_phages LIMIT {BATCH_SIZE} OFFSET {offset}"
    batch = conn.execute(query).fetchdf()
    # Process batch...

# ❌ BAD: Loading everything at once
all_data = conn.execute("SELECT * FROM fact_phages").fetchdf()  # 800K+ rows!

Why? Loading millions of records into memory will cause out-of-memory errors and crash the Jupyter kernel.

3. Use DuckDB's Native Export Functions¶

# ✅ GOOD: Direct export (fast, memory-efficient)
conn.execute("""
    COPY (SELECT * FROM fact_phages WHERE Length > 50000)
    TO '/workspace/exports/large_phages.parquet' (FORMAT PARQUET)
""")

# ❌ BAD: Load then export (slow, memory-intensive)
df = conn.execute("SELECT * FROM fact_phages WHERE Length > 50000").fetchdf()
df.to_parquet('/workspace/exports/large_phages.parquet')

Why? DuckDB can stream results directly to disk without loading into memory, making it orders of magnitude faster for large datasets.

4. Filter Data in SQL, Not Pandas¶

# ✅ GOOD: Filter in database
query = """
SELECT * FROM fact_phages
WHERE Length > 50000 AND Completeness = 'complete'
"""
df = conn.execute(query).fetchdf()

# ❌ BAD: Load everything then filter
all_phages = conn.execute("SELECT * FROM fact_phages").fetchdf()
filtered = all_phages[(all_phages['Length'] > 50000) & 
                      (all_phages['Completeness'] == 'complete')]

Why? Filtering in SQL is faster and more memory-efficient, as only the filtered data is transferred.

5. Close Connections Properly¶

# ✅ GOOD: Ensure cleanup
try:
    conn = duckdb.connect(db_path, read_only=True)
    # Work...
finally:
    conn.close()

# Or use context manager
with duckdb.connect(db_path, read_only=True) as conn:
    # Work...
# Auto-closes

Why? Proper cleanup prevents resource leaks and ensures database locks are released.

6. Monitor Memory Usage¶

# Check memory usage in notebook
import psutil
import os

process = psutil.Process(os.getpid())
memory_mb = process.memory_info().rss / 1024 / 1024
print(f"Current memory usage: {memory_mb:.2f} MB")

Troubleshooting¶

Issue 1: "Database is locked" Error¶

Symptom:

duckdb.IOException: IO Error: Could not set lock on file

Solutions: 1. Ensure you're using read_only=True:

conn = duckdb.connect(db_path, read_only=True)

Check if another process has the database open in write mode
Verify the volume is mounted as read-only (:ro)

Issue 2: Out of Memory Errors¶

Symptom:

MemoryError: Unable to allocate array

or kernel crashes

Solutions: 1. Reduce batch size:

BATCH_SIZE = 1000  # Instead of 10000

Use DuckDB aggregations instead of loading raw data:

# Good: Aggregate in database
stats = conn.execute("SELECT AVG(Length), STD(Length) FROM fact_phages").fetchdf()

# Bad: Load all data
all_lengths = conn.execute("SELECT Length FROM fact_phages").fetchdf()

Stream results to disk using COPY:

conn.execute("COPY (...) TO 'output.parquet'")

Monitor memory and restart kernel if needed

Issue 3: Slow Sequence Retrieval¶

Symptom: Retrieving sequences takes several minutes

Solutions: 1. Verify FASTA index files exist:

from pathlib import Path
fasta_file = Path("/data/processed/sequences/all_phages.fasta")
index_file = Path(str(fasta_file) + ".fai")
print(f"Index exists: {index_file.exists()}")

If index files are missing: Re-run the pipeline as it should create them automatically:

docker compose run --rm pipeline

Alternative: Manually create indexes (requires pyfaidx installed):

docker compose exec analysis python -c "from pyfaidx import Fasta; Fasta('/data/processed/sequences/all_phages.fasta')"

Retrieve sequences in batches, not individually:

# Good: Batch retrieval
sequences = retriever.get_sequences_by_ids(phage_ids)

# Bad: Individual retrieval
for phage_id in phage_ids:
    seq = retriever.get_sequences_by_ids([phage_id])

Use preload=True when initializing SequenceRetriever

Issue 4: Jupyter Lab Not Accessible¶

Symptom: Cannot connect to http://localhost:8888

Solutions: 1. Check if container is running:

docker ps | grep pbi-analysis

Check logs for errors:
```
docker logs pbi-analysis
```
Verify port mapping:
```
docker port pbi-analysis
```
Restart the service:
```
docker compose restart analysis
```

Issue 5: Path Not Found Errors¶

Symptom:

FileNotFoundError: /data/processed/databases/phage_database_optimized.duckdb

Solutions: 1. Verify you're running in the Docker container (not locally)

Check if pipeline has completed:
```
docker compose run --rm pipeline
```

Inspect volume contents:

docker run --rm -v pbi-data:/data alpine ls -lh /data/processed/databases

Ensure paths use /data/processed (Docker) not ./data/processed (local)

Performance Tips¶

Optimize SQL Queries¶

Use appropriate indexes (already optimized in PBI database)

Select only needed columns:

# Good
SELECT Phage_ID, Length, GC_Content FROM fact_phages

# Bad
SELECT * FROM fact_phages

Use LIMIT for exploration:

# Explore data structure with small sample
sample = conn.execute("SELECT * FROM fact_phages LIMIT 100").fetchdf()

Leverage WHERE clauses:

WHERE Length > 50000 AND Completeness = 'complete'

Parallel Processing¶

For CPU-intensive tasks, use multiprocessing with caution:

from multiprocessing import Pool

def process_batch(batch_ids):
    """
    Create a new retriever instance in each worker process.

    Note: SequenceRetriever contains database connections and file handles
    that cannot be pickled across processes. Always create a new instance
    in the worker function.
    """
    from pbi import SequenceRetriever

    # Create new retriever in worker process
    retriever = SequenceRetriever(
        db_path="/data/processed/databases/phage_database_optimized.duckdb",
        phage_fasta_path="/data/processed/sequences/all_phages.fasta",
        protein_fasta_path="/data/processed/sequences/all_proteins.fasta"
    )

    sequences = retriever.get_sequences_by_ids(batch_ids, sequence_type='phage')
    # Process sequences...
    return results

# Split IDs into batches
batches = [phage_ids[i:i+1000] for i in range(0, len(phage_ids), 1000)]

# Process in parallel
with Pool(processes=4) as pool:
    results = pool.map(process_batch, batches)

⚠️ Important Notes: - Always create new SequenceRetriever instances in worker processes - Database connections and file handles are not pickle-able - Monitor memory usage closely - each process will load its own FASTA indexes - For large-scale parallel processing, consider using Dask or Ray instead

Caching Results¶

Cache expensive computations:

import pickle
from pathlib import Path

cache_file = Path("/workspace/.cache/expensive_query.pkl")

if cache_file.exists():
    # Load from cache
    with open(cache_file, 'rb') as f:
        results = pickle.load(f)
else:
    # Compute and cache
    results = conn.execute(expensive_query).fetchdf()
    cache_file.parent.mkdir(exist_ok=True)
    with open(cache_file, 'wb') as f:
        pickle.dump(results, f)

🔄 Workflow Example: Complete Analysis Pipeline¶

Here's a complete example analyzing E. coli phages:

import duckdb
from pbi import SequenceRetriever
import pandas as pd
import matplotlib.pyplot as plt
from Bio.SeqUtils import gc_fraction
from Bio.Seq import Seq

# 1. Setup
DB_PATH = "/data/processed/databases/phage_database_optimized.duckdb"
conn = duckdb.connect(DB_PATH, read_only=True)

# 2. Query E. coli phages
query = """
SELECT DISTINCT
    p.Phage_ID,
    p.Accession,
    p.Length,
    p.GC_Content,
    ph.Host_Name
FROM fact_phages p
INNER JOIN fact_phage_host ph ON p.Phage_ID = ph.Phage_ID
WHERE (ph.Host_Genus LIKE '%Escherichia%' 
       OR ph.Host_Name LIKE '%coli%')
  AND p.GC_Content IS NOT NULL
ORDER BY p.Length DESC
"""
ecoli_phages = conn.execute(query).fetchdf()
print(f"Found {len(ecoli_phages)} E. coli phages")

# 3. Visualize distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].hist(ecoli_phages['Length'] / 1000, bins=50, edgecolor='black')
axes[0].set_xlabel('Genome Length (kb)')
axes[0].set_title('E. coli Phage Length Distribution')

axes[1].hist(ecoli_phages['GC_Content'] * 100, bins=50, edgecolor='black')
axes[1].set_xlabel('GC Content (%)')
axes[1].set_title('E. coli Phage GC Distribution')

plt.tight_layout()
plt.savefig('/workspace/exports/ecoli_phages_analysis.png', dpi=300)

# 4. Retrieve sequences (sample)
retriever = SequenceRetriever(
    db_path=DB_PATH,
    phage_fasta_path="/data/processed/sequences/all_phages.fasta",
    protein_fasta_path="/data/processed/sequences/all_proteins.fasta"
)

sample_ids = ecoli_phages.head(100)['Phage_ID'].tolist()
sequences = retriever.get_sequences_by_ids(sample_ids, sequence_type='phage')

# 5. Calculate sequence statistics
seq_stats = []
for phage_id, seq_str in sequences.items():
    seq = Seq(seq_str)
    seq_stats.append({
        'Phage_ID': phage_id,
        'Calculated_GC': gc_fraction(seq),
        'A_count': seq_str.count('A'),
        'T_count': seq_str.count('T'),
        'G_count': seq_str.count('G'),
        'C_count': seq_str.count('C')
    })

seq_stats_df = pd.DataFrame(seq_stats)
results = ecoli_phages.merge(seq_stats_df, on='Phage_ID')

# 6. Export results
results.to_csv('/workspace/exports/ecoli_phages_complete.csv', index=False)
results.to_parquet('/workspace/exports/ecoli_phages_complete.parquet')

# 7. Cleanup
conn.close()

print("✅ Analysis complete!")
print(f"   Analyzed: {len(results)} phages")
print(f"   Outputs: /workspace/exports/")

💻 Connecting with VSCode¶

While Jupyter Lab in the browser is convenient, you can also connect to the remote Jupyter server using VSCode for a more integrated development experience.

Prerequisites¶

VSCode installed on your local machine
Jupyter extension for VSCode installed (from Microsoft)
PBI analysis service running: docker compose up -d analysis

Connection Steps¶

Method 1: Direct Kernel Connection (Recommended)¶

Get the Jupyter Server URL with Token

# Get the Jupyter server URL with authentication token
docker logs pbi-analysis 2>&1 | grep "http://127.0.0.1:8888/lab?token="

This will output something like:

http://127.0.0.1:8888/lab?token=abc123def456...

Open VSCode and Create/Open a Notebook
Create a new .ipynb file or open an existing one
Click on "Select Kernel" in the top-right corner
Choose "Existing Jupyter Server"
Enter the Jupyter Server URL
Paste the complete URL with token from step 1
VSCode will connect to the remote Jupyter server
You can now run cells directly in VSCode!
Test the Connection

Create a new notebook and run:

from pbi import quick_connect

# This should work if connected to the Docker container
retriever = quick_connect()
stats = retriever.get_stats()
print(f"Connected! Database has {stats['database']['phages']:,} phages")

Method 2: Port Forwarding (Alternative)¶

If you're running the Docker container on a remote server, you may need port forwarding:

SSH Port Forward (if on remote server)

ssh -L 8888:localhost:8888 user@remote-server

Follow Method 1 steps above with http://localhost:8888/lab?token=...

Advantages of VSCode¶

Integrated Development: Edit code, notebooks, and documentation in one place
Git Integration: Easy version control for your analysis notebooks
IntelliSense: Better code completion and suggestions
Debugging: Advanced debugging capabilities
Extensions: Use other VSCode extensions alongside Jupyter
Multi-file Editing: Work with multiple notebooks and Python files simultaneously

Tips for VSCode + Jupyter¶

Save Your Server URL: VSCode remembers the server URL, so you only need to enter it once
Restart Kernel: Use the "Restart Kernel" button in the top menu if needed
Variable Inspector: Enable the variable inspector in VSCode for easier debugging
Keyboard Shortcuts: VSCode uses standard Jupyter keyboard shortcuts (Shift+Enter, etc.)

Troubleshooting VSCode Connection¶

Issue: "Failed to connect to Jupyter server" - Solution: Verify the analysis service is running: docker ps | grep pbi-analysis - Solution: Check the token is correct: docker logs pbi-analysis | grep token

Issue: "Cannot import pbi module" - Solution: Make sure you're connected to the Docker container's kernel, not your local Python - Solution: Verify kernel name shows "Python 3 (ipykernel)" from the remote server

Issue: "Connection timeout" - Solution: If on remote server, ensure SSH port forwarding is active - Solution: Check firewall settings allow port 8888

🆚 API vs Direct Access Comparison¶

Aspect	REST API	Direct Access
Speed	Baseline	5-50x faster
Bulk Operations	⚠️ Limited by request size	✅ No limits
Batch Processing	❌ Multiple requests needed	✅ Single query
Export Large Data	❌ Pagination required	✅ Direct export
Sequence Retrieval	⚠️ Network overhead	✅ Direct file access
Complex Joins	❌ Limited	✅ Full SQL power
Use Case	External integrations	Bulk analysis

When to Use Each:

Use API when: - External applications need access - Need cross-network access - Working with small datasets (< 1000 records) - Want standardized REST interface

Use Direct Access when: - Analyzing large datasets (> 10,000 records) - Need complex SQL queries with joins - Exporting bulk data - Interactive exploratory analysis - Building ML datasets

📚 Additional Resources¶

Example Notebooks: notebooks/01_database_exploration.ipynb, notebooks/02_sequence_retrieval.ipynb, notebooks/03_ml_streaming.ipynb
Docker Guide: docs/guides/docker-guide.md
DuckDB Documentation: https://duckdb.org/docs/
pyfaidx Documentation: https://github.com/mdshw5/pyfaidx
Jupyter Lab Documentation: https://jupyterlab.readthedocs.io/

🙋 Getting Help¶

If you encounter issues:

Check this troubleshooting guide
Review the example notebook
Check Docker logs: docker logs pbi-analysis
Open an issue on GitHub with:
Error message
Steps to reproduce
Environment details (Docker version, OS, etc.)

Analysis Container Usage Guide¶

Demo Notebooks¶

Overview¶

Python Package Usage¶

Using the pbi Package¶

Architecture¶

Quick Start¶

1. Build and Start the Analysis Service¶

2. Open Example Notebook¶

3. Create Your First Analysis¶

🗄️ Accessing the Database¶

Understanding the PBI Package¶

Retrieving Genomes and Metadata¶

Get sequences for these phages¶

Method 2: Direct ID-based retrieval¶

sequences is a dict:¶

Common Query Patterns¶

Common Use Cases¶

Use Case 1: Bulk Metadata Export¶

Use Case 2: Batch Sequence Retrieval¶

Use Case 3: Complex Joins and Analysis¶

Use Case 4: Machine Learning Dataset Preparation¶

⚙️ Configuration¶

Docker Compose Service¶

Environment Variables¶

Volume Mounts¶

Package Management & Custom Environments¶

Installing Additional Packages¶

Creating Custom Analysis Environments¶

Writing Custom Functions¶

Best Practices for Custom Environments¶

🎓 Best Practices¶

1. Always Use Read-Only Connections¶

2. Process Data in Batches¶

3. Use DuckDB's Native Export Functions¶

4. Filter Data in SQL, Not Pandas¶

5. Close Connections Properly¶

6. Monitor Memory Usage¶

Troubleshooting¶

Issue 1: "Database is locked" Error¶

Issue 2: Out of Memory Errors¶

Issue 3: Slow Sequence Retrieval¶

Issue 4: Jupyter Lab Not Accessible¶

Issue 5: Path Not Found Errors¶

Performance Tips¶

Optimize SQL Queries¶

Parallel Processing¶

Caching Results¶

🔄 Workflow Example: Complete Analysis Pipeline¶

💻 Connecting with VSCode¶

Prerequisites¶

Connection Steps¶

Method 1: Direct Kernel Connection (Recommended)¶

Method 2: Port Forwarding (Alternative)¶

Advantages of VSCode¶

Tips for VSCode + Jupyter¶

Troubleshooting VSCode Connection¶

🆚 API vs Direct Access Comparison¶

📚 Additional Resources¶

🙋 Getting Help¶

Using the `pbi` Package¶