Database Overview¶
The PBI database consists of two main components: the phage metadata database (DuckDB) and the host genome data (individual FASTA files + mapping index).
Phage Metadata Database¶
- Database Engine: DuckDB (columnar, embedded analytical database)
- Schema: Star schema with 1 fact table and 8 dimension tables
- Total Size: ~15 GB (optimized) / ~18 GB (unoptimized)
- Total Phages: ~873,000
- Total Proteins: ~43 million
- Data Sources: 14+ major phage databases via PhageScope
Host Genome Data¶
- Storage: Individual FASTA files, one per assembly
- Index:
host_fasta_mapping.jsonmaps host IDs to file paths - Source: NCBI RefSeq bacterial reference genomes
- Coverage: ~9,000 unique assembly accessions attempted; variable success rate depending on NCBI availability
- Location:
/data/processed/sequences/(in Docker)
The host genome data is not stored in the DuckDB database — it is kept as separate FASTA files for memory efficiency, since individual host genomes can be very large. The pbi Python package uses host_fasta_mapping.json to locate and retrieve host sequences on demand.
See Host Resolution Details for information on how host species names are parsed and resolved to NCBI assemblies.
Database Schema¶
dim_proteins ──┐
dim_terminators ──┤
dim_anti_crispr ──┤
dim_virulent_factors ──┤
dim_transmembrane_proteins ──┤──▶ fact_phages (central)
dim_trna_tmrna ──┤
dim_antimicrobial_resistance_genes ─┤
dim_crispr_array ─┘
All dimension tables link to fact_phages via Phage_ID foreign key, enabling comprehensive multi-dimensional analysis.
Host genome information is linked via external CSV/JSON files rather than a database table — see Host-Phage Link Files below.
Database Tables¶
Fact Table: fact_phages¶
Central table containing core phage metadata.
Key Columns:
- Phage_ID (Primary Key): Unique phage identifier
- Source_DB: Origin database (GenBank, RefSeq, PhagesDB, etc.)
- Length: Genome length in base pairs
- GC_content: GC content percentage
- Taxonomy: Taxonomic classification
- Host: Bacterial host species
- Lifestyle: Lytic, Lysogenic, or Unknown
Row Count: ~873,718 phages
Example Query:
SELECT Source_DB, COUNT(*) as count, AVG(Length) as avg_length
FROM fact_phages
GROUP BY Source_DB
ORDER BY count DESC;
Dimension Tables¶
dim_proteins¶
Protein annotations with functional predictions and physicochemical properties.
Key Columns:
- Protein_ID (Primary Key)
- Phage_ID (Foreign Key)
- Product: Protein function/description
- Protein_length: Length in amino acids
- Molecular_weight: Molecular weight
- Isoelectric_point: pI value
- Source_DB: Origin database
Row Count: ~43,088,582 proteins
dim_terminators¶
Transcription terminator predictions.
Key Columns:
- Phage_ID (Foreign Key)
- terminator_type: Type of terminator
- start, end: Genomic coordinates
- confidence: Prediction confidence score
- strand: DNA strand orientation
Row Count: ~6,462,417 terminators
dim_anti_crispr¶
Anti-CRISPR protein predictions.
Key Columns:
- Phage_ID (Foreign Key)
- Protein_ID: Associated protein
- Source_DB: Prediction source
- Score: Confidence score
Row Count: ~307,329 anti-CRISPR proteins
dim_virulent_factors¶
Virulence factor annotations aligned to VFDB.
Key Columns:
- Phage_ID (Foreign Key)
- Protein_ID: Associated protein
- VF_ID: VFDB identifier
- VF_Name: Virulence factor name
- Source_DB: Origin database
Row Count: ~41,609 virulence factors
dim_transmembrane_proteins¶
Transmembrane helix predictions (TMHMM).
Key Columns:
- Phage_ID (Foreign Key)
- Protein_ID: Associated protein
- predicted_tmhs_number: Number of transmembrane helices
- protein_length: Protein length
- topology: Membrane topology
Row Count: ~4,020,770 predictions
dim_trna_tmrna¶
tRNA and tmRNA predictions with sequences.
Key Columns:
- trna_tmrna_id (Primary Key)
- Phage_ID (Foreign Key)
- trna_type: Amino acid specificity
- start, end: Genomic coordinates
- sequence: RNA sequence
- Source_DB: Origin database
Row Count: ~702,607 tRNA/tmRNA features
dim_crispr_array¶
CRISPR array metadata.
Key Columns:
- Phage_ID (Foreign Key)
- array_id: Array identifier
- repeat_sequence: CRISPR repeat sequence
- spacer_count: Number of spacers
Row Count: ~56,652 arrays
dim_antimicrobial_resistance_genes¶
Antimicrobial resistance gene annotations.
Key Columns:
- Phage_ID (Foreign Key)
- gene_id: Gene identifier
- gene_name: AMR gene name
- resistance_type: Type of resistance
Row Count: ~2,602 genes
Host-Phage Link Files (CSV / JSON)¶
⚠️ These are flat files, not database tables. Host genome data is intentionally kept outside the DuckDB database because individual host genomes can be very large. The files below provide the link between phage records and their associated bacterial host assemblies.
The pipeline produces the following host-related files in /data/processed/sequences/ (inside Docker):
| File | Type | Description |
|---|---|---|
phage_host_candidates.csv |
CSV file | One row per (Phage_ID, token): lossless record of every host candidate parsed from the raw Host field. Used for auditing and traceability. |
phage_host_assemblies.csv |
CSV file | One row per (Phage_ID, Assembly_Accession): authoritative flat mapping of phages to resolved NCBI assembly accessions. Drives genome downloads. |
phage_host_links.csv |
CSV file | One row per unique (Phage_ID, Assembly_Accession) pair. Backward-compatible output used by the pbi package to retrieve phage–host pairs. |
host_metadata.csv |
CSV file | One row per unique assembly: per-assembly metadata (level, RefSeq category, quality score, etc.). |
assembly_metadata.csv |
CSV file | Detailed NCBI assembly metadata for each downloaded host genome. |
host_fasta_mapping.json |
JSON file | Maps each host assembly accession (e.g. GCF_000005845.2) to the path of its downloaded FASTA file. Used by SequenceRetriever for fast on-demand access. |
| Individual FASTA files | FASTA (.fna) |
One file per downloaded host assembly, stored under hosts/. |
Key Columns in phage_host_links.csv¶
| Column | Description |
|---|---|
Phage_ID |
Phage identifier (matches fact_phages.Phage_ID) |
Assembly_Accession |
Resolved NCBI assembly accession |
Host_Raw |
Original un-parsed Host field (traceability) |
Confidence |
Float 0–1 indicating resolution confidence |
Resolution_Source |
How the accession was resolved (accession_in_host_field / species_to_taxid_to_assembly / fallback) |
See Host Resolution Details for a full description of how host fields are parsed and resolved to NCBI assemblies.
Data Sources¶
The database integrates data from 14 major phage databases via PhageScope:
| Database | Type | Approx. Records | Focus Area |
|---|---|---|---|
| GenBank | General | ~200,000 | Comprehensive phage collection |
| RefSeq | Curated | ~15,000 | Reference phage genomes |
| PhagesDB | Specialized | ~2,000 | Mycobacteriophages |
| EMBL | European | ~50,000 | European phage data |
| DDBJ | Japanese | ~30,000 | Japanese phage data |
| GOV2 | Viral genomes | ~100,000 | Ocean viral genomes |
| MGV | Metagenomics | ~50,000 | Gut virome data |
| GVD | Metagenomics | ~40,000 | Global virome database |
| IMGVR | IMG/VR | ~30,000 | IMG viral database |
| GPD | Specialized | ~20,000 | Gut phage database |
| CHVD | Specialized | ~5,000 | Chicken virome data |
| STV | Environmental | ~3,000 | Soil/plant phages |
| TemPhD | Specialized | ~2,000 | Temperate phages |
| IGVD | Specialized | ~1,000 | Insect gut viruses |
Indexes and Optimization¶
The database includes 14 targeted indexes for optimal query performance:
Primary Key Indexes¶
fact_phages.Phage_IDdim_proteins.Protein_IDdim_trna_tmrna.trna_tmrna_id
Foreign Key Indexes¶
All dimension tables have indexes on Phage_ID for efficient joins.
Source Tracking Indexes¶
All tables include indexes on Source_DB for source-based filtering.
Example Optimized Queries¶
-- Fast: Uses Phage_ID index
SELECT * FROM dim_proteins
WHERE Phage_ID = 'NC_000866';
-- Fast: Uses Source_DB index
SELECT COUNT(*) FROM fact_phages
WHERE Source_DB = 'GenBank';
-- Fast: Uses both indexes
SELECT p.*
FROM dim_proteins p
JOIN fact_phages f ON p.Phage_ID = f.Phage_ID
WHERE f.Source_DB = 'RefSeq' AND f.Length > 100000;
Analytical Views¶
The database includes pre-built views for common analyses:
phage_summary¶
Aggregated statistics by data source.
phage_complete_profile¶
Comprehensive phage characterization with all dimension counts.
phage_size_distribution¶
Genome size categorization analysis.
Data Validation & Quality¶
Automated Validation¶
The pipeline performs comprehensive quality control:
- Schema Validation
- Table existence verification (all 9 tables)
- Column presence and data type checks
-
Index verification
-
Data Integrity
- Duplicate detection (Phage_ID, Protein_ID, etc.)
- Orphaned record detection (foreign key consistency)
-
NULL value distribution analysis
-
Statistical Validation
- Row count summaries
- Source distribution analysis
-
Numerical range validation (genome sizes, GC content, TMH counts)
-
Relationship Verification
- Foreign key consistency across all dimension tables
- Source database alignment
Validation Reports¶
HTML reports are generated with: - Visual database schema diagram - Row count statistics for all tables - Data quality metrics with pass/fail indicators - Distribution charts for key metrics - Warning flags for data quality issues
View Reports:
Reports are generated in the workflow/reports/ directory after running the pipeline. They include:
- Database Validation Report
- Phage Metadata Report
- Annotated Proteins Report
- And more for each data type
Database Validation Report¶
Report scope
The validation report below was generated before host data was added to the pipeline. It reflects the state of the DuckDB database (all phage tables) and does not yet include host-phage link statistics.
The full interactive report is embedded below. You can also open it in a new tab.
Sequence Files¶
In addition to the database, the pipeline generates indexed FASTA files:
Phage Genomes¶
- File:
data/sequences/all_phages.fasta - Index:
data/sequences/all_phages.fasta.fai - Size: ~40 GB
- Records: ~873,000 phage genomes
Protein Sequences¶
- File:
data/sequences/all_proteins.fasta - Index:
data/sequences/all_proteins.fasta.fai - Size: ~60 GB
- Records: ~43 million protein sequences
Both files are indexed with pyfaidx for fast random access.
Usage Examples¶
Python/DuckDB¶
import duckdb
# Connect to database
conn = duckdb.connect('data/databases/phage_database_optimized.duckdb')
# Get database statistics
stats = conn.execute("""
SELECT
(SELECT COUNT(*) FROM fact_phages) as phages,
(SELECT COUNT(*) FROM dim_proteins) as proteins,
(SELECT COUNT(*) FROM dim_trna_tmrna) as trna
""").fetchone()
print(f"Phages: {stats[0]:,}")
print(f"Proteins: {stats[1]:,}")
print(f"tRNA/tmRNA: {stats[2]:,}")
# Complex analytical query
results = conn.execute("""
SELECT
f.Source_DB,
COUNT(DISTINCT f.Phage_ID) as phage_count,
AVG(f.Length) as avg_length,
AVG(f.GC_content) as avg_gc,
COUNT(DISTINCT p.Protein_ID) as protein_count
FROM fact_phages f
LEFT JOIN dim_proteins p ON f.Phage_ID = p.Phage_ID
GROUP BY f.Source_DB
ORDER BY phage_count DESC
""").df()
print(results)
conn.close()
SQL Queries¶
-- Find large phages with many proteins
SELECT
f.Phage_ID,
f.Length,
f.Host,
COUNT(p.Protein_ID) as protein_count
FROM fact_phages f
JOIN dim_proteins p ON f.Phage_ID = p.Phage_ID
WHERE f.Length > 200000
GROUP BY f.Phage_ID, f.Length, f.Host
HAVING COUNT(p.Protein_ID) > 200
ORDER BY f.Length DESC;
-- Analyze tRNA distribution
SELECT
trna_type,
COUNT(*) as count,
COUNT(DISTINCT Phage_ID) as phage_count
FROM dim_trna_tmrna
WHERE trna_type IS NOT NULL
GROUP BY trna_type
ORDER BY count DESC;
-- Find phages with anti-CRISPR systems
SELECT
f.Phage_ID,
f.Host,
f.Lifestyle,
COUNT(a.Protein_ID) as anti_crispr_count
FROM fact_phages f
JOIN dim_anti_crispr a ON f.Phage_ID = a.Phage_ID
GROUP BY f.Phage_ID, f.Host, f.Lifestyle
HAVING COUNT(a.Protein_ID) > 0
ORDER BY anti_crispr_count DESC;
Performance Considerations¶
- Database Size: ~15 GB (optimized with columnar storage)
- Query Performance: Fast analytical queries thanks to indexes and columnar format
- Memory Usage: DuckDB is memory-efficient, but complex queries may need >8 GB RAM
- Disk I/O: Use SSD for better performance
Next Steps¶
- API Reference - Query the database via REST API
- Command Reference - Common database operations
- Installation Guide - Set up your local environment
Note: The database is read-only in production. For updates, re-run the pipeline with updated source data.