đŸ§Ŧ PhageScope Database Validation Report

Generated: 2026-01-20T16:45:48.826230

Database: /data/processed/databases/phage_database_optimized.duckdb

📊 Database Overview

Phages
873,718
Proteins
36,718,799
Terminators
4,640,129
Anti-CRISPR
299,832
Virulent Factors
23,774
Transmembrane
4,020,770
tRNA/tmRNA
1,298,181
CRISPR Arrays
56,652
AMR Genes
2,602

đŸ—‚ī¸ Database Schema & Relationships

All dimension tables are linked to fact_phages via Phage_ID

dim_proteins
36,718,799 rows
FK: Phage_ID
dim_terminators
4,640,129 rows
FK: Phage_ID
fact_phages
873,718 rows
PK: Phage_ID
dim_anti_crispr
299,832 rows
FK: Phage_ID
dim_virulent_factors
23,774 rows
FK: Phage_ID
dim_crispr_arrays
56,652 rows
FK: Phage_ID
dim_transmembrane_proteins
4,020,770 rows
FK: Phage_ID
dim_trna_tmrna
1,298,181 rows
FK: Phage_ID
dim_antimicrobial_resistance_genes
2,602 rows
FK: Phage_ID

📋 Table Details

fact_phages
873,718
Columns: 10
Columns with NULLs: 3
Data Types:
  • VARCHAR: 8
  • INTEGER: 1
  • DOUBLE: 1
Column Names:
  • Phage_ID (VARCHAR)
  • Source_DB (VARCHAR)
  • Length (INTEGER)
  • GC_content (DOUBLE)
  • Taxonomy (VARCHAR)
  • Completeness (VARCHAR)
  • Host (VARCHAR)
  • Lifestyle (VARCHAR)
  • Cluster (VARCHAR)
  • Subcluster (VARCHAR)
dim_proteins
36,718,799
Columns: 19
Columns with NULLs: 4
Data Types:
  • VARCHAR: 8
  • INTEGER: 2
  • DOUBLE: 9
Column Names:
  • Phage_ID (VARCHAR)
  • Protein_ID (VARCHAR)
  • Protein_source (VARCHAR)
  • Function_prediction_source (VARCHAR)
  • Start (INTEGER)
  • Stop (INTEGER)
  • Strand (VARCHAR)
  • Product (VARCHAR)
  • Protein_classification (VARCHAR)
  • Molecular_weight (DOUBLE)
  • Aromaticity (DOUBLE)
  • Instability_index (DOUBLE)
  • Isoelectric_point (DOUBLE)
  • Helix_fraction (DOUBLE)
  • Turn_fraction (DOUBLE)
  • Sheet_fraction (DOUBLE)
  • Reduced_coefficient (DOUBLE)
  • Oxidized_coefficient (DOUBLE)
  • Source_DB (VARCHAR)
dim_terminators
4,640,129
Columns: 8
Columns with NULLs: 0
Data Types:
  • VARCHAR: 5
  • INTEGER: 2
  • DOUBLE: 1
Column Names:
  • Phage_ID (VARCHAR)
  • terminator_type (VARCHAR)
  • terminator_start (INTEGER)
  • terminator_end (INTEGER)
  • confidence_score (DOUBLE)
  • Sense (VARCHAR)
  • Loc (VARCHAR)
  • Source_DB (VARCHAR)
dim_anti_crispr
299,832
Columns: 4
Columns with NULLs: 0
Data Types:
  • VARCHAR: 4
Column Names:
  • Phage_ID (VARCHAR)
  • Protein_ID (VARCHAR)
  • Source (VARCHAR)
  • Source_DB (VARCHAR)
dim_virulent_factors
23,774
Columns: 4
Columns with NULLs: 0
Data Types:
  • VARCHAR: 4
Column Names:
  • Phage_ID (VARCHAR)
  • Protein_ID (VARCHAR)
  • aligned_protein_vfdb (VARCHAR)
  • Source_DB (VARCHAR)
dim_transmembrane_proteins
4,020,770
Columns: 18
Columns with NULLs: 1
Data Types:
  • VARCHAR: 7
  • INTEGER: 8
  • DOUBLE: 3
Column Names:
  • Phage_ID (VARCHAR)
  • Protein_ID (VARCHAR)
  • protein_length (INTEGER)
  • predicted_tmhs_number (INTEGER)
  • exp_aa_in_tmhs (DOUBLE)
  • exp_first_60_aa (DOUBLE)
  • total_prob_n_in (DOUBLE)
  • possible_n_term (VARCHAR)
  • inside_source (VARCHAR)
  • inside_start (INTEGER)
  • inside_end (INTEGER)
  • tm_helix_source (VARCHAR)
  • tm_helix_start (INTEGER)
  • tm_helix_end (INTEGER)
  • outside_source (VARCHAR)
  • outside_start (INTEGER)
  • outside_end (INTEGER)
  • Source_DB (VARCHAR)
dim_trna_tmrna
1,298,181
Columns: 11
Columns with NULLs: 3
Data Types:
  • VARCHAR: 8
  • INTEGER: 3
Column Names:
  • Phage_ID (VARCHAR)
  • trna_tmrna_id (VARCHAR)
  • source (VARCHAR)
  • trna_type (VARCHAR)
  • start_pos (INTEGER)
  • stop_pos (INTEGER)
  • Strand (VARCHAR)
  • length (INTEGER)
  • permuted (VARCHAR)
  • sequence (VARCHAR)
  • Source_DB (VARCHAR)
dim_crispr_arrays
56,652
Columns: 26
Columns with NULLs: 0
Data Types:
  • VARCHAR: 8
  • INTEGER: 9
  • DOUBLE: 9
Column Names:
  • Phage_ID (VARCHAR)
  • crispr_id (VARCHAR)
  • duplicated_spacers (INTEGER)
  • crispr_start (INTEGER)
  • crispr_end (INTEGER)
  • crispr_length (INTEGER)
  • potential_orientation_at_percent (VARCHAR)
  • crispr_direction (VARCHAR)
  • consensus_repeat (VARCHAR)
  • repeat_id_crisprdb (VARCHAR)
  • nb_crisprs_same_repeat (INTEGER)
  • repeat_length (INTEGER)
  • spacers_count (INTEGER)
  • mean_spacer_size (DOUBLE)
  • std_dev_spacers (DOUBLE)
  • repeats_matching_consensus (INTEGER)
  • ratio_repeats_match_total (DOUBLE)
  • conservation_repeats_pct (DOUBLE)
  • ebcons_repeats (DOUBLE)
  • conservation_spacers_pct (DOUBLE)
  • ebcons_spacers (DOUBLE)
  • repeat_plus_spacer_length (DOUBLE)
  • ratio_repeat_spacer_length (DOUBLE)
  • crispr_found_in_db (VARCHAR)
  • evidence_level (INTEGER)
  • Source_DB (VARCHAR)
dim_antimicrobial_resistance_genes
2,602
Columns: 4
Columns with NULLs: 1
Data Types:
  • VARCHAR: 4
Column Names:
  • Phage_ID (VARCHAR)
  • Protein_id (VARCHAR)
  • aligned_protein_card (VARCHAR)
  • Source_DB (VARCHAR)

✅ Data Quality Checks

CheckStatusDetails
All expected tables present ✅ PASS All 9 tables found
Duplicate Phage IDs ✅ PASS 0 duplicates found
Orphaned Proteins ✅ PASS 0 proteins without matching phages
Duplicate Protein IDs âš ī¸ WARNING 1912 duplicate protein IDs
Orphaned Terminators ✅ PASS 0 terminators without matching phages
Orphaned Anti-CRISPR ✅ PASS 0 anti-CRISPR entries without matching phages
Duplicate Protein IDs (Anti-CRISPR) ✅ PASS 0 duplicate protein IDs
Orphaned Virulent Factors ✅ PASS 0 virulent factors without matching phages
Duplicate Protein IDs (Virulent Factors) ✅ PASS 0 duplicate protein IDs
Orphaned Transmembrane Proteins ✅ PASS 0 transmembrane proteins without matching phages
Duplicate Protein IDs (Transmembrane) ✅ PASS 0 duplicate protein IDs
TMH Statistics â„šī¸ INFO Avg TMHs: 1.86, Range: 1-48
Orphaned tRNA/tmRNA ✅ PASS 0 tRNA/tmRNA entries without matching phages
Duplicate tRNA IDs ✅ PASS 0 duplicate tRNA IDs
Orphaned CRISPR Arrays ✅ PASS 0 CRISPR arrays without matching phages
Duplicate CRISPR IDs âš ī¸ WARNING 31 duplicate CRISPR IDs
Orphaned AMR Genes ✅ PASS 0 AMR genes without matching phages
Duplicate Protein IDs (AMR) ✅ PASS 0 duplicate protein IDs

🔧 Database Objects

Indexes: idx_amr_phage, idx_amr_protein, idx_amr_source, idx_anti_crispr_phage, idx_anti_crispr_source, idx_crispr_evidence, idx_crispr_id, idx_crispr_phage, idx_crispr_source, idx_phages_id, idx_phages_source, idx_proteins_phage, idx_proteins_source, idx_terminators_phage, idx_terminators_source, idx_transmembrane_phage, idx_transmembrane_source, idx_trna_phage, idx_trna_source, idx_virulent_phage, idx_virulent_source

Views: amr_gene_summary, crispr_array_summary, crispr_evidence_distribution, host_amr_profile, host_crispr_profile, phage_complete_profile, phage_defense_profile, phage_size_distribution, phage_summary