Host Genome Resolution¶
Overview¶
This page describes how PBI resolves phage host information to downloadable bacterial genome assemblies from NCBI RefSeq.
Host genome resolution is a critical step because phage metadata from PhageScope contains complex, varied host field formats. A single phage's "Host" field may contain multiple identifiers in different formats, separated by semicolons:
The pipeline parses this into individual tokens, classifies each token, and resolves them to NCBI assembly accessions.
Solution¶
Stage 1 – Lossless parsing: phage_host_candidates.csv¶
The new standalone parse_host_field(host_raw) function splits the raw Host
field into individual tokens:
- Splits on semicolons.
- Normalises
GCA 900066335.1→GCA_900066335.1(space → underscore). - Drops empty,
NA,unknown*, andunidentified*values. - Classifies each token:
assembly_accession– matchesGCA_/GCF_pattern.species_name– two+ words, genus capitalized (binomial nomenclature).other– single words, codes such asUBA9502, etc.- Preserves
Token_Order(1-based position in the original field).
The pipeline writes one row per (Phage_ID, token) to
phage_host_candidates.csv. This is the lossless, auditable record of every
host candidate parsed from the metadata.
Example¶
| Phage_ID | Host_Raw | Host_Token | Token_Type | Token_Order |
|---|---|---|---|---|
| phage1 | NA;GCA 900066335.1;UBA9502;Blautia obeum | GCA_900066335.1 | assembly_accession | 2 |
| phage1 | NA;GCA 900066335.1;UBA9502;Blautia obeum | UBA9502 | other | 3 |
| phage1 | NA;GCA 900066335.1;UBA9502;Blautia obeum | Blautia obeum | species_name | 4 |
Stage 2 – Resolution: phage_host_assemblies.csv¶
Each unique token is resolved independently via resolve_host_token():
assembly_accession→ direct NCBI Assembly lookup (confidence 0.95).species_name→ NCBI Taxonomy + Assembly search (confidence 0.70).other→ attempted species search as fallback (confidence 0.30).
The pipeline writes one row per (Phage_ID, Assembly_Accession) to
phage_host_assemblies.csv. This is the authoritative flat mapping used to
drive host genome downloads.
| Column | Description |
|---|---|
| Phage_ID | Phage identifier |
| Host_Raw | Original un-parsed Host field (traceability) |
| Host_Token | Specific token that was resolved |
| Token_Type | assembly_accession / species_name / other |
| Token_Order | 1-based position in Host_Raw |
| Assembly_Accession | Resolved NCBI accession |
| Resolution_Source | accession_in_host_field / species_to_taxid_to_assembly / fallback |
| Resolution_Rank | 1-based rank within results for this token |
| Confidence | Float 0–1 derived from source + rank |
| Assembly_Level | Complete Genome, Chromosome, Scaffold, or Contig |
| RefSeq_Category | reference genome, representative genome, or na |
| Quality_Score | Integer quality score |
| Ambiguous | True when multiple equally-plausible hits exist |
| Ambiguity_Reason | Human-readable reason when ambiguous |
Stage 3 – Download unique assemblies¶
Host genome downloads are now driven by the unique Assembly_Accession values
in phage_host_assemblies.csv. Each accession is downloaded exactly once,
even if linked to many phages (deduplication).
Backward-compatible outputs¶
The following outputs remain unchanged (same columns):
host_metadata.csv– per-assembly metadata (one row per unique assembly).assembly_metadata.csv– detailed assembly metadata.phage_host_links.csv– phage→assembly links (extended, one row per unique (Phage_ID, Assembly_Accession) pair).
Snakemake caching (idempotency)¶
Snakemake's file-based dependency tracking ensures the download_host_genomes
rule is not re-executed when all output files already exist and are newer
than the input phage CSV.
The new outputs (phage_host_candidates and phage_host_assemblies) are
declared as rule outputs in hosts.smk, so Snakemake tracks them automatically.
Within a single run, the skip_existing=True parameter (default) prevents
re-downloading individual genome files that were already successfully retrieved.
Testing¶
New unit tests in tests/test_multi_host_parsing.py:
TestParseHostField– 18 tests for theparse_host_field()function, including all examples from the problem statement.TestGenerateCandidates– 5 tests for_generate_candidates().TestBuildAssemblyLinks– 7 tests for_build_assembly_links(), including multi-host, unresolved, and ambiguous cases.
All 31 tests pass without NCBI credentials.
API¶
from download_host_genomes_robust import parse_host_field, resolve_host_token, HostToken
# Parse a complex Host field into tokens
tokens = parse_host_field("NA;GCA 900066335.1;UBA9502;Blautia obeum")
# → [HostToken('GCA_900066335.1', 'assembly_accession', 2),
# HostToken('UBA9502', 'other', 3),
# HostToken('Blautia obeum', 'species_name', 4)]
# Resolve a token to assembly links (requires NCBI credentials)
from assembly_resolver import AssemblyResolver
resolver = AssemblyResolver(email='user@example.org')
links = resolve_host_token(tokens[0], resolver, phage_id='p1', host_raw='NA;GCA 900066335.1;...')
# → [ResolvedAssemblyLink(assembly_accession='GCA_900066335.1', confidence=0.95, ...)]
Files Changed¶
workflow/scripts/sequences/download_host_genomes_robust.py– AddedHostToken,ResolvedAssemblyLink,parse_host_field(),resolve_host_token(), helper methods_generate_candidates()and_build_assembly_links(), and replaced the single-hostprocess_all_hosts()with a multi-host pipeline.workflow/rules/hosts.smk– Addedphage_host_candidatesandphage_host_assembliesas rule outputs.workflow/config/config.yaml– Addedphage_host_candidates_outputandphage_host_assemblies_outputconfig keys.tests/test_multi_host_parsing.py– New unit tests.