Pipeline Logs Reference¶
All files produced by the PBI pipeline that are useful for monitoring and post-run analysis are written under pipeline_logs/ (bind-mounted from ./pipeline_logs in the repository root to /pipeline-logs inside the container).
The directory is organised into three sub-directories:
| Sub-directory | Purpose |
|---|---|
logs/ |
Plain-text and CSV logs emitted during rule execution |
reports/ |
HTML validation reports and the host-status summary CSV |
csv/ |
Intermediate CSV/JSON data files useful for log analysis |
Local runs (without Docker) — the
PBI_LOGS_DIRenvironment variable controls where these files land. When unset, the Snakefile falls back to the value ofPBI_DATA_DIR(default:data/), so the same relative paths appear underdata/instead ofpipeline_logs/.
logs/ — Execution logs¶
host_download.log¶
Config key: host_download_log
Verbose log of the download_host_genomes rule. Each downloaded, skipped, or failed genome is recorded here with timestamps.
# Watch progress in real time
tail -f pipeline_logs/logs/host_download.log
# Count successful downloads
grep -c "✅" pipeline_logs/logs/host_download.log
# List failed accessions
grep "❌" pipeline_logs/logs/host_download.log
host_download_failures.log¶
Config key: host_failure_log
Structured failure log written at the end of download_host_genomes. Each line describes a host token that could not be resolved or a genome that could not be downloaded, with the reason categorised.
Common categories:
No assembly found— species absent from NCBI RefSeqGTDB identifier— placeholder IDs such assp001234567(filtered automatically)Generic name— e.g.Acidovorax sp.without strain informationNetwork error— transient NCBI connection failuresEmpty / placeholder—-,unknown host, etc.
host_fasta_qc.csv¶
Config key: host_fasta_qc_log
CSV produced by the index_individual_host_sequences rule. One row per FASTA file that was evaluated. Load directly with pandas for analysis.
Key columns:
| Column | Description |
|---|---|
Host_ID |
Unique host identifier |
fasta_path |
Absolute path to the FASTA file |
status |
indexed / rejected / warning |
reason |
Human-readable reason for rejection or warning |
duplicate_headers |
Number of duplicate sequence identifiers found |
identical_sequences |
Number of identical sequences detected |
import pandas as pd
qc = pd.read_csv("pipeline_logs/logs/host_fasta_qc.csv")
# Files rejected due to duplicate headers
print(qc[qc["status"] == "rejected"])
# Files with duplicate sequence content (indexed but flagged)
print(qc[qc["identical_sequences"] > 0])
create_host_mapping.log¶
Config key: create_host_mapping_log
Log of the create_host_mapping rule, which builds the JSON mapping from Host_ID to individual FASTA file paths. Useful for diagnosing missing or mismatched files.
index_individual_host_sequences.log¶
Config key: index_individual_host_sequences_log
Log of the index_individual_host_sequences rule. Records which FASTA files were indexed with pyfaidx and which were rejected by the QC checks.
create_host_status_report.log¶
Config key: create_host_status_report_log
Log of the create_host_status_report rule, which joins the four host-tracking CSVs into host_status_report.csv (see below).
merge_phage_fasta.log¶
Config key: merge_phage_fasta_log
Log from the rule that merges per-source phage FASTA files into a single all_phages.fasta.
merge_protein_fasta.log¶
Config key: merge_protein_fasta_log
Log from the rule that merges per-source protein FASTA files into a single all_proteins.fasta.
index_phage_sequences.log¶
Config key: index_phage_sequences_log
Log from the rule that creates the pyfaidx .fai index for all_phages.fasta.
index_protein_sequences.log¶
Config key: index_protein_sequences_log
Log from the rule that creates the pyfaidx .fai index for all_proteins.fasta.
reports/ — HTML reports and summary CSVs¶
database_validation.html¶
Config key: database_validation_report_output
Comprehensive HTML report generated after the DuckDB database is built. Covers row counts, null-value rates, cross-table join statistics, and data-quality checks.
host_status_report.csv¶
Config key: host_status_report
Combined per-phage host-status table produced by create_host_status_report. One row per (Phage_ID, Host_Token) pair, joining data from phage_host_candidates.csv, phage_host_assemblies.csv, assembly_metadata.csv, and host_fasta_qc.csv.
Key columns:
| Column | Description |
|---|---|
Phage_ID |
Phage identifier |
Host_Token |
Individual parsed host token |
Token_Type |
assembly_accession / species_name / other |
Assembly_Accession |
Resolved assembly accession (if any) |
Assembly_Level |
Complete Genome / Chromosome / Scaffold / Contig |
Downloaded |
Whether the FASTA was successfully downloaded |
Indexed |
Whether the FASTA was successfully indexed |
QC_Status |
indexed / rejected / warning |
import pandas as pd
report = pd.read_csv("pipeline_logs/reports/host_status_report.csv")
# How many phages have at least one resolved and indexed host?
resolved = report[report["Indexed"] == True]
print(f"Phages with ≥1 indexed host: {resolved['Phage_ID'].nunique()}")
# Phages with no indexed host at all
all_phage_ids = report["Phage_ID"].unique()
phages_with_host = resolved["Phage_ID"].unique()
missing = set(all_phage_ids) - set(phages_with_host)
print(f"Phages with no indexed host: {len(missing)}")
Feature metadata reports (*_report.html)¶
One HTML report per feature is generated by the metadata-merging rules. Each report summarises per-source row counts, column coverage, and data-quality indicators for the merged metadata CSV.
| File | Feature |
|---|---|
phage_metadata_report.html |
Core phage metadata |
annotated_proteins_metadata_report.html |
Annotated protein sequences |
transcription_terminator_metadata_report.html |
Transcription terminators |
phage_trna_tmrna_metadata_report.html |
tRNA / tmRNA features |
phage_anti_crispr_metadata_report.html |
Anti-CRISPR proteins |
phage_virulent_factor_metadata_report.html |
Virulence factors |
phage_transmembrane_protein_metadata_report.html |
Transmembrane proteins |
crispr_array_metadata_report.html |
CRISPR arrays |
antimicrobial_resistance_gene_metadata_report.html |
AMR genes |
csv/ — Intermediate data files for analysis¶
These files are produced during the host-genome download and resolution steps. They are bind-mounted so that they survive container restarts and are immediately accessible from the host for log analysis without needing to enter the container.
.host_indexes_complete¶
Config key: host_index_complete_flag
Hidden sentinel file created by Snakemake's touch() after all host FASTA files have been indexed successfully. Its existence is the only signal Snakemake needs to know the indexing step is complete.
host_metadata.csv¶
Config key: host_metadata_output
Per-assembly metadata for every successfully downloaded host genome. One row per unique assembly accession.
Key columns:
| Column | Description |
|---|---|
Host_ID |
Unique host identifier ({species}_{accession}) |
Species_Name |
Original species name from phage metadata |
Assembly_Accession |
NCBI accession (GCF_ preferred) |
Assembly_Level |
Complete Genome / Chromosome / Scaffold / Contig |
Genome_Length |
Total genome size in base pairs |
GC_Content |
GC percentage |
Sequence_Count |
Number of sequences in the assembly |
Download_Date |
Timestamp of the download |
import pandas as pd
meta = pd.read_csv("pipeline_logs/csv/host_metadata.csv")
print(f"Total host genomes: {len(meta)}")
print(meta["Assembly_Level"].value_counts())
assembly_metadata.csv¶
Config key: assembly_metadata_output
Detailed NCBI Assembly metadata retrieved during the resolution step. Broader than host_metadata.csv; includes RefSeq category, submission date, and other assembly attributes.
phage_host_links.csv¶
Config key: phage_host_links_output
Flat mapping of phage → assembly accession links. One row per unique (Phage_ID, Assembly_Accession) pair. This is the authoritative table loaded into DuckDB to build the phage–host relationship.
Key columns:
| Column | Description |
|---|---|
Phage_ID |
Phage identifier |
Assembly_Accession |
Resolved NCBI accession |
Host_Raw |
Original un-parsed host field (for traceability) |
Confidence |
Float 0–1 reflecting resolution quality |
phage_host_candidates.csv¶
Config key: phage_host_candidates_output
Lossless, auditable record of every host token parsed from the phage metadata. One row per (Phage_ID, token) pair — this is the input to the resolution step.
Key columns:
| Column | Description |
|---|---|
Phage_ID |
Phage identifier |
Host_Raw |
Original un-parsed Host field |
Host_Token |
Individual token extracted from the field |
Token_Type |
assembly_accession / species_name / other |
Token_Order |
1-based position in the original field |
Useful for auditing the parser: every token that entered the pipeline is visible here, including those that were ultimately unresolvable.
phage_host_assemblies.csv¶
Config key: phage_host_assemblies_output
Per-token resolution results. One row per (Phage_ID, Assembly_Accession) pair, produced after NCBI resolution. Includes confidence scores and resolution metadata.
Key columns:
| Column | Description |
|---|---|
Phage_ID |
Phage identifier |
Host_Token |
Specific token that was resolved |
Assembly_Accession |
Resolved NCBI accession |
Resolution_Source |
accession_in_host_field / species_to_taxid_to_assembly / fallback |
Confidence |
Float 0–1 |
Assembly_Level |
Complete Genome / Chromosome / Scaffold / Contig |
Ambiguous |
True when multiple equally-plausible hits exist |
import pandas as pd
assemblies = pd.read_csv("pipeline_logs/csv/phage_host_assemblies.csv")
# Resolution source distribution
print(assemblies["Resolution_Source"].value_counts())
# Ambiguous resolutions
print(assemblies[assemblies["Ambiguous"] == True][["Phage_ID", "Host_Token", "Ambiguity_Reason"]])
host_token_resolution_cache.json¶
Config key: host_resolution_cache_output
Persistent JSON cache mapping host tokens to their resolved NCBI assembly accessions. Reused across reruns when reuse_host_resolution_cache: true (the default), so expensive NCBI taxonomy/assembly lookups are not repeated for tokens that have already been resolved.
To force a full re-resolution pass (ignoring this cache):
snakemake --cores 4 --use-conda \
--forcerun download_host_genomes \
--config reuse_host_resolution_cache=false
The cache is a plain JSON object and can be inspected or edited manually if needed.
public_data_manifest.json and public_data_manifest.csv¶
Config keys:
public_data_provenance.manifest_json_outputpublic_data_provenance.manifest_csv_output
Per-download provenance records for each public input source.
Key diagnostics:
status = failedmeans that source failed download/provenance captureerror_messagecontains the failure reasonschema_fingerprinttracks header/schema drift over time
pipeline_run_provenance.json and pipeline_run_provenance.csv¶
Config keys:
public_data_provenance.pipeline_run_provenance_json_outputpublic_data_provenance.pipeline_run_provenance_csv_output
Run-level provenance snapshot that captures pinned provider metadata for the pipeline run (provider_release, provider_snapshot_date, provider_schema_profile, git_commit, pbi_version).
private_manifest.json¶
Config key: private_manifest_output
Location:
- host:
private_data/private_manifest.json - container:
/private-data/private_manifest.json
This file is the authoritative private-source validation summary.
Key diagnostics:
sources_valid/sources_invalid- per-source
is_valid - per-source
errorsandwarnings
If a private source is missing from the database, check this file first to confirm whether it was skipped during validation.
Quick-reference table¶
File (relative to pipeline_logs/) |
Config key | Format | Produced by rule |
|---|---|---|---|
logs/host_download.log |
host_download_log |
text | download_host_genomes |
logs/host_download_failures.log |
host_failure_log |
text | download_host_genomes |
logs/host_fasta_qc.csv |
host_fasta_qc_log |
CSV | index_individual_host_sequences |
logs/create_host_mapping.log |
create_host_mapping_log |
text | create_host_mapping |
logs/index_individual_host_sequences.log |
index_individual_host_sequences_log |
text | index_individual_host_sequences |
logs/create_host_status_report.log |
create_host_status_report_log |
text | create_host_status_report |
logs/merge_phage_fasta.log |
merge_phage_fasta_log |
text | merge_phage_fasta |
logs/merge_protein_fasta.log |
merge_protein_fasta_log |
text | merge_protein_fasta |
logs/index_phage_sequences.log |
index_phage_sequences_log |
text | index_phage_sequences |
logs/index_protein_sequences.log |
index_protein_sequences_log |
text | index_protein_sequences |
reports/database_validation.html |
database_validation_report_output |
HTML | validate_database |
reports/host_status_report.csv |
host_status_report |
CSV | create_host_status_report |
reports/phage_metadata_report.html |
phage_metadata_report_output |
HTML | generate_report |
reports/annotated_proteins_metadata_report.html |
annotated_proteins_metadata_report_output |
HTML | generate_report |
reports/transcription_terminator_metadata_report.html |
transcription_terminator_metadata_report_output |
HTML | generate_report |
reports/phage_trna_tmrna_metadata_report.html |
phage_trna_tmrna_metadata_report_output |
HTML | generate_report |
reports/phage_anti_crispr_metadata_report.html |
phage_anti_crispr_metadata_report_output |
HTML | generate_report |
reports/phage_virulent_factor_metadata_report.html |
phage_virulent_factor_metadata_report_output |
HTML | generate_report |
reports/phage_transmembrane_protein_metadata_report.html |
phage_transmembrane_protein_metadata_report_output |
HTML | generate_report |
reports/crispr_array_metadata_report.html |
crispr_array_metadata_report_output |
HTML | generate_report |
reports/antimicrobial_resistance_gene_metadata_report.html |
antimicrobial_resistance_gene_metadata_report_output |
HTML | generate_report |
csv/.host_indexes_complete |
host_index_complete_flag |
flag | index_individual_host_sequences |
csv/host_metadata.csv |
host_metadata_output |
CSV | download_host_genomes |
csv/assembly_metadata.csv |
assembly_metadata_output |
CSV | download_host_genomes |
csv/phage_host_links.csv |
phage_host_links_output |
CSV | download_host_genomes |
csv/phage_host_candidates.csv |
phage_host_candidates_output |
CSV | download_host_genomes |
csv/phage_host_assemblies.csv |
phage_host_assemblies_output |
CSV | download_host_genomes |
csv/host_token_resolution_cache.json |
host_resolution_cache_output |
JSON | download_host_genomes |
csv/public_data_manifest.json |
public_data_provenance.manifest_json_output |
JSON | build_public_data_provenance_manifest |
csv/public_data_manifest.csv |
public_data_provenance.manifest_csv_output |
CSV | build_public_data_provenance_manifest |
csv/pipeline_run_provenance.json |
public_data_provenance.pipeline_run_provenance_json_output |
JSON | build_public_data_provenance_manifest |
csv/pipeline_run_provenance.csv |
public_data_provenance.pipeline_run_provenance_csv_output |
CSV | build_public_data_provenance_manifest |