ViralEntropR: A Computational Pipeline for Entropy-Informed Detection of Emerging Viral Variants
Source:R/ViralEntropR-package.R
ViralEntropR-package.RdA computational pipeline for detecting emerging variants in viral amino acid sequence data, combining per-site Shannon entropy, Gaussian mixture model site selection, Gower-distance Partitioning Around Medoids clustering, Hellinger-distance quantification of distributional shifts, and multivariate non-parametric change-point detection.
Pipeline overview
The package supports a four-stage workflow:
Preprocessing. Parse FASTA headers, filter ambiguous residues, and convert between integer and character representations of amino acid sequences under a 25-symbol alphabet. See
extract_fasta_dates,extract_fasta_countries,fasta_to_char_matrix,filter_ambiguous_sequences,encode_aa_sequence, anddecode_aa_sequence.Site selection. Compute per-site Shannon entropy across temporal partitions and cluster sites by entropy via Gaussian mixture models. See
calculate_entropy,partition_time_windows,cluster_sites_by_entropy, andrelabel_entropy_classes.Distributional analysis. Quantify residue-composition shifts between time windows using the Hellinger distance. See
calculate_hellinger_matrix.Change-point detection. Identify temporal change points non-parametrically using energy statistics or wild binary segmentation. See
detect_changepoints_ecpanddetect_changepoints_hdcp.
Visualisation and tabulation
plot_entropy_trajectories— customizable multi-site Shannon entropy trajectories, summarizing evolutionary dynamics across time.plot_site_class_trajectory— single-site entropy trajectory with class-change markers for inspecting individual residues of interest.tabulate_site_evolution— per-site amino acid count and proportion tables across partitions, optionally rendered as styled HTML.
Simulation
simulate_variant_evolution provides a configurable
multi-variant simulation engine with user-specified emergence
schedules, growth rates, mutation-rate variability, and
deleterious-mutation injection. Generates synthetic time-series data
with known ground truth for benchmarking detection pipelines.
Bundled data
sarscov2_variants— curated metadata for twelve SARS-CoV-2 Variants of Concern and Variants of Interest, including WHO labels, Pango lineages, GISAID and Nextstrain clades, dates and countries of first detection, defining Spike-protein mutations and SNP sites, and 21 peer-reviewed references with DOIs.sarscov2_sample— a random sample of 100 NCBI Spike protein sequences for end-to-end testing without external downloads.
External data
The complete preprocessed NCBI Spike protein dataset (137,132
sequences, ~181.5 MB uncompressed FASTA) underlying the package's
real-data pre-processing vignette is archived on Zenodo:
doi:10.5281/zenodo.19040165
. The dataset can be read directly with
readAAStringSet and processed end-to-end
using the preprocessing toolkit; see
vignette("preprocessing_pipeline", "ViralEntropR") for the
full workflow.
Vignettes
Three pre-rendered vignettes walk through the full workflow on real
and simulated data: vignette("preprocessing_pipeline",
"ViralEntropR"), vignette("detecting_variants_simulation",
"ViralEntropR"), and vignette("clustering_accuracy",
"ViralEntropR").
Author
Maintainer: Vadim Tyuryaev vadim.tyuryaev@gmail.com (ORCID)
Authors:
Jane Heffernan jmheffer@yorku.ca
Hanna Jankowski hkj@yorku.ca