ViralEntropR: A Computational Pipeline for Entropy-Informed Detection of Emerging Viral Variants

A computational pipeline for detecting emerging variants in viral amino acid sequence data, combining per-site Shannon entropy, Gaussian mixture model site selection, Gower-distance Partitioning Around Medoids clustering, Hellinger-distance quantification of distributional shifts, and multivariate non-parametric change-point detection.

Pipeline overview

The package supports a four-stage workflow:

Preprocessing. Parse FASTA headers, filter ambiguous residues, and convert between integer and character representations of amino acid sequences under a 25-symbol alphabet. See extract_fasta_dates, extract_fasta_countries, fasta_to_char_matrix, filter_ambiguous_sequences, encode_aa_sequence, and decode_aa_sequence.
Site selection. Compute per-site Shannon entropy across temporal partitions and cluster sites by entropy via Gaussian mixture models. See calculate_entropy, partition_time_windows, cluster_sites_by_entropy, and relabel_entropy_classes.
Distributional analysis. Quantify residue-composition shifts between time windows using the Hellinger distance. See calculate_hellinger_matrix.
Change-point detection. Identify temporal change points non-parametrically using energy statistics or wild binary segmentation. See detect_changepoints_ecp and detect_changepoints_hdcp.

Visualisation and tabulation

plot_entropy_trajectories — customizable multi-site Shannon entropy trajectories, summarizing evolutionary dynamics across time.
plot_site_class_trajectory — single-site entropy trajectory with class-change markers for inspecting individual residues of interest.
tabulate_site_evolution — per-site amino acid count and proportion tables across partitions, optionally rendered as styled HTML.

Simulation

simulate_variant_evolution provides a configurable multi-variant simulation engine with user-specified emergence schedules, growth rates, mutation-rate variability, and deleterious-mutation injection. Generates synthetic time-series data with known ground truth for benchmarking detection pipelines.

Bundled data

sarscov2_variants — curated metadata for twelve SARS-CoV-2 Variants of Concern and Variants of Interest, including WHO labels, Pango lineages, GISAID and Nextstrain clades, dates and countries of first detection, defining Spike-protein mutations and SNP sites, and 21 peer-reviewed references with DOIs.
sarscov2_sample — a random sample of 100 NCBI Spike protein sequences for end-to-end testing without external downloads.

External data

The complete preprocessed NCBI Spike protein dataset (137,132 sequences, ~181.5 MB uncompressed FASTA) underlying the package's real-data pre-processing vignette is archived on Zenodo: doi:10.5281/zenodo.19040165 . The dataset can be read directly with readAAStringSet and processed end-to-end using the preprocessing toolkit; see vignette("preprocessing_pipeline", "ViralEntropR") for the full workflow.

Vignettes

Three pre-rendered vignettes walk through the full workflow on real and simulated data: vignette("preprocessing_pipeline", "ViralEntropR"), vignette("detecting_variants_simulation", "ViralEntropR"), and vignette("clustering_accuracy", "ViralEntropR").

Author

Maintainer: Vadim Tyuryaev vadim.tyuryaev@gmail.com (ORCID)

Authors:

Jane Heffernan jmheffer@yorku.ca
Hanna Jankowski hkj@yorku.ca