Skip to contents

A computational pipeline for detecting emerging variants in viral amino acid sequence data, combining per-site Shannon entropy, Gaussian mixture model site selection, Gower-distance Partitioning Around Medoids clustering, Hellinger-distance quantification of distributional shifts, and multivariate non-parametric change-point detection.

Pipeline overview

The package supports a four-stage workflow:

Visualisation and tabulation

  • plot_entropy_trajectories — customizable multi-site Shannon entropy trajectories, summarizing evolutionary dynamics across time.

  • plot_site_class_trajectory — single-site entropy trajectory with class-change markers for inspecting individual residues of interest.

  • tabulate_site_evolution — per-site amino acid count and proportion tables across partitions, optionally rendered as styled HTML.

Simulation

simulate_variant_evolution provides a configurable multi-variant simulation engine with user-specified emergence schedules, growth rates, mutation-rate variability, and deleterious-mutation injection. Generates synthetic time-series data with known ground truth for benchmarking detection pipelines.

Bundled data

  • sarscov2_variants — curated metadata for twelve SARS-CoV-2 Variants of Concern and Variants of Interest, including WHO labels, Pango lineages, GISAID and Nextstrain clades, dates and countries of first detection, defining Spike-protein mutations and SNP sites, and 21 peer-reviewed references with DOIs.

  • sarscov2_sample — a random sample of 100 NCBI Spike protein sequences for end-to-end testing without external downloads.

External data

The complete preprocessed NCBI Spike protein dataset (137,132 sequences, ~181.5 MB uncompressed FASTA) underlying the package's real-data pre-processing vignette is archived on Zenodo: doi:10.5281/zenodo.19040165 . The dataset can be read directly with readAAStringSet and processed end-to-end using the preprocessing toolkit; see vignette("preprocessing_pipeline", "ViralEntropR") for the full workflow.

Vignettes

Three pre-rendered vignettes walk through the full workflow on real and simulated data: vignette("preprocessing_pipeline", "ViralEntropR"), vignette("detecting_variants_simulation", "ViralEntropR"), and vignette("clustering_accuracy", "ViralEntropR").

Author

Maintainer: Vadim Tyuryaev vadim.tyuryaev@gmail.com (ORCID)

Authors: