ViralEntropR is an R package for the efficient preprocessing of large FASTA archives and computational surveillance of emerging viral variants. It provides a fully vectorized preprocessing layer, including header parsing, country and date extraction, ambiguity filtering, and amino-acid encoding, together with the core computational components for downstream analysis: per-site Shannon entropy (Shannon 1948), time-window partitioning, Gaussian-mixture-based entropy clustering (Everitt et al. 2011; Scrucca et al. 2016), pairwise Hellinger distances (van der Vaart 1998) between empirical amino-acid distributions, and non-parametric change-point detection (Fryzlewicz 2014; Matteson and James 2014). A controlled variant-simulation engine is included for benchmarking detection performance against known ground truth.
The package was developed alongside a proposed analysis pipeline for emerging-variant detection, demonstrated in the bundled vignettes: GMM-driven selection of high-entropy sites, partitioning around medoids on the Gower distance (Gower 1971; Kaufman and Rousseeuw 1990; Rousseeuw 1987) over the selected sites, t-SNE visualization of cluster geometry (van der Maaten and Hinton 2008), non-parametric change-point detection validation, and hierarchical agglomerative clustering (HAC) (Everitt et al. 2011; Murtagh and Legendre 2014; Sangalli et al. 2010; Tucker et al. 2013) of per-site entropy curves and time-indexed Hellinger-distance trajectories, treated within a functional-data-analysis framework (Ramsay and Silverman 2005) as discrete realizations of continuous functions of time. The pipeline was evaluated on filtered, post-processed SARS-CoV-2 Spike-protein sequences: 137,132 raw NCBI records (Sayers et al. 2022) reduced to 109,536 US-filtered sequences (archived on Zenodo at DOI 10.5281/zenodo.19040165), and 16.7 million raw GISAID records (Shu and McCauley 2017) reduced to 129,371 unique US-filtered sequences. The methods are general enough to apply to any aligned amino-acid or nucleotide time series.
Pipeline
┌─────────────────────────┐
│ FASTA file │
└─────────────────────────┘
│
▼
┌───────────────────────────────────────┐
│ All sequences of equal width? │
└───────────────────────────────────────┘
│ │
yes no
│ │
│ ▼
│ ┌───────────────────────────────┐
│ │ Multiple-sequence │
│ │ alignment, e.g. │
│ │ msa::msa() / │
│ │ DECIPHER::AlignSeqs() │
│ └───────────────────────────────┘
│ │
└────────────────┬────────────────┘
▼
┌───────────────────────────────┐
│ Aligned sequence set │
└───────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────┐
│ Preprocessing layer │
│ ─────────────────────────────────────────────── │
│ extract_fasta_dates() │
│ extract_fasta_countries() │
│ fasta_to_char_matrix() │
│ filter_ambiguous_sequences() │
│ encode_aa_sequence() │
└───────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────┐
│ Feature matrix: │
│ m sequences × n sites + │
│ Date + Country │
└───────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────┐
│ partition_time_windows() │
│ │
│ cumulative / sliding / disjoint │
│ per-window entropies + GMM, │
│ via cluster_sites_by_entropy() internally │
└───────────────────────────────────────────────────┘
│
┌────────────────┴────────────────┐
▼ ▼
┌───────────────────────────────┐ ┌───────────────────────────────┐
│ Entropy / GMM-class │ │ calculate_hellinger_matrix() │
│ trajectories │ │ │
│ │ │ │
│ relabel_entropy_classes() │ │ pairwise per-window │
│ plot_entropy_trajectories() │ │ residue-distribution │
│ plot_site_class_trajectory() │ │ distances from the │
│ tabulate_site_evolution() │ │ reference window │
└───────────────────────────────┘ └───────────────────────────────┘
│ │
└────────────────┬────────────────┘
▼
┌───────────────────────────────────────────────────┐
│ Non-parametric change-point detection │
│ applied to entropy AND Hellinger series │
│ ─────────────────────────────────────────────── │
│ detect_changepoints_ecp() │
│ detect_changepoints_hdcp() │
└───────────────────────────────────────────────────┘
│
┌────────────────┴────────────────┐
▼ ▼
┌───────────────────────────────┐ ┌───────────────────────────────┐
│ Gower + Silhouette + PAM │ │ HAC of entropy curves and │
│ + t-SNE visualization │ │ Hellinger trajectories │
│ │ │ │
│ cluster::daisy(), │ │ fdacluster::fdahclust() │
│ cluster::pam(), │ │ │
│ Rtsne::Rtsne() │ │ │
│ │ │ │
│ orchestrated in vignettes; │ │ orchestrated in vignettes; │
│ not exported by the package │ │ not exported by the package │
└───────────────────────────────┘ └───────────────────────────────┘
│ │
└────────────────┬────────────────┘
▼
┌───────────────────────────────────────────────────────────────────────┐
│ Pipeline outputs │
│ ─────────────────────────────────────────────────────────────────── │
│ • Post-processed feature matrix │
│ • Partitioned time windows with GMM-ranked sites per partition │
│ • Hellinger distance matrices │
│ • Clustered sites (candidate variants) │
│ • Change points (variant emergence times) │
│ • Frequency tables + entropy class changes │
│ (sites of high variability / under selection) │
└───────────────────────────────────────────────────────────────────────┘Installation
You can install the released version from CRAN:
install.packages("ViralEntropR")The development version is on GitHub:
# install.packages("remotes")
remotes::install_github("vadimtyuryaev/ViralEntropR")To build the vignettes locally, install with:
remotes::install_github("vadimtyuryaev/ViralEntropR", build_vignettes = TRUE)Quick start
The package ships with a 100-sequence sample of NCBI SARS-CoV-2 Spike sequences in inst/extdata/ for runnable examples:
# This example requires Biostrings (Bioconductor):
# install.packages("BiocManager"); BiocManager::install("Biostrings")
library(ViralEntropR)
# 1. Load the bundled sample (100 NCBI Spike sequences).
sample_path <- system.file("extdata", "sarscov2_sample.fasta.gz",
package = "ViralEntropR")
fasta <- Biostrings::readAAStringSet(sample_path)
# 2. Extract dates and countries from FASTA headers.
# NCBI Virus format: option = 4 (date at end), position = 2 (between pipes).
dates_result <- extract_fasta_dates(fasta, option = 4)
countries_result <- extract_fasta_countries(fasta, position = 2)
# diagnostic: were any headers unparseable?
dates_result$message
countries_result$message
# 3. Drop sequences whose date or country could not be parsed.
# `missing_id` is NA when nothing failed and an integer vector of indices
# otherwise — strip NAs from the union before use.
ids <- c(dates_result$missing_id, countries_result$missing_id)
drop_ids <- unique(ids[!is.na(ids)])
keep <- setdiff(seq_len(length(fasta)), drop_ids)
fasta <- fasta[keep]
corrected_dates <- dates_result$corrected_dates[keep]
countries <- countries_result$countries[keep]
# 4. Convert to a character matrix, drop sequences with ambiguous residues
# (B, J, X, Z), and align metadata to the surviving rows.
char_mat <- fasta_to_char_matrix(fasta)
filtered <- filter_ambiguous_sequences(char_mat, option = 2)
keep_idx <- setdiff(seq_len(nrow(char_mat)), filtered$DeletedSeqId)
corrected_dates <- corrected_dates[keep_idx]
countries <- countries[keep_idx]
# 5. Integer-encode under the 25-symbol ViralEntropR alphabet.
int_mat <- encode_aa_sequence(filtered$FilteredMatrix)
# 6. Assemble the analysis-ready data frame: sites 1..1273 + Date + Country.
AL_df <- as.data.frame(int_mat)
colnames(AL_df) <- as.character(seq_len(ncol(int_mat)))
AL_df$Date <- as.Date(format(corrected_dates, "%Y-%m-01"))
AL_df$Country <- countries
AL_df <- AL_df[order(AL_df$Date), ]
# 7. Per-site Shannon entropy + GMM-based site classification.
# Sites of zero entropy and singletons are excluded.
ent <- apply(int_mat, 2, calculate_entropy)
cls <- cluster_sites_by_entropy(ent, nr = nrow(int_mat))
cls_labeled <- relabel_entropy_classes(cls$DataFrame)
head(cls_labeled, 10)For the full pipeline — including time-window partitioning, Hellinger distances, change-point detection, and visualization — see the bundled vignettes.
Vignettes
Three pre-rendered vignettes walk through the full workflow:
browseVignettes("ViralEntropR")| Vignette | Topic |
|---|---|
preprocessing_pipeline |
Complete preprocessing of raw NCBI SARS-CoV-2 Spike-protein FASTA: two-pass date extraction (yyyy-mm-dd then yyyy-mm), country extraction from pipe-delimited headers, ambiguity filtering (B/J/X/Z), 25-symbol integer encoding, and assembly of an analysis-ready data frame keyed on Date and Country. Applied to the full 137,132-sequence NCBI Spike archive on Zenodo. |
detecting_variants_simulation |
End-to-end variant detection demonstration on a controlled synthetic benchmark: four variants emerging over 24 months under pairwise competition with stochastic growth; per-partition entropies and Hellinger distances; three complementary change-point methods — ks.cp3o (dynamic programming, exact globally optimal), detect_changepoints_ecp() (expanding-window for online surveillance), and e.agglo (agglomerative hierarchical, K-free) — compared against the known emergence schedule. |
clustering_accuracy |
Empirical evaluation of Entropy-GMM-Gower-Silhouette-PAM clustering against a labelled ground truth: wild-type period (May–June 2020) versus Delta-dominant period (July–August 2021); GMM-driven site selection, PAM on the Gower distance with silhouette-based k (optimal number of clusters) selection over k = 2 … 40, 2D t-SNE embedding, precision / recall / F1 at selected k levels, and medoid analysis cross-referenced against the curated mutation catalog of SARS-CoV-2 mutations in sarscov2_variants. |
Analysis
Seven standalone studies live in analysis/ (excluded from the installed package via .Rbuildignore; available by cloning the GitHub repository). The package’s workflow runs in stages: it measures how variable each position of the Spike protein is over time (per-position Shannon entropy, an information measure of diversity); selects the most variable positions using a Gaussian mixture model (a statistical method that groups the positions by their entropy and isolates the high-entropy ones); tracks how each position’s amino-acid composition shifts over time (the pairwise Hellinger distance, a measure of how far apart two probability distributions sit); and finally groups positions by the shape of those time curves (functional clustering) or flags abrupt shifts in them (change-point detection), all referenced against the bundled catalog of WHO variants and their first-detection dates. Each study below either prepares the input data, checks that one stage behaves as intended, measures how much data a stage needs, or carries out a core analysis. Three time-windowing strategies recur throughout: a one-month window that expands from a fixed start, an overlapping two-month window that advances one month at a time, and a non-overlapping two-month window.
Study (in analysis/) |
What it does | Why it is needed and where it fits |
|---|---|---|
Changepoint_Detection_Study/ |
Searches for abrupt shifts (“change points”) in the time series of per-position entropy, for the processed NCBI and GISAID sequence collections, and checks whether those shifts coincide with the dates each WHO variant was first reported in the United States (taken from the bundled sarscov2_variants table). It evaluates the package’s own binary segmentation and wild binary segmentation routines (detect_changepoints_hdcp(), ViralEntropR) against two established methods from the ecp R package — an energy agglomerative method (e.agglo) and a dynamic-programming method with a tunable change-point budget (ks.cp3o) — under all three time-windowing strategies. Key Findings: the ranking of methods replicates almost perfectly across the two independent datasets (Spearman rank correlation 0.961), so it reflects method properties rather than dataset quirks. The energy agglomerative method is best on short windows (median F1 ≈ 0.80); on mid-range windows it is comparable to the dynamic-programming method, which in turn takes over on long windows as the agglomerative method over-segments, while binary and wild binary segmentation fail to detect short and mid ranges windows and achieve results comparable to the agglomerative method on the long range windows. Reducing the data to the model-selected positions preserves detection accuracy to within about 5% F1. |
Validates the change-point branch of the workflow: confirms that shifts in entropy track real variant-emergence events, and identifies which method and time-slicing strategy locate them most accurately. |
Correlation_Analysis/ |
Tests whether reducing the data to only its most variable positions distorts the overall similarity between sequences. It compares two distance tables built on the Spike sequences — one using all 1,273 positions, the other using only the positions kept by the Gaussian mixture model — and measures how closely they agree as more entropy groups are added back (using the Mantel test, a permutation-based correlation between two distance tables, with 9,999 permutations, reporting a Pearson correlation at each step). Distances use the Gower dissimilarity, which suits categorical sequence data. Run separately on NCBI US (8,155 distinct sequences) and GISAID US (a manageable random subset of 10,000 distinct sequences). Key Findings: the structural topology of the complete 1,273-position distance matrix appears to be adequately captured by the two highest-entropy Gaussian mixture models clusters. This restricted subset largely recapitulates the full matrix structure, yielding Mantel Pearson correlations of 0.98 and 0.92 on the NCBI US and GISAID US datasets, respectively. | Validates the position-reduction step: shows that keeping only the high-entropy positions preserves the between-sequence relationships the later clustering relies on, so the reduction does not discard meaningful structure. |
Entropy_Partitioning_Study/ |
Explores how the entropy at each variant-defining position rises and falls over time, under the three time-windowing strategies, for five variants — Alpha (B.1.1.7), Beta (B.1.351), Gamma (P.1), Delta (B.1.617.2), Omicron (B.1.1.529) — plus the D614G Spike mutation (a single amino-acid change on the early B.1 background, used as a reference point, not a designated variant). For each it produces: entropy-over-time curves; plots tracking which entropy group the model assigns each position to over time; tables of which amino acids appear at each position in each window; and summary tables of those group assignments, including the number of sequences and the number of model components fitted in each window. | Characterizes the raw entropy-over-time signal and motivates the functional-data analysis: the curves it surfaces are exactly what the FDA study clusters, and it first exercises the three time-windowing strategies reused downstream. |
FDA_Analysis/ |
Groups positions by the shape of their time curves, treating each curve as a continuous function of time. It clusters two kinds of curve: the entropy-over-time curve at each position, and the curve of how far that position’s amino-acid composition has drifted from a reference time point (pairwise Hellinger distance, measured both from the start of the window and from the predecessor variant). Run for all twelve WHO variants — five Variants of Concern: Alpha (B.1.1.7), Beta (B.1.351), Gamma (P.1), Delta (B.1.617.2), Omicron (B.1.1.529); seven Variants of Interest: Epsilon (B.1.427/B.1.429), Zeta (P.2), Eta (B.1.525), Iota (B.1.526), Theta (P.3), Kappa (B.1.617.1), Lambda (C.37) — across two sequence collections (NCBI US, GISAID US) and the three time-windowing strategies. It reports cluster quality (silhouette width), how reproducible the clusters are under resampling (bootstrap Jaccard stability), how much the clusters agree across windowing strategies (the Adjusted Rand Index), and whether a variant’s defining substitutions concentrate in one cluster (a Fisher exact test), and tracks frame-by-frame when the defining positions first cluster together relative to the official detection date (rendered as animations). Outputs: static figures, animated trajectory videos (.gif), and summary tables. Key Findings: clusters are well-separated (average silhouette ≈ 0.50 for entropy, ≈ 0.63–0.65 for Hellinger) but none reaches the strict resampling-stability threshold (mean Jaccard 0.57–0.61). The clustering is robust to the choice of windowed resolution (the overlapping (sliding) and non-overlapping(disjoint) two-month windows agree closely, median Adjusted Rand Index 1.0) but diverges for the cumulative window. Strict co-clustering of all of a variant’s defining substitutions occurs in only about 21% of evaluable cases, and when it does it coincides with the official detection date (median +1 month) rather than preceding it. |
The central analysis: tests whether the most variable Spike positions organise into coherent groups before, at, and after a variant’s emergence. |
GISAID_data_preprocessing/ |
Turns the raw GISAID Spike-protein archive into the clean, analysis-ready table the other studies consume, in six steps: load the archive; extract collection date and country; keep only United States sequences within the study date range; remove sequences containing ambiguous amino-acid codes; align the short (under-1,273-amino-acid) sequences to full length with the Clustal Omega aligner and merge them with the full-length sequences; convert amino acids to a numeric code; and assemble the final dated table. | The data-preparation entry point: produces the dated, aligned, numerically encoded sequence table that every downstream study reads. |
GMM_transformation_study/ |
Checks whether the position-selection step depends on how the entropy values are mathematically rescaled beforehand. It runs the Gaussian mixture model on per-position entropy under six different order-preserving rescalings and compares the selected positions against the untransformed case; the selected positions are then checked against published lists of each variant’s defining substitutions — Alpha (B.1.1.7), Beta (B.1.351), Gamma (P.1), Delta (B.1.617.2) for NCBI, with Omicron (B.1.1.529) added for GISAID. Key Findings: the selection is essentially unchanged by the rescaling, yielding identical sets of targeted residues. | Confirms a key methodological choice: site selection remains insensitive to data transformation, preventing analytical bias. After removing invariant (zero-entropy) residues and singleton variants prior to fitting, the Gaussian mixture model is applied directly to the highly right-skewed entropy distribution of the remaining polymorphic sites. This accurately represents the full distance matrix, thereby obviating the need for a zero-inflated model. |
Sample_Size_Simulation_Study/ |
Estimates how many sequences of a newly emerging variant are needed before the position-selection step can reliably detect it against a realistic background of already-circulating variants. It runs a large simulation — four scenarios spanning one to three co-circulating background variants and two levels of mutational load on the emerging variant (Omicron-like versus not) — over a grid of 8,400 combinations, each repeated 30 times, for 252,000 simulated runs in total. Results are summarised with time-to-detection (survival) analysis (Kaplan–Meier), bias-corrected bootstrap confidence intervals, and per-combination detection rates. Key Findings: with a reference background of 1,000–10,000 sequences the proposed pipeline reliably detects an emerging variant from roughly 40–50 of its sequences (100% detection). At low volumes (10–100 sequences) detection collapses as more variants co-circulate — from 97.5% with one background variant to 17.3% with three plus an Omicron-like emerging variant — and a heavier mutational load does not rescue it. | A sample size analysis for the detection phase, equipping genomic surveillance practitioners with the minimum sequencing depth necessary to ensure reliable variant detection. |
Data
-
sarscov2_variants— bundled metadata for twelve WHO-designated SARS-CoV-2 variants of concern and variants of interest, plus 21 peer-reviewed literature references with DOIs. Loaded automatically with the package; see?sarscov2_variants. -
inst/extdata/sarscov2_sample.fasta.gz— 100 randomly sampled NCBI Spike-protein sequences for runnable examples (see the Quick start above). -
data-raw/NCBI_US_unaligned_feature_matrix_1273aa.rds— the post-processed NCBI feature matrix (US-filtered Spike protein, 109,536 sequences × 1,273 sites withDateandCountrycolumns) produced by thepreprocessing_pipelinevignette. Tracked in the repository for direct downstream use. - Source NCBI Spike-protein archive — the input FASTA underlying the preprocessing pipeline (137,132 sequences, ~181.5 MB uncompressed) is archived on Zenodo: https://doi.org/10.5281/zenodo.19040165.
-
GISAID data are not redistributed. The raw GISAID Spike-protein archive (
spikeprot0410.fasta, ~22.6 GB) and the derived post-processed feature matrix (GISAID_US_aligned_feature_matrix_1273aa.rds, 129,371 US-filtered sequences) are excluded from the repository and from Zenodo in accordance with the GISAID Database Access Agreement. Users may reproduce these objects independently by registering at GISAID, downloading the equivalent release, and re-runninganalysis/GISAID_data_preprocessing/.
Installing on R server and computational notes
The vignettes and analysis scripts depend on kableExtra for HTML table rendering. On a fresh Linux R server install.packages("kableExtra") may fail because its dependencies systemfonts and svglite need system-level cairo, freetype, and fontconfig development headers. If installation fails, install those headers first (apt install libcairo2-dev libfontconfig1-dev libfreetype-dev libxt-dev on Debian / Ubuntu, or the equivalent -devel packages on RHEL / CentOS) and retry; if the server administrator has already installed kableExtra system-wide, no action is needed. Biostrings is similarly required and must be installed from Bioconductor (BiocManager::install("Biostrings")).
Three scripts currently in analysis/, as well as scripts in progress that will appear there, have substantial memory or runtime footprints and should be run on a server rather than a laptop. Estimated peak RAM varies between approximately 15 GB and 112 GB depending on the script and the dataset to which it is applied.
The three pre-rendered vignettes and the sarscov2_sample.fasta.gz quick-start are designed to run on a laptop without any of the above caveats.
Citation
If you use ViralEntropR in your research, please cite both the software and the underlying methods:
citation("ViralEntropR")BibTeX entry:
@Manual{,
title = {ViralEntropR: A Computational Pipeline for Entropy-Informed Detection of Emerging Viral Variants},
author = {Vadim Tyuryaev and Jane Heffernan and Hanna Jankowski},
year = {2026},
note = {R package version 0.6.2},
url = {https://CRAN.R-project.org/package=ViralEntropR},
doi = {10.32614/CRAN.package.ViralEntropR},
}Acknowledgments
We gratefully acknowledge the authors from the originating laboratories responsible for obtaining the specimens, and the submitting laboratories that generated and shared the genetic sequence data via GISAID and NCBI Virus, on which this research is based.
License
MIT © Vadim Tyuryaev, Jane Heffernan, Hanna Jankowski. See LICENSE.md for the full text.