Entropy-based site selection is only useful if the sites it
identifies carry genuine biological signal. This vignette provides a
rigorous empirical test: we take real SARS-CoV-2 Spike protein sequences
from two well-separated surveillance windows — May–June
2020, when the ancestral wild-type was circulating in the
United States, and July–August 2021, when the Delta
variant (B.1.617.2) was dominant — and ask whether the sites
selected by cluster_sites_by_entropy() are sufficient to
recover the correct temporal structure when the two groups are clustered
together.
Several terms are used throughout this vignette and are defined here for clarity.
During the COVID-19 pandemic, the World Health Organization (WHO) established a variant classification framework to prioritise public health responses based on the epidemiological and clinical significance of emerging SARS-CoV-2 lineages. Three principal designations were used in practice.
A is a lineage for which there is evidence of one or more of the following: increased transmissibility or a detrimental change in COVID-19 epidemiology; increased virulence or change in clinical disease presentation; or decreased effectiveness of public health measures, diagnostics, vaccines, or therapeutics. Five lineages were designated as VOCs during the pandemic: Alpha (B.1.1.7), Beta (B.1.351), Gamma (P.1), Delta (B.1.617.2), and Omicron (B.1.1.529).
A is a lineage with genetic changes that are predicted or known to affect virus characteristics such as transmissibility, disease severity, immune escape, or diagnostic and therapeutic performance, and which has been identified to cause significant community transmission or multiple COVID-19 clusters in multiple countries. VOIs were monitored closely but had not yet accumulated sufficient evidence for VOC designation. Examples, at various points during the pandemic, included Lambda (C.37), Mu (B.1.621), Kappa (B.1.617.1), Iota (B.1.526), and Epsilon (B.1.427/B.1.429).
A is a lineage carrying genetic changes suspected to affect virus characteristics but for which evidence of phenotypic or epidemiological impact is limited, warranting enhanced surveillance without immediate escalation to VOI or VOC status. Variant classifications were dynamic and subject to revision as new evidence became available.
A single nucleotide polymorphism (SNP) is formally defined as a single nucleotide substitution in the genome that is present at an appreciable frequency in a population, often operationally taken as at least 1%. In the context of SARS-CoV-2 genomic surveillance, however, the term “SNP” has frequently been used more loosely to refer to lineage-associated mutations, including those described at the amino acid level. In this work, we retain the term “SNP” for consistency with the surveillance literature, but use it interchangeably with the more precise notion of a defining mutation at a site, referring to amino acid substitutions in the Spike protein that characterize specific viral lineages.
A Gaussian mixture model (GMM) is a probabilistic model that represents a given distribution as a weighted sum of Gaussian components; here it is applied to the vector of per-site Shannon entropies to identify sites with high variability — those sites form the discriminating feature set passed to the clustering step.
A partitioning around medoids (PAM) algorithm is a deterministic clustering method that partitions observations into a fixed number of clusters by selecting representative data points, called medoids, and assigning each observation to the nearest medoid according to a specified dissimilarity measure. In this work, PAM is applied to the reduced sequence representation derived from high-entropy sites, using the Gower distance to quantify pairwise dissimilarity between sequences. Unlike centroid-based methods such as k-means, PAM is robust to noise and accommodates mixed or non-Euclidean data, making it well suited to categorical sequence data.
The pipeline mirrors the approach used in the main detection study,
but applied retrospectively to a known ground truth. Because we know
which sequences belong to which period, we can evaluate clustering
accuracy formally using precision,
recall and F1 score, and
cross-reference the discriminating sites against the curated Delta SNP
catalogue in sarscov2_variants. The dominance of the Delta
variant in the United States during July–August 2021 is well established
from national surveillance data (CDC2022?).
The analysis proceeds in nine steps:
Positive class definition: Period 1 (wild-type, May–June 2020) sequences correctly assigned to the cluster with the highest Period 1 purity. Precision, recall and F1 are computed for this positive class.
All user-facing parameters for steps 1—8 are gathered here. Adjust
COUNTRY, the two time windows, and K_HIGHLIGHT
before re-rendering; no other chunk needs to change.
# ── User-modifiable parameters ───────────────────────────────────────────────
# Data
COUNTRY <- "USA" # filter to this country (NULL = all countries)
UNIQUE_SEQS <- TRUE # TRUE = deduplicate sequences before PAM
# FALSE = keep all sequences (including duplicates)
N_SITES <- 1273L # number of Spike protein sites
# Time windows
PERIOD1_START <- "2020-05-01" # Period 1: wild-type phase
PERIOD1_END <- "2020-07-01"
PERIOD2_START <- "2021-07-01" # Period 2: Delta-dominant phase
PERIOD2_END <- "2021-09-01"
# PAM silhouette sweep
K_MAX <- 40L # maximum number of PAM clusters to evaluate
K_HIGHLIGHT <- c(2L, 3L, 4L, 9L) # PAM k values to plot in detail
# t-SNE parameters
TSNE_PERPLEXITY <- 10L
TSNE_SEED <- 42L
# Parallelism (future.apply)
N_WORKERS <- max(1L, parallel::detectCores() - 1L)The preprocessing steps applied here follow the same pipeline described in detail in the NCBI SARS-CoV-2 Spike Protein Sequence Preprocessing: From Raw FASTA to an Analysis-Ready Integer-Encoded Matrix vignette. Readers are referred to that vignette for a full discussion of each step, including date extraction, ambiguous residue filtering, integer encoding, and data frame assembly.
The raw NCBI FASTA file is loaded and processed in a single pipeline.
Dates are extracted from sequence headers using a two-pass regex
strategy: a full yyyy-mm-dd pass first, followed by a
yyyy-mm pass to recover sequences with month-level
precision only. Sequences with unresolvable dates or missing country
annotations are removed. The remaining sequences are converted to a
character matrix, filtered for ambiguous amino acid codes (B, J, X, Z),
integer-encoded using the 25-character ViralEntropR
alphabet, and assembled into a data frame with integer site columns
1:N_SITES plus Date and Country
columns. Rows are sorted by date and the index is reset.
# Replace 'fasta_path' with the file path to your Zenodo download
fasta_path <- file.path("data-raw", "sequences.fasta") # replace(!)
fasta <- Biostrings::readAAStringSet(fasta_path)
# --- Dates ------------------------------------------------------------------
# Pass 1: identify sequences with year-only dates
dates_ymd <- extract_fasta_dates(
fasta,
custom_pattern = "(?<=\\|)[0-9]{4}-(0?[1-9]|1[0-2])-(0?[1-9]|[12][0-9]|3[01]|00)"
)
# Pass 2: extract yyyy-mm, remove year-only
dates_result <- extract_fasta_dates(
fasta,
custom_pattern = "(?<=\\|)[0-9]{4}-(0[1-9]|1[0-2])",
date_format = "%Y-%m"
)
if (!is.na(dates_result$missing_id[1]) && length(dates_result$missing_id) > 0) {
fasta <- fasta[-dates_result$missing_id]
dates_result <- extract_fasta_dates(
fasta,
custom_pattern = "(?<=\\|)[0-9]{4}-(0[1-9]|1[0-2])",
date_format = "%Y-%m"
)
}
# --- Countries --------------------------------------------------------------
countries_result <- extract_fasta_countries(fasta, position = 2)
if (!is.na(countries_result$missing_id[1]) && length(countries_result$missing_id) > 0) {
fasta <- fasta[-countries_result$missing_id]
countries_result <- extract_fasta_countries(fasta, position = 2)
dates_result <- extract_fasta_dates(
fasta,
custom_pattern = "(?<=\\|)[0-9]{4}-(0[1-9]|1[0-2])",
date_format = "%Y-%m"
)
}
# --- Convert, filter, encode ------------------------------------------------
char_mat <- fasta_to_char_matrix(fasta)
filtered <- filter_ambiguous_sequences(char_mat, option = 2)
clean_char_mat <- filtered$FilteredMatrix
deleted_rows <- filtered$DeletedSeqId
corrected_dates <- dates_result$corrected_dates[-deleted_rows]
countries <- countries_result$countries[-deleted_rows]
int_mat <- encode_aa_sequence(clean_char_mat)
# --- Assemble data frame ----------------------------------------------------
n_sites <- ncol(int_mat)
AL_df <- as.data.frame(int_mat)
colnames(AL_df) <- as.character(seq_len(n_sites))
AL_df[] <- lapply(AL_df, as.integer)
AL_df$Date <- as.Date(format(corrected_dates, "%Y-%m-01"))
AL_df$Country <- countries
AL_df <- AL_df[order(AL_df$Date), ]
rownames(AL_df) <- NULL
sprintf("Final data frame: %d sequences x %d sites", nrow(AL_df), n_sites)
#> [1] "Final data frame: 125130 sequences x 1273 sites"Restricting to a single country (US) reduces confounding
from inter-country sampling biases and surveillance heterogeneity. All
downstream analyses use AL_df after this filter.
We extract two non-overlapping 2-month windows: Period 1 covers the reference wild-type phase and Period 2 covers the Delta-dominant phase.
part1 <- partition_time_windows(
data = AL_df,
n_sites = N_SITES,
window_length = 2L,
window_type = 2L, # single 2-month window (start_date to end_date)
start_date = PERIOD1_START,
end_date = PERIOD1_END,
verbose = FALSE
)
part2 <- partition_time_windows(
data = AL_df,
n_sites = N_SITES,
window_length = 2L,
window_type = 2L,
start_date = PERIOD2_START,
end_date = PERIOD2_END,
verbose = FALSE
)
df_p1 <- part1$Partitions[[1]]
df_p2 <- part2$Partitions[[1]]
sprintf("Period 1 (%s to %s): %d sequences",
PERIOD1_START, PERIOD1_END, nrow(df_p1))
#> [1] "Period 1 (2020-05-01 to 2020-07-01): 5293 sequences"
sprintf("Period 2 (%s to %s): %d sequences",
PERIOD2_START, PERIOD2_END, nrow(df_p2))
#> [1] "Period 2 (2021-07-01 to 2021-09-01): 1651 sequences"# Helper: extract highest-entropy sites from a cluster DataFrame
top_sites <- function(clust_df) {
sort(clust_df[clust_df$class == max(clust_df$class), ]$sites)
}
# Period 1 top sites
sites_p1 <- top_sites(part1$Clusters[[1]]$DataFrame)
cat("Period 1 highest-entropy sites:\n")
#> Period 1 highest-entropy sites:
print(sites_p1)
#> [1] 5 6 52 54 245 253 520 614 681 1228
# Period 2 top sites
sites_p2 <- top_sites(part2$Clusters[[1]]$DataFrame)
cat("\nPeriod 2 highest-entropy sites:\n")
#>
#> Period 2 highest-entropy sites:
print(sites_p2)
#> [1] 5 11 18 19 20 25 26 29 32 70 75 76 77 78 95 98 112 120 126 138 141 142 143
#> [24] 144 145 146 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167
#> [47] 168 169 170 171 172 173 174 175 177 178 179 180 181 182 183 184 186 187 188 189 190 191 192
#> [70] 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215
#> [93] 222 246 252 253 417 427 440 452 477 478 484 501 558 570 640 653 655 675 677 679 681 688 701
#> [116] 716 732 735 790 845 859 936 939 950 957 982 1027 1118 1153 1162 1176 1228 1233 1237 1260 1264Biological note — site 681 (furin cleavage site). Site 681 is of exceptional epidemiological significance and its detection in the Period 1 (wild-type) entropy profile is noteworthy. Position 681 sits immediately upstream of the RRAR polybasic furin cleavage site at the S1/S2 boundary, a structural feature that is unique to SARS-CoV-2 among closely related sarbecoviruses and was absent in SARS-CoV-1. The furin cleavage site enables the Spike protein to be pre-cleaved by furin-family proteases within the secretory pathway before virion release, substantially increasing the efficiency of cell entry and facilitating airway tropism. This feature is widely regarded as a key determinant of SARS-CoV-2’s high transmissibility in humans (Peacock et al. 2021).
Substitutions at position 681 subsequently became defining mutations of two of the most consequential VOCs: P681H (proline to histidine) in Alpha (B.1.1.7), and P681R (proline to arginine) in Delta (B.1.617.2). Both substitutions enhance furin cleavage efficiency, P681R more so than P681H, by altering the charge and conformation of the cleavage loop, resulting in faster S1/S2 processing, more efficient membrane fusion, and increased competitive fitness. The entropy signal at site 681 in the wild-type period therefore reflects early, low-frequency variation at a position that was already under strong positive selection pressure (Johnson, Xie, and Menachery 2021).
We combine both periods and run GMM entropy clustering on the union to identify sites that discriminate the two time points. This achieves dimensionality reduction from 1273 sites to a smaller informative subset.
# Combine periods -- rows are sequences, columns are sites
df_combined <- rbind(df_p1, df_p2)
rownames(df_combined) <- seq_len(nrow(df_combined))
# Period 1 size -- used throughout for labelling
n_p1 <- nrow(df_p1)
n_p2 <- nrow(df_p2)
n_total <- nrow(df_combined)
cat(sprintf("Combined: %d sequences (%d Period 1 + %d Period 2)\n",
n_total, n_p1, n_p2))
#> Combined: 6944 sequences (5293 Period 1 + 1651 Period 2)
# Entropy on site columns only
entrp_combined <- apply(df_combined[, seq_len(N_SITES), drop = FALSE],
2, calculate_entropy)
clust_combined <- cluster_sites_by_entropy(entrp_combined, nr = n_total)
clust_combined_df <- clust_combined$DataFrame
selected_sites <- sort(
clust_combined_df[clust_combined_df$class == max(clust_combined_df$class), ]$sites
)
cat(sprintf("\nGMM selected %d sites from combined data:\n",
length(selected_sites)))
#>
#> GMM selected 34 sites from combined data:
print(selected_sites)
#> [1] 5 18 19 20 26 54 95 138 141 142 144 152 153 155 157 158 180 183 190 253 417 452 477
#> [24] 478 484 501 520 614 655 681 701 950 1027 1176# ── Helper: pull SNP catalogue for a named variant ──────────────────────────
get_variant_snps <- function(label) {
idx <- which(unlist(sarscov2_variants$WHO_Label) == label)
sites <- sarscov2_variants$Defining_SNP_Sites[[idx]]
snps <- sarscov2_variants$Defining_SNPs[[idx]]
list(label = label, sites = sites, snps = snps)
}
# Retrieve catalogues for the four VOCs of interest
var_delta <- get_variant_snps("Delta")
var_alpha <- get_variant_snps("Alpha")
var_beta <- get_variant_snps("Beta")
var_gamma <- get_variant_snps("Gamma")
# Convenience aliases (used throughout downstream chunks)
delta_sites <- var_delta$sites; delta_snps <- var_delta$snps
alpha_sites <- var_alpha$sites; alpha_snps <- var_alpha$snps
beta_sites <- var_beta$sites; beta_snps <- var_beta$snps
gamma_sites <- var_gamma$sites; gamma_snps <- var_gamma$snps
# ── Print SNP tables ─────────────────────────────────────────────────────────
for (v in list(var_delta, var_alpha, var_beta, var_gamma)) {
cat(sprintf("\n%s defining SNP sites:\n", v$label))
print(data.frame(SNP = v$snps, Site = v$sites))
}
#>
#> Delta defining SNP sites:
#> SNP Site
#> 1 T19R 19
#> 2 L452R 452
#> 3 T478K 478
#> 4 P681R 681
#> 5 D950N 950
#>
#> Alpha defining SNP sites:
#> SNP Site
#> 1 N501Y 501
#> 2 A570D 570
#> 3 P681H 681
#> 4 T716I 716
#> 5 S982A 982
#> 6 D1118H 1118
#>
#> Beta defining SNP sites:
#> SNP Site
#> 1 D80A 80
#> 2 D215G 215
#> 3 K417N 417
#> 4 N501Y 501
#> 5 E484K 484
#> 6 A701V 701
#>
#> Gamma defining SNP sites:
#> SNP Site
#> 1 L18F 18
#> 2 T20N 20
#> 3 P26S 26
#> 4 D138Y 138
#> 5 R190S 190
#> 6 K417T 417
#> 7 E484K 484
#> 8 N501Y 501
#> 9 H655Y 655
#> 10 T1027I 1027
# ── Overlap with GMM-selected sites ─────────────────────────────────────────
overlap_list <- lapply(list(var_delta, var_alpha, var_beta, var_gamma),
function(v) intersect(selected_sites, v$sites))
names(overlap_list) <- c("Delta", "Alpha", "Beta", "Gamma")
# Keep Delta overlap as scalar for backward compatibility
overlap <- overlap_list[["Delta"]]
# voc_list defined here so it is available to all downstream chunks
# (delta-flag, pam-and-plots, mode-profile, contrast-gmm)
voc_list <- list(Delta = var_delta, Alpha = var_alpha,
Beta = var_beta, Gamma = var_gamma)
cat("\nGMM-selected sites overlapping with VOC defining SNP sites:\n")
#>
#> GMM-selected sites overlapping with VOC defining SNP sites:
for (nm in names(overlap_list)) {
ov <- overlap_list[[nm]]
cat(sprintf(" %-6s: %s\n",
nm,
if (length(ov) > 0) paste(ov, collapse = ", ") else "none"))
}
#> Delta : 19, 452, 478, 681, 950
#> Alpha : 501, 681
#> Beta : 417, 484, 501, 701
#> Gamma : 18, 20, 26, 138, 190, 417, 484, 501, 655, 1027
# ── Note on shared mutations across VOCs ─────────────────────────────────────
# Identify sites present in the SNP catalogues of more than one VOC
all_voc_sites <- lapply(voc_list, function(v) v$sites)
site_freq <- table(unlist(all_voc_sites))
shared_sites <- as.integer(names(site_freq[site_freq > 1L]))
cat("\nSites shared by more than one VOC catalogue:\n")
#>
#> Sites shared by more than one VOC catalogue:
for (s in sort(shared_sites)) {
vocs_with_site <- names(voc_list)[
vapply(voc_list, function(v) s %in% v$sites, logical(1L))
]
# Collect the actual SNP names at this site for each VOC
snp_labels <- vapply(vocs_with_site, function(vn) {
v <- voc_list[[vn]]
idx <- which(v$sites == s)
paste(v$snps[idx], collapse = "/")
}, character(1L))
cat(sprintf(" Site %-5d: %s\n", s,
paste(sprintf("%s (%s)", vocs_with_site, snp_labels),
collapse = ", ")))
}
#> Site 417 : Beta (K417N), Gamma (K417T)
#> Site 484 : Beta (E484K), Gamma (E484K)
#> Site 501 : Alpha (N501Y), Beta (N501Y), Gamma (N501Y)
#> Site 681 : Delta (P681R), Alpha (P681H)Note on the combined dataset. The union of the two time windows spans the wild-type phase through the peak of Delta dominance, and the GMM-selected sites capture mutation signatures from all major VOCs circulating during this period. Several sites carry mutations that recur independently across multiple variants, a hallmark of convergent evolution under strong positive selection. For example, site 501 (N501Y) is shared by Alpha, Beta and Gamma — all three independently acquired the same asparagine-to-tyrosine substitution because it enhances ACE2 receptor binding affinity. Similarly, site 484 (E484K in Beta and Gamma, E484Q in Delta/Kappa) is a key immune evasion site; the different amino acid outcomes (K vs Q) at the same position across variants illustrate that while the site is under selection, the specific substitution depends on the genetic background. Site 614 (D614G) is present across virtually all VOCs and was the first globally dominant mutation, increasing Spike conformational stability and transmissibility (Korber, Fischer, and Gnanakaran 2020).
Partitioning Around Medoids (PAM) is a robust clustering algorithm that minimises the sum of pairwise dissimilarities between each sequence and the medoid of its assigned cluster (Kaufman and Rousseeuw 1990). Unlike K-Means, PAM medoids are actual observed sequences rather than computed centroids, making them directly interpretable as representative viral haplotypes.
# Subset to selected sites and convert to factors (required for Gower distance)
AL_Cat_sub <- df_combined[, as.character(selected_sites), drop = FALSE]
AL_Cat_sub[] <- lapply(AL_Cat_sub, as.integer)
if (UNIQUE_SEQS) {
# Deduplicate -- record original row indices for labelling
AL_Cat_fac <- AL_Cat_sub %>% mutate_if(is.integer, as.factor)
unique_idx <- which(!duplicated(AL_Cat_fac))
AL_Cat_fac <- AL_Cat_fac[unique_idx, , drop = FALSE]
orig_period <- ifelse(unique_idx <= n_p1, 1L, 2L)
cat(sprintf("Unique sequences: %d (from %d total)\n",
nrow(AL_Cat_fac), n_total))
} else {
AL_Cat_fac <- AL_Cat_sub %>% mutate_if(is.integer, as.factor)
orig_period <- ifelse(seq_len(n_total) <= n_p1, 1L, 2L)
cat(sprintf("All sequences retained: %d\n", nrow(AL_Cat_fac)))
}
#> Unique sequences: 127 (from 6944 total)
n_p1_eff <- sum(orig_period == 1L)
n_p2_eff <- sum(orig_period == 2L)
cat(sprintf("Period 1 sequences in PAM input: %d\n", n_p1_eff))
#> Period 1 sequences in PAM input: 36
cat(sprintf("Period 2 sequences in PAM input: %d\n", n_p2_eff))
#> Period 2 sequences in PAM input: 91Gower distance between two sequences \(\mathbf{x}\) and \(\mathbf{y}\) is defined as \(d_G(\mathbf{x}, \mathbf{y}) = \frac{1}{p} \sum_{j=1}^{p} \delta_j \, s_j\), where \(p\) is the number of sites, \(\delta_j \in \{0, 1\}\) indicates whether site \(j\) is comparable for the given pair, and \(s_j\) is a site-specific dissimilarity score that equals \(0\) when \(x_j = y_j\) and \(1\) otherwise for categorical variables such as amino acid codes (Gower 1971).
The silhouette width for sequence \(i\) is defined as \(s(i) = \frac{b(i) - a(i)}{\max\{a(i),\, b(i)\}}\), where \(a(i)\) is the mean Gower distance from \(i\) to all other sequences in its own cluster and \(b(i)\) is the mean Gower distance from \(i\) to all sequences in the nearest neighbouring cluster; \(s(i) \in [-1, 1]\), with values close to \(1\) indicating a well-separated assignment and values close to \(0\) or negative indicating overlap or potential misclassification (Rousseeuw 1987).
We sweep k from 2 to K_MAX in parallel to find the
optimal number of clusters.
future::plan(future::multisession, workers = N_WORKERS)
sil_width <- c(NA, future_sapply(2:K_MAX, function(k) {
pam(gower_dist, diss = TRUE, k = k)$silinfo$avg.width
}))
future::plan(future::sequential)
# Silhouette plot -- grid lines, x-ticks every 5
# Drop k=1 (undefined, NA) before plotting to avoid missing-value warnings.
# na.rm = TRUE on geom calls provides an additional safety net.
sil_df <- data.frame(k = seq_len(K_MAX), width = sil_width)
sil_df <- sil_df[!is.na(sil_df$width), ] # removes k=1 row
ggplot(sil_df, aes(x = k, y = width)) +
geom_line(colour = "black", na.rm = TRUE) +
geom_point(colour = "darkorange", shape = 1, size = 2, na.rm = TRUE) +
geom_vline(xintercept = K_HIGHLIGHT, linetype = "dashed",
colour = "steelblue", alpha = 0.6) +
scale_x_continuous(limits = c(2L, K_MAX),
breaks = seq(5, K_MAX, by = 5),
minor_breaks = seq(2, K_MAX, by = 1)) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 6)) +
labs(title = "Silhouette Analysis: Optimal Number of Clusters",
subtitle = sprintf("Blue dashed lines mark k = %s (highlighted solutions)",
paste(K_HIGHLIGHT, collapse = ", ")),
x = "Number of clusters",
y = "Average Silhouette Width") +
theme_bw() +
theme(panel.grid.major = element_line(colour = "grey85"),
panel.grid.minor = element_line(colour = "grey93"),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, size = 9),
axis.title.x = element_text(colour = "steelblue"),
axis.title.y = element_text(colour = "darkorange"))t-distributed Stochastic Neighbor Embedding (t-SNE) is a non-linear dimensionality reduction method that maps high-dimensional pairwise distances to a 2D layout by preserving local neighborhood structure (Maaten and Hinton 2008). Distances in the t-SNE plot are not interpretable in absolute terms — only the relative proximity of points within and between clusters carries meaning — and the same Gower distance matrix is used as input for both PAM and t-SNE to ensure consistency between the clustering and its visualization.
To reiterate, t-SNE is run once on the Gower distance matrix and the resulting 2D coordinates are reused for all k values. This ensures that spatial positions are identical across plots so that differences in cluster boundaries are directly comparable.
Each plot uses three visual layers:
The following helpers are used throughout Steps 4–8 to count and name
VOC defining SNP matches in individual sequences.
count_snp_matches() returns an integer count;
get_matched_snps() returns the names of matched SNPs. These
functions operate only on the selected_sites subset, so
overlap counts refer to defining SNPs that were also selected by the
entropy GMM.
# ── Helper: count how many of a variant's SNPs are present in a sequence ────
# seq_int: named integer vector (site name -> encoded amino acid value)
# voc: list with $sites and $snps (from get_variant_snps())
# Returns: integer count of matching SNP alleles
count_snp_matches <- function(seq_int, voc) {
# Only consider SNP sites that are present in the selected_sites subset
overlap_sites <- intersect(as.character(voc$sites), names(seq_int))
if (length(overlap_sites) == 0L) return(0L)
expected_codes <- vapply(as.integer(overlap_sites), function(s) {
snp_idx <- which(voc$sites == s)
aa_char <- substring(voc$snps[snp_idx], nchar(voc$snps[snp_idx]))
as.integer(encode_aa_sequence(matrix(aa_char))[1L, 1L])
}, integer(1L))
names(expected_codes) <- overlap_sites
sum(vapply(overlap_sites, function(s) {
isTRUE(seq_int[[s]] == expected_codes[[s]])
}, logical(1L)))
}
# ── Apply to all Period 1 sequences ─────────────────────────────────────────
# p1_rows_int: row indices into AL_Cat_int that correspond to Period 1
p1_rows_int <- which(orig_period == 1L)
# Build integer version of the deduplicated feature matrix for display/mode calc
AL_Cat_int <- AL_Cat_fac %>%
mutate_if(is.factor, function(x) as.integer(as.character(x)))
# Convert to named integer matrix for fast row-wise access
AL_Cat_int_mat <- as.matrix(AL_Cat_int)
colnames(AL_Cat_int_mat) <- colnames(AL_Cat_int)
# voc_list <- list(Delta = var_delta, Alpha = var_alpha,
# Beta = var_beta, Gamma = var_gamma)
# match_counts: matrix [n_p1_int x 4] -- SNP match count per Period 1 sequence
match_counts <- do.call(cbind, lapply(voc_list, function(voc) {
vapply(p1_rows_int, function(r) {
count_snp_matches(AL_Cat_int_mat[r, ], voc)
}, integer(1L))
}))
rownames(match_counts) <- p1_rows_int
# ── Helper: return names of matched SNPs (not just count) ───────────────────
get_matched_snps <- function(seq_int, voc) {
overlap_sites <- intersect(as.character(voc$sites), names(seq_int))
if (length(overlap_sites) == 0L) return(character(0L))
vapply(overlap_sites, function(s) {
snp_idx <- which(voc$sites == as.integer(s))
aa_char <- substring(voc$snps[snp_idx], nchar(voc$snps[snp_idx]))
exp_code <- as.integer(encode_aa_sequence(matrix(aa_char))[1L, 1L])
if (isTRUE(seq_int[[s]] == exp_code)) voc$snps[snp_idx] else NA_character_
}, character(1L)) |> (\(x) x[!is.na(x)])()
}
# ── Summary table with SNP detail ───────────────────────────────────────────
cat("Period 1 sequences by number of VOC SNP matches\n")
#> Period 1 sequences by number of VOC SNP matches
cat("(only SNPs at GMM-selected sites are considered)\n\n")
#> (only SNPs at GMM-selected sites are considered)
for (voc_nm in names(voc_list)) {
voc <- voc_list[[voc_nm]]
n_overlap <- length(intersect(as.character(voc$sites),
as.character(selected_sites)))
counts <- match_counts[, voc_nm]
cat(sprintf("%s (%d defining SNPs, %d overlap with selected sites):\n",
voc_nm, length(voc$sites), n_overlap))
if (n_overlap == 0L) {
cat(" No overlap with GMM-selected sites -- skip.\n\n")
next
}
for (m in seq(n_overlap, 0L)) {
seq_idx <- which(counts == m)
if (length(seq_idx) == 0L && m < n_overlap) next
cat(sprintf(" %d/%d SNP match(es): %d sequence(s)",
m, n_overlap, length(seq_idx)))
if (m > 0L && length(seq_idx) > 0L) {
# Collect matched SNP names for each sequence at this tier
snp_sets <- lapply(p1_rows_int[seq_idx], function(r) {
get_matched_snps(AL_Cat_int_mat[r, ], voc)
})
# Unique SNP patterns
snp_patterns <- sort(unique(vapply(snp_sets,
paste, character(1L), collapse = "+")))
cat(sprintf(" [SNP pattern(s): %s]", paste(snp_patterns, collapse = " | ")))
}
cat("\n")
}
cat("\n")
}
#> Delta (5 defining SNPs, 5 overlap with selected sites):
#> 5/5 SNP match(es): 0 sequence(s)
#> 1/5 SNP match(es): 1 sequence(s) [SNP pattern(s): D950N]
#> 0/5 SNP match(es): 35 sequence(s)
#>
#> Alpha (6 defining SNPs, 2 overlap with selected sites):
#> 2/2 SNP match(es): 0 sequence(s)
#> 1/2 SNP match(es): 1 sequence(s) [SNP pattern(s): P681H]
#> 0/2 SNP match(es): 35 sequence(s)
#>
#> Beta (6 defining SNPs, 4 overlap with selected sites):
#> 4/4 SNP match(es): 0 sequence(s)
#> 0/4 SNP match(es): 36 sequence(s)
#>
#> Gamma (10 defining SNPs, 10 overlap with selected sites):
#> 10/10 SNP match(es): 0 sequence(s)
#> 1/10 SNP match(es): 3 sequence(s) [SNP pattern(s): D138Y | H655Y | L18F]
#> 0/10 SNP match(es): 33 sequence(s)A reusable helper that renders the t-SNE embedding coloured by PAM cluster assignment. Three visual layers are overlaid: cluster-coloured points, black crosses for Period 1 sequences, and red diamonds for cluster medoids.
# Reusable plot function: takes PAM clustering vector and medoid row indices,
# produces a t-SNE plot with three visual layers:
# 1. Coloured shapes by cluster
# 2. Black cross overlay for Period 1 (wild-type) sequences
# 3. Filled red diamond for each cluster medoid
plot_tsne <- function(pam_clust, k, medoid_rows,
tsne_df = tsne_base,
period_vec = orig_period,
xlim_fixed = NULL,
ylim_fixed = NULL) {
df <- tsne_df %>%
mutate(cluster = factor(pam_clust),
is_p1 = (period_vec == 1L),
is_medoid = row_number() %in% medoid_rows)
p <- ggplot(df, aes(x = X, y = Y, colour = cluster, shape = cluster)) +
geom_point(size = 2, alpha = 0.7) +
scale_shape_manual(values = seq(0, k)) +
# Black cross overlay for all Period 1 sequences
geom_point(data = subset(df, is_p1),
aes(x = X, y = Y),
colour = "black", shape = 4, size = 2,
inherit.aes = FALSE) +
# Semi-transparent red diamond: alpha reveals the medoid is a real data point.
geom_point(data = subset(df, is_medoid),
aes(x = X, y = Y),
colour = "red", shape = 18, size = 5,
alpha = 0.45,
inherit.aes = FALSE) +
labs(title = sprintf("PAM k = %d | t-SNE visualisation", k),
subtitle = "Crosses = Period 1 (wild-type); Red diamonds = cluster medoids",
x = "t-SNE dimension 1",
y = "t-SNE dimension 2") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, size = 9))
# Apply fixed axis ranges if supplied -- ensures identical plot area across
# all k values regardless of legend size differences
if (!is.null(xlim_fixed) && !is.null(ylim_fixed))
p <- p + coord_cartesian(xlim = xlim_fixed, ylim = ylim_fixed)
print(p)
invisible(p)
}# Helper: render kableExtra table in results='asis' chunks inside loops
kt <- function(tbl) cat(as.character(tbl), "\n")
# Fit PAM for all highlighted k values
pam_fits <- lapply(K_HIGHLIGHT, function(k) {
pam(gower_dist, diss = TRUE, k = k)
})
names(pam_fits) <- paste0("k", K_HIGHLIGHT)
# Precompute fixed axis ranges from the common t-SNE coordinates so that all
# plots have an identical data area regardless of legend size differences.
x_pad <- diff(range(tsne_base$X)) * 0.05
y_pad <- diff(range(tsne_base$Y)) * 0.05
xlim_fix <- range(tsne_base$X) + c(-x_pad, x_pad)
ylim_fix <- range(tsne_base$Y) + c(-y_pad, y_pad)
for (k in K_HIGHLIGHT) {
fit <- pam_fits[[paste0("k", k)]]
medoid_rows <- fit$id.med
# ── Subsection heading ──────────────────────────────────────────────────
cat(sprintf("\n## PAM k = %d\n\n### t-SNE Plot\n\n", k))
# ── Plot ─────────────────────────────────────────────────────────────────
plot_tsne(fit$clustering, k, medoid_rows,
xlim_fixed = xlim_fix, ylim_fixed = ylim_fix)
# ── Medoid sequences kable ───────────────────────────────────────────────
cat(sprintf("\n### Medoid Sequences (k = %d)\n\n", k))
medoid_aa <- decode_aa_sequence(
as.matrix(AL_Cat_int[medoid_rows, , drop = FALSE])
)
medoid_df <- as.data.frame(medoid_aa)
colnames(medoid_df) <- as.character(selected_sites)
rownames(medoid_df) <- paste0("cl_", seq_len(k))
# Red cell_spec wherever cl_i (i > 1) differs from cl_1 at that site.
# Direct [row, col_name] access -- avoids any as.character(df[row,]) pitfalls.
if (k > 1L) {
for (ri in 2L:k) {
for (nm in colnames(medoid_df)) {
v1 <- medoid_df[1L, nm]; vi <- medoid_df[ri, nm]
if (vi != v1)
medoid_df[ri, nm] <- cell_spec(vi, format = "html",
color = "red", bold = TRUE)
}
}
}
kt(
kbl(medoid_df, format = "html", escape = FALSE,
caption = sprintf("Medoid sequences (k = %d) | red = differs from cl_1", k)) |>
kable_styling(bootstrap_options = c("condensed", "hover"),
font_size = 11, full_width = FALSE) |>
row_spec(1L, background = "#d6eaf8")
)
# ── VOC SNP content kable ────────────────────────────────────────────────
cat(sprintf("\n### VOC SNP Content of Medoids (k = %d)\n\n", k))
voc_tbl <- do.call(rbind, lapply(seq_len(k), function(cl_i) {
row_i <- medoid_rows[cl_i]
vapply(names(voc_list), function(voc_nm) {
voc <- voc_list[[voc_nm]]
n_ov <- length(intersect(as.character(voc$sites),
as.character(selected_sites)))
n_m <- count_snp_matches(AL_Cat_int_mat[row_i, ], voc)
sprintf("%d / %d", n_m, n_ov)
}, character(1L))
}))
voc_tbl <- as.data.frame(voc_tbl)
rownames(voc_tbl) <- paste0("cl_", seq_len(k))
kt(
kbl(voc_tbl, format = "html",
caption = sprintf("VOC SNP matches per medoid (matched / overlap sites), k = %d", k)) |>
kable_styling(bootstrap_options = c("condensed", "hover"),
font_size = 11, full_width = FALSE)
)
cat("\n")
}
### Medoid Sequences (k = 2)
| 5 | 18 | 19 | 20 | 26 | 54 | 95 | 138 | 141 | 142 | 144 | 152 | 153 | 155 | 157 | 158 | 180 | 183 | 190 | 253 | 417 | 452 | 477 | 478 | 484 | 501 | 520 | 614 | 655 | 681 | 701 | 950 | 1027 | 1176 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cl_1 | L | L | T | T | P | L | T | D | L | G | Y | W | M | S | F | R | E | Q | R | D | K | L | S | T | E | N | A | G | H | P | A | D | T | V |
| cl_2 | L | F | T | N | S | L | T | Y | L | G | Y | W | M | S | F | R | E | Q | S | D | T | L | S | T | K | Y | A | G | Y | P | A | D | I | F |
| Delta | Alpha | Beta | Gamma | |
|---|---|---|---|---|
| cl_1 | 0 / 5 | 0 / 2 | 0 / 4 | 0 / 10 |
| cl_2 | 0 / 5 | 1 / 2 | 2 / 4 | 10 / 10 |
### Medoid Sequences (k = 3)
| 5 | 18 | 19 | 20 | 26 | 54 | 95 | 138 | 141 | 142 | 144 | 152 | 153 | 155 | 157 | 158 | 180 | 183 | 190 | 253 | 417 | 452 | 477 | 478 | 484 | 501 | 520 | 614 | 655 | 681 | 701 | 950 | 1027 | 1176 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cl_1 | L | L | T | T | P | L | T | D | L | G | Y | W | M | S | F | R | E | Q | R | D | K | L | S | T | E | N | A | G | H | P | A | D | T | V |
| cl_2 | L | F | T | N | S | L | T | Y | L | G | Y | W | M | S | F | R | E | Q | S | D | T | L | S | T | K | Y | A | G | Y | P | A | D | I | F |
| cl_3 | L | L | R | T | P | L | T | D | L | G | Y | W | M | S | F | G | E | Q | R | D | K | R | S | K | E | N | A | G | H | R | A | N | T | V |
| Delta | Alpha | Beta | Gamma | |
|---|---|---|---|---|
| cl_1 | 0 / 5 | 0 / 2 | 0 / 4 | 0 / 10 |
| cl_2 | 0 / 5 | 1 / 2 | 2 / 4 | 10 / 10 |
| cl_3 | 5 / 5 | 0 / 2 | 0 / 4 | 0 / 10 |
### Medoid Sequences (k = 4)
| 5 | 18 | 19 | 20 | 26 | 54 | 95 | 138 | 141 | 142 | 144 | 152 | 153 | 155 | 157 | 158 | 180 | 183 | 190 | 253 | 417 | 452 | 477 | 478 | 484 | 501 | 520 | 614 | 655 | 681 | 701 | 950 | 1027 | 1176 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cl_1 | L | L | T | T | P | L | T | D | L | G | Y | W | M | S | F | R | E | Q | R | D | K | L | S | T | E | N | A | G | H | P | A | D | T | V |
| cl_2 | L | F | T | N | S | L | T | Y | L | G | Y | W | M | S | F | R | E | Q | S | D | T | L | S | T | K | Y | A | G | Y | P | A | D | I | F |
| cl_3 | L | L | T | T | P | L | T | D | Y | Y | K | S | E | R | Y | S | Q | F | V | D | K | R | S | T | E | N | A | G | H | P | A | D | T | V |
| cl_4 | L | L | R | T | P | L | T | D | L | G | Y | W | M | S | F | G | E | Q | R | D | K | R | S | K | E | N | A | G | H | R | A | N | T | V |
| Delta | Alpha | Beta | Gamma | |
|---|---|---|---|---|
| cl_1 | 0 / 5 | 0 / 2 | 0 / 4 | 0 / 10 |
| cl_2 | 0 / 5 | 1 / 2 | 2 / 4 | 10 / 10 |
| cl_3 | 1 / 5 | 0 / 2 | 0 / 4 | 0 / 10 |
| cl_4 | 5 / 5 | 0 / 2 | 0 / 4 | 0 / 10 |
### Medoid Sequences (k = 9)
| 5 | 18 | 19 | 20 | 26 | 54 | 95 | 138 | 141 | 142 | 144 | 152 | 153 | 155 | 157 | 158 | 180 | 183 | 190 | 253 | 417 | 452 | 477 | 478 | 484 | 501 | 520 | 614 | 655 | 681 | 701 | 950 | 1027 | 1176 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cl_1 | L | L | T | T | P | L | T | D | L | G | Y | W | M | S | F | R | E | Q | R | D | K | L | S | T | E | N | A | G | H | P | A | D | T | V |
| cl_2 | L | F | T | N | S | L | T | Y | L | G | Y | W | M | S | F | R | E | Q | S | D | T | L | S | T | K | Y | A | G | Y | P | A | D | I | F |
| cl_3 | F | L | T | T | P | L | I | D | L | G | Y | W | M | S | F | R | E | Q | R | G | K | L | N | T | E | N | A | G | H | P | V | D | T | V |
| cl_4 | L | F | T | N | S | L | T | Y | L | G | Y | W | M | S | F | R | E | Q | R | D | K | R | S | K | E | N | A | G | H | R | A | D | I | V |
| cl_5 | L | L | T | T | P | L | T | D | Y | Y | K | S | E | R | Y | S | Q | F | V | D | K | R | S | T | E | N | A | G | H | P | A | D | T | V |
| cl_6 | L | L | R | T | P | L | T | D | L | G | Y | W | M | S | F | G | E | Q | R | D | K | R | S | K | E | N | A | G | H | R | A | N | T | V |
| cl_7 | L | L | T | T | S | L | T | D | L | G | Y | W | T | S | S | R | E | Q | R | D | K | L | N | T | K | N | A | G | Y | H | A | D | T | V |
| cl_8 | L | L | R | T | P | L | T | C | P | F | D | K | S | M | S | G | E | Q | R | D | K | R | S | K | E | N | A | G | H | R | A | N | T | V |
| cl_9 | L | L | T | T | P | L | T | D | L | G | Y | W | M | S | F | R | E | Q | R | D | K | L | S | T | E | Y | A | G | H | H | A | D | T | V |
| Delta | Alpha | Beta | Gamma | |
|---|---|---|---|---|
| cl_1 | 0 / 5 | 0 / 2 | 0 / 4 | 0 / 10 |
| cl_2 | 0 / 5 | 1 / 2 | 2 / 4 | 10 / 10 |
| cl_3 | 0 / 5 | 0 / 2 | 1 / 4 | 0 / 10 |
| cl_4 | 3 / 5 | 0 / 2 | 0 / 4 | 5 / 10 |
| cl_5 | 1 / 5 | 0 / 2 | 0 / 4 | 0 / 10 |
| cl_6 | 5 / 5 | 0 / 2 | 0 / 4 | 0 / 10 |
| cl_7 | 0 / 5 | 1 / 2 | 1 / 4 | 3 / 10 |
| cl_8 | 5 / 5 | 0 / 2 | 0 / 4 | 0 / 10 |
| cl_9 | 0 / 5 | 2 / 2 | 1 / 4 | 1 / 10 |
For each k value, we compute the most frequent (modal) amino acid at each GMM-selected site within each cluster. This reveals the characteristic sequence signature of each temporal group and allows direct comparison with the medoid. The mode matrices computed here are also used by the contrast analysis in Step 7.
get_mode <- function(x) { ux <- unique(x); ux[which.max(tabulate(match(x, ux)))] }
# Store mode matrices for Step 7 contrast analysis.
mode_results <- list()
for (k in K_HIGHLIGHT) {
fit <- pam_fits[[paste0("k", k)]]
clust_vec <- fit$clustering
medoid_rows <- fit$id.med
k_levels <- sort(unique(clust_vec))
mode_mat <- do.call(rbind, lapply(k_levels, function(cl) {
rows <- which(clust_vec == cl)
vapply(colnames(AL_Cat_int), function(s) get_mode(AL_Cat_int[rows, s]),
integer(1L))
}))
rownames(mode_mat) <- paste0("cl_", k_levels)
colnames(mode_mat) <- colnames(AL_Cat_int)
mode_aa <- decode_aa_sequence(as.matrix(mode_mat))
rownames(mode_aa) <- paste0("cl_", k_levels)
colnames(mode_aa) <- as.character(selected_sites)
medoid_int <- as.matrix(AL_Cat_int[medoid_rows, , drop = FALSE])
medoid_aa <- decode_aa_sequence(medoid_int)
rownames(medoid_aa) <- paste0("med_", k_levels)
colnames(medoid_aa) <- as.character(selected_sites)
mode_results[[paste0("k", k)]] <- list(
mode_int = mode_mat, k_levels = k_levels, clust_vec = clust_vec)
same_flag <- all(mode_mat == medoid_int)
obs_note <- if (same_flag)
sprintf("**Mode = Medoid for all clusters (k = %d): tight cohesion confirmed.**", k)
else {
diff_s <- which(apply(mode_mat != medoid_int, 2, any))
sprintf("**Sites where mode differs from medoid (k = %d): %s**",
k, paste(selected_sites[diff_s], collapse = ", "))
}
cat(sprintf("\n### k = %d\n\n%s\n\n", k, obs_note))
# Build combined df with cell_spec red highlighting where mode != medoid
combined_df <- as.data.frame(rbind(mode_aa, medoid_aa), stringsAsFactors = FALSE)
n_cl <- length(k_levels)
for (col_i in seq_along(selected_sites)) {
col_nm <- as.character(selected_sites[col_i])
for (ci in seq_along(k_levels)) {
mv <- mode_aa[ci, col_i]; dv <- medoid_aa[ci, col_i]
if (mv != dv) {
combined_df[ci, col_nm] <- cell_spec(mv, color="red", bold=TRUE)
combined_df[n_cl + ci, col_nm] <- cell_spec(dv, color="red", bold=TRUE)
}
}
}
kt(
kbl(combined_df, format="html", escape=FALSE,
caption=sprintf("Mode (cl_* = blue) / Medoid (med_* = yellow) | red = differs [k=%d]", k)) |>
kable_styling(bootstrap_options=c("condensed","hover"), font_size=11, full_width=FALSE) |>
row_spec(seq_len(n_cl), background="#d6eaf8") |>
row_spec(seq(n_cl+1L, n_cl*2L), background="#fef9e7")
)
cat("\n")
}Mode = Medoid for all clusters (k = 2): tight cohesion confirmed.
| 5 | 18 | 19 | 20 | 26 | 54 | 95 | 138 | 141 | 142 | 144 | 152 | 153 | 155 | 157 | 158 | 180 | 183 | 190 | 253 | 417 | 452 | 477 | 478 | 484 | 501 | 520 | 614 | 655 | 681 | 701 | 950 | 1027 | 1176 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cl_1 | L | L | T | T | P | L | T | D | L | G | Y | W | M | S | F | R | E | Q | R | D | K | L | S | T | E | N | A | G | H | P | A | D | T | V |
| cl_2 | L | F | T | N | S | L | T | Y | L | G | Y | W | M | S | F | R | E | Q | S | D | T | L | S | T | K | Y | A | G | Y | P | A | D | I | F |
| med_1 | L | L | T | T | P | L | T | D | L | G | Y | W | M | S | F | R | E | Q | R | D | K | L | S | T | E | N | A | G | H | P | A | D | T | V |
| med_2 | L | F | T | N | S | L | T | Y | L | G | Y | W | M | S | F | R | E | Q | S | D | T | L | S | T | K | Y | A | G | Y | P | A | D | I | F |
Mode = Medoid for all clusters (k = 3): tight cohesion confirmed.
| 5 | 18 | 19 | 20 | 26 | 54 | 95 | 138 | 141 | 142 | 144 | 152 | 153 | 155 | 157 | 158 | 180 | 183 | 190 | 253 | 417 | 452 | 477 | 478 | 484 | 501 | 520 | 614 | 655 | 681 | 701 | 950 | 1027 | 1176 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cl_1 | L | L | T | T | P | L | T | D | L | G | Y | W | M | S | F | R | E | Q | R | D | K | L | S | T | E | N | A | G | H | P | A | D | T | V |
| cl_2 | L | F | T | N | S | L | T | Y | L | G | Y | W | M | S | F | R | E | Q | S | D | T | L | S | T | K | Y | A | G | Y | P | A | D | I | F |
| cl_3 | L | L | R | T | P | L | T | D | L | G | Y | W | M | S | F | G | E | Q | R | D | K | R | S | K | E | N | A | G | H | R | A | N | T | V |
| med_1 | L | L | T | T | P | L | T | D | L | G | Y | W | M | S | F | R | E | Q | R | D | K | L | S | T | E | N | A | G | H | P | A | D | T | V |
| med_2 | L | F | T | N | S | L | T | Y | L | G | Y | W | M | S | F | R | E | Q | S | D | T | L | S | T | K | Y | A | G | Y | P | A | D | I | F |
| med_3 | L | L | R | T | P | L | T | D | L | G | Y | W | M | S | F | G | E | Q | R | D | K | R | S | K | E | N | A | G | H | R | A | N | T | V |
Mode = Medoid for all clusters (k = 4): tight cohesion confirmed.
| 5 | 18 | 19 | 20 | 26 | 54 | 95 | 138 | 141 | 142 | 144 | 152 | 153 | 155 | 157 | 158 | 180 | 183 | 190 | 253 | 417 | 452 | 477 | 478 | 484 | 501 | 520 | 614 | 655 | 681 | 701 | 950 | 1027 | 1176 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cl_1 | L | L | T | T | P | L | T | D | L | G | Y | W | M | S | F | R | E | Q | R | D | K | L | S | T | E | N | A | G | H | P | A | D | T | V |
| cl_2 | L | F | T | N | S | L | T | Y | L | G | Y | W | M | S | F | R | E | Q | S | D | T | L | S | T | K | Y | A | G | Y | P | A | D | I | F |
| cl_3 | L | L | T | T | P | L | T | D | Y | Y | K | S | E | R | Y | S | Q | F | V | D | K | R | S | T | E | N | A | G | H | P | A | D | T | V |
| cl_4 | L | L | R | T | P | L | T | D | L | G | Y | W | M | S | F | G | E | Q | R | D | K | R | S | K | E | N | A | G | H | R | A | N | T | V |
| med_1 | L | L | T | T | P | L | T | D | L | G | Y | W | M | S | F | R | E | Q | R | D | K | L | S | T | E | N | A | G | H | P | A | D | T | V |
| med_2 | L | F | T | N | S | L | T | Y | L | G | Y | W | M | S | F | R | E | Q | S | D | T | L | S | T | K | Y | A | G | Y | P | A | D | I | F |
| med_3 | L | L | T | T | P | L | T | D | Y | Y | K | S | E | R | Y | S | Q | F | V | D | K | R | S | T | E | N | A | G | H | P | A | D | T | V |
| med_4 | L | L | R | T | P | L | T | D | L | G | Y | W | M | S | F | G | E | Q | R | D | K | R | S | K | E | N | A | G | H | R | A | N | T | V |
Sites where mode differs from medoid (k = 9): 26, 144, 190
| 5 | 18 | 19 | 20 | 26 | 54 | 95 | 138 | 141 | 142 | 144 | 152 | 153 | 155 | 157 | 158 | 180 | 183 | 190 | 253 | 417 | 452 | 477 | 478 | 484 | 501 | 520 | 614 | 655 | 681 | 701 | 950 | 1027 | 1176 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cl_1 | L | L | T | T | P | L | T | D | L | G | Y | W | M | S | F | R | E | Q | R | D | K | L | S | T | E | N | A | G | H | P | A | D | T | V |
| cl_2 | L | F | T | N | S | L | T | Y | L | G | Y | W | M | S | F | R | E | Q | S | D | T | L | S | T | K | Y | A | G | Y | P | A | D | I | F |
| cl_3 | F | L | T | T | P | L | I | D | L | G | Y | W | M | S | F | R | E | Q | R | G | K | L | N | T | E | N | A | G | H | P | V | D | T | V |
| cl_4 | L | F | T | N | S | L | T | Y | L | G | Y | W | M | S | F | R | E | Q | S | D | K | R | S | K | E | N | A | G | H | R | A | D | I | V |
| cl_5 | L | L | T | T | P | L | T | D | Y | Y | K | S | E | R | Y | S | Q | F | V | D | K | R | S | T | E | N | A | G | H | P | A | D | T | V |
| cl_6 | L | L | R | T | P | L | T | D | L | G | Y | W | M | S | F | G | E | Q | R | D | K | R | S | K | E | N | A | G | H | R | A | N | T | V |
| cl_7 | L | L | T | T | P | L | T | D | L | G | Y | W | T | S | S | R | E | Q | R | D | K | L | N | T | K | N | A | G | Y | H | A | D | T | V |
| cl_8 | L | L | R | T | P | L | T | C | P | F | G | K | S | M | S | G | E | Q | R | D | K | R | S | K | E | N | A | G | H | R | A | N | T | V |
| cl_9 | L | L | T | T | P | L | T | D | L | G | Y | W | M | S | F | R | E | Q | R | D | K | L | S | T | E | Y | A | G | H | H | A | D | T | V |
| med_1 | L | L | T | T | P | L | T | D | L | G | Y | W | M | S | F | R | E | Q | R | D | K | L | S | T | E | N | A | G | H | P | A | D | T | V |
| med_2 | L | F | T | N | S | L | T | Y | L | G | Y | W | M | S | F | R | E | Q | S | D | T | L | S | T | K | Y | A | G | Y | P | A | D | I | F |
| med_3 | F | L | T | T | P | L | I | D | L | G | Y | W | M | S | F | R | E | Q | R | G | K | L | N | T | E | N | A | G | H | P | V | D | T | V |
| med_4 | L | F | T | N | S | L | T | Y | L | G | Y | W | M | S | F | R | E | Q | R | D | K | R | S | K | E | N | A | G | H | R | A | D | I | V |
| med_5 | L | L | T | T | P | L | T | D | Y | Y | K | S | E | R | Y | S | Q | F | V | D | K | R | S | T | E | N | A | G | H | P | A | D | T | V |
| med_6 | L | L | R | T | P | L | T | D | L | G | Y | W | M | S | F | G | E | Q | R | D | K | R | S | K | E | N | A | G | H | R | A | N | T | V |
| med_7 | L | L | T | T | S | L | T | D | L | G | Y | W | T | S | S | R | E | Q | R | D | K | L | N | T | K | N | A | G | Y | H | A | D | T | V |
| med_8 | L | L | R | T | P | L | T | C | P | F | D | K | S | M | S | G | E | Q | R | D | K | R | S | K | E | N | A | G | H | R | A | N | T | V |
| med_9 | L | L | T | T | P | L | T | D | L | G | Y | W | M | S | F | R | E | Q | R | D | K | L | S | T | E | Y | A | G | H | H | A | D | T | V |
Observation. Across k = 2, 3 and 4, the modal amino acid at every selected site within each cluster coincides exactly with the PAM medoid sequence. This agreement confirms that each cluster is internally homogeneous: the most centrally located sequence in Gower space is also the most frequent sequence at every position, which is precisely what one expects from a viral population dominated by a single emergent haplotype. At k = 9 the correspondence is no longer exact at every site, reflecting the finer subdivision of sequence space into smaller, less internally uniform clusters. The overall mutational profile of each medoid nonetheless remains intact, indicating that the core variant signal is preserved even as the clustering resolution increases.
For each k value tested (2, 3, 4 and 9), we compute classification metrics treating Period 1 sequences as the positive class. Since PAM cluster labels are arbitrary integers, we first identify the cluster with the highest Period 1 purity and designate that as the predicted positive label.
compute_metrics <- function(clust_vec, period_vec, n_p1_eff) {
p1_idx <- which(period_vec == 1L)
cl_purity <- tapply(clust_vec[p1_idx], clust_vec[p1_idx], length)
positive_cluster <- as.integer(names(which.max(cl_purity)))
pred_positive <- clust_vec == positive_cluster
true_positive <- period_vec == 1L
TP <- sum( pred_positive & true_positive)
FP <- sum( pred_positive & !true_positive)
FN <- sum(!pred_positive & true_positive)
TN <- sum(!pred_positive & !true_positive)
precision <- if ((TP + FP) > 0) TP / (TP + FP) else NA_real_
recall <- if ((TP + FN) > 0) TP / (TP + FN) else NA_real_
f1 <- if (!is.na(precision) && !is.na(recall) && (precision + recall) > 0)
2 * precision * recall / (precision + recall) else NA_real_
list(k = length(unique(clust_vec)), positive_cluster = positive_cluster,
TP = TP, FP = FP, FN = FN, TN = TN,
precision = round(precision, 4), recall = round(recall, 4),
f1 = round(f1, 4))
}
metrics_list <- lapply(K_HIGHLIGHT, function(k) {
compute_metrics(pam_fits[[paste0("k", k)]]$clustering, orig_period, n_p1_eff)
})
metrics_df <- do.call(rbind, lapply(metrics_list, as.data.frame))
print(metrics_df)
#> k positive_cluster TP FP FN TN precision recall f1
#> 1 2 1 36 64 0 27 0.3600 1 0.5294
#> 2 3 1 36 38 0 53 0.4865 1 0.6545
#> 3 4 1 36 34 0 57 0.5143 1 0.6792
#> 4 9 1 36 12 0 79 0.7500 1 0.8571On the interplay between statistical optimality and biological knowledge. The silhouette analysis provides a useful unsupervised criterion for cluster compactness and separation. It correctly identifies k = 2 as the most statistically compact solution. However, k = 2 conflates all non-ancestral variants into a single cluster, obscuring the variant-level structure that is biologically meaningful. As k increases to 3 and 4, F1 scores improve and individual VOC-associated clusters become identifiable, which is confirmed visually through t-SNE, quantitatively through the medoid SNP content tables, and formally through precision and recall. Practitioners working with biological sequence data should treat silhouette analysis as a starting point, not a definitive criterion: statistical optimality and biological interpretability need not coincide. Selecting k requires integrating three complementary sources of evidence: the silhouette curve, visual inspection of the t-SNE layout, and domain knowledge. No single criterion is sufficient on its own.
Using the mode matrices computed in Step 5, we ask: at which Spike sites does the modal amino acid in each non-wild-type cluster differ from that of the wild-type cluster?
A site flagged across multiple k values and multiple non-wild-type clusters is a robust, reproducible differentiator. The ranked frequency table below counts how many of the 14 total (k, cluster) comparisons flag each site, and annotates each site with its VOC SNP membership where applicable (Korber, Fischer, and Gnanakaran 2020; CDC2022?).
# annotate_site_full(s): compact VOC annotation for Spike site s.
# Uses ALL 12 variants in sarscov2_variants (not just the 4 in voc_list).
# Distinguishes:
# "VOC-SNP" = site is a defining SNP for that VOC (Defining_SNP_Sites)
# "VOC" = site is mutated in that VOC but is NOT a defining SNP
# "" = not found in any VOC catalogue
# Examples: 452 -> "Alpha/Delta/Epsilon/Iota/Kappa-SNP"
# 95 -> "Delta/Iota/Kappa/Omicron"
# 190 -> "Gamma-SNP; Iota"
annotate_site_full <- function(s) {
labels <- unlist(sarscov2_variants$WHO_Label)
snp_sites <- sarscov2_variants$Defining_SNP_Sites
mut_sites <- sarscov2_variants$Mutation_Sites
snp_vocs <- labels[vapply(snp_sites, function(ss) s %in% ss, logical(1L))]
mut_only <- setdiff(labels[vapply(mut_sites, function(ms) s %in% ms, logical(1L))],
snp_vocs)
parts <- c(
if (length(snp_vocs) > 0L) paste0(paste(snp_vocs, collapse = "/"), "-SNP"),
if (length(mut_only) > 0L) paste(mut_only, collapse = "/"))
if (length(parts) == 0L) "" else paste(parts, collapse = "; ")
}max_possible <- sum(vapply(K_HIGHLIGHT, function(k) {
length(unique(pam_fits[[paste0("k", k)]]$clustering)) - 1L
}, integer(1L)))
site_diff_counts <- integer(length(selected_sites))
names(site_diff_counts) <- as.character(selected_sites)
contrast_rows <- list()
for (k in K_HIGHLIGHT) {
mr <- mode_results[[paste0("k", k)]]
clust_vec <- mr$clust_vec
k_levels <- mr$k_levels
mode_int <- mr$mode_int
p1_idx <- which(orig_period == 1L)
purity <- tapply(clust_vec[p1_idx], clust_vec[p1_idx], length)
wt_cl <- as.integer(names(which.max(purity)))
wt_row <- which(k_levels == wt_cl)
for (cl in k_levels[k_levels != wt_cl]) {
cl_row <- which(k_levels == cl)
diff_vec <- mode_int[cl_row, ] != mode_int[wt_row, ]
diff_sites_cl <- selected_sites[diff_vec]
site_diff_counts[as.character(diff_sites_cl)] <-
site_diff_counts[as.character(diff_sites_cl)] + 1L
voc_str <- if (length(diff_sites_cl) > 0L)
vapply(diff_sites_cl, annotate_site_full, character(1L))
else character(0L)
contrast_rows[[length(contrast_rows)+1L]] <- data.frame(
k=k, wt=wt_cl, vs=cl, n=length(diff_sites_cl),
sites = if (length(diff_sites_cl)>0L) paste(diff_sites_cl, collapse=", ") else "-",
VOC = if (length(diff_sites_cl)>0L)
paste(sprintf("%s[%s]", diff_sites_cl, voc_str), collapse=", ") else "-",
stringsAsFactors=FALSE)
}
}
# Ranked frequency table
freq_df <- data.frame(
Site = as.integer(names(site_diff_counts)),
Freq = as.integer(site_diff_counts),
Pct = round(100 * site_diff_counts / max_possible, 1),
stringsAsFactors = FALSE)
freq_df <- freq_df[freq_df$Freq > 0L, ]
if (nrow(freq_df) > 0L) {
freq_df <- freq_df[order(-freq_df$Freq), ]
freq_df$VOC <- vapply(freq_df$Site, annotate_site_full, character(1L))
cat(sprintf("\n**Site differentiation across all %d (k, cluster) comparisons:**\n\n",
max_possible))
kt(
kbl(freq_df, format="html",
col.names=c("Site","Freq",sprintf("%% of %d comparisons", max_possible),"VOC"),
caption="Sites ranked by differentiation frequency (wild-type vs other clusters)") |>
kable_styling(bootstrap_options=c("condensed","hover","striped"),
font_size=11, full_width=FALSE) |>
column_spec(3,
color = ifelse(freq_df$Pct >= 50, "darkred", "black"),
bold = freq_df$Pct >= 50)
)
cat("\n")
} else {
cat("\n*No differentiating sites found (all clusters share the same modal residues).*\n\n")
}Site differentiation across all 14 (k, cluster) comparisons:
| Site | Freq | % of 14 comparisons | VOC | |
|---|---|---|---|---|
| 190 | 190 | 7 | 50.0 | Gamma-SNP |
| 452 | 452 | 7 | 50.0 | Delta/Kappa-SNP; Alpha/Epsilon/Iota/Lambda |
| 681 | 681 | 7 | 50.0 | Alpha/Delta/Kappa/Omicron-SNP; Gamma/Theta |
| 138 | 138 | 6 | 42.9 | Gamma-SNP |
| 158 | 158 | 6 | 42.9 | Delta |
| 18 | 18 | 5 | 35.7 | Gamma-SNP |
| 20 | 20 | 5 | 35.7 | Gamma-SNP |
| 26 | 26 | 5 | 35.7 | Gamma-SNP |
| 478 | 478 | 5 | 35.7 | Delta/Omicron-SNP |
| 484 | 484 | 5 | 35.7 | Beta/Eta/Gamma/Kappa/Omicron-SNP; Alpha/Iota/Theta/Zeta |
| 501 | 501 | 5 | 35.7 | Alpha/Beta/Gamma/Omicron-SNP; Theta |
| 655 | 655 | 5 | 35.7 | Gamma/Omicron-SNP |
| 1027 | 1027 | 5 | 35.7 | Gamma-SNP |
| 19 | 19 | 4 | 28.6 | Delta-SNP |
| 153 | 153 | 4 | 28.6 | |
| 157 | 157 | 4 | 28.6 | Delta/Iota |
| 417 | 417 | 4 | 28.6 | Beta/Gamma/Omicron-SNP; Delta |
| 950 | 950 | 4 | 28.6 | Delta-SNP; Iota |
| 1176 | 1176 | 4 | 28.6 | |
| 141 | 141 | 3 | 21.4 | |
| 142 | 142 | 3 | 21.4 | Delta/Kappa/Omicron |
| 144 | 144 | 3 | 21.4 | Alpha/Eta/Iota |
| 152 | 152 | 3 | 21.4 | Epsilon |
| 155 | 155 | 3 | 21.4 | |
| 180 | 180 | 2 | 14.3 | |
| 183 | 183 | 2 | 14.3 | |
| 477 | 477 | 2 | 14.3 | Omicron-SNP; Iota |
| 5 | 5 | 1 | 7.1 | Iota |
| 95 | 95 | 1 | 7.1 | Omicron-SNP; Delta/Iota/Kappa |
| 253 | 253 | 1 | 7.1 | Iota |
| 701 | 701 | 1 | 7.1 | Beta-SNP; Iota |
# Per-k breakdown
if (length(contrast_rows) > 0L) {
contrast_df <- do.call(rbind, contrast_rows)
cat("\n**Per-(k, cluster) breakdown:**\n\n")
kt(
kbl(contrast_df, format="html",
caption="Differentiating sites: wild-type cluster vs each other cluster") |>
kable_styling(bootstrap_options=c("condensed","hover"), font_size=11, full_width=FALSE)
)
}Per-(k, cluster) breakdown:
| k | wt | vs | n | sites | VOC |
|---|---|---|---|---|---|
| 2 | 1 | 2 | 11 | 18, 20, 26, 138, 190, 417, 484, 501, 655, 1027, 1176 | 18[Gamma-SNP], 20[Gamma-SNP], 26[Gamma-SNP], 138[Gamma-SNP], 190[Gamma-SNP], 417[Beta/Gamma/Omicron-SNP; Delta], 484[Beta/Eta/Gamma/Kappa/Omicron-SNP; Alpha/Iota/Theta/Zeta], 501[Alpha/Beta/Gamma/Omicron-SNP; Theta], 655[Gamma/Omicron-SNP], 1027[Gamma-SNP], 1176[] |
| 3 | 1 | 2 | 11 | 18, 20, 26, 138, 190, 417, 484, 501, 655, 1027, 1176 | 18[Gamma-SNP], 20[Gamma-SNP], 26[Gamma-SNP], 138[Gamma-SNP], 190[Gamma-SNP], 417[Beta/Gamma/Omicron-SNP; Delta], 484[Beta/Eta/Gamma/Kappa/Omicron-SNP; Alpha/Iota/Theta/Zeta], 501[Alpha/Beta/Gamma/Omicron-SNP; Theta], 655[Gamma/Omicron-SNP], 1027[Gamma-SNP], 1176[] |
| 3 | 1 | 3 | 6 | 19, 158, 452, 478, 681, 950 | 19[Delta-SNP], 158[Delta], 452[Delta/Kappa-SNP; Alpha/Epsilon/Iota/Lambda], 478[Delta/Omicron-SNP], 681[Alpha/Delta/Kappa/Omicron-SNP; Gamma/Theta], 950[Delta-SNP; Iota] |
| 4 | 1 | 2 | 11 | 18, 20, 26, 138, 190, 417, 484, 501, 655, 1027, 1176 | 18[Gamma-SNP], 20[Gamma-SNP], 26[Gamma-SNP], 138[Gamma-SNP], 190[Gamma-SNP], 417[Beta/Gamma/Omicron-SNP; Delta], 484[Beta/Eta/Gamma/Kappa/Omicron-SNP; Alpha/Iota/Theta/Zeta], 501[Alpha/Beta/Gamma/Omicron-SNP; Theta], 655[Gamma/Omicron-SNP], 1027[Gamma-SNP], 1176[] |
| 4 | 1 | 3 | 12 | 141, 142, 144, 152, 153, 155, 157, 158, 180, 183, 190, 452 | 141[], 142[Delta/Kappa/Omicron], 144[Alpha/Eta/Iota], 152[Epsilon], 153[], 155[], 157[Delta/Iota], 158[Delta], 180[], 183[], 190[Gamma-SNP], 452[Delta/Kappa-SNP; Alpha/Epsilon/Iota/Lambda] |
| 4 | 1 | 4 | 6 | 19, 158, 452, 478, 681, 950 | 19[Delta-SNP], 158[Delta], 452[Delta/Kappa-SNP; Alpha/Epsilon/Iota/Lambda], 478[Delta/Omicron-SNP], 681[Alpha/Delta/Kappa/Omicron-SNP; Gamma/Theta], 950[Delta-SNP; Iota] |
| 9 | 1 | 2 | 11 | 18, 20, 26, 138, 190, 417, 484, 501, 655, 1027, 1176 | 18[Gamma-SNP], 20[Gamma-SNP], 26[Gamma-SNP], 138[Gamma-SNP], 190[Gamma-SNP], 417[Beta/Gamma/Omicron-SNP; Delta], 484[Beta/Eta/Gamma/Kappa/Omicron-SNP; Alpha/Iota/Theta/Zeta], 501[Alpha/Beta/Gamma/Omicron-SNP; Theta], 655[Gamma/Omicron-SNP], 1027[Gamma-SNP], 1176[] |
| 9 | 1 | 3 | 5 | 5, 95, 253, 477, 701 | 5[Iota], 95[Omicron-SNP; Delta/Iota/Kappa], 253[Iota], 477[Omicron-SNP; Iota], 701[Beta-SNP; Iota] |
| 9 | 1 | 4 | 9 | 18, 20, 26, 138, 190, 452, 478, 681, 1027 | 18[Gamma-SNP], 20[Gamma-SNP], 26[Gamma-SNP], 138[Gamma-SNP], 190[Gamma-SNP], 452[Delta/Kappa-SNP; Alpha/Epsilon/Iota/Lambda], 478[Delta/Omicron-SNP], 681[Alpha/Delta/Kappa/Omicron-SNP; Gamma/Theta], 1027[Gamma-SNP] |
| 9 | 1 | 5 | 12 | 141, 142, 144, 152, 153, 155, 157, 158, 180, 183, 190, 452 | 141[], 142[Delta/Kappa/Omicron], 144[Alpha/Eta/Iota], 152[Epsilon], 153[], 155[], 157[Delta/Iota], 158[Delta], 180[], 183[], 190[Gamma-SNP], 452[Delta/Kappa-SNP; Alpha/Epsilon/Iota/Lambda] |
| 9 | 1 | 6 | 6 | 19, 158, 452, 478, 681, 950 | 19[Delta-SNP], 158[Delta], 452[Delta/Kappa-SNP; Alpha/Epsilon/Iota/Lambda], 478[Delta/Omicron-SNP], 681[Alpha/Delta/Kappa/Omicron-SNP; Gamma/Theta], 950[Delta-SNP; Iota] |
| 9 | 1 | 7 | 6 | 153, 157, 477, 484, 655, 681 | 153[], 157[Delta/Iota], 477[Omicron-SNP; Iota], 484[Beta/Eta/Gamma/Kappa/Omicron-SNP; Alpha/Iota/Theta/Zeta], 655[Gamma/Omicron-SNP], 681[Alpha/Delta/Kappa/Omicron-SNP; Gamma/Theta] |
| 9 | 1 | 8 | 14 | 19, 138, 141, 142, 144, 152, 153, 155, 157, 158, 452, 478, 681, 950 | 19[Delta-SNP], 138[Gamma-SNP], 141[], 142[Delta/Kappa/Omicron], 144[Alpha/Eta/Iota], 152[Epsilon], 153[], 155[], 157[Delta/Iota], 158[Delta], 452[Delta/Kappa-SNP; Alpha/Epsilon/Iota/Lambda], 478[Delta/Omicron-SNP], 681[Alpha/Delta/Kappa/Omicron-SNP; Gamma/Theta], 950[Delta-SNP; Iota] |
| 9 | 1 | 9 | 2 | 501, 681 | 501[Alpha/Beta/Gamma/Omicron-SNP; Theta], 681[Alpha/Delta/Kappa/Omicron-SNP; Gamma/Theta] |
The tables above reveal sites that consistently differentiate the wild-type cluster from all others across multiple k values. Beyond the expected VOC-defining SNP sites — D614G (site 614, present in virtually all VOCs (Korber, Fischer, and Gnanakaran 2020), sites 452, 478, 501 and 681 — the analysis flags two clusters of NTD residues whose significance is not immediately apparent from VOC catalogues alone.
Site 190 – NTD allosteric control node. Residue N188 and nearby NTD residues including position 190 show large state-dependent interaction energy changes with S2 residues that regulate RBD opening, indicating an allosteric control node modulating the closed-to-open transition of the Spike trimer (Wrobel, Benton, and Bhatt 2020). Substitutions here can alter the open/closed equilibrium, influencing ACE2 engagement and antibody accessibility without residing within the receptor-binding motif itself.
Sites 141–158 – NTD antigenic supersite and immune-escape module. Residues 140–150 form the N-terminal half of the NTD antigenic supersite loop, while residues 152, 153, 155, 157 and 158 form a single exposed loop functioning as an immune-escape module, heavily targeted by NTD-directed neutralising antibodies (McCallum et al. 2021). Deletions and substitutions in 141–158 arise repeatedly under immune pressure, remodel the supersite conformation, and confer resistance to NTD-directed antibodies while leaving ACE2 binding largely intact (Liu et al. 2022). Their recurrent selection across Alpha, Beta, Gamma and Delta is a hallmark of convergent immune-escape evolution. Crucially, these sites were identified here by entropy alone — no prior annotation was used.
PAM assigns each sequence to its nearest medoid by Gower distance, so no wild-type-cluster sequence can be strictly closer to a foreign medoid in Gower space — that is the definition of cluster membership. However, the t-SNE projection reveals that some cluster-1 sequences sit visually close to the territory of neighbouring clusters in 2D. These sequences are not misclassified in Gower space, but their position in the t-SNE layout suggests they occupy an intermediate region of sequence space, making them biologically interesting candidates for closer inspection. We identify them by hardcoded coordinate regions derived from direct visual inspection of the k = 4 t-SNE plot.
# Borderline analysis for k=4 only. Boundary regions from visual inspection:
# Region A: X > 15 AND Y <= 0 (right border, near cluster 2)
# Region B: X > -10 AND X < 10 AND Y < -10 (center bottom border, clusters 3/4)
# Applies to ALL cluster-1 sequences regardless of time period.
k <- 4L
fit <- pam_fits[[paste0("k", k)]]
clust_vec <- fit$clustering
med_rows <- fit$id.med
wt_cl <- 1L
foreign_cls <- setdiff(sort(unique(clust_vec)), wt_cl)
wt_rows <- which(clust_vec == wt_cl)
cat(sprintf("\n### k = %d (wild-type = cluster %d)\n\n", k, wt_cl))
# Apply coordinate filter
region_A <- tsne_base$X[wt_rows] > 15 & tsne_base$Y[wt_rows] <= 0
region_B <- tsne_base$X[wt_rows] > -10 & tsne_base$X[wt_rows] < 10 &
tsne_base$Y[wt_rows] < -10
is_bl <- region_A | region_B
bl_rows <- wt_rows[is_bl]
cat(sprintf("%d wt-cluster sequences fall in the boundary t-SNE region(s).\n\n",
length(bl_rows)))4 wt-cluster sequences fall in the boundary t-SNE region(s).
if (length(bl_rows) == 0L) {
cat("*No sequences in the defined boundary regions.*\n\n")
} else {
# t-SNE distances to own and nearest foreign medoid
wt_med_x <- tsne_base$X[med_rows[wt_cl]]
wt_med_y <- tsne_base$Y[med_rows[wt_cl]]
d_own <- sqrt((tsne_base$X[bl_rows] - wt_med_x)^2 +
(tsne_base$Y[bl_rows] - wt_med_y)^2)
fgn_dist_mat <- vapply(foreign_cls, function(cl) {
fx <- tsne_base$X[med_rows[cl]]; fy <- tsne_base$Y[med_rows[cl]]
sqrt((tsne_base$X[bl_rows] - fx)^2 + (tsne_base$Y[bl_rows] - fy)^2)
}, numeric(length(bl_rows)))
d_fgn <- if (length(foreign_cls) == 1L) as.numeric(fgn_dist_mat)
else apply(fgn_dist_mat, 1L, min)
margin <- d_own - d_fgn
period_lbl <- ifelse(orig_period[bl_rows] == 1L, "P1", "P2")
dist_tbl <- data.frame(
global_idx = bl_rows,
period = period_lbl,
tsne_X = round(tsne_base$X[bl_rows], 1),
tsne_Y = round(tsne_base$Y[bl_rows], 1),
d_own = round(d_own, 2),
d_fgn = round(d_fgn, 2),
margin = round(margin, 2)
)
dist_tbl <- dist_tbl[order(dist_tbl$margin), ]
kt(
kbl(dist_tbl, format = "html",
col.names = c("Row idx", "Period", "t-SNE X", "t-SNE Y",
"d(own med)", "d(nearest fgn med)", "Margin"),
caption = sprintf(
"Boundary-region wt-cluster sequences, sorted by margin [k=%d, wt=cl%d]",
k, wt_cl)) |>
kable_styling(bootstrap_options = c("condensed","hover"),
font_size = 11, full_width = FALSE) |>
column_spec(7L,
color = ifelse(dist_tbl$margin < 0, "red", "black"),
bold = dist_tbl$margin < 0)
)
cat("\n")
# VOC SNP content table
n_ref <- 1L + length(foreign_cls)
all_rows_bl <- c(med_rows[wt_cl], med_rows[foreign_cls], dist_tbl$global_idx)
row_labels_bl <- c(
sprintf("wt_med (cl%d)", wt_cl),
sprintf("fgn_med (cl%d)", foreign_cls),
sprintf("border[row%d] %s", dist_tbl$global_idx, dist_tbl$period)
)
voc_bl <- do.call(rbind, lapply(all_rows_bl, function(r) {
vapply(names(voc_list), function(voc_nm) {
voc <- voc_list[[voc_nm]]
n_ov <- length(intersect(as.character(voc$sites),
as.character(selected_sites)))
n_m <- count_snp_matches(AL_Cat_int_mat[r, ], voc)
sprintf("%d / %d", n_m, n_ov)
}, character(1L))
}))
voc_bl <- as.data.frame(voc_bl)
rownames(voc_bl) <- row_labels_bl
n_bl <- nrow(voc_bl) - n_ref
tbl_voc <- kbl(voc_bl, format = "html",
caption = sprintf(
"VOC SNP content: wt medoid | foreign medoids | borderline sequences [k=%d]",
k)) |>
kable_styling(bootstrap_options = c("condensed","hover"),
font_size = 11, full_width = FALSE) |>
row_spec(1L, background = "#d6eaf8") |>
row_spec(seq(2L, n_ref), background = "#fef9e7")
if (n_bl > 0L)
tbl_voc <- row_spec(tbl_voc, seq(n_ref + 1L, nrow(voc_bl)),
background = "#fce4ec")
kt(tbl_voc)
cat("\n")
# Sequence comparison table
own_med_aa <- as.data.frame(decode_aa_sequence(
as.matrix(AL_Cat_int[med_rows[wt_cl], , drop = FALSE])))
foreign_aa <- do.call(rbind, lapply(foreign_cls, function(cl)
as.data.frame(decode_aa_sequence(
as.matrix(AL_Cat_int[med_rows[cl], , drop = FALSE])))))
border_aa <- as.data.frame(decode_aa_sequence(
as.matrix(AL_Cat_int[bl_rows, , drop = FALSE])))
compare_df <- rbind(own_med_aa, foreign_aa, border_aa)
colnames(compare_df) <- as.character(selected_sites)
rownames(compare_df) <- c(
sprintf("wt_med (cl%d)", wt_cl),
sprintf("fgn_med (cl%d)", foreign_cls),
sprintf("border[row%d] %s margin=%.2f",
dist_tbl$global_idx, dist_tbl$period, dist_tbl$margin)
)
out_df <- compare_df
for (ri in 2L:nrow(compare_df)) {
for (nm in colnames(compare_df)) {
v1 <- compare_df[1L, nm]; vi <- compare_df[ri, nm]
if (!is.na(vi) && vi != v1)
out_df[ri, nm] <- cell_spec(vi, format = "html", color = "red", bold = TRUE)
}
}
kt(
kbl(out_df, format = "html", escape = FALSE,
caption = sprintf(
"wt medoid (blue) | foreign medoids (yellow) | borderline (pink) | red = differs from wt medoid [k=%d]",
k)) |>
kable_styling(bootstrap_options = c("condensed","hover"),
font_size = 11, full_width = FALSE) |>
row_spec(1L, background = "#d6eaf8") |>
row_spec(seq(2L, n_ref), background = "#fef9e7") |>
row_spec(seq(n_ref + 1L, nrow(out_df)), background = "#fce4ec")
)
cat("\n")
}| Row idx | Period | t-SNE X | t-SNE Y | d(own med) | d(nearest fgn med) | Margin | |
|---|---|---|---|---|---|---|---|
| 6822 | 123 | P2 | -2.1 | -15.1 | 24.37 | 5.31 | 19.05 |
| 6775 | 118 | P2 | 1.5 | -14.3 | 24.81 | 5.51 | 19.30 |
| 6776 | 119 | P2 | 17.8 | -0.3 | 27.53 | 7.16 | 20.37 |
| 6877 | 127 | P2 | 18.0 | -0.3 | 27.71 | 6.96 | 20.74 |
| Delta | Alpha | Beta | Gamma | |
|---|---|---|---|---|
| wt_med (cl1) | 0 / 5 | 0 / 2 | 0 / 4 | 0 / 10 |
| fgn_med (cl2) | 0 / 5 | 1 / 2 | 2 / 4 | 10 / 10 |
| fgn_med (cl3) | 1 / 5 | 0 / 2 | 0 / 4 | 0 / 10 |
| fgn_med (cl4) | 5 / 5 | 0 / 2 | 0 / 4 | 0 / 10 |
| border[row123] P2 | 3 / 5 | 0 / 2 | 0 / 4 | 0 / 10 |
| border[row118] P2 | 3 / 5 | 0 / 2 | 1 / 4 | 3 / 10 |
| border[row119] P2 | 2 / 5 | 0 / 2 | 0 / 4 | 5 / 10 |
| border[row127] P2 | 3 / 5 | 0 / 2 | 0 / 4 | 5 / 10 |
| 5 | 18 | 19 | 20 | 26 | 54 | 95 | 138 | 141 | 142 | 144 | 152 | 153 | 155 | 157 | 158 | 180 | 183 | 190 | 253 | 417 | 452 | 477 | 478 | 484 | 501 | 520 | 614 | 655 | 681 | 701 | 950 | 1027 | 1176 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wt_med (cl1) | L | L | T | T | P | L | T | D | L | G | Y | W | M | S | F | R | E | Q | R | D | K | L | S | T | E | N | A | G | H | P | A | D | T | V |
| fgn_med (cl2) | L | F | T | N | S | L | T | Y | L | G | Y | W | M | S | F | R | E | Q | S | D | T | L | S | T | K | Y | A | G | Y | P | A | D | I | F |
| fgn_med (cl3) | L | L | T | T | P | L | T | D | Y | Y | K | S | E | R | Y | S | Q | F | V | D | K | R | S | T | E | N | A | G | H | P | A | D | T | V |
| fgn_med (cl4) | L | L | R | T | P | L | T | D | L | G | Y | W | M | S | F | G | E | Q | R | D | K | R | S | K | E | N | A | G | H | R | A | N | T | V |
| border[row123] P2 margin=19.05 | L | L | R | T | P | L | T | Y | L | G | Y | W | M | S | F | R | E | Q | S | D | K | L | S | K | K | N | A | G | H | R | A | D | T | V |
| border[row118] P2 margin=19.30 | L | F | T | N | S | L | T | Y | L | G | Y | W | M | S | F | R | E | Q | S | D | K | L | S | K | E | N | A | G | H | R | A | D | T | V |
| border[row119] P2 margin=20.37 | L | L | T | T | P | L | T | D | L | G | Y | W | M | S | F | R | E | Q | R | D | K | R | S | K | E | N | A | G | H | P | A | N | T | V |
| border[row127] P2 margin=20.74 | L | F | T | N | S | L | T | Y | L | G | Y | W | M | S | F | R | E | Q | R | D | K | R | S | K | E | N | A | G | H | R | A | D | I | V |
The preceding analysis was conducted on a focused two-period subset
to enable supervised evaluation against a known ground truth. Here we
apply the same entropy-GMM-PAM pipeline to the complete
preprocessed dataset (AL_df, all available USA
sequences across all time periods) to demonstrate that the algorithm
achieves robust cluster separation at scale, without restricting to a
specific surveillance window.
# Entropy on all N_SITES sites across all sequences in AL_df
entrp_full <- apply(AL_df[, seq_len(N_SITES), drop = FALSE], 2L, calculate_entropy)
clust_full <- cluster_sites_by_entropy(entrp_full, nr = nrow(AL_df))
clust_full_rl <- relabel_entropy_classes(clust_full$DataFrame)
selected_sites_full <- sort(
clust_full_rl[clust_full_rl$class == 1L, ]$sites
)
cat(sprintf("Full-dataset GMM selected %d sites from %d sequences:\n",
length(selected_sites_full), nrow(AL_df)))
#> Full-dataset GMM selected 29 sites from 109536 sequences:
print(selected_sites_full)
#> [1] 5 13 18 20 26 95 138 152 190 253 417 452 477 478 484 494 501 614 655 677 681 688 701
#> [24] 716 732 769 957 1027 1176
# Overlap with VOC defining SNP sites -- uses same get_variant_snps() as Step 2.
# The difference vs. the two-period selected_sites is expected: entropy was
# computed on 109k sequences with different frequency distributions, so GMM
# selects a different (potentially overlapping) site subset.
# Delta two-period: 5 overlap sites (19,452,478,681,950).
# Delta full-dataset overlap depends on selected_sites_full computed above.
overlap_full <- lapply(voc_list, function(v)
intersect(selected_sites_full, v$sites))
names(overlap_full) <- names(voc_list)
cat("\nGMM-selected sites overlapping with VOC defining SNP sites (full dataset):\n")
#>
#> GMM-selected sites overlapping with VOC defining SNP sites (full dataset):
for (nm in names(overlap_full)) {
ov <- overlap_full[[nm]]
cat(sprintf(" %-6s: %s\n", nm,
if (length(ov) > 0L) paste(ov, collapse = ", ") else "none"))
}
#> Delta : 452, 478, 681
#> Alpha : 501, 681, 716
#> Beta : 417, 484, 501, 701
#> Gamma : 18, 20, 26, 138, 190, 417, 484, 501, 655, 1027# Subset to selected sites and convert to factors
AL_full_fac <- AL_df[, as.character(selected_sites_full), drop = FALSE]
AL_full_fac[] <- lapply(AL_full_fac, function(x) as.factor(as.integer(x)))
# Deduplicate before Gower: 109k sequences produce an infeasible (for laptop RAM)
# distance matrix (~48 GB). Unique sequences are sufficient for PAM/t-SNE and
# dramatically reduce computation while preserving the full sequence diversity.
full_unique_idx <- which(!duplicated(AL_full_fac))
AL_full_fac_uniq <- AL_full_fac[full_unique_idx, , drop = FALSE]
cat(sprintf("Unique sequences for full-dataset clustering: %d (from %d total)\n",
nrow(AL_full_fac_uniq), nrow(AL_full_fac)))
#> Unique sequences for full-dataset clustering: 718 (from 109536 total)
gower_dist_full <- daisy(AL_full_fac_uniq, metric = "gower")future::plan(future::multisession, workers = N_WORKERS)
sil_width_full <- c(NA, future_sapply(2:K_MAX, function(k) {
pam(gower_dist_full, diss = TRUE, k = k)$silinfo$avg.width
}))
future::plan(future::sequential)
sil_df_full <- data.frame(k = seq_len(K_MAX), width = sil_width_full)
sil_df_full <- sil_df_full[!is.na(sil_df_full$width), ]
ggplot(sil_df_full, aes(x = k, y = width)) +
geom_line(colour = "black") +
geom_point(colour = "darkorange", shape = 1, size = 2) +
geom_vline(xintercept = c(6L, 7L, 9L), linetype = "dashed", colour = "steelblue", alpha = 0.6) +
scale_x_continuous(limits = c(2L, K_MAX),
breaks = seq(5, K_MAX, by = 5),
minor_breaks = seq(2, K_MAX, by = 1)) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 6)) +
labs(title = "Silhouette Analysis: Full Dataset (All Sequences)",
subtitle = "Blue dashed lines mark k = 6, 7, 9",
x = "Number of clusters", y = "Average Silhouette Width") +
theme_bw() +
theme(panel.grid.major = element_line(colour = "grey85"),
panel.grid.minor = element_line(colour = "grey93"),
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, size = 9),
axis.title.x = element_text(colour = "steelblue"),
axis.title.y = element_text(colour = "darkorange"))We run t-SNE once on the full Gower distance matrix and recolour by each PAM solution (k = 2, 3 and 7). Fixed axis ranges ensure spatial positions are identical across all plots.
K_HIGHLIGHT_FULL <- c(6L, 7L, 9L)
pam_fits_full <- lapply(K_HIGHLIGHT_FULL, function(k)
pam(gower_dist_full, diss = TRUE, k = k))
names(pam_fits_full) <- paste0("k", K_HIGHLIGHT_FULL)
for (k in K_HIGHLIGHT_FULL) {
cat(sprintf("k=%d cluster sizes: ", k))
cat(paste(table(pam_fits_full[[paste0("k",k)]]$clustering), collapse=" | "), "\n")
}
#> k=6 cluster sizes: 321 | 150 | 65 | 73 | 41 | 68
#> k=7 cluster sizes: 385 | 65 | 76 | 35 | 46 | 42 | 69
#> k=9 cluster sizes: 309 | 70 | 63 | 67 | 35 | 43 | 41 | 69 | 21set.seed(TSNE_SEED)
tsne_full_obj <- Rtsne(gower_dist_full, is_distance = TRUE,
perplexity = TSNE_PERPLEXITY)
tsne_full_base <- as.data.frame(tsne_full_obj$Y) |>
setNames(c("X", "Y"))# Integer matrix for medoid decoding and VOC matching (built once, reused)
AL_full_int <- AL_full_fac_uniq |>
mutate_if(is.factor, function(x) as.integer(as.character(x)))
AL_full_int_mat <- as.matrix(AL_full_int)
# VOC list for full dataset (Alpha, Beta, Delta, Gamma only)
voc_list_full <- list(
Delta = get_variant_snps("Delta"),
Alpha = get_variant_snps("Alpha"),
Beta = get_variant_snps("Beta"),
Gamma = get_variant_snps("Gamma")
)
# Fixed axis ranges for identical plot area across all k values
x_pad_f <- diff(range(tsne_full_base$X)) * 0.05
y_pad_f <- diff(range(tsne_full_base$Y)) * 0.05
xlim_fix_f <- range(tsne_full_base$X) + c(-x_pad_f, x_pad_f)
ylim_fix_f <- range(tsne_full_base$Y) + c(-y_pad_f, y_pad_f)for (k in K_HIGHLIGHT_FULL) {
fit_f <- pam_fits_full[[paste0("k", k)]]
med_f <- fit_f$id.med
cl_sizes_f <- as.integer(table(fit_f$clustering))
# ── t-SNE plot ─────────────────────────────────────────────────────────────
cat(sprintf("\n## Full Dataset PAM k = %d\n\n### t-SNE Plot\n\n", k))
df_f <- tsne_full_base |>
mutate(cluster = factor(fit_f$clustering),
is_medoid = row_number() %in% med_f)
p_f <- ggplot(df_f, aes(x = X, y = Y, colour = cluster, shape = cluster)) +
geom_point(size = 1.5, alpha = 0.5) +
scale_shape_manual(values = seq(0, k)) +
geom_point(data = subset(df_f, is_medoid),
aes(x = X, y = Y),
colour = "red", shape = 18, size = 5, alpha = 0.45,
inherit.aes = FALSE) +
coord_cartesian(xlim = xlim_fix_f, ylim = ylim_fix_f) +
labs(title = sprintf("PAM k = %d | Full Dataset t-SNE", k),
subtitle = sprintf("Unique sequences: %d | Red diamonds = cluster medoids",
nrow(AL_full_fac_uniq)),
x = "t-SNE dimension 1", y = "t-SNE dimension 2") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5, size = 9))
print(p_f)
# ── Medoid sequences table ──────────────────────────────────────────────────
cat(sprintf("\n### Medoid Sequences (k = %d)\n\n", k))
med_aa <- decode_aa_sequence(as.matrix(AL_full_int[med_f, , drop = FALSE]))
med_df <- as.data.frame(med_aa)
colnames(med_df) <- as.character(selected_sites_full)
rownames(med_df) <- paste0("cl_", seq_len(k))
med_df <- cbind(N = cl_sizes_f, med_df)
if (k > 1L) {
for (ri in 2L:k) {
for (nm in as.character(selected_sites_full)) {
v1 <- med_df[1L, nm]; vi <- med_df[ri, nm]
if (!is.na(vi) && vi != v1)
med_df[ri, nm] <- cell_spec(vi, format = "html", color = "red", bold = TRUE)
}
}
}
kt(
kbl(med_df, format = "html", escape = FALSE,
caption = sprintf(
"Medoid sequences (k=%d, full dataset) | N = cluster size | red = differs from cl_1",
k)) |>
kable_styling(bootstrap_options = c("condensed","hover"),
font_size = 11, full_width = FALSE) |>
row_spec(1L, background = "#d6eaf8")
)
# ── VOC SNP content table ───────────────────────────────────────────────────
cat(sprintf("\n### VOC SNP Content (k = %d)\n\n", k))
voc_tbl_f <- do.call(rbind, lapply(seq_len(k), function(cl_i) {
row_i <- med_f[cl_i]
vapply(names(voc_list_full), function(voc_nm) {
voc <- voc_list_full[[voc_nm]]
n_ov <- length(intersect(as.character(voc$sites),
as.character(selected_sites_full)))
n_m <- count_snp_matches(AL_full_int_mat[row_i, ], voc)
sprintf("%d / %d", n_m, n_ov)
}, character(1L))
}))
voc_tbl_f <- as.data.frame(voc_tbl_f)
rownames(voc_tbl_f) <- paste0("cl_", seq_len(k))
kt(
kbl(voc_tbl_f, format = "html",
caption = sprintf(
"VOC SNP matches per medoid (matched / overlap sites), full dataset k=%d",
k)) |>
kable_styling(bootstrap_options = c("condensed","hover"),
font_size = 11, full_width = FALSE)
)
cat("\n")
}
### Medoid Sequences (k = 6)
| N | 5 | 13 | 18 | 20 | 26 | 95 | 138 | 152 | 190 | 253 | 417 | 452 | 477 | 478 | 484 | 494 | 501 | 614 | 655 | 677 | 681 | 688 | 701 | 716 | 732 | 769 | 957 | 1027 | 1176 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cl_1 | 321 | L | S | L | T | P | T | D | W | R | D | K | L | S | T | E | S | N | G | H | Q | P | A | A | T | T | G | Q | T | V |
| cl_2 | 150 | L | S | L | T | P | T | D | W | R | D | K | L | S | T | E | S | N | G | H | Q | H | A | A | T | T | G | Q | T | V |
| cl_3 | 65 | L | I | L | T | P | T | D | C | R | D | K | R | S | T | E | S | N | G | H | Q | P | A | A | T | T | G | Q | T | V |
| cl_4 | 73 | F | S | L | T | P | I | D | W | R | G | K | L | S | T | K | S | N | G | H | Q | P | A | V | T | T | G | Q | T | V |
| cl_5 | 41 | F | S | L | T | P | I | D | W | R | G | K | L | N | T | E | S | N | G | H | Q | P | A | A | T | T | G | R | T | V |
| cl_6 | 68 | L | S | F | N | S | T | Y | W | S | D | T | L | S | T | K | S | Y | G | Y | Q | P | A | A | T | T | G | Q | I | F |
| Delta | Alpha | Beta | Gamma | |
|---|---|---|---|---|
| cl_1 | 0 / 3 | 0 / 3 | 0 / 4 | 0 / 10 |
| cl_2 | 0 / 3 | 1 / 3 | 0 / 4 | 0 / 10 |
| cl_3 | 1 / 3 | 0 / 3 | 0 / 4 | 0 / 10 |
| cl_4 | 0 / 3 | 0 / 3 | 2 / 4 | 1 / 10 |
| cl_5 | 0 / 3 | 0 / 3 | 0 / 4 | 0 / 10 |
| cl_6 | 0 / 3 | 1 / 3 | 2 / 4 | 10 / 10 |
### Medoid Sequences (k = 7)
| N | 5 | 13 | 18 | 20 | 26 | 95 | 138 | 152 | 190 | 253 | 417 | 452 | 477 | 478 | 484 | 494 | 501 | 614 | 655 | 677 | 681 | 688 | 701 | 716 | 732 | 769 | 957 | 1027 | 1176 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cl_1 | 385 | L | S | L | T | P | T | D | W | R | D | K | L | S | T | E | S | N | G | H | Q | P | A | A | T | T | G | Q | T | V |
| cl_2 | 65 | L | I | L | T | P | T | D | C | R | D | K | R | S | T | E | S | N | G | H | Q | P | A | A | T | T | G | Q | T | V |
| cl_3 | 76 | F | S | L | T | P | I | D | W | R | G | K | L | S | T | K | S | N | G | H | Q | P | A | V | T | T | G | Q | T | V |
| cl_4 | 35 | L | S | L | T | P | T | D | W | R | D | K | L | S | K | E | S | N | G | H | Q | H | A | A | T | A | G | Q | T | V |
| cl_5 | 46 | L | S | L | T | P | T | D | W | R | D | K | L | S | T | E | S | Y | G | H | Q | H | A | A | I | T | G | Q | T | V |
| cl_6 | 42 | F | S | L | T | P | I | D | W | R | G | K | L | N | T | E | S | N | G | H | Q | P | A | A | T | T | G | R | T | V |
| cl_7 | 69 | L | S | F | N | S | T | Y | W | S | D | T | L | S | T | K | S | Y | G | Y | Q | P | A | A | T | T | G | Q | I | F |
| Delta | Alpha | Beta | Gamma | |
|---|---|---|---|---|
| cl_1 | 0 / 3 | 0 / 3 | 0 / 4 | 0 / 10 |
| cl_2 | 1 / 3 | 0 / 3 | 0 / 4 | 0 / 10 |
| cl_3 | 0 / 3 | 0 / 3 | 2 / 4 | 1 / 10 |
| cl_4 | 1 / 3 | 1 / 3 | 0 / 4 | 0 / 10 |
| cl_5 | 0 / 3 | 3 / 3 | 1 / 4 | 1 / 10 |
| cl_6 | 0 / 3 | 0 / 3 | 0 / 4 | 0 / 10 |
| cl_7 | 0 / 3 | 1 / 3 | 2 / 4 | 10 / 10 |
### Medoid Sequences (k = 9)
| N | 5 | 13 | 18 | 20 | 26 | 95 | 138 | 152 | 190 | 253 | 417 | 452 | 477 | 478 | 484 | 494 | 501 | 614 | 655 | 677 | 681 | 688 | 701 | 716 | 732 | 769 | 957 | 1027 | 1176 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cl_1 | 309 | L | S | L | T | P | T | D | W | R | D | K | L | S | T | E | S | N | G | H | Q | P | A | A | T | T | G | Q | T | V |
| cl_2 | 70 | F | S | L | T | P | T | D | W | R | D | K | L | S | T | E | S | N | G | H | Q | P | A | A | T | T | G | Q | T | V |
| cl_3 | 63 | L | I | L | T | P | T | D | C | R | D | K | R | S | T | E | S | N | G | H | Q | P | A | A | T | T | G | Q | T | V |
| cl_4 | 67 | F | S | L | T | P | I | D | W | R | G | K | L | S | T | K | S | N | G | H | Q | P | A | V | T | T | G | Q | T | V |
| cl_5 | 35 | L | S | L | T | P | T | D | W | R | D | K | L | S | K | E | S | N | G | H | Q | H | A | A | T | A | G | Q | T | V |
| cl_6 | 43 | L | S | L | T | P | T | D | W | R | D | K | L | S | T | E | S | Y | G | H | Q | H | A | A | I | T | G | Q | T | V |
| cl_7 | 41 | F | S | L | T | P | I | D | W | R | G | K | L | N | T | E | S | N | G | H | Q | P | A | A | T | T | G | R | T | V |
| cl_8 | 69 | L | S | F | N | S | T | Y | W | S | D | T | L | S | T | K | S | Y | G | Y | Q | P | A | A | T | T | G | Q | I | F |
| cl_9 | 21 | L | S | L | T | P | T | D | W | R | D | K | R | S | K | E | S | N | G | H | Q | R | A | A | T | T | G | Q | T | V |
| Delta | Alpha | Beta | Gamma | |
|---|---|---|---|---|
| cl_1 | 0 / 3 | 0 / 3 | 0 / 4 | 0 / 10 |
| cl_2 | 0 / 3 | 0 / 3 | 0 / 4 | 0 / 10 |
| cl_3 | 1 / 3 | 0 / 3 | 0 / 4 | 0 / 10 |
| cl_4 | 0 / 3 | 0 / 3 | 2 / 4 | 1 / 10 |
| cl_5 | 1 / 3 | 1 / 3 | 0 / 4 | 0 / 10 |
| cl_6 | 0 / 3 | 3 / 3 | 1 / 4 | 1 / 10 |
| cl_7 | 0 / 3 | 0 / 3 | 0 / 4 | 0 / 10 |
| cl_8 | 0 / 3 | 1 / 3 | 2 / 4 | 10 / 10 |
| cl_9 | 3 / 3 | 0 / 3 | 0 / 4 | 0 / 10 |
cat("== Analysis Summary ==\n\n")
#> == Analysis Summary ==
cat(sprintf("Country filter: %s\n",
ifelse(is.null(COUNTRY), "none (all countries)", COUNTRY)))
#> Country filter: USA
cat(sprintf("Unique sequences: %s\n", UNIQUE_SEQS))
#> Unique sequences: TRUE
cat(sprintf("Period 1 sequences: %d (%s to %s)\n",
n_p1_eff, PERIOD1_START, PERIOD1_END))
#> Period 1 sequences: 36 (2020-05-01 to 2020-07-01)
cat(sprintf("Period 2 sequences: %d (%s to %s)\n",
n_p2_eff, PERIOD2_START, PERIOD2_END))
#> Period 2 sequences: 91 (2021-07-01 to 2021-09-01)
cat(sprintf("GMM-selected sites: %d [%s]\n",
length(selected_sites),
paste(selected_sites, collapse = ", ")))
#> GMM-selected sites: 34 [5, 18, 19, 20, 26, 54, 95, 138, 141, 142, 144, 152, 153, 155, 157, 158, 180, 183, 190, 253, 417, 452, 477, 478, 484, 501, 520, 614, 655, 681, 701, 950, 1027, 1176]
cat("\nVOC SNP overlap with GMM-selected sites:\n")
#>
#> VOC SNP overlap with GMM-selected sites:
for (nm in names(overlap_list)) {
ov <- overlap_list[[nm]]
cat(sprintf(" %-6s: %s\n",
nm,
if (length(ov) > 0) paste(ov, collapse = ", ") else "none"))
}
#> Delta : 19, 452, 478, 681, 950
#> Alpha : 501, 681
#> Beta : 417, 484, 501, 701
#> Gamma : 18, 20, 26, 138, 190, 417, 484, 501, 655, 1027
cat(sprintf("\nOptimal k by silhouette: %d\n", optimal_k))
#>
#> Optimal k by silhouette: 2
cat(sprintf("Optimal k by F1 score: %d\n\n",
K_HIGHLIGHT[which.max(metrics_df$f1)]))
#> Optimal k by F1 score: 9
cat("Classification metrics (k = 2, 3, 4, 9):\n")
#> Classification metrics (k = 2, 3, 4, 9):
print(metrics_df[, c("k", "positive_cluster", "TP", "FP", "FN",
"precision", "recall", "f1")])
#> k positive_cluster TP FP FN precision recall f1
#> 1 2 1 36 64 0 0.3600 1 0.5294
#> 2 3 1 36 38 0 0.4865 1 0.6545
#> 3 4 1 36 34 0 0.5143 1 0.6792
#> 4 9 1 36 12 0 0.7500 1 0.8571ViralEntropR’s entropy-driven dimensionality reduction,
combined with PAM clustering on Gower distance, recovers the temporal
structure of SARS-CoV-2 evolution without any supervised learning or
prior knowledge of mutation topology.
Perfect recall across all solutions. All PAM solutions (k = 2, 3 and 4) achieve recall = 1.00: every wild-type sequence (Period 1, May–June 2020) is assigned to the wild-type cluster. No ancestral sequence is assigned to the Delta-dominant partition despite the method being entirely unsupervised.
F1 score peaks at k = 9, consistent with known biology. While k = 2 achieves the highest silhouette width, k = 9 yields the best classification performance and is the biologically justified solution. During the combined surveillance window, the United States was host not only to the ancestral wild-type and the four designated VOCs (Alpha, Beta, Gamma and Delta) but also to several co-circulating Variants of Interest — Kappa, Iota, Epsilon and Lambda — each carrying distinct subsets of Spike mutations. The pairwise contrast analysis in Step 7 recovers mutation signatures consistent with several of these lineages, suggesting that the nine-cluster solution captures genuine sub-variant structure.
The method is purely information-theoretic. No feature engineering beyond Shannon entropy scoring is applied, and the spatial topology of sites along the Spike protein is entirely ignored. Sites are selected because their amino acid distribution changed between surveillance periods — as Shannon’s foundational framework predicts, entropy naturally captures the informative positions (Shannon 1948).
Mode equals medoid: cluster cohesion confirmed. For k = 2, 3 and 4, the modal amino acid at every selected site within each cluster matches the PAM medoid sequence exactly, confirming that these clusters are not statistical artefacts but reflect genuine population bottlenecks imposed by emergent variants. At k = 9, a small number of sites show minor discrepancies between the mode and the medoid, as expected when clusters become smaller and the within-cluster frequency distribution at some positions is more diffuse. Critically, these discrepancies do not alter the VOC SNP classification of any medoid: the defining mutation signatures remain intact, and the affected sites carry non-defining substitutions that do not change the lineage assignment.
Unsupervised recovery of functionally annotated sites. Steps 7 and 8 used no biological annotation. Yet the analysis independently flagged the NTD allosteric control node near residue 190 (Wrobel, Benton, and Bhatt 2020) and the antigenic supersite spanning positions 141–158 (McCallum et al. 2021; Liu et al. 2022) — both implicated in immune evasion and RBD conformational regulation. This convergence with independently established structural biology validates that the entropy pipeline captures genuine evolutionary signal.
The full-dataset analysis (Step 9) is conceptually distinct from the
accuracy study. It operates on all ~109,536 available USA sequences
across the full surveillance timeline, with GMM site selection
(selected_sites_full) performed on the entire dataset. The
selected sites therefore reflect population-wide entropy rather than the
contrast between two specific windows, and the two site sets are
expected to differ — this is by design. Prior to Gower distance
computation and PAM clustering, sequences are deduplicated and only
unique haplotypes are retained. This step is computationally necessary:
the full 109,536-sequence distance matrix would require approximately 48
GB of RAM, which exceeds the capacity of a typical workstation.
Deduplication reduces the working set to a tractable number of unique
sequences while preserving the full haplotype diversity of the dataset,
since PAM medoids and t-SNE coordinates are determined by sequence
identity rather than sequence count.
Across all values of k tried, the t-SNE plots show visible separation among clusters. At k = 9, the VOC SNP content of the medoid sequences confirms clear separation of all four major VOCs circulating at the cut-off time of the dataset: Alpha, Beta, Delta and Gamma each have a dedicated cluster whose medoid carries the corresponding defining mutations. This demonstrates that the entropy-GMM-PAM pipeline is capable of recovering the known variant structure of the pandemic without any labelled outcomes or restriction to time windows.
ViralEntropR provides a transparent, reproducible, and
computationally efficient framework for detecting and characterizing
viral variant structure in large sequence datasets, combining rigorous
information theory with the interpretability required for
epidemiological application.The results confirm that Shannon entropy,
computed without any prior biological annotation, is sufficient to
select a compact site subset that recovers known variant structure at
both the two-period and full-dataset scales.