Simulate Viral Variant Evolution
Source:R/simulate_variant_evolution.R
simulate_variant_evolution.RdGenerates a synthetic, ground-truth-traceable time series of viral amino-acid sequences in which a reference strain is progressively challenged by emerging variants under stochastic growth and pairwise competition. Designed as a controllable signal generator for benchmarking the entropy, Hellinger, and change-point components of the ViralEntropR pipeline; the dynamics are deliberately simplified rather than a faithful biological model of evolutionary fitness.
Usage
simulate_variant_evolution(
ref_sequences,
n_ref_months = 3L,
start_date,
end_date,
variants_config,
variant_intervals,
n_new_mutations = 1L,
mutation_rate = 2,
mutation_rate_variability = 0.25,
deleterious_rate = 0,
n_deleterious_limit = 1L,
n_sequences_total = 140L,
ref_variability = FALSE,
n_ref_sequences = 100L,
prob_deletion_event = 0,
n_rows_to_delete = 0L,
seed = NULL
)Arguments
- ref_sequences
Either a single character string giving the reference amino-acid sequence, or a data.frame with a
Datecolumn and site columns named with integer position labels ("1","2", ...,as.character(L); one column per site). The sequence lengthLmust be at least 3, since variant mutations and random substitution events are sampled from sites \(>2\) (sites 1 and 2 are reserved for reference variability whenref_variability = TRUE).- n_ref_months
Integer. Duration of the initial reference phase (months). Default
3.- start_date
Character or Date. Simulation start.
- end_date
Character or Date. Simulation end (inclusive month).
- variants_config
Integer vector. Number of mutations per variant, e.g.,
c(2, 6).- variant_intervals
Integer vector. Months between consecutive variant emergences (length
length(variants_config) - 1).- n_new_mutations
Integer. Number of de-novo sequence copies introduced at a variant's first appearance. Despite the name, this controls the count of seeded sequences, not the number of mutations carried by the variant (which is set per-variant by
variants_config). Default1.- mutation_rate
Numeric scalar or vector. Monthly growth multiplier per variant. Default
2.- mutation_rate_variability
Numeric in
[0, 1). Fractional spread aroundmutation_rate: growth is drawn fromUniform(mult*(1-v), mult*(1+v)). Set to0for deterministic growth. Default0.25.- deleterious_rate
Numeric in
[0, 1]. Per-mutation probability of being flagged as "deleterious". Default0.- n_deleterious_limit
Integer. Per-site cap on the number of variant rows allowed to retain a flagged ("deleterious") allele in a given month; overflow is reverted to the reference allele at the flagged site only. See Capped "Deleterious" Sites in Details. Default
1.- n_sequences_total
Integer. Total sequences generated over the full mutation phase (spread evenly per month). Default
140.- ref_variability
Logical. If
TRUE, introduces low-level variability at sites 1 and 2 of the reference pool. Requiresn_ref_sequences >= 6 * n_ref_monthsfor the variability rows to be correctly captured by the multi-variant fill pool; smaller values trigger a warning. DefaultFALSE.- n_ref_sequences
Integer. When
ref_sequencesis a single string, the number of reference rows to create. Default100.- prob_deletion_event
Numeric in
[0, 1]. Per-month probability of triggering a substitution event affectingn_rows_to_deleterows. (Parameter name retained for backward compatibility; the operation is substitution, not literal deletion. See Details.) Default0.- n_rows_to_delete
Integer. Number of rows affected when a substitution event fires (see
prob_deletion_event). Default0.- seed
Integer or
NULL. Random seed for reproducibility. DefaultNULL.
Value
An object of class "viralSim", a named list:
- Simulation_Output
Data frame of all sequences, one column per site (named
"1", ...,as.character(L)) plusVariant(per-row strain label),Phase("R"or"RV"for reference rows depending on whether the row matches the wild-type sequence;"M1","M2", ... in mutation-phase months — note that"Mn"is a month label and applies to every row in that month regardless of strain),Date,Period(1-indexed month), andDelet("Yes"for rows affected by substitution events,"No"otherwise).- Variant_Details
List of per-variant metadata. Each element has
em(emergence-month index),pos(mutated site indices),flags(logical vector of "deleterious" flags per mutated site),vseq(full mutated reference sequence as character vector),mult(growth multiplier),last(count in the most recent month),cum(cumulative count), andrepl(logical flag set when the variant has been displaced by its successor).- Simulation_Dates
Date vector of all simulated months.
- Baseline_Ref_Sequence
Character. The wild-type reference string.
- Delet_Records
Named list of substitution events keyed by date (
"YYYY-MM-DD"). Each entry holdsperiod,date,site,old_aa,new_aa, androws(absolute row indices inSimulation_Outputthat were modified).- Pool
Fill-pool data frame from the final multi-variant competition period (includes
._weightcolumn);NULLif at most one variant was ever active.
Details
Simulates the trajectory of a viral population over discrete monthly time
steps. All stochastic components are reproducible under a fixed
seed and are designed to expose a known emergence pattern that
downstream detection methods can be evaluated against.
Simulation Phases:
Reference Phase: A stable period in which only the reference strain exists. Reference months are resampled with replacement to assemble
n_ref_monthsmonthly batches.Variant Emergence: New variants are introduced at the monthly intervals specified by
variant_intervals.Pairwise Growth: Each month the newest variant grows alongside its immediate predecessor. Both grow stochastically: each variant's next-month count is its previous-month count times
Uniform(mult*(1-variability), mult*(1+variability)). The two are not coupled by a shared frequency constraint — only by the hard monthly quotaceiling(n_sequences_total / (n_periods - n_ref_months))that trims any overshoot. Older, non-paired variants enter the fill pool with sampling weights2^(i+1).Capped "Deleterious" Sites: A Bernoulli-random subset of each variant's mutated positions is flagged with probability
deleterious_rate. At each post-emergence month the count of variant rows still carrying the flagged allele is capped atn_deleterious_limitby reverting overflow sequences to the reference allele at the flagged site only. This is a per-site cap, not a genotype-level fitness penalty.Random Substitution Events: Each month independently fires a Bernoulli coin flip with probability
prob_deletion_event. If it fires, the lastn_rows_to_deleterows of that month's batch receive a single non-reference amino-acid substitution at a randomly chosen site (sites 1 and 2 excluded), and those rows are flagged withDelet = "Yes". Parameter names retain "deletion" / "delete" for backward compatibility; the operation itself is a substitution.
Examples
ref_seq <- "MKTIIALSYIFCLVFADYKDDDDK"
sim <- simulate_variant_evolution(
ref_sequences = ref_seq,
n_ref_months = 3,
start_date = "2021-01-01",
end_date = "2021-12-01",
variants_config = c(3, 5),
variant_intervals = c(4),
n_sequences_total = 50,
mutation_rate = 1.5,
seed = 123
)
head(sim$Simulation_Output)
#> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Variant
#> 1 M K T I I A L S Y I F C L V F A D Y K D D D D K Reference
#> 2 M K T I I A L S Y I F C L V F A D Y K D D D D K Reference
#> 3 M K T I I A L S Y I F C L V F A D Y K D D D D K Reference
#> 4 M K T I I A L S Y I F C L V F A D Y K D D D D K Reference
#> 5 M K T I I A L S Y I F C L V F A D Y K D D D D K Reference
#> 6 M K T I I A L S Y I F C L V F A D Y K D D D D K Reference
#> Phase Date Period Delet
#> 1 R 2021-01-01 1 No
#> 2 R 2021-01-01 1 No
#> 3 R 2021-01-01 1 No
#> 4 R 2021-01-01 1 No
#> 5 R 2021-01-01 1 No
#> 6 R 2021-01-01 1 No