A compressed FASTA file containing a random sample of size 100 of SARS-CoV-2
surface glycoprotein (Spike protein) amino acid sequences, drawn from a
dataset of 137,132 complete sequences downloaded from the NCBI SARS-CoV-2
Data Hub on October 12, 2021. Intended to demonstrate the full
ViralEntropR pipeline on real-world surveillance data.
Format
A gzip-compressed FASTA file (.fasta.gz) readable by
readAAStringSet. Each record contains:
- Header
NCBI Virus format: accession, country, and collection date metadata separated by
|.- Sequence
Amino acid sequence using the standard IUPAC one-letter code. Ambiguous residues (B, X, Z) and gaps (-) may be present and can be removed with
filter_ambiguous_sequences.
Details
The file is accessed at runtime via:
path <- system.file("extdata", "sarscov2_sample.fasta.gz",
package = "ViralEntropR")
fasta <- Biostrings::readAAStringSet(path)Sample file contents:
Sample size: 100 random sequences
Protein: Surface glycoprotein (Spike, S protein)
Organism: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), NCBI Taxonomy ID: 2697049
Completeness: Complete sequences only
Format: Gzip-compressed FASTA (
.fasta.gz), readable directly byreadAAStringSet
Full dataset:
The complete 137,132-sequence dataset (~181.5 MB, uncompressed FASTA) is
archived on Zenodo (DOI: doi:10.5281/zenodo.19040165
);
readAAStringSet reads it directly.
Header format: Sequence headers follow the NCBI Virus export format. Dates and countries can be extracted directly using the package helper functions:
dates <- extract_fasta_dates(fasta, option = 1)
countries <- extract_fasta_countries(fasta, position = 2)Subsampling:
To keep the package within CRAN size limits, the bundled sample was
generated by random sampling of 100 sequences. The full provenance and
reproducible sampling script are in data-raw/sarscov2_ncbi.R in
the package source.
Provenance
Downloaded from the NCBI SARS-CoV-2 Data Hub (https://www.ncbi.nlm.nih.gov/labs/virus/vssi/) on October 12, 2021 with the following filters:
Virus: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), taxid: 2697049
Nucleotide completeness: complete
Protein: surface glycoprotein
Result: n = 137,132 sequences, file size ~181.5 MB (uncompressed).
License
NCBI sequence data is a US Government work and is in the public domain within the United States. Data from international contributors is subject to the INSDC open-access policy (Karsch-Mizrachi et al., 2025; doi:10.1093/nar/gkae1058 ). The compiled dataset is released under CC0 1.0 Universal. Individual GenBank accession numbers in the FASTA headers provide full traceability to original submissions.
References
Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW (2016). GenBank. Nucleic Acids Research, 44(D1), D67–D72. doi:10.1093/nar/gkv1276
Sayers EW, Bolton EE, Brister JR, et al. (2022). Database resources of the National Center for Biotechnology Information. Nucleic Acids Research, 50(D1), D20–D26. doi:10.1093/nar/gkab1112
NCBI Virus [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; [2020] – [cited 2021 Oct 12]. Available from: https://www.ncbi.nlm.nih.gov/labs/virus/vssi/
Examples
# \donttest{
path <- system.file("extdata", "sarscov2_sample.fasta.gz",
package = "ViralEntropR")
fasta <- Biostrings::readAAStringSet(path)
# Number of sequences in the demo sample
length(fasta)
#> [1] 100
# Inspect first 3 headers
names(fasta)[1:3]
#> [1] "QSG79861.1 |USA|2021-02-12" "QPV04597.1 |Chile|2020-04-03"
#> [3] "QUA36626.1 |USA|2021-03-05"
# Extract collection dates
dates <- extract_fasta_dates(fasta, option = 1)
head(dates$corrected_dates)
#> [1] NA NA NA NA NA NA
# Extract countries
countries <- extract_fasta_countries(fasta, position = 2)
table(countries$countries)
#>
#> Australia Bangladesh Chile Germany India New Zealand
#> 7 1 1 1 3 2
#> Saudi Arabia USA
#> 1 84
# Convert to character matrix for pipeline entry
char_mat <- fasta_to_char_matrix(fasta)
dim(char_mat)
#> [1] 100 1273
# The full 137,132-sequence dataset (~181.5 MB) is available on Zenodo:
# https://zenodo.org/records/19040165
# Download sequences.fasta manually and load with:
# fasta_full <- Biostrings::readAAStringSet("path/to/sequences.fasta")
# }