SARS-CoV-2 Surface Glycoprotein Sequences – NCBI Demo Sample

A compressed FASTA file containing a random sample of size 100 of SARS-CoV-2 surface glycoprotein (Spike protein) amino acid sequences, drawn from a dataset of 137,132 complete sequences downloaded from the NCBI SARS-CoV-2 Data Hub on October 12, 2021. Intended to demonstrate the full ViralEntropR pipeline on real-world surveillance data.

Format

A gzip-compressed FASTA file (.fasta.gz) readable by readAAStringSet. Each record contains:

Header: NCBI Virus format: accession, country, and collection date metadata separated by |.
Sequence: Amino acid sequence using the standard IUPAC one-letter code. Ambiguous residues (B, X, Z) and gaps (-) may be present and can be removed with filter_ambiguous_sequences.

Source

https://www.ncbi.nlm.nih.gov/labs/virus/vssi/

Details

The file is accessed at runtime via:


path  <- system.file("extdata", "sarscov2_sample.fasta.gz",
                     package = "ViralEntropR")
fasta <- Biostrings::readAAStringSet(path)

Sample file contents:

Sample size: 100 random sequences
Protein: Surface glycoprotein (Spike, S protein)
Organism: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), NCBI Taxonomy ID: 2697049
Completeness: Complete sequences only
Format: Gzip-compressed FASTA (.fasta.gz), readable directly by readAAStringSet

Full dataset: The complete 137,132-sequence dataset (~181.5 MB, uncompressed FASTA) is archived on Zenodo (DOI: doi:10.5281/zenodo.19040165 ); readAAStringSet reads it directly.

Header format: Sequence headers follow the NCBI Virus export format. Dates and countries can be extracted directly using the package helper functions:


dates     <- extract_fasta_dates(fasta, option = 1)
countries <- extract_fasta_countries(fasta, position = 2)

Subsampling: To keep the package within CRAN size limits, the bundled sample was generated by random sampling of 100 sequences. The full provenance and reproducible sampling script are in data-raw/sarscov2_ncbi.R in the package source.

Provenance

Downloaded from the NCBI SARS-CoV-2 Data Hub (https://www.ncbi.nlm.nih.gov/labs/virus/vssi/) on October 12, 2021 with the following filters:

Virus: Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), taxid: 2697049
Nucleotide completeness: complete
Protein: surface glycoprotein

Result: n = 137,132 sequences, file size ~181.5 MB (uncompressed).

License

NCBI sequence data is a US Government work and is in the public domain within the United States. Data from international contributors is subject to the INSDC open-access policy (Karsch-Mizrachi et al., 2025; doi:10.1093/nar/gkae1058 ). The compiled dataset is released under CC0 1.0 Universal. Individual GenBank accession numbers in the FASTA headers provide full traceability to original submissions.

References

Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW (2016). GenBank. Nucleic Acids Research, 44(D1), D67–D72. doi:10.1093/nar/gkv1276

Sayers EW, Bolton EE, Brister JR, et al. (2022). Database resources of the National Center for Biotechnology Information. Nucleic Acids Research, 50(D1), D20–D26. doi:10.1093/nar/gkab1112

NCBI Virus [Internet]. Bethesda (MD): National Library of Medicine (US), National Center for Biotechnology Information; [2020] – [cited 2021 Oct 12]. Available from: https://www.ncbi.nlm.nih.gov/labs/virus/vssi/

Examples

# \donttest{
path  <- system.file("extdata", "sarscov2_sample.fasta.gz",
                     package = "ViralEntropR")
fasta <- Biostrings::readAAStringSet(path)

# Number of sequences in the demo sample
length(fasta)
#> [1] 100

# Inspect first 3 headers
names(fasta)[1:3]
#> [1] "QSG79861.1 |USA|2021-02-12"   "QPV04597.1 |Chile|2020-04-03"
#> [3] "QUA36626.1 |USA|2021-03-05"  

# Extract collection dates
dates <- extract_fasta_dates(fasta, option = 1)
head(dates$corrected_dates)
#> [1] NA NA NA NA NA NA

# Extract countries
countries <- extract_fasta_countries(fasta, position = 2)
table(countries$countries)
#> 
#>    Australia   Bangladesh        Chile      Germany        India  New Zealand 
#>            7            1            1            1            3            2 
#> Saudi Arabia          USA 
#>            1           84 

# Convert to character matrix for pipeline entry
char_mat <- fasta_to_char_matrix(fasta)
dim(char_mat)
#> [1]  100 1273

# The full 137,132-sequence dataset (~181.5 MB) is available on Zenodo:
# https://zenodo.org/records/19040165
# Download sequences.fasta manually and load with:
# fasta_full <- Biostrings::readAAStringSet("path/to/sequences.fasta")
# }