Extracts date strings from the sequence name strings of an
AAStringSet object loaded via
readAAStringSet. Several built-in date patterns
are provided for the common date conventions used in NCBI and GISAID
exports; a fully custom regex can also be supplied.
Arguments
- sequence
An
AAStringSetobject.- option
Integer (1, 2, 3, or 4). Selects the built-in pattern when
custom_patternis not supplied. See Details.- date_format
Character.
strptime-style format string used to coerce extracted strings toDate. Default is"%Y-%m-%d". Use"%Y-%m"together withoption = 3.- custom_pattern
Character or
NULL. A custom regex passed directly tostr_extract. When supplied,optionis ignored. Default isNULL.
Value
A named list of six elements, each aligned with the input
sequence:
- raw_date_strings
Character vector of extracted date strings before any correction.
NAwhere extraction failed.- corrected_date_strings
Character vector with
-00substrings replaced by-01.NAwhere extraction failed.- raw_dates
Datevector coerced fromraw_date_strings.NAfor unparseable or missing strings (including any record where day =00).- corrected_dates
Datevector coerced fromcorrected_date_strings.NAfor unparseable or missing strings.- message
Character string summarising extraction success.
- missing_id
Integer vector of indices where extraction failed, or
NAif all extractions succeeded.
Details
Date strings of the form yyyy-mm-dd are matched between pipe
characters (|...|) by default. Day value 00 (a common
GISAID convention indicating unknown collection day) is accepted in the
raw string and corrected to 01 before coercion to Date.
Both raw and corrected versions are returned, so the caller can decide
how to treat unknown-day records downstream.
Choosing a built-in pattern. The four options correspond to the four most common date conventions in viral sequence repositories:
option = 1:yyyy-mm-ddbetween pipes — GISAID export format, where the date is followed by additional pipe-delimited fields.option = 2:yyyy-dd-mmbetween pipes — some European data sources reverse day and month.option = 3:yyyy-mmbetween pipes — month-level resolution, useful when the source omits or hides the day. Pair withdate_format = "%Y-%m".option = 4:yyyy-mm-ddat end of header — NCBI Virus export format, where the collection date is the final field with no trailing delimiter. This is the format of the bundledsarscov2_sample.
For datasets where the date does not lie between pipes, supply a
custom_pattern matching whatever surrounding context the headers
provide.
Coercion to Date. When date_format = "%Y-%m" the
function uses as.yearmon for coercion so that
year-month strings are handled correctly (base as.Date cannot
parse "2021-05" alone). For all other formats, base
as.Date with the supplied date_format is used.
Output alignment. All six elements of the return list are the
same length as the input sequence. Where extraction fails for a
record, the corresponding entries are NA; missing_id
lists the affected indices.
See also
extract_fasta_countries for the country-extraction
companion; readAAStringSet for loading the
input AAStringSet; as.yearmon for the
year-month coercion path.
Examples
# \donttest{
path_sample <- system.file("extdata", "sarscov2_sample.fasta.gz",
package = "ViralEntropR")
fasta_sample <- Biostrings::readAAStringSet(path_sample)
# Inspect header structure to confirm date field position.
sample(names(fasta_sample), 1)
#> [1] "QXT18620.1 |USA|2021-01-21"
# The bundled sample uses NCBI Virus format: date is at end of header.
# Default usage on bundled sample: option = 4 for end-of-header dates.
dates <- extract_fasta_dates(fasta_sample, option = 4)
dates$message
#> [1] "There are date strings that have not been recognized"
head(dates$corrected_dates)
#> [1] "2021-02-12" "2020-04-03" "2021-03-05" "2021-02-09" "2020-07-18"
#> [6] "2020-03-23"
range(dates$corrected_dates, na.rm = TRUE)
#> [1] "2020-03-09" "2021-09-10"
# Custom regex for non-standard headers:
dates_custom <- extract_fasta_dates(
fasta_sample,
custom_pattern = "[0-9]{4}-(0?[1-9]|1[0-2])-(0?[1-9]|[12][0-9]|3[01]|00)"
)
# }