Remove Sequences Containing Ambiguous Residues
Source:R/filter_ambiguous_sequences.R
filter_ambiguous_sequences.RdRemoves rows (sequences) that contain at least one ambiguous amino acid residue (B, J, X, or Z) — and, under integer-encoded input, any unrecognised character — from a sequence matrix. Accepts both integer-encoded matrices and character matrices.
Arguments
- NumMatrix
A matrix. Rows are sequences, columns are sites. Either integer-encoded (
option = 1) or character (option = 2). Despite the name, character matrices are also accepted underoption = 2.- option
Integer.
1(default) for integer-encoded matrices produced byencode_aa_sequence;2for character matrices produced byfasta_to_char_matrix.
Value
A named list:
- OriginalDim
Character string reporting the number of input sequences.
- NewDim
Character string reporting the number of sequences remaining after filtering.
- NumberAmbiguous
Character string reporting the number of sequences that contained at least one ambiguous residue.
- RangeAmbiguous
Character string reporting the min and max count of ambiguous residues per removed sequence, or
"No ambiguous sequences found"when none were removed.- DeletedSeqId
Integer vector of row indices that were removed. Empty integer vector if nothing was removed.
- FilteredMatrix
The filtered matrix with ambiguous rows removed, preserving the original column structure and storage mode.
Details
What is removed. Sequences are flagged for removal if any of their residue positions contain one of the four IUPAC ambiguous codes:
B— Aspartate / Asparagine.J— Leucine / Isoleucine.X— any residue.Z— Glutamate / Glutamine.
What is NOT removed. Standard alignment gaps (-,
integer code 25) are retained — gaps represent known absences
rather than uncertain identities and are typically positionally
meaningful in aligned data. Sequences containing only canonical 20
amino acids and gaps are kept.
How input mode is handled.
option = 1(integer-encoded input fromencode_aa_sequence): rows are removed if any cell equals0(unrecognised — including J, NA, empty, lowercase mismatches, byte-order marks, and other characters that fell outside the encoding alphabet),21(B),22(Z), or23(X). The0sentinel acts as a catch-all for anything not in the 25-symbol alphabet.option = 2(character input fromfasta_to_char_matrix): rows are removed if any cell is exactly"B","J","X", or"Z". Unrecognised characters in character input (e.g. lowercase letters,NA, empty strings) are NOT caught at this stage; encode first if you need that catch-all behaviour.
Performance. Detection is fully vectorised: a single logical
matrix comparison followed by rowSums counts
ambiguous residues per sequence in one C-level call, replacing the
original row-by-row loop for a substantial speed improvement on
large matrices (100k+ rows).
See also
encode_aa_sequence and
fasta_to_char_matrix for producing the typical input;
decode_aa_sequence for inspecting the surviving
sequences in character form.
Examples
# Synthetic example: 50 sequences, 10 sites, drawn from canonical residues.
set.seed(1)
m <- matrix(sample(1:20, 500, replace = TRUE), nrow = 50, ncol = 10)
# Inject ambiguous codes into 3 specific rows: 21 (B), 23 (X), 0 (unrecognised).
m[c(3, 17, 42), sample(1:10, 3)] <- c(21, 23, 0)
result <- filter_ambiguous_sequences(m, option = 1)
cat(result$NumberAmbiguous, "\n")
#> Number of sequences containing at least one of B, X, Z or J characters is 3
cat(result$RangeAmbiguous, "\n")
#> Number of ambiguous protein characters per sequence varies between 3 and 3
dim(result$FilteredMatrix)
#> [1] 47 10
# Character-mode example.
chr <- matrix(c("M", "K", "T", "I", "I", "X", "K", "T", "I", "I"),
nrow = 2, byrow = TRUE)
filter_ambiguous_sequences(chr, option = 2)$DeletedSeqId
#> [1] 2