What is discontiguous Mega BLAST?
This version of Mega BLAST is designed specifically for comparison of diverged
sequences, especially sequences from different organisms, which have alignments
with low degree of identity, where the original Mega BLAST is not very
effective. The major difference is in the use of the 'discontiguous word'
approach to finding initial offset pairs, from which the gapped extension is
then performed.
Both Mega BLAST and all previous versions of nucleotide-nucleotide BLAST look
for exact matches of certain length as the starting points for gapped
alignments. When comparing less conserved sequences, i.e. when the expected
share of identity between them is e.g. 80% and below, this traditional approach
becomes much less productive than for the higher degree of conservation.
Depending on the length of the exact match to start the alignments from, it
either misses a lot of statistically significant alignments, or on the contrary
finds too many short random alignments.
According to [1], as well as our own probability simulations, it turns out that
if initial 'words' are based not on the exact match, but on a match of a certain
set of nonconsecutive positions within longer segments of the sequences, the
productivity of the word finding algorithm is much higher. This way fewer words
are found overall, but more of them end up producing statistically significant
alignments, than in the case of contiguous words of the same, and even shorter
length than the number of matched positions in the discontiguous word.
As an example, we can define a pattern (template) of 0s and 1s of length e.g.
21:
100101100101100101101. For each pair of offsets in the query and subject
sequences that are being compared, we compare the 21 nucleotide segments in
these sequences ending at these offsets, and require only those positions in
those segments to match that correspond to the 1s in the above template.
There are several advantages in using this approach. First, the conditional
probabilities of finding word hits satisfying discontiguous templates given the
expected identity percentage in the alignments between two sequences, are higher
than for contiguous words with the same number of positions required matched.
If two word hits are required to initiate a gapped extension, the effect
of the discontiguous word approach is even larger. In both cases higher
sensitivity is achieved because there is less correlation between successive
words as the database sequence is scanned across the query sequence.
Second, when comparing coding sequences, the conservation of the
third nucleotides in every codon is not essential, so there is no need to
require it when matching initial words. This implies the advantage of using
templates based on the '110' pattern, which are called 'coding'.
Finally, to achieve even higher sensitivity, one might combine two different
discontiguous word templates and require any one of them to match at a given
position to qualify it for the initial word hit.
The following options specific to this approach are supported:
Template length: 16, 18, 21.
Word size (i.e. number of 1s in the template): 11, 12
Template type: coding, non-coding.
Require two words for extension: yes/no.
The 'coding' templates are based on the 110 pattern, although more 0s are
required for most of them, so some of the patterns become 010 or 100. These are
the most effective for comparison of coding regions.
The non-coding templates attempt to minimize the correlation between successive
words, when the database sequence is shifted by 4 positions against the query
sequence. This means more 1s are concentrated at the ends of the template (at
least 3 on each side).
When the option to require two words for extension is chosen, two word hits
matching the template must be found within a distance of 50 nucleotides of one
another.
Below are the exact discontiguous word template patterns for different combinations
of word sizes and lengths:
W = 11, t = 16, coding: 1101101101101101
W = 11, t = 16, non-coding: 1110010110110111
W = 12, t = 16, coding: 1111101101101101
W = 12, t = 16, non-coding: 1110110110110111
W = 11, t = 18, coding: 101101100101101101
W = 11, t = 18, non-coding: 111010010110010111
W = 12, t = 18, coding: 101101101101101101
W = 12, t = 18, non-coding: 111010110010110111
W = 11, t = 21, coding: 100101100101100101101
W = 11, t = 21, non-coding: 111010010100010010111
W = 12, t = 21, coding: 100101101101100101101
W = 12, t = 21, non-coding: 111010010110010010111
[1] Ma, B., Tromp, J., Li, M., "PatternHunter: faster and more sensitive
homology search", Bioinformatics 2002 Mar;18(3):440-5