Methodolgy
Description of method
A detailed description of the methodology is available as Methodolgy.pdf.
Flowcharts and pseudocodes are available as flowchart.gif and Pseudocodes.pdf.
The select of the wordsize is described below together with the MGAlign
performance with genomic length correlation results.
Selection of wordsize
MGAlign’s performance was first tested on a random subset
using a range of Z, as shown in the table below. A range of wordsizes,
from 10 to 30, was used to assess the relationship of the wordsize
to the sensitivity and specificity of the first search step. The
size of the genome sequence used was 45Mbp and the total number of
mRNA sequences is 50 (a randomly selected subset of the data used
for the comparison). Thus the total
number of correctly identified hits should be 100, two for each mRNA
sequence. The results shown are in accord with what is expected,
i.e. as Z increases, the specificity increases. Sensitivity in all
cases is 100%, therefore the considerations for the selection of
the default Z value are the lowest computational time required for
high specificity. The choice of a very large wordsize is likely to
render the algorithm incapable of correctly locating matches in the
event of sequences with errors. In view of these considerations,
Z= 20 has been selected as default to optimize the time requirement
factor to the specificity factor. The default wordsize can be modified
by experienced users, if necessary.
Genome Sequence (Mbp)
|
Wordsize, Z (bp)
|
Time Taken (s)
|
Identified Hits
|
Correct Hits
|
Sensitivity
|
Specificity
|
45
|
10
|
6798.62
|
2037
|
150
|
100.0%
|
7.36%
|
45
|
12
|
761.69
|
857
|
107
|
100.0%
|
12.49%
|
45
|
14
|
648.09
|
302
|
101
|
100.0%
|
33.44%
|
45
|
16
|
616.58
|
179
|
101
|
100.0%
|
56.42%
|
45
|
18
|
631.79
|
156
|
100
|
100.0%
|
64.10%
|
45
|
20
|
598.22
|
126
|
100
|
100.0%
|
79.37%
|
45
|
22
|
595.83
|
121
|
100
|
100.0%
|
82.64%
|
45
|
24
|
595.06
|
121
|
100
|
100.0%
|
82.64%
|
45
|
26
|
595.26
|
119
|
100
|
100.0%
|
84.03%
|
45
|
28
|
595.00
|
117
|
100
|
100.0%
|
85.47%
|
45
|
30
|
595.64
|
117
|
100
|
100.0%
|
85.47%
|
A random set of 50 mRNA sequences were used for the alignment against
a 45Mbp fragment of human chromosome 22 genomic sequence. For each
Z value (column 2), the computational time (column 3) in seconds,
the number of successfully extended hits identified (column 4) defining
the alignment windows, the correct hits (column 5) based on annotation
information, sensitivity (column 6; defined as the ratio of the number
of correctly identified hits to the number of hits provided by the
annotations) and specificity (column 7; the ratio of the number of
correctly identified hits to the number of identified hits) is reported.
MGAlign performance with genomic length
correlation
We have run a separate test with the 50 randomly selected mRNA sequences,
(used above in the selection of wordsize), using genomic sequences
of different lengths. We note that for this limited set, the computational
savings are directly proportional to the length of the genomic sequences
in the following figure. All three programs show a linear relationship
between time required and genomic sequence length. In relation to
sim4, The saving in computational time achieved by MGAlign increases
substantially with the length of GS, in relation to sim4, with modest
gains in comparison to Spidey. The speed enhancements gained by MGAlign
shown below may seem small (2.3-2.4 times faster than sim4 or Spidey),
however if one is to perform large numbers of these alignments, then
even the smallest amount of performance increase is amplified.
Dependence of computational time required by MGAlign, sim4 and Spidey
on the length of genomic sequence. The y-axis of the plot shows the
average time required (in sec) by the programs while the x-axis shows
the length of the genomic sequence used, with the data tabulated
below. A total of 50 randomly selected mRNA sequences from the dataset
used in the comparison were used for this plot.
|