Monday, April 12, 2010

Incorporating sequence quality data into alignment improves DNA read mapping

Incorporating sequence quality data into alignment improves DNA read mapping

Pubmed Direct Link

This paper tries to answer a fundamental question that every researcher intuitively says "yes", does incorporating base qualities into alignment improve the alignment.  The strength of the paper relies on its model, while the weakness of the paper relies on its empirical and simulated evaluations.

Firstly, the methods section shows a nice Bayesian derivation for updating the log-odds ratios when performing local alignment (Dynamic Programming, Smith Waterman, Needleman Wunsch, Viterbi, whatever).   The assumption is that quality scores are in fact log-scaled and accurate, which we have seen with the many chemistry/optics/software updates that this may not be the case.  This method could be easily incorporated into any alignment software (short read etc.) to use quality scores, but most likely wont.

Figure 1 shows the error rates from two different Illumina sequencing runs.  They picked the first 100,000 reads, which is not a random sampling, that shows very high error rates, at least from what I have seen from GAII machines.  Equation 15 and its explanation derives a similar probability to that found in MAQ's supplemental materials.   The authors criticize some  of the underlying assumptions, but that is why mapping quality (as defined in the MAQ model) is used, since it incorporates or at least reasonably explains why it ignores some of those assumptions.  I would love to see a empirical estimation of contamination that could be incorporated into mapping quality.

Next is to empirically and with simulated data to show improved accuracy or power when using quality values (MAQ's results are ignored, but I'll get to this later).  They use their own alignment software, not adapting evolved software like BFAST/BWA/BOWTIE/MAQ/SHRIMP.  Basically I want to know, given their dataset, what would any of those software achieve when mapping the data.  It may be the case that LAST (the name of the author's aligner) is not that sensitive or accurate, and thus incorporating quality scores may mediate this problem, while better known/used aligners do not have this problem and thus quality scores may not help at all or as much.  I just don't know given their results.  They try to assuage this criticism (they don't use other mappers) by saying they used their software guaranteeing up to 2 misatches (that's one SNP in color space folks).  Needless to say, I am not impressed, as other software is sensitive up to 10% error regardless of read length (2 mismatches in a 150bp read is not good).

More work is required to truly answer the question do quality values significantly improve alignment, rather than adding the nth decimal place.

No comments:

Post a Comment