Thursday, April 15, 2010

High quality SNP calling using Illumina data at shallow coverage

High quality SNP calling using Illumina data at shallow coverage

Pubmed Direct Link

Here's an interesting paper on an integrated approach to SNP calling in NGS data.  It is integrated in that they present both an aligner and SNP caller co-optimized for lower false positive SNP calls.  It is designed for the Illumina platform and uses the four-channel (base) probability scores to better inform alignment.  Where have we see that before?  There are two steps, alignment then SNP calling.

The idea behind the alignment is to use the first 31 bases of the read to seed the alignment.  This is based on the observation that the first 31 bases are the most reliable in any sequencing run.  The only problem is that if an insertion (or deletion) occurs within these first 31 bases, then it makes it considerably more difficult to map.  If an indel occurs within the first 31 bases, then the rest of the read (not the first 31) could be used to map the read.  If indels are tolerated (what length indel) are tolerated in the "seed", like BWA, then no problem.  Unfortunately Slider does not support indels, so you will have to re-align the data anyways if you are interested in this common source of variation.  Nevertheless, their merge sort technique is a unique and novel technique among short read aligners.  I'll let you read the original paper.

Their SNP caller is more interesting.  It allows for priors in the form of SNP annotation databases, identifying suspect SNPs that could result from paralogs or SVs, using the per-base probability for each base in the read (so much for base calling), and the expected polymorphism rate (or homology between the reference and the current genome).  I would have liked to see other tools beside MAQ being compared, as many other aligners (BFAST/BWA/SHRiMP/Novoalign etc.) are highly sensitive, while there are also a breadth of SNP callers (SAMTools/SOAP/GATK etc.).  It is also unclear if the improvement demonstrated in the paper is due to the better SNP calling algorithm/model, or the aligner, or both.  An easy way to test this is to substitute another aligners/SNP-caller.  After filtering out low coverage regions, they also show how a SNP caller can improve concordance by using known SNPs.  Be careful which version of dbSNP you use, since dbSNP 130 includes a lot of crud from NGS data projects and cancer (wont the whole genome and all alleles be present soon in dbSNP if we sequence enough cancer?).

Figure 4 is interesting since it shows that SNPs called towards the ends of reads are less trustworthy.  Also, if there is high coverage at a SNP and the SNPs still occur towards the ends of reads, the concordance is extremely low (not likely anyways).  Using their position in the reads seems warranted.  Nevertheless, not aligning sensitive to indels will cause false-positive SNPs around where the indel should be placed, which is somewhat mediated by hypothesizing that not aligning with indels will cause false SNPs at the ends of reads.  I am inclined to agree with this hypotheses from my own observations.  As for the false positive reduction claim, some anecdotal evidence is suggested by Figure 6 and 7, but simulations would bolster this claim significantly.  Not only does simulation provide an easy check of the claims, but also helps debug the algorithm anyways, so perform simulations when possible.

Now they need to apply this to SOLiD, add support for indels, modularize the aligner and SNP caller to allow the input/output to/from other tools, and they have an intriguing software that could be extremely useful for NGS data.

Wednesday, April 14, 2010

SeqAnswers.com

Here's a plug for SeqAnswers.com.  It is great site for beginner and expert NGS users and developers to meet, help out, and discuss all issues related to next-generation sequencing technology.  It has been around for a few years, so search the threads with google by typing "site:seqanswers.com" into the search box to search only SeqAnswers.com.

Monday, April 12, 2010

Incorporating sequence quality data into alignment improves DNA read mapping

Incorporating sequence quality data into alignment improves DNA read mapping

Pubmed Direct Link

This paper tries to answer a fundamental question that every researcher intuitively says "yes", does incorporating base qualities into alignment improve the alignment.  The strength of the paper relies on its model, while the weakness of the paper relies on its empirical and simulated evaluations.

Firstly, the methods section shows a nice Bayesian derivation for updating the log-odds ratios when performing local alignment (Dynamic Programming, Smith Waterman, Needleman Wunsch, Viterbi, whatever).   The assumption is that quality scores are in fact log-scaled and accurate, which we have seen with the many chemistry/optics/software updates that this may not be the case.  This method could be easily incorporated into any alignment software (short read etc.) to use quality scores, but most likely wont.

Figure 1 shows the error rates from two different Illumina sequencing runs.  They picked the first 100,000 reads, which is not a random sampling, that shows very high error rates, at least from what I have seen from GAII machines.  Equation 15 and its explanation derives a similar probability to that found in MAQ's supplemental materials.   The authors criticize some  of the underlying assumptions, but that is why mapping quality (as defined in the MAQ model) is used, since it incorporates or at least reasonably explains why it ignores some of those assumptions.  I would love to see a empirical estimation of contamination that could be incorporated into mapping quality.

Next is to empirically and with simulated data to show improved accuracy or power when using quality values (MAQ's results are ignored, but I'll get to this later).  They use their own alignment software, not adapting evolved software like BFAST/BWA/BOWTIE/MAQ/SHRIMP.  Basically I want to know, given their dataset, what would any of those software achieve when mapping the data.  It may be the case that LAST (the name of the author's aligner) is not that sensitive or accurate, and thus incorporating quality scores may mediate this problem, while better known/used aligners do not have this problem and thus quality scores may not help at all or as much.  I just don't know given their results.  They try to assuage this criticism (they don't use other mappers) by saying they used their software guaranteeing up to 2 misatches (that's one SNP in color space folks).  Needless to say, I am not impressed, as other software is sensitive up to 10% error regardless of read length (2 mismatches in a 150bp read is not good).

More work is required to truly answer the question do quality values significantly improve alignment, rather than adding the nth decimal place.

Sunday, April 11, 2010

Structural Variation Analysis with Strobe Reads.

A very interesting paper on SV detection with "strobe reads', which are/will-be produced by single molecule sequencers (SMS).

Links:
Pubmed Direct Link

The traditional way to detect structural variants (SVs) in next-generation sequencing data is to either search for discordant mate pairs or to explicitly look for SV breakpoints by performing a split-read analysis. In the former, DNA fragments are sequence from both ends with some gap (insert) in between. If after mapping, the two ends fall outside the expected insert size distribution and/or on opposite strands, they are called discordant. In the latter, a longer-ish read is split into two parts, with each mapped independently and then tested to see if each end maps to a different location (a poor man's discordant read). The benefit of the former is that given sufficient coverage, the clone coverage (the coverage implied by spanning events using the insert) is quite high and gives good power, although the exact break point is difficult to resolve. The benefit of the latter is that if a break point is detected, it is detected by directly sequencing through it, although the complexity of the genome is quite high making this type of analysis quite difficult.

A strobe read is produced by a single molecule sequence (SMS), for example by Pacific Biosciences (Pac Bio) or Life Technologies (Starlight), neither which have been released. The strobe sequencing from Pac Bio in a proof of principle gives 2-4 reads of 50-200 bases over a range of 10Kb from a contiguous fragment. The method discussed here, seeks to minimize the potential number of SVs implied by aligning the 2-4 sub-reads. This is solved by formulating the problem as a graph optimization problem and using integer linear programming (as I dust of CLR to refresh my linear programming).

The beauty of the method is that typically each read is aligned independently, but in this case, the multiple alignments for each sub-read that form a read, as well as all reads, are examined together to minimize the expected number of SVs. As SVs have a small prior, this is good idea. Not going into the specifics of various simplifications that yield performance improvements, previous computer science methods are easily applied to this problem to yield a sufficient method to sensitively find SVs with strobe reads.

Always a sucker for ROC curves, strobe SV detection fares quite well, in fact better than using paired end reads or mixed read libraries. The do acknowledge some lack of power when searching for SVs in cancer data, since then all bets are off. I can't wait for real SMS data!