Thursday, April 15, 2010

High quality SNP calling using Illumina data at shallow coverage

High quality SNP calling using Illumina data at shallow coverage

Pubmed Direct Link

Here's an interesting paper on an integrated approach to SNP calling in NGS data.  It is integrated in that they present both an aligner and SNP caller co-optimized for lower false positive SNP calls.  It is designed for the Illumina platform and uses the four-channel (base) probability scores to better inform alignment.  Where have we see that before?  There are two steps, alignment then SNP calling.

The idea behind the alignment is to use the first 31 bases of the read to seed the alignment.  This is based on the observation that the first 31 bases are the most reliable in any sequencing run.  The only problem is that if an insertion (or deletion) occurs within these first 31 bases, then it makes it considerably more difficult to map.  If an indel occurs within the first 31 bases, then the rest of the read (not the first 31) could be used to map the read.  If indels are tolerated (what length indel) are tolerated in the "seed", like BWA, then no problem.  Unfortunately Slider does not support indels, so you will have to re-align the data anyways if you are interested in this common source of variation.  Nevertheless, their merge sort technique is a unique and novel technique among short read aligners.  I'll let you read the original paper.

Their SNP caller is more interesting.  It allows for priors in the form of SNP annotation databases, identifying suspect SNPs that could result from paralogs or SVs, using the per-base probability for each base in the read (so much for base calling), and the expected polymorphism rate (or homology between the reference and the current genome).  I would have liked to see other tools beside MAQ being compared, as many other aligners (BFAST/BWA/SHRiMP/Novoalign etc.) are highly sensitive, while there are also a breadth of SNP callers (SAMTools/SOAP/GATK etc.).  It is also unclear if the improvement demonstrated in the paper is due to the better SNP calling algorithm/model, or the aligner, or both.  An easy way to test this is to substitute another aligners/SNP-caller.  After filtering out low coverage regions, they also show how a SNP caller can improve concordance by using known SNPs.  Be careful which version of dbSNP you use, since dbSNP 130 includes a lot of crud from NGS data projects and cancer (wont the whole genome and all alleles be present soon in dbSNP if we sequence enough cancer?).

Figure 4 is interesting since it shows that SNPs called towards the ends of reads are less trustworthy.  Also, if there is high coverage at a SNP and the SNPs still occur towards the ends of reads, the concordance is extremely low (not likely anyways).  Using their position in the reads seems warranted.  Nevertheless, not aligning sensitive to indels will cause false-positive SNPs around where the indel should be placed, which is somewhat mediated by hypothesizing that not aligning with indels will cause false SNPs at the ends of reads.  I am inclined to agree with this hypotheses from my own observations.  As for the false positive reduction claim, some anecdotal evidence is suggested by Figure 6 and 7, but simulations would bolster this claim significantly.  Not only does simulation provide an easy check of the claims, but also helps debug the algorithm anyways, so perform simulations when possible.

Now they need to apply this to SOLiD, add support for indels, modularize the aligner and SNP caller to allow the input/output to/from other tools, and they have an intriguing software that could be extremely useful for NGS data.

No comments:

Post a Comment