Saturday, April 10, 2010

Correction of sequencing errors in a mixed set of reads.

Correction of sequencing errors in a mixed set of reads.

Links:
Pubmed DOI

This paper builds off of a tool to correct sequence error in Illumina reads. What happens if SOLiD reads, with the powerful two-base encoding, is combined with base space data, especially for de novo assembly?  Instead of spectral alignment, like that performed by assemblers (EULER-SR etc.), they prefer to use a suffix tree/array representation to remove erroneous suffixes.  The nice part is that they  observe that correction can be performed in color space, since all original base space read errors or snps look also look like SNPs (valid adjacent difference) in color space, while SOLiD sequencing errors will look like single color differences.

Anyhow, the problem with this procedure is that if the SOLiD data was used alone, it is extremely difficult to randomly have two "valid adjacent" color differences.  This is why two-base encoding is so powerful, it is difficult to discover false positive SNPs.  This method now adds false SNPs from the Illumina reads, removing this assumption, as well as some of the power of two-base encoding.

Furthermore, are true SNPs and indels corrected away?  Although for a normal (i.e. diploid) individual, they should occur in 50% or 100% of the reads, what about cancer, where the composition of cancer cells may be heterogeneous, not to mention contamination.  There is no analysis describing how this affects variant identification.

The human genome is 3.2 billion bases in length, requiring a 18 base sequence or greater to give a 50% probability of being unique.  Suffix arrays cannot fit in modest memory machines (4-8GB) as the best suffix array representation implemented requires 5.25 bytes (more on that on later posts).  So when a billion reads or more are produced for human genome resequencing, will this method work (it is written in Java by the way)?

If there exists a reference, then it should be used since for humans at least it can be assumed the reference is >99% similar when compared to the genome being sequenced.  Furthermore, alignment algorithms are sensitive to SNPs and indels, being able to correctly call/assemble most of the genome.
I really like the idea of performing an error correct step before de novo assembly, as this is generally required. I wish there was more discussion on if this error correction method corrects away the variants: that is why the sequencing happened in the first place right?

Verdict:
A good method which requires more evidence and discussion.