Wednesday, May 12, 2010

Detection and characterization of novel sequence insertions using paired-end next-generation sequencing

Detection and characterization of novel sequence insertions using paired-end next-generation sequencing.

Pubmed Direct Link

This paper presents a novel pipeline, called NovelSeq, that aims to find novel insertions using paired-end technology. The motivation is to accurately discover the sequences not present in the human genome reference; there are 10-40Mb of sequence estimated to be missing from hg18 (etc.).

One caveat of this method is that it the novel inserted sequences that are detected cannot be similar (homologous) to other sequences already present in the reference. One of my own thoughts has been that a lot of mapping errors found in whole-genome re-sequencing data can be attributed to an incomplete reference, where similar sequences are missing from the reference. This method would not help solve the potential problem I am thinking about.

The idea behind this method is to use unpaired reads, or as they call them one-end anchored reads (OEA). These paired-end reads have the property that one end maps while the other does not (at least not confidently). Finding a cluster of these unpaired reads may suggest there exists a novel sequence insertion, which can be reconstructed using your favorite de novo assembler (Velvet, Abyss, Euler-SR, etc.). This is something everyone knows should be done, but here is an analysis that actually did it (kudos).

They align with mrFast, cluster unpaired reads with mrCar, locally assembly these clusters into contigs with mrSAAB (a Swede?), and finally merge multiple contigs with mrBIG. What ever happened to the Mrs? Nothing in the method strikes me as particularly worrisome, and seems like a reasonable workflow. It is interesting though that some of the assembled contigs had >99% similarity to the human reference. This would indicate their mapping was not sensitive enough to sequencing error (mrFast is not mrSensitive ?).

They compared to previously published data (fosmid-seq from Kidd et al.), but due to the small insert size of their library (<250bp), they could not anchor all their contigs due to repeat elements. Nevertheless, they found that there are some rare variants (deletions in this case) found in the human genome reference. This is to be expected, but will cause havoc when performing re-sequencing experiments.

One thing that was left out in this paper was simulations. There is not accounting for the false-negative rate or other important factors. The authors tried to use previous data to validate their method, but the imprecision of all these methods leaves me unsatisfied with the true power and accuracy of their method.

All in all, a nice application of a number of methods and a good pipeline for directly discovering novel insertions.

No comments:

Post a Comment