Thursday, May 13, 2010

A survey of sequence alignment algorithms for next-generation sequencing

A survey of sequence alignment algorithms for next-generation sequencing.

Pubmed Direct Link

Here's a good review of current (short-read) sequence alignment algorithms applied to next-generation sequencing technology. It describes some of the fundamental design choices made during sequence alignment. Both hash-based and BWT-based algorithms are covered. There is also an interesting discussion of need for gapped alinment (shot over the bow of bowtie?). There are also brief discussions of color space read, long reads, bisulfite-treated reads, spliced reads, re-alignment, and a general overview of popular alignment software.

Anyhow, it is a good review so check it out.

Wednesday, May 12, 2010

Detection and characterization of novel sequence insertions using paired-end next-generation sequencing

Detection and characterization of novel sequence insertions using paired-end next-generation sequencing.

Pubmed Direct Link

This paper presents a novel pipeline, called NovelSeq, that aims to find novel insertions using paired-end technology. The motivation is to accurately discover the sequences not present in the human genome reference; there are 10-40Mb of sequence estimated to be missing from hg18 (etc.).

One caveat of this method is that it the novel inserted sequences that are detected cannot be similar (homologous) to other sequences already present in the reference. One of my own thoughts has been that a lot of mapping errors found in whole-genome re-sequencing data can be attributed to an incomplete reference, where similar sequences are missing from the reference. This method would not help solve the potential problem I am thinking about.

The idea behind this method is to use unpaired reads, or as they call them one-end anchored reads (OEA). These paired-end reads have the property that one end maps while the other does not (at least not confidently). Finding a cluster of these unpaired reads may suggest there exists a novel sequence insertion, which can be reconstructed using your favorite de novo assembler (Velvet, Abyss, Euler-SR, etc.). This is something everyone knows should be done, but here is an analysis that actually did it (kudos).

They align with mrFast, cluster unpaired reads with mrCar, locally assembly these clusters into contigs with mrSAAB (a Swede?), and finally merge multiple contigs with mrBIG. What ever happened to the Mrs? Nothing in the method strikes me as particularly worrisome, and seems like a reasonable workflow. It is interesting though that some of the assembled contigs had >99% similarity to the human reference. This would indicate their mapping was not sensitive enough to sequencing error (mrFast is not mrSensitive ?).

They compared to previously published data (fosmid-seq from Kidd et al.), but due to the small insert size of their library (<250bp), they could not anchor all their contigs due to repeat elements. Nevertheless, they found that there are some rare variants (deletions in this case) found in the human genome reference. This is to be expected, but will cause havoc when performing re-sequencing experiments.

One thing that was left out in this paper was simulations. There is not accounting for the false-negative rate or other important factors. The authors tried to use previous data to validate their method, but the imprecision of all these methods leaves me unsatisfied with the true power and accuracy of their method.

All in all, a nice application of a number of methods and a good pipeline for directly discovering novel insertions.

Tuesday, May 11, 2010

The case for cloud computing in genome informatics

The case for cloud computing in genome informatics.

Pubmed
Direct Link

Here's an opinion piece arguing for using cloud computing by a "Giant" in the field: Lincoln Stein. This is a great read for anyone thinking about staying in NGS for the long-haul. Quote of the month: "The cost of genome sequencing is now decreasing several times faster than the cost of storage, promising that at some time in the not too distant future it will cost less to sequence a base of DNA than to store it on a hard disk."

I will discuss some of my criticisms and ideas on the topic related to author's points.

Firstly, the author defines the standard model is hub-and-spoke model, with archival sites (UCSC, SRA, Genebank, etc.) the hub, and sequencing centers and bioinformaticians on the end of the spoke. We all have to submit our NGS data to SRA per NIH policy, but the idea that we submit to SRA then re-download the data is ludicrous. All Genome centers have their own compute farms that analyze the data, which is directly obtained from the sequencers. Whole groups of Bioinformaticians are co-located around these centers for easy access to the data. While other institutions must access the data from SRA, the investigators who generated the data do not. Presumably the original investigators will distill their results into a paper and resource, and have the most incentive to analyze the data. Anyhow, my main point that sequencing centers have massive clusters of computers and Bioinformaticians to analyze their data locally.

The author's solution to giving easy access to data to all (not just those who generated the data) is to use the cloud. This requires everyone to transfer their data to the cloud via fiber (internet) or mail (fed-ex hard drives). Bandwidth costs money and is certainly limited. I will be impressed to see how one genome per day per institution will be feasible. Fed-ex would love us to send hard-drives on a a daily basis. Two other solutions come to mind.

The first solution is to house the major genome centers near the cloud, so that a direct pipe could be installed. Nonetheless, if the West-coast center uploads their data to their west-coast cloud, then the data needs to be migrated to the other clouds so the east-coast can access the data on their cloud (presumably everyone could also access the west-coast cloud). Anyhow, this doesn't solve the problem as migration of data still needs to occur.

The second solution is to use a tried-and-true technology: Bittorrent. Bittorrent is great, although somewhat frowned upon. Having only a modest number of sites hosting this content via Bittorrent would allow the fast and efficient distribution of content, even between clouds. Sites like http://www.biotorrents.net/ already have bittorrent trackers for bio-datasets.

Cloud computing seems like it will become a large tool for NGS analysis. Nevertheless, the clouds need to be more configurable. Does the cloud have high memory (1TB) compute nodes for de novo assembly? Another option would be pool our money together to get an "NIH" part fo the cloud, rather than using for-profit clouds, where the cost of using this part of the cloud is subsidized.

The data deluge is an opportunity to tackle some large data processing questions. Given the size and importance of this data, will academia be squeezed out with industry taking over (like drug development research)?

Monday, May 10, 2010

Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation

Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation.

Pubmed Direct Link

Similar to another paper I blogged about, this paper describes a novel method to analyze NGS data wrapped in a biology coating (mouse genome analysis). I really like this way of describing a method since this highlights the motivation for the method's creation: to answer a question in biology.

This paper describes a method called Cufflinks, which uses TopHat to find splice junctions, which uses bowtie for mapping. Here you can see the recursive dependency tree on which this method is built. You must trust the upstream methods first.

Off to the methods section. Just like any nature paper, the methods are hidden in in the supplemental materials. I am glad to see the supplementary methods written in latex using the hyperref package. This makes it easy to move around the document quickly, plus latex is much easier to write and format than Word or other programs.

The main contribution is the transcript abundance calculation. I would strongly recommend reading the supplementary methods as this is a great Bayesian discussion of transcript assembly and abundance estimation. It also covers some nice computer science Theorems. The basic idea is to find the minimal set of transcripts that explain the fragments, then quantify each transcript. Seems simple right?

Anyhow, take a look at this paper as Cufflinks (with TopHat (with bowtie)) is a very popular pipeline for RNA-SEQ. You should probably understand what it is doing before you use it. Kudos to the authors.

Sunday, May 9, 2010

BS Seeker: precise mapping for bisulfite sequencing

BS Seeker: precise mapping for bisulfite sequencing

Pubmed Direct Link

Yet another alignment algorithm (YAAA), this time for bisulfite sequencing. This algorithm, developed in the same lab as the BS-SEQ (Jacobsen at UCLA) and called BS-SEEKER, purports to be more versatile and faster. I am of the philosophy that an aligner should be "as fast as it needs to be" to achieve the desired level of sensitivity, so the latter claim is not as interesting as the former claim. Basically, it needs to find what you are looking for first, then lets worry about speed. The abstract claims sensitivity and accuracy, so lets see if it delivers.

Surprisingly, the authors do not implement their own algorithm, but instead use bowtie. I do not recommend bowtie since it does not handle insertions or deletions, which can significantly impact mapping accuracy (see this partial discussion). Nevertheless, bowtie should be able to be replaced by other algorithms, since the novel contribution of BS-SEEKER is the separation of the four types of BS reads.

The evaluations were very lite on information, mapping reads to only human chromosome 21, and not accounting for higher sequencing error rates or indels. Furthermore, BS-SEEKER is not the best algorithm in every situation according to their own results, in some cases MAQ or RMAP performing better in terms of accuracy/sensitivity/timing.  Their novel contribution is the application of bowtie, but beyond that I am not seeing the value of this method, with the exception if you use the Cokus et al. library construction. More evaluation of this method is necessary to convince others to use it.