Thursday, June 17, 2010

Journal/Author Name Estimator

Although I review papers here to force myself to be familiar with the current bioinformatics research, I sometimes author manuscripts of my own.  One of the many decisions that need to be made when writing a paper is to which journal it should be submitted.  Sometimes it is easy, since Science and Nature take so many bioinformatics papers, but most of the time there is a trade-off between "time to publication" and "impact factor" (impact factor on my CV that is).  Anyhow, a fun little tool was Jane, the Journal/Author Name Estimator.   Enter in your manuscript name, and it will advise you to which journal you should send your paper.  You can also see authors who have written something similar, which may be a good source of reviewers.  Try it out!

http://biosemantics.org/jane/

Wednesday, June 2, 2010

Direct detection of DNA methylation during single-molecule, real-time sequencing

Direct detection of DNA methylation during single-molecule, real-time sequencing.

Pubmed Direct Link

Here's a paper based on a new DNA sequencing technology: single-molecule real-time sequencing (SMRT). Briefly, single-molecule sequencing (SMS) tries to observe DNA one molecule at a time without the need for (biased) amplification like PCR. Technologies exist like Helicos. Real-time sequencing tries to observe the single DNA molecule in "real-time", observing kinetic information of the polymerase/enzyme used to incorporate nucleotides. What does this mean? This new "real-time" technology can answer questions about polymerase kinetics which correlate well with properties of the DNA like methylation.

This type of data certainly opens up a new area of Bioinformatics, but is also already being patented (see http://www.wipo.int with publication number "WO/2010/059235"). I'll let you read the brief description of the Bioinformatics used in this paper and the lengthy patent application.

Thursday, May 13, 2010

A survey of sequence alignment algorithms for next-generation sequencing

A survey of sequence alignment algorithms for next-generation sequencing.

Pubmed Direct Link

Here's a good review of current (short-read) sequence alignment algorithms applied to next-generation sequencing technology. It describes some of the fundamental design choices made during sequence alignment. Both hash-based and BWT-based algorithms are covered. There is also an interesting discussion of need for gapped alinment (shot over the bow of bowtie?). There are also brief discussions of color space read, long reads, bisulfite-treated reads, spliced reads, re-alignment, and a general overview of popular alignment software.

Anyhow, it is a good review so check it out.

Wednesday, May 12, 2010

Detection and characterization of novel sequence insertions using paired-end next-generation sequencing

Detection and characterization of novel sequence insertions using paired-end next-generation sequencing.

Pubmed Direct Link

This paper presents a novel pipeline, called NovelSeq, that aims to find novel insertions using paired-end technology. The motivation is to accurately discover the sequences not present in the human genome reference; there are 10-40Mb of sequence estimated to be missing from hg18 (etc.).

One caveat of this method is that it the novel inserted sequences that are detected cannot be similar (homologous) to other sequences already present in the reference. One of my own thoughts has been that a lot of mapping errors found in whole-genome re-sequencing data can be attributed to an incomplete reference, where similar sequences are missing from the reference. This method would not help solve the potential problem I am thinking about.

The idea behind this method is to use unpaired reads, or as they call them one-end anchored reads (OEA). These paired-end reads have the property that one end maps while the other does not (at least not confidently). Finding a cluster of these unpaired reads may suggest there exists a novel sequence insertion, which can be reconstructed using your favorite de novo assembler (Velvet, Abyss, Euler-SR, etc.). This is something everyone knows should be done, but here is an analysis that actually did it (kudos).

They align with mrFast, cluster unpaired reads with mrCar, locally assembly these clusters into contigs with mrSAAB (a Swede?), and finally merge multiple contigs with mrBIG. What ever happened to the Mrs? Nothing in the method strikes me as particularly worrisome, and seems like a reasonable workflow. It is interesting though that some of the assembled contigs had >99% similarity to the human reference. This would indicate their mapping was not sensitive enough to sequencing error (mrFast is not mrSensitive ?).

They compared to previously published data (fosmid-seq from Kidd et al.), but due to the small insert size of their library (<250bp), they could not anchor all their contigs due to repeat elements. Nevertheless, they found that there are some rare variants (deletions in this case) found in the human genome reference. This is to be expected, but will cause havoc when performing re-sequencing experiments.

One thing that was left out in this paper was simulations. There is not accounting for the false-negative rate or other important factors. The authors tried to use previous data to validate their method, but the imprecision of all these methods leaves me unsatisfied with the true power and accuracy of their method.

All in all, a nice application of a number of methods and a good pipeline for directly discovering novel insertions.

Tuesday, May 11, 2010

The case for cloud computing in genome informatics

The case for cloud computing in genome informatics.

Pubmed
Direct Link

Here's an opinion piece arguing for using cloud computing by a "Giant" in the field: Lincoln Stein. This is a great read for anyone thinking about staying in NGS for the long-haul. Quote of the month: "The cost of genome sequencing is now decreasing several times faster than the cost of storage, promising that at some time in the not too distant future it will cost less to sequence a base of DNA than to store it on a hard disk."

I will discuss some of my criticisms and ideas on the topic related to author's points.

Firstly, the author defines the standard model is hub-and-spoke model, with archival sites (UCSC, SRA, Genebank, etc.) the hub, and sequencing centers and bioinformaticians on the end of the spoke. We all have to submit our NGS data to SRA per NIH policy, but the idea that we submit to SRA then re-download the data is ludicrous. All Genome centers have their own compute farms that analyze the data, which is directly obtained from the sequencers. Whole groups of Bioinformaticians are co-located around these centers for easy access to the data. While other institutions must access the data from SRA, the investigators who generated the data do not. Presumably the original investigators will distill their results into a paper and resource, and have the most incentive to analyze the data. Anyhow, my main point that sequencing centers have massive clusters of computers and Bioinformaticians to analyze their data locally.

The author's solution to giving easy access to data to all (not just those who generated the data) is to use the cloud. This requires everyone to transfer their data to the cloud via fiber (internet) or mail (fed-ex hard drives). Bandwidth costs money and is certainly limited. I will be impressed to see how one genome per day per institution will be feasible. Fed-ex would love us to send hard-drives on a a daily basis. Two other solutions come to mind.

The first solution is to house the major genome centers near the cloud, so that a direct pipe could be installed. Nonetheless, if the West-coast center uploads their data to their west-coast cloud, then the data needs to be migrated to the other clouds so the east-coast can access the data on their cloud (presumably everyone could also access the west-coast cloud). Anyhow, this doesn't solve the problem as migration of data still needs to occur.

The second solution is to use a tried-and-true technology: Bittorrent. Bittorrent is great, although somewhat frowned upon. Having only a modest number of sites hosting this content via Bittorrent would allow the fast and efficient distribution of content, even between clouds. Sites like http://www.biotorrents.net/ already have bittorrent trackers for bio-datasets.

Cloud computing seems like it will become a large tool for NGS analysis. Nevertheless, the clouds need to be more configurable. Does the cloud have high memory (1TB) compute nodes for de novo assembly? Another option would be pool our money together to get an "NIH" part fo the cloud, rather than using for-profit clouds, where the cost of using this part of the cloud is subsidized.

The data deluge is an opportunity to tackle some large data processing questions. Given the size and importance of this data, will academia be squeezed out with industry taking over (like drug development research)?