Tuesday, May 11, 2010

The case for cloud computing in genome informatics

The case for cloud computing in genome informatics.

Pubmed
Direct Link

Here's an opinion piece arguing for using cloud computing by a "Giant" in the field: Lincoln Stein. This is a great read for anyone thinking about staying in NGS for the long-haul. Quote of the month: "The cost of genome sequencing is now decreasing several times faster than the cost of storage, promising that at some time in the not too distant future it will cost less to sequence a base of DNA than to store it on a hard disk."

I will discuss some of my criticisms and ideas on the topic related to author's points.

Firstly, the author defines the standard model is hub-and-spoke model, with archival sites (UCSC, SRA, Genebank, etc.) the hub, and sequencing centers and bioinformaticians on the end of the spoke. We all have to submit our NGS data to SRA per NIH policy, but the idea that we submit to SRA then re-download the data is ludicrous. All Genome centers have their own compute farms that analyze the data, which is directly obtained from the sequencers. Whole groups of Bioinformaticians are co-located around these centers for easy access to the data. While other institutions must access the data from SRA, the investigators who generated the data do not. Presumably the original investigators will distill their results into a paper and resource, and have the most incentive to analyze the data. Anyhow, my main point that sequencing centers have massive clusters of computers and Bioinformaticians to analyze their data locally.

The author's solution to giving easy access to data to all (not just those who generated the data) is to use the cloud. This requires everyone to transfer their data to the cloud via fiber (internet) or mail (fed-ex hard drives). Bandwidth costs money and is certainly limited. I will be impressed to see how one genome per day per institution will be feasible. Fed-ex would love us to send hard-drives on a a daily basis. Two other solutions come to mind.

The first solution is to house the major genome centers near the cloud, so that a direct pipe could be installed. Nonetheless, if the West-coast center uploads their data to their west-coast cloud, then the data needs to be migrated to the other clouds so the east-coast can access the data on their cloud (presumably everyone could also access the west-coast cloud). Anyhow, this doesn't solve the problem as migration of data still needs to occur.

The second solution is to use a tried-and-true technology: Bittorrent. Bittorrent is great, although somewhat frowned upon. Having only a modest number of sites hosting this content via Bittorrent would allow the fast and efficient distribution of content, even between clouds. Sites like http://www.biotorrents.net/ already have bittorrent trackers for bio-datasets.

Cloud computing seems like it will become a large tool for NGS analysis. Nevertheless, the clouds need to be more configurable. Does the cloud have high memory (1TB) compute nodes for de novo assembly? Another option would be pool our money together to get an "NIH" part fo the cloud, rather than using for-profit clouds, where the cost of using this part of the cloud is subsidized.

The data deluge is an opportunity to tackle some large data processing questions. Given the size and importance of this data, will academia be squeezed out with industry taking over (like drug development research)?

No comments:

Post a Comment