Next-Generation Bioinformatics: 2010

Thursday, June 17, 2010

Journal/Author Name Estimator

Although I review papers here to force myself to be familiar with the current bioinformatics research, I sometimes author manuscripts of my own. One of the many decisions that need to be made when writing a paper is to which journal it should be submitted. Sometimes it is easy, since Science and Nature take so many bioinformatics papers, but most of the time there is a trade-off between "time to publication" and "impact factor" (impact factor on my CV that is). Anyhow, a fun little tool was Jane, the Journal/Author Name Estimator. Enter in your manuscript name, and it will advise you to which journal you should send your paper. You can also see authors who have written something similar, which may be a good source of reviewers. Try it out!

http://biosemantics.org/jane/

Wednesday, June 2, 2010

Direct detection of DNA methylation during single-molecule, real-time sequencing

Direct detection of DNA methylation during single-molecule, real-time sequencing.

Pubmed Direct Link

Here's a paper based on a new DNA sequencing technology: single-molecule real-time sequencing (SMRT). Briefly, single-molecule sequencing (SMS) tries to observe DNA one molecule at a time without the need for (biased) amplification like PCR. Technologies exist like Helicos. Real-time sequencing tries to observe the single DNA molecule in "real-time", observing kinetic information of the polymerase/enzyme used to incorporate nucleotides. What does this mean? This new "real-time" technology can answer questions about polymerase kinetics which correlate well with properties of the DNA like methylation.

This type of data certainly opens up a new area of Bioinformatics, but is also already being patented (see http://www.wipo.int with publication number "WO/2010/059235"). I'll let you read the brief description of the Bioinformatics used in this paper and the lengthy patent application.

Thursday, May 13, 2010

A survey of sequence alignment algorithms for next-generation sequencing

A survey of sequence alignment algorithms for next-generation sequencing.

Pubmed Direct Link

Here's a good review of current (short-read) sequence alignment algorithms applied to next-generation sequencing technology. It describes some of the fundamental design choices made during sequence alignment. Both hash-based and BWT-based algorithms are covered. There is also an interesting discussion of need for gapped alinment (shot over the bow of bowtie?). There are also brief discussions of color space read, long reads, bisulfite-treated reads, spliced reads, re-alignment, and a general overview of popular alignment software.

Anyhow, it is a good review so check it out.

Wednesday, May 12, 2010

Detection and characterization of novel sequence insertions using paired-end next-generation sequencing

Detection and characterization of novel sequence insertions using paired-end next-generation sequencing.

Pubmed Direct Link

This paper presents a novel pipeline, called NovelSeq, that aims to find novel insertions using paired-end technology. The motivation is to accurately discover the sequences not present in the human genome reference; there are 10-40Mb of sequence estimated to be missing from hg18 (etc.).

One caveat of this method is that it the novel inserted sequences that are detected cannot be similar (homologous) to other sequences already present in the reference. One of my own thoughts has been that a lot of mapping errors found in whole-genome re-sequencing data can be attributed to an incomplete reference, where similar sequences are missing from the reference. This method would not help solve the potential problem I am thinking about.

The idea behind this method is to use unpaired reads, or as they call them one-end anchored reads (OEA). These paired-end reads have the property that one end maps while the other does not (at least not confidently). Finding a cluster of these unpaired reads may suggest there exists a novel sequence insertion, which can be reconstructed using your favorite de novo assembler (Velvet, Abyss, Euler-SR, etc.). This is something everyone knows should be done, but here is an analysis that actually did it (kudos).

They align with mrFast, cluster unpaired reads with mrCar, locally assembly these clusters into contigs with mrSAAB (a Swede?), and finally merge multiple contigs with mrBIG. What ever happened to the Mrs? Nothing in the method strikes me as particularly worrisome, and seems like a reasonable workflow. It is interesting though that some of the assembled contigs had >99% similarity to the human reference. This would indicate their mapping was not sensitive enough to sequencing error (mrFast is not mrSensitive ?).

They compared to previously published data (fosmid-seq from Kidd et al.), but due to the small insert size of their library (<250bp), they could not anchor all their contigs due to repeat elements. Nevertheless, they found that there are some rare variants (deletions in this case) found in the human genome reference. This is to be expected, but will cause havoc when performing re-sequencing experiments.

One thing that was left out in this paper was simulations. There is not accounting for the false-negative rate or other important factors. The authors tried to use previous data to validate their method, but the imprecision of all these methods leaves me unsatisfied with the true power and accuracy of their method.

All in all, a nice application of a number of methods and a good pipeline for directly discovering novel insertions.

Tuesday, May 11, 2010

The case for cloud computing in genome informatics

The case for cloud computing in genome informatics.

Pubmed
Direct Link

Here's an opinion piece arguing for using cloud computing by a "Giant" in the field: Lincoln Stein. This is a great read for anyone thinking about staying in NGS for the long-haul. Quote of the month: "The cost of genome sequencing is now decreasing several times faster than the cost of storage, promising that at some time in the not too distant future it will cost less to sequence a base of DNA than to store it on a hard disk."

I will discuss some of my criticisms and ideas on the topic related to author's points.

Firstly, the author defines the standard model is hub-and-spoke model, with archival sites (UCSC, SRA, Genebank, etc.) the hub, and sequencing centers and bioinformaticians on the end of the spoke. We all have to submit our NGS data to SRA per NIH policy, but the idea that we submit to SRA then re-download the data is ludicrous. All Genome centers have their own compute farms that analyze the data, which is directly obtained from the sequencers. Whole groups of Bioinformaticians are co-located around these centers for easy access to the data. While other institutions must access the data from SRA, the investigators who generated the data do not. Presumably the original investigators will distill their results into a paper and resource, and have the most incentive to analyze the data. Anyhow, my main point that sequencing centers have massive clusters of computers and Bioinformaticians to analyze their data locally.

The author's solution to giving easy access to data to all (not just those who generated the data) is to use the cloud. This requires everyone to transfer their data to the cloud via fiber (internet) or mail (fed-ex hard drives). Bandwidth costs money and is certainly limited. I will be impressed to see how one genome per day per institution will be feasible. Fed-ex would love us to send hard-drives on a a daily basis. Two other solutions come to mind.

The first solution is to house the major genome centers near the cloud, so that a direct pipe could be installed. Nonetheless, if the West-coast center uploads their data to their west-coast cloud, then the data needs to be migrated to the other clouds so the east-coast can access the data on their cloud (presumably everyone could also access the west-coast cloud). Anyhow, this doesn't solve the problem as migration of data still needs to occur.

The second solution is to use a tried-and-true technology: Bittorrent. Bittorrent is great, although somewhat frowned upon. Having only a modest number of sites hosting this content via Bittorrent would allow the fast and efficient distribution of content, even between clouds. Sites like http://www.biotorrents.net/ already have bittorrent trackers for bio-datasets.

Cloud computing seems like it will become a large tool for NGS analysis. Nevertheless, the clouds need to be more configurable. Does the cloud have high memory (1TB) compute nodes for de novo assembly? Another option would be pool our money together to get an "NIH" part fo the cloud, rather than using for-profit clouds, where the cost of using this part of the cloud is subsidized.

The data deluge is an opportunity to tackle some large data processing questions. Given the size and importance of this data, will academia be squeezed out with industry taking over (like drug development research)?

Monday, May 10, 2010

Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation

Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation.

Pubmed Direct Link

Similar to another paper I blogged about, this paper describes a novel method to analyze NGS data wrapped in a biology coating (mouse genome analysis). I really like this way of describing a method since this highlights the motivation for the method's creation: to answer a question in biology.

This paper describes a method called Cufflinks, which uses TopHat to find splice junctions, which uses bowtie for mapping. Here you can see the recursive dependency tree on which this method is built. You must trust the upstream methods first.

Off to the methods section. Just like any nature paper, the methods are hidden in in the supplemental materials. I am glad to see the supplementary methods written in latex using the hyperref package. This makes it easy to move around the document quickly, plus latex is much easier to write and format than Word or other programs.

The main contribution is the transcript abundance calculation. I would strongly recommend reading the supplementary methods as this is a great Bayesian discussion of transcript assembly and abundance estimation. It also covers some nice computer science Theorems. The basic idea is to find the minimal set of transcripts that explain the fragments, then quantify each transcript. Seems simple right?

Anyhow, take a look at this paper as Cufflinks (with TopHat (with bowtie)) is a very popular pipeline for RNA-SEQ. You should probably understand what it is doing before you use it. Kudos to the authors.

Sunday, May 9, 2010

BS Seeker: precise mapping for bisulfite sequencing

BS Seeker: precise mapping for bisulfite sequencing

Pubmed Direct Link

Yet another alignment algorithm (YAAA), this time for bisulfite sequencing. This algorithm, developed in the same lab as the BS-SEQ (Jacobsen at UCLA) and called BS-SEEKER, purports to be more versatile and faster. I am of the philosophy that an aligner should be "as fast as it needs to be" to achieve the desired level of sensitivity, so the latter claim is not as interesting as the former claim. Basically, it needs to find what you are looking for first, then lets worry about speed. The abstract claims sensitivity and accuracy, so lets see if it delivers.

Surprisingly, the authors do not implement their own algorithm, but instead use bowtie. I do not recommend bowtie since it does not handle insertions or deletions, which can significantly impact mapping accuracy (see this partial discussion). Nevertheless, bowtie should be able to be replaced by other algorithms, since the novel contribution of BS-SEEKER is the separation of the four types of BS reads.

The evaluations were very lite on information, mapping reads to only human chromosome 21, and not accounting for higher sequencing error rates or indels. Furthermore, BS-SEEKER is not the best algorithm in every situation according to their own results, in some cases MAQ or RMAP performing better in terms of accuracy/sensitivity/timing. Their novel contribution is the application of bowtie, but beyond that I am not seeing the value of this method, with the exception if you use the Cokus et al. library construction. More evaluation of this method is necessary to convince others to use it.

Saturday, May 8, 2010

Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome

Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome.

Pubmed Direct Link

This paper presents a new method to detect structural variant (SV) breakpoints called HYDRA. Since it is a Genome Research (GR) paper, the paper focuses on answering a question in biology, namely detecting SVs in the mouse genome; I will focus on the method.

SVs are detected by clustering discordant matepairs. Matepairs are pairs of reads from the same DNA fragment with some base distance (insert size) between the ends. Discordant reads are reads for which the insert size is outside the expected distribution. Hydra seeks to use multiple mappings per read, where each read may map to multiple locations, albeit with possibly different quality/probability. The method then seeks to select the correct mapping for each read collectively for all reads parsimoniously. This is to detect SVs in regions with segmental duplications, copy number changes, or other repetitive or difficult to map regions. This is an extremely important feature. A minor point is how to store multiple mappings, as a whole genome human resequencing BAM file may be hundreds of gigabytes when there is only one mapping per read. Nevertheless, HYDRA does not report what type of SV occurs, but only that there exists a SV. Further analysis needs to be performed to determine the type of SV (insert/deletion/CNV/inversion/etc.). The authors suggest to use bedtools to do this (their own software). It is yet to be seen if author software will come out that automate this whole process

Wednesday, April 21, 2010

Next-Generation Instrument Google Map

Next-Generation Instrument Google Map

http://map.seqanswers.com

A map of NGS Instruments from around the world can be found my following the link above. This map is far from complete, so if you know of or have not included a sequencing facility, please add it to this map. It is amazing the number of sequencers in the field compared to only a few years ago. It is also surprising that given this global sequencing capacity, less than 20 whole genome human genomes have been published. There are an infinite number of other fun projects that do not involve human resequencing, but would it not be fun to truly have the 1000 Genomes Project completed to a high depth (more than 60X)?

I would say it is a pretty poor showing from Africa and Antartica, although BGI and BIG in China are not well reported (128 Illumina HighSeqs anyone?). The one sequencer farther from any other seems to be the one located at the "

Institute for Chemical Biology and Fundamental Medicine".

Thursday, April 15, 2010

High quality SNP calling using Illumina data at shallow coverage

High quality SNP calling using Illumina data at shallow coverage

Pubmed Direct Link

Here's an interesting paper on an integrated approach to SNP calling in NGS data. It is integrated in that they present both an aligner and SNP caller co-optimized for lower false positive SNP calls. It is designed for the Illumina platform and uses the four-channel (base) probability scores to better inform alignment. Where have we see that before? There are two steps, alignment then SNP calling.

The idea behind the alignment is to use the first 31 bases of the read to seed the alignment. This is based on the observation that the first 31 bases are the most reliable in any sequencing run. The only problem is that if an insertion (or deletion) occurs within these first 31 bases, then it makes it considerably more difficult to map. If an indel occurs within the first 31 bases, then the rest of the read (not the first 31) could be used to map the read. If indels are tolerated (what length indel) are tolerated in the "seed", like BWA, then no problem. Unfortunately Slider does not support indels, so you will have to re-align the data anyways if you are interested in this common source of variation. Nevertheless, their merge sort technique is a unique and novel technique among short read aligners. I'll let you read the original paper.

Their SNP caller is more interesting. It allows for priors in the form of SNP annotation databases, identifying suspect SNPs that could result from paralogs or SVs, using the per-base probability for each base in the read (so much for base calling), and the expected polymorphism rate (or homology between the reference and the current genome). I would have liked to see other tools beside MAQ being compared, as many other aligners (BFAST/BWA/SHRiMP/Novoalign etc.) are highly sensitive, while there are also a breadth of SNP callers (SAMTools/SOAP/GATK etc.). It is also unclear if the improvement demonstrated in the paper is due to the better SNP calling algorithm/model, or the aligner, or both. An easy way to test this is to substitute another aligners/SNP-caller. After filtering out low coverage regions, they also show how a SNP caller can improve concordance by using known SNPs. Be careful which version of dbSNP you use, since dbSNP 130 includes a lot of crud from NGS data projects and cancer (wont the whole genome and all alleles be present soon in dbSNP if we sequence enough cancer?).

Figure 4 is interesting since it shows that SNPs called towards the ends of reads are less trustworthy. Also, if there is high coverage at a SNP and the SNPs still occur towards the ends of reads, the concordance is extremely low (not likely anyways). Using their position in the reads seems warranted. Nevertheless, not aligning sensitive to indels will cause false-positive SNPs around where the indel should be placed, which is somewhat mediated by hypothesizing that not aligning with indels will cause false SNPs at the ends of reads. I am inclined to agree with this hypotheses from my own observations. As for the false positive reduction claim, some anecdotal evidence is suggested by Figure 6 and 7, but simulations would bolster this claim significantly. Not only does simulation provide an easy check of the claims, but also helps debug the algorithm anyways, so perform simulations when possible.

Now they need to apply this to SOLiD, add support for indels, modularize the aligner and SNP caller to allow the input/output to/from other tools, and they have an intriguing software that could be extremely useful for NGS data.

Wednesday, April 14, 2010

SeqAnswers.com

Here's a plug for SeqAnswers.com. It is great site for beginner and expert NGS users and developers to meet, help out, and discuss all issues related to next-generation sequencing technology. It has been around for a few years, so search the threads with google by typing "site:seqanswers.com" into the search box to search only SeqAnswers.com.

Monday, April 12, 2010

Incorporating sequence quality data into alignment improves DNA read mapping

Incorporating sequence quality data into alignment improves DNA read mapping

Pubmed Direct Link

This paper tries to answer a fundamental question that every researcher intuitively says "yes", does incorporating base qualities into alignment improve the alignment. The strength of the paper relies on its model, while the weakness of the paper relies on its empirical and simulated evaluations.

Firstly, the methods section shows a nice Bayesian derivation for updating the log-odds ratios when performing local alignment (Dynamic Programming, Smith Waterman, Needleman Wunsch, Viterbi, whatever). The assumption is that quality scores are in fact log-scaled and accurate, which we have seen with the many chemistry/optics/software updates that this may not be the case. This method could be easily incorporated into any alignment software (short read etc.) to use quality scores, but most likely wont.

Figure 1 shows the error rates from two different Illumina sequencing runs. They picked the first 100,000 reads, which is not a random sampling, that shows very high error rates, at least from what I have seen from GAII machines. Equation 15 and its explanation derives a similar probability to that found in MAQ's supplemental materials. The authors criticize some of the underlying assumptions, but that is why mapping quality (as defined in the MAQ model) is used, since it incorporates or at least reasonably explains why it ignores some of those assumptions. I would love to see a empirical estimation of contamination that could be incorporated into mapping quality.

Next is to empirically and with simulated data to show improved accuracy or power when using quality values (MAQ's results are ignored, but I'll get to this later). They use their own alignment software, not adapting evolved software like BFAST/BWA/BOWTIE/MAQ/SHRIMP. Basically I want to know, given their dataset, what would any of those software achieve when mapping the data. It may be the case that LAST (the name of the author's aligner) is not that sensitive or accurate, and thus incorporating quality scores may mediate this problem, while better known/used aligners do not have this problem and thus quality scores may not help at all or as much. I just don't know given their results. They try to assuage this criticism (they don't use other mappers) by saying they used their software guaranteeing up to 2 misatches (that's one SNP in color space folks). Needless to say, I am not impressed, as other software is sensitive up to 10% error regardless of read length (2 mismatches in a 150bp read is not good).

More work is required to truly answer the question do quality values significantly improve alignment, rather than adding the nth decimal place.

Sunday, April 11, 2010

Structural Variation Analysis with Strobe Reads.

A very interesting paper on SV detection with "strobe reads', which are/will-be produced by single molecule sequencers (SMS).

Links:
Pubmed Direct Link

The traditional way to detect structural variants (SVs) in next-generation sequencing data is to either search for discordant mate pairs or to explicitly look for SV breakpoints by performing a split-read analysis. In the former, DNA fragments are sequence from both ends with some gap (insert) in between. If after mapping, the two ends fall outside the expected insert size distribution and/or on opposite strands, they are called discordant. In the latter, a longer-ish read is split into two parts, with each mapped independently and then tested to see if each end maps to a different location (a poor man's discordant read). The benefit of the former is that given sufficient coverage, the clone coverage (the coverage implied by spanning events using the insert) is quite high and gives good power, although the exact break point is difficult to resolve. The benefit of the latter is that if a break point is detected, it is detected by directly sequencing through it, although the complexity of the genome is quite high making this type of analysis quite difficult.

A strobe read is produced by a single molecule sequence (SMS), for example by Pacific Biosciences (Pac Bio) or Life Technologies (Starlight), neither which have been released. The strobe sequencing from Pac Bio in a proof of principle gives 2-4 reads of 50-200 bases over a range of 10Kb from a contiguous fragment. The method discussed here, seeks to minimize the potential number of SVs implied by aligning the 2-4 sub-reads. This is solved by formulating the problem as a graph optimization problem and using integer linear programming (as I dust of CLR to refresh my linear programming).

The beauty of the method is that typically each read is aligned independently, but in this case, the multiple alignments for each sub-read that form a read, as well as all reads, are examined together to minimize the expected number of SVs. As SVs have a small prior, this is good idea. Not going into the specifics of various simplifications that yield performance improvements, previous computer science methods are easily applied to this problem to yield a sufficient method to sensitively find SVs with strobe reads.

Always a sucker for ROC curves, strobe SV detection fares quite well, in fact better than using paired end reads or mixed read libraries. The do acknowledge some lack of power when searching for SVs in cancer data, since then all bets are off. I can't wait for real SMS data!

Saturday, April 10, 2010

Correction of sequencing errors in a mixed set of reads.

Correction of sequencing errors in a mixed set of reads.

Links:
Pubmed DOI

This paper builds off of a tool to correct sequence error in Illumina reads. What happens if SOLiD reads, with the powerful two-base encoding, is combined with base space data, especially for de novo assembly? Instead of spectral alignment, like that performed by assemblers (EULER-SR etc.), they prefer to use a suffix tree/array representation to remove erroneous suffixes. The nice part is that they observe that correction can be performed in color space, since all original base space read errors or snps look also look like SNPs (valid adjacent difference) in color space, while SOLiD sequencing errors will look like single color differences.

Anyhow, the problem with this procedure is that if the SOLiD data was used alone, it is extremely difficult to randomly have two "valid adjacent" color differences. This is why two-base encoding is so powerful, it is difficult to discover false positive SNPs. This method now adds false SNPs from the Illumina reads, removing this assumption, as well as some of the power of two-base encoding.

Furthermore, are true SNPs and indels corrected away? Although for a normal (i.e. diploid) individual, they should occur in 50% or 100% of the reads, what about cancer, where the composition of cancer cells may be heterogeneous, not to mention contamination. There is no analysis describing how this affects variant identification.

The human genome is 3.2 billion bases in length, requiring a 18 base sequence or greater to give a 50% probability of being unique. Suffix arrays cannot fit in modest memory machines (4-8GB) as the best suffix array representation implemented requires 5.25 bytes (more on that on later posts). So when a billion reads or more are produced for human genome resequencing, will this method work (it is written in Java by the way)?

If there exists a reference, then it should be used since for humans at least it can be assumed the reference is >99% similar when compared to the genome being sequenced. Furthermore, alignment algorithms are sensitive to SNPs and indels, being able to correctly call/assemble most of the genome.
I really like the idea of performing an error correct step before de novo assembly, as this is generally required. I wish there was more discussion on if this error correction method corrects away the variants: that is why the sequencing happened in the first place right?

Verdict:
A good method which requires more evidence and discussion.

Next-Generation Bioinformatics