Sunday, April 11, 2010

Structural Variation Analysis with Strobe Reads.

A very interesting paper on SV detection with "strobe reads', which are/will-be produced by single molecule sequencers (SMS).

Links:
Pubmed Direct Link

The traditional way to detect structural variants (SVs) in next-generation sequencing data is to either search for discordant mate pairs or to explicitly look for SV breakpoints by performing a split-read analysis. In the former, DNA fragments are sequence from both ends with some gap (insert) in between. If after mapping, the two ends fall outside the expected insert size distribution and/or on opposite strands, they are called discordant. In the latter, a longer-ish read is split into two parts, with each mapped independently and then tested to see if each end maps to a different location (a poor man's discordant read). The benefit of the former is that given sufficient coverage, the clone coverage (the coverage implied by spanning events using the insert) is quite high and gives good power, although the exact break point is difficult to resolve. The benefit of the latter is that if a break point is detected, it is detected by directly sequencing through it, although the complexity of the genome is quite high making this type of analysis quite difficult.

A strobe read is produced by a single molecule sequence (SMS), for example by Pacific Biosciences (Pac Bio) or Life Technologies (Starlight), neither which have been released. The strobe sequencing from Pac Bio in a proof of principle gives 2-4 reads of 50-200 bases over a range of 10Kb from a contiguous fragment. The method discussed here, seeks to minimize the potential number of SVs implied by aligning the 2-4 sub-reads. This is solved by formulating the problem as a graph optimization problem and using integer linear programming (as I dust of CLR to refresh my linear programming).

The beauty of the method is that typically each read is aligned independently, but in this case, the multiple alignments for each sub-read that form a read, as well as all reads, are examined together to minimize the expected number of SVs. As SVs have a small prior, this is good idea. Not going into the specifics of various simplifications that yield performance improvements, previous computer science methods are easily applied to this problem to yield a sufficient method to sensitively find SVs with strobe reads.

Always a sucker for ROC curves, strobe SV detection fares quite well, in fact better than using paired end reads or mixed read libraries. The do acknowledge some lack of power when searching for SVs in cancer data, since then all bets are off. I can't wait for real SMS data!

No comments:

Post a Comment