1 nicholas mancuso department of computer science georgia state university joint work with bassam...
Post on 20-Dec-2015
217 views
TRANSCRIPT
1
Nicholas MancusoDepartment of Computer Science
Georgia State University
Joint work with Bassam Tork, GSUPavel Skums, CDCIon Mӑndoiu, UConn Alex Zelikovsky, GSU
Viral Quasispecies Reconstruction from Amplicon 454 Pyrosequencing
Reads
CAME 2011, Atlanta Georgia
2
Viral Quasispecies and NGS• RNA Viruses
—HIV, HCV, SARS, Influenza—Higher (than DNA) mutation rates — quasispecies
—set of closely related variants rather than a single species
• Knowing quasispecies can help—Interferon HCV therapy effectiveness (Skums et al 2011)
• NGS allows to find individual quasispecies sequences—454 Life Sciences : 400-600 Mb with reads 300-800 bp long
• Sequencing is challenging—multiple quasispecies —qsps sequences are very similar
—different qsps may be indistinguishable for > 1kb (longer than reads)
CAME 2011, Atlanta Georgia
3
Outline
• Shotgun vs Amplicon Sequencing• Viral Quasispecies Reconstruction Problem• Challenges and Approaches• Data Structure for Reads: Read Graph• Novel Methods for Solving QSR Problem• Observed vs True Read Frequencies• True Frequency Reconstruction• Simulations and Results
CAME 2011, Atlanta Georgia
4
Shotgun versus Amplicon Sequencing
• Shotgun reads—starting positions
distributed uniformly
• Amplicon—each read has
predefined start/endcovering fixed overlappingwindows
CAME 2011, Atlanta Georgia
5
Viral Quasispecies Spectrum Reconstruction Problem
• Given—collection of amplicon reads from a quasispecies
population with unknown variants and distribution
• Find—viral quasispecies sequences and their frequencies
CAME 2011, Atlanta Georgia
6
Amplicon Sequencing Challenges• Collapse of quasispecies in amplicon
—distinct quasispecies may be indistinguishable in window• Collapse of quasispecies in overlap
—match reads from consecutive windows coming from the same qsp
• First approachProsperi et al (2011)—Guide Distribution
—choose a column—go right/left matching the
the closest in order neighbor
CAME 2011, Atlanta Georgia
220 200 140 160 150200 140 130 150 14070 130 120 140 13010 20 110 130 1200 10 100 20 60
7
Approaches to QSP Reconstruction • Shotgun approaches
—estimates probability of consecutive reads coming from the same qsp (ViSpA, Astrovskaya et al 2011)
—parsimony (minimum number of distinct sequences covering all reads) (ShoRAH, Zagordi et al 2010)
• Why not use shotgun approaches for amplicons?—estimating probability in ViSpA relies on uniform distribution of reads—amplicon reads have fixed beginnings and ends
• Optimization approach—most parsimonious solution
— minimize number of distinct sequences covering all reads — too coarse: many different optimal solutions
—minimum information entropy (Shannon, 1948)— takes in account also frequency— fractional relaxation of pure parsimony
CAME 2011, Atlanta Georgia
Min Entropy vs Parsimony• Parsimony and Min Entropy selects AC and BD if a = c, and b = d
8 CAME 2011, Atlanta Georgia
9
Data Structure for Reads: Read GraphK amplicons → K-staged read graph
—vertices → distinct reads—edges → reads with consistent overlap—vertices, edges have a count function
CAME 2011, Atlanta Georgia
10
Read Graph• May transform graph into a 'forked' graph
—overlap is represented by fork vertex
CAME 2011, Atlanta Georgia
11
Fork Resolving Problem• Minimum Entropy is NP-hard
—can solve it optimally for each small fork separately (future work)
• Greedy heuristic— ≤ a+b-1 are sufficient when resolving fork with a distinct reads on the
left and b on the right— that can be done greedily matching largest (greedy heuristic)— this does not guarantee minimum number of distinct qsps
• Better way = globally match the most frequent reads (max bandwidth)— find s-t path maximizing minimum read count— subtract the minimum count from each read in the path
— exhausts at least one read in the path
CAME 2011, Atlanta Georgia
12
Greedy Method
CAME 2011, Atlanta Georgia
13
Greedy Method
CAME 2011, Atlanta Georgia
14
Greedy Method
CAME 2011, Atlanta Georgia
15
Greedy Method
CAME 2011, Atlanta Georgia
16
Greedy Method
CAME 2011, Atlanta Georgia
17
Greedy Method
CAME 2011, Atlanta Georgia
18
Greedy Method
CAME 2011, Atlanta Georgia
19
Greedy Method
CAME 2011, Atlanta Georgia
20
Greedy Method
CAME 2011, Atlanta Georgia
21
Maximum Bandwidth Method
CAME 2011, Atlanta Georgia
22
Maximum Bandwidth Method
CAME 2011, Atlanta Georgia
23
Maximum Bandwidth Method
CAME 2011, Atlanta Georgia
24
Maximum Bandwidth Method
CAME 2011, Atlanta Georgia
25
Maximum Bandwidth Method
CAME 2011, Atlanta Georgia
26
Maximum Bandwidth Method
CAME 2011, Atlanta Georgia
27
Maximum Bandwidth Method
CAME 2011, Atlanta Georgia
28
Maximum Bandwidth Method
CAME 2011, Atlanta Georgia
29
Maximum Bandwidth Method
CAME 2011, Atlanta Georgia
30
Maximum Bandwidth Method
CAME 2011, Atlanta Georgia
31
Maximum Bandwidth Method
CAME 2011, Atlanta Georgia
32
Maximum Bandwidth Method
CAME 2011, Atlanta Georgia
33
Maximum Bandwidth Method
CAME 2011, Atlanta Georgia
34
Maximum Bandwidth Method
CAME 2011, Atlanta Georgia
35
Maximum Bandwidth Method
CAME 2011, Atlanta Georgia
36
Maximum Bandwidth Method
CAME 2011, Atlanta Georgia
37
Maximum Bandwidth Method
CAME 2011, Atlanta Georgia
38
Observed vs Ideal Read Frequencies• Ideal frequency
—consistent frequency across forks
• Observed frequency (count)—inconsistent frequency across forks
• All methods perform better under ideal frequencies
CAME 2011, Atlanta Georgia
39
Fork Balancing Problem
• Given—set of reads and respective frequencies
• Find—minimal frequency offsets balancing all forks
Simplest approach is to scale frequencies from left to right
CAME 2011, Atlanta Georgia
40
Least Squares Approach
• Quadratic Program for read offsets• q – fork, oi – observed frequency, xi – frequency offset
CAME 2011, Atlanta Georgia
41
Flowchart
CAME 2011, Atlanta Georgia
42
Data Sets and Metrics• Simulated error-free HCV (1734 long fragment)– quasispecies from uniform, geometric, and skewed distribution– shift → delta of starting position
• Sensitivity– percentage of correctly assembled true quasispecies
• PPV– percentage of true quasispecies among all assembled
• Jensen-Shannon Divergence
43
Sensitivity Results
CAME 2011, Atlanta Georgia
44
PPV Results
CAME 2011, Atlanta Georgia
45
Divergence Results
CAME 2011, Atlanta Georgia
46
ViSpA Comparison
CAME 2011, Atlanta Georgia
47
Conclusion
• Two novel methods for solving QSR problem—Outperform Prosperi et al. on average—Outperform ViSpA approach on average
• Maximum Bandwidth approach worked best
• Future work: exact local solution for minimum entropy
CAME 2011, Atlanta Georgia
48
Thanks
CAME 2011, Atlanta Georgia