1 nicholas mancuso department of computer science georgia state university joint work with bassam...

48
1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion Mӑndoiu, UConn Alex Zelikovsky, GSU Viral Quasispecies Reconstruction from Amplicon 454 Pyrosequencing Reads CAME 2011, Atlanta Georgia

Post on 20-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

1

Nicholas MancusoDepartment of Computer Science

Georgia State University

Joint work with Bassam Tork, GSUPavel Skums, CDCIon Mӑndoiu, UConn Alex Zelikovsky, GSU

Viral Quasispecies Reconstruction from Amplicon 454 Pyrosequencing

Reads

CAME 2011, Atlanta Georgia

Page 2: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

2

Viral Quasispecies and NGS• RNA Viruses

—HIV, HCV, SARS, Influenza—Higher (than DNA) mutation rates — quasispecies

—set of closely related variants rather than a single species

• Knowing quasispecies can help—Interferon HCV therapy effectiveness (Skums et al 2011)

• NGS allows to find individual quasispecies sequences—454 Life Sciences : 400-600 Mb with reads 300-800 bp long

• Sequencing is challenging—multiple quasispecies —qsps sequences are very similar

—different qsps may be indistinguishable for > 1kb (longer than reads)

CAME 2011, Atlanta Georgia

Page 3: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

3

Outline

• Shotgun vs Amplicon Sequencing• Viral Quasispecies Reconstruction Problem• Challenges and Approaches• Data Structure for Reads: Read Graph• Novel Methods for Solving QSR Problem• Observed vs True Read Frequencies• True Frequency Reconstruction• Simulations and Results

CAME 2011, Atlanta Georgia

Page 4: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

4

Shotgun versus Amplicon Sequencing

• Shotgun reads—starting positions

distributed uniformly

• Amplicon—each read has

predefined start/endcovering fixed overlappingwindows

CAME 2011, Atlanta Georgia

Page 5: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

5

Viral Quasispecies Spectrum Reconstruction Problem

• Given—collection of amplicon reads from a quasispecies

population with unknown variants and distribution

• Find—viral quasispecies sequences and their frequencies

CAME 2011, Atlanta Georgia

Page 6: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

6

Amplicon Sequencing Challenges• Collapse of quasispecies in amplicon

—distinct quasispecies may be indistinguishable in window• Collapse of quasispecies in overlap

—match reads from consecutive windows coming from the same qsp

• First approachProsperi et al (2011)—Guide Distribution

—choose a column—go right/left matching the

the closest in order neighbor

CAME 2011, Atlanta Georgia

220 200 140 160 150200 140 130 150 14070 130 120 140 13010 20 110 130 1200 10 100 20 60

Page 7: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

7

Approaches to QSP Reconstruction • Shotgun approaches

—estimates probability of consecutive reads coming from the same qsp (ViSpA, Astrovskaya et al 2011)

—parsimony (minimum number of distinct sequences covering all reads) (ShoRAH, Zagordi et al 2010)

• Why not use shotgun approaches for amplicons?—estimating probability in ViSpA relies on uniform distribution of reads—amplicon reads have fixed beginnings and ends

• Optimization approach—most parsimonious solution

— minimize number of distinct sequences covering all reads — too coarse: many different optimal solutions

—minimum information entropy (Shannon, 1948)— takes in account also frequency— fractional relaxation of pure parsimony

CAME 2011, Atlanta Georgia

Page 8: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

Min Entropy vs Parsimony• Parsimony and Min Entropy selects AC and BD if a = c, and b = d

8 CAME 2011, Atlanta Georgia

Page 9: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

9

Data Structure for Reads: Read GraphK amplicons → K-staged read graph

—vertices → distinct reads—edges → reads with consistent overlap—vertices, edges have a count function

CAME 2011, Atlanta Georgia

Page 10: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

10

Read Graph• May transform graph into a 'forked' graph

—overlap is represented by fork vertex

CAME 2011, Atlanta Georgia

Page 11: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

11

Fork Resolving Problem• Minimum Entropy is NP-hard

—can solve it optimally for each small fork separately (future work)

• Greedy heuristic— ≤ a+b-1 are sufficient when resolving fork with a distinct reads on the

left and b on the right— that can be done greedily matching largest (greedy heuristic)— this does not guarantee minimum number of distinct qsps

• Better way = globally match the most frequent reads (max bandwidth)— find s-t path maximizing minimum read count— subtract the minimum count from each read in the path

— exhausts at least one read in the path

CAME 2011, Atlanta Georgia

Page 12: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

12

Greedy Method

CAME 2011, Atlanta Georgia

Page 13: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

13

Greedy Method

CAME 2011, Atlanta Georgia

Page 14: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

14

Greedy Method

CAME 2011, Atlanta Georgia

Page 15: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

15

Greedy Method

CAME 2011, Atlanta Georgia

Page 16: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

16

Greedy Method

CAME 2011, Atlanta Georgia

Page 17: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

17

Greedy Method

CAME 2011, Atlanta Georgia

Page 18: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

18

Greedy Method

CAME 2011, Atlanta Georgia

Page 19: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

19

Greedy Method

CAME 2011, Atlanta Georgia

Page 20: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

20

Greedy Method

CAME 2011, Atlanta Georgia

Page 21: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

21

Maximum Bandwidth Method

CAME 2011, Atlanta Georgia

Page 22: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

22

Maximum Bandwidth Method

CAME 2011, Atlanta Georgia

Page 23: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

23

Maximum Bandwidth Method

CAME 2011, Atlanta Georgia

Page 24: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

24

Maximum Bandwidth Method

CAME 2011, Atlanta Georgia

Page 25: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

25

Maximum Bandwidth Method

CAME 2011, Atlanta Georgia

Page 26: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

26

Maximum Bandwidth Method

CAME 2011, Atlanta Georgia

Page 27: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

27

Maximum Bandwidth Method

CAME 2011, Atlanta Georgia

Page 28: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

28

Maximum Bandwidth Method

CAME 2011, Atlanta Georgia

Page 29: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

29

Maximum Bandwidth Method

CAME 2011, Atlanta Georgia

Page 30: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

30

Maximum Bandwidth Method

CAME 2011, Atlanta Georgia

Page 31: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

31

Maximum Bandwidth Method

CAME 2011, Atlanta Georgia

Page 32: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

32

Maximum Bandwidth Method

CAME 2011, Atlanta Georgia

Page 33: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

33

Maximum Bandwidth Method

CAME 2011, Atlanta Georgia

Page 34: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

34

Maximum Bandwidth Method

CAME 2011, Atlanta Georgia

Page 35: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

35

Maximum Bandwidth Method

CAME 2011, Atlanta Georgia

Page 36: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

36

Maximum Bandwidth Method

CAME 2011, Atlanta Georgia

Page 37: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

37

Maximum Bandwidth Method

CAME 2011, Atlanta Georgia

Page 38: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

38

Observed vs Ideal Read Frequencies• Ideal frequency

—consistent frequency across forks

• Observed frequency (count)—inconsistent frequency across forks

• All methods perform better under ideal frequencies

CAME 2011, Atlanta Georgia

Page 39: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

39

Fork Balancing Problem

• Given—set of reads and respective frequencies

• Find—minimal frequency offsets balancing all forks

Simplest approach is to scale frequencies from left to right

CAME 2011, Atlanta Georgia

Page 40: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

40

Least Squares Approach

• Quadratic Program for read offsets• q – fork, oi – observed frequency, xi – frequency offset

CAME 2011, Atlanta Georgia

Page 41: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

41

Flowchart

CAME 2011, Atlanta Georgia

Page 42: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

42

Data Sets and Metrics• Simulated error-free HCV (1734 long fragment)– quasispecies from uniform, geometric, and skewed distribution– shift → delta of starting position

• Sensitivity– percentage of correctly assembled true quasispecies

• PPV– percentage of true quasispecies among all assembled

• Jensen-Shannon Divergence

Page 43: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

43

Sensitivity Results

CAME 2011, Atlanta Georgia

Page 44: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

44

PPV Results

CAME 2011, Atlanta Georgia

Page 45: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

45

Divergence Results

CAME 2011, Atlanta Georgia

Page 46: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

46

ViSpA Comparison

CAME 2011, Atlanta Georgia

Page 47: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

47

Conclusion

• Two novel methods for solving QSR problem—Outperform Prosperi et al. on average—Outperform ViSpA approach on average

• Maximum Bandwidth approach worked best

• Future work: exact local solution for minimum entropy

CAME 2011, Atlanta Georgia

Page 48: 1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex

48

Thanks

CAME 2011, Atlanta Georgia