sequence assembly and protein docking algorithms

73
Sequence Assembly and Protein Docking Algorithms Vicky Choi Department of Computer Science Duke University

Upload: csilla

Post on 13-Jan-2016

39 views

Category:

Documents


0 download

DESCRIPTION

Sequence Assembly and Protein Docking Algorithms. Vicky Choi Department of Computer Science Duke University. Outline. Sequence Assembly Algorithm (Joint work with Martin Farach-Colton @Rutgers) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Sequence Assembly                         and         Protein Docking Algorithms

Sequence Assembly and Protein Docking Algorithms

Vicky ChoiDepartment of Computer

ScienceDuke University

Page 2: Sequence Assembly                         and         Protein Docking Algorithms

2/73

Outline

• Sequence Assembly Algorithm (Joint work with Martin Farach-Colton @Rutgers)

• Local Search Algorithm for Rigid Protein Docking

(Joint work with Pankaj K. Agarwal, Herbert Edelsbrunner,

Johannes Rudolph @Duke)

Page 3: Sequence Assembly                         and         Protein Docking Algorithms

3/73

Outline: Sequence Assembly

• Biological Background

• Human Genome Project and the Sequence Assembly Problem

• The BARNACLE Assembler

Page 4: Sequence Assembly                         and         Protein Docking Algorithms

4/73

DNAA DNA molecule consists of two strands which are tied together in a helical structure.

Each strand is represented by a string over the alphabet {A,C,G,T}, called a DNA sequence.Example

AAGCTTCAGTTCCTGACCTTCCAATCGCAA

{A,C,G,T} = nucleotide, base, basepair (bp)

Image Credit: US Department of Energy Human Genome Program http://www.ornl.gov/hgmis

Page 5: Sequence Assembly                         and         Protein Docking Algorithms

5/73

Orientation: 5’ ! 3’

Complement: A $ T, C $ G

Example (5') (3')

ACCATGGTGCACCTGACTCCTGAGGAG

TGGTACCACGTGGACTGAGGACTCCTC

(3') (5')

one strand ) another strand

Two Strands: Reverse Complementary

Image Credit: US Department of Energy Human Genome Program http://www.ornl.gov/hgmis

(5’)

(3’) (5’)

(3’)

Page 6: Sequence Assembly                         and         Protein Docking Algorithms

6/73

A genome is the complete set of DNA sequences of an organism.

Human Genome ~ 3x109 bp

Human Chromosomes

Image Credit: Sanger Center http://www.sanger.ac.uk/

Page 7: Sequence Assembly                         and         Protein Docking Algorithms

7/73

DNA Sequencing

DNA Sequencing is the process for determining the sequence of nucleotides of a region of DNA.

Question: How to sequence a longer stretch of DNA?

Current technology : ~500bp

C G A A T C G T C G A T G C T A A T G

Page 8: Sequence Assembly                         and         Protein Docking Algorithms

8/73

Shotgun Sequencing

Target DNA

Shotgun

Sequence ReadsDNA Sequencing

Copies of TargetDNA Cloning

ConsensusAssembly

ACGTAAGAGTACCGATTGGCCA

FinalDirected Read

…ACGTAGTCTTAGATGATAGTAGA…

Page 9: Sequence Assembly                         and         Protein Docking Algorithms

9/73

Shotgun Sequencing History

• 1980s: 5 to 10 Kbp• 1990: 40 Kbp• 1995: 1.8 Mbp (H. Influenzae)• 2000: draft Drosophila (120 Mbp)• 2001: draft Human Genome

(3x109bp) (attempted by Celera)

Page 10: Sequence Assembly                         and         Protein Docking Algorithms

10/73

Outline: Sequence Assembly

• Biological Background

• Human Genome Project and the Sequence Assembly Problem

• The BARNACLE Assembler

Page 11: Sequence Assembly                         and         Protein Docking Algorithms

11/73

Human Genome Project (HGP)

• 1988: “Mapping and Sequencing the Human Genome”

• 1990: HGP started in US• 2001: A “working draft” version• 2003: Completed by HGP Consortium

standard

Page 12: Sequence Assembly                         and         Protein Docking Algorithms

12/73

Hierarchical Shotgun Sequencing(BAC-by-BAC)

Shotgun Sequence &

Assemble of each TP BAC

Final Sequence

Tiling Path of BACs

Human Genome

BAC library

(100-200Kb)

Physical Map

•Map First, Then Sequence

A BAC is a segment of DNA from a chromosome.Each BAC is ~100-200Kb.

Page 13: Sequence Assembly                         and         Protein Docking Algorithms

13/73

Hierarchical Shotgun Sequencing(BAC-by-BAC)

Human Genome

BAC library

(100-200Kb)

Physical Map

•Map First??

? ? ? ?

Physical Map is difficult to build! (original expected time: 5 years)

Page 14: Sequence Assembly                         and         Protein Docking Algorithms

14/73

BAC-by-BAC ! BAC-BasedNew Idea: Map + Sequencing concurrently

Phase 1: Draft

Phase 2: Draft

Phase 3: Finished

Fragments

BAC

Sequence Reads

Ordered Fragments

Randomly pick BACs (not wait for Physical Map) and shotgun sequence BACs

Page 15: Sequence Assembly                         and         Protein Docking Algorithms

15/73

BAC-Based

Human Genome

BAC library

(100-200Kb)

Sequence Assembly Problem

Finished + Draft BACs

the working draft of the human genome

Page 16: Sequence Assembly                         and         Protein Docking Algorithms

16/73

Outline: Sequence Assembly

• Biological Background

• Human Genome Project and the Sequence Assembly Problem

• The BARNACLE Assembler– Details of Input– Difficulties– Basic Idea

Page 17: Sequence Assembly                         and         Protein Docking Algorithms

17/73

Details of Input

• Sequence Information:– BACs

• Overlap Information: – Local Alignments – NT-pairs

• Orientation Information:

Plasmid, EST, mRNA

Page 18: Sequence Assembly                         and         Protein Docking Algorithms

18/73

Input: Sequence Information Recall: A BAC is a contiguous stretch of DNA from a

chromosome. Each comes as a set of fragments.

AccessionPhas

eChrm

# frags

AC002092.1

1 17 4

Frag acc.lengt

h

AC002092.1~1 888

AC002092.1~24531

2

AC002092.1~33872

5

AC002092.1~4.1

10245

•Phase 1,2 = Draft

•Phase 3 = Finished

BAC

fragment

Page 19: Sequence Assembly                         and         Protein Docking Algorithms

19/73

Input: Overlap Information• Preprocessing:

– Local alignments of all fragment pairs

– NT-pairs: Generated from GenBank annotation submitted from genome centers

Page 20: Sequence Assembly                         and         Protein Docking Algorithms

20/73

Example: Input of Dec 2001 freeze

Phase BACs Fragments Total Length (Gbp) Average Number of Fragments1 15298 246424 2.5 16.112 2154 8161 3.3 3.793 17624 17624 2 1

Total 35076 272209 4.9 7.76

Sequence Information:

Chromosome Assignments:

31543 by STS; 2450 by Genbank; 1083 unknownOverlap Information: 403,466 fragment pairs, 12,656 NT-pairs

Orientation Information: 321,751 fragment pairs

Page 21: Sequence Assembly                         and         Protein Docking Algorithms

21/73

True Overlap

Repeat-induced Overlap

True vs Repeat-induced Overlap

Page 22: Sequence Assembly                         and         Protein Docking Algorithms

22/73

Low-copy repeats (segmental duplication)•Large block (>200Kb)•Highly Similar (>97%)

Repeats of the Human Genome

e.g. ALU, L1

High-copy repeats

Page 23: Sequence Assembly                         and         Protein Docking Algorithms

23/73

Noise

• Chimeric BAC (CB)

• False positives (FP): due to repeat

• False negatives (FN): polymorphism, draft quality

Page 24: Sequence Assembly                         and         Protein Docking Algorithms

24/73

The Basic Idea

1. “Conservatively” assemble fragments

Page 25: Sequence Assembly                         and         Protein Docking Algorithms

25/73

Necessary Condition for True Overlaps

Does B overlap with C?

Idea: assemble non-conflict overlaps first

A overlaps with B

AB

CA overlaps with C

A

Yes.

AB

CNo.

AB

C

Page 26: Sequence Assembly                         and         Protein Docking Algorithms

26/73

BAC Graph

The Basic Idea

1. “Conservatively” assemble fragments into subcontigs

Page 27: Sequence Assembly                         and         Protein Docking Algorithms

27/73

Interval Graph

Definition: A graph G is called an interval graph if there is one-onecorrespondence between its vertices and a set of intervals on thereal line such that two vertices are adjacent in G iff their corresponding intervals overlap.

The BAC graph is an interval graph!

Page 28: Sequence Assembly                         and         Protein Docking Algorithms

28/73

Necessary… But Not Sufficient

Long Repeats

Under-represented

Page 29: Sequence Assembly                         and         Protein Docking Algorithms

29/73

Non-interval Graph

Collapsing Repeats:

Chimeric BAC

Page 30: Sequence Assembly                         and         Protein Docking Algorithms

30/73

Forbidden SubgraphsTheorem (Lekkerkerker & Boland 1962) A graph is interval iff it does not contain one of the following (induced) subgraph:

Page 31: Sequence Assembly                         and         Protein Docking Algorithms

31/73

Resolving Non-interval Graphs

Definition: A vertex u 2 V is I-critical if G|V\{u} is interval.

Given a non-interval graph G, identify a forbidden subgraph.

If at least one of the vertices of the forbidden subgraph is I-critical, we say G is fixable.

Based on the structure of the forbidden subgraph,a fixable graph G is resolved by

1. adding an FN edge; or2. removing FP edges; or3. removing a vertex.

Page 32: Sequence Assembly                         and         Protein Docking Algorithms

32/73

Divide and Conquer Method

For the non-fixable graphs, we employ a divide-and-conquermethod by dividing the graph according to some articulation points such that each subcomponent is fixable.

Page 33: Sequence Assembly                         and         Protein Docking Algorithms

33/73

BAC Graph

2. Resolve Non-interval Graph and

Find an Interval Realization of the BAC Graph

3. Orient and order subcontigs

The Basic Idea

1. “Conservatively” assemble fragments into subcontigs

Page 34: Sequence Assembly                         and         Protein Docking Algorithms

34/73

2. Resolve Non-interval Graph and

Find an Interval Realization of the BAC Graph

3. Orient and order subcontigs

Error Detection

1. “Conservatively” assemble fragments into subcontigs

• wrong NT-pairs (annotation from genome centers)• chromosome misassignments

• chimerics

• fragment misassemblies

This is the only algorithm available that does Error Detection.

Page 35: Sequence Assembly                         and         Protein Docking Algorithms

35/73

Output: Contigs

Page 36: Sequence Assembly                         and         Protein Docking Algorithms

36/73

Other Two Assemblers

• GigAssembler by Jim Kent and David Haussler (stop after April 2001 freeze)

• NCBI’s assembler – top-down approach:• build a physical map using sequence overlaps as fingerprint overlaps; • using some scoring functions to resolve conflicts.

Page 37: Sequence Assembly                         and         Protein Docking Algorithms

37/73

BARNACLE’s assembly

NCBI’s assembly

Page 38: Sequence Assembly                         and         Protein Docking Algorithms

38/73

Comparison with NCBI’s Assembly(Dec 2001)

Assembled BAC Length

Barnacle NCBI

· 250K (good BACs) 33921 29952

250K-300K 434 461

300K-500K 549 1328

500K-800K 33 798

800K-1M 0 248

1M-2M 0 496

2M-3M 0 129

3M-10M 0 259

10M-20M 0 67

Total (>250K) 1016 3786

Page 39: Sequence Assembly                         and         Protein Docking Algorithms

39/73

How was the human genome “finished”?

• Hand-curate tiling path of BACs (by Genome Centers)

• Finish sequencing the tiling path of BACs only

• Assemble by NCBI’s assembler based on the hand-curated tiling paths

Page 40: Sequence Assembly                         and         Protein Docking Algorithms

40/73

Incorporating segmental duplication database

BARNACLE’s assembly suggested that at least 89 repeat-contained BACs were dropped from the tiling path.– 69 were added to HGP’s final tiling path– 20 were declared unnecessary

• Due to disagreement about repeat structure of genome

Collaboration with Evan E. Eichler(Department of Genetics, Case Western Reserve University)

The Sequence and Assembly of Highly Duplicated Regions in the Human Genome.

V. Choi, J. Bailey, G. Schuler, Z. Gu, P. Li, M. Farach-Colton and E. Eichler.

Genome Sequencing & Biology meeting at Cold Spring Harbor Laboratory 2002.

Page 41: Sequence Assembly                         and         Protein Docking Algorithms

41/73

Conclusions

• Better assembly– Error detection– Measured by the assembled BAC length

• Efficient (3 minutes on a Pentium III)

• To do large scale sequencing:

–Handle repeats

–Design in data acquisition that will permit error detection & correction

Reference : V. Choi, M. Farach-Colton. BARNACLE: An Assemble Algorithm for Clone-based Sequences of Whole Genomes. Gene, 320, 165-176, 2003.

Page 42: Sequence Assembly                         and         Protein Docking Algorithms

42/73

Acknowledgement

Wojciech Makalowski (NCBI/Penn State University)David Lipman (NCBI)Greg Schuler (NCBI)Evan E. Eichler (Case Western Reserve University)Granger Sutton (Celera)JinSheng Lai (Waksman Institute, Rutgers University)

• NCBI/NIH pre-doctoral visiting fellowship

• Program in Mathematics and Molecular Biology (PMMB),Burroughs Wellcome Fund Interface Program fellowship

Page 43: Sequence Assembly                         and         Protein Docking Algorithms

43/73

Outline : Protein Docking

1. Protein-Protein Docking 2. Local Search Algorithm3. Test Results

Page 44: Sequence Assembly                         and         Protein Docking Algorithms

44/73

Protein-Protein Docking

Barnase

Barstar

1BRS : Barnase + Barstar

Page 45: Sequence Assembly                         and         Protein Docking Algorithms

45/73

Protein Re-Docking Problem(Bound Protein Docking)

Given a known protein-protein complex A-B (native configuration), randomly separate two proteins.

Fix A, find a rigid motion such that (B) is near-native.

Rigid Body Assumption

Page 46: Sequence Assembly                         and         Protein Docking Algorithms

46/73

Formulation of Rigid Protein Docking

• A scoring function that can discriminate correct docking configuration from incorrect ones;

• A search algorithm that finds the docking configuration measured by the scoring function.

Page 47: Sequence Assembly                         and         Protein Docking Algorithms

47/73

Protein

Each atom is represented by a ball in R3.

Notation: A = { (a1, r1), …, (an,rn) }where ai 2 R3 is the iith atom center with(van der Waals) radius ri

Atom Type C N O P SRadius in Angstrom 1.548 1.4 1.348 1.88 1.808

A protein molecule consists of a set of atoms.

Page 48: Sequence Assembly                         and         Protein Docking Algorithms

48/73

Our Scoring Function

Page 49: Sequence Assembly                         and         Protein Docking Algorithms

49/73

Exhaustive Search

• Sampling the rigid motion space (6-dimension)• Evaluate each motion using the scoring function

Rigid Motion = Rotation + Translation

A rotation in R3 can be specified by a rotation angle about a rotation axis u – represented by unit quaternion.

Sampling Rotation Space ) S3 (unit sphere in R4)

A translation in R3 is a 3-dimensional vector (x,y,z) 2 R3.

Sampling Translation Space : a 3-dimensional grid

Page 50: Sequence Assembly                         and         Protein Docking Algorithms

50/73

Protein Re-Docking Without False Positives

The configuration (B) for which • Score(A,(B)) is maximized;• Bump(A,(B)) · 7;

is near-native : RMSD(Bnative, (B)) · 3.

Other prior works (e.g. FFT-based, geometric hashing)generate multiple possible docking configurations (i.e. near-native + false positives).

Empirical Results:

Page 51: Sequence Assembly                         and         Protein Docking Algorithms

51/73

Sampling Rigid Motion Space : High Resolution

Rotations : 12,036 (~5 degree)Translation: 0.4 grid step size (106 ~ 107)

Running time: 13 hours ~ two days on 50 machines

Duke Internet Systems and Storage Group Cluster (~200 machines)

Diverse test set (25 protein-protein complexes):1A22, 1A4Y, 1BI8, 1BUH, 1BXI, 1CHO, 1CSE, 1DFJ, 1F47, 1FC21FIN, 1FS1, 3HLA, 1JAT, 1JLT, 1MCT, 1MEE, 2PTC, 3SGB, 4SGB1STF, 1TEC, 1TGS, 1TX4, 3YGS

Page 52: Sequence Assembly                         and         Protein Docking Algorithms

52/73

Why high resolution?

Two close configurations (i.e. small RMSD),Score and Bump fluctuates greatly.Example:

[Score, Bump] = [309, 2], [467, 39], [158, 2]

Page 53: Sequence Assembly                         and         Protein Docking Algorithms

53/73

Protein Re-Docking: Local Search Approach

Given A = {(aj,rj) : 1 · j · n}, B={(bi,si): 1 · i · m},find a rigid motion such that

• Score(A,(B)) is maximized; and• Bump(A,(B)) · 7.

B

A

““local”local”

Page 54: Sequence Assembly                         and         Protein Docking Algorithms

54/73

Outline : Protein Docking

1. Protein-Protein Docking 2. Local Search Algorithm3. Test Results

Page 55: Sequence Assembly                         and         Protein Docking Algorithms

55/73

Weighted Least Squares Rigid Motion

tentative goal

= WLSM(w,B,C): i wi||(bi) – ci||2 is minimized

absolute orientation problem in computer vision

Page 56: Sequence Assembly                         and         Protein Docking Algorithms

56/73

Local Search Algorithm

Page 57: Sequence Assembly                         and         Protein Docking Algorithms

57/73

Preprocessing: Candidate Positions

mid-spheres = {(a,r+s+0.75): (a,r) 2 A}

Vertex set = {v: v is a vertex of arrangement of mid-spheres, Bump((v,s),A)=0}Sc(v)=Score((v,s),A)

Page 58: Sequence Assembly                         and         Protein Docking Algorithms

58/73

Example: Candidate Positions

Page 59: Sequence Assembly                         and         Protein Docking Algorithms

59/73

Outer Loop : Increasing Score

Local search neighborhood distance D (· 4.5),Tentative goal ci is the largest score vertex within the local neighborhood of bi

Page 60: Sequence Assembly                         and         Protein Docking Algorithms

60/73

Apply Least Squares Rigid Motion

Page 61: Sequence Assembly                         and         Protein Docking Algorithms

61/73

Inner Loop: Collision Resolution

F = {(b,s) 2 B : Bump((b,s),A)=0, Score((b,s),A)>0}

Y = {(b,s) 2 B : Bump((b,s),A) 0}

Page 62: Sequence Assembly                         and         Protein Docking Algorithms

62/73

For (bi, s) 2 F, ci = (bi), wi = 1For (bi, s) 2 Y, ci = the nearest vertex within distance 2 wi = W() / ||(bi) – ci||2

Collision Resolution

Page 63: Sequence Assembly                         and         Protein Docking Algorithms

63/73

Example: 1BRS

Native : [309, 2] (0)Input : [91, 5] (3.78)Increasing Score: [297, 59] Resolving Collisions: [236, 34], [215,28], …, [132,2] (5.45) [326, 59] [298, 43],[282, 30], …, [119,4] (4.67) [246, 13] [247, 10],[200, 9],[174,4](2.67) [351, 30] [332, 16], [323,7] (1.98) [386, 18] [377, 7] (0.53)

Running Time: 30 seconds ~ 2 minutes for preprocessing1~3 seconds per local search

Notation: [Score, Bump] (RMSD)

Page 64: Sequence Assembly                         and         Protein Docking Algorithms

64/73

Outline : Protein Docking

1. Protein-Protein Docking 2. Local Search Algorithm3. Test Results

Page 65: Sequence Assembly                         and         Protein Docking Algorithms

65/73

PerturbationsPerturb protein B locally from its native position:

rotation=(u,) followed by translation=(v,t)

Sampling:

u,v 2 {32 uniformly distributed unit vectors in R3}

2 {0,3,6, …, 27} (degree)

t 2 {0,0.5,1.0, …, 4.5} (Angstrom)

Total:

(32x9+1) (rotations) x (32x9+1)(translations) = 83,521

Page 66: Sequence Assembly                         and         Protein Docking Algorithms

66/73

Test Results

Example:= 18, t=2.5829/1024 = 81%

Success : Score > 90% Native_Score, Bump · 7, RMSD· 2

Page 67: Sequence Assembly                         and         Protein Docking Algorithms

67/73

Test Results

40,903/44,481=92%

{( · 12, t · 3.5), ( · 15, t · 3.0), ( · 18, t · 2.5), ( · 21, t · 2.0)}

Page 68: Sequence Assembly                         and         Protein Docking Algorithms

68/73

10 Different Protein-Protein Complexes · 18, t · 3.5 Angstrom(43,425 perturbations)

Page 69: Sequence Assembly                         and         Protein Docking Algorithms

69/73

Conclusions

Works well in neighborhood :- Rotation angle · 18 degrees- Translation distance · 3.5 Angstrom

Global Search

Incorporate conformation flexibility

Reference: V. Choi, H. Edelsbrunner, P.K. Agarwal and J. Rudolph.Local Search Heuristic for Rigid Protein Docking. To be submitted.

Page 70: Sequence Assembly                         and         Protein Docking Algorithms

70/73

VMD – Visual Molecular Dynamics: http://www.ks.uiuc.edu/Research/vmd

Acknowledgement

Biogeometry Group @ Duke:Tammy BaileyAndrew BanSergei BespamiatnykhAbhijit GuriaVijay NatarajanAlper UngorYusu Wang

Navin Goyal (Rutgers University)

Raimund Seidel (Univ. des Saarlandes)

Stefan Leopoldseder(Vienna Univ. of Technology)

Page 71: Sequence Assembly                         and         Protein Docking Algorithms

71/73

Future Work

Protein Docking Problem : Unbound case

Repeats in the Human Genome:

Page 72: Sequence Assembly                         and         Protein Docking Algorithms

72/73

G

G

GG

Aberrant recombination

Human disease or structural polymorphism

Repeats: Junk DNA?

Not Junk at All!

Page 73: Sequence Assembly                         and         Protein Docking Algorithms

73/73

Future Work

Protein Docking Problem : Unbound case

Repeats in the Human Genome:Characterization and distribution of repeats (both high copy and low copy) in the human genome

Thank You!