multiple global alignment and phylogenetic tree

86
Michael Schroeder BioTechnological Center TU Dresden [email protected] http://biotec.tu-dresden.de Biotec Multiple Global Alignment and Phylogenetic tree

Upload: garvey

Post on 30-Jan-2016

65 views

Category:

Documents


0 download

DESCRIPTION

Multiple Global Alignment and Phylogenetic tree. Outline. Multiple sequence alignment—MSA Motivation The sum of pairs method (SP) Phylogenetic tree Clustering Neighbour joining Clustalw. What is a Multiple Sequence Alignment. MSA is the alignment of more than two sequences. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Multiple Global Alignment  and Phylogenetic tree

Michael Schroeder BioTechnological CenterTU [email protected]://biotec.tu-dresden.de Biotec

Multiple Global Alignment and Phylogenetic tree

Page 2: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 2

Outline

Multiple sequence alignment—MSA Motivation The sum of pairs method (SP)

Phylogenetic tree Clustering Neighbour joining

Clustalw

Page 3: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 3

What is a Multiple Sequence Alignment

MSA is the alignment of more than two sequences

VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGTSSNIGS--ITVNWYQQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG— * *

An example of MSA alignment

Page 4: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 4

Dynamic Programming in 3D

QUESTION:Which alignmentwould be generatedFor DQLF, DNVQ, QGL?

Page 5: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 5

Dynamic Programming in 3D

D--Q-LF

DNVQ---

---QGL-

Page 6: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 6

How many cases do we need to consider?

In standard dynamic programming we considered 3 cases, namely match/mismatch, insert, and delete

For three sequences s1, s2, s3 there are 7 possibilities:

For m sequences there are 2m -1 possibilities

si1 - si

1 si1 - - si

1

sj2 sj

2 - sj2 - sj

2 -

sk3 sk

3 sk3 - sk

3 - -

QUESTION:Why is it “2”?

Page 7: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 7

Complexity

For m sequences each of length n the matrix has nm cells and for each we must check 2m -1 possibilities: That’s prohibitive!

Solution: Use pruning techniques (cut-offs) and heuristics to guide the search for the best solution

Page 8: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 8

A little excursion to Romania:

A* Search

Further reading Russel/Norvig, Artificial Intelligence, Chapter 4. Prentice-Hall

Page 9: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 9

Problem: Find the shortest path from Arad to Bucharest

Arad

Bucharest

OradeaZerind

Faragas

Neamt

Iasi

Vaslui

Hirsova

Eforie

Urziceni

Giurgui

Pitesti

Sibiu

Dobreta

Craiova

Rimnicu

Mehadia

Timisoara

Lugoj

87

92

142

86

98

86

211

101

90

99

151

71

75

140118

111

70

75

120

138

146

97

80

140

80

97

101

Sibiu

Rimnicu

Pitesti

Optimal route is (140+80+97+101) = 418 miles

Page 10: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 10

Straight Line Distances to Bucharest

Town SLD

Arad 366

Bucharest 0

Craiova 160

Dobreta 242

Eforie 161

Fagaras 178

Giurgiu 77

Hirsova 151

Iasi 226

Lugoj 244

Town SLD

Mehadai 241

Neamt 234

Oradea 380

Pitesti 98

Rimnicu 193

Sibiu 253

Timisoara 329

Urziceni 80

Vaslui 199

Zerind 374

Page 11: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 11

Greedy search

Arad

Bucharest

OradeaZerind

Faragas

Hirsova

Eforie

Urziceni

Giurgui

Pitesti

Sibiu

Dobreta

Craiova

Rimnicu

Mehadia

Timisoara

Lugoj

Town SLD

Arad 366

Bucharest 0

Craiova 160

Dobreta 242

Eforie 161

Fagaras 178

Giurgiu 77

Hirsova 151

Iasi 226

Lugoj 244

Town SLD

Mehadai 241

Neamt 234

Oradea 380

Pitesti 98

Rimnicu 193

Sibiu 253

Timisoara 329

Urziceni 80

Vaslui 199

Zerind 374

Go to neighboring city v, which minimizesdistance Fv to goal

Page 12: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 12

Greedy search

Arad

Bucharest

OradeaZerind

Faragas

Hirsova

Eforie

Urziceni

Giurgui

Pitesti

Sibiu

Dobreta

Craiova

Rimnicu

Mehadia

Timisoara

Lugoj

Town SLD

Arad 366

Bucharest 0

Craiova 160

Dobreta 242

Eforie 161

Fagaras 178

Giurgiu 77

Hirsova 151

Iasi 226

Lugoj 244

Town SLD

Mehadai 241

Neamt 234

Oradea 380

Pitesti 98

Rimnicu 193

Sibiu 253

Timisoara 329

Urziceni 80

Vaslui 199

Zerind 374

Go to neighboring city v, which minimizesdistance Fv to goal

QUESTION:Any problems?Why is it called“greedy” search?

Page 13: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 13

Problems of greedy search Not optimal

Greedy search from Arad to Bucharestvia Fagaras, optimum via Rimnicu

Problem: Greedy algorithm does not include distance already covered

A*: Pursue best node first with scoring function of distance so far plus under estimate to goal (e.g.

shortest line distance) v is a node Sv Best score to go from start to node v Fv Estimate for going from v to goal Tv = Sv + Fv Total score

Organize nodes to be visited sorted by total score(TODO list in next slides)

Page 14: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 14

A* search of the Romanian map featured in the previous slide. Note: Nodes are labelled with Tv = Sv + Fv. However,we will be using the abbreviations T, S and F to make the notation simpler

OradeaZerind

Fagaras

Pitesti

Sibiu

Craiova

RimnicuTimisoara

Bucharest

AradT= 0 + 366

T= 366

Bucharest(2)

BucharestBucharest

Page 15: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 15

OradeaZerind

Fagaras

Pitesti

Sibiu

Craiova

RimnicuTimisoara

Bucharest

Arad

We begin with the initial state of Arad. The cost of reaching Arad from Arad (or S value) is 0 miles. The straight line distance from Arad to Bucharest (or F value) is 366 miles. This gives us a total value of ( T = S + F ) 366 miles. Expand the initial state of Arad.

DONE = []

TODO = [Arad/366]

T= 0 + 366

T= 366

Bucharest(2)

BucharestBucharest

Page 16: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 16

OradeaZerind

Fagaras

Pitesti

Sibiu

Craiova

RimnicuTimisoara

Bucharest

AradT= 0 + 366

T= 366

T= 75 + 374

T= 449

T= 140 + 253

T= 393T= 118 + 329

T= 447

Once Arad is expanded we look for the node with the lowest cost. Sibiu has the lowest value for T. (The cost to reach Sibiu from Arad is 140 miles, and the straight line distance from Sibiu to the goal state is 253 miles. This gives a total of 393 miles).

DONE = [Arad]

TODO = [Sibiu/393, Timisoara/447, Zerind/449]

Bucharest(2)

Page 17: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 17

OradeaZerind

Fagaras

Pitesti

Sibiu

Craiova

RimnicuTimisoara

Bucharest

AradT= 0 + 366

T= 366

T= 75 + 374

T= 449

T= 140 + 253

T= 393T= 118 + 329

T= 447

We now expand Sibiu (that is, we expand the node with the lowest value of T).

DONE = [Arad, Sibiu]

TODO = [Rimnicu/413, Fagaras/417, Timisoara/447, Zerind/449, Oradea/671]

T= 239 + 178

T= 417

T= 291 + 380

T= 671

T= 220 + 193

T= 413

Bucharest(2)

Page 18: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 18

OradeaZerind

Fagaras

Pitesti

Sibiu

Craiova

RimnicuTimisoara

Bucharest

AradT= 0 + 366

T= 366

T= 75 + 374

T= 449

T= 140 + 253

T= 393T= 118 + 329

T= 447

T= 239 + 178

T= 417

T= 291 + 380

T= 671

T= 220 + 193

T= 413

We now expand Rimnicu (that is, we expand the node with the lowest value of T ).

DONE = [Arad, Sibiu]

TODO = [Rimnicu/413, Fagaras/417, Timisoara/447, Zerind/449, Oradea/671]

Bucharest(2)

Page 19: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 19

OradeaZerind

Fagaras

Pitesti

Sibiu

Craiova

RimnicuTimisoara

Bucharest

AradT= 0 + 366

T= 366

T= 75 + 374

T= 449

T= 140 + 253

T= 393T= 118 + 329

T= 447

T= 239 + 178

T= 417

T= 291 + 380

T= 671

T= 220 + 193

T= 413

T= 317 + 98

T= 415T= 366 + 160

T= 526

Once Rimnicu is expanded we look for the node with the lowest cost. As you can see, Pitesti has the lowest value for T. (The cost to reach Pitesti from Arad is 317 miles, and the straight line distance from Pitesti to the goal state is 98 miles. This gives a total of 415 miles

DONE = [Arad, Sibiu, Rimnicu]

TODO = [Pitesti/415, Fagaras/417, Timisoara/447, Zerind/449, Craiova/526, Oradea/671]

Bucharest(2)

Page 20: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 20

OradeaZerind

Fagaras

Pitesti

Sibiu

Craiova

RimnicuTimisoara

Bucharest

AradT= 0 + 366

T= 366

T= 75 + 374

T= 449

T= 140 + 253

T= 393T= 118 + 329

T= 447

T= 239 + 178

T= 417

T= 291 + 380

T= 671

T= 220 + 193

T= 413

T= 317 + 98

T= 415T= 366 + 160

T= 526

We now expand Pitesti (that is, we expand the node with the lowest value of T).

DONE = [Arad, Sibiu, Rimnicu, Pitesti]

TODO = [Fagaras/417, Bucharest/418, Timisoara/447, Zerind/449, Craiova/526, Oradea/671]

T= 418 + 0

T= 418

Bucharest(2)

Page 21: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 21

OradeaZerind

Fagaras

Pitesti

Sibiu

Craiova

RimnicuTimisoara

Bucharest

AradT= 0 + 366

T= 366

T= 75 + 374

T= 449

T= 140 + 253

T= 393T= 118 + 329

T= 447

T= 239 + 178

T= 417

T= 291 + 380

T= 671

T= 220 + 193

T= 413

T= 317 + 98

T= 415T= 366 + 160

T= 526

T= 418 + 0

T= 418

In actual fact, the algorithm will not really recognise that we have found Bucharest. It just keeps expanding the lowest cost nodes (based on T ) until it finds a goal state AND it has the lowest value of T. So, we must now move to Fagaras and expand it.

DONE = [Arad, Sibiu, Rimnicu, Pitesti]

TODO = [Fagaras/417, Bucharest/418, Timisoara/447, Zerind/449, Craiova/526, Oradea/671]

Bucharest(2)

Page 22: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 22

OradeaZerind

Fagaras

Pitesti

Sibiu

Craiova

RimnicuTimisoara

Bucharest

AradT= 0 + 366

T= 366

T= 75 + 374

T= 449

T= 140 + 253

T= 393T= 118 + 329

T= 447

T= 239 + 178

T= 417

T= 291 + 380

T= 671

T= 220 + 193

T= 413

T= 317 + 98

T= 415T= 366 + 160

T= 526

We have just expanded a node (Pitesti) that revealed Bucharest, but it has a cost of 418. If there is any other lower cost node (and in this case there is one cheaper node, Fagaras, with a cost of 417) then we need to expand it in case it leads to a better solution to Bucharest than the 418 solution we have already found.

DONE = [Arad, Sibiu, Rimnicu, Pitesti]

TODO = [Fagaras/417, Bucharest/418, Timisoara/447, Zerind/449, Craiova/526, Oradea/671]

T= 418 + 0

T= 418

Bucharest(2)

Page 23: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 23

OradeaZerind

Fagaras

Pitesti

Sibiu

Craiova

RimnicuTimisoara

Bucharest

AradT= 0 + 366

T= 366

T= 75 + 374

T= 449

T= 140 + 253

T= 393T= 118 + 329

T= 447

T= 239 + 178

T= 417

T= 291 + 380

T= 671

T= 220 + 193

T= 413

T= 317 + 98

T= 415T= 366 + 160

T= 526

T= 418 + 0

T= 418

We now expand Fagaras (that is, we expand the node with the lowest value of T ).

DONE = [Arad, Sibiu, Rimnicu, Pitesti]

TODO = [Fagaras/417, Bucharest/418, Timisoara/447, Zerind/449, Craiova/526, Oradea/671]

Bucharest(2)T= 450 + 0

T= 450

Page 24: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 24

OradeaZerind

Fagaras

Pitesti

Sibiu

Craiova

RimnicuTimisoara

Bucharest

AradT= 0 + 366

T= 366

T= 75 + 374

T= 449

T= 140 + 253

T= 393T= 118 + 329

T= 447

T= 239 + 178

T= 417

T= 291 + 380

T= 671

T= 220 + 193

T= 413

T= 317 + 98

T= 415T= 366 + 160

T= 526

T= 418 + 0

T= 418

Bucharest(2)T= 450 + 0

T= 450

Once Fagaras is expanded we look for the lowest cost node. As you can see, we now have two Bucharest nodes. One of these nodes ( Arad – Sibiu – Rimnicu – Pitesti – Bucharest ) has an T value of 418. The other node (Arad – Sibiu – Fagaras – Bucharest(2) ) has an T value of 450. We therefore move to the first Bucharest node and expand it.

DONE = [Arad, Sibiu, Rimnicu, Pitesti, Fagaras]

TODO = [Bucharest/418, Timisoara/447, Zerind/449, Bucharest/450, Craiova/526, Oradea/671]

Page 25: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 25

OradeaZerind

Fagaras

Pitesti

Sibiu

Craiova

RimnicuTimisoara

Bucharest

AradT= 0 + 366

T= 366

T= 75 + 374

T= 449

T= 140 + 253

T= 393T= 118 + 329

T= 447

T= 239 + 178

T= 417

T= 291 + 380

T= 671

T= 220 + 193

T= 413

T= 317 + 98

T= 415T= 366 + 160

T= 526

T= 418 + 0

T= 418

Bucharest(2)T= 450 + 0

T= 450

BucharestBucharestBucharest

We have now arrived at Bucharest. As this is the lowest cost node AND the goal state we can terminate the search. If you look back over the slides you will see that the solution returned by the A* search pattern ( Arad – Sibiu – Rimnicu – Pitesti – Bucharest ), is in fact the optimal solution.

DONE = [Arad, Sibiu, Rimnicu, Pitesti, Fagaras]

TODO = [Bucharest/418, Timisoara/447, Zerind/449, Bucharest/450, Craiova/526, Oradea/671]

Page 26: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 26

Additional optimization

Let‘s assume we have an (over)-estimate K for the best solution, i.e. the optimal solution will be better than K

Do not consider any node with total score Tv worse than K

If Tv > K then remove v from TODO list

Page 27: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 27

OradeaZerind

Fagaras

Pitesti

Sibiu

Craiova

RimnicuTimisoara

Bucharest

AradT= 0 + 366

T= 366

T= 75 + 374

T= 449

T= 140 + 253

T= 393T= 118 + 329

T= 447

T= 239 + 178

T= 417

T= 291 + 380

T= 671

T= 220 + 193

T= 413

T= 317 + 98

T= 415T= 366 + 160

T= 526

T= 418 + 0

T= 418

Bucharest(2)T= 450 + 0

T= 450

BucharestBucharestBucharest

Additional optimization Assume K = 430, then we can

remove nodes Zerind, Oradea, Timisoara, Craiova

QUESTION:What if K is equal to optimum?What if K is poorely chosen?What if rule is “If Tv >= K then remove v“? Problem?

Page 28: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 28

F must be under-estimate

For algorithm to work F must be an under-estimate

Example: Direct distance is always shorter than road

QUESTION:What happens if F is not under-estimate?

Page 29: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 29

F must be under-estimate

For algorithm to work F must be an under-estimate

Example: Direct distance is always shorter than road

Then it cannot be guaranteed that optimal solution is found E.g. FRiminicu = 10.000 in example for Riminicu?

Then TRiminicu = 10.220 > K = 450, so Riminicu would be removed, and optimal solution would not be found

QUESTION:What happens if F is not under-estimate?

Page 30: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 30

From Romania to Dresden

So, what does that mean for multiple sequence alignment?

QUESTIONS:What does a node (city) correspond to?What does an edge between nodes correspond to?What does the cost between two nodes correspond to?How could we define S?How could we define F?How could we define K?

Page 31: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 31

The Sum of Pairs Method

As in the pairwise case, not all MSA’s are equally good. We need a scoring method to determine when one MSA is better than another one

The Sum of Pairs (SP) method: For each column in the alignment, sum up the

score of each pair of residues. M: a MSA of the sequences of (s1, s2, ...sm) s’i is the projection of si , i.e. the sequence si with gaps S(s’i,s’j): the score of the projections The final score is

∑∑+=

=

=m

ij

jim

i

ssSMSP1

1

1

)','()(

Page 32: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 32

QUESTION:What is the score of the alignment?

An Example of Using the SP Method

Example

s1 = AVP s’1: A-VP-

s2 = AVT s’2: A-V-T

s3 = PSVPT s’3: PSVPT Scores:

Match = 1 Mismatch, insertion, deletion = -1 S(-, -) = 0 to prevent the double counting of gaps.

Page 33: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 33

An Example of Using the SP Method

Example

s1 = AVP s’1: A-VP-

s2 = AVT s’2: A-V-T

s3 = PSVPT s’3: PSVPT Scores:

Match = 1 Mismatch, insertion, deletion = -1 S(-, -) = 0 to prevent the double counting of gaps.

Then the SP score is

S(s’1,s’2) + S(s’1,s’3) + S( s’2, s’3)

= 0 + (-1) + (-1)

= -2

Page 34: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 34

1 MSA vs. n SA

What is the difference between making one multiple sequence alignment to making many pairwise sequence comparisons?

The score S(s’i,s’j) for the alignment s’i,s’j in a multiple sequence alignment is less than score S(si,sj) for aligning si,sj directly

S(s’i,s’j) <= S(si,sj)

Page 35: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 35

Pruning the search space Computing all cells in the dynamic

programming solution is expensive, therefore we want to avoid computing as many cells as possible

Can we rule out any cells? Let us assume that we know already that

there is a known alignment of score K Let v = (i1,i2,….im) be a cell of the DP

matrix for which want to determine whether we need to consider it (and its neighbours) or not

Let Sv be the score of the best path from the start cell to cell v

Let FV be an upper bound for the highest-scoring alignment from v to the end of DP matrix, i.e. we can only find a path from v to the end which is less than FV

Then we know the following: If Sv+ Fv < K, then v cannot lie on

the path of the best alignment

SV

v<=Fv

Page 36: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 36

Dynamic Pruning with Forward Recursion

D(v,w) is the score to be added when moving from v to its forward (east, southeast, south) neighbor w.

I.e. the overall score Sv+D(v,w) is sent to w.

The value of Sw is the maximum of all values sent to w from its backward (west, north, northwest) neighbor cells.

SV - gv

si1

From cell v values are sent to all its neighbor cells

SV

- g

SV + R(s

i 1,sj 2)

sj2

Page 37: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 37

One more thing: A queue

We need a data structure before we list the algorithm

A queue is a list of elements with two special operators Push: to add an element at the end of the queue Pop: to remove an element from the top of a queue

Page 38: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 38

Algorithm: Forward-recursion with pruningproc F(v, hN) a procedure which finds an upper bound

of the score of the alignment from a cell v to the end-cell hN

begin v = h0; P(v) = 0; push(v,Q) push start cell on queue while Q is not empty do pop(v,Q); S(v) = P(v) v has got all values from

its neighbours If S(v) + F(v, hN) >= K then for all forward neighbours w of v do if w doesn’t belong to Q then push (w,Q);

P(w) = S(v) + D(v,w) else P(w) = max( P(w), S(v)+D(v,w) ) end for end whileend

consth0 the start cell of the DP matrix (H0,0…0)

hN the end cell of the DP matrix (Hn1,n2…

nm)

K a lower bound for the score of the whole alignment

var u, v, w denote cells

S(u) the best score of an alignment from h0 to u

P(u) the score of the best alignment from h0 to u found so far

D(u, v) the score for extending the alignment from cell u to cell v

Q a queue of the cells u for which a value for P(u) is found but u is not visited yet

Page 39: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 39

Finding upper limits for scores

For any multiple sequence alignment M of sequences {s1,s2,….sm} we know that the score for the multiple sequence alignment S(M) is less then the

sum of pairwise comparisons of the sequences {s1,s2,….sm}

∑∑+=

=

≤m

kl

lkm

k

ssSMS1

1

1

),()(

∑∑+=

++

=

=m

kl

lni

kni

m

kllkk

ssSF1

....1...1

1

1

),( (4.6)

The procedure F should find an upper bound for the alignment of the subsequences s1

i1+1…n1 , s2

i2+1…n2 , ….. sm

im+1…nm This can be done as follows:

Page 40: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 40

Questions

QUESTION:What is the score of the multiple sequence alignmentwhen the algorithm is done?

QUESTION:How can we get alignment from algorithm?

Page 41: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 41

Answers The score for the multiple sequence alignment is S(hN)

How can we get an alignment from the algorithm? We need another variable Dir to store the direction from which we

were coming

Let‘s assume we are at node v and its neighbour w is not pruned If w is new in queue then Dir(w)={v} If w is already in queue and S(v)+D(v,w)>P(w) then

P(w) = S(v)+D(v,w) and Dir(w) = {v} If w is already in queue and S(v)+D(v,w)=P(w) then

Add v to Dir(w)

Page 42: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 42

Algorithm: Forward-recursion with pruningproc F(v, hN) a procedure which finds an upper bound of

the score of the alignment from a cell v to the end-cell hN

begin v = h0; P(v) = 0; push(v,Q) push start cell on queue while Q is not empty do pop(v,Q); S(v) = P(v) v has got all values from

its neighbours If S(v) + F(v, hN) >= K then for all forward neighbours w of v do if w doesn’t belong to Q then push (w,Q);

P(w) = S(v) + D(v,w) Dir(w) = {v}

else if S(v)+D(v,w) > P(w) then P(w) = S(v)+D(v,w) Dir(w) = {v} else if S(v)+D(v,w) = P(w) then Add v to Dir(w)

end for end whileend

consth0 the start cell of the DP matrix (H0,0…0)

hN the end cell of the DP matrix (Hn1,n2…

nm)

K a lower bound for the score of the whole alignment

var u, v, w denote cells

S(u) the best score of an alignment from h0 to u

P(u) the score of the best alignment from h0 to u found so far

D(u, v) the score for extending the alignment from cell u to cell v

Q a queue of the cells u for which a value for P(u) is found but u is not visited yet

Dir(w) stores nodes v from which best scores were obtained

Page 43: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 43

Printing the alignment: printMSA(hN,0)

printMSA is recursive function, which takes a node v and a position k in the alignment to be generated as input

B is a matrix, which contains the aligment

printMSA(v,k): If v = h0 then print B Else

Let i1,…,im be the indices of v For all u in Dir(v) do

Let i‘1,…,i‘m be the indices of w For j from 0 to m-1 do

If ij = i‘j then Bk,j = „-“ Else Bk,j = sequence j at position ij

printMSA(u,k+1)

Page 44: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 44

Questions

QUESTION:Why is Dir a set and not a single node?

QUESTION:Does printMSA print one multiple sequence alignmentor all possible ones?

Page 45: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 45

ExampleLet’s align DQLF, DNVQ, QGL

with match = 3 and insertion, deletion, mismatch = -1

<0,0,0>

<3,3,2>

Page 46: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 46

Example

We need a lower bound for the overall result.Let’s assume we have got already the following alignment

What is K, the sum of pairs for this alignment?

DQ-LF

DNVQ-

-QGL-

Page 47: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 47

Example

We need a lower bound for the overall result.Let’s assume we have got already the following alignment

K = -1 -4 + 3 = -2

DQ-LF

DNVQ-

-QGL-

Page 48: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 48

Example

Upper bound for the score from <0,0,0> to <3,3,2> (match = 3 and insertion, deletion, mismatch = -1)

F( <0,0,0>, <3,3,2> ) = +2 +3 -2 = +3

D--QLF DQ-LF DNVQ--

DNVQ-- -QGL- ---QGL

+2 +3 -2

Page 49: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 49

Examplebegin v = h0; P(v) = 0; push(v,Q) while Q is not empty do pop(v,Q); S(v) = P(v) If S(v) + F(v, hN) >= K then for all forward neighbours w of v do if w doesn’t belong to Q then push (w,Q);

P(w) = S(v) + D(v,w) else P(w) = max( P(w), S(v)+D(v,w) ) end end end endend

Q: <0,0,0>P( <0,0,0> ) = 0S( <0,0,0> ) = 0

Page 50: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 50

Examplebegin v = h0; P(v) = 0; push(v,Q) while Q is not empty do pop(v,Q); S(v) = P(v) If S(v) + F(v, hN) >= K then for all forward neighbours w of v do if w doesn’t belong to Q then push (w,Q);

P(w) = S(v) + D(v,w) else P(w) = max( P(w), S(v)+D(v,w) ) end end end endend

S( <0,0,0> ) + F( <0,0,0>, <3,3,2>) = 0+3 >= -2Q: <0,0,1>, <0,1,0>, <0,1,1>, … , <1,1,1>

P( <0,0,1> ) = 0 + -2 --QP( <0,1,0> ) = 0 + -2 -D-P( <0,1,1> ) = 0 + -3 -DQP( <1,0,0> ) = 0 + -2 D--P( <1,0,1> ) = 0 + -3 D-QP( <1,1,0> ) = 0 + 1 DD-P( <1,1,1> ) = 0 + 1 DDQ

Page 51: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 51

Examplebegin v = h0; P(v) = 0; push(v,Q) while Q is not empty do pop(v,Q); S(v) = P(v) If S(v) + F(v, hN) >= K then for all forward neighbours w of v do if w doesn’t belong to Q then push (w,Q);

P(w) = S(v) + D(v,w) else P(w) = max( P(w), S(v)+D(v,w) ) end end end endend

v = <0,0,1>, Q: <0,1,0>, <0,1,1>, … , <1,1,1>S( <0,0,1> ) = P( <0,0,1> = -2

P( <0,0,1> ) = 0 + -2 --QP( <0,1,0> ) = 0 + -2 -D-P( <0,1,1> ) = 0 + -3 -DQP( <1,0,0> ) = 0 + -2 D--P( <1,0,1> ) = 0 + -3 D-QP( <1,1,0> ) = 0 + 1 DD-P( <1,1,1> ) = 0 + 1 DDQ

Page 52: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 52

Example

Upper bound for the score from <0,0,1> to <3,3,2> (match = 3 and insertion, deletion, mismatch = -1)

F( <0,0,1>, <3,3,2> ) = +2 +0 -4 = -2

D--QLF DQLF DNVQ

DNVQ-- -GL- GL--

+2 +0 -4

Page 53: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 53

Examplebegin v = h0; P(v) = 0; push(v,Q) while Q is not empty do pop(v,Q); S(v) = P(v) If S(v) + F(v, hN) >= K then for all forward neighbours w of v do if w doesn’t belong to Q then push (w,Q);

P(w) = S(v) + D(v,w) else P(w) = max( P(w), S(v)+D(v,w) ) end end end endend

v = <0,0,1>S( <0,0,1> ) = -2S( <0,0,1> ) + F( <0,0,1>, <3,3,2>) = -2-2=-4 >= -2

Q: <0,1,0>, <0,1,1>, … , <1,1,1>

P( <0,0,1> ) = 0 + -2 --QP( <0,1,0> ) = 0 + -2 -D-P( <0,1,1> ) = 0 + -3 -DQP( <1,0,0> ) = 0 + -2 D--P( <1,0,1> ) = 0 + -3 D-QP( <1,1,0> ) = 0 + 1 DD-P( <1,1,1> ) = 0 + 1 DDQ

v = <0,0,1> is not further pursued as the pruning rule determines that it cannot be part of the best alignment

Page 54: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 54

From MSA to phylogenetic treesAR-LARTLARSIARSLAWTLAWT-

AR-LARTLARSIARSL

AWTLAWT-

AWTLAWT-ARSI

ARSLAR-LARTL AWT- AWTL

ARSI ARSLARTLAR-L

1

23

Page 55: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 55

Phylogenetic tree

Introduction Definition Tree construction method

– Clustering (UPGMA)

– Neighbour Joining

Page 56: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 56

Darwin: “Origin of the species”

Find the evolutionary history of species existing today and how they are related.

Page 57: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 57

Unrooted and Rooted Trees

A B C

A C B

B C A

B

C

A

root

Page 58: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 58

Unrooted and Rooted Trees

A

B

C

D

A B

C D

A B

CD

A B C D

A C B D B C A D C A B D D A B c

A D B C A D B C B D A C C B A D D B A C

(a) (b)

All the topologies for four original sequences: (a) unrooted and (b) rooted

A B C D B A C D C D A B D C A B

A C B D

Page 59: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 59

How many different trees are there?

)!2(2

)!32()(

2 −−

= − mm

mT mroot

)!3(2

)!52()(

3 −−

= − mm

mT munroot

The number of unrooted topologies for m≥3 original sequences is

The number of rooted topologies for m≥2 original sequences is

(4.7)

(4.8)

Example: For m=10 there are 2.027.025 unrooted trees and 34.459.425 rooted trees

Page 60: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 60

Distances between Nodes

Degree of sequence similarity should be reflected in the distances between nodes

Additive tree: The distances between any two nodes is the sum of the distances over the edges connecting the nodes

Page 61: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 61

Additive Trees A tree is additive if and only if

the distance between any two nodes is the sum of the distances over the edges connecting the nodes

(a) An additive tree constructed from the sequences with the distances in (b). r shows where a root is placed.

D

A

E

F

BC

8

14

3

2

4

5

34

6

1.5 4.5

r

B C D E F

A 27 24 22 31 30

B 11 21 12 11

C 18 15 14

D 25 24

E 5 (a)

(b)

Page 62: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 62

Additive Trees

If the distances between nodes satisfy the equation below, then an additive tree can be constructed

Di,j + Dk,l = Di,k + Dj,l ≥ Di,l + Dj,k

This means that there are often distance matrices for which we cannot compute an additive tree

i

l

j

k

Page 63: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 63

Distance-based Approach Single Alignment

Score: 46 matches, 3 mismatches, 1 gap, 3 gap extensions, z.B. Score = 46x1 - 3x1 - 1x2 - 3x1 = 38

Approach: Define distance between two sequences, e.g. percentage of

mismatches in their alignment Construct tree, which groups sequences with minimal

distances iteratively together

atgctctggccacggcacttgcggatcccagggtgatctgtgcacctgcgata||||||||||||||| |||| |||||||| |||| |||||||||||||||atgctctggccacggatcttgtggatccca---tgatatgtgcacctgcgata

Page 64: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 64

Distance basedAlignment

4

2

3

5

6

7

1

Tree

Distance Matrix

Page 65: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 65

Hierarchical Clustering (Single linkage)

(1,2) 3 (4,5)

(1,2) 0 5 8

3 0 4

(4,5) 0

1 2 3 4 5

1 0 2 6 10 9

2 0 5 9 8

3 0 4 5

4 0 3

5 0

(1,2) 3 4 5

(1,2) 0 5 9 8

3 0 4 5

4 0 3

5 0

(1,2) (3,(4,5))

(1,2) 0 5

(3,(4,5)) 0

5

4

3

2

1

0

1 2 3 4 5

Page 66: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 66

Hierarchical clusteringconst m number of original sequencesvar U a set of current trees, initially, one tree for each original sequence.D The distance between the trees in Ubegin U = the set of one tree (each of one node) for each original sequence. while |U| >1 do (u,v) = the roots of two trees in U with the least distance in D Make a new tree with root w and with u and v as children Calculate the length of the edges (v, w) and (u, w) for each root x of the trees in U-{u, v} do D(x, w) = calculate the distance between x and the new node (w) end U = (U - {u,v} ) {w} update U endend

Page 67: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 67

Hierarchical Clustering

How to define distance between clusters?Distance to the new cluster w = (u,v)

Single linkage: D(x,w) = min { D(x,u), D(x,v) } Example: Distance (A,B) to C is 1

Complete linkage: D(x,w) = max { D(x,u), D(x,v) } Example: Distance (A,B) is C is 2

Average linkage (also called WPGMA (weighted pair group method with arithmetic mean)):

D(x,w) = ( D(x,u) + D(x,v) ) / 2 Example: Distance (A,B) to C is 1.5

More general (also called UPGMA(unweighted pair group method using arithmetic mean):

D(x,w) = ( mu D(x,u) + mv D(x,v) ) / (mu + mv ) mu is the number of nodes in the subtreee u

Question: Are dendrograms always the same independent

of the method?

Question: What’s the difference between

UPGMA and WPGMA?

Note: “weighted” because u and v may have different number of nodes, hences

they are weighted.

Page 68: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 68

Hierarchical Clustering

0C

10B

210A

CBA A B C A B CQuestion: Are

dendrograms always the same independent

of the method?

Question: What’s the difference between

WPGMA and UPGMA?

Average linkage: D(x,w) = ( D(x,u) + D(x,v) ) / 2 Example: Distance (A,B) to C is 1.5

More general:D(x,w) = ( mu D(x,u) + mv D(x,v) ) / (mu + mv )mu is the number of nodes in the subtreee u

Consider that subtree D contains 100 nodes (mD =100) and E only 1 (mE =1)

Average linkage D( (D,E), F ) = (2+10)/2 = 6Weighted average D (D,E), F ) = (100*2 + 1*10)/(100+1) = 2.08

0F

100E

210D

FED

Single linkage Complete l.

Page 69: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 69

UPGMA-example

B C D E

A 3 7 8 10

B 6 8 7

C 4 5

D 6

C D E

(A,B) 6.5 8 8.5

C 4 5

D 6 ( C,D) E

(A,B) 7.25 8.5

( C,D) 5.5

(( C,D), E)

(A,B) 7.67

(a)

(b)

(c)

(d)

Page 70: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 70

Constructing the Edges of the Tree

Let’s assume we want to join the subtrees u and v under the new root w

Then the edge from v to w has to have the following length

Lv,w = 0.5 Du,v – Lv,yv

Example: Joining C and D:

LC, (C,D) = 0.5x4 – 0=2

Joining (C,D) and E: L(C,D),((C,D),E)= 0.5x5.5-2=0.75

Lv,yv

v

yv

w

u

Lv,w

Page 71: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 71

UPGMA-Tree

A B C DE

(A,B) (C,D)

((C,D),E)

1.5 1.5

2.33

1.08

2 2

2.75

0.75

B C D E

A 3 7.66 7.66 7.66

B 7.66 7.66 7.66

C 4 5.5

D 5.5 Distances in tree

B C D E

A 3 7 8 10

B 6 8 7

C 4 5

D 6

Original distances

Page 72: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 72

Neighbour Joining (NJ)

Does not assume a constant molecular clock Starts with a star tree where all nodes are linked to a central

node:

x

F

A

B

C

D

E

Page 73: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 73

Neighbour Joining (NJ)

Each pair of nodes are evaluated for being clustered together

For each pair the sum of all lengths in the resulting tree is calculated

The pair giving the lowest sum is chosen - in the continuation the pair is considered as one node

This is repeated

Page 74: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 74

x Y

F

A

B

C

D

E

x

F

A

B

C

D

E

Y

x

F

A

B

C

D

E

Yx

F

A

B

C

D

E

(a) (b)

(c) (d)

Neighbour Joining (NJ)

Page 75: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 75

A B

C

FE

D

Rooting an Unrooted Tree

Choose mid-point between all nodes and introduce new root node there

Yx

F

A

B

C

D

E

Mid-point = root

Page 76: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 76

Rooting an Unrooted Tree

Alternative: Use an outgroup, which has large distance to all nodes

Example: Let’s assume D is outgroup, then the root is added to the edge from D

A

B

C

D

D = outgroup, so root goes here

BA

C

D

Page 77: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 77

NJ vs Hierarchical clustering

In Neighbour Joining the pair of nodes is chosen that gives the lowest sum of branch lengths in the resulting tree.

In Hierarchical clustering the pair of closest nodes are chosen not taking into account the rest of the tree.

Hierarchical clustering does not allow for rate variation among branches.

Page 78: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 78

Assessing Quality: Bootstrapping Given a tree obtained from one of the methods above Generate Multiple Alignment For a number of iterations

Generate new sequences by selecting columns (possibly the same column more than once) form the multiple alignment

Generate tree for the new sequences Compare this new tree with the given tree For each cluster in the given tree, which also approach

in the new tree, the bootstrap value is increased Bootstrap-Value = Percentage of trees containing the

same cluster

Page 79: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 79

From Phylogenetic Trees to MSA

Use a phylogenetic tree to guide the construction of the multiple sequence alignment

Page 80: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 80

5

4

3

2

1

0

1 2 3 4 5

From Phylogenetic Trees to MSA

MSA

12

45

3

Page 81: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 81

Progressive AlignmentAlgorithm: Progressive alignment of the sequences {s1, s2, ……sm}var

C current set of alignments.begin C = { };

for i=0 to m do C = C {{ si }} end one alignment of each sequence for i =0 to m-1 do choose two alignments Ap, Aq from C; C = C - { Ap, Aq };

Ar = align ( Ap,Aq ); C = C { Ar } end C now contains the (single) final alignmentend

Page 82: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 82

Aligning two subset alignments

Two subset alignments Ap, Aq with the sequences {sp1 ….spm } and {sq1 ….sqm }

Complete alignment method for aligning pairs of subset alignments

The SP score will be

kj

qqkZss

ppj

wwRnm

trSm

kt

jr

n

∑∑∈∈

=}...{

''}...{ 11

1),(

Page 83: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 83

Clustering The progressive alignment should be guided by a true

phylogenetic tree Methods

Average linkage Maximum (single) linkage Minimum (complete) linkage

Page 84: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 84

Clustering--example Three alignments: A1 ={s1, s2}, A2 ={s3, s4} and A3 ={s5}, with pairwise scores: s2 s3 s4 s5

s1 - 7 5 3 s2 6 4 8 s3 - 7

s4 6

Average linkage S(A1,A2) = (7+5+6+4)/4 = 5.5

S(A1,A3) = 5.5

S(A2,A3) = 6.5 best

Maximum linkage S(A1,A2) = max (7,5,6,4) = 7

S(A1,A3) = 8 best

S(A2,A3) = 7 Minimum linkage S(A1,A2) = min (7,5,6,4) = 4

S(A1,A3) = 3

S(A2,A3) = 6 best

Page 85: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 85

Linear clusteringAlgorithm : Basic linear clustering for aligning the sequences {s1, s2, ……sn}var U the set of sequences not alignedA the current alignmentbegin U = {s1, s2, ……sn }; choose two sequences (the most similar) (s, t) from U; A = Align(s, t); U = U – {s, t}; for i=0 to n-2 do choose a sequence s from U; U = U –{s}; A = Align (A, s) endend

Page 86: Multiple Global Alignment  and Phylogenetic tree

By Michael Schroeder, Biotec 86

The CLUSTALW Algorithm

CLUSTALW: one of the most popular MSA global alignment programs1. Calculate the (static) pairwise similarity scores for the

sequences 2. Construct a guide tree by use of the pairwise scores

(NJ method) 3. Calculate sequence weights, using the guide tree4. Perform a progressive alignment, guided by the tree