cs262 lecture 12, win06, batzoglou rna secondary structure aagacuucggaucuggcgacaccc...

44
262 Lecture 12, Win06, Batzoglou RNA Secondary Structure aagacuucggaucuggcgacaccc uacacuucggaugacaccaaagug aggucuucggcacgggcaccauuc ccaacuucggauuuugcuaccaua aagccuucggagcgggcguaacuc

Post on 19-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

CS262 Lecture 12, Win06, Batzoglou

RNA Secondary Structure

aagacuucggaucuggcgacacccuacacuucggaugacaccaaagugaggucuucggcacgggcaccauucccaacuucggauuuugcuaccauaaagccuucggagcgggcguaacuc

CS262 Lecture 12, Win06, Batzoglou

Decoding: the CYK algorithm

Given x = x1....xN, and a SCFG G,

Find the most likely parse of x

(the most likely alignment of G to x)

Dynamic programming variable:

(i, j, V): likelihood of the most likely parse of xi…xj,

rooted at nonterminal V

Then,

(1, N, S): likelihood of the most likely parse of x by the grammar

CS262 Lecture 12, Win06, Batzoglou

The CYK algorithm (Cocke-Younger-Kasami)

Initialization:For i = 1 to N, any nonterminal V,

(i, i, V) = log P(V xi)

Iteration:For i = 1 to N – 1 For j = i+1 to N For any nonterminal V,

(i, j, V) = maxXmaxYmaxik<j (i,k,X) + (k+1,j,Y) + log P(VXY)

Termination:log P(x | , *) = (1, N, S)

Where * is the optimal parse tree (if traced back appropriately from above)

i j

V

X Y

CS262 Lecture 12, Win06, Batzoglou

A SCFG for predicting RNA structure

S a S | c S | g S | u S | S a | S c | S g | S u

a S u | c S g | g S u | u S g | g S c | u S a

SS

• Adjust the probability parameters to reflect bond strength etc

• No distinction between non-paired bases, bulges, loops• Can modify to model these events

L: loop nonterminal H: hairpin nonterminal B: bulge nonterminal etc

CS262 Lecture 12, Win06, Batzoglou

CYK for RNA folding

Initialization:

(i, i-1) = log P()

Iteration:

For i = 1 to N

For j = i to N

(i+1, j–1) + log P(xi S xj)

(i, j–1) + log P(S xi)

(i, j) = max

(i+1, j) + log P(xi S)

maxi < k < j (i, k) + (k+1, j) + log P(S S)

CS262 Lecture 12, Win06, Batzoglou

Evaluation

Recall HMMs:

Forward: fl(i) = P(x1…xi, i = l)

Backward: bk(i) = P(xi+1…xN | i = k)

Then,

P(x) = k fk(N) ak0 = l a0l el(x1) bl(1)

Analogue in SCFGs:

Inside: a(i, j, V) = P(xi…xj is generated by nonterminal V)

Outside: b(i, j, V) = P(x, excluding xi…xj is generated by S and the excluded part is rooted at V)

CS262 Lecture 12, Win06, Batzoglou

The Inside Algorithm

To compute

a(i, j, V) = P(xi…xj, produced by V)

a(i, j, v) = X Y k a(i, k, X) a(k+1, j, Y) P(V XY)

k k+1i j

V

X Y

CS262 Lecture 12, Win06, Batzoglou

Algorithm: Inside

Initialization:For i = 1 to N, V a nonterminal,

a(i, i, V) = P(V xi)

Iteration:

For i = 1 to N-1 For j = i+1 to N For V a nonterminal

a(i, j, V) = X Y k a(i, k, X) a(k+1, j, X) P(V XY)

Termination:

P(x | ) = a(1, N, S)

CS262 Lecture 12, Win06, Batzoglou

The Outside Algorithm

b(i, j, V) = Prob(x1…xi-1, xj+1…xN, where the “gap” is rooted at V)

Given that V is the right-hand-side nonterminal of a production,

b(i, j, V) = X Y k<i a(k, i – 1, X) b(k, j, Y) P(Y XV)

i j

V

k

X

Y

CS262 Lecture 12, Win06, Batzoglou

Algorithm: Outside

Initialization:b(1, N, S) = 1For any other V, b(1, N, V) = 0

Iteration:

For i = 1 to N-1 For j = N down to i For V a nonterminal

b(i, j, V) = X Y k<i a(k, i – 1, X) b(k, j, Y) P(Y XV) +

X Y k<i a(j+1, k, X) b(i, k, Y) P(Y VX)

Termination:It is true for any i, that:

P(x | ) = X b(i, i, X) P(X xi)

CS262 Lecture 12, Win06, Batzoglou

Learning for SCFGs

We can now estimate

c(V) = expected number of times V is used in the parse of x1….xN

1

c(V) = –––––––– 1iNijN a(i, j, V) b(i, j, v)

P(x | )

1

c(VXY) = –––––––– 1iNi<jN ik<j b(i,j,V) a(i,k,X) a(k+1,j,Y) P(VXY)

P(x | )

CS262 Lecture 12, Win06, Batzoglou

Learning for SCFGs

Then, we can re-estimate the parameters with EM, by:

c(VXY)

Pnew(VXY) = ––––––––––––

c(V)

c(V a) i: xi = a b(i, i, V) P(V a)

Pnew(V a) = –––––––––– = ––––––––––––––––––––––––––––––––

c(V) 1iNi<jN a(i, j, V) b(i, j, V)

CS262 Lecture 12, Win06, Batzoglou

Summary: SCFG and HMM algorithms

GOAL HMM algorithm SCFG algorithm

Optimal parse Viterbi CYK

Estimation Forward InsideBackward Outside

Learning EM: Fw/Bck EM: Ins/Outs

Memory Complexity O(N K) O(N2 K)Time Complexity O(N K2) O(N3 K3)

Where K: # of states in the HMM # of nonterminals in the SCFG

CS262 Lecture 12, Win06, Batzoglou

The Zuker algorithm – main ideas

Models energy of a fold in terms of specific features:

1. Pairs of base pairs (stacked pairs)

2. Bulges

3. Loops (size, composition)

4. Interactions between stem and loop

5’

3’

position i

position j

length l

5’

3’

position i

position j

5’

3’

positions i

position j

position j’

Phylogeny Tree Reconstruction

1 4

3 2 5

1 4 2 3 5

CS262 Lecture 12, Win06, Batzoglou

Inferring Phylogenies

Trees can be inferred by several criteria:

Morphology of the organisms• Can lead to mistakes!

Sequence comparison

Example:

Orc: ACAGTGACGCCCCAAACGTElf: ACAGTGACGCTACAAACGTDwarf: CCTGTGACGTAACAAACGAHobbit: CCTGTGACGTAGCAAACGAHuman: CCTGTGACGTAGCAAACGA

CS262 Lecture 12, Win06, Batzoglou

Modeling Evolution

During infinitesimal time t, there is not enough time for two substitutions to happen on the same nucleotide

So we can estimate P(x | y, t), for x, y {A, C, G, T}

Then let

P(A|A, t) …… P(A|T, t)

S(t) = … …

… …

P(T|A, t) …… P(T|T, t) xx

y

t

CS262 Lecture 12, Win06, Batzoglou

Modeling Evolution

Reasonable assumption: multiplicative

(implying a stationary Markov process)

S(t+t’) = S(t)S(t’)

That is, P(x | y, t+t’) = z P(x | z, t) P(z | y, t’)

Jukes-Cantor: constant rate of evolution

1 - 3 For short time , S() = I+R = 1 - 3

1 - 3 1 - 3

A C

GT

CS262 Lecture 12, Win06, Batzoglou

Modeling Evolution

Jukes-Cantor:

For longer times,

r(t) s(t) s(t) s(t)

S(t) = s(t) r(t) s(t) s(t)

s(t) s(t) r(t) s(t)

s(t) s(t) s(t) r(t)

Where we can derive:

r(t) = ¼ (1 + 3 e-4t)

s(t) = ¼ (1 – e-4t)

S(t+) = S(t)S() = S(t)(I + R)

Therefore,(S(t+) – S(t))/ = S(t) R

At the limit of 0,S’(t) = S(t) R

Equivalently,r’ = -3r + 3ss’ = -s + r

Those diff. equations lead to:

r(t) = ¼ (1 + 3 e-4t)

s(t) = ¼ (1 – e-4t)

CS262 Lecture 12, Win06, Batzoglou

Modeling Evolution

Kimura:Transitions: A/G, C/TTransversions: A/T, A/C, G/T, C/G

Transitions (rate ) are much more likely than transversions (rate )

r(t) s(t) u(t) s(t)S(t) = s(t) r(t) s(t) u(t)

u(t) s(t) r(t) s(t)s(t) u(t) s(t) r(t)

Where s(t) = ¼ (1 – e-4t)u(t) = ¼ (1 + e-4t – e-2(+)t)r(t) = 1 – 2s(t) – u(t)

CS262 Lecture 12, Win06, Batzoglou

Phylogeny and sequence comparison

Basic principles:

• Degree of sequence difference is proportional to length of independent sequence evolution

• Only use positions where alignment is pretty certain – avoid areas with (too many) gaps

CS262 Lecture 12, Win06, Batzoglou

Distance between two sequences

Given sequences xi, xj,

Define

dij = distance between the two sequences

One possible definition:

dij = fraction f of sites u where xi[u] xj[u]

Better model (Jukes-Cantor):

f = 3 s(t) = ¾ (1 – e-4t) ¾ e-4t = ¾ – f log (e-4t) = log (1 – 4/3 f) -4t = log(1 – 4/3 f)

dij = t = - ¼ -1 log(1 – 4/3 f)

CS262 Lecture 12, Win06, Batzoglou

A simple clustering method for building tree

UPGMA (unweighted pair group method using arithmetic averages)Or the Average Linkage Method

Given two disjoint clusters Ci, Cj of sequences,

1dij = ––––––––– {p Ci, q Cj}dpq

|Ci| |Cj|

Claim that if Ck = Ci Cj, then distance to another cluster Cl is:

dil |Ci| + djl |Cj| dkl = ––––––––––––––

|Ci| + |Cj|

Proof

Ci,Cl dpq + Cj,Cl dpq

dkl = –––––––––––––––– (|Ci| + |Cj|) |Cl|

|Ci|/(|Ci||Cl|) Ci,Cl dpq + |Cj|/(|Cj||Cl|) Cj,Cl dpq

= –––––––––––––––––––––––––––––––––––– (|Ci| + |Cj|)

|Ci| dil + |Cj| djl

= ––––––––––––– (|Ci| + |Cj|)

CS262 Lecture 12, Win06, Batzoglou

Algorithm: Average Linkage

Initialization:

Assign each xi into its own cluster Ci

Define one leaf per sequence, height 0

Iteration:

Find two clusters Ci, Cj s.t. dij is min

Let Ck = Ci Cj

Define node connecting Ci, Cj,

& place it at height dij/2

Delete Ci, Cj

Termination:

When two clusters i, j remain,

place root at height dij/2

1 4

3 2 5

1 4 2 3 5

CS262 Lecture 12, Win06, Batzoglou

Example

v w x y z

v 0 6 8 8 8

w 0 8 8 8

x 0 4 4

y 0 2

z 0

y zxwv

1

2

3

4v w x yz

v 0 6 8 8

w 0 8 8

x 0 4

yz 0

v w xyz

v 0 6 8

w 0 8

xyz 0

vw xyz

vw 0 8

xyz 0

CS262 Lecture 12, Win06, Batzoglou

Ultrametric Distances and Molecular Clock

Definition:

A distance function d(.,.) is ultrametric if for any three distances dij dik dij, it is true that

dij dik = dij

The Molecular Clock:

The evolutionary distance between species x and y is 2 the Earth time to reach the nearest common ancestor

That is, the molecular clock has constant rate in all species

1 4 2 3 5years

The molecular clock results in ultrametric

distances

CS262 Lecture 12, Win06, Batzoglou

Ultrametric Distances & Average Linkage

Average Linkage is guaranteed to reconstruct correctly a binary tree with ultrametric distances

Proof: Exercise

1 4 2 3 5

CS262 Lecture 12, Win06, Batzoglou

Weakness of Average Linkage

Molecular clock: all species evolve at the same rate (Earth time)

However, certain species (e.g., mouse, rat) evolve much faster

Example where UPGMA messes up:

23

41

1 4 32

Correct tree AL tree

CS262 Lecture 12, Win06, Batzoglou

Additive Distances

Given a tree, a distance measure is additive if the distance between any pair of leaves is the sum of lengths of edges connecting them

Given a tree T & additive distances dij, can uniquely reconstruct edge lengths:

• Find two neighboring leaves i, j, with common parent k

• Place parent node k at distance dkm = ½ (dim + djm – dij) from any node m

1

2

3

4

5

6

7

8

9

10

12

11

13d1,4

CS262 Lecture 12, Win06, Batzoglou

Reconstructing Additive Distances Given T

x

y

zw

v

54

7

3

3 4

6

v w x y z

v 0 10 17 16 16

w 0 15 14 14

x 0 9 15

y 0 14

z 0

T

If we know T and D, but do not know the length of each leaf, we can reconstruct those lengths

D

CS262 Lecture 12, Win06, Batzoglou

Reconstructing Additive Distances Given T

x

y

zw

v

v w x y z

v 0 10 17 16 16

w 0 15 14 14

x 0 9 15

y 0 14

z 0

TD

CS262 Lecture 12, Win06, Batzoglou

Reconstructing Additive Distances Given T

x

y

zw

v

v w x y z

v 0 10 17 16 16

w 0 15 14 14

x 0 9 15

y 0 14

z 0

T

D

a x y z

a 0 11 10 10

x 0 9 15

y 0 14

z 0

a

D1dax = ½ (dvx + dwx – dvw)

day = ½ (dvy + dwy – dvw)

daz = ½ (dvz + dwz – dvw)

CS262 Lecture 12, Win06, Batzoglou

Reconstructing Additive Distances Given T

x

y

zw

v

Ta x y z

a 0 11 10 10

x 0 9 15

y 0 14

z 0 a

D1

a b z

a 0 6 10

b 0 10

z 0

D2

b

c

a c

a 0 3

c 0

D3

d(a, c) = 3d(b, c) = d(a, b) – d(a, c) = 3d(c, z) = d(a, z) – d(a, c) = 7d(b, x) = d(a, x) – d(a, b) = 5d(b, y) = d(a, y) – d(a, b) = 4d(a, w) = d(z, w) – d(a, z) = 4d(a, v) = d(z, v) – d(a, z) = 6Correct!!!

54

7

3

3 4

6

CS262 Lecture 12, Win06, Batzoglou

Neighbor-Joining

• Guaranteed to produce the correct tree if distance is additive• May produce a good tree even when distance is not additive

Step 1: Finding neighboring leaves

Define

Dij = dij – (ri + rj)

Where 1

ri = –––––k dik

|L| - 2

Claim: The above “magic trick” ensures that Dij is minimal iff i, j are neighborsProof: Very technical, please read Durbin et al.!

1

2 4

3

0.1

0.1 0.1

0.4 0.4

CS262 Lecture 12, Win06, Batzoglou

Algorithm: Neighbor-joining

Initialization:Define T to be the set of leaf nodes, one per sequenceLet L = T

Iteration:

Pick i, j s.t. Dij is minimal

Define a new node k, and set dkm = ½ (dim + djm – dij) for all m L

Add k to T, with edges of lengths dik = ½ (dij + ri – rj)Remove i, j from L; Add k to L

Termination:

When L consists of two nodes, i, j, and the edge between them of length dij

CS262 Lecture 12, Win06, Batzoglou

Parsimony – What if we don’t have distances

• One of the most popular methods: GIVEN multiple alignment FIND tree & history of substitutions explaining alignment

Idea:

Find the tree that explains the observed sequences with a minimal number of substitutions

Two computational subproblems:

1. Find the parsimony cost of a given tree (easy)

2. Search through all tree topologies (hard)

CS262 Lecture 12, Win06, Batzoglou

Example

A B A A

{A, B}CostC+=1

{A}Final cost C = 1

{A}

{A} {B} {A} {A}

CS262 Lecture 12, Win06, Batzoglou

Parsimony Scoring

Given a tree, and an alignment column u

Label internal nodes to minimize the number of required substitutions

Initialization:

Set cost C = 0; k = 2N – 1

Iteration:

If k is a leaf, set Rk = { xk[u] }

If k is not a leaf,

Let i, j be the daughter nodes;

Set Rk = Ri Rj if intersection is nonempty

Set Rk = Ri Rj, and C += 1, if intersection is empty

Termination:

Minimal cost of tree for column u, = C

CS262 Lecture 12, Win06, Batzoglou

Example

A A A B

{A} {A} {A} {B}

B A BA

{A} {B} {A} {B}

{A}

{A}

{A}

{A,B}

{A,B}

{B}

{B}

CS262 Lecture 12, Win06, Batzoglou

Probabilistic Methods

A more refined measure of evolution along a tree than parsimony

P(x1, x2, xroot | t1, t2) = P(xroot) P(x1 | t1, xroot) P(x2 | t2, xroot)

If we use Jukes-Cantor, for example, and x1 = xroot = A, x2 = C, t1 = t2 = 1,

= pA¼(1 + 3e-4α) ¼(1 – e-4α) = (¼)3(1 + 3e-4α)(1 – e-4α)

x1

t2

xroot

t1

x2

CS262 Lecture 12, Win06, Batzoglou

Probabilistic Methods

• If we know all internal labels xu,

P(x1, x2, …, xN, xN+1, …, x2N-1 | T, t) = P(xroot)jrootP(xj | xparent(j), tj, parent(j))

• Usually we don’t know the internal labels, therefore

P(x1, x2, …, xN | T, t) = xN+1 xN+2 … x2N-1 P(x1, x2, …, x2N-1 | T, t)

xroot

x1

x2 xN

xu

CS262 Lecture 12, Win06, Batzoglou

Felsenstein’s Likelihood Algorithm

To calculate P(x1, x2, …, xN | T, t)

Initialization:Set k = 2N – 1

Iteration: Compute P(Lk | a) for all a If k is a leaf node:

Set P(Lk | a) = 1(a = xk)If k is not a leaf node:

1. Compute P(Li | b), P(Lj | b) for all b, for daughter nodes i, j

2. Set P(Lk | a) = b,c P(b | a, ti) P(Li | b) P(c | a, tj) P(Lj | c)

Termination:

Likelihood at this column = P(x1, x2, …, xN | T, t) = aP(L2N-1 | a)P(a)

Let P(Lk | a) denote the prob. of all the leaves below node k, given that the residue at k is a

CS262 Lecture 12, Win06, Batzoglou

Probabilistic Methods

Given M (ungapped) alignment columns of N sequences,

• Define likelihood of a tree:

L(T, t) = P(Data | T, t) = m=1…M P(x1m, …, xnm, T, t)

Maximum Likelihood Reconstruction:

• Given data X = (xij), find a topology T and length vector t that maximize likelihood L(T, t)

CS262 Lecture 12, Win06, Batzoglou

Current popular methods

HUNDREDS of programs available!

http://evolution.genetics.washington.edu/phylip/software.html#methods

Some recommended programs:

• Discrete—Parsimony-based Rec-1-DCM3

http://www.cs.utexas.edu/users/tandy/mp.html

Tandy Warnow and colleagues

• Probabilistic SEMPHY

http://www.cs.huji.ac.il/labs/compbio/semphy/

Nir Friedman and colleagues