cs262 lecture 12, win06, batzoglou rna secondary structure aagacuucggaucuggcgacaccc...
Post on 19-Dec-2015
215 views
TRANSCRIPT
CS262 Lecture 12, Win06, Batzoglou
RNA Secondary Structure
aagacuucggaucuggcgacacccuacacuucggaugacaccaaagugaggucuucggcacgggcaccauucccaacuucggauuuugcuaccauaaagccuucggagcgggcguaacuc
CS262 Lecture 12, Win06, Batzoglou
Decoding: the CYK algorithm
Given x = x1....xN, and a SCFG G,
Find the most likely parse of x
(the most likely alignment of G to x)
Dynamic programming variable:
(i, j, V): likelihood of the most likely parse of xi…xj,
rooted at nonterminal V
Then,
(1, N, S): likelihood of the most likely parse of x by the grammar
CS262 Lecture 12, Win06, Batzoglou
The CYK algorithm (Cocke-Younger-Kasami)
Initialization:For i = 1 to N, any nonterminal V,
(i, i, V) = log P(V xi)
Iteration:For i = 1 to N – 1 For j = i+1 to N For any nonterminal V,
(i, j, V) = maxXmaxYmaxik<j (i,k,X) + (k+1,j,Y) + log P(VXY)
Termination:log P(x | , *) = (1, N, S)
Where * is the optimal parse tree (if traced back appropriately from above)
i j
V
X Y
CS262 Lecture 12, Win06, Batzoglou
A SCFG for predicting RNA structure
S a S | c S | g S | u S | S a | S c | S g | S u
a S u | c S g | g S u | u S g | g S c | u S a
SS
• Adjust the probability parameters to reflect bond strength etc
• No distinction between non-paired bases, bulges, loops• Can modify to model these events
L: loop nonterminal H: hairpin nonterminal B: bulge nonterminal etc
CS262 Lecture 12, Win06, Batzoglou
CYK for RNA folding
Initialization:
(i, i-1) = log P()
Iteration:
For i = 1 to N
For j = i to N
(i+1, j–1) + log P(xi S xj)
(i, j–1) + log P(S xi)
(i, j) = max
(i+1, j) + log P(xi S)
maxi < k < j (i, k) + (k+1, j) + log P(S S)
CS262 Lecture 12, Win06, Batzoglou
Evaluation
Recall HMMs:
Forward: fl(i) = P(x1…xi, i = l)
Backward: bk(i) = P(xi+1…xN | i = k)
Then,
P(x) = k fk(N) ak0 = l a0l el(x1) bl(1)
Analogue in SCFGs:
Inside: a(i, j, V) = P(xi…xj is generated by nonterminal V)
Outside: b(i, j, V) = P(x, excluding xi…xj is generated by S and the excluded part is rooted at V)
CS262 Lecture 12, Win06, Batzoglou
The Inside Algorithm
To compute
a(i, j, V) = P(xi…xj, produced by V)
a(i, j, v) = X Y k a(i, k, X) a(k+1, j, Y) P(V XY)
k k+1i j
V
X Y
CS262 Lecture 12, Win06, Batzoglou
Algorithm: Inside
Initialization:For i = 1 to N, V a nonterminal,
a(i, i, V) = P(V xi)
Iteration:
For i = 1 to N-1 For j = i+1 to N For V a nonterminal
a(i, j, V) = X Y k a(i, k, X) a(k+1, j, X) P(V XY)
Termination:
P(x | ) = a(1, N, S)
CS262 Lecture 12, Win06, Batzoglou
The Outside Algorithm
b(i, j, V) = Prob(x1…xi-1, xj+1…xN, where the “gap” is rooted at V)
Given that V is the right-hand-side nonterminal of a production,
b(i, j, V) = X Y k<i a(k, i – 1, X) b(k, j, Y) P(Y XV)
i j
V
k
X
Y
CS262 Lecture 12, Win06, Batzoglou
Algorithm: Outside
Initialization:b(1, N, S) = 1For any other V, b(1, N, V) = 0
Iteration:
For i = 1 to N-1 For j = N down to i For V a nonterminal
b(i, j, V) = X Y k<i a(k, i – 1, X) b(k, j, Y) P(Y XV) +
X Y k<i a(j+1, k, X) b(i, k, Y) P(Y VX)
Termination:It is true for any i, that:
P(x | ) = X b(i, i, X) P(X xi)
CS262 Lecture 12, Win06, Batzoglou
Learning for SCFGs
We can now estimate
c(V) = expected number of times V is used in the parse of x1….xN
1
c(V) = –––––––– 1iNijN a(i, j, V) b(i, j, v)
P(x | )
1
c(VXY) = –––––––– 1iNi<jN ik<j b(i,j,V) a(i,k,X) a(k+1,j,Y) P(VXY)
P(x | )
CS262 Lecture 12, Win06, Batzoglou
Learning for SCFGs
Then, we can re-estimate the parameters with EM, by:
c(VXY)
Pnew(VXY) = ––––––––––––
c(V)
c(V a) i: xi = a b(i, i, V) P(V a)
Pnew(V a) = –––––––––– = ––––––––––––––––––––––––––––––––
c(V) 1iNi<jN a(i, j, V) b(i, j, V)
CS262 Lecture 12, Win06, Batzoglou
Summary: SCFG and HMM algorithms
GOAL HMM algorithm SCFG algorithm
Optimal parse Viterbi CYK
Estimation Forward InsideBackward Outside
Learning EM: Fw/Bck EM: Ins/Outs
Memory Complexity O(N K) O(N2 K)Time Complexity O(N K2) O(N3 K3)
Where K: # of states in the HMM # of nonterminals in the SCFG
CS262 Lecture 12, Win06, Batzoglou
The Zuker algorithm – main ideas
Models energy of a fold in terms of specific features:
1. Pairs of base pairs (stacked pairs)
2. Bulges
3. Loops (size, composition)
4. Interactions between stem and loop
5’
3’
position i
position j
length l
5’
3’
position i
position j
5’
3’
positions i
position j
position j’
CS262 Lecture 12, Win06, Batzoglou
Inferring Phylogenies
Trees can be inferred by several criteria:
Morphology of the organisms• Can lead to mistakes!
Sequence comparison
Example:
Orc: ACAGTGACGCCCCAAACGTElf: ACAGTGACGCTACAAACGTDwarf: CCTGTGACGTAACAAACGAHobbit: CCTGTGACGTAGCAAACGAHuman: CCTGTGACGTAGCAAACGA
CS262 Lecture 12, Win06, Batzoglou
Modeling Evolution
During infinitesimal time t, there is not enough time for two substitutions to happen on the same nucleotide
So we can estimate P(x | y, t), for x, y {A, C, G, T}
Then let
P(A|A, t) …… P(A|T, t)
S(t) = … …
… …
P(T|A, t) …… P(T|T, t) xx
y
t
CS262 Lecture 12, Win06, Batzoglou
Modeling Evolution
Reasonable assumption: multiplicative
(implying a stationary Markov process)
S(t+t’) = S(t)S(t’)
That is, P(x | y, t+t’) = z P(x | z, t) P(z | y, t’)
Jukes-Cantor: constant rate of evolution
1 - 3 For short time , S() = I+R = 1 - 3
1 - 3 1 - 3
A C
GT
CS262 Lecture 12, Win06, Batzoglou
Modeling Evolution
Jukes-Cantor:
For longer times,
r(t) s(t) s(t) s(t)
S(t) = s(t) r(t) s(t) s(t)
s(t) s(t) r(t) s(t)
s(t) s(t) s(t) r(t)
Where we can derive:
r(t) = ¼ (1 + 3 e-4t)
s(t) = ¼ (1 – e-4t)
S(t+) = S(t)S() = S(t)(I + R)
Therefore,(S(t+) – S(t))/ = S(t) R
At the limit of 0,S’(t) = S(t) R
Equivalently,r’ = -3r + 3ss’ = -s + r
Those diff. equations lead to:
r(t) = ¼ (1 + 3 e-4t)
s(t) = ¼ (1 – e-4t)
CS262 Lecture 12, Win06, Batzoglou
Modeling Evolution
Kimura:Transitions: A/G, C/TTransversions: A/T, A/C, G/T, C/G
Transitions (rate ) are much more likely than transversions (rate )
r(t) s(t) u(t) s(t)S(t) = s(t) r(t) s(t) u(t)
u(t) s(t) r(t) s(t)s(t) u(t) s(t) r(t)
Where s(t) = ¼ (1 – e-4t)u(t) = ¼ (1 + e-4t – e-2(+)t)r(t) = 1 – 2s(t) – u(t)
CS262 Lecture 12, Win06, Batzoglou
Phylogeny and sequence comparison
Basic principles:
• Degree of sequence difference is proportional to length of independent sequence evolution
• Only use positions where alignment is pretty certain – avoid areas with (too many) gaps
CS262 Lecture 12, Win06, Batzoglou
Distance between two sequences
Given sequences xi, xj,
Define
dij = distance between the two sequences
One possible definition:
dij = fraction f of sites u where xi[u] xj[u]
Better model (Jukes-Cantor):
f = 3 s(t) = ¾ (1 – e-4t) ¾ e-4t = ¾ – f log (e-4t) = log (1 – 4/3 f) -4t = log(1 – 4/3 f)
dij = t = - ¼ -1 log(1 – 4/3 f)
CS262 Lecture 12, Win06, Batzoglou
A simple clustering method for building tree
UPGMA (unweighted pair group method using arithmetic averages)Or the Average Linkage Method
Given two disjoint clusters Ci, Cj of sequences,
1dij = ––––––––– {p Ci, q Cj}dpq
|Ci| |Cj|
Claim that if Ck = Ci Cj, then distance to another cluster Cl is:
dil |Ci| + djl |Cj| dkl = ––––––––––––––
|Ci| + |Cj|
Proof
Ci,Cl dpq + Cj,Cl dpq
dkl = –––––––––––––––– (|Ci| + |Cj|) |Cl|
|Ci|/(|Ci||Cl|) Ci,Cl dpq + |Cj|/(|Cj||Cl|) Cj,Cl dpq
= –––––––––––––––––––––––––––––––––––– (|Ci| + |Cj|)
|Ci| dil + |Cj| djl
= ––––––––––––– (|Ci| + |Cj|)
CS262 Lecture 12, Win06, Batzoglou
Algorithm: Average Linkage
Initialization:
Assign each xi into its own cluster Ci
Define one leaf per sequence, height 0
Iteration:
Find two clusters Ci, Cj s.t. dij is min
Let Ck = Ci Cj
Define node connecting Ci, Cj,
& place it at height dij/2
Delete Ci, Cj
Termination:
When two clusters i, j remain,
place root at height dij/2
1 4
3 2 5
1 4 2 3 5
CS262 Lecture 12, Win06, Batzoglou
Example
v w x y z
v 0 6 8 8 8
w 0 8 8 8
x 0 4 4
y 0 2
z 0
y zxwv
1
2
3
4v w x yz
v 0 6 8 8
w 0 8 8
x 0 4
yz 0
v w xyz
v 0 6 8
w 0 8
xyz 0
vw xyz
vw 0 8
xyz 0
CS262 Lecture 12, Win06, Batzoglou
Ultrametric Distances and Molecular Clock
Definition:
A distance function d(.,.) is ultrametric if for any three distances dij dik dij, it is true that
dij dik = dij
The Molecular Clock:
The evolutionary distance between species x and y is 2 the Earth time to reach the nearest common ancestor
That is, the molecular clock has constant rate in all species
1 4 2 3 5years
The molecular clock results in ultrametric
distances
CS262 Lecture 12, Win06, Batzoglou
Ultrametric Distances & Average Linkage
Average Linkage is guaranteed to reconstruct correctly a binary tree with ultrametric distances
Proof: Exercise
1 4 2 3 5
CS262 Lecture 12, Win06, Batzoglou
Weakness of Average Linkage
Molecular clock: all species evolve at the same rate (Earth time)
However, certain species (e.g., mouse, rat) evolve much faster
Example where UPGMA messes up:
23
41
1 4 32
Correct tree AL tree
CS262 Lecture 12, Win06, Batzoglou
Additive Distances
Given a tree, a distance measure is additive if the distance between any pair of leaves is the sum of lengths of edges connecting them
Given a tree T & additive distances dij, can uniquely reconstruct edge lengths:
• Find two neighboring leaves i, j, with common parent k
• Place parent node k at distance dkm = ½ (dim + djm – dij) from any node m
1
2
3
4
5
6
7
8
9
10
12
11
13d1,4
CS262 Lecture 12, Win06, Batzoglou
Reconstructing Additive Distances Given T
x
y
zw
v
54
7
3
3 4
6
v w x y z
v 0 10 17 16 16
w 0 15 14 14
x 0 9 15
y 0 14
z 0
T
If we know T and D, but do not know the length of each leaf, we can reconstruct those lengths
D
CS262 Lecture 12, Win06, Batzoglou
Reconstructing Additive Distances Given T
x
y
zw
v
v w x y z
v 0 10 17 16 16
w 0 15 14 14
x 0 9 15
y 0 14
z 0
TD
CS262 Lecture 12, Win06, Batzoglou
Reconstructing Additive Distances Given T
x
y
zw
v
v w x y z
v 0 10 17 16 16
w 0 15 14 14
x 0 9 15
y 0 14
z 0
T
D
a x y z
a 0 11 10 10
x 0 9 15
y 0 14
z 0
a
D1dax = ½ (dvx + dwx – dvw)
day = ½ (dvy + dwy – dvw)
daz = ½ (dvz + dwz – dvw)
CS262 Lecture 12, Win06, Batzoglou
Reconstructing Additive Distances Given T
x
y
zw
v
Ta x y z
a 0 11 10 10
x 0 9 15
y 0 14
z 0 a
D1
a b z
a 0 6 10
b 0 10
z 0
D2
b
c
a c
a 0 3
c 0
D3
d(a, c) = 3d(b, c) = d(a, b) – d(a, c) = 3d(c, z) = d(a, z) – d(a, c) = 7d(b, x) = d(a, x) – d(a, b) = 5d(b, y) = d(a, y) – d(a, b) = 4d(a, w) = d(z, w) – d(a, z) = 4d(a, v) = d(z, v) – d(a, z) = 6Correct!!!
54
7
3
3 4
6
CS262 Lecture 12, Win06, Batzoglou
Neighbor-Joining
• Guaranteed to produce the correct tree if distance is additive• May produce a good tree even when distance is not additive
Step 1: Finding neighboring leaves
Define
Dij = dij – (ri + rj)
Where 1
ri = –––––k dik
|L| - 2
Claim: The above “magic trick” ensures that Dij is minimal iff i, j are neighborsProof: Very technical, please read Durbin et al.!
1
2 4
3
0.1
0.1 0.1
0.4 0.4
CS262 Lecture 12, Win06, Batzoglou
Algorithm: Neighbor-joining
Initialization:Define T to be the set of leaf nodes, one per sequenceLet L = T
Iteration:
Pick i, j s.t. Dij is minimal
Define a new node k, and set dkm = ½ (dim + djm – dij) for all m L
Add k to T, with edges of lengths dik = ½ (dij + ri – rj)Remove i, j from L; Add k to L
Termination:
When L consists of two nodes, i, j, and the edge between them of length dij
CS262 Lecture 12, Win06, Batzoglou
Parsimony – What if we don’t have distances
• One of the most popular methods: GIVEN multiple alignment FIND tree & history of substitutions explaining alignment
Idea:
Find the tree that explains the observed sequences with a minimal number of substitutions
Two computational subproblems:
1. Find the parsimony cost of a given tree (easy)
2. Search through all tree topologies (hard)
CS262 Lecture 12, Win06, Batzoglou
Example
A B A A
{A, B}CostC+=1
{A}Final cost C = 1
{A}
{A} {B} {A} {A}
CS262 Lecture 12, Win06, Batzoglou
Parsimony Scoring
Given a tree, and an alignment column u
Label internal nodes to minimize the number of required substitutions
Initialization:
Set cost C = 0; k = 2N – 1
Iteration:
If k is a leaf, set Rk = { xk[u] }
If k is not a leaf,
Let i, j be the daughter nodes;
Set Rk = Ri Rj if intersection is nonempty
Set Rk = Ri Rj, and C += 1, if intersection is empty
Termination:
Minimal cost of tree for column u, = C
CS262 Lecture 12, Win06, Batzoglou
Example
A A A B
{A} {A} {A} {B}
B A BA
{A} {B} {A} {B}
{A}
{A}
{A}
{A,B}
{A,B}
{B}
{B}
CS262 Lecture 12, Win06, Batzoglou
Probabilistic Methods
A more refined measure of evolution along a tree than parsimony
P(x1, x2, xroot | t1, t2) = P(xroot) P(x1 | t1, xroot) P(x2 | t2, xroot)
If we use Jukes-Cantor, for example, and x1 = xroot = A, x2 = C, t1 = t2 = 1,
= pA¼(1 + 3e-4α) ¼(1 – e-4α) = (¼)3(1 + 3e-4α)(1 – e-4α)
x1
t2
xroot
t1
x2
CS262 Lecture 12, Win06, Batzoglou
Probabilistic Methods
• If we know all internal labels xu,
P(x1, x2, …, xN, xN+1, …, x2N-1 | T, t) = P(xroot)jrootP(xj | xparent(j), tj, parent(j))
• Usually we don’t know the internal labels, therefore
P(x1, x2, …, xN | T, t) = xN+1 xN+2 … x2N-1 P(x1, x2, …, x2N-1 | T, t)
xroot
x1
x2 xN
xu
CS262 Lecture 12, Win06, Batzoglou
Felsenstein’s Likelihood Algorithm
To calculate P(x1, x2, …, xN | T, t)
Initialization:Set k = 2N – 1
Iteration: Compute P(Lk | a) for all a If k is a leaf node:
Set P(Lk | a) = 1(a = xk)If k is not a leaf node:
1. Compute P(Li | b), P(Lj | b) for all b, for daughter nodes i, j
2. Set P(Lk | a) = b,c P(b | a, ti) P(Li | b) P(c | a, tj) P(Lj | c)
Termination:
Likelihood at this column = P(x1, x2, …, xN | T, t) = aP(L2N-1 | a)P(a)
Let P(Lk | a) denote the prob. of all the leaves below node k, given that the residue at k is a
CS262 Lecture 12, Win06, Batzoglou
Probabilistic Methods
Given M (ungapped) alignment columns of N sequences,
• Define likelihood of a tree:
L(T, t) = P(Data | T, t) = m=1…M P(x1m, …, xnm, T, t)
Maximum Likelihood Reconstruction:
• Given data X = (xij), find a topology T and length vector t that maximize likelihood L(T, t)
CS262 Lecture 12, Win06, Batzoglou
Current popular methods
HUNDREDS of programs available!
http://evolution.genetics.washington.edu/phylip/software.html#methods
Some recommended programs:
• Discrete—Parsimony-based Rec-1-DCM3
http://www.cs.utexas.edu/users/tandy/mp.html
Tandy Warnow and colleagues
• Probabilistic SEMPHY
http://www.cs.huji.ac.il/labs/compbio/semphy/
Nir Friedman and colleagues