cs5263 bioinformatics rna secondary structure prediction
TRANSCRIPT
CS5263 Bioinformatics
RNA Secondary Structure Prediction
Outline
• Biological roles for RNA
• RNA secondary structure– What’s “secondary structure”?– How is it represented?– Why is it important?
• How to predict?
Central dogma
The flow of genetic information
DNA RNA Protein
transcription translation
Replication
Classical Roles for RNA
• mRNA
• tRNA
• rRNA
Ribosome
“Semi-classical” RNA
• snRNA - small nuclear RNA (60-300nt), involved in splicing (removing introns), etc.
• RNaseP - tRNA processing (~300 nt)• SRP RNA - signal recognition particle RNA:
membrane targeting (~100-300 nt)• tmRNA - resetting stalled ribosomes, destroy
aberrant mRNA• Telomerase - (200-400nt)• snoRNA - small nucleolar RNA (many varieties;
80-200nt)
Non-coding RNAs
Dramatic discoveries in last 10 years• 100s of new families• Many roles: regulation, transport, stability, catalysis, …• siRNA: Small interfering RNA (Nobel prize 2006) and miRNAs: both are ~21-23 nt
– Regulating gene expression– Evidence of disease-
association• 1% of DNA codes forprotein, but 30% of it is copied into RNA, i.e.ncRNA >> mRNA
Take-home message
• RNAs play many important roles in the cell beyond the classical roles– Many of which yet to be discovered
• RNA functions are determined by structures
Example: Riboswitch
• Riboswitch: an mRNA regulates its own activity
RNA structure
• Primary: sequence
• Secondary: base-pairing
• Tertiary: 3D shape
RNA base-pairing
• Watson-Crick Pairing– C-G ~3kcal/mole– A-U ~2kcal/mole
• “Wobble Pair” G – U ~1kcal/mole
• Non-canonical Pairs
tRNA structure
Secondary structure prediction
• Given: CAUUUGUGUACCU…. • Goal:
• How can we compute that?
Hairpin Loops
Stems
Bulge loop
Interior loops
Multi-branched loop
Terminology
Pseudoknot
• Makes structure prediction hard. Not considered in most algorithms.
5’5
10
15202530
35
40 45 3’
ucgacuguaaaaaagcgggcgacuuucagucgcucuuuuugucgcgcgc5’- -3’10 20 30 40
The Nussinov algorithm
• Goal: maximizing the number of base-pairs
• Idea: Dynamic programming– Loop matching– Nussinov, Pieczenik, Griggs, Kleitman ’78
• Too simple for accurate prediction, but stepping-stone for later algorithms
The Nussinov algorithm
Problem:
Find the RNA structure with the maximum (weighted) number of nested pairings
Nested: no pseudoknotAGACC
UCUGG
GCGGC
AGUC
UAU
GCG
AA
CGC
GUCA
UCAG
C UG
GA
AGAAG
GG A
GA
UC
U U C
ACCA
AU
ACU
G
AA
UU
GC
A
ACCACGCUUAAGACACCUAGCUUGUGUCCUGGAGGUCUAUAAGUCAGACCGCGAGAGGGAAGACUCGUAUAAGCG
The Nussinov algorithm
• Given sequence X = x1…xN,
• Define DP matrix: F(i, j) = maximum number of base-pairs if xi…xj folds optimally– Matrix is symmetric, so let i < j
The Nussinov algorithm
• Can be summarized into two cases:– (i, j) paired: optimal score is 1 + F(i+1, j-1)– (i, j) unpaired: optimal score is
maxk F(i, k) + F(k+1, j) k = i..j-1
The Nussinov algorithm
• F(i, i) = 0
F(i+1, j-1) + S(xi, xj)• F(i, j) = max
maxk=i..j-1 F(i, k) + F(k+1, j)• S(xi, xj) = 1 if xi, xj can form a base-pair,
and 0 otherwise– Generalize: S(A, U) = 2, S(C, G) = 3, S(G, U) = 1– Or other types of scores (later)
• F(1, N) gives the optimal score for the whole seq
How to fill in the DP matrix?
F(i+1, j-1) + S(xi, xj)
• F(i, j) = max
maxk=i..j-1 F(i, k) + F(k+1, j)0
0
0 (i, j)
0
0
0
0
0
0
0
i
i+1
j–1 j
How to fill in the DP matrix?
F(i+1, j-1) + S(xi, xj)
• F(i, j) = max
maxk=i..j-1 F(i, k) + F(k+1, j)0
0
0
0
0
0
0
0
0
0
j – i = 1
How to fill in the DP matrix?
F(i+1, j-1) + S(xi, xj)
• F(i, j) = max
maxk=i..j-1 F(i, k) + F(k+1, j)0
0
0
0
0
0
0
0
0
0
j – i = 2
How to fill in the DP matrix?
F(i+1, j-1) + S(xi, xj)
• F(i, j) = max
maxk=i..j-1 F(i, k) + F(k+1, j)0
0
0
0
0
0
0
0
0
0
j – i = 3
How to fill in the DP matrix?
F(i+1, j-1) + S(xi, xj)
• F(i, j) = max
maxk=i..j-1 F(i, k) + F(k+1, j)0
0
0
0
0
0
0
0
0
0
j – i = N - 1
Minimum Loop length
• Sharp turns unlikely• Let minimum length
of hairpin loop be 1
(3 in real preds)• F(i, i+1) = 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0
U AG CC GG
C
AlgorithmInitialization:
F(i, i) = 0; for i = 1 to NF(i, i+1) = 0; for i = 1 to N-1
Iteration:For L = 1 to N-1
For i = 1 to N – lj = min(i + L, N)
F(i+1, j -1) + s(xi, xj)F(i, j) = max
max{ i k < j } F(i, k) + F(k+1, j)
Termination: Best score is given by F(1, N)(For trace back, refer to the Durbin book)
Complexity
For L = 1 to N-1
For i = 1 to N – l
j = min(i + L, N)
F(i+1, j -1) + s(xi, xj)
F(i, j) = max
max{ i k < j } F(i, k) + F(k+1, j)
• Time complexity: O(N3)
• Memory: O(N2)
Example
• RNA sequence: GGGAAAUCC
• Only count # of base-pairs– A-U = 1– G-C = 1– G-U = 1
• Minimum hairpin loop length = 1
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0 0
0
G G G A A A U C C
G G G A A A U C C
0 0 0
0 0 0
0 0 0
0 0 0
0 0 1
0 0 0
0 0 0
0 0
0
G G G A A A U C C
G G G A A A U C C
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 1
0 0 1 1
0 0 0 0
0 0 0
0 0
0
G G G A A A U C C
G G G A A A U C C
0 0 0 0 0
0 0 0 0 0
0 0 0 0 1
0 0 0 1 1
0 0 1 1 1
0 0 0 0
0 0 0
0 0
0
G G G A A A U C C
G G G A A A U C C
0 0 0 0 0 0 1 2 3
0 0 0 0 0 1 2 3
0 0 0 0 1 2 2
0 0 0 1 1 1
0 0 1 1 1
0 0 0 0
0 0 0
0 0
0
G G G A A A U C C
G G G A A A U C C
A UG CG CG
AA
G UG CG C
AAA
A UGG CG C
AA
0 0 0 0 0 0 1 2 3
0 0 0 0 0 1 2 3
0 0 0 0 1 2 2
0 0 0 1 1 1
0 0 1 1 1
0 0 0 0
0 0 0
0 0
0
G G G A A A U C C
G G G A A A U C C
A UG CG CG
AA
G UG CG C
AAA
A UGG CG C
AA
0 0 0 0 0 0 1 2 3
0 0 0 0 0 1 2 3
0 0 0 0 1 2 2
0 0 0 1 1 1
0 0 1 1 1
0 0 0 0
0 0 0
0 0
0
G G G A A A U C C
G G G A A A U C C
A UG CG CG
AA
G UG CG C
AAA
A UGG CG C
AA
0 0 0 0 0 0 1 2 3
0 0 0 0 0 1 2 3
0 0 0 0 1 2 2
0 0 0 1 1 1
0 0 1 1 1
0 0 0 0
0 0 0
0 0
0
G G G A A A U C C
G G G A A A U C C
A UG CG CG
AA
G UG CG C
AAA
A UGG CG C
AA
Energy minimization
For L = 1 to N-1For i = 1 to N – l
j = min(i + L, N);
E(i+1, j -1) + e(xi, xj)E(i, j) = min
min{ i k < j } E(i, k) + E(k+1, j)
e(xi, xj) represents the energy for xi base pair with xj
• Energy are negative values. Therefore minimization rather than maximize.
• More complex energy rules: energy depends on neighboring bases
More realistic energy rules
U UA A
A
A
A
G C
G C
G C
U A
A U
C G
A U
4nt hairpin+5.9 -1.1, Terminal mismatch of hairpin
-2.9, stacking
-2.9, stacking (special for 1nt bulge)
-1.8, stack
-0.9, stack
-1.8, stack
-2.1, stack
5’
3’5’-dangle, -0.3
unstructured, 0Overall G = -4.6 kcal/mol
1nt bulge, +3.3
Complete energy rules at http://www.bioinfo.rpi.edu/zukerm/cgi-bin/efiles.cgi
The Zuker algorithm – main ideas
1. Instead of base pairs, pairs of base pairs (more accurate)
2. Separate score for bulges3. Separate score for different-size & composition of loops4. Separate score for interactions between stem &
beginning of loop5. Use additional matrix to remember current state. e.g, to
model stacking energy: • W(i, j): energy of the best structure on i, j• V(i, j): energy of the best structure on i, j given that i, j are paired• Similar to affine-gap alignment.
Two popular implementations
• mfold (Zuker)http://mfold.bioinfo.rpi.edu/
• RNAfold in the Vienna package (Hofacker)http://www.tbi.univie.ac.at/~ivo/RNA/
Accuracy
• 50-70% for sequences up to 300 nt• Not perfect, but useful• Possible reasons:
– Energy rule not perfect: 5-10% error– Many alternative structures within this error
range– Alternative structure do exist– Structure may change in presence of other
molecules
Comparative structure prediction
To maintain structure, two nucleotides that form a base-pair tend to mutate together
Given K homologous aligned RNA sequences:
Human aagacuucggaucuggcgacacccMouse uacacuucggaugacaccaaagugWorm aggucuucggcacgggcaccauucFly ccaacuucggauuuugcuaccauaOrc aagccuucggagcgggcguaacuc
If ith and jth positions are always base paired and covary, then they are likely to be paired
Mutual information
fab(i,j): Prob for a, b to be in positions i, j
fa (i): Prob for a to be in positions i
)|()(
)()(
),(log),(),( 2
),,,(,
jiHiH
jfif
jifjifjiM
ba
ab
TGCAbaab
aagacuucggaucuggcgacacccuacacuucggaugacaccaaagugaggucuucggcacgggcaccauucccaacuucggauuuugcuaccauaaagccuucggagcgggcguaacuc
fgc(3,13) = 3/5fcg(3,13) = 1/5fau(3,13) = 1/5
fg(3) = 3/5fc(3) = 1/5fa(3) = 1/5
fc(13) = 3/5fg(13) = 1/5fu(13) = 1/5
37.1
)2.02.0
2.0(log2.0)
2.02.0
2.0(log2.0)
6.06.0
6.0(log6.0)13,3( 222
M
Mutual information
• Also called covariance score• M is high if base a in position i always follow by base b in position j
– Does not require a to base-pair with b– Advantage: can detect non-canonical base-pairs
• However, M = 0 if no mutation at all, even if perfect base-pairs
)()(
),(log),(),( 2
),,,(, jfif
jifjifjiM
ba
ab
TGCAbaab
aagacuucggaucuggcgacacccuacacuucggaugacaccaaagugaggucuucggcacgggcaccauucccaacuucggauuuugcuaccauaaagccuucggagcgggcguaacuc
One way to get around is to combine covariance and energy scores
Comparative structure prediction
• Given a multiple alignment, can infer structure that maximizes the sum of mutual information, by DP
• However, alignment is hard, since structure often more important than sequence
Comparative structure prediction
In practice:1. Get multiple alignment2. Find covarying bases – deduce structure3. Improve multiple alignment (by hand)4. Go to 2
A manual EM process!!
Comparative structure prediction
• Align then fold
• Fold then align
• Align and fold
Context-free Grammar for RNA Secondary Structure
• S = SS | aSu | cSg | uSa | gSc | L
• L = aL | cL | gL | uL |
aaacgg ugcc
ag ucg
a c g g a g u g c c c g u
S
S
S
S
L
S
L
a L
S
La
Stochastic Context-free Grammar (SCFG)
• Probabilistic context-free grammar• Probabilities can be converted into weights• CFG vs SCFG is similar to RG vs HMM
• S = SS • S = aSu | uSa• S = cSg | gSc• S = uSg | gSu • S = L• L = aL | cL | gL | uL |
0
2
3
0
1
e(xi, xj) + S(i+1, j-1)
S(i, j) = max L(i, j)
maxk (S(i, k) + S(k+1, j))
L(i, j) = 0
0
SCFG Decoding
• Decoding: given a grammar (SCFG/HMM) and a sequence, find the best parse (highest probability or score)– Cocke-Younger-Kasami (CYK) algorithm
(analogous to Viterbi in HMM)– The Nussinov and Zuker algorithms are
essentially special cases of CYK– CYK and SCFG are also used in other
domains (NLP, Compiler, etc).
SCFG Evaluation
• Given a sequence and a SCFG model– Estimate P(seq is generated by model), summing
over all possible paths (analogous to forward-algorithm in HMM)
• Inside-outside algorithm– Analogous to forward-background– Inside: bottom-up parsing (P(xi..xj))– Outside: top-down parsing (P(x1..xi-1 xj+1..xN))
• Can calculate base-paring probability – Analogous to posterior decoding– Essentially the same idea implemented in the Vienna
RNAfold package
SCFG Learning
• Covariance model: similar to profile HMMs– Given a set of sequences with common structures,
simultaneously learn SCFG parameters and optimally parse sequences into states
– EM on SCFG – Inside-outside algorithm– Efficiency is a bottleneck
• Have been successfully applied to predict tRNA genes and structures– tRNAScan
Summary: SCFG and HMM algorithms
GOAL HMM algorithm SCFG algorithm
Optimal parse Viterbi CYK
Estimation Forward InsideBackward Outside
Learning EM: Fw/Bck EM: Ins/Outs
Memory Complexity O(N K) O(N2 K)Time Complexity O(N K2) O(N3 K3)
Where K: # of states in the HMM # of nonterminal symbols in the SCFG
Open research problems
• ncRNA gene prediction• ncRNA regulatory networks
• Structure prediction– Secondary, including pseudoknots– Tertiary
• Structural comparison tools– Structural alignment
• Structure search tools– “RNA-BLAST”
• Structural motif finding– “RNA-MEME”