the rna world, fitness landscapes, and all thatthe rna world, fitness landscapes, and all that peter...
TRANSCRIPT
The RNA World, Fitness Landscapes, and all that
Peter F. Stadler
Bioinformatics Group, Dept. of Computer Science & Interdisciplinary Center forBioinformatics, University of Leipzig
RNomics Group, Fraunhofer Institute IZI, LeipzigInstitute for Theoretical Chemistry, Univ. of Vienna (external faculty)
The Santa Fe Institute (external faculty)
FOGA07, Cd. Mexico, Jan 2007
Why RNA?
until relatively recently:Central Dogma of Molecular BiologyDNA → RNA → ProteinDNA = “genetic memory”, RNA = working copy, proteins dothe work
around 1980: discovery of catalytic RNAs (Nobelprize for TomCech and Sidney Altman)nevertheless long considered “exotic” remnants from theancient RNA world=⇒ RNA World Hypothesis for the Origin of Life
around 2000: structure of the ribosome showns that theribosome is an “RNA enzyme”
around 2000: microRNAs are discovered as a large class ofregulatory RNAs that inhibit translation of proteins
2006: the ENCODE project shows that human geneexpression is quite different from textbook knowledge
RNA Bioinformatics
RNA Secondary Structures are an appropriate level of description
explain the thermodynamics of RNA Structures
often highly conserved in evolution
can be computed efficiently
Many Functional RNAs are Structured
(a) Group I intron P4-P6 domain(b) Hammerhead ribozyme(c) HDV ribozyme
(d) Yeast tRNAphe
(e) L1 domain of 23S rRNA
Hermann & Patel, JMB 294, 1999
The RNA Model
GCGGGAAU
AGCUC
AGUUGG U A
G A G CA
CGA
CC
UU
GC C
AAGGUCGGGGU
CG C G A G
U U CGA
GUCUCGU
UUCCCGC
UC
CA
GCGGGAAUAGCUCAGUUGGUAGAGCACGACCUUGCCAAGGUCGGGGUCGCGAGUUCGAGUCUCGUUUCCCGCUCCA
Formal Definition
A secondary structure on a sequence s is a collection of pairs (i , j)with i < j such that
Base pairing rules are respected, i.e., (i , j) ∈ Ω implies (si , sj)form an allowed pair (GC, CG, AU, UA, GU, UG)
Each base is involved in at most one pair, i.e., Ω is amatching, (i , j), (i , k) ∈ Ω implies j = k and (i , k), (j , k) ∈ Ωin implies i = j .
(i , j)Ω implies |j − i | > 3 (sterical constraint)
No-crossing rule: (i , j), (k, l) ∈ Ω and i < k implies eitheri < k < l < j or i < j < k < l .This excludes so-called pseudoknots
Let’s count the structures . . .
Counting secondary structures. Given a sequence of length n.Πkl = 1 if sequence positions k, l can form a pair GC, CG, AU,
UA, GU, UG and Πkl = 0 otherwise.Nkl = number of structures of the subsequence from k to l .Basic recursion:
• xxxxxxx +∑
(xxxx)xxxx
Nkl = Nk+1,l +l
∑
j=k+m
ΠkjNk+1,j−1Nj+1,l
RNA Folding in a nutshell
i jj i i+1 j i i+1 k−1 k k+1|=
Nij = Ni+1,j +∑
k(i,k)pair
Ni+1,k−1Nk+1,j
Eij = min
Ei+1,j + mink
(i,k)pair
(Ei+1,k−1 + Ek+1,j + εik)
Zij = Zi+1,j +∑
k(i,k)pair
Zi+1,k−1Zk+1,j exp(−εik/RT )
Partition function: Z =∑
Ω exp(−E (Ω)/RT )
A word on the Partition Function
The partition function is the link between the combinatorics of thestructures (in general: states in an ensemble) and thethermodynamic properties of the physical ensemble, e.g.:
Free energy G = −RT lnZ
Expected Energy 〈E 〉 = RT 2 ∂ lnZ∂T
Heat Capacity Cp = −T ∂2G∂T 2
0 20 37 100
0
1
2
3
4
5
CUGUAUUGUUGUAUAGCCCGUGUGGUAAUAUGG
T [C]
C(T
) [k
cal•(
mol
•K)-1
]
Realistic Energy Model
GC
U
A
C
G
closing base pair
A
5
3
3
5
A
5
3
closing base pair
interior base pairsclosing base pair
3
C
U
A
A
U G
C
closing base pair
multi-loop
interior base pair
A U
interior base pair
interior base pair
hairpin loop
bulge
C
G
C U
A
closing base pair
stacking pair
interior loop
G
G
G
3
5A
G
C
CA
5
Parameters from large number of melting experiments by Douglas
Turner, David Matthews, John Santa Lucia, and others
Recursions for Linear RNAs
M1 M1
M1
i u u+1|
MC
= |
= | |
FC
i j i+1 j i
hairpin Cinterior
i j i i k l j
k k+1 j
=
C
F F
i j
M
=i ij j−1
j
j| C
i j
M
ui+1 u+1i j−1 j
j i
M
j−1 ji u u+1|C
Folding Kinetics
RNA molecules may have kinetic traps which prevent them from reachingequilibrium within the lifetime of the molecule. Long molecules are oftentrapped in such meta-stable states during transcription.Possible solutions are
Stochastic folding simulations can predict folding pathways and finalstructures. Computationally expensive, few programs available.
Predicting structures for growing fragments of the sequence canshow whether large scale re-folding will occur during transcription.Cheap but inaccurate.
Analysis of the energy landscape based on complete suboptimalfolding can identify possible traps (local minima).
Kinetic Folding Algorithm
Simulate folding kinetics by a Monte-Carlo type algorithm:Generate all neighbors using the move-setAssign rates to each move, e.g.
Pi = min
1, exp
(
−∆E
kT
)
Select a move with probability proportional toits rateAdvance clock 1/
∑
i Pi .
P4
P3
P5P6
P7
P8
P1 P2
Characterization of Landscapes
A landscape consists of a configuration space V , a move set within thatconfiguration space and an energy function f : V → R.Simplest move set for secondary structure: opening and closing of basepairs.Speed of optimization depends on the roughness of the Landscape.Measures of roughness suggested in the literature:
Number of local optima
Correlation lengths (e.g. along a random walk)
Lengths of adaptive walks
Folding temperature vs. glass temperature Tf /Tg
Energy barriers between the local optima. Especially, themaximum barrier height (“depth” in SA literature)
Ruggedness
Bryce Canyon, UT Capulin Volcano, NMrugged “smooth”
Measures of Ruggedness
Number of Local Minima and Maxima
Length of Adaptive Walks
Correlation Functions:Random walk on V as defined through X : x0, x1, x2, . . .(Markov process on V ).=⇒ “Time Series” f (x0), f (x1), f (x2), . . .=⇒ Autocorrelation Function
r(s) =
⟨
(
f (xt+s ) − 〈f 〉) (
f (xt) − 〈f 〉)
⟩
t⟨
(
f (xt) − 〈f 〉)2
⟩
t
For simplicity:
f (x) → f (x) − f f =1
|V |
∑
x∈V
f (x)
(Subtract the landscape average)Random Walk = Markov process with transition matrix T
Theorem. r(s) = 〈f ,Ts f 〉/
〈f , f 〉.Theorem. r(s) is an exponential function iff f is an eigenfunctionof T. We say (V , f ,T) is an elementary landscape.For d-regular graphs: Relation with graph Laplacians
−∆ = A − D = d(T − I )
Elementary landscapes are eigenfunctions of the Graph Laplacian
Amplitude Spectra
Let ϕk be an orthonormal basis of eigenvectors of T. We will beinterested in the “Fourier Transform” ak where
f (x) =∑
k
akϕk(x)
Theorem. var[f ] := 1|V |
∑
x f (x)2 =∑
k 6=0 |ak |2.
Collect terms that belong to the same eigenspace:
Bp =1
var[f ]
∑
k:−∆ϕk=Λpϕk
|ak |2
“Amplitude Spectrum of f ”.
Quantities derived from Amplitude Spectra
Autocorrelation Function:
r(s) =∑
p 6=0
Bp(1 − Λp/D)s
Correlation length
ℓ =
∞∑
s=0
r(s) = D∑
p 6=0
Bp/Λp
Elementary landscapes:f is elementary if and only if Bp = 1 for a single p, and 0otherwise
Some Elementary Landscapes
Problem Move Set D Λ ℓ/n
NAES Hamming n 4 1/4p-spin Hamming n 2p 1/(2p)WP Hamming n 4 1/4GC Hamming (α − 1)n 2α (1 − 1/α)/2XY-Hamiltonian Hamming (α − 1)n 2α (1 − 1/α)/2
cyclic 2n 8 sin2(π/α) 1/[4 sin2(π/α)]
GBP Exchange n2/4 2(n − 1) 1/8 · n/(n − 1)symmetric TSP Transposition n(n − 1)/2 2(n − 1) 1/4
Inversions n(n − 1)/2 n 1 − 1/n)/4GMP Transposition n(n − 1)/2 2(n − 1) n/4
NAES = Non-All-Equal-Satisfiability, WP = Weight Partition, GC = Graph Coloring with α colors, GBP = Graph
Bipartitioning, TSP = Traveling Salesman Problem, GMP = Graph Matching Problem XY-Hamiltonian:
P
i<j Jij cos( 2π
α(xi − xj ) )
Amplitude Spectra of RNA Landscapes
5 10 15 20 25Interaction Order p
0
0.1
0.2
0.3
0.4
0.5
Am
plitu
de S
pect
rum
Bp
Ensemble Free Energies
12345678910111213141516171819202122232425
5 10 15 20 250
0.1
0.2
0.3
0.4
0.5A
mpl
itude
Spe
ctru
m B
p
Ground State Energies
5 10 15 20 25Interaction Order p
Distance to Maximal Hairpin5 10 15 20 25
Distance to Open Structure
Amplitude Spectra of RNA Landscapes
6 8 10 12
Interaction Order p
0
0.1
0.2
0.3
0.4
0.5
Am
plitu
de S
pect
rum
B(p
)
Ensemble Free Energies6 8 10 12
0
0.1
0.2
0.3
0.4
0.5A
mpl
itude
Spe
ctru
m
B(p
)
Groundstate Energies
6 8 10 12
Interaction Order p
123456789101112Distance to Maximal Hairpin
6 8 10 12
Distance to Open Structure
Neutrality
Degree of neutrality E[νx ]is the expected number of neutral neighbors of x .Suppose µ0 > 0. Then
E[νx ] =∑
y :y∈∂x
µcx (y)
wherecx(y) =
∣
∣j |θj (x) 6= θj(y)∣
∣ y ∈ ∂x
Similar equations can be obtained for E[νxνy ].=⇒ Correlation length ℓ and degree of neutrality νx can be tunedindependently of each other in short range p-spin models.
Energy barriers
E [s,w ] = min
max[
f (z)∣
∣z ∈ p]
∣
∣
∣
∣
p : path from s to w
,
B(s) = min
E [s,w ] − f (s)∣
∣w : f (w) < f (s)
Depth and Difficulty(borrowed from simulated annealing theory)
D = max
B(s)∣
∣s is not a global minimum
ψ = max
B(s)
f (s) − f (min)
∣
∣
∣
∣
s is not a global minimum
Energy Barriers and Barrier Trees
Some topological definitions:A structure is a
local minimum if its energy is lowerthan the energy of all neighbors
local maximum if its energy ishigher than the energy of all
neighbors
saddle point if there are at leasttwo local minima that can bereached by a downhill walk startingat this point
Calculating barrier trees
M1
M2
M3
S12
S23
M3
M3
M1
M1
M2
M2
S12
S12
S23
S23
M3
M1
M2
M3
M1
M2
S23
M3
M1
M2
S12
S23
The flooding algorithm:Read conformations in energy sortedorder.For each confirmation x we havethree cases:
(a) x is a local minimum if it hasno neighbors we’ve alreadyseen
(b) x belongs to basin B(s), if allknown neighbors belong toB(s)
(c) if x has neighbors in severalbasins B(s1) . . .B(sk) then it’sa saddle point that merges
these basins.Basins B(s1), . . . ,B(sk) arethen united and are assigned tothe deepest of local minimum.
Information from the Barrier Trees
Local minima Saddle points Barrier heights Gradient basins Partition functions and free energies of (gradient) basins Depth and Difficulty of the landscape
N.B.: A gradient basin is the set of all initial points from which a
gradient walk (steepest descent) ends in the same local minimum.
Energy Landscape of a Toy Sequence
G C U A U U A
GC
GC
G
UG
A
CG
UG
CG
U
UU
A
G C U A U UCG
UG
CG
U
UU
A
C
C
G
G
G
UG
A
A G C U A U UCG
UG
CG
U
UU
A
C C G G A
G
UG
A
G C U A U UCG
UG
CG
U
UU
A
C
C
A
G
UGGGA
G C U A G UCG
UG
CG
U
UU
A
G G G
AC
CU
U
A
G C G U G GCG
UG
CG
U
UU
A
G A
AU
C
CU
U
A
G U G G G ACG
UG
CG
U
UU
A
GC
AU
C
CU
U
A
G U
UG
CG
U
UU
A GC
AU
C
CU
U
A
C G G GG
G U
CG
U
UU
A
GC
AU
C
CU
U
A
C G
G GUGG
G U
GC
AU
C
CU
U
A
C G
GU
C GUUUAGGG
8 9 10
1 2 3
4567
A A
G U
GC
AU
C
CU
U
A
C G
GU
C G
UUUAGGG
11
U AA
Steps [arbitrary]
−8
−6
−4
−2
0
2
Ene
rgy
[kca
l/mol
]
1
2
3
4
5
6
7
8
9
10
11
2.45[5]
[9]3.80 1 [11]
1.60 3 [7]
13
141.30 9
1.40 192.10 11
3.80 2 [1][3] 12
1.20 8 [4]
173.60 4
2.20 73.61 5
1.40 161.90 10
1.50 152.00 205.02 6
3.90 18E
Folding Kinetics
Transition rates from x to y :
ryx = r0e−
E6=yx−E(x)
RT for x 6= y
rxx = −∑
y 6=x
ryx
Kinetics as a Markov process:
dpx
dt=
∑
y∈X
rxypy (t) .
Transition states:E 6=
yx = maxE (x),E (y)
or more complex models (Tacker et al 1994, Schmitz et al 1996)
Reduced Description of the Folding Dynamics
Macrostates = Classes of a partition of the state space.Partition function for a macro state:
Zα =∑
x∈α
exp(−E (x)/RT )
Free energy of a macro state:
G(α) = −RT ln Zα
rβα =∑
y∈β
∑
x∈α
ryxProb[x |α] for α 6= β
=1
Zα
∑
y∈β
∑
x∈α
ryxe−E(x)/RT
rβα “on flight” while executing the barriers program.Transition state free energy:
G6=βα = −RT ln
∑
y∈β
∑
x∈α
e−E6=xy
RT
12
3
4
5
6
7
8
910
11
12
13
14
0.0
2.0
4.0
6.0
1.2
1.5
2.1
2.8
1.1
0.9
2.4
2.8
0.5 1.3
4.7
0.9
1.2
0.8
2.0
lillyA simple model sequence
10-1
100
101
102
103
time
0
0.2
0.4
0.6
0.8
1
popu
latio
n pr
obab
ility
mfe23456
10-1
100
101
102
103
time
0
0.2
0.4
0.6
0.8
1
popu
latio
n pr
obab
ility
mfe23456
10-1
100
101
102
103
time
0
0.2
0.4
0.6
0.8
1
popu
latio
n pr
obab
ility
mfe23456
103
104
105
106
107
108
109
time
0
0.2
0.4
0.6
0.8
1
popu
latio
n pr
obab
ility
101
102
103
104
105
106
107
108
time
0
0.2
0.4
0.6
0.8
1
popu
latio
n pr
obab
ility
mfe255680
Refolding of a tRNA molecule.
Summary I:
RNA structures can be computed efficiently by means ofdynamic programming
Computations are based on a set of carefully measures energyparameters and an additive energy model
Algorithms exist for ground state energy and structure, fullpartition functions, density of states, interacting structures,. . .
The folding kinetics of a given RNA Sequence can also beinvestigated as the level of secondary structures
VIENNA RNA PACKAGE
PART II: How Do RNAs Evolve
Basic Assumption
Selection Acts on Secondary Structures, Mutations acts on theunderlying sequences⇒ We need to understand the sequence-to-structure map of RNAs(hang on, we’ll discuss the empirical evidence for that a bit later)
Sewall Wright’s Fitness Landscapes
Fitn
ess
Phenotype
How do realistic fitness landscapes look like?
Biological Landscapes
The RNA case is a special case of a very general paradigm:
genotype 7→ phenotype 7→ fitness
What is the relationship between Genotyp and Phenotype?
Central topic in any theory of evolutionbecause:* Selection acts on the Phenotype* Mutation/Recombination acts on the GenotypeBiopolymers as the simplest model:The molecule is both genotype (sequence) and phenotype (structure).
The map from genotype to genotype is determined by physical chemistry:
⇐⇒ folding problem
Computational Analysis of the RNA Map
There are many more sequences than structures.(.)-string: 3-letters (with constraints)
=⇒ less than 3n structures
but 4n sequences.
=⇒ Redundancy
How are sequences folding into the same structure distributed insequence space?Neutral Set S(ψ) = x ∈ Qn
α|f (x) = ψ
Sensitivity and Neutrality
GCGGGAAU
AGCUC
AGUUGG U A
G A G CA
CGA
CC
UU
GC C
AAGGUCGGGGU
CG C G A G
U U CGA
GUCUCGU
UUCCCGC
UC
CA
GCGGGUAUA
GCUCAGU
UGG U A
G A G CA C G
A C CUU G CC A A
G GU
C G G G GU CG C G A G
U U CGA
GUCUCGU
UUCCCGCUCC
A
Effect of a single
point mutation0 100 200 300
Structure Distance
10-4
10-3
10-2
10-1
100
Fre
quen
cy
Distribution of structure distances
The Random Graph Model
Approach:Model S(ψ) as a random induced subgraph Γ with a given value
λ =〈#neutral neighbors〉
(α− 1)n
Threshold value:
λ∗ = 1 −
(
1
α
)1
α−1
Theorem. [Reidys, Stadler, Schuster]If λ > λ∗ then Γ is a.s. dense and connected,if λ < λ∗ then Γ is a.s. neither dense nor connected
A complication: Base Pairing Rules
Unpaired bases:Alphabet A = A,U,G,C
Paired bases: 5’ and 3’ side correlated:Alphabet: B = AU,UA,GC,CG,GU,UG, .
Thus consider only the set of compatible sequences C (ψ):S(ψ) ⊆ C (ψ) ≡ Qnu
4 ×Qnp
6 .=⇒ Two neutrality parameters λu and λp
Connected Components of Neutral Networks
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0λu
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
λp
gray many small components red 1 connected componentgreen 2 equal sized components yellow 3 components size 2:1blue 4 equal sized components
Explanation: for this deviation from the random graph model in terms of the energy model. Some structures can
be made only with a significant bias in the G/C ratio.
x??P (d)
Length of Neutral Path: d !0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0
0.250.200.150.100.050.00
0 20 40 60 80 100 120
Chain lenght n
0
5
10
15
20
25
30
Cov
erin
g ra
dius
from enumeration from inverse folding lower bound
Distance to Target structure Covering radiuslength neutral paths
Closest Approach
Intersection Theorem. For any two secondary structures φ, andψ holds
C (φ) ∩ C (ψ) 6= ∅
What is the distance of neutral networks
δ(φ,ψ) = mind(x , y)|f (x) = φ and f (y) = ψ
Random graph Theory: If λ > λ∗ then δ(φ,ψ) ≈ 2.Computer simulations: upper bounds on δ(φ,ψ):
n GC AU AUGC
50 5.6 2.6 2.170 9.3 4.6 3.4100 13.0 7.8 5.6
AccessibilityFontana & Schuster 1998
Idea: The “interface” between two structures is large is they are“similar”.More precisely: Structure ψ is accessible for φ if x ∈ S(φ) is like tohave neighbor (mutant) x ′ ∈ S(ψ).Structural characterization of “easy” (continuous) transitions:
Shorteningof stacks
Elongationof stacks
Opening ofconstrained stacks
Closing ofconstrained stacks
Summary II: Sequence-Structure Map of RNA
1. Redundancy: Many more sequences than structures
2. Sensitivity: Small changes in the sequences may lead to largechanges in the structure
3. Neutrality: A substantial fraction of mutations does not alter thestructure.
4. Isotropy: S(ψ) is “randomly” embedded in C (ψ).
Implications:
1. Neutral Networks: S(ψ) forms a connected “percolating” network insequence space for all “common” structures.
2. Shape Space Covering: Almost all structures can be found in arelatively small neighborhood of almost every sequence.
3. Mutual Accessibility: The neutral networks of any two structuresalmost touch each other somewhere in sequence space.
Simulated Trajectories
0.5 1.0 1.5time (arbitrary units)
5
15
25
35
45
stru
ctur
e di
stan
ce
0 510
40
proj
ectio
n co
ordi
nate
Punctuated equilibra = diffusion of neutral networks +constant rate of innovation +exponential selection of rare mutants
Proc.Natl.Acad.Sci. 93: 397-401 (1996)
Diffusion Constant
. . . can be deduced from Moran model:
D = λ6Anp
3 + 4Np(1 + 1/N) ∼
(3/2)A(n/N) p ≫ 0 orN ≫ 12Anp p ≪ 1
A . . . replication raten . . . sequence lengthN . . . population sizep . . . mutation rateλ . . . neutrality of network
Dynamics of Interacting Replicators
Ik + Ij −→ Il + Ik + Ij
With mutation:
xk = xk
∑
j
Akjxj −∑
i ,j
Aijxixj
+∑
l ,j
(QklAljxjxl − QlkAkjxkxj)
where
Qkl = (1 − p)n−d(k,l)
(
p
α− 1
)d(k,l)
How does this behave in sequence space?
Simplest case: Simplest case: Akl = A0(1 − d(Ik , Il )/n):
0 1000 2000 3000 4000 5000time
0.0
0.2
0.4
0.6
0.8di
vers
ity
0 2×105
4×105
6×105
8×105
time lag τ
0
10
20
30
40
50
60
g(τ)
g(τ) =1
T2 − T1 + 1
T2∑
t=T1
‖p(t + τ) − p(t)‖2
B.M.R. Stadler, Adv. Complex Syst. (2003)
10-6
10-4
10-2
100
p
10-6
10-4
10-2
100
D
100
101
N/n
10-3
10-2
10-1
100
D
Left: Diffusion coefficient D as a function of the mutation rate for N = 10, 20, 30, 40, 80 and
n = 10, 20, 30, 40, 80 such that N/n = 1 after equilibration for 105 timesteps. Right: Dependence of the ratio
D/p on N/n.
An RNA-Based Model in the Plane
Target hypercycle with 8 mem-bers.
Model:Hypercyclically coupledspecies, each sequence hasa function that depends onits structure.
Spatial Extension: CA Model
Possible Catalysts Actual CatalystsPossible Replicators
Rules of replication. For each of the neighbors (•) of the empty cell (marked by a bold outline) the replication rate
ρz is computed taking into account their neighbors in the direction of the replication () as potential catalysts.
The neighbor with the largest values of ρz invades the empty position. In this example, for the chosen replicator,
only three of its neighbours are catalysts according to the hypercycle topology.
Spirals formed after 3000 generations in an evolution experimentstarted with 300 random sequences in the absence of parasites.see also Borlijst & Hogeweg (1993)
Diffusion in Sequence Space
2000 4000 6000 8000 10000time lag (tau)
2
4
6
8
10
12
g(ta
u)
0.0005 0.0010 0.0015 0.0020Mutation rate
0.002
0.003
0.004
Diff
usio
n co
nsta
nt
Summary III: Dynamics of RNA Evolution
Neutrality of the Sequence-Structure Map impliesdiffusion/drift-like motion in sequence independent of detailsof the selection/mutation mechanisms and whether spatialextension is taken into account or not.
=⇒ The basic assumption of molecular phylogenetics, namelya dominating influence of drift in sequence evolution, holdstrue even when phenotypic evolution is dominated byinteractions(co-evolution).
TODO Development of a rigorous mathematical theorydescribing the motion in sequence space of a population withstrong interactions.
Acknowledgments: It’s not my fault . . .
Peter Schuster, Walter Fontana, Wolfgang SchnablChristoph Flamm, Ivo L. HofackerChristian M. ReidysCamille Stephan-Otto Attolini, Barbel Stadler