the birth of smooth biological codes in a rough evolutionary world shalev itzkovitz, guy shinar, uri...

The Birth of Smooth Biological Codes in a Rough Evolutionary

WorldShalev Itzkovitz, Guy Shinar, Uri Alon

T T

o Biological codes are information

channels or maps with natural

‘fitness’ measure.

o Codes are evolved and selected

according to their fitness or

‘smoothness’.

o The emergence of a code is a phase

transition in an information channel.

o Topology of errors (noise) governs

the emergent code.

Biological codes are (often) maps

• Biological code is a mapping between two sets of

molecules:

– Transcription net: Proteins → DNA binding sites

– Protein-protein recognition: immune system…

– Protein synthesis: DNA → Proteins

DNA Proteins

The genetic

code

Information flows from DNA to RNA to proteins through the

genetic code

• The 20 letters are the amino acids.

• Proteins are amino acid polymers.

DNA ACGGAGGTACCC 4 letters

RNA ACGGAGGUACCC 4 letters

Protein 20 lettersThr Glu Val Pro

Each of the 20 amino acidshas specific chemistry

• Amino acid = backbone + specific side group.

• Some amino acids are hydrophilic, hydrophobic, basic, acidic…

• The diversity of amino acids allows proteins to perform a wide variety of functions efficiently.

Each of the 20 amino acids isencoded by a triplet of RNA letters

• Genetic Code = mapping triplets to amino acids.

• 64 = 43 triplet codons encode only 20 amino acids

(degeneracy)

• Only 48 discernable codons due to U-C “wobble” at 3rd base.

Thr

Glu

Val

Pro

ACG

GUAGAG

CCC

The genetic code is smooth, degenerate and compact

• Redundancy – only 20 of 48.

• Degeneracy – mostly in the 3rd base

• Close codons separated by a single

letter (Hamming Distance = 1)

• Smoothness – Close codons encode

chemically similar amino acids.

( Hydrophobic xUx, hydrophilic

xAx).

• Compactness – single contiguous

domain per each amino-acid.

• The code is highly nonrandom

• (“one in a million” [Haig & Hurst] ). Shades: lighter (darker) – low (high) polarity.Letters: black (white) – hydrophobic (hydrophilic) yellow – medium. [Knight, Freeland, Landweber]

Biological codes evolve(d) to cope with inherent noise

• Messages are written in molecular words that are

read and interpreted by other molecules, which

calculate the response etc…

• Typical energy scale ~ a few kBT.

• Thermal noise → errors.

• Information channels adapt to errors through

evolutionary of selection-mutation

• Some errors = mutations are essential to evolution …

The code is an information channel with an average

distortion

,

i j

encoding misreading decoding

distortion

HUV = ∑paths Pαijβ Dαβ = ∑α,I,j,β PαUαiWijVjβDαβ

• U and V are binary matrices that determine the code

• W is the misreading (noise) stochastic matrix

U VW

Fitter code is one with less distortion

• The ‘error-load’ H measures the difference

between desired and the reproduced amino-acids.

• H is a natural measure for the fitness of the code.

• For better codes the encoding U and the decoding

V are optimized with respect to the reading W.

• The decoded amino-acids must be diverse enough

to map diverse chemical properties.

• However, to minimize the impact of errors it is

preferable to decode fewer amino-acids.

Theories on the origin of the code: Frozen accident or optimization?

Frozen accident hypothesis:

Any change in the code affects

all the proteins in the cell and

therefore will be too harmful:

Life began with very few amino-

acids. New amino-acids were

added until eventually the code

became frozen in its present

form.

[Crick 1968]

Load minimization

hypothesis:

Darwinian dynamics optimize

the code to minimize errors in

information flow

(due to mutations,

misreading).

[Sonneborn,

Zuckerkandl & Pauling…

1965]

Variant codes - evidence for ongoing optimization of the

code

• Variants of the “universal”

genetic code in many

organisms [Osawa, Jukes

1992].

• All variants use the same

twenty amino-acids

(universal invariant?)

• Continuity - Most changes

are to a neighboring amino-

acid.

(‘hydrodynamic’ flow ?)

GUG Val GCG Ala GAG Glu GGG Gly

GUA Val GCA Ala GAA Glu GGA Gly

G GUC Val GCC Ala GAC Asp GGC Gly

GUU Val GCU Ala GAU Asp GGU Gly

AUG Met ACG Thr AAG Lys AGG Arg

AUA Ile ACA Thr AAA Lys AGA Arg

A AUC Ile ACC Thr AAC Asn AGC Ser

AUU Ile ACU Thr AAU Asn AGU Ser

CUG Leu CCG Pro CAG Gln CGG Arg

CUA Leu CCA Pro CAA Gln CGA Arg

C CUC Leu CCC Pro CAC His CGC Arg

CUU Leu CCU Pro CAU His CGU Arg

UUG Leu UCG Ser UAG TER UGG Trp

UUA Leu UCA Ser UAA TER UGA TER

U UUC Phe UCC Ser UAC Tyr UGC Cys

UUU Phe UCU Ser UAU Tyr UGU Cys

U C A G





according to their fitness.




the emergent code.

Codes compete by their error-load

• One letter change in DNA can change one amino

acid in one protein. If the new amino acid is

similar to the original the upset is minimal.

• The organism with the smallest error-load takes

over the population.

• - relatively small population

- high noise levels in protein synthesis

weak selection forces « random drift

Code’s evolution reaches steady-state

• Small effective population and strong drift.

• Population is in detailed balance and therefore

P(fitness) ~ exp(fitness/T) [Lassig,Sella & Hirsh]

• Smaller population is hotter: T ~ 1/Neff.

• The Boltzmannian probability PUV ~ exp(-HUV/T)

minimizes a ‘free energy’

F= <H>-TS = ∑HUV PUV + ∑ PUV logPUV

• F is used to optimize information channels …

2

, , ,

.ij ij i ji j

F T w d u u

At high T no code is chosen

• At high T (small populations) Boltzmann implies

that all codes are equally probable: <Uαi> = 1/NC

• The natural order parameter is uαi= <Uαi>-1/NC

• At high T the state is random ‘non-coding’ uαi=0

• Stability of F is determined by

• w – the preference of the reading w = W − 1/NC

d – normalized chemical distance matrix

δF ~ ut(TIδ×Iw –

w2×d)u









the emergent code.

Code emerges at a phase transition

• When T is decreased below Tc an inhomogeneous coding

state appears

δF ~ ut(TIδ×Iw – w2×d)u

• Critical temperature Tc = λw2 × λd

• The code is the mode uαi of F that corresponds to these

maximal eigenvalues.

• Tc increases with the accuracy of reading w .

• The phase transition is continuous (2nd order).

• Analogous phase transition in information channels

Why twenty amino-acids?

• Code is the mode uαi that minimizes the free energy.

• This mode corresponds to the maximal w - eigenvalue.

• Knowledge of w at the phase transition yields code.

• What can we say without such knowledge? (Why 20?)

• More amino-acids more sensitivity to errors.

• Fewer amino-acids reduce functionality of proteins.

• Historical mechanisms : Freezing, Biosynthetic etc..

• Twenty as a topological feature of generic

evolutionary phase transition?









the emergent code.

AAA

ACA

AAU

CAA

AAGAGA

AUA

GAA

UAA

UUA CCA

CAG

GAU

GUA

UGA

AGG

ACU

The probable errors define the graph and the topology of the

genetic code• Graph = codon vertices +

one-letter difference edges ( Hamming =

1 ) U

A

G

C

U

A

G

C

UC

A G

X XK4 X K4 X

K3

UU

UC

UA

CU

CA

CC

AC

AA

AU

Topology and genus of a simpler code

UU AU CU

UA AA CA

UC AC CC

V = vertices, E = edges, F = faces

Euler’s characteristic χ = V – E + F

Euler Genus (# holes) γ = 1 - (1/2) χ

Doublet Code with 3 bases is imbedded on a torusEach codon has 4 neighbors

Faces are quadrilateral mutation cycles F=V (d/4)= 9 ; E=V (d/2)=18

A C

U

A C

U

X

The genetic code graph is holey

• The 48-codon graph

:

– Each codon has degree d =

3+3+2 = 8 therefore

• E = 48 (d/2) = 192 edges

• F = 48 (d/4) = 96 faces

• The Euler characteristic is

χ = V – E + F = -48 and

– Euler’s genus is γ = 1 -

(1/2) χ = 25 (24 holes +

Klein)

– Embedding by group

Automorphism analysis

• Can one hear the shape of

The code?

K4 X K4 X K3

K

The genetic code has a spectrum

• uαi is average preference of codon i to encode α.

• Every mode corresponds to an amino-acid

-> number of modes = number of amino-acids.

• Misreading w is actually the graph Laplacian

w = -(Δ-Δrandom) where Δij=-Wij Δii=Σj≠iWij

• Δ measures the difference between codons and their

neighbors, a natural measure for error load.

• Maximal mode of w is the 2nd eigenmode of Δ

• Courant’s theorem: uαi have a single maximum

-> single contiguous domain for each amino-acid.

• uαi have single compact domains with one

maximum and one minimum (Courant’s

theorem).

• Compact organization reduces impact of errors

• Single domain in any direction (linearity) Σnαuαi

Embedding in RN-1 is tight

→ The code graph contains complete graph KN

[Banchoff 1965, Colin de Verdiére’s 1987]

amino-acids # = N = chr(γ)

Topology optimizes amino-acid assignment is in compact

domains

Coloring number of graph code is an upper limit for the number of amino-

acids• What is the minimal number of colors required in a

map so that no two adjacent regions have the

same color?

• The coloring number is a topological invariant and

therefore a function of the genus solely.

• Heawood’s conjecture [Ringel & Youngs, Appel &

Haken]

48172

1)(chr 4 7 8 9 10 11 12 12 13 13

14 15 15 16 16 16 17 17 18 18

19 19 19 20 20 20 21 21 21 22

22 22 23 23 23 24 24 24 24 25

25 25 25 26 26 26 27 27 27 27

( ) max( )Nchr K

( )codeN chr

The genetic code coevolves with increasing accuracy of translation

• A path for evolution of codes:

from early codes with higher

codon degeneracy and fewer

amino acids to lower degeneracy

codes with more amino acids.

• Preliminary simulations

• Twenty amino acids is invariant

even in variant codes. 21st and

22nd amino acids are context

dependent.

1st 2nd 3rd chr #

1 4 1 0 4

2 4 1 1 6

4 4 1 5 11

4 4 2 13 16

4 4 3 25 20

4 4 4 41 25

K4 X K4

Summary

• The 64 3-letter triplet code is patterned and degenerate,

maps only 20 amino acids.

• The governing evolutionary dynamics is interplay between

protein diversity and error penalty described by stochastic

diffusion equation.

• The 1st excited state of this diffusive mapping dynamics on

the high-genus surface of the code yield a pattern of ordered

20 amino acids (20 = the coloring number of the graph).

• Topology + dynamics Coloring (?)

Transcription network is a code that relates DNA sites and binding

proteins• Reading DNA to synthesize proteins is controlled by a

system of protein-DNA interactions (transcription net).

• Presence/absence of transcription factor may

repress/enhance synthesis of protein from nearby

gene.

• The transcription network is actually a code that

relates proteins with their DNA targets.

• Like the genetic code, transcription is subject

to evolutionary forces and

adapts to minimize errors.

PolTF

DNA

Probable recognition errors define the binding sequence

space

sphere packing (Shannon) Overlap and continuity

• Typical binding site: 4 base pairs = 12

bit

•Hamming = 1 K46 -> 4096 ‘codons’

TF AA

Codon binding site

Probable recognition errors define the binding sequence

space

• Coloring number

estimate:

v = 4L (L=6)

e ~ 4L(3/2)L

f ~ 4L(3/4)L

-> γ ~ 4L(3/8)L

• The coloring #

chr(γ) ~ 300103 104

100

101

102

103

104

number of genes

n-domain C2H2

winged helix

????

• Why does the code exhaust the coloring limit?• Other population dynamics models (‘quasi-species’)• Glassy 'almost-frozen' dynamics? • The necessity of the wobble (64/48)? 25 acids?

• Generic phase transition scenario that does not depend finely on missing details of the evolutionary pathway.

• Although not much is known about the primordial environment, minimal assumptions about the topology of probable errors can yield characteristics of biological codes.

• Esp. the number of twenty amino-acids in the present picture is reminiscent of a 'shell magic number‘.

Shalev Itzkovitz

Guy Shinar

Uri Alon

Guy Sella

J. –P. Eckmann

Elisha Moses

the birth of smooth biological codes in a rough evolutionary world shalev itzkovitz, guy shinar, uri...

Documents

genetic code slide

amino acids degeneracy

diversity of amino acids

decoded aminoacids

similar amino acids

reproduced aminoacids

fewer aminoacids

code w