genome evolution. amos tanay 2012 genome evolution lecture 9: mutations and variational inference
TRANSCRIPT
Genome Evolution. Amos Tanay 2012
Genome evolution
Lecture 9: Mutations and variational inference
Genome Evolution. Amos Tanay 2012
Sources of mutations
• Mistakes– Replication errors (point mutations, tandem dups/deletions)– Recombination errors (mainly indels)
• Endogenous DNA Damage– Spontaneous base damage: Deaminations, depurinations– Byproducts of metabolism: Oxygen radicals that damage DNA
• Exogenous DNA Damage– UV– Chemicals
All of these mechanisms cross talk with the surrounding sequence
Genome Evolution. Amos Tanay 2012
DNA polymerases
• replicating DNA
• A good polymerase domain has a misincorporation rate of 10-5
(1/100,000)
• Any misincorps are clipped off with 99% efficiency by the “proofreading” activity of the polymerase
• Further mismatch repair that works in 99.9% of the case bring the fidelity of the main Polymerases to 10-10
• Some dedicated polymerases are not as accurate!
Genome Evolution. Amos Tanay 2012
Recombination errors
• A consequence of partial homology between different chromosomal loci
• Can introduce translocations if the matching sequences are on different chromosomes
• Can introduce inversion or deletion if the matching sequences are on the same chromosome
• Can generate duplication or deletions if the matching sequences are in tandem
Replication slippage• Processing a strand, disconnect and reconnect at the wrong place
CACACACACACACACACA CGACAGCGACAGTTACAAA
Genome Evolution. Amos Tanay 2012
Endogenous DNA damage: Deamination of Cytosines
*Thymine has CH3 here
NH
H
H
H
ON
N
2
H*
H
H
ON
N
O
deNHn
Cytosine Uracil
H
Genome Evolution. Amos Tanay 2012
Deamination of Cytosine creates a G-U mismatchEasy to tell that U is wrong
Deamination of Cytosine creates a G-T mismatchNot easy to tell which base is the mutation.
About 50% of the time the G is “corrected” to Aresulting in a mutation
Genome Evolution. Amos Tanay 2012
UV irradiation generate primarily Thymine dimers:
Exogenous DNA damage
Chemicals -
• Food• Benzopyrene – smoke
UV radiations (Sunlight)
Ionizing raidation• radon •Cosmic rays•X rays
Genome Evolution. Amos Tanay 2012
Direct repair
Repairing DNA damage
Genome Evolution. Amos Tanay 2012
Thymine Dimers can be corrected by a direct repair mechanism
Photon
Genome Evolution. Amos Tanay 2012
Deaminated basesare repaired by a base excision mechanism.
BER
Genome Evolution. Amos Tanay 2012
Spontaneously occuringabasic sites are repairedby the same mechanism
BER
Genome Evolution. Amos Tanay 2012
Dimeric bases andbulky lesions, e.g.,large chemical adductsare repaired by Nucleotide excision repair
NER
Genome Evolution. Amos Tanay 2012
Evolutionary consequences of the rich mutational process
Cannot ignore dependencies among adjacent sites
Mechanisms are evolutionary variable
Lifestyle -> Environmental exposure
Germline and male/female ratio
Mechanisms are variable on the genomic scale – late vs. early replication
Genome Evolution. Amos Tanay 2012
Dynamic Bayesian Networks
1
2 3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
Synchronous discrete time process
T=1 T=2 T=3 T=4 T=5
Conditional probabilities
Conditional probabilities
Conditional probabilities
Conditional probabilities
Genome Evolution. Amos Tanay 2012
Context dependent Markov Processes
A AA C AA G AA
AAQ CAQ GAQ
Context determines A markov process rate matrix
Any dependency structure make sense, including loops
A AA
AQ?C
When context is changing, computing probabilities is difficult.Think of the hidden variables as the trajectories Continuous time Bayesian Networks
Koller-Noodleman 2002
1 2 3 4
)(pa iQi
Genome Evolution. Amos Tanay 2012
Modeling simple context in the tree: PhyloHMM
Siepel-Haussler 2003
hpaij
hij-1 hi
j
hpaij
hij-1 hi
j hij+!
hpaij+!hpai
j-1
hkj-1 hk
j hkj+1
Heuristically approximating the Markov process?
Where exactly it fails?
Genome Evolution. Amos Tanay 2012
Log-likelihood to Free Energy
h
shPsP )|,(log)|(log
• We have so far worked on computing the likelihood:
h hs
hqhqF )
)|,Pr(
)(log()(
• Better: when q a distribution, the free energy bounds the likelihood:
• Computing likelihood is hard. We can reformulate the problem by adding parameters and transforming it into an optimization problem. Given a trial function q, define the free energy of the model as:
• The free energy is exactly the likelihood when q is the posterior:
)|Pr(log)|Pr(log),|Pr(
))|,Pr(/),|log(Pr(),|Pr(),|Pr()(
sssh
hsshshFshhq
h
h
hh
shqshhqhqF )),Pr(ogl)()),|Pr(/)(log()(
D(q || p(h|s)) Likelihood
Genome Evolution. Amos Tanay 2012
Energy?? What energy?
T
xE
eTZ
xp)(
)(
1)(
• In statistical mechanics, a system at temperature T with states x and an energy function E(x) is characterized by Boltzman’s law:
• If we think of P(h|s,):
• Given a model p(h,s|T) (a BN), we can define the energy using Boltzman’s law
• Z is the partition function:
dxeTZ TxE /)()(
)|,(log)|,(1 shpshET
)(),,(log)( spZshphE
Genome Evolution. Amos Tanay 2012
Free Energy and Variational Free EnergyT
xE
eTZ
xp)(
)(
1)(
• The Helmoholtz free energy is defined in physics as:
• The average energy is:
• The variational transformation introduce trial functions q(h), and set the variational free energy (or Gibbs free energy) to:
• This free energy is important in statistical mechanics, but it is difficult to compute, as our probabilistic Z (= p(s))
)()()( qHqUqF
h
hEhqqU )()()(
ZFH log
h
hqhqqH )(log)()(
• The variational entropy is:
• And as before:
)||()( pqDFqF H
Genome Evolution. Amos Tanay 2012
Solving the variational optimization problem
• So instead of computing p(s), we can search for q that optimizes the free energy
)()()( qHqUqF h
shphqqU ),(log)()( h
hqhqqH )(log)()(
• This is still hard as before, but we can simplify the problem by restricting q• (this is where the additional degrees of freedom become important)
Maxmizing U? Maxmizing H?
Focus on max configurations Spread out the distribution
Genome Evolution. Amos Tanay 2012
Simplest variational approximation: Mean Field
• Let’s assume complete independence among r.v.’s posteriors:
)()()( qHqUqF h
shphqqU ),(log)()( h
hqhqqH )(log)()(
• Under this assumption we can try optimizing the qi – (looking for minimal energy!)
Maxmizing U? Maxmizing H?
Focus on max configurations Spread out the distribution
)()( iii
hqhq
)(log),(logmin)(min iiiiiq
MF hqqshpqqFFi
i hiii
hii
i
hqqshpq )(log),(log)(min
Genome Evolution. Amos Tanay 2012
Mean Field Inference
• We optimize iteratively:
• Select i (sequentially, or using any method)
• Optimize qi to minimize FMF(q1,..,qi,…,qn) while fixing all other qs
• Terminate when FMF cannot be improved further
)()( iii
hqhq )(log),(logmin)(min iiiiiq
MF hqqshpqqFFi
i hiii
hii
i
hqqshpq )(log),(log)(min
• Remember: FMF always bound the likelihood
• qi optimization can usually be done efficiently
Genome Evolution. Amos Tanay 2012
Adaptive mutations: Cairns et al. 88
Experimental system: lacz frameshiftLuria-Delbruk’s observation
The experiment suggests adaptive mutations
Genome Evolution. Amos Tanay 2012
The “Mutator” paradigm:
Ability to switch to the mutator phenotype depends on particular DNA repair mechanisms (Double Strand Break repair in E. Coli)
Mutator phenotype is suggested to be important in pathogenesis, antibiotic resistance, and in cancer
Species occasionally change (adaptively or even by drift) their repair policy/efficiency
The resulted substitution landscape must be very complex