![Page 1: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/1.jpg)
1
Towards optimal distance functionsfor stochastic substitution models
Ilan Gronau, Shlomo Moran, Irad YavnehTechnion, Israel
![Page 2: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/2.jpg)
2
PreviewThe
Phylogenetic Reconstrutction
Problem
![Page 3: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/3.jpg)
3
AATCCTG
ATAGCTGAATGGGC
GAACGTA
AAACCGA
ACGGTCA
ACGGATA
ACGGGTA
ACCCGTG
ACCGTTG
TCTGGTA
TCTGGGA
TCCGGAA AGCCGTG
GGGGATT
AAAGTCA
AAAGGCG AAACACAAAAGCTG
Evolution is modeled by a Tree
(All our sequences are DNA sequences, consisting of {A,G,C,T})
![Page 4: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/4.jpg)
4
AATCCTG
ATAGCTGAATGGGC
GAACGTA
AAACCGAACCGTTGTCTGGGA
TCCGGAA AGCCGTG
GGGGATT
Phylogenetic Reconstruction
![Page 5: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/5.jpg)
5
B : AATCCTG
C : ATAGCTG
A : AATGGGC
D : GAACGTAE : AAACCGA
J : ACCGTTG
G : TCTGGGAH : TCCGGAA
I : AGCCGTG
F : GGGGATT
Goal: reconstruct the ‘true’ tree as accurately as possible
reconstruct
AB
C
FG
IH J
D
E
A
B
C F
G
I
H
J
D
E
(root)
Phylogenetic Reconstruction
![Page 6: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/6.jpg)
7
Road Map • Distance based reconstruction algorithms• The Kimura 2 Parameter (K2P) Model• Performance of distance methods in the K2P model• Substitution models and substitution rate functions• Properties of SR functions• Unified Substitutions Models• Optimizing Distances in the K2P model• Simulation results
![Page 7: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/7.jpg)
8
A
C
B
D
F
G
E
edge-weighted ‘true’ tree reconstructed tree
reconstruction
B
C
A
D
F
G
E
,
ˆˆ ( , )u v S
D d u v
5
6
0.4
6
3 0.32 2
4
5
Challange: minimize the effect of noiseIntroduced by the sampling
Distance Based Phylogenetic Reconstruction:Exact vs. Noisy distances
Estimated distances
,
( , )u v S
D d u v
Exact (additive) distances
Between species
Distance estimationusing
finite Sampling
![Page 8: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/8.jpg)
9
Road Map • Distance based reconstruction algorithms
• The Kimura 2 Parameter (K2P) Model• Performance of known distance methods in the K2P model• Substitution models and substitution rate functions• Properties of SR functions• Unified Substitutions Models• Optimizing Distances in the K2P model• Simulation results
![Page 9: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/9.jpg)
10
The Kimura 2 Parameter )K2P( model [Kimura80]:each edge corresponds to a “Rate Matrix”
{ }A G
{ }C T
Transitions
Transversions
Transitions
Transitions/transversions ratio = / 2 1R
-αββT
α-ββC
ββ-αG
ββα-A
TCGA
-αββT
α-ββC
ββ-αG
ββα-A
TCGA
K2P generic rate matrixu
v
![Page 10: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/10.jpg)
11
K2P standard distance: Δtotal = Total substitution rate
u v w
The total substitution rate of a K2P rate matrix R is
This is the expected number of mutations per site. It is an additive distance.
+
1( ) 2 sum of off-diagonal entries of 4total uv uvR R
α + 2β α’ + 2β’
(α+α’) + 2(β+ β’)
![Page 11: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/11.jpg)
12
Estimation of Δtotal(Ruv) = dK2P(u,v) is a noisy stochastic process
u A A C A … G T C T T C G A G G C C C
v A G C A … G C C T A T G C G A C C T
2ˆ ˆˆ( , ) 2K Pd u v
K2P total rate“distance correction”
procedure
![Page 12: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/12.jpg)
13
Road Map • Distance based reconstruction algorithms• The Kimura 2 Parameter (K2P) Model
• Performance of distance methods in the K2P model• Substitution models and substitution rate functions• Properties of SR functions• Unified Substitutions Models• Optimizing Distances in the K2P model• Simulation results
![Page 13: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/13.jpg)
14
Check performance of K2P “standard” distances in resolving quartet-splits
A C
B D
A B
C D
A C
D B
• Distance methods reconstruct the true split by 4-point
condition:
There are 3 possible quartet topologies:
wsep
The 4-point condition for noisy distances is:
2 2 2 2 2 2( , ) ( , ) min ( , ) ( , ) , ( , ) ( , )K P K P K P K P K P K Pd d d d d d A B C D A C B D A D B C
2 2 2 2 2 2( , ) ( , ) ( , ) ( , ) ( , ) ( , )2K P K P K P K P K Pse K Ppd d dwd d d A B C D A C B D A D B C
![Page 14: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/14.jpg)
15
We evaluate the accuracy of the K2P distance estimation
by Split Resolution Test:
root
D
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
t
10t
CA
B
10t 10t10t
t-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
-αββC
α-ββT
ββ-αG
ββα-A
CTGA
t is “evolutionary time”
The diameter of the quartet is 22t
![Page 15: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/15.jpg)
16
Phase A: simulate evolution
DC
AB
CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
CCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAACCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
![Page 16: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/16.jpg)
17
Phase B: reconstruct the split by the 4p condition
DCCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
CCCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
BCCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
ACCCGGAGCTTCTG…ACAA CCCGGAGCTTCTG…ACAA
÷÷÷÷÷÷÷÷
øçççççççç
è
2ˆˆ ( , ) ( , )K P i jD i j d s s
Apply the 4p condition.
Was the correct split found?
estimate distances between sequences,
Repeat this process 10,000 times,
count number of failures
![Page 17: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/17.jpg)
18
root
D
t
10t
CA
B
10t 10t 10t
t
root
D
t
10t
CA
B
10t 10t 10t
t
the split resolution test was applied on the model quartet with various diameters
For each diameter, mark the fraction (percentage) of the
simulations in which the 4p condition failed (next slide)
root
D
t
10t
CA
B
10t 10t 10t
t
root
D
t
10t
CA
B
10t 10t 10t
t
root
D
t
10t
CA
B
10t 10t 10t
t
root
D
t
10t
CA
B
10t 10t 10t
t
root
D
t
10t
C
AB
10t 10t 10t
t
root
D
t
10t
C
AB
10t 10t 10t
t … …
![Page 18: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/18.jpg)
19
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
quartet diameter )total rate between furthest leaves(
Fra
ctio
n of
failu
res
out o
f 100
00 e
xper
imen
tsperformance of K2P standard distance method in resolving quartets, R=10
Performance of K2P distances in resolving quartets, small diameters: 0.01-0.2
root
D
t
10t
CA
B
10t 10t 10t
t
root
D
t
10t
CA
B
10t 10t 10t
t
Templatequartet
![Page 19: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/19.jpg)
20
0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
0.1
0.2
0.3
0.4
0.5
0.6
0.7
quartet diameter (=mutations rate between furthest leaves)
Fract
ion
of fa
ilure
s out
of 10
000 si
mul
atio
nsperformance of K2P standard distance method in resolving quartets,
For quartet ratio 0.1, R=10
Performance for larger diameters
“site saturation”
![Page 20: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/20.jpg)
21
{ }A G
{ }C T
Transitions
Transversions
Transitions
When β < α, we can postpone the “site saturation” effect. For this, use another distance function for the same model, Δtv , which counts only transversions:
{0}
{1}
This is actually the CFN model
[Cavendar78, Farris73, Neymann71]
α
α
β
![Page 21: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/21.jpg)
22
Apply the same split resolution test on the transversions only distance:
u A A C A … G T C T T C G A G G C C C
v A G C A … G C C T A T G C G A C C T
ˆ ˆ( , )trd u v
Transversions onlyDistance correction
procedure
![Page 22: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/22.jpg)
23
transversions only performs better on large, worse on small rates
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
quartet diameter
Fract
ion
of Fa
ilure
s out o
f 10
000
exper
imen
ts
performance of distance methods in resolving quartets, R=10
Transversions only
total K2P rate
![Page 23: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/23.jpg)
.
4 5
7 21
210 61
Conclusion: Distance based reconstruction methods should be
adaptive:
Find a distance function d which is good for the input ÷
÷÷÷÷÷÷÷
ø
ö
çççççççç
è
æ
= ˆˆ ( , ) ( , )D u v d u vD
We do a small step in this direction:
Input: An alignment of the sequences at u, v.
Output: a )near(-optimal distance function, which minimizes the
expected noise in the estimation procedure.
![Page 24: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/24.jpg)
25
Example: An adaptive distance method (max-optimal)
based on this talk:
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
quartet diameter
Fract
ion o
f fa
ilure
s out of 10
000
ex
peri
ments
performance of distance methods in resolving quartets, R=10
max-optimal
stanard K2Ptrasversions only
![Page 25: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/25.jpg)
26
Road Map • Distance based reconstruction algorithms• The Kimura 2 Parameter (K2P) Model• Performance of distance methods in the K2P model
• Substitution models and Substitution Rate functions• Properties of SR functions• Unified Substitutions Models• Optimizing Distances in the K2P model• Simulation results
![Page 26: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/26.jpg)
27
Steps in finding optimal distance functions:1. Define substitution model.
2. Characterize the available distance functions.
3. Select a function which is optimal for the input
sequences.
least sensitive to stochastic noise
![Page 27: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/27.jpg)
28
From Rate matrices to Substitution matrices
A A C A … G T C T T C G A G G C C Cu
v A G C A … G C C T A T G C G A C C T
-αββT
α-ββC
ββ-αG
ββα-A
TCGA
-αββT
α-ββC
ββ-αG
ββα-A
TCGA
Rate matrices imply stochastic substitution matrices:
C
T
G
A
CTGA
C
T
G
A
CTGA
p
p
p
p
p
p p
p
p
p
p p1 2 p p
1 2 p p
1 2 p p
1 2 p p
uvP uvR
Evolution of a finite sequence by unknown model parameters α, β
A stochastic substitution matrix Puv
![Page 28: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/28.jpg)
29
A substitution model M : A set of stochastic substitution matrices, closed under matrix product:
P,Q∈ M ⇒ PQ ∈ M
uvP
vwP
u
v
w
uw uv vwP P P
Motivation tothe definition:
Also requiredP>0, 0<det(P)<1
for all P∈M
![Page 29: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/29.jpg)
30
Uniform distribution
Model tree over M =<Tree Topology> +
<DNA distribution at the root> + <M-substitution matrices at the edges>
r
vPrv
P..
P..
P..
P..
P..
P.. P..
P..
P..
P.. P.. P..
P..
P..
P..
P..
P..
![Page 30: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/30.jpg)
31
Distances for a given model are defined by
Substitution Rate functions:
uvP
vwP
u
v
w
Δ:M is an SR function for ℝ M iff for all P,Q in M:
1. Δ(PQ) = Δ(P)+ Δ(Q) (additivity)
2. Δ(P)>0 (positivity)
![Page 31: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/31.jpg)
32
Road Map • Distance based reconstruction algorithms• The Kimura 2 Parameter (K2P) Model• Performance of distance methods in the K2P model• Substitution models and substitution rate functions
• Properties of SR functions• Unified Substitutions Models• Optimizing Distances in the K2P model• Simulation results
![Page 32: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/32.jpg)
33
1st question:Given a model M, what are its SR functions? X
additive
SR functions are additive functions which are strictly
positive
![Page 33: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/33.jpg)
34
Example 1: The logdet function [Lake94, Steel93] is an SR function for the most general model, Muniv :
Muniv= {P: P is a stochastic 4╳4 matrix, 0<det(P)<1}.
logdetThe function ( ) ln(det( ))
additive functionis an for .univ
P P
M
logdetThe function ( , ) ln(det(
SR fun
))
is an for .ction
uv
univ
d u v P
M
![Page 34: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/34.jpg)
35
Example 2: The log eigenvalue function
4
Assume a model with the following property:
There is a vector which is an eigenvector
of .
The function
is an additive function for . [e.g. Gu&L
( ) ln(| ( ) |)
each
P
R
P
M
P
v
v
v
M
M
i98]
i.e., PPv v
![Page 35: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/35.jpg)
36
Both “logdet” and the “log eigenvalue” functions are special cases of a general technique:
Generalized logdet which is given below:
4
Definition: Let be a 4 by 4 matrix.
A subspace of R is -invariant if
If is invariant, then defines a linear transformation on .
det( | ) is the determinant of this linear transformationH
P
H P PH H
H P P H
P
.
(Generalized LogDet)Lemma GLD :
If is -invariant for all , then
ln(| det( | ) |)
is an additive function for .
( ) HH
H P P
PP
M
M
![Page 36: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/36.jpg)
37
Linearity of additive functions:
1. If Δ1 and Δ2 are additive functions for M, so is c1 Δ1 + c2 Δ2
The set of additive functions for M forms a vector space, to be denoted ADM.
Dimension(ADM) is the dimension of this vector space.Large dimension implies more “independent” distance functions
If dimension(ADM ) = 1, then M admits a single distance function (up to product by scalar). Selecting best SR function in such a model is trivial. Thus, the adaptive approach is useful only when dimension(ADM ) > 1.
![Page 37: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/37.jpg)
38
Road Map • Distance based reconstruction algorithms• The Kimura 2 Parameter (K2P) Model• Performance of distance methods in the K2P model• Substitution models and substitution rate functions• Properties of SR functions
• Unified Substitutions Models: Models which the
adaptive approach is potentially useful.• Optimizing Distances in the K2P model• Simulation results
![Page 38: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/38.jpg)
39
Unified Substitution Models:
U-1 PU = λ3(P)000
?λ2(P)00
??λ1(P)0
???1
λ3(P)000
?λ2(P)00
??λ1(P)0
???1
Def: A model M is unified if there is a matrix U s.t. for each P∈M it holds that:
1 2 3
3
1
Thm: if is unified,
then for each 3 constants , , , the function
( ) ln(| ( ) |)
is an additive function for
i ii
c c c
P c P
M
M.
Using Lemma GLD, we have:
![Page 39: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/39.jpg)
40
Strongly Unified Substitution Models
U-1 PU =
Def: A model M is strongly unified if there is a matrix U s.t. for each P∈M it holds that:
3
1
Thm: if is strongly unified,
then the additive functions of
are of the form
( ) ln( ( ))i ii
all
P c P
M
M
000
000
00λ1 (P)0
0001
000
000
00λ1 (P)0
0001
λ2 (P)
λ3 (P)
![Page 40: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/40.jpg)
41
A simple strongly unified model: The Jukes Cantor model [1969]
MJC=
For all P∈ MJC , U-1 PU =
:0< p <0.25
MJC is strongly unified by U=
1 1 12 22
1 1 12 22
1 1 12 22
1 1 12 22
0
0
0
0
1-3ppppC
p1-3pppT
pp1-3ppG
ppp1-3pA
CTGA
1-3ppppC
p1-3pppT
pp1-3ppG
ppp1-3pA
CTGA
1 4P p
000
000
00λp0
0001
λp
λp
Claim dimension(ADMJC)=1
Hence the adaptive approach is irrelevant to this model.
![Page 41: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/41.jpg)
42
Another model M for which dimension(ADM)=1
Recall: Muniv consists of all DNA transition matrices.
Claim 2: dimension(ADMuniv) = 1
This means that all the additive functions of Muniv are
proportional to logdet.
Hence the adaptive approach is irrelevant also to this model.
Luckily, the additive functions of “intermediate” unified models have dimensions > 1, hence the adaptive approach is useful for them.Next we return to the Kimura 2 parameter model.
![Page 42: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/42.jpg)
43
Back to K2P: For every K2P Substitution Matrix P:
1 0 0 0
0 λP 0 0
0 0 μP 0
0 0 0 μP
Where:λP = 1 - 4Pβ = e-4β
μP = 1 - 2Pβ - 2Pα= e-2α-2β
U-1 PU =
C
T
G
A
CTGA
C
T
G
A
CTGA
p
p
p
p
p
p p
p
p
p
p p1 2 p p
1 2 p p
1 2 p p
1 2 p p
P =
0 < λP <10 < μP < 1
Conclusion: dimension(ADMK2P )=2.
U of the JC model
![Page 43: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/43.jpg)
44
The functions:Δλ(P)= -ln(λP) , Δμ (P)=-ln(μP)
Form a basis of ADK2P
1 2
Each positive function of the form:
( ) ln( ) ln( )
is an SR function for the K2P model
P PP c c
uvPu
v
The standard “total rate” distance is:
ΔK2P(P)=-(ln(λP)+2ln(μP))/4=-Δlogdet(P)/4.
The “transversion only” distance is:
Δtr(P)=-ln(λP )/4.
![Page 44: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/44.jpg)
46
Road Map • Distance based reconstruction algorithms• The Kimura 2 Parameter (K2P) Model• Performance of distance methods in the K2P model• Substitution models and substitution rate functions• Properties of SR functions• Unified Substitutions Models
• Optimizing Distances in the K2P model• Simulation results
![Page 45: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/45.jpg)
47
1 2
1 2
ˆˆ ˆCompute ( ) ln( ) ln( ),
an estimation of ( ) ln( ) ln( ).uv
uv
P c c
P c c
u A A C A … G T C T T C G A G G C C C
v A G C A … G C C T A T G C G A C C T
K2P distance estimation: where the noise comes from
ˆ ˆ ˆˆCompute ( ), ( ),
estimations of ( ), ( ).uv uv
uv uv
P P
P P
inherent noise
implied noise propagation
“user controlled” noise propagation
ˆCompute , an estimation of uv uvP P
![Page 46: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/46.jpg)
48
uvP
u
v
1 2
1 2
Given , we look for , such that:
( , ) ( ) ln( ) ln( )
has a small expected relative error.uv uv
uv
uv P P
P c c
d u v P c c
Selection of c1, c2
True distance
Expected error
Estim
ated distance+ =
![Page 47: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/47.jpg)
49
Expected Relative Error True distance
Expected error
==
![Page 48: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/48.jpg)
50
Minimizing the expected relative error
Let ( , ) ( ) be the exact distance
ˆ ˆ ˆ( , ) ( ) is the estimated (stochastic) distance.
We would like to minimize the "Normalized Mean Square Error":
ˆ ( )
uv
uv
d d u v P
d d u v P
NMSE d
2
2
ˆ
ˆIn the plots we use NRMSE=
d dE
d
d dE
d
![Page 49: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/49.jpg)
51
1 2
1
2
The NMSE of a distance function:
ˆˆ ˆ ( ) ln( )+c ln( )
Depends only on the ratio
uvP c
cc
c
This means that equivalent SR functions have
the same NMSE
A basic property of Normalized Mean Square Error:
![Page 50: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/50.jpg)
52
A Proper Disclosure on our optimal functions:
Since ln( ) is non-linear, we only find which minimizes the NMSE
ˆ of a of (usinlinear ap g the "deproxim lta mea thod")on .ti
c
44
4
4 4
and the optimal for a K2P matrix is:
11
11 1
opt
c
ee
ece e
st1 term in the Taylor
expansion of
d d
d
Hence, our approximation is imprecise when some
of the (true) Eigenvalue are very smalls
![Page 51: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/51.jpg)
53
Relation between c and SR functions:
44
4
4 4
11
11 1
opt
ee
ece e
Function name Function c c/(1+c)
Total rate (logdet) -ln)λP(-2ln)μP( 1/2 1/3
Transversions only -ln)λP( ∞ 1
13As grows from to 1, the optimal rate function
1
is gradually changed from to total rate transversions only
opt
opt
c
c
![Page 52: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/52.jpg)
54
0 0.5 1 1.5 2 2.5 30
0.2
0.4
0.6
0.8
1
total substitution rate
C1 /
(C1 +
C2) α=20β
Optimal values of copt /(1+copt) for ti/tv ratio = 10
As the rate grows, the relative weight of the “transversion” coefficient increases
![Page 53: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/53.jpg)
55
0 0.5 1 1.5 2 2.5 30
0.2
0.4
0.6
0.8
1
total substitution rate
C1 /
(C1 +
C2) α=2β
α=4βα=20β
Optimal values of c1/(c1 +c2) for various transitions/transversion rates
α=β
α>>β,rate>2
α=200β
![Page 54: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/54.jpg)
56
0 0.5 1 1.5 2 2.50
0.1
0.2
0.3
0.4
0.5
0.6R = 2
total substitution rate
pred
icte
d N
RM
SE
Expected Relative error of various distance functions: theoretical prediction
Total rate
transversions
optimal
![Page 55: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/55.jpg)
57
Road Map • Distance based reconstruction algorithms• The Kimura 2 Parameter (K2P) Model• Performance of distance methods in the K2P model• Substitution models and substitution rate functions• Properties of SR functions• Unified Substitutions Models• Optimizing Distances in the K2P model
• Simulation results
![Page 56: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/56.jpg)
58
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
total substitution rate
NR
MS
E
R = 2
standard formula )C = 0.5(
'transversions only' )C = (actually used SR functions
predicted error for standard formula
predicted error for 'transversions only'predicted error for optimal SR function
Expected Relative error of various distance functions: simulations
Total rate
Transversions only
optimal
“small eigenvaluedistortion”
![Page 57: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/57.jpg)
59
Back to the K2P quartet resolution
A heuristic distance method )max-optimal( based on this talk:
Select a distance function which is optimal w.r.t. the largest of the six observed distances of the quartet )ie, largest copt(.
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
quartet diameter
Fract
ion
of Fa
ilure
s out
of 10
000
exper
imen
ts
performance of distance methods in resolving quartets, R=10
Recall the performance of the two known distance function on the “template quartet”
![Page 58: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/58.jpg)
60
When α≠β, the suggested heuristic performs better than both known methods.
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
quartet diameter
Fract
ion o
f fa
ilure
s out of 10
000
ex
peri
ments
performance of distance methods in resolving quartets, R=10
max-optimal
stanard K2Ptrasversions only
![Page 59: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/59.jpg)
61
Summary• Adaptive approach to distance based reconstructions: adjust
distance function to input sequences.• Distance functions for stochastic evolutionary models are defined by
SR functions.• SR functions can be constructed by Generalized Logdet.• When the dimension of the space of SR functions is greater than 1,
the adaptive approach is applicable.• The adaptive approach is applicible to non-trivial unified models.• Most common models are unified.• An analysis of the simplest non-trivial unified model - K2P - shows
a significant improvements in the accuracy of the adaptive
approach.
![Page 60: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/60.jpg)
62
Further Research Prove/Disprove: For any substitution model M, all the additive functions of
M are GLD functions. In the K2P model:
Define&find optimal SR functions for: two distances, quartets, general trees.
Find optimal SR functions for non-homogenous model trees Find optimal SR functions to variable rates cross sites.
Find optimal SR functions for more general evolutionary models (Tamura Nei) (analytic/heuristic methods)
Empirical/analytical study of “plugging” adaptive distances in common reconstruction algorithms (eg NJ).
Study improvement in performance on real biological data. Devise algorithms which use distance-vectors
![Page 61: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/61.jpg)
63
![Page 62: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/62.jpg)
64
![Page 63: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/63.jpg)
65
Further research questions• We have infinitely many additive distance functions for
the K2P model.• Which one should we use for reconstructing the tree?• If we have the exact substitution matrices for all pairs of
taxa, then all functions are equally good.• But we have only finite sequences,
whose alignments provide only estimations of the true substitution matrices
![Page 64: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/64.jpg)
66
Distances are defined by Substitution Rate functions
u
v
w
For each tree path u — v—w It holds that D(u,v)+D(v,w)=D(u,w).D(u,v)
D(v,w)
D(u,w)= D(u,v)+D(v,w)
![Page 65: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/65.jpg)
67
Part 3.1:
from
Substitution modelsto
Additive distances
![Page 66: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/66.jpg)
68
The aligned sequences provide for each pair of DNA letters,say A and G, how many times A was mutated to GThis defines a joint distribution matrix F
Aligned Sequences joint distribution matrices
A G T C
A 0.2 0.05 0.01 0.02
G 0.02 0.25 0.01 0.01
T 0.02 0.01 0.16 0.02
C 0.01 0.01 0.01 0.2
F =
A is aligned with GIn 5% of the pairs
![Page 67: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/67.jpg)
69
Joint Distribution matrices are converted to distances by Substitution models.These models describe how DNA sequences are transformed during the evolution. The tool used for this is called “Markovian Processes”. In the following we will sketch it. Additional reading is recommended…
![Page 68: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/68.jpg)
70
species C1 C2 C3 C4 … Cm
u A A C A … G T C T T C G A G G C C C
v A G C A … G C C T A T G C G A C C T
K2P Distinguish between two mutations types:
Transitions {AG, CT}
And
Transversions [{A,G}{C,T}]
Different biological models impose restrictions on the substitution matrices.
Our model is the Kimura 2 Parameter )K2P( model:
![Page 69: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/69.jpg)
71
K2P rate matrices have the following shape
A G T C
A -
G -
T -
C -
All transitions have rate α
All transversions has rate β
![Page 70: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/70.jpg)
72
Part 3.2:Distance functions for K2P
( Linear Algebra in the service of Biology)
![Page 71: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/71.jpg)
73
μP000
0μP00
00λP0
0001
U-1 P U =
μQ000
0μQ00
00λQ0
0001
U-1 Q U =
U-1 PQ U =
Let P,Q be two matrices in K2P. Then:
μP μQ
000
0μP μQ00
00λP λQ0
0001
U-1 PQ U =
![Page 72: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/72.jpg)
74
U-1 PQ U =
000
000
00λ1 (P)0
0001
λ2 (P)
λ3 (P)
![Page 73: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/73.jpg)
75
000
000
00λp0
0001
U-1 P U =
λp
λp
![Page 74: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/74.jpg)
76
ACGGTCA
ACGGATA
GGGGATT
The joint distribution of each pair of verticesprovides an approximation of the substitution matrices
w
v
u uvP
vwP
The common theme of all projects: Start with input sequences for two or more taxa.Find a distance function which minimizes the inaccuracy (noise) introduced by the sampling process.
uvP
vwP
![Page 75: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/75.jpg)
79
A G C T
A - α β β
G α - β β
C β β - αT β β α -
![Page 76: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/76.jpg)
80
A G C T
A - α` β` β`
G α` - β` β`
C β` β` - α`T β` β` α` -
![Page 77: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/77.jpg)
81
25%
ACGGATA
K2P Model tree:======<Tree Topology> +
<Uniform Distribution of DNA at all vertices> + <K2P instantaneous rate matrices at edges>
r
vRuv
![Page 78: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/78.jpg)
82
A G T C
A
G
T
C
p
p
p
p
p
p p
p
p
p
p p1 2 p p
1 2 p p
1 2 p p
1 2 p p
![Page 79: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/79.jpg)
83
A G T C
A 1-3p p p pG p 1-3p p pT p p 1-3p pC p p p 1-3p
![Page 80: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/80.jpg)
84
1 1 12 22
1 1 12 22
1 1 12 22
1 1 12 22
0
0
0
0
![Page 81: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/81.jpg)
85
K2P Model tree:======<Tree Topology> +
<Uniform Distribution of DNA at all vertices> + <K2P instantaneous rate matrices at edges>
0.25 0.25 0.25 0.25
A G C T
![Page 82: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/82.jpg)
86
K2P rate matrices have the following shape
A G T C
A -
G -
T -
C -
All transitions have rate α
All transversions has rate β
' ''
''''''
'
''
![Page 83: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/83.jpg)
87
Given sequences at two adjacent verticeswe define the edge length in two steps :
vertices C1 C2 C3 C4 … Cm
u A A C A … G T C T T C G A G G C C C
v A G C A … G C C T A T G C G A C C T
u
v…TCTGGGA…
…GGGGATT…
First, align the sequences,
![Page 84: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/84.jpg)
88
Natural evolutionary distance: Total substitution rate
u vw
-αββT
α-ββC
ββ-αG
ββα-A
TCGA
-αββT
α-ββC
ββ-αG
ββα-A
TCGA
Each edge is associated with a time t and a K2P rate matrix S.The total substitution rate along an edge of length t is t(α +2β).Total substitution rate between species = sum of the rates over the path connecting them.
Total substitution rates are exact distances, which we try to reconstruct from observing the joint distribution of sequences at u and v.
-α`β`β`T
α`-β`β`C
β`β`-α`G
β`β`α`-A
TCGA
-α`β`β`T
α`-β`β`C
β`β`-α`G
β`β`α`-A
TCGA
![Page 85: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/85.jpg)
89
How do we estimate DK2P(u,v)?
vertices C1 C2 C3 C4 … Cm
u A A C A … G T C T T C G A G G C C C
v A G C A … G C C T A T G C G A C C T
Our input are aligned sequences at u and v.They can be used to estimate the probablity that a nucleotide X in u will be replaced by a nucleotide Y in v
![Page 86: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/86.jpg)
90
vertices C1 C2 C3 C4 … Cm
u A A C A … G T C T T C G A G G C C C
v A G C A … G C C T A T G C G A C C T
Estimate Puv from the joint distributions:
First step in distance estimation:
(Maximum Likelihood)
C
T
G
A
CTGA
C
T
G
A
CTGA
p
p
p
p
p
p p
p
p
p
p p1 2 p p
1 2 p p
1 2 p p
1 2 p p
![Page 87: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/87.jpg)
91
C
T
G
A
CTGA
C
T
G
A
CTGA
p
p
p
p
p
p p
p
p
p
p p1 2 p p
1 2 p p
1 2 p p
1 2 p p
![Page 88: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/88.jpg)
92
Substitution matrix is estimated by the observed difference between the sequences.
ACCGTTGTCTGGGA5
ACGGGTA
ACCCGTGTCTGGTA1
2 3
2
ACCGTTGTCTGGGA
• Errors in distance estimations are amplified when:• The rate is small: signal is too weak (in extreme
cases, there are no substitution whatsoever)• The rate is large: recent substitutions overwrite older
ones.
![Page 89: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/89.jpg)
93
25%
ACGGATA
K2P Model tree:======<Tree Topology> +
<Uniform Distribution of DNA at all vertices> + <K2P instantaneous rate matrices at edges>
r
vRuv
![Page 90: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/90.jpg)
94
How reliable
Consider “balanced” quartets. Define the “quartet ratio” to be the ratio between the middle edge and two external edges.
![Page 91: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/91.jpg)
95
The rate matrix S implies a stochastic substitution matrix Puv :
uvS
u
v
uvP
C
T
G
A
CTGA
C
T
G
A
CTGA
p
p
p
p
p
p p
p
p
p
p p1 2 p p
1 2 p p
1 2 p p
1 2 p p
exp( )uv uvP S
Puv defines the joint distribution of the sequences at u,v.
![Page 92: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/92.jpg)
97
( , ) ( , ) ( , ) ( , ) ( , ) (2 , )seT T T Tp T Td d d d dw d A B C D A C B D A D B C
performance of the standard distance method in reconstructing the split from estimated distances
12 sepw
• Distance based 4-point method (FPM):
Reconstruction will fail if .
ˆ ˆ ˆ ˆ ˆ ˆ( , ) ( , ) min ( , ) ( , ), ( , ) ( , )d A B d C D d A C d B D d A D d B C
12 sepw 1
2 sepw 12 sepw 1
2 sepw 12 sepw
diam
A C
B D
A B
C D
A C
D B
wsep
diam
![Page 93: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/93.jpg)
98
root
D
t
10t
CA
B
10t 10t 10t
t
![Page 94: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/94.jpg)
99
Minimizing the expected relative error
2
2
Since ln( ) is non-linear, we only find which minimizes the NMSE
ˆ of a linear approximation of (using the "delta method").
ˆ ˆˆ ˆ(ln( ) ln( )) (ln( ) ln( ))
ln( ) ln( )
c
E cE c
c
2
2ln( ) ln( )c
44
4
4 4
and the optimal is:
11
11 1
opt
c
ee
ece e
![Page 95: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/95.jpg)
.
- Compute distances between all taxon-pairs
- Find a tree (edge-weighted) best-describing the distances
Distance based methods: The general scheme
0
30
980
1514180
171620220
1615192190
D
4 5
7 21
210 61
This Talk
![Page 96: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/96.jpg)
101
AATCCTG
ATAGCTGAATGGGC
GAACGTA
AAACCGAACCGTTGTCTGGGA
TCCGGAA AGCCGTG
GGGGATT
Phylogenetic Reconstruction
![Page 97: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/97.jpg)
.
1 2
1 2
Find constants { ,c }
s.t. the SR function:
( ) ln( ) ln( )
is best for the input P P
c
P c c D
÷÷÷÷÷÷÷÷
ø
ö
çççççççç
è
æ
=
1615192190
( , ) ( , )i jD i j s s
Adaptive distance based algorithm
for the K2P model
![Page 98: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/98.jpg)
.
- Compute distances between all taxon-pairs
- Find a tree (edge-weighted) best-describing the distances
Distance based methods: The general scheme
0
30
980
1514180
171620220
1615192190
D
4 5
7 21
210 61
This Talk
![Page 99: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/99.jpg)
.
÷÷÷÷÷÷÷÷
ø
ö
çççççççç
è
æ
=
1615192190
D ( , ) ( , )i jD i j d s s4 5
7 21
210 61
Find a good distance function
- Compute distances between all taxon-pairs
- Find a tree (edge-weighted) best-describing the distances
Distance based methods: An adaptive scheme
Find a distance function d which is good for the input
This work
![Page 100: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/100.jpg)
.
÷÷÷÷÷÷÷÷
øçççççççç
è
( , ) ( , )i jD i j d s s
Promotion: Make Distance based methods adaptive
![Page 101: 1 Towards optimal distance functions for stochastic substitution models Ilan Gronau, Shlomo Moran, Irad Yavneh Technion, Israel](https://reader037.vdocuments.site/reader037/viewer/2022110116/55172538550346f5558b5a21/html5/thumbnails/101.jpg)
106
1
1 2(
1 2
)
functions for K2P are of the form:
gives the weight the function
puts on the transversions.
Next we show how this weight is affected by
( ) ln(
the
total substitution r
) ln
)
aa e
( .
t
cc c
P P
SR
P c c
transition/transversion nd ratio
Summary of previous slides: