Sampling distributions of alleles under models of neutral evolution
1. Genetic drift and mutation2. Coalescent3. Pairwise differences and numbers of segregating sites4. Population with time-varying size
Plan
Mathematical model for sampling distributions
of alleles
Genetic drift Mutation
Genetic drift
Alleles:
A1: A2:
Replication = sampling with replacement
A1 – becomes fixed
A2 – becomes lost
G1
G2
Gn
...
Mutation
Gk
Gk+1
Mutation introducesgenetic variability tothe evolution process
MutationMutation follows a Poisson process with intensity measured per locus (per site) per generation. Spatial characterization of places and effects caused, further specifies a mutation model. Most often applied are: infinite sites model, where it is assumed that each mutation takes place at a DNA site that never mutated before; infinite alleles model, where each mutation produces an allele never present in a population before; recurrent mutation model, where multiple changes of the nucleotide at a site are possible; stepwise mutation model, where mutation acts bidirectionally, increasing or reducing the number of repeats of a fixed DNA motif.
Infinite sites model
Mutation configuration in the infinite sites
model is fully described by a map between numbers of
sequences and numbers of mutations
1
2
3
4
5
Mutations
1 2 3 4 5 6
Seq
uen
ce
s
Statistics of mutations (segregating sites)
Number of segregating sites
1
2
3
4
5
Mutations
1 2 3 4 5 6
Seq
uen
ce
s
S=6
Pairwise differences
1
2
3
4
5
1 2 3 4 5 6
Seq
uen
ces
No of differencesd23 = 3
Mutations
Average number of pairwise differences = 3
Histogram of pairwise differences
No o
f p
air
s
No of differences
0
1
2
0 1 2 3 4 5 6
3
Classes of mutations
1
2
3
4
5
1 2 3 4 5 6
Mutation of class 2
Seq
uen
ces
Mutations
Histogram of classes of mutations
Fre
qu
en
cy
Class of mutation
0
0.5
1 2
1
Coalescence method
One looks at the past of an n - sample of sequences taken at present. Possible events that happen in the past are coalescences leading to common ancestors of sequences, and mutationsalong branches of ancestral tree.
Coalescence method
Present
Past
Generation 1, (=1)
Generation 2, (=2)
Generation k, (=k)
.
.
…….
n - sample
Pop
ula
tion
size
2N
2N
2N
Coalescence – pairwise statistics
Two sequences. For each sequence draw randomly a parent in generation 1 (=1), then for each parent draw randomly a (grand) parent in generation 2, (=2) …. . COMMON ANCESTOR2(i) - probability that a COMMON ANCESTOR of the two sequences lived in generation i (=i)
N2
1)1(2
)2
11(
2
1)2(2 NN
12 )
2
11(
2
1)( k
NNk
Coalescence – continuous time approximations
Population time scale 1 unit = 2N generations
Nt
2
tetp )(2
Mutational time scale 1 unit = 1/2 generations
2t Netpt
4 ,1
)(2
Coalescence n-samplek independent, exponentially distributed random variables mutation intensityN population's effective size
= 4N product parameter t = 2 mutational time scale ( - is time in number of
generations).
n
kkn s
kk
ssp2
2 )2
exp(2
),...,(
)2
exp(2
)( kk s
kk
sp
Coalescence method
The use of coalescence
theory allows efficient
formulation of appropriate models and
gives a good basis for
approaching model analysis problems, like
hypotheses testing or
parameter estimation.
s5
s4
s3
6
5
4
3
2
1t4
t2
1 2 3 4 5
s2
t3
t5
Independence of metrics (coalescence times) and
topology
Topologies of trees (with ordered
branches) are all equally probable.
Metrics (distributions of branch
lengths) of trees are determined by
coalescence process which, in turn,
depends on population parameters.
Coalescence – statistics of pairwise differences
Assume mutational time – scale. Then mutations occur with intensity = 1/2. Let A2 denote a Z+ random variable defined by number of segregating sites between sample 1 and sample 2. T – random variable given by coalescence time t. Conditional probability that A2=n is Poisson with =t ! n
te
nt
P[A2=n | T=t] =
n
nAP
11
1][ 2
0
22 ][)(n
nsnAPs
)1(2 )|( stetTs
sss
11
1
1
)1(1
1)(2
Coalescence – population with time varying size
Population with time-varying size
Population's effective size N(t) changes in time, then product parameter is also a time function (t)= 4N(t)
Joint probability density function:
.0 ,
))(
exp(
)(),...,(
132
2
222
1
nn
n
k
t
t
k
k
k
n
tttt
d
tttp
k
k
How the history of population size
N(t) (t)is encoded in histograms
of pairwise differences and mutation classes ?
Pairwise differences
no of differences
0 5 10 150
12
34
56
7
time t
(t
)
Pairwise differences I
0 5 10 15 20 250
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
freq
uen
cy
no of differences
freq
uen
cy
Pairwise differences II
0 5 10 15 20 25 300
0.02
0.04
0.06
0.08
0.1
0.12
time t
(t
)
0 5 10 15 20 25 300
20
40
60
80
100
120
no of differences
freq
uen
cy
Pairwise differences III
0 5 10 15 20 25 300
0.02
0.04
0.06
0.08
0.1
0.12
0.14
time t
(t
)
0 5 10 150
50
100
150
200
250
Mutation classes
Frequencies are computed under the assumption
that mutaion intensity is low
Mutation classes I
0 5 10 150
12
34
56
7
time t
SNP type
N(t
)fr
eq
uen
cy
1 2 3 4 5 6 7 8 9 100
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.3
N(t)=const
SNP type
time t
N(t
)fr
eq
uen
cy
1 2 3 4 5 6 7 8 9 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 5 10 150
50
100
150
200
250
N(t)=N0exp(rt)
0.5
N0r=10
Mutation classes II
time t
SNP type
N(t
)fr
eq
uen
cy
1 2 3 4 5 6 7 8 9 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 5 10 15 20 25 300
20
40
60
80
100
120
0.6
Mutation classes III
Conclusions
Different histories of population sizes lead to different sampling distributions of alleles
Parametric models of different form (exponential, stepwise, logistic) can lead to similar (difficult to distinguish) distributions of alleles
Estimation of population size history from DNA data can be unstable
Models versus data
Parametric and nonparametric estimation of
population size histories from DNA samples
Testing hypotheses on values of parameters
under parametric models, testing hypotheses
of time constant versus time varying
scenario
Models versus data
0 2 4 6 8 10 12 14 16 18 200
50100150200250300350400450
0 5 10 15 20 25 300
0.02
0.04
0.06
0.08
0.1
0.12
Data on worldwide distribution of mtDNA pairwise differences R. Cann et. al. 1987
Estimation of history of human population size
Models versus data II
2 4 6 8 10 12 14 16 18 200
0.1
0.2
0.3
0.4
0.5
0.6 Histogram of classes of mutations. Data on worldwide distribution of mtDNA pairwise differences R. Cann et. al. 1987
Models versus data III
Data on types of 44 SNPs randomly located in the genome Picoult, Newberg 2000
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 10
12
34
5
67
8
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 50
0.10.20.30.40.50.60.70.80.9
1
Parametric estimates of N(t) based on the above data