genome evolution
DESCRIPTION
Genome evolution. Lecture 2: population genetics I: models and drift. Studying Populations. Models: A set of individuals, genomes Ancestry relations or hierarchies Experiments: Fields studies, diversity/genotyping Experimental evolution. mtDNA human migration patterns. - PowerPoint PPT PresentationTRANSCRIPT
Genome Evolution © Amos Tanay, The Weizmann Institute
Genome evolution
Lecture 2:
population genetics I: models and drift
Genome Evolution © Amos Tanay, The Weizmann Institute
Studying Populations
Models:
A set of individuals, genomesAncestry relations or hierarchies
Experiments:
Fields studies, diversity/genotypingExperimental evolution
Åland Islands, Glanville fritillary population
mtDNA human migration patterns
Genome Evolution © Amos Tanay, The Weizmann Institute
Human population
-10,000 0 1750 1950 2010
605525217712526
Growth:
Year
Estimate (Millions)
Genome Evolution © Amos Tanay, The Weizmann Institute
The Data: the hapmap project
1 million SNPs (single nucleotide polymorphisms)
4 populations: 30 trios (parents/child) from Nigeria (Yoruba - YRI)30 trios (parents/child) from Utah (CEU)45 Han chinease (Beijing)44 Japanease (Tokyo)
Haplotyping – each SNP/individual.
No just determining heterozygosity/homozygosity – haplotyping completely resolve the genotypes (phasing)
Because of linkage, the partial SNPMap largely determine all other SNPs!!
The idea is that a group of “tag SNPs”Can be used for representing all geneticVariation in the human population.
This is extremely important in associationstudies that look for the genetic cause ofdisease.
Genome Evolution © Amos Tanay, The Weizmann Institute
Correlation on SNPs between populations
Genome Evolution © Amos Tanay, The Weizmann Institute
Modeling population: the Wright-Fischer model
Generation t
Generation t+1
1 2 3 4 2N
1 2 3 4 2N
…..
…..
Haploid model
Nf Nm
Nf Nm
…..
…..
…..
…..
Diploid model
Genome Evolution © Amos Tanay, The Weizmann Institute
The Hardy-Weinberg Model
• Diploid organismsTwo copies of each allele/gene/baseHomozygous / Heterozygous
• Sexual ReproductionMating haplotypes
• Large population, No migrationFixed size, closed system
• Non-overlapping generationsSynchronous processNot as bad as it may look like
• Random matingNew generation is being selected from the existing haplotypes with
replacement
• No mutations, no selection (will add these later)
Genome Evolution © Amos Tanay, The Weizmann Institute
2
2
)(
2)(
)(
qaaP
pqAaP
pAAP
The Hardy-Weinberg Model
Hardy-Weinberg equilibrium:
AA
Aa
aa
aAqaP
pAP
)(
)(AA
Aa
aa
aA
Random mating
Non overlapping generations
With the model assumption, equilibrium is reached within one generation
• Non-overlapping generationsSynchronous processNot as bad as it may look like
• Random matingNew generation is being selected from the existing haplotypes with
replacement
• No mutations, no selection (will add these later)
Genome Evolution © Amos Tanay, The Weizmann Institute
Frequency estimates
We will be dealing with estimation of allele frequencies.
To remind you, when sampling n times from a population with allele of frequency p, we get an estimate that is distributed as a binomial variable. This can be further approximated using a normal distribution:
))1(,());(( pnpnpNnpBV
n
pps
)ˆ1(ˆ
When estimating the frequency out of the number of successes we therefore have an error that looks like:
ini ppi
nnpB
)1();(
Genome Evolution © Amos Tanay, The Weizmann Institute
Testing Hardy-Weinberg using chi-square statistics
HW is over simplifying everything, but can be used as a baseline to test if interesting evolution is going on for some allele
Classical example is the blood group genotypes M/N (Sanger 1975) (this genotype determines the expression of a polysaccharide on red blood cell surfaces – so they were quantifiable before the genomic era..):
MM298294.3
MN489496
NN213209.3
Observed HW
2
2
)(
2)(
)(
qaaP
pqAaP
pAAP
22.0exp
exp)( 22
obs
Chi-square significance can be computed from the chi-square distribution with df degrees of freedom.
Here: df = #classes - #parameters – 1 = 3(MN/NN/MM) – 1 (p) – 1 = 1
Genome Evolution © Amos Tanay, The Weizmann Institute
Wright-Fischer model for genetic drift
We follow the frequency of an allele in the population, until fixation (f=2N) or loss (f=0)
We can model the frequency as a Markov process on a variable X (the number of A alleles) with transition probabilities:
jNj
ij N
i
N
i
j
NT
2
21
2
2 Sampling j alleles from a population 2N population with i alleles.
In larger population the frequency would change more slowly (the variance of the binomial variable is pq/2N – so sampling wouldn’t change that much)
0 2N1 2N-1Loss Fixation
Genome Evolution © Amos Tanay, The Weizmann Institute
Coalescent and fixation
Genome Evolution © Amos Tanay, The Weizmann Institute
Drift and fixation probability
Theorem (fixation in drift): In the Wright-Fischer model, the probability of fixation in the A’s allele state, given a population of 2N alleles out of which i are A, is:
N
iNXPi 2)2(
Proof: The mean of the binomial sample in the n’th step is np:
nnn XiN
iNiXXE 2
2)|( 1
Which means that the expected number of A’s is constant in time. Intuitively:
)2(2)( NXNPXEi ii
)1()();();()( oXEnXEnXEXEi i
n
niini
Since 0 and 2N are absorbing states, given sufficient time, the wright-fischer process will converge to either 0 or 2N. Define:
}20:min{ NXorXn nn
More formally:
Genome Evolution © Amos Tanay, The Weizmann Institute
Figure 7.4
Drift
Experiments with drifting fly populations: 107 Drosophila melanogaster populations. Each consisted originally of 16 brown eys (bw) heterozygotes. At each generation, 8 males and 8 females were selected at random from the progenies of the previous generation. The bars shows the distribution of allele frequencies in the 107 populations
Genome Evolution © Amos Tanay, The Weizmann Institute
The geometric distribution: remainder
Rolling a dice, and recording the time until first appearance of k (waiting time)
ppjTP j 1)1()(
)()|( 1212 ttTPtTtTP
Lack of memory:
pTE /1)( 2
1)(
p
pTVar
Moments:
)''(),min( ppppgeoTS
“Intersection”:
Genome Evolution © Amos Tanay, The Weizmann Institute
Coalescence
Coalescent at time -1?
NP
2
1
Coalescent at time -T?NN
P t
2
1)
2
11( 1
No coalescence for k samples?
)1
(2
1
21)
1(
21
21
2
2...
2
22
2
1222
1
1 nO
N
k
nO
N
i
N
i
N
kN
N
N
N
NP
ki
k
i
Distribution of time from k to k-1:
N
k
N
ktTP
t
k 2
1
22
1
21)(
1
Genome Evolution © Amos Tanay, The Weizmann Institute
The exponential distribution: remainder
The limit of the geometric distribution when the time step is going to 0:
atetUP 1)( ataedt
tUPd )((Density:
aUE /1)( 2
1)(a
UVar
Moments:
ba
aVUP
)(
“Intersection”:
)(~),min( baExpVU
tt
P=atMemory less!
atMtMMj
j eM
a
M
pMpjTP
11)1()(
/1
Probability:
M=2
M=4
Genome Evolution © Amos Tanay, The Weizmann Institute
The continuous time coalescent
When sampling K new individuals, the chances of peaking up the same parent twice is roughly:
Present 10
2)( 5
NTE
6
2)( 4
NTE
3
2)( 3
NTE
NTE 2)( 2
Past
1 2 3 54
)1
(2
1
2
)1(2N
ON
kk
Theorem: The amount of time during which there are k lineages, tk has approximately an exponential distribution with mean 2N * (2/(k(k-1)))
When looking at k individuals, we can trace their coalescent backwards and ask when did they had k-1,k-2, or one common ancestor.
Proof: the probability of not merging k lineages in n generations is:
N
nkk
N
kkn
22
)1(exp
2
1
2
)1(1
Which is like an exponential te
This is correct for any k, so going backward from present time, we can estimate the time to coalescent at each step
The expected value is)1(
41)(
kk
NeE t
Genome Evolution © Amos Tanay, The Weizmann Institute
The coalescent
The expected time to the common ancestor of k individuals:
Present 10
2)( 5
NTE
6
2)( 4
NTE
3
2)( 3
NTE
NTE 2)( 2
Past
1 2 3 54
nk nk n
Nkk
Nkk
NTE
..2 ..21 )
11(4
1
1
14
)1(
4)(
Theorem: The probability that the most recent common ancestor of a sample of size n is the same as that of the population converges to (n-1)/(n+1) as the population size increase.
When looking at k individuals, we can trace their coalescent backwards and ask when did they had k-1,k-2, or one common ancestor.
4N is the magic number
Genome Evolution © Amos Tanay, The Weizmann Institute
Diffusion approximation and Kimura’s solution
),(),( txJx
txt
),( tx
Fischer, and then Kimura approximated the drift process using a diffusion equation (heat equation):
The density of population with frequency x..x+dx at time t
),( txJ The flux of probability at time t and frequency x
The change in the density equals the differences between the fluxes J(x,t) and J(x+dx,t), taking dx to the limit we have:
The if M(x) is the mean change in allele frequency when the frequency is x, and V(x) is the variance of that change, then the probability flux equals:
),()(2
1),()(),( txxV
xtxxMtxJ
),()(2
1),()(),(
2txxV
xtxxM
xtx
t
N
xxxVM
2
)1()(,0
),()1(
4
1),(
2txxx
xNtx
t
Heat diffusionFokker-PlanckKolmogorov Forward eq.
Genome Evolution © Amos Tanay, The Weizmann Institute
Diffusion approximation and Kimura’s solution
),( tx
Fischer, and then Kimura approximated the drift process using a diffusion equation (heat equation). We start with working on the time step dy and frequency step dx
The probability that the population have allele frequency x time t
)(xM
We limit changes from t to t+dt and x+-dx. The population can be on x at t+dt if:
It was at x and stayed there:
It was at x-dx and moved to x:
It was at x+dx and moved to x:
)],()(),()([2
1
)],()(),()([2
1
)],()(),()([),(),(
tdxxdxxVtxxV
txxVtdxxdxxV
tdxxdxxMtxxMtxdttx
),()(2
1),()(),(
2txxV
xtxxM
xtx
t
2/)(xV
the probability that the frequency increased from x by dx, due to mutation/selection
The probability of dx increase or decrease due to drift
))()(1)(,( xVxMtx
)2/)()()(,( xVxMtdxx
)2/)()(,( xVtdxx
Genome Evolution © Amos Tanay, The Weizmann Institute
Diffusion approximation and Kimura’s solution
),( tx
Fischer, and then Kimura approximated the drift process using a diffusion equation (heat equation). We start with working on the time step dy and frequency step dx
The probability that the population have allele frequency x time t
)(xM
),()(2
1),()(),(
2txxV
xtxxM
xtx
t
2/)(xV
the probability that the frequency increased from x by dx, due to mutation/selection
The probability of dx increase or decrease due to drift
0)(
2/)1()(
xM
NxxxVFor drift the variance is binomial:And we assume no selection:
Still not easy to solve analytically…
Genome Evolution © Amos Tanay, The Weizmann Institute
Changes in allele-frequencies, Fischer-Wright model
After about 4N generations, just 10% of the cases are not fixed and the distribution becomes flat.
Genome Evolution © Amos Tanay, The Weizmann Institute
Absorption time and Time to fixation
According to Kimura’s solution, the mean time for allele fixation, assuming initial probability p and assuming it was not lost is:
)1log()1(4
)(1̂ ppp
Npt
)log()(1
4)(0̂ pp
p
Npt
The mean time for allele loss is (the fixation time of the complement event):
Genome Evolution © Amos Tanay, The Weizmann Institute
Effective population size
4N generations looks light a huge number (in a population of billions!)
But in fact, the wright-fischer model (like the hardy-weinberg model) is based on many non-realistic assumption, including random mating – any two individuals can mate
The effective population size is defined as the size of an idealized population for which the predicted dynamics of changes in allele frequency are similar to the observed ones
For each measurable statistics of population dynamics, a different effective population size can be computed
For example, the expected variance in allele frequency is expressed as:
N
pppV ttt 2
)1()( 1
e
ttt N
pppV
2
)1()( 1
But we can use the same formula to define the effective population size given the variance:
Genome Evolution © Amos Tanay, The Weizmann Institute
Effective population size: changing populations
110
1..
11
t
e
NNN
tN
So the effective population size is dominated by the size of the smallest bottleneck
Bottlenecks can occur during migration, environmental stress, isolation
Such effects greatly decrease heterozygosity (founder effect – for example Tay-Sachs in “ashkenazim”)
Bottlenecks can accelerate fixation of neutral or even deleterious mutations as we shall see later.
If the population is changing over time, the dynamics will be affect by the harmonic mean of the sizes:
Human effective population size in the recent 2My is estimated around 10,000 (due to bottlenecks). (so when was our T1?)
Genome Evolution © Amos Tanay, The Weizmann InstituteEffective population size: unequal sex ratio, and sex chromosomes
fma NNN
So if there are 10 times more females in the population, the effective population size is 4*x*10x/(11x)=4x, much less than the size of the population (11x).
If there are more females than males, or there are fewer males participating in reproduction then the effective population size will be smaller:
fm
fme NN
NNN
4 Any combination of alleles from a male and a female
Another example is the X chromosome, which is contained in only one copy for males.
fm
fme NN
NNN
24
9
f
ff
m
mmfm N
qp
N
qppVarppp
29
4
9
1)(,
3
2
3
1
fm
fmfmfm
NN
NN
pq
NNpqpVarppp
24
92
18
4
9
1)(,
Genome Evolution © Amos Tanay, The Weizmann Institute
Population genetics
Drift: The process by which allele frequencies are changing through generations
Mutation: The process by which new alleles are being introduced
Recombination: the process by which multi-allelic genomes are mixed
Selection: the effect of fitness on the dynamics of allele drift
Epistasis: the drift effects of fitness dependencies among different alleles
“Organismal” effects: Ecology, Geography, Behavior