population genetics ii (selection + haplotype...
TRANSCRIPT
Population Genetics II (Selection + Haplotype
analyses)
Gurinder Singh “Mickey” Atwal Center for Quantitative Biology
26th Oct 2015
Natural Selection Model (Molecular Evolution)
Embryos p
Adults p’
Selection Allele frequency
p
Allele frequency
p’
One generation
Genotype of C57BL/6J mice
LIF injection
Implantation sites (Average±SE)
Number of recovered blastocysts
(Average±SE)
n
Male Female +/+ +/+ - 8.4±0.5 0 5 -/- -/- - 2.7±0.8 3.2 ±0.6 6 -/- -/- + 7±0.8 0.6±0.6 3
p53+/+ p53-/- p53-/- +LIF injection
Implantation sites
Day 5 after fertilization of egg
Example of natural selection in mice
Hu et al (2007)
Hardy Weinberg Law • Consider 2 alleles (A,a) with frequency • Allele frequency of A = p • Allele frequency of a = q = 1-p • Randomly-mating large diploid population with no
mutation, migration, selection and drift
Genotype AA Aa aa
Hardy-Weinberg Frequency
p2
2pq
q2
Fitness Genotype AA Aa aa
Newborn frequency
p2 2pq q2
Fitness wAA wAa waa
Relative fitness
Frequency after
selection ⎟⎠
⎞⎜⎝
⎛w
p 12 ⎟⎠
⎞⎜⎝
⎛ −whspq 12 ⎟
⎠
⎞⎜⎝
⎛ −wsq 12
s = selection coefficient (relative viability of AA over aa) h = heterozygous effect
1=AA
AA
ww hs
wwAA
Aa −= 1 swwAA
aa −= 1
fitness relative mean =w
Mean Relative Fitness of Population
aaAaAA wqpqwwpw 22 2 fitness mean ++==
AAww
w fitness relative mean ==
sqpqhsw 221 −−=
!
w "1
w-1L LoadGenetic ==
!
0 " L "1
Heterozygous advantage h=0 A dominant,
a recessive h=1 A recessive,
a dominant 0<h<1 incomplete dominance
h<0 overdominance h>1 underdominance
h determines the equilibrium allele frequency p s determines how fast the equilibrium is achieved
Fundamental Theorem of Natural Selection
Change of mean fitness is proportional to additive genetic variance
R. Fisher, 1958
wpqsww2
'2
=−
!
w ' =fitness in next generation
Types of selection • Directional selection (0<h<1)
– causes p to go to 1 – conventional Darwinian natural selection
• Balancing selection (h<0) – cause p to go to some equilibrium value pe – e.g. heterozygous variant of HBB gene confers
resistance to malaria pathogen (Plasmodium falciparum)
• Disruptive selection (h>1) – if p<pe then p goes to 0 – if p>pe then p goes to 1
Example of human directional selection
The FY*O allele in the promotor gene of Duffy antigen gene, which confers resistance to Plasmodium vivax malaria, is prevalent and even fixed in many African populations
P C Sabeti et al. Science 2006;312:1614-1620
What about drift? • Very important in small populations. • Depends on relative ratios of s and 1/2N
In an initial population entirely consisting of aa genotypes, probability of new mutant A fixing
e.g. allele A has a selective advantage over allele a with selection coefficient s
swwAA
aa −= 1
Ns
s
ee21
1−
−
−−
=
In an initial population entirely consisting of AA genotypes, probability of new mutant a fixing 1
12 −−
= Ns
s
ee
!
> 0
Therefore, even deleterious alleles can fixate in a small population !
Detecting Natural Selection in the Human Genome
Choice of selection test depends on the time scale of evolution
e.g. McDonald- Kreitman test e.g. Tajima D test
P C Sabeti et al. Science 2006;312:1614-1620
HAPLOTYPE STUDIES
Haplotype
Ø Sequence of contiguous SNP alleles on a chromatid
Ø Hard to determine directly across whole genome
Ø Usually only the genotypes are provided, giving ambiguous haplotypes
Ø Haplotypes usually inferred (“phased”) by statistical computation
Ø Newer experimental methods can directly phase haplotypes, but are costly
6023 T/T G/G A/A C/C A/A A/A C/C G/G C/C G/G 6031 T/T G/G A/A C/C A/A A/A C/G G/G C/C G/G 6032 C/C A/A C/C T/T C/C C/C C/G A/A G/G A/A 6033 C/T A/G A/C C/T A/C A/C C/G A/G C/G A/G 6034 T/T G/G A/A C/C A/A A/A C/G G/G C/C G/G 6046 T/T G/G A/A C/C A/A A/A C/G G/G C/C G/G 6047 C/T A/G A/C C/T A/C A/C C/C A/G C/G A/G 6048 C/T A/G A/C C/T A/C A/C C/G A/G C/G A/G 6053 C/C A/A A/A T/T C/C C/C C/G A/A G/G A/A 6054 T/T G/G A/A C/C A/A A/A C/G G/G C/C G/G 6055 C/T A/G A/C failed A/C A/C C/G A/G C/G A/G 6056 C/T A/G A/C C/T A/C A/C C/G A/G C/G A/G 6057 C/T A/G A/C C/T A/C A/C C/G A/G C/G A/G 6060 C/T A/G A/C C/T A/C A/C failed A/G C/G A/G 6061 C/C A/A C/C T/T C/C C/C C/G A/A G/G A/A 6067 T/T G/G A/A C/C A/A A/A C/C G/G C/C G/G 6077 T/T G/G A/A C/C A/A A/A C/C G/G C/C G/G 6078 T/T G/G A/A C/C A/A A/A C/C G/G C/C G/G 6079 C/T A/G A/C C/T A/C A/C C/G A/G C/G A/G 6080 C/T A/G A/A C/T A/C A/A C/C A/G C/G A/G 6081 T/T G/G A/A C/C A/A A/A C/G G/G C/C G/G 6089 T/T G/G A/A C/C A/A A/A G/G G/G C/C G/G 6090 T/T G/G A/A C/C A/A A/A C/G G/G C/C G/G 6097 C/T A/G A/C C/T A/C A/C C/C A/G C/G A/G
1 2 3 4 5 6 7 8 9 10 SNPS
Cel
l Lin
es /
Patie
nts
Typical Results of Genotype Assays
Linkage Disequilibrium Ø Linkage Disequilibrium (LD) = correlation of
nucleotide alleles at different loci across the population l On average, there is strong LD between nearby
alleles on the same chromosome Ø Linkage Equilibrium = random association
(independence) of alleles at different loci across the population
Ø LD reflects many factors of population history Ø LD permits us to use proxy SNPs as diagnostic
biomarkers for disease-causing mutations
Population history and SNP correlations
Present day chromosomes
time past present
Mutations occurring at various times of population history
Neutral mutation
Disease mutation
New haplotypes generated by mutations and …
C T
Locus 1 Locus 2
C T
A T
C T
A T
C G
Ancestral chromosome with two loci shown
Mutation at locus 1
Mutation at locus 2 on ancestral chromosome
…intra-chromosomal recombination
C T
A T
C G
Haplotype 1
Haplotype 2
Haplotype 3
After recombination
Before recombination
C T
A T
C G
A G
recombination between haplotypes 2 and 3 generates a new
haplotype from existing mutations
Quantifying linkage disequilibrium
Ø From the population haplotype frequencies we can calculate the correlations between SNPs.
Ø Commonly used LD summaries
l D l Lewontin’s D’ l r2
Haplotype frequencies Haplotype with 2 SNPs
pAB pAb
paB pab
LOCUS 2
LOCUS 1
Allele B Allele b
Allele A
Allele a
Totals
pA
pa
pB pb Totals 1.0
A/a B/b
Linkage Equilibrium definition
)1)(1()1()1(
BAbaab
BABaaB
BAbAAb
BAAB
ppppppppppppppp
ppp
−−≡=
−≡=
−≡=
=
• Random association of alleles • Expected for SNPs at distant loci
Linkage Disequilibrium definition
)1)(1()1()1(
BAbaab
BABaaB
BAbAAb
BAAB
ppppppppppppppp
ppp
−−≡≠
−≡≠
−≡≠
≠
• Non-random association of alleles • Expected for SNPs at nearby loci
LD measure : D
BAAB pppD −=
DpppDpppDppp
Dppp
BAab
BAAb
BAaB
BAAB
+−−=
−−=
−−=
+=
)1)(1()1(
)1(
Deviation from linkage equilibrium
Thus it can be shown that all 4 of the 2-SNP haplotype frequencies can be expressed in terms of D, pA and pB only.
Note also, aBAbabAB ppppD −=
i.e.,
LD measure : Lewontin’s D’
max
'DDD =Normalized version of D:
where Dmax is given by ],min[],min[
maxbaBA
BabA
pppppppp
D =if D>0
if D<0
• D’ ranges between -1 and 1 • directly related to recombination fraction • D=0 if linkage equilibrium • |D’|=1 if only 2 or 3 haplotypes are present out of the possible 4 • |D’| upwardly biased in small samples
LD measure : r2
bBaA ppppDr2
2 =Square of the correlation coefficient
• ranges between 0 and 1 • useful in association mapping • r2=0 if linkage equilibrium • r2=1 if only 2 haplotypes are present • proportional to mutual information between 2 loci when D small
Factors affecting Linkage Disequilibrium
Ø Finite Sampling (Drift) Ø Demographic bottleneck Ø Selection Ø Emigration
Increases LD
Decreases LD Ø Immigration Ø Recombination
decreases number (or variability) of haplotypes
increases number (or variability) of haplotypes
How does LD decay over time?
Ø Recombination reduces correlation between SNPs
A B PAB
A b
a B
a b
PAb
PaB
Pab
Haplotype frequencies at
time t
Decay of linkage disequilibrium in large population
Ø The frequency of AB in the new generation (time t+1) will depend on the frequencies of AB, aB, and Ab in the old generation (time t) and also the recombination rate, c
( )tt
AB
ttAB
tAB
tB
tA
tAB
tAB
cDpDpcpc
pcppcp
−=
−+−=
+−=+
)1()1(1
)exp()1()1(1
cnDcDDcDD
t
ntnt
tt
−≈
−=
−=+
+
(at large times)
Therefore,
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 50,000 100,000 150,000Distance (bp)
Mea
n |D
'|
CaucasianAfrican-AmericanAsianYoruban
Different populations exhibit characteristic LD decay across the genome
Gabriel et al, 2002
Finite population size : Recombination-Drift Equilibrium
Ø Rate of decay of LD by recombination is cancelled out by rate of increase of LD by drift
r2 !1
1+ 4Necd
Ne = effective population size (~10,000 for humans) c = recombination rate (per base-pair) d = distance across genome (base-pairs)
1Ne
=1T
1N1
+1N2
+...+ 1NT
!
"#
$
%&
Note that Ne will be dominated by the times when population sizes are reduced
(population bottleneck)