population genetic data analysis · 2014-08-31 · elaborate mathematical theories constructed by...
TRANSCRIPT
Population Genetic Data Analysis
Summer Institute in Statistical Genetics
September 1-3, 2014
University of Lausanne
Jerome Goudet: [email protected]
and
Bruce Weir: [email protected]
1
Contents
Topic Slide
Genetic Data 3Allele Frequencies 46Allelic Association 90Population Structure 150Inbreeding and Relatedness 192Association Mapping 235
Lectures on these topics by Bruce Weir will alternate with R
exercises led by Jerome Goudet. Some of the software is at
http://www2.unil.ch/popgen/teaching/SISG14/ .
2
GENETIC DATA
3
Genetic Data
“Genes are perpetuated as sequences of nucleic acid, but func-
tion by being expressed as proteins.” (Lewin, GENES).
DNA
(transcription)
↓RNA
(translation)
↓Protein
↓Phenotype
4
Sources of Data
Phenotype Mendel’s peasBlood groups
Protein AllozymesAmino acid sequences
DNA Restriction sites, RFLPsLength variants, VNTRs, STRsSNPsNucleotide sequences
5
Mendel’s Data
Dominant Form Recessive Form
Seed characters5474 Round 1850 Wrinkled6022 Yellow 2001 Green
Plant characters705 Grey-brown 224 White882 Simply inflated 299 Constricted428 Green 152 Yellow651 Axial 207 Terminal787 Long 277 Short
6
Genetic Data
Human ABO blood groups discovered in 1900.
Elaborate mathematical theories constructed by Sewall Wright,
R.A. Fisher, J.B.S. Haldane and others. This theory was chal-
lenged by data from new data from electrophoretic methods in
the 1960’s:
“For many years population genetics was an immensely rich and
powerful theory with virtually no suitable facts on which to oper-
ate. . . . Quite suddenly the situation has changed. The mother-
lode has been tapped and facts in profusion have been pored
into the hoppers of this theory machine. . . . The entire relation-
ship between the theory and the facts needs to be reconsidered.“
Lewontin (1974)
7
Electrophoretic Detection
Charge differences among allozymes lead to separation on elec-
trophoretic gels.
Length differences among Restriction Fragment Length Poly-
morphisms, or Variable Number of Tandem Repeat polymor-
phisms also detected by electrophoresis.
8
AmpliTypeTM Data
PCR based-DNA typing systems for forensic science. The loci in-
volved are all sequence polymorphisms that are detected/delineated
by hybridization to allele-specific oligonucleotide probes. Colo-
metric detection used for the hybridized product.
9
AmpliTypeTM Data - Caucasian
LDLR GYPA HBGG D7S8 Gca a B B A B A B C CA a b b A B A A A BA a B B A A A A A Aa a B B A A A B B BA a B b A B B B A CA A B b A A A B B Ca a B b A B A B A CA a b b A A A B A Ca a b b A A A A A BA a B B A B A A C CA a B B A B A B C Ca a B B A A A B A AA A B b A A A B A BA a B B A A B B C CA A B B A A A A B BA a b b A A A A C CA a B b A B A A B Ba a B b A B A A A CA a B B A B A B A CA a b b B B A B B B
10
AmpliTypeTM Data - Afr. Amer.
LDLR GYPA HBGG D7S8 Gca a b b B C A A A Ba a B B A A A B B Ca a B b A C A A B BA a B b A C A A B CA a B b A C A B A BA a B b A A A B A Ba a B b B C B B B CA a B B A A A B B BA a B b A C A B B Ba a b b A B A A B BA a B b B C A B B CA a B b A C A B B Ca a b b A A A A A Ca a b b A C A B B Ba a B B A A A A A BA a B B A B A A B Ba a B b A B A B B CA a B b B C A A B BA a B B A B A B B BA a B B B C A B B B
11
AmpliTypeTM Data - Hispanic
LDLR GYPA HBGG D7S8 Gca a B b B B B B B CA a B b A A A A A CA A b b A B A A A CA a B B A A A A C Ca a B b A B B B A Ca a B B B B A B C CA a B B B B A B C CA a B B A B A A A BA a B B A B A B C CA A B b A A A B B BA a b b A B A A A CA a B B A B A B C Ca a B b A C A A B BA A b b B B A A C CA a B b A A A A C Ca a B B B B A B C Ca a B b A B A B B CA a B B B B A A A BA A B B A B B B C CA a B B A B B B A C
12
STR markers: CTT set
(http://www.cstl.nist.gov/biotech/strbase/seq info.htm)
Usual No.Locus Structure of repeats
CSF1PO [AGAT]n 6–16TPOX [AATG]n 5–14TH01∗ [AATG]n 3–14
∗ “9.3” is [AATG]6ATG[AATG]3
Length variants detected by capillary electrophoresis.
13
“CTT” Data - Caucasian
CSF1P0 TPOX TH0111 12 8 11 7 811 13 8 8 6 711 12 8 11 6 710 12 8 8 6 911 12 8 12 9 9.310 12 9 11 6 710 13 8 11 6 611 12 8 8 6 9.39 10 8 9 7 9.311 12 8 8 6 811 13 8 11 7 911 12 8 11 6 9.310 11 8 8 7 9.310 10 8 11 7 9.39 10 8 8 6 9.311 12 9 11 9 9.39 11 9 11 9 9.311 12 8 8 6 710 10 9 11 6 9.310 13 8 8 8 9.3
14
“CTT” Data - Afr. Amer.
CSF1P0 TPOX TH0110 11 8 8 7 810 10 8 11 6 78 10 8 10 7 812 12 7 8 6 811 12 9 11 7 9.310 10 11 11 7 911 12 8 9 7 910 12 9 9 7 711 12 8 12 7 910 11 8 9 7 97 11 8 8 6 710 11 9 10 8 912 12 11 11 9 910 12 9 11 6 610 10 8 9 7 9.311 13 9 12 8 810 12 7 10 7 811 11 8 9 9 98 11 10 11 8 910 12 9 9 6 9
15
“CTT” Data - Hispanic
CSF1P0 TPOX TH0111 12 8 8 10 1011 11 8 8 6 711 12 8 11 7 9.310 10 11 11 6 9.311 11 8 12 9.3 9.312 13 12 12 6 811 12 8 10 9.3 9.310 11 11 11 6 912 12 7 8 6 9.310 12 8 11 7 9.311 12 8 8 6 712 13 11 12 6 711 12 8 11 7 9.311 14 8 10 6 611 12 8 11 7 711 12 8 11 6 610 11 8 11 6 911 12 8 8 6 611 11 8 11 6 912 12 11 11 7 8
16
STR Typing Methodology
“Short tandem repeat (STR) typing methods are widely used
today for human identity testing applications including foren-
sic DNA analysis. Following multiplex PCR amplification, DNA
samples containing the length-variant STR alleles are typically
separated by capillary electrophoresis and genotyped by compar-
ison to an allelic ladder supplied with a commercial kit. This
article offers a brief perspective on the technologies and issues
involved in STR typing.”
Butler JM. Short tandem repeat typing technologies used in hu-
man identity testing. BioTechniques 43:Sii-Sv (October 2007)
doi 10.2144/000112582
17
Single Nucleotide Polymorphisms (SNPs)
“Single nucleotide polymorphisms (SNPs) are the most frequently
occurring genetic variation in the human genome, with the total
number of SNPs reported in public SNP databases currently ex-
ceeding 9 million. SNPs are important markers in many studies
that link sequence variations to phenotypic changes; such studies
are expected to advance the understanding of human physiology
and elucidate the molecular bases of diseases. For this reason,
over the past several years a great deal of effort has been devoted
to developing accurate, rapid, and cost-effective technologies for
SNP analysis, yielding a large number of distinct approaches. ”
Kim S. Misra A. 2007. SNP genotyping: technologies and
biomedical applications. Annu Rev Biomed Eng. 2007;9:289-
320.
18
Current 1000Genomes Data
Tuesday June 24, 2014
The Initial Phase 3 variant list and phased genotypes.
The initial call set from the 1000 Genomes Project Phase 3
analysis is now available on our ftp site in the directory re-
lease/20130502/.
These release contains more than 79 million variant sites and
includes not just biallelic snps but also indels, deletions, com-
plex short substitutions and other structural variant classes. It
is based on data from 2535 individuals from 26 different popu-
lations around the world.
www.1000genomes.org
19
Sampling
Statistical sampling: The variation among repeated samples
from the same population is analogous to “fixed” sampling. In-
ferences can be made about that particular population.
Genetic sampling: The variation among replicate (conceptual)
populations is analogous to “random” sampling. Inferences are
made to all populations with the same history.
20
Classical Model
Sample ofsize n · · · Sample of
size n
��
��
���=
HHHHHHHHHHj
Time tPopulationof size N · · · Population
of size N
↓ ↓
... ...
↓ ↓
Time 2Populationof size N · · · Population
of size N
↓ ↓
Time 1 Populationof size N · · · Population
of size N
↓ ↓
Reference population(Usually assumed infinite and in equilibrium)
21
Coalescent Theory
An alternative framework works with genealogical history of a
sample of alleles. There is a tree linking all alleles in a current
sample to the “most recent common ancestral allele.” Allelic
variation due to mutations since that ancestral allele.
The coalescent approach requires mutation and may be more
appropriate for long-term evolution and analyses involving more
than one species. The classical approach allows mutation but
does not require it: within one species variation among popula-
tions may be due primarily to drift.
22
Probability
Probability provides the language of data analysis.
Equiprobable outcomes definition:
Probability of event E is number of outcomes favorable to E
divided by the total number of outcomes. e.g. Probability of a
head = 1/2.
Long-run frequency definition:
If event E occurs n times in N identical experiments, the prob-
ability of E is the limit of n/N as N goes to infinity.
Subjective probability:
Probability is a measure of belief.
23
First Law of Probability
Law says that probability can take values only in the range zero
to one and that an event which is certain has probability one.
0 ≤ Pr(E) ≤ 1
Pr(E|E) = 1 for any E
i.e. If event E is true, then it has a probability of 1. For example:
Pr(Seed is Round|Seed is Round) = 1
24
Second Law of Probability
If G and H are mutually exclusive events, then:
Pr(G or H) = Pr(G) + Pr(H)
For example,
Pr(Round or Wrinkled) = Pr(Round) + Pr(Wrinkled)
More generally, if Ei, i = 1, . . . r, are mutually exclusive then
Pr(E1 or . . . or Er) = Pr(E1) + . . .+ Pr(Er)
=∑
i
Pr(Ei)
25
Complementary Probability
If Pr(E) is the probability that E is true then Pr(E) denotes the
probability that E is false. Because these two events are mutually
exclusive
Pr(E or E) = Pr(E) + Pr(E)
and they are also exhaustive in that between them they cover all
possibilities – one or other of them must be true. So,
Pr(E) + Pr(E) = 1
Pr(E) = 1 − Pr(E)
The probability that E is false is one minus the probability it is
true.
26
Third Law of Probability
For any two events, G and H, the third law can be written:
Pr(G and H) = Pr(G) Pr(H|G)
There is no reason why G should precede H and the law can also
be written:
Pr(G and H) = Pr(H) Pr(G|H)
For example
Pr(Seed is round and is type AA) = Pr(Seed is round|Seed is type AA)
× Pr(Seed is type AA)
27
Independent Events
If the information that H is true does nothing to change uncer-
tainty about G, then
Pr(G|H) = Pr(G)
and
Pr(H and G) = Pr(H)Pr(G)
Events G,H are independent.
28
Law of Total Probability
If G,H are two mutually exclusive and exhaustive events (so that
H = G = not −G), then for any other event E, the law of total
probability states that
Pr(E) = Pr(E|G)Pr(G) + Pr(E|H) Pr(H)
This generalizes to any set of mutually exclusive and exhaustive
events {Si}:
Pr(E) =∑
i
Pr(E|Si)Pr(Si)
For example
Pr(Seed is round) = Pr(Round|Type AA)Pr(Type AA)
+ Pr(Round|Type Aa)Pr(Type Aa)
+ Pr(Round|Type aa)Pr(Type aa)
29
Bayes’ Theorem
Bayes’ theorem relates Pr(G|H) to Pr(H|G):
Pr(G|H) =Pr(GH)
Pr(H), from third law
=Pr(H|G) Pr(G)
Pr(H), from third law
If {Gi} are exhaustive and mutually exclusive, Bayes’ theorem
can be written as
Pr(Gi|H) =Pr(H|Gi)Pr(Gi)∑iPr(H|Gi)Pr(Gi)
30
Bayes’ Theorem Example
Suppose G is event that a man has genotype A1A2 and H is the
event that he transmits allele A1 to his child. Then Pr(H|G) =
0.5.
Now what is the probability that a man has genotype A1A2 given
that he transmits allele A1 to his child?
Pr(G|H) =Pr(H|G) Pr(G)
Pr(H)
=0.5 × 2p1p2
p1= p2
31
Mendel’s Data
Model: seed shape governed by gene A with alleles A, a:
Genotype Phenotype
AA RoundAa Roundaa Wrinkled
Cross two inbred lines: AA and aa. All offspring (F1 generation)
are Aa, and so have round seeds.
32
F2 generation
Self an F1 plant: each allele it transmits is equally likely to be A
or a, and alleles are independent, so for F2 generation:
Pr(AA) = Pr(A)Pr(A) = 0.25
Pr(Aa) = Pr(A)Pr(a) + Pr(a)Pr(A) = 0.5
Pr(aa) = Pr(a)Pr(a) = 0.25
Probability that an F2 seed (observed on F1 parental plant) is
round:
Pr(Round) = Pr(Round|AA)Pr(AA)
+ Pr(Round|Aa)Pr(Aa)
+ Pr(Round|aa)Pr(aa)
= 1 × 0.25 + 1 × 0.5 + 0 × 0.25
= 0.75
33
F2 generation
What are the proportions of AA and Aa among F2 plants with
round seeds? From Bayes’ Theorem:
Pr(F2 = AA|F2 Round) =Pr(F2 Round|AA)Pr(F2 AA)
Pr(F2 round)
=1 × 1
434
=1
3
34
Seed Characters
As an experimental check on this last result, and therefore on
Mendel’s theory, Mendel selfed a round-seeded F2 plant and
noted the F3 seed shape (observed on the F2 parental plant).
If all the F3 seeds are round, the F2 must have been AA. If some
F3 seed are round and some are wrinkled, the F2 must have been
Aa. Possible to observe many F3 seeds for an F2 parental plant,
so no doubt that all seeds were round. Data supported theory:
one-third of F2 plants gave only round seeds and so must have
had genotype AA.
35
Plant Characters
Model for stem length is
Genotype Phenotype
GG LongGg Longgg Short
To check this model it is necessary to grow the F3 seed to observe
the F3 stem length.
36
F2 Plant Character
Mendel grew only 10 F3 seeds per F2 parent. If all 10 seeds gave
long stems, he concluded they were all GG, and F2 parent was
GG. This could be wrong. The probability of a Gg F2 plant
giving 10 long-stemmed F3 offspring (GG or Gg), and therefore
wrongly declared to be homozygous GG is (3/4)10 = 0.0563.
37
Fisher’s 1936 Criticism
The probability that a long-stemmed F2 plant is declared to be
homozygous (event V ) is
Pr(V ) = Pr(V |U)Pr(U) + Pr(V |U)Pr(U)
= 1 × (1/3) + 0.0563 × (2/3)
= 0.3709 6= 0.3333
where U is the event that a long-stemmed F2 is actually homozy-
gous and U is the event that it is actually heterozygous.
Fisher claimed Mendel’s data closer to the 0.3333 probability
appropriate for seed shape than to the correct 0.3709 value.
38
Weldon’s 1902 Doubts
In Biometrika, Weldon said:
“Here are seven determinations of a frequency which is said to
obey the law of Chance. Only one determination has a deviation
from the hypothetical frequency greater than the probable error
of the determination, and one has a deviation sensible equal to
the probable error; so that a discrepancy between the hypothesis
and the observations which is equal to or greater than the prob-
able error occurs twice out of seven times, and deviations much
greater than the probable error do not occur at all. These results
then accord so remarkably with Mendel’s summary of them that
if they were repeated a second time, under similar conditions
and on a similar scale, the chance that the agreement between
observation and hypothesis would be worse than that actually
obtained is about 16 to 1.”
39
Edwards’ 1986 Criticism
Mendel had 69 comparisons where the expected ratios were cor-
rect. Each set of data can be tested with a chi-square test:
Category 1 Category 2 Total
Observed (o) a n-a nExpected (e) b n-b n
X2 =(a− b)2
b+
[(n− a)− (n− b)]2
(n− b)
=n(a− b)2
b(n− b)
40
Edwards’ Criticism
If the hypothesis giving the expected values is true, the X2 val-
ues follow a chi-square distribution, and the X values follow a
normal distribution. Edwards claimed Mendel’s values were too
small – not as many large values as would be expected by chance.
−3.0−2.0−1.0 0.0 1.0 2.0 3.0
0.0
5.0
10.0
41
Novitski’s 2004 Response
For six experiments on plant characters, Mendel planned to have
10 F3 plants for each of 100 dominant phenotype F2 plants.
With a 2% error rate, the probability that a family of 10 do not
all survive, i.e. at least one error or one minus the chance of no
error, is 1 − (0.98)10 = 0.18. The expected number of failing
families is 6 × 100 × 0.18 = 110. If such a family still showed at
least one recessive phenotype it would probably still have been
included in the results. Otherwise the family would have been
discarded, or augmented and this can result in an under-counting
of dominant homozygotes.
This may account for the under-counting of homozygotes noted
by Fisher.
42
Hartl and Fairbanks 2007 Rebuttals
In “Mud sticks: On the alleged falsification of Mendel’s data”
(Genetics 175:975-979) Hartl and Fairbanks criticized Fisher and
Novitski. They did not comment on Edwards.
Against Novitski, they claimed that Mendel did indeed collect
data from exactly 10 plants.
Against Fisher, they noted that one experiment was repeated
and the combined data agreed with Fisher’s expectations.
43
Pires and Branco, 2010
“A statistical model to explain the Mendel-Fisher controversy.”
Statistical Science 15:545–565, 2010.
This paper concentrates on the 84 p-values of Mendel’s experi-
ments and suggests an unconscious bias in Mendel’s work.
Refers to the 2008 book:
Franklin, A., Edwards AWF, Fairbanks DJ, Hartl DL, Seiden-
feld T. “Ending the Mendel-Fisher Controversy.” University of
Pittsburgh Press, Pittsburgh.
44
Continuing Debate
“This research was carried out in order to verify by simulation
Mendel’s laws and seek for the clarification, from the author”s
point of view, the Mendel–Fisher controversy. ... It was con-
cluded, that Mendel had no reason to manipulate his data in
order to make them to coincide with his beliefs. Therefore, in
experiment with a single trait, and in experiments with two traits
assuming complete dominance, segregation ratios are 3:1; and
9:3:3:1, respectively. Consequently, Mendel’s laws, under the
conditions as were described are absolutely valid and universal.”
REVISTA CIENTIFICA-FACULTAD DE CIENCIAS VETERINAR-
IAS 24:38-46, 2014.
45
ALLELE FREQUENCIES
46
Properties of Estimators
Consistency Increasing accuracyas sample size increases
Unbiasedness Expected value is the parameter
Efficiency Smallest variance
Sufficiency Contains all the informationin the data about parameter
47
Binomial Distribution
Most population genetic data consists of numbers of observa-
tions in some categories. The values and frequencies of these
counts form a distribution.
Toss a coin n times, and note the number of heads. There
are (n+1) outcomes, and the number of times each outcome is
observed in many sets of n tosses gives the sampling distribution.
Or: sample n alleles from a population and observe x copies of
type A.
48
Binomial distribution
If every toss has the same chance p of giving a head:
Probability of x heads in a row is
p× p× . . .× p = px
Probability of n− x tails in a row is
(1 − p) × (1 − p) × . . .× (1 − p) = (1 − p)n−x
The number of ways of ordering x heads and n − x tails among
n outcomes is n!/[x!(n − x)!].
The binomial probability of x successes in n trials is
Pr(x|p) =n!
x!(n − x)!px(1 − p)n−x
49
Binomial Likelihood
The quantity Pr(x|p) is the probability of the data, x successes
in n trials, when each trial has probability p of success.
The same quantity, written as L(p|x), is the likelihood of the
parameter, p, when the value x has been observed. The terms
that do not involve p are not needed, so
L(p|x) ∝ px(1 − p)(n−x)
Each value of x gives a different likelihood curve, and each curve
points to a p value with maximum likelihood. This leads to
maximum likelihood estimation.
50
Likelihood L(p|x, n = 4)
0.0 0.2 0.4 0.6 0.8 1.0
p
L(p)
x = 0
x = 1x = 2
x = 3
x = 4
..
..
.
..
..
.
..
..
.
..
..
..
.
..
..
.
..
..
..
.
..
..
.
..
..
.
..
..
..
.
..
..
.
..
...
.
..
..
..
.
..
..
.
..
..
.
..
..
.
..
..
.
..
..
..
.
..
..
.
..
..
.
..
..
.
..
..
...
.
..
..
..
.
..
..
..
.
..
..
..
.
..
..
..
.
..
..
..
.
..
..
..
.
..
..
..
.
...
..
..
..
.
..
..
..
..
.
..
..
..
.
..
..
..
..
.
..
..
..
..
.
..
..
..
.
...
..
..
..
.
..
..
..
.
..
..
..
.
..
..
..
..
.
..
..
..
.
..
..
..
.
...
..
..
..
..
..
..
..
..
.
..
..
..
..
..
..
..
..
..
.
..
..
..
...
..
..
..
..
..
..
..
..
..
.
..
..
..
..
..
..
..
..
..
..
...
..
..
..
..
..
..
..
..
..
..
..
.
..
..
..
..
..
..
..
...
..
..
..
..
..
..
...
..
..
..
..
..
..
..
..
..
..
...
..
..
...
..
..
...
..
..
..
...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
..
.
..
.
..
.
..
..
.
..
..
..
.
..
..
..
.
..
..
.
..
..
..
.
..
..
..
.
..
..
.
..
...
..
.
..
..
..
.
..
..
.
..
..
..
.
..
..
..
.
..
..
..
.
..
..
..
.
..
..
.
..
...
..
..
.
..
..
..
..
.
..
..
..
..
.
..
..
..
..
.
..
..
..
..
.
...
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
...
..
.
..
..
..
..
..
..
..
..
..
..
.
..
..
..
..
..
...
.
..
.
..
..
.
..
.
..
.
..
.
..
.
..
..
.
..
.
............................................................................................................................................................................................................................................................................................................................................
......................................................
...............................................
....................................................
...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
..................................................................................................................................................................................................................................................................................................................................................................................................................................
.....................................
...........................
................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
.....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
.................................................................
.............................................
.................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
51
Binomial Mean
If there are n trials, each of which has probability p of giving a
success, the mean or the expected number of successes is np.
The sample proportion of successes is
p =x
n
(This is also the maximum likelihood estimate of p.)
The expected, or mean, value of p is p.
E(p) = p
52
Binomial Variance
The expected value of the squared difference between the num-
ber of successes and its mean, (x − np)2, is np(1 − p). This is
the variance of the number of successes in n trials, and indicates
the spread of the distribution.
The variance of the sample proportion p is
Var(p) =p(1 − p)
n
53
Normal Approximation
Provided np is not too small (e.g. not less than 5), the binomial
distribution can be approximated by the normal distribution with
the same mean and variance. In particular:
p ∼ N
(p,
p(1 − p)
n
)
To use the normal distribution in practice, change to the standard
normal variable z with a mean of 0, and a variance of 1:
z =p− p√
p(1 − p)/n
For a standard normal, 95% of the values lie between ±1.96.
The normal approximation to the binomial therefore implies that
95% of the values of p lie in the range
p ± 1.96√p(1 − p)/n
54
Confidence Intervals
A 95% confidence interval is a variable quantity. It has end-
points which vary with the sample. Expect that 95% of samples
will lead to an interval that includes the unknown true value pc.
The standard normal variable z has 95% of its values between
−1.96 and +1.96. This suggests that a 95% confidence interval
for the binomial parameter p is
p ± 1.96
√p(1 − p)
n
55
Confidence Intervals
For samples of size 10, the 11 possible confidence intervals are:
pc Confidence Interval
0.0 0.0 ± 0.00 0.00,0.00
0.1 0.1 ± 2√
0.009 0.00,0.29
0.2 0.2 ± 2√
0.016 0.00,0.45
0.3 0.3 ± 2√
0.021 0.02,0.58
0.4 0.4 ± 2√
0.024 0.10,0.70
0.5 0.5 ± 2√
0.025 0.19,0.81
0.6 0.6 ± 2√
0.024 0.30,0.90
0.7 0.7 ± 2√
0.021 0.42,0.98
0.8 0.8 ± 2√
0.016 0.55,1.00
0.9 0.9 ± 2√
0.009 0.71,1.001.0 1.0 ± 0.00 1.00,1.00
Can modify interval a little by extending it by the “continuity
correction” ±1/2n in each direction.
56
Confidence Intervals
To be 95% sure that the estimate is no more than 0.01 from
the true value, 1.96√p(1 − p)/n should be less than 0.01. The
widest confidence interval is when p = 0.5, and then need
0.01 ≥ 1.96√
0.5 × 0.5/n
which means that
n ≥ 10,000
If the true value of p was about 0.05, however,
0.01 ≥ 2√
0.05 × 0.95/n
n ≥ 1,900 ≈ 2,000
57
Exact Confidence Intervals: One-sided
The normal-based confidence intervals are constructed to be
symmetric about the sample value, unless the interval goes out-
side the interval from 0 to 1. They are therefore less satisfactory
the closer the true value is to 0 or 1.
More accurate confidence limits follow from the binomial distri-
bution exactly. For events with low probabilities p, how large
could p be for there to be at least a 5% chance of seeing at
most as many as x (i.e. 0,1,2, . . . x) occurrences of that event
among n events. If this upper bound is pU ,
x∑
k=0
Pr(k) ≥ 0.05
x∑
k=0
(n
k
)pkU(1 − pU)n−k ≥ 0.05
If x = 0, then (1 − pU)n ≥ 0.05 or pU ≤ 1 − 0.051/n and this is
0.0295 if n = 100. More generally pU ≈ 3/n when x = 0.
58
Exact Confidence Intervals: Two-sided
Now want to know how large p could be for there to be at least
a 2.5% chance of seeing at most as many as x (i.e. 0,1,2 . . . x)
occurrences, and in knowing how small p could be for there to
be at least a 2.5% chance of seeing at least as many as x (i.e.
x, x+ 1, x+ 2, . . . n) occurrences then we need
x∑
k=0
(n
k
)pkU(1 − pU)n−k ≥ 0.025
n∑
k=x
(n
k
)pkL(1 − pL)
n−k ≥ 0.025
The second of these equations may be written as
x−1∑
k=0
(n
k
)pkL(1 − pL)
n−k ≤ 0.975
If x = n, then pnL ≤ 0.975 or pL ≤ 0.9751/n and this is 0.9997 if
n = 100. Interval is not symmetric if p 6= 0.5.
59
Bootstrapping
An alternative method for constructing confidence intervals uses
numerical resampling. A set of samples is drawn, with replace-
ment, from the original sample to mimic the variation among
samples from the original population. Each new sample is the
same size as the original sample, and is called a bootstrap sam-
ple.
The middle 95% of the sample values p from a large number of
bootstrap samples provides a 95% confidence interval.
60
Multinomial Distribution
Toss two coins n times. For each double toss, the probabilities
of the three outcomes are:
2 heads pHH = 1/41 head, 1 tail pHT = 1/22 tails pTT = 1/4
The probability of x lots of 2 heads is (pHH)x, etc.
The numbers of ways of ordering x, y, z occurrences of the three
outcomes is n!/[x!y!z!] where n = x+ y+ z.
The multinomial probability for x of HH, and y of HT or TH
and z of TT in n trials is:
Pr(x, y, z) =n!
x!y!z!(pHH)x(pHT )y(pTT )z
61
Multinomial Variances and Covariances
If {pi} are the probabilities for a series of categories, the sam-
ple proportions pi from a sample of n observations have these
properties:
E(pi) = pi
Var(pi) =1
npi(1 − pi)
Cov(pi, pj) = −1
npipj, i 6= j
The covariance is defined as E[(pi − pi)(pj − pj)].
For the sample counts:
E(ni) = npi
Var(ni) = npi(1 − pi
Cov(ni, nj) = −npipj, i 6= j
62
Allele Frequency Sampling Distribution
If a locus has alleles A and a, in a sample of size n the allele
counts are sums of genotype counts:
n = nAA + nAa + naa
nA = 2nAA + nAa
na = 2naa + nAa
2n = nA + na
Genotype counts in a random sample are multinomially distributed.
What about allele counts? Approach this question by calculating
variance of nA.
63
Within-population Variance
Var(nA) = Var(2nAA + nAa)
= Var(2nAA) + 2Cov(2nAA, nAa)
+ Var(nAa)
= 2npA(1 − pA) + 2n(PAA − p2A)
This is not the same as the binomial variance 2npA(1−pA) unless
PAA = p2A. In general, the allele frequency distribution is not
binomial.
64
Within-population Variance
The variance of the sample allele frequency can be written as
Var(pA) =pA(1 − pA)
2n+PAA− p2A
2n
It is convenient to reparameterize genotype frequencies with the
(within-population) inbreeding coefficient f :
PAA = p2A + fpA(1 − pA)
PAa = 2pApa − 2fpApa
Paa = p2a + fpa(1 − pa)
Then the variance can be written as
Var(pA) =pA(1 − pA)(1 + f)
2n
This variance is different from the binomial variance of pA(1 −pA)/2n.
65
Bounds on f
Since
pA ≥ PAA = p2A + fpA(1 − pA) ≥ 0
pa ≥ Paa = p2a + fpa(1 − pa) ≥ 0
there are bounds on f :
−pA/(1 − pA) ≤ f ≤ 1
−pa/(1 − pa) ≤ f ≤ 1
or
max
(−pApa,−papA
)≤ f ≤ 1
This range of values is [-1,1] when pA = pa.
66
Indicator Variables
A very convenient way to derive many statistical genetic results
is to define an indicator variable xij for allele j in individual i:
xij =
{1 if allele is A0 if allele is not A
Then
E(xij) = pA
E(x2ij) = pA
E(xijxij′) = PAA
If there is random sampling, individuals are independent, and
E(xijxi′j′) = E(xij)E(xi′j′) = p2A
67
Intraclass Correlation
The inbreeding coefficient is the correlation of the indicator vari-
ables for the two alleles at a locus carried by an individual. This
is because:
Var(xij) = E(x2ij) − [E(xij)]2
= pA(1 − pA)
= Var(xij′), j 6= j′
and
Cov(xij , xij′) = E(xijxij′)− [E(xij)][E(xij′)], j 6= j′
= PAA− p2A= fpA(1 − pA)
so
Corr(xij, xij′) =Cov(xij , xij′)√
Var(xij)Var(xij′)= f
68
Maximum Likelihood Estimation: Binomial
For binomial sample of size n, the likelihood of pA for nA alleles
of type A is
L(pA|nA) = C(pA)nA(1 − pA)n−nA
and is maximized when
∂L(pA|nA)
∂pA= 0 or when
∂ lnL(pA|na)∂pA
= 0
Now
lnL(pA|nA) = lnC + nA ln(pA) + (n− nA) ln(1 − pA)
so
∂ lnL(pA|nA)
∂pA=
nApA
− n− nA1 − pA
and this is zero when pA = pA = nA/n.
69
Maximum Likelihood Estimation: Multinomial
If {ni} are multinomial with parameters n and {Qi}, then the
MLE’s of Qi are ni/n. This will always hold for genotype pro-
portions, but not always for allele proportions.
For two alleles, the MLE’s for genotype proportions are:
PAA = nAA/n
PAa = nAa/n
Paa = naa/n
Does this lead to estimates of allele proportions and the within-
population inbreeding coefficient?
PAA = p2A + fpA(1 − pA)
PAa = 2pA(1 − pA)− 2fpA(1 − pA)
Paa = (1 − pA)2 + fpA(1 − pA)
70
Maximum Likelihood Estimation
The likelihood function for pA, f is
L(pA, f) =n!
nAA!nAa!naa![p2A + pA(1 − pA)f ]nAA
×[2pA(1 − pA)f ]nAa[(1 − pA)2 + pA(1 − pA)f ]naa
and it is difficult to find, analytically, the values of pA and f that
maximize this function or its logarithm.
There is an alternative way of finding maximum likelihood esti-
mates in this case: equating the observed and expected values
of the genotype frequencies.
71
Bailey’s Method
Because the number of parameters (2) equals the number of
degrees of freedom in this case, we can just equate observed and
expected (using the estimates of pA and f) genotype proportions
nAA/n = p2A + f pA(1 − pA)
nAa/n = 2pA(1 − pA) − 2f pA(1 − pA)
naa/n = (1 − pA)2 + f pA(1 − pA)
so that, solving for pA and f :
pA =2nAA + nAa
2n= pA
f =4nAAnaa − n2
Aa
(2nAA + nAa)(2naa + nAa)= 1 − PAa
2pApa
72
Three-allele Case
With three alleles, there are six genotypes and 5 df. To use
Bailey’s method, would need five parameters: 2 allele frequencies
and 3 inbreeding coefficients:
P11 = p21 + f12p1p2 + f13p1p3
P12 = 2p1p2 − 2f12p1p2
P22 = p22 + f12p1p2 + f23p2p3
P13 = 2p1p3 − 2f13p1p3
P23 = 2p2p3 − 2f23p2p3
P33 = p23 + f13p1p3 + f23p2p3
73
Three-allele Case
This formulation preserves the relation between allele and geno-
type frequencies:
p1 = P11 +1
2P12 +
1
2P13
p2 = P22 +1
2P12 +
1
2P23
p3 = P33 +1
2P13 +
1
2P23
74
Three-allele Case
The maximum likelihood estimates in the fully-parameterized
case are
p1 = P11 +1
2P12 +
1
2P13
p2 = P22 +1
2P12 +
1
2P23
p3 = P33 +1
2P13 +
1
2P23
and
f12 = 1 − P12/p1p2
f13 = 1 − P13/p1p3
f23 = 1 − P23/p2p3
75
Three-allele Likelihood
For neutral genes, no reason to expect differences among f ’s, so
an alternative parameterization would be
P11 = p21 + fp1(1 − p1)
P12 = 2p1p2 − 2fp1p2
P22 = p22 + fp2(1 − p2)
P13 = 2p1p3 − 2fp1p3
P23 = 2p2p3 − 2fp2p3
P33 = p23 + fp3(1 − p3)
76
Three-allele Likelihood
No explicit analytic expressions for the MLEs of p’s and f . Nu-
merical methods needed to maximize the log-likelihood
lnL = n11[ln(p1) + ln(p1 + f − fp1)]
+ n12[ln(p1) + ln(p2) + ln(1 − f)]
+ n22[ln(p2) + ln(p2 + f − fp2)]
+ n13[ln(p1) + ln(p3) + ln(1 − f)]
+ n23[ln(p2) + ln(p3) + ln(1 − f)]
+ n33[ln(p3) + ln(p3 + f − fp3)]
77
Method of Moments
An alternative to maximum likelihood estimation is the method
of moments (MoM) where observed values of statistics are set
equal to their expected values. In general, this does not lead to
unique estimates or to estimates with variances as small as those
for maximum likelihood. (Bailey’s method is for the special case
where the MLEs are also MoM estimates.)
78
Method of Moments
For the inbreeding coefficient at loci with m alleles, two different
MoM estimates are
fW =
∑mu=1(Puu − p2u) + 1
2n
∑mu=1(pu − Puu)
∑mu=1 pu(1 − pu)− 1
2n
∑mu=1(pu − Puu)
≈∑mu=1(Puu − p2u)∑mu=1 pu(1 − pu)
fH =1
m− 1
m∑
u=1
(Puu − p2u
pu
)
For loci with two alleles, m = 2, the two moment estimates are
equal to each other and to the maximum likelihood estimate:
fW = fH = 1 − PAa2pApa
79
Expectations of Moment Estimates
The expected values of the estimated inbreeding coefficients can
be found by using the results
E(Puu) = Puu = p2u + fpu(1 − pu)
E(pu) = pu
E(p2u) = p2u +1
2npu(1 − pu)(1 + f)
Then, approximating the expectation of a ratio by the ratio of
expectations:
E(fW ) ≈f n−1
n (1 −∑mu=1 p
2u)
n−1n (1 −∑m
u=1 p2u)
= f
E(fH) ≈ 1
m− 1
m∑
u=1
pu(1 − pu)(f − 1+f
2n )
pu
= f − 1 + f
2n≈ f
80
MLE for Recessive Alleles
Suppose allele a is recessive to allele A. If there is Hardy-
Weinberg equilibrium, the likelihood for the two phenotypes is
L(pa) = (1 − p2a)n−naa(p2a)
naa
ln(L(pa) = (n− naa) ln(1 − p2a) + 2naa ln(pa)
where there are naa individuals of type aa and n− naa of type A.
Differentiating wrt pa:
∂ lnL(pa)
∂pa= −2pa(n− naa)
1 − p2a+
2naa
pa
Setting this to zero leads to an equation that can be solved
explicitly: p2a = naa/n. No need for iteration.
81
EM Algorithm for Recessive Alleles
An alternative way of finding maximum likelihood estimates when
there are “missing data” involves Estimation of the missing data
and then Maximization of the likelihood. For a locus with allele A
dominant to a the missing information is the frequencies (1−pa)2of AA, and 2pa(1−pa) of Aa genotypes. Only the joint frequency
(1 − p2a) of AA+Aa can be observed.
Estimate the missing genotype counts (assuming independence
of alleles):
nAA =(1 − pa)2
1 − p2a(n− naa) =
(1 − pa)(n − naa)
(1 + pa)
nAa =2pa(1 − pa)
1 − p2a(n− naa) =
2pa(n− naa)
(1 + pa)
82
EM Algorithm for Recessive Alleles
Maximize the likelihood (using Bailey’s method):
pa =nAa + 2naa
2n
=1
2n
(2pa(n− naa)
(1 + pa)+ 2naa
)
= =2(npa + naa)
2n(1 + pa)
An initial estimate pa is put into the right hand side to give an
updated estimated pa on the left hand side. This is then put
back into the right hand side to give an iterative equation for pa.
This procedure also has explicit solution pa =√
(naa/n).
83
EM Algorithm for Two Loci
For two loci with two alleles each, the ten two-locus frequencies are:
Genotype Actual Expected Genotype Actual Expected
AB/AB PABAB p2AB AB/Ab PAB
Ab 2pABpAb
AB/aB PABaB 2pABpaB AB/ab PAB
ab 2pABpab
Ab/Ab PAbAb p2Ab Ab/aB PAb
aB 2pAbpaB
Ab/ab PAbab 2pAbpab aB/aB P aB
aB p2aB
aB/ab P aBab 2paBpab ab/ab P ab
ab p2ab
84
EM Algorithm for Two Loci
Gamete frequencies are marginal sums:
pAB = PABAB +1
2(PABAb + PABaB + PABab )
pAb = PAbAb +1
2(PAbAB + PAbab + PAbaB)
paB = P aBaB +1
2(P aBAB + P aBab + P aBAb )
pab = P abab +1
2(P abAb + P abaB + P abAB)
Can arrange gamete frequencies as two-way table to show that
only one of them is unknown when the allele frequencies are
known:
pAB pAb pApaB pab papB pb 1
85
EM Algorithm for Two Loci
The two double heterozygote frequencies PABab , PAbaB are “missing
data.”
Assume initial value of pAB and Estimate the missing counts:
nABab =pABpab
pABpab + pAbpaBnAaBb
nAbaB =pAbpaB
pABpab + pAbpaBnAaBb
and then Maximize the likelihood by setting
pAB =1
2n
(2nABAB + nABAb + nABaB + nABab
)
86
Example
For loci LDLR and GYPA in the Hispanic Sample, the 9 observed
genotype counts are:
BB Bb bb Total
AA nAABB = 1 nAABb = 1 nAAbb = 2 nAA = 4Aa nAaBB = 7 nAaBb = 2 nAabb = 1 nAa = 10aa naaBB = 2 naaBb = 4 naabb = 0 naa = 6
Total nBB = 10 nBb = 7 nbb = 3 n = 20
There is one unknown gamete count x = nAB = 2npAB for AB:
B b Total
A nAB = x nAb = 18 − x nA = 18a naB = 27 − x nab = x− 5 na = 22
Total nB = 27 nb = 13 2n = 40
18 ≥ x ≥ 5
87
Example
EM iterative equation:
x′ = 2nAABB + nAABb + nAaBB + nAB/ab
= 2nAABB + nAABb + nAaBB +2pABpab
2pABpab + 2pAbpaBnAaBb
= 2 + 1 + 7 +2x(x− 5)
2x(x− 5) + 2(18 − x)(27 − x)2
= 10 +2x(x− 5)
x(x− 5) + (18 − x)(27 − x)
88
Example
A good starting value would assume independence of A and B
alleles: x = 2n ∗ pA ∗ pB = (18 × 27/40) = 12.15.
Successive iterates are:
Iterate x value
0 12.15001 12.00002 11.93103 11.89944 11.88495 11.87836 11.87537 11.87398 11.87339 11.873010 11.8728
89
ALLELIC ASSOCIATION
90
Hardy-Weinberg Law
For a random mating population, expect that genotype frequen-
cies are products of allele frequencies.
For a locus with two alleles, A, a:
PAA = (pA)2
PAa = 2pApa
Paa = (pa)2
These are also the results of setting the inbreeding coefficient f
to zero.
For a locus with several alleles Ai:
PAiAi = (pAi)2
PAiAj = 2pAipAj
91
Inference about HWE
Departures from HWE can be described by the within-population
inbreeding coefficient f . This has an MLE that can be written
as
f =4nAAnaa − n2
Aa
(2nAA + nAa)(2naa + nAa)
and we can use “Delta method” to find
E(f) = f
Var(f) ≈ 1
2npApa(1 − f)[2pApa(1 − f)(1 − 2f) + f(2 − f)]
If f is assumed to be normally distributed then, (f−f)/√
Var(f) ∼N(0,1). Since Var(f) = 1/n when f = 0:
X2 =f2
Var(f)= nf2
is appropriate for testing H0 : f = 0. When H0 is true, X2 ∼ χ2(1)
.
Reject HWE if X2 > 3.84.
92
Significance level of HWE test
0 1 2 3 4 5
0.0
0.2
0.4
0.6
0.8
1.0
Chi−square with 1 df
X^2
f(X
^2)
Probability=0.05
X^2=3.84
The area under the chi-square curve to the right of X2 = 3.84
is the probability of rejecting HWE when HWE is true. This is
the significance level of the test.
93
Goodness-of-fit Test
An alternative, but equivalent, test is the goodness-of-fit test.
Genotype Observed Expected (Obs.−Exp.)2
Exp.
AA nAA np2A np2af2
Aa nAa 2npApa 2npApaf2
aa naa np2a np2Af2
The test statistic is
X2 =∑ (Obs.− Exp)2
Exp.= nf2
94
Goodness-of-fit Test
Does a sample of 6 AA, 3 Aa, 1 aa support Hardy-Weinberg?
First need to estimate allele frequencies:
pA = PAA +1
2PAa = 0.75
pa = Paa +1
2PAa = 0.25
Then form “expected” counts:
nAA = n(pA)2 = 5.625
nAa = 2npApa = 2.750
naa = n(pa)2 = 0.625
95
Goodness-of-fit Test
Perform the chi-square test:
Genotype Observed Expected (Obs.− Exp.)2/Exp.
AA 6 5.625 0.025
Aa 3 2.750 0.150
aa 1 0.625 0.225
Total 10 10 0.400
Note that f = 1 − 0.3/(2 × 0.75 × 0.25) = 0.2 and X2 = nf2.
96
Sample size determination
Although Fisher’s exact test (below) is generally preferred for
small samples, the normal or chi-square test has the advantage
of simplifying power calculations.
Assuming that f is normally distributed, form the test statistic
z =f − f√Var(f)
Under the null hypothesis H0 : f = 0 this is z0 =√nf . For a two-
sided test, reject at the α% level if z0 ≤ zα/2 or z0 ≥ z1−α/2 =
−zα/2. For a 5% test, reject if z0 ≤ −1.96 or z0 ≥ 1.96.
97
Sample size determination
If the hypothesis is false, the normal test statistic is
z =f − f√Var(f)
≈√n(f − f) = z0 −
√nf
(using the null-hypothesis value of the variance in the denomina-
tor). Suppose f > 0 so rejection occurs when z0 ≥ −zα/2. With
this rejection region, the probability of rejecting is ≥ (1 − β) if
the rejection region amounts to z = z0 −√nf ≥ zβ. i.e.
−zα/2 −√nf = zβ
nf2 = (zα/2 + zβ)2
For 5% significance level −zα/2 = 1.96, and for 90% power zβ =
−1.28 so we need nf2 ≥ (−1.96 − 1.28)2 = 10.5. i.e. n has to
be over 100,000 when f = 0.01.
98
Sample size determination
More directly, when the Hardy-Weinberg hypothesis is not true,
the test statistic nf2 has a non-central chi-square distribution
with one degree of freedom (df) and non-centrality parameter
λ = nf2. To reach 90% power with a 5% significance level, for
example, it is necessary that λ ≥ 10.5.
In this one-df case, the non-centrality value follows from per-
centiles of the standard normal distribution. If zx is the xth
percentile of the standard normal, than for significance level α
and power 1 − β, λ = (zα/2 + zβ)2.
99
Power of HWE test
0 5 10 15 20 25 30
0.0
00.0
20.0
40.0
60.0
8
Non−central Chi−square with 1 df, ncp=10.5
X^2
f(X
^2)
Probability=0.90
X^2=3.84
The area under the non-central chi-square curve to the right
of X2 = 3.84 is the probability of rejecting HWE when HWE
is false. This is the power of the test. In this plot, the non-
centrality parameter is λ = 10.5.
100
Significance Levels and p-values
The significance level α of a test is the probability of a false
rejection. It is specified by the user, and along with the null
hypothesis, it determines the rejection region. The specified, or
“nominal” value may not be achieved for an actual test.
Once the test has been conducted on a data set, the probability
of the observed test statistic, or a more extreme value, if the
null hypothesis is true is the p-value. The chi-square and normal
tests shown above give approximate p-values because they use a
continuous distribution for discrete data.
An alternative class of tests, “exact tests,” use a discrete distri-
bution for discrete data and provide accurate p-values. It may
be difficult to construct an exact test with a particular nominal
significance level.
101
Exact HWE Test
The preferred test for HWE is an exact one. The test rests
on the assumption that individuals are sampled randomly from
a population so that genotype counts have a multinomial distri-
bution:
Pr(nAA, nAa, naa) =n!
nAA!nAa!naa!(PAA)nAA(PAa)
nAa(Paa)naa
This equation is always true, and when there is HWE (PAA = p2Aetc.) there is the additional result that the allele counts have a
binomial distribution:
Pr(nA, na) =(2n)!
nA!na!(pA)nA(pa)
na
102
Exact HWE Test
Putting these together gives the conditional probability
Pr(nAA, nAa, naa|nA, na) =
n!nAA!nAa!naa!
(p2A)nAA(2pApa)nAa(p2a)
naa
(2n)!nA!na!
(pA)nA(pa)na
=n!
nAA!nAa!naa!
2nAanA!na!
(2n)!
Reject the Hardy-Weinberg hypothesis if this quantity, the prob-
ability of the genotypic array conditional on the allelic array, is
among the smallest of its possible values.
103
Exact HWE Test
For convenience, write the probability of the genotypic array,
conditional on the allelic array and HWE, as Pr(nAa|n, nA). Re-
ject the HWE hypothesis for a data set if this value is among
the smallest probabilities.
As an example, consider (nAA = 1, nAa = 0, naa = 49). The allele
counts are (nA = 2, na = 98) and there are only two possible
genotype arrays:
AA Aa aa Pr(nAa|n, nA)
1 0 49 50!1!0!49!
202!98!100! = 1
99
0 2 48 50!0!2!48!
222!98!100! = 98
99
104
Exact HWE Test
The probability of the data (nAA = 1, nAa = 0, naa = 49), con-
ditional on the allele frequencies and on HWE, is 1/99 = 0.01
which is not nearly as small as the value suggested by the chi-
square statistic of 50. (As PAa = 0, f = 1,X2 = n.)
Traditionally, the p-value is the (conditional) probability of the
data plus the probabilities of all the less-probable datasets. The
probabilities are all calculated assuming HWE is true. More re-
cently (Graffelman and Moreno, Statistical Applications in Ge-
netics and Molecular Biology 2-013; 12:433-448) it has been
shown that the test has a significance value closer to the nomi-
nal value if the p-value s half the probability of the data plus the
probabilities of all datasets that are less probably under the null
hypothesis. For the previous slide then, the p-value is 1/198.
105
Example
For a sample of size n = 100 with minor allele frequency of 0.07,
there are 8 sets of possible genotype counts:
Exact Chi-square
nAA nAa naa Prob. Mid p-value X2 p-value
93 0 7 0.0000 0.0000∗ 100.00 0.0000∗92 2 6 0.0000 0.0000∗ 71.64 0.0000∗91 4 5 0.0000 0.0000∗ 47.99 0.0000∗90 6 4 0.0002 0.0002∗ 29.07 0.0000∗89 8 3 0.0051 0.0028∗ 14.87 0.0001∗88 10 2 0.0602 0.0353∗ 5.38 0.0204∗87 12 1 0.3209 0.2262 0.61 0.434886 14 0 0.6136 0.6832 0.57 0.4503
So, for a nominal 5% significance level, the actual significance
level is 0.0204 for a chi-square test that rejects when nAa ≤ 10
and is 0.0353 for an exact test that also rejects when nAa ≤ 10.
106
Effect of Minor Allele Frequency
The minor allele frequency (MAF) in the previous example was
14/200 = 0.07. How does the exact test behave with other MAF
values?
In particular, what is the size of the rejection region for a nominal
value of α = 0.05? In other words, we decide to reject HWE
for any sample with a p-value of 0.05 or less, and we find the
total probability of all such datasets. We would hope that this
empirical significance level would be close to the nominal value,
but we find that it may not be.
107
na = 16 minor alleles
When the minor allele frequency is 0.08, for a nominal 5% signif-
icance level, the actual significance level is 0.0070 for an exact
test that rejects when nAa ≤ 10.
nAA nAa naa Pr(nAa|na) p −value
92 0 8 .0000 .000091 2 7 .0000 .000090 4 6 .0000 .000089 6 5 .0000 .000088 8 4 .0008 .000487 10 3 .0123 .007086 12 2 .0974 .061885 14 1 .3681 .294684 16 0 .5215 .7382
108
na = 15 minor alleles
When the minor allele frequency is 0.075, for a nominal 5%
significance level, the actual significance level is 0.0474 for an
exact test that rejects when nAa ≤ 11.
nAA nAa naa Pr(nAa|na) p −value
92 1 7 .0000 .000091 3 6 .0000 .000090 5 5 .0000 .000089 7 4 .0004 .000288 9 3 .0081 .004587 11 2 .0776 .047486 13 1 .3464 .259485 15 0 .5675 .7163
109
na = 13 minor alleles
When the minor allele frequency is 0.065, for a nominal 5%
significance level, the actual significance level is 0.0483 for an
exact test that rejects when nAa ≤ 9.
nAA nAa naa Pr(nAa|na) p −value
93 1 6 .0000 .000092 3 5 .0000 .000091 5 4 .0001 .000190 7 3 .0030 .003189 9 2 .0452 .048388 11 1 .2923 .340587 13 0 .6595 1.0000
110
na = 12 minor alleles
When the minor allele frequency is 0.06, for a nominal 5% signif-
icance level, the actual significance level is 0.0344 for an exact
test that rejects when nAa ≤ 8.
nAA nAa naa Pr(nAa|na) p −value
94 0 6 .0000 .000093 2 5 .0000 .000092 4 4 .0000 .000091 6 3 .0017 .001790 8 2 .0327 .018189 10 1 .2612 .165088 12 0 .7045 .2955
111
Rohlfs and Weir, 2008
0.0 0.1 0.2 0.3 0.4 0.5
0.0
00
.04
0.0
80
.12
n=100, θ=4.0
p~A
po
we
r
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.2
0.4
n=100, θ=2.0
p~A
po
we
r
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.4
0.8
n=100, θ=0.1
p~A
po
we
r
0.0 0.1 0.2 0.3 0.4 0.5
0.0
00
.05
0.1
00
.15
n=1000, θ=4.0
p~A
po
we
r
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.4
0.8
n=1000, θ=2.0
p~A
po
we
r
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.4
0.8
n=1000, θ=0.1
p~A
po
we
r
This figure is not for the mid p values. It includes all the proba-
bility of the data in p-values.
112
Graffelman and Moreno, 2013
113
Power of Exact Test
If there is not HWE:
Pr(nAa|nA, na) =n!
nAA!nAa!naa!(PAA)nAA(PAa)
nAa(Paa)naa
=n!
nAA!nAa!naa!(PAA)
nA−nAa2 (PAa)
nAa(Paa)na−nAa
2
=n!
nAA!nAa!naa!(√PAA)nA(
√Paa)
na
(PAa√PAAPaa
)nAa
=CψnAa
nAA!nAa!naa!
where ψ = PAa/(√PAAPaa) measures the departure from HWE.
The constant C makes the probabilities sum to one over all
possible nAa values: C−1 =∑nAa
ψnAa/(nAA!nAa!naa!).
114
Power of Exact Test
Once the rejection region has been determined, the power of
the test (the probability of rejecting) can be found by adding
these probabilities for all sets of genotype counts in the re-
gion. HWE corresponds to ψ = 2. What is the power to
detect HWE when ψ = 1, the sample size is n = 10 and the
sample allele frequencies are pA = 0.75, pa = 0.25? Note that
1/C = 1/(5!5!0!) + 1/(6!3!1!) + 1/(7!1!2!).
Pr(nAa|nA, n)nAA nAa naa ψ = 2 ψ = 1
5 5 0 0.520 0.2626 3 1 0.433 0.3647 1 2 0.047 0.374
The ψ = 2 column shows that the rejection region is nAa = 1.
The ψ = 1 column shows that the power (the probability nAa = 1
when ψ = 1) is 37.4%.
115
Power Examples
For given values of n, na, the rejection region is determined from
null hypothesis and the power is determined from the multinomial
distribution.
Pr(nAa|na = 16)ψ .250 .500 1.000 2.000 4.000 8.000 16.000
nAa f .631 .398 .157 .000 −.062 −.081 −.085
0 .0042 .0000 .0000 .0000 .0000 .0000 .00002 .0956 .0026 .0000 .0000 .0000 .0000 .00004 .3172 .0349 .0003 .0000 .0000 .0000 .00006 .3568 .1569 .0056 .0000 .0000 .0000 .00008 .1772 .3116 .0441 .0008 .0000 .0000 .0000
10 .0433 .3047 .1725 .0123 .0003 .0000 .000012 .0054 .1506 .3411 .0974 .0098 .0007 .000014 .0003 .0356 .3223 .3681 .1485 .0422 .010916 .0000 .0032 .1142 .5214 .8414 .9571 .9890
Power .9943 .8107 .2225 .0131 .0003 .0000 .0000
116
na = 15 Probabilities
Pr(nAa|na = 15)ψ .250 .500 1.000 2.000 4.000 8.000 16.000
nAa f .622 .389 .150 .000 −.058 −.075 −.080
1 .0338 .0006 .0000 .0000 .0000 .0000 .00003 .2269 .0150 .0001 .0000 .0000 .0000 .00005 .3871 .1027 .0026 .0000 .0000 .0000 .00007 .2592 .2750 .0273 .0004 .0000 .0000 .00009 .0801 .3400 .1352 .0081 .0002 .0000 .0000
11 .0120 .2040 .3245 .0776 .0074 .0005 .000013 .0008 .0569 .3620 .3464 .1314 .0367 .009415 .0000 .0058 .1482 .5674 .8610 .9627 .9905
Power .9871 .7333 .1652 .0085 .0002 .0000 .0000
117
na = 14 Probabilities
Pr(nAa|na = 14)ψ .250 .500 1.000 2.000 4.000 8.000 16.000
nAa f .613 .378 .143 .000 −.054 −.070 −.074
0 .0062 .0001 .0000 .0000 .0000 .0000 .00002 .1256 .0051 .0000 .0000 .0000 .0000 .00004 .3610 .0582 .0010 .0000 .0000 .0000 .00006 .3422 .2207 .0156 .0002 .0000 .0000 .00008 .1375 .3547 .1002 .0051 .0001 .0000 .0000
10 .0255 .2631 .2973 .0602 .0054 .0004 .000012 .0021 .0877 .3964 .3209 .1150 .0316 .008114 .0001 .0105 .1895 .6136 .8795 .9680 .9919
Power .9723 .6387 .1168 .0053 .0001 .0000 .0000
118
na = 13 Probabilities
Pr(nAa|na = 13)ψ .250 .500 1.000 2.000 4.000 8.000 16.000
nAa f .603 .366 .136 .000 −.050 −.065 −.068
1 .0479 .0012 .0000 .0000 .0000 .0000 .00003 .2786 .0275 .0003 .0000 .0000 .0000 .00005 .4004 .1583 .0080 .0001 .0000 .0000 .00007 .2169 .3430 .0696 .0030 .0001 .0000 .00009 .0508 .3216 .2611 .0452 .0038 .0003 .0000
11 .0051 .1301 .4225 .2923 .0994 .0269 .006913 .0002 .0183 .2383 .6595 .8967 .9728 .9931
Power .9947 .8516 .3391 .0483 .0039 .0003 .0000
119
na = 12 Probabilities
Pr(nAa|na = 12)ψ .250 .500 1.000 2.000 4.000 8.000 16.000
nAa f .592 .353 .128 .000 −.046 −.059 −.063
0 .0095 .0001 .0000 .0000 .0000 .0000 .00002 .1674 .0102 .0001 .0000 .0000 .0000 .00004 .4053 .0991 .0037 .0000 .0000 .0000 .00006 .3108 .3039 .0449 .0017 .0000 .0000 .00008 .0947 .3703 .2188 .0326 .0026 .0002 .0000
10 .0118 .1852 .4376 .2612 .0846 .0226 .005812 .0005 .0312 .2950 .7044 .9127 .9772 .9942
Power .9877 .7836 .2674 .0344 .0027 .0002 .0000
120
Rohlfs and Weir, 2008
0.0 0.1 0.2 0.3 0.4 0.5
0.0
00
.04
0.0
80
.12
n=100, θ=4.0
p~A
po
we
r
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.2
0.4
n=100, θ=2.0
p~A
po
we
r
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.4
0.8
n=100, θ=0.1
p~A
po
we
r
0.0 0.1 0.2 0.3 0.4 0.5
0.0
00
.05
0.1
00
.15
n=1000, θ=4.0
p~A
po
we
r
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.4
0.8
n=1000, θ=2.0
p~A
po
we
r
0.0 0.1 0.2 0.3 0.4 0.5
0.0
0.4
0.8
n=1000, θ=0.1
p~A
po
we
r
121
Graffelman and Moreno, 2013
122
Permutation Test
For large sample sizes and many alleles per locus, there are too
many genotypic arrays for a complete enumeration and a deter-
mination of which are the least probable 5% arrays.
A large number of the possible arrays is generated by permuting
the alleles among genotypes, and calculating the proportion of
these permuted genotypic arrays that have a smaller conditional
probability than the original data. If this proportion is small, the
Hardy-Weinberg hypothesis is rejected.
123
Permutation Test
Mark a set of five index cards to represent five genotypes:
Card 1: A A
Card 2: A A
Card 3: A A
Card 4: a a
Card 5: a a
Tear the cards in half to give a deck of 10 cards, each with
one allele. Shuffle the deck and deal into 5 pairs, to give five
genotypes.
124
Permutation Test
The permuted set of genotypes fall into one of four types:
AA Aa aa Number of times
3 0 2
2 2 1
1 4 0
125
Permutation Test
Find the theoretical values for the proportions of each of the
three types, from the expression:
n!
nAA!nAa!naa!× 2nAanA!na!
(2n)!
AA Aa aa Conditional Probability
3 0 2
2 2 1
1 4 0
These should match the proportions found by repeating shuf-
flings of the deck of 10 allele cards.
126
Multiple Testing
When multiple tests are performed, each at significance level α,
a proportion α of the tests are expected to cause rejection even
if all the hypotheses are true.
Bonferroni correction makes the overall (experimentwise) signif-
icance level equal to α by adjusting the level for each individual
test to α′. If α is the probability that at least one of the L tests
causes rejection, it is also 1 minus the probability that none of
the tests causes rejection:
α = 1 − (1 − α′)L
≈ Lα′
provided the L tests are independent.
If L = 15, need α′ = 0.0033 in order for α = 0.05.
127
Combining p-values
There is also the issue that if the same hypothesis is tested L
times and just fails to cause rejection each time, there is some
overall evidence against the hypothesis.
Suppose that tests have been conducted for each of L hypotheses
Hi, i = 1,2, . . . , L. For each test the p-value pi is calculated: if
Hi is true, this is the probability of observing a test statistic
as extreme as or more extreme than the observed value in the
direction of rejection.
128
Combining p-values
Methods for combining p-values rest on p having a uniform dis-
tribution when the hypothesis is true. Suppose the test statistic
X has the continuous distribution f(x) when the hypothesis is
true. The p-value when X = x is
p =
∫ ∞
xf(y)dy
(A one-tailed test when large values of X cause rejection.) The
distribution function F (x) for X is
F (x) =
∫ x−∞
f(y)dy = 1 − p
The density function for U = F (x) is
f(u) = f(x)dx
du= f(x)/
du
dx= f(x)/f(x) = 1
Therefore U is uniform on [0,1], and so is p.
129
Fisher’s Method
If U has a uniform distribution, transform to
V = −2 ln(U)
U = e−V/2
f(v) = f(u)du
dv
= (1/2)e−v/2
and this is the density function of a chi-square distribution with
2 d.f.
130
Fisher’s Method
If a hypothesis is true, and the significance level is p, then X2 =
−2 ln(p) is distributed as χ2(2)
. (If p = 0.05 then X2 = 5.99.)
Therefore
t = −2L∑
i=1
ln pi = −2 ln(L∏
i=1
pi)
has a chi-square distribution with 2L df when all L hypotheses
are true.
Therefore, the p-value for the hypothesis HT that all Hi are true
is the probability of a χ2(2L)
variable being greater or equal to the
observed value t.
131
Example
HW-test p-values for previous data.
Locus Cauc. Afr.Am. Hisp. −2∑
ln pLDLR 1.00 0.14 0.69 4.67 nsGYPA 0.22 1.00 0.42 4.76 nsHBGG 1.00 0.44 0.21 4.76 nsD7S8 1.00 0.64 0.69 1.63 nsGc 0.01 1.00 0.59 10.27 ns
−2∑
ln p 12.24 6.47 7.40 26.10ns ns ns ns
Do not reject overall hypothesis of no departures from HW over
populations for each locus, or over loci for each population, or
over all locus-population pairs.
132
QQ-Plots
An alternative approach to considering multiple-testing issues is
to use QQ-plots. If all the hypotheses being tested are true then
the resulting p-values are uniformly distributed between 0 and 1.
For a set of n tests, we would expect to see p values at 1/(n+
1),2/n, . . . , n/(n+ 1),1. We plot the observed p-values against
these expected values: the smallest against 1/(n + 1) and the
largest against 1. It is more convenient to transform to − log10(p)
to accentuate the extremely small p values. The point at which
the observed values start departing from the expected values is
an indication of “significant” values in a way that takes into
account the number of tests.
133
QQ-Plots
0 1 2 3 4
02
46
81
01
2
HWE Test: No SNP Filtering
−log10(p):Expected
−lo
g1
0(p
):O
bse
rve
d
The results for 9208 SNPs on human chromosome 1. Bonferroni
would suggest rejecting HWE when p ≤ 0.05/9205 = 5.4 × 10−6
or − log10(p) ≥ 5.3.
134
QQ-Plots
0 1 2 3 4
02
46
81
01
2
HWE Test: SNPs Filtered on Missingness
−log10(p):Expected
−lo
g1
0(p
):O
bse
rve
d
The same set of results as on the previous slide except now that
any SNP with any missing data was excluded. Now 7446 SNPs
and Bonferroni would reject if − log10(p) ≥ 5.2. All five outliers
had zero counts for the minor allele homozygote and at least 32
heterozygotes in a sample of size 50.
135
Linkage Disequilibrium
This term reserved for association between pairs of alleles – one
at each of two loci.
When gametic data are available, could refer to gametic disequi-
librium.
When genotypic data are available, but gametes can be inferred,
can make inferences about gametic and non-gametic pairs of
alleles.
When genotypic data are available, but gametes cannot be in-
ferred, can work with composite measures of disequilibrium.
136
Linkage Disequilibrium
For alleles A and B are two loci, the usual measure of linkage
disequilibrium is
DAB = PAB − pApB
Whether or not this is zero does not provide a direct state-
ment about linkage between the two loci. For example, consider
marker YFM and disease DTD:
A N Total
+ 1 24 25YFM
− 0 75 75
Total 1 99 100
DA+ =1
100− 1
100
25
100= 0.0075, (maximum possible value)
137
Gametic Linkage Disequilibrium
For loci A, B define indicator variables x, y that take the value
1 for allele A,B and 0 for any other alleles. If gametes within
individuals are indexed by j, j = 1,2 then for expectations over
samples from the same population
E(xj) = pA, j = 1,2 , E(yj) = pB j = 1,2
E(x2j ) = pA, j = 1,2 , E(y2j ) = pB j = 1,2
E(x1x2) = PAA , E(y1y2) = PBB
E(x1y1) = PAB , E(x2y2) = PAB
The variances of xj, yj are pA(1− pA), pB(1− pB) for j = 1,2 and
the covariance and correlation coefficients for x and y are
Cov(x1, y1) = Cov(x2, y2) = PAB − pApB = DAB
Corr(x1, y1) = Corr(x2, y2) = DAB/[pA(1 − pA)pB(1 − pB)] = ρAB
138
Estimation of LD
With random sampling of gametes, gamete counts have a multi-
nomial distribution:
Pr(nAB, nAb, naB, nab) =n!(PAB)nAB(PAb)
nAb(PaB)naB(Pab)nab
nAB!nAb!naB!nab!
=n!(pApB +DAB)nAB(pApb −DAB)nAb
nAB!nAb!naB!nab!
× (papB −DAB)naB(papb +DAB)nab
and this provides the maximum likelihood estimates of DAB and
ρAB:
DAB =nABn
− nAB + nAbn
× nAB + naBn
= PAB − pApB
ρAB = rAB =DAB√pApapBpb
139
Testing LD
Write MLE of DAB as
DAB =nABnab − nAbnaB
(nAB + nAb)(naB + nab)(nAB + naB)(nAb + nab)
and use “Delta method” to find
Var(DAB) ≈ 1
n[pA(1 − pA)pB(1 − pB)
+ (1 − 2pA)(1 − 2pB)DAB −D2AB]
When DAB = 0, Var(DAB) = pA(1 − pA)pB(1 − pB)/n.
If DAB is assumed to be normally distributed then
X2AB =
D2AB
Var(DAB)= nρ2AB = nr2AB
is appropriate for testing H0 : DAB = 0. When H0 is true,
X2AB ∼ χ2
(1).
140
Goodness-of-fit Test
The test statistic for the 2 × 2 table
nAB nAb nAnaB nab nanB nb n
has the value
X2 =n(nABnab − nAbnaB)2
nAnanBnb
For DTD/YFM example, X2 = 3.03. This is not statistically
significant, even though disequilibrium was maximal.
141
Composite Disequilibrium
When genotypes are scored, it is often not possible to distinguish
between the two double heterozygotes AB/ab and Ab/aB, so that
gametic frequencies cannot be inferred.
Under the assumption of random mating, in which genotypic fre-
quencies are assumed to be the products of gametic frequencies,
it is possible to estimate gametic frequencies with the EM algo-
rithm. To avoid making the random-mating assumption, how-
ever, it is possible to work with a set of composite disequilibrium
coefficients.
142
Composite Disequilibrium
Although the separate digenic frequencies pAB (one gamete) and
pA,B (two gametes) cannot be observed, their sum can be since
pAB = PABAB +1
2PABAb +
1
2PABaB +
1
2PABab
pA,B = PABAB +1
2PABAb +
1
2PABaB +
1
2PAbaB
pAB + pA,B = 2PABAB + PABAb + PABaB +PABab + PAbaB
2
Digenic disequilibrium is measured with a composite measure
∆AB defined as
∆AB = pAB + pA,B − 2pApB
= DAB +DA,B
which is the sum of the gametic (DAB = pAB−pApB) and nonga-
metic (DA,B = pA,B − pApB) coefficients.
143
Composite Disequilibrium
If the counts of the nine genotypic classes are
BB Bb bbAA n1 n2 n3Aa n4 n5 n6aa n7 n8 n9
the count for pairs of alleles in an individual being A and B,
whether received from the same or different parents, is
nAB = 2n1 + n2 + n4 +1
2n5
and the MLE for ∆ is
∆AB =1
nnAB − 2pApB
144
Composite Linkage Disequilibrium
For loci A, B define indicator variables x, y that take the value
1 for allele A,B and 0 for any other alleles. If gametes within
individuals are indexed by j, j = 1,2 then for expectations over
samples from the same population
E(xj) = pA, j = 1,2 , E(yj) = pB j = 1,2
E(x2j ) = pA, j = 1,2 , E(yj) = pB j = 1,2
E(x1x2) = PAA , E(y1y2) = PBB
E(x1y1) = PAB , E(x2y2) = PAB
E(x1y2) = PA,B , E(x2y1) = PA,B
Write
DA = PAA − p2A , DB = PBB − p2B
DAB = PAB − pApB , DA,B = PA,B − pApB
∆AB = DAB +DA,B
145
Composite Linkage Disequilibrium
Now set X = x1 + x2, Y = y1 + y2 to get
E(X) = 2pA , E(Y ) = 2pB
E(X2) = 2(pA + PAA) , E(Y 2) = 2(pB + PBB)
Var(X) = 2pA(1 − pA)(1 + fA) , Var(Y ) = 2pB(1 − pB)(1 + fB)
and
E(XY ) = 2(PAB + PA,B)
Cov(X,Y ) = 2(PAB − pApB) + 2(PA,B − pApB)
= 2(DAB +DA,B) = 2∆AB
Corr(X,Y ) =∆AB√
pA(1 − pA)(1 + fA)pB(1 − pB)(1 + fB)
146
Composite Linkage Disequilibrium
∆AB = nAB/n− 2pApB
where
nAB = 2nAABB + nAABb + nAaBB +1
2nAaBb
This does not require phased data.
By analogy to the gametic linkage disequilibrium result, a test
statistic for ∆AB = 0 is
X2AB =
n∆2AB
pA(1 − pA)(1 + fA)pB(1 − pB)(1 + fB)
This is assumed to be approximately χ2(1)
under the null hypoth-
esis.
147
Example
For loci LDLR and GYPA in the Hispanic Sample:
BB Bb bb Total
AA 1 1 2 4Aa 7 2 1 10aa 2 4 0 6
Total 10 7 3 20
nAB = 2 × 1 + 1 + 7 +1
2(2) = 11
nA = 18 , pA = 0.450
nB = 27 , pB = 0.675
fA =4
20−(18
40
)2
= − 1
400= −0.0025
fB =10
20−(27
40
)2
=71
1600= 0.0444
148
Example
∆AB =11
20− 2
18
4
27
40= − 46
800= −0.0575
X2AB =
20(−0.0575)2
(0.450)(0.550)(0.9975)(0.675)(0.325)(1.0444)= 1.11
Previous work on EM algorithm estimated pAB as 11.88/40=0.2970
so
DAB = 0.2970− (0.450)(0.675) = −0.008
X2AB =
40(−0.0068)2
(0.450)(0.550)(0.675)(0.325)= 0.03
149
LD vs Composite LD Estimates
� � � � �
� � � � �
� � � �
� � � �
� � � �
� � � � � � � � � � � � � � � � � � � � � �� � �
�
� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �
A comparison of gametic LD estimates from the EM algorithm
assuming HWE vs composite LD with no HWE assumption.
150
LD vs Composite LD Tests
��� !
" �" �" �" " !
� �
� # " � " # � �$ % & ' ( ) ) * + , - . / 0 1
23 45678 9:: ;<= 6>
?@A
B C D E F G H I J C K L M D G N C O G P E C Q C R F C E S T G O T O T M T C O T C U O V W M E U M O C M K X
A comparison of gametic LD tests from the EM algorithm as-
suming HWE vs composite LD with no HWE assumption.
151
POPULATION STRUCTURE
152
Population Data
Individuals from several populations are scored at a series of
marker loci. At each locus, an individual has two alleles, one
from each parent, and these can be identified. For example, at
locus D3S1358:
Allele AFC NSC QLC SAC TAC VIA WAB11 .000 .001 .002 .001 .000 .000 .00012 .004 .003 .001 .001 .000 .000 .01013 .008 .003 .002 .002 .000 .000 .00114 .123 .098 .159 .125 .152 .008 .07515 .261 .264 .365 .252 .244 .385 .35316 .250 .270 .250 .265 .241 .277 .24217 .187 .198 .123 .202 .197 .246 .19018 .154 .152 .091 .144 .157 .077 .12219 .012 .011 .006 .007 .010 .008 .00720 .002 .000 .000 .000 .000 .000 .000
153
Questions of Interest
• How much genetic variation is there? (animal conservation)
• How much migration (gene flow) is there between popula-
tions? (molecular ecology)
• How does the genetic structure of populations affect tests for
linkage between genetic markers and human disease genes?
(human genetics)
• How should the evidence of matching marker profiles be
quantified? (forensic science)
• What is the evolutionary history of the populations sampled?
(evolutionary genetics)
154
Statistical Analysis
Possible to approach these data from purely statistical viewpoint.
Could test for differences in allele frequencies among populations.
Could use various multivariate techniques to cluster populations.
These analyses may not answer the biological questions.
155
Frequencies of Allele Au
Population 1 Population r
p1u . . .
pu
pru
Among samples from a population: piu ∼ Binomial(n, piu).
Among replicates of a population: piu ∼ Beta((1−θi)pu
θi, (1−θi)(1−pu)θi
).
156
Beta distribution: Theoretical
The beta distribution can take a variety of shapes.
0.0 0.4 0.80
.61
.01
.4
v=1,w=1
0.0 0.4 0.8
0.0
1.0
v=2,w=2
0.0 0.4 0.8
0.0
1.0
2.0
v=4,w=4
0.0 0.4 0.8
0.0
1.0
2.0
v=2,w=1
0.0 0.4 0.80
12
34
v=4,w=1
0.0 0.4 0.8
01
23
4
v=1,w=4
0.0 0.4 0.8
1.0
1.3
1.6
v=0.9,w=0.9
0.0 0.4 0.8
24
68
u=0.5,w=0.5
0.0 0.4 0.80
10
25
u=0.5,w=4
157
Beta distribution: Experimental
The beta distribution is suggested by a Drosophila cage experi-
ment by P. Buri (Evolution 10:367, 1956).
158
Model Details
It is useful to attach indicator variables to each allele. For the
jth allele sampled from the ith population
xij =
{1 allele is of type Au0 otherwise
The model specifies expectations for these indicators. Over sam-
ples from one population:
E(xij) = piu
E(xijxij′) = p2iu j 6= j′, large ni
Over samples from each population and over replicates of each
population:
E(xij) = pu
E(xijxij′) = θipu + (1 − θi)p2u, j 6= j′
E(xijxi′j′) = θii′pu + (1 − θii′)p2u, i 6= i′
159
Inbreeding Coefficients
f1 . . .
F
fr
Within a population: Puui = p2iu + fipiu(1 − piu).
Over replicates of a population: E(Puui) = p2u + Fipu(1 − pu).
160
Coancestry Coefficients
BB
BB
BB
BBBM
����������
θi
. . .
HHHHHHHHHHHHY
������������*
θii′B
BBB
BB
BBBM
����������
θi′
For a population i, using average allele frequencies:
E(Puui) = p2u + θipu(1 − pu), if Fi = θi.
For two populations i, i′: E(Pu,uii′) = p2u + θii′pu(1 − pu).
161
Genotypic Frequencies
Within a population:
Puui = p2iu + fipiu(1 − piu)
Take expected values over replicates of a population:
E(piu) = pu
E(p2iu) = p2u + θipu(1 − pu)
E(Puui) = [p2u + θipu(1 − pu)] + fi[pu − p2u − θipu(1 − pu)]
= p2u + [θi + fi(1 − θi)]pu(1 − pu)
= p2u + Fipu(1 − pu)
since fi = (Fi − θi)/(1 − θi) or FIS = (FIT − FST)/(1 − FST) and
this is zero for HWE within a population.
162
Sample Allele Frequencies: Weir & Hill, 2002
The sample frequency for allele Au for population i can be written
as an average of indicator variables:
piu =1
ni
ni∑
j=1
xij
and this leads to variances and covariances of frequencies over
samples from populations and over replicates of populations:
Var(piu) = pu(1 − pu)
(θi +
1 − θini
)
Cov(piu, piu′) = −pupu′(θi +
1 − θini
), u 6= u′
Cov(piu, pi′u) = pu(1 − pu)θii′, i 6= i′
Cov(piu, pi′u′) = −pupu′θii′, u 6= u′, i 6= i′
163
Weir & Cockerham 1984 Restricted Model
If populations have equal evolutionary histories (θi = θ, all i) and
are independent (θii′ = 0, all i 6= i′)
Var(piu) = pu(1 − pu)
(θ+
1 − θ
ni
)
Cov(piu, piu′) = −pupu′(θ+
1 − θ
ni
), u 6= u′
Cov(piu, pi′u) = 0, i 6= i′
Cov(piu, pi′u′) = 0, u 6= u′, i 6= i′
The overall allele frequencies were weighted by sample sizes
pA =1
∑i ni
∑
i
nipiu
If θ = 0, these weighted means have minimum variance.
164
Weir & Cockerham 1984 Restricted Model
Under this model, the sampled populations can be regarded as
evolutionary realizations of the process that led to any one of
them. The sampled populations are equivalent to the population
replicates that have been discussed in this treatment.
165
Weir & Cockerham 1984 Restricted Model
Two mean squares were constructed for each allele:
MSBu =1
r − 1
r∑
i=1
ni(piu − pu)2
MSWu =1
∑i(ni − 1)
∑
i
nipiu(1 − piu)
These have expected values
E(MSBu) = pu(1 − pu)[(1 − θ) + ncθ]
E(MSWu) = pu(1 − pu)(1 − θ)
where nc = (∑i ni−
∑i n
2i /∑i ni)/(r− 1). The Weir & Cockeram
estimator of θ (or FST) is
θ =
∑u(MSBu − MSWu)
MSBu + (nc − 1)MSWu
166
Unweighted Model
Bhatia et al., Genome Research 23:1514-1521, 2013, returned
to early work by Nei. They used heterozygosities instead of mean
squares:
Hi =ni
ni − 1
∑
upiu(1 − piu) =
nini − 1
(1 −∑
up2iu)
Hii′ = 1 −∑
upiupi′u
Unweighted averages over populations are
HW =1
r
∑
i
Hi , θW =1
r
∑
i
θi
HB =1
r(r − 1)
∑
i 6=i′Hii′ , θB =
1
r(r − 1)
∑
i 6=i′θii′
167
Unweighted Model Expectations
Under the general model of Weir & Hill, these within- and between-
population heterozygosities have expectations
E(Hi) = H(1 − θi)
E(Hii′) = H(1 − θii′)
E(HW ) = H(1 − θW)
E(HB) = H(1 − θB)
where H =∑u pu(1 − pu) = 1 −∑
u p2u.
These equations suggest estimation by the method of moments.
168
Unweighted Model Estimates
Can only estimate θi and θii′ “relative to” θB:
βi = 1 − HiHB
, E(βi) =θi − θB1 − θB
βW = 1 − HWHB
, E(βW ) =θW − θB1 − θB
βii′ = 1 − Hii′
HB, E(βii′) =
θii′ − θB1 − θB
Estimation of θB is not possible unless the whole set of popu-
lations is replicated. Different values of βi or βii′ imply different
values of θi or θii′.
169
Weir & Cockerham Estimator
Under the Weir and Hill model, the Weir and Cockerham esti-
mator has expectation
E(θWC) =θcW − θcB +Q
1 − θcB +Qinstead of
θW − θB1 − θB
where
θcW =
∑i nciθi∑
i nci
, θcB =
∑i 6=i′ nini′θii′∑i 6=i′ nini′
nci = ni −n2i∑i ni
, nc =1
r − 1
∑
i
nci
Q =1
(r − 1)nc
∑
i
(nin
− 1
)θi
If the Weir and Cockerham model holds (θi = θ), or if ni = n, or
if nc is large, then Q = 0.
170
HapMap III Data
Sample Number ofCode Population Description size SNPs typedASW African ancestry in Southwest USA 142 61566CEU Utah residents with Northern and Western 324 56218
European ancestry from CEPH collectionCHB Han Chinese in Beijing, China 160 51946CHD Chinese in Metropolitan Denver, Colorado 140 50517GIH Gujarati Indians in Houston, Texas 166 55710JPT Japanese in Tokyo, Japan 168 53020LWK Luhya in Webuye, Kenya 166 60026MXL Mexican ancestry in Los Angeles, California 142 57937MKK Maasai in Kinyawa, Kenya 342 61057TSI Toscani in Italia 154 55881YRI Yoruba in Ibadan, Nigeria 326 58559
171
Weir & Cockerham vs Unweighted Estimator
0.00 0.05 0.10 0.15 0.20 0.25
0.0
00.0
50.1
00.1
50.2
00.2
5
HapMap Fst estimates: no SNP filtering
Bhatia et al
Weir &
Cockerh
am
FST estimates for HapMap III, using all 87,592 SNPs on chromosome 1.
172
Weir & Cockerham vs Unweighted Estimator
0.00 0.05 0.10 0.15 0.20 0.25
0.0
00.0
50.1
00.1
50.2
00.2
5
HapMap Fst estimates: SNP filtering
Bhatia et al
Weir &
Cockerh
am
FST estimates for HapMap III, using the 42,463 SNPs on chromosome 1
that have at least five copies of the minor allele in samples from all 11
populations.
173
Multiple Loci: Weir & Cockerham
The Weir & Cockerham estimators for locus l are of the form
Estimatorl =
∑uNumeratorlu∑
uDenominatorlu
With several loci, these can be extended to
Estimator =
∑l,uNumeratorlu∑
l,uDenominatorlu
If every locus has the same value of θ, this multi-locus estimator
gives an estimate of the common value. Otherwise it estimates
a weighted average of the different θ values, where the weights
are functions of the allele frequencies at the loci in the sum.
174
Multiple Loci: Unweighted
The unweighted estimators for locus l are of the form
Estimatorl =HBl − Hxl
HBl, x = i,W, ii′
With several loci, these can be extended to
Estimator =
∑l(HBl − Hxl)∑
l HBlx = i, ii′,W
and these estimate (θx− θB)/(1− θB) if each locus has the same
value of the θ’s. Otherwise thet estimate a weighted average
of the different θ values, where the weights are functions of the
allele frequencies at the loci in the sum.
175
βW in LCT Region: 3 Populations
100 120 140 160 180
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
HapMap III Chromosome 2
Position
Be
ta−
W
LCT
CEU
CHB
YRI
176
βW in LCT Region: 11 Populations
100 120 140 160 180
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
HapMap III Chromosome 2
Position
Be
ta−
W
LCT
CEU
MKK
Other
177
MKK Population
“The Maasai are a pastoral people in Kenya and Tanzania, whose
traditional diet of milk, blood and meat is rich in lactose, fat
and cholesterol. In spite of this, they have low levels of blood
cholesterol, and seldom suffer from gallstones or cardiac diseases.
Analysis of HapMap 3 data using Fixation Index (Fst) identified
genomic regions and single nucleotide polymorphisms (SNPs)
as strong candidates for recent selection for lactase persistence
and cholesterol regulation in 143156 founder individuals from the
Maasai population in Kinyawa, Kenya (MKK). The strongest
signal identified by all three metrics was a 1.7 Mb region on
Chr2q21. This region contains the gene LCT (Lactase) involved
in lactase persistence.”
Wagh et al., PLoS One 7: e44751, 2012
178
Mean and Variance of βW in LCT Region
100 120 140 160 180
0.0
00
.05
0.1
00
.15
0.2
00
.25
0.3
0
HapMap III − Chromosome 2
Position (Mb)
Be
ta−
W
Mean
StdDev
LCT
179
Pure Drift Model
Within a species, divergence among populations is driven primar-
ily by genetic drift – the random choice of one of the two parental
alleles for transmission from an individual to an offspring. Mu-
tation is likely to be of less importance in the short term, and
natural selection may not affect genetic markers.
Suppose a parental population has nu of 2N alleles of type Au.
Conditional on pu = nu/2N , the number of alleles of this type in
the offspring generation of N individuals is B(2N,pu).
180
Drift Model Variance
Write initial allele frequency as p0.
Conditional on Total overGenera Allele previous gen. all generations-tion freq. Mean Variance Mean Variance
0 p0 p0 0 p0 0
1 p1 p0p0(1−p0)
2Np0 p0(1 − p0)
[1 −
(1 − 1
2N
)]
2 p2 p1p1(1−p1)
2Np0 p0(1 − p0)
[1 −
(1 − 1
2N
)2]
3 p3 p2p2(1−p2)
2Np0 p0(1 − p0)
[1 −
(1 − 1
2N
)3]
· · · · · · · · · · · · · · · · · ·
t pt pt−1pt−1(1−pt−1)
2Np0 p0(1 − p0)
[1 −
(1 − 1
2N
)t]
181
Drift Model IBD
Could also define θ as
θt = 1 −(1 − 1
2N
)t
so that
Var(pt) = p0(1 − p0)θt
We can see that
θt =1
2N+
(1 − 1
2N
)θt−1
and θ can be interpreted as the probability that two alleles are
identical by descent. This probability interpretation is not neces-
sary. With this drift model, θ serves as a measure of the number
of generations since the founding population (θ = 0).
182
Drift Model: Two Populations
?
t1
θ12�
��
��
@@
@@@θ1 θ2
θ1 = 1 − (1 − θ12)Xt11 , θ2 = 1 − (1 − θ12)X
t12
where Xi = (2Ni− 1)/2Ni and Ni is the population size for pop-
ulation i. Therefore
βi =θi − θ12
1 − θ12= 1 −X
t1i ≈ t1
2Ni
and this is the quantity estimated by the estimate βi. Note that
the times from each population to their most recent common
ancestral population must be the same.
183
Drift Model: Two Populations
The unweighted within-population average estimator βW has ex-
pectation
βW =θW − θB1 − θB
=
θ1+θ22 − θ12
1 − θ12
= 1 − Xt11 +X
t12
2
≈ 1
2
(1
2N1+
1
2N2
)t1 =
t12Nh
so time is being estimated in terms of the harmonic mean of the
two population sizes.
184
Drift Model: Two Populations
The Weir & Cockerham estimator has an expectation of
E(θWC) =
θ1+θ22 − θ12 + (n1−n2)(θ1−θ2)
2n1n2
1 − θ12 + (n1−n2)(θ1−θ2)2n1n2
which reduces to
E(θWC) =
θ1+θ22 − θ12
1 − θ12
if n1 = n2 or θ1 = θ2.
185
Within-populations Example
FBI: African American (AA), Caucasian (CA) and Hispanic (HI) data.
βiLocus AA CA HI Average βWC
D3S135 .010 −.030 .069 .017 .019vWA .000 −.002 .050 .017 .019FGA .007 .012 −.008 .003 .006D8S117 .026 .003 .009 .012 .015D21S11 −.012 −.003 .047 .012 .014D18S51 .011 .008 .010 .010 .012D5S818 −.018 .061 .012 .019 .021D13S31 .132 .028 −.042 .036 .040D7S820 .014 −.016 .026 .008 .011CSF1PO −.048 .015 .051 .006 .008TPOX −.118 .090 .112 .027 .030THO1 .078 .008 .041 .043 .045D16S53 −.011 .028 .024 .014 .016All loci .010 .017 .032 .020 .020
186
Between-populations Example
FBI: African American (AA), Caucasian (CA) and Hispanic (HI) data.
βii′Locus AA&CA AA&HI CA&HID3S135 −.018 .026 −.009vWA −.018 .006 .010FGA .006 −.002 −.004D8S117 −.012 .002 .009D21S11 −.008 −.008 .015D18S51 −.004 −.008 .011D5S818 .003 −.039 .033D13S31 .058 −.021 −.032D7S820 .001 .004 −.006CSF1PO −.024 −.009 .034TPOX −.043 −.053 .097THO1 −.034 .028 .006D16S53 −.009 −.009 .018Total −.008 −.006 .014
187
Allele Frequency Distribution
The moment estimators of θ’s and β’s made use only of the
mean and variance of the distribution of allele frequencies over
populations. The advantages of maximum likelihood estimation
are available if the whole distribution is known.
A convenient distribution is the normal, although it can be only
an approximation. Under a class of evolutionary models (allelic
exchangeability and stationarity) the Beta or Dirichlet distribu-
tion may be used.
188
Normal Distribution
If allele frequencies are not too extreme, it may be satisfactory
to assume they have a normal distribution over populations.
In the case of equal values of θi and zero values of θii′, the sample
frequencies piu for alleles Au in sample i have these properties:
E(piu) = pu
Var(piu) = pu(1 − pu)
[θ+
1 − θ
2n
]
Cov(piu, piu′) = −pupu′[θ+
1 − θ
2n
], u 6= u′
Cov(piu, pi′u′) = 0, i 6= i′
Multiple alleles suggest a multivariate normal approach.
189
Multivariate Normality
Q =r∑
i=1
(pi − p)′V−1(pi − p)
=r∑
i=1
m∑
u=1
(piu − pu)2
pu
(pu =∑ri=1 piu/r) has distribution
Q ∼ θχ2(r−1)(m−1)
suggesting that
(r − 1)(m− 1)θ = Q
or
θN ≈ 1
(m− 1)(r − 1)
r∑
i=1
m∑
u=1
(piu − pu)2
pu
190
Properties of Normal Estimate
From the chi-distribution of Q:
E(Q) = (r − 1)(m− 1)θ
Var(Q) = 2(r − 1)(m− 1)θ2
so that
E(θN) = θ
Var(θN) =2[1 + (2n− 1)θ)]2
(r − 1)(m− 1)(2n− 1)2
≈ 2θ2
(r − 1)(m− 1), (large n)
191
Confidence Intervals
For large sample sizes,
(r − 1)(m− 1)θNθ
∼ χ2(r−1)(m−1)
so a 95% confidence interval for θ is
((r − 1)(m− 1)θN
χ20.975
,(r − 1)(m− 1)θN
χ20.025
)
An estimated θ of 0.01 from data from five populations for a
locus with five alleles, for example, leads to a 95% confidence
interval of (0.001, 0.036).
Analytical approach more convenient than bootstrapping.
192
Effect of Number of Loci
FST
Fre
qu
en
cy
0.0 0.2 0.4 0.6 0.8
05
00
01
00
00
15
00
0
HapMap
FST
0.0 0.2 0.4 0.6 0.8
01
00
00
20
00
03
00
00
40
00
0
Perlegen
5Mb window
Individual SNP
193
INBREEDING AND RELATEDNESS
194
Inbreeding
Two alleles that descend from the same allele are said to be
identical by descent (ibd).
The probability that two parents transmit ibd alleles to a child
is the inbreeding coefficient F of the child.
Use small letters for alleles, and capital letters for their particular
type: i.e. parents might transmit alleles a and b to their child,
and these alleles might both be type A.
Mother Father(ac) (bd)
↘ ↙a b
↘ ↙child(ab)
195
Relatedness
Two individuals that have ibd alleles are said to be related.
The probability that an allele taken at random from one individual
is ibd to an allele taken at random from another individual is the
coancestry coefficient θ of those two individuals.
The inbreeding coefficient of an individual is the coancestry of
its parents.
196
Relatedness
X Y(ac) (bd)
↘ ↙a b
↘ ↙I
(ab)
FI = θXY
197
Path Counting
A↙ ↘
... ...↘ ↓ ↓ ↙
X Y↘ ↙
I
Identify the path linking the parents of I to their common an-
cestor(s).
198
Path Counting
If the parents X,Y of an individual I have ancestor A in common,
and if there are n individuals (including X,Y, I) in the path linking
the parents through A, then the inbreeding coefficient of I, or
the coancestry of X and Y , is
FI = θXY =
(1
2
)n(1 + FA)
If there are several ancestors, this expression is summed over all
the ancestors.
199
Parent-Child
X↘
Y
The common ancestor of parent X and child Y is X. The path
linking X,Y to their common ancestor is Y X and this has n = 2
individuals. Therefore
θXY =
(1
2
)2
=1
4
200
Grandparent-grandchild
X↘
V↘
Y
The common ancestor of grandparent X and grandchild Y is X.
The path linking X,Y to their common ancestor is Y V X and this
has n = 3 individuals. Therefore
θXY =
(1
2
)3
=1
8
201
Half sibs
U V W↘ ↙ ↘ ↙
X Y↓ ↓
The common ancestor of half sibs X and Y is V . The path
linking X,Y to their common ancestor is XV Y and this has n = 3
individuals. Therefore
θXY =
(1
2
)3
=1
8
202
Full sibs
U V↓ ↘ ↙ ↓
↓ ↙ ↘ ↓X Y
The common ancestors of full sibs X and Y are U and V . The
paths linking X,Y to their common ancestors are XUY and XV Y
and these each have n = 3 individuals. Therefore
θXY =
(1
2
)3
+
(1
2
)3
=1
4
203
First cousins
U V↓ ↘ ↙ ↓
↓ ↙ ↘ ↓G C D H
↘ ↙ ↘ ↙X Y
The common ancestors of cousins X and Y are U and V . The
paths linking X,Y to their common ancestors are XCUDY and
XCV DY and these each have n = 5 individuals. Therefore
θXY =
(1
2
)5
+
(1
2
)5
=1
16
204
Genotype frequencies
Consider an individual with inbreeding coefficient F . The proba-
bility it has two ibd alleles is F , and the probability of two non-ibd
alleles is 1 − F . The probability that any allele is type A is pA,
the population allele frequency. So, the probability the individual
is homozygous is
PAA = Pr(AA|ibd)Pr(ibd) + Pr(AA|not ibd)Pr(not ibd)
= pA × F + p2A × (1 − F )
There is an evolutionary perspective here: F reflects the history
that leads to alleles being ibd. Identity by descent has meaning
only with reference to previous generations. The allele frequency
pA is the expected value over evolutionary histories.
205
Genotype frequencies
To emphasize the difference from Hardy-Weinberg:
PAA = p2A + FpA(1 − pA)
Because heterozygous individuals must have non-ibd alleles:
PAa = 2(1 − F )pApa
= 2pApa − 2FpApa
206
First Cousin Example
For individuals with parents who were first cousins, F = 1/16 =
0.0625. For a locus with allele frequencies {pi} that were all
0.10:
Pii = (0.1)2 + 0.0625(0.1)(0.10) = 0.015625
Pij = 2(0.1)(0.1) − 2(0.0625)(0.1)(0.1) = 0.018750
207
Individual Inbreeding Coefficients
Earlier slides considered the estimation of the within-population
inbreeding coefficient f that made use of sample allele frequen-
cies from the particular population under consideration. This f
is a description of the population.
Individual-specific (total) inbreeding coefficients F can be esti-
mated under the assumption that all loci have the same coeffi-
cient, now interpreted as the probability of identity by descent
(ibd). Many loci are needed.
At locus l, write pl for the frequency of Al and code the genotypes
AlAl, Alal, alal as Xl = 2,1,0. These new variables have the
properties E(Xl) = 2pl,Var(Xl) = 2pl(1 − pl)(1 + F ).
208
Individual Inbreeding Coefficients
Then one moment estimator is formed by summing over loci
l, l = 1,2, . . . L:
F =1
L
L∑
l=1
(Xl − 2pl)2
2pl(1 − pl)− 1
but an estimate with smaller variance is
F =
∑Ll=1(Xl − 2pl)
2
∑Ll=1 2pl(1 − pl)
− 1
In practice, we need to use sample allele frequencies.
209
MLE for Individual Inbreeding Coefficients
To avoid having to choose among MoM estimates can set up
an MLE although there is more numerical work needed. An
iterative method makes use of Bayes’ theorem. If F represents
the probability the individual in question has two ibd alleles at a
locus, i.e. is inbred at that locus,
Pr(AlAl|inbred) = pl , Pr(AlAl|Not inbred) = p2lPr(Alal|inbred) = 0 , Pr(Alal|Not inbred) = 2pl(1 − pl)
Pr(alal|inbred) = 1 − pl , Pr(alal|Not inbred) = (1 − pl)2
From Bayes’ theorem then
Pr(inbred|AlAl) =Pr(AlAl|inbred)Pr(inbred)
Pr(AlAl)
= FF+pl(1−F)
Pr(inbred|Alal) = 0
Pr(inbred|alal) = FF+(1−pl)(1−F)
210
MLE for Individual Inbreeding Coefficients
This suggests an iterative scheme: assign an initial value to
F , and then average the updated values over loci. If Gl is the
genotype at locus l, the updated value F ′ is
F ′ =1
L
L∑
l=1
Pr(inbred|Gl)
This value is then substituted into the right hand side and the
process continues until convergence.
211
Descent measures for four alleles
Much of population and quantitative genetics theory rests on
the comparison of pairs of individuals - either on their genotypes
or on their trait values, or both. For single-locus analyses this
requires a discussion of four-allele descent measures.
One-locus descent measures δ for any four alleles a, b, c, d iden-
tify which subsets of the four are ibd. The measures give the
probabilities of the specified alleles being ibd, and other alleles
of the four being not-ibd.
212
Complete Set of IBD Measures
A complete description of the ibd status among four alleles
a, b, c, d carried by two individuals requires 15 measures (as op-
posed to the two, F , 1 − F , for one individual):
Alleles ibd∗ Probability Alleles ibd∗ Probability
a, b, c, d δabcd a, b δaba, b, c δabc a, c δaca, b, d δabd a, d δada, c, d δacd b, c δbcb, c, d δbcd b, d δbd
a, b and c, d δab.cd c, d δcda, c and b, d δac.bd none δ0a, d and b, c δad.bc∗Alleles not listed are not ibd to those listed
213
Nine-parameter IBD Set
In most applications there is no need to distinguish between maternal andpaternal alleles and the 15 ibd states can be collapsed into nine:
NotationAlleles ibd∗ Cockerham,1971 Jacquard,1970a, b, c, d δabcd ∆1
a, b, c or a, b, d δabc + δabd ∆3
a, c, d or b, c, d δacd + δbcd ∆5
a, b and c, d δab.cd ∆2
(a, c and b, d) or(a, d and b, c) δac.bd + δad.bc ∆7
a, b δab ∆4
c, d δcd ∆6
a, c or a, c orb, c or b, d δac + δad + δbc + δbd ∆8
none δ0 ∆9∗Alleles not listed are not ibd to those listed
214
Nine-parameter IBD Set
Solid lines join pairs of ibd alleles: top row is the pair of alleles
for X, bottom row the pair of alleles for Y .
X
Y
a
c
b
d�
��
�@@
@@
S1
a
c
c
d
S2
a
c
b
d�
��
�
S3
a
c
b
d
S4
a
c
b
d
@@
@@
S5
X
Y
a
c
b
d
S6
a
c
b
d
S7
a
c
b
d
S8
a
c
b
d
S9
215
Coancestry Coefficient
The coancestry coefficient θXY is the probability that a random
allele from X(ab) is ibd to a random allele from Y (cd):
θXY =1
4[Pr(a ≡ c) + Pr(a ≡ d) + Pr(b ≡ c) + Pr(b ≡ d)]
= δabcd +1
2(δabc + δabd + δacd + δbcd) +
1
2(δac.bd + δad.bc)
+1
4(δac + δad + δbc + δbd)
= ∆1 +1
2(∆3 + ∆5 + ∆7) +
1
4∆8
216
Siblings whose parents are first cousins
G(i,j) H(k,l)
? ?
HHHHHHHHHHHHHHHj
����������������C D E F@
@@
@@
@@@R
��
��
��
��
@@
@@
@@
@@R
��
��
��
��
A B
??
�����������������������������9
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXz
a d
b c
X(a,b) Y(c,d)
217
Siblings whose parents are first cousins
δabcd ( 164) δab.cd (0)
X : a—b X : a—b∆1 | | ∆2
( 164) Y : c—d (0) Y : c—d
δabc ( 164) δabd ( 1
64)
X : a—b X : a—b∆3 | or |( 264) Y : c d Y : c d
δab ( 164) δcd ( 1
64)
X : a—b X : a b∆4 ∆6
( 164) Y : c d ( 1
64) Y : c—d
218
Siblings whose parents are first cousins
δacd ( 164) δbcd ( 1
64)
X : a b X : a b∆5 | or |( 264) Y : c—d Y : c— d
δac.bd (1564) δad.bc (0)
X : a b X : a b∆7 | | or X
(1564) Y : c d Y : c d
219
Siblings whose parents are first cousins
δac (1464
) δad ( 164
)∆8 or
(3064
) X : a b X : a b| ↘
Y : c d Y : c d
δbc ( 164
) δbd (1464
)X : a b X : a b
or ↙ or |Y : c d Y : c d
δ0(1264
)X : a b
∆9
(1264
) Y : c d
220
Siblings whose parents are first cousins
θ =1
4(δabcd + δabc + δacd + δac.bd + δac)
+1
4(δabcd + δabd + δacd+ δad.bc + δad)
+1
4(δabcd + δabc + δbcd + δad.bc + δbc)
+1
4(δabcd + δabd + δbcd + δac.bd + δbd)
= ∆1 +1
2(∆3 + ∆5 + ∆7) +
1
4∆8
=9
32
221
Non-inbred Relatives
There is a final reduction when neither individual is inbred, as
then neither a, b nor c, d are ibd. There are only three states and
the three probabilities are often written as k2 = ∆7, k1 = ∆8 or
k0 = ∆9 to indicate the number of pairs of ibd alleles carried by
the two individuals. Values of these three probabilities for some
common relationships are:
Relationship k2 k1 k0 θ = 12k2 + 1
4k1Identical twins 1 0 0 1/2Full sibs 1/4 1/2 1/4 1/4Parent-child 0 1 0 1/4Double first cousins 1/16 3/8 9/16 1/8Half sibs∗ 0 1/2 1/2 1/8First cousins 0 1/4 34 1/16Unrelated 0 0 1 0∗ Also grandparent-grandchild and avuncular (e.g. uncle-niece).
222
Joint genotypic probabilities
Genotypes General Non − inbred
ii, ii ∆1pi + (∆2 + ∆3 + ∆5 + ∆7)p2i k2p2i + k1p3i + k0p4i+ (∆4 + ∆6 + ∆8)p3i + ∆9p4i
ii, jj ∆2pipj + ∆4pip2j k0p2i p2j
+ ∆6p2i pj + ∆9p2i p2j
ii, ij ∆3pipj + (2∆4 + ∆8)p2i pj k1p2i pj + 2k0p3i pj+ 2∆9p3i pj
ii, jk 2∆4pipjpk + 2∆9p2i pjpk 2k0p2i pjpk
ij, ij 2∆7pipj + ∆8pipj(pi + pj) 2k2pipj + k1pipj(pi + pj)+ 4∆9p2i p
2j + 4k0p2i p
2j
ij, ik ∆8pipjpk + 4∆9p2i pjpk k1pipjpk + 4k0p2i pjpk
ij, kl 4∆9pipjpkpl 4k0pipjpkpl
223
Example: Non-inbred full sibs
Genotypes Probability
ii, ii p2i (1 + pi)2/4
ii, jj p2i p2j /4
ii, ij pipj(pi + pj)/2
ii, jk p2i pjpk/2
ij, ij pipj(1 + pi + pj + 2pipj)/2
ij, ik pipjpk(1 + 2pi)/2
ij, kl pipjpkpl
224
Method of Moments for Relatedness Coefficients
PLINK uses MoM to estimate three ibd coefficients k0, k1, k2 for
non-inbred relatives.
Two individuals are scored as being in ibs states 0,1,2
ibs state Genotypes Probability
2 (AA,AA), (aa, aa), (Aa,Aa) (p2 + q2)2k0 + k1(p3 + pq+ q3) + k2
1 (AA,Aa), (Aa,AA), (aa, Aa), (Aa, aa) 4pq(p2 + q2)k0 + 2pqk1
0 (AA, aa), (aa,AA) 2p2q2k0
225
MoM Approach: k0
Count the number of loci in ibs state i; i = 0,1,2. These numbers
are N0, N1, N2. The previous table gives the probabilities of ibs
state i given ibd state j. From
Pr(ibs = 0) = Pr(ibs = 0|ibd = 0)Pr(ibd = 0)
sum over loci to get
N0 = Pr(ibd = 0)∑
loci
2p2q2
This gives a moment estimate
Pr(ibd = 0) =N0∑
loci 2p2q2
226
MoM Approach: k1
From
Pr(ibd = 1) = Pr(ibs = 1|ibd = 0)Pr(ibd = 0)
+ Pr(ibs = 1|ibd = 1)Pr(ibd = 1)
sum over loci to get
N1 = Pr(ibd = 0)∑
loci
4pq(p2 + q2) + Pr(ibd = 1)∑
loci
2pq
but we already have an estimate of Pr(ibd = 0). Therefore
Pr(ibd = 1) =N1 −∑
loci 4pq(p2 + q2)Pr(ibd = 0)
∑loci 2pq
227
MoM Approach: k2
From
Pr(ibd = 2) = Pr(ibs = 2|ibd = 0)Pr(ibd = 0)
+ Pr(ibs = 2|ibd = 1)Pr(ibd = 1)
+ Pr(ibs = 2|ibd = 2)Pr(ibd = 2)
sum over loci to get
N2 = Pr(ibd = 0)∑
loci
(p2 + q2)2 + Pr(ibd = 1)∑
loci
(p3 + pq+ q3)
+ Pr(ibd = 2)
but we already have estimates of Pr(ibd = 0) and Pr(ibd = 1.Therefore
Pr(ibd = 1) =N2 −
∑loci(p
2 + q2)2 Pr(ibd = 0) −∑
loci(p3 + pq + q3)Pr(ibd = 1)∑
loci 1
228
Example
229
MoM for Coancestry Coefficient
One moment estimator for the coancestry between individuals G
and H is formed by summing over loci l, l = 1,2, . . . L:
θGH =1
L
L∑
l=1
(Xl − 2pl)(Yl − 2p)
2pl(1 − pl)
where Xl, Yl are 2,1,0 if G,H are (AA,Aa, aa), respectively, at
locus l. An estimate with smaller variance is
θGH =
∑Ll=1(xl − 2pl)(yl − 2p)∑Ll=1 2pl(1 − pl)
In practice, sample allele frequencies will need to be used.
230
MLE for Relatedness Coefficients
For any SNP there are six distinct pairs of genotypes with prob-
abilities depending on allele frequencies for that SNP and on a
set of three k parameters that are assumed to be the same for
all SNPs.
If S is the observed pair of genotypes, we know the conditional
probabilities Pr(S|Di) where the Di represent the identity states
(the relationship). The probability of ibd state Di is ki.
For example, if S is the state (AA,AA); Pr(S|D0) = p4A, Pr(S|D1) =
p3A, and Pr(S|D2) = p2A and
Pr(S) =∑
i
Pr(S|Di)Pr(Di) = p4Ak0 + p3Ak1 + p2Ak2
231
MLE for Relatedness Coefficients
An iterative algorithm for estimating the k’s from observed geno-
types Sl at SNP l is based on Bayes’ theorem for the probability
of descent state Di, i = 0,1,2:
Pr(Di|Sl) =Pr(Sl|Di)Pr (Di)
Pr(Sl)
The procedure begins with initial estimates of the ki = Pr(Di)’s.
The denominator is calculated from the law of total probability
by adding over the three descent states:
Pr(Sl) =∑
i
Pr(Sl|Di)Pr(Di) =∑
i
Pr(Sl|Di)ki
232
MLE for Relatedness Coefficients
The updated estimates are obtained by averaging over m loci:
k′i =1
m
m∑
l=1
(Pr(Sl|Di)ki∑j Pr(Sl|Dj)kj
), i = 0,1,2
These updated values are then substituted into the right hand
side and the process continued until the likelihood L no longer
changes (or changes by less than some specified small amount)
where
L =m∏
l=1
∑
i
Pr(Sl|Di)ki
It will be better to monitor changes in the log-likelihood.
233
“RELPAIR” calculations
This approach compares the probabilities of two genotypes un-
der alternative hypotheses; H0: the individuals have a specified
relationship, versus H1: the individuals are unrelated. The alter-
native is that k0 = 1, k1 = k2 = 0 so the likelihood ratios for the
two hypotheses are:
L(AA,AA) = k0 + k1/pA + k2/p2A
L(BB,BB) = k0 + k1/pB + k2/p2B
L(AB,AB) = k0 + k1/(4pApB) + k2/(2pApB)
L(AA,AB) = k0 + k1/(2pA)
L(BB,AB) = k0 + k1/(2pB)
L(AA,BB) = k0
234
Testing relationship
Hypotheses about alternative pairs of relationships may be tested
with likelihood ratio test statistics. These ratios are the prob-
ability of the observed pair of genotypes under one hypothesis
divided by the probability under the alternative hypothesis. These
ratios are multiplied over (independent) loci.
Each hypothesis is described by a set of k’s and there are three
hypotheses likely to be of interest:
1. individuals are unrelated (k0 = 1)
2. individuals have a specified (or annotated) relationship (k’s
specified)
or 3. individuals are related to an extent measured by the esti-
mated k’s.
235
Testing relationship
The three likelihoods may be written as L0, LS, LE, respectively
and each of the three ratios of these may be of interest. It may
be satisfactory to regard the ratios that involve estimated k’s as
having chi-square distributions, but this is not justified for the
unrelated versus specified situation:
H0 H1 LR Null distribution
Unrelated Specified −2 ln(L0/LS) −−Unrelated Estimated −2 ln(L0/LE) χ2
(2)
Specified Estimated −2 ln(LS/LE) χ2(2)
236
ASSOCIATION MAPPING
237
Association Mapping
Association methods use random samples from a population and
are alternatives to methods based on pedigrees or crosses be-
tween inbred lines. The associations depend on linkage disequi-
librium between marker and trait loci instead of depending on
linkage between those loci as in pedigree or line cross methods.
A quantitative trait locus T contributes to a trait of interest.
The QTL genotype cannot be observed but maybe it can be
inferred, and the location of the QTL be estimated, from ob-
servations on the trait and the genotype at a genetic marker
M.
238
Marker-Trait Genotype Frequencies
Each marker genotypic class MiMj is composed of a mixture of
elements from each of the QTL classes, TrTs, where the propor-
tion of QTL class TrTs contained within marker class MiMj is
Pr(TrTs|MiMj). With random mating, genotype frequencies are
products of gamete frequencies. For example
Pr(TrTr,MiMi) = Pr(TrMi)2
Pr(TrTr|MiMi) = Pr(TrMi)2/Pr(Mi)
2
and gamete frequencies involve allele frequencies and linkage dis-
equilibria:
Pr(TrMi) = prpi +Dri
239
Two-allele Genotypes
TT Tt tt
MM P2MT 2PMTPMt P2
Mt
Mm 2PMTPmT 2PMTPmt + 2PMtPmT 2PMtPmt
mm P2mT 2PmTPmt P2
mt
240
Two-allele Gametes
T t
M PMT = pMpT +DMT PMt = pMpt −DMT
m PmT = pmpT −DMT Pmt = pmpt +DMT
ρMT =DMT√
pMpmpTpt
ρ2MT =D2MT
pMpmpTpt
241
Marker and Trait Variables
Introduce variables X and G for loci M and T. The values of X
will be assigned for the marker whereas the values G represent
the genetic contributions to measured trait variables or to disease
status. In either case, the Hardy-Weinberg assumption provides
the following expressions for the means and variances:
E(X) = µX = p2MXMM + 2pMpmXMm + p2mXmm
E(G) = µG = p2TGTT + 2pTptGT t + p2tGtt
Var(X) = σ2AM
+ σ2DM
Var(G) = σ2AT
+ σ2DT
242
Components of Variance
The “additive” and “dominance” components of variance are
σ2AM
= 2pMpm[pM(XMM −XMm) + pm(XMm −Xmm)]2
σ2AT
= 2pTpt[pT (GTT −GT t) + pt(GT t −Gtt)]2
σ2DM
= p2Mp2m(XMM − 2XMm +Xmm)2
σ2DT
= p2Tp2t (GTT − 2GT t +Gtt)
2
and these lead to the following expression for the covariance of
X and G:
Cov(G,X) = ρMTσATσAM + ρ2MTσDTσDM
243
Correlation of Trait and Marker Variables
Cov(G,X) = ρMTσATσAM + ρ2MTσDTσDM
If either X or G are purely additive, then their covariance is
Cov(G,X) = ρMTσATσAM
If both X and G are purely additive, then their correlation is
ρGX = ρMT
If either X or G are purely non-additive, then their covariance is
Cov(G,X) = ρ2MTσDTσDM
If both X and G are purely non-additive, then their correlation is
ρG,X = ρ2MT
244
Measured Traits
Suppose Y = G + E where G is the genetic effect of locus T
and E are all other effects. These other effects are supposed to
have mean zero and to be independent of both G and the marker
variable X. Then
E(Y ) = E(G)
Cov(X,Y ) = Cov(X,G)
Var(Y ) = σ2AT
+ σ2DT
+ VE
Trait values Y may be regressed on marker variables X. The
regression coefficient is
βY X =Cov(X,Y )
Var(X)=ρMTσATσAM + ρ2MTσDTσDM
σ2AM
+ σ2DM
245
Marker Variable
Variable X may be chosen to be additive, e.g XMM = 2, XMm =
1, Xmm = 0 so that σ2AM
= 2pMpm, σ2DM
= 0, and then
βY X = ρMTσATσAM
The marker variable can also be made to have a zero additive
variance, e.g. XMM = pm, XMm = 0,Xmm = pM so that σ2AM
=
0, σ2DM
= p2Mp2m, and
βY X = ρ2MT
σDTσDM
A significant regression coefficient implies a significant linkage
disequilibrium measure ρMT between marker and disease loci.
The signal is expected to be stronger with an additive marker as
ρMT ≥ ρ2MT and it is usual that σ2AT
≥ σ2DT
.
246
Analysis of Variance
Instead of regressing trait on marker, the trait means could be
compared among marker classes. The expected trait means fol-
low as
E(Y |MiMj) =∑
r,sGrsPr(TrTs|MiMj)
=∑
r,sGrsPr(TrMi, TsMj)/Pr(MiMj)
in general.
For a trait locus with only two alleles, T, t, for marker homozygote
MM :
E(Y |MM) = (GTTP2MT + 2GT tPMTPMt +GttP
2Mt)/p
2M
247
Trait Means in Marker Classes
This last expression can be manipulated to show the effects of
linkage disequilibrium.
Trait means among the three marker genotype classes:
E(Y |MM) = µG + 2ρMTA/pM + ρ2MTD/p2ME(Y |Mm) = µG + ρMTA(1/pM − 1/pm) − ρ2MTD/(pMpm)
E(Y |mm) = µG − 2ρMTA/pm + ρ2MTD/p2mwhere A = σAT
√(pMpm),D = σDT (pMpm), so that an analysis of
variance will also test that ρMT = 0 and the test will be affected
by both additive and dominance effects at the trait locus.
248
Dichotomous Traits: Case Only
The case-control approach starts with independent samples of
people who are either affected or not affected with a disease and
compares marker frequencies between the two groups. The MM
marker frequency among cases is
Pr(MM |Case) = p2M +1
µG
[pMρMTA+ ρ2MTσDTD
]
Pr(Mm|Case) = 2pMpm +1
µG
[(pm − pM)ρMTA− 2ρ2MTD
]
Pr(mm|Case) = p2m +1
µG
[−pmρMTA + ρ2MTD
]
Note that these three probabilities sum to one.
249
Case Allele Frequencies
Combining the genotypic frequencies to give allele frequencies:
Pr(M |Case) = pM +ρMTσAT
2µG
√2pMpm
Pr(m|Case) = pm −ρMTσAT
2µG
√2pMpm
and these two sum to one.
The inbreeding coefficient at the marker locus in the case pop-
ulation follows from
Pr(MM |Case) = Pr(M |Case)2 + fCase Pr(M |Case)[1 − Pr(M |Case)]
or
fCase =ρ2MT (2µGσDT − σ2
AT)
(µG
√2pM/pm + ρMTσAT )(µG
√2pm/pM − ρMTσAT )
250
Case-only HWE Testing
The power of this test depends on nf2Case which is proportional
to ρ4MT so the power will decrease quickly as ρMT deceases.
It is common for investigators to assume a multiplicative disease
model (i.e. additive on a log scale):
GTT = αβ2, GT t = αβ,Gtt = α
so that
µG = α(pTβ + pt)2
σ2AT
= 2pTptα2(1 − β)2(pTβ + pt)
2
σ2DT
= p2Tp2t α
2(1 − β)4
This leads to Hardy-Weinberg equilibrium at marker loci among
cases since
2µGσDT = σ2AT
251
Dichotomous Traits: Case-Control
An argument similar to that above provides the marker genotype
frequencies among controls:
Pr(MM |Control) = p2M − 1
1 − µG
[pMρMTA + ρ2MTD
]
Pr(Mm|Control) = 2pMpm − 1
1 − µG
[(pm − pM)ρMTA− 2ρ2MTD
]
Pr(mm|Control) = p2m − 1
1 − µG
[−pmρMTA + ρ2MTD
]
Pr(M |Control) = pM − ρMTA2(1 − µG)
Pr(m|Control) = pm +ρMTA
2(1 − µG)
252
Case-control Test
The simplest case-control test compares marker allele frequen-
cies between the two samples and it is clearly equivalent to test-
ing that ρMT = 0 since
Pr(M |Case)− Pr(M |Control) ∝ ρMTσAT
√2pMpm
The test is not affected by non-additivity at the disease locus.
If the allelic counts for M,m in cases and controls are laid out in
a 2× 2 table, the contingency-table chi-square test statistic has
1 df. An alternative is to work with the 3 × 2 table of marker
genotype counts in cases and controls and calculate a 2 df chi-
square test statistic. This test is affected by both additivity and
non-additivity at the disease locus but it is sensitive to errors in
genotype calls for rare alleles.
253
Allelic Case Control Test
Write the marker genotype counts in random samples of cases
and controls as
Genotype MM Mm mm Total
Case counts r0 r1 r2 RControl counts s0 s1 s2 STotal counts n0 n1 n2 N
The allelic test statistic uses the allele counts
Observed M m Total
Case counts 2r0 + r1 2r2 + r1 2RControl counts 2s0 + s1 2s2 + s1 2STotal counts 2n0 + n1 2n2 + n1 2N
254
Allelic Case Control Test
A contingency table test for independence of marker allele and
disease status compares the observed allelic counts with the
products of the marginal totals divided by the overall total:
Expected M m Total
Case counts 2R(2n0 + n1)/2N 2R(2n2 + n1)/2N 2RControl counts 2S(2n0 + n1)/2N 2S(2n2 + n1)/2N 2STotal counts 2n0 + n1 2n2 + n1 2N
The test statistic is
X2A =
∑ (Obs.−Exp.)2
Exp.
=2N [N(r1 + 2r2) −R(n1 + 2n2)]
2
SR[2N(n1 + 2n2) − (n1 + 2n2)2]
255
Allelic Case Control Test
Approximating the expectation of X2A by the ratio of the expec-
tations of the numerator and denominator:
E(X2A) ≈ (1 + f)
showing an inflation factor of (1 + f) when there is inbreeding.
The expected value is 1 when f = 0 and the test statistic has a
chi-square distribution with 1 df.
256
AMD Example: Case-control test statistics onchromosome 1
0 50 100 150 200 250
02
46
8
AMD Case Control Test
Position (Mb)
−lo
g10(p
)
257
AMD Example: Case-control test statistics QQplot
0 2 4 6 8
02
46
8
AMD Case Control Test
Expected
Ob
se
rve
d
258
AMD Example: HWE test statistics on chromo-some 1
0 50 100 150 200 250
05
10
15
Control HWE Test: With SNP Filtering
Position
−lo
g1
0(p
)
259
Trend Test
i = 0 i = 1 i = 2
Marker Genotype MM Mm mm TotalMarker Variable X0 X1 X2Case counts Y = 1 r0 r1 r2 RControl counts Y = 0 s0 s1 s2 STotal counts n0 n1 n2 N
The Armitage trend test is based on a score statistic U:
U =2∑
i=0
Xi
(S
Nri −
R
Nsi
)
260
Trend Test for Additivity
Assuming normality for U, the test statistic
X2T =
U2
Var(U)=
N(N∑i riXi −R
∑i niXi)
2
SR[N∑i niX
2i − (
∑i niXi)
2]
is distributed as χ2(1)
under the hypothesis H0 : ρMT = 0.
Usual to consider a linear trend test, with X0 = 0, X1 = 1, X2 =
2, so that σ2DM
= 0 and
X2T =
N [N(r1 + 2r2) −R(n1 + 2n2)]2
SR[N(n1 + 4n2) − (n1 + 2n2)2]
This will provide a test for additive effects at the disease locus.
261
Case-control vs Trend Tests
The allelic case-control test statistic is
X2A =
2N [N(r1 + 2r2) −R(n1 + 2n2)]2
SR[N(2n1 + 4n2) − (n1 + 2n2)2]
and the linear trend test statistic is
X2T =
N [N(r1 + 2r2) −R(n1 + 2n2)]2
SR[N(n1 + 4n2) − (n1 + 2n2)2]
In both cases, σ2DM
= 0 and the test is for linkage disequilibrium
ρMT between trait and marker alleles, and is affected only by
additive trait effects.
Unlike the allelic case-control test, the trend statistic has an
expected value of 1 even when there are departures from Hardy-
Weinberg equilibrium.
262
Trend Test for Non-Additivity
Setting X0 = pm, X1 = 0,X2 = pM gives σ2AM
= 0 and a test for
non-additive effects. There is not an obvious simplification of
the equation for the test statistic.
263