population genetic data analysis · 2014-08-31 · elaborate mathematical theories constructed by...

Population Genetic Data Analysis

Summer Institute in Statistical Genetics

September 1-3, 2014

University of Lausanne

Jerome Goudet: [email protected]

and

Bruce Weir: [email protected]

1

Contents

Topic Slide

Genetic Data 3Allele Frequencies 46Allelic Association 90Population Structure 150Inbreeding and Relatedness 192Association Mapping 235

Lectures on these topics by Bruce Weir will alternate with R

exercises led by Jerome Goudet. Some of the software is at

http://www2.unil.ch/popgen/teaching/SISG14/ .

2

GENETIC DATA

3

Genetic Data

“Genes are perpetuated as sequences of nucleic acid, but func-

tion by being expressed as proteins.” (Lewin, GENES).

DNA

(transcription)

↓RNA

(translation)

↓Protein

↓Phenotype

4

Sources of Data

Phenotype Mendel’s peasBlood groups

Protein AllozymesAmino acid sequences

DNA Restriction sites, RFLPsLength variants, VNTRs, STRsSNPsNucleotide sequences

5

Mendel’s Data

Dominant Form Recessive Form

Seed characters5474 Round 1850 Wrinkled6022 Yellow 2001 Green

Plant characters705 Grey-brown 224 White882 Simply inflated 299 Constricted428 Green 152 Yellow651 Axial 207 Terminal787 Long 277 Short

6

Genetic Data

Human ABO blood groups discovered in 1900.

Elaborate mathematical theories constructed by Sewall Wright,

R.A. Fisher, J.B.S. Haldane and others. This theory was chal-

lenged by data from new data from electrophoretic methods in

the 1960’s:

“For many years population genetics was an immensely rich and

powerful theory with virtually no suitable facts on which to oper-

ate. . . . Quite suddenly the situation has changed. The mother-

lode has been tapped and facts in profusion have been pored

into the hoppers of this theory machine. . . . The entire relation-

ship between the theory and the facts needs to be reconsidered.“

Lewontin (1974)

7

Electrophoretic Detection

Charge differences among allozymes lead to separation on elec-

trophoretic gels.

Length differences among Restriction Fragment Length Poly-

morphisms, or Variable Number of Tandem Repeat polymor-

phisms also detected by electrophoresis.

8

AmpliTypeTM Data

PCR based-DNA typing systems for forensic science. The loci in-

volved are all sequence polymorphisms that are detected/delineated

by hybridization to allele-specific oligonucleotide probes. Colo-

metric detection used for the hybridized product.

9

AmpliTypeTM Data - Caucasian

LDLR GYPA HBGG D7S8 Gca a B B A B A B C CA a b b A B A A A BA a B B A A A A A Aa a B B A A A B B BA a B b A B B B A CA A B b A A A B B Ca a B b A B A B A CA a b b A A A B A Ca a b b A A A A A BA a B B A B A A C CA a B B A B A B C Ca a B B A A A B A AA A B b A A A B A BA a B B A A B B C CA A B B A A A A B BA a b b A A A A C CA a B b A B A A B Ba a B b A B A A A CA a B B A B A B A CA a b b B B A B B B

10

AmpliTypeTM Data - Afr. Amer.

LDLR GYPA HBGG D7S8 Gca a b b B C A A A Ba a B B A A A B B Ca a B b A C A A B BA a B b A C A A B CA a B b A C A B A BA a B b A A A B A Ba a B b B C B B B CA a B B A A A B B BA a B b A C A B B Ba a b b A B A A B BA a B b B C A B B CA a B b A C A B B Ca a b b A A A A A Ca a b b A C A B B Ba a B B A A A A A BA a B B A B A A B Ba a B b A B A B B CA a B b B C A A B BA a B B A B A B B BA a B B B C A B B B

11

AmpliTypeTM Data - Hispanic

LDLR GYPA HBGG D7S8 Gca a B b B B B B B CA a B b A A A A A CA A b b A B A A A CA a B B A A A A C Ca a B b A B B B A Ca a B B B B A B C CA a B B B B A B C CA a B B A B A A A BA a B B A B A B C CA A B b A A A B B BA a b b A B A A A CA a B B A B A B C Ca a B b A C A A B BA A b b B B A A C CA a B b A A A A C Ca a B B B B A B C Ca a B b A B A B B CA a B B B B A A A BA A B B A B B B C CA a B B A B B B A C

12

STR markers: CTT set

(http://www.cstl.nist.gov/biotech/strbase/seq info.htm)

Usual No.Locus Structure of repeats

CSF1PO [AGAT]n 6–16TPOX [AATG]n 5–14TH01∗ [AATG]n 3–14

∗ “9.3” is [AATG]6ATG[AATG]3

Length variants detected by capillary electrophoresis.

13

“CTT” Data - Caucasian

CSF1P0 TPOX TH0111 12 8 11 7 811 13 8 8 6 711 12 8 11 6 710 12 8 8 6 911 12 8 12 9 9.310 12 9 11 6 710 13 8 11 6 611 12 8 8 6 9.39 10 8 9 7 9.311 12 8 8 6 811 13 8 11 7 911 12 8 11 6 9.310 11 8 8 7 9.310 10 8 11 7 9.39 10 8 8 6 9.311 12 9 11 9 9.39 11 9 11 9 9.311 12 8 8 6 710 10 9 11 6 9.310 13 8 8 8 9.3

14

“CTT” Data - Afr. Amer.

CSF1P0 TPOX TH0110 11 8 8 7 810 10 8 11 6 78 10 8 10 7 812 12 7 8 6 811 12 9 11 7 9.310 10 11 11 7 911 12 8 9 7 910 12 9 9 7 711 12 8 12 7 910 11 8 9 7 97 11 8 8 6 710 11 9 10 8 912 12 11 11 9 910 12 9 11 6 610 10 8 9 7 9.311 13 9 12 8 810 12 7 10 7 811 11 8 9 9 98 11 10 11 8 910 12 9 9 6 9

15

“CTT” Data - Hispanic

CSF1P0 TPOX TH0111 12 8 8 10 1011 11 8 8 6 711 12 8 11 7 9.310 10 11 11 6 9.311 11 8 12 9.3 9.312 13 12 12 6 811 12 8 10 9.3 9.310 11 11 11 6 912 12 7 8 6 9.310 12 8 11 7 9.311 12 8 8 6 712 13 11 12 6 711 12 8 11 7 9.311 14 8 10 6 611 12 8 11 7 711 12 8 11 6 610 11 8 11 6 911 12 8 8 6 611 11 8 11 6 912 12 11 11 7 8

16

STR Typing Methodology

“Short tandem repeat (STR) typing methods are widely used

today for human identity testing applications including foren-

sic DNA analysis. Following multiplex PCR amplification, DNA

samples containing the length-variant STR alleles are typically

separated by capillary electrophoresis and genotyped by compar-

ison to an allelic ladder supplied with a commercial kit. This

article offers a brief perspective on the technologies and issues

involved in STR typing.”

Butler JM. Short tandem repeat typing technologies used in hu-

man identity testing. BioTechniques 43:Sii-Sv (October 2007)

doi 10.2144/000112582

17

Single Nucleotide Polymorphisms (SNPs)

“Single nucleotide polymorphisms (SNPs) are the most frequently

occurring genetic variation in the human genome, with the total

number of SNPs reported in public SNP databases currently ex-

ceeding 9 million. SNPs are important markers in many studies

that link sequence variations to phenotypic changes; such studies

are expected to advance the understanding of human physiology

and elucidate the molecular bases of diseases. For this reason,

over the past several years a great deal of effort has been devoted

to developing accurate, rapid, and cost-effective technologies for

SNP analysis, yielding a large number of distinct approaches. ”

Kim S. Misra A. 2007. SNP genotyping: technologies and

biomedical applications. Annu Rev Biomed Eng. 2007;9:289-

320.

18

Current 1000Genomes Data

Tuesday June 24, 2014

The Initial Phase 3 variant list and phased genotypes.

The initial call set from the 1000 Genomes Project Phase 3

analysis is now available on our ftp site in the directory re-

lease/20130502/.

These release contains more than 79 million variant sites and

includes not just biallelic snps but also indels, deletions, com-

plex short substitutions and other structural variant classes. It

is based on data from 2535 individuals from 26 different popu-

lations around the world.

www.1000genomes.org

19

Sampling

Statistical sampling: The variation among repeated samples

from the same population is analogous to “fixed” sampling. In-

ferences can be made about that particular population.

Genetic sampling: The variation among replicate (conceptual)

populations is analogous to “random” sampling. Inferences are

made to all populations with the same history.

20

Classical Model

Sample ofsize n · · · Sample of

size n

��

��

��=

HHHHHHHHHHj

Time tPopulationof size N · · · Population

of size N

↓ ↓

... ...

↓ ↓

Time 2Populationof size N · · · Population

of size N

↓ ↓

Time 1 Populationof size N · · · Population

of size N

↓ ↓

Reference population(Usually assumed infinite and in equilibrium)

21

Coalescent Theory

An alternative framework works with genealogical history of a

sample of alleles. There is a tree linking all alleles in a current

sample to the “most recent common ancestral allele.” Allelic

variation due to mutations since that ancestral allele.

The coalescent approach requires mutation and may be more

appropriate for long-term evolution and analyses involving more

than one species. The classical approach allows mutation but

does not require it: within one species variation among popula-

tions may be due primarily to drift.

22

Probability

Probability provides the language of data analysis.

Equiprobable outcomes definition:

Probability of event E is number of outcomes favorable to E

divided by the total number of outcomes. e.g. Probability of a

head = 1/2.

Long-run frequency definition:

If event E occurs n times in N identical experiments, the prob-

ability of E is the limit of n/N as N goes to infinity.

Subjective probability:

Probability is a measure of belief.

23

First Law of Probability

Law says that probability can take values only in the range zero

to one and that an event which is certain has probability one.

0 ≤ Pr(E) ≤ 1

Pr(E|E) = 1 for any E

i.e. If event E is true, then it has a probability of 1. For example:

Pr(Seed is Round|Seed is Round) = 1

24

Second Law of Probability

If G and H are mutually exclusive events, then:

Pr(G or H) = Pr(G) + Pr(H)

For example,

Pr(Round or Wrinkled) = Pr(Round) + Pr(Wrinkled)

More generally, if Ei, i = 1, . . . r, are mutually exclusive then

Pr(E1 or . . . or Er) = Pr(E1) + . . .+ Pr(Er)

=∑

i

Pr(Ei)

25

Complementary Probability

If Pr(E) is the probability that E is true then Pr(E) denotes the

probability that E is false. Because these two events are mutually

exclusive

Pr(E or E) = Pr(E) + Pr(E)

and they are also exhaustive in that between them they cover all

possibilities – one or other of them must be true. So,

Pr(E) + Pr(E) = 1

Pr(E) = 1 − Pr(E)

The probability that E is false is one minus the probability it is

true.

26

Third Law of Probability

For any two events, G and H, the third law can be written:

Pr(G and H) = Pr(G) Pr(H|G)

There is no reason why G should precede H and the law can also

be written:

Pr(G and H) = Pr(H) Pr(G|H)

For example

Pr(Seed is round and is type AA) = Pr(Seed is round|Seed is type AA)

× Pr(Seed is type AA)

27

Independent Events

If the information that H is true does nothing to change uncer-

tainty about G, then

Pr(G|H) = Pr(G)

and

Pr(H and G) = Pr(H)Pr(G)

Events G,H are independent.

28

Law of Total Probability

If G,H are two mutually exclusive and exhaustive events (so that

H = G = not −G), then for any other event E, the law of total

probability states that

Pr(E) = Pr(E|G)Pr(G) + Pr(E|H) Pr(H)

This generalizes to any set of mutually exclusive and exhaustive

events {Si}:

Pr(E) =∑

i

Pr(E|Si)Pr(Si)

For example

Pr(Seed is round) = Pr(Round|Type AA)Pr(Type AA)

+ Pr(Round|Type Aa)Pr(Type Aa)

+ Pr(Round|Type aa)Pr(Type aa)

29

Bayes’ Theorem

Bayes’ theorem relates Pr(G|H) to Pr(H|G):

Pr(G|H) =Pr(GH)

Pr(H), from third law

=Pr(H|G) Pr(G)

Pr(H), from third law

If {Gi} are exhaustive and mutually exclusive, Bayes’ theorem

can be written as

Pr(Gi|H) =Pr(H|Gi)Pr(Gi)∑iPr(H|Gi)Pr(Gi)

30

Bayes’ Theorem Example

Suppose G is event that a man has genotype A1A2 and H is the

event that he transmits allele A1 to his child. Then Pr(H|G) =

0.5.

Now what is the probability that a man has genotype A1A2 given

that he transmits allele A1 to his child?

Pr(G|H) =Pr(H|G) Pr(G)

Pr(H)

=0.5 × 2p1p2

p1= p2

31

Mendel’s Data

Model: seed shape governed by gene A with alleles A, a:

Genotype Phenotype

AA RoundAa Roundaa Wrinkled

Cross two inbred lines: AA and aa. All offspring (F1 generation)

are Aa, and so have round seeds.

32

F2 generation

Self an F1 plant: each allele it transmits is equally likely to be A

or a, and alleles are independent, so for F2 generation:

Pr(AA) = Pr(A)Pr(A) = 0.25

Pr(Aa) = Pr(A)Pr(a) + Pr(a)Pr(A) = 0.5

Pr(aa) = Pr(a)Pr(a) = 0.25

Probability that an F2 seed (observed on F1 parental plant) is

round:

Pr(Round) = Pr(Round|AA)Pr(AA)

+ Pr(Round|Aa)Pr(Aa)

+ Pr(Round|aa)Pr(aa)

= 1 × 0.25 + 1 × 0.5 + 0 × 0.25

= 0.75

33

F2 generation

What are the proportions of AA and Aa among F2 plants with

round seeds? From Bayes’ Theorem:

Pr(F2 = AA|F2 Round) =Pr(F2 Round|AA)Pr(F2 AA)

Pr(F2 round)

=1 × 1

434

=1

3

34

Seed Characters

As an experimental check on this last result, and therefore on

Mendel’s theory, Mendel selfed a round-seeded F2 plant and

noted the F3 seed shape (observed on the F2 parental plant).

If all the F3 seeds are round, the F2 must have been AA. If some

F3 seed are round and some are wrinkled, the F2 must have been

Aa. Possible to observe many F3 seeds for an F2 parental plant,

so no doubt that all seeds were round. Data supported theory:

one-third of F2 plants gave only round seeds and so must have

had genotype AA.

35

Plant Characters

Model for stem length is

Genotype Phenotype

GG LongGg Longgg Short

To check this model it is necessary to grow the F3 seed to observe

the F3 stem length.

36

F2 Plant Character

Mendel grew only 10 F3 seeds per F2 parent. If all 10 seeds gave

long stems, he concluded they were all GG, and F2 parent was

GG. This could be wrong. The probability of a Gg F2 plant

giving 10 long-stemmed F3 offspring (GG or Gg), and therefore

wrongly declared to be homozygous GG is (3/4)10 = 0.0563.

37

Fisher’s 1936 Criticism

The probability that a long-stemmed F2 plant is declared to be

homozygous (event V ) is

Pr(V ) = Pr(V |U)Pr(U) + Pr(V |U)Pr(U)

= 1 × (1/3) + 0.0563 × (2/3)

= 0.3709 6= 0.3333

where U is the event that a long-stemmed F2 is actually homozy-

gous and U is the event that it is actually heterozygous.

Fisher claimed Mendel’s data closer to the 0.3333 probability

appropriate for seed shape than to the correct 0.3709 value.

38

Weldon’s 1902 Doubts

In Biometrika, Weldon said:

“Here are seven determinations of a frequency which is said to

obey the law of Chance. Only one determination has a deviation

from the hypothetical frequency greater than the probable error

of the determination, and one has a deviation sensible equal to

the probable error; so that a discrepancy between the hypothesis

and the observations which is equal to or greater than the prob-

able error occurs twice out of seven times, and deviations much

greater than the probable error do not occur at all. These results

then accord so remarkably with Mendel’s summary of them that

if they were repeated a second time, under similar conditions

and on a similar scale, the chance that the agreement between

observation and hypothesis would be worse than that actually

obtained is about 16 to 1.”

39

Edwards’ 1986 Criticism

Mendel had 69 comparisons where the expected ratios were cor-

rect. Each set of data can be tested with a chi-square test:

Category 1 Category 2 Total

Observed (o) a n-a nExpected (e) b n-b n

X2 =(a− b)2

b+

[(n− a)− (n− b)]2

(n− b)

=n(a− b)2

b(n− b)

40

Edwards’ Criticism

If the hypothesis giving the expected values is true, the X2 val-

ues follow a chi-square distribution, and the X values follow a

normal distribution. Edwards claimed Mendel’s values were too

small – not as many large values as would be expected by chance.

−3.0−2.0−1.0 0.0 1.0 2.0 3.0

0.0

5.0

10.0

41

Novitski’s 2004 Response

For six experiments on plant characters, Mendel planned to have

10 F3 plants for each of 100 dominant phenotype F2 plants.

With a 2% error rate, the probability that a family of 10 do not

all survive, i.e. at least one error or one minus the chance of no

error, is 1 − (0.98)10 = 0.18. The expected number of failing

families is 6 × 100 × 0.18 = 110. If such a family still showed at

least one recessive phenotype it would probably still have been

included in the results. Otherwise the family would have been

discarded, or augmented and this can result in an under-counting

of dominant homozygotes.

This may account for the under-counting of homozygotes noted

by Fisher.

42

Hartl and Fairbanks 2007 Rebuttals

In “Mud sticks: On the alleged falsification of Mendel’s data”

(Genetics 175:975-979) Hartl and Fairbanks criticized Fisher and

Novitski. They did not comment on Edwards.

Against Novitski, they claimed that Mendel did indeed collect

data from exactly 10 plants.

Against Fisher, they noted that one experiment was repeated

and the combined data agreed with Fisher’s expectations.

43

Pires and Branco, 2010

“A statistical model to explain the Mendel-Fisher controversy.”

Statistical Science 15:545–565, 2010.

This paper concentrates on the 84 p-values of Mendel’s experi-

ments and suggests an unconscious bias in Mendel’s work.

Refers to the 2008 book:

Franklin, A., Edwards AWF, Fairbanks DJ, Hartl DL, Seiden-

feld T. “Ending the Mendel-Fisher Controversy.” University of

Pittsburgh Press, Pittsburgh.

44

Continuing Debate

“This research was carried out in order to verify by simulation

Mendel’s laws and seek for the clarification, from the author”s

point of view, the Mendel–Fisher controversy. ... It was con-

cluded, that Mendel had no reason to manipulate his data in

order to make them to coincide with his beliefs. Therefore, in

experiment with a single trait, and in experiments with two traits

assuming complete dominance, segregation ratios are 3:1; and

9:3:3:1, respectively. Consequently, Mendel’s laws, under the

conditions as were described are absolutely valid and universal.”

REVISTA CIENTIFICA-FACULTAD DE CIENCIAS VETERINAR-

IAS 24:38-46, 2014.

45

ALLELE FREQUENCIES

46

Properties of Estimators

Consistency Increasing accuracyas sample size increases

Unbiasedness Expected value is the parameter

Efficiency Smallest variance

Sufficiency Contains all the informationin the data about parameter

47

Binomial Distribution

Most population genetic data consists of numbers of observa-

tions in some categories. The values and frequencies of these

counts form a distribution.

Toss a coin n times, and note the number of heads. There

are (n+1) outcomes, and the number of times each outcome is

observed in many sets of n tosses gives the sampling distribution.

Or: sample n alleles from a population and observe x copies of

type A.

48

Binomial distribution

If every toss has the same chance p of giving a head:

Probability of x heads in a row is

p× p× . . .× p = px

Probability of n− x tails in a row is

(1 − p) × (1 − p) × . . .× (1 − p) = (1 − p)n−x

The number of ways of ordering x heads and n − x tails among

n outcomes is n!/[x!(n − x)!].

The binomial probability of x successes in n trials is

Pr(x|p) =n!

x!(n − x)!px(1 − p)n−x

49

Binomial Likelihood

The quantity Pr(x|p) is the probability of the data, x successes

in n trials, when each trial has probability p of success.

The same quantity, written as L(p|x), is the likelihood of the

parameter, p, when the value x has been observed. The terms

that do not involve p are not needed, so

L(p|x) ∝ px(1 − p)(n−x)

Each value of x gives a different likelihood curve, and each curve

points to a p value with maximum likelihood. This leads to

maximum likelihood estimation.

50

Likelihood L(p|x, n = 4)

0.0 0.2 0.4 0.6 0.8 1.0

p

L(p)

x = 0

x = 1x = 2

x = 3

x = 4

..

..

.

..

..

.

..

..

.

..

..

..

.

..

..

.

..

..

..

.

..

..

.

..

..

.

..

..

..

.

..

..

.

..

...

.

..

..

..

.

..

..

.

..

..

.

..

..

.

..

..

.

..

..

..

.

..

..

.

..

..

.

..

..

.

..

..

...

.

..

..

..

.

..

..

..

.

..

..

..

.

..

..

..

.

..

..

..

.

..

..

..

.

..

..

..

.

...

..

..

..

.

..

..

..

..

.

..

..

..

.

..

..

..

..

.

..

..

..

..

.

..

..

..

.

...

..

..

..

.

..

..

..

.

..

..

..

.

..

..

..

..

.

..

..

..

.

..

..

..

.

...

..

..

..

..

..

..

..

..

.

..

..

..

..

..

..

..

..

..

.

..

..

..

...

..

..

..

..

..

..

..

..

..

.

..

..

..

..

..

..

..

..

..

..

...

..

..

..

..

..

..

..

..

..

..

..

.

..

..

..

..

..

..

..

...

..

..

..

..

..

..

...

..

..

..

..

..

..

..

..

..

..

...

..

..

...

..

..

...

..

..

..

...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

..

.

..

.

..

.

..

..

.

..

..

..

.

..

..

..

.

..

..

.

..

..

..

.

..

..

..

.

..

..

.

..

...

..

.

..

..

..

.

..

..

.

..

..

..

.

..

..

..

.

..

..

..

.

..

..

..

.

..

..

.

..

...

..

..

.

..

..

..

..

.

..

..

..

..

.

..

..

..

..

.

..

..

..

..

.

...

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

...

..

.

..

..

..

..

..

..

..

..

..

..

.

..

..

..

..

..

...

.

..

.

..

..

.

..

.

..

.

..

.

..

.

..

..

.

..

.

............................................................................................................................................................................................................................................................................................................................................

......................................................

...............................................

....................................................

...............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

..................................................................................................................................................................................................................................................................................................................................................................................................................................

.....................................

...........................

................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

.....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

.................................................................

.............................................

.................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

51

Binomial Mean

If there are n trials, each of which has probability p of giving a

success, the mean or the expected number of successes is np.

The sample proportion of successes is

p =x

n

(This is also the maximum likelihood estimate of p.)

The expected, or mean, value of p is p.

E(p) = p

52

Binomial Variance

The expected value of the squared difference between the num-

ber of successes and its mean, (x − np)2, is np(1 − p). This is

the variance of the number of successes in n trials, and indicates

the spread of the distribution.

The variance of the sample proportion p is

Var(p) =p(1 − p)

n

53

Normal Approximation

Provided np is not too small (e.g. not less than 5), the binomial

distribution can be approximated by the normal distribution with

the same mean and variance. In particular:

p ∼ N

(p,

p(1 − p)

n

)

To use the normal distribution in practice, change to the standard

normal variable z with a mean of 0, and a variance of 1:

z =p− p√

p(1 − p)/n

For a standard normal, 95% of the values lie between ±1.96.

The normal approximation to the binomial therefore implies that

95% of the values of p lie in the range

p ± 1.96√p(1 − p)/n

54

Confidence Intervals

A 95% confidence interval is a variable quantity. It has end-

points which vary with the sample. Expect that 95% of samples

will lead to an interval that includes the unknown true value pc.

The standard normal variable z has 95% of its values between

−1.96 and +1.96. This suggests that a 95% confidence interval

for the binomial parameter p is

p ± 1.96

√p(1 − p)

n

55


For samples of size 10, the 11 possible confidence intervals are:

pc Confidence Interval

0.0 0.0 ± 0.00 0.00,0.00

0.1 0.1 ± 2√

0.009 0.00,0.29

0.2 0.2 ± 2√

0.016 0.00,0.45

0.3 0.3 ± 2√

0.021 0.02,0.58

0.4 0.4 ± 2√

0.024 0.10,0.70

0.5 0.5 ± 2√

0.025 0.19,0.81

0.6 0.6 ± 2√

0.024 0.30,0.90

0.7 0.7 ± 2√

0.021 0.42,0.98

0.8 0.8 ± 2√

0.016 0.55,1.00

0.9 0.9 ± 2√

0.009 0.71,1.001.0 1.0 ± 0.00 1.00,1.00

Can modify interval a little by extending it by the “continuity

correction” ±1/2n in each direction.

56


To be 95% sure that the estimate is no more than 0.01 from

the true value, 1.96√p(1 − p)/n should be less than 0.01. The

widest confidence interval is when p = 0.5, and then need

0.01 ≥ 1.96√

0.5 × 0.5/n

which means that

n ≥ 10,000

If the true value of p was about 0.05, however,

0.01 ≥ 2√

0.05 × 0.95/n

n ≥ 1,900 ≈ 2,000

57

Exact Confidence Intervals: One-sided

The normal-based confidence intervals are constructed to be

symmetric about the sample value, unless the interval goes out-

side the interval from 0 to 1. They are therefore less satisfactory

the closer the true value is to 0 or 1.

More accurate confidence limits follow from the binomial distri-

bution exactly. For events with low probabilities p, how large

could p be for there to be at least a 5% chance of seeing at

most as many as x (i.e. 0,1,2, . . . x) occurrences of that event

among n events. If this upper bound is pU ,

x∑

k=0

Pr(k) ≥ 0.05

x∑

k=0

(n

k

)pkU(1 − pU)n−k ≥ 0.05

If x = 0, then (1 − pU)n ≥ 0.05 or pU ≤ 1 − 0.051/n and this is

0.0295 if n = 100. More generally pU ≈ 3/n when x = 0.

58

Exact Confidence Intervals: Two-sided

Now want to know how large p could be for there to be at least

a 2.5% chance of seeing at most as many as x (i.e. 0,1,2 . . . x)

occurrences, and in knowing how small p could be for there to

be at least a 2.5% chance of seeing at least as many as x (i.e.

x, x+ 1, x+ 2, . . . n) occurrences then we need

x∑

k=0

(n

k

)pkU(1 − pU)n−k ≥ 0.025

n∑

k=x

(n

k

)pkL(1 − pL)

n−k ≥ 0.025

The second of these equations may be written as

x−1∑

k=0

(n

k

)pkL(1 − pL)

n−k ≤ 0.975

If x = n, then pnL ≤ 0.975 or pL ≤ 0.9751/n and this is 0.9997 if

n = 100. Interval is not symmetric if p 6= 0.5.

59

Bootstrapping

An alternative method for constructing confidence intervals uses

numerical resampling. A set of samples is drawn, with replace-

ment, from the original sample to mimic the variation among

samples from the original population. Each new sample is the

same size as the original sample, and is called a bootstrap sam-

ple.

The middle 95% of the sample values p from a large number of

bootstrap samples provides a 95% confidence interval.

60

Multinomial Distribution

Toss two coins n times. For each double toss, the probabilities

of the three outcomes are:

2 heads pHH = 1/41 head, 1 tail pHT = 1/22 tails pTT = 1/4

The probability of x lots of 2 heads is (pHH)x, etc.

The numbers of ways of ordering x, y, z occurrences of the three

outcomes is n!/[x!y!z!] where n = x+ y+ z.

The multinomial probability for x of HH, and y of HT or TH

and z of TT in n trials is:

Pr(x, y, z) =n!

x!y!z!(pHH)x(pHT )y(pTT )z

61

Multinomial Variances and Covariances

If {pi} are the probabilities for a series of categories, the sam-

ple proportions pi from a sample of n observations have these

properties:

E(pi) = pi

Var(pi) =1

npi(1 − pi)

Cov(pi, pj) = −1

npipj, i 6= j

The covariance is defined as E[(pi − pi)(pj − pj)].

For the sample counts:

E(ni) = npi

Var(ni) = npi(1 − pi

Cov(ni, nj) = −npipj, i 6= j

62

Allele Frequency Sampling Distribution

If a locus has alleles A and a, in a sample of size n the allele

counts are sums of genotype counts:

n = nAA + nAa + naa

nA = 2nAA + nAa

na = 2naa + nAa

2n = nA + na

Genotype counts in a random sample are multinomially distributed.

What about allele counts? Approach this question by calculating

variance of nA.

63

Within-population Variance

Var(nA) = Var(2nAA + nAa)

= Var(2nAA) + 2Cov(2nAA, nAa)

+ Var(nAa)

= 2npA(1 − pA) + 2n(PAA − p2A)

This is not the same as the binomial variance 2npA(1−pA) unless

PAA = p2A. In general, the allele frequency distribution is not

binomial.

64

Within-population Variance

The variance of the sample allele frequency can be written as

Var(pA) =pA(1 − pA)

2n+PAA− p2A

2n

It is convenient to reparameterize genotype frequencies with the

(within-population) inbreeding coefficient f :

PAA = p2A + fpA(1 − pA)

PAa = 2pApa − 2fpApa

Paa = p2a + fpa(1 − pa)

Then the variance can be written as

Var(pA) =pA(1 − pA)(1 + f)

2n

This variance is different from the binomial variance of pA(1 −pA)/2n.

65

Bounds on f

Since

pA ≥ PAA = p2A + fpA(1 − pA) ≥ 0

pa ≥ Paa = p2a + fpa(1 − pa) ≥ 0

there are bounds on f :

−pA/(1 − pA) ≤ f ≤ 1

−pa/(1 − pa) ≤ f ≤ 1

or

max

(−pApa,−papA

)≤ f ≤ 1

This range of values is [-1,1] when pA = pa.

66

Indicator Variables

A very convenient way to derive many statistical genetic results

is to define an indicator variable xij for allele j in individual i:

xij =

{1 if allele is A0 if allele is not A

Then

E(xij) = pA

E(x2ij) = pA

E(xijxij′) = PAA

If there is random sampling, individuals are independent, and

E(xijxi′j′) = E(xij)E(xi′j′) = p2A

67

Intraclass Correlation

The inbreeding coefficient is the correlation of the indicator vari-

ables for the two alleles at a locus carried by an individual. This

is because:

Var(xij) = E(x2ij) − [E(xij)]2

= pA(1 − pA)

= Var(xij′), j 6= j′

and

Cov(xij , xij′) = E(xijxij′)− [E(xij)][E(xij′)], j 6= j′

= PAA− p2A= fpA(1 − pA)

so

Corr(xij, xij′) =Cov(xij , xij′)√

Var(xij)Var(xij′)= f

68

Maximum Likelihood Estimation: Binomial

For binomial sample of size n, the likelihood of pA for nA alleles

of type A is

L(pA|nA) = C(pA)nA(1 − pA)n−nA

and is maximized when

∂L(pA|nA)

∂pA= 0 or when

∂ lnL(pA|na)∂pA

= 0

Now

lnL(pA|nA) = lnC + nA ln(pA) + (n− nA) ln(1 − pA)

so

∂ lnL(pA|nA)

∂pA=

nApA

− n− nA1 − pA

and this is zero when pA = pA = nA/n.

69

Maximum Likelihood Estimation: Multinomial

If {ni} are multinomial with parameters n and {Qi}, then the

MLE’s of Qi are ni/n. This will always hold for genotype pro-

portions, but not always for allele proportions.

For two alleles, the MLE’s for genotype proportions are:

PAA = nAA/n

PAa = nAa/n

Paa = naa/n

Does this lead to estimates of allele proportions and the within-

population inbreeding coefficient?

PAA = p2A + fpA(1 − pA)

PAa = 2pA(1 − pA)− 2fpA(1 − pA)

Paa = (1 − pA)2 + fpA(1 − pA)

70

Maximum Likelihood Estimation

The likelihood function for pA, f is

L(pA, f) =n!

nAA!nAa!naa![p2A + pA(1 − pA)f ]nAA

×[2pA(1 − pA)f ]nAa[(1 − pA)2 + pA(1 − pA)f ]naa

and it is difficult to find, analytically, the values of pA and f that

maximize this function or its logarithm.

There is an alternative way of finding maximum likelihood esti-

mates in this case: equating the observed and expected values

of the genotype frequencies.

71

Bailey’s Method

Because the number of parameters (2) equals the number of

degrees of freedom in this case, we can just equate observed and

expected (using the estimates of pA and f) genotype proportions

nAA/n = p2A + f pA(1 − pA)

nAa/n = 2pA(1 − pA) − 2f pA(1 − pA)

naa/n = (1 − pA)2 + f pA(1 − pA)

so that, solving for pA and f :

pA =2nAA + nAa

2n= pA

f =4nAAnaa − n2

Aa

(2nAA + nAa)(2naa + nAa)= 1 − PAa

2pApa

72

Three-allele Case

With three alleles, there are six genotypes and 5 df. To use

Bailey’s method, would need five parameters: 2 allele frequencies

and 3 inbreeding coefficients:

P11 = p21 + f12p1p2 + f13p1p3

P12 = 2p1p2 − 2f12p1p2

P22 = p22 + f12p1p2 + f23p2p3

P13 = 2p1p3 − 2f13p1p3

P23 = 2p2p3 − 2f23p2p3

P33 = p23 + f13p1p3 + f23p2p3

73

Three-allele Case

This formulation preserves the relation between allele and geno-

type frequencies:

p1 = P11 +1

2P12 +

1

2P13

p2 = P22 +1

2P12 +

1

2P23

p3 = P33 +1

2P13 +

1

2P23

74

Three-allele Case

The maximum likelihood estimates in the fully-parameterized

case are

p1 = P11 +1

2P12 +

1

2P13

p2 = P22 +1

2P12 +

1

2P23

p3 = P33 +1

2P13 +

1

2P23

and

f12 = 1 − P12/p1p2

f13 = 1 − P13/p1p3

f23 = 1 − P23/p2p3

75

Three-allele Likelihood

For neutral genes, no reason to expect differences among f ’s, so

an alternative parameterization would be

P11 = p21 + fp1(1 − p1)

P12 = 2p1p2 − 2fp1p2

P22 = p22 + fp2(1 − p2)

P13 = 2p1p3 − 2fp1p3

P23 = 2p2p3 − 2fp2p3

P33 = p23 + fp3(1 − p3)

76

Three-allele Likelihood

No explicit analytic expressions for the MLEs of p’s and f . Nu-

merical methods needed to maximize the log-likelihood

lnL = n11[ln(p1) + ln(p1 + f − fp1)]

+ n12[ln(p1) + ln(p2) + ln(1 − f)]

+ n22[ln(p2) + ln(p2 + f − fp2)]

+ n13[ln(p1) + ln(p3) + ln(1 − f)]

+ n23[ln(p2) + ln(p3) + ln(1 − f)]

+ n33[ln(p3) + ln(p3 + f − fp3)]

77

Method of Moments

An alternative to maximum likelihood estimation is the method

of moments (MoM) where observed values of statistics are set

equal to their expected values. In general, this does not lead to

unique estimates or to estimates with variances as small as those

for maximum likelihood. (Bailey’s method is for the special case

where the MLEs are also MoM estimates.)

78

Method of Moments

For the inbreeding coefficient at loci with m alleles, two different

MoM estimates are

fW =

∑mu=1(Puu − p2u) + 1

2n

∑mu=1(pu − Puu)

∑mu=1 pu(1 − pu)− 1

2n

∑mu=1(pu − Puu)

≈∑mu=1(Puu − p2u)∑mu=1 pu(1 − pu)

fH =1

m− 1

m∑

u=1

(Puu − p2u

pu

)

For loci with two alleles, m = 2, the two moment estimates are

equal to each other and to the maximum likelihood estimate:

fW = fH = 1 − PAa2pApa

79

Expectations of Moment Estimates

The expected values of the estimated inbreeding coefficients can

be found by using the results

E(Puu) = Puu = p2u + fpu(1 − pu)

E(pu) = pu

E(p2u) = p2u +1

2npu(1 − pu)(1 + f)

Then, approximating the expectation of a ratio by the ratio of

expectations:

E(fW ) ≈f n−1

n (1 −∑mu=1 p

2u)

n−1n (1 −∑m

u=1 p2u)

= f

E(fH) ≈ 1

m− 1

m∑

u=1

pu(1 − pu)(f − 1+f

2n )

pu

= f − 1 + f

2n≈ f

80

MLE for Recessive Alleles

Suppose allele a is recessive to allele A. If there is Hardy-

Weinberg equilibrium, the likelihood for the two phenotypes is

L(pa) = (1 − p2a)n−naa(p2a)

naa

ln(L(pa) = (n− naa) ln(1 − p2a) + 2naa ln(pa)

where there are naa individuals of type aa and n− naa of type A.

Differentiating wrt pa:

∂ lnL(pa)

∂pa= −2pa(n− naa)

1 − p2a+

2naa

pa

Setting this to zero leads to an equation that can be solved

explicitly: p2a = naa/n. No need for iteration.

81

EM Algorithm for Recessive Alleles

An alternative way of finding maximum likelihood estimates when

there are “missing data” involves Estimation of the missing data

and then Maximization of the likelihood. For a locus with allele A

dominant to a the missing information is the frequencies (1−pa)2of AA, and 2pa(1−pa) of Aa genotypes. Only the joint frequency

(1 − p2a) of AA+Aa can be observed.

Estimate the missing genotype counts (assuming independence

of alleles):

nAA =(1 − pa)2

1 − p2a(n− naa) =

(1 − pa)(n − naa)

(1 + pa)

nAa =2pa(1 − pa)

1 − p2a(n− naa) =

2pa(n− naa)

(1 + pa)

82

EM Algorithm for Recessive Alleles

Maximize the likelihood (using Bailey’s method):

pa =nAa + 2naa

2n

=1

2n

(2pa(n− naa)

(1 + pa)+ 2naa

)

= =2(npa + naa)

2n(1 + pa)

An initial estimate pa is put into the right hand side to give an

updated estimated pa on the left hand side. This is then put

back into the right hand side to give an iterative equation for pa.

This procedure also has explicit solution pa =√

(naa/n).

83

EM Algorithm for Two Loci

For two loci with two alleles each, the ten two-locus frequencies are:

Genotype Actual Expected Genotype Actual Expected

AB/AB PABAB p2AB AB/Ab PAB

Ab 2pABpAb

AB/aB PABaB 2pABpaB AB/ab PAB

ab 2pABpab

Ab/Ab PAbAb p2Ab Ab/aB PAb

aB 2pAbpaB

Ab/ab PAbab 2pAbpab aB/aB P aB

aB p2aB

aB/ab P aBab 2paBpab ab/ab P ab

ab p2ab

84


Gamete frequencies are marginal sums:

pAB = PABAB +1

2(PABAb + PABaB + PABab )

pAb = PAbAb +1

2(PAbAB + PAbab + PAbaB)

paB = P aBaB +1

2(P aBAB + P aBab + P aBAb )

pab = P abab +1

2(P abAb + P abaB + P abAB)

Can arrange gamete frequencies as two-way table to show that

only one of them is unknown when the allele frequencies are

known:

pAB pAb pApaB pab papB pb 1

85


The two double heterozygote frequencies PABab , PAbaB are “missing

data.”

Assume initial value of pAB and Estimate the missing counts:

nABab =pABpab

pABpab + pAbpaBnAaBb

nAbaB =pAbpaB

pABpab + pAbpaBnAaBb

and then Maximize the likelihood by setting

pAB =1

2n

(2nABAB + nABAb + nABaB + nABab

)

86

Example

For loci LDLR and GYPA in the Hispanic Sample, the 9 observed

genotype counts are:

BB Bb bb Total

AA nAABB = 1 nAABb = 1 nAAbb = 2 nAA = 4Aa nAaBB = 7 nAaBb = 2 nAabb = 1 nAa = 10aa naaBB = 2 naaBb = 4 naabb = 0 naa = 6

Total nBB = 10 nBb = 7 nbb = 3 n = 20

There is one unknown gamete count x = nAB = 2npAB for AB:

B b Total

A nAB = x nAb = 18 − x nA = 18a naB = 27 − x nab = x− 5 na = 22

Total nB = 27 nb = 13 2n = 40

18 ≥ x ≥ 5

87

Example

EM iterative equation:

x′ = 2nAABB + nAABb + nAaBB + nAB/ab

= 2nAABB + nAABb + nAaBB +2pABpab

2pABpab + 2pAbpaBnAaBb

= 2 + 1 + 7 +2x(x− 5)

2x(x− 5) + 2(18 − x)(27 − x)2

= 10 +2x(x− 5)

x(x− 5) + (18 − x)(27 − x)

88

Example

A good starting value would assume independence of A and B

alleles: x = 2n ∗ pA ∗ pB = (18 × 27/40) = 12.15.

Successive iterates are:

Iterate x value

0 12.15001 12.00002 11.93103 11.89944 11.88495 11.87836 11.87537 11.87398 11.87339 11.873010 11.8728

89

ALLELIC ASSOCIATION

90

Hardy-Weinberg Law

For a random mating population, expect that genotype frequen-

cies are products of allele frequencies.

For a locus with two alleles, A, a:

PAA = (pA)2

PAa = 2pApa

Paa = (pa)2

These are also the results of setting the inbreeding coefficient f

to zero.

For a locus with several alleles Ai:

PAiAi = (pAi)2

PAiAj = 2pAipAj

91

Inference about HWE

Departures from HWE can be described by the within-population

inbreeding coefficient f . This has an MLE that can be written

as

f =4nAAnaa − n2

Aa

(2nAA + nAa)(2naa + nAa)

and we can use “Delta method” to find

E(f) = f

Var(f) ≈ 1

2npApa(1 − f)[2pApa(1 − f)(1 − 2f) + f(2 − f)]

If f is assumed to be normally distributed then, (f−f)/√

Var(f) ∼N(0,1). Since Var(f) = 1/n when f = 0:

X2 =f2

Var(f)= nf2

is appropriate for testing H0 : f = 0. When H0 is true, X2 ∼ χ2(1)

.

Reject HWE if X2 > 3.84.

92

Significance level of HWE test

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

Chi−square with 1 df

X^2

f(X

^2)

Probability=0.05

X^2=3.84

The area under the chi-square curve to the right of X2 = 3.84

is the probability of rejecting HWE when HWE is true. This is

the significance level of the test.

93

Goodness-of-fit Test

An alternative, but equivalent, test is the goodness-of-fit test.

Genotype Observed Expected (Obs.−Exp.)2

Exp.

AA nAA np2A np2af2

Aa nAa 2npApa 2npApaf2

aa naa np2a np2Af2

The test statistic is

X2 =∑ (Obs.− Exp)2

Exp.= nf2

94


Does a sample of 6 AA, 3 Aa, 1 aa support Hardy-Weinberg?

First need to estimate allele frequencies:

pA = PAA +1

2PAa = 0.75

pa = Paa +1

2PAa = 0.25

Then form “expected” counts:

nAA = n(pA)2 = 5.625

nAa = 2npApa = 2.750

naa = n(pa)2 = 0.625

95


Perform the chi-square test:

Genotype Observed Expected (Obs.− Exp.)2/Exp.

AA 6 5.625 0.025

Aa 3 2.750 0.150

aa 1 0.625 0.225

Total 10 10 0.400

Note that f = 1 − 0.3/(2 × 0.75 × 0.25) = 0.2 and X2 = nf2.

96

Sample size determination

Although Fisher’s exact test (below) is generally preferred for

small samples, the normal or chi-square test has the advantage

of simplifying power calculations.

Assuming that f is normally distributed, form the test statistic

z =f − f√Var(f)

Under the null hypothesis H0 : f = 0 this is z0 =√nf . For a two-

sided test, reject at the α% level if z0 ≤ zα/2 or z0 ≥ z1−α/2 =

−zα/2. For a 5% test, reject if z0 ≤ −1.96 or z0 ≥ 1.96.

97


If the hypothesis is false, the normal test statistic is

z =f − f√Var(f)

≈√n(f − f) = z0 −

√nf

(using the null-hypothesis value of the variance in the denomina-

tor). Suppose f > 0 so rejection occurs when z0 ≥ −zα/2. With

this rejection region, the probability of rejecting is ≥ (1 − β) if

the rejection region amounts to z = z0 −√nf ≥ zβ. i.e.

−zα/2 −√nf = zβ

nf2 = (zα/2 + zβ)2

For 5% significance level −zα/2 = 1.96, and for 90% power zβ =

−1.28 so we need nf2 ≥ (−1.96 − 1.28)2 = 10.5. i.e. n has to

be over 100,000 when f = 0.01.

98


More directly, when the Hardy-Weinberg hypothesis is not true,

the test statistic nf2 has a non-central chi-square distribution

with one degree of freedom (df) and non-centrality parameter

λ = nf2. To reach 90% power with a 5% significance level, for

example, it is necessary that λ ≥ 10.5.

In this one-df case, the non-centrality value follows from per-

centiles of the standard normal distribution. If zx is the xth

percentile of the standard normal, than for significance level α

and power 1 − β, λ = (zα/2 + zβ)2.

99

Power of HWE test

0 5 10 15 20 25 30

0.0

00.0

20.0

40.0

60.0

8

Non−central Chi−square with 1 df, ncp=10.5

X^2

f(X

^2)

Probability=0.90

X^2=3.84

The area under the non-central chi-square curve to the right

of X2 = 3.84 is the probability of rejecting HWE when HWE

is false. This is the power of the test. In this plot, the non-

centrality parameter is λ = 10.5.

100

Significance Levels and p-values

The significance level α of a test is the probability of a false

rejection. It is specified by the user, and along with the null

hypothesis, it determines the rejection region. The specified, or

“nominal” value may not be achieved for an actual test.

Once the test has been conducted on a data set, the probability

of the observed test statistic, or a more extreme value, if the

null hypothesis is true is the p-value. The chi-square and normal

tests shown above give approximate p-values because they use a

continuous distribution for discrete data.

An alternative class of tests, “exact tests,” use a discrete distri-

bution for discrete data and provide accurate p-values. It may

be difficult to construct an exact test with a particular nominal

significance level.

101

Exact HWE Test

The preferred test for HWE is an exact one. The test rests

on the assumption that individuals are sampled randomly from

a population so that genotype counts have a multinomial distri-

bution:

Pr(nAA, nAa, naa) =n!

nAA!nAa!naa!(PAA)nAA(PAa)

nAa(Paa)naa

This equation is always true, and when there is HWE (PAA = p2Aetc.) there is the additional result that the allele counts have a

binomial distribution:

Pr(nA, na) =(2n)!

nA!na!(pA)nA(pa)

na

102

Exact HWE Test

Putting these together gives the conditional probability

Pr(nAA, nAa, naa|nA, na) =

n!nAA!nAa!naa!

(p2A)nAA(2pApa)nAa(p2a)

naa

(2n)!nA!na!

(pA)nA(pa)na

=n!

nAA!nAa!naa!

2nAanA!na!

(2n)!

Reject the Hardy-Weinberg hypothesis if this quantity, the prob-

ability of the genotypic array conditional on the allelic array, is

among the smallest of its possible values.

103

Exact HWE Test

For convenience, write the probability of the genotypic array,

conditional on the allelic array and HWE, as Pr(nAa|n, nA). Re-

ject the HWE hypothesis for a data set if this value is among

the smallest probabilities.

As an example, consider (nAA = 1, nAa = 0, naa = 49). The allele

counts are (nA = 2, na = 98) and there are only two possible

genotype arrays:

AA Aa aa Pr(nAa|n, nA)

1 0 49 50!1!0!49!

202!98!100! = 1

99

0 2 48 50!0!2!48!

222!98!100! = 98

99

104

Exact HWE Test

The probability of the data (nAA = 1, nAa = 0, naa = 49), con-

ditional on the allele frequencies and on HWE, is 1/99 = 0.01

which is not nearly as small as the value suggested by the chi-

square statistic of 50. (As PAa = 0, f = 1,X2 = n.)

Traditionally, the p-value is the (conditional) probability of the

data plus the probabilities of all the less-probable datasets. The

probabilities are all calculated assuming HWE is true. More re-

cently (Graffelman and Moreno, Statistical Applications in Ge-

netics and Molecular Biology 2-013; 12:433-448) it has been

shown that the test has a significance value closer to the nomi-

nal value if the p-value s half the probability of the data plus the

probabilities of all datasets that are less probably under the null

hypothesis. For the previous slide then, the p-value is 1/198.

105

Example

For a sample of size n = 100 with minor allele frequency of 0.07,

there are 8 sets of possible genotype counts:

Exact Chi-square

nAA nAa naa Prob. Mid p-value X2 p-value

93 0 7 0.0000 0.0000∗ 100.00 0.0000∗92 2 6 0.0000 0.0000∗ 71.64 0.0000∗91 4 5 0.0000 0.0000∗ 47.99 0.0000∗90 6 4 0.0002 0.0002∗ 29.07 0.0000∗89 8 3 0.0051 0.0028∗ 14.87 0.0001∗88 10 2 0.0602 0.0353∗ 5.38 0.0204∗87 12 1 0.3209 0.2262 0.61 0.434886 14 0 0.6136 0.6832 0.57 0.4503

So, for a nominal 5% significance level, the actual significance

level is 0.0204 for a chi-square test that rejects when nAa ≤ 10

and is 0.0353 for an exact test that also rejects when nAa ≤ 10.

106

Effect of Minor Allele Frequency

The minor allele frequency (MAF) in the previous example was

14/200 = 0.07. How does the exact test behave with other MAF

values?

In particular, what is the size of the rejection region for a nominal

value of α = 0.05? In other words, we decide to reject HWE

for any sample with a p-value of 0.05 or less, and we find the

total probability of all such datasets. We would hope that this

empirical significance level would be close to the nominal value,

but we find that it may not be.

107

na = 16 minor alleles

When the minor allele frequency is 0.08, for a nominal 5% signif-

icance level, the actual significance level is 0.0070 for an exact

test that rejects when nAa ≤ 10.

nAA nAa naa Pr(nAa|na) p −value

92 0 8 .0000 .000091 2 7 .0000 .000090 4 6 .0000 .000089 6 5 .0000 .000088 8 4 .0008 .000487 10 3 .0123 .007086 12 2 .0974 .061885 14 1 .3681 .294684 16 0 .5215 .7382

108


When the minor allele frequency is 0.075, for a nominal 5%

significance level, the actual significance level is 0.0474 for an

exact test that rejects when nAa ≤ 11.


92 1 7 .0000 .000091 3 6 .0000 .000090 5 5 .0000 .000089 7 4 .0004 .000288 9 3 .0081 .004587 11 2 .0776 .047486 13 1 .3464 .259485 15 0 .5675 .7163

109


When the minor allele frequency is 0.065, for a nominal 5%

significance level, the actual significance level is 0.0483 for an

exact test that rejects when nAa ≤ 9.


93 1 6 .0000 .000092 3 5 .0000 .000091 5 4 .0001 .000190 7 3 .0030 .003189 9 2 .0452 .048388 11 1 .2923 .340587 13 0 .6595 1.0000

110


When the minor allele frequency is 0.06, for a nominal 5% signif-

icance level, the actual significance level is 0.0344 for an exact

test that rejects when nAa ≤ 8.


94 0 6 .0000 .000093 2 5 .0000 .000092 4 4 .0000 .000091 6 3 .0017 .001790 8 2 .0327 .018189 10 1 .2612 .165088 12 0 .7045 .2955

111

Rohlfs and Weir, 2008

0.0 0.1 0.2 0.3 0.4 0.5

0.0

00

.04

0.0

80

.12

n=100, θ=4.0

p~A

po

we

r

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.2

0.4

n=100, θ=2.0

p~A

po

we

r

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.4

0.8

n=100, θ=0.1

p~A

po

we

r

0.0 0.1 0.2 0.3 0.4 0.5

0.0

00

.05

0.1

00

.15

n=1000, θ=4.0

p~A

po

we

r

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.4

0.8

n=1000, θ=2.0

p~A

po

we

r

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.4

0.8

n=1000, θ=0.1

p~A

po

we

r

This figure is not for the mid p values. It includes all the proba-

bility of the data in p-values.

112

Graffelman and Moreno, 2013

113

Power of Exact Test

If there is not HWE:

Pr(nAa|nA, na) =n!

nAA!nAa!naa!(PAA)nAA(PAa)

nAa(Paa)naa

=n!

nAA!nAa!naa!(PAA)

nA−nAa2 (PAa)

nAa(Paa)na−nAa

2

=n!

nAA!nAa!naa!(√PAA)nA(

√Paa)

na

(PAa√PAAPaa

)nAa

=CψnAa

nAA!nAa!naa!

where ψ = PAa/(√PAAPaa) measures the departure from HWE.

The constant C makes the probabilities sum to one over all

possible nAa values: C−1 =∑nAa

ψnAa/(nAA!nAa!naa!).

114

Power of Exact Test

Once the rejection region has been determined, the power of

the test (the probability of rejecting) can be found by adding

these probabilities for all sets of genotype counts in the re-

gion. HWE corresponds to ψ = 2. What is the power to

detect HWE when ψ = 1, the sample size is n = 10 and the

sample allele frequencies are pA = 0.75, pa = 0.25? Note that

1/C = 1/(5!5!0!) + 1/(6!3!1!) + 1/(7!1!2!).

Pr(nAa|nA, n)nAA nAa naa ψ = 2 ψ = 1

5 5 0 0.520 0.2626 3 1 0.433 0.3647 1 2 0.047 0.374

The ψ = 2 column shows that the rejection region is nAa = 1.

The ψ = 1 column shows that the power (the probability nAa = 1

when ψ = 1) is 37.4%.

115

Power Examples

For given values of n, na, the rejection region is determined from

null hypothesis and the power is determined from the multinomial

distribution.

Pr(nAa|na = 16)ψ .250 .500 1.000 2.000 4.000 8.000 16.000

nAa f .631 .398 .157 .000 −.062 −.081 −.085

0 .0042 .0000 .0000 .0000 .0000 .0000 .00002 .0956 .0026 .0000 .0000 .0000 .0000 .00004 .3172 .0349 .0003 .0000 .0000 .0000 .00006 .3568 .1569 .0056 .0000 .0000 .0000 .00008 .1772 .3116 .0441 .0008 .0000 .0000 .0000

10 .0433 .3047 .1725 .0123 .0003 .0000 .000012 .0054 .1506 .3411 .0974 .0098 .0007 .000014 .0003 .0356 .3223 .3681 .1485 .0422 .010916 .0000 .0032 .1142 .5214 .8414 .9571 .9890

Power .9943 .8107 .2225 .0131 .0003 .0000 .0000

116

na = 15 Probabilities

Pr(nAa|na = 15)ψ .250 .500 1.000 2.000 4.000 8.000 16.000

nAa f .622 .389 .150 .000 −.058 −.075 −.080

1 .0338 .0006 .0000 .0000 .0000 .0000 .00003 .2269 .0150 .0001 .0000 .0000 .0000 .00005 .3871 .1027 .0026 .0000 .0000 .0000 .00007 .2592 .2750 .0273 .0004 .0000 .0000 .00009 .0801 .3400 .1352 .0081 .0002 .0000 .0000

11 .0120 .2040 .3245 .0776 .0074 .0005 .000013 .0008 .0569 .3620 .3464 .1314 .0367 .009415 .0000 .0058 .1482 .5674 .8610 .9627 .9905

Power .9871 .7333 .1652 .0085 .0002 .0000 .0000

117


Pr(nAa|na = 14)ψ .250 .500 1.000 2.000 4.000 8.000 16.000

nAa f .613 .378 .143 .000 −.054 −.070 −.074

0 .0062 .0001 .0000 .0000 .0000 .0000 .00002 .1256 .0051 .0000 .0000 .0000 .0000 .00004 .3610 .0582 .0010 .0000 .0000 .0000 .00006 .3422 .2207 .0156 .0002 .0000 .0000 .00008 .1375 .3547 .1002 .0051 .0001 .0000 .0000

10 .0255 .2631 .2973 .0602 .0054 .0004 .000012 .0021 .0877 .3964 .3209 .1150 .0316 .008114 .0001 .0105 .1895 .6136 .8795 .9680 .9919

Power .9723 .6387 .1168 .0053 .0001 .0000 .0000

118


Pr(nAa|na = 13)ψ .250 .500 1.000 2.000 4.000 8.000 16.000

nAa f .603 .366 .136 .000 −.050 −.065 −.068

1 .0479 .0012 .0000 .0000 .0000 .0000 .00003 .2786 .0275 .0003 .0000 .0000 .0000 .00005 .4004 .1583 .0080 .0001 .0000 .0000 .00007 .2169 .3430 .0696 .0030 .0001 .0000 .00009 .0508 .3216 .2611 .0452 .0038 .0003 .0000

11 .0051 .1301 .4225 .2923 .0994 .0269 .006913 .0002 .0183 .2383 .6595 .8967 .9728 .9931

Power .9947 .8516 .3391 .0483 .0039 .0003 .0000

119


Pr(nAa|na = 12)ψ .250 .500 1.000 2.000 4.000 8.000 16.000

nAa f .592 .353 .128 .000 −.046 −.059 −.063

0 .0095 .0001 .0000 .0000 .0000 .0000 .00002 .1674 .0102 .0001 .0000 .0000 .0000 .00004 .4053 .0991 .0037 .0000 .0000 .0000 .00006 .3108 .3039 .0449 .0017 .0000 .0000 .00008 .0947 .3703 .2188 .0326 .0026 .0002 .0000

10 .0118 .1852 .4376 .2612 .0846 .0226 .005812 .0005 .0312 .2950 .7044 .9127 .9772 .9942

Power .9877 .7836 .2674 .0344 .0027 .0002 .0000

120

Rohlfs and Weir, 2008

0.0 0.1 0.2 0.3 0.4 0.5

0.0

00

.04

0.0

80

.12

n=100, θ=4.0

p~A

po

we

r

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.2

0.4

n=100, θ=2.0

p~A

po

we

r

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.4

0.8

n=100, θ=0.1

p~A

po

we

r

0.0 0.1 0.2 0.3 0.4 0.5

0.0

00

.05

0.1

00

.15

n=1000, θ=4.0

p~A

po

we

r

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.4

0.8

n=1000, θ=2.0

p~A

po

we

r

0.0 0.1 0.2 0.3 0.4 0.5

0.0

0.4

0.8

n=1000, θ=0.1

p~A

po

we

r

121

Graffelman and Moreno, 2013

122

Permutation Test

For large sample sizes and many alleles per locus, there are too

many genotypic arrays for a complete enumeration and a deter-

mination of which are the least probable 5% arrays.

A large number of the possible arrays is generated by permuting

the alleles among genotypes, and calculating the proportion of

these permuted genotypic arrays that have a smaller conditional

probability than the original data. If this proportion is small, the

Hardy-Weinberg hypothesis is rejected.

123

Permutation Test

Mark a set of five index cards to represent five genotypes:

Card 1: A A

Card 2: A A

Card 3: A A

Card 4: a a

Card 5: a a

Tear the cards in half to give a deck of 10 cards, each with

one allele. Shuffle the deck and deal into 5 pairs, to give five

genotypes.

124

Permutation Test

The permuted set of genotypes fall into one of four types:

AA Aa aa Number of times

3 0 2

2 2 1

1 4 0

125

Permutation Test

Find the theoretical values for the proportions of each of the

three types, from the expression:

n!

nAA!nAa!naa!× 2nAanA!na!

(2n)!

AA Aa aa Conditional Probability

3 0 2

2 2 1

1 4 0

These should match the proportions found by repeating shuf-

flings of the deck of 10 allele cards.

126

Multiple Testing

When multiple tests are performed, each at significance level α,

a proportion α of the tests are expected to cause rejection even

if all the hypotheses are true.

Bonferroni correction makes the overall (experimentwise) signif-

icance level equal to α by adjusting the level for each individual

test to α′. If α is the probability that at least one of the L tests

causes rejection, it is also 1 minus the probability that none of

the tests causes rejection:

α = 1 − (1 − α′)L

≈ Lα′

provided the L tests are independent.

If L = 15, need α′ = 0.0033 in order for α = 0.05.

127

Combining p-values

There is also the issue that if the same hypothesis is tested L

times and just fails to cause rejection each time, there is some

overall evidence against the hypothesis.

Suppose that tests have been conducted for each of L hypotheses

Hi, i = 1,2, . . . , L. For each test the p-value pi is calculated: if

Hi is true, this is the probability of observing a test statistic

as extreme as or more extreme than the observed value in the

direction of rejection.

128

Combining p-values

Methods for combining p-values rest on p having a uniform dis-

tribution when the hypothesis is true. Suppose the test statistic

X has the continuous distribution f(x) when the hypothesis is

true. The p-value when X = x is

p =

∫ ∞

xf(y)dy

(A one-tailed test when large values of X cause rejection.) The

distribution function F (x) for X is

F (x) =

∫ x−∞

f(y)dy = 1 − p

The density function for U = F (x) is

f(u) = f(x)dx

du= f(x)/

du

dx= f(x)/f(x) = 1

Therefore U is uniform on [0,1], and so is p.

129

Fisher’s Method

If U has a uniform distribution, transform to

V = −2 ln(U)

U = e−V/2

f(v) = f(u)du

dv

= (1/2)e−v/2

and this is the density function of a chi-square distribution with

2 d.f.

130

Fisher’s Method

If a hypothesis is true, and the significance level is p, then X2 =

−2 ln(p) is distributed as χ2(2)

. (If p = 0.05 then X2 = 5.99.)

Therefore

t = −2L∑

i=1

ln pi = −2 ln(L∏

i=1

pi)

has a chi-square distribution with 2L df when all L hypotheses

are true.

Therefore, the p-value for the hypothesis HT that all Hi are true

is the probability of a χ2(2L)

variable being greater or equal to the

observed value t.

131

Example

HW-test p-values for previous data.

Locus Cauc. Afr.Am. Hisp. −2∑

ln pLDLR 1.00 0.14 0.69 4.67 nsGYPA 0.22 1.00 0.42 4.76 nsHBGG 1.00 0.44 0.21 4.76 nsD7S8 1.00 0.64 0.69 1.63 nsGc 0.01 1.00 0.59 10.27 ns

−2∑

ln p 12.24 6.47 7.40 26.10ns ns ns ns

Do not reject overall hypothesis of no departures from HW over

populations for each locus, or over loci for each population, or

over all locus-population pairs.

132

QQ-Plots

An alternative approach to considering multiple-testing issues is

to use QQ-plots. If all the hypotheses being tested are true then

the resulting p-values are uniformly distributed between 0 and 1.

For a set of n tests, we would expect to see p values at 1/(n+

1),2/n, . . . , n/(n+ 1),1. We plot the observed p-values against

these expected values: the smallest against 1/(n + 1) and the

largest against 1. It is more convenient to transform to − log10(p)

to accentuate the extremely small p values. The point at which

the observed values start departing from the expected values is

an indication of “significant” values in a way that takes into

account the number of tests.

133

QQ-Plots

0 1 2 3 4

02

46

81

01

2

HWE Test: No SNP Filtering

−log10(p):Expected

−lo

g1

0(p

):O

bse

rve

d

The results for 9208 SNPs on human chromosome 1. Bonferroni

would suggest rejecting HWE when p ≤ 0.05/9205 = 5.4 × 10−6

or − log10(p) ≥ 5.3.

134

QQ-Plots

0 1 2 3 4

02

46

81

01

2

HWE Test: SNPs Filtered on Missingness

−log10(p):Expected

−lo

g1

0(p

):O

bse

rve

d

The same set of results as on the previous slide except now that

any SNP with any missing data was excluded. Now 7446 SNPs

and Bonferroni would reject if − log10(p) ≥ 5.2. All five outliers

had zero counts for the minor allele homozygote and at least 32

heterozygotes in a sample of size 50.

135

Linkage Disequilibrium

This term reserved for association between pairs of alleles – one

at each of two loci.

When gametic data are available, could refer to gametic disequi-

librium.

When genotypic data are available, but gametes can be inferred,

can make inferences about gametic and non-gametic pairs of

alleles.

When genotypic data are available, but gametes cannot be in-

ferred, can work with composite measures of disequilibrium.

136

Linkage Disequilibrium

For alleles A and B are two loci, the usual measure of linkage

disequilibrium is

DAB = PAB − pApB

Whether or not this is zero does not provide a direct state-

ment about linkage between the two loci. For example, consider

marker YFM and disease DTD:

A N Total

+ 1 24 25YFM

− 0 75 75

Total 1 99 100

DA+ =1

100− 1

100

25

100= 0.0075, (maximum possible value)

137

Gametic Linkage Disequilibrium

For loci A, B define indicator variables x, y that take the value

1 for allele A,B and 0 for any other alleles. If gametes within

individuals are indexed by j, j = 1,2 then for expectations over

samples from the same population

E(xj) = pA, j = 1,2 , E(yj) = pB j = 1,2

E(x2j ) = pA, j = 1,2 , E(y2j ) = pB j = 1,2

E(x1x2) = PAA , E(y1y2) = PBB

E(x1y1) = PAB , E(x2y2) = PAB

The variances of xj, yj are pA(1− pA), pB(1− pB) for j = 1,2 and

the covariance and correlation coefficients for x and y are

Cov(x1, y1) = Cov(x2, y2) = PAB − pApB = DAB

Corr(x1, y1) = Corr(x2, y2) = DAB/[pA(1 − pA)pB(1 − pB)] = ρAB

138

Estimation of LD

With random sampling of gametes, gamete counts have a multi-

nomial distribution:

Pr(nAB, nAb, naB, nab) =n!(PAB)nAB(PAb)

nAb(PaB)naB(Pab)nab

nAB!nAb!naB!nab!

=n!(pApB +DAB)nAB(pApb −DAB)nAb

nAB!nAb!naB!nab!

× (papB −DAB)naB(papb +DAB)nab

and this provides the maximum likelihood estimates of DAB and

ρAB:

DAB =nABn

− nAB + nAbn

× nAB + naBn

= PAB − pApB

ρAB = rAB =DAB√pApapBpb

139

Testing LD

Write MLE of DAB as

DAB =nABnab − nAbnaB

(nAB + nAb)(naB + nab)(nAB + naB)(nAb + nab)

and use “Delta method” to find

Var(DAB) ≈ 1

n[pA(1 − pA)pB(1 − pB)

+ (1 − 2pA)(1 − 2pB)DAB −D2AB]

When DAB = 0, Var(DAB) = pA(1 − pA)pB(1 − pB)/n.

If DAB is assumed to be normally distributed then

X2AB =

D2AB

Var(DAB)= nρ2AB = nr2AB

is appropriate for testing H0 : DAB = 0. When H0 is true,

X2AB ∼ χ2

(1).

140


The test statistic for the 2 × 2 table

nAB nAb nAnaB nab nanB nb n

has the value

X2 =n(nABnab − nAbnaB)2

nAnanBnb

For DTD/YFM example, X2 = 3.03. This is not statistically

significant, even though disequilibrium was maximal.

141

Composite Disequilibrium

When genotypes are scored, it is often not possible to distinguish

between the two double heterozygotes AB/ab and Ab/aB, so that

gametic frequencies cannot be inferred.

Under the assumption of random mating, in which genotypic fre-

quencies are assumed to be the products of gametic frequencies,

it is possible to estimate gametic frequencies with the EM algo-

rithm. To avoid making the random-mating assumption, how-

ever, it is possible to work with a set of composite disequilibrium

coefficients.

142


Although the separate digenic frequencies pAB (one gamete) and

pA,B (two gametes) cannot be observed, their sum can be since

pAB = PABAB +1

2PABAb +

1

2PABaB +

1

2PABab

pA,B = PABAB +1

2PABAb +

1

2PABaB +

1

2PAbaB

pAB + pA,B = 2PABAB + PABAb + PABaB +PABab + PAbaB

2

Digenic disequilibrium is measured with a composite measure

∆AB defined as

∆AB = pAB + pA,B − 2pApB

= DAB +DA,B

which is the sum of the gametic (DAB = pAB−pApB) and nonga-

metic (DA,B = pA,B − pApB) coefficients.

143


If the counts of the nine genotypic classes are

BB Bb bbAA n1 n2 n3Aa n4 n5 n6aa n7 n8 n9

the count for pairs of alleles in an individual being A and B,

whether received from the same or different parents, is

nAB = 2n1 + n2 + n4 +1

2n5

and the MLE for ∆ is

∆AB =1

nnAB − 2pApB

144

Composite Linkage Disequilibrium

For loci A, B define indicator variables x, y that take the value

1 for allele A,B and 0 for any other alleles. If gametes within

individuals are indexed by j, j = 1,2 then for expectations over

samples from the same population

E(xj) = pA, j = 1,2 , E(yj) = pB j = 1,2

E(x2j ) = pA, j = 1,2 , E(yj) = pB j = 1,2

E(x1x2) = PAA , E(y1y2) = PBB

E(x1y1) = PAB , E(x2y2) = PAB

E(x1y2) = PA,B , E(x2y1) = PA,B

Write

DA = PAA − p2A , DB = PBB − p2B

DAB = PAB − pApB , DA,B = PA,B − pApB

∆AB = DAB +DA,B

145


Now set X = x1 + x2, Y = y1 + y2 to get

E(X) = 2pA , E(Y ) = 2pB

E(X2) = 2(pA + PAA) , E(Y 2) = 2(pB + PBB)

Var(X) = 2pA(1 − pA)(1 + fA) , Var(Y ) = 2pB(1 − pB)(1 + fB)

and

E(XY ) = 2(PAB + PA,B)

Cov(X,Y ) = 2(PAB − pApB) + 2(PA,B − pApB)

= 2(DAB +DA,B) = 2∆AB

Corr(X,Y ) =∆AB√

pA(1 − pA)(1 + fA)pB(1 − pB)(1 + fB)

146


∆AB = nAB/n− 2pApB

where

nAB = 2nAABB + nAABb + nAaBB +1

2nAaBb

This does not require phased data.

By analogy to the gametic linkage disequilibrium result, a test

statistic for ∆AB = 0 is

X2AB =

n∆2AB

pA(1 − pA)(1 + fA)pB(1 − pB)(1 + fB)

This is assumed to be approximately χ2(1)

under the null hypoth-

esis.

147

Example

For loci LDLR and GYPA in the Hispanic Sample:

BB Bb bb Total

AA 1 1 2 4Aa 7 2 1 10aa 2 4 0 6

Total 10 7 3 20

nAB = 2 × 1 + 1 + 7 +1

2(2) = 11

nA = 18 , pA = 0.450

nB = 27 , pB = 0.675

fA =4

20−(18

40

)2

= − 1

400= −0.0025

fB =10

20−(27

40

)2

=71

1600= 0.0444

148

Example

∆AB =11

20− 2

18

4

27

40= − 46

800= −0.0575

X2AB =

20(−0.0575)2

(0.450)(0.550)(0.9975)(0.675)(0.325)(1.0444)= 1.11

Previous work on EM algorithm estimated pAB as 11.88/40=0.2970

so

DAB = 0.2970− (0.450)(0.675) = −0.008

X2AB =

40(−0.0068)2

(0.450)(0.550)(0.675)(0.325)= 0.03

149

LD vs Composite LD Estimates

� � � � �

� � � � �

� � � �

� � � �

� � � �

� � � � � � � � � � � � � � � � � � � � � ��

�

� � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � �

A comparison of gametic LD estimates from the EM algorithm

assuming HWE vs composite LD with no HWE assumption.

150

LD vs Composite LD Tests

�� !

" �" �" �" " !

� �

� # " � " # � �$ % & ' ( ) ) * + , - . / 0 1

23 45678 9:: ;<= 6>

?@A

B C D E F G H I J C K L M D G N C O G P E C Q C R F C E S T G O T O T M T C O T C U O V W M E U M O C M K X

A comparison of gametic LD tests from the EM algorithm as-

suming HWE vs composite LD with no HWE assumption.

151

POPULATION STRUCTURE

152

Population Data

Individuals from several populations are scored at a series of

marker loci. At each locus, an individual has two alleles, one

from each parent, and these can be identified. For example, at

locus D3S1358:

Allele AFC NSC QLC SAC TAC VIA WAB11 .000 .001 .002 .001 .000 .000 .00012 .004 .003 .001 .001 .000 .000 .01013 .008 .003 .002 .002 .000 .000 .00114 .123 .098 .159 .125 .152 .008 .07515 .261 .264 .365 .252 .244 .385 .35316 .250 .270 .250 .265 .241 .277 .24217 .187 .198 .123 .202 .197 .246 .19018 .154 .152 .091 .144 .157 .077 .12219 .012 .011 .006 .007 .010 .008 .00720 .002 .000 .000 .000 .000 .000 .000

153

Questions of Interest

• How much genetic variation is there? (animal conservation)

• How much migration (gene flow) is there between popula-

tions? (molecular ecology)

• How does the genetic structure of populations affect tests for

linkage between genetic markers and human disease genes?

(human genetics)

• How should the evidence of matching marker profiles be

quantified? (forensic science)

• What is the evolutionary history of the populations sampled?

(evolutionary genetics)

154

Statistical Analysis

Possible to approach these data from purely statistical viewpoint.

Could test for differences in allele frequencies among populations.

Could use various multivariate techniques to cluster populations.

These analyses may not answer the biological questions.

155

Frequencies of Allele Au

Population 1 Population r

p1u . . .

pu

pru

Among samples from a population: piu ∼ Binomial(n, piu).

Among replicates of a population: piu ∼ Beta((1−θi)pu

θi, (1−θi)(1−pu)θi

).

156

Beta distribution: Theoretical

The beta distribution can take a variety of shapes.

0.0 0.4 0.80

.61

.01

.4

v=1,w=1

0.0 0.4 0.8

0.0

1.0

v=2,w=2

0.0 0.4 0.8

0.0

1.0

2.0

v=4,w=4

0.0 0.4 0.8

0.0

1.0

2.0

v=2,w=1

0.0 0.4 0.80

12

34

v=4,w=1

0.0 0.4 0.8

01

23

4

v=1,w=4

0.0 0.4 0.8

1.0

1.3

1.6

v=0.9,w=0.9

0.0 0.4 0.8

24

68

u=0.5,w=0.5

0.0 0.4 0.80

10

25

u=0.5,w=4

157

Beta distribution: Experimental

The beta distribution is suggested by a Drosophila cage experi-

ment by P. Buri (Evolution 10:367, 1956).

158

Model Details

It is useful to attach indicator variables to each allele. For the

jth allele sampled from the ith population

xij =

{1 allele is of type Au0 otherwise

The model specifies expectations for these indicators. Over sam-

ples from one population:

E(xij) = piu

E(xijxij′) = p2iu j 6= j′, large ni

Over samples from each population and over replicates of each

population:

E(xij) = pu

E(xijxij′) = θipu + (1 − θi)p2u, j 6= j′

E(xijxi′j′) = θii′pu + (1 − θii′)p2u, i 6= i′

159

Inbreeding Coefficients

f1 . . .

F

fr

Within a population: Puui = p2iu + fipiu(1 − piu).

Over replicates of a population: E(Puui) = p2u + Fipu(1 − pu).

160

Coancestry Coefficients

BB

BB

BB

BBBM

��

θi

. . .

HHHHHHHHHHHHY

��*

θii′B

BBB

BB

BBBM

��

θi′

For a population i, using average allele frequencies:

E(Puui) = p2u + θipu(1 − pu), if Fi = θi.

For two populations i, i′: E(Pu,uii′) = p2u + θii′pu(1 − pu).

161

Genotypic Frequencies

Within a population:

Puui = p2iu + fipiu(1 − piu)

Take expected values over replicates of a population:

E(piu) = pu

E(p2iu) = p2u + θipu(1 − pu)

E(Puui) = [p2u + θipu(1 − pu)] + fi[pu − p2u − θipu(1 − pu)]

= p2u + [θi + fi(1 − θi)]pu(1 − pu)

= p2u + Fipu(1 − pu)

since fi = (Fi − θi)/(1 − θi) or FIS = (FIT − FST)/(1 − FST) and

this is zero for HWE within a population.

162

Sample Allele Frequencies: Weir & Hill, 2002

The sample frequency for allele Au for population i can be written

as an average of indicator variables:

piu =1

ni

ni∑

j=1

xij

and this leads to variances and covariances of frequencies over

samples from populations and over replicates of populations:

Var(piu) = pu(1 − pu)

(θi +

1 − θini

)

Cov(piu, piu′) = −pupu′(θi +

1 − θini

), u 6= u′

Cov(piu, pi′u) = pu(1 − pu)θii′, i 6= i′

Cov(piu, pi′u′) = −pupu′θii′, u 6= u′, i 6= i′

163

Weir & Cockerham 1984 Restricted Model

If populations have equal evolutionary histories (θi = θ, all i) and

are independent (θii′ = 0, all i 6= i′)


(θ+

1 − θ

ni

)

Cov(piu, piu′) = −pupu′(θ+

1 − θ

ni

), u 6= u′

Cov(piu, pi′u) = 0, i 6= i′

Cov(piu, pi′u′) = 0, u 6= u′, i 6= i′

The overall allele frequencies were weighted by sample sizes

pA =1

∑i ni

∑

i

nipiu

If θ = 0, these weighted means have minimum variance.

164


Under this model, the sampled populations can be regarded as

evolutionary realizations of the process that led to any one of

them. The sampled populations are equivalent to the population

replicates that have been discussed in this treatment.

165


Two mean squares were constructed for each allele:

MSBu =1

r − 1

r∑

i=1

ni(piu − pu)2

MSWu =1

∑i(ni − 1)

∑

i

nipiu(1 − piu)

These have expected values

E(MSBu) = pu(1 − pu)[(1 − θ) + ncθ]

E(MSWu) = pu(1 − pu)(1 − θ)

where nc = (∑i ni−

∑i n

2i /∑i ni)/(r− 1). The Weir & Cockeram

estimator of θ (or FST) is

θ =

∑u(MSBu − MSWu)

MSBu + (nc − 1)MSWu

166

Unweighted Model

Bhatia et al., Genome Research 23:1514-1521, 2013, returned

to early work by Nei. They used heterozygosities instead of mean

squares:

Hi =ni

ni − 1

∑

upiu(1 − piu) =

nini − 1

(1 −∑

up2iu)

Hii′ = 1 −∑

upiupi′u

Unweighted averages over populations are

HW =1

r

∑

i

Hi , θW =1

r

∑

i

θi

HB =1

r(r − 1)

∑

i 6=i′Hii′ , θB =

1

r(r − 1)

∑

i 6=i′θii′

167

Unweighted Model Expectations

Under the general model of Weir & Hill, these within- and between-

population heterozygosities have expectations

E(Hi) = H(1 − θi)

E(Hii′) = H(1 − θii′)

E(HW ) = H(1 − θW)

E(HB) = H(1 − θB)

where H =∑u pu(1 − pu) = 1 −∑

u p2u.

These equations suggest estimation by the method of moments.

168

Unweighted Model Estimates

Can only estimate θi and θii′ “relative to” θB:

βi = 1 − HiHB

, E(βi) =θi − θB1 − θB

βW = 1 − HWHB

, E(βW ) =θW − θB1 − θB

βii′ = 1 − Hii′

HB, E(βii′) =

θii′ − θB1 − θB

Estimation of θB is not possible unless the whole set of popu-

lations is replicated. Different values of βi or βii′ imply different

values of θi or θii′.

169

Weir & Cockerham Estimator

Under the Weir and Hill model, the Weir and Cockerham esti-

mator has expectation

E(θWC) =θcW − θcB +Q

1 − θcB +Qinstead of

θW − θB1 − θB

where

θcW =

∑i nciθi∑

i nci

, θcB =

∑i 6=i′ nini′θii′∑i 6=i′ nini′

nci = ni −n2i∑i ni

, nc =1

r − 1

∑

i

nci

Q =1

(r − 1)nc

∑

i

(nin

− 1

)θi

If the Weir and Cockerham model holds (θi = θ), or if ni = n, or

if nc is large, then Q = 0.

170

HapMap III Data

Sample Number ofCode Population Description size SNPs typedASW African ancestry in Southwest USA 142 61566CEU Utah residents with Northern and Western 324 56218

European ancestry from CEPH collectionCHB Han Chinese in Beijing, China 160 51946CHD Chinese in Metropolitan Denver, Colorado 140 50517GIH Gujarati Indians in Houston, Texas 166 55710JPT Japanese in Tokyo, Japan 168 53020LWK Luhya in Webuye, Kenya 166 60026MXL Mexican ancestry in Los Angeles, California 142 57937MKK Maasai in Kinyawa, Kenya 342 61057TSI Toscani in Italia 154 55881YRI Yoruba in Ibadan, Nigeria 326 58559

171

Weir & Cockerham vs Unweighted Estimator

0.00 0.05 0.10 0.15 0.20 0.25

0.0

00.0

50.1

00.1

50.2

00.2

5

HapMap Fst estimates: no SNP filtering

Bhatia et al

Weir &

Cockerh

am

FST estimates for HapMap III, using all 87,592 SNPs on chromosome 1.

172

Weir & Cockerham vs Unweighted Estimator

0.00 0.05 0.10 0.15 0.20 0.25

0.0

00.0

50.1

00.1

50.2

00.2

5

HapMap Fst estimates: SNP filtering

Bhatia et al

Weir &

Cockerh

am

FST estimates for HapMap III, using the 42,463 SNPs on chromosome 1

that have at least five copies of the minor allele in samples from all 11

populations.

173

Multiple Loci: Weir & Cockerham

The Weir & Cockerham estimators for locus l are of the form

Estimatorl =

∑uNumeratorlu∑

uDenominatorlu

With several loci, these can be extended to

Estimator =

∑l,uNumeratorlu∑

l,uDenominatorlu

If every locus has the same value of θ, this multi-locus estimator

gives an estimate of the common value. Otherwise it estimates

a weighted average of the different θ values, where the weights

are functions of the allele frequencies at the loci in the sum.

174

Multiple Loci: Unweighted

The unweighted estimators for locus l are of the form

Estimatorl =HBl − Hxl

HBl, x = i,W, ii′

With several loci, these can be extended to

Estimator =

∑l(HBl − Hxl)∑

l HBlx = i, ii′,W

and these estimate (θx− θB)/(1− θB) if each locus has the same

value of the θ’s. Otherwise thet estimate a weighted average

of the different θ values, where the weights are functions of the

allele frequencies at the loci in the sum.

175

βW in LCT Region: 3 Populations

100 120 140 160 180

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

HapMap III Chromosome 2

Position

Be

ta−

W

LCT

CEU

CHB

YRI

176

βW in LCT Region: 11 Populations

100 120 140 160 180

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

HapMap III Chromosome 2

Position

Be

ta−

W

LCT

CEU

MKK

Other

177

MKK Population

“The Maasai are a pastoral people in Kenya and Tanzania, whose

traditional diet of milk, blood and meat is rich in lactose, fat

and cholesterol. In spite of this, they have low levels of blood

cholesterol, and seldom suffer from gallstones or cardiac diseases.

Analysis of HapMap 3 data using Fixation Index (Fst) identified

genomic regions and single nucleotide polymorphisms (SNPs)

as strong candidates for recent selection for lactase persistence

and cholesterol regulation in 143156 founder individuals from the

Maasai population in Kinyawa, Kenya (MKK). The strongest

signal identified by all three metrics was a 1.7 Mb region on

Chr2q21. This region contains the gene LCT (Lactase) involved

in lactase persistence.”

Wagh et al., PLoS One 7: e44751, 2012

178

Mean and Variance of βW in LCT Region

100 120 140 160 180

0.0

00

.05

0.1

00

.15

0.2

00

.25

0.3

0

HapMap III − Chromosome 2

Position (Mb)

Be

ta−

W

Mean

StdDev

LCT

179

Pure Drift Model

Within a species, divergence among populations is driven primar-

ily by genetic drift – the random choice of one of the two parental

alleles for transmission from an individual to an offspring. Mu-

tation is likely to be of less importance in the short term, and

natural selection may not affect genetic markers.

Suppose a parental population has nu of 2N alleles of type Au.

Conditional on pu = nu/2N , the number of alleles of this type in

the offspring generation of N individuals is B(2N,pu).

180

Drift Model Variance

Write initial allele frequency as p0.

Conditional on Total overGenera Allele previous gen. all generations-tion freq. Mean Variance Mean Variance

0 p0 p0 0 p0 0

1 p1 p0p0(1−p0)

2Np0 p0(1 − p0)

[1 −

(1 − 1

2N

)]

2 p2 p1p1(1−p1)

2Np0 p0(1 − p0)

[1 −

(1 − 1

2N

)2]

3 p3 p2p2(1−p2)

2Np0 p0(1 − p0)

[1 −

(1 − 1

2N

)3]

· · · · · · · · · · · · · · · · · ·

t pt pt−1pt−1(1−pt−1)

2Np0 p0(1 − p0)

[1 −

(1 − 1

2N

)t]

181

Drift Model IBD

Could also define θ as

θt = 1 −(1 − 1

2N

)t

so that

Var(pt) = p0(1 − p0)θt

We can see that

θt =1

2N+

(1 − 1

2N

)θt−1

and θ can be interpreted as the probability that two alleles are

identical by descent. This probability interpretation is not neces-

sary. With this drift model, θ serves as a measure of the number

of generations since the founding population (θ = 0).

182

Drift Model: Two Populations

?

t1

θ12�

��

��

@@

@@@θ1 θ2

θ1 = 1 − (1 − θ12)Xt11 , θ2 = 1 − (1 − θ12)X

t12

where Xi = (2Ni− 1)/2Ni and Ni is the population size for pop-

ulation i. Therefore

βi =θi − θ12

1 − θ12= 1 −X

t1i ≈ t1

2Ni

and this is the quantity estimated by the estimate βi. Note that

the times from each population to their most recent common

ancestral population must be the same.

183


The unweighted within-population average estimator βW has ex-

pectation

βW =θW − θB1 − θB

=

θ1+θ22 − θ12

1 − θ12

= 1 − Xt11 +X

t12

2

≈ 1

2

(1

2N1+

1

2N2

)t1 =

t12Nh

so time is being estimated in terms of the harmonic mean of the

two population sizes.

184


The Weir & Cockerham estimator has an expectation of

E(θWC) =

θ1+θ22 − θ12 + (n1−n2)(θ1−θ2)

2n1n2

1 − θ12 + (n1−n2)(θ1−θ2)2n1n2

which reduces to

E(θWC) =

θ1+θ22 − θ12

1 − θ12

if n1 = n2 or θ1 = θ2.

185

Within-populations Example

FBI: African American (AA), Caucasian (CA) and Hispanic (HI) data.

βiLocus AA CA HI Average βWC

D3S135 .010 −.030 .069 .017 .019vWA .000 −.002 .050 .017 .019FGA .007 .012 −.008 .003 .006D8S117 .026 .003 .009 .012 .015D21S11 −.012 −.003 .047 .012 .014D18S51 .011 .008 .010 .010 .012D5S818 −.018 .061 .012 .019 .021D13S31 .132 .028 −.042 .036 .040D7S820 .014 −.016 .026 .008 .011CSF1PO −.048 .015 .051 .006 .008TPOX −.118 .090 .112 .027 .030THO1 .078 .008 .041 .043 .045D16S53 −.011 .028 .024 .014 .016All loci .010 .017 .032 .020 .020

186

Between-populations Example

FBI: African American (AA), Caucasian (CA) and Hispanic (HI) data.

βii′Locus AA&CA AA&HI CA&HID3S135 −.018 .026 −.009vWA −.018 .006 .010FGA .006 −.002 −.004D8S117 −.012 .002 .009D21S11 −.008 −.008 .015D18S51 −.004 −.008 .011D5S818 .003 −.039 .033D13S31 .058 −.021 −.032D7S820 .001 .004 −.006CSF1PO −.024 −.009 .034TPOX −.043 −.053 .097THO1 −.034 .028 .006D16S53 −.009 −.009 .018Total −.008 −.006 .014

187

Allele Frequency Distribution

The moment estimators of θ’s and β’s made use only of the

mean and variance of the distribution of allele frequencies over

populations. The advantages of maximum likelihood estimation

are available if the whole distribution is known.

A convenient distribution is the normal, although it can be only

an approximation. Under a class of evolutionary models (allelic

exchangeability and stationarity) the Beta or Dirichlet distribu-

tion may be used.

188

Normal Distribution

If allele frequencies are not too extreme, it may be satisfactory

to assume they have a normal distribution over populations.

In the case of equal values of θi and zero values of θii′, the sample

frequencies piu for alleles Au in sample i have these properties:

E(piu) = pu


[θ+

1 − θ

2n

]

Cov(piu, piu′) = −pupu′[θ+

1 − θ

2n

], u 6= u′

Cov(piu, pi′u′) = 0, i 6= i′

Multiple alleles suggest a multivariate normal approach.

189

Multivariate Normality

Q =r∑

i=1

(pi − p)′V−1(pi − p)

=r∑

i=1

m∑

u=1

(piu − pu)2

pu

(pu =∑ri=1 piu/r) has distribution

Q ∼ θχ2(r−1)(m−1)

suggesting that

(r − 1)(m− 1)θ = Q

or

θN ≈ 1

(m− 1)(r − 1)

r∑

i=1

m∑

u=1

(piu − pu)2

pu

190

Properties of Normal Estimate

From the chi-distribution of Q:

E(Q) = (r − 1)(m− 1)θ

Var(Q) = 2(r − 1)(m− 1)θ2

so that

E(θN) = θ

Var(θN) =2[1 + (2n− 1)θ)]2

(r − 1)(m− 1)(2n− 1)2

≈ 2θ2

(r − 1)(m− 1), (large n)

191


For large sample sizes,

(r − 1)(m− 1)θNθ

∼ χ2(r−1)(m−1)

so a 95% confidence interval for θ is

((r − 1)(m− 1)θN

χ20.975

,(r − 1)(m− 1)θN

χ20.025

)

An estimated θ of 0.01 from data from five populations for a

locus with five alleles, for example, leads to a 95% confidence

interval of (0.001, 0.036).

Analytical approach more convenient than bootstrapping.

192

Effect of Number of Loci

FST

Fre

qu

en

cy

0.0 0.2 0.4 0.6 0.8

05

00

01

00

00

15

00

0

HapMap

FST

0.0 0.2 0.4 0.6 0.8

01

00

00

20

00

03

00

00

40

00

0

Perlegen

5Mb window

Individual SNP

193

INBREEDING AND RELATEDNESS

194

Inbreeding

Two alleles that descend from the same allele are said to be

identical by descent (ibd).

The probability that two parents transmit ibd alleles to a child

is the inbreeding coefficient F of the child.

Use small letters for alleles, and capital letters for their particular

type: i.e. parents might transmit alleles a and b to their child,

and these alleles might both be type A.

Mother Father(ac) (bd)

↘ ↙a b

↘ ↙child(ab)

195

Relatedness

Two individuals that have ibd alleles are said to be related.

The probability that an allele taken at random from one individual

is ibd to an allele taken at random from another individual is the

coancestry coefficient θ of those two individuals.

The inbreeding coefficient of an individual is the coancestry of

its parents.

196

Relatedness

X Y(ac) (bd)

↘ ↙a b

↘ ↙I

(ab)

FI = θXY

197

Path Counting

A↙ ↘

... ...↘ ↓ ↓ ↙

X Y↘ ↙

I

Identify the path linking the parents of I to their common an-

cestor(s).

198

Path Counting

If the parents X,Y of an individual I have ancestor A in common,

and if there are n individuals (including X,Y, I) in the path linking

the parents through A, then the inbreeding coefficient of I, or

the coancestry of X and Y , is

FI = θXY =

(1

2

)n(1 + FA)

If there are several ancestors, this expression is summed over all

the ancestors.

199

Parent-Child

X↘

Y

The common ancestor of parent X and child Y is X. The path

linking X,Y to their common ancestor is Y X and this has n = 2

individuals. Therefore

θXY =

(1

2

)2

=1

4

200

Grandparent-grandchild

X↘

V↘

Y

The common ancestor of grandparent X and grandchild Y is X.

The path linking X,Y to their common ancestor is Y V X and this

has n = 3 individuals. Therefore

θXY =

(1

2

)3

=1

8

201

Half sibs

U V W↘ ↙ ↘ ↙

X Y↓ ↓

The common ancestor of half sibs X and Y is V . The path

linking X,Y to their common ancestor is XV Y and this has n = 3

individuals. Therefore

θXY =

(1

2

)3

=1

8

202

Full sibs

U V↓ ↘ ↙ ↓

↓ ↙ ↘ ↓X Y

The common ancestors of full sibs X and Y are U and V . The

paths linking X,Y to their common ancestors are XUY and XV Y

and these each have n = 3 individuals. Therefore

θXY =

(1

2

)3

+

(1

2

)3

=1

4

203

First cousins

U V↓ ↘ ↙ ↓

↓ ↙ ↘ ↓G C D H

↘ ↙ ↘ ↙X Y

The common ancestors of cousins X and Y are U and V . The

paths linking X,Y to their common ancestors are XCUDY and

XCV DY and these each have n = 5 individuals. Therefore

θXY =

(1

2

)5

+

(1

2

)5

=1

16

204

Genotype frequencies

Consider an individual with inbreeding coefficient F . The proba-

bility it has two ibd alleles is F , and the probability of two non-ibd

alleles is 1 − F . The probability that any allele is type A is pA,

the population allele frequency. So, the probability the individual

is homozygous is

PAA = Pr(AA|ibd)Pr(ibd) + Pr(AA|not ibd)Pr(not ibd)

= pA × F + p2A × (1 − F )

There is an evolutionary perspective here: F reflects the history

that leads to alleles being ibd. Identity by descent has meaning

only with reference to previous generations. The allele frequency

pA is the expected value over evolutionary histories.

205

Genotype frequencies

To emphasize the difference from Hardy-Weinberg:

PAA = p2A + FpA(1 − pA)

Because heterozygous individuals must have non-ibd alleles:

PAa = 2(1 − F )pApa

= 2pApa − 2FpApa

206

First Cousin Example

For individuals with parents who were first cousins, F = 1/16 =

0.0625. For a locus with allele frequencies {pi} that were all

0.10:

Pii = (0.1)2 + 0.0625(0.1)(0.10) = 0.015625

Pij = 2(0.1)(0.1) − 2(0.0625)(0.1)(0.1) = 0.018750

207

Individual Inbreeding Coefficients

Earlier slides considered the estimation of the within-population

inbreeding coefficient f that made use of sample allele frequen-

cies from the particular population under consideration. This f

is a description of the population.

Individual-specific (total) inbreeding coefficients F can be esti-

mated under the assumption that all loci have the same coeffi-

cient, now interpreted as the probability of identity by descent

(ibd). Many loci are needed.

At locus l, write pl for the frequency of Al and code the genotypes

AlAl, Alal, alal as Xl = 2,1,0. These new variables have the

properties E(Xl) = 2pl,Var(Xl) = 2pl(1 − pl)(1 + F ).

208

Individual Inbreeding Coefficients

Then one moment estimator is formed by summing over loci

l, l = 1,2, . . . L:

F =1

L

L∑

l=1

(Xl − 2pl)2

2pl(1 − pl)− 1

but an estimate with smaller variance is

F =

∑Ll=1(Xl − 2pl)

2

∑Ll=1 2pl(1 − pl)

− 1

In practice, we need to use sample allele frequencies.

209

MLE for Individual Inbreeding Coefficients

To avoid having to choose among MoM estimates can set up

an MLE although there is more numerical work needed. An

iterative method makes use of Bayes’ theorem. If F represents

the probability the individual in question has two ibd alleles at a

locus, i.e. is inbred at that locus,

Pr(AlAl|inbred) = pl , Pr(AlAl|Not inbred) = p2lPr(Alal|inbred) = 0 , Pr(Alal|Not inbred) = 2pl(1 − pl)

Pr(alal|inbred) = 1 − pl , Pr(alal|Not inbred) = (1 − pl)2

From Bayes’ theorem then

Pr(inbred|AlAl) =Pr(AlAl|inbred)Pr(inbred)

Pr(AlAl)

= FF+pl(1−F)

Pr(inbred|Alal) = 0

Pr(inbred|alal) = FF+(1−pl)(1−F)

210

MLE for Individual Inbreeding Coefficients

This suggests an iterative scheme: assign an initial value to

F , and then average the updated values over loci. If Gl is the

genotype at locus l, the updated value F ′ is

F ′ =1

L

L∑

l=1

Pr(inbred|Gl)

This value is then substituted into the right hand side and the

process continues until convergence.

211

Descent measures for four alleles

Much of population and quantitative genetics theory rests on

the comparison of pairs of individuals - either on their genotypes

or on their trait values, or both. For single-locus analyses this

requires a discussion of four-allele descent measures.

One-locus descent measures δ for any four alleles a, b, c, d iden-

tify which subsets of the four are ibd. The measures give the

probabilities of the specified alleles being ibd, and other alleles

of the four being not-ibd.

212

Complete Set of IBD Measures

A complete description of the ibd status among four alleles

a, b, c, d carried by two individuals requires 15 measures (as op-

posed to the two, F , 1 − F , for one individual):

Alleles ibd∗ Probability Alleles ibd∗ Probability

a, b, c, d δabcd a, b δaba, b, c δabc a, c δaca, b, d δabd a, d δada, c, d δacd b, c δbcb, c, d δbcd b, d δbd

a, b and c, d δab.cd c, d δcda, c and b, d δac.bd none δ0a, d and b, c δad.bc∗Alleles not listed are not ibd to those listed

213

Nine-parameter IBD Set

In most applications there is no need to distinguish between maternal andpaternal alleles and the 15 ibd states can be collapsed into nine:

NotationAlleles ibd∗ Cockerham,1971 Jacquard,1970a, b, c, d δabcd ∆1

a, b, c or a, b, d δabc + δabd ∆3

a, c, d or b, c, d δacd + δbcd ∆5

a, b and c, d δab.cd ∆2

(a, c and b, d) or(a, d and b, c) δac.bd + δad.bc ∆7

a, b δab ∆4

c, d δcd ∆6

a, c or a, c orb, c or b, d δac + δad + δbc + δbd ∆8

none δ0 ∆9∗Alleles not listed are not ibd to those listed

214

Nine-parameter IBD Set

Solid lines join pairs of ibd alleles: top row is the pair of alleles

for X, bottom row the pair of alleles for Y .

X

Y

a

c

b

d�

��

�@@

@@

S1

a

c

c

d

S2

a

c

b

d�

��

�

S3

a

c

b

d

S4

a

c

b

d

@@

@@

S5

X

Y

a

c

b

d

S6

a

c

b

d

S7

a

c

b

d

S8

a

c

b

d

S9

215

Coancestry Coefficient

The coancestry coefficient θXY is the probability that a random

allele from X(ab) is ibd to a random allele from Y (cd):

θXY =1

4[Pr(a ≡ c) + Pr(a ≡ d) + Pr(b ≡ c) + Pr(b ≡ d)]

= δabcd +1

2(δabc + δabd + δacd + δbcd) +

1

2(δac.bd + δad.bc)

+1

4(δac + δad + δbc + δbd)

= ∆1 +1

2(∆3 + ∆5 + ∆7) +

1

4∆8

216

Siblings whose parents are first cousins

G(i,j) H(k,l)

? ?

HHHHHHHHHHHHHHHj

��C D E F@

@@

@@

@@@R

��

��

��

��

@@

@@

@@

@@R

��

��

��

��

A B

??

��9

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXz

a d

b c

X(a,b) Y(c,d)

217


δabcd ( 164) δab.cd (0)

X : a—b X : a—b∆1 | | ∆2

( 164) Y : c—d (0) Y : c—d

δabc ( 164) δabd ( 1

64)

X : a—b X : a—b∆3 | or |( 264) Y : c d Y : c d

δab ( 164) δcd ( 1

64)

X : a—b X : a b∆4 ∆6

( 164) Y : c d ( 1

64) Y : c—d

218


δacd ( 164) δbcd ( 1

64)

X : a b X : a b∆5 | or |( 264) Y : c—d Y : c— d

δac.bd (1564) δad.bc (0)

X : a b X : a b∆7 | | or X

(1564) Y : c d Y : c d

219


δac (1464

) δad ( 164

)∆8 or

(3064

) X : a b X : a b| ↘

Y : c d Y : c d

δbc ( 164

) δbd (1464

)X : a b X : a b

or ↙ or |Y : c d Y : c d

δ0(1264

)X : a b

∆9

(1264

) Y : c d

220


θ =1

4(δabcd + δabc + δacd + δac.bd + δac)

+1

4(δabcd + δabd + δacd+ δad.bc + δad)

+1

4(δabcd + δabc + δbcd + δad.bc + δbc)

+1

4(δabcd + δabd + δbcd + δac.bd + δbd)

= ∆1 +1

2(∆3 + ∆5 + ∆7) +

1

4∆8

=9

32

221

Non-inbred Relatives

There is a final reduction when neither individual is inbred, as

then neither a, b nor c, d are ibd. There are only three states and

the three probabilities are often written as k2 = ∆7, k1 = ∆8 or

k0 = ∆9 to indicate the number of pairs of ibd alleles carried by

the two individuals. Values of these three probabilities for some

common relationships are:

Relationship k2 k1 k0 θ = 12k2 + 1

4k1Identical twins 1 0 0 1/2Full sibs 1/4 1/2 1/4 1/4Parent-child 0 1 0 1/4Double first cousins 1/16 3/8 9/16 1/8Half sibs∗ 0 1/2 1/2 1/8First cousins 0 1/4 34 1/16Unrelated 0 0 1 0∗ Also grandparent-grandchild and avuncular (e.g. uncle-niece).

222

Joint genotypic probabilities

Genotypes General Non − inbred

ii, ii ∆1pi + (∆2 + ∆3 + ∆5 + ∆7)p2i k2p2i + k1p3i + k0p4i+ (∆4 + ∆6 + ∆8)p3i + ∆9p4i

ii, jj ∆2pipj + ∆4pip2j k0p2i p2j

+ ∆6p2i pj + ∆9p2i p2j

ii, ij ∆3pipj + (2∆4 + ∆8)p2i pj k1p2i pj + 2k0p3i pj+ 2∆9p3i pj

ii, jk 2∆4pipjpk + 2∆9p2i pjpk 2k0p2i pjpk

ij, ij 2∆7pipj + ∆8pipj(pi + pj) 2k2pipj + k1pipj(pi + pj)+ 4∆9p2i p

2j + 4k0p2i p

2j

ij, ik ∆8pipjpk + 4∆9p2i pjpk k1pipjpk + 4k0p2i pjpk

ij, kl 4∆9pipjpkpl 4k0pipjpkpl

223

Example: Non-inbred full sibs

Genotypes Probability

ii, ii p2i (1 + pi)2/4

ii, jj p2i p2j /4

ii, ij pipj(pi + pj)/2

ii, jk p2i pjpk/2

ij, ij pipj(1 + pi + pj + 2pipj)/2

ij, ik pipjpk(1 + 2pi)/2

ij, kl pipjpkpl

224

Method of Moments for Relatedness Coefficients

PLINK uses MoM to estimate three ibd coefficients k0, k1, k2 for

non-inbred relatives.

Two individuals are scored as being in ibs states 0,1,2

ibs state Genotypes Probability

2 (AA,AA), (aa, aa), (Aa,Aa) (p2 + q2)2k0 + k1(p3 + pq+ q3) + k2

1 (AA,Aa), (Aa,AA), (aa, Aa), (Aa, aa) 4pq(p2 + q2)k0 + 2pqk1

0 (AA, aa), (aa,AA) 2p2q2k0

225

MoM Approach: k0

Count the number of loci in ibs state i; i = 0,1,2. These numbers

are N0, N1, N2. The previous table gives the probabilities of ibs

state i given ibd state j. From

Pr(ibs = 0) = Pr(ibs = 0|ibd = 0)Pr(ibd = 0)

sum over loci to get

N0 = Pr(ibd = 0)∑

loci

2p2q2

This gives a moment estimate

Pr(ibd = 0) =N0∑

loci 2p2q2

226

MoM Approach: k1

From

Pr(ibd = 1) = Pr(ibs = 1|ibd = 0)Pr(ibd = 0)

+ Pr(ibs = 1|ibd = 1)Pr(ibd = 1)


N1 = Pr(ibd = 0)∑

loci

4pq(p2 + q2) + Pr(ibd = 1)∑

loci

2pq

but we already have an estimate of Pr(ibd = 0). Therefore

Pr(ibd = 1) =N1 −∑

loci 4pq(p2 + q2)Pr(ibd = 0)

∑loci 2pq

227

MoM Approach: k2

From

Pr(ibd = 2) = Pr(ibs = 2|ibd = 0)Pr(ibd = 0)




N2 = Pr(ibd = 0)∑

loci

(p2 + q2)2 + Pr(ibd = 1)∑

loci

(p3 + pq+ q3)

+ Pr(ibd = 2)

but we already have estimates of Pr(ibd = 0) and Pr(ibd = 1.Therefore

Pr(ibd = 1) =N2 −

∑loci(p

2 + q2)2 Pr(ibd = 0) −∑

loci(p3 + pq + q3)Pr(ibd = 1)∑

loci 1

228

Example

229

MoM for Coancestry Coefficient

One moment estimator for the coancestry between individuals G

and H is formed by summing over loci l, l = 1,2, . . . L:

θGH =1

L

L∑

l=1

(Xl − 2pl)(Yl − 2p)

2pl(1 − pl)

where Xl, Yl are 2,1,0 if G,H are (AA,Aa, aa), respectively, at

locus l. An estimate with smaller variance is

θGH =

∑Ll=1(xl − 2pl)(yl − 2p)∑Ll=1 2pl(1 − pl)

In practice, sample allele frequencies will need to be used.

230

MLE for Relatedness Coefficients

For any SNP there are six distinct pairs of genotypes with prob-

abilities depending on allele frequencies for that SNP and on a

set of three k parameters that are assumed to be the same for

all SNPs.

If S is the observed pair of genotypes, we know the conditional

probabilities Pr(S|Di) where the Di represent the identity states

(the relationship). The probability of ibd state Di is ki.

For example, if S is the state (AA,AA); Pr(S|D0) = p4A, Pr(S|D1) =

p3A, and Pr(S|D2) = p2A and

Pr(S) =∑

i

Pr(S|Di)Pr(Di) = p4Ak0 + p3Ak1 + p2Ak2

231


An iterative algorithm for estimating the k’s from observed geno-

types Sl at SNP l is based on Bayes’ theorem for the probability

of descent state Di, i = 0,1,2:

Pr(Di|Sl) =Pr(Sl|Di)Pr (Di)

Pr(Sl)

The procedure begins with initial estimates of the ki = Pr(Di)’s.

The denominator is calculated from the law of total probability

by adding over the three descent states:

Pr(Sl) =∑

i

Pr(Sl|Di)Pr(Di) =∑

i

Pr(Sl|Di)ki

232


The updated estimates are obtained by averaging over m loci:

k′i =1

m

m∑

l=1

(Pr(Sl|Di)ki∑j Pr(Sl|Dj)kj

), i = 0,1,2

These updated values are then substituted into the right hand

side and the process continued until the likelihood L no longer

changes (or changes by less than some specified small amount)

where

L =m∏

l=1

∑

i

Pr(Sl|Di)ki

It will be better to monitor changes in the log-likelihood.

233

“RELPAIR” calculations

This approach compares the probabilities of two genotypes un-

der alternative hypotheses; H0: the individuals have a specified

relationship, versus H1: the individuals are unrelated. The alter-

native is that k0 = 1, k1 = k2 = 0 so the likelihood ratios for the

two hypotheses are:

L(AA,AA) = k0 + k1/pA + k2/p2A

L(BB,BB) = k0 + k1/pB + k2/p2B

L(AB,AB) = k0 + k1/(4pApB) + k2/(2pApB)

L(AA,AB) = k0 + k1/(2pA)

L(BB,AB) = k0 + k1/(2pB)

L(AA,BB) = k0

234

Testing relationship

Hypotheses about alternative pairs of relationships may be tested

with likelihood ratio test statistics. These ratios are the prob-

ability of the observed pair of genotypes under one hypothesis

divided by the probability under the alternative hypothesis. These

ratios are multiplied over (independent) loci.

Each hypothesis is described by a set of k’s and there are three

hypotheses likely to be of interest:

1. individuals are unrelated (k0 = 1)

2. individuals have a specified (or annotated) relationship (k’s

specified)

or 3. individuals are related to an extent measured by the esti-

mated k’s.

235

Testing relationship

The three likelihoods may be written as L0, LS, LE, respectively

and each of the three ratios of these may be of interest. It may

be satisfactory to regard the ratios that involve estimated k’s as

having chi-square distributions, but this is not justified for the

unrelated versus specified situation:

H0 H1 LR Null distribution

Unrelated Specified −2 ln(L0/LS) −−Unrelated Estimated −2 ln(L0/LE) χ2

(2)

Specified Estimated −2 ln(LS/LE) χ2(2)

236

ASSOCIATION MAPPING

237

Association Mapping

Association methods use random samples from a population and

are alternatives to methods based on pedigrees or crosses be-

tween inbred lines. The associations depend on linkage disequi-

librium between marker and trait loci instead of depending on

linkage between those loci as in pedigree or line cross methods.

A quantitative trait locus T contributes to a trait of interest.

The QTL genotype cannot be observed but maybe it can be

inferred, and the location of the QTL be estimated, from ob-

servations on the trait and the genotype at a genetic marker

M.

238

Marker-Trait Genotype Frequencies

Each marker genotypic class MiMj is composed of a mixture of

elements from each of the QTL classes, TrTs, where the propor-

tion of QTL class TrTs contained within marker class MiMj is

Pr(TrTs|MiMj). With random mating, genotype frequencies are

products of gamete frequencies. For example

Pr(TrTr,MiMi) = Pr(TrMi)2

Pr(TrTr|MiMi) = Pr(TrMi)2/Pr(Mi)

2

and gamete frequencies involve allele frequencies and linkage dis-

equilibria:

Pr(TrMi) = prpi +Dri

239

Two-allele Genotypes

TT Tt tt

MM P2MT 2PMTPMt P2

Mt

Mm 2PMTPmT 2PMTPmt + 2PMtPmT 2PMtPmt

mm P2mT 2PmTPmt P2

mt

240

Two-allele Gametes

T t

M PMT = pMpT +DMT PMt = pMpt −DMT

m PmT = pmpT −DMT Pmt = pmpt +DMT

ρMT =DMT√

pMpmpTpt

ρ2MT =D2MT

pMpmpTpt

241

Marker and Trait Variables

Introduce variables X and G for loci M and T. The values of X

will be assigned for the marker whereas the values G represent

the genetic contributions to measured trait variables or to disease

status. In either case, the Hardy-Weinberg assumption provides

the following expressions for the means and variances:

E(X) = µX = p2MXMM + 2pMpmXMm + p2mXmm

E(G) = µG = p2TGTT + 2pTptGT t + p2tGtt

Var(X) = σ2AM

+ σ2DM

Var(G) = σ2AT

+ σ2DT

242

Components of Variance

The “additive” and “dominance” components of variance are

σ2AM

= 2pMpm[pM(XMM −XMm) + pm(XMm −Xmm)]2

σ2AT

= 2pTpt[pT (GTT −GT t) + pt(GT t −Gtt)]2

σ2DM

= p2Mp2m(XMM − 2XMm +Xmm)2

σ2DT

= p2Tp2t (GTT − 2GT t +Gtt)

2

and these lead to the following expression for the covariance of

X and G:

Cov(G,X) = ρMTσATσAM + ρ2MTσDTσDM

243

Correlation of Trait and Marker Variables

Cov(G,X) = ρMTσATσAM + ρ2MTσDTσDM

If either X or G are purely additive, then their covariance is

Cov(G,X) = ρMTσATσAM

If both X and G are purely additive, then their correlation is

ρGX = ρMT

If either X or G are purely non-additive, then their covariance is

Cov(G,X) = ρ2MTσDTσDM

If both X and G are purely non-additive, then their correlation is

ρG,X = ρ2MT

244

Measured Traits

Suppose Y = G + E where G is the genetic effect of locus T

and E are all other effects. These other effects are supposed to

have mean zero and to be independent of both G and the marker

variable X. Then

E(Y ) = E(G)

Cov(X,Y ) = Cov(X,G)

Var(Y ) = σ2AT

+ σ2DT

+ VE

Trait values Y may be regressed on marker variables X. The

regression coefficient is

βY X =Cov(X,Y )

Var(X)=ρMTσATσAM + ρ2MTσDTσDM

σ2AM

+ σ2DM

245

Marker Variable

Variable X may be chosen to be additive, e.g XMM = 2, XMm =

1, Xmm = 0 so that σ2AM

= 2pMpm, σ2DM

= 0, and then

βY X = ρMTσATσAM

The marker variable can also be made to have a zero additive

variance, e.g. XMM = pm, XMm = 0,Xmm = pM so that σ2AM

=

0, σ2DM

= p2Mp2m, and

βY X = ρ2MT

σDTσDM

A significant regression coefficient implies a significant linkage

disequilibrium measure ρMT between marker and disease loci.

The signal is expected to be stronger with an additive marker as

ρMT ≥ ρ2MT and it is usual that σ2AT

≥ σ2DT

.

246

Analysis of Variance

Instead of regressing trait on marker, the trait means could be

compared among marker classes. The expected trait means fol-

low as

E(Y |MiMj) =∑

r,sGrsPr(TrTs|MiMj)

=∑

r,sGrsPr(TrMi, TsMj)/Pr(MiMj)

in general.

For a trait locus with only two alleles, T, t, for marker homozygote

MM :

E(Y |MM) = (GTTP2MT + 2GT tPMTPMt +GttP

2Mt)/p

2M

247

Trait Means in Marker Classes

This last expression can be manipulated to show the effects of

linkage disequilibrium.

Trait means among the three marker genotype classes:

E(Y |MM) = µG + 2ρMTA/pM + ρ2MTD/p2ME(Y |Mm) = µG + ρMTA(1/pM − 1/pm) − ρ2MTD/(pMpm)

E(Y |mm) = µG − 2ρMTA/pm + ρ2MTD/p2mwhere A = σAT

√(pMpm),D = σDT (pMpm), so that an analysis of

variance will also test that ρMT = 0 and the test will be affected

by both additive and dominance effects at the trait locus.

248

Dichotomous Traits: Case Only

The case-control approach starts with independent samples of

people who are either affected or not affected with a disease and

compares marker frequencies between the two groups. The MM

marker frequency among cases is

Pr(MM |Case) = p2M +1

µG

[pMρMTA+ ρ2MTσDTD

]

Pr(Mm|Case) = 2pMpm +1

µG

[(pm − pM)ρMTA− 2ρ2MTD

]

Pr(mm|Case) = p2m +1

µG

[−pmρMTA + ρ2MTD

]

Note that these three probabilities sum to one.

249

Case Allele Frequencies

Combining the genotypic frequencies to give allele frequencies:

Pr(M |Case) = pM +ρMTσAT

2µG

√2pMpm

Pr(m|Case) = pm −ρMTσAT

2µG

√2pMpm

and these two sum to one.

The inbreeding coefficient at the marker locus in the case pop-

ulation follows from

Pr(MM |Case) = Pr(M |Case)2 + fCase Pr(M |Case)[1 − Pr(M |Case)]

or

fCase =ρ2MT (2µGσDT − σ2

AT)

(µG

√2pM/pm + ρMTσAT )(µG

√2pm/pM − ρMTσAT )

250

Case-only HWE Testing

The power of this test depends on nf2Case which is proportional

to ρ4MT so the power will decrease quickly as ρMT deceases.

It is common for investigators to assume a multiplicative disease

model (i.e. additive on a log scale):

GTT = αβ2, GT t = αβ,Gtt = α

so that

µG = α(pTβ + pt)2

σ2AT

= 2pTptα2(1 − β)2(pTβ + pt)

2

σ2DT

= p2Tp2t α

2(1 − β)4

This leads to Hardy-Weinberg equilibrium at marker loci among

cases since

2µGσDT = σ2AT

251

Dichotomous Traits: Case-Control

An argument similar to that above provides the marker genotype

frequencies among controls:

Pr(MM |Control) = p2M − 1

1 − µG

[pMρMTA + ρ2MTD

]

Pr(Mm|Control) = 2pMpm − 1

1 − µG

[(pm − pM)ρMTA− 2ρ2MTD

]

Pr(mm|Control) = p2m − 1

1 − µG

[−pmρMTA + ρ2MTD

]

Pr(M |Control) = pM − ρMTA2(1 − µG)

Pr(m|Control) = pm +ρMTA

2(1 − µG)

252

Case-control Test

The simplest case-control test compares marker allele frequen-

cies between the two samples and it is clearly equivalent to test-

ing that ρMT = 0 since

Pr(M |Case)− Pr(M |Control) ∝ ρMTσAT

√2pMpm

The test is not affected by non-additivity at the disease locus.

If the allelic counts for M,m in cases and controls are laid out in

a 2× 2 table, the contingency-table chi-square test statistic has

1 df. An alternative is to work with the 3 × 2 table of marker

genotype counts in cases and controls and calculate a 2 df chi-

square test statistic. This test is affected by both additivity and

non-additivity at the disease locus but it is sensitive to errors in

genotype calls for rare alleles.

253

Allelic Case Control Test

Write the marker genotype counts in random samples of cases

and controls as

Genotype MM Mm mm Total

Case counts r0 r1 r2 RControl counts s0 s1 s2 STotal counts n0 n1 n2 N

The allelic test statistic uses the allele counts

Observed M m Total

Case counts 2r0 + r1 2r2 + r1 2RControl counts 2s0 + s1 2s2 + s1 2STotal counts 2n0 + n1 2n2 + n1 2N

254


A contingency table test for independence of marker allele and

disease status compares the observed allelic counts with the

products of the marginal totals divided by the overall total:

Expected M m Total

Case counts 2R(2n0 + n1)/2N 2R(2n2 + n1)/2N 2RControl counts 2S(2n0 + n1)/2N 2S(2n2 + n1)/2N 2STotal counts 2n0 + n1 2n2 + n1 2N

The test statistic is

X2A =

∑ (Obs.−Exp.)2

Exp.

=2N [N(r1 + 2r2) −R(n1 + 2n2)]

2

SR[2N(n1 + 2n2) − (n1 + 2n2)2]

255


Approximating the expectation of X2A by the ratio of the expec-

tations of the numerator and denominator:

E(X2A) ≈ (1 + f)

showing an inflation factor of (1 + f) when there is inbreeding.

The expected value is 1 when f = 0 and the test statistic has a

chi-square distribution with 1 df.

256

AMD Example: Case-control test statistics onchromosome 1

0 50 100 150 200 250

02

46

8

AMD Case Control Test

Position (Mb)

−lo

g10(p

)

257

AMD Example: Case-control test statistics QQplot

0 2 4 6 8

02

46

8

AMD Case Control Test

Expected

Ob

se

rve

d

258

AMD Example: HWE test statistics on chromo-some 1

0 50 100 150 200 250

05

10

15

Control HWE Test: With SNP Filtering

Position

−lo

g1

0(p

)

259

Trend Test

i = 0 i = 1 i = 2

Marker Genotype MM Mm mm TotalMarker Variable X0 X1 X2Case counts Y = 1 r0 r1 r2 RControl counts Y = 0 s0 s1 s2 STotal counts n0 n1 n2 N

The Armitage trend test is based on a score statistic U:

U =2∑

i=0

Xi

(S

Nri −

R

Nsi

)

260

Trend Test for Additivity

Assuming normality for U, the test statistic

X2T =

U2

Var(U)=

N(N∑i riXi −R

∑i niXi)

2

SR[N∑i niX

2i − (

∑i niXi)

2]

is distributed as χ2(1)

under the hypothesis H0 : ρMT = 0.

Usual to consider a linear trend test, with X0 = 0, X1 = 1, X2 =

2, so that σ2DM

= 0 and

X2T =

N [N(r1 + 2r2) −R(n1 + 2n2)]2

SR[N(n1 + 4n2) − (n1 + 2n2)2]

This will provide a test for additive effects at the disease locus.

261

Case-control vs Trend Tests

The allelic case-control test statistic is

X2A =

2N [N(r1 + 2r2) −R(n1 + 2n2)]2

SR[N(2n1 + 4n2) − (n1 + 2n2)2]

and the linear trend test statistic is

X2T =

N [N(r1 + 2r2) −R(n1 + 2n2)]2

SR[N(n1 + 4n2) − (n1 + 2n2)2]

In both cases, σ2DM

= 0 and the test is for linkage disequilibrium

ρMT between trait and marker alleles, and is affected only by

additive trait effects.

Unlike the allelic case-control test, the trend statistic has an

expected value of 1 even when there are departures from Hardy-

Weinberg equilibrium.

262

Trend Test for Non-Additivity

Setting X0 = pm, X1 = 0,X2 = pM gives σ2AM

= 0 and a test for

non-additive effects. There is not an obvious simplification of

the equation for the test statistic.

263

population genetic data analysis · 2014-08-31 · elaborate mathematical theories constructed by...

Documents