qtl mapping in mice - biostatistics and medical informaticskbroman/talks/qtl_nz.pdf ·...

Post on 05-Jun-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

QTL mapping in mice

Karl W Broman

Department of BiostatisticsJohns Hopkins University

Baltimore, Maryland, USA

www.biostat.jhsph.edu/˜kbroman

Outline

� Experiments, data, and goals

� Models

� ANOVA at marker loci

� Interval mapping

� LOD scores, LOD thresholds

� Mapping multiple QTLs

� Simulations

Backcross experiment

P1

A A

P2

B B

F1

A B

BCBC BC BC

Intercross experiment

P1

A A

P2

B B

F1

A B

F2F2 F2 F2

Trait distributions

30

40

50

60

70

A B F1 BC

Strain

Phe

noty

pe

Data and Goals

Phenotypes: ��� = trait value for mouse

�Genotypes: � �� = 1/0 if mouse

is BB/AB at marker

(for a backcross)Genetic map: Locations of markers

Goals:

� Identify the (or at least one) genomic regions(QTLs) that contribute to variation in the trait.

� Form confidence intervals for QTL locations.

� Estimate QTL effects.

Note: QTL = “quantitative trait locus”

Why?

Mice: Find gene

� Drug targets, biochemical basis

Agronomy: Selection for improvement

Flies: Genetic architecture

� Evolution

80

60

40

20

0

Chromosome

Loca

tion

(cM

)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X

Genetic map

20 40 60 80 100 120

20

40

60

80

100

120

Markers

Indi

vidu

als

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X

Genotype data

Statistical structure

QTL

Markers Phenotype

Covariates

The missing data problem:

Markers QTL

The model selection problem:

QTL, covariates � phenotype

Models: Recombination

We assume no crossover interference.

� � Points of exchange (crossovers) are accordingto a Poisson process.

� � The

��� �

(marker genotypes) form a Markovchain

Example

A B − B A A

B A A − B A

A A A A A B

?

Models: Genotype Phenotype

Let � = phenotype� = whole genome genotype

Imagine a small number of QTLs with genotypes ������ � � � ��� .(

� �

distinct genotypes)

E

� � � � � � ��� "!# # # ! �%$ var� � � � � � & '� !# # # ! �$

Models: Genotype Phenotype

Homoscedasticity (constant variance): & '� ( & '

Normally distributed residual variation: � � �) * � � � � & ' �

.

Additivity: � � !# # # ! �$ � � + ���, � -� �� ( �� � .

or

/

)

Epistasis: Any deviations from additivity.

The simplest method: ANOVA

� Split mice into groupsaccording to genotypeat a marker.

� Do a t-test / ANOVA.

� Repeat for each marker.

40

50

60

70

80

Phe

noty

pe

BB AB

Genotype at D1M30

BB AB

Genotype at D2M99

ANOVA at marker loci

Advantages

� Simple.

� Easily incorporatecovariates.

� Easily extended to morecomplex models.

� Doesn’t require a geneticmap.

Disadvantages

� Must exclude individualswith missing genotype data.

� Imperfect information aboutQTL location.

� Suffers in low density scans.� Only considers one QTL at a

time.

Interval mapping (IM)

Lander & Botstein (1989)

� Take account of missing genotype data

� Interpolate between markers

� Maximum likelihood under a mixture model

0 20 40 60

0

2

4

6

8

Map position (cM)

LOD

sco

re

Interval mapping (IM)

Lander & Botstein (1989)

� Assume a single QTL model.

� Each position in the genome, one at a time, is posited as theputative QTL.

� Let 0 � 1/0 if the (unobserved) QTL genotype is BB/AB.Assume �) * � ��1 � & �

� Given genotypes at linked markers, �) mixture of normal dist’nswith mixing proportion

243 � 0 � . �marker data

:

QTL genotype576 578 BB ABBB BB

9: ; <>= ? 9: ; <>@ ?A 9: ; < ? <= <>@ A 9 : ; < ?

BB AB9: ; <= ? <>@ A < <= 9: ; <>@ ?A <

AB BB <= 9: ; <>@ ?A < 9: ; <= ? <>@ A <

AB AB <= <@ A 9: ; < ? 9 : ; <= ? 9: ; <@ ?A 9: ; < ?

The normal mixtures

576 578B

7 cM 13 cM

C Two markers separated by 20 cM,with the QTL closer to the leftmarker.

C The figure at right show the dis-tributions of the phenotype condi-tional on the genotypes at the twomarkers.

C The dashed curves correspond tothe components of the mixtures.

20 40 60 80 100

Phenotype

BB/BB

BB/AB

AB/BB

AB/AB

µ0

µ0

µ0

µ0

µ1

µ1

µ1

µ1

Interval mapping (continued)

Let D � � 243 � 0�� � . �

marker data

�� � 0�� ) * � �1E � & ' �

243 � ��� �

marker data� ��F � � ��� & � � D � G � ��� H � �� & � + � . � D � � G � �� H �F � & �

where

G � � H �� & � � density of normal distribution

Log likelihood:

I � ��F � � ��� & � � � JLKM 243 � ��� �

marker data� ��F � � ��� & �

Maximum likelihood estimates (MLEs) of � F , � � , &:EM algorithm.

LOD scores

The LOD score is a measure of the strength of evidence for thepresence of a QTL at a particular location.

LOD

� 0 � � JLKM � F likelihood ratio comparing the hypothesis of aQTL at position 0 versus that of no QTL

� JKM . /243 � � �QTL at 0� N ��F 1 � N � � 1 � N &1 �

243 � � �no QTL� N �� N & �

N �F 1 � N � � 1 � N &1 are the MLEs, assuming a single QTL at position 0.

No QTL model: The phenotypes are independent and identicallydistributed (iid)

* � �� & ' �

.

An example LOD curve

0 20 40 60 80 100

0

2

4

6

8

10

12

Chromosome position (cM)

LOD

0 500 1000 1500 2000 2500

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Map position (cM)

lod

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19LOD curves

Interval mapping

Advantages

� Takes proper account ofmissing data.

� Allows examination ofpositions between markers.

� Gives improved estimatesof QTL effects.

� Provides pretty graphs.

Disadvantages

� Increased computationtime.

� Requires specializedsoftware.

� Difficult to generalize.� Only considers one QTL at

a time.

LOD thresholds

Large LOD scores indicate evidence for the presence of a QTL.

Q: How large is large?

We consider the distribution of the LOD score under the nullhypothesis of no QTL.

Key point: We must make some adjustment for our examination ofmultiple putative QTL locations.

We seek the distribution of the maximum LOD score, genome-wide. The 95th %ile of this distribution serves as a genome-wideLOD threshold.

Estimating the threshold: simulations, analytical calculations, per-mutation (randomization) tests.

Null distribution of the LOD score

C Null distribution derived bycomputer simulation of backcrosswith genome of typical size.

C Solid curve: distribution of LODscore at any one point.

C Dashed curve: distribution ofmaximum LOD score,genome-wide.

0 1 2 3 4

LOD score

Permutation tests

mice

markers

genotypedata

phenotypes

O LOD

9QP ?(a set of curves)

O 5SRT UVXW LOD

9 P ?

C Permute/shuffle the phenotypes; keep the genotype data intact.

C Calculate LOD

Y 9QP ? ; Z 5 YR T UVXW LODY 9 P ?

C We wish to compare the observed5

to the distribution of

5 Y

.

C [�\ 9 5 Y] 5 ?

is a genome-wide P-value.

C The 95th %ile of

5 Y

is a genome-wide LOD threshold.

C We can’t look at all ^ _ possible permutations, but a random set of 1000 is feasi-ble and provides reasonable estimates of P-values and thresholds.

C Value: conditions on observed phenotypes, marker density, and pattern of miss-ing data; doesn’t rely on normality assumptions or asymptotics.

Permutation distribution

maximum LOD score

0 1 2 3 4 5 6 7

95th percentile

Multiple QTL methods

Why consider multiple QTLs at once?

� Reduce residual variation.

� Separate linked QTLs.

� Investigate interactions between QTLs (epistasis).

Epistasis in a backcross

Additive QTLs

Interacting QTLs

0

20

40

60

80

100

Ave

. phe

noty

pe

A HQTL 1

A

H

QTL 2

0

20

40

60

80

100

Ave

. phe

noty

pe

A HQTL 1

A

H

QTL 2

Epistasis in an intercross

Additive QTLs

Interacting QTLs

0

20

40

60

80

100

Ave

. phe

noty

pe

A H BQTL 1

AH

BQTL 2

0

20

40

60

80

100

Ave

. phe

noty

pe

A H BQTL 1

A

H

BQTL 2

Abstractions / simplifications

� Complete marker data

� QTLs are at the marker loci

� QTLs act additively

The problem

n backcross mice; M markers

�� � = genotype (1/0) of mouse � at marker �

�� = phenotype (trait value) of mouse �

�� � � +5

� R :-� ��� +�` � Which

-� a � /?

b Model selection in regression

How is this problem different?

� Relationship among the x’s

� Find a good model vs. minimize prediction error

Model selection

� Select class of models– Additive models

– Add’ve plus pairwise interactions

– Regression trees

� Compare models

– BIC c 9Qd ? R e�f g RSS

9Qd ?ih j d j 9k e�f g ^^ ?

– Sequential permutation tests

– Estimate of prediction error

� Search model space– Forward selection (FS)

– Backward elimination (BE)

– FS followed by BE

– MCMC� Assess performance

– Maximize no. QTLs found;control false positive rate

Why BIC l?

� For a fixed no. markers, letting m n, BIC o is consistent.

� There exists a prior (on models + coefficients) for whichBIC o is the –log posterior.

� BIC o is essentially equivalent to use of a threshold on theconditional LOD score

� It performs well.

Choice of l

Smaller

l

: include more loci; higher false positive rate

Larger

l

: include fewer loci; lower false positive rate

Let L = 95% genome-wide LOD threshold(compare single-QTL models to the null model)

Choose

l

= 2 L / JLKM :p mWith this choice of

l, in the absence of QTLs, we’ll

include at least one extraneous locus, 5% of thetime.

Simulations

� Backcross with n=250

� No crossover interference

� 9 chr, each 100 cM

� Markers at 10 cM spacing;complete genotype data

� 7 QTLs

– One pair in coupling– One pair in repulsion– Three unlinked QTLs

� Heritability = 50%

� 2000 simulation replicates

1

2

3

4

5

6

7

8

9

Methods

� ANOVA at marker loci

� Composite interval mapping (CIM)

� Forward selection with permutation tests

� Forward selection with BICk

� Backward elimination with BICk

� FS followed by BE with BICk

� MCMC with BICk

� A selected marker is deemed correct if it is within10 cM of a QTL (i.e., correct or adjacent)

0 1 2 3 4 5 6 7

Correct

Ave no. chosen

ANOVA

35

79

11

CIM

fs, perm

fs

be

fs/be

mcmc

BIC

0.0

0.5

1.0

1.5

2.0

QTLs linked in couplingA

ve n

o. c

hose

n

AN

OV

A 3 5 7 9 11

CIM fs, p

erm fs be

fs/b

e

mcm

c

BIC

0.0

0.5

1.0

1.5

2.0

QTLs linked in repulsionA

ve n

o. c

hose

n

AN

OV

A 3 5 7 9 11

CIM fs, p

erm fs be

fs/b

e

mcm

c

BIC

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Other QTLsA

ve n

o. c

hose

n

AN

OV

A 3 5 7 9 11

CIM fs, p

erm fs be

fs/b

e

mcm

c

BIC

0.0

0.1

0.2

0.3

0.4

Extraneous linkedA

ve n

o. c

hose

n

AN

OV

A 3 5 7 9 11

CIM fs, p

erm fs be

fs/b

e

mcm

c

BIC

0.00

0.01

0.02

0.03

0.04

0.05

0.06

Extraneous unlinkedA

ve n

o. c

hose

n

AN

OV

A 3 5 7 9 11

CIM fs, p

erm fs be

fs/b

e

mcm

c

BIC

Summary

� QTL mapping is a model selection problem.

� Key issue: the comparison of models.

� Large-scale simulations are important.

� More refined procedures do not necessarily giveimproved results.

� BICk with forward selection followed by backwardelimination works quite well (in the case of additiveQTLs).

Acknowledgements

Terry Speed, University of California, Berkeley, and WEHI

Gary Churchill, The Jackson Laboratory

Saunak Sen, University of California, San Francisco

References

C Broman KW (2001) Review of statistical methods for QTL mapping in experi-mental crosses. Lab Animal 30(7):44–52

Review for non-statisticians

C Broman KW, Speed TP (1999) A review of methods for identifying QTLs inexperimental crosses. In: Seillier-Moiseiwitch F (ed) Statistics in MolecularBiology. IMS Lecture Notes—Monograph Series. Vol 33, pp. 114–142

Older, more statistical review.

C Lander ES, Botstein D (1989) Mapping Mendelian factors underlying quantita-tive traits using RFLP linkage maps. Genetics 121:185–199

The seminal paper.

C Churchill GA, Doerge RW (1994) Empirical threshold values for quantitative traitmapping. Genetics 138:963–971

LOD thresholds by permutation tests.

C Miller AJ (2002) Subset selection in regression, 2nd edition. Chapman & Hall,New York.

A reasonably good book on model selection in regression.

C Strickberger MW (1985) Genetics, 3rd edition. Macmillan, New York, chapter11.

An old but excellent general genetics textbook with a very interesting discussionof epistasis.

C Broman KW, Speed TP (2002) A model selection approach for the identificationof quantitative trait loci in experimental crosses (with discussion). J Roy StatSoc B 64:641–656, 737–775

Contains the simulation study described above.

top related