qtl mapping in mice - biostatistics and medical informaticskbroman/talks/qtl_nz.pdf ·...

QTL mapping in mice

Karl W Broman

Department of BiostatisticsJohns Hopkins University

Baltimore, Maryland, USA

www.biostat.jhsph.edu/˜kbroman

Outline

� Experiments, data, and goals

� Models

� ANOVA at marker loci

� Interval mapping

� LOD scores, LOD thresholds

� Mapping multiple QTLs

� Simulations

Backcross experiment

BCBC BC BC

Intercross experiment

F2F2 F2 F2

Trait distributions

A B F1 BC

Strain

Data and Goals

Phenotypes: �� = trait value for mouse

�Genotypes: � �� = 1/0 if mouse

is BB/AB at marker

(for a backcross)Genetic map: Locations of markers

Goals:

� Identify the (or at least one) genomic regions(QTLs) that contribute to variation in the trait.

� Form confidence intervals for QTL locations.

� Estimate QTL effects.

Note: QTL = “quantitative trait locus”

Mice: Find gene

� Drug targets, biochemical basis

Agronomy: Selection for improvement

Flies: Genetic architecture

� Evolution

Chromosome

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X

Genetic map

20 40 60 80 100 120

Markers

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X

Genotype data

Statistical structure

Markers Phenotype

Covariates

The missing data problem:

Markers QTL

The model selection problem:

QTL, covariates � phenotype

Models: Recombination

We assume no crossover interference.

� � Points of exchange (crossovers) are accordingto a Poisson process.

� � The

��

(marker genotypes) form a Markovchain

Example

A B − B A A

B A A − B A

A A A A A B

Models: Genotype Phenotype

Let � = phenotype� = whole genome genotype

Imagine a small number of QTLs with genotypes �� .(

� �

distinct genotypes)

� � � � � � �� "!# # # ! �%$ var� � � � � � & '� !# # # ! �$

Models: Genotype Phenotype

Homoscedasticity (constant variance): & '� ( & '

Normally distributed residual variation: � � �) * � � � � & ' �

Additivity: � � !# # # ! �$ � � + ��, � -� �� ( �� .

Epistasis: Any deviations from additivity.

The simplest method: ANOVA

� Split mice into groupsaccording to genotypeat a marker.

� Do a t-test / ANOVA.

� Repeat for each marker.

Genotype at D1M30

Genotype at D2M99

ANOVA at marker loci

Advantages

� Simple.

� Easily incorporatecovariates.

� Easily extended to morecomplex models.

� Doesn’t require a geneticmap.

Disadvantages

� Must exclude individualswith missing genotype data.

� Imperfect information aboutQTL location.

� Suffers in low density scans.� Only considers one QTL at a

Interval mapping (IM)

Lander & Botstein (1989)

� Take account of missing genotype data

� Interpolate between markers

� Maximum likelihood under a mixture model

0 20 40 60

Map position (cM)

Interval mapping (IM)

Lander & Botstein (1989)

� Assume a single QTL model.

� Each position in the genome, one at a time, is posited as theputative QTL.

� Let 0 � 1/0 if the (unobserved) QTL genotype is BB/AB.Assume �) * � ��1 � & �

� Given genotypes at linked markers, �) mixture of normal dist’nswith mixing proportion

243 � 0 � . �marker data

QTL genotype576 578 BB ABBB BB

9: ; <>= ? 9: ; <>@ ?A 9: ; < ? <= <>@ A 9 : ; < ?

BB AB9: ; <= ? <>@ A < <= 9: ; <>@ ?A <

AB BB <= 9: ; <>@ ?A < 9: ; <= ? <>@ A <

AB AB <= <@ A 9: ; < ? 9 : ; <= ? 9: ; <@ ?A 9: ; < ?

The normal mixtures

576 578B

7 cM 13 cM

C Two markers separated by 20 cM,with the QTL closer to the leftmarker.

C The figure at right show the dis-tributions of the phenotype condi-tional on the genotypes at the twomarkers.

C The dashed curves correspond tothe components of the mixtures.

20 40 60 80 100

Phenotype

Interval mapping (continued)

Let D � � 243 � 0�� . �

marker data

�� 0�� ) * � �1E � & ' �

243 � ��

marker data� ��F � � �� & � � D � G � �� H � �� & � + � . � D � � G � �� H �F � & �

G � � H �� & � � density of normal distribution

Log likelihood:

I � ��F � � �� & � � � JLKM 243 � ��

marker data� ��F � � �� & �

Maximum likelihood estimates (MLEs) of � F , � � , &:EM algorithm.

LOD scores

The LOD score is a measure of the strength of evidence for thepresence of a QTL at a particular location.

� 0 � � JLKM � F likelihood ratio comparing the hypothesis of aQTL at position 0 versus that of no QTL

� JKM . /243 � � �QTL at 0� N ��F 1 � N � � 1 � N &1 �

243 � � �no QTL� N �� N & �

N �F 1 � N � � 1 � N &1 are the MLEs, assuming a single QTL at position 0.

No QTL model: The phenotypes are independent and identicallydistributed (iid)

* � �� & ' �

An example LOD curve

0 20 40 60 80 100

Chromosome position (cM)

0 500 1000 1500 2000 2500

Map position (cM)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19LOD curves

Interval mapping

Advantages

� Takes proper account ofmissing data.

� Allows examination ofpositions between markers.

� Gives improved estimatesof QTL effects.

� Provides pretty graphs.

Disadvantages

� Increased computationtime.

� Requires specializedsoftware.

� Difficult to generalize.� Only considers one QTL at

a time.

LOD thresholds

Large LOD scores indicate evidence for the presence of a QTL.

Q: How large is large?

We consider the distribution of the LOD score under the nullhypothesis of no QTL.

Key point: We must make some adjustment for our examination ofmultiple putative QTL locations.

We seek the distribution of the maximum LOD score, genome-wide. The 95th %ile of this distribution serves as a genome-wideLOD threshold.

Estimating the threshold: simulations, analytical calculations, per-mutation (randomization) tests.

Null distribution of the LOD score

C Null distribution derived bycomputer simulation of backcrosswith genome of typical size.

C Solid curve: distribution of LODscore at any one point.

C Dashed curve: distribution ofmaximum LOD score,genome-wide.

0 1 2 3 4

LOD score

Permutation tests

markers

genotypedata

phenotypes

9QP ?(a set of curves)

O 5SRT UVXW LOD

C Permute/shuffle the phenotypes; keep the genotype data intact.

C Calculate LOD

Y 9QP ? ; Z 5 YR T UVXW LODY 9 P ?

C We wish to compare the observed5

to the distribution of

C [�\ 9 5 Y] 5 ?

is a genome-wide P-value.

C The 95th %ile of

is a genome-wide LOD threshold.

C We can’t look at all ^ _ possible permutations, but a random set of 1000 is feasi-ble and provides reasonable estimates of P-values and thresholds.

C Value: conditions on observed phenotypes, marker density, and pattern of miss-ing data; doesn’t rely on normality assumptions or asymptotics.

Permutation distribution

maximum LOD score

0 1 2 3 4 5 6 7

95th percentile

Multiple QTL methods

Why consider multiple QTLs at once?

� Reduce residual variation.

� Separate linked QTLs.

� Investigate interactions between QTLs (epistasis).

Epistasis in a backcross

Additive QTLs

Interacting QTLs

A HQTL 1

Epistasis in an intercross

Additive QTLs

Interacting QTLs

A H BQTL 1

BQTL 2

A H BQTL 1

BQTL 2

Abstractions / simplifications

� Complete marker data

� QTLs are at the marker loci

� QTLs act additively

The problem

n backcross mice; M markers

�� = genotype (1/0) of mouse � at marker �

�� = phenotype (trait value) of mouse �

�� +5

� R :-� �� +�` � Which

-� a � /?

b Model selection in regression

How is this problem different?

� Relationship among the x’s

� Find a good model vs. minimize prediction error

Model selection

� Select class of models– Additive models

– Add’ve plus pairwise interactions

– Regression trees

� Compare models

– BIC c 9Qd ? R e�f g RSS

9Qd ?ih j d j 9k e�f g ^^ ?

– Sequential permutation tests

– Estimate of prediction error

� Search model space– Forward selection (FS)

– Backward elimination (BE)

– FS followed by BE

– MCMC� Assess performance

– Maximize no. QTLs found;control false positive rate

Why BIC l?

� For a fixed no. markers, letting m n, BIC o is consistent.

� There exists a prior (on models + coefficients) for whichBIC o is the –log posterior.

� BIC o is essentially equivalent to use of a threshold on theconditional LOD score

� It performs well.

Choice of l

Smaller

: include more loci; higher false positive rate

Larger

: include fewer loci; lower false positive rate

Let L = 95% genome-wide LOD threshold(compare single-QTL models to the null model)

Choose

= 2 L / JLKM :p mWith this choice of

l, in the absence of QTLs, we’ll

include at least one extraneous locus, 5% of thetime.

Simulations

� Backcross with n=250

� No crossover interference

� 9 chr, each 100 cM

� Markers at 10 cM spacing;complete genotype data

� 7 QTLs

– One pair in coupling– One pair in repulsion– Three unlinked QTLs

� Heritability = 50%

� 2000 simulation replicates

Methods

� ANOVA at marker loci

� Composite interval mapping (CIM)

� Forward selection with permutation tests

� Forward selection with BICk

� Backward elimination with BICk

� FS followed by BE with BICk

� MCMC with BICk

� A selected marker is deemed correct if it is within10 cM of a QTL (i.e., correct or adjacent)

0 1 2 3 4 5 6 7

Correct

Ave no. chosen

fs, perm

QTLs linked in couplingA

A 3 5 7 9 11

CIM fs, p

erm fs be

QTLs linked in repulsionA

A 3 5 7 9 11

CIM fs, p

erm fs be

Other QTLsA

A 3 5 7 9 11

CIM fs, p

erm fs be

Extraneous linkedA

A 3 5 7 9 11

CIM fs, p

erm fs be

Extraneous unlinkedA

A 3 5 7 9 11

CIM fs, p

erm fs be

Summary

� QTL mapping is a model selection problem.

� Key issue: the comparison of models.

� Large-scale simulations are important.

� More refined procedures do not necessarily giveimproved results.

� BICk with forward selection followed by backwardelimination works quite well (in the case of additiveQTLs).

Acknowledgements

Terry Speed, University of California, Berkeley, and WEHI

Gary Churchill, The Jackson Laboratory

Saunak Sen, University of California, San Francisco

References

C Broman KW (2001) Review of statistical methods for QTL mapping in experi-mental crosses. Lab Animal 30(7):44–52

Review for non-statisticians

C Broman KW, Speed TP (1999) A review of methods for identifying QTLs inexperimental crosses. In: Seillier-Moiseiwitch F (ed) Statistics in MolecularBiology. IMS Lecture Notes—Monograph Series. Vol 33, pp. 114–142

Older, more statistical review.

C Lander ES, Botstein D (1989) Mapping Mendelian factors underlying quantita-tive traits using RFLP linkage maps. Genetics 121:185–199

The seminal paper.

C Churchill GA, Doerge RW (1994) Empirical threshold values for quantitative traitmapping. Genetics 138:963–971

LOD thresholds by permutation tests.

C Miller AJ (2002) Subset selection in regression, 2nd edition. Chapman & Hall,New York.

A reasonably good book on model selection in regression.

C Strickberger MW (1985) Genetics, 3rd edition. Macmillan, New York, chapter11.

An old but excellent general genetics textbook with a very interesting discussionof epistasis.

C Broman KW, Speed TP (2002) A model selection approach for the identificationof quantitative trait loci in experimental crosses (with discussion). J Roy StatSoc B 64:641–656, 737–775

Contains the simulation study described above.

qtl mapping in mice - biostatistics and medical informaticskbroman/talks/qtl_nz.pdf ·...

Documents

a few basics about qtl mapping

qtl mapping

principles and procedures of qtl mapping

qtl mapping of wheat (triticum aestivum l) in response …....

genetic analysis, qtl mapping and gene expression analysis

genetic diversity analysis and qtl mapping in pearl millet...

qtl mapping sachin pbt

eucalyptus genomics: linkage mapping, qtl analysis … ·...

plant molecular breeding - download.e-bookshelf.de€¦ ·...

genetic mapping and qtl analysis for disease resistance...

physiology, genetics and qtl mapping of salt...

linkage qtl, mapping populations and...

statistical methods for mapping multiple qtl · 2007. 12....

tutorial - multiple-qtl mapping (mqm) analysis for...

roslin qtl mapping

fine mapping of qtl and microarray gene expression studies...

introduction to linkage and qtl mapping

statistical methods for quantitative trait loci (qtl)...

research article open access qtl mapping of the …

reverse genetics: quantitative trait locus (qtl) mapping...