QTL mapping in mice
Karl W Broman
Department of BiostatisticsJohns Hopkins University
Baltimore, Maryland, USA
www.biostat.jhsph.edu/˜kbroman
Outline
� Experiments, data, and goals
� Models
� ANOVA at marker loci
� Interval mapping
� LOD scores, LOD thresholds
� Mapping multiple QTLs
� Simulations
Backcross experiment
P1
A A
P2
B B
F1
A B
BCBC BC BC
Intercross experiment
P1
A A
P2
B B
F1
A B
F2F2 F2 F2
Trait distributions
30
40
50
60
70
A B F1 BC
Strain
Phe
noty
pe
Data and Goals
Phenotypes: ��� = trait value for mouse
�Genotypes: � �� = 1/0 if mouse
�
is BB/AB at marker
�
(for a backcross)Genetic map: Locations of markers
Goals:
� Identify the (or at least one) genomic regions(QTLs) that contribute to variation in the trait.
� Form confidence intervals for QTL locations.
� Estimate QTL effects.
Note: QTL = “quantitative trait locus”
Why?
Mice: Find gene
� Drug targets, biochemical basis
Agronomy: Selection for improvement
Flies: Genetic architecture
� Evolution
80
60
40
20
0
Chromosome
Loca
tion
(cM
)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X
Genetic map
20 40 60 80 100 120
20
40
60
80
100
120
Markers
Indi
vidu
als
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X
Genotype data
Statistical structure
QTL
Markers Phenotype
Covariates
The missing data problem:
Markers QTL
The model selection problem:
QTL, covariates � phenotype
Models: Recombination
We assume no crossover interference.
� � Points of exchange (crossovers) are accordingto a Poisson process.
� � The
��� �
(marker genotypes) form a Markovchain
Example
A B − B A A
B A A − B A
A A A A A B
?
Models: Genotype Phenotype
Let � = phenotype� = whole genome genotype
Imagine a small number of QTLs with genotypes ������ � � � ��� .(
� �
distinct genotypes)
E
� � � � � � ��� "!# # # ! �%$ var� � � � � � & '� !# # # ! �$
Models: Genotype Phenotype
Homoscedasticity (constant variance): & '� ( & '
Normally distributed residual variation: � � �) * � � � � & ' �
.
Additivity: � � !# # # ! �$ � � + ���, � -� �� ( �� � .
or
/
)
Epistasis: Any deviations from additivity.
The simplest method: ANOVA
� Split mice into groupsaccording to genotypeat a marker.
� Do a t-test / ANOVA.
� Repeat for each marker.
40
50
60
70
80
Phe
noty
pe
BB AB
Genotype at D1M30
BB AB
Genotype at D2M99
ANOVA at marker loci
Advantages
� Simple.
� Easily incorporatecovariates.
� Easily extended to morecomplex models.
� Doesn’t require a geneticmap.
Disadvantages
� Must exclude individualswith missing genotype data.
� Imperfect information aboutQTL location.
� Suffers in low density scans.� Only considers one QTL at a
time.
Interval mapping (IM)
Lander & Botstein (1989)
� Take account of missing genotype data
� Interpolate between markers
� Maximum likelihood under a mixture model
0 20 40 60
0
2
4
6
8
Map position (cM)
LOD
sco
re
Interval mapping (IM)
Lander & Botstein (1989)
� Assume a single QTL model.
� Each position in the genome, one at a time, is posited as theputative QTL.
� Let 0 � 1/0 if the (unobserved) QTL genotype is BB/AB.Assume �) * � ��1 � & �
� Given genotypes at linked markers, �) mixture of normal dist’nswith mixing proportion
243 � 0 � . �marker data
�
:
QTL genotype576 578 BB ABBB BB
9: ; <>= ? 9: ; <>@ ?A 9: ; < ? <= <>@ A 9 : ; < ?
BB AB9: ; <= ? <>@ A < <= 9: ; <>@ ?A <
AB BB <= 9: ; <>@ ?A < 9: ; <= ? <>@ A <
AB AB <= <@ A 9: ; < ? 9 : ; <= ? 9: ; <@ ?A 9: ; < ?
The normal mixtures
576 578B
7 cM 13 cM
C Two markers separated by 20 cM,with the QTL closer to the leftmarker.
C The figure at right show the dis-tributions of the phenotype condi-tional on the genotypes at the twomarkers.
C The dashed curves correspond tothe components of the mixtures.
20 40 60 80 100
Phenotype
BB/BB
BB/AB
AB/BB
AB/AB
µ0
µ0
µ0
µ0
µ1
µ1
µ1
µ1
Interval mapping (continued)
Let D � � 243 � 0�� � . �
marker data
�
�� � 0�� ) * � �1E � & ' �
243 � ��� �
marker data� ��F � � ��� & � � D � G � ��� H � �� & � + � . � D � � G � �� H �F � & �
where
G � � H �� & � � density of normal distribution
Log likelihood:
I � ��F � � ��� & � � � JLKM 243 � ��� �
marker data� ��F � � ��� & �
Maximum likelihood estimates (MLEs) of � F , � � , &:EM algorithm.
LOD scores
The LOD score is a measure of the strength of evidence for thepresence of a QTL at a particular location.
LOD
� 0 � � JLKM � F likelihood ratio comparing the hypothesis of aQTL at position 0 versus that of no QTL
� JKM . /243 � � �QTL at 0� N ��F 1 � N � � 1 � N &1 �
243 � � �no QTL� N �� N & �
N �F 1 � N � � 1 � N &1 are the MLEs, assuming a single QTL at position 0.
No QTL model: The phenotypes are independent and identicallydistributed (iid)
* � �� & ' �
.
An example LOD curve
0 20 40 60 80 100
0
2
4
6
8
10
12
Chromosome position (cM)
LOD
0 500 1000 1500 2000 2500
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Map position (cM)
lod
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19LOD curves
Interval mapping
Advantages
� Takes proper account ofmissing data.
� Allows examination ofpositions between markers.
� Gives improved estimatesof QTL effects.
� Provides pretty graphs.
Disadvantages
� Increased computationtime.
� Requires specializedsoftware.
� Difficult to generalize.� Only considers one QTL at
a time.
LOD thresholds
Large LOD scores indicate evidence for the presence of a QTL.
Q: How large is large?
We consider the distribution of the LOD score under the nullhypothesis of no QTL.
Key point: We must make some adjustment for our examination ofmultiple putative QTL locations.
We seek the distribution of the maximum LOD score, genome-wide. The 95th %ile of this distribution serves as a genome-wideLOD threshold.
Estimating the threshold: simulations, analytical calculations, per-mutation (randomization) tests.
Null distribution of the LOD score
C Null distribution derived bycomputer simulation of backcrosswith genome of typical size.
C Solid curve: distribution of LODscore at any one point.
C Dashed curve: distribution ofmaximum LOD score,genome-wide.
0 1 2 3 4
LOD score
Permutation tests
mice
markers
genotypedata
phenotypes
O LOD
9QP ?(a set of curves)
O 5SRT UVXW LOD
9 P ?
C Permute/shuffle the phenotypes; keep the genotype data intact.
C Calculate LOD
Y 9QP ? ; Z 5 YR T UVXW LODY 9 P ?
C We wish to compare the observed5
to the distribution of
5 Y
.
C [�\ 9 5 Y] 5 ?
is a genome-wide P-value.
C The 95th %ile of
5 Y
is a genome-wide LOD threshold.
C We can’t look at all ^ _ possible permutations, but a random set of 1000 is feasi-ble and provides reasonable estimates of P-values and thresholds.
C Value: conditions on observed phenotypes, marker density, and pattern of miss-ing data; doesn’t rely on normality assumptions or asymptotics.
Permutation distribution
maximum LOD score
0 1 2 3 4 5 6 7
95th percentile
Multiple QTL methods
Why consider multiple QTLs at once?
� Reduce residual variation.
� Separate linked QTLs.
� Investigate interactions between QTLs (epistasis).
Epistasis in a backcross
Additive QTLs
Interacting QTLs
0
20
40
60
80
100
Ave
. phe
noty
pe
A HQTL 1
A
H
QTL 2
0
20
40
60
80
100
Ave
. phe
noty
pe
A HQTL 1
A
H
QTL 2
Epistasis in an intercross
Additive QTLs
Interacting QTLs
0
20
40
60
80
100
Ave
. phe
noty
pe
A H BQTL 1
AH
BQTL 2
0
20
40
60
80
100
Ave
. phe
noty
pe
A H BQTL 1
A
H
BQTL 2
Abstractions / simplifications
� Complete marker data
� QTLs are at the marker loci
� QTLs act additively
The problem
n backcross mice; M markers
�� � = genotype (1/0) of mouse � at marker �
�� = phenotype (trait value) of mouse �
�� � � +5
� R :-� ��� +�` � Which
-� a � /?
b Model selection in regression
How is this problem different?
� Relationship among the x’s
� Find a good model vs. minimize prediction error
Model selection
� Select class of models– Additive models
– Add’ve plus pairwise interactions
– Regression trees
� Compare models
– BIC c 9Qd ? R e�f g RSS
9Qd ?ih j d j 9k e�f g ^^ ?
– Sequential permutation tests
– Estimate of prediction error
� Search model space– Forward selection (FS)
– Backward elimination (BE)
– FS followed by BE
– MCMC� Assess performance
– Maximize no. QTLs found;control false positive rate
Why BIC l?
� For a fixed no. markers, letting m n, BIC o is consistent.
� There exists a prior (on models + coefficients) for whichBIC o is the –log posterior.
� BIC o is essentially equivalent to use of a threshold on theconditional LOD score
� It performs well.
Choice of l
Smaller
l
: include more loci; higher false positive rate
Larger
l
: include fewer loci; lower false positive rate
Let L = 95% genome-wide LOD threshold(compare single-QTL models to the null model)
Choose
l
= 2 L / JLKM :p mWith this choice of
l, in the absence of QTLs, we’ll
include at least one extraneous locus, 5% of thetime.
Simulations
� Backcross with n=250
� No crossover interference
� 9 chr, each 100 cM
� Markers at 10 cM spacing;complete genotype data
� 7 QTLs
– One pair in coupling– One pair in repulsion– Three unlinked QTLs
� Heritability = 50%
� 2000 simulation replicates
1
2
3
4
5
6
7
8
9
Methods
� ANOVA at marker loci
� Composite interval mapping (CIM)
� Forward selection with permutation tests
� Forward selection with BICk
� Backward elimination with BICk
� FS followed by BE with BICk
� MCMC with BICk
� A selected marker is deemed correct if it is within10 cM of a QTL (i.e., correct or adjacent)
0 1 2 3 4 5 6 7
Correct
Ave no. chosen
ANOVA
35
79
11
CIM
fs, perm
fs
be
fs/be
mcmc
BIC
0.0
0.5
1.0
1.5
2.0
QTLs linked in couplingA
ve n
o. c
hose
n
AN
OV
A 3 5 7 9 11
CIM fs, p
erm fs be
fs/b
e
mcm
c
BIC
0.0
0.5
1.0
1.5
2.0
QTLs linked in repulsionA
ve n
o. c
hose
n
AN
OV
A 3 5 7 9 11
CIM fs, p
erm fs be
fs/b
e
mcm
c
BIC
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Other QTLsA
ve n
o. c
hose
n
AN
OV
A 3 5 7 9 11
CIM fs, p
erm fs be
fs/b
e
mcm
c
BIC
0.0
0.1
0.2
0.3
0.4
Extraneous linkedA
ve n
o. c
hose
n
AN
OV
A 3 5 7 9 11
CIM fs, p
erm fs be
fs/b
e
mcm
c
BIC
0.00
0.01
0.02
0.03
0.04
0.05
0.06
Extraneous unlinkedA
ve n
o. c
hose
n
AN
OV
A 3 5 7 9 11
CIM fs, p
erm fs be
fs/b
e
mcm
c
BIC
Summary
� QTL mapping is a model selection problem.
� Key issue: the comparison of models.
� Large-scale simulations are important.
� More refined procedures do not necessarily giveimproved results.
� BICk with forward selection followed by backwardelimination works quite well (in the case of additiveQTLs).
Acknowledgements
Terry Speed, University of California, Berkeley, and WEHI
Gary Churchill, The Jackson Laboratory
Saunak Sen, University of California, San Francisco
References
C Broman KW (2001) Review of statistical methods for QTL mapping in experi-mental crosses. Lab Animal 30(7):44–52
Review for non-statisticians
C Broman KW, Speed TP (1999) A review of methods for identifying QTLs inexperimental crosses. In: Seillier-Moiseiwitch F (ed) Statistics in MolecularBiology. IMS Lecture Notes—Monograph Series. Vol 33, pp. 114–142
Older, more statistical review.
C Lander ES, Botstein D (1989) Mapping Mendelian factors underlying quantita-tive traits using RFLP linkage maps. Genetics 121:185–199
The seminal paper.
C Churchill GA, Doerge RW (1994) Empirical threshold values for quantitative traitmapping. Genetics 138:963–971
LOD thresholds by permutation tests.
C Miller AJ (2002) Subset selection in regression, 2nd edition. Chapman & Hall,New York.
A reasonably good book on model selection in regression.
C Strickberger MW (1985) Genetics, 3rd edition. Macmillan, New York, chapter11.
An old but excellent general genetics textbook with a very interesting discussionof epistasis.
C Broman KW, Speed TP (2002) A model selection approach for the identificationof quantitative trait loci in experimental crosses (with discussion). J Roy StatSoc B 64:641–656, 737–775
Contains the simulation study described above.