quantgenet 3 - school of medicine and public health – uw

24
Introduction to QTL mapping in model organisms Karl W Broman Department of Biostatistics Johns Hopkins University [email protected] www.biostat.jhsph.edu/˜kbroman Outline Experiments and data Models ANOVA at marker loci Interval mapping Epistasis LOD thresholds CIs for QTL location Selection bias The X chromosome Selective genotyping Covariates Non-normal traits The need for good data

Upload: others

Post on 14-Mar-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: QuantGenet 3 - School of Medicine and Public Health – UW

Introduction to QTL mapping

in model organisms

Karl W Broman

Department of BiostatisticsJohns Hopkins University

[email protected]

www.biostat.jhsph.edu/˜kbroman

Outline

� Experiments and data� Models� ANOVA at marker loci� Interval mapping� Epistasis� LOD thresholds� CIs for QTL location

� Selection bias� The X chromosome� Selective genotyping� Covariates� Non-normal traits� The need for good data

Page 2: QuantGenet 3 - School of Medicine and Public Health – UW

Backcross experiment

P1

A A

P2

B B

F1

A B

BCBC BC BC

Intercross experiment

P1

A A

P2

B B

F1

A B

F2F2 F2 F2

Page 3: QuantGenet 3 - School of Medicine and Public Health – UW

Phenotype distributions

� Within each of the parentaland F � strains, individuals aregenetically identical.

� Environmental variation mayor may not be constant withgenotype.

� The backcross generation ex-hibits genetic as well as envi-ronmental variation.

A B

Parental strains

0 20 40 60 80 100

0 20 40 60 80 100

F1 generation

0 20 40 60 80 100

Phenotype

Backcross generation

Data

Phenotypes: ��� = trait value for individual �

Genotypes: ���� = 0/1 if mouse � is BB/AB at marker (or 0/1/2, in an intercross)

Genetic map: Locations of markers

Page 4: QuantGenet 3 - School of Medicine and Public Health – UW

80

60

40

20

0

Chromosome

Loca

tion

(cM

)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X

Genetic map

20 40 60 80 100 120

20

40

60

80

100

120

Markers

Indi

vidu

als

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 X

Genotype data

Page 5: QuantGenet 3 - School of Medicine and Public Health – UW

Goals

� Detect QTLs (and interactions between QTLs)

� Confidence intervals for QTL location

� Estimate QTL effects (effects of allelic substitution)

Statistical structure

QTL

Markers Phenotype

Covariates

The missing data problem:

Markers ��� QTL

The model selection problem:

QTL, covariates ��� phenotype

Page 6: QuantGenet 3 - School of Medicine and Public Health – UW

Models: Recombination

We assume � Mendel’s rules� no crossover interference

1. Pr( ���� = 0) = Pr( ���� = 1) = 1/2

2. Pr( ����� ����� = 1� � � � = 0) = Pr( ����� ����� = 0

� ���� = 1) = � �where � � is the recombination fraction for the interval

3. The genotype at a position depends only on thegenotypes at the nearest flanking typed markers

Example

A B − B A A

B A A − B A

A A A A A B

?

Page 7: QuantGenet 3 - School of Medicine and Public Health – UW

Models: Genotype ��� Phenotype

Let � = phenotype� = whole genome genotype

Imagine a small number of QTLs with genotypes � ����������� ��� .( � distinct genotypes)

E � � ���� ����� ������� � ��� var � � ���� ������ ������� � ���

Models: Genotype ��� Phenotype

Homoscedasticity (constant variance): ������ � �

Normally distributed residual variation: � � �"! # �$� � � �%� .

Additivity: ��� � ������� � � � �'& ( ���)��+* � � � ( � � , or - )

Epistasis: Any deviations from additivity.

Page 8: QuantGenet 3 - School of Medicine and Public Health – UW

The simplest method: ANOVA

� Also known as markerregression.

� Split mice into groupsaccording to genotypeat a marker.

� Do a t-test / ANOVA.� Repeat for each marker.

40

50

60

70

80

Phe

noty

pe

BB AB

Genotype at D1M30

BB AB

Genotype at D2M99

ANOVA at marker loci

Advantages� Simple.� Easily incorporates

covariates.� Easily extended to more

complex models.� Doesn’t require a genetic

map.

Disadvantages� Must exclude individuals

with missing genotype data.� Imperfect information about

QTL location.� Suffers in low density scans.� Only considers one QTL at a

time.

Page 9: QuantGenet 3 - School of Medicine and Public Health – UW

Interval mapping (IM)

Lander & Botstein (1989)� Take account of missing genotype data� Interpolate between markers� Maximum likelihood under a mixture model

0 20 40 60

0

2

4

6

8

Map position (cM)

LOD

sco

re

Interval mapping (IM)

Lander & Botstein (1989)� Assume a single QTL model.� Each position in the genome, one at a time, is posited as the

putative QTL.� Let � 1/0 if the (unobserved) QTL genotype is BB/AB.

Assume � ! # ��� � �$�� Given genotypes at linked markers, � ! mixture of normal dist’ns

with mixing proportion��� � , �marker data � :

QTL genotype� � ��� BB ABBB BB � ��������� ����������� ����� ����������� �����BB AB � ��������������� ����� �!���������AB BB � � � ��� � ����� �"��� � ��� � ���AB AB � � � � ��� ����� � ��� � ��� �!� � ����� �����

Page 10: QuantGenet 3 - School of Medicine and Public Health – UW

The normal mixtures

� � � ��7 cM 13 cM

� Two markers separated by 20 cM,with the QTL closer to the leftmarker.

� The figure at right show the dis-tributions of the phenotype condi-tional on the genotypes at the twomarkers.

� The dashed curves correspond tothe components of the mixtures.

20 40 60 80 100

Phenotype

BB/BB

BB/AB

AB/BB

AB/AB

µ0

µ0

µ0

µ0

µ1

µ1

µ1

µ1

Interval mapping (continued)

Let ��� ��� � � , �marker data �

� � � � � ! # ����� � � � ���� � � �marker data � ��� � � �%� � � � ��� � �� � �%� � ��& , �� � � � � �� ��� � � �

where � ��� � � � � = density of normal distribution

Log likelihood: � � � � � �%� � �� ( ������� ��� � � �marker data � ��� � � �%� � �

Maximum likelihood estimates (MLEs) of � � , � � , � :

values for which � � � � � �%� � � is maximized.

Page 11: QuantGenet 3 - School of Medicine and Public Health – UW

EM algorithm

Dempster et al. (1977)

E step:

Let � ��� ����� ��� � � , � � � � marker data ���� ��� �� ���� ��� �� ��� ��� � �

� �� � � ������������� � �� ����� �� � � � � ����������� � �� ����� � � � ��� � � � � � � ����������� � �� ����� �M step:

Let �� ��� ������

( � � � � ��� ������

� ( � � ��� �������� ��� ������ ( � � � , � � ��� �����

� � � ( � , � � ��� ������ �

�� ��� ����� [not worth writing down]

The algorithm:

Start with � � ���� � � ; iterate the E & M steps until convergence.

LOD scores

The LOD score is a measure of the strength of evidence for thepresence of a QTL at a particular location.

LOD � � � � � � � likelihood ratio comparing the hypothesis of aQTL at position � versus that of no QTL

� � � , - ��� � �

QTL at � �!���� � �!�� � � �"�� � ���� � �no QTL �!�� �#�� � $

�� � � �%�� � � ��� � are the MLEs, assuming a single QTL at position � .

No QTL model: The phenotypes are independent and identicallydistributed (iid) # � � ���%� .

Page 12: QuantGenet 3 - School of Medicine and Public Health – UW

An example LOD curve

0 20 40 60 80 100

0

2

4

6

8

10

12

Chromosome position (cM)

LOD

0 500 1000 1500 2000 2500

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Map position (cM)

lod

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19LOD curves

Page 13: QuantGenet 3 - School of Medicine and Public Health – UW

Interval mapping

Advantages� Takes proper account of

missing data.� Allows examination of

positions between markers.� Gives improved estimates

of QTL effects.� Provides pretty graphs.

Disadvantages� Increased computation

time.� Requires specialized

software.� Difficult to generalize.� Only considers one QTL at

a time.

Multiple QTL methods

Why consider multiple QTLs at once?

� Reduce residual variation.� Separate linked QTLs.� Investigate interactions between QTLs (epistasis).

Page 14: QuantGenet 3 - School of Medicine and Public Health – UW

Epistasis in a backcross

Additive QTLs

Interacting QTLs

0

20

40

60

80

100

Ave

. phe

noty

pe

A HQTL 1

A

H

QTL 2

0

20

40

60

80

100

Ave

. phe

noty

pe

A HQTL 1

A

H

QTL 2

Epistasis in an intercross

Additive QTLs

Interacting QTLs

0

20

40

60

80

100

Ave

. phe

noty

pe

A H BQTL 1

AH

BQTL 2

0

20

40

60

80

100

Ave

. phe

noty

pe

A H BQTL 1

A

H

BQTL 2

Page 15: QuantGenet 3 - School of Medicine and Public Health – UW

LOD thresholds

Large LOD scores indicate evidence for the presence of a QTL.

Q: How large is large?

� We consider the distribution of the LOD score under the nullhypothesis of no QTL.

Key point: We must make some adjustment for our examination ofmultiple putative QTL locations.

� We seek the distribution of the maximum LOD score, genome-wide. The 95th %ile of this distribution serves as a genome-wideLOD threshold.

Estimating the threshold: simulations, analytical calculations, per-mutation (randomization) tests.

Null distribution of the LOD score

� Null distribution derived bycomputer simulation of backcrosswith genome of typical size.

� Solid curve: distribution of LODscore at any one point.

� Dashed curve: distribution ofmaximum LOD score,genome-wide.

0 1 2 3 4

LOD score

Page 16: QuantGenet 3 - School of Medicine and Public Health – UW

Permutation tests

mice

markers

genotypedata

phenotypes

� LOD � �(a set of curves)

�� �

������ LOD � �

� Permute/shuffle the phenotypes; keep the genotype data intact.� Calculate LOD

� � ��� � � ������ LOD � �

� We wish to compare the observed�

to the distribution of�

.�� �� � �� � � is a genome-wide P-value.� The 95th %ile of

� is a genome-wide LOD threshold.

� We can’t look at all ��� possible permutations, but a random set of 1000 is feasi-ble and provides reasonable estimates of P-values and thresholds.

� Value: conditions on observed phenotypes, marker density, and pattern of miss-ing data; doesn’t rely on normality assumptions or asymptotics.

Permutation distribution

maximum LOD score

0 1 2 3 4 5 6 7

95th percentile

Page 17: QuantGenet 3 - School of Medicine and Public Health – UW

Confidence intervals for QTL location

Confidence interval indicates plausible location of a QTL.

Methods:� LOD support intervals

(I prefer 1.5-LOD intervals.)

� Bootstrap(Resample individuals with replacement)

1.5-LOD support interval

0 5 10 15 20 25 30 35

0

1

2

3

4

5

6

Map position (cM)

lod

1.5

1.5−LOD support interval

Page 18: QuantGenet 3 - School of Medicine and Public Health – UW

Bootstrap-based confidence intervals

� Consider a set of n individuals (e.g., n = 124)

� Sample, with replacement, n individuals

– Some individuals duplicated– Some individuals missing

� Using the sampled individuals, re-calculate the LOD curve forthe chromosome of interest

� Identify the location of the maximum LOD score

� Repeat many times (e.g., 250)

� Bootstrap CI = (2.5 %ile to 97.5 %ile) of locations of maximumLOD.

Bootstrap confidence interval

Bootstrap CI1.5−LOD support interval

0 5 10 15 20 25 30 35

Page 19: QuantGenet 3 - School of Medicine and Public Health – UW

Selection bias

� The estimated effect of a QTL willvary somewhat from its true effect.

� Only when the estimated effect islarge will the QTL be detected.

� Among those experiments in whichthe QTL is detected, the estimatedQTL effect will be, on average,larger than its true effect.

� This is selection bias.� Selection bias is largest in QTLs

with small or moderate effects.� The true effects of QTLs that we

identify are likely smaller than wasobserved.

Estimated QTL effect

0 5 10 15

QTL effect = 5Bias = 79%

Estimated QTL effect

0 5 10 15

QTL effect = 8Bias = 18%

Estimated QTL effect

0 5 10 15

QTL effect = 11Bias = 1%

Implications of selection bias

� Estimated % variance explained by identified QTLs

� Repeating an experiment

� Congenics

� Marker-assisted selection

Page 20: QuantGenet 3 - School of Medicine and Public Health – UW

The X chromosome

In a backcross, the X chromosome may or may not besegregating.

(A � B) � A

Females: XA � B XA

Males: XA � B YA

A � (A � B)

Females: XA XA

Males: XA YB

The X chromosome

In an intercross, one must pay attention to the paternalgrandmother’s genotype.

(A � B) � (A � B) or (B � A) � (A � B)

Females: XA � B XA

Males: XA � B YB

(A � B) � (B � A) or (B � A) � (B � A)

Females: XA � B XB

Males: XA � B YA

Page 21: QuantGenet 3 - School of Medicine and Public Health – UW

Selective genotyping

� Save effort by onlytyping the mostinformative individuals(say, top & bottom 10%).

� Useful in context of asingle, inexpensive trait.

� Tricky to estimate theeffects of QTLs: use IMwith all phenotypes.

� Can’t get at interactions.� Likely better to also

genotype some randomportion of the rest of theindividuals.

40

50

60

70

80

Phe

noty

pe

BB AB

All genotypes

BB AB

Top and botton 10%

Covariates

� Examples: treatment, sex,litter, lab, age.

� Control residual variation.� Avoid confounding.� Look for QTL � environ’t

interactions� Adjust before interval

mapping (IM) versus adjustwithin IM.

0 20 40 60 80 100 120

0

1

2

3

4

Map position (cM)

lod

7

Intercross, split by sex

0 20 40 60 80 100 120

0

1

2

3

4

Map position (cM)

lod

7

Reverse intercross

Page 22: QuantGenet 3 - School of Medicine and Public Health – UW

Non-normal traits

� Standard interval mapping assumes normally distributedresidual variation. (Thus the phenotype distribution is amixture of normals.)

� In reality: we see dichotomous traits, counts, skeweddistributions, outliers, and all sorts of odd things.

� Interval mapping, with LOD thresholds derived frompermutation tests, generally performs just fine anyway.

� Alternatives to consider:– Nonparametric approaches (Kruglyak & Lander 1995)

– Transformations (e.g., log, square root)

– Specially-tailored models (e.g., a generalized linear model, the Coxproportional hazard model, and the model in Broman et al. 2000)

Check data integrity

The success of QTL mapping depends crucially on the integrity ofthe data.

� Segregation distortion

� Genetic maps / marker positions

� Genotyping errors (tight double crossovers)

� Phenotype distribution / outliers

� Residual analysis

Page 23: QuantGenet 3 - School of Medicine and Public Health – UW

Summary I

� ANOVA at marker loci (aka marker regression) is simple and easily extendedto include covariates or accommodate complex models.

� Interval mapping improves on ANOVA by allowing inference of QTLs topositions between markers and taking proper account of missing genotypedata.

� ANOVA and IM consider only single-QTL models. Multiple QTL methods allowthe better separation of linked QTLs and are necessary for the investigation ofepistasis.

� Statistical significance of LOD peaks requires consideration of the maximumLOD score, genome-wide, under the null hypothesis of no QTLs. Permutationtests are extremely useful for this.

� 1.5-LOD support intervals indicate the plausible location of a QTL.Alternatively, perform a bootstrap.

� Estimates of QTL effects are subject to selection bias. Such estimated effectsare often too large.

Summary II

� The X chromosome must be dealt with specially, and can be tricky.� Study your data. Look for errors in the genetic map, genotyping errors and

phenotype outliers. But don’t worry about them too much.� Selective genotyping can save you time and money, but proceed with caution.� Study your data. The consideration of covariates may reveal extremely

interesting phenomena.� Interval mapping works reasonably well even with non-normal traits. But

consider transformations or specially-tailored models. If interval mappingsoftware is not available for your preferred model, start with some version ofANOVA.

Page 24: QuantGenet 3 - School of Medicine and Public Health – UW

References

� Broman KW (2001) Review of statistical methods for QTL mapping in experi-mental crosses. Lab Animal 30(7):44–52

A review for non-statisticians.� Doerge RW (2002) Mapping and analysis of quantitative trait loci in experimen-

tal populations. Nat Rev Genet 3:43–52

A very recent review.� Doerge RW, Zeng Z-B, Weir BS (1997) Statistical issues in the search for

genes affecting quantitative traits in experimental populations. Statistical Sci-ence 12:195–219

Review paper.� Jansen RC (2001) Quantitative trait loci in inbred lines. In Balding DJ et al.,

Handbook of statistical genetics, John Wiley & Sons, New York, chapter 21

Review in an expensive but rather comprehensive and likely useful book.� Lynch M, Walsh B (1998) Genetics and analysis of quantitative traits. Sinauer

Associates, Sunderland, MA, chapter 15

Chapter on QTL mapping.

� Lander ES, Botstein D (1989) Mapping Mendelian factors underlying quantita-tive traits using RFLP linkage maps. Genetics 121:185–199

The seminal paper.� Churchill GA, Doerge RW (1994) Empirical threshold values for quantitative trait

mapping. Genetics 138:963–971

LOD thresholds by permutation tests.� Kruglyak L, Lander ES (1995) A nonparametric approach for mapping quanti-

tative trait loci. Genetics 139:1421–1428

Non-parameteric interval mapping.� Boyartchuk VL, Broman KW, Mosher RE, D’Orazio S, Starnbach M, Dietrich

WF (2001) Multigenic control of Listeria monocytogenes susceptibility in mice.Nat Genet 27:259–260

Broman KW, Boyartchuk VL, Dietrich WF (2000) Mapping time-to-death quan-titative trait loci in a mouse cross with high survival rates. Technical ReportMS00-04, Department of Biostatistics, Johns Hopkins University, Baltimore, MD

QTL mapping with a special model for a non-normal phenotype.� Visscher PM, Thompson R, Haley CS (1996) Confidence intervals in QTL map-

ping by bootstrapping. Genetics 143:1013–1020

Bootstrap confidence intervals.