ch927 quantitative genomics lecture 2 how can … · by the end of this lecture you should be able...

How can quantitative traits be mapped?

CH927 Quantitative GenomicsLecture 2

By the end of this lecture you should be able to explain:

• What the main steps in QTL mapping are

• What the different methods for QTL analysis are:

- Single marker, Interval/Flanking Mapping - Composite Interval Mapping (CIM), Multiple Interval Mapping (MIM)

and under which experimental conditions they should be used

• What the different statistical methods for QTL analysis are:

- t-test, ANOVA, multiple regression, linear regression and what they can predict about QTLs

Lecture objectives

Basis for QTL mapping known for over 70 years but lack ofgenetic markers prevented widespread use until the mid 80’s

• With DNA sequencing, the number & density of markers have grown

• Also, more statistically-sophisticated mapping methods have been developed

1. Score a population for (i) a trait, and (ii) distribution of genome markers

2. Identify regions of the genome containing QTLs based on occurance of a phenotype- marker association that is significantly more likely than chance

• Results from marker A/a: suggests that the gene is very close to the marker

Association of phenotypes with markers

aGBa A

B b

G g

aGb AgB aGb Agb

Agb aGB aGb AgB

• Results from marker B/b: suggests that the gene is not linked to the marker

aGb

AgB

aGb

AgBaGB

Agb

aGB

aGb Agb

A/a and B/b = molecular scores

G/g = phenotypic score

This is a generalisation of the principle...but for only one gene.We need to consider Quantitative Trait Loci (multiple)

aGcHBJkD

c C

B b

D d

a A

H h

G g

k KJ j

AgcHbjkd

aGcHBJKd

AgcHbjkD

AgcHBJKD

aGcHbjkd

AgcHBJkD

AgcHBJkd

aGcHbjkD

aGcHBJKd

AgcHbjkD

AgcHBJkD

aGcHbjkD

aGcHBJkD

AgcHbjkd

AgcHBJKD

aGcHbjkd

AgcHBJkd

3. Estimate the effects of the QTLs on the quantitative trait:

- many genes with small effect each or few genes with large effect each? - their effects on the trait: is gene action additive or dominant? - their positions in the genome: linkage and association, epistasis - their interaction with the environment

4. Identify candidate genes underlying the QTL and thus the trait

Objectives of QTL analysis

1. Score a population for (i) a trait, and (ii) distribution of genome markers

2. Identify regions of the genome containing QTLs based on occurence of a phenotype-marker association that is significantly more likely than chance

QTL analysis can be classified by the type of progeny used

• All of the different progenies are derived from the same reference population

MMQQ xP1

mmqq

M = marker genotypeQ = QTL genotype

• From this reference population different progenies can be produced

P2

selfx 5

F7 (RILs) x P4

TC3

MmQq F1

TC1 x P4

x P3

TC2

self

F2 x P3

TC4self SI

lines

Backcrosses and Near Isogenic Lines (NILs)

P2

BC1 (Backcross1) F1: use for QTL mapping

x

F2

BC2 F1

BC2 F2

BC2 F3

self

x

BC1 F2

BC1 F3

Near Isogenic Lines

Isolate part of genome A of interest

Rapid generation of material for QTL analysis

To map a quantitative trait:

1. Make a cross and generate marker data - Type of mapping population (e.g. RIL)

3. Collect phenotypic measurements - Evaluate in uniform environment, - Evaluate in multiple environments - Data transformation (approach normal distribution)

2. Generate linkage maps - Genome size, genome coverage

Total variance = VT = VG + VE genetic variance + environmental variance

heterogeneous env. stochastic events measurement errorfre

quen

cy

trait value

A2/A2

A1/A2

A1/A1

Assumes genes act additively (i.e. no epistasis) and that their effects are not conditional on environment, otherwise VT = VG + VE + VGxG + VGxE

By the end of this lecture you should be able to explain:

• What the main steps in QTL mapping are

• What the different methods for QTL analysis are:

- Single marker, Interval/Flanking Mapping - Composite Interval Mapping (CIM), Multiple Interval Mapping (MIM)

and under which experimental conditions they should be used

• What the different statistical methods for QTL analysis are:

- t-test, ANOVA, multiple regression, linear regression and what they can predict about QTLs

Lecture objectives

- Single marker tests (t-test, F-test or Linear Regression)

- Interval/Flanking Mapping (IM) (pair of markers simultaneously)

- Composite Interval Mapping (CIM) (analysis of a marker interval, flanked by adjacent markers, ML-based)

- Multiple Interval Mapping (MIM)

4. The statistical machinery for QTL mapping

Several analysis frameworks for marker-QTL associations:

Four main analysis techniques:

4. The statistical machinery for QTL mapping

ANOVA (marker regression): detects marker differences when there are more than two marker genotypes. Produces a ranking of genotypes, in order of phenotypic effect for the trait of interest, and tests for significant differences between each genotype

Simple t-test: use to evaluate presence of a QTL through statistical differences between two marker genotypes

Linear regression: most complex point analysis method, allowing different characteristics of the QTL to be investigated. Including:

dominance effects, additive effects genotype-environment interactions, epistasis

Multiple regression: simple remodelling of the ANOVA technique in regression terms, with the same ranking and testing for differences

Probabilites andt-tests

Basic mapping format: conditional probablities

• The conditional probibility that the QTL genotype is Qq, given that the marker genotype is Mm:

Pr(Qk | Mj) = Pr(QkMj)Pr(Mj)

• Calculate this in an F2 from:gamete frequenciesmarker genotype probabilities

• Consider a QTL linked to a marker (recombination Fraction = c)

• In the F2, freq(MQ) = freq(mq) = (1-c)/2 freq(mQ) = freq(Mq) = c/2

MMQQ

xP1

mmqq

MmQq

F2

F1

P2

self

QTL genotypes = missing

Marker genotypes = observed

• Since Pr(MM) = 1/4, the conditional probabilities become:

Pr(QQ | MM) = Pr(MMQQ)/Pr(MM) = (1-c)2

Pr(Qq | MM) = Pr(MMQq)/Pr(MM) = 2c(1-c)

Pr(qq | MM) = Pr(MMqq)/Pr(MM) = c2

Basic mapping format: conditional probablities

• Hence,

Pr(MMQq) = 2Pr(MQ)Pr(Mq) = 2c(1-c) /4

Pr(MMqq) = Pr(Mq)Pr(Mq) = c2 /4

Pr(MMQQ) = Pr(MQ)Pr(MQ) = (1-c)2/4

• In the F2, freq(MQ) = freq(mq) = (1-c)/2 freq(mQ) = freq(Mq) = c/2

Using a t-test to probe a QTL

• e.g. backcross with two genes: marker (alleles M, m), and QTL (alleles Q, q) • These two genes are linked with the recombination fraction of c

• Mean of marker genotype Mm:

m1= (1-c)/2(m+a) + c/2m = m + (1-c)a

• Mean of marker genotype mm:

m0= c/2(m+a) + (1-c)/2m = m + ca

MmQq Mmqq mmQq mmqq

Frequency (1-c)/2 c/2 c/2 (1-c)/2

Mean effect m+a m m+a m

If trait mean is significantly different forthe genotypes at a marker locus,

it is linked to a QTL

small effecttight linkage

large effectloose linkage

• A small MM-mm difference:

ANOVAand

single marker regression

• Partition variance: genetically-determined and environmental components

• Model (there is a QTL linked to a marker) is tested against the null hypothesis of no QTL

trait

valu

e

A1/A1 A1/A2 A2/A2

genotype

Partitioning of variance: a simple ANOVA model

Grandmean

• Total sum of squares: calculate grand mean, deviation of each individual from mean SST square each deviation & sum all the deviations for the population

trait

valu

e

A1/A1 A1/A2 A2/A2

Partitioning of variance methodology

SST degrees of freedom = n-1

= total variance• Total mean sum, MST =

n=23

Grandmean

trait

valu

e

A1/A1 A1/A2 A2/A2

• Calculate mean for each genotype group• SSR = residual sum of squares = sum (deviations of each individual from genotype mean)2

Partitioning of variance: fitting the model

SSR degrees of freedom= (n-1) - #genotypes)

= variance not explained by the model (or explained by this QTL)

• Total mean sum, MSR =

Genetic variance and testing the model

• Model sum of squares, SSM = sum values for each genotype:

SSM degrees of freedom = 2

• Genetic variance, MSM =

(grand mean - each genotype mean)2 x (# individuals with that genotype)

• But since MST = MSM + MSR

• It is easier to calculate as MSM = MST - MSR

Genetic variance and testing the model

• To test whether the QTL explains a significant amount of the variation, calculate

• Look up the minimum value of F that is unlikely to have occurred by chance, given 2 d.f. for MSM and 20 for MSR (F ≥ 3.49 for p ≤ 0.05 in this case)

• If F exceeds this value, we can reject the null hypothesis of no QTL

Model to residual variance, F-ratio = MSM / MSR

Variance explained by the QTL = MSM / MST

MSM = MST - MSR

• Incorporate terms into the model to estimate:

The additive effect of the alleles, a = half the difference between the averages for the two homozygotes can be positive or negative, depending on which allele is being considered

The dominance deviation, d = the average difference between hets and the mid-point of the homs can also be positive or negative

This is essentially a least-squares regression

If d > ±aone allele showsover-dominance

If d = ±a one allele

completely dominant

Estimation of additive and dominance effects

Additive effects (a):

(m1–m0)/2 = a(1-2c) = a*

• a* = estimated additive effects

• d* = estimated dominance effects

Dominance effects (d):

m2 - (m1–m0)/2 = d(1-2c) = d* (m1–m0)/2

MmQq Mmqq mmQq mmqq

Frequency (1-c)/2 c/2 c/2 (1-c)/2

Mean effect m+a m m+a m

• Mean of marker genotype Mm: m1

• Mean of marker genotype mm: m0

Linear Models for QTL Detection

• Detection: a QTL is linked to the marker if at least one of the bm is significantly different from zero

• Estimation (QTL effect and position): have to relate the bm to the QTL effects and map position

y mk = π + b m + emkEffect of marker genotype

m on trait value

Value of trait in kth individual of marker genotype m

• Uses the linear relationship between the apparent affects of a marker on a quantitative character, and the substantial effects of all related QTLs that are linked to that marker

• Differences in the distance between the QTL and the markers alter factors in this relationship

Detecting epistasis

• One major advantage of linear models is their flexibility• Test for epistasis between two QTLs: use an ANOVA with an interaction term:

y = π + ai + bk + dik + e

Effect from marker genotype at first

marker set(can be > 1 loci)

Effect from marker genotype at second

marker set

Interaction between marker genotypes i in 1stmarker set and k in 2nd

marker set

• At least one of the ai significantly different from 0QTL linked to first marker set

• At least one of the bk significantly different from 0QTL linked to second marker set

• At least one of the dik significantly different from 0interactions between QTL in sets 1 and two

keynote:/Users/miriamgifford/Documents/Teaching/CH927%20Quantitative%20Biology%20Systems%20MSc/QTL%20mapping%20methods%20(QTL3)%202.key?id=BGSlide-96




Interval mappingand

marker regression

• If marker density is high, ANOVA with individual marker genotypes is effective: “single marker analysis” or “single marker regression”

Problems with single marker mapping using ANOVA

Three important weaknesses:

• Do not receive separate estimates of QTL location and QTL effect.

• Must discard individuals whose genotypes are missing at the marker

• When markers are sparse, the QTL may be quite far from all markers, and so the power for QTL detection will decrease

Interval mapping

• Can use probability estimates for the genotypes in intervals between markers

• Move the QTL position every 2cM from M1 to M2 and draw the profile of the F value. The peak of the profile corresponds to the best estimate of the QTL position

M1 M2 M3 M4 M5

F-va

lue

Testing position

Interval mapping implementation

0

3.7500

7.5000

11.2500

15.0000

Interval mapping by regression(QTL Express)

F-ratio**

* ***

** **0

• Carry out a QTL scan step-wise: once a significant QTL has been identified,

other markers tested for their ability to explain the residual variation

• Known QTL are said to be “fixed” or “co-factors” in the regression

Interval mapping with regression approach

• Consider a marker interval M1-M2. We assume that a QTL is located at a particular position between the two markers (r1 and θ are fixed)

• With response variable, yi, and dependent variable, xi, a regression model is constructed:

• yi is the overall mean• x*i is the indicator variable for QTL genotypes: x*i = 1 for Qq; 0 for qq • a* is the additive effect effect of the putative QTL on the trait• ei is the residual error, ei ~ N(0, σ2)

yi = μ + a*xi + ei i = 1, …, n (latent model)

• The phenotypic value for individual i affected by a QTL can be expressed as,

• Advantages:

- the position of the QTL can be inferred by a support interval

- the estimated position and effects of the QTL tend to be asymptotically unbiased if there is only one segregating QTL on a chromosome

- method requires fewer individuals

Advantages and disadvantages of interval mapping

• Disadvantages:

- this is not an interval test

- even when there is no QTL within an interval, the likelihood profile on the interval can still exceed the threshold if there is a QTL nearby

- if there is more than one QTL on a chromosome, the test statistic at the position being tested will be affected by all QTL and the estimated positions - not efficient to use only two markers at a time for testing

Flanking methodsand

Maximum likelihood

Flanking marker methods have been the most popular analysis techniques over recent years

• Due to their accuracy and level of characterisation of the putative QTL- combine both detection and estimation of QTL effects and position

• Two basic techniques:Maximum likelihoodMaximum likelihood estimation through regression

• Three methods for estimating likelihood:Single marker maximum likelihood (least power)Flanking marker maximum likelihood (most versatile)Order restricted interval mapping (most power)

Estimating the QTL position (θ): Likelihood maps

• View θ as a variable being estimated (derive log-likelihood equation for MLE of θ)

• (LO / LA ) = ratio of the likelihood of the null hypothesis (no QTL in the marker interval) to the likelihood of the alternative hypothesis (QTL present)

LOD (Log of the Odds) = log10 (LO / LA )

Estimated QTL location

Support intervalLO

D s

core

Chromosome position

8

7

6

5

4

3

2

1

0

Significancethreshold

In each method a likelihood map is

produced:

• View θ as a fixed parameter, assume the QTL is located at a particular position

• Uses multiple markers as additional factors (marker cofactors)

Composite interval mapping (CIM)

i i+1 i+2i-1

Interval being mapped

Method:• Predict QTL marker genotype every x cM• Carry out an LR test for QTL effect every x cM• Combines MLE and multiple regression methods

• Five different types of markers are considered for the regression model, depending on the characteristics of the chromosome region: - markers surrounding the QTL of interest - linked & unlinked markers within the QTL region - linked & unlinked markers outside the QTL region

Permutation testing to determine experiment-wide signficance thresholds

• Multiple testing problem: how often are random QTL effects of a certain magnitude detected in similar datasets?

• Method: - create a large number of ‘random empirical’ datasets

- take your marker data and randomly reassign the

phenotypes back to the marker genotypes

- repeat the QTL detection process

- record the highest LR produced for a ‘random QTL’

anywhere in the map

- repeat the whole process > 500 times

top 5% ofrandom

95% ofrandom - record the magnitude of the lowest ‘random QTL’ observed

in the top 5% of LR results = threshold

Multiple interval mapping

Method:• Build regression models which include all QTLs (detected first by CIM)

• Use information content (IC) theory to evaluate alternative models

• Allows simultaneous detection and estimation of additive, dominance & epistatic effects

• Uses multiple marker intervals simultaneously

• Aims to map multiple QTLs in a single step

Some examplesof the final output

1 2 3 4 5 6 7 8 9

232 plant lines, 211 loci (189 SSR, 22 AFLP)

JoinMap2010 broccolilinkage map

Genetics and genomics of post harvest senescence in broccoliVicky Buchanan-Wollaston and Dave Pink (Warwick HRI)

• Two major QTL for ‘time to yellowing’ confirmed on 2010 broccoli map

Chr 1

30.6% of variation

7 cM1.7 cM

MapQTLPermutation test 10,000 iterations

3.8 Lod p >0.01 22.4% of variation

3.8 Lodp >0.01

4.8 Lodp >0.001

Chr 9

• REML calculated: 64.4 % of line mean variation is genetic

QTLs for senescence traits in broccoli