ch927 quantitative genomics lecture 2 how can … · by the end of this lecture you should be able...
TRANSCRIPT
How can quantitative traits be mapped?
CH927 Quantitative GenomicsLecture 2
By the end of this lecture you should be able to explain:
• What the main steps in QTL mapping are
• What the different methods for QTL analysis are:
- Single marker, Interval/Flanking Mapping - Composite Interval Mapping (CIM), Multiple Interval Mapping (MIM)
and under which experimental conditions they should be used
• What the different statistical methods for QTL analysis are:
- t-test, ANOVA, multiple regression, linear regression and what they can predict about QTLs
Lecture objectives
Basis for QTL mapping known for over 70 years but lack ofgenetic markers prevented widespread use until the mid 80’s
• With DNA sequencing, the number & density of markers have grown
• Also, more statistically-sophisticated mapping methods have been developed
1. Score a population for (i) a trait, and (ii) distribution of genome markers
2. Identify regions of the genome containing QTLs based on occurance of a phenotype- marker association that is significantly more likely than chance
• Results from marker A/a: suggests that the gene is very close to the marker
Association of phenotypes with markers
aGBa A
B b
G g
aGb AgB aGb Agb
Agb aGB aGb AgB
• Results from marker B/b: suggests that the gene is not linked to the marker
aGb
AgB
aGb
AgBaGB
Agb
aGB
aGb Agb
A/a and B/b = molecular scores
G/g = phenotypic score
This is a generalisation of the principle...but for only one gene.We need to consider Quantitative Trait Loci (multiple)
aGcHBJkD
c C
B b
D d
a A
H h
G g
k KJ j
AgcHbjkd
aGcHBJKd
AgcHbjkD
AgcHBJKD
aGcHbjkd
AgcHBJkD
AgcHBJkd
aGcHbjkD
aGcHBJKd
AgcHbjkD
AgcHBJkD
aGcHbjkD
aGcHBJkD
AgcHbjkd
AgcHBJKD
aGcHbjkd
AgcHBJkd
3. Estimate the effects of the QTLs on the quantitative trait:
- many genes with small effect each or few genes with large effect each? - their effects on the trait: is gene action additive or dominant? - their positions in the genome: linkage and association, epistasis - their interaction with the environment
4. Identify candidate genes underlying the QTL and thus the trait
Objectives of QTL analysis
1. Score a population for (i) a trait, and (ii) distribution of genome markers
2. Identify regions of the genome containing QTLs based on occurence of a phenotype-marker association that is significantly more likely than chance
QTL analysis can be classified by the type of progeny used
• All of the different progenies are derived from the same reference population
MMQQ xP1
mmqq
M = marker genotypeQ = QTL genotype
• From this reference population different progenies can be produced
P2
selfx 5
F7 (RILs) x P4
TC3
MmQq F1
TC1 x P4
x P3
TC2
self
F2 x P3
TC4self SI
lines
Backcrosses and Near Isogenic Lines (NILs)
P2
BC1 (Backcross1) F1: use for QTL mapping
x
F2
BC2 F1
BC2 F2
BC2 F3
self
x
BC1 F2
BC1 F3
Near Isogenic Lines
Isolate part of genome A of interest
Rapid generation of material for QTL analysis
To map a quantitative trait:
1. Make a cross and generate marker data - Type of mapping population (e.g. RIL)
3. Collect phenotypic measurements - Evaluate in uniform environment, - Evaluate in multiple environments - Data transformation (approach normal distribution)
2. Generate linkage maps - Genome size, genome coverage
Total variance = VT = VG + VE genetic variance + environmental variance
heterogeneous env. stochastic events measurement errorfre
quen
cy
trait value
A2/A2
A1/A2
A1/A1
Assumes genes act additively (i.e. no epistasis) and that their effects are not conditional on environment, otherwise VT = VG + VE + VGxG + VGxE
By the end of this lecture you should be able to explain:
• What the main steps in QTL mapping are
• What the different methods for QTL analysis are:
- Single marker, Interval/Flanking Mapping - Composite Interval Mapping (CIM), Multiple Interval Mapping (MIM)
and under which experimental conditions they should be used
• What the different statistical methods for QTL analysis are:
- t-test, ANOVA, multiple regression, linear regression and what they can predict about QTLs
Lecture objectives
- Single marker tests (t-test, F-test or Linear Regression)
- Interval/Flanking Mapping (IM) (pair of markers simultaneously)
- Composite Interval Mapping (CIM) (analysis of a marker interval, flanked by adjacent markers, ML-based)
- Multiple Interval Mapping (MIM)
4. The statistical machinery for QTL mapping
Several analysis frameworks for marker-QTL associations:
Four main analysis techniques:
4. The statistical machinery for QTL mapping
ANOVA (marker regression): detects marker differences when there are more than two marker genotypes. Produces a ranking of genotypes, in order of phenotypic effect for the trait of interest, and tests for significant differences between each genotype
Simple t-test: use to evaluate presence of a QTL through statistical differences between two marker genotypes
Linear regression: most complex point analysis method, allowing different characteristics of the QTL to be investigated. Including:
dominance effects, additive effects genotype-environment interactions, epistasis
Multiple regression: simple remodelling of the ANOVA technique in regression terms, with the same ranking and testing for differences
Probabilites andt-tests
Basic mapping format: conditional probablities
• The conditional probibility that the QTL genotype is Qq, given that the marker genotype is Mm:
Pr(Qk | Mj) = Pr(QkMj)Pr(Mj)
• Calculate this in an F2 from:gamete frequenciesmarker genotype probabilities
• Consider a QTL linked to a marker (recombination Fraction = c)
• In the F2, freq(MQ) = freq(mq) = (1-c)/2 freq(mQ) = freq(Mq) = c/2
MMQQ
xP1
mmqq
MmQq
F2
F1
P2
self
QTL genotypes = missing
Marker genotypes = observed
• Since Pr(MM) = 1/4, the conditional probabilities become:
Pr(QQ | MM) = Pr(MMQQ)/Pr(MM) = (1-c)2
Pr(Qq | MM) = Pr(MMQq)/Pr(MM) = 2c(1-c)
Pr(qq | MM) = Pr(MMqq)/Pr(MM) = c2
Basic mapping format: conditional probablities
• Hence,
Pr(MMQq) = 2Pr(MQ)Pr(Mq) = 2c(1-c) /4
Pr(MMqq) = Pr(Mq)Pr(Mq) = c2 /4
Pr(MMQQ) = Pr(MQ)Pr(MQ) = (1-c)2/4
• In the F2, freq(MQ) = freq(mq) = (1-c)/2 freq(mQ) = freq(Mq) = c/2
Using a t-test to probe a QTL
• e.g. backcross with two genes: marker (alleles M, m), and QTL (alleles Q, q) • These two genes are linked with the recombination fraction of c
• Mean of marker genotype Mm:
m1= (1-c)/2(m+a) + c/2m = m + (1-c)a
• Mean of marker genotype mm:
m0= c/2(m+a) + (1-c)/2m = m + ca
MmQq Mmqq mmQq mmqq
Frequency (1-c)/2 c/2 c/2 (1-c)/2
Mean effect m+a m m+a m
If trait mean is significantly different forthe genotypes at a marker locus,
it is linked to a QTL
small effecttight linkage
large effectloose linkage
• A small MM-mm difference:
ANOVAand
single marker regression
• Partition variance: genetically-determined and environmental components
• Model (there is a QTL linked to a marker) is tested against the null hypothesis of no QTL
trait
valu
e
A1/A1 A1/A2 A2/A2
genotype
Partitioning of variance: a simple ANOVA model
Grandmean
• Total sum of squares: calculate grand mean, deviation of each individual from mean SST square each deviation & sum all the deviations for the population
trait
valu
e
A1/A1 A1/A2 A2/A2
Partitioning of variance methodology
SST degrees of freedom = n-1
= total variance• Total mean sum, MST =
n=23
Grandmean
trait
valu
e
A1/A1 A1/A2 A2/A2
• Calculate mean for each genotype group• SSR = residual sum of squares = sum (deviations of each individual from genotype mean)2
Partitioning of variance: fitting the model
SSR degrees of freedom= (n-1) - #genotypes)
= variance not explained by the model (or explained by this QTL)
• Total mean sum, MSR =
Genetic variance and testing the model
• Model sum of squares, SSM = sum values for each genotype:
SSM degrees of freedom = 2
• Genetic variance, MSM =
(grand mean - each genotype mean)2 x (# individuals with that genotype)
• But since MST = MSM + MSR
• It is easier to calculate as MSM = MST - MSR
Genetic variance and testing the model
• To test whether the QTL explains a significant amount of the variation, calculate
• Look up the minimum value of F that is unlikely to have occurred by chance, given 2 d.f. for MSM and 20 for MSR (F ≥ 3.49 for p ≤ 0.05 in this case)
• If F exceeds this value, we can reject the null hypothesis of no QTL
Model to residual variance, F-ratio = MSM / MSR
Variance explained by the QTL = MSM / MST
MSM = MST - MSR
• Incorporate terms into the model to estimate:
The additive effect of the alleles, a = half the difference between the averages for the two homozygotes can be positive or negative, depending on which allele is being considered
The dominance deviation, d = the average difference between hets and the mid-point of the homs can also be positive or negative
This is essentially a least-squares regression
If d > ±aone allele showsover-dominance
If d = ±a one allele
completely dominant
Estimation of additive and dominance effects
Additive effects (a):
(m1–m0)/2 = a(1-2c) = a*
• a* = estimated additive effects
• d* = estimated dominance effects
Dominance effects (d):
m2 - (m1–m0)/2 = d(1-2c) = d* (m1–m0)/2
MmQq Mmqq mmQq mmqq
Frequency (1-c)/2 c/2 c/2 (1-c)/2
Mean effect m+a m m+a m
• Mean of marker genotype Mm: m1
• Mean of marker genotype mm: m0
Linear Models for QTL Detection
• Detection: a QTL is linked to the marker if at least one of the bm is significantly different from zero
• Estimation (QTL effect and position): have to relate the bm to the QTL effects and map position
y mk = π + b m + emkEffect of marker genotype
m on trait value
Value of trait in kth individual of marker genotype m
• Uses the linear relationship between the apparent affects of a marker on a quantitative character, and the substantial effects of all related QTLs that are linked to that marker
• Differences in the distance between the QTL and the markers alter factors in this relationship
Detecting epistasis
• One major advantage of linear models is their flexibility• Test for epistasis between two QTLs: use an ANOVA with an interaction term:
y = π + ai + bk + dik + e
Effect from marker genotype at first
marker set(can be > 1 loci)
Effect from marker genotype at second
marker set
Interaction between marker genotypes i in 1stmarker set and k in 2nd
marker set
• At least one of the ai significantly different from 0QTL linked to first marker set
• At least one of the bk significantly different from 0QTL linked to second marker set
• At least one of the dik significantly different from 0interactions between QTL in sets 1 and two
Interval mappingand
marker regression
• If marker density is high, ANOVA with individual marker genotypes is effective: “single marker analysis” or “single marker regression”
Problems with single marker mapping using ANOVA
Three important weaknesses:
• Do not receive separate estimates of QTL location and QTL effect.
• Must discard individuals whose genotypes are missing at the marker
• When markers are sparse, the QTL may be quite far from all markers, and so the power for QTL detection will decrease
Interval mapping
• Can use probability estimates for the genotypes in intervals between markers
• Move the QTL position every 2cM from M1 to M2 and draw the profile of the F value. The peak of the profile corresponds to the best estimate of the QTL position
M1 M2 M3 M4 M5
F-va
lue
Testing position
Interval mapping implementation
0
3.7500
7.5000
11.2500
15.0000
Interval mapping by regression(QTL Express)
F-ratio**
* ***
** **0
• Carry out a QTL scan step-wise: once a significant QTL has been identified,
other markers tested for their ability to explain the residual variation
• Known QTL are said to be “fixed” or “co-factors” in the regression
Interval mapping with regression approach
• Consider a marker interval M1-M2. We assume that a QTL is located at a particular position between the two markers (r1 and θ are fixed)
• With response variable, yi, and dependent variable, xi, a regression model is constructed:
• yi is the overall mean• x*i is the indicator variable for QTL genotypes: x*i = 1 for Qq; 0 for qq • a* is the additive effect effect of the putative QTL on the trait• ei is the residual error, ei ~ N(0, σ2)
yi = μ + a*xi + ei i = 1, …, n (latent model)
• The phenotypic value for individual i affected by a QTL can be expressed as,
• Advantages:
- the position of the QTL can be inferred by a support interval
- the estimated position and effects of the QTL tend to be asymptotically unbiased if there is only one segregating QTL on a chromosome
- method requires fewer individuals
Advantages and disadvantages of interval mapping
• Disadvantages:
- this is not an interval test
- even when there is no QTL within an interval, the likelihood profile on the interval can still exceed the threshold if there is a QTL nearby
- if there is more than one QTL on a chromosome, the test statistic at the position being tested will be affected by all QTL and the estimated positions - not efficient to use only two markers at a time for testing
Flanking methodsand
Maximum likelihood
Flanking marker methods have been the most popular analysis techniques over recent years
• Due to their accuracy and level of characterisation of the putative QTL- combine both detection and estimation of QTL effects and position
• Two basic techniques:Maximum likelihoodMaximum likelihood estimation through regression
• Three methods for estimating likelihood:Single marker maximum likelihood (least power)Flanking marker maximum likelihood (most versatile)Order restricted interval mapping (most power)
Estimating the QTL position (θ): Likelihood maps
• View θ as a variable being estimated (derive log-likelihood equation for MLE of θ)
• (LO / LA ) = ratio of the likelihood of the null hypothesis (no QTL in the marker interval) to the likelihood of the alternative hypothesis (QTL present)
LOD (Log of the Odds) = log10 (LO / LA )
Estimated QTL location
Support intervalLO
D s
core
Chromosome position
8
7
6
5
4
3
2
1
0
Significancethreshold
In each method a likelihood map is
produced:
• View θ as a fixed parameter, assume the QTL is located at a particular position
• Uses multiple markers as additional factors (marker cofactors)
Composite interval mapping (CIM)
i i+1 i+2i-1
Interval being mapped
Method:• Predict QTL marker genotype every x cM• Carry out an LR test for QTL effect every x cM• Combines MLE and multiple regression methods
• Five different types of markers are considered for the regression model, depending on the characteristics of the chromosome region: - markers surrounding the QTL of interest - linked & unlinked markers within the QTL region - linked & unlinked markers outside the QTL region
Permutation testing to determine experiment-wide signficance thresholds
• Multiple testing problem: how often are random QTL effects of a certain magnitude detected in similar datasets?
• Method: - create a large number of ‘random empirical’ datasets
- take your marker data and randomly reassign the
phenotypes back to the marker genotypes
- repeat the QTL detection process
- record the highest LR produced for a ‘random QTL’
anywhere in the map
- repeat the whole process > 500 times
top 5% ofrandom
95% ofrandom - record the magnitude of the lowest ‘random QTL’ observed
in the top 5% of LR results = threshold
Multiple interval mapping
Method:• Build regression models which include all QTLs (detected first by CIM)
• Use information content (IC) theory to evaluate alternative models
• Allows simultaneous detection and estimation of additive, dominance & epistatic effects
• Uses multiple marker intervals simultaneously
• Aims to map multiple QTLs in a single step
Some examplesof the final output
1 2 3 4 5 6 7 8 9
232 plant lines, 211 loci (189 SSR, 22 AFLP)
JoinMap2010 broccolilinkage map
Genetics and genomics of post harvest senescence in broccoliVicky Buchanan-Wollaston and Dave Pink (Warwick HRI)
• Two major QTL for ‘time to yellowing’ confirmed on 2010 broccoli map
Chr 1
30.6% of variation
7 cM1.7 cM
MapQTLPermutation test 10,000 iterations
3.8 Lod p >0.01 22.4% of variation
3.8 Lodp >0.01
4.8 Lodp >0.001
Chr 9
• REML calculated: 64.4 % of line mean variation is genetic
QTLs for senescence traits in broccoli