lecture 23: quantitative traits iii
DESCRIPTION
Lecture 23: Quantitative Traits III. Date: 11/12/02 Single locus backcross regression Single locus backcross likelihood F2 – regression, likelihood, etc. Backcross Model. m 1 is the genotypic value of QQ m 2 is the genotypic value of Qq. Backcross – t-test. - PowerPoint PPT PresentationTRANSCRIPT
Lecture 23: Quantitative Traits III
Date: 11/12/02 Single locus backcross regression Single locus backcross likelihood F2 – regression, likelihood, etc
Backcross Model
Marker Genotype Count
Marg. Freq.
QTL Genotype
Trait ValueQQ Qq
AA n1 0.5 1- 12
Aa n2 0.5 1- 1)2
1 is the genotypic value of QQ2 is the genotypic value of Qq
Backcross – t-test
classmarker each in means observed theare ˆ and ˆ where
variancepooled theis ˆ where
sampled sindividual ofnumber total theis where
2)df(~11
ˆ
ˆˆ
2
21
21
2
AaAA
AaAAM
s
nnN
Nt
nns
t
Gen. Freq. Phen.
AA n1 X11,X12,...,X1n1
Aa n2 X21,X22,..., X2n2
Backcross – Linear Regression (BLG)
One may also test the data using a simple linear regression model.
Where yj is the trait value for the jth individual, xj is a dummy variable indicating marker genotype (AA or Aa).
You know that estimates of the coefficients are given by:
We seek the expectation of these coefficients under a genetic model.
jjj xy
x
yxb
xbya
Var
,Covˆ
ˆ
BLG – Expected Sample Statistics
To find the expected values under the genetic model, we need the expectation of the sample means and variances:
2121212
212121
222
212
111
2
111
2
1ˆE
2
11
2
11
2
1E
112
11
2
1ˆE
012
11
2
1E
xy
x
s
y
s
x
BLG – Expected Coefficients
Recalling the coefficient estimators:
Finally, recalling our genetic models:
212
21
212
1ˆE
ˆEE
2
1EE
x
xy
s
sb
ya
QQ a 2a
Qq d (1+k)a
qq -a 0
BLG – Hypothesis Testing
We conclude that the expected regression coefficient is:
So, again, rejecting H0: =0 means =0.5 (NO LINKAGE) a=0 or a=d=0 (NO VARIATION) k=1 or a=d (COMPLETE DOMINANCE)
ak
dab
1212
1
212
1
212
1E 21
Backcross – Likelihood (BL)
One may also set up a likelihood function for backcross progeny.
Trait values are assumed approximately normal (lots of little effects added together).
The distribution of trait values for each marker class are assumed to be a mixture of two normals, one for each possible genotype at the QTL.
The mixing proportions are determined by the recombination fraction.
Genotypic Value
BL – Distributions Class AA
21 AA
80%
20%
Suppose =0.2
BL – Distributions Class Aa
Genotypic Value
21 Aa
80%
20%
BL – Assumptions
Assume the trait variances for the two QTL genotypes in the backcross are equal.
Assume the traits are normally distributed. Assume there is no marker / trait interaction,
so the distributions remain unchanged in both marker classes (i.e. same variances).
BL – Likelihood
The likelihood function for the backcross is then:
where Qj is one of the (unknown) two possible genotypes at the marker locus.
N
i j
jiijN
yMQL
1
2
12
2
2expP
2
1
BL – Log Likelihood
Take the log of the likelihood to obtain:
N
i j
jiij
NyMQl
1
22
12
2
2log22
expPlog
BL – Null Hypothesis A
One null hypothesis of interest is that the mean genotypic values for the two distributions are not in fact different, so
H0: 1 = 2 = .
In this case, the log likelihood becomes:
2
1
2
2
1
22
2
2log22
1
2log22
explog
Ny
Nyl
N
ii
N
i
i
BL – Null Hypothesis B
Another, perhaps more interesting null hypothesis, is that there is no linkage, so
H0: =0.5
Under this assumption, the log likelihood becomes
N
i j
ji Nyl
1
22
12
2
2log22
explog
BL – Statistical Test
The G statistic that is commonly calculated to test for linkage is:
However, this test is less powerful than the t test introduced earlier.
21df
221
221 ~5.0,ˆ,ˆ,ˆˆ,ˆ,ˆ,ˆ2 rlrlG
BL – LOD Scores
Again, LOD scores are commonly used for QTL detection.
Where, we interpret, as usual, that a lod score of l means the alternative hypothesis is 10l times as likely as the null hypothesis.
5.0,ˆ,ˆ,ˆlogˆ,ˆ,ˆ,ˆloglod 22110
22110 LL
BL – Likelihood Maximization
Analytic solutions are difficult to achieve. Iterative approaches are generally used (EM,
NR). Combinations of methods are also used. For
example, the variance is commonly estimated with the pooled variance:
2
222
1
2112ˆ
n
yy
n
yy jj
To facilitate calculations even more, a grid of values with maximization on 1 and 2 can be used.
So suppose you have multiple markers with known map position. Then, evaluate a G statistic or lod score for 3 possible locations of the QTL:
BL – Likelihood Maximization
Marker 0 0.25m 0.5m
1 =0 =f(0.25m12) =f(0.5m12)
2
BL – Sample Results
0
1
2
3
4
5
6
7
0 0.5 1 1.5 2
Chromosome Location
LO
D S
core
BL – Caveats
When there is more than one QTL in the same vicinity, the peaks in the LOD score plot may not correspond to QTLs.
Recall that these results are still based on single-locus analysis for which we cannot separate genetic effect from linkage. Thus, there is little good information about QTL location in such a plot, even though it looks like there should be.
BL – Comments
Note, that if marker density is high, then there is no need to evaluate at multiple levels of for each marker.
However, when marker density is low, information is gained when multiple QTL locations are considered.
When =0 is assumed, the estimates of 1 and 2 are simple means.
Single Marker F2 (F2)
There are now three possible genotypes to consider for both the marker and the QTL locus.
ni
Marg. Freq.
P(Qj | Mi)
QQ Qq qq
AA n1 0.25 (1-)2 2(1-) 2
Aa n2 0.50 (1-) (1-)2+2 (1-)
aa n3 0.25 2 2(1-) (1-)2
F2 – Expected Trait Values
ni
Marg. Freq. Expected Trait Value
AA n1 0.25
Aa n2 0.50
aa n3 0.25
da
adaAA
1221
121 22
QQQqqq
a-a d
d
adaAa
22
22
1
111
da
adaaa
1221
112 22
F2 – Dominant Marker
Similar tables can be derived for the case of a dominant marker.
In general, the procedure is as follows: Derive the QTL genotype probabilities
conditional on the marker phenotype. Using the conditional probabilities, derive the
expected trait value for each marker phenotype class.
F2 – Regression (F2R)
The regression model is
where yj is the trait value of the jth individual in the population
where x1j is the dummy variable for marker additive effect taking on value 1 for AA, 0 for Aa, and –1 for aa.
where x2j is the dummy variable for marker dominance effect taking on value 1 for AA and –1 for Aa and 1 for aa.
jjjj xxy 22110
F2R – Matrix Notation
XYXX
XYXX1'ˆ
'
2312
31
321
3
1
0
2215.0
215.0
225.0
100
05.00
001
F2R – Expected Coefficients
The coefficient estimates have expectation:
d
a
21
212
225.0
2215.0
215.0
225.0
100
020
001
ˆ
ˆ
ˆ
E
321
2312
31
321
3
1
0
F2R – F Statistics
The F statistic is the ratio between the residual mean squares for the reduced model and the full model.
The full model has residual mean square:
XYXYS '2full
F2R – Reduced Models
Reduced models of interest are:
And the F statistics are:
20,0021
202201
201102
21
1
2
0,0
0
0
Sy
Sxy
Sxy
jj
jjj
jjj
3,2df
3,1df
3,1df
2full
20,0
0,0
2full
20
0
2full
20
0
21
21
1
1
2
2
NS
SF
NS
SF
NS
SF
F2R – Dominant Marker
If the marker locus segregates as a dominant trait, then:
Thus, significant regression coefficient tests for a confounded additive effect, dominance effect, and linkage.
jjj xy 10
da 21 21212
3
1E
F2 – Likelihood Approach (F2L)
Assume trait variances for the three QTL genotypes are equal.
For each marker class, the trait value is a mixture of three normal distributions with different means, equal variances, and expected proportions based on degree of linkage.
The expected proportions are given in slide #23.
F2L – Log Likelihood
The likelihood then becomes a sum over three normals:
N
i j
jiijNF
yMQL
1
3
12
2
2 2expP
2
1
N
i j
jiijF
NyMQl
1
23
12
2
2 2log22
expPlog
F2L – Null Hypothesis A
If the null hypothesis is
H0: a = 0
2
1
2
22
2
2
21
31
2
2log2
2expP
2expPP
log0
N
yMQ
yMQMQ
alN
ii
i
iii
F
F2L – Null Hypothesis B
Suppose instead that the null hypothesis is
H0: d = 0
2
1
2
23
3
2
2
31
2
2
21
1
2 2log2
2expP
221
expP
2expP
log0
N
yMQ
yMQ
yMQ
dlN
i
ii
i
i
ii
F
F2L – Null Hypothesis C
Suppose instead that the null hypothesis is
H0: a = 0, d = 0
N
iiF
Nydal
1
22
22 2log22
10,0
F2L – Null Hypothesis D
When the null hypothesis is
H0: = 0.5
2
1
2
23
2
22
2
21
2 2log2
2exp
4
1
2exp
2
1
2exp
4
1
log5.0
N
y
y
y
lN
i
i
i
i
F
F2L – Statistical Test
The G statistic
21df2
3212
23212 ~
5.0ˆ,ˆ,ˆ,ˆ,ˆ
ˆ,ˆ,ˆ,ˆ,ˆ2
F
F
l
lG
F2L – Maximization
Iterative methods are required to find the maximum likelihood estimates.
Other approaches have been suggested, such as combining moment estimation with maximum likelihood approach. The resulting system of equations to solve for the estimators is given on the next slide.
F2L – Finding MLEs
2
322
22
1222
23
22
21
22
23
222
21
222
32
212
3222
1
32
212
112
11211
121
112
111
121
AAAAAAaa
AAAAAAAa
AAAAAAAA
aa
Aa
AA
mmmS
mmmS
mmmS
m
m
m
F2L – Dominant Marker Model
Modify the likelihood equations with QTL genotypes probabilities conditional on the marker genotype for a dominant marker.
Modify the expected trait values for each marker genotype.
Done.