lecture 23: quantitative traits iii

Lecture 23: Quantitative Traits III

Date: 11/12/02 Single locus backcross regression Single locus backcross likelihood F2 – regression, likelihood, etc

Backcross Model

Marker Genotype Count

Marg. Freq.

QTL Genotype

Trait ValueQQ Qq

AA n1 0.5 1- 12

Aa n2 0.5 1- 1)2

1 is the genotypic value of QQ2 is the genotypic value of Qq

Backcross – t-test

classmarker each in means observed theare ˆ and ˆ where

variancepooled theis ˆ where

sampled sindividual ofnumber total theis where

2)df(~11

ˆ

ˆˆ

2

21

21

2

AaAA

AaAAM

s

nnN

Nt

nns

t

Gen. Freq. Phen.

AA n1 X11,X12,...,X1n1

Aa n2 X21,X22,..., X2n2

Backcross – Linear Regression (BLG)

One may also test the data using a simple linear regression model.

Where yj is the trait value for the jth individual, xj is a dummy variable indicating marker genotype (AA or Aa).

You know that estimates of the coefficients are given by:

We seek the expectation of these coefficients under a genetic model.

jjj xy

x

yxb

xbya

Var

,Covˆ

ˆ

BLG – Expected Sample Statistics

To find the expected values under the genetic model, we need the expectation of the sample means and variances:

2121212

212121

222

212

111

2

111

2

1ˆE

2

11

2

11

2

1E

112

11

2

1ˆE

012

11

2

1E

xy

x

s

y

s

x

BLG – Expected Coefficients

Recalling the coefficient estimators:

Finally, recalling our genetic models:

212

21

212

1ˆE

ˆEE

2

1EE

x

xy

s

sb

ya

QQ a 2a

Qq d (1+k)a

qq -a 0

BLG – Hypothesis Testing

We conclude that the expected regression coefficient is:

So, again, rejecting H0: =0 means =0.5 (NO LINKAGE) a=0 or a=d=0 (NO VARIATION) k=1 or a=d (COMPLETE DOMINANCE)

ak

dab

1212

1

212

1

212

1E 21

Backcross – Likelihood (BL)

One may also set up a likelihood function for backcross progeny.

Trait values are assumed approximately normal (lots of little effects added together).

The distribution of trait values for each marker class are assumed to be a mixture of two normals, one for each possible genotype at the QTL.

The mixing proportions are determined by the recombination fraction.

Genotypic Value

BL – Distributions Class AA

QQ

Qq

21 AA

80%

20%

Suppose =0.2

BL – Distributions Class Aa

Genotypic Value

QQ

Qq

21 Aa

80%

20%

BL – Assumptions

Assume the trait variances for the two QTL genotypes in the backcross are equal.

Assume the traits are normally distributed. Assume there is no marker / trait interaction,

so the distributions remain unchanged in both marker classes (i.e. same variances).

BL – Likelihood

The likelihood function for the backcross is then:

where Qj is one of the (unknown) two possible genotypes at the marker locus.

N

i j

jiijN

yMQL

1

2

12

2

2expP

2

1

BL – Log Likelihood

Take the log of the likelihood to obtain:

N

i j

jiij

NyMQl

1

22

12

2

2log22

expPlog

BL – Null Hypothesis A

One null hypothesis of interest is that the mean genotypic values for the two distributions are not in fact different, so

H0: 1 = 2 = .

In this case, the log likelihood becomes:

2

1

2

2

1

22

2

2log22

1

2log22

explog

Ny

Nyl

N

ii

N

i

i

BL – Null Hypothesis B

Another, perhaps more interesting null hypothesis, is that there is no linkage, so

H0: =0.5

Under this assumption, the log likelihood becomes

N

i j

ji Nyl

1

22

12

2

2log22

explog

BL – Statistical Test

The G statistic that is commonly calculated to test for linkage is:

However, this test is less powerful than the t test introduced earlier.

21df

221

221 ~5.0,ˆ,ˆ,ˆˆ,ˆ,ˆ,ˆ2 rlrlG

BL – LOD Scores

Again, LOD scores are commonly used for QTL detection.

Where, we interpret, as usual, that a lod score of l means the alternative hypothesis is 10l times as likely as the null hypothesis.

5.0,ˆ,ˆ,ˆlogˆ,ˆ,ˆ,ˆloglod 22110

22110 LL

BL – Likelihood Maximization

Analytic solutions are difficult to achieve. Iterative approaches are generally used (EM,

NR). Combinations of methods are also used. For

example, the variance is commonly estimated with the pooled variance:

2

222

1

2112ˆ

n

yy

n

yy jj

To facilitate calculations even more, a grid of values with maximization on 1 and 2 can be used.

So suppose you have multiple markers with known map position. Then, evaluate a G statistic or lod score for 3 possible locations of the QTL:

BL – Likelihood Maximization

Marker 0 0.25m 0.5m

1 =0 =f(0.25m12) =f(0.5m12)

2

BL – Sample Results

0

1

2

3

4

5

6

7

0 0.5 1 1.5 2

Chromosome Location

LO

D S

core

BL – Caveats

When there is more than one QTL in the same vicinity, the peaks in the LOD score plot may not correspond to QTLs.

Recall that these results are still based on single-locus analysis for which we cannot separate genetic effect from linkage. Thus, there is little good information about QTL location in such a plot, even though it looks like there should be.

BL – Comments

Note, that if marker density is high, then there is no need to evaluate at multiple levels of for each marker.

However, when marker density is low, information is gained when multiple QTL locations are considered.

When =0 is assumed, the estimates of 1 and 2 are simple means.

Single Marker F2 (F2)

There are now three possible genotypes to consider for both the marker and the QTL locus.

ni

Marg. Freq.

P(Qj | Mi)

QQ Qq qq

AA n1 0.25 (1-)2 2(1-) 2

Aa n2 0.50 (1-) (1-)2+2 (1-)

aa n3 0.25 2 2(1-) (1-)2

F2 – Expected Trait Values

ni

Marg. Freq. Expected Trait Value

AA n1 0.25

Aa n2 0.50

aa n3 0.25

da

adaAA

1221

121 22

QQQqqq

a-a d

d

adaAa

22

22

1

111

da

adaaa

1221

112 22

F2 – Dominant Marker

Similar tables can be derived for the case of a dominant marker.

In general, the procedure is as follows: Derive the QTL genotype probabilities

conditional on the marker phenotype. Using the conditional probabilities, derive the

expected trait value for each marker phenotype class.

F2 – Regression (F2R)

The regression model is

where yj is the trait value of the jth individual in the population

where x1j is the dummy variable for marker additive effect taking on value 1 for AA, 0 for Aa, and –1 for aa.

where x2j is the dummy variable for marker dominance effect taking on value 1 for AA and –1 for Aa and 1 for aa.

jjjj xxy 22110

F2R – Matrix Notation

XYXX

XYXX1'ˆ

'

2312

31

321

3

1

0

2215.0

215.0

225.0

100

05.00

001

F2R – Expected Coefficients

The coefficient estimates have expectation:

d

a

21

212

225.0

2215.0

215.0

225.0

100

020

001

ˆ

ˆ

ˆ

E

321

2312

31

321

3

1

0

F2R – F Statistics

The F statistic is the ratio between the residual mean squares for the reduced model and the full model.

The full model has residual mean square:

XYXYS '2full

F2R – Reduced Models

Reduced models of interest are:

And the F statistics are:

20,0021

202201

201102

21

1

2

0,0

0

0

Sy

Sxy

Sxy

jj

jjj

jjj

3,2df

3,1df

3,1df

2full

20,0

0,0

2full

20

0

2full

20

0

21

21

1

1

2

2

NS

SF

NS

SF

NS

SF

F2R – Dominant Marker

If the marker locus segregates as a dominant trait, then:

Thus, significant regression coefficient tests for a confounded additive effect, dominance effect, and linkage.

jjj xy 10

da 21 21212

3

1E

F2 – Likelihood Approach (F2L)

Assume trait variances for the three QTL genotypes are equal.

For each marker class, the trait value is a mixture of three normal distributions with different means, equal variances, and expected proportions based on degree of linkage.

The expected proportions are given in slide #23.

F2L – Log Likelihood

The likelihood then becomes a sum over three normals:

N

i j

jiijNF

yMQL

1

3

12

2

2 2expP

2

1

N

i j

jiijF

NyMQl

1

23

12

2

2 2log22

expPlog

F2L – Null Hypothesis A

If the null hypothesis is

H0: a = 0

2

1

2

22

2

2

21

31

2

2log2

2expP

2expPP

log0

N

yMQ

yMQMQ

alN

ii

i

iii

F

F2L – Null Hypothesis B

Suppose instead that the null hypothesis is

H0: d = 0

2

1

2

23

3

2

2

31

2

2

21

1

2 2log2

2expP

221

expP

2expP

log0

N

yMQ

yMQ

yMQ

dlN

i

ii

i

i

ii

F

F2L – Null Hypothesis C

Suppose instead that the null hypothesis is

H0: a = 0, d = 0

N

iiF

Nydal

1

22

22 2log22

10,0

F2L – Null Hypothesis D

When the null hypothesis is

H0: = 0.5

2

1

2

23

2

22

2

21

2 2log2

2exp

4

1

2exp

2

1

2exp

4

1

log5.0

N

y

y

y

lN

i

i

i

i

F

F2L – Statistical Test

The G statistic

21df2

3212

23212 ~

5.0ˆ,ˆ,ˆ,ˆ,ˆ

ˆ,ˆ,ˆ,ˆ,ˆ2

F

F

l

lG

F2L – Maximization

Iterative methods are required to find the maximum likelihood estimates.

Other approaches have been suggested, such as combining moment estimation with maximum likelihood approach. The resulting system of equations to solve for the estimators is given on the next slide.

F2L – Finding MLEs

2

322

22

1222

23

22

21

22

23

222

21

222

32

212

3222

1

32

212

112

11211

121

112

111

121

AAAAAAaa

AAAAAAAa

AAAAAAAA

aa

Aa

AA

mmmS

mmmS

mmmS

m

m

m

F2L – Dominant Marker Model

Modify the likelihood equations with QTL genotypes probabilities conditional on the marker genotype for a dominant marker.

Modify the expected trait values for each marker genotype.

Done.

lecture 23: quantitative traits iii

Documents