analysis of variance (marden, 2003)

Upload: xu-zhiming

Post on 03-Apr-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    1/130

    Notes on Analysis of Variance: Old School

    John I. Marden

    Copyright 2003

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    2/130

    2

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    3/130

    Chapter 1

    Introduction to Linear Models

    These notes are based on a course I taught using the text Plane Answers to ComplexQuestions by Ronald Christensen (Third edition, 2002, Springer-Verlag). Hence, everythingthroughout these pages implicitly uses that book as a reference. So keep a copy handy! Buteverything here is my own interpretation.

    1.1 Dependent and explanatory variables

    How is height related to weight? How are sex and age related to heart disease? What factorsinfluence crime rate? Questions such as these have one dependent variable of interest, andone or more explanatory variables. The goal is to assess the relationship of the explanatoryvariables to the dependent variable. Examples:

    Dependent Variable Explanatory VariablesWeight HeightCholesterol level Fat intakeHeart function Age, sexCrime rate Population density, Average income, Educational levelBacterial count Drug

    Linear models model the relationship by writing the mean of the dependent variable asa linear combination of the explanatory variables, or some representations of the explanatoryvariables. For example, a linear model relating cholesterol level to the percentage of fat inthe diet would be

    cholesterol = 0 + 1(f at) + residual. (1.1)

    The intercept 0 and slope 1 are parameters, usually unknown and to be estimated. Onedoes not expect the cholesterol level to be an exact function of fat. Rather, there will berandom variation: Two people with the same fat intake will likely have different cholesterollevels, just as two people of the same height will have different weights. The residual isthe part of the dependent variable not explained by the linear function of the explanatoryvariables. As we go along, we will make other assumptions about the residuals, but the key

    3

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    4/130

    4 CHAPTER 1. INTRODUCTION TO LINEAR MODELS

    one at this point is that they have mean 0. That is, the dependent variable is on averageequal to the linear function of the explanatory variables.

    It is easy to think of more complicated models that are still linear, e.g.,

    cholesterol = 0 + 1(f at) + (exercise) + residual, (1.2)

    or

    cholesterol = 0 + 1(f at) + 3(f at)2 + residual. (1.3)

    Wait! you might say. That last equation is not linear, it is quadratic: the mean of choles-terol is a parabolic function of fat intake. Here is one of the strengths of linear models:The linearity is in the parameters, so that one or more representations of the explanatoryvariables can appear (e.g., here represented by fat and fat2), as long as they are combined

    linearly. An example of a non-linear model:

    cholesterol = 0e1(fat) + residual. (1.4)

    This model is perfectly fine, just not a linear model.

    A particular type of linear model, used when the explanatory variables are categorical, isthe analysis of variancemodel, which is the main focus of this course. A categorical variableis one whose values are not-necessarily numerical. One study measured the bacterial countof leprosy patients, where each patient was given one of three treatment: Drug A, Drug D,or a placebo. The explanatory variable is the treatment, but it is not numerical. One wayto represent treatment is with 0-1 variables, say a, d, and p:

    a =

    1 if treatment is Drug A0 if treatment is not Drug A

    , d =

    1 if treatment is Drug D0 if treatment is not Drug D

    , (1.5)

    and

    p =

    1 if treatment is the placebo0 if treatment is not the placebo

    . (1.6)

    If someone received Drug A, that patients values for the representations would be a = 1, d =0, p = 0. Similarly, one receiving the placebo would have a = 0, d = 0, p = 1. (Note thatexactly one ofa,d,p will be 1, the others 0.) One can go the other way: knowing a,d,p, it is

    easy to figure out what the treatment is. In fact, you only need to know two of them, e.g.,a and d. A linear model constructed from these representations is then

    bacterial count = 1a + 2d + 3p + residual. (1.7)

    If the residual has mean 0, then one can see that 1 is the mean of patients who receive DrugA, 2 is the mean of those who receive Drug D, and 3 is the mean for the control group.

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    5/130

    1.2. MATRIX NOTATION 5

    1.2 Matrix notation

    The values for the dependent variable will be denoted using ys. The representations of the

    explanatory variables will usually be denoted using xs, although other letters may show up.The ys and xs will have subscripts to indicate individual and variable. These subscriptsmay be single or may be multiple, whichever seems most useful at the time. The residualsuse es, and the parameters are usually s, but can be any other Greek letter as well.

    The next step is to write the model in a universal matrix notation. The dependentvariable is an n 1 vector y, where n is the number of observations. The representationsof the explanatory variables are in the n p matrix X, where the jth column of X containsthe values for the n observations on the jth representation. The is the p 1 vector ofcoefficients. Finally, e is the n q vector of residuals. Putting these together, we have themodel

    y = X+ e. (1.8)

    (Generally, column vectors will be denoted by underlining, and matrices will be bold.) Thefollowing examples show how to set up this matrix notation.

    Simple linear regression. Simple linear regression has one x, as in (1.1). If there are nobservations, then the linear model would be written

    yi = 0 + 1xi + ei, i = 1, . . . , n . (1.9)

    The y, e, and are easy to construct:

    y =

    y1

    y2...

    yn

    , e = e1

    e2...

    en

    and = 01 . (1.10)

    For X, we need a vector for the xis, but also a vector of 1s, which are surreptitiouslymultiplying the 0:

    X =

    1 x11 x2...

    ...1 xn

    . (1.11)

    Check to see that putting (1.10) and (1.11) in (1.8) yields the model (1.9). Note that p = 2,that is, X has two columns even though there is only one x.

    Another useful way to look at the model is to let 1n be the n 1 vector of 1s, and x bethe vector of the xis, so that X = (1n, x), and

    y = 01n + 1x + e. (1.12)

    (The text uses J to denote 1n, which is fine.)

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    6/130

    6 CHAPTER 1. INTRODUCTION TO LINEAR MODELS

    Multiple linear regression. When there are more than one explanatory variables, as in(1.2), we need an extra subscript for x, so that xi1 is the fat value and xi2 is the exerciselevel for person i:

    yi = 0 + 1xi1 + 2xi2 + ei, i = 1, . . . , n . (1.13)

    With q variables, the model would be

    yi = 0 + 1xi1 + + qxiq + ei, i = 1, . . . , n . (1.14)

    Notice that the quadratic model (1.3) is of this form with xi1 = xi and xi2 = x2i :

    yi = 0 + 1xi + 2x2i + ei, i = 1, . . . , n . (1.15)

    The general model (1.14) has the form (1.8) with a longer and wider X:

    y = X+ e =

    1 x11 x12 x1q1 x21 x22 x2q...

    ...... ...

    1 xn1 xn2 xnq

    012...

    q

    + e. (1.16)

    Here, p = q+ 1, again there being an extra column in X for the 1n vector.Analogous to (1.12), if we let xj be the vector of xij s, which is the (j + 1)

    st column ofX, so that

    X = (1n, x1, x2, . . . , xq), (1.17)

    we have thaty = 01n + 1x1 + + qxq + e. (1.18)

    Analysis of variance. In analysis of variance, or ANOVA, explanatory variables arecategorical. A one-way ANOVA has one categorical variable, as in the leprosy example(1.7). Suppose in that example, there are two observations for each treatment, so thatn = 6. (The actual experiment had ten observations in each group.) The layout is

    Drug A Drug D Controly11, y12 y21, y22 y31, y32

    (1.19)

    where now the dependent variable is denoted yij, where i indicates the treatment, 1 = DrugA, 2 = Drug D, 3 = Control, and j indicates the individual within the treatment. The linearmodel (1.7) translates to

    yij = j + eij , i = 1, 2, 3, j = 1, 2. (1.20)

    To write the model in the matrix form (1.8), we first have to vectorize the yijs (and eijs),even though notationally they look like elements of a matrix. Any way you string them out is

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    7/130

    1.2. MATRIX NOTATION 7

    fine, as long as you are consistent. We will do it systematically by grouping the observationby treatment, that is,

    y =

    y11

    y12y21y22y31y32

    , and e =

    e11

    e12e21e22e31e32

    . (1.21)

    Writing out the model in matrix form, we have

    y11y12y21y22

    y31y32

    =

    1 0 01 0 00 1 00 1 0

    0 0 10 0 1

    12

    3

    + e. (1.22)

    Two-way ANOVA has two categorical explanatory variables. For example, the tablebelow contains the leaf area/dry weight for some citrus trees, categorized by type of citrusfruit and amount of shade:

    Orange Grapef ruit MandarinSun 112 90 123Half shade 86 73 89Shade 80 62 81

    (1.23)

    (From Table 11.2.1 in Statistical Methods by Snedecor and Cochran.) Each variable has3 categories, which means there are 9 categories taking the two variables together. Thedependent variable again has two subscripts, yij , where now the i indicates the row variable(sun/shade) and the j represents the column variable (type of fruit). That is,

    Orange Grapef ruit MandarinSun y11 y12 y13Half shade y21 y22 y23Shade y31 y32 y33

    (1.24)

    One linear model for such data is the additive model, in which the mean for yij is the sumof an effect of the ith row variable and an effect for the jth column. That is, suppose 1, 2,

    and 3 are the effects attached to Sun, Half-shade, and Shade, respectively, and 1, 2, and3 are the effects attached to Orange, Grapefruit, and Mandarin, respectively. Then theadditive model is

    yij = i + j + eij . (1.25)

    The idea is that the two variables act separately. E.g., the effect of sun on y is the same foreach fruit. The additive model places a restriction on the means of the cells, that is,

    ij E(Yij) = i + j . (1.26)

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    8/130

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    9/130

    1.2. MATRIX NOTATION 9

    For the X, we use the vectors in (1.29) and (1.30), making sure that the 0s and 1s inthese vectors are correctly lined up with the observations in the y vector. That is, X =(x1, x2, x3, x4, x5, x6), and the model is

    y =

    y11y12y13y21y22y23y31y32y33

    = X+ e =

    Rows 1 0 01 0 01 0 00 1 00 1 00 1 00 0 10 0 10 0 1

    Columns 1 0 00 1 00 0 11 0 00 1 00 0 11 0 00 1 00 0 1

    123123

    + e, (1.32)

    ory = 1x1 + 2x2 + 3x3 + 1x4 + 2x5 + 3x6 + e (1.33)

    If the additivity restriction (1.29), that ij = i + j , is violated, then the model is saidto have interaction. Specifically, the interaction term for each cell is defined by the difference

    ij = ij i j , (1.34)and the model with interaction is

    yij = i + j + ij + eij . (1.35)To write out the model completely, we need to add the ij s to , and corresponding vectorsto the X:

    y =

    y11y12y13y21

    y22y23y31y32y33

    =

    Rows 1 0 01 0 01 0 00 1 0

    0 1 00 1 00 0 10 0 10 0 1

    Columns 1 0 00 1 00 0 11 0 0

    0 1 00 0 11 0 00 1 00 0 1

    Interactions 1 0 0 0 0 0 0 0 00 1 0 0 0 0 0 0 00 0 1 0 0 0 0 0 00 0 0 1 0 0 0 0 0

    0 0 0 0 1 0 0 0 00 0 0 0 0 1 0 0 00 0 0 0 0 0 1 0 00 0 0 0 0 0 0 1 00 0 0 0 0 0 0 0 1

    12312311

    1213212223313233

    + e. (1.36)

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    10/130

    10 CHAPTER 1. INTRODUCTION TO LINEAR MODELS

    We will see later that we do not need that many vectors in X, as there are many redundanciesthe way it is written now.

    Analysis of covariance. It may be that the main interest is in comparing the means ofgroups as in analysis of variance, but there are other variables that may be effecting the y.For example, in the study comparing three drugs effectiveness in treating leprosy, there werebacterial measurements before and after treatment. The yijs are the after measurement,and one would expect the before measurement, in addition to the drugs, to affect the aftermeasurement. Letting zij represent the before measurements, the model modifies (1.20) to

    yij = j + zij + eij , (1.37)

    or in matrix form, modifying (1.22),

    y11y12y21y22y31y32

    =

    1 0 0 z111 0 0 z120 1 0 z210 1 0 z220 0 1 z310 0 1 z32

    123

    + e. (1.38)

    1.3 Vector spaces Definition

    The model (1.8), y = X+ e, is very general. The X matrix can contain any numbers. The

    previous section gives some ideas of the scope of the model. In this section we look at themodel slightly more abstractly. Letting be the vector of means of the elements of y, andX = (x1, x2, . . . , xp), the model states that

    = 1x1 + 2x2 + + pxp, (1.39)

    where the j s can be any real numbers. Now is a vector in Rn, and (1.39) shows that

    is actually in a subset ofRn:

    M {c1x1 + c2x2 + + cpxp | c1 R, . . . , cp R}. (1.40)

    Such a space of linear combinations of a set of vector is called a span.

    Definition 1 The span of the set of vectors {x1, . . . , xp} Rn is

    span{x1, . . . , xp} = {c1x1 + + cpxp | ci R, i = 1, . . . , p}. (1.41)

    Because the matrix notation (1.8) is heavily used in this course, we have notation forconnecting the X to the M.

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    11/130

    1.3. VECTOR SPACES DEFINITION 11

    Definition 2 For n p matrix X, C(X) denotes the column space of X, which is thespan of the columns of X. That is, if X = (x1, . . . , xp), then

    C(X) = span{x1, . . . , xp}. (1.42)Spans have nice properties. In fact, a span is a vector space. The formal definition of

    a vector space, at least for those that are subsets of Rn, follows. [Vector spaces are moregeneral than those that are subsets ofRn, but since those are the only ones we need, we willstick with this restricted definition.]

    Definition 3 A subset M Rn is a vector space ifx, y M = x + y M, and (1.43)

    c

    R, x

    M=

    cx

    M. (1.44)

    Thus any linear combination of vectors in M is also in M. Note that Rn is itself a vectorspace, as is the set {0n}. [The n 1 vector of all 0s is 0n.] Because c in (1.44) can be 0,any subspace must contain 0n. Any line through 0n, or plane through 0n, is a subspace. Itis not hard to show that any span is a vector space. Take M in (1.40). First, if x, y M,then there are ais and bis such that

    x = a1x1 + a2x2 + + apxp and y = b1x1 + b2x2 + + bpxp, (1.45)so that

    x + y = c1x1 + c2x2 +

    + cpxp, where ci = ai + bi, (1.46)

    hence x + y M. Second, for x M as in (1.45) and real c,cx = c1x1 + c2x2 + + cpxp, where ci = cai, (1.47)

    hence cx M.Not only is any span a subspace, but any subspace is a span of some vectors. Thus a

    linear model (1.8) can equivalently be define as one for which

    M ( = E[Y]) (1.48)for some vector space

    M.

    Specifying a vector space through span is quite convenient, but not the only convenientway. Another is to give the form of elements directly. For example, the vector space of allvectors with equal elements can be given in the following two ways:

    aa...a

    Rn | a R = span{1n}. (1.49)

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    12/130

    12 CHAPTER 1. INTRODUCTION TO LINEAR MODELS

    When n = 3, the x/y plane can be represented as

    a

    b0 | a R, b R = span

    1

    00 ,

    0

    10 . (1.50)

    A different plane is aa + b

    b

    | a R, b R = span

    11

    0

    , 01

    1

    . (1.51)

    1.4 Linear independence and bases

    Any subspace ofRn

    can be written as a span of at most n vectors, although not in a uniqueway. For example,

    span

    10

    0

    , 01

    0

    = span

    10

    0

    , 01

    0

    , 11

    0

    = span

    10

    0

    , 11

    0

    = span

    20

    0

    ,

    0

    7

    0

    ,

    332

    0

    . (1.52)

    Note that the space in (1.52) can be a span of two or three vectors, or a span of anynumber more than three as well. It cannot be written as a span of only one vector. Theminimum number of vectors is called the rank of the space, which in this example is 2.Any set of two vectors which does span that space is called a basis. Notice that in thetwo sets of three vectors, there is a redundancy, that is, one of the vectors can be writtenas a linear combination of the other two: (1, 1, 0) = (1, 0, 0) + (0, 1, 0) and (2, 0, 0) =(4/(33 7))(0, 7, 0) + (2/33) (33, 2, 0). Such sets are called linearly dependent.

    To formally define basis, we need to first define linear independence.

    Definition 4 The vectors x1, . . . , xp inRn arelinear independent if

    a1x1 + + apxp = 0n = a1 = = ap = 0. (1.53)Equivalently, the vectors are linearly independent if no one of them (as long as it is not

    0n) can be written as a linear combination of the others. That is, they are linear dependentif there is an xi = 0n and set of coefficients ai such that

    xi = a1x1 + + ai1xi1 + ai+1xi+1 + . . . + apxp. (1.54)

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    13/130

    1.4. LINEAR INDEPENDENCE AND BASES 13

    They are not linearly dependent if and only if they are linearly independent. In (1.52),the sets with three vectors are linearly dependent, and those with two vectors are lin-early independent. To see that latter fact for

    {(1, 0, 0), (1, 1, 0)

    }, suppose that a1(1, 0, 0)

    +a2(1, 1, 0) = (0, 0, 0). Then

    a1 + a2 = 0 and a2 = 0 = a1 = a2 = 0, (1.55)

    which verifies (1.53). If a set of vectors is linearly dependent, then one can remove one ofthe redundant vectors (1.54), and still have the same span. A basis is a set of vectors thathas the same span but no dependencies.

    Definition 5 The set of vectors {z1, . . . , z d} is a basis for the subspace M if the vectorsare linearly independent and

    M= span

    {z1, . . . , z d

    }.

    For estimating , the following lemma is useful.

    Lemma 1 If {x1, . . . , xp} is a basis for M, then for x M, there is a unique set ofcoefficients a1, . . . , ap such that x = a1x1 + + apxp.

    Although a (nontrivial) subspace has many bases, each basis has the same number ofelements, which is the rank.

    Definition 6 The rank of a subspace is the number of vectors in any of its bases.

    A couple of useful facts about a vector space M with rank d

    1. Any set of more than d vectors from M is linearly dependent;

    2. Any set of d linear independent vectors from M forms a basis ofM.

    For example, consider the one-way ANOVA model in (1.22). The three vectors in X areclearly linearly independent, hence the space C(X) has rank 3, and those vectors constitute

    a basis. On the other hand, the columns of X in the two-way additive ANOVA model in(1.32) are not linearly independent: The first three add to 16, as do the last three, hence

    x1 + x2 + x3 x4 x5 x6 = 06. (1.56)

    Removing any one of the vectors does leave a basis. The model (1.36) has many redundancies.For one thing, n = 9, and there are 15 columns in X. One basis consists of the 9 interactionvectors (i.e., the last 9 vectors). Another consists of the columns of the following matrix,

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    14/130

    14 CHAPTER 1. INTRODUCTION TO LINEAR MODELS

    obtained by dropping a judicious set of vectors from X:

    Rows

    1 0 01 0 01 0 00 1 00 1 00 1 00 0 10 0 10 0 1

    Columns

    1 00 10 01 00 10 01 00 10 0

    Interactions

    1 0 0 00 1 0 00 0 0 00 0 1 00 0 0 10 0 0 00 0 0 00 0 0 00 0 0 0

    (1.57)

    Not just any 9 vectors of X will be a basis, though. For example, the first 9 are not linearlyindependent, as in (1.56).

    1.5 Summary

    This chapter introduced linear models, with some examples, and showed that any linearmodel can be expressed in a number of equivalent ways:

    1. Each yi is written as a linear combinations of xis, plus residual (e.g., equation 1.13);

    2. The vector y is be written as a linear combination ofxi vectors, plus vector of residuals(e.g., equation 1.18);

    3. The vector y is written as X, plus vector of residuals (as in equation 1.8);

    4. The mean vector is restricted to a vector space, as in (1.48).

    Each representation is useful in different situations, and it is important to be able to go fromone to the others.

    In the next chapter, we consider estimation of the mean E(Y) and the parameters .What can be estimated, and how, depends on the vectors in X, and whether they form a

    basis.

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    15/130

    Chapter 2

    Estimation and Projections

    This chapter considers estimating E[Y] and . There are many estimation techniques, andwhich are best depends on the distribution assumptions made on the residuals. We start withminimal assumption on the residuals, and look at unbiased estimation using least squares.

    The basic model isy = X+ e, where E[e] = 0n. (2.1)

    The expected value of a vector is just the vector of expected values, so that

    E[e] = E

    e1e2...

    en

    =

    E[e1]E[e2]

    ...E[en]

    . (2.2)

    Then, as in (1.39), with E[Y] = X+ E[e], so that = X. (2.3)

    2.1 Projections and estimation of the mean

    We first divorce ourselves from , and just estimate M (2.3) for a vector space M. Theidea in estimating is to pick an estimate M that is close to the observed y. Theleast squares principle is to find the estimate that has the smallest sum of squares from y.That is, is the vector in M such that

    ni=1

    (yi i)2 ni=1

    (yi ai)2 for any a M. (2.4)

    The length of a vector x Rn isn

    i=1 x2i , which is denoted by the norm x, so that

    x2 =ni=1

    x2i = xx. (2.5)

    15

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    16/130

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    17/130

    2.1. PROJECTIONS AND ESTIMATION OF THE MEAN 17

    2.1.1 Some simple examples

    Suppose M = span{1n} = {(a , a , . . . , a) | a R}. Then the projection of y onto M is avector (b , b , . . . , b) ( M) such that (y (b , b , . . . , b)) M, i.e.,y

    bb...b

    aa...a

    = 0 for all a R. (2.12)Now (2.12) means that

    ni=1

    (yi b)a = a(ni=1

    yi nb) = 0 for all a R. (2.13)

    The only way that last equality can hold for all a is ifni=1

    yi nb = 0, (2.14)or

    b =

    ni=1 yin

    = y. (2.15)

    Thus the projection is

    y =

    yy...

    y

    . (2.16)

    Extend this example to M = span{x} for any fixed nonzero vector x Rn. Because itis an element of M, y = cx for some c, and for y y to be orthogonal to M, it must beorthogonal to x, that is,

    x(y cx) = 0. (2.17)Solve for c:

    xy = cxx = c = xy

    xx, (2.18)

    so that

    y =

    xyxx

    x. (2.19)

    Next, consider the one-way ANOVA model (1.22),

    y11y12y21y22y31y32

    =

    1 0 01 0 00 1 00 1 00 0 10 0 1

    12

    3

    + e, (2.20)

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    18/130

    18 CHAPTER 2. ESTIMATION AND PROJECTIONS

    so that

    M =

    11

    0000

    ,

    00

    1100

    ,

    00

    0011

    . (2.21)

    Now y = (a,a,b,b,c,c) for some a,b,c. For y y to be orthogonal to M, it is enoughthat it be orthogonal to the spanning vectors ofM.

    Proposition 2 If M = span{x1, x2, . . . , xp}, then a M if and only if a xi for i =1, . . . , p.

    Proof. If a M, then a xi for each i because x is orthogonal to all vectors in M.So suppose a xi for i = 1, . . . , p, and take any x M. By definition of span, x =c1x1 + + cpxp, so that

    xa = (c1x1 + + cpxp)a = c1x1a + + cpxpa = 0, (2.22)

    because each xia = 0. 2

    Writing down the equations resulting from (y (a,a,b,b,c,c))x for x being each of thespanning vectors in (2.21) yields

    y11

    a + y12

    a = 0

    y21 b + y22 b = 0y31 c + y32 c = 0. (2.23)

    It is easy to solve for a,b,c:

    a =y11 + y12

    2 y1 ; b =

    y21 + y222

    y2 ; c =y31 + y32

    2 y3. (2.24)

    These equations introduce the dot notation: When a variable has multiple subscripts,then replacing the subscript with a , and placing a bar over the variable, denotes theaverage of the variable over that subscript.

    2.1.2 The projection matrix

    Rather than figuring out the projection for every y, one can find a matrix M that gives theprojection.

    Definition 9 For vector spaceM, the matrixM such thaty = My for anyy Rn is calledthe projection matrix.

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    19/130

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    20/130

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    21/130

    2.2. ESTIMATING COEFFICIENTS 21

    Proposition 4 Suppose M = C(X), where X = (x1, . . . , xp). If {x1, . . . , xp} is a basis forM, then

    M = X(XX)1X. (2.35)

    The proposition uses that XX is invertible if its columns are linearly independent. Wewill show that later. We do note that even if the columns are not linearly independent,(XX)1 can be replaced by any generalized inverse, which we will mention later as well.

    Proof of proposition. For any given x, let x be its projection onto M, so that x = Xb forsome vector b. Because x x M, x x xj for each j, so that X(x x) = 0p, hence

    X(x Xb) = 0p, which = Xx = XXb = b = (XX)1Xx, (2.36)

    and

    x = X(XX)1Xx. (2.37)Thus (2.35) holds. 2

    Compare (2.28) for p = 1 to (2.35). Also note that it is easy to see that this M issymmetric and idempotent. Even though the basis is not unique for a given vector space,the projection matrix is unique, hence any basis will yield the same X(XX)1X.

    It is interesting that any symmetric idempotent matrix M is a projection matrix for somevector space, that vector space being

    M = {Mx | x Rn} (2.38)

    2.2 Estimating coefficients

    2.2.1 Coefficients

    Return to the linear model (2.1),

    y = X+ e, where E[e] = 0n. (2.39)

    In this section we consider estimating , or linear functions of . In some sense, this task isless fundamental than estimating = E[Y], since the meaning of any j depends not onlyon its corresponding column in X, but also what other columns happen to be in X. For

    example, consider these five equivalent models for in the one-way ANOVA (1.22):

    = X11 =

    1 0 01 0 00 1 00 1 00 0 10 0 1

    12

    3

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    22/130

    22 CHAPTER 2. ESTIMATION AND PROJECTIONS

    = X22 =

    1 1 0 01 1 0 01 0 1 0

    1 0 1 01 0 0 11 0 0 1

    1

    23

    = X33 =

    1 1 01 1 01 0 11 0 11 0 01 0 0

    1

    2

    = X44 =

    1 1 0

    1 1 01 0 11 0 11 1 11 1 1

    1

    2

    = X55 =

    1 1 11 1 11 1 11 1 11 2 01 2 0

    12

    (2.40)

    The 1, 2, 3 in 1 are the means of the three groups, i.e., j = E[Yij]. Comparing to 2,we see that j = + j, but we still do not have a good interpretation for and the s. Forexample, if = 0, then j = j, the mean of the j

    th group. But if = (1 + 2 + 3)/3,the overall average, then j = j , the effect of group j. Thus the presence of the 16vector in X2 changes the meaning of the coefficients of the other vectors.

    Now in 2, one may make the restriction that 3 = 0. Then one has the third model,

    and = 3, 1 = 1 3, 2 = 2 3. Alternatively, a common restriction is that1 + 2 + 3 = 0, so that the second formulation becomes the fourth:

    = X22 =

    1 1 0 01 1 0 01 0 1 01 0 1 01 0 0 11 0 0 1

    12

    1 2

    =

    1 1 01 1 01 0 11 0 11 1 11 1 1

    1

    2

    . (2.41)

    Now = and j = j , the effect.

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    23/130

    2.2. ESTIMATING COEFFICIENTS 23

    The final expression has

    1 = + 1 + 2, 2 = + 1 2, 3 = 21, (2.42)from which can be derived

    = , 1 =1

    3(

    1

    2(1 + 2) 3), 2 = 1

    2(1 2). (2.43)

    Then, e.g., for the leprosy example (1.19), is the overall bacterial level, 1 contrasts theaverage of the two drugs and the placebo, and 3 contrasts the two drugs.

    2.2.2 Least squares estimation of the coefficients

    We know that the least squares estimate of the mean = E[Y] is

    =

    y, the projection of y

    onto M = C(X) in (2.39). It exists and is unique, because the projection is. A least squaresestimate of is one that yields the projection.

    Definition 10 In the model y = X+ e, a least squares estimate of is any vector forwhich y = X, (2.44)where y is the projection of y into C(X).

    A least squares estimate of a linear combination , where Rp, is for any leastsquares estimate of .

    A least squares estimate of always exists, but it may not be unique. The condition foruniqueness is direct.

    Proposition 5 The least squares estimate of is unique if and only if the column ofX arelinearly independent.

    The proposition follows from Lemma 1, because if the columns of X are linearly inde-pendent, they form a basis for C(X), hence there is a unique set of js that will solve

    y = 1x1 + + pxp. (2.45)And if the columns are not linearly independent, there are many sets of coefficients that will

    yield the y.If columns of X are linearly independent, then XX is invertible. In that case, as in theproof of Proposition 4, equation (2.36), we have that

    y = Xb for b = (XX)1Xy, (2.46)which means that the unique least squares estimate of is

    = (XX)1Xy. (2.47)

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    24/130

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    25/130

    2.2. ESTIMATING COEFFICIENTS 25

    With the above example, + 1 has = (1, 1, 0, 0), and taking a = (1/2, 1/2, 0, 0, 0, 0),

    so that ay = y1, we have that

    E[aY] = 12

    (E[Y11] + E[Y12]) = 12

    ( + 1 + + 1) = + 1 (2.53)

    no matter what and the is are. That is not the unique estimator. Note that Y11 aloneworks, also.

    On the other hand, consider = with = (1, 0, 0, 0). Can we find a so that

    E[aY] = E[(a11, a12, a21, a22, a31, a32)Y]

    = (a11 + a12)( + 1) + (a21 + a22)( + 2) + (a31 + a32)( + 3)

    ? (2.54)

    For that to occur, we need

    a11 + a12 + a21 + a22 + a31 + a32 = 1, a11 + a12 = 0, a21 + a22 = 0, a31 + a32 = 0. (2.55)

    Those equations are impossible to solve, since the last three imply that the sum of all aij sis 0, not 1. Thus is not estimable.

    The next proposition systematizes how to check for estimability.

    Proposition 6 In the model y = X + e, with E[e] = 0n, is estimable if and only if

    there exists an n 1 vector a such that

    aX = . (2.56)

    Proof. By Definition 11, since E[Y] = X in (2.52), is estimable if and only if thereexists a such that aX = for all Rp. But that equality is equivalent to aX = . 2

    Note that the condition (2.56) means that is a linear combination of the rows of X,i.e., C(X). (Or we could introduce the notation R(X) to denote the span of the rows.)

    To see how that works in the example (2.50), look at

    C(X) = span

    1

    100

    ,1

    010

    ,1

    001

    . (2.57)

    (The rows of X have some duplicate vectors.) Then + 1 has = (1, 1, 0, 0), which is

    clearly in C(X), since it is one of the basis vectors. (Similarly for + 2 and + 3.)The contrast 1 22 + 3 has = (0, 1, 2, 1). That vector is also in C(X), since = (1, 1, 0, 0) 2(1, 0, 1, 0) + (1, 0, 0, 1). On the other hand, consider estimating , which

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    26/130

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    27/130

    2.2. ESTIMATING COEFFICIENTS 27

    2.2.3 Example: Leprosy

    Below are data on leprosy patients (from Snedecor and Cochran, Statistical Methods). There

    were 30 patients, randomly allocated to three groups of 10. The first group received drug A,the second drug D, and the third group received a placebo. Each person had the bacterialcount taken before and after receiving the treatment.

    Drug A Drug D Placebo

    Before After Before After Before After

    11 6 6 0 16 13

    8 0 6 2 13 10

    5 2 7 3 11 18

    14 8 8 1 9 5

    19 11 18 18 21 236 4 8 4 16 12

    10 13 19 14 12 5

    6 1 8 9 12 16

    11 8 5 1 7 1

    3 0 15 9 12 20

    First, consider the one-way ANOVA, with the after measurements as the ys, ignoringthe before measurements. The model is

    y11y12

    .

    ..y1,10

    y21y22

    ...y2,10

    y31y32

    ...y3,10

    =

    1 1 0 01 1 0 0...

    .

    .....

    .

    ..1 1 0 0 1 0 1 01 0 1 0...

    ......

    ...1 0 1 0 1 0 0 11 0 0 1

    ... ... ... ...1 0 0 1

    123

    + e. (2.59)

    The sample means are y1 = 5.3, y2 = 6.1, y3 = 12.3. Suppose we are interested in thetwo contrasts 1 2, comparing the two drugs, and (1 + 2) 3, comparing the placeboto the average of the two drugs. The least squares estimates are found by taking the samecontrasts of the sample means:

    1 2 = 5.3 6.1 = 0.8, (1 + 2) 3 = (5.3 + 6.1)/2 12.3 = 6.6. (2.60)

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    28/130

    28 CHAPTER 2. ESTIMATION AND PROJECTIONS

    There doesnt appear to be much difference between the two drugs, but their average seemsbetter than the placebo. Is it significantly different? That question will be addressed in thenext chapter. We need standard errors for the estimates.

    What about the before measurements? The sample means of the before measurementsare 9.3, 10, and 12.9, respectively. Thus, by chance, the placebo group happened to get peoplewho were slightly worse off already, so it might be important to make some adjustments.A simple one would be to take the ys as after before, which is very reasonable in thiscase. Instead, we will look at analysis of covariance model (1.38), which adds in the beforemeasurements (the covariates) as zij s:

    y11y12

    ...

    y1,10y21y22

    ...y2,10

    y31y32

    ...

    y3,10

    =

    1 1 0 0 z111 1 0 0 z12...

    ......

    ......

    1 1 0 0 z1,10 1 0 1 0 z211 0 1 0 z22...

    ......

    ......

    1 0 1 0 z2,10 1 0 0 1 z311 0 0 1 z32...

    ......

    ......

    1 0 0 1 z3,10

    123

    + e =

    1 1 0 0 111 1 0 0 8...

    ......

    ......

    1 1 0 0 3 1 0 1 0 61 0 1 0 6...

    ......

    ......

    1 0 1 0 15 1 0 0 1 161 0 0 1 13...

    ......

    ......

    1 0 0 1 12

    123

    + e.

    (2.61)We want to estimate the same contrasts. Start with 1 2. Is it estimable? We need

    to show that there is a a such that aX = (0, 1, 1, 0, 0). We can do it with just three rowsof X, the first two and the eleventh. That is, we want to find a,b,c so that

    a(1, 1, 0, 0, 11) + b(1, 1, 0, 0, 8) + c(1, 0, 1, 0, 6) = (0, 1, 1, 0, 0), (2.62)or

    a + b + c = 0

    a + b = 1c = 1

    11a + 8b + 6c = 0 (2.63)

    The second and third equations imply the first, and the third is that c = 1, hence usingb = 1 a, the fourth equation yields a = 2/3, hence b = 5/3. Thus

    23

    y11 +5

    3y12 y21 is an unbiased estimate of 1 2. (2.64)

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    29/130

    2.2. ESTIMATING COEFFICIENTS 29

    The least squares estimate replaces the yijs with their hats, the yij s. We know inprinciple the projection

    y from Question 4 of HW #2 (where now we have 10 instead of 2

    observations in each group). That is, the projection vector has elements of the form

    y1j = a + d z1j , y2j = b + d z2j , y3j = c + d z3j . (2.65)The constants are

    a = y1 d z1, b = y2 d z2, c = y3 d z3, (2.66)

    hence yij = yi + d (zij zi). (2.67)The d is

    d = 3i=1

    10j=1(yij

    yi)zij3i=110j=1(zij zi)zij . (2.68)

    Plugging in the data, we obtain that d = 585.4/593 = 0.987.Back to estimating 1 2, substitute the yij s of (2.67) for the yijs in (2.64) to get the

    least squares estimate

    23y11 + 5

    3y12 y21 = 2

    3(y1 + d (z11 z1)) +

    5

    3(y1 + d (z12 z1))

    (y2 + d (z21 z2))= (y1 d z1) (y2 d z2) + d(

    2

    3z11 +

    5

    3z12 z21)

    = (y1 d z1) (y2 d z2)= 5.3 0.987(9.3) (6.1 0.987(10))= 0.109. (2.69)

    The third line comes from the second line since (2/3)z11 + (5/3)z12 z21 = (2/3)11 +(5/3)86 = 0. Notice that the least squares estimate is the same as that without covariates,but using adjusted (yi d zi)s instead of plain yis.

    The unadjusted estimate (not using the covariate) was 0.8, so the adjusted estimate iseven smaller.

    A similar procedure will show that the least squares estimate of (1 + 2)/2 3 is

    (y1 d z1) + (y2 d z2)2

    (y3 d z3) = 3.392. (2.70)

    This value is somewhat less (in absolute value) than the unadjusted estimate 4.60.

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    30/130

    30 CHAPTER 2. ESTIMATION AND PROJECTIONS

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    31/130

    Chapter 3

    Variances and Covariances

    The previous chapter focussed on finding unbiased estimators for , , and . In thissection we tackle standard errors and variances, and in particular look for estimators withsmall variance. The variance of a random variable Z is V ar(Z) = E[(Z Z)2], whereZ = E[Z]. [Note: These are true if the expectations exist.] With two variables, Y1 and Y2,there is the covariance:

    Cov(Y1, Y2) = E[(Y1 1)(Y2 2)], where 1 = E(Y1) and 2 = E(Y2). (3.1)

    The covariance of a variable with itself is the variance.

    3.1 Covariance matrices for affine transformationsDefining the mean of a vector or of a matrix is straightforward: it is just the vector or matrixof means. That is, as in (2.2), for vector Y = (Y1, . . . , Y n)

    ,

    E(Y) =

    E(Y1)E(Y2)

    ...E(Yn)

    , (3.2)

    and for an n p matrix W,

    E[W] = E

    W11 W12 W1pW21 W22 W2p

    ......

    . . ....

    Wn1 Wn2 Wnp

    =

    E(W11) E(W12) E(W1p)E(W21) E(W22) E(W2p)

    ......

    . . ....

    E(Wn1) E(Wn2) E(Wnp)

    . (3.3)

    Turning to variances, a n 1 vector Y = (Y1, . . . , Y n) has n variances (the V ar(Yi)s),and several covariances Cov(Yi, Yj). These are all conveniently arranged in the covariance

    31

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    32/130

    32 CHAPTER 3. VARIANCES AND COVARIANCES

    matrix, defined for vector Y to be the nn matrix Cov(Y) whose ij th element is Cov(Yi, Yj).It is often denoted :

    Cov(Y) = =

    V ar(Y1) Cov(Y1, Y2) Cov(Y1, Yn)

    Cov(Y2, Y1) V ar(Y2) Cov(Y2, Yn)...

    .... . .

    ...Cov(Yn, Y1) Cov(Yn, Y2) V ar(Yn)

    . (3.4)

    The diagonals are the variances, and the matrix is symmetric because Cov(Yi, Yj) = Cov(Yj, Yi).Analogous to the definition of variance, an equivalent definition of covariance is

    Cov(Y) = E(Y )(Y )

    , where = E(Y). (3.5)

    The variances and covariances of linear combinations are often needed in this course, e.g.,the least squares estimates are linear combinations of the yis. Fortunately, the means and(co)variances of linear combinations are easy to obtain from those of the originals. With justone variable Z, we know that for any a and b,

    E[a + b Z] = a + b E[Z] and V ar[a + b Z] = b2 V ar[Z]. (3.6)

    Note that the constant a does not effect the variation.Turn to an n 1 vector, Z, and consider the affine transformation

    W = a + BZ (3.7)

    for some m 1 vector a and m n matrix B, so that W is m 1. (A linear transformationwould be BZ. The word affine pops up because of the additional constant vector a.)Expected values are linear, so

    E[W] = E[a + B Z] = a + B E[Z]. (3.8)

    For the covariance, start by noting that

    W E[W] = (a + B Z) (a + B E[Z]) = B(Z E[Z]), (3.9)

    so that

    Cov[W] = Cov[a + BZ] = E[(W E[W])(W E[W])]= E[B(Z E[Z])(Z E[Z])B]= B E[(Z E[Z])(Z E[Z])] B= B Cov[Z] B. (3.10)

    Compare these formulas to the univariate ones, (3.6).

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    33/130

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    34/130

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    35/130

    3.3. ESTIMATING THE VARIANCE 35

    because a (a a), hencea2 >

    a2 unless a =

    a. (3.22)

    Thus ifay is an unbiased estimator of that is not the least squares estimate, then a = a,ay is the least squares estimate by Proposition 9, andV ar(ay) = 2ea2 > 2ea2 = V ar(ay). (3.23)

    That is, any unbiased linear estimate has larger variance than the least squares estimate. 2

    Thus we have established that the least squares estimate is best in terms of variance.The next section develops estimates of the variance.

    3.3 Estimating the variance

    In the model (3.11), 2e is typically a parameter to be estimated. Because the residuals havemean zero, E(ei) = 0, we have that

    2e = V ar(ei) = E(e

    2i ), so that

    E

    ni=1 e

    2i

    n

    = 2e . (3.24)

    Unfortunately, we do not observe the actual eis, because e = y X, and is not observed.Thus we have to estimate e, which we can do by plugging in the (or a) least squares estimateof :

    e = y X = y y = (In M)y, (3.25)where M is the projection matrix for C(X). Note that e is the projection of y onto C(X).See Proposition 3. Because E[y] = E[y) = X, E[e] = 0n, hence as for the eis, V ar[ei] =E[e2i ], but unlike the eis, the eis do not typically have variance 2e . Rather, the variance ofei is 2e times the ith diagonal of (In M). Then

    E[ni=1

    e2i ] = ni=1

    V ar[ei] = 2e [sum of diagonals of (In M)] = 2e trace(In M). (3.26)Thus an unbiased estimator of 2e is

    2e = ni=1e2itrace(In M) = e2trace(In M) . (3.27)It is easy enough to calculate the trace of a matrix, but the trace of a projection matrix

    is actually the rank of the corresponding vector space.

    Proposition 10 The rank of a vector space M is trace(M), where M is the projectionmatrix forM.

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    36/130

    36 CHAPTER 3. VARIANCES AND COVARIANCES

    Proof. Suppose rank(M) = p, and let {x1, . . . , xp} be a basis for M. Then with X =(x1, . . . , xp) (so that M = C(X), the projection matrix is M = X(XX)1X. The XX isinvertible because the columns of X are linearly independent. But then using the fact that

    trace(AB) = trace(BA),

    trace(M) = trace(X(XX)1X) = trace((XX)1(XX)) = trace(Ip) = p, (3.28)

    since (XX) is p p. 2Thus we have from (3.27) that

    2e = e2n p , p = rank(C(X)). (3.29)Now everything is set to estimate the standard error of estimates.

    3.4 Example: Leprosy

    3.4.1 Without covariate

    Continue the example from Section 2.2.3. Start with the model without the covariate, (2.59),and consider the first contrast 1 2, which compares the two drugs. From (2.60), theleast squares estimate is

    1 2 = y1 y2 = 0.8. (3.30)

    This estimate is ay with a = (1/10, . . . , 1/10, 1/10, . . . , 1/10, 0, . . . , 0), where each valueis repeated ten times. Then

    V ar[ 1 2] = 2ea2 = 2e(10( 110)2 + 10( 110)2) = 15 2e . (3.31)To estimate 2e , we need e2 and p, the rank of C(X). The columns of X in (2.59)

    are not linearly independent, because the first, 1n, is the sum of the other three. Thuswe can eliminate the first, and the remaining are linearly independent, so p = 3. The

    y = (y1, . . . , y1, y2, . . . , y2, y3, . . . , y3)

    , so that

    2e = e2n p = y y2

    n p =3i=110j=1(yij yi)2

    n p = 995.130 3 = 36.856. (3.32)

    Now the estimate of the standard error of the estimate 1 2 is (1/5)2e = 36.856/5 =2.715. Thus, with the estimate given in (3.30), we have an approximate 95% confidence in-terval for 1 2 being

    ( 1 2 2 se) = (0.8 2(2.715)) = (6.23, 4.63). (3.33)

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    37/130

    3.4. EXAMPLE: LEPROSY 37

    This interval is fairly wide, containing 0, which suggests that there is no evidence of adifference between the two drugs. Equivalently, we could look at the approximate z-statistic,

    1 2se

    = 0.82.715

    = 0.295, (3.34)

    and note that it is well less than 2 in absolute value. (Later, we will refine these inferencesby replacing the 2 with a t-value, at least when assuming normality of the residuals.)

    For the contrast comparing the average of the two drugs to the control, ( 1 + 2)/2 3,we again have from (2.60) that

    (1 + 2) 3 = (y1 + y2)/2 y3 = (5.3 + 6.1)/2 12.3 = 6.6. (3.35)The a for this estimate is a = (1/20, . . . , 1/20,

    1/10, . . . ,

    1/10), where there are 20 1/20s

    and 10 1/10s. Thus a2 = 20/202 + 10/102 = 0.15, and se = 36.856 0.15 = 2.35, andthe approximate confidence interval is

    ( (1 + 2) 3 2 se) = (6.6 2(2.35)) = (11.3, 1.9). (3.36)This interval is entirely below 0, which suggests that the drugs are effective relative to theplacebo. (Or look at z = 6.6/2.35 = 2.81.)

    3.4.2 With covariate

    The hope is that by using the before measurements, the parameters can be estimated moreaccurately. In this section we use model (2.61), yij = +i+ zij +eij. Because the patientswere randomly allocated to the three groups, and the zij s were measured before treatment,the and is have the same interpretation in both models (with and without covariates).However, the 2e is not the same.

    The projections now are, from (2.67), yij = yi + d (zij zi), where d = = 0.987. Thedimension of X is now p = 4, because, after removing the 1n vector, the remaining four arelinearly independent. (They would not be linearly independent if the zij s were the samewithin each group, but clearly they are not.) Then

    2e = 3i=1

    10j=1(yij

    yi

    0.987 (zij

    zi))2

    30 26 = 16.046. (3.37)Note how much smaller this estimate is than the 36.856 in (3.32).

    If we were to follow the procedure in the previous section, we would need to find the asfor the estimates, then their lengths, in order to find the standard errors. Instead, we will gothe other route, using (3.16) to find the variances: 2e

    (XX)1. In order to proceed, weneed XX to be invertible, which at present is not true. We need to place a restriction onthe parameters so that the matrix is invertible, but also in such a way that the the meaning

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    38/130

    38 CHAPTER 3. VARIANCES AND COVARIANCES

    of the contrasts is the same. One way is to simply set = 0, so that

    y = X+ e =

    1 0 0 11

    1 0 0 8......

    ......

    1 0 0 3 0 1 0 60 1 0 6...

    ......

    ...0 1 0 15 0 0 1 16

    0 0 1 13...

    ......

    ...0 0 1 12

    123

    + e. (3.38)

    Now

    XX =

    10 0 0 93

    0 10 0 1000 0 10 129

    93 100 129 4122

    and (XX)1 =

    0.2459 0.1568 0.2023 0.01570.1568 0.2686 0.2175 0.01690.2023 0.2175 0.3806 0.0218

    0.0157 0.0169 0.0218 0.0017

    .(3.39)

    For 1 2, we have from (2.69) the estimate 0.109. For this , = (1, 1, 0, 0), henceV ar( 1 2) = 2e (XX)1 = 2e (0.2459 2 0.1568 + 0.2686) = 2e (0.201). (3.40)

    Using the estimate in (3.37), we have that

    se( 1 2) = 16.046 0.201 = 1.796, (3.41)hence

    z =1.091.796

    = 0.61, (3.42)

    which is again quite small, showing no evidence of a difference between drugs.For the drug versus control contrast, from (2.70), we have the estimate (1 + 2)/2 3 =

    3.392. Now = (1/2, 1/2, 1, 0), hence

    se =

    16.046 (1/2, 1/2, 1, 0)(XX)1(1/2, 1/2, 1, 0) = 16.046 0.1678 = 1.641.(3.43)

    Compare this se to that without the covariate, 2.35. It is substantially smaller, showing thatthe covariate does help improve accuracy in this example.

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    39/130

    3.4. EXAMPLE: LEPROSY 39

    Now

    z =3.3921.641

    = 2.07. (3.44)

    This is marginally significant, suggesting there may be a drug effect. However, it is a bitsmaller (in absolute value) that the z = 2.81 calculated without covariates. Thus thecovariate is also important in adjusting for the fact that the controls had somewhat lesshealthy patients initially.

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    40/130

    40 CHAPTER 3. VARIANCES AND COVARIANCES

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    41/130

    Chapter 4

    Distributions: Normal, 2, and t

    The previous chapter presented the basic estimates of parameters in the linear models. In thischapter we add the assumption of normality to the residuals, which allows us to provide moreformal confidence intervals and hypothesis tests. The central distribution is the multivariatenormal, from which the 2, t, and F are derived.

    4.1 Multivariate Normal

    The standard normal distribution for random variable Z is the familiar bell-shaped curve.The density is

    f(z) =12

    e12z2. (4.1)

    The mean is 0 and variance is 1. The more general normal distribution, with arbitrarymean and variance 2 is written X N(, 2), and has density

    f(x; , ) =12

    e(x)2

    22 (4.2)

    when 2 > 0. If 2 = 0, then X equals with probability 1, that is, X is essentially aconstant.

    The normal works nicely with affine transformations. That is,

    X

    N(, 2) =

    a + bX

    N(a + b,b22). (4.3)

    We already know that E[a+bX] = a+b and V ar[a+bX] = b22, but the added property in(4.3) is that ifX is normal, so is a + bX. It is not hard to show using the change-of-variableformula.

    We will assume that the residuals ei are independent N(0, 2e) random variables, which

    will imply that the yis are independent normals as well. We also need the distributions ofthe vectors such as y, e and . It will turn out that under the assumptions, the individualcomponents of these vectors are normal, but they are typically not independent (because

    41

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    42/130

    42 CHAPTER 4. DISTRIBUTIONS: NORMAL, 2, AND T

    their covariance matrices are not diagonal). Thus we need a distribution for the entire vector.This distribution is the multivariate normal, which we now define as an affine transformationof independent standard normals.

    Definition 13 An n 1 vector W has a multivariate normal distribution if for some n 1vector a and n q matrix B,

    W = a + BZ, (4.4)

    where Z = (Z1, . . . , Z q), the Zis being independent standard normal random variables.

    If (4.4) holds, then W is said to be multivariate normal with mean and covariance ,where = a and = BB, written

    W Nn(, ). (4.5)

    The elements of the vector Z in the definition all have E[Zi] = 0 and V ar[Zi] = 1,and they are independent, hence E[Z] = 0q and Cov[Z] = Iq. Thus, as in (4.3), that = E[W] = E[a + BZ] = a and = Cov[W] = Cov[a + BZ] = BB follows from (3.8)and (3.10). The added fillip is the multivariate normality. Note that by taking a = 0q andB = Iq, we have that

    Z Nq(0q, Iq). (4.6)The definition presumes that the distribution is well-defined. That is, two different Bs

    could yield the same , so how can one be sure the distributions are the same? For example,suppose n = 2, and consider the two matrices

    B1 = 1 0 10 1 1 and B2 = 32

    12

    0 2 . (4.7)Certainly

    B1B1 = B2B

    2 =

    2 11 2

    = , (4.8)

    but is it clear that

    a + B1

    Z1Z2Z3

    and a + B2

    Z1Z2

    (4.9)

    have the same distribution? Not only are they different linear combinations, but they are

    linear combinations of different numbers of standard normals. So it certainly is not obvious,but they do have the same distribution. This result depends on the normality of the Zis.It can be proved using moment generating functions. Similar results do not hold for otherZis, e.g., Cauchy or exponential.

    The next question is, What s and s are valid? Any Rn is possible, since a isarbitrary. But the possible matrices are restricted. For one, BB is symmetric for anyB, so must be symmetric, but we already knew that because all covariance matrices aresymmetric. We also need to be nonnegative definite, which we deal with next.

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    43/130

    4.1. MULTIVARIATE NORMAL 43

    Definition 14 A symmetric n n matrix A is nonnegative definite if

    xAx

    0 for all x

    Rn. (4.10)

    The matrix is positive definite if

    xAx > 0 for all x Rn, x = 0n. (4.11)

    All covariance matrices are nonnegative definite. To see this fact, suppose Cov(Y) = .Then for any vector x (of the right dimension),

    xx = V ar(xY) 0, (4.12)

    because variances are always nonnegative. Not all covariances are positive definite, though.

    For example, we know that for our models, Cov(y) = 2eM, where M is the projection matrixonto M. Now for any x, because M is symmetric and idempotent,xMx = xMMx = xx = x2. (4.13)

    Certainly x2 0, but is it always strictly positive? No. If x M, then x = 0n. Thus ifthere are any vectors besides 0n that are orthogonal to M, then M is not positive definite.There always are such vectors, unless M = Rn, in which case M = In.

    IfCov(Y) = is not positive definite, then there is a linear combination of the Yis, xY,

    that has variance 0. That is, xY is essentially a constant.The nonnegative definiteness of covariance matrices implies the covariance inequality that

    follows, which in turn implies that the correlation between any two random variables is ineth range [1, 1].

    Lemma 2 Cauchy-Schwarz Inequality. For any two random variables Y1 and Y2 withfinite variances,

    Cov(Y1, Y2)

    V ar(Y1)V ar(Y2). (4.14)

    Thus, if the variances are positive,

    1 Corr(Y1, Y2) 1, where Corr(Y1, Y2) = Cov(Y1, Y2)

    V ar(Y1)V ar)Y2)

    (4.15)

    is the correlation between Y1 and Y2.

    Proof. Let = Cov((Y1, Y2)). Because is nonnegative definite, xx 0 for any x. Two

    such xs are (22, 12) and (12, 11), which yield the inequalities

    22(1122 212) 0 and11(1122 212) 0, (4.16)

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    44/130

    44 CHAPTER 4. DISTRIBUTIONS: NORMAL, 2, AND T

    respectively. If either 11 or 22 is positive, then at least one of the equations shows that212 1122, which implies (4.14). If 11 = 22 = 0, then it is easy to see that 12 = 0, e.g.,by looking at (1, 1)(1, 1) = 212

    0 and (1,

    1)(1,

    1) =

    212

    0, which imply that

    12 = 0. 2

    Back to matrices BB. All such matrices are nonnegative definite, because x(BB)x =(Bx)Bx = Bx2 0. Thus the in Definition 13 must be nonnegative definite. Butagain, allcovariance matrices are nonnegative definite. Thus the question is, Are all nonneg-ative definite symmetric matrices equal to BB for some B? The answer is, Yes. There aremany possibilities, but the next subsection shows that there always exists a lower-triangularmatrix L with = LL. Note that if L works, so does L for any n n orthogonal matrix.

    4.1.1 Cholesky decompositionThere are thousands of matrix decompositions. One that exhibits a B as above is theCholeksy decomposition, for which the matrix B is lower triangular.

    Definition 15 An n n matrix L is lower triangular if lij = 0 for i < j :

    L =

    l11 0 0 0l21 l22 0 0l31 l32 l33 0...

    ......

    . . ....

    ln1 ln2 ln3 lnn

    . (4.17)

    Some properties:

    1. The product of two lower triangular matrices is also lower traingular.

    2. The lower triangular matrix L is invertible if and only if the diagonals are nonzero,lii = 0. If it exists, the inverse is also lower triangular, and the diagonals are 1/lii.

    The main property we need is the following.

    Proposition 11 If is symmetric and nonnegative definite, then there exists a lower tri-angular matrix with diagonals lii 0 such that

    = LL. (4.18)

    The L is unique if is positive definite. In addition, is positive definite if and only if theL in (4.18) has all diagonals lii > 0.

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    45/130

    4.1. MULTIVARIATE NORMAL 45

    Proof. We will use induction on n. The first step is to prove it works for n = 1. In that case = 2 and L = l, so the equation (4.18) is 2 = l2, which is solved by taking l = +

    2.

    This l is nonnegative, and positive if and only if 2 > 0.

    Now assume the decomposition works for any (n 1) (n 1) symmetric nonnegativedefinite matrix, and write the n n matrix as

    =

    11

    12

    12 22

    , (4.19)

    where 22 is (n 1) (n 1), and 12 is (n 1) 1. Partion the lower-triangular matrixL similarly, that is,

    L =

    l11 0

    n1

    l12 L22

    , (4.20)

    where L22 is an (n 1) (n 1) lower-triangular matrix, and l12 is (n 1) 1. We wantto find such an L that satisfies (4.18), which translates to the equations

    11 = l211

    12 = l11l1222 = L22L

    22 + l12l

    12. (4.21)

    It is easy to see that l11 = +

    11. To solve for l12, we have to know whether 11, hence l11,is positive. So there are two cases.

    11 > 0: Then l11 > 0, and using the second equation in (4.21), the unique solution is

    l12 = (1/l11)12 = (1/11)12. 11 = 0: By the covariance inequality in Lemma 2, the 11 = 0 implies that the

    covariances between the first variable and the others are all 0, that is, 12 = 0n1.Thus in this case we can take l11 = 0 and l12 = 0n1 in the second equation of (4.21),although any l12 will work.

    Now to solve for L22. If 11 = 0, then as above, 12 = 0n1, so that by the inductionhypothesis we have that there does exist a lower-triangular L22 with 22 = L22L

    22. If

    11 > 0, then the third line in (4.21) becomes

    22 111

    1212 = L22L22. (4.22)

    By the induction hypothesis, that equation can be solved if the left-hand side is nonnegativedefinite. For any mn matrix B, BB is also symmetric and nonnegative definite. (Why?)Consider the (n 1) n matrix

    B = ( 111

    12, In1). (4.23)

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    46/130

    46 CHAPTER 4. DISTRIBUTIONS: NORMAL, 2, AND T

    Multiplying out shows that

    BB = 22 111

    1212. (4.24)

    Therefore (4.22) can be solved with a lower-triangular L22, which by induction proves thatany nonnegative definite matrix can be written as in (4.18) with a lower-triangular L.

    There are a couple of other parts to this proposition. We wont give all the details, butsuppose is positive definite. Then 11 > 0, and the above proof shows that l11 is positiveand l12 is uniquely determined. Also, 22 will be positive definite, so that by induction thediagonals l22, . . . , lnn will also be positive, and the off-diagonals of L22 will be unique. 2

    4.2 Some properties of the multivariate normal

    The multivariate normal has many properties useful for analyzing linear models. Three ofthem follow.

    Proposition 12 1. Affine transformations. If W Nn(, ), c is m 1 and D ism n, then

    c + DW Nm(c + D, DD). (4.25)2. Marginals. Suppose W Nn(, ) is partitioned as

    W =

    W1W2

    , (4.26)

    where W1 is n1 1 and W2 is n2 1, and the parameters are simlarly partitioned:

    =

    1

    2

    and =

    11 1221 22

    , (4.27)

    where i

    is ni 1 and ij is ni nj. Then

    W1 Nn1(1, 11), (4.28)

    and similarly for W2. In particular, the individual components have Wi N(i, ii),where ii is the i

    th diagonal of .

    3. Independence. PartitioningW as in part 2, W1 and W2 are independent if and onlyif 12 = 0. In particular, Wi and Wj are independent if and only if ij = 0.

    Part 1 follows from the Definition 13 of multivariate normality. That is, if W Nn(, ),then W = +BZ where BB = and Z is a vector of independent standard normals. Thus

    c + DW = (c + D) + (DB)Z, (4.29)

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    47/130

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    48/130

    48 CHAPTER 4. DISTRIBUTIONS: NORMAL, 2, AND T

    Then

    Cov y

    e = M

    (In M) 2eIn

    M

    (In M)

    = 2e

    MM M(In M)

    (In M)M (In M)(In M)

    = 2e

    M 00 In M

    , (4.37)

    because M is idempotent and symmetric (so, e.g., M(In M) = M MM = MM = 0).Thus the covariance between y and e is 0, which means that y and e are independent bypart 3 of Proposition 12. (If the residuals are not normal, then these projections will not beindependent in general, but just uncorrelated.)

    We will need the next fact for confidence intervals and confidence regions.

    Proposition 13 Under model (4.32), if XX is invertible, and e are independent.The proof is easy once you realize that is a function of y, which follows either by

    recalling that is found by satisfying y = X, or by using the formula = (XX)1Xy,and noting that XM = X, hence (XX)1Xy = (XX)1XMy = (XX)1Xy, or justwriting it out:

    (XX)1Xy = (XX)1X(X(XX)1X)y = (XX)1Xy = . (4.38)4.4 Chi-squares

    Under the normality assumption, if ay is an unbiased estimate of ,

    ay ea N(0, 1). (4.39)

    To derive an exact confidence interval for , start with

    P

    z/2 < a

    y ea < z/2

    = 1 , (4.40)

    where z/2 is the upper (/2)th cutoff point for the N(0, 1), i.e., P[z/2 < N(0, 1) < z/2] =

    1 . Then rewriting the inequalities in (4.40) so that is in the center shows that anexact 100 (1 )% confidence interval for is

    ay z/2ea. (4.41)Unfortunately, the e is still unknown, so we must estimate it, which then destroys theexact normality in (4.39). It turns out that Students t is the correct way to adjust for thisestimation, but first we need to obtain the distribution of 2e . Which brings us to the 2(chi-squared) distribution.

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    49/130

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    50/130

    50 CHAPTER 4. DISTRIBUTIONS: NORMAL, 2, AND T

    Now

    Z = XW

    Np(0p, X

    MX) = Np(0p, XXXX) = Np(0p, Ip) =

    Z

    2

    2p. (4.49)

    Equation (4.48), and the fact that M is idempotent, shows that

    Z2 = XW2 = WXXW = WMW = (MW)(MW). (4.50)

    Finally, MW Nn(0n, M), so MW and W have the same distribution, and (4.44) holds,because p = trace(M). 2

    4.4.1 Distribution ofe2In the model (4.32), we have that e

    Nn(0n, 2e(In

    M)), hence (1/e)e

    Nn(0n, In

    M).

    Thus by (4.44), since trace(In M) = n p,1

    2ee2 2np, (4.51)

    hencee2 2e2np, (4.52)

    and

    2e =

    1

    n p

    e2

    2e

    n p 2np. (4.53)

    4.5 Exact confidence intervals: Students t

    We now obtain the distribution of (4.39) with 2e replaced by its estimate. First, we need todefine Students t.

    Definition 17 If Z N(0, 1) and U 2, and Z and U are independent, then

    T ZU

    (4.54)

    has a Students t distribution on degrees of freedom, written

    T t. (4.55)

    Proposition 15 Under the model (4.39) with 2e > 0, if is estimable anday is the least

    squares estimate, thenay ea tnp. (4.56)

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    51/130

    4.5. EXACT CONFIDENCE INTERVALS: STUDENTST 51

    Now is a good time to remind you that if XX is invertible, then ay = and a2 =(XX)1.

    Proof. We know that ay = ay N(, 2ea2), hence as in (4.39),Z a

    y ea N(0, 1). (4.57)

    From (4.51),

    U 12e

    e2 2np. (4.58)Furthermore, y and e are independent by (4.37), hence the Z in (4.57) and U in (4.58) areindependent. Plugging them into the formula for T in (4.54) yields

    T =a

    yea12e

    1np e2 =

    ay ea . (4.59)This statistic is that in (4.56), hence by definition (4.55), it is t with = n p. 2

    To obtain a confidence interval for , proceed as in (4.40) and (4.40), but use theStudents t instead of the Normal, that is, an exact 100 (1 )% confidence interval is

    ay t,/2ea, (4.60)where the t,/2 is found in a t table so that

    P[t,/2 < t < t,/2] = 1 . (4.61)

    Example. Consider the contrast (1 + 2)/2 3 from the Leprosy example, using thecovariate, in Section 3.4.2. The least squares estimate is 3.392, and se =

    2e(XX)1 =1.641. With the covariate, p = 4, and n = 30, hence = n p = 26. Finding a t-table,t26,0.025 = 2.056, hence the 95% confidence interval is

    (3.392 2.056 1.641) = (6.77, 0.02). (4.62)

    This interval just barely misses 0, so the effectiveness of the drugs is marginally significant.

    Note. This confidence interval is exact if the assumptions are exact. Because we do notreally believe that e or y are multivariate normal, in reality even the t interval is approximate.But in general, if the data are not too skewed or have large outliers. the approximation isfairly good.

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    52/130

    52 CHAPTER 4. DISTRIBUTIONS: NORMAL, 2, AND T

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    53/130

    Chapter 5

    Nested Models

    5.1 Introduction

    The previous chapters introduced inference on single parameters or linear combinations .Analysis of variance is often concerned with combined effects, e.g., the treatment effect in theleprosy example (1.26), or the sun/shade effect, fruit effect, or interaction effect in the two-way ANOVA example (1.23). Such effects are often not representable using one parameter,but rather by several parameters, or more generally, by nested vector spaces.

    For example, consider the one-way ANOVA model yij = + i + eij , i = 1, 2, 3, j = 1, 2:

    y11y12y21y22y31y32

    = X+ e =

    1 1 0 01 1 0 01 0 1 01 0 1 01 0 0 11 0 0 1

    123

    + e. (5.1)

    We now how to assess single contrasts, e.g., 12 or (1 + 2)/23, but one may wish todetermine whether there is anydifference among the groups at all. If there is no difference,then the six observations are as from one large group, in which case the model would be

    y = 16 + e. (5.2)

    Letting MA = C(X) be the vector space for model (5.1) and M0 = span{16} be that formodel (5.2), we have that M0 MA. Such spaces are said to be nested. Note that model(5.2) can be obtained from model (5.1) by setting some parameters to zero, 1 = 2 = 3 = 0.It is not necessary that that be the case, e.g., we could have represented (5.1) without the

    53

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    54/130

    54 CHAPTER 5. NESTED MODELS

    16 vector,

    y11y12y21y22y31y32

    = X + e =

    1 0 01 0 0

    0 1 00 1 00 0 10 0 1

    12

    3

    + e, (5.3)

    in which case we still have M0 MA = C(X), but setting any of the is would not yieldM0. We could do it by setting 1 = 2 = 3, though.

    Using the hypothesis testing formulation, we are interested in testing the smaller modelas the null hypothesis, and the larger model as the alternative. Thus with = E[y], we aretesting

    H0 : M

    0 versus HA : M

    A. (5.4)

    (Formally, we should not let the two hypotheses overlap, so that HA should be MA M0.)

    The ANOVA approach to comparing two nested models is to consider the squared lengthsof the projections onto M0 and MA, the idea being that the length of a projection representsthe variation in the data y that is explained by the vector space. The basic decompositionof the squared length of y based on vector space M0 is

    y2 = y02 + y y

    02, (5.5)

    where y0 is the projection of y onto M0. (Recall that this decomposition is due to y0 andy y0

    being orthogonal. It is the Pythagorean Theorem.) The equation in (5.5) is expressedas

    Total variation = Variation due to M0 + Variation unexplained by M0. (5.6)A similar decomposition using the projection onto MA, yA, is

    y2 = yA2 + y y

    A2,

    Total variation = Variation due to MA + Variation unexplained by MA. (5.7)

    The explanatory power of the alternative model MA over the null model M0 can bemeasured in a number of ways, e.g., by comparing the variation due to the two models, orby comparing the variation unexplained by the two models. The most common measuresstart with the variation unexplained by the null model, and look at how much of that isexplained by the alternative. That is, we subtract the variation due to the null model fromthe equations (5.5) and (5.7):

    y2 y02 = y y

    02 and

    y2 y02 = y

    A2 y

    02 + y y

    A2, (5.8)

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    55/130

    5.1. INTRODUCTION 55

    yielding

    Variation unexplained by

    M0 = Variation explained by

    MA but not by

    M0

    + Variation unexplained by MA. (5.9)The larger the Variation explained by MA but not by M0, and the smaller the Variationunexplained by MA, the more evidence there is that the more complicated model MA isbetter than the simpler model M0. These quantities need to be normalized somehow. Onepopular way is to take the ratio

    R2 Variation explained by MA but not by M0Variation unexplained by M0 =

    yA2 y

    02

    y y02 . (5.10)

    This quantity is sometimes called the coefficient of determination or the square of the multiple

    correlation coefficient. Usually it is called R-squared.The squaredness suggests that R2 must be nonnegative, and the correlation in the term

    suggests it must be no larger than 1. Both suggestions are true. The next section looks moreclosely at these sums of squares.

    5.1.1 Note on calculating sums of squares

    The sums of squares as in (5.5) for a generic model y = X+ e can be obtained by findingthe y explicitly, then squaring the elements and summing. When XX is invertible, thereare more efficient ways, although they may not be as stable numerically. That is, once onehas and XX calculated, it is simple to use

    y2 = yy = (X)X = XX. (5.11)Then

    y y2 = y2 XX. (5.12)These formulas are especially useful if p, the dimension of the vector, is small relative ton.

    As a special case, suppose X = 1n. Then = y and XX = n, so that (5.12) is

    y

    y2 = y2 y(n)y, (5.13)i.e.,

    ni=1

    (yi y)2 =ni=1

    y2i ny2, (5.14)

    the familiar machine formula used for calculating the sample standard deviation.These days, typical statistical programs use efficient and accurate routines for calculating

    linear model quantities, so that the efficiency of formula (5.12) is of minor importance to us.Conceptually, it comes in handy, though.

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    56/130

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    57/130

    5.2. ORTHOGONAL COMPLEMENTS 57

    5.2.1 Example

    Consider the Leprosy example, with the covariate, so that the model is as in (2.61),

    y11y12

    ...y1,10

    y21y22

    ...y2,10

    y31y32

    ...y3,10

    = XAA + e =

    1 1 0 0 z111 1 0 0 z12...

    ......

    ......

    1 1 0 0 z1,10 1 0 1 0 z211 0 1 0 z22...

    ......

    ......

    1 0 1 0 z2,10

    1 0 0 1 z311 0 0 1 z32...

    ......

    ......

    1 0 0 1 z3,10

    123

    + e (5.19)

    The large model is then MA = C(XA). Consider the smaller model to be that withouttreatment effect. It can be obtained by setting 1 = 2 = 3 = 0 (or equal to any constant),so that M0 = span{130, z}. From (2.67) we know that yA has elements

    yAij = yi + 0.987 (zij zi). (5.20)

    Notice that model M0 is just a simple linear regression mode, yij = + zij + eij , so weknow how to estimate the coefficients. They turn out to be = 3.886 and = 1.098, so

    y0ij = 3.886 + 1.098zij. (5.21)To find the decompositions, we first calculate y2 = 3161. For model MA, we would

    like to use formula (5.12), but need XX invertible, which we do by dropping the 130 vectorfrom XA, so that we have the model (3.38). From (2.65) and (2.68), we can calculate the (without the ) to be

    =

    3.8813.7720.435

    0.987

    (5.22)

    The X

    X matrix is given in (3.39), hence

    yA2 =

    3.8813.7720.435

    0.987

    10 0 0 930 10 0 1000 0 10 129

    93 100 129 4122

    3.8813.7720.435

    0.987

    = 2743.80. (5.23)

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    58/130

    58 CHAPTER 5. NESTED MODELS

    Then y yA2 is the difference, 3161 2743.80:

    y

    2 =

    yA

    2 +

    y

    yA

    2;

    3161 = 2743.80 + 417.20. (5.24)

    The decomposition for M0 is similar. We can find X0X0, where X0 = (130, z), from XXby adding the first three diagonals, and adding the first three elements of the first column:

    X0X0 =

    30 322

    322 4122

    . (5.25)

    Then

    y02 =

    3.8861.098

    30 322

    322 4122

    3.8861.098

    = 2675.24, (5.26)

    hence y2 = y02 + y y

    02;

    3161 = 2675.24 + 485.76.(5.27)

    The decomposition of interest then follows easily by subtraction:

    y y02 = y

    Ay

    02 + y y

    A2;

    485.76 = 68.56 + 417.20.(5.28)

    The R2 is then

    R2 =

    yA

    y02

    y

    y0

    2

    =68.56

    485.76= 0.141. (5.29)

    That says that about 14% of the variation that the before measurements fail to explain isexplained by the treatments. It is fairly small, which means there is still a lot of variationin the data not explained by the difference between the treatments. It may be that thereare other variables that would be relevant, such as age, sex, weight, etc., or that bacterialcounts are inherently variable.

    5.3 Mean squares

    Another popular ratio is motivated by looking at the expected values of the sums of squaresunder the two models. We will concentrate on the two difference vectors that decomposey y

    0in (5.17), y

    Ay

    0and y y

    A. There are two models under consideration, the null M0

    and the alternative MA. Thus there are two possible distributions for y:M0 : y Nn(0, 2eIn) for some 0 M0;MA : y Nn(A, 2eIn) for some A MA.

    (5.30)

    The two difference vectors are independent under either model by the next proposition,because they are the projections on two orthogonal spaces, MA0 and MA.

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    59/130

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    60/130

    60 CHAPTER 5. NESTED MODELS

    For the covariance matrices in the table (5.33), we know trace(MA) = pA and trace(M0) =p0, where pA and p0 are the ranks of the respective vector spaces. Thus we have

    Expected sum of squares M0 MAES SA0 E[yA y02] 2e(pA p0) 2e(pA p0) + A(MA M0)AESSE E[y y

    A2] 2e(n pA) 2e(n pA)

    (5.35)

    The ES S means expected sum of squares, and the ESSE means expected sumof squares of errors.

    Now for the question: If the null hypothesis is not true, then E[yA

    y02] will be

    relatively large. How large is large? One approach is the R2 idea from the previous section.Another is to notice that if the null hypothesis is true, how large E[y

    Ay

    02] is depends

    on 2e (and n pA). Thus we could try comparing those. The key is to look at expectedmean squares, which are obtained from table (5.35) by dividing by the degrees of freedom:

    Expected mean squares M0 MAEM SA0 E[yA y02]/(pA p0) 2e 2e + A(MA M0)A/(pA p0)

    EMSE E[y yA2]/(n pA) 2e 2e

    (5.36)

    The EMSE means expected mean square error. One further step simplifies even more:Take the ratio of the expected mean squares:

    Ratio of expected mean squares M0 MAEMSA0EMSE

    1 1 +A(MAM0)

    A

    2e(pAp0)(5.37)

    Now we have (sort of) answered the question: How large is large? Larger than 1. That

    is, this ratio of expected mean squares is 1 if the null hypothesis is true, and larger than 1 ifthe null hypothesis is not true. How much larger is semi-complicated to say, but it dependson

    Aand 2e .

    That is fine, but we need to estimate this ratio. We will use the analogous ratio of meansquares, that is, just remove the E[ ]s. This ratio is called the F ratio, named afterR. A. Fisher:

    Definition 19 Given the above set up, the F ratio is

    F =MSA0MSE

    =

    yA

    y02/(pA p0)

    y

    yA

    2/(n

    pA)

    . (5.38)

    Notice that EMSE is actually 2e for the model MA.The larger F, the more evidence we have for rejecting the null hypothesis in favor of the

    alternative. The next section will deal with exactly how large is large, again. But first wecontinue with the example in Section 5.2.1. From (5.28) we obtain the sums of squares. Then = 30, p0 = 2 and pA = 4, hence

    MSA0 =68.56

    4 2 = 34.28, M SE =417.20

    30 4 = 16.05, F =34.28

    16.05= 2.14. (5.39)

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    61/130

    5.4. THE F DISTRIBUTION 61

    That F is not much larger than 1, so there does not seem to be a very significant treatmenteffect. The next section shows how to calculate the significance level.

    For both measures R2 and F, the larger, the more one favors the alternative model. These

    measures are in fact equivalent in the sense of being monotone functions of each other:

    R2 =y

    Ay

    02

    y y02 =

    yA

    y02

    yA

    y02 + y y

    A2 , (5.40)

    so thatR2

    1 R2 =y

    Ay

    02

    y yA2 , (5.41)

    andn pA

    pA p0R2

    1 R2

    =

    yA

    y02/(pA p0)

    y yA2/(n pA) = F. (5.42)5.4 The F distribution

    Consider the F statistic in (5.42). We know that the numerator and denominator are inde-pendent, by (5.31). Furthermore, under the null model M0 in (5.33), from Proposition 14,Equation (4.44), we have as in Section 4.4.1 that

    yA

    y02 2e2pAp0 and y yA2 2e2npA , (5.43)

    so that the distribution of F can be given as

    F 2e

    2pAp0/(pA p0)

    2e2npA/(n pA)

    =2pAp0/(pA p0)2npA/(n pA)

    , (5.44)

    where the 2s are independent. In fact, that is the definition of the F distribution.

    Definition 20 If U1 21 and U2 22, and U1 and U2 are independent, then

    F U1/1U2/2

    (5.45)

    has an F distribution with degrees of freedom 1 and 2. It is written

    F F1,2. (5.46)

    Then, according to the definition, when M0 is the true model,

    F =MSA0MSE

    =y

    Ay

    02/(pA p0)

    y yA2/(n pA) FpAp0,npA. (5.47)

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    62/130

    62 CHAPTER 5. NESTED MODELS

    Note. When MA is true, then the numerator and denominator are still independent, and thedenominator is still 2npA/(n pA), but the SSA0 is no longer 2. In fact, it is noncentralchi-squared, and the F is noncentral F. We will not deal with these distributions, except to

    say that they are larger than their regular (central) cousins.

    We can now formally test the hypotheses

    H0 : M0 versus HA : MA, (5.48)based on y Nn(, 2eIn). (We assume 2e > 0. Otherwise, y = , so it is easy to test thehypotheses with no error.) For level , reject the null hypothesis when

    F > F1,2,, (5.49)

    where F is in (5.47), and F1,2, is the upper cutoff point of the F distribution:

    P[FpAp0,npA > F1,2,] = . (5.50)

    There are tables of these cutoff points, and most statistical software will produce them.

    Example. Continuing the leprosy example, from (5.39) we have that F = 2.16. Also,pA p0 = 2 and n pA = 26, hence we reject the null hypothesis (M0) of no treatmenteffect at the = 0.05 level if

    F > F2,26.0.05 = 3.369. (5.51)

    Because 2.16 is less than 3.369, we cannot reject the null hypothesis, which means there isnot enough evidence to say that there is a treatment effect.

    Does this conclusion contradict that from the confidence interval in (4.62) for ( 1 +2)/2 3, which shows a significant difference between the average of the two drugs andthe placebo? Yes and no. The F test is a less focussed test, in that it is looking for anydifference among the three treatments. Thus it is combining inferences for the drug versusplacebo contrast, which is barely significant on its own, and the drug A versus drug Dcontrast, which is very insignificant. The combining drowns out the first contrast, so thatoverall there does not appear to be anything significant. More on this phenomenon when weget to simultaneous inference.

    5.5 The ANOVA table

    The ANOVA table is based on a systematic method for arranging the important quantitiesin comparing two nested model, taking off from the decomposition of the sums of squaresand degrees of freedom. That is, write

    y y02 = y

    Ay

    02 + y y

    A2

    n p0 = pA p0 + n pA (5.52)

    in table form. The more generic form is

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    63/130

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    64/130

    64 CHAPTER 5. NESTED MODELS

    Source Sum of squares Degrees of freedom Mean square FRegression 63925 1 63925 11.65Error 713076 130 5485

    Total 777001 131

    R2 = 0.082

    The F1,130,0,05 = 3.91, so that the is very significant. On the other hand, R2 is quite

    small, suggesting there is substantial variation in the data. It is partly because this modeldoes not take into account the important factor that there are actually two sports representedin the data.

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    65/130

    Chapter 6

    One-way ANOVA

    This chapter will look more closely at the one-way ANOVA model. The model has g groups,and Ni observations in group i, so that there are n = N1 + + Ng observations overall.Formally, the model is

    yij = + i + eij, i = 1, . . . , g; j = 1, . . . , N i, (6.1)

    where the eij s are independent N(0, 2e)s. Written in matrix form, we have

    y = X+ e =

    1N1 1N1 0N1 0N11N2 0N2 1N2 0N2

    ......

    .... . .

    ...1Ng 0Ng 0Ng

    1Ng

    12...

    g

    + e, e Nn(0n, 2eIn). (6.2)

    The ANOVA is called balanced if there is the same number of observations in eachgroup, that is, Ni = N, so that n = Ng. Otherwise, it is unbalanced. The balanced caseis somewhat easier to analyze than the unbalanced one, but the difference is more evidentin higher-way ANOVAs. See the next chapter.

    The next section gives the ANOVA table. Section 6.2 shows how to further decomposethe group sum of squares into components based on orthogonal contrasts. Section 6.3 looksat effects and gives constraints on the parameters to make them estimable. Later, inChapter 8, we deal with the thorny problem of multiple comparisons: A single confidenceinterval may have a 5% chance of missing the parameter, but with many confidence intervals,each at 95%, the chance that at least one misses its parameter can be quite high. E.g., with100 95% confidence intervals, youd expect about 5 to miss. How can you adjust so that thechance is 95% that they are all ok?

    6.1 The ANOVA table

    From all the work so far, it is easy to find the ANOVA table for testing whether there areany group effects, that is, testing whether the group means are equal. Here, MA = C(X)

    65

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    66/130

    66 CHAPTER 6. ONE-WAY ANOVA

    for the X in (6.2), and M0 = span{1n}. The ranks of these spaces are, respectively, pA = gand p0 = 1. The projections have

    yAij = yi and y0ij = y, (6.3)where

    yi =1

    Ni

    Nij=1

    yij (6.4)

    is the sample mean of the observations in group i. Then the sums of squares are immediate:

    SSA0 = yA y02 = gi=1

    Nij=1

    (yi y)2 =gi=1

    Ni(yi y)2

    SS E = y yA2 = gi=1

    Nij=1

    (yij yi)2

    SS T = y y02 =

    gi=1

    Nij=1

    (yij y)2. (6.5)

    (The SS T means sum of squares total.) Often, the SSA0 is called the between sum ofsquares because it measures the differences between the group means and the overall mean,and the SS E is called the within sum of squares, because it adds up the sums of squaresof the deviations from each groups mean. The table is then

    Source Sum of squares Degrees of freedom Mean square F

    Between gi=1 Ni(yi y)2 g 1 MSB MSB/MSW Within

    gi=1

    Nij=1(yij yi)2 n g MSW

    Totalg

    i=1

    Nij=1(yij y)2 n 1

    R2 =g

    i=1Ni(yiy)2g

    i=1

    Nij=1

    (yijy)2.

    6.2 Decomposing the between sum of squares

    When there are more than two groups, the between sum of squares measure a combinationof all possible differences among the groups. It is usually informative to be more specific

    about differences. One approach is to further decompose the between sum of squares usingorthogonal contrasts. A contrast of a vector of parameters is a linear combination inwhich the coefficients sum to zero. For example, in the leprosy data we looked at contrastsof the group means (1, 2, 3)

    , or, equivalently, the (1, 2, 3):

    1 =1

    2(1 + 2) 3 = ( 1

    2,

    1

    2, 1)(1, 2, 3), and

    2 = 1 2 = (1, 1, 0)(1, 2, 3). (6.6)

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    67/130

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    68/130

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    69/130

  • 7/29/2019 Analysis of Variance (Marden, 2003)

    70/130

    70 CHAPTER 6. ONE-WAY ANOVA

    6.2.2 Limitations of the decomposition

    Every subspace has an orthogonal basis, many essentially different such bases if the rank

    is more than one. Unfortunately, the resulting subspaces Mk need not be ones of interest.In the leprosy example, the two subspaces nicely corresponded to two interesting contrasts.Such nice results will occur in the balanced one-way ANOVA if one is interested in a set oforthogonal contrasts of the is. A contrast of = (1, . . . , g)

    is a linear combinationc, where c is any nonzero g 1 vector c such that c1 + + cg = 0. Two contrasts c1 andc2 are orthogonal if their vectors c1 and c2 are orthogonal. For example, the is in (6.6)are orthogonal contrasts of with

    c1 = (1

    2,

    1

    2, 1) and c2 = (1, 1, 0). (6.25)

    It is easy to see that these two vectors are orthogonal, and their components sum to 0.If the contrasts of interest are not orthogonal, then their respective sums of squares donot sum to the between sum of squares. E.g., we may be interested in comparing each drugto the placebo, so that the contrast vectors are (1, 0, 1) and (0, 1, 1). Although theseare contrasts, the two vectors are not orthogonal. Worse, if the model is unbalanced, evenorthogonal contrasts will not translate back to an orthogonal basis for MA0. Also, the modelwith covariates will not allow the decomposition, even with a balanced design.

    On the positive side, balanced higher-way ANOVA models do allow nice decompositions.

    6.3 Effects

    We know that the general one-way ANOVA model (6.1), (6.2), is often parametrized in sucha way that the parameters are not estimable. That usually will not present a problem,because interest is one contrasts of the is, which are estimable, or (equivalently) testingwhether there are any differences among the groups. Alternatively, one can place constraintson the parameters so that they are estimable. E.g., one could set = 0