i owa s tate u niversity department of animal science use of proc glm to analyze experimental data...

57
IOWA STATE UNIVERSITY Department of Animal Science Use of Proc GLM to Analyze Experimental Data Animal Science 500 Lecture No. October , 2010

Upload: brice-wilkins

Post on 18-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

IOWA STATE UNIVERSITYDepartment of Animal Science

Use of Proc GLM to Analyze Experimental Data

Animal Science 500

Lecture No.

October , 2010

IOWA STATE UNIVERSITYDepartment of Animal Science

PROC GLM

u The GLM procedure uses the method of least squares to fit general linear models.

u Among the statistical methods available in PROC GLM are:n Regression, n Analysis of variance, n Analysis of covariance, n Multivariate analysis of variance (MANOVA), n and partial correlation.

SAS/STAT(R) 9.22 User's Guide

IOWA STATE UNIVERSITYDepartment of Animal Science

PROC GLM

u PROC GLM analyzes data within the framework of general linear models.

u PROC GLM handles models relating one or several continuous dependent variables to one or several independent variables. n The independent variables can be either classification

variables, which divide the observations into discrete groups, or continuous variables.

n Thus, the GLM procedure can be used for many different analyses, including the following:

SAS/STAT(R) 9.22 User's Guide

IOWA STATE UNIVERSITYDepartment of Animal Science

PROC GLMn Thus, the GLM procedure can be used for many

different analyses, including the following: l simple regression l multiple regression l analysis of variance (ANOVA), especially for unbalanced data l analysis of covariance l response surface models l weighted regression l polynomial regression l partial correlation l multivariate analysis of variance (MANOVA) l repeated measures analysis of variance

SAS/STAT(R) 9.22 User's Guide

IOWA STATE UNIVERSITYDepartment of Animal Science

PROC GLMu PROC GLM enables you to specify any degree of

interaction (crossed effects) and nested effects.n It also provides for polynomial, continuous-by-class, and

continuous-nesting-class effects.

u Through the concept of estimability, the GLM procedure can provide tests of hypotheses for the effects of a linear model regardless of the number of missing cells or the extent of confounding.

u PROC GLM displays the sum of squares (SS) associated with each hypothesis tested and, upon request, the form of the estimable functions employed in the test. PROC GLM can produce the general form of all estimable functions.

SAS/STAT(R) 9.22 User's Guide

IOWA STATE UNIVERSITYDepartment of Animal Science

PROC GLMu The REPEATED statement enables you to specify

effects in the model that represent repeated measurements on the same experimental unit for the same response, providing both univariate and multivariate tests of hypotheses.

u The RANDOM statement enables you to specify random effects in the model; expected mean squares are produced for each Type I, Type II, Type III, Type IV, and contrast mean square used in the analysis. Upon request, tests that use appropriate mean squares or linear combinations of mean squares as error terms are performed.

SAS/STAT(R) 9.22 User's Guide

IOWA STATE UNIVERSITYDepartment of Animal Science

PROC GLMu The ESTIMATE statement enables you to specify an

vector for estimating a linear function of the parameters .

u The CONTRAST statement enables you to specify a contrast vector or matrix for testing the hypothesis that . When specified, the contrasts are also incorporated into analyses that use the MANOVA and REPEATED statements.

u The MANOVA statement enables you to specify both the hypothesis effects and the error effect to use for a multivariate analysis of variance.

SAS/STAT(R) 9.22 User's Guide

IOWA STATE UNIVERSITYDepartment of Animal Science

PROC GLMu PROC GLM can create an output data set containing

the input data set in addition to predicted values, residuals, and other diagnostic measures.

u PROC GLM can be used interactively. After you specify and fit a model, you can execute a variety of statements without recomputing the model parameters or sums of squares.

SAS/STAT(R) 9.22 User's Guide

IOWA STATE UNIVERSITYDepartment of Animal Science

PROC GLMu For analysis involving multiple dependent variables

but not the MANOVA or REPEATED statements, a missing value in one dependent variable does not eliminate the observation from the analysis for other dependent variables. PROC GLM automatically groups together those variables that have the same pattern of missing values within the data set or within a BY group. This ensures that the analysis for each dependent variable brings into use all possible observations.

SAS/STAT(R) 9.22 User's Guide

IOWA STATE UNIVERSITYDepartment of Animal Science

Estimable Function

u Often see an error in SAS non-est.

u What does this mean?

IOWA STATE UNIVERSITYDepartment of Animal Science

Estimability

u Generalized inverses are used to obtain solutions for effects in general linear models. n There are many generalized inverses.n Many different sets of solutions are possible.

u Estimable are unique and don’t depend on the generalized inverse used to obtain solutions.

u To analyze data properly, that is answer the hypothesis being tested, the scientist should know what function of the parameters in the model are being estimated.

IOWA STATE UNIVERSITYDepartment of Animal Science

Estimability

u The hypothesis being tested is NOT the absolute values for a level of a factor in the model.

u Usually asking or hypothesizing that two means are different or some treatment is different from a control.

u Hence the differences are estimable function NOT the values (solutions) for any of the functions.

IOWA STATE UNIVERSITYDepartment of Animal Science

The General Linear Model

u The main effects general linear model can be parameterized as

Yij = µ + αi + bj + εij

Where

Y observation for ith α,

µ is the overall mean (unknown fixed parameter),

αi effect of the ith value of α (αi - µ),

bj effect of the jth value of b (bj - µ), and

εij is the experimental error N(0,δ2)

IOWA STATE UNIVERSITYDepartment of Animal Science

The General Linear Model

u In matrix terminology, the general linear model may be expressed as

u Y = Xβ + ε

where

Y the observed data vector,

X the design matrix,

β is a vector of unknown fixed effect parameters, and

ε is the vector of errors

IOWA STATE UNIVERSITYDepartment of Animal Science

Programming the General Linear Model

u In the GLM procedure, one saves the data set plus the residuals, predicted values, and studentized residuals with an output statement in a data set called resdat.

PROC GLM;

class machine operator;

Model yield=machine|operator;

output out=resdat r=resid p=pred

student=stdres rstudent=rstud

cookd=cksd h=lev;

IOWA STATE UNIVERSITYDepartment of Animal Science

Assumptions of the general linear model

u E (ε) = 0

u var(ε) = σ2 I

u var(Y) = σ2 I

u E(Y ) = Xβ

IOWA STATE UNIVERSITYDepartment of Animal Science

Assumptions of the Linear Regression Model1.Linear Functional form

2.Fixed independent variables

3.Independent observations

4.Representative sample and proper specification of the model (no omitted variables)

5.Normality of the residuals or errors

6.Equality of variance of the errors (homogeneity of residual variance)

7.No multicollinearity

8.No autocorrelation of the errors

9.No outlier distortion

IOWA STATE UNIVERSITYDepartment of Animal Science

Explanation of the Assumptions1. Linear Functional form

n Does not detect curvilinear relationships

2. The Observations are Independent observationsn Representative sample from some larger populationn If the observations are not independent results in an autocorrelation which inflates the

t and r and f statistics which in turn distorts the significance tests

3. Normality of the residualsn Permits proper significance testing similar to ANOVA and other statistical procedures

4. Equal variance (or no heterogenous variance)n Heteroskedasticity precludes generalization and external validityn This too distorts the significance tests being used

5. Multicollinearity (many of the traits exhibit collinearity)n Biases parameter estimation. n Can prevent the analysis from running or converging (getting your answers)

6. Severe or several outliers will distort the results and may bias the results. n If outliers have high influence and the sample is not large enough, then they may

serious bias the parameter estimates

IOWA STATE UNIVERSITYDepartment of Animal Science

SAS test for residual normality

Proc univariate data=resdat normal plot;

var resid;

Run;

Quit;

IOWA STATE UNIVERSITYDepartment of Animal Science

Graphically examining residuals for homogeneity

Proc gplot data=resdat;

plot resid * pred;

Run;

Quit;

Analysis for lack of pattern;

IOWA STATE UNIVERSITYDepartment of Animal Science

Testing for outliers

Proc freq data=resdat;

tables stdres cksd;

Run;

Quit;

1. Look for standardized residuals greater than 3.5 or less than – 3.5

2. And look for high Cook’s D (greater than 4*p/(n-p-1).

IOWA STATE UNIVERSITYDepartment of Animal Science

Class Statement

u Variables included in the CLASS statement referred to as class variables.

u Specifies the variables whose values define the subgroup combinations for the analysis.n Represent various level of some factors or effects

l Treatment (1,….n)l Season (spring, summer, fall, and winter coded 1 through 4)l Breedl Colorl Sexl Linel Dayl Laboratory

IOWA STATE UNIVERSITYDepartment of Animal Science

Evaluating outliers

1.Check coding to spot typos

2. Correct typos

3. If observational outlier is correct,

Examine the dffits option to see determine how much influence the outlier has on the fitting statistics.

This will show the standardized influence of the observation on the fit.

If the influence of the outlier is bad, then consider removal making it a missing observation ( . )

IOWA STATE UNIVERSITYDepartment of Animal Science

Getting started with GLM

IOWA STATE UNIVERSITYDepartment of Animal Science

PROC GLM Syntax

PROC GLM <options> ;

CLASS variables </ option> ;

MODEL dependent-variables=independent-effects </ options> ;

IOWA STATE UNIVERSITYDepartment of Animal Science

Statement   Must Precede...   Must Follow... ABSORB   First RUN statement    BY   First RUN statement    CLASS   MODEL statement    

CONTRAST  MANOVA, REPEATED,

  MODEL statement

   or RANDOM statement

   

ESTIMATE       MODEL statement FREQ   First RUN statement    ID   First RUN statement    LSMEANS       MODEL statement MANOVA       CONTRAST or         MODEL statement MEANS       MODEL statement

MODEL  CONTRAST, ESTIMATE,

  CLASS statement

   LSMEANS, or MEANS

   

    statement    OUTPUT       MODEL statement RANDOM       CONTRAST or         MODEL statement

REPEATED      CONTRAST, MODEL,

        or TEST statement TEST   MANOVA or   MODEL statement

   REPEATED statement

   

WEIGHT   First RUN statement    

Positional Requirements for PROC GLM Statements

IOWA STATE UNIVERSITYDepartment of Animal Science

Statement DescriptionABSORB Absorbs classification effects in a model

BY Specifies variables to define subgroups for the analysis

CLASS Declares classification variables

CONTRAST Constructs and tests linear functions of the parameters

ESTIMATE Estimates linear functions of the parameters FREQ Specifies a frequency variable ID Identifies observations on output LSMEANS Computes least squares (marginal) means MANOVA Performs a multivariate analysis of variance

MEANS Computes and optionally compares arithmetic means

MODEL Defines the model to be fit

OUTPUT Requests an output data set containing diagnostics for each observation

RANDOM Declares certain effects to be random and computes expected mean squares

REPEATED Performs multivariate and univariate repeated measures analysis of variance

STORE Requests that the procedure save the context and results of the statistical analysis into an item store

TEST Constructs tests that use the sums of squares for effects and the error term you specify

WEIGHT Specifies a variable for weighting observations

Statements in the GLM Procedure

IOWA STATE UNIVERSITYDepartment of Animal Science

Class Variables

u Are usually things you would like to account for in your model

u Can be numeric or character

u Can be continuous values

u They are generally not used in regression analysesn What meaning would they have

IOWA STATE UNIVERSITYDepartment of Animal Science

Class Statement Optionsu Ascending sorts class variable in ascending order

u Descending sorts class variable in descending order

Other options with the Class statement generally related to the procedure (PROC) being used and thus will not cover them all

IOWA STATE UNIVERSITYDepartment of Animal Science

Discrete Variables

u A discrete variable is one that cannot take on all values within the limits of the variable. n Limited to whole numbersn For example, responses to a five-point rating scale can

only take on the values 1, 2, 3, 4, and 5. n The variable cannot have the value 1.7. A variable such

as a person's height can take on any value.

Discrete variables also are of two types:1. unorderable (also called nominal variables)

2. orderable (also called ordinal)

IOWA STATE UNIVERSITYDepartment of Animal Science

Discrete Variablesu Data sometimes called categorical as the

observations may fall into one of a number of categories for example: n Any trait where you score the value

l Lameness scoresl Body condition scoresl Soundness scoring

Reproductive Feet and leg

l Behavioral traits Fear test Back test Vocal scores

l Body lesion scores

IOWA STATE UNIVERSITYDepartment of Animal Science

Discrete Variablesu When do discrete variables become continuous

or do they?

u What is a trait like number born alive considered discrete or continuous?

IOWA STATE UNIVERSITYDepartment of Animal Science

Example Variables

Data:

The dependent variable (what is being measured) is aerial biomass

and there are five substrate measurements: (These are the independent variables) 1. Salinity,

2. Acidity,

3. Potassium,

4. Sodium, and Zinc.

IOWA STATE UNIVERSITYDepartment of Animal Science

Covariates

u a covariate is a independent variable that contribute variation to the dependent variable of interest.

u The research wants to account for the covariate differences that occurs for each observation.

u A covariate may be of direct interest or it may be a confounding or interacting type of variable

IOWA STATE UNIVERSITYDepartment of Animal Science

Covariates

u Examples

Weight of animal at measurement

Age of animal at measurement

Age of animal at weaning

Parity of sow for number born alive and weaning weight

Days of lactation for milk weight

IOWA STATE UNIVERSITYDepartment of Animal Science

Covariates

u Covariate may influence the dependent variable in the following waysn Linear covariaten Quadratic covariaten Cubic covariate

IOWA STATE UNIVERSITYDepartment of Animal Science

Covariates

u Check to be sure your covariate is significant

u If the linear is significant, test the quadratic

u If the linear and quadratic are significant sources of variation test the cubic

u How do you do that?

IOWA STATE UNIVERSITYDepartment of Animal Science

Covariates

u How do you do that? n Linear include the variable name in the model not listed

in the class statement.n Example weightn Quadratic the variable name is included as follows

weight*weightn Cubic the variable name is included as follows

weight*weight*weight

IOWA STATE UNIVERSITYDepartment of Animal Science

Covariates

u Covariate may influence the dependent variable in the following waysn Linear covariate

l Independent covariate affects the dependent variable in a linear manner

n Quadratic covariatel Independent covariate affects the dependent variable in a linear

quadratic mannerl Indicates there is an inflection point (and only one)

n Cubic covariatel Independent covariate affects the dependent variable in a linear

cubic mannerl Indicates there are two inflection points

IOWA STATE UNIVERSITYDepartment of Animal Science

Covariates

u Covariate may influence the dependent variable in the following waysn Linear covariate

l Independent covariate affects the dependent variable in a linear manner

n Dependent variable increase or decreases at a constant rate

IOWA STATE UNIVERSITYDepartment of Animal Science

Covariatesu Covariate may influence the dependent

variable in the following waysn Quadratic covariate

l Independent covariate affects the dependent variable in a linear quadratic manner

l Indicates there is an inflection point (and only one)

n The dependent variable increases (or decreases) to some point and then either increases at an increasing rate (decreases at an increasing rate) or increases at a decreasing rate (or decreases at a decreasing rate)

n Or could be a directional change – to some point the dependent variable increases and then after another point the dependent variable response decreases or vise versa

IOWA STATE UNIVERSITYDepartment of Animal Science

Covariatesn Cubic covariate

l Independent covariate affects the dependent variable in a linear cubic manner

l Indicates there are two inflection points

n Essentially the same as quadratic but the changes can occur at an additional point

n The dependent variable increases (or decreases) to some point and then either increases at an increasing rate (decreases at an increasing rate) or increases at a decreasing rate (or decreases at a decreasing rate)

n Or could be a directional change – to some point the dependent variable increases and then after another point the dependent variable response decreases or vise versa

IOWA STATE UNIVERSITYDepartment of Animal Science

Model Development and Selection of Variables

Example:

The general problem addressed is to identify important soil characteristics influencing aerial biomass production of marsh grass, Spartina alterniflora.

IOWA STATE UNIVERSITYDepartment of Animal Science

Example Data Origination (Dr. P. J. Berger)

Data: The data were published as an exercise by Rawlings (1988) and originally appeared as a study by Dr. Rick Linthurst, North Carolina State University (1979). The purpose of his research was to identify the important soil characteristics influencing aerial biomass production of the marsh grass, Spartina alterniflora in the Cape Fear Estuary of North Carolina. The design for collecting data was such that there were three types of Spartina vegetation, in each of three locations, and five random sites within each location vegetation type.

IOWA STATE UNIVERSITYDepartment of Animal Science

Example Datau Objective:

u Find the substrate variable, or combination of variables, showing the strongest relationship to biomass.

Or,

u From the list of five independent variables of salinity, acidity, potassium, sodium, and zinc, find the combination of one or more variables that has the strongest relationship with aerial biomass.

u Find the independent variables that can be used to predict aerial biomass.

IOWA STATE UNIVERSITYDepartment of Animal Science

Example Datau Class vegetative_type location sites

n Recall 3 vegetative types evaluatedn Recall 3 locations where tests occurredn Recall 5 sites within each location

u Model

u Biomass = vegetative_type location site(location) vegetative_type*location salinity acidity potassium sodium zinc;

IOWA STATE UNIVERSITYDepartment of Animal Science

Example Datau Model

u Biomass = vegetative_type location site(location) vegetative_type*location salinity acidity potassium sodium zinc;

u Would need to examine assuming each linear affect was signficantn salinity*salinityn salinity*salinity*salinityn acidity*acidityn acidity*acidity*acidity,n Etc.

IOWA STATE UNIVERSITYDepartment of Animal Science

PROC GLM Exampleu Example Strawberry yield is modeled as a function of

strawberry variety, type of fertilizer, and their interaction.

PROC GLM DATA=berry;

CLASS fertiliz variety;

MODEL yield=fertiliz variety Fertiliz*variety / SOLUTION;

LSMEANS fertiliz variety;

Run;

Quit;u The SOLUTION statement is useful for showing the relative effect

sizes.

IOWA STATE UNIVERSITYDepartment of Animal Science

PROC GLM Example Output

General Linear Models Procedure

Class Level Information

FERTILIZ 2 K N

VARIETY 2 Red Sweet

Number of observations in data set = 24This section lets us verify that we have two fertilizers and two varieties of interest, and that there are 24 observations in the data. Information about missing observations is also printed here, if applicable.

IOWA STATE UNIVERSITYDepartment of Animal Science

PROC GLM Example OutputDependent Variable: YIELD

Sum of Mean

Source DF Squares Square F Value Pr > F

Model 3 0.87166667 0.29055556 2.59 0.0816

Error 20 2.24666667 0.11233333

Corrected Total 23 3.11833333

R-Square C.V. Root MSE YIELD Mean

0.279530 3.790707 0.3351617 8.8416667This section shows the ANOVA table, with degrees of freedom (DF), sums of squares, and an F value which tests whether any of the terms in the model are significant. The C. V. (coefficient of variation) is (root MSE/mean yield)(100%). R-Square is the model sum of squares divided by total sum of squares. This is commonly used to evaluate how well the model fits the data, but it should not be the only criterion of fit that you examine.

IOWA STATE UNIVERSITYDepartment of Animal Science

PROC GLM Example OutputSource

DF Type I SS Mean Square F Value Pr > F

FERTILIZ 1 0.37500000 0.37500000 3.34 0.0826

VARIETY 1 0.48166667 0.48166667 4.29 0.0515

FERT*VAR 1 0.01500000 0.01500000 0.13 0.7186

Source DF Type III SS Mean Square F Value Pr > F

FERTILIZ 1 0.37500000 0.37500000 3.34 0.0826

VARIETY 1 0.48166667 0.48166667 4.29 0.0515

FERT*VAR 1 0.01500000 0.01500000 0.13 0.7186

SAS presents Type I and Type III sums of squares and F statistics for their significance under a particular set of assumptions; namely, that fertilizer and variety should be modeled with fixed effects, and that the random error terms satisfy their requirements.

The F test statistics shown here are not always the proper results to interpret! This depends on the design of the experiment.

IOWA STATE UNIVERSITYDepartment of Animal Science

PROC GLM Example Outputu The Type I sums of squares are also called sequential sums of

squares. Here, they test: 1. Whether fertilizer is a significant predictor

2. Whether variety is significant when considered in addition to fertilizer

3. Whether the interaction is significant when considered in addition to both fertilizer and variety.

u The Type III sums of squares are also called partial sums of squares. Here, they test:1. Assuming that the combinations of fertilizers and varieties are different

from each other, do they show consistent trends for fertilizers to be different from each other?

2. Assuming that the combinations of fertilizers and varieties are different from each other, do they show consistent trends for varieties to be different from each other?

3. Knowing that fertilizers and varieties could be different from each other, is the difference between fertilizers the same for both varieties?

IOWA STATE UNIVERSITYDepartment of Animal Science

PROC GLM Example Output

u Because the experiment is balanced, both Type I and Type III sums of squares are identical.

u Usually, the Type III sums of squares are used for inference, although the Type I sums of squares are used in specific situations.

u SAS can calculate Type II and Type IV sums of squares as well.

IOWA STATE UNIVERSITYDepartment of Animal Science

PROC GLM Example Output

u Solution option used after the model statement (i.e. /solution;)

Parameter EstimateT for H:0 Parameter=0 Prob > |T|

Std. Error of Estimates

INTERCEPT 9.13 B 66.75 0.001 0.137

FERTILIZ - K 0.30 B -1.55 0.137 0.194

N 0.00 B . . .

Variety Red -0.33 B

Sweet 0.00 B . . .

Fert x Var K Red 0.10 B 0.37 0.719 0.274

K Sweet 0.00 B . . .

N Red 0.00 B . . .

K Sweet 0.00 B . . .

IOWA STATE UNIVERSITYDepartment of Animal Science

PROC GLM Example Outputu There are many ways to estimate effects in a linear model with

categorical predictors (fixed effects).

u SAS chooses to do so by alphabetizing the levels of each factor, then assigning an effect size of zero to the last alphabetically-ordered level of each factor and its interactions.

u To predict the response for, say, Fertilizer K for the Red variety, use the equation (Intercept) + (K effect) + (Red effect) + (K*Red interaction effect), or 9.13 - 0.30 - 0.33 + 0.10 = 8.60.

u The t-test values listed on the right can be used to test if certain parameters are significantly different from zero; n in this case, they compare the levels of each factor to the last alphabetically-ordered

level (which is forced to be zero).

u The SOLUTION statement is useful for determining how treatment effects can be contrasted or estimated within PROC GLM.

IOWA STATE UNIVERSITYDepartment of Animal Science

PROC GLM Example Examining the Error valuesu An analysis of a general linear model should

include a check of the assumptions about the random error terms.

u To do this in PROC GLM, you must use an OUTPUT statement.

u The following statements show how to produce a residual plot for the model above.

IOWA STATE UNIVERSITYDepartment of Animal Science

PROC GLM Example Examining the Error valuesPROC GLM DATA=berry;

CLASS fertiliz variety;

MODEL yield=fertiliz variety fertiliz*variety/SOLUTION;

OUTPUT OUT=results P=pred R=resid;

PROC GLM DATA=results; LPOT resid*pred;

RUN;

Quit;