selecting variables and avoiding pitfalls chapters 6 and 7

Selecting Variables and Avoiding Pitfalls

Chapters 6 and 7

Let’s start with the pitfalls

What do you think they are?

Young children who sleep with the light on are much more likely to develop myopia in later life.

This result of a study at the University of Pennsylvania Medical Center was published in the May 13, 1999, issue of Nature. However a later study at Ohio State University did not find any link between infants sleeping with the light on and developing myopia but did find a strong link between parental myopia and the development of child myopia and also noted that myopic parents were more likely to leave a light on in their children's bedroom.

http://researchnews.osu.edu/archive/nitelite.htm

What’s going on here?

Remember…Correlation does not imply Causation! A statistically significant relationship between

a response y and predictor x does not necessarily imply a cause-and-effect relationship.

Caution: Lack of variability or small n

The number of levels of a quantitative variable must be at least one more than the order of the polynomial x that you want to fit. To fit a straight line, you need at least two

different x values; how many do you need to fit a curve?

Sample size n must be large enough so that the degrees of freedom (n-(k+1)) for estimating σ2 exceeds 0.

Caution: Interpreting the magnitude of βi coefficient as determining the importance of xi

With complex models, not all βs have practical interpretation.

Unless coefficients are standardized, we cannot compare β values. To standardize in Minitab: Stat Regression Storage

Standardized Coefficients

Caution: Multicollinearity

When 2 or more independent variables are moderately to highly correlated with each other

The best regression models are those in which the predictor variables each correlate highly with the dependent (outcome) variable but correlate-- at most-- only minimally with each other

How do I know if multicollinearity is present?

Correlation matrix Stat > Basic Statistics > Correlation (select all

variables of interest)

Look for non-significant t tests for individual β parameters when the F test is significant

Look for opposite signs with β than you expected

VIF

The variance inflation factor (VIF): measures how much the variance of an estimated regression coefficient increases if your predictors are correlated (multicollinear). VIF = 1 indicates no relation; VIF > 1, otherwise. When VIF is greater than 10, then

the regression coefficients are poorly estimated. In Regression Window, select Options

button, select Variance Inflation Factors under Display

Using the VIF value, you can calculate an R2 to relate one of the independent variables to the remaining independent variables (p. 349).

Caution 2: Violating the Assumptions

What are the assumptions about ε? mean value of ε for any given set of values of

x1, x2,…xk is E (ε )= 0 ε has a normal probability distribution with

mean equal to 0 and variance equal to σ2

Random errors are independent If data violate the assumptions, derived

inferences are suspect… so methodology must be modified

We will dive into this in chapter 8!

Selecting Variables Last class and journal:

Compared complex models to reduced models Tested a portion of complete model

parameters with a nested model F test Today:

Methods for selecting which independent variables to include from many possible variables

Paring down

We start with a comprehensive model that includes all conceivable, testable influences on the phenomena under investigation. We want to end up with the simplest model possible.

In addition to literature, theory and plotting data, stepwise regression can help in selecting variables.

Parsimony: the smaller number of βs, the better (Simpler models are easier to understand and appreciate, and therefore have a "beauty" that their more complicated counterparts often lack.)

What we’ve done: R2 Criterion and Adjusted R2 (MSE Criterion)

Explore with different variables in the regression equation… but always keep common sense/ literature in mind!

Add enough variables that the R2 is sufficiently large You can also look at the adjusted R2, which is adjusted

as the number of βs increases Select the simplest model with the highest R2 (or R2

adj) You can also search for the model with the minimum

MSE (included in the ANOVA output)

Remember: parsimony

Another option: Stepwise Regression

(1) You identify an initial model with lots of potential variables

(2) The software repeatedly alters the model at the previous step by adding a predictor variable in accordance with the "stepping criteria."

(3) The search is terminated when stepping is no longer possible given the stepping criteria, or when a specified maximum number of steps has been reached.

Today’s Data Set: Pulse

Each student in a class recorded his or her height, weight, gender, smoking preference, usual activity level, and resting pulse. They all flipped coins, and those whose coins came up heads ran in place for one minute. Afterward, the entire class recorded their pulses once more. We now want to find the best predictors for the second pulse rate.

Stepwise Regression

• Starts with no predictors. Each of the available predictors is evaluated with respect to how much R2 would be increased by adding it to the model.

• The one which will most increase R2 will be added if it meets the statistical criterion for entry.

• This procedure is repeated until there remain no more predictors that are eligible for entry.

Open the Pulse data set.

1) Choose Stat > Regression > Stepwise.

2) In Response, enter Pulse2.

3) In Predictors, enter Pulse1 Ran-Weight.

4) Click OK

Which variables were selected?

Backward Elimination

Fits a model with terms for all potential variables, then drops the variable with the smallest t statistic

1) Choose Stat > Regression > Stepwise.

2) In Response, enter Pulse2.

3) In Predictors, enter Pulse1 Ran-Weight.

4) Click Methods

5) Select Backward elimination

6) Click OK (twice)

In light of today’s cautions, what could be a disadvantage to this procedure?

Best Subsets Regression

Stat Regression Best Subsets

Cp Criterion

Used to compare the full model to a model with a subset of predictors.

Look for models where Mallows' Cp is small and close to p, where p is the number of predictors in the model, including the constant.

A small Cp value indicates that the model is relatively precise (has small variance) in estimating the true regression coefficients and predicting future responses.

Models with considerable lack-of-fit and bias have values of Cp larger than p.

PRESS Criterion

Prediction Sum of Squares

is the predicted value for the ith observation obtained when the regression model is fit with the data point for the ith observation deleted from the example.

Small differences in y values indicate that the model is predicting well

Minitab: Stat Regression Regression Options (select Press)

2)(

1

]ˆ[ i

n

ii yyPRESS

)(ˆ iy

Cautions

These are screening methods, not your final decision-maker

Interactions Non-linear relationships

selecting variables and avoiding pitfalls chapters 6 and 7

Documents

predictor variables

variables of interestlook

vif value

regression coefficients

variance equal

regression window

relation vif

parental myopia