multiple regression iii 4/16/12 more on categorical variables missing data variable selection...

44
Multiple Regression III 4/16/12 • More on categorical variables • Missing data • Variable Selection • Stepwise Regression • Confounding variables Not in book Professor Kari Lock Morgan Duke University

Upload: irene-harrell

Post on 22-Dec-2015

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Multiple Regression III4/16/12

• More on categorical variables• Missing data • Variable Selection• Stepwise Regression• Confounding variables

Not in book Professor Kari Lock MorganDuke University

Page 2: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

• Project 2 Presentation (Thursday, 4/19)

• Project 2 Paper (Wednesday, 4/25)

To Do

Page 3: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Categorical Variables• Models are estimated better with fewer coefficients to estimate

• A categorical variables requires estimating a coefficient for each category (minus 1)

• Unless you have a very large sample size, refrain from including categorical variables with lots of categories (or else group the categories)

Page 4: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Categorical Variables

• Sometimes, categorical variables are coded with numbers

• If this is the case, R will interpret it as a quantitative variable, not a categorical variable

• To make sure it is treated like a categorical variable, BEFORE attaching the data use

dataname$variablename = as.factor(dataname$variablename)

Page 5: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Missing Data• If a cases has missing data for any of the explanatory variables, it will be left out of the regression

• If there is lots of missing data, your sample size could be greatly reduced

• Consider leaving out variables with lots of missing data, especially if your sample size is small to begin with

Page 6: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Missing Data• If the proportion of cases with missing data is small, then you do not have much to worry about

• If there are lots of missing values, you could get a completely wrong answer by just leaving those cases out

• Simply ignoring missing data can be very dangerous, but dealing with it appropriately requires another course in statistics

• In life after STAT 101, if you want to analyze a dataset with lots of missing data, consult with a statistician

Page 7: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Smokers

Page 8: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Smokers• If smoking was banned in a state, the percentage of smokers would most likely decrease.

• In that case, the percentage voting Democratic would…

(a) increase(b) decrease(c) impossible to tell

Page 9: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Causation

• A significant explanatory variable in a regression model indicates association, but not necessarily causation

CAUSALITY CAN ONLY BE INFERRED FROM A

RANDOMIZED EXPERIMENT!!!!

Page 10: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Causation

http://www.dilbert.com/strips/comic/2011-11-28/

Page 11: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Variable Selection

• The p-value for an explanatory variable can be taken as a rough measure for how helpful that explanatory variable is to the model

• Insignificant variables may be pruned from the model

• You can also look at relationships between explanatory variables; if two are strongly associated, perhaps both are not necessary

Page 12: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor
Page 13: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor
Page 14: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Variable Selection(Some) ways of deciding whether a variable should be included in the model or not:

1. Does it improve adjusted R2?

2. Does it have a low p-value?

3. Is it associated with the response by itself?

4. Is it strongly associated with another explanatory variables? (If yes, then including both may be redundant)

5. Does common sense say it should contribute to the model?

What would you eliminate from the model? (handout)

Page 15: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Stepwise Regression• We could go through and think hard about which variables to include, or we could automate the process

• Stepwise regression drops insignificant variables one by one

• This is particularly useful if you have many potential explanatory variables

Page 16: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Full Model

Page 17: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Pruned Model 1

Page 18: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Pruned Model 2

Page 19: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Pruned Model 3

Page 20: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Pruned Model 4

Page 21: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Pruned Model 5

Page 22: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Pruned Model 5FINAL STEPWISE MODEL

Page 23: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Project 2

• For project 2, I recommend doing variable selection both ways:

1) manually choose which variables to include based on p-values, pairwise relationships, and common sense

2) use stepwise regression

and then choose between these two models

Page 24: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Electricity and Life Expectancy

• Cases: countries of the world

• Response variable: life expectancy

• Explanatory variable: electricity use (kWh per capita)

• Is a country’s electricity use helpful in predicting life expectancy?

Page 25: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Electricity and Life Expectancy

Page 26: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Electricity and Life Expectancy

Outlier: Iceland

Page 27: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Electricity and Life Expectancy

Page 28: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Electricity and Life Expectancy

Page 29: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Electricity and Life Expectancy

Is this a good model for predicting life expectancy based on electricity use?

(a) Yes(b) No

Page 30: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Electricity and Life Expectancy

Is a country’s electricity use helpful in predicting life expectancy?

(a) Yes(b) No

Page 31: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Electricity and Life Expectancy

If we increased electricity use in a country, would life expectancy increase?

(a) Yes(b) No(c) Impossible to tell

Page 32: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Confounding Variables

• Wealth is an obvious confounding variable that could explain the relationship between electricity use and life expectancy

• Multiple regression is a powerful tool that allows us to account for confounding variables

• We can see whether an explanatory variable is still significant, even after including potential confounding variables in the model

Page 33: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Electricity and Life ExpectancyIs a country’s electricity use helpful in predicting life expectancy, even after including GDP in the model?

(a) Yes (b) No

Once GDP is accounted for, electricity use is no longer a significant predictor of life expectancy.

Page 34: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Which is the “best” model?

(a)

(b)

(c)

Page 35: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

• Cases: countries of the world

• Response variable: life expectancy

• Explanatory variable: number of mobile cellular subscriptions per 100 people

• Is a country’s cell phone subscription rate helpful in predicting life expectancy?

Cell Phones and Life Expectancy

Page 36: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Cell Phones and Life Expectancy

Page 37: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Cell Phones and Life Expectancy

Page 38: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Cell Phones and Life Expectancy

Page 39: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Is this a good model for predicting life expectancy based on cell phone subscriptions?

(a) Yes(b) No

Cell Phones and Life Expectancy

Page 40: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Is a country’s number of cell phone subscriptions per capita helpful in predicting life expectancy?

(a) Yes(b) No

Cell Phones and Life Expectancy

Page 41: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

If we gave everyone in a country a cell phone and a cell phone subscription, would life expectancy in that country increase?

(a) Yes(b) No(c) Impossible to tell

Cell Phones and Life Expectancy

Page 42: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Is a country’s cell phone subscription rate helpful in predicting life expectancy, even after including GDP in the model?

(a) Yes (b) No

Cell Phones and Life Expectancy

Page 43: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Cell Phones and Life Expectancy• This says that wealth alone can not explain the association between cell phone subscriptions and life expectancy

• This suggests that either cell phones actually do something to increase life expectancy (causal) OR there is another confounding variable besides wealth of the country

Page 44: Multiple Regression III 4/16/12 More on categorical variables Missing data Variable Selection Stepwise Regression Confounding variables Not in book Professor

Confounding Variables• Multiple regression is one potential way to account for confounding variables

• This is most commonly used in practice across a wide variety of fields, but is quite sensitive to the conditions for the linear model (particularly linearity)

• You can only “rule out” confounding variables that you have data on, so it is still very hard to make true causal conclusions without a randomized experiment