regression in r...•correctly specified model regression coefficients can be interpreted as causal...
TRANSCRIPT
-
Regression in RSeth Margolis
GradQuant
May 31, 2018
1
-
What is Regression Good For?
• Assessing relationships between variables This probably covers most of what
you do
Example:
What is the relationship between intelligence and GPA?
Intelligence is the independent variable, which we will call X
GPA is the dependent variable, which we will call Y
2
2
2.2
2.4
2.6
2.8
3
3.2
3.4
3.6
3.8
4
60 70 80 90 100 110 120 130G
PA
Intelligence
Person Intelligence GPA
1 70 2.7
2 75 2.5
3 80 3.4
4 85 2.6
5 90 2.7
6 95 2.7
7 100 3.2
8 105 3.5
9 110 3.2
10 115 3.8
11 120 3.6
-
What is Regression?
• Predicts an outcome (Y) from 1+ predictors (Xs)
• Fitting a line to data Y = mx + b + error
Y = intercept + slope*X + error
• Y = b0 + b1X1 + … + b2X2 + e
• Y is one column of data
• Each X is a column of data
• bx is the coefficient/weight for that x
• b0 is the intercept Prediction for Y when all Xs are 0
• e is error/residual What is left over after prediction
Yactual-Ypredicted 1 value for each case (row of data)
3
-
• Y = b0 + b1X1 + b2X2 + e
• What would the best bs do?
• They would lead to predictions of Y that are closest to the actual Y values
• How close is one prediction from the actual value for that case (row in data)?
The residual/error
Yactual-Ypredicted
Better bs will lead to smaller residuals/errors
• Proposed solution: Find bs that lead to the lowest average residuals/errors
• Problem: Average residual = 0
• Solution: Find bs that lead to lowest average squared residuals/errors Ordinary least squares (OLS) regression
• R finds the bs that minimize the squared residuals/errors using matrix algebra
How does R select the best bs?
4
-
Assumptions
1. Homoscedasticity Variance of residuals does not change at levels of X
When violated, bootstrap
2. Residuals/errors normally distributed Can use histogram or P-P / Q-Q plot
When violated, bootstrap
3. Independence of residuals/errors When violated, we can use multilevel modeling
4. Linearity When violated, transform Xs to meet this assumption
• Xs can have any distribution 5
-
R and RStudio
• Script
• Console
• Environment
• Object assignment
• Plotting
• Packages
• Help
• Commenting
• Data frames
• Functions and arguments
6
-
Load Data• Usually will use
dataname = read.csv(“filepath”)
• We will use a dataset in the corrgram package about 322 Major League Baseball regular and substitute hitters in 1986
7
-
Some Data Transformations – Not Focus of this Workshop
8
-
1 Continuous Predictor – Scatterplot
• Y = b0 + b1X1 + e
• Scatterplot can be quite revealing
• Want to make sure the relationship looks linear
• Otherwise you will probably want to transform your predictor
Later in workshop
0 500 1000 1500 2000 2500
0.2
00
.25
0.3
00
.35
Salary and Season Batting Average
Salary (thousands of $ per year)
Se
aso
n B
attin
g A
ve
rag
e
9
-
1 Continuous Predictor – Model
• Estimate = b = regression coefficient
• Standard error = standard deviation of sampling distribution Provides confidence intervals
Most affected by N (sample size)
• t = b / se
• t and df -> p df = N-k-1
N = sample size
k = number of predictors
• Multiple R2 = correlation of predicted and actual values squared p-value is p from F (lower right)
• Adjusted R2 adjusts R2 based on number of predictors
For each 1-unit increase in SeasonSalary, the
expected SeasonBattingAvg increases by .000022 10
-
Standardized Coefficients
• What if all values were standardized before entering model?
• Z = (X – Mean) / SD
• βs can be interpreted like correlations
-1 to 1
Called standardized regression coefficients
For each 1 SD increase in SeasonSalary,
SeasonBattingAvg increases by .33 SDs
11
-
Multiple Continuous Predictors
• Coefficients are partial coefficients
Contribution of that variable holding other variables constant
Unique contribution of that variable over and above other variables
Y
X2X1
12
-
Adding a Predictor
For each 1 unit increase in CareerYears,
the expected SeasonBattingAvg increases
by -.00099, holding other variables constant 13
-
Adding a Predictor
14
-
Categorical Predictors: Dichotomous
A N
0.2
00
.25
0.3
00
.35
League
Se
aso
n B
attin
g A
ve
rag
e
The NL has a batting average .0036
lower than the AL 15
-
Categorical Predictors: 3+ Levels• Dummy Coding
1 2 3 4 5 6
1B 1 0 0 0 0 0
2B 0 1 0 0 0 0
3B 0 0 1 0 0 0
C 0 0 0 1 0 0
DH 0 0 0 0 1 0
OF 0 0 0 0 0 0
SS 0 0 0 0 0 1
• Effects Coding 1 2 3 4 5 6
1B 1 0 0 0 0 0
2B 0 1 0 0 0 0
3B 0 0 1 0 0 0
C 0 0 0 1 0 0
DH 0 0 0 0 1 0
OF -1-1-1-1-1-1
SS 0 0 0 0 0 1
• Dummy Coding
Pick a reference group
Estimate that group’s level of Y as the intercept
For all other groups, estimate how different their level of Y is, compared to the dummy group as the bs
I usually prefer this over effects coding
• Effects Coding
Estimate the grand mean (mean of all groups) as the intercept
For all other groups (except 1), estimate how different their level of Y is, compared to the grand mean as the bs
16
-
Categorical Predictors: 3+ Levels
17
-
Categorical Predictors: 3+ Levels
1B 2B 3B C DH OF SS
0.2
00
.25
0.3
00
.35
PositionS
ea
so
n B
attin
g A
ve
rag
e
Catchers have a batting average .019 lower
than outfielders18
-
Comparing to ANOVA
Between-Group Variability
Within-Group Variability 19
-
ANCOVA – Categorical and Continuous Predictors
Catchers have a batting average .010 lower
than outfielders, holding other variables
constant
20
-
Non-linear Relationships
5 10 15 20
0.2
00
.25
0.3
00
.35
Career Years
Se
aso
n B
attin
g A
ve
rag
e
21
-
Interactions
• Interaction = effect of X1 depends on level of X2 If that is true, opposite is true too
• Y = b0 + b1X1 + b2X2 + b3X1X2
• Y = b0 + b1X1 + (b2 + b3X1)X2 b3 = For each 1-unit increase in X
how much does b2 increase
And vice-versa
• Can have 3-way, 4-way, etc. interactions
22
-
Interaction Plot
23
-
Spline Regression – Making Data
24
-
Spline Regression – Model
bs are slopes before and after knot
25
-
Count Outcome
26
-
Count Outcome –Negative Binomial Model
27
-
Count Outcome – Negative Binomial Model
For each 1-unit increase in log(CareerYears),
the expected log(CareerAtBats) increases by
1.12 28
-
0 500 1000 1500 2000 2500
02
04
06
08
01
00
12
01
40
BaseballData$SeasonSalary
Ba
se
ba
llD
ata
$S
ea
so
nH
itsZ
ero
Infla
ted
Count Outcome – Zero-Inflated Model
For each 1-unit increase in SeasonSalary,
the expected log(SeasonHits) increases by
.0004
Use when many observations have a 0 on the Y
29
-
Proportional Outcome – Beta Regression
For each 1-unit increase in SeasonSalary, the log odds
of SeasonBatting Avg increases by .00011Log odds = log(probability/1-probability)
Can do algebra to get probability at each X 30
-
Quantile (Percentile) Regression
0 500 1000 1500 2000 2500
0.2
00
.25
0.3
00
.35
Season Salary
Se
aso
n B
attin
g A
ve
rag
e
50th percentile
80th percentile
50th percentile
80th percentile
50th percentile
20th percentile
Predict a certain percentile at each X
31
-
Dichotomous Outcome – Logistic Regression
0.20 0.25 0.30 0.35
0.0
0.2
0.4
0.6
0.8
1.0
Season Batting Average
Pro
ba
bility o
f O
utfie
lde
r
For each 1-unit increase in
SeasonBattingAvg, the log odds of being an
outfielder increases by 65
odds ratio = 2 -> probability of .66
32
-
Ordinal Outcome - Ordered Logistic Regression
33
-
Ordinal Outcome - Ordered Logistic Regression
34
-
Ordinal Outcome –Ordered Logistic Regression
20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
Season Walks
Pro
ba
bility
30 Hits
40 Hits
50 Hits
60 Hits
70 Hits
80 Hits
90 Hits
100 Hits
200 Hits
35
-
Categorical Outcome – Multinomial Logistic Regression
36
-
Categorical Outcome – Multinomial Logistic Regression
37
-
Categorical Outcome– Multinomial Logistic Regression
0.20 0.25 0.30 0.35
0.0
0.2
0.4
0.6
0.8
1.0
Season Batting Average
Pro
ba
bility
Catcher
1st Baseman
2nd Baseman
3rd Baseman
Shortstop
Outfielder
38
-
Nested Data – Multilevel Modeling• Assumption of regression: independence of residuals/errors
When violated, we can use multilevel modeling
• Dependence of residuals/errors usually results from grouping/nesting in the measured DV
Students nested in classrooms/teachers
Observations nested within participants (i.e., repeated measures)
Participants nested in countries
• Conceptually, it’s like running separate regression models for each classroom and then aggregating them
• Can have student-level and classroom-level predictors
• Can have more than 2 levels
39
-
Multilevel Equations• Example with level-1 and level-2 predictors:
Level-1 Model:
Yij = β0j + β1jXij + rij
Level-2 Model:
β0j = γ00 + γ01Wj + u0j
β1j = γ10 + γ11Wj + u1j
Combined Model:
Yij = γ00 + γ01Wj + γ10Xij + γ11WjXij + u0j + u1jXij + rij
• Var(rij) = σ2
• Var(u0j) = τ00
• Var(u1j) = τ11
• Cov(u0j, u1j) = τ01
40
-
Nested Data – Multilevel Modeling
• Level-1 Model:
Salaryij = β0j + β1jSeasonHomeRunsij + rij
• Level-2 Model:
β0j = 313.9 + u0j
β1j = 19.3 + u1j
• Var(rij) = σ2 = 150331.9
• Var(u0j) = τ00 = 1560.0
• Var(u1j) = τ11 = 168.7
• Cor(u0j, u1j) = -.71
Within teams, a 1-unit increase in
SeasonHomeRuns the expected Salary
increases by 19.3 units 41
-
Nested Data – Multilevel Modeling
42
-
Nested Data – Multilevel Modeling
43
-
Nested Data – Multilevel Modeling
44
-
Regularization• To prevent overfitting, take parsimony into account
• Ridge (L2)
Causes regression coefficients to shrink
• Lasso (L1)
Causes some regression coefficients to become 0
• Elastic Net
Hybrid of other 2
45
-
Overfit Multiple Regression Model
0.20 0.25 0.30 0.35 0.400
.20
0.2
50
.30
0.3
5
MultipleRegressionModel$fitted.values
Ba
se
ba
llD
ata
$S
ea
so
nB
attin
gA
vg
46
-
Multicollinearity
47
-
Ridge Regression
-2 0 2 4 6
0.4
0.6
0.8
1.0
log(Lambda)
Me
an
-Sq
ua
red
Err
or
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
48
-
Ridge Regression
-2 -1 0 1 2
-2-1
01
2
Predicted Season Batting Average
Actu
al S
ea
so
n B
attin
g A
ve
rag
e
49
-
Lasso Regression
50
-
Elastic Net Regression
51
-
Bootstrapping• Two assumptions of regression:
1. Homoscedasticity
Variance of residuals does not change at levels of X
2. Residuals/errors normally distributed
Can use histogram or P-P / Q-Q plot
• Can solve each with bootstrapping
Imagine a dataset with N rows
Could sample rows with replacement N times
Run model with that new dataset
Record results
Repeat 10,000 times
Provides distribution of results with a mean and standard deviation/error
52
-
Bootstrapping Three-Predictor Model
53
-
Causal Claims• Correctly specified model
Regression coefficients can be interpreted as causal effects if model is “correctly specified”
Other models still valid prediction models
All causes of Y that are correlated with any Xs in the model are in the model
Rare, except for…
• Random assignment
Creates a correctly specified model
If randomly assigned condition is a predictor, nothing is correlated to it (assuming large enough N)
So model is correctly specified
If add covariates, can still interpret effects of condition causally
Must be able to manipulate IV
If cannot, try to make model as “correct” as possible
54
-
Future Directions
• Structural Equation Modeling (SEM) Path analysis
Mediation
Latent variables – model measurement error
• Factor Analysis
• Longitudinal data analysis Regressed change
Predict t2 from t1 and other variables
Difference scores Outcome is t2-t1
MLM
SEM Latent growth models
Cross-lagged models
Time series analyses
• Machine Learning
55