regression in r...•correctly specified model regression coefficients can be interpreted as causal...

55
Regression in R Seth Margolis GradQuant May 31, 2018 1

Upload: others

Post on 05-Feb-2021

16 views

Category:

Documents


0 download

TRANSCRIPT

  • Regression in RSeth Margolis

    GradQuant

    May 31, 2018

    1

  • What is Regression Good For?

    • Assessing relationships between variables This probably covers most of what

    you do

    Example:

    What is the relationship between intelligence and GPA?

    Intelligence is the independent variable, which we will call X

    GPA is the dependent variable, which we will call Y

    2

    2

    2.2

    2.4

    2.6

    2.8

    3

    3.2

    3.4

    3.6

    3.8

    4

    60 70 80 90 100 110 120 130G

    PA

    Intelligence

    Person Intelligence GPA

    1 70 2.7

    2 75 2.5

    3 80 3.4

    4 85 2.6

    5 90 2.7

    6 95 2.7

    7 100 3.2

    8 105 3.5

    9 110 3.2

    10 115 3.8

    11 120 3.6

  • What is Regression?

    • Predicts an outcome (Y) from 1+ predictors (Xs)

    • Fitting a line to data Y = mx + b + error

    Y = intercept + slope*X + error

    • Y = b0 + b1X1 + … + b2X2 + e

    • Y is one column of data

    • Each X is a column of data

    • bx is the coefficient/weight for that x

    • b0 is the intercept Prediction for Y when all Xs are 0

    • e is error/residual What is left over after prediction

    Yactual-Ypredicted 1 value for each case (row of data)

    3

  • • Y = b0 + b1X1 + b2X2 + e

    • What would the best bs do?

    • They would lead to predictions of Y that are closest to the actual Y values

    • How close is one prediction from the actual value for that case (row in data)?

    The residual/error

    Yactual-Ypredicted

    Better bs will lead to smaller residuals/errors

    • Proposed solution: Find bs that lead to the lowest average residuals/errors

    • Problem: Average residual = 0

    • Solution: Find bs that lead to lowest average squared residuals/errors Ordinary least squares (OLS) regression

    • R finds the bs that minimize the squared residuals/errors using matrix algebra

    How does R select the best bs?

    4

  • Assumptions

    1. Homoscedasticity Variance of residuals does not change at levels of X

    When violated, bootstrap

    2. Residuals/errors normally distributed Can use histogram or P-P / Q-Q plot

    When violated, bootstrap

    3. Independence of residuals/errors When violated, we can use multilevel modeling

    4. Linearity When violated, transform Xs to meet this assumption

    • Xs can have any distribution 5

  • R and RStudio

    • Script

    • Console

    • Environment

    • Object assignment

    • Plotting

    • Packages

    • Help

    • Commenting

    • Data frames

    • Functions and arguments

    6

  • Load Data• Usually will use

    dataname = read.csv(“filepath”)

    • We will use a dataset in the corrgram package about 322 Major League Baseball regular and substitute hitters in 1986

    7

  • Some Data Transformations – Not Focus of this Workshop

    8

  • 1 Continuous Predictor – Scatterplot

    • Y = b0 + b1X1 + e

    • Scatterplot can be quite revealing

    • Want to make sure the relationship looks linear

    • Otherwise you will probably want to transform your predictor

    Later in workshop

    0 500 1000 1500 2000 2500

    0.2

    00

    .25

    0.3

    00

    .35

    Salary and Season Batting Average

    Salary (thousands of $ per year)

    Se

    aso

    n B

    attin

    g A

    ve

    rag

    e

    9

  • 1 Continuous Predictor – Model

    • Estimate = b = regression coefficient

    • Standard error = standard deviation of sampling distribution Provides confidence intervals

    Most affected by N (sample size)

    • t = b / se

    • t and df -> p df = N-k-1

    N = sample size

    k = number of predictors

    • Multiple R2 = correlation of predicted and actual values squared p-value is p from F (lower right)

    • Adjusted R2 adjusts R2 based on number of predictors

    For each 1-unit increase in SeasonSalary, the

    expected SeasonBattingAvg increases by .000022 10

  • Standardized Coefficients

    • What if all values were standardized before entering model?

    • Z = (X – Mean) / SD

    • βs can be interpreted like correlations

    -1 to 1

    Called standardized regression coefficients

    For each 1 SD increase in SeasonSalary,

    SeasonBattingAvg increases by .33 SDs

    11

  • Multiple Continuous Predictors

    • Coefficients are partial coefficients

    Contribution of that variable holding other variables constant

    Unique contribution of that variable over and above other variables

    Y

    X2X1

    12

  • Adding a Predictor

    For each 1 unit increase in CareerYears,

    the expected SeasonBattingAvg increases

    by -.00099, holding other variables constant 13

  • Adding a Predictor

    14

  • Categorical Predictors: Dichotomous

    A N

    0.2

    00

    .25

    0.3

    00

    .35

    League

    Se

    aso

    n B

    attin

    g A

    ve

    rag

    e

    The NL has a batting average .0036

    lower than the AL 15

  • Categorical Predictors: 3+ Levels• Dummy Coding

    1 2 3 4 5 6

    1B 1 0 0 0 0 0

    2B 0 1 0 0 0 0

    3B 0 0 1 0 0 0

    C 0 0 0 1 0 0

    DH 0 0 0 0 1 0

    OF 0 0 0 0 0 0

    SS 0 0 0 0 0 1

    • Effects Coding 1 2 3 4 5 6

    1B 1 0 0 0 0 0

    2B 0 1 0 0 0 0

    3B 0 0 1 0 0 0

    C 0 0 0 1 0 0

    DH 0 0 0 0 1 0

    OF -1-1-1-1-1-1

    SS 0 0 0 0 0 1

    • Dummy Coding

    Pick a reference group

    Estimate that group’s level of Y as the intercept

    For all other groups, estimate how different their level of Y is, compared to the dummy group as the bs

    I usually prefer this over effects coding

    • Effects Coding

    Estimate the grand mean (mean of all groups) as the intercept

    For all other groups (except 1), estimate how different their level of Y is, compared to the grand mean as the bs

    16

  • Categorical Predictors: 3+ Levels

    17

  • Categorical Predictors: 3+ Levels

    1B 2B 3B C DH OF SS

    0.2

    00

    .25

    0.3

    00

    .35

    PositionS

    ea

    so

    n B

    attin

    g A

    ve

    rag

    e

    Catchers have a batting average .019 lower

    than outfielders18

  • Comparing to ANOVA

    Between-Group Variability

    Within-Group Variability 19

  • ANCOVA – Categorical and Continuous Predictors

    Catchers have a batting average .010 lower

    than outfielders, holding other variables

    constant

    20

  • Non-linear Relationships

    5 10 15 20

    0.2

    00

    .25

    0.3

    00

    .35

    Career Years

    Se

    aso

    n B

    attin

    g A

    ve

    rag

    e

    21

  • Interactions

    • Interaction = effect of X1 depends on level of X2 If that is true, opposite is true too

    • Y = b0 + b1X1 + b2X2 + b3X1X2

    • Y = b0 + b1X1 + (b2 + b3X1)X2 b3 = For each 1-unit increase in X

    how much does b2 increase

    And vice-versa

    • Can have 3-way, 4-way, etc. interactions

    22

  • Interaction Plot

    23

  • Spline Regression – Making Data

    24

  • Spline Regression – Model

    bs are slopes before and after knot

    25

  • Count Outcome

    26

  • Count Outcome –Negative Binomial Model

    27

  • Count Outcome – Negative Binomial Model

    For each 1-unit increase in log(CareerYears),

    the expected log(CareerAtBats) increases by

    1.12 28

  • 0 500 1000 1500 2000 2500

    02

    04

    06

    08

    01

    00

    12

    01

    40

    BaseballData$SeasonSalary

    Ba

    se

    ba

    llD

    ata

    $S

    ea

    so

    nH

    itsZ

    ero

    Infla

    ted

    Count Outcome – Zero-Inflated Model

    For each 1-unit increase in SeasonSalary,

    the expected log(SeasonHits) increases by

    .0004

    Use when many observations have a 0 on the Y

    29

  • Proportional Outcome – Beta Regression

    For each 1-unit increase in SeasonSalary, the log odds

    of SeasonBatting Avg increases by .00011Log odds = log(probability/1-probability)

    Can do algebra to get probability at each X 30

  • Quantile (Percentile) Regression

    0 500 1000 1500 2000 2500

    0.2

    00

    .25

    0.3

    00

    .35

    Season Salary

    Se

    aso

    n B

    attin

    g A

    ve

    rag

    e

    50th percentile

    80th percentile

    50th percentile

    80th percentile

    50th percentile

    20th percentile

    Predict a certain percentile at each X

    31

  • Dichotomous Outcome – Logistic Regression

    0.20 0.25 0.30 0.35

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Season Batting Average

    Pro

    ba

    bility o

    f O

    utfie

    lde

    r

    For each 1-unit increase in

    SeasonBattingAvg, the log odds of being an

    outfielder increases by 65

    odds ratio = 2 -> probability of .66

    32

  • Ordinal Outcome - Ordered Logistic Regression

    33

  • Ordinal Outcome - Ordered Logistic Regression

    34

  • Ordinal Outcome –Ordered Logistic Regression

    20 40 60 80 100

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Season Walks

    Pro

    ba

    bility

    30 Hits

    40 Hits

    50 Hits

    60 Hits

    70 Hits

    80 Hits

    90 Hits

    100 Hits

    200 Hits

    35

  • Categorical Outcome – Multinomial Logistic Regression

    36

  • Categorical Outcome – Multinomial Logistic Regression

    37

  • Categorical Outcome– Multinomial Logistic Regression

    0.20 0.25 0.30 0.35

    0.0

    0.2

    0.4

    0.6

    0.8

    1.0

    Season Batting Average

    Pro

    ba

    bility

    Catcher

    1st Baseman

    2nd Baseman

    3rd Baseman

    Shortstop

    Outfielder

    38

  • Nested Data – Multilevel Modeling• Assumption of regression: independence of residuals/errors

    When violated, we can use multilevel modeling

    • Dependence of residuals/errors usually results from grouping/nesting in the measured DV

    Students nested in classrooms/teachers

    Observations nested within participants (i.e., repeated measures)

    Participants nested in countries

    • Conceptually, it’s like running separate regression models for each classroom and then aggregating them

    • Can have student-level and classroom-level predictors

    • Can have more than 2 levels

    39

  • Multilevel Equations• Example with level-1 and level-2 predictors:

    Level-1 Model:

    Yij = β0j + β1jXij + rij

    Level-2 Model:

    β0j = γ00 + γ01Wj + u0j

    β1j = γ10 + γ11Wj + u1j

    Combined Model:

    Yij = γ00 + γ01Wj + γ10Xij + γ11WjXij + u0j + u1jXij + rij

    • Var(rij) = σ2

    • Var(u0j) = τ00

    • Var(u1j) = τ11

    • Cov(u0j, u1j) = τ01

    40

  • Nested Data – Multilevel Modeling

    • Level-1 Model:

    Salaryij = β0j + β1jSeasonHomeRunsij + rij

    • Level-2 Model:

    β0j = 313.9 + u0j

    β1j = 19.3 + u1j

    • Var(rij) = σ2 = 150331.9

    • Var(u0j) = τ00 = 1560.0

    • Var(u1j) = τ11 = 168.7

    • Cor(u0j, u1j) = -.71

    Within teams, a 1-unit increase in

    SeasonHomeRuns the expected Salary

    increases by 19.3 units 41

  • Nested Data – Multilevel Modeling

    42

  • Nested Data – Multilevel Modeling

    43

  • Nested Data – Multilevel Modeling

    44

  • Regularization• To prevent overfitting, take parsimony into account

    • Ridge (L2)

    Causes regression coefficients to shrink

    • Lasso (L1)

    Causes some regression coefficients to become 0

    • Elastic Net

    Hybrid of other 2

    45

  • Overfit Multiple Regression Model

    0.20 0.25 0.30 0.35 0.400

    .20

    0.2

    50

    .30

    0.3

    5

    MultipleRegressionModel$fitted.values

    Ba

    se

    ba

    llD

    ata

    $S

    ea

    so

    nB

    attin

    gA

    vg

    46

  • Multicollinearity

    47

  • Ridge Regression

    -2 0 2 4 6

    0.4

    0.6

    0.8

    1.0

    log(Lambda)

    Me

    an

    -Sq

    ua

    red

    Err

    or

    10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10

    48

  • Ridge Regression

    -2 -1 0 1 2

    -2-1

    01

    2

    Predicted Season Batting Average

    Actu

    al S

    ea

    so

    n B

    attin

    g A

    ve

    rag

    e

    49

  • Lasso Regression

    50

  • Elastic Net Regression

    51

  • Bootstrapping• Two assumptions of regression:

    1. Homoscedasticity

    Variance of residuals does not change at levels of X

    2. Residuals/errors normally distributed

    Can use histogram or P-P / Q-Q plot

    • Can solve each with bootstrapping

    Imagine a dataset with N rows

    Could sample rows with replacement N times

    Run model with that new dataset

    Record results

    Repeat 10,000 times

    Provides distribution of results with a mean and standard deviation/error

    52

  • Bootstrapping Three-Predictor Model

    53

  • Causal Claims• Correctly specified model

    Regression coefficients can be interpreted as causal effects if model is “correctly specified”

    Other models still valid prediction models

    All causes of Y that are correlated with any Xs in the model are in the model

    Rare, except for…

    • Random assignment

    Creates a correctly specified model

    If randomly assigned condition is a predictor, nothing is correlated to it (assuming large enough N)

    So model is correctly specified

    If add covariates, can still interpret effects of condition causally

    Must be able to manipulate IV

    If cannot, try to make model as “correct” as possible

    54

  • Future Directions

    • Structural Equation Modeling (SEM) Path analysis

    Mediation

    Latent variables – model measurement error

    • Factor Analysis

    • Longitudinal data analysis Regressed change

    Predict t2 from t1 and other variables

    Difference scores Outcome is t2-t1

    MLM

    SEM Latent growth models

    Cross-lagged models

    Time series analyses

    • Machine Learning

    55