linear probability models and big data: kosher or not?
DESCRIPTION
Slides from Galit Shmueli's talk at the 10th Statistical Challenges in eCommerce Research (SCECR) symposium, Tel Aviv, Israel. http://scecr.org/scecr2014/TRANSCRIPT
![Page 1: Linear Probability Models and Big Data: Kosher or Not?](https://reader036.vdocuments.site/reader036/viewer/2022070301/546b52a6af7959604f8b5c4e/html5/thumbnails/1.jpg)
Linear Probability Models and Big DataKosher or Not?
Galit Shmueli & Suneel Chatla
![Page 2: Linear Probability Models and Big Data: Kosher or Not?](https://reader036.vdocuments.site/reader036/viewer/2022070301/546b52a6af7959604f8b5c4e/html5/thumbnails/2.jpg)
Linear Regression on Y:
Y = b0 +b1 X1+…+bk Xk+ e
e N(0,s2)
Y={0,1}
What is a Linear Probability Model (LPM)?
Used for… • Explaining: estimating/testing b• Predicting: class probabilities
Popular in some fields but not in Information Systems
![Page 3: Linear Probability Models and Big Data: Kosher or Not?](https://reader036.vdocuments.site/reader036/viewer/2022070301/546b52a6af7959604f8b5c4e/html5/thumbnails/3.jpg)
Criticism in the Literature e N(0,s2)
Common advice: use logistic/probit model
![Page 4: Linear Probability Models and Big Data: Kosher or Not?](https://reader036.vdocuments.site/reader036/viewer/2022070301/546b52a6af7959604f8b5c4e/html5/thumbnails/4.jpg)
Why do researchers still use LPM?
Compared to logit/probit:
• Easy coefficient interpretation• Same statistical significance• Works under quasi/full-separation• Cheap computation
Relevant for
InferenceRelevant for Prediction
LPM is rare in IS
![Page 5: Linear Probability Models and Big Data: Kosher or Not?](https://reader036.vdocuments.site/reader036/viewer/2022070301/546b52a6af7959604f8b5c4e/html5/thumbnails/5.jpg)
Should we use LPM?
![Page 6: Linear Probability Models and Big Data: Kosher or Not?](https://reader036.vdocuments.site/reader036/viewer/2022070301/546b52a6af7959604f8b5c4e/html5/thumbnails/6.jpg)
Our Approach: Extensive Simulation
EvaluationExplanatory: Estimate bPredictive: Predict new records
Big DataVery large sampleMany variables
ModelsCorrectly specifiedOver specifiedUnder specified
Simulated DataSample sizes: 50, 500, 2MSignal-to-noise: High, lowOutcome Y: Binary, dichotomized
Yes/No High/Low
![Page 7: Linear Probability Models and Big Data: Kosher or Not?](https://reader036.vdocuments.site/reader036/viewer/2022070301/546b52a6af7959604f8b5c4e/html5/thumbnails/7.jpg)
Study Design
Covariates:X ~ U(-0.5,0.5) e ~ N(0,s2)
Simulation Models: y = 0.5 + β1x1 + ε
y = 0.5 + ε
y = 0.5 + β1x1 + β2x2 + ε
Signal-to-noise:High: s=0.01, β1=1, (β2=0.01)Low: s=0.10, β1=0.10, (β2=0.45)
Outcome Origin:Binary: yb ~ Bernoulli (y)Dichotomized: yd = I(y ≥ median(y))
Estimated Models:
y = 0.5 + β1x1 + ε
y = 0.5 + β1x1 + β2x2 + ε
Prediction:n=500 holdout sampleLogit and Probit models
![Page 8: Linear Probability Models and Big Data: Kosher or Not?](https://reader036.vdocuments.site/reader036/viewer/2022070301/546b52a6af7959604f8b5c4e/html5/thumbnails/8.jpg)
Binary Y
High Signal-to-noise Low Signal-to-noise
n=50
n=500
n=2M
— True Model--- LPM y=0.5+b1x1+ε--- LPM using WLS
Simulated: yb~Bernoulli(0.5+b1x1+e )Fitted: Correctly-specified modelGoal: Estimate slope (b1)
Binary Y: With large sample, LPM is fine for estimation
Even with low signal
![Page 9: Linear Probability Models and Big Data: Kosher or Not?](https://reader036.vdocuments.site/reader036/viewer/2022070301/546b52a6af7959604f8b5c4e/html5/thumbnails/9.jpg)
High Signal-to-noise Low Signal-to-noise
n=50
n=500
n=2M
Y=0 Y=0Y=1 Y=1
Binary Y: LPM predictive power same as logit/probit; depends on signal (not n)
Binary Y
Goal: Predict 500 new records
Logit Probit LPM LPM using WLS
![Page 10: Linear Probability Models and Big Data: Kosher or Not?](https://reader036.vdocuments.site/reader036/viewer/2022070301/546b52a6af7959604f8b5c4e/html5/thumbnails/10.jpg)
Dichotomized Y
High Signal-to-noise Low Signal-to-noise
n=50
n=500
n=2M
— OLS (numerical Y)--- LPM (yd)--- LPM using WLS
Dichotomized Y: LPM gives biased coefs
WLS makes it worseCan correct bias if sy can be estimated
Simulated: y=0.5+b1x1+e , yd=I(y>med)Fitted: Correctly-specified modelGoal: Estimate slope (b1)
![Page 11: Linear Probability Models and Big Data: Kosher or Not?](https://reader036.vdocuments.site/reader036/viewer/2022070301/546b52a6af7959604f8b5c4e/html5/thumbnails/11.jpg)
Dichotomized Y
High Signal-to-noise Low Signal-to-noise
n=50
n=500
n=2M
Logi
tPr
obitLP
MLP
M+W
LS
Y=0 Y=0Y=1 Y=1
Dichotomized Y: LPM predictive power similar to logit/probit; depends on signal (not n)LPM+WLS is best
Goal: Predict 500 new records
![Page 12: Linear Probability Models and Big Data: Kosher or Not?](https://reader036.vdocuments.site/reader036/viewer/2022070301/546b52a6af7959604f8b5c4e/html5/thumbnails/12.jpg)
Dichotomized Y: • LPM gives biased coefficients
WLS makes it worseCan correct bias with estimate of sy
• Predictive power similar to logit/probit; depends on signal (not n)WLS improves predictive power
Quick Summary: Correctly specified model
Binary Y: • With large n, LPM is fine for estimation
Even with low signal• LPM predictive power same as
logit/probit; depends on signal (not n)
![Page 13: Linear Probability Models and Big Data: Kosher or Not?](https://reader036.vdocuments.site/reader036/viewer/2022070301/546b52a6af7959604f8b5c4e/html5/thumbnails/13.jpg)
Over-specified modelsb1 is of interest
Simulated: y = 0.5 + β1x1 + ε
Estimated: y = 0.5 + β1x1 + β2x2 + ε
Simulated: y = 0.5 + ε
Estimated: y = 0.5 + β1x1 + ε
Binary Y: • b1 coef insignificant
All sample sizes• Prediction=logit/probit
WLS doesn’t help
Binary Y: • b1 (and b2) coefs unbiased
For n=2M, identical to OLS• Prediction=logit/probit
WLS doesn’t help
Dichotomized Y: • b1 coef insignificant
All sample sizes• Prediction=logit/probit
WLS improves prediction
Dichotomized Y: • b1 coef biased
Worse with WLS; can correct bias• Prediction=logit/probit
WLS improves prediction
![Page 14: Linear Probability Models and Big Data: Kosher or Not?](https://reader036.vdocuments.site/reader036/viewer/2022070301/546b52a6af7959604f8b5c4e/html5/thumbnails/14.jpg)
Modeling Auction Price
300,000 eBay auctions (Aug 2007- Jan 2008)
Price = f(min_bid, duration, seller_feedback, reserve)
1. Estimation/inference: determinants of price2. Prediction: holdout sample (n = 5,000)
Dichotomized Price
![Page 15: Linear Probability Models and Big Data: Kosher or Not?](https://reader036.vdocuments.site/reader036/viewer/2022070301/546b52a6af7959604f8b5c4e/html5/thumbnails/15.jpg)
Inference/Estimation
Sample so large: all coefficients significant!Bias due to dichotomization - corrected
![Page 16: Linear Probability Models and Big Data: Kosher or Not?](https://reader036.vdocuments.site/reader036/viewer/2022070301/546b52a6af7959604f8b5c4e/html5/thumbnails/16.jpg)
Prediction
Removal of outliers gives identical ROC curves
![Page 17: Linear Probability Models and Big Data: Kosher or Not?](https://reader036.vdocuments.site/reader036/viewer/2022070301/546b52a6af7959604f8b5c4e/html5/thumbnails/17.jpg)
Study Conclusions
• Explanatory modeling with a binary outcome – large sample needed to reduce bias.
• Explanatory modeling with dichotomous outcome requires sy to correct bias.
• Predicting a binary outcome (without WLS) or dichotomous outcome (with WLS) – sample size irrelevant
• Robust to over- or under-specified models
LPM is rare in IS
![Page 18: Linear Probability Models and Big Data: Kosher or Not?](https://reader036.vdocuments.site/reader036/viewer/2022070301/546b52a6af7959604f8b5c4e/html5/thumbnails/18.jpg)