count data models in sas
TRANSCRIPT
May 2, 2023
© 2006 ChoicePoint Asset Company. All Rights Reserved.
Count Data Models in SAS
WenSui Liu, Statistical Project ManagerJimmy Cela, AVPMar 2008, ChoicePoint Precision Marketing
2© 2006 ChoicePoint Asset Company. All Rights Reserved.
A comprehensive survey of models for count data in SAS
Why? Gaining popularity since 1980 => Insurance: # of auto/medical insurance claims => Banking: # of delinquencies / missed payments => Marketing: # of responses / purchases
5 Models to be covered: poisson regression, negative binomial regression, hurdle poisson regression, zero-inflated poisson regression, finite mixture (latent class) poisson regression
Introduction
3© 2006 ChoicePoint Asset Company. All Rights Reserved.
SAS Capability
Poisson Regression
NB Regression
Hurdle Regression
ZIP Regression
LC Poisson Regression
GENMOD ✔ ✔GLIMMIX ✔ ✔NLIN ✔ ✔ ✔ ✔ ✔NLMIXED ✔ ✔ ✔ ✔ ✔COUNTREG ✔ ✔ ✔MODEL ✔ ✔ ✔ ✔ ✔
4© 2006 ChoicePoint Asset Company. All Rights Reserved.
Nature of count data nonnegative, discrete, skewed distribution In empirical, high proportion of zero outcomes potential problems: over-dispersion (variance >> mean) , excess
zeroes
Why OLS won’t work? counts are heteroskedastic (variance dependent on mean) predicted has to be nonnegative (log transformation won’t work)
A case study: model # of hospital stays
Count Data
5© 2006 ChoicePoint Asset Company. All Rights Reserved.
Data Summary
Classical data for count models:
- 4406 elderly respondents sampled from National Medical Expenditure Survey (NMES) in 1987
- Information included: 7 health, demo, and socio-econ variables
6© 2006 ChoicePoint Asset Company. All Rights Reserved.
Starting Point
0%
20%
40%
60%
80%
100%
0 1 2 3 4 5 6 7 8
Observed Probability Univariate Poisson Probability
Observations:1) 80% zeroes ==> excess zeroes2) Variance = 2 * Mean ==> possible over-dispersion3) Poor fit with univariate Poisson
7© 2006 ChoicePoint Asset Company. All Rights Reserved.
Baseline Model
Probability Function of Poisson Regression
proc nlmixed data = data;
params b0 = 0 b1 = 0 b2 = 0 ... ...;
mu = exp(b0 + b1 * x1 + b2 * x2...);
p = exp(-mu) * mu ** y / fact(y);
ll = log(p);
model y ~ general(ll);
Run;
!
|i
Yii
ii YuuExpXYf
i
Identical to Prob. Function
8© 2006 ChoicePoint Asset Company. All Rights Reserved.
Result of Poisson Model
0%
20%
40%
60%
80%
100%
0 1 2 3 4 5 6 7 8
Observed Probability Predicted Probability of Poisson Regerssion
Observations:1) Improvement by including observed heterogeneity2) Significantly under-fit at zeroes
What's wrong? ==> Over-Dispersion
9© 2006 ChoicePoint Asset Company. All Rights Reserved.
Test for Over-Dispersion
Auxiliary OLS regression (Cameron, 1996):
data ols_tmp;
set poi_out;
dep = ((y - yhat) ** 2 - y) / yhat;
run;
proc reg data = ols_tmp;
model dep = yhat / noint;
run;
ii
i
iii euu
yuy
2
significant yhat indicates over-dispersion
10© 2006 ChoicePoint Asset Company. All Rights Reserved.
Alternative I
Most common alternative: Negative Binomial Regression NB can be considered a generalized Poisson by including a
dispersion parameter.
ii
i
iiiii
eExpVeExpEGammaeExp
eExpXExpeXExpu
and 1 s.t. ,~ where 11
11© 2006 ChoicePoint Asset Company. All Rights Reserved.
Alternative I
Probability Function of Negative Binomial Regression
proc nlmixed data = data;
params b0 = 0 b1 = 0 b2 = 0 ... ...;
mu = exp(b0 + b1 * x1 + b2 * x2 ... ...);
p = gamma(y + 1/alpha) / (gamma(y + 1) * gamma(1/alpha)) * ((1/alpha) / (1/alpha + mu)) ** (1/alpha) * (mu / (1/alpha + mu)) ** y;
ll = log(p);
model y ~ general(ll);
Run;
iY
i
i
ii
iii u
uuY
YXYf
11
1
1
11
1|
12© 2006 ChoicePoint Asset Company. All Rights Reserved.
Result of NB Model
0%
20%
40%
60%
80%
100%
0 1 2 3 4 5 6 7 8
Observed Probability Predicted Probability of NB Regerssion
Observations:1) Significant Improvement by including unobserved heterogeneity
Comparison with Poisson model:Likelihood Ratio = 2 * (LL_poi - LL_nb) = 2 * (-3048 - -2857) = 378
13© 2006 ChoicePoint Asset Company. All Rights Reserved.
Alternative II
Hurdle Regression (Mullahy, 1986) Two Parts: - zero outcomes: Logistic regression - positive outcomes: Truncated Poisson regression Probability Function of Hurdle Regression
0for
!11
0for |
iii
Yiii
ii
ii YYuExpuuExp
YXYf i
14© 2006 ChoicePoint Asset Company. All Rights Reserved.
Alternative II
proc nlmixed data = data;
params b0 = 0 b1 = 0 ... a0 = 0 a1 = 0 ...;
xb = b0 + b1 * x1 + b2 * x2 ... ...);
mu = exp(b0 + b1 * x1 + b2 * x2...);
xa = a0 + a1 * x1 + a2 * x2 ... ...);
if y = 0 then p = exp(xa) / (1 + exp(xa));
else p = (1 - exp(xa) / (1 + exp(xa))) / (1 - exp(-mu)) * (exp(-mu) * mu ** y / fact(y));
ll = log(p);
model y ~ general(ll);
Run;
Prob function for zeroes
Prob function for positive
15© 2006 ChoicePoint Asset Company. All Rights Reserved.
Result of Hurdle Model
0%
20%
40%
60%
80%
100%
0 1 2 3 4 5 6 7 8
Observed Probability Predicted Probability of Hurdle Regerssion
Observations:1) Significant Improvement by modeling zeroes separatedly
How to compare with Poisson model?AIC, BIC, & Vuong statistic
16© 2006 ChoicePoint Asset Company. All Rights Reserved.
Alternative III
Zero-inflated Poisson Regression (Lambert, 1992) Two sources of zeroes - a point mass of zeroes - zeroes from standard Poisson distribution Probability Function of Hurdle Regression
0for
!1
0for 1|
ii
Yii
i
iiii
ii YY
uuExpYuExp
XYf i
17© 2006 ChoicePoint Asset Company. All Rights Reserved.
Alternative III
proc nlmixed data = data;
params b0 = 0 b1 = 0 ... a0 = 0 a1 = 0 ...;
xb = b0 + b1 * x1 + b2 * x2 ... ...);
mu = exp(b0 + b1 * x1 + b2 * x2...);
xa = a0 + a1 * x1 + a2 * x2 ... ...);
if y = 0 then p = exp(xa) / (1 + exp(xa)) + (1 - exp(xa) / (1 + exp(xa)) * exp(-mu);
else p = (1 - exp(xa) / (1 + exp(xa))) * (exp(-mu) * mu ** y / fact(y));
ll = log(p);
model y ~ general(ll);
Run;
Prob function for zeroes
Prob function for zeroes
18© 2006 ChoicePoint Asset Company. All Rights Reserved.
Result of ZIP Model
0%
20%
40%
60%
80%
100%
0 1 2 3 4 5 6 7 8
Observed Probability Predicted Probability of ZIP Regerssion
Observations:1) Significant Improvement by assuming 2 sources of zeroes
How to compare with other models?AIC, BIC, & Vuong statistic
19© 2006 ChoicePoint Asset Company. All Rights Reserved.
Alternative IV
Latent Class Poisson Regression (Wedel, 1993): - Existence of S >= 2 classes of latent segments in the data - Each latent segment is poisson with different parameter - Each case drawn from such latent segments with certain probs. - Interesting in marketing: segment and model at the same time Probability Function of LC Poisson Regression
S
s i
Ysisi
sii YuuExp
pXYfi
1
| |
!|
20© 2006 ChoicePoint Asset Company. All Rights Reserved.
Alternative IV
proc nlmixed data = data;
params a0 = 0 ... b0 = 1 ... c0 = 2 ...
prior1 = 0 to 1 by 0.1 prior2 = 0 to 1 by 0.1;
xa = a0 + a1 * x1 + a2 * x2 ... ...); ma = exp(xa);
pa = exp(-ma) * ma ** y / fact(y);
xb = b0 + b1 * x1 + b2 * x2 ... ...); mb = exp(xb);
pb = exp(-mb) * mb ** y / fact(y);
xc = c0 + c1 * x1 + c2 * x2 ... ...); mc = exp(xc);
pc = exp(-mc) * mc ** y / fact(y);
p = prior1 * pa + prior2 * pb + (1 - prior1 - prior2) * pc;
ll = log(p);
... ...
21© 2006 ChoicePoint Asset Company. All Rights Reserved.
Result of LC Poisson
0%
20%
40%
60%
80%
100%
0 1 2 3 4 5 6 7 8
Observed Probability Predicted Probability of LC Poisson Regerssion
Observations:1) Significant Improvement by assuming 3 latent classes with different sets of parameter
How to compare with other models?AIC, BIC, & Vuong statistic
22© 2006 ChoicePoint Asset Company. All Rights Reserved.
Models Prediction
1) Poisson cannot give adequate fit for the data.
2) Hurdle and ZIP are better to model excess zeroes.
3) NB and LC are better to handle heterogeneity.
23© 2006 ChoicePoint Asset Company. All Rights Reserved.
Models Comparison
1) AIC & BIC is convenient and easy to compute for model comparison, good enough for practitioners. BIC tends to select a more parsimonious model.
2) Vuong test is good but computationally tedious (code available in the paper), recommended for researchers.
24© 2006 ChoicePoint Asset Company. All Rights Reserved.
In practice, Poisson model usually is not sufficient for over-dispersed data but useful as a baseline model. (Rule of Thumb for Over-Dispersion: Variance ≥ 2 * Mean)
It is important to identify the reason for over-dispersion, long tail, excess zeroes, or … … ? (Excess zeroes might be the most common reason)
Statistics shouldn’t be the only consideration for model selection. Examples: 1) Both Hurdle and ZIP suggest positive effect of private insurance
on hospital stays, which makes perfect sense. 2) LC provides a possibility to segment population, which is
invaluable in marketing, insurance, and credit risk.
Conclusion