count models 1 sociology 8811 lecture 12 copyright © 2007 by evan schofer do not copy or distribute...
TRANSCRIPT
Count Models 1
Sociology 8811 Lecture 12
Copyright © 2007 by Evan SchoferDo not copy or distribute without permission
Count Variables
• Many dependent variables are counts: Non-negative integers
• # Crimes a person has committed in lifetime• # Children living in a household• # new companies founded in a year (in an industry)• # of social protests per month in a city
– Can you think of others?
Count Variables
• Count variables can be modeled with OLS regression… but:– 1. Linear models can yield negative predicted
values… whereas counts are never negative• Similar to the problem of the Linear Probability Model
– 2. Count variables are often highly skewed• Ex: # crimes committed this year… most people are
zero or very low; a few people are very high• Extreme skew violates the normality assumption of
OLS regression.
Count Models
• Two most common count models:• Poisson Regression Model• Negative Binomial Regression Model
• Both based on the Poisson distribution:• = expected count (and variance)
– Called lambda () in some texts; I rely on Freese & Long 2006
• y = observed count
!y
eyP
y
Poisson Regression
• Strategy: Model log of as a function of Xs• Quite similar to modeling log odds in logit• Again, the log form avoids negative values
K
jjijX
1
ln
• Which can be written as:
K
jjijX
e 1
Poisson Regression: Example• Hours per week spent on web
0.0
5.1
.15
.2D
en
sity
0 10 20 30 40 50www hours per week
Poisson Regression: Web Use• Output = similar to logistic regression. poisson wwwhr male age educ lowincome babies
Poisson regression Number of obs = 1552 LR chi2(5) = 525.66 Prob > chi2 = 0.0000Log likelihood = -8598.488 Pseudo R2 = 0.0297
------------------------------------------------------------------------------ wwwhr | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- male | .3595968 .0210578 17.08 0.000 .3183242 .4008694 age | -.0097401 .0007891 -12.34 0.000 -.0112867 -.0081934 educ | .0205217 .004046 5.07 0.000 .0125917 .0284516 lowincome | -.1168778 .0236503 -4.94 0.000 -.1632316 -.0705241 babies | -.1436266 .0224814 -6.39 0.000 -.1876892 -.0995639 _cons | 1.806489 .0641575 28.16 0.000 1.680743 1.932236------------------------------------------------------------------------------
Men spend more time on the web than women
Number of young children in household reduces web use
Poisson Regression: Stata Output
• Stata output yields familiar statistics:– Standard errors, z/t- values, and p-values for
coefficient hypothesis tests– Pseudo R-square for model fit
• Not a great measure… but gives a crude explained variance
– MLE log likelihood– Likelihood ratio test: Chi-square and p-value
• Comparing to null model (constant only)• Tests can also be conducted on nested models with
stata command “lrtest”.
Interpreting Coefficients
• In Poisson Regression, Y is typically conceptualized as a rate…
• Positive coefficients indicate higher rate; negative = lower rate
• Like logit, Poisson models are non-linear• Coefficients don’t have a simple linear interpretation
• Like logit, model has a log form; exponentiation aids interpretation
• Exponentiated coefficients are multiplicative• Analogous to odds ratios… but called “incidence rate
ratios”.
Interpreting Coefficients
• Exponentiated coefficients: indicate effect of unit change of X on rate
• In STATA: “incidence rate ratios”: “poison … , irr”• eb= 2.0 indicates that the rate doubles for each unit
change in X• eb= .5 indicates that the rate drops by half for each unit
change in X
• Recall: Exponentiated coefs are multiplicative• If eb= 5.0, a 2-point change in X isn’t 10; it is 5 * 5 = 25
– Also: you must invert to see opposite effects• If eb= 5.0, a 1-point decrease in X isn’t -5, it is 1/5 = .2
Interpreting Coefficients
• Again, exponentiated coefficients (rate ratios) can be converted to % change
• Formula: (eb - 1) * 100%• Ex: (e.5 - 1) * 100% = 50% decrease in rate.
Interpreting Coefficients• Exponentiated coefficients yield multiplier:. poisson wwwhr male age educ lowincome babies
Poisson regression Number of obs = 1552 LR chi2(5) = 525.66 Prob > chi2 = 0.0000Log likelihood = -8598.488 Pseudo R2 = 0.0297
------------------------------------------------------------------------------ wwwhr | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- male | .3595968 .0210578 17.08 0.000 .3183242 .4008694 age | -.0097401 .0007891 -12.34 0.000 -.0112867 -.0081934 educ | .0205217 .004046 5.07 0.000 .0125917 .0284516 lowincome | -.1168778 .0236503 -4.94 0.000 -.1632316 -.0705241 babies | -.1436266 .0224814 -6.39 0.000 -.1876892 -.0995639 _cons | 1.806489 .0641575 28.16 0.000 1.680743 1.932236------------------------------------------------------------------------------
Exponentiation of .359 = 1.43; Rate is 1.43 times higher for men
(1.43-1) * 100 = 43% more
Exp(-.14) = .87. Each baby reduces rate by factor of .87
(.87-1) * 100 = 13% less
Predicted Counts
• Stata “predict varname, n” computes predicted value for each case
. predict predwww if e(sample), n
. list wwwhr predwww if e(sample)
+------------------+ | wwwhr predwww | |------------------| 1. | 1 5.659943 | 2. | 3 7.090338 | 3. | 2 5.281404 | 12. | 5 6.09473 | 13. | 4 6.968055 | 15. | 3 5.815624 | 16. | 0 5.539187 | 19. | 0 7.207257 | 20. | 8 8.03906 | 21. | 5 4.400002 | 23. | 1 6.77004 | 24. | 1 4.806245 | 25. | 8 5.710855 | 27. | 12 3.687142 | 33. | 40 4.997193 |
Some of the predictions are close to the observed values…
Many of the predictions are quite bad…
Recall that the model fit was VERY poor!
Predicted Probabilities
• Stata extension “prcount” can compute probabilities for each possible count outcome
• For all cases, of for particular groups• It plugs values (m), Xs, & bs into formula:
!
|m
XeXmP
mX
Rate: 5.7446 [ 5.6238, 5.8655] Pr(y=0|x): 0.0032 [ 0.0028, 0.0036] Pr(y=1|x): 0.0184 [ 0.0165, 0.0202] Pr(y=2|x): 0.0528 [ 0.0486, 0.0570] Pr(y=3|x): 0.1011 [ 0.0953, 0.1069] Pr(y=4|x): 0.1452 [ 0.1399, 0.1505] Pr(y=5|x): 0.1668 [ 0.1642, 0.1694] Pr(y=6|x): 0.1597 [ 0.1589, 0.1606] Pr(y=7|x): 0.1311 [ 0.1276, 0.1345] Pr(y=8|x): 0.0941 [ 0.0897, 0.0986] Pr(y=9|x): 0.0601 [ 0.0560, 0.0642]
male age educ lowincome babiesx= .4503866 40.992912 14.345361 .7371134 .20296392
Issue: Exposure
• Poisson outcome variables are typically conceptualized as rates
• Web hours per week• Number of crimes committed in past year
• Issue: Cases may vary in exposure to “risk” of a given outcome
• To properly model rates, we must account for the fact that some cases have greater exposure than others
• Ex: # crimes committed in lifetime– Older people have greater opportunity to have higher counts
• Alternately, exposure may vary due to research design– Ex: Some cases followed for longer time than others…
Issue: Exposure
• Poisson (and other count models) can address varying exposure:
K
jijij tX
ii et 1)ln(
• Where ti = exposure time for case i
• It is easy to incorporate into stata, too:• Ex: poisson NumCrimes SES income, exposure(age)• Note: Also works with other “count” models.
Poisson Model Assumptions
• Poisson regression makes a big assumption: That variance of = (“equidisperson”)
• In other words, the mean and variance are the same• This assumption is often not met in real data• Dispersion is often greater than : overdispersion
– Consequence of overdispersion: Standard errors will be underestimated
• Potential for overconfidence in results; rejecting H0 when you shouldn’t!
• Note: overdispersion doesn’t necessarily affect predicted counts (compared to alternative models).
Poisson Model Assumptions
• Overdispersion is most often caused by highly skewed dependent variables – Often due to variables with high numbers of zeros
• Ex: Number of traffic tickets per year• Most people have zero, some can have 50!• Mean of variable is low, but SD is high
– Other examples of skewed outcomes• # of scholarly publications• # cigarettes smoked per day• # riots per year (for sample of cities in US).
Negative Binomial Regression
• Strategy: Modify the Poisson model to address overdispersion
• Add an “error” term to the basic model:
• Additional model assumptions:• Expected value of exponentiated error = 1 (e = 1)• Exponentiated error is Gamma distributed• We hope that these assumptions are more plausible
than the equidispersion assumption!
K
jijijX
e 1
Negative Binomial Regression
• Full negative biniomial model:
y
y
yXyP
11
1
1
11
!|
• Note that the model incorporates a new parameter:
• Alpha represents the extent of overdispersion• If = 0 the model reduces to simple poisson regression
Negative Binomial Regression
• Question: Is alpha () = 0?• If so, we can use Poisson regression• If not, overdispersion is present; Poisson is inadequate
• Strategy: conduct a statistical test of the hypothesis: H0: = 0; H1: > 0
• Stata provides this information when you run a negative binomial model:
• Likelihood ratio test (G2) for alpha• P-value < .05 indicates that overdispersion is present;
negative binomial is preferred• If P>.05, just use Poisson regression
– So you don’t have to make assumptions about gamma dist….
Negative Binomial Regression
• Interpreting coefficients: Identical to poisson regression
• Predicted probabilities: Can be done. You must use big Neg Binomial formula
• Plugging in observed Xs, estimates of a, Bs…
y
y
yXyP
ˆ
ˆ
ˆ!|ˆ
11
1
1
11
• Probably best to get STATA to do this one…• Long & Freese created command: prvalue
Negative Binomial Example: Web Use• Note: Bs are similar but SEs change a lot!Negative binomial regression Number of obs = 1552 LR chi2(5) = 57.80 Prob > chi2 = 0.0000Log likelihood = -4368.6846 Pseudo R2 = 0.0066
------------------------------------------------------------------------------ wwwhr | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- male | .3617049 .0634391 5.70 0.000 .2373666 .4860433 age | -.0109788 .0024167 -4.54 0.000 -.0157155 -.006242 educ | .0171875 .0120853 1.42 0.155 -.0064992 .0408742 lowincome | -.0916297 .0724074 -1.27 0.206 -.2335457 .0502862 babies | -.1238295 .0624742 -1.98 0.047 -.2462767 -.0013824 _cons | 1.881168 .1966654 9.57 0.000 1.495711 2.266625-------------+---------------------------------------------------------------- /lnalpha | .2979718 .0408267 .217953 .3779907-------------+---------------------------------------------------------------- alpha | 1.347124 .0549986 1.243529 1.459349------------------------------------------------------------------------------Likelihood-ratio test of alpha=0: chibar2(01) = 8459.61 Prob>=chibar2 = 0.000
Note: Standard Error for education increased from .004 to .012! Effect is no longer statistically significant.
Negative Binomial Example: Web Use• Note: Info on overdispersion is providedNegative binomial regression Number of obs = 1552 LR chi2(5) = 57.80 Prob > chi2 = 0.0000Log likelihood = -4368.6846 Pseudo R2 = 0.0066
------------------------------------------------------------------------------ wwwhr | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+---------------------------------------------------------------- male | .3617049 .0634391 5.70 0.000 .2373666 .4860433 age | -.0109788 .0024167 -4.54 0.000 -.0157155 -.006242 educ | .0171875 .0120853 1.42 0.155 -.0064992 .0408742 lowincome | -.0916297 .0724074 -1.27 0.206 -.2335457 .0502862 babies | -.1238295 .0624742 -1.98 0.047 -.2462767 -.0013824 _cons | 1.881168 .1966654 9.57 0.000 1.495711 2.266625-------------+---------------------------------------------------------------- /lnalpha | .2979718 .0408267 .217953 .3779907-------------+---------------------------------------------------------------- alpha | 1.347124 .0549986 1.243529 1.459349------------------------------------------------------------------------------Likelihood-ratio test of alpha=0: chibar2(01) = 8459.61 Prob>=chibar2 = 0.000
Alpha is clearly > 0! Overdispersion is evident; LR test p<.05
You should not use Poisson Regression in this case
General Remarks
• Poisson & Negative binomial models suffer all the same basic issues as “normal” regression
• Model specification / omitted variable bias• Multicollinearity• Outliers/influential cases
– Also, it uses Maximum Likelihood• N > 500 = fine; N < 100 can be worrisome
– Results aren’t necessarily wrong if N<100; – But it is a possibility; and hard to know when problems crop up
• Plus ~10 cases per independent variable.
General Remarks
• It is often useful to try both Poisson and Negative Binomial models
• The latter allows you to test for overdispersion• Use LRtest on alpha () to guide model choice
– If you don’t suspect dispersion and alpha appears to be zero, use Poission Regression
• It makes fewer assumptions– Such as gamma-distributed error.