modeling big count data: an irls framework for com-poisson regression and gam
TRANSCRIPT
Modeling Big Count DataAn IRLS framework for COM-Poisson regression and GAM
Suneel ChatlaGalit ShmueliNovember 12, 2016
Institute of Service ScienceNational Tsing Hua University, Taiwan (R.O.C)
Table of contents
1. Speed Dating Experiment- Count data models
2. Motivation
3. An IRLS framework
4. Simulation Study-Comparison of IRLS with MLE
5. A CMP Generalized Additive Model
6. Results & Conclusions
1
Speed Dating Experiment- Countdata models
Speed dating experiment
Fisman et al. (2006) conducted a speed dating experiment toevaluate the gender differences in mate selection 1.
Total sessions 14Decision 1 or 0
Attractiveness 1-10Intelligence 1-10Ambition 1-10
......
Control variables
1https://www.kaggle.com/annavictoria/speed-dating-experiment
2
Outcome/Count variables
Matches : When both persons decide YesTot.Yes : Total number of Yes for each subject in a particular session
3
Summary Statistics
Statistic N Mean St. Dev. Min Maxmatches 531 2.524 2.304 0 14Tot.Yes 531 6.433 4.361 0 21
Tot.partner 531 15.311 4.967 5 22age 531 26.303 3.735 18 55perc.samerace 531 0.391 0.242 0.000 0.833avg.intcor 531 0.190 0.167 −0.298 0.569attr 531 6.195 1.122 1.818 10.000sinc 531 7.205 1.108 2.773 10.000intel 531 7.381 0.988 3.409 10.000func 531 6.438 1.103 2.682 10.000amb 531 6.812 1.133 3.091 10.000shar 531 5.511 1.333 1.409 10.000like 531 6.157 1.072 1.682 10.000prob 531 5.234 1.525 0.778 10.000mean.agep 531 26.314 1.674 20.444 31.667attr_o 531 6.200 1.186 2.333 8.688sinc_o 531 7.224 0.690 4.167 9.000intel_o 531 7.410 0.614 4.875 9.150fun_o 531 6.438 1.015 2.625 8.615amb_o 531 6.827 0.756 4.600 8.842shar_o 531 5.498 0.942 1.375 7.700like_o 531 6.161 0.873 2.333 8.300prob_o 531 5.256 0.736 3.200 7.200Tot.part.Yes 531 6.420 4.128 0 20 4
Tools:
• Poisson Regression• Negative Binomial Regression• Conway-Maxwell Poisson (CMP) Regression
5
The CMP distribution
From Shmueli et al. (2005),
Y ∼ CMP(λ, ν)
implies
P(Y = y) = λy
(y!)νZ(λ, ν) , y = 0, 1, 2, . . .
Z(λ, ν) =∞∑s=0
λs
(s!)ν
for λ > 0, ν ≥ 0.
The CMP distribution includes three well-known distributions asspecial cases:
• Poisson (ν = 1),• Geometric (ν = 0, λ < 1),• Bernoulli (ν → ∞ with probability λ
1+λ ).6
CMP distribution for different (λ, ν) combinations
λ=2,ν=0.5
Den
sity
0 5 10 15
0.00
0.05
0.10
0.15
λ=2,ν=0.75
0 2 4 6 8 10 12
0.00
0.10
0.20
λ=2,ν=1
0 2 4 6 8
0.0
0.2
0.4
λ=2,ν=3
0 1 2 3 4
0.0
1.0
2.0
λ=8,ν=0.5
Den
sity
40 60 80 100
0.00
00.
015
0.03
0
λ=8,ν=0.75
5 10 15 20 25 30 35
0.00
0.04
0.08
λ=8,ν=1
0 5 10 15 20
0.00
0.06
0.12
λ=8,ν=3
0 1 2 3 4 5
0.0
0.2
0.4
0.6
0.8
λ=15,ν=0.5
Den
sity
150 200 250 300
0.00
00.
010
λ=15,ν=0.75
20 30 40 50 60
0.00
0.02
0.04
λ=15,ν=1
5 10 15 20 25 30
0.00
0.04
0.08
λ=15,ν=3
0 1 2 3 4 5 6
0.0
0.4
0.8
7
CMP Regression
CMP regression models can be formulated as follows:
log(λ) = Xβ (1)log(ν) = Zγ (2)
Maximizing the log-likelihood w.r.t the parameters β and γ will yieldthe following normal equations Sellers and Shmueli (2010):
U =∂logL∂β
= XT(y− E(y)) (3)
V =∂logL∂γ
= νZT(−log(y!) + E(log(y!))) (4)
8
Motivation
Exploration of Speed Dating data
●
●
●
●
● ●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●●
●
● ●
●
● ●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●●●
●
●●
●
● ●
● ●
●
●
●● ●
●
●
●● ●
●● ●
●
●
● ● ●
●
●
● ●
●
●
●●
●
●
●●
●
●●
●●
●
●
●
●
● ●●●
●
●
●●
●
●
●
●● ●
●
●
●
●●
●
●● ●
●
●
●
●
●
●
●
●
●● ● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
● ●
●●
●
4 5 6 7 8 9
−2
−1
01
23
Sincerity (Others)
Tot.Y
es (
log)
●
●
●
●
● ●
●
●●
●
●
●
●●●
●
●
●
●
●
●
●●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
● ●●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●
● ●
●
● ●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
● ●●●
●
●●
●
●●
●●
●
●
●● ●
●
●
●●●
●●●
●
●
● ●●
●
●
● ●
●
●
●●
●
●
●●
●
●●
●●
●
●
●
●
●●● ●
●
●
● ●
●
●
●
●● ●
●
●
●
●●
●
●● ●
●
●
●
●
●
●
●
●
●●● ●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
● ●
●●
●
5 6 7 8 9
−2
−1
01
23
Intelligence (Others)
Tot.Y
es (
log)
●
●
●
●
● ●
●
● ●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●●
●
● ●
●
● ●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
●●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●● ●●
●
●●
●
●●
● ●
●
●
●●●
●
●
●●●
●● ●
●
●
●● ●
●
●
● ●
●
●
● ●
●
●
●●
●
●●
●●
●
●
●
●
● ●● ●
●
●
● ●
●
●
●
●● ●
●
●
●
●●
●
●● ●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
● ●
●●
●
4 6 8 10
−2
−1
01
23
Sincerity
Tot.Y
es (
log)
●
●
●
●
●●
●
● ●
●
●
●
●●●
●
●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
● ●●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●●
●
● ●
●
● ●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●● ●●
●
●●
●
●●
● ●
●
●
●●●
●
●
●●●
●● ●
●
●
●● ●
●
●
●●
●
●
● ●
●
●
●●
●
●●
●●
●
●
●
●
● ●● ●
●
●
● ●
●
●
●
●● ●
●
●
●
●●
●
●● ●
●
●
●
●
●
●
●
●
●● ●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●
●●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
● ●
●●
●
4 6 8 10
−2
−1
01
23
Fun seeking
Tot.Y
es (
log)
9
More flexibility?
Generalized Additive Models
• Smoothing Splines• Penalized Splines
Both implementations are dependent upon the Iterative ReweightedLeast Squares (IRLS) estimation framework.
At present, there is no IRLS framework available for CMP !!
10
An IRLS framework
Update for each iteration
I[β
γ
](m)
= I[β
γ
](m−1)
+
[UV
]
which implies the following equations
XTΣyXβ(m) − XTΣy,log(y!)νZγ(m) = XTΣyXβ(m−1) −XTΣy,log(y!)νZγ(m−1) + XT(y− E(y))
and
− νZTΣy,log(y!)Xβ(m) + ν2ZTΣlog(y!)Zγ(m) = −νZTΣy,log(y!)Xβ(m−1) +
ν2ZTΣlog(y!)Zγ(m−1) +
νZT(−log(y!) + E(log(y!)))
11
For the fixed values of both β and γ the equations
XTΣyXβ(m) = XTΣyXβ(m−1) + XT(y− E(y)) (5)
ν2ZTΣlog(y!)Zγ(m) = ν2ZTΣlog(y!)Zγ(m−1) + νZT(−log(y!) + E(log(y!))).(6)
12
Algorithm
https://arxiv.org/abs/1610.08244
13
Practical issues
Initial Values
• For λ = (y+ 0.1)ν
• For ν = 0.2
Calculation of Cumulants
• Bounding error 10−8 or 10−10
• Asymptotic expressions
Stopping Criterion
• Based on −2∑l(yi; λ̂i, ν̂i)
Step size
• Step halving
14
Simulation Study-Comparison ofIRLS with MLE
Study design
We compare our IRLS algorithm with the existing implementationwhich is based on maximizing the likelihood function (through optimin R).
(a) Set sample size n = 100(b) Generate x1 ∼ U(0, 1) and x2 ∼ N(0, 1)(c) Calculate x3 = 0.2x1 + U(0, 0.3) and x4 = 0.3x2 + N(0, 0.1) (to
create correlated variables)(d) Generate
y ∼ CMP(log(λ) = 0.05+ 0.5x1 − 0.5x2 + 0.25x3 − 0.25x4, ν)where ν = {0.5, 2, 5}
15
Results
●
●●●
IR MLE IR MLE IR MLE
−0.
50.
00.
51.
01.
5
x1
● ●
●
●
●
●
●
●
IR MLE IR MLE IR MLE
−2.
0−
1.5
−1.
0−
0.5
0.0
0.5
x2
●
●
●
IR MLE IR MLE IR MLE
−4
−2
02
46
x3
●
●
●
●
●
●
●●
●●
IR MLE IR MLE IR MLE
−4
−2
02
4
x4
●
●
●
IR MLE IR MLE IR MLE
−2
−1
01
23
4
log(ν)
ν=0.5ν=2ν=5
16
A CMP Generalized AdditiveModel
Additive Model
log(λ) = α+
p∑j=1
fj(Xj)
log(ν) = Zγ
where fj (j = 1, 2, . . . ,p) are the smooth functions for the p variables.
17
Backfitting
Based on Hastie and Tibshirani (1990); Wood (2006), the algorithm asfollows
1. Initialize: fj = f(0)j , j = 1, . . . ,p2. Cycle: j = 1, . . . ,p, 1, . . . ,p, . . .
fj = Sj(y−
∑k̸=j
fk|xj)
3. Continue (2) until the individual functions don’t change.
One more nested loop inside theIRLS framework !
18
Results & Conclusions
Comparison of Regression models on Tot.Yes
Poisson Negative Binomial CMP(Intercept) 0.49 0.59 0.14
(0.43) (0.55) (0.33)GenderMale 0.05 0.05 0.03
(0.04) (0.06) (0.03)age −0.01 −0.01 −0.004
(0.01) (0.01) (0.004)Tot.partner 0.07∗∗∗ 0.07∗∗∗ 0.04∗∗∗
(0.00) (0.01) (0.003)avg.intcor −0.04 −0.04 −0.02
(0.11) (0.15) (0.09)attr 0.19∗∗∗ 0.18∗∗∗ 0.11∗∗∗
(0.03) (0.04) (0.02)sinc −0.06 −0.05 −0.04
(0.03) (0.04) (0.02)intel 0.05 0.06 0.03
(0.04) (0.05) (0.03)func 0.03 0.04 0.02
(0.04) (0.05) (0.03)amb −0.12∗∗∗ −0.13∗∗ −0.07∗∗
(0.03) (0.04) (0.02)shar 0.10∗∗∗ 0.10∗∗∗ 0.06∗∗∗
(0.02) (0.03) (0.02)mean.agep −0.01 −0.01 −0.007
(0.01) (0.02) (0.009)attr_o −0.10∗∗∗ −0.10∗∗∗ −0.06∗∗∗
(0.02) (0.03) (0.02)sinc_o 0.02 0.02 0.01
(0.04) (0.05) (0.03)intel_o 0.08 0.08 0.05
(0.05) (0.07) (0.04)fun_o −0.01 −0.01 −0.003
(0.03) (0.04) (0.02)amb_o −0.00 −0.01 0.0005
(0.04) (0.05) (0.03)shar_o 0.02 0.03 0.01
(0.03) (0.04) (0.02)ν 0.53∗∗∗AIC 2844.92 2777.24 2751.7BIC 3011.64 2948.23 2922.66Log Likelihood -1383.46 -1348.62 -1335.33Deviance 970.04 637.25Num. obs. 531 531 531∗∗∗p < 0.001, ∗∗p < 0.01, ∗p < 0.05
19
Comparison of Additive Models on Tot.Yes
Dependent variable:Tot.Yes
CMP(Chi.Sq) Poisson(Chi.Sq)s(sinc) 7.16 11.53∗∗s(func) 7.51 11.40∗∗s(sinc_o) 13.96∗∗ 29.30∗∗∗s(intel_o) 14.06∗∗ 13.26∗∗∗
ν 0.56AIC 2737.03 2804.77
Note: ∗p<0.1; ∗∗p<0.05; ∗∗∗p<0.01
It’s more about the behavior of opposite person that guide us toselect her/him.
20
Summary
• The IRLS framework is far more efficient than the existinglikelihood based method and provides more flexibility.
• Since CMP is computationally heavier than the other GLMs wecould parallelize some matrix computations inorder to increasethe speed.
• The IRLS framework allows CMP to have other modelingextensions such as LASSO etc.
Full paper available from https://arxiv.org/abs/1610.08244and the source code is available fromhttps://github.com/SuneelChatla/cmp
21
Suggestions and1. 1.1
Questions?
21
References
Fisman, R., Iyengar, S. S., Kamenica, E., and Simonson, I. (2006).Gender differences in mate selection: Evidence from a speeddating experiment. The Quarterly Journal of Economics, pages673–697.
Hastie, T. J. and Tibshirani, R. J. (1990). Generalized additive models,volume 43. CRC Press.
Sellers, K. F. and Shmueli, G. (2010). A flexible regression model forcount data. Annals of Applied Statistics, 4(2):943–961.
Shmueli, G., Minka, T. P., Kadane, J. B., Borle, S., and Boatwright, P.(2005). A useful distribution for fitting discrete data: revival of theconway–maxwell–poisson distribution. Journal of the RoyalStatistical Society: Series C (Applied Statistics), 54(1):127–142.
Wood, S. (2006). Generalized additive models: an introduction with R.CRC press.