comparing mle, mue and firth estimates for logistic …...firth’s penalized likelihood method •...
TRANSCRIPT
Comparing MLE, MUE and Firth Estimates
for Logistic Regression
Nitin R Patel, Chairman & Co-founder, Cytel Inc.
Research Affiliate, MIT
Copyright © Cytel Inc. 2013. All rights reserved.
2
Acknowledgements This presentation is based on joint work with:
• Pralay Senchaudhuri, Cytel Inc. • Hrishikesh Kulkarni, Cytel Inc.
JSM, Montreal, Aug 2013
Copyright © Cytel Inc. 2013. All rights reserved.
3
Outline
• Separation and Maximum Likelihood Estimates • Firth’s Method of Maximum Penalized Likelihood
Estimation • Numerical experiments comparing MUE with FirthE
when there is separation • Near separation and problems with MLE • Numerical experiments comparing MLE with FirthE
when there is near separation • Conclusions
JSM, Montreal, Aug 2013
Copyright © Cytel Inc. 2013. All rights reserved.
4
Maximum Likelihood Estimation • Almost universally used method for logistic
regression models. • ML estimates are asymptotically unbiased and have
minimum variance but not for finite samples. • MLE’s can have serious shortcomings when applied
to datasets with the following characteristics: – Small/moderate in size – Unbalanced responses (Rare outcomes) – Unequally spaced covariate values – Many parameters relative to number of
observations.
JSM, Montreal, Aug 2013
Copyright © Cytel Inc. 2013. All rights reserved.
5
Separation • Example 1
seq# x1 x2 y1 10 10 12 11 11 13 12 12 14 13 13 15 14 14 16 15 15 17 16 16 18 17 17 19 19 19 110 10 16 011 11 17 012 12 18 013 13 19 014 14 20 015 15 21 016 16 22 017 17 23 018 18 18 019 18 24 020 19 25 0
covariate plot of data
5
10
15
20
25
30
5 10 15 20
x1
x2
JSM, Montreal, Aug 2013
Copyright © Cytel Inc. 2013. All rights reserved.
6
MLE’s and Separation
• When separation occurs one or more MLE’s do not exist.
• In other words, one or more MLE’s are unbounded (and so are their standard errors).
• This means that the maximum likelihood method fails to provide either point or interval estimates.
JSM, Montreal, Aug 2013
Copyright © Cytel Inc. 2013. All rights reserved.
7
A useful characterization of separation
• Separation occurs if and only if the observed vector of sufficient statistics is on the boundary of the convex hull of the (finite) set of possible sufficient statistics vectors.
JSM, Montreal, Aug 2013
Copyright © Cytel Inc. 2013. All rights reserved.
8
Example 2: Simple Logistic Regression (one covariate, two parameters)
( ) 0 1logit i ixπ β β= +
( )1i iP Yπ = =
Response Yi , covariate xi for observation i Model:
∑= i iYT0 ∑= i iiYxT1
Sufficient statistics vector is (T0 , T1)
and where JSM, Montreal, Aug 2013
Copyright © Cytel Inc. 2013. All rights reserved.
9
Example 2: Simple Logistic Regression (contd.)
x5101520253035404550556065707580859095100
0
200
400
600
800
1000
1200
0 5 10 15 20 25
t0: sufficient stat. for beta0
t1: s
uffic
ient
sta
t. fo
r bet
a1
∑= i iYT0 ∑= i iiYxT1Sufficient statistics vector is (T0 , T1) where and
JSM, Montreal, Aug 2013
Copyright © Cytel Inc. 2013. All rights reserved.
10
Example 2 (contd.)
• If we observe: yi = 0 for xi = 5, 10, 15, 20, 25, 30, 35, 40, 45 yi = 1 for xi = 50, 55, 60, 65, 70, 75, 80, 85,
90, 95, 100. • The observed sufficient statistics vector is
(t0 = 11, t1 = 825). • The MLE for β1 does not exist since (11,825)
is on the boundary of (T0, T1) space.
JSM, Montreal, Aug 2013
Copyright © Cytel Inc. 2013. All rights reserved.
11
Firth’s Penalized Likelihood Method • The MLE is the root when the score function
(derivative of the loglikelihood) is equated to zero.
• Firth’s method removes the O(n-1) term from the bias of the MLE by modifying the score function by subtracting a penalty function.
• The solution obtained as the root when the score function is set to zero is Firth’s Penalized Likelihood Estimate (FirthE)
JSM, Montreal, Aug 2013
Copyright © Cytel Inc. 2013. All rights reserved.
12
Logistic Regression
• The loglikelihood has the form where t is the observed sufficient statistic vector
• The score function is therefore
• Firth’s modified score function is
where is Fisher’s information matrix
• Firth’s modification shrinks the MLE estimate towards zero
( ) ( )l t Kβ β β= −
( ) ( ) ( )U l t Kβ β βʹ′ ʹ′= = −
* 1 ( )( ) ( ) 1/ 2 ( )j jj
IU U trace I ββ β β
β−
⎡ ⎤∂= + ⎢ ⎥
∂⎢ ⎥⎣ ⎦( )I β
JSM, Montreal, Aug 2013
Copyright © Cytel Inc. 2013. All rights reserved.
13
Boundary points of Sufficient Statistics space
Boundary Points in space of sufficient stats
0
200
400
600
800
1000
1200
0 5 10 15 20 25
t0
t1
t_0 t_10 01 51 1002 152 1953 303 2854 504 3705 755 4506 1056 5257 1407 5958 1808 660
9 2259 72010 27510 77511 33011 82512 39012 87013 45513 91014 52514 94515 60015 97516 68016 100017 76517 102018 85518 103519 95019 104520 1050
There are 40 points on the boundary of the set of possible values of (t0 , t1)
Copyright © Cytel Inc. 2013. All rights reserved.
14
Comparison of MUE with FirthE when MLE does not exist
• Several numerical experiments with one covariate models and a limited number with two covariate models.
• Used exhaustive enumeration of t-vectors as well as Monte Carlo simulations with sample sizes of 1000.
• We will illustrate with Example 2 data
JSM, Montreal, Aug 2013
Bias Comparison for MUE with FirthE for ED50 = 52.5
Copyright © Cytel Inc. 2005. All rights reserved.
15
Based on complete enumeration
MSE Comparison for MUE with FirthE for ED50 = 52.5
Copyright © Cytel Inc. 2005. All rights reserved.
16
Based on complete enumeration
Copyright © Cytel Inc. 2013. All rights reserved.
17
Findings from numerical experiments • Our experiments with several numerical experiments
with one covariate and some with two covariates suggest that both from the point of view of bias and Mean Square Error Firth’s method gives better estimates when there is complete separation.
• Additional Advantages of Firth’s method are: – Unlike MUE it does not depend on the conditional
distribution of the sufficient statistic, so it does not have problems associated with having few support points (e.g. with continuous covariates).
– It is much faster to compute.
JSM, Montreal, Aug 2013
Copyright © Cytel Inc. 2013. All rights reserved.
18
A real dataset
• Two hundred rats treated with a toxic at four levels of dose, binary response examined was development of an intestinal tumor.The covariates were levels of dose (as factor variables) and a binary survival variable to control for death. (Data from US Toxicology Program Tech Report 405, 1991,
LogXact manual gives details.)
• There was separation in this dataset. • Output from current beta version of LogXact that provides
Firth’s method as an option.
JSM, Montreal, Aug 2013
Copyright © Cytel Inc. 2013. All rights reserved.
19
LogXact Results
2*1-sidedModel Term Type Beta SE(Beta) Type Lower Upper P-Value%Const FirthE -3.861 2.108 Asymptotic -7.993 0.2713 0.0671dose_0 FirthE -2.873 1.937 Asymptotic -6.67 0.9241 0.1381
MUE -1.053 NA Exact -INF 1.909 0.4824dose_150 FirthE -1.24 1.438 Asymptotic -4.057 1.578 0.3886
CMLE -1.444 1.667 Exact -6.437 2.471 0.9367dose_300 FirthE -2.733 1.656 Asymptotic -5.978 0.5116 0.0988
MUE -1.677 NA Exact -INF 0.869 0.2068survival FirthE 0.09387 0.1402 Asymptotic -0.1808 0.3686 0.5030
CMLE 0.1246 0.174 Exact -0.2128 0.5058 0.5345
Point Estimate 95% Conf. Interval
JSM, Montreal, Aug 2013
Copyright © Cytel Inc. 2013. All rights reserved.
20
Near Separation • MLE is unstable – small shift in data leads to huge
change in ML estimate of coefficients seq# x1 x2 y
1 10 10 12 11 11 13 12 12 14 13 13 15 14 14 16 15 15 17 16 16 18 17 17 19 19 19 110 10 16 011 11 17 012 12 18 013 13 19 014 14 20 015 15 21 016 16 22 017 17 23 018 18 k 019 18 24 020 19 25 0
covariate plot of data
5
10
15
20
25
30
5 10 15 20
x1
x2k
Example 1 JSM, Montreal, Aug 2013
Copyright © Cytel Inc. 2013. All rights reserved.
21
MLE and Near separation: Example 1 (contd.)
coefficients vs k
-3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0 5 10 15 20
k
bet
a
beta1 beta2
JSM, Montreal, Aug 2013
MLE
Interior Points grouped into Layers by closeness to the boundary
Interior Point Layers
0
200
400
600
800
1000
1200
0 5 10 15 20
t0
t1
Layer 1 Layer 5 Layer 10 Layer 20 Layer 40 Layer 50
Copyright © Cytel Inc. 2013. All rights reserved.
22
JSM, Montreal, Aug 2013
Copyright © Cytel Inc. 2013. All rights reserved.
24
Bias Comparison of MLE to FirthE ED50=52.5
Based on complete enumeration
JSM, Montreal, Aug 2013
Bias Comparison of MLE to FirthE ED50=5
Copyright © Cytel Inc. 2013. All rights reserved.
25
JSM, Montreal, Aug 2013
Based on complete enumeration
Bias Comparison of MLE to FirthE ED50=100
Copyright © Cytel Inc. 2013. All rights reserved.
26
JSM, Montreal, Aug 2013
Based on complete enumeration
Significant Models (pval < 0.05) Bias Comparison of MLE to FirthE
ED50=52.5
Copyright © Cytel Inc. 2013. All rights reserved.
27
JSM, Montreal, Aug 2013 Based on complete enumeration
MSE Comparison of MLE to FirthE ED50 = 52.5
Copyright © Cytel Inc. 2013. All rights reserved.
28
JSM, Montreal, Aug 2013
Based on complete enumeration
MSE Comparison of MLE to FirthE ED50 = 5
Copyright © Cytel Inc. 2013. All rights reserved.
29
JSM, Montreal, Aug 2013
Based on complete enumeration
MSE Comparison of MLE to FirthE ED50 = 100
Copyright © Cytel Inc. 2013. All rights reserved.
30
JSM, Montreal, Aug 2013
Based on complete enumeration
Significant Models (pval < 0.05) MSE Comparison of MLE to FirthE
ED50=52.5
Copyright © Cytel Inc. 2013. All rights reserved.
31
JSM, Montreal, Aug 2013
Based on complete enumeration
Copyright © Cytel Inc. 2013. All rights reserved.
32
Conclusions from Experiments
• Our numerical experiments and simulations suggest that FirthE reduces bias as well as Mean Square Error in comparison to MLE when the maximum slope of the logistic curve is not very high.
• However when the max slope is high the FirthE correction for bias produces excessive shrinkage and the MLE is superior.
• In many data sets that arise in we don’t expect large changes in response for small changes in the covariate values so FirthE will be superior
• We conjecture that this conclusion will also hold when we compare conditional MLE and conditional FirthE
JSM, Montreal, Aug 2013
Detecting near separation in data sets
• We have a research project to create an index to signal near separation in data sets to alert LogXact users about the bias in MLE.
• Please let us know if you have datasets you can share which seem to exhibit near separation
• Experiments suggest that we can use Confidence Intervals based on the Firth Profile Likelihood to detect near separation.
• The ratio of the Upper CI width to the Lower CI appears to have promise as an index of near separation
Copyright © Cytel Inc. 2013. All rights reserved.
JSM, Montreal, Aug 2013
33
Copyright © Cytel Inc. 2013. All rights reserved.
34
Example 2: Simple Logistic Regression (contd.)
x5101520253035404550556065707580859095100
0
200
400
600
800
1000
1200
0 5 10 15 20 25
t0: sufficient stat. for beta0
t1: s
uffic
ient
sta
t. fo
r bet
a1
∑= i iYT0 ∑= i iiYxT1Sufficient statistics vector is (T0 , T1) where and
JSM, Montreal, Aug 2013
Interior Points grouped into Layers by closeness to the boundary
Copyright © Cytel Inc. 2013. All rights reserved.
35
JSM, Montreal, Aug 2013
Ratios were calculated for each interior point
Ratio of Firth Profile Likelihood 95%CI widths
Ratio = UCIwidth/LCIwidth
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
0 10 20 30 40 50 60
# Layers from boundary
Ratio
Copyright © Cytel Inc. 2013. All rights reserved.
36
JSM, Montreal, Aug 2013
Fitted polynomial
Thank you! [email protected]
Copyright © Cytel Inc. 2013. All rights reserved.
JSM, Montreal, Aug 2013
37