bias corrected estimates for logistic regression models...
Post on 09-Mar-2020
28 Views
Preview:
TRANSCRIPT
Bias corrected estimates for logistic regression models
for complex surveys with application to the United
States’ Nationwide Inpatient Sample
Kevin A. Rader
Harvard School of Public Health, Boston, MA, U.S.A.
Stuart R. Lipsitz
Brigham and Women’s Hospital, Boston, MA, U.S.A.
Garrett M. Fitzmaurice
Harvard Medical School, Boston, MA, U.S.A.
David P. Harrington
Harvard School of Public Health, Boston, MA, U.S.A.
Michael Parzen
Statistiics Department, Harvard University, Cambridge, MA, U.S.A.
Debajyoti Sinha
Florida State University, Tallahassee, FL, U.S.A.
September 18, 2014
1
Abstract
For complex surveys with a binary outcome, logistic regression is widely used to
model the outcome as a function of covariates. Complex survey sampling designs are
typically stratified cluster samples, but consistent and asymptotically unbiased estimates
of the logistic regression parameters can be obtained using weighted estimating equations
(WEE) under the naive assumption that subjects within a cluster are independent. Zeger,
1983). Despite the relatively large samples typical of many complex surveys, with rare
outcomes, many interaction terms, or analysis of subgroups, the logistic regression param-
eters estimates from WEE can be severely biased, just as with independent samples. In
this paper, we propose bias-corrected weighted estimating equations for complex survey
data. The proposed method is motivated by a study of post-operative complications in
laparoscopic cystectomy, using data from the 2009 United States’ Nationwide Inpatient
Sample complex survey of hospitals.
Key words: Binary responses; Bladder cancer; Population survey; Stratified cluster
sampling; Weighted estimating equations.
2
1 Introduction
Binary responses are commonplace in studies in the medical, behavioral and social sci-
ences. For example, a practicioner may be interested in determining whether or not a
patient contracts a disease or complication based on a measureable set of predictors,
e.g., age, sex, or environmental exposure factors. The logistic regression model is the
most commonly used model for predicting a binary outcome from a set of measurable
covariates. For independent observations, maximum likelihood is the method of choice
for estimating the logistic regression model parameters.
However, for independent observations, when the sample size is relatively small or
when the binary oucome is either rare or very prevalent (even in large samples), maximum
likelihood can yield biased estimates of the logistic regression parameters. In certain cases,
when the data has complete or quasi-complete separation, the likelihood may not have a
unique solution1. Techniques have been developed to remove the first-order term in the
Taylor series expansion of the asymptotic bias of the maximum likelihood estimator2−4.
This approach is straightforward to implement when observations are sampled indepen-
dently. For the case of logistic regression with independent subjects, there have been
numerous methods proposed for handling these issues, such as exact logistic regression or
the bias-correcting approach discussed above3; however, such approaches have not been
well-studied for binary data from complex sampling schemes. The focus of this paper is on
bias-corrected estimates of the parameters for the logistic regression model when the data
arise from complex surveys with stratified and clustered designs. Consistent and asymp-
totically unbiased estimates of the logistic regression parameters can be obtained using
weighted estimating equations (WEE), which incorporate the complex survey weights but
naively assume independence among observations within a cluster5,6. A consistent vari-
ance estimate can be obtained using a robust sandwich variance estimate that accounts for
the stratification, clustering, and weighting5,6. Although consistent, even with the large
3
samples typical of many complex surveys, with rare outcomes, many interaction terms, or
analysis focusing on subgroups, the logistic regression parameters estimates from WEE
can be severely biased, just as with independent samples.
Our proposed method is motivated by a study from the 2009 National Inpatient
Sample (NIS) that investigated laparoscopic cystectomies to treat bladder cancer 7; here
we use more recent (2010) NIS bladder cancer data. Subjects were identified from the US
Healthcare Cost and Utilization Project (HCUP) Nationwide Inpatient Sample (NIS),
sponsored by the Agency for Healthcare Research and Quality8. It is the largest all-
payer inpatient care observational cohort in the United States and is representative of
approximately 90% of all hospital discharges. The NIS is a 20% stratified probability
sample that encompasses approximately 8 million acute hospital stays per year from
approximately 1000 hospitals (clusters) in 45 states. In the NIS, there are 60 strata
based on five key hospital characteristics. Because approximately 20% of the universe
of hospitals are sampled, the weight (or inverse probability of being sampled) for each
patient is usually close to five.
In this paper, we analyze patients from the first 6 months of the 2010 NIS who received
laparoscopic cystectomies to treat bladder cancer (n = 385). The primary objective of the
study was to compare robot-assisted laparoscopic radical cystectomy (RARC) and open
radical cystectomy (ORC) for treatment of bladder cancer. We focus on the primary
binary endpoint of whether or not the patient contracted a wound infection (yes or no)
after surgery. The target of inference is the difference in the probability of a patient
experiencing an infection of the wound area comparing RARC to ORC. There are three
a priori potential confounding factors associated with wound infection, age, sex, and
whether the subject had one or more comorbidities, which are summarized for the two
groups in Table 1. In our sample from the NIS there were 17 (5.0%) wound complications
in the 343 patients who received standard ORC; none of the 42 patients that received
4
robot-assisted treatement, RARC, experienced a wound complication. This result leads
to the classic issue of separation in the response for two treatment groups, and motivated
us to explore a new analytic approach to handle this issue in the complex survey setting.
In Section 2, we briefly describe the complex sampling design, the standard weighted
estimating equations (WEE) for the logistic regression model for complex surveys and our
bias-corrected version of WEE. In Section 3, we apply this approach to logistic regres-
sion analyses of the data from the study of post-operative complications in laparoscopic
cystecomy7. In Section 4, we present results of a small-scale simulation study of our bias
correction for the logistic regression model. In the example and simulations, we compare
our approach to the standard WEE for complex surveys without bias correction.
2 Methods
2.1 Notation for Complex Surveys
The most common type of complex survey design is a stratified cluster design. Further,
more complex multi-stage designs can be approximated as a stratified cluster design (Kish,
1965). Thus here we use notation for stratified cluster designs. We let yhij represent the
Bernoulli outcome for the jth subject, (j = 1, . . . ,mhi), in the ith cluster, (i = 1, . . . , nh),
within the hth stratum, (h = 1, . . . , H). Note that we assume there are H strata, nh
clusters in stratum h, and mhi subjects in cluster i of stratum h. Let the indicator
variable δhij equal 1 if subject hij is selected into the sample and equal 0 otherwise. The
probability of being selected into the survey, P (δhij = 1) = phij, is fixed by the study
design and may depend on the outcome of interest, the covariates, or additional variables
(screening variables, for example) not in the logistic regression model for the outcome of
interest. Thus, each subject in the sample has a known ‘weight’ whij = δhij/phij. We let
5
πhij be the probability that Yhij = 1, which follows the standard logistic regression model:
πhij = P (Yhij = 1|xhij, β) =exp(β′xhij)
1 + exp(β′xhij)
where xhij is a (k + 1) × 1 vector of covariates including the constant term for the hijth
observation, and β is a (k + 1)× 1 parameter vector including the intercept term.
To obtain consistent estimates of β in complex surveys, one needs to incorporate the
subject-specific sampling weights, whij, into the logistic regression estimating equations.
Weighting estimating equations (WEE), which naively assume subjects are independent,
have been shown to give consistent estimates10, and are of the form, U(β̂) = 0, where:
U(β) =H∑
h=1
nh∑i=1
mhi∑j=1
whijxhij (yhij − πhij) . (1)
The standard multivariable estimating equations can be modified to correct for the
first-order bias2. This can be done by replacing the responses, yihj, with ’pseudo-response’,
y∗hij:
y∗hij = yhij + ahij
where ahij represents an adjustment to the observed response, yhij, and is defined as:
ahij = 0.5(
tr[Var(β̂)D′′hij
])(2)
where D′′hij is the second derivative matrix of the logistic function, πhij, with respect to
β. Specifically, for the logistic regression model, D′′hij reduces to
6
D′′hij =∂2
∂β2[πhij] =
∂2
∂β2
([1 + exp(−xhijβ)]−1
)=
∂
∂β
(xhij [1 + exp(−xhijβ)]−1 exp(−xhijβ)
)=
∂
∂β(xhijπi [1− πhij])
=∂πhij∂β
∂
∂πhij(xhijπi [1− πhij])
= (xhijπhij [1− πhij])([1− 2πhij]x
′hij
)= xhijx
′hijπhij (1− πhij) (1− 2πhij) ,
and the adjustment factor, ahij, simplifies to:
ahij = 0.5(
tr[Var(β̂)D′′hij
])= 0.5
(tr[Var(β̂)xhijx
′hijπhij(1− πhij)(1− 2πhij)
])= 0.5πhij(1− πhij)(1− 2πhij)
[Var(xhijβ̂)
],
since Var(xhijβ̂) is a scalar. Note, in generalized linear model terminology, logit(πhij) =
xhijβ is referred to as the ’linear predictor’. Thus, the adjustment term is a simple function
of πhij and the variance of the estimated linear predictor. When there are no sampling
weights involved, the adjustment term ahij is equivalent to the adjustment for ordinary
logistic regression3. Replacing yhij in (1) with y∗hij, the bias-reduced estimating equations
become:
7
U(β)∗ =
H∑h=1
nh∑i=1
mhi∑j=1
whijxhij
(y∗hij − πhij
)=
H∑h=1
nh∑i=1
mhi∑j=1
whijxhij (yhij + ahij − πhij)
=H∑
h=1
nh∑i=1
mhi∑j=1
whijxhij
{yhij + 0.5πhij(1− πhij)(1− 2πhij)
(Var(xhij β̂)
)− πhij
}
=
H∑h=1
nh∑i=1
mhi∑j=1
whijxhij
{yhij − πhij +
[πhij(1− πhij)
(Var(xhij β̂)
)](0.5− πhij)
}
=H∑
h=1
nh∑i=1
mhi∑j=1
whijxhij {yi − πhij + qhij(0.5− πhij)}
(3)
where qhij = πhij(1 − πhij)Var(xhijβ̂). For standard logistic regression, the term qhij
reduces to the leverage for observation hij.
Bias-corrected estimates can also be calculated by splitting each of the original ob-
servations into two new observations: one taking value yhij and the other taking value
1 − yhij but with weights 1 + qhij/2 and qhij/2, respectively11. Extending these results
to complex surveys with weights whij, we propose the use of the weights whij(1 + qhij/2)
and whij(qhij/2) for, yhij and 1 − yhij, respectively. Thus each individual contributes
{(yhij − πhij)whij(1 + qhij/2) +(1 − yhij − πhij)whij(qhij/2)} to the score function, which
can be shown to be mathematically equivalent to the following weighted estimating equa-
tions:
8
U(β)∗ =
H∑h=1
nh∑i=1
mhi∑j=1
{(yhij − πhij)whij(1 + qhij/2) + (1− yhij − πhij)whij(qhij/2)}xhij
=H∑
h=1
nh∑i=1
mhi∑j=1
{yhij − πhij − πhijqhij/2 + qhij/2− πhijqhij/2}whijxhij
=H∑
h=1
nh∑i=1
mhi∑j=1
{yhij − πhij − πhijqhij + qhij/2}whijxhij
=H∑
h=1
nh∑i=1
mhi∑j=1
{(yhij − πhij) + qhij(1/2− πhij)}whijxhij .
(4)
Even though complex surveys commonly have large sample sizes, the issue of separa-
tion can arise when domains are small or subrgoup analyses are performed. The bladder
cancer study mentioned in the Introduction is an example where this has occurred, as the
number of wound infections in the robotic treament arm is zero. In the bias-corrected
WEE in (4), yhij = 1 and yhij = 0 have positive weights, which is equivalent to there being
a non-zero number of successes (yhij = 1) and failures (yhij = 0) at each value of xhij.
Because of this property, the adjusted weighted estimating equations (equivalent to stan-
dard logistic regression with weights) we propose have a unique, finite solution (assuming
the design matrix is full rank)12. By splitting each observation into two, this eliminates
the problem of separation when the response variable is composed of all successes or all
failures for a specific combination of the covariates, and allows for the use of standard
complex survey software, e.g., svyglm in R, to perform the analysis.
Theory suggests that using a consistent estimate of the true Var(xhijβ̂) in qhij should
reduce the bias2. Two approaches for consistently estimating Var(β̂) in qhij, and thus
Var(xhijβ̂) in qhij, are the standard sandwich estimator based on a Taylor series and the
small-sample bias-corrected estimator of variance developed by Morel, et al5,13. Because
the focus of this paper is on bias-reduction, and the standard sandwich variance estimator
9
can be highly variable for rare events, or a small number of large clusters, we use the small-
sample bias-corrected variance estimator proposed by Morel, et al to estimate Var(β̂)
instead of the standard sandwich variance estimator13. In particular, in simulations the
small-sample bias-corrected variance estimator has been shown to be much less variable
and more stable than the standard sandwich variance estimator 13. Thus, to calculate qhij,
we consider two approaches: (a) naively assuming independence among the observations
and (b) using the bias-corrected sandwich variance estimator13.
2.2 Algorithm for Bias-Corrected Estimates
To obtain the first-order bias-corrected estimates of β, one can iterate between updating
qhij given a current estimate of β and V̂ar(xhijβ̂), and then re-estimating β and V̂ar(xhijβ̂)
given the updated qhij by solving (4), until the estimates of β converge.
In particular, we start by initializing qhij = k/(∑H
h=1
∑nh
i=1mhi), which is the av-
erage value of the qhij when the observations are independent. This initial estimate is
equivalent to the ad hoc bias-reduction approach proposed by Clogg, et al with indepen-
dent observations14. Note that∑H
h=1
∑nh
i=1mhi is the total sample size. We then iterate
between the following two steps until convergence of β̂ is obtained:
1. Calculate the complex survey based estimates of β, but with modified survey weights
of whij(1 + qhij/2) for the original yhij and weights of whij(qhij/2) for the pseudo-
observations 1 − yhij, where whij is the original sampling weight (this is straight-
forward to implement using svyglm in R or a similar procedure in another software
program).
2. Recalculate qhij based on the estimates, π̂hij and V̂ar(xhijβ̂), from the logistic re-
gression model estimates in the previous step.
10
Note that in this iterative procedure, Var(β̂) and thus qhij is calculated either under
independence or using the Morel, et al bias-corrected sandwich variance estimator13. Fur-
ther, after convergence, the Morel, et al sandwich variance estimator is used to estimate
the variance of β̂ in making inferences from both approaches13. That is, for the setting
where our proposed approach is most applicable (rare events, small domains, many in-
teraction terms), the bias-corrected variance estimator of β̂ proposed by Morel, et al is
needed13.
3 Application to Bladder Cancer Study
In this section, we apply the proposed methods to the analysis of the radical cystectomy
data from the National Inpatient Sample (NIS) described in the Introduction. This anal-
ysis of the NIS includes 385 patients (using the weights, representing 1976 patients in the
population) undergoing radical cystectomy to treat bladder cancer throughout the United
States. The outcome of interest is binary: whether or not the patient experienced wound
infections post-surgery (1=infection, 0 = no infection) while staying at the hospital. Our
main comparison of interest is to determine whether the probability of wound infections
differs between the two types of cystectomy: standard open radical cystectomy (ORS)
and robot-assisted radical cystectomy (RARC). Based on an earlier study, a priori, we
conjectured that robot-assisted surgeries would have a lower rate of infection7. There are
a total of 16 wound infections, all in the ORS group. With only 16 complications, the
number of predictors that can be included in a logistic regression model is limited; follow-
ing the results of Vittinghoff15, the model should be limited to no more than 3 covariates.
Based on the earlier study, the covariates most predictive of wound infection are surgery
type (ORC and RARC), age, and sex7.
To examine the relationship between post-surgery wound infection and these 3 co-
11
variates, we fit the logistic regression model,
logit[πi] = logit{P [Yi = 1|xi, β]} = β0 + β1x1i + β2x2i + β3x3i , (5)
where x1i is the surgery type (x1i = 1 for robot-assisted and 0 for standard open radical
cystectomy), x2i is the age of the patient, in years, and x3i is the sex of the ith patient
(x3i = 1 for females and 0 for males).
Table 2 gives the estimates of β obtained using the two bias-corrected approaches we
propose, in addition to the Clogg, et al approach with qhij = k/(∑H
h=1
∑nh
i=1mhi) and the
standard WEE estimates (the latter were obtained using R svyglm)14. Of note, there were
no convergence problems with the three bias-corrected methods. However, because there
were no complications in the robotic arm, the coefficient for β1 was converging to −∞ for
WEE; the results for WEE reported in Table 2 are the estimates at the 25th iteration (the
default maximum number of iterations in R’s svyglm function in the package survey).
Using the independence variance when calculating the adjustment term, the estimated
odds ratio (OR) for surgery type, controlling for age and sex, is e−2.774 = 0.062; when the
small-sample bias-corrected variance is used the OR is estimated to be e−2.917 = 0.054. For
the Clogg estimator, the OR is estimated to be e−2.443 = 0.087. For all of the methods,
the standard errors reported in Table 2 are based on the variance estimator of Morel,
et al13. When comparing the estimates of β to their standard errors, we see that the
standard WEE (the 25th iteration in R svyglm) produces a much more significant result
than the bias-corrected approaches. The three bias-corrected approaches produce very
similar estimates, and all lead to the same conclusion if a test of the null hypothesis were
conducted at the α = 0.05 level.
For the other covariates in the model, age and sex, the estimates of their effects on the
probability of wound infection are relatively stable. The estimated odds ratio is between
12
e−10(0.0325) = 0.72 and e−10(0.0193) = 0.82 for every 10-year increase in age, and the OR is
estimated to be between e0.662 = 1.94 and e0.808 = 2.24 when comparing females to males.
Whereas age but not gender is significat using the standard WEE and the Clogg method,
neither are signficant using the two proposed bias-corrected approaches.
In summary, the results of analyses of the bladder cancer data highlight how the
standard WEE approach and the bias-corrected methods can produce somewhat different
estimates of effects. To examine the finite sample bias of these approaches, we conducted
a simulation study; the results of the simulation study are reported in the next section.
4 Simulation Study
In this section, we study the empirical relative bias in estimating β using standard logis-
tic regression models incorporating the complex survey structure (WEE) and our bias-
reduced approach using 3 different variance estimators when calculating the multiplicative
weighting factor, qhij: variance under independence (Independence), variance using the
bias-corrected sandwich estimator of Morel, et al. (Morel), along with the Clogg, et al.
approach (Clogg)13,14. For simplicity, in the simulation study, we used a cluster design
without stratification and weighting, where sampling of clusters was performed without
replacement from a finite population of clusters.
For the simulations, the true marginal logistic model for any subject in the population
is
logit(P [Yij = 1|xij]) = β0 +10∑k=1
βkxijk , (6)
where the ten xijk’s are independent Bern(px) variables. The intercept β0 was chosen so
that the average P [Yij = 1] equals 0.20. This marginal model is similar to that used in a
13
simulation study performed by Heinze and Schemper11. For simplicity, we set all ten βk
equal to the same value,
logit(P [Yij = 1|xij]) = β0 + β
10∑k=1
xijk
To simulate the clustered data, we use the random intercept logistic regression model
proposed by Wang and Louis and further developed by Parzen, et al.16,17. In particular,
the conditional subject-specific logistic regression model is
logit(P [Yij = 1|xij]) = bi +
(β0 + β
10∑k=1
xijk
)/φ, (7)
where, given the subject-specific random effect bi, the Yij’s from the same cluster are
independent Bernoulli random variables. When bi follows a ‘bridge’ distribution, the
marginal logistic regression equals that given in (6)16. The bridge random variable has
mean 0 and φ is the rescaling parameter. In particular,
Var(bi) =π2
3
(1
φ2− 1
),
so that the larger the value of φ, (0 < φ < 1), the smaller the variance (and the lower the
correlation between pairs of random variables in the same cluster).
We denote the population number of clusters by N , which we set to N = 400, the
number of sampled clusters by n, and the cluster size by mi (we assume all clusters are
of the same size, and all members of the cluster are sampled).
We conducted 24 simulation configurationss varying the following conditions: the
effect of the covariates, βk = β = {ln(2), ln(4), ln(16)} (recall, we set all ten βk to the
same value); cluster sizes, mi = {5, 10}; the bridge distribution’s scaling parameter,
φ = {0.7, 0.9}; and the number of clusters sampled, n = 40 and n = 80. For each
14
simulation configuration, 2000 simulation replications were performed. The convergence
criterion for WEE is that the relative change in the log-likelihood between successive
iterations is less than 0.000001; we report the percentage of simulation replications in
which this convergence criterion was not met. When the standard WEE failed to converge,
we used the estimates from the 25th iteration (the default maximum number of iterations
in R’s svyglm function in the package survey).
Tables 3, 4, and 5 present the relative biases for β2 defined as 100(β̂2 − β2)/β2, the
mean square error of the estimates, and the empirical coverage probabilities of 95% Wald
confidence intervals for all the simulation study specifications, respectively. Without loss
of generality we report results for β2 only; any of the βk could have been selected for bias
reporting since the model is symmetric across covariates (all covariates are independent
and have the same Bernoulli distribution with all ten βk = β). The results indicate that
the relative bias is greatly reduced, by an order of magnitude, when using any of the three
bias-reduced approaches in comparison to standard WEE. The standard WEE approach
gave average estimated values for β close to zero, suggesting no effect when there truly
was an effect of at least β = 0.69. As a result, the average relative bias for the standard
WEE method is very close to -100% in all simulation configurations.
In previous simulations for logistic regression with independent observations, it was
found that the Clogg et al. approach was typically less biased than standard logistic
regression using maximimum likelihood, but more biased than the Box and Firth bias-
correction approach2,3,11. We have found analagous results for the Clogg type estimator
for complex survey data–the estimator is more biased than the proposed bias-corrected
approaches (generalization of the Firth estimators), but much less biased than WEE
(generalization of maximimum likelihood). Overall, these results suggest that the bias-
reduced approach using either the independent variance estimate or the Morel variance
estimate to calculate qhij is the preferred method for performing the analysis.
15
Although Wald confidence intervals are known to be conservative with large β’s, we
found in nearly all sets of simulations with β2 = 2.77, that the coverage probabilities agree
with the nominal 95% level provided the sample size is large11,18,19. However, we caution
against generalizing this result based on the results of a single simulation study.
5 Discussion
In this paper we have described a simple implementation of bias correction in the logis-
tic regression model for complex surveys. By incorporating an adjustment term to the
weighted estimating equations, we derived a bias correction based on univariate Bernoulli
distributions. This bias correction splits each of the original observations into two new
observations: the original response, yi with the original sampling weight, whij, times a
multiplicative factor, 1 + qhij/2, and a pseudo-response, 1− yhij with the original weight
times the multiplicative factor minus one, or qhij/2. Since both the response and pseduo-
response have weights that are guaranteed to be positive, the problematic issue of separa-
tion is eliminated12. These pseudo-responses and weights are relatively easy to calculate,
and this approach leads to an iterative algorithm that is straightforward to implement.
Because WEE is the most widely used estimation approach for logistic regression models
in complex surveys, the approach to correct for bias described in this paper should be
useful in applications where there are rare outcomes, many interaction terms, or the focus
of analysis is on particular subgroups of interest.
Although not specifically discussed in this paper, the proposed method can also be
used for any regression model for binary outcomes in complex surveys, including models
that adopt non-canonical link functions, such as probit or complementary log-log links.
Kosmidis described bias-corrected estimating equations for non-canonical links for binary
data with independent observations, where the original observations are split into yi and
16
1− yhij with weights for each that are a function of ahij in (2)20. The proposed approach
could be used to extend these results by incorporating the complex survey sampling
weights into the weights for yi and 1−yhij. Further, the approach can be extended to other
generalized linear models using weighted estimating equations for complex surveys. The
bias-corrected approach for complex surveys would be similar to that given in Kosmidis
and Firth for other generalized linear models based on specific link functions21. This can
be implemented by creating a psuedo-response (a function of the outcome and ahij) to
correct for the first-order bias.
Finally, the results of the simulations demonstrate that the proposed method can
greatly reduce the finite sample bias of WEE for estimating logistic regression parameters
for binary data in the complex survey setting. WEE estimates can be biased due to the
issue of separation or quasi-separation, a problem that can occur in large complex surveys
when the outcomes are rare or subrgoup analyses are performed. The bias-corrected
methods perform discernibly better than the standard WEE approach for binary data,
suggesting that they could be adopted as the method of choice in regression analyses of
binary outcomes in complex surveys. Because of its computational simplicity, the standard
bias-corrected logistic regression approach of Firth, where qhij is estimated under the naive
assumption of independence, appears to greatly reduce the bias. The latter approach may
be somewhat easier to implement using standard statistical software for logistic regression
with complex survey data3.
Acknowledgments
We thank Edward Giovannucci and Caprice Greenberg for advice on the analysis of the
NIS bladder cancer data. We are grateful for the support provided by grant CA 160679
from the U.S. National Institutes of Health.
17
Table 1: Baseline characteristics (means/percentages and 95% confidence intervals) ofbladder cancer patients treated with radical cystectomy in the National Inpatient Sample(NIS).
Open Radical Cystectomy Robot-assisted Radical(ORC), n = 343 Cystectomy (RARC), n = 42
Age, years 68.6 (67.6, 69.6) 67.2 (63.3, 71.1)Female, % 15.2 (12.6, 18.1) 11.9 (5.8, 22.9)One or more comorbidities, % 22.9 (19.2, 27.0) 21.7 (12.1, 36.0)
Note: Results are reported as population estimates using survey weights, strata, andcluster variables.
18
Table 2: Comparison of WEE logistic regression parameter estimates for the bladdercancer data from the National Inpatient Survey (NIS), n = 343.
Effect Approach Estimate SE Z-statistic P-value
Intercept Standard WEE -1.548 0.868 -1.784 0.079
Bias-Reduced Independent Var. -1.106 1.404 10.817 0.414Morel Var. -1.458 0.909 -1.603 0.109Clogg Var. -1.096 0.849 -1.290 0.201
Robot Standard WEE -15.61 0.527 -29.61 <0.001
Bias-Reduced Independent Var. -2.774 1.309 -2.119 0.034Morel Var. -2.917 1.168 -2.499 0.013Clogg Var. -2.443 0.993 -2.460 0.015
Age Standard WEE -0.0325 0.0160 -2.037 0.045
Bias-Reduced Independent Var. -0.0321 0.0191 -1.552 0.121Morel Var. -0.0193 0.0177 -1.092 0.278Clogg Var. -0.0280 0.0122 -2.275 0.026
Female Standard WEE 0.689 0.430 1.601 0.113
Bias-Reduced Independent Var. 0.697 0.564 1.160 0.247Morel Var. 0.808 0.925 0.874 0.385Clogg Var. 0.662 0.430 1.538 0.128
19
Table 3: Average relative bias and mean square error of β̂2 and empirical coverage prob-abilties of confidence intervals for each simulation specification where the true parametervalues are β2 = ln(2) ≈ 0.69.
Average Mean EmpiricalConfiguration Method Relative Bias Square Error Coverage Prob.
mi = 5
φ = 0.7
n = 40
WEE -75.79 0.073 0.548Independence 2.54 0.129 0.947Morel 2.48 0.129 0.989Clogg 4.38 0.129 0.948
n = 80
WEE -75.86 0.071 0.380Independence 0.99 0.056 0.950Morel 0.98 0.056 0.954Clogg -3.20 0.056 0.943
φ = 0.9
n = 40
WEE -85.2 0.421 0.154Independence -2.89 0.261 0.960Morel -2.89 0.261 0.982Clogg -17.78 0.232 0.978
n = 80
WEE -84.92 0.416 0.053Independence 0.63 0.111 0.959Morel 0.64 0.110 0.952Clogg -7.59 0.104 0.944
mi = 10
φ = 0.7
n = 40
WEE -92.2 0.837 0.206Independence -1.76 0.165 0.971Morel -1.78 0.166 0.979Clogg -13.39 0.152 0.978
n = 80
WEE -89.25 0.764 0.074Independence 1.23 0.078 0.948Morel 1.22 0.079 0.966Clogg -7.44 0.076 0.964
φ = 0.9
n = 40
WEE -90.1 0.497 0.118Independence 0.72 0.146 0.958Morel 0.71 0.146 0.962Clogg -8.72 0.135 0.972
n = 80
WEE -89.78 0.496 0.032Independence -0.18 0.061 0.959Morel -0.18 0.062 0.982Clogg -4.81 0.059 0.972
Based on 2000 replications for each simulation for varying levels of the number of obser-vations in each cluster mi, levels of φ for the bridge distribution, and number of clusterssampled, n.
20
Table 4: Average relative bias and mean square error of β̂2 and empirical coverage prob-abilties of confidence intervals for each simulation specification where the true parametervalues are β2 = ln(4) ≈ 1.39.
Average Mean EmpiricalConfiguration Method Relative Bias Square Error Coverage Prob.
mi = 5
φ = 0.7
n = 40
WEE -86.04 2.404 0.192Independence -2.22 0.215 0.982Morel -2.24 0.266 0.970Clogg -9.34 0.202 0.968
n = 80
WEE -86.21 2.407 0.196Independence -2.14 0.091 0.960Morel -2.14 0.091 0.959Clogg -5.82 0.090 0.957
φ = 0.9
n = 40
WEE -85.16 1.807 0.108Independence -0.69 0.241 0.981Morel -0.72 0.242 0.978Clogg -8.43 0.215 0.971
n = 80
WEE -84.99 1.788 0.031Independence -1.56 0.099 0.994Morel -1.55 0.098 0.963Clogg -5.41 0.096 0.961
mi = 10
φ = 0.7
n = 40
WEE -86.55 3.202 0.143Independence 0.88 0.150 0.969Morel 0.82 0.151 0.956Clogg -6.25 0.138 0.954
n = 80
WEE -86.43 2.671 0.072Independence 0.13 0.064 0.922Morel 0.12 0.064 0.962Clogg -3.48 0.062 0.988
φ = 0.9
n = 40
WEE -84.95 1.650 0.271Independence 0.69 0.108 0.918Morel 0.69 0.107 0.983Clogg -3.48 0.100 0.964
n = 80
WEE -85.21 1.633 0.043Independence 0.44 0.046 0.945Morel 0.45 0.047 0.978Clogg -1.51 0.045 0.981
Based on 2000 replications for each simulation for varying levels of the number of obser-vations in each cluster mi, levels of φ for the bridge distribution, and number of clusterssampled, n.
21
Table 5: Average relative bias and mean square error of β̂2 and empirical coverage prob-abilties of confidence intervals for each simulation specification where the true parametervalues are β2 = ln(16) ≈ 2.77.
Average Mean EmpiricalConfiguration Method Relative Bias Square Error Coverage Prob.
mi = 5
φ = 0.7
n = 40
WEE -92.5 2.338 0.073Independence 2.12 0.314 0.979Morel 2.05 0.313 0.912Clogg -6.47 0.253 0.921
n = 80
WEE -92.47 2.339 0.032Independence -0.55 0.118 0.958Morel -0.56 0.117 0.985Clogg -4.60 0.111 0.981
φ = 0.9
n = 40
WEE -92.67 6.958 0.092Independence -0.51 0.524 0.948Morel -0.45 0.521 0.917Clogg -16.30 0.512 0.832
n = 80
WEE -92.67 6.914 0.011Independence -1.25 0.202 0.965Morel -1.26 0.201 0.973Clogg -9.75 0.226 0.889
mi = 10
φ = 0.7
n = 40
WEE -93.41 14.303 0.231Independence 0.47 0.478 0.943Morel 0.46 0.479 0.940Clogg -15.06 0.604 0.741
n = 80
WEE -93.52 14.333 0.119Independence -0.39 0.205 0.939Morel -0.40 0.204 0.986Clogg -8.29 0.257 0.874
φ = 0.9
n = 40
WEE -91.99 8.156 0.327Independence 0.38 0.274 0.955Morel 0.36 0.273 0.942Clogg -9.17 0.265 0.898
n = 80
WEE -92.09 8.146 0.013Independence -1.61 0.109 0.947Morel -1.62 0.108 0.979Clogg -6.26 0.124 0.931
Based on 2000 replications for each simulation for varying levels of the number of observations
in each cluster mi, levels of φ for the bridge distribution, and number of clusters sampled, n.
22
References
1. Albert A and Anderson JA. On the existence of maximum likelihood estimates in
logistic regression models. Biometrika 1984; 71: 1-10.
2. Box MJ. Bias in nonlinear estimation. Journal of the Royal Statistical Society.
Series B (Methodological) 1971; 33: 171-201.
3. Firth D. Bias reduction of maximum likelihood estimates. Biometrika 1993; 80:
27-38.
4. Kosmidis I and Firth D. Bias reduction in exponential family nonlinear models.
Biometrika 2009; 96: 793-804.
5. Binder D. On the variance of asymptotically normal estimators from complex sur-
veys. International Statistical Review 1983; 51: 279-292.
6. Liang KY and Zeger SL. Longitudinal data analysis using generalized linear models.
Biometrika 1986; 73: 13-22.
7. Yu H et al. Comparative analysis of outcomes and costs following open radical
cystectomy versus robot-assisted laparoscopic radical cystectomy: results from the
US Nationwide Inpatient Sample. European Urology 2012; 61: 1239-1244.
8. HCUP, NIS. Database Documentation. Healthcare Cost and Utilization Project
(HCUP). SID Database Documentation. Rockville, MD: Agency for Healthcare
Research and Quality, 2007.
9. Kish L. Survey Sampling. New York, NY: Wiley, 1965.
10. Shah BV, Barnwell BG and Bieler GS. SUDAAN, Software for the Statistical Anal-
ysis of Correlated Data: User’s Manual, Release 7.0. Research Triangle Institute,
1996.
23
11. Heinze G and Schemper M. Comparing the importance of prognostic factors in Cox
and logistic regression using SAS. Computer Methods and Programs in Biomedicine
2003; 71: 155-163.
12. Wedderburn RWM. On the existence and uniqueness of the maximum likelihood
estimates for certain generalized linear models. Biometrika 1976; 63: 27-32.
13. Morel JG, Bokossa MC and Neerchal NK. Small sample correction for the variance
of GEE estimators. Biometrical Journal 2003; 45: 395-409.
14. Clogg CC, Rubin DB, Schenker N, Schultz B and Weidman L. Multiple imputation
of industry and occupation codes in census public-use samples using Bayesian logistic
regression. Journal of the American Statistical Association 1991; 86: 68-78.
15. Vittinghoff E and McCulloch CE. Relaxing the rule of ten events per variable in
logistic and Cox regression. American Journal of Epidemiology 2007; 165: 710-718.
16. Wang Z and Louis TA. Matching conditional and marginal shapes in binary random
intercept models using a bridge distribution function. Biometrika 2003; 90: 765-775.
17. Parzen M et al. A generalized linear mixed model for longitudinal binary data with
a marginal logit link function. The Annals of Applied Statistics 2011; 5: 449-467.
18. Hauck WW and Donner A. Wald’s test as applied to hypotheses in logit analysis.
Journal of the American Statistical Association 1977; 72: 851-853.
19. Bull SB, Lewinger JP and Lee S. Confidence intervals for multinomial logistic re-
gression in sparse data. Statistics in Medicine 2007; 26: 903-918.
20. Kosmidis I. On iterative adjustment of responses for the reduction of bias in binary
regression models. Technical Report 09-36. CRiSM working paper series, 2009.
24
21. Kosmidis I and Firth D. Multinomial logit bias reduction via the Poisson log-linear
model. Biometrika 2011; 98: 755-759.
25
top related