practical bayesian quantile regression - illinoisroger/seminar/keming.pdf · 2004-04-02 · 4....
TRANSCRIPT
Practical Bayesian Quantile Regression
Keming Yu
University of Plymouth, UK
A brief summary of some recent work of us
(Keming Yu, Rana Moyeed and Julian Stander).
Summary
We develops a Bayesian framework for quan-
tile regression, including Tobit quantile regres-
sion. We discuss the likelihood selection, and
families of prior distribution on the quantile re-
gression vector that lead to proper posterior
distributions with finite moments. We show
how the posterior distribution can be sampled
and summarized by Markov chain Monte Carlo
methods. A method for quantile regression
model choice is also developed. In an empir-
ical comparison, our approach out-performed
some common classical estimators.
1. Background
Linear regression models:
Aim: Estimate E(Y |X);
Method: Least-squares minimization;
Suitability: Gaussian errors,
symmetric distributions;
Weakness: Conditional skew distribution,
outliers, tail behavior.
Quantile regression models (Koenker and
Bassett 1978):
Aim: Estimate the pth quantile of Y given X
(0 < p < 1), and explore a complete relation-
ship between Y and X;
Specific cases: median regression (p = 0.5);
Method: “Check function” minimization;
Check function: ρp(u) = u(p− I(u < 0));
Applications: Reference charts in medicine (Cole
and Green, 1992), Survival analysis (Koenker
and Geling, 2001), Value at Risk (Bassett and
Chen, 2001), Labor economics (Buchinsky, 1995),
Flood return period (Yu et al., 2003).
2. Basic setting
Consider the following standard linear model
yi = µ(xi) + εi,
Typically,
µ(xi) = x′i β
for a vector of coefficients β.
The pth (0 < p < 1) quantile of εi is the value,
qp, for which P (εi < qp) = p.
The pth conditional quantile of yi given xi is
denoted as
qp(yi|xi) = x′i β(p). (1)
To inference for parameter β(p) given p and
the observations on (X, Y), the posterior dis-
tribution of β(p), π(β|y) is given by
π(β|y) ∝ L(y|β)π(β), (2)
where π(β) is the prior distribution of β and
L(y|β) is the likelihood function.
3. Selecting likelihood function
There are two or three different ways to select
L(y|β) in the set-up above.
For example, using a mixture distribution ac-
companying a Dirichlet process or Po tree prior
to model the error distribution (Walker et al.,
1999, Kottas and Gelfand, 2001).
But people (Richardson, 1999) commented that,
extra parameters under inference, associated
computations complicated, difficult in the choice
of the partition of the space of the priors.
The other example is to use a substitute likeli-
hood (Dunson et al., (2003), Biometrics.) But
this is not a proper likelihood.
A simple and natural likelihood is based on the
asymmetric Laplace distribution with probabil-
ity density
fp(u) = p(1− p) exp{−ρp(u)}, (3)
for 0 < p < 1;
Link: The minimization of the loss function
(3) is exactly equivalent to the maximization
of a likelihood function below;
Likelihood function:
L(y|β) = pn (1− p)n exp
−
∑
i
ρp(yi − x′iβ)
(4)
Feature: Except parameter β, no extra param-
eter under inference. Then we just need to set
priors for β.
Once we have a prior for β(p) and data set
available, we could make posterior inference
for the parameters.
How? as there is no conjugate prior, we use
the popular MCMC techniques for sampling
from the posterior.
Why Bayes? Classical methods rely on asymp-
totics, either large sample is available or using
bootstrap.
In contrast, MCMC sampling enables us to
make exact inference for any sample size with-
out resorting to asymptotic calculations.
For example, The asymptotic covariance ma-
trices of parameter estimators in those clas-
sical approaches for these quantile regression
models depend on the error densities of ε, and
are therefore difficult to estimate reliably. In
the Bayesian framework, variance estimates,
as well as any other posterior summary come
out as a by-product of the MCMC sampler,
and therefore are trivial to obtain once samples
from the posterior distribution are available.
Moreover, the Bayesian paradigm also enables
us to incorporate prior information in a natu-
ral way, whereas the frequentist paradigm does
not.
Take the uncertainty of parameters into ac-
count.
4. Bayesian posterior computation via MCMC
An MCMC scheme would constructs a Markov
chain with equilibrium distribution the poste-
rior π(β|y). After running the Markov chain for
a certain burn-in period so that it can reach
equilibrium, one obtains samples from π(β|y).
One popular method for constructing a Markov
chain is via the Metropolis-Hastings (MH) al-
gorithm. Where a candidate is generated from
an auxiliary distribution and then accepted or
rejected with some probability. The candidate
generating distribution q(β, βc) can depend on
the current state βc of the Markov chain.
A candidate β′ is accepted with an certain ac-
ceptance probability α(β′, βc) also depending
on the current state βc given by:
α(β′, βc) = min[π(β′)L(y|β′)q(βc, β′)π(βc)L(y|βc)q(β′, βc)
, 1].
If, for example, a simple random walk is used to
generate β′ from βc, then the ratio q(βc,β′)q(β′,βc) = 1.
In general the steps of the MH algorithm are
therefore:
Step 0: Start with an arbitrary value β(0)
For n from 1 to N
Step n: Generate β′ from q(β, βc) and u from
U(0,1)
If u ≤ α(β′, βc) set β(n) = β′ (acceptance)
If u > α(β′, βc) set β(n) = β(n−1) (rejection)
As mentioned above after running this pro-
cedure for a certain burn-in period, the sam-
ples obtained may be through as coming from
the posterior distribution. Hence, we can esti-
mate posterior moments, standard deviations
and credible intervals from this posterior sam-
ple. We found that convergence was very rapid.
Remark: We may use the set-up for Tobit
quantile regression: suppose that y∗ and y are
random variables connected by the censoring
relationship
y = max{y0, y∗
},
where y0 is a known censoring point.
In this case, we have found it simplifies the
algorithm to assume zero to be the fixed cen-
soring points by a simple transformation of any
non-zero censoring points.
S-PLUS or R- codes to implement the algo-
rithms are available (free).
5. Some theoretic results, including prior
selection
As people like Richardson (1999) mentioned
that popular forms of priors tends to be those
which have parameters that can be set straight-
forwardly and which lead to posterior with a
relatively immediate form.
Although a standard conjugate prior distribu-
tion is not available for the quantile regression
formulation, MCMC methods may be used to
draw samples from the posterior distributions.
This, principal, allows us to use virtually any
prior distribution. However, we should select
priors that yields proper posteriors.
Choose the prior π(β) from a class of known
distributions, in order to get proper posteriors.
First, the posterior is proper if and only if
0 <∫
RT+1π(β|y)dβ < ∞, (5)
or, equivalently, if and only if,
0 <∫
RT+1L(y|β)π(β) dβ < ∞.
Moreover, we require that all posterior moment
exist. That is,
E[(T∏
j=0
|βj|rj)|y] < ∞, (6)
where (r0, . . . , rT ) denotes the order of the mo-
ments of β = (β0, . . . , βT ).
We now establish a bound for the integral∫RT+1
∏Tj=0 |βj|rj L(y|β)π(β) dβ that allows us
to obtain proper posterior moments.
Theorem 1 (basic lemma) Let the function
g(t) = exp(−|t|), and let h1(p) = min{p,1 − p}and h2(p) = max{p,1 − p}, then all posterior
moments exist if and only if
∫
RT+1
T∏
j=0
|βj|rj
n∏
i=1
g{hk(p)(yi − x′iβ)
}π(β)dβ
is finite for both k = 1 and 2.
Theorem 2 establishes that in the absence of
any realistic prior information we could legiti-
mately use an improper uniform prior distribu-
tion for all the components of β. This choice
may be appealing as the resulting posterior dis-
tribution is proportional to the likelihood sur-
face.
Theorem 2: Assume that the prior for β is
improper and uniform, that is, π(β) ∝ 1, then
all posterior moments of β exist.
Theorem 3: When the elements of β are as-
sumed prior independent, and each π(βi) ∝exp(−|βi−µi|
λi), a double-exponential with fixed
µi and λi > 0, all posterior moments of β exist.
Theorem 4: Assume that the prior for β is
multivariate normal N(µ,Σ) with fixed µ and
Σ, then all posterior moments of β exist.
In particular, when the elements of β are as-
sumed a prior independent and univariate nor-
mal, all posterior moments of β exist.
6. Empirical comparison
Buchinsky and Hahn (1998) performed Monte
Carlo experiments to compare their estimator
with the one proposed by Powell (1986) for
Tobit quantile regression estimation. One of
models they used was given by
y = max{−0.75, y∗}and
y∗ = 1 + x1 + 0.5x2 + ε,
where the regressors in xi were each drawn
from a standard normal distribution and the
error term has multiplicative heteroskedasticity
obtained by taking ε = ξ v(x) with ξ ∼ N(0,25)
and v(x) = 1 + 0.5(x1 + x21 + x2 + x2
2).
For estimating the median regression for this
model, Table 1 summaries the biases, root
mean square errors (RMSE) and 95% credible
intervals for β0 and β1 obtained from the fol-
lowing three approaches: BH method (Buchin-
sky and Hahn, 1998), Powell’s estimator (Power,
1986) and the proposed Bayesian method with
uniform prior. The values relating to BH and
Powell were also reported in Table 1 of Buchin-
sky and Hahn (1998). In particular, the BH
method used log-likelihood cross-validated band-
width selection for kernel estimation of the
censored probability, and the 95% confidence
intervals for both BH and Powell estimators
are based on their asymptotic normality the-
ory. The results from Bayesian inference are
based on a burn-in of 1000 iterations and then
2000 sample values (see Figure).
Table 1. Bias, root mean square errors (RMSE)
and 95% credible intervals for the parameters
β0 and β1 of the median regression. Sam-
ples were generated from the model consid-
ered by Buchinsky and Hahn (1998). Three
approaches were used: BH method (Buchin-
sky and Hahn, 1998), Powell estimator (Pow-
ell, 1986) and the proposed Bayesian method
with uniform prior.
β0 β1
Size BH Powell Bayes BH Powell Bayes
100 Bias 0.14 -0.08 -0.08 0.31 0.33 0.12RMSE 2.88 4.11 0.18 2.16 2.85 0.252.5% -4.49 -6.00 0.57 -3.13 -4.55 0.7397.5% 6.76 9.40 1.20 5.65 7.41 1.65
400 Bias 0.20 0.19 -0.04 -0.06 -0.45 -0.01RMSE 0.58 0.68 0.08 0.61 0.66 0.082.5% -0.85 -0.83 0.82 -0.82 -1.12 0.8397.5% 4.41 4.31 1.13 2.24 2.36 1.17
600 Bias 0.18 0.20 -0.01 -0.06 -0.47 -0.05RMSE 0.48 0.49 0.05 0.50 0.57 0.082.5% -1.33 -0.14 0.91 -0.37 -0.89 0.8397.5% 4.67 3.42 1.09 2.09 1.89 1.08
Clearly, the proposed Bayesian method outperformedthe BH and Powell methods. It yields considerably lowerbiases, lower mean square errors and much more pre-cise credible intervals. S-PLUS code to implement themethod with this comparison is available.
7. Reference chart for Immunoglobulin-G
This data set refers to the serum concentration (gramsper litre)of immunoglobulin-G (IgG) in 298 children agedfrom 6 months to 6 years (Issacs, et al., 1983). Therelationship of IgG with age is quite weak, with somevisual evidence of positive skewness.
We took the response variable Y to be the IgG concen-tration and used a quadratic model in age, x, to fit thequantile regression:
qp(y|x) = β0(p) + β1(p)x + β2(p)x2,
for 0 < p < 1. Figure shows the plot of the data alongwith the quantile regression lines. Each point on thecurves is the mean of the predictive posterior distribu-tion.
We could also obtain desired credible intervals aroundthese curves using the MCMC samples of β(p).
8. Model choice:Using marginal likelihood and Bayes factors
Basically, considering the problem of comparing a col-lection of models {M1, ..., ML} that reflect competinghypotheses about the regression form. The issue ofmodel choice can be dealt with by calculating Bayesfactors. Under model Mk, suppose that the pth quantilemodel is given by
y|Mk = x′(k)βp(k) + εp(k),
then the marginal likelihood arising from estimating βp(k)is defined as
m(y|Mk) =
∫L(y|Mk, βp(k))π(βp(k)|Mk)dβp(k),
which is the normalizing constant of the posterior den-sity.
The calculation of the marginal likelihood has attractedconsiderable interest in the recent MCMC literature. Inparticular, Chib (1995) and Chib and Jeliazkov (2001)have developed a simple approach for estimating themarginal likelihood using the output from the Gibbssample and the MH algorithm respectively. Here, wehave
logm(y|Mk)
= logL(y|Mk, β∗p(k)) + logπ(β∗p(k)|Mk)− logπ(β∗p(k)|y, Mk),
from which the marginal likelihood can be estimated byfinding an estimate of the posterior ordinate π(β∗
p(k)|y, Mk).
We denote this estimate as π(β∗p(k)|y, Mk). For estimate
efficiency, βp(k) = β∗p(k) is generally taken to be a point
of high density in the support of posterior density. Onsubstituting the latter estimate in logm(y|Mk), we get
log m(y|Mk) = n log(p(1− p))−∑
ρp(y − x′(k)β∗p(k))
+ logπ(β∗p(k)|Mk)− log π(β∗p(k)|y, Mk),
in which the first term n log(p(1 − p)) is constant andthe sum is over all data points.
Once the posterior ordinate is estimated, we can esti-mate the Bayes factor of any two models Mk and Ml
by
Bkl = exp{log m(y|Mk)− log m(y|Ml)}.
A simulation-consistent estimate of π(β∗θ(k)|y, Mk) can be
given by
For an improper prior,
π(β∗|y) =G−1
∑Gg=1 α(β(g), β∗) q(β(g), β∗)
J−1∑J
j=1 α(β∗, β(j)),
in which
α(β, β∗) = min{1,π(β∗)π(β)
L∗(β∗, β)},
and
L∗(β∗, β) = exp{−∑(
ρp(y−x′(k)β∗p(k))−ρp(y−x′(k)βθ(k))
)}.Where {β(j)} are samples drawn from q(β∗, β) and {β(g)}are samples drawn from the posterior distribution.
For proper prior,
π(β∗|y) =
∫α(β, β∗)π(β|y)dβ∫α∗(β∗, β)π(β)dβ
,
in which
α∗(β∗, β) = min{ 1
π(β),
1
π(β∗)L∗(β, β∗)}.
This implies that a simulation-consistent estimate of theposterior ordinate is given by
π(β∗|y) =G−1
∑Gg=1 α(β(g), β∗)
J−1∑J
j=1 α∗(β∗, β(j)),
where {β(j)} are samples drawn from a proper prior dis-tribution and {β(g)} are samples drawn from the poste-rior distribution.
9. Inference with scale parameter
One may be interested in introducing a scale parame-ter into the likelihood function L(y|β) for the proposedBayesian inference. Suppose σ > 0 is the scale parame-ter,
L(y|β, σ) =θn(1− θ)n
σnexp
(−
n∑
i=1
ρθ(yi − x′iβ
σ)
).
The corresponding posterior distribution π(β, σ|y) canbe written as
π(β, σ|y) ∝ L(y|β, σ)π(β, σ),
where π(β, σ) is the prior distribution of (βθ, σ) for aparticular θ. As what interests us is the regression pa-rameter β, and σ is what is referred to as a nuisanceparameter, we may integrate out σ and investigate themarginal posterior π(β|y) only. For example, we haveconsidered a “reference” prior π(β, σ) ∝ 1
σ, which gives
that
π(β|y) ∝ ( n∑
i=1
ρθ(yi − x′iβ))−n
,
or
log π(β|y) ∝ −n logn∑
i=1
ρθ(yi − x′iβ).
Implementing MCMC algorithm on this posterior form,we have found that the simulation results are more orless same as those obtained using the posterior density(4).
References
Billias, Y., Chen, S. and Ying, Z. (2000): “Simple Re-sampling Methods for Censored Regression Quantiles,”Journal of Econometrics, 99, 373–386.
Buchinsky, M. (1998), Recent advances in quantile re-gression models, J. Human Res., 33, 88–126.
Chib, S. (1992): “Bayes Inference in the Tobit CensoredRegression Model,” Journal of Econometrics, 51, 79–99.
Isaacs, D., Altman, D.G., Tidmarsh, C.E., Valman, H.B.and Webster, A.D.B. (1983), Serum Immunoglobin con-centrations in preschool children measured by laser neph-elometry: reference ranges for IgG, IgA, IgM, J. ClinicalPathology, 36, 1193–1196.
Koenker, R. and Bassett, G.S. (1978), Regression quan-tiles, Econometrica, 46, 33–50.
Hahn, J. (1995): “Bootstrapping Quantile RegressionEstimators,” Econometric Theory, 11, 105–121.
Huang, H. (2001): “Bayesian Analysis of the SUR TobitModel,” Applied Economics Letters, 8, 617–622.
Kottas, A. and Gelfand, A. E. (2001): “Bayesian semi-parametric Median Regression Model, ” Journal of theAmerican Statistical Association, 91, 689–698.
Powell, J. (1986a): “ Censored Regression Quantiles,Journal of Econometrics,” 32, 143-155.
Powell, J. (1986b): Symmetrically Trimmed Least SquaresEstimation for Tobit Models, Econometrica, 54, 1435–60.
Richardson, S. (1999): “ Contribution to the Discus-sion of Walker et al., Bayesian Nonparametric Inferencefor Random Distribution and Related Functions,” J. R.Statist. Soc. B, 61, 485–527.
Walker, S.G., Damien, P., Laud, P. W. and Smith,A.F.M. (1999): “Bayesian Nonparametric Inference forRandom Distribution and Related Functions,” J. R. Statist.Soc. B, 61, 485–527.
Powell, J. L. (2001): Semiparametric estimation of cen-sored selection models in Hsiao,C., K. Morimune, andJ.L. Powell,eds., Nonlinear statistical modelling, Cam-bridge University Press.
Yu, K. and Moyeed, R.A. (2001): “ Bayesian quantileregression, ” Statistics and Probability Letters, 54, 437-447.
Yu, K., Lu, Z. and Stander, J. (2003): “ Quantile re-gression: applications and current research area,” TheStatistican, 52, 331–350.