practical bayesian quantile regression - illinoisroger/seminar/keming.pdf · 2004-04-02 · 4....

Practical Bayesian Quantile Regression

Keming Yu

University of Plymouth, UK

([email protected])

A brief summary of some recent work of us

(Keming Yu, Rana Moyeed and Julian Stander).

Summary

We develops a Bayesian framework for quan-

tile regression, including Tobit quantile regres-

sion. We discuss the likelihood selection, and

families of prior distribution on the quantile re-

gression vector that lead to proper posterior

distributions with finite moments. We show

how the posterior distribution can be sampled

and summarized by Markov chain Monte Carlo

methods. A method for quantile regression

model choice is also developed. In an empir-

ical comparison, our approach out-performed

some common classical estimators.

1. Background

Linear regression models:

Aim: Estimate E(Y |X);

Method: Least-squares minimization;

Suitability: Gaussian errors,

symmetric distributions;

Weakness: Conditional skew distribution,

outliers, tail behavior.

Quantile regression models (Koenker and

Bassett 1978):

Aim: Estimate the pth quantile of Y given X

(0 < p < 1), and explore a complete relation-

ship between Y and X;

Specific cases: median regression (p = 0.5);

Method: “Check function” minimization;

Check function: ρp(u) = u(p− I(u < 0));

Applications: Reference charts in medicine (Cole

and Green, 1992), Survival analysis (Koenker

and Geling, 2001), Value at Risk (Bassett and

Chen, 2001), Labor economics (Buchinsky, 1995),

Flood return period (Yu et al., 2003).

2. Basic setting

Consider the following standard linear model

yi = µ(xi) + εi,

Typically,

µ(xi) = x′i β

for a vector of coefficients β.

The pth (0 < p < 1) quantile of εi is the value,

qp, for which P (εi < qp) = p.

The pth conditional quantile of yi given xi is

denoted as

qp(yi|xi) = x′i β(p). (1)

To inference for parameter β(p) given p and

the observations on (X, Y), the posterior dis-

tribution of β(p), π(β|y) is given by

π(β|y) ∝ L(y|β)π(β), (2)

where π(β) is the prior distribution of β and

L(y|β) is the likelihood function.

3. Selecting likelihood function

There are two or three different ways to select

L(y|β) in the set-up above.

For example, using a mixture distribution ac-

companying a Dirichlet process or Po tree prior

to model the error distribution (Walker et al.,

1999, Kottas and Gelfand, 2001).

But people (Richardson, 1999) commented that,

extra parameters under inference, associated

computations complicated, difficult in the choice

of the partition of the space of the priors.

The other example is to use a substitute likeli-

hood (Dunson et al., (2003), Biometrics.) But

this is not a proper likelihood.

A simple and natural likelihood is based on the

asymmetric Laplace distribution with probabil-

ity density

fp(u) = p(1− p) exp{−ρp(u)}, (3)

for 0 < p < 1;

Link: The minimization of the loss function

(3) is exactly equivalent to the maximization

of a likelihood function below;

Likelihood function:

L(y|β) = pn (1− p)n exp

−

∑

i

ρp(yi − x′iβ)

(4)

Feature: Except parameter β, no extra param-

eter under inference. Then we just need to set

priors for β.

Once we have a prior for β(p) and data set

available, we could make posterior inference

for the parameters.

How? as there is no conjugate prior, we use

the popular MCMC techniques for sampling

from the posterior.

Why Bayes? Classical methods rely on asymp-

totics, either large sample is available or using

bootstrap.

In contrast, MCMC sampling enables us to

make exact inference for any sample size with-

out resorting to asymptotic calculations.

For example, The asymptotic covariance ma-

trices of parameter estimators in those clas-

sical approaches for these quantile regression

models depend on the error densities of ε, and

are therefore difficult to estimate reliably. In

the Bayesian framework, variance estimates,

as well as any other posterior summary come

out as a by-product of the MCMC sampler,

and therefore are trivial to obtain once samples

from the posterior distribution are available.

Moreover, the Bayesian paradigm also enables

us to incorporate prior information in a natu-

ral way, whereas the frequentist paradigm does

not.

Take the uncertainty of parameters into ac-

count.

4. Bayesian posterior computation via MCMC

An MCMC scheme would constructs a Markov

chain with equilibrium distribution the poste-

rior π(β|y). After running the Markov chain for

a certain burn-in period so that it can reach

equilibrium, one obtains samples from π(β|y).

One popular method for constructing a Markov

chain is via the Metropolis-Hastings (MH) al-

gorithm. Where a candidate is generated from

an auxiliary distribution and then accepted or

rejected with some probability. The candidate

generating distribution q(β, βc) can depend on

the current state βc of the Markov chain.

A candidate β′ is accepted with an certain ac-

ceptance probability α(β′, βc) also depending

on the current state βc given by:

α(β′, βc) = min[π(β′)L(y|β′)q(βc, β′)π(βc)L(y|βc)q(β′, βc)

, 1].

If, for example, a simple random walk is used to

generate β′ from βc, then the ratio q(βc,β′)q(β′,βc) = 1.

In general the steps of the MH algorithm are

therefore:

Step 0: Start with an arbitrary value β(0)

For n from 1 to N

Step n: Generate β′ from q(β, βc) and u from

U(0,1)

If u ≤ α(β′, βc) set β(n) = β′ (acceptance)

If u > α(β′, βc) set β(n) = β(n−1) (rejection)

As mentioned above after running this pro-

cedure for a certain burn-in period, the sam-

ples obtained may be through as coming from

the posterior distribution. Hence, we can esti-

mate posterior moments, standard deviations

and credible intervals from this posterior sam-

ple. We found that convergence was very rapid.

Remark: We may use the set-up for Tobit

quantile regression: suppose that y∗ and y are

random variables connected by the censoring

relationship

y = max{y0, y∗

},

where y0 is a known censoring point.

In this case, we have found it simplifies the

algorithm to assume zero to be the fixed cen-

soring points by a simple transformation of any

non-zero censoring points.

S-PLUS or R- codes to implement the algo-

rithms are available (free).

5. Some theoretic results, including prior

selection

As people like Richardson (1999) mentioned

that popular forms of priors tends to be those

which have parameters that can be set straight-

forwardly and which lead to posterior with a

relatively immediate form.

Although a standard conjugate prior distribu-

tion is not available for the quantile regression

formulation, MCMC methods may be used to

draw samples from the posterior distributions.

This, principal, allows us to use virtually any

prior distribution. However, we should select

priors that yields proper posteriors.

Choose the prior π(β) from a class of known

distributions, in order to get proper posteriors.

First, the posterior is proper if and only if

0 <∫

RT+1π(β|y)dβ < ∞, (5)

or, equivalently, if and only if,

0 <∫

RT+1L(y|β)π(β) dβ < ∞.

Moreover, we require that all posterior moment

exist. That is,

E[(T∏

j=0

|βj|rj)|y] < ∞, (6)

where (r0, . . . , rT ) denotes the order of the mo-

ments of β = (β0, . . . , βT ).

We now establish a bound for the integral∫RT+1

∏Tj=0 |βj|rj L(y|β)π(β) dβ that allows us

to obtain proper posterior moments.

Theorem 1 (basic lemma) Let the function

g(t) = exp(−|t|), and let h1(p) = min{p,1 − p}and h2(p) = max{p,1 − p}, then all posterior

moments exist if and only if

∫

RT+1

T∏

j=0

|βj|rj

n∏

i=1

g{hk(p)(yi − x′iβ)

}π(β)dβ

is finite for both k = 1 and 2.

Theorem 2 establishes that in the absence of

any realistic prior information we could legiti-

mately use an improper uniform prior distribu-

tion for all the components of β. This choice

may be appealing as the resulting posterior dis-

tribution is proportional to the likelihood sur-

face.

Theorem 2: Assume that the prior for β is

improper and uniform, that is, π(β) ∝ 1, then

all posterior moments of β exist.

Theorem 3: When the elements of β are as-

sumed prior independent, and each π(βi) ∝exp(−|βi−µi|

λi), a double-exponential with fixed

µi and λi > 0, all posterior moments of β exist.

Theorem 4: Assume that the prior for β is

multivariate normal N(µ,Σ) with fixed µ and

Σ, then all posterior moments of β exist.

In particular, when the elements of β are as-

sumed a prior independent and univariate nor-

mal, all posterior moments of β exist.

6. Empirical comparison

Buchinsky and Hahn (1998) performed Monte

Carlo experiments to compare their estimator

with the one proposed by Powell (1986) for

Tobit quantile regression estimation. One of

models they used was given by

y = max{−0.75, y∗}and

y∗ = 1 + x1 + 0.5x2 + ε,

where the regressors in xi were each drawn

from a standard normal distribution and the

error term has multiplicative heteroskedasticity

obtained by taking ε = ξ v(x) with ξ ∼ N(0,25)

and v(x) = 1 + 0.5(x1 + x21 + x2 + x2

2).

For estimating the median regression for this

model, Table 1 summaries the biases, root

mean square errors (RMSE) and 95% credible

intervals for β0 and β1 obtained from the fol-

lowing three approaches: BH method (Buchin-

sky and Hahn, 1998), Powell’s estimator (Power,

1986) and the proposed Bayesian method with

uniform prior. The values relating to BH and

Powell were also reported in Table 1 of Buchin-

sky and Hahn (1998). In particular, the BH

method used log-likelihood cross-validated band-

width selection for kernel estimation of the

censored probability, and the 95% confidence

intervals for both BH and Powell estimators

are based on their asymptotic normality the-

ory. The results from Bayesian inference are

based on a burn-in of 1000 iterations and then

2000 sample values (see Figure).

Table 1. Bias, root mean square errors (RMSE)

and 95% credible intervals for the parameters

β0 and β1 of the median regression. Sam-

ples were generated from the model consid-

ered by Buchinsky and Hahn (1998). Three

approaches were used: BH method (Buchin-

sky and Hahn, 1998), Powell estimator (Pow-

ell, 1986) and the proposed Bayesian method

with uniform prior.

β0 β1

Size BH Powell Bayes BH Powell Bayes

100 Bias 0.14 -0.08 -0.08 0.31 0.33 0.12RMSE 2.88 4.11 0.18 2.16 2.85 0.252.5% -4.49 -6.00 0.57 -3.13 -4.55 0.7397.5% 6.76 9.40 1.20 5.65 7.41 1.65

400 Bias 0.20 0.19 -0.04 -0.06 -0.45 -0.01RMSE 0.58 0.68 0.08 0.61 0.66 0.082.5% -0.85 -0.83 0.82 -0.82 -1.12 0.8397.5% 4.41 4.31 1.13 2.24 2.36 1.17

600 Bias 0.18 0.20 -0.01 -0.06 -0.47 -0.05RMSE 0.48 0.49 0.05 0.50 0.57 0.082.5% -1.33 -0.14 0.91 -0.37 -0.89 0.8397.5% 4.67 3.42 1.09 2.09 1.89 1.08

Clearly, the proposed Bayesian method outperformedthe BH and Powell methods. It yields considerably lowerbiases, lower mean square errors and much more pre-cise credible intervals. S-PLUS code to implement themethod with this comparison is available.

7. Reference chart for Immunoglobulin-G

This data set refers to the serum concentration (gramsper litre)of immunoglobulin-G (IgG) in 298 children agedfrom 6 months to 6 years (Issacs, et al., 1983). Therelationship of IgG with age is quite weak, with somevisual evidence of positive skewness.

We took the response variable Y to be the IgG concen-tration and used a quadratic model in age, x, to fit thequantile regression:

qp(y|x) = β0(p) + β1(p)x + β2(p)x2,

for 0 < p < 1. Figure shows the plot of the data alongwith the quantile regression lines. Each point on thecurves is the mean of the predictive posterior distribu-tion.

We could also obtain desired credible intervals aroundthese curves using the MCMC samples of β(p).

8. Model choice:Using marginal likelihood and Bayes factors

Basically, considering the problem of comparing a col-lection of models {M1, ..., ML} that reflect competinghypotheses about the regression form. The issue ofmodel choice can be dealt with by calculating Bayesfactors. Under model Mk, suppose that the pth quantilemodel is given by

y|Mk = x′(k)βp(k) + εp(k),

then the marginal likelihood arising from estimating βp(k)is defined as

m(y|Mk) =

∫L(y|Mk, βp(k))π(βp(k)|Mk)dβp(k),

which is the normalizing constant of the posterior den-sity.

The calculation of the marginal likelihood has attractedconsiderable interest in the recent MCMC literature. Inparticular, Chib (1995) and Chib and Jeliazkov (2001)have developed a simple approach for estimating themarginal likelihood using the output from the Gibbssample and the MH algorithm respectively. Here, wehave

logm(y|Mk)

= logL(y|Mk, β∗p(k)) + logπ(β∗p(k)|Mk)− logπ(β∗p(k)|y, Mk),

from which the marginal likelihood can be estimated byfinding an estimate of the posterior ordinate π(β∗

p(k)|y, Mk).

We denote this estimate as π(β∗p(k)|y, Mk). For estimate

efficiency, βp(k) = β∗p(k) is generally taken to be a point

of high density in the support of posterior density. Onsubstituting the latter estimate in logm(y|Mk), we get

log m(y|Mk) = n log(p(1− p))−∑

ρp(y − x′(k)β∗p(k))

+ logπ(β∗p(k)|Mk)− log π(β∗p(k)|y, Mk),

in which the first term n log(p(1 − p)) is constant andthe sum is over all data points.

Once the posterior ordinate is estimated, we can esti-mate the Bayes factor of any two models Mk and Ml

by

Bkl = exp{log m(y|Mk)− log m(y|Ml)}.

A simulation-consistent estimate of π(β∗θ(k)|y, Mk) can be

given by

For an improper prior,

π(β∗|y) =G−1

∑Gg=1 α(β(g), β∗) q(β(g), β∗)

J−1∑J

j=1 α(β∗, β(j)),

in which

α(β, β∗) = min{1,π(β∗)π(β)

L∗(β∗, β)},

and

L∗(β∗, β) = exp{−∑(

ρp(y−x′(k)β∗p(k))−ρp(y−x′(k)βθ(k))

)}.Where {β(j)} are samples drawn from q(β∗, β) and {β(g)}are samples drawn from the posterior distribution.

For proper prior,

π(β∗|y) =

∫α(β, β∗)π(β|y)dβ∫α∗(β∗, β)π(β)dβ

,

in which

α∗(β∗, β) = min{ 1

π(β),

1

π(β∗)L∗(β, β∗)}.

This implies that a simulation-consistent estimate of theposterior ordinate is given by

π(β∗|y) =G−1

∑Gg=1 α(β(g), β∗)

J−1∑J

j=1 α∗(β∗, β(j)),

where {β(j)} are samples drawn from a proper prior dis-tribution and {β(g)} are samples drawn from the poste-rior distribution.

9. Inference with scale parameter

One may be interested in introducing a scale parame-ter into the likelihood function L(y|β) for the proposedBayesian inference. Suppose σ > 0 is the scale parame-ter,

L(y|β, σ) =θn(1− θ)n

σnexp

(−

n∑

i=1

ρθ(yi − x′iβ

σ)

).

The corresponding posterior distribution π(β, σ|y) canbe written as

π(β, σ|y) ∝ L(y|β, σ)π(β, σ),

where π(β, σ) is the prior distribution of (βθ, σ) for aparticular θ. As what interests us is the regression pa-rameter β, and σ is what is referred to as a nuisanceparameter, we may integrate out σ and investigate themarginal posterior π(β|y) only. For example, we haveconsidered a “reference” prior π(β, σ) ∝ 1

σ, which gives

that

π(β|y) ∝ ( n∑

i=1

ρθ(yi − x′iβ))−n

,

or

log π(β|y) ∝ −n logn∑

i=1

ρθ(yi − x′iβ).

Implementing MCMC algorithm on this posterior form,we have found that the simulation results are more orless same as those obtained using the posterior density(4).

References

Billias, Y., Chen, S. and Ying, Z. (2000): “Simple Re-sampling Methods for Censored Regression Quantiles,”Journal of Econometrics, 99, 373–386.

Buchinsky, M. (1998), Recent advances in quantile re-gression models, J. Human Res., 33, 88–126.

Chib, S. (1992): “Bayes Inference in the Tobit CensoredRegression Model,” Journal of Econometrics, 51, 79–99.

Isaacs, D., Altman, D.G., Tidmarsh, C.E., Valman, H.B.and Webster, A.D.B. (1983), Serum Immunoglobin con-centrations in preschool children measured by laser neph-elometry: reference ranges for IgG, IgA, IgM, J. ClinicalPathology, 36, 1193–1196.

Koenker, R. and Bassett, G.S. (1978), Regression quan-tiles, Econometrica, 46, 33–50.

Hahn, J. (1995): “Bootstrapping Quantile RegressionEstimators,” Econometric Theory, 11, 105–121.

Huang, H. (2001): “Bayesian Analysis of the SUR TobitModel,” Applied Economics Letters, 8, 617–622.

Kottas, A. and Gelfand, A. E. (2001): “Bayesian semi-parametric Median Regression Model, ” Journal of theAmerican Statistical Association, 91, 689–698.

Powell, J. (1986a): “ Censored Regression Quantiles,Journal of Econometrics,” 32, 143-155.

Powell, J. (1986b): Symmetrically Trimmed Least SquaresEstimation for Tobit Models, Econometrica, 54, 1435–60.

Richardson, S. (1999): “ Contribution to the Discus-sion of Walker et al., Bayesian Nonparametric Inferencefor Random Distribution and Related Functions,” J. R.Statist. Soc. B, 61, 485–527.

Walker, S.G., Damien, P., Laud, P. W. and Smith,A.F.M. (1999): “Bayesian Nonparametric Inference forRandom Distribution and Related Functions,” J. R. Statist.Soc. B, 61, 485–527.

Powell, J. L. (2001): Semiparametric estimation of cen-sored selection models in Hsiao,C., K. Morimune, andJ.L. Powell,eds., Nonlinear statistical modelling, Cam-bridge University Press.

Yu, K. and Moyeed, R.A. (2001): “ Bayesian quantileregression, ” Statistics and Probability Letters, 54, 437-447.

Yu, K., Lu, Z. and Stander, J. (2003): “ Quantile re-gression: applications and current research area,” TheStatistican, 52, 331–350.

practical bayesian quantile regression - illinoisroger/seminar/keming.pdf · 2004-04-02 · 4....

Documents