nonparametric bayes estimation of a distribution function with truncated data

9
ELSEVIER Journal of Statistical Planning and Inference 55 (1996) 361-369 journal of statistical planning and inference Nonparametric Bayes estimation of a distribution function with truncated data Mauro Gasparini * Purdue University, USA Received 23 March 1992; revised 29 March 1995 Abstract A truncation bias affects the observation of a pair of variables (X, Y), so that data are available only if Y ~<X. In such a situation, the nonparametric maximum likelihood estimator (NPMLE) of the distribution function of Y may have unpleasant features (Woodroofe, Ann. Statisr 13 (1985) 163-177). As a possible alternative, a nonparametric Bayes estimator is obtained using a Dirichlet prior (Ferguson, Ann. Statist. 1 (1973) 209-230). lts frequentist asymptotic behavior is investigated and found to be the same as the asymptotic behavior of the NPMLE. The results are illustrated by an example, with astronomical data, where the NPMLE is clearly unacceptable. AMS classifications: Primary 62G05; secondary 62P99 Keywords: Dirichlet priors; Truncated data; Luminosity; Quasar 1. Introduction Consider sampling from a bivariate population in the presence of truncation bias. Specifically, let Y be a random variable of interest and X an independent truncating covariate, and imagine one can observe only those realizations of the pair (X, Y) for which Y ~<X. If Y exceeds X, any information about that potential observation is lost. This feature differentiates truncation from censoring, where the smaller of X and Y is observed. Examples abound in Astronomy (Lynden-Bell, 1971), in Survival Analysis (Lagakos et al., 1988) and in Economics (Tsui et al., 1983). Given a truncated sample (Xl, Y1 ) ..... (X,, Yn), i.i.d, with distribution H,(x, y) = Jo G(y Az)dF(z) Jo G(z) dE(z) ' ( l ) * Correspondence address: Biometrics K-490.3.54, ClBA-Geigy AG, CH-4002 Basel, Switzerland. Tel.: 41- 61-6967579; Fax: 41-61-6968477; email: [email protected]. Research partially supported by US Army Research Office under DAAL-03-88-0122. 0378-3758/96/$15.00 @ 1996 Elsevier Science B.V. All rights reserved SSDI 0378-3758(95)00195-6

Upload: mauro-gasparini

Post on 04-Jul-2016

231 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Nonparametric Bayes estimation of a distribution function with truncated data

ELSEVIER Journal of Statistical Planning and

Inference 55 (1996) 361-369

journal of statistical planning and inference

Nonparametric Bayes estimation of a distribution function with truncated data

Mauro Gasparini *

Purdue University, USA

Received 23 March 1992; revised 29 March 1995

Abstract

A truncation bias affects the observation of a pair of variables (X, Y), so that data are available only if Y ~<X. In such a situation, the nonparametric maximum likelihood estimator (NPMLE) of the distribution function of Y may have unpleasant features (Woodroofe, Ann. Statisr 13 (1985) 163-177). As a possible alternative, a nonparametric Bayes estimator is obtained using a Dirichlet prior (Ferguson, Ann. Statist. 1 (1973) 209-230). lts frequentist asymptotic behavior is investigated and found to be the same as the asymptotic behavior of the NPMLE. The results are illustrated by an example, with astronomical data, where the NPMLE is clearly unacceptable.

A M S classifications: Primary 62G05; secondary 62P99

Keywords: Dirichlet priors; Truncated data; Luminosity; Quasar

1. Introduction

Consider sampling from a bivariate population in the presence of truncation bias.

Specifically, let Y be a random variable of interest and X an independent truncating

covariate, and imagine one can observe only those realizations of the pair (X, Y) for

which Y ~<X. If Y exceeds X, any information about that potential observation is lost.

This feature differentiates truncation from censoring, where the smaller of X and Y is

observed. Examples abound in Astronomy (Lynden-Bell , 1971), in Survival Analysis

(Lagakos et al., 1988) and in Economics (Tsui et al., 1983).

Given a truncated sample (Xl, Y1 ) . . . . . (X, , Yn), i.i.d, with distribution

H , ( x , y ) = Jo G(y A z ) d F ( z )

Jo G(z) dE(z) ' ( l )

* Correspondence address: Biometrics K-490.3.54, ClBA-Geigy AG, CH-4002 Basel, Switzerland. Tel.: 41- 61-6967579; Fax: 41-61-6968477; email: [email protected]. Research partially supported by US Army Research Office under DAAL-03-88-0122.

0378-3758/96/$15.00 @ 1996 Elsevier Science B.V. All rights reserved SSDI 0378-3758(95)00195-6

Page 2: Nonparametric Bayes estimation of a distribution function with truncated data

362 M. Gasparini/ Journal of Statistical Plannin9 and Inference 55 (1996) 361-369

where G is the distribution function of Y, F is the distribution function of X and F(0) = G(0) = 0, the nonparametric maximum likelihood estimator (NPMLE) of

G(u),u > 0, given by

G ( u ) = l-I* G n ( y f ) - F n ( y } - ) (2) j:y;>u Gn(Y j ) -F , , (YT) '

where F,,(z) := #{i<~n : xi<<.z}/n and G,,(z) := #{i<<.n : y i~z} /n are the marginal empirical distribution functions, - indicates left limits and I-I* indicates that the product extends over distinct y values, has been studied by several authors. Earlier references can be found in Woodroofe (1985), which also studies its asymptotic properties. The NPMLE behaves nicely as the sample size n --+ oo, but for a moderate sample size it may have very undesirable features. In particular, if Fn(u) = Gn(u) for some u, then G(u') = 0 for all u'<u. The probability of this occurrence goes to 0 as n ~ if F and G are continuous (see Woodroofe, 1985), but when it actually happens the researcher is faced with a NPMLE that basically ignores a fraction, possibly a big one, of the data. An example is included in Section 4.

This paper introduces a Nonparametric Bayes Estimator (NPBE) of G, formula (4), as an alternative to the NPMLE. The Bayes estimator reduces the effects of the un- pleasant property of the NPMLE described above and may therefore be preferable to it even from a frequentist point of view. In cases when substantial prior information for G is available, a Bayes estimator becomes a natural candidate.

The Dirichlet-Ferguson process has been extensively applied to nonparametric statis- tical problems (for a review, see Ferguson et al., 1992). The NPBE from a Dirichlet- Ferguson prior (Ferguson, 1973) and truncated data is derived in Section 2, formula (4). In Section 3, NPMLE and NPBE are shown to have the same frequentist asymptotic behavior. Section 4 deals with an example that illustrates the problems the NPMLE may create and the use of altemative NPBEs.

Estimator G in formula (4) was obtained in Gasparini (1990). Tiwari and Zalkikar (1993) obtain similar results and give a different application from the one in Section 4.

2. The Bayes estimator

The following lemma gives a computational formula for an integral to be used in the proof of Theorem 1. It can be easily proved by calculus.

Lemma 1. For a positive integer m and 0 < ~ : 1 cj < ~ ' : 1 aj Vi = 1 . . . . . m,

~ S l i ~ I ( S i _ _ S i , ) a i - - l s ~ C i l ( l _ S m ) ..... t - l d s l . . . d s m

= fl (aj -- cj),ai+l , (3)

Page 3: Nonparametric Bayes estimation of a distribution function with truncated data

A,£ GasparinilJournal of Statistical Planning and Inference 55 (1996) 361-369 363

where S = {(s, . . . . . Srn) : 0 = S O < S 1 < S 2 < " ' " < S m < 1} and [3(.,.) indicates

Euler's beta function.

In Theorem 1, a NPBE of the distribution function G of Y is obtained in the form of the posterior expectation of G, for the case where the truncating values are fixed constants.

Theorem 1. Let G be the distribution function associated with a Dirichlet process on ~ with parameter measure ~ finite and positive over any interval. G, viewed as a random distribution function, is defined on some probability space ( f 2 ,~ ,~ ) . Let xl,x2 . . . . . Xn be real numbers and (YI, Y2,...,Y~) random variables, condition- ally independent given G and such that Yk has distribution function ~{Yk <-YtG} = min{G(y)/G(xk), 1}. Then the posterior expectation of the parameter G(u),u E •. given a sample ( x , y ) = ( X l , I11 = Y l ) . . . . . (Xn, Y~ = y~) is

G ( u ) = ~(u)+n(Gn(u) -Fn(u) ) l-I* ~(Xk)+n(G,(xk) -Fn(xk) ) (4) ~z(~) k:xk > ~ 2(Xk) + n(G,(Xk) -- F~(x~- ))'

where ~(z) := 2 ( ( - o c , z]), Fn(z- ) is the left limit of the empirical distribution Junction Fn at z and I~* indicates that the product extends over distinct x values.

Proof. Let Yl . . . . . Yn be the observed values of YI . . . . . Y~ and choose e > 0 small enough that the interval (Yk - e , yk], Vk, contains the point Yk and its ties and no other

n x or y point. Define D := Nk=l{Yk --e < Yk<<.yk<~Xk}. Then, for u ~ (Yk - e, yk] and u ~': xk for all k,

G(y~)-G(y~-~) d ~ E(G(u)ID ) = fG(u)I-Ik=l G(x~) (5)

l-I" f k=, G(yk)--G(yk--~) d ~ G(xk )

Numerator and denominator of the above equation can be computed using the dis-

tribution of the random distribution function G. To obtain the numerator, which will

be denoted by NUM, partition the real line into m + 1 intervals with endpoints given

by u and the distinct values among Yk --~,yk,xk,k = 1, . . . ,n . Now let ~ be the measure o f the generic ith interval, i = 1 , . . . ,m + 1, 7i (resp. ~bi) be the number of v

(resp. x) points less than or equal to the right endpoint of the ith interval; note that in this way the quantity Y i - 7i-I (resp. qSi- ~b~-l) is the y (resp. x) multiplicity of the right endpoint of the ith interval. Finally, let iu be the index corresponding to the

interval with right endpoint u. Since the increments of G over any partition of the real

line have a Dirichlet distribution, let F(.) denote the gamma function, apply Lemma 1 and obtain

t.2_.,i= 1 cti ) i,, NUM -- ~-~m+l F" " tj

H i = ! [°~i) >0,~ ti< 1

(11, n ~_,+(.p_~ , ) { i "~-(¢"-¢"-~)~ X ~ l l i : l [ i ~j~_.ltJ) ) (6)

Page 4: Nonparametric Bayes estimation of a distribution function with truncated data

364 lPL GasparinilJournal of Statistical Planning and Inference 55 (1996) 361-369

m "~ Um+1-1 × 1 - i~=tj] dtl ""dtm (7)

~(u) ÷ n(Gn(u) - Fn(u)) c¢( -~ , +e~ )

1-I;(o~(yj -- e, yj))[gj]

× Hk(~(xk) + n(Gn(xk) -- Fn(xk)) + l{xk>.u})[fk] (8)

where a[k] := a(a + 1) . . . (a + k - 1) is the ascending factorial, the product at the numerator (resp. denominator) extends over distinct y (resp. x) values and 9j (resp. f k ) is the multiplicity of yj (resp. xe), i.e. 9j = n (Gn(Y j ) - Gn(y f ) ) and fk = n(Vn(xk) - F,(x~ )).

After calculations similar to the previous ones for the denominator, we obtain

E(G(u)[D) = ~(u) + n(G,(u) - Fn(u)) [ I* (Ct(Xk) q- n(Gn(Xk) -- Fn(Xk)))[fk]

>o k G%277 )- ,TkT 7

The right-hand side of the above expression gives, after a series of short telescopic simplifications, the right-hand side of (4), and since it does not depend on e, by letting e ---+ 0 we obtain the result. []

The theorem can be extended to consider the case of random X under specific assumptions. At the price of tedious computations, similar techniques could be used to obtain the distribution of G.

The following lemma, which bounds the NPBE by functions with jumps in y points only, is instrumental in the proofs of Theorems 2 and 3 (proof is omitted)•

Lemma 2. I f ~(.) is continuous, the following bounds for the NPBE hoM:

~(Yn) I-I* c~(yj) + Gn(yf ) - Fn(y f ) c@x~)j:yj>u c~(yj) + Gn(Y j ) - Fn(y f )

~(oo) + G . ( y f ) - F . ( y 7 )

j:y,>u ~(oo) + Gn(Yj) - F . ( yT)

(9)

The relationship between NPBE and NPMLE is investigated in the following.

Theorem 2. I f ~(~) -* 0 in such a way that

~b(u) := lim E ( G ( u ) ) = ~(~)~0

lim c¢(u)/u(~) ~(R)---+0

exists for any u E ~ and is a distribution function, then

lim G(u) = { q~(x(.))O(u) i f u<<.x(.), ~(~)-~o 4)(u) i f u > X(n),

(10)

where G(u) is given by (2).

Page 5: Nonparametric Bayes estimation of a distribution function with truncated data

M. Gasparini/Journal of Statistical Planniny and Inference 55 (1996) 361~69 365

The limit of the Bayes estimator behaves like a rescaled NPMLE up to the last

observed value. Beyond that value the sample does not convey any information and

the estimate equals the limit of the prior guess.

3. Asymptotic behavior of the Bayes estimator

Consider F and G to be unknown but fixed and suppose, to restrict attention to the

regular case, that they are continuous and belong to

J{0 := {(F,G) : F(O) = G(O) = O,P(Y <~X) > O, aa <~aF, ba <.bF},

where, for any distribution function M, aM = inf{z > 0 : M(z) > 0} and bM =

sup{z > 0 : M ( z ) < 1}. The marginal distributions of X and Y conditional on Y ~<X

are

F.(x) := GdF/P(Y<~X), x>~O,

G . ( y ) := ( 1 - F ) d G / P ( Y < ~ X ) , y>~O,

where P(Y <~X) = f o GdF. As in Woodroofe (1985), note that

C(u) := G,(u) - F , (u - ) = G(u) [ l - F(u- )]/P(Y <~X). (11)

In this section the notation will be different from the previous sections: all x and y

symbols will denote the ordered values in the sample, that is yl < y2 < " " < y,,

and xl < x2 < . - - < x, with probabil i ty 1. Define also cj := n(Gn(yj) - Fn(£/)) and

ej ::= ~ ( y / ) - c¢(yj_l), where the dependence on n is understood.

The following technical lemma is used in Theorem 3 to establish the convergence

o f the NPBE over intervals not including the origin.

Lemma 3. I f F and G are continuous, (F ,G) E g/lo and f ( 1 - F ) - l d G < cx~, then for every s, such that G(s) > O, the followin9 assertions hoM:

x/n :y~>s~I l+cj(cj_l)j--I PO a s n - - + c c

and

/:y/>s c / ~ ~/

Proof . For the first assertion, it suffices to prove that

v ~ log 1-I 1 + - - - P 0 . j:yj >s Cj(Cj -- l )

(12)

(13)

(14)

Page 6: Nonparametric Bayes estimation of a distribution function with truncated data

366 M. Gasparini/Journal of Statistical Plannin 9 and Inference 55 (1996) 361 369

Now, it is easily seen that

~ ) _ ~ ~(~) o~<~log l-I l+cj(cj_l)j ~<v~ E - - -

j:yj>s j:yj>s Cj(Cj -- 1 )

n ~< 3 ~ ( C C ) v / - ~ l{yj>s}

j=l Cj(Cj + 1) '

since I/(Cj(Cj -- l ) )<~3/ (c j (e j q- 1)) over a set A n defined by A n : = {Cj > 1 Vj =

1 . . . . ,n}, to which attention is restricted, without loss of generality, since P(A,)--+ 1

by Corollary 5 of Woodroofe (1985). By an argument used in the same reference, the conditional distribution of cj - 1 given yj is binomial with parameters (n - 1 ,C(y j ) ) for each j = 1 . . . . . n. It can then be seen by an elementary computation that E(1/cj (c j + 1)[yj)~< 1/nZC2(yj), which implies

, o . E(3c~(oc)v~ cj(cj + 1) ~< -C -T- ~< v~G2( s ) 1-~-ff --

By Markov inequality, (14) holds. The second assertion is proved by a similar reason- ing. []

Theorem 3 establishes the asymptotic equivalence of the NPBE and the NPMLE:

Theorem 3. I f F and G are continuous, (F, G) E ~ 0 , f ( 1 - F ) -1 dG < cx~, ~(.) is continuous and sup{z > 0 : co(z) < ~(cc)} <~bG, then

sup x/-n[G - Gl(u) -~P 0 as n --+ c~. (15) u

Proof . Since the NPMLE, given by Eq. (2), can now be rewritten as G(u) = ]-Ij:yj>u(Cj - 1)/cj, then over the same set An defined in Lemma 3, the bounds given in Lemma 2 imply

°t(Yn)G(u) ]-I ( l ~J )<~G(u)<~G(u) I-I (1 ~ ( ~ ) + >u j:yj>u cj ej j:yj

and to prove the theorem it is enough to show that

:yj>u Cj(Cj -- l ) J -- 1 ~P 0

and

~(--~-) ) + c j ( c j - 1)

as n --+ ~c (16)

Convergence of expressions like Qn and Rn if the supremum is taken over intervals where G is bounded away from 0 is ensured by Lemma 3. For small values of u instead, we have to identify a set having high probability and over which the quantities cj are

Page 7: Nonparametric Bayes estimation of a distribution function with truncated data

M. GaspariniIJournal of Statistical Planning and Inference 55 (1996) 361 369 367

big enough to bound the products appearing in Eqs. (16) and (17). It is possible to identify such a set using in-probability linear bounds provided, for example, by Shorack and Wellner(1986, p. 418). Details are omitted. []

4. An example

Lynden-Bell (1971) reports a set of astronomical data where the problem of trunca- tion arises naturally from physical constraints. In his notation P, a certain measure of luminosity of a quasar, cannot be observed unless it is above a limit Pm, the realization of another variable for the same quasar. In case P~Pm, both variables are observed; otherwise, the quasar does not enter the sample. The problem is estimating the distri- bution function of P, given a sample of quasars for which P and Pm are available. To use the previous results, put Y = - P and X = -Pro; the problem is now estimating the distribution function G of Y, given data subject to the familiar constraint Y ~<X. Observed values of ( - X , - Y ) in a sample of 40 quasars are (5.11,5.33), (5.05,5.14), (4.88,5.13), (4.75,4.75), (4.74,4.86), (4.63,4.79), (4.61,4.92), (4.59,4.59), (4.55,4.77), (4.54,4.59), (4.53,4.55), (4.52,4.54), (4.52,4.59), (4.47,4.66), (4.40,4.45), (4.39,4.74), (4.39,4.71), (4.36,5.18), (4.35,4.46) (4.31,4.68), (4.29,4.43), (4.28,5.09), (4.25,4.43), (4.25,4.27), (4.22,4.47), (4.21,4.33), (4.20,4.38), (4.06,4.34), (3.98,4.67), (3.87,4.67), (3.86,4.58), (3.85,3.85), (3.84,4.06), (3.77,3.93), (3.76,3.81), (3.75,3.80), (3.57,3.82), (3.35,3.46), (3.30,3.39), (2.49,3.36).

Now observe that to the left of -3.57 the empirical distribution functions of X and Y are the same, therefore G degenerates to 0 for all the points less than -3.57. The NPMLE estimator in this situation does not use 37 of the 40 observations. Lynden-Belt is aware of this problem and instead of using the three points left, he uses the other 37, labeling those three points as 'affected by numerical fluctuations'. Both G, a step function with only three jumps, and the NPMLE based on the reduced sample of 37 observations, used by Lynden-Bell and here labeled MLE37, are shown in Fig. 1.

This is a typical situation where a Bayes estimator is of great help, since it makes effective use of all the observations and offers a compromise between G and MLE37.

Estimators (4) corresponding to the prior measure given by

j" c x exp{-A x ( - u ) ~} ifu~<0, (18)

C otherwise,

where A = 0.000488 and B = 5.887, are plotted in Fig. 2 for C - 5 and 40. The discussion that follows motivates the choice of such a specific e.

We assume first that • belongs to a specific parametric family of measures like (18), so that the problem reduces to fix the values of A,B and C. In a proper Bayesian approach these parameters would be specified a priori through elicitation of prior in- formation and beliefs, and a viable method to do that seems to be the specification of certain percentiles of the prior expectation of G.

Page 8: Nonparametric Bayes estimation of a distribution function with truncated data

368 M. Gasparini/ Journal of Statistical Planning and Inference 55 (1996) 361-369

o

o

Q

q o

M L E 3 7 ( y ) - -

_ , , # -

M L E ( y )

o

1=o o

o

-.¢a- o

o 0

i i i | i i

- 5 , 0 - 4 . 5 - 4 . 0 - 3 . 5 - 3 . 0 - 2 . 5

Fig. 1. M a x i m u m l i k e l i h o o d e s t i m a t o r s .

Legend

+ N P B E ( y ) , c - - 4 0

. N P B E ( y ) , C - - 5

J

i i i i i i

- 5 . 0 - 4 . 5 - 4 . 0 - 3 . 5 - 3 . 0 - . 2 5

Fig. 2. B a y e s e s t i m a t o r s .

In the absence of substantial prior information on G, another solution is adopted here: the values of A and B are chosen in such a way that the prior expectation of G fits MLE37. In detail, a least-squares line is fitted through the empirical scatter plot of l o g ( - y ) and l o g ( - log(MLE37(y))), which looks roughly linear except for the last three (of the 37) points, to be ignored. The numerical values of A and B are then

Page 9: Nonparametric Bayes estimation of a distribution function with truncated data

M. Gasparini/Journal of Statistical Plannin9 and Inference 55 (1996) 361 369 369

obtained through obvious transformations of the regression coefficients. The reason for ignoring the last three points of MLE37 is that the latter is an unreliable estimator of the right tail of the distribution, most likely overestimating it. The procedure is heuristic and makes double use of the data, but having the prior expectation of G fit MLE37 does not lack intuitive appeal.

The parameter C may roughly be interpreted as the weight the researcher is willing to give to the prior guess, when a weight equivalent to the sample size is given to the NPMLE. Since the prior guess is here an interpolation of MLE37, the case C = 40 represents, roughly, an equal compromise between NPMLE and MLE37.

Aeknowledegements

Without the indispensable help and advice of Professor Michael Woodroofe this research would not have been possible.

References

Ferguson, T.S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1, 209 230. Ferguson, T.S., E.G. Phadia and R.C. Tiwari (1992). Bayesian nonparametric inference. In: M. Ghosh and

P.K. Patak, Eds., Current Issues in Statistical Inference: Essays in honor of D. Basu, IMS Lecture Notes-Monograph Series, Vol. 17, 127 150.

Gasparini, M. (1990). Nonparametric Bayes estimation of a distribution function with truncated data. Technical Report #182, Dept. of Statistics, The University of Michigan.

Lagakos, S.W., L.M. Barraj and V. De Gruttola (1988). Nonparametric analysis of truncated survival data, with application to AIDS. Biometrika 75, 515 523.

Lynden-Bell, D. (1971). A method of allowing for known observational selection in small samples applied to 3CR quasars. Monthly Notices Roy. Astrnom. Soc. 155, 95 118.

Shorack, G.R. and J.A. Wellner (1986). Empirical Processes with Applications to Statistics. Wiley, New York.

Tiwari, R.C. and J.N. Zalkikar (1993). Nonparametric Bayesian estimation of survival function under random left truncation. J. Statist. Plann. Inference 35, 31-45.

Tsui K.-L., N.P. Jewell and C.F.J. Wu (1983). A nonparametric approach to the truncated regression problem. J, Amer. Statist. Assoc. 78, 785-792.

Wondroofe, M. (1985). Estimating a distribution function with truncated data. Ann. Statist. 13, 163 177.