shape constrained estimation: a brief ... - yale university · econometrics lunch seminar, yale,...

Shape constrained estimation:

a brief introduction

Jon A. Wellner

University of Washington, Seattle

Cowles Foundation, Yale

November 7, 2016

Econometrics lunch seminar, Yale

Based on joint work with:

• Fadoua Balabdaoui

• Piet Groeneboom

• Geurt Jongbloed

• Kaspar Rufibach

• Arseni Seregin

• Charles Doss

Outline

• 1. Some shape constrained models:

B A: Monotone (and unimodal).

B B: Convex (or concave):

B C: k−monotone, completely monotone

B D: log-concave; log-convex; s-concave; bi-logconcave

B E: hyperbolically monotone,

hyperbolically completely monotone

B F: higher dimensional monotone, convex, log-concave.

• 2. Estimation methods:

B A. maximum likelihood

B B. least squares

B C. minimum Renyi entropy

B D. penalized likelihoods

Econometrics Lunch Seminar, Yale, November 7, 2016 1.2

• 3. Types of results?

B A. Pointwise consistency; global consistency ?

B B. Global consistency; rates of global consistency?

B C. Global rates of convergence: Hellinger/ L1 ?

B D. Local distribution theory / limiting distributions?

B E. Risk bounds: lower; upper; adaptation to boundary?

• 4. Testing and confidence intervals

• 5. Algorithms

B A. active set

B B. interior point


1. Some shape constrained models:

A. Monotone (density):

Fmon ≡ {f a density on R+, f(x) ≤ f(y) if x ≥ y}.

f on R+ = [0,∞) is decreasing (i.e. non-increasing) if and only

if it is a scale mixture of uniform densities: for x > 0,

f(x) =∫ ∞

0

1

y1[0,y)(x)dG(y) =

∫{y>x}

1

ydG(y) , (1)

for some distribution function G on (0,∞). [Schoenberg, 1941;

Williamson, 1956; Feller, 1971].

Thus if X ∼ f ∈ Fmon, there is a random variable Z ∼ G on R+

such that

Xd= UZ

where U ∼ Uniform[0,1].


The relationship (1) implies that the corresponding distribution

function F is given by

F (x) =∫ ∞

0

x

y1[0,y)(x)dG(y) +

∫ ∞0

1[y,∞)(x)dG(y)

= xf(x) +G(x) ,

and this can be “inverted” to yield

G(x) = F (x)− xf(x) . (2)

From Figure 1 we see that the function on the right side of (2) is

non-negative and non-decreasing: the shaded area gives exactly

the difference F (x)− xf(x).


0.2 0.4 0.6 0.8 1 1.2 1.4

0.2

0.4

0.6

0.8

1

0.2 0.4 0.6 0.8 1 1.2 1.4

0.2

0.4

0.6

0.8

1

Inverting the monotone density mixture representation


Alternatively, if G is a distribution function with finite mean

µG ≡∫∞0 (1−G(y))dy, then

f(x;G) ≡1

µG(1−G(x)) (3)

is a monotone decreasing density on R+. This mixture rep-

resentation yields the sub-class of monotone densities with

f(0+) < ∞. This is the way that monotone densities often

arise in applications, since the right side in (3) is the limiting

distribution of both the forward and backward recurrence time

distributions (or the distributions of “excess life” γt and “current

life” δt) in renewal theory; see e.g. Karlin-Taylor (1975).


If we change the uniform density to a triangular density, themixture representation yields convex decreasing densities f ∈Fconv:

f(x) =∫ ∞

02y−1

(1−

x

y

)+dG(y).

C. k−monotone; completely monotone Since the triangulardensity 2(1 − x)1[0,1](x) is just the Beta(1,2) density, we easilyfind:

Proposition. (Williams, 1956). A density f is a k−monotone(completely monotone) density if and only if it can be representedas a scale mixture of Beta(1, k) (exponential) densities; withx+ ≡ x1{x ≥ 0},

f(x) =

∫∞0 y−1

(1− x

ky

)k−1

+dG(y), k ∈ {1,2, . . .},∫∞

0 y−1exp(−x/y)dG(y), k =∞ ,(4)

for some distribution function G on (0,∞).


The inversion formulas corresponding to these mixture represen-

tations are given in the following proposition.

Proposition 2. Suppose that f is a k−monotone density

with distribution function F (so F (x) =∫ x0 f(t)dt). Then the

distribution function G = Gk of (4) is given at continuity points

of Gk by

Gk(t) =k∑

j=0

(−1)j

j!(kt)jF (j)(kt) , (5)

G∞(t) = limk→∞

Gk(t) . (6)


0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1

0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1

Graphical view of the inversion formula,

convex decreasing density


Note that for 3 ≤ k <∞

F∞ ⊂ Fk ⊂ F2 ⊂ Fmon.

D. Log-concave; log-convex; s−concave; . . .

• p is log-concave on Rd if logp ≡ ϕ is concave (on Rd).

• p is log-convex on Rd if logp ≡ ϕ is convex (on Rd).

• for s > 0, p is s−concave if p = ϕ1/s+ for some ϕ concave;

for s < 0, p is s−concave if p = ϕ1/s+ for some ϕ convex.

Then, for −∞ < s < 0 < r <∞

P−∞ ⊃ Ps ⊃ P0 ⊃ Pr ⊃ P∞.


E. Hyperbolically k−monotone and completely monotone

A non-negative function f(x) on (0,∞) is said to be

hyperbolically monotone of order k (or HMk) if for each fixed u

the function

H(w) ≡ f(uv)f(u/v), w ≡ 2−1(v + 1/v)

satisfies (−1)jH(j)(w) ≥ 0, j = 0,1, . . . , k−1 and (−1)(k−1)H(k−1)(w)

is right-continuous and decreasing. If f is HMk for every k, then

f is hyperbolically completely monotone, or HM∞.

• Any uniform density is HM1.

• The half-normal densities f(x) = 2φ(x/σ)/σ for x ≥ 0 are HM1,

but not HM2.

• The log-normal distribution is HM∞.


Proposition 1. (Bondesson, 1991) X ∼ HM1 if and only if logX

has a log-concave density on R.

Proposition 2. (Bondesson, 1991) If X ∼ HMk and Y ∼ HMk

are independent then XY ∼ HMk and X/Y ∼ HMk.

Thus:

log-concave = logHM1 ⊂ logHMk ⊂ logHM∞.


2. Estimation methods:

• Maximum likelihood:

• Suppose that X1, . . . , Xn are i.i.d. p0 ∈ P.

• Then with Pn ≡ n−1∑n1 δXi, the log-likelihood is defined by

Ln(p) ≡n∑i=1

logp(Xi) ≡ nPn(logp).

• If pn satisfies

Ln(pn) = maxp∈P

Ln(p),

then pn is the Maximum Likelihood Estimator of p0 ∈ P.

• The “adjusted log-likelihood” Ψn is defined by

Ψn(p) ≡ Ln(p)−∫Rdp(x)dx.


• Least squares:

• X1, . . . , Xn i.i.d. p0 ∈ P.

• The least squares contrast function is defined by

Ψn(p) =1

2

∫p2(x)dx−

∫p(x)dPn(x)

≈1

2

∫(p(x)− pn(x))2dx−

1

2

∫p2n(x)dx

if Pn had density pn; but the neglected term depends only on

the data, not on p.

• If pn satisfies

Ψn(pn) = maxp∈P

Ψn(p),

then pn is the least squares estimator of p0 ∈ P.


• Other methods:

• Minimum Renyi entropy (or divergence).

• Penalized versions of maximum likelihood, LS, . . ..

• Least squares for regression models ...

• L1 for regression models?


Example 1. Monotone densities.

Grenander (1956) showed that the maximum likelihood estimator

fn of f is the (left-) derivative of the least concave majorant of

the empirical distribution function Fn

fn = left derivative of the least concave majorant of Fn,the empirical distribution of X1, . . . , Xn i.i.d. F

For the class of monotone densities on [0,∞),

the MLE and LSE are identical.

This is not true in general.


1 2 3 4

0.2

0.4

0.6

0.8

1

Least Concave Majorant and Empirical n = 10


1 2 3 4

1

2

3

4

Grenander Estimator and Exp(1) density, n = 10


1 2 3 4 5

0.2

0.4

0.6

0.8

1.0

Least Concave Majorant and Empirical n = 40


0 1 2 3 40.0

0.2

0.4

0.6

0.8

1.0

1.2

Grenander Estimator and Exp(1) density, n = 40


Properties of Grenander’s estimator:

What do we know about fn?

• Consistency:

B fn(x)→a.s. f0(x) for each x > 0.

B If f0 is continuous on (0,∞), supx≥c |fn(x)−f0(x)| →a.s. 0.

B If f0(0) <∞, then fn(0)→d f0(0)U−1 where

U ∼ Uniform(0,1).

B (Groeneboom, Birge, van de Geer) Let

H2(f, g) ≡ 2−1∫{√f(x)−

√g(x)}2dx

be the Hellinger distance between f and g.

If∫f0<1 f

1−α0 (x)dx <∞ and

∫[f0>1] f

1+α0 (x)dx <∞, then

H(fn, f0) = Op(n−1/3).


• Limit distribution theory:

B f ′0(x) < 0, f0(x) > 0, and f ′0 is continuous in a

neighborhood of f0, then with C ≡ (2−1f0(x)|f ′0(x)|)1/3,

B

n1/3(fn(x)− f0(x))→d CS(0)d= C2Z

where

S(0) ≡ slope at zero of the least concave majorant

of W (h)− h2,

Z ≡ argmaxh{W (h)− h2};

here W is two-sided Brownian motion started at 0.

B The distribution of Z is called Chernoff’s distribution.

B From Groeneboom (1989), lots is known about fZ.


-1.5 -1 -0.5 0.5 1 1.5

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Chernoff’s density fZ(x) = (1/2)g(x)g(−x).


• Marshall’s lemma: ‖Fn − F0‖∞ ≤ ‖Fn − F0‖∞.

• Kiefer-Wolfowitz theorem:

B If α1(F ) ≡ inf{t : F0(t) = 1} <∞,

β1(F ) ≡ supt<α1(−f ′0(t)/f20 (t)) > 0,

and γ1(F ) ≡ supt<α1(−f ′(t))/ inft<α1 f2(t) <∞,

B . . . then

‖Fn − Fn‖∞ = O((n−1logn)2/3) a.s.

and hence

√n‖Fn − Fn‖∞ = O(n−1/6(logn)2/3) a.s..


• Model miss-specification: If f is not monotone (or even if F

does not have a density), then∫|fn(x)− f∗0(x)|dx→a.s. 0

where

K(f, f∗0) = inf{K(f, g) : g ∈ Fmon}.

(Pfanzagl (1990), Patilea (2001), Jankowski (2014), Dumbgen,

Samworth, and Schumacher (2011)).


A. Log-concave densities on R1

Suppose that

f(x) ≡ fϕ(x) = exp(ϕ(x)) = exp (−(−ϕ(x)))

where ϕ is concave (and −ϕ is convex). The class of all densities

f on R of this form is called the class of log-concave densities,

Plog−concave ≡ P0.

Properties of log-concave densities:

• A density f on R is log-concave if and only if its convolution

with any unimodal density is again unimodal (Ibragimov,

1956).

• Every log-concave density f is unimodal (but need not be

symmetric).

• P0 is closed under convolution.


A. Log-concave densities on R1

• Many parametric families are log-concave, for example:

B Normal (µ, σ2); Uniform(a, b);

B Gamma(r, λ) for r ≥ 1; Beta(a, b) for a, b ≥ 1

• tr densities with r > 0 are not log-concave

• Tails of log-concave densities are necessarily sub-exponential

• Plog−concave = the class of “Polya frequency functions of

order 2”, PF2, in the terminology of Schoenberg (1951) and

Karlin (1968). See Marshall and Olkin (1979), chapter 18,

and Dharmadhikari and Joag-Dev (1988), page 150. for nice

introductions.

• Thus:

log-concave = PF2 = strongly uni-modal


B. Nonparametric estimation, log-concave on R

• The (nonparametric) MLE fn exists (Rufibach, Dumbgen

and Rufibach).

• fn can be computed: R-package “logcondens” (Dumbgen

and Rufibach)

• In contrast, the (nonparametric) MLE for the class of

unimodal densities on R1 does not exist. Birge (1997) and

Bickel and Fan (1996) consider alternatives to maximum

likelihood for the class of unimodal densities.

• Consistency and rates of convergence for fn:

Dumbgen and Rufibach, (2009); Pal, Woodroofe and Meyer

(2007).

• Pointwise limit theory? Yes! Balabdaoui, Rufibach, and W

(2009).



MLE of f and ϕ: Let C denote the class of all concave function

ϕ : R→ [−∞,∞). The estimator ϕn based on X1, . . . , Xn i.i.d. as

f0 is the maximizer of the “adjusted criterion function”

`n(ϕ) =∫

logfϕ(x)dFn(x)−∫fϕ(x)dx

=∫ϕ(x)dFn(x)−

∫eϕ(x)dx

over ϕ ∈ C.

Properties of fn, ϕn: (Dumbgen & Rufibach, 2009)

• ϕn is piecewise linear.

• ϕn = −∞ on R \ [X(1), X(n)].

• The knots (or kinks) of ϕn occur at a subset of the order

statistics X(1) < X(2) < · · · < X(n).

• Characterized by ...



... ϕn is the MLE of logf0 = ϕ0 if and only if

Hn(x)

{≤ Hn(x), for all x > X(1),

= Hn(x), if x is a knot.

where

Fn(x) =∫ xX(1)

fn(y)dy, Hn(x) =∫ xX(1)

Fn(y)dy,

Hn(x) =∫ x−∞

Fn(y)dy.

Furthermore, for every function ∆ such that ϕn+ t∆ is concave

for t small enough,∫R

∆(x)dFn(x) ≤∫R

∆(x)dFn(x).



Consistency of fn and ϕn:

• (Pal, Woodroofe, & Meyer, 2007):

If f0 ∈ P0, then H(fn, f0)→a.s. 0.

• (Dumbgen & Rufibach, 2009):

If f0 ∈ P0 and ϕ0 ∈ Hβ,L(T ) for some compact T = [A,B] ⊂{x : f0(x) > 0}◦, M <∞, and 1 ≤ β ≤ 2. Then

supt∈T

(ϕn(t)− ϕ0(t)) = Op

(logn

n

)β/(2β+1) , and

supt∈Tn

(ϕ0(t)− ϕn(t)) = Op

(logn

n

)β/(2β+1)

where Tn ≡ [A+ (logn/n)β/(2β+1), B − (logn/n)β/(2β+1)] and

β/(2β + 1) ∈ [1/3,2/5] for 1 ≤ β ≤ 2.

• The same remains true if ϕn, ϕ0 are replaced by fn, f0.



• If ϕ0 ∈ Hβ,L(T ) as above and, with ϕ′0 = ϕ0(·−) or ϕ′0(·+),

ϕ′0(x)−ϕ′0(y) ≥ C(y−x) for some C > 0 and all A ≤ x < y ≤ B,

then

supt∈Tn|Fn(t)− Fn(t)| = Op

(logn

n

)3β/(4β+2) .

where 3β/(2β + 4) ∈ [1/2,3/5] = [.5, .6] for 1 ≤ β ≤ 2.

• If β > 1, this implies supt∈Tn |Fn(t)− Fn(t)| = op(n−1/2).



0 1 2 3 4 5

0.0

0.1

0.2

0.3

0.4

dens

ity fu

nctio

ns

0 1 2 3 4 5

7

6

5

4

3

2

1

0

log

dens

ity fu

nctio

ns

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

1.0

CD

Fs

0 1 2 3 4 5

0.0

0.5

1.0

1.5

2.0

haza

rd fu

nctio

ns

0 2 4 6 8 10

0.0

0.1

0.2

0.3

0.4

dens

ity fu

nctio

ns

0 2 4 6 8 10

7

6

5

4

3

2

1

0

log

dens

ity fu

nctio

ns

0 2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

CD

Fs

0 2 4 6 8 10

0.0

0.5

1.0

1.5

2.0

haza

rd fu

nctio

ns



Top panel: Estimated log-densities. Bottom panel: Empirical processes


C: Limit theory at a fixed point in R

Assumptions: • f0 is log-concave, f0(x0) > 0.

• If ϕ′′0(x0) 6= 0, then k = 2;

otherwise, k is the smallest integer such that

ϕ(j)0 (x0) = 0, j = 2, . . . , k − 1, ϕ(k)

0 (x0) 6= 0.

• ϕ(k)0 is continuous in a neighborhood of x0.

Example: f0(x) = Cexp(−x4) with C =√

2Γ(3/4)/π: k = 4.

Driving process: Yk(t) =∫ t0W (s)ds−tk+2, W standard 2-sided

Brownian motion.

Invelope process: Hk determined by limit Fenchel relations:

• Hk(t) ≤ Yk(t) for all t ∈ R

•∫R(Hk(t)− Yk(t))dH(3)

k (t) = 0.

• H(2)k is concave.



Theorem. (Balabdaoui, Rufibach, & W, 2009)

• Pointwise limit theorem for fn(x0):(nk/(2k+1)(fn(x0)− f0(x0))

n(k−1)/(2k+1)(f ′n(x0)− f ′0(x0))

)→d

ckH(2)k (0)

dkH(3)k (0)

where

ck ≡

f0(x0)k+1|ϕ(k)0 (x0)|

(k + 2)!

1/(2k+1)

,

dk ≡

f0(x0)k+2|ϕ(k)0 (x0)|3

[(k + 2)!]3

1/(2k+1)

.



• Pointwise limit theorem for ϕn(x0):(nk/(2k+1)(ϕn(x0)− ϕ0(x0))

n(k−1)/(2k+1)(ϕ′n(x0)− ϕ′0(x0))

)→d

CkH(2)k (0)

DkH(3)k (0)

where

Ck ≡

|ϕ(k)0 (x0)|

f0(x0)k(k + 2)!

1/(2k+1)

,

Dk ≡

|ϕ(k)0 (x0)|3

f0(x0)k−1[(k + 2)!]3

1/(2k+1)

.

• Proof: Use the same perturbation as for convex - decreasing

density proof with perturbation version of characterization:


D: Mode estimation, log-concave density on R

Let x0 = M(f0) be the mode of the log-concave density f0,

recalling that P0 ⊂ Punimodal. Lower bound calculations using

Jongbloed’s perturbation ϕε of ϕ0 yields:

Proposition. If f0 ∈ P0 satisfies f0(x0) > 0, f ′′0(x0) < 0, and f ′′0is continuous in a neighborhood of x0, and Tn is any estimator

of the mode x0 ≡ M(f0), then fn ≡ exp(ϕεn) with εn ≡ νn−1/5

and ν ≡ 2f ′′0(x0)2/(5f0(x0)),

lim infn→∞ n1/5 inf

Tnmax {En|Tn −M(fn)|, E0|Tn −M(f0)|}

≥1

4

(5/2

10e

)1/5(f0(x0)

f ′′0(x0)2

)1/5

.

Does the MLE M(fn) achieve this?



!4 !3 !2 !1 1 2

!10

!8

!6

!4

!2

2

4



Proposition. (Balabdaoui, Rufibach, & W, 2009)

Suppose that f0 ∈ P0 satisfies:

• ϕ(j)0 (x0) = 0, j = 2, . . . , k − 1,

• ϕ(k)0 (x0) 6= 0, and

• ϕ(k)0 is continuous in a neighborhood of x0.

Then Mn ≡M(fn) ≡ min{u : fn(u) = supt fn(t)}, satisfies

n1/(2k+1)(Mn −M(f0))→d

((k + 2)!)2f0(x0)

f(k)0 (x0)2

1/(2k+1)

M(H(2)k )

where M(H(2)k ) = argmax(H(2)

k ).

Note that when k = 2 this agrees with the lower bound

calculation, at least up to absolute constants.



f0

f1

f2

!1 !0.5 0 0.5 1

!2

0

2

4

6

8

10

12

14



Y

H1

H2

!1 !0.5 0 0.5 1

!0.5

0

0.5

1

1.5



X

F1

F2

!1 !0.5 0 0.5 1

!2

0

2

4



f0

f1

f2

!1 !0.5 0 0.5 1

!2

0

2

4

6

8

10

12

14



f !0

f !1

f !2

!1 !0.5 0 0.5 1

!80

!60

!40

!20

0

20


E: Generalizations of log-concave to R and Rd:

Three generalizations:

• log−concave densities on Rd

(Cule, Samworth, and Stewart, 2010)

• s−concave and h− transformed convex densities on Rd

(Seregin, 2010)

• Hyperbolically k−monotone and completely monotone

densities on R; (Bondesson, 1981, 1992)



Log-concave densities on Rd:

• A density f on Rd is log-concave if f(x) = exp(ϕ(x)) with ϕ

concave.

• Some properties:

B Any log−concave f is unimodal

B The level sets of f are closed convex sets

B Convolutions of log-concave distributions are log-concave.

B Marginals of log-concave distributions are log-concave.



MLE of f ∈ P0(Rd): (Cule, Samworth, Stewart, 2010)

• MLE fn = argmaxf∈P0(Rd)Pnlogf exists and is unique if

n ≥ d+ 1.

• The estimator ϕn of ϕ0 is a “taut tent” stretched over “tent

poles” of certain heights at a subset of the observations.

• Computable via non-differentiable convex optimization meth-

ods: Shor’s (1985) r−algorithm: R−package LogConcDEAD

(Cule, Samworth, Stewart, 2008).



• If f0 is any density on Rd with∫Rd ‖x‖f0(x)dx <∞,∫

Rd f0(x)logf0(x)dx < ∞, and {x ∈ Rd : f0(x) > 0}◦ =

int(supp(f0)) 6= ∅, then fn satisfies:∫Rd|fn(x)− f∗(x)|dx→a.s. 0

where, for the Kullback-Leibler divergence

K(f0, f) =∫f0log(f0/f)dµ,

f∗ = argminf∈P0(Rd)K(f0, f)

is the “pseudo-true” density in P0(Rd) corresponding to f0.

In fact: ∫Rdea‖x‖|fn(x)− f∗(x)|dx→a.s. 0

for any a < a0 where f∗(x) ≤ exp(−a0‖x‖+ b0).


Questions and Problems:

Questions:

• Rates of convergence? Limiting distribution(s) for d > 1?

(nr with r = 2/(4 + d)?)

• MLE (rate-) inefficient for d ≥ 4? How to penalize to get

efficient rates?

• Multivariate classes with nice preservation/closure properties

and smoother than log-concave?

• Can we treat fn ∈ Ph with miss-specification: f0 /∈ Ph?

• Algorithms for computing fn ∈ Ph?

• Related results for convex regression on Rd: Seijo and Sen,

Ann. Statist. (2011).


Further information:

• Other Jon W talks :

www.stat.washington.edu/jaw/RESEARCH/TALKS/talks.html

• Jon W Probability Seminar:

www.stat.washington.edu/jaw/RESEARCH/TALKS/UW-Prob-

October2016.pdf

• Jon W Frejus lectures:

B jaw/RESEARCH/TALKS/FrejusDay1-small.pdf






• Groeneboom & Jongbloed (2014).

Nonparametric Estimation under Shape Constraints. Cam-

bridge U. Press.

• Kim, A. K., Guntuboyina, A., and Samworth (2016). Adap-

tation in log-concave density estimation. arXiv:1609.00861.

• Chatterjee, S., Guntuboyina, A., and Sen, B. (2015). On

risk bounds in isotonic and other shape restricted regression

problems. Ann. Statist. 43, 1774-1800.

• Guntuboyina, A. and Sen, B. (2015). Global risk bounds and

adaptation in univariate convex regression. Prob. Theor.

Rel. Fields 163, 379-411.


Thanks!


shape constrained estimation: a brief ... - yale university · econometrics lunch seminar, yale,...

Documents