beating monte carlo integration: a nonasymptotic study of kernel...

13
Beating Monte Carlo Integration: a Nonasymptotic Study of Kernel Smoothing Methods Stephan Cl´ emen¸con Fran¸coisPortier el´ ecom ParisTech, LTCI, Universit´ e Paris Saclay Abstract Evaluating integrals is an ubiquitous issue and Monte Carlo methods, exploiting ad- vances in random number generation over the last decades, offer a popular and powerful alternative to integration deterministic tech- niques, unsuited in particular when the do- main of integration is complex. This paper is devoted to the study of a kernel smooth- ing based competitor built from a sequence of n 1 i.i.d random vectors with arbitrary continuous probability distribution f (x)dx, originally proposed in [7], from a nonasymp- totic perspective. We establish a probability bound showing that the method under study, though biased, produces an estimate approxi- mating the target integral R xR d ϕ(x)dx with an error bound of order o(1/ n) uniformly over a class Φ of functions ϕ, under weak complexity/smoothness assumptions related to the class Φ, outperforming Monte-Carlo procedures. This striking result is shown to derive from an appropriate decomposition of the maximal deviation between the target in- tegrals and their estimates, highlighting the remarkable benefit to averaging strongly de- pendent terms regarding statistical accuracy in this situation. The theoretical analysis then rests on sharp probability inequalities for degenerate U -statistics. It is illustrated by numerical results in the context of covari- ate shift regression, providing empirical evi- dence of the relevance of the approach. Proceedings of the 21 st International Conference on Artifi- cial Intelligence and Statistics (AISTATS) 2018, Lanzarote, Spain. JMLR: W&CP volume 7X. Copyright 2018 by the author(s). 1 INTRODUCTION For over two thousands years, numerical integration has been the subject of intense research activity, start- ing with Babylonian mathematics and the elabora- tion of quadrature rules for measuring areas and vol- umes. It lead to the development of a very wide variety of algorithms for calculating approximately the numerical value of a given (well-defined) inte- gral with a controlled error, ranging from (possibly adaptive) methods based on interpolating functions to (quasi/advanced) Monte Carlo techniques. One may refer to e.g. [6] for an excellent account of deter- ministic techniques for numerical integration and to [13] for an introduction to Monte Carlo integration. Probabilistic approaches have been proved quite use- ful in high-dimensional cases to circumvent the curse of dimensionality phenomenon with the advent of com- puter technology and significant advances in pseudo- random number generation. Error bounds achieved by Monte Carlo integration methods based on a simu- lated sample of size n 1 are typically of order 1/ n, the rate of the classical CLT. Recently, a competitor based on kernel smoothing has been proposed in [7]. The resulting integral estimates can be interpreted as biased importance sampling (IS, in abbreviated form) estimates, where the (true) importance function is re- placed by leave-one-out kernel estimators. Provided that the instrumental density used in this integral estimation procedure is smooth enough, it has been proved that the asymptotic rate of convergence can be faster than 1/ n for an appropriate choice of the kernel bandwidth (see also [2] for a similar study in the Markovian context). It is the goal of this paper to investigate this striking phenomenon much further, from both a nonasymptotic and functional perspective and establish confidence upper bounds holding true for finite samples, uniformly over classes of functions of controlled complexity. The main argument relies on an adequate decomposition of the integral estimates obtained by means of this method, in which degen- erate U -statistics appear in particular, and on recent concentration inequalities for such functionals, gener-

Upload: others

Post on 22-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Beating Monte Carlo Integration: a Nonasymptotic Study of Kernel …proceedings.mlr.press/v84/clemencon18a/clemencon18a-supp.pdf · 2018-08-21 · gral with a controlled error, ranging

Beating Monte Carlo Integration:a Nonasymptotic Study of Kernel Smoothing Methods

Stephan Clemencon Francois PortierTelecom ParisTech, LTCI, Universite Paris Saclay

Abstract

Evaluating integrals is an ubiquitous issueand Monte Carlo methods, exploiting ad-vances in random number generation over thelast decades, offer a popular and powerfulalternative to integration deterministic tech-niques, unsuited in particular when the do-main of integration is complex. This paperis devoted to the study of a kernel smooth-ing based competitor built from a sequenceof n ≥ 1 i.i.d random vectors with arbitrarycontinuous probability distribution f(x)dx,originally proposed in [7], from a nonasymp-totic perspective. We establish a probabilitybound showing that the method under study,though biased, produces an estimate approxi-mating the target integral

∫x∈Rd ϕ(x)dx with

an error bound of order o(1/√n) uniformly

over a class Φ of functions ϕ, under weakcomplexity/smoothness assumptions relatedto the class Φ, outperforming Monte-Carloprocedures. This striking result is shown toderive from an appropriate decomposition ofthe maximal deviation between the target in-tegrals and their estimates, highlighting theremarkable benefit to averaging strongly de-pendent terms regarding statistical accuracyin this situation. The theoretical analysisthen rests on sharp probability inequalitiesfor degenerate U -statistics. It is illustratedby numerical results in the context of covari-ate shift regression, providing empirical evi-dence of the relevance of the approach.

Proceedings of the 21st International Conference on Artifi-cial Intelligence and Statistics (AISTATS) 2018, Lanzarote,Spain. JMLR: W&CP volume 7X. Copyright 2018 by theauthor(s).

1 INTRODUCTION

For over two thousands years, numerical integrationhas been the subject of intense research activity, start-ing with Babylonian mathematics and the elabora-tion of quadrature rules for measuring areas and vol-umes. It lead to the development of a very widevariety of algorithms for calculating approximatelythe numerical value of a given (well-defined) inte-gral with a controlled error, ranging from (possiblyadaptive) methods based on interpolating functions to(quasi/advanced) Monte Carlo techniques. One mayrefer to e.g. [6] for an excellent account of deter-ministic techniques for numerical integration and to[13] for an introduction to Monte Carlo integration.Probabilistic approaches have been proved quite use-ful in high-dimensional cases to circumvent the curseof dimensionality phenomenon with the advent of com-puter technology and significant advances in pseudo-random number generation. Error bounds achievedby Monte Carlo integration methods based on a simu-lated sample of size n ≥ 1 are typically of order 1/

√n,

the rate of the classical CLT. Recently, a competitorbased on kernel smoothing has been proposed in [7].The resulting integral estimates can be interpreted asbiased importance sampling (IS, in abbreviated form)estimates, where the (true) importance function is re-placed by leave-one-out kernel estimators. Providedthat the instrumental density used in this integralestimation procedure is smooth enough, it has beenproved that the asymptotic rate of convergence canbe faster than 1/

√n for an appropriate choice of the

kernel bandwidth (see also [2] for a similar study inthe Markovian context). It is the goal of this paperto investigate this striking phenomenon much further,from both a nonasymptotic and functional perspectiveand establish confidence upper bounds holding true forfinite samples, uniformly over classes of functions ofcontrolled complexity. The main argument relies onan adequate decomposition of the integral estimatesobtained by means of this method, in which degen-erate U -statistics appear in particular, and on recentconcentration inequalities for such functionals, gener-

Page 2: Beating Monte Carlo Integration: a Nonasymptotic Study of Kernel …proceedings.mlr.press/v84/clemencon18a/clemencon18a-supp.pdf · 2018-08-21 · gral with a controlled error, ranging

A Nonasymptotic Study of Kernel Smoothing Methods

ally used in the context of asymptotic study of (vari-able bandwidth) kernel density estimation methods,see e.g. [10]. Incidentally, attention should be paidto the fact that the analysis carried out in this articlesheds light on a striking phenomenon: whereas it hasbeen shown that the dependence structure among av-eraged identically distributed r.v.’s may significantlydeteriorate the convergence rates in a wide variety ofsituations (e.g. for long-memory processes or weaklydependent sequences with non geometrically decay-ing mixing coefficients, see [8] for instance, in cross-validation procedures), it is proved here that the de-pendence between the components averaged to pro-duce the kernel smoothing-based integral estimates isin contrast of great benefit to statistical accuracy.

The article is structured as follows. Basics in Monte-Carlo integral approximation are briefly recalled in sec-tion 2, together with the alternative method originallyproposed in [7]. The main results of the paper arethen stated in section 3 and an illustrative applicationis presented in section 4. Finally, some concluding re-marks are collected in section 5. Technical details andadditional remarks are deferred to the SupplementaryMaterial.

2 BACKGROUND

Here and throughout, (Xn)n≥1 is a sequence of contin-uous independent and identically distributed randomvectors, taking their values in Rd, d ≥ 1, with densityf(x) w.r.t. Lebesgue measure µLeb, Φ is a given classof Borelian functions ϕ : Rd → R and K : Rd → R is asymmetric kernel function of order l ≥ 1, i.e. a Bore-lian function, integrable w.r.t. Lebesgue measure suchthat

∫K(x)dx = 1, K(x) = K(−x) for all x ∈ Rd. We

set ||z||Φ := supϕ∈Φ |z(ϕ)| for any real valued sequencez = z(ϕ)ϕ∈Φ. Denote by IE the indicator variableof any event E . For any Borelian function g : Rd → R,the closure of the set x ∈ Rd : g(x) 6= 0 is denotedby Supp(g) and by ||g||∞ is meant the essential supre-mum of g when it is bounded almost everywhere. Forany h > 0 and x ∈ Rd, we set Kh(x) = K(h−1x)/hd.When well-defined, the convolution product betweentwo real-valued Borelian functions g(x) and w(x) is de-noted by g ∗w(x) =

∫x′∈Rd g(x− x′)w(x′)dx′. For any

β > 0, we set bβc = maxn ∈ N : n < β. Let α =

(α1, . . . , αd) ∈ Nd, we set |α| =∑di=1 αi and mean

by ∂α the differential operator ∂|α|/∂xα11 · · · ∂x

αdd . For

m ∈ N, whenever Ω is an open subset of Rd, the spaceof real-valued functions on Ω that are differentiable upto order m is denoted by Cm(Ω) and, for any β > 0,L > 0, we denote by Hβ,L(Ω) the space of functions gin Cbβc(Ω) with all derivatives up to order bβc boundedby L and such that, for any multi-index α ∈ Nd with

|α| ≤ bβc:

∀(x, y) ∈ Ω2, |∂αf(x)− ∂αf(y)| ≤ L||x− y||β−|α|,

denoting by ||.|| the usual Euclidean norm on Rd.

2.1 Integral(s) Approximation

It is the goal of this paper to analyze the performanceof statistical techniques to approximate accurately theintegral

I(ϕ) =

∫x∈Rd

ϕ(x)dx, (1)

based on the observation of the i.i.d. sampleX1, . . . , Xn, n ≥ 1. When the support K of ϕ,i.e. the domain of integration related to (1), is com-pact, a basic Monte-Carlo method would consist ingenerating independent random vectors U1, . . . , Ununiformly distributed over a domain H ⊃ K (a unionof hypercubes typically, for computational simplicity)and compute the natural (unbiased) Monte-Carlo es-timate:

In(ϕ) =1

n

n∑i=1

ϕ(Ui). (2)

Beyond classic limit theorems (SLLN, CLT, LIL, etc.),the accuracy of estimate (2) can be evaluated for afixed sample size n ≥ 1. For simplicity, suppose thatϕ is bounded almost-everywhere. In absence of anysmoothness assumption for the integrand ϕ, a straight-forward application of Hoeffding’s inequality (see [12])shows that, for all n ≥ 1, for any δ ∈ (0, 1), we havewith probability at least 1− δ:∣∣∣In(ϕ)− I(ϕ)

∣∣∣ ≤ ||ϕ||∞√2 log(2/δ)

n.

Maximal deviations over a class Φ of functions ϕ s.t.||ϕ||∞ ≤ MΦ < +∞ can be obtained by means ofconcentration inequalities under appropriate complex-ity assumptions on Φ. Indeed, by virtue of McDi-armid’s inequality (see [18]) combined with classicalsymmetrization and randomization arguments, for alln ≥ 1, for any δ ∈ (0, 1), we have with probabilitylarger than 1− δ:∣∣∣∣∣∣In − I∣∣∣∣∣∣

Φ≤ 2E [Rn(Φ)] +MΦ

√2 log(2/δ)

n, (3)

where the Rademacher average associated to theset (ϕ(U1), . . . , ϕ(Un)) : ϕ ∈ Φ is denotedby Rn(Φ) = Eε1, ..., εn

[supϕ∈Φ

1n |∑ni=1 εiϕ(Ui)|

]and

ε1, . . . , εn are independent Rademacher variables, in-dependent from the Ui’s. The expected Rademacheraverage measures the richness of the class Φ, see e.g.[15]. Classically [25], if Φ is a Vapnik-ChervonenkisVC major class of functions of finite VC dimen-sion V < ∞ (i.e. if the collection of sub-level sets

Page 3: Beating Monte Carlo Integration: a Nonasymptotic Study of Kernel …proceedings.mlr.press/v84/clemencon18a/clemencon18a-supp.pdf · 2018-08-21 · gral with a controlled error, ranging

Stephan Clemencon, Francois Portier

ϕ(x) ≥ t : (ϕ, t) ∈ Φ × R is of finite VC dimen-sion V < +∞), we have E [Rn(Φ)] ≤ C

√V/n, where

C < +∞ is a universal constant and the basic MonteCarlo procedure permits to approximate the integrals(1) uniformly over the class Φ at the rate 1/

√n in this

case. Except in pathologic situations, a basic CLTargument can be used to prove that this rate boundcannot be improved. Whereas many refinements of thebounds stated (involving the variance of the ϕ(Ui)’s orother measures of complexity for classes of functions,such as metric entropies) can be considered, focus ishere on an alternative method, significantly improvingupon Monte Carlo integration in terms of order of the(nonasymptotic) rate bound achieved.

2.2 A Kernel Smoothing Alternative

We now describe at length the integral estimation pro-cedure promoted in this paper. As an alternative to(2), it is proposed in [7] to consider the estimate

In(ϕ) =1

n

n∑i=1

ϕ(Xi)

fn,i(Xi), (4)

denoting by

fn,i(x) =1

n− 1

∑1≤j≤n, j 6=i

Kh(x−Xj) (5)

the smoothed leave-one-out estimator of f(x) based onkernel K and bandwidth 0 < h ≤ h0, computed withall the Xj ’s except Xi, for 1 ≤ i ≤ n. The expectationof the estimate (5) is equal to the convolution productKh∗f(x). Assume in addition that the kernel functionK is of order bβc with β > 0, meaning that x ∈ Rd 7→||x||l|K(x)| is integrable for all l ≤ bβc and∫

x∈Rd

d∏i=1

xαji K(x)dx = 0

for all α ∈ Nd such that |α| ≤ bβc. Provided that fbelongs to the Holder space Hβ,L(Rd) for some L > 0,the deviation |Kh ∗ f(x)− f(x)| is of order O(hβ), seeLemma 5 in the supplementary file. As shall be seen atlength in the next subsection, though biased and com-plex (the quantities involved in the average (4) exhibita strong dependence structure), the estimator (4) issignificantly more accurate, under specific hypotheses(on the decay rate of h as n → +∞ and the smooth-ness of f in particular), than the (unbiased) IS MonteCarlo estimate

In(ϕ) =1

n

n∑i=1

ϕ(Xi)

f(Xi),

obtained when replacing the leave-one-out estimatorsfn,i(Xi) by the true values f(Xi) of the instrumental

density in (4). Although the smoothing stage inducesa bias in the estimation of (1), it may drastically ac-celerate the convergence, as claimed in limit resultsproved in [7]. Before investigating the accuracy of (4),uniformly over a class of functions of controlled com-plexity (in a sense that shall be specified later) from anonasymptotic angle, a few remarks are in order.

Remark 1. (Multivariate kernel) Many uni-variate kernels have been proposed in the literature:u 7→ (1/2)I−1 ≤ u ≤ +1 (rectangular), u 7→(1 − |u|)I−1 ≤ u ≤ +1 (triangular), u 7→(1/√

2π) exp(−u2)/2 (Gaussian) or u 7→ (3/4)(1 −u2)I−1 ≤ u ≤ +1 (Epanechnikov). Extensions tothe multivariate framework is straightforward by ten-sorization: for any univariate kernel K(u) of orderm ∈ N, the product kernel defined below is a multivari-ate kernel of same order: ∀d ≥ 1, u = (u1, . . . , ud) ∈Rd 7→

∏di=1K(ui).

Remark 2. (On computational complexity)The fact that the theoretical results stated in [7] andin the next section (see Theorem 1 therein) are validwhathever the dimension d ≥ 1 makes the method de-scribed above very attractive. Truth be told, the latteris appropriate in low/moderate dimensional settingsonly. In high dimensions, kernel smoothing methodsbehave poorly as they face the dimensionality curse,see [21]. In addition, the computational budget of In(in n2) is larger than that of In (in n). This makes

In particularly appropriate in these situations: (i) realdataset when f is unknown (see [2]) and (ii) numer-ical integration when ϕ is computationally expensive(see [19]).

3 NON-ASYMPTOTIC BOUNDS

It is the purpose of this section to establish nonasymp-totic upper bounds for the maximal deviation

||In − I||Φ = supϕ∈Φ

∣∣∣In(ϕ)− I(ϕ)∣∣∣ (6)

of estimated integrals based on kernel smoothing fromtheir true values. As previously mentioned, the vari-ables ϕ(Xi)/fn,i(Xi) averaged in (4) are identicallydistributed and ”close” to the ϕ(Xi)/f(Xi)’s but are,in contrast, highly dependent: a same subset of n− 2original observations is involved in the computationof any pair of such r.v.’s. However, it is well-knownin Statistics that averaging dependent (identically dis-tributed) random variables may considerably refine ac-curacy: a U -statistics, say Un, is a typical exampleof statistics obtained by averaging strongly dependentterms and providing estimate of the mean θ = E[Un]with minimum variance among all unbiased estimates

Page 4: Beating Monte Carlo Integration: a Nonasymptotic Study of Kernel …proceedings.mlr.press/v84/clemencon18a/clemencon18a-supp.pdf · 2018-08-21 · gral with a controlled error, ranging

A Nonasymptotic Study of Kernel Smoothing Methods

of θ, see e.g. [16]1. By means of an appropriate de-composition of (4) (where, incidentally, the appear-ance of degenerate U -statistics plays a crucial role),we shall show that the uniform deviation (6) may bemuch smaller than the bound (3) in certain situations(for proper choice of the bandwidth h = hn in partic-ular). The following assumptions are involved in thesubsequent analysis.

A1 Let β > 0 and L > 0. The density f belongs tothe Holder class Hβ,L(Rd).

A2 The class Φ is countable, uniformly bounded, i.e.MΦ := supϕ∈Φ ||ϕ||∞ < +∞, and of VC type [9](w.r.t. the constant envelope MΦ), meaning thatthere exist nonnegative constants A and v s.t. for allprobability measures Q on Rd and any 0 < ε < 1:N (Φ, L2(Q), ε) ≤ (AMΦ/ε)

v, where N (Φ, L2(Q), ε)denotes the smallest number of L2(Q)-balls of radiusless than ε required to cover class Φ (covering number).

A3 The set DΦdef=⋃ϕ∈Φ Supp(ϕ) is compact.

A4 The density of the Xi’s is bounded by below on

the domain DΦ: λdef= infx∈DΦ

f(x) > 0.

A5 For all ϕ ∈ Φ, the function ϕ/f belongs to theHolder class Hβ,L(Rd).

The result stated below reveals that, under these hy-potheses, the integral approximation method recalledin subsection 2.2, achieves a rate bound faster than1/√n for an appropriate choice of the bandwidth

hn > 0.

Theorem 1. (Probability rate bounds) Supposethat assumptions A1 −A5 are fulfilled. For all δ ∈(0, 1), there exists a set Cδ ⊂ N × R depending on δ,Φ, K, (β, L) and f such that, for all (n, h) ∈ Cδ, withprobability at least 1− δ, we have:

supϕ∈Φ

∣∣∣In(ϕ)− I(ϕ)∣∣∣ ≤ Cδ hβ +

| log(hd/2

)|

nhd

.

where Cδ is a constant depending on δ, Φ, K, (β, L)and f . In particular, choosing h = hn so that hn =o(1/n1/(2β)) and 1/n1/(2d) = o(hn) as n→ +∞, whichguarantees that (n, h) ∈ Cδ and is always possible assoon as β > d, yields a rate bound of order oP(1/

√n).

Before sketching the argument of the theorem above,a few comments are in order.

1Let (S,S) be a measurable space. Recall that theU -statistic of kernel ω : S × S → R based on the i.i.d.observations Z1, . . . , Zn valued in S is the quantity1/n(n− 1)

∑i 6=j ω(Zi, Zj). One says it is degenerate when

E[ω(Z, z)] = E[ω(z, Z)] = 0, for all z ∈ S.

Remark 3. (On complexity/smoothness as-sumptions) It is supposed here that the class Φ isof VC type, cf assumption A2, meaning that uniformentropy numbers grow at a polynomial rate. We recallfor completeness that a uniformly bounded VC majorclass of functions of finite VC dimension V < +∞is of course of VC type (constants A and v can beexpressed as functions of V , see e.g. Theorem 2.6.7in [25]). The hypothesis that Φ is countable can beweakened, using the notion of countable approximabil-ity, see the definition in [17] on p. 492. In addition,observe that we assumed here that f and the ϕ/f ’s be-long to the same Holder class for the sake of simplicityonly. The analysis can be straightforwardly extendedto more general smoothness assumptions, at the priceof more complex formulas for the rate bounds.

Proof. The argument is based on the following decom-position of the estimator (4) for an arbitrary elementϕ of class Φ:

In(ϕ) =2

n

n∑i=1

ϕ(Xi)

f(Xi)

− 1

n(n− 1)

∑1≤i6=j≤n

ϕ(Xi)Kh(Xi −Xj)

f2(Xi)

+1

n

n∑i=1

ϕ(Xi)(f(Xi)− fn,i(Xi)

)2

f2(Xi)fn,i(Xi).

The first term is an i.i.d. sample mean providing anunbiased estimate of 2I(ϕ), while the second one is aU -statistic Un(ϕ) of degree two with kernel given byH(x, x′) = ϕ(x)Kh(x − x′)/f2(x) for x, x′ in Rd andthat can be considered as a biased estimate of −I(ϕ).One may classically write the Hoeffding decomposition(i.e. Hajek projection) of Un(ϕ) [16]:

Un(ϕ) = Tn(ϕ) + Sn(ϕ) +Wn(ϕ)− E [Un(ϕ)] ,

where Wn(ϕ) is a degenerate U -statistic with zeromean and kernel Qn(x, x′) = H(x, x′)− E[H(x,X)]−E[H(X,x′)] + E[Un(ϕ)] and

Tn(ϕ) =1

n

n∑i=1

E [H(Xi, X) | Xi]

=1

n

n∑i=1

ϕ(Xi)

f2(Xi)(Kh ∗ f) (Xi),

Sn(ϕ) =1

n

n∑i=1

E [H(X,Xi) | Xi]

=1

n

n∑i=1

(Kh ∗

ϕ

f

)(Xi),

Page 5: Beating Monte Carlo Integration: a Nonasymptotic Study of Kernel …proceedings.mlr.press/v84/clemencon18a/clemencon18a-supp.pdf · 2018-08-21 · gral with a controlled error, ranging

Stephan Clemencon, Francois Portier

denoting by X a random vector independent from theXi’s, distributed according to f(x)dx. Observe inci-dentally that: ∀n ≥ 1, ∀ϕ ∈ Φ,

E[Un(ϕ)] = E [Tn(ϕ)] = E[Sn(ϕ)]

=

∫x∈Rd

ϕ(x)

f(x)(Kh ∗ f)(x)dx.

Hence, the deviation between the estimate (4) and thetarget integral (1) can be decomposed as the sum offour terms:

In(ϕ)− I(ϕ) = Mn(ϕ) +Wn(ϕ) +Bh(ϕ) +Rn(ϕ),

where

Bh(ϕ) = I(ϕ)− E[Un(ϕ)]

=

∫ϕ(x)

(1− (Kh ∗ f)(x)

f(x)

)dx (7)

is a deterministic term vanishing as h > 0 tends tozero under adequate conditions (see Lemma 1),

Mn(ϕ) =2

n

n∑i=1

ϕ(Xi)

f(Xi)− Tn(ϕ)− Sn(ϕ)− 2Bh(ϕ)

is a centered sum of i.i.d. random variables and

Rn(ϕ) =1

n

n∑i=1

ϕ(Xi)

f2(Xi)fn,i(Xi)

(f(Xi)− fn,i(Xi)

)2

.

(8)The proof consists in establishing bounds showing thateach of these four terms is of order oP(n−1/2) uniformlyover Φ. In contrast to the maximal deviations resultsused in general to investigate the accuracy of Empir-ical Risk Minimization in statistical learning (see e.g.[3]), one should pay attention to the fact that sharpinequalities (involving bounds for the maximal vari-ance) are considered in the present analysis in orderto deal properly with the dependence on n (throughthe bandwidth hn) of the classes of functions/kernelsconsidered, more commonly needed in the asymptoticstudy of kernel density estimators, see e.g. [10]. Con-stants involved in the intermediary results below arenot necessarily the same at each appearance.

Bias. As can be shown by examining the proof ofthe lemma below, a bound for the deterministic term(7) can be obtained using well-known approximationtheoretic arguments under the smoothness hypothesesstipulated for the elements of class Φ.

Lemma 1. Under asssumptions A1, A2, A3 and A4,we have the uniform bound: ∀h > 0,

||Bh||Φ = supϕ∈Φ|Bh(ϕ)| ≤ CµLeb(DΦ)MΦ

λ· hβ ,

where C = Lbβc!

∑α∈Nd: |α|=bβc

∫z∈Rd |K(z)|

∏di=1 |zi|αidz.

Its technical proof is deferred to the SupplementaryMaterial.

Empirical process. As shown in the SupplementaryMaterial, the maximal deviation result related to theempirical process Mn(ϕ)ϕ∈Φ stated below is provedby means of a specific version of an exponential in-equality of [23].

Lemma 2. Suppose that assumptions A1-A5 are ful-filled. For all δ ∈ (0, 1), there exists nδ ≥ 1 (defined in(B.2), in the supplementary file), depending on δ, K,λ and Φ only, such that for all n ≥ nδ, with probabilityat least 1− δ, we have:

||Mn||Φ ≤ cΦhβ√n

√2 maxlog(C2/δ)/C3, C2

1 log(2),

where cΦ, C1, C2 and C3 are constants depending onΦ, K, (β, L) and λ.

Degenerate U-process. Whereas concentration in-equalities for degenerate U -processes (i.e. collectionsof U -statistics indexed by classes of functions) havebeen established in various articles such as [1] in thecontext of VC classes (see also [5] for more general re-sults for instance), the major difficulty here arises fromthe fact that the class of kernels considered here de-pends on n ≥ 1 (through the bandwidth hn namely).As shown in the Supplementary Material, the followingbound can be proved by means of an exponential in-equality for degenerate U -processes indexed by classesof kernels of VC type, involving a bound for the max-imal variance, established in [17].

Lemma 3. Suppose that assumptions A2, A3 andA4 are fulfilled. Then, for all δ ∈ (0, 1), there existsCδ,1 ⊂ N × R (defined in (C.2), in the supplementaryfile) depending on δ, K, Φ and λ only, such that, forall (n, h) ∈ Cδ,1, with probability greater than 1− δ, wehave:

‖Wn‖Φ ≤γΦ

(n− 1)hd/2×

max

log(C2/δ)/C3, C1 log(

2GΦ/(γΦhd/2)

),

where γΦ, C1, C2 and C3 are constants depending onΦ, λ and K.

Residuals. We now turn to the residual term (8).The lemma below is established in the SupplementaryMaterial. Its proof is based on the control of the proba-bility that the fn,i(Xi)’s get close to zero in particular.

Lemma 4. Suppose that assumptions A1-A5 are ful-filled. For all δ ∈ (0, 1), there exists Cδ,2 ⊂ N×R (de-fined in (D.1) and (D.2), in the supplementary file)

Page 6: Beating Monte Carlo Integration: a Nonasymptotic Study of Kernel …proceedings.mlr.press/v84/clemencon18a/clemencon18a-supp.pdf · 2018-08-21 · gral with a controlled error, ranging

A Nonasymptotic Study of Kernel Smoothing Methods

depending on δ, K, Φ and λ only, such that, for all(n, h) ∈ Cδ,2, with probability at least 1− δ, we have:

||Rn||Φ ≤

γΦ

max

log(C2

δ )C3

, C1 log(

2‖K‖∞cK,fhd/2

)nhd

+ h2β

,

where γΦ, cK,f , C1, C2 and C3 are constants depend-ing on Φ, K, (β, L) and λ.

Derivation of the stated bound. The bound inTheorem 1 results from those stated in Lemmas 1-4 bytaking Cδ as the intersection of Cδ,1 (in Lemma 3), Cδ,2(Lemma 4), n ≥ nδ (in Lemma 2), and also values of(n, h ≤ h0) such that the bound for the bias in Lemma1 (resp. for the residuals in Lemma 4) is larger thanthe bound for the empirical process term in Lemma 2(resp. the U -process term in Lemma 3).

4 APPLICATION TO SHIFTCOVARIATE IN REGRESSION

We now present an application of the method anal-ysed in the previous section in order to illustrate itsperformance in practice. After a brief presentation ofthe framework of covariate shift regression, a diagnos-tic tool for evaluating the quality of the prediction in agiven covariate region is introduced (see e.g. [4]) andimplemented by means of the method promoted basedon toy data.

Covariate shift regression. Let (xi, yi)i=1,...,n de-note a training dataset of size n ≥ 1 where, for eachi ∈ 1, . . . , n, yi ∈ Y stands for the output andxi ∈ X is the covariate/input vector. The regres-sion task consists in (i) learning a predictor g fromthe training data in order to (ii) predict unobserv-able yte with g(xte) for a so called test covariatexte. Classical regression is concerned with a test co-variate xte ∈ X that is similarly distributed as thetraining covariates. In contrast, covariate shift regres-sion considers situations where xte ∈ X is not dis-tributed in the same way as the training covariates.That is, when learning g, the main risk is to focustoo much on regions containing the xi’s but farawayfrom xte. Under covariate shift and mispecification,it is known [20] that standard regression techniquessuch as maximum likelihood estimation does not pro-vide accurate estimate. The most popular approachto the covariate shift regression problem is based ona re-weighting strategy (see [22], [14] and the refer-ences therein). Suppose for simplicity that the trainingdataset forms an i.i.d. sequence distributed according

to (Y,X). The conditional risk of the predictor g givenX = x is denoted by R(g|x) = E[(Y − g)2|X = x].The marginal distribution of xi is denoted by f trX . Iff teX denotes the test distribution, i.e. the distributionof xte, then the underlying risk can be expressed asRte(g) =

∫R(g|x)f teX (x)dx. A natural estimate of this

risk is then given by

Rte(g) = n−1n∑i=1

(yi − g(xi))2wi, (9)

with wi = f teX (xi)/ftrX (xi). As the weights wi are un-

known in practice, one should estimate them basedon the training sample and, when available, the testsample. The naive strategy (subject to the curse ofdimensionality) is to estimate f teX and f trX using kernelsmoothing estimates and then to replace the unknownweights in (9) by the estimates. More sophisticatedmethods relying on the Kullback-Leibler divergenceand on the least-squares distance are proposed in [22]and [14], respectively.

Diagnostic tool for prediction quality in covari-ate regions. When no test covariate xte is observed(making impossible an estimation of the importanceweights wi), an interesting issue is to know whetheror not a given region in the covariate space X hasa prediction of good quality. In the following, a re-gion (µ,Γ) is represented by the Gaussian distribu-tion with center µ ∈ X ⊂ Rp and a dispersion ma-trix Γ ∈ Rp×p. The risk related to the region (µ,Γ)is given by Rµ,Γ(g) =

∫R(g|x)φµ,Γ(x)dx, where φµ,Γ

stands for the density of N (µ,Γ). The empirical “or-acle” counter part (because it requires to know f trX )

is R(or)µ,Γ (g) = n−1

∑ni=1(φµ,Γ(xi)/f

trX (xi))(yi−g(xi))

2,and, the estimator based on the kernel smoothing ap-proach is

Rµ,Γ(g) = n−1n∑i=1

φµ,Γ(xi)

fn,i(xi)(yi − g(xi))

2, (10)

where fn,i is the leave-one-out estimator, defined in(5), associated to xi. The estimation error associated

to Rµ,Γ(g) has two component: one is related to theerror between g(x) and E[Y |X = x] and one associatedto the noise Y − E[Y |X = x]. Theorem 1 can be usedto handle the first component in the error decomposi-tion, i.e., the function ϕ in Theorem 1 is taken equalto x 7→ φµ,Γ(x)(g(x) − E[Y |X = x])2 which in manycases verifies each of our assumptions except the com-pact support assumption on ϕ stated in A3 and (con-sequently) the lower bound assumption on fX stated inA4. This problem can be solved in practice by consid-ering a trimming version (as proposed for instance in

[11]), i.e., ignoring the terms with a too small fn,i(xi)in (10). Addressing these technicalities is beyond thescope of the paper.

Page 7: Beating Monte Carlo Integration: a Nonasymptotic Study of Kernel …proceedings.mlr.press/v84/clemencon18a/clemencon18a-supp.pdf · 2018-08-21 · gral with a controlled error, ranging

Stephan Clemencon, Francois Portier

Ordinary least squares with misspecification.To illustrate our proposal we consider a toy modelaccording to which we generate the training dataset(xi, yi)i=1,...,n for n = 500. It is given by yi =x2i,1Ixi,1 > 0 + εi, where xi = (xi,1, . . . , xi,p) ∼N ((a, 0, 0, . . . , 0)T , Ip) , εi ∼ N (0, s2), s is chosen suchthat the signal-noise quotient is 0.5 and p = 10. Theparameter a is either −1 or 2 in order to highlightdifferent situations. Estimation of g is made throughordinary least squares, with g(xi) = α + βTxi, where

(α, β) ∈ argminα,β∑ni=1(yi−α−βTxi)2. To avoid the

dimensionality curse, we consider regions associated toone specific covariate xi,1, that is, the distribution φµ,Γis Gaussian. We set Γ = 1/2. We are interested in

the performance of R(or)µ,Γ (g) (ORACLE) and Rµ,Γ(g),

when µ varies in in the range of xi,1 estimating Rµ,Γ(g)

(TRUE). For Rµ,Γ(g), fn,i is either the classical kerneldensity estimator based on xi,1 (KDE), or the leave-one-out estimator based on xi,1(KDE-LOO). The pa-rameter h for the density estimate in (10) is picked viathe “rule of thumb” in [21], giving h = 1.06σ2n−1/5,where σ2 is the empirical estimator of the variance ofxi,1, i = 1, . . . , n. Fig. 1 provides an illustration forone particular dataset. The estimation accuracy (re-flected by small values of Rµ,Γ(g)) is not homogeneous.When a = −1, g is not sharp in the right tail of xi,1whereas when a = 2, g performs poorly in both theleft and the right tails. For each value of a, KDE-LOO recovers this trend pretty well. Fig. 2 confirmsthat estimating Rµ,Γ(g) is more difficult when only fewpoints xi,1 are lying around µ. Notice that KDE-LOOover performs KDE for any value of µ. The ORACLEpresents less bias, but a larger variance than KDE.

5 CONCLUSION

We provided a sound nonasymptotic analysis of theperformance of a kernel smoothing integral estima-tion method that can be used as an alternative tothe Monte-Carlo technique and compares favourablywith it under certain assumptions. Precisely, thoughbiased and involving highly dependent averaged com-ponents, the integral estimates thus produced achieverate bounds that surpass those attained by traditionalMonte Carlo methods (of order OP(1/

√n)) provided

the instrumental density is sufficiently smooth andthe kernel/bandwidth used are picked appropriately,uniformly over a class of functions of controlled com-plexity. The main tools exploited for establishing thisstriking result are an appropriate decomposition of thedeviation between the target integrals and their esti-mates plus sharp concentration inequalities involvingthe variance of the functionals thus considered. Be-yond theoretical results, a numerical example illus-trates the practical performance of our method pro-

−4 −3 −2 −1 0 1 2

01

23

45

6

values of µ

RIS

K

KDE−LOO

KDE

ORACLE

TRUE

−3 −2 −1 0

0.0

0.2

0.4

0.6

0.8

values of µ

RIS

K

KDE−LOO

KDE

ORACLE

TRUE

−4 −3 −2 −1 0 1 2

01

23

4

values of µ

OU

TP

UT

−1 0 1 2 3 4 5

02

04

06

08

01

00

values of µ

RIS

K

0 1 2 3 4

05

10

15

20

25

values of µ

RIS

K

−1 0 1 2 3 4 5

−1

00

10

20

30

values of µ

OU

TP

UT

Figure 1: Top/middle : graph of Rµ,Γ(g) for TRUE,KDE-LOO, KDE and ORACLE, when µ lives in thewhole range of xi,1 (top) and zooming around themean of xi,1. Bottom : outputs yi and predicted valuesg(xi,1) versus xi,1. Red and grey colors reflects largeand small values of KDE-LOO. In the right, a = −1.In the left a = 2. The signal-noise quotient is 0.5.

Page 8: Beating Monte Carlo Integration: a Nonasymptotic Study of Kernel …proceedings.mlr.press/v84/clemencon18a/clemencon18a-supp.pdf · 2018-08-21 · gral with a controlled error, ranging

A Nonasymptotic Study of Kernel Smoothing Methods

−2.2 −1.9 −1.7 −1.4 −1.1 −0.8 −0.5 −0.3 0 0.3

−0

.03

−0

.01

0.0

00

.01

values of µ

KD

E−

LO

O

−2.2 −1.9 −1.7 −1.4 −1.1 −0.8 −0.5 −0.3 0 0.3

−0

.03

−0

.01

0.0

00

.01

values of µ

KD

E

−2.2 −1.9 −1.7 −1.4 −1.1 −0.8 −0.5 −0.3 0 0.3

−0

.03

−0

.01

0.0

00

.01

values of µ

OR

AC

LE

0.1 0.5 0.9 1.4 1.8 2.2 2.6 3 3.5 3.9

−4

−2

02

4

values of µ

KD

E−

LO

O

0.1 0.5 0.9 1.4 1.8 2.2 2.6 3 3.5 3.9

−4

−2

02

4

values of µ

KD

E

0.1 0.5 0.9 1.4 1.8 2.2 2.6 3 3.5 3.9

−4

−2

02

4

values of µ

OR

AC

LE

Figure 2: Boxplot (based on 100 replications) of theerror for KDE-LOO, KDE and ORACLE when esti-mating Rµ,Γ(g) for different values of µ in the rangeof xi,1. In the right, a = −1. In the left a = 2. Thesignal-noise quotient is 0.5.

moted here.

Acknowledgements

We thank the industrial Chair ’Machine Learning forBig Data’ of Telecom ParisTech for partly funding thisresearch.

A PROOF OF LEMMA 1

The lemma mainly derives from the following classicalapproximation bound. Its proof is only given here forthe sake of completeness, insofar as solely a univariateversion of this result is documented in the literaturein general, see e.g. Proposition 1.2 in [24].

Lemma 5. Suppose that the smoothness condition A1

holds. For any h > 0, we have:

supx∈Rd

|(Kh ∗ f)(x)− f(x)| ≤ Chβ , (11)

where C is the constant given in Lemma 1.

Proof. Observe that, as the kernel K is assumed tobe of order bβc, a straighforward change of variablecombined with a Taylor-Young expansion permits towrite the deviation as follows: ∀h > 0,

Kh∗f(x)−f(x) =

∫z∈Rd

K(z) (f(x+ hz)− f(x)) dz

=

∫z∈Rd

K(z)hbβc

bβc!∑

α∈Nd: |α|=bβc

d∏i=1

zαii ∂αif(x+θhz)dz

=hbβc

bβc!×∫

z∈RdK(z)

∑α: |α|=bβc

d∏i=1

zαii (∂αif(x+ θhz)− ∂αif(x)) dz,

where θ ∈ [0, 1]. Since we assumed that f belongsto Hβ,L(Rd), Holder condition applies and leads to(11).

Lemma 5 combined with Assumption A3 and A4 en-tails that: ∀ϕ ∈ Φ, ∀h > 0,

|Bh(ϕ)| ≤ Chβ

λ

∫|ϕ(x)|dx ≤ CµLeb(DΦ)MΦ

λ· hβ .

B PROOF OF LEMMA 2

We first recall a version of the concentration inequalityfor empirical processes established in [23] (see also [9])stated in [10].

Page 9: Beating Monte Carlo Integration: a Nonasymptotic Study of Kernel …proceedings.mlr.press/v84/clemencon18a/clemencon18a-supp.pdf · 2018-08-21 · gral with a controlled error, ranging

Stephan Clemencon, Francois Portier

Lemma 6. Let ξ1, ξ2, . . . be i.i.d. r.v.’s valued ina measurable space (S,S) and U be class of func-tions on S, uniformly bounded and of VC-type withconstant (v,A) and envelope U : S → R. Setσ2(u) = var(u(ξ1)) for all u ∈ U . There exist posi-tive constants C1, C2, C3 (depending on v and A) andsupu∈U |σ2(u)| ≤ σ2 ≤ ‖U‖2∞, such that ∀t > 0 satis-fying

C1

√nσ

√log

(2‖U‖∞σ

)≤ t ≤ nσ2

‖U‖∞, (12)

then

P

∥∥∥∥∥n∑i=1

u(ξi)− E[u(ξi)]

∥∥∥∥∥U

> t

≤ C2 exp

(−C3

t2

nσ2

).

We shall prove that the desired bound can be deducedfrom the application of the lemma above to the Xi’sand the class of functions UΦ,h = uϕϕ∈Φ, where:∀ϕ ∈ Φ,

uϕ(x)/hd = 2ϕ(x)/f(x)

− ϕ(x)(Kh ∗ f)(x)/f2(x)− (Kh ∗ (ϕ/f))(x)

=ϕ(x)

f2(x)(f(x)− (Kh ∗ f)(x))

+

ϕ(x)

f(x)− (Kh ∗ (ϕ/f)(x)

.

Using Lemma 5 twice (for the term with f − (Kh ∗ f)(resp. with ϕ/f − (Kh ∗ (ϕ/f)), we rely on A1 (resp.on A4)) notice that ||uϕ||∞ ≤ C(1 +MΦ/λ

2)hd+β forall ϕ ∈ Φ. In addition, since the functional classesΦ,

∫x′ K((x − x′)/h′)f(x′)dx′h′>0 and

∫x′ K((x −

x′)/h′)(ϕ(x′)/f(x′))dx′h′>0, ϕ∈Φ are of VC type, theclass UΦ,h is still of VC type with constants (v′, A′)independent from h ≤ h0 by virtue of permanenceproperties of VC type classes of functions regard-ing summation and multiplication (cf section 2.6.5 in[25]). Using again the decomposition above and againLemma 5 twice, we also have: ∀ϕ ∈ Φ,

σ2(uϕ) ≤ 2C2(1 +M2

Φ/λ4)h2(d+β),

where we used that for any (X,Y ), real-valued r.v.’s,var(X + Y ) ≤ E[(X + Y )2] ≤ 2(E[X2] + E[Y 2]). Weapply Lemma 6 to UΦ,h and the Xi’s with constant en-velope UΦ = cΦh

d+β , where c2Φ = 2C2(1 +M2

Φ/λ4)≥

C2(1 +MΦ/λ2)2, σ2 = c2Φh

2(d+β) and

t =√nσ√

maxlog(C2/δ)/C3, C21 log(2).

Observe that σ2 = U2Φ. Let δ ∈ (0, 1) and suppose

that

n ≥ max

log(C2/δ)/C3, C21 log(2)

, (13)

so that (12) is fulfilled. With probability larger than1− δ, we then have:

||Mn||Φ =1

nhd

∣∣∣∣∣∣∣∣∣∣n∑i=1

uϕ(Xi)− E[uϕ(X1)]

∣∣∣∣∣∣∣∣∣∣Φ

≤ cΦhβ√n

√2 maxlog(C2/δ)/C3, C2

1 log(2).

C PROOF OF LEMMA 3

The proof is based on the exponential bound for degen-erate U -processes indexed by classes of kernels of VCtype recalled in the lemma below, which can be viewedas a version of Lemma 2 for degenerate U -processes.The following bound is stated in [10] (see section 4therein), and can be viewed as a reformulation of thebounds proved in [9] and [17].

Lemma 7. Let ξ1, ξ2, . . . be an i.i.d. sequence ofrandom variables taking their values in a measurablespace (S,S) and distributed according to a probabilitymeasure P . Let G be a class of functions on S × Suniformly bounded such that G is of VC type withconstants (v,A) and envelope G(x, x′). Set σ2(g) =var(g(ξ1, ξ2)) for any g ∈ G. Then, there exist posi-tive constants C1, C2, C3 (depending on v and A) andsupg∈G σ

2(g) ≤ σ2 ≤ ‖G‖2∞, such that for all t > 0satisfying

C1nσlog

(2‖G‖∞σ

)≤ t ≤ n2σ3

‖G‖2∞, (14)

then

P

∥∥∥∥∥∥∑

1≤i 6=j≤n

πP (g)(ξi, ξj)

∥∥∥∥∥∥G

> t

≤ C2 exp

(−C3

t

),

where, for all (x, x′) ∈ S2 and any g ∈ G,

πP (g)(x, x′) = g(x, x′)−∫g(ξ, x′)dP (ξ)

−∫g(x, ξ)dP (ξ) +

∫g(ξ1, ξ2)dP (ξ1)dP (ξ2).

The proof is based on the application of the lemmaabove to the class

GΦ,hdef=gϕ(x, x′) := ϕ(x)K((x− x′)/h)/f2(x)

ϕ∈Φ

.

The class Φ being of VC type by assumption, just likethe class K((x−x′)/h′)/f2(x)h′>0, observe first thatthe class GΦ is still of VC type with constants (v′′, A′′)independent from h, by virtue of well-known perma-nence properties of VC type classes, see e.g. Lemma2.6.20 in [25]. In addition, one may check that: ∀ϕ ∈

Page 10: Beating Monte Carlo Integration: a Nonasymptotic Study of Kernel …proceedings.mlr.press/v84/clemencon18a/clemencon18a-supp.pdf · 2018-08-21 · gral with a controlled error, ranging

A Nonasymptotic Study of Kernel Smoothing Methods

Φ, σ2(gϕ) ≤ γ2Φh

d, where γ2Φ = M2

Φ

∫K2(x)dx/λ3.

Apply Lemma 7 to the class GΦ,h with constant en-

velop GΦ = max(MΦ||K||∞/λ2, γΦhd/20 ) where we re-

call that h ≤ h0, σ2 = γ2Φh

d and

t = nσmax

log(C2/δ)/C3, C1 log(

2GΦ/(γΦhd/2)

).

We have σ2 ≤ G2Φ. Let δ ∈ (0, 1) and Cδ,1 be the set

of pairs (n, h) such that n ∈ N, h > 0, and

nhd ≥ G2Φ

γ2Φ

max

log(C2/δ)

C3, C1 log

(2GΦ/(γΦh

d/2))

.

(15)This guarantees that (14) is satisfied. We then have,with probability larger than 1− δ,

||Wn||Φ =1

n(n− 1)hd||

∑1≤i 6=j≤n

πf ·µLeb(Xi, Xj)||GΦ,h

≤ γΦ

(n− 1)hd/2max

log(C2/δ)

C3, C1 log

(2GΦ

γΦhd/2)

).

D PROOF OF LEMMA 4

Let

fn(x) =1

n

∑1≤j≤n

Kh(x−Xj).

We start off by controlling the probability related tosmall values of fn,i. Observe that

min1≤i≤n :Xi∈DΦ

fn,i(Xi)

=

(n

n− 1

)min

1≤i≤n :Xi∈DΦ

fn(Xi)−K(0)

(n− 1)hd

≥ infx∈DΦ

fn(x)− K(0)

(n− 1)hd

≥ λ− supx∈DΦ

|fn(x)− f(x)| − K(0)

(n− 1)hd

≥ λ− supx∈DΦ

|fn(x)− (Kh ∗ f)(x)| − Chβ − K(0)

(n− 1)hd.

Apply Lemma 6 to class U = K((y − ·)/h)y∈Rd ,which has constant envelope ‖K‖∞ and for which theVC constant does not depend on h. Let c2K,f =

‖f‖∞∫K2(x)dx. We consider the constant envelop

U = max(‖K‖∞, cK,fhd/20 ) and define

σ2 = c2K,fhd, t =

√nσmδ

mδ =√

maxlog(C2/δ)/C3, C1 log(2‖K‖∞/σ).

Let δ ∈ (0, 1), and define

an,h =cK,fmδ√nhd

− Chβ − K(0)

(n− 1)hd.

Let Cδ,2 be the set of pairs (n, h) such that n ∈ N,h > 0, and

nhd ≥ (V mδ)2/c2K,f (16)

an,h ≤ λ/2 (17)

Equation (16) ensures that (12) holds and that, withprobability at least 1− δ,

supx∈DΦ

|fn(x)− (Kh ∗ f)(x)| ≤ cK,fmδ√nhd

.

Placing ourselves on this event, equation (17) permitsto obtain

supϕ∈Φ|Rn(ϕ)| ≤

(MΦ

(λ− an,h)λ2

n−1n∑i=1

(f(Xi)− fn,i(Xi)

)2

≤ 2MΦ

λ3

(cK,fmδ√nhd

+ Chβ)2

.

E ADDITIONAL FIGURES

To support the stability of our estimation method,Rµ,Γ(g) estimating Rµ,Γ(g), we provide additional fig-ures considering the same model as the one introducedin the paper but with different values of the signal-noise quotient.

References

[1] M. A. Arcones and E. Gine. U-processes indexedby VC classes of functions with applications toasymptotics and bootstrap of U-statistics with es-timated parameters. SPA, 52:1738, 1994.

[2] R. Azaıs, B. Delyon, and F. Portier. Integral esti-mation based on Markovian design. Submitted,available at https://arxiv.org/abs/1609.01165,2017.

[3] S. Boucheron, O. Bousquet, and G. Lugosi. The-ory of Classification: A Survey of Some RecentAdvances. ESAIM: Probability and Statistics,9:323–375, 2005.

[4] X. Chen, M. Monfort, A. Liu, and B. Ziebart.Robust covariate shift regression. In Proceedingsof AISTATS, 2016.

[5] S. Clemencon, G. Lugosi, and N. Vayatis. Rank-ing and empirical minimization of U-statistics.Ann. Statist., 36(2):844–874, 2008.

Page 11: Beating Monte Carlo Integration: a Nonasymptotic Study of Kernel …proceedings.mlr.press/v84/clemencon18a/clemencon18a-supp.pdf · 2018-08-21 · gral with a controlled error, ranging

Stephan Clemencon, Francois Portier

−4 −3 −2 −1 0 1 2

01

02

03

04

05

0

values of µ

RIS

K

KDE−LOO

KDE

ORACLE

TRUE

−2 −1 0 1

02

46

values of µ

RIS

K

KDE−LOO

KDE

ORACLE

TRUE

−4 −3 −2 −1 0 1 2

01

23

45

6

values of µ

OU

TP

UT

−1 0 1 2 3 4 5

02

04

06

08

01

00

values of µ

RIS

K

1 2 3 4

02

46

81

01

21

4

values of µ

RIS

K

−1 0 1 2 3 4 5

−5

05

10

15

20

25

values of µ

OU

TP

UT

Figure 3: Top/middle : graph of Rµ,Γ(g) for TRUE,KDE-LOO, KDE and ORACLE, when µ lives in thewhole range of xi,1 (top) and zooming around themean of xi,1. Bottom : outputs yi and predicted valuesg(xi,1) versus xi,1. Red and grey colors reflects largeand small values of KDE-LOO. In the right, a = −1.In the left a = 2. The signal-noise quotient is 0.1.

−1.9 −1.7 −1.4 −1.1 −0.8 −0.6 −0.3 0 0.3 0.5

−0

.04

−0

.02

0.0

00

.01

values of µ

KD

E−

LO

O

−1.9 −1.7 −1.4 −1.1 −0.8 −0.6 −0.3 0 0.3 0.5

−0

.04

−0

.02

0.0

00

.01

values of µ

KD

E

−1.9 −1.7 −1.4 −1.1 −0.8 −0.6 −0.3 0 0.3 0.5

−0

.04

−0

.02

0.0

00

.01

values of µO

RA

CL

E

0.2 0.6 1 1.3 1.7 2.1 2.5 2.9 3.3 3.7

−3

−2

−1

01

2

values of µ

KD

E−

LO

O

0.2 0.6 1 1.3 1.7 2.1 2.5 2.9 3.3 3.7

−3

−2

−1

01

2

values of µ

KD

E

0.2 0.6 1 1.3 1.7 2.1 2.5 2.9 3.3 3.7

−3

−2

−1

01

2

values of µ

OR

AC

LE

Figure 4: Boxplot (based on 100 replications) of theerror for KDE-LOO, KDE and ORACLE when esti-mating Rµ,Γ(g) for different values of µ in the rangeof xi,1. In the right, a = −1. In the left a = 2. Thesignal-noise quotient is 0.1.

Page 12: Beating Monte Carlo Integration: a Nonasymptotic Study of Kernel …proceedings.mlr.press/v84/clemencon18a/clemencon18a-supp.pdf · 2018-08-21 · gral with a controlled error, ranging

A Nonasymptotic Study of Kernel Smoothing Methods

−4 −3 −2 −1 0 1

0.0

0.2

0.4

0.6

0.8

1.0

values of µ

RIS

K

KDE−LOO

KDE

ORACLE

TRUE

−3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0

0.0

00

.05

0.1

00

.15

0.2

00

.25

values of µ

RIS

K

KDE−LOO

KDE

ORACLE

TRUE

−4 −3 −2 −1 0 1

−1

.00

.00

.51

.01

.52

.0

values of µ

OU

TP

UT

−1 0 1 2 3 4 5

01

02

03

04

05

06

07

0

values of µ

RIS

K

1 2 3 4

05

10

15

20

25

30

35

values of µ

RIS

K

−1 0 1 2 3 4 5

−5

05

10

15

20

25

values of µ

OU

TP

UT

Figure 5: Top/middle : graph of Rµ,Γ(g) for TRUE,KDE-LOO, KDE and ORACLE, when µ lives in thewhole range of xi,1 (top) and zooming around themean of xi,1. Bottom : outputs yi and predicted valuesg(xi,1) versus xi,1. Red and grey colors reflects largeand small values of KDE-LOO. In the right, a = −1.In the left a = 2. The signal-noise quotient is 1.

−2.5 −2.2 −2 −1.7 −1.5 −1.2 −0.9 −0.7 −0.4 −0.1

−0

.04

−0

.02

0.0

00

.02

values of µ

KD

E−

LO

O

−2.5 −2.2 −2 −1.7 −1.5 −1.2 −0.9 −0.7 −0.4 −0.1

−0

.04

−0

.02

0.0

00

.02

values of µ

KD

E

−2.5 −2.2 −2 −1.7 −1.5 −1.2 −0.9 −0.7 −0.4 −0.1

−0

.04

−0

.02

0.0

00

.02

values of µO

RA

CL

E

−0.3 0.1 0.5 0.9 1.3 1.7 2.1 2.5 2.9 3.3

−1

0−

50

5

values of µ

KD

E−

LO

O

−0.3 0.1 0.5 0.9 1.3 1.7 2.1 2.5 2.9 3.3

−1

0−

50

5

values of µ

KD

E

−0.3 0.1 0.5 0.9 1.3 1.7 2.1 2.5 2.9 3.3

−1

0−

50

5

values of µ

OR

AC

LE

Figure 6: Boxplot (based on 100 replications) of theerror for KDE-LOO, KDE and ORACLE when esti-mating Rµ,Γ(g) for different values of µ in the rangeof xi,1. In the right, a = −1. In the left a = 2. Thesignal-noise quotient is 1.

Page 13: Beating Monte Carlo Integration: a Nonasymptotic Study of Kernel …proceedings.mlr.press/v84/clemencon18a/clemencon18a-supp.pdf · 2018-08-21 · gral with a controlled error, ranging

Stephan Clemencon, Francois Portier

[6] P.J. Davis and P. Rabinovitz. Methods of Numer-ical Integration. Second edition. Dover, 2007.

[7] B. Delyon and F. Portier. Integral approximationby kernel smoothing. Bernoulli, 22(4):2177–2208,2016.

[8] P. Doukhan. Mixing. Properties and examples.Springer-Verlag, New York, 1994.

[9] E. Gine and A. Guillou. On consistency of kerneldensity estimators for randomly censored data:rates holding uniformly over adaptive intervals.Ann. IHP., 37(4):503–522, 2001.

[10] E. Gine and H. Sang. Uniform asymptoticsfor kernel density estimators with variable band-widths. J. Nonparametr. Stat., 22(5-6):773–795,2010.

[11] Wolfgang Hardle and Thomas M. Stoker. Investi-gating smooth multiple regression by the methodof average derivatives. J. Amer. Statist. Assoc.,84(408):986–995, 1989.

[12] W. Hoeffding. Probability inequalities for sumsof bounded random variables. J. Amer. Statist.Assoc., 58:13–30, 1963.

[13] M.H. Kalos and P.H. Whitlock. Monte CarloMethods. Wiley-Blackwell, 2008.

[14] T. Kanamori, S. Hido, and M. Sugiyama. Aleast-squares approach to direct importance es-timation. J. Mach. Learn. Res., 10:1391–1445,2009.

[15] V. Koltchinskii. Local Rademacher complexitiesand oracle inequalities in risk minimization (withdiscussion). The Annals of Statistics, 34:2593–2706, 2006.

[16] A. J. Lee. U-statistics: Theory and Practice. CRCPress, 1990.

[17] P. Major. An estimate on the supremum of anice class of stochastic integrals and U -statistics.Probab. Theory Related Fields, 134(3):489–537,2006.

[18] C. McDiarmid. Concentration. In Michel Habib,Colin McDiarmid, Jorge Ramirez-Alfonsin, andBruce Reed, editors, Probabilistic Methods forAlgorithmic Discrete Mathematics, volume 16 ofAlgorithms and Combinatorics, pages 195–248.Springer Berlin Heidelberg, 1998.

[19] C. Oates, M. Girolami, and N. Chopin. Controlfunctionals for monte carlo integration. Journal ofthe Royal Statistical Society: Series B (StatisticalMethodology), 79(3):695–718, 2017.

[20] H. Shimodaira. Improving predictive infer-ence under covariate shift by weighting the log-likelihood function. J. Statist. Plann. Inference,90(2):227–244, 2000.

[21] B.W. Silverman. Density Estimation for Statis-tics and Data Analysis. Chapman, 1986.

[22] M. Sugiyama, T. Suzuki, S. Nakajima,H. Kashima, P. von Bunau, and M. Kawan-abe. Direct importance estimation for covariateshift adaptation. Ann. ISM., 60(4):699–746,2008.

[23] M. Talagrand. New concentration inequalities inproduct spaces. Invent. Math., 126(3):505–563,1996.

[24] A. B. Tsybakov. Introduction to NonparametricEstimation. Springer, 2009.

[25] A. van der Vaart and J.A. Wellner. Weak Con-vergence and Empirical Processes. Springer Seriesin Statistics. Springer, New York, 1996.