moment-based uniform deviation bounds for k...

Moment-based Uniform Deviation Bounds fork-means and Friends

Matus Telgarsky Sanjoy DasguptaComputer Science and Engineering, UC San Diego

{mtelgars,dasgupta}@cs.ucsd.edu

Abstract

Suppose k centers are fit to m points by heuristically minimizing the k-meanscost; what is the corresponding fit over the source distribution? This question isresolved here for distributions with p � 4 bounded moments; in particular, thedifference between the sample cost and distribution cost decays with m and p asmmin{�1/4,�1/2+2/p}. The essential technical contribution is a mechanism to uni-formly control deviations in the face of unbounded parameter sets, cost functions,and source distributions. To further demonstrate this mechanism, a soft clusteringvariant of k-means cost is also considered, namely the log likelihood of a Gaus-sian mixture, subject to the constraint that all covariance matrices have boundedspectrum. Lastly, a rate with refined constants is provided for k-means instancespossessing some cluster structure.

1 Introduction

Suppose a set of k centers {pi

}ki=1

is selected by approximate minimization of k-means cost; howdoes the fit over the sample compare with the fit over the distribution? Concretely: given m pointssampled from a source distribution ⇢, what can be said about the quantities

�

�

�

�

�

1

m

m

X

j=1

min

i

kxj

� pi

k22

�Z

min

i

kx� pi

k22

d⇢(x)

�

�

�

�

�

(k-means), (1.1)

�

�

�

�

�

1

m

m

X

j=1

ln

k

X

i=1

↵i

p✓i(xj

)

!

�Z

ln

k

X

i=1

↵i

p✓i(x)

!

d⇢(x)

�

�

�

�

�

(soft k-means), (1.2)

where each p✓i denotes the density of a Gaussian with a covariance matrix whose eigenvalues lie in

some closed positive interval.

The literature offers a wealth of information related to this question. For k-means, there is firstly aconsistency result: under some identifiability conditions, the global minimizer over the sample willconverge to the global minimizer over the distribution as the sample size m increases [1]. Further-more, if the distribution is bounded, standard tools can provide deviation inequalities [2, 3, 4]. Forthe second problem, which is maximum likelihood of a Gaussian mixture (thus amenable to EM[5]), classical results regarding the consistency of maximum likelihood again provide that, undersome identifiability conditions, the optimal solutions over the sample converge to the optimum overthe distribution [6].

The task here is thus: to provide finite sample guarantees for these problems, but eschewing bound-edness, subgaussianity, and similar assumptions in favor of moment assumptions.

1

1.1 Contribution

The results here are of the following form: given m examples from a distribution with a few boundedmoments, and any set of parameters beating some fixed cost c, the corresponding deviations in cost(as in eq. (1.1) and eq. (1.2)) approach O(m�1/2

) with the availability of higher moments.

• In the case of k-means (cf. Corollary 3.1), p � 4 moments suffice, and the rate isO(mmin{�1/4,�1/2+2/p}

). For Gaussian mixtures (cf. Theorem 5.1), p � 8 momentssuffice, and the rate is O(m�1/2+3/p

).• The parameter c allows these guarantees to hold for heuristics. For instance, suppose k

centers are output by Lloyd’s method. While Lloyd’s method carries no optimality guar-antees, the results here hold for the output of Lloyd’s method simply by setting c to be thevariance of the data, equivalently the k-means cost with a single center placed at the mean.

• The k-means and Gaussian mixture costs are only well-defined when the source distribu-tion has p � 2 moments. The condition of p � 4 moments, meaning the variance has avariance, allows consideration of many heavy-tailed distributions, which are ruled out byboundedness and subgaussianity assumptions.

The main technical byproduct of the proof is a mechanism to deal with the unboundedness of thecost function; this technique will be detailed in Section 3, but the difficulty and its resolution can beeasily sketched here.

For a single set of centers P , the deviations in eq. (1.1) may be controlled with an application ofChebyshev’s inequality. But this does not immediately grant deviation bounds on another set ofcenters P 0, even if P and P 0 are very close: for instance, the difference between the two costs willgrow as successively farther and farther away points are considered.

The resolution is to simply note that there is so little probability mass in those far reaches that thecost there is irrelevant. Consider a single center p (and assume x 7! kx � pk2

2

is integrable); thedominated convergence theorem grantsZ

Bi

kx� pk22

d⇢(x) !Z

kx� pk22

d⇢(x), where Bi

:= {x 2 Rd

: kx� pk2

i}.

In other words, a ball Bi

may be chosen so thatR

B

cikx� pk2

2

d⇢(x) 1/1024. Now consider somep0 with kp� p0k

2

i. ThenZ

B

ci

kx� p0k22

d⇢(x) Z

B

ci

(kx� pk2

+ kp� p0k2

)

2d⇢(x) 4

Z

B

ci

kx� pk22

d⇢(x) 1

256

.

In this way, a single center may control the outer deviations of whole swaths of other centers. Indeed,those choices outperforming the reference score c will provide a suitable swath. Of course, it wouldbe nice to get a sense of the size of B

i

; this however is provided by the moment assumptions.

The general strategy is thus to split consideration into outer deviations, and local deviations. Thelocal deviations may be controlled by standard techniques. To control outer deviations, a single pairof dominating costs — a lower bound and an upper bound — is controlled.

This technique can be found in the proof of the consistency of k-means due to Pollard [1]. Thepresent work shows it can also provide finite sample guarantees, and moreover be applied outsidehard clustering.

The content here is organized as follows. The remainder of the introduction surveys related work,and subsequently Section 2 establishes some basic notation. The core deviation technique, termedouter bracketing (to connect it to the bracketing technique from empirical process theory), is pre-sented along with the deviations of k-means in Section 3. The technique is then applied in Section 5to a soft clustering variant, namely log likelihood of Gaussian mixtures having bounded spectra. Asa reprieve between these two heavier bracketing sections, Section 4 provides a simple refinement fork-means which can adapt to cluster structure.

All proofs are deferred to the appendices, however the construction and application of outer bracketsis sketched in the text.

2

1.2 Related Work

As referenced earlier, Pollard’s work deserves special mention, both since it can be seen as the originof the outer bracketing technique, and since it handled k-means under similarly slight assumptions(just two moments, rather than the four here) [1, 7]. The present work hopes to be a spiritualsuccessor, providing finite sample guarantees, and adapting technique to a soft clustering problem.

In the machine learning community, statistical guarantees for clustering have been extensively stud-ied under the topic of clustering stability [4, 8, 9, 10]. One formulation of stability is: if param-eters are learned over two samples, how close are they? The technical component of these worksfrequently involves finite sample guarantees, which in the works listed here make a boundednessassumption, or something similar (for instance, the work of Shamir and Tishby [9] requires the costfunction to satisfy a bounded differences condition). Amongst these finite sample guarantees, thefinite sample guarantees due to Rakhlin and Caponnetto [4] are similar to the development here afterthe invocation of the outer bracket: namely, a covering argument controls deviations over a boundedset. The results of Shamir and Tishby [10] do not make a boundedness assumption, but the mainresults are not finite sample guarantees; in particular, they rely on asymptotic results due to Pollard[7].

There are many standard tools which may be applied to the problems here, particularly if a bound-edness assumption is made [11, 12]; for instance, Lugosi and Zeger [2] use tools from VC theory tohandle k-means in the bounded case. Another interesting work, by Ben-david [3], develops special-ized tools to measure the complexity of certain clustering problems; when applied to the problemsof the type considered here, a boundedness assumption is made.

A few of the above works provide some negative results and related commentary on the topic ofuniform deviations for distributions with unbounded support [10, Theorem 3 and subsequent discus-sion] [3, Page 5 above Definition 2]. The primary “loophole” here is to constrain consideration tothose solutions beating some reference score c. It is reasonable to guess that such a condition en-tails that a few centers must lie near the bulk of the distribution’s mass; making this guess rigorousis the first step here both for k-means and for Gaussian mixtures, and moreover the same conse-quence was used by Pollard for the consistency of k-means [1]. In Pollard’s work, only optimalchoices were considered, but the same argument relaxes to arbitrary c, which can thus encapsulateheuristic schemes, and not just nearly optimal ones. (The secondary loophole is to make momentassumptions; these sufficiently constrain the structure of the distribution to provide rates.)

In recent years, the empirical process theory community has produced a large body of work on thetopic of maximum likelihood (see for instance the excellent overviews and recent work of Wellner[13], van der Vaart and Wellner [14], Gao and Wellner [15]). As stated previously, the choice of theterm “bracket” is to connect to empirical process theory. Loosely stated, a bracket is simply a pairof functions which sandwich some set of functions; the bracketing entropy is then (the logarithm of)the number of brackets needed to control a particular set of functions. In the present work, bracketsare paired with sets which identify the far away regions they are meant to control; furthermore,while there is potential for the use of many outer brackets, the approach here is able to make use ofjust a single outer bracket. The name bracket is suitable, as opposed to cover, since the bracketingelements need not be members of the function class being dominated. (By contrast, Pollard’s use inthe proof of the consistency of k-means was more akin to covering, in that remote fluctuations werecompared to that of a a single center placed at the origin [1].)

2 Notation

The ambient space will always be the Euclidean space Rd, though a few results will be stated for ageneral domain X . The source probability measure will be ⇢, and when a finite sample of size mis available, ⇢ is the corresponding empirical measure. Occasionally, the variable ⌫ will refer to anarbitrary probability measure (where ⇢ and ⇢ will serve as relevant instantiations). Both integral andexpectation notation will be used; for example, E(f(X)) = E

⇢

(f(X) =

R

f(x)d⇢(x); for integrals,R

B

f(x)d⇢(x) =R

f(x)1[x 2 B]d⇢(x), where 1 is the indicator function. The moments of ⇢ aredefined as follows.Definition 2.1. Probability measure ⇢ has order-p moment bound M with respect to norm k ·k whenE⇢

kX � E⇢

(X)kl M for 1 l p.

3

For example, the typical setting of k-means uses norm k·k2

, and at least two moments are needed forthe cost over ⇢ to be finite; the condition here of needing 4 moments can be seen as naturally arisingvia Chebyshev’s inequality. Of course, the availability of higher moments is beneficial, dropping therates here from m�1/4 down to m�1/2. Note that the basic controls derived from moments, whichare primarily elaborations of Chebyshev’s inequality, can be found in Appendix A.

The k-means analysis will generalize slightly beyond the single-center cost x 7! kx � pk22

viaBregman divergences [16, 17].Definition 2.2. Given a convex differentiable function f : X ! R, the corresponding Bregmandivergence is B

f

(x, y) := f(x)� f(y)� hrf(y), x� yi.

Not all Bregman divergences are handled; rather, the following regularity conditions will be placedon the convex function.Definition 2.3. A convex differentiable function f is strongly convex with modulus r

1

and has Lip-schitz gradients with constant r

2

, both respect to some norm k · k, when f (respectively) satisfies

f(↵x+ (1� ↵)y) ↵f(x) + (1� ↵)f(y)� r1

↵(1� ↵)

2

kx� yk2,

krf(x)�rf(y)k⇤ r2

kx� yk,where x, y 2 X , ↵ 2 [0, 1], and k · k⇤ is the dual of k · k. (The Lipschitz gradient condition issometimes called strong smoothness.)

These conditions are a fancy way of saying the corresponding Bregman divergence is sandwichedbetween two quadratics (cf. Lemma B.1).Definition 2.4. Given a convex differentiable function f : Rd ! R which is strongly convex andhas Lipschitz gradients with respective constants r

1

, r2

with respect to norm k · k, the hard k-meanscost of a single point x according to a set of centers P is

�f

(x;P ) := min

p2P

Bf

(x, p).

The corresponding k-means cost of a set of points (or distribution) is thus computed asE⌫

(�f

(X;P )), and let Hf

(⌫; c, k) denote all sets of at most k centers beating cost c, meaning

Hf

(⌫; c, k) := {P : |P | k,E⌫

(�f

(X;P )) c}.

For example, choosing norm k · k2

and convex function f(x) = kxk22

(which has r1

= r2

= 2), thecorresponding Bregman divergence is B

f

(x, y) = kx� yk22

, and E⇢

(�f

(X;P )) denotes the vanillak-means cost of some finite point set encoded in the empirical measure ⇢.

The hard clustering guarantees will work with Hf

(⌫; c, k), where ⌫ can be either the source distri-bution ⇢, or its empirical counterpart ⇢. As discussed previously, it is reasonable to set c to simplythe sample variance of the data, or a related estimate of the true variance (cf. Appendix A).

Lastly, the class of Gaussian mixture penalties is as follows.Definition 2.5. Given Gaussian parameters ✓ := (µ,⌃), let p

✓

denote Gaussian density

p✓

(x) =1

p

(2⇡)d|⌃i

|exp

✓

�1

2

(x� µi

)

T⌃�1

i

(x� µi

)

◆

.

Given Gaussian mixture parameters (↵,⇥) = ({↵i

}ki=1

, {✓i

}ki=1

) with ↵ � 0 andP

i

↵i

= 1

(written ↵ 2 �), the Gaussian mixture cost at a point x is

�g(x; (↵,⇥)) := �g(x; {(↵i

, ✓i

) = (↵i

, µi

,⌃i

)}ki=1

) := ln

k

X

i=1

↵i

p✓i(x)

!

,

Lastly, given a measure ⌫, bound k on the number of mixture parameters, and spectrum bounds0 < �

1

�2

, let Smog(⌫; c, k,�1

,�2

) denote those mixture parameters beating cost c, meaning

Smog(⌫; c, k,�1

,�2

) := {(↵,⇥) : �1

I � ⌃i

� �2

I, |↵| k,↵ 2 �,E⌫

(�g(X; (↵,⇥))) c} .

While a condition of the form ⌃ ⌫ �1

I is typically enforced in practice (say, with a Bayesian prior,or by ignoring updates which shrink the covariance beyond this point), the condition ⌃ � �

2

I ispotentially violated. These conditions will be discussed further in Section 5.

4

3 Controlling k-means with an Outer Bracket

First consider the special case of k-means cost.Corollary 3.1. Set f(x) := kxk2

2

, whereby �f

is the k-means cost. Let real c � 0 and probabilitymeasure ⇢ be given with order-p moment bound M with respect to k · k

2

, where p � 4 is a positivemultiple of 4. Define the quantities

c1

:= (2M)

1/p

+

p2c, M

1

:= M1/(p�2)

+M2/p, N1

:= 2 + 576d(c1

+ c21

+M1

+M2

1

).

Then with probability at least 1 � 3� over the draw of a sample of size m �max{(p/(2p/4+2e))2, 9 ln(1/�)}, every set of centers P 2 H

f

(⇢; c, k) [Hf

(⇢; c, k) satisfies�

�

�

�

Z

�f

(x;P )d⇢(x)�Z

�f

(x;P )d⇢(x)

�

�

�

�

m�1/2+min{1/4,2/p}

4 + (72c21

+ 32M2

1

)

s

1

2

ln

✓

(mN1

)

dk

�

◆

+

r

2

p/4ep

8m1/2

✓

2

�

◆

4/p

!

.

One artifact of the moment approach (cf. Appendix A), heretofore ignored, is the term (2/�)4/p.While this may seem inferior to ln(2/�), note that the choice p = 4 ln(2/�)/ ln(ln(2/�)) suffices tomake the two equal.

Next consider a general bound for Bregman divergences. This bound has a few more parametersthan Corollary 3.1. In particular, the term ✏, which is instantiated to m�1/2+1/p in the proof ofCorollary 3.1, catches the mass of points discarded due to the outer bracket, as well as the resolutionof the (inner) cover. The parameter p0, which controls the tradeoff between m and 1/�, is set to p/4in the proof of Corollary 3.1.Theorem 3.2. Fix a reference norm k · k throughout the following. Let probability measure ⇢ begiven with order-p moment bound M where p � 4, a convex function f with corresponding constantsr1

and r2

, reals c and ✏ > 0, and integer 1 p0 p/2� 1 be given. Define the quantities

RB

:= max

⇢

(2M)

1/p

+

p

4c/r1

,max

i2[p

0]

(M/✏)1/(p�2i)

�

,

RC

:=

p

r2

/r1

⇣

(2M)

1/p

+

p

4c/r1

+RB

⌘

+RB

,

B :=

�

x 2 Rd

: kx� E(X)k RB

,

C :=

�

x 2 Rd

: kx� E(X)k RC

,

⌧ := min

⇢

r

✏

2r2

,✏

2(RB

+RC

)r2

�

,

and let N be a cover of C by k · k-balls with radius ⌧ ; in the case that k · k is an lp

norm, the size ofthis cover has bound

|N | ✓

1 +

2RC

d

⌧

◆

d

.

Then with probability at least 1 � 3� over the draw of a sample of size m �max{p0/(e2p0

✏), 9 ln(1/�)}, every set of centers P 2 Hf

(⇢; c, k) [Hf

(⇢; c, k) satisfies

�

�

�

�

Z

�f

(x;P )d⇢(x)�Z

�f

(x;P )d⇢(x)

�

�

�

�

4✏+4r2

R2

C

s

1

2mln

✓

2|N |k�

◆

+

r

e2p0✏p0

2m

✓

2

�

◆

1/p

0

.

3.1 Compactification via Outer Brackets

The outer bracket is defined as follows.Definition 3.3. An outer bracket for probability measure ⌫ at scale ✏ consists of two triples, oneeach for lower and upper bounds.

5

1. The function `, function class Z`

, and set B`

satisfy two conditions: if x 2 Bc

`

and � 2 Z`

,then `(x) �(x), and secondly |

R

B

c``(x)d⌫(x)| ✏.

2. Similarly, function u, function class Zu

, and set Bu

satisfy: if x 2 Bc

u

and � 2 Zu

, thenu(x) � �(x), and secondly |

R

B

cuu(x)d⌫(x)| ✏.

Direct from the definition, given bracketing functions (`, u), a bracketed function �f

(·;P ), and thebracketing set B := B

u

[B`

,

�✏ Z

B

c

`(x)d⌫(x) Z

B

c

�f

(x;P )d⌫(x) Z

B

c

u(x)d⌫(x) ✏; (3.4)

in other words, as intended, this mechanism allows deviations on Bc to be discarded. Thus touniformly control the deviations of the dominated functions Z := Z

u

[ Z`

over the set Bc, itsuffices to simply control the deviations of the pair (`, u).

The following lemma shows that a bracket exists for {�f

(·;P ) : P 2 Hf

(⌫; c, k)} and compact B,and moreover that this allows sampled points and candidate centers in far reaches to be deleted.Lemma 3.5. Consider the setting and definitions in Theorem 3.2, but additionally define

M 0:= 2

p

0✏, `(x) := 0, u(x) := 4r

2

kx� E(X)k2, ✏⇢

:= ✏+

r

M 0ep0

2m

✓

2

�

◆

1/p

0

.

The following statements hold with probability at least 1 � 2� over a draw of size m �max{p0/(M 0e), 9 ln(1/�)}.

1. (u, `) is an outer bracket for ⇢ at scale ✏⇢

:= ✏ with sets B`

= Bu

= B and Z`

= Zu

=

{�f

(·;P ) : P 2 Hf

(⇢; c, k)[Hf

(⇢; c, k)}, and furthermore the pair (u, `) is also an outerbracket for ⇢ at scale ✏

⇢

with the same sets.

2. For every P 2 Hf

(⇢; c, k) [Hf

(⇢; c, k),�

�

�

�

Z

�f

(x;P )d⇢(x)�Z

B

�f

(x;P \ C)d⇢(x)

�

�

�

�

✏⇢

= ✏.

and�

�

�

�

Z

�f

(x;P )d⇢(x)�Z

B

�f

(x;P \ C)d⇢(x)

�

�

�

�

✏⇢

.

The proof of Lemma 3.5 has roughly the following outline.

1. Pick some ball B0

which has probability mass at least 1/4. It is not possible for an elementof H

f

(⇢; c, k) [ Hf

(⇢; c, k) to have all centers far from B0

, since otherwise the cost islarger than c. (Concretely, “far from” means at least

p

4c/r1

away; note that this termappears in the definitions of B and C in Theorem 3.2.) Consequently, at least one centerlies near to B

0

; this reasoning was also the first step in the k-means consistency proof dueto k-means Pollard [1].

2. It is now easy to dominate P 2 Hf

(⇢; c, k) [Hf

(⇢; c, k) far away from B0

. In particular,choose any p

0

2 B0

\ P , which was guaranteed to exist in the preceding point; sincemin

p2P

Bf

(x, p) Bf

(x, p0

) holds for all x, it suffices to dominate p0

. This dominationproceeds exactly as discussed in the introduction; in fact, the factor 4 appeared there, andagain appears in the u here, for exactly the same reason. Once again, similar reasoning canbe found in the proof by Pollard [1].

3. Satisfying the integral conditions over ⇢ is easy: it suffices to make B huge. To control thesize of B

0

, as well as the size of B, and moreover the deviations of the bracket over B, themoment tools from Appendix A are used.

Now turning consideration back to the proof of Theorem 3.2, the above bracketing allows the re-moval of points and centers outside of a compact set (in particular, the pair of compact sets B andC, respectively). On the remaining truncated data and set of centers, any standard tool suffices; formathematical convenience, and also to fit well with the use of norms in the definition of momentsas well as the conditions on the convex function f providing the divergence B

f

, norm structureused throughout the other properties, covering arguments are used here. (For details, please seeAppendix B.)

6

4 Interlude: Refined Estimates via Clamping

So far, rates have been given that guarantee uniform convergence when the distribution has a fewmoments, and these rates improve with the availability of higher moments. These moment condi-tions, however, do not necessarily reflect any natural cluster structure in the source distribution. Thepurpose of this section is to propose and analyze another distributional property which is intendedto capture cluster structure. To this end, consider the following definition.Definition 4.1. Real number R and compact set C are a clamp for probability measure ⌫ and familyof centers Z and cost �

f

at scale ✏ > 0 if every P 2 Z satisfies|E

⌫

(�f

(X;P ))� E⌫

(min {�f

(X;P \ C) , R})| ✏.

Note that this definition is similar to the second part of the outer bracket guarantee in Lemma 3.5,and, predictably enough, will soon lead to another deviation bound.Example 4.2. If the distribution has bounded support, then choosing a clamping value R and clamp-ing set C respectively slightly larger than the support size and set is sufficient: as was reasoned inthe construction of outer brackets, if no centers are close to the support, then the cost is bad. Corre-spondingly, the clamped set of functions Z should again be choices of centers whose cost is not toohigh.

For a more interesting example, suppose ⇢ is supported on k small balls of radius R1

, where thedistance between their respective centers is some R

2

� R1

. Then by reasoning similar to thebounded case, all choices of centers achieving a good cost will place centers near to each ball, andthus the clamping value can be taken closer to R

1

. ⌅

Of course, the above gave the existence of clamps under favorable conditions. The following showsthat outer brackets can be used to show the existence of clamps in general. In fact, the proof is veryshort, and follows the scheme laid out in the bounded example above: outer bracketing allows therestriction of consideration to a bounded set, and some algebra from there gives a conservative upperbound for the clamping value.Proposition 4.3. Suppose the setting and definitions of Lemma 3.5, and additionally define

R := 2((2M)

2/p

+R2

B

).

Then (C,R) is a clamp for measure ⇢ and center Hf

(⇢; c, k) at scale ✏, and with probability at least1 � 3� over a draw of size m � max{p0/(M 0e), 9 ln(1/�)}, it is also a clamp for ⇢ and centersH

f

(⇢; c, k) at scale ✏⇢

.

The general guarantee using clamps is as follows. The proof is almost the same as for Theorem 3.2,but note that this statement is not used quite as readily, since it first requires the construction ofclamps.Theorem 4.4. Fix a norm k · k. Let (R,C) be a clamp for probability measure ⇢ and empiricalcounterpart ⇢ over some center class Z and cost �

f

at respective scales ✏⇢

and ✏⇢

, where f hascorresponding convexity constants r

1

and r2

. Suppose C is contained within a ball of radius RC

,let ✏ > 0 be given, define scale parameter

⌧ := min

⇢

r

✏

2r2

,r1

✏

2r2

R3

�

,

and let N be a cover of C by k · k-balls of radius ⌧ (as per lemma B.4, if k · k is an lp

norm, then|N | (1+ (2R

C

d)/⌧)d suffices). Then with probability at least 1� � over the draw of a sample ofsize m � p0/(M 0e), every set of centers P 2 Z satisfies

�

�

�

�

Z

�f

(x;P )d⇢(x)�Z

�f

(x;P )d⇢(x)

�

�

�

�

2✏+ ✏⇢

+ ✏⇢

+R2

s

1

2mln

✓

2|N |k�

◆

.

Before adjourning this section, note that clamps and outer brackets disagree on the treatment of theouter regions: the former replaces the cost there with the fixed value R, whereas the latter uses thevalue 0. On the technical side, this is necessitated by the covering argument used to produce thefinal theorem: if the clamping operation instead truncated beyond a ball of radius R centered at eachp 2 P , then the deviations would be wild as these balls moved and suddenly switched the value at apoint from 0 to something large. This is not a problem with outer bracketing, since the same points(namely Bc) are ignored by every set of centers.

7

5 Mixtures of Gaussians

Before turning to the deviation bound, it is a good place to discuss the condition �1

I � ⌃ � �2

I ,which must be met by every covariance matrix of every constituent Gaussian in a mixture.

The lower bound �1

I � ⌃, as discussed previously, is fairly common in practice, arising eithervia a Bayesian prior, or by implementing EM with an explicit condition that covariance updates arediscarded when the eigenvalues fall below some threshold. In the analysis here, this lower bound isused to rule out two kinds of bad behavior.

1. Given a budget of at least 2 Gaussians, and a sample of at least 2 distinct points, arbitrarilylarge likelihood may be achieved by devoting one Gaussian to one point, and shrinking itscovariance. This issue destroys convergence properties of maximum likelihood, since thelikelihood score may be arbitrarily large over every sample, but is finite for well-behaveddistributions. The condition �

1

I � ⌃ rules this out.2. Another phenomenon is a “flat” Gaussian, meaning a Gaussian whose density is high along

a lower dimensional manifold, but small elsewhere. Concretely, consider a Gaussian overR2 with covariance ⌃ = diag(�,��1

); as � decreases, the Gaussian has large density ona line, but low density elsewhere. This phenomenon is distinct from the preceding in thatit does not produce arbitrarily large likelihood scores over finite samples. The condition�1

I � ⌃ rules this situation out as well.

In both the hard and soft clustering analyses here, a crucial early step allows the assertion that goodscores in some region mean the relevant parameter is nearby. For the case of Gaussians, the condition�1

I � ⌃ makes this problem manageable, but there is still the possibility that some far away, fairlyuniform Gaussian has reasonable density. This case is ruled out here via �

2

I ⌫ ⌃.Theorem 5.1. Let probability measure ⇢ be given with order-p moment bound M according to normk · k

2

where p � 8 is a positive multiple of 4, covariance bounds 0 < �1

�2

with �1

1 forsimplicity, and real c 1/2 be given. Then with probability at least 1 � 5� over the draw of asample of size m � max

�

(p/(2p/4+2e))2, 8 ln(1/�), d2 ln(⇡�2

)

2

ln(1/�)

, every set of Gaussianmixture parameters (↵,⇥) 2 Smog(⇢; c, k,�1

,�2

) [ Smog(⇢; c, k,�1

,�2

) satisfies�

�

�

�

Z

�g(x; (↵,⇥))d⇢(x)�Z

�g(x; (↵,⇥))d⇢(x)

�

�

�

�

= O⇣

m�1/2+3/p

⇣

1 +

p

ln(m) + ln(1/�) + (1/�)4/p⌘⌘

,

where the O(·) drops numerical constants, polynomial terms depending on c, M , d, and k, �2

/�1

,and ln(�

2

/�1

), but in particular has no sample-dependent quantities.

The proof follows the scheme of the hard clustering analysis. One distinction is that the outer bracketnow uses both components; the upper component is the log of the largest possible density — indeed,it is ln((2⇡�

1

)

�d/2

) — whereas the lower component is a function mimicking the log density ofthe steepest possible Gaussian — concretely, the lower bracket’s definition contains the expressionln((2⇡�

2

)

�d/2

) � 2kx � E⇢

(X)k22

/�1

, which lacks the normalization of a proper Gaussian, high-lighting the fact that bracketing elements need not be elements of the class. Superficially, a seconddistinction with the hard clustering case is that far away Gaussians can not be entirely ignored onlocal regions; the influence is limited, however, and the analysis proceeds similarly in each case.

Acknowledgments

The authors thank the NSF for supporting this work under grant IIS-1162581.

8

References[1] David Pollard. Strong consistency of k-means clustering. The Annals of Statistics, 9(1):135–

140, 1981.[2] Gbor Lugosi and Kenneth Zeger. Rates of convergence in the source coding theorem, in em-

pirical quantizer design, and in universal lossy source coding. IEEE Trans. Inform. Theory, 40:1728–1740, 1994.

[3] Shai Ben-david. A framework for statistical clustering with a constant time approximationalgorithms for k-median clustering. In COLT, pages 415–426. Springer, 2004.

[4] Alexander Rakhlin and Andrea Caponnetto. Stability of k-means clustering. In NIPS, pages1121–1128, 2006.

[5] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classification. Wiley, 2 edition,2001.

[6] Thomas S. Ferguson. A course in large sample theory. Chapman & Hall, 1996.[7] David Pollard. A central limit theorem for k-means clustering. The Annals of Probability, 10

(4):919–926, 1982.[8] Shai Ben-david, Ulrike Von Luxburg, and David Pal. A sober look at clustering stability. In In

COLT, pages 5–19. Springer, 2006.[9] Ohad Shamir and Naftali Tishby. Cluster stability for finite samples. In Annals of Probability,

10(4), pages 919–926, 1982.[10] Ohad Shamir and Naftali Tishby. Model selection and stability in k-means clustering. In

COLT, 2008.[11] Stephane Boucheron, Olivier Bousquet, and Gabor Lugosi. Theory of classification: a survey

of recent advances. ESAIM: Probability and Statistics, 9:323–375, 2005.[12] Stephane Boucheron, Gabor Lugosi, and Pascal Massart. Concentration Inequalities: A

Nonasymptotic Theory of Independence. Oxford, 2013.[13] Jon Wellner. Consistency and rates of convergence for maximum likelihood estimators via

empirical process theory. 2005.[14] Aad van der Vaart and Jon Wellner. Weak Convergence and Empirical Processes. Springer,

1996.[15] FuChang Gao and Jon A. Wellner. On the rate of convergence of the maximum likelihood

estimator of a k-monotone density. Science in China Series A: Mathematics, 52(7):1525–1538,2009.

[16] Yair Al Censor and Stavros A. Zenios. Parallel Optimization: Theory, Algorithms and Appli-cations. Oxford University Press, 1997.

[17] Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, and Joydeep Ghosh. Clustering withBregman divergences. Journal of Machine Learning Research, 6:1705–1749, 2005.

[18] Terence Tao. 254a notes 1: Concentration of measure, January2010. URL http://terrytao.wordpress.com/2010/01/03/

254a-notes-1-concentration-of-measure/.[19] I. F. Pinelis and S. A. Utev. Estimates of the moments of sums of independent random vari-

ables. Teor. Veroyatnost. i Primenen., 29(3):554–557, 1984. Translation to English by BernardSeckler.

[20] Shai Shalev-Shwartz. Online Learning: Theory, Algorithms, and Applications. PhD thesis,The Hebrew University of Jerusalem, July 2007.

[21] Jean-Baptiste Hiriart-Urruty and Claude Lemarechal. Fundamentals of Convex Analysis.Springer Publishing Company, Incorporated, 2001.

9

http://terrytao.wordpress.com/2010/01/03/254a-notes-1-concentration-of-measure/

http://terrytao.wordpress.com/2010/01/03/254a-notes-1-concentration-of-measure/

A Moment Bounds

This section provides the basic probability controls resulting from moments. The material deals withthe following slight generalization of the bounded moment definition from Section 2.Definition A.1. A function ⌧ : X ! Rd has order-p moment bound M for probability measure ⇢with respect to norm k · k if E

⇢

(k⌧(X)kl) M for all 1 l p. (For convenience, measure ⇢ andnorm k · k will be often be implicit.)

To connect this to the earlier definition, simply choose the map ⌧(x) := x � E⇢

(X). As was thecase in Section 2, this definition requires a uniform bound across all lth moments for 1 l p. Ofcourse, working with a probability measure implies these moments are all finite when just the pth

moment is finite. The significance of working with a bound across all moments will be discussedagain in the context of Lemma A.3 below.

The first result controls the measures of balls thanks to moments. This result is only stated for thesource distribution ⇢, but Hoeffding’s inequality suffices to control ⇢.Lemma A.2. Suppose ⌧ has order-p moment bound M . Then for any ✏ > 0,

Pr

h

k⌧(X)k (M/✏)1/pi

� 1� ✏.

Proof. If M = 0, the result is immediate. Otherwise, when M > 0, for any R > 0, by Chebyshev’sinequality,

Pr [k⌧(X)k < R] = 1� Pr [k⌧(X)k � R] � 1� Ek⌧(X)kpRp

� 1� M

Rp

;

the result follows by choosing R := (M/✏)1/p.

The following fact will be the basic tool for controlling empirical averages via moments. Both thestatement and proof are close to one by Tao [18, Equation 7], which rather than bounded momentsuses boundedness (almost surely). As discussed previously, the term 1/�1/l overtakes ln(1/�) whenl = ln(1/�)/ ln(ln(1/�)).

For simplicity, this result is stated in terms of univariate random variables; to connect with the earlierdevelopment, the random variable X will be substituted with the map x 7! k⌧(x)k.Lemma A.3. (Cf. Tao [18, Equation 7].) Let m i.i.d. copies {X

i

}mi=1

of a random variable X ,even integer p � 2, real M > 0 with E(|X � E(X)|l) M for 2 l p, and ✏ > 0 be given. Ifm � p/(Me), then

Pr

�

�

�

�

�

1

n

X

i

Xi

� E(X)

�

�

�

�

�

� ✏

!

2

(✏pm)

p

✓

Mpe

2

◆

p/2

.

In other words, with probability at least 1� � over a draw of size m � p/(Me),�

�

�

�

�

1

n

X

i

Xi

� E(X)

�

�

�

�

�

r

Mpe

2m

✓

2

�

◆

1/p

.

Proof. Without loss of generality, suppose E(X1

) = 0 (i.e., given Y1

with E(Y1

) 6= 0, work withX

i

:= Yi

� E(Y1

)). By Chebyshev’s inequality,

Pr

�

�

�

�

�

1

m

X

i

Xi

�

�

�

�

�

� ✏

!

E�

�

1

m

P

i

Xi

�

�

p

✏p=

E |P

i

Xi

|p

(m✏)p. (A.4)

Recalling p is even, consider the term

E�

�

�

�

�

X

i

Xi

�

�

�

�

�

p

= E

X

i

Xi

!

p

=

X

i1,i2,...,ip2[m]

E

0

@

p

Y

j=1

Xij

1

A .

10

If some ij

is equal to none of the others, then, by independence, a term E(Xij ) = 0 is introduced

and the product vanishes; thus the product is nonzero when each ij

has some copy ij

= ij

0 , andthus there are at most p/2 distinct values amongst {i

j

}pj=1

. Each distinct value contributes a termE(X l

) E(|X|l) M for some 2 l p, and thus

E�

�

�

�

�

X

i

Xi

�

�

�

�

�

p

p/2

X

r=1

MrNr

, (A.5)

where Nr

is the number of ways to choose a multiset of size p from [m], subject to the constraint thateach number appears at least twice, and at most r distinct numbers appear. One way to over-countthis is to first choose a subset of size r from m, and then draw from it (with repetition) p times:

Nr

✓

m

r

◆

rp mrrp

r! mrrp

(r/e)r= (me)rrp�r.

Plugging this into eq. (A.5), and thereafter re-indexing with r := p/2� j,

E�

�

�

�

�

X

i

Xi

�

�

�

�

�

p

p/2

X

r=1

(Mme)rrp�r p/2

X

r=1

(Mme)r(p/2)p�r

p/2

X

j=0

(Mme)p/2�j

(p/2)p/2+j ✓

Mmpe

2

◆

p/2

p/2

X

j=0

⇣ p

2Mme

⌘

j

.

Since p Mme,

E�

�

�

�

�

X

i

Xi

�

�

�

�

�

p

2

✓

Mmpe

2

◆

p/2

,

and the result follows by plugging this into eq. (A.4).

Thanks to Chebyshev’s inequality, proving Lemma A.3 boils down to controlling E|P

i

Xi

|p, whichhere relied on a combinatorial scheme by Tao [18, Equation 7]. There is, however, another approachto controlling this quantity, namely Rosenthal inequalities, which write this pth moment of the sumin terms of the 2nd and pth moments of individual random variables (general material on these boundscan be found in the book of Boucheron et al. [12, Section 15.4], however the specific form providedhere is most easily presented by Pinelis and Utev [19]). While Rosenthal inequalities may seem amore elegant approach, they involve different constants, and thus the approach and bound here arefollowed instead to suggest further work on how to best control E|

P

i

Xi

|p.

Returning to task, as was stated in the introduction, the dominated convergence theorem providesthat

R

Bikxk2

2

d⇢(x) !R

kxk22

d⇢(x) (assuming integrability of x 7! kxk22

), where the sequence ofballs {B

i

}1i=1

grow in radius without bound; moment bounds allow the rate of this process to bequantified as follows.Lemma A.6. Suppose ⌧ has order-p moment bound M , and let 0 < k < p be given. Then for any✏ > 0, the ball

B :=

n

x 2 X : k⌧(X)k (M/✏)1/(p�k)

o

satisfiesZ

B

c

k⌧(x)kkd⇢(x) ✏.

Proof. Let the B be given as specified; an application of Lemma A.2 with ✏0 := (✏p/Mk

)

1/(p�k)

yieldsZ

1[x 2 Bc

]d⇢(x) = Pr[k⌧(x)k > (M/✏)1/(p�k)

] = Pr[k⌧(x)k > (M/✏0)1/p] ✏0.

11

By Holder’s inequality with conjugate exponents p/k and p/(p�k) (where the condition 0 < k < pmeans each lies within (1,1)),

Z

B

c

k⌧(x)kkd⇢(x) =Z

k⌧(x)kk1[x 2 Bc

]d⇢(x)

✓

Z

k⌧(x)kk(p/k)d⇢(x)◆

k/p

✓

Z

1[x 2 Bc

]

p/(p�k)d⇢(x)

◆

(p�k)/p

(M)

k/p

✓

✏p/(p�k)

Mk/(p�k)

◆

(p�k)/p

= ✏

as desired.

Lastly, thanks to the moment-based deviation inequality in Lemma A.3, the deviations on this outerregion may be controlled. Note that in order to control the k-means cost (i.e., an exponent k = 2),at least 4 moments are necessary (p � 4).

Lemma A.7. Let integers k � 1 and p0 � 1 be given, and set p := k(p0+1). Suppose ⌧ has order-pmoment bound M , and let ✏ > 0 be arbitrary. Define the radius R and ball B as

R := max{(M/✏)1/(p�ik)

: 1 i < p/k} and B := {x 2 X : k⌧(x)k R} ,

and set M 0:= 2

p

0✏. With probability at least 1�� over the draw of a sample of size m � p0/(M 0e),

�

�

�

�

Z

B

c

k⌧(x)kkd⇢(x)�Z

B

c

k⌧(x)kkd⇢(x)�

�

�

�

r

M 0ep0

2m

✓

2

�

◆

1/p

0

.

Proof. Consider a fixed 1 i < p/k = p0 + 1, and set l = ik. Let Bl

be the ball provided byLemma A.6 for exponent l. Since B ◆ B

l

,Z

B

c

k⌧(x)kld⇢(x) Z

B

cl

k⌧(x)kld⇢(x) ✏.

As such, by Minkowski’s inequality, since z 7! zl is convex for l � 1,

Z

�

�

�

�

k⌧(x)k1[x 2 Bc

]�Z

B

c

k⌧(x)kd⇢(x)�

�

�

�

l

d⇢(x)

!

1/l

✓

Z

B

c

k⌧(x)kld⇢(x)◆

1/l

+

✓

Z

B

c

k⌧(x)kd⇢(x)◆

l/l

2

✓

Z

B

c

k⌧(x)kld⇢(x)◆

1/l

,

meaningZ

�

�

�

�

k⌧(x)k1[x 2 Bc

]�Z

B

c

k⌧(x)kd⇢(x)�

�

�

�

l

d⇢(x) 2

l

Z

B

c

k⌧(x)kl 2

l

Z

B

cl

k⌧(x)kl 2

l✏.

Since l = ik had 1 i < p/k = p0 + 1 arbitrary, it follows that the map x 7! k⌧(x)kk1[x 2 Bc

]

has its first p0 moments bounded by 2

p

0✏.

The finite sample bounds will now proceed with an application of Lemma A.3, where the randomvariable X will be the map x 7! k⌧(x)kk1[x 2 Bc

]. Plugging the above moment bounds for thisrandom variable into Lemma A.3, the result follows.

12

B Deferred Material from Section 3

Before proceeding with the main proofs, note that Bregman divergences in the setting here aresandwiched between quadratics.Lemma B.1. If differentiable f is r

1

strongly convex with respect to k · k, then Bf

(x, y) �r1

kx � yk2. If differentiable f has Lipschitz gradients with parameter r2

with respect to k · k,then B

f

(x, y) r2

kx� yk2.

Proof. The first part (strong convexity) is standard (see for instance the proof by Shalev-Shwartz [20,Lemma 13], or a similar proof by Hiriart-Urruty and Lemarechal [21, Theorem B.4.1.4]). For thesecond part, by the fundamental theorem of calculus, properties of norm duality, and the Lipschitzgradient property,

f(x) = f(y) + hrf(y), x� yi+Z

1

0

hrf(y + t(x� y))�rf(y), x� yi dt

f(y) + hrf(y), x� yi+Z

1

0

krf(y + t(x� y))�rf(y)k⇤kx� ykdt

f(y) + hrf(y), x� yi+ r2

2

kx� yk2.(The preceding is also standard; see for instance the beginning of a proof by Hiriart-Urruty andLemarechal [21, Theorem E.4.2.2], which only differs by fixing the norm k · k

2

.)

B.1 Proof of Lemma 3.5

The first step is the following characterization of Hf

(⌫; c, k): at least one center must fall withinsome compact set. (The lemma works more naturally with the contrapositive.) The proof by Pollard[1] also started by controlling a single center.Lemma B.2. Consider the setting of Lemma 3.5, and additionally define the two balls

B0

:=

n

x 2 Rd

: kx� E⇢

(X)k (2M)

1/p

o

,

C0

:=

n

x 2 Rd

: kx� E⇢

(X)k (2M)

1/p

+

p

4c/r1

o

,

Then ⇢(B0

) � 1/2, and for any center set P , if P \C0

= ; then E⇢

(�f

(X;P )) � 2c. Furthermore,with probability at least 1� � over a draw from ⇢ of size at least

m � 9 ln

✓

1

�

◆

.

then ⇢(B0

) > 1/4 and P \ C0

= ; implies E⇢

(�f

(X;P )) > c.

Proof. The guarantee ⇢(B0

) � 1/2 is direct from Lemma A.2 with moment map ⌧(x) := x �E⇢

(X). By Hoeffding’s inequality and the lower bound on m, with probability at least 1� �,

⇢(B0

) � ⇢(B0

)�s

1

2mln

✓

1

�

◆

>1

4

.

By the definition of C0

, every p 2 Cc

0

and x 2 B0

satisfiesBf

(x, p) � r1

kx� pk2 � 4c.

Now let ⌫ denote either ⇢ or ⇢; then for any set of centers P with P \ C0

= ; (meaning P ✓ Cc

0

),Z

�f

(x;P )d⌫(x) =

Z

min

p2P

Bf

(x, p)d⌫(x)

�Z

B0

min

p2P

Bf

(x, p)d⌫(x)

�Z

B0

min

p2P

4cd⌫(x)

= 4c⌫(B0

).

Instantiating ⌫ with ⇢ or ⇢, the results follow.

13

With this tiny handle on the structure of a set of centers P satisfying �f

(x;P ) c, the proof ofLemma 3.5 follows.

Proof of Lemma 3.5. Throughout both sections, let B0

and C0

be as defined in Lemma B.2; it fol-lows by Lemma B.2, with probability at least 1 � �, that P 2 H

f

(⇢; c, k) [ Hf

(⇢; c, k) impliesP \ C

0

6= ;. Henceforth discard this failure event, and fix any P 2 Hf

(⇢; c, k) [Hf

(⇢; c, k).

1. Since P \C0

6= ;, fix some p0

2 P \C0

. Since B ◆ C0

by definition, it follows, for everyx 2 Bc that

�f

(x;P ) = min

p2P

Bf

(x, p) r2

kx� p0

k2 r2

(kx� E⇢

(X)k+ kp0

� E⇢

(X)k)2

4r2

kx� E⇢

(X)k2 = u(x).

Additionally,`(x) = 0 min

p2P

r1

kx� pk2 �f

(x;P ),

meaning u and ` properly bracket Z`

= Zu

over Bc; what remains is to control their massover Bc.

Since ` = 0,�

�

�

�

Z

B

c

`(x)d⇢(x)

�

�

�

�

=

�

�

�

�

Z

B

c

`(x)d⇢(x)

�

�

�

�

= 0 < ✏.

Next, for u with respect to ⇢, the result follows from the definition of u together withLemma A.6 (using the map ⌧(x) = x� E

⇢

(X) together with exponent 2).

Lastly, to control u with respect to ⇢, note that p0 p/2�1 means p := 2(p0+1) p, andthus the map ⌧(x) := kx�E

⇢

(X)k2 has order-p moment bound M . Thus, by Lemma A.7and the triangle inequality,

�

�

�

�

Z

B

c

u(x)d⇢(x)

�

�

�

�

✏+

r

M 0ep0

2m

✓

2

�

◆

1/p

0

= ✏⇢

.

2. Throughout this proof, let ⌫ denote either ⇢ or ⇢; the above established�

�

�

�

Z

B

c

u(x)d⌫(x)

�

�

�

�

✏⌫

,

where in the case of ⌫ = ⇢, this statement holds with probability 1� �; henceforth discardthis failure event, and thus the statement holds for both cases.

By definition of C, for any p 2 Cc and x 2 B,

Bf

(x, p) � r1

kx� pk2 � r1

⇣

p

r2

/r1

⇣

(2M)

1/p

+

p

4c/r1

+RB

⌘⌘

2

= r2

⇣

(2M)

1/p

+

p

4c/r1

+RB

⌘

2

.

On the other hand, fixing any p0

2 P \ C0

(which was guaranteed to exist at the start ofthis proof), since C

0

✓ C,

sup

x2B

�f

(x;P \ C) sup

x2B

r2

kx� p0

k2 r2

⇣

(2M)

1/p

+

p

4c/r1

+RB

⌘

2

.

Consequently, no element of B is closer to an element of P \ C than to any element ofP \ C. As such,Z

�f

(x;P )d⌫(x) �Z

B

�f

(x;P )d⌫(x) +

Z

B

c

`(x)d⌫(x) =

Z

B

�f

(x;P \ C)d⌫(x).

(Note here that `(x) = 0 was used directly, rather than the ✏ provided by outer covering;in the case of Gaussian mixtures, both bracket elements are nonzero, and ✏ will be used.)This establishes one direction of the bound.

14

For the other direction, note that adding centers back in only decreases cost (becausemin

p2P\C

is replaced with min

p2P

), and thus recalling the properties of the outer bracketelement u established above,

Z

B

�f

(x;P \ C)d⌫(x) =

Z

�f

(x;P \ C)d⌫(x)�Z

B

c

�f

(x;P \ C)d⌫(x)

�Z

�f

(x;P \ C)d⌫(x)�Z

B

c

u(x)d⌫(x)

�Z

�f

(x;P )d⌫(x)� ✏⌫

,

which gives the result(s).

B.2 Covering Properties

The next step is to control the deviations over the bounded portion; this is achieved via uniformcovers, as developed in this subsection.

First, another basic fact about Bregman divergences.Lemma B.3. Let differentiable convex function f be given with Lipschitz gradient constant r

2

withrespect to norm k · k, and let B

f

be the corresponding Bregman divergence. For any {x, y, z} ✓ X ,

Bf

(x, z) Bf

(x, y) + Bf

(y, z) + r2

kx� ykky � zk.Similarly, given finite sets Y ✓ X and Z ✓ X , and letting Y (p) and Z(p) respectively select (any)closest point in Y and Z to p according to B

f

, meaning

Y (p) := argmin

y2Y

Bf

(y, p) and Z(p) := argmin

z2Z

Bf

(z, p),

then

min

z2Z

Bf

(x, z) min

y2Y

Bf

(x, y) + Bf

(Y (x), Z(Y (x))) + r2

kx� Y (x)kkY (x)� Z(Y (x))k.

Proof. By definition of Bf

, properties of dual norms, and the Lipschitz gradient property,

Bf

(x, z)� Bf

(x, y)� B(y, z) = f(x)� f(z)� f(x) + f(y)� f(y) + f(z)

� hrf(z), x� zi+ hrf(y), x� yi+ hrf(z), y � zi= hrf(y)�rf(z), x� yi krf(y)�rf(z)k⇤kx� yk r

2

ky � zkkx� yk;rearranging this inequality gives the first statement.

The second statement follows the first instantiated with y = Y (x) and z = Z(Y (x)), since

min

z2Z

Bf

(x, z) Bf

(x, Z(Y (x)))

Bf

(x, Y (x)) + Bf

(Y (x), Z(Y (x))) + r2

kx� Y (x)kkY (x)� Z(Y (x))k,and using B

f

(x, Y (x)) = min

y2Y

Bf

(x, y).

The covers will be based on norm balls; the following estimate is useful.Lemma B.4. If k · k is an l

p

norm over Rd, then the ball of radius R admits a cover N with size

|N | ✓

1 +

2Rd

⌧

◆

d

.

Proof. It suffices to grid the B with l1 balls centered at grid points at scale ⌧/d; the result followssince the l1 balls of radius ⌧/d are contained in l

p

balls of radius ⌧ for all p � 1.

15

The uniform covering result is as follows.

Lemma B.5. Let scale ✏ > 0, ball B := {x 2 Rd

: kx � E(X)k R}, parameter set Z := {x 2Rd

: kx � E(X)k R2

}, and differentiable convex function f with Lipschitz gradient parameterr2

with respect to norm k · k be given. Define resolution parameter

⌧ := min

⇢

r

✏

2r2

,✏

2(R2

+R)r2

�

,

and let N be set of centers for a cover of Z by k · k-balls of radius ⌧ (see Lemma B.4 for an estimatewhen k · k is an l

p

norm). It follows that there exists a uniform cover F at scale ✏ with cardinality|N |k, meaning for any collection P = {p

i

}li=1

with pi

2 Z and l k, there is a cover element Qwith

sup

x2B

�

�

�

�

min

p2P

Bf

(x, p)�min

q2Q

Bf

(x, q)

�

�

�

�

✏.

Proof. Given a collection P as specified, choose Q so that for every p 2 P , there is q 2 Q withkp� qk ⌧ , and vice versa. By Lemma B.3 (and using the notation therein), for any x 2 Bc,

min

p2P

Bf

(x, p) min

q2Q

Bf

(x, q) + Bf

(Q(x), P (Q(x))) + r2

kx�Q(x)kkQ(x)� P (Q(x))k

min

q2Q

Bf

(x, q) + r2

⌧2 + r2

⌧(R+R2

)

min

q2Q

Bf

(x, q) + ✏;

the reverse inequality holds for the same reason, and the result follows.

B.3 Proof of Theorem 3.2 and Corollary 3.1

First, the proof of the general rate for Hf

(⌫; c, k).

Proof of Theorem 3.2. For convenience, define M 0= 2

p

0✏. By Lemma B.5, let N be a cover of the

set C, whereby every set of centers P ✓ C with |P | k has a cover element Q 2 N k with

sup

x2B

�

�

�

�

min

p2P

Bf

(x, p)�min

q2Q

Bf

(x, q)

�

�

�

�

✏; (B.6)

when k · k is an lp

norm, Lemma B.4 provides the stated estimate of its size. Since B ✓ C and

sup

x2B

sup

p2C

Bf

(x, p) r2

sup

x2B

sup

p2C

kx� pk2 4r2

R2

C

,

it follows by Hoeffding’s inequality and a union bound over N k that with probability at least 1� �,

sup

Q2Nk

�

�

�

�

Z

B

�(x;Q)d⇢(x)�Z

B

�(x;Q)d⇢(x)

�

�

�

�

4r2

R2

C

s

1

2mln

✓

2|N |k�

◆

. (B.7)

For the remainder of this proof, discard the corresponding failure event.

Now let any P 2 Hf

(⇢; c, k) [Hf

(⇢; c, k) be given, and let Q 2 N k be a cover element satisfyingeq. (B.6) for P \ C. By eq. (B.6), eq. (B.7), and Lemma 3.5 (and thus discarding an additional

16

failure event having probability 2�),�

�

�

�

Z

�f

(x;P )d⇢(x)�Z

�f

(x;P )d⇢(x)

�

�

�

�

�

�

�

�

Z

�f

(x;P )d⇢(x)�Z

B

�f

(x;P \ C)d⇢(x)

�

�

�

�

+

�

�

�

�

Z

B

�f

(x;P \ C)d⇢(x)�Z

B

�f

(x;Q)d⇢(x)

�

�

�

�

+

�

�

�

�

Z

B

�f

(x;Q)d⇢(x)�Z

B

�f

(x;Q)d⇢(x)

�

�

�

�

+

�

�

�

�

Z

B

�f

(x;Q)d⇢(x)�Z

B

�f

(x;P \ C)d⇢(x)

�

�

�

�

+

�

�

�

�

Z

B

�f

(x;P \ C)d⇢(x)�Z

�f

(x;P )d⇢(x)

�

�

�

�

2✏+ 4r2

R2

C

s

1

2mln

✓

2|N |k�

◆

+ ✏⇢

+ ✏⇢

,

and the result follows by unwrapping the definitions of ✏⇢

and ✏⇢

from Lemma 3.5, and M 0= 2

p

0✏

as above.

The more concrete bound for the k-means cost is proved as follows.

Proof of Corollary 3.1. Set

✏ := m�1/2+1/p, p0 := p/4, M 0:= 2

p

0✏ = 2

p/4m�1/2+1/p,

and recall f(x) := kxk22

has convexity constants r1

= r2

= 2. Since

m =

pmpm � p

pm

2

p/4+2e� p0m1/2�1/p

2

p

0e=

p0

M 0e

and p0 = p/2 � p/4 p/2 � 1, the conditions for Theorem 3.2 are met, and thus with probabilityat least 1� �,�

�

�

�

Z

�f

(x;P )d⇢(x)�Z

�f

(x;P )d⇢(x)

�

�

�

�

4✏+ 4R2

C

s

1

2mln

✓

2|N |k�

◆

+

r

2

p/4ep✏

8m

✓

2

�

◆

4/p

,

where

RC

:= (2M)

1/p

+

p2c+ 2R

B

,

RB

:= max

⇢

(2M)

1/p

+

p2c,max

i2[p

0]

(M/✏)1/(p�2i)

�

,

|N | ✓

1 +

2RC

d

⌧

◆

d

,

⌧ := min

⇢

r

✏

4

,✏

4(RB

+RC

)

�

.

To simplify these quantities, since ✏ 1, the term 1/✏1/(p�2i), as i ranges between 1 and p� 2p0, ismaximized at i = 1/(p� 2p0) = 2/p. Therefore, by choice of M

1

and ✏,

RB

c1

+ (M/✏)1/(p�2)

+ (M/✏)1/(p�2p

0) c

1

+ (M1/(p�2)

+M1/(p�2p

0)

)/✏2/p

= c1

+M1

m1/p�2/p

2

.

Consequently,

RC

= c1

+ 2RB

3c1

+ 2M1

m1/p�2/p

2

and R2

C

18c21

+ 8M2

1

m2/p�4/p

2

.

17

This entails2R

C

d

⌧ 2R

C

d⇣

2m1/4�1/(2p)

+ 4(RB

+RC

)m1/2�1/p

⌘

8d⇣

(3c1

+ 2M1

m1/p�2/p

2

)m1/4�1/(2p)

+ (36c21

+ 16M2

1

m2/p�4/p

2

)m1/2�1/p

⌘

288dm(c1

+ c21

+M1

+M2

1

).

Secondly,

R2

Cpm

(18c21

+ 8M2

1

m2/p�4/p

2

)m�1/2 mmin{1/4,�1/2+2/p}(18c2

1

+ 8M2

1

).

The last term is direct, sincep

✏/m = m�1/4+1/(2p)�1/2

= m�1/2+1/(2p)m�1/4.

Combining these pieces, the result follows.

C Deferred Material from Section 4

First, the deferred proof that outer brackets give rise to clamps.

Proof of Proposition 4.3. Throughout this proof, let ⌫ refer to either ⇢ or ⇢, with ✏⌫

similarly refer-ring to either ✏

⇢

or ✏⇢

. Let P 2 Hf

(⇢; c, k) [Hf

(⇢; c, k) be given.

One direction is direct:Z

�f

(x;P )d⌫(x) �Z

�f

(x;P \ C)d⌫(x)

�Z

min{�f

(x;P \ C), R}d⌫(x).

For the second direction, with probability at least 1 � �, Lemma B.2 grants the existence of p0 2P \ C

0

✓ P \ C. Consequently, for any x 2 B,

min

p2P

Bf

(x, p) min

p2P\C

Bf

(x, p) Bf

(x, p0)

r2

kx� p0k2 2r2

�

kx� E⇢

(X)k2 + kp0 � E⇢

(X)k2�

R;

in other words, if x 2 B, then min{�f

(x;P \ C), R} = �f

(x;P \ C). Combining this with thelast part of Lemma 3.5.

Z

min{�f

(x;P \ C), R}d⌫(x) �Z

B

min{�f

(x;P \ C), R}d⌫(x)

�Z

B

�f

(x;P \ C)d⌫(x)

�Z

�f

(x;P )d⌫(x)� ✏⌫

.

The proof of Theorem 4.4 will depend on the following uniform covering property of the clampedcost (which mirrors Lemma B.5 for the unclamped cost).Lemma C.1. Let scale ✏ > 0, clamping value R

3

, parameter set C contained within a k · k-ballof some radius R

2

, and differentiable convex function f with Lipschitz gradient parameter r2

andstrong convexity modulus r

1

with respect to norm k · k be given. Define resolution parameter

⌧ := min

⇢

r

✏

2r2

,r1

✏

2r2

R3

�

,

18

and let N be set of centers for a cover of C by k ·k-balls of radius ⌧ (see Lemma B.4 for an estimatewhen k · k is an l

p

norm). It follows that there exists a uniform cover F at scale ✏ with cardinality|N |k, meaning for any collection P = {p

i

}li=1

with pi

2 C and l k, there is a cover element Qwith

sup

x

�

�

�

�

min

⇢

R3

,min

p2P

Bf

(x, p)

�

�min

⇢

R3

,min

q2Q

Bf

(x, q)

�

�

�

�

�

✏.

Proof. Given a collection P as specified, choose Q so that for every p 2 P , there is q 2 Q withkp� qk ⌧ , and vice versa.

First suppose min

q2Q

Bf

(x, q) � R3

; then

min

⇢

R3

,min

p2P

Bf

(x, p)

�

R3

= min

⇢

R3

,min

q2Q

Bf

(x, q)

�

as desired.

Otherwise, min

q2Q

Bf

(x, q) < R3

, which by the sandwiching property (cf. Lemma B.1) means

r1

kx�Q(x)k Bf

(x,Q(x)) < R3

.

By Lemma B.3,

min

⇢

R3

,min

p2P

Bf

(x, p)

�

min

⇢

R3

,min

q2Q

Bf

(x, q) + Bf

(Q(x), P (Q(x))) + r2

kx�Q(x)kkQ(x)� P (Q(x))k�

min

⇢

R3

,min

q2Q

Bf

(x, q) + r2

⌧2 + r2

⌧kx�Q(x)k�

min

⇢

R3

,min

q2Q

Bf

(x, q) + r2

⌧2 +r2

R3

r1

⌧

�

min

⇢

R3

,min

q2Q

Bf

(x, q)

�

+ ✏.

The reverse inequality is analogous.

The proof of Theorem 4.4 follows.

Proof of Theorem 4.4. This proof is a minor alteration of the proof of Theorem 3.2.

By Lemma C.1, let N be a cover of the set C, whereby every set of centers P ✓ C with |P | khas a cover element Q 2 N k with

sup

x

|min{�f

(x;P ), R}�min{�f

(x;Q), R}| ✏; (C.2)

when k · k is an lp

norm, Lemma B.4 provides the stated estimate of its size. Sincemin{�

f

(x;Q), R} 2 [0, R], it follows by Hoeffding’s inequality and a union bound over N k thatwith probability at least 1� �,

sup

Q2Nk

�

�

�

�

Z

B

�f

(x;Q)d⇢(x)�Z

B

�f

(x;Q)d⇢(x)

�

�

�

�

R

s

1

2mln

✓

2|N |k�

◆

. (C.3)


19

Now let any P 2 Z be given, and let Q 2 N k be a cover element satisfying eq. (C.2) for P \C. Byeq. (C.2), eq. (C.3), and lastly by the definition of clamp,�

�

�

�

Z

�f

(x;P )d⇢(x)�Z

�f

(x;P )d⇢(x)

�

�

�

�

�

�

�

�

Z

�f

(x;P )d⇢(x)�Z

min{�f

(x;P \ C), R}d⇢(x)�

�

�

�

+

�

�

�

�

Z

min{�f

(x;P \ C), R}d⇢(x)�Z

min{�f

(x;Q), R}d⇢(x)�

�

�

�

+

�

�

�

�

Z

min{�f

(x;Q), R}d⇢(x)�Z

min{�f

(x;Q), R}d⇢(x)�

�

�

�

+

�

�

�

�

Z

min{�f

(x;Q), R}d⇢(x)�Z

min{�f

(x;P \ C), R}d⇢(x)�

�

�

�

+

�

�

�

�

Z

min{�f

(x;P \ C), R}d⇢(x)�Z

�f

(x;P )d⇢(x)

�

�

�

�

2✏+ ✏⇢

+ ✏⇢

+R2

s

1

2mln

✓

2|N |k�

◆

.

D Deferred Material from Section 5

The following notation for restricting a Gaussian mixture to a certain set of means will be convenientthroughout this section.Definition D.1. Given a Gaussian mixture with parameters (↵,⇥) (where ↵ = {↵

i

}ki=1

and ⇥ =

{✓i

}ki=1

= {(µi

,⌃i

)}ki=1

), and a set of means B ✓ Rd, define

(↵,⇥) uB := {({↵i

}i2I

, {(µi

,⌃i

)}i2I

) : I = {1 i k : µi

2 B}} .

(Note that potentiallyP

i2I

↵i

< 1, and thus the terminology partial Gaussian mixture is sometimesemployed.)

D.1 Constructing an Outer Bracket

The first step is to show that pushing a mean far away from a region will rapidly decrease its densitythere, which is immediate from the condition �

1

I � ⌃ � �2

I .Lemma D.2. Let probability measure ⇢, accuracy ✏ > 0, covariance lower bound 0 < �

1

�2

,and radius R with corresponding l

2

ball B := {x 2 Rd

: kx� E⇢

(X)k2

R} be given. Define

R1

:=

s

2�2

ln

✓

1

(2⇡�1

)

d/2✏2

◆

R2

:= R+R1

,

B2

:= {µ 2 Rd

: kµ� E⇢

(X)k2

R2

}.

If ✓ = (µ,⌃) is the parameterization of a Gaussian density p✓

with �1

I � ⌃ � �2

I but µ 62 B2

,then p

✓

(x) < ✏ for every x 2 B.

Proof. Let Gaussian parameters ✓ = (µ,⌃) be given with �1

I � ⌃ � �2

I , but µ 62 B2

. By thedefinition of B

2

, for any x 2 B1

,

p✓

(x) < (2⇡�1

)

�d/2

exp(�R2

1

/(2�2

)) = ✏.

The upper component of the outer bracket will be constructed first (and indeed used in the construc-tion of the lower component).

20

Lemma D.3. Let probability measure ⇢ with order-p moment bound with respect to k · k2

, targetaccuracy ✏ > 0, and covariance lower bound 0 < �

1

be given. Define

pmax

:= (2⇡�1

)

�d/2,

u(x) := ln(pmax

),

Ru

:= (M | ln(pmax

)|/✏)1/p,B

u

:=

�

x 2 Rd

: kx� E⇢

(X)k2

Ru

.

If p✓

denotes a Gaussian density with parameters ✓ = (µ,⌃) satisfying ⌃ ⌫ �1

I , then p✓

ueverywhere. Additionally,

�

�

�

�

�

Z

B

cu

u(x)d⇢(x)

�

�

�

�

�

Z

B

cu

|u(x)|d⇢(x) ✏,

and with probability at least 1� � over the draw of m points from ⇢,�

�

�

�

�

Z

B

cu

u(x)d⇢(x)

�

�

�

�

�

Z

B

cu

|u(x)|d⇢(x) ✏+ | ln(pmax

)|s

1

2mln

✓

1

�

◆

.

(That is to say, u is the upper part of an outer bracket for all Gaussians (and mixtures thereof) whereeach covariance ⌃ satisfies ⌃ ⌫ �

1

I .)

Proof. Let p✓

with ✓ = (µ,⌃) satisfying ⌃ ⌫ �1

I be given. Then

p✓

(x) 1

p

(2⇡)d�d

1

exp( 0 ) = pmax

.

Next, given the form of Bu

, if ln(pmax

) = 0, the result is immediate, thus suppose ln(pmax

) 6= 0;Lemma A.2 provides that ⇢(B

u

) � 1� ✏/| ln(pmax

)|, whereby�

�

�

�

�

Z

B

cu

u(x)d⇢(x)

�

�

�

�

�

Z

B

cu

|u(x)|d⇢(x) = | ln(pmax

)|⇢(Bc

u

) ✏.

For the finite sample guarantee, by Hoeffding’s inequality,

⇢(Bc

u

) ⇢(Bc

u

) +

s

1

2mln

✓

1

�

◆

✏

| ln(pmax

)| +s

1

2mln

✓

1

�

◆

,

which gives the result similarly to the case for ⇢.

From, here, a tiny control on Smog(⌫; c, k,�1

,�2

) emerges, analogous to Lemma B.2 for Hf

(⌫; c, k).Lemma D.4. Let covariance bounds 0 < �

1

�2

, cost c 1/2, and probability measure ⇢ withorder-p moment bound M with respect to k · k

2

be given. Define

pmax

:= (2⇡�1

)

�d/2,

R3

:= (2M | ln(pmax

)|)1/p,R

4

:= (2M)

1/p,

R5

:=

s

2�2

✓

ln

✓

8e

(2⇡�1

)

d/2

◆

� 4c

◆

.

R6

:= max{R3

, R4

}+R5

.

B6

:= {x 2 Rd

: kx� E⇢

(X)k2

R6

}.Suppose

m � 2 ln(1/�)max{4, | ln(pmax

)|2}.With probability at least 1 � 2�, given any (↵,⇥) 2 Smog(⇢; c, k,�1

,�2

) [ Smog(⇢; c, k,�1

,�2

),the restriction (↵0,⇥0

) = (↵,⇥) u B6

is nonempty, and moreover satisfiesP

↵i2↵

0 ↵i

�exp(4c)/(8ep

max

).

21

Proof. Define

B3

:=

�

x 2 Rd

: kx� E⇢

(X)k2

max{R3

, R4

}

.

Since B3

has radius at least R4

, Lemma A.2 provides

⇢(B3

) � 1/2,

and Hoeffding’s inequality and the lower bound on m provide (with probability at least 1� �)

⇢(B3

) � 1

2

�s

2

mln

✓

1

�

◆

>1

4

.

Additionally, since B3

also has radius at least R3

, by Lemma D.3, the choice of B3

, and the lowerbound on m, and letting B

4

denote the ball of radius R3

,�

�

�

�

�

Z

B

c3

ud⇢

�

�

�

�

�

Z

B

c4

|u|d⇢ Z

B

c4

|u|d⇢ 1/2 and

�

�

�

�

�

Z

B

c3

ud⇢

�

�

�

�

�

< 1,

where the statement for ⇢ is with probability at least 1��. For the remainder of the proof, let ⌫ referto either ⇢ or ⇢, and discard the 2� failure probability of either of the above two events.

For convenience, define p0

:= exp(4c)/(8e), whereby

R5

=

s

2�2

ln

✓

1

p0

(2⇡�1

)

d/2

◆

.

By Lemma D.2, any Gaussian parameters ✓ = (µ,⌃) with �1

I � ⌃ � �2

I and µ 62 B6

havep✓

(x) < p0

everywhere on B3

. As such, a mixture (↵,⇥) where each ✓i

2 ⇥ satisfies theseconditions also satisfiesZ

ln

X

i

↵i

p✓i

!

d⌫ Z

B3

ln

0

@

X

(↵i,✓i)2(↵,⇥)uB6

↵i

p✓i +

X

(↵i,✓i) 62(↵,⇥)uB6

↵i

p✓i

1

A d⌫ +

Z

B

c3

ud⌫

< ln

0

@

X

(↵i,✓i)2(↵,⇥)uB6

↵i

pmax

+

X

↵i,✓i) 62(↵,⇥)uB6

↵i

p0

1

A ⌫(B3

) + 1

Suppose contradictorily that (↵,⇥) u B6

= ; orP

(↵i,✓i)2(↵,⇥)uB6↵i

< p0

/pmax

. But c 1/2

implies p0

1/2 and so ln(2p0

) 0, thus ln(2p0

)⌫(B3

) ln(2p0

)/4 which together with p0

exp(4c)/(8e) and the above display gives

Z

ln

X

i

↵i

p✓i

!

d⌫ < ln(2p0

)/4 + 1 c,

which contradicts E⌫

(�g(X; (↵,⇥))) � c.

Now that significant weight can be shown to reside in some restricted region, the outer bracket andits basic properties follow (i.e., the analog to Lemma 3.5).Lemma D.5. Let target accuracy 0 < ✏ 1, covariance bounds 0 < �

1

�2

with �1

1, targetcost c, confidence parameter � 2 (0, 1], probability measure ⇢ with order-p moment bound M withrespect to k · k

2

with p � 4, and integer 1 p0 p/2� 1. Define first the basic quantities

M 0:= 2

p

0✏,

pmax

:= (2⇡�1

)

�d/2,

R6

:= (2M | ln(pmax

)|)1/p + (2M)

1/p

+

s

2�2

✓

ln

✓

8e

(2⇡�1

)

d/2

◆

� 4c

◆

,

B6

:= {x 2 Rd

: kx� E⇢

(X)k2

R6

k}.

22

Additionally define the outer bracket elements

Z`

:=

(

(↵,⇥) : 8(↵i

, (µi

, ✓i

)) 2 (↵,⇥) ⇧ µi

2 B6

,�1

I � ⌃ � �2

I,X

i

↵i

� exp(4c)/(8epmax

)

)

,

c`

:= 4c� ln(8epmax

)� d

2

ln(2⇡�2

),

`(x) := c`

� 2

�1

kx� E⇢

(X)k22

,

u(x) := ln(pmax

),

✏⇢

:= ✏+ (|c`

|+ | ln(pmax

)|)s

1

2mln

✓

1

�

◆

+

r

M 0ep0

2m

✓

2

�

◆

1/p

0

,

M1

:= (2M |c`

|)1/p + (4M�1

)

1/(p�2)

+ max

1ip

0M1/(p�2i)

+ (M | ln(pmax

))

1/p,

RB

= R6

+M1

/✏1/(p�2p

0),

B := {x 2 Rd

: kx� E⇢

(X)k2

RB

}.The following statements hold with probability at least 1� 4� over a draw of size

m � max

�

p0/(M 0e), 8 ln(1/�), 2| ln(pmax

)|2 ln(1/�)

.

1. (u, `) is an outer bracket for ⇢ at scale ✏⇢

:= ✏ with sets B`

:= Bu

:= B, center set classZ`

as above, and Zu

= Smog(⇢;1, k,�1

,�2

). Additionally, (u, `) is also an outer bracketfor ⇢ at scale ✏

⇢

with the same sets.

2. Define

RC

:= 1 +RB

(1 +

p

8�2

/�1

) +

p

4�2

ln(1/✏) +

s

2�2

✓

ln

✓

64e2(2⇡�2

)

d

(2⇡)dp4max

◆

� 8c

◆

,

C := {µ 2 Rd

: kx� E⇢

(X)k2

RC

}.Every (↵,⇥) 2 Smog(⇢; c, k,�1

,�2

)[Smog(⇢; c, k,�1

,�2

) satisfiesP

(↵i,✓i)2(↵,⇥)uC

↵i

�exp(4c)/(8ep

max

), and�

�

�

�

Z

�g(x; (↵,⇥))d⇢(x)�Z

B

�g(x; (↵,⇥) u C)d⇢(x)

�

�

�

�

✏⇢

= 2✏

and�

�

�

�

Z

�g(x; (↵,⇥))d⇢(x)�Z

B

�g(x; (↵,⇥) u C)d⇢(x)

�

�

�

�

✏+ ✏⇢

.

Proof of Lemma D.5. It is useful to first expand the choice of RB

, which was chosen large enough tocarry a collection of other radii. In particular, since ✏ 1, then 1/✏ � 1, and therefore 1/✏a 1/✏b

when a b. As such, since p0 p/2� 1,

RB

= R6

+M1

/✏1/(p�2p

0)

= R6

+

✓

(2M |c`

|)1/p + (4M�1

)

1/(p�2)

+ max

1ip

0M1/(p�2i)

+ (M | ln(pmax

))

1/p

◆

/✏1/(p�2p

0)

� R6

+

✓

(2M |c`

|/✏)1/p + (4M�1

/✏)1/(p�2)

+ max

1ip

0(M/✏)1/(p�2i)

+ (M | ln(pmax

)|/✏)1/p◆

.

Since every term is nonnegative, RB

dominates each individual term.

1. The upper bracket and its guarantees were provided by Lemma D.3; note that ✏⇢

is definedlarge enough to include the deviations there, and similarly R

B

� (M | ln(pmax

)|/✏)1/pmeans the B here is defined large enough to contain the B

u

there; correspondingly, discarda failure event with probability mass at most �.

23

Let the lower bracket be defined as in the statement; note that its properties are much moreconservative as compared with the upper bracket. Let (↵,⇥) 2 Z

`

be given. For every✓i

= (µi

,⌃i

), kµi

� E⇢

(X)k2

R6

, whereas RB

� R6

meaning x 2 Bc implieskx� E

⇢

(X)k2

� R6

, so

kx� µi

k2

kx� E⇢

(X)k2

+ kµi

� E⇢

(X)k2

2kx� E⇢

(X)k2

,

which combined with �1

I � ⌃i

� �2

I gives

ln

X

i

↵i

p✓i(x)

!

� ln

X

i

↵i

1

(2⇡�2

)

d/2

exp

✓

� 1

2�1

kx� µi

k22

◆

!

� ln(p0

/pmax

)� d

2

ln(2⇡�2

)� 2

�1

kx� E⇢

(X)k22

= `(x),

which is the dominance property.

Next come the integral properties of `. By Lemma A.2 and since RB

� (2M |c`

|/✏)1/p,�

�

�

�

Z

B

c

c`

d⇢

�

�

�

�

Z

B

c

|c`

|d⇢ Z

B

c

|c`

|d⇢ = ⇢(Bc

)|c`

| ✏/2.

Similarly, by Hoeffding’s inequality, with probability at least 1� �,�

�

�

�

�

Z

B

c`

c`

d⇢

�

�

�

�

�

✏/2 + |c`

|s

1

2mln

✓

1

�

◆

.

Now define

`1

(x) := � 2

�1

kx� E⇢

(X)k22

= `(x)� c`

.

By Lemma A.6 and since RB

� (4�1

M/✏)1/(p�2),�

�

�

�

Z

B

c

`1

d⇢

�

�

�

�

Z

B

c

|`1

|d⇢ =

2

�1

Z

B

c

kx� E⇢

(X)k22

d⇢(x) ✏/2.

Furthermore by Lemma A.7 and the above estimate, and since RB

�max

1ip

0(M/✏)1/(p�2i) (where the maximum is attained at one of the endpoints),

then with probability at least 1� �

�

�

�

�

Z

B

c

`1

d⇢

�

�

�

�

✏

2

+

r

M 0ep0

2m

✓

2

�

◆

1/p

0

.

Unioning together the above failure probabilities, the general controls for ` = c`

+`1

followby the triangle inequality and definition of ✏

⇢

.

2. Throughout the following, let ⌫ denote either ⇢ or ⇢, and correspondingly let ✏⌫

respec-tively refer to ✏

⇢

or ✏⇢

; let the above bracketing properties hold throughout (with eventsappropriately discarded for ⇢). Furthermore, for convenience, define

p0

:= exp(4c)/(8e).

Let any (↵,⇥) be given with (↵,⇥) 2 Smog(⇢; c, k,�1

,�2

) [ Smog(⇢; c, k,�1

,�2

). Definethe two index sets

IC

:= {i 2 [k] : (↵i

, ✓i

) 2 (↵,⇥) u C},I6

:= {i 2 [k] : (↵i

, ✓i

) 2 (↵,⇥) uB6

}.

By Lemma D.4, with probability at least 1 � �,P

i2I6↵i

� p0

/pmax

; henceforth discardthe corresponding failure event, bringing the total discarded probability mass to 4�.

24

To start, since ln(·) is concave and thus ln(a+ b) ln(a) + b/a for any positive a, b,Z

ln

X

i

↵i

p✓i(x)

!

d⌫(x) Z

B

ln

X

i

↵i

p✓i(x)

!

d⌫(x) +

Z

B

c

u(x)d⌫(x)

Z

B

ln

X

i2IC

↵i

p✓i(x)

!

d⌫(x) +

Z

B

P

i 62IC↵i

p✓i(x)

P

i2IC↵i

p✓i(x)

d⌫(x) + ✏⌫

.

In order to control the fraction, both the numerator and denominator will be uniformlycontrolled for every x 2 B, whereby the result follows since ⌫ is a probability measure(i.e., the integral is upper bounded with an upper bound on the numerator times ⌫(B) 1

divided by a lower bound on the denominator).

For the purposes of controlling this fraction, define

p1

:=

1

(2⇡�2

)

d/2

exp

✓

�R2

B

+R2

6

�

◆

,

p2 := ✏p1p0/pmax

,

Observe, by choice of RC

and since �1

1, that

RB

+

v

u

u

t

2�2

ln

1

p22(2⇡)d�d�1

1

!

RB

+

s

2�2

ln

✓

64e2p2max

(2⇡�2

)

d

exp(2(R2

B

+R2

6

))

✏2 exp(8c)(2⇡)d�d

1

◆

RB

+

s

2�2

✓

ln

✓

64e2(2⇡�2

)

d

✏2(2⇡)dp4max

◆

� 8c� 4R2

B

/�

◆

RB

+

s

2�2

✓

ln

✓

64e2(2⇡�2

)

d

(2⇡)dp4max

◆

� 8c

◆

+

p

4�2

ln(1/✏) +RB

p

8�2

/�1

RC

.

For the denominator, first note for every x 2 B and parameters ✓ = (µ,⌃) with �1

I �⌃ � �

2

I and µ 2 B6

that

p✓

(x) � 1

(2⇡�2

)

d/2

exp

✓

� 1

2�1

kx� µk22

◆

� 1

(2⇡�2

)

d/2

exp

✓

� 1

2�1

(kx� E⇢

(X)k2

+ kE⇢

(X)� µi

k2

)

2

◆

� p1.

Consequently, for x 2 B,X

i2IC

↵i

pi

(x) �X

i2I6

↵i

pi

(x) � p1X

i2I6

↵i

� p1p0/pmax

.

For the numerator, by choice of C (as developed above with the definitions of p1

and p2

)and an application of Lemma D.2, for p

i

corresponding to i 62 IC

,

pi

(x) ✏p1

p0

/pmax

= p2.

It follows that the fractional term is at most ✏, which gives the first direction of the desiredinequality.

To get the other direction, sinceP

i2I6↵i

� p0

/pmax

due to Lemma D.4 as discussedabove, it follows that (↵,⇥) u B

6

2 Z`

, meaning the corresponding partial Gaussianmixture can be controlled by `. As such, since R

6

RB

thus I6

✓ IC

, and since ln is

25

nondecreasing,Z

B

ln

X

i2IC

↵i

pi

!

d⌫ =

Z

ln

X

i2IC

↵i

pi

!

d⌫ �Z

B

c

ln

X

i2IC

↵i

pi

!

d⌫

Z

ln

X

i2IC

↵i

pi

!

d⌫ �Z

B

c

ln

X

i2I6

↵i

pi

!

d⌫

Z

ln

X

i2IC

↵i

pi

!

d⌫ �Z

B

c

`d⌫

Z

ln

X

i2IC

↵i

pi

!

d⌫ + ✏⌫

Z

ln

X

i

↵i

pi

!

d⌫ + ✏⌫

.

D.2 Uniform Covering of Gaussian Mixtures

First, a helper lemma for covering covariance matrices.Lemma D.6. Let scale ✏ > 0 and eigenvalue bounds 0 < �

1

�2

be given. There exists a subsetM of the positive definite matrices satisfying �

1

I � M � �2

I so that

|M| (1 + 32�2

/✏)d2

✓

1 +

�2

� �1

✏/2

◆

d

+

✓

ln(�2

/�1

)

✏/d

◆

d

!

,

and for any A with �1

I � A � �2

I , there exists B 2 M with

exp(�✏) |A||B| exp(✏) and kA�Bk

2

✏.

Proof. The mechanism of the proof is to separately cover the set of orthogonal matrices and the setof possible eigenvalues; this directly leads to the determinant control, and after some algebra, thespectral norm control follows as well.

With foresight, set the scales

⌧ := ✏/(8�2

),

⌧ 0 := ✏/2,

⌧ 00 := exp(✏/d).

First, a cover of the orthogonal d⇥d matrices at scale ⌧ is constructed as follows. The entries of theseorthogonal matrices are within [�1,+1], thus first construct a cover Q0 of all matrices [�1,+1]

d⇥d

at scale ⌧/2 according to the maximum-norm, which simply measures the max among entrywisedifferences; this cover can be constructed by gridding each coordinate at scale ⌧/2, and thus

|Q0| (1 + 4/⌧)d2

.

Now, to produce a cover of the orthogonal matrices, for each M 0 2 Q0, if it is within max-normdistance ⌧/2 of some orthogonal matrix M , include M in the new cover Q; otherwise, ignore M 0.Since Q0 was a max-norm cover of [�1,+1]

d⇥d at scale ⌧/2, then Q must be a max-norm cover ofthe orthogonal matrices at scale ⌧ (by the triangle inequality), and it still holds that

|Q| (1 + 4/⌧)d2

.

Since the max-norm is dominated by the spectral norm, for any orthogonal matrix O, there existsQ 2 M with kO �Qk

2

⌧ .

26

Second, a cover of the set of possible eigenvalues is constructed as follows; since both a mul-tiplicative and an additive guarantee are needed for the eigenvalues, two covers will be unionedtogether. First, produce a cover L

1

of the set [�1

,�2

]

d at scale ⌧ 0 entrywise as usual, which means|L

1

| (1 + (�2

� �1

)/⌧ 0)d. Second, the cover L2

will cover each coordinate multiplicatively,meaning each coordinate cover consists of �

1

,�1

⌧ 00,�1

(⌧ 00)2, and so on; consequently, this coverhas size |L

2

| ln(�2

/�1

)/ ln(⌧ 00). Together, the cover L := L1

[ L2

has size

|L| ✓

1 +

�2

� �1

⌧ 0

◆

d

+

✓

ln(�2

/�1

)

ln(⌧ 00)

◆

d

,

and for any vector ⇤ 2 [�1

,�2

]

d, there exists ⇤0 2 L with

1

⌧ 00 max

i

⇤

0i

/⇤i

⌧ 00 and max

i

|⇤0i

� ⇤

i

| ⌧.

Note there was redundancy in this construction: L need only contain nondecreasing sequences.

The final cover M is thus the cross product of Q and L, and correspondingly its size is the productof their sizes. Given any A with �

1

I � A � �2

I with spectral decomposition O>1

⇤

1

O1

, pick acorresponding O

2

2 Q which is closest to O1

in spectral norm, and ⇤

2

2 L which is closest to ⇤

1

in max-norm, and set B = O>2

⇤

2

O2

. By the multiplicative guarantee on L, it follows that✓

1

⌧ 00

◆

d

|⇤2

||⇤

1

| =|B||A| (⌧ 00)d;

by the choice of ⌧ 00, the determinant guarantee follows. Secondly, relying on a few properties ofspectral norms (kXY k

2

kXk2

kY k2

for square matrices, and kZk2

= 1 for orthogonal matrices,and of course the triangle inequality),

kA�Bk2

=

�

�

(O1

�O2

+O2

)

>⇤

1

(O1

�O2

+O2

)

> �O>2

⇤

2

O2

�

�

2

kO>2

⇤

1

O2

�O>2

⇤

2

O2

k2

+ 2kO>2

⇤

1

(O1

�O2

)k2

+ k(O1

�O2

)

>⇤

1

(O1

�O2

)k2

k⇤1

� ⇤

2

k2

+ 2kO1

�O2

k2

k⇤1

k2

+ kO1

�O2

k2

k⇤1

k2

(kO1

k2

+ kO2

k2

)

⌧ 0 + 4⌧�2

,

and the second guarantee follows by choice of ⌧ and ⌧ 0.

The covering lemma is as follows.Lemma D.7. Let scale ✏ > 0, ball B := {x 2 Rd

: kx� E(X)k R}, mean set X := {x 2 Rd

:

kx � E(X)k R2

}, covariance eigenvalue bounds 0 < �1

�2

, mass lower bound c1

> 0, andnumber of mixtures k > 0 be given. Then there exists a cover set N (where (µ,⌃) 2 N has µ 2 Xand �

1

I � ⌃ � �2

I) of size

|N |

✓

ln(1/↵0

)

ln(⌧0

)

+

1� ↵0

⌧4

◆

·✓

1 +

2R2

d

⌧1

◆

d

· (1 + 32/(�1

⌧2

))

d

2

✓

1 +

��1

1

� ��1

2

⌧2

/2

◆

d

+

✓

ln(�2

/�1

)

⌧2

/d

◆

d

!!

k

where

⌧0

:= exp(✏/4),

⌧1

:= min

⇢

✏�1

16(R+R2

)

,

r

✏�1

8

�

,

⌧2

:=

✏

4max{1, (R+R2

)

2} ,

pmin

:=

1

(2⇡�2

)

d/2

exp(�(R+R2

)

2/(2�1

)),

pmax

:= (2⇡�1

)

�d/2,

↵0

:=

✏c1

pmin

4k(pmax

+ ✏pmin

/2),

⌧4

:= ↵0

,

27

(whereby pmin

p✓

(x) pmax

for x 2 B and ✓ = (µ,⌃) satisfies µ 2 X and �1

I � ⌃ � �2

I ,)so that for every partial Gaussian mixture (↵,⇥) = {(↵

i

, µi

,⌃i

)} with ↵i

� 0, c1

P

i

↵i

1,µi

2 X , and �1

I � ⌃i

� �2

I there is an element (↵0,⇥0) 2 N with weights c

1

�k↵0

P

i

↵0i

1

so that, for every x 2 B,| ln(p

↵,⇥

(x))� ln(p↵

0,⇥

0(x))| ✏.

Proof. The proof controls components in two different ways. For those where the weight ↵i

isnot too small, both ↵

i

and p✓i are closely (multiplicatively) approximated. When ↵

i

is small, itscontribution can be discarded. Between these two cases, the bound follows.

Note briefly that for any ✓ = (µ,⌃) with µ 2 X and �1

I � ⌃ � �2

I ,

p✓

(x) 1

(2⇡�1

)

d/2

exp( 0 ) = pmax

,

p✓

(x) � 1

(2⇡�2

)

d/2

exp(�kx� µk22

/(2�1

))

� 1

(2⇡�2

)

d/2

exp(�(kx� E⇢

(X)k2

+ kµ� E⇢

(X)k)2/(2�1

))

= pmin

.

Next, the covers of each element of the Gaussian mixture are as follows.

1. Union together a multiplicatively grid of [↵0

, 1] at scale ⌧0

(meaning produce a sequenceof the form ↵

0

,↵0

⌧0

,↵0

⌧20

, and so on), and an additive grid of [↵0

, 1] at scale ⌧4

; together,the grid has a size of at most

ln(1/↵0

)

ln(⌧0

)

+

1� ↵0

⌧4

.

2. Grid the candidate center set X at scale ⌧1

, which by Lemma B.5 can be done with size atmost

✓

1 +

2R2

d

⌧1

◆

d

.

3. Lastly, grid the inverse of covariance matrices (sometimes called precision matrices), mean-ing ��1

2

I � ⌃�1 � ��1

1

, whereby Lemma D.6 grants that a cover of size

(1 + 32/(�1

⌧2

))

d

2

✓

��1

1

� ��1

2

⌧2

/2

◆

d

+

✓

ln(�1

/�2

)

⌧2

/d

◆

d

!

suffices to provide that for any permissible ⌃�1, there exists a cover element A with

exp(�⌧2

) |⌃�1||A| exp(⌧

2

) and k⌃�1 �Ak2

⌧2

.

Producting the size of these various covers and raising to the power k (to handle at most k compo-nents), the cover size in the statement is met.

Now consider a component (↵i

, µi

,⌃i

) with ↵i

� ↵0

; a relevant cover element (ai

, ci

, Bi

) is chosenas follows.

1. Choose the largest ai

↵i

in the gridding of [↵0

, 1], whereby it follows thatP

i

ai

P

i

↵i

1, and also

⌧�1

0

ai

/↵i

⌧0

and ai

� ↵i

� ⌧4

.

Thanks to the second property,

X

↵i�↵0

ai

�

0

@

X

↵i�↵0

↵i

1

A� k⌧4

.

28

2. Choose ci

in the grid on X so that kµi

� ci

k ⌧1

.

3. Choose covariance Bi

so that

exp(�⌧2

) |Bi

||⌃

i

| exp(⌧2

) and k⌃�1 �B�1

i

k2

⌧2

.

The first property directly controls for the determinant term in the Gaussian density. Tocontrol the Mahalanobis term, note that the above display, combined with kµ

i

� ci

k ⌧1

,gives, for every x 2 B,

�

�

(x� µi

)

>⌃�1

i

(x� µi

)� (x� ci

)

>B�1

i

(x� ci

)

�

�

=

�

�

(x� µi

)

>⌃�1

i

(x� µi

)� (x� ci

)

>(B�1

i

�⌃�1

i

+ ⌃

�1

i

)(x� ci

)

�

�

�

�

(x� µi

)

>⌃�1

i

(x� µi

)� (x� ci

)

>⌃�1

i

(x� ci

)

�

�

+ kx� ci

k22

kB�1

i

�⌃�1

i

k2

�

�

(x� µi

)

>⌃�1

i

(x� µi

)� (x� ci

)

>⌃�1

i

(x� ci

)

�

�

+ (R+R2

)

2⌧2

�

�

(x� µi

)

>⌃�1

i

(x� µi

)� (x� ci

)

>⌃�1

i

(x� ci

)

�

�

+ ✏/4.

Continuing with the (still uncontrolled) first term,�

�

(x� µi

)

>⌃�1

i

(x� µi

)� (x� ci

)

>⌃�1

i

(x� ci

)

�

�

=

�

�

(x� µi

)

>⌃�1

i

(x� µi

)� (x� µi

+ µi

� ci

)

>⌃�1

i

(x� µi

+ µi

� ci

)

�

�

2kx� µi

k2

kµi

� ci

k2

k⌃�1k2

+ kµi

� ci

k22

k⌃�1k2

2(R+R2

)⌧1

/�1

+ ⌧21

/�1

✏/4.

Combining these various controls with the choices of scale parameters, for some provided probabil-ity ↵

i

pi

and cover element probability ai

p0i

, it follows for x 2 B that

exp(�3✏/4) ↵i

pi

(x)

ai

p0i

(x) exp(3✏/4).

Lastly, when ↵i

< ↵0

, simply do not bother to exhibit a cover element.

To show | ln(p↵,⇥

(x))� ln(p↵

0,⇥

0(x))| ✏, consider the two directions separately as follows.

1. Given the various constructions above, since ln is nondecreasing,

ln

X

i

ai

p✓

0i(x)

!

ln

0

@

X

↵i�↵0

↵i

p✓i(x) exp(3✏/4) +

X

↵i<↵0

↵i

p✓i(x)

1

A

ln

X

i

↵i

p✓i(x)

!

+

3✏

4

.

2. On the other hand,

ln

X

i

↵i

p✓i(x)

!

= ln

0

@

X

↵i�↵0

↵i

p✓i(x) +

X

↵i<↵0

↵i

p✓i(x)

1

A

ln

0

@

X

↵i�↵0

ai

p✓

0i(x) exp(3✏/4) + k↵

0

pmax

1

A

= ln

(1 + ✏/4)X

↵i�↵0

ai

p✓

0i(x) exp(3✏/4)

� ✏/4X

↵i�↵0

ai

p✓

0i(x) exp(3✏/4) + k↵

0

pmax

!

.

29

But sinceP

i

ai

� c1

� k(⌧4

+ ↵0

),

�✏/4X

↵i�↵0

ai

p✓

0i(x) exp(3✏/4) �(✏/4)(c

1

� k(⌧4

+ ↵0

))pmin

+ k↵0

pmax

0.

As such, since (1 + ✏/4) exp(✏/4), the result follows in this case as well.

D.3 Proof of Theorem 5.1

Proof of Theorem 5.1. This proof is based on the proof of Theorem 3.2. Let the various quantitiesin Lemma D.5 be given; in particular, let balls B,C and their radii R

B

, RC

be as provided there.Additionally, define p

0

:= exp(4c)/8e for convenience. Near the end of the proof, the choicesp0 = p/4 and ✏ := m�1/2+1/p will be made.

By Lemma D.7, let N be a cover of the set C, with all parameters having the same names as thosehere, except the R there is the radius R

B

here, and R2

there is radius RC

here, the lower boundc1

is p0

/pmax

. By the construction of the cover there, every set of partial Gaussian parameters(↵,⇥) 2 C with

P

↵i↵i

� c1

= p0

/pmax

and cardinality at most k has a cover element Q 2 Nwith

sup

x2B

|�g(x; (↵,⇥))� �g(x;Q)| ✏; (D.8)

note that Lemma D.7 also provides the stated estimate of the size. Next, note for x 2 B and everycover element Q 2 N that Lemma D.7 provides

ln((c1

� k↵0

)pmin

) pQ

(x) ln(pmax

)

where c1

= p0

/pmax

as above and

↵0

=

✏c1

pmin

4k(pmax

+ ✏pmin

/2) ✏c

1

pmin

4kpmax

,

which combined with ✏ 2 and pmin

pmax

gives

c1

� k↵0

� c1

✓

1� ✏pmin

4pmax

◆

� c1

2

.

Thus, by Hoeffding’s inequality,

sup

Q2N

�

�

�

�

Z

B

�g(x;Q)d⇢(x)�Z

B

�g(x;Q)d⇢(x)

�

�

�

�

ln

✓

pmax

pmin

(c1

� k↵0

)

◆

s

1

2mln

✓

2|N |�

◆

ln

✓

2p2max

pmin

p0

◆

s

1

2mln

✓

2|N |�

◆

. (D.9)


To further simplify eq. (D.9), note firstly that

ln

✓

1

pmin

◆

= ln

⇣

(2⇡�2

)

d/2

exp((RB

+RC

)

2/(2�1

))

⌘

= ln((2⇡�2

)

d/2

) + 2R2

C

/�1

,

where

R2

C

3R2

B

(1 +

p

8�2

/�1

)

2

+ 12�2

ln(1/✏) + 6�2

✓

ln

✓

64e2(2⇡�2

)

d

(2⇡)dp4max

◆

� 8c

◆

andR2

B

2R2

6

+M2

1

/✏2/(p�2p

0).

30

Next, to control |N |, the scale term ⌧ = min{⌧1

, ⌧2

} must first be controlled. Since ✏ 1 and�1

1 and RC

� 1,

⌧ � ✏�1

16(RB

+RC

)

2

� ✏�1

64R2

C

,

and thus

ln

⇣ ✏

⌧

⌘

ln(64R2

C

/�1

).

Together with ⌧0

= exp(✏/4) and ↵0

� ✏c1

pmin

/(8kpmax

) = p0

pmin

/(8kp2max

), and letting O(·)swallow terms depending only on numerical constants, c, �

1

, and �2

, but in particular not touchingterms depending on ✏, d, k or m or �,

ln(|N |) ln

0

@

✓

5

✏

✓

8kp2max

✏p0

pmin

◆◆✓

3RC

d

⌧

◆

d

✓

33

�1

⌧

◆

d

2 ✓

��1

1

⌧/2

◆

d

+

✓

ln(�2

/�1

)

⌧/2

◆

d

!!

k

1

A

3d2k (5 ln(1/✏) + ln(1/pmin

) + 3 ln(✏/⌧) + ln(RC

) +O(1))

3d2k�

5 ln(1/✏) + 2R2

C

/�1

+ 3 ln(✏/⌧) + 4 ln(RC

) +O(1)

�

= O⇣

d2k(ln(1/✏) + ✏�2/(p�2p

0)

⌘

.

Together, the full expression in eq. (D.9) may be simplified down to

sup

Q2N

�

�

�

�

Z

B


B

�g(x;Q)d⇢(x)

�

�

�

�

O

poly(d, k)✓

1

✏

◆

2/(p�2p

0)

r

(ln(1/✏) + (1/✏)2/(p�2p

0)

+ ln(1/�))

m

!

O

poly(d, k)

✏�3/(p�2p

0)

pm

+

r

(ln(1/✏) + ln(1/�))

m

!!

O

poly(d, k)

m�1/2+3/p

+

r

(ln(m) + ln(1/�))

m

!!

(D.10)

where the final step used the choice p0 = p/4 and ✏ := m�1/2+1/p.

Now let any (↵,⇥) 2 Smog(⇢; c, k,�1

,�2

) [ Smog(⇢; c, k,�1

,�2

) be given, and let Q 2 N be acover element satisfying eq. (D.8) for (↵, ✓)uC. By eq. (D.8), eq. (D.9), and Lemma D.5 (and thusdiscarding an additional failure event having probability 4�),�

�

�

�

Z

�g(x; (↵,⇥))d⇢(x)�Z

�g(x; (↵,⇥))d⇢(x)

�

�

�

�

�

�

�

�

Z

�g(x; (↵,⇥))d⇢(x)�Z

B

�g(x; (↵,⇥) u C)d⇢(x)

�

�

�

�

+

�

�

�

�

Z

B

�g(x; (↵,⇥) u C)d⇢(x)�Z

B

�g(x;Q)d⇢(x)

�

�

�

�

+

�

�

�

�

Z

B


B

�g(x;Q)d⇢(x)

�

�

�

�

+

�

�

�

�

Z

B


B

�g(x; (↵,⇥) u C)d⇢(x)

�

�

�

�

+

�

�

�

�

Z

B

�g(x; (↵,⇥) u C)d⇢(x)�Z

�g(x; (↵,⇥))d⇢(x)

�

�

�

�

4✏+ ln

✓

2p2max

pmin

p0

◆

s

1

2mln

✓

2|N |�

◆

+ ✏⇢

+ ✏⇢

,

= poly(d,k)O(m�1/2+3/p

⇣

1 +

p

ln(m) + ln(1/�) + (1/�)4/p⌘

,

where the final step uses the above simplification of the cover term, the choices ✏ = m�1/2+1/p andp0 = p/4, and additionally unwrapping the forms of ✏

⇢

and ✏⇢

from Lemma D.5.

31

moment-based uniform deviation bounds for k...

Documents