9 vc sets 1 - yale universitypollard/books/mini/vcsets.pdf · 2016. 4. 3. · chapter 9 vc sets...

28
9 VC Sets 1 9.1 An introductory example .................... 2 9.2 VC-dimension ........................... 5 9.3 Other ways to think about VC-classes ............. 9 9.4 A combinatorial inequality .................... 13 9.5 *Patterns as subgraphs of the discrete cube .......... 15 9.6 *Haussler’s improvement of the packing bound ........ 18 9.7 Problems ............................. 22 9.8 Notes ............................... 25 Printed: 31 March 2016 version: 26july13 printed: 31 March 2016 Mini-empirical c David Pollard

Upload: others

Post on 27-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • 9 VC Sets 19.1 An introductory example . . . . . . . . . . . . . . . . . . . . 29.2 VC-dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . 59.3 Other ways to think about VC-classes . . . . . . . . . . . . . 99.4 A combinatorial inequality . . . . . . . . . . . . . . . . . . . . 139.5 *Patterns as subgraphs of the discrete cube . . . . . . . . . . 159.6 *Haussler’s improvement of the packing bound . . . . . . . . 189.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229.8 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    Printed: 31 March 2016

    version: 26july13printed: 31 March 2016

    Mini-empiricalc©David Pollard

  • Chapter 9

    VC Sets

    Section 9.1 presents a simple concrete example to illustrate the use of thecombinatorial method for deriving bounds on packing numbers.

    Section 9.2 defines the concept of a VC-class of sets then presents the basicpolynomial bound for the numbers of subsets that can be picked out by aVC-class. It also establishes Dudley’s inequality that bounds L1 packingnumbers for VC-classes of sets.

    Section 9.3 explains how the subsets picked out by a VC-class can also bethought of as a subset of the vertices of the discrete unit cube, or as thecolumns of a binary matrix.

    Section 9.4 derives a general combinatorial bound for the shatter dimensionof a binary matrix, a result often called the VC Lemma or Sauer’s LemmaThe proof uses the important technique known as downshifting.

    *Section9.5 interprets subsets of the discrete unit cube as graphs. The shat-ter dimension gives a surprising upper bound for the number of edges ofthe graph. The method of proof again relies on downshifting.

    *Section9.6 presents Haussler’s refinement of the calculation from Section 9.2,leading to a sharper bound for packing numbers.

    version: 26july13printed: 31 March 2016

    Mini-empiricalc©David Pollard

  • §9.1 An introductory example 2

    9.1 An introductory exampleVCsets::S:VC.intro

    For my purposes, the combinatorial argument often referred to as the VCmethod—in honor of the important contributions of Vapnik and Červonenkis(1971, 1981)—is of use mainly as a step towards calculation of coveringnumbers for classes of sets or functions equipped with various Lp metrics.The method is elegant and leads to results not easily obtained in other ways.The basic calculations occupy only a few pages. Nevertheless, the ideas aresubtle enough to appear beyond the comfortable reach of many would-beusers. With that fact in mind, I offer a more concrete preliminary example,in the hope that the combinatorial ideas might seem less mysterious.

    Consider the set H of all closed half-spaces in R2. Let F = {x1, . . . , xn}be a set of n points in R2. How many distinct subsets can H pick outfrom F? That is, how large can the cardinality of HF := {F ∩H : H ∈ H}be? Certainly it can be no larger than 2n because F has only that manysubsets. In fact a simple argument shows that

    #HF := cardinality of HF ≤ p(n) := 1 + 4n(n− 1).

    H0L0

    L1

    Indeed, consider a particular nonempty subset F0 of F picked out by aparticular half-space H0. There is no loss of generality in assuming that atleast one point of F0 lies on L0, the boundary of H0: otherwise we couldreplace H0 by a smaller H1 whose boundary, L1, runs parallel to L0 throughthe point x0 of F0 that is closest to L0.

    As seen from x0, the other n− 1 points of F all lie on a set L(x0) of atmost n − 1 lines through x0. Augment L(x0) by another set L′(x0) of atmost n−1 lines through x0, one in each angle between two lines from L(x0).The lines in H(x0) := L(x0) ∪L′(x0) define a collection of at most 4(n− 1)closed half-spaces, each with x0 on its boundary. The collection ∪x∈FH(x)accounts for all possible nonempty subsets of F picked out by closed half-spaces. The extra 1 takes care of the empty set.

    Draft: 26july13 c©David Pollard

  • §9.1 An introductory example 3

    Remark. Apparently the bound can be reduced to n2 − n + 2, whichis sharp in the sense that it is achieved by some configurations of npoints in R2. See the discussion by Dudley (1978, page 921) and theoriginal sources that he cited. For my purposes the sharp bound isnot needed. Indeed any polynomial would suffice for the consequencesdescribed below.

    The slow increase in #HF , at an O(n2) rate rather than a rapid 2n rate,

    has a useful consequence for the packing numbers of H when it is equippedwith an L1(P ) metric. The L1(P ) distance between two Borel sets B andB′ is defined as P |B − B′| = P (B∆B′), the probability measure of thesymmetric difference. The two sets are said to be �-separated (in L1(P ))if P (B∆B′) > �. As in Chapter 4, the packing number pack(�,H, P )—or pack(�,H,L1(P )) if there is any ambiguity about the norm being used—is defined as the largest N for which there exists a collection of N closedhalf-spaces, each pair �-separated.

    VCsets::poly.to.packing Example. Here is an argument, due to Dudley (1978, Lemma 7.13), to showthat the packing numbers pack(�,H, P ) are bounded uniformly over P bya polynomial in 1/�, for 0 < � ≤ 1. The result is surprising because it makesno regularity assumptions about the probability measure P .

    Suppose the half-spaces H1, H2, . . . ,HN are �-separated in L1(P ). By

    means of a cunningly chosen F , the polynomial bound for #HF will leadto an upper bound for N . The trick is to find a set F of m = d2 logN/�epoints from which each Hi picks out a different subset. Then H will pickout at least N subsets from the m points, implying that

    N ≤ p(m) ≤ 1 + 4(

    1 +2 logN

    )(2 logN

    )≤ 9

    (logN

    )2.

    Bounding the logN by a constant multiple of N1/4 and solving the resultinginequality forN , we get an upper bound N ≤ O(1/�)4. With a smaller powerin the bound for logN we would bring the power of 1/� arbitrarily close to 2.

    Remark. As shown in Lemma , the bound for the packingnumbers for H can be sharpened to C(�−1 log(1/�))2, with C auniversal constant, at least for � bounded away from 1. With a lotmore work, even the log term can be removed—see Section 9.6. At thisstage there is little point in struggling to get the best bound in �. Formany applications, the qualitative consequences of a polynomial boundin 1/� are the same, no matter what the degree of the polynomial.

    Draft: 26july13 c©David Pollard

  • §9.1 An introductory example 4

    How do we find a set F0 = {x1, . . . , xm} of points in R2 from which eachHi picks out a different subset? We need to place at least one point of F0 ineach of the

    (N2

    )symmetric differences Hi∆Hj . It might seem we are faced

    with a delicate task involving consideration of all possible configurations ofthe symmetric differences, but here probability theory comes to the rescue.

    As described in the Preface of the wonderful little book by Alon andSpencer (2000), the Probabilistic Method proves existence by artificiallyintroducing a probability measure into a problem:

    In order to prove the existence of a combinatorial structure withcertain properties, we construct an appropriate probability spaceand show that a randomly chosen element in this space has thedesired properties with positive probability. This method wasinitiated by Paul Erdős, who contributed so much to its devel-opment over the last fifty years, that is seems appropriate to callit “The Erdős Method.”

    Generate F0 as a random sample of size m from P . If m ≥ 2 logN/�,then there is a strictly positive probability that the sample has the desiredproperty. Indeed, for fixed i 6= j,

    P{Hi and Hj pick out same points from F0}= P{no points of sample in Hi∆Hj}= (1− P (Hi∆Hj))m

    ≤ (1− �)m

    ≤ exp(−m�).

    Add up(N2

    )such probability bounds to get a conservative estimate,

    P{some pair Hi, Hj pick same subset from F0} ≤(N2

    )exp(−m�).

    When m ≥ 2 logN/� the last bound is strictly less than 1, as desired.�

    Remark. The argument in the second paragraph of the Exampleimplicitly assumed that the xi’s sampled from P are all distinct, whichneed not be true if P has atoms. To be more precise I could havewritten N ≤ p(#F0) ≤ p(m) ≤ . . . . The added rigor is hardly worththe trouble for that Example. In general, ties can cause awkwardnotational difficulties. Section 9.4 introduces a way to avoid suchdifficulties.

    Draft: 26july13 c©David Pollard

  • §9.2 VC-dimension 5

    Notice how probability theory has been used to prove an existence re-sult, which gives a bound for a packing number, which will be used toderive probabilistic consequences—all based ultimately on the existence ofthe polynomial bound for #HF .

    For small F , the class H might pick out all subsets. The establishedterm for this property is shattering. For example, if F consists of 3 points,not all on the same straight line, then it can be shattered by H. Howeverno set of 8 points can be shattered, because there are 28 = 256 possiblesubsets—the empty set included—whereas the half-spaces can pick out atmost p(8) = 225 subsets. More generally, 2n > p(n) for all n ≥ 8, so that (ofcourse) no set of more than 8 points can be shattered by H. You should findit is easy to improve on the 8, by arguing directly that no set of 4 points canbe shattered by H. Indeed, if H picks out both F1 and F2 = F

    c1 , then the

    convex hulls of F1 and F2 must be disjoint. You have only to demonstratethat from every F with at least 4 points, you can find such F1 and F2 whoseconvex hulls overlap. See Problem [1] for details.In summary: The size of the largest set shattered by H is 3. Note wellthat the assertion is not that all sets of 3 points can be shattered, but merelythat there is some set of 3 points that is shattered, while no set of 4 pointscan be shattered. In the terminology of the next Section, the class H wouldbe said to have VC dimension equal to 3.

    9.2 VC-dimensionVCsets::S:VCdim

    The argument in the previous Section had little to do with the choice ofH as a class of half-spaces in a particular Euclidean space. It would applyto any class D of sets for which #DF (the analog of #HF ) is bounded bya fixed polynomial in #F . Unfortunately, for more complicated D’s it cansometimes be difficult to derive such a polynomial bound directly but easierto prove existence of a finite d such that no set of more than d points can beshattered by D. Such a D is called a VC-class of sets in honor of Vapnik& Červonenkis.

    VCsets::VCdef Definition. A class D of subsets of a set X is said to have VC-dimension d(written VCdim(D) = d) if both the following conditions hold.

    (i) There exists at least one subset F0 of X for which

    #{F0 ∩D : D ∈ D} = 2d.

    Draft: 26july13 c©David Pollard

  • §9.2 VC-dimension 6

    (ii) For every finite subset F of X with #F > d,

    #{F ∩D : D ∈ D} < 2#F .

    Remark. Some authors use the term shatter dimension instead ofVC-dimension. Motivated by the extension of the concept to classesof functions, and beyond, I feel a better name would be surrounddimension. In fact that is the term I use in the next Chapter.

    Miraculously, #DF is bounded above by a polynomial in #F for eachVC-class D. The key result is often called the VC Lemma or Sauer’sLemma, although credit should be spread more widely. (See the Notes inSection Notes.)

    VCsets::VC.patterns Theorem. If D is a VC-class of subsets of X with VCdim(D) ≤ d then, foreach subset F of X, the cardinality of the set DF := {F ∩D : D ∈ D} is atmost

    β(m, d) :=

    (m

    0

    )+

    (m

    1

    )+ · · ·+

    (m

    d

    )where m = #F.

    This bound follows immediately from Theorem, in Section downshift,which treats an analogous (more streamlined) version of the result for ma-trices of zeros and ones.

    Remark. When applied to the H from Section 9.1, Theorem gives a bound of order O(m3) rather than the O(m2) derived by directarguments. The β(m, d) bound is not sharp. Nevertheless it doesestablish a remarkable general fact: the failure to shatter for everysubset F0 with cardinality d forces a rate of growth for #DF thatincreases as a polynomial in m = #F , a rate much slower than theworst case 2m. A rate between polynomial and 2m, such as 2

    √m or

    even 2m/2, is not possible.

    Clearly β(m, d) is a polynomial of degree d in m. More precisely, itequals 2m if m ≤ d and, if m ≥ d,

    β(m, d) =∑d

    k=0

    m(m− 1) . . . (m− k + 1)k!

    ≤∑d

    k=0

    mk

    dkdk

    k!

    ≤ (em/d)d because (m/d)k ≤ (m/d)d when m ≥ d.\E@ VC.poly.bnd\E@ VC.poly.bnd

    The constant (e/d)d in can be improved slightly:

    \E@ VC.poly.bnd2\E@ VC.poly.bnd2 β(m, d) ≤ 1.5md/d! if m ≥ d+ 2,

    Draft: 26july13 c©David Pollard

  • §9.2 VC-dimension 7

    a bound due to Vapnik and Červonenkis—see Dudley (1999, Proposition 4.15).As Dudley noted, the bound is “not far from optimal” because the

    (md

    )con-

    tributes an md/d! as the leading term in the polynomial β(m, d).

    Remark. By direct calculation, em/d < 2m/d when m/d > x0 ≈ 3.06.If we were relying on to detect impossibilty of shattering wewould overestimate the VC-dimension by a factor of about 3. For manypurposes, such an overestimate would be unimportant.

    It is quite easy to construct VC-classes of sets. The next Example givesa useful starting point.

    VCsets::zero.sets Example. (Dudley, 1978, Theorem 7.2) Suppose L is a d-dimensional vec-tor space of real functions on a set X. Write D for the class of all subsetsof X of the form {` ≥ 0}, with ` in L. Then D is a VC-class of sets,with VCdim(D) ≤ d+ 1.

    Suppose F = {x1, . . . , xm} ⊆ X. The set LF of all vectors in Rm ofthe form [`(x1), . . . , `(xm)], for some ` in L, is a vector subspace of Rm ofdimension at most d. If m > d there must exist some nonzero vector αorthogonal to LF . Express the orthogonality as

    \E@ m.gt.d\E@ m.gt.d ∑m

    i=1{αi ≥ 0}αi`(xi) =

    ∑mi=1{αi < 0}(−αi)`(xi) for each ` in L.

    Without loss of generality suppose the set F0 := {xi : αi < 0} is nonempty.(Replace α by −α if F0 = ∅.) No member of D can pick out the subsetF c0 from F : if `(xi) ≥ 0 when αi ≥ 0 and `(xi) < 0 when αi < 0 then theright-hand side of equality for that ` would be strictly negative, whilethe left-hand side would be nonnegative.

    Despite the simplicity of the definition of VC-dimension, it can be sur-prisingly tricky to check the no-shatter property directly. Instead one canrely on the polynomial bound for the number of subsets picked out and thefact that the sum or product of two polynomials is also a polynomial.

    VCsets::VC.stability Example. Suppose D1 and D2 are VC-classes on X, with VC-dimensions D1and D2. Define D = max(D1,D2).

    Suppose F is a finite subset with #F = m. Then D = D1 ∪D2 has VC-dimension less than 5D because D can pick out at most β(m,D1)+β(m,D2)subsets from an F with #F = m and 21/d(em/d) < 2m/d when m ≥ 5dbecause 2ex < 2x for x ≥ 4.67.

    Similarly, {D1 ∩ D2 : Di ∈ Di} has VC-dimension at most 10D. Firstnote that there are at most (em/D)D subsets picked out by D1 and from

    Draft: 26july13 c©David Pollard

  • §9.3 Other ways to think about VC-classes 8

    each of those sets D2 picks out at most (em/D)D subsets. Then note that(em/D)2D < 2m when m ≥ 10D because 2x < 2x/2 for x ≥ 9.33.

    A similar argument applies to other classes formed by taking the pairwiseunions, or complements. The idea can be iterated to generate very fancyclasses with finite VC-dimension—see Problem [2].

    The connection between shatter dimension and covering numbers intro-duced in Section 1 extends to more general classes D of measurable subsetsof a space X on which a probability measure P is defined. This time I paymore careful attention to the inequality relating N and logN , for the sakeof comparison with the results in Section 9.6.

    Remember that the packing number pack(�,D, P ) is defined as thelargest size of an �-separated (in L1(P )) subset of D.

    VCsets::Dudley.L1 Lemma. (Dudley, 1978, Lemma 7.13) Let D be a class of sets with VC-dimension at most D. Then, for each probability measure P ,

    pack(�,D, P ) ≤(

    5e

    �log

    3e

    )D≤(

    15e2

    �2

    )Dfor 0 < � ≤ 1.

    Proof Suppose D1, . . . , DN are sets in D with P (Di∆Dj) > � for i 6= j.Let F0 = {x1, . . . , xm} be an independent sample of size m = d2(logN)/�efrom P . As in Section 9.1, for fixed i and j,

    P{Di and Dj pick out same points from F0} ≤ exp(−m�).

    The sum of(N2

    )such probability bounds is strictly less than 1. For some

    configuration F0 the class D picks out N distinct subsets, which gives theinequality N ≤ β(m,D). Problem [4] with ρ = 2/� and λ ≤ 3e/� then shows

    N1/D ≤ c0(3e/�) log(3e/�) where c0 := (1− e−1)−1.

    The stated bound merely tidies up some constants.�

    Remark. The Lemma also gives bounds for Lp packing numbers, via

    pack(�,D,Lp(P )) = pack(�p,D,L1(P )),

    because ‖D −D′‖p = ‖D −D′‖1/p1 for each p ≥ 1.

    Draft: 26july13 c©David Pollard

  • §9.3 Other ways to think about VC-classes 9

    9.3 Other ways to think about VC-classesVCsets::S:threeways

    For Definition a VC-class is a set D of subsets of some set X. Thedefining property involves subsets D ∩ F , with D ∈ D, picked out from afinite subset F of X. In classical empirical process applications, F is oftengenerated as a sample from some probability distribution P on X.

    If P has atoms then a random sample from P can involve ties, points of Xthat are selected more than once in the sample. As you saw in Example ,ties cause awkward (but minor) notational difficulties. Those difficulties canbe avoided by thinking of x1, . . . , xn as the coordinates of a single vector xin Xn, rather than as members of a subset F of X. Instead of subsets pickedout from F by D we have a set of “patterns”

    Dx := {Dx : D ∈ D} ⊆ {0, 1}n ⊂ Rn

    where Dx :=[{x1 ∈ D}, . . . , {xn ∈ D}

    ].

    That is, for each D in D the indicators {xi ∈ D} define the coordinates ofan n-vector Dx of 0’s and 1’s.

    Remark. If #Dx = N then D picks out exactly N distinct patternsfrom x. If the vector x contains repeated coordinates then N cannotequal 2n. For example, if x1 = x2 then, for each D, the first twocoordinates of Dx are the same: x1 ∈ D if and only if x2 ∈ D.

    The sets Dx for various x capture all the useful properties of VC-classes.Subpatterns are given by projections: If I is a subset of {1, . . . , n}, thepattern picked out by D from the subsequence (xi : i ∈ I) is equal to theprojection πIDx of Dx onto the coordinate subspace RI .

    VCsets::hypercube Definition. Say that a subset V of {0, 1}n shatters a subset I of {1, . . . , n}if the coordinate projection πIV onto RI equals {0, 1}I . Define the shatterdimension sdim(V) of V to be the largest #I for which V shatters I.

    As a small exercise in notation, you should convince yourself that

    VCdim(D) = max{sdim(Dx) : x ∈ Xn, n ∈ N}.

    In fact it is sometimes more helpful to think of the shatter dimension as aquantity that depends on the sequence x = (x1, . . . , xn) rather than workingwith an upper bound that does not depend on the sample size or sampleconfiguration. For example, if x represents a random sample from someprobability distribution P then (modulo measurability quibbles) sdim(Dx)

    Draft: 26july13 c©David Pollard

  • §9.3 Other ways to think about VC-classes 10

    is a random variable. As Vapnik and Červonenkis (1971, Theorem 4) showed,the uniform (over D) law of large numbers holds for that P if and only ifn−1 log sdim(Dx) converges in P

    n probability to zero.The sdim approach also fits well with the stochastic process setting where

    {Xt : t ∈ T} is an Rn-valued stochastic process. As you saw in Chapter 8,the random set

    Xω := {X(ω, t) : t ∈ T} ⊂ Rn for each fixed ω

    indexes an important stochastic process, {s · z : z ∈ Xω}, with s a vector ofindependent Rademacher variables. The process has subgaussian incrementscontrolled by Euclidean (`2) distance. If each coordinate Xi(ω, t) takes val-ues in {0, 1} then sdim(Xω) can be used to bound the `2 packing numbersfor Xω. See Chapter 10 for the more interesting case where the range is notrestricted to {0, 1}.

    You might be wondering why I even bothered discussing VC-classes ofsets when the most important properties involve the shatter dimension ofsubsets of {0, 1}n. In fact, the shatter dimension is actually a special case ofVC-dimension. In a formal mathematical sense, if V ⊆ {0, 1}n then each vin V is a function from the set X∗ = {1, . . . , n} to {0, 1}; and each suchfunction can be identified with a subset of X∗. That is, we could think of Vas a set of subsets of X∗. As another small exercise, you should now convinceyourself that VCdim(V), with V regarded as set of subsets of {1, . . . , n}, isthe same as sdim(V), with V regarded as a subset of {0, 1}n.

    For some arguments I find it is easier to visualize what is going on whendealing with finite subsets V (of size N) of {0, 1}n if I write each vector in Vas a column of an n×N binary matrix V , a matrix with distinct columnswhose elements all belong to the “binary” set {0, 1}.

    For dealing with submatrices there is a convenient notation that shouldbe familiar to anyone used to working with programming languages thattreat vectors and matrices as basic objects.

    (i) For each pair of integers p, q with p ≤ q write [p : q] for the set ofintegers {i ∈ Z : p ≤ i ≤ q}. For example, [1 : n] = {1, . . . , n}.

    (ii) If V is an n×N matrix, and if I ⊆ [1 : n] with #I = r and J ⊆ [1 : N ]with #J = c, define V [I, J ] as the r× c matrix obtained by discardingfrom V all rows except those in I and all columns except those in J .More precisely, if I is enumerated as (i1, . . . , ir) and J as (j1, . . . , jc)then the (α, )

    ¯th element of V [I, J ] is V [iα, jβ], for 1 ≤ α ≤ r and 1 ≤

    β ≤ c.

    Draft: 26july13 c©David Pollard

  • §9.3 Other ways to think about VC-classes 11

    (iii) Abbreviate V [I, [1 : N ] ] to V [I, · ], the submatrix with only the rowsin [1 : n]\I removed. Interpret V [·, J ] similarly

    (iv) For each I ⊂ [1 : n] write V [−I, ·] for V [ [1 : n]\I, ·] In particular writeV [−i, ·] for V [ [1 : n]\{i}, ·]. And so on.

    Remark. The submatrix notation also makes sense if I = (i1, . . . , ir)and J = (j1, . . . , jc) are sequences with possible repeats, or if theelements are not enumerated in increasing order. The matrix V [I, J ]will then have some rows or columns repeated and rows or columns inorders different from V . This extension will prove convenient when Iis chosen as a random sample with repeats from [1 : n]. For example:

    if V =

    1 2 3 4 5 67 8 9 10 11 1213 14 15 16 17 1819 20 21 22 23 24

    then

    V [(1, 1, 3), (2, 1, 4)] =

    2 1 42 1 414 13 16

    and V [−2,−(1, 3, 5)] = 2 4 614 16 18

    20 22 24

    .Notice the subtle difference between the 3×3 matrix V [(1, 1, 3), (2, 1, 4)]and—if we follow the usual mathematical convention that {1, 1, 3} is aset with only two members—the 2× 3 matrix V [{1, 1, 3}, {2, 1, 4}]. Toavoid such mathematical quicksand, and to be explicit about the rowor column order, I typically use (. . . ) when extracting submatrices,even for cases where {. . . } would have the same effect.

    VCsets::sdim Definition. If V is an n × N binary matrix and I ⊆ [1 : n], say that Vshatters the set of rows I if the submatrix V [I, · ] contains all 2#I elementsof {0, 1}#I amongst its columns. Define the shatter dimension of V ,denoted by sdim(V ), as the largest #I for which V shatters I.

    Remark. For a binary matrix V with each column correspondingto a unique member of a set V ⊆ {0, 1}n, the shatter dimensionis also equal to the largest #I for which the πIV = {0, 1}I . Thussdim(V ) = sdim(V). The columns of V are unique but the columnsof the submatrix V [I, · ] need not be unique; some vectors in V mightproject onto the same vector in {0, 1}I .

    Draft: 26july13 c©David Pollard

  • §9.3 Other ways to think about VC-classes 12

    VCsets::V64 Example. The matrix

    V =

    1 0 1 0 0 10 1 0 1 0 00 1 0 0 1 10 0 1 1 1 1

    does not shatter rows {1, 2} because the column vector [1, 1] does not appearas a column of the 6× 2 submatrix V [(1, 2), ], the first two rows of V . Eachof the other five subsets of two rows is shattered. For example,

    V [(2, 3), (1, 2, 4, 5)] =

    [0 1 1 00 1 0 1

    ]has distinct columns, the four elements of {0, 1}2. No subset of three (orfour) rows can be shattered because V has fewer than 23 columns. Eachsingleton {1}, {2}, {3}, and {4} is shattered, because each row contains atleast one 0 and one 1. (Easier: every nonempty subset of a shattered I isalso shattered.)

    The matrix V has shatter dimension 2. Note that the number of columnsin V is strictly larger than

    (40

    )+(41

    ), which Theorem ) shows is enough

    to force sdim(V ) > 1.�

    VCsets::quadrants Example. Let Q be the set of all closed quadrants with a north-east vertexin R2, that is, sets of the form {(x, y) ∈ R2 : x ≤ a, y ≤ b} with a, b ∈ R.You should check that VCdim(Q) = 2. (Hint: If a quadrant Q containsthe northern-most point and the eastern-most point of a finite set F , whymust Q contain all of F?)

    For the three points x1, x2, x3 of R2 shown in the picture, the set Qxconsists of five vertices of the discrete cube {0, 1}3 and sdim(Vx) = 2. Thecolumns of V [(1, 2), · ] include all four elements of {0, 1}2, with one duplicatebecause (1, 1, 1) and (1, 1, 0) both project to (1, 1).

    Q111

    Q110

    Q100

    Q010

    Q000

    x1

    x2

    x3

    (0,0,0) (1,0,0)

    (0,1,0)(1,1,0)

    (1,1,1)

    Vx =

    0 1 0 1 10 0 1 1 10 0 0 0 1

    Draft: 26july13 c©David Pollard

  • §9.4 A combinatorial inequality 13

    If point x3 were removed, leaving x = (x1, x2), then

    Vx =

    [0 1 0 1 10 0 1 1 1

    ]with sdim(Vx) = 2.

    If instead of the three xi’s in the picture the point x2 were moved to coincidewith the x3 then

    Vx =

    0 1 10 0 10 0 1

    with sdim(Vx) = 1.Note that the second and third rows of this Vx are the same because x2 = x3.

    9.4 A combinatorial inequalityVCsets::S:downshift

    The following Lemma—sometimes called the VC Lemma and sometimescalled Sauer’s Lemma—plays a key role in the VC theory. See the Notes fora brief account of its complicated history.

    VCsets::VClemma1 Theorem. Let V be an m×N binary matrix with all columns distinct. If

    N > β(m, d) :=

    (m

    0

    )+

    (m

    1

    )+ · · ·+

    (m

    d

    )then sdim(V ) > d.

    VCsets::VClemma2 Corollary. If sdim(V ) ≤ d then N ≤ β(m, d).

    Proof (of the Theorem). Define the downshift for the ith row of thematrix as the operation:

    for j ∈ [1 : N ],if V [i, j] = 1 change it to a 0unless the resulting column already appears in V .

    Remark. The binary matrix V defines a subset V of the discretehypercube {0, 1}n. The downshifting of V corresponds to a movementof the points of V along the edges of the hypercube.

    Draft: 26july13 c©David Pollard

  • §9.4 A combinatorial inequality 14

    For example, the downshift for the 1st row of the matrix

    V =

    1 0 1 0 0 10 1 0 1 0 00 1 0 0 1 10 0 1 1 1 1

    generates V1 =

    0 0 0 0 0 10 1 0 1 0 00 1 0 0 1 10 0 1 1 1 1

    .The 1 in the last column was blocked (prevented from changing to a 0) bythe fifth column of V ; if it had been changed to a 0 the column [0 0 1 1]would have appeared twice in V1.

    Continuing from V1, downshift on any other row to generate a new ma-trix V2. And so on. The order in which the rows are selected is unimportant.Stop when no more changes can be made by downshifting.

    Remark. It is plausible that the downshift of a particular 1 in row ithat is initially blocked by some column might succeed at a later stage,because the blocking column might itself be changed between the firstand second downshift operations on row i. In fact, as Problem [5]shows, unblocking cannot occur. The whole process actually requiresonly a single downshift attempt for each row of the matrix. (I learnedthat fact from Dana Yang and Sören Künzel.)

    Such possibilities have no effect on the argument that follows. Itdoes no harm to imagine that the process requires some arbitrary, butfinite, number of downshift steps.

    For example, a downshift on the 2nd row of

    V1 =

    0 0 0 0 0 10 1 0 1 0 00 1 0 0 1 10 0 1 1 1 1

    generates V2 =

    0 0 0 0 0 10 0 0 1 0 00 1 0 0 1 10 0 1 1 1 1

    ,the fourth column being blocked from changing by the third column of V1.A downshift on the 4th row followed by a downshift on the 3rd row thencomplete the process:

    V3 =

    0 0 0 0 0 10 0 0 1 0 00 1 0 0 1 10 0 1 0 1 0

    and V4 =

    0 0 0 0 0 10 0 0 1 0 00 1 0 0 1 00 0 1 0 1 0

    .No more changes can be generated by downshifts; every possible change ofa 1 to a 0 in V4 is blocked by some column already present. Notice thatrows {3, 4} are shattered by V4.

    In general, the downshifting operation has two important properties:

    Draft: 26july13 c©David Pollard

  • §9.5 Patterns as subgraphs of the discrete cube 15

    (i) No new shattered sets of rows can be created by a downshift.

    (ii) When no more downshifting is posssible, the final matrix, V ∗, is hered-itary. That is, if v1 is a column of V

    ∗ and if v2 is obtained from v1by changing some of its 1’s to 0’s then v2 is also a column of V

    ∗.

    With these two properties in hand the rest of the proof then becomesa simple pigeonhole argument. Let me first establish (i) and then (ii), andthen deal with the pigeons.

    For simplicity of notation, consider a downshift for the 1st row of amatrix V , which generates a matrix V1. Suppose V1 shatters a set of rows I.If 1 /∈ I the submatrix V1[I, ·] is the same as V [I, ·], so V also shatters I. Forthe other case, suppose (again for notational simplicity) that 1 ∈ I = [1 : k].For each x in {0, 1}k−1 the vector [1, x] appears as column of V1[I, ·] so that,for some w ∈ {0, 1}m−k, the vector v = [1, x, w] is a column of V1. As thedownshift creates no new 1’s, the vector v must also have been a columnof V and have been blocked from changing by a column [0, x, w] of V . Thusboth [0, x] and [1, x] are columns of V [I, ·]. The matrix V must also haveshattered I.

    For (ii), suppose v1 is a column of V∗. If v1[i] = 1 the vector v2 ob-

    tained by changing that ith element to a 0 must also be a column of V ∗, forotherwise a downshift on row i would change the v1 column to v2. And soon.

    The rest of the proof for the Theorem is easy. Remember that V ∗ musthave all N of its columns distinct. At most

    (mk

    )of those columns can contain

    exactly k ones and m− k zeros. Consequently, at most(m

    0

    )+

    (m

    1

    )+ · · ·+

    (m

    d

    )= β(m, d)

    of the columns can contain d or fewer ones. As N > β(m, d), there mustbe some column v0 of V

    ∗ containing at least d + 1 ones, in rows I0 forsome I0 ⊆ [1 : m] with #I0 ≥ d + 1. By the hereditary property, thematrix V ∗ must shatter I0. By (i), the original matrix V must also shatter I0,so that sdim(V ) ≥ #I0 > d.

    *9.5 *Patterns as subgraphs of the discrete cubeVCsets::S:VC.as.graph

    The results from this Section are needed in Section 9.6 to establish an el-egant refinement of Lemma . I feel it is helpful to separate out the

    Draft: 26july13 c©David Pollard

  • §9.5 Patterns as subgraphs of the discrete cube 16

    downshifting part of the argument because it reveals yet another way ofthinking about VC-classes.

    Every nonempty subset V of the discrete hypercube {0, 1}n has a naturalinterpretation as the vertex set of a graph with edge set E defined by pairsat Hamming distance equal to one: the pair of vertices {v1, v2} form an(undirected) edge if and only if

    \E@ edge.def\E@ edge.def hamm(v1, v2) :=∑

    i≤n{v1[i] 6= v2[i]} =

    ∑i≤n|v1[i]− v2[i]| = 1.

    That is, endpoints of an edge must differ in exactly one coordinate position.

    VCsets::edges Theorem. For a graph with vertex set V ⊆ {0, 1}n and edge set E definedby , #E/#V ≤ sdim(V).

    Proof Write V for the n×N binary matrix that corresponds to the set V.Identify the N vertices with the column numbers: vertex 1 is V [·, 1], andso on. Also decompose the set of edges E = E(V ) into disjoint subsetsaccording to the single coordinate at which the endpoints disagree: that is,{j1, j2} ∈ Ei(V ) if and only if

    hamm (V [·, j1], V [·, j2]) = 1 = |V [i, j1]− V [i, j2]|

    Repeat the downshifting procedure used for proving Theorem ,leading to an hereditary binary matrix V ∗ when no more downshiftingchanges are possible. The important new idea is that a downshift can-not decrease the number of edges. To understand why, consider a downshiftof row 1 (for example) of V that generates a new binary matrix V1. Toprove that #E(V ) ≤ #E(V1) it suffices to show that #Ei(V ) ≤ #Ei(V1) forevery i.

    For i = 1 the inequality is actually an equality because E1(V ) = E1(V1).Indeed, if {j1, j2} ∈ E1(V ) then, without loss of generality,

    V [·, (j1, j2)] = B :=[

    1 0w w

    ]for some w ∈ {0, 1}n−1.

    Column V [·, j1] cannot change because it is blocked by column V [·, j2]. Con-versely, if V1[·, (j1, j2)] = B then [1, w] must have been a column of V blockedby [0, w].

    The argument gets more exciting for i 6= 1. It suffices to construct aone-to-one map ψi from oldI := Ei(V )\Ei(V1) into newi := Ei(V1)\Ei(V )for each i ≥ 2. Consider an edge e = {j1, j2} in oldi. That is, v1 = V [·, j1]and v2 = V [·, j2] are the endpoints of e in oldi.

    Draft: 26july13 c©David Pollard

  • §9.5 Patterns as subgraphs of the discrete cube 17

    Write K for [1 : n]\{1, i}. I assert that the downshift must changeexactly one of v1 and v2, for otherwise e would also be an edge in Ei(V1).For concreteness, suppose v1 is changed. We must have V [1, j1] = V [1, j2] =1 = V1[1, j2] and V1[1, j1] = 0. The possible change to v2 must have beenblocked by some column V [·, j] that agrees with V [·, j2] except for a zero inthe first position. The relevant part of V must look like

    V [(1, i,K), (j1, j2, j)] =

    1 1 0a b bw w w

    with a 6= b and w ∈ {0, 1}K ,which is downshifted to

    V1[(1, i,K), (j1, j2, j)] =

    0 1 0a b bw w w

    .The edge e = {j1, j2} in oldi has been destroyed but has been replaced byan edge {j1, j} in newi. The new edge is uniquely determined by v1 and v2.Define ψ({j1, j2}) = {j1, j}.

    After enough downshifts we reach the conclusion that #E(V ) ≤ #E(V ∗).By definition of edges, for each e = {j1, j2} ∈ E(V ∗) we must have∑

    i≤n

    (V ∗[i, j1]− V ∗[i, j2]

    )= ±1.

    Orient e from the vertex with fewer 1’s to the vertex with more 1’s. Thatis, e becomes the directed edge (j1, j2) if the last sum equals −1 and (j2, j1)if it equals +1.

    As you saw in the proof of Theorem , for every k at most D =sdim(V ) of the coordinates of v = V ∗[·, k] can equal 1. The vertices w =V ∗[·, k′] for which (w, v) is a directed edge of the V ∗ graph are all obtainedby changing exactly one of those 1’s to a 0. The indegree of v must be ≤ D.The number of edges in the (directed) graph is obtained by summing theindegrees for all vertices, which gives the bound ND ≥ #E(V ∗) ≥ #E(V ).

    In the previous proof, the graph with vertex set determined by thecolumns of the hereditary V ∗ was oriented to produce a directed graphwith the indegree of every vertex bounded by sdim(V ). The proof in thenext Section requires a similar ordering for the graph on the original vertexset, which can be achieved by an appeal to the classical Marriage Lemma(UGMTP Problem 10.5). Recall the statement of that Lemma. For each σ

    Draft: 26july13 c©David Pollard

  • §9.6 Haussler’s improvement of the packing bound 18

    in a finite set S we are given a nonempty subset K(σ) of some finite set K.We seek a function ψ : S → K for which ψ(σ) ∈ K(σ) for every σ in S. TheLemma asserts that such a function exists if and only if

    # (∪σ∈AK(σ)) ≥ #A for every nonempty subset A of S.

    VCsets::directedVC Corollary. (Haussler, 1995 following Alon and Tarsi, 1992) For V and Eas in Theorem , there exists an orientation of the edges for which theindegree of every vertex is at most sdim(V).

    Proof Once again identify the members of V with the columns of an n×Nbinary matrix V with distinct columns and label the vertices in V withthe corresponding column numbers of V . For each nonempty subset Jof [1 : N ] the subset VJ of {0, 1}n corresponding to the columns of V [·, J ]has sdim(VJ) ≤ D := sdim(V). By the Theorem, the corresponding edgeset EJ := E(VJ) satisfies #EJ/#VJ ≤ D.

    Define S = E(V) and K = [1 : D]×V. (That is, K consists of D “copies”of each vertex in V.) For each e = {j1, j2} in S define

    K(e) := [1 : D]× {j1, j2}.

    For a nonempty subset A of S define J to be the sets of vertices that appearas an endpoint of at least one edge from A. Note that A ⊆ EJ . Then

    ∪e∈AK(e) = [1 : D]× J,

    a set with cardinality D×#J = D×#VJ ≥ #EJ ≥ #A.The Marriage Lemma gives a map ψ : E(V)→ K for which ψ(e) = (i, j)

    with 1 ≤ i ≤ D and j one of the two vertices that define edge e. If e = {j′, j}then orient it as (j′, j). For each vertex j there are at most D edges e forwhich ψ(e) = (i, j) with 1 ≤ i ≤ D. Vertex j has indegree at most D.

    If the edge e = {j′, j} is oriented as (j′, j) write head(e) for j and tail(e)for j′. If the orientation were indicated by an arrow, its head would be at jand its tail at j′.

    *9.6 *Haussler’s improvement of the packing boundVCsets::S:HausslerVC

    By means of a more subtle randomization argument, it is possible to elim-inate the log(3e/�) factor from the bound in Theorem . The improve-ment is due to Haussler (1995).

    Draft: 26july13 c©David Pollard

  • §9.6 Haussler’s improvement of the packing bound 19

    Throughout the Section, ‖·‖1 denotes the `1 norm on Rn, that is, ‖v‖1 =∑i≤n |v[i]|. For vectors v1 and v2 in {0, 1}n the `1 distance ‖v1 − v2‖1 coin-

    cides with Hamming distance.

    VCsets::haussler.packing Theorem. Let V = {v1, . . . , vN} be a subset of {0, 1}n for which∥∥vj − vj′∥∥1 ≥ n� for all 1 ≤ j < j′ ≤ N , for some 0 < � ≤ 1.If sdim(V) ≤ D then N ≤ e(1 + D) (2e/�)D.

    Proof Let X be a random vector distributed uniformly on V. As usual,write X1, . . . , Xn for the coordinates, each of which takes values in {0, 1}.The proof depends on two inequalities involving conditional variances, fornonempty, proper subsets I and K of [1 : n]:

    \E@ upper.var\E@ upper.var ∑

    i∈IP var(Xi | XI\{i}) ≤ D

    and ∑i/∈K

    P var(Xi | XK) ≥ 12n�(1−#VK/N

    )with VK := πKV\E@ lower.var\E@ lower.var

    ≥ 12n�(1− β(#K,D)/N

    ),

    the final inequality by Corollary . Inequality depends on Corol-lary ; inequality depends on the assumed separation of the vec-tors in V. The proofs of both are given as Corollaries to Lemma atthe end of this Section.

    Let M equal d2(D + 1)/�e − 1. (The reason for this choice becomesevident at the end of the proof.) Without loss of generality assume thatM < n. (Soon n is replaced by `n for an arbitrarily large integer `.) Suminequality over all possible subsets I of [1 : n] with #I = M + 1 toget ∑

    #I=M+1

    ∑i∈I

    P var(Xi | XI\{i}) ≤(

    n

    M + 1

    )D.

    Equivalently,

    \E@ M.upper\E@ M.upper ∑

    #K=M

    ∑i/∈K

    P var(Xi | XK) ≤(

    n

    M + 1

    )D.

    (To see the equivalence, write K for I\{i}. Every (i,K) pair with i /∈ Kand #K = M appears exactly once in the first double sum. As a check,

    Draft: 26july13 c©David Pollard

  • §9.6 Haussler’s improvement of the packing bound 20

    note that both double sums involve (M + 1)(

    nM+1

    )= (n −M)

    (nM

    )terms.)

    Similarly, a summation over all subsets K of [1 : n] of size M in gives

    \E@ M.lower\E@ M.lower ∑

    #K=M

    ∑i/∈K

    P var(Xi | XK) ≥(n

    M

    )12n�

    (1− β(M,D)/N

    ).

    Together and imply

    β(M,D)N

    ≥ 1− 2(n−M)n�(M + 1)

    D

    or

    N ≤ β(M,D)(

    1− 2(n−M)n�(M + 1)

    D)−1

    .

    I could try to optimize over M immediately, as Haussler (1995, page 225)did, to bound N by a function of �, D, and n, then take a supremum over n.However, it is simpler to note that all the conditions of the Theorem apply ifn is replaced by `n, for some positive integer `, and each vj in V is replacedby the concatenation of ` copies of vj . (That is, if V is represented by then × N binary matrix V , the new vectors are the columns of the (`n) × Nbinary matrix obtained by stacking ` copies of V .) Letting ` tend to infinity(with M fixed) eliminates n from the bound, leaving

    N ≤ β(M,D)(

    1− 2D�(M + 1)

    )−1The first factor is an increasing function of M , the second a decreasing

    function. If M is chosen so that M + 1 ≥ 2(D+ 1)/� ≥M then M ≥ D andthe upper bound for N is less than(

    eM

    D

    )D �(M + 1)�(M + 1)− 2D

    ≤(

    2e(D + 1)�D

    )D 2(D + 1)2(D + 1)− 2D

    ≤ (2e/�)D(1 + D−1

    )D(D + 1),

    which is smaller than the bound stated in the Theorem.�

    Remark. My M corresponds to m− 1 in Haussler’s paper. To see thereason for the choice of M , note that the function xD (1− (2D)/(�x))−1is minimimized over x in R+ by x = 2(D + 1)/�.

    Draft: 26july13 c©David Pollard

  • §9.6 Haussler’s improvement of the packing bound 21

    It remains only to justify inequalities and . To keep thenotation simple, I first prove a general result then deduce the inequalitiesas Corollaries.

    VCsets::random.col Lemma. Suppose W is a subset of {0, 1}R of size L. Suppose also that Yis a random R-vector distributed according to some distribution P on W.Then

    (i)∑

    i≤R Pvar(Yi | Y−i) ≤ sdim(W).

    (ii) If P is the uniform distribution on W and the elements are δ-separated,∥∥wj − wj′∥∥1 ≥ δ for 1 ≤ j < j′ ≤ L,for some δ > 0 then

    ∑i≤R var(Yi) ≥

    12δ(1− L

    −1).

    Proof For (i) let E denote the edge set, where e = {w,w′} is an edge ifand only if ‖w − w′‖1 = 1. Write D for sdim(W). By Lemma , theedges can be directed in such a way that every vertex in W has indegree atmost D. Put another way,∑

    e∈E{head(e) = w} ≤ D for each w ∈W.

    Consider the case i = 1. The set W−1 := π−1W can be partitioned intotwo subsets: the set W1−1 of those x’s for which there is a unique w ∈ Wthat projects onto x, and the set W2−1 of those x’s for which there is anedge e = {[0, x], [1, x]} in E1(W). If x ∈ W1−1 then the distribution of Ygiven Y−1 = x is degenerate at a single point and var(Y1 | Y−1 = x) = 0.If x ∈ W2−1, the conditional distribution of Y1 given Y−1 = x is Ber(p1(x)}where p1(x) := P{[1, x]}/(P e) and

    var (Y1 | Y−1 = x) = p1(x)[1− p1(x)] = P{tail(e)}P{head(e)}/(P e)2.

    The analysis for other values of i in [1 : R] is similar Thus

    R∑i=1

    Pvar(Yi | Y−i) =R∑i=1

    ∑e∈Ei(W)

    P eP{tail(e)}P{head(e)}

    (P e)2

    ≤∑e∈E

    P{head(e)} =∑w∈W

    ∑e∈E{head(e) = w}P{w}.

    As e ranges over all edges, head(e) visits each vertex at most D times. Thelast sum is at most D

    ∑w∈W P{w} = D.

    Draft: 26july13 c©David Pollard

  • §9.7 Problems 22

    Let (ii) let Y ′ = (Y ′i : 1 ≤ i ≤ R) be chosen independently of Y , withthe same uniform distribution. Note that the difference Yi − Y ′i takes thevalues ±1 when Yi 6= Y ′i and is otherwise zero. Thus (Yi − Y ′i )2 = |Yi − Y ′i |and ∑

    i≤Rvar(Yi) =

    ∑i

    12P|Yi − Y

    ′i |2 by independence

    = 12P∥∥Y − Y ′∥∥

    1

    ≥ 12δP{Y 6= Y′} = 12δ(1− L

    −1)

    VCsets::upper Corollary. (inequality )∑

    i∈I P var(Xi | XI\{i}) ≤ D.

    Proof Define W = πIV and Y = XI , whose distribution need not beuniform on W.

    VCsets::lower Corollary. (inequality )∑i/∈K

    P var(Xi | XK) ≥ 12n�(1−#VK/N

    )with VK := πKV

    Proof For each z in VK , the distribution of X−K given XK = z is uniformon the set Wz := π

    −1K {z} = {v ∈ V : πKv = z}. Part (ii) of the Lemma gives∑

    i/∈Kvar(Xi | XK = z) ≥ 12n�

    (1−N(K, z)−1

    )),

    where N(K, z) = #Wz. Also P{XK = z} = N(K, z)/N . Thus∑i/∈K

    P var(Xi | XK) ≥ 12n�∑

    z∈VK

    N(K, z)

    N

    (1− 1

    N(K, z)

    ),

    which reduces to the stated expression.�

    9.7 ProblemsVCsets::S:Problems

    [1] Let Hd denote the set of all closed half-spaces in Rd. Show that VCdim(Hd) =VCsets::P:halfspacesd+ 1 by the following arguments.

    (i) Show that Hd shatters the set F = {0, e1, . . . , ed}, where ei is the unitvector with +1 in the ith position. Hint: Consider closed halfspaces of theform {x : θ · x ≥ r} with θ ∈ {−1,+1}d.

    Draft: 26july13 c©David Pollard

  • §9.7 Problems 23

    (ii) Suppose F = {xi : i = 0, 1, . . . , d+ 1} is a set of d+ 2 distinct points in Rd.If H picks out a subset FK := {xi : i ∈ K} from F , show that that theconvex hull of FK is a subset of H, which is disjoint from the convex hullof FKc .

    (iii) Show that∑

    1≤i≤d+1 αi(xi − x0) = 0 for some constants αi that are not allzero. Define α0 = −

    ∑1≤i≤d+1 αi. Show that∑

    0≤i≤d+1αixi = 0 and

    ∑0≤i≤d+1

    αi = 0.

    Define J = {i ∈ [0 : d + 1] : αi > 0}. Show that the convex hulls of thesubsets FJ = {xi : i ∈ J} and F cJ have a nonempty intersection. Thus Hdcannot pick out the subset FJ from F .

    [2] Suppose D, D1, . . . ,Dk are VC-classes of subsets of X.VCsets::P:BooleanVC

    (i) (Dudley, 1978, Proposition 7.12) Write A(B1, . . . , Bk) for the algebra ofsubsets of X generated by subsets B1, . . . , Bk of X. For each finite k, showthat ∪{A(D1, . . . , Dk} : D1, . . . , Dk ∈ D} is also a VC-class.

    (ii) Suppose VCdim(Di) ≤ D for each i. Let Bk denote the class of all sets ex-pressible by means of at most k union, intersection, or complement symbols.Find an increasing, integer-valued function β(k) such that the VC-dimensionof Bk is at most β(k)D.

    [3] Define G(x) = x/(log x) for x ≥ 1, with G(1) = +∞.VCsets::P:xlogx(i) Show that G achieves its minimum value of e at x = e and that G is an

    increasing function on [e,∞).(ii) Suppose y ≥ G(x) for some x ≥ e. Show that

    log y ≥ (1− 1/G(log x)) log x ≥ (1− e−1) log x.

    Deduce that x = G(x) log x ≤ c0y log y, where c0 := (1− e−1)−1 ≈ 1.58 < 2.

    [4] Suppose N and d are positive integers for whichVCsets::P:packVC

    N ≤ β(m, d) :=∑d

    k=0

    (m

    k

    )with m = dρ logNe for some ρ ≥ 1.

    Show that, for the constant c0 defined in Problem [3],

    N1/d ≤ c0λ log λ ≤ c0λ2 where λ := (1 + ρ)e

    by the following steps.

    Draft: 26july13 c©David Pollard

  • §9.8 Notes 24

    (i) First check that the inequality is trivially true if N ≤ ed.(ii) For N > ed check that m ≥ d, which implies β(m, d) ≤ (em/d)d. Deduce

    that

    N1/d ≤ emd≤ e(1 + ρ logN)

    d≤ e(1 + ρ logN1/d),

    so that G(N1/d) ≤ (1 + ρ)e = λ.

    [5] For the downshift argument described in Section 9.4, show that every ele-VCsets::P:downshiftment that is blocked from changing to 0 during at least one downshift mustbe blocked for every downshift; it must stay equal to 1 in the final V ∗.Generalize the following argument.

    Consider the form of V after the first attempted downshift on row 1 atstep m. If k of the 1’s were blocked, the new Vm must contain an n× (2k)submatrix like

    Wm =

    1 0 1 0 . . . 1 0a1 a1 a2 a2 . . . ak akw1 w1 w2 w2 . . . wk wk

    for distinct column vectors [ai, wi] ∈ {0, 1}n−1. Suppose row 2 is the targetfor the next downshift. Consider the effect on the first two columns of Wm.

    If a1 = 0 then neither column can change; the 1 will still be blockedon any subsequent attempt to downshift row 1. A similar conclusion holdsif a1 = 1 and both downshifts of the a1 are blocked. Only if a1 = 1 andexactly one of the downshifts succeeds will the future of the 1 be in anydoubt. In that case the first two columns of Wm will be changed to either 1 01 0

    w1 w1

    or 1 00 1w1 w1

    In the first case the matrix Wm must already have had a column [1, 0, w1],

    which must have been blocked by a column [0, 0, w1] when Vm was created.But that same [0, 0, w1] should have blocked the change in the second columnof Wm. Thus the first case cannot occur.

    In the second case the matrixWm must already have had a column [0, 0, w1],which then will block future attempts to change 1 .

    In short, the Vm+1 will again contain a submatrix like Wm. And so on.

    Draft: 26july13 c©David Pollard

  • §9.8 Notes 25

    9.8 NotesVCsets::S:Notes

    The bound stated in Theorem appeared in the paper by Sauer (1972).In an online blog (http://leon.bottou.org/news/vapnik-chervonenkis sauer),Bottou made a case that the credit should go to Vapnik and Červonenkis,who first published a short summary (Vapnik and Červonenkis, 1968) of theirresult and then a longer version (Vapnik and Červonenkis, 1971). Dudley(1999, page 167) (see also Dudley 1978, Section 7) noted that the 1971bound was a little weaker than the Sauer result. Inequality is provedon page 154 of the Wapnik and Tscherwonenkis (1979) book, a translationinto German of a 1974 Russian edition, which I have not seen.

    Steele (1975, 1978) generalized the result described in Theorem to matrices whose entries come from a finite alphabet. Haussler and Long(1995) generalized further.

    I learned about the downshift method for proving Theorem fromMichel Talagrand. Ledoux and Talagrand (1991, pp. 411-412) used the set-theoretic version of the technique, attributing it (page 420) to Frankl (1983).(See also Talagrand 1988, Proposition 8 for another interesting applicationof downshifting.) Independently, both Alon (1983) and Frankl (1983) usedthe shifting method to prove a result that implies Theorem . Frankl(1995, page 1298) noted that the shifting technique was introduced by Erdőset al. (1961).

    The results in Section 9.5 come from Haussler (1995), who acknowledged(page 220) Linial for the downshifting method of proof for Theorem ,with the comment that the result had already been proved by Haussler, Lit-tlestone, and Warmuth (1994). For an illuminating Bayesian interpretationof the probability method used to prove Theorem see Haussler (1995,Section 3).

    Stengle and Yukich (1989) showed how to obtain VC-classes of subsetsof a Euclidean space by taking cross-sections of semi-algebraic subsets (de-fined by finitely many inequalities for polynomials) of higher dimensionalEuclidean spaces. See van den Dries (1998, Chapter 5) for more generalresults about cross-sections of definable sets in o-minimal structures.

    References

    Alon1983DiscMath Alon, N. (1983). On the density of sets of vectors. Discrete Mathematics46, 199–202.

    Draft: 26july13 c©David Pollard

  • §9.8 Notes 26

    AlonSpencer2000 Alon, N. and J. H. Spencer (2000). The Probabilistic Method (second ed.).Wiley.

    AlonTarsi1992Combinatorica Alon, N. and M. Tarsi (1992). Colorings and orientations of graphs. Com-binatorica 12(2), 125–134.

    Dudley78clt Dudley, R. M. (1978). Central limit theorems for empirical measures. Annalsof Probability 6, 899–929.

    Dudley99UCLT Dudley, R. M. (1999). Uniform Central Limit Theorems. Cambridge Uni-versity Press.

    ErdosKoRado1995QJM Erdős, P., C. Ko, and R. Rado (1961). Intersection theorems for systems offinite sets. Quart. J. Math. Oxford 12 (2), 313–320.

    Frankl1983JCT Frankl, P. (1983). On the trace of finite sets. Journal of CombinatorialTheory, Series A 34, 41–45.

    Frankl1995handbook Frankl, P. (1995). Extremal set systems. In R. L. Graham, M. Grötschel,and L. Lovász (Eds.), Handbook of Combinatorics, Volume 2, Chapter 24,pp. 1293–1329. Elsevier.

    Haussler:95jct Haussler, D. (1995). Sphere packing numbers for subsets of the Booleann-cube with bounded Vapnik-Chervonenkis dimension. Journal of Com-binatorial Theory 69, 217–232.

    HausslerLittlestoneWarmuth1994IC Haussler, D., N. Littlestone, and M. K. Warmuth (1994). Predicting {0, 1}-functions on randomly drawn points. Information and Computation 115,248–292.

    HausslerLong:95jct Haussler, D. and P. M. Long (1995). A generalization of Sauer’s lemma.Journal of Combinatorial Theory 71, 219–240.

    LedouxTalagrand91book Ledoux, M. and M. Talagrand (1991). Probability in Banach Spaces:Isoperimetry and Processes. New York: Springer.

    Sauer72jct Sauer, N. (1972). On the density of families of sets. Journal of CombinatorialTheory 13, 145–147.

    Steele75PhD Steele, J. M. (1975). Combinatorial Entropy and Uniform Limit Laws. Ph.D. thesis, Stanford University.

    Steele1978JCT Steele, J. M. (1978). Existence of submatrices with all possible columns.Journal of Combinatorial Theory, Series A 24, 84–88.

    Draft: 26july13 c©David Pollard

  • Notes 27

    StengleYukich89 Stengle, G. and J. Yukich (1989). Some new Vapnik-Chervonenkis classes.Annals of Statistics 17, 1441–1446.

    Talagrand1988PTRF Talagrand, M. (1988). Donsker classes of sets. Probab. Theory Relat. Fields78, 169–191.

    vandendries98 van den Dries, L. (1998). Tame Topology and O-minimal Structures. Cam-bridge University Press.

    VapnikCervonenkis1968SMD Vapnik, V. N. and A. J. Červonenkis (1968). The uniform convergence offrequencies of the appearance of events to their probabilities. Dokl. Akad.Nauk SSSR 181, 781–783. In Russian. English translation: Soviet Math.Dokl. 9 (1968), 915–918. See Mathematical Review MR0231431 by R. M.Dudley.

    VapnikCervonenkis71events Vapnik, V. N. and A. Y. Červonenkis (1971). On the uniform convergenceof relative frequencies of events to their probabilities. Theory Probabilityand Its Applications 16, 264–280.

    VapnikCervonenkis81fns Vapnik, V. N. and A. Y. Červonenkis (1981). Necessary and sufficient condi-tions for the uniform convergence of means to their expectations. TheoryProbability and Its Applications 26, 532–553.

    WapnikTscherwonenkis1979 Wapnik, W. N. and A. J. Tscherwonenkis (1979). Theorie der Zeichenerken-nung. Berlin: Akademie-Verlag. German translation of a 1974 Russianedition.

    Draft: 26july13 c©David Pollard