geometrical and statistical properties of systems of

Upload: farhad-javedani

Post on 06-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Geometrical and Statistical Properties of Systems Of

    1/9

    IEEE TRANSACTIONS ON ELECTRONIC COMPUTERS

    Geometrical and Statistical Properties of Systems ofLinear Inequalities with Applications in PatternRecognitionTHOMAS M. COVER

    Abstract-This paper develops the separating capacities of fami-lies of nonlinear decision surfaces by a direct application of a theoremin classical combinatorial geometry. It is shown that a family of sur-faces having d degrees of freedom has a natural separating capacityof 2d pattern vectors, thus extending and unifying results of Winderand others on the pattern-separating capacity of hyperplanes. Apply-ing these ideas to the vertices of a binary n-cube yields bounds onthe number of spherically, quadratically, and, in general, nonlinearlyseparable Boolean functions of n variables.It is shown that the set of all surfaces which separate a dichotomyof an infinite, random, separable set of pattern vectors can be charac-terized, on th e average, by a subset of only 2d extreme pattern vec-tors. In addition, the problem of generalizing the classifications on alabeled set of pattern points to the classification of a new point isdefined, and it is found that the probability of ambiguous generaliza-tion is large unless the number of training patterns exceeds thecapacity of the set of separating surfaces.

    I. DEFINITIONS AND HISTORY OF FUNCTION-COUNTING THEOREMSC ONSIDER a se t of patterns represented by a se tof vectors in a d-dimensional Euclidean spaceEd . A homogeneous linear threshold functionfw: Ed-- { -1, 0, 1 } is defined in terms of a parameter orweight vector w for every vector x in this space:

    f 1,f (x) = q 0,1-1,

    w-x> 0w.x = 0w x

  • 8/3/2019 Geometrical and Statistical Properties of Systems Of

    2/9

    Cover: Properties of Systems of Linear InequalitiesCameron independently applied Theorem 1 to thevertices of a binary n-cube in order to find an upperbound on the number of linearly separable truth func-tions of n variables. These authors [1]-[5] have allused a variant of a proof, which seems to have firstappeared in Schlafli [6], of Theorem 2 or its dual state-ment Theorem 2' .Theorem 2: N hyperplanes in general position passingthrough the origin of d-space divide the space intoC(N, d) regions.Theorem 2': A d-dimensional subspace in generalposition in N-space intersects C(N, d) orthants.Proofs of Theorem 2 appear in Schlafli [6], Winder[1], [2], Cameron [3], and Wendel [7]. Proofs in termsof the dual statement, Theorem 2' , can be found inSchlafli [6] and in Joseph [4]. It should be noted thatmost of these references are relatively obscure. A varia-tion of the known proofs of Theorems 2 and 2' willtherefore be provided in the first portion of the proof ofTheorem 3.

    II. SEPARABILITY BY ARBITRARY SURFACESMany authors [8]-[12] have been concerned withseparating sets of points with parametric families of sur-faces. Cooper [8], [9] has been primarily concerned withthe characterization of the natural class of decision sur-faces for a given decision theoretic pattern-recognitionproblem, as well as with the complementary questionof characterizing the natural class of problems for whicha given family of decision surfaces is minimal complete.Bishop [13], Wong and Eisenberg [14], and Cooper[15] have been concerned with the problem of using rth-order polynomial surfaces to implement truth functions(separating a dichotomy of the vertices of a binary n-cube). Koford [16] and Aizerman et al . [10] have ob-served that standard training algorithms will convergeto a separating surface if one exists. In this section, thenumber of dichotomies of a set of points which may beseparated by a faomily of decision surfaces will be found.This number foll ws directly from the function-count-in g theorem when the family of separating surfaces andthe set of points to be separated are carefully defined.Consider a family of surfaces, each of which naturallydivides a given space into two regions, and a collectionof N points in this space, each of which is assigned toone of tw o classes X' or X-. This dichotomy of thepoints is said to be separable relative to the family ofsurfaces if there exists a surface in the family thatseparates th e points in X+ from the points in X-. Con-sider the se t of N objects X= {X1, * , XN}. The ele-ments of X will be referred to as patterns for intuitivereasons. On each pattern xCEX, a set of real-valuedmeasurement functions ly, 4)2, * *, 4d comprises thevector of measurements

    4: X -> Ed (7)where 4)(x) = [4)1(x), +52(X), * . , Od(X))], XCX.

    A dichotomy (binary partition) { X+, X-} of X is-separable if there exists a vector w such thatW.*(X) > 0,w.q5(x)

  • 8/3/2019 Geometrical and Statistical Properties of Systems Of

    3/9

    IEEE TRANSACTIONS ON ELECTRONIC COMPUTERSand only if there exists a separating hyperplane throughthe new point which separates the ol d dichotomy. Thisis reasonable because, if such a hyperplane exists, smalldisplacements of the hyperplane will allow arbitraryclassification of the new point without affecting theseparation of the ol d dichotomy. The proof makes thesedisplacements explicit.Proof: The se t W of separating vectors for { X+, X-is given by W= { w: w x>0, xGX+; w x 0; and the dichotomy {X+, X-J{ y} }is homogeneously linearly separable if and only if thereexists a w in W such that w y 0 suchthat w* +Ey and w*-ey are in W. Hence, {X+U y}, X- }and { X+, X-U { y } } are homogeneously linearly sepa-rable by w*+ ey and w* - Ey, respectively.Theorem 3 generalizes the function-counting theoremto certain classes of nonlinear functions under con-straints. In particular, it states that k independent con-straints on the class of separating surfaces reduce thenumber of degrees of freedom of the class by k.Theorem 3: If a 4-surface {x : w - (X) =0} is con-strained to contain the set of points Y= { yi, Y2 , Yk I,where 4)(y,), O(y2), - , 4)(yk) are linearly independent,and where the projection of 4)(xl), 4)(x2), * * , 4)(XN)onto the orthogonal subspace to the space spanned by4)(yI), 4)(y2), , 4)(yk) is in general position, then thereare C(N, d - k) -separable dichotomies of X.Proof: In the special case k=0 (no constraints) and+5(x) =x (linear separating surfaces), this theorem re-duces to the statement of Theorem 1. We shall firstprove this special case by induction on N and d. LetC(N, d) be the number of homogeneously linearly sepa-rable dichotomies of X= {Xl, X2, * * *, XN}. Consider anew point XN+l such that XU {XN+ } is in general posi-tion, and consider the C(N, d) homogeneously linearlyseparable dichotomies of X. If a dichotomy {X+, X-}is separable, then either {X+U { XN+1 }, X- } or{ X+, X-U { XN+1 } } must be separable. However, bothdichotomies are separable, by Lemma 1, if and only ifthere exists a separating vector w for { X+, X-} lyingin the (d-1)-dimensional subspace orthogonal toXN+1. A dichotomy of X is separable by such a w if andonly if the projection of the set X onto the (d- 1)-dimensional orthogonal subspace to XNi is separable.By the induction hypothesis there are C(N, d -1) suchseparable dichotomies. Hence,

    Repeated application of (10) to the terms on the rightyields(11)

    from which the theorem follows immediately on notingC(1, m) = { m> 1m < 1. (12)

    Generalizing the proof to arbitrary 4 and k, we firstobserve that the condition that a 4-surf ace contains these t {Iy1, y2 , * , yk} is that the parameter vector wwhich characterizes the surface must li e in the (d-k)-dimensional subspace L, whereL = w: w 4)(yi) = 0, i = 1,2, * , k} .

    Let + be the orthogonal projection of 4) onto L. Then,since w-4=w. for all w in L, it is seen that a se t ofvectors 0} is separable by a parameter vector in L ifand only if the corresponding set of projections { } isseparable. Since, by the second assumption, 0(xi),(X2), - , +(XN) are in +-general position in L, thereare C(N, d-k) homogeneously linearly separable di-chotomies of {4(Xl), +(X2), * * *, O(XN) } by a vector win L.In defining a linear threshold function, it is difficultto decide whether to classify the points lying on theseparating hyperplane into the (+1) c lass o r the (-1)class. This difficulty can be resolved by assigning thesepoints to a third class and proving Theorem 4.Theorem 4: Let { XI , X2, * * - , XN } be a se t of N vectorsin general position in Ed . Let F be the class of functionsf: {Xl, X2, XN}--> I1, 0, -11 defined for each winEd by

    ( 1, xi.w>0f (xi) = t o, xi w < 0-1) xi-w

  • 8/3/2019 Geometrical and Statistical Properties of Systems Of

    4/9

    Cover: Properties of Systems of Linear Inequalities

    A AA0

    0 0(a)

    A

    A A

    (b)

    A

    00

    0 A

    (c)Fig. 1. Examples of -separable dichotomies of five points in two dimensions. (a) Linearly separable dichotomy.(b) Spherically separable dichotomy. (c) Quadrically separable dichotomy.

    lr n Q(N, d)N-Xc C(N, d)III. EXAMPLES OF SEPARATING SURFACES

    A natural generalization of linear separability is poly-nomial separability. For the ensuing discussion, considerthe patterns to be vectors in an m-dimensional space.The measurement function 0 then maps points in m-space into points in d-space.Consider a natural class of mappings obtained byadjoining r-wise products of the pattern vector coordi-nates. The natural separating surfaces corresponding tosuch mappings are known as rth-order rational varieties.A rational variety of order r in a space of m dimensionsis represented by an rth-degree homogeneous equationin the coordinates (x),

    aii2. -ir(X)il(X)i2 . .(X)i) = 0, (16)0 < il < i2 s < .

  • 8/3/2019 Geometrical and Statistical Properties of Systems Of

    5/9

    IEEE TRANSACTIONS ON ELECTRONIC COMPUTERSTABLE I

    EXAMPLES OF SEPARATING SURFACES WITH THE CORRESPONDING NUMBER OFSEPARABLE DICHOTOMIES OF N POINTS IN m DIMENSIONSMapping Se ti Degrees of Number of 4>-Sepa- SeparatingDefined on Separatng Freedom of General Position rable Dichotomies Capacityx in Em Surace -Surface of N Points

    (x) =x hyperplane through m no m points on any subspace C(N, m) 2morigin+(x)=(1, x) hyperplane m+1 no m+1 points on any hyperplane C(N, m+1) 2(m+1)+(X) =(1, x, Ix 12) hypersphere m+2 no m+2 points on any hypersphere C(N, m+2) 2(m+2)+(x)=(x, llxll) hypercone m+1 no m+1 points on any hypercone C(N, m+1) 2(m+1)

    /m+r Zm+r m+r' /m+r\4(x)=all r-wise products rational rth-order no points on any rth- C N, 2of components of x variety \ r/ r order surface C(\(r>) 2(m+r)where 0(mr lo g m) is a remainder term which, for someK, is asymptotically bounded by Kmr lo g m.In general, for rth-order polynomial separating sur-faces, the number Lm(r) of separable truth functions ofm variables is bounded above by

    Lm(r) < C Q2m, ( ) = 2mr+lIr!+O(mr log ) (19)Koford [16] has observed that augmenting the vectorxC Ed to yield a vector 4)(x), as in (17), is particularlyeasy to implement when the coefficients are binary. Inaddition, Koford notes as do Aizerman [10] andGreenberg and Konheim [12] that, if the augmentedvector +(x) is used as an input to a linear threshold de-vice, then the standard fixed increment training pro-cedure will converge [17] (by the Perceptron con-vergence theorem) in a finite number of steps to a

    separating 4)-surface if one exists.Table I lists several examples of families of separatingsurfaces. Al l patterns x should be considered as vectorsin an m-dimensional space. The function +(x) = (1 , x)is an (m+1)-dimensional vector. The final column ofTable I lists the separating capacities of the 4-surfaces,a measure of the expected number of random patternswhich can be separated. The separating capacity will bemade plausible as a useful idea in Section IV.IV. SEPARABILITY OF RANDOM PATTERNSTwo kinds of randomness are considered in the pat-

    tern dichotomization problem:1) The patterns are fixed in position but are classifiedindependently with equal probability into one oftwo categories.2) The patterns themselves are randomly distributedin space, and the desired dichotomization may berandom or fixed.Under these conditions the separability of the se t ofpattern vectors becomes a random event depending onthe dichotomy chosen and the configuration of the pat-terns. The probability of this random event and the

    maximum number of random patterns that can beseparated by a given family of decision surfaces are tobe determined.Suppose that the patterns x1, X2, , XN are chosenindependently according to a probability measure ,u onthe pattern space. The necessary and sufficient conditionon u such that, with probability 1, Xi , X2, - * , XNare in general position is d-space is that the probabilitybe zero that any point will fall on an y given (d-1)-dimensional subspace. In terms of 4)-surfaces, a se t ofvectors chosen independently according to a probabilitymeasure / is in +)-general position with probability 1if and only if every 4)-surface {xEEd:w - (x) =0} hast,u measure zero.Suppose that a dichotomy of X= {X1, X2, , XN iSchosen at random with equal probability from the 2Nequiprobable possible dichotomies of X. Let X be in+-general position with probability 1, and let P(N, d)be the probability that the random dichotomy is 4) -separable, where the class of 4)-surfaces has d degrees offreedom. Then with probability 1 there are C(N, d)4)-separable dichotomies, and

    P(NT, d) = (1)NC(N, d) = (1)N-1 Ed(AT 1) (20)which is just the cumulative binomial distribution cor-responding to the probability that N-1 flips of a faircoin result in d - 1 or fewer heads.One of the first applications of the function-countingtheorems to random pattern vectors was by Wendel[7], who found the probability that N random pointsli e in some hemisphere. Let the se t of vectorsX= Xi , X2, * * *, XN} be in general position on thesurface of a d-sphere with probability 1. In addition, letthe joint distribution of { X1, X2, - * * , XN be unchangedby the reflection of any subset through the origin. Underthese restrictions Wendel proves that the probabilitythat a set of N vectors randomly distributed on thesurface of a d-sphere is contained in some hemisphereis P(N, d).The proof of this result follows immediately fromSchlafli's theorem and the reflection invariance of the

    330 June

  • 8/3/2019 Geometrical and Statistical Properties of Systems Of

    6/9

    Cover: Properties of Systems of Linear Inequalitiesjoint probability distribution of X. This invariance im-plies that the probability (conditioned on X) that arandom dichotomy of X be separable is equal to the un-conditional probability that a particular dichotomy ofX (all N points in one category) be separable.

    V. SEPARATING CAPACITY OF A SURFACEIt will be shown that the expected maximum numberof randomly assigned vectors that are linearly separablein d dimensions is equal to 2d. It is thus possible to con-clude that a linear threshold device has an informationstorage capacity-relative to learning random dichoto-mies of a se t of patterns-of two patterns per variableweight. This result was originally conjectured byWidrow and Koford and experimentally as reported byWidrow [18] for the case of pattern vectors chosen atrandom from the set of vertices of a binary d-cube.Brown [19] found experimentally that the conjectureheld for patterns distributed at random in the unitd-sphere. This conjecture wa s supported theoretically

    by Winder [20], by Efron and Cover [21]-[23], andsubsequently by Brown [24].Let {x1, x2, . } be a sequence of random patternsas previously shown, and define the random variable Nto be the largest integer such that {X1, X2, - * *, XN } is4-separable, where 4) has d degrees of freedom. Then,from (20),Pr{N = n} = P(n, d) - P(n + 1, d)

    = (2)n g_ n == 0, 1, 2, (21)which is just the negative binomial distribution (shiftedd units right with parameters d and 2) . Thus N cor-responds to the waiting time for the dth failure in aseries of tosses of a fair coin, and

    E(N) = 2dMedian (N) = 2d. (22)

    The asymptotic probability that N patterns areseparable in d (N/2) + (a/2)V/N dimensions is/ N a

    P N, -+--Ny -(a) (23)where 4((a) is the cumulative normal distribution

    1 ra4(a)= e-x22dx. (24)-\2 0r_OIn addition, fo r E>0,

    lim P(2d(1 + e) , d) = 0P(2d, d) =

    lim P(2d(1 - E) , d) = 1d+oo

    as was shown by Winder [20]. Thus the probability ofseparability shows a pronounced threshold effect whenthe number of patterns is equal to twice the number ofdimensions. These results [21] confirm Koford's con-jecture and suggest that 2d is a natural d ef in it io n o fthe separating capacity of a family of decision surfaceshaving d degrees of freedom.If the number of patterns is fixed at N and the dimen-sionality of the space in which the patterns lie is allowedto increase, it follows that the probability that the di-mension d* at which the se t of patterns first becomesseparable is given by

    Pr {d * = d} = P(N, d) - P(N, d -1)/N-1\=(1)Nf-d 1)d = 1, 2, , 1\N. (26)Thus d* is binomially distributed (shifted one unit rightwith parameters N and 2) , and

    (27)+1

    E(d*)= N 2Equation (26) is partial justification for past statementsthat linear threshold devices are self-healing or can ad-just around their defects [18] because the separatingprobability of a linear threshold device is relatively in-sensitive to the number of parameters d fo r d> (N+ 1)/2.The fact that 2d is indeed a critical number for a systemof linear inequalities in d unknowns will be furtherestablished in Sections VI and VII.

    VI. GENERALIZATION AND LEARNINGLet X be a se t of N pattern vectors in general positionin d-space. This se t of pattern vectors, together with adichotomy of the set into two categories X+ and X-,will constitute a training set. On what basis can a newpoint be categorized into one of the two training cate-gories? This is the problem of generalization.Consider the problem of generalizing from the train-ing se t with respect to a given admissible family of de-cision surfaces (that family of surfaces that can be im-plemented by linear threshold devices). By some process,a decision surface from the admissible class will beselected which correctly separates the training se t intothe desired categories. Then the new pattern will be

    assigned to the category lying on the same side of thedecision surface. Clearly, for some dichotomies of these t of training patterns, the assignment of category willnot be unique. However, it is generally believed that,after a "large number" of training patterns, the state ofa linear threshold device is sufficiently constrained toyield a unique response to a new pattern. It will beshown that the number of training patterns must exceedthe statistical capacity of the linear threshold devicebefore unique generalization becomes probable.The classification of a pattern y with respect to thetraining se t { X+, X- } is said to be ambiguous relative toa given class of 4-surfaces, if there exists one +-surface

    1965 331

  • 8/3/2019 Geometrical and Statistical Properties of Systems Of

    7/9

    IEEE TRANSACTIONS ON ELECTRONIC COMPUTERSin the class that induces the dichotomy { X+J { y }, X- Iand another 4)-surface in the class that induces thedichotomy {X+, X-J {y} }. That is, there exist two-surfaces, both correctly separating the training set,but yielding different classifications of the new patterny. Thus, if w, and w2 ar e the parameter weight vectorsfor the two -surfaces, then

    w, +(x) > 0 and w2 +(x) > 0 for x C X+w, O(x) < 0 and w2 4(X) < 0 for x E X-and either

    w, (y) > 0 and w2 +(Y) < 0 (28)or

    w, +(y) < 0 and w2 - (y) > 0.In addition, y is said to be ambiguous with respect tothe training se t if the training set is not separable.In Fig. 2, for example, points yi and Y3 are unambigu-ous and point Y2 is ambiguous with respect to the train-in g se t { X+, X- } relative to the class of al l lines in theplane (not necessarily through the origin). Points y, andy3 are uniquely classified into sets X+ and X-, respec-tively, by any line separating X' and X-, while Y2 isclassified into X- by 11, and into X+ by 12.

    + 0Ox~~_;iY2i

    X 3QJY3Fig. 2. Ambiguous generalization.

    Theorem 6 establishes the probability that a newpattern is ambiguous with respect to a random dichot-omy of the training set. This probability is independentof the configuration of the pattern vectors.Theorem 6: Let XUJ{y} = xXl, x2 , , Xv, y} be in4-general position in d-space, where 4 = (X)1, 42 , * -, 4d) )Then y is ambiguous with respect to C(N, d -1) di-chotomies of X relative to the class of all 4-surfaces.Hence, if each of the 4-separable dichotomies of X hasequal probability, then the probability A (N, d) that yis ambiguous with respect to a random 4-separabledichotomy of X is

    exists a 4-surface containing y which separatesf X+, X- }. The proposition then follows from Theorem3 on noting that the separating vector w obeys the linearconstraintw.4)(y) = 0. (30)

    Applying the proposition to the example in Fig. 2,where X has ten points, it can be seen that each of theyi's is ambiguous with respect to C(10, 2) = 20 dichoto-mies of X relative to the class of all lines in the plane.Now C(10, 3) = 92 diclhotomies of X are separable by theclass of al l lines in the plane. Thus, a new pattern isambiguous with respect to a random, linearly separabledichotomy of X with probability

    C(10, 2) 5A(10, 3) = 3C(10, 3) 23 (31)The behavior of A (N, d) is indicated by examinationof its asymptotic form. Consider Badahur's expansion[251] of the cumulative binomial distribution, for N>2d,dQT) (32)

    where F is the hypergeometric functionF N + 1, 1; NT- d + 1;-J

    N+ 1 1 (N + 1)(N + 2) 1k+ 1 2 (k+ l)(k+ 2) 4

    andk = N- d.

    (33)

    (34)Let [x] denote the greatest integer less than x, and letN= [/d], /3>2. Then the terms of the expansion ofF( [3d] + 1, 1; [Od -d + 1; 2) are uniformly bounded andpositive, and the limit, as d increases, of the jth term is(,B/2 (,- 1))J. Thus, by the M-test of Weierstrass,

    d ATVi=0Vi tim dNN= [#d] tvX 2 1 -/2(- 1)d- o d

    ,3 - 1= ,~d - 2 (35)and

    A*(A) = lim A(N, d) = 1,d o (d- 1

    d 2 t - 10QNl, d - 1) k-O kJA(N, d) = =-C(N, d) d(N - 1) (29)Proof: From Lemma 1 of Section II, the point y isambiguous with respect to { X+, X- } if and only if there

    The graph of A* (/) is shown in Fig. 3. Note the relativelylarge number of training patterns required for un-ambiguous generalization. If it is recalled that thecapacity of a linear threshold device is /=2 patternsper variable weight, it will again be seen that the capac-

    (36)/3> 2

    332 June

    I N, 1F N+1,1;N-d+l;-2 d 2

    2

  • 8/3/2019 Geometrical and Statistical Properties of Systems Of

    8/9

    Cover: Properties of Systems of Linear Inequalitiest)D

    LL WIz>0-wm WIx~~~~~~~~~~~~~~~~~O 2 34 5

    N PATTERNSd DIMENSIONFig. 3. Asymptotic probability of ambiguous generalization.

    ity is a critical number in the description of the behaviorof a linear threshold device.If the patterns themselves are randomly distributed,the comments of Section IV concerning randomly dis-tributed patterns and random dichotomies of the pat-tern se t apply in full here. The crucial condition is thatthe pattern se t be in general position with probability 1.Thus, if a linear threshold device is trained on a se t ofN points chosen at random according to a uniform dis-tribution on the surface of a unit sphere in d-space, andthese points are classified independently with equalprobability into one of tw o categories, then it is readilyseen that the probability of error on a new patternsimilarly chosen, conditioned on the separability of theentire set, is just 'A (N, d).VII. EXTREME PATTERNS

    For any dichotomy of a se t of points, there exists aminimal sufficient subset of extreme points such that anyhyperplane correctly separating the subset must sepa-rate the entire se t correctly. Thus for any dichotomyX+, X-} of a se t of points in Ed , there exists a minimalse t Z contained in X+UX- such that w satisfies

    w*x>O, xCEX+w x 0, x X+TZwx < 0, xrC X-CZ. (38)

    The set Z will be called the set of extreme points of thedichotomy. The vectors in Z form the boundary matrixinvestigated by Mays [26].From this definition it follows that a point is an ex-treme point of the dichotomy {X+, X-} if an d only ifit is ambiguous (in the sense of Section VI) with respectto { X+, X- }. Thus, fo r a set of N points in general posi-tion in Ed, each of the N points is ambiguous with re-spect to precisely C(N- 1, d- 1) dichotomies of the re-maining N-1 points. Hence, each of these C(N- 1, d -1)dichotomies is the restriction of two homogeneouslylinearly separable dichotomies of the original set of N

    points-the two dichotomies which differed only in theclassification of the remaining point. Since there wereC(N, d) homogeneously linearly separable dichotomiesof N points, it is clear that if one of the C(N, d) separabledichotomies is selected at random according to an equi-probable distribution over the class, then a given pointwill be an extreme point with probability2C(N - 1, d - 1)/C(N, d).Then the expected number of extreme points R(N, d)will be equal to the sum of the N probabilities that eachpoint is an extreme point. Since these probabilities areequal,

    E{R(N, d) } = 2NC(N - 1, d - 1)C(N, d) (39)Utilizing 35, we may show that

    lim E {R(lv;, d)}N= [d] 2dd- o4=2'1,

    0 2

    See Cover [21] and [23] for related geometrical results.Note that the limiting average number of extremevectors is independent of ,B for /32. The capacity hasplayed a role again. Also observe that, since R(N, d) isa positive random variable, the probability is less than1/t that R exceeds its mean value by a factor t.The conclusion may be drawn that the averageamount of necessary and sufficient information for thecomplete characterization of the se t of separating sur-faces for a random, separable dichotomy of N pointsgrows slowly with N and asymptotically approaches 2d(twice the number of degrees of freedom of the class ofseparating surfaces). The implication for pattern-recognition devices is that the essential information inan infinite training se t can be expected to be storedin a computer of finite storage capacity.

    VIII. CONCLUSIONSThe original work of Winder, Joseph, Cameron,Schlafli, and others on counting the number of linearlyseparable dichotomies of a se t of points has been de -veloped as a unified whole and extended to counting1) the number of nonlinearly separable dichotomies ofa se t of points, and 2) the separable dichotomies ofrandom c ol le ct io ns o f p oi nt s. Application of this workto the vertices of a binary n-cube has yielded upperbounds on the number of polynomially separableBoolean functions. It has been shown that the naturalseparating capacity of a family of separating surfaces istwo pattern points for each degree of freedom, thus ex-tending the work done by Koford and Winder. Theseparating capacity was seen to arise naturally as acritical parameter in defining the probability of un-ambiguous generalization, and in defining the numberof extreme points characterizing the se t of patternpoints.

    1965 333

  • 8/3/2019 Geometrical and Statistical Properties of Systems Of

    9/9

    IEEE TRANSACTIONS ON ELECTRONIC COMPUTERSIt was shown that, fo r a random set of linear inequali-ties in d unknowns, the expected number of extremeinequalities, which are necessary and sufficient to implythe entire set, tends to 2d as the number of consistentinequalities tends to infinity, thus bounding the ex-pected necessary storage capacity fo r li ne ar decision

    algorithms in separable problems. The results, even thosedealing with randomly positioned points, have beencombinatorial in nature, and have been essentially inde-pendent of the configuration of the set of points in thespace.

    ACKNOWLEDGMENTThe author wishes to express his gratitude to N. Ab-

    ramson, B. Efron, N. Nilsson, and L. Zadeh for theircomments and suggestions. He also wishes to thankB. Brown, J. Koford, and B. Widrow for the formulationof several related problems which le d to the studies inthis report, and B. Elspas of Stanford Research Insti-tute for the limiting argument used in Section VI.A paper on related questions is in preparation withB. Efron.

    REFERENCES[1 ] Winder, R. O., Single stage threshold logic, Switching CircuitTheory and Logical Design, AIEE Special Publications S-134,Sep 1961, pp 321-332.[2 ] , Threshold logic, Ph.D. dissertation, Princeton University,Princeton, N. J., 1962.[3 ] Cameron, S. H., Tech Rept 60-600, Proceedings of the BionicsSymposium, Wright Air Development Division, Dayton, Ohio,1960, pp 197-212.[4] Joseph, R. D. , The number of orthants in n-space intersectedby an s-dimensional subspace, Tech Memo 8, Project PARA,Cornell Aeronautical Lab., Buffalo, N. Y. , 1960.[5 ] Whitmore, E. A. , an d D. G. Willis, Division of space by con-current hyperplanes, unpublished Internal Rept, LockheedMissiles & Space Co., Sunnyvale, Calif., 1960.[6] Schlifli, L., Gesammelte Mathematische Abhandlungen I. Basel,Switzerland: Verlag Birkhauser, 1950, pp 209-212.[7] Wendel, J. G., A problem in geometric probability, MathematicaScandinavica, vol 11, 1962, pp 109-111.[8 ] Cooper, P. W., The hyperplane in pattern recognition, Cy-bernetica, no 4, 1962, pp 215-238.

    [91 -, The hypersphere in pattern recognition, Information andControl, vo l 5, Dec 1962.[10] Aizerman, M. A. , E. M. Braverman, and L. I. Rozonoer, Theo-retical foundations of the potential function method in patternrecognition learning, Automatika i Telemekhanika, vol 25, Jun1964; translation puplished Jan 1965, pp 821-837.[11] Kaszerman, P., A nonlinear-summation threshold device,IEEE Trans. on Electronic Computers (Correspondence), vo lEC-12, Dec 1963, pp 914-915.[12] Greenberg, H. J., and A. G. Konheim, Linear and nonlinearmethods in pattern classification, IBM J. Res. Develop., vol 8,Jul 1964, pp 299-307.[13] Bishop, A. B., Adaptive pattern recognition, 1963 WESCONRept of Session 1.5, unpublished.[14] Wong, E., and E. Eisenberg, Iterative synthesis of thresholdfunctions. To appear in J. Mathematical Analysis and Applica-tions.[15] Cooper, J. A., Orthogonal expansion applied to the design ofthreshold element networks, Rept SEL-63-123, TR 6204-1,Stanford Electronics Labs., Stanford University, Stanford,Calif., Dec 1963.[16] Koford, J., Adaptive network organization, Rept SEL-63-009,Stanford Electronics Laboratories Quarterly Research Review,no 3, 1962, 111-6.[17] Novikoff, A., On convergence proofs for perceptrons, Symposiumon Mathematical Theory of Antomata. Brooklyn, N. Y.: Poly-technic Press, 1963, pp. 615-622.[18] Widrow, B., Generalization and information storage in network:of adaline "neurons," Self Organizing Systems. Washington:Spartan Books, 1962, pp 442, 459.[19] Brown, R., Logical properties of adaptive networks, ReptSEP-62-109, Stanford Electronics Laboratories Quarterly Re-search Review, no 1, 1962, pp 87-88.[20] Winder, R. O., Bounds on threshold gate realizability, IEEETrans. on Electronic Computers (Correspondence), vol EC-12,Oct 1963, pp 561-564.[21] Efron, B., and T. Cover, Linear separability of random vectors,unpublished Internal Rept, Stanford Research Institute, MenloPark, Calif., Mar 1963.[22] Cover, T., Classification and generalization capabilities of linearthreshold units, Documentary Rept RADC-TDR-64-32, RomeAir Development Center, Griffiss AFB, N. Y. , Feb 1964.[23] Cover, T. M., Geometrical and statistical properties of linearthreshold devices, Rept SEL-64-052, TR 6107-1, Stanford Elec-tronics Labs., Stanford University, Stanford, Calif., May 1964.[24] Brown, R. J., Adaptive multiple-output threshold systems andtheir storage capacities, TR 6771-1, Stanford Electronics Labs.,Stanford University, Stanford, Calif., Ju n 1964.[25] Bahadur, R. R., Some approximations to the binomial distribution function, Annals of Mathematical Statistics, vol 31,Mar 1960, pp 43-54.[26] Mays, C. H. , Adaptive threshold logic, Rept SEL-63-927, TR1557-1, Stanford Electronics Labs., Stanford University, Stan-ford, Calif., Apr 1963.

    334