efficient active learning of halfspaces: an aggressive ...sabatos/papers/gonensash_icml2013.pdf ·...

Efficient Active Learning of Halfspaces: an Aggressive Approach

Alon Gonen [email protected]

Benin School of CSE, The Hebrew University

Sivan Sabato [email protected]

Microsoft Reserach New England

Shai Shalev-Shwartz [email protected]

Benin School of CSE, The Hebrew University

Abstract

We study pool-based active learning of half-spaces. We revisit the aggressive approachfor active learning in the realizable case, andshow that it can be made efficient and prac-tical, while also having theoretical guaran-tees under reasonable assumptions. We fur-ther show, both theoretically and experimen-tally, that it can be preferable to mellowapproaches. Our efficient aggressive activelearner of half-spaces has formal approxima-tion guarantees that hold when the pool isseparable with a margin. While our analysisis focused on the realizable setting, we showthat a simple heuristic allows using the samealgorithm successfully for pools with low er-ror as well. We further compare the aggres-sive approach to the mellow approach, andprove that there are cases in which the ag-gressive approach results in significantly bet-ter label complexity compared to the mellowapproach. Experiments demonstrate thatsubstantial improvements in label complex-ity can be achieved using the aggressive ap-proach, in realizable and low-error settings.

1. Introduction

We consider pool-based active learning (McCallum &Nigam, 1998), in which a learner receives a pool of un-labeled examples, and can iteratively query a teacherfor the labels of examples from the pool. The goal ofthe learner is to return a low-error prediction rule for

Proceedings of the 30 th International Conference on Ma-chine Learning, Atlanta, Georgia, USA, 2013. JMLR:W&CP volume 28. Copyright 2013 by the author(s).

the labels of the examples, using a small number ofqueries. The number of queries used by the learner istermed its label complexity. This setting is most use-ful when unlabeled data is abundant but labeling isexpensive, a common case in many data-laden appli-cations. A pool-based algorithm can be used to learna classifier in the standard PAC model, while query-ing fewer labels. This can be done by first drawing arandom unlabeled sample to be used as the pool, thenusing pool-based active learning to identify its labelswith few queries, and then using the resulting labeledsample as input to a regular “passive” PAC-learner.

Most active learning approaches can be loosely de-scribed as more ‘aggressive’ or more ‘mellow’. A moreaggressive approach is one in which only ‘highly infor-mative’ queries are requested (Tong & Koller, 2002;Balcan et al., 2007; Dasgupta et al., 2005), while themellow approach, first proposed in the CAL algorithm(Cohn et al., 1994), is one in which the learner essen-tially queries all the labels it has not inferred yet.

In recent years a significant advancement has beenmade for active learning in the PAC model. In particu-lar, it has been shown that when the data is realizable(that is, incurs zero error under some assumed hypoth-esis class), the mellow approach can guarantee an ex-ponential improvement in label complexity, comparedto passive learning (Balcan et al., 2006). This expo-nential improvement depends on the properties of thedistribution, as quantified by the Disagreement Coeffi-cient proposed by Hanneke (2007). Specifically, whenlearning half-spaces in Euclidean space, the disagree-ment coefficient implies a low label complexity whenthe data distribution is uniform or close to uniform.El-Yaniv & Wiener (2012) have shown guarantees ifthe data distribution is a finite mixture of Gaussians.

An advantage of the mellow approach is its ability


to obtain label complexity improvements in the ag-nostic setting, which allows an arbitrary and largelabeling error (Balcan et al., 2006; Dasgupta et al.,2007). Nonetheless, in the realizable case the mellowapproach is not always optimal, even for the uniformdistribution (Balcan et al., 2007). In this work we re-visit the aggressive approach for the realizable case,and in particular for active learning of half-spaces inEuclidean space. We show that it can be made efficientand practical, while also having theoretical guaranteesunder reasonable assumptions. We further show, boththeoretically and experimentally, that it can some-times be preferable to mellow approaches.

In the first part of this work we construct an efficientaggressive active learner for half-spaces in Euclideanspace, which is approximately optimal if the pool isseparable with a margin. While our analysis is fo-cused on the realizable setting, we show that a simpleheuristic allows using the same algorithm successfullyfor pools with low error as well. Our algorithm forhalfspaces is based on a greedy query selection ap-proach as proposed in Tong & Koller (2002); Dasgupta(2005). We obtain improved target-dependent approx-imation guarantees for greedy selection in a generalactive learning setting. These guarantees allow us toprove meaningful approximation guarantees for halfs-paces based on a margin assumption.

In the second part of this work we compare the greedyapproach to the mellow approach. We prove that thereare cases in which this highly aggressive greedy ap-proach results in significantly better label complexitycompared to the mellow approach. We demonstrateexperimentally that substantial improvements in labelcomplexity can be achieved compared to mellow ap-proaches, for both realizable and low-error settings.

The first greedy query selection algorithm for learninghalfspaces in Euclidean space was proposed by Tong& Koller (2002). The greedy algorithm is based on thenotion of a version space: the set of all hypotheses inthe hypothesis class that are consistent with the labelscurrently known to the learner. In the case of halfs-paces, each version space is a convex body in Euclideanspace. Each possible query thus splits the current ver-sion space into two parts: the version space that wouldresult if the query received a positive label, and the oneresulting from a negative label. Tong and Koller pro-posed to query the example from the pool that splitsthe version space as evenly as possible. To implementthis policy, one would need to calculate the volume ofa convex body in Euclidean space, a problem which isknown to be computationally intractable (Brightwell& Winkler, 1991). Tong and Koller thus implemented

several heuristics that attempt to follow their proposedselection principle using an efficient algorithm. For in-stance, they suggest to choose the example which isclosest to the max-margin solution of the data labeledso far. However, none of their heuristics provably fol-low this greedy selection policy.

The label complexity of greedy pool-based activelearning algorithms can be analyzed by comparing itto the best possible label complexity of any pool-basedactive learner on the same pool. The worst-case labelcomplexity of an active learner is the maximal numberof queries it would make on the given pool, where themaximum is over all the possible classification rulesthat can be consistent with the pool according to thegiven hypothesis class. The average-case label com-plexity of an active learner is the average number ofqueries it would make on the given pool, where theaverage is taken with respect to some fixed probabil-ity distribution P over the possible classifiers in thehypothesis class. For each of these definitions, the op-timal label complexity is the lowest label complexitythat can be achieved by an active learner on the givenpool. Since implementing the optimal label complexityis usually computationally intractable, an alternativeis to implement an efficient algorithm, and to guar-antee a bounded factor of approximation on its labelcomplexity, compared to the optimal label complexity.

Dasgupta (2005) showed that if a greedy algorithmsplits the probability mass of the version space asevenly as possible, as defined by the fixed probabil-ity distribution P over the hypothesis class, then theapproximation factor for its average label complexity,with respect to the same distribution, is bounded byO(log(1/pmin)), where pmin is the minimal probabilityof any possible labeling of the pool, if the classifier isdrawn according to the fixed distribution. Golovin &Krause (2010) extended Dasgupta’s result and showedthat a similar bound holds for an approximate greedyrule. They also showed that the approximation factorfor the worst-case label complexity of an approximategreedy rule is also bounded by O(log(1/pmin)), thusextending a result of Arkin et al. (1993). Note that inthe worst-case analysis, the fixed distribution is onlyan analysis tool, and does not represent any assump-tion on the true probability of the possible labelings.

Returning to greedy selection of halfspaces in Eu-clidean space, we can see that the fixed distribu-tion over hypotheses that matches the volume-splittingstrategy is the distribution that draws a halfspace uni-formly from the unit ball. The analysis presentedabove thus can result in poor approximation factors,since if there are instances in the pool that are very


close to each other, then pmin might be very small.

We first show that mild conditions suffice to guaranteethat pmin is bounded from below. By proving a variantof a result due to Muroga et al. (1961), we show thatif the examples in the pool are stored using number ofa finite accuracy 1/c, then pmin ≥ (c/d)d

2

, where d isthe dimensionality of the space. It follows that the ap-proximation factor for the worst-case label complexityof our algorithm is at most O(d2 log(d/c)).

While this result provides us with a uniform lowerbound on pmin, in many real-world situations the prob-ability of the target hypothesis (i.e., one that is con-sistent with the true labeling) could be much largerthan pmin. A noteworthy example is when the tar-get hypothesis separates the pool with a margin of γ.In this case, it can be shown that the probability ofthe target hypothesis is at least γd, which can be sig-nificantly larger than pmin. An immediate question istherefore: can we obtain a target-dependent label com-plexity approximation factor that would depend on theprobability of the target hypothesis, P (h), instead ofthe minimal probability of any labeling?

We prove that such a target dependent bound doesnot hold for a general approximate-greedy algorithm.To overcome this, we introduce an algorithmic changeto the approximate greedy policy, which allows usto obtain a label complexity approximation factor oflog(1/P (h)). This can be achieved by running theapproximate-greedy procedure, but stopping the pro-cedure early, before reaching a pure version space thatexactly matches the labeling of the pool. Then, an ap-proximate majority vote over the version space can beused to determine the labels of the pool. This result isgeneral and holds for any hypothesis class and distri-bution. For halfspaces, it implies an approximation-factor guarantee of O(d log(1/γ)).

We use this result to provide an efficientapproximately-optimal active learner for half-spaces, called ALuMA, which relies on randomizedapproximation of the volume of the version space(Kannan et al., 1997). This allows us to prove amargin-dependent approximation factor guarantee forALuMA. The assumption of separation with a margincan be relaxed if a lower bound on the total hinge-lossof the best separator for the pool can be assumed.We show that under such an assumption a simpletransformation on the data allows running ALuMAas if the data was separable with a margin. Thisresults in approximately optimal label complexitywith respect to the new representation.

We also derive lower bounds, showing that the depen-

dence of our label-complexity guarantee on the accu-racy c, or the margin parameter γ, is indeed neces-sary and is not an artifact of our analysis. We do notknow if the dependence of our bounds on d is tight.It should be noted that some of the most popularlearning algorithms (e.g. SVM, Perceptron, and Ad-aBoost) rely on a large-margin assumption to derivedimension-independent sample complexity guarantees.In contrast, here we use the margin for computationalreasons. Our approximation guarantee depends loga-rithmically on the margin parameter, while the samplecomplexities of SVM, Perceptron, and AdaBoost de-pend polynomially on the margin. Hence, we requirea much smaller margin than these algorithms do. In arelated work, Balcan et al. (2007) proposed an activelearning algorithm with dimension-independent guar-antees under a margin assumption. These guaranteeshold for a restricted class of data distributions.

In the second part of this work, we compare the greedyapproach to the mellow approach of CAL in the re-alizable case, both theoretically and experimentally.Our theoretical results show the following: (1) In thesimple learning setting of thresholds on the line, ourmargin-based approach is preferable to the mellow ap-proach when the true margin of the target hypothesisis large. (2) There exists a distribution in Euclideanspace such that the mellow approach cannot achieve asignificant improvement in label complexity over pas-sive learning for halfspaces, while the greedy approachachieves such an improvement using more unlabeledexamples. (3) There exists a pool in Euclidean spacesuch that the mellow approach requires exponentiallymore labels than the greedy approach.

We further compare the two approaches experimen-tally, both on separable data and on data with smallerror. The empirical evaluation indicates that ouralgorithm, which can be implemented in practice,achieves state-of-the-art results. It further suggeststhat aggressive approaches can be significantly betterthan mellow approaches in some practical settings.

We show main results for the greedy algorithm in Sec-tion 2. The ALuMA algorithm is described in Sec-tion 3. Results and experiments for the aggressive andthe mellow approaches are presented in Section 4. Fullproofs and additional experiments are provided in thefull version of this paper (Gonen et al., 2012).

2. Results for Greedy Active Learning

Let us first introduce some notation. Given a poolX = x1, . . . , xm, where each instance xi is associatedwith an unknown label L(i) ∈ ±1, the goal of the


learner is to find the values L(1), . . . , L(m) using as fewcalls to the oracle L as possible. We assume that L isdetermined by a function h taken from a predefinedhypothesis class H. That is, ∃h ∈ H such that for alli, L(i) = h(xi). We denote this by L W h. An activelearning algorithm A obtains (X,L, T ) as input, whereT is the label budget of A. We denote by N(A, h) thenumber of calls to L that A makes before outputting(L(x1), . . . , L(xm)), under the assumption that LW h.The worst-case label complexity of A is defined to be

cwc(A)def= maxh∈HN(A, h). We denote the optimal

worst-case label complexity for the given pool by OPT.Formally, we define OPT = minA cwc(A), where theminimum is taken over all possible active learners forthe given pool.

For a given active learner, we denote by Vt ⊆ H theversion space of an active learner after t queries. For-mally, suppose that the active learning queried in-stances i1, . . . , it in the first t iterations needed. Then

Vt = h ∈ H | ∀j ∈ [t], h(xij ) = L(ij).

For a given pool example x ∈ X, denote by V jt,x theversion spaces that would result if the algorithm nowqueried x and received label j. Formally,

V jt,x = Vt ∩ h ∈ H | h(x) = j.

Given X and H, we define, for each h ∈ H, the equiva-lence class of h overH, [h] = h′ ∈ H | ∀x ∈ X, h(x) =h′(x). We consider a probability distribution P overH such that P ([h]) is defined for all h ∈ H. For brevity,we denote P (h) = P ([h]). Let pmin = minh∈H P (h).

Consider an active learning algorithm A. For anyα ≥ 1, A is α-approximately greedy with respect toa probability distribution P over H, if at any itera-tion of the algorithm t, the query x ∈ X chosen by Asatisfies

Eh∼P [Vh(x)t,x ] ≥ 1

αmaxx∈X

Eh∼P [1− P (Vh(x)t,x )],

and the algorithm returns a labeling consistent withVT upon termination. Golovin & Krause (2010)have shown that any α-approximately greedy algo-rithm finds the correct labeling after at most O(α ·OPT log(1/pmin)) queries.

We start by showing that by slightly changing thepolicy of an approximately-greedy algorithm, we canachieve a better approximation factor whenever thetrue target hypothesis has a larger probability thanpmin. This can be done by allowing the algorithm tostop before it reaches a pure version space, and re-quiring that in this case, it would output the labeling

which is most likely based on the current version spaceand the fixed probability distribution P .

We say that A outputs an approximate majority voteif whenever VT is pure enough, the algorithm outputsthe majority vote on VT . Formally, we define this asfollows.

Definition 1. An algorithm A outputs a β-approximate majority vote for β ∈ ( 1

2 , 1) if when-ever there exists a labeling Z : X → ±1 such thatPh∼P [Z W h | h ∈ VT ] ≥ β, A outputs Z.

In the following theorem we provide the target-dependent label complexity bound, which holds forany approximate greedy algorithm that outputs an ap-proximate majority vote. We give here a sketch of theproof idea. The full proof is provided in Gonen et al.(2012).

Theorem 1. Let X = x1, . . . , xm. Let H be a hy-pothesis class, and let P be a distribution over H. Sup-pose that A is α-approximately greedy with respect toP . Further suppose that it outputs a β-approximatemajority vote. If A is executed with input (X,L, T )where L W h ∈ H, then A outputs L(1), . . . , L(m)after T ≥ α(2 ln(1/P (h)) + ln( β

1−β )) ·OPT.

Sketch. Fix a pool X. For any algorithm alg, denoteby Vt(alg, h) the version space induced by the first nlabels it queries if the true labeling of the pool is con-sistent with h. Denote the average version space re-duction of alg after t queries by

favg(alg, t) = 1− Eh∼P [P (Vt(alg, h))].

Golovin & Krause (2010) prove that since A is α-approximately greedy, for any pool-based algorithmalg, and for every k, t ∈ N,

favg(A, t) ≥ favg(alg, k)(1− exp(−t/αk)). (1)

Let opt be an algorithm that achieves OPT. Weshow that for any hypothesis h ∈ H and anyactive learner alg, favg(opt,OPT) − favg(alg, t) ≥P (h)(P (Vt(alg, h))−P (h)). Combining this with Equa-tion (1) we conclude that if A is α-approximatelygreedy then

P (h)

P (Vt(A, h))≥ P (h)2

exp(− tαOPT ) + P (h)2

.

This means that if P (h) is large enough and we runan approximate greedy algorithm, then after a suffi-cient number of iterations, most of the remaining ver-sion space induces the correct labeling of the sample.Specifically, if n ≥ α(2 ln(1/P (h)) + ln( β

1−β )) · OPT,


then P (h)/P (Vt(A, h)) ≥ β. Since A outputs a β-approximate majority labeling from Vt(A, h), A re-turns the correct labeling.

When P (h) pmin, the bound in Theorem 1 isstronger than the guarantee obtained by Golovin &Krause (2010). Importantly, such an improved ap-proximation factor cannot be obtained for a generalapproximate-greedy algorithm, even in a very simplesetting. Thus, we can conclude that some algorith-mic change is necessary. To show this, consider thesetting of thresholds on the line. In this setting, thedomain of examples is [0, 1], and the hypothesis classHline includes all the hypotheses which are positive ifthe example is larger than some threshold in [0, 1].

Theorem 2. Consider pool-based active learning onHline, and assume that P on Hline selects hc by draw-ing the value c uniformly from [0, 1]. For any α > 1there exists an α-approximately greedy algorithm Asuch that for any m > 0 there exists a pool X ⊆ [0, 1] ofsize m, and a threshold c such that P (hc) = 1/2, whilethe label-complexity of A for LW hc is m

dlog(m)e ·OPT.

Interestingly, this theorem does not hold for α = 1,that is for the exact greedy algorithm. This followsfrom Theorem 4, which we state and prove in Section 4.

So far we have considered a general hypothesis class.We now discuss the class of halfspaces in Rd,W =x 7→ sgn(〈w, x〉) : w ∈ Bd1, where Bd1 is the unitball in Rd. For simplicity, we will slightly overload no-tation and sometimes use w to denote the halfspace itdetermines. We fix the distribution P to be the onethat selects a vector w uniformly from Bd1. The follow-ing lemma shows that our active learning algorithm forhalfspaces has the desired properties described abovewith high probability.

Lemma 1. If ALuMA is executed with confidence δ,then with probability 1−δ over its internal randomiza-tion, ALuMA is 4-approximately greedy and outputs a2/3-approximate majority vote. ALuMA is polynomialin the pool size, the dimension, and log(1/δ).

Combining the above lemma with Theorem 1 weimmediately obtain that ALuMA’s label complex-ity is O(log(1/P (h)) · OPT). We can upper-boundlog(1/P (h)) using the familiar notion of margin: Forany hypothesis h ∈ W defined by some w ∈ Rd, letγ(h) be the maximal margin of the labeling of X by h,namely γ(h) = maxv:‖v‖=1 mini∈[m] h(xi)〈v, xi〉/‖xi‖.

We show that for all h ∈ W, P (h) ≥(γ(h)2

)d. From

this and Lemma 1, we obtain the following corollary,which provides a guarantee for ALuMA that dependson the margin of the target hypothesis.

Corollary 1. Let X = x1, . . . , xm ⊆ Bd1, where Bd1is the unit Euclidean ball of Rd. Let δ ∈ (0, 1) be aconfidence parameter. Suppose that ALuMA is exe-cuted with input (X,L, T, δ), where L W h ∈ W andT ≥ 4(2d ln(2/γ(h)) + ln(2)) ·OPT. Then, with proba-bility of at least 1− δ over ALuMA’s own randomiza-tion, it outputs L(1), . . . , L(m).

Our result for ALuMA provides a target-dependent ap-proximation factor guarantee, depending on the mar-gin of the target hypothesis.1 We can also consider theminimal possible margin, γ = minh∈W γ(h), and de-duce from Theorem 1, or from the results of Golovin& Krause (2010), a uniform approximation factor ofO(d log(1/γ)). How small can γ be? The followingresult bounds this minimal margin from below underthe reasonable assumption that the examples are rep-resented by numbers of a finite accuracy.

Lemma 2. Let c > 0 be such that 1/c is an integer andsuppose that X ⊂ −1,−1 + c, . . . , 1 − c, 1d. Then,minh∈W γ(h) ≥ (c/

√d)d+2.

The proof is an adaptation of a classic result due toMuroga et al. (1961). We conclude that under this

assumption for halfspaces, pmin = Ω((c/d)d2

), and de-duce an approximation factor of d2 log(d/c) for theworst-case label complexity of ALuMA. The exponen-tial dependence of the minimal margin on d here isnecessary; as shown in Hastad (1994), the minimalmargin can indeed be exponentially small, even if thepoints are taken only from ±1d.

We also derive a lower bound, showing that the depen-dence of our bounds on γ or on c is necessary. Whetherthe dependence on d is also necessary is an open ques-tion for future work.

Theorem 3. For any γ ∈ (0, 1/8), there exists a poolX ⊆ B2

1∩−1, 1 + c, . . . , 1− c, 12 for c = Θ(γ), and atarget hypothesis h∗ ∈ W for which γ(h∗) = Ω(γ), suchthat there exists an exact greedy algorithm that requiresΩ(ln(1/γ)) = Ω(ln(1/c)) labels in order to output acorrect majority vote, while the optimal algorithm re-quires only O(log(log(1/γ))) queries.

3. The ALuMA algorithm

We now describe our algorithm, listed below as Alg. 1,and explain why Lemma 1 holds. We name the algo-rithm Active Learning under a Margin Assumption orALuMA. Its inputs are the unlabeled sample X, the

1One may suggest that the approximation factor weachieve for ALuMA in Lemma 1 is due to its internal ran-domization. We show in Gonen et al. (2012) that this isnot the case.


labeling oracle L, the maximal allowed number of la-bel queries T , and the desired confidence δ ∈ (0, 1). Itreturns the labels of all the examples in X.

Recall that we take P to be uniform over W, theclass of homogeneous half-spaces in Rd. In each it-eration we wish to choose, among the instances inthe pool, the instance whose label would lead to themaximal expected reduction in the version space. De-note by It the set of indices corresponding to the el-ements in the pool whose label was not queried yet(I0 = [m]). This is equivalent to finding, on round t,

kdef= argmaxi∈It P (V 1

t,xi) · P (V −1t,xi

).

In order to be able to find k, we need to calculate thevolumes of the sets V 1

t,x and V −1t,x for every element xin the pool. Both of these sets are convex sets ob-tained by intersecting the unit ball with halfspaces.The problem of calculating the volume of such convexsets in Rd is #P-hard if d is not fixed (Brightwell &Winkler, 1991). Moreover, deterministically approx-imating the volume is NP-hard in the general case(Matousek, 2002). Luckily, it is possible to approxi-mate this volume using randomization. Specifically, inKannan et al. (1997) a randomized algorithm with thefollowing guarantees is provided:

Lemma 3. Let K ⊆ Rd be a convex body with anefficient separation oracle. There exists a randomizedalgorithm, such that given ε, δ > 0, with probability atleast 1− δ the algorithm returns a number Γ ≥ 0 suchthat (1− ε)Γ < P (K) < (1 + ε)Γ. The running time ofthe algorithm is polynomial in d, 1/ε, ln(1/δ).

We denote an execution of this algorithm on a con-vex body K by Γ← VolEst(K, ε, δ). The algorithm ispolynomial in d, 1/ε, ln(1/δ). ALuMA uses this algo-rithm to estimate P (V 1

t,x) and P (V −1t,x ) with sufficientaccuracy. We denote these approximations by vx,1 andvx,−1 respectively. Using Lemma 3 we show that w.p.at least 1− δ/2, ALuMA is 4-approximately greedy.

After T iterations, ALuMA needs to output the major-ity vote of a version space V that has a high enoughpurity level. To do this, we would like to uniformlydraw several hypotheses from V and label X accord-ing to a majority vote over these hypotheses. Thiscan be approximated using the hit-and-run algorithm(Lovasz, 1999), which efficiently draws a random sam-ple from a convex body K ⊆ Rd, according to a distri-bution which is λ-close in total variation distance tothe uniform distribution over K, in time O(d3/λ2). Inthe final step of ALuMA, we produce a majority voteclassification from a distribution which is 1

12 -close touniform. This allows proving that ALuMA outputs a2/3-approximate majority vote w.p. at least 1− δ/2.

Algorithm 1 The ALuMA algorithm

1: Input: X = x1, . . . , xm, L : [m] → −1, 1, T ,δ

2: I1 ← [m], V1 ← Bd13: for t = 1 to T do4: ∀i ∈ It, j ∈ ±1, vxi,j ← VolEst(V jt,xi

, 13 ,δ

4mT )5: Select it ∈ argmaxi∈It(vxi,1 · vxi,−1)6: It+1 ← It \ it7: Request y = L(it)8: Vt+1 ← Vt ∩ w : y〈w, xit〉 > 09: end for

10: M ← d72 ln(2/δ)e.11: Draw w1, . . . , wM

112 -uniformly from VT+1.

12: For each xi return yi = sgn(∑M

j=1 sgn(〈wj , xi〉))

.

3.1. Non-Separable Data and Kernels

If the data pool X is not separable, but a small up-per bound on the total hinge-loss of the best separatorcan be assumed, then ALuMA can be applied aftera preprocessing step. This preprocessing step mapsthe points in X to a set of points in a higher dimen-sion, which are separable using the original labels of X.The dimensionality depends on the margin and on thebound on the total hinge-loss of the original represen-tation. The preprocessing step can be enhanced also tosupport kernel representations, so that the original Xcan be represented by a kernel matrix as well. Apply-ing ALuMA after this preprocessing steps results inan approximately optimal label complexity, howeverOPT here is measured with respect to the new repre-sentation. See the full details in Gonen et al. (2012).We demonstrate that in practice, this procedure pro-vides good results on real data sets. Investigating therelationship between OPT in the new representationand OPT in the original representation is an impor-tant question for future work.

4. Other Approaches: A Theoreticaland Empirical Comparison

We now compare the effectiveness of the approach ofALuMA to other active learning strategies. ALuMAcan be characterized by two properties: (1) its “ob-jective” is to reduce the volume of the version space;(2) at each iteration, it aggressively selects an exam-ple from the pool so as to (approximately) minimize itsobjective as much as possible (in a greedy sense). Wediscuss the implications of these properties by com-paring to other strategies. Property (1) is contrastedwith strategies that focus on increasing the number ofexamples whose label is known. Property (2) is con-


trasted with strategies which are “mellow”, in thattheir criterion for querying examples is softer.

Much research has been devoted to the challenge of ob-taining a substantial guaranteed improvement of labelcomplexity over regular “passive” learning for halfs-paces in Rd. Examples (for the realizable case) in-clude QBC (Seung et al., 1992; Freund et al., 1997),CAL (Cohn et al., 1994), and Active Perceptron (Das-gupta et al., 2005). These algorithms are not “pool-based” but rather use “selective-sampling”: they sam-ple one example at each iteration, and immediatelydecide whether to ask for its label. Out of these algo-rithms, CAL is the most mellow, since it queries anyexample whose label is yet undetermined by the ver-sion space. Its “objective” can be described as reduc-ing the number of examples which are labeled incor-rectly, since it has been shown to do so in many cases(Hanneke, 2007; 2011; Friedman, 2009). QBC and theActive Perceptron are less mellow. Their “objective”is similar to that of ALuMA since they decide on ex-amples to query based on geometric considerations.In Section 4.1 we discuss the theoretical advantagesand disadvantages of different strategies, by consider-ing some interesting cases from a theoretical perspec-tive. In Section 4.2 we report an empirical comparisonof several algorithms and discuss our conclusions.

4.1. Theoretical Comparison

The label complexity of the algorithms mentionedabove is usually analyzed in the PAC setting, thus wetranslate our guarantees into the PAC setting as wellfor the sake of comparison. We define the (ε,m,D)-label complexity of an active learning algorithm to bethe number of label queries that are required in or-der to guarantee that given a sample of m unlabeledexamples drawn from D, the error of the learned clas-sifier will be at most ε (with probability of at least1 − δ over the choice of sample). A pool-based activelearner can be used to learn a classifier in the PACmodel by sampling a pool of m unlabeled examplesfrom D, applying the pool-based active learner to thispool, and running a passive learner on the labeled poolto obtain a classifier. For the class of halfspaces, if wesample an unlabeled pool of m = Ω(d/ε) examples, thelearned classifier will have an error of at most ε (withhigh probability).

To demonstrate the effect of property (1) presentedabove, consider again the case of thresholds on theline defined in Section 2. Compare two greedy pool-based active learners for Hline : The first follows a bi-nary search procedure, greedily selecting the examplethat increases the number of known labels the most.

This requires dlog(m)e queries to identify the correctlabeling of the pool. The second algorithm queriesthe example that splits the version space as evenlyas possible. Theorem 1 implies a label complexityof O(log(m) log(1/γ(h))) for such an algorithm, sinceOPTmax = O(log(m)). However, a better result holds:

Theorem 4. In the problem of thresholds on the line,for any pool with labeling L, the exact greedy algorithm,and any approximate greedy algorithm that outputs amajority vote, require at most O(log(1/γ(h))) labels.

Comparing the dlog(m)e guarantee of the first algo-rithm to the log(1/γ(h)) guarantee of the second, wereach the (unsurprising) conclusion, that the first al-gorithm is preferable when the true labeling has asmall margin, while the second is preferable when thetrue labeling has a large margin. This simple exam-ple accentuates the implications of selecting the vol-ume of the version space as an objective. A sim-ilar implication can be derived in the PAC setting,when comparing CAL to ALuMA, with m = Θ(1/ε).When d = 1, CAL achieves a label-complexity ofO(log(1/ε)) = O(log(m)), similarly to the binarysearch strategy. Thus when ε is large compared toγ(h), CAL is better than being greedy on the volume,and the opposite holds when the condition is reversed.QBC will behave similarly to ALuMA in this setting.

To demonstrate the effect of property (2) describedabove—being aggressive versus being mellow—we con-sider an example adapted from Dasgupta (2006). Inthis example the distribution is supported by two cir-cles in R3, one around the origin and one slightly aboveit. Dasgupta has demonstrated via this example thatactive learning can gain in label complexity from hav-ing significantly more unlabeled data. The followingtheorem shows that the aggressive strategy employedby ALuMA indeed achieves an exponential improve-ment when there are more unlabeled samples. In manyapplications, unlabeled examples are virtually free tosample, thus it can be worthwhile to allow the ac-tive learner to sample more examples than the passivesample complexity and use an aggressive strategy. Incontrast, the mellow strategy of CAL does not signif-icantly improve over passive learning in this case. Wenote that these results hold for any selective-samplingmethod that guarantees an error rate similar to pas-sive ERM given the same sample size. This falls in linewith the observation of Balcan et al. (2007), that insome cases a more aggressive approach is preferable.

Theorem 5. For all small enough ε ∈ (0, 1) there isa distribution Dε of points in R3, such that

1. For m = O(1/ε), the (ε,m,Dε)-label complexityof any active learner is Ω(1/ε).


2. For m = Ω(log2(1/ε)/ε2), the (ε,m,Dε)-labelcomplexity of ALuMA is O(log2(1/ε)).

3. For any value of m, the (ε,m,Dε)-label complexityof CAL is Ω(1/ε).

This theorem demonstrates a case where more un-labeled examples can help ALuMA use less labels,whereas they do not help CAL. In fact, in some casesthe label complexity of CAL can be significantly worsethan that of the optimal algorithm, even when bothCAL and the optimal algorithm have access to all thepoints in the support of the distribution. An exampleis provided in Gonen et al. (2012).

4.2. Empirical Comparison

We now report an empirical comparison between ag-gressive and mellow algorithms. Our goal is twofold:First, to evaluate ALuMA in practice, and second, tocompare the performance of aggressive strategies andmellow strategies. The aggressive strategy is repre-sented by ALuMA and by Tong & Koller (2002). Themellow strategy is represented by CAL. QBC repre-sents a middle-ground between aggressive and mellow.We also compare to a passive ERM algorithm—onethat uses random labeled examples.

Our implementation of ALuMA uses hit-and-run sam-ples instead of full-blown volume estimation. See Go-nen et al. (2012) for full details. QBC was also im-plemented using hit-and-run (Gilad-Bachrach et al.,2005). CAL and QBC were provided with a randomordering of the pool. The algorithm TK is the firstheuristic proposed in Tong & Koller (2002), in which ateach iteration the example closest to the max-marginsolution of the labeled examples is selected. Since theactive learners operate by reducing the training error,the graphs below compare the achieved training errors.

In all algorithms, we classify the training examples us-ing the version space defined by the queried labels.The theory for CAL and ERM allows selecting an ar-bitrary predictor from the version space. In QBC, thepredictor should be drawn uniformly at random fromthe version space. We found that all the algorithmsperform significantly better if the majority vote classi-fication proposed for ALuMA is used for them. There-fore, we used this methodology in for all algorithms.

First, we conducted a synthetic experiment on a sam-ple from the uniform distribution on a sphere. Fig-ure 1 depicts the training error as a function of the la-bel budget when learning a random halfspace over theuniform distribution in R100. ALuMA and TK out-perform CAL and QBC here. In Gonen et al. (2012),

Figure 1. Uniform distribution (d = 100).

we show the behavior in R10 as well. The differencebetween the performance of the different algorithms isless marked in the lower dimension, suggesting thatthe difference might grow with the dimension. Theseresults indicate that ALuMA might have a better guar-antee than the general relative analysis in the case ofthe uniform distribution. Achieving such an analysisis an open question which is left for future work.

We next tested MNIST,2 which depicts gray-scale im-ages of digits in dimension 784. Figure 2(left) showsthe training error as a function of the label budgetfor learning to distinguish between the digits 3 and5. Strikingly, CAL provides no improvement over pas-sive ERM in the first 1000 examples, ALuMA and TKreach zero training error. We tested also PCMAC,3

after a random projection from the original dimensionof 7511 to dimension 300. The results are shown inFigure 2(right). Again, CAL does not improve overpassive ERM unlike the aggressive approaches.

Figure 2. MNIST (3 vs. 5) (left) and PCMAC (right)

In the experiments reported so far, TK and ALuMAperform about the same, showing that the TK heuris-tic is very successful. However, there are cases whereTK performs much worse than ALuMA. We report arelevant synthetic experiment in Gonen et al. (2012).We also present experiments on non-separable datasets, showing that when the error is low and an up-per bound on the hinge-loss can be assumed, ALuMAcan improve performance over mellow approaches.

Acknowledgement : This work was supportedby the Germani-Israeli Foundation grant number0378590.

2http://yann.lecun.com/exdb/mnist/3http://vikas.sindhwani.org/datasets/lskm/matlab/


References

Arkin, E.M., Meijer, H., Mitchell, J.S.B., Rappaport, D.,and Skiena, S.S. Decision trees for geometric models. InProceedings of the ninth annual symposium on Compu-tational geometry, pp. 369–378. ACM, 1993.

Balcan, M.F., Beygelzimer, A., and Langford, J. Agnosticactive learning. In Proceedings of the 23rd internationalconference on Machine learning, pp. 65–72. ACM, 2006.

Balcan, M.F., Broder, A., and Zhang, T. Margin basedactive learning. Learning Theory, pp. 35–50, 2007.

Brightwell, G. and Winkler, P. Counting linear extensionsis #p-complete. In Proceedings of the twenty-third an-nual ACM symposium on Theory of computing, STOC’91, pp. 175–181, 1991.

Cohn, D., Atlas, L., and Ladner, R. Improving general-ization with active learning. Machine Learning, 15(2):201–221, 1994.

Dasgupta, S. Analysis of a greedy active learning strategy.Advances in neural information processing systems, 17:337–344, 2005.

Dasgupta, S. Coarse sample complexity bounds for activelearning. Advances in neural information processing sys-tems, 18:235, 2006.

Dasgupta, S., Kalai, A., and Monteleoni, C. Analysis ofperceptron-based active learning. Learning Theory, pp.889–905, 2005.

Dasgupta, S., Hsu, D., and Monteleoni, C. A general ag-nostic active learning algorithm. Advances in neural in-formation processing systems, 20:353–360, 2007.

El-Yaniv, R. and Wiener, Y. Active learning via perfectselective classification. The Journal of Machine LearningResearch, 13:255–279, 2012.

Freund, Y., Seung, H.S., Shamir, E., and Tishby, N. Selec-tive sampling using the query by committee algorithm.Machine learning, 28(2):133–168, 1997.

Friedman, E. Active learning for smooth problems. InProceedings of the 22nd Conference on Learning Theory,volume 1, pp. 3–2, 2009.

Gilad-Bachrach, R., Navot, A., and Tishby, N. Query bycommittee made real. Advances in Neural InformationProcessing Systems (NIPS), 19, 2005.

Golovin, D. and Krause, A. Adaptive submodularity: Anew approach to active learning and stochastic opti-mization. In Proceedings of International Conference onLearning Theory (COLT), 2010.

Gonen, A., Sabato, S., and Shalev-Shwartz, S. Effi-cient pool-based active learning of halfspaces. CoRR,abs/1208.3561, 2012.

Hanneke, S. A bound on the label complexity of agnosticactive learning. In ICML, 2007.

Hanneke, S. Rates of convergence in active learning. TheAnnals of Statistics, 39(1):333–361, 2011.

Hastad, J. On the size of weights for threshold gates. SIAMJournal on Discrete Mathematics, 7:484, 1994.

Kannan, R., Lovasz, L., and Simonovits, M. Random walksand an o∗(n5) volume algorithm for convex bodies. Ran-dom structures and algorithms, 11(1):1–50, 1997.

Lovasz, L. Hit-and-run mixes fast. Mathematical Program-ming, 86(3):443–461, 1999.

Matousek, J. Lectures on discrete geometry, volume 212.Springer Verlag, 2002.

McCallum, A. and Nigam, K. Employing em in pool-basedactive learning for text classification. In Proceedingsof ICML-98, 15th International Conference on MachineLearning, pp. 350–358, 1998.

Muroga, S., Toda, I., and Takasu, S. Theory of majoritydecision elements. Journal of the Franklin Institute, 271(5):376–418, 1961.

Seung, H.S., Opper, M., and Sompolinsky, H. Query bycommittee. In Proceedings of the fifth annual workshopon Computational learning theory, pp. 287–294. ACM,1992.

Tong, S. and Koller, D. Support vector machine activelearning with applications to text classification. TheJournal of Machine Learning Research, 2:45–66, 2002.

efficient active learning of halfspaces: an aggressive ...sabatos/papers/gonensash_icml2013.pdf ·...

Documents