efficient classification for metric data lee-ad gottliebhebrew u. aryeh kontorovichben gurion u....

Efficient classification for metric data

Lee-Ad Gottlieb Hebrew U.

Aryeh Kontorovich Ben Gurion U.

Robert Krauthgamer Weizmann Institute

Classification problem A fundamental problem in learning:

Point space X

Probability distribution P on X x {-1,1} Learner observes sample S of n points (x,y) drawn iid ~P

Wants to predict labels of other points in X

Produces hypothesis h: X → {-1,1} with

empirical error

and true error

Goal: uniformly over h in probability

Efficient classification for metric data 2

-1+1


Point space X




empirical error

and true error



-1+1

Generalization bounds How do we upper bound the true error?

Use a generalization bound. Roughly speaking (and whp)

true error ≤ empirical error + (complexity of h)/n

More complex classifier ↔ “easier” to fit to arbitrary data VC-dimension: largest point set

that can be shattered by h

-1+1

-1+1

5

Popular approach for classification Assume the points are in Euclidean space! Pros

Existence of inner product Efficient algorithms (SVM) Good generalization bounds (max margin)

Cons Many natural settings non-Euclidean Euclidean structure is a strong assumption

Recent popular focus Metric space data



Metric space (X,d) is a metric space if

X = set of points d() = distance function

nonnegative symmetric triangle inequality

inner product → norm norm → metric But ⇐ doesn’t hold

חיפה

באר שבע

תל אביב208km

95km

113km

Classification for metric data? Advantage: often much more natural

much weaker assumption strings Images (earthmover distance)

Problem: no vector representation No notion of dot-product (and no kernel) What to do?

Invent kernel (e.g. embed into Euclidean space)?.. Possible high distortion! Use some NN heuristic?.. NN classifier has ∞ VC-dim!



Preliminaries: Lipschitz constant The Lipschitz constant L of a function f: X → R measures its

smoothness It is the smallest value L that satisfies for all points xi,xj in X

Denoted by

Suppose hypothesis h: S → {-1,1} is consistent with sample S Its Lipschitz constant of h is determined by the closest pair of differently

labeled points

Or equivalently ≥ 2/d(S+,S−) -1 +1


Preliminaries: Lipschitz extension Lipschitz extension:

A classic problem in Analysis given a function f: S → R for S in X, extend f to all of X without

increasing the Lipschitz constant

Example: Points on the real line f(1) = 1 f(-1) = -1 credit: A. Oberman


Classification for metric data A powerful framework for metric classification was introduced

by von Luxburg & Bousquet (vLB, JMLR ‘04)

Construction of h on S: The natural hypotheses (classifiers) to consider are maximally smooth Lipschitz functions

Estimation of h on X: The problem of evaluating h for new points in X reduces to the problem of finding a Lipschitz function consistent with h Lipschitz extension problem For example

f(x) = mini [f(xi) + 2d(x, xi)/d(S+,S−)] over all (xi,yi) in S

Evaluation of h reduces to exact Nearest Neighbor Search Strong theoretical motivation for the NNS classification heuristic


Two new directions The framework of [vLB ‘04] leaves open two further questions:

Constructing h: handling noise Bias-Variance tradeoff Which sample points in S should h ignore?

Evaluating h on X In arbitrary metric space, exact NNS

requires Θ(n) time Can we do better?

q

~1

~1

-1 +1


Doubling dimension Definition: Ball B(x,r) = all points within distance r from x.

The doubling constant (of a metric M) is the minimum value >0 such that every ball can be covered by balls of half the radius First used by [Assoud ‘83], algorithmically by [Clarkson ‘97]. The doubling dimension is ddim(M)=log2(M) A metric is doubling if its doubling dimension is constant Euclidean: ddim(Rd) = O(d)

Packing property of doubling spaces A set with diameter diam and minimum

inter-point distance a, contains at most

(diam/a)O(ddim) points

Here ≥7.

Applications of doubling dimension Major application to databases

Recall that exact NNS requires Θ(n) time in arbitrary metric space There exists a linear size structure that supports approximate nearest neighbor search in

time 2O(ddim) log n

Database/network structures and tasks analyzed via the doubling dimension Nearest neighbor search structure [KL ‘04, HM ’06, BKL ’06, CG ‘06] Image recognition (Vision) [KG --] Spanner construction [GGN ‘06, CG ’06, DPP ‘06, GR ‘08a, GR ‘08b] Distance oracles [Tal ’04, Sli ’05, HM ’06, BGRKL ‘11] Clustering [Tal ‘04, ABS ‘08, FM ‘10] Routing [KSW ‘04, Sli ‘05, AGGM ‘06, KRXY ‘07, KRX ‘08]

Further applications Travelling Salesperson [Tal ‘04] Embeddings [Ass ‘84, ABN ‘08, BRS ‘07, GK ‘11] Machine learning [BLL ‘09, KKL ‘10, KKL --]

Note: Above algorithms can be extended to nearly-doubling spaces [GK ‘10]

Message: This is an active line of research…

16

Our dual use of doubling dimension Interestingly, considering the doubling dimension yields

contributes in two different areas

Statistical: Function complexity We bound the complexity of the hypothesis in terms of the doubling

dimension of X and the Lipschitz constant of the classifier h

Computational: efficient approximate NNS



Statistical contribution We provide generalization bounds for Lipschitz functions on

spaces with low doubling dimension vLB provided similar bounds using covering numbers and Rademacher

averages

Fat-shattering analysis: L-Lipschitz functions shatter a set →

inter-point distance is at least 2/L Packing property →

set has (diam L)O(ddim) points This is the fat-shattering dimension

of the classifier on the space, and is

a good measure of its complexity.


Statistical contribution [BST ‘99]:

For any f that classifies a sample of size n correctly, we have with probability at least 1− P {(x, y) : sgn(f(x)) ≠ y} ≤ 2/n (d log(34en/d) log2(578n) + log(4/)) .

Likewise, if f is correct on all but k examples, we have with probability at least 1− P {(x, y) : sgn(f(x)) ≠ y} ≤ k/n + [2/n (d ln(34en/d) log2(578n) + ln(4/))]1/2.

In both cases, d is bound by the fat-shattering dimension,

d ≤ (diam L)ddim + 1

Done with the statistical contribution … On to the computational contribution.


Computational contribution Evaluation of h for new points in X

Lipschitz extension function f(x) = mini [yi + 2d(x, xi)/d(S+,S−)]

Requires exact nearest neighbor search, which can be expensive!

New tool: (1+)-approximate nearest neighbor search 2O(ddim) log n + O(-ddim) time [KL ‘04, HM ‘06, BKL ‘06, CG ‘06]

If we evaluate f(x) using an approximate NNS, we can show that the result agrees with (the sign of) at least one of g(x) = (1+) f(x) + e(x) = (1+) f(x) - Note that g(x) ≥ f(x) ≥ e(x)

g(x) and e(x) have Lipschitz constant (1+)L, so they and the approximate function generalizes well

g(x)f(x)e(x)

2


Final problem: bias variance tradeoff Which sample points in S should h ignore?

If f is correct on all but k examples, we have with probability at least 1− P {(x, y):sgn(f(x)) ≠ y} ≤ k/n+ [2/n (d ln(34en/d)log2(578n) +ln(4/))]1/2. Where d ≤ (diam L)ddim + 1

-1 +1


Structural Risk Minimization Algorithm

Fix a target Lipschitz constant L O(n2) possibilities

Locate all pairs of points from S+ and S- whose distance is less than 2L At least one of these points has to be taken as an error

Goal: Remove as few points as possible

-1

+1






Minimum vertex cover NP-Complete Admits a 2-approximation in O(E) time

-1

+1






Minimum vertex cover NP-Complete Admits a 2-approximation in O(E) time

Minimum vertex cover on a bipartite graph Equivalent to maximum matching (Konig’s theorem) Admits an exact solution in O(n2.376) randomized time [MS ‘04]

-1

+1


Efficient SRM Algorithm:

For each of O(n2) values of L Run matching algorithm to find minimum error Evaluate generalization bound for this value of L

O(n4.376) randomized time

Better algorithm Binary search over O(n2) values of L For each value

Run greedy 2-approximation

Approximate minimum error in O(n2 log n) time

Evaluate approximate generalization bound for this value of L


Conclusion Results:

Generalization bounds for Lipschitz classifiers in doubling spaces Efficient evaluation of the Lipschitz extension hypothesis using

approximate NNS Efficient Structural Risk Minimization

Continuing research: Continuous labels Risk bound via the doubling dimension Classifier h determined via an LP Faster LP: low-hop low-stretch spanners [GR ’08a, GR ’08b] → fewer

constraints, each variable appears in bounded number of constraints.

Application: earthmover distance


S T

efficient classification for metric data lee-ad gottliebhebrew u. aryeh kontorovichben gurion u....

Documents

metric space x

sample s of n points

classification problem

hypothesis h

lipschitz constant of

true error goal

fundamental problem

euclidean space