efficient classification for metric data lee-ad gottliebhebrew u. aryeh kontorovichben gurion u....

25
Efficient classification for metric data Lee-Ad Gottlieb Hebrew U. Aryeh Kontorovich Ben Gurion U. Robert Krauthgamer Weizmann Institute

Post on 19-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Efficient classification for metric data

Lee-Ad Gottlieb Hebrew U.

Aryeh Kontorovich Ben Gurion U.

Robert Krauthgamer Weizmann Institute

Classification problem A fundamental problem in learning:

Point space X

Probability distribution P on X x {-1,1} Learner observes sample S of n points (x,y) drawn iid ~P

Wants to predict labels of other points in X

Produces hypothesis h: X → {-1,1} with

empirical error

and true error

Goal: uniformly over h in probability

Efficient classification for metric data 2

-1+1

Classification problem A fundamental problem in learning:

Point space X

Probability distribution P on X x {-1,1} Learner observes sample S of n points (x,y) drawn iid ~P

Wants to predict labels of other points in X

Produces hypothesis h: X → {-1,1} with

empirical error

and true error

Goal: uniformly over h in probability

Efficient classification for metric data 3

-1+1

Classification problem A fundamental problem in learning:

Point space X

Probability distribution P on X x {-1,1} Learner observes sample S of n points (x,y) drawn iid ~P

Wants to predict labels of other points in X

Produces hypothesis h: X → {-1,1} with

empirical error

and true error

Goal: uniformly over h in probability

Efficient classification for metric data 4

-1+1

Generalization bounds How do we upper bound the true error?

Use a generalization bound. Roughly speaking (and whp)

true error ≤ empirical error + (complexity of h)/n

More complex classifier ↔ “easier” to fit to arbitrary data VC-dimension: largest point set

that can be shattered by h

-1+1

-1+1

5

Popular approach for classification Assume the points are in Euclidean space! Pros

Existence of inner product Efficient algorithms (SVM) Good generalization bounds (max margin)

Cons Many natural settings non-Euclidean Euclidean structure is a strong assumption

Recent popular focus Metric space data

Efficient classification for metric data 6

Efficient classification for metric data 7

Metric space (X,d) is a metric space if

X = set of points d() = distance function

nonnegative symmetric triangle inequality

inner product → norm norm → metric But ⇐ doesn’t hold

חיפה

באר שבע

תל אביב208km

95km

113km

Classification for metric data? Advantage: often much more natural

much weaker assumption strings Images (earthmover distance)

Problem: no vector representation No notion of dot-product (and no kernel) What to do?

Invent kernel (e.g. embed into Euclidean space)?.. Possible high distortion! Use some NN heuristic?.. NN classifier has ∞ VC-dim!

Efficient classification for metric data 8

Efficient classification for metric data 9

Preliminaries: Lipschitz constant The Lipschitz constant L of a function f: X → R measures its

smoothness It is the smallest value L that satisfies for all points xi,xj in X

Denoted by

Suppose hypothesis h: S → {-1,1} is consistent with sample S Its Lipschitz constant of h is determined by the closest pair of differently

labeled points

Or equivalently ≥ 2/d(S+,S−) -1 +1

Efficient classification for metric data 10

Preliminaries: Lipschitz extension Lipschitz extension:

A classic problem in Analysis given a function f: S → R for S in X, extend f to all of X without

increasing the Lipschitz constant

Example: Points on the real line f(1) = 1 f(-1) = -1 credit: A. Oberman

Efficient classification for metric data 11

Classification for metric data A powerful framework for metric classification was introduced

by von Luxburg & Bousquet (vLB, JMLR ‘04)

Construction of h on S: The natural hypotheses (classifiers) to consider are maximally smooth Lipschitz functions

Estimation of h on X: The problem of evaluating h for new points in X reduces to the problem of finding a Lipschitz function consistent with h Lipschitz extension problem For example

f(x) = mini [f(xi) + 2d(x, xi)/d(S+,S−)] over all (xi,yi) in S

Evaluation of h reduces to exact Nearest Neighbor Search Strong theoretical motivation for the NNS classification heuristic

Efficient classification for metric data 14

Two new directions The framework of [vLB ‘04] leaves open two further questions:

Constructing h: handling noise Bias-Variance tradeoff Which sample points in S should h ignore?

Evaluating h on X In arbitrary metric space, exact NNS

requires Θ(n) time Can we do better?

q

~1

~1

-1 +1

Efficient classification for metric data 15

Doubling dimension Definition: Ball B(x,r) = all points within distance r from x.

The doubling constant (of a metric M) is the minimum value >0 such that every ball can be covered by balls of half the radius First used by [Assoud ‘83], algorithmically by [Clarkson ‘97]. The doubling dimension is ddim(M)=log2(M) A metric is doubling if its doubling dimension is constant Euclidean: ddim(Rd) = O(d)

Packing property of doubling spaces A set with diameter diam and minimum

inter-point distance a, contains at most

(diam/a)O(ddim) points

Here ≥7.

Applications of doubling dimension Major application to databases

Recall that exact NNS requires Θ(n) time in arbitrary metric space There exists a linear size structure that supports approximate nearest neighbor search in

time 2O(ddim) log n

Database/network structures and tasks analyzed via the doubling dimension Nearest neighbor search structure [KL ‘04, HM ’06, BKL ’06, CG ‘06] Image recognition (Vision) [KG --] Spanner construction [GGN ‘06, CG ’06, DPP ‘06, GR ‘08a, GR ‘08b] Distance oracles [Tal ’04, Sli ’05, HM ’06, BGRKL ‘11] Clustering [Tal ‘04, ABS ‘08, FM ‘10] Routing [KSW ‘04, Sli ‘05, AGGM ‘06, KRXY ‘07, KRX ‘08]

Further applications Travelling Salesperson [Tal ‘04] Embeddings [Ass ‘84, ABN ‘08, BRS ‘07, GK ‘11] Machine learning [BLL ‘09, KKL ‘10, KKL --]

Note: Above algorithms can be extended to nearly-doubling spaces [GK ‘10]

Message: This is an active line of research…

16

Our dual use of doubling dimension Interestingly, considering the doubling dimension yields

contributes in two different areas

Statistical: Function complexity We bound the complexity of the hypothesis in terms of the doubling

dimension of X and the Lipschitz constant of the classifier h

Computational: efficient approximate NNS

Efficient classification for metric data 19

Efficient classification for metric data 20

Statistical contribution We provide generalization bounds for Lipschitz functions on

spaces with low doubling dimension vLB provided similar bounds using covering numbers and Rademacher

averages

Fat-shattering analysis: L-Lipschitz functions shatter a set →

inter-point distance is at least 2/L Packing property →

set has (diam L)O(ddim) points This is the fat-shattering dimension

of the classifier on the space, and is

a good measure of its complexity.

Efficient classification for metric data 21

Statistical contribution [BST ‘99]:

For any f that classifies a sample of size n correctly, we have with probability at least 1− P {(x, y) : sgn(f(x)) ≠ y} ≤ 2/n (d log(34en/d) log2(578n) + log(4/)) .

Likewise, if f is correct on all but k examples, we have with probability at least 1− P {(x, y) : sgn(f(x)) ≠ y} ≤ k/n + [2/n (d ln(34en/d) log2(578n) + ln(4/))]1/2.

In both cases, d is bound by the fat-shattering dimension,

d ≤ (diam L)ddim + 1

Done with the statistical contribution … On to the computational contribution.

Efficient classification for metric data 22

Computational contribution Evaluation of h for new points in X

Lipschitz extension function f(x) = mini [yi + 2d(x, xi)/d(S+,S−)]

Requires exact nearest neighbor search, which can be expensive!

New tool: (1+)-approximate nearest neighbor search 2O(ddim) log n + O(-ddim) time [KL ‘04, HM ‘06, BKL ‘06, CG ‘06]

If we evaluate f(x) using an approximate NNS, we can show that the result agrees with (the sign of) at least one of g(x) = (1+) f(x) + e(x) = (1+) f(x) - Note that g(x) ≥ f(x) ≥ e(x)

g(x) and e(x) have Lipschitz constant (1+)L, so they and the approximate function generalizes well

g(x)f(x)e(x)

2

Efficient classification for metric data 23

Final problem: bias variance tradeoff Which sample points in S should h ignore?

If f is correct on all but k examples, we have with probability at least 1− P {(x, y):sgn(f(x)) ≠ y} ≤ k/n+ [2/n (d ln(34en/d)log2(578n) +ln(4/))]1/2. Where d ≤ (diam L)ddim + 1

-1 +1

Efficient classification for metric data 24

Structural Risk Minimization Algorithm

Fix a target Lipschitz constant L O(n2) possibilities

Locate all pairs of points from S+ and S- whose distance is less than 2L At least one of these points has to be taken as an error

Goal: Remove as few points as possible

-1

+1

Efficient classification for metric data 25

Structural Risk Minimization Algorithm

Fix a target Lipschitz constant L O(n2) possibilities

Locate all pairs of points from S+ and S- whose distance is less than 2L At least one of these points has to be taken as an error

Goal: Remove as few points as possible

Minimum vertex cover NP-Complete Admits a 2-approximation in O(E) time

-1

+1

Efficient classification for metric data 26

Structural Risk Minimization Algorithm

Fix a target Lipschitz constant L O(n2) possibilities

Locate all pairs of points from S+ and S- whose distance is less than 2L At least one of these points has to be taken as an error

Goal: Remove as few points as possible

Minimum vertex cover NP-Complete Admits a 2-approximation in O(E) time

Minimum vertex cover on a bipartite graph Equivalent to maximum matching (Konig’s theorem) Admits an exact solution in O(n2.376) randomized time [MS ‘04]

-1

+1

Efficient classification for metric data 27

Efficient SRM Algorithm:

For each of O(n2) values of L Run matching algorithm to find minimum error Evaluate generalization bound for this value of L

O(n4.376) randomized time

Better algorithm Binary search over O(n2) values of L For each value

Run greedy 2-approximation

Approximate minimum error in O(n2 log n) time

Evaluate approximate generalization bound for this value of L

Efficient classification for metric data 28

Conclusion Results:

Generalization bounds for Lipschitz classifiers in doubling spaces Efficient evaluation of the Lipschitz extension hypothesis using

approximate NNS Efficient Structural Risk Minimization

Continuing research: Continuous labels Risk bound via the doubling dimension Classifier h determined via an LP Faster LP: low-hop low-stretch spanners [GR ’08a, GR ’08b] → fewer

constraints, each variable appears in bounded number of constraints.

Application: earthmover distance

Efficient classification for metric data 29

S T