efficient classification for metric data lee-ad gottliebhebrew u. aryeh kontorovichben gurion u....
Post on 19-Dec-2015
214 views
TRANSCRIPT
Efficient classification for metric data
Lee-Ad Gottlieb Hebrew U.
Aryeh Kontorovich Ben Gurion U.
Robert Krauthgamer Weizmann Institute
Classification problem A fundamental problem in learning:
Point space X
Probability distribution P on X x {-1,1} Learner observes sample S of n points (x,y) drawn iid ~P
Wants to predict labels of other points in X
Produces hypothesis h: X → {-1,1} with
empirical error
and true error
Goal: uniformly over h in probability
Efficient classification for metric data 2
-1+1
Classification problem A fundamental problem in learning:
Point space X
Probability distribution P on X x {-1,1} Learner observes sample S of n points (x,y) drawn iid ~P
Wants to predict labels of other points in X
Produces hypothesis h: X → {-1,1} with
empirical error
and true error
Goal: uniformly over h in probability
Efficient classification for metric data 3
-1+1
Classification problem A fundamental problem in learning:
Point space X
Probability distribution P on X x {-1,1} Learner observes sample S of n points (x,y) drawn iid ~P
Wants to predict labels of other points in X
Produces hypothesis h: X → {-1,1} with
empirical error
and true error
Goal: uniformly over h in probability
Efficient classification for metric data 4
-1+1
Generalization bounds How do we upper bound the true error?
Use a generalization bound. Roughly speaking (and whp)
true error ≤ empirical error + (complexity of h)/n
More complex classifier ↔ “easier” to fit to arbitrary data VC-dimension: largest point set
that can be shattered by h
-1+1
-1+1
5
Popular approach for classification Assume the points are in Euclidean space! Pros
Existence of inner product Efficient algorithms (SVM) Good generalization bounds (max margin)
Cons Many natural settings non-Euclidean Euclidean structure is a strong assumption
Recent popular focus Metric space data
Efficient classification for metric data 6
Efficient classification for metric data 7
Metric space (X,d) is a metric space if
X = set of points d() = distance function
nonnegative symmetric triangle inequality
inner product → norm norm → metric But ⇐ doesn’t hold
חיפה
באר שבע
תל אביב208km
95km
113km
Classification for metric data? Advantage: often much more natural
much weaker assumption strings Images (earthmover distance)
Problem: no vector representation No notion of dot-product (and no kernel) What to do?
Invent kernel (e.g. embed into Euclidean space)?.. Possible high distortion! Use some NN heuristic?.. NN classifier has ∞ VC-dim!
Efficient classification for metric data 8
Efficient classification for metric data 9
Preliminaries: Lipschitz constant The Lipschitz constant L of a function f: X → R measures its
smoothness It is the smallest value L that satisfies for all points xi,xj in X
Denoted by
Suppose hypothesis h: S → {-1,1} is consistent with sample S Its Lipschitz constant of h is determined by the closest pair of differently
labeled points
Or equivalently ≥ 2/d(S+,S−) -1 +1
Efficient classification for metric data 10
Preliminaries: Lipschitz extension Lipschitz extension:
A classic problem in Analysis given a function f: S → R for S in X, extend f to all of X without
increasing the Lipschitz constant
Example: Points on the real line f(1) = 1 f(-1) = -1 credit: A. Oberman
Efficient classification for metric data 11
Classification for metric data A powerful framework for metric classification was introduced
by von Luxburg & Bousquet (vLB, JMLR ‘04)
Construction of h on S: The natural hypotheses (classifiers) to consider are maximally smooth Lipschitz functions
Estimation of h on X: The problem of evaluating h for new points in X reduces to the problem of finding a Lipschitz function consistent with h Lipschitz extension problem For example
f(x) = mini [f(xi) + 2d(x, xi)/d(S+,S−)] over all (xi,yi) in S
Evaluation of h reduces to exact Nearest Neighbor Search Strong theoretical motivation for the NNS classification heuristic
Efficient classification for metric data 14
Two new directions The framework of [vLB ‘04] leaves open two further questions:
Constructing h: handling noise Bias-Variance tradeoff Which sample points in S should h ignore?
Evaluating h on X In arbitrary metric space, exact NNS
requires Θ(n) time Can we do better?
q
~1
~1
-1 +1
Efficient classification for metric data 15
Doubling dimension Definition: Ball B(x,r) = all points within distance r from x.
The doubling constant (of a metric M) is the minimum value >0 such that every ball can be covered by balls of half the radius First used by [Assoud ‘83], algorithmically by [Clarkson ‘97]. The doubling dimension is ddim(M)=log2(M) A metric is doubling if its doubling dimension is constant Euclidean: ddim(Rd) = O(d)
Packing property of doubling spaces A set with diameter diam and minimum
inter-point distance a, contains at most
(diam/a)O(ddim) points
Here ≥7.
Applications of doubling dimension Major application to databases
Recall that exact NNS requires Θ(n) time in arbitrary metric space There exists a linear size structure that supports approximate nearest neighbor search in
time 2O(ddim) log n
Database/network structures and tasks analyzed via the doubling dimension Nearest neighbor search structure [KL ‘04, HM ’06, BKL ’06, CG ‘06] Image recognition (Vision) [KG --] Spanner construction [GGN ‘06, CG ’06, DPP ‘06, GR ‘08a, GR ‘08b] Distance oracles [Tal ’04, Sli ’05, HM ’06, BGRKL ‘11] Clustering [Tal ‘04, ABS ‘08, FM ‘10] Routing [KSW ‘04, Sli ‘05, AGGM ‘06, KRXY ‘07, KRX ‘08]
Further applications Travelling Salesperson [Tal ‘04] Embeddings [Ass ‘84, ABN ‘08, BRS ‘07, GK ‘11] Machine learning [BLL ‘09, KKL ‘10, KKL --]
Note: Above algorithms can be extended to nearly-doubling spaces [GK ‘10]
Message: This is an active line of research…
16
Our dual use of doubling dimension Interestingly, considering the doubling dimension yields
contributes in two different areas
Statistical: Function complexity We bound the complexity of the hypothesis in terms of the doubling
dimension of X and the Lipschitz constant of the classifier h
Computational: efficient approximate NNS
Efficient classification for metric data 19
Efficient classification for metric data 20
Statistical contribution We provide generalization bounds for Lipschitz functions on
spaces with low doubling dimension vLB provided similar bounds using covering numbers and Rademacher
averages
Fat-shattering analysis: L-Lipschitz functions shatter a set →
inter-point distance is at least 2/L Packing property →
set has (diam L)O(ddim) points This is the fat-shattering dimension
of the classifier on the space, and is
a good measure of its complexity.
Efficient classification for metric data 21
Statistical contribution [BST ‘99]:
For any f that classifies a sample of size n correctly, we have with probability at least 1− P {(x, y) : sgn(f(x)) ≠ y} ≤ 2/n (d log(34en/d) log2(578n) + log(4/)) .
Likewise, if f is correct on all but k examples, we have with probability at least 1− P {(x, y) : sgn(f(x)) ≠ y} ≤ k/n + [2/n (d ln(34en/d) log2(578n) + ln(4/))]1/2.
In both cases, d is bound by the fat-shattering dimension,
d ≤ (diam L)ddim + 1
Done with the statistical contribution … On to the computational contribution.
Efficient classification for metric data 22
Computational contribution Evaluation of h for new points in X
Lipschitz extension function f(x) = mini [yi + 2d(x, xi)/d(S+,S−)]
Requires exact nearest neighbor search, which can be expensive!
New tool: (1+)-approximate nearest neighbor search 2O(ddim) log n + O(-ddim) time [KL ‘04, HM ‘06, BKL ‘06, CG ‘06]
If we evaluate f(x) using an approximate NNS, we can show that the result agrees with (the sign of) at least one of g(x) = (1+) f(x) + e(x) = (1+) f(x) - Note that g(x) ≥ f(x) ≥ e(x)
g(x) and e(x) have Lipschitz constant (1+)L, so they and the approximate function generalizes well
g(x)f(x)e(x)
2
Efficient classification for metric data 23
Final problem: bias variance tradeoff Which sample points in S should h ignore?
If f is correct on all but k examples, we have with probability at least 1− P {(x, y):sgn(f(x)) ≠ y} ≤ k/n+ [2/n (d ln(34en/d)log2(578n) +ln(4/))]1/2. Where d ≤ (diam L)ddim + 1
-1 +1
Efficient classification for metric data 24
Structural Risk Minimization Algorithm
Fix a target Lipschitz constant L O(n2) possibilities
Locate all pairs of points from S+ and S- whose distance is less than 2L At least one of these points has to be taken as an error
Goal: Remove as few points as possible
-1
+1
Efficient classification for metric data 25
Structural Risk Minimization Algorithm
Fix a target Lipschitz constant L O(n2) possibilities
Locate all pairs of points from S+ and S- whose distance is less than 2L At least one of these points has to be taken as an error
Goal: Remove as few points as possible
Minimum vertex cover NP-Complete Admits a 2-approximation in O(E) time
-1
+1
Efficient classification for metric data 26
Structural Risk Minimization Algorithm
Fix a target Lipschitz constant L O(n2) possibilities
Locate all pairs of points from S+ and S- whose distance is less than 2L At least one of these points has to be taken as an error
Goal: Remove as few points as possible
Minimum vertex cover NP-Complete Admits a 2-approximation in O(E) time
Minimum vertex cover on a bipartite graph Equivalent to maximum matching (Konig’s theorem) Admits an exact solution in O(n2.376) randomized time [MS ‘04]
-1
+1
Efficient classification for metric data 27
Efficient SRM Algorithm:
For each of O(n2) values of L Run matching algorithm to find minimum error Evaluate generalization bound for this value of L
O(n4.376) randomized time
Better algorithm Binary search over O(n2) values of L For each value
Run greedy 2-approximation
Approximate minimum error in O(n2 log n) time
Evaluate approximate generalization bound for this value of L
Efficient classification for metric data 28
Conclusion Results:
Generalization bounds for Lipschitz classifiers in doubling spaces Efficient evaluation of the Lipschitz extension hypothesis using
approximate NNS Efficient Structural Risk Minimization
Continuing research: Continuous labels Risk bound via the doubling dimension Classifier h determined via an LP Faster LP: low-hop low-stretch spanners [GR ’08a, GR ’08b] → fewer
constraints, each variable appears in bounded number of constraints.