machine learning - columbia universityjebara/4771/notes/class5x.pdf · •linear classifiers are...
TRANSCRIPT
Tony Jebara, Columbia University
Machine Learning 4771
Instructor: Tony Jebara
Tony Jebara, Columbia University
Topic 5 • Generalization Guarantees
• VC-Dimension
• Nearest Neighbor Classification (infinite VC dimension)
• Structural Risk Minimization
• Support Vector Machines
Tony Jebara, Columbia University
Empirical Risk Minimization • Example: non-pdf linear classifiers
• Recall ERM: • Have loss function: quadratic: linear: binary: • Empirical approximates the true risk (expected error)
• But, we don’t know the true P(x,y)! • If infinite data, law of large numbers says:
• But, in general, can’t make guarantees for ERM solution:
R
empθ( ) = 1
NL y
i, f x
i;θ( )( )i=1
N∑ ∈ 0,1⎡⎣⎢⎤⎦⎥
L y,x, θ( ) = 1
2y − f x;θ( )( )2
L y,x, θ( ) = y − f x;θ( )
f x;θ( ) = sign θTx + θ
0( )∈ −1,1{ }
L y,x, θ( ) = step −yf x;θ( )( )
R θ( ) = E
PL x,y, θ( ){ } = P x,y( )L x,y, θ( )dx dy
X×Y∫ ∈ 0,1⎡⎣⎢⎤⎦⎥
lim
N→∞min
θR
empθ( ) = min
θR θ( )
arg min
θR
empθ( )≠ arg min
θR θ( )
R
empθ( )
Tony Jebara, Columbia University
• ERM is inconsistent not guaranteed may do better on training than on test! • Idea: add a prior or regularizer to • Define capacity or confidence = which favors simpler
• If, we have bound is a guaranteed risk • After train, can guarantee future error rate is
Bounding the True Risk R
empθ( )
R θ̂( )≥ R
empθ̂( )
J θ( ) = R
empθ( ) +C θ( )
J θ( )
≤ min
θJ θ( )
R θ( )≤ J θ( )
R
empθ( )
Tony Jebara, Columbia University
Bound the True Risk with VC • But, how to find a guarantee? Difficult, but there is one… • Theorem (Vapnik): with probability 1-η where η is a
number between [0,1], the following bound holds:
N = number of data points h = Vapnik-Chervonenkis (VC) dimension (1970’s) = capacity of the classifier class
• Note, above is independent of the true P(x,y) • A worst-case scenario bound, guaranteed for all P(x,y) • VC dimension not just the # of parameters a classifier has • VC measures # of different datasets it can classify perfectly • Structural Risk Minimization: minimize risk bound J(θ)
R θ( )≤ J θ( ) = Rempθ( ) +
2h log 2eNh( ) + 2 log 4
η( )N
1 + 1 +NR
empθ( )
h log 2eNh( ) + log 4
η( )⎛
⎝
⎜⎜⎜⎜⎜⎜⎜
⎞
⎠
⎟⎟⎟⎟⎟⎟⎟⎟
f .;θ( )
Tony Jebara, Columbia University
VC Dimension & Shattering • How to compute h or VC for a family of functions h = # of training points that can be shattered • Recall, classifier maps input to output
• Shattering: I pick h points & place them at You challenge me with 2h possible labelings VC dimension is maximum # of points I can place which a can correctly classify for arbitrary labeling • Example: for 2d linear classifier h=3
can’t ever shatter 4 points! or 3 points on a straight line…
f .;θ( )
f x;θ( )→ y ∈ −1,1{ }
x1,…,x
h
y
1,…,y
h∈ ±1{ }h
y1,…,y
h f x;θ( )
f x;θ( ) = x
1θ
1+ x
2θ
2+ θ
0
Tony Jebara, Columbia University
VC Dimension & Shattering • More generally for higher dimensional linear classifiers, a hyperplane in shatters any set of linearly independent points. Can choose d+1 linearly indep. points so h=d+1
• Note: VC is not necessarily proportional to # of parameters • Example: sinusoidal 1d classifier number of parameters=1 …but… h=infinity!
since I can choose: no matter what labeling you challenge: using shatters perfectly
f x;θ( ) = sign sin θx( )( )
xi= 10−i i = 1,…,h
y
1,…,y
h∈ ±1{ }h
θ = π 1 + 1
21−y
i( )10−ii=1
h∑( )But, as a side note, if I choose 4 equally spaced x’s then cannot shatter
Tony Jebara, Columbia University
VC Dimension & Shattering • Recall that VC dimension gives an upper bound • We want to minimize h since that minimizes C(θ) & J(θ) • If can’t compute h exactly but can compute h+ can plug in h+ in bound & still guarantee • Also, sometimes bound is trivial • Need h/N = 0.3 before C(θ)<1 (recall R(θ) in [0,1])
• Note: h = low ⇒ good performance h =∞ ⇒ poor performance
Tony Jebara, Columbia University
Nearest Neighbors & VC • Consider Nearest Neighbors classification algorithm: Input a query example x Find training example xi in {x1,…xN} closest to x Predict label for x as yi of neighbor
• Often use Euclidean distance to measure closeness • Nearest neighbors shatters any set of points! • So VC=infinity, C(θ)=infinity, guaranteed risk=infinity • But still works well in practice
h =∞ ⇒ poor performance h = low ⇒ good performance
x −x
i
Tony Jebara, Columbia University
• Linear classifiers are too big a function class since h=d+1 • Can reduce VC dimension if we restrict them • Constrain linear classifiers to data living inside a sphere • Gap-Tolerant classifiers: a linear classifier whose activity is constrained to a sphere & outside a margin
• If M is small relative to D, can still shatter 3 points:
VC Dimension & Large Margins
M
D M=margin D=diameter d=dimensionality
Only count errors in shaded region Elsewhere have L(x,y,θ)=0
Tony Jebara, Columbia University
VC Dimension & Large Margins • But, as M grows relative to D, can only shatter 2 points!
• For hyperplanes, as M grows vs. D, shatter fewer points! • VC dimension h goes down if gap-tolerant classifier has larger margin, general formula is:
• Before, just had h=d+1. Now we have a smaller h • If data is anywhere, D is infinite and back to h=d+1 • Typically real data is bounded (by sphere), D is fixed • Maximizing M reduces h, improving guaranteed risk J(θ) • Note: R(θ) doesn’t count errors in margin or outside sphere
Can’t shatter 3
Can shatter 2
h ≤ min ceil
D2
M 2
⎡
⎣⎢⎢⎢
⎤
⎦⎥⎥⎥,d
⎧⎨⎪⎪
⎩⎪⎪
⎫⎬⎪⎪
⎭⎪⎪
+1
Tony Jebara, Columbia University
Structural Risk Minimization • Structural Risk Minimization: minimize risk bound J(θ) reducing empirical error & reduce VC dimension h
for each model i in list of hypothesis 1) compute its h=hi
2) 3) compute choose model with lowest
• Or, directly optimize over both • If possible, min empirical error while also minimizing VC • For gap-tolerant linear classifiers, minimize Remp(θ) while maximizing margin, support vector machines do just that!
θ* = arg min
θR
empθ( )
J θ*,h
i( ) J θ*,h
i( ) θ*,h( ) = arg min
θ,hJ θ,h( )
h1 h2 h3
Space of different Classifiers or Hypotheses
R θ( )≤ J θ( ) = Rempθ( ) +
2h log 2eNh( ) + 2 log 4
η( )N
1 + 1 +NR
empθ( )
h log 2eNh( ) + log 4
η( )⎛
⎝
⎜⎜⎜⎜⎜⎜⎜
⎞
⎠
⎟⎟⎟⎟⎟⎟⎟⎟
Tony Jebara, Columbia University
Support Vector Machines • Support vector machines are (in the simplest case) linear classifiers that do structural risk minimization (SRM) • Directly maximize margin to reduce guaranteed risk J(θ) • Assume first the 2-class data is linearly separable:
• Decision boundary or hyperplane given by • Note: can scale w & b while keeping same boundary • Many solutions exist which have empirical error Remp(θ)=0 • Want widest or thickest one (max margin), also it’s unique!
have x
1,y
1( ),…, xN,y
N( ){ } where xi∈ D and y
i∈ −1,1{ }
f x;θ( ) = sign wTx +b( )
wTx +b = 0
⇒
Tony Jebara, Columbia University
Side Note: Constraints • How to minimize a function subject to equality constraints?
• Only walk on x1=2x2 or… x1-2x2=0… • Use Lagrange Multipliers, for each constraint, subtract it times a lambda variable. Lambda blows up the minimization if we don’t satisfy the constraint:
minx1 ,x2
fx( ) = min
x1 ,x2b
1x
1+b
2x
2+ 1
2H
11x
12 + H
12x
1x
2+ 1
2H
22x
22
= min x
bT x + 1
2
xTHx
⇒∂f∂x
=b + H
x = 0
⇒x = −H −1b
minx1 ,x2
maxλfx( )−λ equalitycondition = 0( )
= minx1 ,x2
maxλb
1x
1+b
2x
2+ 1
2H
11x
12 + H
12x
1x
2+ 1
2H
22x
22 −λ x
1− 2x
2( )
Tony Jebara, Columbia University
Side Note: Constraints • Cost minimization with equality constraints: 1) Subtract each constraint times an extra variable (a Lagrange multiplier λ, like an adversary variable) 2) Take partials with respect to x and set to zero 3) Plug solution into constraint to find lambda
min xmax
λfx( )−λ equalitycondition = 0( )
= min x
maxλbT x + 1
2
xTHx −λ x
1− 2x
2( )⇒∂f∂x
=b + H
x −λ 1
−2
⎡
⎣⎢⎢
⎤
⎦⎥⎥ = 0 ⇒
x = H −1λ 1
−2
⎡
⎣⎢⎢
⎤
⎦⎥⎥ −H −1b
⇒ H −1λ 1−2
⎡
⎣⎢⎢
⎤
⎦⎥⎥ −H −1b
⎛
⎝⎜⎜⎜⎜
⎞
⎠
⎟⎟⎟⎟⎟
T
1−2
⎡
⎣⎢⎢
⎤
⎦⎥⎥ = 0 ⇒ λ =
bT H−1 1−2
⎡
⎣⎢⎢⎢
⎤
⎦⎥⎥⎥
1−2
⎡
⎣⎢⎢⎢
⎤
⎦⎥⎥⎥
T
H−1 1−2
⎡
⎣⎢⎢⎢
⎤
⎦⎥⎥⎥
x
1− 2x
2=xT 1−2
⎡
⎣⎢⎢
⎤
⎦⎥⎥
Tony Jebara, Columbia University
Support Vector Machines • Define: w
Tx +b = 0
H+=positive margin hyperplane H- =negative margin hyperplane q =distance from decision plane to origin
q = min
x
x −0 subject to wTx +b = 0
min
x12
x −0
2−λ wTx +b( )
∂∂x
12xTx −λ wTx +b( )( ) = 0
x −λw = 0x = λw
2) plug into constraint
wTx +b = 0
wT λw( ) +b = 0
λ = − bwTw
1) grad
3) Sol’n x̂ = − b
wTw( )w4) distance
q = x̂ −0 = − b
wTww =
b
wTwwTw =
b
w5) Define without loss of generality since can scale b & w
H → wTx +b = 0
H+→ wTx +b = +1
H− → wTx +b = −1
Tony Jebara, Columbia University
Support Vector Machines • The constraints on the SVM for Remp(θ)=0 are thus:
• Or more simply: • The margin of the SVM is:
• Distance to origin:
• Therefore: and margin
• Want to max margin, or equivalently minimize: • SVM Problem: • This is a quadratic program! • Can plug this into a matlab function called “qp()”, done!
H+→ wTx +b = +1
H− → wTx +b = −1
wTx
i+b ≥+1 ∀y
i= +1
wTx
i+b ≤−1 ∀y
i= −1
y
iwTx
i+b( )−1≥ 0
m = d+
+d−
H → q =
b
w H
+→ q
+=
b−1
w H−→ q
−=
−1−b
w
d
+= d
−= 1
w
m =2
w
w or w
2
min 1
2w
2subject to y
iwTx
i+b( )−1≥ 0
Tony Jebara, Columbia University
• A hierarchy of Matlab optimization packages to use:
Linear Programming <Quadratic Programming <Quadratically Constrained Quadratic Programming <Semidefinite Programming <Convex Programming <Polynomial Time Algorithms
Side Note: Optimization Tools
P CP SDP QCQP QP LP
min x
bT x s.t.
c
iT x ≥ α
i∀i
min x
12
xTHx +bT x s.t.
c
iT x ≥ α
i∀i
Tony Jebara, Columbia University
• Each data point adds linear inequality to QP • Each point cuts a half plane of allowable SVMs and reduces green region • The SVM is closest point to the origin that is still in the green region • The preceptron algorithm just puts us randomly in green region • QP runs in cubic polynomial time • There are D values in the w vector • Needs O(D3) run time… But, there is a DUAL SVM in O(N3)!
Side Note: Optimization Tools 12wTw
y
iwTx
i+b( )−1≥ 0
Tony Jebara, Columbia University
SVM in Dual Form • We can also solve the problem via convex duality • Primal SVM problem LP: • This is a quadratic program, quadratic cost function with multiple linear inequalities (these carve out a convex hull) • Subtract from cost each inequality times an α Lagrange multiplier, take derivatives of w & b:
• Plug back in, dual: • Also have constraints: • Above LD must be maximized! convex duality… also qp()
min 1
2w
2subject to y
iwTx
i+b( )−1≥ 0
L
P= min
w,bmax
α≥012
w2− α
iy
iwTx
i+b( )−1( )i∑
∂∂w
LP
= w− αiy
ix
i= 0 →
i∑ w = αiy
ix
ii∑ ∂∂b
LP
=− αiy
i= 0
i∑ L
D= α
ii∑ − 12
αiα
jy
iy
jx
iTx
jj∑i∑ α
iy
i= 0
i∑ & αi≥ 0