machine learning - columbia universityjebara/4771/notes/class5x.pdf · •linear classifiers are...

20
Tony Jebara, Columbia University Machine Learning 4771 Instructor: Tony Jebara

Upload: others

Post on 29-Sep-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Machine Learning - Columbia Universityjebara/4771/notes/class5x.pdf · •Linear classifiers are too big a function class since h=d+1 •Can reduce VC dimension if we restrict them

Tony Jebara, Columbia University

Machine Learning 4771

Instructor: Tony Jebara

Page 2: Machine Learning - Columbia Universityjebara/4771/notes/class5x.pdf · •Linear classifiers are too big a function class since h=d+1 •Can reduce VC dimension if we restrict them

Tony Jebara, Columbia University

Topic 5 • Generalization Guarantees

• VC-Dimension

• Nearest Neighbor Classification (infinite VC dimension)

• Structural Risk Minimization

• Support Vector Machines

Page 3: Machine Learning - Columbia Universityjebara/4771/notes/class5x.pdf · •Linear classifiers are too big a function class since h=d+1 •Can reduce VC dimension if we restrict them

Tony Jebara, Columbia University

Empirical Risk Minimization • Example: non-pdf linear classifiers

• Recall ERM: • Have loss function: quadratic: linear: binary: • Empirical approximates the true risk (expected error)

• But, we don’t know the true P(x,y)! • If infinite data, law of large numbers says:

• But, in general, can’t make guarantees for ERM solution:

R

empθ( ) = 1

NL y

i, f x

i;θ( )( )i=1

N∑ ∈ 0,1⎡⎣⎢⎤⎦⎥

L y,x, θ( ) = 1

2y − f x;θ( )( )2

L y,x, θ( ) = y − f x;θ( )

f x;θ( ) = sign θTx + θ

0( )∈ −1,1{ }

L y,x, θ( ) = step −yf x;θ( )( )

R θ( ) = E

PL x,y, θ( ){ } = P x,y( )L x,y, θ( )dx dy

X×Y∫ ∈ 0,1⎡⎣⎢⎤⎦⎥

lim

N→∞min

θR

empθ( ) = min

θR θ( )

arg min

θR

empθ( )≠ arg min

θR θ( )

R

empθ( )

Page 4: Machine Learning - Columbia Universityjebara/4771/notes/class5x.pdf · •Linear classifiers are too big a function class since h=d+1 •Can reduce VC dimension if we restrict them

Tony Jebara, Columbia University

• ERM is inconsistent not guaranteed may do better on training than on test! • Idea: add a prior or regularizer to • Define capacity or confidence = which favors simpler

• If, we have bound is a guaranteed risk • After train, can guarantee future error rate is

Bounding the True Risk R

empθ( )

R θ̂( )≥ R

empθ̂( )

J θ( ) = R

empθ( ) +C θ( )

J θ( )

≤ min

θJ θ( )

R θ( )≤ J θ( )

R

empθ( )

Page 5: Machine Learning - Columbia Universityjebara/4771/notes/class5x.pdf · •Linear classifiers are too big a function class since h=d+1 •Can reduce VC dimension if we restrict them

Tony Jebara, Columbia University

Bound the True Risk with VC • But, how to find a guarantee? Difficult, but there is one… • Theorem (Vapnik): with probability 1-η where η is a

number between [0,1], the following bound holds:

N = number of data points h = Vapnik-Chervonenkis (VC) dimension (1970’s) = capacity of the classifier class

• Note, above is independent of the true P(x,y) • A worst-case scenario bound, guaranteed for all P(x,y) • VC dimension not just the # of parameters a classifier has • VC measures # of different datasets it can classify perfectly • Structural Risk Minimization: minimize risk bound J(θ)

R θ( )≤ J θ( ) = Rempθ( ) +

2h log 2eNh( ) + 2 log 4

η( )N

1 + 1 +NR

empθ( )

h log 2eNh( ) + log 4

η( )⎛

⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟

f .;θ( )

Page 6: Machine Learning - Columbia Universityjebara/4771/notes/class5x.pdf · •Linear classifiers are too big a function class since h=d+1 •Can reduce VC dimension if we restrict them

Tony Jebara, Columbia University

VC Dimension & Shattering • How to compute h or VC for a family of functions h = # of training points that can be shattered • Recall, classifier maps input to output

• Shattering: I pick h points & place them at You challenge me with 2h possible labelings VC dimension is maximum # of points I can place which a can correctly classify for arbitrary labeling • Example: for 2d linear classifier h=3

can’t ever shatter 4 points! or 3 points on a straight line…

f .;θ( )

f x;θ( )→ y ∈ −1,1{ }

x1,…,x

h

y

1,…,y

h∈ ±1{ }h

y1,…,y

h f x;θ( )

f x;θ( ) = x

1+ x

2+ θ

0

Page 7: Machine Learning - Columbia Universityjebara/4771/notes/class5x.pdf · •Linear classifiers are too big a function class since h=d+1 •Can reduce VC dimension if we restrict them

Tony Jebara, Columbia University

VC Dimension & Shattering • More generally for higher dimensional linear classifiers, a hyperplane in shatters any set of linearly independent points. Can choose d+1 linearly indep. points so h=d+1

• Note: VC is not necessarily proportional to # of parameters • Example: sinusoidal 1d classifier number of parameters=1 …but… h=infinity!

since I can choose: no matter what labeling you challenge: using shatters perfectly

f x;θ( ) = sign sin θx( )( )

xi= 10−i i = 1,…,h

y

1,…,y

h∈ ±1{ }h

θ = π 1 + 1

21−y

i( )10−ii=1

h∑( )But, as a side note, if I choose 4 equally spaced x’s then cannot shatter

Page 8: Machine Learning - Columbia Universityjebara/4771/notes/class5x.pdf · •Linear classifiers are too big a function class since h=d+1 •Can reduce VC dimension if we restrict them

Tony Jebara, Columbia University

VC Dimension & Shattering • Recall that VC dimension gives an upper bound • We want to minimize h since that minimizes C(θ) & J(θ) • If can’t compute h exactly but can compute h+ can plug in h+ in bound & still guarantee • Also, sometimes bound is trivial • Need h/N = 0.3 before C(θ)<1 (recall R(θ) in [0,1])

• Note: h = low ⇒ good performance h =∞ ⇒ poor performance

Page 9: Machine Learning - Columbia Universityjebara/4771/notes/class5x.pdf · •Linear classifiers are too big a function class since h=d+1 •Can reduce VC dimension if we restrict them

Tony Jebara, Columbia University

Nearest Neighbors & VC • Consider Nearest Neighbors classification algorithm: Input a query example x Find training example xi in {x1,…xN} closest to x Predict label for x as yi of neighbor

• Often use Euclidean distance to measure closeness • Nearest neighbors shatters any set of points! • So VC=infinity, C(θ)=infinity, guaranteed risk=infinity • But still works well in practice

h =∞ ⇒ poor performance h = low ⇒ good performance

x −x

i

Page 10: Machine Learning - Columbia Universityjebara/4771/notes/class5x.pdf · •Linear classifiers are too big a function class since h=d+1 •Can reduce VC dimension if we restrict them

Tony Jebara, Columbia University

• Linear classifiers are too big a function class since h=d+1 • Can reduce VC dimension if we restrict them • Constrain linear classifiers to data living inside a sphere • Gap-Tolerant classifiers: a linear classifier whose activity is constrained to a sphere & outside a margin

• If M is small relative to D, can still shatter 3 points:

VC Dimension & Large Margins

M

D M=margin D=diameter d=dimensionality

Only count errors in shaded region Elsewhere have L(x,y,θ)=0

Page 11: Machine Learning - Columbia Universityjebara/4771/notes/class5x.pdf · •Linear classifiers are too big a function class since h=d+1 •Can reduce VC dimension if we restrict them

Tony Jebara, Columbia University

VC Dimension & Large Margins • But, as M grows relative to D, can only shatter 2 points!

• For hyperplanes, as M grows vs. D, shatter fewer points! • VC dimension h goes down if gap-tolerant classifier has larger margin, general formula is:

• Before, just had h=d+1. Now we have a smaller h • If data is anywhere, D is infinite and back to h=d+1 • Typically real data is bounded (by sphere), D is fixed • Maximizing M reduces h, improving guaranteed risk J(θ) • Note: R(θ) doesn’t count errors in margin or outside sphere

Can’t shatter 3

Can shatter 2

h ≤ min ceil

D2

M 2

⎣⎢⎢⎢

⎦⎥⎥⎥,d

⎧⎨⎪⎪

⎩⎪⎪

⎫⎬⎪⎪

⎭⎪⎪

+1

Page 12: Machine Learning - Columbia Universityjebara/4771/notes/class5x.pdf · •Linear classifiers are too big a function class since h=d+1 •Can reduce VC dimension if we restrict them

Tony Jebara, Columbia University

Structural Risk Minimization • Structural Risk Minimization: minimize risk bound J(θ) reducing empirical error & reduce VC dimension h

for each model i in list of hypothesis 1) compute its h=hi

2) 3) compute choose model with lowest

• Or, directly optimize over both • If possible, min empirical error while also minimizing VC • For gap-tolerant linear classifiers, minimize Remp(θ) while maximizing margin, support vector machines do just that!

θ* = arg min

θR

empθ( )

J θ*,h

i( ) J θ*,h

i( ) θ*,h( ) = arg min

θ,hJ θ,h( )

h1 h2 h3

Space of different Classifiers or Hypotheses

R θ( )≤ J θ( ) = Rempθ( ) +

2h log 2eNh( ) + 2 log 4

η( )N

1 + 1 +NR

empθ( )

h log 2eNh( ) + log 4

η( )⎛

⎜⎜⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟⎟⎟⎟

Page 13: Machine Learning - Columbia Universityjebara/4771/notes/class5x.pdf · •Linear classifiers are too big a function class since h=d+1 •Can reduce VC dimension if we restrict them

Tony Jebara, Columbia University

Support Vector Machines • Support vector machines are (in the simplest case) linear classifiers that do structural risk minimization (SRM) • Directly maximize margin to reduce guaranteed risk J(θ) • Assume first the 2-class data is linearly separable:

• Decision boundary or hyperplane given by • Note: can scale w & b while keeping same boundary • Many solutions exist which have empirical error Remp(θ)=0 • Want widest or thickest one (max margin), also it’s unique!

have x

1,y

1( ),…, xN,y

N( ){ } where xi∈ D and y

i∈ −1,1{ }

f x;θ( ) = sign wTx +b( )

wTx +b = 0

Page 14: Machine Learning - Columbia Universityjebara/4771/notes/class5x.pdf · •Linear classifiers are too big a function class since h=d+1 •Can reduce VC dimension if we restrict them

Tony Jebara, Columbia University

Side Note: Constraints • How to minimize a function subject to equality constraints?

• Only walk on x1=2x2 or… x1-2x2=0… • Use Lagrange Multipliers, for each constraint, subtract it times a lambda variable. Lambda blows up the minimization if we don’t satisfy the constraint:

minx1 ,x2

fx( ) = min

x1 ,x2b

1x

1+b

2x

2+ 1

2H

11x

12 + H

12x

1x

2+ 1

2H

22x

22

= min x

bT x + 1

2

xTHx

⇒∂f∂x

=b + H

x = 0

⇒x = −H −1b

minx1 ,x2

maxλfx( )−λ equalitycondition = 0( )

= minx1 ,x2

maxλb

1x

1+b

2x

2+ 1

2H

11x

12 + H

12x

1x

2+ 1

2H

22x

22 −λ x

1− 2x

2( )

Page 15: Machine Learning - Columbia Universityjebara/4771/notes/class5x.pdf · •Linear classifiers are too big a function class since h=d+1 •Can reduce VC dimension if we restrict them

Tony Jebara, Columbia University

Side Note: Constraints • Cost minimization with equality constraints: 1) Subtract each constraint times an extra variable (a Lagrange multiplier λ, like an adversary variable) 2) Take partials with respect to x and set to zero 3) Plug solution into constraint to find lambda

min xmax

λfx( )−λ equalitycondition = 0( )

= min x

maxλbT x + 1

2

xTHx −λ x

1− 2x

2( )⇒∂f∂x

=b + H

x −λ 1

−2

⎣⎢⎢

⎦⎥⎥ = 0 ⇒

x = H −1λ 1

−2

⎣⎢⎢

⎦⎥⎥ −H −1b

⇒ H −1λ 1−2

⎣⎢⎢

⎦⎥⎥ −H −1b

⎝⎜⎜⎜⎜

⎟⎟⎟⎟⎟

T

1−2

⎣⎢⎢

⎦⎥⎥ = 0 ⇒ λ =

bT H−1 1−2

⎣⎢⎢⎢

⎦⎥⎥⎥

1−2

⎣⎢⎢⎢

⎦⎥⎥⎥

T

H−1 1−2

⎣⎢⎢⎢

⎦⎥⎥⎥

x

1− 2x

2=xT 1−2

⎣⎢⎢

⎦⎥⎥

Page 16: Machine Learning - Columbia Universityjebara/4771/notes/class5x.pdf · •Linear classifiers are too big a function class since h=d+1 •Can reduce VC dimension if we restrict them

Tony Jebara, Columbia University

Support Vector Machines • Define: w

Tx +b = 0

H+=positive margin hyperplane H- =negative margin hyperplane q =distance from decision plane to origin

q = min

x

x −0 subject to wTx +b = 0

min

x12

x −0

2−λ wTx +b( )

∂∂x

12xTx −λ wTx +b( )( ) = 0

x −λw = 0x = λw

2) plug into constraint

wTx +b = 0

wT λw( ) +b = 0

λ = − bwTw

1) grad

3) Sol’n x̂ = − b

wTw( )w4) distance

q = x̂ −0 = − b

wTww =

b

wTwwTw =

b

w5) Define without loss of generality since can scale b & w

H → wTx +b = 0

H+→ wTx +b = +1

H− → wTx +b = −1

Page 17: Machine Learning - Columbia Universityjebara/4771/notes/class5x.pdf · •Linear classifiers are too big a function class since h=d+1 •Can reduce VC dimension if we restrict them

Tony Jebara, Columbia University

Support Vector Machines • The constraints on the SVM for Remp(θ)=0 are thus:

• Or more simply: • The margin of the SVM is:

• Distance to origin:

• Therefore: and margin

• Want to max margin, or equivalently minimize: • SVM Problem: • This is a quadratic program! • Can plug this into a matlab function called “qp()”, done!

H+→ wTx +b = +1

H− → wTx +b = −1

wTx

i+b ≥+1 ∀y

i= +1

wTx

i+b ≤−1 ∀y

i= −1

y

iwTx

i+b( )−1≥ 0

m = d+

+d−

H → q =

b

w H

+→ q

+=

b−1

w H−→ q

−=

−1−b

w

d

+= d

−= 1

w

m =2

w

w or w

2

min 1

2w

2subject to y

iwTx

i+b( )−1≥ 0

Page 18: Machine Learning - Columbia Universityjebara/4771/notes/class5x.pdf · •Linear classifiers are too big a function class since h=d+1 •Can reduce VC dimension if we restrict them

Tony Jebara, Columbia University

• A hierarchy of Matlab optimization packages to use:

Linear Programming <Quadratic Programming <Quadratically Constrained Quadratic Programming <Semidefinite Programming <Convex Programming <Polynomial Time Algorithms

Side Note: Optimization Tools

P CP SDP QCQP QP LP

min x

bT x s.t.

c

iT x ≥ α

i∀i

min x

12

xTHx +bT x s.t.

c

iT x ≥ α

i∀i

Page 19: Machine Learning - Columbia Universityjebara/4771/notes/class5x.pdf · •Linear classifiers are too big a function class since h=d+1 •Can reduce VC dimension if we restrict them

Tony Jebara, Columbia University

• Each data point adds linear inequality to QP • Each point cuts a half plane of allowable SVMs and reduces green region • The SVM is closest point to the origin that is still in the green region • The preceptron algorithm just puts us randomly in green region • QP runs in cubic polynomial time • There are D values in the w vector • Needs O(D3) run time… But, there is a DUAL SVM in O(N3)!

Side Note: Optimization Tools 12wTw

y

iwTx

i+b( )−1≥ 0

Page 20: Machine Learning - Columbia Universityjebara/4771/notes/class5x.pdf · •Linear classifiers are too big a function class since h=d+1 •Can reduce VC dimension if we restrict them

Tony Jebara, Columbia University

SVM in Dual Form • We can also solve the problem via convex duality • Primal SVM problem LP: • This is a quadratic program, quadratic cost function with multiple linear inequalities (these carve out a convex hull) • Subtract from cost each inequality times an α Lagrange multiplier, take derivatives of w & b:

• Plug back in, dual: • Also have constraints: • Above LD must be maximized! convex duality… also qp()

min 1

2w

2subject to y

iwTx

i+b( )−1≥ 0

L

P= min

w,bmax

α≥012

w2− α

iy

iwTx

i+b( )−1( )i∑

∂∂w

LP

= w− αiy

ix

i= 0 →

i∑ w = αiy

ix

ii∑ ∂∂b

LP

=− αiy

i= 0

i∑ L

D= α

ii∑ − 12

αiα

jy

iy

jx

iTx

jj∑i∑ α

iy

i= 0

i∑ & αi≥ 0