forgetron slide 1 online learning with a memory harness using the forgetron shai shalev-shwartz...

17
Forgetron Slide 1 Online Learning with a Memory Harness using the Forgetron Shai Shalev-Shwartz joint work with Ofer Dekel and Yoram Singer Large Scale Kernel Machine NIPS’05, Whistler The Hebrew University Jerusalem, Israel

Post on 18-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Forgetron Slide 1

Online Learning with a Memory Harness using

the Forgetron

Shai Shalev-Shwartz joint work with

Ofer Dekel and Yoram Singer

Large Scale Kernel Machine NIPS’05, Whistler

The Hebrew University Jerusalem, Israel

Forgetron Slide 2

Overview

• Online learning with kernels

• Goal: strict limit on the number of

“support vectors”

• The Forgetron algorithm

• Analysis

• Experiments

Forgetron Slide 3

current classifier: ft(x) = i 2 I yi K(xi,x)

Kernel-based Perceptron for Online Learning

Online Learner

#Mistakes M

Current Active-Set

1 2 3 4 5 6 7 . . .

I = {1,3}

xtsign(ft(xt))yt

I = {1,3,4}

sign(ft(xt))

Forgetron Slide 4

current classifier: ft(x) = i 2 I yi K(xi,x)

Kernel-based Perceptron for Online Learning

Online Learner

#Mistakes M

Current Active-Set

1 2 3 4 5 6 7 . . .

xtsign(ft(xt))yt

I = {1,3,4}

sign(ft(xt))

Forgetron Slide 5

Learning on a Budget

• |I| = number of mistakes until round t• Memory + time inefficient • |I| might grow unboundedly

• Goal: Construct a kernel-based online

algorithm for which: • |I| · B for each t• Still performs “well” comes with

performance guarantee

Forgetron Slide 6

Mistake Bound for Perceptron

• {(x1,y1),…,(xT,yT)} : a sequence of examples

• A kernel K s.t. K(xt,xt) · 1

• g : a fixed competitor classifier in RKHS

• Define ` t(g)= max(0,1 – yt g(xt))

• Then,

Forgetron Slide 7

Previous Work

• Crammer, Kandola, Singer (2003)

• Kivinen, Smola, Williamson (2004)

• Weston, Bordes, Bottu (2005)

Previous online budget algorithms do not provide a mistake bound

Is our goal attainable ?

Forgetron Slide 8

Mission Impossible

• Input space: {e1,…,eB+1}

• Linear kernel: K(ei,ej) = ei ¢ ej = i,j

• Budget constraint: |I| · B . Therefore,

there exists j s.t. i2 I i K(ei,ej) = 0

• We might always err

• But, the competitor: g=i ei never errs !

• Perceptron makes B+1 mistakes

Forgetron Slide 9

Redefine the Goal

• We must restrict the competitor g

somehow. One way: restrict ||g||

• The counter example implies that we

cannot compete with ||g|| ¸ (B+1)1/2

• Main result: The Forgetron algorithm

can compete with any classifier g s.t.

Forgetron Slide 10

The Forgetron

1 2 3 ... ... t-1 tft(x) = i 2 I i yi K(xi,x)

Step (1) - Perceptron

I’ = I [ {t}

Step (2) – Shrinking

Step (3) – Remove Oldest

i t i

1 2 3 ... ... t-1 t

1 2 3 ... ... t-1 t

1 2 3 ... ... t-1 t

r = min I

I I [ {t}

Forgetron Slide 12

Quantifying Deviation

• “Progress” measure: t = ||ft – g||2 - ||ft+1-g||2

• “Progress” for each update step

• “Deviation” is measured by negative progress

t = t + t + t

||ft-g||2-||f’-g||2 ||f’-g||2-||f’’-g||2 ||f’’-g||2-||ft+1-g||2

after shrinking

after removal

after Perceptron

Forgetron Slide 13

Quantifying Deviation

The Forgetron sets:

Gain from Perceptron step:Damage from shrinking:Damage from removal:

Forgetron Slide 14

Resulting Mistake Bound

For any g s.t.

the number of prediction mistakes the

Forgetron makes is at most

Forgetron Slide 22

Experiment I: MNIST dataset

1000 2000 3000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

budget size - B

ave

rag

e e

rro

rForgetronCKS

Forgetron Slide 23

1000 2000 3000 4000 5000 6000

0.05

0.1

0.15

0.2

0.25

0.3

budget size - B

ave

rag

e e

rro

r

ForgetronCKS

Experiment II: Census-income (adult)

… (Perceptron makes 16,000 mistakes)

Forgetron Slide 24

Experiment III: Synthetic Data with Label Noise

0 500 1000 1500 2000

0.2

0.25

0.3

0.35

0.4

0.45

budget size - B

ave

rage

err

orForgetronCKS

Forgetron Slide 25

Summary

• No budget algorithm can compete with

arbitrary hypotheses

• The Forgetron can compete with norm-

bounded hypotheses

• Works well in practice

• Does not require parameters

Future work: the Forgetron for batch learning