mul$class classifica$on - github pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf ·...

44
Mul$class Classifica$on Alan Ri0er (many slides from Greg Durrett,Vivek Srikumar, Stanford CS231n)

Upload: others

Post on 09-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

Mul$classClassifica$on

AlanRi0er(many slides from Greg Durrett, Vivek Srikumar, Stanford CS231n)

Page 2: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

ThisLecture

‣Mul$classfundamentals

‣Mul$classlogis$cregression

‣Mul$classSVM

‣ Featureextrac$on

‣ Op$miza$on

Page 3: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

Mul$classFundamentals

Page 4: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

TextClassifica$on

~20classes

Sports

Health

Page 5: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

ImageClassifica$on

‣ Thousandsofclasses(ImageNet)

Car

Dog

Page 6: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

En$tyLinking

‣ 4,500,000classes(allar$clesinWikipedia)

Althoughheoriginallywontheevent,theUnitedStatesAn$-DopingAgencyannouncedinAugust2012thattheyhaddisqualifiedArmstrongfromhissevenconsecu$veTourdeFrancewinsfrom1999–2005.

LanceEdwardArmstrongisanAmericanformerprofessionalroadcyclist

ArmstrongCountyisacountyinPennsylvania…

??

Page 7: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

ReadingComprehension

‣Mul$plechoiceques$ons,4classes(butclasseschangeperexample)

Richardson(2013)

Page 8: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

BinaryClassifica$on

‣ Binaryclassifica$on:oneweightvectordefinesposi$veandnega$veclasses

+++ +

++

++

- - -----

--

Page 9: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

Mul$classClassifica$on

1 11 11 12

222

333

3

‣ Canwejustusebinaryclassifiershere?

Page 10: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

Mul$classClassifica$on

‣ One-vs-all:trainkclassifiers,onetodis$nguisheachclassfromalltherest

1 11 11 12

222

333

3

1 11 11 12

222

333

3

‣ Howdowereconcilemul$pleposi$vepredic$ons?Highestscore?

Page 11: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

Mul$classClassifica$on

‣ Notallclassesmayevenbeseparableusingthisapproach

1 11 11 12 2

222 2

33

3 33 3

‣ Canseparate1from2+3and2from1+3butnot3fromtheothers(withthesefeatures)

Page 12: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

Mul$classClassifica$on‣ All-vs-all:trainn(n-1)/2classifierstodifferen$ateeachpairofclasses

1 11 11 1

333

3

1 11 11 12

222

‣ Again,howtoreconcile?

Page 13: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

Mul$classClassifica$on

+++ +

++

++

- - -----

--

1 11 11 12

222

333

3

‣ Binaryclassifica$on:oneweightvectordefinesbothclasses

‣Mul$classclassifica$on:differentweightsand/orfeaturesperclass

Page 14: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

Mul$classClassifica$on

‣ Decisionrule:

‣ Canalsohaveoneweightvectorperclass:

‣ Formally:insteadoftwolabels,wehaveanoutputspacecontaininganumberofpossibleclasses

Y

‣ Samemachinerythatwe’lluselaterforexponen$allylargeoutputspaces,includingsequencesandtrees

argmaxy2Yw>y f(x)

argmaxy2Yw>f(x, y)

‣ Thesingleweightvectorapproachwillgeneralizetostructuredoutputspaces,whereasper-classweightvectorswon’t

‣Mul$plefeaturevectors,oneweightvector

featuresdependonchoiceoflabelnow!note:thisisn’tthegoldlabel

Page 15: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

FeatureExtrac$on

Page 16: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

BlockFeatureVectors‣ Decisionrule:argmaxy2Yw

>f(x, y)

toomanydrugtrials,toofewpa5ents

Health

Sports

Science

f(x)=I[containsdrug],I[containspa5ents],I[containsbaseball] =[1,1,0]

[1,1,0,0,0,0,0,0,0]

[0,0,0,1,1,0,0,0,0]

f(x, y = ) =Health

f(x, y = ) =Sports

‣ Equivalenttohavingthreeweightvectorsinthiscase

featurevectorblocksforeachlabel

‣ Basefeaturefunc$on:

I[containsdrug&label=Health]

Page 17: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

MakingDecisions

f(x) =I[containsdrug],I[containspa5ents],I[containsbaseball]

w = [+2.1,+2.3,-5,-2.1,-3.8,0,+1.1,-1.7,-1.3]

= Health:+4.4 Sports:-5.9 Science:-0.6

argmax

toomanydrugtrials,toofewpa5ents

Health

Sports

Science

[1,1,0,0,0,0,0,0,0]

[0,0,0,1,1,0,0,0,0]

f(x, y = ) =Health

f(x, y = ) =Sports “worddruginSciencear$cle”=+1.1

w>f(x, y)

Page 18: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

Anotherexample:POStaggingblocksNNSVBZNNDT…

f(x,y=VBZ)=I[curr_word=blocks&tag=VBZ],I[prev_word=router&tag=VBZ]I[next_word=the&tag=VBZ]I[curr_suffix=s&tag=VBZ]

‣ Classifyblocksasoneof36POStags

‣ Nexttwolectures:sequencelabeling!

‣ Examplex:sentencewithaword(inthiscase,blocks)highlighted

‣ Extractfeatureswithrespecttothisword:

notsayingthattheistaggedasVBZ!sayingthatthefollowstheVBZword

therouter thepackets

Page 19: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

Mul$classLogis$cRegression

Page 20: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

Mul$classLogis$cRegression

‣ Comparetobinary:

nega$veclassimplicitlyhadf(x,y=0)=thezerovector

sumoveroutputspacetonormalize

P (y = 1|x) = exp(w>f(x))

1 + exp(w>f(x))

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Page 21: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

Mul$classLogis$cRegression

sumoveroutputspacetonormalize

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Health:+2.2

Sports:+3.1

Science:-0.6w>f(x, y)

Why?Interpretrawclassifierscoresasprobabili/es

exp6.05

22.2

0.55

probabili$esmustbe>=0

unnormalizedprobabili$es

normalize0.21

0.77

0.02

probabili$esmustsumto1

probabili$es

Softmax function

1.00

0.00

0.00correct(gold)probabili$es

toomanydrugtrials,

toofewpa5ents

compare

L(x, y) =nX

j=1

logP (y⇤j |xj)L(xj , y⇤j ) = w>f(xj , y

⇤j )� log

X

y

exp(w>f(xj , y))

log(0.21)=-1.56

Page 22: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

Mul$classLogis$cRegression

sumoveroutputspacetonormalize

‣ Training:maximize

=nX

j=1

w>f(xj , y

⇤j )� log

X

y

exp(w>f(xj , y))

!L(x, y) =

nX

j=1

logP (y⇤j |xj)

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Page 23: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

Training

‣ Mul$classlogis$cregression

‣ Likelihood L(xj , y⇤j ) = w>f(xj , y

⇤j )� log

X

y

exp(w>f(xj , y))

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

Py fi(xj , y) exp(w>f(xj , y))P

y exp(w>f(xj , y))

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )� Ey[fi(xj , y)]

goldfeaturevalue

model’sexpecta$onoffeaturevalue

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

X

y

fi(xj , y)Pw(y|xj)

Page 24: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

Training

toomanydrugtrials,toofewpa5ents[1,1,0,0,0,0,0,0,0]

[0,0,0,1,1,0,0,0,0]

f(x, y = ) =Health

f(x, y = ) =Sports

y*= Health

Pw(y|x)=[0.21,0.77,0.02]

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

X

y

fi(xj , y)Pw(y|xj)

[1,1,0,0,0,0,0,0,0] —0.21[1,1,0,0,0,0,0,0,0]—0.77[0,0,0,1,1,0,0,0,0] —0.02[0,0,0,0,0,0,1,1,0]

=[0.79,0.79,0,-0.77,-0.77,0,-0.02,-0.02,0]

gradient:

[1.3,0.9,-5,3.2,-0.1,0,1.1,-1.7,-1.3] +[0.79,0.79,0,-0.77,-0.77,0,-0.02,-0.02,0]=[2.09,1.69,0,2.43,-0.87,0,1.08,-1.72,0]

newPw(y|x)=[0.89,0.10,0.01]

update:w>f(x, y) + `(y, y⇤)

Page 25: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

Logis$cRegression:Summary

‣Model:

‣ Learning:gradientascentonthediscrimina$velog-likelihood

‣ Inference:

“towardsgoldfeaturevalue,awayfromexpecta$onoffeaturevalue”

f(x, y⇤)� Ey[f(x, y)] = f(x, y⇤)�X

y

[Pw(y|x)f(x, y)]

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

argmaxyPw(y|x)

Page 26: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

Mul$classSVM

Page 27: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

SoxMarginSVM

Minimize

s.t.

8j (2yj � 1)(w>xj) � 1� ⇠j

8j ⇠j � 0

�kwk22 +mX

j=1

⇠jslackvariables>0iffexampleissupportvector

Image credit: Lang Van Tran

Page 28: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

Mul$classSVM

Correctpredic$onnowhastobeateveryotherclass

Minimize

s.t.

8j (2yj � 1)(w>xj) � 1� ⇠j

8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

The1thatwashereisreplacedbyalossfunc$on

Scorecomparisonismoreexplicitnow

slackvariables>0iffexampleissupportvector

Page 29: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

Training(loss-augmented)

‣ Arealldecisionsequallycostly?

‣Wecandefinealossfunc$on `(y, y⇤)

toomanydrugtrials,toofewpa5ents

Health

SportsSports

ScienceSports

Science

Predicted

Predicted :notsobad

:baderror

`( , ) =HealthSports

HealthScience`( , ) =

3

1

Page 30: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

Mul$classSVM

Health Science Sports

2.4+0

1.3+3

1.8+1

Y

‣ Doesgoldbeateverylabel+loss?No!

‣ =4.3-2.4=1.9⇠j

‣MostviolatedconstraintisSports;whatis?

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

w>f(x, y) + `(y, y⇤)

‣ Perceptronwouldmakenoupdatehere

⇠j

Page 31: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

Mul$classSVM

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

‣ Oneslackvariableperexample,soit’ssettobewhateverthemostviolatedconstraintisforthatexample

⇠j = maxy2Y

w>f(xj , y) + `(y, y⇤j )� w>f(xj , y⇤j )

‣ Pluginthegoldyandyouget0,soslackisalwaysnonnega$ve!

Page 32: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

Compu$ngtheSubgradient

‣ If,theexampleisnotasupportvector,gradientiszero⇠j = 0

‣ Otherwise,

(updatelooksbackwards—we’reminimizinghere!)

@

@wi⇠j = fi(xj , ymax)� fi(xj , y

⇤j )

‣ Perceptron-like,butweupdateawayfrom*loss-augmented*predic$on

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

⇠j = maxy2Y

w>f(xj , y) + `(y, y⇤j )� w>f(xj , y⇤j )

Page 33: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

PuzngitTogether

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

‣ (Unregularized)gradients:‣ SVM:

f(x, y⇤)� Ey[f(x, y)] = f(x, y⇤)�X

y

[Pw(y|x)f(x, y)]‣ Logreg:f(x, y⇤)� f(x, ymax) (loss-augmentedmax)

‣ SVM:maxoverstocomputegradient.LR:needtosumovers`(y, y⇤) `(y, y⇤)

Page 34: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

Op$miza$on

Page 35: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

Recap‣ Fourelementsofamachinelearningmethod:

‣Model:probabilis$c,max-margin,deepneuralnetwork

‣ Inference:justmaxesandsimpleexpecta$onssofar,butwillgetharder

‣ Training:gradientdescent?

‣ Objec$ve:

Page 36: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

Op$miza$on

‣ Stochas$cgradient*ascent*‣ Verysimpletocodeup

w w + ↵g, g =@

@wL

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201820

Optimization

W_1

W_2

Page 37: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

Op$miza$on

‣ Stochas$cgradient*ascent*‣ Verysimpletocodeup

w w + ↵g, g =@

@wL

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201821

Optimization: Problems with SGDWhat if loss changes quickly in one direction and slowly in another?What does gradient descent do?

Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large

‣Whatiflosschangesquicklyinonedirec$onandslowlyinanotherdirec$on?

Page 38: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

Op$miza$on

‣ Stochas$cgradient*ascent*‣ Verysimpletocodeup

w w + ↵g, g =@

@wL

‣Whatiflosschangesquicklyinonedirec$onandslowlyinanotherdirec$on?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201822

Optimization: Problems with SGDWhat if loss changes quickly in one direction and slowly in another?What does gradient descent do?Very slow progress along shallow dimension, jitter along steep direction

Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large

Page 39: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

Op$miza$on

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201825

Optimization: Problems with SGD

What if the loss function has a local minima or saddle point?

Saddle points much more common in high dimension

Dauphin et al, “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization”, NIPS 2014

‣ Stochas$cgradient*ascent*‣ Verysimpletocodeup

w w + ↵g, g =@

@wL

‣Whatifthelossfunc$onhasalocalminimaorsaddlepoint?

Page 40: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

Op$miza$on

‣ Stochas$cgradient*ascent*‣ Verysimpletocodeup

‣ “First-order”technique:onlyreliesonhavinggradient

w w + ↵g, g =@

@wL

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201854

First-Order Optimization

Loss

w1

(1) Use gradient form linear approximation(2) Step to minimize the approximation

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201855

Second-Order Optimization

Loss

w1

(1) Use gradient and Hessian to form quadratic approximation(2) Step to the minima of the approximation

Page 41: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

Op$miza$on

‣ Stochas$cgradient*ascent*‣ Verysimpletocodeup

‣ “First-order”technique:onlyreliesonhavinggradient

‣ Newton’smethod

‣ Second-ordertechnique

InverseHessian:nxnmat,expensive!‣ Op$mizesquadra$cinstantly

‣ Quasi-Newtonmethods:L-BFGS,etc.approximateinverseHessian

‣ Sezngstepsizeishard(decreasewhenheld-outperformanceworsens?)

w w + ↵g, g =@

@wL

w w +

✓@2

@w2L◆�1

g

Page 42: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

AdaGrad

Duchietal.(2011)

‣ Op$mizedforproblemswithsparsefeatures

‣ Per-parameterlearningrate:smallerupdatesaremadetoparametersthatgetupdatedfrequently

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201837

AdaGrad

Added element-wise scaling of the gradient based on the historical sum of squares in each dimension

“Per-parameter learning rates” or “adaptive learning rates”

Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201821

Optimization: Problems with SGDWhat if loss changes quickly in one direction and slowly in another?What does gradient descent do?

Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large

Page 43: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

AdaGrad

Duchietal.(2011)

‣ Op$mizedforproblemswithsparsefeatures

‣ Per-parameterlearningrate:smallerupdatesaremadetoparametersthatgetupdatedfrequently

(smoothed)sumofsquaredgradientsfromallupdates

‣ GenerallymorerobustthanSGD,requireslesstuningoflearningrate

‣ Othertechniquesforop$mizingdeepmodels—morelater!

wi wi + ↵1q

✏+Pt

⌧=1 g2⌧,i

gti

Page 44: Mul$class Classifica$on - GitHub Pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf · Another example: POS tagging blocks NNS VBZ NN DT … f(x, y=VBZ) = I[curr_word=blocks

Summary

‣ Designtradeoffsneedtoreflectinterac$ons:

‣Modelandobjec$vearecoupled:probabilis$cmodel<->maximizelikelihood

‣ …butnotalways:alinearmodelorneuralnetworkcanbetrainedtominimizeanydifferen$ablelossfunc$on

‣ Inferencegovernswhatlearning:needtobeabletocomputeexpecta$onstouselogis$cregression