mul$class classifica$on - github pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf ·...

Post on 09-Jun-2020

6 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Mul$classClassifica$on

AlanRi0er(many slides from Greg Durrett, Vivek Srikumar, Stanford CS231n)

ThisLecture

‣Mul$classfundamentals

‣Mul$classlogis$cregression

‣Mul$classSVM

‣ Featureextrac$on

‣ Op$miza$on

Mul$classFundamentals

TextClassifica$on

~20classes

Sports

Health

ImageClassifica$on

‣ Thousandsofclasses(ImageNet)

Car

Dog

En$tyLinking

‣ 4,500,000classes(allar$clesinWikipedia)

Althoughheoriginallywontheevent,theUnitedStatesAn$-DopingAgencyannouncedinAugust2012thattheyhaddisqualifiedArmstrongfromhissevenconsecu$veTourdeFrancewinsfrom1999–2005.

LanceEdwardArmstrongisanAmericanformerprofessionalroadcyclist

ArmstrongCountyisacountyinPennsylvania…

??

ReadingComprehension

‣Mul$plechoiceques$ons,4classes(butclasseschangeperexample)

Richardson(2013)

BinaryClassifica$on

‣ Binaryclassifica$on:oneweightvectordefinesposi$veandnega$veclasses

+++ +

++

++

- - -----

--

Mul$classClassifica$on

1 11 11 12

222

333

3

‣ Canwejustusebinaryclassifiershere?

Mul$classClassifica$on

‣ One-vs-all:trainkclassifiers,onetodis$nguisheachclassfromalltherest

1 11 11 12

222

333

3

1 11 11 12

222

333

3

‣ Howdowereconcilemul$pleposi$vepredic$ons?Highestscore?

Mul$classClassifica$on

‣ Notallclassesmayevenbeseparableusingthisapproach

1 11 11 12 2

222 2

33

3 33 3

‣ Canseparate1from2+3and2from1+3butnot3fromtheothers(withthesefeatures)

Mul$classClassifica$on‣ All-vs-all:trainn(n-1)/2classifierstodifferen$ateeachpairofclasses

1 11 11 1

333

3

1 11 11 12

222

‣ Again,howtoreconcile?

Mul$classClassifica$on

+++ +

++

++

- - -----

--

1 11 11 12

222

333

3

‣ Binaryclassifica$on:oneweightvectordefinesbothclasses

‣Mul$classclassifica$on:differentweightsand/orfeaturesperclass

Mul$classClassifica$on

‣ Decisionrule:

‣ Canalsohaveoneweightvectorperclass:

‣ Formally:insteadoftwolabels,wehaveanoutputspacecontaininganumberofpossibleclasses

Y

‣ Samemachinerythatwe’lluselaterforexponen$allylargeoutputspaces,includingsequencesandtrees

argmaxy2Yw>y f(x)

argmaxy2Yw>f(x, y)

‣ Thesingleweightvectorapproachwillgeneralizetostructuredoutputspaces,whereasper-classweightvectorswon’t

‣Mul$plefeaturevectors,oneweightvector

featuresdependonchoiceoflabelnow!note:thisisn’tthegoldlabel

FeatureExtrac$on

BlockFeatureVectors‣ Decisionrule:argmaxy2Yw

>f(x, y)

toomanydrugtrials,toofewpa5ents

Health

Sports

Science

f(x)=I[containsdrug],I[containspa5ents],I[containsbaseball] =[1,1,0]

[1,1,0,0,0,0,0,0,0]

[0,0,0,1,1,0,0,0,0]

f(x, y = ) =Health

f(x, y = ) =Sports

‣ Equivalenttohavingthreeweightvectorsinthiscase

featurevectorblocksforeachlabel

‣ Basefeaturefunc$on:

I[containsdrug&label=Health]

MakingDecisions

f(x) =I[containsdrug],I[containspa5ents],I[containsbaseball]

w = [+2.1,+2.3,-5,-2.1,-3.8,0,+1.1,-1.7,-1.3]

= Health:+4.4 Sports:-5.9 Science:-0.6

argmax

toomanydrugtrials,toofewpa5ents

Health

Sports

Science

[1,1,0,0,0,0,0,0,0]

[0,0,0,1,1,0,0,0,0]

f(x, y = ) =Health

f(x, y = ) =Sports “worddruginSciencear$cle”=+1.1

w>f(x, y)

Anotherexample:POStaggingblocksNNSVBZNNDT…

f(x,y=VBZ)=I[curr_word=blocks&tag=VBZ],I[prev_word=router&tag=VBZ]I[next_word=the&tag=VBZ]I[curr_suffix=s&tag=VBZ]

‣ Classifyblocksasoneof36POStags

‣ Nexttwolectures:sequencelabeling!

‣ Examplex:sentencewithaword(inthiscase,blocks)highlighted

‣ Extractfeatureswithrespecttothisword:

notsayingthattheistaggedasVBZ!sayingthatthefollowstheVBZword

therouter thepackets

Mul$classLogis$cRegression

Mul$classLogis$cRegression

‣ Comparetobinary:

nega$veclassimplicitlyhadf(x,y=0)=thezerovector

sumoveroutputspacetonormalize

P (y = 1|x) = exp(w>f(x))

1 + exp(w>f(x))

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Mul$classLogis$cRegression

sumoveroutputspacetonormalize

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Health:+2.2

Sports:+3.1

Science:-0.6w>f(x, y)

Why?Interpretrawclassifierscoresasprobabili/es

exp6.05

22.2

0.55

probabili$esmustbe>=0

unnormalizedprobabili$es

normalize0.21

0.77

0.02

probabili$esmustsumto1

probabili$es

Softmax function

1.00

0.00

0.00correct(gold)probabili$es

toomanydrugtrials,

toofewpa5ents

compare

L(x, y) =nX

j=1

logP (y⇤j |xj)L(xj , y⇤j ) = w>f(xj , y

⇤j )� log

X

y

exp(w>f(xj , y))

log(0.21)=-1.56

Mul$classLogis$cRegression

sumoveroutputspacetonormalize

‣ Training:maximize

=nX

j=1

w>f(xj , y

⇤j )� log

X

y

exp(w>f(xj , y))

!L(x, y) =

nX

j=1

logP (y⇤j |xj)

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Training

‣ Mul$classlogis$cregression

‣ Likelihood L(xj , y⇤j ) = w>f(xj , y

⇤j )� log

X

y

exp(w>f(xj , y))

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

Py fi(xj , y) exp(w>f(xj , y))P

y exp(w>f(xj , y))

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )� Ey[fi(xj , y)]

goldfeaturevalue

model’sexpecta$onoffeaturevalue

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

X

y

fi(xj , y)Pw(y|xj)

Training

toomanydrugtrials,toofewpa5ents[1,1,0,0,0,0,0,0,0]

[0,0,0,1,1,0,0,0,0]

f(x, y = ) =Health

f(x, y = ) =Sports

y*= Health

Pw(y|x)=[0.21,0.77,0.02]

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

X

y

fi(xj , y)Pw(y|xj)

[1,1,0,0,0,0,0,0,0] —0.21[1,1,0,0,0,0,0,0,0]—0.77[0,0,0,1,1,0,0,0,0] —0.02[0,0,0,0,0,0,1,1,0]

=[0.79,0.79,0,-0.77,-0.77,0,-0.02,-0.02,0]

gradient:

[1.3,0.9,-5,3.2,-0.1,0,1.1,-1.7,-1.3] +[0.79,0.79,0,-0.77,-0.77,0,-0.02,-0.02,0]=[2.09,1.69,0,2.43,-0.87,0,1.08,-1.72,0]

newPw(y|x)=[0.89,0.10,0.01]

update:w>f(x, y) + `(y, y⇤)

Logis$cRegression:Summary

‣Model:

‣ Learning:gradientascentonthediscrimina$velog-likelihood

‣ Inference:

“towardsgoldfeaturevalue,awayfromexpecta$onoffeaturevalue”

f(x, y⇤)� Ey[f(x, y)] = f(x, y⇤)�X

y

[Pw(y|x)f(x, y)]

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

argmaxyPw(y|x)

Mul$classSVM

SoxMarginSVM

Minimize

s.t.

8j (2yj � 1)(w>xj) � 1� ⇠j

8j ⇠j � 0

�kwk22 +mX

j=1

⇠jslackvariables>0iffexampleissupportvector

Image credit: Lang Van Tran

Mul$classSVM

Correctpredic$onnowhastobeateveryotherclass

Minimize

s.t.

8j (2yj � 1)(w>xj) � 1� ⇠j

8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

The1thatwashereisreplacedbyalossfunc$on

Scorecomparisonismoreexplicitnow

slackvariables>0iffexampleissupportvector

Training(loss-augmented)

‣ Arealldecisionsequallycostly?

‣Wecandefinealossfunc$on `(y, y⇤)

toomanydrugtrials,toofewpa5ents

Health

SportsSports

ScienceSports

Science

Predicted

Predicted :notsobad

:baderror

`( , ) =HealthSports

HealthScience`( , ) =

3

1

Mul$classSVM

Health Science Sports

2.4+0

1.3+3

1.8+1

Y

‣ Doesgoldbeateverylabel+loss?No!

‣ =4.3-2.4=1.9⇠j

‣MostviolatedconstraintisSports;whatis?

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

w>f(x, y) + `(y, y⇤)

‣ Perceptronwouldmakenoupdatehere

⇠j

Mul$classSVM

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

‣ Oneslackvariableperexample,soit’ssettobewhateverthemostviolatedconstraintisforthatexample

⇠j = maxy2Y

w>f(xj , y) + `(y, y⇤j )� w>f(xj , y⇤j )

‣ Pluginthegoldyandyouget0,soslackisalwaysnonnega$ve!

Compu$ngtheSubgradient

‣ If,theexampleisnotasupportvector,gradientiszero⇠j = 0

‣ Otherwise,

(updatelooksbackwards—we’reminimizinghere!)

@

@wi⇠j = fi(xj , ymax)� fi(xj , y

⇤j )

‣ Perceptron-like,butweupdateawayfrom*loss-augmented*predic$on

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

⇠j = maxy2Y

w>f(xj , y) + `(y, y⇤j )� w>f(xj , y⇤j )

PuzngitTogether

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

‣ (Unregularized)gradients:‣ SVM:

f(x, y⇤)� Ey[f(x, y)] = f(x, y⇤)�X

y

[Pw(y|x)f(x, y)]‣ Logreg:f(x, y⇤)� f(x, ymax) (loss-augmentedmax)

‣ SVM:maxoverstocomputegradient.LR:needtosumovers`(y, y⇤) `(y, y⇤)

Op$miza$on

Recap‣ Fourelementsofamachinelearningmethod:

‣Model:probabilis$c,max-margin,deepneuralnetwork

‣ Inference:justmaxesandsimpleexpecta$onssofar,butwillgetharder

‣ Training:gradientdescent?

‣ Objec$ve:

Op$miza$on

‣ Stochas$cgradient*ascent*‣ Verysimpletocodeup

w w + ↵g, g =@

@wL

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201820

Optimization

W_1

W_2

Op$miza$on

‣ Stochas$cgradient*ascent*‣ Verysimpletocodeup

w w + ↵g, g =@

@wL

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201821

Optimization: Problems with SGDWhat if loss changes quickly in one direction and slowly in another?What does gradient descent do?

Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large

‣Whatiflosschangesquicklyinonedirec$onandslowlyinanotherdirec$on?

Op$miza$on

‣ Stochas$cgradient*ascent*‣ Verysimpletocodeup

w w + ↵g, g =@

@wL

‣Whatiflosschangesquicklyinonedirec$onandslowlyinanotherdirec$on?

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201822

Optimization: Problems with SGDWhat if loss changes quickly in one direction and slowly in another?What does gradient descent do?Very slow progress along shallow dimension, jitter along steep direction

Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large

Op$miza$on

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201825

Optimization: Problems with SGD

What if the loss function has a local minima or saddle point?

Saddle points much more common in high dimension

Dauphin et al, “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization”, NIPS 2014

‣ Stochas$cgradient*ascent*‣ Verysimpletocodeup

w w + ↵g, g =@

@wL

‣Whatifthelossfunc$onhasalocalminimaorsaddlepoint?

Op$miza$on

‣ Stochas$cgradient*ascent*‣ Verysimpletocodeup

‣ “First-order”technique:onlyreliesonhavinggradient

w w + ↵g, g =@

@wL

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201854

First-Order Optimization

Loss

w1

(1) Use gradient form linear approximation(2) Step to minimize the approximation

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201855

Second-Order Optimization

Loss

w1

(1) Use gradient and Hessian to form quadratic approximation(2) Step to the minima of the approximation

Op$miza$on

‣ Stochas$cgradient*ascent*‣ Verysimpletocodeup

‣ “First-order”technique:onlyreliesonhavinggradient

‣ Newton’smethod

‣ Second-ordertechnique

InverseHessian:nxnmat,expensive!‣ Op$mizesquadra$cinstantly

‣ Quasi-Newtonmethods:L-BFGS,etc.approximateinverseHessian

‣ Sezngstepsizeishard(decreasewhenheld-outperformanceworsens?)

w w + ↵g, g =@

@wL

w w +

✓@2

@w2L◆�1

g

AdaGrad

Duchietal.(2011)

‣ Op$mizedforproblemswithsparsefeatures

‣ Per-parameterlearningrate:smallerupdatesaremadetoparametersthatgetupdatedfrequently

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201837

AdaGrad

Added element-wise scaling of the gradient based on the historical sum of squares in each dimension

“Per-parameter learning rates” or “adaptive learning rates”

Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201821

Optimization: Problems with SGDWhat if loss changes quickly in one direction and slowly in another?What does gradient descent do?

Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large

AdaGrad

Duchietal.(2011)

‣ Op$mizedforproblemswithsparsefeatures

‣ Per-parameterlearningrate:smallerupdatesaremadetoparametersthatgetupdatedfrequently

(smoothed)sumofsquaredgradientsfromallupdates

‣ GenerallymorerobustthanSGD,requireslesstuningoflearningrate

‣ Othertechniquesforop$mizingdeepmodels—morelater!

wi wi + ↵1q

✏+Pt

⌧=1 g2⌧,i

gti

Summary

‣ Designtradeoffsneedtoreflectinterac$ons:

‣Modelandobjec$vearecoupled:probabilis$cmodel<->maximizelikelihood

‣ …butnotalways:alinearmodelorneuralnetworkcanbetrainedtominimizeanydifferen$ablelossfunc$on

‣ Inferencegovernswhatlearning:needtobeabletocomputeexpecta$onstouselogis$cregression

top related