mul$class classifica$on

52
Mul$class Classifica$on Wei Xu (many slides from Greg Durrett,Vivek Srikumar, Stanford CS231n)

Upload: others

Post on 29-Nov-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Mul$classClassifica$on

WeiXu(many slides from Greg Durrett, Vivek Srikumar, Stanford CS231n)

Administrivia

‣ ProblemSet1Graded(onGradescope)

‣ ProgrammingProject1isreleased(due9/20)

‣ Reading:Eisenstein2.0-2.5,4.1,4.3-4.5

‣ Op$onalreadingsrelatedtoProject1werepostedbyTAonPiazza

ThisLecture

‣Mul$classfundamentals

‣Mul$classlogis$cregression

‣ Featureextrac$on

‣ Op$miza$on

Mul$classFundamentals

TextClassifica$on

~20classes

Sports

Health

ImageClassifica$on

‣ Thousandsofclasses(ImageNet)

Car

Dog

En$tyLinking

‣ 4,500,000classes(allar$clesinWikipedia)

Althoughheoriginallywontheevent,theUnitedStatesAn$-DopingAgencyannouncedinAugust2012thattheyhaddisqualifiedArmstrongfromhissevenconsecu$veTourdeFrancewinsfrom1999–2005.

LanceEdwardArmstrongisanAmericanformerprofessionalroadcyclist

ArmstrongCountyisacountyinPennsylvania…

??

En$tyLinking

BinaryClassifica$on

‣ Binaryclassifica$on:oneweightvectordefinesposi$veandnega$veclasses

+++ +

++

++

- - -----

--

Mul$classClassifica$on

1 11 11 12

222

333

3

‣ Canwejustusebinaryclassifiershere?

Mul$classClassifica$on

‣ One-vs-all:trainkclassifiers,onetodis$nguisheachclassfromalltherest

1 11 11 12

222

333

3

1 11 11 12

222

333

3

‣ Howdowereconcilemul$pleposi$vepredic$ons?Highestscore?

Mul$classClassifica$on

‣ Notallclassesmayevenbeseparableusingthisapproach

1 11 11 12 2

222 2

33

3 33 3

‣ Canseparate1from2+3and2from1+3butnot3fromtheothers(withthesefeatures)

Mul$classClassifica$on‣ All-vs-all:trainn(n-1)/2classifierstodifferen$ateeachpairofclasses

1 11 11 1

333

3

1 11 11 12

222

‣ Again,howtoreconcile?

Mul$classClassifica$on

+++ +

++

++

- - -----

--

1 11 11 12

222

333

3

‣ Binaryclassifica$on:oneweightvectordefinesbothclasses

‣Mul$classclassifica$on:differentweightsand/orfeaturesperclass

Mul$classClassifica$on

‣ Decisionrule:

‣ Canalsohaveoneweightvectorperclass:

‣ Formally:insteadoftwolabels,wehaveanoutputspacecontaininganumberofpossibleclasses

Y

‣ Samemachinerythatwe’lluselaterforexponen$allylargeoutputspaces,includingsequencesandtrees

argmaxy2Yw>y f(x)

argmaxy2Yw>f(x, y)

‣ Thesingleweightvectorapproachwillgeneralizetostructuredoutputspaces,whereasper-classweightvectorswon’t

‣Mul$plefeaturevectors,oneweightvector

featuresdependonchoiceoflabelnow!note:thisisn’tthegoldlabel

FeatureExtrac$on

BlockFeatureVectors‣ Decisionrule:argmaxy2Yw

>f(x, y)

toomanydrugtrials,toofewpa5ents

Health

Sports

Science

f(x)=I[containsdrug],I[containspa5ents],I[containsbaseball] =[1,1,0]

[1,1,0,0,0,0,0,0,0]

[0,0,0,1,1,0,0,0,0]

f(x, y = ) =Health

f(x, y = ) =Sports

‣ Equivalenttohavingthreeweightvectorsinthiscase

featurevectorblocksforeachlabel

‣ Basefeaturefunc$on:

I[containsdrug&label=Health]

MakingDecisions

f(x) =I[containsdrug],I[containspa5ents],I[containsbaseball]

w = [+2.1,+2.3,-5,-2.1,-3.8,0,+1.1,-1.7,-1.3]

= Health:+4.4 Sports:-5.9 Science:-0.6

argmax

toomanydrugtrials,toofewpa5ents

Health

Sports

Science

[1,1,0,0,0,0,0,0,0]

[0,0,0,1,1,0,0,0,0]

f(x, y = ) =Health

f(x, y = ) =Sports “worddruginSciencear$cle”=+1.1

w>f(x, y)

Anotherexample:POStaggingblocksNNSVBZNNDT…

f(x,y=VBZ)=I[curr_word=blocks&tag=VBZ],I[prev_word=router&tag=VBZ]I[next_word=the&tag=VBZ]I[curr_suffix=s&tag=VBZ]

‣ Classifyblocksasoneof36POStags

‣ Nexttwolectures:sequencelabeling!

‣ Examplex:sentencewithaword(inthiscase,blocks)highlighted

‣ Extractfeatureswithrespecttothisword:

notsayingthattheistaggedasVBZ!sayingthatthefollowstheVBZword

therouter thepackets

Mul$classLogis$cRegression

Mul$classLogis$cRegression

‣ Comparetobinary:

nega$veclassimplicitlyhadf(x,y=0)=thezerovector

sumoveroutputspacetonormalize

P (y = 1|x) = exp(w>f(x))

1 + exp(w>f(x))

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Softmax function

Mul$classLogis$cRegression

sumoveroutputspacetonormalize

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Health:+2.2

Sports:+3.1

Science:-0.6w>f(x, y)

Why?Interpretrawclassifierscoresasprobabili/es

exp6.05

22.2

0.55

probabili$esmustbe>=0

unnormalizedprobabili$es

toomanydrugtrials,

toofewpa5ents

normalize0.21

0.77

0.02

probabili$esmustsumto1

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Mul$classLogis$cRegression

sumoveroutputspacetonormalize

‣ Training:maximize

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

i.e.minimizenega$veloglikelihoodorcross-entropyloss

indexofdatapoints(j)

L(x, y) =nX

j=1

logP (y⇤j |xj)

=nX

j=1

w>f(xj , y

⇤j )� log

X

y

exp(w>f(xj , y))

!

Mul$classLogis$cRegression

sumoveroutputspacetonormalize

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Health:+2.2

Sports:+3.1

Science:-0.6w>f(x, y)

exp6.05

22.2

0.55

probabili$esmustbe>=0

unnormalizedprobabili$es

1.00

0.00

0.00correct(gold)probabili$es

toomanydrugtrials,

toofewpa5ents

compare

L(x, y) =nX

j=1

logP (y⇤j |xj)L(xj , y⇤j ) = w>f(xj , y

⇤j )� log

X

y

exp(w>f(xj , y))

log(0.21)=-1.56

Q:max/minoflogprob.?

normalize0.21

0.77

0.02

probabili$esmustsumto1

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

Training

‣ Mul$classlogis$cregression

‣ Likelihood L(xj , y⇤j ) = w>f(xj , y

⇤j )� log

X

y

exp(w>f(xj , y))

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

Py fi(xj , y) exp(w>f(xj , y))P

y exp(w>f(xj , y))

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )� Ey[fi(xj , y)]

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

X

y

fi(xj , y)Pw(y|xj)

goldfeaturevalue

model’sexpecta$onoffeaturevalue

Training

toomanydrugtrials,toofewpa5ents[1,1,0,0,0,0,0,0,0]

[0,0,0,1,1,0,0,0,0]

f(x, y = ) =Health

f(x, y = ) =Sports

y*= Health

Pw(y|x)=[0.21,0.77,0.02]

@

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

X

y

fi(xj , y)Pw(y|xj)

[1,1,0,0,0,0,0,0,0] —0.21[1,1,0,0,0,0,0,0,0]—0.77[0,0,0,1,1,0,0,0,0] —0.02[0,0,0,0,0,0,1,1,0]

=[0.79,0.79,0,-0.77,-0.77,0,-0.02,-0.02,0]

gradient:

[1.3,0.9,-5,3.2,-0.1,0,1.1,-1.7,-1.3] +[0.79,0.79,0,-0.77,-0.77,0,-0.02,-0.02,0]=[2.09,1.69,0,2.43,-0.87,0,1.08,-1.72,0]

newPw(y|x)=[0.89,0.10,0.01]

update:w>f(x, y) + `(y, y⇤)

Mul$classLogis$cRegression:Summary

‣Model:

‣ Learning:gradientascentonthediscrimina$velog-likelihood

‣ Inference:

“towardsgoldfeaturevalue,awayfromexpecta$onoffeaturevalue”

f(x, y⇤)� Ey[f(x, y)] = f(x, y⇤)�X

y

[Pw(y|x)f(x, y)]

Pw(y|x) =exp

�w>f(x, y)

�P

y02Y exp (w>f(x, y0))

argmaxyPw(y|x)

Mul$classSVM

SozMarginSVM

Minimize

s.t.

8j (2yj � 1)(w>xj) � 1� ⇠j

8j ⇠j � 0

�kwk22 +mX

j=1

⇠jslackvariables>0iffexampleissupportvector

Image credit: Lang Van Tran

Mul$classSVM

Correctpredic$onnowhastobeateveryotherclass

Minimize

s.t.

8j (2yj � 1)(w>xj) � 1� ⇠j

8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

The1thatwashereisreplacedbyalossfunc$on

Scorecomparisonismoreexplicitnow

slackvariables>0iffexampleissupportvector

Training(loss-augmented)

‣ Arealldecisionsequallycostly?

‣Wecandefinealossfunc$on `(y, y⇤)

toomanydrugtrials,toofewpa5ents

Health

SportsSports

ScienceSports

Science

Predicted

Predicted :notsobad

:baderror

`( , ) =HealthSports

HealthScience`( , ) =

3

1

Loss-AugmentedDecoding

Health Science Sports

2.4+0

1.3+3

1.8+1

Y

‣ Doesgoldbeateverylabel+loss?No!

‣ =4.3-2.4=1.9⇠j

‣MostviolatedconstraintisSports;whatis?

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

w>f(x, y) + `(y, y⇤)

‣ Perceptronwouldmakenoupdatehere

⇠j

Loss-AugmentedDecoding

‣ Sportsismostviolatedconstraint,slack=4.3—2.4=1.9

Health+2.4Sports+1.3

Science+1.8

toomanydrugtrials,toofewpa5ents

Loss031

⇠j = maxy2Y

w>f(xj , y) + `(y, y⇤j )� w>f(xj , y⇤j )

argmax

Total2.44.32.8

Health

w>f(x, y)

‣ Perceptronwouldmakenoupdate,regularSVMwouldpickScience

Mul$classSVM

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

‣ Oneslackvariableperexample,soit’ssettobewhateverthemostviolatedconstraintisforthatexample

⇠j = maxy2Y

w>f(xj , y) + `(y, y⇤j )� w>f(xj , y⇤j )

‣ Pluginthegoldyandyouget0,soslackisalwaysnonnega$ve!

Compu$ngtheSubgradient

‣ If,theexampleisnotasupportvector,gradientiszero⇠j = 0

‣ Otherwise,

(updatelooksbackwards—we’reminimizinghere!)

@

@wi⇠j = fi(xj , ymax)� fi(xj , y

⇤j )

‣ Perceptron-like,butweupdateawayfrom*loss-augmented*predic$on

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

⇠j = maxy2Y

w>f(xj , y) + `(y, y⇤j )� w>f(xj , y⇤j )

Pu|ngitTogether

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

j=1

⇠j

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

‣ (Unregularized)gradients:‣ SVM:

f(x, y⇤)� Ey[f(x, y)] = f(x, y⇤)�X

y

[Pw(y|x)f(x, y)]‣ Logreg:f(x, y⇤)� f(x, ymax) (loss-augmentedmax)

‣ SVM:maxoverstocomputegradient.LR:needtosumovers`(y, y⇤) `(y, y⇤)

SozmaxMargin‣ Canweincludealossfunc$oninlogis$cregression?

‣ Likelihoodisar$ficiallyhigherforthingswithhighloss—trainingneedstoworkevenhardertomaximizethelikelihoodoftherightthing!

rightanswer

highloss

lowloss

highloss

lowloss

‣ Biasedes$matorfororiginallikelihood,butbe}erlossGimpelandSmith(2010)

P (y|x) =exp

�w>f(x, y) + `(y, y⇤)

�P

y0 exp (w>f(x, y0) + `(y0, y⇤))

En$tyLinking

Althoughheoriginallywontheevent,theUnitedStatesAn$-DopingAgencyannouncedinAugust2012thattheyhaddisqualifiedArmstrongfromhissevenconsecu$veTourdeFrancewinsfrom1999–2005.

LanceEdwardArmstrongisanAmericanformerprofessionalroadcyclist

ArmstrongCountyisacountyinPennsylvania…

??

‣ Instead,featuresf(x,y)lookattheactualar$cleassociatedwithy

‣ 4.5Mclasses,notenoughdatatolearnfeatureslike“TourdeFrance<->en/wiki/Lance_Armstrong”

En$tyLinkingAlthoughheoriginallywontheevent,theUnitedStatesAn$-DopingAgencyannouncedinAugust2012thattheyhaddisqualifiedArmstrongfromhissevenconsecu$veTourdeFrancewinsfrom1999–2005.

LanceEdwardArmstrong ArmstrongCounty

‣ �-idf(doc,w)=freqofwindoc*log(4.5M/#Wikiar$cleswoccursin)‣ the:occursineveryar$cle,�-idf=0‣ cyclist:occursin1%ofar$cles,�-idf=#occurrences*log10(100)

‣ f(x,y)=[cos(�-idf(x),�-idf(y)),…otherfeatures]‣ �-idf(doc)=vectorof�-idf(doc,w)forallwordsinvocabulary(50,000)

Op$miza$on

Recap‣ Fourelementsofamachinelearningmethod:

‣Model:probabilis$c,max-margin,deepneuralnetwork

‣ Inference:justmaxesandsimpleexpecta$onssofar,butwillgetharder

‣ Training:gradientdescent?

‣ Objec$ve:

Op$miza$on

‣ Gradientdescent‣ Batchupdateforlogis$cregression‣ Eachupdateisbasedonacomputa$onovertheen$redataset

L

Lmin

L(w)

Lmin

w

Op$miza$on

‣ Gradientdescent‣ Batchupdateforlogis$cregression‣ Eachupdateisbasedonacomputa$onovertheen$redataset

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201820

Optimization

W_1

W_2

‣ Verysimpletocodeup

Op$miza$on

‣ Stochas/cgradientdescent

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201821

Optimization: Problems with SGDWhat if loss changes quickly in one direction and slowly in another?What does gradient descent do?

Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large

w w + ↵g, g =@

@wL

Q:Whatiflosschangesquicklyinonedirec$onandslowlyinanotherdirec$on?

‣ Approx.gradientiscomputedonasingleinstance

contourplot

Op$miza$on

‣ Stochas/cgradientdescentw w + ↵g, g =

@

@wL

Q:Whatiflosschangesquicklyinonedirec$onandslowlyinanotherdirec$on?

‣ Approx.gradientiscomputedonasingleinstance

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201822

Optimization: Problems with SGDWhat if loss changes quickly in one direction and slowly in another?What does gradient descent do?Very slow progress along shallow dimension, jitter along steep direction

Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large

Op$miza$on

‣ Stochas/cgradientdescent‣ Verysimpletocodeup

w w + ↵g, g =@

@wL

‣Whatifthelossfunc$onhasalocalminimaorsaddlepoint?

“Iden$fyinganda}ackingthesaddlepointprobleminhigh-dimensionalnon-convexop$miza$on”Dauphinetal.(2014)

Op$miza$on

‣ Stochas$cgradientdescent‣ Verysimpletocodeup

‣ “First-order”technique:onlyreliesonhavinggradient

w w + ↵g, g =@

@wL

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201854

First-Order Optimization

Loss

w1

(1) Use gradient form linear approximation(2) Step to minimize the approximation

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201855

Second-Order Optimization

Loss

w1

(1) Use gradient and Hessian to form quadratic approximation(2) Step to the minima of the approximation

Op$miza$on(extracurricular)

‣ Stochas$cgradientdescent‣ Verysimpletocodeup

‣ “First-order”technique:onlyreliesonhavinggradient

‣ Newton’smethod

‣ Second-ordertechnique

InverseHessian:nxnmat,expensive!‣ Op$mizesquadra$cinstantly

‣ Quasi-Newtonmethods:L-BFGS,etc.approximateinverseHessian

‣ Se|ngstepsizeishard(decreasewhenheld-outperformanceworsens?)

w w + ↵g, g =@

@wL

w w +

✓@2

@w2L◆�1

g

AdaGrad(extracurricular)

Duchietal.(2011)

‣ Op$mizedforproblemswithsparsefeatures

‣ Per-parameterlearningrate:smallerupdatesaremadetoparametersthatgetupdatedfrequently

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201837

AdaGrad

Added element-wise scaling of the gradient based on the historical sum of squares in each dimension

“Per-parameter learning rates” or “adaptive learning rates”

Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201821

Optimization: Problems with SGDWhat if loss changes quickly in one direction and slowly in another?What does gradient descent do?

Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large

AdaGrad(extracurricular)

Duchietal.(2011)

‣ Op$mizedforproblemswithsparsefeatures

‣ Per-parameterlearningrate:smallerupdatesaremadetoparametersthatgetupdatedfrequently

(smoothed)sumofsquaredgradientsfromallupdates

‣ GenerallymorerobustthanSGD,requireslesstuningoflearningrate

‣ Othertechniquesforop$mizingdeepmodels—morelater!

wi wi + ↵1q

✏+Pt

⌧=1 g2⌧,i

gti

Summary

‣ Designtradeoffsneedtoreflectinterac$ons:

‣Modelandobjec$vearecoupled:probabilis$cmodel<->maximizelikelihood

‣ …butnotalways:alinearmodelorneuralnetworkcanbetrainedtominimizeanydifferen$ablelossfunc$on

‣ Inferencegovernswhatlearning:needtobeabletocomputeexpecta$onstouselogis$cregression

NextUp

‣ You’venowseeneverythingyouneedtoimplementmul$-classclassifica$onmodels

‣ Next$me:NeuralNetworkBasics!

‣ In2weeks:Sequen$alModels(HMM,CRF,…)forPOStagging,NER