mul$class classiﬁca$on - github pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf ·...

Mul$classClassifica$on

AlanRi0er(many slides from Greg Durrett, Vivek Srikumar, Stanford CS231n)

ThisLecture

‣Mul$classfundamentals

‣Mul$classlogis$cregression

‣Mul$classSVM

‣ Featureextrac$on

‣ Op$miza$on

Mul$classFundamentals

TextClassifica$on

~20classes

Sports

Health

ImageClassifica$on

‣ Thousandsofclasses(ImageNet)

En$tyLinking

‣ 4,500,000classes(allar$clesinWikipedia)

Althoughheoriginallywontheevent,theUnitedStatesAn$-DopingAgencyannouncedinAugust2012thattheyhaddisqualifiedArmstrongfromhissevenconsecu$veTourdeFrancewinsfrom1999–2005.

LanceEdwardArmstrongisanAmericanformerprofessionalroadcyclist

ArmstrongCountyisacountyinPennsylvania…

ReadingComprehension

‣Mul$plechoiceques$ons,4classes(butclasseschangeperexample)

Richardson(2013)

BinaryClassifica$on

‣ Binaryclassifica$on:oneweightvectordefinesposi$veandnega$veclasses

- - -----

1 11 11 12

‣ Canwejustusebinaryclassifiershere?

‣ One-vs-all:trainkclassifiers,onetodis$nguisheachclassfromalltherest

1 11 11 12

‣ Howdowereconcilemul$pleposi$vepredic$ons?Highestscore?

‣ Notallclassesmayevenbeseparableusingthisapproach

1 11 11 12 2

3 33 3

‣ Canseparate1from2+3and2from1+3butnot3fromtheothers(withthesefeatures)

Mul$classClassifica$on‣ All-vs-all:trainn(n-1)/2classifierstodifferen$ateeachpairofclasses

1 11 11 1

1 11 11 12

‣ Again,howtoreconcile?

- - -----

1 11 11 12

‣ Binaryclassifica$on:oneweightvectordefinesbothclasses

‣Mul$classclassifica$on:differentweightsand/orfeaturesperclass

‣ Decisionrule:

‣ Canalsohaveoneweightvectorperclass:

‣ Formally:insteadoftwolabels,wehaveanoutputspacecontaininganumberofpossibleclasses

‣ Samemachinerythatwe’lluselaterforexponen$allylargeoutputspaces,includingsequencesandtrees

argmaxy2Yw>y f(x)

argmaxy2Yw>f(x, y)

‣ Thesingleweightvectorapproachwillgeneralizetostructuredoutputspaces,whereasper-classweightvectorswon’t

‣Mul$plefeaturevectors,oneweightvector

featuresdependonchoiceoflabelnow!note:thisisn’tthegoldlabel

FeatureExtrac$on

BlockFeatureVectors‣ Decisionrule:argmaxy2Yw

>f(x, y)

toomanydrugtrials,toofewpa5ents

Health

Sports

Science

f(x)=I[containsdrug],I[containspa5ents],I[containsbaseball] =[1,1,0]

[1,1,0,0,0,0,0,0,0]

[0,0,0,1,1,0,0,0,0]

f(x, y = ) =Health

f(x, y = ) =Sports

‣ Equivalenttohavingthreeweightvectorsinthiscase

featurevectorblocksforeachlabel

‣ Basefeaturefunc$on:

I[containsdrug&label=Health]

MakingDecisions

f(x) =I[containsdrug],I[containspa5ents],I[containsbaseball]

w = [+2.1,+2.3,-5,-2.1,-3.8,0,+1.1,-1.7,-1.3]

= Health:+4.4 Sports:-5.9 Science:-0.6

argmax

Health

Sports

Science

[1,1,0,0,0,0,0,0,0]

[0,0,0,1,1,0,0,0,0]

f(x, y = ) =Health

f(x, y = ) =Sports “worddruginSciencear$cle”=+1.1

w>f(x, y)

Anotherexample:POStaggingblocksNNSVBZNNDT…

f(x,y=VBZ)=I[curr_word=blocks&tag=VBZ],I[prev_word=router&tag=VBZ]I[next_word=the&tag=VBZ]I[curr_suffix=s&tag=VBZ]

‣ Classifyblocksasoneof36POStags

‣ Nexttwolectures:sequencelabeling!

‣ Examplex:sentencewithaword(inthiscase,blocks)highlighted

‣ Extractfeatureswithrespecttothisword:

notsayingthattheistaggedasVBZ!sayingthatthefollowstheVBZword

therouter thepackets

Mul$classLogis$cRegression

‣ Comparetobinary:

nega$veclassimplicitlyhadf(x,y=0)=thezerovector

sumoveroutputspacetonormalize

P (y = 1|x) = exp(w>f(x))

1 + exp(w>f(x))

Pw(y|x) =exp

�w>f(x, y)

y02Y exp (w>f(x, y0))

Pw(y|x) =exp

�w>f(x, y)

Health:+2.2

Sports:+3.1

Science:-0.6w>f(x, y)

Why?Interpretrawclassifierscoresasprobabili/es

exp6.05

probabili$esmustbe>=0

unnormalizedprobabili$es

normalize0.21

probabili$esmustsumto1

probabili$es

Softmax function

0.00correct(gold)probabili$es

toomanydrugtrials,

toofewpa5ents

compare

L(x, y) =nX

logP (y⇤j |xj)L(xj , y⇤j ) = w>f(xj , y

⇤j )� log

exp(w>f(xj , y))

log(0.21)=-1.56

‣ Training:maximize

w>f(xj , y

⇤j )� log

exp(w>f(xj , y))

!L(x, y) =

logP (y⇤j |xj)

Pw(y|x) =exp

�w>f(x, y)

Training

‣ Mul$classlogis$cregression

‣ Likelihood L(xj , y⇤j ) = w>f(xj , y

⇤j )� log

exp(w>f(xj , y))

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

Py fi(xj , y) exp(w>f(xj , y))P

y exp(w>f(xj , y))

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )� Ey[fi(xj , y)]

goldfeaturevalue

model’sexpecta$onoffeaturevalue

Pw(y|x) =exp

�w>f(x, y)

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

fi(xj , y)Pw(y|xj)

Training

toomanydrugtrials,toofewpa5ents[1,1,0,0,0,0,0,0,0]

[0,0,0,1,1,0,0,0,0]

f(x, y = ) =Health

f(x, y = ) =Sports

y*= Health

Pw(y|x)=[0.21,0.77,0.02]

@wiL(xj , y

⇤j ) = fi(xj , y

⇤j )�

fi(xj , y)Pw(y|xj)

[1,1,0,0,0,0,0,0,0] —0.21[1,1,0,0,0,0,0,0,0]—0.77[0,0,0,1,1,0,0,0,0] —0.02[0,0,0,0,0,0,1,1,0]

=[0.79,0.79,0,-0.77,-0.77,0,-0.02,-0.02,0]

gradient:

[1.3,0.9,-5,3.2,-0.1,0,1.1,-1.7,-1.3] +[0.79,0.79,0,-0.77,-0.77,0,-0.02,-0.02,0]=[2.09,1.69,0,2.43,-0.87,0,1.08,-1.72,0]

newPw(y|x)=[0.89,0.10,0.01]

update:w>f(x, y) + `(y, y⇤)

Logis$cRegression:Summary

‣Model:

‣ Learning:gradientascentonthediscrimina$velog-likelihood

‣ Inference:

“towardsgoldfeaturevalue,awayfromexpecta$onoffeaturevalue”

f(x, y⇤)� Ey[f(x, y)] = f(x, y⇤)�X

[Pw(y|x)f(x, y)]

Pw(y|x) =exp

�w>f(x, y)

argmaxyPw(y|x)

Mul$classSVM

SoxMarginSVM

Minimize

8j (2yj � 1)(w>xj) � 1� ⇠j

8j ⇠j � 0

�kwk22 +mX

⇠jslackvariables>0iffexampleissupportvector

Image credit: Lang Van Tran

Mul$classSVM

Correctpredic$onnowhastobeateveryotherclass

Minimize

8j (2yj � 1)(w>xj) � 1� ⇠j

8j ⇠j � 0

�kwk22 +mX

8j8y 2 Y w>f(xj , y⇤j ) � w>f(xj , y) + `(y, y⇤j )� ⇠j

The1thatwashereisreplacedbyalossfunc$on

Scorecomparisonismoreexplicitnow

slackvariables>0iffexampleissupportvector

Training(loss-augmented)

‣ Arealldecisionsequallycostly?

‣Wecandefinealossfunc$on `(y, y⇤)

Health

SportsSports

ScienceSports

Science

Predicted

Predicted :notsobad

:baderror

`( , ) =HealthSports

HealthScience`( , ) =

Mul$classSVM

Health Science Sports

‣ Doesgoldbeateverylabel+loss?No!

‣ =4.3-2.4=1.9⇠j

‣MostviolatedconstraintisSports;whatis?

w>f(x, y) + `(y, y⇤)

‣ Perceptronwouldmakenoupdatehere

Mul$classSVM

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

‣ Oneslackvariableperexample,soit’ssettobewhateverthemostviolatedconstraintisforthatexample

⇠j = maxy2Y

w>f(xj , y) + `(y, y⇤j )� w>f(xj , y⇤j )

‣ Pluginthegoldyandyouget0,soslackisalwaysnonnega$ve!

Compu$ngtheSubgradient

‣ If,theexampleisnotasupportvector,gradientiszero⇠j = 0

‣ Otherwise,

(updatelooksbackwards—we’reminimizinghere!)

@wi⇠j = fi(xj , ymax)� fi(xj , y

⇤j )

‣ Perceptron-like,butweupdateawayfrom*loss-augmented*predic$on

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

⇠j = maxy2Y

w>f(xj , y) + `(y, y⇤j )� w>f(xj , y⇤j )

PuzngitTogether

Minimize

s.t. 8j ⇠j � 0

�kwk22 +mX

‣ (Unregularized)gradients:‣ SVM:

f(x, y⇤)� Ey[f(x, y)] = f(x, y⇤)�X

[Pw(y|x)f(x, y)]‣ Logreg:f(x, y⇤)� f(x, ymax) (loss-augmentedmax)

‣ SVM:maxoverstocomputegradient.LR:needtosumovers`(y, y⇤) `(y, y⇤)

Op$miza$on

Recap‣ Fourelementsofamachinelearningmethod:

‣Model:probabilis$c,max-margin,deepneuralnetwork

‣ Inference:justmaxesandsimpleexpecta$onssofar,butwillgetharder

‣ Training:gradientdescent?

‣ Objec$ve:

Op$miza$on

‣ Stochas$cgradient*ascent*‣ Verysimpletocodeup

w w + ↵g, g =@

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 2018Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 7 - April 24, 201820

Optimization

Op$miza$on

w w + ↵g, g =@

Optimization: Problems with SGDWhat if loss changes quickly in one direction and slowly in another?What does gradient descent do?

Loss function has high condition number: ratio of largest to smallest singular value of the Hessian matrix is large

‣Whatiflosschangesquicklyinonedirec$onandslowlyinanotherdirec$on?

Op$miza$on

w w + ↵g, g =@

‣Whatiflosschangesquicklyinonedirec$onandslowlyinanotherdirec$on?

Optimization: Problems with SGDWhat if loss changes quickly in one direction and slowly in another?What does gradient descent do?Very slow progress along shallow dimension, jitter along steep direction

Op$miza$on

Optimization: Problems with SGD

What if the loss function has a local minima or saddle point?

Saddle points much more common in high dimension

Dauphin et al, “Identifying and attacking the saddle point problem in high-dimensional non-convex optimization”, NIPS 2014

w w + ↵g, g =@

‣Whatifthelossfunc$onhasalocalminimaorsaddlepoint?

Op$miza$on

‣ “First-order”technique:onlyreliesonhavinggradient

w w + ↵g, g =@

First-Order Optimization

(1) Use gradient form linear approximation(2) Step to minimize the approximation

Second-Order Optimization

(1) Use gradient and Hessian to form quadratic approximation(2) Step to the minima of the approximation

Op$miza$on

‣ “First-order”technique:onlyreliesonhavinggradient

‣ Newton’smethod

‣ Second-ordertechnique

InverseHessian:nxnmat,expensive!‣ Op$mizesquadra$cinstantly

‣ Quasi-Newtonmethods:L-BFGS,etc.approximateinverseHessian

‣ Sezngstepsizeishard(decreasewhenheld-outperformanceworsens?)

w w + ↵g, g =@

@w2L◆�1

AdaGrad

Duchietal.(2011)

‣ Op$mizedforproblemswithsparsefeatures

‣ Per-parameterlearningrate:smallerupdatesaremadetoparametersthatgetupdatedfrequently

AdaGrad

Added element-wise scaling of the gradient based on the historical sum of squares in each dimension

“Per-parameter learning rates” or “adaptive learning rates”

Duchi et al, “Adaptive subgradient methods for online learning and stochastic optimization”, JMLR 2011

Optimization: Problems with SGDWhat if loss changes quickly in one direction and slowly in another?What does gradient descent do?

AdaGrad

Duchietal.(2011)

‣ Op$mizedforproblemswithsparsefeatures

‣ Per-parameterlearningrate:smallerupdatesaremadetoparametersthatgetupdatedfrequently

(smoothed)sumofsquaredgradientsfromallupdates

‣ GenerallymorerobustthanSGD,requireslesstuningoflearningrate

‣ Othertechniquesforop$mizingdeepmodels—morelater!

wi wi + ↵1q

✏+Pt

⌧=1 g2⌧,i

Summary

‣ Designtradeoffsneedtoreflectinterac$ons:

‣Modelandobjec$vearecoupled:probabilis$cmodel<->maximizelikelihood

‣ …butnotalways:alinearmodelorneuralnetworkcanbetrainedtominimizeanydifferen$ablelossfunc$on

‣ Inferencegovernswhatlearning:needtobeabletocomputeexpecta$onstouselogis$cregression

mul$class classiﬁca$on - github pagesaritter.github.io/courses/5525_slides_v2/lec3-mcc.pdf ·...

Documents

short form catalog terminal blocks - allied · pdf...

blocks€¦ · blocks blocks method of reeving manila &...

lego mindstorm an introduction to blocks. blocks blocks are...

juli 2014, nr. 87 afzender: vbz, p/a dorpsstraat 4, 4924 be...

suggested fabric requirements 6” blocks 9” blocks

1167 300 61 werkleitungen - ebau.ch€¦ · .50x2.30) ewz/...

building blocks and cognitive building blocks

terminal blocks - abb · pdf filefeed-through terminal...

lecture 13: machine transla3on ii - github...

lecture 16:...

distribution blocks power blocks power terminals · 2015....

5525: speech and language...

juli 2013, nr. 83 afzender: vbz, p/a dorpsstraat 4, 4924 be...

20mm dynamic blocks 30mm dynamic blocks

c autocad avanzado - cetys.mx · autocad avanzado objetivo....

ut – screw connection...blocks 22 disconnect and knife...

atex-handleiding - sociaalfondsbakkersbedrijf.nl ·...

pn 322-vbz 2 mpirvf ulletnetworkamera

auswirkungen von tempo 30 auf den Öv simulationsergebnisse...

power terminal blocks blocks - abb