day 4: classification, support vector machinessuriya/website-intromlss2018/... · 2018-11-29 ·...

Introduction to Machine Learning Summer SchoolJune 18, 2018 - June 29, 2018, Chicago

Instructor:SuriyaGunasekar,TTIChicago

21June2018

Day4:Classification,supportvectormachines

Topicssofar

• Supervisedlearning,linearregression• Linearregression

o Overfitting,biasvariancetrade-offo Ridgeandlassoregression,gradientdescent

• Yesterdayo Classification,logisticregressiono Regularizationforlogisticregressiono Multi-classclassification

• Todayo Maximummarginclassifierso Kerneltrick

1

Classification

• Supervisedlearning:estimateamapping𝑓 frominput𝑥 ∈ 𝒳 tooutput𝑦 ∈ 𝒴o Regression 𝒴 = ℝorothercontinuousvariableso Classification 𝒴 takesdiscretesetofvalues

§ Examples:q 𝒴 = {spam, nospam},q digits(notvalues)𝒴 = {0,1,2, … , 9}

• ManysuccessfulapplicationsofMLinvision,speech,NLP,healthcare

2

!"

!#

ç

Parametricclassifiers

• ℋ = 𝒙 → 𝒘. 𝒙 + 𝑤?:𝒘 ∈ ℝA,𝑤? ∈ ℝ

• 𝑦B 𝒙 = sign(𝒘F. 𝒙 + 𝑤F?)• 𝒘F. 𝒙 + 𝑤F? = 0 (linear)decisionboundaryorseparatinghyperplaneseparatesℝA intotwohalfspaces(regions)𝒘F. 𝒙 + 𝑤F? > 0 getslabel1and𝒘F. 𝒙 + 𝑤F? < 0 getslabel-1

• moregenerally,𝑦B 𝒙 = sign 𝑓J 𝒙à decisionboundaryis𝑓J 𝒙 = 0

3

• Thecorrectlosstouseis0-1lossafter thresholdingℓ?L 𝑓 𝑥 , 𝑦 = 𝟏 sign 𝑓 𝑥 ≠ 𝑦

= 𝟏 sign 𝑓 𝑥 𝑦 < 0

SurrogateLosses

4

0 𝑓(𝑥)𝑦 →

ℓ(𝑓𝑥,𝑦)

• Thecorrectlosstouseis0-1lossafter thresholdingℓ?L 𝑓 𝑥 , 𝑦 = 𝟏 sign 𝑓 𝑥 ≠ 𝑦

= 𝟏 sign 𝑓 𝑥 𝑦 < 0• LinearregressionusesℓOP 𝑓 𝑥 , 𝑦 = 𝑓 𝑥 − 𝑦 R

SurrogateLosses

5

0 𝑦B𝑦 →

ℓ(𝑓𝑥,𝑦)

0 𝑓(𝑥)𝑦 →

• Hardtooptimizeoverℓ?L,findanotherlossℓ(𝑓(𝑥), 𝑦)o Convex(foranyfixed𝑦)à easiertominimizeo Anupperboundofℓ?L à smallℓ ⇒ smallℓ?L

• Satisfiedbysquaredlossàbuthas“large”lossevenwhenℓ?L 𝑓(𝑥), 𝑦 = 0• Twomoresurrogatelossesininthiscourseo LogisticlossℓTUV 𝑓(𝑥), 𝑦 = log 1 + exp −𝑓(𝑥)𝑦o HingelossℓZ[\]^ 𝑓(𝑥), 𝑦 = max(0,1 − 𝑓(𝑥)𝑦)

SurrogateLosses

60 𝑓(𝑥)𝑦 →

ℓ(𝑓𝑥,𝑦)

Logisticregression:ERMonsurrogateloss

• 𝑆 = 𝒙 𝒊 , 𝑦 [ : 𝑖 = 1,2, … ,𝑁 , 𝒳 = ℝA, 𝒴 = {−1,1}• Linearmodel𝑓 𝒙 = 𝑓𝒘 𝒙 = 𝒘. 𝒙 + 𝑤?• Minimizetrainingloss

𝒘F,𝑤F? = argmin𝒘,de

flog 1 + exp − 𝒘. 𝒙 𝒊 + 𝑤? 𝑦 [ �

[

• Outputclassifier𝑦B 𝒙 = sign(𝒘. 𝒙 + 𝑤?)

Logisticlossℓ 𝑓 𝑥 , 𝑦 = log 1 + exp −𝑓(𝑥)𝑦

7

ℓ(𝑓𝑥,𝑦)

0 𝑓(𝑥)𝑦 →

LogisticRegression

𝒘F,𝑤F? = argmin𝒘,de

flog 1 + exp − 𝒘. 𝒙 [ + 𝑤? 𝑦 [ �

[

• Convexoptimizationproblem• Cansolveusinggradientdescent• Canalsoaddusualregularization:ℓR, ℓL

8

𝑦B 𝒙 = sign(𝒘. 𝒙 + 𝑤?)

• {𝒙:𝒘. 𝒙 + 𝑤? = 0} isahyperplaneinℝAo decisionboundaryo 𝒘 isdirectionofnormalo 𝑤? istheoffset

• {𝒙:𝒘. 𝒙 + 𝑤? = 0} dividesℝA intotwohalfspaces(regions)

o 𝒙:𝒘. 𝒙 + 𝑤? ≥ 0 getlabel+1and{𝒙:𝒘. 𝒙 + 𝑤? < 0} getlabel-1

Lineardecisionboundaries

𝑥L

𝑥R

𝒘

9

𝒙′

𝑦B 𝒙 = sign(𝒘. 𝒙 + 𝑤?)

• {𝒙:𝒘. 𝒙 + 𝑤? = 0} isahyperplaneinℝAo decisionboundaryo 𝒘 isdirectionofnormalo 𝑤? istheoffset

• {𝒙:𝒘. 𝒙 + 𝑤? = 0} dividesℝA intotwohalfspaces(regions)

o 𝒙:𝒘. 𝒙 + 𝑤? ≥ 0 getlabel+1and{𝒙:𝒘. 𝒙 + 𝑤? < 0} getlabel-1

Lineardecisionboundaries

𝒙

𝑥L

𝑥R

𝒘

10

Maps𝒙 toa1Dcoordinate

𝑥j =𝒘. 𝒙 + 𝑤?

𝒘

Linearseparatorsin2D

11Slidecredit:Nati Srebro

Marginofaclassifier• Margin: distanceoftheclosestinstancepointtothelinearhyperplane• Largemarginsaremorestableo smallperturbationsofthedatadonotchangetheprediction

14

𝒙𝟏

𝒙𝟐

!′

!

#$

#%

&

10

Maximummarginclassifier

• 𝑆 = 𝒙 𝒊 , 𝑦 [ : 𝑖 = 1,2, … ,𝑁binaryclasses𝒴 = {−1,1}• Assumedatais“linearlyseparable”

o ∃𝒘,𝑤? suchthatforall𝑖 = 1,2, … ,𝑁𝑦 [ = sign 𝒘. 𝒙 𝒊 + 𝑤?⇒ 𝑦 [ (𝒘. 𝒙 𝒊 + 𝑤?) > 0

• Maximummarginseparatorgivenby

𝒘F,𝑤F? = argmax𝒘∈ℝm,de

min[𝑦 [ 𝒘. 𝒙 𝒊 + 𝑤?

𝒘

15

marginofsample𝑖

smallestmargin

!′

!

#$

#%

&

10


𝒘F,𝑤F? = argmax𝒘∈ℝm,de∈ℝ


𝒘

• Claim1:If𝒘F,𝑤F? isasolution,thenforany𝛾 > 0, 𝛾𝒘F, 𝛾𝑤F? isalsoasolution

• Option1: Wecanfix 𝒘 = 1 toget

𝒘F,𝑤F? = arg𝑚𝑎𝑥𝒘 qL,de


16




𝒘

• Claim1:If𝒘F,𝑤F? isasolution,thenforany𝛾 > 0, 𝛾𝒘F, 𝛾𝑤F? isalsoasolution

• Option1:wecanfix 𝒘 = 1 toget

𝒘F,𝑤F? = arg𝑚𝑎𝑥𝒘 qL


• Option2: wecanalsofixmin[𝑦 [ 𝒘. 𝒙 𝒊 + 𝑤? = 1

o marginnowis L𝒘

o Insteadof“increasingmargin”wecan“reducenorm”

17

Max-marginclassifierequivalentformulation

• Solve:𝒘r,𝑤r? = argmin𝒘

𝒘 R

s.t. ∀𝑖, 𝑦 [ (𝒘. 𝒙 𝒊 + 𝑤?) ≥ 1

• Claim2: Equivalenttopreviousslideà

𝒘r𝒘r, dre𝒘r

issolutionfor𝒘F,𝑤F? = max

𝒘 qLmin[𝑦 [ (𝒘. 𝒙 𝒊 + 𝑤?)

• Proof:1. Letmin

[𝑦 [ 𝒘F. 𝒙 𝒊 + 𝑤F? = 𝛾B,thenmin

[𝑦 [ 𝒘F

tF. 𝒙 𝒊 + dFe

tF ≥ 1

2. ⇒ 𝒘r ≤ 𝒘FtF

= LtF

3. min[𝑦 [ 𝒘r

𝒘r. 𝑥 [ + dre

𝒘r= min

[w

x 𝒘r.𝒙 𝒊 ydre𝒘r

≥ L𝒘r≥ 𝛾B

HardmarginSupportVectorMachine(SVM)

18

Maximummarginclassifierformulations• Originalformulation



𝒘

• Fixing 𝒘 = 1𝒘F,𝑤F? = arg𝑚𝑎𝑥

𝒘,demin

[𝑦 [ 𝒘. 𝒙 𝒊 + 𝑤? s. t. 𝒘 = 1

• Fixingmin[𝑦 [ 𝒘. 𝒙 𝒊 + 𝑤? = 1

𝒘r,𝑤r? = argmin𝒘

𝒘 Rs.t.∀𝑖, 𝑦 [ (𝒘. 𝒙 𝒊 + 𝑤?) ≥ 1

Henceforth,𝑤? willbeabsorbedinto𝒘 byaddinganadditonalfeatureof‘1’to𝒙

19

!′

!

#$

#%

&

10

Marginandnorm

• margin 𝒘 = min[

w x 𝒘.𝒙 𝒊

𝒘

• Rememberinregression:smallnormsolutionshavelowcomplexity!o Isthistrueformaximummarginclassifiers?o whataboutclassificationwithlogisticloss∑ log(1 + exp(−𝑦 [ 𝒘. 𝒙 𝒊 ))�[ ?

o howtodocapacitycontrolinmaximummarginclassifierlearning?

• Someplacesmin[𝑦 [ 𝒘. 𝒙 𝒊 referredasmarginà

implicitlyassumesnormalizationo min

[𝑦 [ 𝒘. 𝒙 𝒊 ismeaninglesswithoutknowingwhat 𝒘 is!

20

SolutionsofhardmarginSVM

𝒘F = argmin𝒘

𝒘 R𝑠. 𝑡., 𝑦 [ 𝒘. 𝒙 𝒊 ≥ 1∀𝑖

• Theorem:𝒘F = 𝑠𝑝𝑎𝑛 𝒙 𝒊 : 𝑖 = 1,2, … , 𝑁i.e.,∃{𝛽J[: 𝑖 = 1,2, … , 𝑁} suchthat𝒘F = ∑ 𝛽J[�

[ 𝒙 𝒊

• Denote𝒮 = 𝑠𝑝𝑎𝑛 𝒙 𝒊 : 𝑖 = 1,2, … , 𝑁 and𝒮� = {𝒛 ∈ ℝA: ∀𝑖, 𝒛. 𝒙𝒊 = 0}

o Forany𝒛 ∈ ℝA, 𝒛 = 𝒛𝒮 + 𝒛𝒮� s.t. 𝒛𝒮 ∈ 𝒮 and𝒛𝒮� ∈ 𝒮�

o 𝒛 R = 𝒛𝒮 R + 𝒛𝒮�R

• Threestepproof:1. Decompose𝒘F = 𝒘F𝓢 + 𝒘F𝓢�.2. min

[𝑦 [ 𝒘F. 𝒙 𝒊 ≥ 1 ⇒ min

[𝑦 [ 𝒘F𝒮. 𝒙 𝒊 ≥ 1

(because𝒘F𝓢�. 𝒙 𝒊 = 0∀𝑖)3. if𝒘F𝓢� ≠ 0, then 𝒘F𝓢 < 𝒘F

21


𝒘F = argmin𝒘

𝒘 R𝑠. 𝑡., 𝑦 [ 𝒘. 𝒙 𝒊 ≥ 1∀𝑖


[ 𝒙 𝒊





[𝑦 [ 𝒘F. 𝒙 𝒊 ≥ 1 ⇒ min

[𝑦 [ 𝒘F𝒮. 𝒙 𝒊 ≥ 1


22


𝒘F = argmin𝒘

𝒘 R𝑠. 𝑡., 𝑦 [ 𝒘. 𝒙 𝒊 ≥ 1∀𝑖


[ 𝒙 𝒊





[𝑦 [ 𝒘F. 𝒙 𝒊 ≥ 1 ⇒ min

[𝑦 [ 𝒘F𝒮. 𝒙 𝒊 ≥ 1


23

Representer Theorem

𝒘F = argmin𝒘

𝒘 R𝑠. 𝑡., 𝑦 [ 𝒘. 𝒙 𝒊 ≥ 1∀𝑖

• Theorem:𝒘F = 𝑠𝑝𝑎𝑛 𝒙 𝒊 : 𝑖 = 1,2, … , 𝑁 i.e.,

∃{𝛽J[: 𝑖 = 1,2, … , 𝑁} suchthat𝒘F = ∑ 𝛽J[�[ 𝒙 𝒊

o Specialcaseofrepresentortheorem

• Theorem(ext):additionally,{𝛽J[} alsostisfies𝛽J[ = 0forall𝑖 suchthat𝑦 [ 𝒘. 𝒙 𝒊 > 1• Proof?:(animationnextslide)

24

𝑤F

25Illustrationcredit:Nati Srebro

Representer Theorem

𝒘F = argmin𝒘

𝒘 R𝑠. 𝑡., 𝑦 [ 𝒘. 𝒙 𝒊 ≥ 1∀𝑖

• Theorem:∃{𝛽J[: 𝑖 = 1,2, … ,𝑁} suchthat𝒘F = ∑ 𝛽J[�[ 𝒙 𝒊

{𝛽J[} alsosatisfies𝛽J[ = 0forall𝑖 suchthat𝑦 [ 𝒘F. 𝒙 𝒊 > 1

• 𝑆𝑉(𝑤F) = {𝒊: 𝑦 [ 𝒘F. 𝒙 𝒊 = 𝟏} datapoints closestto𝑤Fo calledsupportvectorso hencesupportvectormachine

𝒘F = f 𝛽J[𝒙 𝒊�

[∈P�(dF)

26

Representer Theorem

𝒘F = argmin𝒘

𝒘 R𝑠. 𝑡., 𝑦 [ 𝒘. 𝒙 𝒊 ≥ 1∀𝑖

• Theorem:∃{𝛽J[: 𝑖 = 1,2, … ,𝑁} suchthat𝒘F = ∑ 𝛽J[�[ 𝒙 𝒊

{𝛽J[} alsosatisfies𝛽J[ = 0forall𝑖 suchthat𝑦 [ 𝒘F. 𝒙 𝒊 > 1

• 𝑆𝑉(𝑤F) = {𝒊: 𝑦 [ 𝒘F. 𝒙 𝒊 = 𝟏} datapoints closestto𝑤Fo calledsupportvectorso hencesupportvectormachine

𝒘F = f 𝛽J[𝒙 𝒊�

[∈P�(dF)

27

Howdoweget𝑤F?

OptimizingtheSVMproblem

𝒘F = argmin𝒘

𝒘 R𝑠. 𝑡., 𝑦 [ 𝒘. 𝒙 𝒊 ≥ 1∀𝑖

1. Candosub-gradientdescent(nextclass)2. Specialcaseofquadraticprogram

min𝒛

𝟏𝟐𝒛�𝑷𝒛 + 𝒒�𝒛

𝑠. 𝑡. 𝑮𝒛 ≤ 𝒉, 𝑨𝒛 = 𝒃o Changeofvariables𝒘F = ∑ 𝛽J[𝒙 𝒊�

[∈P�(dF) ?o Changeofvariables𝒘F = ∑ 𝛽J[𝒙 𝒊�

[qL !

28

OptimizingtheSVMproblem

𝒘F = argmin𝒘

𝒘 R𝑠. 𝑡., 𝑦 [ 𝒘. 𝒙 𝒊 ≥ 1∀𝑖

• Changeofvariables𝒘 = ∑ 𝛽[𝒙 𝒊�[qL !

≡ min{�x}

ff𝛽[𝛽�𝒙 𝒊 . 𝒙 𝒋\

�qL

�

[qL

𝑠. 𝑡. f𝛽�𝑦 [ 𝒙 𝒊 . 𝒙 𝒋𝑵

𝒋q𝟏

≥ 1∀𝒊

= min𝜷∈ℝ𝑵

𝜷�𝑮𝜷𝑠. 𝑡. 𝑦 [ 𝑮𝜷 [ ≥ 1∀𝑖

• 𝑮 ∈ ℝ�×�with𝐺[� = 𝒙 𝒊 . 𝒙 𝒋 iscalledthegrammatrix• Convexprogram:quadraticprogramming

29

TheKernel

mind 𝑤 R 𝑠. 𝑡. 𝑦 [ 𝒘. 𝒙 𝒊 ≥ 1∀𝑖

≡ min𝜷∈ℝ𝑵

𝜷�𝑮𝜷𝑠. 𝑡. 𝑦 [ 𝑮𝜷 [ ≥ 1∀𝑖

• Optimizationproblemdependson𝒙 𝒊 onlythroughthevaluesof𝐺[� = 𝒙 𝒊 . 𝒙 𝒋 for𝑖, 𝑗 ∈ [𝑁].• Whataboutprediction?

𝒘F. 𝒙 =f𝛽[

�

[

𝒙 𝒊 . 𝒙

• Function𝐾 𝒙, 𝒙j = 𝒙. 𝒙′ iscalledtheKernel• Learningnon-linearclassifiersusingfeaturetransformations,i.e.,𝑓𝒘 𝒙 = 𝒘.𝜙(𝒙) forsome𝜙(𝒙)o onlythingweneedtoknowis𝐾� 𝒙, 𝒙j = 𝐾(𝜙 𝒙 , 𝜙(𝒙′))

30

TheKernel



𝜷�𝑮𝜷𝑠. 𝑡. 𝑦 [ 𝑮𝜷 [ ≥ 1∀𝑖


𝒘F. 𝒙 =f𝛽[

�

[

𝒙 𝒊 . 𝒙


31

TheKernel



𝜷�𝑮𝜷𝑠. 𝑡. 𝑦 [ 𝑮𝜷 [ ≥ 1∀𝑖


𝒘F. 𝒙 =f𝛽[

�

[

𝒙 𝒊 . 𝒙


32

KernelsAsPriorKnowledge

• Ifwethinkthatpositiveexamplescan(almost)beseparatedbysomeellipse:

thenweshouldusepolynomialsofdegree2• AKernelencodesameasureofsimilaritybetweenobjects.AbitlikeNN,exceptthatitmustbeavalidinnerproductfunction.


day 4: classification, support vector machinessuriya/website-intromlss2018/... · 2018-11-29 ·...

Documents