day 4: classification, support vector machinessuriya/website-intromlss2018/... · 2018-11-29 ·...
TRANSCRIPT
Introduction to Machine Learning Summer SchoolJune 18, 2018 - June 29, 2018, Chicago
Instructor:SuriyaGunasekar,TTIChicago
21June2018
Day4:Classification,supportvectormachines
Topicssofar
• Supervisedlearning,linearregression• Linearregression
o Overfitting,biasvariancetrade-offo Ridgeandlassoregression,gradientdescent
• Yesterdayo Classification,logisticregressiono Regularizationforlogisticregressiono Multi-classclassification
• Todayo Maximummarginclassifierso Kerneltrick
1
Classification
• Supervisedlearning:estimateamapping𝑓 frominput𝑥 ∈ 𝒳 tooutput𝑦 ∈ 𝒴o Regression 𝒴 = ℝorothercontinuousvariableso Classification 𝒴 takesdiscretesetofvalues
§ Examples:q 𝒴 = {spam, nospam},q digits(notvalues)𝒴 = {0,1,2, … , 9}
• ManysuccessfulapplicationsofMLinvision,speech,NLP,healthcare
2
!"
!#
ç
Parametricclassifiers
• ℋ = 𝒙 → 𝒘. 𝒙 + 𝑤?:𝒘 ∈ ℝA,𝑤? ∈ ℝ
• 𝑦B 𝒙 = sign(𝒘F. 𝒙 + 𝑤F?)• 𝒘F. 𝒙 + 𝑤F? = 0 (linear)decisionboundaryorseparatinghyperplaneseparatesℝA intotwohalfspaces(regions)𝒘F. 𝒙 + 𝑤F? > 0 getslabel1and𝒘F. 𝒙 + 𝑤F? < 0 getslabel-1
• moregenerally,𝑦B 𝒙 = sign 𝑓J 𝒙à decisionboundaryis𝑓J 𝒙 = 0
3
• Thecorrectlosstouseis0-1lossafter thresholdingℓ?L 𝑓 𝑥 , 𝑦 = 𝟏 sign 𝑓 𝑥 ≠ 𝑦
= 𝟏 sign 𝑓 𝑥 𝑦 < 0
SurrogateLosses
4
0 𝑓(𝑥)𝑦 →
ℓ(𝑓𝑥,𝑦)
• Thecorrectlosstouseis0-1lossafter thresholdingℓ?L 𝑓 𝑥 , 𝑦 = 𝟏 sign 𝑓 𝑥 ≠ 𝑦
= 𝟏 sign 𝑓 𝑥 𝑦 < 0• LinearregressionusesℓOP 𝑓 𝑥 , 𝑦 = 𝑓 𝑥 − 𝑦 R
SurrogateLosses
5
0 𝑦B𝑦 →
ℓ(𝑓𝑥,𝑦)
0 𝑓(𝑥)𝑦 →
• Hardtooptimizeoverℓ?L,findanotherlossℓ(𝑓(𝑥), 𝑦)o Convex(foranyfixed𝑦)à easiertominimizeo Anupperboundofℓ?L à smallℓ ⇒ smallℓ?L
• Satisfiedbysquaredlossàbuthas“large”lossevenwhenℓ?L 𝑓(𝑥), 𝑦 = 0• Twomoresurrogatelossesininthiscourseo LogisticlossℓTUV 𝑓(𝑥), 𝑦 = log 1 + exp −𝑓(𝑥)𝑦o HingelossℓZ[\]^ 𝑓(𝑥), 𝑦 = max(0,1 − 𝑓(𝑥)𝑦)
SurrogateLosses
60 𝑓(𝑥)𝑦 →
ℓ(𝑓𝑥,𝑦)
Logisticregression:ERMonsurrogateloss
• 𝑆 = 𝒙 𝒊 , 𝑦 [ : 𝑖 = 1,2, … ,𝑁 , 𝒳 = ℝA, 𝒴 = {−1,1}• Linearmodel𝑓 𝒙 = 𝑓𝒘 𝒙 = 𝒘. 𝒙 + 𝑤?• Minimizetrainingloss
𝒘F,𝑤F? = argmin𝒘,de
flog 1 + exp − 𝒘. 𝒙 𝒊 + 𝑤? 𝑦 [ �
[
• Outputclassifier𝑦B 𝒙 = sign(𝒘. 𝒙 + 𝑤?)
Logisticlossℓ 𝑓 𝑥 , 𝑦 = log 1 + exp −𝑓(𝑥)𝑦
7
ℓ(𝑓𝑥,𝑦)
0 𝑓(𝑥)𝑦 →
LogisticRegression
𝒘F,𝑤F? = argmin𝒘,de
flog 1 + exp − 𝒘. 𝒙 [ + 𝑤? 𝑦 [ �
[
• Convexoptimizationproblem• Cansolveusinggradientdescent• Canalsoaddusualregularization:ℓR, ℓL
8
𝑦B 𝒙 = sign(𝒘. 𝒙 + 𝑤?)
• {𝒙:𝒘. 𝒙 + 𝑤? = 0} isahyperplaneinℝAo decisionboundaryo 𝒘 isdirectionofnormalo 𝑤? istheoffset
• {𝒙:𝒘. 𝒙 + 𝑤? = 0} dividesℝA intotwohalfspaces(regions)
o 𝒙:𝒘. 𝒙 + 𝑤? ≥ 0 getlabel+1and{𝒙:𝒘. 𝒙 + 𝑤? < 0} getlabel-1
Lineardecisionboundaries
𝑥L
𝑥R
𝒘
9
𝒙′
𝑦B 𝒙 = sign(𝒘. 𝒙 + 𝑤?)
• {𝒙:𝒘. 𝒙 + 𝑤? = 0} isahyperplaneinℝAo decisionboundaryo 𝒘 isdirectionofnormalo 𝑤? istheoffset
• {𝒙:𝒘. 𝒙 + 𝑤? = 0} dividesℝA intotwohalfspaces(regions)
o 𝒙:𝒘. 𝒙 + 𝑤? ≥ 0 getlabel+1and{𝒙:𝒘. 𝒙 + 𝑤? < 0} getlabel-1
Lineardecisionboundaries
𝒙
𝑥L
𝑥R
𝒘
10
Maps𝒙 toa1Dcoordinate
𝑥j =𝒘. 𝒙 + 𝑤?
𝒘
Linearseparatorsin2D
11Slidecredit:Nati Srebro
Linearseparatorsin2D
12Slidecredit:Nati Srebro
Linearseparatorsin2D
13Slidecredit:Nati Srebro
Marginofaclassifier• Margin: distanceoftheclosestinstancepointtothelinearhyperplane• Largemarginsaremorestableo smallperturbationsofthedatadonotchangetheprediction
14
𝒙𝟏
𝒙𝟐
!′
!
#$
#%
&
10
Maximummarginclassifier
• 𝑆 = 𝒙 𝒊 , 𝑦 [ : 𝑖 = 1,2, … ,𝑁binaryclasses𝒴 = {−1,1}• Assumedatais“linearlyseparable”
o ∃𝒘,𝑤? suchthatforall𝑖 = 1,2, … ,𝑁𝑦 [ = sign 𝒘. 𝒙 𝒊 + 𝑤?⇒ 𝑦 [ (𝒘. 𝒙 𝒊 + 𝑤?) > 0
• Maximummarginseparatorgivenby
𝒘F,𝑤F? = argmax𝒘∈ℝm,de
min[𝑦 [ 𝒘. 𝒙 𝒊 + 𝑤?
𝒘
15
marginofsample𝑖
smallestmargin
!′
!
#$
#%
&
10
Maximummarginclassifier
𝒘F,𝑤F? = argmax𝒘∈ℝm,de∈ℝ
min[𝑦 [ 𝒘. 𝒙 𝒊 + 𝑤?
𝒘
• Claim1:If𝒘F,𝑤F? isasolution,thenforany𝛾 > 0, 𝛾𝒘F, 𝛾𝑤F? isalsoasolution
• Option1: Wecanfix 𝒘 = 1 toget
𝒘F,𝑤F? = arg𝑚𝑎𝑥𝒘 qL,de
min[𝑦 [ 𝒘. 𝒙 𝒊 + 𝑤?
16
Maximummarginclassifier
𝒘F,𝑤F? = argmax𝒘∈ℝm,de∈ℝ
min[𝑦 [ 𝒘. 𝒙 𝒊 + 𝑤?
𝒘
• Claim1:If𝒘F,𝑤F? isasolution,thenforany𝛾 > 0, 𝛾𝒘F, 𝛾𝑤F? isalsoasolution
• Option1:wecanfix 𝒘 = 1 toget
𝒘F,𝑤F? = arg𝑚𝑎𝑥𝒘 qL
min[𝑦 [ 𝒘. 𝒙 𝒊 + 𝑤?
• Option2: wecanalsofixmin[𝑦 [ 𝒘. 𝒙 𝒊 + 𝑤? = 1
o marginnowis L𝒘
o Insteadof“increasingmargin”wecan“reducenorm”
17
Max-marginclassifierequivalentformulation
• Solve:𝒘r,𝑤r? = argmin𝒘
𝒘 R
s.t. ∀𝑖, 𝑦 [ (𝒘. 𝒙 𝒊 + 𝑤?) ≥ 1
• Claim2: Equivalenttopreviousslideà
𝒘r𝒘r, dre𝒘r
issolutionfor𝒘F,𝑤F? = max
𝒘 qLmin[𝑦 [ (𝒘. 𝒙 𝒊 + 𝑤?)
• Proof:1. Letmin
[𝑦 [ 𝒘F. 𝒙 𝒊 + 𝑤F? = 𝛾B,thenmin
[𝑦 [ 𝒘F
tF. 𝒙 𝒊 + dFe
tF ≥ 1
2. ⇒ 𝒘r ≤ 𝒘FtF
= LtF
3. min[𝑦 [ 𝒘r
𝒘r. 𝑥 [ + dre
𝒘r= min
[w
x 𝒘r.𝒙 𝒊 ydre𝒘r
≥ L𝒘r≥ 𝛾B
HardmarginSupportVectorMachine(SVM)
18
Maximummarginclassifierformulations• Originalformulation
𝒘F,𝑤F? = argmax𝒘∈ℝm,de∈ℝ
min[𝑦 [ 𝒘. 𝒙 𝒊 + 𝑤?
𝒘
• Fixing 𝒘 = 1𝒘F,𝑤F? = arg𝑚𝑎𝑥
𝒘,demin
[𝑦 [ 𝒘. 𝒙 𝒊 + 𝑤? s. t. 𝒘 = 1
• Fixingmin[𝑦 [ 𝒘. 𝒙 𝒊 + 𝑤? = 1
𝒘r,𝑤r? = argmin𝒘
𝒘 Rs.t.∀𝑖, 𝑦 [ (𝒘. 𝒙 𝒊 + 𝑤?) ≥ 1
Henceforth,𝑤? willbeabsorbedinto𝒘 byaddinganadditonalfeatureof‘1’to𝒙
19
!′
!
#$
#%
&
10
Marginandnorm
• margin 𝒘 = min[
w x 𝒘.𝒙 𝒊
𝒘
• Rememberinregression:smallnormsolutionshavelowcomplexity!o Isthistrueformaximummarginclassifiers?o whataboutclassificationwithlogisticloss∑ log(1 + exp(−𝑦 [ 𝒘. 𝒙 𝒊 ))�[ ?
o howtodocapacitycontrolinmaximummarginclassifierlearning?
• Someplacesmin[𝑦 [ 𝒘. 𝒙 𝒊 referredasmarginà
implicitlyassumesnormalizationo min
[𝑦 [ 𝒘. 𝒙 𝒊 ismeaninglesswithoutknowingwhat 𝒘 is!
20
SolutionsofhardmarginSVM
𝒘F = argmin𝒘
𝒘 R𝑠. 𝑡., 𝑦 [ 𝒘. 𝒙 𝒊 ≥ 1∀𝑖
• Theorem:𝒘F = 𝑠𝑝𝑎𝑛 𝒙 𝒊 : 𝑖 = 1,2, … , 𝑁i.e.,∃{𝛽J[: 𝑖 = 1,2, … , 𝑁} suchthat𝒘F = ∑ 𝛽J[�
[ 𝒙 𝒊
• Denote𝒮 = 𝑠𝑝𝑎𝑛 𝒙 𝒊 : 𝑖 = 1,2, … , 𝑁 and𝒮� = {𝒛 ∈ ℝA: ∀𝑖, 𝒛. 𝒙𝒊 = 0}
o Forany𝒛 ∈ ℝA, 𝒛 = 𝒛𝒮 + 𝒛𝒮� s.t. 𝒛𝒮 ∈ 𝒮 and𝒛𝒮� ∈ 𝒮�
o 𝒛 R = 𝒛𝒮 R + 𝒛𝒮�R
• Threestepproof:1. Decompose𝒘F = 𝒘F𝓢 + 𝒘F𝓢�.2. min
[𝑦 [ 𝒘F. 𝒙 𝒊 ≥ 1 ⇒ min
[𝑦 [ 𝒘F𝒮. 𝒙 𝒊 ≥ 1
(because𝒘F𝓢�. 𝒙 𝒊 = 0∀𝑖)3. if𝒘F𝓢� ≠ 0, then 𝒘F𝓢 < 𝒘F
21
SolutionsofhardmarginSVM
𝒘F = argmin𝒘
𝒘 R𝑠. 𝑡., 𝑦 [ 𝒘. 𝒙 𝒊 ≥ 1∀𝑖
• Theorem:𝒘F = 𝑠𝑝𝑎𝑛 𝒙 𝒊 : 𝑖 = 1,2, … , 𝑁i.e.,∃{𝛽J[: 𝑖 = 1,2, … , 𝑁} suchthat𝒘F = ∑ 𝛽J[�
[ 𝒙 𝒊
• Denote𝒮 = 𝑠𝑝𝑎𝑛 𝒙 𝒊 : 𝑖 = 1,2, … , 𝑁 and𝒮� = {𝒛 ∈ ℝA: ∀𝑖, 𝒛. 𝒙𝒊 = 0}
o Forany𝒛 ∈ ℝA, 𝒛 = 𝒛𝒮 + 𝒛𝒮� s.t. 𝒛𝒮 ∈ 𝒮 and𝒛𝒮� ∈ 𝒮�
o 𝒛 R = 𝒛𝒮 R + 𝒛𝒮�R
• Threestepproof:1. Decompose𝒘F = 𝒘F𝓢 + 𝒘F𝓢�.2. min
[𝑦 [ 𝒘F. 𝒙 𝒊 ≥ 1 ⇒ min
[𝑦 [ 𝒘F𝒮. 𝒙 𝒊 ≥ 1
(because𝒘F𝓢�. 𝒙 𝒊 = 0∀𝑖)3. if𝒘F𝓢� ≠ 0, then 𝒘F𝓢 < 𝒘F
22
SolutionsofhardmarginSVM
𝒘F = argmin𝒘
𝒘 R𝑠. 𝑡., 𝑦 [ 𝒘. 𝒙 𝒊 ≥ 1∀𝑖
• Theorem:𝒘F = 𝑠𝑝𝑎𝑛 𝒙 𝒊 : 𝑖 = 1,2, … , 𝑁i.e.,∃{𝛽J[: 𝑖 = 1,2, … , 𝑁} suchthat𝒘F = ∑ 𝛽J[�
[ 𝒙 𝒊
• Denote𝒮 = 𝑠𝑝𝑎𝑛 𝒙 𝒊 : 𝑖 = 1,2, … , 𝑁 and𝒮� = {𝒛 ∈ ℝA: ∀𝑖, 𝒛. 𝒙𝒊 = 0}
o Forany𝒛 ∈ ℝA, 𝒛 = 𝒛𝒮 + 𝒛𝒮� s.t. 𝒛𝒮 ∈ 𝒮 and𝒛𝒮� ∈ 𝒮�
o 𝒛 R = 𝒛𝒮 R + 𝒛𝒮�R
• Threestepproof:1. Decompose𝒘F = 𝒘F𝓢 + 𝒘F𝓢�.2. min
[𝑦 [ 𝒘F. 𝒙 𝒊 ≥ 1 ⇒ min
[𝑦 [ 𝒘F𝒮. 𝒙 𝒊 ≥ 1
(because𝒘F𝓢�. 𝒙 𝒊 = 0∀𝑖)3. if𝒘F𝓢� ≠ 0, then 𝒘F𝓢 < 𝒘F
23
Representer Theorem
𝒘F = argmin𝒘
𝒘 R𝑠. 𝑡., 𝑦 [ 𝒘. 𝒙 𝒊 ≥ 1∀𝑖
• Theorem:𝒘F = 𝑠𝑝𝑎𝑛 𝒙 𝒊 : 𝑖 = 1,2, … , 𝑁 i.e.,
∃{𝛽J[: 𝑖 = 1,2, … , 𝑁} suchthat𝒘F = ∑ 𝛽J[�[ 𝒙 𝒊
o Specialcaseofrepresentortheorem
• Theorem(ext):additionally,{𝛽J[} alsostisfies𝛽J[ = 0forall𝑖 suchthat𝑦 [ 𝒘. 𝒙 𝒊 > 1• Proof?:(animationnextslide)
24
𝑤F
25Illustrationcredit:Nati Srebro
Representer Theorem
𝒘F = argmin𝒘
𝒘 R𝑠. 𝑡., 𝑦 [ 𝒘. 𝒙 𝒊 ≥ 1∀𝑖
• Theorem:∃{𝛽J[: 𝑖 = 1,2, … ,𝑁} suchthat𝒘F = ∑ 𝛽J[�[ 𝒙 𝒊
{𝛽J[} alsosatisfies𝛽J[ = 0forall𝑖 suchthat𝑦 [ 𝒘F. 𝒙 𝒊 > 1
• 𝑆𝑉(𝑤F) = {𝒊: 𝑦 [ 𝒘F. 𝒙 𝒊 = 𝟏} datapoints closestto𝑤Fo calledsupportvectorso hencesupportvectormachine
𝒘F = f 𝛽J[𝒙 𝒊�
[∈P�(dF)
26
Representer Theorem
𝒘F = argmin𝒘
𝒘 R𝑠. 𝑡., 𝑦 [ 𝒘. 𝒙 𝒊 ≥ 1∀𝑖
• Theorem:∃{𝛽J[: 𝑖 = 1,2, … ,𝑁} suchthat𝒘F = ∑ 𝛽J[�[ 𝒙 𝒊
{𝛽J[} alsosatisfies𝛽J[ = 0forall𝑖 suchthat𝑦 [ 𝒘F. 𝒙 𝒊 > 1
• 𝑆𝑉(𝑤F) = {𝒊: 𝑦 [ 𝒘F. 𝒙 𝒊 = 𝟏} datapoints closestto𝑤Fo calledsupportvectorso hencesupportvectormachine
𝒘F = f 𝛽J[𝒙 𝒊�
[∈P�(dF)
27
Howdoweget𝑤F?
OptimizingtheSVMproblem
𝒘F = argmin𝒘
𝒘 R𝑠. 𝑡., 𝑦 [ 𝒘. 𝒙 𝒊 ≥ 1∀𝑖
1. Candosub-gradientdescent(nextclass)2. Specialcaseofquadraticprogram
min𝒛
𝟏𝟐𝒛�𝑷𝒛 + 𝒒�𝒛
𝑠. 𝑡. 𝑮𝒛 ≤ 𝒉, 𝑨𝒛 = 𝒃o Changeofvariables𝒘F = ∑ 𝛽J[𝒙 𝒊�
[∈P�(dF) ?o Changeofvariables𝒘F = ∑ 𝛽J[𝒙 𝒊�
[qL !
28
OptimizingtheSVMproblem
𝒘F = argmin𝒘
𝒘 R𝑠. 𝑡., 𝑦 [ 𝒘. 𝒙 𝒊 ≥ 1∀𝑖
• Changeofvariables𝒘 = ∑ 𝛽[𝒙 𝒊�[qL !
≡ min{�x}
ff𝛽[𝛽�𝒙 𝒊 . 𝒙 𝒋\
�qL
�
[qL
𝑠. 𝑡. f𝛽�𝑦 [ 𝒙 𝒊 . 𝒙 𝒋𝑵
𝒋q𝟏
≥ 1∀𝒊
= min𝜷∈ℝ𝑵
𝜷�𝑮𝜷𝑠. 𝑡. 𝑦 [ 𝑮𝜷 [ ≥ 1∀𝑖
• 𝑮 ∈ ℝ�×�with𝐺[� = 𝒙 𝒊 . 𝒙 𝒋 iscalledthegrammatrix• Convexprogram:quadraticprogramming
29
TheKernel
mind 𝑤 R 𝑠. 𝑡. 𝑦 [ 𝒘. 𝒙 𝒊 ≥ 1∀𝑖
≡ min𝜷∈ℝ𝑵
𝜷�𝑮𝜷𝑠. 𝑡. 𝑦 [ 𝑮𝜷 [ ≥ 1∀𝑖
• Optimizationproblemdependson𝒙 𝒊 onlythroughthevaluesof𝐺[� = 𝒙 𝒊 . 𝒙 𝒋 for𝑖, 𝑗 ∈ [𝑁].• Whataboutprediction?
𝒘F. 𝒙 =f𝛽[
�
[
𝒙 𝒊 . 𝒙
• Function𝐾 𝒙, 𝒙j = 𝒙. 𝒙′ iscalledtheKernel• Learningnon-linearclassifiersusingfeaturetransformations,i.e.,𝑓𝒘 𝒙 = 𝒘.𝜙(𝒙) forsome𝜙(𝒙)o onlythingweneedtoknowis𝐾� 𝒙, 𝒙j = 𝐾(𝜙 𝒙 , 𝜙(𝒙′))
30
TheKernel
mind 𝑤 R 𝑠. 𝑡. 𝑦 [ 𝒘. 𝒙 𝒊 ≥ 1∀𝑖
≡ min𝜷∈ℝ𝑵
𝜷�𝑮𝜷𝑠. 𝑡. 𝑦 [ 𝑮𝜷 [ ≥ 1∀𝑖
• Optimizationproblemdependson𝒙 𝒊 onlythroughthevaluesof𝐺[� = 𝒙 𝒊 . 𝒙 𝒋 for𝑖, 𝑗 ∈ [𝑁].• Whataboutprediction?
𝒘F. 𝒙 =f𝛽[
�
[
𝒙 𝒊 . 𝒙
• Function𝐾 𝒙, 𝒙j = 𝒙. 𝒙′ iscalledtheKernel• Learningnon-linearclassifiersusingfeaturetransformations,i.e.,𝑓𝒘 𝒙 = 𝒘.𝜙(𝒙) forsome𝜙(𝒙)o onlythingweneedtoknowis𝐾� 𝒙, 𝒙j = 𝐾(𝜙 𝒙 , 𝜙(𝒙′))
31
TheKernel
mind 𝑤 R 𝑠. 𝑡. 𝑦 [ 𝒘. 𝒙 𝒊 ≥ 1∀𝑖
≡ min𝜷∈ℝ𝑵
𝜷�𝑮𝜷𝑠. 𝑡. 𝑦 [ 𝑮𝜷 [ ≥ 1∀𝑖
• Optimizationproblemdependson𝒙 𝒊 onlythroughthevaluesof𝐺[� = 𝒙 𝒊 . 𝒙 𝒋 for𝑖, 𝑗 ∈ [𝑁].• Whataboutprediction?
𝒘F. 𝒙 =f𝛽[
�
[
𝒙 𝒊 . 𝒙
• Function𝐾 𝒙, 𝒙j = 𝒙. 𝒙′ iscalledtheKernel• Learningnon-linearclassifiersusingfeaturetransformations,i.e.,𝑓𝒘 𝒙 = 𝒘.𝜙(𝒙) forsome𝜙(𝒙)o onlythingweneedtoknowis𝐾� 𝒙, 𝒙j = 𝐾(𝜙 𝒙 , 𝜙(𝒙′))
32
KernelsAsPriorKnowledge
• Ifwethinkthatpositiveexamplescan(almost)beseparatedbysomeellipse:
thenweshouldusepolynomialsofdegree2• AKernelencodesameasureofsimilaritybetweenobjects.AbitlikeNN,exceptthatitmustbeavalidinnerproductfunction.
33Slidecredit:Nati Srebro