cpsc 340: machine learning and data mining - cs.ubc.cafwood/cs340/lectures/l19.pdf · perceptron...

37
CPSC 340: Machine Learning and Data Mining Linear Classifiers Spring 2019

Upload: lyminh

Post on 19-May-2019

228 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

CPSC340:MachineLearningandDataMining

LinearClassifiersSpring2019

Page 2: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

LastTime:L1-Regularization• WediscussedL1-regularization:

– Alsoknownas“LASSO”and“basispursuitdenoising”.– Regularizes‘w’sowedecreaseourtesterror(likeL2-regularization).– Yieldssparse‘w’ soitselectsfeatures(likeL0-regularization).

• Properties:– It’sconvexandfast tominimize(with“proximal-gradient”methods).– Solutionisnotunique (sometimespeopledoL2- andL1-regularization).– Usuallyincludes“correct”variablesbuttendstoyieldfalsepositives.

Page 3: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

L*-Regularization• L0-regularization (AIC,BIC,Mallow’sCp,AdjustedR2,ANOVA):– Addspenaltyonthenumberofnon-zerostoselectfeatures.

• L2-regularization (ridgeregression):– AddingpenaltyontheL2-normof‘w’todecreaseoverfitting:

• L1-regularization (LASSO):– AddingpenaltyontheL1-normdecreasesoverfittingandselectsfeatures:

Page 4: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

L0- vs.L1- vs.L2-RegularizationSparse‘w’

(Selects Features)Speed Unique‘w’ Coding Effort Irrelevant

Features

L0-Regularization Yes Slow No Fewlines NotSensitive

L1-Regularization Yes* Fast* No 1 line* NotSensitive

L2-Regularization No Fast Yes 1line Abitsensitive

• L1-Regularizationisn’tassparseasL0-regularization.– L1-regularizationtendstogivemorefalsepositives(selectstoomany).– Andit’sonly“fast”and“1line”withspecializedsolvers.

• CostofL2-regularizedleastsquaresisO(nd2 +d3).– ChangestoO(ndt)for‘t’iterationsofgradientdescent(sameforL1).

• “Elasticnet”(L1- andL2-regularization)issparse,fast,andunique.• UsingL0+L2doesnotgiveauniquesolution.

Page 5: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

EnsembleFeatureSelection• Wecanalsouseensemblemethodsforfeatureselection.– Usuallydesignedtoreducefalsepositivesorreduce falsenegatives.

• InthiscaseofL1-regularization,wewanttoreducefalsepositives.– UnlikeL0-regularization,thenon-zerowj arestill“shrunk”.

• “Irrelevant”variablesareincluded,before“relevant”wj reachbestvalue.

• Abootstrap approachtoreducingfalsepositives:– Applythemethodtobootstrapsamplesofthetrainingdata.– Onlytakethefeaturesselectedinallbootstrapsamples.

Page 6: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

EnsembleFeatureSelection

• Example:boostrapping plusL1-regularization(“BoLASSO”).– Reducesfalsepositives.– It’spossibletoshowitrecovers“correct”variableswithweakerconditions.

Page 7: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

(pause)

Page 8: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

Motivation:IdentifyingImportantE-mails• Howcanweautomaticallyidentify‘important’e-mails?

• Abinaryclassification problem(“important”vs.“notimportant”).– Labelsareapproximatedbywhetheryoutookan“action”basedonmail.– High-dimensionalfeatureset(thatwe’lldiscusslater).

• Gmailusesregressionforthisbinaryclassificationproblem.

Page 9: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

BinaryClassificationUsingRegression?• Canweapplylinearmodelsforbinaryclassification?– Setyi =+1foroneclass (“important”).– Setyi =-1fortheotherclass(“notimportant”).

• Attrainingtime,fitalinearregressionmodel:

• ThemodelwilltrytomakewTxi =+1for“important”e-mails,andwTxi =-1for“notimportant”e-mails.

Page 10: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

BinaryClassificationUsingRegression?• Canweapplylinearmodelsforbinaryclassification?– Setyi =+1foroneclass (“important”).– Setyi =-1fortheotherclass(“notimportant”).

• Linearmodelgivesrealnumberslike0.9,-1.1,andsoon.• Sotopredict,welookatwhetherwTxi iscloserto+1or-1.– IfwTxi =0.9,predict𝑦"i =+1.– IfwTxi =-1.1,predict𝑦"i =-1.– IfwTxi =0.1,predict𝑦"i =+1.– IfwTxi =-100,predict𝑦"i =-1.– Wewritethisoperation(roundingto+1or-1)as𝑦"i =sign(wTxi).

Page 11: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

DecisionBoundaryin1D

Page 12: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

• Wecaninterpret‘w’asahyperplaneseparatingxintosets:– SetwherewTxi >0andsetwherewTxi <0.

DecisionBoundaryin1D

Page 13: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

DecisionBoundaryin2D

decisiontree KNN linearclassifier

• Alinearclassifierwouldbelinearfunction𝑦"i=w0 +w1xi1+w2xi2comingoutofthepage(theboundaryisat𝑦"i=0)

Page 14: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

Shouldweuseleastsquaresforclassification?• Considertrainingbyminimizingsquarederrorwithyi thatare+1or-1:

• IfwepredictwTxi =+0.9andyi =+1,errorissmall:(0.9– 1)2 =0.01.• IfwepredictwTxi =-0.8andyi =+1,errorisbigger:(-0.8– 1)2 =3.24.• IfwepredictwTxi =+100andyi =+1,errorishuge:(100– 1)2 =9801.

– Butitshouldn’tbe,thepredictioniscorrect.

• Leastsquarespenalizedforbeing“tooright”.– +100hastherightsign,sotheerrorshouldbezero.

Page 15: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

Shouldweuseleastsquaresforclassification?• Leastsquarescanbehaveweirdlywhenappliedtoclassification:

• Why?Squarederrorofgreenlineishuge!– Makesureyouunderstandwhythegreenlineachieves0trainingerror.

Page 16: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

“0-1Loss”Function:MinimizingClassificationErrors

• Couldweinsteadminimizenumberofclassificationerrors?– Thisiscalledthe0-1lossfunction:

• Youeithergettheclassificationwrong(1)orright(0).

– WecanwriteusingtheL0-normas||𝑦"– y||0.• Unlikeregression,inclassificationit’sreasonablethat𝑦"𝑖=yi (it’seither+1or-1).

• Importantspecialcase:“linearlyseparable”data.– Classescanbe“separated”byahyper-plane.– Soaperfectlinearclassifierexists.

Page 17: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

PerceptronAlgorithmforLinearly-SeparableData• Oneofthefirst“learning”algorithmswasthe“perceptron”(1957).

– Searchesfora‘w’suchthatsign(wTxi)=yi foralli.

• Perceptron algorithm:– Startwithw0 =0.– Gothroughexamplesinanyorderuntilyoumakeamistakepredictingyi.

• Setwt+1 =wt +yixi.– Keepgoingthroughexamplesuntilyoumakenoerrorsontrainingdata.

• Ifaperfectclassifierexists,thisalgorithmfindsoneinfinitenumberofsteps.

• Intuitionforstep:ifyi =+1,“addmoreofxi tow”sothatwTxi islarger.

– Ifyi =-1,youwouldbesubtractingthesquarednorm.

Page 18: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

https://en.wikipedia.org/wiki/Perceptron

Page 19: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

Geometryofwhywewantthe0-1loss

Page 20: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

Thoughtsontheprevious(andnext)slide• Wearenowplottingthelossvs.thepredictedw⊤xi.– “Lossspace”,whichisdifferentthanparameterspaceordataspace.

• We'replottingtheindividuallossforaparticulartrainingexample.– Inthefigurethe labelisyi =−1(solossiscenteredat-1).

• Itwillbecenteredat+1whenyi =+1.

• (Thenextslideisthesameasthepreviousone)

Page 21: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

Geometryofwhywewantthe0-1loss

Page 22: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

Geometryofwhywewantthe0-1loss

Page 23: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

Geometryofwhywewantthe0-1loss

Page 24: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

0-1LossFunction• Unfortunatelythe0-1lossisnon-convexin‘w’.– It’seasytominimizeifaperfectclassifierexists(perceptron).– Otherwise,findingthe‘w’minimizing0-1lossisahardproblem.

– Gradientiszeroeverywhere:don’tevenknow“whichwaytogo”.

– NOTthesametypeofproblemwehadwithusingthesquaredloss.• Wecanminimizethesquarederror,butitmightgiveabadmodelforclassification.

• Motivatesconvexapproximationsto0-1loss…

Page 25: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

DegenerateConvexApproximationto0-1Loss• Ifyi =+1,wegetthelabelrightifwTxi >0.• Ifyi =-1,wegetthelabelrightifwTxi <0,orequivalently–wTxi >0.• So“classifying‘i’correctly”isequivalenttohavingyiwTxi >0.

• Onepossibleconvexapproximationto0-1loss:– Minimizehowmuchthisconstraintisviolated.

Page 26: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

DegenerateConvexApproximationto0-1Loss• Ourconvexapproximationoftheerrorforoneexampleis:

• Wecouldtrainbyminimizingsumoverallexamples:

• Butthishasadegeneratesolution:– Wehavef(0)=0,andthisisthelowestpossiblevalueof‘f’.

• Therearetwostandardfixes:hingelossandlogisticloss.

Page 27: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

Summary• Ensemblefeatureselectionreducesfalsepositivesornegatives.• Binaryclassificationusingregression:– Encodeusingyi in{-1,1}.– Use sign(wTxi)asprediction.– “Linearclassifier”(ahyperplanesplittingthespaceinhalf).

• Leastsquaresisaweirderrorforclassification.• Perceptronalgorithm:findsaperfectclassifier(ifoneexists).• 0-1lossistheidealloss,butisnon-smoothandnon-convex.

• Nexttime:oneofthebest“outofthebox”classifiers.27

Page 28: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

L1-RegularizationasaFeatureSelectionMethod• Advantages:

– Dealswithconditionalindependence(iflinear).– Sortofdealswithcollinearity:

• Picksatleastoneof“mom”and“mom2”.– Veryfastwithspecializedalgorithms.

• Disadvantages:– Tendstogivefalsepositives(selectstoomanyvariables).

• Neithergoodnorbad:– Doesnottakesmalleffects.– Says“gender”isrelevantifweknow“baby”.– Goodforpredictionifwewantfasttraininganddon’tcareabouthavingsomeirrelevantvariablesincluded.

Page 29: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

“ElasticNet”:L2- andL1-Regularization• Toaddressnon-uniqueness,someauthorsuseL2- andL1-:

• Called“elasticnet”regularization.– Solutionissparseandunique.– Slightlybetterwithfeaturedependence:

• Selectsboth“mom”and“mom2”.

• Optimizationiseasierthoughstillnon-differentiable.

Page 30: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

L1-RegularizationDebiasing andFiltering• Toremovefalsepositives,someauthorsaddadebiasing step:– Fit‘w’usingL1-regularization.– Grabthenon-zerovaluesof‘w’asthe“relevant”variables.– Re-fitrelevant‘w’usingleastsquaresorL2-regularizedleastsquares.

• ArelateduseofL1-regularizationisasafilteringmethod:– Fit‘w’usingL1-regularization.– Grabthenon-zerovaluesof‘w’asthe“relevant”variables.– Runstandard(slow)variableselectionrestrictedtorelevantvariables.

• Forwardselection,exhaustivesearch,stochasticlocalsearch,etc.

Page 31: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

Non-ConvexRegularizers• Regularizing|wj|2 selectsallfeatures.• Regularizing|wj|selectsfewer,butstillhasmanyfalsepositives.• Whatifweregularize|wj|1/2 instead?

• Minimizingthisobjectivewouldleadtofewerfalsepositives.– Lessneedfordebiasing,butit’snotconvexandhardtominimize.

• Therearemanynon-convexregularizers withsimilarproperties.– L1-regularizationis(basically)the“mostsparse”convexregularizer.

Page 32: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

Canwejustuseleastsquares??• Whatwentwrong?– “Good”errorsvs.“bad”errors.

Page 33: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

Canwejustuseleastsquares??• Whatwentwrong?– “Good”errorsvs.“bad”errors.

Page 34: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

OnlineClassificationwithPerceptron• Perceptron foronlinelinearbinaryclassification[Rosenblatt,1957]

– Startwithw0 =0.– Attime‘t’wereceivefeaturesxt.– Wepredict𝑦"t =sign(wt

Txt).– If𝑦"t ≠yt,thensetwt+1 =wt +ytxt.

• Otherwise,setwt+1 =wt.

(SlidesareoldsoaboveI’musingsubscriptsof‘t’insteadofsuperscripts.)

• Perceptronmistakebound[Novikoff,1962]:– Assumedataislinearly-separable witha“margin”:

• Thereexistsw*with||w*||=1suchthatsign(xtTw*)=sign(yt)forall‘t’and|xTw*|≥γ.– Thenthenumberoftotalmistakesisbounded.

• NorequirementthatdataisIID.

Page 35: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

PerceptronMistakeBound• Let’snormalizeeachxt sothat||xt||=1.– Lengthdoesn’tchangelabel.

• Wheneverwemakeamistake,wehavesign(yt)≠sign(wtTxt)and

• Soafter‘k’errorswehave||wt||2 ≤k.

Page 36: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

PerceptronMistakeBound• Let’sconsiderasolutionw*,sosign(yt)=sign(xtTw*).

– Andlet’schooseaw*with||w*||=1,• Wheneverwemakeamistake,wehave:

– Note:wtTw* ≥0byinduction(startsat0,thenatleastasbigasoldvalueplusγ).

• Soafter‘k’mistakeswehave||wt||≥γk.

Page 37: CPSC 340: Machine Learning and Data Mining - cs.ubc.cafwood/CS340/lectures/L19.pdf · Perceptron Algorithm for Linearly-Separable Data • One of the first “learning” algorithms

PerceptronMistakeBound• Soourtwoboundsare||wt||≤sqrt(k)and ||wt||≥γk.

• Thisgivesγk≤sqrt(k),oramaximumof1/γ2 mistakes.– Notethatγ >0byassumptionand isupper-boundedbyoneby||x||≤1.– Afterthis‘k’,underourassumptionswe’reguaranteedtohaveaperfectclassifier.