cpsc 340: machine learning and data mining - cs.ubc.cafwood/cs340/lectures/l19.pdf · perceptron...

CPSC340:MachineLearningandDataMining

LinearClassifiersSpring2019

LastTime:L1-Regularization• WediscussedL1-regularization:

– Alsoknownas“LASSO”and“basispursuitdenoising”.– Regularizes‘w’sowedecreaseourtesterror(likeL2-regularization).– Yieldssparse‘w’ soitselectsfeatures(likeL0-regularization).

• Properties:– It’sconvexandfast tominimize(with“proximal-gradient”methods).– Solutionisnotunique (sometimespeopledoL2- andL1-regularization).– Usuallyincludes“correct”variablesbuttendstoyieldfalsepositives.

L*-Regularization• L0-regularization (AIC,BIC,Mallow’sCp,AdjustedR2,ANOVA):– Addspenaltyonthenumberofnon-zerostoselectfeatures.

• L2-regularization (ridgeregression):– AddingpenaltyontheL2-normof‘w’todecreaseoverfitting:

• L1-regularization (LASSO):– AddingpenaltyontheL1-normdecreasesoverfittingandselectsfeatures:

L0- vs.L1- vs.L2-RegularizationSparse‘w’

(Selects Features)Speed Unique‘w’ Coding Effort Irrelevant

Features

L0-Regularization Yes Slow No Fewlines NotSensitive

L1-Regularization Yes* Fast* No 1 line* NotSensitive

L2-Regularization No Fast Yes 1line Abitsensitive

• L1-Regularizationisn’tassparseasL0-regularization.– L1-regularizationtendstogivemorefalsepositives(selectstoomany).– Andit’sonly“fast”and“1line”withspecializedsolvers.

• CostofL2-regularizedleastsquaresisO(nd2 +d3).– ChangestoO(ndt)for‘t’iterationsofgradientdescent(sameforL1).

• “Elasticnet”(L1- andL2-regularization)issparse,fast,andunique.• UsingL0+L2doesnotgiveauniquesolution.

EnsembleFeatureSelection• Wecanalsouseensemblemethodsforfeatureselection.– Usuallydesignedtoreducefalsepositivesorreduce falsenegatives.

• InthiscaseofL1-regularization,wewanttoreducefalsepositives.– UnlikeL0-regularization,thenon-zerowj arestill“shrunk”.

• “Irrelevant”variablesareincluded,before“relevant”wj reachbestvalue.

• Abootstrap approachtoreducingfalsepositives:– Applythemethodtobootstrapsamplesofthetrainingdata.– Onlytakethefeaturesselectedinallbootstrapsamples.

EnsembleFeatureSelection

• Example:boostrapping plusL1-regularization(“BoLASSO”).– Reducesfalsepositives.– It’spossibletoshowitrecovers“correct”variableswithweakerconditions.

(pause)

Motivation:IdentifyingImportantE-mails• Howcanweautomaticallyidentify‘important’e-mails?

• Abinaryclassification problem(“important”vs.“notimportant”).– Labelsareapproximatedbywhetheryoutookan“action”basedonmail.– High-dimensionalfeatureset(thatwe’lldiscusslater).

• Gmailusesregressionforthisbinaryclassificationproblem.

BinaryClassificationUsingRegression?• Canweapplylinearmodelsforbinaryclassification?– Setyi =+1foroneclass (“important”).– Setyi =-1fortheotherclass(“notimportant”).

• Attrainingtime,fitalinearregressionmodel:

• ThemodelwilltrytomakewTxi =+1for“important”e-mails,andwTxi =-1for“notimportant”e-mails.

BinaryClassificationUsingRegression?• Canweapplylinearmodelsforbinaryclassification?– Setyi =+1foroneclass (“important”).– Setyi =-1fortheotherclass(“notimportant”).

• Linearmodelgivesrealnumberslike0.9,-1.1,andsoon.• Sotopredict,welookatwhetherwTxi iscloserto+1or-1.– IfwTxi =0.9,predict𝑦"i =+1.– IfwTxi =-1.1,predict𝑦"i =-1.– IfwTxi =0.1,predict𝑦"i =+1.– IfwTxi =-100,predict𝑦"i =-1.– Wewritethisoperation(roundingto+1or-1)as𝑦"i =sign(wTxi).

DecisionBoundaryin1D

• Wecaninterpret‘w’asahyperplaneseparatingxintosets:– SetwherewTxi >0andsetwherewTxi <0.



decisiontree KNN linearclassifier

• Alinearclassifierwouldbelinearfunction𝑦"i=w0 +w1xi1+w2xi2comingoutofthepage(theboundaryisat𝑦"i=0)

Shouldweuseleastsquaresforclassification?• Considertrainingbyminimizingsquarederrorwithyi thatare+1or-1:

• IfwepredictwTxi =+0.9andyi =+1,errorissmall:(0.9– 1)2 =0.01.• IfwepredictwTxi =-0.8andyi =+1,errorisbigger:(-0.8– 1)2 =3.24.• IfwepredictwTxi =+100andyi =+1,errorishuge:(100– 1)2 =9801.

– Butitshouldn’tbe,thepredictioniscorrect.

• Leastsquarespenalizedforbeing“tooright”.– +100hastherightsign,sotheerrorshouldbezero.

Shouldweuseleastsquaresforclassification?• Leastsquarescanbehaveweirdlywhenappliedtoclassification:

• Why?Squarederrorofgreenlineishuge!– Makesureyouunderstandwhythegreenlineachieves0trainingerror.

“0-1Loss”Function:MinimizingClassificationErrors

• Couldweinsteadminimizenumberofclassificationerrors?– Thisiscalledthe0-1lossfunction:

• Youeithergettheclassificationwrong(1)orright(0).

– WecanwriteusingtheL0-normas||𝑦"– y||0.• Unlikeregression,inclassificationit’sreasonablethat𝑦"𝑖=yi (it’seither+1or-1).

• Importantspecialcase:“linearlyseparable”data.– Classescanbe“separated”byahyper-plane.– Soaperfectlinearclassifierexists.

PerceptronAlgorithmforLinearly-SeparableData• Oneofthefirst“learning”algorithmswasthe“perceptron”(1957).

– Searchesfora‘w’suchthatsign(wTxi)=yi foralli.

• Perceptron algorithm:– Startwithw0 =0.– Gothroughexamplesinanyorderuntilyoumakeamistakepredictingyi.

• Setwt+1 =wt +yixi.– Keepgoingthroughexamplesuntilyoumakenoerrorsontrainingdata.

• Ifaperfectclassifierexists,thisalgorithmfindsoneinfinitenumberofsteps.

• Intuitionforstep:ifyi =+1,“addmoreofxi tow”sothatwTxi islarger.

– Ifyi =-1,youwouldbesubtractingthesquarednorm.

https://en.wikipedia.org/wiki/Perceptron

Geometryofwhywewantthe0-1loss

Thoughtsontheprevious(andnext)slide• Wearenowplottingthelossvs.thepredictedw⊤xi.– “Lossspace”,whichisdifferentthanparameterspaceordataspace.

• We'replottingtheindividuallossforaparticulartrainingexample.– Inthefigurethe labelisyi =−1(solossiscenteredat-1).

• Itwillbecenteredat+1whenyi =+1.

• (Thenextslideisthesameasthepreviousone)

Geometryofwhywewantthe0-1loss

0-1LossFunction• Unfortunatelythe0-1lossisnon-convexin‘w’.– It’seasytominimizeifaperfectclassifierexists(perceptron).– Otherwise,findingthe‘w’minimizing0-1lossisahardproblem.

– Gradientiszeroeverywhere:don’tevenknow“whichwaytogo”.

– NOTthesametypeofproblemwehadwithusingthesquaredloss.• Wecanminimizethesquarederror,butitmightgiveabadmodelforclassification.

• Motivatesconvexapproximationsto0-1loss…

DegenerateConvexApproximationto0-1Loss• Ifyi =+1,wegetthelabelrightifwTxi >0.• Ifyi =-1,wegetthelabelrightifwTxi <0,orequivalently–wTxi >0.• So“classifying‘i’correctly”isequivalenttohavingyiwTxi >0.

• Onepossibleconvexapproximationto0-1loss:– Minimizehowmuchthisconstraintisviolated.

DegenerateConvexApproximationto0-1Loss• Ourconvexapproximationoftheerrorforoneexampleis:

• Wecouldtrainbyminimizingsumoverallexamples:

• Butthishasadegeneratesolution:– Wehavef(0)=0,andthisisthelowestpossiblevalueof‘f’.

• Therearetwostandardfixes:hingelossandlogisticloss.

Summary• Ensemblefeatureselectionreducesfalsepositivesornegatives.• Binaryclassificationusingregression:– Encodeusingyi in{-1,1}.– Use sign(wTxi)asprediction.– “Linearclassifier”(ahyperplanesplittingthespaceinhalf).

• Leastsquaresisaweirderrorforclassification.• Perceptronalgorithm:findsaperfectclassifier(ifoneexists).• 0-1lossistheidealloss,butisnon-smoothandnon-convex.

• Nexttime:oneofthebest“outofthebox”classifiers.27

L1-RegularizationasaFeatureSelectionMethod• Advantages:

– Dealswithconditionalindependence(iflinear).– Sortofdealswithcollinearity:

• Picksatleastoneof“mom”and“mom2”.– Veryfastwithspecializedalgorithms.

• Disadvantages:– Tendstogivefalsepositives(selectstoomanyvariables).

• Neithergoodnorbad:– Doesnottakesmalleffects.– Says“gender”isrelevantifweknow“baby”.– Goodforpredictionifwewantfasttraininganddon’tcareabouthavingsomeirrelevantvariablesincluded.

“ElasticNet”:L2- andL1-Regularization• Toaddressnon-uniqueness,someauthorsuseL2- andL1-:

• Called“elasticnet”regularization.– Solutionissparseandunique.– Slightlybetterwithfeaturedependence:

• Selectsboth“mom”and“mom2”.

• Optimizationiseasierthoughstillnon-differentiable.

L1-RegularizationDebiasing andFiltering• Toremovefalsepositives,someauthorsaddadebiasing step:– Fit‘w’usingL1-regularization.– Grabthenon-zerovaluesof‘w’asthe“relevant”variables.– Re-fitrelevant‘w’usingleastsquaresorL2-regularizedleastsquares.

• ArelateduseofL1-regularizationisasafilteringmethod:– Fit‘w’usingL1-regularization.– Grabthenon-zerovaluesof‘w’asthe“relevant”variables.– Runstandard(slow)variableselectionrestrictedtorelevantvariables.

• Forwardselection,exhaustivesearch,stochasticlocalsearch,etc.

Non-ConvexRegularizers• Regularizing|wj|2 selectsallfeatures.• Regularizing|wj|selectsfewer,butstillhasmanyfalsepositives.• Whatifweregularize|wj|1/2 instead?

• Minimizingthisobjectivewouldleadtofewerfalsepositives.– Lessneedfordebiasing,butit’snotconvexandhardtominimize.

• Therearemanynon-convexregularizers withsimilarproperties.– L1-regularizationis(basically)the“mostsparse”convexregularizer.

Canwejustuseleastsquares??• Whatwentwrong?– “Good”errorsvs.“bad”errors.

OnlineClassificationwithPerceptron• Perceptron foronlinelinearbinaryclassification[Rosenblatt,1957]

– Startwithw0 =0.– Attime‘t’wereceivefeaturesxt.– Wepredict𝑦"t =sign(wt

Txt).– If𝑦"t ≠yt,thensetwt+1 =wt +ytxt.

• Otherwise,setwt+1 =wt.

(SlidesareoldsoaboveI’musingsubscriptsof‘t’insteadofsuperscripts.)

• Perceptronmistakebound[Novikoff,1962]:– Assumedataislinearly-separable witha“margin”:

• Thereexistsw*with||w*||=1suchthatsign(xtTw*)=sign(yt)forall‘t’and|xTw*|≥γ.– Thenthenumberoftotalmistakesisbounded.

• NorequirementthatdataisIID.

PerceptronMistakeBound• Let’snormalizeeachxt sothat||xt||=1.– Lengthdoesn’tchangelabel.

• Wheneverwemakeamistake,wehavesign(yt)≠sign(wtTxt)and

• Soafter‘k’errorswehave||wt||2 ≤k.

PerceptronMistakeBound• Let’sconsiderasolutionw*,sosign(yt)=sign(xtTw*).

– Andlet’schooseaw*with||w*||=1,• Wheneverwemakeamistake,wehave:

– Note:wtTw* ≥0byinduction(startsat0,thenatleastasbigasoldvalueplusγ).

• Soafter‘k’mistakeswehave||wt||≥γk.

PerceptronMistakeBound• Soourtwoboundsare||wt||≤sqrt(k)and ||wt||≥γk.

• Thisgivesγk≤sqrt(k),oramaximumof1/γ2 mistakes.– Notethatγ >0byassumptionand isupper-boundedbyoneby||x||≤1.– Afterthis‘k’,underourassumptionswe’reguaranteedtohaveaperfectclassifier.

cpsc 340: machine learning and data mining - cs.ubc.cafwood/cs340/lectures/l19.pdf · perceptron...

Documents