cpsc 340: machine learning and data mining - cs.ubc.cafwood/cs340/lectures/l19.pdf · perceptron...
TRANSCRIPT
CPSC340:MachineLearningandDataMining
LinearClassifiersSpring2019
LastTime:L1-Regularization• WediscussedL1-regularization:
– Alsoknownas“LASSO”and“basispursuitdenoising”.– Regularizes‘w’sowedecreaseourtesterror(likeL2-regularization).– Yieldssparse‘w’ soitselectsfeatures(likeL0-regularization).
• Properties:– It’sconvexandfast tominimize(with“proximal-gradient”methods).– Solutionisnotunique (sometimespeopledoL2- andL1-regularization).– Usuallyincludes“correct”variablesbuttendstoyieldfalsepositives.
L*-Regularization• L0-regularization (AIC,BIC,Mallow’sCp,AdjustedR2,ANOVA):– Addspenaltyonthenumberofnon-zerostoselectfeatures.
• L2-regularization (ridgeregression):– AddingpenaltyontheL2-normof‘w’todecreaseoverfitting:
• L1-regularization (LASSO):– AddingpenaltyontheL1-normdecreasesoverfittingandselectsfeatures:
L0- vs.L1- vs.L2-RegularizationSparse‘w’
(Selects Features)Speed Unique‘w’ Coding Effort Irrelevant
Features
L0-Regularization Yes Slow No Fewlines NotSensitive
L1-Regularization Yes* Fast* No 1 line* NotSensitive
L2-Regularization No Fast Yes 1line Abitsensitive
• L1-Regularizationisn’tassparseasL0-regularization.– L1-regularizationtendstogivemorefalsepositives(selectstoomany).– Andit’sonly“fast”and“1line”withspecializedsolvers.
• CostofL2-regularizedleastsquaresisO(nd2 +d3).– ChangestoO(ndt)for‘t’iterationsofgradientdescent(sameforL1).
• “Elasticnet”(L1- andL2-regularization)issparse,fast,andunique.• UsingL0+L2doesnotgiveauniquesolution.
EnsembleFeatureSelection• Wecanalsouseensemblemethodsforfeatureselection.– Usuallydesignedtoreducefalsepositivesorreduce falsenegatives.
• InthiscaseofL1-regularization,wewanttoreducefalsepositives.– UnlikeL0-regularization,thenon-zerowj arestill“shrunk”.
• “Irrelevant”variablesareincluded,before“relevant”wj reachbestvalue.
• Abootstrap approachtoreducingfalsepositives:– Applythemethodtobootstrapsamplesofthetrainingdata.– Onlytakethefeaturesselectedinallbootstrapsamples.
EnsembleFeatureSelection
• Example:boostrapping plusL1-regularization(“BoLASSO”).– Reducesfalsepositives.– It’spossibletoshowitrecovers“correct”variableswithweakerconditions.
(pause)
Motivation:IdentifyingImportantE-mails• Howcanweautomaticallyidentify‘important’e-mails?
• Abinaryclassification problem(“important”vs.“notimportant”).– Labelsareapproximatedbywhetheryoutookan“action”basedonmail.– High-dimensionalfeatureset(thatwe’lldiscusslater).
• Gmailusesregressionforthisbinaryclassificationproblem.
BinaryClassificationUsingRegression?• Canweapplylinearmodelsforbinaryclassification?– Setyi =+1foroneclass (“important”).– Setyi =-1fortheotherclass(“notimportant”).
• Attrainingtime,fitalinearregressionmodel:
• ThemodelwilltrytomakewTxi =+1for“important”e-mails,andwTxi =-1for“notimportant”e-mails.
BinaryClassificationUsingRegression?• Canweapplylinearmodelsforbinaryclassification?– Setyi =+1foroneclass (“important”).– Setyi =-1fortheotherclass(“notimportant”).
• Linearmodelgivesrealnumberslike0.9,-1.1,andsoon.• Sotopredict,welookatwhetherwTxi iscloserto+1or-1.– IfwTxi =0.9,predict𝑦"i =+1.– IfwTxi =-1.1,predict𝑦"i =-1.– IfwTxi =0.1,predict𝑦"i =+1.– IfwTxi =-100,predict𝑦"i =-1.– Wewritethisoperation(roundingto+1or-1)as𝑦"i =sign(wTxi).
DecisionBoundaryin1D
• Wecaninterpret‘w’asahyperplaneseparatingxintosets:– SetwherewTxi >0andsetwherewTxi <0.
DecisionBoundaryin1D
DecisionBoundaryin2D
decisiontree KNN linearclassifier
• Alinearclassifierwouldbelinearfunction𝑦"i=w0 +w1xi1+w2xi2comingoutofthepage(theboundaryisat𝑦"i=0)
Shouldweuseleastsquaresforclassification?• Considertrainingbyminimizingsquarederrorwithyi thatare+1or-1:
• IfwepredictwTxi =+0.9andyi =+1,errorissmall:(0.9– 1)2 =0.01.• IfwepredictwTxi =-0.8andyi =+1,errorisbigger:(-0.8– 1)2 =3.24.• IfwepredictwTxi =+100andyi =+1,errorishuge:(100– 1)2 =9801.
– Butitshouldn’tbe,thepredictioniscorrect.
• Leastsquarespenalizedforbeing“tooright”.– +100hastherightsign,sotheerrorshouldbezero.
Shouldweuseleastsquaresforclassification?• Leastsquarescanbehaveweirdlywhenappliedtoclassification:
• Why?Squarederrorofgreenlineishuge!– Makesureyouunderstandwhythegreenlineachieves0trainingerror.
“0-1Loss”Function:MinimizingClassificationErrors
• Couldweinsteadminimizenumberofclassificationerrors?– Thisiscalledthe0-1lossfunction:
• Youeithergettheclassificationwrong(1)orright(0).
– WecanwriteusingtheL0-normas||𝑦"– y||0.• Unlikeregression,inclassificationit’sreasonablethat𝑦"𝑖=yi (it’seither+1or-1).
• Importantspecialcase:“linearlyseparable”data.– Classescanbe“separated”byahyper-plane.– Soaperfectlinearclassifierexists.
PerceptronAlgorithmforLinearly-SeparableData• Oneofthefirst“learning”algorithmswasthe“perceptron”(1957).
– Searchesfora‘w’suchthatsign(wTxi)=yi foralli.
• Perceptron algorithm:– Startwithw0 =0.– Gothroughexamplesinanyorderuntilyoumakeamistakepredictingyi.
• Setwt+1 =wt +yixi.– Keepgoingthroughexamplesuntilyoumakenoerrorsontrainingdata.
• Ifaperfectclassifierexists,thisalgorithmfindsoneinfinitenumberofsteps.
• Intuitionforstep:ifyi =+1,“addmoreofxi tow”sothatwTxi islarger.
– Ifyi =-1,youwouldbesubtractingthesquarednorm.
https://en.wikipedia.org/wiki/Perceptron
Geometryofwhywewantthe0-1loss
Thoughtsontheprevious(andnext)slide• Wearenowplottingthelossvs.thepredictedw⊤xi.– “Lossspace”,whichisdifferentthanparameterspaceordataspace.
• We'replottingtheindividuallossforaparticulartrainingexample.– Inthefigurethe labelisyi =−1(solossiscenteredat-1).
• Itwillbecenteredat+1whenyi =+1.
• (Thenextslideisthesameasthepreviousone)
Geometryofwhywewantthe0-1loss
Geometryofwhywewantthe0-1loss
Geometryofwhywewantthe0-1loss
0-1LossFunction• Unfortunatelythe0-1lossisnon-convexin‘w’.– It’seasytominimizeifaperfectclassifierexists(perceptron).– Otherwise,findingthe‘w’minimizing0-1lossisahardproblem.
– Gradientiszeroeverywhere:don’tevenknow“whichwaytogo”.
– NOTthesametypeofproblemwehadwithusingthesquaredloss.• Wecanminimizethesquarederror,butitmightgiveabadmodelforclassification.
• Motivatesconvexapproximationsto0-1loss…
DegenerateConvexApproximationto0-1Loss• Ifyi =+1,wegetthelabelrightifwTxi >0.• Ifyi =-1,wegetthelabelrightifwTxi <0,orequivalently–wTxi >0.• So“classifying‘i’correctly”isequivalenttohavingyiwTxi >0.
• Onepossibleconvexapproximationto0-1loss:– Minimizehowmuchthisconstraintisviolated.
DegenerateConvexApproximationto0-1Loss• Ourconvexapproximationoftheerrorforoneexampleis:
• Wecouldtrainbyminimizingsumoverallexamples:
• Butthishasadegeneratesolution:– Wehavef(0)=0,andthisisthelowestpossiblevalueof‘f’.
• Therearetwostandardfixes:hingelossandlogisticloss.
Summary• Ensemblefeatureselectionreducesfalsepositivesornegatives.• Binaryclassificationusingregression:– Encodeusingyi in{-1,1}.– Use sign(wTxi)asprediction.– “Linearclassifier”(ahyperplanesplittingthespaceinhalf).
• Leastsquaresisaweirderrorforclassification.• Perceptronalgorithm:findsaperfectclassifier(ifoneexists).• 0-1lossistheidealloss,butisnon-smoothandnon-convex.
• Nexttime:oneofthebest“outofthebox”classifiers.27
L1-RegularizationasaFeatureSelectionMethod• Advantages:
– Dealswithconditionalindependence(iflinear).– Sortofdealswithcollinearity:
• Picksatleastoneof“mom”and“mom2”.– Veryfastwithspecializedalgorithms.
• Disadvantages:– Tendstogivefalsepositives(selectstoomanyvariables).
• Neithergoodnorbad:– Doesnottakesmalleffects.– Says“gender”isrelevantifweknow“baby”.– Goodforpredictionifwewantfasttraininganddon’tcareabouthavingsomeirrelevantvariablesincluded.
“ElasticNet”:L2- andL1-Regularization• Toaddressnon-uniqueness,someauthorsuseL2- andL1-:
• Called“elasticnet”regularization.– Solutionissparseandunique.– Slightlybetterwithfeaturedependence:
• Selectsboth“mom”and“mom2”.
• Optimizationiseasierthoughstillnon-differentiable.
L1-RegularizationDebiasing andFiltering• Toremovefalsepositives,someauthorsaddadebiasing step:– Fit‘w’usingL1-regularization.– Grabthenon-zerovaluesof‘w’asthe“relevant”variables.– Re-fitrelevant‘w’usingleastsquaresorL2-regularizedleastsquares.
• ArelateduseofL1-regularizationisasafilteringmethod:– Fit‘w’usingL1-regularization.– Grabthenon-zerovaluesof‘w’asthe“relevant”variables.– Runstandard(slow)variableselectionrestrictedtorelevantvariables.
• Forwardselection,exhaustivesearch,stochasticlocalsearch,etc.
Non-ConvexRegularizers• Regularizing|wj|2 selectsallfeatures.• Regularizing|wj|selectsfewer,butstillhasmanyfalsepositives.• Whatifweregularize|wj|1/2 instead?
• Minimizingthisobjectivewouldleadtofewerfalsepositives.– Lessneedfordebiasing,butit’snotconvexandhardtominimize.
• Therearemanynon-convexregularizers withsimilarproperties.– L1-regularizationis(basically)the“mostsparse”convexregularizer.
Canwejustuseleastsquares??• Whatwentwrong?– “Good”errorsvs.“bad”errors.
Canwejustuseleastsquares??• Whatwentwrong?– “Good”errorsvs.“bad”errors.
OnlineClassificationwithPerceptron• Perceptron foronlinelinearbinaryclassification[Rosenblatt,1957]
– Startwithw0 =0.– Attime‘t’wereceivefeaturesxt.– Wepredict𝑦"t =sign(wt
Txt).– If𝑦"t ≠yt,thensetwt+1 =wt +ytxt.
• Otherwise,setwt+1 =wt.
(SlidesareoldsoaboveI’musingsubscriptsof‘t’insteadofsuperscripts.)
• Perceptronmistakebound[Novikoff,1962]:– Assumedataislinearly-separable witha“margin”:
• Thereexistsw*with||w*||=1suchthatsign(xtTw*)=sign(yt)forall‘t’and|xTw*|≥γ.– Thenthenumberoftotalmistakesisbounded.
• NorequirementthatdataisIID.
PerceptronMistakeBound• Let’snormalizeeachxt sothat||xt||=1.– Lengthdoesn’tchangelabel.
• Wheneverwemakeamistake,wehavesign(yt)≠sign(wtTxt)and
• Soafter‘k’errorswehave||wt||2 ≤k.
PerceptronMistakeBound• Let’sconsiderasolutionw*,sosign(yt)=sign(xtTw*).
– Andlet’schooseaw*with||w*||=1,• Wheneverwemakeamistake,wehave:
– Note:wtTw* ≥0byinduction(startsat0,thenatleastasbigasoldvalueplusγ).
• Soafter‘k’mistakeswehave||wt||≥γk.
PerceptronMistakeBound• Soourtwoboundsare||wt||≤sqrt(k)and ||wt||≥γk.
• Thisgivesγk≤sqrt(k),oramaximumof1/γ2 mistakes.– Notethatγ >0byassumptionand isupper-boundedbyoneby||x||≤1.– Afterthis‘k’,underourassumptionswe’reguaranteedtohaveaperfectclassifier.