gradient boosting - wikipedia, the free encyclopedia

8/28/2015 GradientboostingWikipedia,thefreeencyclopedia

https://en.wikipedia.org/wiki/Gradient_boosting 1/6

GradientboostingFromWikipedia,thefreeencyclopedia

Gradientboostingisamachinelearningtechniqueforregressionandclassificationproblems,whichproducesapredictionmodelintheformofanensembleofweakpredictionmodels,typicallydecisiontrees.Itbuildsthemodelinastagewisefashionlikeotherboostingmethodsdo,anditgeneralizesthembyallowingoptimizationofanarbitrarydifferentiablelossfunction.

TheideaofgradientboostingoriginatedintheobservationbyLeoBreiman[1]thatboostingcanbeinterpretedasanoptimizationalgorithmonasuitablecostfunction.ExplicitregressiongradientboostingalgorithmsweresubsequentlydevelopedbyJeromeH.Friedman[2][3]simultaneouslywiththemoregeneralfunctionalgradientboostingperspectiveofLlewMason,JonathanBaxter,PeterBartlettandMarcusFrean.[4][5]Thelattertwopapersintroducedtheabstractviewofboostingalgorithmsasiterativefunctionalgradientdescentalgorithms.Thatis,algorithmsthatoptimizeacostfunctionaloverfunctionspacebyiterativelychoosingafunction(weakhypothesis)thatpointsinthenegativegradientdirection.Thisfunctionalgradientviewofboostinghasledtothedevelopmentofboostingalgorithmsinmanyareasofmachinelearningandstatisticsbeyondregressionandclassification.

Contents

1Informalintroduction2Algorithm3Gradienttreeboosting

3.1Sizeoftrees4Regularization

4.1Shrinkage4.2Stochasticgradientboosting4.3Numberofobservationsinleaves4.4PenalizeComplexityofTree

5Usage6Names7Seealso8References

Informalintroduction

(ThissectionfollowstheexpositionofgradientboostingbyLi.[6])

Likeotherboostingmethods,gradientboostingcombinesweaklearnersintoasinglestronglearner,inaniterativefashion.Itiseasiesttoexplainintheleastsquaresregressionsetting,wherethegoalistolearnamodel thatpredictsvalues ,minimizingthemeansquarederror tothetruevaluesy(averagedoversometrainingset).

Ateachstage ofgradientboosting,itmaybeassumedthatthereissomeimperfectmodel (attheoutset,averyweakmodelthatjustpredictsthemeanyinthetrainingsetcouldbeused).Thegradientboostingalgorithmdoesnotchange

inanywayinstead,itimprovesonitbyconstructinganewmodelthataddsanestimatorhtoprovideabettermodel.Thequestionisnow,howtofind ?Thegradientboostingsolutionstartswiththeobservation

thataperfecthwouldimply

or,equivalently,

.

Therefore,gradientboostingwillfithtotheresidual .Likeinotherboostingvariants,each learnstocorrect



Therefore,gradientboostingwillfithtotheresidual .Likeinotherboostingvariants,each learnstocorrectitspredecessor .Ageneralizationofthisideatootherlossfunctionsthansquarederror(andtoclassificationandrankingproblems)followsfromtheobservationthatresiduals arethenegativegradientsofthesquarederrorlossfunction

.So,gradientboostingisagradientdescentalgorithmandgeneralizingitentails"pluggingin"adifferentloss

anditsgradient.

Algorithm

InmanysupervisedlearningproblemsonehasanoutputvariableyandavectorofinputvariablesxconnectedtogetherviaajointprobabilitydistributionP(x,y).Usingatrainingset ofknownvaluesofxandcorrespondingvaluesofy,thegoalistofindanapproximation toafunctionF*(x)thatminimizestheexpectedvalueofsomespecifiedlossfunctionL(y,F(x)):

.

Gradientboostingmethodassumesarealvaluedyandseeksanapproximation intheformofaweightedsumoffunctionsh(x)fromsomeclass,calledbase(orweak)learners:

.

Inaccordancewiththeempiricalriskminimizationprinciple,themethodtriestofindanapproximation thatminimizestheaveragevalueofthelossfunctiononthetrainingset.Itdoessobystartingwithamodel,consistingofaconstantfunction ,andincrementallyexpandingitinagreedyfashion:

,

,

wherefisrestrictedtobeafunctionfromtheclassofbaselearnerfunctions.

However,theproblemofchoosingateachstepthebestfforanarbitrarylossfunctionLisahardoptimizationproblemingeneral,andsowe'll"cheat"bysolvingamucheasierprobleminstead.

Theideaistoapplyasteepestdescentsteptothisminimizationproblem.Ifweonlycaredaboutpredictionsatthepointsofthetrainingset,andfwereunrestricted,we'dupdatethemodelperthefollowingequation,whereweviewL(y,f)notasafunctionaloff,butasafunctionofavectorofvalues :

Butasfmustcomefromarestrictedclassoffunctions(that'swhatallowsustogeneralize),we'lljustchoosetheonethatmostcloselyapproximatesthegradientofL.Havingchosenf,themultiplieristhenselectedusinglinesearchjustasshowninthesecondequationabove.

Inpseudocode,thegenericgradientboostingmethodis:[2][7]



Input:trainingset adifferentiablelossfunction numberofiterations

Algorithm:

1. Initializemodelwithaconstantvalue:

2. Form=1toM:1. Computesocalledpseudoresiduals:

2. Fitabaselearner topseudoresiduals,i.e.trainitusingthetrainingset .3. Computemultiplier bysolvingthefollowingonedimensionaloptimizationproblem:

4. Updatethemodel:

3. Output

Gradienttreeboosting

Gradientboostingistypicallyusedwithdecisiontrees(especiallyCARTtrees)ofafixedsizeasbaselearners.ForthisspecialcaseFriedmanproposesamodificationtogradientboostingmethodwhichimprovesthequalityoffitofeachbaselearner.

Genericgradientboostingatthemthstepwouldfitadecisiontree topseudoresiduals.Let bethenumberofitsleaves.Thetreepartitionstheinputspaceinto disjointregions andpredictsaconstantvalueineachregion.Usingtheindicatornotation,theoutputof forinputxcanbewrittenasthesum:

where isthevaluepredictedintheregion .[8]

Thenthecoefficients aremultipliedbysomevalue ,chosenusinglinesearchsoastominimizethelossfunction,andthemodelisupdatedasfollows:

Friedmanproposestomodifythisalgorithmsothatitchoosesaseparateoptimalvalue foreachofthetree'sregions,insteadofasingle forthewholetree.Hecallsthemodifiedalgorithm"TreeBoost".Thecoefficients fromthetreefittingprocedurecanbethensimplydiscardedandthemodelupdaterulebecomes:



Sizeoftrees

,thenumberofterminalnodesintrees,isthemethod'sparameterwhichcanbeadjustedforadatasetathand.Itcontrolsthemaximumallowedlevelofinteractionbetweenvariablesinthemodel.With (decisionstumps),nointeractionbetweenvariablesisallowed.With themodelmayincludeeffectsoftheinteractionbetweenuptotwovariables,andsoon.

Hastieetal.[7]commentthattypically workwellforboostingandresultsarefairlyinsensitivetothechoiceof inthisrange, isinsufficientformanyapplications,and isunlikelytoberequired.

Regularization

Fittingthetrainingsettoocloselycanleadtodegradationofthemodel'sgeneralizationability.Severalsocalledregularizationtechniquesreducethisoverfittingeffectbyconstrainingthefittingprocedure.

OnenaturalregularizationparameteristhenumberofgradientboostingiterationsM(i.e.thenumberoftreesinthemodelwhenthebaselearnerisadecisiontree).IncreasingMreducestheerrorontrainingset,butsettingittoohighmayleadtooverfitting.AnoptimalvalueofMisoftenselectedbymonitoringpredictionerroronaseparatevalidationdataset.BesidescontrollingM,severalotherregularizationtechniquesareused.

Shrinkage

Animportantpartofgradientboostingmethodisregularizationbyshrinkagewhichconsistsinmodifyingtheupdateruleasfollows:

whereparameter iscalledthe"learningrate".

Empiricallyithasbeenfoundthatusingsmalllearningrates(suchas )yieldsdramaticimprovementsinmodel'sgeneralizationabilityovergradientboostingwithoutshrinking( ).[7]However,itcomesatthepriceofincreasingcomputationaltimebothduringtrainingandquerying:lowerlearningraterequiresmoreiterations.

Stochasticgradientboosting

SoonaftertheintroductionofgradientboostingFriedmanproposedaminormodificationtothealgorithm,motivatedbyBreiman'sbaggingmethod.[3]Specifically,heproposedthatateachiterationofthealgorithm,abaselearnershouldbefitonasubsampleofthetrainingsetdrawnatrandomwithoutreplacement.[9]Friedmanobservedasubstantialimprovementingradientboosting'saccuracywiththismodification.

Subsamplesizeissomeconstantfractionfofthesizeofthetrainingset.Whenf=1,thealgorithmisdeterministicandidenticaltotheonedescribedabove.Smallervaluesoffintroducerandomnessintothealgorithmandhelppreventoverfitting,actingasakindofregularization.Thealgorithmalsobecomesfaster,becauseregressiontreeshavetobefittosmallerdatasetsateachiteration.Friedman[3]obtainedthat leadstogoodresultsforsmallandmoderatesizedtrainingsets.Therefore,fistypicallysetto0.5,meaningthatonehalfofthetrainingsetisusedtobuildeachbaselearner.

Also,likeinbagging,subsamplingallowsonetodefineanoutofbagestimateofthepredictionperformanceimprovementbyevaluatingpredictionsonthoseobservationswhichwerenotusedinthebuildingofthenextbaselearner.Outofbagestimateshelpavoidtheneedforanindependentvalidationdataset,butoftenunderestimateactualperformanceimprovementandtheoptimalnumberofiterations.[10]

Numberofobservationsinleaves

Gradienttreeboostingimplementationsoftenalsouseregularizationbylimitingtheminimumnumberofobservationsintrees'terminalnodes(thisparameteriscalledn.minobsinnodeintheRgbmpackage[10]).Itisusedinthetreebuildingprocessbyignoringanysplitsthatleadtonodescontainingfewerthanthisnumberoftrainingsetinstances.



Imposingthislimithelpstoreducevarianceinpredictionsatleaves.

PenalizeComplexityofTree

Anotherusefulregularizationtechniquesforgradientboostedtreesistopenalizemodelcomplexityofthelearnedmodel.[11]Themodelcomplexitycanbedefinedproportionalnumberofleavesinthelearnedtrees.Thejointlyoptimizationoflossandmodelcomplexitycorrespondstoapostpruningalgorithmtoremovebranchesthatfailtoreducethelossbyathreshold.Otherkindsofregularizationsuchasl2penaltyontheleavevaluescanalsobeaddedtoavoidoverfitting.

Usage

Recently,gradientboostinghasgainedsomepopularityinthefieldoflearningtorank.ThecommercialwebsearchenginesYahoo[12]andYandex[13]usevariantsofgradientboostingintheirmachinelearnedrankingengines.

Names

Themethodgoesbyavarietyofnames.Friedmanintroducedhisregressiontechniqueasa"GradientBoostingMachine"(GBM).[2]Mason,Baxteret.el.describedthegeneralizedabstractclassofalgorithmsas"functionalgradientboosting".[4][5]

Apopularopensourceimplementation[10]forRcallsit"GeneralizedBoostingModel".CommercialimplementationsfromSalfordSystemsusethenames"MultipleAdditiveRegressionTrees"(MART)andTreeNet,bothtrademarked.

Seealso

AdaBoostRandomforest

References1. Brieman,L."ArcingTheEdge(http://statistics.berkeley.edu/sites/default/files/techreports/486.pdf)"(June1997)2. Friedman,J.H."GreedyFunctionApproximation:AGradientBoostingMachine.(http://wwwstat.stanford.edu/~jhf/ftp/trebst.pdf)"

(February1999)3. Friedman,J.H."StochasticGradientBoosting.(https://statweb.stanford.edu/~jhf/ftp/stobst.pdf)"(March1999)4. Mason,L.Baxter,J.Bartlett,P.L.Frean,Marcus(1999)."BoostingAlgorithmsasGradientDescent"

(http://papers.nips.cc/paper/1766boostingalgorithmsasgradientdescent.pdf)(PDF).InS.A.SollaandT.K.LeenandK.Mller.AdvancesinNeuralInformationProcessingSystems12.MITPress.pp.512518.

5. Mason,L.Baxter,J.Bartlett,P.L.Frean,Marcus(May1999).BoostingAlgorithmsasGradientDescentinFunctionSpace(http://maths.dur.ac.uk/~dma6kp/pdf/face_recognition/Boosting/Mason99AnyboostLong.pdf)(PDF).

6. ChengLi."AGentleIntroductiontoGradientBoosting"(http://www.ccs.neu.edu/home/vip/teach/MLcourse/4_boosting/slides/gradient_boosting.pdf)(PDF).NortheasternUniversity.Retrieved19August2014.

7. Hastie,T.Tibshirani,R.Friedman,J.H.(2009)."10.BoostingandAdditiveTrees".TheElementsofStatisticalLearning(http://wwwstat.stanford.edu/~tibs/ElemStatLearn/)(2nded.).NewYork:Springer.pp.337384.ISBN0387848576.

8. Note:incaseofusualCARTtrees,thetreesarefittedusingleastsquaresloss,andsothecoefficient fortheregion isequaltojustthevalueofoutputvariable,averagedoveralltraininginstancesin .

9. Notethatthisisdifferentfrombagging,whichsampleswithreplacementbecauseitusessamplesofthesamesizeasthetrainingset.10. Ridgeway,Greg(2007).GeneralizedBoostedModels:Aguidetothegbmpackage.(http://cran.rproject.org/web/packages/gbm/gbm.pdf)11. TianqiChen.IntroductiontoBoostedTrees(http://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf)12. Cossock,DavidandZhang,Tong(2008).StatisticalAnalysisofBayesOptimalSubsetRanking

(http://www.stat.rutgers.edu/~tzhang/papers/it08ranking.pdf),page14.13. Yandexcorporateblogentryaboutnewrankingmodel"Snezhinsk"(http://webmaster.ya.ru/replies.xml?item_no=5707&ncrnd=5118)(in

Russian)

Retrievedfrom"https://en.wikipedia.org/w/index.php?title=Gradient_boosting&oldid=678013581"

Categories: Decisiontrees Ensemblelearning



Thispagewaslastmodifiedon26August2015,at22:37.TextisavailableundertheCreativeCommonsAttributionShareAlikeLicenseadditionaltermsmayapply.Byusingthissite,youagreetotheTermsofUseandPrivacyPolicy.WikipediaisaregisteredtrademarkoftheWikimediaFoundation,Inc.,anonprofitorganization.

gradient boosting - wikipedia, the free encyclopedia

Documents