gradient boosting - wikipedia, the free encyclopedia
DESCRIPTION
algorithmTRANSCRIPT
-
8/28/2015 GradientboostingWikipedia,thefreeencyclopedia
https://en.wikipedia.org/wiki/Gradient_boosting 1/6
GradientboostingFromWikipedia,thefreeencyclopedia
Gradientboostingisamachinelearningtechniqueforregressionandclassificationproblems,whichproducesapredictionmodelintheformofanensembleofweakpredictionmodels,typicallydecisiontrees.Itbuildsthemodelinastagewisefashionlikeotherboostingmethodsdo,anditgeneralizesthembyallowingoptimizationofanarbitrarydifferentiablelossfunction.
TheideaofgradientboostingoriginatedintheobservationbyLeoBreiman[1]thatboostingcanbeinterpretedasanoptimizationalgorithmonasuitablecostfunction.ExplicitregressiongradientboostingalgorithmsweresubsequentlydevelopedbyJeromeH.Friedman[2][3]simultaneouslywiththemoregeneralfunctionalgradientboostingperspectiveofLlewMason,JonathanBaxter,PeterBartlettandMarcusFrean.[4][5]Thelattertwopapersintroducedtheabstractviewofboostingalgorithmsasiterativefunctionalgradientdescentalgorithms.Thatis,algorithmsthatoptimizeacostfunctionaloverfunctionspacebyiterativelychoosingafunction(weakhypothesis)thatpointsinthenegativegradientdirection.Thisfunctionalgradientviewofboostinghasledtothedevelopmentofboostingalgorithmsinmanyareasofmachinelearningandstatisticsbeyondregressionandclassification.
Contents
1Informalintroduction2Algorithm3Gradienttreeboosting
3.1Sizeoftrees4Regularization
4.1Shrinkage4.2Stochasticgradientboosting4.3Numberofobservationsinleaves4.4PenalizeComplexityofTree
5Usage6Names7Seealso8References
Informalintroduction
(ThissectionfollowstheexpositionofgradientboostingbyLi.[6])
Likeotherboostingmethods,gradientboostingcombinesweaklearnersintoasinglestronglearner,inaniterativefashion.Itiseasiesttoexplainintheleastsquaresregressionsetting,wherethegoalistolearnamodel thatpredictsvalues ,minimizingthemeansquarederror tothetruevaluesy(averagedoversometrainingset).
Ateachstage ofgradientboosting,itmaybeassumedthatthereissomeimperfectmodel (attheoutset,averyweakmodelthatjustpredictsthemeanyinthetrainingsetcouldbeused).Thegradientboostingalgorithmdoesnotchange
inanywayinstead,itimprovesonitbyconstructinganewmodelthataddsanestimatorhtoprovideabettermodel.Thequestionisnow,howtofind ?Thegradientboostingsolutionstartswiththeobservation
thataperfecthwouldimply
or,equivalently,
.
Therefore,gradientboostingwillfithtotheresidual .Likeinotherboostingvariants,each learnstocorrect
-
8/28/2015 GradientboostingWikipedia,thefreeencyclopedia
https://en.wikipedia.org/wiki/Gradient_boosting 2/6
Therefore,gradientboostingwillfithtotheresidual .Likeinotherboostingvariants,each learnstocorrectitspredecessor .Ageneralizationofthisideatootherlossfunctionsthansquarederror(andtoclassificationandrankingproblems)followsfromtheobservationthatresiduals arethenegativegradientsofthesquarederrorlossfunction
.So,gradientboostingisagradientdescentalgorithmandgeneralizingitentails"pluggingin"adifferentloss
anditsgradient.
Algorithm
InmanysupervisedlearningproblemsonehasanoutputvariableyandavectorofinputvariablesxconnectedtogetherviaajointprobabilitydistributionP(x,y).Usingatrainingset ofknownvaluesofxandcorrespondingvaluesofy,thegoalistofindanapproximation toafunctionF*(x)thatminimizestheexpectedvalueofsomespecifiedlossfunctionL(y,F(x)):
.
Gradientboostingmethodassumesarealvaluedyandseeksanapproximation intheformofaweightedsumoffunctionsh(x)fromsomeclass,calledbase(orweak)learners:
.
Inaccordancewiththeempiricalriskminimizationprinciple,themethodtriestofindanapproximation thatminimizestheaveragevalueofthelossfunctiononthetrainingset.Itdoessobystartingwithamodel,consistingofaconstantfunction ,andincrementallyexpandingitinagreedyfashion:
,
,
wherefisrestrictedtobeafunctionfromtheclassofbaselearnerfunctions.
However,theproblemofchoosingateachstepthebestfforanarbitrarylossfunctionLisahardoptimizationproblemingeneral,andsowe'll"cheat"bysolvingamucheasierprobleminstead.
Theideaistoapplyasteepestdescentsteptothisminimizationproblem.Ifweonlycaredaboutpredictionsatthepointsofthetrainingset,andfwereunrestricted,we'dupdatethemodelperthefollowingequation,whereweviewL(y,f)notasafunctionaloff,butasafunctionofavectorofvalues :
Butasfmustcomefromarestrictedclassoffunctions(that'swhatallowsustogeneralize),we'lljustchoosetheonethatmostcloselyapproximatesthegradientofL.Havingchosenf,themultiplieristhenselectedusinglinesearchjustasshowninthesecondequationabove.
Inpseudocode,thegenericgradientboostingmethodis:[2][7]
-
8/28/2015 GradientboostingWikipedia,thefreeencyclopedia
https://en.wikipedia.org/wiki/Gradient_boosting 3/6
Input:trainingset adifferentiablelossfunction numberofiterations
Algorithm:
1. Initializemodelwithaconstantvalue:
2. Form=1toM:1. Computesocalledpseudoresiduals:
2. Fitabaselearner topseudoresiduals,i.e.trainitusingthetrainingset .3. Computemultiplier bysolvingthefollowingonedimensionaloptimizationproblem:
4. Updatethemodel:
3. Output
Gradienttreeboosting
Gradientboostingistypicallyusedwithdecisiontrees(especiallyCARTtrees)ofafixedsizeasbaselearners.ForthisspecialcaseFriedmanproposesamodificationtogradientboostingmethodwhichimprovesthequalityoffitofeachbaselearner.
Genericgradientboostingatthemthstepwouldfitadecisiontree topseudoresiduals.Let bethenumberofitsleaves.Thetreepartitionstheinputspaceinto disjointregions andpredictsaconstantvalueineachregion.Usingtheindicatornotation,theoutputof forinputxcanbewrittenasthesum:
where isthevaluepredictedintheregion .[8]
Thenthecoefficients aremultipliedbysomevalue ,chosenusinglinesearchsoastominimizethelossfunction,andthemodelisupdatedasfollows:
Friedmanproposestomodifythisalgorithmsothatitchoosesaseparateoptimalvalue foreachofthetree'sregions,insteadofasingle forthewholetree.Hecallsthemodifiedalgorithm"TreeBoost".Thecoefficients fromthetreefittingprocedurecanbethensimplydiscardedandthemodelupdaterulebecomes:
-
8/28/2015 GradientboostingWikipedia,thefreeencyclopedia
https://en.wikipedia.org/wiki/Gradient_boosting 4/6
Sizeoftrees
,thenumberofterminalnodesintrees,isthemethod'sparameterwhichcanbeadjustedforadatasetathand.Itcontrolsthemaximumallowedlevelofinteractionbetweenvariablesinthemodel.With (decisionstumps),nointeractionbetweenvariablesisallowed.With themodelmayincludeeffectsoftheinteractionbetweenuptotwovariables,andsoon.
Hastieetal.[7]commentthattypically workwellforboostingandresultsarefairlyinsensitivetothechoiceof inthisrange, isinsufficientformanyapplications,and isunlikelytoberequired.
Regularization
Fittingthetrainingsettoocloselycanleadtodegradationofthemodel'sgeneralizationability.Severalsocalledregularizationtechniquesreducethisoverfittingeffectbyconstrainingthefittingprocedure.
OnenaturalregularizationparameteristhenumberofgradientboostingiterationsM(i.e.thenumberoftreesinthemodelwhenthebaselearnerisadecisiontree).IncreasingMreducestheerrorontrainingset,butsettingittoohighmayleadtooverfitting.AnoptimalvalueofMisoftenselectedbymonitoringpredictionerroronaseparatevalidationdataset.BesidescontrollingM,severalotherregularizationtechniquesareused.
Shrinkage
Animportantpartofgradientboostingmethodisregularizationbyshrinkagewhichconsistsinmodifyingtheupdateruleasfollows:
whereparameter iscalledthe"learningrate".
Empiricallyithasbeenfoundthatusingsmalllearningrates(suchas )yieldsdramaticimprovementsinmodel'sgeneralizationabilityovergradientboostingwithoutshrinking( ).[7]However,itcomesatthepriceofincreasingcomputationaltimebothduringtrainingandquerying:lowerlearningraterequiresmoreiterations.
Stochasticgradientboosting
SoonaftertheintroductionofgradientboostingFriedmanproposedaminormodificationtothealgorithm,motivatedbyBreiman'sbaggingmethod.[3]Specifically,heproposedthatateachiterationofthealgorithm,abaselearnershouldbefitonasubsampleofthetrainingsetdrawnatrandomwithoutreplacement.[9]Friedmanobservedasubstantialimprovementingradientboosting'saccuracywiththismodification.
Subsamplesizeissomeconstantfractionfofthesizeofthetrainingset.Whenf=1,thealgorithmisdeterministicandidenticaltotheonedescribedabove.Smallervaluesoffintroducerandomnessintothealgorithmandhelppreventoverfitting,actingasakindofregularization.Thealgorithmalsobecomesfaster,becauseregressiontreeshavetobefittosmallerdatasetsateachiteration.Friedman[3]obtainedthat leadstogoodresultsforsmallandmoderatesizedtrainingsets.Therefore,fistypicallysetto0.5,meaningthatonehalfofthetrainingsetisusedtobuildeachbaselearner.
Also,likeinbagging,subsamplingallowsonetodefineanoutofbagestimateofthepredictionperformanceimprovementbyevaluatingpredictionsonthoseobservationswhichwerenotusedinthebuildingofthenextbaselearner.Outofbagestimateshelpavoidtheneedforanindependentvalidationdataset,butoftenunderestimateactualperformanceimprovementandtheoptimalnumberofiterations.[10]
Numberofobservationsinleaves
Gradienttreeboostingimplementationsoftenalsouseregularizationbylimitingtheminimumnumberofobservationsintrees'terminalnodes(thisparameteriscalledn.minobsinnodeintheRgbmpackage[10]).Itisusedinthetreebuildingprocessbyignoringanysplitsthatleadtonodescontainingfewerthanthisnumberoftrainingsetinstances.
-
8/28/2015 GradientboostingWikipedia,thefreeencyclopedia
https://en.wikipedia.org/wiki/Gradient_boosting 5/6
Imposingthislimithelpstoreducevarianceinpredictionsatleaves.
PenalizeComplexityofTree
Anotherusefulregularizationtechniquesforgradientboostedtreesistopenalizemodelcomplexityofthelearnedmodel.[11]Themodelcomplexitycanbedefinedproportionalnumberofleavesinthelearnedtrees.Thejointlyoptimizationoflossandmodelcomplexitycorrespondstoapostpruningalgorithmtoremovebranchesthatfailtoreducethelossbyathreshold.Otherkindsofregularizationsuchasl2penaltyontheleavevaluescanalsobeaddedtoavoidoverfitting.
Usage
Recently,gradientboostinghasgainedsomepopularityinthefieldoflearningtorank.ThecommercialwebsearchenginesYahoo[12]andYandex[13]usevariantsofgradientboostingintheirmachinelearnedrankingengines.
Names
Themethodgoesbyavarietyofnames.Friedmanintroducedhisregressiontechniqueasa"GradientBoostingMachine"(GBM).[2]Mason,Baxteret.el.describedthegeneralizedabstractclassofalgorithmsas"functionalgradientboosting".[4][5]
Apopularopensourceimplementation[10]forRcallsit"GeneralizedBoostingModel".CommercialimplementationsfromSalfordSystemsusethenames"MultipleAdditiveRegressionTrees"(MART)andTreeNet,bothtrademarked.
Seealso
AdaBoostRandomforest
References1. Brieman,L."ArcingTheEdge(http://statistics.berkeley.edu/sites/default/files/techreports/486.pdf)"(June1997)2. Friedman,J.H."GreedyFunctionApproximation:AGradientBoostingMachine.(http://wwwstat.stanford.edu/~jhf/ftp/trebst.pdf)"
(February1999)3. Friedman,J.H."StochasticGradientBoosting.(https://statweb.stanford.edu/~jhf/ftp/stobst.pdf)"(March1999)4. Mason,L.Baxter,J.Bartlett,P.L.Frean,Marcus(1999)."BoostingAlgorithmsasGradientDescent"
(http://papers.nips.cc/paper/1766boostingalgorithmsasgradientdescent.pdf)(PDF).InS.A.SollaandT.K.LeenandK.Mller.AdvancesinNeuralInformationProcessingSystems12.MITPress.pp.512518.
5. Mason,L.Baxter,J.Bartlett,P.L.Frean,Marcus(May1999).BoostingAlgorithmsasGradientDescentinFunctionSpace(http://maths.dur.ac.uk/~dma6kp/pdf/face_recognition/Boosting/Mason99AnyboostLong.pdf)(PDF).
6. ChengLi."AGentleIntroductiontoGradientBoosting"(http://www.ccs.neu.edu/home/vip/teach/MLcourse/4_boosting/slides/gradient_boosting.pdf)(PDF).NortheasternUniversity.Retrieved19August2014.
7. Hastie,T.Tibshirani,R.Friedman,J.H.(2009)."10.BoostingandAdditiveTrees".TheElementsofStatisticalLearning(http://wwwstat.stanford.edu/~tibs/ElemStatLearn/)(2nded.).NewYork:Springer.pp.337384.ISBN0387848576.
8. Note:incaseofusualCARTtrees,thetreesarefittedusingleastsquaresloss,andsothecoefficient fortheregion isequaltojustthevalueofoutputvariable,averagedoveralltraininginstancesin .
9. Notethatthisisdifferentfrombagging,whichsampleswithreplacementbecauseitusessamplesofthesamesizeasthetrainingset.10. Ridgeway,Greg(2007).GeneralizedBoostedModels:Aguidetothegbmpackage.(http://cran.rproject.org/web/packages/gbm/gbm.pdf)11. TianqiChen.IntroductiontoBoostedTrees(http://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf)12. Cossock,DavidandZhang,Tong(2008).StatisticalAnalysisofBayesOptimalSubsetRanking
(http://www.stat.rutgers.edu/~tzhang/papers/it08ranking.pdf),page14.13. Yandexcorporateblogentryaboutnewrankingmodel"Snezhinsk"(http://webmaster.ya.ru/replies.xml?item_no=5707&ncrnd=5118)(in
Russian)
Retrievedfrom"https://en.wikipedia.org/w/index.php?title=Gradient_boosting&oldid=678013581"
Categories: Decisiontrees Ensemblelearning
-
8/28/2015 GradientboostingWikipedia,thefreeencyclopedia
https://en.wikipedia.org/wiki/Gradient_boosting 6/6
Thispagewaslastmodifiedon26August2015,at22:37.TextisavailableundertheCreativeCommonsAttributionShareAlikeLicenseadditionaltermsmayapply.Byusingthissite,youagreetotheTermsofUseandPrivacyPolicy.WikipediaisaregisteredtrademarkoftheWikimediaFoundation,Inc.,anonprofitorganization.