index-of.esindex-of.es/varios-2/hands on machine learning with... · revision history for the first...
Post on 04-Jul-2020
1 Views
Preview:
TRANSCRIPT
Hands-OnMachineLearningwithScikit-LearnandTensorFlow
Concepts,Tools,andTechniquestoBuildIntelligentSystems
AurélienGéron
Hands-OnMachineLearningwithScikit-LearnandTensorFlowbyAurélienGéron
Copyright©2017AurélienGéron.Allrightsreserved.
PrintedintheUnitedStatesofAmerica.
PublishedbyO’ReillyMedia,Inc.,1005GravensteinHighwayNorth,Sebastopol,CA95472.
O’Reillybooksmaybepurchasedforeducational,business,orsalespromotionaluse.Onlineeditionsarealsoavailableformosttitles(http://oreilly.com/safari).Formoreinformation,contactourcorporate/institutionalsalesdepartment:800-998-9938orcorporate@oreilly.com.
Editor:NicoleTache
ProductionEditor:NicholasAdams
Copyeditor:RachelMonaghan
Proofreader:CharlesRoumeliotis
Indexer:WendyCatalano
InteriorDesigner:DavidFutato
CoverDesigner:RandyComer
Illustrator:RebeccaDemarest
March2017:FirstEdition
RevisionHistoryfortheFirstEdition2017-03-10:FirstRelease
2017-06-09:SecondRelease
Seehttp://oreilly.com/catalog/errata.csp?isbn=9781491962299forreleasedetails.
TheO’ReillylogoisaregisteredtrademarkofO’ReillyMedia,Inc.Hands-OnMachineLearningwithScikit-LearnandTensorFlow,thecoverimage,andrelatedtradedressaretrademarksofO’ReillyMedia,Inc.
Whilethepublisherandtheauthorhaveusedgoodfaitheffortstoensurethattheinformationandinstructionscontainedinthisworkareaccurate,thepublisherandtheauthordisclaimallresponsibilityforerrorsoromissions,includingwithoutlimitationresponsibilityfordamagesresultingfromtheuseoforrelianceonthiswork.Useoftheinformationandinstructionscontainedinthisworkisatyourownrisk.Ifanycodesamplesorothertechnologythisworkcontainsordescribesissubjecttoopensourcelicensesortheintellectualpropertyrightsofothers,itisyourresponsibilitytoensurethatyourusethereofcomplieswithsuchlicensesand/orrights.
978-1-491-96229-9
[LSI]
Preface
TheMachineLearningTsunamiIn2006,GeoffreyHintonetal.publishedapaper1showinghowtotrainadeepneuralnetworkcapableofrecognizinghandwrittendigitswithstate-of-the-artprecision(>98%).Theybrandedthistechnique“DeepLearning.”Trainingadeepneuralnetwaswidelyconsideredimpossibleatthetime,2andmostresearchershadabandonedtheideasincethe1990s.ThispaperrevivedtheinterestofthescientificcommunityandbeforelongmanynewpapersdemonstratedthatDeepLearningwasnotonlypossible,butcapableofmind-blowingachievementsthatnootherMachineLearning(ML)techniquecouldhopetomatch(withthehelpoftremendouscomputingpowerandgreatamountsofdata).ThisenthusiasmsoonextendedtomanyotherareasofMachineLearning.
Fast-forward10yearsandMachineLearninghasconqueredtheindustry:itisnowattheheartofmuchofthemagicintoday’shigh-techproducts,rankingyourwebsearchresults,poweringyoursmartphone’sspeechrecognition,andrecommendingvideos,beatingtheworldchampionatthegameofGo.Beforeyouknowit,itwillbedrivingyourcar.
MachineLearninginYourProjectsSonaturallyyouareexcitedaboutMachineLearningandyouwouldlovetojointheparty!
Perhapsyouwouldliketogiveyourhomemaderobotabrainofitsown?Makeitrecognizefaces?Orlearntowalkaround?
Ormaybeyourcompanyhastonsofdata(userlogs,financialdata,productiondata,machinesensordata,hotlinestats,HRreports,etc.),andmorethanlikelyyoucouldunearthsomehiddengemsifyoujustknewwheretolook;forexample:
Segmentcustomersandfindthebestmarketingstrategyforeachgroup
Recommendproductsforeachclientbasedonwhatsimilarclientsbought
Detectwhichtransactionsarelikelytobefraudulent
Predictnextyear’srevenue
Andmore
Whateverthereason,youhavedecidedtolearnMachineLearningandimplementitinyourprojects.Greatidea!
ObjectiveandApproachThisbookassumesthatyouknowclosetonothingaboutMachineLearning.Itsgoalistogiveyoutheconcepts,theintuitions,andthetoolsyouneedtoactuallyimplementprogramscapableoflearningfromdata.
Wewillcoveralargenumberoftechniques,fromthesimplestandmostcommonlyused(suchaslinearregression)tosomeoftheDeepLearningtechniquesthatregularlywincompetitions.
Ratherthanimplementingourowntoyversionsofeachalgorithm,wewillbeusingactualproduction-readyPythonframeworks:
Scikit-Learnisveryeasytouse,yetitimplementsmanyMachineLearningalgorithmsefficiently,soitmakesforagreatentrypointtolearnMachineLearning.
TensorFlowisamorecomplexlibraryfordistributednumericalcomputationusingdataflowgraphs.Itmakesitpossibletotrainandrunverylargeneuralnetworksefficientlybydistributingthecomputationsacrosspotentiallythousandsofmulti-GPUservers.TensorFlowwascreatedatGoogleandsupportsmanyoftheirlarge-scaleMachineLearningapplications.Itwasopen-sourcedinNovember2015.
Thebookfavorsahands-onapproach,growinganintuitiveunderstandingofMachineLearningthroughconcreteworkingexamplesandjustalittlebitoftheory.Whileyoucanreadthisbookwithoutpickingupyourlaptop,wehighlyrecommendyouexperimentwiththecodeexamplesavailableonlineasJupyternotebooksathttps://github.com/ageron/handson-ml.
PrerequisitesThisbookassumesthatyouhavesomePythonprogrammingexperienceandthatyouarefamiliarwithPython’smainscientificlibraries,inparticularNumPy,Pandas,andMatplotlib.
Also,ifyoucareaboutwhat’sunderthehoodyoushouldhaveareasonableunderstandingofcollege-levelmathaswell(calculus,linearalgebra,probabilities,andstatistics).
Ifyoudon’tknowPythonyet,http://learnpython.org/isagreatplacetostart.Theofficialtutorialonpython.orgisalsoquitegood.
IfyouhaveneverusedJupyter,Chapter2willguideyouthroughinstallationandthebasics:itisagreattooltohaveinyourtoolbox.
IfyouarenotfamiliarwithPython’sscientificlibraries,theprovidedJupyternotebooksincludeafewtutorials.Thereisalsoaquickmathtutorialforlinearalgebra.
RoadmapThisbookisorganizedintwoparts.PartI,TheFundamentalsofMachineLearning,coversthefollowingtopics:
WhatisMachineLearning?Whatproblemsdoesittrytosolve?WhatarethemaincategoriesandfundamentalconceptsofMachineLearningsystems?
ThemainstepsinatypicalMachineLearningproject.
Learningbyfittingamodeltodata.
Optimizingacostfunction.
Handling,cleaning,andpreparingdata.
Selectingandengineeringfeatures.
Selectingamodelandtuninghyperparametersusingcross-validation.
ThemainchallengesofMachineLearning,inparticularunderfittingandoverfitting(thebias/variancetradeoff).
Reducingthedimensionalityofthetrainingdatatofightthecurseofdimensionality.
Themostcommonlearningalgorithms:LinearandPolynomialRegression,LogisticRegression,k-NearestNeighbors,SupportVectorMachines,DecisionTrees,RandomForests,andEnsemblemethods.
PartII,NeuralNetworksandDeepLearning,coversthefollowingtopics:Whatareneuralnets?Whataretheygoodfor?
BuildingandtrainingneuralnetsusingTensorFlow.
Themostimportantneuralnetarchitectures:feedforwardneuralnets,convolutionalnets,recurrentnets,longshort-termmemory(LSTM)nets,andautoencoders.
Techniquesfortrainingdeepneuralnets.
Scalingneuralnetworksforhugedatasets.
Reinforcementlearning.
ThefirstpartisbasedmostlyonScikit-LearnwhilethesecondpartusesTensorFlow.
CAUTIONDon’tjumpintodeepwaterstoohastily:whileDeepLearningisnodoubtoneofthemostexcitingareasinMachineLearning,youshouldmasterthefundamentalsfirst.Moreover,mostproblemscanbesolvedquitewellusingsimplertechniquessuchasRandomForestsandEnsemblemethods(discussedinPartI).DeepLearningisbestsuitedforcomplexproblemssuchasimagerecognition,speechrecognition,ornaturallanguageprocessing,providedyouhaveenoughdata,computingpower,andpatience.
OtherResourcesManyresourcesareavailabletolearnaboutMachineLearning.AndrewNg’sMLcourseonCourseraandGeoffreyHinton’scourseonneuralnetworksandDeepLearningareamazing,althoughtheybothrequireasignificanttimeinvestment(thinkmonths).
TherearealsomanyinterestingwebsitesaboutMachineLearning,includingofcourseScikit-Learn’sexceptionalUserGuide.YoumayalsoenjoyDataquest,whichprovidesveryniceinteractivetutorials,andMLblogssuchasthoselistedonQuora.Finally,theDeepLearningwebsitehasagoodlistofresourcestolearnmore.
OfcoursetherearealsomanyotherintroductorybooksaboutMachineLearning,inparticular:JoelGrus,DataSciencefromScratch(O’Reilly).ThisbookpresentsthefundamentalsofMachineLearning,andimplementssomeofthemainalgorithmsinpurePython(fromscratch,asthenamesuggests).
StephenMarsland,MachineLearning:AnAlgorithmicPerspective(ChapmanandHall).ThisbookisagreatintroductiontoMachineLearning,coveringawiderangeoftopicsindepth,withcodeexamplesinPython(alsofromscratch,butusingNumPy).
SebastianRaschka,PythonMachineLearning(PacktPublishing).AlsoagreatintroductiontoMachineLearning,thisbookleveragesPythonopensourcelibraries(Pylearn2andTheano).
YaserS.Abu-Mostafa,MalikMagdon-Ismail,andHsuan-TienLin,LearningfromData(AMLBook).ArathertheoreticalapproachtoML,thisbookprovidesdeepinsights,inparticularonthebias/variancetradeoff(seeChapter4).
StuartRussellandPeterNorvig,ArtificialIntelligence:AModernApproach,3rdEdition(Pearson).Thisisagreat(andhuge)bookcoveringanincredibleamountoftopics,includingMachineLearning.IthelpsputMLintoperspective.
Finally,agreatwaytolearnistojoinMLcompetitionwebsitessuchasKaggle.comthiswillallowyoutopracticeyourskillsonreal-worldproblems,withhelpandinsightsfromsomeofthebestMLprofessionalsoutthere.
ConventionsUsedinThisBookThefollowingtypographicalconventionsareusedinthisbook:
ItalicIndicatesnewterms,URLs,emailaddresses,filenames,andfileextensions.
Constantwidth
Usedforprogramlistings,aswellaswithinparagraphstorefertoprogramelementssuchasvariableorfunctionnames,databases,datatypes,environmentvariables,statementsandkeywords.
Constantwidthbold
Showscommandsorothertextthatshouldbetypedliterallybytheuser.
Constantwidthitalic
Showstextthatshouldbereplacedwithuser-suppliedvaluesorbyvaluesdeterminedbycontext.
TIPThiselementsignifiesatiporsuggestion.
NOTEThiselementsignifiesageneralnote.
WARNINGThiselementindicatesawarningorcaution.
UsingCodeExamplesSupplementalmaterial(codeexamples,exercises,etc.)isavailablefordownloadathttps://github.com/ageron/handson-ml.
Thisbookisheretohelpyougetyourjobdone.Ingeneral,ifexamplecodeisofferedwiththisbook,youmayuseitinyourprogramsanddocumentation.Youdonotneedtocontactusforpermissionunlessyou’rereproducingasignificantportionofthecode.Forexample,writingaprogramthatusesseveralchunksofcodefromthisbookdoesnotrequirepermission.SellingordistributingaCD-ROMofexamplesfromO’Reillybooksdoesrequirepermission.Answeringaquestionbycitingthisbookandquotingexamplecodedoesnotrequirepermission.Incorporatingasignificantamountofexamplecodefromthisbookintoyourproduct’sdocumentationdoesrequirepermission.
Weappreciate,butdonotrequire,attribution.Anattributionusuallyincludesthetitle,author,publisher,andISBN.Forexample:“Hands-OnMachineLearningwithScikit-LearnandTensorFlowbyAurélienGéron(O’Reilly).Copyright2017AurélienGéron,978-1-491-96229-9.”
Ifyoufeelyouruseofcodeexamplesfallsoutsidefairuseorthepermissiongivenabove,feelfreetocontactusatpermissions@oreilly.com.
O’ReillySafariNOTE
Safari(formerlySafariBooksOnline)isamembership-basedtrainingandreferenceplatformforenterprise,government,educators,andindividuals.
Membershaveaccesstothousandsofbooks,trainingvideos,LearningPaths,interactivetutorials,andcuratedplaylistsfromover250publishers,includingO’ReillyMedia,HarvardBusinessReview,PrenticeHallProfessional,Addison-WesleyProfessional,MicrosoftPress,Sams,Que,PeachpitPress,Adobe,FocalPress,CiscoPress,JohnWiley&Sons,Syngress,MorganKaufmann,IBMRedbooks,Packt,AdobePress,FTPress,Apress,Manning,NewRiders,McGraw-Hill,Jones&Bartlett,andCourseTechnology,amongothers.
Formoreinformation,pleasevisithttp://oreilly.com/safari.
HowtoContactUsPleaseaddresscommentsandquestionsconcerningthisbooktothepublisher:
O’ReillyMedia,Inc.
1005GravensteinHighwayNorth
Sebastopol,CA95472
800-998-9938(intheUnitedStatesorCanada)
707-829-0515(internationalorlocal)
707-829-0104(fax)
Wehaveawebpageforthisbook,wherewelisterrata,examples,andanyadditionalinformation.Youcanaccessthispageathttp://bit.ly/hands-on-machine-learning-with-scikit-learn-and-tensorflow.
Tocommentorasktechnicalquestionsaboutthisbook,sendemailtobookquestions@oreilly.com.
Formoreinformationaboutourbooks,courses,conferences,andnews,seeourwebsiteathttp://www.oreilly.com.
FindusonFacebook:http://facebook.com/oreilly
FollowusonTwitter:http://twitter.com/oreillymedia
WatchusonYouTube:http://www.youtube.com/oreillymedia
AcknowledgmentsIwouldliketothankmyGooglecolleagues,inparticulartheYouTubevideoclassificationteam,forteachingmesomuchaboutMachineLearning.Icouldneverhavestartedthisprojectwithoutthem.SpecialthankstomypersonalMLgurus:ClémentCourbet,JulienDubois,MathiasKende,DanielKitachewsky,JamesPack,AlexanderPak,AnoshRaj,VitorSessak,WiktorTomczak,IngridvonGlehn,RichWashington,andeveryoneatYouTubeParis.
Iamincrediblygratefultoalltheamazingpeoplewhotooktimeoutoftheirbusylivestoreviewmybookinsomuchdetail.ThankstoPeteWardenforansweringallmyTensorFlowquestions,reviewingPartII,providingmanyinterestinginsights,andofcourseforbeingpartofthecoreTensorFlowteam.Youshoulddefinitelycheckouthisblog!ManythankstoLukasBiewaldforhisverythoroughreviewofPartII:heleftnostoneunturned,testedallthecode(andcaughtafewerrors),mademanygreatsuggestions,andhisenthusiasmwascontagious.Youshouldcheckouthisblogandhiscoolrobots!ThankstoJustinFrancis,whoalsoreviewedPartIIverythoroughly,catchingerrorsandprovidinggreatinsights,inparticularinChapter16.CheckouthispostsonTensorFlow!
HugethanksaswelltoDavidAndrzejewski,whoreviewedPartIandprovidedincrediblyusefulfeedback,identifyingunclearsectionsandsuggestinghowtoimprovethem.Checkouthiswebsite!ThankstoGrégoireMesnil,whoreviewedPartIIandcontributedveryinterestingpracticaladviceontrainingneuralnetworks.ThanksaswelltoEddyHung,SalimSémaoune,KarimMatrah,IngridvonGlehn,IainSmears,andVincentGuilbeauforreviewingPartIandmakingmanyusefulsuggestions.AndIalsowishtothankmyfather-in-law,MichelTessier,formermathematicsteacherandnowagreattranslatorofAntonChekhov,forhelpingmeironoutsomeofthemathematicsandnotationsinthisbookandreviewingthelinearalgebraJupyternotebook.
Andofcourse,agigantic“thankyou”tomydearbrotherSylvain,whoreviewedeverysinglechapter,testedeverylineofcode,providedfeedbackonvirtuallyeverysection,andencouragedmefromthefirstlinetothelast.Loveyou,bro!
ManythanksaswelltoO’Reilly’sfantasticstaff,inparticularNicoleTache,whogavemeinsightfulfeedback,alwayscheerful,encouraging,andhelpful.ThanksaswelltoMarieBeaugureau,BenLorica,MikeLoukides,andLaurelRumaforbelievinginthisprojectandhelpingmedefineitsscope.ThankstoMattHackerandalloftheAtlasteamforansweringallmytechnicalquestionsregardingformatting,asciidoc,andLaTeX,andthankstoRachelMonaghan,NickAdams,andalloftheproductionteamfortheirfinalreviewandtheirhundredsofcorrections.
Lastbutnotleast,Iaminfinitelygratefultomybelovedwife,Emmanuelle,andtoourthreewonderfulkids,Alexandre,Rémi,andGabrielle,forencouragingmetoworkhardonthisbook,askingmanyquestions(whosaidyoucan’tteachneuralnetworkstoaseven-year-old?),andevenbringingmecookiesandcoffee.Whatmorecanonedreamof?
AvailableonHinton’shomepageathttp://www.cs.toronto.edu/~hinton/.
DespitethefactthatYannLecun’sdeepconvolutionalneuralnetworkshadworkedwellforimagerecognitionsincethe1990s,althoughtheywerenotasgeneralpurpose.
1
2
PartI.TheFundamentalsofMachineLearning
Chapter1.TheMachineLearningLandscape
Whenmostpeoplehear“MachineLearning,”theypicturearobot:adependablebutleroradeadlyTerminatordependingonwhoyouask.ButMachineLearningisnotjustafuturisticfantasy,it’salreadyhere.Infact,ithasbeenaroundfordecadesinsomespecializedapplications,suchasOpticalCharacterRecognition(OCR).ButthefirstMLapplicationthatreallybecamemainstream,improvingthelivesofhundredsofmillionsofpeople,tookovertheworldbackinthe1990s:itwasthespamfilter.Notexactlyaself-awareSkynet,butitdoestechnicallyqualifyasMachineLearning(ithasactuallylearnedsowellthatyouseldomneedtoflaganemailasspamanymore).ItwasfollowedbyhundredsofMLapplicationsthatnowquietlypowerhundredsofproductsandfeaturesthatyouuseregularly,frombetterrecommendationstovoicesearch.
WheredoesMachineLearningstartandwheredoesitend?Whatexactlydoesitmeanforamachinetolearnsomething?IfIdownloadacopyofWikipedia,hasmycomputerreally“learned”something?Isitsuddenlysmarter?InthischapterwewillstartbyclarifyingwhatMachineLearningisandwhyyoumaywanttouseit.
Then,beforewesetouttoexploretheMachineLearningcontinent,wewilltakealookatthemapandlearnaboutthemainregionsandthemostnotablelandmarks:supervisedversusunsupervisedlearning,onlineversusbatchlearning,instance-basedversusmodel-basedlearning.ThenwewilllookattheworkflowofatypicalMLproject,discussthemainchallengesyoumayface,andcoverhowtoevaluateandfine-tuneaMachineLearningsystem.
Thischapterintroducesalotoffundamentalconcepts(andjargon)thateverydatascientistshouldknowbyheart.Itwillbeahigh-leveloverview(theonlychapterwithoutmuchcode),allrathersimple,butyoushouldmakesureeverythingiscrystal-cleartoyoubeforecontinuingtotherestofthebook.Sograbacoffeeandlet’sgetstarted!
TIPIfyoualreadyknowalltheMachineLearningbasics,youmaywanttoskipdirectlytoChapter2.Ifyouarenotsure,trytoanswerallthequestionslistedattheendofthechapterbeforemovingon.
WhatIsMachineLearning?MachineLearningisthescience(andart)ofprogrammingcomputerssotheycanlearnfromdata.
Hereisaslightlymoregeneraldefinition:
[MachineLearningisthe]fieldofstudythatgivescomputerstheabilitytolearnwithoutbeingexplicitlyprogrammed.ArthurSamuel,1959
Andamoreengineering-orientedone:
AcomputerprogramissaidtolearnfromexperienceEwithrespecttosometaskTandsomeperformancemeasureP,ifitsperformanceonT,asmeasuredbyP,improveswithexperienceE.TomMitchell,1997
Forexample,yourspamfilterisaMachineLearningprogramthatcanlearntoflagspamgivenexamplesofspamemails(e.g.,flaggedbyusers)andexamplesofregular(nonspam,alsocalled“ham”)emails.Theexamplesthatthesystemusestolearnarecalledthetrainingset.Eachtrainingexampleiscalledatraininginstance(orsample).Inthiscase,thetaskTistoflagspamfornewemails,theexperienceEisthetrainingdata,andtheperformancemeasurePneedstobedefined;forexample,youcanusetheratioofcorrectlyclassifiedemails.Thisparticularperformancemeasureiscalledaccuracyanditisoftenusedinclassificationtasks.
IfyoujustdownloadacopyofWikipedia,yourcomputerhasalotmoredata,butitisnotsuddenlybetteratanytask.Thus,itisnotMachineLearning.
WhyUseMachineLearning?Considerhowyouwouldwriteaspamfilterusingtraditionalprogrammingtechniques(Figure1-1):1. Firstyouwouldlookatwhatspamtypicallylookslike.Youmightnoticethatsomewordsorphrases
(suchas“4U,”“creditcard,”“free,”and“amazing”)tendtocomeupalotinthesubject.Perhapsyouwouldalsonoticeafewotherpatternsinthesender’sname,theemail’sbody,andsoon.
2. Youwouldwriteadetectionalgorithmforeachofthepatternsthatyounoticed,andyourprogramwouldflagemailsasspamifanumberofthesepatternsaredetected.
3. Youwouldtestyourprogram,andrepeatsteps1and2untilitisgoodenough.
Figure1-1.Thetraditionalapproach
Sincetheproblemisnottrivial,yourprogramwilllikelybecomealonglistofcomplexrules—prettyhardtomaintain.
Incontrast,aspamfilterbasedonMachineLearningtechniquesautomaticallylearnswhichwordsandphrasesaregoodpredictorsofspambydetectingunusuallyfrequentpatternsofwordsinthespamexamplescomparedtothehamexamples(Figure1-2).Theprogramismuchshorter,easiertomaintain,andmostlikelymoreaccurate.
Figure1-2.MachineLearningapproach
Moreover,ifspammersnoticethatalltheiremailscontaining“4U”areblocked,theymightstartwriting“ForU”instead.Aspamfilterusingtraditionalprogrammingtechniqueswouldneedtobeupdatedtoflag“ForU”emails.Ifspammerskeepworkingaroundyourspamfilter,youwillneedtokeepwritingnewrulesforever.
Incontrast,aspamfilterbasedonMachineLearningtechniquesautomaticallynoticesthat“ForU”hasbecomeunusuallyfrequentinspamflaggedbyusers,anditstartsflaggingthemwithoutyourintervention(Figure1-3).
Figure1-3.Automaticallyadaptingtochange
AnotherareawhereMachineLearningshinesisforproblemsthateitheraretoocomplexfortraditionalapproachesorhavenoknownalgorithm.Forexample,considerspeechrecognition:sayyouwanttostartsimpleandwriteaprogramcapableofdistinguishingthewords“one”and“two.”Youmightnoticethattheword“two”startswithahigh-pitchsound(“T”),soyoucouldhardcodeanalgorithmthatmeasureshigh-pitchsoundintensityandusethattodistinguishonesandtwos.Obviouslythistechniquewillnotscaletothousandsofwordsspokenbymillionsofverydifferentpeopleinnoisyenvironmentsandin
dozensoflanguages.Thebestsolution(atleasttoday)istowriteanalgorithmthatlearnsbyitself,givenmanyexamplerecordingsforeachword.
Finally,MachineLearningcanhelphumanslearn(Figure1-4):MLalgorithmscanbeinspectedtoseewhattheyhavelearned(althoughforsomealgorithmsthiscanbetricky).Forinstance,oncethespamfilterhasbeentrainedonenoughspam,itcaneasilybeinspectedtorevealthelistofwordsandcombinationsofwordsthatitbelievesarethebestpredictorsofspam.Sometimesthiswillrevealunsuspectedcorrelationsornewtrends,andtherebyleadtoabetterunderstandingoftheproblem.
ApplyingMLtechniquestodigintolargeamountsofdatacanhelpdiscoverpatternsthatwerenotimmediatelyapparent.Thisiscalleddatamining.
Figure1-4.MachineLearningcanhelphumanslearn
Tosummarize,MachineLearningisgreatfor:Problemsforwhichexistingsolutionsrequirealotofhand-tuningorlonglistsofrules:oneMachineLearningalgorithmcanoftensimplifycodeandperformbetter.
Complexproblemsforwhichthereisnogoodsolutionatallusingatraditionalapproach:thebestMachineLearningtechniquescanfindasolution.
Fluctuatingenvironments:aMachineLearningsystemcanadapttonewdata.
Gettinginsightsaboutcomplexproblemsandlargeamountsofdata.
TypesofMachineLearningSystemsTherearesomanydifferenttypesofMachineLearningsystemsthatitisusefultoclassifytheminbroadcategoriesbasedon:
Whetherornottheyaretrainedwithhumansupervision(supervised,unsupervised,semisupervised,andReinforcementLearning)
Whetherornottheycanlearnincrementallyonthefly(onlineversusbatchlearning)
Whethertheyworkbysimplycomparingnewdatapointstoknowndatapoints,orinsteaddetectpatternsinthetrainingdataandbuildapredictivemodel,muchlikescientistsdo(instance-basedversusmodel-basedlearning)
Thesecriteriaarenotexclusive;youcancombinetheminanywayyoulike.Forexample,astate-of-the-artspamfiltermaylearnontheflyusingadeepneuralnetworkmodeltrainedusingexamplesofspamandham;thismakesitanonline,model-based,supervisedlearningsystem.
Let’slookateachofthesecriteriaabitmoreclosely.
Supervised/UnsupervisedLearningMachineLearningsystemscanbeclassifiedaccordingtotheamountandtypeofsupervisiontheygetduringtraining.Therearefourmajorcategories:supervisedlearning,unsupervisedlearning,semisupervisedlearning,andReinforcementLearning.
SupervisedlearningInsupervisedlearning,thetrainingdatayoufeedtothealgorithmincludesthedesiredsolutions,calledlabels(Figure1-5).
Figure1-5.Alabeledtrainingsetforsupervisedlearning(e.g.,spamclassification)
Atypicalsupervisedlearningtaskisclassification.Thespamfilterisagoodexampleofthis:itistrainedwithmanyexampleemailsalongwiththeirclass(spamorham),anditmustlearnhowtoclassifynewemails.
Anothertypicaltaskistopredictatargetnumericvalue,suchasthepriceofacar,givenasetoffeatures(mileage,age,brand,etc.)calledpredictors.Thissortoftaskiscalledregression(Figure1-6).1Totrainthesystem,youneedtogiveitmanyexamplesofcars,includingboththeirpredictorsandtheirlabels(i.e.,theirprices).
NOTEInMachineLearninganattributeisadatatype(e.g.,“Mileage”),whileafeaturehasseveralmeaningsdependingonthecontext,butgenerallymeansanattributeplusitsvalue(e.g.,“Mileage=15,000”).Manypeopleusethewordsattributeandfeatureinterchangeably,though.
Figure1-6.Regression
Notethatsomeregressionalgorithmscanbeusedforclassificationaswell,andviceversa.Forexample,LogisticRegressioniscommonlyusedforclassification,asitcanoutputavaluethatcorrespondstotheprobabilityofbelongingtoagivenclass(e.g.,20%chanceofbeingspam).
Herearesomeofthemostimportantsupervisedlearningalgorithms(coveredinthisbook):k-NearestNeighbors
LinearRegression
LogisticRegression
SupportVectorMachines(SVMs)
DecisionTreesandRandomForests
Neuralnetworks2
UnsupervisedlearningInunsupervisedlearning,asyoumightguess,thetrainingdataisunlabeled(Figure1-7).Thesystemtriestolearnwithoutateacher.
Figure1-7.Anunlabeledtrainingsetforunsupervisedlearning
Herearesomeofthemostimportantunsupervisedlearningalgorithms(wewillcoverdimensionalityreductioninChapter8):
Clusteringk-Means
HierarchicalClusterAnalysis(HCA)
ExpectationMaximization
VisualizationanddimensionalityreductionPrincipalComponentAnalysis(PCA)
KernelPCA
Locally-LinearEmbedding(LLE)
t-distributedStochasticNeighborEmbedding(t-SNE)
AssociationrulelearningApriori
Eclat
Forexample,sayyouhavealotofdataaboutyourblog’svisitors.Youmaywanttorunaclusteringalgorithmtotrytodetectgroupsofsimilarvisitors(Figure1-8).Atnopointdoyoutellthealgorithmwhichgroupavisitorbelongsto:itfindsthoseconnectionswithoutyourhelp.Forexample,itmightnoticethat40%ofyourvisitorsaremaleswholovecomicbooksandgenerallyreadyourblogintheevening,while20%areyoungsci-filoverswhovisitduringtheweekends,andsoon.Ifyouuseahierarchicalclusteringalgorithm,itmayalsosubdivideeachgroupintosmallergroups.Thismayhelpyoutargetyourpostsforeachgroup.
Figure1-8.Clustering
Visualizationalgorithmsarealsogoodexamplesofunsupervisedlearningalgorithms:youfeedthemalotofcomplexandunlabeleddata,andtheyoutputa2Dor3Drepresentationofyourdatathatcaneasilybeplotted(Figure1-9).Thesealgorithmstrytopreserveasmuchstructureastheycan(e.g.,tryingtokeepseparateclustersintheinputspacefromoverlappinginthevisualization),soyoucanunderstandhowthedataisorganizedandperhapsidentifyunsuspectedpatterns.
Figure1-9.Exampleofat-SNEvisualizationhighlightingsemanticclusters3
Arelatedtaskisdimensionalityreduction,inwhichthegoalistosimplifythedatawithoutlosingtoomuchinformation.Onewaytodothisistomergeseveralcorrelatedfeaturesintoone.Forexample,acar’smileagemaybeverycorrelatedwithitsage,sothedimensionalityreductionalgorithmwillmergethemintoonefeaturethatrepresentsthecar’swearandtear.Thisiscalledfeatureextraction.
TIPItisoftenagoodideatotrytoreducethedimensionofyourtrainingdatausingadimensionalityreductionalgorithmbeforeyoufeedittoanotherMachineLearningalgorithm(suchasasupervisedlearningalgorithm).Itwillrunmuchfaster,thedatawilltakeuplessdiskandmemoryspace,andinsomecasesitmayalsoperformbetter.
Yetanotherimportantunsupervisedtaskisanomalydetection—forexample,detectingunusualcreditcardtransactionstopreventfraud,catchingmanufacturingdefects,orautomaticallyremovingoutliersfromadatasetbeforefeedingittoanotherlearningalgorithm.Thesystemistrainedwithnormalinstances,andwhenitseesanewinstanceitcantellwhetheritlookslikeanormaloneorwhetheritislikelyananomaly(seeFigure1-10).
Figure1-10.Anomalydetection
Finally,anothercommonunsupervisedtaskisassociationrulelearning,inwhichthegoalistodigintolargeamountsofdataanddiscoverinterestingrelationsbetweenattributes.Forexample,supposeyouownasupermarket.Runninganassociationruleonyoursaleslogsmayrevealthatpeoplewhopurchasebarbecuesauceandpotatochipsalsotendtobuysteak.Thus,youmaywanttoplacetheseitemsclosetoeachother.
SemisupervisedlearningSomealgorithmscandealwithpartiallylabeledtrainingdata,usuallyalotofunlabeleddataandalittlebitoflabeleddata.Thisiscalledsemisupervisedlearning(Figure1-11).
Somephoto-hostingservices,suchasGooglePhotos,aregoodexamplesofthis.Onceyouuploadallyourfamilyphotostotheservice,itautomaticallyrecognizesthatthesamepersonAshowsupinphotos1,5,and11,whileanotherpersonBshowsupinphotos2,5,and7.Thisistheunsupervisedpartofthealgorithm(clustering).Nowallthesystemneedsisforyoutotellitwhothesepeopleare.Justonelabelperperson,4anditisabletonameeveryoneineveryphoto,whichisusefulforsearchingphotos.
Figure1-11.Semisupervisedlearning
Mostsemisupervisedlearningalgorithmsarecombinationsofunsupervisedandsupervisedalgorithms.Forexample,deepbeliefnetworks(DBNs)arebasedonunsupervisedcomponentscalledrestrictedBoltzmannmachines(RBMs)stackedontopofoneanother.RBMsaretrainedsequentiallyinanunsupervisedmanner,andthenthewholesystemisfine-tunedusingsupervisedlearningtechniques.
ReinforcementLearningReinforcementLearningisaverydifferentbeast.Thelearningsystem,calledanagentinthiscontext,canobservetheenvironment,selectandperformactions,andgetrewardsinreturn(orpenaltiesintheformofnegativerewards,asinFigure1-12).Itmustthenlearnbyitselfwhatisthebeststrategy,calledapolicy,togetthemostrewardovertime.Apolicydefineswhatactiontheagentshouldchoosewhenitisinagivensituation.
Figure1-12.ReinforcementLearning
Forexample,manyrobotsimplementReinforcementLearningalgorithmstolearnhowtowalk.DeepMind’sAlphaGoprogramisalsoagoodexampleofReinforcementLearning:itmadetheheadlinesinMarch2016whenitbeattheworldchampionLeeSedolatthegameofGo.Itlearneditswinningpolicybyanalyzingmillionsofgames,andthenplayingmanygamesagainstitself.Notethatlearningwasturnedoffduringthegamesagainstthechampion;AlphaGowasjustapplyingthepolicyithadlearned.
BatchandOnlineLearningAnothercriterionusedtoclassifyMachineLearningsystemsiswhetherornotthesystemcanlearnincrementallyfromastreamofincomingdata.
BatchlearningInbatchlearning,thesystemisincapableoflearningincrementally:itmustbetrainedusingalltheavailabledata.Thiswillgenerallytakealotoftimeandcomputingresources,soitistypicallydoneoffline.Firstthesystemistrained,andthenitislaunchedintoproductionandrunswithoutlearninganymore;itjustapplieswhatithaslearned.Thisiscalledofflinelearning.
Ifyouwantabatchlearningsystemtoknowaboutnewdata(suchasanewtypeofspam),youneedtotrainanewversionofthesystemfromscratchonthefulldataset(notjustthenewdata,butalsotheolddata),thenstoptheoldsystemandreplaceitwiththenewone.
Fortunately,thewholeprocessoftraining,evaluating,andlaunchingaMachineLearningsystemcanbeautomatedfairlyeasily(asshowninFigure1-3),soevenabatchlearningsystemcanadapttochange.Simplyupdatethedataandtrainanewversionofthesystemfromscratchasoftenasneeded.
Thissolutionissimpleandoftenworksfine,buttrainingusingthefullsetofdatacantakemanyhours,soyouwouldtypicallytrainanewsystemonlyevery24hoursorevenjustweekly.Ifyoursystemneedstoadapttorapidlychangingdata(e.g.,topredictstockprices),thenyouneedamorereactivesolution.
Also,trainingonthefullsetofdatarequiresalotofcomputingresources(CPU,memoryspace,diskspace,diskI/O,networkI/O,etc.).Ifyouhavealotofdataandyouautomateyoursystemtotrainfromscratcheveryday,itwillendupcostingyoualotofmoney.Iftheamountofdataishuge,itmayevenbeimpossibletouseabatchlearningalgorithm.
Finally,ifyoursystemneedstobeabletolearnautonomouslyandithaslimitedresources(e.g.,asmartphoneapplicationoraroveronMars),thencarryingaroundlargeamountsoftrainingdataandtakingupalotofresourcestotrainforhourseverydayisashowstopper.
Fortunately,abetteroptioninallthesecasesistousealgorithmsthatarecapableoflearningincrementally.
OnlinelearningInonlinelearning,youtrainthesystemincrementallybyfeedingitdatainstancessequentially,eitherindividuallyorbysmallgroupscalledmini-batches.Eachlearningstepisfastandcheap,sothesystemcanlearnaboutnewdataonthefly,asitarrives(seeFigure1-13).
Figure1-13.Onlinelearning
Onlinelearningisgreatforsystemsthatreceivedataasacontinuousflow(e.g.,stockprices)andneedtoadapttochangerapidlyorautonomously.Itisalsoagoodoptionifyouhavelimitedcomputingresources:onceanonlinelearningsystemhaslearnedaboutnewdatainstances,itdoesnotneedthemanymore,soyoucandiscardthem(unlessyouwanttobeabletorollbacktoapreviousstateand“replay”thedata).Thiscansaveahugeamountofspace.
Onlinelearningalgorithmscanalsobeusedtotrainsystemsonhugedatasetsthatcannotfitinonemachine’smainmemory(thisiscalledout-of-corelearning).Thealgorithmloadspartofthedata,runsatrainingsteponthatdata,andrepeatstheprocessuntilithasrunonallofthedata(seeFigure1-14).
WARNINGThiswholeprocessisusuallydoneoffline(i.e.,notonthelivesystem),soonlinelearningcanbeaconfusingname.Thinkofitasincrementallearning.
Figure1-14.Usingonlinelearningtohandlehugedatasets
Oneimportantparameterofonlinelearningsystemsishowfasttheyshouldadapttochangingdata:thisiscalledthelearningrate.Ifyousetahighlearningrate,thenyoursystemwillrapidlyadapttonewdata,butitwillalsotendtoquicklyforgettheolddata(youdon’twantaspamfiltertoflagonlythelatestkindsofspamitwasshown).Conversely,ifyousetalowlearningrate,thesystemwillhavemoreinertia;thatis,itwilllearnmoreslowly,butitwillalsobelesssensitivetonoiseinthenewdataortosequencesofnonrepresentativedatapoints.
Abigchallengewithonlinelearningisthatifbaddataisfedtothesystem,thesystem’sperformancewillgraduallydecline.Ifwearetalkingaboutalivesystem,yourclientswillnotice.Forexample,baddatacouldcomefromamalfunctioningsensoronarobot,orfromsomeonespammingasearchenginetotrytorankhighinsearchresults.Toreducethisrisk,youneedtomonitoryoursystemcloselyandpromptlyswitchlearningoff(andpossiblyreverttoapreviouslyworkingstate)ifyoudetectadropinperformance.Youmayalsowanttomonitortheinputdataandreacttoabnormaldata(e.g.,usingananomalydetectionalgorithm).
Instance-BasedVersusModel-BasedLearningOnemorewaytocategorizeMachineLearningsystemsisbyhowtheygeneralize.MostMachineLearningtasksareaboutmakingpredictions.Thismeansthatgivenanumberoftrainingexamples,thesystemneedstobeabletogeneralizetoexamplesithasneverseenbefore.Havingagoodperformancemeasureonthetrainingdataisgood,butinsufficient;thetruegoalistoperformwellonnewinstances.
Therearetwomainapproachestogeneralization:instance-basedlearningandmodel-basedlearning.
Instance-basedlearningPossiblythemosttrivialformoflearningissimplytolearnbyheart.Ifyouweretocreateaspamfilterthisway,itwouldjustflagallemailsthatareidenticaltoemailsthathavealreadybeenflaggedbyusers—nottheworstsolution,butcertainlynotthebest.
Insteadofjustflaggingemailsthatareidenticaltoknownspamemails,yourspamfiltercouldbeprogrammedtoalsoflagemailsthatareverysimilartoknownspamemails.Thisrequiresameasureofsimilaritybetweentwoemails.A(verybasic)similaritymeasurebetweentwoemailscouldbetocountthenumberofwordstheyhaveincommon.Thesystemwouldflaganemailasspamifithasmanywordsincommonwithaknownspamemail.
Thisiscalledinstance-basedlearning:thesystemlearnstheexamplesbyheart,thengeneralizestonewcasesusingasimilaritymeasure(Figure1-15).
Figure1-15.Instance-basedlearning
Model-basedlearningAnotherwaytogeneralizefromasetofexamplesistobuildamodeloftheseexamples,thenusethatmodeltomakepredictions.Thisiscalledmodel-basedlearning(Figure1-16).
Figure1-16.Model-basedlearning
Forexample,supposeyouwanttoknowifmoneymakespeoplehappy,soyoudownloadtheBetterLifeIndexdatafromtheOECD’swebsiteaswellasstatsaboutGDPpercapitafromtheIMF’swebsite.ThenyoujointhetablesandsortbyGDPpercapita.Table1-1showsanexcerptofwhatyouget.
Table1-1.Doesmoneymakepeoplehappier?
Country GDPpercapita(USD) Lifesatisfaction
Hungary 12,240 4.9
Korea 27,195 5.8
France 37,675 6.5
Australia 50,962 7.3
UnitedStates 55,805 7.2
Let’splotthedataforafewrandomcountries(Figure1-17).
Figure1-17.Doyouseeatrendhere?
Theredoesseemtobeatrendhere!Althoughthedataisnoisy(i.e.,partlyrandom),itlookslikelifesatisfactiongoesupmoreorlesslinearlyasthecountry’sGDPpercapitaincreases.SoyoudecidetomodellifesatisfactionasalinearfunctionofGDPpercapita.Thisstepiscalledmodelselection:youselectedalinearmodeloflifesatisfactionwithjustoneattribute,GDPpercapita(Equation1-1).
Equation1-1.Asimplelinearmodel
Thismodelhastwomodelparameters,θ0andθ1.5Bytweakingtheseparameters,youcanmakeyourmodelrepresentanylinearfunction,asshowninFigure1-18.
Figure1-18.Afewpossiblelinearmodels
Beforeyoucanuseyourmodel,youneedtodefinetheparametervaluesθ0andθ1.Howcanyouknow
whichvalueswillmakeyourmodelperformbest?Toanswerthisquestion,youneedtospecifyaperformancemeasure.Youcaneitherdefineautilityfunction(orfitnessfunction)thatmeasureshowgoodyourmodelis,oryoucandefineacostfunctionthatmeasureshowbaditis.Forlinearregressionproblems,peopletypicallyuseacostfunctionthatmeasuresthedistancebetweenthelinearmodel’spredictionsandthetrainingexamples;theobjectiveistominimizethisdistance.
ThisiswheretheLinearRegressionalgorithmcomesin:youfeedityourtrainingexamplesanditfindstheparametersthatmakethelinearmodelfitbesttoyourdata.Thisiscalledtrainingthemodel.Inourcasethealgorithmfindsthattheoptimalparametervaluesareθ0=4.85andθ1=4.91×10–5.
Nowthemodelfitsthetrainingdataascloselyaspossible(foralinearmodel),asyoucanseeinFigure1-19.
Figure1-19.Thelinearmodelthatfitsthetrainingdatabest
Youarefinallyreadytorunthemodeltomakepredictions.Forexample,sayyouwanttoknowhowhappyCypriotsare,andtheOECDdatadoesnothavetheanswer.Fortunately,youcanuseyourmodeltomakeagoodprediction:youlookupCyprus’sGDPpercapita,find$22,587,andthenapplyyourmodelandfindthatlifesatisfactionislikelytobesomewherearound4.85+22,587×4.91×10-5=5.96.
Towhetyourappetite,Example1-1showsthePythoncodethatloadsthedata,preparesit,6createsascatterplotforvisualization,andthentrainsalinearmodelandmakesaprediction.7
Example1-1.TrainingandrunningalinearmodelusingScikit-Learnimportmatplotlib
importmatplotlib.pyplotasplt
importnumpyasnp
importpandasaspd
importsklearn
#Loadthedata
oecd_bli=pd.read_csv("oecd_bli_2015.csv",thousands=',')
gdp_per_capita=pd.read_csv("gdp_per_capita.csv",thousands=',',delimiter='\t',
encoding='latin1',na_values="n/a")
#Preparethedata
country_stats=prepare_country_stats(oecd_bli,gdp_per_capita)
X=np.c_[country_stats["GDPpercapita"]]
y=np.c_[country_stats["Lifesatisfaction"]]
#Visualizethedata
country_stats.plot(kind='scatter',x="GDPpercapita",y='Lifesatisfaction')
plt.show()
#Selectalinearmodel
model=sklearn.linear_model.LinearRegression()
#Trainthemodel
model.fit(X,y)
#MakeapredictionforCyprus
X_new=[[22587]]#Cyprus'GDPpercapita
print(model.predict(X_new))#outputs[[5.96242338]]
NOTEIfyouhadusedaninstance-basedlearningalgorithminstead,youwouldhavefoundthatSloveniahastheclosestGDPpercapitatothatofCyprus($20,732),andsincetheOECDdatatellsusthatSlovenians’lifesatisfactionis5.7,youwouldhavepredictedalifesatisfactionof5.7forCyprus.Ifyouzoomoutabitandlookatthetwonextclosestcountries,youwillfindPortugalandSpainwithlifesatisfactionsof5.1and6.5,respectively.Averagingthesethreevalues,youget5.77,whichisprettyclosetoyourmodel-basedprediction.Thissimplealgorithmiscalledk-NearestNeighborsregression(inthisexample,k =3).
ReplacingtheLinearRegressionmodelwithk-NearestNeighborsregressioninthepreviouscodeisassimpleasreplacingthisline:
model=sklearn.linear_model.LinearRegression()
withthisone:
model=sklearn.neighbors.KNeighborsRegressor(n_neighbors=3)
Ifallwentwell,yourmodelwillmakegoodpredictions.Ifnot,youmayneedtousemoreattributes(employmentrate,health,airpollution,etc.),getmoreorbetterqualitytrainingdata,orperhapsselectamorepowerfulmodel(e.g.,aPolynomialRegressionmodel).
Insummary:Youstudiedthedata.
Youselectedamodel.
Youtraineditonthetrainingdata(i.e.,thelearningalgorithmsearchedforthemodelparametervaluesthatminimizeacostfunction).
Finally,youappliedthemodeltomakepredictionsonnewcases(thisiscalledinference),hopingthatthismodelwillgeneralizewell.
ThisiswhatatypicalMachineLearningprojectlookslike.InChapter2youwillexperiencethisfirst-handbygoingthroughanend-to-endproject.
Wehavecoveredalotofgroundsofar:younowknowwhatMachineLearningisreallyabout,whyitisuseful,whatsomeofthemostcommoncategoriesofMLsystemsare,andwhatatypicalprojectworkflowlookslike.Nowlet’slookatwhatcangowronginlearningandpreventyoufrommakingaccurate
predictions.
MainChallengesofMachineLearningInshort,sinceyourmaintaskistoselectalearningalgorithmandtrainitonsomedata,thetwothingsthatcangowrongare“badalgorithm”and“baddata.”Let’sstartwithexamplesofbaddata.
InsufficientQuantityofTrainingDataForatoddlertolearnwhatanappleis,allittakesisforyoutopointtoanappleandsay“apple”(possiblyrepeatingthisprocedureafewtimes).Nowthechildisabletorecognizeapplesinallsortsofcolorsandshapes.Genius.
MachineLearningisnotquitethereyet;ittakesalotofdataformostMachineLearningalgorithmstoworkproperly.Evenforverysimpleproblemsyoutypicallyneedthousandsofexamples,andforcomplexproblemssuchasimageorspeechrecognitionyoumayneedmillionsofexamples(unlessyoucanreusepartsofanexistingmodel).
THEUNREASONABLEEFFECTIVENESSOFDATA
Inafamouspaperpublishedin2001,MicrosoftresearchersMicheleBankoandEricBrillshowedthatverydifferentMachineLearningalgorithms,includingfairlysimpleones,performedalmostidenticallywellonacomplexproblemofnaturallanguagedisambiguation8oncetheyweregivenenoughdata(asyoucanseeinFigure1-20).
Figure1-20.Theimportanceofdataversusalgorithms9
Astheauthorsputit:“theseresultssuggestthatwemaywanttoreconsiderthetrade-offbetweenspendingtimeandmoneyonalgorithmdevelopmentversusspendingitoncorpusdevelopment.”
TheideathatdatamattersmorethanalgorithmsforcomplexproblemswasfurtherpopularizedbyPeterNorvigetal.inapapertitled“TheUnreasonableEffectivenessofData”publishedin2009.10Itshouldbenoted,however,thatsmall-andmedium-sizeddatasetsarestillverycommon,anditisnotalwayseasyorcheaptogetextratrainingdata,sodon’tabandonalgorithmsjustyet.
NonrepresentativeTrainingDataInordertogeneralizewell,itiscrucialthatyourtrainingdataberepresentativeofthenewcasesyouwanttogeneralizeto.Thisistruewhetheryouuseinstance-basedlearningormodel-basedlearning.
Forexample,thesetofcountriesweusedearlierfortrainingthelinearmodelwasnotperfectlyrepresentative;afewcountriesweremissing.Figure1-21showswhatthedatalookslikewhenyouaddthemissingcountries.
Figure1-21.Amorerepresentativetrainingsample
Ifyoutrainalinearmodelonthisdata,yougetthesolidline,whiletheoldmodelisrepresentedbythedottedline.Asyoucansee,notonlydoesaddingafewmissingcountriessignificantlyalterthemodel,butitmakesitclearthatsuchasimplelinearmodelisprobablynevergoingtoworkwell.Itseemsthatveryrichcountriesarenothappierthanmoderatelyrichcountries(infacttheyseemunhappier),andconverselysomepoorcountriesseemhappierthanmanyrichcountries.
Byusinganonrepresentativetrainingset,wetrainedamodelthatisunlikelytomakeaccuratepredictions,especiallyforverypoorandveryrichcountries.
Itiscrucialtouseatrainingsetthatisrepresentativeofthecasesyouwanttogeneralizeto.Thisisoftenharderthanitsounds:ifthesampleistoosmall,youwillhavesamplingnoise(i.e.,nonrepresentativedataasaresultofchance),butevenverylargesamplescanbenonrepresentativeifthesamplingmethodisflawed.Thisiscalledsamplingbias.
AFAMOUSEXAMPLEOFSAMPLINGBIAS
PerhapsthemostfamousexampleofsamplingbiashappenedduringtheUSpresidentialelectionin1936,whichpittedLandonagainstRoosevelt:theLiteraryDigestconductedaverylargepoll,sendingmailtoabout10millionpeople.Itgot2.4millionanswers,andpredictedwithhighconfidencethatLandonwouldget57%ofthevotes.Instead,Rooseveltwonwith62%ofthevotes.TheflawwasintheLiteraryDigest’ssamplingmethod:
First,toobtaintheaddressestosendthepollsto,theLiteraryDigestusedtelephonedirectories,listsofmagazinesubscribers,clubmembershiplists,andthelike.Alloftheseliststendtofavorwealthierpeople,whoaremorelikelytovoteRepublican(henceLandon).
Second,lessthan25%ofthepeoplewhoreceivedthepollanswered.Again,thisintroducesasamplingbias,byrulingoutpeoplewhodon’tcaremuchaboutpolitics,peoplewhodon’tliketheLiteraryDigest,andotherkeygroups.Thisisaspecialtypeofsamplingbiascallednonresponsebias.
Hereisanotherexample:sayyouwanttobuildasystemtorecognizefunkmusicvideos.Onewaytobuildyourtrainingsetistosearch“funkmusic”onYouTubeandusetheresultingvideos.ButthisassumesthatYouTube’ssearchenginereturnsasetofvideosthatarerepresentativeofallthefunkmusicvideosonYouTube.Inreality,thesearchresultsarelikelytobebiasedtowardpopularartists(andifyouliveinBrazilyouwillgetalotof“funkcarioca”videos,whichsoundnothinglikeJamesBrown).Ontheotherhand,howelsecanyougetalargetrainingset?
Poor-QualityDataObviously,ifyourtrainingdataisfulloferrors,outliers,andnoise(e.g.,duetopoor-qualitymeasurements),itwillmakeitharderforthesystemtodetecttheunderlyingpatterns,soyoursystemislesslikelytoperformwell.Itisoftenwellworththeefforttospendtimecleaningupyourtrainingdata.Thetruthis,mostdatascientistsspendasignificantpartoftheirtimedoingjustthat.Forexample:
Ifsomeinstancesareclearlyoutliers,itmayhelptosimplydiscardthemortrytofixtheerrorsmanually.
Ifsomeinstancesaremissingafewfeatures(e.g.,5%ofyourcustomersdidnotspecifytheirage),youmustdecidewhetheryouwanttoignorethisattributealtogether,ignoretheseinstances,fillinthemissingvalues(e.g.,withthemedianage),ortrainonemodelwiththefeatureandonemodelwithoutit,andsoon.
IrrelevantFeaturesAsthesayinggoes:garbagein,garbageout.Yoursystemwillonlybecapableoflearningifthetrainingdatacontainsenoughrelevantfeaturesandnottoomanyirrelevantones.AcriticalpartofthesuccessofaMachineLearningprojectiscomingupwithagoodsetoffeaturestotrainon.Thisprocess,calledfeatureengineering,involves:
Featureselection:selectingthemostusefulfeaturestotrainonamongexistingfeatures.
Featureextraction:combiningexistingfeaturestoproduceamoreusefulone(aswesawearlier,dimensionalityreductionalgorithmscanhelp).
Creatingnewfeaturesbygatheringnewdata.
Nowthatwehavelookedatmanyexamplesofbaddata,let’slookatacoupleofexamplesofbadalgorithms.
OverfittingtheTrainingDataSayyouarevisitingaforeigncountryandthetaxidriverripsyouoff.Youmightbetemptedtosaythatalltaxidriversinthatcountryarethieves.Overgeneralizingissomethingthatwehumansdoalltoooften,andunfortunatelymachinescanfallintothesametrapifwearenotcareful.InMachineLearningthisiscalledoverfitting:itmeansthatthemodelperformswellonthetrainingdata,butitdoesnotgeneralizewell.
Figure1-22showsanexampleofahigh-degreepolynomiallifesatisfactionmodelthatstronglyoverfitsthetrainingdata.Eventhoughitperformsmuchbetteronthetrainingdatathanthesimplelinearmodel,wouldyoureallytrustitspredictions?
Figure1-22.Overfittingthetrainingdata
Complexmodelssuchasdeepneuralnetworkscandetectsubtlepatternsinthedata,butifthetrainingsetisnoisy,orifitistoosmall(whichintroducessamplingnoise),thenthemodelislikelytodetectpatternsinthenoiseitself.Obviouslythesepatternswillnotgeneralizetonewinstances.Forexample,sayyoufeedyourlifesatisfactionmodelmanymoreattributes,includinguninformativeonessuchasthecountry’sname.Inthatcase,acomplexmodelmaydetectpatternslikethefactthatallcountriesinthetrainingdatawithawintheirnamehavealifesatisfactiongreaterthan7:NewZealand(7.3),Norway(7.4),Sweden(7.2),andSwitzerland(7.5).HowconfidentareyouthattheW-satisfactionrulegeneralizestoRwandaorZimbabwe?Obviouslythispatternoccurredinthetrainingdatabypurechance,butthemodelhasnowaytotellwhetherapatternisrealorsimplytheresultofnoiseinthedata.
WARNINGOverfittinghappenswhenthemodelistoocomplexrelativetotheamountandnoisinessofthetrainingdata.Thepossiblesolutionsare:
Tosimplifythemodelbyselectingonewithfewerparameters(e.g.,alinearmodelratherthanahigh-degreepolynomialmodel),byreducingthenumberofattributesinthetrainingdataorbyconstrainingthemodel
Togathermoretrainingdata
Toreducethenoiseinthetrainingdata(e.g.,fixdataerrorsandremoveoutliers)
Constrainingamodeltomakeitsimplerandreducetheriskofoverfittingiscalledregularization.For
example,thelinearmodelwedefinedearlierhastwoparameters,θ0andθ1.Thisgivesthelearningalgorithmtwodegreesoffreedomtoadaptthemodeltothetrainingdata:itcantweakboththeheight(θ0)andtheslope(θ1)oftheline.Ifweforcedθ1=0,thealgorithmwouldhaveonlyonedegreeoffreedomandwouldhaveamuchhardertimefittingthedataproperly:allitcoulddoismovethelineupordowntogetascloseaspossibletothetraininginstances,soitwouldenduparoundthemean.Averysimplemodelindeed!Ifweallowthealgorithmtomodifyθ1butweforceittokeepitsmall,thenthelearningalgorithmwilleffectivelyhavesomewhereinbetweenoneandtwodegreesoffreedom.Itwillproduceasimplermodelthanwithtwodegreesoffreedom,butmorecomplexthanwithjustone.Youwanttofindtherightbalancebetweenfittingthedataperfectlyandkeepingthemodelsimpleenoughtoensurethatitwillgeneralizewell.
Figure1-23showsthreemodels:thedottedlinerepresentstheoriginalmodelthatwastrainedwithafewcountriesmissing,thedashedlineisoursecondmodeltrainedwithallcountries,andthesolidlineisalinearmodeltrainedwiththesamedataasthefirstmodelbutwitharegularizationconstraint.Youcanseethatregularizationforcedthemodeltohaveasmallerslope,whichfitsabitlessthetrainingdatathatthemodelwastrainedon,butactuallyallowsittogeneralizebettertonewexamples.
Figure1-23.Regularizationreducestheriskofoverfitting
Theamountofregularizationtoapplyduringlearningcanbecontrolledbyahyperparameter.Ahyperparameterisaparameterofalearningalgorithm(notofthemodel).Assuch,itisnotaffectedbythelearningalgorithmitself;itmustbesetpriortotrainingandremainsconstantduringtraining.Ifyousettheregularizationhyperparametertoaverylargevalue,youwillgetanalmostflatmodel(aslopeclosetozero);thelearningalgorithmwillalmostcertainlynotoverfitthetrainingdata,butitwillbelesslikelytofindagoodsolution.TuninghyperparametersisanimportantpartofbuildingaMachineLearningsystem(youwillseeadetailedexampleinthenextchapter).
UnderfittingtheTrainingDataAsyoumightguess,underfittingistheoppositeofoverfitting:itoccurswhenyourmodelistoosimpletolearntheunderlyingstructureofthedata.Forexample,alinearmodeloflifesatisfactionispronetounderfit;realityisjustmorecomplexthanthemodel,soitspredictionsareboundtobeinaccurate,evenonthetrainingexamples.
Themainoptionstofixthisproblemare:Selectingamorepowerfulmodel,withmoreparameters
Feedingbetterfeaturestothelearningalgorithm(featureengineering)
Reducingtheconstraintsonthemodel(e.g.,reducingtheregularizationhyperparameter)
SteppingBackBynowyoualreadyknowalotaboutMachineLearning.However,wewentthroughsomanyconceptsthatyoumaybefeelingalittlelost,solet’sstepbackandlookatthebigpicture:
MachineLearningisaboutmakingmachinesgetbetteratsometaskbylearningfromdata,insteadofhavingtoexplicitlycoderules.
TherearemanydifferenttypesofMLsystems:supervisedornot,batchoronline,instance-basedormodel-based,andsoon.
InaMLprojectyougatherdatainatrainingset,andyoufeedthetrainingsettoalearningalgorithm.Ifthealgorithmismodel-basedittunessomeparameterstofitthemodeltothetrainingset(i.e.,tomakegoodpredictionsonthetrainingsetitself),andthenhopefullyitwillbeabletomakegoodpredictionsonnewcasesaswell.Ifthealgorithmisinstance-based,itjustlearnstheexamplesbyheartandusesasimilaritymeasuretogeneralizetonewinstances.
Thesystemwillnotperformwellifyourtrainingsetistoosmall,orifthedataisnotrepresentative,noisy,orpollutedwithirrelevantfeatures(garbagein,garbageout).Lastly,yourmodelneedstobeneithertoosimple(inwhichcaseitwillunderfit)nortoocomplex(inwhichcaseitwilloverfit).
There’sjustonelastimportanttopictocover:onceyouhavetrainedamodel,youdon’twanttojust“hope”itgeneralizestonewcases.Youwanttoevaluateit,andfine-tuneitifnecessary.Let’sseehow.
TestingandValidatingTheonlywaytoknowhowwellamodelwillgeneralizetonewcasesistoactuallytryitoutonnewcases.Onewaytodothatistoputyourmodelinproductionandmonitorhowwellitperforms.Thisworkswell,butifyourmodelishorriblybad,youruserswillcomplain—notthebestidea.
Abetteroptionistosplityourdataintotwosets:thetrainingsetandthetestset.Asthesenamesimply,youtrainyourmodelusingthetrainingset,andyoutestitusingthetestset.Theerrorrateonnewcasesiscalledthegeneralizationerror(orout-of-sampleerror),andbyevaluatingyourmodelonthetestset,yougetanestimationofthiserror.Thisvaluetellsyouhowwellyourmodelwillperformoninstancesithasneverseenbefore.
Ifthetrainingerrorislow(i.e.,yourmodelmakesfewmistakesonthetrainingset)butthegeneralizationerrorishigh,itmeansthatyourmodelisoverfittingthetrainingdata.
TIPItiscommontouse80%ofthedatafortrainingandholdout20%fortesting.
Soevaluatingamodelissimpleenough:justuseatestset.Nowsupposeyouarehesitatingbetweentwomodels(sayalinearmodelandapolynomialmodel):howcanyoudecide?Oneoptionistotrainbothandcomparehowwelltheygeneralizeusingthetestset.
Nowsupposethatthelinearmodelgeneralizesbetter,butyouwanttoapplysomeregularizationtoavoidoverfitting.Thequestionis:howdoyouchoosethevalueoftheregularizationhyperparameter?Oneoptionistotrain100differentmodelsusing100differentvaluesforthishyperparameter.Supposeyoufindthebesthyperparametervaluethatproducesamodelwiththelowestgeneralizationerror,sayjust5%error.
Soyoulaunchthismodelintoproduction,butunfortunatelyitdoesnotperformaswellasexpectedandproduces15%errors.Whatjusthappened?
Theproblemisthatyoumeasuredthegeneralizationerrormultipletimesonthetestset,andyouadaptedthemodelandhyperparameterstoproducethebestmodelforthatset.Thismeansthatthemodelisunlikelytoperformaswellonnewdata.
Acommonsolutiontothisproblemistohaveasecondholdoutsetcalledthevalidationset.Youtrainmultiplemodelswithvarioushyperparametersusingthetrainingset,youselectthemodelandhyperparametersthatperformbestonthevalidationset,andwhenyou’rehappywithyourmodelyourunasinglefinaltestagainstthetestsettogetanestimateofthegeneralizationerror.
Toavoid“wasting”toomuchtrainingdatainvalidationsets,acommontechniqueistousecross-validation:thetrainingsetissplitintocomplementarysubsets,andeachmodelistrainedagainstadifferentcombinationofthesesubsetsandvalidatedagainsttheremainingparts.Oncethemodeltypeandhyperparametershavebeenselected,afinalmodelistrainedusingthesehyperparametersonthefulltrainingset,andthegeneralizederrorismeasuredonthetestset.
NOFREELUNCHTHEOREM
Amodelisasimplifiedversionoftheobservations.Thesimplificationsaremeanttodiscardthesuperfluousdetailsthatareunlikelytogeneralizetonewinstances.However,todecidewhatdatatodiscardandwhatdatatokeep,youmustmakeassumptions.Forexample,alinearmodelmakestheassumptionthatthedataisfundamentallylinearandthatthedistancebetweentheinstancesandthestraightlineisjustnoise,whichcansafelybeignored.
Inafamous1996paper,11DavidWolpertdemonstratedthatifyoumakeabsolutelynoassumptionaboutthedata,thenthereisnoreasontopreferonemodeloveranyother.ThisiscalledtheNoFreeLunch(NFL)theorem.Forsomedatasetsthebestmodelisalinearmodel,whileforotherdatasetsitisaneuralnetwork.Thereisnomodelthatisaprioriguaranteedtoworkbetter(hencethenameofthetheorem).Theonlywaytoknowforsurewhichmodelisbestistoevaluatethemall.Sincethisisnotpossible,inpracticeyoumakesomereasonableassumptionsaboutthedataandyouevaluateonlyafewreasonablemodels.Forexample,forsimpletasksyoumayevaluatelinearmodelswithvariouslevelsofregularization,andforacomplexproblemyoumayevaluatevariousneuralnetworks.
ExercisesInthischapterwehavecoveredsomeofthemostimportantconceptsinMachineLearning.Inthenextchapterswewilldivedeeperandwritemorecode,butbeforewedo,makesureyouknowhowtoanswerthefollowingquestions:1. HowwouldyoudefineMachineLearning?
2. Canyounamefourtypesofproblemswhereitshines?
3. Whatisalabeledtrainingset?
4. Whatarethetwomostcommonsupervisedtasks?
5. Canyounamefourcommonunsupervisedtasks?
6. WhattypeofMachineLearningalgorithmwouldyouusetoallowarobottowalkinvariousunknownterrains?
7. Whattypeofalgorithmwouldyouusetosegmentyourcustomersintomultiplegroups?
8. Wouldyouframetheproblemofspamdetectionasasupervisedlearningproblemoranunsupervisedlearningproblem?
9. Whatisanonlinelearningsystem?
10. Whatisout-of-corelearning?
11. Whattypeoflearningalgorithmreliesonasimilaritymeasuretomakepredictions?
12. Whatisthedifferencebetweenamodelparameterandalearningalgorithm’shyperparameter?
13. Whatdomodel-basedlearningalgorithmssearchfor?Whatisthemostcommonstrategytheyusetosucceed?Howdotheymakepredictions?
14. CanyounamefourofthemainchallengesinMachineLearning?
15. Ifyourmodelperformsgreatonthetrainingdatabutgeneralizespoorlytonewinstances,whatishappening?Canyounamethreepossiblesolutions?
16. Whatisatestsetandwhywouldyouwanttouseit?
17. Whatisthepurposeofavalidationset?
18. Whatcangowrongifyoutunehyperparametersusingthetestset?
19. Whatiscross-validationandwhywouldyoupreferittoavalidationset?
SolutionstotheseexercisesareavailableinAppendixA.
Funfact:thisodd-soundingnameisastatisticstermintroducedbyFrancisGaltonwhilehewasstudyingthefactthatthechildrenoftallpeopletendtobeshorterthantheirparents.Sincechildrenwereshorter,hecalledthisregressiontothemean.Thisnamewasthenappliedtothemethodsheusedtoanalyzecorrelationsbetweenvariables.
Someneuralnetworkarchitecturescanbeunsupervised,suchasautoencodersandrestrictedBoltzmannmachines.Theycanalsobesemisupervised,suchasindeepbeliefnetworksandunsupervisedpretraining.
Noticehowanimalsareratherwellseparatedfromvehicles,howhorsesareclosetodeerbutfarfrombirds,andsoon.FigurereproducedwithpermissionfromSocher,Ganjoo,Manning,andNg(2013),“T-SNEvisualizationofthesemanticwordspace.”
That’swhenthesystemworksperfectly.Inpracticeitoftencreatesafewclustersperperson,andsometimesmixesuptwopeoplewholookalike,soyouneedtoprovideafewlabelsperpersonandmanuallycleanupsomeclusters.
Byconvention,theGreekletterθ(theta)isfrequentlyusedtorepresentmodelparameters.
Thecodeassumesthatprepare_country_stats()isalreadydefined:itmergestheGDPandlifesatisfactiondataintoasinglePandasdataframe.
It’sokayifyoudon’tunderstandallthecodeyet;wewillpresentScikit-Learninthefollowingchapters.
Forexample,knowingwhethertowrite“to,”“two,”or“too”dependingonthecontext.
FigurereproducedwithpermissionfromBankoandBrill(2001),“LearningCurvesforConfusionSetDisambiguation.”
“TheUnreasonableEffectivenessofData,”PeterNorvigetal.(2009).
“TheLackofAPrioriDistinctionsBetweenLearningAlgorithms,”D.Wolperts(1996).
1
2
3
4
5
6
7
8
9
10
11
Chapter2.End-to-EndMachineLearningProject
Inthischapter,youwillgothroughanexampleprojectendtoend,pretendingtobearecentlyhireddatascientistinarealestatecompany.1Herearethemainstepsyouwillgothrough:1. Lookatthebigpicture.
2. Getthedata.
3. Discoverandvisualizethedatatogaininsights.
4. PreparethedataforMachineLearningalgorithms.
5. Selectamodelandtrainit.
6. Fine-tuneyourmodel.
7. Presentyoursolution.
8. Launch,monitor,andmaintainyoursystem.
WorkingwithRealDataWhenyouarelearningaboutMachineLearningitisbesttoactuallyexperimentwithreal-worlddata,notjustartificialdatasets.Fortunately,therearethousandsofopendatasetstochoosefrom,rangingacrossallsortsofdomains.Hereareafewplacesyoucanlooktogetdata:
Popularopendatarepositories:UCIrvineMachineLearningRepository
Kaggledatasets
Amazon’sAWSdatasets
Metaportals(theylistopendatarepositories):http://dataportals.org/
http://opendatamonitor.eu/
http://quandl.com/
Otherpageslistingmanypopularopendatarepositories:Wikipedia’slistofMachineLearningdatasets
Quora.comquestion
Datasetssubreddit
InthischapterwechosetheCaliforniaHousingPricesdatasetfromtheStatLibrepository2(seeFigure2-1).Thisdatasetwasbasedondatafromthe1990Californiacensus.Itisnotexactlyrecent(youcouldstillaffordanicehouseintheBayAreaatthetime),butithasmanyqualitiesforlearning,sowewillpretenditisrecentdata.Wealsoaddedacategoricalattributeandremovedafewfeaturesforteachingpurposes.
Figure2-1.Californiahousingprices
LookattheBigPictureWelcometoMachineLearningHousingCorporation!ThefirsttaskyouareaskedtoperformistobuildamodelofhousingpricesinCaliforniausingtheCaliforniacensusdata.Thisdatahasmetricssuchasthepopulation,medianincome,medianhousingprice,andsoonforeachblockgroupinCalifornia.BlockgroupsarethesmallestgeographicalunitforwhichtheUSCensusBureaupublishessampledata(ablockgrouptypicallyhasapopulationof600to3,000people).Wewilljustcallthem“districts”forshort.
Yourmodelshouldlearnfromthisdataandbeabletopredictthemedianhousingpriceinanydistrict,givenalltheothermetrics.
TIPSinceyouareawell-organizeddatascientist,thefirstthingyoudoistopulloutyourMachineLearningprojectchecklist.YoucanstartwiththeoneinAppendixB;itshouldworkreasonablywellformostMachineLearningprojectsbutmakesuretoadaptittoyourneeds.Inthischapterwewillgothroughmanychecklistitems,butwewillalsoskipafew,eitherbecausetheyareself-explanatoryorbecausetheywillbediscussedinlaterchapters.
FrametheProblemThefirstquestiontoaskyourbossiswhatexactlyisthebusinessobjective;buildingamodelisprobablynottheendgoal.Howdoesthecompanyexpecttouseandbenefitfromthismodel?Thisisimportantbecauseitwilldeterminehowyouframetheproblem,whatalgorithmsyouwillselect,whatperformancemeasureyouwillusetoevaluateyourmodel,andhowmucheffortyoushouldspendtweakingit.
Yourbossanswersthatyourmodel’soutput(apredictionofadistrict’smedianhousingprice)willbefedtoanotherMachineLearningsystem(seeFigure2-2),alongwithmanyothersignals.3Thisdownstreamsystemwilldeterminewhetheritisworthinvestinginagivenareaornot.Gettingthisrightiscritical,asitdirectlyaffectsrevenue.
Figure2-2.AMachineLearningpipelineforrealestateinvestments
PIPELINES
Asequenceofdataprocessingcomponentsiscalledadatapipeline.PipelinesareverycommoninMachineLearningsystems,sincethereisalotofdatatomanipulateandmanydatatransformationstoapply.
Componentstypicallyrunasynchronously.Eachcomponentpullsinalargeamountofdata,processesit,andspitsouttheresultinanotherdatastore,andthensometimelaterthenextcomponentinthepipelinepullsthisdataandspitsoutitsownoutput,andsoon.Eachcomponentisfairlyself-contained:theinterfacebetweencomponentsissimplythedatastore.Thismakesthesystemquitesimpletograsp(withthehelpofadataflowgraph),anddifferentteamscanfocusondifferentcomponents.Moreover,ifacomponentbreaksdown,thedownstreamcomponentscanoftencontinuetorunnormally(atleastforawhile)byjustusingthelastoutputfromthebrokencomponent.Thismakesthearchitecturequiterobust.
Ontheotherhand,abrokencomponentcangounnoticedforsometimeifpropermonitoringisnotimplemented.Thedatagetsstaleandtheoverallsystem’sperformancedrops.
Thenextquestiontoaskiswhatthecurrentsolutionlookslike(ifany).Itwilloftengiveyouareferenceperformance,aswellasinsightsonhowtosolvetheproblem.Yourbossanswersthatthedistricthousingpricesarecurrentlyestimatedmanuallybyexperts:ateamgathersup-to-dateinformationaboutadistrict,andwhentheycannotgetthemedianhousingprice,theyestimateitusingcomplexrules.
Thisiscostlyandtime-consuming,andtheirestimatesarenotgreat;incaseswheretheymanagetofindouttheactualmedianhousingprice,theyoftenrealizethattheirestimateswereoffbymorethan10%.Thisiswhythecompanythinksthatitwouldbeusefultotrainamodeltopredictadistrict’smedianhousingpricegivenotherdataaboutthatdistrict.Thecensusdatalookslikeagreatdatasettoexploitforthispurpose,sinceitincludesthemedianhousingpricesofthousandsofdistricts,aswellasotherdata.
Okay,withallthisinformationyouarenowreadytostartdesigningyoursystem.First,youneedtoframe
theproblem:isitsupervised,unsupervised,orReinforcementLearning?Isitaclassificationtask,aregressiontask,orsomethingelse?Shouldyouusebatchlearningoronlinelearningtechniques?Beforeyoureadon,pauseandtrytoanswerthesequestionsforyourself.
Haveyoufoundtheanswers?Let’ssee:itisclearlyatypicalsupervisedlearningtasksinceyouaregivenlabeledtrainingexamples(eachinstancecomeswiththeexpectedoutput,i.e.,thedistrict’smedianhousingprice).Moreover,itisalsoatypicalregressiontask,sinceyouareaskedtopredictavalue.Morespecifically,thisisamultivariateregressionproblemsincethesystemwillusemultiplefeaturestomakeaprediction(itwillusethedistrict’spopulation,themedianincome,etc.).Inthefirstchapter,youpredictedlifesatisfactionbasedonjustonefeature,theGDPpercapita,soitwasaunivariateregressionproblem.Finally,thereisnocontinuousflowofdatacominginthesystem,thereisnoparticularneedtoadjusttochangingdatarapidly,andthedataissmallenoughtofitinmemory,soplainbatchlearningshoulddojustfine.
TIPIfthedatawashuge,youcouldeithersplityourbatchlearningworkacrossmultipleservers(usingtheMapReducetechnique,aswewillseelater),oryoucoulduseanonlinelearningtechniqueinstead.
SelectaPerformanceMeasureYournextstepistoselectaperformancemeasure.AtypicalperformancemeasureforregressionproblemsistheRootMeanSquareError(RMSE).Itgivesanideaofhowmucherrorthesystemtypicallymakesinitspredictions,withahigherweightforlargeerrors.Equation2-1showsthemathematicalformulatocomputetheRMSE.
Equation2-1.RootMeanSquareError(RMSE)
NOTATIONS
ThisequationintroducesseveralverycommonMachineLearningnotationsthatwewillusethroughoutthisbook:
misthenumberofinstancesinthedatasetyouaremeasuringtheRMSEon.
Forexample,ifyouareevaluatingtheRMSEonavalidationsetof2,000districts,thenm=2,000.
x(i)isavectorofallthefeaturevalues(excludingthelabel)oftheithinstanceinthedataset,andy(i)isitslabel(thedesiredoutputvalueforthatinstance).
Forexample,ifthefirstdistrictinthedatasetislocatedatlongitude–118.29°,latitude33.91°,andithas1,416inhabitantswithamedianincomeof$38,372,andthemedianhousevalueis$156,400(ignoringtheotherfeaturesfornow),then:
and:
Xisamatrixcontainingallthefeaturevalues(excludinglabels)ofallinstancesinthedataset.Thereisonerowperinstanceandtheithrowisequaltothetransposeofx(i),noted(x(i))T.4
Forexample,ifthefirstdistrictisasjustdescribed,thenthematrixXlookslikethis:
hisyoursystem’spredictionfunction,alsocalledahypothesis.Whenyoursystemisgivenaninstance’sfeaturevectorx(i),itoutputsapredictedvalueŷ(i)=h(x(i))forthatinstance(ŷispronounced“y-hat”).
Forexample,ifyoursystempredictsthatthemedianhousingpriceinthefirstdistrictis$158,400,thenŷ(1)=h(x(1))=158,400.Thepredictionerrorforthisdistrictisŷ(1)–y(1)=2,000.
RMSE(X,h)isthecostfunctionmeasuredonthesetofexamplesusingyourhypothesish.
Weuselowercaseitalicfontforscalarvalues(suchasmory(i))andfunctionnames(suchash),lowercaseboldfontforvectors(suchasx(i)),anduppercaseboldfontformatrices(suchasX).
EventhoughtheRMSEisgenerallythepreferredperformancemeasureforregressiontasks,insomecontextsyoumayprefertouseanotherfunction.Forexample,supposethattherearemanyoutlierdistricts.Inthatcase,youmayconsiderusingtheMeanAbsoluteError(alsocalledtheAverageAbsoluteDeviation;seeEquation2-2):
Equation2-2.MeanAbsoluteError
BoththeRMSEandtheMAEarewaystomeasurethedistancebetweentwovectors:thevectorofpredictionsandthevectoroftargetvalues.Variousdistancemeasures,ornorms,arepossible:
Computingtherootofasumofsquares(RMSE)correspondstotheEuclidiannorm:itisthenotionofdistanceyouarefamiliarwith.Itisalsocalledtheℓ2norm,noted· 2(orjust·).
Computingthesumofabsolutes(MAE)correspondstotheℓ1norm,noted· 1.ItissometimescalledtheManhattannormbecauseitmeasuresthedistancebetweentwopointsinacityifyoucanonlytravelalongorthogonalcityblocks.
Moregenerally,theℓknormofavectorvcontainingnelementsisdefinedas
.ℓ0justgivesthenumberofnon-zeroelementsinthevector,andℓ∞givesthemaximumabsolutevalueinthevector.
Thehigherthenormindex,themoreitfocusesonlargevaluesandneglectssmallones.ThisiswhytheRMSEismoresensitivetooutliersthantheMAE.Butwhenoutliersareexponentiallyrare(likeinabell-shapedcurve),theRMSEperformsverywellandisgenerallypreferred.
ChecktheAssumptionsLastly,itisgoodpracticetolistandverifytheassumptionsthatweremadesofar(byyouorothers);thiscancatchseriousissuesearlyon.Forexample,thedistrictpricesthatyoursystemoutputsaregoingtobefedintoadownstreamMachineLearningsystem,andweassumethatthesepricesaregoingtobeusedassuch.Butwhatifthedownstreamsystemactuallyconvertsthepricesintocategories(e.g.,“cheap,”“medium,”or“expensive”)andthenusesthosecategoriesinsteadofthepricesthemselves?Inthiscase,gettingthepriceperfectlyrightisnotimportantatall;yoursystemjustneedstogetthecategoryright.Ifthat’sso,thentheproblemshouldhavebeenframedasaclassificationtask,notaregressiontask.Youdon’twanttofindthisoutafterworkingonaregressionsystemformonths.
Fortunately,aftertalkingwiththeteaminchargeofthedownstreamsystem,youareconfidentthattheydoindeedneedtheactualprices,notjustcategories.Great!You’reallset,thelightsaregreen,andyoucanstartcodingnow!
GettheDataIt’stimetogetyourhandsdirty.Don’thesitatetopickupyourlaptopandwalkthroughthefollowingcodeexamplesinaJupyternotebook.ThefullJupyternotebookisavailableathttps://github.com/ageron/handson-ml.
CreatetheWorkspaceFirstyouwillneedtohavePythoninstalled.Itisprobablyalreadyinstalledonyoursystem.Ifnot,youcangetitathttps://www.python.org/.5
NextyouneedtocreateaworkspacedirectoryforyourMachineLearningcodeanddatasets.Openaterminalandtypethefollowingcommands(afterthe$prompts):
$exportML_PATH="$HOME/ml"#Youcanchangethepathifyouprefer
$mkdir-p$ML_PATH
YouwillneedanumberofPythonmodules:Jupyter,NumPy,Pandas,Matplotlib,andScikit-Learn.IfyoualreadyhaveJupyterrunningwithallthesemodulesinstalled,youcansafelyskipto“DownloadtheData”.Ifyoudon’thavethemyet,therearemanywaystoinstallthem(andtheirdependencies).Youcanuseyoursystem’spackagingsystem(e.g.,apt-getonUbuntu,orMacPortsorHomeBrewonmacOS),installaScientificPythondistributionsuchasAnacondaanduseitspackagingsystem,orjustusePython’sownpackagingsystem,pip,whichisincludedbydefaultwiththePythonbinaryinstallers(sincePython2.7.9).6Youcanchecktoseeifpipisinstalledbytypingthefollowingcommand:
$pip3--version
pip9.0.1from[...]/lib/python3.5/site-packages(python3.5)
Youshouldmakesureyouhavearecentversionofpipinstalled,attheveryleast>1.4tosupportbinarymoduleinstallation(a.k.a.wheels).Toupgradethepipmodule,type:7
$pip3install--upgradepip
Collectingpip
[...]
Successfullyinstalledpip-9.0.1
CREATINGANISOLATEDENVIRONMENT
Ifyouwouldliketoworkinanisolatedenvironment(whichisstronglyrecommendedsoyoucanworkondifferentprojectswithouthavingconflictinglibraryversions),installvirtualenvbyrunningthefollowingpipcommand:
$pip3install--user--upgradevirtualenv
Collectingvirtualenv
[...]
Successfullyinstalledvirtualenv
NowyoucancreateanisolatedPythonenvironmentbytyping:
$cd$ML_PATH
$virtualenvenv
Usingbaseprefix'[...]'
Newpythonexecutablein[...]/ml/env/bin/python3.5
Alsocreatingexecutablein[...]/ml/env/bin/python
Installingsetuptools,pip,wheel...done.
Noweverytimeyouwanttoactivatethisenvironment,justopenaterminalandtype:
$cd$ML_PATH
$sourceenv/bin/activate
Whiletheenvironmentisactive,anypackageyouinstallusingpipwillbeinstalledinthisisolatedenvironment,andPythonwillonlyhaveaccesstothesepackages(ifyoualsowantaccesstothesystem’ssitepackages,youshouldcreatetheenvironmentusingvirtualenv’s--system-site-packagesoption).Checkoutvirtualenv’sdocumentationformoreinformation.
Nowyoucaninstallalltherequiredmodulesandtheirdependenciesusingthissimplepipcommand:
$pip3install--upgradejupytermatplotlibnumpypandasscipyscikit-learn
Collectingjupyter
Downloadingjupyter-1.0.0-py2.py3-none-any.whl
Collectingmatplotlib
[...]
Tocheckyourinstallation,trytoimporteverymodulelikethis:
$python3-c"importjupyter,matplotlib,numpy,pandas,scipy,sklearn"
Thereshouldbenooutputandnoerror.NowyoucanfireupJupyterbytyping:
$jupyternotebook
[I15:24NotebookApp]Servingnotebooksfromlocaldirectory:[...]/ml
[I15:24NotebookApp]0activekernels
[I15:24NotebookApp]TheJupyterNotebookisrunningat:http://localhost:8888/
[I15:24NotebookApp]UseControl-Ctostopthisserverandshutdownall
kernels(twicetoskipconfirmation).
AJupyterserverisnowrunninginyourterminal,listeningtoport8888.Youcanvisitthisserverbyopeningyourwebbrowsertohttp://localhost:8888/(thisusuallyhappensautomaticallywhentheserverstarts).Youshouldseeyouremptyworkspacedirectory(containingonlytheenvdirectoryifyoufollowedtheprecedingvirtualenvinstructions).
NowcreateanewPythonnotebookbyclickingontheNewbuttonandselectingtheappropriatePythonversion8(seeFigure2-3).
Thisdoesthreethings:first,itcreatesanewnotebookfilecalledUntitled.ipynbinyourworkspace;second,itstartsaJupyterPythonkerneltorunthisnotebook;andthird,itopensthisnotebookinanewtab.Youshouldstartbyrenamingthisnotebookto“Housing”(thiswillautomaticallyrenamethefiletoHousing.ipynb)byclickingUntitledandtypingthenewname.
Figure2-3.YourworkspaceinJupyter
Anotebookcontainsalistofcells.Eachcellcancontainexecutablecodeorformattedtext.Rightnowthenotebookcontainsonlyoneemptycodecell,labeled“In[1]:”.Trytypingprint("Helloworld!")inthecell,andclickontheplaybutton(seeFigure2-4)orpressShift-Enter.Thissendsthecurrentcelltothisnotebook’sPythonkernel,whichrunsitandreturnstheoutput.Theresultisdisplayedbelowthecell,andsincewereachedtheendofthenotebook,anewcellisautomaticallycreated.GothroughtheUserInterfaceTourfromJupyter’sHelpmenutolearnthebasics.
Figure2-4.HelloworldPythonnotebook
DownloadtheDataIntypicalenvironmentsyourdatawouldbeavailableinarelationaldatabase(orsomeothercommondatastore)andspreadacrossmultipletables/documents/files.Toaccessit,youwouldfirstneedtogetyourcredentialsandaccessauthorizations,9andfamiliarizeyourselfwiththedataschema.Inthisproject,however,thingsaremuchsimpler:youwilljustdownloadasinglecompressedfile,housing.tgz,whichcontainsacomma-separatedvalue(CSV)filecalledhousing.csvwithallthedata.
Youcoulduseyourwebbrowsertodownloadit,andruntarxzfhousing.tgztodecompressthefileandextracttheCSVfile,butitispreferabletocreateasmallfunctiontodothat.Itisusefulinparticularifdatachangesregularly,asitallowsyoutowriteasmallscriptthatyoucanrunwheneveryouneedtofetchthelatestdata(oryoucansetupascheduledjobtodothatautomaticallyatregularintervals).Automatingtheprocessoffetchingthedataisalsousefulifyouneedtoinstallthedatasetonmultiplemachines.
Hereisthefunctiontofetchthedata:10
importos
importtarfile
fromsix.movesimporturllib
DOWNLOAD_ROOT="https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH=os.path.join("datasets","housing")
HOUSING_URL=DOWNLOAD_ROOT+"datasets/housing/housing.tgz"
deffetch_housing_data(housing_url=HOUSING_URL,housing_path=HOUSING_PATH):
ifnotos.path.isdir(housing_path):
os.makedirs(housing_path)
tgz_path=os.path.join(housing_path,"housing.tgz")
urllib.request.urlretrieve(housing_url,tgz_path)
housing_tgz=tarfile.open(tgz_path)
housing_tgz.extractall(path=housing_path)
housing_tgz.close()
Nowwhenyoucallfetch_housing_data(),itcreatesadatasets/housingdirectoryinyourworkspace,downloadsthehousing.tgzfile,andextractsthehousing.csvfromitinthisdirectory.
Nowlet’sloadthedatausingPandas.Onceagainyoushouldwriteasmallfunctiontoloadthedata:
importpandasaspd
defload_housing_data(housing_path=HOUSING_PATH):
csv_path=os.path.join(housing_path,"housing.csv")
returnpd.read_csv(csv_path)
ThisfunctionreturnsaPandasDataFrameobjectcontainingallthedata.
TakeaQuickLookattheDataStructureLet’stakealookatthetopfiverowsusingtheDataFrame’shead()method(seeFigure2-5).
Figure2-5.Topfiverowsinthedataset
Eachrowrepresentsonedistrict.Thereare10attributes(youcanseethefirst6inthescreenshot):longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,andocean_proximity.
Theinfo()methodisusefultogetaquickdescriptionofthedata,inparticularthetotalnumberofrows,andeachattribute’stypeandnumberofnon-nullvalues(seeFigure2-6).
Figure2-6.Housinginfo
Thereare20,640instancesinthedataset,whichmeansthatitisfairlysmallbyMachineLearningstandards,butit’sperfecttogetstarted.Noticethatthetotal_bedroomsattributehasonly20,433non-nullvalues,meaningthat207districtsaremissingthisfeature.Wewillneedtotakecareofthislater.
Allattributesarenumerical,excepttheocean_proximityfield.Itstypeisobject,soitcouldholdanykindofPythonobject,butsinceyouloadedthisdatafromaCSVfileyouknowthatitmustbeatextattribute.Whenyoulookedatthetopfiverows,youprobablynoticedthatthevaluesintheocean_proximitycolumnwererepetitive,whichmeansthatitisprobablyacategoricalattribute.Youcanfindoutwhatcategoriesexistandhowmanydistrictsbelongtoeachcategorybyusingthe
value_counts()method:
>>>housing["ocean_proximity"].value_counts()
<1HOCEAN9136
INLAND6551
NEAROCEAN2658
NEARBAY2290
ISLAND5
Name:ocean_proximity,dtype:int64
Let’slookattheotherfields.Thedescribe()methodshowsasummaryofthenumericalattributes(Figure2-7).
Figure2-7.Summaryofeachnumericalattribute
Thecount,mean,min,andmaxrowsareself-explanatory.Notethatthenullvaluesareignored(so,forexample,countoftotal_bedroomsis20,433,not20,640).Thestdrowshowsthestandarddeviation,whichmeasureshowdispersedthevaluesare.11The25%,50%,and75%rowsshowthecorrespondingpercentiles:apercentileindicatesthevaluebelowwhichagivenpercentageofobservationsinagroupofobservationsfalls.Forexample,25%ofthedistrictshaveahousing_median_agelowerthan18,while50%arelowerthan29and75%arelowerthan37.Theseareoftencalledthe25thpercentile(or1stquartile),themedian,andthe75thpercentile(or3rdquartile).
Anotherquickwaytogetafeelofthetypeofdatayouaredealingwithistoplotahistogramforeachnumericalattribute.Ahistogramshowsthenumberofinstances(ontheverticalaxis)thathaveagivenvaluerange(onthehorizontalaxis).Youcaneitherplotthisoneattributeatatime,oryoucancallthehist()methodonthewholedataset,anditwillplotahistogramforeachnumericalattribute(seeFigure2-8).Forexample,youcanseethatslightlyover800districtshaveamedian_house_valueequaltoabout$100,000.
%matplotlibinline#onlyinaJupyternotebook
importmatplotlib.pyplotasplt
housing.hist(bins=50,figsize=(20,15))
plt.show()
NOTEThehist()methodreliesonMatplotlib,whichinturnreliesonauser-specifiedgraphicalbackendtodrawonyourscreen.Sobeforeyoucanplotanything,youneedtospecifywhichbackendMatplotlibshoulduse.ThesimplestoptionistouseJupyter’smagiccommand%matplotlibinline.ThistellsJupytertosetupMatplotlibsoitusesJupyter’sownbackend.Plotsarethenrenderedwithinthenotebookitself.Notethatcallingshow()isoptionalinaJupyternotebook,asJupyterwillautomaticallydisplayplotswhenacellisexecuted.
Figure2-8.Ahistogramforeachnumericalattribute
Noticeafewthingsinthesehistograms:1. First,themedianincomeattributedoesnotlooklikeitisexpressedinUSdollars(USD).After
checkingwiththeteamthatcollectedthedata,youaretoldthatthedatahasbeenscaledandcappedat15(actually15.0001)forhighermedianincomes,andat0.5(actually0.4999)forlowermedianincomes.WorkingwithpreprocessedattributesiscommoninMachineLearning,anditisnotnecessarilyaproblem,butyoushouldtrytounderstandhowthedatawascomputed.
2. Thehousingmedianageandthemedianhousevaluewerealsocapped.Thelattermaybeaseriousproblemsinceitisyourtargetattribute(yourlabels).YourMachineLearningalgorithmsmaylearnthatpricesnevergobeyondthatlimit.Youneedtocheckwithyourclientteam(theteamthatwilluseyoursystem’soutput)toseeifthisisaproblemornot.Iftheytellyouthattheyneedprecisepredictionsevenbeyond$500,000,thenyouhavemainlytwooptions:a. Collectproperlabelsforthedistrictswhoselabelswerecapped.
b. Removethosedistrictsfromthetrainingset(andalsofromthetestset,sinceyoursystemshouldnotbeevaluatedpoorlyifitpredictsvaluesbeyond$500,000).
3. Theseattributeshaveverydifferentscales.Wewilldiscussthislaterinthischapterwhenweexplorefeaturescaling.
4. Finally,manyhistogramsaretailheavy:theyextendmuchfarthertotherightofthemedianthantotheleft.ThismaymakeitabitharderforsomeMachineLearningalgorithmstodetectpatterns.Wewilltrytransformingtheseattributeslaterontohavemorebell-shapeddistributions.
Hopefullyyounowhaveabetterunderstandingofthekindofdatayouaredealingwith.
WARNINGWait!Beforeyoulookatthedataanyfurther,youneedtocreateatestset,putitaside,andneverlookatit.
CreateaTestSetItmaysoundstrangetovoluntarilysetasidepartofthedataatthisstage.Afterall,youhaveonlytakenaquickglanceatthedata,andsurelyyoushouldlearnawholelotmoreaboutitbeforeyoudecidewhatalgorithmstouse,right?Thisistrue,butyourbrainisanamazingpatterndetectionsystem,whichmeansthatitishighlypronetooverfitting:ifyoulookatthetestset,youmaystumbleuponsomeseeminglyinterestingpatterninthetestdatathatleadsyoutoselectaparticularkindofMachineLearningmodel.Whenyouestimatethegeneralizationerrorusingthetestset,yourestimatewillbetoooptimisticandyouwilllaunchasystemthatwillnotperformaswellasexpected.Thisiscalleddatasnoopingbias.
Creatingatestsetistheoreticallyquitesimple:justpicksomeinstancesrandomly,typically20%ofthedataset,andsetthemaside:
importnumpyasnp
defsplit_train_test(data,test_ratio):
shuffled_indices=np.random.permutation(len(data))
test_set_size=int(len(data)*test_ratio)
test_indices=shuffled_indices[:test_set_size]
train_indices=shuffled_indices[test_set_size:]
returndata.iloc[train_indices],data.iloc[test_indices]
Youcanthenusethisfunctionlikethis:
>>>train_set,test_set=split_train_test(housing,0.2)
>>>print(len(train_set),"train+",len(test_set),"test")
16512train+4128test
Well,thisworks,butitisnotperfect:ifyouruntheprogramagain,itwillgenerateadifferenttestset!Overtime,you(oryourMachineLearningalgorithms)willgettoseethewholedataset,whichiswhatyouwanttoavoid.
Onesolutionistosavethetestsetonthefirstrunandthenloaditinsubsequentruns.Anotheroptionistosettherandomnumbergenerator’sseed(e.g.,np.random.seed(42))12beforecallingnp.random.permutation(),sothatitalwaysgeneratesthesameshuffledindices.
Butboththesesolutionswillbreaknexttimeyoufetchanupdateddataset.Acommonsolutionistouseeachinstance’sidentifiertodecidewhetherornotitshouldgointhetestset(assuminginstanceshaveauniqueandimmutableidentifier).Forexample,youcouldcomputeahashofeachinstance’sidentifier,keeponlythelastbyteofthehash,andputtheinstanceinthetestsetifthisvalueislowerorequalto51(~20%of256).Thisensuresthatthetestsetwillremainconsistentacrossmultipleruns,evenifyourefreshthedataset.Thenewtestsetwillcontain20%ofthenewinstances,butitwillnotcontainanyinstancethatwaspreviouslyinthetrainingset.Hereisapossibleimplementation:
importhashlib
deftest_set_check(identifier,test_ratio,hash):
returnhash(np.int64(identifier)).digest()[-1]<256*test_ratio
defsplit_train_test_by_id(data,test_ratio,id_column,hash=hashlib.md5):
ids=data[id_column]
in_test_set=ids.apply(lambdaid_:test_set_check(id_,test_ratio,hash))
returndata.loc[~in_test_set],data.loc[in_test_set]
Unfortunately,thehousingdatasetdoesnothaveanidentifiercolumn.ThesimplestsolutionistousetherowindexastheID:
housing_with_id=housing.reset_index()#addsan`index`column
train_set,test_set=split_train_test_by_id(housing_with_id,0.2,"index")
Ifyouusetherowindexasauniqueidentifier,youneedtomakesurethatnewdatagetsappendedtotheendofthedataset,andnorowevergetsdeleted.Ifthisisnotpossible,thenyoucantrytousethemoststablefeaturestobuildauniqueidentifier.Forexample,adistrict’slatitudeandlongitudeareguaranteedtobestableforafewmillionyears,soyoucouldcombinethemintoanIDlikeso:13
housing_with_id["id"]=housing["longitude"]*1000+housing["latitude"]
train_set,test_set=split_train_test_by_id(housing_with_id,0.2,"id")
Scikit-Learnprovidesafewfunctionstosplitdatasetsintomultiplesubsetsinvariousways.Thesimplestfunctionistrain_test_split,whichdoesprettymuchthesamethingasthefunctionsplit_train_testdefinedearlier,withacoupleofadditionalfeatures.Firstthereisarandom_stateparameterthatallowsyoutosettherandomgeneratorseedasexplainedpreviously,andsecondyoucanpassitmultipledatasetswithanidenticalnumberofrows,anditwillsplitthemonthesameindices(thisisveryuseful,forexample,ifyouhaveaseparateDataFrameforlabels):
fromsklearn.model_selectionimporttrain_test_split
train_set,test_set=train_test_split(housing,test_size=0.2,random_state=42)
Sofarwehaveconsideredpurelyrandomsamplingmethods.Thisisgenerallyfineifyourdatasetislargeenough(especiallyrelativetothenumberofattributes),butifitisnot,youruntheriskofintroducingasignificantsamplingbias.Whenasurveycompanydecidestocall1,000peopletoaskthemafewquestions,theydon’tjustpick1,000peoplerandomlyinaphonebooth.Theytrytoensurethatthese1,000peoplearerepresentativeofthewholepopulation.Forexample,theUSpopulationiscomposedof51.3%femaleand48.7%male,soawell-conductedsurveyintheUSwouldtrytomaintainthisratiointhesample:513femaleand487male.Thisiscalledstratifiedsampling:thepopulationisdividedintohomogeneoussubgroupscalledstrata,andtherightnumberofinstancesissampledfromeachstratumtoguaranteethatthetestsetisrepresentativeoftheoverallpopulation.Iftheyusedpurelyrandomsampling,therewouldbeabout12%chanceofsamplingaskewedtestsetwitheitherlessthan49%femaleormorethan54%female.Eitherway,thesurveyresultswouldbesignificantlybiased.
Supposeyouchattedwithexpertswhotoldyouthatthemedianincomeisaveryimportantattributetopredictmedianhousingprices.Youmaywanttoensurethatthetestsetisrepresentativeofthevariouscategoriesofincomesinthewholedataset.Sincethemedianincomeisacontinuousnumericalattribute,youfirstneedtocreateanincomecategoryattribute.Let’slookatthemedianincomehistogrammoreclosely(seeFigure2-8):mostmedianincomevaluesareclusteredaround$20,000–$50,000,butsomemedianincomesgofarbeyond$60,000.Itisimportanttohaveasufficientnumberofinstancesinyourdatasetforeachstratum,orelsetheestimateofthestratum’simportancemaybebiased.Thismeansthatyoushouldnothavetoomanystrata,andeachstratumshouldbelargeenough.Thefollowingcodecreatesanincomecategoryattributebydividingthemedianincomeby1.5(tolimitthenumberofincome
categories),androundingupusingceil(tohavediscretecategories),andthenmergingallthecategoriesgreaterthan5intocategory5:
housing["income_cat"]=np.ceil(housing["median_income"]/1.5)
housing["income_cat"].where(housing["income_cat"]<5,5.0,inplace=True)
TheseincomecategoriesarerepresentedonFigure2-9):
Figure2-9.Histogramofincomecategories
Nowyouarereadytodostratifiedsamplingbasedontheincomecategory.ForthisyoucanuseScikit-Learn’sStratifiedShuffleSplitclass:
fromsklearn.model_selectionimportStratifiedShuffleSplit
split=StratifiedShuffleSplit(n_splits=1,test_size=0.2,random_state=42)
fortrain_index,test_indexinsplit.split(housing,housing["income_cat"]):
strat_train_set=housing.loc[train_index]
strat_test_set=housing.loc[test_index]
Let’sseeifthisworkedasexpected.Youcanstartbylookingattheincomecategoryproportionsinthefullhousingdataset:
>>>housing["income_cat"].value_counts()/len(housing)
3.00.350581
2.00.318847
4.00.176308
5.00.114438
1.00.039826
Name:income_cat,dtype:float64
Withsimilarcodeyoucanmeasuretheincomecategoryproportionsinthetestset.Figure2-10comparestheincomecategoryproportionsintheoveralldataset,inthetestsetgeneratedwithstratifiedsampling,andinatestsetgeneratedusingpurelyrandomsampling.Asyoucansee,thetestsetgeneratedusing
stratifiedsamplinghasincomecategoryproportionsalmostidenticaltothoseinthefulldataset,whereasthetestsetgeneratedusingpurelyrandomsamplingisquiteskewed.
Figure2-10.Samplingbiascomparisonofstratifiedversuspurelyrandomsampling
Nowyoushouldremovetheincome_catattributesothedataisbacktoitsoriginalstate:
forset_in(strat_train_set,strat_test_set):
set_.drop("income_cat",axis=1,inplace=True)
Wespentquiteabitoftimeontestsetgenerationforagoodreason:thisisanoftenneglectedbutcriticalpartofaMachineLearningproject.Moreover,manyoftheseideaswillbeusefullaterwhenwediscusscross-validation.Nowit’stimetomoveontothenextstage:exploringthedata.
DiscoverandVisualizetheDatatoGainInsightsSofaryouhaveonlytakenaquickglanceatthedatatogetageneralunderstandingofthekindofdatayouaremanipulating.Nowthegoalistogoalittlebitmoreindepth.
First,makesureyouhaveputthetestsetasideandyouareonlyexploringthetrainingset.Also,ifthetrainingsetisverylarge,youmaywanttosampleanexplorationset,tomakemanipulationseasyandfast.Inourcase,thesetisquitesmallsoyoucanjustworkdirectlyonthefullset.Let’screateacopysoyoucanplaywithitwithoutharmingthetrainingset:
housing=strat_train_set.copy()
VisualizingGeographicalDataSincethereisgeographicalinformation(latitudeandlongitude),itisagoodideatocreateascatterplotofalldistrictstovisualizethedata(Figure2-11):
housing.plot(kind="scatter",x="longitude",y="latitude")
Figure2-11.Ageographicalscatterplotofthedata
ThislookslikeCaliforniaallright,butotherthanthatitishardtoseeanyparticularpattern.Settingthealphaoptionto0.1makesitmucheasiertovisualizetheplaceswherethereisahighdensityofdatapoints(Figure2-12):
housing.plot(kind="scatter",x="longitude",y="latitude",alpha=0.1)
Figure2-12.Abettervisualizationhighlightinghigh-densityareas
Nowthat’smuchbetter:youcanclearlyseethehigh-densityareas,namelytheBayAreaandaroundLosAngelesandSanDiego,plusalonglineoffairlyhighdensityintheCentralValley,inparticulararoundSacramentoandFresno.
Moregenerally,ourbrainsareverygoodatspottingpatternsonpictures,butyoumayneedtoplayaroundwithvisualizationparameterstomakethepatternsstandout.
Nowlet’slookatthehousingprices(Figure2-13).Theradiusofeachcirclerepresentsthedistrict’spopulation(options),andthecolorrepresentstheprice(optionc).Wewilluseapredefinedcolormap(optioncmap)calledjet,whichrangesfromblue(lowvalues)tored(highprices):14
housing.plot(kind="scatter",x="longitude",y="latitude",alpha=0.4,
s=housing["population"]/100,label="population",figsize=(10,7),
c="median_house_value",cmap=plt.get_cmap("jet"),colorbar=True,
)
plt.legend()
Figure2-13.Californiahousingprices
Thisimagetellsyouthatthehousingpricesareverymuchrelatedtothelocation(e.g.,closetotheocean)andtothepopulationdensity,asyouprobablyknewalready.Itwillprobablybeusefultouseaclusteringalgorithmtodetectthemainclusters,andaddnewfeaturesthatmeasuretheproximitytotheclustercenters.Theoceanproximityattributemaybeusefulaswell,althoughinNorthernCaliforniathehousingpricesincoastaldistrictsarenottoohigh,soitisnotasimplerule.
LookingforCorrelationsSincethedatasetisnottoolarge,youcaneasilycomputethestandardcorrelationcoefficient(alsocalledPearson’sr)betweeneverypairofattributesusingthecorr()method:
corr_matrix=housing.corr()
Nowlet’slookathowmucheachattributecorrelateswiththemedianhousevalue:
>>>corr_matrix["median_house_value"].sort_values(ascending=False)
median_house_value1.000000
median_income0.687170
total_rooms0.135231
housing_median_age0.114220
households0.064702
total_bedrooms0.047865
population-0.026699
longitude-0.047279
latitude-0.142826
Name:median_house_value,dtype:float64
Thecorrelationcoefficientrangesfrom–1to1.Whenitiscloseto1,itmeansthatthereisastrongpositivecorrelation;forexample,themedianhousevaluetendstogoupwhenthemedianincomegoesup.Whenthecoefficientiscloseto–1,itmeansthatthereisastrongnegativecorrelation;youcanseeasmallnegativecorrelationbetweenthelatitudeandthemedianhousevalue(i.e.,priceshaveaslighttendencytogodownwhenyougonorth).Finally,coefficientsclosetozeromeanthatthereisnolinearcorrelation.Figure2-14showsvariousplotsalongwiththecorrelationcoefficientbetweentheirhorizontalandverticalaxes.
Figure2-14.Standardcorrelationcoefficientofvariousdatasets(source:Wikipedia;publicdomainimage)
WARNINGThecorrelationcoefficientonlymeasureslinearcorrelations(“ifxgoesup,thenygenerallygoesup/down”).Itmaycompletelymissoutonnonlinearrelationships(e.g.,“ifxisclosetozerothenygenerallygoesup”).Notehowalltheplotsofthebottomrowhaveacorrelationcoefficientequaltozerodespitethefactthattheiraxesareclearlynotindependent:theseareexamplesofnonlinearrelationships.Also,thesecondrowshowsexampleswherethecorrelationcoefficientisequalto1or–1;noticethatthishasnothingtodowiththeslope.Forexample,yourheightinincheshasacorrelationcoefficientof1withyourheightinfeetorinnanometers.
AnotherwaytocheckforcorrelationbetweenattributesistousePandas’scatter_matrixfunction,whichplotseverynumericalattributeagainsteveryothernumericalattribute.Sincetherearenow11numericalattributes,youwouldget112=121plots,whichwouldnotfitonapage,solet’sjustfocusonafewpromisingattributesthatseemmostcorrelatedwiththemedianhousingvalue(Figure2-15):
frompandas.tools.plottingimportscatter_matrix
attributes=["median_house_value","median_income","total_rooms",
"housing_median_age"]
scatter_matrix(housing[attributes],figsize=(12,8))
Figure2-15.Scattermatrix
Themaindiagonal(toplefttobottomright)wouldbefullofstraightlinesifPandasplottedeachvariableagainstitself,whichwouldnotbeveryuseful.SoinsteadPandasdisplaysahistogramofeachattribute(otheroptionsareavailable;seePandas’documentationformoredetails).
Themostpromisingattributetopredictthemedianhousevalueisthemedianincome,solet’szoominontheircorrelationscatterplot(Figure2-16):
housing.plot(kind="scatter",x="median_income",y="median_house_value",
alpha=0.1)
Thisplotrevealsafewthings.First,thecorrelationisindeedverystrong;youcanclearlyseetheupwardtrendandthepointsarenottoodispersed.Second,thepricecapthatwenoticedearlierisclearlyvisibleasahorizontallineat$500,000.Butthisplotrevealsotherlessobviousstraightlines:ahorizontallinearound$450,000,anotheraround$350,000,perhapsonearound$280,000,andafewmorebelowthat.Youmaywanttotryremovingthecorrespondingdistrictstopreventyouralgorithmsfromlearningtoreproducethesedataquirks.
Figure2-16.Medianincomeversusmedianhousevalue
ExperimentingwithAttributeCombinationsHopefullytheprevioussectionsgaveyouanideaofafewwaysyoucanexplorethedataandgaininsights.YouidentifiedafewdataquirksthatyoumaywanttocleanupbeforefeedingthedatatoaMachineLearningalgorithm,andyoufoundinterestingcorrelationsbetweenattributes,inparticularwiththetargetattribute.Youalsonoticedthatsomeattributeshaveatail-heavydistribution,soyoumaywanttotransformthem(e.g.,bycomputingtheirlogarithm).Ofcourse,yourmileagewillvaryconsiderablywitheachproject,butthegeneralideasaresimilar.
OnelastthingyoumaywanttodobeforeactuallypreparingthedataforMachineLearningalgorithmsistotryoutvariousattributecombinations.Forexample,thetotalnumberofroomsinadistrictisnotveryusefulifyoudon’tknowhowmanyhouseholdsthereare.Whatyoureallywantisthenumberofroomsperhousehold.Similarly,thetotalnumberofbedroomsbyitselfisnotveryuseful:youprobablywanttocompareittothenumberofrooms.Andthepopulationperhouseholdalsoseemslikeaninterestingattributecombinationtolookat.Let’screatethesenewattributes:
housing["rooms_per_household"]=housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"]=housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]
Andnowlet’slookatthecorrelationmatrixagain:
>>>corr_matrix=housing.corr()
>>>corr_matrix["median_house_value"].sort_values(ascending=False)
median_house_value1.000000
median_income0.687160
rooms_per_household0.146285
total_rooms0.135097
housing_median_age0.114110
households0.064506
total_bedrooms0.047689
population_per_household-0.021985
population-0.026920
longitude-0.047432
latitude-0.142724
bedrooms_per_room-0.259984
Name:median_house_value,dtype:float64
Hey,notbad!Thenewbedrooms_per_roomattributeismuchmorecorrelatedwiththemedianhousevaluethanthetotalnumberofroomsorbedrooms.Apparentlyhouseswithalowerbedroom/roomratiotendtobemoreexpensive.Thenumberofroomsperhouseholdisalsomoreinformativethanthetotalnumberofroomsinadistrict—obviouslythelargerthehouses,themoreexpensivetheyare.
Thisroundofexplorationdoesnothavetobeabsolutelythorough;thepointistostartoffontherightfootandquicklygaininsightsthatwillhelpyougetafirstreasonablygoodprototype.Butthisisaniterativeprocess:onceyougetaprototypeupandrunning,youcananalyzeitsoutputtogainmoreinsightsandcomebacktothisexplorationstep.
PreparetheDataforMachineLearningAlgorithmsIt’stimetopreparethedataforyourMachineLearningalgorithms.Insteadofjustdoingthismanually,youshouldwritefunctionstodothat,forseveralgoodreasons:
Thiswillallowyoutoreproducethesetransformationseasilyonanydataset(e.g.,thenexttimeyougetafreshdataset).
Youwillgraduallybuildalibraryoftransformationfunctionsthatyoucanreuseinfutureprojects.
Youcanusethesefunctionsinyourlivesystemtotransformthenewdatabeforefeedingittoyouralgorithms.
Thiswillmakeitpossibleforyoutoeasilytryvarioustransformationsandseewhichcombinationoftransformationsworksbest.
Butfirstlet’sreverttoacleantrainingset(bycopyingstrat_train_setonceagain),andlet’sseparatethepredictorsandthelabelssincewedon’tnecessarilywanttoapplythesametransformationstothepredictorsandthetargetvalues(notethatdrop()createsacopyofthedataanddoesnotaffectstrat_train_set):
housing=strat_train_set.drop("median_house_value",axis=1)
housing_labels=strat_train_set["median_house_value"].copy()
DataCleaningMostMachineLearningalgorithmscannotworkwithmissingfeatures,solet’screateafewfunctionstotakecareofthem.Younoticedearlierthatthetotal_bedroomsattributehassomemissingvalues,solet’sfixthis.Youhavethreeoptions:
Getridofthecorrespondingdistricts.
Getridofthewholeattribute.
Setthevaluestosomevalue(zero,themean,themedian,etc.).
YoucanaccomplishtheseeasilyusingDataFrame’sdropna(),drop(),andfillna()methods:
housing.dropna(subset=["total_bedrooms"])#option1
housing.drop("total_bedrooms",axis=1)#option2
median=housing["total_bedrooms"].median()#option3
housing["total_bedrooms"].fillna(median,inplace=True)
Ifyouchooseoption3,youshouldcomputethemedianvalueonthetrainingset,anduseittofillthemissingvaluesinthetrainingset,butalsodon’tforgettosavethemedianvaluethatyouhavecomputed.Youwillneeditlatertoreplacemissingvaluesinthetestsetwhenyouwanttoevaluateyoursystem,andalsooncethesystemgoeslivetoreplacemissingvaluesinnewdata.
Scikit-Learnprovidesahandyclasstotakecareofmissingvalues:Imputer.Hereishowtouseit.First,youneedtocreateanImputerinstance,specifyingthatyouwanttoreplaceeachattribute’smissingvalueswiththemedianofthatattribute:
fromsklearn.preprocessingimportImputer
imputer=Imputer(strategy="median")
Sincethemediancanonlybecomputedonnumericalattributes,weneedtocreateacopyofthedatawithoutthetextattributeocean_proximity:
housing_num=housing.drop("ocean_proximity",axis=1)
Nowyoucanfittheimputerinstancetothetrainingdatausingthefit()method:
imputer.fit(housing_num)
Theimputerhassimplycomputedthemedianofeachattributeandstoredtheresultinitsstatistics_instancevariable.Onlythetotal_bedroomsattributehadmissingvalues,butwecannotbesurethattherewon’tbeanymissingvaluesinnewdataafterthesystemgoeslive,soitissafertoapplytheimputertoallthenumericalattributes:
>>>imputer.statistics_
array([-118.51,34.26,29.,2119.5,433.,1164.,408.,3.5409])
>>>housing_num.median().values
array([-118.51,34.26,29.,2119.5,433.,1164.,408.,3.5409])
Nowyoucanusethis“trained”imputertotransformthetrainingsetbyreplacingmissingvaluesbythelearnedmedians:
X=imputer.transform(housing_num)
TheresultisaplainNumpyarraycontainingthetransformedfeatures.IfyouwanttoputitbackintoaPandasDataFrame,it’ssimple:
housing_tr=pd.DataFrame(X,columns=housing_num.columns)
SCIKIT-LEARNDESIGN
Scikit-Learn’sAPIisremarkablywelldesigned.Themaindesignprinciplesare:15
Consistency.Allobjectsshareaconsistentandsimpleinterface:
Estimators.Anyobjectthatcanestimatesomeparametersbasedonadatasetiscalledanestimator(e.g.,animputerisanestimator).Theestimationitselfisperformedbythefit()method,andittakesonlyadatasetasaparameter(ortwoforsupervisedlearningalgorithms;theseconddatasetcontainsthelabels).Anyotherparameterneededtoguidetheestimationprocessisconsideredahyperparameter(suchasanimputer’sstrategy),anditmustbesetasaninstancevariable(generallyviaaconstructorparameter).
Transformers.Someestimators(suchasanimputer)canalsotransformadataset;thesearecalledtransformers.Onceagain,theAPIisquitesimple:thetransformationisperformedbythetransform()methodwiththedatasettotransformasaparameter.Itreturnsthetransformeddataset.Thistransformationgenerallyreliesonthelearnedparameters,asisthecaseforanimputer.Alltransformersalsohaveaconveniencemethodcalledfit_transform()thatisequivalenttocallingfit()andthentransform()(butsometimesfit_transform()isoptimizedandrunsmuchfaster).
Predictors.Finally,someestimatorsarecapableofmakingpredictionsgivenadataset;theyarecalledpredictors.Forexample,theLinearRegressionmodelinthepreviouschapterwasapredictor:itpredictedlifesatisfactiongivenacountry’sGDPpercapita.Apredictorhasapredict()methodthattakesadatasetofnewinstancesandreturnsadatasetofcorrespondingpredictions.Italsohasascore()methodthatmeasuresthequalityofthepredictionsgivenatestset(andthecorrespondinglabelsinthecaseofsupervisedlearningalgorithms).16
Inspection.Alltheestimator’shyperparametersareaccessibledirectlyviapublicinstancevariables(e.g.,imputer.strategy),andalltheestimator’slearnedparametersarealsoaccessibleviapublicinstancevariableswithanunderscoresuffix(e.g.,imputer.statistics_).
Nonproliferationofclasses .DatasetsarerepresentedasNumPyarraysorSciPysparsematrices,insteadofhomemadeclasses.HyperparametersarejustregularPythonstringsornumbers.
Composition.Existingbuildingblocksarereusedasmuchaspossible.Forexample,itiseasytocreateaPipelineestimatorfromanarbitrarysequenceoftransformersfollowedbyafinalestimator,aswewillsee.
Sensibledefaults .Scikit-Learnprovidesreasonabledefaultvaluesformostparameters,makingiteasytocreateabaselineworkingsystemquickly.
HandlingTextandCategoricalAttributesEarlierweleftoutthecategoricalattributeocean_proximitybecauseitisatextattributesowecannotcomputeitsmedian.MostMachineLearningalgorithmsprefertoworkwithnumbersanyway,solet’sconvertthesetextlabelstonumbers.
Scikit-LearnprovidesatransformerforthistaskcalledLabelEncoder:
>>>fromsklearn.preprocessingimportLabelEncoder
>>>encoder=LabelEncoder()
>>>housing_cat=housing["ocean_proximity"]
>>>housing_cat_encoded=encoder.fit_transform(housing_cat)
>>>housing_cat_encoded
array([0,0,4,...,1,0,3])
Thisisbetter:nowwecanusethisnumericaldatainanyMLalgorithm.Youcanlookatthemappingthatthisencoderhaslearnedusingtheclasses_attribute(“<1HOCEAN”ismappedto0,“INLAND”ismappedto1,etc.):
>>>print(encoder.classes_)
['<1HOCEAN''INLAND''ISLAND''NEARBAY''NEAROCEAN']
OneissuewiththisrepresentationisthatMLalgorithmswillassumethattwonearbyvaluesaremoresimilarthantwodistantvalues.Obviouslythisisnotthecase(forexample,categories0and4aremoresimilarthancategories0and1).Tofixthisissue,acommonsolutionistocreateonebinaryattributepercategory:oneattributeequalto1whenthecategoryis“<1HOCEAN”(and0otherwise),anotherattributeequalto1whenthecategoryis“INLAND”(and0otherwise),andsoon.Thisiscalledone-hotencoding,becauseonlyoneattributewillbeequalto1(hot),whiletheotherswillbe0(cold).
Scikit-LearnprovidesaOneHotEncoderencodertoconvertintegercategoricalvaluesintoone-hotvectors.Let’sencodethecategoriesasone-hotvectors.Notethatfit_transform()expectsa2Darray,buthousing_cat_encodedisa1Darray,soweneedtoreshapeit:17
>>>fromsklearn.preprocessingimportOneHotEncoder
>>>encoder=OneHotEncoder()
>>>housing_cat_1hot=encoder.fit_transform(housing_cat_encoded.reshape(-1,1))
>>>housing_cat_1hot
<16512x5sparsematrixoftype'<class'numpy.float64'>'
with16512storedelementsinCompressedSparseRowformat>
NoticethattheoutputisaSciPysparsematrix,insteadofaNumPyarray.Thisisveryusefulwhenyouhavecategoricalattributeswiththousandsofcategories.Afterone-hotencodingwegetamatrixwiththousandsofcolumns,andthematrixisfullofzerosexceptforone1perrow.Usinguptonsofmemorymostlytostorezeroswouldbeverywasteful,soinsteadasparsematrixonlystoresthelocationofthenonzeroelements.Youcanuseitmostlylikeanormal2Darray,18butifyoureallywanttoconvertittoa(dense)NumPyarray,justcallthetoarray()method:
>>>housing_cat_1hot.toarray()
array([[1.,0.,0.,0.,0.],
[1.,0.,0.,0.,0.],
[0.,0.,0.,0.,1.],
...,
[0.,1.,0.,0.,0.],
[1.,0.,0.,0.,0.],
[0.,0.,0.,1.,0.]])
Wecanapplybothtransformations(fromtextcategoriestointegercategories,thenfromintegercategoriestoone-hotvectors)inoneshotusingtheLabelBinarizerclass:
>>>fromsklearn.preprocessingimportLabelBinarizer
>>>encoder=LabelBinarizer()
>>>housing_cat_1hot=encoder.fit_transform(housing_cat)
>>>housing_cat_1hot
array([[1,0,0,0,0],
[1,0,0,0,0],
[0,0,0,0,1],
...,
[0,1,0,0,0],
[1,0,0,0,0],
[0,0,0,1,0]])
NotethatthisreturnsadenseNumPyarraybydefault.Youcangetasparsematrixinsteadbypassingsparse_output=TruetotheLabelBinarizerconstructor.
CustomTransformersAlthoughScikit-Learnprovidesmanyusefultransformers,youwillneedtowriteyourownfortaskssuchascustomcleanupoperationsorcombiningspecificattributes.YouwillwantyourtransformertoworkseamlesslywithScikit-Learnfunctionalities(suchaspipelines),andsinceScikit-Learnreliesonducktyping(notinheritance),allyouneedistocreateaclassandimplementthreemethods:fit()(returningself),transform(),andfit_transform().YoucangetthelastoneforfreebysimplyaddingTransformerMixinasabaseclass.Also,ifyouaddBaseEstimatorasabaseclass(andavoid*argsand**kargsinyourconstructor)youwillgettwoextramethods(get_params()andset_params())thatwillbeusefulforautomatichyperparametertuning.Forexample,hereisasmalltransformerclassthataddsthecombinedattributeswediscussedearlier:
fromsklearn.baseimportBaseEstimator,TransformerMixin
rooms_ix,bedrooms_ix,population_ix,household_ix=3,4,5,6
classCombinedAttributesAdder(BaseEstimator,TransformerMixin):
def__init__(self,add_bedrooms_per_room=True):#no*argsor**kargs
self.add_bedrooms_per_room=add_bedrooms_per_room
deffit(self,X,y=None):
returnself#nothingelsetodo
deftransform(self,X,y=None):
rooms_per_household=X[:,rooms_ix]/X[:,household_ix]
population_per_household=X[:,population_ix]/X[:,household_ix]
ifself.add_bedrooms_per_room:
bedrooms_per_room=X[:,bedrooms_ix]/X[:,rooms_ix]
returnnp.c_[X,rooms_per_household,population_per_household,
bedrooms_per_room]
else:
returnnp.c_[X,rooms_per_household,population_per_household]
attr_adder=CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs=attr_adder.transform(housing.values)
Inthisexamplethetransformerhasonehyperparameter,add_bedrooms_per_room,settoTruebydefault(itisoftenhelpfultoprovidesensibledefaults).ThishyperparameterwillallowyoutoeasilyfindoutwhetheraddingthisattributehelpstheMachineLearningalgorithmsornot.Moregenerally,youcanaddahyperparametertogateanydatapreparationstepthatyouarenot100%sureabout.Themoreyouautomatethesedatapreparationsteps,themorecombinationsyoucanautomaticallytryout,makingitmuchmorelikelythatyouwillfindagreatcombination(andsavingyoualotoftime).
FeatureScalingOneofthemostimportanttransformationsyouneedtoapplytoyourdataisfeaturescaling.Withfewexceptions,MachineLearningalgorithmsdon’tperformwellwhentheinputnumericalattributeshaveverydifferentscales.Thisisthecaseforthehousingdata:thetotalnumberofroomsrangesfromabout6to39,320,whilethemedianincomesonlyrangefrom0to15.Notethatscalingthetargetvaluesisgenerallynotrequired.
Therearetwocommonwaystogetallattributestohavethesamescale:min-maxscalingandstandardization.
Min-maxscaling(manypeoplecallthisnormalization)isquitesimple:valuesareshiftedandrescaledsothattheyenduprangingfrom0to1.Wedothisbysubtractingtheminvalueanddividingbythemaxminusthemin.Scikit-LearnprovidesatransformercalledMinMaxScalerforthis.Ithasafeature_rangehyperparameterthatletsyouchangetherangeifyoudon’twant0–1forsomereason.
Standardizationisquitedifferent:firstitsubtractsthemeanvalue(sostandardizedvaluesalwayshaveazeromean),andthenitdividesbythevariancesothattheresultingdistributionhasunitvariance.Unlikemin-maxscaling,standardizationdoesnotboundvaluestoaspecificrange,whichmaybeaproblemforsomealgorithms(e.g.,neuralnetworksoftenexpectaninputvaluerangingfrom0to1).However,standardizationismuchlessaffectedbyoutliers.Forexample,supposeadistricthadamedianincomeequalto100(bymistake).Min-maxscalingwouldthencrushalltheothervaluesfrom0–15downto0–0.15,whereasstandardizationwouldnotbemuchaffected.Scikit-LearnprovidesatransformercalledStandardScalerforstandardization.
WARNINGAswithallthetransformations,itisimportanttofitthescalerstothetrainingdataonly,nottothefulldataset(includingthetestset).Onlythencanyouusethemtotransformthetrainingsetandthetestset(andnewdata).
TransformationPipelinesAsyoucansee,therearemanydatatransformationstepsthatneedtobeexecutedintherightorder.Fortunately,Scikit-LearnprovidesthePipelineclasstohelpwithsuchsequencesoftransformations.Hereisasmallpipelineforthenumericalattributes:
fromsklearn.pipelineimportPipeline
fromsklearn.preprocessingimportStandardScaler
num_pipeline=Pipeline([
('imputer',Imputer(strategy="median")),
('attribs_adder',CombinedAttributesAdder()),
('std_scaler',StandardScaler()),
])
housing_num_tr=num_pipeline.fit_transform(housing_num)
ThePipelineconstructortakesalistofname/estimatorpairsdefiningasequenceofsteps.Allbutthelastestimatormustbetransformers(i.e.,theymusthaveafit_transform()method).Thenamescanbeanythingyoulike(aslongastheydon’tcontaindoubleunderscores“__”).
Whenyoucallthepipeline’sfit()method,itcallsfit_transform()sequentiallyonalltransformers,passingtheoutputofeachcallastheparametertothenextcall,untilitreachesthefinalestimator,forwhichitjustcallsthefit()method.
Thepipelineexposesthesamemethodsasthefinalestimator.Inthisexample,thelastestimatorisaStandardScaler,whichisatransformer,sothepipelinehasatransform()methodthatappliesallthetransformstothedatainsequence(italsohasafit_transformmethodthatwecouldhaveusedinsteadofcallingfit()andthentransform()).
NowitwouldbeniceifwecouldfeedaPandasDataFramedirectlyintoourpipeline,insteadofhavingtofirstmanuallyextractthenumericalcolumnsintoaNumPyarray.ThereisnothinginScikit-LearntohandlePandasDataFrames,19butwecanwriteacustomtransformerforthistask:
fromsklearn.baseimportBaseEstimator,TransformerMixin
classDataFrameSelector(BaseEstimator,TransformerMixin):
def__init__(self,attribute_names):
self.attribute_names=attribute_names
deffit(self,X,y=None):
returnself
deftransform(self,X):
returnX[self.attribute_names].values
OurDataFrameSelectorwilltransformthedatabyselectingthedesiredattributes,droppingtherest,andconvertingtheresultingDataFrametoaNumPyarray.Withthis,youcaneasilywriteapipelinethatwilltakeaPandasDataFrameandhandleonlythenumericalvalues:thepipelinewouldjuststartwithaDataFrameSelectortopickonlythenumericalattributes,followedbytheotherpreprocessingstepswediscussedearlier.AndyoucanjustaseasilywriteanotherpipelineforthecategoricalattributesaswellbysimplyselectingthecategoricalattributesusingaDataFrameSelectorandthenapplyingaLabelBinarizer.
num_attribs=list(housing_num)
cat_attribs=["ocean_proximity"]
num_pipeline=Pipeline([
('selector',DataFrameSelector(num_attribs)),
('imputer',Imputer(strategy="median")),
('attribs_adder',CombinedAttributesAdder()),
('std_scaler',StandardScaler()),
])
cat_pipeline=Pipeline([
('selector',DataFrameSelector(cat_attribs)),
('label_binarizer',LabelBinarizer()),
])
Buthowcanyoujointhesetwopipelinesintoasinglepipeline?TheansweristouseScikit-Learn’sFeatureUnionclass.Yougiveitalistoftransformers(whichcanbeentiretransformerpipelines);whenitstransform()methodiscalled,itrunseachtransformer’stransform()methodinparallel,waitsfortheiroutput,andthenconcatenatesthemandreturnstheresult(andofcoursecallingitsfit()methodcallseachtransformer’sfit()method).Afullpipelinehandlingbothnumericalandcategoricalattributesmaylooklikethis:
fromsklearn.pipelineimportFeatureUnion
full_pipeline=FeatureUnion(transformer_list=[
("num_pipeline",num_pipeline),
("cat_pipeline",cat_pipeline),
])
Andyoucanrunthewholepipelinesimply:
>>>housing_prepared=full_pipeline.fit_transform(housing)
>>>housing_prepared
array([[-1.15604281,0.77194962,0.74333089,...,0.,
0.,0.],
[-1.17602483,0.6596948,-1.1653172,...,0.,
0.,0.],
[...]
>>>housing_prepared.shape
(16512,16)
SelectandTrainaModelAtlast!Youframedtheproblem,yougotthedataandexploredit,yousampledatrainingsetandatestset,andyouwrotetransformationpipelinestocleanupandprepareyourdataforMachineLearningalgorithmsautomatically.YouarenowreadytoselectandtrainaMachineLearningmodel.
TrainingandEvaluatingontheTrainingSetThegoodnewsisthatthankstoalltheseprevioussteps,thingsarenowgoingtobemuchsimplerthanyoumightthink.Let’sfirsttrainaLinearRegressionmodel,likewedidinthepreviouschapter:
fromsklearn.linear_modelimportLinearRegression
lin_reg=LinearRegression()
lin_reg.fit(housing_prepared,housing_labels)
Done!YounowhaveaworkingLinearRegressionmodel.Let’stryitoutonafewinstancesfromthetrainingset:
>>>some_data=housing.iloc[:5]
>>>some_labels=housing_labels.iloc[:5]
>>>some_data_prepared=full_pipeline.transform(some_data)
>>>print("Predictions:",lin_reg.predict(some_data_prepared))
Predictions:[210644.6045317768.8069210956.433359218.9888189747.5584]
>>>print("Labels:",list(some_labels))
Labels:[286600.0,340600.0,196900.0,46300.0,254500.0]
Itworks,althoughthepredictionsarenotexactlyaccurate(e.g.,thefirstpredictionisoffbycloseto40%!).Let’smeasurethisregressionmodel’sRMSEonthewholetrainingsetusingScikit-Learn’smean_squared_errorfunction:
>>>fromsklearn.metricsimportmean_squared_error
>>>housing_predictions=lin_reg.predict(housing_prepared)
>>>lin_mse=mean_squared_error(housing_labels,housing_predictions)
>>>lin_rmse=np.sqrt(lin_mse)
>>>lin_rmse
68628.198198489219
Okay,thisisbetterthannothingbutclearlynotagreatscore:mostdistricts’median_housing_valuesrangebetween$120,000and$265,000,soatypicalpredictionerrorof$68,628isnotverysatisfying.Thisisanexampleofamodelunderfittingthetrainingdata.Whenthishappensitcanmeanthatthefeaturesdonotprovideenoughinformationtomakegoodpredictions,orthatthemodelisnotpowerfulenough.Aswesawinthepreviouschapter,themainwaystofixunderfittingaretoselectamorepowerfulmodel,tofeedthetrainingalgorithmwithbetterfeatures,ortoreducetheconstraintsonthemodel.Thismodelisnotregularized,sothisrulesoutthelastoption.Youcouldtrytoaddmorefeatures(e.g.,thelogofthepopulation),butfirstlet’stryamorecomplexmodeltoseehowitdoes.
Let’strainaDecisionTreeRegressor.Thisisapowerfulmodel,capableoffindingcomplexnonlinearrelationshipsinthedata(DecisionTreesarepresentedinmoredetailinChapter6).Thecodeshouldlookfamiliarbynow:
fromsklearn.treeimportDecisionTreeRegressor
tree_reg=DecisionTreeRegressor()
tree_reg.fit(housing_prepared,housing_labels)
Nowthatthemodelistrained,let’sevaluateitonthetrainingset:
>>>housing_predictions=tree_reg.predict(housing_prepared)
>>>tree_mse=mean_squared_error(housing_labels,housing_predictions)
>>>tree_rmse=np.sqrt(tree_mse)
>>>tree_rmse
0.0
Wait,what!?Noerroratall?Couldthismodelreallybeabsolutelyperfect?Ofcourse,itismuchmorelikelythatthemodelhasbadlyoverfitthedata.Howcanyoubesure?Aswesawearlier,youdon’twanttotouchthetestsetuntilyouarereadytolaunchamodelyouareconfidentabout,soyouneedtousepartofthetrainingsetfortraining,andpartformodelvalidation.
BetterEvaluationUsingCross-ValidationOnewaytoevaluatetheDecisionTreemodelwouldbetousethetrain_test_splitfunctiontosplitthetrainingsetintoasmallertrainingsetandavalidationset,thentrainyourmodelsagainstthesmallertrainingsetandevaluatethemagainstthevalidationset.It’sabitofwork,butnothingtoodifficultanditwouldworkfairlywell.
AgreatalternativeistouseScikit-Learn’scross-validationfeature.ThefollowingcodeperformsK-foldcross-validation:itrandomlysplitsthetrainingsetinto10distinctsubsetscalledfolds,thenittrainsandevaluatestheDecisionTreemodel10times,pickingadifferentfoldforevaluationeverytimeandtrainingontheother9folds.Theresultisanarraycontainingthe10evaluationscores:
fromsklearn.model_selectionimportcross_val_score
scores=cross_val_score(tree_reg,housing_prepared,housing_labels,
scoring="neg_mean_squared_error",cv=10)
tree_rmse_scores=np.sqrt(-scores)
WARNINGScikit-Learncross-validationfeaturesexpectautilityfunction(greaterisbetter)ratherthanacostfunction(lowerisbetter),sothescoringfunctionisactuallytheoppositeoftheMSE(i.e.,anegativevalue),whichiswhytheprecedingcodecomputes-scoresbeforecalculatingthesquareroot.
Let’slookattheresults:
>>>defdisplay_scores(scores):
...print("Scores:",scores)
...print("Mean:",scores.mean())
...print("Standarddeviation:",scores.std())
...
>>>display_scores(tree_rmse_scores)
Scores:[70232.013648266828.4683989272444.0872100370761.50186201
71125.5269765375581.2931985770169.5928616470055.37863456
75370.4911677371222.39081244]
Mean:71379.0744771
Standarddeviation:2458.31882043
NowtheDecisionTreedoesn’tlookasgoodasitdidearlier.Infact,itseemstoperformworsethantheLinearRegressionmodel!Noticethatcross-validationallowsyoutogetnotonlyanestimateoftheperformanceofyourmodel,butalsoameasureofhowprecisethisestimateis(i.e.,itsstandarddeviation).TheDecisionTreehasascoreofapproximately71,379,generally±2,458.Youwouldnothavethisinformationifyoujustusedonevalidationset.Butcross-validationcomesatthecostoftrainingthemodelseveraltimes,soitisnotalwayspossible.
Let’scomputethesamescoresfortheLinearRegressionmodeljusttobesure:
>>>lin_scores=cross_val_score(lin_reg,housing_prepared,housing_labels,
...scoring="neg_mean_squared_error",cv=10)
...
>>>lin_rmse_scores=np.sqrt(-lin_scores)
>>>display_scores(lin_rmse_scores)
Scores:[66760.9737157266962.6191424470349.9485340174757.02629506
68031.1338893871193.8418342664968.1370652768261.95557897
71527.6421787467665.10082067]
Mean:69047.8379055
Standarddeviation:2735.51074287
That’sright:theDecisionTreemodelisoverfittingsobadlythatitperformsworsethantheLinearRegressionmodel.
Let’stryonelastmodelnow:theRandomForestRegressor.AswewillseeinChapter7,RandomForestsworkbytrainingmanyDecisionTreesonrandomsubsetsofthefeatures,thenaveragingouttheirpredictions.BuildingamodelontopofmanyothermodelsiscalledEnsembleLearning,anditisoftenagreatwaytopushMLalgorithmsevenfurther.Wewillskipmostofthecodesinceitisessentiallythesameasfortheothermodels:
>>>fromsklearn.ensembleimportRandomForestRegressor
>>>forest_reg=RandomForestRegressor()
>>>forest_reg.fit(housing_prepared,housing_labels)
>>>[...]
>>>forest_rmse
21941.911027380233
>>>display_scores(forest_rmse_scores)
Scores:[51650.9440547148920.8064549852979.1609675254412.74042021
50861.2938116356488.5569972751866.9012078649752.24599537
55399.5071319153309.74548294]
Mean:52564.1902524
Standarddeviation:2301.87380392
Wow,thisismuchbetter:RandomForestslookverypromising.However,notethatthescoreonthetrainingsetisstillmuchlowerthanonthevalidationsets,meaningthatthemodelisstilloverfittingthetrainingset.Possiblesolutionsforoverfittingaretosimplifythemodel,constrainit(i.e.,regularizeit),orgetalotmoretrainingdata.However,beforeyoudivemuchdeeperinRandomForests,youshouldtryoutmanyothermodelsfromvariouscategoriesofMachineLearningalgorithms(severalSupportVectorMachineswithdifferentkernels,possiblyaneuralnetwork,etc.),withoutspendingtoomuchtimetweakingthehyperparameters.Thegoalistoshortlistafew(twotofive)promisingmodels.
TIPYoushouldsaveeverymodelyouexperimentwith,soyoucancomebackeasilytoanymodelyouwant.Makesureyousaveboththehyperparametersandthetrainedparameters,aswellasthecross-validationscoresandperhapstheactualpredictionsaswell.Thiswillallowyoutoeasilycomparescoresacrossmodeltypes,andcomparethetypesoferrorstheymake.YoucaneasilysaveScikit-LearnmodelsbyusingPython’spicklemodule,orusingsklearn.externals.joblib,whichismoreefficientatserializinglargeNumPyarrays:
fromsklearn.externalsimportjoblib
joblib.dump(my_model,"my_model.pkl")
#andlater...
my_model_loaded=joblib.load("my_model.pkl")
Fine-TuneYourModelLet’sassumethatyounowhaveashortlistofpromisingmodels.Younowneedtofine-tunethem.Let’slookatafewwaysyoucandothat.
GridSearchOnewaytodothatwouldbetofiddlewiththehyperparametersmanually,untilyoufindagreatcombinationofhyperparametervalues.Thiswouldbeverytediouswork,andyoumaynothavetimetoexploremanycombinations.
InsteadyoushouldgetScikit-Learn’sGridSearchCVtosearchforyou.Allyouneedtodoistellitwhichhyperparametersyouwantittoexperimentwith,andwhatvaluestotryout,anditwillevaluateallthepossiblecombinationsofhyperparametervalues,usingcross-validation.Forexample,thefollowingcodesearchesforthebestcombinationofhyperparametervaluesfortheRandomForestRegressor:
fromsklearn.model_selectionimportGridSearchCV
param_grid=[
{'n_estimators':[3,10,30],'max_features':[2,4,6,8]},
{'bootstrap':[False],'n_estimators':[3,10],'max_features':[2,3,4]},
]
forest_reg=RandomForestRegressor()
grid_search=GridSearchCV(forest_reg,param_grid,cv=5,
scoring='neg_mean_squared_error')
grid_search.fit(housing_prepared,housing_labels)
TIPWhenyouhavenoideawhatvalueahyperparametershouldhave,asimpleapproachistotryoutconsecutivepowersof10(orasmallernumberifyouwantamorefine-grainedsearch,asshowninthisexamplewiththen_estimatorshyperparameter).
Thisparam_gridtellsScikit-Learntofirstevaluateall3×4=12combinationsofn_estimatorsandmax_featureshyperparametervaluesspecifiedinthefirstdict(don’tworryaboutwhatthesehyperparametersmeanfornow;theywillbeexplainedinChapter7),thentryall2×3=6combinationsofhyperparametervaluesintheseconddict,butthistimewiththebootstraphyperparametersettoFalseinsteadofTrue(whichisthedefaultvalueforthishyperparameter).
Allinall,thegridsearchwillexplore12+6=18combinationsofRandomForestRegressorhyperparametervalues,anditwilltraineachmodelfivetimes(sinceweareusingfive-foldcrossvalidation).Inotherwords,allinall,therewillbe18×5=90roundsoftraining!Itmaytakequitealongtime,butwhenitisdoneyoucangetthebestcombinationofparameterslikethis:
>>>grid_search.best_params_
{'max_features':8,'n_estimators':30}
TIPSince8and30arethemaximumvaluesthatwereevaluated,youshouldprobablytrysearchingagainwithhighervalues,sincethescoremaycontinuetoimprove.
Youcanalsogetthebestestimatordirectly:
>>>grid_search.best_estimator_
RandomForestRegressor(bootstrap=True,criterion='mse',max_depth=None,
max_features=8,max_leaf_nodes=None,min_impurity_split=1e-07,
min_samples_leaf=1,min_samples_split=2,
min_weight_fraction_leaf=0.0,n_estimators=30,n_jobs=1,
oob_score=False,random_state=42,verbose=0,warm_start=False)
NOTEIfGridSearchCVisinitializedwithrefit=True(whichisthedefault),thenonceitfindsthebestestimatorusingcross-validation,itretrainsitonthewholetrainingset.Thisisusuallyagoodideasincefeedingitmoredatawilllikelyimproveitsperformance.
Andofcoursetheevaluationscoresarealsoavailable:
>>>cvres=grid_search.cv_results_
>>>formean_score,paramsinzip(cvres["mean_test_score"],cvres["params"]):
...print(np.sqrt(-mean_score),params)
...
63825.0479302{'max_features':2,'n_estimators':3}
55643.8429091{'max_features':2,'n_estimators':10}
53380.6566859{'max_features':2,'n_estimators':30}
60959.1388585{'max_features':4,'n_estimators':3}
52740.5841667{'max_features':4,'n_estimators':10}
50374.1421461{'max_features':4,'n_estimators':30}
58661.2866462{'max_features':6,'n_estimators':3}
52009.9739798{'max_features':6,'n_estimators':10}
50154.1177737{'max_features':6,'n_estimators':30}
57865.3616801{'max_features':8,'n_estimators':3}
51730.0755087{'max_features':8,'n_estimators':10}
49694.8514333{'max_features':8,'n_estimators':30}
62874.4073931{'max_features':2,'n_estimators':3,'bootstrap':False}
54561.9398157{'max_features':2,'n_estimators':10,'bootstrap':False}
59416.6463145{'max_features':3,'n_estimators':3,'bootstrap':False}
52660.245911{'max_features':3,'n_estimators':10,'bootstrap':False}
57490.0168279{'max_features':4,'n_estimators':3,'bootstrap':False}
51093.9059428{'max_features':4,'n_estimators':10,'bootstrap':False}
Inthisexample,weobtainthebestsolutionbysettingthemax_featureshyperparameterto8,andthen_estimatorshyperparameterto30.TheRMSEscoreforthiscombinationis49,694,whichisslightlybetterthanthescoreyougotearlierusingthedefaulthyperparametervalues(whichwas52,564).Congratulations,youhavesuccessfullyfine-tunedyourbestmodel!
TIPDon’tforgetthatyoucantreatsomeofthedatapreparationstepsashyperparameters.Forexample,thegridsearchwillautomaticallyfindoutwhetherornottoaddafeatureyouwerenotsureabout(e.g.,usingtheadd_bedrooms_per_roomhyperparameterofyourCombinedAttributesAddertransformer).Itmaysimilarlybeusedtoautomaticallyfindthebestwaytohandleoutliers,missingfeatures,featureselection,andmore.
RandomizedSearchThegridsearchapproachisfinewhenyouareexploringrelativelyfewcombinations,likeinthepreviousexample,butwhenthehyperparametersearchspaceislarge,itisoftenpreferabletouseRandomizedSearchCVinstead.ThisclasscanbeusedinmuchthesamewayastheGridSearchCVclass,butinsteadoftryingoutallpossiblecombinations,itevaluatesagivennumberofrandomcombinationsbyselectingarandomvalueforeachhyperparameterateveryiteration.Thisapproachhastwomainbenefits:
Ifyoulettherandomizedsearchrunfor,say,1,000iterations,thisapproachwillexplore1,000differentvaluesforeachhyperparameter(insteadofjustafewvaluesperhyperparameterwiththegridsearchapproach).
Youhavemorecontroloverthecomputingbudgetyouwanttoallocatetohyperparametersearch,simplybysettingthenumberofiterations.
EnsembleMethodsAnotherwaytofine-tuneyoursystemistotrytocombinethemodelsthatperformbest.Thegroup(or“ensemble”)willoftenperformbetterthanthebestindividualmodel(justlikeRandomForestsperformbetterthantheindividualDecisionTreestheyrelyon),especiallyiftheindividualmodelsmakeverydifferenttypesoferrors.WewillcoverthistopicinmoredetailinChapter7.
AnalyzetheBestModelsandTheirErrorsYouwilloftengaingoodinsightsontheproblembyinspectingthebestmodels.Forexample,theRandomForestRegressorcanindicatetherelativeimportanceofeachattributeformakingaccuratepredictions:
>>>feature_importances=grid_search.best_estimator_.feature_importances_
>>>feature_importances
array([7.33442355e-02,6.29090705e-02,4.11437985e-02,
1.46726854e-02,1.41064835e-02,1.48742809e-02,
1.42575993e-02,3.66158981e-01,5.64191792e-02,
1.08792957e-01,5.33510773e-02,1.03114883e-02,
1.64780994e-01,6.02803867e-05,1.96041560e-03,
2.85647464e-03])
Let’sdisplaytheseimportancescoresnexttotheircorrespondingattributenames:
>>>extra_attribs=["rooms_per_hhold","pop_per_hhold","bedrooms_per_room"]
>>>cat_one_hot_attribs=list(encoder.classes_)
>>>attributes=num_attribs+extra_attribs+cat_one_hot_attribs
>>>sorted(zip(feature_importances,attributes),reverse=True)
[(0.36615898061813418,'median_income'),
(0.16478099356159051,'INLAND'),
(0.10879295677551573,'pop_per_hhold'),
(0.073344235516012421,'longitude'),
(0.062909070482620302,'latitude'),
(0.056419179181954007,'rooms_per_hhold'),
(0.053351077347675809,'bedrooms_per_room'),
(0.041143798478729635,'housing_median_age'),
(0.014874280890402767,'population'),
(0.014672685420543237,'total_rooms'),
(0.014257599323407807,'households'),
(0.014106483453584102,'total_bedrooms'),
(0.010311488326303787,'<1HOCEAN'),
(0.0028564746373201579,'NEAROCEAN'),
(0.0019604155994780701,'NEARBAY'),
(6.0280386727365991e-05,'ISLAND')]
Withthisinformation,youmaywanttotrydroppingsomeofthelessusefulfeatures(e.g.,apparentlyonlyoneocean_proximitycategoryisreallyuseful,soyoucouldtrydroppingtheothers).
Youshouldalsolookatthespecificerrorsthatyoursystemmakes,thentrytounderstandwhyitmakesthemandwhatcouldfixtheproblem(addingextrafeaturesor,onthecontrary,gettingridofuninformativeones,cleaningupoutliers,etc.).
EvaluateYourSystemontheTestSetAftertweakingyourmodelsforawhile,youeventuallyhaveasystemthatperformssufficientlywell.Nowisthetimetoevaluatethefinalmodelonthetestset.Thereisnothingspecialaboutthisprocess;justgetthepredictorsandthelabelsfromyourtestset,runyourfull_pipelinetotransformthedata(calltransform(),notfit_transform()!),andevaluatethefinalmodelonthetestset:
final_model=grid_search.best_estimator_
X_test=strat_test_set.drop("median_house_value",axis=1)
y_test=strat_test_set["median_house_value"].copy()
X_test_prepared=full_pipeline.transform(X_test)
final_predictions=final_model.predict(X_test_prepared)
final_mse=mean_squared_error(y_test,final_predictions)
final_rmse=np.sqrt(final_mse)#=>evaluatesto47,766.0
Theperformancewillusuallybeslightlyworsethanwhatyoumeasuredusingcross-validationifyoudidalotofhyperparametertuning(becauseyoursystemendsupfine-tunedtoperformwellonthevalidationdata,andwilllikelynotperformaswellonunknowndatasets).Itisnotthecaseinthisexample,butwhenthishappensyoumustresistthetemptationtotweakthehyperparameterstomakethenumberslookgoodonthetestset;theimprovementswouldbeunlikelytogeneralizetonewdata.
Nowcomestheprojectprelaunchphase:youneedtopresentyoursolution(highlightingwhatyouhavelearned,whatworkedandwhatdidnot,whatassumptionsweremade,andwhatyoursystem’slimitationsare),documenteverything,andcreatenicepresentationswithclearvisualizationsandeasy-to-rememberstatements(e.g.,“themedianincomeisthenumberonepredictorofhousingprices”).
Launch,Monitor,andMaintainYourSystemPerfect,yougotapprovaltolaunch!Youneedtogetyoursolutionreadyforproduction,inparticularbypluggingtheproductioninputdatasourcesintoyoursystemandwritingtests.
Youalsoneedtowritemonitoringcodetocheckyoursystem’sliveperformanceatregularintervalsandtriggeralertswhenitdrops.Thisisimportanttocatchnotonlysuddenbreakage,butalsoperformancedegradation.Thisisquitecommonbecausemodelstendto“rot”asdataevolvesovertime,unlessthemodelsareregularlytrainedonfreshdata.
Evaluatingyoursystem’sperformancewillrequiresamplingthesystem’spredictionsandevaluatingthem.Thiswillgenerallyrequireahumananalysis.Theseanalystsmaybefieldexperts,orworkersonacrowdsourcingplatform(suchasAmazonMechanicalTurkorCrowdFlower).Eitherway,youneedtoplugthehumanevaluationpipelineintoyoursystem.
Youshouldalsomakesureyouevaluatethesystem’sinputdataquality.Sometimesperformancewilldegradeslightlybecauseofapoorqualitysignal(e.g.,amalfunctioningsensorsendingrandomvalues,oranotherteam’soutputbecomingstale),butitmaytakeawhilebeforeyoursystem’sperformancedegradesenoughtotriggeranalert.Ifyoumonitoryoursystem’sinputs,youmaycatchthisearlier.Monitoringtheinputsisparticularlyimportantforonlinelearningsystems.
Finally,youwillgenerallywanttotrainyourmodelsonaregularbasisusingfreshdata.Youshouldautomatethisprocessasmuchaspossible.Ifyoudon’t,youareverylikelytorefreshyourmodelonlyeverysixmonths(atbest),andyoursystem’sperformancemayfluctuateseverelyovertime.Ifyoursystemisanonlinelearningsystem,youshouldmakesureyousavesnapshotsofitsstateatregularintervalssoyoucaneasilyrollbacktoapreviouslyworkingstate.
TryItOut!HopefullythischaptergaveyouagoodideaofwhataMachineLearningprojectlookslike,andshowedyousomeofthetoolsyoucanusetotrainagreatsystem.Asyoucansee,muchoftheworkisinthedatapreparationstep,buildingmonitoringtools,settinguphumanevaluationpipelines,andautomatingregularmodeltraining.TheMachineLearningalgorithmsarealsoimportant,ofcourse,butitisprobablypreferabletobecomfortablewiththeoverallprocessandknowthreeorfouralgorithmswellratherthantospendallyourtimeexploringadvancedalgorithmsandnotenoughtimeontheoverallprocess.
So,ifyouhavenotalreadydoneso,nowisagoodtimetopickupalaptop,selectadatasetthatyouareinterestedin,andtrytogothroughthewholeprocessfromAtoZ.Agoodplacetostartisonacompetitionwebsitesuchashttp://kaggle.com/:youwillhaveadatasettoplaywith,acleargoal,andpeopletosharetheexperiencewith.
ExercisesUsingthischapter’shousingdataset:1. TryaSupportVectorMachineregressor(sklearn.svm.SVR),withvarioushyperparameterssuchas
kernel="linear"(withvariousvaluesfortheChyperparameter)orkernel="rbf"(withvariousvaluesfortheCandgammahyperparameters).Don’tworryaboutwhatthesehyperparametersmeanfornow.HowdoesthebestSVRpredictorperform?
2. TryreplacingGridSearchCVwithRandomizedSearchCV.
3. Tryaddingatransformerinthepreparationpipelinetoselectonlythemostimportantattributes.
4. Trycreatingasinglepipelinethatdoesthefulldatapreparationplusthefinalprediction.
5. AutomaticallyexploresomepreparationoptionsusingGridSearchCV.
SolutionstotheseexercisesareavailableintheonlineJupyternotebooksathttps://github.com/ageron/handson-ml.
Theexampleprojectiscompletelyfictitious;thegoalisjusttoillustratethemainstepsofaMachineLearningproject,nottolearnanythingabouttherealestatebusiness.
TheoriginaldatasetappearedinR.KelleyPaceandRonaldBarry,“SparseSpatialAutoregressions,”Statistics&ProbabilityLetters33,no.3(1997):291–297.
ApieceofinformationfedtoaMachineLearningsystemisoftencalledasignalinreferencetoShannon’sinformationtheory:youwantahighsignal/noiseratio.
Recallthatthetransposeoperatorflipsacolumnvectorintoarowvector(andviceversa).
ThelatestversionofPython3isrecommended.Python2.7+shouldworkfinetoo,butitisdeprecated.IfyouusePython2,youmustaddfrom__future__importdivision,print_function,unicode_literalsatthebeginningofyourcode.
WewillshowtheinstallationstepsusingpipinabashshellonaLinuxormacOSsystem.Youmayneedtoadaptthesecommandstoyourownsystem.OnWindows,werecommendinstallingAnacondainstead.
Youmayneedtohaveadministratorrightstorunthiscommand;ifso,tryprefixingitwithsudo.
NotethatJupytercanhandlemultipleversionsofPython,andevenmanyotherlanguagessuchasRorOctave.
Youmightalsoneedtochecklegalconstraints,suchasprivatefieldsthatshouldneverbecopiedtounsafedatastores.
InarealprojectyouwouldsavethiscodeinaPythonfile,butfornowyoucanjustwriteitinyourJupyternotebook.
Thestandarddeviationisgenerallydenotedσ(theGreeklettersigma),anditisthesquarerootofthevariance,whichistheaverageofthesquareddeviationfromthemean.Whenafeaturehasabell-shapednormaldistribution(alsocalledaGaussiandistribution),whichisverycommon,the“68-95-99.7”ruleapplies:about68%ofthevaluesfallwithin1σofthemean,95%within2σ,and99.7%within3σ.
Youwilloftenseepeoplesettherandomseedto42.Thisnumberhasnospecialproperty,otherthantobeTheAnswertotheUltimateQuestionofLife,theUniverse,andEverything.
Thelocationinformationisactuallyquitecoarse,andasaresultmanydistrictswillhavetheexactsameID,sotheywillendupinthesameset(testortrain).Thisintroducessomeunfortunatesamplingbias.
Ifyouarereadingthisingrayscale,grabaredpenandscribbleovermostofthecoastlinefromtheBayAreadowntoSanDiego(asyoumightexpect).YoucanaddapatchofyellowaroundSacramentoaswell.
Formoredetailsonthedesignprinciples,see“APIdesignformachinelearningsoftware:experiencesfromthescikit-learnproject,”L.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Buitinck,G.Louppe,M.Blondel,F.Pedregosa,A.Müller,etal.(2013).
Somepredictorsalsoprovidemethodstomeasuretheconfidenceoftheirpredictions.
NumPy’sreshape()functionallowsonedimensiontobe–1,whichmeans“unspecified”:thevalueisinferredfromthelengthofthearrayandtheremainingdimensions.
SeeSciPy’sdocumentationformoredetails.
ButcheckoutPullRequest#3886,whichmayintroduceaColumnTransformerclassmakingattribute-specifictransformationseasy.Youcouldalsorunpip3installsklearn-pandastogetaDataFrameMapperclasswithasimilarobjective.
16
17
18
19
Chapter3.Classification
InChapter1wementionedthatthemostcommonsupervisedlearningtasksareregression(predictingvalues)andclassification(predictingclasses).InChapter2weexploredaregressiontask,predictinghousingvalues,usingvariousalgorithmssuchasLinearRegression,DecisionTrees,andRandomForests(whichwillbeexplainedinfurtherdetailinlaterchapters).Nowwewillturnourattentiontoclassificationsystems.
MNISTInthischapter,wewillbeusingtheMNISTdataset,whichisasetof70,000smallimagesofdigitshandwrittenbyhighschoolstudentsandemployeesoftheUSCensusBureau.Eachimageislabeledwiththedigititrepresents.Thissethasbeenstudiedsomuchthatitisoftencalledthe“HelloWorld”ofMachineLearning:wheneverpeoplecomeupwithanewclassificationalgorithm,theyarecurioustoseehowitwillperformonMNIST.WheneversomeonelearnsMachineLearning,soonerorlatertheytackleMNIST.
Scikit-Learnprovidesmanyhelperfunctionstodownloadpopulardatasets.MNISTisoneofthem.ThefollowingcodefetchestheMNISTdataset:1
>>>fromsklearn.datasetsimportfetch_mldata
>>>mnist=fetch_mldata('MNISToriginal')
>>>mnist
{'COL_NAMES':['label','data'],
'DESCR':'mldata.orgdataset:mnist-original',
'data':array([[0,0,0,...,0,0,0],
[0,0,0,...,0,0,0],
[0,0,0,...,0,0,0],
...,
[0,0,0,...,0,0,0],
[0,0,0,...,0,0,0],
[0,0,0,...,0,0,0]],dtype=uint8),
'target':array([0.,0.,0.,...,9.,9.,9.])}
DatasetsloadedbyScikit-Learngenerallyhaveasimilardictionarystructureincluding:ADESCRkeydescribingthedataset
Adatakeycontaininganarraywithonerowperinstanceandonecolumnperfeature
Atargetkeycontaininganarraywiththelabels
Let’slookatthesearrays:
>>>X,y=mnist["data"],mnist["target"]
>>>X.shape
(70000,784)
>>>y.shape
(70000,)
Thereare70,000images,andeachimagehas784features.Thisisbecauseeachimageis28×28pixels,andeachfeaturesimplyrepresentsonepixel’sintensity,from0(white)to255(black).Let’stakeapeekatonedigitfromthedataset.Allyouneedtodoisgrabaninstance’sfeaturevector,reshapeittoa28×28array,anddisplayitusingMatplotlib’simshow()function:
%matplotlibinline
importmatplotlib
importmatplotlib.pyplotasplt
some_digit=X[36000]
some_digit_image=some_digit.reshape(28,28)
plt.imshow(some_digit_image,cmap=matplotlib.cm.binary,
interpolation="nearest")
plt.axis("off")
plt.show()
Thislookslikea5,andindeedthat’swhatthelabeltellsus:
>>>y[36000]
5.0
Figure3-1showsafewmoreimagesfromtheMNISTdatasettogiveyouafeelforthecomplexityoftheclassificationtask.
Figure3-1.AfewdigitsfromtheMNISTdataset
Butwait!Youshouldalwayscreateatestsetandsetitasidebeforeinspectingthedataclosely.TheMNISTdatasetisactuallyalreadysplitintoatrainingset(thefirst60,000images)andatestset(thelast10,000images):
X_train,X_test,y_train,y_test=X[:60000],X[60000:],y[:60000],y[60000:]
Let’salsoshufflethetrainingset;thiswillguaranteethatallcross-validationfoldswillbesimilar(youdon’twantonefoldtobemissingsomedigits).Moreover,somelearningalgorithmsaresensitivetotheorderofthetraininginstances,andtheyperformpoorlyiftheygetmanysimilarinstancesinarow.Shufflingthedatasetensuresthatthiswon’thappen:2
importnumpyasnp
shuffle_index=np.random.permutation(60000)
X_train,y_train=X_train[shuffle_index],y_train[shuffle_index]
TrainingaBinaryClassifierLet’ssimplifytheproblemfornowandonlytrytoidentifyonedigit—forexample,thenumber5.This“5-detector”willbeanexampleofabinaryclassifier,capableofdistinguishingbetweenjusttwoclasses,5andnot-5.Let’screatethetargetvectorsforthisclassificationtask:
y_train_5=(y_train==5)#Trueforall5s,Falseforallotherdigits.
y_test_5=(y_test==5)
Okay,nowlet’spickaclassifierandtrainit.AgoodplacetostartiswithaStochasticGradientDescent(SGD)classifier,usingScikit-Learn’sSGDClassifierclass.Thisclassifierhastheadvantageofbeingcapableofhandlingverylargedatasetsefficiently.ThisisinpartbecauseSGDdealswithtraininginstancesindependently,oneatatime(whichalsomakesSGDwellsuitedforonlinelearning),aswewillseelater.Let’screateanSGDClassifierandtrainitonthewholetrainingset:
fromsklearn.linear_modelimportSGDClassifier
sgd_clf=SGDClassifier(random_state=42)
sgd_clf.fit(X_train,y_train_5)
TIPTheSGDClassifierreliesonrandomnessduringtraining(hencethename“stochastic”).Ifyouwantreproducibleresults,youshouldsettherandom_stateparameter.
Nowyoucanuseittodetectimagesofthenumber5:
>>>sgd_clf.predict([some_digit])
array([True],dtype=bool)
Theclassifierguessesthatthisimagerepresentsa5(True).Lookslikeitguessedrightinthisparticularcase!Now,let’sevaluatethismodel’sperformance.
PerformanceMeasuresEvaluatingaclassifierisoftensignificantlytrickierthanevaluatingaregressor,sowewillspendalargepartofthischapteronthistopic.Therearemanyperformancemeasuresavailable,sograbanothercoffeeandgetreadytolearnmanynewconceptsandacronyms!
MeasuringAccuracyUsingCross-ValidationAgoodwaytoevaluateamodelistousecross-validation,justasyoudidinChapter2.
IMPLEMENTINGCROSS-VALIDATION
Occasionallyyouwillneedmorecontroloverthecross-validationprocessthanwhatScikit-Learnprovidesoff-the-shelf.Inthesecases,youcanimplementcross-validationyourself;itisactuallyfairlystraightforward.ThefollowingcodedoesroughlythesamethingasScikit-Learn’scross_val_score()function,andprintsthesameresult:
fromsklearn.model_selectionimportStratifiedKFold
fromsklearn.baseimportclone
skfolds=StratifiedKFold(n_splits=3,random_state=42)
fortrain_index,test_indexinskfolds.split(X_train,y_train_5):
clone_clf=clone(sgd_clf)
X_train_folds=X_train[train_index]
y_train_folds=(y_train_5[train_index])
X_test_fold=X_train[test_index]
y_test_fold=(y_train_5[test_index])
clone_clf.fit(X_train_folds,y_train_folds)
y_pred=clone_clf.predict(X_test_fold)
n_correct=sum(y_pred==y_test_fold)
print(n_correct/len(y_pred))#prints0.9502,0.96565and0.96495
TheStratifiedKFoldclassperformsstratifiedsampling(asexplainedinChapter2)toproducefoldsthatcontainarepresentativeratioofeachclass.Ateachiterationthecodecreatesacloneoftheclassifier,trainsthatcloneonthetrainingfolds,andmakespredictionsonthetestfold.Thenitcountsthenumberofcorrectpredictionsandoutputstheratioofcorrectpredictions.
Let’susethecross_val_score()functiontoevaluateyourSGDClassifiermodelusingK-foldcross-validation,withthreefolds.RememberthatK-foldcross-validationmeanssplittingthetrainingsetintoK-folds(inthiscase,three),thenmakingpredictionsandevaluatingthemoneachfoldusingamodeltrainedontheremainingfolds(seeChapter2):
>>>fromsklearn.model_selectionimportcross_val_score
>>>cross_val_score(sgd_clf,X_train,y_train_5,cv=3,scoring="accuracy")
array([0.9502,0.96565,0.96495])
Wow!Above95%accuracy(ratioofcorrectpredictions)onallcross-validationfolds?Thislooksamazing,doesn’tit?Well,beforeyougettooexcited,let’slookataverydumbclassifierthatjustclassifieseverysingleimageinthe“not-5”class:
fromsklearn.baseimportBaseEstimator
classNever5Classifier(BaseEstimator):
deffit(self,X,y=None):
pass
defpredict(self,X):
returnnp.zeros((len(X),1),dtype=bool)
Canyouguessthismodel’saccuracy?Let’sfindout:
>>>never_5_clf=Never5Classifier()
>>>cross_val_score(never_5_clf,X_train,y_train_5,cv=3,scoring="accuracy")
array([0.909,0.90715,0.9128])
That’sright,ithasover90%accuracy!Thisissimplybecauseonlyabout10%oftheimagesare5s,soifyoualwaysguessthatanimageisnota5,youwillberightabout90%ofthetime.BeatsNostradamus.
Thisdemonstrateswhyaccuracyisgenerallynotthepreferredperformancemeasureforclassifiers,especiallywhenyouaredealingwithskeweddatasets(i.e.,whensomeclassesaremuchmorefrequentthanothers).
ConfusionMatrixAmuchbetterwaytoevaluatetheperformanceofaclassifieristolookattheconfusionmatrix.ThegeneralideaistocountthenumberoftimesinstancesofclassAareclassifiedasclassB.Forexample,toknowthenumberoftimestheclassifierconfusedimagesof5swith3s,youwouldlookinthe5throwand3rdcolumnoftheconfusionmatrix.
Tocomputetheconfusionmatrix,youfirstneedtohaveasetofpredictions,sotheycanbecomparedtotheactualtargets.Youcouldmakepredictionsonthetestset,butlet’skeepituntouchedfornow(rememberthatyouwanttousethetestsetonlyattheveryendofyourproject,onceyouhaveaclassifierthatyouarereadytolaunch).Instead,youcanusethecross_val_predict()function:
fromsklearn.model_selectionimportcross_val_predict
y_train_pred=cross_val_predict(sgd_clf,X_train,y_train_5,cv=3)
Justlikethecross_val_score()function,cross_val_predict()performsK-foldcross-validation,butinsteadofreturningtheevaluationscores,itreturnsthepredictionsmadeoneachtestfold.Thismeansthatyougetacleanpredictionforeachinstanceinthetrainingset(“clean”meaningthatthepredictionismadebyamodelthatneversawthedataduringtraining).
Nowyouarereadytogettheconfusionmatrixusingtheconfusion_matrix()function.Justpassitthetargetclasses(y_train_5)andthepredictedclasses(y_train_pred):
>>>fromsklearn.metricsimportconfusion_matrix
>>>confusion_matrix(y_train_5,y_train_pred)
array([[53272,1307],
[1077,4344]])
Eachrowinaconfusionmatrixrepresentsanactualclass,whileeachcolumnrepresentsapredictedclass.Thefirstrowofthismatrixconsidersnon-5images(thenegativeclass):53,272ofthemwerecorrectlyclassifiedasnon-5s(theyarecalledtruenegatives),whiletheremaining1,307werewronglyclassifiedas5s(falsepositives).Thesecondrowconsiderstheimagesof5s(thepositiveclass):1,077werewronglyclassifiedasnon-5s(falsenegatives),whiletheremaining4,344werecorrectlyclassifiedas5s(truepositives).Aperfectclassifierwouldhaveonlytruepositivesandtruenegatives,soitsconfusionmatrixwouldhavenonzerovaluesonlyonitsmaindiagonal(toplefttobottomright):
>>>confusion_matrix(y_train_5,y_train_perfect_predictions)
array([[54579,0],
[0,5421]])
Theconfusionmatrixgivesyoualotofinformation,butsometimesyoumaypreferamoreconcisemetric.Aninterestingonetolookatistheaccuracyofthepositivepredictions;thisiscalledtheprecisionoftheclassifier(Equation3-1).
Equation3-1.Precision
TPisthenumberoftruepositives,andFPisthenumberoffalsepositives.
Atrivialwaytohaveperfectprecisionistomakeonesinglepositivepredictionandensureitiscorrect(precision=1/1=100%).Thiswouldnotbeveryusefulsincetheclassifierwouldignoreallbutonepositiveinstance.Soprecisionistypicallyusedalongwithanothermetricnamedrecall,alsocalledsensitivityortruepositiverate(TPR):thisistheratioofpositiveinstancesthatarecorrectlydetectedbytheclassifier(Equation3-2).
Equation3-2.Recall
FNisofcoursethenumberoffalsenegatives.
Ifyouareconfusedabouttheconfusionmatrix,Figure3-2mayhelp.
Figure3-2.Anillustratedconfusionmatrix
PrecisionandRecallScikit-Learnprovidesseveralfunctionstocomputeclassifiermetrics,includingprecisionandrecall:
>>>fromsklearn.metricsimportprecision_score,recall_score
>>>precision_score(y_train_5,y_train_pred)#==4344/(4344+1307)
0.76871350203503808
>>>recall_score(y_train_5,y_train_pred)#==4344/(4344+1077)
0.80132816823464303
Nowyour5-detectordoesnotlookasshinyasitdidwhenyoulookedatitsaccuracy.Whenitclaimsanimagerepresentsa5,itiscorrectonly77%ofthetime.Moreover,itonlydetects80%ofthe5s.
ItisoftenconvenienttocombineprecisionandrecallintoasinglemetriccalledtheF1score,inparticularifyouneedasimplewaytocomparetwoclassifiers.TheF1scoreistheharmonicmeanofprecisionandrecall(Equation3-3).Whereastheregularmeantreatsallvaluesequally,theharmonicmeangivesmuchmoreweighttolowvalues.Asaresult,theclassifierwillonlygetahighF1scoreifbothrecallandprecisionarehigh.
Equation3-3.F1score
TocomputetheF1score,simplycallthef1_score()function:
>>>fromsklearn.metricsimportf1_score
>>>f1_score(y_train_5,y_train_pred)
0.78468208092485547
TheF1scorefavorsclassifiersthathavesimilarprecisionandrecall.Thisisnotalwayswhatyouwant:insomecontextsyoumostlycareaboutprecision,andinothercontextsyoureallycareaboutrecall.Forexample,ifyoutrainedaclassifiertodetectvideosthataresafeforkids,youwouldprobablypreferaclassifierthatrejectsmanygoodvideos(lowrecall)butkeepsonlysafeones(highprecision),ratherthanaclassifierthathasamuchhigherrecallbutletsafewreallybadvideosshowupinyourproduct(insuchcases,youmayevenwanttoaddahumanpipelinetochecktheclassifier’svideoselection).Ontheotherhand,supposeyoutrainaclassifiertodetectshopliftersonsurveillanceimages:itisprobablyfineifyourclassifierhasonly30%precisionaslongasithas99%recall(sure,thesecurityguardswillgetafewfalsealerts,butalmostallshoplifterswillgetcaught).
Unfortunately,youcan’thaveitbothways:increasingprecisionreducesrecall,andviceversa.Thisiscalledtheprecision/recalltradeoff.
Precision/RecallTradeoffTounderstandthistradeoff,let’slookathowtheSGDClassifiermakesitsclassificationdecisions.Foreachinstance,itcomputesascorebasedonadecisionfunction,andifthatscoreisgreaterthanathreshold,itassignstheinstancetothepositiveclass,orelseitassignsittothenegativeclass.Figure3-3showsafewdigitspositionedfromthelowestscoreonthelefttothehighestscoreontheright.Supposethedecisionthresholdispositionedatthecentralarrow(betweenthetwo5s):youwillfind4truepositives(actual5s)ontherightofthatthreshold,andonefalsepositive(actuallya6).Therefore,withthatthreshold,theprecisionis80%(4outof5).Butoutof6actual5s,theclassifieronlydetects4,sotherecallis67%(4outof6).Nowifyouraisethethreshold(moveittothearrowontheright),thefalsepositive(the6)becomesatruenegative,therebyincreasingprecision(upto100%inthiscase),butonetruepositivebecomesafalsenegative,decreasingrecalldownto50%.Conversely,loweringthethresholdincreasesrecallandreducesprecision.
Figure3-3.Decisionthresholdandprecision/recalltradeoff
Scikit-Learndoesnotletyousetthethresholddirectly,butitdoesgiveyouaccesstothedecisionscoresthatitusestomakepredictions.Insteadofcallingtheclassifier’spredict()method,youcancallitsdecision_function()method,whichreturnsascoreforeachinstance,andthenmakepredictionsbasedonthosescoresusinganythresholdyouwant:
>>>y_scores=sgd_clf.decision_function([some_digit])
>>>y_scores
array([161855.74572176])
>>>threshold=0
>>>y_some_digit_pred=(y_scores>threshold)
array([True],dtype=bool)
TheSGDClassifierusesathresholdequalto0,sothepreviouscodereturnsthesameresultasthepredict()method(i.e.,True).Let’sraisethethreshold:
>>>threshold=200000
>>>y_some_digit_pred=(y_scores>threshold)
>>>y_some_digit_pred
array([False],dtype=bool)
Thisconfirmsthatraisingthethresholddecreasesrecall.Theimageactuallyrepresentsa5,andtheclassifierdetectsitwhenthethresholdis0,butitmissesitwhenthethresholdisincreasedto200,000.
Sohowcanyoudecidewhichthresholdtouse?Forthisyouwillfirstneedtogetthescoresofallinstancesinthetrainingsetusingthecross_val_predict()functionagain,butthistimespecifyingthatyouwantittoreturndecisionscoresinsteadofpredictions:
y_scores=cross_val_predict(sgd_clf,X_train,y_train_5,cv=3,
method="decision_function")
Nowwiththesescoresyoucancomputeprecisionandrecallforallpossiblethresholdsusingtheprecision_recall_curve()function:
fromsklearn.metricsimportprecision_recall_curve
precisions,recalls,thresholds=precision_recall_curve(y_train_5,y_scores)
Finally,youcanplotprecisionandrecallasfunctionsofthethresholdvalueusingMatplotlib(Figure3-4):
defplot_precision_recall_vs_threshold(precisions,recalls,thresholds):
plt.plot(thresholds,precisions[:-1],"b--",label="Precision")
plt.plot(thresholds,recalls[:-1],"g-",label="Recall")
plt.xlabel("Threshold")
plt.legend(loc="upperleft")
plt.ylim([0,1])
plot_precision_recall_vs_threshold(precisions,recalls,thresholds)
plt.show()
Figure3-4.Precisionandrecallversusthedecisionthreshold
NOTEYoumaywonderwhytheprecisioncurveisbumpierthantherecallcurveinFigure3-4.Thereasonisthatprecisionmaysometimesgodownwhenyouraisethethreshold(althoughingeneralitwillgoup).Tounderstandwhy,lookbackatFigure3-3andnoticewhathappenswhenyoustartfromthecentralthresholdandmoveitjustonedigittotheright:precisiongoesfrom4/5(80%)downto3/4(75%).Ontheotherhand,recallcanonlygodownwhenthethresholdisincreased,whichexplainswhyitscurvelookssmooth.
Nowyoucansimplyselectthethresholdvaluethatgivesyouthebestprecision/recalltradeoffforyourtask.Anotherwaytoselectagoodprecision/recalltradeoffistoplotprecisiondirectlyagainstrecall,asshowninFigure3-5.
Figure3-5.Precisionversusrecall
Youcanseethatprecisionreallystartstofallsharplyaround80%recall.Youwillprobablywanttoselectaprecision/recalltradeoffjustbeforethatdrop—forexample,ataround60%recall.Butofcoursethechoicedependsonyourproject.
Solet’ssupposeyoudecidetoaimfor90%precision.Youlookupthefirstplot(zoominginabit)andfindthatyouneedtouseathresholdofabout70,000.Tomakepredictions(onthetrainingsetfornow),insteadofcallingtheclassifier’spredict()method,youcanjustrunthiscode:
y_train_pred_90=(y_scores>70000)
Let’scheckthesepredictions’precisionandrecall:
>>>precision_score(y_train_5,y_train_pred_90)
0.86592051164915484
>>>recall_score(y_train_5,y_train_pred_90)
0.69931746910164172
Great,youhavea90%precisionclassifier(orcloseenough)!Asyoucansee,itisfairlyeasytocreateaclassifierwithvirtuallyanyprecisionyouwant:justsetahighenoughthreshold,andyou’redone.Hmm,notsofast.Ahigh-precisionclassifierisnotveryusefulifitsrecallistoolow!
TIPIfsomeonesays“let’sreach99%precision,”youshouldask,“atwhatrecall?”
TheROCCurveThereceiveroperatingcharacteristic(ROC)curveisanothercommontoolusedwithbinaryclassifiers.Itisverysimilartotheprecision/recallcurve,butinsteadofplottingprecisionversusrecall,theROCcurveplotsthetruepositiverate(anothernameforrecall)againstthefalsepositiverate.TheFPRistheratioofnegativeinstancesthatareincorrectlyclassifiedaspositive.Itisequaltooneminusthetruenegativerate,whichistheratioofnegativeinstancesthatarecorrectlyclassifiedasnegative.TheTNRisalsocalledspecificity.HencetheROCcurveplotssensitivity(recall)versus1–specificity.
ToplottheROCcurve,youfirstneedtocomputetheTPRandFPRforvariousthresholdvalues,usingtheroc_curve()function:
fromsklearn.metricsimportroc_curve
fpr,tpr,thresholds=roc_curve(y_train_5,y_scores)
ThenyoucanplottheFPRagainsttheTPRusingMatplotlib.ThiscodeproducestheplotinFigure3-6:
defplot_roc_curve(fpr,tpr,label=None):
plt.plot(fpr,tpr,linewidth=2,label=label)
plt.plot([0,1],[0,1],'k--')
plt.axis([0,1,0,1])
plt.xlabel('FalsePositiveRate')
plt.ylabel('TruePositiveRate')
plot_roc_curve(fpr,tpr)
plt.show()
Figure3-6.ROCcurve
Onceagainthereisatradeoff:thehighertherecall(TPR),themorefalsepositives(FPR)theclassifier
produces.ThedottedlinerepresentstheROCcurveofapurelyrandomclassifier;agoodclassifierstaysasfarawayfromthatlineaspossible(towardthetop-leftcorner).
Onewaytocompareclassifiersistomeasuretheareaunderthecurve(AUC).AperfectclassifierwillhaveaROCAUCequalto1,whereasapurelyrandomclassifierwillhaveaROCAUCequalto0.5.Scikit-LearnprovidesafunctiontocomputetheROCAUC:
>>>fromsklearn.metricsimportroc_auc_score
>>>roc_auc_score(y_train_5,y_scores)
0.96244965559671547
TIPSincetheROCcurveissosimilartotheprecision/recall(orPR)curve,youmaywonderhowtodecidewhichonetouse.Asaruleofthumb,youshouldpreferthePRcurvewheneverthepositiveclassisrareorwhenyoucaremoreaboutthefalsepositivesthanthefalsenegatives,andtheROCcurveotherwise.Forexample,lookingatthepreviousROCcurve(andtheROCAUCscore),youmaythinkthattheclassifierisreallygood.Butthisismostlybecausetherearefewpositives(5s)comparedtothenegatives(non-5s).Incontrast,thePRcurvemakesitclearthattheclassifierhasroomforimprovement(thecurvecouldbeclosertothetop-rightcorner).
Let’strainaRandomForestClassifierandcompareitsROCcurveandROCAUCscoretotheSGDClassifier.First,youneedtogetscoresforeachinstanceinthetrainingset.Butduetothewayitworks(seeChapter7),theRandomForestClassifierclassdoesnothaveadecision_function()method.Insteadithasapredict_proba()method.Scikit-Learnclassifiersgenerallyhaveoneortheother.Thepredict_proba()methodreturnsanarraycontainingarowperinstanceandacolumnperclass,eachcontainingtheprobabilitythatthegiveninstancebelongstothegivenclass(e.g.,70%chancethattheimagerepresentsa5):
fromsklearn.ensembleimportRandomForestClassifier
forest_clf=RandomForestClassifier(random_state=42)
y_probas_forest=cross_val_predict(forest_clf,X_train,y_train_5,cv=3,
method="predict_proba")
ButtoplotaROCcurve,youneedscores,notprobabilities.Asimplesolutionistousethepositiveclass’sprobabilityasthescore:
y_scores_forest=y_probas_forest[:,1]#score=probaofpositiveclass
fpr_forest,tpr_forest,thresholds_forest=roc_curve(y_train_5,y_scores_forest)
NowyouarereadytoplottheROCcurve.ItisusefultoplotthefirstROCcurveaswelltoseehowtheycompare(Figure3-7):
plt.plot(fpr,tpr,"b:",label="SGD")
plot_roc_curve(fpr_forest,tpr_forest,"RandomForest")
plt.legend(loc="lowerright")
plt.show()
Figure3-7.ComparingROCcurves
AsyoucanseeinFigure3-7,theRandomForestClassifier’sROCcurvelooksmuchbetterthantheSGDClassifier’s:itcomesmuchclosertothetop-leftcorner.Asaresult,itsROCAUCscoreisalsosignificantlybetter:
>>>roc_auc_score(y_train_5,y_scores_forest)
0.99312433660038291
Trymeasuringtheprecisionandrecallscores:youshouldfind98.5%precisionand82.8%recall.Nottoobad!
Hopefullyyounowknowhowtotrainbinaryclassifiers,choosetheappropriatemetricforyourtask,evaluateyourclassifiersusingcross-validation,selecttheprecision/recalltradeoffthatfitsyourneeds,andcomparevariousmodelsusingROCcurvesandROCAUCscores.Nowlet’strytodetectmorethanjustthe5s.
MulticlassClassificationWhereasbinaryclassifiersdistinguishbetweentwoclasses,multiclassclassifiers(alsocalledmultinomialclassifiers)candistinguishbetweenmorethantwoclasses.
Somealgorithms(suchasRandomForestclassifiersornaiveBayesclassifiers)arecapableofhandlingmultipleclassesdirectly.Others(suchasSupportVectorMachineclassifiersorLinearclassifiers)arestrictlybinaryclassifiers.However,therearevariousstrategiesthatyoucanusetoperformmulticlassclassificationusingmultiplebinaryclassifiers.
Forexample,onewaytocreateasystemthatcanclassifythedigitimagesinto10classes(from0to9)istotrain10binaryclassifiers,oneforeachdigit(a0-detector,a1-detector,a2-detector,andsoon).Thenwhenyouwanttoclassifyanimage,yougetthedecisionscorefromeachclassifierforthatimageandyouselecttheclasswhoseclassifieroutputsthehighestscore.Thisiscalledtheone-versus-all(OvA)strategy(alsocalledone-versus-the-rest).
Anotherstrategyistotrainabinaryclassifierforeverypairofdigits:onetodistinguish0sand1s,anothertodistinguish0sand2s,anotherfor1sand2s,andsoon.Thisiscalledtheone-versus-one(OvO)strategy.IfthereareNclasses,youneedtotrainN×(N–1)/2classifiers.FortheMNISTproblem,thismeanstraining45binaryclassifiers!Whenyouwanttoclassifyanimage,youhavetoruntheimagethroughall45classifiersandseewhichclasswinsthemostduels.ThemainadvantageofOvOisthateachclassifieronlyneedstobetrainedonthepartofthetrainingsetforthetwoclassesthatitmustdistinguish.
Somealgorithms(suchasSupportVectorMachineclassifiers)scalepoorlywiththesizeofthetrainingset,soforthesealgorithmsOvOispreferredsinceitisfastertotrainmanyclassifiersonsmalltrainingsetsthantrainingfewclassifiersonlargetrainingsets.Formostbinaryclassificationalgorithms,however,OvAispreferred.
Scikit-Learndetectswhenyoutrytouseabinaryclassificationalgorithmforamulticlassclassificationtask,anditautomaticallyrunsOvA(exceptforSVMclassifiersforwhichitusesOvO).Let’strythiswiththeSGDClassifier:
>>>sgd_clf.fit(X_train,y_train)#y_train,noty_train_5
>>>sgd_clf.predict([some_digit])
array([5.])
Thatwaseasy!ThiscodetrainstheSGDClassifieronthetrainingsetusingtheoriginaltargetclassesfrom0to9(y_train),insteadofthe5-versus-alltargetclasses(y_train_5).Thenitmakesaprediction(acorrectoneinthiscase).Underthehood,Scikit-Learnactuallytrained10binaryclassifiers,gottheirdecisionscoresfortheimage,andselectedtheclasswiththehighestscore.
Toseethatthisisindeedthecase,youcancallthedecision_function()method.Insteadofreturningjustonescoreperinstance,itnowreturns10scores,oneperclass:
>>>some_digit_scores=sgd_clf.decision_function([some_digit])
>>>some_digit_scores
array([[-311402.62954431,-363517.28355739,-446449.5306454,
-183226.61023518,-414337.15339485,161855.74572176,
-452576.39616343,-471957.14962573,-518542.33997148,
-536774.63961222]])
Thehighestscoreisindeedtheonecorrespondingtoclass5:
>>>np.argmax(some_digit_scores)
5
>>>sgd_clf.classes_
array([0.,1.,2.,3.,4.,5.,6.,7.,8.,9.])
>>>sgd_clf.classes_[5]
5.0
WARNINGWhenaclassifieristrained,itstoresthelistoftargetclassesinitsclasses_attribute,orderedbyvalue.Inthiscase,theindexofeachclassintheclasses_arrayconvenientlymatchestheclassitself(e.g.,theclassatindex5happenstobeclass5),butingeneralyouwon’tbesolucky.
IfyouwanttoforceScikitLearntouseone-versus-oneorone-versus-all,youcanusetheOneVsOneClassifierorOneVsRestClassifierclasses.Simplycreateaninstanceandpassabinaryclassifiertoitsconstructor.Forexample,thiscodecreatesamulticlassclassifierusingtheOvOstrategy,basedonaSGDClassifier:
>>>fromsklearn.multiclassimportOneVsOneClassifier
>>>ovo_clf=OneVsOneClassifier(SGDClassifier(random_state=42))
>>>ovo_clf.fit(X_train,y_train)
>>>ovo_clf.predict([some_digit])
array([5.])
>>>len(ovo_clf.estimators_)
45
TrainingaRandomForestClassifierisjustaseasy:
>>>forest_clf.fit(X_train,y_train)
>>>forest_clf.predict([some_digit])
array([5.])
ThistimeScikit-LearndidnothavetorunOvAorOvObecauseRandomForestclassifierscandirectlyclassifyinstancesintomultipleclasses.Youcancallpredict_proba()togetthelistofprobabilitiesthattheclassifierassignedtoeachinstanceforeachclass:
>>>forest_clf.predict_proba([some_digit])
array([[0.1,0.,0.,0.1,0.,0.8,0.,0.,0.,0.]])
Youcanseethattheclassifierisfairlyconfidentaboutitsprediction:the0.8atthe5thindexinthearraymeansthatthemodelestimatesan80%probabilitythattheimagerepresentsa5.Italsothinksthattheimagecouldinsteadbea0ora3(10%chanceeach).
Nowofcourseyouwanttoevaluatetheseclassifiers.Asusual,youwanttousecross-validation.Let’sevaluatetheSGDClassifier’saccuracyusingthecross_val_score()function:
>>>cross_val_score(sgd_clf,X_train,y_train,cv=3,scoring="accuracy")
array([0.84063187,0.84899245,0.86652998])
Itgetsover84%onalltestfolds.Ifyouusedarandomclassifier,youwouldget10%accuracy,sothisisnotsuchabadscore,butyoucanstilldomuchbetter.Forexample,simplyscalingtheinputs(asdiscussedinChapter2)increasesaccuracyabove90%:
>>>fromsklearn.preprocessingimportStandardScaler
>>>scaler=StandardScaler()
>>>X_train_scaled=scaler.fit_transform(X_train.astype(np.float64))
>>>cross_val_score(sgd_clf,X_train_scaled,y_train,cv=3,scoring="accuracy")
array([0.91011798,0.90874544,0.906636])
ErrorAnalysisOfcourse,ifthiswerearealproject,youwouldfollowthestepsinyourMachineLearningprojectchecklist(seeAppendixB):exploringdatapreparationoptions,tryingoutmultiplemodels,shortlistingthebestonesandfine-tuningtheirhyperparametersusingGridSearchCV,andautomatingasmuchaspossible,asyoudidinthepreviouschapter.Here,wewillassumethatyouhavefoundapromisingmodelandyouwanttofindwaystoimproveit.Onewaytodothisistoanalyzethetypesoferrorsitmakes.
First,youcanlookattheconfusionmatrix.Youneedtomakepredictionsusingthecross_val_predict()function,thencalltheconfusion_matrix()function,justlikeyoudidearlier:
>>>y_train_pred=cross_val_predict(sgd_clf,X_train_scaled,y_train,cv=3)
>>>conf_mx=confusion_matrix(y_train,y_train_pred)
>>>conf_mx
array([[5725,3,24,9,10,49,50,10,39,4],
[2,6493,43,25,7,40,5,10,109,8],
[51,41,5321,104,89,26,87,60,166,13],
[47,46,141,5342,1,231,40,50,141,92],
[19,29,41,10,5366,9,56,37,86,189],
[73,45,36,193,64,4582,111,30,193,94],
[29,34,44,2,42,85,5627,10,45,0],
[25,24,74,32,54,12,6,5787,15,236],
[52,161,73,156,10,163,61,25,5027,123],
[43,35,26,92,178,28,2,223,82,5240]])
That’salotofnumbers.It’softenmoreconvenienttolookatanimagerepresentationoftheconfusionmatrix,usingMatplotlib’smatshow()function:
plt.matshow(conf_mx,cmap=plt.cm.gray)
plt.show()
Thisconfusionmatrixlooksfairlygood,sincemostimagesareonthemaindiagonal,whichmeansthattheywereclassifiedcorrectly.The5slookslightlydarkerthantheotherdigits,whichcouldmeanthattherearefewerimagesof5sinthedatasetorthattheclassifierdoesnotperformaswellon5sasonotherdigits.Infact,youcanverifythatbotharethecase.
Let’sfocustheplotontheerrors.First,youneedtodivideeachvalueintheconfusionmatrixbythenumberofimagesinthecorrespondingclass,soyoucancompareerrorratesinsteadofabsolutenumberoferrors(whichwouldmakeabundantclasseslookunfairlybad):
row_sums=conf_mx.sum(axis=1,keepdims=True)
norm_conf_mx=conf_mx/row_sums
Nowlet’sfillthediagonalwithzerostokeeponlytheerrors,andlet’splottheresult:
np.fill_diagonal(norm_conf_mx,0)
plt.matshow(norm_conf_mx,cmap=plt.cm.gray)
plt.show()
Nowyoucanclearlyseethekindsoferrorstheclassifiermakes.Rememberthatrowsrepresentactualclasses,whilecolumnsrepresentpredictedclasses.Thecolumnsforclasses8and9arequitebright,whichtellsyouthatmanyimagesgetmisclassifiedas8sor9s.Similarly,therowsforclasses8and9arealsoquitebright,tellingyouthat8sand9sareoftenconfusedwithotherdigits.Conversely,somerowsareprettydark,suchasrow1:thismeansthatmost1sareclassifiedcorrectly(afewareconfusedwith8s,butthat’saboutit).Noticethattheerrorsarenotperfectlysymmetrical;forexample,therearemore5smisclassifiedas8sthanthereverse.
Analyzingtheconfusionmatrixcanoftengiveyouinsightsonwaystoimproveyourclassifier.Lookingatthisplot,itseemsthatyoureffortsshouldbespentonimprovingclassificationof8sand9s,aswellasfixingthespecific3/5confusion.Forexample,youcouldtrytogathermoretrainingdataforthesedigits.Oryoucouldengineernewfeaturesthatwouldhelptheclassifier—forexample,writinganalgorithmtocountthenumberofclosedloops(e.g.,8hastwo,6hasone,5hasnone).Oryoucouldpreprocesstheimages(e.g.,usingScikit-Image,Pillow,orOpenCV)tomakesomepatternsstandoutmore,suchasclosedloops.
Analyzingindividualerrorscanalsobeagoodwaytogaininsightsonwhatyourclassifierisdoingandwhyitisfailing,butitismoredifficultandtime-consuming.Forexample,let’splotexamplesof3sand5s(theplot_digits()functionjustusesMatplotlib’simshow()function;seethischapter’sJupyter
notebookfordetails):
cl_a,cl_b=3,5
X_aa=X_train[(y_train==cl_a)&(y_train_pred==cl_a)]
X_ab=X_train[(y_train==cl_a)&(y_train_pred==cl_b)]
X_ba=X_train[(y_train==cl_b)&(y_train_pred==cl_a)]
X_bb=X_train[(y_train==cl_b)&(y_train_pred==cl_b)]
plt.figure(figsize=(8,8))
plt.subplot(221);plot_digits(X_aa[:25],images_per_row=5)
plt.subplot(222);plot_digits(X_ab[:25],images_per_row=5)
plt.subplot(223);plot_digits(X_ba[:25],images_per_row=5)
plt.subplot(224);plot_digits(X_bb[:25],images_per_row=5)
plt.show()
Thetwo5×5blocksontheleftshowdigitsclassifiedas3s,andthetwo5×5blocksontherightshowimagesclassifiedas5s.Someofthedigitsthattheclassifiergetswrong(i.e.,inthebottom-leftandtop-rightblocks)aresobadlywrittenthatevenahumanwouldhavetroubleclassifyingthem(e.g.,the5onthe8throwand1stcolumntrulylookslikea3).However,mostmisclassifiedimagesseemlikeobviouserrorstous,andit’shardtounderstandwhytheclassifiermadethemistakesitdid.3ThereasonisthatweusedasimpleSGDClassifier,whichisalinearmodel.Allitdoesisassignaweightperclasstoeachpixel,andwhenitseesanewimageitjustsumsuptheweightedpixelintensitiestogetascoreforeachclass.Sosince3sand5sdifferonlybyafewpixels,thismodelwilleasilyconfusethem.
Themaindifferencebetween3sand5sisthepositionofthesmalllinethatjoinsthetoplinetothebottomarc.Ifyoudrawa3withthejunctionslightlyshiftedtotheleft,theclassifiermightclassifyitasa5,andviceversa.Inotherwords,thisclassifierisquitesensitivetoimageshiftingandrotation.Soonewaytoreducethe3/5confusionwouldbetopreprocesstheimagestoensurethattheyarewellcenteredandnottoorotated.Thiswillprobablyhelpreduceothererrorsaswell.
MultilabelClassificationUntilnoweachinstancehasalwaysbeenassignedtojustoneclass.Insomecasesyoumaywantyourclassifiertooutputmultipleclassesforeachinstance.Forexample,consideraface-recognitionclassifier:whatshoulditdoifitrecognizesseveralpeopleonthesamepicture?Ofcourseitshouldattachonelabelperpersonitrecognizes.Saytheclassifierhasbeentrainedtorecognizethreefaces,Alice,Bob,andCharlie;thenwhenitisshownapictureofAliceandCharlie,itshouldoutput[1,0,1](meaning“Aliceyes,Bobno,Charlieyes”).Suchaclassificationsystemthatoutputsmultiplebinarylabelsiscalledamultilabelclassificationsystem.
Wewon’tgointofacerecognitionjustyet,butlet’slookatasimplerexample,justforillustrationpurposes:
fromsklearn.neighborsimportKNeighborsClassifier
y_train_large=(y_train>=7)
y_train_odd=(y_train%2==1)
y_multilabel=np.c_[y_train_large,y_train_odd]
knn_clf=KNeighborsClassifier()
knn_clf.fit(X_train,y_multilabel)
Thiscodecreatesay_multilabelarraycontainingtwotargetlabelsforeachdigitimage:thefirstindicateswhetherornotthedigitislarge(7,8,or9)andthesecondindicateswhetherornotitisodd.ThenextlinescreateaKNeighborsClassifierinstance(whichsupportsmultilabelclassification,butnotallclassifiersdo)andwetrainitusingthemultipletargetsarray.Nowyoucanmakeaprediction,andnoticethatitoutputstwolabels:
>>>knn_clf.predict([some_digit])
array([[False,True]],dtype=bool)
Anditgetsitright!Thedigit5isindeednotlarge(False)andodd(True).
Therearemanywaystoevaluateamultilabelclassifier,andselectingtherightmetricreallydependsonyourproject.Forexample,oneapproachistomeasuretheF1scoreforeachindividuallabel(oranyotherbinaryclassifiermetricdiscussedearlier),thensimplycomputetheaveragescore.ThiscodecomputestheaverageF1scoreacrossalllabels:
>>>y_train_knn_pred=cross_val_predict(knn_clf,X_train,y_train,cv=3)
>>>f1_score(y_train,y_train_knn_pred,average="macro")
0.96845540180280221
Thisassumesthatalllabelsareequallyimportant,whichmaynotbethecase.Inparticular,ifyouhavemanymorepicturesofAlicethanofBoborCharlie,youmaywanttogivemoreweighttotheclassifier’sscoreonpicturesofAlice.Onesimpleoptionistogiveeachlabelaweightequaltoitssupport(i.e.,thenumberofinstanceswiththattargetlabel).Todothis,simplysetaverage="weighted"intheprecedingcode.4
MultioutputClassificationThelasttypeofclassificationtaskwearegoingtodiscusshereiscalledmultioutput-multiclassclassification(orsimplymultioutputclassification).Itissimplyageneralizationofmultilabelclassificationwhereeachlabelcanbemulticlass(i.e.,itcanhavemorethantwopossiblevalues).
Toillustratethis,let’sbuildasystemthatremovesnoisefromimages.Itwilltakeasinputanoisydigitimage,anditwill(hopefully)outputacleandigitimage,representedasanarrayofpixelintensities,justliketheMNISTimages.Noticethattheclassifier’soutputismultilabel(onelabelperpixel)andeachlabelcanhavemultiplevalues(pixelintensityrangesfrom0to255).Itisthusanexampleofamultioutputclassificationsystem.
NOTEThelinebetweenclassificationandregressionissometimesblurry,suchasinthisexample.Arguably,predictingpixelintensityismoreakintoregressionthantoclassification.Moreover,multioutputsystemsarenotlimitedtoclassificationtasks;youcouldevenhaveasystemthatoutputsmultiplelabelsperinstance,includingbothclasslabelsandvaluelabels.
Let’sstartbycreatingthetrainingandtestsetsbytakingtheMNISTimagesandaddingnoisetotheirpixelintensitiesusingNumPy’srandint()function.Thetargetimageswillbetheoriginalimages:
noise=np.random.randint(0,100,(len(X_train),784))
X_train_mod=X_train+noise
noise=np.random.randint(0,100,(len(X_test),784))
X_test_mod=X_test+noise
y_train_mod=X_train
y_test_mod=X_test
Let’stakeapeekatanimagefromthetestset(yes,we’resnoopingonthetestdata,soyoushouldbefrowningrightnow):
Ontheleftisthenoisyinputimage,andontherightisthecleantargetimage.Nowlet’straintheclassifier
andmakeitcleanthisimage:
knn_clf.fit(X_train_mod,y_train_mod)
clean_digit=knn_clf.predict([X_test_mod[some_index]])
plot_digit(clean_digit)
Lookscloseenoughtothetarget!Thisconcludesourtourofclassification.Hopefullyyoushouldnowknowhowtoselectgoodmetricsforclassificationtasks,picktheappropriateprecision/recalltradeoff,compareclassifiers,andmoregenerallybuildgoodclassificationsystemsforavarietyoftasks.
Exercises1. TrytobuildaclassifierfortheMNISTdatasetthatachievesover97%accuracyonthetestset.Hint:
theKNeighborsClassifierworksquitewellforthistask;youjustneedtofindgoodhyperparametervalues(tryagridsearchontheweightsandn_neighborshyperparameters).
2. WriteafunctionthatcanshiftanMNISTimageinanydirection(left,right,up,ordown)byonepixel.5Then,foreachimageinthetrainingset,createfourshiftedcopies(oneperdirection)andaddthemtothetrainingset.Finally,trainyourbestmodelonthisexpandedtrainingsetandmeasureitsaccuracyonthetestset.Youshouldobservethatyourmodelperformsevenbetternow!Thistechniqueofartificiallygrowingthetrainingsetiscalleddataaugmentationortrainingsetexpansion.
3. TackletheTitanicdataset.AgreatplacetostartisonKaggle.
4. Buildaspamclassifier(amorechallengingexercise):DownloadexamplesofspamandhamfromApacheSpamAssassin’spublicdatasets.
Unzipthedatasetsandfamiliarizeyourselfwiththedataformat.
Splitthedatasetsintoatrainingsetandatestset.
Writeadatapreparationpipelinetoconverteachemailintoafeaturevector.Yourpreparationpipelineshouldtransformanemailintoa(sparse)vectorindicatingthepresenceorabsenceofeachpossibleword.Forexample,ifallemailsonlyevercontainfourwords,“Hello,”“how,”“are,”“you,”thentheemail“HelloyouHelloHelloyou”wouldbeconvertedintoavector[1,0,0,1](meaning[“Hello”ispresent,“how”isabsent,“are”isabsent,“you”ispresent]),or[3,0,0,2]ifyouprefertocountthenumberofoccurrencesofeachword.
Youmaywanttoaddhyperparameterstoyourpreparationpipelinetocontrolwhetherornottostripoffemailheaders,converteachemailtolowercase,removepunctuation,replaceallURLswith“URL,”replaceallnumberswith“NUMBER,”orevenperformstemming(i.e.,trimoffwordendings;therearePythonlibrariesavailabletodothis).
Thentryoutseveralclassifiersandseeifyoucanbuildagreatspamclassifier,withbothhighrecallandhighprecision.
SolutionstotheseexercisesareavailableintheonlineJupyternotebooksathttps://github.com/ageron/handson-ml.
BydefaultScikit-Learncachesdownloadeddatasetsinadirectorycalled$HOME/scikit_learn_data.
Shufflingmaybeabadideainsomecontexts—forexample,ifyouareworkingontimeseriesdata(suchasstockmarketpricesorweatherconditions).Wewillexplorethisinthenextchapters.
Butrememberthatourbrainisafantasticpatternrecognitionsystem,andourvisualsystemdoesalotofcomplexpreprocessingbeforeanyinformationreachesourconsciousness,sothefactthatitfeelssimpledoesnotmeanthatitis.
1
2
3
4
Scikit-Learnoffersafewotheraveragingoptionsandmultilabelclassifiermetrics;seethedocumentationformoredetails.
Youcanusetheshift()functionfromthescipy.ndimage.interpolationmodule.Forexample,shift(image,[2,1],cval=0)shiftstheimage2pixelsdownand1pixeltotheright.
4
5
Chapter4.TrainingModels
SofarwehavetreatedMachineLearningmodelsandtheirtrainingalgorithmsmostlylikeblackboxes.Ifyouwentthroughsomeoftheexercisesinthepreviouschapters,youmayhavebeensurprisedbyhowmuchyoucangetdonewithoutknowinganythingaboutwhat’sunderthehood:youoptimizedaregressionsystem,youimprovedadigitimageclassifier,andyouevenbuiltaspamclassifierfromscratch—allthiswithoutknowinghowtheyactuallywork.Indeed,inmanysituationsyoudon’treallyneedtoknowtheimplementationdetails.
However,havingagoodunderstandingofhowthingsworkcanhelpyouquicklyhomeinontheappropriatemodel,therighttrainingalgorithmtouse,andagoodsetofhyperparametersforyourtask.Understandingwhat’sunderthehoodwillalsohelpyoudebugissuesandperformerroranalysismoreefficiently.Lastly,mostofthetopicsdiscussedinthischapterwillbeessentialinunderstanding,building,andtrainingneuralnetworks(discussedinPartIIofthisbook).
Inthischapter,wewillstartbylookingattheLinearRegressionmodel,oneofthesimplestmodelsthereis.Wewilldiscusstwoverydifferentwaystotrainit:
Usingadirect“closed-form”equationthatdirectlycomputesthemodelparametersthatbestfitthemodeltothetrainingset(i.e.,themodelparametersthatminimizethecostfunctionoverthetrainingset).
Usinganiterativeoptimizationapproach,calledGradientDescent(GD),thatgraduallytweaksthemodelparameterstominimizethecostfunctionoverthetrainingset,eventuallyconvergingtothesamesetofparametersasthefirstmethod.WewilllookatafewvariantsofGradientDescentthatwewilluseagainandagainwhenwestudyneuralnetworksinPartII:BatchGD,Mini-batchGD,andStochasticGD.
NextwewilllookatPolynomialRegression,amorecomplexmodelthatcanfitnonlineardatasets.SincethismodelhasmoreparametersthanLinearRegression,itismorepronetooverfittingthetrainingdata,sowewilllookathowtodetectwhetherornotthisisthecase,usinglearningcurves,andthenwewilllookatseveralregularizationtechniquesthatcanreducetheriskofoverfittingthetrainingset.
Finally,wewilllookattwomoremodelsthatarecommonlyusedforclassificationtasks:LogisticRegressionandSoftmaxRegression.
WARNINGTherewillbequiteafewmathequationsinthischapter,usingbasicnotionsoflinearalgebraandcalculus.Tounderstandtheseequations,youwillneedtoknowwhatvectorsandmatricesare,howtotransposethem,whatthedotproductis,whatmatrixinverseis,andwhatpartialderivativesare.Ifyouareunfamiliarwiththeseconcepts,pleasegothroughthelinearalgebraandcalculusintroductorytutorialsavailableasJupyternotebooksintheonlinesupplementalmaterial.Forthosewhoaretrulyallergictomathematics,youshouldstillgothroughthischapterandsimplyskiptheequations;hopefully,thetextwillbesufficienttohelpyouunderstandmostoftheconcepts.
LinearRegressionInChapter1,welookedatasimpleregressionmodeloflifesatisfaction:life_satisfaction=θ0+θ1×GDP_per_capita.
ThismodelisjustalinearfunctionoftheinputfeatureGDP_per_capita.θ0andθ1arethemodel’sparameters.
Moregenerally,alinearmodelmakesapredictionbysimplycomputingaweightedsumoftheinputfeatures,plusaconstantcalledthebiasterm(alsocalledtheinterceptterm),asshowninEquation4-1.
Equation4-1.LinearRegressionmodelprediction
ŷisthepredictedvalue.
nisthenumberoffeatures.
xiistheithfeaturevalue.
θjisthejthmodelparameter(includingthebiastermθ0andthefeatureweightsθ1,θ2,, θn).
Thiscanbewrittenmuchmoreconciselyusingavectorizedform,asshowninEquation4-2.
Equation4-2.LinearRegressionmodelprediction(vectorizedform)
θisthemodel’sparametervector,containingthebiastermθ0andthefeatureweightsθ1toθn.
θTisthetransposeofθ(arowvectorinsteadofacolumnvector).
xistheinstance’sfeaturevector,containingx0toxn,withx0alwaysequalto1.
θT·xisthedotproductofθTandx.
hθisthehypothesisfunction,usingthemodelparametersθ.
Okay,that’stheLinearRegressionmodel,sonowhowdowetrainit?Well,recallthattrainingamodelmeanssettingitsparameterssothatthemodelbestfitsthetrainingset.Forthispurpose,wefirstneedameasureofhowwell(orpoorly)themodelfitsthetrainingdata.InChapter2wesawthatthemostcommonperformancemeasureofaregressionmodelistheRootMeanSquareError(RMSE)(Equation2-1).Therefore,totrainaLinearRegressionmodel,youneedtofindthevalueofθthatminimizesthe
RMSE.Inpractice,itissimplertominimizetheMeanSquareError(MSE)thantheRMSE,anditleadstothesameresult(becausethevaluethatminimizesafunctionalsominimizesitssquareroot).1
TheMSEofaLinearRegressionhypothesishθonatrainingsetXiscalculatedusingEquation4-3.
Equation4-3.MSEcostfunctionforaLinearRegressionmodel
MostofthesenotationswerepresentedinChapter2(see“Notations”).Theonlydifferenceisthatwewritehθinsteadofjusthinordertomakeitclearthatthemodelisparametrizedbythevectorθ.Tosimplifynotations,wewilljustwriteMSE(θ)insteadofMSE(X,hθ).
TheNormalEquationTofindthevalueofθthatminimizesthecostfunction,thereisaclosed-formsolution—inotherwords,amathematicalequationthatgivestheresultdirectly.ThisiscalledtheNormalEquation(Equation4-4).2
Equation4-4.NormalEquation
isthevalueof thatminimizesthecostfunction.
yisthevectoroftargetvaluescontainingy(1)toy(m).
Let’sgeneratesomelinear-lookingdatatotestthisequationon(Figure4-1):
importnumpyasnp
X=2*np.random.rand(100,1)
y=4+3*X+np.random.randn(100,1)
Figure4-1.Randomlygeneratedlineardataset
Nowlet’scompute usingtheNormalEquation.Wewillusetheinv()functionfromNumPy’sLinearAlgebramodule(np.linalg)tocomputetheinverseofamatrix,andthedot()methodformatrixmultiplication:
X_b=np.c_[np.ones((100,1)),X]#addx0=1toeachinstance
theta_best=np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
Theactualfunctionthatweusedtogeneratethedataisy=4+3x0+Gaussiannoise.Let’sseewhattheequationfound:
>>>theta_best
array([[4.21509616],
[2.77011339]])
Wewouldhavehopedforθ0=4andθ1=3insteadofθ0=4.215andθ1=2.770.Closeenough,butthenoisemadeitimpossibletorecovertheexactparametersoftheoriginalfunction.
Nowyoucanmakepredictionsusing :
>>>X_new=np.array([[0],[2]])
>>>X_new_b=np.c_[np.ones((2,1)),X_new]#addx0=1toeachinstance
>>>y_predict=X_new_b.dot(theta_best)
>>>y_predict
array([[4.21509616],
[9.75532293]])
Let’splotthismodel’spredictions(Figure4-2):
plt.plot(X_new,y_predict,"r-")
plt.plot(X,y,"b.")
plt.axis([0,2,0,15])
plt.show()
Figure4-2.LinearRegressionmodelpredictions
TheequivalentcodeusingScikit-Learnlookslikethis:3
>>>fromsklearn.linear_modelimportLinearRegression
>>>lin_reg=LinearRegression()
>>>lin_reg.fit(X,y)
>>>lin_reg.intercept_,lin_reg.coef_
(array([4.21509616]),array([[2.77011339]]))
>>>lin_reg.predict(X_new)
array([[4.21509616],
[9.75532293]])
ComputationalComplexityTheNormalEquationcomputestheinverseofXT·X,whichisann×nmatrix(wherenisthenumberoffeatures).ThecomputationalcomplexityofinvertingsuchamatrixistypicallyaboutO(n2.4)toO(n3)(dependingontheimplementation).Inotherwords,ifyoudoublethenumberoffeatures,youmultiplythecomputationtimebyroughly22.4=5.3to23=8.
WARNINGTheNormalEquationgetsveryslowwhenthenumberoffeaturesgrowslarge(e.g.,100,000).
Onthepositiveside,thisequationislinearwithregardstothenumberofinstancesinthetrainingset(itisO(m)),soithandleslargetrainingsetsefficiently,providedtheycanfitinmemory.
Also,onceyouhavetrainedyourLinearRegressionmodel(usingtheNormalEquationoranyotheralgorithm),predictionsareveryfast:thecomputationalcomplexityislinearwithregardstoboththenumberofinstancesyouwanttomakepredictionsonandthenumberoffeatures.Inotherwords,makingpredictionsontwiceasmanyinstances(ortwiceasmanyfeatures)willjusttakeroughlytwiceasmuchtime.
NowwewilllookatverydifferentwaystotrainaLinearRegressionmodel,bettersuitedforcaseswheretherearealargenumberoffeatures,ortoomanytraininginstancestofitinmemory.
GradientDescentGradientDescentisaverygenericoptimizationalgorithmcapableoffindingoptimalsolutionstoawiderangeofproblems.ThegeneralideaofGradientDescentistotweakparametersiterativelyinordertominimizeacostfunction.
Supposeyouarelostinthemountainsinadensefog;youcanonlyfeeltheslopeofthegroundbelowyourfeet.Agoodstrategytogettothebottomofthevalleyquicklyistogodownhillinthedirectionofthesteepestslope.ThisisexactlywhatGradientDescentdoes:itmeasuresthelocalgradientoftheerrorfunctionwithregardstotheparametervectorθ,anditgoesinthedirectionofdescendinggradient.Oncethegradientiszero,youhavereachedaminimum!
Concretely,youstartbyfillingθwithrandomvalues(thisiscalledrandominitialization),andthenyouimproveitgradually,takingonebabystepatatime,eachstepattemptingtodecreasethecostfunction(e.g.,theMSE),untilthealgorithmconvergestoaminimum(seeFigure4-3).
Figure4-3.GradientDescent
AnimportantparameterinGradientDescentisthesizeofthesteps,determinedbythelearningratehyperparameter.Ifthelearningrateistoosmall,thenthealgorithmwillhavetogothroughmanyiterationstoconverge,whichwilltakealongtime(seeFigure4-4).
Figure4-4.Learningratetoosmall
Ontheotherhand,ifthelearningrateistoohigh,youmightjumpacrossthevalleyandendupontheotherside,possiblyevenhigherupthanyouwerebefore.Thismightmakethealgorithmdiverge,withlargerandlargervalues,failingtofindagoodsolution(seeFigure4-5).
Figure4-5.Learningratetoolarge
Finally,notallcostfunctionslooklikeniceregularbowls.Theremaybeholes,ridges,plateaus,andallsortsofirregularterrains,makingconvergencetotheminimumverydifficult.Figure4-6showsthetwomainchallengeswithGradientDescent:iftherandominitializationstartsthealgorithmontheleft,thenitwillconvergetoalocalminimum,whichisnotasgoodastheglobalminimum.Ifitstartsontheright,thenitwilltakeaverylongtimetocrosstheplateau,andifyoustoptooearlyyouwillneverreachtheglobalminimum.
Figure4-6.GradientDescentpitfalls
Fortunately,theMSEcostfunctionforaLinearRegressionmodelhappenstobeaconvexfunction,whichmeansthatifyoupickanytwopointsonthecurve,thelinesegmentjoiningthemnevercrossesthecurve.Thisimpliesthattherearenolocalminima,justoneglobalminimum.Itisalsoacontinuousfunctionwithaslopethatneverchangesabruptly.4Thesetwofactshaveagreatconsequence:GradientDescentisguaranteedtoapproacharbitrarilyclosetheglobalminimum(ifyouwaitlongenoughandifthelearningrateisnottoohigh).
Infact,thecostfunctionhastheshapeofabowl,butitcanbeanelongatedbowlifthefeatureshaveverydifferentscales.Figure4-7showsGradientDescentonatrainingsetwherefeatures1and2havethesamescale(ontheleft),andonatrainingsetwherefeature1hasmuchsmallervaluesthanfeature2(ontheright).5
Figure4-7.GradientDescentwithandwithoutfeaturescaling
Asyoucansee,onthelefttheGradientDescentalgorithmgoesstraighttowardtheminimum,therebyreachingitquickly,whereasontherightitfirstgoesinadirectionalmostorthogonaltothedirectionoftheglobalminimum,anditendswithalongmarchdownanalmostflatvalley.Itwilleventuallyreachtheminimum,butitwilltakealongtime.
WARNINGWhenusingGradientDescent,youshouldensurethatallfeatureshaveasimilarscale(e.g.,usingScikit-Learn’sStandardScalerclass),orelseitwilltakemuchlongertoconverge.
Thisdiagramalsoillustratesthefactthattrainingamodelmeanssearchingforacombinationofmodelparametersthatminimizesacostfunction(overthetrainingset).Itisasearchinthemodel’sparameterspace:themoreparametersamodelhas,themoredimensionsthisspacehas,andtheharderthesearchis:searchingforaneedleina300-dimensionalhaystackismuchtrickierthaninthreedimensions.Fortunately,sincethecostfunctionisconvexinthecaseofLinearRegression,theneedleissimplyatthebottomofthebowl.
BatchGradientDescentToimplementGradientDescent,youneedtocomputethegradientofthecostfunctionwithregardstoeachmodelparameterθj.Inotherwords,youneedtocalculatehowmuchthecostfunctionwillchangeifyouchangeθjjustalittlebit.Thisiscalledapartialderivative.Itislikeasking“whatistheslopeofthemountainundermyfeetifIfaceeast?”andthenaskingthesamequestionfacingnorth(andsoonforallotherdimensions,ifyoucanimagineauniversewithmorethanthreedimensions).Equation4-5computes
thepartialderivativeofthecostfunctionwithregardstoparameterθj,noted .
Equation4-5.Partialderivativesofthecostfunction
Insteadofcomputingthesepartialderivativesindividually,youcanuseEquation4-6tocomputethemallinonego.Thegradientvector,noted θMSE(θ),containsallthepartialderivativesofthecostfunction(oneforeachmodelparameter).
Equation4-6.Gradientvectorofthecostfunction
WARNINGNoticethatthisformulainvolvescalculationsoverthefulltrainingsetX,ateachGradientDescentstep!ThisiswhythealgorithmiscalledBatchGradientDescent:itusesthewholebatchoftrainingdataateverystep.Asaresultitisterriblyslowonverylargetrainingsets(butwewillseemuchfasterGradientDescentalgorithmsshortly).However,GradientDescentscaleswellwiththenumberoffeatures;trainingaLinearRegressionmodelwhentherearehundredsofthousandsoffeaturesismuchfasterusingGradientDescentthanusingtheNormalEquation.
Onceyouhavethegradientvector,whichpointsuphill,justgointheoppositedirectiontogodownhill.Thismeanssubtracting θMSE(θ)fromθ.Thisiswherethelearningrateηcomesintoplay:6multiplythegradientvectorbyηtodeterminethesizeofthedownhillstep(Equation4-7).
Equation4-7.GradientDescentstep
Let’slookataquickimplementationofthisalgorithm:
eta=0.1#learningrate
n_iterations=1000
m=100
theta=np.random.randn(2,1)#randominitialization
foriterationinrange(n_iterations):
gradients=2/m*X_b.T.dot(X_b.dot(theta)-y)
theta=theta-eta*gradients
Thatwasn’ttoohard!Let’slookattheresultingtheta:
>>>theta
array([[4.21509616],
[2.77011339]])
Hey,that’sexactlywhattheNormalEquationfound!GradientDescentworkedperfectly.Butwhatifyouhadusedadifferentlearningrateeta?Figure4-8showsthefirst10stepsofGradientDescentusingthreedifferentlearningrates(thedashedlinerepresentsthestartingpoint).
Figure4-8.GradientDescentwithvariouslearningrates
Ontheleft,thelearningrateistoolow:thealgorithmwilleventuallyreachthesolution,butitwilltakealongtime.Inthemiddle,thelearningratelooksprettygood:injustafewiterations,ithasalreadyconvergedtothesolution.Ontheright,thelearningrateistoohigh:thealgorithmdiverges,jumpingallovertheplaceandactuallygettingfurtherandfurtherawayfromthesolutionateverystep.
Tofindagoodlearningrate,youcanusegridsearch(seeChapter2).However,youmaywanttolimitthenumberofiterationssothatgridsearchcaneliminatemodelsthattaketoolongtoconverge.
Youmaywonderhowtosetthenumberofiterations.Ifitistoolow,youwillstillbefarawayfromtheoptimalsolutionwhenthealgorithmstops,butifitistoohigh,youwillwastetimewhilethemodelparametersdonotchangeanymore.Asimplesolutionistosetaverylargenumberofiterationsbuttointerruptthealgorithmwhenthegradientvectorbecomestiny—thatis,whenitsnormbecomessmallerthanatinynumberϵ(calledthetolerance)—becausethishappenswhenGradientDescenthas(almost)
reachedtheminimum.
CONVERGENCERATE
Whenthecostfunctionisconvexanditsslopedoesnotchangeabruptly(asisthecasefortheMSEcostfunction),itcanbeshownthat
BatchGradientDescentwithafixedlearningratehasaconvergencerateof .Inotherwords,ifyoudividethetoleranceϵby10(tohaveamoreprecisesolution),thenthealgorithmwillhavetorunabout10timesmoreiterations.
StochasticGradientDescentThemainproblemwithBatchGradientDescentisthefactthatitusesthewholetrainingsettocomputethegradientsateverystep,whichmakesitveryslowwhenthetrainingsetislarge.Attheoppositeextreme,StochasticGradientDescentjustpicksarandominstanceinthetrainingsetateverystepandcomputesthegradientsbasedonlyonthatsingleinstance.Obviouslythismakesthealgorithmmuchfastersinceithasverylittledatatomanipulateateveryiteration.Italsomakesitpossibletotrainonhugetrainingsets,sinceonlyoneinstanceneedstobeinmemoryateachiteration(SGDcanbeimplementedasanout-of-corealgorithm.7)
Ontheotherhand,duetoitsstochastic(i.e.,random)nature,thisalgorithmismuchlessregularthanBatchGradientDescent:insteadofgentlydecreasinguntilitreachestheminimum,thecostfunctionwillbounceupanddown,decreasingonlyonaverage.Overtimeitwillendupveryclosetotheminimum,butonceitgetsthereitwillcontinuetobouncearound,neversettlingdown(seeFigure4-9).Sooncethealgorithmstops,thefinalparametervaluesaregood,butnotoptimal.
Figure4-9.StochasticGradientDescent
Whenthecostfunctionisveryirregular(asinFigure4-6),thiscanactuallyhelpthealgorithmjumpoutoflocalminima,soStochasticGradientDescenthasabetterchanceoffindingtheglobalminimumthanBatchGradientDescentdoes.
Thereforerandomnessisgoodtoescapefromlocaloptima,butbadbecauseitmeansthatthealgorithmcanneversettleattheminimum.Onesolutiontothisdilemmaistograduallyreducethelearningrate.Thestepsstartoutlarge(whichhelpsmakequickprogressandescapelocalminima),thengetsmallerandsmaller,allowingthealgorithmtosettleattheglobalminimum.Thisprocessiscalledsimulated
annealing,becauseitresemblestheprocessofannealinginmetallurgywheremoltenmetalisslowlycooleddown.Thefunctionthatdeterminesthelearningrateateachiterationiscalledthelearningschedule.Ifthelearningrateisreducedtooquickly,youmaygetstuckinalocalminimum,orevenendupfrozenhalfwaytotheminimum.Ifthelearningrateisreducedtooslowly,youmayjumparoundtheminimumforalongtimeandendupwithasuboptimalsolutionifyouhalttrainingtooearly.
ThiscodeimplementsStochasticGradientDescentusingasimplelearningschedule:
n_epochs=50
t0,t1=5,50#learningschedulehyperparameters
deflearning_schedule(t):
returnt0/(t+t1)
theta=np.random.randn(2,1)#randominitialization
forepochinrange(n_epochs):
foriinrange(m):
random_index=np.random.randint(m)
xi=X_b[random_index:random_index+1]
yi=y[random_index:random_index+1]
gradients=2*xi.T.dot(xi.dot(theta)-yi)
eta=learning_schedule(epoch*m+i)
theta=theta-eta*gradients
Byconventionweiteratebyroundsofmiterations;eachroundiscalledanepoch.WhiletheBatchGradientDescentcodeiterated1,000timesthroughthewholetrainingset,thiscodegoesthroughthetrainingsetonly50timesandreachesafairlygoodsolution:
>>>theta
array([[4.21076011],
[2.74856079]])
Figure4-10showsthefirst10stepsoftraining(noticehowirregularthestepsare).
Figure4-10.StochasticGradientDescentfirst10steps
Notethatsinceinstancesarepickedrandomly,someinstancesmaybepickedseveraltimesperepochwhileothersmaynotbepickedatall.Ifyouwanttobesurethatthealgorithmgoesthrougheveryinstanceateachepoch,anotherapproachistoshufflethetrainingset,thengothroughitinstancebyinstance,thenshuffleitagain,andsoon.However,thisgenerallyconvergesmoreslowly.
ToperformLinearRegressionusingSGDwithScikit-Learn,youcanusetheSGDRegressorclass,whichdefaultstooptimizingthesquarederrorcostfunction.Thefollowingcoderuns50epochs,startingwithalearningrateof0.1(eta0=0.1),usingthedefaultlearningschedule(differentfromtheprecedingone),anditdoesnotuseanyregularization(penalty=None;moredetailsonthisshortly):
fromsklearn.linear_modelimportSGDRegressor
sgd_reg=SGDRegressor(n_iter=50,penalty=None,eta0=0.1)
sgd_reg.fit(X,y.ravel())
Onceagain,youfindasolutionveryclosetotheonereturnedbytheNormalEquation:
>>>sgd_reg.intercept_,sgd_reg.coef_
(array([4.16782089]),array([2.72603052]))
Mini-batchGradientDescentThelastGradientDescentalgorithmwewilllookatiscalledMini-batchGradientDescent.ItisquitesimpletounderstandonceyouknowBatchandStochasticGradientDescent:ateachstep,insteadofcomputingthegradientsbasedonthefulltrainingset(asinBatchGD)orbasedonjustoneinstance(asinStochasticGD),Mini-batchGDcomputesthegradientsonsmallrandomsetsofinstancescalledmini-batches.ThemainadvantageofMini-batchGDoverStochasticGDisthatyoucangetaperformanceboostfromhardwareoptimizationofmatrixoperations,especiallywhenusingGPUs.
Thealgorithm’sprogressinparameterspaceislesserraticthanwithSGD,especiallywithfairlylargemini-batches.Asaresult,Mini-batchGDwillendupwalkingaroundabitclosertotheminimumthanSGD.But,ontheotherhand,itmaybeharderforittoescapefromlocalminima(inthecaseofproblemsthatsufferfromlocalminima,unlikeLinearRegressionaswesawearlier).Figure4-11showsthepathstakenbythethreeGradientDescentalgorithmsinparameterspaceduringtraining.Theyallendupneartheminimum,butBatchGD’spathactuallystopsattheminimum,whilebothStochasticGDandMini-batchGDcontinuetowalkaround.However,don’tforgetthatBatchGDtakesalotoftimetotakeeachstep,andStochasticGDandMini-batchGDwouldalsoreachtheminimumifyouusedagoodlearningschedule.
Figure4-11.GradientDescentpathsinparameterspace
Let’scomparethealgorithmswe’vediscussedsofarforLinearRegression8(recallthatmisthenumberoftraininginstancesandnisthenumberoffeatures);seeTable4-1.
Table4-1.ComparisonofalgorithmsforLinearRegression
Algorithm Largem Out-of-coresupport Largen Hyperparams Scalingrequired Scikit-Learn
NormalEquation Fast No Slow 0 No LinearRegression
BatchGD Slow No Fast 2 Yes n/a
StochasticGD Fast Yes Fast ≥2 Yes SGDRegressor
Mini-batchGD Fast Yes Fast ≥2 Yes n/a
NOTEThereisalmostnodifferenceaftertraining:allthesealgorithmsendupwithverysimilarmodelsandmakepredictionsinexactlythesameway.
PolynomialRegressionWhatifyourdataisactuallymorecomplexthanasimplestraightline?Surprisingly,youcanactuallyusealinearmodeltofitnonlineardata.Asimplewaytodothisistoaddpowersofeachfeatureasnewfeatures,thentrainalinearmodelonthisextendedsetoffeatures.ThistechniqueiscalledPolynomialRegression.
Let’slookatanexample.First,let’sgeneratesomenonlineardata,basedonasimplequadraticequation9(plussomenoise;seeFigure4-12):
m=100
X=6*np.random.rand(m,1)-3
y=0.5*X**2+X+2+np.random.randn(m,1)
Figure4-12.Generatednonlinearandnoisydataset
Clearly,astraightlinewillneverfitthisdataproperly.Solet’suseScikit-Learn’sPolynomialFeaturesclasstotransformourtrainingdata,addingthesquare(2nd-degreepolynomial)ofeachfeatureinthetrainingsetasnewfeatures(inthiscasethereisjustonefeature):
>>>fromsklearn.preprocessingimportPolynomialFeatures
>>>poly_features=PolynomialFeatures(degree=2,include_bias=False)
>>>X_poly=poly_features.fit_transform(X)
>>>X[0]
array([-0.75275929])
>>>X_poly[0]
array([-0.75275929,0.56664654])
X_polynowcontainstheoriginalfeatureofXplusthesquareofthisfeature.NowyoucanfitaLinearRegressionmodeltothisextendedtrainingdata(Figure4-13):
>>>lin_reg=LinearRegression()
>>>lin_reg.fit(X_poly,y)
>>>lin_reg.intercept_,lin_reg.coef_
(array([1.78134581]),array([[0.93366893,0.56456263]]))
Figure4-13.PolynomialRegressionmodelpredictions
Notbad:themodelestimates wheninfacttheoriginalfunctionwas.
Notethatwhentherearemultiplefeatures,PolynomialRegressioniscapableoffindingrelationshipsbetweenfeatures(whichissomethingaplainLinearRegressionmodelcannotdo).ThisismadepossiblebythefactthatPolynomialFeaturesalsoaddsallcombinationsoffeaturesuptothegivendegree.Forexample,ifthereweretwofeaturesaandb,PolynomialFeatureswithdegree=3wouldnotonlyaddthefeaturesa2,a3,b2,andb3,butalsothecombinationsab,a2b,andab2.
WARNING
PolynomialFeatures(degree=d)transformsanarraycontainingnfeaturesintoanarraycontaining features,wheren!isthefactorialofn,equalto1×2×3×× n.Bewareofthecombinatorialexplosionofthenumberoffeatures!
LearningCurvesIfyouperformhigh-degreePolynomialRegression,youwilllikelyfitthetrainingdatamuchbetterthanwithplainLinearRegression.Forexample,Figure4-14appliesa300-degreepolynomialmodeltotheprecedingtrainingdata,andcomparestheresultwithapurelinearmodelandaquadraticmodel(2nd-degreepolynomial).Noticehowthe300-degreepolynomialmodelwigglesaroundtogetascloseaspossibletothetraininginstances.
Figure4-14.High-degreePolynomialRegression
Ofcourse,thishigh-degreePolynomialRegressionmodelisseverelyoverfittingthetrainingdata,whilethelinearmodelisunderfittingit.Themodelthatwillgeneralizebestinthiscaseisthequadraticmodel.Itmakessensesincethedatawasgeneratedusingaquadraticmodel,butingeneralyouwon’tknowwhatfunctiongeneratedthedata,sohowcanyoudecidehowcomplexyourmodelshouldbe?Howcanyoutellthatyourmodelisoverfittingorunderfittingthedata?
InChapter2youusedcross-validationtogetanestimateofamodel’sgeneralizationperformance.Ifamodelperformswellonthetrainingdatabutgeneralizespoorlyaccordingtothecross-validationmetrics,thenyourmodelisoverfitting.Ifitperformspoorlyonboth,thenitisunderfitting.Thisisonewaytotellwhenamodelistoosimpleortoocomplex.
Anotherwayistolookatthelearningcurves:theseareplotsofthemodel’sperformanceonthetrainingsetandthevalidationsetasafunctionofthetrainingsetsize.Togeneratetheplots,simplytrainthemodelseveraltimesondifferentsizedsubsetsofthetrainingset.Thefollowingcodedefinesafunctionthatplotsthelearningcurvesofamodelgivensometrainingdata:
fromsklearn.metricsimportmean_squared_error
fromsklearn.model_selectionimporttrain_test_split
defplot_learning_curves(model,X,y):
X_train,X_val,y_train,y_val=train_test_split(X,y,test_size=0.2)
train_errors,val_errors=[],[]
forminrange(1,len(X_train)):
model.fit(X_train[:m],y_train[:m])
y_train_predict=model.predict(X_train[:m])
y_val_predict=model.predict(X_val)
train_errors.append(mean_squared_error(y_train_predict,y_train[:m]))
val_errors.append(mean_squared_error(y_val_predict,y_val))
plt.plot(np.sqrt(train_errors),"r-+",linewidth=2,label="train")
plt.plot(np.sqrt(val_errors),"b-",linewidth=3,label="val")
Let’slookatthelearningcurvesoftheplainLinearRegressionmodel(astraightline;Figure4-15):
lin_reg=LinearRegression()
plot_learning_curves(lin_reg,X,y)
Figure4-15.Learningcurves
Thisdeservesabitofexplanation.First,let’slookattheperformanceonthetrainingdata:whentherearejustoneortwoinstancesinthetrainingset,themodelcanfitthemperfectly,whichiswhythecurvestartsatzero.Butasnewinstancesareaddedtothetrainingset,itbecomesimpossibleforthemodeltofitthetrainingdataperfectly,bothbecausethedataisnoisyandbecauseitisnotlinearatall.Sotheerroronthetrainingdatagoesupuntilitreachesaplateau,atwhichpointaddingnewinstancestothetrainingsetdoesn’tmaketheaverageerrormuchbetterorworse.Nowlet’slookattheperformanceofthemodelonthevalidationdata.Whenthemodelistrainedonveryfewtraininginstances,itisincapableofgeneralizingproperly,whichiswhythevalidationerrorisinitiallyquitebig.Thenasthemodelisshownmoretrainingexamples,itlearnsandthusthevalidationerrorslowlygoesdown.However,onceagainastraightlinecannotdoagoodjobmodelingthedata,sotheerrorendsupataplateau,veryclosetotheothercurve.
Theselearningcurvesaretypicalofanunderfittingmodel.Bothcurveshavereachedaplateau;theyarecloseandfairlyhigh.
TIPIfyourmodelisunderfittingthetrainingdata,addingmoretrainingexampleswillnothelp.Youneedtouseamorecomplexmodelorcomeupwithbetterfeatures.
Nowlet’slookatthelearningcurvesofa10th-degreepolynomialmodelonthesamedata(Figure4-16):
fromsklearn.pipelineimportPipeline
polynomial_regression=Pipeline((
("poly_features",PolynomialFeatures(degree=10,include_bias=False)),
("lin_reg",LinearRegression()),
))
plot_learning_curves(polynomial_regression,X,y)
Theselearningcurveslookabitlikethepreviousones,buttherearetwoveryimportantdifferences:TheerroronthetrainingdataismuchlowerthanwiththeLinearRegressionmodel.
Thereisagapbetweenthecurves.Thismeansthatthemodelperformssignificantlybetteronthetrainingdatathanonthevalidationdata,whichisthehallmarkofanoverfittingmodel.However,ifyouusedamuchlargertrainingset,thetwocurveswouldcontinuetogetcloser.
Figure4-16.Learningcurvesforthepolynomialmodel
TIPOnewaytoimproveanoverfittingmodelistofeeditmoretrainingdatauntilthevalidationerrorreachesthetrainingerror.
THEBIAS/VARIANCETRADEOFF
AnimportanttheoreticalresultofstatisticsandMachineLearningisthefactthatamodel’sgeneralizationerrorcanbeexpressedasthesumofthreeverydifferenterrors:
BiasThispartofthegeneralizationerrorisduetowrongassumptions,suchasassumingthatthedataislinearwhenitisactuallyquadratic.Ahigh-biasmodelismostlikelytounderfitthetrainingdata.10
VarianceThispartisduetothemodel’sexcessivesensitivitytosmallvariationsinthetrainingdata.Amodelwithmanydegreesoffreedom(suchasahigh-degreepolynomialmodel)islikelytohavehighvariance,andthustooverfitthetrainingdata.
IrreducibleerrorThispartisduetothenoisinessofthedataitself.Theonlywaytoreducethispartoftheerroristocleanupthedata(e.g.,fixthedatasources,suchasbrokensensors,ordetectandremoveoutliers).
Increasingamodel’scomplexitywilltypicallyincreaseitsvarianceandreduceitsbias.Conversely,reducingamodel’scomplexityincreasesitsbiasandreducesitsvariance.Thisiswhyitiscalledatradeoff.
RegularizedLinearModelsAswesawinChapters1and2,agoodwaytoreduceoverfittingistoregularizethemodel(i.e.,toconstrainit):thefewerdegreesoffreedomithas,theharderitwillbeforittooverfitthedata.Forexample,asimplewaytoregularizeapolynomialmodelistoreducethenumberofpolynomialdegrees.
Foralinearmodel,regularizationistypicallyachievedbyconstrainingtheweightsofthemodel.WewillnowlookatRidgeRegression,LassoRegression,andElasticNet,whichimplementthreedifferentwaystoconstraintheweights.
RidgeRegressionRidgeRegression(alsocalledTikhonovregularization)isaregularizedversionofLinearRegression:a
regularizationtermequalto isaddedtothecostfunction.Thisforcesthelearningalgorithmtonotonlyfitthedatabutalsokeepthemodelweightsassmallaspossible.Notethattheregularizationtermshouldonlybeaddedtothecostfunctionduringtraining.Oncethemodelistrained,youwanttoevaluatethemodel’sperformanceusingtheunregularizedperformancemeasure.
NOTEItisquitecommonforthecostfunctionusedduringtrainingtobedifferentfromtheperformancemeasureusedfortesting.Apartfromregularization,anotherreasonwhytheymightbedifferentisthatagoodtrainingcostfunctionshouldhaveoptimization-friendlyderivatives,whiletheperformancemeasureusedfortestingshouldbeascloseaspossibletothefinalobjective.Agoodexampleofthisisaclassifiertrainedusingacostfunctionsuchasthelogloss(discussedinamoment)butevaluatedusingprecision/recall.
Thehyperparameterαcontrolshowmuchyouwanttoregularizethemodel.Ifα=0thenRidgeRegressionisjustLinearRegression.Ifαisverylarge,thenallweightsendupveryclosetozeroandtheresultisaflatlinegoingthroughthedata’smean.Equation4-8presentstheRidgeRegressioncostfunction.11
Equation4-8.RidgeRegressioncostfunction
Notethatthebiastermθ0isnotregularized(thesumstartsati=1,not0).Ifwedefinewasthevectoroffeatureweights(θ1toθn),thentheregularizationtermissimplyequalto½( w2)2,where· 2representstheℓ2normoftheweightvector.12ForGradientDescent,justaddαwtotheMSEgradientvector(Equation4-6).
WARNINGItisimportanttoscalethedata(e.g.,usingaStandardScaler)beforeperformingRidgeRegression,asitissensitivetothescaleoftheinputfeatures.Thisistrueofmostregularizedmodels.
Figure4-17showsseveralRidgemodelstrainedonsomelineardatausingdifferentαvalue.Ontheleft,plainRidgemodelsareused,leadingtolinearpredictions.Ontheright,thedataisfirstexpandedusingPolynomialFeatures(degree=10),thenitisscaledusingaStandardScaler,andfinallytheRidge
modelsareappliedtotheresultingfeatures:thisisPolynomialRegressionwithRidgeregularization.Notehowincreasingαleadstoflatter(i.e.,lessextreme,morereasonable)predictions;thisreducesthemodel’svariancebutincreasesitsbias.
AswithLinearRegression,wecanperformRidgeRegressioneitherbycomputingaclosed-formequationorbyperformingGradientDescent.Theprosandconsarethesame.Equation4-9showstheclosed-formsolution(whereAisthen×nidentitymatrix13exceptwitha0inthetop-leftcell,correspondingtothebiasterm).
Figure4-17.RidgeRegression
Equation4-9.RidgeRegressionclosed-formsolution
HereishowtoperformRidgeRegressionwithScikit-Learnusingaclosed-formsolution(avariantofEquation4-9usingamatrixfactorizationtechniquebyAndré-LouisCholesky):
>>>fromsklearn.linear_modelimportRidge
>>>ridge_reg=Ridge(alpha=1,solver="cholesky")
>>>ridge_reg.fit(X,y)
>>>ridge_reg.predict([[1.5]])
array([[1.55071465]])
AndusingStochasticGradientDescent:14
>>>sgd_reg=SGDRegressor(penalty="l2")
>>>sgd_reg.fit(X,y.ravel())
>>>sgd_reg.predict([[1.5]])
array([1.13500145])
Thepenaltyhyperparametersetsthetypeofregularizationtermtouse.Specifying"l2"indicatesthatyouwantSGDtoaddaregularizationtermtothecostfunctionequaltohalfthesquareoftheℓ2normoftheweightvector:thisissimplyRidgeRegression.
LassoRegressionLeastAbsoluteShrinkageandSelectionOperatorRegression(simplycalledLassoRegression)isanotherregularizedversionofLinearRegression:justlikeRidgeRegression,itaddsaregularizationtermtothecostfunction,butitusestheℓ1normoftheweightvectorinsteadofhalfthesquareoftheℓ2norm(seeEquation4-10).
Equation4-10.LassoRegressioncostfunction
Figure4-18showsthesamethingasFigure4-17butreplacesRidgemodelswithLassomodelsandusessmallerαvalues.
Figure4-18.LassoRegression
AnimportantcharacteristicofLassoRegressionisthatittendstocompletelyeliminatetheweightsoftheleastimportantfeatures(i.e.,setthemtozero).Forexample,thedashedlineintherightplotonFigure4-18(withα=10-7)looksquadratic,almostlinear:alltheweightsforthehigh-degreepolynomialfeaturesareequaltozero.Inotherwords,LassoRegressionautomaticallyperformsfeatureselectionandoutputsasparsemodel(i.e.,withfewnonzerofeatureweights).
YoucangetasenseofwhythisisthecasebylookingatFigure4-19:onthetop-leftplot,thebackgroundcontours(ellipses)representanunregularizedMSEcostfunction(α=0),andthewhitecirclesshowtheBatchGradientDescentpathwiththatcostfunction.Theforegroundcontours(diamonds)representtheℓ1penalty,andthetrianglesshowtheBGDpathforthispenaltyonly(α→∞).Noticehowthepathfirstreachesθ1=0,thenrollsdownagutteruntilitreachesθ2=0.Onthetop-rightplot,thecontoursrepresentthesamecostfunctionplusanℓ1penaltywithα=0.5.Theglobalminimumisontheθ2=0axis.BGD
firstreachesθ2=0,thenrollsdownthegutteruntilitreachestheglobalminimum.Thetwobottomplotsshowthesamethingbutusesanℓ2penaltyinstead.Theregularizedminimumisclosertoθ=0thantheunregularizedminimum,buttheweightsdonotgetfullyeliminated.
Figure4-19.LassoversusRidgeregularization
TIPOntheLassocostfunction,theBGDpathtendstobounceacrosstheguttertowardtheend.Thisisbecausetheslopechangesabruptlyatθ2=0.Youneedtograduallyreducethelearningrateinordertoactuallyconvergetotheglobalminimum.
TheLassocostfunctionisnotdifferentiableatθi=0(fori=1,2,, n),butGradientDescentstillworksfineifyouuseasubgradientvectorg15insteadwhenanyθi=0.Equation4-11showsasubgradientvectorequationyoucanuseforGradientDescentwiththeLassocostfunction.
Equation4-11.LassoRegressionsubgradientvector
HereisasmallScikit-LearnexampleusingtheLassoclass.NotethatyoucouldinsteaduseanSGDRegressor(penalty="l1").
>>>fromsklearn.linear_modelimportLasso
>>>lasso_reg=Lasso(alpha=0.1)
>>>lasso_reg.fit(X,y)
>>>lasso_reg.predict([[1.5]])
array([1.53788174])
ElasticNetElasticNetisamiddlegroundbetweenRidgeRegressionandLassoRegression.TheregularizationtermisasimplemixofbothRidgeandLasso’sregularizationterms,andyoucancontrolthemixratior.Whenr=0,ElasticNetisequivalenttoRidgeRegression,andwhenr=1,itisequivalenttoLassoRegression(seeEquation4-12).
Equation4-12.ElasticNetcostfunction
SowhenshouldyouuseplainLinearRegression(i.e.,withoutanyregularization),Ridge,Lasso,orElasticNet?Itisalmostalwayspreferabletohaveatleastalittlebitofregularization,sogenerallyyoushouldavoidplainLinearRegression.Ridgeisagooddefault,butifyoususpectthatonlyafewfeaturesareactuallyuseful,youshouldpreferLassoorElasticNetsincetheytendtoreducetheuselessfeatures’weightsdowntozeroaswehavediscussed.Ingeneral,ElasticNetispreferredoverLassosinceLassomaybehaveerraticallywhenthenumberoffeaturesisgreaterthanthenumberoftraininginstancesorwhenseveralfeaturesarestronglycorrelated.
HereisashortexampleusingScikit-Learn’sElasticNet(l1_ratiocorrespondstothemixratior):
>>>fromsklearn.linear_modelimportElasticNet
>>>elastic_net=ElasticNet(alpha=0.1,l1_ratio=0.5)
>>>elastic_net.fit(X,y)
>>>elastic_net.predict([[1.5]])
array([1.54333232])
EarlyStoppingAverydifferentwaytoregularizeiterativelearningalgorithmssuchasGradientDescentistostoptrainingassoonasthevalidationerrorreachesaminimum.Thisiscalledearlystopping.Figure4-20showsacomplexmodel(inthiscaseahigh-degreePolynomialRegressionmodel)beingtrainedusingBatchGradientDescent.Astheepochsgoby,thealgorithmlearnsanditspredictionerror(RMSE)onthetrainingsetnaturallygoesdown,andsodoesitspredictionerroronthevalidationset.However,afterawhilethevalidationerrorstopsdecreasingandactuallystartstogobackup.Thisindicatesthatthemodelhasstartedtooverfitthetrainingdata.Withearlystoppingyoujuststoptrainingassoonasthevalidationerrorreachestheminimum.ItissuchasimpleandefficientregularizationtechniquethatGeoffreyHintoncalledita“beautifulfreelunch.”
Figure4-20.Earlystoppingregularization
TIPWithStochasticandMini-batchGradientDescent,thecurvesarenotsosmooth,anditmaybehardtoknowwhetheryouhavereachedtheminimumornot.Onesolutionistostoponlyafterthevalidationerrorhasbeenabovetheminimumforsometime(whenyouareconfidentthatthemodelwillnotdoanybetter),thenrollbackthemodelparameterstothepointwherethevalidationerrorwasataminimum.
Hereisabasicimplementationofearlystopping:
fromsklearn.baseimportclone
sgd_reg=SGDRegressor(n_iter=1,warm_start=True,penalty=None,
learning_rate="constant",eta0=0.0005)
minimum_val_error=float("inf")
best_epoch=None
best_model=None
forepochinrange(1000):
sgd_reg.fit(X_train_poly_scaled,y_train)#continueswhereitleftoff
y_val_predict=sgd_reg.predict(X_val_poly_scaled)
val_error=mean_squared_error(y_val_predict,y_val)
ifval_error<minimum_val_error:
minimum_val_error=val_error
best_epoch=epoch
best_model=clone(sgd_reg)
Notethatwithwarm_start=True,whenthefit()methodiscalled,itjustcontinuestrainingwhereitleftoffinsteadofrestartingfromscratch.
LogisticRegressionAswediscussedinChapter1,someregressionalgorithmscanbeusedforclassificationaswell(andviceversa).LogisticRegression(alsocalledLogitRegression)iscommonlyusedtoestimatetheprobabilitythataninstancebelongstoaparticularclass(e.g.,whatistheprobabilitythatthisemailisspam?).Iftheestimatedprobabilityisgreaterthan50%,thenthemodelpredictsthattheinstancebelongstothatclass(calledthepositiveclass,labeled“1”),orelseitpredictsthatitdoesnot(i.e.,itbelongstothenegativeclass,labeled“0”).Thismakesitabinaryclassifier.
EstimatingProbabilitiesSohowdoesitwork?JustlikeaLinearRegressionmodel,aLogisticRegressionmodelcomputesaweightedsumoftheinputfeatures(plusabiasterm),butinsteadofoutputtingtheresultdirectlyliketheLinearRegressionmodeldoes,itoutputsthelogisticofthisresult(seeEquation4-13).
Equation4-13.LogisticRegressionmodelestimatedprobability(vectorizedform)
Thelogistic—alsocalledthelogit,notedσ(·)—isasigmoidfunction(i.e.,S-shaped)thatoutputsanumberbetween0and1.ItisdefinedasshowninEquation4-14andFigure4-21.
Equation4-14.Logisticfunction
Figure4-21.Logisticfunction
OncetheLogisticRegressionmodelhasestimatedtheprobability =hθ(x)thataninstancexbelongstothepositiveclass,itcanmakeitspredictionŷeasily(seeEquation4-15).
Equation4-15.LogisticRegressionmodelprediction
Noticethatσ(t)<0.5whent<0,andσ(t)≥0.5whent≥0,soaLogisticRegressionmodelpredicts1if
θT·xispositive,and0ifitisnegative.
TrainingandCostFunctionGood,nowyouknowhowaLogisticRegressionmodelestimatesprobabilitiesandmakespredictions.Buthowisittrained?Theobjectiveoftrainingistosettheparametervectorθsothatthemodelestimateshighprobabilitiesforpositiveinstances(y=1)andlowprobabilitiesfornegativeinstances(y=0).ThisideaiscapturedbythecostfunctionshowninEquation4-16forasingletraininginstancex.
Equation4-16.Costfunctionofasingletraininginstance
Thiscostfunctionmakessensebecause–log(t)growsverylargewhentapproaches0,sothecostwillbelargeifthemodelestimatesaprobabilitycloseto0forapositiveinstance,anditwillalsobeverylargeifthemodelestimatesaprobabilitycloseto1foranegativeinstance.Ontheotherhand,–log(t)iscloseto0whentiscloseto1,sothecostwillbecloseto0iftheestimatedprobabilityiscloseto0foranegativeinstanceorcloseto1forapositiveinstance,whichispreciselywhatwewant.
Thecostfunctionoverthewholetrainingsetissimplytheaveragecostoveralltraininginstances.Itcanbewritteninasingleexpression(asyoucanverifyeasily),calledthelogloss,showninEquation4-17.
Equation4-17.LogisticRegressioncostfunction(logloss)
Thebadnewsisthatthereisnoknownclosed-formequationtocomputethevalueofθthatminimizesthiscostfunction(thereisnoequivalentoftheNormalEquation).Butthegoodnewsisthatthiscostfunctionisconvex,soGradientDescent(oranyotheroptimizationalgorithm)isguaranteedtofindtheglobalminimum(ifthelearningrateisnottoolargeandyouwaitlongenough).ThepartialderivativesofthecostfunctionwithregardstothejthmodelparameterθjisgivenbyEquation4-18.
Equation4-18.Logisticcostfunctionpartialderivatives
ThisequationlooksverymuchlikeEquation4-5:foreachinstanceitcomputesthepredictionerrorandmultipliesitbythejthfeaturevalue,andthenitcomputestheaverageoveralltraininginstances.OnceyouhavethegradientvectorcontainingallthepartialderivativesyoucanuseitintheBatchGradientDescentalgorithm.That’sit:younowknowhowtotrainaLogisticRegressionmodel.ForStochasticGDyou
wouldofcoursejusttakeoneinstanceatatime,andforMini-batchGDyouwoulduseamini-batchatatime.
DecisionBoundariesLet’susetheirisdatasettoillustrateLogisticRegression.Thisisafamousdatasetthatcontainsthesepalandpetallengthandwidthof150irisflowersofthreedifferentspecies:Iris-Setosa,Iris-Versicolor,andIris-Virginica(seeFigure4-22).
Figure4-22.Flowersofthreeirisplantspecies16
Let’strytobuildaclassifiertodetecttheIris-Virginicatypebasedonlyonthepetalwidthfeature.Firstlet’sloadthedata:
>>>fromsklearnimportdatasets
>>>iris=datasets.load_iris()
>>>list(iris.keys())
['data','target_names','feature_names','target','DESCR']
>>>X=iris["data"][:,3:]#petalwidth
>>>y=(iris["target"]==2).astype(np.int)#1ifIris-Virginica,else0
Nowlet’strainaLogisticRegressionmodel:
fromsklearn.linear_modelimportLogisticRegression
log_reg=LogisticRegression()
log_reg.fit(X,y)
Let’slookatthemodel’sestimatedprobabilitiesforflowerswithpetalwidthsvaryingfrom0to3cm(Figure4-23):
X_new=np.linspace(0,3,1000).reshape(-1,1)
y_proba=log_reg.predict_proba(X_new)
plt.plot(X_new,y_proba[:,1],"g-",label="Iris-Virginica")
plt.plot(X_new,y_proba[:,0],"b--",label="NotIris-Virginica")
#+moreMatplotlibcodetomaketheimagelookpretty
Figure4-23.Estimatedprobabilitiesanddecisionboundary
ThepetalwidthofIris-Virginicaflowers(representedbytriangles)rangesfrom1.4cmto2.5cm,whiletheotheririsflowers(representedbysquares)generallyhaveasmallerpetalwidth,rangingfrom0.1cmto1.8cm.Noticethatthereisabitofoverlap.Aboveabout2cmtheclassifierishighlyconfidentthattheflowerisanIris-Virginica(itoutputsahighprobabilitytothatclass),whilebelow1cmitishighlyconfidentthatitisnotanIris-Virginica(highprobabilityforthe“NotIris-Virginica”class).Inbetweentheseextremes,theclassifierisunsure.However,ifyouaskittopredicttheclass(usingthepredict()methodratherthanthepredict_proba()method),itwillreturnwhicheverclassisthemostlikely.Therefore,thereisadecisionboundaryataround1.6cmwherebothprobabilitiesareequalto50%:ifthepetalwidthishigherthan1.6cm,theclassifierwillpredictthattheflowerisanIris-Virginica,orelseitwillpredictthatitisnot(evenifitisnotveryconfident):
>>>log_reg.predict([[1.7],[1.5]])
array([1,0])
Figure4-24showsthesamedatasetbutthistimedisplayingtwofeatures:petalwidthandlength.Oncetrained,theLogisticRegressionclassifiercanestimatetheprobabilitythatanewflowerisanIris-Virginicabasedonthesetwofeatures.Thedashedlinerepresentsthepointswherethemodelestimatesa50%probability:thisisthemodel’sdecisionboundary.Notethatitisalinearboundary.17Eachparallellinerepresentsthepointswherethemodeloutputsaspecificprobability,from15%(bottomleft)to90%(topright).Alltheflowersbeyondthetop-rightlinehaveanover90%chanceofbeingIris-Virginicaaccordingtothemodel.
Figure4-24.Lineardecisionboundary
Justliketheotherlinearmodels,LogisticRegressionmodelscanberegularizedusingℓ1orℓ2penalties.Scitkit-Learnactuallyaddsanℓ2penaltybydefault.
NOTEThehyperparametercontrollingtheregularizationstrengthofaScikit-LearnLogisticRegressionmodelisnotalpha(asinotherlinearmodels),butitsinverse:C.ThehigherthevalueofC,thelessthemodelisregularized.
SoftmaxRegressionTheLogisticRegressionmodelcanbegeneralizedtosupportmultipleclassesdirectly,withouthavingtotrainandcombinemultiplebinaryclassifiers(asdiscussedinChapter3).ThisiscalledSoftmaxRegression,orMultinomialLogisticRegression.
Theideaisquitesimple:whengivenaninstancex,theSoftmaxRegressionmodelfirstcomputesascoresk(x)foreachclassk,thenestimatestheprobabilityofeachclassbyapplyingthesoftmaxfunction(alsocalledthenormalizedexponential)tothescores.Theequationtocomputesk(x)shouldlookfamiliar,asitisjustliketheequationforLinearRegressionprediction(seeEquation4-19).
Equation4-19.Softmaxscoreforclassk
Notethateachclasshasitsowndedicatedparametervectorθ(k).AllthesevectorsaretypicallystoredasrowsinaparametermatrixΘ.
Onceyouhavecomputedthescoreofeveryclassfortheinstancex,youcanestimatetheprobability kthattheinstancebelongstoclasskbyrunningthescoresthroughthesoftmaxfunction(Equation4-20):itcomputestheexponentialofeveryscore,thennormalizesthem(dividingbythesumofalltheexponentials).
Equation4-20.Softmaxfunction
Kisthenumberofclasses.
s(x)isavectorcontainingthescoresofeachclassfortheinstancex.
σ(s(x))kistheestimatedprobabilitythattheinstancexbelongstoclasskgiventhescoresofeachclassforthatinstance.
JustliketheLogisticRegressionclassifier,theSoftmaxRegressionclassifierpredictstheclasswiththehighestestimatedprobability(whichissimplytheclasswiththehighestscore),asshowninEquation4-21.
Equation4-21.SoftmaxRegressionclassifierprediction
Theargmaxoperatorreturnsthevalueofavariablethatmaximizesafunction.Inthisequation,itreturnsthevalueofkthatmaximizestheestimatedprobabilityσ(s(x))k.
TIPTheSoftmaxRegressionclassifierpredictsonlyoneclassatatime(i.e.,itismulticlass,notmultioutput)soitshouldbeusedonlywithmutuallyexclusiveclassessuchasdifferenttypesofplants.Youcannotuseittorecognizemultiplepeopleinonepicture.
Nowthatyouknowhowthemodelestimatesprobabilitiesandmakespredictions,let’stakealookattraining.Theobjectiveistohaveamodelthatestimatesahighprobabilityforthetargetclass(andconsequentlyalowprobabilityfortheotherclasses).MinimizingthecostfunctionshowninEquation4-22,calledthecrossentropy,shouldleadtothisobjectivebecauseitpenalizesthemodelwhenitestimatesalowprobabilityforatargetclass.Crossentropyisfrequentlyusedtomeasurehowwellasetofestimatedclassprobabilitiesmatchthetargetclasses(wewilluseitagainseveraltimesinthefollowingchapters).
Equation4-22.Crossentropycostfunction
isequalto1ifthetargetclassfortheithinstanceisk;otherwise,itisequalto0.
Noticethatwhentherearejusttwoclasses(K=2),thiscostfunctionisequivalenttotheLogisticRegression’scostfunction(logloss;seeEquation4-17).
CROSSENTROPY
Crossentropyoriginatedfrominformationtheory.Supposeyouwanttoefficientlytransmitinformationabouttheweathereveryday.Ifthereareeightoptions(sunny,rainy,etc.),youcouldencodeeachoptionusing3bitssince23=8.However,ifyouthinkitwillbesunnyalmosteveryday,itwouldbemuchmoreefficienttocode“sunny”onjustonebit(0)andtheothersevenoptionson4bits(startingwitha1).Crossentropymeasurestheaveragenumberofbitsyouactuallysendperoption.Ifyourassumptionabouttheweatherisperfect,crossentropywilljustbeequaltotheentropyoftheweatheritself(i.e.,itsintrinsicunpredictability).Butifyourassumptionsarewrong(e.g.,ifitrainsoften),crossentropywillbegreaterbyanamountcalledtheKullback–Leiblerdivergence.
Thecrossentropybetweentwoprobabilitydistributionspandqisdefinedas (atleastwhenthedistributionsarediscrete).
Thegradientvectorofthiscostfunctionwithregardstoθ(k)isgivenbyEquation4-23:
Equation4-23.Crossentropygradientvectorforclassk
Nowyoucancomputethegradientvectorforeveryclass,thenuseGradientDescent(oranyotheroptimizationalgorithm)tofindtheparametermatrixΘthatminimizesthecostfunction.
Let’suseSoftmaxRegressiontoclassifytheirisflowersintoallthreeclasses.Scikit-Learn’sLogisticRegressionusesone-versus-allbydefaultwhenyoutrainitonmorethantwoclasses,butyoucansetthemulti_classhyperparameterto"multinomial"toswitchittoSoftmaxRegressioninstead.YoumustalsospecifyasolverthatsupportsSoftmaxRegression,suchasthe"lbfgs"solver(seeScikit-Learn’sdocumentationformoredetails).Italsoappliesℓ2regularizationbydefault,whichyoucancontrolusingthehyperparameterC.
X=iris["data"][:,(2,3)]#petallength,petalwidth
y=iris["target"]
softmax_reg=LogisticRegression(multi_class="multinomial",solver="lbfgs",C=10)
softmax_reg.fit(X,y)
Sothenexttimeyoufindaniriswith5cmlongand2cmwidepetals,youcanaskyourmodeltotellyouwhattypeofirisitis,anditwillanswerIris-Virginica(class2)with94.2%probability(orIris-Versicolorwith5.8%probability):
>>>softmax_reg.predict([[5,2]])
array([2])
>>>softmax_reg.predict_proba([[5,2]])
array([[6.33134078e-07,5.75276067e-02,9.42471760e-01]])
Figure4-25showstheresultingdecisionboundaries,representedbythebackgroundcolors.Noticethatthedecisionboundariesbetweenanytwoclassesarelinear.ThefigurealsoshowstheprobabilitiesfortheIris-Versicolorclass,representedbythecurvedlines(e.g.,thelinelabeledwith0.450representsthe45%probabilityboundary).Noticethatthemodelcanpredictaclassthathasanestimatedprobabilitybelow50%.Forexample,atthepointwherealldecisionboundariesmeet,allclasseshaveanequalestimatedprobabilityof33%.
Figure4-25.SoftmaxRegressiondecisionboundaries
Exercises1. WhatLinearRegressiontrainingalgorithmcanyouuseifyouhaveatrainingsetwithmillionsof
features?
2. Supposethefeaturesinyourtrainingsethaveverydifferentscales.Whatalgorithmsmightsufferfromthis,andhow?Whatcanyoudoaboutit?
3. CanGradientDescentgetstuckinalocalminimumwhentrainingaLogisticRegressionmodel?
4. DoallGradientDescentalgorithmsleadtothesamemodelprovidedyouletthemrunlongenough?
5. SupposeyouuseBatchGradientDescentandyouplotthevalidationerrorateveryepoch.Ifyounoticethatthevalidationerrorconsistentlygoesup,whatislikelygoingon?Howcanyoufixthis?
6. IsitagoodideatostopMini-batchGradientDescentimmediatelywhenthevalidationerrorgoesup?
7. WhichGradientDescentalgorithm(amongthosewediscussed)willreachthevicinityoftheoptimalsolutionthefastest?Whichwillactuallyconverge?Howcanyoumaketheothersconvergeaswell?
8. SupposeyouareusingPolynomialRegression.Youplotthelearningcurvesandyounoticethatthereisalargegapbetweenthetrainingerrorandthevalidationerror.Whatishappening?Whatarethreewaystosolvethis?
9. SupposeyouareusingRidgeRegressionandyounoticethatthetrainingerrorandthevalidationerrorarealmostequalandfairlyhigh.Wouldyousaythatthemodelsuffersfromhighbiasorhighvariance?Shouldyouincreasetheregularizationhyperparameterαorreduceit?
10. Whywouldyouwanttouse:RidgeRegressioninsteadofplainLinearRegression(i.e.,withoutanyregularization)?
LassoinsteadofRidgeRegression?
ElasticNetinsteadofLasso?
11. Supposeyouwanttoclassifypicturesasoutdoor/indooranddaytime/nighttime.ShouldyouimplementtwoLogisticRegressionclassifiersoroneSoftmaxRegressionclassifier?
12. ImplementBatchGradientDescentwithearlystoppingforSoftmaxRegression(withoutusingScikit-Learn).
SolutionstotheseexercisesareavailableinAppendixA.
Itisoftenthecasethatalearningalgorithmwilltrytooptimizeadifferentfunctionthantheperformancemeasureusedtoevaluatethefinalmodel.Thisisgenerallybecausethatfunctioniseasiertocompute,becauseithasusefuldifferentiationpropertiesthattheperformancemeasurelacks,orbecausewewanttoconstrainthemodelduringtraining,aswewillseewhenwediscussregularization.
1
Thedemonstrationthatthisreturnsthevalueofθthatminimizesthecostfunctionisoutsidethescopeofthisbook.
NotethatScikit-Learnseparatesthebiasterm(intercept_)fromthefeatureweights(coef_).
Technicallyspeaking,itsderivativeisLipschitzcontinuous.
Sincefeature1issmaller,ittakesalargerchangeinθ1toaffectthecostfunction,whichiswhythebowliselongatedalongtheθ1axis.
Eta(η)isthe7
letteroftheGreekalphabet.
Out-of-corealgorithmsarediscussedinChapter1.
WhiletheNormalEquationcanonlyperformLinearRegression,theGradientDescentalgorithmscanbeusedtotrainmanyothermodels,aswewillsee.
Aquadraticequationisoftheformy=ax
+bx+c.
Thisnotionofbiasisnottobeconfusedwiththebiastermoflinearmodels.
ItiscommontousethenotationJ(θ)forcostfunctionsthatdon’thaveashortname;wewilloftenusethisnotationthroughouttherestofthisbook.Thecontextwillmakeitclearwhichcostfunctionisbeingdiscussed.
NormsarediscussedinChapter2.
Asquarematrixfullof0sexceptfor1sonthemaindiagonal(top-lefttobottom-right).
AlternativelyyoucanusetheRidgeclasswiththe"sag"solver.StochasticAverageGDisavariantofSGD.Formoredetails,seethepresentation“MinimizingFiniteSumswiththeStochasticAverageGradientAlgorithm”byMarkSchmidtetal.fromtheUniversityofBritishColumbia.
Youcanthinkofasubgradientvectoratanondifferentiablepointasanintermediatevectorbetweenthegradientvectorsaroundthatpoint.
PhotosreproducedfromthecorrespondingWikipediapages.Iris-VirginicaphotobyFrankMayfield(CreativeCommonsBY-SA2.0),Iris-VersicolorphotobyD.GordonE.Robertson(CreativeCommonsBY-SA3.0),andIris-Setosaphotoispublicdomain.
Itisthethesetofpointsxsuchthatθ0+θ1x1+θ2x2=0,whichdefinesastraightline.
2
3
4
5
6
th
7
8
9
2
10
11
12
13
14
15
16
17
Chapter5.SupportVectorMachines
ASupportVectorMachine(SVM)isaverypowerfulandversatileMachineLearningmodel,capableofperforminglinearornonlinearclassification,regression,andevenoutlierdetection.ItisoneofthemostpopularmodelsinMachineLearning,andanyoneinterestedinMachineLearningshouldhaveitintheirtoolbox.SVMsareparticularlywellsuitedforclassificationofcomplexbutsmall-ormedium-sizeddatasets.
ThischapterwillexplainthecoreconceptsofSVMs,howtousethem,andhowtheywork.
LinearSVMClassificationThefundamentalideabehindSVMsisbestexplainedwithsomepictures.Figure5-1showspartoftheirisdatasetthatwasintroducedattheendofChapter4.Thetwoclassescanclearlybeseparatedeasilywithastraightline(theyarelinearlyseparable).Theleftplotshowsthedecisionboundariesofthreepossiblelinearclassifiers.Themodelwhosedecisionboundaryisrepresentedbythedashedlineissobadthatitdoesnotevenseparatetheclassesproperly.Theothertwomodelsworkperfectlyonthistrainingset,buttheirdecisionboundariescomesoclosetotheinstancesthatthesemodelswillprobablynotperformaswellonnewinstances.Incontrast,thesolidlineintheplotontherightrepresentsthedecisionboundaryofanSVMclassifier;thislinenotonlyseparatesthetwoclassesbutalsostaysasfarawayfromtheclosesttraininginstancesaspossible.YoucanthinkofanSVMclassifierasfittingthewidestpossiblestreet(representedbytheparalleldashedlines)betweentheclasses.Thisiscalledlargemarginclassification.
Figure5-1.Largemarginclassification
Noticethataddingmoretraininginstances“offthestreet”willnotaffectthedecisionboundaryatall:itisfullydetermined(or“supported”)bytheinstanceslocatedontheedgeofthestreet.Theseinstancesarecalledthesupportvectors(theyarecircledinFigure5-1).
WARNINGSVMsaresensitivetothefeaturescales,asyoucanseeinFigure5-2:ontheleftplot,theverticalscaleismuchlargerthanthehorizontalscale,sothewidestpossiblestreetisclosetohorizontal.Afterfeaturescaling(e.g.,usingScikit-Learn’sStandardScaler),thedecisionboundarylooksmuchbetter(ontherightplot).
Figure5-2.Sensitivitytofeaturescales
SoftMarginClassificationIfwestrictlyimposethatallinstancesbeoffthestreetandontherightside,thisiscalledhardmarginclassification.Therearetwomainissueswithhardmarginclassification.First,itonlyworksifthedataislinearlyseparable,andseconditisquitesensitivetooutliers.Figure5-3showstheirisdatasetwithjustoneadditionaloutlier:ontheleft,itisimpossibletofindahardmargin,andontherightthedecisionboundaryendsupverydifferentfromtheonewesawinFigure5-1withouttheoutlier,anditwillprobablynotgeneralizeaswell.
Figure5-3.Hardmarginsensitivitytooutliers
Toavoidtheseissuesitispreferabletouseamoreflexiblemodel.Theobjectiveistofindagoodbalancebetweenkeepingthestreetaslargeaspossibleandlimitingthemarginviolations(i.e.,instancesthatendupinthemiddleofthestreetorevenonthewrongside).Thisiscalledsoftmarginclassification.
InScikit-Learn’sSVMclasses,youcancontrolthisbalanceusingtheChyperparameter:asmallerCvalueleadstoawiderstreetbutmoremarginviolations.Figure5-4showsthedecisionboundariesandmarginsoftwosoftmarginSVMclassifiersonanonlinearlyseparabledataset.Ontheleft,usingahighCvaluetheclassifiermakesfewermarginviolationsbutendsupwithasmallermargin.Ontheright,usingalowCvaluethemarginismuchlarger,butmanyinstancesenduponthestreet.However,itseemslikelythatthesecondclassifierwillgeneralizebetter:infactevenonthistrainingsetitmakesfewerpredictionerrors,sincemostofthemarginviolationsareactuallyonthecorrectsideofthedecisionboundary.
Figure5-4.Fewermarginviolationsversuslargemargin
TIPIfyourSVMmodelisoverfitting,youcantryregularizingitbyreducingC.
ThefollowingScikit-Learncodeloadstheirisdataset,scalesthefeatures,andthentrainsalinearSVMmodel(usingtheLinearSVCclasswithC=0.1andthehingelossfunction,describedshortly)todetectIris-Virginicaflowers.TheresultingmodelisrepresentedontherightofFigure5-4.
importnumpyasnp
fromsklearnimportdatasets
fromsklearn.pipelineimportPipeline
fromsklearn.preprocessingimportStandardScaler
fromsklearn.svmimportLinearSVC
iris=datasets.load_iris()
X=iris["data"][:,(2,3)]#petallength,petalwidth
y=(iris["target"]==2).astype(np.float64)#Iris-Virginica
svm_clf=Pipeline((
("scaler",StandardScaler()),
("linear_svc",LinearSVC(C=1,loss="hinge")),
))
svm_clf.fit(X,y)
Then,asusual,youcanusethemodeltomakepredictions:
>>>svm_clf.predict([[5.5,1.7]])
array([1.])
NOTEUnlikeLogisticRegressionclassifiers,SVMclassifiersdonotoutputprobabilitiesforeachclass.
Alternatively,youcouldusetheSVCclass,usingSVC(kernel="linear",C=1),butitismuchslower,especiallywithlargetrainingsets,soitisnotrecommended.AnotheroptionistousetheSGDClassifierclass,withSGDClassifier(loss="hinge",alpha=1/(m*C)).ThisappliesregularStochasticGradientDescent(seeChapter4)totrainalinearSVMclassifier.ItdoesnotconvergeasfastastheLinearSVCclass,butitcanbeusefultohandlehugedatasetsthatdonotfitinmemory(out-of-coretraining),ortohandleonlineclassificationtasks.
TIPTheLinearSVCclassregularizesthebiasterm,soyoushouldcenterthetrainingsetfirstbysubtractingitsmean.ThisisautomaticifyouscalethedatausingtheStandardScaler.Moreover,makesureyousetthelosshyperparameterto"hinge",asitisnotthedefaultvalue.Finally,forbetterperformanceyoushouldsetthedualhyperparametertoFalse,unlesstherearemorefeaturesthantraininginstances(wewilldiscussdualitylaterinthechapter).
NonlinearSVMClassificationAlthoughlinearSVMclassifiersareefficientandworksurprisinglywellinmanycases,manydatasetsarenotevenclosetobeinglinearlyseparable.Oneapproachtohandlingnonlineardatasetsistoaddmorefeatures,suchaspolynomialfeatures(asyoudidinChapter4);insomecasesthiscanresultinalinearlyseparabledataset.ConsidertheleftplotinFigure5-5:itrepresentsasimpledatasetwithjustonefeaturex1.Thisdatasetisnotlinearlyseparable,asyoucansee.Butifyouaddasecondfeaturex2=(x1)2,theresulting2Ddatasetisperfectlylinearlyseparable.
Figure5-5.Addingfeaturestomakeadatasetlinearlyseparable
ToimplementthisideausingScikit-Learn,youcancreateaPipelinecontainingaPolynomialFeaturestransformer(discussedin“PolynomialRegression”),followedbyaStandardScalerandaLinearSVC.Let’stestthisonthemoonsdataset(seeFigure5-6):
fromsklearn.datasetsimportmake_moons
fromsklearn.pipelineimportPipeline
fromsklearn.preprocessingimportPolynomialFeatures
polynomial_svm_clf=Pipeline((
("poly_features",PolynomialFeatures(degree=3)),
("scaler",StandardScaler()),
("svm_clf",LinearSVC(C=10,loss="hinge"))
))
polynomial_svm_clf.fit(X,y)
Figure5-6.LinearSVMclassifierusingpolynomialfeatures
PolynomialKernelAddingpolynomialfeaturesissimpletoimplementandcanworkgreatwithallsortsofMachineLearningalgorithms(notjustSVMs),butatalowpolynomialdegreeitcannotdealwithverycomplexdatasets,andwithahighpolynomialdegreeitcreatesahugenumberoffeatures,makingthemodeltooslow.
Fortunately,whenusingSVMsyoucanapplyanalmostmiraculousmathematicaltechniquecalledthekerneltrick(itisexplainedinamoment).Itmakesitpossibletogetthesameresultasifyouaddedmanypolynomialfeatures,evenwithveryhigh-degreepolynomials,withoutactuallyhavingtoaddthem.Sothereisnocombinatorialexplosionofthenumberoffeaturessinceyoudon’tactuallyaddanyfeatures.ThistrickisimplementedbytheSVCclass.Let’stestitonthemoonsdataset:
fromsklearn.svmimportSVC
poly_kernel_svm_clf=Pipeline((
("scaler",StandardScaler()),
("svm_clf",SVC(kernel="poly",degree=3,coef0=1,C=5))
))
poly_kernel_svm_clf.fit(X,y)
ThiscodetrainsanSVMclassifierusinga3rd-degreepolynomialkernel.ItisrepresentedontheleftofFigure5-7.OntherightisanotherSVMclassifierusinga10th-degreepolynomialkernel.Obviously,ifyourmodelisoverfitting,youmightwanttoreducethepolynomialdegree.Conversely,ifitisunderfitting,youcantryincreasingit.Thehyperparametercoef0controlshowmuchthemodelisinfluencedbyhigh-degreepolynomialsversuslow-degreepolynomials.
Figure5-7.SVMclassifierswithapolynomialkernel
TIPAcommonapproachtofindtherighthyperparametervaluesistousegridsearch(seeChapter2).Itisoftenfastertofirstdoaverycoarsegridsearch,thenafinergridsearcharoundthebestvaluesfound.Havingagoodsenseofwhateachhyperparameteractuallydoescanalsohelpyousearchintherightpartofthehyperparameterspace.
AddingSimilarityFeaturesAnothertechniquetotacklenonlinearproblemsistoaddfeaturescomputedusingasimilarityfunctionthatmeasureshowmucheachinstanceresemblesaparticularlandmark.Forexample,let’staketheone-dimensionaldatasetdiscussedearlierandaddtwolandmarkstoitatx1=–2andx1=1(seetheleftplotinFigure5-8).Next,let’sdefinethesimilarityfunctiontobetheGaussianRadialBasisFunction(RBF)withγ=0.3(seeEquation5-1).
Equation5-1.GaussianRBF
Itisabell-shapedfunctionvaryingfrom0(veryfarawayfromthelandmark)to1(atthelandmark).Nowwearereadytocomputethenewfeatures.Forexample,let’slookattheinstancex1=–1:itislocatedatadistanceof1fromthefirstlandmark,and2fromthesecondlandmark.Thereforeitsnewfeaturesarex2=exp(–0.3×12)≈0.74andx3=exp(–0.3×22)≈0.30.TheplotontherightofFigure5-8showsthetransformeddataset(droppingtheoriginalfeatures).Asyoucansee,itisnowlinearlyseparable.
Figure5-8.SimilarityfeaturesusingtheGaussianRBF
Youmaywonderhowtoselectthelandmarks.Thesimplestapproachistocreatealandmarkatthelocationofeachandeveryinstanceinthedataset.Thiscreatesmanydimensionsandthusincreasesthechancesthatthetransformedtrainingsetwillbelinearlyseparable.Thedownsideisthatatrainingsetwithminstancesandnfeaturesgetstransformedintoatrainingsetwithminstancesandmfeatures(assumingyoudroptheoriginalfeatures).Ifyourtrainingsetisverylarge,youendupwithanequallylargenumberoffeatures.
GaussianRBFKernelJustlikethepolynomialfeaturesmethod,thesimilarityfeaturesmethodcanbeusefulwithanyMachineLearningalgorithm,butitmaybecomputationallyexpensivetocomputealltheadditionalfeatures,especiallyonlargetrainingsets.However,onceagainthekerneltrickdoesitsSVMmagic:itmakesitpossibletoobtainasimilarresultasifyouhadaddedmanysimilarityfeatures,withoutactuallyhavingtoaddthem.Let’strytheGaussianRBFkernelusingtheSVCclass:
rbf_kernel_svm_clf=Pipeline((
("scaler",StandardScaler()),
("svm_clf",SVC(kernel="rbf",gamma=5,C=0.001))
))
rbf_kernel_svm_clf.fit(X,y)
ThismodelisrepresentedonthebottomleftofFigure5-9.Theotherplotsshowmodelstrainedwithdifferentvaluesofhyperparametersgamma(γ)andC.Increasinggammamakesthebell-shapecurvenarrower(seetheleftplotofFigure5-8),andasaresulteachinstance’srangeofinfluenceissmaller:thedecisionboundaryendsupbeingmoreirregular,wigglingaroundindividualinstances.Conversely,asmallgammavaluemakesthebell-shapedcurvewider,soinstanceshavealargerrangeofinfluence,andthedecisionboundaryendsupsmoother.Soγactslikearegularizationhyperparameter:ifyourmodelisoverfitting,youshouldreduceit,andifitisunderfitting,youshouldincreaseit(similartotheChyperparameter).
Figure5-9.SVMclassifiersusinganRBFkernel
Otherkernelsexistbutareusedmuchmorerarely.Forexample,somekernelsarespecializedforspecificdatastructures.StringkernelsaresometimesusedwhenclassifyingtextdocumentsorDNAsequences(e.g.,usingthestringsubsequencekernelorkernelsbasedontheLevenshteindistance).
TIPWithsomanykernelstochoosefrom,howcanyoudecidewhichonetouse?Asaruleofthumb,youshouldalwaystrythelinearkernelfirst(rememberthatLinearSVCismuchfasterthanSVC(kernel="linear")),especiallyifthetrainingsetisverylargeorifithasplentyoffeatures.Ifthetrainingsetisnottoolarge,youshouldtrytheGaussianRBFkernelaswell;itworkswellinmostcases.Thenifyouhavesparetimeandcomputingpower,youcanalsoexperimentwithafewotherkernelsusingcross-validationandgridsearch,especiallyiftherearekernelsspecializedforyourtrainingset’sdatastructure.
ComputationalComplexityTheLinearSVCclassisbasedontheliblinearlibrary,whichimplementsanoptimizedalgorithmforlinearSVMs.1Itdoesnotsupportthekerneltrick,butitscalesalmostlinearlywiththenumberoftraininginstancesandthenumberoffeatures:itstrainingtimecomplexityisroughlyO(m×n).
Thealgorithmtakeslongerifyourequireaveryhighprecision.Thisiscontrolledbythetolerancehyperparameterϵ(calledtolinScikit-Learn).Inmostclassificationtasks,thedefaulttoleranceisfine.
TheSVCclassisbasedonthelibsvmlibrary,whichimplementsanalgorithmthatsupportsthekerneltrick.2ThetrainingtimecomplexityisusuallybetweenO(m2×n)andO(m3×n).Unfortunately,thismeansthatitgetsdreadfullyslowwhenthenumberoftraininginstancesgetslarge(e.g.,hundredsofthousandsofinstances).Thisalgorithmisperfectforcomplexbutsmallormediumtrainingsets.However,itscaleswellwiththenumberoffeatures,especiallywithsparsefeatures(i.e.,wheneachinstancehasfewnonzerofeatures).Inthiscase,thealgorithmscalesroughlywiththeaveragenumberofnonzerofeaturesperinstance.Table5-1comparesScikit-Learn’sSVMclassificationclasses.
Table5-1.ComparisonofScikit-LearnclassesforSVMclassification
Class Timecomplexity Out-of-coresupport Scalingrequired Kerneltrick
LinearSVC O(m×n) No Yes No
SGDClassifier O(m×n) Yes Yes No
SVC O(m²×n)toO(m³×n) No Yes Yes
SVMRegressionAswementionedearlier,theSVMalgorithmisquiteversatile:notonlydoesitsupportlinearandnonlinearclassification,butitalsosupportslinearandnonlinearregression.Thetrickistoreversetheobjective:insteadoftryingtofitthelargestpossiblestreetbetweentwoclasseswhilelimitingmarginviolations,SVMRegressiontriestofitasmanyinstancesaspossibleonthestreetwhilelimitingmarginviolations(i.e.,instancesoffthestreet).Thewidthofthestreetiscontrolledbyahyperparameterϵ.Figure5-10showstwolinearSVMRegressionmodelstrainedonsomerandomlineardata,onewithalargemargin(ϵ=1.5)andtheotherwithasmallmargin(ϵ=0.5).
Figure5-10.SVMRegression
Addingmoretraininginstanceswithinthemargindoesnotaffectthemodel’spredictions;thus,themodelissaidtobeϵ-insensitive.
YoucanuseScikit-Learn’sLinearSVRclasstoperformlinearSVMRegression.ThefollowingcodeproducesthemodelrepresentedontheleftofFigure5-10(thetrainingdatashouldbescaledandcenteredfirst):
fromsklearn.svmimportLinearSVR
svm_reg=LinearSVR(epsilon=1.5)
svm_reg.fit(X,y)
Totacklenonlinearregressiontasks,youcanuseakernelizedSVMmodel.Forexample,Figure5-11showsSVMRegressiononarandomquadratictrainingset,usinga2nd-degreepolynomialkernel.Thereislittleregularizationontheleftplot(i.e.,alargeCvalue),andmuchmoreregularizationontherightplot(i.e.,asmallCvalue).
Figure5-11.SVMregressionusinga2nd-degreepolynomialkernel
ThefollowingcodeproducesthemodelrepresentedontheleftofFigure5-11usingScikit-Learn’sSVRclass(whichsupportsthekerneltrick).TheSVRclassistheregressionequivalentoftheSVCclass,andtheLinearSVRclassistheregressionequivalentoftheLinearSVCclass.TheLinearSVRclassscaleslinearlywiththesizeofthetrainingset(justliketheLinearSVCclass),whiletheSVRclassgetsmuchtooslowwhenthetrainingsetgrowslarge(justliketheSVCclass).
fromsklearn.svmimportSVR
svm_poly_reg=SVR(kernel="poly",degree=2,C=100,epsilon=0.1)
svm_poly_reg.fit(X,y)
NOTESVMscanalsobeusedforoutlierdetection;seeScikit-Learn’sdocumentationformoredetails.
UndertheHoodThissectionexplainshowSVMsmakepredictionsandhowtheirtrainingalgorithmswork,startingwithlinearSVMclassifiers.YoucansafelyskipitandgostraighttotheexercisesattheendofthischapterifyouarejustgettingstartedwithMachineLearning,andcomebacklaterwhenyouwanttogetadeeperunderstandingofSVMs.
First,awordaboutnotations:inChapter4weusedtheconventionofputtingallthemodelparametersinonevectorθ,includingthebiastermθ0andtheinputfeatureweightsθ1toθn,andaddingabiasinputx0=1toallinstances.Inthischapter,wewilluseadifferentconvention,whichismoreconvenient(andmorecommon)whenyouaredealingwithSVMs:thebiastermwillbecalledbandthefeatureweightsvectorwillbecalledw.Nobiasfeaturewillbeaddedtotheinputfeaturevectors.
DecisionFunctionandPredictionsThelinearSVMclassifiermodelpredictstheclassofanewinstancexbysimplycomputingthedecisionfunctionwT·x+b=w1x1++ wnxn+b:iftheresultispositive,thepredictedclassŷisthepositiveclass(1),orelseitisthenegativeclass(0);seeEquation5-2.
Equation5-2.LinearSVMclassifierprediction
Figure5-12showsthedecisionfunctionthatcorrespondstothemodelontherightofFigure5-4:itisatwo-dimensionalplanesincethisdatasethastwofeatures(petalwidthandpetallength).Thedecisionboundaryisthesetofpointswherethedecisionfunctionisequalto0:itistheintersectionoftwoplanes,whichisastraightline(representedbythethicksolidline).3
Figure5-12.Decisionfunctionfortheirisdataset
Thedashedlinesrepresentthepointswherethedecisionfunctionisequalto1or–1:theyareparallelandatequaldistancetothedecisionboundary,formingamarginaroundit.TrainingalinearSVMclassifiermeansfindingthevalueofwandbthatmakethismarginaswideaspossiblewhileavoidingmarginviolations(hardmargin)orlimitingthem(softmargin).
TrainingObjectiveConsidertheslopeofthedecisionfunction:itisequaltothenormoftheweightvector, w.Ifwedividethisslopeby2,thepointswherethedecisionfunctionisequalto±1aregoingtobetwiceasfarawayfromthedecisionboundary.Inotherwords,dividingtheslopeby2willmultiplythemarginby2.Perhapsthisiseasiertovisualizein2DinFigure5-13.Thesmallertheweightvectorw,thelargerthemargin.
Figure5-13.Asmallerweightvectorresultsinalargermargin
Sowewanttominimize wtogetalargemargin.However,ifwealsowanttoavoidanymarginviolation(hardmargin),thenweneedthedecisionfunctiontobegreaterthan1forallpositivetraininginstances,andlowerthan–1fornegativetraininginstances.Ifwedefinet(i)=–1fornegativeinstances(ify(i)=0)andt(i)=1forpositiveinstances(ify(i)=1),thenwecanexpressthisconstraintast(i)(wT·x(i)+b)≥1forallinstances.
WecanthereforeexpressthehardmarginlinearSVMclassifierobjectiveastheconstrainedoptimizationprobleminEquation5-3.
Equation5-3.HardmarginlinearSVMclassifierobjective
NOTE
Weareminimizing wT·w,whichisequalto w2,ratherthanminimizing w.Thisisbecauseitwillgivethesame
result(sincethevaluesofwandbthatminimizeavaluealsominimizehalfofitssquare),but w2hasaniceandsimplederivative(itisjustw)while wisnotdifferentiableat w=0.Optimizationalgorithmsworkmuchbetterondifferentiablefunctions.
Togetthesoftmarginobjective,weneedtointroduceaslackvariableζ(i)≥0foreachinstance:4ζ(i)
measureshowmuchtheithinstanceisallowedtoviolatethemargin.Wenowhavetwoconflicting
objectives:makingtheslackvariablesassmallaspossibletoreducethemarginviolations,andmakingwT·wassmallaspossibletoincreasethemargin.ThisiswheretheChyperparametercomesin:it
allowsustodefinethetradeoffbetweenthesetwoobjectives.ThisgivesustheconstrainedoptimizationprobleminEquation5-4.
Equation5-4.SoftmarginlinearSVMclassifierobjective
QuadraticProgrammingThehardmarginandsoftmarginproblemsarebothconvexquadraticoptimizationproblemswithlinearconstraints.SuchproblemsareknownasQuadraticProgramming(QP)problems.Manyoff-the-shelfsolversareavailabletosolveQPproblemsusingavarietyoftechniquesthatareoutsidethescopeofthisbook.5ThegeneralproblemformulationisgivenbyEquation5-5.
Equation5-5.QuadraticProgrammingproblem
NotethattheexpressionA·p≤bactuallydefinesncconstraints:pT·a(i)≤b(i)fori=1,2,, nc,wherea(i)isthevectorcontainingtheelementsoftheithrowofAandb(i)istheithelementofb.
YoucaneasilyverifythatifyousettheQPparametersinthefollowingway,yougetthehardmarginlinearSVMclassifierobjective:
np=n+1,wherenisthenumberoffeatures(the+1isforthebiasterm).
nc=m,wheremisthenumberoftraininginstances.
Histhenp×npidentitymatrix,exceptwithazerointhetop-leftcell(toignorethebiasterm).
f=0,annp-dimensionalvectorfullof0s.
b=1,annc-dimensionalvectorfullof1s.
a(i)=–t(i) (i),where (i)isequaltox(i)withanextrabiasfeature 0=1.
SoonewaytotrainahardmarginlinearSVMclassifierisjusttouseanoff-the-shelfQPsolverbypassingittheprecedingparameters.Theresultingvectorpwillcontainthebiastermb=p0andthefeatureweightswi=pifori=1,2,, m.Similarly,youcanuseaQPsolvertosolvethesoftmarginproblem(seetheexercisesattheendofthechapter).
However,tousethekerneltrickwearegoingtolookatadifferentconstrainedoptimizationproblem.
TheDualProblemGivenaconstrainedoptimizationproblem,knownastheprimalproblem,itispossibletoexpressadifferentbutcloselyrelatedproblem,calleditsdualproblem.Thesolutiontothedualproblemtypicallygivesalowerboundtothesolutionoftheprimalproblem,butundersomeconditionsitcanevenhavethesamesolutionsastheprimalproblem.Luckily,theSVMproblemhappenstomeettheseconditions,6soyoucanchoosetosolvetheprimalproblemorthedualproblem;bothwillhavethesamesolution.Equation5-6showsthedualformofthelinearSVMobjective(ifyouareinterestedinknowinghowtoderivethedualproblemfromtheprimalproblem,seeAppendixC).
Equation5-6.DualformofthelinearSVMobjective
Onceyoufindthevector thatminimizesthisequation(usingaQPsolver),youcancompute andthatminimizetheprimalproblembyusingEquation5-7.
Equation5-7.Fromthedualsolutiontotheprimalsolution
Thedualproblemisfastertosolvethantheprimalwhenthenumberoftraininginstancesissmallerthanthenumberoffeatures.Moreimportantly,itmakesthekerneltrickpossible,whiletheprimaldoesnot.Sowhatisthiskerneltrickanyway?
KernelizedSVMSupposeyouwanttoapplya2nd-degreepolynomialtransformationtoatwo-dimensionaltrainingset(suchasthemoonstrainingset),thentrainalinearSVMclassifieronthetransformedtrainingset.Equation5-8showsthe2nd-degreepolynomialmappingfunctionϕthatyouwanttoapply.
Equation5-8.Second-degreepolynomialmapping
Noticethatthetransformedvectoristhree-dimensionalinsteadoftwo-dimensional.Nowlet’slookatwhathappenstoacoupleoftwo-dimensionalvectors,aandb,ifweapplythis2nd-degreepolynomialmappingandthencomputethedotproductofthetransformedvectors(SeeEquation5-9).
Equation5-9.Kerneltrickfora2nd-degreepolynomialmapping
Howaboutthat?Thedotproductofthetransformedvectorsisequaltothesquareofthedotproductoftheoriginalvectors:ϕ(a)T·ϕ(b)=(aT·b)2.
Nowhereisthekeyinsight:ifyouapplythetransformationϕtoalltraininginstances,thenthedualproblem(seeEquation5-6)willcontainthedotproductϕ(x(i))T·ϕ(x(j)).Butifϕisthe2nd-degreepolynomialtransformationdefinedinEquation5-8,thenyoucanreplacethisdotproductoftransformed
vectorssimplyby .Soyoudon’tactuallyneedtotransformthetraininginstancesatall:justreplacethedotproductbyitssquareinEquation5-6.TheresultwillbestrictlythesameasifyouwentthroughthetroubleofactuallytransformingthetrainingsetthenfittingalinearSVMalgorithm,butthistrickmakesthewholeprocessmuchmorecomputationallyefficient.Thisistheessenceofthekerneltrick.
ThefunctionK(a,b)=(aT·b)2iscalleda2nd-degreepolynomialkernel.InMachineLearning,akernelisafunctioncapableofcomputingthedotproductϕ(a)T·ϕ(b)basedonlyontheoriginalvectorsaandb,
withouthavingtocompute(oreventoknowabout)thetransformationϕ.Equation5-10listssomeofthemostcommonlyusedkernels.
Equation5-10.Commonkernels
MERCER’STHEOREM
AccordingtoMercer’stheorem,ifafunctionK(a,b)respectsafewmathematicalconditionscalledMercer’sconditions(Kmustbecontinuous,symmetricinitsargumentssoK(a,b)=K(b,a),etc.),thenthereexistsafunctionϕthatmapsaandbintoanotherspace(possiblywithmuchhigherdimensions)suchthatK(a,b)=ϕ(a)T·ϕ(b).SoyoucanuseKasakernelsinceyouknowϕexists,evenifyoudon’tknowwhatϕis.InthecaseoftheGaussianRBFkernel,itcanbeshownthatϕactuallymapseachtraininginstancetoaninfinite-dimensionalspace,soit’sagoodthingyoudon’tneedtoactuallyperformthemapping!
Notethatsomefrequentlyusedkernels(suchastheSigmoidkernel)don’trespectallofMercer’sconditions,yettheygenerallyworkwellinpractice.
Thereisstillonelooseendwemusttie.Equation5-7showshowtogofromthedualsolutiontotheprimalsolutioninthecaseofalinearSVMclassifier,butifyouapplythekerneltrickyouendupwithequationsthatincludeϕ(x(i)).Infact, musthavethesamenumberofdimensionsasϕ(x(i)),whichmaybehugeoreveninfinite,soyoucan’tcomputeit.Buthowcanyoumakepredictionswithoutknowing ?Well,thegoodnewsisthatyoucanplugintheformulafor fromEquation5-7intothedecisionfunctionforanewinstancex(n),andyougetanequationwithonlydotproductsbetweeninputvectors.Thismakesitpossibletousethekerneltrick,onceagain(Equation5-11).
Equation5-11.MakingpredictionswithakernelizedSVM
Notethatsinceα(i)≠0onlyforsupportvectors,makingpredictionsinvolvescomputingthedotproductofthenewinputvectorx(n)withonlythesupportvectors,notallthetraininginstances.Ofcourse,youalsoneedtocomputethebiasterm ,usingthesametrick(Equation5-12).
Equation5-12.Computingthebiastermusingthekerneltrick
Ifyouarestartingtogetaheadache,it’sperfectlynormal:it’sanunfortunatesideeffectsofthekerneltrick.
OnlineSVMsBeforeconcludingthischapter,let’stakeaquicklookatonlineSVMclassifiers(recallthatonlinelearningmeanslearningincrementally,typicallyasnewinstancesarrive).
ForlinearSVMclassifiers,onemethodistouseGradientDescent(e.g.,usingSGDClassifier)tominimizethecostfunctioninEquation5-13,whichisderivedfromtheprimalproblem.UnfortunatelyitconvergesmuchmoreslowlythanthemethodsbasedonQP.
Equation5-13.LinearSVMclassifiercostfunction
Thefirstsuminthecostfunctionwillpushthemodeltohaveasmallweightvectorw,leadingtoalargermargin.Thesecondsumcomputesthetotalofallmarginviolations.Aninstance’smarginviolationisequalto0ifitislocatedoffthestreetandonthecorrectside,orelseitisproportionaltothedistancetothecorrectsideofthestreet.Minimizingthistermensuresthatthemodelmakesthemarginviolationsassmallandasfewaspossible
HINGELOSS
Thefunctionmax(0,1–t)iscalledthehingelossfunction(representedbelow).Itisequalto0whent≥1.Itsderivative(slope)isequalto–1ift<1and0ift>1.Itisnotdifferentiableatt=1,butjustlikeforLassoRegression(see“LassoRegression”)youcanstilluseGradientDescentusinganysubderivativeatt=1(i.e.,anyvaluebetween–1and0).
ItisalsopossibletoimplementonlinekernelizedSVMs—forexample,using“IncrementalandDecrementalSVMLearning”7or“FastKernelClassifierswithOnlineandActiveLearning.”8However,theseareimplementedinMatlabandC++.Forlarge-scalenonlinearproblems,youmaywanttoconsiderusingneuralnetworksinstead(seePartII).
Exercises1. WhatisthefundamentalideabehindSupportVectorMachines?
2. Whatisasupportvector?
3. WhyisitimportanttoscaletheinputswhenusingSVMs?
4. CananSVMclassifieroutputaconfidencescorewhenitclassifiesaninstance?Whataboutaprobability?
5. ShouldyouusetheprimalorthedualformoftheSVMproblemtotrainamodelonatrainingsetwithmillionsofinstancesandhundredsoffeatures?
6. SayyoutrainedanSVMclassifierwithanRBFkernel.Itseemstounderfitthetrainingset:shouldyouincreaseordecreaseγ(gamma)?WhataboutC?
7. HowshouldyousettheQPparameters(H,f,A,andb)tosolvethesoftmarginlinearSVMclassifierproblemusinganoff-the-shelfQPsolver?
8. TrainaLinearSVConalinearlyseparabledataset.ThentrainanSVCandaSGDClassifieronthesamedataset.Seeifyoucangetthemtoproduceroughlythesamemodel.
9. TrainanSVMclassifierontheMNISTdataset.SinceSVMclassifiersarebinaryclassifiers,youwillneedtouseone-versus-alltoclassifyall10digits.Youmaywanttotunethehyperparametersusingsmallvalidationsetstospeeduptheprocess.Whataccuracycanyoureach?
10. TrainanSVMregressorontheCaliforniahousingdataset.
SolutionstotheseexercisesareavailableinAppendixA.
“ADualCoordinateDescentMethodforLarge-scaleLinearSVM,”Linetal.(2008).
“SequentialMinimalOptimization(SMO),”J.Platt(1998).
Moregenerally,whentherearenfeatures,thedecisionfunctionisann-dimensionalhyperplane,andthedecisionboundaryisan(n–1)-dimensionalhyperplane.
Zeta(ζ)isthe8
letteroftheGreekalphabet.
TolearnmoreaboutQuadraticProgramming,youcanstartbyreadingStephenBoydandLievenVandenberghe,ConvexOptimization(Cambridge,UK:CambridgeUniversityPress,2004)orwatchRichardBrown’sseriesofvideolectures.
Theobjectivefunctionisconvex,andtheinequalityconstraintsarecontinuouslydifferentiableandconvexfunctions.
“IncrementalandDecrementalSupportVectorMachineLearning,”G.Cauwenberghs,T.Poggio(2001).
“FastKernelClassifierswithOnlineandActiveLearning,“A.Bordes,S.Ertekin,J.Weston,L.Bottou(2005).
1
2
3
4
th
5
6
7
8
Chapter6.DecisionTrees
LikeSVMs,DecisionTreesareversatileMachineLearningalgorithmsthatcanperformbothclassificationandregressiontasks,andevenmultioutputtasks.Theyareverypowerfulalgorithms,capableoffittingcomplexdatasets.Forexample,inChapter2youtrainedaDecisionTreeRegressormodelontheCaliforniahousingdataset,fittingitperfectly(actuallyoverfittingit).
DecisionTreesarealsothefundamentalcomponentsofRandomForests(seeChapter7),whichareamongthemostpowerfulMachineLearningalgorithmsavailabletoday.
Inthischapterwewillstartbydiscussinghowtotrain,visualize,andmakepredictionswithDecisionTrees.ThenwewillgothroughtheCARTtrainingalgorithmusedbyScikit-Learn,andwewilldiscusshowtoregularizetreesandusethemforregressiontasks.Finally,wewilldiscusssomeofthelimitationsofDecisionTrees.
TrainingandVisualizingaDecisionTreeTounderstandDecisionTrees,let’sjustbuildoneandtakealookathowitmakespredictions.ThefollowingcodetrainsaDecisionTreeClassifierontheirisdataset(seeChapter4):
fromsklearn.datasetsimportload_iris
fromsklearn.treeimportDecisionTreeClassifier
iris=load_iris()
X=iris.data[:,2:]#petallengthandwidth
y=iris.target
tree_clf=DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X,y)
YoucanvisualizethetrainedDecisionTreebyfirstusingtheexport_graphviz()methodtooutputagraphdefinitionfilecallediris_tree.dot:
fromsklearn.treeimportexport_graphviz
export_graphviz(
tree_clf,
out_file=image_path("iris_tree.dot"),
feature_names=iris.feature_names[2:],
class_names=iris.target_names,
rounded=True,
filled=True
)
Thenyoucanconvertthis.dotfiletoavarietyofformatssuchasPDForPNGusingthedotcommand-linetoolfromthegraphvizpackage.1Thiscommandlineconvertsthe.dotfiletoa.pngimagefile:
$dot-Tpngiris_tree.dot-oiris_tree.png
YourfirstdecisiontreelookslikeFigure6-1.
Figure6-1.IrisDecisionTree
MakingPredictionsLet’sseehowthetreerepresentedinFigure6-1makespredictions.Supposeyoufindanirisflowerandyouwanttoclassifyit.Youstartattherootnode(depth0,atthetop):thisnodeaskswhethertheflower’spetallengthissmallerthan2.45cm.Ifitis,thenyoumovedowntotheroot’sleftchildnode(depth1,left).Inthiscase,itisaleafnode(i.e.,itdoesnothaveanychildrennodes),soitdoesnotaskanyquestions:youcansimplylookatthepredictedclassforthatnodeandtheDecisionTreepredictsthatyourflowerisanIris-Setosa(class=setosa).
Nowsupposeyoufindanotherflower,butthistimethepetallengthisgreaterthan2.45cm.Youmustmovedowntotheroot’srightchildnode(depth1,right),whichisnotaleafnode,soitasksanotherquestion:isthepetalwidthsmallerthan1.75cm?Ifitis,thenyourflowerismostlikelyanIris-Versicolor(depth2,left).Ifnot,itislikelyanIris-Virginica(depth2,right).It’sreallythatsimple.
NOTEOneofthemanyqualitiesofDecisionTreesisthattheyrequireverylittledatapreparation.Inparticular,theydon’trequirefeaturescalingorcenteringatall.
Anode’ssamplesattributecountshowmanytraininginstancesitappliesto.Forexample,100traininginstanceshaveapetallengthgreaterthan2.45cm(depth1,right),amongwhich54haveapetalwidthsmallerthan1.75cm(depth2,left).Anode’svalueattributetellsyouhowmanytraininginstancesofeachclassthisnodeappliesto:forexample,thebottom-rightnodeappliesto0Iris-Setosa,1Iris-Versicolor,and45Iris-Virginica.Finally,anode’sginiattributemeasuresitsimpurity:anodeis“pure”(gini=0)ifalltraininginstancesitappliestobelongtothesameclass.Forexample,sincethedepth-1leftnodeappliesonlytoIris-Setosatraininginstances,itispureanditsginiscoreis0.Equation6-1showshowthetrainingalgorithmcomputestheginiscoreGioftheithnode.Forexample,thedepth-2leftnodehasaginiscoreequalto1–(0/54)2–(49/54)2–(5/54)2≈0.168.Anotherimpuritymeasureisdiscussedshortly.
Equation6-1.Giniimpurity
pi,kistheratioofclasskinstancesamongthetraininginstancesintheithnode.
NOTEScikit-LearnusestheCARTalgorithm,whichproducesonlybinarytrees:nonleafnodesalwayshavetwochildren(i.e.,questionsonlyhaveyes/noanswers).However,otheralgorithmssuchasID3canproduceDecisionTreeswithnodesthathavemorethantwochildren.
Figure6-2showsthisDecisionTree’sdecisionboundaries.Thethickverticallinerepresentsthedecisionboundaryoftherootnode(depth0):petallength=2.45cm.Sincetheleftareaispure(onlyIris-Setosa),itcannotbesplitanyfurther.However,therightareaisimpure,sothedepth-1rightnodesplitsitatpetalwidth=1.75cm(representedbythedashedline).Sincemax_depthwassetto2,theDecisionTreestopsrightthere.However,ifyousetmax_depthto3,thenthetwodepth-2nodeswouldeachaddanotherdecisionboundary(representedbythedottedlines).
Figure6-2.DecisionTreedecisionboundaries
MODELINTERPRETATION:WHITEBOXVERSUSBLACKBOX
AsyoucanseeDecisionTreesarefairlyintuitiveandtheirdecisionsareeasytointerpret.Suchmodelsareoftencalledwhiteboxmodels.Incontrast,aswewillsee,RandomForestsorneuralnetworksaregenerallyconsideredblackboxmodels.Theymakegreatpredictions,andyoucaneasilycheckthecalculationsthattheyperformedtomakethesepredictions;nevertheless,itisusuallyhardtoexplaininsimpletermswhythepredictionsweremade.Forexample,ifaneuralnetworksaysthataparticularpersonappearsonapicture,itishardtoknowwhatactuallycontributedtothisprediction:didthemodelrecognizethatperson’seyes?Hermouth?Hernose?Hershoes?Oreventhecouchthatshewassittingon?Conversely,DecisionTreesprovideniceandsimpleclassificationrulesthatcanevenbeappliedmanuallyifneedbe(e.g.,forflowerclassification).
EstimatingClassProbabilitiesADecisionTreecanalsoestimatetheprobabilitythataninstancebelongstoaparticularclassk:firstittraversesthetreetofindtheleafnodeforthisinstance,andthenitreturnstheratiooftraininginstancesofclasskinthisnode.Forexample,supposeyouhavefoundaflowerwhosepetalsare5cmlongand1.5cmwide.Thecorrespondingleafnodeisthedepth-2leftnode,sotheDecisionTreeshouldoutputthefollowingprobabilities:0%forIris-Setosa(0/54),90.7%forIris-Versicolor(49/54),and9.3%forIris-Virginica(5/54).Andofcourseifyouaskittopredicttheclass,itshouldoutputIris-Versicolor(class1)sinceithasthehighestprobability.Let’scheckthis:
>>>tree_clf.predict_proba([[5,1.5]])
array([[0.,0.90740741,0.09259259]])
>>>tree_clf.predict([[5,1.5]])
array([1])
Perfect!Noticethattheestimatedprobabilitieswouldbeidenticalanywhereelseinthebottom-rightrectangleofFigure6-2—forexample,ifthepetalswere6cmlongand1.5cmwide(eventhoughitseemsobviousthatitwouldmostlikelybeanIris-Virginicainthiscase).
TheCARTTrainingAlgorithmScikit-LearnusestheClassificationAndRegressionTree(CART)algorithmtotrainDecisionTrees(alsocalled“growing”trees).Theideaisreallyquitesimple:thealgorithmfirstsplitsthetrainingsetintwosubsetsusingasinglefeaturekandathresholdtk(e.g.,“petallength≤2.45cm”).Howdoesitchoosekandtk?Itsearchesforthepair(k,tk)thatproducesthepurestsubsets(weightedbytheirsize).ThecostfunctionthatthealgorithmtriestominimizeisgivenbyEquation6-2.
Equation6-2.CARTcostfunctionforclassification
Onceithassuccessfullysplitthetrainingsetintwo,itsplitsthesubsetsusingthesamelogic,thenthesub-subsetsandsoon,recursively.Itstopsrecursingonceitreachesthemaximumdepth(definedbythemax_depthhyperparameter),orifitcannotfindasplitthatwillreduceimpurity.Afewotherhyperparameters(describedinamoment)controladditionalstoppingconditions(min_samples_split,min_samples_leaf,min_weight_fraction_leaf,andmax_leaf_nodes).
WARNINGAsyoucansee,theCARTalgorithmisagreedyalgorithm:itgreedilysearchesforanoptimumsplitatthetoplevel,thenrepeatstheprocessateachlevel.Itdoesnotcheckwhetherornotthesplitwillleadtothelowestpossibleimpurityseverallevelsdown.Agreedyalgorithmoftenproducesareasonablygoodsolution,butitisnotguaranteedtobetheoptimalsolution.
Unfortunately,findingtheoptimaltreeisknowntobeanNP-Completeproblem:2itrequiresO(exp(m))time,makingtheproblemintractableevenforfairlysmalltrainingsets.Thisiswhywemustsettlefora“reasonablygood”solution.
ComputationalComplexityMakingpredictionsrequirestraversingtheDecisionTreefromtheroottoaleaf.DecisionTreesaregenerallyapproximatelybalanced,sotraversingtheDecisionTreerequiresgoingthroughroughlyO(log2(m))nodes.3Sinceeachnodeonlyrequirescheckingthevalueofonefeature,theoverallpredictioncomplexityisjustO(log2(m)),independentofthenumberoffeatures.Sopredictionsareveryfast,evenwhendealingwithlargetrainingsets.
However,thetrainingalgorithmcomparesallfeatures(orlessifmax_featuresisset)onallsamplesateachnode.ThisresultsinatrainingcomplexityofO(n×mlog(m)).Forsmalltrainingsets(lessthanafewthousandinstances),Scikit-Learncanspeeduptrainingbypresortingthedata(setpresort=True),butthisslowsdowntrainingconsiderablyforlargertrainingsets.
GiniImpurityorEntropy?Bydefault,theGiniimpuritymeasureisused,butyoucanselecttheentropyimpuritymeasureinsteadbysettingthecriterionhyperparameterto"entropy".Theconceptofentropyoriginatedinthermodynamicsasameasureofmoleculardisorder:entropyapproacheszerowhenmoleculesarestillandwellordered.Itlaterspreadtoawidevarietyofdomains,includingShannon’sinformationtheory,whereitmeasurestheaverageinformationcontentofamessage:4entropyiszerowhenallmessagesareidentical.InMachineLearning,itisfrequentlyusedasanimpuritymeasure:aset’sentropyiszerowhenitcontainsinstancesofonlyoneclass.Equation6-3showsthedefinitionoftheentropyoftheithnode.
Forexample,thedepth-2leftnodeinFigure6-1hasanentropyequalto≈0.31.
Equation6-3.Entropy
SoshouldyouuseGiniimpurityorentropy?Thetruthis,mostofthetimeitdoesnotmakeabigdifference:theyleadtosimilartrees.Giniimpurityisslightlyfastertocompute,soitisagooddefault.However,whentheydiffer,Giniimpuritytendstoisolatethemostfrequentclassinitsownbranchofthetree,whileentropytendstoproduceslightlymorebalancedtrees.5
RegularizationHyperparametersDecisionTreesmakeveryfewassumptionsaboutthetrainingdata(asopposedtolinearmodels,whichobviouslyassumethatthedataislinear,forexample).Ifleftunconstrained,thetreestructurewilladaptitselftothetrainingdata,fittingitveryclosely,andmostlikelyoverfittingit.Suchamodelisoftencalledanonparametricmodel,notbecauseitdoesnothaveanyparameters(itoftenhasalot)butbecausethenumberofparametersisnotdeterminedpriortotraining,sothemodelstructureisfreetostickcloselytothedata.Incontrast,aparametricmodelsuchasalinearmodelhasapredeterminednumberofparameters,soitsdegreeoffreedomislimited,reducingtheriskofoverfitting(butincreasingtheriskofunderfitting).
Toavoidoverfittingthetrainingdata,youneedtorestricttheDecisionTree’sfreedomduringtraining.Asyouknowbynow,thisiscalledregularization.Theregularizationhyperparametersdependonthealgorithmused,butgenerallyyoucanatleastrestrictthemaximumdepthoftheDecisionTree.InScikit-Learn,thisiscontrolledbythemax_depthhyperparameter(thedefaultvalueisNone,whichmeansunlimited).Reducingmax_depthwillregularizethemodelandthusreducetheriskofoverfitting.
TheDecisionTreeClassifierclasshasafewotherparametersthatsimilarlyrestricttheshapeoftheDecisionTree:min_samples_split(theminimumnumberofsamplesanodemusthavebeforeitcanbesplit),min_samples_leaf(theminimumnumberofsamplesaleafnodemusthave),min_weight_fraction_leaf(sameasmin_samples_leafbutexpressedasafractionofthetotalnumberofweightedinstances),max_leaf_nodes(maximumnumberofleafnodes),andmax_features(maximumnumberoffeaturesthatareevaluatedforsplittingateachnode).Increasingmin_*hyperparametersorreducingmax_*hyperparameterswillregularizethemodel.
NOTEOtheralgorithmsworkbyfirsttrainingtheDecisionTreewithoutrestrictions,thenpruning(deleting)unnecessarynodes.Anodewhosechildrenareallleafnodesisconsideredunnecessaryifthepurityimprovementitprovidesisnotstatisticallysignificant.Standardstatisticaltests,suchastheχ2test,areusedtoestimatetheprobabilitythattheimprovementispurelytheresultofchance(whichiscalledthenullhypothesis).Ifthisprobability,calledthep-value,ishigherthanagiventhreshold(typically5%,controlledbyahyperparameter),thenthenodeisconsideredunnecessaryanditschildrenaredeleted.Thepruningcontinuesuntilallunnecessarynodeshavebeenpruned.
Figure6-3showstwoDecisionTreestrainedonthemoonsdataset(introducedinChapter5).Ontheleft,theDecisionTreeistrainedwiththedefaulthyperparameters(i.e.,norestrictions),andontherighttheDecisionTreeistrainedwithmin_samples_leaf=4.Itisquiteobviousthatthemodelontheleftisoverfitting,andthemodelontherightwillprobablygeneralizebetter.
Figure6-3.Regularizationusingmin_samples_leaf
RegressionDecisionTreesarealsocapableofperformingregressiontasks.Let’sbuildaregressiontreeusingScikit-Learn’sDecisionTreeRegressorclass,trainingitonanoisyquadraticdatasetwithmax_depth=2:
fromsklearn.treeimportDecisionTreeRegressor
tree_reg=DecisionTreeRegressor(max_depth=2)
tree_reg.fit(X,y)
TheresultingtreeisrepresentedonFigure6-4.
Figure6-4.ADecisionTreeforregression
Thistreelooksverysimilartotheclassificationtreeyoubuiltearlier.Themaindifferenceisthatinsteadofpredictingaclassineachnode,itpredictsavalue.Forexample,supposeyouwanttomakeapredictionforanewinstancewithx1=0.6.Youtraversethetreestartingattheroot,andyoueventuallyreachtheleafnodethatpredictsvalue=0.1106.Thispredictionissimplytheaveragetargetvalueofthe110traininginstancesassociatedtothisleafnode.ThispredictionresultsinaMeanSquaredError(MSE)equalto0.0151overthese110instances.
Thismodel’spredictionsarerepresentedontheleftofFigure6-5.Ifyousetmax_depth=3,yougetthepredictionsrepresentedontheright.Noticehowthepredictedvalueforeachregionisalwaystheaveragetargetvalueoftheinstancesinthatregion.Thealgorithmsplitseachregioninawaythatmakesmosttraininginstancesascloseaspossibletothatpredictedvalue.
Figure6-5.PredictionsoftwoDecisionTreeregressionmodels
TheCARTalgorithmworksmostlythesamewayasearlier,exceptthatinsteadoftryingtosplitthetrainingsetinawaythatminimizesimpurity,itnowtriestosplitthetrainingsetinawaythatminimizestheMSE.Equation6-4showsthecostfunctionthatthealgorithmtriestominimize.
Equation6-4.CARTcostfunctionforregression
Justlikeforclassificationtasks,DecisionTreesarepronetooverfittingwhendealingwithregressiontasks.Withoutanyregularization(i.e.,usingthedefaulthyperparameters),yougetthepredictionsontheleftofFigure6-6.Itisobviouslyoverfittingthetrainingsetverybadly.Justsettingmin_samples_leaf=10resultsinamuchmorereasonablemodel,representedontherightofFigure6-6.
Figure6-6.RegularizingaDecisionTreeregressor
InstabilityHopefullybynowyouareconvincedthatDecisionTreeshavealotgoingforthem:theyaresimpletounderstandandinterpret,easytouse,versatile,andpowerful.Howevertheydohaveafewlimitations.First,asyoumayhavenoticed,DecisionTreesloveorthogonaldecisionboundaries(allsplitsareperpendiculartoanaxis),whichmakesthemsensitivetotrainingsetrotation.Forexample,Figure6-7showsasimplelinearlyseparabledataset:ontheleft,aDecisionTreecansplititeasily,whileontheright,afterthedatasetisrotatedby45°,thedecisionboundarylooksunnecessarilyconvoluted.AlthoughbothDecisionTreesfitthetrainingsetperfectly,itisverylikelythatthemodelontherightwillnotgeneralizewell.OnewaytolimitthisproblemistousePCA(seeChapter8),whichoftenresultsinabetterorientationofthetrainingdata.
Figure6-7.Sensitivitytotrainingsetrotation
Moregenerally,themainissuewithDecisionTreesisthattheyareverysensitivetosmallvariationsinthetrainingdata.Forexample,ifyoujustremovethewidestIris-Versicolorfromtheiristrainingset(theonewithpetals4.8cmlongand1.8cmwide)andtrainanewDecisionTree,youmaygetthemodelrepresentedinFigure6-8.Asyoucansee,itlooksverydifferentfromthepreviousDecisionTree(Figure6-2).Actually,sincethetrainingalgorithmusedbyScikit-Learnisstochastic6youmaygetverydifferentmodelsevenonthesametrainingdata(unlessyousettherandom_statehyperparameter).
Figure6-8.Sensitivitytotrainingsetdetails
RandomForestscanlimitthisinstabilitybyaveragingpredictionsovermanytrees,aswewillseeinthenextchapter.
Exercises1. WhatistheapproximatedepthofaDecisionTreetrained(withoutrestrictions)onatrainingsetwith
1millioninstances?
2. Isanode’sGiniimpuritygenerallylowerorgreaterthanitsparent’s?Isitgenerallylower/greater,oralwayslower/greater?
3. IfaDecisionTreeisoverfittingthetrainingset,isitagoodideatotrydecreasingmax_depth?
4. IfaDecisionTreeisunderfittingthetrainingset,isitagoodideatotryscalingtheinputfeatures?
5. IfittakesonehourtotrainaDecisionTreeonatrainingsetcontaining1millioninstances,roughlyhowmuchtimewillittaketotrainanotherDecisionTreeonatrainingsetcontaining10millioninstances?
6. Ifyourtrainingsetcontains100,000instances,willsettingpresort=Truespeeduptraining?
7. Trainandfine-tuneaDecisionTreeforthemoonsdataset.a. Generateamoonsdatasetusingmake_moons(n_samples=10000,noise=0.4).
b. Splititintoatrainingsetandatestsetusingtrain_test_split().
c. Usegridsearchwithcross-validation(withthehelpoftheGridSearchCVclass)tofindgoodhyperparametervaluesforaDecisionTreeClassifier.Hint:tryvariousvaluesformax_leaf_nodes.
d. Trainitonthefulltrainingsetusingthesehyperparameters,andmeasureyourmodel’sperformanceonthetestset.Youshouldgetroughly85%to87%accuracy.
8. Growaforest.a. Continuingthepreviousexercise,generate1,000subsetsofthetrainingset,eachcontaining100
instancesselectedrandomly.Hint:youcanuseScikit-Learn’sShuffleSplitclassforthis.
b. TrainoneDecisionTreeoneachsubset,usingthebesthyperparametervaluesfoundabove.Evaluatethese1,000DecisionTreesonthetestset.Sincetheyweretrainedonsmallersets,theseDecisionTreeswilllikelyperformworsethanthefirstDecisionTree,achievingonlyabout80%accuracy.
c. Nowcomesthemagic.Foreachtestsetinstance,generatethepredictionsofthe1,000DecisionTrees,andkeeponlythemostfrequentprediction(youcanuseSciPy’smode()functionforthis).Thisgivesyoumajority-votepredictionsoverthetestset.
d. Evaluatethesepredictionsonthetestset:youshouldobtainaslightlyhigheraccuracythanyourfirstmodel(about0.5to1.5%higher).Congratulations,youhavetrainedaRandomForestclassifier!
SolutionstotheseexercisesareavailableinAppendixA.
Graphvizisanopensourcegraphvisualizationsoftwarepackage,availableathttp://www.graphviz.org/.
Pisthesetofproblemsthatcanbesolvedinpolynomialtime.NPisthesetofproblemswhosesolutionscanbeverifiedinpolynomialtime.AnNP-HardproblemisaproblemtowhichanyNPproblemcanbereducedinpolynomialtime.AnNP-CompleteproblemisbothNPandNP-Hard.AmajoropenmathematicalquestioniswhetherornotP=NP.IfP≠NP(whichseemslikely),thennopolynomialalgorithmwilleverbefoundforanyNP-Completeproblem(exceptperhapsonaquantumcomputer).
log2isthebinarylogarithm.Itisequaltolog2(m)=log(m)/log(2).
Areductionofentropyisoftencalledaninformationgain.
SeeSebastianRaschka’sinterestinganalysisformoredetails.
Itrandomlyselectsthesetoffeaturestoevaluateateachnode.
1
2
3
4
5
6
Chapter7.EnsembleLearningandRandomForests
Supposeyouaskacomplexquestiontothousandsofrandompeople,thenaggregatetheiranswers.Inmanycasesyouwillfindthatthisaggregatedanswerisbetterthananexpert’sanswer.Thisiscalledthewisdomofthecrowd.Similarly,ifyouaggregatethepredictionsofagroupofpredictors(suchasclassifiersorregressors),youwilloftengetbetterpredictionsthanwiththebestindividualpredictor.Agroupofpredictorsiscalledanensemble;thus,thistechniqueiscalledEnsembleLearning,andanEnsembleLearningalgorithmiscalledanEnsemblemethod.
Forexample,youcantrainagroupofDecisionTreeclassifiers,eachonadifferentrandomsubsetofthetrainingset.Tomakepredictions,youjustobtainthepredictionsofallindividualtrees,thenpredicttheclassthatgetsthemostvotes(seethelastexerciseinChapter6).SuchanensembleofDecisionTreesiscalledaRandomForest,anddespiteitssimplicity,thisisoneofthemostpowerfulMachineLearningalgorithmsavailabletoday.
Moreover,aswediscussedinChapter2,youwilloftenuseEnsemblemethodsneartheendofaproject,onceyouhavealreadybuiltafewgoodpredictors,tocombinethemintoanevenbetterpredictor.Infact,thewinningsolutionsinMachineLearningcompetitionsofteninvolveseveralEnsemblemethods(mostfamouslyintheNetflixPrizecompetition).
InthischapterwewilldiscussthemostpopularEnsemblemethods,includingbagging,boosting,stacking,andafewothers.WewillalsoexploreRandomForests.
VotingClassifiersSupposeyouhavetrainedafewclassifiers,eachoneachievingabout80%accuracy.YoumayhaveaLogisticRegressionclassifier,anSVMclassifier,aRandomForestclassifier,aK-NearestNeighborsclassifier,andperhapsafewmore(seeFigure7-1).
Figure7-1.Trainingdiverseclassifiers
Averysimplewaytocreateanevenbetterclassifieristoaggregatethepredictionsofeachclassifierandpredicttheclassthatgetsthemostvotes.Thismajority-voteclassifieriscalledahardvotingclassifier(seeFigure7-2).
Figure7-2.Hardvotingclassifierpredictions
Somewhatsurprisingly,thisvotingclassifieroftenachievesahigheraccuracythanthebestclassifierintheensemble.Infact,evenifeachclassifierisaweaklearner(meaningitdoesonlyslightlybetterthanrandomguessing),theensemblecanstillbeastronglearner(achievinghighaccuracy),providedthereareasufficientnumberofweaklearnersandtheyaresufficientlydiverse.
Howisthispossible?Thefollowinganalogycanhelpshedsomelightonthismystery.Supposeyouhave
aslightlybiasedcointhathasa51%chanceofcomingupheads,and49%chanceofcominguptails.Ifyoutossit1,000times,youwillgenerallygetmoreorless510headsand490tails,andhenceamajorityofheads.Ifyoudothemath,youwillfindthattheprobabilityofobtainingamajorityofheadsafter1,000tossesiscloseto75%.Themoreyoutossthecoin,thehighertheprobability(e.g.,with10,000tosses,theprobabilityclimbsover97%).Thisisduetothelawoflargenumbers:asyoukeeptossingthecoin,theratioofheadsgetscloserandclosertotheprobabilityofheads(51%).Figure7-3shows10seriesofbiasedcointosses.Youcanseethatasthenumberoftossesincreases,theratioofheadsapproaches51%.Eventuallyall10seriesendupsocloseto51%thattheyareconsistentlyabove50%.
Figure7-3.Thelawoflargenumbers
Similarly,supposeyoubuildanensemblecontaining1,000classifiersthatareindividuallycorrectonly51%ofthetime(barelybetterthanrandomguessing).Ifyoupredictthemajorityvotedclass,youcanhopeforupto75%accuracy!However,thisisonlytrueifallclassifiersareperfectlyindependent,makinguncorrelatederrors,whichisclearlynotthecasesincetheyaretrainedonthesamedata.Theyarelikelytomakethesametypesoferrors,sotherewillbemanymajorityvotesforthewrongclass,reducingtheensemble’saccuracy.
TIPEnsemblemethodsworkbestwhenthepredictorsareasindependentfromoneanotheraspossible.Onewaytogetdiverseclassifiersistotrainthemusingverydifferentalgorithms.Thisincreasesthechancethattheywillmakeverydifferenttypesoferrors,improvingtheensemble’saccuracy.
ThefollowingcodecreatesandtrainsavotingclassifierinScikit-Learn,composedofthreediverseclassifiers(thetrainingsetisthemoonsdataset,introducedinChapter5):
fromsklearn.ensembleimportRandomForestClassifier
fromsklearn.ensembleimportVotingClassifier
fromsklearn.linear_modelimportLogisticRegression
fromsklearn.svmimportSVC
log_clf=LogisticRegression()
rnd_clf=RandomForestClassifier()
svm_clf=SVC()
voting_clf=VotingClassifier(
estimators=[('lr',log_clf),('rf',rnd_clf),('svc',svm_clf)],
voting='hard')
voting_clf.fit(X_train,y_train)
Let’slookateachclassifier’saccuracyonthetestset:
>>>fromsklearn.metricsimportaccuracy_score
>>>forclfin(log_clf,rnd_clf,svm_clf,voting_clf):
...clf.fit(X_train,y_train)
...y_pred=clf.predict(X_test)
...print(clf.__class__.__name__,accuracy_score(y_test,y_pred))
...
LogisticRegression0.864
RandomForestClassifier0.872
SVC0.888
VotingClassifier0.896
Thereyouhaveit!Thevotingclassifierslightlyoutperformsalltheindividualclassifiers.
Ifallclassifiersareabletoestimateclassprobabilities(i.e.,theyhaveapredict_proba()method),thenyoucantellScikit-Learntopredicttheclasswiththehighestclassprobability,averagedoveralltheindividualclassifiers.Thisiscalledsoftvoting.Itoftenachieveshigherperformancethanhardvotingbecauseitgivesmoreweighttohighlyconfidentvotes.Allyouneedtodoisreplacevoting="hard"withvoting="soft"andensurethatallclassifierscanestimateclassprobabilities.ThisisnotthecaseoftheSVCclassbydefault,soyouneedtosetitsprobabilityhyperparametertoTrue(thiswillmaketheSVCclassusecross-validationtoestimateclassprobabilities,slowingdowntraining,anditwilladdapredict_proba()method).Ifyoumodifytheprecedingcodetousesoftvoting,youwillfindthatthevotingclassifierachievesover91%accuracy!
BaggingandPastingOnewaytogetadiversesetofclassifiersistouseverydifferenttrainingalgorithms,asjustdiscussed.Anotherapproachistousethesametrainingalgorithmforeverypredictor,buttotrainthemondifferentrandomsubsetsofthetrainingset.Whensamplingisperformedwithreplacement,thismethodiscalledbagging1(shortforbootstrapaggregating2).Whensamplingisperformedwithoutreplacement,itiscalledpasting.3
Inotherwords,bothbaggingandpastingallowtraininginstancestobesampledseveraltimesacrossmultiplepredictors,butonlybaggingallowstraininginstancestobesampledseveraltimesforthesamepredictor.ThissamplingandtrainingprocessisrepresentedinFigure7-4.
Figure7-4.Pasting/baggingtrainingsetsamplingandtraining
Onceallpredictorsaretrained,theensemblecanmakeapredictionforanewinstancebysimplyaggregatingthepredictionsofallpredictors.Theaggregationfunctionistypicallythestatisticalmode(i.e.,themostfrequentprediction,justlikeahardvotingclassifier)forclassification,ortheaverageforregression.Eachindividualpredictorhasahigherbiasthanifitweretrainedontheoriginaltrainingset,butaggregationreducesbothbiasandvariance.4Generally,thenetresultisthattheensemblehasasimilarbiasbutalowervariancethanasinglepredictortrainedontheoriginaltrainingset.
AsyoucanseeinFigure7-4,predictorscanallbetrainedinparallel,viadifferentCPUcoresorevendifferentservers.Similarly,predictionscanbemadeinparallel.Thisisoneofthereasonswhybaggingandpastingaresuchpopularmethods:theyscaleverywell.
BaggingandPastinginScikit-LearnScikit-LearnoffersasimpleAPIforbothbaggingandpastingwiththeBaggingClassifierclass(orBaggingRegressorforregression).Thefollowingcodetrainsanensembleof500DecisionTreeclassifiers,5eachtrainedon100traininginstancesrandomlysampledfromthetrainingsetwithreplacement(thisisanexampleofbagging,butifyouwanttousepastinginstead,justsetbootstrap=False).Then_jobsparametertellsScikit-LearnthenumberofCPUcorestousefortrainingandpredictions(–1tellsScikit-Learntouseallavailablecores):
fromsklearn.ensembleimportBaggingClassifier
fromsklearn.treeimportDecisionTreeClassifier
bag_clf=BaggingClassifier(
DecisionTreeClassifier(),n_estimators=500,
max_samples=100,bootstrap=True,n_jobs=-1)
bag_clf.fit(X_train,y_train)
y_pred=bag_clf.predict(X_test)
NOTETheBaggingClassifierautomaticallyperformssoftvotinginsteadofhardvotingifthebaseclassifiercanestimateclassprobabilities(i.e.,ifithasapredict_proba()method),whichisthecasewithDecisionTreesclassifiers.
Figure7-5comparesthedecisionboundaryofasingleDecisionTreewiththedecisionboundaryofabaggingensembleof500trees(fromtheprecedingcode),bothtrainedonthemoonsdataset.Asyoucansee,theensemble’spredictionswilllikelygeneralizemuchbetterthanthesingleDecisionTree’spredictions:theensemblehasacomparablebiasbutasmallervariance(itmakesroughlythesamenumberoferrorsonthetrainingset,butthedecisionboundaryislessirregular).
Figure7-5.AsingleDecisionTreeversusabaggingensembleof500trees
Bootstrappingintroducesabitmorediversityinthesubsetsthateachpredictoristrainedon,sobaggingendsupwithaslightlyhigherbiasthanpasting,butthisalsomeansthatpredictorsendupbeinglesscorrelatedsotheensemble’svarianceisreduced.Overall,baggingoftenresultsinbettermodels,whichexplainswhyitisgenerallypreferred.However,ifyouhavesparetimeandCPUpoweryoucanusecross-validationtoevaluatebothbaggingandpastingandselecttheonethatworksbest.
Out-of-BagEvaluationWithbagging,someinstancesmaybesampledseveraltimesforanygivenpredictor,whileothersmaynotbesampledatall.BydefaultaBaggingClassifiersamplesmtraininginstanceswithreplacement(bootstrap=True),wheremisthesizeofthetrainingset.Thismeansthatonlyabout63%ofthetraininginstancesaresampledonaverageforeachpredictor.6Theremaining37%ofthetraininginstancesthatarenotsampledarecalledout-of-bag(oob)instances.Notethattheyarenotthesame37%forallpredictors.
Sinceapredictorneverseestheoobinstancesduringtraining,itcanbeevaluatedontheseinstances,withouttheneedforaseparatevalidationsetorcross-validation.Youcanevaluatetheensembleitselfbyaveragingouttheoobevaluationsofeachpredictor.
InScikit-Learn,youcansetoob_score=TruewhencreatingaBaggingClassifiertorequestanautomaticoobevaluationaftertraining.Thefollowingcodedemonstratesthis.Theresultingevaluationscoreisavailablethroughtheoob_score_variable:
>>>bag_clf=BaggingClassifier(
...DecisionTreeClassifier(),n_estimators=500,
...bootstrap=True,n_jobs=-1,oob_score=True)
...
>>>bag_clf.fit(X_train,y_train)
>>>bag_clf.oob_score_
0.90133333333333332
Accordingtothisoobevaluation,thisBaggingClassifierislikelytoachieveabout90.1%accuracyonthetestset.Let’sverifythis:
>>>fromsklearn.metricsimportaccuracy_score
>>>y_pred=bag_clf.predict(X_test)
>>>accuracy_score(y_test,y_pred)
0.91200000000000003
Weget91.2%accuracyonthetestset—closeenough!
Theoobdecisionfunctionforeachtraininginstanceisalsoavailablethroughtheoob_decision_function_variable.Inthiscase(sincethebaseestimatorhasapredict_proba()method)thedecisionfunctionreturnstheclassprobabilitiesforeachtraininginstance.Forexample,theoobevaluationestimatesthatthesecondtraininginstancehasa60.6%probabilityofbelongingtothepositiveclass(and39.4%ofbelongingtothepositiveclass):
>>>bag_clf.oob_decision_function_
array([[0.31746032,0.68253968],
[0.34117647,0.65882353],
[1.,0.],
...
[1.,0.],
[0.03108808,0.96891192],
[0.57291667,0.42708333]])
RandomPatchesandRandomSubspacesTheBaggingClassifierclasssupportssamplingthefeaturesaswell.Thisiscontrolledbytwohyperparameters:max_featuresandbootstrap_features.Theyworkthesamewayasmax_samplesandbootstrap,butforfeaturesamplinginsteadofinstancesampling.Thus,eachpredictorwillbetrainedonarandomsubsetoftheinputfeatures.
Thisisparticularlyusefulwhenyouaredealingwithhigh-dimensionalinputs(suchasimages).SamplingbothtraininginstancesandfeaturesiscalledtheRandomPatchesmethod.7Keepingalltraininginstances(i.e.,bootstrap=Falseandmax_samples=1.0)butsamplingfeatures(i.e.,bootstrap_features=Trueand/ormax_featuressmallerthan1.0)iscalledtheRandomSubspacesmethod.8
Samplingfeaturesresultsinevenmorepredictordiversity,tradingabitmorebiasforalowervariance.
RandomForestsAswehavediscussed,aRandomForest9isanensembleofDecisionTrees,generallytrainedviathebaggingmethod(orsometimespasting),typicallywithmax_samplessettothesizeofthetrainingset.InsteadofbuildingaBaggingClassifierandpassingitaDecisionTreeClassifier,youcaninsteadusetheRandomForestClassifierclass,whichismoreconvenientandoptimizedforDecisionTrees10
(similarly,thereisaRandomForestRegressorclassforregressiontasks).ThefollowingcodetrainsaRandomForestclassifierwith500trees(eachlimitedtomaximum16nodes),usingallavailableCPUcores:
fromsklearn.ensembleimportRandomForestClassifier
rnd_clf=RandomForestClassifier(n_estimators=500,max_leaf_nodes=16,n_jobs=-1)
rnd_clf.fit(X_train,y_train)
y_pred_rf=rnd_clf.predict(X_test)
Withafewexceptions,aRandomForestClassifierhasallthehyperparametersofaDecisionTreeClassifier(tocontrolhowtreesaregrown),plusallthehyperparametersofaBaggingClassifiertocontroltheensembleitself.11
TheRandomForestalgorithmintroducesextrarandomnesswhengrowingtrees;insteadofsearchingfortheverybestfeaturewhensplittinganode(seeChapter6),itsearchesforthebestfeatureamongarandomsubsetoffeatures.Thisresultsinagreatertreediversity,which(onceagain)tradesahigherbiasforalowervariance,generallyyieldinganoverallbettermodel.ThefollowingBaggingClassifierisroughlyequivalenttothepreviousRandomForestClassifier:
bag_clf=BaggingClassifier(
DecisionTreeClassifier(splitter="random",max_leaf_nodes=16),
n_estimators=500,max_samples=1.0,bootstrap=True,n_jobs=-1)
Extra-TreesWhenyouaregrowingatreeinaRandomForest,ateachnodeonlyarandomsubsetofthefeaturesisconsideredforsplitting(asdiscussedearlier).Itispossibletomaketreesevenmorerandombyalsousingrandomthresholdsforeachfeatureratherthansearchingforthebestpossiblethresholds(likeregularDecisionTreesdo).
AforestofsuchextremelyrandomtreesissimplycalledanExtremelyRandomizedTreesensemble12(orExtra-Treesforshort).Onceagain,thistradesmorebiasforalowervariance.ItalsomakesExtra-TreesmuchfastertotrainthanregularRandomForestssincefindingthebestpossiblethresholdforeachfeatureateverynodeisoneofthemosttime-consumingtasksofgrowingatree.
YoucancreateanExtra-TreesclassifierusingScikit-Learn’sExtraTreesClassifierclass.ItsAPIisidenticaltotheRandomForestClassifierclass.Similarly,theExtraTreesRegressorclasshasthesameAPIastheRandomForestRegressorclass.
TIPItishardtotellinadvancewhetheraRandomForestClassifierwillperformbetterorworsethananExtraTreesClassifier.Generally,theonlywaytoknowistotrybothandcomparethemusingcross-validation(andtuningthehyperparametersusinggridsearch).
FeatureImportanceYetanothergreatqualityofRandomForestsisthattheymakeiteasytomeasuretherelativeimportanceofeachfeature.Scikit-Learnmeasuresafeature’simportancebylookingathowmuchthetreenodesthatusethatfeaturereduceimpurityonaverage(acrossalltreesintheforest).Moreprecisely,itisaweightedaverage,whereeachnode’sweightisequaltothenumberoftrainingsamplesthatareassociatedwithit(seeChapter6).
Scikit-Learncomputesthisscoreautomaticallyforeachfeatureaftertraining,thenitscalestheresultssothatthesumofallimportancesisequalto1.Youcanaccesstheresultusingthefeature_importances_variable.Forexample,thefollowingcodetrainsaRandomForestClassifierontheirisdataset(introducedinChapter4)andoutputseachfeature’simportance.Itseemsthatthemostimportantfeaturesarethepetallength(44%)andwidth(42%),whilesepallengthandwidthareratherunimportantincomparison(11%and2%,respectively).
>>>fromsklearn.datasetsimportload_iris
>>>iris=load_iris()
>>>rnd_clf=RandomForestClassifier(n_estimators=500,n_jobs=-1)
>>>rnd_clf.fit(iris["data"],iris["target"])
>>>forname,scoreinzip(iris["feature_names"],rnd_clf.feature_importances_):
...print(name,score)
...
sepallength(cm)0.112492250999
sepalwidth(cm)0.0231192882825
petallength(cm)0.441030464364
petalwidth(cm)0.423357996355
Similarly,ifyoutrainaRandomForestclassifierontheMNISTdataset(introducedinChapter3)andploteachpixel’simportance,yougettheimagerepresentedinFigure7-6.
Figure7-6.MNISTpixelimportance(accordingtoaRandomForestclassifier)
RandomForestsareveryhandytogetaquickunderstandingofwhatfeaturesactuallymatter,inparticularifyouneedtoperformfeatureselection.
BoostingBoosting(originallycalledhypothesisboosting)referstoanyEnsemblemethodthatcancombineseveralweaklearnersintoastronglearner.Thegeneralideaofmostboostingmethodsistotrainpredictorssequentially,eachtryingtocorrectitspredecessor.Therearemanyboostingmethodsavailable,butbyfarthemostpopularareAdaBoost13(shortforAdaptiveBoosting)andGradientBoosting.Let’sstartwithAdaBoost.
AdaBoostOnewayforanewpredictortocorrectitspredecessoristopayabitmoreattentiontothetraininginstancesthatthepredecessorunderfitted.Thisresultsinnewpredictorsfocusingmoreandmoreonthehardcases.ThisisthetechniqueusedbyAdaBoost.
Forexample,tobuildanAdaBoostclassifier,afirstbaseclassifier(suchasaDecisionTree)istrainedandusedtomakepredictionsonthetrainingset.Therelativeweightofmisclassifiedtraininginstancesisthenincreased.Asecondclassifieristrainedusingtheupdatedweightsandagainitmakespredictionsonthetrainingset,weightsareupdated,andsoon(seeFigure7-7).
Figure7-7.AdaBoostsequentialtrainingwithinstanceweightupdates
Figure7-8showsthedecisionboundariesoffiveconsecutivepredictorsonthemoonsdataset(inthisexample,eachpredictorisahighlyregularizedSVMclassifierwithanRBFkernel14).Thefirstclassifiergetsmanyinstanceswrong,sotheirweightsgetboosted.Thesecondclassifierthereforedoesabetterjobontheseinstances,andsoon.Theplotontherightrepresentsthesamesequenceofpredictorsexceptthatthelearningrateishalved(i.e.,themisclassifiedinstanceweightsareboostedhalfasmuchateveryiteration).Asyoucansee,thissequentiallearningtechniquehassomesimilaritieswithGradientDescent,exceptthatinsteadoftweakingasinglepredictor’sparameterstominimizeacostfunction,AdaBoostaddspredictorstotheensemble,graduallymakingitbetter.
Figure7-8.Decisionboundariesofconsecutivepredictors
Onceallpredictorsaretrained,theensemblemakespredictionsverymuchlikebaggingorpasting,exceptthatpredictorshavedifferentweightsdependingontheiroverallaccuracyontheweightedtrainingset.
WARNINGThereisoneimportantdrawbacktothissequentiallearningtechnique:itcannotbeparallelized(oronlypartially),sinceeachpredictorcanonlybetrainedafterthepreviouspredictorhasbeentrainedandevaluated.Asaresult,itdoesnotscaleaswellasbaggingorpasting.
Let’stakeacloserlookattheAdaBoostalgorithm.Eachinstanceweightw(i)isinitiallysetto .Afirstpredictoristrainedanditsweightederrorrater1iscomputedonthetrainingset;seeEquation7-1.
Equation7-1.Weightederrorrateofthejthpredictor
Thepredictor’sweightαjisthencomputedusingEquation7-2,whereηisthelearningratehyperparameter(defaultsto1).15Themoreaccuratethepredictoris,thehigheritsweightwillbe.Ifitisjustguessingrandomly,thenitsweightwillbeclosetozero.However,ifitismostoftenwrong(i.e.,lessaccuratethanrandomguessing),thenitsweightwillbenegative.
Equation7-2.Predictorweight
NexttheinstanceweightsareupdatedusingEquation7-3:themisclassifiedinstancesareboosted.
Equation7-3.Weightupdaterule
Thenalltheinstanceweightsarenormalized(i.e.,dividedby ).
Finally,anewpredictoristrainedusingtheupdatedweights,andthewholeprocessisrepeated(thenewpredictor’sweightiscomputed,theinstanceweightsareupdated,thenanotherpredictoristrained,andsoon).Thealgorithmstopswhenthedesirednumberofpredictorsisreached,orwhenaperfectpredictorisfound.
Tomakepredictions,AdaBoostsimplycomputesthepredictionsofallthepredictorsandweighsthemusingthepredictorweightsαj.Thepredictedclassistheonethatreceivesthemajorityofweightedvotes(seeEquation7-4).
Equation7-4.AdaBoostpredictions
Scikit-LearnactuallyusesamulticlassversionofAdaBoostcalledSAMME16(whichstandsforStagewiseAdditiveModelingusingaMulticlassExponentiallossfunction).Whentherearejusttwoclasses,SAMMEisequivalenttoAdaBoost.Moreover,ifthepredictorscanestimateclassprobabilities(i.e.,iftheyhaveapredict_proba()method),Scikit-LearncanuseavariantofSAMMEcalledSAMME.R(theRstandsfor“Real”),whichreliesonclassprobabilitiesratherthanpredictionsandgenerallyperformsbetter.
ThefollowingcodetrainsanAdaBoostclassifierbasedon200DecisionStumpsusingScikit-Learn’sAdaBoostClassifierclass(asyoumightexpect,thereisalsoanAdaBoostRegressorclass).ADecisionStumpisaDecisionTreewithmax_depth=1—inotherwords,atreecomposedofasingledecisionnodeplustwoleafnodes.ThisisthedefaultbaseestimatorfortheAdaBoostClassifierclass:
fromsklearn.ensembleimportAdaBoostClassifier
ada_clf=AdaBoostClassifier(
DecisionTreeClassifier(max_depth=1),n_estimators=200,
algorithm="SAMME.R",learning_rate=0.5)
ada_clf.fit(X_train,y_train)
TIPIfyourAdaBoostensembleisoverfittingthetrainingset,youcantryreducingthenumberofestimatorsormorestronglyregularizingthebaseestimator.
GradientBoostingAnotherverypopularBoostingalgorithmisGradientBoosting.17JustlikeAdaBoost,GradientBoostingworksbysequentiallyaddingpredictorstoanensemble,eachonecorrectingitspredecessor.However,insteadoftweakingtheinstanceweightsateveryiterationlikeAdaBoostdoes,thismethodtriestofitthenewpredictortotheresidualerrorsmadebythepreviouspredictor.
Let’sgothroughasimpleregressionexampleusingDecisionTreesasthebasepredictors(ofcourseGradientBoostingalsoworksgreatwithregressiontasks).ThisiscalledGradientTreeBoosting,orGradientBoostedRegressionTrees(GBRT).First,let’sfitaDecisionTreeRegressortothetrainingset(forexample,anoisyquadratictrainingset):
fromsklearn.treeimportDecisionTreeRegressor
tree_reg1=DecisionTreeRegressor(max_depth=2)
tree_reg1.fit(X,y)
NowtrainasecondDecisionTreeRegressorontheresidualerrorsmadebythefirstpredictor:
y2=y-tree_reg1.predict(X)
tree_reg2=DecisionTreeRegressor(max_depth=2)
tree_reg2.fit(X,y2)
Thenwetrainathirdregressorontheresidualerrorsmadebythesecondpredictor:
y3=y2-tree_reg2.predict(X)
tree_reg3=DecisionTreeRegressor(max_depth=2)
tree_reg3.fit(X,y3)
Nowwehaveanensemblecontainingthreetrees.Itcanmakepredictionsonanewinstancesimplybyaddingupthepredictionsofallthetrees:
y_pred=sum(tree.predict(X_new)fortreein(tree_reg1,tree_reg2,tree_reg3))
Figure7-9representsthepredictionsofthesethreetreesintheleftcolumn,andtheensemble’spredictionsintherightcolumn.Inthefirstrow,theensemblehasjustonetree,soitspredictionsareexactlythesameasthefirsttree’spredictions.Inthesecondrow,anewtreeistrainedontheresidualerrorsofthefirsttree.Ontherightyoucanseethattheensemble’spredictionsareequaltothesumofthepredictionsofthefirsttwotrees.Similarly,inthethirdrowanothertreeistrainedontheresidualerrorsofthesecondtree.Youcanseethattheensemble’spredictionsgraduallygetbetterastreesareaddedtotheensemble.
AsimplerwaytotrainGBRTensemblesistouseScikit-Learn’sGradientBoostingRegressorclass.MuchliketheRandomForestRegressorclass,ithashyperparameterstocontrolthegrowthofDecisionTrees(e.g.,max_depth,min_samples_leaf,andsoon),aswellashyperparameterstocontroltheensembletraining,suchasthenumberoftrees(n_estimators).Thefollowingcodecreatesthesameensembleasthepreviousone:
fromsklearn.ensembleimportGradientBoostingRegressor
gbrt=GradientBoostingRegressor(max_depth=2,n_estimators=3,learning_rate=1.0)
gbrt.fit(X,y)
Figure7-9.GradientBoosting
Thelearning_ratehyperparameterscalesthecontributionofeachtree.Ifyousetittoalowvalue,suchas0.1,youwillneedmoretreesintheensembletofitthetrainingset,butthepredictionswillusuallygeneralizebetter.Thisisaregularizationtechniquecalledshrinkage.Figure7-10showstwoGBRTensemblestrainedwithalowlearningrate:theoneontheleftdoesnothaveenoughtreestofitthetrainingset,whiletheoneontherighthastoomanytreesandoverfitsthetrainingset.
Figure7-10.GBRTensembleswithnotenoughpredictors(left)andtoomany(right)
Inordertofindtheoptimalnumberoftrees,youcanuseearlystopping(seeChapter4).Asimplewaytoimplementthisistousethestaged_predict()method:itreturnsaniteratoroverthepredictionsmadebytheensembleateachstageoftraining(withonetree,twotrees,etc.).ThefollowingcodetrainsaGBRTensemblewith120trees,thenmeasuresthevalidationerrorateachstageoftrainingtofindtheoptimalnumberoftrees,andfinallytrainsanotherGBRTensembleusingtheoptimalnumberoftrees:
importnumpyasnp
fromsklearn.model_selectionimporttrain_test_split
fromsklearn.metricsimportmean_squared_error
X_train,X_val,y_train,y_val=train_test_split(X,y)
gbrt=GradientBoostingRegressor(max_depth=2,n_estimators=120)
gbrt.fit(X_train,y_train)
errors=[mean_squared_error(y_val,y_pred)
fory_predingbrt.staged_predict(X_val)]
bst_n_estimators=np.argmin(errors)
gbrt_best=GradientBoostingRegressor(max_depth=2,n_estimators=bst_n_estimators)
gbrt_best.fit(X_train,y_train)
ThevalidationerrorsarerepresentedontheleftofFigure7-11,andthebestmodel’spredictionsarerepresentedontheright.
Figure7-11.Tuningthenumberoftreesusingearlystopping
Itisalsopossibletoimplementearlystoppingbyactuallystoppingtrainingearly(insteadoftrainingalargenumberoftreesfirstandthenlookingbacktofindtheoptimalnumber).Youcandosobysettingwarm_start=True,whichmakesScikit-Learnkeepexistingtreeswhenthefit()methodiscalled,allowingincrementaltraining.Thefollowingcodestopstrainingwhenthevalidationerrordoesnot
improveforfiveiterationsinarow:
gbrt=GradientBoostingRegressor(max_depth=2,warm_start=True)
min_val_error=float("inf")
error_going_up=0
forn_estimatorsinrange(1,120):
gbrt.n_estimators=n_estimators
gbrt.fit(X_train,y_train)
y_pred=gbrt.predict(X_val)
val_error=mean_squared_error(y_val,y_pred)
ifval_error<min_val_error:
min_val_error=val_error
error_going_up=0
else:
error_going_up+=1
iferror_going_up==5:
break#earlystopping
TheGradientBoostingRegressorclassalsosupportsasubsamplehyperparameter,whichspecifiesthefractionoftraininginstancestobeusedfortrainingeachtree.Forexample,ifsubsample=0.25,theneachtreeistrainedon25%ofthetraininginstances,selectedrandomly.Asyoucanprobablyguessbynow,thistradesahigherbiasforalowervariance.Italsospeedsuptrainingconsiderably.ThistechniqueiscalledStochasticGradientBoosting.
NOTEItispossibletouseGradientBoostingwithothercostfunctions.Thisiscontrolledbythelosshyperparameter(seeScikit-Learn’sdocumentationformoredetails).
StackingThelastEnsemblemethodwewilldiscussinthischapteriscalledstacking(shortforstackedgeneralization).18Itisbasedonasimpleidea:insteadofusingtrivialfunctions(suchashardvoting)toaggregatethepredictionsofallpredictorsinanensemble,whydon’twetrainamodeltoperformthisaggregation?Figure7-12showssuchanensembleperformingaregressiontaskonanewinstance.Eachofthebottomthreepredictorspredictsadifferentvalue(3.1,2.7,and2.9),andthenthefinalpredictor(calledablender,orametalearner)takesthesepredictionsasinputsandmakesthefinalprediction(3.0).
Figure7-12.Aggregatingpredictionsusingablendingpredictor
Totraintheblender,acommonapproachistouseahold-outset.19Let’sseehowitworks.First,thetrainingsetissplitintwosubsets.Thefirstsubsetisusedtotrainthepredictorsinthefirstlayer(seeFigure7-13).
Figure7-13.Trainingthefirstlayer
Next,thefirstlayerpredictorsareusedtomakepredictionsonthesecond(held-out)set(seeFigure7-14).Thisensuresthatthepredictionsare“clean,”sincethepredictorsneversawtheseinstancesduringtraining.Nowforeachinstanceinthehold-outsettherearethreepredictedvalues.Wecancreateanewtrainingsetusingthesepredictedvaluesasinputfeatures(whichmakesthisnewtrainingsetthree-dimensional),andkeepingthetargetvalues.Theblenderistrainedonthisnewtrainingset,soitlearnstopredictthetargetvaluegiventhefirstlayer’spredictions.
Figure7-14.Trainingtheblender
Itisactuallypossibletotrainseveraldifferentblendersthisway(e.g.,oneusingLinearRegression,anotherusingRandomForestRegression,andsoon):wegetawholelayerofblenders.Thetrickistosplitthetrainingsetintothreesubsets:thefirstoneisusedtotrainthefirstlayer,thesecondoneisusedtocreatethetrainingsetusedtotrainthesecondlayer(usingpredictionsmadebythepredictorsofthefirstlayer),andthethirdoneisusedtocreatethetrainingsettotrainthethirdlayer(usingpredictionsmadebythepredictorsofthesecondlayer).Oncethisisdone,wecanmakeapredictionforanewinstancebygoingthrougheachlayersequentially,asshowninFigure7-15.
Figure7-15.Predictionsinamultilayerstackingensemble
Unfortunately,Scikit-Learndoesnotsupportstackingdirectly,butitisnottoohardtorolloutyourownimplementation(seethefollowingexercises).Alternatively,youcanuseanopensourceimplementationsuchasbrew(availableathttps://github.com/viisar/brew).
Exercises1. Ifyouhavetrainedfivedifferentmodelsontheexactsametrainingdata,andtheyallachieve95%
precision,isthereanychancethatyoucancombinethesemodelstogetbetterresults?Ifso,how?Ifnot,why?
2. Whatisthedifferencebetweenhardandsoftvotingclassifiers?
3. Isitpossibletospeeduptrainingofabaggingensemblebydistributingitacrossmultipleservers?Whataboutpastingensembles,boostingensembles,randomforests,orstackingensembles?
4. Whatisthebenefitofout-of-bagevaluation?
5. WhatmakesExtra-TreesmorerandomthanregularRandomForests?Howcanthisextrarandomnesshelp?AreExtra-TreesslowerorfasterthanregularRandomForests?
6. IfyourAdaBoostensembleunderfitsthetrainingdata,whathyperparametersshouldyoutweakandhow?
7. IfyourGradientBoostingensembleoverfitsthetrainingset,shouldyouincreaseordecreasethelearningrate?
8. LoadtheMNISTdata(introducedinChapter3),andsplititintoatrainingset,avalidationset,andatestset(e.g.,use40,000instancesfortraining,10,000forvalidation,and10,000fortesting).Thentrainvariousclassifiers,suchasaRandomForestclassifier,anExtra-Treesclassifier,andanSVM.Next,trytocombinethemintoanensemblethatoutperformsthemallonthevalidationset,usingasoftorhardvotingclassifier.Onceyouhavefoundone,tryitonthetestset.Howmuchbetterdoesitperformcomparedtotheindividualclassifiers?
9. Runtheindividualclassifiersfromthepreviousexercisetomakepredictionsonthevalidationset,andcreateanewtrainingsetwiththeresultingpredictions:eachtraininginstanceisavectorcontainingthesetofpredictionsfromallyourclassifiersforanimage,andthetargetistheimage’sclass.Congratulations,youhavejusttrainedablender,andtogetherwiththeclassifierstheyformastackingensemble!Nowlet’sevaluatetheensembleonthetestset.Foreachimageinthetestset,makepredictionswithallyourclassifiers,thenfeedthepredictionstotheblendertogettheensemble’spredictions.Howdoesitcomparetothevotingclassifieryoutrainedearlier?
SolutionstotheseexercisesareavailableinAppendixA.
“BaggingPredictors,”L.Breiman(1996).
Instatistics,resamplingwithreplacementiscalledbootstrapping.
“Pastingsmallvotesforclassificationinlargedatabasesandon-line,”L.Breiman(1999).
BiasandvariancewereintroducedinChapter4.
max_samplescanalternativelybesettoafloatbetween0.0and1.0,inwhichcasethemaxnumberofinstancestosampleisequaltothesizeofthetrainingsettimesmax_samples.
1
2
3
4
5
6
Asmgrows,thisratioapproaches1–exp(–1)≈63.212%.
“EnsemblesonRandomPatches,”G.LouppeandP.Geurts(2012).
“Therandomsubspacemethodforconstructingdecisionforests,”TinKamHo(1998).
“RandomDecisionForests,”T.Ho(1995).
TheBaggingClassifierclassremainsusefulifyouwantabagofsomethingotherthanDecisionTrees.
Thereareafewnotableexceptions:splitterisabsent(forcedto"random"),presortisabsent(forcedtoFalse),max_samplesisabsent(forcedto1.0),andbase_estimatorisabsent(forcedtoDecisionTreeClassifierwiththeprovidedhyperparameters).
“Extremelyrandomizedtrees,”P.Geurts,D.Ernst,L.Wehenkel(2005).
“ADecision-TheoreticGeneralizationofOn-LineLearningandanApplicationtoBoosting,”YoavFreund,RobertE.Schapire(1997).
Thisisjustforillustrativepurposes.SVMsaregenerallynotgoodbasepredictorsforAdaBoost,becausetheyareslowandtendtobeunstablewithAdaBoost.
TheoriginalAdaBoostalgorithmdoesnotusealearningratehyperparameter.
Formoredetails,see“Multi-ClassAdaBoost,”J.Zhuetal.(2006).
Firstintroducedin“ArcingtheEdge,”L.Breiman(1997).
“StackedGeneralization,”D.Wolpert(1992).
Alternatively,itispossibletouseout-of-foldpredictions.Insomecontextsthisiscalledstacking,whileusingahold-outsetiscalledblending.However,formanypeoplethesetermsaresynonymous.
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Chapter8.DimensionalityReduction
ManyMachineLearningproblemsinvolvethousandsorevenmillionsoffeaturesforeachtraininginstance.Notonlydoesthismaketrainingextremelyslow,itcanalsomakeitmuchhardertofindagoodsolution,aswewillsee.Thisproblemisoftenreferredtoasthecurseofdimensionality.
Fortunately,inreal-worldproblems,itisoftenpossibletoreducethenumberoffeaturesconsiderably,turninganintractableproblemintoatractableone.Forexample,considertheMNISTimages(introducedinChapter3):thepixelsontheimagebordersarealmostalwayswhite,soyoucouldcompletelydropthesepixelsfromthetrainingsetwithoutlosingmuchinformation.Figure7-6confirmsthatthesepixelsareutterlyunimportantfortheclassificationtask.Moreover,twoneighboringpixelsareoftenhighlycorrelated:ifyoumergethemintoasinglepixel(e.g.,bytakingthemeanofthetwopixelintensities),youwillnotlosemuchinformation.
WARNINGReducingdimensionalitydoeslosesomeinformation(justlikecompressinganimagetoJPEGcandegradeitsquality),soeventhoughitwillspeeduptraining,itmayalsomakeyoursystemperformslightlyworse.Italsomakesyourpipelinesabitmorecomplexandthushardertomaintain.Soyoushouldfirsttrytotrainyoursystemwiththeoriginaldatabeforeconsideringusingdimensionalityreductioniftrainingistooslow.Insomecases,however,reducingthedimensionalityofthetrainingdatamayfilteroutsomenoiseandunnecessarydetailsandthusresultinhigherperformance(butingeneralitwon’t;itwilljustspeeduptraining).
Apartfromspeedinguptraining,dimensionalityreductionisalsoextremelyusefulfordatavisualization(orDataViz).Reducingthenumberofdimensionsdowntotwo(orthree)makesitpossibletoplotahigh-dimensionaltrainingsetonagraphandoftengainsomeimportantinsightsbyvisuallydetectingpatterns,suchasclusters.
Inthischapterwewilldiscussthecurseofdimensionalityandgetasenseofwhatgoesoninhigh-dimensionalspace.Then,wewillpresentthetwomainapproachestodimensionalityreduction(projectionandManifoldLearning),andwewillgothroughthreeofthemostpopulardimensionalityreductiontechniques:PCA,KernelPCA,andLLE.
TheCurseofDimensionalityWearesousedtolivinginthreedimensions1thatourintuitionfailsuswhenwetrytoimagineahigh-dimensionalspace.Evenabasic4Dhypercubeisincrediblyhardtopictureinourmind(seeFigure8-1),letalonea200-dimensionalellipsoidbentina1,000-dimensionalspace.
Figure8-1.Point,segment,square,cube,andtesseract(0Dto4Dhypercubes)2
Itturnsoutthatmanythingsbehaveverydifferentlyinhigh-dimensionalspace.Forexample,ifyoupickarandompointinaunitsquare(a1×1square),itwillhaveonlyabouta0.4%chanceofbeinglocatedlessthan0.001fromaborder(inotherwords,itisveryunlikelythatarandompointwillbe“extreme”alonganydimension).Butina10,000-dimensionalunithypercube(a1×1××1cube,withtenthousand1s),thisprobabilityisgreaterthan99.999999%.Mostpointsinahigh-dimensionalhypercubeareveryclosetotheborder.3
Hereisamoretroublesomedifference:ifyoupicktwopointsrandomlyinaunitsquare,thedistancebetweenthesetwopointswillbe,onaverage,roughly0.52.Ifyoupicktworandompointsinaunit3Dcube,theaveragedistancewillberoughly0.66.Butwhatabouttwopointspickedrandomlyina1,000,000-dimensionalhypercube?Well,theaveragedistance,believeitornot,willbeabout408.25
(roughly )!Thisisquitecounterintuitive:howcantwopointsbesofarapartwhentheybothliewithinthesameunithypercube?Thisfactimpliesthathigh-dimensionaldatasetsareatriskofbeingverysparse:mosttraininginstancesarelikelytobefarawayfromeachother.Ofcourse,thisalsomeansthatanewinstancewilllikelybefarawayfromanytraininginstance,makingpredictionsmuchlessreliablethaninlowerdimensions,sincetheywillbebasedonmuchlargerextrapolations.Inshort,themoredimensionsthetrainingsethas,thegreatertheriskofoverfittingit.
Intheory,onesolutiontothecurseofdimensionalitycouldbetoincreasethesizeofthetrainingsettoreachasufficientdensityoftraininginstances.Unfortunately,inpractice,thenumberoftraininginstancesrequiredtoreachagivendensitygrowsexponentiallywiththenumberofdimensions.Withjust100features(muchlessthanintheMNISTproblem),youwouldneedmoretraininginstancesthanatomsintheobservableuniverseinorderfortraininginstancestobewithin0.1ofeachotheronaverage,assumingtheywerespreadoutuniformlyacrossalldimensions.
MainApproachesforDimensionalityReductionBeforewediveintospecificdimensionalityreductionalgorithms,let’stakealookatthetwomainapproachestoreducingdimensionality:projectionandManifoldLearning.
ProjectionInmostreal-worldproblems,traininginstancesarenotspreadoutuniformlyacrossalldimensions.Manyfeaturesarealmostconstant,whileothersarehighlycorrelated(asdiscussedearlierforMNIST).Asaresult,alltraininginstancesactuallyliewithin(orcloseto)amuchlower-dimensionalsubspaceofthehigh-dimensionalspace.Thissoundsveryabstract,solet’slookatanexample.InFigure8-2youcanseea3Ddatasetrepresentedbythecircles.
Figure8-2.A3Ddatasetlyingclosetoa2Dsubspace
Noticethatalltraininginstanceslieclosetoaplane:thisisalower-dimensional(2D)subspaceofthehigh-dimensional(3D)space.Nowifweprojecteverytraininginstanceperpendicularlyontothissubspace(asrepresentedbytheshortlinesconnectingtheinstancestotheplane),wegetthenew2DdatasetshowninFigure8-3.Ta-da!Wehavejustreducedthedataset’sdimensionalityfrom3Dto2D.Notethattheaxescorrespondtonewfeaturesz1andz2(thecoordinatesoftheprojectionsontheplane).
Figure8-3.Thenew2Ddatasetafterprojection
However,projectionisnotalwaysthebestapproachtodimensionalityreduction.Inmanycasesthesubspacemaytwistandturn,suchasinthefamousSwissrolltoydatasetrepresentedinFigure8-4.
Figure8-4.Swissrolldataset
Simplyprojectingontoaplane(e.g.,bydroppingx3)wouldsquashdifferentlayersoftheSwissrolltogether,asshownontheleftofFigure8-5.However,whatyoureallywantistounrolltheSwissrolltoobtainthe2DdatasetontherightofFigure8-5.
Figure8-5.Squashingbyprojectingontoaplane(left)versusunrollingtheSwissroll(right)
ManifoldLearningTheSwissrollisanexampleofa2Dmanifold.Putsimply,a2Dmanifoldisa2Dshapethatcanbebentandtwistedinahigher-dimensionalspace.Moregenerally,ad-dimensionalmanifoldisapartofann-dimensionalspace(whered<n)thatlocallyresemblesad-dimensionalhyperplane.InthecaseoftheSwissroll,d=2andn=3:itlocallyresemblesa2Dplane,butitisrolledinthethirddimension.
Manydimensionalityreductionalgorithmsworkbymodelingthemanifoldonwhichthetraininginstanceslie;thisiscalledManifoldLearning.Itreliesonthemanifoldassumption,alsocalledthemanifoldhypothesis,whichholdsthatmostreal-worldhigh-dimensionaldatasetslieclosetoamuchlower-dimensionalmanifold.Thisassumptionisveryoftenempiricallyobserved.
Onceagain,thinkabouttheMNISTdataset:allhandwrittendigitimageshavesomesimilarities.Theyaremadeofconnectedlines,thebordersarewhite,theyaremoreorlesscentered,andsoon.Ifyourandomlygeneratedimages,onlyaridiculouslytinyfractionofthemwouldlooklikehandwrittendigits.Inotherwords,thedegreesoffreedomavailabletoyouifyoutrytocreateadigitimagearedramaticallylowerthanthedegreesoffreedomyouwouldhaveifyouwereallowedtogenerateanyimageyouwanted.Theseconstraintstendtosqueezethedatasetintoalower-dimensionalmanifold.
Themanifoldassumptionisoftenaccompaniedbyanotherimplicitassumption:thatthetaskathand(e.g.,classificationorregression)willbesimplerifexpressedinthelower-dimensionalspaceofthemanifold.Forexample,inthetoprowofFigure8-6theSwissrollissplitintotwoclasses:inthe3Dspace(ontheleft),thedecisionboundarywouldbefairlycomplex,butinthe2Dunrolledmanifoldspace(ontheright),thedecisionboundaryisasimplestraightline.
However,thisassumptiondoesnotalwayshold.Forexample,inthebottomrowofFigure8-6,thedecisionboundaryislocatedatx1=5.Thisdecisionboundarylooksverysimpleintheoriginal3Dspace(averticalplane),butitlooksmorecomplexintheunrolledmanifold(acollectionoffourindependentlinesegments).
Inshort,ifyoureducethedimensionalityofyourtrainingsetbeforetrainingamodel,itwilldefinitelyspeeduptraining,butitmaynotalwaysleadtoabetterorsimplersolution;italldependsonthedataset.
Hopefullyyounowhaveagoodsenseofwhatthecurseofdimensionalityisandhowdimensionalityreductionalgorithmscanfightit,especiallywhenthemanifoldassumptionholds.Therestofthischapterwillgothroughsomeofthemostpopularalgorithms.
Figure8-6.Thedecisionboundarymaynotalwaysbesimplerwithlowerdimensions
PCAPrincipalComponentAnalysis(PCA)isbyfarthemostpopulardimensionalityreductionalgorithm.Firstitidentifiesthehyperplanethatliesclosesttothedata,andthenitprojectsthedataontoit.
PreservingtheVarianceBeforeyoucanprojectthetrainingsetontoalower-dimensionalhyperplane,youfirstneedtochoosetherighthyperplane.Forexample,asimple2DdatasetisrepresentedontheleftofFigure8-7,alongwiththreedifferentaxes(i.e.,one-dimensionalhyperplanes).Ontherightistheresultoftheprojectionofthedatasetontoeachoftheseaxes.Asyoucansee,theprojectionontothesolidlinepreservesthemaximumvariance,whiletheprojectionontothedottedlinepreservesverylittlevariance,andtheprojectionontothedashedlinepreservesanintermediateamountofvariance.
Figure8-7.Selectingthesubspaceontowhichtoproject
Itseemsreasonabletoselecttheaxisthatpreservesthemaximumamountofvariance,asitwillmostlikelyloselessinformationthantheotherprojections.Anotherwaytojustifythischoiceisthatitistheaxisthatminimizesthemeansquareddistancebetweentheoriginaldatasetanditsprojectionontothataxis.ThisistherathersimpleideabehindPCA.4
PrincipalComponentsPCAidentifiestheaxisthataccountsforthelargestamountofvarianceinthetrainingset.InFigure8-7,itisthesolidline.Italsofindsasecondaxis,orthogonaltothefirstone,thataccountsforthelargestamountofremainingvariance.Inthis2Dexamplethereisnochoice:itisthedottedline.Ifitwereahigher-dimensionaldataset,PCAwouldalsofindathirdaxis,orthogonaltobothpreviousaxes,andafourth,afifth,andsoon—asmanyaxesasthenumberofdimensionsinthedataset.
Theunitvectorthatdefinestheithaxisiscalledtheithprincipalcomponent(PC).InFigure8-7,the1st
PCisc1andthe2ndPCisc2.InFigure8-2thefirsttwoPCsarerepresentedbytheorthogonalarrowsintheplane,andthethirdPCwouldbeorthogonaltotheplane(pointingupordown).
NOTEThedirectionoftheprincipalcomponentsisnotstable:ifyouperturbthetrainingsetslightlyandrunPCAagain,someofthenewPCsmaypointintheoppositedirectionoftheoriginalPCs.However,theywillgenerallystilllieonthesameaxes.Insomecases,apairofPCsmayevenrotateorswap,buttheplanetheydefinewillgenerallyremainthesame.
Sohowcanyoufindtheprincipalcomponentsofatrainingset?Luckily,thereisastandardmatrixfactorizationtechniquecalledSingularValueDecomposition(SVD)thatcandecomposethetrainingsetmatrixXintothedotproductofthreematricesU·Σ·VT,whereVTcontainsalltheprincipalcomponentsthatwearelookingfor,asshowninEquation8-1.
Equation8-1.Principalcomponentsmatrix
ThefollowingPythoncodeusesNumPy’ssvd()functiontoobtainalltheprincipalcomponentsofthetrainingset,thenextractsthefirsttwoPCs:
X_centered=X-X.mean(axis=0)
U,s,V=np.linalg.svd(X_centered)
c1=V.T[:,0]
c2=V.T[:,1]
WARNINGPCAassumesthatthedatasetiscenteredaroundtheorigin.Aswewillsee,Scikit-Learn’sPCAclassestakecareofcenteringthedataforyou.However,ifyouimplementPCAyourself(asintheprecedingexample),orifyouuseotherlibraries,don’tforgettocenterthedatafirst.
ProjectingDowntodDimensionsOnceyouhaveidentifiedalltheprincipalcomponents,youcanreducethedimensionalityofthedatasetdowntoddimensionsbyprojectingitontothehyperplanedefinedbythefirstdprincipalcomponents.Selectingthishyperplaneensuresthattheprojectionwillpreserveasmuchvarianceaspossible.Forexample,inFigure8-2the3Ddatasetisprojecteddowntothe2Dplanedefinedbythefirsttwoprincipalcomponents,preservingalargepartofthedataset’svariance.Asaresult,the2Dprojectionlooksverymuchliketheoriginal3Ddataset.
Toprojectthetrainingsetontothehyperplane,youcansimplycomputethedotproductofthetrainingsetmatrixXbythematrixWd,definedasthematrixcontainingthefirstdprincipalcomponents(i.e.,thematrixcomposedofthefirstdcolumnsofVT),asshowninEquation8-2.
Equation8-2.Projectingthetrainingsetdowntoddimensions
ThefollowingPythoncodeprojectsthetrainingsetontotheplanedefinedbythefirsttwoprincipalcomponents:
W2=V.T[:,:2]
X2D=X_centered.dot(W2)
Thereyouhaveit!Younowknowhowtoreducethedimensionalityofanydatasetdowntoanynumberofdimensions,whilepreservingasmuchvarianceaspossible.
UsingScikit-LearnScikit-Learn’sPCAclassimplementsPCAusingSVDdecompositionjustlikewedidbefore.ThefollowingcodeappliesPCAtoreducethedimensionalityofthedatasetdowntotwodimensions(notethatitautomaticallytakescareofcenteringthedata):
fromsklearn.decompositionimportPCA
pca=PCA(n_components=2)
X2D=pca.fit_transform(X)
AfterfittingthePCAtransformertothedataset,youcanaccesstheprincipalcomponentsusingthecomponents_variable(notethatitcontainsthePCsashorizontalvectors,so,forexample,thefirstprincipalcomponentisequaltopca.components_.T[:,0]).
ExplainedVarianceRatioAnotherveryusefulpieceofinformationistheexplainedvarianceratioofeachprincipalcomponent,availableviatheexplained_variance_ratio_variable.Itindicatestheproportionofthedataset’svariancethatliesalongtheaxisofeachprincipalcomponent.Forexample,let’slookattheexplainedvarianceratiosofthefirsttwocomponentsofthe3DdatasetrepresentedinFigure8-2:
>>>pca.explained_variance_ratio_
array([0.84248607,0.14631839])
Thistellsyouthat84.2%ofthedataset’svarianceliesalongthefirstaxis,and14.6%liesalongthesecondaxis.Thisleaveslessthan1.2%forthethirdaxis,soitisreasonabletoassumethatitprobablycarrieslittleinformation.
ChoosingtheRightNumberofDimensionsInsteadofarbitrarilychoosingthenumberofdimensionstoreducedownto,itisgenerallypreferabletochoosethenumberofdimensionsthatadduptoasufficientlylargeportionofthevariance(e.g.,95%).Unless,ofcourse,youarereducingdimensionalityfordatavisualization—inthatcaseyouwillgenerallywanttoreducethedimensionalitydownto2or3.
ThefollowingcodecomputesPCAwithoutreducingdimensionality,thencomputestheminimumnumberofdimensionsrequiredtopreserve95%ofthetrainingset’svariance:
pca=PCA()
pca.fit(X_train)
cumsum=np.cumsum(pca.explained_variance_ratio_)
d=np.argmax(cumsum>=0.95)+1
Youcouldthensetn_components=dandrunPCAagain.However,thereisamuchbetteroption:insteadofspecifyingthenumberofprincipalcomponentsyouwanttopreserve,youcansetn_componentstobeafloatbetween0.0and1.0,indicatingtheratioofvarianceyouwishtopreserve:
pca=PCA(n_components=0.95)
X_reduced=pca.fit_transform(X_train)
Yetanotheroptionistoplottheexplainedvarianceasafunctionofthenumberofdimensions(simplyplotcumsum;seeFigure8-8).Therewillusuallybeanelbowinthecurve,wheretheexplainedvariancestopsgrowingfast.Youcanthinkofthisastheintrinsicdimensionalityofthedataset.Inthiscase,youcanseethatreducingthedimensionalitydowntoabout100dimensionswouldn’tlosetoomuchexplainedvariance.
Figure8-8.Explainedvarianceasafunctionofthenumberofdimensions
PCAforCompressionObviouslyafterdimensionalityreduction,thetrainingsettakesupmuchlessspace.Forexample,tryapplyingPCAtotheMNISTdatasetwhilepreserving95%ofitsvariance.Youshouldfindthateachinstancewillhavejustover150features,insteadoftheoriginal784features.Sowhilemostofthevarianceispreserved,thedatasetisnowlessthan20%ofitsoriginalsize!Thisisareasonablecompressionratio,andyoucanseehowthiscanspeedupaclassificationalgorithm(suchasanSVMclassifier)tremendously.
Itisalsopossibletodecompressthereduceddatasetbackto784dimensionsbyapplyingtheinversetransformationofthePCAprojection.Ofcoursethiswon’tgiveyoubacktheoriginaldata,sincetheprojectionlostabitofinformation(withinthe5%variancethatwasdropped),butitwilllikelybequiteclosetotheoriginaldata.Themeansquareddistancebetweentheoriginaldataandthereconstructeddata(compressedandthendecompressed)iscalledthereconstructionerror.Forexample,thefollowingcodecompressestheMNISTdatasetdownto154dimensions,thenusestheinverse_transform()methodtodecompressitbackto784dimensions.Figure8-9showsafewdigitsfromtheoriginaltrainingset(ontheleft),andthecorrespondingdigitsaftercompressionanddecompression.Youcanseethatthereisaslightimagequalityloss,butthedigitsarestillmostlyintact.
pca=PCA(n_components=154)
X_reduced=pca.fit_transform(X_train)
X_recovered=pca.inverse_transform(X_reduced)
Figure8-9.MNISTcompressionpreserving95%ofthevariance
TheequationoftheinversetransformationisshowninEquation8-3.
Equation8-3.PCAinversetransformation,backtotheoriginalnumberofdimensions
IncrementalPCAOneproblemwiththeprecedingimplementationofPCAisthatitrequiresthewholetrainingsettofitinmemoryinorderfortheSVDalgorithmtorun.Fortunately,IncrementalPCA(IPCA)algorithmshavebeendeveloped:youcansplitthetrainingsetintomini-batchesandfeedanIPCAalgorithmonemini-batchatatime.Thisisusefulforlargetrainingsets,andalsotoapplyPCAonline(i.e.,onthefly,asnewinstancesarrive).
ThefollowingcodesplitstheMNISTdatasetinto100mini-batches(usingNumPy’sarray_split()function)andfeedsthemtoScikit-Learn’sIncrementalPCAclass5toreducethedimensionalityoftheMNISTdatasetdownto154dimensions(justlikebefore).Notethatyoumustcallthepartial_fit()methodwitheachmini-batchratherthanthefit()methodwiththewholetrainingset:
fromsklearn.decompositionimportIncrementalPCA
n_batches=100
inc_pca=IncrementalPCA(n_components=154)
forX_batchinnp.array_split(X_train,n_batches):
inc_pca.partial_fit(X_batch)
X_reduced=inc_pca.transform(X_train)
Alternatively,youcanuseNumPy’smemmapclass,whichallowsyoutomanipulatealargearraystoredinabinaryfileondiskasifitwereentirelyinmemory;theclassloadsonlythedataitneedsinmemory,whenitneedsit.SincetheIncrementalPCAclassusesonlyasmallpartofthearrayatanygiventime,thememoryusageremainsundercontrol.Thismakesitpossibletocalltheusualfit()method,asyoucanseeinthefollowingcode:
X_mm=np.memmap(filename,dtype="float32",mode="readonly",shape=(m,n))
batch_size=m//n_batches
inc_pca=IncrementalPCA(n_components=154,batch_size=batch_size)
inc_pca.fit(X_mm)
RandomizedPCAScikit-LearnoffersyetanotheroptiontoperformPCA,calledRandomizedPCA.Thisisastochasticalgorithmthatquicklyfindsanapproximationofthefirstdprincipalcomponents.ItscomputationalcomplexityisO(m×d2)+O(d3),insteadofO(m×n2)+O(n3),soitisdramaticallyfasterthanthepreviousalgorithmswhendismuchsmallerthann.
rnd_pca=PCA(n_components=154,svd_solver="randomized")
X_reduced=rnd_pca.fit_transform(X_train)
KernelPCAInChapter5wediscussedthekerneltrick,amathematicaltechniquethatimplicitlymapsinstancesintoaveryhigh-dimensionalspace(calledthefeaturespace),enablingnonlinearclassificationandregressionwithSupportVectorMachines.Recallthatalineardecisionboundaryinthehigh-dimensionalfeaturespacecorrespondstoacomplexnonlineardecisionboundaryintheoriginalspace.
ItturnsoutthatthesametrickcanbeappliedtoPCA,makingitpossibletoperformcomplexnonlinearprojectionsfordimensionalityreduction.ThisiscalledKernelPCA(kPCA).6Itisoftengoodatpreservingclustersofinstancesafterprojection,orsometimesevenunrollingdatasetsthatlieclosetoatwistedmanifold.
Forexample,thefollowingcodeusesScikit-Learn’sKernelPCAclasstoperformkPCAwithanRBFkernel(seeChapter5formoredetailsabouttheRBFkernelandtheotherkernels):
fromsklearn.decompositionimportKernelPCA
rbf_pca=KernelPCA(n_components=2,kernel="rbf",gamma=0.04)
X_reduced=rbf_pca.fit_transform(X)
Figure8-10showstheSwissroll,reducedtotwodimensionsusingalinearkernel(equivalenttosimplyusingthePCAclass),anRBFkernel,andasigmoidkernel(Logistic).
Figure8-10.Swissrollreducedto2DusingkPCAwithvariouskernels
SelectingaKernelandTuningHyperparametersAskPCAisanunsupervisedlearningalgorithm,thereisnoobviousperformancemeasuretohelpyouselectthebestkernelandhyperparametervalues.However,dimensionalityreductionisoftenapreparationstepforasupervisedlearningtask(e.g.,classification),soyoucansimplyusegridsearchtoselectthekernelandhyperparametersthatleadtothebestperformanceonthattask.Forexample,thefollowingcodecreatesatwo-steppipeline,firstreducingdimensionalitytotwodimensionsusingkPCA,thenapplyingLogisticRegressionforclassification.ThenitusesGridSearchCVtofindthebestkernelandgammavalueforkPCAinordertogetthebestclassificationaccuracyattheendofthepipeline:
fromsklearn.model_selectionimportGridSearchCV
fromsklearn.linear_modelimportLogisticRegression
fromsklearn.pipelineimportPipeline
clf=Pipeline([
("kpca",KernelPCA(n_components=2)),
("log_reg",LogisticRegression())
])
param_grid=[{
"kpca__gamma":np.linspace(0.03,0.05,10),
"kpca__kernel":["rbf","sigmoid"]
}]
grid_search=GridSearchCV(clf,param_grid,cv=3)
grid_search.fit(X,y)
Thebestkernelandhyperparametersarethenavailablethroughthebest_params_variable:
>>>print(grid_search.best_params_)
{'kpca__gamma':0.043333333333333335,'kpca__kernel':'rbf'}
Anotherapproach,thistimeentirelyunsupervised,istoselectthekernelandhyperparametersthatyieldthelowestreconstructionerror.However,reconstructionisnotaseasyaswithlinearPCA.Here’swhy.Figure8-11showstheoriginalSwissroll3Ddataset(topleft),andtheresulting2DdatasetafterkPCAisappliedusinganRBFkernel(topright).Thankstothekerneltrick,thisismathematicallyequivalenttomappingthetrainingsettoaninfinite-dimensionalfeaturespace(bottomright)usingthefeaturemapφ,thenprojectingthetransformedtrainingsetdownto2DusinglinearPCA.NoticethatifwecouldinvertthelinearPCAstepforagiveninstanceinthereducedspace,thereconstructedpointwouldlieinfeaturespace,notintheoriginalspace(e.g.,liketheonerepresentedbyanxinthediagram).Sincethefeaturespaceisinfinite-dimensional,wecannotcomputethereconstructedpoint,andthereforewecannotcomputethetruereconstructionerror.Fortunately,itispossibletofindapointintheoriginalspacethatwouldmapclosetothereconstructedpoint.Thisiscalledthereconstructionpre-image.Onceyouhavethispre-image,youcanmeasureitssquareddistancetotheoriginalinstance.Youcanthenselectthekernelandhyperparametersthatminimizethisreconstructionpre-imageerror.
Figure8-11.KernelPCAandthereconstructionpre-imageerror
Youmaybewonderinghowtoperformthisreconstruction.Onesolutionistotrainasupervisedregressionmodel,withtheprojectedinstancesasthetrainingsetandtheoriginalinstancesasthetargets.Scikit-Learnwilldothisautomaticallyifyousetfit_inverse_transform=True,asshowninthefollowingcode:7
rbf_pca=KernelPCA(n_components=2,kernel="rbf",gamma=0.0433,
fit_inverse_transform=True)
X_reduced=rbf_pca.fit_transform(X)
X_preimage=rbf_pca.inverse_transform(X_reduced)
NOTEBydefault,fit_inverse_transform=FalseandKernelPCAhasnoinverse_transform()method.Thismethodonlygetscreatedwhenyousetfit_inverse_transform=True.
Youcanthencomputethereconstructionpre-imageerror:
>>>fromsklearn.metricsimportmean_squared_error
>>>mean_squared_error(X,X_preimage)
32.786308795766132
Nowyoucanusegridsearchwithcross-validationtofindthekernelandhyperparametersthatminimizethispre-imagereconstructionerror.
LLELocallyLinearEmbedding(LLE)8isanotherverypowerfulnonlineardimensionalityreduction(NLDR)technique.ItisaManifoldLearningtechniquethatdoesnotrelyonprojectionslikethepreviousalgorithms.Inanutshell,LLEworksbyfirstmeasuringhoweachtraininginstancelinearlyrelatestoitsclosestneighbors(c.n.),andthenlookingforalow-dimensionalrepresentationofthetrainingsetwheretheselocalrelationshipsarebestpreserved(moredetailsshortly).Thismakesitparticularlygoodatunrollingtwistedmanifolds,especiallywhenthereisnottoomuchnoise.
Forexample,thefollowingcodeusesScikit-Learn’sLocallyLinearEmbeddingclasstounrolltheSwissroll.Theresulting2DdatasetisshowninFigure8-12.Asyoucansee,theSwissrolliscompletelyunrolledandthedistancesbetweeninstancesarelocallywellpreserved.However,distancesarenotpreservedonalargerscale:theleftpartoftheunrolledSwissrollissqueezed,whiletherightpartisstretched.Nevertheless,LLEdidaprettygoodjobatmodelingthemanifold.
fromsklearn.manifoldimportLocallyLinearEmbedding
lle=LocallyLinearEmbedding(n_components=2,n_neighbors=10)
X_reduced=lle.fit_transform(X)
Figure8-12.UnrolledSwissrollusingLLE
Here’showLLEworks:first,foreachtraininginstancex(i),thealgorithmidentifiesitskclosestneighbors(intheprecedingcodek=10),thentriestoreconstructx(i)asalinearfunctionoftheseneighbors.Morespecifically,itfindstheweightswi,jsuchthatthesquareddistancebetweenx(i)and
isassmallaspossible,assumingwi,j=0ifx(j)isnotoneofthekclosestneighborsofx(i).ThusthefirststepofLLEistheconstrainedoptimizationproblemdescribedinEquation8-4,whereWis
theweightmatrixcontainingalltheweightswi,j.Thesecondconstraintsimplynormalizestheweightsforeachtraininginstancex(i).
Equation8-4.LLEstep1:linearlymodelinglocalrelationships
Afterthisstep,theweightmatrix (containingtheweights )encodesthelocallinearrelationshipsbetweenthetraininginstances.Nowthesecondstepistomapthetraininginstancesintoad-dimensionalspace(whered<n)whilepreservingtheselocalrelationshipsasmuchaspossible.Ifz(i)istheimageof
x(i)inthisd-dimensionalspace,thenwewantthesquareddistancebetweenz(i)and tobeassmallaspossible.ThisidealeadstotheunconstrainedoptimizationproblemdescribedinEquation8-5.Itlooksverysimilartothefirststep,butinsteadofkeepingtheinstancesfixedandfindingtheoptimalweights,wearedoingthereverse:keepingtheweightsfixedandfindingtheoptimalpositionoftheinstances’imagesinthelow-dimensionalspace.NotethatZisthematrixcontainingallz(i).
Equation8-5.LLEstep2:reducingdimensionalitywhilepreservingrelationships
Scikit-Learn’sLLEimplementationhasthefollowingcomputationalcomplexity:O(mlog(m)nlog(k))forfindingtheknearestneighbors,O(mnk3)foroptimizingtheweights,andO(dm2)forconstructingthelow-dimensionalrepresentations.Unfortunately,them2inthelasttermmakesthisalgorithmscalepoorlytoverylargedatasets.
OtherDimensionalityReductionTechniquesTherearemanyotherdimensionalityreductiontechniques,severalofwhichareavailableinScikit-Learn.Herearesomeofthemostpopular:
MultidimensionalScaling(MDS)reducesdimensionalitywhiletryingtopreservethedistancesbetweentheinstances(seeFigure8-13).
Isomapcreatesagraphbyconnectingeachinstancetoitsnearestneighbors,thenreducesdimensionalitywhiletryingtopreservethegeodesicdistances9betweentheinstances.
t-DistributedStochasticNeighborEmbedding(t-SNE)reducesdimensionalitywhiletryingtokeepsimilarinstancescloseanddissimilarinstancesapart.Itismostlyusedforvisualization,inparticulartovisualizeclustersofinstancesinhigh-dimensionalspace(e.g.,tovisualizetheMNISTimagesin2D).
LinearDiscriminantAnalysis(LDA)isactuallyaclassificationalgorithm,butduringtrainingitlearnsthemostdiscriminativeaxesbetweentheclasses,andtheseaxescanthenbeusedtodefineahyperplaneontowhichtoprojectthedata.Thebenefitisthattheprojectionwillkeepclassesasfarapartaspossible,soLDAisagoodtechniquetoreducedimensionalitybeforerunninganotherclassificationalgorithmsuchasanSVMclassifier.
Figure8-13.ReducingtheSwissrollto2Dusingvarioustechniques
Exercises1. Whatarethemainmotivationsforreducingadataset’sdimensionality?Whatarethemain
drawbacks?
2. Whatisthecurseofdimensionality?
3. Onceadataset’sdimensionalityhasbeenreduced,isitpossibletoreversetheoperation?Ifso,how?Ifnot,why?
4. CanPCAbeusedtoreducethedimensionalityofahighlynonlineardataset?
5. SupposeyouperformPCAona1,000-dimensionaldataset,settingtheexplainedvarianceratioto95%.Howmanydimensionswilltheresultingdatasethave?
6. InwhatcaseswouldyouusevanillaPCA,IncrementalPCA,RandomizedPCA,orKernelPCA?
7. Howcanyouevaluatetheperformanceofadimensionalityreductionalgorithmonyourdataset?
8. Doesitmakeanysensetochaintwodifferentdimensionalityreductionalgorithms?
9. LoadtheMNISTdataset(introducedinChapter3)andsplititintoatrainingsetandatestset(takethefirst60,000instancesfortraining,andtheremaining10,000fortesting).TrainaRandomForestclassifieronthedatasetandtimehowlongittakes,thenevaluatetheresultingmodelonthetestset.Next,usePCAtoreducethedataset’sdimensionality,withanexplainedvarianceratioof95%.TrainanewRandomForestclassifieronthereduceddatasetandseehowlongittakes.Wastrainingmuchfaster?Nextevaluatetheclassifieronthetestset:howdoesitcomparetothepreviousclassifier?
10. Uset-SNEtoreducetheMNISTdatasetdowntotwodimensionsandplottheresultusingMatplotlib.Youcanuseascatterplotusing10differentcolorstorepresenteachimage’stargetclass.Alternatively,youcanwritecoloreddigitsatthelocationofeachinstance,orevenplotscaled-downversionsofthedigitimagesthemselves(ifyouplotalldigits,thevisualizationwillbetoocluttered,soyoushouldeitherdrawarandomsampleorplotaninstanceonlyifnootherinstancehasalreadybeenplottedataclosedistance).Youshouldgetanicevisualizationwithwell-separatedclustersofdigits.TryusingotherdimensionalityreductionalgorithmssuchasPCA,LLE,orMDSandcomparetheresultingvisualizations.
SolutionstotheseexercisesareavailableinAppendixA.
Well,fourdimensionsifyoucounttime,andafewmoreifyouareastringtheorist.
Watcharotatingtesseractprojectedinto3Dspaceathttp://goo.gl/OM7ktJ.ImagebyWikipediauserNerdBoy1392(CreativeCommonsBY-SA3.0).Reproducedfromhttps://en.wikipedia.org/wiki/Tesseract.
Funfact:anyoneyouknowisprobablyanextremistinatleastonedimension(e.g.,howmuchsugartheyputintheircoffee),ifyouconsiderenoughdimensions.
“OnLinesandPlanesofClosestFittoSystemsofPointsinSpace,”K.Pearson(1901).
Scikit-Learnusesthealgorithmdescribedin“IncrementalLearningforRobustVisualTracking,”D.Rossetal.(2007).
1
2
3
4
5
“KernelPrincipalComponentAnalysis,”B.Schölkopf,A.Smola,K.Müller(1999).
Scikit-LearnusesthealgorithmbasedonKernelRidgeRegressiondescribedinGokhanH.Bakır,JasonWeston,andBernhardScholkopf,“LearningtoFindPre-images”(Tubingen,Germany:MaxPlanckInstituteforBiologicalCybernetics,2004).
“NonlinearDimensionalityReductionbyLocallyLinearEmbedding,”S.Roweis,L.Saul(2000).
Thegeodesicdistancebetweentwonodesinagraphisthenumberofnodesontheshortestpathbetweenthesenodes.
6
7
8
9
PartII.NeuralNetworksandDeepLearning
Chapter9.UpandRunningwithTensorFlow
TensorFlowisapowerfulopensourcesoftwarelibraryfornumericalcomputation,particularlywellsuitedandfine-tunedforlarge-scaleMachineLearning.Itsbasicprincipleissimple:youfirstdefineinPythonagraphofcomputationstoperform(forexample,theoneinFigure9-1),andthenTensorFlowtakesthatgraphandrunsitefficientlyusingoptimizedC++code.
Figure9-1.Asimplecomputationgraph
Mostimportantly,itispossibletobreakupthegraphintoseveralchunksandruntheminparallelacrossmultipleCPUsorGPUs(asshowninFigure9-2).TensorFlowalsosupportsdistributedcomputing,soyoucantraincolossalneuralnetworksonhumongoustrainingsetsinareasonableamountoftimebysplittingthecomputationsacrosshundredsofservers(seeChapter12).TensorFlowcantrainanetworkwithmillionsofparametersonatrainingsetcomposedofbillionsofinstanceswithmillionsoffeatureseach.Thisshouldcomeasnosurprise,sinceTensorFlowwasdevelopedbytheGoogleBrainteamanditpowersmanyofGoogle’slarge-scaleservices,suchasGoogleCloudSpeech,GooglePhotos,andGoogleSearch.
Figure9-2.ParallelcomputationonmultipleCPUs/GPUs/servers
WhenTensorFlowwasopen-sourcedinNovember2015,therewerealreadymanypopularopensourcelibrariesforDeepLearning(Table9-1listsafew),andtobefairmostofTensorFlow’sfeaturesalreadyexistedinonelibraryoranother.Nevertheless,TensorFlow’scleandesign,scalability,flexibility,1andgreatdocumentation(nottomentionGoogle’sname)quicklyboostedittothetopofthelist.Inshort,TensorFlowwasdesignedtobeflexible,scalable,andproduction-ready,andexistingframeworksarguablyhitonlytwooutofthethreeofthese.HerearesomeofTensorFlow’shighlights:
ItrunsnotonlyonWindows,Linux,andmacOS,butalsoonmobiledevices,includingbothiOSandAndroid.
ItprovidesaverysimplePythonAPIcalledTF.Learn2(tensorflow.contrib.learn),compatiblewithScikit-Learn.Asyouwillsee,youcanuseittotrainvarioustypesofneuralnetworksinjustafewlinesofcode.ItwaspreviouslyanindependentprojectcalledScikitFlow(orskflow).
ItalsoprovidesanothersimpleAPIcalledTF-slim(tensorflow.contrib.slim)tosimplifybuilding,training,andevaluatingneuralnetworks.
Severalotherhigh-levelAPIshavebeenbuiltindependentlyontopofTensorFlow,suchasKeras(nowavailableintensorflow.contrib.keras)orPrettyTensor.
ItsmainPythonAPIoffersmuchmoreflexibility(atthecostofhighercomplexity)tocreateallsortsofcomputations,includinganyneuralnetworkarchitectureyoucanthinkof.
ItincludeshighlyefficientC++implementationsofmanyMLoperations,particularlythoseneededtobuildneuralnetworks.ThereisalsoaC++APItodefineyourownhigh-performanceoperations.
Itprovidesseveraladvancedoptimizationnodestosearchfortheparametersthatminimizeacostfunction.TheseareveryeasytousesinceTensorFlowautomaticallytakescareofcomputingthegradientsofthefunctionsyoudefine.Thisiscalledautomaticdifferentiating(orautodiff).
ItalsocomeswithagreatvisualizationtoolcalledTensorBoardthatallowsyoutobrowsethroughthecomputationgraph,viewlearningcurves,andmore.
GooglealsolaunchedacloudservicetorunTensorFlowgraphs.
Lastbutnotleast,ithasadedicatedteamofpassionateandhelpfuldevelopers,andagrowingcommunitycontributingtoimprovingit.ItisoneofthemostpopularopensourceprojectsonGitHub,andmoreandmoregreatprojectsarebeingbuiltontopofit(forexamples,checkouttheresourcespageonhttps://www.tensorflow.org/,orhttps://github.com/jtoy/awesome-tensorflow).Toasktechnicalquestions,youshouldusehttp://stackoverflow.com/andtagyourquestionwith"tensorflow".YoucanfilebugsandfeaturerequeststhroughGitHub.Forgeneraldiscussions,jointheGooglegroup.
Inthischapter,wewillgothroughthebasicsofTensorFlow,frominstallationtocreating,running,saving,andvisualizingsimplecomputationalgraphs.Masteringthesebasicsisimportantbeforeyoubuildyourfirstneuralnetwork(whichwewilldointhenextchapter).
Table9-1.OpensourceDeepLearninglibraries(notanexhaustivelist)
Library API Platforms Startedby Year
Caffe Python,C++,Matlab Linux,macOS,Windows Y.Jia,UCBerkeley(BVLC) 2013
Deeplearning4j Java,Scala,Clojure Linux,macOS,Windows,Android A.Gibson,J.Patterson 2014
H2O Python,R Linux,macOS,Windows H2O.ai 2014
MXNet Python,C++,others Linux,macOS,Windows,iOS,Android DMLC 2015
TensorFlow Python,C++ Linux,macOS,Windows,iOS,Android Google 2015
Theano Python Linux,macOS,iOS UniversityofMontreal 2010
Torch C++,Lua Linux,macOS,iOS,Android R.Collobert,K.Kavukcuoglu,C.Farabet 2002
InstallationLet’sgetstarted!AssumingyouinstalledJupyterandScikit-LearnbyfollowingtheinstallationinstructionsinChapter2,youcansimplyusepiptoinstallTensorFlow.Ifyoucreatedanisolatedenvironmentusingvirtualenv,youfirstneedtoactivateit:
$cd$ML_PATH#YourMLworkingdirectory(e.g.,$HOME/ml)
$sourceenv/bin/activate
Next,installTensorFlow:
$pip3install--upgradetensorflow
NOTEForGPUsupport,youneedtoinstalltensorflow-gpuinsteadoftensorflow.SeeChapter12formoredetails.
Totestyourinstallation,typethefollowingcommand.ItshouldoutputtheversionofTensorFlowyouinstalled.
$python3-c'importtensorflow;print(tensorflow.__version__)'
1.0.0
CreatingYourFirstGraphandRunningItinaSessionThefollowingcodecreatesthegraphrepresentedinFigure9-1:
importtensorflowastf
x=tf.Variable(3,name="x")
y=tf.Variable(4,name="y")
f=x*x*y+y+2
That’sallthereistoit!Themostimportantthingtounderstandisthatthiscodedoesnotactuallyperformanycomputation,eventhoughitlookslikeitdoes(especiallythelastline).Itjustcreatesacomputationgraph.Infact,eventhevariablesarenotinitializedyet.Toevaluatethisgraph,youneedtoopenaTensorFlowsessionanduseittoinitializethevariablesandevaluatef.ATensorFlowsessiontakescareofplacingtheoperationsontodevicessuchasCPUsandGPUsandrunningthem,anditholdsallthevariablevalues.3Thefollowingcodecreatesasession,initializesthevariables,andevaluates,andfthenclosesthesession(whichfreesupresources):
>>>sess=tf.Session()
>>>sess.run(x.initializer)
>>>sess.run(y.initializer)
>>>result=sess.run(f)
>>>print(result)
42
>>>sess.close()
Havingtorepeatsess.run()allthetimeisabitcumbersome,butfortunatelythereisabetterway:
withtf.Session()assess:
x.initializer.run()
y.initializer.run()
result=f.eval()
Insidethewithblock,thesessionissetasthedefaultsession.Callingx.initializer.run()isequivalenttocallingtf.get_default_session().run(x.initializer),andsimilarlyf.eval()isequivalenttocallingtf.get_default_session().run(f).Thismakesthecodeeasiertoread.Moreover,thesessionisautomaticallyclosedattheendoftheblock.
Insteadofmanuallyrunningtheinitializerforeverysinglevariable,youcanusetheglobal_variables_initializer()function.Notethatitdoesnotactuallyperformtheinitializationimmediately,butrathercreatesanodeinthegraphthatwillinitializeallvariableswhenitisrun:
init=tf.global_variables_initializer()#prepareaninitnode
withtf.Session()assess:
init.run()#actuallyinitializeallthevariables
result=f.eval()
InsideJupyterorwithinaPythonshellyoumayprefertocreateanInteractiveSession.TheonlydifferencefromaregularSessionisthatwhenanInteractiveSessioniscreateditautomaticallysets
itselfasthedefaultsession,soyoudon’tneedawithblock(butyoudoneedtoclosethesessionmanuallywhenyouaredonewithit):
>>>sess=tf.InteractiveSession()
>>>init.run()
>>>result=f.eval()
>>>print(result)
42
>>>sess.close()
ATensorFlowprogramistypicallysplitintotwoparts:thefirstpartbuildsacomputationgraph(thisiscalledtheconstructionphase),andthesecondpartrunsit(thisistheexecutionphase).TheconstructionphasetypicallybuildsacomputationgraphrepresentingtheMLmodelandthecomputationsrequiredtotrainit.Theexecutionphasegenerallyrunsaloopthatevaluatesatrainingsteprepeatedly(forexample,onesteppermini-batch),graduallyimprovingthemodelparameters.Wewillgothroughanexampleshortly.
ManagingGraphsAnynodeyoucreateisautomaticallyaddedtothedefaultgraph:
>>>x1=tf.Variable(1)
>>>x1.graphistf.get_default_graph()
True
Inmostcasesthisisfine,butsometimesyoumaywanttomanagemultipleindependentgraphs.YoucandothisbycreatinganewGraphandtemporarilymakingitthedefaultgraphinsideawithblock,likeso:
>>>graph=tf.Graph()
>>>withgraph.as_default():
...x2=tf.Variable(2)
...
>>>x2.graphisgraph
True
>>>x2.graphistf.get_default_graph()
False
TIPInJupyter(orinaPythonshell),itiscommontorunthesamecommandsmorethanoncewhileyouareexperimenting.Asaresult,youmayendupwithadefaultgraphcontainingmanyduplicatenodes.OnesolutionistorestarttheJupyterkernel(orthePythonshell),butamoreconvenientsolutionistojustresetthedefaultgraphbyrunningtf.reset_default_graph().
LifecycleofaNodeValueWhenyouevaluateanode,TensorFlowautomaticallydeterminesthesetofnodesthatitdependsonanditevaluatesthesenodesfirst.Forexample,considerthefollowingcode:
w=tf.constant(3)
x=w+2
y=x+5
z=x*3
withtf.Session()assess:
print(y.eval())#10
print(z.eval())#15
First,thiscodedefinesaverysimplegraph.Thenitstartsasessionandrunsthegraphtoevaluatey:TensorFlowautomaticallydetectsthatydependsonx,whichdependsonw,soitfirstevaluatesw,thenx,theny,andreturnsthevalueofy.Finally,thecoderunsthegraphtoevaluatez.Onceagain,TensorFlowdetectsthatitmustfirstevaluatewandx.Itisimportanttonotethatitwillnotreusetheresultofthepreviousevaluationofwandx.Inshort,theprecedingcodeevaluateswandxtwice.
Allnodevaluesaredroppedbetweengraphruns,exceptvariablevalues,whicharemaintainedbythesessionacrossgraphruns(queuesandreadersalsomaintainsomestate,aswewillseeinChapter12).Avariablestartsitslifewhenitsinitializerisrun,anditendswhenthesessionisclosed.
Ifyouwanttoevaluateyandzefficiently,withoutevaluatingwandxtwiceasinthepreviouscode,youmustaskTensorFlowtoevaluatebothyandzinjustonegraphrun,asshowninthefollowingcode:
withtf.Session()assess:
y_val,z_val=sess.run([y,z])
print(y_val)#10
print(z_val)#15
WARNINGInsingle-processTensorFlow,multiplesessionsdonotshareanystate,eveniftheyreusethesamegraph(eachsessionwouldhaveitsowncopyofeveryvariable).IndistributedTensorFlow(seeChapter12),variablestateisstoredontheservers,notinthesessions,somultiplesessionscansharethesamevariables.
LinearRegressionwithTensorFlowTensorFlowoperations(alsocalledopsforshort)cantakeanynumberofinputsandproduceanynumberofoutputs.Forexample,theadditionandmultiplicationopseachtaketwoinputsandproduceoneoutput.Constantsandvariablestakenoinput(theyarecalledsourceops).Theinputsandoutputsaremultidimensionalarrays,calledtensors(hencethename“tensorflow”).JustlikeNumPyarrays,tensorshaveatypeandashape.Infact,inthePythonAPItensorsaresimplyrepresentedbyNumPyndarrays.Theytypicallycontainfloats,butyoucanalsousethemtocarrystrings(arbitrarybytearrays).
Intheexamplessofar,thetensorsjustcontainedasinglescalarvalue,butyoucanofcourseperformcomputationsonarraysofanyshape.Forexample,thefollowingcodemanipulates2DarraystoperformLinearRegressionontheCaliforniahousingdataset(introducedinChapter2).Itstartsbyfetchingthedataset;thenitaddsanextrabiasinputfeature(x0=1)toalltraininginstances(itdoessousingNumPysoitrunsimmediately);thenitcreatestwoTensorFlowconstantnodes,Xandy,toholdthisdataandthetargets,4anditusessomeofthematrixoperationsprovidedbyTensorFlowtodefinetheta.Thesematrixfunctions—transpose(),matmul(),andmatrix_inverse()—areself-explanatory,butasusualtheydonotperformanycomputationsimmediately;instead,theycreatenodesinthegraphthatwillperformthemwhenthegraphisrun.YoumayrecognizethatthedefinitionofthetacorrespondstotheNormal
Equation( =(XT·X)–1·XT·y;seeChapter4).Finally,thecodecreatesasessionandusesittoevaluatetheta.
importnumpyasnp
fromsklearn.datasetsimportfetch_california_housing
housing=fetch_california_housing()
m,n=housing.data.shape
housing_data_plus_bias=np.c_[np.ones((m,1)),housing.data]
X=tf.constant(housing_data_plus_bias,dtype=tf.float32,name="X")
y=tf.constant(housing.target.reshape(-1,1),dtype=tf.float32,name="y")
XT=tf.transpose(X)
theta=tf.matmul(tf.matmul(tf.matrix_inverse(tf.matmul(XT,X)),XT),y)
withtf.Session()assess:
theta_value=theta.eval()
ThemainbenefitofthiscodeversuscomputingtheNormalEquationdirectlyusingNumPyisthatTensorFlowwillautomaticallyrunthisonyourGPUcardifyouhaveone(providedyouinstalledTensorFlowwithGPUsupport,ofcourse;seeChapter12formoredetails).
ImplementingGradientDescentLet’stryusingBatchGradientDescent(introducedinChapter4)insteadoftheNormalEquation.Firstwewilldothisbymanuallycomputingthegradients,thenwewilluseTensorFlow’sautodifffeaturetoletTensorFlowcomputethegradientsautomatically,andfinallywewilluseacoupleofTensorFlow’sout-of-the-boxoptimizers.
WARNINGWhenusingGradientDescent,rememberthatitisimportanttofirstnormalizetheinputfeaturevectors,orelsetrainingmaybemuchslower.YoucandothisusingTensorFlow,NumPy,Scikit-Learn’sStandardScaler,oranyothersolutionyouprefer.Thefollowingcodeassumesthatthisnormalizationhasalreadybeendone.
ManuallyComputingtheGradientsThefollowingcodeshouldbefairlyself-explanatory,exceptforafewnewelements:
Therandom_uniform()functioncreatesanodeinthegraphthatwillgenerateatensorcontainingrandomvalues,givenitsshapeandvaluerange,muchlikeNumPy’srand()function.
Theassign()functioncreatesanodethatwillassignanewvaluetoavariable.Inthiscase,itimplementstheBatchGradientDescentstepθ(nextstep)=θ–η θMSE(θ).
Themainloopexecutesthetrainingstepoverandoveragain(n_epochstimes),andevery100iterationsitprintsoutthecurrentMeanSquaredError(mse).YoushouldseetheMSEgodownateveryiteration.
n_epochs=1000
learning_rate=0.01
X=tf.constant(scaled_housing_data_plus_bias,dtype=tf.float32,name="X")
y=tf.constant(housing.target.reshape(-1,1),dtype=tf.float32,name="y")
theta=tf.Variable(tf.random_uniform([n+1,1],-1.0,1.0),name="theta")
y_pred=tf.matmul(X,theta,name="predictions")
error=y_pred-y
mse=tf.reduce_mean(tf.square(error),name="mse")
gradients=2/m*tf.matmul(tf.transpose(X),error)
training_op=tf.assign(theta,theta-learning_rate*gradients)
init=tf.global_variables_initializer()
withtf.Session()assess:
sess.run(init)
forepochinrange(n_epochs):
ifepoch%100==0:
print("Epoch",epoch,"MSE=",mse.eval())
sess.run(training_op)
best_theta=theta.eval()
UsingautodiffTheprecedingcodeworksfine,butitrequiresmathematicallyderivingthegradientsfromthecostfunction(MSE).InthecaseofLinearRegression,itisreasonablyeasy,butifyouhadtodothiswithdeepneuralnetworksyouwouldgetquiteaheadache:itwouldbetediousanderror-prone.Youcouldusesymbolicdifferentiationtoautomaticallyfindtheequationsforthepartialderivativesforyou,buttheresultingcodewouldnotnecessarilybeveryefficient.
Tounderstandwhy,considerthefunctionf(x)=exp(exp(exp(x))).Ifyouknowcalculus,youcanfigureoutitsderivativef′(x)=exp(x)×exp(exp(x))×exp(exp(exp(x))).Ifyoucodef(x)andf′(x)separatelyandexactlyastheyappear,yourcodewillnotbeasefficientasitcouldbe.Amoreefficientsolutionwouldbetowriteafunctionthatfirstcomputesexp(x),thenexp(exp(x)),thenexp(exp(exp(x))),andreturnsallthree.Thisgivesyouf(x)directly(thethirdterm),andifyouneedthederivativeyoucanjustmultiplyallthreetermsandyouaredone.Withthenaïveapproachyouwouldhavehadtocalltheexpfunctionninetimestocomputebothf(x)andf′(x).Withthisapproachyoujustneedtocallitthreetimes.
Itgetsworsewhenyourfunctionisdefinedbysomearbitrarycode.Canyoufindtheequation(orthecode)tocomputethepartialderivativesofthefollowingfunction?Hint:don’teventry.
defmy_func(a,b):
z=0
foriinrange(100):
z=a*np.cos(z+i)+z*np.sin(b-i)
returnz
Fortunately,TensorFlow’sautodifffeaturecomestotherescue:itcanautomaticallyandefficientlycomputethegradientsforyou.Simplyreplacethegradients=...lineintheGradientDescentcodeintheprevioussectionwiththefollowingline,andthecodewillcontinuetoworkjustfine:
gradients=tf.gradients(mse,[theta])[0]
Thegradients()functiontakesanop(inthiscasemse)andalistofvariables(inthiscasejusttheta),anditcreatesalistofops(onepervariable)tocomputethegradientsoftheopwithregardstoeachvariable.SothegradientsnodewillcomputethegradientvectoroftheMSEwithregardstotheta.
Therearefourmainapproachestocomputinggradientsautomatically.TheyaresummarizedinTable9-2.TensorFlowusesreverse-modeautodiff,whichisperfect(efficientandaccurate)whentherearemanyinputsandfewoutputs,asisoftenthecaseinneuralnetworks.Itcomputesallthepartialderivativesoftheoutputswithregardstoalltheinputsinjustnoutputs+1graphtraversals.
Table9-2.Mainsolutionstocomputegradientsautomatically
Technique Nbofgraphtraversalstocomputeallgradients
Accuracy Supportsarbitrarycode
Comment
Numericaldifferentiation
ninputs+1 Low Yes Trivialtoimplement
Symbolicdifferentiation N/A High No Buildsaverydifferent
graph
Forward-modeautodiff ninputs High Yes Usesdualnumbers
Reverse-modeautodiff noutputs+1 High Yes ImplementedbyTensorFlow
Ifyouareinterestedinhowthismagicworks,checkoutAppendixD.
UsinganOptimizerSoTensorFlowcomputesthegradientsforyou.Butitgetseveneasier:italsoprovidesanumberofoptimizersoutofthebox,includingaGradientDescentoptimizer.Youcansimplyreplacetheprecedinggradients=...andtraining_op=...lineswiththefollowingcode,andonceagaineverythingwilljustworkfine:
optimizer=tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
training_op=optimizer.minimize(mse)
Ifyouwanttouseadifferenttypeofoptimizer,youjustneedtochangeoneline.Forexample,youcanuseamomentumoptimizer(whichoftenconvergesmuchfasterthanGradientDescent;seeChapter11)bydefiningtheoptimizerlikethis:
optimizer=tf.train.MomentumOptimizer(learning_rate=learning_rate,
momentum=0.9)
FeedingDatatotheTrainingAlgorithmLet’strytomodifythepreviouscodetoimplementMini-batchGradientDescent.Forthis,weneedawaytoreplaceXandyateveryiterationwiththenextmini-batch.Thesimplestwaytodothisistouseplaceholdernodes.Thesenodesarespecialbecausetheydon’tactuallyperformanycomputation,theyjustoutputthedatayoutellthemtooutputatruntime.TheyaretypicallyusedtopassthetrainingdatatoTensorFlowduringtraining.Ifyoudon’tspecifyavalueatruntimeforaplaceholder,yougetanexception.
Tocreateaplaceholdernode,youmustcalltheplaceholder()functionandspecifytheoutputtensor’sdatatype.Optionally,youcanalsospecifyitsshape,ifyouwanttoenforceit.IfyouspecifyNoneforadimension,itmeans“anysize.”Forexample,thefollowingcodecreatesaplaceholdernodeA,andalsoanodeB=A+5.WhenweevaluateB,wepassafeed_dicttotheeval()methodthatspecifiesthevalueofA.NotethatAmusthaverank2(i.e.,itmustbetwo-dimensional)andtheremustbethreecolumns(orelseanexceptionisraised),butitcanhaveanynumberofrows.
>>>A=tf.placeholder(tf.float32,shape=(None,3))
>>>B=A+5
>>>withtf.Session()assess:
...B_val_1=B.eval(feed_dict={A:[[1,2,3]]})
...B_val_2=B.eval(feed_dict={A:[[4,5,6],[7,8,9]]})
...
>>>print(B_val_1)
[[6.7.8.]]
>>>print(B_val_2)
[[9.10.11.]
[12.13.14.]]
NOTEYoucanactuallyfeedtheoutputofanyoperations,notjustplaceholders.InthiscaseTensorFlowdoesnottrytoevaluatetheseoperations;itusesthevaluesyoufeedit.
ToimplementMini-batchGradientDescent,weonlyneedtotweaktheexistingcodeslightly.FirstchangethedefinitionofXandyintheconstructionphasetomakethemplaceholdernodes:
X=tf.placeholder(tf.float32,shape=(None,n+1),name="X")
y=tf.placeholder(tf.float32,shape=(None,1),name="y")
Thendefinethebatchsizeandcomputethetotalnumberofbatches:
batch_size=100
n_batches=int(np.ceil(m/batch_size))
Finally,intheexecutionphase,fetchthemini-batchesonebyone,thenprovidethevalueofXandyviathefeed_dictparameterwhenevaluatinganodethatdependsoneitherofthem.
deffetch_batch(epoch,batch_index,batch_size):
[...]#loadthedatafromdisk
returnX_batch,y_batch
withtf.Session()assess:
sess.run(init)
forepochinrange(n_epochs):
forbatch_indexinrange(n_batches):
X_batch,y_batch=fetch_batch(epoch,batch_index,batch_size)
sess.run(training_op,feed_dict={X:X_batch,y:y_batch})
best_theta=theta.eval()
NOTEWedon’tneedtopassthevalueofXandywhenevaluatingthetasinceitdoesnotdependoneitherofthem.
SavingandRestoringModelsOnceyouhavetrainedyourmodel,youshouldsaveitsparameterstodisksoyoucancomebacktoitwheneveryouwant,useitinanotherprogram,compareittoothermodels,andsoon.Moreover,youprobablywanttosavecheckpointsatregularintervalsduringtrainingsothatifyourcomputercrashesduringtrainingyoucancontinuefromthelastcheckpointratherthanstartoverfromscratch.
TensorFlowmakessavingandrestoringamodelveryeasy.JustcreateaSavernodeattheendoftheconstructionphase(afterallvariablenodesarecreated);then,intheexecutionphase,justcallitssave()methodwheneveryouwanttosavethemodel,passingitthesessionandpathofthecheckpointfile:
[...]
theta=tf.Variable(tf.random_uniform([n+1,1],-1.0,1.0),name="theta")
[...]
init=tf.global_variables_initializer()
saver=tf.train.Saver()
withtf.Session()assess:
sess.run(init)
forepochinrange(n_epochs):
ifepoch%100==0:#checkpointevery100epochs
save_path=saver.save(sess,"/tmp/my_model.ckpt")
sess.run(training_op)
best_theta=theta.eval()
save_path=saver.save(sess,"/tmp/my_model_final.ckpt")
Restoringamodelisjustaseasy:youcreateaSaverattheendoftheconstructionphasejustlikebefore,butthenatthebeginningoftheexecutionphase,insteadofinitializingthevariablesusingtheinitnode,youcalltherestore()methodoftheSaverobject:
withtf.Session()assess:
saver.restore(sess,"/tmp/my_model_final.ckpt")
[...]
BydefaultaSaversavesandrestoresallvariablesundertheirownname,butifyouneedmorecontrol,youcanspecifywhichvariablestosaveorrestore,andwhatnamestouse.Forexample,thefollowingSaverwillsaveorrestoreonlythethetavariableunderthenameweights:
saver=tf.train.Saver({"weights":theta})
Bydefault,thesave()methodalsosavesthestructureofthegraphinasecondfilewiththesamenameplusa.metaextension.Youcanloadthisgraphstructureusingtf.train.import_meta_graph().Thisaddsthegraphtothedefaultgraph,andreturnsaSaverinstancethatyoucanthenusetorestorethegraph’sstate(i.e.,thevariablevalues):
saver=tf.train.import_meta_graph("/tmp/my_model_final.ckpt.meta")
withtf.Session()assess:
saver.restore(sess,"/tmp/my_model_final.ckpt")
[...]
Thisallowsyoutofullyrestoreasavedmodel,includingboththegraphstructureandthevariablevalues,withouthavingtosearchforthecodethatbuiltit.
VisualizingtheGraphandTrainingCurvesUsingTensorBoardSonowwehaveacomputationgraphthattrainsaLinearRegressionmodelusingMini-batchGradientDescent,andwearesavingcheckpointsatregularintervals.Soundssophisticated,doesn’tit?However,wearestillrelyingontheprint()functiontovisualizeprogressduringtraining.Thereisabetterway:enterTensorBoard.Ifyoufeeditsometrainingstats,itwilldisplayniceinteractivevisualizationsofthesestatsinyourwebbrowser(e.g.,learningcurves).Youcanalsoprovideitthegraph’sdefinitionanditwillgiveyouagreatinterfacetobrowsethroughit.Thisisveryusefultoidentifyerrorsinthegraph,tofindbottlenecks,andsoon.
Thefirststepistotweakyourprogramabitsoitwritesthegraphdefinitionandsometrainingstats—forexample,thetrainingerror(MSE)—toalogdirectorythatTensorBoardwillreadfrom.Youneedtouseadifferentlogdirectoryeverytimeyourunyourprogram,orelseTensorBoardwillmergestatsfromdifferentruns,whichwillmessupthevisualizations.Thesimplestsolutionforthisistoincludeatimestampinthelogdirectoryname.Addthefollowingcodeatthebeginningoftheprogram:
fromdatetimeimportdatetime
now=datetime.utcnow().strftime("%Y%m%d%H%M%S")
root_logdir="tf_logs"
logdir="{}/run-{}/".format(root_logdir,now)
Next,addthefollowingcodeattheveryendoftheconstructionphase:
mse_summary=tf.summary.scalar('MSE',mse)
file_writer=tf.summary.FileWriter(logdir,tf.get_default_graph())
ThefirstlinecreatesanodeinthegraphthatwillevaluatetheMSEvalueandwriteittoaTensorBoard-compatiblebinarylogstringcalledasummary.ThesecondlinecreatesaFileWriterthatyouwillusetowritesummariestologfilesinthelogdirectory.Thefirstparameterindicatesthepathofthelogdirectory(inthiscasesomethingliketf_logs/run-20160906091959/,relativetothecurrentdirectory).Thesecond(optional)parameteristhegraphyouwanttovisualize.Uponcreation,theFileWritercreatesthelogdirectoryifitdoesnotalreadyexist(anditsparentdirectoriesifneeded),andwritesthegraphdefinitioninabinarylogfilecalledaneventsfile.
Nextyouneedtoupdatetheexecutionphasetoevaluatethemse_summarynoderegularlyduringtraining(e.g.,every10mini-batches).Thiswilloutputasummarythatyoucanthenwritetotheeventsfileusingthefile_writer.Hereistheupdatedcode:
[...]
forbatch_indexinrange(n_batches):
X_batch,y_batch=fetch_batch(epoch,batch_index,batch_size)
ifbatch_index%10==0:
summary_str=mse_summary.eval(feed_dict={X:X_batch,y:y_batch})
step=epoch*n_batches+batch_index
file_writer.add_summary(summary_str,step)
sess.run(training_op,feed_dict={X:X_batch,y:y_batch})
[...]
WARNINGAvoidloggingtrainingstatsateverysingletrainingstep,asthiswouldsignificantlyslowdowntraining.
Finally,youwanttoclosetheFileWriterattheendoftheprogram:
file_writer.close()
Nowrunthisprogram:itwillcreatethelogdirectoryandwriteaneventsfileinthisdirectory,containingboththegraphdefinitionandtheMSEvalues.Openupashellandgotoyourworkingdirectory,thentypels-ltf_logs/run*tolistthecontentsofthelogdirectory:
$cd$ML_PATH#YourMLworkingdirectory(e.g.,$HOME/ml)
$ls-ltf_logs/run*
total40
-rw-r--r--1ageronstaff18620Sep611:10events.out.tfevents.1472553182.mymac
Ifyouruntheprogramasecondtime,youshouldseeaseconddirectoryinthetf_logs/directory:
$ls-ltf_logs/
total0
drwxr-xr-x3ageronstaff102Sep610:07run-20160906091959
drwxr-xr-x3ageronstaff102Sep610:22run-20160906092202
Great!Nowit’stimetofireuptheTensorBoardserver.Youneedtoactivateyourvirtualenvenvironmentifyoucreatedone,thenstarttheserverbyrunningthetensorboardcommand,pointingittotherootlogdirectory.ThisstartstheTensorBoardwebserver,listeningonport6006(whichis“goog”writtenupsidedown):
$sourceenv/bin/activate
$tensorboard--logdirtf_logs/
StartingTensorBoardonport6006
(Youcannavigatetohttp://0.0.0.0:6006)
Nextopenabrowserandgotohttp://0.0.0.0:6006/(orhttp://localhost:6006/).WelcometoTensorBoard!IntheEventstabyoushouldseeMSEontheright.Ifyouclickonit,youwillseeaplotoftheMSEduringtraining,forbothruns(Figure9-3).Youcancheckorunchecktherunsyouwanttosee,zoominorout,hoveroverthecurvetogetdetails,andsoon.
Figure9-3.VisualizingtrainingstatsusingTensorBoard
NowclickontheGraphstab.YoushouldseethegraphshowninFigure9-4.
Toreduceclutter,thenodesthathavemanyedges(i.e.,connectionstoothernodes)areseparatedouttoanauxiliaryareaontheright(youcanmoveanodebackandforthbetweenthemaingraphandtheauxiliaryareabyright-clickingonit).Somepartsofthegrapharealsocollapsedbydefault.Forexample,tryhoveringoverthegradientsnode,thenclickonthe icontoexpandthissubgraph.Next,inthissubgraph,tryexpandingthemse_gradsubgraph.
Figure9-4.VisualizingthegraphusingTensorBoard
TIPIfyouwanttotakeapeekatthegraphdirectlywithinJupyter,youcanusetheshow_graph()functionavailableinthenotebookforthischapter.ItwasoriginallywrittenbyA.Mordvintsevinhisgreatdeepdreamtutorialnotebook.AnotheroptionistoinstallE.Jang’sTensorFlowdebuggertoolwhichincludesaJupyterextensionforgraphvisualization(andmore).
NameScopesWhendealingwithmorecomplexmodelssuchasneuralnetworks,thegraphcaneasilybecomeclutteredwiththousandsofnodes.Toavoidthis,youcancreatenamescopestogrouprelatednodes.Forexample,let’smodifythepreviouscodetodefinetheerrorandmseopswithinanamescopecalled"loss":
withtf.name_scope("loss")asscope:
error=y_pred-y
mse=tf.reduce_mean(tf.square(error),name="mse")
Thenameofeachopdefinedwithinthescopeisnowprefixedwith"loss/":
>>>print(error.op.name)
loss/sub
>>>print(mse.op.name)
loss/mse
InTensorBoard,themseanderrornodesnowappearinsidethelossnamespace,whichappearscollapsedbydefault(Figure9-5).
Figure9-5.AcollapsednamescopeinTensorBoard
ModularitySupposeyouwanttocreateagraphthataddstheoutputoftworectifiedlinearunits(ReLU).AReLUcomputesalinearfunctionoftheinputs,andoutputstheresultifitispositive,and0otherwise,asshowninEquation9-1.
Equation9-1.Rectifiedlinearunit
Thefollowingcodedoesthejob,butit’squiterepetitive:
n_features=3
X=tf.placeholder(tf.float32,shape=(None,n_features),name="X")
w1=tf.Variable(tf.random_normal((n_features,1)),name="weights1")
w2=tf.Variable(tf.random_normal((n_features,1)),name="weights2")
b1=tf.Variable(0.0,name="bias1")
b2=tf.Variable(0.0,name="bias2")
z1=tf.add(tf.matmul(X,w1),b1,name="z1")
z2=tf.add(tf.matmul(X,w2),b2,name="z2")
relu1=tf.maximum(z1,0.,name="relu1")
relu2=tf.maximum(z1,0.,name="relu2")
output=tf.add(relu1,relu2,name="output")
Suchrepetitivecodeishardtomaintainanderror-prone(infact,thiscodecontainsacut-and-pasteerror;didyouspotit?).ItwouldbecomeevenworseifyouwantedtoaddafewmoreReLUs.Fortunately,TensorFlowletsyoustayDRY(Don’tRepeatYourself):simplycreateafunctiontobuildaReLU.ThefollowingcodecreatesfiveReLUsandoutputstheirsum(notethatadd_n()createsanoperationthatwillcomputethesumofalistoftensors):
defrelu(X):
w_shape=(int(X.get_shape()[1]),1)
w=tf.Variable(tf.random_normal(w_shape),name="weights")
b=tf.Variable(0.0,name="bias")
z=tf.add(tf.matmul(X,w),b,name="z")
returntf.maximum(z,0.,name="relu")
n_features=3
X=tf.placeholder(tf.float32,shape=(None,n_features),name="X")
relus=[relu(X)foriinrange(5)]
output=tf.add_n(relus,name="output")
Notethatwhenyoucreateanode,TensorFlowcheckswhetheritsnamealreadyexists,andifitdoesitappendsanunderscorefollowedbyanindextomakethenameunique.SothefirstReLUcontainsnodesnamed"weights","bias","z",and"relu"(plusmanymorenodeswiththeirdefaultname,suchas"MatMul");thesecondReLUcontainsnodesnamed"weights_1","bias_1",andsoon;thethirdReLUcontainsnodesnamed"weights_2","bias_2",andsoon.TensorBoardidentifiessuchseriesandcollapsesthemtogethertoreduceclutter(asyoucanseeinFigure9-6).
Figure9-6.Collapsednodeseries
Usingnamescopes,youcanmakethegraphmuchclearer.Simplymoveallthecontentoftherelu()functioninsideanamescope.Figure9-7showstheresultinggraph.NoticethatTensorFlowalsogivesthenamescopesuniquenamesbyappending_1,_2,andsoon.
defrelu(X):
withtf.name_scope("relu"):
[...]
Figure9-7.Aclearergraphusingname-scopedunits
SharingVariablesIfyouwanttoshareavariablebetweenvariouscomponentsofyourgraph,onesimpleoptionistocreateitfirst,thenpassitasaparametertothefunctionsthatneedit.Forexample,supposeyouwanttocontroltheReLUthreshold(currentlyhardcodedto0)usingasharedthresholdvariableforallReLUs.Youcouldjustcreatethatvariablefirst,andthenpassittotherelu()function:
defrelu(X,threshold):
withtf.name_scope("relu"):
[...]
returntf.maximum(z,threshold,name="max")
threshold=tf.Variable(0.0,name="threshold")
X=tf.placeholder(tf.float32,shape=(None,n_features),name="X")
relus=[relu(X,threshold)foriinrange(5)]
output=tf.add_n(relus,name="output")
Thisworksfine:nowyoucancontrolthethresholdforallReLUsusingthethresholdvariable.However,iftherearemanysharedparameterssuchasthisone,itwillbepainfultohavetopassthemaroundasparametersallthetime.ManypeoplecreateaPythondictionarycontainingallthevariablesintheirmodel,andpassitaroundtoeveryfunction.Otherscreateaclassforeachmodule(e.g.,aReLUclassusingclassvariablestohandlethesharedparameter).Yetanotheroptionistosetthesharedvariableasanattributeoftherelu()functionuponthefirstcall,likeso:
defrelu(X):
withtf.name_scope("relu"):
ifnothasattr(relu,"threshold"):
relu.threshold=tf.Variable(0.0,name="threshold")
[...]
returntf.maximum(z,relu.threshold,name="max")
TensorFlowoffersanotheroption,whichmayleadtoslightlycleanerandmoremodularcodethantheprevioussolutions.5Thissolutionisabittrickytounderstandatfirst,butsinceitisusedalotinTensorFlowitisworthgoingintoabitofdetail.Theideaistousetheget_variable()functiontocreatethesharedvariableifitdoesnotexistyet,orreuseitifitalreadyexists.Thedesiredbehavior(creatingorreusing)iscontrolledbyanattributeofthecurrentvariable_scope().Forexample,thefollowingcodewillcreateavariablenamed"relu/threshold"(asascalar,sinceshape=(),andusing0.0astheinitialvalue):
withtf.variable_scope("relu"):
threshold=tf.get_variable("threshold",shape=(),
initializer=tf.constant_initializer(0.0))
Notethatifthevariablehasalreadybeencreatedbyanearliercalltoget_variable(),thiscodewillraiseanexception.Thisbehaviorpreventsreusingvariablesbymistake.Ifyouwanttoreuseavariable,youneedtoexplicitlysaysobysettingthevariablescope’sreuseattributetoTrue(inwhichcaseyoudon’thavetospecifytheshapeortheinitializer):
withtf.variable_scope("relu",reuse=True):
threshold=tf.get_variable("threshold")
Thiscodewillfetchtheexisting"relu/threshold"variable,orraiseanexceptionifitdoesnotexistorifitwasnotcreatedusingget_variable().Alternatively,youcansetthereuseattributetoTrueinsidetheblockbycallingthescope’sreuse_variables()method:
withtf.variable_scope("relu")asscope:
scope.reuse_variables()
threshold=tf.get_variable("threshold")
WARNINGOncereuseissettoTrue,itcannotbesetbacktoFalsewithintheblock.Moreover,ifyoudefineothervariablescopesinsidethisone,theywillautomaticallyinheritreuse=True.Lastly,onlyvariablescreatedbyget_variable()canbereusedthisway.
Nowyouhaveallthepiecesyouneedtomaketherelu()functionaccessthethresholdvariablewithouthavingtopassitasaparameter:
defrelu(X):
withtf.variable_scope("relu",reuse=True):
threshold=tf.get_variable("threshold")#reuseexistingvariable
[...]
returntf.maximum(z,threshold,name="max")
X=tf.placeholder(tf.float32,shape=(None,n_features),name="X")
withtf.variable_scope("relu"):#createthevariable
threshold=tf.get_variable("threshold",shape=(),
initializer=tf.constant_initializer(0.0))
relus=[relu(X)forrelu_indexinrange(5)]
output=tf.add_n(relus,name="output")
Thiscodefirstdefinestherelu()function,thencreatestherelu/thresholdvariable(asascalarthatwilllaterbeinitializedto0.0)andbuildsfiveReLUsbycallingtherelu()function.Therelu()functionreusestherelu/thresholdvariable,andcreatestheotherReLUnodes.
NOTEVariablescreatedusingget_variable()arealwaysnamedusingthenameoftheirvariable_scopeasaprefix(e.g.,"relu/threshold"),butforallothernodes(includingvariablescreatedwithtf.Variable())thevariablescopeactslikeanewnamescope.Inparticular,ifanamescopewithanidenticalnamewasalreadycreated,thenasuffixisaddedtomakethenameunique.Forexample,allnodescreatedintheprecedingcode(exceptthethresholdvariable)haveanameprefixedwith"relu_1/"to"relu_5/",asshowninFigure9-8.
Figure9-8.FiveReLUssharingthethresholdvariable
Itissomewhatunfortunatethatthethresholdvariablemustbedefinedoutsidetherelu()function,wherealltherestoftheReLUcoderesides.Tofixthis,thefollowingcodecreatesthethresholdvariablewithintherelu()functionuponthefirstcall,thenreusesitinsubsequentcalls.Nowtherelu()functiondoesnothavetoworryaboutnamescopesorvariablesharing:itjustcallsget_variable(),whichwillcreateorreusethethresholdvariable(itdoesnotneedtoknowwhichisthecase).Therestofthecodecallsrelu()fivetimes,makingsuretosetreuse=Falseonthefirstcall,andreuse=Truefortheothercalls.
defrelu(X):
threshold=tf.get_variable("threshold",shape=(),
initializer=tf.constant_initializer(0.0))
[...]
returntf.maximum(z,threshold,name="max")
X=tf.placeholder(tf.float32,shape=(None,n_features),name="X")
relus=[]
forrelu_indexinrange(5):
withtf.variable_scope("relu",reuse=(relu_index>=1))asscope:
relus.append(relu(X))
output=tf.add_n(relus,name="output")
Theresultinggraphisslightlydifferentthanbefore,sincethesharedvariableliveswithinthefirstReLU(seeFigure9-9).
Figure9-9.FiveReLUssharingthethresholdvariable
ThisconcludesthisintroductiontoTensorFlow.Wewilldiscussmoreadvancedtopicsaswegothroughthefollowingchapters,inparticularmanyoperationsrelatedtodeepneuralnetworks,convolutionalneuralnetworks,andrecurrentneuralnetworksaswellashowtoscaleupwithTensorFlowusingmultithreading,queues,multipleGPUs,andmultipleservers.
Exercises1. Whatarethemainbenefitsofcreatingacomputationgraphratherthandirectlyexecutingthe
computations?Whatarethemaindrawbacks?
2. Isthestatementa_val=a.eval(session=sess)equivalenttoa_val=sess.run(a)?
3. Isthestatementa_val,b_val=a.eval(session=sess),b.eval(session=sess)equivalenttoa_val,b_val=sess.run([a,b])?
4. Canyouruntwographsinthesamesession?
5. Ifyoucreateagraphgcontainingavariablew,thenstarttwothreadsandopenasessionineachthread,bothusingthesamegraphg,willeachsessionhaveitsowncopyofthevariableworwillitbeshared?
6. Whenisavariableinitialized?Whenisitdestroyed?
7. Whatisthedifferencebetweenaplaceholderandavariable?
8. Whathappenswhenyourunthegraphtoevaluateanoperationthatdependsonaplaceholderbutyoudon’tfeeditsvalue?Whathappensiftheoperationdoesnotdependontheplaceholder?
9. Whenyourunagraph,canyoufeedtheoutputvalueofanyoperation,orjustthevalueofplaceholders?
10. Howcanyousetavariabletoanyvalueyouwant(duringtheexecutionphase)?
11. Howmanytimesdoesreverse-modeautodiffneedtotraversethegraphinordertocomputethegradientsofthecostfunctionwithregardsto10variables?Whataboutforward-modeautodiff?Andsymbolicdifferentiation?
12. ImplementLogisticRegressionwithMini-batchGradientDescentusingTensorFlow.Trainitandevaluateitonthemoonsdataset(introducedinChapter5).Tryaddingallthebellsandwhistles:
Definethegraphwithinalogistic_regression()functionthatcanbereusedeasily.
SavecheckpointsusingaSaveratregularintervalsduringtraining,andsavethefinalmodelattheendoftraining.
Restorethelastcheckpointuponstartupiftrainingwasinterrupted.
DefinethegraphusingnicescopessothegraphlooksgoodinTensorBoard.
AddsummariestovisualizethelearningcurvesinTensorBoard.
Trytweakingsomehyperparameterssuchasthelearningrateorthemini-batchsizeandlookattheshapeofthelearningcurve.
SolutionstotheseexercisesareavailableinAppendixA.
TensorFlowisnotlimitedtoneuralnetworksorevenMachineLearning;youcouldrunquantumphysicssimulationsifyouwanted.
NottobeconfusedwiththeTFLearnlibrary,whichisanindependentproject.
IndistributedTensorFlow,variablevaluesarestoredontheserversinsteadofthesession,aswewillseeinChapter12.
Notethathousing.targetisa1Darray,butweneedtoreshapeittoacolumnvectortocomputetheta.RecallthatNumPy’sreshape()functionaccepts–1(meaning“unspecified”)foroneofthedimensions:thatdimensionwillbecomputedbasedonthearray’slengthandtheremainingdimensions.
CreatingaReLUclassisarguablythecleanestoption,butitisratherheavyweight.
1
2
3
4
5
Chapter10.IntroductiontoArtificialNeuralNetworks
Birdsinspiredustofly,burdockplantsinspiredvelcro,andnaturehasinspiredmanyotherinventions.Itseemsonlylogical,then,tolookatthebrain’sarchitectureforinspirationonhowtobuildanintelligentmachine.Thisisthekeyideathatinspiredartificialneuralnetworks(ANNs).However,althoughplaneswereinspiredbybirds,theydon’thavetoflaptheirwings.Similarly,ANNshavegraduallybecomequitedifferentfromtheirbiologicalcousins.Someresearchersevenarguethatweshoulddropthebiologicalanalogyaltogether(e.g.,bysaying“units”ratherthan“neurons”),lestwerestrictourcreativitytobiologicallyplausiblesystems.1
ANNsareattheverycoreofDeepLearning.Theyareversatile,powerful,andscalable,makingthemidealtotacklelargeandhighlycomplexMachineLearningtasks,suchasclassifyingbillionsofimages(e.g.,GoogleImages),poweringspeechrecognitionservices(e.g.,Apple’sSiri),recommendingthebestvideostowatchtohundredsofmillionsofuserseveryday(e.g.,YouTube),orlearningtobeattheworldchampionatthegameofGobyexaminingmillionsofpastgamesandthenplayingagainstitself(DeepMind’sAlphaGo).
Inthischapter,wewillintroduceartificialneuralnetworks,startingwithaquicktouroftheveryfirstANNarchitectures.ThenwewillpresentMulti-LayerPerceptrons(MLPs)andimplementoneusingTensorFlowtotackletheMNISTdigitclassificationproblem(introducedinChapter3).
FromBiologicaltoArtificialNeuronsSurprisingly,ANNshavebeenaroundforquiteawhile:theywerefirstintroducedbackin1943bytheneurophysiologistWarrenMcCullochandthemathematicianWalterPitts.Intheirlandmarkpaper,2“ALogicalCalculusofIdeasImmanentinNervousActivity,”McCullochandPittspresentedasimplifiedcomputationalmodelofhowbiologicalneuronsmightworktogetherinanimalbrainstoperformcomplexcomputationsusingpropositionallogic.Thiswasthefirstartificialneuralnetworkarchitecture.Sincethenmanyotherarchitectureshavebeeninvented,aswewillsee.
TheearlysuccessesofANNsuntilthe1960sledtothewidespreadbeliefthatwewouldsoonbeconversingwithtrulyintelligentmachines.Whenitbecameclearthatthispromisewouldgounfulfilled(atleastforquiteawhile),fundingflewelsewhereandANNsenteredalongdarkera.Intheearly1980stherewasarevivalofinterestinANNsasnewnetworkarchitectureswereinventedandbettertrainingtechniquesweredeveloped.Butbythe1990s,powerfulalternativeMachineLearningtechniquessuchasSupportVectorMachines(seeChapter5)werefavoredbymostresearchers,astheyseemedtoofferbetterresultsandstrongertheoreticalfoundations.Finally,wearenowwitnessingyetanotherwaveofinterestinANNs.Willthiswavedieoutlikethepreviousonesdid?Thereareafewgoodreasonstobelievethatthisoneisdifferentandwillhaveamuchmoreprofoundimpactonourlives:
Thereisnowahugequantityofdataavailabletotrainneuralnetworks,andANNsfrequentlyoutperformotherMLtechniquesonverylargeandcomplexproblems.
Thetremendousincreaseincomputingpowersincethe1990snowmakesitpossibletotrainlargeneuralnetworksinareasonableamountoftime.ThisisinpartduetoMoore’sLaw,butalsothankstothegamingindustry,whichhasproducedpowerfulGPUcardsbythemillions.
Thetrainingalgorithmshavebeenimproved.Tobefairtheyareonlyslightlydifferentfromtheonesusedinthe1990s,buttheserelativelysmalltweakshaveahugepositiveimpact.
SometheoreticallimitationsofANNshaveturnedouttobebenigninpractice.Forexample,manypeoplethoughtthatANNtrainingalgorithmsweredoomedbecausetheywerelikelytogetstuckinlocaloptima,butitturnsoutthatthisisratherrareinpractice(orwhenitisthecase,theyareusuallyfairlyclosetotheglobaloptimum).
ANNsseemtohaveenteredavirtuouscircleoffundingandprogress.AmazingproductsbasedonANNsregularlymaketheheadlinenews,whichpullsmoreandmoreattentionandfundingtowardthem,resultinginmoreandmoreprogress,andevenmoreamazingproducts.
BiologicalNeuronsBeforewediscussartificialneurons,let’stakeaquicklookatabiologicalneuron(representedinFigure10-1).Itisanunusual-lookingcellmostlyfoundinanimalcerebralcortexes(e.g.,yourbrain),composedofacellbodycontainingthenucleusandmostofthecell’scomplexcomponents,andmanybranchingextensionscalleddendrites,plusoneverylongextensioncalledtheaxon.Theaxon’slengthmaybejustafewtimeslongerthanthecellbody,oruptotensofthousandsoftimeslonger.Nearitsextremitytheaxonsplitsoffintomanybranchescalledtelodendria,andatthetipofthesebranchesareminusculestructurescalledsynapticterminals(orsimplysynapses),whichareconnectedtothedendrites(ordirectlytothecellbody)ofotherneurons.Biologicalneuronsreceiveshortelectricalimpulsescalledsignalsfromotherneuronsviathesesynapses.Whenaneuronreceivesasufficientnumberofsignalsfromotherneuronswithinafewmilliseconds,itfiresitsownsignals.
Figure10-1.Biologicalneuron3
Thus,individualbiologicalneuronsseemtobehaveinarathersimpleway,buttheyareorganizedinavastnetworkofbillionsofneurons,eachneurontypicallyconnectedtothousandsofotherneurons.Highlycomplexcomputationscanbeperformedbyavastnetworkoffairlysimpleneurons,muchlikeacomplexanthillcanemergefromthecombinedeffortsofsimpleants.Thearchitectureofbiologicalneuralnetworks(BNN)4isstillthesubjectofactiveresearch,butsomepartsofthebrainhavebeenmapped,anditseemsthatneuronsareoftenorganizedinconsecutivelayers,asshowninFigure10-2.
Figure10-2.Multiplelayersinabiologicalneuralnetwork(humancortex)5
LogicalComputationswithNeuronsWarrenMcCullochandWalterPittsproposedaverysimplemodelofthebiologicalneuron,whichlaterbecameknownasanartificialneuron:ithasoneormorebinary(on/off)inputsandonebinaryoutput.Theartificialneuronsimplyactivatesitsoutputwhenmorethanacertainnumberofitsinputsareactive.McCullochandPittsshowedthatevenwithsuchasimplifiedmodelitispossibletobuildanetworkofartificialneuronsthatcomputesanylogicalpropositionyouwant.Forexample,let’sbuildafewANNsthatperformvariouslogicalcomputations(seeFigure10-3),assumingthataneuronisactivatedwhenatleasttwoofitsinputsareactive.
Figure10-3.ANNsperformingsimplelogicalcomputations
Thefirstnetworkontheleftissimplytheidentityfunction:ifneuronAisactivated,thenneuronCgetsactivatedaswell(sinceitreceivestwoinputsignalsfromneuronA),butifneuronAisoff,thenneuronCisoffaswell.
ThesecondnetworkperformsalogicalAND:neuronCisactivatedonlywhenbothneuronsAandBareactivated(asingleinputsignalisnotenoughtoactivateneuronC).
ThethirdnetworkperformsalogicalOR:neuronCgetsactivatedifeitherneuronAorneuronBisactivated(orboth).
Finally,ifwesupposethataninputconnectioncaninhibittheneuron’sactivity(whichisthecasewithbiologicalneurons),thenthefourthnetworkcomputesaslightlymorecomplexlogicalproposition:neuronCisactivatedonlyifneuronAisactiveandifneuronBisoff.IfneuronAisactiveallthetime,thenyougetalogicalNOT:neuronCisactivewhenneuronBisoff,andviceversa.
Youcaneasilyimaginehowthesenetworkscanbecombinedtocomputecomplexlogicalexpressions(seetheexercisesattheendofthechapter).
ThePerceptronThePerceptronisoneofthesimplestANNarchitectures,inventedin1957byFrankRosenblatt.Itisbasedonaslightlydifferentartificialneuron(seeFigure10-4)calledalinearthresholdunit(LTU):theinputsandoutputarenownumbers(insteadofbinaryon/offvalues)andeachinputconnectionisassociatedwithaweight.TheLTUcomputesaweightedsumofitsinputs(z=w1x1+w2x2++ wnxn=wT·x),thenappliesastepfunctiontothatsumandoutputstheresult:hw(x)=step(z)=step(wT·x).
Figure10-4.Linearthresholdunit
ThemostcommonstepfunctionusedinPerceptronsistheHeavisidestepfunction(seeEquation10-1).Sometimesthesignfunctionisusedinstead.
Equation10-1.CommonstepfunctionsusedinPerceptrons
AsingleLTUcanbeusedforsimplelinearbinaryclassification.Itcomputesalinearcombinationoftheinputsandiftheresultexceedsathreshold,itoutputsthepositiveclassorelseoutputsthenegativeclass(justlikeaLogisticRegressionclassifieroralinearSVM).Forexample,youcoulduseasingleLTUtoclassifyirisflowersbasedonthepetallengthandwidth(alsoaddinganextrabiasfeaturex0=1,justlikewedidinpreviouschapters).TraininganLTUmeansfindingtherightvaluesforw0,w1,andw2(thetrainingalgorithmisdiscussedshortly).
APerceptronissimplycomposedofasinglelayerofLTUs,6witheachneuronconnectedtoalltheinputs.Theseconnectionsareoftenrepresentedusingspecialpassthroughneuronscalledinputneurons:theyjustoutputwhateverinputtheyarefed.Moreover,anextrabiasfeatureisgenerallyadded(x0=1).Thisbiasfeatureistypicallyrepresentedusingaspecialtypeofneuroncalledabiasneuron,whichjustoutputs1allthetime.
APerceptronwithtwoinputsandthreeoutputsisrepresentedinFigure10-5.ThisPerceptroncanclassifyinstancessimultaneouslyintothreedifferentbinaryclasses,whichmakesitamultioutputclassifier.
Figure10-5.Perceptrondiagram
SohowisaPerceptrontrained?ThePerceptrontrainingalgorithmproposedbyFrankRosenblattwaslargelyinspiredbyHebb’srule.InhisbookTheOrganizationofBehavior,publishedin1949,DonaldHebbsuggestedthatwhenabiologicalneuronoftentriggersanotherneuron,theconnectionbetweenthesetwoneuronsgrowsstronger.ThisideawaslatersummarizedbySiegridLöwelinthiscatchyphrase:“Cellsthatfiretogether,wiretogether.”ThisrulelaterbecameknownasHebb’srule(orHebbianlearning);thatis,theconnectionweightbetweentwoneuronsisincreasedwhenevertheyhavethesameoutput.Perceptronsaretrainedusingavariantofthisrulethattakesintoaccounttheerrormadebythenetwork;itdoesnotreinforceconnectionsthatleadtothewrongoutput.Morespecifically,thePerceptronisfedonetraininginstanceatatime,andforeachinstanceitmakesitspredictions.Foreveryoutputneuronthatproducedawrongprediction,itreinforcestheconnectionweightsfromtheinputsthatwouldhavecontributedtothecorrectprediction.TheruleisshowninEquation10-2.
Equation10-2.Perceptronlearningrule(weightupdate)
wi,jistheconnectionweightbetweentheithinputneuronandthejthoutputneuron.
xiistheithinputvalueofthecurrenttraininginstance.
jistheoutputofthejthoutputneuronforthecurrenttraininginstance.
yjisthetargetoutputofthejthoutputneuronforthecurrenttraininginstance.
ηisthelearningrate.
Thedecisionboundaryofeachoutputneuronislinear,soPerceptronsareincapableoflearningcomplexpatterns(justlikeLogisticRegressionclassifiers).However,ifthetraininginstancesarelinearlyseparable,Rosenblattdemonstratedthatthisalgorithmwouldconvergetoasolution.7ThisiscalledthePerceptronconvergencetheorem.
Scikit-LearnprovidesaPerceptronclassthatimplementsasingleLTUnetwork.Itcanbeusedprettymuchasyouwouldexpect—forexample,ontheirisdataset(introducedinChapter4):
importnumpyasnp
fromsklearn.datasetsimportload_iris
fromsklearn.linear_modelimportPerceptron
iris=load_iris()
X=iris.data[:,(2,3)]#petallength,petalwidth
y=(iris.target==0).astype(np.int)#IrisSetosa?
per_clf=Perceptron(random_state=42)
per_clf.fit(X,y)
y_pred=per_clf.predict([[2,0.5]])
YoumayhaverecognizedthatthePerceptronlearningalgorithmstronglyresemblesStochasticGradientDescent.Infact,Scikit-Learn’sPerceptronclassisequivalenttousinganSGDClassifierwiththefollowinghyperparameters:loss="perceptron",learning_rate="constant",eta0=1(thelearningrate),andpenalty=None(noregularization).
NotethatcontrarytoLogisticRegressionclassifiers,Perceptronsdonotoutputaclassprobability;rather,theyjustmakepredictionsbasedonahardthreshold.ThisisoneofthegoodreasonstopreferLogisticRegressionoverPerceptrons.
Intheir1969monographtitledPerceptrons,MarvinMinskyandSeymourPaperthighlightedanumberofseriousweaknessesofPerceptrons,inparticularthefactthattheyareincapableofsolvingsometrivialproblems(e.g.,theExclusiveOR(XOR)classificationproblem;seetheleftsideofFigure10-6).Ofcoursethisistrueofanyotherlinearclassificationmodelaswell(suchasLogisticRegressionclassifiers),butresearchershadexpectedmuchmorefromPerceptrons,andtheirdisappointmentwasgreat:asaresult,manyresearchersdroppedconnectionismaltogether(i.e.,thestudyofneuralnetworks)infavorofhigher-levelproblemssuchaslogic,problemsolving,andsearch.
However,itturnsoutthatsomeofthelimitationsofPerceptronscanbeeliminatedbystackingmultiplePerceptrons.TheresultingANNiscalledaMulti-LayerPerceptron(MLP).Inparticular,anMLPcansolvetheXORproblem,asyoucanverifybycomputingtheoutputoftheMLPrepresentedontherightofFigure10-6,foreachcombinationofinputs:withinputs(0,0)or(1,1)thenetworkoutputs0,andwithinputs(0,1)or(1,0)itoutputs1.
Figure10-6.XORclassificationproblemandanMLPthatsolvesit
Multi-LayerPerceptronandBackpropagationAnMLPiscomposedofone(passthrough)inputlayer,oneormorelayersofLTUs,calledhiddenlayers,andonefinallayerofLTUscalledtheoutputlayer(seeFigure10-7).Everylayerexcepttheoutputlayerincludesabiasneuronandisfullyconnectedtothenextlayer.WhenanANNhastwoormorehiddenlayers,itiscalledadeepneuralnetwork(DNN).
Figure10-7.Multi-LayerPerceptron
FormanyyearsresearchersstruggledtofindawaytotrainMLPs,withoutsuccess.Butin1986,D.E.Rumelhartetal.publishedagroundbreakingarticle8introducingthebackpropagationtrainingalgorithm.9TodaywewoulddescribeitasGradientDescentusingreverse-modeautodiff(GradientDescentwasintroducedinChapter4,andautodiffwasdiscussedinChapter9).
Foreachtraininginstance,thealgorithmfeedsittothenetworkandcomputestheoutputofeveryneuronineachconsecutivelayer(thisistheforwardpass,justlikewhenmakingpredictions).Thenitmeasuresthenetwork’soutputerror(i.e.,thedifferencebetweenthedesiredoutputandtheactualoutputofthenetwork),anditcomputeshowmucheachneuroninthelasthiddenlayercontributedtoeachoutputneuron’serror.Itthenproceedstomeasurehowmuchoftheseerrorcontributionscamefromeachneuronintheprevioushiddenlayer—andsoonuntilthealgorithmreachestheinputlayer.Thisreversepassefficientlymeasurestheerrorgradientacrossalltheconnectionweightsinthenetworkbypropagatingtheerrorgradientbackwardinthenetwork(hencethenameofthealgorithm).Ifyoucheckoutthereverse-modeautodiffalgorithminAppendixD,youwillfindthattheforwardandreversepassesofbackpropagationsimplyperformreverse-modeautodiff.ThelaststepofthebackpropagationalgorithmisaGradientDescentsteponalltheconnectionweightsinthenetwork,usingtheerrorgradientsmeasuredearlier.
Let’smakethisevenshorter:foreachtraininginstancethebackpropagationalgorithmfirstmakesaprediction(forwardpass),measurestheerror,thengoesthrougheachlayerinreversetomeasuretheerrorcontributionfromeachconnection(reversepass),andfinallyslightlytweakstheconnectionweightstoreducetheerror(GradientDescentstep).
Inorderforthisalgorithmtoworkproperly,theauthorsmadeakeychangetotheMLP’sarchitecture:theyreplacedthestepfunctionwiththelogisticfunction,σ(z)=1/(1+exp(–z)).Thiswasessentialbecausethestepfunctioncontainsonlyflatsegments,sothereisnogradienttoworkwith(GradientDescentcannotmoveonaflatsurface),whilethelogisticfunctionhasawell-definednonzeroderivativeeverywhere,allowingGradientDescenttomakesomeprogressateverystep.Thebackpropagationalgorithmmaybeusedwithotheractivationfunctions,insteadofthelogisticfunction.Twootherpopularactivationfunctionsare:
Thehyperbolictangentfunctiontanh(z)=2σ(2z)–1JustlikethelogisticfunctionitisS-shaped,continuous,anddifferentiable,butitsoutputvaluerangesfrom–1to1(insteadof0to1inthecaseofthelogisticfunction),whichtendstomakeeachlayer’soutputmoreorlessnormalized(i.e.,centeredaround0)atthebeginningoftraining.Thisoftenhelpsspeedupconvergence.
TheReLUfunction(introducedinChapter9)ReLU(z)=max(0,z).Itiscontinuousbutunfortunatelynotdifferentiableatz=0(theslopechangesabruptly,whichcanmakeGradientDescentbouncearound).However,inpracticeitworksverywellandhastheadvantageofbeingfasttocompute.Mostimportantly,thefactthatitdoesnothaveamaximumoutputvaluealsohelpsreducesomeissuesduringGradientDescent(wewillcomebacktothisinChapter11).
ThesepopularactivationfunctionsandtheirderivativesarerepresentedinFigure10-8.
Figure10-8.Activationfunctionsandtheirderivatives
AnMLPisoftenusedforclassification,witheachoutputcorrespondingtoadifferentbinaryclass(e.g.,spam/ham,urgent/not-urgent,andsoon).Whentheclassesareexclusive(e.g.,classes0through9fordigitimageclassification),theoutputlayeristypicallymodifiedbyreplacingtheindividualactivationfunctionsbyasharedsoftmaxfunction(seeFigure10-9).ThesoftmaxfunctionwasintroducedinChapter3.Theoutputofeachneuroncorrespondstotheestimatedprobabilityofthecorrespondingclass.Notethatthesignalflowsonlyinonedirection(fromtheinputstotheoutputs),sothisarchitectureisanexampleofafeedforwardneuralnetwork(FNN).
Figure10-9.AmodernMLP(includingReLUandsoftmax)forclassification
NOTEBiologicalneuronsseemtoimplementaroughlysigmoid(S-shaped)activationfunction,soresearchersstucktosigmoidfunctionsforaverylongtime.ButitturnsoutthattheReLUactivationfunctiongenerallyworksbetterinANNs.Thisisoneofthecaseswherethebiologicalanalogywasmisleading.
TraininganMLPwithTensorFlow’sHigh-LevelAPIThesimplestwaytotrainanMLPwithTensorFlowistousethehigh-levelAPITF.Learn,whichoffersaScikit-Learn–compatibleAPI.TheDNNClassifierclassmakesitfairlyeasytotrainadeepneuralnetworkwithanynumberofhiddenlayers,andasoftmaxoutputlayertooutputestimatedclassprobabilities.Forexample,thefollowingcodetrainsaDNNforclassificationwithtwohiddenlayers(onewith300neurons,andtheotherwith100neurons)andasoftmaxoutputlayerwith10neurons:
importtensorflowastf
feature_cols=tf.contrib.learn.infer_real_valued_columns_from_input(X_train)
dnn_clf=tf.contrib.learn.DNNClassifier(hidden_units=[300,100],n_classes=10,
feature_columns=feature_cols)
dnn_clf=tf.contrib.learn.SKCompat(dnn_clf)#ifTensorFlow>=1.1
dnn_clf.fit(X_train,y_train,batch_size=50,steps=40000)
Thecodefirstcreatesasetofrealvaluedcolumnsfromthetrainingset(othertypesofcolumns,suchascategoricalcolumns,areavailable).ThenwecreatetheDNNClassifier,andwewrapitinaScikit-Learncompatibilityhelper.Finally,werun40,000trainingiterationsusingbatchesof50instances.
IfyourunthiscodeontheMNISTdataset(afterscalingit,e.g.,byusingScikit-Learn’sStandardScaler),youwillactuallygetamodelthatachievesaround98.2%accuracyonthetestset!That’sbetterthanthebestmodelwetrainedinChapter3:
>>>fromsklearn.metricsimportaccuracy_score
>>>y_pred=dnn_clf.predict(X_test)
>>>accuracy_score(y_test,y_pred['classes'])
0.98250000000000004
WARNINGThetensorflow.contribpackagecontainsmanyusefulfunctions,butitisaplaceforexperimentalcodethathasnotyetgraduatedtobepartofthecoreTensorFlowAPI.SotheDNNClassifierclass(andanyothercontribcode)maychangewithoutnoticeinthefuture.
Underthehood,theDNNClassifierclasscreatesalltheneuronlayers,basedontheReLUactivationfunction(wecanchangethisbysettingtheactivation_fnhyperparameter).Theoutputlayerreliesonthesoftmaxfunction,andthecostfunctioniscrossentropy(introducedinChapter4).
TrainingaDNNUsingPlainTensorFlowIfyouwantmorecontroloverthearchitectureofthenetwork,youmayprefertouseTensorFlow’slower-levelPythonAPI(introducedinChapter9).InthissectionwewillbuildthesamemodelasbeforeusingthisAPI,andwewillimplementMini-batchGradientDescenttotrainitontheMNISTdataset.Thefirststepistheconstructionphase,buildingtheTensorFlowgraph.Thesecondstepistheexecutionphase,whereyouactuallyrunthegraphtotrainthemodel.
ConstructionPhaseLet’sstart.Firstweneedtoimportthetensorflowlibrary.Thenwemustspecifythenumberofinputsandoutputs,andsetthenumberofhiddenneuronsineachlayer:
importtensorflowastf
n_inputs=28*28#MNIST
n_hidden1=300
n_hidden2=100
n_outputs=10
Next,justlikeyoudidinChapter9,youcanuseplaceholdernodestorepresentthetrainingdataandtargets.TheshapeofXisonlypartiallydefined.Weknowthatitwillbea2Dtensor(i.e.,amatrix),withinstancesalongthefirstdimensionandfeaturesalongtheseconddimension,andweknowthatthenumberoffeaturesisgoingtobe28x28(onefeatureperpixel),butwedon’tknowyethowmanyinstanceseachtrainingbatchwillcontain.SotheshapeofXis(None,n_inputs).Similarly,weknowthatywillbea1Dtensorwithoneentryperinstance,butagainwedon’tknowthesizeofthetrainingbatchatthispoint,sotheshapeis(None).
X=tf.placeholder(tf.float32,shape=(None,n_inputs),name="X")
y=tf.placeholder(tf.int64,shape=(None),name="y")
Nowlet’screatetheactualneuralnetwork.TheplaceholderXwillactastheinputlayer;duringtheexecutionphase,itwillbereplacedwithonetrainingbatchatatime(notethatalltheinstancesinatrainingbatchwillbeprocessedsimultaneouslybytheneuralnetwork).Nowyouneedtocreatethetwohiddenlayersandtheoutputlayer.Thetwohiddenlayersarealmostidentical:theydifferonlybytheinputstheyareconnectedtoandbythenumberofneuronstheycontain.Theoutputlayerisalsoverysimilar,butitusesasoftmaxactivationfunctioninsteadofaReLUactivationfunction.Solet’screateaneuron_layer()functionthatwewillusetocreateonelayeratatime.Itwillneedparameterstospecifytheinputs,thenumberofneurons,theactivationfunction,andthenameofthelayer:
defneuron_layer(X,n_neurons,name,activation=None):
withtf.name_scope(name):
n_inputs=int(X.get_shape()[1])
stddev=2/np.sqrt(n_inputs)
init=tf.truncated_normal((n_inputs,n_neurons),stddev=stddev)
W=tf.Variable(init,name="kernel")
b=tf.Variable(tf.zeros([n_neurons]),name="bias")
Z=tf.matmul(X,W)+b
ifactivationisnotNone:
returnactivation(Z)
else:
returnZ
Let’sgothroughthiscodelinebyline:1. Firstwecreateanamescopeusingthenameofthelayer:itwillcontainallthecomputationnodes
forthisneuronlayer.Thisisoptional,butthegraphwilllookmuchnicerinTensorBoardifitsnodesarewellorganized.
2. Next,wegetthenumberofinputsbylookinguptheinputmatrix’sshapeandgettingthesizeoftheseconddimension(thefirstdimensionisforinstances).
3. ThenextthreelinescreateaWvariablethatwillholdtheweightsmatrix(oftencalledthelayer’skernel).Itwillbea2Dtensorcontainingalltheconnectionweightsbetweeneachinputandeachneuron;hence,itsshapewillbe(n_inputs,n_neurons).Itwillbeinitializedrandomly,usinga
truncated10normal(Gaussian)distributionwithastandarddeviationof .Usingthisspecificstandarddeviationhelpsthealgorithmconvergemuchfaster(wewilldiscussthisfurtherinChapter11;itisoneofthosesmalltweakstoneuralnetworksthathavehadatremendousimpactontheirefficiency).ItisimportanttoinitializeconnectionweightsrandomlyforallhiddenlayerstoavoidanysymmetriesthattheGradientDescentalgorithmwouldbeunabletobreak.11
4. Thenextlinecreatesabvariableforbiases,initializedto0(nosymmetryissueinthiscase),withonebiasparameterperneuron.
5. ThenwecreateasubgraphtocomputeZ=X·W+b.Thisvectorizedimplementationwillefficientlycomputetheweightedsumsoftheinputsplusthebiastermforeachandeveryneuroninthelayer,foralltheinstancesinthebatchinjustoneshot.
6. Finally,ifanactivationparameterisprovided,suchastf.nn.relu(i.e.,max(0,Z)),thenthecodereturnsactivation(Z),orelseitjustreturnsZ.
Okay,sonowyouhaveanicefunctiontocreateaneuronlayer.Let’suseittocreatethedeepneuralnetwork!ThefirsthiddenlayertakesXasitsinput.Thesecondtakestheoutputofthefirsthiddenlayerasitsinput.Andfinally,theoutputlayertakestheoutputofthesecondhiddenlayerasitsinput.
withtf.name_scope("dnn"):
hidden1=neuron_layer(X,n_hidden1,name="hidden1",
activation=tf.nn.relu)
hidden2=neuron_layer(hidden1,n_hidden2,name="hidden2",
activation=tf.nn.relu)
logits=neuron_layer(hidden2,n_outputs,name="outputs")
Noticethatonceagainweusedanamescopeforclarity.Alsonotethatlogitsistheoutputoftheneuralnetworkbeforegoingthroughthesoftmaxactivationfunction:foroptimizationreasons,wewillhandlethesoftmaxcomputationlater.
Asyoumightexpect,TensorFlowcomeswithmanyhandyfunctionstocreatestandardneuralnetworklayers,sothere’softennoneedtodefineyourownneuron_layer()functionlikewejustdid.Forexample,TensorFlow’stf.layers.dense()function(previouslycalledtf.contrib.layers.fully_connected())createsafullyconnectedlayer,wherealltheinputsareconnectedtoalltheneuronsinthelayer.Ittakescareofcreatingtheweightsandbiasesvariables,namedkernelandbiasrespectively,usingtheappropriateinitializationstrategy,andyoucansettheactivationfunctionusingtheactivationargument.AswewillseeinChapter11,italsosupportsregularizationparameters.Let’stweaktheprecedingcodetousethedense()functioninsteadofourneuron_layer()function.Simplyreplacethednnconstructionsectionwiththefollowingcode:
withtf.name_scope("dnn"):
hidden1=tf.layers.dense(X,n_hidden1,name="hidden1",
activation=tf.nn.relu)
hidden2=tf.layers.dense(hidden1,n_hidden2,name="hidden2",
activation=tf.nn.relu)
logits=tf.layers.dense(hidden2,n_outputs,name="outputs")
Nowthatwehavetheneuralnetworkmodelreadytogo,weneedtodefinethecostfunctionthatwewillusetotrainit.JustaswedidforSoftmaxRegressioninChapter4,wewillusecrossentropy.Aswediscussedearlier,crossentropywillpenalizemodelsthatestimatealowprobabilityforthetargetclass.TensorFlowprovidesseveralfunctionstocomputecrossentropy.Wewillusesparse_softmax_cross_entropy_with_logits():itcomputesthecrossentropybasedonthe“logits”(i.e.,theoutputofthenetworkbeforegoingthroughthesoftmaxactivationfunction),anditexpectslabelsintheformofintegersrangingfrom0tothenumberofclassesminus1(inourcase,from0to9).Thiswillgiveusa1Dtensorcontainingthecrossentropyforeachinstance.WecanthenuseTensorFlow’sreduce_mean()functiontocomputethemeancrossentropyoverallinstances.
withtf.name_scope("loss"):
xentropy=tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
logits=logits)
loss=tf.reduce_mean(xentropy,name="loss")
NOTEThesparse_softmax_cross_entropy_with_logits()functionisequivalenttoapplyingthesoftmaxactivationfunctionandthencomputingthecrossentropy,butitismoreefficient,anditproperlytakescareofcornercaseslikelogitsequalto0.Thisiswhywedidnotapplythesoftmaxactivationfunctionearlier.Thereisalsoanotherfunctioncalledsoftmax_cross_entropy_with_logits(),whichtakeslabelsintheformofone-hotvectors(insteadofintsfrom0tothenumberofclassesminus1).
Wehavetheneuralnetworkmodel,wehavethecostfunction,andnowweneedtodefineaGradientDescentOptimizerthatwilltweakthemodelparameterstominimizethecostfunction.Nothingnew;it’sjustlikewedidinChapter9:
learning_rate=0.01
withtf.name_scope("train"):
optimizer=tf.train.GradientDescentOptimizer(learning_rate)
training_op=optimizer.minimize(loss)
Thelastimportantstepintheconstructionphaseistospecifyhowtoevaluatethemodel.Wewillsimplyuseaccuracyasourperformancemeasure.First,foreachinstance,determineiftheneuralnetwork’spredictioniscorrectbycheckingwhetherornotthehighestlogitcorrespondstothetargetclass.Forthisyoucanusethein_top_k()function.Thisreturnsa1Dtensorfullofbooleanvalues,soweneedtocastthesebooleanstofloatsandthencomputetheaverage.Thiswillgiveusthenetwork’soverallaccuracy.
withtf.name_scope("eval"):
correct=tf.nn.in_top_k(logits,y,1)
accuracy=tf.reduce_mean(tf.cast(correct,tf.float32))
And,asusual,weneedtocreateanodetoinitializeallvariables,andwewillalsocreateaSaverto
saveourtrainedmodelparameterstodisk:
init=tf.global_variables_initializer()
saver=tf.train.Saver()
Phew!Thisconcludestheconstructionphase.Thiswasfewerthan40linesofcode,butitwasprettyintense:wecreatedplaceholdersfortheinputsandthetargets,wecreatedafunctiontobuildaneuronlayer,weusedittocreatetheDNN,wedefinedthecostfunction,wecreatedanoptimizer,andfinallywedefinedtheperformancemeasure.Nowontotheexecutionphase.
ExecutionPhaseThispartismuchshorterandsimpler.First,let’sloadMNIST.WecoulduseScikit-Learnforthataswedidinpreviouschapters,butTensorFlowoffersitsownhelperthatfetchesthedata,scalesit(between0and1),shufflesit,andprovidesasimplefunctiontoloadonemini-batchatime.Solet’suseitinstead:
fromtensorflow.examples.tutorials.mnistimportinput_data
mnist=input_data.read_data_sets("/tmp/data/")
Nowwedefinethenumberofepochsthatwewanttorun,aswellasthesizeofthemini-batches:
n_epochs=40
batch_size=50
Andnowwecantrainthemodel:
withtf.Session()assess:
init.run()
forepochinrange(n_epochs):
foriterationinrange(mnist.train.num_examples//batch_size):
X_batch,y_batch=mnist.train.next_batch(batch_size)
sess.run(training_op,feed_dict={X:X_batch,y:y_batch})
acc_train=accuracy.eval(feed_dict={X:X_batch,y:y_batch})
acc_test=accuracy.eval(feed_dict={X:mnist.test.images,
y:mnist.test.labels})
print(epoch,"Trainaccuracy:",acc_train,"Testaccuracy:",acc_test)
save_path=saver.save(sess,"./my_model_final.ckpt")
ThiscodeopensaTensorFlowsession,anditrunstheinitnodethatinitializesallthevariables.Thenitrunsthemaintrainingloop:ateachepoch,thecodeiteratesthroughanumberofmini-batchesthatcorrespondstothetrainingsetsize.Eachmini-batchisfetchedviathenext_batch()method,andthenthecodesimplyrunsthetrainingoperation,feedingitthecurrentmini-batchinputdataandtargets.Next,attheendofeachepoch,thecodeevaluatesthemodelonthelastmini-batchandonthefulltestset,anditprintsouttheresult.Finally,themodelparametersaresavedtodisk.
UsingtheNeuralNetworkNowthattheneuralnetworkistrained,youcanuseittomakepredictions.Todothat,youcanreusethesameconstructionphase,butchangetheexecutionphaselikethis:
withtf.Session()assess:
saver.restore(sess,"./my_model_final.ckpt")
X_new_scaled=[...]#somenewimages(scaledfrom0to1)
Z=logits.eval(feed_dict={X:X_new_scaled})
y_pred=np.argmax(Z,axis=1)
Firstthecodeloadsthemodelparametersfromdisk.Thenitloadssomenewimagesthatyouwanttoclassify.Remembertoapplythesamefeaturescalingasforthetrainingdata(inthiscase,scaleitfrom0to1).Thenthecodeevaluatesthelogitsnode.Ifyouwantedtoknowalltheestimatedclassprobabilities,youwouldneedtoapplythesoftmax()functiontothelogits,butifyoujustwanttopredictaclass,youcansimplypicktheclassthathasthehighestlogitvalue(usingtheargmax()functiondoesthetrick).
Fine-TuningNeuralNetworkHyperparametersTheflexibilityofneuralnetworksisalsooneoftheirmaindrawbacks:therearemanyhyperparameterstotweak.Notonlycanyouuseanyimaginablenetworktopology(howneuronsareinterconnected),buteveninasimpleMLPyoucanchangethenumberoflayers,thenumberofneuronsperlayer,thetypeofactivationfunctiontouseineachlayer,theweightinitializationlogic,andmuchmore.Howdoyouknowwhatcombinationofhyperparametersisthebestforyourtask?
Ofcourse,youcanusegridsearchwithcross-validationtofindtherighthyperparameters,likeyoudidinpreviouschapters,butsincetherearemanyhyperparameterstotune,andsincetraininganeuralnetworkonalargedatasettakesalotoftime,youwillonlybeabletoexploreatinypartofthehyperparameterspaceinareasonableamountoftime.Itismuchbettertouserandomizedsearch,aswediscussedinChapter2.AnotheroptionistouseatoolsuchasOscar,whichimplementsmorecomplexalgorithmstohelpyoufindagoodsetofhyperparametersquickly.
Ithelpstohaveanideaofwhatvaluesarereasonableforeachhyperparameter,soyoucanrestrictthesearchspace.Let’sstartwiththenumberofhiddenlayers.
NumberofHiddenLayersFormanyproblems,youcanjustbeginwithasinglehiddenlayerandyouwillgetreasonableresults.IthasactuallybeenshownthatanMLPwithjustonehiddenlayercanmodeleventhemostcomplexfunctionsprovidedithasenoughneurons.Foralongtime,thesefactsconvincedresearchersthattherewasnoneedtoinvestigateanydeeperneuralnetworks.Buttheyoverlookedthefactthatdeepnetworkshaveamuchhigherparameterefficiencythanshallowones:theycanmodelcomplexfunctionsusingexponentiallyfewerneuronsthanshallownets,makingthemmuchfastertotrain.
Tounderstandwhy,supposeyouareaskedtodrawaforestusingsomedrawingsoftware,butyouareforbiddentousecopy/paste.Youwouldhavetodraweachtreeindividually,branchperbranch,leafperleaf.Ifyoucouldinsteaddrawoneleaf,copy/pasteittodrawabranch,thencopy/pastethatbranchtocreateatree,andfinallycopy/pastethistreetomakeaforest,youwouldbefinishedinnotime.Real-worlddataisoftenstructuredinsuchahierarchicalwayandDNNsautomaticallytakeadvantageofthisfact:lowerhiddenlayersmodellow-levelstructures(e.g.,linesegmentsofvariousshapesandorientations),intermediatehiddenlayerscombinetheselow-levelstructurestomodelintermediate-levelstructures(e.g.,squares,circles),andthehighesthiddenlayersandtheoutputlayercombinetheseintermediatestructurestomodelhigh-levelstructures(e.g.,faces).
NotonlydoesthishierarchicalarchitecturehelpDNNsconvergefastertoagoodsolution,italsoimprovestheirabilitytogeneralizetonewdatasets.Forexample,ifyouhavealreadytrainedamodeltorecognizefacesinpictures,andyounowwanttotrainanewneuralnetworktorecognizehairstyles,thenyoucankickstarttrainingbyreusingthelowerlayersofthefirstnetwork.Insteadofrandomlyinitializingtheweightsandbiasesofthefirstfewlayersofthenewneuralnetwork,youcaninitializethemtothevalueoftheweightsandbiasesofthelowerlayersofthefirstnetwork.Thiswaythenetworkwillnothavetolearnfromscratchallthelow-levelstructuresthatoccurinmostpictures;itwillonlyhavetolearnthehigher-levelstructures(e.g.,hairstyles).
Insummary,formanyproblemsyoucanstartwithjustoneortwohiddenlayersanditwillworkjustfine(e.g.,youcaneasilyreachabove97%accuracyontheMNISTdatasetusingjustonehiddenlayerwithafewhundredneurons,andabove98%accuracyusingtwohiddenlayerswiththesametotalamountofneurons,inroughlythesameamountoftrainingtime).Formorecomplexproblems,youcangraduallyrampupthenumberofhiddenlayers,untilyoustartoverfittingthetrainingset.Verycomplextasks,suchaslargeimageclassificationorspeechrecognition,typicallyrequirenetworkswithdozensoflayers(orevenhundreds,butnotfullyconnectedones,aswewillseeinChapter13),andtheyneedahugeamountoftrainingdata.However,youwillrarelyhavetotrainsuchnetworksfromscratch:itismuchmorecommontoreusepartsofapretrainedstate-of-the-artnetworkthatperformsasimilartask.Trainingwillbealotfasterandrequiremuchlessdata(wewilldiscussthisinChapter11).
NumberofNeuronsperHiddenLayerObviouslythenumberofneuronsintheinputandoutputlayersisdeterminedbythetypeofinputandoutputyourtaskrequires.Forexample,theMNISTtaskrequires28x28=784inputneuronsand10outputneurons.Asforthehiddenlayers,acommonpracticeistosizethemtoformafunnel,withfewerandfewerneuronsateachlayer—therationalebeingthatmanylow-levelfeaturescancoalesceintofarfewerhigh-levelfeatures.Forexample,atypicalneuralnetworkforMNISTmayhavetwohiddenlayers,thefirstwith300neuronsandthesecondwith100.However,thispracticeisnotascommonnow,andyoumaysimplyusethesamesizeforallhiddenlayers—forexample,allhiddenlayerswith150neurons:that’sjustonehyperparametertotuneinsteadofoneperlayer.Justlikeforthenumberoflayers,youcantryincreasingthenumberofneuronsgraduallyuntilthenetworkstartsoverfitting.Ingeneralyouwillgetmorebangforthebuckbyincreasingthenumberoflayersthanthenumberofneuronsperlayer.Unfortunately,asyoucansee,findingtheperfectamountofneuronsisstillsomewhatofablackart.
Asimplerapproachistopickamodelwithmorelayersandneuronsthanyouactuallyneed,thenuseearlystoppingtopreventitfromoverfitting(andotherregularizationtechniques,especiallydropout,aswewillseeinChapter11).Thishasbeendubbedthe“stretchpants”approach:12insteadofwastingtimelookingforpantsthatperfectlymatchyoursize,justuselargestretchpantsthatwillshrinkdowntotherightsize.
ActivationFunctionsInmostcasesyoucanusetheReLUactivationfunctioninthehiddenlayers(oroneofitsvariants,aswewillseeinChapter11).Itisabitfastertocomputethanotheractivationfunctions,andGradientDescentdoesnotgetstuckasmuchonplateaus,thankstothefactthatitdoesnotsaturateforlargeinputvalues(asopposedtothelogisticfunctionorthehyperbolictangentfunction,whichsaturateat1).
Fortheoutputlayer,thesoftmaxactivationfunctionisgenerallyagoodchoiceforclassificationtasks(whentheclassesaremutuallyexclusive).Forregressiontasks,youcansimplyusenoactivationfunctionatall.
Thisconcludesthisintroductiontoartificialneuralnetworks.Inthefollowingchapters,wewilldiscusstechniquestotrainverydeepnets,anddistributetrainingacrossmultipleserversandGPUs.Thenwewillexploreafewotherpopularneuralnetworkarchitectures:convolutionalneuralnetworks,recurrentneuralnetworks,andautoencoders.13
Exercises1. DrawanANNusingtheoriginalartificialneurons(liketheonesinFigure10-3)thatcomputesAB(where representstheXORoperation).Hint:A B=(A ¬B) (¬A B).
2. WhyisitgenerallypreferabletouseaLogisticRegressionclassifierratherthanaclassicalPerceptron(i.e.,asinglelayeroflinearthresholdunitstrainedusingthePerceptrontrainingalgorithm)?HowcanyoutweakaPerceptrontomakeitequivalenttoaLogisticRegressionclassifier?
3. WhywasthelogisticactivationfunctionakeyingredientintrainingthefirstMLPs?
4. Namethreepopularactivationfunctions.Canyoudrawthem?
5. SupposeyouhaveanMLPcomposedofoneinputlayerwith10passthroughneurons,followedbyonehiddenlayerwith50artificialneurons,andfinallyoneoutputlayerwith3artificialneurons.AllartificialneuronsusetheReLUactivationfunction.
WhatistheshapeoftheinputmatrixX?
Whatabouttheshapeofthehiddenlayer’sweightvectorWh,andtheshapeofitsbiasvectorbh?
Whatistheshapeoftheoutputlayer’sweightvectorWo,anditsbiasvectorbo?
Whatistheshapeofthenetwork’soutputmatrixY?
Writetheequationthatcomputesthenetwork’soutputmatrixYasafunctionofX,Wh,bh,Woandbo.
6. Howmanyneuronsdoyouneedintheoutputlayerifyouwanttoclassifyemailintospamorham?Whatactivationfunctionshouldyouuseintheoutputlayer?IfinsteadyouwanttotackleMNIST,howmanyneuronsdoyouneedintheoutputlayer,usingwhatactivationfunction?AnswerthesamequestionsforgettingyournetworktopredicthousingpricesasinChapter2.
7. Whatisbackpropagationandhowdoesitwork?Whatisthedifferencebetweenbackpropagationandreverse-modeautodiff?
8. CanyoulistallthehyperparametersyoucantweakinanMLP?IftheMLPoverfitsthetrainingdata,howcouldyoutweakthesehyperparameterstotrytosolvetheproblem?
9. TrainadeepMLPontheMNISTdatasetandseeifyoucangetover98%precision.JustlikeinthelastexerciseofChapter9,tryaddingallthebellsandwhistles(i.e.,savecheckpoints,restorethelastcheckpointincaseofaninterruption,addsummaries,plotlearningcurvesusingTensorBoard,andsoon).
SolutionstotheseexercisesareavailableinAppendixA.
Youcangetthebestofbothworldsbybeingopentobiologicalinspirationswithoutbeingafraidtocreatebiologicallyunrealisticmodels,aslongastheyworkwell.
“ALogicalCalculusofIdeasImmanentinNervousActivity,”W.McCullochandW.Pitts(1943).
ImagebyBruceBlaus(CreativeCommons3.0).Reproducedfromhttps://en.wikipedia.org/wiki/Neuron.
InthecontextofMachineLearning,thephrase“neuralnetworks”generallyreferstoANNs,notBNNs.
DrawingofacorticallaminationbyS.RamonyCajal(publicdomain).Reproducedfromhttps://en.wikipedia.org/wiki/Cerebral_cortex.
ThenamePerceptronissometimesusedtomeanatinynetworkwithasingleLTU.
Notethatthissolutionisgenerallynotunique:ingeneralwhenthedataarelinearlyseparable,thereisaninfinityofhyperplanesthatcanseparatethem.
“LearningInternalRepresentationsbyErrorPropagation,”D.Rumelhart,G.Hinton,R.Williams(1986).
Thisalgorithmwasactuallyinventedseveraltimesbyvariousresearchersindifferentfields,startingwithP.Werbosin1974.
Usingatruncatednormaldistributionratherthanaregularnormaldistributionensuresthattherewon’tbeanylargeweights,whichcouldslowdowntraining.
Forexample,ifyousetalltheweightsto0,thenallneuronswilloutput0,andtheerrorgradientwillbethesameforallneuronsinagivenhiddenlayer.TheGradientDescentstepwillthenupdatealltheweightsinexactlythesamewayineachlayer,sotheywillallremainequal.Inotherwords,despitehavinghundredsofneuronsperlayer,yourmodelwillactasiftherewereonlyoneneuronperlayer.Itisnotgoingtofly.
ByVincentVanhouckeinhisDeepLearningclassonUdacity.com.
AfewextraANNarchitecturesarepresentedinAppendixE.
1
2
3
4
5
6
7
8
9
10
11
12
13
Chapter11.TrainingDeepNeuralNets
InChapter10weintroducedartificialneuralnetworksandtrainedourfirstdeepneuralnetwork.ButitwasaveryshallowDNN,withonlytwohiddenlayers.Whatifyouneedtotackleaverycomplexproblem,suchasdetectinghundredsoftypesofobjectsinhigh-resolutionimages?YoumayneedtotrainamuchdeeperDNN,perhapswith(say)10layers,eachcontaininghundredsofneurons,connectedbyhundredsofthousandsofconnections.Thiswouldnotbeawalkinthepark:
First,youwouldbefacedwiththetrickyvanishinggradientsproblem(ortherelatedexplodinggradientsproblem)thataffectsdeepneuralnetworksandmakeslowerlayersveryhardtotrain.
Second,withsuchalargenetwork,trainingwouldbeextremelyslow.
Third,amodelwithmillionsofparameterswouldseverelyriskoverfittingthetrainingset.
Inthischapter,wewillgothrougheachoftheseproblemsinturnandpresenttechniquestosolvethem.Wewillstartbyexplainingthevanishinggradientsproblemandexploringsomeofthemostpopularsolutionstothisproblem.NextwewilllookatvariousoptimizersthatcanspeeduptraininglargemodelstremendouslycomparedtoplainGradientDescent.Finally,wewillgothroughafewpopularregularizationtechniquesforlargeneuralnetworks.
Withthesetools,youwillbeabletotrainverydeepnets:welcometoDeepLearning!
Vanishing/ExplodingGradientsProblemsAswediscussedinChapter10,thebackpropagationalgorithmworksbygoingfromtheoutputlayertotheinputlayer,propagatingtheerrorgradientontheway.Oncethealgorithmhascomputedthegradientofthecostfunctionwithregardstoeachparameterinthenetwork,itusesthesegradientstoupdateeachparameterwithaGradientDescentstep.
Unfortunately,gradientsoftengetsmallerandsmallerasthealgorithmprogressesdowntothelowerlayers.Asaresult,theGradientDescentupdateleavesthelowerlayerconnectionweightsvirtuallyunchanged,andtrainingneverconvergestoagoodsolution.Thisiscalledthevanishinggradientsproblem.Insomecases,theoppositecanhappen:thegradientscangrowbiggerandbigger,somanylayersgetinsanelylargeweightupdatesandthealgorithmdiverges.Thisistheexplodinggradientsproblem,whichismostlyencounteredinrecurrentneuralnetworks(seeChapter14).Moregenerally,deepneuralnetworkssufferfromunstablegradients;differentlayersmaylearnatwidelydifferentspeeds.
Althoughthisunfortunatebehaviorhasbeenempiricallyobservedforquiteawhile(itwasoneofthereasonswhydeepneuralnetworksweremostlyabandonedforalongtime),itisonlyaround2010thatsignificantprogresswasmadeinunderstandingit.Apapertitled“UnderstandingtheDifficultyofTrainingDeepFeedforwardNeuralNetworks”byXavierGlorotandYoshuaBengio1foundafewsuspects,includingthecombinationofthepopularlogisticsigmoidactivationfunctionandtheweightinitializationtechniquethatwasmostpopularatthetime,namelyrandominitializationusinganormaldistributionwithameanof0andastandarddeviationof1.Inshort,theyshowedthatwiththisactivationfunctionandthisinitializationscheme,thevarianceoftheoutputsofeachlayerismuchgreaterthanthevarianceofitsinputs.Goingforwardinthenetwork,thevariancekeepsincreasingaftereachlayeruntiltheactivationfunctionsaturatesatthetoplayers.Thisisactuallymadeworsebythefactthatthelogisticfunctionhasameanof0.5,not0(thehyperbolictangentfunctionhasameanof0andbehavesslightlybetterthanthelogisticfunctionindeepnetworks).
Lookingatthelogisticactivationfunction(seeFigure11-1),youcanseethatwheninputsbecomelarge(negativeorpositive),thefunctionsaturatesat0or1,withaderivativeextremelycloseto0.Thuswhenbackpropagationkicksin,ithasvirtuallynogradienttopropagatebackthroughthenetwork,andwhatlittlegradientexistskeepsgettingdilutedasbackpropagationprogressesdownthroughthetoplayers,sothereisreallynothingleftforthelowerlayers.
Figure11-1.Logisticactivationfunctionsaturation
XavierandHeInitializationIntheirpaper,GlorotandBengioproposeawaytosignificantlyalleviatethisproblem.Weneedthesignaltoflowproperlyinbothdirections:intheforwarddirectionwhenmakingpredictions,andinthereversedirectionwhenbackpropagatinggradients.Wedon’twantthesignaltodieout,nordowewantittoexplodeandsaturate.Forthesignaltoflowproperly,theauthorsarguethatweneedthevarianceoftheoutputsofeachlayertobeequaltothevarianceofitsinputs,2andwealsoneedthegradientstohaveequalvariancebeforeandafterflowingthroughalayerinthereversedirection(pleasecheckoutthepaperifyouareinterestedinthemathematicaldetails).Itisactuallynotpossibletoguaranteebothunlessthelayerhasanequalnumberofinputandoutputconnections,buttheyproposedagoodcompromisethathasproventoworkverywellinpractice:theconnectionweightsmustbeinitializedrandomlyasdescribedinEquation11-1,whereninputsandnoutputsarethenumberofinputandoutputconnectionsforthelayerwhoseweightsarebeinginitialized(alsocalledfan-inandfan-out).ThisinitializationstrategyisoftencalledXavierinitialization(aftertheauthor’sfirstname),orsometimesGlorotinitialization.
Equation11-1.Xavierinitialization(whenusingthelogisticactivationfunction)
Whenthenumberofinputconnectionsisroughlyequaltothenumberofoutputconnections,youget
simplerequations(e.g., or ).WeusedthissimplifiedstrategyinChapter10.3
UsingtheXavierinitializationstrategycanspeeduptrainingconsiderably,anditisoneofthetricksthatledtothecurrentsuccessofDeepLearning.Somerecentpapers4haveprovidedsimilarstrategiesfordifferentactivationfunctions,asshowninTable11-1.TheinitializationstrategyfortheReLUactivationfunction(anditsvariants,includingtheELUactivationdescribedshortly)issometimescalledHeinitialization(afterthelastnameofitsauthor).
Table11-1.Initializationparametersforeachtypeofactivationfunction
Activationfunction Uniformdistribution[–r,r] Normaldistribution
Logistic
Hyperbolictangent
ReLU(anditsvariants)
Bydefault,thetf.layers.dense()function(introducedinChapter10)usesXavierinitialization(withauniformdistribution).YoucanchangethistoHeinitializationbyusingthe
variance_scaling_initializer()functionlikethis:
he_init=tf.contrib.layers.variance_scaling_initializer()
hidden1=tf.layers.dense(X,n_hidden1,activation=tf.nn.relu,
kernel_initializer=he_init,name="hidden1")
NOTEHeinitializationconsidersonlythefan-in,nottheaveragebetweenfan-inandfan-outlikeinXavierinitialization.Thisisalsothedefaultforthevariance_scaling_initializer()function,butyoucanchangethisbysettingtheargumentmode="FAN_AVG".
NonsaturatingActivationFunctionsOneoftheinsightsinthe2010paperbyGlorotandBengiowasthatthevanishing/explodinggradientsproblemswereinpartduetoapoorchoiceofactivationfunction.UntilthenmostpeoplehadassumedthatifMotherNaturehadchosentouseroughlysigmoidactivationfunctionsinbiologicalneurons,theymustbeanexcellentchoice.Butitturnsoutthatotheractivationfunctionsbehavemuchbetterindeepneuralnetworks,inparticulartheReLUactivationfunction,mostlybecauseitdoesnotsaturateforpositivevalues(andalsobecauseitisquitefasttocompute).
Unfortunately,theReLUactivationfunctionisnotperfect.ItsuffersfromaproblemknownasthedyingReLUs:duringtraining,someneuronseffectivelydie,meaningtheystopoutputtinganythingotherthan0.Insomecases,youmayfindthathalfofyournetwork’sneuronsaredead,especiallyifyouusedalargelearningrate.Duringtraining,ifaneuron’sweightsgetupdatedsuchthattheweightedsumoftheneuron’sinputsisnegative,itwillstartoutputting0.Whenthishappen,theneuronisunlikelytocomebacktolifesincethegradientoftheReLUfunctionis0whenitsinputisnegative.
Tosolvethisproblem,youmaywanttouseavariantoftheReLUfunction,suchastheleakyReLU.ThisfunctionisdefinedasLeakyReLUα(z)=max(αz,z)(seeFigure11-2).Thehyperparameterαdefineshowmuchthefunction“leaks”:itistheslopeofthefunctionforz<0,andistypicallysetto0.01.ThissmallslopeensuresthatleakyReLUsneverdie;theycangointoalongcoma,buttheyhaveachancetoeventuallywakeup.Arecentpaper5comparedseveralvariantsoftheReLUactivationfunctionandoneofitsconclusionswasthattheleakyvariantsalwaysoutperformedthestrictReLUactivationfunction.Infact,settingα=0.2(hugeleak)seemedtoresultinbetterperformancethanα=0.01(smallleak).TheyalsoevaluatedtherandomizedleakyReLU(RReLU),whereαispickedrandomlyinagivenrangeduringtraining,anditisfixedtoanaveragevalueduringtesting.Italsoperformedfairlywellandseemedtoactasaregularizer(reducingtheriskofoverfittingthetrainingset).Finally,theyalsoevaluatedtheparametricleakyReLU(PReLU),whereαisauthorizedtobelearnedduringtraining(insteadofbeingahyperparameter,itbecomesaparameterthatcanbemodifiedbybackpropagationlikeanyotherparameter).ThiswasreportedtostronglyoutperformReLUonlargeimagedatasets,butonsmallerdatasetsitrunstheriskofoverfittingthetrainingset.
Figure11-2.LeakyReLU
Lastbutnotleast,a2015paperbyDjork-ArnéClevertetal.6proposedanewactivationfunctioncalledtheexponentiallinearunit(ELU)thatoutperformedalltheReLUvariantsintheirexperiments:trainingtimewasreducedandtheneuralnetworkperformedbetteronthetestset.ItisrepresentedinFigure11-3,andEquation11-2showsitsdefinition.
Equation11-2.ELUactivationfunction
Figure11-3.ELUactivationfunction
ItlooksalotliketheReLUfunction,withafewmajordifferences:Firstittakesonnegativevalueswhenz<0,whichallowstheunittohaveanaverageoutputcloserto0.Thishelpsalleviatethevanishinggradientsproblem,asdiscussedearlier.ThehyperparameterαdefinesthevaluethattheELUfunctionapproacheswhenzisalargenegativenumber.Itisusuallysetto1,butyoucantweakitlikeanyotherhyperparameterifyouwant.
Second,ithasanonzerogradientforz<0,whichavoidsthedyingunitsissue.
Third,thefunctionissmootheverywhere,includingaroundz=0,whichhelpsspeedupGradientDescent,sinceitdoesnotbounceasmuchleftandrightofz=0.
ThemaindrawbackoftheELUactivationfunctionisthatitisslowertocomputethantheReLUanditsvariants(duetotheuseoftheexponentialfunction),butduringtrainingthisiscompensatedbythefasterconvergencerate.However,attesttimeanELUnetworkwillbeslowerthanaReLUnetwork.
TIPSowhichactivationfunctionshouldyouuseforthehiddenlayersofyourdeepneuralnetworks?Althoughyourmileagewillvary,ingeneralELU>leakyReLU(anditsvariants)>ReLU>tanh>logistic.Ifyoucarealotaboutruntimeperformance,thenyoumaypreferleakyReLUsoverELUs.Ifyoudon’twanttotweakyetanotherhyperparameter,youmayjustusethedefaultαvaluessuggestedearlier(0.01fortheleakyReLU,and1forELU).Ifyouhavesparetimeandcomputingpower,youcanusecross-validationtoevaluateotheractivationfunctions,inparticularRReLUifyournetworkisoverfitting,orPReLUifyouhaveahugetrainingset.
TensorFlowoffersanelu()functionthatyoucanusetobuildyourneuralnetwork.Simplysettheactivationargumentwhencallingthedense()function,likethis:
hidden1=tf.layers.dense(X,n_hidden1,activation=tf.nn.elu,name="hidden1")
TensorFlowdoesnothaveapredefinedfunctionforleakyReLUs,butitiseasyenoughtodefine:
defleaky_relu(z,name=None):
returntf.maximum(0.01*z,z,name=name)
hidden1=tf.layers.dense(X,n_hidden1,activation=leaky_relu,name="hidden1")
BatchNormalizationAlthoughusingHeinitializationalongwithELU(oranyvariantofReLU)cansignificantlyreducethevanishing/explodinggradientsproblemsatthebeginningoftraining,itdoesn’tguaranteethattheywon’tcomebackduringtraining.
Ina2015paper,7SergeyIoffeandChristianSzegedyproposedatechniquecalledBatchNormalization(BN)toaddressthevanishing/explodinggradientsproblems,andmoregenerallytheproblemthatthedistributionofeachlayer’sinputschangesduringtraining,astheparametersofthepreviouslayerschange(whichtheycalltheInternalCovariateShiftproblem).
Thetechniqueconsistsofaddinganoperationinthemodeljustbeforetheactivationfunctionofeachlayer,simplyzero-centeringandnormalizingtheinputs,thenscalingandshiftingtheresultusingtwonewparametersperlayer(oneforscaling,theotherforshifting).Inotherwords,thisoperationletsthemodellearntheoptimalscaleandmeanoftheinputsforeachlayer.
Inordertozero-centerandnormalizetheinputs,thealgorithmneedstoestimatetheinputs’meanandstandarddeviation.Itdoessobyevaluatingthemeanandstandarddeviationoftheinputsoverthecurrentmini-batch(hencethename“BatchNormalization”).ThewholeoperationissummarizedinEquation11-3.
Equation11-3.BatchNormalizationalgorithm
μBistheempiricalmean,evaluatedoverthewholemini-batchB.
σBistheempiricalstandarddeviation,alsoevaluatedoverthewholemini-batch.
mBisthenumberofinstancesinthemini-batch.
(i)isthezero-centeredandnormalizedinput.
γisthescalingparameterforthelayer.
βistheshiftingparameter(offset)forthelayer.
ϵisatinynumbertoavoiddivisionbyzero(typically10–5).Thisiscalledasmoothingterm.
z(i)istheoutputoftheBNoperation:itisascaledandshiftedversionoftheinputs.
Attesttime,thereisnomini-batchtocomputetheempiricalmeanandstandarddeviation,soinsteadyousimplyusethewholetrainingset’smeanandstandarddeviation.Thesearetypicallyefficientlycomputedduringtrainingusingamovingaverage.So,intotal,fourparametersarelearnedforeachbatch-normalizedlayer:γ(scale),β(offset),μ(mean),andσ(standarddeviation).
Theauthorsdemonstratedthatthistechniqueconsiderablyimprovedallthedeepneuralnetworkstheyexperimentedwith.Thevanishinggradientsproblemwasstronglyreduced,tothepointthattheycouldusesaturatingactivationfunctionssuchasthetanhandeventhelogisticactivationfunction.Thenetworkswerealsomuchlesssensitivetotheweightinitialization.Theywereabletousemuchlargerlearningrates,significantlyspeedingupthelearningprocess.Specifically,theynotethat“Appliedtoastate-of-the-artimageclassificationmodel,BatchNormalizationachievesthesameaccuracywith14timesfewertrainingsteps,andbeatstheoriginalmodelbyasignificantmargin.[…]Usinganensembleofbatch-normalizednetworks,weimproveuponthebestpublishedresultonImageNetclassification:reaching4.9%top-5validationerror(and4.8%testerror),exceedingtheaccuracyofhumanraters.”Finally,likeagiftthatkeepsongiving,BatchNormalizationalsoactslikearegularizer,reducingtheneedforotherregularizationtechniques(suchasdropout,describedlaterinthechapter).
BatchNormalizationdoes,however,addsomecomplexitytothemodel(althoughitremovestheneedfornormalizingtheinputdatasincethefirsthiddenlayerwilltakecareofthat,provideditisbatch-normalized).Moreover,thereisaruntimepenalty:theneuralnetworkmakesslowerpredictionsduetotheextracomputationsrequiredateachlayer.Soifyouneedpredictionstobelightning-fast,youmaywanttocheckhowwellplainELU+HeinitializationperformbeforeplayingwithBatchNormalization.
NOTEYoumayfindthattrainingisratherslowatfirstwhileGradientDescentissearchingfortheoptimalscalesandoffsetsforeachlayer,butitacceleratesonceithasfoundreasonablygoodvalues.
ImplementingBatchNormalizationwithTensorFlowTensorFlowprovidesatf.nn.batch_normalization()functionthatsimplycentersandnormalizestheinputs,butyoumustcomputethemeanandstandarddeviationyourself(basedonthemini-batchdata
duringtrainingoronthefulldatasetduringtesting,asjustdiscussed)andpassthemasparameterstothisfunction,andyoumustalsohandlethecreationofthescalingandoffsetparameters(andpassthemtothisfunction).Itisdoable,butnotthemostconvenientapproach.Instead,youshouldusethetf.layers.batch_normalization()function,whichhandlesallthisforyou,asinthefollowingcode:
importtensorflowastf
n_inputs=28*28
n_hidden1=300
n_hidden2=100
n_outputs=10
X=tf.placeholder(tf.float32,shape=(None,n_inputs),name="X")
training=tf.placeholder_with_default(False,shape=(),name='training')
hidden1=tf.layers.dense(X,n_hidden1,name="hidden1")
bn1=tf.layers.batch_normalization(hidden1,training=training,momentum=0.9)
bn1_act=tf.nn.elu(bn1)
hidden2=tf.layers.dense(bn1_act,n_hidden2,name="hidden2")
bn2=tf.layers.batch_normalization(hidden2,training=training,momentum=0.9)
bn2_act=tf.nn.elu(bn2)
logits_before_bn=tf.layers.dense(bn2_act,n_outputs,name="outputs")
logits=tf.layers.batch_normalization(logits_before_bn,training=training,
momentum=0.9)
Let’swalkthroughthiscode.Thefirstlinesarefairlyself-explanatory,untilwedefinethetrainingplaceholder:wewillsetittoTrueduringtraining,butotherwiseitwilldefaulttoFalse.Thiswillbeusedtotellthetf.layers.batch_normalization()functionwhetheritshouldusethecurrentmini-batch’smeanandstandarddeviation(duringtraining)orthewholetrainingset’smeanandstandarddeviation(duringtesting).
Then,wealternatefullyconnectedlayersandbatchnormalizationlayers:thefullyconnectedlayersarecreatedusingthetf.layers.dense()function,justlikewedidinChapter10.Notethatwedon’tspecifyanyactivationfunctionforthefullyconnectedlayersbecausewewanttoapplytheactivationfunctionaftereachbatchnormalizationlayer.8Wecreatethebatchnormalizationlayersusingthetf.layers.batch_normalization()function,settingitstrainingandmomentumparameters.TheBNalgorithmusesexponentialdecaytocomputetherunningaverages,whichiswhyitrequiresthemomentumparameter:givenanewvaluev,therunningaverage isupdatedthroughtheequation:
Agoodmomentumvalueistypicallycloseto1—forexample,0.9,0.99,or0.999(youwantmore9sforlargerdatasetsandsmallermini-batches).
Youmayhavenoticedthatthecodeisquiterepetitive,withthesamebatchnormalizationparametersappearingoverandoveragain.Toavoidthisrepetition,youcanusethepartial()functionfromthefunctoolsmodule(partofPython’sstandardlibrary).Itcreatesathinwrapperaroundafunctionandallowsyoutodefinedefaultvaluesforsomeparameters.Thecreationofthenetworklayersintheprecedingcodecanbemodifiedlikeso:
fromfunctoolsimportpartial
my_batch_norm_layer=partial(tf.layers.batch_normalization,
training=training,momentum=0.9)
hidden1=tf.layers.dense(X,n_hidden1,name="hidden1")
bn1=my_batch_norm_layer(hidden1)
bn1_act=tf.nn.elu(bn1)
hidden2=tf.layers.dense(bn1_act,n_hidden2,name="hidden2")
bn2=my_batch_norm_layer(hidden2)
bn2_act=tf.nn.elu(bn2)
logits_before_bn=tf.layers.dense(bn2_act,n_outputs,name="outputs")
logits=my_batch_norm_layer(logits_before_bn)
Itmaynotlookmuchbetterthanbeforeinthissmallexample,butifyouhave10layersandwanttousethesameactivationfunction,initializer,regularizer,andsoon,inalllayers,thistrickwillmakeyourcodemuchmorereadable.
TherestoftheconstructionphaseisthesameasinChapter10:definethecostfunction,createanoptimizer,tellittominimizethecostfunction,definetheevaluationoperations,createaSaver,andsoon.
Theexecutionphaseisalsoprettymuchthesame,withtwoexceptions.First,duringtraining,wheneveryourunanoperationthatdependsonthebatch_normalization()layer,youneedtosetthetrainingplaceholdertoTrue.Second,thebatch_normalization()functioncreatesafewoperationsthatmustbeevaluatedateachstepduringtraininginordertoupdatethemovingaverages(recallthatthesemovingaveragesareneededtoevaluatethetrainingset’smeanandstandarddeviation).TheseoperationsareautomaticallyaddedtotheUPDATE_OPScollection,soallweneedtodoisgetthelistofoperationsinthatcollectionandrunthemateachtrainingiteration:
extra_update_ops=tf.get_collection(tf.GraphKeys.UPDATE_OPS)
withtf.Session()assess:
init.run()
forepochinrange(n_epochs):
foriterationinrange(mnist.train.num_examples//batch_size):
X_batch,y_batch=mnist.train.next_batch(batch_size)
sess.run([training_op,extra_update_ops],
feed_dict={training:True,X:X_batch,y:y_batch})
accuracy_val=accuracy.eval(feed_dict={X:mnist.test.images,
y:mnist.test.labels})
print(epoch,"Testaccuracy:",accuracy_val)
save_path=saver.save(sess,"./my_model_final.ckpt")
That’sall!Inthistinyexamplewithjusttwolayers,it’sunlikelythatBatchNormalizationwillhaveaverypositiveimpact,butfordeepernetworksitcanmakeatremendousdifference.
GradientClippingApopulartechniquetolessentheexplodinggradientsproblemistosimplyclipthegradientsduringbackpropagationsothattheyneverexceedsomethreshold(thisismostlyusefulforrecurrentneuralnetworks;seeChapter14).ThisiscalledGradientClipping.9IngeneralpeoplenowpreferBatchNormalization,butit’sstillusefultoknowaboutGradientClippingandhowtoimplementit.
InTensorFlow,theoptimizer’sminimize()functiontakescareofbothcomputingthegradientsandapplyingthem,soyoumustinsteadcalltheoptimizer’scompute_gradients()methodfirst,thencreateanoperationtoclipthegradientsusingtheclip_by_value()function,andfinallycreateanoperationtoapplytheclippedgradientsusingtheoptimizer’sapply_gradients()method:
threshold=1.0
optimizer=tf.train.GradientDescentOptimizer(learning_rate)
grads_and_vars=optimizer.compute_gradients(loss)
capped_gvs=[(tf.clip_by_value(grad,-threshold,threshold),var)
forgrad,varingrads_and_vars]
training_op=optimizer.apply_gradients(capped_gvs)
Youwouldthenrunthistraining_opateverytrainingstep,asusual.Itwillcomputethegradients,clipthembetween–1.0and1.0,andapplythem.Thethresholdisahyperparameteryoucantune.
ReusingPretrainedLayersItisgenerallynotagoodideatotrainaverylargeDNNfromscratch:instead,youshouldalwaystrytofindanexistingneuralnetworkthataccomplishesasimilartasktotheoneyouaretryingtotackle,thenjustreusethelowerlayersofthisnetwork:thisiscalledtransferlearning.Itwillnotonlyspeeduptrainingconsiderably,butwillalsorequiremuchlesstrainingdata.
Forexample,supposethatyouhaveaccesstoaDNNthatwastrainedtoclassifypicturesinto100differentcategories,includinganimals,plants,vehicles,andeverydayobjects.YounowwanttotrainaDNNtoclassifyspecifictypesofvehicles.Thesetasksareverysimilar,soyoushouldtrytoreusepartsofthefirstnetwork(seeFigure11-4).
Figure11-4.Reusingpretrainedlayers
NOTEIftheinputpicturesofyournewtaskdon’thavethesamesizeastheonesusedintheoriginaltask,youwillhavetoaddapreprocessingsteptoresizethemtothesizeexpectedbytheoriginalmodel.Moregenerally,transferlearningwillonlyworkwelliftheinputshavesimilarlow-levelfeatures.
ReusingaTensorFlowModelIftheoriginalmodelwastrainedusingTensorFlow,youcansimplyrestoreitandtrainitonthenewtask.AswediscussedinChapter9,youcanusetheimport_meta_graph()functiontoimporttheoperationsintothedefaultgraph.ThisreturnsaSaverthatyoucanlaterusetoloadthemodel’sstate:
saver=tf.train.import_meta_graph("./my_model_final.ckpt.meta")
Youmustthengetahandleontheoperationsandtensorsyouwillneedfortraining.Forthis,youcanusethegraph’sget_operation_by_name()andget_tensor_by_name()methods.Thenameofatensoristhenameoftheoperationthatoutputsitfollowedby:0(or:1ifitisthesecondoutput,:2ifitisthethird,andsoon):
X=tf.get_default_graph().get_tensor_by_name("X:0")
y=tf.get_default_graph().get_tensor_by_name("y:0")
accuracy=tf.get_default_graph().get_tensor_by_name("eval/accuracy:0")
training_op=tf.get_default_graph().get_operation_by_name("GradientDescent")
Ifthepretrainedmodelisnotwelldocumented,thenyouwillhavetoexplorethegraphtofindthenamesoftheoperationsyouwillneed.Inthiscase,youcaneitherexplorethegraphusingTensorBoard(forthisyoumustfirstexportthegraphusingaFileWriter,asdiscussedinChapter9),oryoucanusethegraph’sget_operations()methodtolistalltheoperations:
foropintf.get_default_graph().get_operations():
print(op.name)
Ifyouaretheauthoroftheoriginalmodel,youcouldmakethingseasierforpeoplewhowillreuseyourmodelbygivingoperationsveryclearnamesanddocumentingthem.Anotherapproachistocreateacollectioncontainingalltheimportantoperationsthatpeoplewillwanttogetahandleon:
foropin(X,y,accuracy,training_op):
tf.add_to_collection("my_important_ops",op)
Thiswaypeoplewhoreuseyourmodelwillbeabletosimplywrite:
X,y,accuracy,training_op=tf.get_collection("my_important_ops")
Youcanthenrestorethemodel’sstateusingtheSaverandcontinuetrainingusingyourowndata:
withtf.Session()assess:
saver.restore(sess,"./my_model_final.ckpt")
[...]#trainthemodelonyourowndata
Alternatively,ifyouhaveaccesstothePythoncodethatbuilttheoriginalgraph,youcanuseitinsteadofimport_meta_graph().
Ingeneral,youwillwanttoreuseonlypartoftheoriginalmodel,typicallythelowerlayers.Ifyouuseimport_meta_graph()torestorethegraph,itwillloadtheentireoriginalgraph,butnothingprevents
youfromjustignoringthelayersyoudonotcareabout.Forexample,asshowninFigure11-4,youcouldbuildnewlayers(e.g.,onehiddenlayerandoneoutputlayer)ontopofapretrainedlayer(e.g.,pretrainedhiddenlayer3).Youwouldalsoneedtocomputethelossforthisnewoutput,andcreateanoptimizertominimizethatloss.
Ifyouhaveaccesstothepretrainedgraph’sPythoncode,youcanjustreusethepartsyouneedandchopouttherest.However,inthiscaseyouneedaSavertorestorethepretrainedmodel(specifyingwhichvariablesyouwanttorestore;otherwise,TensorFlowwillcomplainthatthegraphsdonotmatch),andanotherSavertosavethenewmodel.Forexample,thefollowingcoderestoresonlyhiddenlayers1,2,and3:
[...]#buildthenewmodelwiththesamehiddenlayers1-3asbefore
reuse_vars=tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,
scope="hidden[123]")#regularexpression
reuse_vars_dict=dict([(var.op.name,var)forvarinreuse_vars])
restore_saver=tf.train.Saver(reuse_vars_dict)#torestorelayers1-3
init=tf.global_variables_initializer()#toinitallvariables,oldandnew
saver=tf.train.Saver()#tosavethenewmodel
withtf.Session()assess:
init.run()
restore_saver.restore(sess,"./my_model_final.ckpt")
[...]#trainthemodel
save_path=saver.save(sess,"./my_new_model_final.ckpt")
Firstwebuildthenewmodel,makingsuretocopytheoriginalmodel’shiddenlayers1to3.Thenwegetthelistofallvariablesinhiddenlayers1to3,usingtheregularexpression"hidden[123]".Next,wecreateadictionarythatmapsthenameofeachvariableintheoriginalmodeltoitsnameinthenewmodel(generallyyouwanttokeeptheexactsamenames).ThenwecreateaSaverthatwillrestoreonlythesevariables.Wealsocreateanoperationtoinitializeallthevariables(oldandnew)andasecondSavertosavetheentirenewmodel,notjustlayers1to3.Wethenstartasessionandinitializeallvariablesinthemodel,thenrestorethevariablevaluesfromtheoriginalmodel’slayers1to3.Finally,wetrainthemodelonthenewtaskandsaveit.
TIPThemoresimilarthetasksare,themorelayersyouwanttoreuse(startingwiththelowerlayers).Forverysimilartasks,youcantrykeepingallthehiddenlayersandjustreplacetheoutputlayer.
ReusingModelsfromOtherFrameworksIfthemodelwastrainedusinganotherframework,youwillneedtoloadthemodelparametersmanually(e.g.,usingTheanocodeifitwastrainedwithTheano),thenassignthemtotheappropriatevariables.Thiscanbequitetedious.Forexample,thefollowingcodeshowshowyouwouldcopytheweightandbiasesfromthefirsthiddenlayerofamodeltrainedusinganotherframework:
original_w=[...]#Loadtheweightsfromtheotherframework
original_b=[...]#Loadthebiasesfromtheotherframework
X=tf.placeholder(tf.float32,shape=(None,n_inputs),name="X")
hidden1=tf.layers.dense(X,n_hidden1,activation=tf.nn.relu,name="hidden1")
[...]#Buildtherestofthemodel
#Getahandleontheassignmentnodesforthehidden1variables
graph=tf.get_default_graph()
assign_kernel=graph.get_operation_by_name("hidden1/kernel/Assign")
assign_bias=graph.get_operation_by_name("hidden1/bias/Assign")
init_kernel=assign_kernel.inputs[1]
init_bias=assign_bias.inputs[1]
init=tf.global_variables_initializer()
withtf.Session()assess:
sess.run(init,feed_dict={init_kernel:original_w,init_bias:original_b})
#[...]Trainthemodelonyournewtask
Inthisimplementation,wefirstloadthepretrainedmodelusingtheotherframework(notshownhere),andweextractfromitthemodelparameterswewanttoreuse.Next,webuildourTensorFlowmodelasusual.Thencomesthetrickypart:everyTensorFlowvariablehasanassociatedassignmentoperationthatisusedtoinitializeit.Westartbygettingahandleontheseassignmentoperations(theyhavethesamenameasthevariable,plus"/Assign").Wealsogetahandleoneachassignmentoperation’ssecondinput:inthecaseofanassignmentoperation,thesecondinputcorrespondstothevaluethatwillbeassignedtothevariable,sointhiscaseitisthevariable’sinitializationvalue.Oncewestartthesession,weruntheusualinitializationoperation,butthistimewefeeditthevalueswewantforthevariableswewanttoreuse.Alternatively,wecouldhavecreatednewassignmentoperationsandplaceholders,andusedthemtosetthevaluesofthevariablesafterinitialization.Butwhycreatenewnodesinthegraphwheneverythingweneedisalreadythere?
FreezingtheLowerLayersItislikelythatthelowerlayersofthefirstDNNhavelearnedtodetectlow-levelfeaturesinpicturesthatwillbeusefulacrossbothimageclassificationtasks,soyoucanjustreusetheselayersastheyare.Itisgenerallyagoodideato“freeze”theirweightswhentrainingthenewDNN:ifthelower-layerweightsarefixed,thenthehigher-layerweightswillbeeasiertotrain(becausetheywon’thavetolearnamovingtarget).Tofreezethelowerlayersduringtraining,onesolutionistogivetheoptimizerthelistofvariablestotrain,excludingthevariablesfromthelowerlayers:
train_vars=tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,
scope="hidden[34]|outputs")
training_op=optimizer.minimize(loss,var_list=train_vars)
Thefirstlinegetsthelistofalltrainablevariablesinhiddenlayers3and4andintheoutputlayer.Thisleavesoutthevariablesinthehiddenlayers1and2.Nextweprovidethisrestrictedlistoftrainablevariablestotheoptimizer’sminimize()function.Ta-da!Layers1and2arenowfrozen:theywillnotbudgeduringtraining(theseareoftencalledfrozenlayers).
Anotheroptionistoaddastop_gradient()layerinthegraph.Anylayerbelowitwillbefrozen:
withtf.name_scope("dnn"):
hidden1=tf.layers.dense(X,n_hidden1,activation=tf.nn.relu,
name="hidden1")#reusedfrozen
hidden2=tf.layers.dense(hidden1,n_hidden2,activation=tf.nn.relu,
name="hidden2")#reusedfrozen
hidden2_stop=tf.stop_gradient(hidden2)
hidden3=tf.layers.dense(hidden2_stop,n_hidden3,activation=tf.nn.relu,
name="hidden3")#reused,notfrozen
hidden4=tf.layers.dense(hidden3,n_hidden4,activation=tf.nn.relu,
name="hidden4")#new!
logits=tf.layers.dense(hidden4,n_outputs,name="outputs")#new!
CachingtheFrozenLayersSincethefrozenlayerswon’tchange,itispossibletocachetheoutputofthetopmostfrozenlayerforeachtraininginstance.Sincetraininggoesthroughthewholedatasetmanytimes,thiswillgiveyouahugespeedboostasyouwillonlyneedtogothroughthefrozenlayersoncepertraininginstance(insteadofonceperepoch).Forexample,youcouldfirstrunthewholetrainingsetthroughthelowerlayers(assumingyouhaveenoughRAM),thenduringtraining,insteadofbuildingbatchesoftraininginstances,youwouldbuildbatchesofoutputsfromhiddenlayer2andfeedthemtothetrainingoperation:
importnumpyasnp
n_batches=mnist.train.num_examples//batch_size
withtf.Session()assess:
init.run()
restore_saver.restore(sess,"./my_model_final.ckpt")
h2_cache=sess.run(hidden2,feed_dict={X:mnist.train.images})
forepochinrange(n_epochs):
shuffled_idx=np.random.permutation(mnist.train.num_examples)
hidden2_batches=np.array_split(h2_cache[shuffled_idx],n_batches)
y_batches=np.array_split(mnist.train.labels[shuffled_idx],n_batches)
forhidden2_batch,y_batchinzip(hidden2_batches,y_batches):
sess.run(training_op,feed_dict={hidden2:hidden2_batch,y:y_batch})
save_path=saver.save(sess,"./my_new_model_final.ckpt")
Thelastlineofthetraininglooprunsthetrainingoperationdefinedearlier(whichdoesnottouchlayers1and2),andfeedsitabatchofoutputsfromthesecondhiddenlayer(aswellasthetargetsforthatbatch).SincewegiveTensorFlowtheoutputofhiddenlayer2,itdoesnottrytoevaluateit(oranynodeitdependson).
Tweaking,Dropping,orReplacingtheUpperLayersTheoutputlayeroftheoriginalmodelshouldusuallybereplacedsinceitismostlikelynotusefulatallforthenewtask,anditmaynotevenhavetherightnumberofoutputsforthenewtask.
Similarly,theupperhiddenlayersoftheoriginalmodelarelesslikelytobeasusefulasthelowerlayers,sincethehigh-levelfeaturesthataremostusefulforthenewtaskmaydiffersignificantlyfromtheonesthatweremostusefulfortheoriginaltask.Youwanttofindtherightnumberoflayerstoreuse.
Tryfreezingallthecopiedlayersfirst,thentrainyourmodelandseehowitperforms.Thentryunfreezingoneortwoofthetophiddenlayerstoletbackpropagationtweakthemandseeifperformanceimproves.Themoretrainingdatayouhave,themorelayersyoucanunfreeze.
Ifyoustillcannotgetgoodperformance,andyouhavelittletrainingdata,trydroppingthetophiddenlayer(s)andfreezeallremaininghiddenlayersagain.Youcaniterateuntilyoufindtherightnumberoflayerstoreuse.Ifyouhaveplentyoftrainingdata,youmaytryreplacingthetophiddenlayersinsteadofdroppingthem,andevenaddmorehiddenlayers.
ModelZoosWherecanyoufindaneuralnetworktrainedforatasksimilartotheoneyouwanttotackle?Thefirstplacetolookisobviouslyinyourowncatalogofmodels.Thisisonegoodreasontosaveallyourmodelsandorganizethemsoyoucanretrievethemlatereasily.Anotheroptionistosearchinamodelzoo.ManypeopletrainMachineLearningmodelsforvarioustasksandkindlyreleasetheirpretrainedmodelstothepublic.
TensorFlowhasitsownmodelzooavailableathttps://github.com/tensorflow/models.Inparticular,itcontainsmostofthestate-of-the-artimageclassificationnetssuchasVGG,Inception,andResNet(seeChapter13,andcheckoutthemodels/slimdirectory),includingthecode,thepretrainedmodels,andtoolstodownloadpopularimagedatasets.
AnotherpopularmodelzooisCaffe’sModelZoo.Italsocontainsmanycomputervisionmodels(e.g.,LeNet,AlexNet,ZFNet,GoogLeNet,VGGNet,inception)trainedonvariousdatasets(e.g.,ImageNet,PlacesDatabase,CIFAR10,etc.).SaumitroDasguptawroteaconverter,whichisavailableathttps://github.com/ethereon/caffe-tensorflow.
UnsupervisedPretrainingSupposeyouwanttotackleacomplextaskforwhichyoudon’thavemuchlabeledtrainingdata,butunfortunatelyyoucannotfindamodeltrainedonasimilartask.Don’tloseallhope!First,youshouldofcoursetrytogathermorelabeledtrainingdata,butifthisistoohardortooexpensive,youmaystillbeabletoperformunsupervisedpretraining(seeFigure11-5).Thatis,ifyouhaveplentyofunlabeledtrainingdata,youcantrytotrainthelayersonebyone,startingwiththelowestlayerandthengoingup,usinganunsupervisedfeaturedetectoralgorithmsuchasRestrictedBoltzmannMachines(RBMs;seeAppendixE)orautoencoders(seeChapter15).Eachlayeristrainedontheoutputofthepreviouslytrainedlayers(alllayersexcepttheonebeingtrainedarefrozen).Oncealllayershavebeentrainedthisway,youcanfine-tunethenetworkusingsupervisedlearning(i.e.,withbackpropagation).
Thisisaratherlongandtediousprocess,butitoftenworkswell;infact,itisthistechniquethatGeoffreyHintonandhisteamusedin2006andwhichledtotherevivalofneuralnetworksandthesuccessofDeepLearning.Until2010,unsupervisedpretraining(typicallyusingRBMs)wasthenormfordeepnets,anditwasonlyafterthevanishinggradientsproblemwasalleviatedthatitbecamemuchmorecommontotrainDNNspurelyusingbackpropagation.However,unsupervisedpretraining(todaytypicallyusingautoencodersratherthanRBMs)isstillagoodoptionwhenyouhaveacomplextasktosolve,nosimilarmodelyoucanreuse,andlittlelabeledtrainingdatabutplentyofunlabeledtrainingdata.10
Figure11-5.Unsupervisedpretraining
PretrainingonanAuxiliaryTaskOnelastoptionistotrainafirstneuralnetworkonanauxiliarytaskforwhichyoucaneasilyobtainorgeneratelabeledtrainingdata,thenreusethelowerlayersofthatnetworkforyouractualtask.Thefirstneuralnetwork’slowerlayerswilllearnfeaturedetectorsthatwilllikelybereusablebythesecondneuralnetwork.
Forexample,ifyouwanttobuildasystemtorecognizefaces,youmayonlyhaveafewpicturesofeachindividual—clearlynotenoughtotrainagoodclassifier.Gatheringhundredsofpicturesofeachpersonwouldnotbepractical.However,youcouldgatheralotofpicturesofrandompeopleontheinternetandtrainafirstneuralnetworktodetectwhetherornottwodifferentpicturesfeaturethesameperson.Suchanetworkwouldlearngoodfeaturedetectorsforfaces,soreusingitslowerlayerswouldallowyoutotrainagoodfaceclassifierusinglittletrainingdata.
Itisoftenrathercheaptogatherunlabeledtrainingexamples,butquiteexpensivetolabelthem.Inthissituation,acommontechniqueistolabelallyourtrainingexamplesas“good,”thengeneratemanynewtraininginstancesbycorruptingthegoodones,andlabelthesecorruptedinstancesas“bad.”Thenyoucantrainafirstneuralnetworktoclassifyinstancesasgoodorbad.Forexample,youcoulddownloadmillionsofsentences,labelthemas“good,”thenrandomlychangeawordineachsentenceandlabeltheresultingsentencesas“bad.”Ifaneuralnetworkcantellthat“Thedogsleeps”isagoodsentencebut“Thedogthey”isbad,itprobablyknowsquitealotaboutlanguage.Reusingitslowerlayerswilllikelyhelpinmanylanguageprocessingtasks.
Anotherapproachistotrainafirstnetworktooutputascoreforeachtraininginstance,anduseacostfunctionthatensuresthatagoodinstance’sscoreisgreaterthanabadinstance’sscorebyatleastsomemargin.Thisiscalledmaxmarginlearning.
FasterOptimizersTrainingaverylargedeepneuralnetworkcanbepainfullyslow.Sofarwehaveseenfourwaystospeeduptraining(andreachabettersolution):applyingagoodinitializationstrategyfortheconnectionweights,usingagoodactivationfunction,usingBatchNormalization,andreusingpartsofapretrainednetwork.AnotherhugespeedboostcomesfromusingafasteroptimizerthantheregularGradientDescentoptimizer.Inthissectionwewillpresentthemostpopularones:Momentumoptimization,NesterovAcceleratedGradient,AdaGrad,RMSProp,andfinallyAdamoptimization.
MomentumOptimizationImagineabowlingballrollingdownagentleslopeonasmoothsurface:itwillstartoutslowly,butitwillquicklypickupmomentumuntiliteventuallyreachesterminalvelocity(ifthereissomefrictionorairresistance).ThisistheverysimpleideabehindMomentumoptimization,proposedbyBorisPolyakin1964.11Incontrast,regularGradientDescentwillsimplytakesmallregularstepsdowntheslope,soitwilltakemuchmoretimetoreachthebottom.
RecallthatGradientDescentsimplyupdatestheweightsθbydirectlysubtractingthegradientofthecostfunctionJ(θ)withregardstotheweights( θJ(θ))multipliedbythelearningrateη.Theequationis:θ←θ–η θJ(θ).Itdoesnotcareaboutwhattheearliergradientswere.Ifthelocalgradientistiny,itgoesveryslowly.
Momentumoptimizationcaresagreatdealaboutwhatpreviousgradientswere:ateachiteration,itsubtractsthelocalgradientfromthemomentumvectorm(multipliedbythelearningrateη),anditupdatestheweightsbysimplyaddingthismomentumvector(seeEquation11-4).Inotherwords,thegradientisusedasanacceleration,notasaspeed.Tosimulatesomesortoffrictionmechanismandpreventthemomentumfromgrowingtoolarge,thealgorithmintroducesanewhyperparameterβ,simplycalledthemomentum,whichmustbesetbetween0(highfriction)and1(nofriction).Atypicalmomentumvalueis0.9.
Equation11-4.Momentumalgorithm
Youcaneasilyverifythatifthegradientremainsconstant,theterminalvelocity(i.e.,themaximumsizeof
theweightupdates)isequaltothatgradientmultipliedbythelearningrateηmultipliedby (ignoringthesign).Forexample,ifβ=0.9,thentheterminalvelocityisequalto10timesthegradienttimesthelearningrate,soMomentumoptimizationendsupgoing10timesfasterthanGradientDescent!ThisallowsMomentumoptimizationtoescapefromplateausmuchfasterthanGradientDescent.Inparticular,wesawinChapter4thatwhentheinputshaveverydifferentscalesthecostfunctionwilllooklikeanelongatedbowl(seeFigure4-7).GradientDescentgoesdownthesteepslopequitefast,butthenittakesaverylongtimetogodownthevalley.Incontrast,Momentumoptimizationwillrolldownthebottomofthevalleyfasterandfasteruntilitreachesthebottom(theoptimum).Indeepneuralnetworksthatdon’tuseBatchNormalization,theupperlayerswilloftenenduphavinginputswithverydifferentscales,sousingMomentumoptimizationhelpsalot.Itcanalsohelprollpastlocaloptima.
NOTEDuetothemomentum,theoptimizermayovershootabit,thencomeback,overshootagain,andoscillatelikethismanytimesbeforestabilizingattheminimum.Thisisoneofthereasonswhyitisgoodtohaveabitoffrictioninthesystem:itgetsridoftheseoscillationsandthusspeedsupconvergence.
ImplementingMomentumoptimizationinTensorFlowisano-brainer:justreplacetheGradientDescentOptimizerwiththeMomentumOptimizer,thenliebackandprofit!
optimizer=tf.train.MomentumOptimizer(learning_rate=learning_rate,
momentum=0.9)
TheonedrawbackofMomentumoptimizationisthatitaddsyetanotherhyperparametertotune.However,themomentumvalueof0.9usuallyworkswellinpracticeandalmostalwaysgoesfasterthanGradientDescent.
NesterovAcceleratedGradientOnesmallvarianttoMomentumoptimization,proposedbyYuriiNesterovin1983,12isalmostalwaysfasterthanvanillaMomentumoptimization.TheideaofNesterovMomentumoptimization,orNesterovAcceleratedGradient(NAG),istomeasurethegradientofthecostfunctionnotatthelocalpositionbutslightlyaheadinthedirectionofthemomentum(seeEquation11-5).TheonlydifferencefromvanillaMomentumoptimizationisthatthegradientismeasuredatθ+βmratherthanatθ.
Equation11-5.NesterovAcceleratedGradientalgorithm
Thissmalltweakworksbecauseingeneralthemomentumvectorwillbepointingintherightdirection(i.e.,towardtheoptimum),soitwillbeslightlymoreaccuratetousethegradientmeasuredabitfartherinthatdirectionratherthanusingthegradientattheoriginalposition,asyoucanseeinFigure11-6(where1representsthegradientofthecostfunctionmeasuredatthestartingpointθ,and 2representsthe
gradientatthepointlocatedatθ+βm).Asyoucansee,theNesterovupdateendsupslightlyclosertotheoptimum.Afterawhile,thesesmallimprovementsaddupandNAGendsupbeingsignificantlyfasterthanregularMomentumoptimization.Moreover,notethatwhenthemomentumpushestheweightsacrossavalley, 1continuestopushfurtheracrossthevalley,while 2pushesbacktowardthebottomofthevalley.Thishelpsreduceoscillationsandthusconvergesfaster.
NAGwillalmostalwaysspeeduptrainingcomparedtoregularMomentumoptimization.Touseit,simplysetuse_nesterov=TruewhencreatingtheMomentumOptimizer:
optimizer=tf.train.MomentumOptimizer(learning_rate=learning_rate,
momentum=0.9,use_nesterov=True)
Figure11-6.RegularversusNesterovMomentumoptimization
AdaGradConsidertheelongatedbowlproblemagain:GradientDescentstartsbyquicklygoingdownthesteepestslope,thenslowlygoesdownthebottomofthevalley.Itwouldbeniceifthealgorithmcoulddetectthisearlyonandcorrectitsdirectiontopointabitmoretowardtheglobaloptimum.
TheAdaGradalgorithm13achievesthisbyscalingdownthegradientvectoralongthesteepestdimensions(seeEquation11-6):
Equation11-6.AdaGradalgorithm
Thefirststepaccumulatesthesquareofthegradientsintothevectors(the symbolrepresentstheelement-wisemultiplication).Thisvectorizedformisequivalenttocomputingsi←si+(∂J(θ)/∂θi)2
foreachelementsiofthevectors;inotherwords,eachsiaccumulatesthesquaresofthepartialderivativeofthecostfunctionwithregardstoparameterθi.Ifthecostfunctionissteepalongtheith
dimension,thensiwillgetlargerandlargerateachiteration.
ThesecondstepisalmostidenticaltoGradientDescent,butwithonebigdifference:thegradientvectorisscaleddownbyafactorof (thesymbolrepresentstheelement-wisedivision,andϵisasmoothingtermtoavoiddivisionbyzero,typicallysetto10–10).Thisvectorizedformisequivalenttocomputing forallparametersθi(simultaneously).
Inshort,thisalgorithmdecaysthelearningrate,butitdoessofasterforsteepdimensionsthanfordimensionswithgentlerslopes.Thisiscalledanadaptivelearningrate.Ithelpspointtheresultingupdatesmoredirectlytowardtheglobaloptimum(seeFigure11-7).Oneadditionalbenefitisthatitrequiresmuchlesstuningofthelearningratehyperparameterη.
Figure11-7.AdaGradversusGradientDescent
AdaGradoftenperformswellforsimplequadraticproblems,butunfortunatelyitoftenstopstooearlywhentrainingneuralnetworks.Thelearningrategetsscaleddownsomuchthatthealgorithmendsupstoppingentirelybeforereachingtheglobaloptimum.SoeventhoughTensorFlowhasanAdagradOptimizer,youshouldnotuseittotraindeepneuralnetworks(itmaybeefficientforsimplertaskssuchasLinearRegression,though).
RMSPropAlthoughAdaGradslowsdownabittoofastandendsupneverconvergingtotheglobaloptimum,theRMSPropalgorithm14fixesthisbyaccumulatingonlythegradientsfromthemostrecentiterations(asopposedtoallthegradientssincethebeginningoftraining).Itdoessobyusingexponentialdecayinthefirststep(seeEquation11-7).
Equation11-7.RMSPropalgorithm
Thedecayrateβistypicallysetto0.9.Yes,itisonceagainanewhyperparameter,butthisdefaultvalueoftenworkswell,soyoumaynotneedtotuneitatall.
Asyoumightexpect,TensorFlowhasanRMSPropOptimizerclass:
optimizer=tf.train.RMSPropOptimizer(learning_rate=learning_rate,
momentum=0.9,decay=0.9,epsilon=1e-10)
Exceptonverysimpleproblems,thisoptimizeralmostalwaysperformsmuchbetterthanAdaGrad.ItalsogenerallyconvergesfasterthanMomentumoptimizationandNesterovAcceleratedGradients.Infact,itwasthepreferredoptimizationalgorithmofmanyresearchersuntilAdamoptimizationcamearound.
AdamOptimizationAdam,15whichstandsforadaptivemomentestimation,combinestheideasofMomentumoptimizationandRMSProp:justlikeMomentumoptimizationitkeepstrackofanexponentiallydecayingaverageofpastgradients,andjustlikeRMSPropitkeepstrackofanexponentiallydecayingaverageofpastsquaredgradients(seeEquation11-8).16
Equation11-8.Adamalgorithm
Trepresentstheiterationnumber(startingat1).
Ifyoujustlookatsteps1,2,and5,youwillnoticeAdam’sclosesimilaritytobothMomentumoptimizationandRMSProp.Theonlydifferenceisthatstep1computesanexponentiallydecayingaverageratherthananexponentiallydecayingsum,buttheseareactuallyequivalentexceptforaconstantfactor(thedecayingaverageisjust1–β1timesthedecayingsum).Steps3and4aresomewhatofatechnicaldetail:sincemandsareinitializedat0,theywillbebiasedtoward0atthebeginningoftraining,sothesetwostepswillhelpboostmandsatthebeginningoftraining.
Themomentumdecayhyperparameterβ1istypicallyinitializedto0.9,whilethescalingdecayhyperparameterβ2isofteninitializedto0.999.Asearlier,thesmoothingtermϵisusuallyinitializedtoatinynumbersuchas10–8.ThesearethedefaultvaluesforTensorFlow’sAdamOptimizerclass,soyoucansimplyuse:
optimizer=tf.train.AdamOptimizer(learning_rate=learning_rate)
Infact,sinceAdamisanadaptivelearningratealgorithm(likeAdaGradandRMSProp),itrequireslesstuningofthelearningratehyperparameterη.Youcanoftenusethedefaultvalueη=0.001,makingAdameveneasiertousethanGradientDescent.
WARNINGThisbookinitiallyrecommendedusingAdamoptimization,becauseitwasgenerallyconsideredfasterandbetterthanothermethods.However,a2017paper17byAshiaC.Wilsonetal.showedthatadaptiveoptimizationmethods(i.e.,AdaGrad,RMSPropandAdamoptimization)canleadtosolutionsthatgeneralizepoorlyonsomedatasets.SoyoumaywanttosticktoMomentumoptimizationorNesterovAcceleratedGradientfornow,untilresearchershaveabetterunderstandingofthisissue.
Alltheoptimizationtechniquesdiscussedsofaronlyrelyonthefirst-orderpartialderivatives(Jacobians).Theoptimizationliteraturecontainsamazingalgorithmsbasedonthesecond-orderpartialderivatives(theHessians).Unfortunately,thesealgorithmsareveryhardtoapplytodeepneuralnetworksbecausetherearen2Hessiansperoutput(wherenisthenumberofparameters),asopposedtojustnJacobiansperoutput.SinceDNNstypicallyhavetensofthousandsofparameters,thesecond-orderoptimizationalgorithmsoftendon’tevenfitinmemory,andevenwhentheydo,computingtheHessiansisjusttooslow.
TRAININGSPARSEMODELS
Alltheoptimizationalgorithmsjustpresentedproducedensemodels,meaningthatmostparameterswillbenonzero.Ifyouneedablazinglyfastmodelatruntime,orifyouneedittotakeuplessmemory,youmayprefertoendupwithasparsemodelinstead.
Onetrivialwaytoachievethisistotrainthemodelasusual,thengetridofthetinyweights(setthemto0).
Anotheroptionistoapplystrongℓ1regularizationduringtraining,asitpushestheoptimizertozerooutasmanyweightsasitcan(asdiscussedinChapter4aboutLassoRegression).
However,insomecasesthesetechniquesmayremaininsufficient.OnelastoptionistoapplyDualAveraging,oftencalledFollowTheRegularizedLeader(FTRL),atechniqueproposedbyYuriiNesterov.18Whenusedwithℓ1regularization,thistechniqueoftenleadsto
verysparsemodels.TensorFlowimplementsavariantofFTRLcalledFTRL-Proximal19intheFTRLOptimizerclass.
LearningRateSchedulingFindingagoodlearningratecanbetricky.Ifyousetitwaytoohigh,trainingmayactuallydiverge(aswediscussedinChapter4).Ifyousetittoolow,trainingwilleventuallyconvergetotheoptimum,butitwilltakeaverylongtime.Ifyousetitslightlytoohigh,itwillmakeprogressveryquicklyatfirst,butitwillendupdancingaroundtheoptimum,neversettlingdown(unlessyouuseanadaptivelearningrateoptimizationalgorithmsuchasAdaGrad,RMSProp,orAdam,buteventhenitmaytaketimetosettle).Ifyouhavealimitedcomputingbudget,youmayhavetointerrupttrainingbeforeithasconvergedproperly,yieldingasuboptimalsolution(seeFigure11-8).
Figure11-8.Learningcurvesforvariouslearningratesη
Youmaybeabletofindafairlygoodlearningratebytrainingyournetworkseveraltimesduringjustafewepochsusingvariouslearningratesandcomparingthelearningcurves.Theideallearningratewilllearnquicklyandconvergetogoodsolution.
However,youcandobetterthanaconstantlearningrate:ifyoustartwithahighlearningrateandthenreduceitonceitstopsmakingfastprogress,youcanreachagoodsolutionfasterthanwiththeoptimalconstantlearningrate.Therearemanydifferentstrategiestoreducethelearningrateduringtraining.Thesestrategiesarecalledlearningschedules(webrieflyintroducedthisconceptinChapter4),themostcommonofwhichare:
PredeterminedpiecewiseconstantlearningrateForexample,setthelearningratetoη0=0.1atfirst,thentoη1=0.001after50epochs.Althoughthissolutioncanworkverywell,itoftenrequiresfiddlingaroundtofigureouttherightlearningratesandwhentousethem.
PerformanceschedulingMeasurethevalidationerroreveryNsteps(justlikeforearlystopping)andreducethelearningratebyafactorofλwhentheerrorstopsdropping.
Exponentialscheduling
Setthelearningratetoafunctionoftheiterationnumbert:η(t)=η010–t/r.Thisworksgreat,butitrequirestuningη0andr.Thelearningratewilldropbyafactorof10everyrsteps.
PowerschedulingSetthelearningratetoη(t)=η0(1+t/r)–c.Thehyperparametercistypicallysetto1.Thisissimilartoexponentialscheduling,butthelearningratedropsmuchmoreslowly.
A2013paper20byAndrewSenioretal.comparedtheperformanceofsomeofthemostpopularlearningscheduleswhentrainingdeepneuralnetworksforspeechrecognitionusingMomentumoptimization.Theauthorsconcludedthat,inthissetting,bothperformanceschedulingandexponentialschedulingperformedwell,buttheyfavoredexponentialschedulingbecauseitissimplertoimplement,iseasytotune,andconvergedslightlyfastertotheoptimalsolution.
ImplementingalearningschedulewithTensorFlowisfairlystraightforward:
initial_learning_rate=0.1
decay_steps=10000
decay_rate=1/10
global_step=tf.Variable(0,trainable=False,name="global_step")
learning_rate=tf.train.exponential_decay(initial_learning_rate,global_step,
decay_steps,decay_rate)
optimizer=tf.train.MomentumOptimizer(learning_rate,momentum=0.9)
training_op=optimizer.minimize(loss,global_step=global_step)
Aftersettingthehyperparametervalues,wecreateanontrainablevariableglobal_step(initializedto0)tokeeptrackofthecurrenttrainingiterationnumber.Thenwedefineanexponentiallydecayinglearningrate(withη0=0.1andr=10,000)usingTensorFlow’sexponential_decay()function.Next,wecreateanoptimizer(inthisexample,aMomentumOptimizer)usingthisdecayinglearningrate.Finally,wecreatethetrainingoperationbycallingtheoptimizer’sminimize()method;sincewepassittheglobal_stepvariable,itwillkindlytakecareofincrementingit.That’sit!
SinceAdaGrad,RMSProp,andAdamoptimizationautomaticallyreducethelearningrateduringtraining,itisnotnecessarytoaddanextralearningschedule.Forotheroptimizationalgorithms,usingexponentialdecayorperformanceschedulingcanconsiderablyspeedupconvergence.
AvoidingOverfittingThroughRegularizationWithfourparametersIcanfitanelephantandwithfiveIcanmakehimwigglehistrunk.JohnvonNeumann,citedbyEnricoFermiinNature427
Deepneuralnetworkstypicallyhavetensofthousandsofparameters,sometimesevenmillions.Withsomanyparameters,thenetworkhasanincredibleamountoffreedomandcanfitahugevarietyofcomplexdatasets.Butthisgreatflexibilityalsomeansthatitispronetooverfittingthetrainingset.
Withmillionsofparametersyoucanfitthewholezoo.Inthissectionwewillpresentsomeofthemostpopularregularizationtechniquesforneuralnetworks,andhowtoimplementthemwithTensorFlow:earlystopping,ℓ1andℓ2regularization,dropout,max-normregularization,anddataaugmentation.
EarlyStoppingToavoidoverfittingthetrainingset,agreatsolutionisearlystopping(introducedinChapter4):justinterrupttrainingwhenitsperformanceonthevalidationsetstartsdropping.
OnewaytoimplementthiswithTensorFlowistoevaluatethemodelonavalidationsetatregularintervals(e.g.,every50steps),andsavea“winner”snapshotifitoutperformsprevious“winner”snapshots.Countthenumberofstepssincethelast“winner”snapshotwassaved,andinterrupttrainingwhenthisnumberreachessomelimit(e.g.,2,000steps).Thenrestorethelast“winner”snapshot.
Althoughearlystoppingworksverywellinpractice,youcanusuallygetmuchhigherperformanceoutofyournetworkbycombiningitwithotherregularizationtechniques.
ℓ1andℓ2RegularizationJustlikeyoudidinChapter4forsimplelinearmodels,youcanuseℓ1andℓ2regularizationtoconstrainaneuralnetwork’sconnectionweights(buttypicallynotitsbiases).
OnewaytodothisusingTensorFlowistosimplyaddtheappropriateregularizationtermstoyourcostfunction.Forexample,assumingyouhavejustonehiddenlayerwithweightsW1andoneoutputlayerwithweightsW2,thenyoucanapplyℓ1regularizationlikethis:
[...]#constructtheneuralnetwork
W1=tf.get_default_graph().get_tensor_by_name("hidden1/kernel:0")
W2=tf.get_default_graph().get_tensor_by_name("outputs/kernel:0")
scale=0.001#l1regularizationhyperparameter
withtf.name_scope("loss"):
xentropy=tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
logits=logits)
base_loss=tf.reduce_mean(xentropy,name="avg_xentropy")
reg_losses=tf.reduce_sum(tf.abs(W1))+tf.reduce_sum(tf.abs(W2))
loss=tf.add(base_loss,scale*reg_losses,name="loss")
However,iftherearemanylayers,thisapproachisnotveryconvenient.Fortunately,TensorFlowprovidesabetteroption.Manyfunctionsthatcreatevariables(suchasget_variable()ortf.layers.dense())accepta*_regularizerargumentforeachcreatedvariable(e.g.,kernel_regularizer).Youcanpassanyfunctionthattakesweightsasanargumentandreturnsthecorrespondingregularizationloss.Thel1_regularizer(),l2_regularizer(),andl1_l2_regularizer()functionsreturnsuchfunctions.Thefollowingcodeputsallthistogether:
my_dense_layer=partial(
tf.layers.dense,activation=tf.nn.relu,
kernel_regularizer=tf.contrib.layers.l1_regularizer(scale))
withtf.name_scope("dnn"):
hidden1=my_dense_layer(X,n_hidden1,name="hidden1")
hidden2=my_dense_layer(hidden1,n_hidden2,name="hidden2")
logits=my_dense_layer(hidden2,n_outputs,activation=None,
name="outputs")
Thiscodecreatesaneuralnetworkwithtwohiddenlayersandoneoutputlayer,anditalsocreatesnodesinthegraphtocomputetheℓ1regularizationlosscorrespondingtoeachlayer’sweights.TensorFlowautomaticallyaddsthesenodestoaspecialcollectioncontainingalltheregularizationlosses.Youjustneedtoaddtheseregularizationlossestoyouroverallloss,likethis:
reg_losses=tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
loss=tf.add_n([base_loss]+reg_losses,name="loss")
WARNINGDon’tforgettoaddtheregularizationlossestoyouroverallloss,orelsetheywillsimplybeignored.
DropoutThemostpopularregularizationtechniquefordeepneuralnetworksisarguablydropout.Itwasproposed21byG.E.Hintonin2012andfurtherdetailedinapaper22byNitishSrivastavaetal.,andithasproventobehighlysuccessful:eventhestate-of-the-artneuralnetworksgota1–2%accuracyboostsimplybyaddingdropout.Thismaynotsoundlikealot,butwhenamodelalreadyhas95%accuracy,gettinga2%accuracyboostmeansdroppingtheerrorratebyalmost40%(goingfrom5%errortoroughly3%).
Itisafairlysimplealgorithm:ateverytrainingstep,everyneuron(includingtheinputneuronsbutexcludingtheoutputneurons)hasaprobabilitypofbeingtemporarily“droppedout,”meaningitwillbeentirelyignoredduringthistrainingstep,butitmaybeactiveduringthenextstep(seeFigure11-9).Thehyperparameterpiscalledthedropoutrate,anditistypicallysetto50%.Aftertraining,neuronsdon’tgetdroppedanymore.Andthat’sall(exceptforatechnicaldetailwewilldiscussmomentarily).
Figure11-9.Dropoutregularization
Itisquitesurprisingatfirstthatthisratherbrutaltechniqueworksatall.Wouldacompanyperformbetterifitsemployeesweretoldtotossacoineverymorningtodecidewhetherornottogotowork?Well,whoknows;perhapsitwould!Thecompanywouldobviouslybeforcedtoadaptitsorganization;itcouldnotrelyonanysinglepersontofillinthecoffeemachineorperformanyothercriticaltasks,sothisexpertisewouldhavetobespreadacrossseveralpeople.Employeeswouldhavetolearntocooperatewithmanyoftheircoworkers,notjustahandfulofthem.Thecompanywouldbecomemuchmoreresilient.Ifonepersonquit,itwouldn’tmakemuchofadifference.It’sunclearwhetherthisideawouldactuallyworkforcompanies,butitcertainlydoesforneuralnetworks.Neuronstrainedwithdropoutcannotco-adaptwiththeirneighboringneurons;theyhavetobeasusefulaspossibleontheirown.Theyalsocannotrely
excessivelyonjustafewinputneurons;theymustpayattentiontoeachoftheirinputneurons.Theyendupbeinglesssensitivetoslightchangesintheinputs.Intheendyougetamorerobustnetworkthatgeneralizesbetter.
Anotherwaytounderstandthepowerofdropoutistorealizethatauniqueneuralnetworkisgeneratedateachtrainingstep.Sinceeachneuroncanbeeitherpresentorabsent,thereisatotalof2Npossiblenetworks(whereNisthetotalnumberofdroppableneurons).Thisissuchahugenumberthatitisvirtuallyimpossibleforthesameneuralnetworktobesampledtwice.Onceyouhaveruna10,000trainingsteps,youhaveessentiallytrained10,000differentneuralnetworks(eachwithjustonetraininginstance).Theseneuralnetworksareobviouslynotindependentsincetheysharemanyoftheirweights,buttheyareneverthelessalldifferent.Theresultingneuralnetworkcanbeseenasanaveragingensembleofallthesesmallerneuralnetworks.
Thereisonesmallbutimportanttechnicaldetail.Supposep=50%,inwhichcaseduringtestinganeuronwillbeconnectedtotwiceasmanyinputneuronsasitwas(onaverage)duringtraining.Tocompensateforthisfact,weneedtomultiplyeachneuron’sinputconnectionweightsby0.5aftertraining.Ifwedon’t,eachneuronwillgetatotalinputsignalroughlytwiceaslargeaswhatthenetworkwastrainedon,anditisunlikelytoperformwell.Moregenerally,weneedtomultiplyeachinputconnectionweightbythekeepprobability(1–p)aftertraining.Alternatively,wecandivideeachneuron’soutputbythekeepprobabilityduringtraining(thesealternativesarenotperfectlyequivalent,buttheyworkequallywell).
ToimplementdropoutusingTensorFlow,youcansimplyapplythetf.layers.dropout()functiontotheinputlayerand/ortotheoutputofanyhiddenlayeryouwant.Duringtraining,thisfunctionrandomlydropssomeitems(settingthemto0)anddividestheremainingitemsbythekeepprobability.Aftertraining,thisfunctiondoesnothingatall.Thefollowingcodeappliesdropoutregularizationtoourthree-layerneuralnetwork:
[...]
training=tf.placeholder_with_default(False,shape=(),name='training')
dropout_rate=0.5#==1-keep_prob
X_drop=tf.layers.dropout(X,dropout_rate,training=training)
withtf.name_scope("dnn"):
hidden1=tf.layers.dense(X_drop,n_hidden1,activation=tf.nn.relu,
name="hidden1")
hidden1_drop=tf.layers.dropout(hidden1,dropout_rate,training=training)
hidden2=tf.layers.dense(hidden1_drop,n_hidden2,activation=tf.nn.relu,
name="hidden2")
hidden2_drop=tf.layers.dropout(hidden2,dropout_rate,training=training)
logits=tf.layers.dense(hidden2_drop,n_outputs,name="outputs")
WARNINGYouwanttousethetf.layers.dropout()function,nottf.nn.dropout().Thefirstoneturnsoff(no-op)whennottraining,whichiswhatyouwant,whilethesecondonedoesnot.
Ofcourse,justlikeyoudidearlierforBatchNormalization,youneedtosettrainingtoTruewhentraining,andleavethedefaultFalsevaluewhentesting.
Ifyouobservethatthemodelisoverfitting,youcanincreasethedropoutrate.Conversely,youshouldtry
decreasingthedropoutrateifthemodelunderfitsthetrainingset.Itcanalsohelptoincreasethedropoutrateforlargelayers,andreduceitforsmallones.
Dropoutdoestendtosignificantlyslowdownconvergence,butitusuallyresultsinamuchbettermodelwhentunedproperly.So,itisgenerallywellworththeextratimeandeffort.
NOTEDropconnectisavariantofdropoutwhereindividualconnectionsaredroppedrandomlyratherthanwholeneurons.Ingeneraldropoutperformsbetter.
Max-NormRegularizationAnotherregularizationtechniquethatisquitepopularforneuralnetworksiscalledmax-normregularization:foreachneuron,itconstrainstheweightswoftheincomingconnectionssuchthat w2≤r,whereristhemax-normhyperparameterand· 2istheℓ2norm.
Wetypicallyimplementthisconstraintbycomputing w2aftereachtrainingstepandclippingwif
needed( ).
Reducingrincreasestheamountofregularizationandhelpsreduceoverfitting.Max-normregularizationcanalsohelpalleviatethevanishing/explodinggradientsproblems(ifyouarenotusingBatchNormalization).
TensorFlowdoesnotprovideanoff-the-shelfmax-normregularizer,butitisnottoohardtoimplement.Thefollowingcodegetsahandleontheweightsofthefirsthiddenlayer,thenitusestheclip_by_norm()functiontocreateanoperationthatwillcliptheweightsalongthesecondaxissothateachrowvectorendsupwithamaximumnormof1.0.Thelastlinecreatesanassignmentoperationthatwillassigntheclippedweightstotheweightsvariable:
threshold=1.0
weights=tf.get_default_graph().get_tensor_by_name("hidden1/kernel:0")
clipped_weights=tf.clip_by_norm(weights,clip_norm=threshold,axes=1)
clip_weights=tf.assign(weights,clipped_weights)
Thenyoujustapplythisoperationaftereachtrainingstep,likeso:
sess.run(training_op,feed_dict={X:X_batch,y:y_batch})
clip_weights.eval()
Ingeneral,youwoulddothisforeveryhiddenlayer.Althoughthissolutionshouldworkfine,itisabitmessy.Acleanersolutionistocreateamax_norm_regularizer()functionanduseitjustliketheearlierl1_regularizer()function:
defmax_norm_regularizer(threshold,axes=1,name="max_norm",
collection="max_norm"):
defmax_norm(weights):
clipped=tf.clip_by_norm(weights,clip_norm=threshold,axes=axes)
clip_weights=tf.assign(weights,clipped,name=name)
tf.add_to_collection(collection,clip_weights)
returnNone#thereisnoregularizationlossterm
returnmax_norm
Thisfunctionreturnsaparametrizedmax_norm()functionthatyoucanuselikeanyotherregularizer:
max_norm_reg=max_norm_regularizer(threshold=1.0)
withtf.name_scope("dnn"):
hidden1=tf.layers.dense(X,n_hidden1,activation=tf.nn.relu,
kernel_regularizer=max_norm_reg,name="hidden1")
hidden2=tf.layers.dense(hidden1,n_hidden2,activation=tf.nn.relu,
kernel_regularizer=max_norm_reg,name="hidden2")
logits=tf.layers.dense(hidden2,n_outputs,name="outputs")
Notethatmax-normregularizationdoesnotrequireaddingaregularizationlosstermtoyouroveralllossfunction,whichiswhythemax_norm()functionreturnsNone.Butyoustillneedtobeabletoruntheclip_weightsoperationsaftereachtrainingstep,soyouneedtobeabletogetahandleonthem.Thisiswhythemax_norm()functionaddstheclip_weightsoperationtoacollectionofmax-normclippingoperations.Youneedtofetchtheseclippingoperationsandrunthemaftereachtrainingstep:
clip_all_weights=tf.get_collection("max_norm")
withtf.Session()assess:
init.run()
forepochinrange(n_epochs):
foriterationinrange(mnist.train.num_examples//batch_size):
X_batch,y_batch=mnist.train.next_batch(batch_size)
sess.run(training_op,feed_dict={X:X_batch,y:y_batch})
sess.run(clip_all_weights)
Muchcleanercode,isn’tit?
DataAugmentationOnelastregularizationtechnique,dataaugmentation,consistsofgeneratingnewtraininginstancesfromexistingones,artificiallyboostingthesizeofthetrainingset.Thiswillreduceoverfitting,makingthisaregularizationtechnique.Thetrickistogeneraterealistictraininginstances;ideally,ahumanshouldnotbeabletotellwhichinstancesweregeneratedandwhichoneswerenot.Moreover,simplyaddingwhitenoisewillnothelp;themodificationsyouapplyshouldbelearnable(whitenoiseisnot).
Forexample,ifyourmodelismeanttoclassifypicturesofmushrooms,youcanslightlyshift,rotate,andresizeeverypictureinthetrainingsetbyvariousamountsandaddtheresultingpicturestothetrainingset(seeFigure11-10).Thisforcesthemodeltobemoretoleranttotheposition,orientation,andsizeofthemushroomsinthepicture.Ifyouwantthemodeltobemoretoleranttolightingconditions,youcansimilarlygeneratemanyimageswithvariouscontrasts.Assumingthemushroomsaresymmetrical,youcanalsoflipthepictureshorizontally.Bycombiningthesetransformationsyoucangreatlyincreasethesizeofyourtrainingset.
Figure11-10.Generatingnewtraininginstancesfromexistingones
Itisoftenpreferabletogeneratetraininginstancesontheflyduringtrainingratherthanwastingstoragespaceandnetworkbandwidth.TensorFlowoffersseveralimagemanipulationoperationssuchastransposing(shifting),rotating,resizing,flipping,andcropping,aswellasadjustingthebrightness,contrast,saturation,andhue(seetheAPIdocumentationformoredetails).Thismakesiteasytoimplementdataaugmentationforimagedatasets.
NOTEAnotherpowerfultechniquetotrainverydeepneuralnetworksistoaddskipconnections(askipconnectioniswhenyouaddtheinputofalayertotheoutputofahigherlayer).WewillexplorethisideainChapter13whenwetalkaboutdeepresidualnetworks.
PracticalGuidelinesInthischapter,wehavecoveredawiderangeoftechniquesandyoumaybewonderingwhichonesyoushoulduse.TheconfigurationinTable11-2willworkfineinmostcases.
Table11-2.DefaultDNNconfiguration
Initialization Heinitialization
Activationfunction ELU
Normalization BatchNormalization
Regularization Dropout
Optimizer NesterovAcceleratedGradient
Learningrateschedule None
Ofcourse,youshouldtrytoreusepartsofapretrainedneuralnetworkifyoucanfindonethatsolvesasimilarproblem.
Thisdefaultconfigurationmayneedtobetweaked:Ifyoucan’tfindagoodlearningrate(convergencewastooslow,soyouincreasedthetrainingrate,andnowconvergenceisfastbutthenetwork’saccuracyissuboptimal),thenyoucantryaddingalearningschedulesuchasexponentialdecay.
Ifyourtrainingsetisabittoosmall,youcanimplementdataaugmentation.
Ifyouneedasparsemodel,youcanaddsomeℓ1regularizationtothemix(andoptionallyzerooutthetinyweightsaftertraining).Ifyouneedanevensparsermodel,youcantryusingFTRLinsteadofAdamoptimization,alongwithℓ1regularization.
Ifyouneedalightning-fastmodelatruntime,youmaywanttodropBatchNormalization,andpossiblyreplacetheELUactivationfunctionwiththeleakyReLU.Havingasparsemodelwillalsohelp.
Withtheseguidelines,youarenowreadytotrainverydeepnets—well,ifyouareverypatient,thatis!Ifyouuseasinglemachine,youmayhavetowaitfordaysorevenmonthsfortrainingtocomplete.InthenextchapterwewilldiscusshowtousedistributedTensorFlowtotrainandrunmodelsacrossmanyserversandGPUs.
Exercises1. Isitokaytoinitializealltheweightstothesamevalueaslongasthatvalueisselectedrandomly
usingHeinitialization?
2. Isitokaytoinitializethebiastermsto0?
3. NamethreeadvantagesoftheELUactivationfunctionoverReLU.
4. Inwhichcaseswouldyouwanttouseeachofthefollowingactivationfunctions:ELU,leakyReLU(anditsvariants),ReLU,tanh,logistic,andsoftmax?
5. Whatmayhappenifyousetthemomentumhyperparametertoocloseto1(e.g.,0.99999)whenusingaMomentumOptimizer?
6. Namethreewaysyoucanproduceasparsemodel.
7. Doesdropoutslowdowntraining?Doesitslowdowninference(i.e.,makingpredictionsonnewinstances)?
8. DeepLearning.a. BuildaDNNwithfivehiddenlayersof100neuronseach,Heinitialization,andtheELU
activationfunction.
b. UsingAdamoptimizationandearlystopping,trytrainingitonMNISTbutonlyondigits0to4,aswewillusetransferlearningfordigits5to9inthenextexercise.Youwillneedasoftmaxoutputlayerwithfiveneurons,andasalwaysmakesuretosavecheckpointsatregularintervalsandsavethefinalmodelsoyoucanreuseitlater.
c. Tunethehyperparametersusingcross-validationandseewhatprecisionyoucanachieve.
d. NowtryaddingBatchNormalizationandcomparethelearningcurves:isitconvergingfasterthanbefore?Doesitproduceabettermodel?
e. Isthemodeloverfittingthetrainingset?Tryaddingdropouttoeverylayerandtryagain.Doesithelp?
9. Transferlearning.a. CreateanewDNNthatreusesallthepretrainedhiddenlayersofthepreviousmodel,freezes
them,andreplacesthesoftmaxoutputlayerwithanewone.
b. TrainthisnewDNNondigits5to9,usingonly100imagesperdigit,andtimehowlongittakes.Despitethissmallnumberofexamples,canyouachievehighprecision?
c. Trycachingthefrozenlayers,andtrainthemodelagain:howmuchfasterisitnow?
d. Tryagainreusingjustfourhiddenlayersinsteadoffive.Canyouachieveahigherprecision?
e. Nowunfreezethetoptwohiddenlayersandcontinuetraining:canyougetthemodeltoperformevenbetter?
10. Pretrainingonanauxiliarytask.a. InthisexerciseyouwillbuildaDNNthatcomparestwoMNISTdigitimagesandpredicts
whethertheyrepresentthesamedigitornot.ThenyouwillreusethelowerlayersofthisnetworktotrainanMNISTclassifierusingverylittletrainingdata.StartbybuildingtwoDNNs(let’scallthemDNNAandB),bothsimilartotheoneyoubuiltearlierbutwithouttheoutputlayer:eachDNNshouldhavefivehiddenlayersof100neuronseach,Heinitialization,andELUactivation.Next,addonemorehiddenlayerwith10unitsontopofbothDNNs.Todothis,youshoulduseTensorFlow’sconcat()functionwithaxis=1toconcatenatetheoutputsofbothDNNsforeachinstance,thenfeedtheresulttothehiddenlayer.Finally,addanoutputlayerwithasingleneuronusingthelogisticactivationfunction.
b. SplittheMNISTtrainingsetintwosets:split#1shouldcontaining55,000images,andsplit#2shouldcontaincontain5,000images.CreateafunctionthatgeneratesatrainingbatchwhereeachinstanceisapairofMNISTimagespickedfromsplit#1.Halfofthetraininginstancesshouldbepairsofimagesthatbelongtothesameclass,whiletheotherhalfshouldbeimagesfromdifferentclasses.Foreachpair,thetraininglabelshouldbe0iftheimagesarefromthesameclass,or1iftheyarefromdifferentclasses.
c. TraintheDNNonthistrainingset.Foreachimagepair,youcansimultaneouslyfeedthefirstimagetoDNNAandthesecondimagetoDNNB.Thewholenetworkwillgraduallylearntotellwhethertwoimagesbelongtothesameclassornot.
d. NowcreateanewDNNbyreusingandfreezingthehiddenlayersofDNNAandaddingasoftmaxoutputlayerontopwith10neurons.Trainthisnetworkonsplit#2andseeifyoucanachievehighperformancedespitehavingonly500imagesperclass.
SolutionstotheseexercisesareavailableinAppendixA.
“UnderstandingtheDifficultyofTrainingDeepFeedforwardNeuralNetworks,”X.Glorot,YBengio(2010).
Here’sananalogy:ifyousetamicrophoneamplifier’sknobtooclosetozero,peoplewon’thearyourvoice,butifyousetittooclosetothemax,yourvoicewillbesaturatedandpeoplewon’tunderstandwhatyouaresaying.Nowimagineachainofsuchamplifiers:theyallneedtobesetproperlyinorderforyourvoicetocomeoutloudandclearattheendofthechain.Yourvoicehastocomeoutofeachamplifieratthesameamplitudeasitcamein.
Thissimplifiedstrategywasactuallyalreadyproposedmuchearlier—forexample,inthe1998bookNeuralNetworks:TricksoftheTradebyGenevieveOrrandKlaus-RobertMüller(Springer).
Suchas“DelvingDeepintoRectifiers:SurpassingHuman-LevelPerformanceonImageNetClassification,”K.Heetal.(2015).
“EmpiricalEvaluationofRectifiedActivationsinConvolutionNetwork,”B.Xuetal.(2015).
“FastandAccurateDeepNetworkLearningbyExponentialLinearUnits(ELUs),”D.Clevert,T.Unterthiner,S.Hochreiter(2015).
“BatchNormalization:AcceleratingDeepNetworkTrainingbyReducingInternalCovariateShift,”S.IoffeandC.Szegedy(2015).
Manyresearchersarguethatitisjustasgood,orevenbetter,toplacethebatchnormalizationlayersafter(ratherthanbefore)theactivations.
“Onthedifficultyoftrainingrecurrentneuralnetworks,”R.Pascanuetal.(2013).
Anotheroptionistocomeupwithasupervisedtaskforwhichyoucaneasilygatheralotoflabeledtrainingdata,thenusetransfer
1
2
3
4
5
6
7
8
9
10
learning,asexplainedearlier.Forexample,ifyouwanttotrainamodeltoidentifyyourfriendsinpictures,youcoulddownloadmillionsoffacesontheinternetandtrainaclassifiertodetectwhethertwofacesareidenticalornot,thenusethisclassifiertocompareanewpicturewitheachpictureofyourfriends.
“Somemethodsofspeedinguptheconvergenceofiterationmethods,”B.Polyak(1964).
“AMethodforUnconstrainedConvexMinimizationProblemwiththeRateofConvergenceO(1/k
),”YuriiNesterov(1983).
“AdaptiveSubgradientMethodsforOnlineLearningandStochasticOptimization,”J.Duchietal.(2011).
ThisalgorithmwascreatedbyTijmenTielemanandGeoffreyHintonin2012,andpresentedbyGeoffreyHintoninhisCourseraclassonneuralnetworks(slides:http://goo.gl/RsQeis;video:https://goo.gl/XUbIyJ).Amusingly,sincetheauthorshavenotwrittenapapertodescribeit,researchersoftencite“slide29inlecture6”intheirpapers.
“Adam:AMethodforStochasticOptimization,”D.Kingma,J.Ba(2015).
Theseareestimationsofthemeanand(uncentered)varianceofthegradients.Themeanisoftencalledthefirstmoment,whilethevarianceisoftencalledthesecondmoment,hencethenameofthealgorithm.
“TheMarginalValueofAdaptiveGradientMethodsinMachineLearning,”A.C.Wilsonetal.(2017).
“Primal-DualSubgradientMethodsforConvexProblems,”YuriiNesterov(2005).
“AdClickPrediction:aViewfromtheTrenches,”H.McMahanetal.(2013).
“AnEmpiricalStudyofLearningRatesinDeepNeuralNetworksforSpeechRecognition,”A.Senioretal.(2013).
“Improvingneuralnetworksbypreventingco-adaptationoffeaturedetectors,”G.Hintonetal.(2012).
“Dropout:ASimpleWaytoPreventNeuralNetworksfromOverfitting,”N.Srivastavaetal.(2014).
11
12
2
13
14
15
16
17
18
19
20
21
22
Chapter12.DistributingTensorFlowAcrossDevicesandServers
InChapter11wediscussedseveraltechniquesthatcanconsiderablyspeeduptraining:betterweightinitialization,BatchNormalization,sophisticatedoptimizers,andsoon.However,evenwithallofthesetechniques,trainingalargeneuralnetworkonasinglemachinewithasingleCPUcantakedaysorevenweeks.
InthischapterwewillseehowtouseTensorFlowtodistributecomputationsacrossmultipledevices(CPUsandGPUs)andruntheminparallel(seeFigure12-1).Firstwewilldistributecomputationsacrossmultipledevicesonjustonemachine,thenonmultipledevicesacrossmultiplemachines.
Figure12-1.ExecutingaTensorFlowgraphacrossmultipledevicesinparallel
TensorFlow’ssupportofdistributedcomputingisoneofitsmainhighlightscomparedtootherneuralnetworkframeworks.Itgivesyoufullcontroloverhowtosplit(orreplicate)yourcomputationgraphacrossdevicesandservers,anditletsyouparallelizeandsynchronizeoperationsinflexiblewayssoyoucanchoosebetweenallsortsofparallelizationapproaches.
Wewilllookatsomeofthemostpopularapproachestoparallelizingtheexecutionandtrainingofaneuralnetwork.Insteadofwaitingforweeksforatrainingalgorithmtocomplete,youmayendupwaitingforjustafewhours.Notonlydoesthissaveanenormousamountoftime,italsomeansthatyoucanexperimentwithvariousmodelsmuchmoreeasily,andfrequentlyretrainyourmodelsonfreshdata.
Othergreatusecasesofparallelizationincludeexploringamuchlargerhyperparameterspacewhenfine-tuningyourmodel,andrunninglargeensemblesofneuralnetworksefficiently.
Butwemustlearntowalkbeforewecanrun.Let’sstartbyparallelizingsimplegraphsacrossseveralGPUsonasinglemachine.
MultipleDevicesonaSingleMachineYoucanoftengetamajorperformanceboostsimplybyaddingGPUcardstoasinglemachine.Infact,inmanycasesthiswillsuffice;youwon’tneedtousemultiplemachinesatall.Forexample,youcantypicallytrainaneuralnetworkjustasfastusing8GPUsonasinglemachineratherthan16GPUsacrossmultiplemachines(duetotheextradelayimposedbynetworkcommunicationsinamultimachinesetup).
InthissectionwewilllookathowtosetupyourenvironmentsothatTensorFlowcanusemultipleGPUcardsononemachine.Thenwewilllookathowyoucandistributeoperationsacrossavailabledevicesandexecutetheminparallel.
InstallationInordertorunTensorFlowonmultipleGPUcards,youfirstneedtomakesureyourGPUcardshaveNVidiaComputeCapability(greaterorequalto3.0).ThisincludesNvidia’sTitan,TitanX,K20,andK40cards(ifyouownanothercard,youcancheckitscompatibilityathttps://developer.nvidia.com/cuda-gpus).
TIPIfyoudon’townanyGPUcards,youcanuseahostingservicewithGPUcapabilitysuchasAmazonAWS.DetailedinstructionstosetupTensorFlow0.9withPython3.5onanAmazonAWSGPUinstanceareavailableinŽigaAvsec’shelpfulblogpost.ItshouldnotbetoohardtoupdateittothelatestversionofTensorFlow.GooglealsoreleasedacloudservicecalledCloudMachineLearningtorunTensorFlowgraphs.InMay2016,theyannouncedthattheirplatformnowincludesserversequippedwithtensorprocessingunits(TPUs),processorsspecializedforMachineLearningthataremuchfasterthanGPUsformanyMLtasks.Ofcourse,anotheroptionissimplytobuyyourownGPUcard.TimDettmerswroteagreatblogposttohelpyouchoose,andheupdatesitfairlyregularly.
YoumustthendownloadandinstalltheappropriateversionoftheCUDAandcuDNNlibraries(CUDA8.0andcuDNN5.1ifyouareusingthebinaryinstallationofTensorFlow1.0.0),andsetafewenvironmentvariablessoTensorFlowknowswheretofindCUDAandcuDNN.Thedetailedinstallationinstructionsarelikelytochangefairlyquickly,soitisbestthatyoufollowtheinstructionsonTensorFlow’swebsite.
Nvidia’sComputeUnifiedDeviceArchitecturelibrary(CUDA)allowsdeveloperstouseCUDA-enabledGPUsforallsortsofcomputations(notjustgraphicsacceleration).Nvidia’sCUDADeepNeuralNetworklibrary(cuDNN)isaGPU-acceleratedlibraryofprimitivesforDNNs.ItprovidesoptimizedimplementationsofcommonDNNcomputationssuchasactivationlayers,normalization,forwardandbackwardconvolutions,andpooling(seeChapter13).ItispartofNvidia’sDeepLearningSDK(notethatitrequirescreatinganNvidiadeveloperaccountinordertodownloadit).TensorFlowusesCUDAandcuDNNtocontroltheGPUcardsandacceleratecomputations(seeFigure12-2).
Figure12-2.TensorFlowusesCUDAandcuDNNtocontrolGPUsandboostDNNs
Youcanusethenvidia-smicommandtocheckthatCUDAisproperlyinstalled.ItliststheavailableGPUcards,aswellasprocessesrunningoneachcard:
$nvidia-smi
WedSep1609:50:032016
+------------------------------------------------------+
|NVIDIA-SMI352.63DriverVersion:352.63|
|-------------------------------+----------------------+----------------------+
|GPUNamePersistence-M|Bus-IdDisp.A|VolatileUncorr.ECC|
|FanTempPerfPwr:Usage/Cap|Memory-Usage|GPU-UtilComputeM.|
|===============================+======================+======================|
|0GRIDK520Off|0000:00:03.0Off|N/A|
|N/A27CP817W/125W|11MiB/4095MiB|0%Default|
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
|Processes:GPUMemory|
|GPUPIDTypeProcessnameUsage|
|=============================================================================|
|Norunningprocessesfound|
+-----------------------------------------------------------------------------+
Finally,youmustinstallTensorFlowwithGPUsupport.Ifyoucreatedanisolatedenvironmentusingvirtualenv,youfirstneedtoactivateit:
$cd$ML_PATH#YourMLworkingdirectory(e.g.,$HOME/ml)
$sourceenv/bin/activate
TheninstalltheappropriateGPU-enabledversionofTensorFlow:
$pip3install--upgradetensorflow-gpu
NowyoucanopenupaPythonshellandcheckthatTensorFlowdetectsandusesCUDAandcuDNNproperlybyimportingTensorFlowandcreatingasession:
>>>importtensorflowastf
I[...]/dso_loader.cc:108]successfullyopenedCUDAlibrarylibcublas.solocally
I[...]/dso_loader.cc:108]successfullyopenedCUDAlibrarylibcudnn.solocally
I[...]/dso_loader.cc:108]successfullyopenedCUDAlibrarylibcufft.solocally
I[...]/dso_loader.cc:108]successfullyopenedCUDAlibrarylibcuda.so.1locally
I[...]/dso_loader.cc:108]successfullyopenedCUDAlibrarylibcurand.solocally
>>>sess=tf.Session()
[...]
I[...]/gpu_init.cc:102]Founddevice0withproperties:
name:GRIDK520
major:3minor:0memoryClockRate(GHz)0.797
pciBusID0000:00:03.0
Totalmemory:4.00GiB
Freememory:3.95GiB
I[...]/gpu_init.cc:126]DMA:0
I[...]/gpu_init.cc:136]0:Y
I[...]/gpu_device.cc:839]CreatingTensorFlowdevice
(/gpu:0)->(device:0,name:GRIDK520,pcibusid:0000:00:03.0)
Looksgood!TensorFlowdetectedtheCUDAandcuDNNlibraries,anditusedtheCUDAlibrarytodetecttheGPUcard(inthiscaseanNvidiaGridK520card).
ManagingtheGPURAMBydefaultTensorFlowautomaticallygrabsalltheRAMinallavailableGPUsthefirsttimeyourunagraph,soyouwillnotbeabletostartasecondTensorFlowprogramwhilethefirstoneisstillrunning.Ifyoutry,youwillgetthefollowingerror:
E[...]/cuda_driver.cc:965]failedtoallocate3.66G(3928915968bytes)from
device:CUDA_ERROR_OUT_OF_MEMORY
OnesolutionistoruneachprocessondifferentGPUcards.Todothis,thesimplestoptionistosettheCUDA_VISIBLE_DEVICESenvironmentvariablesothateachprocessonlyseestheappropriateGPUcards.Forexample,youcouldstarttwoprogramslikethis:
$CUDA_VISIBLE_DEVICES=0,1python3program_1.py
#andinanotherterminal:
$CUDA_VISIBLE_DEVICES=3,2python3program_2.py
Program#1willonlyseeGPUcards0and1(numbered0and1,respectively),andprogram#2willonlyseeGPUcards2and3(numbered1and0,respectively).Everythingwillworkfine(seeFigure12-3).
Figure12-3.EachprogramgetstwoGPUsforitself
AnotheroptionistotellTensorFlowtograbonlyafractionofthememory.Forexample,tomakeTensorFlowgrabonly40%ofeachGPU’smemory,youmustcreateaConfigProtoobject,setitsgpu_options.per_process_gpu_memory_fractionoptionto0.4,andcreatethesessionusingthisconfiguration:
config=tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction=0.4
session=tf.Session(config=config)
NowtwoprogramslikethisonecanruninparallelusingthesameGPUcards(butnotthree,since3×0.4>1).SeeFigure12-4.
Figure12-4.EachprogramgetsallfourGPUs,butwithonly40%oftheRAMeach
Ifyourunthenvidia-smicommandwhilebothprogramsarerunning,youshouldseethateachprocessholdsroughly40%ofthetotalRAMofeachcard:
$nvidia-smi
[...]
+-----------------------------------------------------------------------------+
|Processes:GPUMemory|
|GPUPIDTypeProcessnameUsage|
|=============================================================================|
|05231Cpython1677MiB|
|05262Cpython1677MiB|
|15231Cpython1677MiB|
|15262Cpython1677MiB|
[...]
YetanotheroptionistotellTensorFlowtograbmemoryonlywhenitneedsit.Todothisyoumustsetconfig.gpu_options.allow_growthtoTrue.However,TensorFlowneverreleasesmemoryonceithasgrabbedit(toavoidmemoryfragmentation)soyoumaystillrunoutofmemoryafterawhile.Itmaybehardertoguaranteeadeterministicbehaviorusingthisoption,soingeneralyouprobablywanttostickwithoneofthepreviousoptions.
Okay,nowyouhaveaworkingGPU-enabledTensorFlowinstallation.Let’sseehowtouseit!
PlacingOperationsonDevicesTheTensorFlowwhitepaper1presentsafriendlydynamicplaceralgorithmthatautomagicallydistributesoperationsacrossallavailabledevices,takingintoaccountthingslikethemeasuredcomputationtimeinpreviousrunsofthegraph,estimationsofthesizeoftheinputandoutputtensorstoeachoperation,theamountofRAMavailableineachdevice,communicationdelaywhentransferringdatainandoutofdevices,hintsandconstraintsfromtheuser,andmore.Unfortunately,thissophisticatedalgorithmisinternaltoGoogle;itwasnotreleasedintheopensourceversionofTensorFlow.Thereasonitwasleftoutseemstobethatinpracticeasmallsetofplacementrulesspecifiedbytheuseractuallyresultsinmoreefficientplacementthanwhatthedynamicplaceriscapableof.However,theTensorFlowteamisworkingonimprovingthedynamicplacer,andperhapsitwilleventuallybegoodenoughtobereleased.
UntilthenTensorFlowreliesonthesimpleplacer,which(asitsnamesuggests)isverybasic.
SimpleplacementWheneveryourunagraph,ifTensorFlowneedstoevaluateanodethatisnotplacedonadeviceyet,itusesthesimpleplacertoplaceit,alongwithallothernodesthatarenotplacedyet.Thesimpleplacerrespectsthefollowingrules:
Ifanodewasalreadyplacedonadeviceinapreviousrunofthegraph,itisleftonthatdevice.
Else,iftheuserpinnedanodetoadevice(describednext),theplacerplacesitonthatdevice.
Else,itdefaultstoGPU#0,ortheCPUifthereisnoGPU.
Asyoucansee,placingoperationsontheappropriatedeviceismostlyuptoyou.Ifyoudon’tdoanything,thewholegraphwillbeplacedonthedefaultdevice.Topinnodesontoadevice,youmustcreateadeviceblockusingthedevice()function.Forexample,thefollowingcodepinsthevariableaandtheconstantbontheCPU,butthemultiplicationnodecisnotpinnedonanydevice,soitwillbeplacedonthedefaultdevice:
withtf.device("/cpu:0"):
a=tf.Variable(3.0)
b=tf.constant(4.0)
c=a*b
NOTEThe"/cpu:0"deviceaggregatesallCPUsonamulti-CPUsystem.ThereiscurrentlynowaytopinnodesonspecificCPUsortousejustasubsetofallCPUs.
LoggingplacementsLet’scheckthatthesimpleplacerrespectstheplacementconstraintswehavejustdefined.Forthisyoucansetthelog_device_placementoptiontoTrue;thistellstheplacertologamessagewheneverit
placesanode.Forexample:
>>>config=tf.ConfigProto()
>>>config.log_device_placement=True
>>>sess=tf.Session(config=config)
I[...]CreatingTensorFlowdevice(/gpu:0)->(device:0,name:GRIDK520,
pcibusid:0000:00:03.0)
[...]
>>>x.initializer.run(session=sess)
I[...]a:/job:localhost/replica:0/task:0/cpu:0
I[...]a/read:/job:localhost/replica:0/task:0/cpu:0
I[...]mul:/job:localhost/replica:0/task:0/gpu:0
I[...]a/Assign:/job:localhost/replica:0/task:0/cpu:0
I[...]b:/job:localhost/replica:0/task:0/cpu:0
I[...]a/initial_value:/job:localhost/replica:0/task:0/cpu:0
>>>sess.run(c)
12
Thelinesstartingwith"I"forInfoarethelogmessages.Whenwecreateasession,TensorFlowlogsamessagetotellusthatithasfoundaGPUcard(inthiscasetheGridK520card).Thenthefirsttimewerunthegraph(inthiscasewheninitializingthevariablea),thesimpleplacerisrunandplaceseachnodeonthedeviceitwasassignedto.Asexpected,thelogmessagesshowthatallnodesareplacedon"/cpu:0"exceptthemultiplicationnode,whichendsuponthedefaultdevice"/gpu:0"(youcansafelyignoretheprefix/job:localhost/replica:0/task:0fornow;wewilltalkaboutitinamoment).Noticethatthesecondtimewerunthegraph(tocomputec),theplacerisnotusedsinceallthenodesTensorFlowneedstocomputecarealreadyplaced.
DynamicplacementfunctionWhenyoucreateadeviceblock,youcanspecifyafunctioninsteadofadevicename.TensorFlowwillcallthisfunctionforeachoperationitneedstoplaceinthedeviceblock,andthefunctionmustreturnthenameofthedevicetopintheoperationon.Forexample,thefollowingcodepinsallthevariablenodesto"/cpu:0"(inthiscasejustthevariablea)andallothernodesto"/gpu:0":
defvariables_on_cpu(op):
ifop.type=="Variable":
return"/cpu:0"
else:
return"/gpu:0"
withtf.device(variables_on_cpu):
a=tf.Variable(3.0)
b=tf.constant(4.0)
c=a*b
Youcaneasilyimplementmorecomplexalgorithms,suchaspinningvariablesacrossGPUsinaround-robinfashion.
OperationsandkernelsForaTensorFlowoperationtorunonadevice,itneedstohaveanimplementationforthatdevice;thisiscalledakernel.ManyoperationshavekernelsforbothCPUsandGPUs,butnotallofthem.Forexample,TensorFlowdoesnothaveaGPUkernelforintegervariables,sothefollowingcodewillfailwhenTensorFlowtriestoplacethevariableionGPU#0:
>>>withtf.device("/gpu:0"):
...i=tf.Variable(3)
[...]
>>>sess.run(i.initializer)
Traceback(mostrecentcalllast):
[...]
tensorflow.python.framework.errors.InvalidArgumentError:Cannotassignadevice
tonode'Variable':Couldnotsatisfyexplicitdevicespecification
NotethatTensorFlowinfersthatthevariablemustbeoftypeint32sincetheinitializationvalueisaninteger.Ifyouchangetheinitializationvalueto3.0insteadof3,orifyouexplicitlysetdtype=tf.float32whencreatingthevariable,everythingwillworkfine.
SoftplacementBydefault,ifyoutrytopinanoperationonadeviceforwhichtheoperationhasnokernel,yougettheexceptionshownearlierwhenTensorFlowtriestoplacetheoperationonthedevice.IfyoupreferTensorFlowtofallbacktotheCPUinstead,youcansettheallow_soft_placementconfigurationoptiontoTrue:
withtf.device("/gpu:0"):
i=tf.Variable(3)
config=tf.ConfigProto()
config.allow_soft_placement=True
sess=tf.Session(config=config)
sess.run(i.initializer)#theplacerrunsandfallsbackto/cpu:0
Sofarwehavediscussedhowtoplacenodesondifferentdevices.Nowlet’sseehowTensorFlowwillrunthesenodesinparallel.
ParallelExecutionWhenTensorFlowrunsagraph,itstartsbyfindingoutthelistofnodesthatneedtobeevaluated,anditcountshowmanydependencieseachofthemhas.TensorFlowthenstartsevaluatingthenodeswithzerodependencies(i.e.,sourcenodes).Ifthesenodesareplacedonseparatedevices,theyobviouslygetevaluatedinparallel.Iftheyareplacedonthesamedevice,theygetevaluatedindifferentthreads,sotheymayruninparalleltoo(inseparateGPUthreadsorCPUcores).
TensorFlowmanagesathreadpooloneachdevicetoparallelizeoperations(seeFigure12-5).Thesearecalledtheinter-opthreadpools.Someoperationshavemultithreadedkernels:theycanuseotherthreadpools(oneperdevice)calledtheintra-opthreadpools.
Figure12-5.ParallelizedexecutionofaTensorFlowgraph
Forexample,inFigure12-5,operationsA,B,andCaresourceops,sotheycanimmediatelybeevaluated.OperationsAandBareplacedonGPU#0,sotheyaresenttothisdevice’sinter-opthreadpool,andimmediatelyevaluatedinparallel.OperationAhappenstohaveamultithreadedkernel;itscomputationsaresplitinthreeparts,whichareexecutedinparallelbytheintra-opthreadpool.OperationCgoestoGPU#1’sinter-opthreadpool.
AssoonasoperationCfinishes,thedependencycountersofoperationsDandEwillbedecrementedandwillbothreach0,sobothoperationswillbesenttotheinter-opthreadpooltobeexecuted.
TIPYoucancontrolthenumberofthreadsperinter-oppoolbysettingtheinter_op_parallelism_threadsoption.Notethatthefirstsessionyoustartcreatestheinter-opthreadpools.Allothersessionswilljustreusethemunlessyousettheuse_per_session_threadsoptiontoTrue.Youcancontrolthenumberofthreadsperintra-oppoolbysettingtheintra_op_parallelism_threadsoption.
ControlDependenciesInsomecases,itmaybewisetopostponetheevaluationofanoperationeventhoughalltheoperationsitdependsonhavebeenexecuted.Forexample,ifitusesalotofmemorybutitsvalueisneededonlymuchfurtherinthegraph,itwouldbebesttoevaluateitatthelastmomenttoavoidneedlesslyoccupyingRAMthatotheroperationsmayneed.Anotherexampleisasetofoperationsthatdependondatalocatedoutsideofthedevice.Iftheyallrunatthesametime,theymaysaturatethedevice’scommunicationbandwidth,andtheywillendupallwaitingonI/O.Otheroperationsthatneedtocommunicatedatawillalsobeblocked.Itwouldbepreferabletoexecutethesecommunication-heavyoperationssequentially,allowingthedevicetoperformotheroperationsinparallel.
Topostponeevaluationofsomenodes,asimplesolutionistoaddcontroldependencies.Forexample,thefollowingcodetellsTensorFlowtoevaluatexandyonlyafteraandbhavebeenevaluated:
a=tf.constant(1.0)
b=a+2.0
withtf.control_dependencies([a,b]):
x=tf.constant(3.0)
y=tf.constant(4.0)
z=x+y
Obviously,sincezdependsonxandy,evaluatingzalsoimplieswaitingforaandbtobeevaluated,eventhoughitisnotexplicitlyinthecontrol_dependencies()block.Also,sincebdependsona,wecouldsimplifytheprecedingcodebyjustcreatingacontroldependencyon[b]insteadof[a,b],butinsomecases“explicitisbetterthanimplicit.”
Great!Nowyouknow:Howtoplaceoperationsonmultipledevicesinanywayyouplease
Howtheseoperationsgetexecutedinparallel
Howtocreatecontroldependenciestooptimizeparallelexecution
It’stimetodistributecomputationsacrossmultipleservers!
MultipleDevicesAcrossMultipleServersTorunagraphacrossmultipleservers,youfirstneedtodefineacluster.AclusteriscomposedofoneormoreTensorFlowservers,calledtasks,typicallyspreadacrossseveralmachines(seeFigure12-6).Eachtaskbelongstoajob.Ajobisjustanamedgroupoftasksthattypicallyhaveacommonrole,suchaskeepingtrackofthemodelparameters(suchajobisusuallynamed"ps"forparameterserver),orperformingcomputations(suchajobisusuallynamed"worker").
Figure12-6.TensorFlowcluster
Thefollowingclusterspecificationdefinestwojobs,"ps"and"worker",containingonetaskandtwotasks,respectively.Inthisexample,machineAhoststwoTensorFlowservers(i.e.,tasks),listeningondifferentports:oneispartofthe"ps"job,andtheotherispartofthe"worker"job.MachineBjusthostsoneTensorFlowserver,partofthe"worker"job.
cluster_spec=tf.train.ClusterSpec({
"ps":[
"machine-a.example.com:2221",#/job:ps/task:0
],
"worker":[
"machine-a.example.com:2222",#/job:worker/task:0
"machine-b.example.com:2222",#/job:worker/task:1
]})
TostartaTensorFlowserver,youmustcreateaServerobject,passingittheclusterspecification(soitcancommunicatewithotherservers)anditsownjobnameandtasknumber.Forexample,tostartthefirstworkertask,youwouldrunthefollowingcodeonmachineA:
server=tf.train.Server(cluster_spec,job_name="worker",task_index=0)
Itisusuallysimplertojustrunonetaskpermachine,butthepreviousexampledemonstratesthatTensorFlowallowsyoutorunmultipletasksonthesamemachineifyouwant.2Ifyouhaveseveralserversononemachine,youwillneedtoensurethattheydon’talltrytograballtheRAMofeveryGPU,asexplainedearlier.Forexample,inFigure12-6the"ps"taskdoesnotseetheGPUdevices,sincepresumablyitsprocesswaslaunchedwithCUDA_VISIBLE_DEVICES="".NotethattheCPUissharedbyalltaskslocatedonthesamemachine.
IfyouwanttheprocesstodonothingotherthanruntheTensorFlowserver,youcanblockthemainthreadbytellingittowaitfortheservertofinishusingthejoin()method(otherwisetheserverwillbekilledassoonasyourmainthreadexits).Sincethereiscurrentlynowaytostoptheserver,thiswillactuallyblockforever:
server.join()#blocksuntiltheserverstops(i.e.,never)
OpeningaSessionOnceallthetasksareupandrunning(doingnothingyet),youcanopenasessiononanyoftheservers,fromaclientlocatedinanyprocessonanymachine(evenfromaprocessrunningoneofthetasks),andusethatsessionlikearegularlocalsession.Forexample:
a=tf.constant(1.0)
b=a+2
c=a*3
withtf.Session("grpc://machine-b.example.com:2222")assess:
print(c.eval())#9.0
Thisclientcodefirstcreatesasimplegraph,thenopensasessionontheTensorFlowserverlocatedonmachineB(whichwewillcallthemaster),andinstructsittoevaluatec.Themasterstartsbyplacingtheoperationsontheappropriatedevices.Inthisexample,sincewedidnotpinanyoperationonanydevice,themastersimplyplacesthemallonitsowndefaultdevice—inthiscase,machineB’sGPUdevice.Thenitjustevaluatescasinstructedbytheclient,anditreturnstheresult.
TheMasterandWorkerServicesTheclientusesthegRPCprotocol(GoogleRemoteProcedureCall)tocommunicatewiththeserver.Thisisanefficientopensourceframeworktocallremotefunctionsandgettheiroutputsacrossavarietyofplatformsandlanguages.3ItisbasedonHTTP2,whichopensaconnectionandleavesitopenduringthewholesession,allowingefficientbidirectionalcommunicationoncetheconnectionisestablished.Dataistransmittedintheformofprotocolbuffers,anotheropensourceGoogletechnology.Thisisalightweightbinarydatainterchangeformat.
WARNINGAllserversinaTensorFlowclustermaycommunicatewithanyotherserverinthecluster,somakesuretoopentheappropriateportsonyourfirewall.
EveryTensorFlowserverprovidestwoservices:themasterserviceandtheworkerservice.Themasterserviceallowsclientstoopensessionsandusethemtorungraphs.Itcoordinatesthecomputationsacrosstasks,relyingontheworkerservicetoactuallyexecutecomputationsonothertasksandgettheirresults.
Thisarchitecturegivesyoualotofflexibility.Oneclientcanconnecttomultipleserversbyopeningmultiplesessionsindifferentthreads.Oneservercanhandlemultiplesessionssimultaneouslyfromoneormoreclients.Youcanrunoneclientpertask(typicallywithinthesameprocess),orjustoneclienttocontrolalltasks.Alloptionsareopen.
PinningOperationsAcrossTasksYoucanusedeviceblockstopinoperationsonanydevicemanagedbyanytask,byspecifyingthejobname,taskindex,devicetype,anddeviceindex.Forexample,thefollowingcodepinsatotheCPUofthefirsttaskinthe"ps"job(that’stheCPUonmachineA),anditpinsbtothesecondGPUmanagedbythefirsttaskofthe"worker"job(that’sGPU#1onmachineA).Finally,cisnotpinnedtoanydevice,sothemasterplacesitonitsowndefaultdevice(machineB’sGPU#0device).
withtf.device("/job:ps/task:0/cpu:0")
a=tf.constant(1.0)
withtf.device("/job:worker/task:0/gpu:1")
b=a+2
c=a+b
Asearlier,ifyouomitthedevicetypeandindex,TensorFlowwilldefaulttothetask’sdefaultdevice;forexample,pinninganoperationto"/job:ps/task:0"willplaceitonthedefaultdeviceofthefirsttaskofthe"ps"job(machineA’sCPU).Ifyoualsoomitthetaskindex(e.g.,"/job:ps"),TensorFlowdefaultsto"/task:0".Ifyouomitthejobnameandthetaskindex,TensorFlowdefaultstothesession’smastertask.
ShardingVariablesAcrossMultipleParameterServersAswewillseeshortly,acommonpatternwhentraininganeuralnetworkonadistributedsetupistostorethemodelparametersonasetofparameterservers(i.e.,thetasksinthe"ps"job)whileothertasksfocusoncomputations(i.e.,thetasksinthe"worker"job).Forlargemodelswithmillionsofparameters,itisusefultoshardtheseparametersacrossmultipleparameterservers,toreducetheriskofsaturatingasingleparameterserver’snetworkcard.Ifyouweretomanuallypineveryvariabletoadifferentparameterserver,itwouldbequitetedious.Fortunately,TensorFlowprovidesthereplica_device_setter()function,whichdistributesvariablesacrossallthe"ps"tasksinaround-robinfashion.Forexample,thefollowingcodepinsfivevariablestotwoparameterservers:
withtf.device(tf.train.replica_device_setter(ps_tasks=2):
v1=tf.Variable(1.0)#pinnedto/job:ps/task:0
v2=tf.Variable(2.0)#pinnedto/job:ps/task:1
v3=tf.Variable(3.0)#pinnedto/job:ps/task:0
v4=tf.Variable(4.0)#pinnedto/job:ps/task:1
v5=tf.Variable(5.0)#pinnedto/job:ps/task:0
Insteadofpassingthenumberofps_tasks,youcanpasstheclusterspeccluster=cluster_specandTensorFlowwillsimplycountthenumberoftasksinthe"ps"job.
Ifyoucreateotheroperationsintheblock,beyondjustvariables,TensorFlowautomaticallypinsthemto"/job:worker",whichwilldefaulttothefirstdevicemanagedbythefirsttaskinthe"worker"job.Youcanpinthemtoanotherdevicebysettingtheworker_deviceparameter,butabetterapproachistouseembeddeddeviceblocks.Aninnerdeviceblockcanoverridethejob,task,ordevicedefinedinanouterblock.Forexample:
withtf.device(tf.train.replica_device_setter(ps_tasks=2)):
v1=tf.Variable(1.0)#pinnedto/job:ps/task:0(+defaultsto/cpu:0)
v2=tf.Variable(2.0)#pinnedto/job:ps/task:1(+defaultsto/cpu:0)
v3=tf.Variable(3.0)#pinnedto/job:ps/task:0(+defaultsto/cpu:0)
[...]
s=v1+v2#pinnedto/job:worker(+defaultstotask:0/gpu:0)
withtf.device("/gpu:1"):
p1=2*s#pinnedto/job:worker/gpu:1(+defaultsto/task:0)
withtf.device("/task:1"):
p2=3*s#pinnedto/job:worker/task:1/gpu:1
NOTEThisexampleassumesthattheparameterserversareCPU-only,whichistypicallythecasesincetheyonlyneedtostoreandcommunicateparameters,notperformintensivecomputations.
SharingStateAcrossSessionsUsingResourceContainersWhenyouareusingaplainlocalsession(notthedistributedkind),eachvariable’sstateismanagedbythesessionitself;assoonasitends,allvariablevaluesarelost.Moreover,multiplelocalsessionscannotshareanystate,eveniftheybothrunthesamegraph;eachsessionhasitsowncopyofeveryvariable(aswediscussedinChapter9).Incontrast,whenyouareusingdistributedsessions,variablestateismanagedbyresourcecontainerslocatedontheclusteritself,notbythesessions.Soifyoucreateavariablenamedxusingoneclientsession,itwillautomaticallybeavailabletoanyothersessiononthesamecluster(evenifbothsessionsareconnectedtoadifferentserver).Forexample,considerthefollowingclientcode:
#simple_client.py
importtensorflowastf
importsys
x=tf.Variable(0.0,name="x")
increment_x=tf.assign(x,x+1)
withtf.Session(sys.argv[1])assess:
ifsys.argv[2:]==["init"]:
sess.run(x.initializer)
sess.run(increment_x)
print(x.eval())
Let’ssupposeyouhaveaTensorFlowclusterupandrunningonmachinesAandB,port2222.Youcouldlaunchtheclient,haveitopenasessionwiththeserveronmachineA,andtellittoinitializethevariable,incrementit,andprintitsvaluebylaunchingthefollowingcommand:
$python3simple_client.pygrpc://machine-a.example.com:2222init
1.0
Nowifyoulaunchtheclientwiththefollowingcommand,itwillconnecttotheserveronmachineBandmagicallyreusethesamevariablex(thistimewedon’tasktheservertoinitializethevariable):
$python3simple_client.pygrpc://machine-b.example.com:2222
2.0
Thisfeaturecutsbothways:it’sgreatifyouwanttosharevariablesacrossmultiplesessions,butifyouwanttoruncompletelyindependentcomputationsonthesameclusteryouwillhavetobecarefulnottousethesamevariablenamesbyaccident.Onewaytoensurethatyouwon’thavenameclashesistowrapallofyourconstructionphaseinsideavariablescopewithauniquenameforeachcomputation,forexample:
withtf.variable_scope("my_problem_1"):
[...]#Constructionphaseofproblem1
Abetteroptionistouseacontainerblock:
withtf.container("my_problem_1"):
[...]#Constructionphaseofproblem1
Thiswilluseacontainerdedicatedtoproblem#1,insteadofthedefaultone(whosenameisanemptystring"").Oneadvantageisthatvariablenamesremainniceandshort.Anotheradvantageisthatyoucaneasilyresetanamedcontainer.Forexample,thefollowingcommandwillconnecttotheserveronmachineAandaskittoresetthecontainernamed"my_problem_1",whichwillfreealltheresourcesthiscontainerused(andalsocloseallsessionsopenontheserver).Anyvariablemanagedbythiscontainermustbeinitializedbeforeyoucanuseitagain:
tf.Session.reset("grpc://machine-a.example.com:2222",["my_problem_1"])
Resourcecontainersmakeiteasytosharevariablesacrosssessionsinflexibleways.Forexample,Figure12-7showsfourclientsrunningdifferentgraphsonthesamecluster,butsharingsomevariables.ClientsAandBsharethesamevariablexmanagedbythedefaultcontainer,whileclientsCandDshareanothervariablenamedxmanagedbythecontainernamed"my_problem_1".NotethatclientCevenusesvariablesfrombothcontainers.
Figure12-7.Resourcecontainers
Resourcecontainersalsotakecareofpreservingthestateofotherstatefuloperations,namelyqueuesandreaders.Let’stakealookatqueuesfirst.
AsynchronousCommunicationUsingTensorFlowQueuesQueuesareanothergreatwaytoexchangedatabetweenmultiplesessions;forexample,onecommonusecaseistohaveaclientcreateagraphthatloadsthetrainingdataandpushesitintoaqueue,whileanotherclientcreatesagraphthatpullsthedatafromthequeueandtrainsamodel(seeFigure12-8).Thiscanspeeduptrainingconsiderablybecausethetrainingoperationsdon’thavetowaitforthenextmini-batchateverystep.
Figure12-8.Usingqueuestoloadthetrainingdataasynchronously
TensorFlowprovidesvariouskindsofqueues.Thesimplestkindisthefirst-infirst-out(FIFO)queue.Forexample,thefollowingcodecreatesaFIFOqueuethatcanstoreupto10tensorscontainingtwofloatvalueseach:
q=tf.FIFOQueue(capacity=10,dtypes=[tf.float32],shapes=[[2]],
name="q",shared_name="shared_q")
WARNINGTosharevariablesacrosssessions,allyouhadtodowastospecifythesamenameandcontaineronbothends.WithqueuesTensorFlowdoesnotusethenameattributebutinsteadusesshared_name,soitisimportanttospecifyit(evenifitisthesameasthename).And,ofcourse,usethesamecontainer.
EnqueuingdataTopushdatatoaqueue,youmustcreateanenqueueoperation.Forexample,thefollowingcodepushesthreetraininginstancestothequeue:
#training_data_loader.py
importtensorflowastf
q=[...]
training_instance=tf.placeholder(tf.float32,shape=(2))
enqueue=q.enqueue([training_instance])
withtf.Session("grpc://machine-a.example.com:2222")assess:
sess.run(enqueue,feed_dict={training_instance:[1.,2.]})
sess.run(enqueue,feed_dict={training_instance:[3.,4.]})
sess.run(enqueue,feed_dict={training_instance:[5.,6.]})
Insteadofenqueuinginstancesonebyone,youcanenqueueseveralatatimeusinganenqueue_manyoperation:
[...]
training_instances=tf.placeholder(tf.float32,shape=(None,2))
enqueue_many=q.enqueue([training_instances])
withtf.Session("grpc://machine-a.example.com:2222")assess:
sess.run(enqueue_many,
feed_dict={training_instances:[[1.,2.],[3.,4.],[5.,6.]]})
Bothexamplesenqueuethesamethreetensorstothequeue.
DequeuingdataTopulltheinstancesoutofthequeue,ontheotherend,youneedtouseadequeueoperation:
#trainer.py
importtensorflowastf
q=[...]
dequeue=q.dequeue()
withtf.Session("grpc://machine-a.example.com:2222")assess:
print(sess.run(dequeue))#[1.,2.]
print(sess.run(dequeue))#[3.,4.]
print(sess.run(dequeue))#[5.,6.]
Ingeneralyouwillwanttopullawholemini-batchatonce,insteadofpullingjustoneinstanceatatime.Todoso,youmustuseadequeue_manyoperation,specifyingthemini-batchsize:
[...]
batch_size=2
dequeue_mini_batch=q.dequeue_many(batch_size)
withtf.Session("grpc://machine-a.example.com:2222")assess:
print(sess.run(dequeue_mini_batch))#[[1.,2.],[4.,5.]]
print(sess.run(dequeue_mini_batch))#blockedwaitingforanotherinstance
Whenaqueueisfull,theenqueueoperationwillblockuntilitemsarepulledoutbyadequeueoperation.Similarly,whenaqueueisempty(oryouareusingdequeue_many()andtherearefeweritemsthanthemini-batchsize),thedequeueoperationwillblockuntilenoughitemsarepushedintothequeueusinganenqueueoperation.
QueuesoftuplesEachiteminaqueuecanbeatupleoftensors(ofvarioustypesandshapes)insteadofjustasingletensor.Forexample,thefollowingqueuestorespairsoftensors,oneoftypeint32andshape(),andtheotheroftypefloat32andshape[3,2]:
q=tf.FIFOQueue(capacity=10,dtypes=[tf.int32,tf.float32],shapes=[[],[3,2]],
name="q",shared_name="shared_q")
Theenqueueoperationmustbegivenpairsoftensors(notethateachpairrepresentsonlyoneiteminthe
queue):
a=tf.placeholder(tf.int32,shape=())
b=tf.placeholder(tf.float32,shape=(3,2))
enqueue=q.enqueue((a,b))
withtf.Session([...])assess:
sess.run(enqueue,feed_dict={a:10,b:[[1.,2.],[3.,4.],[5.,6.]]})
sess.run(enqueue,feed_dict={a:11,b:[[2.,4.],[6.,8.],[0.,2.]]})
sess.run(enqueue,feed_dict={a:12,b:[[3.,6.],[9.,2.],[5.,8.]]})
Ontheotherend,thedequeue()functionnowcreatesapairofdequeueoperations:
dequeue_a,dequeue_b=q.dequeue()
Ingeneral,youshouldruntheseoperationstogether:
withtf.Session([...])assess:
a_val,b_val=sess.run([dequeue_a,dequeue_b])
print(a_val)#10
print(b_val)#[[1.,2.],[3.,4.],[5.,6.]]
WARNINGIfyourundequeue_aonitsown,itwilldequeueapairandreturnonlythefirstelement;thesecondelementwillbelost(andsimilarly,ifyourundequeue_bonitsown,thefirstelementwillbelost).
Thedequeue_many()functionalsoreturnsapairofoperations:
batch_size=2
dequeue_as,dequeue_bs=q.dequeue_many(batch_size)
Youcanuseitasyouwouldexpect:
withtf.Session([...])assess:
a,b=sess.run([dequeue_a,dequeue_b])
print(a)#[10,11]
print(b)#[[[1.,2.],[3.,4.],[5.,6.]],[[2.,4.],[6.,8.],[0.,2.]]]
a,b=sess.run([dequeue_a,dequeue_b])#blockedwaitingforanotherpair
ClosingaqueueItispossibletocloseaqueuetosignaltotheothersessionsthatnomoredatawillbeenqueued:
close_q=q.close()
withtf.Session([...])assess:
[...]
sess.run(close_q)
Subsequentexecutionsofenqueueorenqueue_manyoperationswillraiseanexception.Bydefault,anypendingenqueuerequestwillbehonored,unlessyoucallq.close(cancel_pending_enqueues=True).
Subsequentexecutionsofdequeueordequeue_manyoperationswillcontinuetosucceedaslongasthere
areitemsinthequeue,buttheywillfailwhentherearenotenoughitemsleftinthequeue.Ifyouareusingadequeue_manyoperationandthereareafewinstancesleftinthequeue,butfewerthanthemini-batchsize,theywillbelost.Youmayprefertouseadequeue_up_tooperationinstead;itbehavesexactlylikedequeue_manyexceptwhenaqueueisclosedandtherearefewerthanbatch_sizeinstancesleftinthequeue,inwhichcaseitjustreturnsthem.
RandomShuffleQueueTensorFlowalsosupportsacouplemoretypesofqueues,includingRandomShuffleQueue,whichcanbeusedjustlikeaFIFOQueueexceptthatitemsaredequeuedinarandomorder.Thiscanbeusefultoshuffletraininginstancesateachepochduringtraining.First,let’screatethequeue:
q=tf.RandomShuffleQueue(capacity=50,min_after_dequeue=10,
dtypes=[tf.float32],shapes=[()],
name="q",shared_name="shared_q")
Themin_after_dequeuespecifiestheminimumnumberofitemsthatmustremaininthequeueafteradequeueoperation.Thisensuresthattherewillbeenoughinstancesinthequeuetohaveenoughrandomness(oncethequeueisclosed,themin_after_dequeuelimitisignored).Nowsupposethatyouenqueued22itemsinthisqueue(floats1.to22.).Hereishowyoucoulddequeuethem:
dequeue=q.dequeue_many(5)
withtf.Session([...])assess:
print(sess.run(dequeue))#[20.15.11.12.4.](17itemsleft)
print(sess.run(dequeue))#[5.13.6.0.17.](12itemsleft)
print(sess.run(dequeue))#12-5<10:blockedwaitingfor3moreinstances
PaddingFifoQueueAPaddingFIFOQueuecanalsobeusedjustlikeaFIFOQueueexceptthatitacceptstensorsofvariablesizesalonganydimension(butwithafixedrank).Whenyouaredequeuingthemwithadequeue_manyordequeue_up_tooperation,eachtensorispaddedwithzerosalongeveryvariabledimensiontomakeitthesamesizeasthelargesttensorinthemini-batch.Forexample,youcouldenqueue2Dtensors(matrices)ofarbitrarysizes:
q=tf.PaddingFIFOQueue(capacity=50,dtypes=[tf.float32],shapes=[(None,None)]
name="q",shared_name="shared_q")
v=tf.placeholder(tf.float32,shape=(None,None))
enqueue=q.enqueue([v])
withtf.Session([...])assess:
sess.run(enqueue,feed_dict={v:[[1.,2.],[3.,4.],[5.,6.]]})#3x2
sess.run(enqueue,feed_dict={v:[[1.]]})#1x1
sess.run(enqueue,feed_dict={v:[[7.,8.,9.,5.],[6.,7.,8.,9.]]})#2x4
Ifwejustdequeueoneitematatime,wegettheexactsametensorsthatwereenqueued.Butifwedequeueseveralitemsatatime(usingdequeue_many()ordequeue_up_to()),thequeueautomaticallypadsthetensorsappropriately.Forexample,ifwedequeueallthreeitemsatonce,alltensorswillbepaddedwithzerostobecome3×4tensors,sincethemaximumsizeforthefirstdimensionis3(firstitem)andthemaximumsizefortheseconddimensionis4(thirditem):
>>>q=[...]
>>>dequeue=q.dequeue_many(3)
>>>withtf.Session([...])assess:
...print(sess.run(dequeue))
[[[1.2.0.0.]
[3.4.0.0.]
[5.6.0.0.]]
[[1.0.0.0.]
[0.0.0.0.]
[0.0.0.0.]]
[[7.8.9.5.]
[6.7.8.9.]
[0.0.0.0.]]]
Thistypeofqueuecanbeusefulwhenyouaredealingwithvariablelengthinputs,suchassequencesofwords(seeChapter14).
Okay,nowlet’spauseforasecond:sofaryouhavelearnedtodistributecomputationsacrossmultipledevicesandservers,sharevariablesacrosssessions,andcommunicateasynchronouslyusingqueues.Beforeyoustarttrainingneuralnetworks,though,there’sonelasttopicweneedtodiscuss:howtoefficientlyloadtrainingdata.
LoadingDataDirectlyfromtheGraphSofarwehaveassumedthattheclientswouldloadthetrainingdataandfeedittotheclusterusingplaceholders.Thisissimpleandworksquitewellforsimplesetups,butitisratherinefficientsinceittransfersthetrainingdataseveraltimes:1. Fromthefilesystemtotheclient
2. Fromtheclienttothemastertask
3. Possiblyfromthemastertasktoothertaskswherethedataisneeded
Itgetsworseifyouhaveseveralclientstrainingvariousneuralnetworksusingthesametrainingdata(forexample,forhyperparametertuning):ifeveryclientloadsthedatasimultaneously,youmayendupevensaturatingyourfileserverorthenetwork’sbandwidth.
PreloadthedataintoavariableFordatasetsthatcanfitinmemory,abetteroptionistoloadthetrainingdataonceandassignittoavariable,thenjustusethatvariableinyourgraph.Thisiscalledpreloadingthetrainingset.Thiswaythedatawillbetransferredonlyoncefromtheclienttothecluster(butitmaystillneedtobemovedaroundfromtasktotaskdependingonwhichoperationsneedit).Thefollowingcodeshowshowtoloadthefulltrainingsetintoavariable:
training_set_init=tf.placeholder(tf.float32,shape=(None,n_features))
training_set=tf.Variable(training_set_init,trainable=False,collections=[],
name="training_set")
withtf.Session([...])assess:
data=[...]#loadthetrainingdatafromthedatastore
sess.run(training_set.initializer,feed_dict={training_set_init:data})
Youmustsettrainable=Falsesotheoptimizersdon’ttrytotweakthisvariable.Youshouldalsosetcollections=[]toensurethatthisvariablewon’tgetaddedtotheGraphKeys.GLOBAL_VARIABLEScollection,whichisusedforsavingandrestoringcheckpoints.
NOTEThisexampleassumesthatallofyourtrainingset(includingthelabels)consistsonlyoffloat32values.Ifthat’snotthecase,youwillneedonevariablepertype.
ReadingthetrainingdatadirectlyfromthegraphIfthetrainingsetdoesnotfitinmemory,agoodsolutionistousereaderoperations:theseareoperationscapableofreadingdatadirectlyfromthefilesystem.Thiswaythetrainingdataneverneedstoflowthroughtheclientsatall.TensorFlowprovidesreadersforvariousfileformats:
CSV
Fixed-lengthbinaryrecords
TensorFlow’sownTFRecordsformat,basedonprotocolbuffers
Let’slookatasimpleexamplereadingfromaCSVfile(forotherformats,pleasecheckouttheAPIdocumentation).Supposeyouhavefilenamedmy_test.csvthatcontainstraininginstances,andyouwanttocreateoperationstoreadit.Supposeithasthefollowingcontent,withtwofloatfeaturesx1andx2andoneintegertargetrepresentingabinaryclass:
x1,x2,target
1.,2.,0
4.,5,1
7.,,0
First,let’screateaTextLineReadertoreadthisfile.ATextLineReaderopensafile(oncewetellitwhichonetoopen)andreadslinesonebyone.Itisastatefuloperation,likevariablesandqueues:itpreservesitsstateacrossmultiplerunsofthegraph,keepingtrackofwhichfileitiscurrentlyreadingandwhatitscurrentpositionisinthisfile.
reader=tf.TextLineReader(skip_header_lines=1)
Next,wecreateaqueuethatthereaderwillpullfromtoknowwhichfiletoreadnext.Wealsocreateanenqueueoperationandaplaceholdertopushanyfilenamewewanttothequeue,andwecreateanoperationtoclosethequeueoncewehavenomorefilestoread:
filename_queue=tf.FIFOQueue(capacity=10,dtypes=[tf.string],shapes=[()])
filename=tf.placeholder(tf.string)
enqueue_filename=filename_queue.enqueue([filename])
close_filename_queue=filename_queue.close()
Nowwearereadytocreateareadoperationthatwillreadonerecord(i.e.,aline)atatimeandreturnakey/valuepair.Thekeyistherecord’suniqueidentifier—astringcomposedofthefilename,acolon(:),andthelinenumber—andthevalueissimplyastringcontainingthecontentoftheline:
key,value=reader.read(filename_queue)
Wehaveallweneedtoreadthefilelinebyline!Butwearenotquitedoneyet—weneedtoparsethisstringtogetthefeaturesandtarget:
x1,x2,target=tf.decode_csv(value,record_defaults=[[-1.],[-1.],[-1]])
features=tf.stack([x1,x2])
ThefirstlineusesTensorFlow’sCSVparsertoextractthevaluesfromthecurrentline.Thedefaultvaluesareusedwhenafieldismissing(inthisexamplethethirdtraininginstance’sx2feature),andtheyarealsousedtodeterminethetypeofeachfield(inthiscasetwofloatsandoneinteger).
Finally,wecanpushthistraininginstanceanditstargettoaRandomShuffleQueuethatwewillsharewiththetraininggraph(soitcanpullmini-batchesfromit),andwecreateanoperationtoclosethatqueuewhenwearedonepushinginstancestoit:
instance_queue=tf.RandomShuffleQueue(
capacity=10,min_after_dequeue=2,
dtypes=[tf.float32,tf.int32],shapes=[[2],[]],
name="instance_q",shared_name="shared_instance_q")
enqueue_instance=instance_queue.enqueue([features,target])
close_instance_queue=instance_queue.close()
Wow!Thatwasalotofworkjusttoreadafile.Plusweonlycreatedthegraph,sonowweneedtorunit:
withtf.Session([...])assess:
sess.run(enqueue_filename,feed_dict={filename:"my_test.csv"})
sess.run(close_filename_queue)
try:
whileTrue:
sess.run(enqueue_instance)
excepttf.errors.OutOfRangeErrorasex:
pass#nomorerecordsinthecurrentfileandnomorefilestoread
sess.run(close_instance_queue)
Firstweopenthesession,andthenweenqueuethefilename"my_test.csv"andimmediatelyclosethatqueuesincewewillnotenqueueanymorefilenames.Thenwerunaninfinitelooptoenqueueinstancesonebyone.Theenqueue_instancedependsonthereaderreadingthenextline,soateveryiterationanewrecordisreaduntilitreachestheendofthefile.Atthatpointittriestoreadthefilenamequeuetoknowwhichfiletoreadnext,andsincethequeueiscloseditthrowsanOutOfRangeErrorexception(ifwedidnotclosethequeue,itwouldjustremainblockeduntilwepushedanotherfilenameorclosedthequeue).Lastly,weclosetheinstancequeuesothatthetrainingoperationspullingfromitwon’tgetblockedforever.Figure12-9summarizeswhatwehavelearned;itrepresentsatypicalgraphforreadingtraininginstancesfromasetofCSVfiles.
Figure12-9.AgraphdedicatedtoreadingtraininginstancesfromCSVfiles
Inthetraininggraph,youneedtocreatethesharedinstancequeueandsimplydequeuemini-batchesfromit:
instance_queue=tf.RandomShuffleQueue([...],shared_name="shared_instance_q")
mini_batch_instances,mini_batch_targets=instance_queue.dequeue_up_to(2)
[...]#usethemini_batchinstancesandtargetstobuildthetraininggraph
training_op=[...]
withtf.Session([...])assess:
try:
forstepinrange(max_steps):
sess.run(training_op)
excepttf.errors.OutOfRangeErrorasex:
pass#nomoretraininginstances
Inthisexample,thefirstmini-batchwillcontainthefirsttwoinstancesoftheCSVfile,andthesecondmini-batchwillcontainthelastinstance.
WARNINGTensorFlowqueuesdon’thandlesparsetensorswell,soifyourtraininginstancesaresparseyoushouldparsetherecordsaftertheinstancequeue.
Thisarchitecturewillonlyuseonethreadtoreadrecordsandpushthemtotheinstancequeue.Youcangetamuchhigherthroughputbyhavingmultiplethreadsreadsimultaneouslyfrommultiplefilesusingmultiplereaders.Let’sseehow.
MultithreadedreadersusingaCoordinatorandaQueueRunnerTohavemultiplethreadsreadinstancessimultaneously,youcouldcreatePythonthreads(usingthethreadingmodule)andmanagethemyourself.However,TensorFlowprovidessometoolstomakethissimpler:theCoordinatorclassandtheQueueRunnerclass.
ACoordinatorisaverysimpleobjectwhosesolepurposeistocoordinatestoppingmultiplethreads.FirstyoucreateaCoordinator:
coord=tf.train.Coordinator()
Thenyougiveittoallthreadsthatneedtostopjointly,andtheirmainlooplookslikethis:
whilenotcoord.should_stop():
[...]#dosomething
AnythreadcanrequestthateverythreadstopbycallingtheCoordinator’srequest_stop()method:
coord.request_stop()
Everythreadwillstopassoonasitfinishesitscurrentiteration.YoucanwaitforallofthethreadstofinishbycallingtheCoordinator’sjoin()method,passingitthelistofthreads:
coord.join(list_of_threads)
AQueueRunnerrunsmultiplethreadsthateachrunanenqueueoperationrepeatedly,fillingupaqueueasfastaspossible.Assoonasthequeueisclosed,thenextthreadthattriestopushanitemtothequeuewillgetanOutOfRangeError;thisthreadcatchestheerrorandimmediatelytellsotherthreadstostopusingaCoordinator.ThefollowingcodeshowshowyoucanuseaQueueRunnertohavefivethreadsreadinginstancessimultaneouslyandpushingthemtoaninstancequeue:
[...]#sameconstructionphaseasearlier
queue_runner=tf.train.QueueRunner(instance_queue,[enqueue_instance]*5)
withtf.Session()assess:
sess.run(enqueue_filename,feed_dict={filename:"my_test.csv"})
sess.run(close_filename_queue)
coord=tf.train.Coordinator()
enqueue_threads=queue_runner.create_threads(sess,coord=coord,start=True)
ThefirstlinecreatestheQueueRunnerandtellsittorunfivethreads,allrunningthesameenqueue_instanceoperationrepeatedly.Thenwestartasessionandweenqueuethenameofthefilestoread(inthiscasejust"my_test.csv").NextwecreateaCoordinatorthattheQueueRunnerwillusetostopgracefully,asjustexplained.Finally,wetelltheQueueRunnertocreatethethreadsandstartthem.Thethreadswillreadalltraininginstancesandpushthemtotheinstancequeue,andthentheywillallstopgracefully.
Thiswillbeabitmoreefficientthanearlier,butwecandobetter.Currentlyallthreadsarereadingfromthesamefile.Wecanmakethemreadsimultaneouslyfromseparatefilesinstead(assumingthetrainingdataisshardedacrossmultipleCSVfiles)bycreatingmultiplereaders(seeFigure12-10).
Figure12-10.Readingsimultaneouslyfrommultiplefiles
Forthisweneedtowriteasmallfunctiontocreateareaderandthenodesthatwillreadandpushoneinstancetotheinstancequeue:
defread_and_push_instance(filename_queue,instance_queue):
reader=tf.TextLineReader(skip_header_lines=1)
key,value=reader.read(filename_queue)
x1,x2,target=tf.decode_csv(value,record_defaults=[[-1.],[-1.],[-1]])
features=tf.stack([x1,x2])
enqueue_instance=instance_queue.enqueue([features,target])
returnenqueue_instance
Nextwedefinethequeues:
filename_queue=tf.FIFOQueue(capacity=10,dtypes=[tf.string],shapes=[()])
filename=tf.placeholder(tf.string)
enqueue_filename=filename_queue.enqueue([filename])
close_filename_queue=filename_queue.close()
instance_queue=tf.RandomShuffleQueue([...])
AndfinallywecreatetheQueueRunner,butthistimewegiveitalistofdifferentenqueueoperations.Eachoperationwilluseadifferentreader,sothethreadswillsimultaneouslyreadfromdifferentfiles:
read_and_enqueue_ops=[
read_and_push_instance(filename_queue,instance_queue)
foriinrange(5)]
queue_runner=tf.train.QueueRunner(instance_queue,read_and_enqueue_ops)
Theexecutionphaseisthenthesameasbefore:firstpushthenamesofthefilestoread,thencreateaCoordinatorandcreateandstarttheQueueRunnerthreads.Thistimeallthreadswillreadfromdifferentfilessimultaneouslyuntilallfilesarereadentirely,andthentheQueueRunnerwillclosetheinstancequeuesothatotheropspullingfromitdon’tgetblocked.
OtherconveniencefunctionsTensorFlowalsooffersafewconveniencefunctionstosimplifysomecommontaskswhenreadingtraininginstances.Wewillgooverjustafew(seetheAPIdocumentationforthefulllist).
Thestring_input_producer()takesa1Dtensorcontainingalistoffilenames,createsathreadthatpushesonefilenameatatimetothefilenamequeue,andthenclosesthequeue.Ifyouspecifyanumberofepochs,itwillcyclethroughthefilenamesonceperepochbeforeclosingthequeue.Bydefault,itshufflesthefilenamesateachepoch.ItcreatesaQueueRunnertomanageitsthread,andaddsittotheGraphKeys.QUEUE_RUNNERScollection.TostarteveryQueueRunnerinthatcollection,youcancallthetf.train.start_queue_runners()function.NotethatifyouforgettostarttheQueueRunner,thefilenamequeuewillbeopenandempty,andyourreaderswillbeblockedforever.
ThereareafewotherproducerfunctionsthatsimilarlycreateaqueueandacorrespondingQueueRunnerforrunninganenqueueoperation(e.g.,input_producer(),range_input_producer(),andslice_input_producer()).
Theshuffle_batch()functiontakesalistoftensors(e.g.,[features,target])andcreates:
ARandomShuffleQueue
AQueueRunnertoenqueuethetensorstothequeue(addedtotheGraphKeys.QUEUE_RUNNERScollection)
Adequeue_manyoperationtoextractamini-batchfromthequeue
Thismakesiteasytomanageinasingleprocessamultithreadedinputpipelinefeedingaqueueandatrainingpipelinereadingmini-batchesfromthatqueue.Alsocheckoutthebatch(),batch_join(),andshuffle_batch_join()functionsthatprovidesimilarfunctionality.
Okay!YounowhaveallthetoolsyouneedtostarttrainingandrunningneuralnetworksefficientlyacrossmultipledevicesandserversonaTensorFlowcluster.Let’sreviewwhatyouhavelearned:
UsingmultipleGPUdevices
SettingupandstartingaTensorFlowcluster
Distributingcomputationsacrossmultipledevicesandservers
Sharingvariables(andotherstatefulopssuchasqueuesandreaders)acrosssessionsusingcontainers
Coordinatingmultiplegraphsworkingasynchronouslyusingqueues
Readinginputsefficientlyusingreaders,queuerunners,andcoordinators
Nowlet’suseallofthistoparallelizeneuralnetworks!
ParallelizingNeuralNetworksonaTensorFlowClusterInthissection,firstwewilllookathowtoparallelizeseveralneuralnetworksbysimplyplacingeachoneonadifferentdevice.Thenwewilllookatthemuchtrickierproblemoftrainingasingleneuralnetworkacrossmultipledevicesandservers.
OneNeuralNetworkperDeviceThemosttrivialwaytotrainandrunneuralnetworksonaTensorFlowclusteristotaketheexactsamecodeyouwoulduseforasingledeviceonasinglemachine,andspecifythemasterserver’saddresswhencreatingthesession.That’sit—you’redone!Yourcodewillberunningontheserver’sdefaultdevice.Youcanchangethedevicethatwillrunyourgraphsimplybyputtingyourcode’sconstructionphasewithinadeviceblock.
Byrunningseveralclientsessionsinparallel(indifferentthreadsordifferentprocesses),connectingthemtodifferentservers,andconfiguringthemtousedifferentdevices,youcanquiteeasilytrainorrunmanyneuralnetworksinparallel,acrossalldevicesandallmachinesinyourcluster(seeFigure12-11).Thespeedupisalmostlinear.4Training100neuralnetworksacross50serverswith2GPUseachwillnottakemuchlongerthantrainingjust1neuralnetworkon1GPU.
Figure12-11.Trainingoneneuralnetworkperdevice
Thissolutionisperfectforhyperparametertuning:eachdeviceintheclusterwilltrainadifferentmodelwithitsownsetofhyperparameters.Themorecomputingpoweryouhave,thelargerthehyperparameterspaceyoucanexplore.
Italsoworksperfectlyifyouhostawebservicethatreceivesalargenumberofqueriespersecond(QPS)andyouneedyourneuralnetworktomakeapredictionforeachquery.Simplyreplicatetheneuralnetworkacrossalldevicesontheclusteranddispatchqueriesacrossalldevices.ByaddingmoreserversyoucanhandleanunlimitednumberofQPS(however,thiswillnotreducethetimeittakestoprocessasinglerequestsinceitwillstillhavetowaitforaneuralnetworktomakeaprediction).
NOTEAnotheroptionistoserveyourneuralnetworksusingTensorFlowServing.Itisanopensourcesystem,releasedbyGoogleinFebruary2016,designedtoserveahighvolumeofqueriestoMachineLearningmodels(typicallybuiltwithTensorFlow).Ithandlesmodelversioning,soyoucaneasilydeployanewversionofyournetworktoproduction,orexperimentwithvariousalgorithmswithoutinterruptingyourservice,anditcansustainaheavyloadbyaddingmoreservers.Formoredetails,checkouthttps://tensorflow.github.io/serving/.
In-GraphVersusBetween-GraphReplicationYoucanalsoparallelizethetrainingofalargeensembleofneuralnetworksbysimplyplacingeveryneuralnetworkonadifferentdevice(ensembleswereintroducedinChapter7).However,onceyouwanttoruntheensemble,youwillneedtoaggregatetheindividualpredictionsmadebyeachneuralnetworktoproducetheensemble’sprediction,andthisrequiresabitofcoordination.
Therearetwomajorapproachestohandlinganeuralnetworkensemble(oranyothergraphthatcontainslargechunksofindependentcomputations):
Youcancreateonebiggraph,containingeveryneuralnetwork,eachpinnedtoadifferentdevice,plusthecomputationsneededtoaggregatetheindividualpredictionsfromalltheneuralnetworks(seeFigure12-12).Thenyoujustcreateonesessiontoanyserverintheclusterandletittakecareofeverything(includingwaitingforallindividualpredictionstobeavailablebeforeaggregatingthem).Thisapproachiscalledin-graphreplication.
Figure12-12.In-graphreplication
Alternatively,youcancreateoneseparategraphforeachneuralnetworkandhandlesynchronizationbetweenthesegraphsyourself.Thisapproachiscalledbetween-graphreplication.Onetypicalimplementationistocoordinatetheexecutionofthesegraphsusingqueues(seeFigure12-13).Asetofclientshandlesoneneuralnetworkeach,readingfromitsdedicatedinputqueue,andwritingtoitsdedicatedpredictionqueue.Anotherclientisinchargeofreadingtheinputsandpushingthemtoalltheinputqueues(copyingallinputstoeveryqueue).Finally,onelastclientisinchargeofreadingonepredictionfromeachpredictionqueueandaggregatingthemtoproducetheensemble’sprediction.
Figure12-13.Between-graphreplication
Thesesolutionshavetheirprosandcons.In-graphreplicationissomewhatsimplertoimplementsinceyoudon’thavetomanagemultipleclientsandmultiplequeues.However,between-graphreplicationisabiteasiertoorganizeintowell-boundedandeasy-to-testmodules.Moreover,itgivesyoumoreflexibility.Forexample,youcouldaddadequeuetimeoutintheaggregatorclientsothattheensemblewouldnotfailevenifoneoftheneuralnetworkclientscrashesorifoneneuralnetworktakestoolongtoproduceitsprediction.TensorFlowletsyouspecifyatimeoutwhencallingtherun()functionbypassingaRunOptionswithtimeout_in_ms:
withtf.Session([...])assess:
[...]
run_options=tf.RunOptions()
run_options.timeout_in_ms=1000#1stimeout
try:
pred=sess.run(dequeue_prediction,options=run_options)
excepttf.errors.DeadlineExceededErrorasex:
[...]#thedequeueoperationtimedoutafter1s
Anotherwayyoucanspecifyatimeoutistosetthesession’soperation_timeout_in_msconfigurationoption,butinthiscasetherun()functiontimesoutifanyoperationtakeslongerthanthetimeoutdelay:
config=tf.ConfigProto()
config.operation_timeout_in_ms=1000#1stimeoutforeveryoperation
withtf.Session([...],config=config)assess:
[...]
try:
pred=sess.run(dequeue_prediction)
excepttf.errors.DeadlineExceededErrorasex:
[...]#thedequeueoperationtimedoutafter1s
ModelParallelismSofarwehaveruneachneuralnetworkonasingledevice.Whatifwewanttorunasingleneuralnetworkacrossmultipledevices?Thisrequireschoppingyourmodelintoseparatechunksandrunningeachchunkonadifferentdevice.Thisiscalledmodelparallelism.Unfortunately,modelparallelismturnsouttobeprettytricky,anditreallydependsonthearchitectureofyourneuralnetwork.Forfullyconnectednetworks,thereisgenerallynotmuchtobegainedfromthisapproach(seeFigure12-14).Intuitively,itmayseemthataneasywaytosplitthemodelistoplaceeachlayeronadifferentdevice,butthisdoesnotworksinceeachlayerneedstowaitfortheoutputofthepreviouslayerbeforeitcandoanything.Soperhapsyoucansliceitvertically—forexample,withthelefthalfofeachlayerononedevice,andtherightpartonanotherdevice?Thisisslightlybetter,sincebothhalvesofeachlayercanindeedworkinparallel,buttheproblemisthateachhalfofthenextlayerrequirestheoutputofbothhalves,sotherewillbealotofcross-devicecommunication(representedbythedashedarrows).Thisislikelytocompletelycanceloutthebenefitoftheparallelcomputation,sincecross-devicecommunicationisslow(especiallyifitisacrossseparatemachines).
Figure12-14.Splittingafullyconnectedneuralnetwork
However,aswewillseeinChapter13,someneuralnetworkarchitectures,suchasconvolutionalneuralnetworks,containlayersthatareonlypartiallyconnectedtothelowerlayers,soitismucheasiertodistributechunksacrossdevicesinanefficientway.
Figure12-15.Splittingapartiallyconnectedneuralnetwork
Moreover,aswewillseeinChapter14,somedeeprecurrentneuralnetworksarecomposedofseverallayersofmemorycells(seetheleftsideofFigure12-16).Acell’soutputattimetisfedbacktoitsinputattimet+1(asyoucanseemoreclearlyontherightsideofFigure12-16).Ifyousplitsuchanetworkhorizontally,placingeachlayeronadifferentdevice,thenatthefirststeponlyonedevicewillbeactive,atthesecondsteptwowillbeactive,andbythetimethesignalpropagatestotheoutputlayeralldeviceswillbeactivesimultaneously.Thereisstillalotofcross-devicecommunicationgoingon,butsinceeachcellmaybefairlycomplex,thebenefitofrunningmultiplecellsinparalleloftenoutweighsthecommunicationpenalty.
Figure12-16.Splittingadeeprecurrentneuralnetwork
Inshort,modelparallelismcanspeeduprunningortrainingsometypesofneuralnetworks,butnotall,anditrequiresspecialcareandtuning,suchasmakingsurethatdevicesthatneedtocommunicatethemostrunonthesamemachine.
DataParallelismAnotherwaytoparallelizethetrainingofaneuralnetworkistoreplicateitoneachdevice,runatrainingstepsimultaneouslyonallreplicasusingadifferentmini-batchforeach,andthenaggregatethegradientstoupdatethemodelparameters.Thisiscalleddataparallelism(seeFigure12-17).
Figure12-17.Dataparallelism
Therearetwovariantsofthisapproach:synchronousupdatesandasynchronousupdates.
SynchronousupdatesWithsynchronousupdates,theaggregatorwaitsforallgradientstobeavailablebeforecomputingtheaverageandapplyingtheresult(i.e.,usingtheaggregatedgradientstoupdatethemodelparameters).Onceareplicahasfinishedcomputingitsgradients,itmustwaitfortheparameterstobeupdatedbeforeitcanproceedtothenextmini-batch.Thedownsideisthatsomedevicesmaybeslowerthanothers,soallotherdeviceswillhavetowaitforthemateverystep.Moreover,theparameterswillbecopiedtoeverydevicealmostatthesametime(immediatelyafterthegradientsareapplied),whichmaysaturatetheparameterservers’bandwidth.
TIPToreducethewaitingtimeateachstep,youcouldignorethegradientsfromtheslowestfewreplicas(typically~10%).Forexample,youcouldrun20replicas,butonlyaggregatethegradientsfromthefastest18replicasateachstep,andjustignorethegradientsfromthelast2.Assoonastheparametersareupdated,thefirst18replicascanstartworkingagainimmediately,withouthavingtowaitforthe2slowestreplicas.Thissetupisgenerallydescribedashaving18replicasplus2sparereplicas.5
AsynchronousupdatesWithasynchronousupdates,wheneverareplicahasfinishedcomputingthegradients,itimmediatelyusesthemtoupdatethemodelparameters.Thereisnoaggregation(removethe“mean”stepinFigure12-17),andnosynchronization.Replicasjustworkindependentlyoftheotherreplicas.Sincethereisnowaitingfortheotherreplicas,thisapproachrunsmoretrainingstepsperminute.Moreover,althoughtheparametersstillneedtobecopiedtoeverydeviceateverystep,thishappensatdifferenttimesforeachreplicasotheriskofbandwidthsaturationisreduced.
Dataparallelismwithasynchronousupdatesisanattractivechoice,becauseofitssimplicity,theabsenceofsynchronizationdelay,andabetteruseofthebandwidth.However,althoughitworksreasonablywellinpractice,itisalmostsurprisingthatitworksatall!Indeed,bythetimeareplicahasfinishedcomputingthegradientsbasedonsomeparametervalues,theseparameterswillhavebeenupdatedseveraltimesbyotherreplicas(onaverageN–1timesifthereareNreplicas)andthereisnoguaranteethatthecomputedgradientswillstillbepointingintherightdirection(seeFigure12-18).Whengradientsareseverelyout-of-date,theyarecalledstalegradients:theycanslowdownconvergence,introducingnoiseandwobbleeffects(thelearningcurvemaycontaintemporaryoscillations),ortheycanevenmakethetrainingalgorithmdiverge.
Figure12-18.Stalegradientswhenusingasynchronousupdates
Thereareafewwaystoreducetheeffectofstalegradients:Reducethelearningrate.
Dropstalegradientsorscalethemdown.
Adjustthemini-batchsize.
Startthefirstfewepochsusingjustonereplica(thisiscalledthewarmupphase).Stalegradientstendtobemoredamagingatthebeginningoftraining,whengradientsaretypicallylargeandtheparametershavenotsettledintoavalleyofthecostfunctionyet,sodifferentreplicasmaypushtheparametersinquitedifferentdirections.
ApaperpublishedbytheGoogleBrainteaminApril2016benchmarkedvariousapproachesandfoundthatdataparallelismwithsynchronousupdatesusingafewsparereplicaswasthemostefficient,notonlyconvergingfasterbutalsoproducingabettermodel.However,thisisstillanactiveareaofresearch,soyoushouldnotruleoutasynchronousupdatesquiteyet.
BandwidthsaturationWhetheryouusesynchronousorasynchronousupdates,dataparallelismstillrequirescommunicatingthemodelparametersfromtheparameterserverstoeveryreplicaatthebeginningofeverytrainingstep,andthegradientsintheotherdirectionattheendofeachtrainingstep.Unfortunately,thismeansthattherealwayscomesapointwhereaddinganextraGPUwillnotimproveperformanceatallbecausethetimespentmovingthedatainandoutofGPURAM(andpossiblyacrossthenetwork)willoutweighthespeedupobtainedbysplittingthecomputationload.Atthatpoint,addingmoreGPUswilljustincreasesaturationandslowdowntraining.
TIPForsomemodels,typicallyrelativelysmallandtrainedonaverylargetrainingset,youareoftenbetterofftrainingthemodelonasinglemachinewithasingleGPU.
Saturationismoresevereforlargedensemodels,sincetheyhavealotofparametersandgradientstotransfer.Itislesssevereforsmallmodels(buttheparallelizationgainissmall)andalsoforlargesparsemodelssincethegradientsaretypicallymostlyzeros,sotheycanbecommunicatedefficiently.JeffDean,initiatorandleadoftheGoogleBrainproject,reportedtypicalspeedupsof25–40xwhendistributingcomputationsacross50GPUsfordensemodels,and300xspeedupforsparsermodelstrainedacross500GPUs.Asyoucansee,sparsemodelsreallydoscalebetter.Hereareafewconcreteexamples:
NeuralMachineTranslation:6xspeedupon8GPUs
Inception/ImageNet:32xspeedupon50GPUs
RankBrain:300xspeedupon500GPUs
ThesenumbersrepresentthestateoftheartinQ12016.BeyondafewdozenGPUsforadensemodelorfewhundredGPUsforasparsemodel,saturationkicksinandperformancedegrades.Thereisplentyofresearchgoingontosolvethisproblem(exploringpeer-to-peerarchitecturesratherthancentralizedparameterservers,usinglossymodelcompression,optimizingwhenandwhatthereplicasneedtocommunicate,andsoon),sotherewilllikelybealotofprogressinparallelizingneuralnetworksinthenextfewyears.
Inthemeantime,hereareafewsimplestepsyoucantaketoreducethesaturationproblem:GroupyourGPUsonafewserversratherthanscatteringthemacrossmanyservers.Thiswillavoidunnecessarynetworkhops.
Shardtheparametersacrossmultipleparameterservers(asdiscussedearlier).
Dropthemodelparameters’floatprecisionfrom32bits(tf.float32)to16bits(tf.bfloat16).Thiswillcutinhalftheamountofdatatotransfer,withoutmuchimpactontheconvergencerateorthemodel’sperformance.
TIPAlthough16-bitprecisionistheminimumfortrainingneuralnetwork,youcanactuallydropdownto8-bitprecisionaftertrainingtoreducethesizeofthemodelandspeedupcomputations.Thisiscalledquantizingtheneuralnetwork.Itisparticularlyusefulfordeployingandrunningpretrainedmodelsonmobilephones.SeePeteWarden’sgreatpostonthesubject.
TensorFlowimplementationToimplementdataparallelismusingTensorFlow,youfirstneedtochoosewhetheryouwantin-graphreplicationorbetween-graphreplication,andwhetheryouwantsynchronousupdatesorasynchronousupdates.Let’slookathowyouwouldimplementeachcombination(seetheexercisesandtheJupyternotebooksforcompletecodeexamples).
Within-graphreplication+synchronousupdates,youbuildonebiggraphcontainingallthemodelreplicas(placedondifferentdevices),andafewnodestoaggregatealltheirgradientsandfeedthemtoanoptimizer.Yourcodeopensasessiontotheclusterandsimplyrunsthetrainingoperationrepeatedly.
Within-graphreplication+asynchronousupdates,youalsocreateonebiggraph,butwithoneoptimizerperreplica,andyourunonethreadperreplica,repeatedlyrunningthereplica’soptimizer.
Withbetween-graphreplication+asynchronousupdates,yourunmultipleindependentclients(typicallyinseparateprocesses),eachtrainingthemodelreplicaasifitwerealoneintheworld,buttheparametersareactuallysharedwithotherreplicas(usingaresourcecontainer).
Withbetween-graphreplication+synchronousupdates,onceagainyourunmultipleclients,eachtrainingamodelreplicabasedonsharedparameters,butthistimeyouwraptheoptimizer(e.g.,aMomentumOptimizer)withinaSyncReplicasOptimizer.Eachreplicausesthisoptimizerasitwoulduseanyotheroptimizer,butunderthehoodthisoptimizersendsthegradientstoasetofqueues(onepervariable),whichisreadbyoneofthereplica’sSyncReplicasOptimizer,calledthechief.Thechiefaggregatesthegradientsandappliesthem,thenwritesatokentoatokenqueueforeachreplica,signalingitthatitcangoaheadandcomputethenextgradients.Thisapproachsupportshavingsparereplicas.
Ifyougothroughtheexercises,youwillimplementeachofthesefoursolutions.YouwilleasilybeabletoapplywhatyouhavelearnedtotrainlargedeepneuralnetworksacrossdozensofserversandGPUs!InthefollowingchapterswewillgothroughafewmoreimportantneuralnetworkarchitecturesbeforewetackleReinforcementLearning.
Exercises1. IfyougetaCUDA_ERROR_OUT_OF_MEMORYwhenstartingyourTensorFlowprogram,whatis
probablygoingon?Whatcanyoudoaboutit?
2. Whatisthedifferencebetweenpinninganoperationonadeviceandplacinganoperationonadevice?
3. IfyouarerunningonaGPU-enabledTensorFlowinstallation,andyoujustusethedefaultplacement,willalloperationsbeplacedonthefirstGPU?
4. Ifyoupinavariableto"/gpu:0",canitbeusedbyoperationsplacedon/gpu:1?Orbyoperationsplacedon"/cpu:0"?Orbyoperationspinnedtodeviceslocatedonotherservers?
5. Cantwooperationsplacedonthesamedeviceruninparallel?
6. Whatisacontroldependencyandwhenwouldyouwanttouseone?
7. SupposeyoutrainaDNNfordaysonaTensorFlowcluster,andimmediatelyafteryourtrainingprogramendsyourealizethatyouforgottosavethemodelusingaSaver.Isyourtrainedmodellost?
8. TrainseveralDNNsinparallelonaTensorFlowcluster,usingdifferenthyperparametervalues.ThiscouldbeDNNsforMNISTclassificationoranyothertaskyouareinterestedin.ThesimplestoptionistowriteasingleclientprogramthattrainsonlyoneDNN,thenrunthisprograminmultipleprocessesinparallel,withdifferenthyperparametervaluesforeachclient.Theprogramshouldhavecommand-lineoptionstocontrolwhatserveranddevicetheDNNshouldbeplacedon,andwhatresourcecontainerandhyperparametervaluestouse(makesuretouseadifferentresourcecontainerforeachDNN).Useavalidationsetorcross-validationtoselectthetopthreemodels.
9. Createanensembleusingthetopthreemodelsfromthepreviousexercise.Defineitinasinglegraph,ensuringthateachDNNrunsonadifferentdevice.Evaluateitonthevalidationset:doestheensembleperformbetterthantheindividualDNNs?
10. TrainaDNNusingbetween-graphreplicationanddataparallelismwithasynchronousupdates,timinghowlongittakestoreachasatisfyingperformance.Next,tryagainusingsynchronousupdates.Dosynchronousupdatesproduceabettermodel?Istrainingfaster?SplittheDNNverticallyandplaceeachverticalsliceonadifferentdevice,andtrainthemodelagain.Istraininganyfaster?Istheperformanceanydifferent?
SolutionstotheseexercisesareavailableinAppendixA.
“TensorFlow:Large-ScaleMachineLearningonHeterogeneousDistributedSystems,”GoogleResearch(2015).
Youcanevenstartmultipletasksinthesameprocess.Itmaybeusefulfortests,butitisnotrecommendedinproduction.
ItisthenextversionofGoogle’sinternalStubbyservice,whichGooglehasusedsuccessfullyforoveradecade.Seehttp://grpc.io/formoredetails.
Not100%linearifyouwaitforalldevicestofinish,sincethetotaltimewillbethetimetakenbytheslowestdevice.
1
2
3
4
Thisnameisslightlyconfusingsinceitsoundslikesomereplicasarespecial,doingnothing.Inreality,allreplicasareequivalent:theyallworkhardtobeamongthefastestateachtrainingstep,andthelosersvaryateverystep(unlesssomedevicesarereallyslowerthanothers).
5
Chapter13.ConvolutionalNeuralNetworks
AlthoughIBM’sDeepBluesupercomputerbeatthechessworldchampionGarryKasparovbackin1996,untilquiterecentlycomputerswereunabletoreliablyperformseeminglytrivialtaskssuchasdetectingapuppyinapictureorrecognizingspokenwords.Whyarethesetaskssoeffortlesstoushumans?Theanswerliesinthefactthatperceptionlargelytakesplaceoutsidetherealmofourconsciousness,withinspecializedvisual,auditory,andothersensorymodulesinourbrains.Bythetimesensoryinformationreachesourconsciousness,itisalreadyadornedwithhigh-levelfeatures;forexample,whenyoulookatapictureofacutepuppy,youcannotchoosenottoseethepuppy,ornottonoticeitscuteness.Norcanyouexplainhowyourecognizeacutepuppy;it’sjustobvioustoyou.Thus,wecannottrustoursubjectiveexperience:perceptionisnottrivialatall,andtounderstanditwemustlookathowthesensorymoduleswork.
Convolutionalneuralnetworks(CNNs)emergedfromthestudyofthebrain’svisualcortex,andtheyhavebeenusedinimagerecognitionsincethe1980s.Inthelastfewyears,thankstotheincreaseincomputationalpower,theamountofavailabletrainingdata,andthetrickspresentedinChapter11fortrainingdeepnets,CNNshavemanagedtoachievesuperhumanperformanceonsomecomplexvisualtasks.Theypowerimagesearchservices,self-drivingcars,automaticvideoclassificationsystems,andmore.Moreover,CNNsarenotrestrictedtovisualperception:theyarealsosuccessfulatothertasks,suchasvoicerecognitionornaturallanguageprocessing(NLP);however,wewillfocusonvisualapplicationsfornow.
InthischapterwewillpresentwhereCNNscamefrom,whattheirbuildingblockslooklike,andhowtoimplementthemusingTensorFlow.ThenwewillpresentsomeofthebestCNNarchitectures.
TheArchitectureoftheVisualCortexDavidH.HubelandTorstenWieselperformedaseriesofexperimentsoncatsin19581and19592(andafewyearslateronmonkeys3),givingcrucialinsightsonthestructureofthevisualcortex(theauthorsreceivedtheNobelPrizeinPhysiologyorMedicinein1981fortheirwork).Inparticular,theyshowedthatmanyneuronsinthevisualcortexhaveasmalllocalreceptivefield,meaningtheyreactonlytovisualstimulilocatedinalimitedregionofthevisualfield(seeFigure13-1,inwhichthelocalreceptivefieldsoffiveneuronsarerepresentedbydashedcircles).Thereceptivefieldsofdifferentneuronsmayoverlap,andtogethertheytilethewholevisualfield.Moreover,theauthorsshowedthatsomeneuronsreactonlytoimagesofhorizontallines,whileothersreactonlytolineswithdifferentorientations(twoneuronsmayhavethesamereceptivefieldbutreacttodifferentlineorientations).Theyalsonoticedthatsomeneuronshavelargerreceptivefields,andtheyreacttomorecomplexpatternsthatarecombinationsofthelower-levelpatterns.Theseobservationsledtotheideathatthehigher-levelneuronsarebasedontheoutputsofneighboringlower-levelneurons(inFigure13-1,noticethateachneuronisconnectedonlytoafewneuronsfromthepreviouslayer).Thispowerfularchitectureisabletodetectallsortsofcomplexpatternsinanyareaofthevisualfield.
Figure13-1.Localreceptivefieldsinthevisualcortex
Thesestudiesofthevisualcortexinspiredtheneocognitron,introducedin1980,4whichgraduallyevolvedintowhatwenowcallconvolutionalneuralnetworks.Animportantmilestonewasa1998paper5byYannLeCun,LéonBottou,YoshuaBengio,andPatrickHaffner,whichintroducedthefamousLeNet-5architecture,widelyusedtorecognizehandwrittenchecknumbers.Thisarchitecturehassomebuildingblocksthatyoualreadyknow,suchasfullyconnectedlayersandsigmoidactivationfunctions,butitalsointroducestwonewbuildingblocks:convolutionallayersandpoolinglayers.Let’slookatthemnow.
NOTEWhynotsimplyusearegulardeepneuralnetworkwithfullyconnectedlayersforimagerecognitiontasks?Unfortunately,althoughthisworksfineforsmallimages(e.g.,MNIST),itbreaksdownforlargerimagesbecauseofthehugenumberofparametersitrequires.Forexample,a100×100imagehas10,000pixels,andifthefirstlayerhasjust1,000neurons(whichalreadyseverelyrestrictstheamountofinformationtransmittedtothenextlayer),thismeansatotalof10millionconnections.Andthat’sjustthefirstlayer.CNNssolvethisproblemusingpartiallyconnectedlayers.
ConvolutionalLayerThemostimportantbuildingblockofaCNNistheconvolutionallayer:6neuronsinthefirstconvolutionallayerarenotconnectedtoeverysinglepixelintheinputimage(liketheywereinpreviouschapters),butonlytopixelsintheirreceptivefields(seeFigure13-2).Inturn,eachneuroninthesecondconvolutionallayerisconnectedonlytoneuronslocatedwithinasmallrectangleinthefirstlayer.Thisarchitectureallowsthenetworktoconcentrateonlow-levelfeaturesinthefirsthiddenlayer,thenassemblethemintohigher-levelfeaturesinthenexthiddenlayer,andsoon.Thishierarchicalstructureiscommoninreal-worldimages,whichisoneofthereasonswhyCNNsworksowellforimagerecognition.
Figure13-2.CNNlayerswithrectangularlocalreceptivefields
NOTEUntilnow,allmultilayerneuralnetworkswelookedathadlayerscomposedofalonglineofneurons,andwehadtoflatteninputimagesto1Dbeforefeedingthemtotheneuralnetwork.Noweachlayerisrepresentedin2D,whichmakesiteasiertomatchneuronswiththeircorrespondinginputs.
Aneuronlocatedinrowi,columnjofagivenlayerisconnectedtotheoutputsoftheneuronsinthepreviouslayerlocatedinrowsitoi+fh–1,columnsjtoj+fw–1,wherefhandfwaretheheightandwidthofthereceptivefield(seeFigure13-3).Inorderforalayertohavethesameheightandwidthasthepreviouslayer,itiscommontoaddzerosaroundtheinputs,asshowninthediagram.Thisiscalledzeropadding.
Figure13-3.Connectionsbetweenlayersandzeropadding
Itisalsopossibletoconnectalargeinputlayertoamuchsmallerlayerbyspacingoutthereceptivefields,asshowninFigure13-4.Thedistancebetweentwoconsecutivereceptivefieldsiscalledthestride.Inthediagram,a5×7inputlayer(pluszeropadding)isconnectedtoa3×4layer,using3×3receptivefieldsandastrideof2(inthisexamplethestrideisthesameinbothdirections,butitdoesnothavetobeso).Aneuronlocatedinrowi,columnjintheupperlayerisconnectedtotheoutputsoftheneuronsinthepreviouslayerlocatedinrowsi×shtoi×sh+fh–1,columnsj×sw+fw–1,whereshandswaretheverticalandhorizontalstrides.
Figure13-4.Reducingdimensionalityusingastride
FiltersAneuron’sweightscanberepresentedasasmallimagethesizeofthereceptivefield.Forexample,Figure13-5showstwopossiblesetsofweights,calledfilters(orconvolutionkernels).Thefirstoneisrepresentedasablacksquarewithaverticalwhitelineinthemiddle(itisa7×7matrixfullof0sexceptforthecentralcolumn,whichisfullof1s);neuronsusingtheseweightswillignoreeverythingintheirreceptivefieldexceptforthecentralverticalline(sinceallinputswillgetmultipliedby0,exceptfortheoneslocatedinthecentralverticalline).Thesecondfilterisablacksquarewithahorizontalwhitelineinthemiddle.Onceagain,neuronsusingtheseweightswillignoreeverythingintheirreceptivefieldexceptforthecentralhorizontalline.
Nowifallneuronsinalayerusethesameverticallinefilter(andthesamebiasterm),andyoufeedthenetworktheinputimageshowninFigure13-5(bottomimage),thelayerwilloutputthetop-leftimage.Noticethattheverticalwhitelinesgetenhancedwhiletherestgetsblurred.Similarly,theupper-rightimageiswhatyougetifallneuronsusethehorizontallinefilter;noticethatthehorizontalwhitelinesgetenhancedwhiletherestisblurredout.Thus,alayerfullofneuronsusingthesamefiltergivesyouafeaturemap,whichhighlightstheareasinanimagethataremostsimilartothefilter.Duringtraining,aCNNfindsthemostusefulfiltersforitstask,anditlearnstocombinethemintomorecomplexpatterns(e.g.,acrossisanareainanimagewhereboththeverticalfilterandthehorizontalfilterareactive).
Figure13-5.Applyingtwodifferentfilterstogettwofeaturemaps
StackingMultipleFeatureMapsUptonow,forsimplicity,wehaverepresentedeachconvolutionallayerasathin2Dlayer,butinrealityitiscomposedofseveralfeaturemapsofequalsizes,soitismoreaccuratelyrepresentedin3D(seeFigure13-6).Withinonefeaturemap,allneuronssharethesameparameters(weightsandbiasterm),butdifferentfeaturemapsmayhavedifferentparameters.Aneuron’sreceptivefieldisthesameasdescribedearlier,butitextendsacrossallthepreviouslayers’featuremaps.Inshort,aconvolutionallayersimultaneouslyappliesmultiplefilterstoitsinputs,makingitcapableofdetectingmultiplefeaturesanywhereinitsinputs.
NOTEThefactthatallneuronsinafeaturemapsharethesameparametersdramaticallyreducesthenumberofparametersinthemodel,butmostimportantlyitmeansthatoncetheCNNhaslearnedtorecognizeapatterninonelocation,itcanrecognizeitinanyotherlocation.Incontrast,oncearegularDNNhaslearnedtorecognizeapatterninonelocation,itcanrecognizeitonlyinthatparticularlocation.
Moreover,inputimagesarealsocomposedofmultiplesublayers:onepercolorchannel.Therearetypicallythree:red,green,andblue(RGB).Grayscaleimageshavejustonechannel,butsomeimagesmayhavemuchmore—forexample,satelliteimagesthatcaptureextralightfrequencies(suchasinfrared).
Figure13-6.Convolutionlayerswithmultiplefeaturemaps,andimageswiththreechannels
Specifically,aneuronlocatedinrowi,columnjofthefeaturemapkinagivenconvolutionallayerlisconnectedtotheoutputsoftheneuronsinthepreviouslayerl–1,locatedinrowsi×shtoi×sh+fh–1andcolumnsj×swtoj×sw+fw–1,acrossallfeaturemaps(inlayerl–1).Notethatallneuronslocatedinthesamerowiandcolumnjbutindifferentfeaturemapsareconnectedtotheoutputsoftheexactsameneuronsinthepreviouslayer.
Equation13-1summarizestheprecedingexplanationsinonebigmathematicalequation:itshowshowtocomputetheoutputofagivenneuroninaconvolutionallayer.Itisabituglyduetoallthedifferentindices,butallitdoesiscalculatetheweightedsumofalltheinputs,plusthebiasterm.
Equation13-1.Computingtheoutputofaneuroninaconvolutionallayer
zi,j,kistheoutputoftheneuronlocatedinrowi,columnjinfeaturemapkoftheconvolutionallayer(layerl).
Asexplainedearlier,shandswaretheverticalandhorizontalstrides,fhandfwaretheheightandwidthofthereceptivefield,andfn′isthenumberoffeaturemapsinthepreviouslayer(layerl–1).
xi′,j′,k ′istheoutputoftheneuronlocatedinlayerl–1,rowi′,columnj′,featuremapk′(orchannelk′ifthepreviouslayeristheinputlayer).
bkisthebiastermforfeaturemapk(inlayerl).Youcanthinkofitasaknobthattweakstheoverallbrightnessofthefeaturemapk.
wu,v,k ′,kistheconnectionweightbetweenanyneuroninfeaturemapkofthelayerlanditsinputlocatedatrowu,columnv(relativetotheneuron’sreceptivefield),andfeaturemapk′.
TensorFlowImplementationInTensorFlow,eachinputimageistypicallyrepresentedasa3Dtensorofshape[height,width,channels].Amini-batchisrepresentedasa4Dtensorofshape[mini-batchsize,height,width,channels].Theweightsofaconvolutionallayerarerepresentedasa4Dtensorofshape[fh,fw,fn′,fn].Thebiastermsofaconvolutionallayeraresimplyrepresentedasa1Dtensorofshape[fn].
Let’slookatasimpleexample.Thefollowingcodeloadstwosampleimages,usingScikit-Learn’sload_sample_images()(whichloadstwocolorimages,oneofaChinesetemple,andtheotherofaflower).Thenitcreatestwo7×7filters(onewithaverticalwhitelineinthemiddle,andtheotherwithahorizontalwhitelineinthemiddle),andappliesthemtobothimagesusingaconvolutionallayerbuiltusingTensorFlow’stf.nn.conv2d()function(withzeropaddingandastrideof2).Finally,itplotsoneoftheresultingfeaturemaps(similartothetop-rightimageinFigure13-5).
importnumpyasnp
fromsklearn.datasetsimportload_sample_images
#Loadsampleimages
china=load_sample_image("china.jpg")
flower=load_sample_image("flower.jpg")
dataset=np.array([china,flower],dtype=np.float32)
batch_size,height,width,channels=dataset.shape
#Create2filters
filters=np.zeros(shape=(7,7,channels,2),dtype=np.float32)
filters[:,3,:,0]=1#verticalline
filters[3,:,:,1]=1#horizontalline
#CreateagraphwithinputXplusaconvolutionallayerapplyingthe2filters
X=tf.placeholder(tf.float32,shape=(None,height,width,channels))
convolution=tf.nn.conv2d(X,filters,strides=[1,2,2,1],padding="SAME")
withtf.Session()assess:
output=sess.run(convolution,feed_dict={X:dataset})
plt.imshow(output[0,:,:,1],cmap="gray")#plot1stimage's2ndfeaturemap
plt.show()
Mostofthiscodeisself-explanatory,butthetf.nn.conv2d()linedeservesabitofexplanation:
Xistheinputmini-batch(a4Dtensor,asexplainedearlier).
filtersisthesetoffilterstoapply(alsoa4Dtensor,asexplainedearlier).
stridesisafour-element1Darray,wherethetwocentralelementsaretheverticalandhorizontalstrides(shandsw).Thefirstandlastelementsmustcurrentlybeequalto1.Theymayonedaybeusedtospecifyabatchstride(toskipsomeinstances)andachannelstride(toskipsomeofthepreviouslayer’sfeaturemapsorchannels).
paddingmustbeeither"VALID"or"SAME":Ifsetto"VALID",theconvolutionallayerdoesnotusezeropadding,andmayignoresomerowsandcolumnsatthebottomandrightoftheinputimage,dependingonthestride,asshowninFigure13-7(forsimplicity,onlythehorizontaldimensionisshownhere,butofcoursethesamelogicappliestotheverticaldimension).
Ifsetto"SAME",theconvolutionallayeruseszeropaddingifnecessary.Inthiscase,thenumberofoutputneuronsisequaltothenumberofinputneuronsdividedbythestride,roundedup(inthisexample,ceil(13/5)=3).Thenzerosareaddedasevenlyaspossiblearoundtheinputs.
Figure13-7.Paddingoptions—inputwidth:13,filterwidth:6,stride:5
Inthissimpleexample,wemanuallycreatedthefilters,butinarealCNNyouwouldletthetrainingalgorithmdiscoverthebestfiltersautomatically.TensorFlowhasatf.layers.conv2d()functionwhichcreatesthefiltersvariableforyou(calledkernel),andinitializesitrandomly.Forexample,thefollowingcodecreatesaninputplaceholderfollowedbyaconvolutionallayerwithtwo7×7featuremaps,using2×2strides(notethatthisfunctiononlyexpectstheverticalandhorizontalstrides),and"SAME"padding:
X=tf.placeholder(shape=(None,height,width,channels),dtype=tf.float32)
conv=tf.layers.conv2d(X,filters=2,kernel_size=7,strides=[2,2],
padding="SAME")
Unfortunately,convolutionallayershavequiteafewhyperparameters:youmustchoosethenumberoffilters,theirheightandwidth,thestrides,andthepaddingtype.Asalways,youcanusecross-validationtofindtherighthyperparametervalues,butthisisverytime-consuming.WewilldiscusscommonCNNarchitectureslater,togiveyousomeideaofwhathyperparametervaluesworkbestinpractice.
MemoryRequirementsAnotherproblemwithCNNsisthattheconvolutionallayersrequireahugeamountofRAM,especiallyduringtraining,becausethereversepassofbackpropagationrequiresalltheintermediatevaluescomputedduringtheforwardpass.
Forexample,consideraconvolutionallayerwith5×5filters,outputting200featuremapsofsize150×100,withstride1andSAMEpadding.Iftheinputisa150×100RGBimage(threechannels),thenthenumberofparametersis(5×5×3+1)×200=15,200(the+1correspondstothebiasterms),whichisfairlysmallcomparedtoafullyconnectedlayer.7However,eachofthe200featuremapscontains150×100neurons,andeachoftheseneuronsneedstocomputeaweightedsumofits5×5×3=75inputs:that’satotalof225millionfloatmultiplications.Notasbadasafullyconnectedlayer,butstillquitecomputationallyintensive.Moreover,ifthefeaturemapsarerepresentedusing32-bitfloats,thentheconvolutionallayer’soutputwilloccupy200×150×100×32=96millionbits(about11.4MB)ofRAM.8Andthat’sjustforoneinstance!Ifatrainingbatchcontains100instances,thenthislayerwilluseupover1GBofRAM!
Duringinference(i.e.,whenmakingapredictionforanewinstance)theRAMoccupiedbyonelayercanbereleasedassoonasthenextlayerhasbeencomputed,soyouonlyneedasmuchRAMasrequiredbytwoconsecutivelayers.Butduringtrainingeverythingcomputedduringtheforwardpassneedstobepreservedforthereversepass,sotheamountofRAMneededis(atleast)thetotalamountofRAMrequiredbyalllayers.
TIPIftrainingcrashesbecauseofanout-of-memoryerror,youcantryreducingthemini-batchsize.Alternatively,youcantryreducingdimensionalityusingastride,orremovingafewlayers.Oryoucantryusing16-bitfloatsinsteadof32-bitfloats.OryoucoulddistributetheCNNacrossmultipledevices.
Nowlet’slookatthesecondcommonbuildingblockofCNNs:thepoolinglayer.
PoolingLayerOnceyouunderstandhowconvolutionallayerswork,thepoolinglayersarequiteeasytograsp.Theirgoalistosubsample(i.e.,shrink)theinputimageinordertoreducethecomputationalload,thememoryusage,andthenumberofparameters(therebylimitingtheriskofoverfitting).Reducingtheinputimagesizealsomakestheneuralnetworktoleratealittlebitofimageshift(locationinvariance).
Justlikeinconvolutionallayers,eachneuroninapoolinglayerisconnectedtotheoutputsofalimitednumberofneuronsinthepreviouslayer,locatedwithinasmallrectangularreceptivefield.Youmustdefineitssize,thestride,andthepaddingtype,justlikebefore.However,apoolingneuronhasnoweights;allitdoesisaggregatetheinputsusinganaggregationfunctionsuchasthemaxormean.Figure13-8showsamaxpoolinglayer,whichisthemostcommontypeofpoolinglayer.Inthisexample,weusea2×2poolingkernel,astrideof2,andnopadding.Notethatonlythemaxinputvalueineachkernelmakesittothenextlayer.Theotherinputsaredropped.
Figure13-8.Maxpoolinglayer(2×2poolingkernel,stride2,nopadding)
Thisisobviouslyaverydestructivekindoflayer:evenwithatiny2×2kernelandastrideof2,theoutputwillbetwotimessmallerinbothdirections(soitsareawillbefourtimessmaller),simplydropping75%oftheinputvalues.
Apoolinglayertypicallyworksoneveryinputchannelindependently,sotheoutputdepthisthesameastheinputdepth.Youmayalternativelypooloverthedepthdimension,aswewillseenext,inwhichcasetheimage’sspatialdimensions(heightandwidth)remainunchanged,butthenumberofchannelsisreduced.
ImplementingamaxpoolinglayerinTensorFlowisquiteeasy.Thefollowingcodecreatesamaxpoolinglayerusinga2×2kernel,stride2,andnopadding,thenappliesittoalltheimagesinthedataset:
[...]#loadtheimagedataset,justlikeabove
#CreateagraphwithinputXplusamaxpoolinglayer
X=tf.placeholder(tf.float32,shape=(None,height,width,channels))
max_pool=tf.nn.max_pool(X,ksize=[1,2,2,1],strides=[1,2,2,1],padding="VALID")
withtf.Session()assess:
output=sess.run(max_pool,feed_dict={X:dataset})
plt.imshow(output[0].astype(np.uint8))#plottheoutputforthe1stimage
plt.show()
Theksizeargumentcontainsthekernelshapealongallfourdimensionsoftheinputtensor:[batchsize,height,width,channels].TensorFlowcurrentlydoesnotsupportpoolingovermultipleinstances,sothefirstelementofksizemustbeequalto1.Moreover,itdoesnotsupportpoolingoverboththespatialdimensions(heightandwidth)andthedepthdimension,soeitherksize[1]andksize[2]mustbothbeequalto1,orksize[3]mustbeequalto1.
Tocreateanaveragepoolinglayer,justusetheavg_pool()functioninsteadofmax_pool().
Nowyouknowallthebuildingblockstocreateaconvolutionalneuralnetwork.Let’sseehowtoassemblethem.
CNNArchitecturesTypicalCNNarchitecturesstackafewconvolutionallayers(eachonegenerallyfollowedbyaReLUlayer),thenapoolinglayer,thenanotherfewconvolutionallayers(+ReLU),thenanotherpoolinglayer,andsoon.Theimagegetssmallerandsmallerasitprogressesthroughthenetwork,butitalsotypicallygetsdeeperanddeeper(i.e.,withmorefeaturemaps)thankstotheconvolutionallayers(seeFigure13-9).Atthetopofthestack,aregularfeedforwardneuralnetworkisadded,composedofafewfullyconnectedlayers(+ReLUs),andthefinallayeroutputstheprediction(e.g.,asoftmaxlayerthatoutputsestimatedclassprobabilities).
Figure13-9.TypicalCNNarchitecture
TIPAcommonmistakeistouseconvolutionkernelsthataretoolarge.Youcanoftengetthesameeffectasa9×9kernelbystackingtwo3×3kernelsontopofeachother,foralotlesscompute.
Overtheyears,variantsofthisfundamentalarchitecturehavebeendeveloped,leadingtoamazingadvancesinthefield.AgoodmeasureofthisprogressistheerrorrateincompetitionssuchastheILSVRCImageNetchallenge.Inthiscompetitionthetop-5errorrateforimageclassificationfellfromover26%tobarelyover3%injustfiveyears.Thetop-fiveerrorrateisthenumberoftestimagesforwhichthesystem’stop5predictionsdidnotincludethecorrectanswer.Theimagesarelarge(256pixelshigh)andthereare1,000classes,someofwhicharereallysubtle(trydistinguishing120dogbreeds).LookingattheevolutionofthewinningentriesisagoodwaytounderstandhowCNNswork.
WewillfirstlookattheclassicalLeNet-5architecture(1998),thenthreeofthewinnersoftheILSVRCchallenge:AlexNet(2012),GoogLeNet(2014),andResNet(2015).
OTHERVISUALTASKS
Therewasstunningprogressaswellinothervisualtaskssuchasobjectdetectionandlocalization,andimagesegmentation.Inobjectdetectionandlocalization,theneuralnetworktypicallyoutputsasequenceofboundingboxesaroundvariousobjectsintheimage.Forexample,seeMaxineOquabetal.’s2015paperthatoutputsaheatmapforeachobjectclass,orRussellStewartetal.’s2015paperthatusesacombinationofaCNNtodetectfacesandarecurrentneuralnetworktooutputasequenceofboundingboxesaroundthem.Inimagesegmentation,thenetoutputsanimage(usuallyofthesamesizeastheinput)whereeachpixelindicatestheclassoftheobjecttowhichthecorrespondinginputpixelbelongs.Forexample,checkoutEvanShelhameretal.’s2016paper.
LeNet-5TheLeNet-5architectureisperhapsthemostwidelyknownCNNarchitecture.Asmentionedearlier,itwascreatedbyYannLeCunin1998andwidelyusedforhandwrittendigitrecognition(MNIST).ItiscomposedofthelayersshowninTable13-1.
Table13-1.LeNet-5architecture
Layer Type Maps Size Kernelsize Stride Activation
Out FullyConnected – 10 – – RBF
F6 FullyConnected – 84 – – tanh
C5 Convolution 120 1×1 5×5 1 tanh
S4 AvgPooling 16 5×5 2×2 2 tanh
C3 Convolution 16 10×10 5×5 1 tanh
S2 AvgPooling 6 14×14 2×2 2 tanh
C1 Convolution 6 28×28 5×5 1 tanh
In Input 1 32×32 – – –
Thereareafewextradetailstobenoted:MNISTimagesare28×28pixels,buttheyarezero-paddedto32×32pixelsandnormalizedbeforebeingfedtothenetwork.Therestofthenetworkdoesnotuseanypadding,whichiswhythesizekeepsshrinkingastheimageprogressesthroughthenetwork.
Theaveragepoolinglayersareslightlymorecomplexthanusual:eachneuroncomputesthemeanofitsinputs,thenmultipliestheresultbyalearnablecoefficient(onepermap)andaddsalearnablebiasterm(again,onepermap),thenfinallyappliestheactivationfunction.
MostneuronsinC3mapsareconnectedtoneuronsinonlythreeorfourS2maps(insteadofallsixS2maps).Seetable1intheoriginalpaperfordetails.
Theoutputlayerisabitspecial:insteadofcomputingthedotproductoftheinputsandtheweightvector,eachneuronoutputsthesquareoftheEuclidiandistancebetweenitsinputvectoranditsweightvector.Eachoutputmeasureshowmuchtheimagebelongstoaparticulardigitclass.Thecrossentropycostfunctionisnowpreferred,asitpenalizesbadpredictionsmuchmore,producinglargergradientsandthusconvergingfaster.
YannLeCun’swebsite(“LENET”section)featuresgreatdemosofLeNet-5classifyingdigits.
AlexNetTheAlexNetCNNarchitecture9wonthe2012ImageNetILSVRCchallengebyalargemargin:itachieved17%top-5errorratewhilethesecondbestachievedonly26%!ItwasdevelopedbyAlexKrizhevsky(hencethename),IlyaSutskever,andGeoffreyHinton.ItisquitesimilartoLeNet-5,onlymuchlargeranddeeper,anditwasthefirsttostackconvolutionallayersdirectlyontopofeachother,insteadofstackingapoolinglayerontopofeachconvolutionallayer.Table13-2presentsthisarchitecture.
Table13-2.AlexNetarchitecture
Layer Type Maps Size Kernelsize Stride Padding Activation
Out FullyConnected – 1,000 – – – Softmax
F9 FullyConnected – 4,096 – – – ReLU
F8 FullyConnected – 4,096 – – – ReLU
C7 Convolution 256 13×13 3×3 1 SAME ReLU
C6 Convolution 384 13×13 3×3 1 SAME ReLU
C5 Convolution 384 13×13 3×3 1 SAME ReLU
S4 MaxPooling 256 13×13 3×3 2 VALID –
C3 Convolution 256 27×27 5×5 1 SAME ReLU
S2 MaxPooling 96 27×27 3×3 2 VALID –
C1 Convolution 96 55×55 11×11 4 SAME ReLU
In Input 3(RGB) 224×224 – – – –
Toreduceoverfitting,theauthorsusedtworegularizationtechniqueswediscussedinpreviouschapters:firsttheyapplieddropout(witha50%dropoutrate)duringtrainingtotheoutputsoflayersF8andF9.Second,theyperformeddataaugmentationbyrandomlyshiftingthetrainingimagesbyvariousoffsets,flippingthemhorizontally,andchangingthelightingconditions.
AlexNetalsousesacompetitivenormalizationstepimmediatelyaftertheReLUstepoflayersC1andC3,calledlocalresponsenormalization.Thisformofnormalizationmakestheneuronsthatmoststronglyactivateinhibitneuronsatthesamelocationbutinneighboringfeaturemaps(suchcompetitiveactivationhasbeenobservedinbiologicalneurons).Thisencouragesdifferentfeaturemapstospecialize,pushingthemapartandforcingthemtoexploreawiderrangeoffeatures,ultimatelyimprovinggeneralization.Equation13-2showshowtoapplyLRN.
Equation13-2.Localresponsenormalization
biisthenormalizedoutputoftheneuronlocatedinfeaturemapi,atsomerowuandcolumnv(notethatinthisequationweconsideronlyneuronslocatedatthisrowandcolumn,souandvarenotshown).
aiistheactivationofthatneuronaftertheReLUstep,butbeforenormalization.
k,α,β,andrarehyperparameters.kiscalledthebias,andriscalledthedepthradius.
fnisthenumberoffeaturemaps.
Forexample,ifr=2andaneuronhasastrongactivation,itwillinhibittheactivationoftheneuronslocatedinthefeaturemapsimmediatelyaboveandbelowitsown.
InAlexNet,thehyperparametersaresetasfollows:r=2,α=0.00002,β=0.75,andk=1.ThisstepcanbeimplementedusingTensorFlow’stf.nn.local_response_normalization()operation.
AvariantofAlexNetcalledZFNetwasdevelopedbyMatthewZeilerandRobFergusandwonthe2013ILSVRCchallenge.ItisessentiallyAlexNetwithafewtweakedhyperparameters(numberoffeaturemaps,kernelsize,stride,etc.).
GoogLeNetTheGoogLeNetarchitecturewasdevelopedbyChristianSzegedyetal.fromGoogleResearch,10anditwontheILSVRC2014challengebypushingthetop-5errorratebelow7%.ThisgreatperformancecameinlargepartfromthefactthatthenetworkwasmuchdeeperthanpreviousCNNs(seeFigure13-11).Thiswasmadepossiblebysub-networkscalledinceptionmodules,11whichallowGoogLeNettouseparametersmuchmoreefficientlythanpreviousarchitectures:GoogLeNetactuallyhas10timesfewerparametersthanAlexNet(roughly6millioninsteadof60million).
Figure13-10showsthearchitectureofaninceptionmodule.Thenotation“3×3+2(S)”meansthatthelayerusesa3×3kernel,stride2,andSAMEpadding.Theinputsignalisfirstcopiedandfedtofourdifferentlayers.AllconvolutionallayersusetheReLUactivationfunction.Notethatthesecondsetofconvolutionallayersusesdifferentkernelsizes(1×1,3×3,and5×5),allowingthemtocapturepatternsatdifferentscales.Alsonotethateverysinglelayerusesastrideof1andSAMEpadding(eventhemaxpoolinglayer),sotheiroutputsallhavethesameheightandwidthastheirinputs.Thismakesitpossibletoconcatenatealltheoutputsalongthedepthdimensioninthefinaldepthconcatlayer(i.e.,stackthefeaturemapsfromallfourtopconvolutionallayers).ThisconcatenationlayercanbeimplementedinTensorFlowusingthetf.concat()operation,withaxis=3(axis3isthedepth).
Figure13-10.Inceptionmodule
Youmaywonderwhyinceptionmoduleshaveconvolutionallayerswith1×1kernels.Surelytheselayerscannotcaptureanyfeaturessincetheylookatonlyonepixelatatime?Infact,theselayersservetwopurposes:
First,theyareconfiguredtooutputmanyfewerfeaturemapsthantheirinputs,sotheyserveasbottlenecklayers,meaningtheyreducedimensionality.Thisisparticularlyusefulbeforethe3×3and5×5convolutions,sincetheseareverycomputationallyexpensivelayers.
Second,eachpairofconvolutionallayers([1×1,3×3]and[1×1,5×5])actslikeasingle,
powerfulconvolutionallayer,capableofcapturingmorecomplexpatterns.Indeed,insteadofsweepingasimplelinearclassifieracrosstheimage(asasingleconvolutionallayerdoes),thispairofconvolutionallayerssweepsatwo-layerneuralnetworkacrosstheimage.
Inshort,youcanthinkofthewholeinceptionmoduleasaconvolutionallayeronsteroids,abletooutputfeaturemapsthatcapturecomplexpatternsatvariousscales.
WARNINGThenumberofconvolutionalkernelsforeachconvolutionallayerisahyperparameter.Unfortunately,thismeansthatyouhavesixmorehyperparameterstotweakforeveryinceptionlayeryouadd.
Nowlet’slookatthearchitectureoftheGoogLeNetCNN(seeFigure13-11).Itissodeepthatwehadtorepresentitinthreecolumns,butGoogLeNetisactuallyonetallstack,includingnineinceptionmodules(theboxeswiththespinningtops)thatactuallycontainthreelayerseach.Thenumberoffeaturemapsoutputbyeachconvolutionallayerandeachpoolinglayerisshownbeforethekernelsize.Thesixnumbersintheinceptionmodulesrepresentthenumberoffeaturemapsoutputbyeachconvolutionallayerinthemodule(inthesameorderasinFigure13-10).NotethatalltheconvolutionallayersusetheReLUactivationfunction.
Figure13-11.GoogLeNetarchitecture
Let’sgothroughthisnetwork:Thefirsttwolayersdividetheimage’sheightandwidthby4(soitsareaisdividedby16),toreducethecomputationalload.
Thenthelocalresponsenormalizationlayerensuresthatthepreviouslayerslearnawidevarietyoffeatures(asdiscussedearlier).
Twoconvolutionallayersfollow,wherethefirstactslikeabottlenecklayer.Asexplainedearlier,youcanthinkofthispairasasinglesmarterconvolutionallayer.
Again,alocalresponsenormalizationlayerensuresthatthepreviouslayerscaptureawidevarietyofpatterns.
Nextamaxpoolinglayerreducestheimageheightandwidthby2,againtospeedupcomputations.
Thencomesthetallstackofnineinceptionmodules,interleavedwithacouplemaxpoolinglayerstoreducedimensionalityandspeedupthenet.
Next,theaveragepoolinglayerusesakernelthesizeofthefeaturemapswithVALIDpadding,outputting1×1featuremaps:thissurprisingstrategyiscalledglobalaveragepooling.Iteffectivelyforcesthepreviouslayerstoproducefeaturemapsthatareactuallyconfidencemapsforeachtargetclass(sinceotherkindsoffeatureswouldbedestroyedbytheaveragingstep).ThismakesitunnecessarytohaveseveralfullyconnectedlayersatthetopoftheCNN(likeinAlexNet),considerablyreducingthenumberofparametersinthenetworkandlimitingtheriskofoverfitting.
Thelastlayersareself-explanatory:dropoutforregularization,thenafullyconnectedlayerwithasoftmaxactivationfunctiontooutputestimatedclassprobabilities.
Thisdiagramisslightlysimplified:theoriginalGoogLeNetarchitecturealsoincludedtwoauxiliaryclassifierspluggedontopofthethirdandsixthinceptionmodules.Theywerebothcomposedofoneaveragepoolinglayer,oneconvolutionallayer,twofullyconnectedlayers,andasoftmaxactivationlayer.Duringtraining,theirloss(scaleddownby70%)wasaddedtotheoverallloss.Thegoalwastofightthevanishinggradientsproblemandregularizethenetwork.However,itwasshownthattheireffectwasrelativelyminor.
ResNetLastbutnotleast,thewinneroftheILSVRC2015challengewastheResidualNetwork(orResNet),developedbyKaimingHeetal.,12whichdeliveredanastoundingtop-5errorrateunder3.6%,usinganextremelydeepCNNcomposedof152layers.Thekeytobeingabletotrainsuchadeepnetworkistouseskipconnections(alsocalledshortcutconnections):thesignalfeedingintoalayerisalsoaddedtotheoutputofalayerlocatedabithigherupthestack.Let’sseewhythisisuseful.
Whentraininganeuralnetwork,thegoalistomakeitmodelatargetfunctionh(x).Ifyouaddtheinputxtotheoutputofthenetwork(i.e.,youaddaskipconnection),thenthenetworkwillbeforcedtomodelf(x)=h(x)–xratherthanh(x).Thisiscalledresiduallearning(seeFigure13-12).
Figure13-12.Residuallearning
Whenyouinitializearegularneuralnetwork,itsweightsareclosetozero,sothenetworkjustoutputsvaluesclosetozero.Ifyouaddaskipconnection,theresultingnetworkjustoutputsacopyofitsinputs;inotherwords,itinitiallymodelstheidentityfunction.Ifthetargetfunctionisfairlyclosetotheidentityfunction(whichisoftenthecase),thiswillspeeduptrainingconsiderably.
Moreover,ifyouaddmanyskipconnections,thenetworkcanstartmakingprogressevenifseverallayershavenotstartedlearningyet(seeFigure13-13).Thankstoskipconnections,thesignalcaneasilymakeitswayacrossthewholenetwork.Thedeepresidualnetworkcanbeseenasastackofresidualunits,whereeachresidualunitisasmallneuralnetworkwithaskipconnection.
Figure13-13.Regulardeepneuralnetwork(left)anddeepresidualnetwork(right)
Nowlet’slookatResNet’sarchitecture(seeFigure13-14).Itisactuallysurprisinglysimple.ItstartsandendsexactlylikeGoogLeNet(exceptwithoutadropoutlayer),andinbetweenisjustaverydeepstackofsimpleresidualunits.Eachresidualunitiscomposedoftwoconvolutionallayers,withBatchNormalization(BN)andReLUactivation,using3×3kernelsandpreservingspatialdimensions(stride1,SAMEpadding).
Figure13-14.ResNetarchitecture
Notethatthenumberoffeaturemapsisdoubledeveryfewresidualunits,atthesametimeastheirheightandwidtharehalved(usingaconvolutionallayerwithstride2).Whenthishappenstheinputscannotbeaddeddirectlytotheoutputsoftheresidualunitsincetheydon’thavethesameshape(forexample,this
problemaffectstheskipconnectionrepresentedbythedashedarrowinFigure13-14).Tosolvethisproblem,theinputsarepassedthrougha1×1convolutionallayerwithstride2andtherightnumberofoutputfeaturemaps(seeFigure13-15).
Figure13-15.Skipconnectionwhenchangingfeaturemapsizeanddepth
ResNet-34istheResNetwith34layers(onlycountingtheconvolutionallayersandthefullyconnectedlayer)containingthreeresidualunitsthatoutput64featuremaps,4RUswith128maps,6RUswith256maps,and3RUswith512maps.
ResNetsdeeperthanthat,suchasResNet-152,useslightlydifferentresidualunits.Insteadoftwo3×3convolutionallayerswith(say)256featuremaps,theyusethreeconvolutionallayers:firsta1×1convolutionallayerwithjust64featuremaps(4timesless),whichactsaabottlenecklayer(asdiscussedalready),thena3×3layerwith64featuremaps,andfinallyanother1×1convolutionallayerwith256featuremaps(4times64)thatrestorestheoriginaldepth.ResNet-152containsthreesuchRUsthatoutput256maps,then8RUswith512maps,awhopping36RUswith1,024maps,andfinally3RUswith2,048maps.
Asyoucansee,thefieldismovingrapidly,withallsortsofarchitecturespoppingouteveryyear.OnecleartrendisthatCNNskeepgettingdeeperanddeeper.Theyarealsogettinglighter,requiringfewerandfewerparameters.Atpresent,theResNetarchitectureisboththemostpowerfulandarguablythesimplest,soitisreallytheoneyoushouldprobablyusefornow,butkeeplookingattheILSVRCchallengeeveryyear.The2016winnersweretheTrimps-SoushenteamfromChinawithanastounding2.99%errorrate.Toachievethistheytrainedcombinationsofthepreviousmodelsandjoinedthemintoanensemble.Dependingonthetask,thereducederrorratemayormaynotbeworththeextracomplexity.
Thereareafewotherarchitecturesthatyoumaywanttolookat,inparticularVGGNet13(runner-upoftheILSVRC2014challenge)andInception-v414(whichmergestheideasofGoogLeNetandResNetandachievescloseto3%top-5errorrateonImageNetclassification).
NOTEThereisreallynothingspecialaboutimplementingthevariousCNNarchitectureswejustdiscussed.Wesawearlierhowtobuildalltheindividualbuildingblocks,sonowallyouneedistoassemblethemtocreatethedesiredarchitecture.WewillbuildacompleteCNNintheupcomingexercisesandyouwillfindfullworkingcodeintheJupyternotebooks.
TENSORFLOWCONVOLUTIONOPERATIONS
TensorFlowalsooffersafewotherkindsofconvolutionallayers:
tf.layers.conv1d()createsaconvolutionallayerfor1Dinputs.Thisisuseful,forexample,innaturallanguageprocessing,whereasentencemayberepresentedasa1Darrayofwords,andthereceptivefieldcoversafewneighboringwords.
tf.layers.conv3d()createsaconvolutionallayerfor3Dinputs,suchas3DPETscan.
tf.nn.atrous_conv2d()createsanatrousconvolutionallayer(“àtrous”isFrenchfor“withholes”).Thisisequivalenttousingaregularconvolutionallayerwithafilterdilatedbyinsertingrowsandcolumnsofzeros(i.e.,holes).Forexample,a1×3filterequalto[[1,2,3]]maybedilatedwithadilationrateof4,resultinginadilatedfilter[[1,0,0,0,2,0,0,0,3]].Thisallowstheconvolutionallayertohavealargerreceptivefieldatnocomputationalpriceandusingnoextraparameters.
tf.layers.conv2d_transpose()createsatransposeconvolutionallayer,sometimescalledadeconvolutionallayer,15whichupsamplesanimage.Itdoessobyinsertingzerosbetweentheinputs,soyoucanthinkofthisasaregularconvolutionallayerusingafractionalstride.Upsamplingisuseful,forexample,inimagesegmentation:inatypicalCNN,featuremapsgetsmallerandsmallerasyouprogressthroughthenetwork,soifyouwanttooutputanimageofthesamesizeastheinput,youneedanupsamplinglayer.
tf.nn.depthwise_conv2d()createsadepthwiseconvolutionallayerthatapplieseveryfiltertoeveryindividualinputchannelindependently.Thus,iftherearefnfiltersandfn′inputchannels,thenthiswilloutputfn×fn′featuremaps.
tf.layers.separable_conv2d()createsaseparableconvolutionallayerthatfirstactslikeadepthwiseconvolutionallayer,thenappliesa1×1convolutionallayertotheresultingfeaturemaps.Thismakesitpossibletoapplyfilterstoarbitrarysetsofinputschannels.
Exercises1. WhataretheadvantagesofaCNNoverafullyconnectedDNNforimageclassification?
2. ConsideraCNNcomposedofthreeconvolutionallayers,eachwith3×3kernels,astrideof2,andSAMEpadding.Thelowestlayeroutputs100featuremaps,themiddleoneoutputs200,andthetoponeoutputs400.TheinputimagesareRGBimagesof200×300pixels.WhatisthetotalnumberofparametersintheCNN?Ifweareusing32-bitfloats,atleasthowmuchRAMwillthisnetworkrequirewhenmakingapredictionforasingleinstance?Whataboutwhentrainingonamini-batchof50images?
3. IfyourGPUrunsoutofmemorywhiletrainingaCNN,whatarefivethingsyoucouldtrytosolvetheproblem?
4. Whywouldyouwanttoaddamaxpoolinglayerratherthanaconvolutionallayerwiththesamestride?
5. Whenwouldyouwanttoaddalocalresponsenormalizationlayer?
6. CanyounamethemaininnovationsinAlexNet,comparedtoLeNet-5?WhataboutthemaininnovationsinGoogLeNetandResNet?
7. BuildyourownCNNandtrytoachievethehighestpossibleaccuracyonMNIST.
8. ClassifyinglargeimagesusingInceptionv3.a. Downloadsomeimagesofvariousanimals.LoadtheminPython,forexampleusingthe
matplotlib.image.mpimg.imread()functionorthescipy.misc.imread()function.Resizeand/orcropthemto299×299pixels,andensurethattheyhavejustthreechannels(RGB),withnotransparencychannel.
b. DownloadthelatestpretrainedInceptionv3model:thecheckpointisavailableathttps://goo.gl/nxSQvl.
c. CreatetheInceptionv3modelbycallingtheinception_v3()function,asshownbelow.Thismustbedonewithinanargumentscopecreatedbytheinception_v3_arg_scope()function.Also,youmustsetis_training=Falseandnum_classes=1001likeso:
fromtensorflow.contrib.slim.netsimportinception
importtensorflow.contrib.slimasslim
X=tf.placeholder(tf.float32,shape=[None,299,299,3],name="X")
withslim.arg_scope(inception.inception_v3_arg_scope()):
logits,end_points=inception.inception_v3(
X,num_classes=1001,is_training=False)
predictions=end_points["Predictions"]
saver=tf.train.Saver()
d. OpenasessionandusetheSavertorestorethepretrainedmodelcheckpointyoudownloadedearlier.
e. Runthemodeltoclassifytheimagesyouprepared.Displaythetopfivepredictionsforeachimage,alongwiththeestimatedprobability(thelistofclassnamesisavailableathttps://goo.gl/brXRtZ).Howaccurateisthemodel?
9. Transferlearningforlargeimageclassification.a. Createatrainingsetcontainingatleast100imagesperclass.Forexample,youcouldclassify
yourownpicturesbasedonthelocation(beach,mountain,city,etc.),oralternativelyyoucanjustuseanexistingdataset,suchastheflowersdatasetorMIT’splacesdataset(requiresregistration,anditishuge).
b. Writeapreprocessingstepthatwillresizeandcroptheimageto299×299,withsomerandomnessfordataaugmentation.
c. UsingthepretrainedInceptionv3modelfromthepreviousexercise,freezealllayersuptothebottlenecklayer(i.e.,thelastlayerbeforetheoutputlayer),andreplacetheoutputlayerwiththeappropriatenumberofoutputsforyournewclassificationtask(e.g.,theflowersdatasethasfivemutuallyexclusiveclassessotheoutputlayermusthavefiveneuronsandusethesoftmaxactivationfunction).
d. Splityourdatasetintoatrainingsetandatestset.Trainthemodelonthetrainingsetandevaluateitonthetestset.
10. GothroughTensorFlow’sDeepDreamtutorial.ItisafunwaytofamiliarizeyourselfwithvariouswaysofvisualizingthepatternslearnedbyaCNN,andtogenerateartusingDeepLearning.
SolutionstotheseexercisesareavailableinAppendixA.
“SingleUnitActivityinStriateCortexofUnrestrainedCats,”D.HubelandT.Wiesel(1958).
“ReceptiveFieldsofSingleNeuronesintheCat’sStriateCortex,”D.HubelandT.Wiesel(1959).
“ReceptiveFieldsandFunctionalArchitectureofMonkeyStriateCortex,”D.HubelandT.Wiesel(1968).
“Neocognitron:ASelf-organizingNeuralNetworkModelforaMechanismofPatternRecognitionUnaffectedbyShiftinPosition,”K.Fukushima(1980).
“Gradient-BasedLearningAppliedtoDocumentRecognition,”Y.LeCunetal.(1998).
Aconvolutionisamathematicaloperationthatslidesonefunctionoveranotherandmeasurestheintegraloftheirpointwisemultiplication.IthasdeepconnectionswiththeFouriertransformandtheLaplacetransform,andisheavilyusedinsignalprocessing.Convolutionallayersactuallyusecross-correlations,whichareverysimilartoconvolutions(seehttp://goo.gl/HAfxXdformoredetails).
Afullyconnectedlayerwith150×100neurons,eachconnectedtoall150×100×3inputs,wouldhave150
×100
×3=675millionparameters!
1MB=1,024kB=1,024×1,024bytes=1,024×1,024×8bits.
“ImageNetClassificationwithDeepConvolutionalNeuralNetworks,”A.Krizhevskyetal.(2012).
“GoingDeeperwithConvolutions,”C.Szegedyetal.(2015).
Inthe2010movieInception,thecharacterskeepgoingdeeperanddeeperintomultiplelayersofdreams,hencethenameofthesemodules.
1
2
3
4
5
6
7
2
2
8
9
10
11
12
“DeepResidualLearningforImageRecognition,”K.He(2015).
“VeryDeepConvolutionalNetworksforLarge-ScaleImageRecognition,”K.SimonyanandA.Zisserman(2015).
“Inception-v4,Inception-ResNetandtheImpactofResidualConnectionsonLearning,”C.Szegedyetal.(2016).
Thisnameisquitemisleadingsincethislayerdoesnotperformadeconvolution,whichisawell-definedmathematicaloperation(theinverseofaconvolution).
12
13
14
15
Chapter14.RecurrentNeuralNetworks
Thebatterhitstheball.Youimmediatelystartrunning,anticipatingtheball’strajectory.Youtrackitandadaptyourmovements,andfinallycatchit(underathunderofapplause).Predictingthefutureiswhatyoudoallthetime,whetheryouarefinishingafriend’ssentenceoranticipatingthesmellofcoffeeatbreakfast.Inthischapter,wearegoingtodiscussrecurrentneuralnetworks(RNN),aclassofnetsthatcanpredictthefuture(well,uptoapoint,ofcourse).Theycananalyzetimeseriesdatasuchasstockprices,andtellyouwhentobuyorsell.Inautonomousdrivingsystems,theycananticipatecartrajectoriesandhelpavoidaccidents.Moregenerally,theycanworkonsequencesofarbitrarylengths,ratherthanonfixed-sizedinputslikeallthenetswehavediscussedsofar.Forexample,theycantakesentences,documents,oraudiosamplesasinput,makingthemextremelyusefulfornaturallanguageprocessing(NLP)systemssuchasautomatictranslation,speech-to-text,orsentimentanalysis(e.g.,readingmoviereviewsandextractingtherater’sfeelingaboutthemovie).
Moreover,RNNs’abilitytoanticipatealsomakesthemcapableofsurprisingcreativity.Youcanaskthemtopredictwhicharethemostlikelynextnotesinamelody,thenrandomlypickoneofthesenotesandplayit.Thenaskthenetforthenextmostlikelynotes,playit,andrepeattheprocessagainandagain.Beforeyouknowit,yournetwillcomposeamelodysuchastheoneproducedbyGoogle’sMagentaproject.Similarly,RNNscangeneratesentences,imagecaptions,andmuchmore.TheresultisnotexactlyShakespeareorMozartyet,butwhoknowswhattheywillproduceafewyearsfromnow?
Inthischapter,wewilllookatthefundamentalconceptsunderlyingRNNs,themainproblemtheyface(namely,vanishing/explodinggradients,discussedinChapter11),andthesolutionswidelyusedtofightit:LSTMandGRUcells.Alongtheway,asalways,wewillshowhowtoimplementRNNsusingTensorFlow.Finally,wewilltakealookatthearchitectureofamachinetranslationsystem.
RecurrentNeuronsUptonowwehavemostlylookedatfeedforwardneuralnetworks,wheretheactivationsflowonlyinonedirection,fromtheinputlayertotheoutputlayer(exceptforafewnetworksinAppendixE).Arecurrentneuralnetworklooksverymuchlikeafeedforwardneuralnetwork,exceptitalsohasconnectionspointingbackward.Let’slookatthesimplestpossibleRNN,composedofjustoneneuronreceivinginputs,producinganoutput,andsendingthatoutputbacktoitself,asshowninFigure14-1(left).Ateachtimestept(alsocalledaframe),thisrecurrentneuronreceivestheinputsx(t)aswellasitsownoutputfromtheprevioustimestep,y(t–1).Wecanrepresentthistinynetworkagainstthetimeaxis,asshowninFigure14-1(right).Thisiscalledunrollingthenetworkthroughtime.
Figure14-1.Arecurrentneuron(left),unrolledthroughtime(right)
Youcaneasilycreatealayerofrecurrentneurons.Ateachtimestept,everyneuronreceivesboththeinputvectorx(t)andtheoutputvectorfromtheprevioustimestepy(t–1),asshowninFigure14-2.Notethatboththeinputsandoutputsarevectorsnow(whentherewasjustasingleneuron,theoutputwasascalar).
Figure14-2.Alayerofrecurrentneurons(left),unrolledthroughtime(right)
Eachrecurrentneuronhastwosetsofweights:onefortheinputsx(t)andtheotherfortheoutputsoftheprevioustimestep,y(t–1).Let’scalltheseweightvectorswxandwy.Theoutputofarecurrentlayercanbe
computedprettymuchasyoumightexpect,asshowninEquation14-1(bisthebiastermandϕ(·)istheactivationfunction,e.g.,ReLU1).
Equation14-1.Outputofarecurrentlayerforasingleinstance
Justlikeforfeedforwardneuralnetworks,wecancomputearecurrentlayer’soutputinoneshotforawholemini-batchusingavectorizedformofthepreviousequation(seeEquation14-2).
Equation14-2.Outputsofalayerofrecurrentneuronsforallinstancesinamini-batch
Y(t)isanm×nneuronsmatrixcontainingthelayer’soutputsattimesteptforeachinstanceinthemini-batch(misthenumberofinstancesinthemini-batchandnneuronsisthenumberofneurons).
X(t)isanm×ninputsmatrixcontainingtheinputsforallinstances(ninputsisthenumberofinputfeatures).
Wxisanninputs×nneuronsmatrixcontainingtheconnectionweightsfortheinputsofthecurrenttimestep.
Wyisannneurons×nneuronsmatrixcontainingtheconnectionweightsfortheoutputsoftheprevioustimestep.
TheweightmatricesWxandWyareoftenconcatenatedintoasingleweightmatrixWofshape(ninputs+nneurons)×nneurons(seethesecondlineofEquation14-2).
bisavectorofsizenneuronscontainingeachneuron’sbiasterm.
NoticethatY(t)isafunctionofX(t)andY(t–1),whichisafunctionofX(t–1)andY(t–2),whichisafunctionofX(t–2)andY(t–3),andsoon.ThismakesY(t)afunctionofalltheinputssincetimet=0(thatis,X(0),X(1),…,X(t)).Atthefirsttimestep,t=0,therearenopreviousoutputs,sotheyaretypicallyassumedtobeallzeros.
MemoryCellsSincetheoutputofarecurrentneuronattimesteptisafunctionofalltheinputsfromprevioustimesteps,youcouldsayithasaformofmemory.Apartofaneuralnetworkthatpreservessomestateacrosstimestepsiscalledamemorycell(orsimplyacell).Asinglerecurrentneuron,oralayerofrecurrentneurons,isaverybasiccell,butlaterinthischapterwewilllookatsomemorecomplexandpowerfultypesofcells.
Ingeneralacell’sstateattimestept,denotedh(t)(the“h”standsfor“hidden”),isafunctionofsomeinputsatthattimestepanditsstateattheprevioustimestep:h(t)=f(h(t–1),x(t)).Itsoutputattimestept,denotedy(t),isalsoafunctionofthepreviousstateandthecurrentinputs.Inthecaseofthebasiccellswehavediscussedsofar,theoutputissimplyequaltothestate,butinmorecomplexcellsthisisnotalwaysthecase,asshowninFigure14-3.
Figure14-3.Acell’shiddenstateanditsoutputmaybedifferent
InputandOutputSequencesAnRNNcansimultaneouslytakeasequenceofinputsandproduceasequenceofoutputs(seeFigure14-4,top-leftnetwork).Forexample,thistypeofnetworkisusefulforpredictingtimeseriessuchasstockprices:youfeeditthepricesoverthelastNdays,anditmustoutputthepricesshiftedbyonedayintothefuture(i.e.,fromN–1daysagototomorrow).
Alternatively,youcouldfeedthenetworkasequenceofinputs,andignorealloutputsexceptforthelastone(seethetop-rightnetwork).Inotherwords,thisisasequence-to-vectornetwork.Forexample,youcouldfeedthenetworkasequenceofwordscorrespondingtoamoviereview,andthenetworkwouldoutputasentimentscore(e.g.,from–1[hate]to+1[love]).
Conversely,youcouldfeedthenetworkasingleinputatthefirsttimestep(andzerosforallothertimesteps),andletitoutputasequence(seethebottom-leftnetwork).Thisisavector-to-sequencenetwork.Forexample,theinputcouldbeanimage,andtheoutputcouldbeacaptionforthatimage.
Lastly,youcouldhaveasequence-to-vectornetwork,calledanencoder,followedbyavector-to-sequencenetwork,calledadecoder(seethebottom-rightnetwork).Forexample,thiscanbeusedfortranslatingasentencefromonelanguagetoanother.Youwouldfeedthenetworkasentenceinonelanguage,theencoderwouldconvertthissentenceintoasinglevectorrepresentation,andthenthedecoderwoulddecodethisvectorintoasentenceinanotherlanguage.Thistwo-stepmodel,calledanEncoder–Decoder,worksmuchbetterthantryingtotranslateontheflywithasinglesequence-to-sequenceRNN(liketheonerepresentedonthetopleft),sincethelastwordsofasentencecanaffectthefirstwordsofthetranslation,soyouneedtowaituntilyouhaveheardthewholesentencebeforetranslatingit.
Figure14-4.Seqtoseq(topleft),seqtovector(topright),vectortoseq(bottomleft),delayedseqtoseq(bottomright)
Soundspromising,solet’sstartcoding!
BasicRNNsinTensorFlowFirst,let’simplementaverysimpleRNNmodel,withoutusinganyofTensorFlow’sRNNoperations,tobetterunderstandwhatgoesonunderthehood.WewillcreateanRNNcomposedofalayeroffiverecurrentneurons(liketheRNNrepresentedinFigure14-2),usingthetanhactivationfunction.WewillassumethattheRNNrunsoveronlytwotimesteps,takinginputvectorsofsize3ateachtimestep.ThefollowingcodebuildsthisRNN,unrolledthroughtwotimesteps:
n_inputs=3
n_neurons=5
X0=tf.placeholder(tf.float32,[None,n_inputs])
X1=tf.placeholder(tf.float32,[None,n_inputs])
Wx=tf.Variable(tf.random_normal(shape=[n_inputs,n_neurons],dtype=tf.float32))
Wy=tf.Variable(tf.random_normal(shape=[n_neurons,n_neurons],dtype=tf.float32))
b=tf.Variable(tf.zeros([1,n_neurons],dtype=tf.float32))
Y0=tf.tanh(tf.matmul(X0,Wx)+b)
Y1=tf.tanh(tf.matmul(Y0,Wy)+tf.matmul(X1,Wx)+b)
init=tf.global_variables_initializer()
Thisnetworklooksmuchlikeatwo-layerfeedforwardneuralnetwork,withafewtwists:first,thesameweightsandbiastermsaresharedbybothlayers,andsecond,wefeedinputsateachlayer,andwegetoutputsfromeachlayer.Torunthemodel,weneedtofeedittheinputsatbothtimesteps,likeso:
importnumpyasnp
#Mini-batch:instance0,instance1,instance2,instance3
X0_batch=np.array([[0,1,2],[3,4,5],[6,7,8],[9,0,1]])#t=0
X1_batch=np.array([[9,8,7],[0,0,0],[6,5,4],[3,2,1]])#t=1
withtf.Session()assess:
init.run()
Y0_val,Y1_val=sess.run([Y0,Y1],feed_dict={X0:X0_batch,X1:X1_batch})
Thismini-batchcontainsfourinstances,eachwithaninputsequencecomposedofexactlytwoinputs.Attheend,Y0_valandY1_valcontaintheoutputsofthenetworkatbothtimestepsforallneuronsandallinstancesinthemini-batch:
>>>print(Y0_val)#outputatt=0
[[-0.06640060.962576690.681057870.70918542-0.89821595]#instance0
[0.9977755-0.71978885-0.996576250.9673925-0.99989718]#instance1
[0.99999774-0.99898815-0.999998930.99677622-0.99999988]#instance2
[1.-1.-1.-0.998189150.99950868]]#instance3
>>>print(Y1_val)#outputatt=1
[[1.-1.-1.0.40200216-1.]#instance0
[-0.122104330.628053190.96718419-0.99371207-0.25839335]#instance1
[0.99999827-0.9999994-0.9999975-0.85943311-0.9999879]#instance2
[0.99928284-0.99999815-0.999905820.98579615-0.92205751]]#instance3
Thatwasn’ttoohard,butofcourseifyouwanttobeabletorunanRNNover100timesteps,thegraphisgoingtobeprettybig.Nowlet’slookathowtocreatethesamemodelusingTensorFlow’sRNNoperations.
StaticUnrollingThroughTimeThestatic_rnn()functioncreatesanunrolledRNNnetworkbychainingcells.Thefollowingcodecreatestheexactsamemodelasthepreviousone:
X0=tf.placeholder(tf.float32,[None,n_inputs])
X1=tf.placeholder(tf.float32,[None,n_inputs])
basic_cell=tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
output_seqs,states=tf.contrib.rnn.static_rnn(basic_cell,[X0,X1],
dtype=tf.float32)
Y0,Y1=output_seqs
Firstwecreatetheinputplaceholders,asbefore.ThenwecreateaBasicRNNCell,whichyoucanthinkofasafactorythatcreatescopiesofthecelltobuildtheunrolledRNN(oneforeachtimestep).Thenwecallstatic_rnn(),givingitthecellfactoryandtheinputtensors,andtellingitthedatatypeoftheinputs(thisisusedtocreatetheinitialstatematrix,whichbydefaultisfullofzeros).Thestatic_rnn()functioncallsthecellfactory’s__call__()functiononceperinput,creatingtwocopiesofthecell(eachcontainingalayeroffiverecurrentneurons),withsharedweightsandbiasterms,anditchainsthemjustlikewedidearlier.Thestatic_rnn()functionreturnstwoobjects.ThefirstisaPythonlistcontainingtheoutputtensorsforeachtimestep.Thesecondisatensorcontainingthefinalstatesofthenetwork.Whenyouareusingbasiccells,thefinalstateissimplyequaltothelastoutput.
Iftherewere50timesteps,itwouldnotbeveryconvenienttohavetodefine50inputplaceholdersand50outputtensors.Moreover,atexecutiontimeyouwouldhavetofeedeachofthe50placeholdersandmanipulatethe50outputs.Let’ssimplifythis.ThefollowingcodebuildsthesameRNNagain,butthistimeittakesasingleinputplaceholderofshape[None,n_steps,n_inputs]wherethefirstdimensionisthemini-batchsize.Thenitextractsthelistofinputsequencesforeachtimestep.X_seqsisaPythonlistofn_stepstensorsofshape[None,n_inputs],whereonceagainthefirstdimensionisthemini-batchsize.Todothis,wefirstswapthefirsttwodimensionsusingthetranspose()function,sothatthetimestepsarenowthefirstdimension.ThenweextractaPythonlistoftensorsalongthefirstdimension(i.e.,onetensorpertimestep)usingtheunstack()function.Thenexttwolinesarethesameasbefore.Finally,wemergealltheoutputtensorsintoasingletensorusingthestack()function,andweswapthefirsttwodimensionstogetafinaloutputstensorofshape[None,n_steps,n_neurons](againthefirstdimensionisthemini-batchsize).
X=tf.placeholder(tf.float32,[None,n_steps,n_inputs])
X_seqs=tf.unstack(tf.transpose(X,perm=[1,0,2]))
basic_cell=tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
output_seqs,states=tf.contrib.rnn.static_rnn(basic_cell,X_seqs,
dtype=tf.float32)
outputs=tf.transpose(tf.stack(output_seqs),perm=[1,0,2])
Nowwecanrunthenetworkbyfeedingitasingletensorthatcontainsallthemini-batchsequences:
X_batch=np.array([
#t=0t=1
[[0,1,2],[9,8,7]],#instance0
[[3,4,5],[0,0,0]],#instance1
[[6,7,8],[6,5,4]],#instance2
[[9,0,1],[3,2,1]],#instance3
])
withtf.Session()assess:
init.run()
outputs_val=outputs.eval(feed_dict={X:X_batch})
Andwegetasingleoutputs_valtensorforallinstances,alltimesteps,andallneurons:
>>>print(outputs_val)
[[[-0.912797270.83698678-0.892779410.80308062-0.5283336]
[-1.1.-0.997948290.99985468-0.99273592]]
[[-0.999943910.99951613-0.99469250.99030769-0.94413054]
[0.487333090.93389565-0.313620720.885736110.2424476]]
[[-1.0.99999875-0.999750140.99956584-0.99466234]
[-0.999948560.99999434-0.960581720.99784708-0.9099462]]
[[-0.959724250.999514820.96938795-0.969908-0.67668229]
[-0.845960140.962882280.96856463-0.14777924-0.9119423]]]
However,thisapproachstillbuildsagraphcontainingonecellpertimestep.Iftherewere50timesteps,thegraphwouldlookprettyugly.Itisabitlikewritingaprogramwithouteverusingloops(e.g.,Y0=f(0,X0);Y1=f(Y0,X1);Y2=f(Y1,X2);...;Y50=f(Y49,X50)).Withsuchaslargegraph,youmayevengetout-of-memory(OOM)errorsduringbackpropagation(especiallywiththelimitedmemoryofGPUcards),sinceitmuststorealltensorvaluesduringtheforwardpasssoitcanusethemtocomputegradientsduringthereversepass.
Fortunately,thereisabettersolution:thedynamic_rnn()function.
DynamicUnrollingThroughTimeThedynamic_rnn()functionusesawhile_loop()operationtorunoverthecelltheappropriatenumberoftimes,andyoucansetswap_memory=TrueifyouwantittoswaptheGPU’smemorytotheCPU’smemoryduringbackpropagationtoavoidOOMerrors.Conveniently,italsoacceptsasingletensorforallinputsateverytimestep(shape[None,n_steps,n_inputs])anditoutputsasingletensorforalloutputsateverytimestep(shape[None,n_steps,n_neurons]);thereisnoneedtostack,unstack,ortranspose.ThefollowingcodecreatesthesameRNNasearlierusingthedynamic_rnn()function.It’ssomuchnicer!
X=tf.placeholder(tf.float32,[None,n_steps,n_inputs])
basic_cell=tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
outputs,states=tf.nn.dynamic_rnn(basic_cell,X,dtype=tf.float32)
NOTEDuringbackpropagation,thewhile_loop()operationdoestheappropriatemagic:itstoresthetensorvaluesforeachiterationduringtheforwardpasssoitcanusethemtocomputegradientsduringthereversepass.
HandlingVariableLengthInputSequencesSofarwehaveusedonlyfixed-sizeinputsequences(allexactlytwostepslong).Whatiftheinputsequenceshavevariablelengths(e.g.,likesentences)?Inthiscaseyoushouldsetthesequence_lengthargumentwhencallingthedynamic_rnn()(orstatic_rnn())function;itmustbea1Dtensorindicatingthelengthoftheinputsequenceforeachinstance.Forexample:
seq_length=tf.placeholder(tf.int32,[None])
[...]
outputs,states=tf.nn.dynamic_rnn(basic_cell,X,dtype=tf.float32,
sequence_length=seq_length)
Forexample,supposethesecondinputsequencecontainsonlyoneinputinsteadoftwo.ItmustbepaddedwithazerovectorinordertofitintheinputtensorX(becausetheinputtensor’sseconddimensionisthesizeofthelongestsequence—i.e.,2).
X_batch=np.array([
#step0step1
[[0,1,2],[9,8,7]],#instance0
[[3,4,5],[0,0,0]],#instance1(paddedwithazerovector)
[[6,7,8],[6,5,4]],#instance2
[[9,0,1],[3,2,1]],#instance3
])
seq_length_batch=np.array([2,1,2,2])
Ofcourse,younowneedtofeedvaluesforbothplaceholdersXandseq_length:
withtf.Session()assess:
init.run()
outputs_val,states_val=sess.run(
[outputs,states],feed_dict={X:X_batch,seq_length:seq_length_batch})
NowtheRNNoutputszerovectorsforeverytimesteppasttheinputsequencelength(lookatthesecondinstance’soutputforthesecondtimestep):
>>>print(outputs_val)
[[[-0.68579948-0.25901747-0.80249101-0.18141513-0.37491536]
[-0.99996698-0.945011850.98072106-0.96897620.99966913]]#finalstate
[[-0.99099374-0.64768541-0.67801034-0.74154460.7719509]#finalstate
[0.0.0.0.0.]]#zerovector
[[-0.99978048-0.85583007-0.49696958-0.938385780.98505187]
[-0.99951065-0.891487960.94170523-0.384076570.97499216]]#finalstate
[[-0.02052618-0.945880470.999352040.372833310.9998163]
[-0.910523470.057694090.47446665-0.446110370.89394671]]]#finalstate
Moreover,thestatestensorcontainsthefinalstateofeachcell(excludingthezerovectors):
>>>print(states_val)
[[-0.99996698-0.945011850.98072106-0.96897620.99966913]#t=1
[-0.99099374-0.64768541-0.67801034-0.74154460.7719509]#t=0!!!
[-0.99951065-0.891487960.94170523-0.384076570.97499216]#t=1
[-0.910523470.057694090.47446665-0.446110370.89394671]]#t=1
HandlingVariable-LengthOutputSequencesWhatiftheoutputsequenceshavevariablelengthsaswell?Ifyouknowinadvancewhatlengtheachsequencewillhave(forexampleifyouknowthatitwillbethesamelengthastheinputsequence),thenyoucansetthesequence_lengthparameterasdescribedabove.Unfortunately,ingeneralthiswillnotbepossible:forexample,thelengthofatranslatedsentenceisgenerallydifferentfromthelengthoftheinputsentence.Inthiscase,themostcommonsolutionistodefineaspecialoutputcalledanend-of-sequencetoken(EOStoken).AnyoutputpasttheEOSshouldbeignored(wewilldiscussthislaterinthischapter).
Okay,nowyouknowhowtobuildanRNNnetwork(ormorepreciselyanRNNnetworkunrolledthroughtime).Buthowdoyoutrainit?
TrainingRNNsTotrainanRNN,thetrickistounrollitthroughtime(likewejustdid)andthensimplyuseregularbackpropagation(seeFigure14-5).Thisstrategyiscalledbackpropagationthroughtime(BPTT).
Figure14-5.Backpropagationthroughtime
Justlikeinregularbackpropagation,thereisafirstforwardpassthroughtheunrollednetwork(representedbythedashedarrows);thentheoutputsequenceisevaluatedusingacostfunction
(wheretminandtmaxarethefirstandlastoutputtimesteps,notcountingtheignoredoutputs),andthegradientsofthatcostfunctionarepropagatedbackwardthroughtheunrollednetwork(representedbythesolidarrows);andfinallythemodelparametersareupdatedusingthegradientscomputedduringBPTT.Notethatthegradientsflowbackwardthroughalltheoutputsusedbythecostfunction,notjustthroughthefinaloutput(forexample,inFigure14-5thecostfunctioniscomputedusingthelastthreeoutputsofthenetwork,Y(2),Y(3),andY(4),sogradientsflowthroughthesethreeoutputs,butnotthroughY(0)andY(1)).Moreover,sincethesameparametersWandbareusedateachtimestep,backpropagationwilldotherightthingandsumoveralltimesteps.
TrainingaSequenceClassifierLet’strainanRNNtoclassifyMNISTimages.Aconvolutionalneuralnetworkwouldbebettersuitedforimageclassification(seeChapter13),butthismakesforasimpleexamplethatyouarealreadyfamiliarwith.Wewilltreateachimageasasequenceof28rowsof28pixelseach(sinceeachMNISTimageis28×28pixels).Wewillusecellsof150recurrentneurons,plusafullyconnectedlayercontaining10neurons(oneperclass)connectedtotheoutputofthelasttimestep,followedbyasoftmaxlayer(seeFigure14-6).
Figure14-6.Sequenceclassifier
Theconstructionphaseisquitestraightforward;it’sprettymuchthesameastheMNISTclassifierwebuiltinChapter10exceptthatanunrolledRNNreplacesthehiddenlayers.Notethatthefullyconnectedlayerisconnectedtothestatestensor,whichcontainsonlythefinalstateoftheRNN(i.e.,the28th
output).Alsonotethatyisaplaceholderforthetargetclasses.
n_steps=28
n_inputs=28
n_neurons=150
n_outputs=10
learning_rate=0.001
X=tf.placeholder(tf.float32,[None,n_steps,n_inputs])
y=tf.placeholder(tf.int32,[None])
basic_cell=tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
outputs,states=tf.nn.dynamic_rnn(basic_cell,X,dtype=tf.float32)
logits=tf.layers.dense(states,n_outputs)
xentropy=tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
logits=logits)
loss=tf.reduce_mean(xentropy)
optimizer=tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op=optimizer.minimize(loss)
correct=tf.nn.in_top_k(logits,y,1)
accuracy=tf.reduce_mean(tf.cast(correct,tf.float32))
init=tf.global_variables_initializer()
Nowlet’sloadtheMNISTdataandreshapethetestdatato[batch_size,n_steps,n_inputs]asisexpectedbythenetwork.Wewilltakecareofreshapingthetrainingdatainamoment.
fromtensorflow.examples.tutorials.mnistimportinput_data
mnist=input_data.read_data_sets("/tmp/data/")
X_test=mnist.test.images.reshape((-1,n_steps,n_inputs))
y_test=mnist.test.labels
NowwearereadytotraintheRNN.TheexecutionphaseisexactlythesameasfortheMNISTclassifierinChapter10,exceptthatwereshapeeachtrainingbatchbeforefeedingittothenetwork.
n_epochs=100
batch_size=150
withtf.Session()assess:
init.run()
forepochinrange(n_epochs):
foriterationinrange(mnist.train.num_examples//batch_size):
X_batch,y_batch=mnist.train.next_batch(batch_size)
X_batch=X_batch.reshape((-1,n_steps,n_inputs))
sess.run(training_op,feed_dict={X:X_batch,y:y_batch})
acc_train=accuracy.eval(feed_dict={X:X_batch,y:y_batch})
acc_test=accuracy.eval(feed_dict={X:X_test,y:y_test})
print(epoch,"Trainaccuracy:",acc_train,"Testaccuracy:",acc_test)
Theoutputshouldlooklikethis:
0Trainaccuracy:0.94Testaccuracy:0.9308
1Trainaccuracy:0.933333Testaccuracy:0.9431
[...]
98Trainaccuracy:0.98Testaccuracy:0.9794
99Trainaccuracy:1.0Testaccuracy:0.9804
Wegetover98%accuracy—notbad!Plusyouwouldcertainlygetabetterresultbytuningthehyperparameters,initializingtheRNNweightsusingHeinitialization,traininglonger,oraddingabitofregularization(e.g.,dropout).
TIPYoucanspecifyaninitializerfortheRNNbywrappingitsconstructioncodeinavariablescope(e.g.,usevariable_scope("rnn",initializer=variance_scaling_initializer())touseHeinitialization).
TrainingtoPredictTimeSeriesNowlet’stakealookathowtohandletimeseries,suchasstockprices,airtemperature,brainwavepatterns,andsoon.InthissectionwewilltrainanRNNtopredictthenextvalueinageneratedtimeseries.Eachtraininginstanceisarandomlyselectedsequenceof20consecutivevaluesfromthetimeseries,andthetargetsequenceisthesameastheinputsequence,exceptitisshiftedbyonetimestepintothefuture(seeFigure14-7).
Figure14-7.Timeseries(left),andatraininginstancefromthatseries(right)
First,let’screatetheRNN.Itwillcontain100recurrentneuronsandwewillunrollitover20timestepssinceeachtraininginstancewillbe20inputslong.Eachinputwillcontainonlyonefeature(thevalueatthattime).Thetargetsarealsosequencesof20inputs,eachcontainingasinglevalue.Thecodeisalmostthesameasearlier:
n_steps=20
n_inputs=1
n_neurons=100
n_outputs=1
X=tf.placeholder(tf.float32,[None,n_steps,n_inputs])
y=tf.placeholder(tf.float32,[None,n_steps,n_outputs])
cell=tf.contrib.rnn.BasicRNNCell(num_units=n_neurons,activation=tf.nn.relu)
outputs,states=tf.nn.dynamic_rnn(cell,X,dtype=tf.float32)
NOTEIngeneralyouwouldhavemorethanjustoneinputfeature.Forexample,ifyouweretryingtopredictstockprices,youwouldlikelyhavemanyotherinputfeaturesateachtimestep,suchaspricesofcompetingstocks,ratingsfromanalysts,oranyotherfeaturethatmighthelpthesystemmakeitspredictions.
Ateachtimestepwenowhaveanoutputvectorofsize100.Butwhatweactuallywantisasingleoutputvalueateachtimestep.ThesimplestsolutionistowrapthecellinanOutputProjectionWrapper.Acellwrapperactslikeanormalcell,proxyingeverymethodcalltoanunderlyingcell,butitalsoaddssomefunctionality.TheOutputProjectionWrapperaddsafullyconnectedlayeroflinearneurons(i.e.,withoutanyactivationfunction)ontopofeachoutput(butitdoesnotaffectthecellstate).Allthesefullyconnectedlayerssharethesame(trainable)weightsandbiasterms.TheresultingRNNisrepresentedin
Figure14-8.
Figure14-8.RNNcellsusingoutputprojections
Wrappingacellisquiteeasy.Let’stweaktheprecedingcodebywrappingtheBasicRNNCellintoanOutputProjectionWrapper:
cell=tf.contrib.rnn.OutputProjectionWrapper(
tf.contrib.rnn.BasicRNNCell(num_units=n_neurons,activation=tf.nn.relu),
output_size=n_outputs)
Sofar,sogood.Nowweneedtodefinethecostfunction.WewillusetheMeanSquaredError(MSE),aswedidinpreviousregressiontasks.NextwewillcreateanAdamoptimizer,thetrainingop,andthevariableinitializationop,asusual:
learning_rate=0.001
loss=tf.reduce_mean(tf.square(outputs-y))
optimizer=tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op=optimizer.minimize(loss)
init=tf.global_variables_initializer()
Nowontotheexecutionphase:
n_iterations=1500
batch_size=50
withtf.Session()assess:
init.run()
foriterationinrange(n_iterations):
X_batch,y_batch=[...]#fetchthenexttrainingbatch
sess.run(training_op,feed_dict={X:X_batch,y:y_batch})
ifiteration%100==0:
mse=loss.eval(feed_dict={X:X_batch,y:y_batch})
print(iteration,"\tMSE:",mse)
Theprogram’soutputshouldlooklikethis:
0MSE:13.6543
100MSE:0.538476
200MSE:0.168532
300MSE:0.0879579
400MSE:0.0633425
[...]
Oncethemodelistrained,youcanmakepredictions:
X_new=[...]#Newsequences
y_pred=sess.run(outputs,feed_dict={X:X_new})
Figure14-9showsthepredictedsequencefortheinstancewelookedatearlier(inFigure14-7),afterjust1,000trainingiterations.
Figure14-9.Timeseriesprediction
AlthoughusinganOutputProjectionWrapperisthesimplestsolutiontoreducethedimensionalityoftheRNN’soutputsequencesdowntojustonevaluepertimestep(perinstance),itisnotthemostefficient.Thereisatrickierbutmoreefficientsolution:youcanreshapetheRNNoutputsfrom[batch_size,n_steps,n_neurons]to[batch_size*n_steps,n_neurons],thenapplyasinglefullyconnectedlayerwiththeappropriateoutputsize(inourcasejust1),whichwillresultinanoutputtensorofshape[batch_size*n_steps,n_outputs],andthenreshapethistensorto[batch_size,n_steps,n_outputs].TheseoperationsarerepresentedinFigure14-10.
Figure14-10.Stackalltheoutputs,applytheprojection,thenunstacktheresult
Toimplementthissolution,wefirstreverttoabasiccell,withouttheOutputProjectionWrapper:
cell=tf.contrib.rnn.BasicRNNCell(num_units=n_neurons,activation=tf.nn.relu)
rnn_outputs,states=tf.nn.dynamic_rnn(cell,X,dtype=tf.float32)
Thenwestackalltheoutputsusingthereshape()operation,applythefullyconnectedlinearlayer(withoutusinganyactivationfunction;thisisjustaprojection),andfinallyunstackalltheoutputs,againusingreshape():
stacked_rnn_outputs=tf.reshape(rnn_outputs,[-1,n_neurons])
stacked_outputs=tf.layers.dense(stacked_rnn_outputs,n_outputs)
outputs=tf.reshape(stacked_outputs,[-1,n_steps,n_outputs])
Therestofthecodeisthesameasearlier.Thiscanprovideasignificantspeedboostsincethereisjustonefullyconnectedlayerinsteadofonepertimestep.
CreativeRNNNowthatwehaveamodelthatcanpredictthefuture,wecanuseittogeneratesomecreativesequences,asexplainedatthebeginningofthechapter.Allweneedistoprovideitaseedsequencecontainingn_stepsvalues(e.g.,fullofzeros),usethemodeltopredictthenextvalue,appendthispredictedvaluetothesequence,feedthelastn_stepsvaluestothemodeltopredictthenextvalue,andsoon.Thisprocessgeneratesanewsequencethathassomeresemblancetotheoriginaltimeseries(seeFigure14-11).
sequence=[0.]*n_steps
foriterationinrange(300):
X_batch=np.array(sequence[-n_steps:]).reshape(1,n_steps,1)
y_pred=sess.run(outputs,feed_dict={X:X_batch})
sequence.append(y_pred[0,-1,0])
Figure14-11.Creativesequences,seededwithzeros(left)orwithaninstance(right)
NowyoucantrytofeedallyourJohnLennonalbumstoanRNNandseeifitcangeneratethenext“Imagine.”However,youwillprobablyneedamuchmorepowerfulRNN,withmoreneurons,andalsomuchdeeper.Let’slookatdeepRNNsnow.
DeepRNNsItisquitecommontostackmultiplelayersofcells,asshowninFigure14-12.ThisgivesyouadeepRNN.
ToimplementadeepRNNinTensorFlow,youcancreateseveralcellsandstackthemintoaMultiRNNCell.Inthefollowingcodewestackthreeidenticalcells(butyoucouldverywellusevariouskindsofcellswithadifferentnumberofneurons):
n_neurons=100
n_layers=3
layers=[tf.contrib.rnn.BasicRNNCell(num_units=n_neurons,
activation=tf.nn.relu)
forlayerinrange(n_layers)]
multi_layer_cell=tf.contrib.rnn.MultiRNNCell(layers)
outputs,states=tf.nn.dynamic_rnn(multi_layer_cell,X,dtype=tf.float32)
Figure14-12.DeepRNN(left),unrolledthroughtime(right)
That’sallthereistoit!Thestatesvariableisatuplecontainingonetensorperlayer,eachrepresentingthefinalstateofthatlayer’scell(withshape[batch_size,n_neurons]).Ifyousetstate_is_tuple=FalsewhencreatingtheMultiRNNCell,thenstatesbecomesasingletensorcontainingthestatesfromeverylayer,concatenatedalongthecolumnaxis(i.e.,itsshapeis[batch_size,n_layers*n_neurons]).NotethatbeforeTensorFlow0.11.0,thisbehaviorwasthedefault.
DistributingaDeepRNNAcrossMultipleGPUsChapter12pointedoutthatwecanefficientlydistributedeepRNNsacrossmultipleGPUsbypinningeachlayertoadifferentGPU(seeFigure12-16).However,ifyoutrytocreateeachcellinadifferentdevice()block,itwillnotwork:
withtf.device("/gpu:0"):#BAD!Thisisignored.
layer1=tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
withtf.device("/gpu:1"):#BAD!Ignoredagain.
layer2=tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
ThisfailsbecauseaBasicRNNCellisacellfactory,notacellperse(asmentionedearlier);nocellsgetcreatedwhenyoucreatethefactory,andthusnovariablesdoeither.Thedeviceblockissimplyignored.Thecellsactuallygetcreatedlater.Whenyoucalldynamic_rnn(),itcallstheMultiRNNCell,whichcallseachindividualBasicRNNCell,whichcreatetheactualcells(includingtheirvariables).Unfortunately,noneoftheseclassesprovideanywaytocontrolthedevicesonwhichthevariablesgetcreated.Ifyoutrytoputthedynamic_rnn()callwithinadeviceblock,thewholeRNNgetspinnedtoasingledevice.Soareyoustuck?Fortunatelynot!Thetrickistocreateyourowncellwrapper:
importtensorflowastf
classDeviceCellWrapper(tf.contrib.rnn.RNNCell):
def__init__(self,device,cell):
self._cell=cell
self._device=device
@property
defstate_size(self):
returnself._cell.state_size
@property
defoutput_size(self):
returnself._cell.output_size
def__call__(self,inputs,state,scope=None):
withtf.device(self._device):
returnself._cell(inputs,state,scope)
Thiswrappersimplyproxieseverymethodcalltoanothercell,exceptitwrapsthe__call__()functionwithinadeviceblock.2NowyoucandistributeeachlayeronadifferentGPU:
devices=["/gpu:0","/gpu:1","/gpu:2"]
cells=[DeviceCellWrapper(dev,tf.contrib.rnn.BasicRNNCell(num_units=n_neurons))
fordevindevices]
multi_layer_cell=tf.contrib.rnn.MultiRNNCell(cells)
outputs,states=tf.nn.dynamic_rnn(multi_layer_cell,X,dtype=tf.float32)
WARNINGDonotsetstate_is_tuple=False,ortheMultiRNNCellwillconcatenateallthecellstatesintoasingletensor,onasingleGPU.
ApplyingDropoutIfyoubuildaverydeepRNN,itmayendupoverfittingthetrainingset.Topreventthat,acommontechniqueistoapplydropout(introducedinChapter11).YoucansimplyaddadropoutlayerbeforeoraftertheRNNasusual,butifyoualsowanttoapplydropoutbetweentheRNNlayers,youneedtouseaDropoutWrapper.ThefollowingcodeappliesdropouttotheinputsofeachlayerintheRNN,droppingeachinputwitha50%probability:
keep_prob=0.5
cells=[tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
forlayerinrange(n_layers)]
cells_drop=[tf.contrib.rnn.DropoutWrapper(cell,input_keep_prob=keep_prob)
forcellincells]
multi_layer_cell=tf.contrib.rnn.MultiRNNCell(cells_drop)
rnn_outputs,states=tf.nn.dynamic_rnn(multi_layer_cell,X,dtype=tf.float32)
Notethatitisalsopossibletoapplydropouttotheoutputsbysettingoutput_keep_prob.
Themainproblemwiththiscodeisthatitwillapplydropoutnotonlyduringtrainingbutalsoduringtesting,whichisnotwhatyouwant(recallthatdropoutshouldbeappliedonlyduringtraining).Unfortunately,theDropoutWrapperdoesnotsupportatrainingplaceholder(yet?),soyoumusteitherwriteyourowndropoutwrapperclass,orhavetwodifferentgraphs:onefortraining,andtheotherfortesting.Thesecondoptionlookslikethis:
importsys
training=(sys.argv[-1]=="train")
X=tf.placeholder(tf.float32,[None,n_steps,n_inputs])
y=tf.placeholder(tf.float32,[None,n_steps,n_outputs])
cells=[tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
forlayerinrange(n_layers)]
iftraining:
cells=[tf.contrib.rnn.DropoutWrapper(cell,input_keep_prob=keep_prob)
forcellincells]
multi_layer_cell=tf.contrib.rnn.MultiRNNCell(cells)
rnn_outputs,states=tf.nn.dynamic_rnn(multi_layer_cell,X,dtype=tf.float32)
[...]#buildtherestofthegraph
withtf.Session()assess:
iftraining:
init.run()
foriterationinrange(n_iterations):
[...]#trainthemodel
save_path=saver.save(sess,"/tmp/my_model.ckpt")
else:
saver.restore(sess,"/tmp/my_model.ckpt")
[...]#usethemodel
WiththatyoushouldbeabletotrainallsortsofRNNs!Unfortunately,ifyouwanttotrainanRNNonlongsequences,thingswillgetabitharder.Let’sseewhyandwhatyoucandoaboutit.
TheDifficultyofTrainingoverManyTimeStepsTotrainanRNNonlongsequences,youwillneedtorunitovermanytimesteps,makingtheunrolledRNNaverydeepnetwork.Justlikeanydeepneuralnetworkitmaysufferfromthevanishing/explodinggradientsproblem(discussedinChapter11)andtakeforevertotrain.ManyofthetrickswediscussedtoalleviatethisproblemcanbeusedfordeepunrolledRNNsaswell:goodparameterinitialization,nonsaturatingactivationfunctions(e.g.,ReLU),BatchNormalization,GradientClipping,andfasteroptimizers.However,iftheRNNneedstohandleevenmoderatelylongsequences(e.g.,100inputs),thentrainingwillstillbeveryslow.
ThesimplestandmostcommonsolutiontothisproblemistounrolltheRNNonlyoveralimitednumberoftimestepsduringtraining.Thisiscalledtruncatedbackpropagationthroughtime.InTensorFlowyoucanimplementitsimplybytruncatingtheinputsequences.Forexample,inthetimeseriespredictionproblem,youwouldsimplyreducen_stepsduringtraining.Theproblem,ofcourse,isthatthemodelwillnotbeabletolearnlong-termpatterns.Oneworkaroundcouldbetomakesurethattheseshortenedsequencescontainbotholdandrecentdata,sothatthemodelcanlearntouseboth(e.g.,thesequencecouldcontainmonthlydataforthelastfivemonths,thenweeklydataforthelastfiveweeks,thendailydataoverthelastfivedays).Butthisworkaroundhasitslimits:whatiffine-graineddatafromlastyearisactuallyuseful?Whatiftherewasabriefbutsignificanteventthatabsolutelymustbetakenintoaccount,evenyearslater(e.g.,theresultofanelection)?
Besidesthelongtrainingtime,asecondproblemfacedbylong-runningRNNsisthefactthatthememoryofthefirstinputsgraduallyfadesaway.Indeed,duetothetransformationsthatthedatagoesthroughwhentraversinganRNN,someinformationislostaftereachtimestep.Afterawhile,theRNN’sstatecontainsvirtuallynotraceofthefirstinputs.Thiscanbeashowstopper.Forexample,sayyouwanttoperformsentimentanalysisonalongreviewthatstartswiththefourwords“Ilovedthismovie,”buttherestofthereviewliststhemanythingsthatcouldhavemadethemovieevenbetter.IftheRNNgraduallyforgetsthefirstfourwords,itwillcompletelymisinterpretthereview.Tosolvethisproblem,varioustypesofcellswithlong-termmemoryhavebeenintroduced.Theyhaveprovedsosuccessfulthatthebasiccellsarenotmuchusedanymore.Let’sfirstlookatthemostpopularoftheselongmemorycells:theLSTMcell.
LSTMCellTheLongShort-TermMemory(LSTM)cellwasproposedin19973bySeppHochreiterandJürgenSchmidhuber,anditwasgraduallyimprovedovertheyearsbyseveralresearchers,suchasAlexGraves,HaşimSak,4WojciechZaremba,5andmanymore.IfyouconsidertheLSTMcellasablackbox,itcanbeusedverymuchlikeabasiccell,exceptitwillperformmuchbetter;trainingwillconvergefasteranditwilldetectlong-termdependenciesinthedata.InTensorFlow,youcansimplyuseaBasicLSTMCellinsteadofaBasicRNNCell:
lstm_cell=tf.contrib.rnn.BasicLSTMCell(num_units=n_neurons)
LSTMcellsmanagetwostatevectors,andforperformancereasonstheyarekeptseparatebydefault.Youcanchangethisdefaultbehaviorbysettingstate_is_tuple=FalsewhencreatingtheBasicLSTMCell.
SohowdoesanLSTMcellwork?ThearchitectureofabasicLSTMcellisshowninFigure14-13.
Figure14-13.LSTMcell
Ifyoudon’tlookatwhat’sinsidethebox,theLSTMcelllooksexactlylikearegularcell,exceptthatitsstateissplitintwovectors:h(t)andc(t)(“c”standsfor“cell”).Youcanthinkofh(t)astheshort-termstateandc(t)asthelong-termstate.
Nowlet’sopenthebox!Thekeyideaisthatthenetworkcanlearnwhattostoreinthelong-termstate,whattothrowaway,andwhattoreadfromit.Asthelong-termstatec(t–1)traversesthenetworkfromlefttoright,youcanseethatitfirstgoesthroughaforgetgate,droppingsomememories,andthenitaddssomenewmemoriesviatheadditionoperation(whichaddsthememoriesthatwereselectedbyaninputgate).Theresultc(t)issentstraightout,withoutanyfurthertransformation.So,ateachtimestep,somememoriesaredroppedandsomememoriesareadded.Moreover,aftertheadditionoperation,thelong-termstateiscopiedandpassedthroughthetanhfunction,andthentheresultisfilteredbytheoutputgate.
Thisproducestheshort-termstateh(t)(whichisequaltothecell’soutputforthistimestepy(t)).Nowlet’slookatwherenewmemoriescomefromandhowthegateswork.
First,thecurrentinputvectorx(t)andthepreviousshort-termstateh(t–1)arefedtofourdifferentfullyconnectedlayers.Theyallserveadifferentpurpose:
Themainlayeristheonethatoutputsg(t).Ithastheusualroleofanalyzingthecurrentinputsx(t)andtheprevious(short-term)stateh(t–1).Inabasiccell,thereisnothingelsethanthislayer,anditsoutputgoesstraightouttoy(t)andh(t).Incontrast,inanLSTMcellthislayer’soutputdoesnotgostraightout,butinsteaditispartiallystoredinthelong-termstate.
Thethreeotherlayersaregatecontrollers.Sincetheyusethelogisticactivationfunction,theiroutputsrangefrom0to1.Asyoucansee,theiroutputsarefedtoelement-wisemultiplicationoperations,soiftheyoutput0s,theyclosethegate,andiftheyoutput1s,theyopenit.Specifically:
Theforgetgate(controlledbyf(t))controlswhichpartsofthelong-termstateshouldbeerased.
Theinputgate(controlledbyi(t))controlswhichpartsofg(t)shouldbeaddedtothelong-termstate(thisiswhywesaiditwasonly“partiallystored”).
Finally,theoutputgate(controlledbyo(t))controlswhichpartsofthelong-termstateshouldbereadandoutputatthistimestep(bothtoh(t))andy(t).
Inshort,anLSTMcellcanlearntorecognizeanimportantinput(that’stheroleoftheinputgate),storeitinthelong-termstate,learntopreserveitforaslongasitisneeded(that’stheroleoftheforgetgate),andlearntoextractitwheneveritisneeded.Thisexplainswhytheyhavebeenamazinglysuccessfulatcapturinglong-termpatternsintimeseries,longtexts,audiorecordings,andmore.
Equation14-3summarizeshowtocomputethecell’slong-termstate,itsshort-termstate,anditsoutputateachtimestepforasingleinstance(theequationsforawholemini-batchareverysimilar).
Equation14-3.LSTMcomputations
Wxi,Wxf,Wxo,Wxgaretheweightmatricesofeachofthefourlayersfortheirconnectiontothe
inputvectorx(t).
Whi,Whf,Who,andWhgaretheweightmatricesofeachofthefourlayersfortheirconnectiontothepreviousshort-termstateh(t–1).
bi,bf,bo,andbgarethebiastermsforeachofthefourlayers.NotethatTensorFlowinitializesbftoavectorfullof1sinsteadof0s.Thispreventsforgettingeverythingatthebeginningoftraining.
PeepholeConnectionsInabasicLSTMcell,thegatecontrollerscanlookonlyattheinputx(t)andthepreviousshort-termstateh(t–1).Itmaybeagoodideatogivethemabitmorecontextbylettingthempeekatthelong-termstateaswell.ThisideawasproposedbyFelixGersandJürgenSchmidhuberin2000.6TheyproposedanLSTMvariantwithextraconnectionscalledpeepholeconnections:thepreviouslong-termstatec(t–1)isaddedasaninputtothecontrollersoftheforgetgateandtheinputgate,andthecurrentlong-termstatec(t)isaddedasinputtothecontrolleroftheoutputgate.
ToimplementpeepholeconnectionsinTensorFlow,youmustusetheLSTMCellinsteadoftheBasicLSTMCellandsetuse_peepholes=True:
lstm_cell=tf.contrib.rnn.LSTMCell(num_units=n_neurons,use_peepholes=True)
TherearemanyothervariantsoftheLSTMcell.OneparticularlypopularvariantistheGRUcell,whichwewilllookatnow.
GRUCellTheGatedRecurrentUnit(GRU)cell(seeFigure14-14)wasproposedbyKyunghyunChoetal.ina2014paper7thatalsointroducedtheEncoder–Decodernetworkwementionedearlier.
Figure14-14.GRUcell
TheGRUcellisasimplifiedversionoftheLSTMcell,anditseemstoperformjustaswell8(whichexplainsitsgrowingpopularity).Themainsimplificationsare:
Bothstatevectorsaremergedintoasinglevectorh(t).
Asinglegatecontrollercontrolsboththeforgetgateandtheinputgate.Ifthegatecontrolleroutputsa1,theinputgateisopenandtheforgetgateisclosed.Ifitoutputsa0,theoppositehappens.Inotherwords,wheneveramemorymustbestored,thelocationwhereitwillbestorediserasedfirst.ThisisactuallyafrequentvarianttotheLSTMcellinandofitself.
Thereisnooutputgate;thefullstatevectorisoutputateverytimestep.However,thereisanewgatecontrollerthatcontrolswhichpartofthepreviousstatewillbeshowntothemainlayer.
Equation14-4summarizeshowtocomputethecell’sstateateachtimestepforasingleinstance.
Equation14-4.GRUcomputations
CreatingaGRUcellinTensorFlowistrivial:
gru_cell=tf.contrib.rnn.GRUCell(num_units=n_neurons)
LSTMorGRUcellsareoneofthemainreasonsbehindthesuccessofRNNsinrecentyears,inparticularforapplicationsinnaturallanguageprocessing(NLP).
NaturalLanguageProcessingMostofthestate-of-the-artNLPapplications,suchasmachinetranslation,automaticsummarization,parsing,sentimentanalysis,andmore,arenowbased(atleastinpart)onRNNs.Inthislastsection,wewilltakeaquicklookatwhatamachinetranslationmodellookslike.ThistopicisverywellcoveredbyTensorFlow’sawesomeWord2VecandSeq2Seqtutorials,soyoushoulddefinitelycheckthemout.
WordEmbeddingsBeforewestart,weneedtochooseawordrepresentation.Oneoptioncouldbetorepresenteachwordusingaone-hotvector.Supposeyourvocabularycontains50,000words,thenthenthwordwouldberepresentedasa50,000-dimensionalvector,fullof0sexceptfora1atthenthposition.However,withsuchalargevocabulary,thissparserepresentationwouldnotbeefficientatall.Ideally,youwantsimilarwordstohavesimilarrepresentations,makingiteasyforthemodeltogeneralizewhatitlearnsaboutawordtoallsimilarwords.Forexample,ifthemodelistoldthat“Idrinkmilk”isavalidsentence,andifitknowsthat“milk”iscloseto“water”butfarfrom“shoes,”thenitwillknowthat“Idrinkwater”isprobablyavalidsentenceaswell,while“Idrinkshoes”isprobablynot.Buthowcanyoucomeupwithsuchameaningfulrepresentation?
Themostcommonsolutionistorepresenteachwordinthevocabularyusingafairlysmallanddensevector(e.g.,150dimensions),calledanembedding,andjustlettheneuralnetworklearnagoodembeddingforeachwordduringtraining.Atthebeginningoftraining,embeddingsaresimplychosenrandomly,butduringtraining,backpropagationautomaticallymovestheembeddingsaroundinawaythathelpstheneuralnetworkperformitstask.Typicallythismeansthatsimilarwordswillgraduallyclusterclosetooneanother,andevenenduporganizedinarathermeaningfulway.Forexample,embeddingsmayendupplacedalongvariousaxesthatrepresentgender,singular/plural,adjective/noun,andsoon.Theresultcanbetrulyamazing.9
InTensorFlow,youfirstneedtocreatethevariablerepresentingtheembeddingsforeverywordinyourvocabulary(initializedrandomly):
vocabulary_size=50000
embedding_size=150
init_embeds=tf.random_uniform([vocabulary_size,embedding_size],-1.0,1.0)
embeddings=tf.Variable(init_embeds)
Nowsupposeyouwanttofeedthesentence“Idrinkmilk”toyourneuralnetwork.Youshouldfirstpreprocessthesentenceandbreakitintoalistofknownwords.Forexampleyoumayremoveunnecessarycharacters,replaceunknownwordsbyapredefinedtokenwordsuchas“[UNK]”,replacenumericalvaluesby“[NUM]”,replaceURLsby“[URL]”,andsoon.Onceyouhavealistofknownwords,youcanlookupeachword’sintegeridentifier(from0to49999)inadictionary,forexample[72,3335,288].Atthatpoint,youarereadytofeedthesewordidentifierstoTensorFlowusingaplaceholder,andapplytheembedding_lookup()functiontogetthecorrespondingembeddings:
train_inputs=tf.placeholder(tf.int32,shape=[None])#fromids...
embed=tf.nn.embedding_lookup(embeddings,train_inputs)#...toembeddings
Onceyourmodelhaslearnedgoodwordembeddings,theycanactuallybereusedfairlyefficientlyinanyNLPapplication:afterall,“milk”isstillcloseto“water”andfarfrom“shoes”nomatterwhatyourapplicationis.Infact,insteadoftrainingyourownwordembeddings,youmaywanttodownloadpretrainedwordembeddings.Justlikewhenreusingpretrainedlayers(seeChapter11),youcanchoosetofreezethepretrainedembeddings(e.g.,creatingtheembeddingsvariableusingtrainable=False)orlet
backpropagationtweakthemforyourapplication.Thefirstoptionwillspeeduptraining,butthesecondmayleadtoslightlyhigherperformance.
TIPEmbeddingsarealsousefulforrepresentingcategoricalattributesthatcantakeonalargenumberofdifferentvalues,especiallywhentherearecomplexsimilaritiesbetweenvalues.Forexample,considerprofessions,hobbies,dishes,species,brands,andsoon.
Younowhavealmostallthetoolsyouneedtoimplementamachinetranslationsystem.Let’slookatthisnow.
AnEncoder–DecoderNetworkforMachineTranslationLet’stakealookatasimplemachinetranslationmodel10thatwilltranslateEnglishsentencestoFrench(seeFigure14-15).
Figure14-15.Asimplemachinetranslationmodel
TheEnglishsentencesarefedtotheencoder,andthedecoderoutputstheFrenchtranslations.NotethattheFrenchtranslationsarealsousedasinputstothedecoder,butpushedbackbyonestep.Inotherwords,thedecoderisgivenasinputthewordthatitshouldhaveoutputatthepreviousstep(regardlessofwhatitactuallyoutput).Fortheveryfirstword,itisgivenatokenthatrepresentsthebeginningofthesentence(e.g.,“<go>”).Thedecoderisexpectedtoendthesentencewithanend-of-sequence(EOS)token(e.g.,“<eos>”).
NotethattheEnglishsentencesarereversedbeforetheyarefedtotheencoder.Forexample“Idrinkmilk”isreversedto“milkdrinkI.”ThisensuresthatthebeginningoftheEnglishsentencewillbefedlasttotheencoder,whichisusefulbecausethat’sgenerallythefirstthingthatthedecoderneedstotranslate.
Eachwordisinitiallyrepresentedbyasimpleintegeridentifier(e.g.,288fortheword“milk”).Next,anembeddinglookupreturnsthewordembedding(asexplainedearlier,thisisadense,fairlylow-dimensionalvector).Thesewordembeddingsarewhatisactuallyfedtotheencoderandthedecoder.
Ateachstep,thedecoderoutputsascoreforeachwordintheoutputvocabulary(i.e.,French),andthentheSoftmaxlayerturnsthesescoresintoprobabilities.Forexample,atthefirststeptheword“Je”mayhaveaprobabilityof20%,“Tu”mayhaveaprobabilityof1%,andsoon.Thewordwiththehighestprobabilityisoutput.Thisisverymuchlikearegularclassificationtask,soyoucantrainthemodelusingthesoftmax_cross_entropy_with_logits()function.
Notethatatinferencetime(aftertraining),youwillnothavethetargetsentencetofeedtothedecoder.
Instead,simplyfeedthedecoderthewordthatitoutputatthepreviousstep,asshowninFigure14-16(thiswillrequireanembeddinglookupthatisnotshownonthediagram).
Figure14-16.Feedingthepreviousoutputwordasinputatinferencetime
Okay,nowyouhavethebigpicture.However,ifyougothroughTensorFlow’ssequence-to-sequencetutorialandyoulookatthecodeinrnn/translate/seq2seq_model.py(intheTensorFlowmodels),youwillnoticeafewimportantdifferences:
First,sofarwehaveassumedthatallinputsequences(totheencoderandtothedecoder)haveaconstantlength.Butobviouslysentencelengthsmayvary.Thereareseveralwaysthatthiscanbehandled—forexample,usingthesequence_lengthargumenttothestatic_rnn()ordynamic_rnn()functionstospecifyeachsentence’slength(asdiscussedearlier).However,anotherapproachisusedinthetutorial(presumablyforperformancereasons):sentencesaregroupedintobucketsofsimilarlengths(e.g.,abucketforthe1-to6-wordsentences,anotherforthe7-to12-wordsentences,andsoon11),andtheshortersentencesarepaddedusingaspecialpaddingtoken(e.g.,“<pad>”).Forexample“Idrinkmilk”becomes“<pad><pad><pad>milkdrinkI”,anditstranslationbecomes“Jeboisdulait<eos><pad>”.Ofcourse,wewanttoignoreanyoutputpasttheEOStoken.Forthis,thetutorial’simplementationusesatarget_weightsvector.Forexample,forthetargetsentence“Jeboisdulait<eos><pad>”,theweightswouldbesetto[1.0,1.0,1.0,1.0,1.0,0.0](noticetheweight0.0thatcorrespondstothepaddingtokeninthetargetsentence).SimplymultiplyingthelossesbythetargetweightswillzerooutthelossesthatcorrespondtowordspastEOStokens.
Second,whentheoutputvocabularyislarge(whichisthecasehere),outputtingaprobabilityforeachandeverypossiblewordwouldbeterriblyslow.Ifthetargetvocabularycontains,say,50,000Frenchwords,thenthedecoderwouldoutput50,000-dimensionalvectors,andthencomputingthesoftmaxfunctionoversuchalargevectorwouldbeverycomputationallyintensive.Toavoidthis,onesolutionistoletthedecoderoutputmuchsmallervectors,suchas1,000-dimensionalvectors,thenuseasamplingtechniquetoestimatethelosswithouthavingtocomputeitovereverysinglewordinthetargetvocabulary.ThisSampledSoftmaxtechniquewasintroducedin2015bySébastienJeanetal.12InTensorFlowyoucanusethesampled_softmax_loss()function.
Third,thetutorial’simplementationusesanattentionmechanismthatletsthedecoderpeekintotheinputsequence.AttentionaugmentedRNNsarebeyondthescopeofthisbook,butifyouare
interestedtherearehelpfulpapersaboutmachinetranslation,13machinereading,14andimagecaptions15usingattention.
Finally,thetutorial’simplementationmakesuseofthetf.nn.legacy_seq2seqmodule,whichprovidestoolstobuildvariousEncoder–Decodermodelseasily.Forexample,theembedding_rnn_seq2seq()functioncreatesasimpleEncoder–Decodermodelthatautomaticallytakescareofwordembeddingsforyou,justliketheonerepresentedinFigure14-15.Thiscodewilllikelybeupdatedquicklytousethenewtf.nn.seq2seqmodule.
Younowhaveallthetoolsyouneedtounderstandthesequence-to-sequencetutorial’simplementation.CheckitoutandtrainyourownEnglish-to-Frenchtranslator!
Exercises1. Canyouthinkofafewapplicationsforasequence-to-sequenceRNN?Whataboutasequence-to-
vectorRNN?Andavector-to-sequenceRNN?
2. Whydopeopleuseencoder–decoderRNNsratherthanplainsequence-to-sequenceRNNsforautomatictranslation?
3. HowcouldyoucombineaconvolutionalneuralnetworkwithanRNNtoclassifyvideos?
4. WhataretheadvantagesofbuildinganRNNusingdynamic_rnn()ratherthanstatic_rnn()?
5. Howcanyoudealwithvariable-lengthinputsequences?Whataboutvariable-lengthoutputsequences?
6. WhatisacommonwaytodistributetrainingandexecutionofadeepRNNacrossmultipleGPUs?
7. EmbeddedRebergrammarswereusedbyHochreiterandSchmidhuberintheirpaperaboutLSTMs.Theyareartificialgrammarsthatproducestringssuchas“BPBTSXXVPSEPE.”CheckoutJennyOrr’sniceintroductiontothistopic.ChooseaparticularembeddedRebergrammar(suchastheonerepresentedonJennyOrr’spage),thentrainanRNNtoidentifywhetherastringrespectsthatgrammarornot.Youwillfirstneedtowriteafunctioncapableofgeneratingatrainingbatchcontainingabout50%stringsthatrespectthegrammar,and50%thatdon’t.
8. Tacklethe“Howmuchdiditrain?II”Kagglecompetition.Thisisatimeseriespredictiontask:youaregivensnapshotsofpolarimetricradarvaluesandaskedtopredictthehourlyraingaugetotal.LuisAndreDutraeSilva’sinterviewgivessomeinterestinginsightsintothetechniquesheusedtoreachsecondplaceinthecompetition.Inparticular,heusedanRNNcomposedoftwoLSTMlayers.
9. GothroughTensorFlow’sWord2Vectutorialtocreatewordembeddings,andthengothroughtheSeq2SeqtutorialtotrainanEnglish-to-Frenchtranslationsystem.
SolutionstotheseexercisesareavailableinAppendixA.
Notethatmanyresearchersprefertousethehyperbolictangent(tanh)activationfunctioninRNNsratherthantheReLUactivationfunction.Forexample,takealookatbyVuPhametal.’spaper“DropoutImprovesRecurrentNeuralNetworksforHandwritingRecognition”.However,ReLU-basedRNNsarealsopossible,asshowninQuocV.Leetal.’spaper“ASimpleWaytoInitializeRecurrentNetworksofRectifiedLinearUnits”.
Thisusesthedecoratordesignpattern.
“LongShort-TermMemory,”S.HochreiterandJ.Schmidhuber(1997).
“LongShort-TermMemoryRecurrentNeuralNetworkArchitecturesforLargeScaleAcousticModeling,”H.Saketal.(2014).
“RecurrentNeuralNetworkRegularization,”W.Zarembaetal.(2015).
“RecurrentNetsthatTimeandCount,”F.GersandJ.Schmidhuber(2000).
“LearningPhraseRepresentationsusingRNNEncoder–DecoderforStatisticalMachineTranslation,”K.Choetal.(2014).
A2015paperbyKlausGreffetal.,“LSTM:ASearchSpaceOdyssey,”seemstoshowthatallLSTMvariantsperformroughlythesame.
1
2
3
4
5
6
7
8
9
Formoredetails,checkoutChristopherOlah’sgreatpost,orSebastianRuder’sseriesofposts.
“SequencetoSequencelearningwithNeuralNetworks,”I.Sutskeveretal.(2014).
Thebucketsizesusedinthetutorialaredifferent.
“OnUsingVeryLargeTargetVocabularyforNeuralMachineTranslation,”S.Jeanetal.(2015).
“NeuralMachineTranslationbyJointlyLearningtoAlignandTranslate,”D.Bahdanauetal.(2014).
“LongShort-TermMemory-NetworksforMachineReading,”J.Cheng(2016).
“Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttention,”K.Xuetal.(2015).
9
10
11
12
13
14
15
Chapter15.Autoencoders
Autoencodersareartificialneuralnetworkscapableoflearningefficientrepresentationsoftheinputdata,calledcodings,withoutanysupervision(i.e.,thetrainingsetisunlabeled).Thesecodingstypicallyhaveamuchlowerdimensionalitythantheinputdata,makingautoencodersusefulfordimensionalityreduction(seeChapter8).Moreimportantly,autoencodersactaspowerfulfeaturedetectors,andtheycanbeusedforunsupervisedpretrainingofdeepneuralnetworks(aswediscussedinChapter11).Lastly,theyarecapableofrandomlygeneratingnewdatathatlooksverysimilartothetrainingdata;thisiscalledagenerativemodel.Forexample,youcouldtrainanautoencoderonpicturesoffaces,anditwouldthenbeabletogeneratenewfaces.
Surprisingly,autoencodersworkbysimplylearningtocopytheirinputstotheiroutputs.Thismaysoundlikeatrivialtask,butwewillseethatconstrainingthenetworkinvariouswayscanmakeitratherdifficult.Forexample,youcanlimitthesizeoftheinternalrepresentation,oryoucanaddnoisetotheinputsandtrainthenetworktorecovertheoriginalinputs.Theseconstraintspreventtheautoencoderfromtriviallycopyingtheinputsdirectlytotheoutputs,whichforcesittolearnefficientwaysofrepresentingthedata.Inshort,thecodingsarebyproductsoftheautoencoder’sattempttolearntheidentityfunctionundersomeconstraints.
Inthischapterwewillexplaininmoredepthhowautoencoderswork,whattypesofconstraintscanbeimposed,andhowtoimplementthemusingTensorFlow,whetheritisfordimensionalityreduction,featureextraction,unsupervisedpretraining,orasgenerativemodels.
EfficientDataRepresentationsWhichofthefollowingnumbersequencesdoyoufindtheeasiesttomemorize?
40,27,25,36,81,57,10,73,19,68
50,25,76,38,19,58,29,88,44,22,11,34,17,52,26,13,40,20
Atfirstglance,itwouldseemthatthefirstsequenceshouldbeeasier,sinceitismuchshorter.However,ifyoulookcarefullyatthesecondsequence,youmaynoticethatitfollowstwosimplerules:evennumbersarefollowedbytheirhalf,andoddnumbersarefollowedbytheirtripleplusone(thisisafamoussequenceknownasthehailstonesequence).Onceyounoticethispattern,thesecondsequencebecomesmucheasiertomemorizethanthefirstbecauseyouonlyneedtomemorizethetworules,thefirstnumber,andthelengthofthesequence.Notethatifyoucouldquicklyandeasilymemorizeverylongsequences,youwouldnotcaremuchabouttheexistenceofapatterninthesecondsequence.Youwouldjustlearneverynumberbyheart,andthatwouldbethat.Itisthefactthatitishardtomemorizelongsequencesthatmakesitusefultorecognizepatterns,andhopefullythisclarifieswhyconstraininganautoencoderduringtrainingpushesittodiscoverandexploitpatternsinthedata.
Therelationshipbetweenmemory,perception,andpatternmatchingwasfamouslystudiedbyWilliamChaseandHerbertSimonintheearly1970s.1Theyobservedthatexpertchessplayerswereabletomemorizethepositionsofallthepiecesinagamebylookingattheboardforjust5seconds,ataskthatmostpeoplewouldfindimpossible.However,thiswasonlythecasewhenthepieceswereplacedinrealisticpositions(fromactualgames),notwhenthepieceswereplacedrandomly.Chessexpertsdon’thaveamuchbettermemorythanyouandI,theyjustseechesspatternsmoreeasilythankstotheirexperiencewiththegame.Noticingpatternshelpsthemstoreinformationefficiently.
Justlikethechessplayersinthismemoryexperiment,anautoencoderlooksattheinputs,convertsthemtoanefficientinternalrepresentation,andthenspitsoutsomethingthat(hopefully)looksveryclosetotheinputs.Anautoencoderisalwayscomposedoftwoparts:anencoder(orrecognitionnetwork)thatconvertstheinputstoaninternalrepresentation,followedbyadecoder(orgenerativenetwork)thatconvertstheinternalrepresentationtotheoutputs(seeFigure15-1).
Asyoucansee,anautoencodertypicallyhasthesamearchitectureasaMulti-LayerPerceptron(MLP;seeChapter10),exceptthatthenumberofneuronsintheoutputlayermustbeequaltothenumberofinputs.Inthisexample,thereisjustonehiddenlayercomposedoftwoneurons(theencoder),andoneoutputlayercomposedofthreeneurons(thedecoder).Theoutputsareoftencalledthereconstructionssincetheautoencodertriestoreconstructtheinputs,andthecostfunctioncontainsareconstructionlossthatpenalizesthemodelwhenthereconstructionsaredifferentfromtheinputs.
Figure15-1.Thechessmemoryexperiment(left)andasimpleautoencoder(right)
Becausetheinternalrepresentationhasalowerdimensionalitythantheinputdata(itis2Dinsteadof3D),theautoencoderissaidtobeundercomplete.Anundercompleteautoencodercannottriviallycopyitsinputstothecodings,yetitmustfindawaytooutputacopyofitsinputs.Itisforcedtolearnthemostimportantfeaturesintheinputdata(anddroptheunimportantones).
Let’sseehowtoimplementaverysimpleundercompleteautoencoderfordimensionalityreduction.
PerformingPCAwithanUndercompleteLinearAutoencoderIftheautoencoderusesonlylinearactivationsandthecostfunctionistheMeanSquaredError(MSE),thenitcanbeshownthatitendsupperformingPrincipalComponentAnalysis(seeChapter8).
ThefollowingcodebuildsasimplelinearautoencodertoperformPCAona3Ddataset,projectingitto2D:
importtensorflowastf
n_inputs=3#3Dinputs
n_hidden=2#2Dcodings
n_outputs=n_inputs
learning_rate=0.01
X=tf.placeholder(tf.float32,shape=[None,n_inputs])
hidden=tf.layers.dense(X,n_hidden)
outputs=tf.layers.dense(hidden,n_outputs)
reconstruction_loss=tf.reduce_mean(tf.square(outputs-X))#MSE
optimizer=tf.train.AdamOptimizer(learning_rate)
training_op=optimizer.minimize(reconstruction_loss)
init=tf.global_variables_initializer()
ThiscodeisreallynotverydifferentfromalltheMLPswebuiltinpastchapters.Thetwothingstonoteare:
Thenumberofoutputsisequaltothenumberofinputs.
ToperformsimplePCA,wedonotuseanyactivationfunction(i.e.,allneuronsarelinear)andthecostfunctionistheMSE.Wewillseemorecomplexautoencodersshortly.
Nowlet’sloadthedataset,trainthemodelonthetrainingset,anduseittoencodethetestset(i.e.,projectitto2D):
X_train,X_test=[...]#loadthedataset
n_iterations=1000
codings=hidden#theoutputofthehiddenlayerprovidesthecodings
withtf.Session()assess:
init.run()
foriterationinrange(n_iterations):
training_op.run(feed_dict={X:X_train})#nolabels(unsupervised)
codings_val=codings.eval(feed_dict={X:X_test})
Figure15-2showstheoriginal3Ddataset(attheleft)andtheoutputoftheautoencoder’shiddenlayer(i.e.,thecodinglayer,attheright).Asyoucansee,theautoencoderfoundthebest2Dplanetoprojectthedataonto,preservingasmuchvarianceinthedataasitcould(justlikePCA).
Figure15-2.PCAperformedbyanundercompletelinearautoencoder
StackedAutoencodersJustlikeotherneuralnetworkswehavediscussed,autoencoderscanhavemultiplehiddenlayers.Inthiscasetheyarecalledstackedautoencoders(ordeepautoencoders).Addingmorelayershelpstheautoencoderlearnmorecomplexcodings.However,onemustbecarefulnottomaketheautoencodertoopowerful.Imagineanencodersopowerfulthatitjustlearnstomapeachinputtoasinglearbitrarynumber(andthedecoderlearnsthereversemapping).Obviouslysuchanautoencoderwillreconstructthetrainingdataperfectly,butitwillnothavelearnedanyusefuldatarepresentationintheprocess(anditisunlikelytogeneralizewelltonewinstances).
Thearchitectureofastackedautoencoderistypicallysymmetricalwithregardstothecentralhiddenlayer(thecodinglayer).Toputitsimply,itlookslikeasandwich.Forexample,anautoencoderforMNIST(introducedinChapter3)mayhave784inputs,followedbyahiddenlayerwith300neurons,thenacentralhiddenlayerof150neurons,thenanotherhiddenlayerwith300neurons,andanoutputlayerwith784neurons.ThisstackedautoencoderisrepresentedinFigure15-3.
Figure15-3.Stackedautoencoder
TensorFlowImplementationYoucanimplementastackedautoencoderverymuchlikearegulardeepMLP.Inparticular,thesametechniquesweusedinChapter11fortrainingdeepnetscanbeapplied.Forexample,thefollowingcodebuildsastackedautoencoderforMNIST,usingHeinitialization,theELUactivationfunction,andℓ2regularization.Thecodeshouldlookveryfamiliar,exceptthattherearenolabels(noy):
fromfunctoolsimportpartial
n_inputs=28*28#forMNIST
n_hidden1=300
n_hidden2=150#codings
n_hidden3=n_hidden1
n_outputs=n_inputs
learning_rate=0.01
l2_reg=0.0001
X=tf.placeholder(tf.float32,shape=[None,n_inputs])
he_init=tf.contrib.layers.variance_scaling_initializer()
l2_regularizer=tf.contrib.layers.l2_regularizer(l2_reg)
my_dense_layer=partial(tf.layers.dense,
activation=tf.nn.elu,
kernel_initializer=he_init,
kernel_regularizer=l2_regularizer)
hidden1=my_dense_layer(X,n_hidden1)
hidden2=my_dense_layer(hidden1,n_hidden2)#codings
hidden3=my_dense_layer(hidden2,n_hidden3)
outputs=my_dense_layer(hidden3,n_outputs,activation=None)
reconstruction_loss=tf.reduce_mean(tf.square(outputs-X))#MSE
reg_losses=tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)
loss=tf.add_n([reconstruction_loss]+reg_losses)
optimizer=tf.train.AdamOptimizer(learning_rate)
training_op=optimizer.minimize(loss)
init=tf.global_variables_initializer()
Youcanthentrainthemodelnormally.Notethatthedigitlabels(y_batch)areunused:
n_epochs=5
batch_size=150
withtf.Session()assess:
init.run()
forepochinrange(n_epochs):
n_batches=mnist.train.num_examples//batch_size
foriterationinrange(n_batches):
X_batch,y_batch=mnist.train.next_batch(batch_size)
sess.run(training_op,feed_dict={X:X_batch})
TyingWeightsWhenanautoencoderisneatlysymmetrical,liketheonewejustbuilt,acommontechniqueistotietheweightsofthedecoderlayerstotheweightsoftheencoderlayers.Thishalvesthenumberofweightsinthemodel,speedinguptrainingandlimitingtheriskofoverfitting.Specifically,iftheautoencoderhasatotalofNlayers(notcountingtheinputlayer),andWLrepresentstheconnectionweightsoftheLthlayer
(e.g.,layer1isthefirsthiddenlayer,layer isthecodinglayer,andlayerNistheoutputlayer),thenthe
decoderlayerweightscanbedefinedsimplyas:WN–L+1=WLT(withL=1,2, ).
Unfortunately,implementingtiedweightsinTensorFlowusingthedense()functionisabitcumbersome;it’sactuallyeasiertojustdefinethelayersmanually.Thecodeendsupsignificantlymoreverbose:
activation=tf.nn.elu
regularizer=tf.contrib.layers.l2_regularizer(l2_reg)
initializer=tf.contrib.layers.variance_scaling_initializer()
X=tf.placeholder(tf.float32,shape=[None,n_inputs])
weights1_init=initializer([n_inputs,n_hidden1])
weights2_init=initializer([n_hidden1,n_hidden2])
weights1=tf.Variable(weights1_init,dtype=tf.float32,name="weights1")
weights2=tf.Variable(weights2_init,dtype=tf.float32,name="weights2")
weights3=tf.transpose(weights2,name="weights3")#tiedweights
weights4=tf.transpose(weights1,name="weights4")#tiedweights
biases1=tf.Variable(tf.zeros(n_hidden1),name="biases1")
biases2=tf.Variable(tf.zeros(n_hidden2),name="biases2")
biases3=tf.Variable(tf.zeros(n_hidden3),name="biases3")
biases4=tf.Variable(tf.zeros(n_outputs),name="biases4")
hidden1=activation(tf.matmul(X,weights1)+biases1)
hidden2=activation(tf.matmul(hidden1,weights2)+biases2)
hidden3=activation(tf.matmul(hidden2,weights3)+biases3)
outputs=tf.matmul(hidden3,weights4)+biases4
reconstruction_loss=tf.reduce_mean(tf.square(outputs-X))
reg_loss=regularizer(weights1)+regularizer(weights2)
loss=reconstruction_loss+reg_loss
optimizer=tf.train.AdamOptimizer(learning_rate)
training_op=optimizer.minimize(loss)
init=tf.global_variables_initializer()
Thiscodeisfairlystraightforward,butthereareafewimportantthingstonote:First,weight3andweights4arenotvariables,theyarerespectivelythetransposeofweights2andweights1(theyare“tied”tothem).
Second,sincetheyarenotvariables,it’snouseregularizingthem:weonlyregularizeweights1andweights2.
Third,biasesarenevertied,andneverregularized.
TrainingOneAutoencoderataTimeRatherthantrainingthewholestackedautoencoderinonegolikewejustdid,itisoftenmuchfastertotrainoneshallowautoencoderatatime,thenstackallofthemintoasinglestackedautoencoder(hencethename),asshownonFigure15-4.Thisisespeciallyusefulforverydeepautoencoders.
Figure15-4.Trainingoneautoencoderatatime
Duringthefirstphaseoftraining,thefirstautoencoderlearnstoreconstructtheinputs.Duringthesecondphase,thesecondautoencoderlearnstoreconstructtheoutputofthefirstautoencoder’shiddenlayer.Finally,youjustbuildabigsandwichusingalltheseautoencoders,asshowninFigure15-4(i.e.,youfirststackthehiddenlayersofeachautoencoder,thentheoutputlayersinreverseorder).Thisgivesyouthefinalstackedautoencoder.Youcouldeasilytrainmoreautoencodersthisway,buildingaverydeepstackedautoencoder.
Toimplementthismultiphasetrainingalgorithm,thesimplestapproachistouseadifferentTensorFlowgraphforeachphase.Aftertraininganautoencoder,youjustrunthetrainingsetthroughitandcapturetheoutputofthehiddenlayer.Thisoutputthenservesasthetrainingsetforthenextautoencoder.Onceallautoencodershavebeentrainedthisway,yousimplycopytheweightsandbiasesfromeachautoencoderandusethemtobuildthestackedautoencoder.Implementingthisapproachisquitestraightforward,sowewon’tdetailithere,butpleasecheckoutthecodeintheJupyternotebooksforanexample.
Anotherapproachistouseasinglegraphcontainingthewholestackedautoencoder,plussomeextraoperationstoperformeachtrainingphase,asshowninFigure15-5.
Figure15-5.Asinglegraphtotrainastackedautoencoder
Thisdeservesabitofexplanation:Thecentralcolumninthegraphisthefullstackedautoencoder.Thispartcanbeusedaftertraining.
Theleftcolumnisthesetofoperationsneededtorunthefirstphaseoftraining.Itcreatesanoutputlayerthatbypasseshiddenlayers2and3.Thisoutputlayersharesthesameweightsandbiasesasthestackedautoencoder’soutputlayer.Ontopofthatarethetrainingoperationsthatwillaimatmakingtheoutputascloseaspossibletotheinputs.Thus,thisphasewilltraintheweightsandbiasesforthehiddenlayer1andtheoutputlayer(i.e.,thefirstautoencoder).
Therightcolumninthegraphisthesetofoperationsneededtorunthesecondphaseoftraining.Itaddsthetrainingoperationthatwillaimatmakingtheoutputofhiddenlayer3ascloseaspossibletotheoutputofhiddenlayer1.Notethatwemustfreezehiddenlayer1whilerunningphase2.Thisphasewilltraintheweightsandbiasesforhiddenlayers2and3(i.e.,thesecondautoencoder).
TheTensorFlowcodelookslikethis:
[...]#Buildthewholestackedautoencodernormally.
#Inthisexample,theweightsarenottied.
optimizer=tf.train.AdamOptimizer(learning_rate)
withtf.name_scope("phase1"):
phase1_outputs=tf.matmul(hidden1,weights4)+biases4
phase1_reconstruction_loss=tf.reduce_mean(tf.square(phase1_outputs-X))
phase1_reg_loss=regularizer(weights1)+regularizer(weights4)
phase1_loss=phase1_reconstruction_loss+phase1_reg_loss
phase1_training_op=optimizer.minimize(phase1_loss)
withtf.name_scope("phase2"):
phase2_reconstruction_loss=tf.reduce_mean(tf.square(hidden3-hidden1))
phase2_reg_loss=regularizer(weights2)+regularizer(weights3)
phase2_loss=phase2_reconstruction_loss+phase2_reg_loss
train_vars=[weights2,biases2,weights3,biases3]
phase2_training_op=optimizer.minimize(phase2_loss,var_list=train_vars)
Thefirstphaseisratherstraightforward:wejustcreateanoutputlayerthatskipshiddenlayers2and3,thenbuildthetrainingoperationstominimizethedistancebetweentheoutputsandtheinputs(plussome
regularization).
Thesecondphasejustaddstheoperationsneededtominimizethedistancebetweentheoutputofhiddenlayer3andhiddenlayer1(alsowithsomeregularization).Mostimportantly,weprovidethelistoftrainablevariablestotheminimize()method,makingsuretoleaveoutweights1andbiases1;thiseffectivelyfreezeshiddenlayer1duringphase2.
Duringtheexecutionphase,allyouneedtodoisrunthephase1trainingopforanumberofepochs,thenthephase2trainingopforsomemoreepochs.
TIPSincehiddenlayer1isfrozenduringphase2,itsoutputwillalwaysbethesameforanygiventraininginstance.Toavoidhavingtorecomputetheoutputofhiddenlayer1ateverysingleepoch,youcancomputeitforthewholetrainingsetattheendofphase1,thendirectlyfeedthecachedoutputofhiddenlayer1duringphase2.Thiscangiveyouaniceperformanceboost.
VisualizingtheReconstructionsOnewaytoensurethatanautoencoderisproperlytrainedistocomparetheinputsandtheoutputs.Theymustbefairlysimilar,andthedifferencesshouldbeunimportantdetails.Let’splottworandomdigitsandtheirreconstructions:
n_test_digits=2
X_test=mnist.test.images[:n_test_digits]
withtf.Session()assess:
[...]#TraintheAutoencoder
outputs_val=outputs.eval(feed_dict={X:X_test})
defplot_image(image,shape=[28,28]):
plt.imshow(image.reshape(shape),cmap="Greys",interpolation="nearest")
plt.axis("off")
fordigit_indexinrange(n_test_digits):
plt.subplot(n_test_digits,2,digit_index*2+1)
plot_image(X_test[digit_index])
plt.subplot(n_test_digits,2,digit_index*2+2)
plot_image(outputs_val[digit_index])
Figure15-6showstheresultingimages.
Figure15-6.Originaldigits(left)andtheirreconstructions(right)
Lookscloseenough.Sotheautoencoderhasproperlylearnedtoreproduceitsinputs,buthasitlearnedusefulfeatures?Let’stakealook.
VisualizingFeaturesOnceyourautoencoderhaslearnedsomefeatures,youmaywanttotakealookatthem.Therearevarioustechniquesforthis.Arguablythesimplesttechniqueistoconsidereachneuronineveryhiddenlayer,andfindthetraininginstancesthatactivateitthemost.Thisisespeciallyusefulforthetophiddenlayerssincetheyoftencapturerelativelylargefeaturesthatyoucaneasilyspotinagroupoftraininginstancesthatcontainthem.Forexample,ifaneuronstronglyactivateswhenitseesacatinapicture,itwillbeprettyobviousthatthepicturesthatactivateitthemostallcontaincats.However,forlowerlayers,thistechniquedoesnotworksowell,asthefeaturesaresmallerandmoreabstract,soit’softenhardtounderstandexactlywhattheneuronisgettingallexcitedabout.
Let’slookatanothertechnique.Foreachneuroninthefirsthiddenlayer,youcancreateanimagewhereapixel’sintensitycorrespondstotheweightoftheconnectiontothegivenneuron.Forexample,thefollowingcodeplotsthefeatureslearnedbyfiveneuronsinthefirsthiddenlayer:
withtf.Session()assess:
[...]#trainautoencoder
weights1_val=weights1.eval()
foriinrange(5):
plt.subplot(1,5,i+1)
plot_image(weights1_val.T[i])
Youmaygetlow-levelfeaturessuchastheonesshowninFigure15-7.
Figure15-7.Featureslearnedbyfiveneuronsfromthefirsthiddenlayer
Thefirstfourfeaturesseemtocorrespondtosmallpatches,whilethefifthfeatureseemstolookforverticalstrokes(notethatthesefeaturescomefromthestackeddenoisingautoencoderthatwewilldiscusslater).
Anothertechniqueistofeedtheautoencoderarandominputimage,measuretheactivationoftheneuronyouareinterestedin,andthenperformbackpropagationtotweaktheimageinsuchawaythattheneuronwillactivateevenmore.Ifyouiterateseveraltimes(performinggradientascent),theimagewillgraduallyturnintothemostexcitingimage(fortheneuron).Thisisausefultechniquetovisualizethekindsofinputsthataneuronislookingfor.
Finally,ifyouareusinganautoencodertoperformunsupervisedpretraining—forexample,foraclassificationtask—asimplewaytoverifythatthefeatureslearnedbytheautoencoderareusefulistomeasuretheperformanceoftheclassifier.
UnsupervisedPretrainingUsingStackedAutoencodersAswediscussedinChapter11,ifyouaretacklingacomplexsupervisedtaskbutyoudonothavealotoflabeledtrainingdata,onesolutionistofindaneuralnetworkthatperformsasimilartask,andthenreuseitslowerlayers.Thismakesitpossibletotrainahigh-performancemodelusingonlylittletrainingdatabecauseyourneuralnetworkwon’thavetolearnallthelow-levelfeatures;itwilljustreusethefeaturedetectorslearnedbytheexistingnet.
Similarly,ifyouhavealargedatasetbutmostofitisunlabeled,youcanfirsttrainastackedautoencoderusingallthedata,thenreusethelowerlayerstocreateaneuralnetworkforyouractualtask,andtrainitusingthelabeleddata.Forexample,Figure15-8showshowtouseastackedautoencodertoperformunsupervisedpretrainingforaclassificationneuralnetwork.Thestackedautoencoderitselfistypicallytrainedoneautoencoderatatime,asdiscussedearlier.Whentrainingtheclassifier,ifyoureallydon’thavemuchlabeledtrainingdata,youmaywanttofreezethepretrainedlayers(atleastthelowerones).
Figure15-8.Unsupervisedpretrainingusingautoencoders
NOTEThissituationisactuallyquitecommon,becausebuildingalargeunlabeleddatasetisoftencheap(e.g.,asimplescriptcandownloadmillionsofimagesofftheinternet),butlabelingthemcanonlybedonereliablybyhumans(e.g.,classifyingimagesascuteornot).Labelinginstancesistime-consumingandcostly,soitisquitecommontohaveonlyafewthousandlabeledinstances.
Aswediscussedearlier,oneofthetriggersofthecurrentDeepLearningtsunamiisthediscoveryin2006byGeoffreyHintonetal.thatdeepneuralnetworkscanbepretrainedinanunsupervisedfashion.TheyusedrestrictedBoltzmannmachinesforthat(seeAppendixE),butin2007YoshuaBengioetal.showed2thatautoencodersworkedjustaswell.
ThereisnothingspecialabouttheTensorFlowimplementation:justtrainanautoencoderusingallthetrainingdata,thenreuseitsencoderlayerstocreateanewneuralnetwork(seeChapter11formoredetailsonhowtoreusepretrainedlayers,orcheckoutthecodeexamplesintheJupyternotebooks).
Uptonow,inordertoforcetheautoencodertolearninterestingfeatures,wehavelimitedthesizeofthecodinglayer,makingitundercomplete.Thereareactuallymanyotherkindsofconstraintsthatcanbeused,includingonesthatallowthecodinglayertobejustaslargeastheinputs,orevenlarger,resultinginanovercompleteautoencoder.Let’slookatsomeofthoseapproachesnow.
DenoisingAutoencodersAnotherwaytoforcetheautoencodertolearnusefulfeaturesistoaddnoisetoitsinputs,trainingittorecovertheoriginal,noise-freeinputs.Thispreventstheautoencoderfromtriviallycopyingitsinputstoitsoutputs,soitendsuphavingtofindpatternsinthedata.
Theideaofusingautoencoderstoremovenoisehasbeenaroundsincethe1980s(e.g.,itismentionedinYannLeCun’s1987master’sthesis).Ina2008paper,3PascalVincentetal.showedthatautoencoderscouldalsobeusedforfeatureextraction.Ina2010paper,4Vincentetal.introducedstackeddenoisingautoencoders.
ThenoisecanbepureGaussiannoiseaddedtotheinputs,oritcanberandomlyswitchedoffinputs,justlikeindropout(introducedinChapter11).Figure15-9showsbothoptions.
Figure15-9.Denoisingautoencoders,withGaussiannoise(left)ordropout(right)
TensorFlowImplementationImplementingdenoisingautoencodersinTensorFlowisnottoohard.Let’sstartwithGaussiannoise.It’sreallyjustliketrainingaregularautoencoder,exceptyouaddnoisetotheinputs,andthereconstructionlossiscalculatedbasedontheoriginalinputs:
noise_level=1.0
X=tf.placeholder(tf.float32,shape=[None,n_inputs])
X_noisy=X+noise_level*tf.random_normal(tf.shape(X))
hidden1=tf.layers.dense(X_noisy,n_hidden1,activation=tf.nn.relu,
name="hidden1")
[...]
reconstruction_loss=tf.reduce_mean(tf.square(outputs-X))#MSE
[...]
WARNINGSincetheshapeofXisonlypartiallydefinedduringtheconstructionphase,wecannotknowinadvancetheshapeofthenoisethatwemustaddtoX.WecannotcallX.get_shape()becausethiswouldjustreturnthepartiallydefinedshapeofX([None,n_inputs]),andrandom_normal()expectsafullydefinedshapesoitwouldraiseanexception.Instead,wecalltf.shape(X),whichcreatesanoperationthatwillreturntheshapeofXatruntime,whichwillbefullydefinedatthatpoint.
Implementingthedropoutversion,whichismorecommon,isnotmuchharder:
dropout_rate=0.3
training=tf.placeholder_with_default(False,shape=(),name='training')
X=tf.placeholder(tf.float32,shape=[None,n_inputs])
X_drop=tf.layers.dropout(X,dropout_rate,training=training)
hidden1=tf.layers.dense(X_drop,n_hidden1,activation=tf.nn.relu,
name="hidden1")
[...]
reconstruction_loss=tf.reduce_mean(tf.square(outputs-X))#MSE
[...]
DuringtrainingwemustsettrainingtoTrue(asexplainedinChapter11)usingthefeed_dict:
sess.run(training_op,feed_dict={X:X_batch,training:True})
DuringtestingitisnotnecessarytosettrainingtoFalse,sincewesetthatasthedefaultinthecalltotheplaceholder_with_default()function.
SparseAutoencodersAnotherkindofconstraintthatoftenleadstogoodfeatureextractionissparsity:byaddinganappropriatetermtothecostfunction,theautoencoderispushedtoreducethenumberofactiveneuronsinthecodinglayer.Forexample,itmaybepushedtohaveonaverageonly5%significantlyactiveneuronsinthecodinglayer.Thisforcestheautoencodertorepresenteachinputasacombinationofasmallnumberofactivations.Asaresult,eachneuroninthecodinglayertypicallyendsuprepresentingausefulfeature(ifyoucouldspeakonlyafewwordspermonth,youwouldprobablytrytomakethemworthlisteningto).
Inordertofavorsparsemodels,wemustfirstmeasuretheactualsparsityofthecodinglayerateachtrainingiteration.Wedosobycomputingtheaverageactivationofeachneuroninthecodinglayer,overthewholetrainingbatch.Thebatchsizemustnotbetoosmall,orelsethemeanwillnotbeaccurate.
Oncewehavethemeanactivationperneuron,wewanttopenalizetheneuronsthataretooactivebyaddingasparsitylosstothecostfunction.Forexample,ifwemeasurethataneuronhasanaverageactivationof0.3,butthetargetsparsityis0.1,itmustbepenalizedtoactivateless.Oneapproachcouldbesimplyaddingthesquarederror(0.3–0.1)2tothecostfunction,butinpracticeabetterapproachistousetheKullback–Leiblerdivergence(brieflydiscussedinChapter4),whichhasmuchstrongergradientsthantheMeanSquaredError,asyoucanseeinFigure15-10.
Figure15-10.Sparsityloss
GiventwodiscreteprobabilitydistributionsPandQ,theKLdivergencebetweenthesedistributions,notedDKL(P Q),canbecomputedusingEquation15-1.
Equation15-1.Kullback–Leiblerdivergence
Inourcase,wewanttomeasurethedivergencebetweenthetargetprobabilitypthataneuroninthecodinglayerwillactivate,andtheactualprobabilityq(i.e.,themeanactivationoverthetrainingbatch).SotheKLdivergencesimplifiestoEquation15-2.
Equation15-2.KLdivergencebetweenthetargetsparsitypandtheactualsparsityq
Oncewehavecomputedthesparsitylossforeachneuroninthecodinglayer,wejustsumuptheselosses,andaddtheresulttothecostfunction.Inordertocontroltherelativeimportanceofthesparsitylossandthereconstructionloss,wecanmultiplythesparsitylossbyasparsityweighthyperparameter.Ifthisweightistoohigh,themodelwillstickcloselytothetargetsparsity,butitmaynotreconstructtheinputsproperly,makingthemodeluseless.Conversely,ifitistoolow,themodelwillmostlyignorethesparsityobjectiveanditwillnotlearnanyinterestingfeatures.
TensorFlowImplementationWenowhaveallweneedtoimplementasparseautoencoderusingTensorFlow:
defkl_divergence(p,q):
returnp*tf.log(p/q)+(1-p)*tf.log((1-p)/(1-q))
learning_rate=0.01
sparsity_target=0.1
sparsity_weight=0.2
[...]#Buildanormalautoencoder(inthisexamplethecodinglayerishidden1)
hidden1_mean=tf.reduce_mean(hidden1,axis=0)#batchmean
sparsity_loss=tf.reduce_sum(kl_divergence(sparsity_target,hidden1_mean))
reconstruction_loss=tf.reduce_mean(tf.square(outputs-X))#MSE
loss=reconstruction_loss+sparsity_weight*sparsity_loss
optimizer=tf.train.AdamOptimizer(learning_rate)
training_op=optimizer.minimize(loss)
Animportantdetailisthefactthattheactivationsofthecodinglayermustbebetween0and1(butnotequalto0or1),orelsetheKLdivergencewillreturnNaN(NotaNumber).Asimplesolutionistousethelogisticactivationfunctionforthecodinglayer:
hidden1=tf.layers.dense(X,n_hidden1,activation=tf.nn.sigmoid)
Onesimpletrickcanspeedupconvergence:insteadofusingtheMSE,wecanchooseareconstructionlossthatwillhavelargergradients.Crossentropyisoftenagoodchoice.Touseit,wemustnormalizetheinputstomakethemtakeonvaluesfrom0to1,andusethelogisticactivationfunctionintheoutputlayersotheoutputsalsotakeonvaluesfrom0to1.TensorFlow’ssigmoid_cross_entropy_with_logits()functiontakescareofefficientlyapplyingthelogistic(sigmoid)activationfunctiontotheoutputsandcomputingthecrossentropy:
[...]
logits=tf.layers.dense(hidden1,n_outputs)
outputs=tf.nn.sigmoid(logits)
xentropy=tf.nn.sigmoid_cross_entropy_with_logits(labels=X,logits=logits)
reconstruction_loss=tf.reduce_sum(xentropy)
Notethattheoutputsoperationisnotneededduringtraining(weuseitonlywhenwewanttolookatthereconstructions).
VariationalAutoencodersAnotherimportantcategoryofautoencoderswasintroducedin2014byDiederikKingmaandMaxWelling,5andhasquicklybecomeoneofthemostpopulartypesofautoencoders:variationalautoencoders.
Theyarequitedifferentfromalltheautoencoderswehavediscussedsofar,inparticular:Theyareprobabilisticautoencoders,meaningthattheiroutputsarepartlydeterminedbychance,evenaftertraining(asopposedtodenoisingautoencoders,whichuserandomnessonlyduringtraining).
Mostimportantly,theyaregenerativeautoencoders,meaningthattheycangeneratenewinstancesthatlookliketheyweresampledfromthetrainingset.
BoththesepropertiesmakethemrathersimilartoRBMs(seeAppendixE),buttheyareeasiertotrainandthesamplingprocessismuchfaster(withRBMsyouneedtowaitforthenetworktostabilizeintoa“thermalequilibrium”beforeyoucansampleanewinstance).
Let’stakealookathowtheywork.Figure15-11(left)showsavariationalautoencoder.Youcanrecognize,ofcourse,thebasicstructureofallautoencoders,withanencoderfollowedbyadecoder(inthisexample,theybothhavetwohiddenlayers),butthereisatwist:insteadofdirectlyproducingacodingforagiveninput,theencoderproducesameancodingμandastandarddeviationσ.TheactualcodingisthensampledrandomlyfromaGaussiandistributionwithmeanμandstandarddeviationσ.Afterthatthedecoderjustdecodesthesampledcodingnormally.Therightpartofthediagramshowsatraininginstancegoingthroughthisautoencoder.First,theencoderproducesμandσ,thenacodingissampledrandomly(noticethatitisnotexactlylocatedatμ),andfinallythiscodingisdecoded,andthefinaloutputresemblesthetraininginstance.
Figure15-11.Variationalautoencoder(left),andaninstancegoingthroughit(right)
Asyoucanseeonthediagram,althoughtheinputsmayhaveaveryconvoluteddistribution,avariationalautoencodertendstoproducecodingsthatlookasthoughtheyweresampledfromasimpleGaussiandistribution:6duringtraining,thecostfunction(discussednext)pushesthecodingstograduallymigratewithinthecodingspace(alsocalledthelatentspace)tooccupyaroughly(hyper)sphericalregionthatlookslikeacloudofGaussianpoints.Onegreatconsequenceisthataftertrainingavariationalautoencoder,youcanveryeasilygenerateanewinstance:justsamplearandomcodingfromtheGaussiandistribution,decodeit,andvoilà!
Solet’slookatthecostfunction.Itiscomposedoftwoparts.Thefirstistheusualreconstructionlossthatpushestheautoencodertoreproduceitsinputs(wecanusecrossentropyforthis,asdiscussedearlier).ThesecondisthelatentlossthatpushestheautoencodertohavecodingsthatlookasthoughtheyweresampledfromasimpleGaussiandistribution,forwhichweusetheKLdivergencebetweenthetargetdistribution(theGaussiandistribution)andtheactualdistributionofthecodings.Themathisabitmorecomplexthanearlier,inparticularbecauseoftheGaussiannoise,whichlimitstheamountofinformationthatcanbetransmittedtothecodinglayer(thuspushingtheautoencodertolearnusefulfeatures).Luckily,theequationssimplifytothefollowingcodeforthelatentloss:7
eps=1e-10#smoothingtermtoavoidcomputinglog(0)whichisNaN
latent_loss=0.5*tf.reduce_sum(
tf.square(hidden3_sigma)+tf.square(hidden3_mean)
-1-tf.log(eps+tf.square(hidden3_sigma)))
Onecommonvariantistotraintheencodertooutputγ=log(σ2)ratherthanσ.Whereverweneedσwe
canjustcompute .Thismakesitabiteasierfortheencodertocapturesigmasofdifferent
scales,andthusithelpsspeedupconvergence.Thelatentlossendsupabitsimpler:
latent_loss=0.5*tf.reduce_sum(
tf.exp(hidden3_gamma)+tf.square(hidden3_mean)-1-hidden3_gamma)
ThefollowingcodebuildsthevariationalautoencodershowninFigure15-11(left),usingthelog(σ2)variant:
fromfunctoolsimportpartial
n_inputs=28*28
n_hidden1=500
n_hidden2=500
n_hidden3=20#codings
n_hidden4=n_hidden2
n_hidden5=n_hidden1
n_outputs=n_inputs
learning_rate=0.001
initializer=tf.contrib.layers.variance_scaling_initializer()
my_dense_layer=partial(
tf.layers.dense,
activation=tf.nn.elu,
kernel_initializer=initializer)
X=tf.placeholder(tf.float32,[None,n_inputs])
hidden1=my_dense_layer(X,n_hidden1)
hidden2=my_dense_layer(hidden1,n_hidden2)
hidden3_mean=my_dense_layer(hidden2,n_hidden3,activation=None)
hidden3_gamma=my_dense_layer(hidden2,n_hidden3,activation=None)
noise=tf.random_normal(tf.shape(hidden3_gamma),dtype=tf.float32)
hidden3=hidden3_mean+tf.exp(0.5*hidden3_gamma)*noise
hidden4=my_dense_layer(hidden3,n_hidden4)
hidden5=my_dense_layer(hidden4,n_hidden5)
logits=my_dense_layer(hidden5,n_outputs,activation=None)
outputs=tf.sigmoid(logits)
xentropy=tf.nn.sigmoid_cross_entropy_with_logits(labels=X,logits=logits)
reconstruction_loss=tf.reduce_sum(xentropy)
latent_loss=0.5*tf.reduce_sum(
tf.exp(hidden3_gamma)+tf.square(hidden3_mean)-1-hidden3_gamma)
loss=reconstruction_loss+latent_loss
optimizer=tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op=optimizer.minimize(loss)
init=tf.global_variables_initializer()
saver=tf.train.Saver()
GeneratingDigitsNowlet’susethisvariationalautoencodertogenerateimagesthatlooklikehandwrittendigits.Allweneedtodoistrainthemodel,thensamplerandomcodingsfromaGaussiandistributionanddecodethem.
importnumpyasnp
n_digits=60
n_epochs=50
batch_size=150
withtf.Session()assess:
init.run()
forepochinrange(n_epochs):
n_batches=mnist.train.num_examples//batch_size
foriterationinrange(n_batches):
X_batch,y_batch=mnist.train.next_batch(batch_size)
sess.run(training_op,feed_dict={X:X_batch})
codings_rnd=np.random.normal(size=[n_digits,n_hidden3])
outputs_val=outputs.eval(feed_dict={hidden3:codings_rnd})
That’sit.Nowwecanseewhatthe“handwritten”digitsproducedbytheautoencoderlooklike(seeFigure15-12):
foriterationinrange(n_digits):
plt.subplot(n_digits,10,iteration+1)
plot_image(outputs_val[iteration])
Figure15-12.Imagesofhandwrittendigitsgeneratedbythevariationalautoencoder
Amajorityofthesedigitslookprettyconvincing,whileafewarerather“creative.”Butdon’tbetooharshontheautoencoder—itonlystartedlearninglessthananhourago.Giveitabitmoretrainingtime,andthosedigitswilllookbetterandbetter.
OtherAutoencodersTheamazingsuccessesofsupervisedlearninginimagerecognition,speechrecognition,texttranslation,andmorehavesomewhatovershadowedunsupervisedlearning,butitisactuallybooming.Newarchitecturesforautoencodersandotherunsupervisedlearningalgorithmsareinventedregularly,somuchsothatwecannotcoverthemallinthisbook.Hereisabrief(bynomeansexhaustive)overviewofafewmoretypesofautoencodersthatyoumaywanttocheckout:
Contractiveautoencoder(CAE)8
Theautoencoderisconstrainedduringtrainingsothatthederivativesofthecodingswithregardstotheinputsaresmall.Inotherwords,twosimilarinputsmusthavesimilarcodings.
Stackedconvolutionalautoencoders9
Autoencodersthatlearntoextractvisualfeaturesbyreconstructingimagesprocessedthroughconvolutionallayers.
Generativestochasticnetwork(GSN)10
Ageneralizationofdenoisingautoencoders,withtheaddedcapabilitytogeneratedata.
Winner-take-all(WTA)autoencoder11
Duringtraining,aftercomputingtheactivationsofalltheneuronsinthecodinglayer,onlythetopk%activationsforeachneuronoverthetrainingbatcharepreserved,andtherestaresettozero.Naturallythisleadstosparsecodings.Moreover,asimilarWTAapproachcanbeusedtoproducesparseconvolutionalautoencoders.
Adversarialautoencoders12
Onenetworkistrainedtoreproduceitsinputs,andatthesametimeanotheristrainedtofindinputsthatthefirstnetworkisunabletoproperlyreconstruct.Thispushesthefirstautoencodertolearnrobustcodings.
Exercises1. Whatarethemaintasksthatautoencodersareusedfor?
2. Supposeyouwanttotrainaclassifierandyouhaveplentyofunlabeledtrainingdata,butonlyafewthousandlabeledinstances.Howcanautoencodershelp?Howwouldyouproceed?
3. Ifanautoencoderperfectlyreconstructstheinputs,isitnecessarilyagoodautoencoder?Howcanyouevaluatetheperformanceofanautoencoder?
4. Whatareundercompleteandovercompleteautoencoders?Whatisthemainriskofanexcessivelyundercompleteautoencoder?Whataboutthemainriskofanovercompleteautoencoder?
5. Howdoyoutieweightsinastackedautoencoder?Whatisthepointofdoingso?
6. Whatisacommontechniquetovisualizefeatureslearnedbythelowerlayerofastackedautoencoder?Whatabouthigherlayers?
7. Whatisagenerativemodel?Canyounameatypeofgenerativeautoencoder?
8. Let’suseadenoisingautoencodertopretrainanimageclassifier:YoucanuseMNIST(simplest),oranotherlargesetofimagessuchasCIFAR10ifyouwantabiggerchallenge.IfyouchooseCIFAR10,youneedtowritecodetoloadbatchesofimagesfortraining.Ifyouwanttoskipthispart,TensorFlow’smodelzoocontainstoolstodojustthat.
Splitthedatasetintoatrainingsetandatestset.Trainadeepdenoisingautoencoderonthefulltrainingset.
Checkthattheimagesarefairlywellreconstructed,andvisualizethelow-levelfeatures.Visualizetheimagesthatmostactivateeachneuroninthecodinglayer.
Buildaclassificationdeepneuralnetwork,reusingthelowerlayersoftheautoencoder.Trainitusingonly10%ofthetrainingset.Canyougetittoperformaswellasthesameclassifiertrainedonthefulltrainingset?
9. Semantichashing,introducedin2008byRuslanSalakhutdinovandGeoffreyHinton,13isatechniqueusedforefficientinformationretrieval:adocument(e.g.,animage)ispassedthroughasystem,typicallyaneuralnetwork,whichoutputsafairlylow-dimensionalbinaryvector(e.g.,30bits).Twosimilardocumentsarelikelytohaveidenticalorverysimilarhashes.Byindexingeachdocumentusingitshash,itispossibletoretrievemanydocumentssimilartoaparticulardocumentalmostinstantly,eveniftherearebillionsofdocuments:justcomputethehashofthedocumentandlookupalldocumentswiththatsamehash(orhashesdifferingbyjustoneortwobits).Let’simplementsemantichashingusingaslightlytweakedstackedautoencoder:
Createastackedautoencodercontainingtwohiddenlayersbelowthecodinglayer,andtrainitontheimagedatasetyouusedinthepreviousexercise.Thecodinglayershouldcontain30neuronsandusethelogisticactivationfunctiontooutputvaluesbetween0and1.Aftertraining,
toproducethehashofanimage,youcansimplyrunitthroughtheautoencoder,taketheoutputofthecodinglayer,androundeveryvaluetotheclosestinteger(0or1).
OneneattrickproposedbySalakhutdinovandHintonistoaddGaussiannoise(withzeromean)totheinputsofthecodinglayer,duringtrainingonly.Inordertopreserveahighsignal-to-noiseratio,theautoencoderwilllearntofeedlargevaluestothecodinglayer(sothatthenoisebecomesnegligible).Inturn,thismeansthatthelogisticfunctionofthecodinglayerwilllikelysaturateat0or1.Asaresult,roundingthecodingsto0or1won’tdistortthemtoomuch,andthiswillimprovethereliabilityofthehashes.
Computethehashofeveryimage,andseeifimageswithidenticalhasheslookalike.SinceMNISTandCIFAR10arelabeled,amoreobjectivewaytomeasuretheperformanceoftheautoencoderforsemantichashingistoensurethatimageswiththesamehashgenerallyhavethesameclass.OnewaytodothisistomeasuretheaverageGinipurity(introducedinChapter6)ofthesetsofimageswithidentical(orverysimilar)hashes.
Tryfine-tuningthehyperparametersusingcross-validation.
Notethatwithalabeleddataset,anotherapproachistotrainaconvolutionalneuralnetwork(seeChapter13)forclassification,thenusethelayerbelowtheoutputlayertoproducethehashes.SeeJinmaGuaandJianminLi’s2015paper.14Seeifthatperformsbetter.
10. Trainavariationalautoencoderontheimagedatasetusedinthepreviousexercises(MNISTorCIFAR10),andmakeitgenerateimages.Alternatively,youcantrytofindanunlabeleddatasetthatyouareinterestedinandseeifyoucangeneratenewsamples.
SolutionstotheseexercisesareavailableinAppendixA.
“Perceptioninchess,”W.ChaseandH.Simon(1973).
“GreedyLayer-WiseTrainingofDeepNetworks,”Y.Bengioetal.(2007).
“ExtractingandComposingRobustFeatureswithDenoisingAutoencoders,”P.Vincentetal.(2008).
“StackedDenoisingAutoencoders:LearningUsefulRepresentationsinaDeepNetworkwithaLocalDenoisingCriterion,”P.Vincentetal.(2010).
“Auto-EncodingVariationalBayes,”D.KingmaandM.Welling(2014).
Variationalautoencodersareactuallymoregeneral;thecodingsarenotlimitedtoGaussiandistributions.
Formoremathematicaldetails,checkouttheoriginalpaperonvariationalautoencoders,orCarlDoersch’sgreattutorial(2016).
“ContractiveAuto-Encoders:ExplicitInvarianceDuringFeatureExtraction,”S.Rifaietal.(2011).
“StackedConvolutionalAuto-EncodersforHierarchicalFeatureExtraction,”J.Mascietal.(2011).
“GSNs:GenerativeStochasticNetworks,”G.Alainetal.(2015).
“Winner-Take-AllAutoencoders,”A.MakhzaniandB.Frey(2015).
“AdversarialAutoencoders,”A.Makhzanietal.(2016).
“SemanticHashing,”R.SalakhutdinovandG.Hinton(2008).
“CNNBasedHashingforImageRetrieval,”J.GuaandJ.Li(2015).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Chapter16.ReinforcementLearning
ReinforcementLearning(RL)isoneofthemostexcitingfieldsofMachineLearningtoday,andalsooneoftheoldest.Ithasbeenaroundsincethe1950s,producingmanyinterestingapplicationsovertheyears,1inparticularingames(e.g.,TD-Gammon,aBackgammonplayingprogram)andinmachinecontrol,butseldommakingtheheadlinenews.Butarevolutiontookplacein2013whenresearchersfromanEnglishstartupcalledDeepMinddemonstratedasystemthatcouldlearntoplayjustaboutanyAtarigamefromscratch,2eventuallyoutperforminghumans3inmostofthem,usingonlyrawpixelsasinputsandwithoutanypriorknowledgeoftherulesofthegames.4Thiswasthefirstofaseriesofamazingfeats,culminatinginMarch2016withthevictoryoftheirsystemAlphaGoagainstLeeSedol,theworldchampionofthegameofGo.Noprogramhadevercomeclosetobeatingamasterofthisgame,letalonetheworldchampion.TodaythewholefieldofRLisboilingwithnewideas,withawiderangeofapplications.DeepMindwasboughtbyGoogleforover500milliondollarsin2014.
Sohowdidtheydoit?Withhindsightitseemsrathersimple:theyappliedthepowerofDeepLearningtothefieldofReinforcementLearning,anditworkedbeyondtheirwildestdreams.InthischapterwewillfirstexplainwhatReinforcementLearningisandwhatitisgoodat,andthenwewillpresenttwoofthemostimportanttechniquesindeepReinforcementLearning:policygradientsanddeepQ-networks(DQN),includingadiscussionofMarkovdecisionprocesses(MDP).Wewillusethesetechniquestotrainamodeltobalanceapoleonamovingcart,andanothertoplayAtarigames.Thesametechniquescanbeusedforawidevarietyoftasks,fromwalkingrobotstoself-drivingcars.
LearningtoOptimizeRewardsInReinforcementLearning,asoftwareagentmakesobservationsandtakesactionswithinanenvironment,andinreturnitreceivesrewards.Itsobjectiveistolearntoactinawaythatwillmaximizeitsexpectedlong-termrewards.Ifyoudon’tmindabitofanthropomorphism,youcanthinkofpositiverewardsaspleasure,andnegativerewardsaspain(theterm“reward”isabitmisleadinginthiscase).Inshort,theagentactsintheenvironmentandlearnsbytrialanderrortomaximizeitspleasureandminimizeitspain.
Thisisquiteabroadsetting,whichcanapplytoawidevarietyoftasks.Hereareafewexamples(seeFigure16-1):1. Theagentcanbetheprogramcontrollingawalkingrobot.Inthiscase,theenvironmentisthereal
world,theagentobservestheenvironmentthroughasetofsensorssuchascamerasandtouchsensors,anditsactionsconsistofsendingsignalstoactivatemotors.Itmaybeprogrammedtogetpositiverewardswheneveritapproachesthetargetdestination,andnegativerewardswheneveritwastestime,goesinthewrongdirection,orfallsdown.
2. TheagentcanbetheprogramcontrollingMs.Pac-Man.Inthiscase,theenvironmentisasimulationoftheAtarigame,theactionsaretheninepossiblejoystickpositions(upperleft,down,center,andsoon),theobservationsarescreenshots,andtherewardsarejustthegamepoints.
3. Similarly,theagentcanbetheprogramplayingaboardgamesuchasthegameofGo.
4. Theagentdoesnothavetocontrolaphysically(orvirtually)movingthing.Forexample,itcanbeasmartthermostat,gettingrewardswheneveritisclosetothetargettemperatureandsavesenergy,andnegativerewardswhenhumansneedtotweakthetemperature,sotheagentmustlearntoanticipatehumanneeds.
5. Theagentcanobservestockmarketpricesanddecidehowmuchtobuyorselleverysecond.Rewardsareobviouslythemonetarygainsandlosses.
Figure16-1.ReinforcementLearningexamples:(a)walkingrobot,(b)Ms.Pac-Man,(c)Goplayer,(d)thermostat,(e)automatictrader5
Notethattheremaynotbeanypositiverewardsatall;forexample,theagentmaymovearoundinamaze,gettinganegativerewardateverytimestep,soitbetterfindtheexitasquicklyaspossible!TherearemanyotherexamplesoftaskswhereReinforcementLearningiswellsuited,suchasself-drivingcars,placingadsonawebpage,orcontrollingwhereanimageclassificationsystemshouldfocusitsattention.
PolicySearchThealgorithmusedbythesoftwareagenttodetermineitsactionsiscalleditspolicy.Forexample,thepolicycouldbeaneuralnetworktakingobservationsasinputsandoutputtingtheactiontotake(seeFigure16-2).
Figure16-2.ReinforcementLearningusinganeuralnetworkpolicy
Thepolicycanbeanyalgorithmyoucanthinkof,anditdoesnotevenhavetobedeterministic.Forexample,consideraroboticvacuumcleanerwhoserewardistheamountofdustitpicksupin30minutes.Itspolicycouldbetomoveforwardwithsomeprobabilitypeverysecond,orrandomlyrotateleftorrightwithprobability1–p.Therotationanglewouldbearandomanglebetween–rand+r.Sincethispolicyinvolvessomerandomness,itiscalledastochasticpolicy.Therobotwillhaveanerratictrajectory,whichguaranteesthatitwilleventuallygettoanyplaceitcanreachandpickupallthedust.Thequestionis:howmuchdustwillitpickupin30minutes?
Howwouldyoutrainsucharobot?Therearejusttwopolicyparametersyoucantweak:theprobabilitypandtheangleranger.Onepossiblelearningalgorithmcouldbetotryoutmanydifferentvaluesfortheseparameters,andpickthecombinationthatperformsbest(seeFigure16-3).Thisisanexampleofpolicysearch,inthiscaseusingabruteforceapproach.However,whenthepolicyspaceistoolarge(whichisgenerallythecase),findingagoodsetofparametersthiswayislikesearchingforaneedleinagigantichaystack.
Anotherwaytoexplorethepolicyspaceistousegeneticalgorithms.Forexample,youcouldrandomlycreateafirstgenerationof100policiesandtrythemout,then“kill”the80worstpolicies6andmakethe20survivorsproduce4offspringeach.Anoffspringisjustacopyofitsparent7plussomerandomvariation.Thesurvivingpoliciesplustheiroffspringtogetherconstitutethesecondgeneration.Youcancontinuetoiteratethroughgenerationsthisway,untilyoufindagoodpolicy.
Figure16-3.Fourpointsinpolicyspaceandtheagent’scorrespondingbehavior
Yetanotherapproachistouseoptimizationtechniques,byevaluatingthegradientsoftherewardswithregardstothepolicyparameters,thentweakingtheseparametersbyfollowingthegradienttowardhigherrewards(gradientascent).Thisapproachiscalledpolicygradients(PG),whichwewilldiscussinmoredetaillaterinthischapter.Forexample,goingbacktothevacuumcleanerrobot,youcouldslightlyincreasepandevaluatewhetherthisincreasestheamountofdustpickedupbytherobotin30minutes;ifitdoes,thenincreasepsomemore,orelsereducep.WewillimplementapopularPGalgorithmusingTensorFlow,butbeforewedoweneedtocreateanenvironmentfortheagenttolivein,soit’stimetointroduceOpenAIgym.
IntroductiontoOpenAIGymOneofthechallengesofReinforcementLearningisthatinordertotrainanagent,youfirstneedtohaveaworkingenvironment.IfyouwanttoprogramanagentthatwilllearntoplayanAtarigame,youwillneedanAtarigamesimulator.Ifyouwanttoprogramawalkingrobot,thentheenvironmentistherealworldandyoucandirectlytrainyourrobotinthatenvironment,butthishasitslimits:iftherobotfallsoffacliff,youcan’tjustclick“undo.”Youcan’tspeeduptimeeither;addingmorecomputingpowerwon’tmaketherobotmoveanyfaster.Andit’sgenerallytooexpensivetotrain1,000robotsinparallel.Inshort,trainingishardandslowintherealworld,soyougenerallyneedasimulatedenvironmentatleasttobootstraptraining.
OpenAIgym8isatoolkitthatprovidesawidevarietyofsimulatedenvironments(Atarigames,boardgames,2Dand3Dphysicalsimulations,andsoon),soyoucantrainagents,comparethem,ordevelopnewRLalgorithms.
Let’sinstallOpenAIgym.ForaminimalOpenAIgyminstallation,simplyusepip:
$pip3install--upgradegym
NextopenupaPythonshelloraJupyternotebookandcreateyourfirstenvironment:
>>>importgym
>>>env=gym.make("CartPole-v0")
[2016-10-1416:03:23,199]Makingnewenv:MsPacman-v0
>>>obs=env.reset()
>>>obs
array([-0.03799846,-0.03288115,0.02337094,0.00720711])
>>>env.render()
Themake()functioncreatesanenvironment,inthiscaseaCartPoleenvironment.Thisisa2Dsimulationinwhichacartcanbeacceleratedleftorrightinordertobalanceapoleplacedontopofit(seeFigure16-4).Aftertheenvironmentiscreated,wemustinitializeitusingthereset()method.Thisreturnsthefirstobservation.Observationsdependonthetypeofenvironment.FortheCartPoleenvironment,eachobservationisa1DNumPyarraycontainingfourfloats:thesefloatsrepresentthecart’shorizontalposition(0.0=center),itsvelocity,theangleofthepole(0.0=vertical),anditsangularvelocity.Finally,therender()methoddisplaystheenvironmentasshowninFigure16-4.
Figure16-4.TheCartPoleenvironment
Ifyouwantrender()toreturntherenderedimageasaNumPyarray,youcansetthemodeparametertorgb_array(notethatotherenvironmentsmaysupportdifferentmodes):
>>>img=env.render(mode="rgb_array")
>>>img.shape#height,width,channels(3=RGB)
(400,600,3)
TIPUnfortunately,theCartPole(andafewotherenvironments)renderstheimagetothescreenevenifyousetthemodeto"rgb_array".TheonlywaytoavoidthisistouseafakeXserversuchasXvfborXdummy.Forexample,youcaninstallXvfbandstartPythonusingthefollowingcommand:xvfb-run-s"-screen01400x900x24"python.Orusethexvfbwrapperpackage.
Let’sasktheenvironmentwhatactionsarepossible:
>>>env.action_space
Discrete(2)
Discrete(2)meansthatthepossibleactionsareintegers0and1,whichrepresentacceleratingleft(0)orright(1).Otherenvironmentsmayhavemorediscreteactions,orotherkindsofactions(e.g.,continuous).Sincethepoleisleaningtowardtheright,let’sacceleratethecarttowardtheright:
>>>action=1#accelerateright
>>>obs,reward,done,info=env.step(action)
>>>obs
array([-0.03865608,0.16189797,0.02351508,-0.27801135])
>>>reward
1.0
>>>done
False
>>>info
{}
Thestep()methodexecutesthegivenactionandreturnsfourvalues:
obs
Thisisthenewobservation.Thecartisnowmovingtowardtheright(obs[1]>0).Thepoleisstilltiltedtowardtheright(obs[2]>0),butitsangularvelocityisnownegative(obs[3]<0),soitwilllikelybetiltedtowardtheleftafterthenextstep.
reward
Inthisenvironment,yougetarewardof1.0ateverystep,nomatterwhatyoudo,sothegoalistokeeprunningaslongaspossible.
done
ThisvaluewillbeTruewhentheepisodeisover.Thiswillhappenwhenthepoletiltstoomuch.Afterthat,theenvironmentmustberesetbeforeitcanbeusedagain.
info
Thisdictionarymayprovideextradebuginformationinotherenvironments.Thisdatashouldnotbeusedfortraining(itwouldbecheating).
Let’shardcodeasimplepolicythatacceleratesleftwhenthepoleisleaningtowardtheleftandacceleratesrightwhenthepoleisleaningtowardtheright.Wewillrunthispolicytoseetheaveragerewardsitgetsover500episodes:
defbasic_policy(obs):
angle=obs[2]
return0ifangle<0else1
totals=[]
forepisodeinrange(500):
episode_rewards=0
obs=env.reset()
forstepinrange(1000):#1000stepsmax,wedon'twanttorunforever
action=basic_policy(obs)
obs,reward,done,info=env.step(action)
episode_rewards+=reward
ifdone:
break
totals.append(episode_rewards)
Thiscodeishopefullyself-explanatory.Let’slookattheresult:
>>>importnumpyasnp
>>>np.mean(totals),np.std(totals),np.min(totals),np.max(totals)
(42.125999999999998,9.1237121830974033,24.0,68.0)
Evenwith500tries,thispolicynevermanagedtokeepthepoleuprightformorethan68consecutivesteps.Notgreat.IfyoulookatthesimulationintheJupyternotebooks,youwillseethatthecartoscillatesleftandrightmoreandmorestronglyuntilthepoletiltstoomuch.Let’sseeifaneuralnetworkcancomeupwithabetterpolicy.
NeuralNetworkPoliciesLet’screateaneuralnetworkpolicy.Justlikethepolicywehardcodedearlier,thisneuralnetworkwilltakeanobservationasinput,anditwilloutputtheactiontobeexecuted.Moreprecisely,itwillestimateaprobabilityforeachaction,andthenwewillselectanactionrandomlyaccordingtotheestimatedprobabilities(seeFigure16-5).InthecaseoftheCartPoleenvironment,therearejusttwopossibleactions(leftorright),soweonlyneedoneoutputneuron.Itwilloutputtheprobabilitypofaction0(left),andofcoursetheprobabilityofaction1(right)willbe1–p.Forexample,ifitoutputs0.7,thenwewillpickaction0with70%probability,andaction1with30%probability.
Figure16-5.Neuralnetworkpolicy
Youmaywonderwhywearepickingarandomactionbasedontheprobabilitygivenbytheneuralnetwork,ratherthanjustpickingtheactionwiththehighestscore.Thisapproachletstheagentfindtherightbalancebetweenexploringnewactionsandexploitingtheactionsthatareknowntoworkwell.Here’sananalogy:supposeyougotoarestaurantforthefirsttime,andallthedisheslookequallyappealingsoyourandomlypickone.Ifitturnsouttobegood,youcanincreasetheprobabilitytoorderitnexttime,butyoushouldn’tincreasethatprobabilityupto100%,orelseyouwillnevertryouttheotherdishes,someofwhichmaybeevenbetterthantheoneyoutried.
Alsonotethatinthisparticularenvironment,thepastactionsandobservationscansafelybeignored,
sinceeachobservationcontainstheenvironment’sfullstate.Ifthereweresomehiddenstate,thenyoumayneedtoconsiderpastactionsandobservationsaswell.Forexample,iftheenvironmentonlyrevealedthepositionofthecartbutnotitsvelocity,youwouldhavetoconsidernotonlythecurrentobservationbutalsothepreviousobservationinordertoestimatethecurrentvelocity.Anotherexampleiswhentheobservationsarenoisy;inthatcase,yougenerallywanttousethepastfewobservationstoestimatethemostlikelycurrentstate.TheCartPoleproblemisthusassimpleascanbe;theobservationsarenoise-freeandtheycontaintheenvironment’sfullstate.
HereisthecodetobuildthisneuralnetworkpolicyusingTensorFlow:
importtensorflowastf
#1.Specifytheneuralnetworkarchitecture
n_inputs=4#==env.observation_space.shape[0]
n_hidden=4#it'sasimpletask,wedon'tneedmorehiddenneurons
n_outputs=1#onlyoutputstheprobabilityofacceleratingleft
initializer=tf.contrib.layers.variance_scaling_initializer()
#2.Buildtheneuralnetwork
X=tf.placeholder(tf.float32,shape=[None,n_inputs])
hidden=tf.layers.dense(X,n_hidden,activation=tf.nn.elu,
kernel_initializer=initializer)
logits=tf.layers.dense(hidden,n_outputs,
kernel_initializer=initializer)
outputs=tf.nn.sigmoid(logits)
#3.Selectarandomactionbasedontheestimatedprobabilities
p_left_and_right=tf.concat(axis=1,values=[outputs,1-outputs])
action=tf.multinomial(tf.log(p_left_and_right),num_samples=1)
init=tf.global_variables_initializer()
Let’sgothroughthiscode:1. Aftertheimports,wedefinetheneuralnetworkarchitecture.Thenumberofinputsisthesizeofthe
observationspace(whichinthecaseoftheCartPoleisfour),wejusthavefourhiddenunitsandnoneedformore,andwehavejustoneoutputprobability(theprobabilityofgoingleft).
2. Nextwebuildtheneuralnetwork.Inthisexample,it’savanillaMulti-LayerPerceptron,withasingleoutput.Notethattheoutputlayerusesthelogistic(sigmoid)activationfunctioninordertooutputaprobabilityfrom0.0to1.0.Ifthereweremorethantwopossibleactions,therewouldbeoneoutputneuronperaction,andyouwouldusethesoftmaxactivationfunctioninstead.
3. Lastly,wecallthemultinomial()functiontopickarandomaction.Thisfunctionindependentlysamplesone(ormore)integers,giventhelogprobabilityofeachinteger.Forexample,ifyoucallitwiththearray[np.log(0.5),np.log(0.2),np.log(0.3)]andwithnum_samples=5,thenitwilloutputfiveintegers,eachofwhichwillhavea50%probabilityofbeing0,20%ofbeing1,and30%ofbeing2.Inourcasewejustneedoneintegerrepresentingtheactiontotake.Sincetheoutputstensoronlycontainstheprobabilityofgoingleft,wemustfirstconcatenate1-outputstoittohaveatensorcontainingtheprobabilityofbothleftandrightactions.Notethatifthereweremorethantwopossibleactions,theneuralnetworkwouldhavetooutputoneprobabilityperactionsoyouwouldnotneedtheconcatenationstep.
Okay,wenowhaveaneuralnetworkpolicythatwilltakeobservationsandoutputactions.Buthowdowetrainit?
EvaluatingActions:TheCreditAssignmentProblemIfweknewwhatthebestactionwasateachstep,wecouldtraintheneuralnetworkasusual,byminimizingthecrossentropybetweentheestimatedprobabilityandthetargetprobability.Itwouldjustberegularsupervisedlearning.However,inReinforcementLearningtheonlyguidancetheagentgetsisthroughrewards,andrewardsaretypicallysparseanddelayed.Forexample,iftheagentmanagestobalancethepolefor100steps,howcanitknowwhichofthe100actionsittookweregood,andwhichofthemwerebad?Allitknowsisthatthepolefellafterthelastaction,butsurelythislastactionisnotentirelyresponsible.Thisiscalledthecreditassignmentproblem:whentheagentgetsareward,itishardforittoknowwhichactionsshouldgetcredited(orblamed)forit.Thinkofadogthatgetsrewardedhoursafteritbehavedwell;willitunderstandwhatitisrewardedfor?
Totacklethisproblem,acommonstrategyistoevaluateanactionbasedonthesumofalltherewardsthatcomeafterit,usuallyapplyingadiscountraterateachstep.Forexample(seeFigure16-6),ifanagentdecidestogorightthreetimesinarowandgets+10rewardafterthefirststep,0afterthesecondstep,andfinally–50afterthethirdstep,thenassumingweuseadiscountrater=0.8,thefirstactionwillhaveatotalscoreof10+r×0+r2×(–50)=–22.Ifthediscountrateiscloseto0,thenfuturerewardswon’tcountformuchcomparedtoimmediaterewards.Conversely,ifthediscountrateiscloseto1,thenrewardsfarintothefuturewillcountalmostasmuchasimmediaterewards.Typicaldiscountratesare0.95or0.99.Withadiscountrateof0.95,rewards13stepsintothefuturecountroughlyforhalfasmuchasimmediaterewards(since0.9513≈0.5),whilewithadiscountrateof0.99,rewards69stepsintothefuturecountforhalfasmuchasimmediaterewards.IntheCartPoleenvironment,actionshavefairlyshort-termeffects,sochoosingadiscountrateof0.95seemsreasonable.
Figure16-6.Discountedrewards
Ofcourse,agoodactionmaybefollowedbyseveralbadactionsthatcausethepoletofallquickly,resultinginthegoodactiongettingalowscore(similarly,agoodactormaysometimesstarinaterriblemovie).However,ifweplaythegameenoughtimes,onaveragegoodactionswillgetabetterscorethanbadones.So,togetfairlyreliableactionscores,wemustrunmanyepisodesandnormalizealltheaction
scores(bysubtractingthemeananddividingbythestandarddeviation).Afterthat,wecanreasonablyassumethatactionswithanegativescorewerebadwhileactionswithapositivescoreweregood.Perfect—nowthatwehaveawaytoevaluateeachaction,wearereadytotrainourfirstagentusingpolicygradients.Let’sseehow.
PolicyGradientsAsdiscussedearlier,PGalgorithmsoptimizetheparametersofapolicybyfollowingthegradientstowardhigherrewards.OnepopularclassofPGalgorithms,calledREINFORCEalgorithms,wasintroducedbackin19929byRonaldWilliams.Hereisonecommonvariant:1. First,lettheneuralnetworkpolicyplaythegameseveraltimesandateachstepcomputethe
gradientsthatwouldmakethechosenactionevenmorelikely,butdon’tapplythesegradientsyet.
2. Onceyouhaverunseveralepisodes,computeeachaction’sscore(usingthemethoddescribedinthepreviousparagraph).
3. Ifanaction’sscoreispositive,itmeansthattheactionwasgoodandyouwanttoapplythegradientscomputedearliertomaketheactionevenmorelikelytobechoseninthefuture.However,ifthescoreisnegative,itmeanstheactionwasbadandyouwanttoapplytheoppositegradientstomakethisactionslightlylesslikelyinthefuture.Thesolutionissimplytomultiplyeachgradientvectorbythecorrespondingaction’sscore.
4. Finally,computethemeanofalltheresultinggradientvectors,anduseittoperformaGradientDescentstep.
Let’simplementthisalgorithmusingTensorFlow.Wewilltraintheneuralnetworkpolicywebuiltearliersothatitlearnstobalancethepoleonthecart.Let’sstartbycompletingtheconstructionphasewecodedearliertoaddthetargetprobability,thecostfunction,andthetrainingoperation.Sinceweareactingasthoughthechosenactionisthebestpossibleaction,thetargetprobabilitymustbe1.0ifthechosenactionisaction0(left)and0.0ifitisaction1(right):
y=1.-tf.to_float(action)
Nowthatwehaveatargetprobability,wecandefinethecostfunction(crossentropy)andcomputethegradients:
learning_rate=0.01
cross_entropy=tf.nn.sigmoid_cross_entropy_with_logits(labels=y,
logits=logits)
optimizer=tf.train.AdamOptimizer(learning_rate)
grads_and_vars=optimizer.compute_gradients(cross_entropy)
Notethatwearecallingtheoptimizer’scompute_gradients()methodinsteadoftheminimize()method.Thisisbecausewewanttotweakthegradientsbeforeweapplythem.10Thecompute_gradients()methodreturnsalistofgradientvector/variablepairs(onepairpertrainablevariable).Let’sputallthegradientsinalist,tomakeitmoreconvenienttoobtaintheirvalues:
gradients=[gradforgrad,variableingrads_and_vars]
Okay,nowcomesthetrickypart.Duringtheexecutionphase,thealgorithmwillrunthepolicyandateach
stepitwillevaluatethesegradienttensorsandstoretheirvalues.Afteranumberofepisodesitwilltweakthesegradientsasexplainedearlier(i.e.,multiplythembytheactionscoresandnormalizethem)andcomputethemeanofthetweakedgradients.Next,itwillneedtofeedtheresultinggradientsbacktotheoptimizersothatitcanperformanoptimizationstep.Thismeansweneedoneplaceholderpergradientvector.Moreover,wemustcreatetheoperationthatwillapplytheupdatedgradients.Forthiswewillcalltheoptimizer’sapply_gradients()function,whichtakesalistofgradientvector/variablepairs.Insteadofgivingittheoriginalgradientvectors,wewillgiveitalistcontainingtheupdatedgradients(i.e.,theonesfedthroughthegradientplaceholders):
gradient_placeholders=[]
grads_and_vars_feed=[]
forgrad,variableingrads_and_vars:
gradient_placeholder=tf.placeholder(tf.float32,shape=grad.get_shape())
gradient_placeholders.append(gradient_placeholder)
grads_and_vars_feed.append((gradient_placeholder,variable))
training_op=optimizer.apply_gradients(grads_and_vars_feed)
Let’sstepbackandtakealookatthefullconstructionphase:
n_inputs=4
n_hidden=4
n_outputs=1
initializer=tf.contrib.layers.variance_scaling_initializer()
learning_rate=0.01
X=tf.placeholder(tf.float32,shape=[None,n_inputs])
hidden=tf.layers.dense(X,n_hidden,activation=tf.nn.elu,
kernel_initializer=initializer)
logits=tf.layers.dense(hidden,n_outputs,
kernel_initializer=initializer)
outputs=tf.nn.sigmoid(logits)
p_left_and_right=tf.concat(axis=1,values=[outputs,1-outputs])
action=tf.multinomial(tf.log(p_left_and_right),num_samples=1)
y=1.-tf.to_float(action)
cross_entropy=tf.nn.sigmoid_cross_entropy_with_logits(
labels=y,logits=logits)
optimizer=tf.train.AdamOptimizer(learning_rate)
grads_and_vars=optimizer.compute_gradients(cross_entropy)
gradients=[gradforgrad,variableingrads_and_vars]
gradient_placeholders=[]
grads_and_vars_feed=[]
forgrad,variableingrads_and_vars:
gradient_placeholder=tf.placeholder(tf.float32,shape=grad.get_shape())
gradient_placeholders.append(gradient_placeholder)
grads_and_vars_feed.append((gradient_placeholder,variable))
training_op=optimizer.apply_gradients(grads_and_vars_feed)
init=tf.global_variables_initializer()
saver=tf.train.Saver()
Ontotheexecutionphase!Wewillneedacoupleoffunctionstocomputethetotaldiscountedrewards,giventherawrewards,andtonormalizetheresultsacrossmultipleepisodes:
defdiscount_rewards(rewards,discount_rate):
discounted_rewards=np.empty(len(rewards))
cumulative_rewards=0
forstepinreversed(range(len(rewards))):
cumulative_rewards=rewards[step]+cumulative_rewards*discount_rate
discounted_rewards[step]=cumulative_rewards
returndiscounted_rewards
defdiscount_and_normalize_rewards(all_rewards,discount_rate):
all_discounted_rewards=[discount_rewards(rewards,discount_rate)
forrewardsinall_rewards]
flat_rewards=np.concatenate(all_discounted_rewards)
reward_mean=flat_rewards.mean()
reward_std=flat_rewards.std()
return[(discounted_rewards-reward_mean)/reward_std
fordiscounted_rewardsinall_discounted_rewards]
Let’scheckthatthisworks:
>>>discount_rewards([10,0,-50],discount_rate=0.8)
array([-22.,-40.,-50.])
>>>discount_and_normalize_rewards([[10,0,-50],[10,20]],discount_rate=0.8)
[array([-0.28435071,-0.86597718,-1.18910299]),
array([1.26665318,1.0727777])]
Thecalltodiscount_rewards()returnsexactlywhatweexpect(seeFigure16-6).Youcanverifythatthefunctiondiscount_and_normalize_rewards()doesindeedreturnthenormalizedscoresforeachactioninbothepisodes.Noticethatthefirstepisodewasmuchworsethanthesecond,soitsnormalizedscoresareallnegative;allactionsfromthefirstepisodewouldbeconsideredbad,andconverselyallactionsfromthesecondepisodewouldbeconsideredgood.
Wenowhaveallweneedtotrainthepolicy:
n_iterations=250#numberoftrainingiterations
n_max_steps=1000#maxstepsperepisode
n_games_per_update=10#trainthepolicyevery10episodes
save_iterations=10#savethemodelevery10trainingiterations
discount_rate=0.95
withtf.Session()assess:
init.run()
foriterationinrange(n_iterations):
all_rewards=[]#allsequencesofrawrewardsforeachepisode
all_gradients=[]#gradientssavedateachstepofeachepisode
forgameinrange(n_games_per_update):
current_rewards=[]#allrawrewardsfromthecurrentepisode
current_gradients=[]#allgradientsfromthecurrentepisode
obs=env.reset()
forstepinrange(n_max_steps):
action_val,gradients_val=sess.run(
[action,gradients],
feed_dict={X:obs.reshape(1,n_inputs)})#oneobs
obs,reward,done,info=env.step(action_val[0][0])
current_rewards.append(reward)
current_gradients.append(gradients_val)
ifdone:
break
all_rewards.append(current_rewards)
all_gradients.append(current_gradients)
#Atthispointwehaverunthepolicyfor10episodes,andweare
#readyforapolicyupdateusingthealgorithmdescribedearlier.
all_rewards=discount_and_normalize_rewards(all_rewards,discount_rate)
feed_dict={}
forvar_index,grad_placeholderinenumerate(gradient_placeholders):
#multiplythegradientsbytheactionscores,andcomputethemean
mean_gradients=np.mean(
[reward*all_gradients[game_index][step][var_index]
forgame_index,rewardsinenumerate(all_rewards)
forstep,rewardinenumerate(rewards)],
axis=0)
feed_dict[grad_placeholder]=mean_gradients
sess.run(training_op,feed_dict=feed_dict)
ifiteration%save_iterations==0:
saver.save(sess,"./my_policy_net_pg.ckpt")
Eachtrainingiterationstartsbyrunningthepolicyfor10episodes(withmaximum1,000stepsperepisode,toavoidrunningforever).Ateachstep,wealsocomputethegradients,pretendingthatthechosenactionwasthebest.Afterthese10episodeshavebeenrun,wecomputetheactionscoresusingthediscount_and_normalize_rewards()function;wegothrougheachtrainablevariable,acrossallepisodesandallsteps,tomultiplyeachgradientvectorbyitscorrespondingactionscore;andwecomputethemeanoftheresultinggradients.Finally,werunthetrainingoperation,feedingitthesemeangradients(onepertrainablevariable).Wealsosavethemodelevery10trainingoperations.
Andwe’redone!Thiscodewilltraintheneuralnetworkpolicy,anditwillsuccessfullylearntobalancethepoleonthecart(youcantryitoutintheJupyternotebooks).Notethatthereareactuallytwowaystheagentcanlosethegame:eitherthepolecantilttoomuch,orthecartcangocompletelyoffthescreen.With250trainingiterations,thepolicylearnstobalancethepolequitewell,butitisnotyetgoodenoughatavoidinggoingoffthescreen.Afewhundredmoretrainingiterationswillfixthat.
TIPResearcherstrytofindalgorithmsthatworkwellevenwhentheagentinitiallyknowsnothingabouttheenvironment.However,unlessyouarewritingapaper,youshouldinjectasmuchpriorknowledgeaspossibleintotheagent,asitwillspeeduptrainingdramatically.Forexample,youcouldaddnegativerewardsproportionaltothedistancefromthecenterofthescreen,andtothepole’sangle.Also,ifyoualreadyhaveareasonablygoodpolicy(e.g.,hardcoded),youmaywanttotraintheneuralnetworktoimitateitbeforeusingpolicygradientstoimproveit.
Despiteitsrelativesimplicity,thisalgorithmisquitepowerful.Youcanuseittotacklemuchharderproblemsthanbalancingapoleonacart.Infact,AlphaGowasbasedonasimilarPGalgorithm(plusMonteCarloTreeSearch,whichisbeyondthescopeofthisbook).
Wewillnowlookatanotherpopularfamilyofalgorithms.WhereasPGalgorithmsdirectlytrytooptimizethepolicytoincreaserewards,thealgorithmswewilllookatnowarelessdirect:theagentlearnstoestimatetheexpectedsumofdiscountedfuturerewardsforeachstate,ortheexpectedsumofdiscountedfuturerewardsforeachactionineachstate,thenusesthisknowledgetodecidehowtoact.Tounderstandthesealgorithms,wemustfirstintroduceMarkovdecisionprocesses(MDP).
MarkovDecisionProcessesIntheearly20thcentury,themathematicianAndreyMarkovstudiedstochasticprocesseswithnomemory,calledMarkovchains.Suchaprocesshasafixednumberofstates,anditrandomlyevolvesfromonestatetoanotherateachstep.Theprobabilityforittoevolvefromastatestoastates′isfixed,anditdependsonlyonthepair(s,s′),notonpaststates(thesystemhasnomemory).
Figure16-7showsanexampleofaMarkovchainwithfourstates.Supposethattheprocessstartsinstates0,andthereisa70%chancethatitwillremaininthatstateatthenextstep.Eventuallyitisboundtoleavethatstateandnevercomebacksincenootherstatepointsbacktos0.Ifitgoestostates1,itwillthenmostlikelygotostates2(90%probability),thenimmediatelybacktostates1(with100%probability).Itmayalternateanumberoftimesbetweenthesetwostates,buteventuallyitwillfallintostates3andremainthereforever(thisisaterminalstate).Markovchainscanhaveverydifferentdynamics,andtheyareheavilyusedinthermodynamics,chemistry,statistics,andmuchmore.
Figure16-7.ExampleofaMarkovchain
Markovdecisionprocesseswerefirstdescribedinthe1950sbyRichardBellman.11TheyresembleMarkovchainsbutwithatwist:ateachstep,anagentcanchooseoneofseveralpossibleactions,andthetransitionprobabilitiesdependonthechosenaction.Moreover,somestatetransitionsreturnsomereward(positiveornegative),andtheagent’sgoalistofindapolicythatwillmaximizerewardsovertime.
Forexample,theMDPrepresentedinFigure16-8hasthreestatesanduptothreepossiblediscreteactionsateachstep.Ifitstartsinstates0,theagentcanchoosebetweenactionsa0,a1,ora2.Ifitchoosesactiona1,itjustremainsinstates0withcertainty,andwithoutanyreward.Itcanthusdecidetostaythereforeverifitwants.Butifitchoosesactiona0,ithasa70%probabilityofgainingarewardof+10,andremaininginstates0.Itcanthentryagainandagaintogainasmuchrewardaspossible.Butatonepointit
isgoingtoendupinsteadinstates1.Instates1ithasonlytwopossibleactions:a0ora2.Itcanchoosetostayputbyrepeatedlychoosingactiona0,oritcanchoosetomoveontostates2andgetanegativerewardof–50(ouch).Instates2ithasnootherchoicethantotakeactiona1,whichwillmostlikelyleaditbacktostates0,gainingarewardof+40ontheway.Yougetthepicture.BylookingatthisMDP,canyouguesswhichstrategywillgainthemostrewardovertime?Instates0itisclearthatactiona0isthebestoption,andinstates2theagenthasnochoicebuttotakeactiona1,butinstates1itisnotobviouswhethertheagentshouldstayput(a0)orgothroughthefire(a2).
Figure16-8.ExampleofaMarkovdecisionprocess
Bellmanfoundawaytoestimatetheoptimalstatevalueofanystates,notedV*(s),whichisthesumofalldiscountedfuturerewardstheagentcanexpectonaverageafteritreachesastates,assumingitactsoptimally.Heshowedthatiftheagentactsoptimally,thentheBellmanOptimalityEquationapplies(seeEquation16-1).Thisrecursiveequationsaysthatiftheagentactsoptimally,thentheoptimalvalueofthecurrentstateisequaltotherewarditwillgetonaverageaftertakingoneoptimalaction,plustheexpectedoptimalvalueofallpossiblenextstatesthatthisactioncanleadto.
Equation16-1.BellmanOptimalityEquation
T(s,a,s′)isthetransitionprobabilityfromstatestostates′,giventhattheagentchoseactiona.
R(s,a,s′)istherewardthattheagentgetswhenitgoesfromstatestostates′,giventhattheagentchoseactiona.
γisthediscountrate.
Thisequationleadsdirectlytoanalgorithmthatcanpreciselyestimatetheoptimalstatevalueofeverypossiblestate:youfirstinitializeallthestatevalueestimatestozero,andthenyouiterativelyupdatethemusingtheValueIterationalgorithm(seeEquation16-2).Aremarkableresultisthat,givenenoughtime,theseestimatesareguaranteedtoconvergetotheoptimalstatevalues,correspondingtotheoptimal
policy.
Equation16-2.ValueIterationalgorithm
Vk(s)istheestimatedvalueofstatesatthekthiterationofthealgorithm.
NOTEThisalgorithmisanexampleofDynamicProgramming,whichbreaksdownacomplexproblem(inthiscaseestimatingapotentiallyinfinitesumofdiscountedfuturerewards)intotractablesub-problemsthatcanbetacklediteratively(inthiscasefindingtheactionthatmaximizestheaveragerewardplusthediscountednextstatevalue).
Knowingtheoptimalstatevaluescanbeuseful,inparticulartoevaluateapolicy,butitdoesnottelltheagentexplicitlywhattodo.Luckily,Bellmanfoundaverysimilaralgorithmtoestimatetheoptimalstate-actionvalues,generallycalledQ-Values.TheoptimalQ-Valueofthestate-actionpair(s,a),notedQ*(s,a),isthesumofdiscountedfuturerewardstheagentcanexpectonaverageafteritreachesthestatesandchoosesactiona,butbeforeitseestheoutcomeofthisaction,assumingitactsoptimallyafterthataction.
Hereishowitworks:onceagain,youstartbyinitializingalltheQ-Valueestimatestozero,thenyouupdatethemusingtheQ-ValueIterationalgorithm(seeEquation16-3).
Equation16-3.Q-ValueIterationalgorithm
OnceyouhavetheoptimalQ-Values,definingtheoptimalpolicy,notedπ*(s),istrivial:whentheagentis
instates,itshouldchoosetheactionwiththehighestQ-Valueforthatstate: .
Let’sapplythisalgorithmtotheMDPrepresentedinFigure16-8.First,weneedtodefinetheMDP:
nan=np.nan#representsimpossibleactions
T=np.array([#shape=[s,a,s']
[[0.7,0.3,0.0],[1.0,0.0,0.0],[0.8,0.2,0.0]],
[[0.0,1.0,0.0],[nan,nan,nan],[0.0,0.0,1.0]],
[[nan,nan,nan],[0.8,0.1,0.1],[nan,nan,nan]],
])
R=np.array([#shape=[s,a,s']
[[10.,0.0,0.0],[0.0,0.0,0.0],[0.0,0.0,0.0]],
[[10.,0.0,0.0],[nan,nan,nan],[0.0,0.0,-50.]],
[[nan,nan,nan],[40.,0.0,0.0],[nan,nan,nan]],
])
possible_actions=[[0,1,2],[0,2],[1]]
Nowlet’sruntheQ-ValueIterationalgorithm:
Q=np.full((3,3),-np.inf)#-infforimpossibleactions
forstate,actionsinenumerate(possible_actions):
Q[state,actions]=0.0#Initialvalue=0.0,forallpossibleactions
learning_rate=0.01
discount_rate=0.95
n_iterations=100
foriterationinrange(n_iterations):
Q_prev=Q.copy()
forsinrange(3):
forainpossible_actions[s]:
Q[s,a]=np.sum([
T[s,a,sp]*(R[s,a,sp]+discount_rate*np.max(Q_prev[sp]))
forspinrange(3)
])
TheresultingQ-Valueslooklikethis:
>>>Q
array([[21.89498982,20.80024033,16.86353093],
[1.11669335,-inf,1.17573546],
[-inf,53.86946068,-inf]])
>>>np.argmax(Q,axis=1)#optimalactionforeachstate
array([0,2,1])
ThisgivesustheoptimalpolicyforthisMDP,whenusingadiscountrateof0.95:instates0chooseactiona0,instates1chooseactiona2(gothroughthefire!),andinstates2chooseactiona1(theonlypossibleaction).Interestingly,ifyoureducethediscountrateto0.9,theoptimalpolicychanges:instates1thebestactionbecomesa0(stayput;don’tgothroughthefire).Itmakessensebecauseifyouvaluethepresentmuchmorethanthefuture,thentheprospectoffuturerewardsisnotworthimmediatepain.
TemporalDifferenceLearningandQ-LearningReinforcementLearningproblemswithdiscreteactionscanoftenbemodeledasMarkovdecisionprocesses,buttheagentinitiallyhasnoideawhatthetransitionprobabilitiesare(itdoesnotknowT(s,a,s′)),anditdoesnotknowwhattherewardsaregoingtobeeither(itdoesnotknowR(s,a,s′)).Itmustexperienceeachstateandeachtransitionatleastoncetoknowtherewards,anditmustexperiencethemmultipletimesifitistohaveareasonableestimateofthetransitionprobabilities.
TheTemporalDifferenceLearning(TDLearning)algorithmisverysimilartotheValueIterationalgorithm,buttweakedtotakeintoaccountthefactthattheagenthasonlypartialknowledgeoftheMDP.Ingeneralweassumethattheagentinitiallyknowsonlythepossiblestatesandactions,andnothingmore.Theagentusesanexplorationpolicy—forexample,apurelyrandompolicy—toexploretheMDP,andasitprogressestheTDLearningalgorithmupdatestheestimatesofthestatevaluesbasedonthetransitionsandrewardsthatareactuallyobserved(seeEquation16-4).
Equation16-4.TDLearningalgorithm
αisthelearningrate(e.g.,0.01).
TIPTDLearninghasmanysimilaritieswithStochasticGradientDescent,inparticularthefactthatithandlesonesampleatatime.JustlikeSGD,itcanonlytrulyconvergeifyougraduallyreducethelearningrate(otherwiseitwillkeepbouncingaroundtheoptimum).
Foreachstates,thisalgorithmsimplykeepstrackofarunningaverageoftheimmediaterewardstheagentgetsuponleavingthatstate,plustherewardsitexpectstogetlater(assumingitactsoptimally).
Similarly,theQ-LearningalgorithmisanadaptationoftheQ-ValueIterationalgorithmtothesituationwherethetransitionprobabilitiesandtherewardsareinitiallyunknown(seeEquation16-5).
Equation16-5.Q-Learningalgorithm
Foreachstate-actionpair(s,a),thisalgorithmkeepstrackofarunningaverageoftherewardsrtheagentgetsuponleavingthestateswithactiona,plustherewardsitexpectstogetlater.Sincethetargetpolicywouldactoptimally,wetakethemaximumoftheQ-Valueestimatesforthenextstate.
HereishowQ-Learningcanbeimplemented:
importnumpy.randomasrnd
learning_rate0=0.05
learning_rate_decay=0.1
n_iterations=20000
s=0#startinstate0
Q=np.full((3,3),-np.inf)#-infforimpossibleactions
forstate,actionsinenumerate(possible_actions):
Q[state,actions]=0.0#Initialvalue=0.0,forallpossibleactions
foriterationinrange(n_iterations):
a=rnd.choice(possible_actions[s])#chooseanaction(randomly)
sp=rnd.choice(range(3),p=T[s,a])#picknextstateusingT[s,a]
reward=R[s,a,sp]
learning_rate=learning_rate0/(1+iteration*learning_rate_decay)
Q[s,a]=learning_rate*Q[s,a]+(1-learning_rate)*(
reward+discount_rate*np.max(Q[sp])
)
s=sp#movetonextstate
Givenenoughiterations,thisalgorithmwillconvergetotheoptimalQ-Values.Thisiscalledanoff-policyalgorithmbecausethepolicybeingtrainedisnottheonebeingexecuted.Itissomewhatsurprisingthatthisalgorithmiscapableoflearningtheoptimalpolicybyjustwatchinganagentactrandomly(imaginelearningtoplaygolfwhenyourteacherisadrunkenmonkey).Canwedobetter?
ExplorationPoliciesOfcourseQ-LearningcanworkonlyiftheexplorationpolicyexplorestheMDPthoroughlyenough.Althoughapurelyrandompolicyisguaranteedtoeventuallyvisiteverystateandeverytransitionmanytimes,itmaytakeanextremelylongtimetodoso.Therefore,abetteroptionistousetheε-greedypolicy:ateachstepitactsrandomlywithprobabilityε,orgreedily(choosingtheactionwiththehighestQ-Value)withprobability1-ε.Theadvantageoftheε-greedypolicy(comparedtoacompletelyrandompolicy)isthatitwillspendmoreandmoretimeexploringtheinterestingpartsoftheenvironment,astheQ-Valueestimatesgetbetterandbetter,whilestillspendingsometimevisitingunknownregionsoftheMDP.Itisquitecommontostartwithahighvalueforε(e.g.,1.0)andthengraduallyreduceit(e.g.,downto0.05).
Alternatively,ratherthanrelyingonchanceforexploration,anotherapproachistoencouragetheexplorationpolicytotryactionsthatithasnottriedmuchbefore.ThiscanbeimplementedasabonusaddedtotheQ-Valueestimates,asshowninEquation16-6.
Equation16-6.Q-Learningusinganexplorationfunction
N(s′,a′)countsthenumberoftimestheactiona′waschoseninstates′.
f(q,n)isanexplorationfunction,suchasf(q,n)=q+K/(1+n),whereKisacuriosityhyperparameterthatmeasureshowmuchtheagentisattractedtototheunknown.
ApproximateQ-LearningThemainproblemwithQ-Learningisthatitdoesnotscalewelltolarge(orevenmedium)MDPswithmanystatesandactions.ConsidertryingtouseQ-LearningtotrainanagenttoplayMs.Pac-Man.Thereareover250pelletsthatMs.Pac-Mancaneat,eachofwhichcanbepresentorabsent(i.e.,alreadyeaten).Sothenumberofpossiblestatesisgreaterthan2250≈1075(andthat’sconsideringthepossiblestatesonlyofthepellets).Thisiswaymorethanatomsintheobservableuniverse,sothere’sabsolutelynowayyoucankeeptrackofanestimateforeverysingleQ-Value.
ThesolutionistofindafunctionthatapproximatestheQ-Valuesusingamanageablenumberofparameters.ThisiscalledApproximateQ-Learning.Foryearsitwasrecommendedtouselinearcombinationsofhand-craftedfeaturesextractedfromthestate(e.g.,distanceoftheclosestghosts,theirdirections,andsoon)toestimateQ-Values,butDeepMindshowedthatusingdeepneuralnetworkscanworkmuchbetter,especiallyforcomplexproblems,anditdoesnotrequireanyfeatureengineering.ADNNusedtoestimateQ-ValuesiscalledadeepQ-network(DQN),andusingaDQNforApproximateQ-LearningiscalledDeepQ-Learning.
Intherestofthischapter,wewilluseDeepQ-LearningtotrainanagenttoplayMs.Pac-Man,muchlikeDeepMinddidin2013.ThecodecaneasilybetweakedtolearntoplaythemajorityofAtarigamesquitewell.Itcanachievesuperhumanskillatmostactiongames,butitisnotsogoodatgameswithlong-runningstorylines.
LearningtoPlayMs.Pac-ManUsingDeepQ-LearningSincewewillbeusinganAtarienvironment,wemustfirstinstallOpenAIgym’sAtaridependencies.Whilewe’reatit,wewillalsoinstalldependenciesforotherOpenAIgymenvironmentsthatyoumaywanttoplaywith.OnmacOS,assumingyouhaveinstalledHomebrew,youneedtorun:
$brewinstallcmakeboostboost-pythonsdl2swigwget
OnUbuntu,typethefollowingcommand(replacingpython3withpythonifyouareusingPython2):
$apt-getinstall-ypython3-numpypython3-devcmakezlib1g-devlibjpeg-dev\
xvfblibav-toolsxorg-devpython3-opengllibboost-all-devlibsdl2-devswig
TheninstalltheextraPythonmodules:
$pip3install--upgrade'gym[all]'
Ifeverythingwentwell,youshouldbeabletocreateaMs.Pac-Manenvironment:
>>>env=gym.make("MsPacman-v0")
>>>obs=env.reset()
>>>obs.shape#[height,width,channels]
(210,160,3)
>>>env.action_space
Discrete(9)
Asyoucansee,thereareninediscreteactionsavailable,whichcorrespondtotheninepossiblepositionsofthejoystick(left,right,up,down,center,upperleft,andsoon),andtheobservationsaresimplyscreenshotsoftheAtariscreen(seeFigure16-9,left),representedas3DNumPyarrays.Theseimagesareabitlarge,sowewillcreateasmallpreprocessingfunctionthatwillcroptheimageandshrinkitdownto88×80pixels,convertittograyscale,andimprovethecontrastofMs.Pac-Man.ThiswillreducetheamountofcomputationsrequiredbytheDQN,andspeeduptraining.
mspacman_color=np.array([210,164,74]).mean()
defpreprocess_observation(obs):
img=obs[1:176:2,::2]#cropanddownsize
img=img.mean(axis=2)#togreyscale
img[img==mspacman_color]=0#improvecontrast
img=(img-128)/128-1#normalizefrom-1.to1.
returnimg.reshape(88,80,1)
TheresultofpreprocessingisshowninFigure16-9(right).
Figure16-9.Ms.Pac-Manobservation,original(left)andafterpreprocessing(right)
Next,let’screatetheDQN.Itcouldjusttakeastate-actionpair(s,a)asinput,andoutputanestimateofthecorrespondingQ-ValueQ(s,a),butsincetheactionsarediscreteitismoreconvenienttouseaneuralnetworkthattakesonlyastatesasinputandoutputsoneQ-Valueestimateperaction.TheDQNwillbecomposedofthreeconvolutionallayers,followedbytwofullyconnectedlayers,includingtheoutputlayer(seeFigure16-10).
Figure16-10.DeepQ-networktoplayMs.Pac-Man
Aswewillsee,thetrainingalgorithmwewilluserequirestwoDQNswiththesamearchitecture(butdifferentparameters):onewillbeusedtodriveMs.Pac-Manduringtraining(theactor),andtheotherwillwatchtheactorandlearnfromitstrialsanderrors(thecritic).Atregularintervalswewillcopythecritictotheactor.SinceweneedtwoidenticalDQNs,wewillcreateaq_network()functiontobuildthem:
input_height=88
input_width=80
input_channels=1
conv_n_maps=[32,64,64]
conv_kernel_sizes=[(8,8),(4,4),(3,3)]
conv_strides=[4,2,1]
conv_paddings=["SAME"]*3
conv_activation=[tf.nn.relu]*3
n_hidden_in=64*11*10#conv3has64mapsof11x10each
n_hidden=512
hidden_activation=tf.nn.relu
n_outputs=env.action_space.n#9discreteactionsareavailable
initializer=tf.contrib.layers.variance_scaling_initializer()
defq_network(X_state,name):
prev_layer=X_state
conv_layers=[]
withtf.variable_scope(name)asscope:
forn_maps,kernel_size,stride,padding,activationinzip(
conv_n_maps,conv_kernel_sizes,conv_strides,
conv_paddings,conv_activation):
prev_layer=tf.layers.conv2d(
prev_layer,filters=n_maps,kernel_size=kernel_size,
stride=stride,padding=padding,activation=activation,
kernel_initializer=initializer)
conv_layers.append(prev_layer)
last_conv_layer_flat=tf.reshape(prev_layer,shape=[-1,n_hidden_in])
hidden=tf.layers.dense(last_conv_layer_flat,n_hidden,
activation=hidden_activation,
kernel_initializer=initializer)
outputs=tf.layers.dense(hidden,n_outputs,
kernel_initializer=initializer)
trainable_vars=tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,
scope=scope.name)
trainable_vars_by_name={var.name[len(scope.name):]:var
forvarintrainable_vars}
returnoutputs,trainable_vars_by_name
ThefirstpartofthiscodedefinesthehyperparametersoftheDQNarchitecture.Thentheq_network()functioncreatestheDQN,takingtheenvironment’sstateX_stateasinput,andthenameofthevariablescope.Notethatwewilljustuseoneobservationtorepresenttheenvironment’sstatesincethere’salmostnohiddenstate(exceptforblinkingobjectsandtheghosts’directions).
Thetrainable_vars_by_namedictionarygathersallthetrainablevariablesofthisDQN.ItwillbeusefulinaminutewhenwecreateoperationstocopythecriticDQNtotheactorDQN.Thekeysofthedictionaryarethenamesofthevariables,strippingthepartoftheprefixthatjustcorrespondstothescope’sname.Itlookslikethis:
>>>trainable_vars_by_name
{'/conv2d/bias:0':<tensorflow.python.ops.variables.Variableat0x121cf7b50>,
'/conv2d/kernel:0':<tensorflow.python.ops.variables.Variable...>,
'/conv2d_1/bias:0':<tensorflow.python.ops.variables.Variable...>,
'/conv2d_1/kernel:0':<tensorflow.python.ops.variables.Variable...>,
'/conv2d_2/bias:0':<tensorflow.python.ops.variables.Variable...>,
'/conv2d_2/kernel:0':<tensorflow.python.ops.variables.Variable...>,
'/dense/bias:0':<tensorflow.python.ops.variables.Variable...>,
'/dense/kernel:0':<tensorflow.python.ops.variables.Variable...>,
'/dense_1/bias:0':<tensorflow.python.ops.variables.Variable...>,
'/dense_1/kernel:0':<tensorflow.python.ops.variables.Variable...>}
Nowlet’screatetheinputplaceholder,thetwoDQNs,andtheoperationtocopythecriticDQNtotheactorDQN:
X_state=tf.placeholder(tf.float32,shape=[None,input_height,input_width,
input_channels])
actor_q_values,actor_vars=q_network(X_state,name="q_networks/actor")
critic_q_values,critic_vars=q_network(X_state,name="q_networks/critic")
copy_ops=[actor_var.assign(critic_vars[var_name])
forvar_name,actor_varinactor_vars.items()]
copy_critic_to_actor=tf.group(*copy_ops)
Let’sstepbackforasecond:wenowhavetwoDQNsthatarebothcapableoftakinganenvironmentstate(i.e.,apreprocessedobservation)asinputandoutputtinganestimatedQ-Valueforeachpossibleactioninthatstate.Pluswehaveanoperationcalledcopy_critic_to_actortocopyallthetrainablevariablesofthecriticDQNtotheactorDQN.WeuseTensorFlow’stf.group()functiontogroupalltheassignmentoperationsintoasingleconvenientoperation.
TheactorDQNcanbeusedtoplayMs.Pac-Man(initiallyverybadly).Asdiscussedearlier,youwantittoexplorethegamethoroughlyenough,soyougenerallywanttocombineitwithanε-greedypolicyoranotherexplorationstrategy.
ButwhataboutthecriticDQN?Howwillitlearntoplaythegame?TheshortansweristhatitwilltrytomakeitsQ-ValuepredictionsmatchtheQ-Valuesestimatedbytheactorthroughitsexperienceofthegame.Specifically,wewilllettheactorplayforawhile,storingallitsexperiencesinareplaymemory.Eachmemorywillbea5-tuple(state,action,nextstate,reward,continue),wherethe“continue”itemwillbeequalto0.0whenthegameisover,or1.0otherwise.Next,atregularintervalswewillsampleabatchofmemoriesfromthereplaymemory,andwewillestimatetheQ-Valuesfromthesememories.Finally,wewilltrainthecriticDQNtopredicttheseQ-Valuesusingregularsupervisedlearningtechniques.Onceeveryfewtrainingiterations,wewillcopythecriticDQNtotheactorDQN.Andthat’sit!Equation16-7showsthecostfunctionusedtotrainthecriticDQN:
Equation16-7.DeepQ-Learningcostfunction
s(i),a(i),r(i)ands′(i)arerespectivelythestate,action,reward,andnextstateoftheithmemorysampledfromthereplaymemory.
misthesizeofthememorybatch.
θcriticandθactorarethecriticandtheactor’sparameters.
Q(s(i),a(i),θcritic)isthecriticDQN’spredictionoftheithmemorizedstate-action’sQ-Value.
Q(s′(i),a′,θactor)istheactorDQN’spredictionoftheQ-Valueitcanexpectfromthenextstates′(i)ifitchoosesactiona′.
y(i)isthetargetQ-Valuefortheithmemory.Notethatitisequaltotherewardactuallyobservedbytheactor,plustheactor’spredictionofwhatfuturerewardsitshouldexpectifitweretoplayoptimally(asfarasitknows).
J(θcritic)isthecostfunctionusedtotrainthecriticDQN.Asyoucansee,itisjusttheMeanSquared
ErrorbetweenthetargetQ-Valuesy(i)asestimatedbytheactorDQN,andthecriticDQN’spredictionsoftheseQ-Values.
NOTEThereplaymemoryisoptional,buthighlyrecommended.Withoutit,youwouldtrainthecriticDQNusingconsecutiveexperiencesthatmaybeverycorrelated.Thiswouldintroducealotofbiasandslowdownthetrainingalgorithm’sconvergence.Byusingareplaymemory,weensurethatthememoriesfedtothetrainingalgorithmcanbefairlyuncorrelated.
Let’saddthecriticDQN’strainingoperations.First,weneedtobeabletocomputeitspredictedQ-Valuesforeachstate-actioninthememorybatch.SincetheDQNoutputsoneQ-Valueforeverypossibleaction,weneedtokeeponlytheQ-Valuethatcorrespondstotheactionthatwasactuallychoseninthismemory.Forthis,wewillconverttheactiontoaone-hotvector(recallthatthisisavectorfullof0sexceptfora1attheithindex),andmultiplyitbytheQ-Values:thiswillzerooutallQ-Valuesexceptfortheonecorrespondingtothememorizedaction.ThenjustsumoverthefirstaxistoobtainonlythedesiredQ-Valuepredictionforeachmemory.
X_action=tf.placeholder(tf.int32,shape=[None])
q_value=tf.reduce_sum(critic_q_values*tf.one_hot(X_action,n_outputs),
axis=1,keep_dims=True)
Nextlet’saddthetrainingoperations,assumingthetargetQ-Valueswillbefedthroughaplaceholder.Wealsocreateanontrainablevariablecalledglobal_step.Theoptimizer’sminimize()operationwilltakecareofincrementingit.PluswecreatetheusualinitoperationandaSaver.
y=tf.placeholder(tf.float32,shape=[None,1])
cost=tf.reduce_mean(tf.square(y-q_value))
global_step=tf.Variable(0,trainable=False,name='global_step')
optimizer=tf.train.AdamOptimizer(learning_rate)
training_op=optimizer.minimize(cost,global_step=global_step)
init=tf.global_variables_initializer()
saver=tf.train.Saver()
That’sitfortheconstructionphase.Beforewelookattheexecutionphase,wewillneedacoupleoftools.First,let’sstartbyimplementingthereplaymemory.Wewilluseadequelistsinceitisveryefficientatpushingitemstothequeueandpoppingthemoutfromtheendofthelistwhenthemaximummemorysizeisreached.Wewillalsowriteasmallfunctiontorandomlysampleabatchofexperiencesfromthereplaymemory:
fromcollectionsimportdeque
replay_memory_size=10000
replay_memory=deque([],maxlen=replay_memory_size)
defsample_memories(batch_size):
indices=rnd.permutation(len(replay_memory))[:batch_size]
cols=[[],[],[],[],[]]#state,action,reward,next_state,continue
foridxinindices:
memory=replay_memory[idx]
forcol,valueinzip(cols,memory):
col.append(value)
cols=[np.array(col)forcolincols]
return(cols[0],cols[1],cols[2].reshape(-1,1),cols[3],
cols[4].reshape(-1,1))
Next,wewillneedtheactortoexplorethegame.Wewillusetheε-greedypolicy,andgraduallydecreaseεfrom1.0to0.05,in50,000trainingsteps:
eps_min=0.05
eps_max=1.0
eps_decay_steps=50000
defepsilon_greedy(q_values,step):
epsilon=max(eps_min,eps_max-(eps_max-eps_min)*step/eps_decay_steps)
ifrnd.rand()<epsilon:
returnrnd.randint(n_outputs)#randomaction
else:
returnnp.argmax(q_values)#optimalaction
That’sit!Wehaveallweneedtostarttraining.Theexecutionphasedoesnotcontainanythingtoocomplex,butitisabitlong,sotakeadeepbreath.Ready?Let’sgo!First,let’sinitializeafewvariables:
n_steps=100000#totalnumberoftrainingsteps
training_start=1000#starttrainingafter1,000gameiterations
training_interval=3#runatrainingstepevery3gameiterations
save_steps=50#savethemodelevery50trainingsteps
copy_steps=25#copythecritictotheactorevery25trainingsteps
discount_rate=0.95
skip_start=90#skipthestartofeverygame(it'sjustwaitingtime)
batch_size=50
iteration=0#gameiterations
checkpoint_path="./my_dqn.ckpt"
done=True#envneedstobereset
Next,let’sopenthesessionandrunthemaintrainingloop:
withtf.Session()assess:
ifos.path.isfile(checkpoint_path):
saver.restore(sess,checkpoint_path)
else:
init.run()
whileTrue:
step=global_step.eval()
ifstep>=n_steps:
break
iteration+=1
ifdone:#gameover,startagain
obs=env.reset()
forskipinrange(skip_start):#skipthestartofeachgame
obs,reward,done,info=env.step(0)
state=preprocess_observation(obs)
#Actorevaluateswhattodo
q_values=actor_q_values.eval(feed_dict={X_state:[state]})
action=epsilon_greedy(q_values,step)
#Actorplays
obs,reward,done,info=env.step(action)
next_state=preprocess_observation(obs)
#Let'smemorizewhatjusthappened
replay_memory.append((state,action,reward,next_state,1.0-done))
state=next_state
ifiteration<training_startoriteration%training_interval!=0:
continue
#Criticlearns
X_state_val,X_action_val,rewards,X_next_state_val,continues=(
sample_memories(batch_size))
next_q_values=actor_q_values.eval(
feed_dict={X_state:X_next_state_val})
max_next_q_values=np.max(next_q_values,axis=1,keepdims=True)
y_val=rewards+continues*discount_rate*max_next_q_values
training_op.run(feed_dict={X_state:X_state_val,
X_action:X_action_val,y:y_val})
#Regularlycopycritictoactor
ifstep%copy_steps==0:
copy_critic_to_actor.run()
#Andsaveregularly
ifstep%save_steps==0:
saver.save(sess,checkpoint_path)
Westartbyrestoringthemodelsifacheckpointfileexists,orelsewejustinitializethevariablesnormally.Thenthemainloopstarts,whereiterationcountsthetotalnumberofgamestepswehavegonethroughsincetheprogramstarted,andstepcountsthetotalnumberoftrainingstepssincetrainingstarted(ifacheckpointisrestored,theglobalstepisrestoredaswell).Thenthecoderesetsthegame(andskipsthefirstboringgamesteps,wherenothinghappens).Next,theactorevaluateswhattodo,andplaysthegame,anditsexperienceismemorizedinreplaymemory.Then,atregularintervals(afterawarmupperiod),thecriticgoesthroughatrainingstep.ItsamplesabatchofmemoriesandaskstheactortoestimatetheQ-Valuesofallactionsforthenextstate,anditappliesEquation16-7tocomputethetargetQ-Valuey_val.Theonlytrickyparthereisthatwemustmultiplythenextstate’sQ-ValuesbythecontinuesvectortozeroouttheQ-Valuescorrespondingtomemorieswherethegamewasover.Nextwerunatrainingoperationtoimprovethecritic’sabilitytopredictQ-Values.Finally,atregularintervalswecopythecritictotheactor,andwesavethemodel.
TIPUnfortunately,trainingisveryslow:ifyouuseyourlaptopfortraining,itwilltakedaysbeforeMs.Pac-Mangetsanygood,andifyoulookatthelearningcurve,measuringtheaveragerewardsperepisode,youwillnoticethatitisextremelynoisy.Atsomepointstheremaybenoapparentprogressforaverylongtimeuntilsuddenlytheagentlearnstosurviveareasonableamountoftime.Asmentionedearlier,onesolutionistoinjectasmuchpriorknowledgeaspossibleintothemodel(e.g.,throughpreprocessing,rewards,andsoon),andyoucanalsotrytobootstrapthemodelbyfirsttrainingittoimitateabasicstrategy.Inanycase,RLstillrequiresquitealotofpatienceandtweaking,buttheendresultisveryexciting.
Exercises1. HowwouldyoudefineReinforcementLearning?Howisitdifferentfromregularsupervisedor
unsupervisedlearning?
2. CanyouthinkofthreepossibleapplicationsofRLthatwerenotmentionedinthischapter?Foreachofthem,whatistheenvironment?Whatistheagent?Whatarepossibleactions?Whataretherewards?
3. Whatisthediscountrate?Cantheoptimalpolicychangeifyoumodifythediscountrate?
4. HowdoyoumeasuretheperformanceofaReinforcementLearningagent?
5. Whatisthecreditassignmentproblem?Whendoesitoccur?Howcanyoualleviateit?
6. Whatisthepointofusingareplaymemory?
7. Whatisanoff-policyRLalgorithm?
8. UseDeepQ-LearningtotackleOpenAIgym’s“BypedalWalker-v2.”TheQ-networksdonotneedtobeverydeepforthistask.
9. UsepolicygradientstotrainanagenttoplayPong,thefamousAtarigame(Pong-v0intheOpenAIgym).Beware:anindividualobservationisinsufficienttotellthedirectionandspeedoftheball.Onesolutionistopasstwoobservationsatatimetotheneuralnetworkpolicy.Toreducedimensionalityandspeeduptraining,youshoulddefinitelypreprocesstheseimages(crop,resize,andconvertthemtoblackandwhite),andpossiblymergethemintoasingleimage(e.g.,byoverlayingthem).
10. Ifyouhaveabout$100tospare,youcanpurchaseaRaspberryPi3plussomecheaproboticscomponents,installTensorFlowonthePi,andgowild!Foranexample,checkoutthisfunpostbyLukasBiewald,ortakealookatGoPiGoorBrickPi.Whynottrytobuildareal-lifecartpolebytrainingtherobotusingpolicygradients?Orbuildaroboticspiderthatlearnstowalk;giveitrewardsanytimeitgetsclosertosomeobjective(youwillneedsensorstomeasurethedistancetotheobjective).Theonlylimitisyourimagination.
SolutionstotheseexercisesareavailableinAppendixA.
ThankYou!Beforeweclosethelastchapterofthisbook,Iwouldliketothankyouforreadingituptothelastparagraph.ItrulyhopethatyouhadasmuchpleasurereadingthisbookasIhadwritingit,andthatitwillbeusefulforyourprojects,bigorsmall.
Ifyoufinderrors,pleasesendfeedback.Moregenerally,Iwouldlovetoknowwhatyouthink,sopleasedon’thesitatetocontactmeviaO’Reilly,orthroughtheageron/handson-mlGitHubproject.
Goingforward,mybestadvicetoyouistopracticeandpractice:trygoingthroughalltheexercisesifyouhavenotdonesoalready,playwiththeJupyternotebooks,joinKaggle.comorsomeotherMLcommunity,watchMLcourses,readpapers,attendconferences,meetexperts.Youmayalsowanttostudysometopicsthatwedidnotcoverinthisbook,includingrecommendersystems,clusteringalgorithms,anomalydetectionalgorithms,andgeneticalgorithms.
MygreatesthopeisthatthisbookwillinspireyoutobuildawonderfulMLapplicationthatwillbenefitallofus!Whatwillitbe?
AurélienGéron,November26th,2016
Formoredetails,besuretocheckoutRichardSuttonandAndrewBarto’sbookonRL,ReinforcementLearning:AnIntroduction(MITPress),orDavidSilver’sfreeonlineRLcourseatUniversityCollegeLondon.
“PlayingAtariwithDeepReinforcementLearning,”V.Mnihetal.(2013).
“Human-levelcontrolthroughdeepreinforcementlearning,”V.Mnihetal.(2015).
CheckoutthevideosofDeepMind’ssystemlearningtoplaySpaceInvaders,Breakout,andmoreathttps://goo.gl/yTsH6X.
Images(a),(c),and(d)arereproducedfromWikipedia.(a)and(d)areinthepublicdomain.(c)wascreatedbyuserStevertigoandreleasedunderCreativeCommonsBY-SA2.0.(b)isascreenshotfromtheMs.Pac-Mangame,copyrightAtari(theauthorbelievesittobefairuseinthischapter).(e)wasreproducedfromPixabay,releasedunderCreativeCommonsCC0.
Itisoftenbettertogivethepoorperformersaslightchanceofsurvival,topreservesomediversityinthe“genepool.”
Ifthereisasingleparent,thisiscalledasexualreproduction.Withtwo(ormore)parents,itiscalledsexualreproduction.Anoffspring’sgenome(inthiscaseasetofpolicyparameters)israndomlycomposedofpartsofitsparents’genomes.
OpenAIisanonprofitartificialintelligenceresearchcompany,fundedinpartbyElonMusk.ItsstatedgoalistopromoteanddevelopfriendlyAIsthatwillbenefithumanity(ratherthanexterminateit).
“SimpleStatisticalGradient-FollowingAlgorithmsforConnectionistReinforcementLearning,”R.Williams(1992).
WealreadydidsomethingsimilarinChapter11whenwediscussedGradientClipping:wefirstcomputedthegradients,thenweclippedthem,andfinallyweappliedtheclippedgradients.
“AMarkovianDecisionProcess,”R.Bellman(1957).
1
2
3
4
5
6
7
8
9
10
11
AppendixA.ExerciseSolutions
NOTESolutionstothecodingexercisesareavailableintheonlineJupyternotebooksathttps://github.com/ageron/handson-ml.
Chapter1:TheMachineLearningLandscape1. MachineLearningisaboutbuildingsystemsthatcanlearnfromdata.Learningmeansgettingbetterat
sometask,givensomeperformancemeasure.
2. MachineLearningisgreatforcomplexproblemsforwhichwehavenoalgorithmicsolution,toreplacelonglistsofhand-tunedrules,tobuildsystemsthatadapttofluctuatingenvironments,andfinallytohelphumanslearn(e.g.,datamining).
3. Alabeledtrainingsetisatrainingsetthatcontainsthedesiredsolution(a.k.a.alabel)foreachinstance.
4. Thetwomostcommonsupervisedtasksareregressionandclassification.
5. Commonunsupervisedtasksincludeclustering,visualization,dimensionalityreduction,andassociationrulelearning.
6. ReinforcementLearningislikelytoperformbestifwewantarobottolearntowalkinvariousunknownterrainssincethisistypicallythetypeofproblemthatReinforcementLearningtackles.Itmightbepossibletoexpresstheproblemasasupervisedorsemisupervisedlearningproblem,butitwouldbelessnatural.
7. Ifyoudon’tknowhowtodefinethegroups,thenyoucanuseaclusteringalgorithm(unsupervisedlearning)tosegmentyourcustomersintoclustersofsimilarcustomers.However,ifyouknowwhatgroupsyouwouldliketohave,thenyoucanfeedmanyexamplesofeachgrouptoaclassificationalgorithm(supervisedlearning),anditwillclassifyallyourcustomersintothesegroups.
8. Spamdetectionisatypicalsupervisedlearningproblem:thealgorithmisfedmanyemailsalongwiththeirlabel(spamornotspam).
9. Anonlinelearningsystemcanlearnincrementally,asopposedtoabatchlearningsystem.Thismakesitcapableofadaptingrapidlytobothchangingdataandautonomoussystems,andoftrainingonverylargequantitiesofdata.
10. Out-of-corealgorithmscanhandlevastquantitiesofdatathatcannotfitinacomputer’smainmemory.Anout-of-corelearningalgorithmchopsthedataintomini-batchesandusesonlinelearningtechniquestolearnfromthesemini-batches.
11. Aninstance-basedlearningsystemlearnsthetrainingdatabyheart;then,whengivenanewinstance,itusesasimilaritymeasuretofindthemostsimilarlearnedinstancesandusesthemtomakepredictions.
12. Amodelhasoneormoremodelparametersthatdeterminewhatitwillpredictgivenanewinstance(e.g.,theslopeofalinearmodel).Alearningalgorithmtriestofindoptimalvaluesfortheseparameterssuchthatthemodelgeneralizeswelltonewinstances.Ahyperparameterisaparameterofthelearningalgorithmitself,notofthemodel(e.g.,theamountofregularizationtoapply).
13. Model-basedlearningalgorithmssearchforanoptimalvalueforthemodelparameterssuchthatthemodelwillgeneralizewelltonewinstances.Weusuallytrainsuchsystemsbyminimizingacostfunctionthatmeasureshowbadthesystemisatmakingpredictionsonthetrainingdata,plusapenaltyformodelcomplexityifthemodelisregularized.Tomakepredictions,wefeedthenewinstance’sfeaturesintothemodel’spredictionfunction,usingtheparametervaluesfoundbythelearningalgorithm.
14. SomeofthemainchallengesinMachineLearningarethelackofdata,poordataquality,nonrepresentativedata,uninformativefeatures,excessivelysimplemodelsthatunderfitthetrainingdata,andexcessivelycomplexmodelsthatoverfitthedata.
15. Ifamodelperformsgreatonthetrainingdatabutgeneralizespoorlytonewinstances,themodelislikelyoverfittingthetrainingdata(orwegotextremelyluckyonthetrainingdata).Possiblesolutionstooverfittingaregettingmoredata,simplifyingthemodel(selectingasimpleralgorithm,reducingthenumberofparametersorfeaturesused,orregularizingthemodel),orreducingthenoiseinthetrainingdata.
16. Atestsetisusedtoestimatethegeneralizationerrorthatamodelwillmakeonnewinstances,beforethemodelislaunchedinproduction.
17. Avalidationsetisusedtocomparemodels.Itmakesitpossibletoselectthebestmodelandtunethehyperparameters.
18. Ifyoutunehyperparametersusingthetestset,youriskoverfittingthetestset,andthegeneralizationerroryoumeasurewillbeoptimistic(youmaylaunchamodelthatperformsworsethanyouexpect).
19. Cross-validationisatechniquethatmakesitpossibletocomparemodels(formodelselectionandhyperparametertuning)withouttheneedforaseparatevalidationset.Thissavesprecioustrainingdata.
Chapter2:End-to-EndMachineLearningProjectSeetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.
Chapter3:ClassificationSeetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.
Chapter4:TrainingModels1. IfyouhaveatrainingsetwithmillionsoffeaturesyoucanuseStochasticGradientDescentorMini-
batchGradientDescent,andperhapsBatchGradientDescentifthetrainingsetfitsinmemory.ButyoucannotusetheNormalEquationbecausethecomputationalcomplexitygrowsquickly(morethanquadratically)withthenumberoffeatures.
2. Ifthefeaturesinyourtrainingsethaveverydifferentscales,thecostfunctionwillhavetheshapeofanelongatedbowl,sotheGradientDescentalgorithmswilltakealongtimetoconverge.Tosolvethisyoushouldscalethedatabeforetrainingthemodel.NotethattheNormalEquationwillworkjustfinewithoutscaling.Moreover,regularizedmodelsmayconvergetoasuboptimalsolutionifthefeaturesarenotscaled:indeed,sinceregularizationpenalizeslargeweights,featureswithsmallervalueswilltendtobeignoredcomparedtofeatureswithlargervalues.
3. GradientDescentcannotgetstuckinalocalminimumwhentrainingaLogisticRegressionmodelbecausethecostfunctionisconvex.1
4. Iftheoptimizationproblemisconvex(suchasLinearRegressionorLogisticRegression),andassumingthelearningrateisnottoohigh,thenallGradientDescentalgorithmswillapproachtheglobaloptimumandendupproducingfairlysimilarmodels.However,unlessyougraduallyreducethelearningrate,StochasticGDandMini-batchGDwillnevertrulyconverge;instead,theywillkeepjumpingbackandfortharoundtheglobaloptimum.Thismeansthatevenifyouletthemrunforaverylongtime,theseGradientDescentalgorithmswillproduceslightlydifferentmodels.
5. Ifthevalidationerrorconsistentlygoesupaftereveryepoch,thenonepossibilityisthatthelearningrateistoohighandthealgorithmisdiverging.Ifthetrainingerroralsogoesup,thenthisisclearlytheproblemandyoushouldreducethelearningrate.However,ifthetrainingerrorisnotgoingup,thenyourmodelisoverfittingthetrainingsetandyoushouldstoptraining.
6. Duetotheirrandomnature,neitherStochasticGradientDescentnorMini-batchGradientDescentisguaranteedtomakeprogressateverysingletrainingiteration.Soifyouimmediatelystoptrainingwhenthevalidationerrorgoesup,youmaystopmuchtooearly,beforetheoptimumisreached.Abetteroptionistosavethemodelatregularintervals,andwhenithasnotimprovedforalongtime(meaningitwillprobablyneverbeattherecord),youcanreverttothebestsavedmodel.
7. StochasticGradientDescenthasthefastesttrainingiterationsinceitconsidersonlyonetraininginstanceatatime,soitisgenerallythefirsttoreachthevicinityoftheglobaloptimum(orMini-batchGDwithaverysmallmini-batchsize).However,onlyBatchGradientDescentwillactuallyconverge,givenenoughtrainingtime.Asmentioned,StochasticGDandMini-batchGDwillbouncearoundtheoptimum,unlessyougraduallyreducethelearningrate.
8. Ifthevalidationerrorismuchhigherthanthetrainingerror,thisislikelybecauseyourmodelisoverfittingthetrainingset.Onewaytotrytofixthisistoreducethepolynomialdegree:amodelwithfewerdegreesoffreedomislesslikelytooverfit.Anotherthingyoucantryistoregularizethemodel—forexample,byaddinganℓ2penalty(Ridge)oranℓ1penalty(Lasso)tothecostfunction.Thiswillalsoreducethedegreesoffreedomofthemodel.Lastly,youcantrytoincreasethesizeof
thetrainingset.
9. Ifboththetrainingerrorandthevalidationerrorarealmostequalandfairlyhigh,themodelislikelyunderfittingthetrainingset,whichmeansithasahighbias.Youshouldtryreducingtheregularizationhyperparameterα.
10. Let’ssee:Amodelwithsomeregularizationtypicallyperformsbetterthanamodelwithoutanyregularization,soyoushouldgenerallypreferRidgeRegressionoverplainLinearRegression.2
LassoRegressionusesanℓ1penalty,whichtendstopushtheweightsdowntoexactlyzero.Thisleadstosparsemodels,whereallweightsarezeroexceptforthemostimportantweights.Thisisawaytoperformfeatureselectionautomatically,whichisgoodifyoususpectthatonlyafewfeaturesactuallymatter.Whenyouarenotsure,youshouldpreferRidgeRegression.
ElasticNetisgenerallypreferredoverLassosinceLassomaybehaveerraticallyinsomecases(whenseveralfeaturesarestronglycorrelatedorwhentherearemorefeaturesthantraininginstances).However,itdoesaddanextrahyperparametertotune.IfyoujustwantLassowithouttheerraticbehavior,youcanjustuseElasticNetwithanl1_ratiocloseto1.
11. Ifyouwanttoclassifypicturesasoutdoor/indooranddaytime/nighttime,sincethesearenotexclusiveclasses(i.e.,allfourcombinationsarepossible)youshouldtraintwoLogisticRegressionclassifiers.
12. SeetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.
Chapter5:SupportVectorMachines1. ThefundamentalideabehindSupportVectorMachinesistofitthewidestpossible“street”between
theclasses.Inotherwords,thegoalistohavethelargestpossiblemarginbetweenthedecisionboundarythatseparatesthetwoclassesandthetraininginstances.Whenperformingsoftmarginclassification,theSVMsearchesforacompromisebetweenperfectlyseparatingthetwoclassesandhavingthewidestpossiblestreet(i.e.,afewinstancesmayenduponthestreet).Anotherkeyideaistousekernelswhentrainingonnonlineardatasets.
2. AftertraininganSVM,asupportvectorisanyinstancelocatedonthe“street”(seethepreviousanswer),includingitsborder.Thedecisionboundaryisentirelydeterminedbythesupportvectors.Anyinstancethatisnotasupportvector(i.e.,offthestreet)hasnoinfluencewhatsoever;youcouldremovethem,addmoreinstances,ormovethemaround,andaslongastheystayoffthestreettheywon’taffectthedecisionboundary.Computingthepredictionsonlyinvolvesthesupportvectors,notthewholetrainingset.
3. SVMstrytofitthelargestpossible“street”betweentheclasses(seethefirstanswer),soifthetrainingsetisnotscaled,theSVMwilltendtoneglectsmallfeatures(seeFigure5-2).
4. AnSVMclassifiercanoutputthedistancebetweenthetestinstanceandthedecisionboundary,andyoucanusethisasaconfidencescore.However,thisscorecannotbedirectlyconvertedintoanestimationoftheclassprobability.Ifyousetprobability=TruewhencreatinganSVMinScikit-Learn,thenaftertrainingitwillcalibratetheprobabilitiesusingLogisticRegressionontheSVM’sscores(trainedbyanadditionalfive-foldcross-validationonthetrainingdata).Thiswilladdthepredict_proba()andpredict_log_proba()methodstotheSVM.
5. ThisquestionappliesonlytolinearSVMssincekernelizedcanonlyusethedualform.ThecomputationalcomplexityoftheprimalformoftheSVMproblemisproportionaltothenumberoftraininginstancesm,whilethecomputationalcomplexityofthedualformisproportionaltoanumberbetweenm2andm3.Soiftherearemillionsofinstances,youshoulddefinitelyusetheprimalform,becausethedualformwillbemuchtooslow.
6. IfanSVMclassifiertrainedwithanRBFkernelunderfitsthetrainingset,theremightbetoomuchregularization.Todecreaseit,youneedtoincreasegammaorC(orboth).
7. Let’scalltheQPparametersforthehard-marginproblemH′,f′,A′andb′(see“QuadraticProgramming”).TheQPparametersforthesoft-marginproblemhavemadditionalparameters(np=n+1+m)andmadditionalconstraints(nc=2m).Theycanbedefinedlikeso:
HisequaltoH′,plusmcolumnsof0sontherightandmrowsof0satthebottom:
fisequaltof′withmadditionalelements,allequaltothevalueofthehyperparameterC.
bisequaltob′withmadditionalelements,allequalto0.
AisequaltoA′,withanextram×midentitymatrixImappendedtotheright,–Imjustbelowit,
andtherestfilledwithzeros:
Forthesolutionstoexercises8,9,and10,pleaseseetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.
Chapter6:DecisionTrees1. Thedepthofawell-balancedbinarytreecontainingmleavesisequaltolog2(m)3,roundedup.A
binaryDecisionTree(onethatmakesonlybinarydecisions,asisthecaseofalltreesinScikit-Learn)willendupmoreorlesswellbalancedattheendoftraining,withoneleafpertraininginstanceifitistrainedwithoutrestrictions.Thus,ifthetrainingsetcontainsonemillioninstances,theDecisionTreewillhaveadepthoflog2(106)≈20(actuallyabitmoresincethetreewillgenerallynotbeperfectlywellbalanced).
2. Anode’sGiniimpurityisgenerallylowerthanitsparent’s.ThisisduetotheCARTtrainingalgorithm’scostfunction,whichsplitseachnodeinawaythatminimizestheweightedsumofitschildren’sGiniimpurities.However,itispossibleforanodetohaveahigherGiniimpuritythanitsparent,aslongasthisincreaseismorethancompensatedforbyadecreaseoftheotherchild’simpurity.Forexample,consideranodecontainingfourinstancesofclassAand1ofclassB.ItsGini
impurityis =0.32.Nowsupposethedatasetisone-dimensionalandtheinstancesarelinedupinthefollowingorder:A,B,A,A,A.Youcanverifythatthealgorithmwillsplitthisnodeafterthesecondinstance,producingonechildnodewithinstancesA,B,andtheotherchildnode
withinstancesA,A,A.Thefirstchildnode’sGiniimpurityis =0.5,whichishigherthanitsparent.Thisiscompensatedforbythefactthattheothernodeispure,sotheoverall
weightedGiniimpurityis 0.5+ =0.2,whichislowerthantheparent’sGiniimpurity.
3. IfaDecisionTreeisoverfittingthetrainingset,itmaybeagoodideatodecreasemax_depth,sincethiswillconstrainthemodel,regularizingit.
4. DecisionTreesdon’tcarewhetherornotthetrainingdataisscaledorcentered;that’soneofthenicethingsaboutthem.SoifaDecisionTreeunderfitsthetrainingset,scalingtheinputfeatureswilljustbeawasteoftime.
5. ThecomputationalcomplexityoftrainingaDecisionTreeisO(n×mlog(m)).Soifyoumultiplythetrainingsetsizeby10,thetrainingtimewillbemultipliedbyK=(n×10m×log(10m))/(n×m×log(m))=10×log(10m)/log(m).Ifm=106,thenK≈11.7,soyoucanexpectthetrainingtimetoberoughly11.7hours.
6. Presortingthetrainingsetspeedsuptrainingonlyifthedatasetissmallerthanafewthousandinstances.Ifitcontains100,000instances,settingpresort=Truewillconsiderablyslowdowntraining.
Forthesolutionstoexercises7and8,pleaseseetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.
Chapter7:EnsembleLearningandRandomForests1. Ifyouhavetrainedfivedifferentmodelsandtheyallachieve95%precision,youcantrycombining
themintoavotingensemble,whichwilloftengiveyouevenbetterresults.Itworksbetterifthemodelsareverydifferent(e.g.,anSVMclassifier,aDecisionTreeclassifier,aLogisticRegressionclassifier,andsoon).Itisevenbetteriftheyaretrainedondifferenttraininginstances(that’sthewholepointofbaggingandpastingensembles),butifnotitwillstillworkaslongasthemodelsareverydifferent.
2. Ahardvotingclassifierjustcountsthevotesofeachclassifierintheensembleandpickstheclassthatgetsthemostvotes.Asoftvotingclassifiercomputestheaverageestimatedclassprobabilityforeachclassandpickstheclasswiththehighestprobability.Thisgiveshigh-confidencevotesmoreweightandoftenperformsbetter,butitworksonlyifeveryclassifierisabletoestimateclassprobabilities(e.g.,fortheSVMclassifiersinScikit-Learnyoumustsetprobability=True).
3. Itisquitepossibletospeeduptrainingofabaggingensemblebydistributingitacrossmultipleservers,sinceeachpredictorintheensembleisindependentoftheothers.ThesamegoesforpastingensemblesandRandomForests,forthesamereason.However,eachpredictorinaboostingensembleisbuiltbasedonthepreviouspredictor,sotrainingisnecessarilysequential,andyouwillnotgainanythingbydistributingtrainingacrossmultipleservers.Regardingstackingensembles,allthepredictorsinagivenlayerareindependentofeachother,sotheycanbetrainedinparallelonmultipleservers.However,thepredictorsinonelayercanonlybetrainedafterthepredictorsinthepreviouslayerhaveallbeentrained.
4. Without-of-bagevaluation,eachpredictorinabaggingensembleisevaluatedusinginstancesthatitwasnottrainedon(theywereheldout).Thismakesitpossibletohaveafairlyunbiasedevaluationoftheensemblewithouttheneedforanadditionalvalidationset.Thus,youhavemoreinstancesavailablefortraining,andyourensemblecanperformslightlybetter.
5. WhenyouaregrowingatreeinaRandomForest,onlyarandomsubsetofthefeaturesisconsideredforsplittingateachnode.ThisistrueaswellforExtra-Trees,buttheygoonestepfurther:ratherthansearchingforthebestpossiblethresholds,likeregularDecisionTreesdo,theyuserandomthresholdsforeachfeature.Thisextrarandomnessactslikeaformofregularization:ifaRandomForestoverfitsthetrainingdata,Extra-Treesmightperformbetter.Moreover,sinceExtra-Treesdon’tsearchforthebestpossiblethresholds,theyaremuchfastertotrainthanRandomForests.However,theyareneitherfasternorslowerthanRandomForestswhenmakingpredictions.
6. IfyourAdaBoostensembleunderfitsthetrainingdata,youcantryincreasingthenumberofestimatorsorreducingtheregularizationhyperparametersofthebaseestimator.Youmayalsotryslightlyincreasingthelearningrate.
7. IfyourGradientBoostingensembleoverfitsthetrainingset,youshouldtrydecreasingthelearningrate.Youcouldalsouseearlystoppingtofindtherightnumberofpredictors(youprobablyhavetoomany).
Forthesolutionstoexercises8and9,pleaseseetheJupyternotebooksavailableat
Chapter8:DimensionalityReduction1. Motivationsanddrawbacks:
Themainmotivationsfordimensionalityreductionare:Tospeedupasubsequenttrainingalgorithm(insomecasesitmayevenremovenoiseandredundantfeatures,makingthetrainingalgorithmperformbetter).
Tovisualizethedataandgaininsightsonthemostimportantfeatures.
Simplytosavespace(compression).
Themaindrawbacksare:Someinformationislost,possiblydegradingtheperformanceofsubsequenttrainingalgorithms.
Itcanbecomputationallyintensive.
ItaddssomecomplexitytoyourMachineLearningpipelines.
Transformedfeaturesareoftenhardtointerpret.
2. Thecurseofdimensionalityreferstothefactthatmanyproblemsthatdonotexistinlow-dimensionalspaceariseinhigh-dimensionalspace.InMachineLearning,onecommonmanifestationisthefactthatrandomlysampledhigh-dimensionalvectorsaregenerallyverysparse,increasingtheriskofoverfittingandmakingitverydifficulttoidentifypatternsinthedatawithouthavingplentyoftrainingdata.
3. Onceadataset’sdimensionalityhasbeenreducedusingoneofthealgorithmswediscussed,itisalmostalwaysimpossibletoperfectlyreversetheoperation,becausesomeinformationgetslostduringdimensionalityreduction.Moreover,whilesomealgorithms(suchasPCA)haveasimplereversetransformationprocedurethatcanreconstructadatasetrelativelysimilartotheoriginal,otheralgorithms(suchasT-SNE)donot.
4. PCAcanbeusedtosignificantlyreducethedimensionalityofmostdatasets,eveniftheyarehighlynonlinear,becauseitcanatleastgetridofuselessdimensions.However,iftherearenouselessdimensions—forexample,theSwissroll—thenreducingdimensionalitywithPCAwilllosetoomuchinformation.YouwanttounrolltheSwissroll,notsquashit.
5. That’satrickquestion:itdependsonthedataset.Let’slookattwoextremeexamples.First,supposethedatasetiscomposedofpointsthatarealmostperfectlyaligned.Inthiscase,PCAcanreducethedatasetdowntojustonedimensionwhilestillpreserving95%ofthevariance.Nowimaginethatthedatasetiscomposedofperfectlyrandompoints,scatteredallaroundthe1,000dimensions.Inthiscaseroughly950dimensionsarerequiredtopreserve95%ofthevariance.Sotheansweris,itdependsonthedataset,anditcouldbeanynumberbetween1and950.Plottingtheexplainedvarianceasafunctionofthenumberofdimensionsisonewaytogetaroughideaofthedataset’sintrinsicdimensionality.
6. RegularPCAisthedefault,butitworksonlyifthedatasetfitsinmemory.IncrementalPCAisusefulforlargedatasetsthatdon’tfitinmemory,butitisslowerthanregularPCA,soifthedatasetfitsinmemoryyoushouldpreferregularPCA.IncrementalPCAisalsousefulforonlinetasks,whenyouneedtoapplyPCAonthefly,everytimeanewinstancearrives.RandomizedPCAisusefulwhenyouwanttoconsiderablyreducedimensionalityandthedatasetfitsinmemory;inthiscase,itismuchfasterthanregularPCA.Finally,KernelPCAisusefulfornonlineardatasets.
7. Intuitively,adimensionalityreductionalgorithmperformswellifiteliminatesalotofdimensionsfromthedatasetwithoutlosingtoomuchinformation.Onewaytomeasurethisistoapplythereversetransformationandmeasurethereconstructionerror.However,notalldimensionalityreductionalgorithmsprovideareversetransformation.Alternatively,ifyouareusingdimensionalityreductionasapreprocessingstepbeforeanotherMachineLearningalgorithm(e.g.,aRandomForestclassifier),thenyoucansimplymeasuretheperformanceofthatsecondalgorithm;ifdimensionalityreductiondidnotlosetoomuchinformation,thenthealgorithmshouldperformjustaswellaswhenusingtheoriginaldataset.
8. Itcanabsolutelymakesensetochaintwodifferentdimensionalityreductionalgorithms.AcommonexampleisusingPCAtoquicklygetridofalargenumberofuselessdimensions,thenapplyinganothermuchslowerdimensionalityreductionalgorithm,suchasLLE.Thistwo-stepapproachwilllikelyyieldthesameperformanceasusingLLEonly,butinafractionofthetime.
Forthesolutionstoexercises9and10,pleaseseetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.
Chapter9:UpandRunningwithTensorFlow1. Mainbenefitsanddrawbacksofcreatingacomputationgraphratherthandirectlyexecutingthe
computations:Mainbenefits:
TensorFlowcanautomaticallycomputethegradientsforyou(usingreverse-modeautodiff).
TensorFlowcantakecareofrunningtheoperationsinparallelindifferentthreads.
Itmakesiteasiertorunthesamemodelacrossdifferentdevices.
Itsimplifiesintrospection—forexample,toviewthemodelinTensorBoard.
Maindrawbacks:Itmakesthelearningcurvesteeper.
Itmakesstep-by-stepdebuggingharder.
2. Yes,thestatementa_val=a.eval(session=sess)isindeedequivalenttoa_val=sess.run(a).
3. No,thestatementa_val,b_val=a.eval(session=sess),b.eval(session=sess)isnotequivalenttoa_val,b_val=sess.run([a,b]).Indeed,thefirststatementrunsthegraphtwice(oncetocomputea,oncetocomputeb),whilethesecondstatementrunsthegraphonlyonce.Ifanyoftheseoperations(ortheopstheydependon)havesideeffects(e.g.,avariableismodified,anitemisinsertedinaqueue,orareaderreadsafile),thentheeffectswillbedifferent.Iftheydon’thavesideeffects,bothstatementswillreturnthesameresult,butthesecondstatementwillbefasterthanthefirst.
4. No,youcannotruntwographsinthesamesession.Youwouldhavetomergethegraphsintoasinglegraphfirst.
5. InlocalTensorFlow,sessionsmanagevariablevalues,soifyoucreateagraphgcontainingavariablew,thenstarttwothreadsandopenalocalsessionineachthread,bothusingthesamegraphg,theneachsessionwillhaveitsowncopyofthevariablew.However,indistributedTensorFlow,variablevaluesarestoredincontainersmanagedbythecluster,soifbothsessionsconnecttothesameclusterandusethesamecontainer,thentheywillsharethesamevariablevalueforw.
6. Avariableisinitializedwhenyoucallitsinitializer,anditisdestroyedwhenthesessionends.IndistributedTensorFlow,variablesliveincontainersonthecluster,soclosingasessionwillnotdestroythevariable.Todestroyavariable,youneedtoclearitscontainer.
7. Variablesandplaceholdersareextremelydifferent,butbeginnersoftenconfusethem:Avariableisanoperationthatholdsavalue.Ifyourunthevariable,itreturnsthatvalue.Beforeyoucanrunit,youneedtoinitializeit.Youcanchangethevariable’svalue(for
example,byusinganassignmentoperation).Itisstateful:thevariablekeepsthesamevalueuponsuccessiverunsofthegraph.Itistypicallyusedtoholdmodelparametersbutalsoforotherpurposes(e.g.,tocounttheglobaltrainingstep).
Placeholderstechnicallydon’tdomuch:theyjustholdinformationaboutthetypeandshapeofthetensortheyrepresent,buttheyhavenovalue.Infact,ifyoutrytoevaluateanoperationthatdependsonaplaceholder,youmustfeedTensorFlowthevalueoftheplaceholder(usingthefeed_dictargument)orelseyouwillgetanexception.PlaceholdersaretypicallyusedtofeedtrainingortestdatatoTensorFlowduringtheexecutionphase.Theyarealsousefultopassavaluetoanassignmentnode,tochangethevalueofavariable(e.g.,modelweights).
8. Ifyourunthegraphtoevaluateanoperationthatdependsonaplaceholderbutyoudon’tfeeditsvalue,yougetanexception.Iftheoperationdoesnotdependontheplaceholder,thennoexceptionisraised.
9. Whenyourunagraph,youcanfeedtheoutputvalueofanyoperation,notjustthevalueofplaceholders.Inpractice,however,thisisratherrare(itcanbeuseful,forexample,whenyouarecachingtheoutputoffrozenlayers;seeChapter11).
10. Youcanspecifyavariable’sinitialvaluewhenconstructingthegraph,anditwillbeinitializedlaterwhenyourunthevariable’sinitializerduringtheexecutionphase.Ifyouwanttochangethatvariable’svaluetoanythingyouwantduringtheexecutionphase,thenthesimplestoptionistocreateanassignmentnode(duringthegraphconstructionphase)usingthetf.assign()function,passingthevariableandaplaceholderasparameters.Duringtheexecutionphase,youcanruntheassignmentoperationandfeedthevariable’snewvalueusingtheplaceholder.
importtensorflowastf
x=tf.Variable(tf.random_uniform(shape=(),minval=0.0,maxval=1.0))
x_new_val=tf.placeholder(shape=(),dtype=tf.float32)
x_assign=tf.assign(x,x_new_val)
withtf.Session():
x.initializer.run()#randomnumberissampled*now*
print(x.eval())#0.646157(somerandomnumber)
x_assign.eval(feed_dict={x_new_val:5.0})
print(x.eval())#5.0
11. Reverse-modeautodiff(implementedbyTensorFlow)needstotraversethegraphonlytwiceinordertocomputethegradientsofthecostfunctionwithregardstoanynumberofvariables.Ontheotherhand,forward-modeautodiffwouldneedtorunonceforeachvariable(so10timesifwewantthegradientswithregardsto10differentvariables).Asforsymbolicdifferentiation,itwouldbuildadifferentgraphtocomputethegradients,soitwouldnottraversetheoriginalgraphatall(exceptwhenbuildingthenewgradientsgraph).Ahighlyoptimizedsymbolicdifferentiationsystemcouldpotentiallyrunthenewgradientsgraphonlyoncetocomputethegradientswithregardstoallvariables,butthatnewgraphmaybehorriblycomplexandinefficientcomparedtotheoriginalgraph.
12. SeetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.
Chapter10:IntroductiontoArtificialNeuralNetworks1. HereisaneuralnetworkbasedontheoriginalartificialneuronsthatcomputesA B(where
representstheexclusiveOR),usingthefactthatA B=(A ¬B) (¬A B).Thereareothersolutions—forexample,usingthefactthatA B=(A B) ¬(A B),orthefactthatA B=(AB) (¬A B),andsoon.
2. AclassicalPerceptronwillconvergeonlyifthedatasetislinearlyseparable,anditwon’tbeabletoestimateclassprobabilities.Incontrast,aLogisticRegressionclassifierwillconvergetoagoodsolutionevenifthedatasetisnotlinearlyseparable,anditwilloutputclassprobabilities.IfyouchangethePerceptron’sactivationfunctiontothelogisticactivationfunction(orthesoftmaxactivationfunctioniftherearemultipleneurons),andifyoutrainitusingGradientDescent(orsomeotheroptimizationalgorithmminimizingthecostfunction,typicallycrossentropy),thenitbecomesequivalenttoaLogisticRegressionclassifier.
3. ThelogisticactivationfunctionwasakeyingredientintrainingthefirstMLPsbecauseitsderivativeisalwaysnonzero,soGradientDescentcanalwaysrolldowntheslope.Whentheactivationfunctionisastepfunction,GradientDescentcannotmove,asthereisnoslopeatall.
4. Thestepfunction,thelogisticfunction,thehyperbolictangent,therectifiedlinearunit(seeFigure10-8).SeeChapter11forotherexamples,suchasELUandvariantsoftheReLU.
5. ConsideringtheMLPdescribedinthequestion:supposeyouhaveanMLPcomposedofoneinputlayerwith10passthroughneurons,followedbyonehiddenlayerwith50artificialneurons,andfinallyoneoutputlayerwith3artificialneurons.AllartificialneuronsusetheReLUactivationfunction.
TheshapeoftheinputmatrixXism×10,wheremrepresentsthetrainingbatchsize.
Theshapeofthehiddenlayer’sweightvectorWhis10×50andthelengthofitsbiasvectorbhis50.
Theshapeoftheoutputlayer’sweightvectorWois50×3,andthelengthofitsbiasvectorbo
is3.
Theshapeofthenetwork’soutputmatrixYism×3.
Y=ReLU(ReLU(X·Wh+bh)·Wo+bo).RecallthattheReLUfunctionjustsetseverynegativenumberinthematrixtozero.Alsonotethatwhenyouareaddingabiasvectortoamatrix,itisaddedtoeverysinglerowinthematrix,whichiscalledbroadcasting.
6. Toclassifyemailintospamorham,youjustneedoneneuronintheoutputlayerofaneuralnetwork—forexample,indicatingtheprobabilitythattheemailisspam.Youwouldtypicallyusethelogisticactivationfunctionintheoutputlayerwhenestimatingaprobability.IfinsteadyouwanttotackleMNIST,youneed10neuronsintheoutputlayer,andyoumustreplacethelogisticfunctionwiththesoftmaxactivationfunction,whichcanhandlemultipleclasses,outputtingoneprobabilityperclass.Now,ifyouwantyourneuralnetworktopredicthousingpriceslikeinChapter2,thenyouneedoneoutputneuron,usingnoactivationfunctionatallintheoutputlayer.4
7. Backpropagationisatechniqueusedtotrainartificialneuralnetworks.Itfirstcomputesthegradientsofthecostfunctionwithregardstoeverymodelparameter(alltheweightsandbiases),andthenitperformsaGradientDescentstepusingthesegradients.Thisbackpropagationstepistypicallyperformedthousandsormillionsoftimes,usingmanytrainingbatches,untilthemodelparametersconvergetovaluesthat(hopefully)minimizethecostfunction.Tocomputethegradients,backpropagationusesreverse-modeautodiff(althoughitwasn’tcalledthatwhenbackpropagationwasinvented,andithasbeenreinventedseveraltimes).Reverse-modeautodiffperformsaforwardpassthroughacomputationgraph,computingeverynode’svalueforthecurrenttrainingbatch,andthenitperformsareversepass,computingallthegradientsatonce(seeAppendixDformoredetails).Sowhat’sthedifference?Well,backpropagationreferstothewholeprocessoftraininganartificialneuralnetworkusingmultiplebackpropagationsteps,eachofwhichcomputesgradientsandusesthemtoperformaGradientDescentstep.Incontrast,reverse-modeautodiffisasimplyatechniquetocomputegradientsefficiently,andithappenstobeusedbybackpropagation.
8. HereisalistofallthehyperparametersyoucantweakinabasicMLP:thenumberofhiddenlayers,thenumberofneuronsineachhiddenlayer,andtheactivationfunctionusedineachhiddenlayerandintheoutputlayer.5Ingeneral,theReLUactivationfunction(oroneofitsvariants;seeChapter11)isagooddefaultforthehiddenlayers.Fortheoutputlayer,ingeneralyouwillwantthelogisticactivationfunctionforbinaryclassification,thesoftmaxactivationfunctionformulticlassclassification,ornoactivationfunctionforregression.IftheMLPoverfitsthetrainingdata,youcantryreducingthenumberofhiddenlayersandreducingthenumberofneuronsperhiddenlayer.
9. SeetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.
Chapter11:TrainingDeepNeuralNets1. No,allweightsshouldbesampledindependently;theyshouldnotallhavethesameinitialvalue.
Oneimportantgoalofsamplingweightsrandomlyistobreaksymmetries:ifalltheweightshavethesameinitialvalue,evenifthatvalueisnotzero,thensymmetryisnotbroken(i.e.,allneuronsinagivenlayerareequivalent),andbackpropagationwillbeunabletobreakit.Concretely,thismeansthatalltheneuronsinanygivenlayerwillalwayshavethesameweights.It’slikehavingjustoneneuronperlayer,andmuchslower.Itisvirtuallyimpossibleforsuchaconfigurationtoconvergetoagoodsolution.
2. Itisperfectlyfinetoinitializethebiastermstozero.Somepeopleliketoinitializethemjustlikeweights,andthat’sokaytoo;itdoesnotmakemuchdifference.
3. AfewadvantagesoftheELUfunctionovertheReLUfunctionare:Itcantakeonnegativevalues,sotheaverageoutputoftheneuronsinanygivenlayeristypicallycloserto0thanwhenusingtheReLUactivationfunction(whichneveroutputsnegativevalues).Thishelpsalleviatethevanishinggradientsproblem.
Italwayshasanonzeroderivative,whichavoidsthedyingunitsissuethatcanaffectReLUunits.
Itissmootheverywhere,whereastheReLU’sslopeabruptlyjumpsfrom0to1atz=0.SuchanabruptchangecanslowdownGradientDescentbecauseitwillbouncearoundz=0.
4. TheELUactivationfunctionisagooddefault.Ifyouneedtheneuralnetworktobeasfastaspossible,youcanuseoneoftheleakyReLUvariantsinstead(e.g.,asimpleleakyReLUusingthedefaulthyperparametervalue).ThesimplicityoftheReLUactivationfunctionmakesitmanypeople’spreferredoption,despitethefactthattheyaregenerallyoutperformedbytheELUandleakyReLU.However,theReLUactivationfunction’scapabilityofoutputtingpreciselyzerocanbeusefulinsomecases(e.g.,seeChapter15).Thehyperbolictangent(tanh)canbeusefulintheoutputlayerifyouneedtooutputanumberbetween–1and1,butnowadaysitisnotusedmuchinhiddenlayers.Thelogisticactivationfunctionisalsousefulintheoutputlayerwhenyouneedtoestimateaprobability(e.g.,forbinaryclassification),butitisalsorarelyusedinhiddenlayers(thereareexceptions—forexample,forthecodinglayerofvariationalautoencoders;seeChapter15).Finally,thesoftmaxactivationfunctionisusefulintheoutputlayertooutputprobabilitiesformutuallyexclusiveclasses,butotherthanthatitisrarely(ifever)usedinhiddenlayers.
5. Ifyousetthemomentumhyperparametertoocloseto1(e.g.,0.99999)whenusingaMomentumOptimizer,thenthealgorithmwilllikelypickupalotofspeed,hopefullyroughlytowardtheglobalminimum,butthenitwillshootrightpasttheminimum,duetoitsmomentum.Thenitwillslowdownandcomeback,accelerateagain,overshootagain,andsoon.Itmayoscillatethiswaymanytimesbeforeconverging,sooverallitwilltakemuchlongertoconvergethanwithasmallermomentumvalue.
6. Onewaytoproduceasparsemodel(i.e.,withmostweightsequaltozero)istotrainthemodelnormally,thenzeroouttinyweights.Formoresparsity,youcanapplyℓ1regularizationduring
training,whichpushestheoptimizertowardsparsity.Athirdoptionistocombineℓ1regularizationwithdualaveraging,usingTensorFlow’sFTRLOptimizerclass.
7. Yes,dropoutdoesslowdowntraining,ingeneralroughlybyafactoroftwo.However,ithasnoimpactoninferencesinceitisonlyturnedonduringtraining.
Forthesolutionstoexercises8,9,and10,pleaseseetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.
Chapter12:DistributingTensorFlowAcrossDevicesandServers1. WhenaTensorFlowprocessstarts,itgrabsalltheavailablememoryonallGPUdevicesthatare
visibletoit,soifyougetaCUDA_ERROR_OUT_OF_MEMORYwhenstartingyourTensorFlowprogram,itprobablymeansthatotherprocessesarerunningthathavealreadygrabbedallthememoryonatleastonevisibleGPUdevice(mostlikelyitisanotherTensorFlowprocess).Tofixthisproblem,atrivialsolutionistostoptheotherprocessesandtryagain.However,ifyouneedallprocessestorunsimultaneously,asimpleoptionistodedicatedifferentdevicestoeachprocess,bysettingtheCUDA_VISIBLE_DEVICESenvironmentvariableappropriatelyforeachdevice.AnotheroptionistoconfigureTensorFlowtograbonlypartoftheGPUmemory,insteadofallofit,bycreatingaConfigProto,settingitsgpu_options.per_process_gpu_memory_fractiontotheproportionofthetotalmemorythatitshouldgrab(e.g.,0.4),andusingthisConfigProtowhenopeningasession.ThelastoptionistotellTensorFlowtograbmemoryonlywhenitneedsitbysettingthegpu_options.allow_growthtoTrue.However,thislastoptionisusuallynotrecommendedbecauseanymemorythatTensorFlowgrabsisneverreleased,anditishardertoguaranteearepeatablebehavior(theremayberaceconditionsdependingonwhichprocessesstartfirst,howmuchmemorytheyneedduringtraining,andsoon).
2. Bypinninganoperationonadevice,youaretellingTensorFlowthatthisiswhereyouwouldlikethisoperationtobeplaced.However,someconstraintsmaypreventTensorFlowfromhonoringyourrequest.Forexample,theoperationmayhavenoimplementation(calledakernel)forthatparticulartypeofdevice.Inthiscase,TensorFlowwillraiseanexceptionbydefault,butyoucanconfigureittofallbacktotheCPUinstead(thisiscalledsoftplacement).Anotherexampleisanoperationthatcanmodifyavariable;thisoperationandthevariableneedtobecollocated.SothedifferencebetweenpinninganoperationandplacinganoperationisthatpinningiswhatyouaskTensorFlow(“PleaseplacethisoperationonGPU#1”)whileplacementiswhatTensorFlowactuallyendsupdoing(“Sorry,fallingbacktotheCPU”).
3. IfyouarerunningonaGPU-enabledTensorFlowinstallation,andyoujustusethedefaultplacement,thenifalloperationshaveaGPUkernel(i.e.,aGPUimplementation),yes,theywillallbeplacedonthefirstGPU.However,ifoneormoreoperationsdonothaveaGPUkernel,thenbydefaultTensorFlowwillraiseanexception.IfyouconfigureTensorFlowtofallbacktotheCPUinstead(softplacement),thenalloperationswillbeplacedonthefirstGPUexcepttheoneswithoutaGPUkernelandalltheoperationsthatmustbecollocatedwiththem(seetheanswertothepreviousexercise).
4. Yes,ifyoupinavariableto"/gpu:0",itcanbeusedbyoperationsplacedon/gpu:1.TensorFlowwillautomaticallytakecareofaddingtheappropriateoperationstotransferthevariable’svalueacrossdevices.Thesamegoesfordeviceslocatedondifferentservers(aslongastheyarepartofthesamecluster).
5. Yes,twooperationsplacedonthesamedevicecanruninparallel:TensorFlowautomaticallytakescareofrunningoperationsinparallel(ondifferentCPUcoresordifferentGPUthreads),aslongasnooperationdependsonanotheroperation’soutput.Moreover,youcanstartmultiplesessionsinparallelthreads(orprocesses),andevaluateoperationsineachthread.Sincesessionsareindependent,TensorFlowwillbeabletoevaluateanyoperationfromonesessioninparallelwith
anyoperationfromanothersession.
6. ControldependenciesareusedwhenyouwanttopostponetheevaluationofanoperationXuntilaftersomeotheroperationsarerun,eventhoughtheseoperationsarenotrequiredtocomputeX.ThisisusefulinparticularwhenXwouldoccupyalotofmemoryandyouonlyneeditlaterinthecomputationgraph,orifXusesupalotofI/O(forexample,itrequiresalargevariablevaluelocatedonadifferentdeviceorserver)andyoudon’twantittorunatthesametimeasotherI/O-hungryoperations,toavoidsaturatingthebandwidth.
7. You’reinluck!IndistributedTensorFlow,thevariablevaluesliveincontainersmanagedbythecluster,soevenifyouclosethesessionandexittheclientprogram,themodelparametersarestillaliveandwellonthecluster.Yousimplyneedtoopenanewsessiontotheclusterandsavethemodel(makesureyoudon’tcallthevariableinitializersorrestoreapreviousmodel,asthiswoulddestroyyourpreciousnewmodel!).
Forthesolutionstoexercises8,9,and10,pleaseseetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.
Chapter13:ConvolutionalNeuralNetworks1. ThesearethemainadvantagesofaCNNoverafullyconnectedDNNforimageclassification:
Becauseconsecutivelayersareonlypartiallyconnectedandbecauseitheavilyreusesitsweights,aCNNhasmanyfewerparametersthanafullyconnectedDNN,whichmakesitmuchfastertotrain,reducestheriskofoverfitting,andrequiresmuchlesstrainingdata.
WhenaCNNhaslearnedakernelthatcandetectaparticularfeature,itcandetectthatfeatureanywhereontheimage.Incontrast,whenaDNNlearnsafeatureinonelocation,itcandetectitonlyinthatparticularlocation.Sinceimagestypicallyhaveveryrepetitivefeatures,CNNsareabletogeneralizemuchbetterthanDNNsforimageprocessingtaskssuchasclassification,usingfewertrainingexamples.
Finally,aDNNhasnopriorknowledgeofhowpixelsareorganized;itdoesnotknowthatnearbypixelsareclose.ACNN’sarchitectureembedsthispriorknowledge.Lowerlayerstypicallyidentifyfeaturesinsmallareasoftheimages,whilehigherlayerscombinethelower-levelfeaturesintolargerfeatures.Thisworkswellwithmostnaturalimages,givingCNNsadecisiveheadstartcomparedtoDNNs.
2. Let’scomputehowmanyparameterstheCNNhas.Sinceitsfirstconvolutionallayerhas3×3kernels,andtheinputhasthreechannels(red,green,andblue),theneachfeaturemaphas3×3×3weights,plusabiasterm.That’s28parametersperfeaturemap.Sincethisfirstconvolutionallayerhas100featuremaps,ithasatotalof2,800parameters.Thesecondconvolutionallayerhas3×3kernels,anditsinputisthesetof100featuremapsofthepreviouslayer,soeachfeaturemaphas3×3×100=900weights,plusabiasterm.Sinceithas200featuremaps,thislayerhas901×200=180,200parameters.Finally,thethirdandlastconvolutionallayeralsohas3×3kernels,anditsinputisthesetof200featuremapsofthepreviouslayers,soeachfeaturemaphas3×3×200=1,800weights,plusabiasterm.Sinceithas400featuremaps,thislayerhasatotalof1,801×400=720,400parameters.Allinall,theCNNhas2,800+180,200+720,400=903,400parameters.Nowlet’scomputehowmuchRAMthisneuralnetworkwillrequire(atleast)whenmakingapredictionforasingleinstance.Firstlet’scomputethefeaturemapsizeforeachlayer.Sinceweareusingastrideof2andSAMEpadding,thehorizontalandverticalsizeofthefeaturemapsaredividedby2ateachlayer(roundingupifnecessary),soastheinputchannelsare200×300pixels,thefirstlayer’sfeaturemapsare100×150,thesecondlayer’sfeaturemapsare50×75,andthethirdlayer’sfeaturemapsare25×38.Since32bitsis4bytesandthefirstconvolutionallayerhas100featuremaps,thisfirstlayertakesup4x100×150×100=6millionbytes(about5.7MB,consideringthat1MB=1,024KBand1KB=1,024bytes).Thesecondlayertakesup4×50×75×200=3millionbytes(about2.9MB).Finally,thethirdlayertakesup4×25×38×400=1,520,000bytes(about1.4MB).However,oncealayerhasbeencomputed,thememoryoccupiedbythepreviouslayercanbereleased,soifeverythingiswelloptimized,only6+9=15millionbytes(about14.3MB)ofRAMwillberequired(whenthesecondlayerhasjustbeencomputed,butthememoryoccupiedbythefirstlayerisnotreleasedyet).Butwait,youalsoneedtoaddthememoryoccupiedbytheCNN’sparameters.Wecomputedearlierthatithas903,400parameters,eachusingup4bytes,sothisadds3,613,600bytes(about3.4MB).ThetotalRAMrequiredis(atleast)18,613,600bytes(about17.8MB).
Lastly,let’scomputetheminimumamountofRAMrequiredwhentrainingtheCNNonamini-batchof50images.DuringtrainingTensorFlowusesbackpropagation,whichrequireskeepingallvaluescomputedduringtheforwardpassuntilthereversepassbegins.SowemustcomputethetotalRAMrequiredbyalllayersforasingleinstanceandmultiplythatby50!Atthatpointlet’sstartcountinginmegabytesratherthanbytes.Wecomputedbeforethatthethreelayersrequirerespectively5.7,2.9,and1.4MBforeachinstance.That’satotalof10.0MBperinstance.Sofor50instancesthetotalRAMis500MB.AddtothattheRAMrequiredbytheinputimages,whichis50×4×200×300×3=36millionbytes(about34.3MB),plustheRAMrequiredforthemodelparameters,whichisabout3.4MB(computedearlier),plussomeRAMforthegradients(wewillneglectthemsincetheycanbereleasedgraduallyasbackpropagationgoesdownthelayersduringthereversepass).Weareuptoatotalofroughly500.0+34.3+3.4=537.7MB.Andthat’sreallyanoptimisticbareminimum.
3. IfyourGPUrunsoutofmemorywhiletrainingaCNN,herearefivethingsyoucouldtrytosolvetheproblem(otherthanpurchasingaGPUwithmoreRAM):
Reducethemini-batchsize.
Reducedimensionalityusingalargerstrideinoneormorelayers.
Removeoneormorelayers.
Use16-bitfloatsinsteadof32-bitfloats.
DistributetheCNNacrossmultipledevices.
4. Amaxpoolinglayerhasnoparametersatall,whereasaconvolutionallayerhasquiteafew(seethepreviousquestions).
5. Alocalresponsenormalizationlayermakestheneuronsthatmoststronglyactivateinhibitneuronsatthesamelocationbutinneighboringfeaturemaps,whichencouragesdifferentfeaturemapstospecializeandpushesthemapart,forcingthemtoexploreawiderrangeoffeatures.Itistypicallyusedinthelowerlayerstohavealargerpooloflow-levelfeaturesthattheupperlayerscanbuildupon.
6. ThemaininnovationsinAlexNetcomparedtoLeNet-5are(1)itismuchlargeranddeeper,and(2)itstacksconvolutionallayersdirectlyontopofeachother,insteadofstackingapoolinglayerontopofeachconvolutionallayer.ThemaininnovationinGoogLeNetistheintroductionofinceptionmodules,whichmakeitpossibletohaveamuchdeepernetthanpreviousCNNarchitectures,withfewerparameters.Finally,ResNet’smaininnovationistheintroductionofskipconnections,whichmakeitpossibletogowellbeyond100layers.Arguably,itssimplicityandconsistencyarealsoratherinnovative.
Forthesolutionstoexercises7,8,9,and10,pleaseseetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.
Chapter14:RecurrentNeuralNetworks1. HereareafewRNNapplications:
Forasequence-to-sequenceRNN:predictingtheweather(oranyothertimeseries),machinetranslation(usinganencoder–decoderarchitecture),videocaptioning,speechtotext,musicgeneration(orothersequencegeneration),identifyingthechordsofasong.
Forasequence-to-vectorRNN:classifyingmusicsamplesbymusicgenre,analyzingthesentimentofabookreview,predictingwhatwordanaphasicpatientisthinkingofbasedonreadingsfrombrainimplants,predictingtheprobabilitythatauserwillwanttowatchamoviebasedonherwatchhistory(thisisoneofmanypossibleimplementationsofcollaborativefiltering).
Foravector-to-sequenceRNN:imagecaptioning,creatingamusicplaylistbasedonanembeddingofthecurrentartist,generatingamelodybasedonasetofparameters,locatingpedestriansinapicture(e.g.,avideoframefromaself-drivingcar’scamera).
2. Ingeneral,ifyoutranslateasentenceonewordatatime,theresultwillbeterrible.Forexample,theFrenchsentence“Jevousenprie”means“Youarewelcome,”butifyoutranslateitonewordatatime,youget“Iyouinpray.”Huh?Itismuchbettertoreadthewholesentencefirstandthentranslateit.Aplainsequence-to-sequenceRNNwouldstarttranslatingasentenceimmediatelyafterreadingthefirstword,whileanencoder–decoderRNNwillfirstreadthewholesentenceandthentranslateit.Thatsaid,onecouldimagineaplainsequence-to-sequenceRNNthatwouldoutputsilencewheneveritisunsureaboutwhattosaynext(justlikehumantranslatorsdowhentheymusttranslatealivebroadcast).
3. Toclassifyvideosbasedonthevisualcontent,onepossiblearchitecturecouldbetotake(say)oneframepersecond,thenruneachframethroughaconvolutionalneuralnetwork,feedtheoutputoftheCNNtoasequence-to-vectorRNN,andfinallyrunitsoutputthroughasoftmaxlayer,givingyoualltheclassprobabilities.Fortrainingyouwouldjustusecrossentropyasthecostfunction.Ifyouwantedtousetheaudioforclassificationaswell,youcouldconverteverysecondofaudiotoaspectrograph,feedthisspectrographtoaCNN,andfeedtheoutputofthisCNNtotheRNN(alongwiththecorrespondingoutputoftheotherCNN).
4. BuildinganRNNusingdynamic_rnn()ratherthanstatic_rnn()offersseveraladvantages:Itisbasedonawhile_loop()operationthatisabletoswaptheGPU’smemorytotheCPU’smemoryduringbackpropagation,avoidingout-of-memoryerrors.
Itisarguablyeasiertouse,asitcandirectlytakeasingletensorasinputandoutput(coveringalltimesteps),ratherthanalistoftensors(onepertimestep).Noneedtostack,unstack,ortranspose.
Itgeneratesasmallergraph,easiertovisualizeinTensorBoard.
5. Tohandlevariablelengthinputsequences,thesimplestoptionistosetthesequence_lengthparameterwhencallingthestatic_rnn()ordynamic_rnn()functions.Anotheroptionistopad
thesmallerinputs(e.g.,withzeros)tomakethemthesamesizeasthelargestinput(thismaybefasterthanthefirstoptioniftheinputsequencesallhaveverysimilarlengths).Tohandlevariable-lengthoutputsequences,ifyouknowinadvancethelengthofeachoutputsequence,youcanusethesequence_lengthparameter(forexample,considerasequence-to-sequenceRNNthatlabelseveryframeinavideowithaviolencescore:theoutputsequencewillbeexactlythesamelengthastheinputsequence).Ifyoudon’tknowinadvancethelengthoftheoutputsequence,youcanusethepaddingtrick:alwaysoutputthesamesizesequence,butignoreanyoutputsthatcomeaftertheend-of-sequencetoken(byignoringthemwhencomputingthecostfunction).
6. TodistributetrainingandexecutionofadeepRNNacrossmultipleGPUs,acommontechniqueissimplytoplaceeachlayeronadifferentGPU(seeChapter12).
Forthesolutionstoexercises7,8,and9,pleaseseetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.
Chapter15:Autoencoders1. Herearesomeofthemaintasksthatautoencodersareusedfor:
Featureextraction
Unsupervisedpretraining
Dimensionalityreduction
Generativemodels
Anomalydetection(anautoencoderisgenerallybadatreconstructingoutliers)
2. Ifyouwanttotrainaclassifierandyouhaveplentyofunlabeledtrainingdata,butonlyafewthousandlabeledinstances,thenyoucouldfirsttrainadeepautoencoderonthefulldataset(labeled+unlabeled),thenreuseitslowerhalffortheclassifier(i.e.,reusethelayersuptothecodingslayer,included)andtraintheclassifierusingthelabeleddata.Ifyouhavelittlelabeleddata,youprobablywanttofreezethereusedlayerswhentrainingtheclassifier.
3. Thefactthatanautoencoderperfectlyreconstructsitsinputsdoesnotnecessarilymeanthatitisagoodautoencoder;perhapsitissimplyanovercompleteautoencoderthatlearnedtocopyitsinputstothecodingslayerandthentotheoutputs.Infact,evenifthecodingslayercontainedasingleneuron,itwouldbepossibleforaverydeepautoencodertolearntomapeachtraininginstancetoadifferentcoding(e.g.,thefirstinstancecouldbemappedto0.001,thesecondto0.002,thethirdto0.003,andsoon),anditcouldlearn“byheart”toreconstructtherighttraininginstanceforeachcoding.Itwouldperfectlyreconstructitsinputswithoutreallylearninganyusefulpatterninthedata.Inpracticesuchamappingisunlikelytohappen,butitillustratesthefactthatperfectreconstructionsarenotaguaranteethattheautoencoderlearnedanythinguseful.However,ifitproducesverybadreconstructions,thenitisalmostguaranteedtobeabadautoencoder.Toevaluatetheperformanceofanautoencoder,oneoptionistomeasurethereconstructionloss(e.g.,computetheMSE,themeansquareoftheoutputsminustheinputs).Again,ahighreconstructionlossisagoodsignthattheautoencoderisbad,butalowreconstructionlossisnotaguaranteethatitisgood.Youshouldalsoevaluatetheautoencoderaccordingtowhatitwillbeusedfor.Forexample,ifyouareusingitforunsupervisedpretrainingofaclassifier,thenyoushouldalsoevaluatetheclassifier’sperformance.
4. Anundercompleteautoencoderisonewhosecodingslayerissmallerthantheinputandoutputlayers.Ifitislarger,thenitisanovercompleteautoencoder.Themainriskofanexcessivelyundercompleteautoencoderisthatitmayfailtoreconstructtheinputs.Themainriskofanovercompleteautoencoderisthatitmayjustcopytheinputstotheoutputs,withoutlearninganyusefulfeature.
5. Totietheweightsofanencoderlayeranditscorrespondingdecoderlayer,yousimplymakethedecoderweightsequaltothetransposeoftheencoderweights.Thisreducesthenumberofparametersinthemodelbyhalf,oftenmakingtrainingconvergefasterwithlesstrainingdata,andreducingtheriskofoverfittingthetrainingset.
6. Tovisualizethefeatureslearnedbythelowerlayerofastackedautoencoder,acommontechniqueissimplytoplottheweightsofeachneuron,byreshapingeachweightvectortothesizeofaninputimage(e.g.,forMNIST,reshapingaweightvectorofshape[784]to[28,28]).Tovisualizethefeatureslearnedbyhigherlayers,onetechniqueistodisplaythetraininginstancesthatmostactivateeachneuron.
7. Agenerativemodelisamodelcapableofrandomlygeneratingoutputsthatresemblethetraininginstances.Forexample,oncetrainedsuccessfullyontheMNISTdataset,agenerativemodelcanbeusedtorandomlygeneraterealisticimagesofdigits.Theoutputdistributionistypicallysimilartothetrainingdata.Forexample,sinceMNISTcontainsmanyimagesofeachdigit,thegenerativemodelwouldoutputroughlythesamenumberofimagesofeachdigit.Somegenerativemodelscanbeparametrized—forexample,togenerateonlysomekindsofoutputs.Anexampleofagenerativeautoencoderisthevariationalautoencoder.
Forthesolutionstoexercises8,9,and10,pleaseseetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.
Chapter16:ReinforcementLearning1. ReinforcementLearningisanareaofMachineLearningaimedatcreatingagentscapableoftaking
actionsinanenvironmentinawaythatmaximizesrewardsovertime.TherearemanydifferencesbetweenRLandregularsupervisedandunsupervisedlearning.Hereareafew:
Insupervisedandunsupervisedlearning,thegoalisgenerallytofindpatternsinthedata.InReinforcementLearning,thegoalistofindagoodpolicy.
Unlikeinsupervisedlearning,theagentisnotexplicitlygiventhe“right”answer.Itmustlearnbytrialanderror.
Unlikeinunsupervisedlearning,thereisaformofsupervision,throughrewards.Wedonottelltheagenthowtoperformthetask,butwedotellitwhenitismakingprogressorwhenitisfailing.
AReinforcementLearningagentneedstofindtherightbalancebetweenexploringtheenvironment,lookingfornewwaysofgettingrewards,andexploitingsourcesofrewardsthatitalreadyknows.Incontrast,supervisedandunsupervisedlearningsystemsgenerallydon’tneedtoworryaboutexploration;theyjustfeedonthetrainingdatatheyaregiven.
Insupervisedandunsupervisedlearning,traininginstancesaretypicallyindependent(infact,theyaregenerallyshuffled).InReinforcementLearning,consecutiveobservationsaregenerallynotindependent.Anagentmayremaininthesameregionoftheenvironmentforawhilebeforeitmoveson,soconsecutiveobservationswillbeverycorrelated.Insomecasesareplaymemoryisusedtoensurethatthetrainingalgorithmgetsfairlyindependentobservations.
2. HereareafewpossibleapplicationsofReinforcementLearning,otherthanthosementionedinChapter16:
MusicpersonalizationTheenvironmentisauser’spersonalizedwebradio.Theagentisthesoftwaredecidingwhatsongtoplaynextforthatuser.Itspossibleactionsaretoplayanysonginthecatalog(itmusttrytochooseasongtheuserwillenjoy)ortoplayanadvertisement(itmusttrytochooseanadthattheuserwillbeinterestedin).Itgetsasmallrewardeverytimetheuserlistenstoasong,alargerrewardeverytimetheuserlistenstoanad,anegativerewardwhentheuserskipsasongoranad,andaverynegativerewardiftheuserleaves.
MarketingTheenvironmentisyourcompany’smarketingdepartment.Theagentisthesoftwarethatdefineswhichcustomersamailingcampaignshouldbesentto,giventheirprofileandpurchasehistory(foreachcustomerithastwopossibleactions:sendordon’tsend).Itgetsanegativerewardforthecostofthemailingcampaign,andapositiverewardforestimatedrevenuegeneratedfromthiscampaign.
Productdelivery
Lettheagentcontrolafleetofdeliverytrucks,decidingwhattheyshouldpickupatthedepots,wheretheyshouldgo,whattheyshoulddropoff,andsoon.Theywouldgetpositiverewardsforeachproductdeliveredontime,andnegativerewardsforlatedeliveries.
3. Whenestimatingthevalueofanaction,ReinforcementLearningalgorithmstypicallysumalltherewardsthatthisactionledto,givingmoreweighttoimmediaterewards,andlessweighttolaterrewards(consideringthatanactionhasmoreinfluenceonthenearfuturethanonthedistantfuture).Tomodelthis,adiscountrateistypicallyappliedateachtimestep.Forexample,withadiscountrateof0.9,arewardof100thatisreceivedtwotimestepslateriscountedasonly0.92×100=81whenyouareestimatingthevalueoftheaction.Youcanthinkofthediscountrateasameasureofhowmuchthefutureisvaluedrelativetothepresent:ifitisverycloseto1,thenthefutureisvaluedalmostasmuchasthepresent.Ifitiscloseto0,thenonlyimmediaterewardsmatter.Ofcourse,thisimpactstheoptimalpolicytremendously:ifyouvaluethefuture,youmaybewillingtoputupwithalotofimmediatepainfortheprospectofeventualrewards,whileifyoudon’tvaluethefuture,youwilljustgrabanyimmediaterewardyoucanfind,neverinvestinginthefuture.
4. TomeasuretheperformanceofaReinforcementLearningagent,youcansimplysumuptherewardsitgets.Inasimulatedenvironment,youcanrunmanyepisodesandlookatthetotalrewardsitgetsonaverage(andpossiblylookatthemin,max,standarddeviation,andsoon).
5. ThecreditassignmentproblemisthefactthatwhenaReinforcementLearningagentreceivesareward,ithasnodirectwayofknowingwhichofitspreviousactionscontributedtothisreward.Ittypicallyoccurswhenthereisalargedelaybetweenanactionandtheresultingrewards(e.g.,duringagameofAtari’sPong,theremaybeafewdozentimestepsbetweenthemomenttheagenthitstheballandthemomentitwinsthepoint).Onewaytoalleviateitistoprovidetheagentwithshorter-termrewards,whenpossible.Thisusuallyrequirespriorknowledgeaboutthetask.Forexample,ifwewanttobuildanagentthatwilllearntoplaychess,insteadofgivingitarewardonlywhenitwinsthegame,wecouldgiveitarewardeverytimeitcapturesoneoftheopponent’spieces.
6. Anagentcanoftenremaininthesameregionofitsenvironmentforawhile,soallofitsexperienceswillbeverysimilarforthatperiodoftime.Thiscanintroducesomebiasinthelearningalgorithm.Itmaytuneitspolicyforthisregionoftheenvironment,butitwillnotperformwellassoonasitmovesoutofthisregion.Tosolvethisproblem,youcanuseareplaymemory;insteadofusingonlythemostimmediateexperiencesforlearning,theagentwilllearnbasedonabufferofitspastexperiences,recentandnotsorecent(perhapsthisiswhywedreamatnight:toreplayourexperiencesofthedayandbetterlearnfromthem?).
7. Anoff-policyRLalgorithmlearnsthevalueoftheoptimalpolicy(i.e.,thesumofdiscountedrewardsthatcanbeexpectedforeachstateiftheagentactsoptimally),independentlyofhowtheagentactuallyacts.Q-Learningisagoodexampleofsuchanalgorithm.Incontrast,anon-policyalgorithmlearnsthevalueofthepolicythattheagentactuallyexecutes,includingbothexplorationandexploitation.
Forthesolutionstoexercises8,9,and10,pleaseseetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.
Ifyoudrawastraightlinebetweenanytwopointsonthecurve,thelinenevercrossesthecurve.1
Moreover,theNormalEquationrequirescomputingtheinverseofamatrix,butthatmatrixisnotalwaysinvertible.Incontrast,thematrixforRidgeRegressionisalwaysinvertible.
log2isthebinarylog,log2(m)=log(m)/log(2).
Whenthevaluestopredictcanvarybymanyordersofmagnitude,thenyoumaywanttopredictthelogarithmofthetargetvalueratherthanthetargetvaluedirectly.Simplycomputingtheexponentialoftheneuralnetwork’soutputwillgiveyoutheestimatedvalue(sinceexp(logv)=v).
InChapter11wediscussmanytechniquesthatintroduceadditionalhyperparameters:typeofweightinitialization,activationfunctionhyperparameters(e.g.,amountofleakinleakyReLU),GradientClippingthreshold,typeofoptimizeranditshyperparameters(e.g.,themomentumhyperparameterwhenusingaMomentumOptimizer),typeofregularizationforeachlayer,andtheregularizationhyperparameters(e.g.,dropoutratewhenusingdropout)andsoon.
2
3
4
5
AppendixB.MachineLearningProjectChecklist
ThischecklistcanguideyouthroughyourMachineLearningprojects.Thereareeightmainsteps:1. Frametheproblemandlookatthebigpicture.
2. Getthedata.
3. Explorethedatatogaininsights.
4. PreparethedatatobetterexposetheunderlyingdatapatternstoMachineLearningalgorithms.
5. Exploremanydifferentmodelsandshort-listthebestones.
6. Fine-tuneyourmodelsandcombinethemintoagreatsolution.
7. Presentyoursolution.
8. Launch,monitor,andmaintainyoursystem.
Obviously,youshouldfeelfreetoadaptthischecklisttoyourneeds.
FrametheProblemandLookattheBigPicture1. Definetheobjectiveinbusinessterms.
2. Howwillyoursolutionbeused?
3. Whatarethecurrentsolutions/workarounds(ifany)?
4. Howshouldyouframethisproblem(supervised/unsupervised,online/offline,etc.)?
5. Howshouldperformancebemeasured?
6. Istheperformancemeasurealignedwiththebusinessobjective?
7. Whatwouldbetheminimumperformanceneededtoreachthebusinessobjective?
8. Whatarecomparableproblems?Canyoureuseexperienceortools?
9. Ishumanexpertiseavailable?
10. Howwouldyousolvetheproblemmanually?
11. Listtheassumptionsyou(orothers)havemadesofar.
12. Verifyassumptionsifpossible.
GettheDataNote:automateasmuchaspossiblesoyoucaneasilygetfreshdata.1. Listthedatayouneedandhowmuchyouneed.
2. Findanddocumentwhereyoucangetthatdata.
3. Checkhowmuchspaceitwilltake.
4. Checklegalobligations,andgetauthorizationifnecessary.
5. Getaccessauthorizations.
6. Createaworkspace(withenoughstoragespace).
7. Getthedata.
8. Convertthedatatoaformatyoucaneasilymanipulate(withoutchangingthedataitself).
9. Ensuresensitiveinformationisdeletedorprotected(e.g.,anonymized).
10. Checkthesizeandtypeofdata(timeseries,sample,geographical,etc.).
11. Sampleatestset,putitaside,andneverlookatit(nodatasnooping!).
ExploretheDataNote:trytogetinsightsfromafieldexpertforthesesteps.1. Createacopyofthedataforexploration(samplingitdowntoamanageablesizeifnecessary).
2. CreateaJupyternotebooktokeeparecordofyourdataexploration.
3. Studyeachattributeanditscharacteristics:Name
Type(categorical,int/float,bounded/unbounded,text,structured,etc.)
%ofmissingvalues
Noisinessandtypeofnoise(stochastic,outliers,roundingerrors,etc.)
Possiblyusefulforthetask?
Typeofdistribution(Gaussian,uniform,logarithmic,etc.)
4. Forsupervisedlearningtasks,identifythetargetattribute(s).
5. Visualizethedata.
6. Studythecorrelationsbetweenattributes.
7. Studyhowyouwouldsolvetheproblemmanually.
8. Identifythepromisingtransformationsyoumaywanttoapply.
9. Identifyextradatathatwouldbeuseful(gobackto“GettheData”).
10. Documentwhatyouhavelearned.
PreparetheDataNotes:
Workoncopiesofthedata(keeptheoriginaldatasetintact).
Writefunctionsforalldatatransformationsyouapply,forfivereasons:Soyoucaneasilypreparethedatathenexttimeyougetafreshdataset
Soyoucanapplythesetransformationsinfutureprojects
Tocleanandpreparethetestset
Tocleanandpreparenewdatainstancesonceyoursolutionislive
Tomakeiteasytotreatyourpreparationchoicesashyperparameters
1. Datacleaning:Fixorremoveoutliers(optional).
Fillinmissingvalues(e.g.,withzero,mean,median…)ordroptheirrows(orcolumns).
2. Featureselection(optional):Droptheattributesthatprovidenousefulinformationforthetask.
3. Featureengineering,whereappropriate:Discretizecontinuousfeatures.
Decomposefeatures(e.g.,categorical,date/time,etc.).
Addpromisingtransformationsoffeatures(e.g.,log(x),sqrt(x),x^2,etc.).
Aggregatefeaturesintopromisingnewfeatures.
4. Featurescaling:standardizeornormalizefeatures.
Short-ListPromisingModelsNotes:
Ifthedataishuge,youmaywanttosamplesmallertrainingsetssoyoucantrainmanydifferentmodelsinareasonabletime(beawarethatthispenalizescomplexmodelssuchaslargeneuralnetsorRandomForests).
Onceagain,trytoautomatethesestepsasmuchaspossible.
1. Trainmanyquickanddirtymodelsfromdifferentcategories(e.g.,linear,naiveBayes,SVM,RandomForests,neuralnet,etc.)usingstandardparameters.
2. Measureandcomparetheirperformance.Foreachmodel,useN-foldcross-validationandcomputethemeanandstandarddeviationoftheperformancemeasureontheNfolds.
3. Analyzethemostsignificantvariablesforeachalgorithm.
4. Analyzethetypesoferrorsthemodelsmake.Whatdatawouldahumanhaveusedtoavoidtheseerrors?
5. Haveaquickroundoffeatureselectionandengineering.
6. Haveoneortwomorequickiterationsofthefiveprevioussteps.
7. Short-listthetopthreetofivemostpromisingmodels,preferringmodelsthatmakedifferenttypesoferrors.
Fine-TunetheSystemNotes:
Youwillwanttouseasmuchdataaspossibleforthisstep,especiallyasyoumovetowardtheendoffine-tuning.
Asalwaysautomatewhatyoucan.
1. Fine-tunethehyperparametersusingcross-validation.Treatyourdatatransformationchoicesashyperparameters,especiallywhenyouarenotsureaboutthem(e.g.,shouldIreplacemissingvalueswithzeroorwiththemedianvalue?Orjustdroptherows?).
Unlessthereareveryfewhyperparametervaluestoexplore,preferrandomsearchovergridsearch.Iftrainingisverylong,youmaypreferaBayesianoptimizationapproach(e.g.,usingGaussianprocesspriors,asdescribedbyJasperSnoek,HugoLarochelle,andRyanAdams).1
2. TryEnsemblemethods.Combiningyourbestmodelswilloftenperformbetterthanrunningthemindividually.
3. Onceyouareconfidentaboutyourfinalmodel,measureitsperformanceonthetestsettoestimatethegeneralizationerror.
WARNINGDon’ttweakyourmodelaftermeasuringthegeneralizationerror:youwouldjuststartoverfittingthetestset.
PresentYourSolution1. Documentwhatyouhavedone.
2. Createanicepresentation.Makesureyouhighlightthebigpicturefirst.
3. Explainwhyyoursolutionachievesthebusinessobjective.
4. Don’tforgettopresentinterestingpointsyounoticedalongtheway.Describewhatworkedandwhatdidnot.
Listyourassumptionsandyoursystem’slimitations.
5. Ensureyourkeyfindingsarecommunicatedthroughbeautifulvisualizationsoreasy-to-rememberstatements(e.g.,“themedianincomeisthenumber-onepredictorofhousingprices”).
Launch!1. Getyoursolutionreadyforproduction(plugintoproductiondatainputs,writeunittests,etc.).
2. Writemonitoringcodetocheckyoursystem’sliveperformanceatregularintervalsandtriggeralertswhenitdrops.
Bewareofslowdegradationtoo:modelstendto“rot”asdataevolves.
Measuringperformancemayrequireahumanpipeline(e.g.,viaacrowdsourcingservice).
Alsomonitoryourinputs’quality(e.g.,amalfunctioningsensorsendingrandomvalues,oranotherteam’soutputbecomingstale).Thisisparticularlyimportantforonlinelearningsystems.
3. Retrainyourmodelsonaregularbasisonfreshdata(automateasmuchaspossible).
“PracticalBayesianOptimizationofMachineLearningAlgorithms,”J.Snoek,H.Larochelle,R.Adams(2012).1
AppendixC.SVMDualProblem
Tounderstandduality,youfirstneedtounderstandtheLagrangemultipliersmethod.Thegeneralideaistotransformaconstrainedoptimizationobjectiveintoanunconstrainedone,bymovingtheconstraintsintotheobjectivefunction.Let’slookatasimpleexample.Supposeyouwanttofindthevaluesofxandythatminimizethefunctionf(x,y)=x2+2y,subjecttoanequalityconstraint:3x+2y+1=0.UsingtheLagrangemultipliersmethod,westartbydefininganewfunctioncalledtheLagrangian(orLagrangefunction):g(x,y,α)=f(x,y)–α(3x+2y+1).Eachconstraint(inthiscasejustone)issubtractedfromtheoriginalobjective,multipliedbyanewvariablecalledaLagrangemultiplier.
Joseph-LouisLagrangeshowedthatif isasolutiontotheconstrainedoptimizationproblem,then
theremustexistan suchthat isastationarypointoftheLagrangian(astationarypointisapointwhereallpartialderivativesareequaltozero).Inotherwords,wecancomputethepartialderivativesofg(x,y,α)withregardstox,y,andα;wecanfindthepointswherethesederivativesareallequaltozero;andthesolutionstotheconstrainedoptimizationproblem(iftheyexist)mustbeamongthesestationarypoints.
Inthisexamplethepartialderivativesare:
Whenallthesepartialderivativesareequalto0,wefindthat
,fromwhichwecaneasilyfindthat ,
,and .Thisistheonlystationarypoint,andasitrespectstheconstraint,itmustbethesolutiontotheconstrainedoptimizationproblem.
However,thismethodappliesonlytoequalityconstraints.Fortunately,undersomeregularityconditions(whicharerespectedbytheSVMobjectives),thismethodcanbegeneralizedtoinequalityconstraintsaswell(e.g.,3x+2y+1≥0).ThegeneralizedLagrangianforthehardmarginproblemisgivenbyEquationC-1,wheretheα(i)variablesarecalledtheKarush–Kuhn–Tucker(KKT)multipliers,andtheymustbegreaterorequaltozero.
EquationC-1.GeneralizedLagrangianforthehardmarginproblem
JustlikewiththeLagrangemultipliersmethod,youcancomputethepartialderivativesandlocatethe
stationarypoints.Ifthereisasolution,itwillnecessarilybeamongthestationarypoints thatrespecttheKKTconditions:
Respecttheproblem’sconstraints: ,
Verify ,
Either ortheithconstraintmustbeanactiveconstraint,meaningitmustholdbyequality:.Thisconditioniscalledthecomplementaryslacknesscondition.Itimplies
thateither ortheithinstanceliesontheboundary(itisasupportvector).
NotethattheKKTconditionsarenecessaryconditionsforastationarypointtobeasolutionoftheconstrainedoptimizationproblem.Undersomeconditions,theyarealsosufficientconditions.Luckily,theSVMoptimizationproblemhappenstomeettheseconditions,soanystationarypointthatmeetstheKKTconditionsisguaranteedtobeasolutiontotheconstrainedoptimizationproblem.
WecancomputethepartialderivativesofthegeneralizedLagrangianwithregardstowandbwithEquationC-2.
EquationC-2.PartialderivativesofthegeneralizedLagrangian
Whenthesepartialderivativesareequalto0,wehaveEquationC-3.
EquationC-3.Propertiesofthestationarypoints
IfweplugtheseresultsintothedefinitionofthegeneralizedLagrangian,sometermsdisappearandwe
findEquationC-4.
EquationC-4.DualformoftheSVMproblem
Thegoalisnowtofindthevector thatminimizesthisfunction,with forallinstances.Thisconstrainedoptimizationproblemisthedualproblemwewerelookingfor.
Onceyoufindtheoptimal ,youcancompute usingthefirstlineofEquationC-3.Tocompute ,youcanusethefactthatasupportvectorverifiest(i)(wT·x(i)+b)=1,soifthekthinstanceisasupportvector
(i.e.,αk>0),youcanuseittocompute .However,itisoftenpreferedtocomputetheaverageoverallsupportvectorstogetamorestableandprecisevalue,asinEquationC-5.
EquationC-5.Biastermestimationusingthedualform
AppendixD.Autodiff
ThisappendixexplainshowTensorFlow’sautodifffeatureworks,andhowitcomparestoothersolutions.
Supposeyoudefineafunctionf(x,y)=x2y+y+2,andyouneeditspartialderivatives and ,typicallytoperformGradientDescent(orsomeotheroptimizationalgorithm).Yourmainoptionsaremanualdifferentiation,symbolicdifferentiation,numericaldifferentiation,forward-modeautodiff,andfinallyreverse-modeautodiff.TensorFlowimplementsthislastoption.Let’sgothrougheachoftheseoptions.
ManualDifferentiationThefirstapproachistopickupapencilandapieceofpaperanduseyourcalculusknowledgetoderivethepartialderivativesmanually.Forthefunctionf(x,y)justdefined,itisnottoohard;youjustneedtousefiverules:
Thederivativeofaconstantis0.
Thederivativeofλxisλ(whereλisaconstant).
Thederivativeofxλisλxλ–1,sothederivativeofx2is2x.
Thederivativeofasumoffunctionsisthesumofthesefunctions’derivatives.
Thederivativeofλtimesafunctionisλtimesitsderivative.
Fromtheserules,youcanderiveEquationD-1:
EquationD-1.Partialderivativesoff(x,y)
Thisapproachcanbecomeverytediousformorecomplexfunctions,andyouruntheriskofmakingmistakes.Thegoodnewsisthatderivingthemathematicalequationsforthepartialderivativeslikewejustdidcanbeautomated,throughaprocesscalledsymbolicdifferentiation.
SymbolicDifferentiationFigureD-1showshowsymbolicdifferentiationworksonanevensimplerfunction,g(x,y)=5+xy.Thegraphforthatfunctionisrepresentedontheleft.Aftersymbolicdifferentiation,wegetthegraphonthe
right,whichrepresentsthepartialderivative (wecouldsimilarlyobtainthepartialderivativewithregardstoy).
FigureD-1.Symbolicdifferentiation
Thealgorithmstartsbygettingthepartialderivativeoftheleafnodes.Theconstantnode(5)returnstheconstant0,sincethederivativeofaconstantisalways0.Thevariablexreturnstheconstant1since
,andthevariableyreturnstheconstant0since (ifwewerelookingforthepartialderivativewithregardstoy,itwouldbethereverse).
Nowwehaveallweneedtomoveupthegraphtothemultiplicationnodeinfunctiong.Calculustellsus
thatthederivativeoftheproductoftwofunctionsuandvis .Wecanthereforeconstructalargepartofthegraphontheright,representing0×x+y×1.
Finally,wecangouptotheadditionnodeinfunctiong.Asmentioned,thederivativeofasumoffunctionsisthesumofthesefunctions’derivatives.Sowejustneedtocreateanadditionnodeandconnectittothepartsofthegraphwehavealreadycomputed.Wegetthecorrectpartialderivative:
.
However,itcanbesimplified(alot).Afewtrivialpruningstepscanbeappliedtothisgraphtogetridof
allunnecessaryoperations,andwegetamuchsmallergraphwithjustonenode: .
Inthiscase,simplificationisfairlyeasy,butforamorecomplexfunction,symbolicdifferentiationcanproduceahugegraphthatmaybetoughtosimplifyandleadtosuboptimalperformance.Mostimportantly,symbolicdifferentiationcannotdealwithfunctionsdefinedwitharbitrarycode—forexample,the
followingfunctiondiscussedinChapter9:
defmy_func(a,b):
z=0
foriinrange(100):
z=a*np.cos(z+i)+z*np.sin(b-i)
returnz
NumericalDifferentiationThesimplestsolutionistocomputeanapproximationofthederivatives,numerically.Recallthatthederivativeh′(x0)ofafunctionh(x)atapointx0istheslopeofthefunctionatthatpoint,ormorepreciselyEquationD-2.
EquationD-2.Derivativeofafunctionh(x)atpointx0
Soifwewanttocalculatethepartialderivativeoff(x,y)withregardstox,atx=3andy=4,wecansimplycomputef(3+ϵ,4)–f(3,4)anddividetheresultbyϵ,usingaverysmallvalueforϵ.That’sexactlywhatthefollowingcodedoes:
deff(x,y):
returnx**2*y+y+2
defderivative(f,x,y,x_eps,y_eps):
return(f(x+x_eps,y+y_eps)-f(x,y))/(x_eps+y_eps)
df_dx=derivative(f,3,4,0.00001,0)
df_dy=derivative(f,3,4,0,0.00001)
Unfortunately,theresultisimprecise(anditgetsworseformorecomplexfunctions).Thecorrectresultsarerespectively24and10,butinsteadweget:
>>>print(df_dx)
24.000039999805264
>>>print(df_dy)
10.000000000331966
Noticethattocomputebothpartialderivatives,wehavetocallf()atleastthreetimes(wecalleditfourtimesintheprecedingcode,butitcouldbeoptimized).Iftherewere1,000parameters,wewouldneedtocallf()atleast1,001times.Whenyouaredealingwithlargeneuralnetworks,thismakesnumericaldifferentiationwaytooinefficient.
However,numericaldifferentiationissosimpletoimplementthatitisagreattooltocheckthattheothermethodsareimplementedcorrectly.Forexample,ifitdisagreeswithyourmanuallyderivedfunction,thenyourfunctionprobablycontainsamistake.
Forward-ModeAutodiffForward-modeautodiffisneithernumericaldifferentiationnorsymbolicdifferentiation,butinsomewaysitistheirlovechild.Itreliesondualnumbers,whichare(weirdbutfascinating)numbersoftheforma+bϵwhereaandbarerealnumbersandϵisaninfinitesimalnumbersuchthatϵ2=0(butϵ≠0).Youcanthinkofthedualnumber42+24ϵassomethingakinto42.0000000024withaninfinitenumberof0s(butofcoursethisissimplifiedjusttogiveyousomeideaofwhatdualnumbersare).Adualnumberisrepresentedinmemoryasapairoffloats.Forexample,42+24ϵisrepresentedbythepair(42.0,24.0).
Dualnumberscanbeadded,multiplied,andsoon,asshowninEquationD-3.
EquationD-3.Afewoperationswithdualnumbers
Mostimportantly,itcanbeshownthath(a+bϵ)=h(a)+b×h′(a)ϵ,socomputingh(a+ϵ)givesyoubothh(a)andthederivativeh′(a)injustoneshot.FigureD-2showshowforward-modeautodiffcomputesthepartialderivativeoff(x,y)withregardstoxatx=3andy=4.Allweneedtodoiscomputef(3+ϵ,4);thiswilloutputadualnumberwhosefirstcomponentisequaltof(3,4)andwhosesecondcomponentis
equalto .
FigureD-2.Forward-modeautodiff
Tocompute wewouldhavetogothroughthegraphagain,butthistimewithx=3andy=4+ϵ.
Soforward-modeautodiffismuchmoreaccuratethannumericaldifferentiation,butitsuffersfromthesamemajorflaw:iftherewere1,000parameters,itwouldrequire1,000passesthroughthegraphtocomputeallthepartialderivatives.Thisiswherereverse-modeautodiffshines:itcancomputealloftheminjusttwopassesthroughthegraph.
Reverse-ModeAutodiffReverse-modeautodiffisthesolutionimplementedbyTensorFlow.Itfirstgoesthroughthegraphintheforwarddirection(i.e.,fromtheinputstotheoutput)tocomputethevalueofeachnode.Thenitdoesasecondpass,thistimeinthereversedirection(i.e.,fromtheoutputtotheinputs)tocomputeallthepartialderivatives.FigureD-3representsthesecondpass.Duringthefirstpass,allthenodevalueswerecomputed,startingfromx=3andy=4.Youcanseethosevaluesatthebottomrightofeachnode(e.g.,x×x=9).Thenodesarelabeledn1ton7forclarity.Theoutputnodeisn7:f(3,4)=n7=42.
FigureD-3.Reverse-modeautodiff
Theideaistograduallygodownthegraph,computingthepartialderivativeoff(x,y)withregardstoeachconsecutivenode,untilwereachthevariablenodes.Forthis,reverse-modeautodiffreliesheavilyonthechainrule,showninEquationD-4.
EquationD-4.Chainrule
Sincen7istheoutputnode,f=n7sotrivially .
Let’scontinuedownthegraphton5:howmuchdoesfvarywhenn5varies?Theansweris .
Wealreadyknowthat ,soallweneedis .Sincen7simplyperformsthesumn5+n6,wefind
that ,so .
Nowwecanproceedtonoden4:howmuchdoesfvarywhenn4varies?Theansweris .
Sincen5=n4×n2,wefindthat ,so .
Theprocesscontinuesuntilwereachthebottomofthegraph.Atthatpointwewillhavecalculatedallthe
partialderivativesoff(x,y)atthepointx=3andy=4.Inthisexample,wefind and .Soundsaboutright!
Reverse-modeautodiffisaverypowerfulandaccuratetechnique,especiallywhentherearemanyinputsandfewoutputs,sinceitrequiresonlyoneforwardpassplusonereversepassperoutputtocomputeallthepartialderivativesforalloutputswithregardstoalltheinputs.Mostimportantly,itcandealwithfunctionsdefinedbyarbitrarycode.Itcanalsohandlefunctionsthatarenotentirelydifferentiable,aslongasyouaskittocomputethepartialderivativesatpointsthataredifferentiable.
TIPIfyouimplementanewtypeofoperationinTensorFlowandyouwanttomakeitcompatiblewithautodiff,thenyouneedtoprovideafunctionthatbuildsasubgraphtocomputeitspartialderivativeswithregardstoitsinputs.Forexample,supposeyouimplementafunctionthatcomputesthesquareofitsinputf(x)=x2.Inthatcaseyouwouldneedtoprovidethecorrespondingderivativefunctionf′(x)=2x.Notethatthisfunctiondoesnotcomputeanumericalresult,butinsteadbuildsasubgraphthatwill(later)computetheresult.Thisisveryusefulbecauseitmeansthatyoucancomputegradientsofgradients(tocomputesecond-orderderivatives,orevenhigher-orderderivatives).
AppendixE.OtherPopularANNArchitectures
InthisappendixwewillgiveaquickoverviewofafewhistoricallyimportantneuralnetworkarchitecturesthataremuchlessusedtodaythandeepMulti-LayerPerceptrons(Chapter10),convolutionalneuralnetworks(Chapter13),recurrentneuralnetworks(Chapter14),orautoencoders(Chapter15).Theyareoftenmentionedintheliterature,andsomearestillusedinmanyapplications,soitisworthknowingaboutthem.Moreover,wewilldiscussdeepbeliefnets(DBNs),whichwerethestateoftheartinDeepLearninguntiltheearly2010s.Theyarestillthesubjectofveryactiveresearch,sotheymaywellcomebackwithavengeanceinthenearfuture.
HopfieldNetworksHopfieldnetworkswerefirstintroducedbyW.A.Littlein1974,thenpopularizedbyJ.Hopfieldin1982.Theyareassociativememorynetworks:youfirstteachthemsomepatterns,andthenwhentheyseeanewpatternthey(hopefully)outputtheclosestlearnedpattern.Thishasmadethemusefulinparticularforcharacterrecognitionbeforetheywereoutperformedbyotherapproaches.Youfirsttrainthenetworkbyshowingitexamplesofcharacterimages(eachbinarypixelmapstooneneuron),andthenwhenyoushowitanewcharacterimage,afterafewiterationsitoutputstheclosestlearnedcharacter.
Theyarefullyconnectedgraphs(seeFigureE-1);thatis,everyneuronisconnectedtoeveryotherneuron.Notethatonthediagramtheimagesare6×6pixels,sotheneuralnetworkontheleftshouldcontain36neurons(and648connections),butforvisualclarityamuchsmallernetworkisrepresented.
FigureE-1.Hopfieldnetwork
ThetrainingalgorithmworksbyusingHebb’srule:foreachtrainingimage,theweightbetweentwoneuronsisincreasedifthecorrespondingpixelsarebothonorbothoff,butdecreasedifonepixelisonandtheotherisoff.
Toshowanewimagetothenetwork,youjustactivatetheneuronsthatcorrespondtoactivepixels.Thenetworkthencomputestheoutputofeveryneuron,andthisgivesyouanewimage.Youcanthentakethisnewimageandrepeatthewholeprocess.Afterawhile,thenetworkreachesastablestate.Generally,thiscorrespondstothetrainingimagethatmostresemblestheinputimage.
Aso-calledenergyfunctionisassociatedwithHopfieldnets.Ateachiteration,theenergydecreases,sothenetworkisguaranteedtoeventuallystabilizetoalow-energystate.Thetrainingalgorithmtweakstheweightsinawaythatdecreasestheenergylevelofthetrainingpatterns,sothenetworkislikelytostabilizeinoneoftheselow-energyconfigurations.Unfortunately,somepatternsthatwerenotinthetrainingsetalsoendupwithlowenergy,sothenetworksometimesstabilizesinaconfigurationthatwas
notlearned.Thesearecalledspuriouspatterns.
AnothermajorflawwithHopfieldnetsisthattheydon’tscaleverywell—theirmemorycapacityisroughlyequalto14%ofthenumberofneurons.Forexample,toclassify28×28images,youwouldneedaHopfieldnetwith784fullyconnectedneuronsand306,936weights.Suchanetworkwouldonlybeabletolearnabout110differentcharacters(14%of784).That’salotofparametersforsuchasmallmemory.
BoltzmannMachinesBoltzmannmachineswereinventedin1985byGeoffreyHintonandTerrenceSejnowski.JustlikeHopfieldnets,theyarefullyconnectedANNs,buttheyarebasedonstochasticneurons:insteadofusingadeterministicstepfunctiontodecidewhatvaluetooutput,theseneuronsoutput1withsomeprobability,and0otherwise.TheprobabilityfunctionthattheseANNsuseisbasedontheBoltzmanndistribution(usedinstatisticalmechanics)hencetheirname.EquationE-1givestheprobabilitythataparticularneuronwilloutputa1.
EquationE-1.Probabilitythattheithneuronwilloutput1
sjisthejthneuron’sstate(0or1).
wi,jistheconnectionweightbetweentheithandjthneurons.Notethatwi,i=0.
biistheithneuron’sbiasterm.Wecanimplementthistermbyaddingabiasneurontothenetwork.
Nisthenumberofneuronsinthenetwork.
Tisanumbercalledthenetwork’stemperature;thehigherthetemperature,themorerandomtheoutputis(i.e.,themoretheprobabilityapproaches50%).
σisthelogisticfunction.
NeuronsinBoltzmannmachinesareseparatedintotwogroups:visibleunitsandhiddenunits(seeFigureE-2).Allneuronsworkinthesamestochasticway,butthevisibleunitsaretheonesthatreceivetheinputsandfromwhichoutputsareread.
FigureE-2.Boltzmannmachine
Becauseofitsstochasticnature,aBoltzmannmachinewillneverstabilizeintoafixedconfiguration,butinsteaditwillkeepswitchingbetweenmanyconfigurations.Ifitisleftrunningforasufficientlylongtime,theprobabilityofobservingaparticularconfigurationwillonlybeafunctionoftheconnectionweightsandbiasterms,notoftheoriginalconfiguration(similarly,afteryoushuffleadeckofcardsforlongenough,theconfigurationofthedeckdoesnotdependontheinitialstate).Whenthenetworkreachesthisstatewheretheoriginalconfigurationis“forgotten,”itissaidtobeinthermalequilibrium(althoughitsconfigurationkeepschangingallthetime).Bysettingthenetworkparametersappropriately,lettingthenetworkreachthermalequilibrium,andthenobservingitsstate,wecansimulateawiderangeofprobabilitydistributions.Thisiscalledagenerativemodel.
TrainingaBoltzmannmachinemeansfindingtheparametersthatwillmakethenetworkapproximatethetrainingset’sprobabilitydistribution.Forexample,iftherearethreevisibleneuronsandthetrainingsetcontains75%(0,1,1)triplets,10%(0,0,1)triplets,and15%(1,1,1)triplets,thenaftertrainingaBoltzmannmachine,youcoulduseittogeneraterandombinarytripletswithaboutthesameprobabilitydistribution.Forexample,about75%ofthetimeitwouldoutputthe(0,1,1)triplet.
Suchagenerativemodelcanbeusedinavarietyofways.Forexample,ifitistrainedonimages,andyouprovideanincompleteornoisyimagetothenetwork,itwillautomatically“repair”theimageinareasonableway.Youcanalsouseagenerativemodelforclassification.Justaddafewvisibleneuronstoencodethetrainingimage’sclass(e.g.,add10visibleneuronsandturnononlythefifthneuronwhenthetrainingimagerepresentsa5).Then,whengivenanewimage,thenetworkwillautomaticallyturnonthe
appropriatevisibleneurons,indicatingtheimage’sclass(e.g.,itwillturnonthefifthvisibleneuroniftheimagerepresentsa5).
Unfortunately,thereisnoefficienttechniquetotrainBoltzmannmachines.However,fairlyefficientalgorithmshavebeendevelopedtotrainrestrictedBoltzmannmachines(RBM).
RestrictedBoltzmannMachinesAnRBMissimplyaBoltzmannmachineinwhichtherearenoconnectionsbetweenvisibleunitsorbetweenhiddenunits,onlybetweenvisibleandhiddenunits.Forexample,FigureE-3representsanRBMwiththreevisibleunitsandfourhiddenunits.
FigureE-3.RestrictedBoltzmannmachine
Averyefficienttrainingalgorithm,calledContrastiveDivergence,wasintroducedin2005byMiguelÁ.Carreira-PerpiñánandGeoffreyHinton.1Hereishowitworks:foreachtraininginstancex,thealgorithmstartsbyfeedingittothenetworkbysettingthestateofthevisibleunitstox1,x2,, xn.Thenyoucomputethestateofthehiddenunitsbyapplyingthestochasticequationdescribedbefore(EquationE-1).Thisgivesyouahiddenvectorh(wherehiisequaltothestateoftheithunit).Nextyoucomputethestateofthevisibleunits,byapplyingthesamestochasticequation.Thisgivesyouavector .Thenonceagainyoucomputethestateofthehiddenunits,whichgivesyouavector .NowyoucanupdateeachconnectionweightbyapplyingtheruleinEquationE-2.
EquationE-2.Contrastivedivergenceweightupdate
Thegreatbenefitofthisalgorithmitthatitdoesnotrequirewaitingforthenetworktoreachthermalequilibrium:itjustgoesforward,backward,andforwardagain,andthat’sit.Thismakesitincomparablymoreefficientthanpreviousalgorithms,anditwasakeyingredienttothefirstsuccessofDeepLearningbasedonmultiplestackedRBMs.
DeepBeliefNetsSeverallayersofRBMscanbestacked;thehiddenunitsofthefirst-levelRBMservesasthevisibleunitsforthesecond-layerRBM,andsoon.SuchanRBMstackiscalledadeepbeliefnet(DBN).
Yee-WhyeTeh,oneofGeoffreyHinton’sstudents,observedthatitwaspossibletotrainDBNsonelayeratatimeusingContrastiveDivergence,startingwiththelowerlayersandthengraduallymovinguptothetoplayers.ThisledtothegroundbreakingarticlethatkickstartedtheDeepLearningtsunamiin2006.2
JustlikeRBMs,DBNslearntoreproducetheprobabilitydistributionoftheirinputs,withoutanysupervision.However,theyaremuchbetteratit,forthesamereasonthatdeepneuralnetworksaremorepowerfulthanshallowones:real-worlddataisoftenorganizedinhierarchicalpatterns,andDBNstakeadvantageofthat.Theirlowerlayerslearnlow-levelfeaturesintheinputdata,whilehigherlayerslearnhigh-levelfeatures.
JustlikeRBMs,DBNsarefundamentallyunsupervised,butyoucanalsotraintheminasupervisedmannerbyaddingsomevisibleunitstorepresentthelabels.Moreover,onegreatfeatureofDBNsisthattheycanbetrainedinasemisupervisedfashion.FigureE-4representssuchaDBNconfiguredforsemisupervisedlearning.
FigureE-4.Adeepbeliefnetworkconfiguredforsemisupervisedlearning
First,theRBM1istrainedwithoutsupervision.Itlearnslow-levelfeaturesinthetrainingdata.ThenRBM2istrainedwithRBM1’shiddenunitsasinputs,againwithoutsupervision:itlearnshigher-levelfeatures(notethatRBM2’shiddenunitsincludeonlythethreerightmostunits,notthelabelunits).SeveralmoreRBMscouldbestackedthisway,butyougettheidea.Sofar,trainingwas100%unsupervised.Lastly,RBM3istrainedusingbothRBM2’shiddenunitsasinputs,aswellasextravisibleunitsusedtorepresentthetargetlabels(e.g.,aone-hotvectorrepresentingtheinstanceclass).Itlearnstoassociatehigh-levelfeatureswithtraininglabels.Thisisthesupervisedstep.
Attheendoftraining,ifyoufeedRBM1anewinstance,thesignalwillpropagateuptoRBM2,thenuptothetopofRBM3,andthenbackdowntothelabelunits;hopefully,theappropriatelabelwilllightup.ThisishowaDBNcanbeusedforclassification.
Onegreatbenefitofthissemisupervisedapproachisthatyoudon’tneedmuchlabeledtrainingdata.IftheunsupervisedRBMsdoagoodenoughjob,thenonlyasmallamountoflabeledtraininginstancesperclasswillbenecessary.Similarly,ababylearnstorecognizeobjectswithoutsupervision,sowhenyoupointtoachairandsay“chair,”thebabycanassociatetheword“chair”withtheclassofobjectsithasalreadylearnedtorecognizeonitsown.Youdon’tneedtopointtoeverysinglechairandsay“chair”;onlyafewexampleswillsuffice(justenoughsothebabycanbesurethatyouareindeedreferringtothechair,nottoitscolororoneofthechair’sparts).
Quiteamazingly,DBNscanalsoworkinreverse.Ifyouactivateoneofthelabelunits,thesignalwillpropagateuptothehiddenunitsofRBM3,thendowntoRBM2,andthenRBM1,andanewinstancewillbeoutputbythevisibleunitsofRBM1.Thisnewinstancewillusuallylooklikearegularinstanceoftheclasswhoselabelunityouactivated.ThisgenerativecapabilityofDBNsisquitepowerful.Forexample,ithasbeenusedtoautomaticallygeneratecaptionsforimages,andviceversa:firstaDBNistrained(withoutsupervision)tolearnfeaturesinimages,andanotherDBNistrained(againwithoutsupervision)tolearnfeaturesinsetsofcaptions(e.g.,“car”oftencomeswith“automobile”).ThenanRBMisstackedontopofbothDBNsandtrainedwithasetofimagesalongwiththeircaptions;itlearnstoassociatehigh-levelfeaturesinimageswithhigh-levelfeaturesincaptions.Next,ifyoufeedtheimageDBNanimageofacar,thesignalwillpropagatethroughthenetwork,uptothetop-levelRBM,andbackdowntothebottomofthecaptionDBN,producingacaption.DuetothestochasticnatureofRBMsandDBNs,thecaptionwillkeepchangingrandomly,butitwillgenerallybeappropriatefortheimage.Ifyougenerateafewhundredcaptions,themostfrequentlygeneratedoneswilllikelybeagooddescriptionoftheimage.3
Self-OrganizingMapsSelf-organizingmaps(SOM)arequitedifferentfromalltheothertypesofneuralnetworkswehavediscussedsofar.Theyareusedtoproducealow-dimensionalrepresentationofahigh-dimensionaldataset,generallyforvisualization,clustering,orclassification.Theneuronsarespreadacrossamap(typically2Dforvisualization,butitcanbeanynumberofdimensionsyouwant),asshowninFigureE-5,andeachneuronhasaweightedconnectiontoeveryinput(notethatthediagramshowsjusttwoinputs,buttherearetypicallyaverylargenumber,sincethewholepointofSOMsistoreducedimensionality).
FigureE-5.Self-organizingmaps
Oncethenetworkistrained,youcanfeeditanewinstanceandthiswillactivateonlyoneneuron(i.e.,henceonepointonthemap):theneuronwhoseweightvectorisclosesttotheinputvector.Ingeneral,instancesthatarenearbyintheoriginalinputspacewillactivateneuronsthatarenearbyonthemap.ThismakesSOMsusefulforvisualization(inparticular,youcaneasilyidentifyclustersonthemap),butalsoforapplicationslikespeechrecognition.Forexample,ifeachinstancerepresentstheaudiorecordingofapersonpronouncingavowel,thendifferentpronunciationsofthevowel“a”willactivateneuronsinthesameareaofthemap,whileinstancesofthevowel“e”willactivateneuronsinanotherarea,andintermediatesoundswillgenerallyactivateintermediateneuronsonthemap.
NOTEOneimportantdifferencewiththeotherdimensionalityreductiontechniquesdiscussedinChapter8isthatallinstancesgetmappedtoadiscretenumberofpointsinthelow-dimensionalspace(onepointperneuron).Whenthereareveryfewneurons,thistechniqueisbetterdescribedasclusteringratherthandimensionalityreduction.
Thetrainingalgorithmisunsupervised.Itworksbyhavingalltheneuronscompeteagainsteachother.First,alltheweightsareinitializedrandomly.Thenatraininginstanceispickedrandomlyandfedtothenetwork.Allneuronscomputethedistancebetweentheirweightvectorandtheinputvector(thisisverydifferentfromtheartificialneuronswehaveseensofar).Theneuronthatmeasuresthesmallestdistancewinsandtweaksitsweightvectortobeevenslightlyclosertotheinputvector,makingitmorelikelytowinfuturecompetitionsforotherinputssimilartothisone.Italsorecruitsitsneighboringneurons,andtheytooupdatetheirweightvectortobeslightlyclosertotheinputvector(buttheydon’tupdatetheirweightsasmuchasthewinnerneuron).Thenthealgorithmpicksanothertraininginstanceandrepeatstheprocess,againandagain.Thisalgorithmtendstomakenearbyneuronsgraduallyspecializeinsimilarinputs.4
“OnContrastiveDivergenceLearning,”M.Á.Carreira-PerpiñánandG.Hinton(2005).
“AFastLearningAlgorithmforDeepBeliefNets,”G.Hinton,S.Osindero,Y.Teh(2006).
SeethisvideobyGeoffreyHintonformoredetailsandademo:http://goo.gl/7Z5QiS.
Youcanimagineaclassofyoungchildrenwithroughlysimilarskills.Onechildhappenstobeslightlybetteratbasketball.Thismotivateshertopracticemore,especiallywithherfriends.Afterawhile,thisgroupoffriendsgetssogoodatbasketballthatotherkidscannotcompete.Butthat’sokay,becausetheotherkidsspecializeinothertopics.Afterawhile,theclassisfulloflittlespecializedgroups.
1
2
3
4
Index
Symbols
__call__(),StaticUnrollingThroughTime
ε-greedypolicy,ExplorationPolicies,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning
ε-insensitive,SVMRegression
χ2test(seechisquaretest)
ℓ0norm,SelectaPerformanceMeasure
ℓ1andℓ2regularization,ℓ1andℓ2Regularization-ℓ1andℓ2Regularization
ℓ1norm,SelectaPerformanceMeasure,LassoRegression,DecisionBoundaries,AdamOptimization,AvoidingOverfittingThroughRegularization
ℓ2norm,SelectaPerformanceMeasure,RidgeRegression-LassoRegression,DecisionBoundaries,SoftmaxRegression,AvoidingOverfittingThroughRegularization,Max-NormRegularization
ℓknorm,SelectaPerformanceMeasure
ℓ∞norm,SelectaPerformanceMeasure
A
accuracy,WhatIsMachineLearning?,MeasuringAccuracyUsingCross-Validation-MeasuringAccuracyUsingCross-Validation
actions,evaluating,EvaluatingActions:TheCreditAssignmentProblem-EvaluatingActions:TheCreditAssignmentProblem
activationfunctions,Multi-LayerPerceptronandBackpropagation-Multi-LayerPerceptronandBackpropagation
activeconstraints,SVMDualProblem
actors,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning
actualclass,ConfusionMatrix
AdaBoost,AdaBoost-AdaBoost
Adagrad,AdaGrad-AdaGrad
Adamoptimization,AdamOptimization-AdamOptimization,AdamOptimization
adaptivelearningrate,AdaGrad
adaptivemomentoptimization,AdamOptimization
agents,LearningtoOptimizeRewards
AlexNetarchitecture,AlexNet-AlexNet
algorithms
preparingdatafor,PreparetheDataforMachineLearningAlgorithms-SelectandTrainaModel
AlphaGo,ReinforcementLearning,IntroductiontoArtificialNeuralNetworks,ReinforcementLearning,PolicyGradients
Anaconda,CreatetheWorkspace
anomalydetection,Unsupervisedlearning
Apple’sSiri,IntroductiontoArtificialNeuralNetworks
apply_gradients(),GradientClipping,PolicyGradients
areaunderthecurve(AUC),TheROCCurve
array_split(),IncrementalPCA
artificialneuralnetworks(ANNs),IntroductiontoArtificialNeuralNetworks-Exercises
BoltzmannMachines,BoltzmannMachines-BoltzmannMachines
deepbeliefnetworks(DBNs),DeepBeliefNets-DeepBeliefNets
evolutionof,FromBiologicaltoArtificialNeurons
HopfieldNetworks,HopfieldNetworks-HopfieldNetworks
hyperparameterfine-tuning,Fine-TuningNeuralNetworkHyperparameters-Activation
Functions
overview,IntroductiontoArtificialNeuralNetworks-FromBiologicaltoArtificialNeurons
Perceptrons,ThePerceptron-Multi-LayerPerceptronandBackpropagation
self-organizingmaps,Self-OrganizingMaps-Self-OrganizingMaps
trainingaDNNwithTensorFlow,TrainingaDNNUsingPlainTensorFlow-UsingtheNeuralNetwork
artificialneuron,LogicalComputationswithNeurons
(seealsoartificialneuralnetwork(ANN))
assign(),ManuallyComputingtheGradients
associationrulelearning,Unsupervisedlearning
associativememorynetworks,HopfieldNetworks
assumptions,checking,ChecktheAssumptions
asynchronousupdates,Asynchronousupdates-Asynchronousupdates
asynchrouscommunication,AsynchronousCommunicationUsingTensorFlowQueues-PaddingFifoQueue
atrous_conv2d(),ResNet
attentionmechanism,AnEncoder–DecoderNetworkforMachineTranslation
attributes,Supervisedlearning,TakeaQuickLookattheDataStructure-TakeaQuickLookattheDataStructure
(seealsodatastructure)
combinationsof,ExperimentingwithAttributeCombinations-ExperimentingwithAttributeCombinations
preprocessed,TakeaQuickLookattheDataStructure
target,TakeaQuickLookattheDataStructure
autodiff,Usingautodiff-Usingautodiff,Autodiff-Reverse-ModeAutodiff
forward-mode,Forward-ModeAutodiff-Forward-ModeAutodiff
manualdifferentiation,ManualDifferentiation
numericaldifferentiation,NumericalDifferentiation
reverse-mode,Reverse-ModeAutodiff-Reverse-ModeAutodiff
symbolicdifferentiation,SymbolicDifferentiation-NumericalDifferentiation
autoencoders,Autoencoders-Exercises
adversarial,OtherAutoencoders
contractive,OtherAutoencoders
denoising,DenoisingAutoencoders-TensorFlowImplementation
efficientdatarepresentations,EfficientDataRepresentations
generativestochasticnetwork(GSN),OtherAutoencoders
overcomplete,UnsupervisedPretrainingUsingStackedAutoencoders
PCAwithundercompletelinearautoencoder,PerformingPCAwithanUndercompleteLinearAutoencoder
reconstructions,EfficientDataRepresentations
sparse,SparseAutoencoders-TensorFlowImplementation
stacked,StackedAutoencoders-UnsupervisedPretrainingUsingStackedAutoencoders
stackedconvolutional,OtherAutoencoders
undercomplete,EfficientDataRepresentations
variational,VariationalAutoencoders-GeneratingDigits
visualizingfeatures,VisualizingFeatures-VisualizingFeatures
winner-take-all(WTA),OtherAutoencoders
automaticdifferentiating,UpandRunningwithTensorFlow
autonomousdrivingsystems,RecurrentNeuralNetworks
AverageAbsoluteDeviation,SelectaPerformanceMeasure
averagepoolinglayer,PoolingLayer
avg_pool(),PoolingLayer
B
backpropagation,Multi-LayerPerceptronandBackpropagation-Multi-LayerPerceptronandBackpropagation,Vanishing/ExplodingGradientsProblems,UnsupervisedPretraining,VisualizingFeatures
backpropagationthroughtime(BPTT),TrainingRNNs
baggingandpasting,BaggingandPasting-Out-of-BagEvaluation
out-of-bagevaluation,Out-of-BagEvaluation-Out-of-BagEvaluation
inScikit-Learn,BaggingandPastinginScikit-Learn-BaggingandPastinginScikit-Learn
bandwidthsaturation,Bandwidthsaturation-Bandwidthsaturation
BasicLSTMCell,LSTMCell
BasicRNNCell,DistributingaDeepRNNAcrossMultipleGPUs-DistributingaDeepRNNAcrossMultipleGPUs
BatchGradientDescent,BatchGradientDescent-BatchGradientDescent,LassoRegression
batchlearning,Batchlearning-Batchlearning
BatchNormalization,BatchNormalization-ImplementingBatchNormalizationwithTensorFlow,ResNet
operationsummary,BatchNormalization
withTensorFlow,ImplementingBatchNormalizationwithTensorFlow-ImplementingBatchNormalizationwithTensorFlow
batch(),Otherconveniencefunctions
batch_join(),Otherconveniencefunctions
batch_normalization(),ImplementingBatchNormalizationwithTensorFlow-ImplementingBatchNormalizationwithTensorFlow
BellmanOptimalityEquation,MarkovDecisionProcesses
between-graphreplication,In-GraphVersusBetween-GraphReplication
biasneurons,ThePerceptron
biasterm,LinearRegression
bias/variancetradeoff,LearningCurves
biases,ConstructionPhase
binaryclassifiers,TrainingaBinaryClassifier,LogisticRegression
biologicalneurons,FromBiologicaltoArtificialNeurons-BiologicalNeurons
blackboxmodels,MakingPredictions
blending,Stacking-Exercises
BoltzmannMachines,BoltzmannMachines-BoltzmannMachines
(seealsorestrictedBoltzmanmachines(RBMs))
boosting,Boosting-GradientBoosting
AdaBoost,AdaBoost-AdaBoost
GradientBoosting,GradientBoosting-GradientBoosting
bootstrapaggregation(seebagging)
bootstrapping,GridSearch,BaggingandPasting,IntroductiontoOpenAIGym,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning
bottlenecklayers,GoogLeNet
brew,Stacking
C
Caffemodelzoo,ModelZoos
CART(ClassificationandRegressionTree)algorithm,MakingPredictions-TheCARTTrainingAlgorithm,Regression
categoricalattributes,HandlingTextandCategoricalAttributes-HandlingTextandCategoricalAttributes
cellwrapper,TrainingtoPredictTimeSeries
chisquaretest,RegularizationHyperparameters
classificationversusregression,Supervisedlearning,MultioutputClassification
classifiers
binary,TrainingaBinaryClassifier
erroranalysis,ErrorAnalysis-ErrorAnalysis
evaluating,MulticlassClassification
MNISTdataset,MNIST-MNIST
multiclass,MulticlassClassification-MulticlassClassification
multilabel,MultilabelClassification-MultilabelClassification
multioutput,MultioutputClassification-MultioutputClassification
performancemeasures,PerformanceMeasures-TheROCCurve
precisionof,ConfusionMatrix
voting,VotingClassifiers-VotingClassifiers
clip_by_value(),GradientClipping
closed-formequation,TrainingModels,RidgeRegression,TrainingandCostFunction
clusterspecification,MultipleDevicesAcrossMultipleServers
clusteringalgorithms,Unsupervisedlearning
clusters,MultipleDevicesAcrossMultipleServers
codingspace,VariationalAutoencoders
codings,Autoencoders
complementaryslacknesscondition,SVMDualProblem
components_,UsingScikit-Learn
computationalcomplexity,ComputationalComplexity,ComputationalComplexity,ComputationalComplexity
compute_gradients(),GradientClipping,PolicyGradients
concat(),GoogLeNet
config.gpu_options,ManagingtheGPURAM
ConfigProto,ManagingtheGPURAM
confusionmatrix,ConfusionMatrix-ConfusionMatrix,ErrorAnalysis-ErrorAnalysis
connectionism,ThePerceptron
constrainedoptimization,TrainingObjective,SVMDualProblem
ContrastiveDivergence,RestrictedBoltzmannMachines
controldependencies,ControlDependencies
conv1d(),ResNet
conv2d_transpose(),ResNet
conv3d(),ResNet
convergencerate,BatchGradientDescent
convexfunction,GradientDescent
convolutionkernels,Filters,CNNArchitectures,GoogLeNet
convolutionalneuralnetworks(CNNs),ConvolutionalNeuralNetworks-Exercises
architectures,CNNArchitectures-ResNet
AlexNet,AlexNet-AlexNet
GoogleNet,GoogLeNet-GoogLeNet
LeNet5,LeNet-5-LeNet-5
ResNet,ResNet-ResNet
convolutionallayer,ConvolutionalLayer-MemoryRequirements,GoogLeNet,ResNet
featuremaps,StackingMultipleFeatureMaps-TensorFlowImplementation
filters,Filters
memoryrequirement,MemoryRequirements-MemoryRequirements
evolutionof,TheArchitectureoftheVisualCortex
poolinglayer,PoolingLayer-PoolingLayer
TensorFlowimplementation,TensorFlowImplementation-TensorFlowImplementation
Coordinatorclass,MultithreadedreadersusingaCoordinatorandaQueueRunner-MultithreadedreadersusingaCoordinatorandaQueueRunner
correlationcoefficient,LookingforCorrelations-LookingforCorrelations
correlations,finding,LookingforCorrelations-LookingforCorrelations
costfunction,Model-basedlearning,SelectaPerformanceMeasure
inAdaBoost,AdaBoost
inadagrad,AdaGrad
inartificialneuralnetworks,TraininganMLPwithTensorFlow’sHigh-LevelAPI,ConstructionPhase-ConstructionPhase
inautodiff,Usingautodiff
inbatchnormalization,ImplementingBatchNormalizationwithTensorFlow
crossentropy,LeNet-5
deepQ-Learning,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning
inElasticNet,ElasticNet
inGradientDescent,TrainingModels,GradientDescent-GradientDescent,Batch
GradientDescent,BatchGradientDescent-StochasticGradientDescent,GradientBoosting,Vanishing/ExplodingGradientsProblems
inLogisticRegression,TrainingandCostFunction-TrainingandCostFunction
inPGalgorithms,PolicyGradients
invariationalautoencoders,VariationalAutoencoders
inLassoRegression,LassoRegression-LassoRegression
inLinearRegression,TheNormalEquation,GradientDescent
inMomentumoptimization,MomentumOptimization-NesterovAcceleratedGradient
inpretrainedlayersreuse,PretrainingonanAuxiliaryTask
inridgeregression,RidgeRegression-RidgeRegression
inRNNs,TrainingRNNs,TrainingtoPredictTimeSeries
stalegradientsand,Asynchronousupdates
creativesequences,CreativeRNN
creditassignmentproblem,EvaluatingActions:TheCreditAssignmentProblem-EvaluatingActions:TheCreditAssignmentProblem
critics,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning
crossentropy,SoftmaxRegression-SoftmaxRegression,TraininganMLPwithTensorFlow’sHigh-LevelAPI,TensorFlowImplementation,PolicyGradients
cross-validation,TestingandValidating,BetterEvaluationUsingCross-Validation-BetterEvaluationUsingCross-Validation,MeasuringAccuracyUsingCross-Validation-MeasuringAccuracyUsingCross-Validation
CUDAlibrary,Installation
cuDNNlibrary,Installation
curseofdimensionality,DimensionalityReduction-TheCurseofDimensionality
(seealsodimensionalityreduction)
customtransformers,CustomTransformers-CustomTransformers
D
data,TestingandValidating
(seealsotestdata;trainingdata)
creatingworkspacefor,GettheData-DownloadtheData
downloading,DownloadtheData-DownloadtheData
findingcorrelationsin,LookingforCorrelations-LookingforCorrelations
makingassumptionsabout,TestingandValidating
preparingforMachineLearningalgorithms,PreparetheDataforMachineLearningAlgorithms-SelectandTrainaModel
test-setcreation,CreateaTestSet-CreateaTestSet
workingwithrealdata,WorkingwithRealData
dataaugmentation,DataAugmentation-DataAugmentation
datacleaning,DataCleaning-HandlingTextandCategoricalAttributes
datamining,WhyUseMachineLearning?
dataparallelism,DataParallelism-TensorFlowimplementation
asynchronousupdates,Asynchronousupdates-Asynchronousupdates
bandwidthsaturation,Bandwidthsaturation-Bandwidthsaturation
synchronousupdates,Synchronousupdates
TensorFlowimplementation,TensorFlowimplementation
datapipeline,FrametheProblem
datasnoopingbias,CreateaTestSet
datastructure,TakeaQuickLookattheDataStructure-TakeaQuickLookattheDataStructure
datavisualization,VisualizingGeographicalData-VisualizingGeographicalData
DataFrame,DataCleaning
dataquest,OtherResources
decisionboundaries,DecisionBoundaries-DecisionBoundaries,SoftmaxRegression,MakingPredictions
decisionfunction,Precision/RecallTradeoff,DecisionFunctionandPredictions-DecisionFunctionandPredictions
DecisionStumps,AdaBoost
decisionthreshold,Precision/RecallTradeoff
DecisionTrees,TrainingandEvaluatingontheTrainingSet-BetterEvaluationUsingCross-Validation,DecisionTrees-Exercises,EnsembleLearningandRandomForests
binarytrees,MakingPredictions
classprobabilityestimates,EstimatingClassProbabilities
computationalcomplexity,ComputationalComplexity
decisionboundaries,MakingPredictions
GINIimpurity,GiniImpurityorEntropy?
instabilitywith,Instability-Instability
numbersofchildren,MakingPredictions
predictions,MakingPredictions-EstimatingClassProbabilities
RandomForests(seeRandomForests)
regressiontasks,Regression-Regression
regularizationhyperparameters,RegularizationHyperparameters-RegularizationHyperparameters
trainingandvisualizing,TrainingandVisualizingaDecisionTree-MakingPredictions
decoder,EfficientDataRepresentations
deconvolutionallayer,ResNet
deepautoencoders(seestackedautoencoders)
deepbeliefnetworks(DBNs),Semisupervisedlearning,DeepBeliefNets-DeepBeliefNets
DeepLearning,ReinforcementLearning
(seealsoReinforcementLearning;TensorFlow)
about,TheMachineLearningTsunami,Roadmap
libraries,UpandRunningwithTensorFlow-UpandRunningwithTensorFlow
deepneuralnetworks(DNNs),Multi-LayerPerceptronandBackpropagation,TrainingDeepNeuralNets-Exercises
(seealsoMulti-LayerPerceptrons(MLP))
fasteroptimizersfor,FasterOptimizers-LearningRateScheduling
regularization,AvoidingOverfittingThroughRegularization-DataAugmentation
reusingpretrainedlayers,ReusingPretrainedLayers-PretrainingonanAuxiliaryTask
trainingguidelinesoverview,PracticalGuidelines
trainingwithTensorFlow,TrainingaDNNUsingPlainTensorFlow-UsingtheNeuralNetwork
trainingwithTF.Learn,TraininganMLPwithTensorFlow’sHigh-LevelAPI
unstablegradients,Vanishing/ExplodingGradientsProblems
vanishingandexplodinggradients,TrainingDeepNeuralNets-GradientClipping
DeepQ-Learning,ApproximateQ-Learning-LearningtoPlayMs.Pac-ManUsingDeepQ-Learning
Ms.PacManexample,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning-LearningtoPlayMs.Pac-ManUsingDeepQ-Learning
deepQ-network,ApproximateQ-Learning
deepRNNs,DeepRNNs-TheDifficultyofTrainingoverManyTimeSteps
applyingdropout,ApplyingDropout
distributingacrossmultipleGPUs,DistributingaDeepRNNAcrossMultipleGPUs
longsequencedifficulties,TheDifficultyofTrainingoverManyTimeSteps
truncatedbackpropagationthroughtime,TheDifficultyofTrainingoverManyTimeSteps
DeepMind,ReinforcementLearning,IntroductiontoArtificialNeuralNetworks,ReinforcementLearning,ApproximateQ-Learning
degreesoffreedom,OverfittingtheTrainingData,LearningCurves
denoisingautoencoders,DenoisingAutoencoders-TensorFlowImplementation
dense(),ConstructionPhase,TyingWeights
depthconcatlayer,GoogLeNet
depthradius,AlexNet
depthwise_conv2d(),ResNet
dequeue(),Queuesoftuples
dequeue_many(),Queuesoftuples,PaddingFifoQueue
dequeue_up_to(),Closingaqueue-PaddingFifoQueue
dequeuingdata,Dequeuingdata
describe(),TakeaQuickLookattheDataStructure
deviceblocks,ShardingVariablesAcrossMultipleParameterServers
device(),Simpleplacement
dimensionalityreduction,Unsupervisedlearning,DimensionalityReduction-Exercises,Autoencoders
approachesto
ManifoldLearning,ManifoldLearning
projection,Projection-Projection
choosingtherightnumberofdimensions,ChoosingtheRightNumberofDimensions
curseofdimensionality,DimensionalityReduction-TheCurseofDimensionality
anddatavisualization,DimensionalityReduction
Isomap,OtherDimensionalityReductionTechniques
LLE(LocallyLinearEmbedding),LLE-LLE
MultidimensionalScaling,OtherDimensionalityReductionTechniques-OtherDimensionalityReductionTechniques
PCA(PrincipalComponentAnalysis),PCA-RandomizedPCA
t-DistributedStochasticNeighborEmbedding(t-SNE),OtherDimensionalityReductionTechniques
discountrate,EvaluatingActions:TheCreditAssignmentProblem
distributedcomputing,UpandRunningwithTensorFlow
distributedsessions,SharingStateAcrossSessionsUsingResourceContainers-SharingStateAcrossSessionsUsingResourceContainers
DNNClassifier,TraininganMLPwithTensorFlow’sHigh-LevelAPI
drop(),PreparetheDataforMachineLearningAlgorithms
dropconnect,Dropout
dropna(),DataCleaning
dropout,NumberofNeuronsperHiddenLayer,ApplyingDropout
dropoutrate,Dropout
dropout(),Dropout
DropoutWrapper,ApplyingDropout
DRY(Don’tRepeatYourself),Modularity
DualAveraging,AdamOptimization
dualnumbers,Forward-ModeAutodiff
dualproblem,TheDualProblem
duality,SVMDualProblem
dyingReLUs,NonsaturatingActivationFunctions
dynamicplacements,Dynamicplacementfunction
dynamicplacer,PlacingOperationsonDevices
DynamicProgramming,MarkovDecisionProcesses
dynamicunrollingthroughtime,DynamicUnrollingThroughTime
dynamic_rnn(),DynamicUnrollingThroughTime,DistributingaDeepRNNAcrossMultipleGPUs,AnEncoder–DecoderNetworkforMachineTranslation
E
earlystopping,EarlyStopping-EarlyStopping,GradientBoosting,NumberofNeuronsperHiddenLayer,EarlyStopping
ElasticNet,ElasticNet
embeddeddeviceblocks,ShardingVariablesAcrossMultipleParameterServers
EmbeddedRebergrammars,Exercises
embeddings,WordEmbeddings-WordEmbeddings
embedding_lookup(),WordEmbeddings
encoder,EfficientDataRepresentations
Encoder–Decoder,InputandOutputSequences
end-of-sequence(EOS)token,HandlingVariable-LengthOutputSequences
energyfunctions,HopfieldNetworks
enqueuingdata,Enqueuingdata
EnsembleLearning,BetterEvaluationUsingCross-Validation,EnsembleMethods,EnsembleLearningandRandomForests-Exercises
baggingandpasting,BaggingandPasting-Out-of-BagEvaluation
boosting,Boosting-GradientBoosting
in-graphversusbetween-graphreplication,In-GraphVersusBetween-GraphReplication-In-GraphVersusBetween-GraphReplication
RandomForests,RandomForests-FeatureImportance
(seealsoRandomForests)
randompatchesandrandomsubspaces,RandomPatchesandRandomSubspaces
stacking,Stacking-Stacking
entropyimpuritymeasure,GiniImpurityorEntropy?
environments,inreinforcementlearning,LearningtoOptimizeRewards-EvaluatingActions:TheCreditAssignmentProblem,ExplorationPolicies,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning
episodes(inRL),IntroductiontoOpenAIGym,EvaluatingActions:TheCreditAssignmentProblem-PolicyGradients,PolicyGradients-PolicyGradients,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning
epochs,StochasticGradientDescent
ε-insensitive,SVMRegression
equalitycontraints,SVMDualProblem
erroranalysis,ErrorAnalysis-ErrorAnalysis
estimators,DataCleaning
Euclidiannorm,SelectaPerformanceMeasure
eval(),FeedingDatatotheTrainingAlgorithm
evaluatingmodels,TestingandValidating-TestingandValidating
explainedvariance,ChoosingtheRightNumberofDimensions
explainedvarianceratio,ExplainedVarianceRatio
explodinggradients,Vanishing/ExplodingGradientsProblems
(seealsogradients,vanishingandexploding)
explorationpolicies,ExplorationPolicies
exponentialdecay,ImplementingBatchNormalizationwithTensorFlow
exponentiallinearunit(ELU),NonsaturatingActivationFunctions-NonsaturatingActivationFunctions
exponentialscheduling,LearningRateScheduling
Extra-Trees,Extra-Trees
F
F-1score,PrecisionandRecall-PrecisionandRecall
face-recognition,MultilabelClassification
fakeXserver,IntroductiontoOpenAIGym
falsepositiverate(FPR),TheROCCurve-TheROCCurve
fan-in,XavierandHeInitialization,XavierandHeInitialization
fan-out,XavierandHeInitialization,XavierandHeInitialization
featuredetection,Autoencoders
featureengineering,IrrelevantFeatures
featureextraction,Unsupervisedlearning
featureimportance,FeatureImportance-FeatureImportance
featuremaps,SelectingaKernelandTuningHyperparameters,Filters-TensorFlowImplementation,ResNet
featurescaling,FeatureScaling
featureselection,IrrelevantFeatures,GridSearch,LassoRegression,FeatureImportance,
PreparetheData
featurespace,KernelPCA,SelectingaKernelandTuningHyperparameters
featurevector,SelectaPerformanceMeasure,LinearRegression,UndertheHood,ImplementingGradientDescent
features,Supervisedlearning
FeatureUnion,TransformationPipelines
feedforwardneuralnetwork(FNN),Multi-LayerPerceptronandBackpropagation
feed_dict,FeedingDatatotheTrainingAlgorithm
FIFOQueue,AsynchronousCommunicationUsingTensorFlowQueues,RandomShuffleQueue
fillna(),DataCleaning
first-infirst-out(FIFO)queues,AsynchronousCommunicationUsingTensorFlowQueues
first-orderpartialderivatives(Jacobians),AdamOptimization
fit(),DataCleaning,TransformationPipelines,IncrementalPCA
fitnessfunction,Model-basedlearning
fit_inverse_transform=,SelectingaKernelandTuningHyperparameters
fit_transform(),DataCleaning,TransformationPipelines
folds,BetterEvaluationUsingCross-Validation,MNIST,MeasuringAccuracyUsingCross-Validation-MeasuringAccuracyUsingCross-Validation
FollowTheRegularizedLeader(FTRL),AdamOptimization
forgetgate,LSTMCell
forward-modeautodiff,Forward-ModeAutodiff-Forward-ModeAutodiff
framingaproblem,FrametheProblem-FrametheProblem
frozenlayers,FreezingtheLowerLayers-CachingtheFrozenLayers
functools.partial(),ImplementingBatchNormalizationwithTensorFlow,TensorFlowImplementation,VariationalAutoencoders
G
gameplay(seereinforcementlearning)
gammavalue,GaussianRBFKernel
gatecontrollers,LSTMCell
Gaussiandistribution,VariationalAutoencoders,GeneratingDigits
GaussianRBF,AddingSimilarityFeatures
GaussianRBFkernel,GaussianRBFKernel-GaussianRBFKernel,KernelizedSVM
generalizationerror,TestingandValidating
generalizedLagrangian,SVMDualProblem-SVMDualProblem
generativeautoencoders,VariationalAutoencoders
generativemodels,Autoencoders,BoltzmannMachines
geneticalgorithms,PolicySearch
geodesicdistance,OtherDimensionalityReductionTechniques
get_variable(),SharingVariables-SharingVariables
GINIimpurity,MakingPredictions,GiniImpurityorEntropy?
globalaveragepooling,GoogLeNet
global_step,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning
global_variables_initializer(),CreatingYourFirstGraphandRunningItinaSession
Glorotinitialization,Vanishing/ExplodingGradientsProblems-XavierandHeInitialization
Google,UpandRunningwithTensorFlow
GoogleImages,IntroductiontoArtificialNeuralNetworks
GooglePhotos,Semisupervisedlearning
GoogleNetarchitecture,GoogLeNet-GoogLeNet
gpu_options.per_process_gpu_memory_fraction,ManagingtheGPURAM
gradientascent,PolicySearch
GradientBoostedRegressionTrees(GBRT),GradientBoosting
GradientBoosting,GradientBoosting-GradientBoosting
GradientDescent(GD),TrainingModels,GradientDescent-Mini-batchGradientDescent,OnlineSVMs,TrainingDeepNeuralNets,MomentumOptimization,AdaGrad
algorithmcomparisons,Mini-batchGradientDescent-Mini-batchGradientDescent
automaticallycomputinggradients,Usingautodiff-Usingautodiff
BatchGD,BatchGradientDescent-BatchGradientDescent,LassoRegression
defining,GradientDescent
localminimumversusglobalminimum,GradientDescent
manuallycomputinggradients,ManuallyComputingtheGradients
Mini-batchGD,Mini-batchGradientDescent-Mini-batchGradientDescent,FeedingDatatotheTrainingAlgorithm-FeedingDatatotheTrainingAlgorithm
optimizer,UsinganOptimizer
StochasticGD,StochasticGradientDescent-StochasticGradientDescent,SoftMarginClassification
withTensorFlow,ImplementingGradientDescent-UsinganOptimizer
GradientTreeBoosting,GradientBoosting
GradientDescentOptimizer,ConstructionPhase
gradients(),Usingautodiff
gradients,vanishingandexploding,TrainingDeepNeuralNets-GradientClipping,TheDifficultyofTrainingoverManyTimeSteps
BatchNormalization,BatchNormalization-ImplementingBatchNormalizationwithTensorFlow
GlorotandHeinitialization,Vanishing/ExplodingGradientsProblems-XavierandHeInitialization
gradientclipping,GradientClipping
nonsaturatingactivationfunctions,NonsaturatingActivationFunctions-NonsaturatingActivationFunctions
graphviz,TrainingandVisualizingaDecisionTree
greedyalgorithm,TheCARTTrainingAlgorithm
gridsearch,Fine-TuneYourModel-GridSearch,PolynomialKernel
group(),LearningtoPlayMs.Pac-ManUsingDeepQ-Learning
GRU(GatedRecurrentUnit)cell,GRUCell-GRUCell
H
hailstonesequence,EfficientDataRepresentations
hardmarginclassification,SoftMarginClassification-SoftMarginClassification
hardvotingclassifiers,VotingClassifiers-VotingClassifiers
harmonicmean,PrecisionandRecall
Heinitialization,Vanishing/ExplodingGradientsProblems-XavierandHeInitialization
Heavisidestepfunction,ThePerceptron
Hebb'srule,ThePerceptron,HopfieldNetworks
Hebbianlearning,ThePerceptron
hiddenlayers,Multi-LayerPerceptronandBackpropagation
hierarchicalclustering,Unsupervisedlearning
hingelossfunction,OnlineSVMs
histograms,TakeaQuickLookattheDataStructure-TakeaQuickLookattheDataStructure
hold-outsets,Stacking
(seealsoblenders)
HopfieldNetworks,HopfieldNetworks-HopfieldNetworks
hyperbolictangent(htanactivationfunction),Multi-LayerPerceptronandBackpropagation,ActivationFunctions,Vanishing/ExplodingGradientsProblems,XavierandHeInitialization,RecurrentNeurons
hyperparameters,OverfittingtheTrainingData,CustomTransformers,GridSearch-GridSearch,EvaluateYourSystemontheTestSet,GradientDescent,PolynomialKernel,ComputationalComplexity,Fine-TuningNeuralNetworkHyperparameters
(seealsoneuralnetworkhyperparameters)
hyperplane,DecisionFunctionandPredictions,ManifoldLearning-PCA,ProjectingDowntodDimensions,OtherDimensionalityReductionTechniques
hypothesis,SelectaPerformanceMeasure
manifold,ManifoldLearning
hypothesisboosting(seeboosting)
hypothesisfunction,LinearRegression
hypothesis,null,RegularizationHyperparameters
I
identitymatrix,RidgeRegression,QuadraticProgramming
ILSVRCImageNetchallenge,CNNArchitectures
imageclassification,CNNArchitectures
impuritymeasures,MakingPredictions,GiniImpurityorEntropy?
in-graphreplication,In-GraphVersusBetween-GraphReplication
inceptionmodules,GoogLeNet
Inception-v4,ResNet
incrementallearning,Onlinelearning,IncrementalPCA
inequalityconstraints,SVMDualProblem
inference,Model-basedlearning,Exercises,MemoryRequirements,AnEncoder–DecoderNetworkforMachineTranslation
info(),TakeaQuickLookattheDataStructure
informationgain,GiniImpurityorEntropy?
informationtheory,GiniImpurityorEntropy?
initnode,SavingandRestoringModels
inputgate,LSTMCell
inputneurons,ThePerceptron
input_keep_prob,ApplyingDropout
instance-basedlearning,Instance-basedlearning,Model-basedlearning
InteractiveSession,CreatingYourFirstGraphandRunningItinaSession
interceptterm,LinearRegression
InternalCovariateShiftproblem,BatchNormalization
inter_op_parallelism_threads,ParallelExecution
intra_op_parallelism_threads,ParallelExecution
inverse_transform(),SelectingaKernelandTuningHyperparameters
in_top_k(),ConstructionPhase
irreducibleerror,LearningCurves
isolatedenvironment,CreatetheWorkspace-CreatetheWorkspace
Isomap,OtherDimensionalityReductionTechniques
J
jobs,MultipleDevicesAcrossMultipleServers
join(),MultipleDevicesAcrossMultipleServers,MultithreadedreadersusingaCoordinatorandaQueueRunner
Jupyter,CreatetheWorkspace,CreatetheWorkspace,TakeaQuickLookattheDataStructure
K
K-foldcross-validation,BetterEvaluationUsingCross-Validation-BetterEvaluationUsingCross-Validation,MeasuringAccuracyUsingCross-Validation
k-NearestNeighbors,Model-basedlearning,MultilabelClassification
Karush–Kuhn–Tucker(KKT)conditions,SVMDualProblem
keepprobability,Dropout
Keras,UpandRunningwithTensorFlow
KernelPCA(kPCA),KernelPCA-SelectingaKernelandTuningHyperparameters
kerneltrick,PolynomialKernel,GaussianRBFKernel,TheDualProblem-KernelizedSVM,KernelPCA
kernelizedSVM,KernelizedSVM-KernelizedSVM
kernels,PolynomialKernel-GaussianRBFKernel,Operationsandkernels
Kullback–Leiblerdivergence,SoftmaxRegression,SparseAutoencoders
L
l1_l2_regularizer(),ℓ1andℓ2Regularization
LabelBinarizer,TransformationPipelines
labels,Supervisedlearning,FrametheProblem
Lagrangefunction,SVMDualProblem-SVMDualProblem
Lagrangemultiplier,SVMDualProblem
landmarks,AddingSimilarityFeatures-AddingSimilarityFeatures
largemarginclassification,LinearSVMClassification-LinearSVMClassification
LassoRegression,LassoRegression-LassoRegression
latentloss,VariationalAutoencoders
latentspace,VariationalAutoencoders
lawoflargenumbers,VotingClassifiers
leakyReLU,NonsaturatingActivationFunctions
learningrate,Onlinelearning,GradientDescent,BatchGradientDescent-StochasticGradientDescent
learningratescheduling,StochasticGradientDescent,LearningRateScheduling-LearningRateScheduling
LeNet-5architecture,TheArchitectureoftheVisualCortex,LeNet-5-LeNet-5
Levenshteindistance,GaussianRBFKernel
liblinearlibrary,ComputationalComplexity
libsvmlibrary,ComputationalComplexity
LinearDiscriminantAnalysis(LDA),OtherDimensionalityReductionTechniques
linearmodels
earlystopping,EarlyStopping-EarlyStopping
ElasticNet,ElasticNet
LassoRegression,LassoRegression-LassoRegression
LinearRegression(seeLinearRegression)
regression(seeLinearRegression)
RidgeRegression,RidgeRegression-RidgeRegression,ElasticNet
SVM,LinearSVMClassification-SoftMarginClassification
LinearRegression,Model-basedlearning,TrainingandEvaluatingontheTrainingSet,TrainingModels-Mini-batchGradientDescent,ElasticNet
computationalcomplexity,ComputationalComplexity
GradientDescentin,GradientDescent-Mini-batchGradientDescent
learningcurvesin,LearningCurves-LearningCurves
NormalEquation,TheNormalEquation-ComputationalComplexity
regularizingmodels(seeregularization)
usingStochasticGradientDescent(SGD),StochasticGradientDescent
withTensorFlow,LinearRegressionwithTensorFlow-LinearRegressionwithTensorFlow
linearSVMclassification,LinearSVMClassification-SoftMarginClassification
linearthresholdunits(LTUs),ThePerceptron
Lipschitzcontinuous,GradientDescent
LLE(LocallyLinearEmbedding),LLE-LLE
load_sample_images(),TensorFlowImplementation
localreceptivefield,TheArchitectureoftheVisualCortex
localresponsenormalization,AlexNet
localsessions,SharingStateAcrossSessionsUsingResourceContainers
locationinvariance,PoolingLayer
logloss,TrainingandCostFunction
loggingplacements,Loggingplacements-Loggingplacements
logisticfunction,EstimatingProbabilities
LogisticRegression,Supervisedlearning,LogisticRegression-SoftmaxRegression
decisionboundaries,DecisionBoundaries-DecisionBoundaries
estimatingprobablities,EstimatingProbabilities-EstimatingProbabilities
SoftmaxRegressionmodel,SoftmaxRegression-SoftmaxRegression
trainingandcostfunction,TrainingandCostFunction-TrainingandCostFunction
log_device_placement,Loggingplacements
LSTM(LongShort-TermMemory)cell,LSTMCell-GRUCell
M
machinecontrol(seereinforcementlearning)
MachineLearning
large-scaleprojects(seeTensorFlow)
notations,SelectaPerformanceMeasure-SelectaPerformanceMeasure
processexample,End-to-EndMachineLearningProject-Exercises
projectchecklist,LookattheBigPicture,MachineLearningProjectChecklist-Launch!
resourceson,OtherResources-OtherResources
usesfor,MachineLearninginYourProjects-MachineLearninginYourProjects
MachineLearningbasics
attributes,Supervisedlearning
challenges,MainChallengesofMachineLearning-SteppingBack
algorithmproblems,OverfittingtheTrainingData-UnderfittingtheTrainingData
trainingdataproblems,Poor-QualityData
definition,WhatIsMachineLearning?
features,Supervisedlearning
overview,TheMachineLearningLandscape
reasonsforusing,WhyUseMachineLearning?-WhyUseMachineLearning?
spamfilterexample,WhatIsMachineLearning?-WhyUseMachineLearning?
summary,SteppingBack
testingandvalidating,TestingandValidating-TestingandValidating
typesofsystems,TypesofMachineLearningSystems-Model-basedlearning
batchandonlinelearning,BatchandOnlineLearning-Onlinelearning
instance-basedversusmodel-basedlearning,Instance-BasedVersusModel-BasedLearning-Model-basedlearning
supervised/unsupervisedlearning,Supervised/UnsupervisedLearning-ReinforcementLearning
workflowexample,Model-basedlearning-Model-basedlearning
machinetranslation(seenaturallanguageprocessing(NLP))
make(),IntroductiontoOpenAIGym
Manhattannorm,SelectaPerformanceMeasure
manifoldassumption/hypothesis,ManifoldLearning
ManifoldLearning,ManifoldLearning,LLE
(seealsoLLE(LocallyLinearEmbedding)
MapReduce,FrametheProblem
marginviolations,SoftMarginClassification
Markovchains,MarkovDecisionProcesses
Markovdecisionprocesses,MarkovDecisionProcesses-MarkovDecisionProcesses
masterservice,TheMasterandWorkerServices
Matplotlib,CreatetheWorkspace,TakeaQuickLookattheDataStructure,TheROCCurve,ErrorAnalysis
maxmarginlearning,PretrainingonanAuxiliaryTask
maxpoolinglayer,PoolingLayer
max-normregularization,Max-NormRegularization-Max-NormRegularization
max_norm(),Max-NormRegularization
max_norm_regularizer(),Max-NormRegularization
max_pool(),PoolingLayer
MeanAbsoluteError(MAE),SelectaPerformanceMeasure-SelectaPerformanceMeasure
meancoding,VariationalAutoencoders
MeanSquareError(MSE),LinearRegression,ManuallyComputingtheGradients,SparseAutoencoders
measureofsimilarity,Instance-basedlearning
memmap,IncrementalPCA
memorycells,ModelParallelism,MemoryCells
Mercer'stheorem,KernelizedSVM
metalearner(seeblending)
min-maxscaling,FeatureScaling
Mini-batchGradientDescent,Mini-batchGradientDescent-Mini-batchGradientDescent,TrainingandCostFunction,FeedingDatatotheTrainingAlgorithm-FeedingDatatotheTrainingAlgorithm
mini-batches,Onlinelearning
minimize(),GradientClipping,FreezingtheLowerLayers,PolicyGradients,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning
min_after_dequeue,RandomShuffleQueue
MNISTdataset,MNIST-MNIST
modelparallelism,ModelParallelism-ModelParallelism
modelparameters,GradientDescent,BatchGradientDescent,EarlyStopping,UndertheHood,QuadraticProgramming,CreatingYourFirstGraphandRunningItinaSession,ConstructionPhase,TrainingRNNs
defining,Model-basedlearning
modelselection,Model-basedlearning
modelzoos,ModelZoos
model-basedlearning,Model-basedlearning-Model-basedlearning
models
analyzing,AnalyzetheBestModelsandTheirErrors-AnalyzetheBestModelsandTheirErrors
evaluatingontestset,EvaluateYourSystemontheTestSet-EvaluateYourSystemontheTestSet
moments,AdamOptimization
Momentumoptimization,MomentumOptimization-MomentumOptimization
MonteCarlotreesearch,PolicyGradients
Multi-LayerPerceptrons(MLP),IntroductiontoArtificialNeuralNetworks,ThePerceptron-Multi-LayerPerceptronandBackpropagation,NeuralNetworkPolicies
trainingwithTF.Learn,TraininganMLPwithTensorFlow’sHigh-LevelAPI
multiclassclassifiers,MulticlassClassification-MulticlassClassification
MultidimensionalScaling(MDS),OtherDimensionalityReductionTechniques
multilabelclassifiers,MultilabelClassification-MultilabelClassification
MultinomialLogisticRegression(seeSoftmaxRegression)
multinomial(),NeuralNetworkPolicies
multioutputclassifiers,MultioutputClassification-MultioutputClassification
MultiRNNCell,DistributingaDeepRNNAcrossMultipleGPUs
multithreadedreaders,MultithreadedreadersusingaCoordinatorandaQueueRunner-MultithreadedreadersusingaCoordinatorandaQueueRunner
multivariateregression,FrametheProblem
N
naiveBayesclassifiers,MulticlassClassification
namescopes,NameScopes
naturallanguageprocessing(NLP),RecurrentNeuralNetworks,NaturalLanguageProcessing-AnEncoder–DecoderNetworkforMachineTranslation
encoder-decodernetworkformachinetranslation,AnEncoder–DecoderNetworkforMachineTranslation-AnEncoder–DecoderNetworkforMachineTranslation
TensorFlowtutorials,NaturalLanguageProcessing,AnEncoder–DecoderNetworkforMachineTranslation
wordembeddings,WordEmbeddings-WordEmbeddings
NesterovAcceleratedGradient(NAG),NesterovAcceleratedGradient-NesterovAcceleratedGradient
Nesterovmomentumoptimization,NesterovAcceleratedGradient-NesterovAcceleratedGradient
networktopology,Fine-TuningNeuralNetworkHyperparameters
neuralnetworkhyperparameters,Fine-TuningNeuralNetworkHyperparameters-ActivationFunctions
activationfunctions,ActivationFunctions
neuronsperhiddenlayer,NumberofNeuronsperHiddenLayer
numberofhiddenlayers,NumberofHiddenLayers-NumberofHiddenLayers
neuralnetworkpolicies,NeuralNetworkPolicies-NeuralNetworkPolicies
neurons
biological,FromBiologicaltoArtificialNeurons-BiologicalNeurons
logicalcomputationswith,LogicalComputationswithNeurons
neuron_layer(),ConstructionPhase
next_batch(),ExecutionPhase
NoFreeLunchtheorem,TestingandValidating
nodeedges,VisualizingtheGraphandTrainingCurvesUsingTensorBoard
nonlineardimensionalityreduction(NLDR),LLE
(seealsoKernelPCA;LLE(LocallyLinearEmbedding))
nonlinearSVMclassification,NonlinearSVMClassification-ComputationalComplexity
computationalcomplexity,ComputationalComplexity
GaussianRBFkernel,GaussianRBFKernel-GaussianRBFKernel
withpolynomialfeatures,NonlinearSVMClassification-PolynomialKernel
polynomialkernel,PolynomialKernel-PolynomialKernel
similarityfeatures,adding,AddingSimilarityFeatures-AddingSimilarityFeatures
nonparametricmodels,RegularizationHyperparameters
nonresponsebias,NonrepresentativeTrainingData
nonsaturatingactivationfunctions,NonsaturatingActivationFunctions-NonsaturatingActivationFunctions
NormalEquation,TheNormalEquation-ComputationalComplexity
normalization,FeatureScaling
normalizedexponential,SoftmaxRegression
norms,SelectaPerformanceMeasure
notations,SelectaPerformanceMeasure-SelectaPerformanceMeasure
NP-Completeproblems,TheCARTTrainingAlgorithm
nullhypothesis,RegularizationHyperparameters
numericaldifferentiation,NumericalDifferentiation
NumPy,CreatetheWorkspace
NumPyarrays,HandlingTextandCategoricalAttributes
NVidiaComputeCapability,Installation
nvidia-smi,ManagingtheGPURAM
n_components,ChoosingtheRightNumberofDimensions
O
observationspace,NeuralNetworkPolicies
off-policyalgorithm,TemporalDifferenceLearningandQ-Learning
offlinelearning,Batchlearning
one-hotencoding,HandlingTextandCategoricalAttributes
one-versus-all(OvA)strategy,MulticlassClassification,SoftmaxRegression,Exercises
one-versus-one(OvO)strategy,MulticlassClassification
onlinelearning,Onlinelearning-Onlinelearning
onlineSVMs,OnlineSVMs-OnlineSVMs
OpenAIGym,IntroductiontoOpenAIGym-IntroductiontoOpenAIGym
operation_timeout_in_ms,In-GraphVersusBetween-GraphReplication
OpticalCharacterRecognition(OCR),TheMachineLearningLandscape
optimalstatevalue,MarkovDecisionProcesses
optimizers,FasterOptimizers-LearningRateScheduling
AdaGrad,AdaGrad-AdaGrad
Adamoptimization,AdamOptimization-AdamOptimization,AdamOptimization
GradientDescent(seeGradientDescentoptimizer)
learningratescheduling,LearningRateScheduling-LearningRateScheduling
Momentumoptimization,MomentumOptimization-MomentumOptimization
NesterovAcceleratedGradient(NAG),NesterovAcceleratedGradient-NesterovAcceleratedGradient
RMSProp,RMSProp
out-of-bagevaluation,Out-of-BagEvaluation-Out-of-BagEvaluation
out-of-corelearning,Onlinelearning
out-of-memory(OOM)errors,StaticUnrollingThroughTime
out-of-sampleerror,TestingandValidating
OutOfRangeError,Readingthetrainingdatadirectlyfromthegraph,MultithreadedreadersusingaCoordinatorandaQueueRunner
outputgate,LSTMCell
outputlayer,Multi-LayerPerceptronandBackpropagation
OutputProjectionWrapper,TrainingtoPredictTimeSeries-TrainingtoPredictTimeSeries
output_keep_prob,ApplyingDropout
overcompleteautoencoder,UnsupervisedPretrainingUsingStackedAutoencoders
overfitting,OverfittingtheTrainingData-OverfittingtheTrainingData,CreateaTestSet,SoftMarginClassification,GaussianRBFKernel,RegularizationHyperparameters,Regression,NumberofNeuronsperHiddenLayer
avoidingthroughregularization,AvoidingOverfittingThroughRegularization-DataAugmentation
P
p-value,RegularizationHyperparameters
PaddingFIFOQueue,PaddingFifoQueue
Pandas,CreatetheWorkspace,DownloadtheData
scatter_matrix,LookingforCorrelations-LookingforCorrelations
paralleldistributedcomputing,DistributingTensorFlowAcrossDevicesandServers-Exercises
dataparallelism,DataParallelism-TensorFlowimplementation
in-graphversusbetween-graphreplication,In-GraphVersusBetween-GraphReplication-ModelParallelism
modelparallelism,ModelParallelism-ModelParallelism
multipledevicesacrossmultipleservers,MultipleDevicesAcrossMultipleServers-
Otherconveniencefunctions
asynchronouscommunicationusingqueues,AsynchronousCommunicationUsingTensorFlowQueues-PaddingFifoQueue
loadingtrainingdata,LoadingDataDirectlyfromtheGraph-Otherconveniencefunctions
masterandworkerservices,TheMasterandWorkerServices
openingasession,OpeningaSession
pinningoperationsacrosstasks,PinningOperationsAcrossTasks
shardingvariables,ShardingVariablesAcrossMultipleParameterServers
sharingstateacrosssessions,SharingStateAcrossSessionsUsingResourceContainers-SharingStateAcrossSessionsUsingResourceContainers
multipledevicesonasinglemachine,MultipleDevicesonaSingleMachine-ControlDependencies
controldependencies,ControlDependencies
installation,Installation-Installation
managingtheGPURAM,ManagingtheGPURAM-ManagingtheGPURAM
parallelexecution,ParallelExecution-ParallelExecution
placingoperationsondevices,PlacingOperationsonDevices-Softplacement
oneneuralnetworkperdevice,OneNeuralNetworkperDevice-OneNeuralNetworkperDevice
parameterefficiency,NumberofHiddenLayers
parametermatrix,SoftmaxRegression
parameterserver(ps),MultipleDevicesAcrossMultipleServers
parameterspace,GradientDescent
parametervector,LinearRegression,GradientDescent,TrainingandCostFunction,SoftmaxRegression
parametricmodels,RegularizationHyperparameters
partialderivative,BatchGradientDescent
partial_fit(),IncrementalPCA
Pearson'sr,LookingforCorrelations
peepholeconnections,PeepholeConnections
penalties(seerewards,inRL)
percentiles,TakeaQuickLookattheDataStructure
Perceptronconvergencetheorem,ThePerceptron
Perceptrons,ThePerceptron-Multi-LayerPerceptronandBackpropagation
versusLogisticRegression,ThePerceptron
training,ThePerceptron-ThePerceptron
performancemeasures,SelectaPerformanceMeasure-SelectaPerformanceMeasure
confusionmatrix,ConfusionMatrix-ConfusionMatrix
cross-validation,MeasuringAccuracyUsingCross-Validation-MeasuringAccuracyUsingCross-Validation
precisionandrecall,PrecisionandRecall-Precision/RecallTradeoff
ROC(receiveroperatingcharacteristic)curve,TheROCCurve-TheROCCurve
performancescheduling,LearningRateScheduling
permutation(),CreateaTestSet
PGalgorithms,PolicyGradients
photo-hostingservices,Semisupervisedlearning
pinningoperations,PinningOperationsAcrossTasks
pip,CreatetheWorkspace
Pipelineconstructor,TransformationPipelines-SelectandTrainaModel
pipelines,FrametheProblem
placeholdernodes,FeedingDatatotheTrainingAlgorithm
placers(seesimpleplacer;dynamicplacer)
policy,PolicySearch
policygradients,PolicySearch(seePGalgorithms)
policyspace,PolicySearch
polynomialfeatures,adding,NonlinearSVMClassification-PolynomialKernel
polynomialkernel,PolynomialKernel-PolynomialKernel,KernelizedSVM
PolynomialRegression,TrainingModels,PolynomialRegression-PolynomialRegression
learningcurvesin,LearningCurves-LearningCurves
poolingkernel,PoolingLayer
poolinglayer,PoolingLayer-PoolingLayer
powerscheduling,LearningRateScheduling
precision,ConfusionMatrix
precisionandrecall,PrecisionandRecall-Precision/RecallTradeoff
F-1score,PrecisionandRecall-PrecisionandRecall
precision/recall(PR)curve,TheROCCurve
precision/recalltradeoff,Precision/RecallTradeoff-Precision/RecallTradeoff
predeterminedpiecewiseconstantlearningrate,LearningRateScheduling
predict(),DataCleaning
predictedclass,ConfusionMatrix
predictions,ConfusionMatrix-ConfusionMatrix,DecisionFunctionandPredictions-DecisionFunctionandPredictions,MakingPredictions-EstimatingClassProbabilities
predictors,Supervisedlearning,DataCleaning
preloadingtrainingdata,Preloadthedataintoavariable
PReLU(parametricleakyReLU),NonsaturatingActivationFunctions
preprocessedattributes,TakeaQuickLookattheDataStructure
pretrainedlayersreuse,ReusingPretrainedLayers-PretrainingonanAuxiliaryTask
auxiliarytask,PretrainingonanAuxiliaryTask-PretrainingonanAuxiliaryTask
cachingfrozenlayers,CachingtheFrozenLayers
freezinglowerlayers,FreezingtheLowerLayers
modelzoos,ModelZoos
otherframeworks,ReusingModelsfromOtherFrameworks
TensorFlowmodel,ReusingaTensorFlowModel-ReusingaTensorFlowModel
unsupervisedpretraining,UnsupervisedPretraining-UnsupervisedPretraining
upperlayers,Tweaking,Dropping,orReplacingtheUpperLayers
PrettyTensor,UpandRunningwithTensorFlow
primalproblem,TheDualProblem
principalcomponent,PrincipalComponents
PrincipalComponentAnalysis(PCA),PCA-RandomizedPCA
explainedvarianceratios,ExplainedVarianceRatio
findingprincipalcomponents,PrincipalComponents-PrincipalComponents
forcompression,PCAforCompression-IncrementalPCA
IncrementalPCA,IncrementalPCA-RandomizedPCA
KernelPCA(kPCA),KernelPCA-SelectingaKernelandTuningHyperparameters
projectingdowntoddimensions,ProjectingDowntodDimensions
RandomizedPCA,RandomizedPCA
ScikitLearnfor,UsingScikit-Learn
variance,preserving,PreservingtheVariance-PreservingtheVariance
probabilisticautoencoders,VariationalAutoencoders
probabilities,estimating,EstimatingProbabilities-EstimatingProbabilities,EstimatingClassProbabilities
producerfunctions,Otherconveniencefunctions
projection,Projection-Projection
propositionallogic,FromBiologicaltoArtificialNeurons
pruning,RegularizationHyperparameters,SymbolicDifferentiation
Python
isolatedenvironmentin,CreatetheWorkspace-CreatetheWorkspace
notebooksin,CreatetheWorkspace-DownloadtheData
pickle,BetterEvaluationUsingCross-Validation
pip,CreatetheWorkspace
Q
Q-Learningalgorithm,TemporalDifferenceLearningandQ-Learning-LearningtoPlayMs.Pac-ManUsingDeepQ-Learning
approximateQ-Learning,ApproximateQ-Learning
deepQ-Learning,ApproximateQ-Learning-LearningtoPlayMs.Pac-ManUsingDeepQ-Learning
Q-ValueIterationAlgorithm,MarkovDecisionProcesses
Q-Values,MarkovDecisionProcesses
QuadraticProgramming(QP)Problems,QuadraticProgramming-QuadraticProgramming
quantizing,Bandwidthsaturation
queriespersecond(QPS),OneNeuralNetworkperDevice
QueueRunner,MultithreadedreadersusingaCoordinatorandaQueueRunner-MultithreadedreadersusingaCoordinatorandaQueueRunner
queues,AsynchronousCommunicationUsingTensorFlowQueues-PaddingFifoQueue
closing,Closingaqueue
dequeuingdata,Dequeuingdata
enqueuingdata,Enqueuingdata
first-infirst-out(FIFO),AsynchronousCommunicationUsingTensorFlowQueues
oftuples,Queuesoftuples
PaddingFIFOQueue,PaddingFifoQueue
RandomShuffleQueue,RandomShuffleQueue
q_network(),LearningtoPlayMs.Pac-ManUsingDeepQ-Learning
R
RadialBasisFunction(RBF),AddingSimilarityFeatures
RandomForests,BetterEvaluationUsingCross-Validation-GridSearch,MulticlassClassification,DecisionTrees,Instability,EnsembleLearningandRandomForests,RandomForests-FeatureImportance
Extra-Trees,Extra-Trees
featureimportance,FeatureImportance-FeatureImportance
randominitialization,GradientDescent,BatchGradientDescent,StochasticGradientDescent,Vanishing/ExplodingGradientsProblems
RandomPatchesandRandomSubspaces,RandomPatchesandRandomSubspaces
randomizedleakyReLU(RReLU),NonsaturatingActivationFunctions
RandomizedPCA,RandomizedPCA
randomizedsearch,RandomizedSearch,Fine-TuningNeuralNetworkHyperparameters
RandomShuffleQueue,RandomShuffleQueue,Readingthetrainingdatadirectlyfromthegraph
random_uniform(),ManuallyComputingtheGradients
readeroperations,Readingthetrainingdatadirectlyfromthegraph
recall,ConfusionMatrix
recognitionnetwork,EfficientDataRepresentations
reconstructionerror,PCAforCompression
reconstructionloss,EfficientDataRepresentations,TensorFlowImplementation,VariationalAutoencoders
reconstructionpre-image,SelectingaKernelandTuningHyperparameters
reconstructions,EfficientDataRepresentations
recurrentneuralnetworks(RNNs),RecurrentNeuralNetworks-Exercises
deepRNNs,DeepRNNs-TheDifficultyofTrainingoverManyTimeSteps
explorationpolicies,ExplorationPolicies
GRUcell,GRUCell-GRUCell
inputandoutputsequences,InputandOutputSequences-InputandOutputSequences
LSTMcell,LSTMCell-GRUCell
naturallanguageprocessing(NLP),NaturalLanguageProcessing-AnEncoder–DecoderNetworkforMachineTranslation
inTensorFlow,BasicRNNsinTensorFlow-HandlingVariable-LengthOutputSequences
dynamicunrollingthroughtime,DynamicUnrollingThroughTime
staticunrollingthroughtime,StaticUnrollingThroughTime-StaticUnrollingThroughTime
variablelengthinputsequences,HandlingVariableLengthInputSequences
variablelengthoutputsequences,HandlingVariable-LengthOutputSequences
training,TrainingRNNs-CreativeRNN
backpropagationthroughtime(BPTT),TrainingRNNs
creativesequences,CreativeRNN
sequenceclassifiers,TrainingaSequenceClassifier-TrainingaSequenceClassifier
timeseriespredictions,TrainingtoPredictTimeSeries-TrainingtoPredictTimeSeries
recurrentneurons,RecurrentNeurons-InputandOutputSequences
memorycells,MemoryCells
reduce_mean(),ConstructionPhase
reduce_sum(),TensorFlowImplementation-TensorFlowImplementation,VariationalAutoencoders,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning
regression,Supervisedlearning
DecisionTrees,Regression-Regression
regressionmodels
linear,TrainingandEvaluatingontheTrainingSet
regressionversusclassification,MultioutputClassification
regularization,OverfittingtheTrainingData-OverfittingtheTrainingData,TestingandValidating,RegularizedLinearModels-EarlyStopping
dataaugmentation,DataAugmentation-DataAugmentation
DecisionTrees,RegularizationHyperparameters-RegularizationHyperparameters
dropout,Dropout-Dropout
earlystopping,EarlyStopping-EarlyStopping,EarlyStopping
ElasticNet,ElasticNet
LassoRegression,LassoRegression-LassoRegression
max-norm,Max-NormRegularization-Max-NormRegularization
RidgeRegression,RidgeRegression-RidgeRegression
shrinkage,GradientBoosting
ℓ1andℓ2regularization,ℓ1andℓ2Regularization-ℓ1andℓ2Regularization
REINFORCEalgorithms,PolicyGradients
ReinforcementLearning(RL),ReinforcementLearning-ReinforcementLearning,ReinforcementLearning-ThankYou!
actions,EvaluatingActions:TheCreditAssignmentProblem-EvaluatingActions:TheCreditAssignmentProblem
creditassignmentproblem,EvaluatingActions:TheCreditAssignmentProblem-EvaluatingActions:TheCreditAssignmentProblem
discountrate,EvaluatingActions:TheCreditAssignmentProblem
examplesof,LearningtoOptimizeRewards
Markovdecisionprocesses,MarkovDecisionProcesses-MarkovDecisionProcesses
neuralnetworkpolicies,NeuralNetworkPolicies-NeuralNetworkPolicies
OpenAIgym,IntroductiontoOpenAIGym-IntroductiontoOpenAIGym
PGalgorithms,PolicyGradients-PolicyGradients
policysearch,PolicySearch-PolicySearch
Q-Learningalgorithm,TemporalDifferenceLearningandQ-Learning-LearningtoPlayMs.Pac-ManUsingDeepQ-Learning
rewards,learningtooptimize,LearningtoOptimizeRewards-LearningtoOptimizeRewards
TemporalDifference(TD)Learning,TemporalDifferenceLearningandQ-Learning-TemporalDifferenceLearningandQ-Learning
ReLU(rectifiedlinearunits),Modularity-Modularity
ReLUactivation,ResNet
ReLUfunction,Multi-LayerPerceptronandBackpropagation,ActivationFunctions,XavierandHeInitialization-NonsaturatingActivationFunctions
relu(z),ConstructionPhase
render(),IntroductiontoOpenAIGym
replaymemory,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning
replica_device_setter(),ShardingVariablesAcrossMultipleParameterServers
request_stop(),MultithreadedreadersusingaCoordinatorandaQueueRunner
reset(),IntroductiontoOpenAIGym
reset_default_graph(),ManagingGraphs
reshape(),TrainingtoPredictTimeSeries
residualerrors,GradientBoosting-GradientBoosting
residuallearning,ResNet
residualnetwork(ResNet),ModelZoos,ResNet-ResNet
residualunits,ResNet
ResNet,ResNet-ResNet
resourcecontainers,SharingStateAcrossSessionsUsingResourceContainers-SharingStateAcrossSessionsUsingResourceContainers
restore(),SavingandRestoringModels
restrictedBoltzmannmachines(RBMs),Semisupervisedlearning,UnsupervisedPretraining,BoltzmannMachines
reuse_variables(),SharingVariables
reverse-modeautodiff,Reverse-ModeAutodiff-Reverse-ModeAutodiff
rewards,inRL,LearningtoOptimizeRewards-LearningtoOptimizeRewards
rgb_array,IntroductiontoOpenAIGym
RidgeRegression,RidgeRegression-RidgeRegression,ElasticNet
RMSProp,RMSProp
ROC(receiveroperatingcharacteristic)curve,TheROCCurve-TheROCCurve
RootMeanSquareError(RMSE),SelectaPerformanceMeasure-SelectaPerformanceMeasure,LinearRegression
RReLU(randomizedleakyReLU),NonsaturatingActivationFunctions
run(),CreatingYourFirstGraphandRunningItinaSession,In-GraphVersusBetween-GraphReplication
S
SampledSoftmax,AnEncoder–DecoderNetworkforMachineTranslation
samplingbias,NonrepresentativeTrainingData-Poor-QualityData,CreateaTestSet
samplingnoise,NonrepresentativeTrainingData
save(),SavingandRestoringModels
Savernode,SavingandRestoringModels
ScikitFlow,UpandRunningwithTensorFlow
Scikit-Learn,CreatetheWorkspace
about,ObjectiveandApproach
baggingandpastingin,BaggingandPastinginScikit-Learn-BaggingandPastinginScikit-Learn
CARTalgorithm,MakingPredictions-TheCARTTrainingAlgorithm,Regression
cross-validation,BetterEvaluationUsingCross-Validation-BetterEvaluationUsingCross-Validation
designprinciples,DataCleaning-DataCleaning
imputer,DataCleaning-HandlingTextandCategoricalAttributes
LinearSVRclass,SVMRegression
MinMaxScaler,FeatureScaling
min_andmax_hyperparameters,RegularizationHyperparameters
PCAimplementation,UsingScikit-Learn
Perceptronclass,ThePerceptron
Pipelineconstructor,TransformationPipelines-SelectandTrainaModel,NonlinearSVMClassification
RandomizedPCA,RandomizedPCA
RidgeRegressionwith,RidgeRegression
SAMME,AdaBoost
SGDClassifier,TrainingaBinaryClassifier,Precision/RecallTradeoff-Precision/RecallTradeoff,MulticlassClassification
SGDRegressor,StochasticGradientDescent
sklearn.base.BaseEstimator,CustomTransformers,TransformationPipelines,MeasuringAccuracyUsingCross-Validation
sklearn.base.clone(),MeasuringAccuracyUsingCross-Validation,EarlyStopping
sklearn.base.TransformerMixin,CustomTransformers,TransformationPipelines
sklearn.datasets.fetch_california_housing(),LinearRegressionwithTensorFlow
sklearn.datasets.fetch_mldata(),MNIST
sklearn.datasets.load_iris(),DecisionBoundaries,SoftMarginClassification,TrainingandVisualizingaDecisionTree,FeatureImportance,ThePerceptron
sklearn.datasets.load_sample_images(),TensorFlowImplementation-TensorFlowImplementation
sklearn.datasets.make_moons(),NonlinearSVMClassification,Exercises
sklearn.decomposition.IncrementalPCA,IncrementalPCA
sklearn.decomposition.KernelPCA,KernelPCA-SelectingaKernelandTuning
Hyperparameters,SelectingaKernelandTuningHyperparameters
sklearn.decomposition.PCA,UsingScikit-Learn
sklearn.ensemble.AdaBoostClassifier,AdaBoost
sklearn.ensemble.BaggingClassifier,BaggingandPastinginScikit-Learn-RandomForests
sklearn.ensemble.GradientBoostingRegressor,GradientBoosting,GradientBoosting-GradientBoosting
sklearn.ensemble.RandomForestClassifier,TheROCCurve,MulticlassClassification,VotingClassifiers
sklearn.ensemble.RandomForestRegressor,BetterEvaluationUsingCross-Validation,GridSearch-AnalyzetheBestModelsandTheirErrors,RandomForests-Extra-Trees,GradientBoosting
sklearn.ensemble.VotingClassifier,VotingClassifiers
sklearn.externals.joblib,BetterEvaluationUsingCross-Validation
sklearn.linear_model.ElasticNet,ElasticNet
sklearn.linear_model.Lasso,LassoRegression
sklearn.linear_model.LinearRegression,Model-basedlearning-Model-basedlearning,DataCleaning,TrainingandEvaluatingontheTrainingSet,TheNormalEquation,Mini-batchGradientDescent,PolynomialRegression,LearningCurves-LearningCurves
sklearn.linear_model.LogisticRegression,DecisionBoundaries,DecisionBoundaries,SoftmaxRegression,VotingClassifiers,SelectingaKernelandTuningHyperparameters
sklearn.linear_model.Perceptron,ThePerceptron
sklearn.linear_model.Ridge,RidgeRegression
sklearn.linear_model.SGDClassifier,TrainingaBinaryClassifier
sklearn.linear_model.SGDRegressor,StochasticGradientDescent-Mini-batchGradient
Descent,RidgeRegression,LassoRegression-EarlyStopping
sklearn.manifold.LocallyLinearEmbedding,LLE-LLE
sklearn.metrics.accuracy_score(),VotingClassifiers,Out-of-BagEvaluation,TraininganMLPwithTensorFlow’sHigh-LevelAPI
sklearn.metrics.confusion_matrix(),ConfusionMatrix,ErrorAnalysis
sklearn.metrics.f1_score(),PrecisionandRecall,MultilabelClassification
sklearn.metrics.mean_squared_error(),TrainingandEvaluatingontheTrainingSet-TrainingandEvaluatingontheTrainingSet,EvaluateYourSystemontheTestSet,LearningCurves,EarlyStopping,GradientBoosting-GradientBoosting,SelectingaKernelandTuningHyperparameters
sklearn.metrics.precision_recall_curve(),Precision/RecallTradeoff
sklearn.metrics.precision_score(),PrecisionandRecall,Precision/RecallTradeoff
sklearn.metrics.recall_score(),PrecisionandRecall,Precision/RecallTradeoff
sklearn.metrics.roc_auc_score(),TheROCCurve-TheROCCurve
sklearn.metrics.roc_curve(),TheROCCurve-TheROCCurve
sklearn.model_selection.cross_val_predict(),ConfusionMatrix,Precision/RecallTradeoff,TheROCCurve,ErrorAnalysis,MultilabelClassification
sklearn.model_selection.cross_val_score(),BetterEvaluationUsingCross-Validation-BetterEvaluationUsingCross-Validation,MeasuringAccuracyUsingCross-Validation-ConfusionMatrix
sklearn.model_selection.GridSearchCV,GridSearch-RandomizedSearch,Exercises,ErrorAnalysis,Exercises,SelectingaKernelandTuningHyperparameters
sklearn.model_selection.StratifiedKFold,MeasuringAccuracyUsingCross-Validation
sklearn.model_selection.StratifiedShuffleSplit,CreateaTestSet
sklearn.model_selection.train_test_split(),CreateaTestSet,TrainingandEvaluatingontheTrainingSet,LearningCurves,Exercises,GradientBoosting
sklearn.multiclass.OneVsOneClassifier,MulticlassClassification
sklearn.neighbors.KNeighborsClassifier,MultilabelClassification,Exercises
sklearn.neighbors.KNeighborsRegressor,Model-basedlearning
sklearn.pipeline.FeatureUnion,TransformationPipelines
sklearn.pipeline.Pipeline,TransformationPipelines,LearningCurves,SoftMarginClassification-NonlinearSVMClassification,SelectingaKernelandTuningHyperparameters
sklearn.preprocessing.Imputer,DataCleaning,TransformationPipelines
sklearn.preprocessing.LabelBinarizer,HandlingTextandCategoricalAttributes,TransformationPipelines
sklearn.preprocessing.LabelEncoder,HandlingTextandCategoricalAttributes
sklearn.preprocessing.OneHotEncoder,HandlingTextandCategoricalAttributes
sklearn.preprocessing.PolynomialFeatures,PolynomialRegression-PolynomialRegression,LearningCurves,RidgeRegression,NonlinearSVMClassification
sklearn.preprocessing.StandardScaler,FeatureScaling-TransformationPipelines,MulticlassClassification,GradientDescent,RidgeRegression,LinearSVMClassification,SoftMarginClassification-PolynomialKernel,GaussianRBFKernel,ImplementingGradientDescent,TraininganMLPwithTensorFlow’sHigh-LevelAPI
sklearn.svm.LinearSVC,SoftMarginClassification-NonlinearSVMClassification,GaussianRBFKernel-ComputationalComplexity,SVMRegression,Exercises
sklearn.svm.LinearSVR,SVMRegression-SVMRegression
sklearn.svm.SVC,SoftMarginClassification,PolynomialKernel,GaussianRBFKernel-ComputationalComplexity,SVMRegression,Exercises,VotingClassifiers
sklearn.svm.SVR,Exercises,SVMRegression
sklearn.tree.DecisionTreeClassifier,RegularizationHyperparameters,Exercises,BaggingandPastinginScikit-Learn-Out-of-BagEvaluation,RandomForests,AdaBoost
sklearn.tree.DecisionTreeRegressor,TrainingandEvaluatingontheTrainingSet,DecisionTrees,Regression,GradientBoosting-GradientBoosting
sklearn.tree.export_graphviz(),TrainingandVisualizingaDecisionTree
StandardScaler,GradientDescent,ImplementingGradientDescent,TraininganMLPwithTensorFlow’sHigh-LevelAPI
SVMclassificationclasses,ComputationalComplexity
TF.Learn,UpandRunningwithTensorFlow
userguide,OtherResources
score(),DataCleaning
searchspace,RandomizedSearch,Fine-TuningNeuralNetworkHyperparameters
second-orderpartialderivatives(Hessians),AdamOptimization
self-organizingmaps(SOMs),Self-OrganizingMaps-Self-OrganizingMaps
semantichashing,Exercises
semisupervisedlearning,Semisupervisedlearning
sensitivity,ConfusionMatrix,TheROCCurve
sentimentanalysis,RecurrentNeuralNetworks
separable_conv2d(),ResNet
sequences,RecurrentNeuralNetworks
sequence_length,HandlingVariableLengthInputSequences-HandlingVariable-LengthOutputSequences,AnEncoder–DecoderNetworkforMachineTranslation
Shannon'sinformationtheory,GiniImpurityorEntropy?
shortcutconnections,ResNet
show(),TakeaQuickLookattheDataStructure
show_graph(),VisualizingtheGraphandTrainingCurvesUsingTensorBoard
shrinkage,GradientBoosting
shuffle_batch(),Otherconveniencefunctions
shuffle_batch_join(),Otherconveniencefunctions
sigmoidfunction,EstimatingProbabilities
sigmoid_cross_entropy_with_logits(),TensorFlowImplementation
similarityfunction,AddingSimilarityFeatures-AddingSimilarityFeatures
simulatedannealing,StochasticGradientDescent
simulatedenvironments,IntroductiontoOpenAIGym
(seealsoOpenAIGym)
SingularValueDecomposition(SVD),PrincipalComponents
skeweddatasets,MeasuringAccuracyUsingCross-Validation
skipconnections,DataAugmentation,ResNet
slackvariable,TrainingObjective
smoothingterms,BatchNormalization,AdaGrad,AdamOptimization,VariationalAutoencoders
softmarginclassification,SoftMarginClassification-SoftMarginClassification
softplacements,Softplacement
softvoting,VotingClassifiers
softmaxfunction,SoftmaxRegression,Multi-LayerPerceptronandBackpropagation,TraininganMLPwithTensorFlow’sHigh-LevelAPI
SoftmaxRegression,SoftmaxRegression-SoftmaxRegression
sourceops,LinearRegressionwithTensorFlow,ParallelExecution
spamfilters,TheMachineLearningLandscape-WhyUseMachineLearning?,Supervisedlearning
sparseautoencoders,SparseAutoencoders-TensorFlowImplementation
sparsematrix,HandlingTextandCategoricalAttributes
sparsemodels,LassoRegression,AdamOptimization
sparse_softmax_cross_entropy_with_logits(),ConstructionPhase
sparsityloss,SparseAutoencoders
specificity,TheROCCurve
speechrecognition,WhyUseMachineLearning?
spuriouspatterns,HopfieldNetworks
stack(),StaticUnrollingThroughTime
stackedautoencoders,StackedAutoencoders-UnsupervisedPretrainingUsingStackedAutoencoders
TensorFlowimplementation,TensorFlowImplementation
trainingone-at-a-time,TrainingOneAutoencoderataTime-TrainingOneAutoencoderataTime
tyingweights,TyingWeights-TyingWeights
unsupervisedpretrainingwith,UnsupervisedPretrainingUsingStackedAutoencoders-UnsupervisedPretrainingUsingStackedAutoencoders
visualizingthereconstructions,VisualizingtheReconstructions-VisualizingtheReconstructions
stackeddenoisingautoencoders,VisualizingFeatures,DenoisingAutoencoders
stackeddenoisingencoders,DenoisingAutoencoders
stackedgeneralization(seestacking)
stacking,Stacking-Stacking
stalegradients,Asynchronousupdates
standardcorrelationcoefficient,LookingforCorrelations
standardization,FeatureScaling
StandardScaler,TransformationPipelines,ImplementingGradientDescent,TraininganMLPwithTensorFlow’sHigh-LevelAPI
state-actionvalues,MarkovDecisionProcesses
statestensor,HandlingVariableLengthInputSequences
state_is_tuple,DistributingaDeepRNNAcrossMultipleGPUs,LSTMCell
staticunrollingthroughtime,StaticUnrollingThroughTime-StaticUnrollingThroughTime
static_rnn(),StaticUnrollingThroughTime-StaticUnrollingThroughTime,AnEncoder–DecoderNetworkforMachineTranslation
stationarypoint,SVMDualProblem-SVMDualProblem
statisticalmode,BaggingandPasting
statisticalsignificance,RegularizationHyperparameters
stemming,Exercises
stepfunctions,ThePerceptron
step(),IntroductiontoOpenAIGym
StochasticGradientBoosting,GradientBoosting
StochasticGradientDescent(SGD),StochasticGradientDescent-StochasticGradientDescent,SoftMarginClassification,ThePerceptron
training,TrainingandCostFunction
StochasticGradientDescent(SGD)classifier,TrainingaBinaryClassifier,RidgeRegression
stochasticneurons,BoltzmannMachines
stochasticpolicy,PolicySearch
stratifiedsampling,CreateaTestSet-CreateaTestSet,MeasuringAccuracyUsingCross-Validation
stride,ConvolutionalLayer
stringkernels,GaussianRBFKernel
string_input_producer(),Otherconveniencefunctions
stronglearners,VotingClassifiers
subderivatives,OnlineSVMs
subgradientvector,LassoRegression
subsample,GradientBoosting,PoolingLayer
supervisedlearning,Supervised/UnsupervisedLearning-Supervisedlearning
SupportVectorMachines(SVMs),MulticlassClassification,SupportVectorMachines-Exercises
decisionfunctionandpredictions,DecisionFunctionandPredictions-DecisionFunctionandPredictions
dualproblem,SVMDualProblem-SVMDualProblem
kernelizedSVM,KernelizedSVM-KernelizedSVM
linearclassification,LinearSVMClassification-SoftMarginClassification
mechanicsof,UndertheHood-OnlineSVMs
nonlinearclassification,NonlinearSVMClassification-ComputationalComplexity
onlineSVMs,OnlineSVMs-OnlineSVMs
QuadraticProgramming(QP)problems,QuadraticProgramming-QuadraticProgramming
SVMregression,SVMRegression-OnlineSVMs
thedualproblem,TheDualProblem
trainingobjective,TrainingObjective-TrainingObjective
supportvectors,LinearSVMClassification
svd(),PrincipalComponents
symbolicdifferentiation,Usingautodiff,SymbolicDifferentiation-NumericalDifferentiation
synchronousupdates,Synchronousupdates
T
t-DistributedStochasticNeighborEmbedding(t-SNE),OtherDimensionalityReductionTechniques
tailheavy,TakeaQuickLookattheDataStructure
targetattributes,TakeaQuickLookattheDataStructure
target_weights,AnEncoder–DecoderNetworkforMachineTranslation
tasks,MultipleDevicesAcrossMultipleServers
TemporalDifference(TD)Learning,TemporalDifferenceLearningandQ-Learning-TemporalDifferenceLearningandQ-Learning
tensorprocessingunits(TPUs),Installation
TensorBoard,UpandRunningwithTensorFlow
TensorFlow,UpandRunningwithTensorFlow-Exercises
about,ObjectiveandApproach
autodiff,Usingautodiff-Usingautodiff,Autodiff-Reverse-ModeAutodiff
BatchNormalizationwith,ImplementingBatchNormalizationwithTensorFlow-ImplementingBatchNormalizationwithTensorFlow
constructionphase,CreatingYourFirstGraphandRunningItinaSession
controldependencies,ControlDependencies
conveniencefunctions,Otherconveniencefunctions
convolutionallayers,ResNet
convolutionalneuralnetworksand,TensorFlowImplementation-TensorFlowImplementation
dataparallelismand,TensorFlowimplementation
denoisingautoencoders,TensorFlowImplementation-TensorFlowImplementation
dropoutwith,Dropout
dynamicplacer,PlacingOperationsonDevices
executionphase,CreatingYourFirstGraphandRunningItinaSession
feedingdatatothetrainingalgorithm,FeedingDatatotheTrainingAlgorithm-FeedingDatatotheTrainingAlgorithm
GradientDescentwith,ImplementingGradientDescent-UsinganOptimizer
graphs,managing,ManagingGraphs
initialgraphcreationandsessionrun,CreatingYourFirstGraphandRunningItinaSession-CreatingYourFirstGraphandRunningItinaSession
installation,Installation
l1andl2regularizationwith,ℓ1andℓ2Regularization
learningschedulesin,LearningRateScheduling
LinearRegressionwith,LinearRegressionwithTensorFlow-LinearRegressionwithTensorFlow
maxpoolinglayerin,PoolingLayer
max-normregularizationwith,Max-NormRegularization
modelzoo,ModelZoos
modularity,Modularity-Modularity
Momentumoptimizationin,MomentumOptimization
namescopes,NameScopes
neuralnetworkpolicies,NeuralNetworkPolicies
NLPtutorials,NaturalLanguageProcessing,AnEncoder–DecoderNetworkforMachineTranslation
nodevaluelifecycle,LifecycleofaNodeValue
operations(ops),LinearRegressionwithTensorFlow
optimizer,UsinganOptimizer
overview,UpandRunningwithTensorFlow-UpandRunningwithTensorFlow
paralleldistributedcomputing(seeparalleldistributedcomputingwithTensorFlow)
PythonAPI
construction,ConstructionPhase-ConstructionPhase
execution,ExecutionPhase
usingtheneuralnetwork,UsingtheNeuralNetwork
queues(seequeues)
reusingpretrainedlayers,ReusingaTensorFlowModel-ReusingaTensorFlowModel
RNNsin,BasicRNNsinTensorFlow-HandlingVariable-LengthOutputSequences
(seealsorecurrentneuralnetworks(RNNs))
savingandrestoringmodels,SavingandRestoringModels-SavingandRestoringModels
sharingvariables,SharingVariables-SharingVariables
simpleplacer,PlacingOperationsonDevices
sparseautoencoderswith,TensorFlowImplementation
andstackedautoencoders,TensorFlowImplementation
TensorBoard,VisualizingtheGraphandTrainingCurvesUsingTensorBoard-VisualizingtheGraphandTrainingCurvesUsingTensorBoard
tf.abs(),ℓ1andℓ2Regularization
tf.add(),Modularity,ℓ1andℓ2Regularization
tf.add_n(),Modularity-SharingVariables,SharingVariables-SharingVariables
tf.add_to_collection(),Max-NormRegularization
tf.assign(),ManuallyComputingtheGradients,ReusingModelsfromOther
Frameworks,Max-NormRegularization-Max-NormRegularization,Chapter9:UpandRunningwithTensorFlow
tf.bfloat16,Bandwidthsaturation
tf.bool,Dropout
tf.cast(),ConstructionPhase,TrainingaSequenceClassifier
tf.clip_by_norm(),Max-NormRegularization-Max-NormRegularization
tf.clip_by_value(),GradientClipping
tf.concat(),Exercises,GoogLeNet,NeuralNetworkPolicies,PolicyGradients
tf.ConfigProto,ManagingtheGPURAM,Loggingplacements-Softplacement,In-GraphVersusBetween-GraphReplication,Chapter12:DistributingTensorFlowAcrossDevicesandServers
tf.constant(),LifecycleofaNodeValue-ManuallyComputingtheGradients,Simpleplacement-Dynamicplacementfunction,ControlDependencies,OpeningaSession-PinningOperationsAcrossTasks
tf.constant_initializer(),SharingVariables-SharingVariables
tf.container(),SharingStateAcrossSessionsUsingResourceContainers-AsynchronousCommunicationUsingTensorFlowQueues,TensorFlowimplementation-Exercises,Chapter9:UpandRunningwithTensorFlow
tf.contrib.layers.l1_regularizer(),ℓ1andℓ2Regularization,Max-NormRegularization
tf.contrib.layers.l2_regularizer(),ℓ1andℓ2Regularization,TensorFlowImplementation-TyingWeights
tf.contrib.layers.variance_scaling_initializer(),XavierandHeInitialization-XavierandHeInitialization,TrainingaSequenceClassifier,TensorFlowImplementation-TyingWeights,VariationalAutoencoders,NeuralNetworkPolicies,PolicyGradients,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning
tf.contrib.learn.DNNClassifier,TraininganMLPwithTensorFlow’sHigh-LevelAPI
tf.contrib.learn.infer_real_valued_columns_from_input(),TraininganMLPwithTensorFlow’sHigh-LevelAPI
tf.contrib.rnn.BasicLSTMCell,LSTMCell,PeepholeConnections
tf.contrib.rnn.BasicRNNCell,StaticUnrollingThroughTime-DynamicUnrollingThroughTime,TrainingaSequenceClassifier,TrainingtoPredictTimeSeries-TrainingtoPredictTimeSeries,TrainingtoPredictTimeSeries,DeepRNNs-ApplyingDropout,LSTMCell
tf.contrib.rnn.DropoutWrapper,ApplyingDropout
tf.contrib.rnn.GRUCell,GRUCell
tf.contrib.rnn.LSTMCell,PeepholeConnections
tf.contrib.rnn.MultiRNNCell,DeepRNNs-ApplyingDropout
tf.contrib.rnn.OutputProjectionWrapper,TrainingtoPredictTimeSeries-TrainingtoPredictTimeSeries
tf.contrib.rnn.RNNCell,DistributingaDeepRNNAcrossMultipleGPUs
tf.contrib.rnn.static_rnn(),BasicRNNsinTensorFlow-HandlingVariableLengthInputSequences,AnEncoder–DecoderNetworkforMachineTranslation-Exercises,Chapter14:RecurrentNeuralNetworks-Chapter14:RecurrentNeuralNetworks
tf.contrib.slimmodule,UpandRunningwithTensorFlow,Exercises
tf.contrib.slim.netsmodule(nets),Exercises
tf.control_dependencies(),ControlDependencies
tf.decode_csv(),Readingthetrainingdatadirectlyfromthegraph,MultithreadedreadersusingaCoordinatorandaQueueRunner
tf.device(),Simpleplacement-Softplacement,PinningOperationsAcrossTasks-ShardingVariablesAcrossMultipleParameterServers,DistributingaDeepRNNAcrossMultipleGPUs-DistributingaDeepRNNAcrossMultipleGPUs
tf.exp(),VariationalAutoencoders-GeneratingDigits
tf.FIFOQueue,AsynchronousCommunicationUsingTensorFlowQueues,Queuesoftuples-RandomShuffleQueue,Readingthetrainingdatadirectlyfromthegraph,MultithreadedreadersusingaCoordinatorandaQueueRunner
tf.float32,LinearRegressionwithTensorFlow,Chapter9:UpandRunningwithTensorFlow
tf.get_collection(),ReusingaTensorFlowModel-FreezingtheLowerLayers,ℓ1andℓ2Regularization,Max-NormRegularization,TensorFlowImplementation,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning
tf.get_default_graph(),ManagingGraphs,VisualizingtheGraphandTrainingCurvesUsingTensorBoard
tf.get_default_session(),CreatingYourFirstGraphandRunningItinaSession
tf.get_variable(),SharingVariables-SharingVariables,ReusingModelsfromOtherFrameworks,ℓ1andℓ2Regularization
tf.global_variables(),ReusingaTensorFlowModel
tf.global_variables_initializer(),CreatingYourFirstGraphandRunningItinaSession,ManuallyComputingtheGradients
tf.gradients(),Usingautodiff
tf.Graph,CreatingYourFirstGraphandRunningItinaSession,ManagingGraphs,VisualizingtheGraphandTrainingCurvesUsingTensorBoard,LoadingDataDirectlyfromtheGraph,In-GraphVersusBetween-GraphReplication
tf.GraphKeys.GLOBAL_VARIABLES,ReusingaTensorFlowModel-FreezingtheLowerLayers
tf.GraphKeys.REGULARIZATION_LOSSES,ℓ1andℓ2Regularization,TensorFlowImplementation
tf.group(),LearningtoPlayMs.Pac-ManUsingDeepQ-Learning
tf.int32,Operationsandkernels-Queuesoftuples,Readingthetrainingdatadirectlyfromthegraph,HandlingVariableLengthInputSequences,TrainingaSequenceClassifier,WordEmbeddings,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning
tf.int64,ConstructionPhase
tf.InteractiveSession,CreatingYourFirstGraphandRunningItinaSession
tf.layers.batch_normalization(),ImplementingBatchNormalizationwithTensorFlow-
ImplementingBatchNormalizationwithTensorFlow
tf.layers.dense(),ConstructionPhase
TF.Learn,TraininganMLPwithTensorFlow’sHigh-LevelAPI
tf.log(),TensorFlowImplementation,VariationalAutoencoders,NeuralNetworkPolicies,PolicyGradients
tf.matmul(),LinearRegressionwithTensorFlow-ManuallyComputingtheGradients,Modularity,ConstructionPhase,BasicRNNsinTensorFlow,TyingWeights,TrainingOneAutoencoderataTime,TensorFlowImplementation,TensorFlowImplementation-TensorFlowImplementation
tf.matrix_inverse(),LinearRegressionwithTensorFlow
tf.maximum(),Modularity,SharingVariables-SharingVariables,NonsaturatingActivationFunctions
tf.multinomial(),NeuralNetworkPolicies,PolicyGradients
tf.name_scope(),NameScopes,Modularity-SharingVariables,ConstructionPhase,ConstructionPhase-ConstructionPhase,TrainingOneAutoencoderataTime-TrainingOneAutoencoderataTime
tf.nn.conv2d(),TensorFlowImplementation-TensorFlowImplementation
tf.nn.dynamic_rnn(),StaticUnrollingThroughTime-DynamicUnrollingThroughTime,TrainingaSequenceClassifier,TrainingtoPredictTimeSeries,TrainingtoPredictTimeSeries,DeepRNNs-ApplyingDropout,AnEncoder–DecoderNetworkforMachineTranslation-Exercises,Chapter14:RecurrentNeuralNetworks-Chapter14:RecurrentNeuralNetworks
tf.nn.elu(),NonsaturatingActivationFunctions,TensorFlowImplementation-TyingWeights,VariationalAutoencoders,NeuralNetworkPolicies,PolicyGradients
tf.nn.embedding_lookup(),WordEmbeddings
tf.nn.in_top_k(),ConstructionPhase,TrainingaSequenceClassifier
tf.nn.max_pool(),PoolingLayer-PoolingLayer
tf.nn.relu(),ConstructionPhase,TrainingtoPredictTimeSeries-TrainingtoPredict
TimeSeries,TrainingtoPredictTimeSeries,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning
tf.nn.sigmoid_cross_entropy_with_logits(),TensorFlowImplementation,GeneratingDigits,PolicyGradients-PolicyGradients
tf.nn.sparse_softmax_cross_entropy_with_logits(),ConstructionPhase-ConstructionPhase,TrainingaSequenceClassifier
tf.one_hot(),LearningtoPlayMs.Pac-ManUsingDeepQ-Learning
tf.PaddingFIFOQueue,PaddingFifoQueue
tf.placeholder(),FeedingDatatotheTrainingAlgorithm-FeedingDatatotheTrainingAlgorithm,Chapter9:UpandRunningwithTensorFlow
tf.placeholder_with_default(),TensorFlowImplementation
tf.RandomShuffleQueue,RandomShuffleQueue,Readingthetrainingdatadirectlyfromthegraph-Readingthetrainingdatadirectlyfromthegraph,MultithreadedreadersusingaCoordinatorandaQueueRunner-Otherconveniencefunctions
tf.random_normal(),Modularity,BasicRNNsinTensorFlow,TensorFlowImplementation,VariationalAutoencoders
tf.random_uniform(),ManuallyComputingtheGradients,SavingandRestoringModels,WordEmbeddings,Chapter9:UpandRunningwithTensorFlow
tf.reduce_mean(),ManuallyComputingtheGradients,NameScopes,ConstructionPhase-ConstructionPhase,ℓ1andℓ2Regularization,TrainingaSequenceClassifier-TrainingaSequenceClassifier,PerformingPCAwithanUndercompleteLinearAutoencoder,TensorFlowImplementation,TrainingOneAutoencoderataTime,TrainingOneAutoencoderataTime,TensorFlowImplementation,TensorFlowImplementation,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning
tf.reduce_sum(),ℓ1andℓ2Regularization,TensorFlowImplementation-TensorFlowImplementation,VariationalAutoencoders-GeneratingDigits,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning-LearningtoPlayMs.Pac-ManUsingDeepQ-Learning
tf.reset_default_graph(),ManagingGraphs
tf.reshape(),TrainingtoPredictTimeSeries,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning
tf.RunOptions,In-GraphVersusBetween-GraphReplication
tf.Session,CreatingYourFirstGraphandRunningItinaSession,Chapter9:UpandRunningwithTensorFlow
tf.shape(),TensorFlowImplementation,VariationalAutoencoders
tf.square(),ManuallyComputingtheGradients,NameScopes,TrainingtoPredictTimeSeries,PerformingPCAwithanUndercompleteLinearAutoencoder,TensorFlowImplementation,TrainingOneAutoencoderataTime,TrainingOneAutoencoderataTime,TensorFlowImplementation,TensorFlowImplementation,VariationalAutoencoders-GeneratingDigits,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning
tf.stack(),Readingthetrainingdatadirectlyfromthegraph,MultithreadedreadersusingaCoordinatorandaQueueRunner,StaticUnrollingThroughTime
tf.string,Readingthetrainingdatadirectlyfromthegraph,MultithreadedreadersusingaCoordinatorandaQueueRunner
tf.summary.FileWriter,VisualizingtheGraphandTrainingCurvesUsingTensorBoard-VisualizingtheGraphandTrainingCurvesUsingTensorBoard
tf.summary.scalar(),VisualizingtheGraphandTrainingCurvesUsingTensorBoard
tf.tanh(),BasicRNNsinTensorFlow
tf.TextLineReader,Readingthetrainingdatadirectlyfromthegraph,MultithreadedreadersusingaCoordinatorandaQueueRunner
tf.to_float(),PolicyGradients-PolicyGradients
tf.train.AdamOptimizer,AdamOptimization,AdamOptimization,TrainingaSequenceClassifier,TrainingtoPredictTimeSeries,PerformingPCAwithanUndercompleteLinearAutoencoder,TensorFlowImplementation-TyingWeights,TrainingOneAutoencoderataTime,TensorFlowImplementation,GeneratingDigits,PolicyGradients-PolicyGradients,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning
tf.train.ClusterSpec,MultipleDevicesAcrossMultipleServers
tf.train.Coordinator,MultithreadedreadersusingaCoordinatorandaQueueRunner-MultithreadedreadersusingaCoordinatorandaQueueRunner
tf.train.exponential_decay(),LearningRateScheduling
tf.train.GradientDescentOptimizer,UsinganOptimizer,ConstructionPhase,GradientClipping,MomentumOptimization,AdamOptimization
tf.train.MomentumOptimizer,UsinganOptimizer,MomentumOptimization-NesterovAcceleratedGradient,LearningRateScheduling,Exercises,TensorFlowimplementation,Chapter10:IntroductiontoArtificialNeuralNetworks-Chapter11:TrainingDeepNeuralNets
tf.train.QueueRunner,MultithreadedreadersusingaCoordinatorandaQueueRunner-Otherconveniencefunctions
tf.train.replica_device_setter(),ShardingVariablesAcrossMultipleParameterServers-SharingStateAcrossSessionsUsingResourceContainers
tf.train.RMSPropOptimizer,RMSProp
tf.train.Saver,SavingandRestoringModels-SavingandRestoringModels,ConstructionPhase,Exercises,ApplyingDropout,PolicyGradients,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning
tf.train.Server,MultipleDevicesAcrossMultipleServers
tf.train.start_queue_runners(),Otherconveniencefunctions
tf.transpose(),LinearRegressionwithTensorFlow-ManuallyComputingtheGradients,StaticUnrollingThroughTime,TyingWeights
tf.truncated_normal(),ConstructionPhase
tf.unstack(),StaticUnrollingThroughTime-DynamicUnrollingThroughTime,TrainingtoPredictTimeSeries,Chapter14:RecurrentNeuralNetworks
tf.Variable,CreatingYourFirstGraphandRunningItinaSession,Chapter9:UpandRunningwithTensorFlow
tf.variable_scope(),SharingVariables-SharingVariables,ReusingModelsfromOtherFrameworks,SharingStateAcrossSessionsUsingResourceContainers,Traininga
SequenceClassifier,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning
tf.zeros(),ConstructionPhase,BasicRNNsinTensorFlow,TyingWeights
truncatedbackpropagationthroughtime,TheDifficultyofTrainingoverManyTimeSteps
visualizinggraphandtrainingcurves,VisualizingtheGraphandTrainingCurvesUsingTensorBoard-VisualizingtheGraphandTrainingCurvesUsingTensorBoard
TensorFlowServing,OneNeuralNetworkperDevice
tensorflow.contrib,TraininganMLPwithTensorFlow’sHigh-LevelAPI
testset,TestingandValidating,CreateaTestSet-CreateaTestSet,MNIST
testingandvalidating,TestingandValidating-TestingandValidating
textattributes,HandlingTextandCategoricalAttributes-HandlingTextandCategoricalAttributes
TextLineReader,Readingthetrainingdatadirectlyfromthegraph
TF-slim,UpandRunningwithTensorFlow
tf.layers.conv1d(),ResNet
tf.layers.conv2d(),LearningtoPlayMs.Pac-ManUsingDeepQ-Learning
tf.layers.conv2d_transpose(),ResNet
tf.layers.conv3d(),ResNet
tf.layers.dense(),XavierandHeInitialization,ImplementingBatchNormalizationwithTensorFlow
tf.layers.separable_conv2d(),ResNet
TF.Learn,UpandRunningwithTensorFlow,TraininganMLPwithTensorFlow’sHigh-LevelAPI
tf.nn.atrous_conv2d(),ResNet
tf.nn.depthwise_conv2d(),ResNet
thermalequilibrium,BoltzmannMachines
threadpools(inter-op/intra-op,inTensorFlow,ParallelExecution
thresholdvariable,SharingVariables-SharingVariables
Tikhonovregularization,RidgeRegression
timeseriesdata,RecurrentNeuralNetworks
toarray(),HandlingTextandCategoricalAttributes
tolerancehyperparameter,ComputationalComplexity
training,ImplementingBatchNormalizationwithTensorFlow-ImplementingBatchNormalizationwithTensorFlow,ApplyingDropout
trainingdata,WhatIsMachineLearning?
insufficientquantities,InsufficientQuantityofTrainingData
irrelevantfeatures,IrrelevantFeatures
loading,LoadingDataDirectlyfromtheGraph-Otherconveniencefunctions
nonrepresentative,NonrepresentativeTrainingData
overfitting,OverfittingtheTrainingData-OverfittingtheTrainingData
poorquality,Poor-QualityData
underfitting,UnderfittingtheTrainingData
traininginstance,WhatIsMachineLearning?
trainingmodels,Model-basedlearning,TrainingModels-Exercises
learningcurvesin,LearningCurves-LearningCurves
LinearRegression,TrainingModels,LinearRegression-Mini-batchGradientDescent
LogisticRegression,LogisticRegression-SoftmaxRegression
overview,TrainingModels-TrainingModels
PolynomialRegression,TrainingModels,PolynomialRegression-PolynomialRegression
trainingobjectives,TrainingObjective-TrainingObjective
trainingset,WhatIsMachineLearning?,TestingandValidating,DiscoverandVisualizetheDatatoGainInsights,PreparetheDataforMachineLearningAlgorithms,TrainingandEvaluatingontheTrainingSet-TrainingandEvaluatingontheTrainingSet
costfunctionof,TrainingandCostFunction-TrainingandCostFunction
shuffling,MNIST
transferlearning,ReusingPretrainedLayers-PretrainingonanAuxiliaryTask
(seealsopretrainedlayersreuse)
transform(),DataCleaning,TransformationPipelines
transformationpipelines,TransformationPipelines-SelectandTrainaModel
transformers,DataCleaning
transformers,custom,CustomTransformers-CustomTransformers
transpose(),StaticUnrollingThroughTime
truenegativerate(TNR),TheROCCurve
truepositiverate(TPR),ConfusionMatrix,TheROCCurve
truncatedbackpropagationthroughtime,TheDifficultyofTrainingoverManyTimeSteps
tuples,Queuesoftuples
tyingweights,TyingWeights
U
underfitting,UnderfittingtheTrainingData,TrainingandEvaluatingontheTrainingSet,GaussianRBFKernel
univariateregression,FrametheProblem
unstack(),StaticUnrollingThroughTime
unsupervisedlearning,Unsupervisedlearning-Unsupervisedlearning
anomalydetection,Unsupervisedlearning
associationrulelearning,Unsupervisedlearning,Unsupervisedlearning
clustering,Unsupervisedlearning
dimensionalityreductionalgorithm,Unsupervisedlearning
visualizationalgorithms,Unsupervisedlearning
unsupervisedpretraining,UnsupervisedPretraining-UnsupervisedPretraining,UnsupervisedPretrainingUsingStackedAutoencoders-UnsupervisedPretrainingUsingStackedAutoencoders
upsampling,ResNet
utilityfunction,Model-basedlearning
V
validationset,TestingandValidating
ValueIteration,MarkovDecisionProcesses
value_counts(),TakeaQuickLookattheDataStructure
vanishinggradients,Vanishing/ExplodingGradientsProblems
(seealsogradients,vanishingandexploding)
variables,sharing,SharingVariables-SharingVariables
variable_scope(),SharingVariables-SharingVariables
variance
bias/variancetradeoff,LearningCurves
variancepreservation,PreservingtheVariance-PreservingtheVariance
variance_scaling_initializer(),XavierandHeInitialization
variationalautoencoders,VariationalAutoencoders-GeneratingDigits
VGGNet,ResNet
visualcortex,TheArchitectureoftheVisualCortex
visualization,VisualizingtheGraphandTrainingCurvesUsingTensorBoard-VisualizingtheGraphandTrainingCurvesUsingTensorBoard
visualizationalgorithms,Unsupervisedlearning-Unsupervisedlearning
voicerecognition,ConvolutionalNeuralNetworks
votingclassifiers,VotingClassifiers-VotingClassifiers
W
warmupphase,Asynchronousupdates
weaklearners,VotingClassifiers
weight-tying,TyingWeights
weights,ConstructionPhase
freezing,FreezingtheLowerLayers
while_loop(),DynamicUnrollingThroughTime
whiteboxmodels,MakingPredictions
worker,MultipleDevicesAcrossMultipleServers
workerservice,TheMasterandWorkerServices
worker_device,ShardingVariablesAcrossMultipleParameterServers
workspacedirectory,GettheData-DownloadtheData
X
Xavierinitialization,Vanishing/ExplodingGradientsProblems-XavierandHeInitialization
Y
YouTube,IntroductiontoArtificialNeuralNetworks
Z
zeropadding,ConvolutionalLayer,TensorFlowImplementation
AbouttheAuthorAurélienGéronisaMachineLearningconsultant.AformerGoogler,heledtheYouTubevideoclassificationteamfrom2013to2016.HewasalsoafounderandCTOofWifirstfrom2002to2012,aleadingWirelessISPinFrance;andafounderandCTOofPolyconseilin2001,thefirmthatnowmanagestheelectriccarsharingserviceAutolib’.
Beforethisheworkedasanengineerinavarietyofdomains:finance(JPMorganandSociétéGénérale),defense(Canada’sDOD),andhealthcare(bloodtransfusion).Hepublishedafewtechnicalbooks(onC++,WiFi,andinternetarchitectures),andwasaComputerSciencelecturerinaFrenchengineeringschool.
Afewfunfacts:hetaughthisthreechildrentocountinbinarywiththeirfingers(upto1023),hestudiedmicrobiologyandevolutionarygeneticsbeforegoingintosoftwareengineering,andhisparachutedidn’topenonthesecondjump.
ColophonTheanimalonthecoverofHands-OnMachineLearningwithScikit-LearnandTensorFlowisthefareasternfiresalamander(Salamandrainfraimmaculata),anamphibianfoundintheMiddleEast.Theyhaveblackskinfeaturinglargeyellowspotsontheirbackandhead.Thesespotsareawarningcolorationmeanttokeeppredatorsatbay.Full-grownsalamanderscanbeoverafootinlength.
Fareasternfiresalamandersliveinsubtropicalshrublandandforestsnearriversorotherfreshwaterbodies.Theyspendmostoftheirlifeonland,butlaytheireggsinthewater.Theysubsistmostlyonadietofinsects,worms,andsmallcrustaceans,butoccasionallyeatothersalamanders.Malesofthespecieshavebeenknowntoliveupto23years,whilefemalescanliveupto21years.
Althoughnotyetendangered,thefareasternfiresalamanderpopulationisindecline.Primarythreatsincludedammingofrivers(whichdisruptsthesalamander’sbreeding)andpollution.Theyarealsothreatenedbytherecentintroductionofpredatoryfish,suchasthemosquitofish.Thesefishwereintendedtocontrolthemosquitopopulation,buttheyalsofeedonyoungsalamanders.
ManyoftheanimalsonO’Reillycoversareendangered;allofthemareimportanttotheworld.Tolearnmoreabouthowyoucanhelp,gotoanimals.oreilly.com.
ThecoverimageisfromWood’sIllustratedNaturalHistory.ThecoverfontsareURWTypewriterandGuardianSans.ThetextfontisAdobeMinionPro;theheadingfontisAdobeMyriadCondensed;andthecodefontisDaltonMaag’sUbuntuMono.
PrefaceTheMachineLearningTsunami
MachineLearninginYourProjects
ObjectiveandApproach
Prerequisites
Roadmap
OtherResources
ConventionsUsedinThisBook
UsingCodeExamples
O’ReillySafari
HowtoContactUs
Acknowledgments
I.TheFundamentalsofMachineLearning
1.TheMachineLearningLandscapeWhatIsMachineLearning?
WhyUseMachineLearning?
TypesofMachineLearningSystemsSupervised/UnsupervisedLearning
BatchandOnlineLearning
Instance-BasedVersusModel-BasedLearning
MainChallengesofMachineLearningInsufficientQuantityofTrainingData
NonrepresentativeTrainingData
Poor-QualityData
IrrelevantFeatures
OverfittingtheTrainingData
UnderfittingtheTrainingData
SteppingBack
TestingandValidating
Exercises
2.End-to-EndMachineLearningProjectWorkingwithRealData
LookattheBigPictureFrametheProblem
SelectaPerformanceMeasure
ChecktheAssumptions
GettheDataCreatetheWorkspace
DownloadtheData
TakeaQuickLookattheDataStructure
CreateaTestSet
DiscoverandVisualizetheDatatoGainInsightsVisualizingGeographicalData
LookingforCorrelations
ExperimentingwithAttributeCombinations
PreparetheDataforMachineLearningAlgorithmsDataCleaning
HandlingTextandCategoricalAttributes
CustomTransformers
FeatureScaling
TransformationPipelines
SelectandTrainaModelTrainingandEvaluatingontheTrainingSet
BetterEvaluationUsingCross-Validation
Fine-TuneYourModelGridSearch
RandomizedSearch
EnsembleMethods
AnalyzetheBestModelsandTheirErrors
EvaluateYourSystemontheTestSet
Launch,Monitor,andMaintainYourSystem
TryItOut!
Exercises
3.ClassificationMNIST
TrainingaBinaryClassifier
PerformanceMeasuresMeasuringAccuracyUsingCross-Validation
ConfusionMatrix
PrecisionandRecall
Precision/RecallTradeoff
TheROCCurve
MulticlassClassification
ErrorAnalysis
MultilabelClassification
MultioutputClassification
Exercises
4.TrainingModelsLinearRegression
TheNormalEquation
ComputationalComplexity
GradientDescentBatchGradientDescent
StochasticGradientDescent
Mini-batchGradientDescent
PolynomialRegression
LearningCurves
RegularizedLinearModelsRidgeRegression
LassoRegression
ElasticNet
EarlyStopping
LogisticRegressionEstimatingProbabilities
TrainingandCostFunction
DecisionBoundaries
SoftmaxRegression
Exercises
5.SupportVectorMachinesLinearSVMClassification
SoftMarginClassification
NonlinearSVMClassificationPolynomialKernel
AddingSimilarityFeatures
GaussianRBFKernel
ComputationalComplexity
SVMRegression
UndertheHoodDecisionFunctionandPredictions
TrainingObjective
QuadraticProgramming
TheDualProblem
KernelizedSVM
OnlineSVMs
Exercises
6.DecisionTreesTrainingandVisualizingaDecisionTree
MakingPredictions
EstimatingClassProbabilities
TheCARTTrainingAlgorithm
ComputationalComplexity
GiniImpurityorEntropy?
RegularizationHyperparameters
Regression
Instability
Exercises
7.EnsembleLearningandRandomForestsVotingClassifiers
BaggingandPastingBaggingandPastinginScikit-Learn
Out-of-BagEvaluation
RandomPatchesandRandomSubspaces
RandomForestsExtra-Trees
FeatureImportance
BoostingAdaBoost
GradientBoosting
Stacking
Exercises
8.DimensionalityReductionTheCurseofDimensionality
MainApproachesforDimensionalityReductionProjection
ManifoldLearning
PCAPreservingtheVariance
PrincipalComponents
ProjectingDowntodDimensions
UsingScikit-Learn
ExplainedVarianceRatio
ChoosingtheRightNumberofDimensions
PCAforCompression
IncrementalPCA
RandomizedPCA
KernelPCASelectingaKernelandTuningHyperparameters
LLE
OtherDimensionalityReductionTechniques
Exercises
II.NeuralNetworksandDeepLearning
9.UpandRunningwithTensorFlowInstallation
CreatingYourFirstGraphandRunningItinaSession
ManagingGraphs
LifecycleofaNodeValue
LinearRegressionwithTensorFlow
ImplementingGradientDescentManuallyComputingtheGradients
Usingautodiff
UsinganOptimizer
FeedingDatatotheTrainingAlgorithm
SavingandRestoringModels
VisualizingtheGraphandTrainingCurvesUsingTensorBoard
NameScopes
Modularity
SharingVariables
Exercises
10.IntroductiontoArtificialNeuralNetworksFromBiologicaltoArtificialNeurons
BiologicalNeurons
LogicalComputationswithNeurons
ThePerceptron
Multi-LayerPerceptronandBackpropagation
TraininganMLPwithTensorFlow’sHigh-LevelAPI
TrainingaDNNUsingPlainTensorFlowConstructionPhase
ExecutionPhase
UsingtheNeuralNetwork
Fine-TuningNeuralNetworkHyperparametersNumberofHiddenLayers
NumberofNeuronsperHiddenLayer
ActivationFunctions
Exercises
11.TrainingDeepNeuralNetsVanishing/ExplodingGradientsProblems
XavierandHeInitialization
NonsaturatingActivationFunctions
BatchNormalization
GradientClipping
ReusingPretrainedLayersReusingaTensorFlowModel
ReusingModelsfromOtherFrameworks
FreezingtheLowerLayers
CachingtheFrozenLayers
Tweaking,Dropping,orReplacingtheUpperLayers
ModelZoos
UnsupervisedPretraining
PretrainingonanAuxiliaryTask
FasterOptimizersMomentumOptimization
NesterovAcceleratedGradient
AdaGrad
RMSProp
AdamOptimization
LearningRateScheduling
AvoidingOverfittingThroughRegularizationEarlyStopping
ℓ1andℓ2Regularization
Dropout
Max-NormRegularization
DataAugmentation
PracticalGuidelines
Exercises
12.DistributingTensorFlowAcrossDevicesandServersMultipleDevicesonaSingleMachine
Installation
ManagingtheGPURAM
PlacingOperationsonDevices
ParallelExecution
ControlDependencies
MultipleDevicesAcrossMultipleServersOpeningaSession
TheMasterandWorkerServices
PinningOperationsAcrossTasks
ShardingVariablesAcrossMultipleParameterServers
SharingStateAcrossSessionsUsingResourceContainers
AsynchronousCommunicationUsingTensorFlowQueues
LoadingDataDirectlyfromtheGraph
ParallelizingNeuralNetworksonaTensorFlowClusterOneNeuralNetworkperDevice
In-GraphVersusBetween-GraphReplication
ModelParallelism
DataParallelism
Exercises
13.ConvolutionalNeuralNetworksTheArchitectureoftheVisualCortex
ConvolutionalLayerFilters
StackingMultipleFeatureMaps
TensorFlowImplementation
MemoryRequirements
PoolingLayer
CNNArchitecturesLeNet-5
AlexNet
GoogLeNet
ResNet
Exercises
14.RecurrentNeuralNetworksRecurrentNeurons
MemoryCells
InputandOutputSequences
BasicRNNsinTensorFlowStaticUnrollingThroughTime
DynamicUnrollingThroughTime
HandlingVariableLengthInputSequences
HandlingVariable-LengthOutputSequences
TrainingRNNs
TrainingaSequenceClassifier
TrainingtoPredictTimeSeries
CreativeRNN
DeepRNNsDistributingaDeepRNNAcrossMultipleGPUs
ApplyingDropout
TheDifficultyofTrainingoverManyTimeSteps
LSTMCellPeepholeConnections
GRUCell
NaturalLanguageProcessingWordEmbeddings
AnEncoder–DecoderNetworkforMachineTranslation
Exercises
15.AutoencodersEfficientDataRepresentations
PerformingPCAwithanUndercompleteLinearAutoencoder
StackedAutoencodersTensorFlowImplementation
TyingWeights
TrainingOneAutoencoderataTime
VisualizingtheReconstructions
VisualizingFeatures
UnsupervisedPretrainingUsingStackedAutoencoders
DenoisingAutoencodersTensorFlowImplementation
SparseAutoencoders
TensorFlowImplementation
VariationalAutoencodersGeneratingDigits
OtherAutoencoders
Exercises
16.ReinforcementLearningLearningtoOptimizeRewards
PolicySearch
IntroductiontoOpenAIGym
NeuralNetworkPolicies
EvaluatingActions:TheCreditAssignmentProblem
PolicyGradients
MarkovDecisionProcesses
TemporalDifferenceLearningandQ-LearningExplorationPolicies
ApproximateQ-Learning
LearningtoPlayMs.Pac-ManUsingDeepQ-Learning
Exercises
ThankYou!
A.ExerciseSolutionsChapter1:TheMachineLearningLandscape
Chapter2:End-to-EndMachineLearningProject
Chapter3:Classification
Chapter4:TrainingModels
Chapter5:SupportVectorMachines
Chapter6:DecisionTrees
Chapter7:EnsembleLearningandRandomForests
Chapter8:DimensionalityReduction
Chapter9:UpandRunningwithTensorFlow
Chapter10:IntroductiontoArtificialNeuralNetworks
Chapter11:TrainingDeepNeuralNets
Chapter12:DistributingTensorFlowAcrossDevicesandServers
Chapter13:ConvolutionalNeuralNetworks
Chapter14:RecurrentNeuralNetworks
Chapter15:Autoencoders
Chapter16:ReinforcementLearning
B.MachineLearningProjectChecklistFrametheProblemandLookattheBigPicture
GettheData
ExploretheData
PreparetheData
Short-ListPromisingModels
Fine-TunetheSystem
PresentYourSolution
Launch!
C.SVMDualProblem
D.AutodiffManualDifferentiation
SymbolicDifferentiation
NumericalDifferentiation
Forward-ModeAutodiff
Reverse-ModeAutodiff
E.OtherPopularANNArchitecturesHopfieldNetworks
BoltzmannMachines
RestrictedBoltzmannMachines
DeepBeliefNets
Self-OrganizingMaps
Index
top related