index-of.esindex-of.es/varios-2/hands on machine learning with... · revision history for the first...

Hands-OnMachineLearningwithScikit-LearnandTensorFlow

Concepts,Tools,andTechniquestoBuildIntelligentSystems

AurélienGéron

Hands-OnMachineLearningwithScikit-LearnandTensorFlowbyAurélienGéron

PrintedintheUnitedStatesofAmerica.

PublishedbyO’ReillyMedia,Inc.,1005GravensteinHighwayNorth,Sebastopol,CA95472.

O’Reillybooksmaybepurchasedforeducational,business,orsalespromotionaluse.Onlineeditionsarealsoavailableformosttitles(http://oreilly.com/safari).Formoreinformation,contactourcorporate/institutionalsalesdepartment:800-998-9938orcorporate@oreilly.com.

Editor:NicoleTache

ProductionEditor:NicholasAdams

Copyeditor:RachelMonaghan

Proofreader:CharlesRoumeliotis

Indexer:WendyCatalano

InteriorDesigner:DavidFutato

CoverDesigner:RandyComer

Illustrator:RebeccaDemarest

March2017:FirstEdition

RevisionHistoryfortheFirstEdition2017-03-10:FirstRelease

2017-06-09:SecondRelease

Seehttp://oreilly.com/catalog/errata.csp?isbn=9781491962299forreleasedetails.

TheO’ReillylogoisaregisteredtrademarkofO’ReillyMedia,Inc.Hands-OnMachineLearningwithScikit-LearnandTensorFlow,thecoverimage,andrelatedtradedressaretrademarksofO’ReillyMedia,Inc.

Whilethepublisherandtheauthorhaveusedgoodfaitheffortstoensurethattheinformationandinstructionscontainedinthisworkareaccurate,thepublisherandtheauthordisclaimallresponsibilityforerrorsoromissions,includingwithoutlimitationresponsibilityfordamagesresultingfromtheuseoforrelianceonthiswork.Useoftheinformationandinstructionscontainedinthisworkisatyourownrisk.Ifanycodesamplesorothertechnologythisworkcontainsordescribesissubjecttoopensourcelicensesortheintellectualpropertyrightsofothers,itisyourresponsibilitytoensurethatyourusethereofcomplieswithsuchlicensesand/orrights.

978-1-491-96229-9

Preface

TheMachineLearningTsunamiIn2006,GeoffreyHintonetal.publishedapaper1showinghowtotrainadeepneuralnetworkcapableofrecognizinghandwrittendigitswithstate-of-the-artprecision(>98%).Theybrandedthistechnique“DeepLearning.”Trainingadeepneuralnetwaswidelyconsideredimpossibleatthetime,2andmostresearchershadabandonedtheideasincethe1990s.ThispaperrevivedtheinterestofthescientificcommunityandbeforelongmanynewpapersdemonstratedthatDeepLearningwasnotonlypossible,butcapableofmind-blowingachievementsthatnootherMachineLearning(ML)techniquecouldhopetomatch(withthehelpoftremendouscomputingpowerandgreatamountsofdata).ThisenthusiasmsoonextendedtomanyotherareasofMachineLearning.

Fast-forward10yearsandMachineLearninghasconqueredtheindustry:itisnowattheheartofmuchofthemagicintoday’shigh-techproducts,rankingyourwebsearchresults,poweringyoursmartphone’sspeechrecognition,andrecommendingvideos,beatingtheworldchampionatthegameofGo.Beforeyouknowit,itwillbedrivingyourcar.

MachineLearninginYourProjectsSonaturallyyouareexcitedaboutMachineLearningandyouwouldlovetojointheparty!

Perhapsyouwouldliketogiveyourhomemaderobotabrainofitsown?Makeitrecognizefaces?Orlearntowalkaround?

Ormaybeyourcompanyhastonsofdata(userlogs,financialdata,productiondata,machinesensordata,hotlinestats,HRreports,etc.),andmorethanlikelyyoucouldunearthsomehiddengemsifyoujustknewwheretolook;forexample:

Segmentcustomersandfindthebestmarketingstrategyforeachgroup

Recommendproductsforeachclientbasedonwhatsimilarclientsbought

Detectwhichtransactionsarelikelytobefraudulent

Predictnextyear’srevenue

Andmore

Whateverthereason,youhavedecidedtolearnMachineLearningandimplementitinyourprojects.Greatidea!

ObjectiveandApproachThisbookassumesthatyouknowclosetonothingaboutMachineLearning.Itsgoalistogiveyoutheconcepts,theintuitions,andthetoolsyouneedtoactuallyimplementprogramscapableoflearningfromdata.

Wewillcoveralargenumberoftechniques,fromthesimplestandmostcommonlyused(suchaslinearregression)tosomeoftheDeepLearningtechniquesthatregularlywincompetitions.

Ratherthanimplementingourowntoyversionsofeachalgorithm,wewillbeusingactualproduction-readyPythonframeworks:

Scikit-Learnisveryeasytouse,yetitimplementsmanyMachineLearningalgorithmsefficiently,soitmakesforagreatentrypointtolearnMachineLearning.

TensorFlowisamorecomplexlibraryfordistributednumericalcomputationusingdataflowgraphs.Itmakesitpossibletotrainandrunverylargeneuralnetworksefficientlybydistributingthecomputationsacrosspotentiallythousandsofmulti-GPUservers.TensorFlowwascreatedatGoogleandsupportsmanyoftheirlarge-scaleMachineLearningapplications.Itwasopen-sourcedinNovember2015.

Thebookfavorsahands-onapproach,growinganintuitiveunderstandingofMachineLearningthroughconcreteworkingexamplesandjustalittlebitoftheory.Whileyoucanreadthisbookwithoutpickingupyourlaptop,wehighlyrecommendyouexperimentwiththecodeexamplesavailableonlineasJupyternotebooksathttps://github.com/ageron/handson-ml.

PrerequisitesThisbookassumesthatyouhavesomePythonprogrammingexperienceandthatyouarefamiliarwithPython’smainscientificlibraries,inparticularNumPy,Pandas,andMatplotlib.

Also,ifyoucareaboutwhat’sunderthehoodyoushouldhaveareasonableunderstandingofcollege-levelmathaswell(calculus,linearalgebra,probabilities,andstatistics).

Ifyoudon’tknowPythonyet,http://learnpython.org/isagreatplacetostart.Theofficialtutorialonpython.orgisalsoquitegood.

IfyouhaveneverusedJupyter,Chapter2willguideyouthroughinstallationandthebasics:itisagreattooltohaveinyourtoolbox.

IfyouarenotfamiliarwithPython’sscientificlibraries,theprovidedJupyternotebooksincludeafewtutorials.Thereisalsoaquickmathtutorialforlinearalgebra.

RoadmapThisbookisorganizedintwoparts.PartI,TheFundamentalsofMachineLearning,coversthefollowingtopics:

WhatisMachineLearning?Whatproblemsdoesittrytosolve?WhatarethemaincategoriesandfundamentalconceptsofMachineLearningsystems?

ThemainstepsinatypicalMachineLearningproject.

Learningbyfittingamodeltodata.

Optimizingacostfunction.

Handling,cleaning,andpreparingdata.

Selectingandengineeringfeatures.

Selectingamodelandtuninghyperparametersusingcross-validation.

ThemainchallengesofMachineLearning,inparticularunderfittingandoverfitting(thebias/variancetradeoff).

Reducingthedimensionalityofthetrainingdatatofightthecurseofdimensionality.

Themostcommonlearningalgorithms:LinearandPolynomialRegression,LogisticRegression,k-NearestNeighbors,SupportVectorMachines,DecisionTrees,RandomForests,andEnsemblemethods.

PartII,NeuralNetworksandDeepLearning,coversthefollowingtopics:Whatareneuralnets?Whataretheygoodfor?

BuildingandtrainingneuralnetsusingTensorFlow.

Themostimportantneuralnetarchitectures:feedforwardneuralnets,convolutionalnets,recurrentnets,longshort-termmemory(LSTM)nets,andautoencoders.

Techniquesfortrainingdeepneuralnets.

Scalingneuralnetworksforhugedatasets.

Reinforcementlearning.

ThefirstpartisbasedmostlyonScikit-LearnwhilethesecondpartusesTensorFlow.

CAUTIONDon’tjumpintodeepwaterstoohastily:whileDeepLearningisnodoubtoneofthemostexcitingareasinMachineLearning,youshouldmasterthefundamentalsfirst.Moreover,mostproblemscanbesolvedquitewellusingsimplertechniquessuchasRandomForestsandEnsemblemethods(discussedinPartI).DeepLearningisbestsuitedforcomplexproblemssuchasimagerecognition,speechrecognition,ornaturallanguageprocessing,providedyouhaveenoughdata,computingpower,andpatience.

OtherResourcesManyresourcesareavailabletolearnaboutMachineLearning.AndrewNg’sMLcourseonCourseraandGeoffreyHinton’scourseonneuralnetworksandDeepLearningareamazing,althoughtheybothrequireasignificanttimeinvestment(thinkmonths).

TherearealsomanyinterestingwebsitesaboutMachineLearning,includingofcourseScikit-Learn’sexceptionalUserGuide.YoumayalsoenjoyDataquest,whichprovidesveryniceinteractivetutorials,andMLblogssuchasthoselistedonQuora.Finally,theDeepLearningwebsitehasagoodlistofresourcestolearnmore.

OfcoursetherearealsomanyotherintroductorybooksaboutMachineLearning,inparticular:JoelGrus,DataSciencefromScratch(O’Reilly).ThisbookpresentsthefundamentalsofMachineLearning,andimplementssomeofthemainalgorithmsinpurePython(fromscratch,asthenamesuggests).

StephenMarsland,MachineLearning:AnAlgorithmicPerspective(ChapmanandHall).ThisbookisagreatintroductiontoMachineLearning,coveringawiderangeoftopicsindepth,withcodeexamplesinPython(alsofromscratch,butusingNumPy).

SebastianRaschka,PythonMachineLearning(PacktPublishing).AlsoagreatintroductiontoMachineLearning,thisbookleveragesPythonopensourcelibraries(Pylearn2andTheano).

YaserS.Abu-Mostafa,MalikMagdon-Ismail,andHsuan-TienLin,LearningfromData(AMLBook).ArathertheoreticalapproachtoML,thisbookprovidesdeepinsights,inparticularonthebias/variancetradeoff(seeChapter4).

StuartRussellandPeterNorvig,ArtificialIntelligence:AModernApproach,3rdEdition(Pearson).Thisisagreat(andhuge)bookcoveringanincredibleamountoftopics,includingMachineLearning.IthelpsputMLintoperspective.

Finally,agreatwaytolearnistojoinMLcompetitionwebsitessuchasKaggle.comthiswillallowyoutopracticeyourskillsonreal-worldproblems,withhelpandinsightsfromsomeofthebestMLprofessionalsoutthere.

ConventionsUsedinThisBookThefollowingtypographicalconventionsareusedinthisbook:

ItalicIndicatesnewterms,URLs,emailaddresses,filenames,andfileextensions.

Constantwidth

Usedforprogramlistings,aswellaswithinparagraphstorefertoprogramelementssuchasvariableorfunctionnames,databases,datatypes,environmentvariables,statementsandkeywords.

Constantwidthbold

Showscommandsorothertextthatshouldbetypedliterallybytheuser.

Constantwidthitalic

Showstextthatshouldbereplacedwithuser-suppliedvaluesorbyvaluesdeterminedbycontext.

TIPThiselementsignifiesatiporsuggestion.

NOTEThiselementsignifiesageneralnote.

WARNINGThiselementindicatesawarningorcaution.

UsingCodeExamplesSupplementalmaterial(codeexamples,exercises,etc.)isavailablefordownloadathttps://github.com/ageron/handson-ml.

Thisbookisheretohelpyougetyourjobdone.Ingeneral,ifexamplecodeisofferedwiththisbook,youmayuseitinyourprogramsanddocumentation.Youdonotneedtocontactusforpermissionunlessyou’rereproducingasignificantportionofthecode.Forexample,writingaprogramthatusesseveralchunksofcodefromthisbookdoesnotrequirepermission.SellingordistributingaCD-ROMofexamplesfromO’Reillybooksdoesrequirepermission.Answeringaquestionbycitingthisbookandquotingexamplecodedoesnotrequirepermission.Incorporatingasignificantamountofexamplecodefromthisbookintoyourproduct’sdocumentationdoesrequirepermission.

Weappreciate,butdonotrequire,attribution.Anattributionusuallyincludesthetitle,author,publisher,andISBN.Forexample:“Hands-OnMachineLearningwithScikit-LearnandTensorFlowbyAurélienGéron(O’Reilly).Copyright2017AurélienGéron,978-1-491-96229-9.”

Ifyoufeelyouruseofcodeexamplesfallsoutsidefairuseorthepermissiongivenabove,feelfreetocontactusatpermissions@oreilly.com.

O’ReillySafariNOTE

Safari(formerlySafariBooksOnline)isamembership-basedtrainingandreferenceplatformforenterprise,government,educators,andindividuals.

Membershaveaccesstothousandsofbooks,trainingvideos,LearningPaths,interactivetutorials,andcuratedplaylistsfromover250publishers,includingO’ReillyMedia,HarvardBusinessReview,PrenticeHallProfessional,Addison-WesleyProfessional,MicrosoftPress,Sams,Que,PeachpitPress,Adobe,FocalPress,CiscoPress,JohnWiley&Sons,Syngress,MorganKaufmann,IBMRedbooks,Packt,AdobePress,FTPress,Apress,Manning,NewRiders,McGraw-Hill,Jones&Bartlett,andCourseTechnology,amongothers.

Formoreinformation,pleasevisithttp://oreilly.com/safari.

HowtoContactUsPleaseaddresscommentsandquestionsconcerningthisbooktothepublisher:

O’ReillyMedia,Inc.

1005GravensteinHighwayNorth

Sebastopol,CA95472

800-998-9938(intheUnitedStatesorCanada)

707-829-0515(internationalorlocal)

707-829-0104(fax)

Wehaveawebpageforthisbook,wherewelisterrata,examples,andanyadditionalinformation.Youcanaccessthispageathttp://bit.ly/hands-on-machine-learning-with-scikit-learn-and-tensorflow.

Tocommentorasktechnicalquestionsaboutthisbook,sendemailtobookquestions@oreilly.com.

Formoreinformationaboutourbooks,courses,conferences,andnews,seeourwebsiteathttp://www.oreilly.com.

FindusonFacebook:http://facebook.com/oreilly

FollowusonTwitter:http://twitter.com/oreillymedia

WatchusonYouTube:http://www.youtube.com/oreillymedia

AcknowledgmentsIwouldliketothankmyGooglecolleagues,inparticulartheYouTubevideoclassificationteam,forteachingmesomuchaboutMachineLearning.Icouldneverhavestartedthisprojectwithoutthem.SpecialthankstomypersonalMLgurus:ClémentCourbet,JulienDubois,MathiasKende,DanielKitachewsky,JamesPack,AlexanderPak,AnoshRaj,VitorSessak,WiktorTomczak,IngridvonGlehn,RichWashington,andeveryoneatYouTubeParis.

Iamincrediblygratefultoalltheamazingpeoplewhotooktimeoutoftheirbusylivestoreviewmybookinsomuchdetail.ThankstoPeteWardenforansweringallmyTensorFlowquestions,reviewingPartII,providingmanyinterestinginsights,andofcourseforbeingpartofthecoreTensorFlowteam.Youshoulddefinitelycheckouthisblog!ManythankstoLukasBiewaldforhisverythoroughreviewofPartII:heleftnostoneunturned,testedallthecode(andcaughtafewerrors),mademanygreatsuggestions,andhisenthusiasmwascontagious.Youshouldcheckouthisblogandhiscoolrobots!ThankstoJustinFrancis,whoalsoreviewedPartIIverythoroughly,catchingerrorsandprovidinggreatinsights,inparticularinChapter16.CheckouthispostsonTensorFlow!

HugethanksaswelltoDavidAndrzejewski,whoreviewedPartIandprovidedincrediblyusefulfeedback,identifyingunclearsectionsandsuggestinghowtoimprovethem.Checkouthiswebsite!ThankstoGrégoireMesnil,whoreviewedPartIIandcontributedveryinterestingpracticaladviceontrainingneuralnetworks.ThanksaswelltoEddyHung,SalimSémaoune,KarimMatrah,IngridvonGlehn,IainSmears,andVincentGuilbeauforreviewingPartIandmakingmanyusefulsuggestions.AndIalsowishtothankmyfather-in-law,MichelTessier,formermathematicsteacherandnowagreattranslatorofAntonChekhov,forhelpingmeironoutsomeofthemathematicsandnotationsinthisbookandreviewingthelinearalgebraJupyternotebook.

Andofcourse,agigantic“thankyou”tomydearbrotherSylvain,whoreviewedeverysinglechapter,testedeverylineofcode,providedfeedbackonvirtuallyeverysection,andencouragedmefromthefirstlinetothelast.Loveyou,bro!

ManythanksaswelltoO’Reilly’sfantasticstaff,inparticularNicoleTache,whogavemeinsightfulfeedback,alwayscheerful,encouraging,andhelpful.ThanksaswelltoMarieBeaugureau,BenLorica,MikeLoukides,andLaurelRumaforbelievinginthisprojectandhelpingmedefineitsscope.ThankstoMattHackerandalloftheAtlasteamforansweringallmytechnicalquestionsregardingformatting,asciidoc,andLaTeX,andthankstoRachelMonaghan,NickAdams,andalloftheproductionteamfortheirfinalreviewandtheirhundredsofcorrections.

Lastbutnotleast,Iaminfinitelygratefultomybelovedwife,Emmanuelle,andtoourthreewonderfulkids,Alexandre,Rémi,andGabrielle,forencouragingmetoworkhardonthisbook,askingmanyquestions(whosaidyoucan’tteachneuralnetworkstoaseven-year-old?),andevenbringingmecookiesandcoffee.Whatmorecanonedreamof?

AvailableonHinton’shomepageathttp://www.cs.toronto.edu/~hinton/.

DespitethefactthatYannLecun’sdeepconvolutionalneuralnetworkshadworkedwellforimagerecognitionsincethe1990s,althoughtheywerenotasgeneralpurpose.

PartI.TheFundamentalsofMachineLearning

Chapter1.TheMachineLearningLandscape

Whenmostpeoplehear“MachineLearning,”theypicturearobot:adependablebutleroradeadlyTerminatordependingonwhoyouask.ButMachineLearningisnotjustafuturisticfantasy,it’salreadyhere.Infact,ithasbeenaroundfordecadesinsomespecializedapplications,suchasOpticalCharacterRecognition(OCR).ButthefirstMLapplicationthatreallybecamemainstream,improvingthelivesofhundredsofmillionsofpeople,tookovertheworldbackinthe1990s:itwasthespamfilter.Notexactlyaself-awareSkynet,butitdoestechnicallyqualifyasMachineLearning(ithasactuallylearnedsowellthatyouseldomneedtoflaganemailasspamanymore).ItwasfollowedbyhundredsofMLapplicationsthatnowquietlypowerhundredsofproductsandfeaturesthatyouuseregularly,frombetterrecommendationstovoicesearch.

WheredoesMachineLearningstartandwheredoesitend?Whatexactlydoesitmeanforamachinetolearnsomething?IfIdownloadacopyofWikipedia,hasmycomputerreally“learned”something?Isitsuddenlysmarter?InthischapterwewillstartbyclarifyingwhatMachineLearningisandwhyyoumaywanttouseit.

Then,beforewesetouttoexploretheMachineLearningcontinent,wewilltakealookatthemapandlearnaboutthemainregionsandthemostnotablelandmarks:supervisedversusunsupervisedlearning,onlineversusbatchlearning,instance-basedversusmodel-basedlearning.ThenwewilllookattheworkflowofatypicalMLproject,discussthemainchallengesyoumayface,andcoverhowtoevaluateandfine-tuneaMachineLearningsystem.

Thischapterintroducesalotoffundamentalconcepts(andjargon)thateverydatascientistshouldknowbyheart.Itwillbeahigh-leveloverview(theonlychapterwithoutmuchcode),allrathersimple,butyoushouldmakesureeverythingiscrystal-cleartoyoubeforecontinuingtotherestofthebook.Sograbacoffeeandlet’sgetstarted!

TIPIfyoualreadyknowalltheMachineLearningbasics,youmaywanttoskipdirectlytoChapter2.Ifyouarenotsure,trytoanswerallthequestionslistedattheendofthechapterbeforemovingon.

WhatIsMachineLearning?MachineLearningisthescience(andart)ofprogrammingcomputerssotheycanlearnfromdata.

Hereisaslightlymoregeneraldefinition:

[MachineLearningisthe]fieldofstudythatgivescomputerstheabilitytolearnwithoutbeingexplicitlyprogrammed.ArthurSamuel,1959

Andamoreengineering-orientedone:

AcomputerprogramissaidtolearnfromexperienceEwithrespecttosometaskTandsomeperformancemeasureP,ifitsperformanceonT,asmeasuredbyP,improveswithexperienceE.TomMitchell,1997

Forexample,yourspamfilterisaMachineLearningprogramthatcanlearntoflagspamgivenexamplesofspamemails(e.g.,flaggedbyusers)andexamplesofregular(nonspam,alsocalled“ham”)emails.Theexamplesthatthesystemusestolearnarecalledthetrainingset.Eachtrainingexampleiscalledatraininginstance(orsample).Inthiscase,thetaskTistoflagspamfornewemails,theexperienceEisthetrainingdata,andtheperformancemeasurePneedstobedefined;forexample,youcanusetheratioofcorrectlyclassifiedemails.Thisparticularperformancemeasureiscalledaccuracyanditisoftenusedinclassificationtasks.

IfyoujustdownloadacopyofWikipedia,yourcomputerhasalotmoredata,butitisnotsuddenlybetteratanytask.Thus,itisnotMachineLearning.

WhyUseMachineLearning?Considerhowyouwouldwriteaspamfilterusingtraditionalprogrammingtechniques(Figure1-1):1. Firstyouwouldlookatwhatspamtypicallylookslike.Youmightnoticethatsomewordsorphrases

(suchas“4U,”“creditcard,”“free,”and“amazing”)tendtocomeupalotinthesubject.Perhapsyouwouldalsonoticeafewotherpatternsinthesender’sname,theemail’sbody,andsoon.

2. Youwouldwriteadetectionalgorithmforeachofthepatternsthatyounoticed,andyourprogramwouldflagemailsasspamifanumberofthesepatternsaredetected.

3. Youwouldtestyourprogram,andrepeatsteps1and2untilitisgoodenough.

Figure1-1.Thetraditionalapproach

Sincetheproblemisnottrivial,yourprogramwilllikelybecomealonglistofcomplexrules—prettyhardtomaintain.

Incontrast,aspamfilterbasedonMachineLearningtechniquesautomaticallylearnswhichwordsandphrasesaregoodpredictorsofspambydetectingunusuallyfrequentpatternsofwordsinthespamexamplescomparedtothehamexamples(Figure1-2).Theprogramismuchshorter,easiertomaintain,andmostlikelymoreaccurate.

Figure1-2.MachineLearningapproach

Moreover,ifspammersnoticethatalltheiremailscontaining“4U”areblocked,theymightstartwriting“ForU”instead.Aspamfilterusingtraditionalprogrammingtechniqueswouldneedtobeupdatedtoflag“ForU”emails.Ifspammerskeepworkingaroundyourspamfilter,youwillneedtokeepwritingnewrulesforever.

Incontrast,aspamfilterbasedonMachineLearningtechniquesautomaticallynoticesthat“ForU”hasbecomeunusuallyfrequentinspamflaggedbyusers,anditstartsflaggingthemwithoutyourintervention(Figure1-3).

Figure1-3.Automaticallyadaptingtochange

AnotherareawhereMachineLearningshinesisforproblemsthateitheraretoocomplexfortraditionalapproachesorhavenoknownalgorithm.Forexample,considerspeechrecognition:sayyouwanttostartsimpleandwriteaprogramcapableofdistinguishingthewords“one”and“two.”Youmightnoticethattheword“two”startswithahigh-pitchsound(“T”),soyoucouldhardcodeanalgorithmthatmeasureshigh-pitchsoundintensityandusethattodistinguishonesandtwos.Obviouslythistechniquewillnotscaletothousandsofwordsspokenbymillionsofverydifferentpeopleinnoisyenvironmentsandin

dozensoflanguages.Thebestsolution(atleasttoday)istowriteanalgorithmthatlearnsbyitself,givenmanyexamplerecordingsforeachword.

Finally,MachineLearningcanhelphumanslearn(Figure1-4):MLalgorithmscanbeinspectedtoseewhattheyhavelearned(althoughforsomealgorithmsthiscanbetricky).Forinstance,oncethespamfilterhasbeentrainedonenoughspam,itcaneasilybeinspectedtorevealthelistofwordsandcombinationsofwordsthatitbelievesarethebestpredictorsofspam.Sometimesthiswillrevealunsuspectedcorrelationsornewtrends,andtherebyleadtoabetterunderstandingoftheproblem.

ApplyingMLtechniquestodigintolargeamountsofdatacanhelpdiscoverpatternsthatwerenotimmediatelyapparent.Thisiscalleddatamining.

Figure1-4.MachineLearningcanhelphumanslearn

Tosummarize,MachineLearningisgreatfor:Problemsforwhichexistingsolutionsrequirealotofhand-tuningorlonglistsofrules:oneMachineLearningalgorithmcanoftensimplifycodeandperformbetter.

Complexproblemsforwhichthereisnogoodsolutionatallusingatraditionalapproach:thebestMachineLearningtechniquescanfindasolution.

Fluctuatingenvironments:aMachineLearningsystemcanadapttonewdata.

Gettinginsightsaboutcomplexproblemsandlargeamountsofdata.

TypesofMachineLearningSystemsTherearesomanydifferenttypesofMachineLearningsystemsthatitisusefultoclassifytheminbroadcategoriesbasedon:

Whetherornottheyaretrainedwithhumansupervision(supervised,unsupervised,semisupervised,andReinforcementLearning)

Whetherornottheycanlearnincrementallyonthefly(onlineversusbatchlearning)

Whethertheyworkbysimplycomparingnewdatapointstoknowndatapoints,orinsteaddetectpatternsinthetrainingdataandbuildapredictivemodel,muchlikescientistsdo(instance-basedversusmodel-basedlearning)

Thesecriteriaarenotexclusive;youcancombinetheminanywayyoulike.Forexample,astate-of-the-artspamfiltermaylearnontheflyusingadeepneuralnetworkmodeltrainedusingexamplesofspamandham;thismakesitanonline,model-based,supervisedlearningsystem.

Let’slookateachofthesecriteriaabitmoreclosely.

Supervised/UnsupervisedLearningMachineLearningsystemscanbeclassifiedaccordingtotheamountandtypeofsupervisiontheygetduringtraining.Therearefourmajorcategories:supervisedlearning,unsupervisedlearning,semisupervisedlearning,andReinforcementLearning.

SupervisedlearningInsupervisedlearning,thetrainingdatayoufeedtothealgorithmincludesthedesiredsolutions,calledlabels(Figure1-5).

Figure1-5.Alabeledtrainingsetforsupervisedlearning(e.g.,spamclassification)

Atypicalsupervisedlearningtaskisclassification.Thespamfilterisagoodexampleofthis:itistrainedwithmanyexampleemailsalongwiththeirclass(spamorham),anditmustlearnhowtoclassifynewemails.

Anothertypicaltaskistopredictatargetnumericvalue,suchasthepriceofacar,givenasetoffeatures(mileage,age,brand,etc.)calledpredictors.Thissortoftaskiscalledregression(Figure1-6).1Totrainthesystem,youneedtogiveitmanyexamplesofcars,includingboththeirpredictorsandtheirlabels(i.e.,theirprices).

NOTEInMachineLearninganattributeisadatatype(e.g.,“Mileage”),whileafeaturehasseveralmeaningsdependingonthecontext,butgenerallymeansanattributeplusitsvalue(e.g.,“Mileage=15,000”).Manypeopleusethewordsattributeandfeatureinterchangeably,though.

Figure1-6.Regression

Notethatsomeregressionalgorithmscanbeusedforclassificationaswell,andviceversa.Forexample,LogisticRegressioniscommonlyusedforclassification,asitcanoutputavaluethatcorrespondstotheprobabilityofbelongingtoagivenclass(e.g.,20%chanceofbeingspam).

Herearesomeofthemostimportantsupervisedlearningalgorithms(coveredinthisbook):k-NearestNeighbors

LinearRegression

LogisticRegression

SupportVectorMachines(SVMs)

DecisionTreesandRandomForests

Neuralnetworks2

UnsupervisedlearningInunsupervisedlearning,asyoumightguess,thetrainingdataisunlabeled(Figure1-7).Thesystemtriestolearnwithoutateacher.

Figure1-7.Anunlabeledtrainingsetforunsupervisedlearning

Herearesomeofthemostimportantunsupervisedlearningalgorithms(wewillcoverdimensionalityreductioninChapter8):

Clusteringk-Means

HierarchicalClusterAnalysis(HCA)

ExpectationMaximization

VisualizationanddimensionalityreductionPrincipalComponentAnalysis(PCA)

KernelPCA

Locally-LinearEmbedding(LLE)

t-distributedStochasticNeighborEmbedding(t-SNE)

AssociationrulelearningApriori

Forexample,sayyouhavealotofdataaboutyourblog’svisitors.Youmaywanttorunaclusteringalgorithmtotrytodetectgroupsofsimilarvisitors(Figure1-8).Atnopointdoyoutellthealgorithmwhichgroupavisitorbelongsto:itfindsthoseconnectionswithoutyourhelp.Forexample,itmightnoticethat40%ofyourvisitorsaremaleswholovecomicbooksandgenerallyreadyourblogintheevening,while20%areyoungsci-filoverswhovisitduringtheweekends,andsoon.Ifyouuseahierarchicalclusteringalgorithm,itmayalsosubdivideeachgroupintosmallergroups.Thismayhelpyoutargetyourpostsforeachgroup.

Figure1-8.Clustering

Visualizationalgorithmsarealsogoodexamplesofunsupervisedlearningalgorithms:youfeedthemalotofcomplexandunlabeleddata,andtheyoutputa2Dor3Drepresentationofyourdatathatcaneasilybeplotted(Figure1-9).Thesealgorithmstrytopreserveasmuchstructureastheycan(e.g.,tryingtokeepseparateclustersintheinputspacefromoverlappinginthevisualization),soyoucanunderstandhowthedataisorganizedandperhapsidentifyunsuspectedpatterns.

Figure1-9.Exampleofat-SNEvisualizationhighlightingsemanticclusters3

Arelatedtaskisdimensionalityreduction,inwhichthegoalistosimplifythedatawithoutlosingtoomuchinformation.Onewaytodothisistomergeseveralcorrelatedfeaturesintoone.Forexample,acar’smileagemaybeverycorrelatedwithitsage,sothedimensionalityreductionalgorithmwillmergethemintoonefeaturethatrepresentsthecar’swearandtear.Thisiscalledfeatureextraction.

TIPItisoftenagoodideatotrytoreducethedimensionofyourtrainingdatausingadimensionalityreductionalgorithmbeforeyoufeedittoanotherMachineLearningalgorithm(suchasasupervisedlearningalgorithm).Itwillrunmuchfaster,thedatawilltakeuplessdiskandmemoryspace,andinsomecasesitmayalsoperformbetter.

Yetanotherimportantunsupervisedtaskisanomalydetection—forexample,detectingunusualcreditcardtransactionstopreventfraud,catchingmanufacturingdefects,orautomaticallyremovingoutliersfromadatasetbeforefeedingittoanotherlearningalgorithm.Thesystemistrainedwithnormalinstances,andwhenitseesanewinstanceitcantellwhetheritlookslikeanormaloneorwhetheritislikelyananomaly(seeFigure1-10).

Figure1-10.Anomalydetection

Finally,anothercommonunsupervisedtaskisassociationrulelearning,inwhichthegoalistodigintolargeamountsofdataanddiscoverinterestingrelationsbetweenattributes.Forexample,supposeyouownasupermarket.Runninganassociationruleonyoursaleslogsmayrevealthatpeoplewhopurchasebarbecuesauceandpotatochipsalsotendtobuysteak.Thus,youmaywanttoplacetheseitemsclosetoeachother.

SemisupervisedlearningSomealgorithmscandealwithpartiallylabeledtrainingdata,usuallyalotofunlabeleddataandalittlebitoflabeleddata.Thisiscalledsemisupervisedlearning(Figure1-11).

Somephoto-hostingservices,suchasGooglePhotos,aregoodexamplesofthis.Onceyouuploadallyourfamilyphotostotheservice,itautomaticallyrecognizesthatthesamepersonAshowsupinphotos1,5,and11,whileanotherpersonBshowsupinphotos2,5,and7.Thisistheunsupervisedpartofthealgorithm(clustering).Nowallthesystemneedsisforyoutotellitwhothesepeopleare.Justonelabelperperson,4anditisabletonameeveryoneineveryphoto,whichisusefulforsearchingphotos.

Figure1-11.Semisupervisedlearning

Mostsemisupervisedlearningalgorithmsarecombinationsofunsupervisedandsupervisedalgorithms.Forexample,deepbeliefnetworks(DBNs)arebasedonunsupervisedcomponentscalledrestrictedBoltzmannmachines(RBMs)stackedontopofoneanother.RBMsaretrainedsequentiallyinanunsupervisedmanner,andthenthewholesystemisfine-tunedusingsupervisedlearningtechniques.

ReinforcementLearningReinforcementLearningisaverydifferentbeast.Thelearningsystem,calledanagentinthiscontext,canobservetheenvironment,selectandperformactions,andgetrewardsinreturn(orpenaltiesintheformofnegativerewards,asinFigure1-12).Itmustthenlearnbyitselfwhatisthebeststrategy,calledapolicy,togetthemostrewardovertime.Apolicydefineswhatactiontheagentshouldchoosewhenitisinagivensituation.

Figure1-12.ReinforcementLearning

Forexample,manyrobotsimplementReinforcementLearningalgorithmstolearnhowtowalk.DeepMind’sAlphaGoprogramisalsoagoodexampleofReinforcementLearning:itmadetheheadlinesinMarch2016whenitbeattheworldchampionLeeSedolatthegameofGo.Itlearneditswinningpolicybyanalyzingmillionsofgames,andthenplayingmanygamesagainstitself.Notethatlearningwasturnedoffduringthegamesagainstthechampion;AlphaGowasjustapplyingthepolicyithadlearned.

BatchandOnlineLearningAnothercriterionusedtoclassifyMachineLearningsystemsiswhetherornotthesystemcanlearnincrementallyfromastreamofincomingdata.

BatchlearningInbatchlearning,thesystemisincapableoflearningincrementally:itmustbetrainedusingalltheavailabledata.Thiswillgenerallytakealotoftimeandcomputingresources,soitistypicallydoneoffline.Firstthesystemistrained,andthenitislaunchedintoproductionandrunswithoutlearninganymore;itjustapplieswhatithaslearned.Thisiscalledofflinelearning.

Ifyouwantabatchlearningsystemtoknowaboutnewdata(suchasanewtypeofspam),youneedtotrainanewversionofthesystemfromscratchonthefulldataset(notjustthenewdata,butalsotheolddata),thenstoptheoldsystemandreplaceitwiththenewone.

Fortunately,thewholeprocessoftraining,evaluating,andlaunchingaMachineLearningsystemcanbeautomatedfairlyeasily(asshowninFigure1-3),soevenabatchlearningsystemcanadapttochange.Simplyupdatethedataandtrainanewversionofthesystemfromscratchasoftenasneeded.

Thissolutionissimpleandoftenworksfine,buttrainingusingthefullsetofdatacantakemanyhours,soyouwouldtypicallytrainanewsystemonlyevery24hoursorevenjustweekly.Ifyoursystemneedstoadapttorapidlychangingdata(e.g.,topredictstockprices),thenyouneedamorereactivesolution.

Also,trainingonthefullsetofdatarequiresalotofcomputingresources(CPU,memoryspace,diskspace,diskI/O,networkI/O,etc.).Ifyouhavealotofdataandyouautomateyoursystemtotrainfromscratcheveryday,itwillendupcostingyoualotofmoney.Iftheamountofdataishuge,itmayevenbeimpossibletouseabatchlearningalgorithm.

Finally,ifyoursystemneedstobeabletolearnautonomouslyandithaslimitedresources(e.g.,asmartphoneapplicationoraroveronMars),thencarryingaroundlargeamountsoftrainingdataandtakingupalotofresourcestotrainforhourseverydayisashowstopper.

Fortunately,abetteroptioninallthesecasesistousealgorithmsthatarecapableoflearningincrementally.

OnlinelearningInonlinelearning,youtrainthesystemincrementallybyfeedingitdatainstancessequentially,eitherindividuallyorbysmallgroupscalledmini-batches.Eachlearningstepisfastandcheap,sothesystemcanlearnaboutnewdataonthefly,asitarrives(seeFigure1-13).

Figure1-13.Onlinelearning

Onlinelearningisgreatforsystemsthatreceivedataasacontinuousflow(e.g.,stockprices)andneedtoadapttochangerapidlyorautonomously.Itisalsoagoodoptionifyouhavelimitedcomputingresources:onceanonlinelearningsystemhaslearnedaboutnewdatainstances,itdoesnotneedthemanymore,soyoucandiscardthem(unlessyouwanttobeabletorollbacktoapreviousstateand“replay”thedata).Thiscansaveahugeamountofspace.

Onlinelearningalgorithmscanalsobeusedtotrainsystemsonhugedatasetsthatcannotfitinonemachine’smainmemory(thisiscalledout-of-corelearning).Thealgorithmloadspartofthedata,runsatrainingsteponthatdata,andrepeatstheprocessuntilithasrunonallofthedata(seeFigure1-14).

WARNINGThiswholeprocessisusuallydoneoffline(i.e.,notonthelivesystem),soonlinelearningcanbeaconfusingname.Thinkofitasincrementallearning.

Figure1-14.Usingonlinelearningtohandlehugedatasets

Oneimportantparameterofonlinelearningsystemsishowfasttheyshouldadapttochangingdata:thisiscalledthelearningrate.Ifyousetahighlearningrate,thenyoursystemwillrapidlyadapttonewdata,butitwillalsotendtoquicklyforgettheolddata(youdon’twantaspamfiltertoflagonlythelatestkindsofspamitwasshown).Conversely,ifyousetalowlearningrate,thesystemwillhavemoreinertia;thatis,itwilllearnmoreslowly,butitwillalsobelesssensitivetonoiseinthenewdataortosequencesofnonrepresentativedatapoints.

Abigchallengewithonlinelearningisthatifbaddataisfedtothesystem,thesystem’sperformancewillgraduallydecline.Ifwearetalkingaboutalivesystem,yourclientswillnotice.Forexample,baddatacouldcomefromamalfunctioningsensoronarobot,orfromsomeonespammingasearchenginetotrytorankhighinsearchresults.Toreducethisrisk,youneedtomonitoryoursystemcloselyandpromptlyswitchlearningoff(andpossiblyreverttoapreviouslyworkingstate)ifyoudetectadropinperformance.Youmayalsowanttomonitortheinputdataandreacttoabnormaldata(e.g.,usingananomalydetectionalgorithm).

Instance-BasedVersusModel-BasedLearningOnemorewaytocategorizeMachineLearningsystemsisbyhowtheygeneralize.MostMachineLearningtasksareaboutmakingpredictions.Thismeansthatgivenanumberoftrainingexamples,thesystemneedstobeabletogeneralizetoexamplesithasneverseenbefore.Havingagoodperformancemeasureonthetrainingdataisgood,butinsufficient;thetruegoalistoperformwellonnewinstances.

Therearetwomainapproachestogeneralization:instance-basedlearningandmodel-basedlearning.

Instance-basedlearningPossiblythemosttrivialformoflearningissimplytolearnbyheart.Ifyouweretocreateaspamfilterthisway,itwouldjustflagallemailsthatareidenticaltoemailsthathavealreadybeenflaggedbyusers—nottheworstsolution,butcertainlynotthebest.

Insteadofjustflaggingemailsthatareidenticaltoknownspamemails,yourspamfiltercouldbeprogrammedtoalsoflagemailsthatareverysimilartoknownspamemails.Thisrequiresameasureofsimilaritybetweentwoemails.A(verybasic)similaritymeasurebetweentwoemailscouldbetocountthenumberofwordstheyhaveincommon.Thesystemwouldflaganemailasspamifithasmanywordsincommonwithaknownspamemail.

Thisiscalledinstance-basedlearning:thesystemlearnstheexamplesbyheart,thengeneralizestonewcasesusingasimilaritymeasure(Figure1-15).

Figure1-15.Instance-basedlearning

Model-basedlearningAnotherwaytogeneralizefromasetofexamplesistobuildamodeloftheseexamples,thenusethatmodeltomakepredictions.Thisiscalledmodel-basedlearning(Figure1-16).

Figure1-16.Model-basedlearning

Forexample,supposeyouwanttoknowifmoneymakespeoplehappy,soyoudownloadtheBetterLifeIndexdatafromtheOECD’swebsiteaswellasstatsaboutGDPpercapitafromtheIMF’swebsite.ThenyoujointhetablesandsortbyGDPpercapita.Table1-1showsanexcerptofwhatyouget.

Table1-1.Doesmoneymakepeoplehappier?

Country GDPpercapita(USD) Lifesatisfaction

Hungary 12,240 4.9

Korea 27,195 5.8

France 37,675 6.5

Australia 50,962 7.3

UnitedStates 55,805 7.2

Let’splotthedataforafewrandomcountries(Figure1-17).

Figure1-17.Doyouseeatrendhere?

Theredoesseemtobeatrendhere!Althoughthedataisnoisy(i.e.,partlyrandom),itlookslikelifesatisfactiongoesupmoreorlesslinearlyasthecountry’sGDPpercapitaincreases.SoyoudecidetomodellifesatisfactionasalinearfunctionofGDPpercapita.Thisstepiscalledmodelselection:youselectedalinearmodeloflifesatisfactionwithjustoneattribute,GDPpercapita(Equation1-1).

Equation1-1.Asimplelinearmodel

Thismodelhastwomodelparameters,θ0andθ1.5Bytweakingtheseparameters,youcanmakeyourmodelrepresentanylinearfunction,asshowninFigure1-18.

Figure1-18.Afewpossiblelinearmodels

Beforeyoucanuseyourmodel,youneedtodefinetheparametervaluesθ0andθ1.Howcanyouknow

whichvalueswillmakeyourmodelperformbest?Toanswerthisquestion,youneedtospecifyaperformancemeasure.Youcaneitherdefineautilityfunction(orfitnessfunction)thatmeasureshowgoodyourmodelis,oryoucandefineacostfunctionthatmeasureshowbaditis.Forlinearregressionproblems,peopletypicallyuseacostfunctionthatmeasuresthedistancebetweenthelinearmodel’spredictionsandthetrainingexamples;theobjectiveistominimizethisdistance.

ThisiswheretheLinearRegressionalgorithmcomesin:youfeedityourtrainingexamplesanditfindstheparametersthatmakethelinearmodelfitbesttoyourdata.Thisiscalledtrainingthemodel.Inourcasethealgorithmfindsthattheoptimalparametervaluesareθ0=4.85andθ1=4.91×10–5.

Nowthemodelfitsthetrainingdataascloselyaspossible(foralinearmodel),asyoucanseeinFigure1-19.

Figure1-19.Thelinearmodelthatfitsthetrainingdatabest

Youarefinallyreadytorunthemodeltomakepredictions.Forexample,sayyouwanttoknowhowhappyCypriotsare,andtheOECDdatadoesnothavetheanswer.Fortunately,youcanuseyourmodeltomakeagoodprediction:youlookupCyprus’sGDPpercapita,find$22,587,andthenapplyyourmodelandfindthatlifesatisfactionislikelytobesomewherearound4.85+22,587×4.91×10-5=5.96.

Towhetyourappetite,Example1-1showsthePythoncodethatloadsthedata,preparesit,6createsascatterplotforvisualization,andthentrainsalinearmodelandmakesaprediction.7

Example1-1.TrainingandrunningalinearmodelusingScikit-Learnimportmatplotlib

importmatplotlib.pyplotasplt

importnumpyasnp

importpandasaspd

importsklearn

#Loadthedata

oecd_bli=pd.read_csv("oecd_bli_2015.csv",thousands=',')

gdp_per_capita=pd.read_csv("gdp_per_capita.csv",thousands=',',delimiter='\t',

encoding='latin1',na_values="n/a")

#Preparethedata

country_stats=prepare_country_stats(oecd_bli,gdp_per_capita)

X=np.c_[country_stats["GDPpercapita"]]

y=np.c_[country_stats["Lifesatisfaction"]]

#Visualizethedata

country_stats.plot(kind='scatter',x="GDPpercapita",y='Lifesatisfaction')

plt.show()

#Selectalinearmodel

model=sklearn.linear_model.LinearRegression()

#Trainthemodel

model.fit(X,y)

#MakeapredictionforCyprus

X_new=[[22587]]#Cyprus'GDPpercapita

print(model.predict(X_new))#outputs[[5.96242338]]

NOTEIfyouhadusedaninstance-basedlearningalgorithminstead,youwouldhavefoundthatSloveniahastheclosestGDPpercapitatothatofCyprus($20,732),andsincetheOECDdatatellsusthatSlovenians’lifesatisfactionis5.7,youwouldhavepredictedalifesatisfactionof5.7forCyprus.Ifyouzoomoutabitandlookatthetwonextclosestcountries,youwillfindPortugalandSpainwithlifesatisfactionsof5.1and6.5,respectively.Averagingthesethreevalues,youget5.77,whichisprettyclosetoyourmodel-basedprediction.Thissimplealgorithmiscalledk-NearestNeighborsregression(inthisexample,k =3).

ReplacingtheLinearRegressionmodelwithk-NearestNeighborsregressioninthepreviouscodeisassimpleasreplacingthisline:

model=sklearn.linear_model.LinearRegression()

withthisone:

model=sklearn.neighbors.KNeighborsRegressor(n_neighbors=3)

Ifallwentwell,yourmodelwillmakegoodpredictions.Ifnot,youmayneedtousemoreattributes(employmentrate,health,airpollution,etc.),getmoreorbetterqualitytrainingdata,orperhapsselectamorepowerfulmodel(e.g.,aPolynomialRegressionmodel).

Insummary:Youstudiedthedata.

Youselectedamodel.

Youtraineditonthetrainingdata(i.e.,thelearningalgorithmsearchedforthemodelparametervaluesthatminimizeacostfunction).

Finally,youappliedthemodeltomakepredictionsonnewcases(thisiscalledinference),hopingthatthismodelwillgeneralizewell.

ThisiswhatatypicalMachineLearningprojectlookslike.InChapter2youwillexperiencethisfirst-handbygoingthroughanend-to-endproject.

Wehavecoveredalotofgroundsofar:younowknowwhatMachineLearningisreallyabout,whyitisuseful,whatsomeofthemostcommoncategoriesofMLsystemsare,andwhatatypicalprojectworkflowlookslike.Nowlet’slookatwhatcangowronginlearningandpreventyoufrommakingaccurate

predictions.

MainChallengesofMachineLearningInshort,sinceyourmaintaskistoselectalearningalgorithmandtrainitonsomedata,thetwothingsthatcangowrongare“badalgorithm”and“baddata.”Let’sstartwithexamplesofbaddata.

InsufficientQuantityofTrainingDataForatoddlertolearnwhatanappleis,allittakesisforyoutopointtoanappleandsay“apple”(possiblyrepeatingthisprocedureafewtimes).Nowthechildisabletorecognizeapplesinallsortsofcolorsandshapes.Genius.

MachineLearningisnotquitethereyet;ittakesalotofdataformostMachineLearningalgorithmstoworkproperly.Evenforverysimpleproblemsyoutypicallyneedthousandsofexamples,andforcomplexproblemssuchasimageorspeechrecognitionyoumayneedmillionsofexamples(unlessyoucanreusepartsofanexistingmodel).

THEUNREASONABLEEFFECTIVENESSOFDATA

Inafamouspaperpublishedin2001,MicrosoftresearchersMicheleBankoandEricBrillshowedthatverydifferentMachineLearningalgorithms,includingfairlysimpleones,performedalmostidenticallywellonacomplexproblemofnaturallanguagedisambiguation8oncetheyweregivenenoughdata(asyoucanseeinFigure1-20).

Figure1-20.Theimportanceofdataversusalgorithms9

Astheauthorsputit:“theseresultssuggestthatwemaywanttoreconsiderthetrade-offbetweenspendingtimeandmoneyonalgorithmdevelopmentversusspendingitoncorpusdevelopment.”

TheideathatdatamattersmorethanalgorithmsforcomplexproblemswasfurtherpopularizedbyPeterNorvigetal.inapapertitled“TheUnreasonableEffectivenessofData”publishedin2009.10Itshouldbenoted,however,thatsmall-andmedium-sizeddatasetsarestillverycommon,anditisnotalwayseasyorcheaptogetextratrainingdata,sodon’tabandonalgorithmsjustyet.

NonrepresentativeTrainingDataInordertogeneralizewell,itiscrucialthatyourtrainingdataberepresentativeofthenewcasesyouwanttogeneralizeto.Thisistruewhetheryouuseinstance-basedlearningormodel-basedlearning.

Forexample,thesetofcountriesweusedearlierfortrainingthelinearmodelwasnotperfectlyrepresentative;afewcountriesweremissing.Figure1-21showswhatthedatalookslikewhenyouaddthemissingcountries.

Figure1-21.Amorerepresentativetrainingsample

Ifyoutrainalinearmodelonthisdata,yougetthesolidline,whiletheoldmodelisrepresentedbythedottedline.Asyoucansee,notonlydoesaddingafewmissingcountriessignificantlyalterthemodel,butitmakesitclearthatsuchasimplelinearmodelisprobablynevergoingtoworkwell.Itseemsthatveryrichcountriesarenothappierthanmoderatelyrichcountries(infacttheyseemunhappier),andconverselysomepoorcountriesseemhappierthanmanyrichcountries.

Byusinganonrepresentativetrainingset,wetrainedamodelthatisunlikelytomakeaccuratepredictions,especiallyforverypoorandveryrichcountries.

Itiscrucialtouseatrainingsetthatisrepresentativeofthecasesyouwanttogeneralizeto.Thisisoftenharderthanitsounds:ifthesampleistoosmall,youwillhavesamplingnoise(i.e.,nonrepresentativedataasaresultofchance),butevenverylargesamplescanbenonrepresentativeifthesamplingmethodisflawed.Thisiscalledsamplingbias.

AFAMOUSEXAMPLEOFSAMPLINGBIAS

PerhapsthemostfamousexampleofsamplingbiashappenedduringtheUSpresidentialelectionin1936,whichpittedLandonagainstRoosevelt:theLiteraryDigestconductedaverylargepoll,sendingmailtoabout10millionpeople.Itgot2.4millionanswers,andpredictedwithhighconfidencethatLandonwouldget57%ofthevotes.Instead,Rooseveltwonwith62%ofthevotes.TheflawwasintheLiteraryDigest’ssamplingmethod:

First,toobtaintheaddressestosendthepollsto,theLiteraryDigestusedtelephonedirectories,listsofmagazinesubscribers,clubmembershiplists,andthelike.Alloftheseliststendtofavorwealthierpeople,whoaremorelikelytovoteRepublican(henceLandon).

Second,lessthan25%ofthepeoplewhoreceivedthepollanswered.Again,thisintroducesasamplingbias,byrulingoutpeoplewhodon’tcaremuchaboutpolitics,peoplewhodon’tliketheLiteraryDigest,andotherkeygroups.Thisisaspecialtypeofsamplingbiascallednonresponsebias.

Hereisanotherexample:sayyouwanttobuildasystemtorecognizefunkmusicvideos.Onewaytobuildyourtrainingsetistosearch“funkmusic”onYouTubeandusetheresultingvideos.ButthisassumesthatYouTube’ssearchenginereturnsasetofvideosthatarerepresentativeofallthefunkmusicvideosonYouTube.Inreality,thesearchresultsarelikelytobebiasedtowardpopularartists(andifyouliveinBrazilyouwillgetalotof“funkcarioca”videos,whichsoundnothinglikeJamesBrown).Ontheotherhand,howelsecanyougetalargetrainingset?

Poor-QualityDataObviously,ifyourtrainingdataisfulloferrors,outliers,andnoise(e.g.,duetopoor-qualitymeasurements),itwillmakeitharderforthesystemtodetecttheunderlyingpatterns,soyoursystemislesslikelytoperformwell.Itisoftenwellworththeefforttospendtimecleaningupyourtrainingdata.Thetruthis,mostdatascientistsspendasignificantpartoftheirtimedoingjustthat.Forexample:

Ifsomeinstancesareclearlyoutliers,itmayhelptosimplydiscardthemortrytofixtheerrorsmanually.

Ifsomeinstancesaremissingafewfeatures(e.g.,5%ofyourcustomersdidnotspecifytheirage),youmustdecidewhetheryouwanttoignorethisattributealtogether,ignoretheseinstances,fillinthemissingvalues(e.g.,withthemedianage),ortrainonemodelwiththefeatureandonemodelwithoutit,andsoon.

IrrelevantFeaturesAsthesayinggoes:garbagein,garbageout.Yoursystemwillonlybecapableoflearningifthetrainingdatacontainsenoughrelevantfeaturesandnottoomanyirrelevantones.AcriticalpartofthesuccessofaMachineLearningprojectiscomingupwithagoodsetoffeaturestotrainon.Thisprocess,calledfeatureengineering,involves:

Featureselection:selectingthemostusefulfeaturestotrainonamongexistingfeatures.

Featureextraction:combiningexistingfeaturestoproduceamoreusefulone(aswesawearlier,dimensionalityreductionalgorithmscanhelp).

Creatingnewfeaturesbygatheringnewdata.

Nowthatwehavelookedatmanyexamplesofbaddata,let’slookatacoupleofexamplesofbadalgorithms.

OverfittingtheTrainingDataSayyouarevisitingaforeigncountryandthetaxidriverripsyouoff.Youmightbetemptedtosaythatalltaxidriversinthatcountryarethieves.Overgeneralizingissomethingthatwehumansdoalltoooften,andunfortunatelymachinescanfallintothesametrapifwearenotcareful.InMachineLearningthisiscalledoverfitting:itmeansthatthemodelperformswellonthetrainingdata,butitdoesnotgeneralizewell.

Figure1-22showsanexampleofahigh-degreepolynomiallifesatisfactionmodelthatstronglyoverfitsthetrainingdata.Eventhoughitperformsmuchbetteronthetrainingdatathanthesimplelinearmodel,wouldyoureallytrustitspredictions?

Figure1-22.Overfittingthetrainingdata

Complexmodelssuchasdeepneuralnetworkscandetectsubtlepatternsinthedata,butifthetrainingsetisnoisy,orifitistoosmall(whichintroducessamplingnoise),thenthemodelislikelytodetectpatternsinthenoiseitself.Obviouslythesepatternswillnotgeneralizetonewinstances.Forexample,sayyoufeedyourlifesatisfactionmodelmanymoreattributes,includinguninformativeonessuchasthecountry’sname.Inthatcase,acomplexmodelmaydetectpatternslikethefactthatallcountriesinthetrainingdatawithawintheirnamehavealifesatisfactiongreaterthan7:NewZealand(7.3),Norway(7.4),Sweden(7.2),andSwitzerland(7.5).HowconfidentareyouthattheW-satisfactionrulegeneralizestoRwandaorZimbabwe?Obviouslythispatternoccurredinthetrainingdatabypurechance,butthemodelhasnowaytotellwhetherapatternisrealorsimplytheresultofnoiseinthedata.

WARNINGOverfittinghappenswhenthemodelistoocomplexrelativetotheamountandnoisinessofthetrainingdata.Thepossiblesolutionsare:

Tosimplifythemodelbyselectingonewithfewerparameters(e.g.,alinearmodelratherthanahigh-degreepolynomialmodel),byreducingthenumberofattributesinthetrainingdataorbyconstrainingthemodel

Togathermoretrainingdata

Toreducethenoiseinthetrainingdata(e.g.,fixdataerrorsandremoveoutliers)

Constrainingamodeltomakeitsimplerandreducetheriskofoverfittingiscalledregularization.For

example,thelinearmodelwedefinedearlierhastwoparameters,θ0andθ1.Thisgivesthelearningalgorithmtwodegreesoffreedomtoadaptthemodeltothetrainingdata:itcantweakboththeheight(θ0)andtheslope(θ1)oftheline.Ifweforcedθ1=0,thealgorithmwouldhaveonlyonedegreeoffreedomandwouldhaveamuchhardertimefittingthedataproperly:allitcoulddoismovethelineupordowntogetascloseaspossibletothetraininginstances,soitwouldenduparoundthemean.Averysimplemodelindeed!Ifweallowthealgorithmtomodifyθ1butweforceittokeepitsmall,thenthelearningalgorithmwilleffectivelyhavesomewhereinbetweenoneandtwodegreesoffreedom.Itwillproduceasimplermodelthanwithtwodegreesoffreedom,butmorecomplexthanwithjustone.Youwanttofindtherightbalancebetweenfittingthedataperfectlyandkeepingthemodelsimpleenoughtoensurethatitwillgeneralizewell.

Figure1-23showsthreemodels:thedottedlinerepresentstheoriginalmodelthatwastrainedwithafewcountriesmissing,thedashedlineisoursecondmodeltrainedwithallcountries,andthesolidlineisalinearmodeltrainedwiththesamedataasthefirstmodelbutwitharegularizationconstraint.Youcanseethatregularizationforcedthemodeltohaveasmallerslope,whichfitsabitlessthetrainingdatathatthemodelwastrainedon,butactuallyallowsittogeneralizebettertonewexamples.

Figure1-23.Regularizationreducestheriskofoverfitting

Theamountofregularizationtoapplyduringlearningcanbecontrolledbyahyperparameter.Ahyperparameterisaparameterofalearningalgorithm(notofthemodel).Assuch,itisnotaffectedbythelearningalgorithmitself;itmustbesetpriortotrainingandremainsconstantduringtraining.Ifyousettheregularizationhyperparametertoaverylargevalue,youwillgetanalmostflatmodel(aslopeclosetozero);thelearningalgorithmwillalmostcertainlynotoverfitthetrainingdata,butitwillbelesslikelytofindagoodsolution.TuninghyperparametersisanimportantpartofbuildingaMachineLearningsystem(youwillseeadetailedexampleinthenextchapter).

UnderfittingtheTrainingDataAsyoumightguess,underfittingistheoppositeofoverfitting:itoccurswhenyourmodelistoosimpletolearntheunderlyingstructureofthedata.Forexample,alinearmodeloflifesatisfactionispronetounderfit;realityisjustmorecomplexthanthemodel,soitspredictionsareboundtobeinaccurate,evenonthetrainingexamples.

Themainoptionstofixthisproblemare:Selectingamorepowerfulmodel,withmoreparameters

Feedingbetterfeaturestothelearningalgorithm(featureengineering)

Reducingtheconstraintsonthemodel(e.g.,reducingtheregularizationhyperparameter)

SteppingBackBynowyoualreadyknowalotaboutMachineLearning.However,wewentthroughsomanyconceptsthatyoumaybefeelingalittlelost,solet’sstepbackandlookatthebigpicture:

MachineLearningisaboutmakingmachinesgetbetteratsometaskbylearningfromdata,insteadofhavingtoexplicitlycoderules.

TherearemanydifferenttypesofMLsystems:supervisedornot,batchoronline,instance-basedormodel-based,andsoon.

InaMLprojectyougatherdatainatrainingset,andyoufeedthetrainingsettoalearningalgorithm.Ifthealgorithmismodel-basedittunessomeparameterstofitthemodeltothetrainingset(i.e.,tomakegoodpredictionsonthetrainingsetitself),andthenhopefullyitwillbeabletomakegoodpredictionsonnewcasesaswell.Ifthealgorithmisinstance-based,itjustlearnstheexamplesbyheartandusesasimilaritymeasuretogeneralizetonewinstances.

Thesystemwillnotperformwellifyourtrainingsetistoosmall,orifthedataisnotrepresentative,noisy,orpollutedwithirrelevantfeatures(garbagein,garbageout).Lastly,yourmodelneedstobeneithertoosimple(inwhichcaseitwillunderfit)nortoocomplex(inwhichcaseitwilloverfit).

There’sjustonelastimportanttopictocover:onceyouhavetrainedamodel,youdon’twanttojust“hope”itgeneralizestonewcases.Youwanttoevaluateit,andfine-tuneitifnecessary.Let’sseehow.

TestingandValidatingTheonlywaytoknowhowwellamodelwillgeneralizetonewcasesistoactuallytryitoutonnewcases.Onewaytodothatistoputyourmodelinproductionandmonitorhowwellitperforms.Thisworkswell,butifyourmodelishorriblybad,youruserswillcomplain—notthebestidea.

Abetteroptionistosplityourdataintotwosets:thetrainingsetandthetestset.Asthesenamesimply,youtrainyourmodelusingthetrainingset,andyoutestitusingthetestset.Theerrorrateonnewcasesiscalledthegeneralizationerror(orout-of-sampleerror),andbyevaluatingyourmodelonthetestset,yougetanestimationofthiserror.Thisvaluetellsyouhowwellyourmodelwillperformoninstancesithasneverseenbefore.

Ifthetrainingerrorislow(i.e.,yourmodelmakesfewmistakesonthetrainingset)butthegeneralizationerrorishigh,itmeansthatyourmodelisoverfittingthetrainingdata.

TIPItiscommontouse80%ofthedatafortrainingandholdout20%fortesting.

Soevaluatingamodelissimpleenough:justuseatestset.Nowsupposeyouarehesitatingbetweentwomodels(sayalinearmodelandapolynomialmodel):howcanyoudecide?Oneoptionistotrainbothandcomparehowwelltheygeneralizeusingthetestset.

Nowsupposethatthelinearmodelgeneralizesbetter,butyouwanttoapplysomeregularizationtoavoidoverfitting.Thequestionis:howdoyouchoosethevalueoftheregularizationhyperparameter?Oneoptionistotrain100differentmodelsusing100differentvaluesforthishyperparameter.Supposeyoufindthebesthyperparametervaluethatproducesamodelwiththelowestgeneralizationerror,sayjust5%error.

Soyoulaunchthismodelintoproduction,butunfortunatelyitdoesnotperformaswellasexpectedandproduces15%errors.Whatjusthappened?

Theproblemisthatyoumeasuredthegeneralizationerrormultipletimesonthetestset,andyouadaptedthemodelandhyperparameterstoproducethebestmodelforthatset.Thismeansthatthemodelisunlikelytoperformaswellonnewdata.

Acommonsolutiontothisproblemistohaveasecondholdoutsetcalledthevalidationset.Youtrainmultiplemodelswithvarioushyperparametersusingthetrainingset,youselectthemodelandhyperparametersthatperformbestonthevalidationset,andwhenyou’rehappywithyourmodelyourunasinglefinaltestagainstthetestsettogetanestimateofthegeneralizationerror.

Toavoid“wasting”toomuchtrainingdatainvalidationsets,acommontechniqueistousecross-validation:thetrainingsetissplitintocomplementarysubsets,andeachmodelistrainedagainstadifferentcombinationofthesesubsetsandvalidatedagainsttheremainingparts.Oncethemodeltypeandhyperparametershavebeenselected,afinalmodelistrainedusingthesehyperparametersonthefulltrainingset,andthegeneralizederrorismeasuredonthetestset.

NOFREELUNCHTHEOREM

Amodelisasimplifiedversionoftheobservations.Thesimplificationsaremeanttodiscardthesuperfluousdetailsthatareunlikelytogeneralizetonewinstances.However,todecidewhatdatatodiscardandwhatdatatokeep,youmustmakeassumptions.Forexample,alinearmodelmakestheassumptionthatthedataisfundamentallylinearandthatthedistancebetweentheinstancesandthestraightlineisjustnoise,whichcansafelybeignored.

Inafamous1996paper,11DavidWolpertdemonstratedthatifyoumakeabsolutelynoassumptionaboutthedata,thenthereisnoreasontopreferonemodeloveranyother.ThisiscalledtheNoFreeLunch(NFL)theorem.Forsomedatasetsthebestmodelisalinearmodel,whileforotherdatasetsitisaneuralnetwork.Thereisnomodelthatisaprioriguaranteedtoworkbetter(hencethenameofthetheorem).Theonlywaytoknowforsurewhichmodelisbestistoevaluatethemall.Sincethisisnotpossible,inpracticeyoumakesomereasonableassumptionsaboutthedataandyouevaluateonlyafewreasonablemodels.Forexample,forsimpletasksyoumayevaluatelinearmodelswithvariouslevelsofregularization,andforacomplexproblemyoumayevaluatevariousneuralnetworks.

ExercisesInthischapterwehavecoveredsomeofthemostimportantconceptsinMachineLearning.Inthenextchapterswewilldivedeeperandwritemorecode,butbeforewedo,makesureyouknowhowtoanswerthefollowingquestions:1. HowwouldyoudefineMachineLearning?

2. Canyounamefourtypesofproblemswhereitshines?

3. Whatisalabeledtrainingset?

4. Whatarethetwomostcommonsupervisedtasks?

5. Canyounamefourcommonunsupervisedtasks?

6. WhattypeofMachineLearningalgorithmwouldyouusetoallowarobottowalkinvariousunknownterrains?

7. Whattypeofalgorithmwouldyouusetosegmentyourcustomersintomultiplegroups?

8. Wouldyouframetheproblemofspamdetectionasasupervisedlearningproblemoranunsupervisedlearningproblem?

9. Whatisanonlinelearningsystem?

10. Whatisout-of-corelearning?

11. Whattypeoflearningalgorithmreliesonasimilaritymeasuretomakepredictions?

12. Whatisthedifferencebetweenamodelparameterandalearningalgorithm’shyperparameter?

13. Whatdomodel-basedlearningalgorithmssearchfor?Whatisthemostcommonstrategytheyusetosucceed?Howdotheymakepredictions?

14. CanyounamefourofthemainchallengesinMachineLearning?

15. Ifyourmodelperformsgreatonthetrainingdatabutgeneralizespoorlytonewinstances,whatishappening?Canyounamethreepossiblesolutions?

16. Whatisatestsetandwhywouldyouwanttouseit?

17. Whatisthepurposeofavalidationset?

18. Whatcangowrongifyoutunehyperparametersusingthetestset?

19. Whatiscross-validationandwhywouldyoupreferittoavalidationset?

SolutionstotheseexercisesareavailableinAppendixA.

Funfact:thisodd-soundingnameisastatisticstermintroducedbyFrancisGaltonwhilehewasstudyingthefactthatthechildrenoftallpeopletendtobeshorterthantheirparents.Sincechildrenwereshorter,hecalledthisregressiontothemean.Thisnamewasthenappliedtothemethodsheusedtoanalyzecorrelationsbetweenvariables.

Someneuralnetworkarchitecturescanbeunsupervised,suchasautoencodersandrestrictedBoltzmannmachines.Theycanalsobesemisupervised,suchasindeepbeliefnetworksandunsupervisedpretraining.

Noticehowanimalsareratherwellseparatedfromvehicles,howhorsesareclosetodeerbutfarfrombirds,andsoon.FigurereproducedwithpermissionfromSocher,Ganjoo,Manning,andNg(2013),“T-SNEvisualizationofthesemanticwordspace.”

That’swhenthesystemworksperfectly.Inpracticeitoftencreatesafewclustersperperson,andsometimesmixesuptwopeoplewholookalike,soyouneedtoprovideafewlabelsperpersonandmanuallycleanupsomeclusters.

Byconvention,theGreekletterθ(theta)isfrequentlyusedtorepresentmodelparameters.

Thecodeassumesthatprepare_country_stats()isalreadydefined:itmergestheGDPandlifesatisfactiondataintoasinglePandasdataframe.

It’sokayifyoudon’tunderstandallthecodeyet;wewillpresentScikit-Learninthefollowingchapters.

Forexample,knowingwhethertowrite“to,”“two,”or“too”dependingonthecontext.

FigurereproducedwithpermissionfromBankoandBrill(2001),“LearningCurvesforConfusionSetDisambiguation.”

“TheUnreasonableEffectivenessofData,”PeterNorvigetal.(2009).

“TheLackofAPrioriDistinctionsBetweenLearningAlgorithms,”D.Wolperts(1996).

Chapter2.End-to-EndMachineLearningProject

Inthischapter,youwillgothroughanexampleprojectendtoend,pretendingtobearecentlyhireddatascientistinarealestatecompany.1Herearethemainstepsyouwillgothrough:1. Lookatthebigpicture.

2. Getthedata.

3. Discoverandvisualizethedatatogaininsights.

4. PreparethedataforMachineLearningalgorithms.

5. Selectamodelandtrainit.

6. Fine-tuneyourmodel.

7. Presentyoursolution.

8. Launch,monitor,andmaintainyoursystem.

WorkingwithRealDataWhenyouarelearningaboutMachineLearningitisbesttoactuallyexperimentwithreal-worlddata,notjustartificialdatasets.Fortunately,therearethousandsofopendatasetstochoosefrom,rangingacrossallsortsofdomains.Hereareafewplacesyoucanlooktogetdata:

Popularopendatarepositories:UCIrvineMachineLearningRepository

Kaggledatasets

Amazon’sAWSdatasets

Metaportals(theylistopendatarepositories):http://dataportals.org/

http://opendatamonitor.eu/

http://quandl.com/

Otherpageslistingmanypopularopendatarepositories:Wikipedia’slistofMachineLearningdatasets

Quora.comquestion

Datasetssubreddit

InthischapterwechosetheCaliforniaHousingPricesdatasetfromtheStatLibrepository2(seeFigure2-1).Thisdatasetwasbasedondatafromthe1990Californiacensus.Itisnotexactlyrecent(youcouldstillaffordanicehouseintheBayAreaatthetime),butithasmanyqualitiesforlearning,sowewillpretenditisrecentdata.Wealsoaddedacategoricalattributeandremovedafewfeaturesforteachingpurposes.

Figure2-1.Californiahousingprices

LookattheBigPictureWelcometoMachineLearningHousingCorporation!ThefirsttaskyouareaskedtoperformistobuildamodelofhousingpricesinCaliforniausingtheCaliforniacensusdata.Thisdatahasmetricssuchasthepopulation,medianincome,medianhousingprice,andsoonforeachblockgroupinCalifornia.BlockgroupsarethesmallestgeographicalunitforwhichtheUSCensusBureaupublishessampledata(ablockgrouptypicallyhasapopulationof600to3,000people).Wewilljustcallthem“districts”forshort.

Yourmodelshouldlearnfromthisdataandbeabletopredictthemedianhousingpriceinanydistrict,givenalltheothermetrics.

TIPSinceyouareawell-organizeddatascientist,thefirstthingyoudoistopulloutyourMachineLearningprojectchecklist.YoucanstartwiththeoneinAppendixB;itshouldworkreasonablywellformostMachineLearningprojectsbutmakesuretoadaptittoyourneeds.Inthischapterwewillgothroughmanychecklistitems,butwewillalsoskipafew,eitherbecausetheyareself-explanatoryorbecausetheywillbediscussedinlaterchapters.

FrametheProblemThefirstquestiontoaskyourbossiswhatexactlyisthebusinessobjective;buildingamodelisprobablynottheendgoal.Howdoesthecompanyexpecttouseandbenefitfromthismodel?Thisisimportantbecauseitwilldeterminehowyouframetheproblem,whatalgorithmsyouwillselect,whatperformancemeasureyouwillusetoevaluateyourmodel,andhowmucheffortyoushouldspendtweakingit.

Yourbossanswersthatyourmodel’soutput(apredictionofadistrict’smedianhousingprice)willbefedtoanotherMachineLearningsystem(seeFigure2-2),alongwithmanyothersignals.3Thisdownstreamsystemwilldeterminewhetheritisworthinvestinginagivenareaornot.Gettingthisrightiscritical,asitdirectlyaffectsrevenue.

Figure2-2.AMachineLearningpipelineforrealestateinvestments

PIPELINES

Asequenceofdataprocessingcomponentsiscalledadatapipeline.PipelinesareverycommoninMachineLearningsystems,sincethereisalotofdatatomanipulateandmanydatatransformationstoapply.

Componentstypicallyrunasynchronously.Eachcomponentpullsinalargeamountofdata,processesit,andspitsouttheresultinanotherdatastore,andthensometimelaterthenextcomponentinthepipelinepullsthisdataandspitsoutitsownoutput,andsoon.Eachcomponentisfairlyself-contained:theinterfacebetweencomponentsissimplythedatastore.Thismakesthesystemquitesimpletograsp(withthehelpofadataflowgraph),anddifferentteamscanfocusondifferentcomponents.Moreover,ifacomponentbreaksdown,thedownstreamcomponentscanoftencontinuetorunnormally(atleastforawhile)byjustusingthelastoutputfromthebrokencomponent.Thismakesthearchitecturequiterobust.

Ontheotherhand,abrokencomponentcangounnoticedforsometimeifpropermonitoringisnotimplemented.Thedatagetsstaleandtheoverallsystem’sperformancedrops.

Thenextquestiontoaskiswhatthecurrentsolutionlookslike(ifany).Itwilloftengiveyouareferenceperformance,aswellasinsightsonhowtosolvetheproblem.Yourbossanswersthatthedistricthousingpricesarecurrentlyestimatedmanuallybyexperts:ateamgathersup-to-dateinformationaboutadistrict,andwhentheycannotgetthemedianhousingprice,theyestimateitusingcomplexrules.

Thisiscostlyandtime-consuming,andtheirestimatesarenotgreat;incaseswheretheymanagetofindouttheactualmedianhousingprice,theyoftenrealizethattheirestimateswereoffbymorethan10%.Thisiswhythecompanythinksthatitwouldbeusefultotrainamodeltopredictadistrict’smedianhousingpricegivenotherdataaboutthatdistrict.Thecensusdatalookslikeagreatdatasettoexploitforthispurpose,sinceitincludesthemedianhousingpricesofthousandsofdistricts,aswellasotherdata.

Okay,withallthisinformationyouarenowreadytostartdesigningyoursystem.First,youneedtoframe

theproblem:isitsupervised,unsupervised,orReinforcementLearning?Isitaclassificationtask,aregressiontask,orsomethingelse?Shouldyouusebatchlearningoronlinelearningtechniques?Beforeyoureadon,pauseandtrytoanswerthesequestionsforyourself.

Haveyoufoundtheanswers?Let’ssee:itisclearlyatypicalsupervisedlearningtasksinceyouaregivenlabeledtrainingexamples(eachinstancecomeswiththeexpectedoutput,i.e.,thedistrict’smedianhousingprice).Moreover,itisalsoatypicalregressiontask,sinceyouareaskedtopredictavalue.Morespecifically,thisisamultivariateregressionproblemsincethesystemwillusemultiplefeaturestomakeaprediction(itwillusethedistrict’spopulation,themedianincome,etc.).Inthefirstchapter,youpredictedlifesatisfactionbasedonjustonefeature,theGDPpercapita,soitwasaunivariateregressionproblem.Finally,thereisnocontinuousflowofdatacominginthesystem,thereisnoparticularneedtoadjusttochangingdatarapidly,andthedataissmallenoughtofitinmemory,soplainbatchlearningshoulddojustfine.

TIPIfthedatawashuge,youcouldeithersplityourbatchlearningworkacrossmultipleservers(usingtheMapReducetechnique,aswewillseelater),oryoucoulduseanonlinelearningtechniqueinstead.

SelectaPerformanceMeasureYournextstepistoselectaperformancemeasure.AtypicalperformancemeasureforregressionproblemsistheRootMeanSquareError(RMSE).Itgivesanideaofhowmucherrorthesystemtypicallymakesinitspredictions,withahigherweightforlargeerrors.Equation2-1showsthemathematicalformulatocomputetheRMSE.

Equation2-1.RootMeanSquareError(RMSE)

NOTATIONS

ThisequationintroducesseveralverycommonMachineLearningnotationsthatwewillusethroughoutthisbook:

misthenumberofinstancesinthedatasetyouaremeasuringtheRMSEon.

Forexample,ifyouareevaluatingtheRMSEonavalidationsetof2,000districts,thenm=2,000.

x(i)isavectorofallthefeaturevalues(excludingthelabel)oftheithinstanceinthedataset,andy(i)isitslabel(thedesiredoutputvalueforthatinstance).

Forexample,ifthefirstdistrictinthedatasetislocatedatlongitude–118.29°,latitude33.91°,andithas1,416inhabitantswithamedianincomeof$38,372,andthemedianhousevalueis$156,400(ignoringtheotherfeaturesfornow),then:

Xisamatrixcontainingallthefeaturevalues(excludinglabels)ofallinstancesinthedataset.Thereisonerowperinstanceandtheithrowisequaltothetransposeofx(i),noted(x(i))T.4

Forexample,ifthefirstdistrictisasjustdescribed,thenthematrixXlookslikethis:

hisyoursystem’spredictionfunction,alsocalledahypothesis.Whenyoursystemisgivenaninstance’sfeaturevectorx(i),itoutputsapredictedvalueŷ(i)=h(x(i))forthatinstance(ŷispronounced“y-hat”).

Forexample,ifyoursystempredictsthatthemedianhousingpriceinthefirstdistrictis$158,400,thenŷ(1)=h(x(1))=158,400.Thepredictionerrorforthisdistrictisŷ(1)–y(1)=2,000.

RMSE(X,h)isthecostfunctionmeasuredonthesetofexamplesusingyourhypothesish.

Weuselowercaseitalicfontforscalarvalues(suchasmory(i))andfunctionnames(suchash),lowercaseboldfontforvectors(suchasx(i)),anduppercaseboldfontformatrices(suchasX).

EventhoughtheRMSEisgenerallythepreferredperformancemeasureforregressiontasks,insomecontextsyoumayprefertouseanotherfunction.Forexample,supposethattherearemanyoutlierdistricts.Inthatcase,youmayconsiderusingtheMeanAbsoluteError(alsocalledtheAverageAbsoluteDeviation;seeEquation2-2):

Equation2-2.MeanAbsoluteError

BoththeRMSEandtheMAEarewaystomeasurethedistancebetweentwovectors:thevectorofpredictionsandthevectoroftargetvalues.Variousdistancemeasures,ornorms,arepossible:

Computingtherootofasumofsquares(RMSE)correspondstotheEuclidiannorm:itisthenotionofdistanceyouarefamiliarwith.Itisalsocalledtheℓ2norm,noted· 2(orjust·).

Computingthesumofabsolutes(MAE)correspondstotheℓ1norm,noted· 1.ItissometimescalledtheManhattannormbecauseitmeasuresthedistancebetweentwopointsinacityifyoucanonlytravelalongorthogonalcityblocks.

Moregenerally,theℓknormofavectorvcontainingnelementsisdefinedas

.ℓ0justgivesthenumberofnon-zeroelementsinthevector,andℓ∞givesthemaximumabsolutevalueinthevector.

Thehigherthenormindex,themoreitfocusesonlargevaluesandneglectssmallones.ThisiswhytheRMSEismoresensitivetooutliersthantheMAE.Butwhenoutliersareexponentiallyrare(likeinabell-shapedcurve),theRMSEperformsverywellandisgenerallypreferred.

ChecktheAssumptionsLastly,itisgoodpracticetolistandverifytheassumptionsthatweremadesofar(byyouorothers);thiscancatchseriousissuesearlyon.Forexample,thedistrictpricesthatyoursystemoutputsaregoingtobefedintoadownstreamMachineLearningsystem,andweassumethatthesepricesaregoingtobeusedassuch.Butwhatifthedownstreamsystemactuallyconvertsthepricesintocategories(e.g.,“cheap,”“medium,”or“expensive”)andthenusesthosecategoriesinsteadofthepricesthemselves?Inthiscase,gettingthepriceperfectlyrightisnotimportantatall;yoursystemjustneedstogetthecategoryright.Ifthat’sso,thentheproblemshouldhavebeenframedasaclassificationtask,notaregressiontask.Youdon’twanttofindthisoutafterworkingonaregressionsystemformonths.

Fortunately,aftertalkingwiththeteaminchargeofthedownstreamsystem,youareconfidentthattheydoindeedneedtheactualprices,notjustcategories.Great!You’reallset,thelightsaregreen,andyoucanstartcodingnow!

GettheDataIt’stimetogetyourhandsdirty.Don’thesitatetopickupyourlaptopandwalkthroughthefollowingcodeexamplesinaJupyternotebook.ThefullJupyternotebookisavailableathttps://github.com/ageron/handson-ml.

CreatetheWorkspaceFirstyouwillneedtohavePythoninstalled.Itisprobablyalreadyinstalledonyoursystem.Ifnot,youcangetitathttps://www.python.org/.5

NextyouneedtocreateaworkspacedirectoryforyourMachineLearningcodeanddatasets.Openaterminalandtypethefollowingcommands(afterthe$prompts):

$exportML_PATH="$HOME/ml"#Youcanchangethepathifyouprefer

$mkdir-p$ML_PATH

YouwillneedanumberofPythonmodules:Jupyter,NumPy,Pandas,Matplotlib,andScikit-Learn.IfyoualreadyhaveJupyterrunningwithallthesemodulesinstalled,youcansafelyskipto“DownloadtheData”.Ifyoudon’thavethemyet,therearemanywaystoinstallthem(andtheirdependencies).Youcanuseyoursystem’spackagingsystem(e.g.,apt-getonUbuntu,orMacPortsorHomeBrewonmacOS),installaScientificPythondistributionsuchasAnacondaanduseitspackagingsystem,orjustusePython’sownpackagingsystem,pip,whichisincludedbydefaultwiththePythonbinaryinstallers(sincePython2.7.9).6Youcanchecktoseeifpipisinstalledbytypingthefollowingcommand:

$pip3--version

pip9.0.1from[...]/lib/python3.5/site-packages(python3.5)

Youshouldmakesureyouhavearecentversionofpipinstalled,attheveryleast>1.4tosupportbinarymoduleinstallation(a.k.a.wheels).Toupgradethepipmodule,type:7

$pip3install--upgradepip

Collectingpip

Successfullyinstalledpip-9.0.1

CREATINGANISOLATEDENVIRONMENT

Ifyouwouldliketoworkinanisolatedenvironment(whichisstronglyrecommendedsoyoucanworkondifferentprojectswithouthavingconflictinglibraryversions),installvirtualenvbyrunningthefollowingpipcommand:

$pip3install--user--upgradevirtualenv

Collectingvirtualenv

Successfullyinstalledvirtualenv

NowyoucancreateanisolatedPythonenvironmentbytyping:

$cd$ML_PATH

$virtualenvenv

Usingbaseprefix'[...]'

Newpythonexecutablein[...]/ml/env/bin/python3.5

Alsocreatingexecutablein[...]/ml/env/bin/python

Installingsetuptools,pip,wheel...done.

Noweverytimeyouwanttoactivatethisenvironment,justopenaterminalandtype:

$cd$ML_PATH

$sourceenv/bin/activate

Whiletheenvironmentisactive,anypackageyouinstallusingpipwillbeinstalledinthisisolatedenvironment,andPythonwillonlyhaveaccesstothesepackages(ifyoualsowantaccesstothesystem’ssitepackages,youshouldcreatetheenvironmentusingvirtualenv’s--system-site-packagesoption).Checkoutvirtualenv’sdocumentationformoreinformation.

Nowyoucaninstallalltherequiredmodulesandtheirdependenciesusingthissimplepipcommand:

$pip3install--upgradejupytermatplotlibnumpypandasscipyscikit-learn

Collectingjupyter

Downloadingjupyter-1.0.0-py2.py3-none-any.whl

Collectingmatplotlib

Tocheckyourinstallation,trytoimporteverymodulelikethis:

$python3-c"importjupyter,matplotlib,numpy,pandas,scipy,sklearn"

Thereshouldbenooutputandnoerror.NowyoucanfireupJupyterbytyping:

$jupyternotebook

[I15:24NotebookApp]Servingnotebooksfromlocaldirectory:[...]/ml

[I15:24NotebookApp]0activekernels

[I15:24NotebookApp]TheJupyterNotebookisrunningat:http://localhost:8888/

[I15:24NotebookApp]UseControl-Ctostopthisserverandshutdownall

kernels(twicetoskipconfirmation).

AJupyterserverisnowrunninginyourterminal,listeningtoport8888.Youcanvisitthisserverbyopeningyourwebbrowsertohttp://localhost:8888/(thisusuallyhappensautomaticallywhentheserverstarts).Youshouldseeyouremptyworkspacedirectory(containingonlytheenvdirectoryifyoufollowedtheprecedingvirtualenvinstructions).

NowcreateanewPythonnotebookbyclickingontheNewbuttonandselectingtheappropriatePythonversion8(seeFigure2-3).

Thisdoesthreethings:first,itcreatesanewnotebookfilecalledUntitled.ipynbinyourworkspace;second,itstartsaJupyterPythonkerneltorunthisnotebook;andthird,itopensthisnotebookinanewtab.Youshouldstartbyrenamingthisnotebookto“Housing”(thiswillautomaticallyrenamethefiletoHousing.ipynb)byclickingUntitledandtypingthenewname.

Figure2-3.YourworkspaceinJupyter

Anotebookcontainsalistofcells.Eachcellcancontainexecutablecodeorformattedtext.Rightnowthenotebookcontainsonlyoneemptycodecell,labeled“In[1]:”.Trytypingprint("Helloworld!")inthecell,andclickontheplaybutton(seeFigure2-4)orpressShift-Enter.Thissendsthecurrentcelltothisnotebook’sPythonkernel,whichrunsitandreturnstheoutput.Theresultisdisplayedbelowthecell,andsincewereachedtheendofthenotebook,anewcellisautomaticallycreated.GothroughtheUserInterfaceTourfromJupyter’sHelpmenutolearnthebasics.

Figure2-4.HelloworldPythonnotebook

DownloadtheDataIntypicalenvironmentsyourdatawouldbeavailableinarelationaldatabase(orsomeothercommondatastore)andspreadacrossmultipletables/documents/files.Toaccessit,youwouldfirstneedtogetyourcredentialsandaccessauthorizations,9andfamiliarizeyourselfwiththedataschema.Inthisproject,however,thingsaremuchsimpler:youwilljustdownloadasinglecompressedfile,housing.tgz,whichcontainsacomma-separatedvalue(CSV)filecalledhousing.csvwithallthedata.

Youcoulduseyourwebbrowsertodownloadit,andruntarxzfhousing.tgztodecompressthefileandextracttheCSVfile,butitispreferabletocreateasmallfunctiontodothat.Itisusefulinparticularifdatachangesregularly,asitallowsyoutowriteasmallscriptthatyoucanrunwheneveryouneedtofetchthelatestdata(oryoucansetupascheduledjobtodothatautomaticallyatregularintervals).Automatingtheprocessoffetchingthedataisalsousefulifyouneedtoinstallthedatasetonmultiplemachines.

Hereisthefunctiontofetchthedata:10

importos

importtarfile

fromsix.movesimporturllib

DOWNLOAD_ROOT="https://raw.githubusercontent.com/ageron/handson-ml/master/"

HOUSING_PATH=os.path.join("datasets","housing")

HOUSING_URL=DOWNLOAD_ROOT+"datasets/housing/housing.tgz"

deffetch_housing_data(housing_url=HOUSING_URL,housing_path=HOUSING_PATH):

ifnotos.path.isdir(housing_path):

os.makedirs(housing_path)

tgz_path=os.path.join(housing_path,"housing.tgz")

urllib.request.urlretrieve(housing_url,tgz_path)

housing_tgz=tarfile.open(tgz_path)

housing_tgz.extractall(path=housing_path)

housing_tgz.close()

Nowwhenyoucallfetch_housing_data(),itcreatesadatasets/housingdirectoryinyourworkspace,downloadsthehousing.tgzfile,andextractsthehousing.csvfromitinthisdirectory.

Nowlet’sloadthedatausingPandas.Onceagainyoushouldwriteasmallfunctiontoloadthedata:

importpandasaspd

defload_housing_data(housing_path=HOUSING_PATH):

csv_path=os.path.join(housing_path,"housing.csv")

returnpd.read_csv(csv_path)

ThisfunctionreturnsaPandasDataFrameobjectcontainingallthedata.

TakeaQuickLookattheDataStructureLet’stakealookatthetopfiverowsusingtheDataFrame’shead()method(seeFigure2-5).

Figure2-5.Topfiverowsinthedataset

Eachrowrepresentsonedistrict.Thereare10attributes(youcanseethefirst6inthescreenshot):longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,andocean_proximity.

Theinfo()methodisusefultogetaquickdescriptionofthedata,inparticularthetotalnumberofrows,andeachattribute’stypeandnumberofnon-nullvalues(seeFigure2-6).

Figure2-6.Housinginfo

Thereare20,640instancesinthedataset,whichmeansthatitisfairlysmallbyMachineLearningstandards,butit’sperfecttogetstarted.Noticethatthetotal_bedroomsattributehasonly20,433non-nullvalues,meaningthat207districtsaremissingthisfeature.Wewillneedtotakecareofthislater.

Allattributesarenumerical,excepttheocean_proximityfield.Itstypeisobject,soitcouldholdanykindofPythonobject,butsinceyouloadedthisdatafromaCSVfileyouknowthatitmustbeatextattribute.Whenyoulookedatthetopfiverows,youprobablynoticedthatthevaluesintheocean_proximitycolumnwererepetitive,whichmeansthatitisprobablyacategoricalattribute.Youcanfindoutwhatcategoriesexistandhowmanydistrictsbelongtoeachcategorybyusingthe

value_counts()method:

>>>housing["ocean_proximity"].value_counts()

<1HOCEAN9136

INLAND6551

NEAROCEAN2658

NEARBAY2290

ISLAND5

Name:ocean_proximity,dtype:int64

Let’slookattheotherfields.Thedescribe()methodshowsasummaryofthenumericalattributes(Figure2-7).

Figure2-7.Summaryofeachnumericalattribute

Thecount,mean,min,andmaxrowsareself-explanatory.Notethatthenullvaluesareignored(so,forexample,countoftotal_bedroomsis20,433,not20,640).Thestdrowshowsthestandarddeviation,whichmeasureshowdispersedthevaluesare.11The25%,50%,and75%rowsshowthecorrespondingpercentiles:apercentileindicatesthevaluebelowwhichagivenpercentageofobservationsinagroupofobservationsfalls.Forexample,25%ofthedistrictshaveahousing_median_agelowerthan18,while50%arelowerthan29and75%arelowerthan37.Theseareoftencalledthe25thpercentile(or1stquartile),themedian,andthe75thpercentile(or3rdquartile).

Anotherquickwaytogetafeelofthetypeofdatayouaredealingwithistoplotahistogramforeachnumericalattribute.Ahistogramshowsthenumberofinstances(ontheverticalaxis)thathaveagivenvaluerange(onthehorizontalaxis).Youcaneitherplotthisoneattributeatatime,oryoucancallthehist()methodonthewholedataset,anditwillplotahistogramforeachnumericalattribute(seeFigure2-8).Forexample,youcanseethatslightlyover800districtshaveamedian_house_valueequaltoabout$100,000.

%matplotlibinline#onlyinaJupyternotebook

housing.hist(bins=50,figsize=(20,15))

plt.show()

NOTEThehist()methodreliesonMatplotlib,whichinturnreliesonauser-specifiedgraphicalbackendtodrawonyourscreen.Sobeforeyoucanplotanything,youneedtospecifywhichbackendMatplotlibshoulduse.ThesimplestoptionistouseJupyter’smagiccommand%matplotlibinline.ThistellsJupytertosetupMatplotlibsoitusesJupyter’sownbackend.Plotsarethenrenderedwithinthenotebookitself.Notethatcallingshow()isoptionalinaJupyternotebook,asJupyterwillautomaticallydisplayplotswhenacellisexecuted.

Figure2-8.Ahistogramforeachnumericalattribute

Noticeafewthingsinthesehistograms:1. First,themedianincomeattributedoesnotlooklikeitisexpressedinUSdollars(USD).After

checkingwiththeteamthatcollectedthedata,youaretoldthatthedatahasbeenscaledandcappedat15(actually15.0001)forhighermedianincomes,andat0.5(actually0.4999)forlowermedianincomes.WorkingwithpreprocessedattributesiscommoninMachineLearning,anditisnotnecessarilyaproblem,butyoushouldtrytounderstandhowthedatawascomputed.

2. Thehousingmedianageandthemedianhousevaluewerealsocapped.Thelattermaybeaseriousproblemsinceitisyourtargetattribute(yourlabels).YourMachineLearningalgorithmsmaylearnthatpricesnevergobeyondthatlimit.Youneedtocheckwithyourclientteam(theteamthatwilluseyoursystem’soutput)toseeifthisisaproblemornot.Iftheytellyouthattheyneedprecisepredictionsevenbeyond$500,000,thenyouhavemainlytwooptions:a. Collectproperlabelsforthedistrictswhoselabelswerecapped.

b. Removethosedistrictsfromthetrainingset(andalsofromthetestset,sinceyoursystemshouldnotbeevaluatedpoorlyifitpredictsvaluesbeyond$500,000).

3. Theseattributeshaveverydifferentscales.Wewilldiscussthislaterinthischapterwhenweexplorefeaturescaling.

4. Finally,manyhistogramsaretailheavy:theyextendmuchfarthertotherightofthemedianthantotheleft.ThismaymakeitabitharderforsomeMachineLearningalgorithmstodetectpatterns.Wewilltrytransformingtheseattributeslaterontohavemorebell-shapeddistributions.

Hopefullyyounowhaveabetterunderstandingofthekindofdatayouaredealingwith.

WARNINGWait!Beforeyoulookatthedataanyfurther,youneedtocreateatestset,putitaside,andneverlookatit.

CreateaTestSetItmaysoundstrangetovoluntarilysetasidepartofthedataatthisstage.Afterall,youhaveonlytakenaquickglanceatthedata,andsurelyyoushouldlearnawholelotmoreaboutitbeforeyoudecidewhatalgorithmstouse,right?Thisistrue,butyourbrainisanamazingpatterndetectionsystem,whichmeansthatitishighlypronetooverfitting:ifyoulookatthetestset,youmaystumbleuponsomeseeminglyinterestingpatterninthetestdatathatleadsyoutoselectaparticularkindofMachineLearningmodel.Whenyouestimatethegeneralizationerrorusingthetestset,yourestimatewillbetoooptimisticandyouwilllaunchasystemthatwillnotperformaswellasexpected.Thisiscalleddatasnoopingbias.

Creatingatestsetistheoreticallyquitesimple:justpicksomeinstancesrandomly,typically20%ofthedataset,andsetthemaside:

importnumpyasnp

defsplit_train_test(data,test_ratio):

shuffled_indices=np.random.permutation(len(data))

test_set_size=int(len(data)*test_ratio)

test_indices=shuffled_indices[:test_set_size]

train_indices=shuffled_indices[test_set_size:]

returndata.iloc[train_indices],data.iloc[test_indices]

Youcanthenusethisfunctionlikethis:

>>>train_set,test_set=split_train_test(housing,0.2)

>>>print(len(train_set),"train+",len(test_set),"test")

16512train+4128test

Well,thisworks,butitisnotperfect:ifyouruntheprogramagain,itwillgenerateadifferenttestset!Overtime,you(oryourMachineLearningalgorithms)willgettoseethewholedataset,whichiswhatyouwanttoavoid.

Onesolutionistosavethetestsetonthefirstrunandthenloaditinsubsequentruns.Anotheroptionistosettherandomnumbergenerator’sseed(e.g.,np.random.seed(42))12beforecallingnp.random.permutation(),sothatitalwaysgeneratesthesameshuffledindices.

Butboththesesolutionswillbreaknexttimeyoufetchanupdateddataset.Acommonsolutionistouseeachinstance’sidentifiertodecidewhetherornotitshouldgointhetestset(assuminginstanceshaveauniqueandimmutableidentifier).Forexample,youcouldcomputeahashofeachinstance’sidentifier,keeponlythelastbyteofthehash,andputtheinstanceinthetestsetifthisvalueislowerorequalto51(~20%of256).Thisensuresthatthetestsetwillremainconsistentacrossmultipleruns,evenifyourefreshthedataset.Thenewtestsetwillcontain20%ofthenewinstances,butitwillnotcontainanyinstancethatwaspreviouslyinthetrainingset.Hereisapossibleimplementation:

importhashlib

deftest_set_check(identifier,test_ratio,hash):

returnhash(np.int64(identifier)).digest()[-1]<256*test_ratio

defsplit_train_test_by_id(data,test_ratio,id_column,hash=hashlib.md5):

ids=data[id_column]

in_test_set=ids.apply(lambdaid_:test_set_check(id_,test_ratio,hash))

returndata.loc[~in_test_set],data.loc[in_test_set]

Unfortunately,thehousingdatasetdoesnothaveanidentifiercolumn.ThesimplestsolutionistousetherowindexastheID:

housing_with_id=housing.reset_index()#addsan`index`column

train_set,test_set=split_train_test_by_id(housing_with_id,0.2,"index")

Ifyouusetherowindexasauniqueidentifier,youneedtomakesurethatnewdatagetsappendedtotheendofthedataset,andnorowevergetsdeleted.Ifthisisnotpossible,thenyoucantrytousethemoststablefeaturestobuildauniqueidentifier.Forexample,adistrict’slatitudeandlongitudeareguaranteedtobestableforafewmillionyears,soyoucouldcombinethemintoanIDlikeso:13

housing_with_id["id"]=housing["longitude"]*1000+housing["latitude"]

train_set,test_set=split_train_test_by_id(housing_with_id,0.2,"id")

Scikit-Learnprovidesafewfunctionstosplitdatasetsintomultiplesubsetsinvariousways.Thesimplestfunctionistrain_test_split,whichdoesprettymuchthesamethingasthefunctionsplit_train_testdefinedearlier,withacoupleofadditionalfeatures.Firstthereisarandom_stateparameterthatallowsyoutosettherandomgeneratorseedasexplainedpreviously,andsecondyoucanpassitmultipledatasetswithanidenticalnumberofrows,anditwillsplitthemonthesameindices(thisisveryuseful,forexample,ifyouhaveaseparateDataFrameforlabels):

fromsklearn.model_selectionimporttrain_test_split

train_set,test_set=train_test_split(housing,test_size=0.2,random_state=42)

Sofarwehaveconsideredpurelyrandomsamplingmethods.Thisisgenerallyfineifyourdatasetislargeenough(especiallyrelativetothenumberofattributes),butifitisnot,youruntheriskofintroducingasignificantsamplingbias.Whenasurveycompanydecidestocall1,000peopletoaskthemafewquestions,theydon’tjustpick1,000peoplerandomlyinaphonebooth.Theytrytoensurethatthese1,000peoplearerepresentativeofthewholepopulation.Forexample,theUSpopulationiscomposedof51.3%femaleand48.7%male,soawell-conductedsurveyintheUSwouldtrytomaintainthisratiointhesample:513femaleand487male.Thisiscalledstratifiedsampling:thepopulationisdividedintohomogeneoussubgroupscalledstrata,andtherightnumberofinstancesissampledfromeachstratumtoguaranteethatthetestsetisrepresentativeoftheoverallpopulation.Iftheyusedpurelyrandomsampling,therewouldbeabout12%chanceofsamplingaskewedtestsetwitheitherlessthan49%femaleormorethan54%female.Eitherway,thesurveyresultswouldbesignificantlybiased.

Supposeyouchattedwithexpertswhotoldyouthatthemedianincomeisaveryimportantattributetopredictmedianhousingprices.Youmaywanttoensurethatthetestsetisrepresentativeofthevariouscategoriesofincomesinthewholedataset.Sincethemedianincomeisacontinuousnumericalattribute,youfirstneedtocreateanincomecategoryattribute.Let’slookatthemedianincomehistogrammoreclosely(seeFigure2-8):mostmedianincomevaluesareclusteredaround$20,000–$50,000,butsomemedianincomesgofarbeyond$60,000.Itisimportanttohaveasufficientnumberofinstancesinyourdatasetforeachstratum,orelsetheestimateofthestratum’simportancemaybebiased.Thismeansthatyoushouldnothavetoomanystrata,andeachstratumshouldbelargeenough.Thefollowingcodecreatesanincomecategoryattributebydividingthemedianincomeby1.5(tolimitthenumberofincome

categories),androundingupusingceil(tohavediscretecategories),andthenmergingallthecategoriesgreaterthan5intocategory5:

housing["income_cat"]=np.ceil(housing["median_income"]/1.5)

housing["income_cat"].where(housing["income_cat"]<5,5.0,inplace=True)

TheseincomecategoriesarerepresentedonFigure2-9):

Figure2-9.Histogramofincomecategories

Nowyouarereadytodostratifiedsamplingbasedontheincomecategory.ForthisyoucanuseScikit-Learn’sStratifiedShuffleSplitclass:

fromsklearn.model_selectionimportStratifiedShuffleSplit

split=StratifiedShuffleSplit(n_splits=1,test_size=0.2,random_state=42)

fortrain_index,test_indexinsplit.split(housing,housing["income_cat"]):

strat_train_set=housing.loc[train_index]

strat_test_set=housing.loc[test_index]

Let’sseeifthisworkedasexpected.Youcanstartbylookingattheincomecategoryproportionsinthefullhousingdataset:

>>>housing["income_cat"].value_counts()/len(housing)

3.00.350581

2.00.318847

4.00.176308

5.00.114438

1.00.039826

Name:income_cat,dtype:float64

Withsimilarcodeyoucanmeasuretheincomecategoryproportionsinthetestset.Figure2-10comparestheincomecategoryproportionsintheoveralldataset,inthetestsetgeneratedwithstratifiedsampling,andinatestsetgeneratedusingpurelyrandomsampling.Asyoucansee,thetestsetgeneratedusing

stratifiedsamplinghasincomecategoryproportionsalmostidenticaltothoseinthefulldataset,whereasthetestsetgeneratedusingpurelyrandomsamplingisquiteskewed.

Figure2-10.Samplingbiascomparisonofstratifiedversuspurelyrandomsampling

Nowyoushouldremovetheincome_catattributesothedataisbacktoitsoriginalstate:

forset_in(strat_train_set,strat_test_set):

set_.drop("income_cat",axis=1,inplace=True)

Wespentquiteabitoftimeontestsetgenerationforagoodreason:thisisanoftenneglectedbutcriticalpartofaMachineLearningproject.Moreover,manyoftheseideaswillbeusefullaterwhenwediscusscross-validation.Nowit’stimetomoveontothenextstage:exploringthedata.

DiscoverandVisualizetheDatatoGainInsightsSofaryouhaveonlytakenaquickglanceatthedatatogetageneralunderstandingofthekindofdatayouaremanipulating.Nowthegoalistogoalittlebitmoreindepth.

First,makesureyouhaveputthetestsetasideandyouareonlyexploringthetrainingset.Also,ifthetrainingsetisverylarge,youmaywanttosampleanexplorationset,tomakemanipulationseasyandfast.Inourcase,thesetisquitesmallsoyoucanjustworkdirectlyonthefullset.Let’screateacopysoyoucanplaywithitwithoutharmingthetrainingset:

housing=strat_train_set.copy()

VisualizingGeographicalDataSincethereisgeographicalinformation(latitudeandlongitude),itisagoodideatocreateascatterplotofalldistrictstovisualizethedata(Figure2-11):

housing.plot(kind="scatter",x="longitude",y="latitude")

Figure2-11.Ageographicalscatterplotofthedata

ThislookslikeCaliforniaallright,butotherthanthatitishardtoseeanyparticularpattern.Settingthealphaoptionto0.1makesitmucheasiertovisualizetheplaceswherethereisahighdensityofdatapoints(Figure2-12):

housing.plot(kind="scatter",x="longitude",y="latitude",alpha=0.1)

Figure2-12.Abettervisualizationhighlightinghigh-densityareas

Nowthat’smuchbetter:youcanclearlyseethehigh-densityareas,namelytheBayAreaandaroundLosAngelesandSanDiego,plusalonglineoffairlyhighdensityintheCentralValley,inparticulararoundSacramentoandFresno.

Moregenerally,ourbrainsareverygoodatspottingpatternsonpictures,butyoumayneedtoplayaroundwithvisualizationparameterstomakethepatternsstandout.

Nowlet’slookatthehousingprices(Figure2-13).Theradiusofeachcirclerepresentsthedistrict’spopulation(options),andthecolorrepresentstheprice(optionc).Wewilluseapredefinedcolormap(optioncmap)calledjet,whichrangesfromblue(lowvalues)tored(highprices):14

housing.plot(kind="scatter",x="longitude",y="latitude",alpha=0.4,

s=housing["population"]/100,label="population",figsize=(10,7),

c="median_house_value",cmap=plt.get_cmap("jet"),colorbar=True,

plt.legend()

Figure2-13.Californiahousingprices

Thisimagetellsyouthatthehousingpricesareverymuchrelatedtothelocation(e.g.,closetotheocean)andtothepopulationdensity,asyouprobablyknewalready.Itwillprobablybeusefultouseaclusteringalgorithmtodetectthemainclusters,andaddnewfeaturesthatmeasuretheproximitytotheclustercenters.Theoceanproximityattributemaybeusefulaswell,althoughinNorthernCaliforniathehousingpricesincoastaldistrictsarenottoohigh,soitisnotasimplerule.

LookingforCorrelationsSincethedatasetisnottoolarge,youcaneasilycomputethestandardcorrelationcoefficient(alsocalledPearson’sr)betweeneverypairofattributesusingthecorr()method:

corr_matrix=housing.corr()

Nowlet’slookathowmucheachattributecorrelateswiththemedianhousevalue:

>>>corr_matrix["median_house_value"].sort_values(ascending=False)

median_house_value1.000000

median_income0.687170

total_rooms0.135231

housing_median_age0.114220

households0.064702

total_bedrooms0.047865

population-0.026699

longitude-0.047279

latitude-0.142826

Name:median_house_value,dtype:float64

Thecorrelationcoefficientrangesfrom–1to1.Whenitiscloseto1,itmeansthatthereisastrongpositivecorrelation;forexample,themedianhousevaluetendstogoupwhenthemedianincomegoesup.Whenthecoefficientiscloseto–1,itmeansthatthereisastrongnegativecorrelation;youcanseeasmallnegativecorrelationbetweenthelatitudeandthemedianhousevalue(i.e.,priceshaveaslighttendencytogodownwhenyougonorth).Finally,coefficientsclosetozeromeanthatthereisnolinearcorrelation.Figure2-14showsvariousplotsalongwiththecorrelationcoefficientbetweentheirhorizontalandverticalaxes.

Figure2-14.Standardcorrelationcoefficientofvariousdatasets(source:Wikipedia;publicdomainimage)

WARNINGThecorrelationcoefficientonlymeasureslinearcorrelations(“ifxgoesup,thenygenerallygoesup/down”).Itmaycompletelymissoutonnonlinearrelationships(e.g.,“ifxisclosetozerothenygenerallygoesup”).Notehowalltheplotsofthebottomrowhaveacorrelationcoefficientequaltozerodespitethefactthattheiraxesareclearlynotindependent:theseareexamplesofnonlinearrelationships.Also,thesecondrowshowsexampleswherethecorrelationcoefficientisequalto1or–1;noticethatthishasnothingtodowiththeslope.Forexample,yourheightinincheshasacorrelationcoefficientof1withyourheightinfeetorinnanometers.

AnotherwaytocheckforcorrelationbetweenattributesistousePandas’scatter_matrixfunction,whichplotseverynumericalattributeagainsteveryothernumericalattribute.Sincetherearenow11numericalattributes,youwouldget112=121plots,whichwouldnotfitonapage,solet’sjustfocusonafewpromisingattributesthatseemmostcorrelatedwiththemedianhousingvalue(Figure2-15):

frompandas.tools.plottingimportscatter_matrix

attributes=["median_house_value","median_income","total_rooms",

"housing_median_age"]

scatter_matrix(housing[attributes],figsize=(12,8))

Figure2-15.Scattermatrix

Themaindiagonal(toplefttobottomright)wouldbefullofstraightlinesifPandasplottedeachvariableagainstitself,whichwouldnotbeveryuseful.SoinsteadPandasdisplaysahistogramofeachattribute(otheroptionsareavailable;seePandas’documentationformoredetails).

Themostpromisingattributetopredictthemedianhousevalueisthemedianincome,solet’szoominontheircorrelationscatterplot(Figure2-16):

housing.plot(kind="scatter",x="median_income",y="median_house_value",

alpha=0.1)

Thisplotrevealsafewthings.First,thecorrelationisindeedverystrong;youcanclearlyseetheupwardtrendandthepointsarenottoodispersed.Second,thepricecapthatwenoticedearlierisclearlyvisibleasahorizontallineat$500,000.Butthisplotrevealsotherlessobviousstraightlines:ahorizontallinearound$450,000,anotheraround$350,000,perhapsonearound$280,000,andafewmorebelowthat.Youmaywanttotryremovingthecorrespondingdistrictstopreventyouralgorithmsfromlearningtoreproducethesedataquirks.

Figure2-16.Medianincomeversusmedianhousevalue

ExperimentingwithAttributeCombinationsHopefullytheprevioussectionsgaveyouanideaofafewwaysyoucanexplorethedataandgaininsights.YouidentifiedafewdataquirksthatyoumaywanttocleanupbeforefeedingthedatatoaMachineLearningalgorithm,andyoufoundinterestingcorrelationsbetweenattributes,inparticularwiththetargetattribute.Youalsonoticedthatsomeattributeshaveatail-heavydistribution,soyoumaywanttotransformthem(e.g.,bycomputingtheirlogarithm).Ofcourse,yourmileagewillvaryconsiderablywitheachproject,butthegeneralideasaresimilar.

OnelastthingyoumaywanttodobeforeactuallypreparingthedataforMachineLearningalgorithmsistotryoutvariousattributecombinations.Forexample,thetotalnumberofroomsinadistrictisnotveryusefulifyoudon’tknowhowmanyhouseholdsthereare.Whatyoureallywantisthenumberofroomsperhousehold.Similarly,thetotalnumberofbedroomsbyitselfisnotveryuseful:youprobablywanttocompareittothenumberofrooms.Andthepopulationperhouseholdalsoseemslikeaninterestingattributecombinationtolookat.Let’screatethesenewattributes:

housing["rooms_per_household"]=housing["total_rooms"]/housing["households"]

housing["bedrooms_per_room"]=housing["total_bedrooms"]/housing["total_rooms"]

housing["population_per_household"]=housing["population"]/housing["households"]

Andnowlet’slookatthecorrelationmatrixagain:

>>>corr_matrix=housing.corr()

>>>corr_matrix["median_house_value"].sort_values(ascending=False)

median_house_value1.000000

median_income0.687160

rooms_per_household0.146285

total_rooms0.135097

housing_median_age0.114110

households0.064506

total_bedrooms0.047689

population_per_household-0.021985

population-0.026920

longitude-0.047432

latitude-0.142724

bedrooms_per_room-0.259984

Name:median_house_value,dtype:float64

Hey,notbad!Thenewbedrooms_per_roomattributeismuchmorecorrelatedwiththemedianhousevaluethanthetotalnumberofroomsorbedrooms.Apparentlyhouseswithalowerbedroom/roomratiotendtobemoreexpensive.Thenumberofroomsperhouseholdisalsomoreinformativethanthetotalnumberofroomsinadistrict—obviouslythelargerthehouses,themoreexpensivetheyare.

Thisroundofexplorationdoesnothavetobeabsolutelythorough;thepointistostartoffontherightfootandquicklygaininsightsthatwillhelpyougetafirstreasonablygoodprototype.Butthisisaniterativeprocess:onceyougetaprototypeupandrunning,youcananalyzeitsoutputtogainmoreinsightsandcomebacktothisexplorationstep.

PreparetheDataforMachineLearningAlgorithmsIt’stimetopreparethedataforyourMachineLearningalgorithms.Insteadofjustdoingthismanually,youshouldwritefunctionstodothat,forseveralgoodreasons:

Thiswillallowyoutoreproducethesetransformationseasilyonanydataset(e.g.,thenexttimeyougetafreshdataset).

Youwillgraduallybuildalibraryoftransformationfunctionsthatyoucanreuseinfutureprojects.

Youcanusethesefunctionsinyourlivesystemtotransformthenewdatabeforefeedingittoyouralgorithms.

Thiswillmakeitpossibleforyoutoeasilytryvarioustransformationsandseewhichcombinationoftransformationsworksbest.

Butfirstlet’sreverttoacleantrainingset(bycopyingstrat_train_setonceagain),andlet’sseparatethepredictorsandthelabelssincewedon’tnecessarilywanttoapplythesametransformationstothepredictorsandthetargetvalues(notethatdrop()createsacopyofthedataanddoesnotaffectstrat_train_set):

housing=strat_train_set.drop("median_house_value",axis=1)

housing_labels=strat_train_set["median_house_value"].copy()

DataCleaningMostMachineLearningalgorithmscannotworkwithmissingfeatures,solet’screateafewfunctionstotakecareofthem.Younoticedearlierthatthetotal_bedroomsattributehassomemissingvalues,solet’sfixthis.Youhavethreeoptions:

Getridofthecorrespondingdistricts.

Getridofthewholeattribute.

Setthevaluestosomevalue(zero,themean,themedian,etc.).

YoucanaccomplishtheseeasilyusingDataFrame’sdropna(),drop(),andfillna()methods:

housing.dropna(subset=["total_bedrooms"])#option1

housing.drop("total_bedrooms",axis=1)#option2

median=housing["total_bedrooms"].median()#option3

housing["total_bedrooms"].fillna(median,inplace=True)

Ifyouchooseoption3,youshouldcomputethemedianvalueonthetrainingset,anduseittofillthemissingvaluesinthetrainingset,butalsodon’tforgettosavethemedianvaluethatyouhavecomputed.Youwillneeditlatertoreplacemissingvaluesinthetestsetwhenyouwanttoevaluateyoursystem,andalsooncethesystemgoeslivetoreplacemissingvaluesinnewdata.

Scikit-Learnprovidesahandyclasstotakecareofmissingvalues:Imputer.Hereishowtouseit.First,youneedtocreateanImputerinstance,specifyingthatyouwanttoreplaceeachattribute’smissingvalueswiththemedianofthatattribute:

fromsklearn.preprocessingimportImputer

imputer=Imputer(strategy="median")

Sincethemediancanonlybecomputedonnumericalattributes,weneedtocreateacopyofthedatawithoutthetextattributeocean_proximity:

housing_num=housing.drop("ocean_proximity",axis=1)

Nowyoucanfittheimputerinstancetothetrainingdatausingthefit()method:

imputer.fit(housing_num)

Theimputerhassimplycomputedthemedianofeachattributeandstoredtheresultinitsstatistics_instancevariable.Onlythetotal_bedroomsattributehadmissingvalues,butwecannotbesurethattherewon’tbeanymissingvaluesinnewdataafterthesystemgoeslive,soitissafertoapplytheimputertoallthenumericalattributes:

>>>imputer.statistics_

array([-118.51,34.26,29.,2119.5,433.,1164.,408.,3.5409])

>>>housing_num.median().values

array([-118.51,34.26,29.,2119.5,433.,1164.,408.,3.5409])

Nowyoucanusethis“trained”imputertotransformthetrainingsetbyreplacingmissingvaluesbythelearnedmedians:

X=imputer.transform(housing_num)

TheresultisaplainNumpyarraycontainingthetransformedfeatures.IfyouwanttoputitbackintoaPandasDataFrame,it’ssimple:

housing_tr=pd.DataFrame(X,columns=housing_num.columns)

SCIKIT-LEARNDESIGN

Scikit-Learn’sAPIisremarkablywelldesigned.Themaindesignprinciplesare:15

Consistency.Allobjectsshareaconsistentandsimpleinterface:

Estimators.Anyobjectthatcanestimatesomeparametersbasedonadatasetiscalledanestimator(e.g.,animputerisanestimator).Theestimationitselfisperformedbythefit()method,andittakesonlyadatasetasaparameter(ortwoforsupervisedlearningalgorithms;theseconddatasetcontainsthelabels).Anyotherparameterneededtoguidetheestimationprocessisconsideredahyperparameter(suchasanimputer’sstrategy),anditmustbesetasaninstancevariable(generallyviaaconstructorparameter).

Transformers.Someestimators(suchasanimputer)canalsotransformadataset;thesearecalledtransformers.Onceagain,theAPIisquitesimple:thetransformationisperformedbythetransform()methodwiththedatasettotransformasaparameter.Itreturnsthetransformeddataset.Thistransformationgenerallyreliesonthelearnedparameters,asisthecaseforanimputer.Alltransformersalsohaveaconveniencemethodcalledfit_transform()thatisequivalenttocallingfit()andthentransform()(butsometimesfit_transform()isoptimizedandrunsmuchfaster).

Predictors.Finally,someestimatorsarecapableofmakingpredictionsgivenadataset;theyarecalledpredictors.Forexample,theLinearRegressionmodelinthepreviouschapterwasapredictor:itpredictedlifesatisfactiongivenacountry’sGDPpercapita.Apredictorhasapredict()methodthattakesadatasetofnewinstancesandreturnsadatasetofcorrespondingpredictions.Italsohasascore()methodthatmeasuresthequalityofthepredictionsgivenatestset(andthecorrespondinglabelsinthecaseofsupervisedlearningalgorithms).16

Inspection.Alltheestimator’shyperparametersareaccessibledirectlyviapublicinstancevariables(e.g.,imputer.strategy),andalltheestimator’slearnedparametersarealsoaccessibleviapublicinstancevariableswithanunderscoresuffix(e.g.,imputer.statistics_).

Nonproliferationofclasses .DatasetsarerepresentedasNumPyarraysorSciPysparsematrices,insteadofhomemadeclasses.HyperparametersarejustregularPythonstringsornumbers.

Composition.Existingbuildingblocksarereusedasmuchaspossible.Forexample,itiseasytocreateaPipelineestimatorfromanarbitrarysequenceoftransformersfollowedbyafinalestimator,aswewillsee.

Sensibledefaults .Scikit-Learnprovidesreasonabledefaultvaluesformostparameters,makingiteasytocreateabaselineworkingsystemquickly.

HandlingTextandCategoricalAttributesEarlierweleftoutthecategoricalattributeocean_proximitybecauseitisatextattributesowecannotcomputeitsmedian.MostMachineLearningalgorithmsprefertoworkwithnumbersanyway,solet’sconvertthesetextlabelstonumbers.

Scikit-LearnprovidesatransformerforthistaskcalledLabelEncoder:

>>>fromsklearn.preprocessingimportLabelEncoder

>>>encoder=LabelEncoder()

>>>housing_cat=housing["ocean_proximity"]

>>>housing_cat_encoded=encoder.fit_transform(housing_cat)

>>>housing_cat_encoded

array([0,0,4,...,1,0,3])

Thisisbetter:nowwecanusethisnumericaldatainanyMLalgorithm.Youcanlookatthemappingthatthisencoderhaslearnedusingtheclasses_attribute(“<1HOCEAN”ismappedto0,“INLAND”ismappedto1,etc.):

>>>print(encoder.classes_)

['<1HOCEAN''INLAND''ISLAND''NEARBAY''NEAROCEAN']

OneissuewiththisrepresentationisthatMLalgorithmswillassumethattwonearbyvaluesaremoresimilarthantwodistantvalues.Obviouslythisisnotthecase(forexample,categories0and4aremoresimilarthancategories0and1).Tofixthisissue,acommonsolutionistocreateonebinaryattributepercategory:oneattributeequalto1whenthecategoryis“<1HOCEAN”(and0otherwise),anotherattributeequalto1whenthecategoryis“INLAND”(and0otherwise),andsoon.Thisiscalledone-hotencoding,becauseonlyoneattributewillbeequalto1(hot),whiletheotherswillbe0(cold).

Scikit-LearnprovidesaOneHotEncoderencodertoconvertintegercategoricalvaluesintoone-hotvectors.Let’sencodethecategoriesasone-hotvectors.Notethatfit_transform()expectsa2Darray,buthousing_cat_encodedisa1Darray,soweneedtoreshapeit:17

>>>fromsklearn.preprocessingimportOneHotEncoder

>>>encoder=OneHotEncoder()

>>>housing_cat_1hot=encoder.fit_transform(housing_cat_encoded.reshape(-1,1))

>>>housing_cat_1hot

<16512x5sparsematrixoftype'<class'numpy.float64'>'

with16512storedelementsinCompressedSparseRowformat>

NoticethattheoutputisaSciPysparsematrix,insteadofaNumPyarray.Thisisveryusefulwhenyouhavecategoricalattributeswiththousandsofcategories.Afterone-hotencodingwegetamatrixwiththousandsofcolumns,andthematrixisfullofzerosexceptforone1perrow.Usinguptonsofmemorymostlytostorezeroswouldbeverywasteful,soinsteadasparsematrixonlystoresthelocationofthenonzeroelements.Youcanuseitmostlylikeanormal2Darray,18butifyoureallywanttoconvertittoa(dense)NumPyarray,justcallthetoarray()method:

>>>housing_cat_1hot.toarray()

array([[1.,0.,0.,0.,0.],

[1.,0.,0.,0.,0.],

[0.,0.,0.,0.,1.],

[0.,1.,0.,0.,0.],

[1.,0.,0.,0.,0.],

[0.,0.,0.,1.,0.]])

Wecanapplybothtransformations(fromtextcategoriestointegercategories,thenfromintegercategoriestoone-hotvectors)inoneshotusingtheLabelBinarizerclass:

>>>fromsklearn.preprocessingimportLabelBinarizer

>>>encoder=LabelBinarizer()

>>>housing_cat_1hot=encoder.fit_transform(housing_cat)

>>>housing_cat_1hot

array([[1,0,0,0,0],

[1,0,0,0,0],

[0,0,0,0,1],

[0,1,0,0,0],

[1,0,0,0,0],

[0,0,0,1,0]])

NotethatthisreturnsadenseNumPyarraybydefault.Youcangetasparsematrixinsteadbypassingsparse_output=TruetotheLabelBinarizerconstructor.

CustomTransformersAlthoughScikit-Learnprovidesmanyusefultransformers,youwillneedtowriteyourownfortaskssuchascustomcleanupoperationsorcombiningspecificattributes.YouwillwantyourtransformertoworkseamlesslywithScikit-Learnfunctionalities(suchaspipelines),andsinceScikit-Learnreliesonducktyping(notinheritance),allyouneedistocreateaclassandimplementthreemethods:fit()(returningself),transform(),andfit_transform().YoucangetthelastoneforfreebysimplyaddingTransformerMixinasabaseclass.Also,ifyouaddBaseEstimatorasabaseclass(andavoid*argsand**kargsinyourconstructor)youwillgettwoextramethods(get_params()andset_params())thatwillbeusefulforautomatichyperparametertuning.Forexample,hereisasmalltransformerclassthataddsthecombinedattributeswediscussedearlier:

fromsklearn.baseimportBaseEstimator,TransformerMixin

rooms_ix,bedrooms_ix,population_ix,household_ix=3,4,5,6

classCombinedAttributesAdder(BaseEstimator,TransformerMixin):

def__init__(self,add_bedrooms_per_room=True):#no*argsor**kargs

self.add_bedrooms_per_room=add_bedrooms_per_room

deffit(self,X,y=None):

returnself#nothingelsetodo

deftransform(self,X,y=None):

rooms_per_household=X[:,rooms_ix]/X[:,household_ix]

population_per_household=X[:,population_ix]/X[:,household_ix]

ifself.add_bedrooms_per_room:

bedrooms_per_room=X[:,bedrooms_ix]/X[:,rooms_ix]

returnnp.c_[X,rooms_per_household,population_per_household,

bedrooms_per_room]

returnnp.c_[X,rooms_per_household,population_per_household]

attr_adder=CombinedAttributesAdder(add_bedrooms_per_room=False)

housing_extra_attribs=attr_adder.transform(housing.values)

Inthisexamplethetransformerhasonehyperparameter,add_bedrooms_per_room,settoTruebydefault(itisoftenhelpfultoprovidesensibledefaults).ThishyperparameterwillallowyoutoeasilyfindoutwhetheraddingthisattributehelpstheMachineLearningalgorithmsornot.Moregenerally,youcanaddahyperparametertogateanydatapreparationstepthatyouarenot100%sureabout.Themoreyouautomatethesedatapreparationsteps,themorecombinationsyoucanautomaticallytryout,makingitmuchmorelikelythatyouwillfindagreatcombination(andsavingyoualotoftime).

FeatureScalingOneofthemostimportanttransformationsyouneedtoapplytoyourdataisfeaturescaling.Withfewexceptions,MachineLearningalgorithmsdon’tperformwellwhentheinputnumericalattributeshaveverydifferentscales.Thisisthecaseforthehousingdata:thetotalnumberofroomsrangesfromabout6to39,320,whilethemedianincomesonlyrangefrom0to15.Notethatscalingthetargetvaluesisgenerallynotrequired.

Therearetwocommonwaystogetallattributestohavethesamescale:min-maxscalingandstandardization.

Min-maxscaling(manypeoplecallthisnormalization)isquitesimple:valuesareshiftedandrescaledsothattheyenduprangingfrom0to1.Wedothisbysubtractingtheminvalueanddividingbythemaxminusthemin.Scikit-LearnprovidesatransformercalledMinMaxScalerforthis.Ithasafeature_rangehyperparameterthatletsyouchangetherangeifyoudon’twant0–1forsomereason.

Standardizationisquitedifferent:firstitsubtractsthemeanvalue(sostandardizedvaluesalwayshaveazeromean),andthenitdividesbythevariancesothattheresultingdistributionhasunitvariance.Unlikemin-maxscaling,standardizationdoesnotboundvaluestoaspecificrange,whichmaybeaproblemforsomealgorithms(e.g.,neuralnetworksoftenexpectaninputvaluerangingfrom0to1).However,standardizationismuchlessaffectedbyoutliers.Forexample,supposeadistricthadamedianincomeequalto100(bymistake).Min-maxscalingwouldthencrushalltheothervaluesfrom0–15downto0–0.15,whereasstandardizationwouldnotbemuchaffected.Scikit-LearnprovidesatransformercalledStandardScalerforstandardization.

WARNINGAswithallthetransformations,itisimportanttofitthescalerstothetrainingdataonly,nottothefulldataset(includingthetestset).Onlythencanyouusethemtotransformthetrainingsetandthetestset(andnewdata).

TransformationPipelinesAsyoucansee,therearemanydatatransformationstepsthatneedtobeexecutedintherightorder.Fortunately,Scikit-LearnprovidesthePipelineclasstohelpwithsuchsequencesoftransformations.Hereisasmallpipelineforthenumericalattributes:

fromsklearn.pipelineimportPipeline

fromsklearn.preprocessingimportStandardScaler

num_pipeline=Pipeline([

('imputer',Imputer(strategy="median")),

('attribs_adder',CombinedAttributesAdder()),

('std_scaler',StandardScaler()),

housing_num_tr=num_pipeline.fit_transform(housing_num)

ThePipelineconstructortakesalistofname/estimatorpairsdefiningasequenceofsteps.Allbutthelastestimatormustbetransformers(i.e.,theymusthaveafit_transform()method).Thenamescanbeanythingyoulike(aslongastheydon’tcontaindoubleunderscores“__”).

Whenyoucallthepipeline’sfit()method,itcallsfit_transform()sequentiallyonalltransformers,passingtheoutputofeachcallastheparametertothenextcall,untilitreachesthefinalestimator,forwhichitjustcallsthefit()method.

Thepipelineexposesthesamemethodsasthefinalestimator.Inthisexample,thelastestimatorisaStandardScaler,whichisatransformer,sothepipelinehasatransform()methodthatappliesallthetransformstothedatainsequence(italsohasafit_transformmethodthatwecouldhaveusedinsteadofcallingfit()andthentransform()).

NowitwouldbeniceifwecouldfeedaPandasDataFramedirectlyintoourpipeline,insteadofhavingtofirstmanuallyextractthenumericalcolumnsintoaNumPyarray.ThereisnothinginScikit-LearntohandlePandasDataFrames,19butwecanwriteacustomtransformerforthistask:

fromsklearn.baseimportBaseEstimator,TransformerMixin

classDataFrameSelector(BaseEstimator,TransformerMixin):

def__init__(self,attribute_names):

self.attribute_names=attribute_names

returnself

deftransform(self,X):

returnX[self.attribute_names].values

OurDataFrameSelectorwilltransformthedatabyselectingthedesiredattributes,droppingtherest,andconvertingtheresultingDataFrametoaNumPyarray.Withthis,youcaneasilywriteapipelinethatwilltakeaPandasDataFrameandhandleonlythenumericalvalues:thepipelinewouldjuststartwithaDataFrameSelectortopickonlythenumericalattributes,followedbytheotherpreprocessingstepswediscussedearlier.AndyoucanjustaseasilywriteanotherpipelineforthecategoricalattributesaswellbysimplyselectingthecategoricalattributesusingaDataFrameSelectorandthenapplyingaLabelBinarizer.

num_attribs=list(housing_num)

cat_attribs=["ocean_proximity"]

num_pipeline=Pipeline([

('selector',DataFrameSelector(num_attribs)),

('imputer',Imputer(strategy="median")),

('attribs_adder',CombinedAttributesAdder()),

('std_scaler',StandardScaler()),

cat_pipeline=Pipeline([

('selector',DataFrameSelector(cat_attribs)),

('label_binarizer',LabelBinarizer()),

Buthowcanyoujointhesetwopipelinesintoasinglepipeline?TheansweristouseScikit-Learn’sFeatureUnionclass.Yougiveitalistoftransformers(whichcanbeentiretransformerpipelines);whenitstransform()methodiscalled,itrunseachtransformer’stransform()methodinparallel,waitsfortheiroutput,andthenconcatenatesthemandreturnstheresult(andofcoursecallingitsfit()methodcallseachtransformer’sfit()method).Afullpipelinehandlingbothnumericalandcategoricalattributesmaylooklikethis:

fromsklearn.pipelineimportFeatureUnion

full_pipeline=FeatureUnion(transformer_list=[

("num_pipeline",num_pipeline),

("cat_pipeline",cat_pipeline),

Andyoucanrunthewholepipelinesimply:

>>>housing_prepared=full_pipeline.fit_transform(housing)

>>>housing_prepared

array([[-1.15604281,0.77194962,0.74333089,...,0.,

0.,0.],

[-1.17602483,0.6596948,-1.1653172,...,0.,

0.,0.],

>>>housing_prepared.shape

(16512,16)

SelectandTrainaModelAtlast!Youframedtheproblem,yougotthedataandexploredit,yousampledatrainingsetandatestset,andyouwrotetransformationpipelinestocleanupandprepareyourdataforMachineLearningalgorithmsautomatically.YouarenowreadytoselectandtrainaMachineLearningmodel.

TrainingandEvaluatingontheTrainingSetThegoodnewsisthatthankstoalltheseprevioussteps,thingsarenowgoingtobemuchsimplerthanyoumightthink.Let’sfirsttrainaLinearRegressionmodel,likewedidinthepreviouschapter:

fromsklearn.linear_modelimportLinearRegression

lin_reg=LinearRegression()

lin_reg.fit(housing_prepared,housing_labels)

Done!YounowhaveaworkingLinearRegressionmodel.Let’stryitoutonafewinstancesfromthetrainingset:

>>>some_data=housing.iloc[:5]

>>>some_labels=housing_labels.iloc[:5]

>>>some_data_prepared=full_pipeline.transform(some_data)

>>>print("Predictions:",lin_reg.predict(some_data_prepared))

Predictions:[210644.6045317768.8069210956.433359218.9888189747.5584]

>>>print("Labels:",list(some_labels))

Labels:[286600.0,340600.0,196900.0,46300.0,254500.0]

Itworks,althoughthepredictionsarenotexactlyaccurate(e.g.,thefirstpredictionisoffbycloseto40%!).Let’smeasurethisregressionmodel’sRMSEonthewholetrainingsetusingScikit-Learn’smean_squared_errorfunction:

>>>fromsklearn.metricsimportmean_squared_error

>>>housing_predictions=lin_reg.predict(housing_prepared)

>>>lin_mse=mean_squared_error(housing_labels,housing_predictions)

>>>lin_rmse=np.sqrt(lin_mse)

>>>lin_rmse

68628.198198489219

Okay,thisisbetterthannothingbutclearlynotagreatscore:mostdistricts’median_housing_valuesrangebetween$120,000and$265,000,soatypicalpredictionerrorof$68,628isnotverysatisfying.Thisisanexampleofamodelunderfittingthetrainingdata.Whenthishappensitcanmeanthatthefeaturesdonotprovideenoughinformationtomakegoodpredictions,orthatthemodelisnotpowerfulenough.Aswesawinthepreviouschapter,themainwaystofixunderfittingaretoselectamorepowerfulmodel,tofeedthetrainingalgorithmwithbetterfeatures,ortoreducetheconstraintsonthemodel.Thismodelisnotregularized,sothisrulesoutthelastoption.Youcouldtrytoaddmorefeatures(e.g.,thelogofthepopulation),butfirstlet’stryamorecomplexmodeltoseehowitdoes.

Let’strainaDecisionTreeRegressor.Thisisapowerfulmodel,capableoffindingcomplexnonlinearrelationshipsinthedata(DecisionTreesarepresentedinmoredetailinChapter6).Thecodeshouldlookfamiliarbynow:

fromsklearn.treeimportDecisionTreeRegressor

tree_reg=DecisionTreeRegressor()

tree_reg.fit(housing_prepared,housing_labels)

Nowthatthemodelistrained,let’sevaluateitonthetrainingset:

>>>housing_predictions=tree_reg.predict(housing_prepared)

>>>tree_mse=mean_squared_error(housing_labels,housing_predictions)

>>>tree_rmse=np.sqrt(tree_mse)

>>>tree_rmse

Wait,what!?Noerroratall?Couldthismodelreallybeabsolutelyperfect?Ofcourse,itismuchmorelikelythatthemodelhasbadlyoverfitthedata.Howcanyoubesure?Aswesawearlier,youdon’twanttotouchthetestsetuntilyouarereadytolaunchamodelyouareconfidentabout,soyouneedtousepartofthetrainingsetfortraining,andpartformodelvalidation.

BetterEvaluationUsingCross-ValidationOnewaytoevaluatetheDecisionTreemodelwouldbetousethetrain_test_splitfunctiontosplitthetrainingsetintoasmallertrainingsetandavalidationset,thentrainyourmodelsagainstthesmallertrainingsetandevaluatethemagainstthevalidationset.It’sabitofwork,butnothingtoodifficultanditwouldworkfairlywell.

AgreatalternativeistouseScikit-Learn’scross-validationfeature.ThefollowingcodeperformsK-foldcross-validation:itrandomlysplitsthetrainingsetinto10distinctsubsetscalledfolds,thenittrainsandevaluatestheDecisionTreemodel10times,pickingadifferentfoldforevaluationeverytimeandtrainingontheother9folds.Theresultisanarraycontainingthe10evaluationscores:

fromsklearn.model_selectionimportcross_val_score

scores=cross_val_score(tree_reg,housing_prepared,housing_labels,

scoring="neg_mean_squared_error",cv=10)

tree_rmse_scores=np.sqrt(-scores)

WARNINGScikit-Learncross-validationfeaturesexpectautilityfunction(greaterisbetter)ratherthanacostfunction(lowerisbetter),sothescoringfunctionisactuallytheoppositeoftheMSE(i.e.,anegativevalue),whichiswhytheprecedingcodecomputes-scoresbeforecalculatingthesquareroot.

Let’slookattheresults:

>>>defdisplay_scores(scores):

...print("Scores:",scores)

...print("Mean:",scores.mean())

...print("Standarddeviation:",scores.std())

>>>display_scores(tree_rmse_scores)

Scores:[70232.013648266828.4683989272444.0872100370761.50186201

71125.5269765375581.2931985770169.5928616470055.37863456

75370.4911677371222.39081244]

Mean:71379.0744771

Standarddeviation:2458.31882043

NowtheDecisionTreedoesn’tlookasgoodasitdidearlier.Infact,itseemstoperformworsethantheLinearRegressionmodel!Noticethatcross-validationallowsyoutogetnotonlyanestimateoftheperformanceofyourmodel,butalsoameasureofhowprecisethisestimateis(i.e.,itsstandarddeviation).TheDecisionTreehasascoreofapproximately71,379,generally±2,458.Youwouldnothavethisinformationifyoujustusedonevalidationset.Butcross-validationcomesatthecostoftrainingthemodelseveraltimes,soitisnotalwayspossible.

Let’scomputethesamescoresfortheLinearRegressionmodeljusttobesure:

>>>lin_scores=cross_val_score(lin_reg,housing_prepared,housing_labels,

...scoring="neg_mean_squared_error",cv=10)

>>>lin_rmse_scores=np.sqrt(-lin_scores)

>>>display_scores(lin_rmse_scores)

Scores:[66760.9737157266962.6191424470349.9485340174757.02629506

68031.1338893871193.8418342664968.1370652768261.95557897

71527.6421787467665.10082067]

Mean:69047.8379055

That’sright:theDecisionTreemodelisoverfittingsobadlythatitperformsworsethantheLinearRegressionmodel.

Let’stryonelastmodelnow:theRandomForestRegressor.AswewillseeinChapter7,RandomForestsworkbytrainingmanyDecisionTreesonrandomsubsetsofthefeatures,thenaveragingouttheirpredictions.BuildingamodelontopofmanyothermodelsiscalledEnsembleLearning,anditisoftenagreatwaytopushMLalgorithmsevenfurther.Wewillskipmostofthecodesinceitisessentiallythesameasfortheothermodels:

>>>fromsklearn.ensembleimportRandomForestRegressor

>>>forest_reg=RandomForestRegressor()

>>>forest_reg.fit(housing_prepared,housing_labels)

>>>[...]

>>>forest_rmse

21941.911027380233

>>>display_scores(forest_rmse_scores)

Scores:[51650.9440547148920.8064549852979.1609675254412.74042021

50861.2938116356488.5569972751866.9012078649752.24599537

55399.5071319153309.74548294]

Mean:52564.1902524

Wow,thisismuchbetter:RandomForestslookverypromising.However,notethatthescoreonthetrainingsetisstillmuchlowerthanonthevalidationsets,meaningthatthemodelisstilloverfittingthetrainingset.Possiblesolutionsforoverfittingaretosimplifythemodel,constrainit(i.e.,regularizeit),orgetalotmoretrainingdata.However,beforeyoudivemuchdeeperinRandomForests,youshouldtryoutmanyothermodelsfromvariouscategoriesofMachineLearningalgorithms(severalSupportVectorMachineswithdifferentkernels,possiblyaneuralnetwork,etc.),withoutspendingtoomuchtimetweakingthehyperparameters.Thegoalistoshortlistafew(twotofive)promisingmodels.

TIPYoushouldsaveeverymodelyouexperimentwith,soyoucancomebackeasilytoanymodelyouwant.Makesureyousaveboththehyperparametersandthetrainedparameters,aswellasthecross-validationscoresandperhapstheactualpredictionsaswell.Thiswillallowyoutoeasilycomparescoresacrossmodeltypes,andcomparethetypesoferrorstheymake.YoucaneasilysaveScikit-LearnmodelsbyusingPython’spicklemodule,orusingsklearn.externals.joblib,whichismoreefficientatserializinglargeNumPyarrays:

fromsklearn.externalsimportjoblib

joblib.dump(my_model,"my_model.pkl")

#andlater...

my_model_loaded=joblib.load("my_model.pkl")

Fine-TuneYourModelLet’sassumethatyounowhaveashortlistofpromisingmodels.Younowneedtofine-tunethem.Let’slookatafewwaysyoucandothat.

GridSearchOnewaytodothatwouldbetofiddlewiththehyperparametersmanually,untilyoufindagreatcombinationofhyperparametervalues.Thiswouldbeverytediouswork,andyoumaynothavetimetoexploremanycombinations.

InsteadyoushouldgetScikit-Learn’sGridSearchCVtosearchforyou.Allyouneedtodoistellitwhichhyperparametersyouwantittoexperimentwith,andwhatvaluestotryout,anditwillevaluateallthepossiblecombinationsofhyperparametervalues,usingcross-validation.Forexample,thefollowingcodesearchesforthebestcombinationofhyperparametervaluesfortheRandomForestRegressor:

fromsklearn.model_selectionimportGridSearchCV

param_grid=[

{'n_estimators':[3,10,30],'max_features':[2,4,6,8]},

{'bootstrap':[False],'n_estimators':[3,10],'max_features':[2,3,4]},

forest_reg=RandomForestRegressor()

grid_search=GridSearchCV(forest_reg,param_grid,cv=5,

scoring='neg_mean_squared_error')

grid_search.fit(housing_prepared,housing_labels)

TIPWhenyouhavenoideawhatvalueahyperparametershouldhave,asimpleapproachistotryoutconsecutivepowersof10(orasmallernumberifyouwantamorefine-grainedsearch,asshowninthisexamplewiththen_estimatorshyperparameter).

Thisparam_gridtellsScikit-Learntofirstevaluateall3×4=12combinationsofn_estimatorsandmax_featureshyperparametervaluesspecifiedinthefirstdict(don’tworryaboutwhatthesehyperparametersmeanfornow;theywillbeexplainedinChapter7),thentryall2×3=6combinationsofhyperparametervaluesintheseconddict,butthistimewiththebootstraphyperparametersettoFalseinsteadofTrue(whichisthedefaultvalueforthishyperparameter).

Allinall,thegridsearchwillexplore12+6=18combinationsofRandomForestRegressorhyperparametervalues,anditwilltraineachmodelfivetimes(sinceweareusingfive-foldcrossvalidation).Inotherwords,allinall,therewillbe18×5=90roundsoftraining!Itmaytakequitealongtime,butwhenitisdoneyoucangetthebestcombinationofparameterslikethis:

>>>grid_search.best_params_

{'max_features':8,'n_estimators':30}

TIPSince8and30arethemaximumvaluesthatwereevaluated,youshouldprobablytrysearchingagainwithhighervalues,sincethescoremaycontinuetoimprove.

Youcanalsogetthebestestimatordirectly:

>>>grid_search.best_estimator_

RandomForestRegressor(bootstrap=True,criterion='mse',max_depth=None,

max_features=8,max_leaf_nodes=None,min_impurity_split=1e-07,

min_samples_leaf=1,min_samples_split=2,

min_weight_fraction_leaf=0.0,n_estimators=30,n_jobs=1,

oob_score=False,random_state=42,verbose=0,warm_start=False)

NOTEIfGridSearchCVisinitializedwithrefit=True(whichisthedefault),thenonceitfindsthebestestimatorusingcross-validation,itretrainsitonthewholetrainingset.Thisisusuallyagoodideasincefeedingitmoredatawilllikelyimproveitsperformance.

Andofcoursetheevaluationscoresarealsoavailable:

>>>cvres=grid_search.cv_results_

>>>formean_score,paramsinzip(cvres["mean_test_score"],cvres["params"]):

...print(np.sqrt(-mean_score),params)

63825.0479302{'max_features':2,'n_estimators':3}

62874.4073931{'max_features':2,'n_estimators':3,'bootstrap':False}

Inthisexample,weobtainthebestsolutionbysettingthemax_featureshyperparameterto8,andthen_estimatorshyperparameterto30.TheRMSEscoreforthiscombinationis49,694,whichisslightlybetterthanthescoreyougotearlierusingthedefaulthyperparametervalues(whichwas52,564).Congratulations,youhavesuccessfullyfine-tunedyourbestmodel!

TIPDon’tforgetthatyoucantreatsomeofthedatapreparationstepsashyperparameters.Forexample,thegridsearchwillautomaticallyfindoutwhetherornottoaddafeatureyouwerenotsureabout(e.g.,usingtheadd_bedrooms_per_roomhyperparameterofyourCombinedAttributesAddertransformer).Itmaysimilarlybeusedtoautomaticallyfindthebestwaytohandleoutliers,missingfeatures,featureselection,andmore.

RandomizedSearchThegridsearchapproachisfinewhenyouareexploringrelativelyfewcombinations,likeinthepreviousexample,butwhenthehyperparametersearchspaceislarge,itisoftenpreferabletouseRandomizedSearchCVinstead.ThisclasscanbeusedinmuchthesamewayastheGridSearchCVclass,butinsteadoftryingoutallpossiblecombinations,itevaluatesagivennumberofrandomcombinationsbyselectingarandomvalueforeachhyperparameterateveryiteration.Thisapproachhastwomainbenefits:

Ifyoulettherandomizedsearchrunfor,say,1,000iterations,thisapproachwillexplore1,000differentvaluesforeachhyperparameter(insteadofjustafewvaluesperhyperparameterwiththegridsearchapproach).

Youhavemorecontroloverthecomputingbudgetyouwanttoallocatetohyperparametersearch,simplybysettingthenumberofiterations.

EnsembleMethodsAnotherwaytofine-tuneyoursystemistotrytocombinethemodelsthatperformbest.Thegroup(or“ensemble”)willoftenperformbetterthanthebestindividualmodel(justlikeRandomForestsperformbetterthantheindividualDecisionTreestheyrelyon),especiallyiftheindividualmodelsmakeverydifferenttypesoferrors.WewillcoverthistopicinmoredetailinChapter7.

AnalyzetheBestModelsandTheirErrorsYouwilloftengaingoodinsightsontheproblembyinspectingthebestmodels.Forexample,theRandomForestRegressorcanindicatetherelativeimportanceofeachattributeformakingaccuratepredictions:

>>>feature_importances=grid_search.best_estimator_.feature_importances_

>>>feature_importances

array([7.33442355e-02,6.29090705e-02,4.11437985e-02,

1.46726854e-02,1.41064835e-02,1.48742809e-02,

1.42575993e-02,3.66158981e-01,5.64191792e-02,

1.08792957e-01,5.33510773e-02,1.03114883e-02,

1.64780994e-01,6.02803867e-05,1.96041560e-03,

2.85647464e-03])

Let’sdisplaytheseimportancescoresnexttotheircorrespondingattributenames:

>>>extra_attribs=["rooms_per_hhold","pop_per_hhold","bedrooms_per_room"]

>>>cat_one_hot_attribs=list(encoder.classes_)

>>>attributes=num_attribs+extra_attribs+cat_one_hot_attribs

>>>sorted(zip(feature_importances,attributes),reverse=True)

[(0.36615898061813418,'median_income'),

(0.16478099356159051,'INLAND'),

(0.10879295677551573,'pop_per_hhold'),

(0.073344235516012421,'longitude'),

(0.062909070482620302,'latitude'),

(0.056419179181954007,'rooms_per_hhold'),

(0.053351077347675809,'bedrooms_per_room'),

(0.041143798478729635,'housing_median_age'),

(0.014874280890402767,'population'),

(0.014672685420543237,'total_rooms'),

(0.014257599323407807,'households'),

(0.014106483453584102,'total_bedrooms'),

(0.010311488326303787,'<1HOCEAN'),

(0.0028564746373201579,'NEAROCEAN'),

(0.0019604155994780701,'NEARBAY'),

(6.0280386727365991e-05,'ISLAND')]

Withthisinformation,youmaywanttotrydroppingsomeofthelessusefulfeatures(e.g.,apparentlyonlyoneocean_proximitycategoryisreallyuseful,soyoucouldtrydroppingtheothers).

Youshouldalsolookatthespecificerrorsthatyoursystemmakes,thentrytounderstandwhyitmakesthemandwhatcouldfixtheproblem(addingextrafeaturesor,onthecontrary,gettingridofuninformativeones,cleaningupoutliers,etc.).

EvaluateYourSystemontheTestSetAftertweakingyourmodelsforawhile,youeventuallyhaveasystemthatperformssufficientlywell.Nowisthetimetoevaluatethefinalmodelonthetestset.Thereisnothingspecialaboutthisprocess;justgetthepredictorsandthelabelsfromyourtestset,runyourfull_pipelinetotransformthedata(calltransform(),notfit_transform()!),andevaluatethefinalmodelonthetestset:

final_model=grid_search.best_estimator_

X_test=strat_test_set.drop("median_house_value",axis=1)

y_test=strat_test_set["median_house_value"].copy()

X_test_prepared=full_pipeline.transform(X_test)

final_predictions=final_model.predict(X_test_prepared)

final_mse=mean_squared_error(y_test,final_predictions)

final_rmse=np.sqrt(final_mse)#=>evaluatesto47,766.0

Theperformancewillusuallybeslightlyworsethanwhatyoumeasuredusingcross-validationifyoudidalotofhyperparametertuning(becauseyoursystemendsupfine-tunedtoperformwellonthevalidationdata,andwilllikelynotperformaswellonunknowndatasets).Itisnotthecaseinthisexample,butwhenthishappensyoumustresistthetemptationtotweakthehyperparameterstomakethenumberslookgoodonthetestset;theimprovementswouldbeunlikelytogeneralizetonewdata.

Nowcomestheprojectprelaunchphase:youneedtopresentyoursolution(highlightingwhatyouhavelearned,whatworkedandwhatdidnot,whatassumptionsweremade,andwhatyoursystem’slimitationsare),documenteverything,andcreatenicepresentationswithclearvisualizationsandeasy-to-rememberstatements(e.g.,“themedianincomeisthenumberonepredictorofhousingprices”).

Launch,Monitor,andMaintainYourSystemPerfect,yougotapprovaltolaunch!Youneedtogetyoursolutionreadyforproduction,inparticularbypluggingtheproductioninputdatasourcesintoyoursystemandwritingtests.

Youalsoneedtowritemonitoringcodetocheckyoursystem’sliveperformanceatregularintervalsandtriggeralertswhenitdrops.Thisisimportanttocatchnotonlysuddenbreakage,butalsoperformancedegradation.Thisisquitecommonbecausemodelstendto“rot”asdataevolvesovertime,unlessthemodelsareregularlytrainedonfreshdata.

Evaluatingyoursystem’sperformancewillrequiresamplingthesystem’spredictionsandevaluatingthem.Thiswillgenerallyrequireahumananalysis.Theseanalystsmaybefieldexperts,orworkersonacrowdsourcingplatform(suchasAmazonMechanicalTurkorCrowdFlower).Eitherway,youneedtoplugthehumanevaluationpipelineintoyoursystem.

Youshouldalsomakesureyouevaluatethesystem’sinputdataquality.Sometimesperformancewilldegradeslightlybecauseofapoorqualitysignal(e.g.,amalfunctioningsensorsendingrandomvalues,oranotherteam’soutputbecomingstale),butitmaytakeawhilebeforeyoursystem’sperformancedegradesenoughtotriggeranalert.Ifyoumonitoryoursystem’sinputs,youmaycatchthisearlier.Monitoringtheinputsisparticularlyimportantforonlinelearningsystems.

Finally,youwillgenerallywanttotrainyourmodelsonaregularbasisusingfreshdata.Youshouldautomatethisprocessasmuchaspossible.Ifyoudon’t,youareverylikelytorefreshyourmodelonlyeverysixmonths(atbest),andyoursystem’sperformancemayfluctuateseverelyovertime.Ifyoursystemisanonlinelearningsystem,youshouldmakesureyousavesnapshotsofitsstateatregularintervalssoyoucaneasilyrollbacktoapreviouslyworkingstate.

TryItOut!HopefullythischaptergaveyouagoodideaofwhataMachineLearningprojectlookslike,andshowedyousomeofthetoolsyoucanusetotrainagreatsystem.Asyoucansee,muchoftheworkisinthedatapreparationstep,buildingmonitoringtools,settinguphumanevaluationpipelines,andautomatingregularmodeltraining.TheMachineLearningalgorithmsarealsoimportant,ofcourse,butitisprobablypreferabletobecomfortablewiththeoverallprocessandknowthreeorfouralgorithmswellratherthantospendallyourtimeexploringadvancedalgorithmsandnotenoughtimeontheoverallprocess.

So,ifyouhavenotalreadydoneso,nowisagoodtimetopickupalaptop,selectadatasetthatyouareinterestedin,andtrytogothroughthewholeprocessfromAtoZ.Agoodplacetostartisonacompetitionwebsitesuchashttp://kaggle.com/:youwillhaveadatasettoplaywith,acleargoal,andpeopletosharetheexperiencewith.

ExercisesUsingthischapter’shousingdataset:1. TryaSupportVectorMachineregressor(sklearn.svm.SVR),withvarioushyperparameterssuchas

kernel="linear"(withvariousvaluesfortheChyperparameter)orkernel="rbf"(withvariousvaluesfortheCandgammahyperparameters).Don’tworryaboutwhatthesehyperparametersmeanfornow.HowdoesthebestSVRpredictorperform?

2. TryreplacingGridSearchCVwithRandomizedSearchCV.

3. Tryaddingatransformerinthepreparationpipelinetoselectonlythemostimportantattributes.

4. Trycreatingasinglepipelinethatdoesthefulldatapreparationplusthefinalprediction.

5. AutomaticallyexploresomepreparationoptionsusingGridSearchCV.

SolutionstotheseexercisesareavailableintheonlineJupyternotebooksathttps://github.com/ageron/handson-ml.

Theexampleprojectiscompletelyfictitious;thegoalisjusttoillustratethemainstepsofaMachineLearningproject,nottolearnanythingabouttherealestatebusiness.

TheoriginaldatasetappearedinR.KelleyPaceandRonaldBarry,“SparseSpatialAutoregressions,”Statistics&ProbabilityLetters33,no.3(1997):291–297.

ApieceofinformationfedtoaMachineLearningsystemisoftencalledasignalinreferencetoShannon’sinformationtheory:youwantahighsignal/noiseratio.

Recallthatthetransposeoperatorflipsacolumnvectorintoarowvector(andviceversa).

ThelatestversionofPython3isrecommended.Python2.7+shouldworkfinetoo,butitisdeprecated.IfyouusePython2,youmustaddfrom__future__importdivision,print_function,unicode_literalsatthebeginningofyourcode.

WewillshowtheinstallationstepsusingpipinabashshellonaLinuxormacOSsystem.Youmayneedtoadaptthesecommandstoyourownsystem.OnWindows,werecommendinstallingAnacondainstead.

Youmayneedtohaveadministratorrightstorunthiscommand;ifso,tryprefixingitwithsudo.

NotethatJupytercanhandlemultipleversionsofPython,andevenmanyotherlanguagessuchasRorOctave.

Youmightalsoneedtochecklegalconstraints,suchasprivatefieldsthatshouldneverbecopiedtounsafedatastores.

InarealprojectyouwouldsavethiscodeinaPythonfile,butfornowyoucanjustwriteitinyourJupyternotebook.

Thestandarddeviationisgenerallydenotedσ(theGreeklettersigma),anditisthesquarerootofthevariance,whichistheaverageofthesquareddeviationfromthemean.Whenafeaturehasabell-shapednormaldistribution(alsocalledaGaussiandistribution),whichisverycommon,the“68-95-99.7”ruleapplies:about68%ofthevaluesfallwithin1σofthemean,95%within2σ,and99.7%within3σ.

Youwilloftenseepeoplesettherandomseedto42.Thisnumberhasnospecialproperty,otherthantobeTheAnswertotheUltimateQuestionofLife,theUniverse,andEverything.

Thelocationinformationisactuallyquitecoarse,andasaresultmanydistrictswillhavetheexactsameID,sotheywillendupinthesameset(testortrain).Thisintroducessomeunfortunatesamplingbias.

Ifyouarereadingthisingrayscale,grabaredpenandscribbleovermostofthecoastlinefromtheBayAreadowntoSanDiego(asyoumightexpect).YoucanaddapatchofyellowaroundSacramentoaswell.

Formoredetailsonthedesignprinciples,see“APIdesignformachinelearningsoftware:experiencesfromthescikit-learnproject,”L.

Buitinck,G.Louppe,M.Blondel,F.Pedregosa,A.Müller,etal.(2013).

Somepredictorsalsoprovidemethodstomeasuretheconfidenceoftheirpredictions.

NumPy’sreshape()functionallowsonedimensiontobe–1,whichmeans“unspecified”:thevalueisinferredfromthelengthofthearrayandtheremainingdimensions.

SeeSciPy’sdocumentationformoredetails.

ButcheckoutPullRequest#3886,whichmayintroduceaColumnTransformerclassmakingattribute-specifictransformationseasy.Youcouldalsorunpip3installsklearn-pandastogetaDataFrameMapperclasswithasimilarobjective.

Chapter3.Classification

InChapter1wementionedthatthemostcommonsupervisedlearningtasksareregression(predictingvalues)andclassification(predictingclasses).InChapter2weexploredaregressiontask,predictinghousingvalues,usingvariousalgorithmssuchasLinearRegression,DecisionTrees,andRandomForests(whichwillbeexplainedinfurtherdetailinlaterchapters).Nowwewillturnourattentiontoclassificationsystems.

MNISTInthischapter,wewillbeusingtheMNISTdataset,whichisasetof70,000smallimagesofdigitshandwrittenbyhighschoolstudentsandemployeesoftheUSCensusBureau.Eachimageislabeledwiththedigititrepresents.Thissethasbeenstudiedsomuchthatitisoftencalledthe“HelloWorld”ofMachineLearning:wheneverpeoplecomeupwithanewclassificationalgorithm,theyarecurioustoseehowitwillperformonMNIST.WheneversomeonelearnsMachineLearning,soonerorlatertheytackleMNIST.

Scikit-Learnprovidesmanyhelperfunctionstodownloadpopulardatasets.MNISTisoneofthem.ThefollowingcodefetchestheMNISTdataset:1

>>>fromsklearn.datasetsimportfetch_mldata

>>>mnist=fetch_mldata('MNISToriginal')

>>>mnist

{'COL_NAMES':['label','data'],

'DESCR':'mldata.orgdataset:mnist-original',

'data':array([[0,0,0,...,0,0,0],

[0,0,0,...,0,0,0],

[0,0,0,...,0,0,0]],dtype=uint8),

'target':array([0.,0.,0.,...,9.,9.,9.])}

DatasetsloadedbyScikit-Learngenerallyhaveasimilardictionarystructureincluding:ADESCRkeydescribingthedataset

Adatakeycontaininganarraywithonerowperinstanceandonecolumnperfeature

Atargetkeycontaininganarraywiththelabels

Let’slookatthesearrays:

>>>X,y=mnist["data"],mnist["target"]

>>>X.shape

(70000,784)

>>>y.shape

(70000,)

Thereare70,000images,andeachimagehas784features.Thisisbecauseeachimageis28×28pixels,andeachfeaturesimplyrepresentsonepixel’sintensity,from0(white)to255(black).Let’stakeapeekatonedigitfromthedataset.Allyouneedtodoisgrabaninstance’sfeaturevector,reshapeittoa28×28array,anddisplayitusingMatplotlib’simshow()function:

%matplotlibinline

importmatplotlib

some_digit=X[36000]

some_digit_image=some_digit.reshape(28,28)

plt.imshow(some_digit_image,cmap=matplotlib.cm.binary,

interpolation="nearest")

plt.axis("off")

plt.show()

Thislookslikea5,andindeedthat’swhatthelabeltellsus:

>>>y[36000]

Figure3-1showsafewmoreimagesfromtheMNISTdatasettogiveyouafeelforthecomplexityoftheclassificationtask.

Figure3-1.AfewdigitsfromtheMNISTdataset

Butwait!Youshouldalwayscreateatestsetandsetitasidebeforeinspectingthedataclosely.TheMNISTdatasetisactuallyalreadysplitintoatrainingset(thefirst60,000images)andatestset(thelast10,000images):

X_train,X_test,y_train,y_test=X[:60000],X[60000:],y[:60000],y[60000:]

Let’salsoshufflethetrainingset;thiswillguaranteethatallcross-validationfoldswillbesimilar(youdon’twantonefoldtobemissingsomedigits).Moreover,somelearningalgorithmsaresensitivetotheorderofthetraininginstances,andtheyperformpoorlyiftheygetmanysimilarinstancesinarow.Shufflingthedatasetensuresthatthiswon’thappen:2

importnumpyasnp

shuffle_index=np.random.permutation(60000)

X_train,y_train=X_train[shuffle_index],y_train[shuffle_index]

TrainingaBinaryClassifierLet’ssimplifytheproblemfornowandonlytrytoidentifyonedigit—forexample,thenumber5.This“5-detector”willbeanexampleofabinaryclassifier,capableofdistinguishingbetweenjusttwoclasses,5andnot-5.Let’screatethetargetvectorsforthisclassificationtask:

y_train_5=(y_train==5)#Trueforall5s,Falseforallotherdigits.

y_test_5=(y_test==5)

Okay,nowlet’spickaclassifierandtrainit.AgoodplacetostartiswithaStochasticGradientDescent(SGD)classifier,usingScikit-Learn’sSGDClassifierclass.Thisclassifierhastheadvantageofbeingcapableofhandlingverylargedatasetsefficiently.ThisisinpartbecauseSGDdealswithtraininginstancesindependently,oneatatime(whichalsomakesSGDwellsuitedforonlinelearning),aswewillseelater.Let’screateanSGDClassifierandtrainitonthewholetrainingset:

fromsklearn.linear_modelimportSGDClassifier

sgd_clf=SGDClassifier(random_state=42)

sgd_clf.fit(X_train,y_train_5)

TIPTheSGDClassifierreliesonrandomnessduringtraining(hencethename“stochastic”).Ifyouwantreproducibleresults,youshouldsettherandom_stateparameter.

Nowyoucanuseittodetectimagesofthenumber5:

>>>sgd_clf.predict([some_digit])

array([True],dtype=bool)

Theclassifierguessesthatthisimagerepresentsa5(True).Lookslikeitguessedrightinthisparticularcase!Now,let’sevaluatethismodel’sperformance.

PerformanceMeasuresEvaluatingaclassifierisoftensignificantlytrickierthanevaluatingaregressor,sowewillspendalargepartofthischapteronthistopic.Therearemanyperformancemeasuresavailable,sograbanothercoffeeandgetreadytolearnmanynewconceptsandacronyms!

MeasuringAccuracyUsingCross-ValidationAgoodwaytoevaluateamodelistousecross-validation,justasyoudidinChapter2.

IMPLEMENTINGCROSS-VALIDATION

Occasionallyyouwillneedmorecontroloverthecross-validationprocessthanwhatScikit-Learnprovidesoff-the-shelf.Inthesecases,youcanimplementcross-validationyourself;itisactuallyfairlystraightforward.ThefollowingcodedoesroughlythesamethingasScikit-Learn’scross_val_score()function,andprintsthesameresult:

fromsklearn.model_selectionimportStratifiedKFold

fromsklearn.baseimportclone

skfolds=StratifiedKFold(n_splits=3,random_state=42)

fortrain_index,test_indexinskfolds.split(X_train,y_train_5):

clone_clf=clone(sgd_clf)

X_train_folds=X_train[train_index]

y_train_folds=(y_train_5[train_index])

X_test_fold=X_train[test_index]

y_test_fold=(y_train_5[test_index])

clone_clf.fit(X_train_folds,y_train_folds)

y_pred=clone_clf.predict(X_test_fold)

n_correct=sum(y_pred==y_test_fold)

print(n_correct/len(y_pred))#prints0.9502,0.96565and0.96495

TheStratifiedKFoldclassperformsstratifiedsampling(asexplainedinChapter2)toproducefoldsthatcontainarepresentativeratioofeachclass.Ateachiterationthecodecreatesacloneoftheclassifier,trainsthatcloneonthetrainingfolds,andmakespredictionsonthetestfold.Thenitcountsthenumberofcorrectpredictionsandoutputstheratioofcorrectpredictions.

Let’susethecross_val_score()functiontoevaluateyourSGDClassifiermodelusingK-foldcross-validation,withthreefolds.RememberthatK-foldcross-validationmeanssplittingthetrainingsetintoK-folds(inthiscase,three),thenmakingpredictionsandevaluatingthemoneachfoldusingamodeltrainedontheremainingfolds(seeChapter2):

>>>fromsklearn.model_selectionimportcross_val_score

>>>cross_val_score(sgd_clf,X_train,y_train_5,cv=3,scoring="accuracy")

array([0.9502,0.96565,0.96495])

Wow!Above95%accuracy(ratioofcorrectpredictions)onallcross-validationfolds?Thislooksamazing,doesn’tit?Well,beforeyougettooexcited,let’slookataverydumbclassifierthatjustclassifieseverysingleimageinthe“not-5”class:

fromsklearn.baseimportBaseEstimator

classNever5Classifier(BaseEstimator):

defpredict(self,X):

returnnp.zeros((len(X),1),dtype=bool)

Canyouguessthismodel’saccuracy?Let’sfindout:

>>>never_5_clf=Never5Classifier()

>>>cross_val_score(never_5_clf,X_train,y_train_5,cv=3,scoring="accuracy")

array([0.909,0.90715,0.9128])

That’sright,ithasover90%accuracy!Thisissimplybecauseonlyabout10%oftheimagesare5s,soifyoualwaysguessthatanimageisnota5,youwillberightabout90%ofthetime.BeatsNostradamus.

Thisdemonstrateswhyaccuracyisgenerallynotthepreferredperformancemeasureforclassifiers,especiallywhenyouaredealingwithskeweddatasets(i.e.,whensomeclassesaremuchmorefrequentthanothers).

ConfusionMatrixAmuchbetterwaytoevaluatetheperformanceofaclassifieristolookattheconfusionmatrix.ThegeneralideaistocountthenumberoftimesinstancesofclassAareclassifiedasclassB.Forexample,toknowthenumberoftimestheclassifierconfusedimagesof5swith3s,youwouldlookinthe5throwand3rdcolumnoftheconfusionmatrix.

Tocomputetheconfusionmatrix,youfirstneedtohaveasetofpredictions,sotheycanbecomparedtotheactualtargets.Youcouldmakepredictionsonthetestset,butlet’skeepituntouchedfornow(rememberthatyouwanttousethetestsetonlyattheveryendofyourproject,onceyouhaveaclassifierthatyouarereadytolaunch).Instead,youcanusethecross_val_predict()function:

fromsklearn.model_selectionimportcross_val_predict

y_train_pred=cross_val_predict(sgd_clf,X_train,y_train_5,cv=3)

Justlikethecross_val_score()function,cross_val_predict()performsK-foldcross-validation,butinsteadofreturningtheevaluationscores,itreturnsthepredictionsmadeoneachtestfold.Thismeansthatyougetacleanpredictionforeachinstanceinthetrainingset(“clean”meaningthatthepredictionismadebyamodelthatneversawthedataduringtraining).

Nowyouarereadytogettheconfusionmatrixusingtheconfusion_matrix()function.Justpassitthetargetclasses(y_train_5)andthepredictedclasses(y_train_pred):

>>>fromsklearn.metricsimportconfusion_matrix

>>>confusion_matrix(y_train_5,y_train_pred)

array([[53272,1307],

[1077,4344]])

Eachrowinaconfusionmatrixrepresentsanactualclass,whileeachcolumnrepresentsapredictedclass.Thefirstrowofthismatrixconsidersnon-5images(thenegativeclass):53,272ofthemwerecorrectlyclassifiedasnon-5s(theyarecalledtruenegatives),whiletheremaining1,307werewronglyclassifiedas5s(falsepositives).Thesecondrowconsiderstheimagesof5s(thepositiveclass):1,077werewronglyclassifiedasnon-5s(falsenegatives),whiletheremaining4,344werecorrectlyclassifiedas5s(truepositives).Aperfectclassifierwouldhaveonlytruepositivesandtruenegatives,soitsconfusionmatrixwouldhavenonzerovaluesonlyonitsmaindiagonal(toplefttobottomright):

>>>confusion_matrix(y_train_5,y_train_perfect_predictions)

array([[54579,0],

[0,5421]])

Theconfusionmatrixgivesyoualotofinformation,butsometimesyoumaypreferamoreconcisemetric.Aninterestingonetolookatistheaccuracyofthepositivepredictions;thisiscalledtheprecisionoftheclassifier(Equation3-1).

Equation3-1.Precision

TPisthenumberoftruepositives,andFPisthenumberoffalsepositives.

Atrivialwaytohaveperfectprecisionistomakeonesinglepositivepredictionandensureitiscorrect(precision=1/1=100%).Thiswouldnotbeveryusefulsincetheclassifierwouldignoreallbutonepositiveinstance.Soprecisionistypicallyusedalongwithanothermetricnamedrecall,alsocalledsensitivityortruepositiverate(TPR):thisistheratioofpositiveinstancesthatarecorrectlydetectedbytheclassifier(Equation3-2).

Equation3-2.Recall

FNisofcoursethenumberoffalsenegatives.

Ifyouareconfusedabouttheconfusionmatrix,Figure3-2mayhelp.

Figure3-2.Anillustratedconfusionmatrix

PrecisionandRecallScikit-Learnprovidesseveralfunctionstocomputeclassifiermetrics,includingprecisionandrecall:

>>>fromsklearn.metricsimportprecision_score,recall_score

>>>precision_score(y_train_5,y_train_pred)#==4344/(4344+1307)

0.76871350203503808

>>>recall_score(y_train_5,y_train_pred)#==4344/(4344+1077)

0.80132816823464303

Nowyour5-detectordoesnotlookasshinyasitdidwhenyoulookedatitsaccuracy.Whenitclaimsanimagerepresentsa5,itiscorrectonly77%ofthetime.Moreover,itonlydetects80%ofthe5s.

ItisoftenconvenienttocombineprecisionandrecallintoasinglemetriccalledtheF1score,inparticularifyouneedasimplewaytocomparetwoclassifiers.TheF1scoreistheharmonicmeanofprecisionandrecall(Equation3-3).Whereastheregularmeantreatsallvaluesequally,theharmonicmeangivesmuchmoreweighttolowvalues.Asaresult,theclassifierwillonlygetahighF1scoreifbothrecallandprecisionarehigh.

Equation3-3.F1score

TocomputetheF1score,simplycallthef1_score()function:

>>>fromsklearn.metricsimportf1_score

>>>f1_score(y_train_5,y_train_pred)

0.78468208092485547

TheF1scorefavorsclassifiersthathavesimilarprecisionandrecall.Thisisnotalwayswhatyouwant:insomecontextsyoumostlycareaboutprecision,andinothercontextsyoureallycareaboutrecall.Forexample,ifyoutrainedaclassifiertodetectvideosthataresafeforkids,youwouldprobablypreferaclassifierthatrejectsmanygoodvideos(lowrecall)butkeepsonlysafeones(highprecision),ratherthanaclassifierthathasamuchhigherrecallbutletsafewreallybadvideosshowupinyourproduct(insuchcases,youmayevenwanttoaddahumanpipelinetochecktheclassifier’svideoselection).Ontheotherhand,supposeyoutrainaclassifiertodetectshopliftersonsurveillanceimages:itisprobablyfineifyourclassifierhasonly30%precisionaslongasithas99%recall(sure,thesecurityguardswillgetafewfalsealerts,butalmostallshoplifterswillgetcaught).

Unfortunately,youcan’thaveitbothways:increasingprecisionreducesrecall,andviceversa.Thisiscalledtheprecision/recalltradeoff.

Precision/RecallTradeoffTounderstandthistradeoff,let’slookathowtheSGDClassifiermakesitsclassificationdecisions.Foreachinstance,itcomputesascorebasedonadecisionfunction,andifthatscoreisgreaterthanathreshold,itassignstheinstancetothepositiveclass,orelseitassignsittothenegativeclass.Figure3-3showsafewdigitspositionedfromthelowestscoreonthelefttothehighestscoreontheright.Supposethedecisionthresholdispositionedatthecentralarrow(betweenthetwo5s):youwillfind4truepositives(actual5s)ontherightofthatthreshold,andonefalsepositive(actuallya6).Therefore,withthatthreshold,theprecisionis80%(4outof5).Butoutof6actual5s,theclassifieronlydetects4,sotherecallis67%(4outof6).Nowifyouraisethethreshold(moveittothearrowontheright),thefalsepositive(the6)becomesatruenegative,therebyincreasingprecision(upto100%inthiscase),butonetruepositivebecomesafalsenegative,decreasingrecalldownto50%.Conversely,loweringthethresholdincreasesrecallandreducesprecision.

Figure3-3.Decisionthresholdandprecision/recalltradeoff

Scikit-Learndoesnotletyousetthethresholddirectly,butitdoesgiveyouaccesstothedecisionscoresthatitusestomakepredictions.Insteadofcallingtheclassifier’spredict()method,youcancallitsdecision_function()method,whichreturnsascoreforeachinstance,andthenmakepredictionsbasedonthosescoresusinganythresholdyouwant:

>>>y_scores=sgd_clf.decision_function([some_digit])

>>>y_scores

array([161855.74572176])

>>>threshold=0

>>>y_some_digit_pred=(y_scores>threshold)

array([True],dtype=bool)

TheSGDClassifierusesathresholdequalto0,sothepreviouscodereturnsthesameresultasthepredict()method(i.e.,True).Let’sraisethethreshold:

>>>threshold=200000

>>>y_some_digit_pred=(y_scores>threshold)

>>>y_some_digit_pred

array([False],dtype=bool)

Thisconfirmsthatraisingthethresholddecreasesrecall.Theimageactuallyrepresentsa5,andtheclassifierdetectsitwhenthethresholdis0,butitmissesitwhenthethresholdisincreasedto200,000.

Sohowcanyoudecidewhichthresholdtouse?Forthisyouwillfirstneedtogetthescoresofallinstancesinthetrainingsetusingthecross_val_predict()functionagain,butthistimespecifyingthatyouwantittoreturndecisionscoresinsteadofpredictions:

y_scores=cross_val_predict(sgd_clf,X_train,y_train_5,cv=3,

method="decision_function")

Nowwiththesescoresyoucancomputeprecisionandrecallforallpossiblethresholdsusingtheprecision_recall_curve()function:

fromsklearn.metricsimportprecision_recall_curve

precisions,recalls,thresholds=precision_recall_curve(y_train_5,y_scores)

Finally,youcanplotprecisionandrecallasfunctionsofthethresholdvalueusingMatplotlib(Figure3-4):

defplot_precision_recall_vs_threshold(precisions,recalls,thresholds):

plt.plot(thresholds,precisions[:-1],"b--",label="Precision")

plt.plot(thresholds,recalls[:-1],"g-",label="Recall")

plt.xlabel("Threshold")

plt.legend(loc="upperleft")

plt.ylim([0,1])

plot_precision_recall_vs_threshold(precisions,recalls,thresholds)

plt.show()

Figure3-4.Precisionandrecallversusthedecisionthreshold

NOTEYoumaywonderwhytheprecisioncurveisbumpierthantherecallcurveinFigure3-4.Thereasonisthatprecisionmaysometimesgodownwhenyouraisethethreshold(althoughingeneralitwillgoup).Tounderstandwhy,lookbackatFigure3-3andnoticewhathappenswhenyoustartfromthecentralthresholdandmoveitjustonedigittotheright:precisiongoesfrom4/5(80%)downto3/4(75%).Ontheotherhand,recallcanonlygodownwhenthethresholdisincreased,whichexplainswhyitscurvelookssmooth.

Nowyoucansimplyselectthethresholdvaluethatgivesyouthebestprecision/recalltradeoffforyourtask.Anotherwaytoselectagoodprecision/recalltradeoffistoplotprecisiondirectlyagainstrecall,asshowninFigure3-5.

Figure3-5.Precisionversusrecall

Youcanseethatprecisionreallystartstofallsharplyaround80%recall.Youwillprobablywanttoselectaprecision/recalltradeoffjustbeforethatdrop—forexample,ataround60%recall.Butofcoursethechoicedependsonyourproject.

Solet’ssupposeyoudecidetoaimfor90%precision.Youlookupthefirstplot(zoominginabit)andfindthatyouneedtouseathresholdofabout70,000.Tomakepredictions(onthetrainingsetfornow),insteadofcallingtheclassifier’spredict()method,youcanjustrunthiscode:

y_train_pred_90=(y_scores>70000)

Let’scheckthesepredictions’precisionandrecall:

>>>precision_score(y_train_5,y_train_pred_90)

0.86592051164915484

>>>recall_score(y_train_5,y_train_pred_90)

0.69931746910164172

Great,youhavea90%precisionclassifier(orcloseenough)!Asyoucansee,itisfairlyeasytocreateaclassifierwithvirtuallyanyprecisionyouwant:justsetahighenoughthreshold,andyou’redone.Hmm,notsofast.Ahigh-precisionclassifierisnotveryusefulifitsrecallistoolow!

TIPIfsomeonesays“let’sreach99%precision,”youshouldask,“atwhatrecall?”

TheROCCurveThereceiveroperatingcharacteristic(ROC)curveisanothercommontoolusedwithbinaryclassifiers.Itisverysimilartotheprecision/recallcurve,butinsteadofplottingprecisionversusrecall,theROCcurveplotsthetruepositiverate(anothernameforrecall)againstthefalsepositiverate.TheFPRistheratioofnegativeinstancesthatareincorrectlyclassifiedaspositive.Itisequaltooneminusthetruenegativerate,whichistheratioofnegativeinstancesthatarecorrectlyclassifiedasnegative.TheTNRisalsocalledspecificity.HencetheROCcurveplotssensitivity(recall)versus1–specificity.

ToplottheROCcurve,youfirstneedtocomputetheTPRandFPRforvariousthresholdvalues,usingtheroc_curve()function:

fromsklearn.metricsimportroc_curve

fpr,tpr,thresholds=roc_curve(y_train_5,y_scores)

ThenyoucanplottheFPRagainsttheTPRusingMatplotlib.ThiscodeproducestheplotinFigure3-6:

defplot_roc_curve(fpr,tpr,label=None):

plt.plot(fpr,tpr,linewidth=2,label=label)

plt.plot([0,1],[0,1],'k--')

plt.axis([0,1,0,1])

plt.xlabel('FalsePositiveRate')

plt.ylabel('TruePositiveRate')

plot_roc_curve(fpr,tpr)

plt.show()

Figure3-6.ROCcurve

Onceagainthereisatradeoff:thehighertherecall(TPR),themorefalsepositives(FPR)theclassifier

produces.ThedottedlinerepresentstheROCcurveofapurelyrandomclassifier;agoodclassifierstaysasfarawayfromthatlineaspossible(towardthetop-leftcorner).

Onewaytocompareclassifiersistomeasuretheareaunderthecurve(AUC).AperfectclassifierwillhaveaROCAUCequalto1,whereasapurelyrandomclassifierwillhaveaROCAUCequalto0.5.Scikit-LearnprovidesafunctiontocomputetheROCAUC:

>>>fromsklearn.metricsimportroc_auc_score

>>>roc_auc_score(y_train_5,y_scores)

0.96244965559671547

TIPSincetheROCcurveissosimilartotheprecision/recall(orPR)curve,youmaywonderhowtodecidewhichonetouse.Asaruleofthumb,youshouldpreferthePRcurvewheneverthepositiveclassisrareorwhenyoucaremoreaboutthefalsepositivesthanthefalsenegatives,andtheROCcurveotherwise.Forexample,lookingatthepreviousROCcurve(andtheROCAUCscore),youmaythinkthattheclassifierisreallygood.Butthisismostlybecausetherearefewpositives(5s)comparedtothenegatives(non-5s).Incontrast,thePRcurvemakesitclearthattheclassifierhasroomforimprovement(thecurvecouldbeclosertothetop-rightcorner).

Let’strainaRandomForestClassifierandcompareitsROCcurveandROCAUCscoretotheSGDClassifier.First,youneedtogetscoresforeachinstanceinthetrainingset.Butduetothewayitworks(seeChapter7),theRandomForestClassifierclassdoesnothaveadecision_function()method.Insteadithasapredict_proba()method.Scikit-Learnclassifiersgenerallyhaveoneortheother.Thepredict_proba()methodreturnsanarraycontainingarowperinstanceandacolumnperclass,eachcontainingtheprobabilitythatthegiveninstancebelongstothegivenclass(e.g.,70%chancethattheimagerepresentsa5):

fromsklearn.ensembleimportRandomForestClassifier

forest_clf=RandomForestClassifier(random_state=42)

y_probas_forest=cross_val_predict(forest_clf,X_train,y_train_5,cv=3,

method="predict_proba")

ButtoplotaROCcurve,youneedscores,notprobabilities.Asimplesolutionistousethepositiveclass’sprobabilityasthescore:

y_scores_forest=y_probas_forest[:,1]#score=probaofpositiveclass

fpr_forest,tpr_forest,thresholds_forest=roc_curve(y_train_5,y_scores_forest)

NowyouarereadytoplottheROCcurve.ItisusefultoplotthefirstROCcurveaswelltoseehowtheycompare(Figure3-7):

plt.plot(fpr,tpr,"b:",label="SGD")

plot_roc_curve(fpr_forest,tpr_forest,"RandomForest")

plt.legend(loc="lowerright")

plt.show()

Figure3-7.ComparingROCcurves

AsyoucanseeinFigure3-7,theRandomForestClassifier’sROCcurvelooksmuchbetterthantheSGDClassifier’s:itcomesmuchclosertothetop-leftcorner.Asaresult,itsROCAUCscoreisalsosignificantlybetter:

>>>roc_auc_score(y_train_5,y_scores_forest)

0.99312433660038291

Trymeasuringtheprecisionandrecallscores:youshouldfind98.5%precisionand82.8%recall.Nottoobad!

Hopefullyyounowknowhowtotrainbinaryclassifiers,choosetheappropriatemetricforyourtask,evaluateyourclassifiersusingcross-validation,selecttheprecision/recalltradeoffthatfitsyourneeds,andcomparevariousmodelsusingROCcurvesandROCAUCscores.Nowlet’strytodetectmorethanjustthe5s.

MulticlassClassificationWhereasbinaryclassifiersdistinguishbetweentwoclasses,multiclassclassifiers(alsocalledmultinomialclassifiers)candistinguishbetweenmorethantwoclasses.

Somealgorithms(suchasRandomForestclassifiersornaiveBayesclassifiers)arecapableofhandlingmultipleclassesdirectly.Others(suchasSupportVectorMachineclassifiersorLinearclassifiers)arestrictlybinaryclassifiers.However,therearevariousstrategiesthatyoucanusetoperformmulticlassclassificationusingmultiplebinaryclassifiers.

Forexample,onewaytocreateasystemthatcanclassifythedigitimagesinto10classes(from0to9)istotrain10binaryclassifiers,oneforeachdigit(a0-detector,a1-detector,a2-detector,andsoon).Thenwhenyouwanttoclassifyanimage,yougetthedecisionscorefromeachclassifierforthatimageandyouselecttheclasswhoseclassifieroutputsthehighestscore.Thisiscalledtheone-versus-all(OvA)strategy(alsocalledone-versus-the-rest).

Anotherstrategyistotrainabinaryclassifierforeverypairofdigits:onetodistinguish0sand1s,anothertodistinguish0sand2s,anotherfor1sand2s,andsoon.Thisiscalledtheone-versus-one(OvO)strategy.IfthereareNclasses,youneedtotrainN×(N–1)/2classifiers.FortheMNISTproblem,thismeanstraining45binaryclassifiers!Whenyouwanttoclassifyanimage,youhavetoruntheimagethroughall45classifiersandseewhichclasswinsthemostduels.ThemainadvantageofOvOisthateachclassifieronlyneedstobetrainedonthepartofthetrainingsetforthetwoclassesthatitmustdistinguish.

Somealgorithms(suchasSupportVectorMachineclassifiers)scalepoorlywiththesizeofthetrainingset,soforthesealgorithmsOvOispreferredsinceitisfastertotrainmanyclassifiersonsmalltrainingsetsthantrainingfewclassifiersonlargetrainingsets.Formostbinaryclassificationalgorithms,however,OvAispreferred.

Scikit-Learndetectswhenyoutrytouseabinaryclassificationalgorithmforamulticlassclassificationtask,anditautomaticallyrunsOvA(exceptforSVMclassifiersforwhichitusesOvO).Let’strythiswiththeSGDClassifier:

>>>sgd_clf.fit(X_train,y_train)#y_train,noty_train_5

>>>sgd_clf.predict([some_digit])

array([5.])

Thatwaseasy!ThiscodetrainstheSGDClassifieronthetrainingsetusingtheoriginaltargetclassesfrom0to9(y_train),insteadofthe5-versus-alltargetclasses(y_train_5).Thenitmakesaprediction(acorrectoneinthiscase).Underthehood,Scikit-Learnactuallytrained10binaryclassifiers,gottheirdecisionscoresfortheimage,andselectedtheclasswiththehighestscore.

Toseethatthisisindeedthecase,youcancallthedecision_function()method.Insteadofreturningjustonescoreperinstance,itnowreturns10scores,oneperclass:

>>>some_digit_scores=sgd_clf.decision_function([some_digit])

>>>some_digit_scores

array([[-311402.62954431,-363517.28355739,-446449.5306454,

-183226.61023518,-414337.15339485,161855.74572176,

-452576.39616343,-471957.14962573,-518542.33997148,

-536774.63961222]])

Thehighestscoreisindeedtheonecorrespondingtoclass5:

>>>np.argmax(some_digit_scores)

>>>sgd_clf.classes_

array([0.,1.,2.,3.,4.,5.,6.,7.,8.,9.])

>>>sgd_clf.classes_[5]

WARNINGWhenaclassifieristrained,itstoresthelistoftargetclassesinitsclasses_attribute,orderedbyvalue.Inthiscase,theindexofeachclassintheclasses_arrayconvenientlymatchestheclassitself(e.g.,theclassatindex5happenstobeclass5),butingeneralyouwon’tbesolucky.

IfyouwanttoforceScikitLearntouseone-versus-oneorone-versus-all,youcanusetheOneVsOneClassifierorOneVsRestClassifierclasses.Simplycreateaninstanceandpassabinaryclassifiertoitsconstructor.Forexample,thiscodecreatesamulticlassclassifierusingtheOvOstrategy,basedonaSGDClassifier:

>>>fromsklearn.multiclassimportOneVsOneClassifier

>>>ovo_clf=OneVsOneClassifier(SGDClassifier(random_state=42))

>>>ovo_clf.fit(X_train,y_train)

>>>ovo_clf.predict([some_digit])

array([5.])

>>>len(ovo_clf.estimators_)

TrainingaRandomForestClassifierisjustaseasy:

>>>forest_clf.fit(X_train,y_train)

>>>forest_clf.predict([some_digit])

array([5.])

ThistimeScikit-LearndidnothavetorunOvAorOvObecauseRandomForestclassifierscandirectlyclassifyinstancesintomultipleclasses.Youcancallpredict_proba()togetthelistofprobabilitiesthattheclassifierassignedtoeachinstanceforeachclass:

>>>forest_clf.predict_proba([some_digit])

array([[0.1,0.,0.,0.1,0.,0.8,0.,0.,0.,0.]])

Youcanseethattheclassifierisfairlyconfidentaboutitsprediction:the0.8atthe5thindexinthearraymeansthatthemodelestimatesan80%probabilitythattheimagerepresentsa5.Italsothinksthattheimagecouldinsteadbea0ora3(10%chanceeach).

Nowofcourseyouwanttoevaluatetheseclassifiers.Asusual,youwanttousecross-validation.Let’sevaluatetheSGDClassifier’saccuracyusingthecross_val_score()function:

>>>cross_val_score(sgd_clf,X_train,y_train,cv=3,scoring="accuracy")

array([0.84063187,0.84899245,0.86652998])

Itgetsover84%onalltestfolds.Ifyouusedarandomclassifier,youwouldget10%accuracy,sothisisnotsuchabadscore,butyoucanstilldomuchbetter.Forexample,simplyscalingtheinputs(asdiscussedinChapter2)increasesaccuracyabove90%:

>>>fromsklearn.preprocessingimportStandardScaler

>>>scaler=StandardScaler()

>>>X_train_scaled=scaler.fit_transform(X_train.astype(np.float64))

>>>cross_val_score(sgd_clf,X_train_scaled,y_train,cv=3,scoring="accuracy")

array([0.91011798,0.90874544,0.906636])

ErrorAnalysisOfcourse,ifthiswerearealproject,youwouldfollowthestepsinyourMachineLearningprojectchecklist(seeAppendixB):exploringdatapreparationoptions,tryingoutmultiplemodels,shortlistingthebestonesandfine-tuningtheirhyperparametersusingGridSearchCV,andautomatingasmuchaspossible,asyoudidinthepreviouschapter.Here,wewillassumethatyouhavefoundapromisingmodelandyouwanttofindwaystoimproveit.Onewaytodothisistoanalyzethetypesoferrorsitmakes.

First,youcanlookattheconfusionmatrix.Youneedtomakepredictionsusingthecross_val_predict()function,thencalltheconfusion_matrix()function,justlikeyoudidearlier:

>>>y_train_pred=cross_val_predict(sgd_clf,X_train_scaled,y_train,cv=3)

>>>conf_mx=confusion_matrix(y_train,y_train_pred)

>>>conf_mx

array([[5725,3,24,9,10,49,50,10,39,4],

[2,6493,43,25,7,40,5,10,109,8],

[51,41,5321,104,89,26,87,60,166,13],

[47,46,141,5342,1,231,40,50,141,92],

[19,29,41,10,5366,9,56,37,86,189],

[73,45,36,193,64,4582,111,30,193,94],

[29,34,44,2,42,85,5627,10,45,0],

[25,24,74,32,54,12,6,5787,15,236],

[52,161,73,156,10,163,61,25,5027,123],

[43,35,26,92,178,28,2,223,82,5240]])

That’salotofnumbers.It’softenmoreconvenienttolookatanimagerepresentationoftheconfusionmatrix,usingMatplotlib’smatshow()function:

plt.matshow(conf_mx,cmap=plt.cm.gray)

plt.show()

Thisconfusionmatrixlooksfairlygood,sincemostimagesareonthemaindiagonal,whichmeansthattheywereclassifiedcorrectly.The5slookslightlydarkerthantheotherdigits,whichcouldmeanthattherearefewerimagesof5sinthedatasetorthattheclassifierdoesnotperformaswellon5sasonotherdigits.Infact,youcanverifythatbotharethecase.

Let’sfocustheplotontheerrors.First,youneedtodivideeachvalueintheconfusionmatrixbythenumberofimagesinthecorrespondingclass,soyoucancompareerrorratesinsteadofabsolutenumberoferrors(whichwouldmakeabundantclasseslookunfairlybad):

row_sums=conf_mx.sum(axis=1,keepdims=True)

norm_conf_mx=conf_mx/row_sums

Nowlet’sfillthediagonalwithzerostokeeponlytheerrors,andlet’splottheresult:

np.fill_diagonal(norm_conf_mx,0)

plt.matshow(norm_conf_mx,cmap=plt.cm.gray)

plt.show()

Nowyoucanclearlyseethekindsoferrorstheclassifiermakes.Rememberthatrowsrepresentactualclasses,whilecolumnsrepresentpredictedclasses.Thecolumnsforclasses8and9arequitebright,whichtellsyouthatmanyimagesgetmisclassifiedas8sor9s.Similarly,therowsforclasses8and9arealsoquitebright,tellingyouthat8sand9sareoftenconfusedwithotherdigits.Conversely,somerowsareprettydark,suchasrow1:thismeansthatmost1sareclassifiedcorrectly(afewareconfusedwith8s,butthat’saboutit).Noticethattheerrorsarenotperfectlysymmetrical;forexample,therearemore5smisclassifiedas8sthanthereverse.

Analyzingtheconfusionmatrixcanoftengiveyouinsightsonwaystoimproveyourclassifier.Lookingatthisplot,itseemsthatyoureffortsshouldbespentonimprovingclassificationof8sand9s,aswellasfixingthespecific3/5confusion.Forexample,youcouldtrytogathermoretrainingdataforthesedigits.Oryoucouldengineernewfeaturesthatwouldhelptheclassifier—forexample,writinganalgorithmtocountthenumberofclosedloops(e.g.,8hastwo,6hasone,5hasnone).Oryoucouldpreprocesstheimages(e.g.,usingScikit-Image,Pillow,orOpenCV)tomakesomepatternsstandoutmore,suchasclosedloops.

Analyzingindividualerrorscanalsobeagoodwaytogaininsightsonwhatyourclassifierisdoingandwhyitisfailing,butitismoredifficultandtime-consuming.Forexample,let’splotexamplesof3sand5s(theplot_digits()functionjustusesMatplotlib’simshow()function;seethischapter’sJupyter

notebookfordetails):

cl_a,cl_b=3,5

X_aa=X_train[(y_train==cl_a)&(y_train_pred==cl_a)]

X_ab=X_train[(y_train==cl_a)&(y_train_pred==cl_b)]

X_ba=X_train[(y_train==cl_b)&(y_train_pred==cl_a)]

X_bb=X_train[(y_train==cl_b)&(y_train_pred==cl_b)]

plt.figure(figsize=(8,8))

plt.subplot(221);plot_digits(X_aa[:25],images_per_row=5)

plt.subplot(222);plot_digits(X_ab[:25],images_per_row=5)

plt.subplot(223);plot_digits(X_ba[:25],images_per_row=5)

plt.subplot(224);plot_digits(X_bb[:25],images_per_row=5)

plt.show()

Thetwo5×5blocksontheleftshowdigitsclassifiedas3s,andthetwo5×5blocksontherightshowimagesclassifiedas5s.Someofthedigitsthattheclassifiergetswrong(i.e.,inthebottom-leftandtop-rightblocks)aresobadlywrittenthatevenahumanwouldhavetroubleclassifyingthem(e.g.,the5onthe8throwand1stcolumntrulylookslikea3).However,mostmisclassifiedimagesseemlikeobviouserrorstous,andit’shardtounderstandwhytheclassifiermadethemistakesitdid.3ThereasonisthatweusedasimpleSGDClassifier,whichisalinearmodel.Allitdoesisassignaweightperclasstoeachpixel,andwhenitseesanewimageitjustsumsuptheweightedpixelintensitiestogetascoreforeachclass.Sosince3sand5sdifferonlybyafewpixels,thismodelwilleasilyconfusethem.

Themaindifferencebetween3sand5sisthepositionofthesmalllinethatjoinsthetoplinetothebottomarc.Ifyoudrawa3withthejunctionslightlyshiftedtotheleft,theclassifiermightclassifyitasa5,andviceversa.Inotherwords,thisclassifierisquitesensitivetoimageshiftingandrotation.Soonewaytoreducethe3/5confusionwouldbetopreprocesstheimagestoensurethattheyarewellcenteredandnottoorotated.Thiswillprobablyhelpreduceothererrorsaswell.

MultilabelClassificationUntilnoweachinstancehasalwaysbeenassignedtojustoneclass.Insomecasesyoumaywantyourclassifiertooutputmultipleclassesforeachinstance.Forexample,consideraface-recognitionclassifier:whatshoulditdoifitrecognizesseveralpeopleonthesamepicture?Ofcourseitshouldattachonelabelperpersonitrecognizes.Saytheclassifierhasbeentrainedtorecognizethreefaces,Alice,Bob,andCharlie;thenwhenitisshownapictureofAliceandCharlie,itshouldoutput[1,0,1](meaning“Aliceyes,Bobno,Charlieyes”).Suchaclassificationsystemthatoutputsmultiplebinarylabelsiscalledamultilabelclassificationsystem.

Wewon’tgointofacerecognitionjustyet,butlet’slookatasimplerexample,justforillustrationpurposes:

fromsklearn.neighborsimportKNeighborsClassifier

y_train_large=(y_train>=7)

y_train_odd=(y_train%2==1)

y_multilabel=np.c_[y_train_large,y_train_odd]

knn_clf=KNeighborsClassifier()

knn_clf.fit(X_train,y_multilabel)

Thiscodecreatesay_multilabelarraycontainingtwotargetlabelsforeachdigitimage:thefirstindicateswhetherornotthedigitislarge(7,8,or9)andthesecondindicateswhetherornotitisodd.ThenextlinescreateaKNeighborsClassifierinstance(whichsupportsmultilabelclassification,butnotallclassifiersdo)andwetrainitusingthemultipletargetsarray.Nowyoucanmakeaprediction,andnoticethatitoutputstwolabels:

>>>knn_clf.predict([some_digit])

array([[False,True]],dtype=bool)

Anditgetsitright!Thedigit5isindeednotlarge(False)andodd(True).

Therearemanywaystoevaluateamultilabelclassifier,andselectingtherightmetricreallydependsonyourproject.Forexample,oneapproachistomeasuretheF1scoreforeachindividuallabel(oranyotherbinaryclassifiermetricdiscussedearlier),thensimplycomputetheaveragescore.ThiscodecomputestheaverageF1scoreacrossalllabels:

>>>y_train_knn_pred=cross_val_predict(knn_clf,X_train,y_train,cv=3)

>>>f1_score(y_train,y_train_knn_pred,average="macro")

0.96845540180280221

Thisassumesthatalllabelsareequallyimportant,whichmaynotbethecase.Inparticular,ifyouhavemanymorepicturesofAlicethanofBoborCharlie,youmaywanttogivemoreweighttotheclassifier’sscoreonpicturesofAlice.Onesimpleoptionistogiveeachlabelaweightequaltoitssupport(i.e.,thenumberofinstanceswiththattargetlabel).Todothis,simplysetaverage="weighted"intheprecedingcode.4

MultioutputClassificationThelasttypeofclassificationtaskwearegoingtodiscusshereiscalledmultioutput-multiclassclassification(orsimplymultioutputclassification).Itissimplyageneralizationofmultilabelclassificationwhereeachlabelcanbemulticlass(i.e.,itcanhavemorethantwopossiblevalues).

Toillustratethis,let’sbuildasystemthatremovesnoisefromimages.Itwilltakeasinputanoisydigitimage,anditwill(hopefully)outputacleandigitimage,representedasanarrayofpixelintensities,justliketheMNISTimages.Noticethattheclassifier’soutputismultilabel(onelabelperpixel)andeachlabelcanhavemultiplevalues(pixelintensityrangesfrom0to255).Itisthusanexampleofamultioutputclassificationsystem.

NOTEThelinebetweenclassificationandregressionissometimesblurry,suchasinthisexample.Arguably,predictingpixelintensityismoreakintoregressionthantoclassification.Moreover,multioutputsystemsarenotlimitedtoclassificationtasks;youcouldevenhaveasystemthatoutputsmultiplelabelsperinstance,includingbothclasslabelsandvaluelabels.

Let’sstartbycreatingthetrainingandtestsetsbytakingtheMNISTimagesandaddingnoisetotheirpixelintensitiesusingNumPy’srandint()function.Thetargetimageswillbetheoriginalimages:

noise=np.random.randint(0,100,(len(X_train),784))

X_train_mod=X_train+noise

noise=np.random.randint(0,100,(len(X_test),784))

X_test_mod=X_test+noise

y_train_mod=X_train

y_test_mod=X_test

Let’stakeapeekatanimagefromthetestset(yes,we’resnoopingonthetestdata,soyoushouldbefrowningrightnow):

Ontheleftisthenoisyinputimage,andontherightisthecleantargetimage.Nowlet’straintheclassifier

andmakeitcleanthisimage:

knn_clf.fit(X_train_mod,y_train_mod)

clean_digit=knn_clf.predict([X_test_mod[some_index]])

plot_digit(clean_digit)

Lookscloseenoughtothetarget!Thisconcludesourtourofclassification.Hopefullyyoushouldnowknowhowtoselectgoodmetricsforclassificationtasks,picktheappropriateprecision/recalltradeoff,compareclassifiers,andmoregenerallybuildgoodclassificationsystemsforavarietyoftasks.

Exercises1. TrytobuildaclassifierfortheMNISTdatasetthatachievesover97%accuracyonthetestset.Hint:

theKNeighborsClassifierworksquitewellforthistask;youjustneedtofindgoodhyperparametervalues(tryagridsearchontheweightsandn_neighborshyperparameters).

2. WriteafunctionthatcanshiftanMNISTimageinanydirection(left,right,up,ordown)byonepixel.5Then,foreachimageinthetrainingset,createfourshiftedcopies(oneperdirection)andaddthemtothetrainingset.Finally,trainyourbestmodelonthisexpandedtrainingsetandmeasureitsaccuracyonthetestset.Youshouldobservethatyourmodelperformsevenbetternow!Thistechniqueofartificiallygrowingthetrainingsetiscalleddataaugmentationortrainingsetexpansion.

3. TackletheTitanicdataset.AgreatplacetostartisonKaggle.

4. Buildaspamclassifier(amorechallengingexercise):DownloadexamplesofspamandhamfromApacheSpamAssassin’spublicdatasets.

Unzipthedatasetsandfamiliarizeyourselfwiththedataformat.

Splitthedatasetsintoatrainingsetandatestset.

Writeadatapreparationpipelinetoconverteachemailintoafeaturevector.Yourpreparationpipelineshouldtransformanemailintoa(sparse)vectorindicatingthepresenceorabsenceofeachpossibleword.Forexample,ifallemailsonlyevercontainfourwords,“Hello,”“how,”“are,”“you,”thentheemail“HelloyouHelloHelloyou”wouldbeconvertedintoavector[1,0,0,1](meaning[“Hello”ispresent,“how”isabsent,“are”isabsent,“you”ispresent]),or[3,0,0,2]ifyouprefertocountthenumberofoccurrencesofeachword.

Youmaywanttoaddhyperparameterstoyourpreparationpipelinetocontrolwhetherornottostripoffemailheaders,converteachemailtolowercase,removepunctuation,replaceallURLswith“URL,”replaceallnumberswith“NUMBER,”orevenperformstemming(i.e.,trimoffwordendings;therearePythonlibrariesavailabletodothis).

Thentryoutseveralclassifiersandseeifyoucanbuildagreatspamclassifier,withbothhighrecallandhighprecision.

SolutionstotheseexercisesareavailableintheonlineJupyternotebooksathttps://github.com/ageron/handson-ml.

BydefaultScikit-Learncachesdownloadeddatasetsinadirectorycalled$HOME/scikit_learn_data.

Shufflingmaybeabadideainsomecontexts—forexample,ifyouareworkingontimeseriesdata(suchasstockmarketpricesorweatherconditions).Wewillexplorethisinthenextchapters.

Butrememberthatourbrainisafantasticpatternrecognitionsystem,andourvisualsystemdoesalotofcomplexpreprocessingbeforeanyinformationreachesourconsciousness,sothefactthatitfeelssimpledoesnotmeanthatitis.

Scikit-Learnoffersafewotheraveragingoptionsandmultilabelclassifiermetrics;seethedocumentationformoredetails.

Youcanusetheshift()functionfromthescipy.ndimage.interpolationmodule.Forexample,shift(image,[2,1],cval=0)shiftstheimage2pixelsdownand1pixeltotheright.

Chapter4.TrainingModels

SofarwehavetreatedMachineLearningmodelsandtheirtrainingalgorithmsmostlylikeblackboxes.Ifyouwentthroughsomeoftheexercisesinthepreviouschapters,youmayhavebeensurprisedbyhowmuchyoucangetdonewithoutknowinganythingaboutwhat’sunderthehood:youoptimizedaregressionsystem,youimprovedadigitimageclassifier,andyouevenbuiltaspamclassifierfromscratch—allthiswithoutknowinghowtheyactuallywork.Indeed,inmanysituationsyoudon’treallyneedtoknowtheimplementationdetails.

However,havingagoodunderstandingofhowthingsworkcanhelpyouquicklyhomeinontheappropriatemodel,therighttrainingalgorithmtouse,andagoodsetofhyperparametersforyourtask.Understandingwhat’sunderthehoodwillalsohelpyoudebugissuesandperformerroranalysismoreefficiently.Lastly,mostofthetopicsdiscussedinthischapterwillbeessentialinunderstanding,building,andtrainingneuralnetworks(discussedinPartIIofthisbook).

Inthischapter,wewillstartbylookingattheLinearRegressionmodel,oneofthesimplestmodelsthereis.Wewilldiscusstwoverydifferentwaystotrainit:

Usingadirect“closed-form”equationthatdirectlycomputesthemodelparametersthatbestfitthemodeltothetrainingset(i.e.,themodelparametersthatminimizethecostfunctionoverthetrainingset).

Usinganiterativeoptimizationapproach,calledGradientDescent(GD),thatgraduallytweaksthemodelparameterstominimizethecostfunctionoverthetrainingset,eventuallyconvergingtothesamesetofparametersasthefirstmethod.WewilllookatafewvariantsofGradientDescentthatwewilluseagainandagainwhenwestudyneuralnetworksinPartII:BatchGD,Mini-batchGD,andStochasticGD.

NextwewilllookatPolynomialRegression,amorecomplexmodelthatcanfitnonlineardatasets.SincethismodelhasmoreparametersthanLinearRegression,itismorepronetooverfittingthetrainingdata,sowewilllookathowtodetectwhetherornotthisisthecase,usinglearningcurves,andthenwewilllookatseveralregularizationtechniquesthatcanreducetheriskofoverfittingthetrainingset.

Finally,wewilllookattwomoremodelsthatarecommonlyusedforclassificationtasks:LogisticRegressionandSoftmaxRegression.

WARNINGTherewillbequiteafewmathequationsinthischapter,usingbasicnotionsoflinearalgebraandcalculus.Tounderstandtheseequations,youwillneedtoknowwhatvectorsandmatricesare,howtotransposethem,whatthedotproductis,whatmatrixinverseis,andwhatpartialderivativesare.Ifyouareunfamiliarwiththeseconcepts,pleasegothroughthelinearalgebraandcalculusintroductorytutorialsavailableasJupyternotebooksintheonlinesupplementalmaterial.Forthosewhoaretrulyallergictomathematics,youshouldstillgothroughthischapterandsimplyskiptheequations;hopefully,thetextwillbesufficienttohelpyouunderstandmostoftheconcepts.

LinearRegressionInChapter1,welookedatasimpleregressionmodeloflifesatisfaction:life_satisfaction=θ0+θ1×GDP_per_capita.

ThismodelisjustalinearfunctionoftheinputfeatureGDP_per_capita.θ0andθ1arethemodel’sparameters.

Moregenerally,alinearmodelmakesapredictionbysimplycomputingaweightedsumoftheinputfeatures,plusaconstantcalledthebiasterm(alsocalledtheinterceptterm),asshowninEquation4-1.

Equation4-1.LinearRegressionmodelprediction

ŷisthepredictedvalue.

nisthenumberoffeatures.

xiistheithfeaturevalue.

θjisthejthmodelparameter(includingthebiastermθ0andthefeatureweightsθ1,θ2,, θn).

Thiscanbewrittenmuchmoreconciselyusingavectorizedform,asshowninEquation4-2.

Equation4-2.LinearRegressionmodelprediction(vectorizedform)

θisthemodel’sparametervector,containingthebiastermθ0andthefeatureweightsθ1toθn.

θTisthetransposeofθ(arowvectorinsteadofacolumnvector).

xistheinstance’sfeaturevector,containingx0toxn,withx0alwaysequalto1.

θT·xisthedotproductofθTandx.

hθisthehypothesisfunction,usingthemodelparametersθ.

Okay,that’stheLinearRegressionmodel,sonowhowdowetrainit?Well,recallthattrainingamodelmeanssettingitsparameterssothatthemodelbestfitsthetrainingset.Forthispurpose,wefirstneedameasureofhowwell(orpoorly)themodelfitsthetrainingdata.InChapter2wesawthatthemostcommonperformancemeasureofaregressionmodelistheRootMeanSquareError(RMSE)(Equation2-1).Therefore,totrainaLinearRegressionmodel,youneedtofindthevalueofθthatminimizesthe

RMSE.Inpractice,itissimplertominimizetheMeanSquareError(MSE)thantheRMSE,anditleadstothesameresult(becausethevaluethatminimizesafunctionalsominimizesitssquareroot).1

TheMSEofaLinearRegressionhypothesishθonatrainingsetXiscalculatedusingEquation4-3.

Equation4-3.MSEcostfunctionforaLinearRegressionmodel

MostofthesenotationswerepresentedinChapter2(see“Notations”).Theonlydifferenceisthatwewritehθinsteadofjusthinordertomakeitclearthatthemodelisparametrizedbythevectorθ.Tosimplifynotations,wewilljustwriteMSE(θ)insteadofMSE(X,hθ).

TheNormalEquationTofindthevalueofθthatminimizesthecostfunction,thereisaclosed-formsolution—inotherwords,amathematicalequationthatgivestheresultdirectly.ThisiscalledtheNormalEquation(Equation4-4).2

Equation4-4.NormalEquation

isthevalueof thatminimizesthecostfunction.

yisthevectoroftargetvaluescontainingy(1)toy(m).

Let’sgeneratesomelinear-lookingdatatotestthisequationon(Figure4-1):

importnumpyasnp

X=2*np.random.rand(100,1)

y=4+3*X+np.random.randn(100,1)

Figure4-1.Randomlygeneratedlineardataset

Nowlet’scompute usingtheNormalEquation.Wewillusetheinv()functionfromNumPy’sLinearAlgebramodule(np.linalg)tocomputetheinverseofamatrix,andthedot()methodformatrixmultiplication:

X_b=np.c_[np.ones((100,1)),X]#addx0=1toeachinstance

theta_best=np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)

Theactualfunctionthatweusedtogeneratethedataisy=4+3x0+Gaussiannoise.Let’sseewhattheequationfound:

>>>theta_best

array([[4.21509616],

[2.77011339]])

Wewouldhavehopedforθ0=4andθ1=3insteadofθ0=4.215andθ1=2.770.Closeenough,butthenoisemadeitimpossibletorecovertheexactparametersoftheoriginalfunction.

Nowyoucanmakepredictionsusing :

>>>X_new=np.array([[0],[2]])

>>>X_new_b=np.c_[np.ones((2,1)),X_new]#addx0=1toeachinstance

>>>y_predict=X_new_b.dot(theta_best)

>>>y_predict

array([[4.21509616],

[9.75532293]])

Let’splotthismodel’spredictions(Figure4-2):

plt.plot(X_new,y_predict,"r-")

plt.plot(X,y,"b.")

plt.axis([0,2,0,15])

plt.show()

Figure4-2.LinearRegressionmodelpredictions

TheequivalentcodeusingScikit-Learnlookslikethis:3

>>>fromsklearn.linear_modelimportLinearRegression

>>>lin_reg=LinearRegression()

>>>lin_reg.fit(X,y)

>>>lin_reg.intercept_,lin_reg.coef_

(array([4.21509616]),array([[2.77011339]]))

>>>lin_reg.predict(X_new)

array([[4.21509616],

[9.75532293]])

ComputationalComplexityTheNormalEquationcomputestheinverseofXT·X,whichisann×nmatrix(wherenisthenumberoffeatures).ThecomputationalcomplexityofinvertingsuchamatrixistypicallyaboutO(n2.4)toO(n3)(dependingontheimplementation).Inotherwords,ifyoudoublethenumberoffeatures,youmultiplythecomputationtimebyroughly22.4=5.3to23=8.

WARNINGTheNormalEquationgetsveryslowwhenthenumberoffeaturesgrowslarge(e.g.,100,000).

Onthepositiveside,thisequationislinearwithregardstothenumberofinstancesinthetrainingset(itisO(m)),soithandleslargetrainingsetsefficiently,providedtheycanfitinmemory.

Also,onceyouhavetrainedyourLinearRegressionmodel(usingtheNormalEquationoranyotheralgorithm),predictionsareveryfast:thecomputationalcomplexityislinearwithregardstoboththenumberofinstancesyouwanttomakepredictionsonandthenumberoffeatures.Inotherwords,makingpredictionsontwiceasmanyinstances(ortwiceasmanyfeatures)willjusttakeroughlytwiceasmuchtime.

NowwewilllookatverydifferentwaystotrainaLinearRegressionmodel,bettersuitedforcaseswheretherearealargenumberoffeatures,ortoomanytraininginstancestofitinmemory.

GradientDescentGradientDescentisaverygenericoptimizationalgorithmcapableoffindingoptimalsolutionstoawiderangeofproblems.ThegeneralideaofGradientDescentistotweakparametersiterativelyinordertominimizeacostfunction.

Supposeyouarelostinthemountainsinadensefog;youcanonlyfeeltheslopeofthegroundbelowyourfeet.Agoodstrategytogettothebottomofthevalleyquicklyistogodownhillinthedirectionofthesteepestslope.ThisisexactlywhatGradientDescentdoes:itmeasuresthelocalgradientoftheerrorfunctionwithregardstotheparametervectorθ,anditgoesinthedirectionofdescendinggradient.Oncethegradientiszero,youhavereachedaminimum!

Concretely,youstartbyfillingθwithrandomvalues(thisiscalledrandominitialization),andthenyouimproveitgradually,takingonebabystepatatime,eachstepattemptingtodecreasethecostfunction(e.g.,theMSE),untilthealgorithmconvergestoaminimum(seeFigure4-3).

Figure4-3.GradientDescent

AnimportantparameterinGradientDescentisthesizeofthesteps,determinedbythelearningratehyperparameter.Ifthelearningrateistoosmall,thenthealgorithmwillhavetogothroughmanyiterationstoconverge,whichwilltakealongtime(seeFigure4-4).

Figure4-4.Learningratetoosmall

Ontheotherhand,ifthelearningrateistoohigh,youmightjumpacrossthevalleyandendupontheotherside,possiblyevenhigherupthanyouwerebefore.Thismightmakethealgorithmdiverge,withlargerandlargervalues,failingtofindagoodsolution(seeFigure4-5).

Figure4-5.Learningratetoolarge

Finally,notallcostfunctionslooklikeniceregularbowls.Theremaybeholes,ridges,plateaus,andallsortsofirregularterrains,makingconvergencetotheminimumverydifficult.Figure4-6showsthetwomainchallengeswithGradientDescent:iftherandominitializationstartsthealgorithmontheleft,thenitwillconvergetoalocalminimum,whichisnotasgoodastheglobalminimum.Ifitstartsontheright,thenitwilltakeaverylongtimetocrosstheplateau,andifyoustoptooearlyyouwillneverreachtheglobalminimum.

Figure4-6.GradientDescentpitfalls

Fortunately,theMSEcostfunctionforaLinearRegressionmodelhappenstobeaconvexfunction,whichmeansthatifyoupickanytwopointsonthecurve,thelinesegmentjoiningthemnevercrossesthecurve.Thisimpliesthattherearenolocalminima,justoneglobalminimum.Itisalsoacontinuousfunctionwithaslopethatneverchangesabruptly.4Thesetwofactshaveagreatconsequence:GradientDescentisguaranteedtoapproacharbitrarilyclosetheglobalminimum(ifyouwaitlongenoughandifthelearningrateisnottoohigh).

Infact,thecostfunctionhastheshapeofabowl,butitcanbeanelongatedbowlifthefeatureshaveverydifferentscales.Figure4-7showsGradientDescentonatrainingsetwherefeatures1and2havethesamescale(ontheleft),andonatrainingsetwherefeature1hasmuchsmallervaluesthanfeature2(ontheright).5

Figure4-7.GradientDescentwithandwithoutfeaturescaling

Asyoucansee,onthelefttheGradientDescentalgorithmgoesstraighttowardtheminimum,therebyreachingitquickly,whereasontherightitfirstgoesinadirectionalmostorthogonaltothedirectionoftheglobalminimum,anditendswithalongmarchdownanalmostflatvalley.Itwilleventuallyreachtheminimum,butitwilltakealongtime.

WARNINGWhenusingGradientDescent,youshouldensurethatallfeatureshaveasimilarscale(e.g.,usingScikit-Learn’sStandardScalerclass),orelseitwilltakemuchlongertoconverge.

Thisdiagramalsoillustratesthefactthattrainingamodelmeanssearchingforacombinationofmodelparametersthatminimizesacostfunction(overthetrainingset).Itisasearchinthemodel’sparameterspace:themoreparametersamodelhas,themoredimensionsthisspacehas,andtheharderthesearchis:searchingforaneedleina300-dimensionalhaystackismuchtrickierthaninthreedimensions.Fortunately,sincethecostfunctionisconvexinthecaseofLinearRegression,theneedleissimplyatthebottomofthebowl.

BatchGradientDescentToimplementGradientDescent,youneedtocomputethegradientofthecostfunctionwithregardstoeachmodelparameterθj.Inotherwords,youneedtocalculatehowmuchthecostfunctionwillchangeifyouchangeθjjustalittlebit.Thisiscalledapartialderivative.Itislikeasking“whatistheslopeofthemountainundermyfeetifIfaceeast?”andthenaskingthesamequestionfacingnorth(andsoonforallotherdimensions,ifyoucanimagineauniversewithmorethanthreedimensions).Equation4-5computes

thepartialderivativeofthecostfunctionwithregardstoparameterθj,noted .

Equation4-5.Partialderivativesofthecostfunction

Insteadofcomputingthesepartialderivativesindividually,youcanuseEquation4-6tocomputethemallinonego.Thegradientvector,noted θMSE(θ),containsallthepartialderivativesofthecostfunction(oneforeachmodelparameter).

Equation4-6.Gradientvectorofthecostfunction

WARNINGNoticethatthisformulainvolvescalculationsoverthefulltrainingsetX,ateachGradientDescentstep!ThisiswhythealgorithmiscalledBatchGradientDescent:itusesthewholebatchoftrainingdataateverystep.Asaresultitisterriblyslowonverylargetrainingsets(butwewillseemuchfasterGradientDescentalgorithmsshortly).However,GradientDescentscaleswellwiththenumberoffeatures;trainingaLinearRegressionmodelwhentherearehundredsofthousandsoffeaturesismuchfasterusingGradientDescentthanusingtheNormalEquation.

Onceyouhavethegradientvector,whichpointsuphill,justgointheoppositedirectiontogodownhill.Thismeanssubtracting θMSE(θ)fromθ.Thisiswherethelearningrateηcomesintoplay:6multiplythegradientvectorbyηtodeterminethesizeofthedownhillstep(Equation4-7).

Equation4-7.GradientDescentstep

Let’slookataquickimplementationofthisalgorithm:

eta=0.1#learningrate

n_iterations=1000

theta=np.random.randn(2,1)#randominitialization

foriterationinrange(n_iterations):

gradients=2/m*X_b.T.dot(X_b.dot(theta)-y)

theta=theta-eta*gradients

Thatwasn’ttoohard!Let’slookattheresultingtheta:

>>>theta

array([[4.21509616],

[2.77011339]])

Hey,that’sexactlywhattheNormalEquationfound!GradientDescentworkedperfectly.Butwhatifyouhadusedadifferentlearningrateeta?Figure4-8showsthefirst10stepsofGradientDescentusingthreedifferentlearningrates(thedashedlinerepresentsthestartingpoint).

Figure4-8.GradientDescentwithvariouslearningrates

Ontheleft,thelearningrateistoolow:thealgorithmwilleventuallyreachthesolution,butitwilltakealongtime.Inthemiddle,thelearningratelooksprettygood:injustafewiterations,ithasalreadyconvergedtothesolution.Ontheright,thelearningrateistoohigh:thealgorithmdiverges,jumpingallovertheplaceandactuallygettingfurtherandfurtherawayfromthesolutionateverystep.

Tofindagoodlearningrate,youcanusegridsearch(seeChapter2).However,youmaywanttolimitthenumberofiterationssothatgridsearchcaneliminatemodelsthattaketoolongtoconverge.

Youmaywonderhowtosetthenumberofiterations.Ifitistoolow,youwillstillbefarawayfromtheoptimalsolutionwhenthealgorithmstops,butifitistoohigh,youwillwastetimewhilethemodelparametersdonotchangeanymore.Asimplesolutionistosetaverylargenumberofiterationsbuttointerruptthealgorithmwhenthegradientvectorbecomestiny—thatis,whenitsnormbecomessmallerthanatinynumberϵ(calledthetolerance)—becausethishappenswhenGradientDescenthas(almost)

reachedtheminimum.

CONVERGENCERATE

Whenthecostfunctionisconvexanditsslopedoesnotchangeabruptly(asisthecasefortheMSEcostfunction),itcanbeshownthat

BatchGradientDescentwithafixedlearningratehasaconvergencerateof .Inotherwords,ifyoudividethetoleranceϵby10(tohaveamoreprecisesolution),thenthealgorithmwillhavetorunabout10timesmoreiterations.

StochasticGradientDescentThemainproblemwithBatchGradientDescentisthefactthatitusesthewholetrainingsettocomputethegradientsateverystep,whichmakesitveryslowwhenthetrainingsetislarge.Attheoppositeextreme,StochasticGradientDescentjustpicksarandominstanceinthetrainingsetateverystepandcomputesthegradientsbasedonlyonthatsingleinstance.Obviouslythismakesthealgorithmmuchfastersinceithasverylittledatatomanipulateateveryiteration.Italsomakesitpossibletotrainonhugetrainingsets,sinceonlyoneinstanceneedstobeinmemoryateachiteration(SGDcanbeimplementedasanout-of-corealgorithm.7)

Ontheotherhand,duetoitsstochastic(i.e.,random)nature,thisalgorithmismuchlessregularthanBatchGradientDescent:insteadofgentlydecreasinguntilitreachestheminimum,thecostfunctionwillbounceupanddown,decreasingonlyonaverage.Overtimeitwillendupveryclosetotheminimum,butonceitgetsthereitwillcontinuetobouncearound,neversettlingdown(seeFigure4-9).Sooncethealgorithmstops,thefinalparametervaluesaregood,butnotoptimal.

Figure4-9.StochasticGradientDescent

Whenthecostfunctionisveryirregular(asinFigure4-6),thiscanactuallyhelpthealgorithmjumpoutoflocalminima,soStochasticGradientDescenthasabetterchanceoffindingtheglobalminimumthanBatchGradientDescentdoes.

Thereforerandomnessisgoodtoescapefromlocaloptima,butbadbecauseitmeansthatthealgorithmcanneversettleattheminimum.Onesolutiontothisdilemmaistograduallyreducethelearningrate.Thestepsstartoutlarge(whichhelpsmakequickprogressandescapelocalminima),thengetsmallerandsmaller,allowingthealgorithmtosettleattheglobalminimum.Thisprocessiscalledsimulated

annealing,becauseitresemblestheprocessofannealinginmetallurgywheremoltenmetalisslowlycooleddown.Thefunctionthatdeterminesthelearningrateateachiterationiscalledthelearningschedule.Ifthelearningrateisreducedtooquickly,youmaygetstuckinalocalminimum,orevenendupfrozenhalfwaytotheminimum.Ifthelearningrateisreducedtooslowly,youmayjumparoundtheminimumforalongtimeandendupwithasuboptimalsolutionifyouhalttrainingtooearly.

ThiscodeimplementsStochasticGradientDescentusingasimplelearningschedule:

n_epochs=50

t0,t1=5,50#learningschedulehyperparameters

deflearning_schedule(t):

returnt0/(t+t1)

theta=np.random.randn(2,1)#randominitialization

forepochinrange(n_epochs):

foriinrange(m):

random_index=np.random.randint(m)

xi=X_b[random_index:random_index+1]

yi=y[random_index:random_index+1]

gradients=2*xi.T.dot(xi.dot(theta)-yi)

eta=learning_schedule(epoch*m+i)

theta=theta-eta*gradients

Byconventionweiteratebyroundsofmiterations;eachroundiscalledanepoch.WhiletheBatchGradientDescentcodeiterated1,000timesthroughthewholetrainingset,thiscodegoesthroughthetrainingsetonly50timesandreachesafairlygoodsolution:

>>>theta

array([[4.21076011],

[2.74856079]])

Figure4-10showsthefirst10stepsoftraining(noticehowirregularthestepsare).

Figure4-10.StochasticGradientDescentfirst10steps

Notethatsinceinstancesarepickedrandomly,someinstancesmaybepickedseveraltimesperepochwhileothersmaynotbepickedatall.Ifyouwanttobesurethatthealgorithmgoesthrougheveryinstanceateachepoch,anotherapproachistoshufflethetrainingset,thengothroughitinstancebyinstance,thenshuffleitagain,andsoon.However,thisgenerallyconvergesmoreslowly.

ToperformLinearRegressionusingSGDwithScikit-Learn,youcanusetheSGDRegressorclass,whichdefaultstooptimizingthesquarederrorcostfunction.Thefollowingcoderuns50epochs,startingwithalearningrateof0.1(eta0=0.1),usingthedefaultlearningschedule(differentfromtheprecedingone),anditdoesnotuseanyregularization(penalty=None;moredetailsonthisshortly):

fromsklearn.linear_modelimportSGDRegressor

sgd_reg=SGDRegressor(n_iter=50,penalty=None,eta0=0.1)

sgd_reg.fit(X,y.ravel())

Onceagain,youfindasolutionveryclosetotheonereturnedbytheNormalEquation:

>>>sgd_reg.intercept_,sgd_reg.coef_

(array([4.16782089]),array([2.72603052]))

Mini-batchGradientDescentThelastGradientDescentalgorithmwewilllookatiscalledMini-batchGradientDescent.ItisquitesimpletounderstandonceyouknowBatchandStochasticGradientDescent:ateachstep,insteadofcomputingthegradientsbasedonthefulltrainingset(asinBatchGD)orbasedonjustoneinstance(asinStochasticGD),Mini-batchGDcomputesthegradientsonsmallrandomsetsofinstancescalledmini-batches.ThemainadvantageofMini-batchGDoverStochasticGDisthatyoucangetaperformanceboostfromhardwareoptimizationofmatrixoperations,especiallywhenusingGPUs.

Thealgorithm’sprogressinparameterspaceislesserraticthanwithSGD,especiallywithfairlylargemini-batches.Asaresult,Mini-batchGDwillendupwalkingaroundabitclosertotheminimumthanSGD.But,ontheotherhand,itmaybeharderforittoescapefromlocalminima(inthecaseofproblemsthatsufferfromlocalminima,unlikeLinearRegressionaswesawearlier).Figure4-11showsthepathstakenbythethreeGradientDescentalgorithmsinparameterspaceduringtraining.Theyallendupneartheminimum,butBatchGD’spathactuallystopsattheminimum,whilebothStochasticGDandMini-batchGDcontinuetowalkaround.However,don’tforgetthatBatchGDtakesalotoftimetotakeeachstep,andStochasticGDandMini-batchGDwouldalsoreachtheminimumifyouusedagoodlearningschedule.

Figure4-11.GradientDescentpathsinparameterspace

Let’scomparethealgorithmswe’vediscussedsofarforLinearRegression8(recallthatmisthenumberoftraininginstancesandnisthenumberoffeatures);seeTable4-1.

Table4-1.ComparisonofalgorithmsforLinearRegression

Algorithm Largem Out-of-coresupport Largen Hyperparams Scalingrequired Scikit-Learn

NormalEquation Fast No Slow 0 No LinearRegression

BatchGD Slow No Fast 2 Yes n/a

StochasticGD Fast Yes Fast ≥2 Yes SGDRegressor

Mini-batchGD Fast Yes Fast ≥2 Yes n/a

NOTEThereisalmostnodifferenceaftertraining:allthesealgorithmsendupwithverysimilarmodelsandmakepredictionsinexactlythesameway.

PolynomialRegressionWhatifyourdataisactuallymorecomplexthanasimplestraightline?Surprisingly,youcanactuallyusealinearmodeltofitnonlineardata.Asimplewaytodothisistoaddpowersofeachfeatureasnewfeatures,thentrainalinearmodelonthisextendedsetoffeatures.ThistechniqueiscalledPolynomialRegression.

Let’slookatanexample.First,let’sgeneratesomenonlineardata,basedonasimplequadraticequation9(plussomenoise;seeFigure4-12):

X=6*np.random.rand(m,1)-3

y=0.5*X**2+X+2+np.random.randn(m,1)

Figure4-12.Generatednonlinearandnoisydataset

Clearly,astraightlinewillneverfitthisdataproperly.Solet’suseScikit-Learn’sPolynomialFeaturesclasstotransformourtrainingdata,addingthesquare(2nd-degreepolynomial)ofeachfeatureinthetrainingsetasnewfeatures(inthiscasethereisjustonefeature):

>>>fromsklearn.preprocessingimportPolynomialFeatures

>>>poly_features=PolynomialFeatures(degree=2,include_bias=False)

>>>X_poly=poly_features.fit_transform(X)

>>>X[0]

array([-0.75275929])

>>>X_poly[0]

array([-0.75275929,0.56664654])

X_polynowcontainstheoriginalfeatureofXplusthesquareofthisfeature.NowyoucanfitaLinearRegressionmodeltothisextendedtrainingdata(Figure4-13):

>>>lin_reg=LinearRegression()

>>>lin_reg.fit(X_poly,y)

>>>lin_reg.intercept_,lin_reg.coef_

(array([1.78134581]),array([[0.93366893,0.56456263]]))

Figure4-13.PolynomialRegressionmodelpredictions

Notbad:themodelestimates wheninfacttheoriginalfunctionwas.

Notethatwhentherearemultiplefeatures,PolynomialRegressioniscapableoffindingrelationshipsbetweenfeatures(whichissomethingaplainLinearRegressionmodelcannotdo).ThisismadepossiblebythefactthatPolynomialFeaturesalsoaddsallcombinationsoffeaturesuptothegivendegree.Forexample,ifthereweretwofeaturesaandb,PolynomialFeatureswithdegree=3wouldnotonlyaddthefeaturesa2,a3,b2,andb3,butalsothecombinationsab,a2b,andab2.

WARNING

PolynomialFeatures(degree=d)transformsanarraycontainingnfeaturesintoanarraycontaining features,wheren!isthefactorialofn,equalto1×2×3×× n.Bewareofthecombinatorialexplosionofthenumberoffeatures!

LearningCurvesIfyouperformhigh-degreePolynomialRegression,youwilllikelyfitthetrainingdatamuchbetterthanwithplainLinearRegression.Forexample,Figure4-14appliesa300-degreepolynomialmodeltotheprecedingtrainingdata,andcomparestheresultwithapurelinearmodelandaquadraticmodel(2nd-degreepolynomial).Noticehowthe300-degreepolynomialmodelwigglesaroundtogetascloseaspossibletothetraininginstances.

Figure4-14.High-degreePolynomialRegression

Ofcourse,thishigh-degreePolynomialRegressionmodelisseverelyoverfittingthetrainingdata,whilethelinearmodelisunderfittingit.Themodelthatwillgeneralizebestinthiscaseisthequadraticmodel.Itmakessensesincethedatawasgeneratedusingaquadraticmodel,butingeneralyouwon’tknowwhatfunctiongeneratedthedata,sohowcanyoudecidehowcomplexyourmodelshouldbe?Howcanyoutellthatyourmodelisoverfittingorunderfittingthedata?

InChapter2youusedcross-validationtogetanestimateofamodel’sgeneralizationperformance.Ifamodelperformswellonthetrainingdatabutgeneralizespoorlyaccordingtothecross-validationmetrics,thenyourmodelisoverfitting.Ifitperformspoorlyonboth,thenitisunderfitting.Thisisonewaytotellwhenamodelistoosimpleortoocomplex.

Anotherwayistolookatthelearningcurves:theseareplotsofthemodel’sperformanceonthetrainingsetandthevalidationsetasafunctionofthetrainingsetsize.Togeneratetheplots,simplytrainthemodelseveraltimesondifferentsizedsubsetsofthetrainingset.Thefollowingcodedefinesafunctionthatplotsthelearningcurvesofamodelgivensometrainingdata:

fromsklearn.metricsimportmean_squared_error

defplot_learning_curves(model,X,y):

X_train,X_val,y_train,y_val=train_test_split(X,y,test_size=0.2)

train_errors,val_errors=[],[]

forminrange(1,len(X_train)):

model.fit(X_train[:m],y_train[:m])

y_train_predict=model.predict(X_train[:m])

y_val_predict=model.predict(X_val)

train_errors.append(mean_squared_error(y_train_predict,y_train[:m]))

val_errors.append(mean_squared_error(y_val_predict,y_val))

plt.plot(np.sqrt(train_errors),"r-+",linewidth=2,label="train")

plt.plot(np.sqrt(val_errors),"b-",linewidth=3,label="val")

Let’slookatthelearningcurvesoftheplainLinearRegressionmodel(astraightline;Figure4-15):

lin_reg=LinearRegression()

plot_learning_curves(lin_reg,X,y)

Figure4-15.Learningcurves

Thisdeservesabitofexplanation.First,let’slookattheperformanceonthetrainingdata:whentherearejustoneortwoinstancesinthetrainingset,themodelcanfitthemperfectly,whichiswhythecurvestartsatzero.Butasnewinstancesareaddedtothetrainingset,itbecomesimpossibleforthemodeltofitthetrainingdataperfectly,bothbecausethedataisnoisyandbecauseitisnotlinearatall.Sotheerroronthetrainingdatagoesupuntilitreachesaplateau,atwhichpointaddingnewinstancestothetrainingsetdoesn’tmaketheaverageerrormuchbetterorworse.Nowlet’slookattheperformanceofthemodelonthevalidationdata.Whenthemodelistrainedonveryfewtraininginstances,itisincapableofgeneralizingproperly,whichiswhythevalidationerrorisinitiallyquitebig.Thenasthemodelisshownmoretrainingexamples,itlearnsandthusthevalidationerrorslowlygoesdown.However,onceagainastraightlinecannotdoagoodjobmodelingthedata,sotheerrorendsupataplateau,veryclosetotheothercurve.

Theselearningcurvesaretypicalofanunderfittingmodel.Bothcurveshavereachedaplateau;theyarecloseandfairlyhigh.

TIPIfyourmodelisunderfittingthetrainingdata,addingmoretrainingexampleswillnothelp.Youneedtouseamorecomplexmodelorcomeupwithbetterfeatures.

Nowlet’slookatthelearningcurvesofa10th-degreepolynomialmodelonthesamedata(Figure4-16):

polynomial_regression=Pipeline((

("poly_features",PolynomialFeatures(degree=10,include_bias=False)),

("lin_reg",LinearRegression()),

plot_learning_curves(polynomial_regression,X,y)

Theselearningcurveslookabitlikethepreviousones,buttherearetwoveryimportantdifferences:TheerroronthetrainingdataismuchlowerthanwiththeLinearRegressionmodel.

Thereisagapbetweenthecurves.Thismeansthatthemodelperformssignificantlybetteronthetrainingdatathanonthevalidationdata,whichisthehallmarkofanoverfittingmodel.However,ifyouusedamuchlargertrainingset,thetwocurveswouldcontinuetogetcloser.

Figure4-16.Learningcurvesforthepolynomialmodel

TIPOnewaytoimproveanoverfittingmodelistofeeditmoretrainingdatauntilthevalidationerrorreachesthetrainingerror.

THEBIAS/VARIANCETRADEOFF

AnimportanttheoreticalresultofstatisticsandMachineLearningisthefactthatamodel’sgeneralizationerrorcanbeexpressedasthesumofthreeverydifferenterrors:

BiasThispartofthegeneralizationerrorisduetowrongassumptions,suchasassumingthatthedataislinearwhenitisactuallyquadratic.Ahigh-biasmodelismostlikelytounderfitthetrainingdata.10

VarianceThispartisduetothemodel’sexcessivesensitivitytosmallvariationsinthetrainingdata.Amodelwithmanydegreesoffreedom(suchasahigh-degreepolynomialmodel)islikelytohavehighvariance,andthustooverfitthetrainingdata.

IrreducibleerrorThispartisduetothenoisinessofthedataitself.Theonlywaytoreducethispartoftheerroristocleanupthedata(e.g.,fixthedatasources,suchasbrokensensors,ordetectandremoveoutliers).

Increasingamodel’scomplexitywilltypicallyincreaseitsvarianceandreduceitsbias.Conversely,reducingamodel’scomplexityincreasesitsbiasandreducesitsvariance.Thisiswhyitiscalledatradeoff.

RegularizedLinearModelsAswesawinChapters1and2,agoodwaytoreduceoverfittingistoregularizethemodel(i.e.,toconstrainit):thefewerdegreesoffreedomithas,theharderitwillbeforittooverfitthedata.Forexample,asimplewaytoregularizeapolynomialmodelistoreducethenumberofpolynomialdegrees.

Foralinearmodel,regularizationistypicallyachievedbyconstrainingtheweightsofthemodel.WewillnowlookatRidgeRegression,LassoRegression,andElasticNet,whichimplementthreedifferentwaystoconstraintheweights.

RidgeRegressionRidgeRegression(alsocalledTikhonovregularization)isaregularizedversionofLinearRegression:a

regularizationtermequalto isaddedtothecostfunction.Thisforcesthelearningalgorithmtonotonlyfitthedatabutalsokeepthemodelweightsassmallaspossible.Notethattheregularizationtermshouldonlybeaddedtothecostfunctionduringtraining.Oncethemodelistrained,youwanttoevaluatethemodel’sperformanceusingtheunregularizedperformancemeasure.

NOTEItisquitecommonforthecostfunctionusedduringtrainingtobedifferentfromtheperformancemeasureusedfortesting.Apartfromregularization,anotherreasonwhytheymightbedifferentisthatagoodtrainingcostfunctionshouldhaveoptimization-friendlyderivatives,whiletheperformancemeasureusedfortestingshouldbeascloseaspossibletothefinalobjective.Agoodexampleofthisisaclassifiertrainedusingacostfunctionsuchasthelogloss(discussedinamoment)butevaluatedusingprecision/recall.

Thehyperparameterαcontrolshowmuchyouwanttoregularizethemodel.Ifα=0thenRidgeRegressionisjustLinearRegression.Ifαisverylarge,thenallweightsendupveryclosetozeroandtheresultisaflatlinegoingthroughthedata’smean.Equation4-8presentstheRidgeRegressioncostfunction.11

Equation4-8.RidgeRegressioncostfunction

Notethatthebiastermθ0isnotregularized(thesumstartsati=1,not0).Ifwedefinewasthevectoroffeatureweights(θ1toθn),thentheregularizationtermissimplyequalto½( w2)2,where· 2representstheℓ2normoftheweightvector.12ForGradientDescent,justaddαwtotheMSEgradientvector(Equation4-6).

WARNINGItisimportanttoscalethedata(e.g.,usingaStandardScaler)beforeperformingRidgeRegression,asitissensitivetothescaleoftheinputfeatures.Thisistrueofmostregularizedmodels.

Figure4-17showsseveralRidgemodelstrainedonsomelineardatausingdifferentαvalue.Ontheleft,plainRidgemodelsareused,leadingtolinearpredictions.Ontheright,thedataisfirstexpandedusingPolynomialFeatures(degree=10),thenitisscaledusingaStandardScaler,andfinallytheRidge

modelsareappliedtotheresultingfeatures:thisisPolynomialRegressionwithRidgeregularization.Notehowincreasingαleadstoflatter(i.e.,lessextreme,morereasonable)predictions;thisreducesthemodel’svariancebutincreasesitsbias.

AswithLinearRegression,wecanperformRidgeRegressioneitherbycomputingaclosed-formequationorbyperformingGradientDescent.Theprosandconsarethesame.Equation4-9showstheclosed-formsolution(whereAisthen×nidentitymatrix13exceptwitha0inthetop-leftcell,correspondingtothebiasterm).

Figure4-17.RidgeRegression

Equation4-9.RidgeRegressionclosed-formsolution

HereishowtoperformRidgeRegressionwithScikit-Learnusingaclosed-formsolution(avariantofEquation4-9usingamatrixfactorizationtechniquebyAndré-LouisCholesky):

>>>fromsklearn.linear_modelimportRidge

>>>ridge_reg=Ridge(alpha=1,solver="cholesky")

>>>ridge_reg.fit(X,y)

>>>ridge_reg.predict([[1.5]])

array([[1.55071465]])

AndusingStochasticGradientDescent:14

>>>sgd_reg=SGDRegressor(penalty="l2")

>>>sgd_reg.fit(X,y.ravel())

>>>sgd_reg.predict([[1.5]])

array([1.13500145])

Thepenaltyhyperparametersetsthetypeofregularizationtermtouse.Specifying"l2"indicatesthatyouwantSGDtoaddaregularizationtermtothecostfunctionequaltohalfthesquareoftheℓ2normoftheweightvector:thisissimplyRidgeRegression.

LassoRegressionLeastAbsoluteShrinkageandSelectionOperatorRegression(simplycalledLassoRegression)isanotherregularizedversionofLinearRegression:justlikeRidgeRegression,itaddsaregularizationtermtothecostfunction,butitusestheℓ1normoftheweightvectorinsteadofhalfthesquareoftheℓ2norm(seeEquation4-10).

Equation4-10.LassoRegressioncostfunction

Figure4-18showsthesamethingasFigure4-17butreplacesRidgemodelswithLassomodelsandusessmallerαvalues.

Figure4-18.LassoRegression

AnimportantcharacteristicofLassoRegressionisthatittendstocompletelyeliminatetheweightsoftheleastimportantfeatures(i.e.,setthemtozero).Forexample,thedashedlineintherightplotonFigure4-18(withα=10-7)looksquadratic,almostlinear:alltheweightsforthehigh-degreepolynomialfeaturesareequaltozero.Inotherwords,LassoRegressionautomaticallyperformsfeatureselectionandoutputsasparsemodel(i.e.,withfewnonzerofeatureweights).

YoucangetasenseofwhythisisthecasebylookingatFigure4-19:onthetop-leftplot,thebackgroundcontours(ellipses)representanunregularizedMSEcostfunction(α=0),andthewhitecirclesshowtheBatchGradientDescentpathwiththatcostfunction.Theforegroundcontours(diamonds)representtheℓ1penalty,andthetrianglesshowtheBGDpathforthispenaltyonly(α→∞).Noticehowthepathfirstreachesθ1=0,thenrollsdownagutteruntilitreachesθ2=0.Onthetop-rightplot,thecontoursrepresentthesamecostfunctionplusanℓ1penaltywithα=0.5.Theglobalminimumisontheθ2=0axis.BGD

firstreachesθ2=0,thenrollsdownthegutteruntilitreachestheglobalminimum.Thetwobottomplotsshowthesamethingbutusesanℓ2penaltyinstead.Theregularizedminimumisclosertoθ=0thantheunregularizedminimum,buttheweightsdonotgetfullyeliminated.

Figure4-19.LassoversusRidgeregularization

TIPOntheLassocostfunction,theBGDpathtendstobounceacrosstheguttertowardtheend.Thisisbecausetheslopechangesabruptlyatθ2=0.Youneedtograduallyreducethelearningrateinordertoactuallyconvergetotheglobalminimum.

TheLassocostfunctionisnotdifferentiableatθi=0(fori=1,2,, n),butGradientDescentstillworksfineifyouuseasubgradientvectorg15insteadwhenanyθi=0.Equation4-11showsasubgradientvectorequationyoucanuseforGradientDescentwiththeLassocostfunction.

Equation4-11.LassoRegressionsubgradientvector

HereisasmallScikit-LearnexampleusingtheLassoclass.NotethatyoucouldinsteaduseanSGDRegressor(penalty="l1").

>>>fromsklearn.linear_modelimportLasso

>>>lasso_reg=Lasso(alpha=0.1)

>>>lasso_reg.fit(X,y)

>>>lasso_reg.predict([[1.5]])

array([1.53788174])

ElasticNetElasticNetisamiddlegroundbetweenRidgeRegressionandLassoRegression.TheregularizationtermisasimplemixofbothRidgeandLasso’sregularizationterms,andyoucancontrolthemixratior.Whenr=0,ElasticNetisequivalenttoRidgeRegression,andwhenr=1,itisequivalenttoLassoRegression(seeEquation4-12).

Equation4-12.ElasticNetcostfunction

SowhenshouldyouuseplainLinearRegression(i.e.,withoutanyregularization),Ridge,Lasso,orElasticNet?Itisalmostalwayspreferabletohaveatleastalittlebitofregularization,sogenerallyyoushouldavoidplainLinearRegression.Ridgeisagooddefault,butifyoususpectthatonlyafewfeaturesareactuallyuseful,youshouldpreferLassoorElasticNetsincetheytendtoreducetheuselessfeatures’weightsdowntozeroaswehavediscussed.Ingeneral,ElasticNetispreferredoverLassosinceLassomaybehaveerraticallywhenthenumberoffeaturesisgreaterthanthenumberoftraininginstancesorwhenseveralfeaturesarestronglycorrelated.

HereisashortexampleusingScikit-Learn’sElasticNet(l1_ratiocorrespondstothemixratior):

>>>fromsklearn.linear_modelimportElasticNet

>>>elastic_net=ElasticNet(alpha=0.1,l1_ratio=0.5)

>>>elastic_net.fit(X,y)

>>>elastic_net.predict([[1.5]])

array([1.54333232])

EarlyStoppingAverydifferentwaytoregularizeiterativelearningalgorithmssuchasGradientDescentistostoptrainingassoonasthevalidationerrorreachesaminimum.Thisiscalledearlystopping.Figure4-20showsacomplexmodel(inthiscaseahigh-degreePolynomialRegressionmodel)beingtrainedusingBatchGradientDescent.Astheepochsgoby,thealgorithmlearnsanditspredictionerror(RMSE)onthetrainingsetnaturallygoesdown,andsodoesitspredictionerroronthevalidationset.However,afterawhilethevalidationerrorstopsdecreasingandactuallystartstogobackup.Thisindicatesthatthemodelhasstartedtooverfitthetrainingdata.Withearlystoppingyoujuststoptrainingassoonasthevalidationerrorreachestheminimum.ItissuchasimpleandefficientregularizationtechniquethatGeoffreyHintoncalledita“beautifulfreelunch.”

Figure4-20.Earlystoppingregularization

TIPWithStochasticandMini-batchGradientDescent,thecurvesarenotsosmooth,anditmaybehardtoknowwhetheryouhavereachedtheminimumornot.Onesolutionistostoponlyafterthevalidationerrorhasbeenabovetheminimumforsometime(whenyouareconfidentthatthemodelwillnotdoanybetter),thenrollbackthemodelparameterstothepointwherethevalidationerrorwasataminimum.

Hereisabasicimplementationofearlystopping:

fromsklearn.baseimportclone

sgd_reg=SGDRegressor(n_iter=1,warm_start=True,penalty=None,

learning_rate="constant",eta0=0.0005)

minimum_val_error=float("inf")

best_epoch=None

best_model=None

forepochinrange(1000):

sgd_reg.fit(X_train_poly_scaled,y_train)#continueswhereitleftoff

y_val_predict=sgd_reg.predict(X_val_poly_scaled)

val_error=mean_squared_error(y_val_predict,y_val)

ifval_error<minimum_val_error:

minimum_val_error=val_error

best_epoch=epoch

best_model=clone(sgd_reg)

Notethatwithwarm_start=True,whenthefit()methodiscalled,itjustcontinuestrainingwhereitleftoffinsteadofrestartingfromscratch.

LogisticRegressionAswediscussedinChapter1,someregressionalgorithmscanbeusedforclassificationaswell(andviceversa).LogisticRegression(alsocalledLogitRegression)iscommonlyusedtoestimatetheprobabilitythataninstancebelongstoaparticularclass(e.g.,whatistheprobabilitythatthisemailisspam?).Iftheestimatedprobabilityisgreaterthan50%,thenthemodelpredictsthattheinstancebelongstothatclass(calledthepositiveclass,labeled“1”),orelseitpredictsthatitdoesnot(i.e.,itbelongstothenegativeclass,labeled“0”).Thismakesitabinaryclassifier.

EstimatingProbabilitiesSohowdoesitwork?JustlikeaLinearRegressionmodel,aLogisticRegressionmodelcomputesaweightedsumoftheinputfeatures(plusabiasterm),butinsteadofoutputtingtheresultdirectlyliketheLinearRegressionmodeldoes,itoutputsthelogisticofthisresult(seeEquation4-13).

Equation4-13.LogisticRegressionmodelestimatedprobability(vectorizedform)

Thelogistic—alsocalledthelogit,notedσ(·)—isasigmoidfunction(i.e.,S-shaped)thatoutputsanumberbetween0and1.ItisdefinedasshowninEquation4-14andFigure4-21.

Equation4-14.Logisticfunction

Figure4-21.Logisticfunction

OncetheLogisticRegressionmodelhasestimatedtheprobability =hθ(x)thataninstancexbelongstothepositiveclass,itcanmakeitspredictionŷeasily(seeEquation4-15).

Equation4-15.LogisticRegressionmodelprediction

Noticethatσ(t)<0.5whent<0,andσ(t)≥0.5whent≥0,soaLogisticRegressionmodelpredicts1if

θT·xispositive,and0ifitisnegative.

TrainingandCostFunctionGood,nowyouknowhowaLogisticRegressionmodelestimatesprobabilitiesandmakespredictions.Buthowisittrained?Theobjectiveoftrainingistosettheparametervectorθsothatthemodelestimateshighprobabilitiesforpositiveinstances(y=1)andlowprobabilitiesfornegativeinstances(y=0).ThisideaiscapturedbythecostfunctionshowninEquation4-16forasingletraininginstancex.

Equation4-16.Costfunctionofasingletraininginstance

Thiscostfunctionmakessensebecause–log(t)growsverylargewhentapproaches0,sothecostwillbelargeifthemodelestimatesaprobabilitycloseto0forapositiveinstance,anditwillalsobeverylargeifthemodelestimatesaprobabilitycloseto1foranegativeinstance.Ontheotherhand,–log(t)iscloseto0whentiscloseto1,sothecostwillbecloseto0iftheestimatedprobabilityiscloseto0foranegativeinstanceorcloseto1forapositiveinstance,whichispreciselywhatwewant.

Thecostfunctionoverthewholetrainingsetissimplytheaveragecostoveralltraininginstances.Itcanbewritteninasingleexpression(asyoucanverifyeasily),calledthelogloss,showninEquation4-17.

Equation4-17.LogisticRegressioncostfunction(logloss)

Thebadnewsisthatthereisnoknownclosed-formequationtocomputethevalueofθthatminimizesthiscostfunction(thereisnoequivalentoftheNormalEquation).Butthegoodnewsisthatthiscostfunctionisconvex,soGradientDescent(oranyotheroptimizationalgorithm)isguaranteedtofindtheglobalminimum(ifthelearningrateisnottoolargeandyouwaitlongenough).ThepartialderivativesofthecostfunctionwithregardstothejthmodelparameterθjisgivenbyEquation4-18.

Equation4-18.Logisticcostfunctionpartialderivatives

ThisequationlooksverymuchlikeEquation4-5:foreachinstanceitcomputesthepredictionerrorandmultipliesitbythejthfeaturevalue,andthenitcomputestheaverageoveralltraininginstances.OnceyouhavethegradientvectorcontainingallthepartialderivativesyoucanuseitintheBatchGradientDescentalgorithm.That’sit:younowknowhowtotrainaLogisticRegressionmodel.ForStochasticGDyou

wouldofcoursejusttakeoneinstanceatatime,andforMini-batchGDyouwoulduseamini-batchatatime.

DecisionBoundariesLet’susetheirisdatasettoillustrateLogisticRegression.Thisisafamousdatasetthatcontainsthesepalandpetallengthandwidthof150irisflowersofthreedifferentspecies:Iris-Setosa,Iris-Versicolor,andIris-Virginica(seeFigure4-22).

Figure4-22.Flowersofthreeirisplantspecies16

Let’strytobuildaclassifiertodetecttheIris-Virginicatypebasedonlyonthepetalwidthfeature.Firstlet’sloadthedata:

>>>fromsklearnimportdatasets

>>>iris=datasets.load_iris()

>>>list(iris.keys())

['data','target_names','feature_names','target','DESCR']

>>>X=iris["data"][:,3:]#petalwidth

>>>y=(iris["target"]==2).astype(np.int)#1ifIris-Virginica,else0

Nowlet’strainaLogisticRegressionmodel:

fromsklearn.linear_modelimportLogisticRegression

log_reg=LogisticRegression()

log_reg.fit(X,y)

Let’slookatthemodel’sestimatedprobabilitiesforflowerswithpetalwidthsvaryingfrom0to3cm(Figure4-23):

X_new=np.linspace(0,3,1000).reshape(-1,1)

y_proba=log_reg.predict_proba(X_new)

plt.plot(X_new,y_proba[:,1],"g-",label="Iris-Virginica")

plt.plot(X_new,y_proba[:,0],"b--",label="NotIris-Virginica")

#+moreMatplotlibcodetomaketheimagelookpretty

Figure4-23.Estimatedprobabilitiesanddecisionboundary

ThepetalwidthofIris-Virginicaflowers(representedbytriangles)rangesfrom1.4cmto2.5cm,whiletheotheririsflowers(representedbysquares)generallyhaveasmallerpetalwidth,rangingfrom0.1cmto1.8cm.Noticethatthereisabitofoverlap.Aboveabout2cmtheclassifierishighlyconfidentthattheflowerisanIris-Virginica(itoutputsahighprobabilitytothatclass),whilebelow1cmitishighlyconfidentthatitisnotanIris-Virginica(highprobabilityforthe“NotIris-Virginica”class).Inbetweentheseextremes,theclassifierisunsure.However,ifyouaskittopredicttheclass(usingthepredict()methodratherthanthepredict_proba()method),itwillreturnwhicheverclassisthemostlikely.Therefore,thereisadecisionboundaryataround1.6cmwherebothprobabilitiesareequalto50%:ifthepetalwidthishigherthan1.6cm,theclassifierwillpredictthattheflowerisanIris-Virginica,orelseitwillpredictthatitisnot(evenifitisnotveryconfident):

>>>log_reg.predict([[1.7],[1.5]])

array([1,0])

Figure4-24showsthesamedatasetbutthistimedisplayingtwofeatures:petalwidthandlength.Oncetrained,theLogisticRegressionclassifiercanestimatetheprobabilitythatanewflowerisanIris-Virginicabasedonthesetwofeatures.Thedashedlinerepresentsthepointswherethemodelestimatesa50%probability:thisisthemodel’sdecisionboundary.Notethatitisalinearboundary.17Eachparallellinerepresentsthepointswherethemodeloutputsaspecificprobability,from15%(bottomleft)to90%(topright).Alltheflowersbeyondthetop-rightlinehaveanover90%chanceofbeingIris-Virginicaaccordingtothemodel.

Figure4-24.Lineardecisionboundary

Justliketheotherlinearmodels,LogisticRegressionmodelscanberegularizedusingℓ1orℓ2penalties.Scitkit-Learnactuallyaddsanℓ2penaltybydefault.

NOTEThehyperparametercontrollingtheregularizationstrengthofaScikit-LearnLogisticRegressionmodelisnotalpha(asinotherlinearmodels),butitsinverse:C.ThehigherthevalueofC,thelessthemodelisregularized.

SoftmaxRegressionTheLogisticRegressionmodelcanbegeneralizedtosupportmultipleclassesdirectly,withouthavingtotrainandcombinemultiplebinaryclassifiers(asdiscussedinChapter3).ThisiscalledSoftmaxRegression,orMultinomialLogisticRegression.

Theideaisquitesimple:whengivenaninstancex,theSoftmaxRegressionmodelfirstcomputesascoresk(x)foreachclassk,thenestimatestheprobabilityofeachclassbyapplyingthesoftmaxfunction(alsocalledthenormalizedexponential)tothescores.Theequationtocomputesk(x)shouldlookfamiliar,asitisjustliketheequationforLinearRegressionprediction(seeEquation4-19).

Equation4-19.Softmaxscoreforclassk

Notethateachclasshasitsowndedicatedparametervectorθ(k).AllthesevectorsaretypicallystoredasrowsinaparametermatrixΘ.

Onceyouhavecomputedthescoreofeveryclassfortheinstancex,youcanestimatetheprobability kthattheinstancebelongstoclasskbyrunningthescoresthroughthesoftmaxfunction(Equation4-20):itcomputestheexponentialofeveryscore,thennormalizesthem(dividingbythesumofalltheexponentials).

Equation4-20.Softmaxfunction

Kisthenumberofclasses.

s(x)isavectorcontainingthescoresofeachclassfortheinstancex.

σ(s(x))kistheestimatedprobabilitythattheinstancexbelongstoclasskgiventhescoresofeachclassforthatinstance.

JustliketheLogisticRegressionclassifier,theSoftmaxRegressionclassifierpredictstheclasswiththehighestestimatedprobability(whichissimplytheclasswiththehighestscore),asshowninEquation4-21.

Equation4-21.SoftmaxRegressionclassifierprediction

Theargmaxoperatorreturnsthevalueofavariablethatmaximizesafunction.Inthisequation,itreturnsthevalueofkthatmaximizestheestimatedprobabilityσ(s(x))k.

TIPTheSoftmaxRegressionclassifierpredictsonlyoneclassatatime(i.e.,itismulticlass,notmultioutput)soitshouldbeusedonlywithmutuallyexclusiveclassessuchasdifferenttypesofplants.Youcannotuseittorecognizemultiplepeopleinonepicture.

Nowthatyouknowhowthemodelestimatesprobabilitiesandmakespredictions,let’stakealookattraining.Theobjectiveistohaveamodelthatestimatesahighprobabilityforthetargetclass(andconsequentlyalowprobabilityfortheotherclasses).MinimizingthecostfunctionshowninEquation4-22,calledthecrossentropy,shouldleadtothisobjectivebecauseitpenalizesthemodelwhenitestimatesalowprobabilityforatargetclass.Crossentropyisfrequentlyusedtomeasurehowwellasetofestimatedclassprobabilitiesmatchthetargetclasses(wewilluseitagainseveraltimesinthefollowingchapters).

Equation4-22.Crossentropycostfunction

isequalto1ifthetargetclassfortheithinstanceisk;otherwise,itisequalto0.

Noticethatwhentherearejusttwoclasses(K=2),thiscostfunctionisequivalenttotheLogisticRegression’scostfunction(logloss;seeEquation4-17).

CROSSENTROPY

Crossentropyoriginatedfrominformationtheory.Supposeyouwanttoefficientlytransmitinformationabouttheweathereveryday.Ifthereareeightoptions(sunny,rainy,etc.),youcouldencodeeachoptionusing3bitssince23=8.However,ifyouthinkitwillbesunnyalmosteveryday,itwouldbemuchmoreefficienttocode“sunny”onjustonebit(0)andtheothersevenoptionson4bits(startingwitha1).Crossentropymeasurestheaveragenumberofbitsyouactuallysendperoption.Ifyourassumptionabouttheweatherisperfect,crossentropywilljustbeequaltotheentropyoftheweatheritself(i.e.,itsintrinsicunpredictability).Butifyourassumptionsarewrong(e.g.,ifitrainsoften),crossentropywillbegreaterbyanamountcalledtheKullback–Leiblerdivergence.

Thecrossentropybetweentwoprobabilitydistributionspandqisdefinedas (atleastwhenthedistributionsarediscrete).

Thegradientvectorofthiscostfunctionwithregardstoθ(k)isgivenbyEquation4-23:

Equation4-23.Crossentropygradientvectorforclassk

Nowyoucancomputethegradientvectorforeveryclass,thenuseGradientDescent(oranyotheroptimizationalgorithm)tofindtheparametermatrixΘthatminimizesthecostfunction.

Let’suseSoftmaxRegressiontoclassifytheirisflowersintoallthreeclasses.Scikit-Learn’sLogisticRegressionusesone-versus-allbydefaultwhenyoutrainitonmorethantwoclasses,butyoucansetthemulti_classhyperparameterto"multinomial"toswitchittoSoftmaxRegressioninstead.YoumustalsospecifyasolverthatsupportsSoftmaxRegression,suchasthe"lbfgs"solver(seeScikit-Learn’sdocumentationformoredetails).Italsoappliesℓ2regularizationbydefault,whichyoucancontrolusingthehyperparameterC.

X=iris["data"][:,(2,3)]#petallength,petalwidth

y=iris["target"]

softmax_reg=LogisticRegression(multi_class="multinomial",solver="lbfgs",C=10)

softmax_reg.fit(X,y)

Sothenexttimeyoufindaniriswith5cmlongand2cmwidepetals,youcanaskyourmodeltotellyouwhattypeofirisitis,anditwillanswerIris-Virginica(class2)with94.2%probability(orIris-Versicolorwith5.8%probability):

>>>softmax_reg.predict([[5,2]])

array([2])

>>>softmax_reg.predict_proba([[5,2]])

array([[6.33134078e-07,5.75276067e-02,9.42471760e-01]])

Figure4-25showstheresultingdecisionboundaries,representedbythebackgroundcolors.Noticethatthedecisionboundariesbetweenanytwoclassesarelinear.ThefigurealsoshowstheprobabilitiesfortheIris-Versicolorclass,representedbythecurvedlines(e.g.,thelinelabeledwith0.450representsthe45%probabilityboundary).Noticethatthemodelcanpredictaclassthathasanestimatedprobabilitybelow50%.Forexample,atthepointwherealldecisionboundariesmeet,allclasseshaveanequalestimatedprobabilityof33%.

Figure4-25.SoftmaxRegressiondecisionboundaries

Exercises1. WhatLinearRegressiontrainingalgorithmcanyouuseifyouhaveatrainingsetwithmillionsof

features?

2. Supposethefeaturesinyourtrainingsethaveverydifferentscales.Whatalgorithmsmightsufferfromthis,andhow?Whatcanyoudoaboutit?

3. CanGradientDescentgetstuckinalocalminimumwhentrainingaLogisticRegressionmodel?

4. DoallGradientDescentalgorithmsleadtothesamemodelprovidedyouletthemrunlongenough?

5. SupposeyouuseBatchGradientDescentandyouplotthevalidationerrorateveryepoch.Ifyounoticethatthevalidationerrorconsistentlygoesup,whatislikelygoingon?Howcanyoufixthis?

6. IsitagoodideatostopMini-batchGradientDescentimmediatelywhenthevalidationerrorgoesup?

7. WhichGradientDescentalgorithm(amongthosewediscussed)willreachthevicinityoftheoptimalsolutionthefastest?Whichwillactuallyconverge?Howcanyoumaketheothersconvergeaswell?

8. SupposeyouareusingPolynomialRegression.Youplotthelearningcurvesandyounoticethatthereisalargegapbetweenthetrainingerrorandthevalidationerror.Whatishappening?Whatarethreewaystosolvethis?

9. SupposeyouareusingRidgeRegressionandyounoticethatthetrainingerrorandthevalidationerrorarealmostequalandfairlyhigh.Wouldyousaythatthemodelsuffersfromhighbiasorhighvariance?Shouldyouincreasetheregularizationhyperparameterαorreduceit?

10. Whywouldyouwanttouse:RidgeRegressioninsteadofplainLinearRegression(i.e.,withoutanyregularization)?

LassoinsteadofRidgeRegression?

ElasticNetinsteadofLasso?

11. Supposeyouwanttoclassifypicturesasoutdoor/indooranddaytime/nighttime.ShouldyouimplementtwoLogisticRegressionclassifiersoroneSoftmaxRegressionclassifier?

12. ImplementBatchGradientDescentwithearlystoppingforSoftmaxRegression(withoutusingScikit-Learn).

Itisoftenthecasethatalearningalgorithmwilltrytooptimizeadifferentfunctionthantheperformancemeasureusedtoevaluatethefinalmodel.Thisisgenerallybecausethatfunctioniseasiertocompute,becauseithasusefuldifferentiationpropertiesthattheperformancemeasurelacks,orbecausewewanttoconstrainthemodelduringtraining,aswewillseewhenwediscussregularization.

Thedemonstrationthatthisreturnsthevalueofθthatminimizesthecostfunctionisoutsidethescopeofthisbook.

NotethatScikit-Learnseparatesthebiasterm(intercept_)fromthefeatureweights(coef_).

Technicallyspeaking,itsderivativeisLipschitzcontinuous.

Sincefeature1issmaller,ittakesalargerchangeinθ1toaffectthecostfunction,whichiswhythebowliselongatedalongtheθ1axis.

Eta(η)isthe7

letteroftheGreekalphabet.

Out-of-corealgorithmsarediscussedinChapter1.

WhiletheNormalEquationcanonlyperformLinearRegression,theGradientDescentalgorithmscanbeusedtotrainmanyothermodels,aswewillsee.

Aquadraticequationisoftheformy=ax

+bx+c.

Thisnotionofbiasisnottobeconfusedwiththebiastermoflinearmodels.

ItiscommontousethenotationJ(θ)forcostfunctionsthatdon’thaveashortname;wewilloftenusethisnotationthroughouttherestofthisbook.Thecontextwillmakeitclearwhichcostfunctionisbeingdiscussed.

NormsarediscussedinChapter2.

Asquarematrixfullof0sexceptfor1sonthemaindiagonal(top-lefttobottom-right).

AlternativelyyoucanusetheRidgeclasswiththe"sag"solver.StochasticAverageGDisavariantofSGD.Formoredetails,seethepresentation“MinimizingFiniteSumswiththeStochasticAverageGradientAlgorithm”byMarkSchmidtetal.fromtheUniversityofBritishColumbia.

Youcanthinkofasubgradientvectoratanondifferentiablepointasanintermediatevectorbetweenthegradientvectorsaroundthatpoint.

PhotosreproducedfromthecorrespondingWikipediapages.Iris-VirginicaphotobyFrankMayfield(CreativeCommonsBY-SA2.0),Iris-VersicolorphotobyD.GordonE.Robertson(CreativeCommonsBY-SA3.0),andIris-Setosaphotoispublicdomain.

Itisthethesetofpointsxsuchthatθ0+θ1x1+θ2x2=0,whichdefinesastraightline.

Chapter5.SupportVectorMachines

ASupportVectorMachine(SVM)isaverypowerfulandversatileMachineLearningmodel,capableofperforminglinearornonlinearclassification,regression,andevenoutlierdetection.ItisoneofthemostpopularmodelsinMachineLearning,andanyoneinterestedinMachineLearningshouldhaveitintheirtoolbox.SVMsareparticularlywellsuitedforclassificationofcomplexbutsmall-ormedium-sizeddatasets.

ThischapterwillexplainthecoreconceptsofSVMs,howtousethem,andhowtheywork.

LinearSVMClassificationThefundamentalideabehindSVMsisbestexplainedwithsomepictures.Figure5-1showspartoftheirisdatasetthatwasintroducedattheendofChapter4.Thetwoclassescanclearlybeseparatedeasilywithastraightline(theyarelinearlyseparable).Theleftplotshowsthedecisionboundariesofthreepossiblelinearclassifiers.Themodelwhosedecisionboundaryisrepresentedbythedashedlineissobadthatitdoesnotevenseparatetheclassesproperly.Theothertwomodelsworkperfectlyonthistrainingset,buttheirdecisionboundariescomesoclosetotheinstancesthatthesemodelswillprobablynotperformaswellonnewinstances.Incontrast,thesolidlineintheplotontherightrepresentsthedecisionboundaryofanSVMclassifier;thislinenotonlyseparatesthetwoclassesbutalsostaysasfarawayfromtheclosesttraininginstancesaspossible.YoucanthinkofanSVMclassifierasfittingthewidestpossiblestreet(representedbytheparalleldashedlines)betweentheclasses.Thisiscalledlargemarginclassification.

Figure5-1.Largemarginclassification

Noticethataddingmoretraininginstances“offthestreet”willnotaffectthedecisionboundaryatall:itisfullydetermined(or“supported”)bytheinstanceslocatedontheedgeofthestreet.Theseinstancesarecalledthesupportvectors(theyarecircledinFigure5-1).

WARNINGSVMsaresensitivetothefeaturescales,asyoucanseeinFigure5-2:ontheleftplot,theverticalscaleismuchlargerthanthehorizontalscale,sothewidestpossiblestreetisclosetohorizontal.Afterfeaturescaling(e.g.,usingScikit-Learn’sStandardScaler),thedecisionboundarylooksmuchbetter(ontherightplot).

Figure5-2.Sensitivitytofeaturescales

SoftMarginClassificationIfwestrictlyimposethatallinstancesbeoffthestreetandontherightside,thisiscalledhardmarginclassification.Therearetwomainissueswithhardmarginclassification.First,itonlyworksifthedataislinearlyseparable,andseconditisquitesensitivetooutliers.Figure5-3showstheirisdatasetwithjustoneadditionaloutlier:ontheleft,itisimpossibletofindahardmargin,andontherightthedecisionboundaryendsupverydifferentfromtheonewesawinFigure5-1withouttheoutlier,anditwillprobablynotgeneralizeaswell.

Figure5-3.Hardmarginsensitivitytooutliers

Toavoidtheseissuesitispreferabletouseamoreflexiblemodel.Theobjectiveistofindagoodbalancebetweenkeepingthestreetaslargeaspossibleandlimitingthemarginviolations(i.e.,instancesthatendupinthemiddleofthestreetorevenonthewrongside).Thisiscalledsoftmarginclassification.

InScikit-Learn’sSVMclasses,youcancontrolthisbalanceusingtheChyperparameter:asmallerCvalueleadstoawiderstreetbutmoremarginviolations.Figure5-4showsthedecisionboundariesandmarginsoftwosoftmarginSVMclassifiersonanonlinearlyseparabledataset.Ontheleft,usingahighCvaluetheclassifiermakesfewermarginviolationsbutendsupwithasmallermargin.Ontheright,usingalowCvaluethemarginismuchlarger,butmanyinstancesenduponthestreet.However,itseemslikelythatthesecondclassifierwillgeneralizebetter:infactevenonthistrainingsetitmakesfewerpredictionerrors,sincemostofthemarginviolationsareactuallyonthecorrectsideofthedecisionboundary.

Figure5-4.Fewermarginviolationsversuslargemargin

TIPIfyourSVMmodelisoverfitting,youcantryregularizingitbyreducingC.

ThefollowingScikit-Learncodeloadstheirisdataset,scalesthefeatures,andthentrainsalinearSVMmodel(usingtheLinearSVCclasswithC=0.1andthehingelossfunction,describedshortly)todetectIris-Virginicaflowers.TheresultingmodelisrepresentedontherightofFigure5-4.

importnumpyasnp

fromsklearnimportdatasets

fromsklearn.preprocessingimportStandardScaler

fromsklearn.svmimportLinearSVC

iris=datasets.load_iris()

X=iris["data"][:,(2,3)]#petallength,petalwidth

y=(iris["target"]==2).astype(np.float64)#Iris-Virginica

svm_clf=Pipeline((

("scaler",StandardScaler()),

("linear_svc",LinearSVC(C=1,loss="hinge")),

svm_clf.fit(X,y)

Then,asusual,youcanusethemodeltomakepredictions:

>>>svm_clf.predict([[5.5,1.7]])

array([1.])

NOTEUnlikeLogisticRegressionclassifiers,SVMclassifiersdonotoutputprobabilitiesforeachclass.

Alternatively,youcouldusetheSVCclass,usingSVC(kernel="linear",C=1),butitismuchslower,especiallywithlargetrainingsets,soitisnotrecommended.AnotheroptionistousetheSGDClassifierclass,withSGDClassifier(loss="hinge",alpha=1/(m*C)).ThisappliesregularStochasticGradientDescent(seeChapter4)totrainalinearSVMclassifier.ItdoesnotconvergeasfastastheLinearSVCclass,butitcanbeusefultohandlehugedatasetsthatdonotfitinmemory(out-of-coretraining),ortohandleonlineclassificationtasks.

TIPTheLinearSVCclassregularizesthebiasterm,soyoushouldcenterthetrainingsetfirstbysubtractingitsmean.ThisisautomaticifyouscalethedatausingtheStandardScaler.Moreover,makesureyousetthelosshyperparameterto"hinge",asitisnotthedefaultvalue.Finally,forbetterperformanceyoushouldsetthedualhyperparametertoFalse,unlesstherearemorefeaturesthantraininginstances(wewilldiscussdualitylaterinthechapter).

NonlinearSVMClassificationAlthoughlinearSVMclassifiersareefficientandworksurprisinglywellinmanycases,manydatasetsarenotevenclosetobeinglinearlyseparable.Oneapproachtohandlingnonlineardatasetsistoaddmorefeatures,suchaspolynomialfeatures(asyoudidinChapter4);insomecasesthiscanresultinalinearlyseparabledataset.ConsidertheleftplotinFigure5-5:itrepresentsasimpledatasetwithjustonefeaturex1.Thisdatasetisnotlinearlyseparable,asyoucansee.Butifyouaddasecondfeaturex2=(x1)2,theresulting2Ddatasetisperfectlylinearlyseparable.

Figure5-5.Addingfeaturestomakeadatasetlinearlyseparable

ToimplementthisideausingScikit-Learn,youcancreateaPipelinecontainingaPolynomialFeaturestransformer(discussedin“PolynomialRegression”),followedbyaStandardScalerandaLinearSVC.Let’stestthisonthemoonsdataset(seeFigure5-6):

fromsklearn.datasetsimportmake_moons

fromsklearn.preprocessingimportPolynomialFeatures

polynomial_svm_clf=Pipeline((

("poly_features",PolynomialFeatures(degree=3)),

("svm_clf",LinearSVC(C=10,loss="hinge"))

polynomial_svm_clf.fit(X,y)

Figure5-6.LinearSVMclassifierusingpolynomialfeatures

PolynomialKernelAddingpolynomialfeaturesissimpletoimplementandcanworkgreatwithallsortsofMachineLearningalgorithms(notjustSVMs),butatalowpolynomialdegreeitcannotdealwithverycomplexdatasets,andwithahighpolynomialdegreeitcreatesahugenumberoffeatures,makingthemodeltooslow.

Fortunately,whenusingSVMsyoucanapplyanalmostmiraculousmathematicaltechniquecalledthekerneltrick(itisexplainedinamoment).Itmakesitpossibletogetthesameresultasifyouaddedmanypolynomialfeatures,evenwithveryhigh-degreepolynomials,withoutactuallyhavingtoaddthem.Sothereisnocombinatorialexplosionofthenumberoffeaturessinceyoudon’tactuallyaddanyfeatures.ThistrickisimplementedbytheSVCclass.Let’stestitonthemoonsdataset:

fromsklearn.svmimportSVC

poly_kernel_svm_clf=Pipeline((

("svm_clf",SVC(kernel="poly",degree=3,coef0=1,C=5))

poly_kernel_svm_clf.fit(X,y)

ThiscodetrainsanSVMclassifierusinga3rd-degreepolynomialkernel.ItisrepresentedontheleftofFigure5-7.OntherightisanotherSVMclassifierusinga10th-degreepolynomialkernel.Obviously,ifyourmodelisoverfitting,youmightwanttoreducethepolynomialdegree.Conversely,ifitisunderfitting,youcantryincreasingit.Thehyperparametercoef0controlshowmuchthemodelisinfluencedbyhigh-degreepolynomialsversuslow-degreepolynomials.

Figure5-7.SVMclassifierswithapolynomialkernel

TIPAcommonapproachtofindtherighthyperparametervaluesistousegridsearch(seeChapter2).Itisoftenfastertofirstdoaverycoarsegridsearch,thenafinergridsearcharoundthebestvaluesfound.Havingagoodsenseofwhateachhyperparameteractuallydoescanalsohelpyousearchintherightpartofthehyperparameterspace.

AddingSimilarityFeaturesAnothertechniquetotacklenonlinearproblemsistoaddfeaturescomputedusingasimilarityfunctionthatmeasureshowmucheachinstanceresemblesaparticularlandmark.Forexample,let’staketheone-dimensionaldatasetdiscussedearlierandaddtwolandmarkstoitatx1=–2andx1=1(seetheleftplotinFigure5-8).Next,let’sdefinethesimilarityfunctiontobetheGaussianRadialBasisFunction(RBF)withγ=0.3(seeEquation5-1).

Equation5-1.GaussianRBF

Itisabell-shapedfunctionvaryingfrom0(veryfarawayfromthelandmark)to1(atthelandmark).Nowwearereadytocomputethenewfeatures.Forexample,let’slookattheinstancex1=–1:itislocatedatadistanceof1fromthefirstlandmark,and2fromthesecondlandmark.Thereforeitsnewfeaturesarex2=exp(–0.3×12)≈0.74andx3=exp(–0.3×22)≈0.30.TheplotontherightofFigure5-8showsthetransformeddataset(droppingtheoriginalfeatures).Asyoucansee,itisnowlinearlyseparable.

Figure5-8.SimilarityfeaturesusingtheGaussianRBF

Youmaywonderhowtoselectthelandmarks.Thesimplestapproachistocreatealandmarkatthelocationofeachandeveryinstanceinthedataset.Thiscreatesmanydimensionsandthusincreasesthechancesthatthetransformedtrainingsetwillbelinearlyseparable.Thedownsideisthatatrainingsetwithminstancesandnfeaturesgetstransformedintoatrainingsetwithminstancesandmfeatures(assumingyoudroptheoriginalfeatures).Ifyourtrainingsetisverylarge,youendupwithanequallylargenumberoffeatures.

GaussianRBFKernelJustlikethepolynomialfeaturesmethod,thesimilarityfeaturesmethodcanbeusefulwithanyMachineLearningalgorithm,butitmaybecomputationallyexpensivetocomputealltheadditionalfeatures,especiallyonlargetrainingsets.However,onceagainthekerneltrickdoesitsSVMmagic:itmakesitpossibletoobtainasimilarresultasifyouhadaddedmanysimilarityfeatures,withoutactuallyhavingtoaddthem.Let’strytheGaussianRBFkernelusingtheSVCclass:

rbf_kernel_svm_clf=Pipeline((

("svm_clf",SVC(kernel="rbf",gamma=5,C=0.001))

rbf_kernel_svm_clf.fit(X,y)

ThismodelisrepresentedonthebottomleftofFigure5-9.Theotherplotsshowmodelstrainedwithdifferentvaluesofhyperparametersgamma(γ)andC.Increasinggammamakesthebell-shapecurvenarrower(seetheleftplotofFigure5-8),andasaresulteachinstance’srangeofinfluenceissmaller:thedecisionboundaryendsupbeingmoreirregular,wigglingaroundindividualinstances.Conversely,asmallgammavaluemakesthebell-shapedcurvewider,soinstanceshavealargerrangeofinfluence,andthedecisionboundaryendsupsmoother.Soγactslikearegularizationhyperparameter:ifyourmodelisoverfitting,youshouldreduceit,andifitisunderfitting,youshouldincreaseit(similartotheChyperparameter).

Figure5-9.SVMclassifiersusinganRBFkernel

Otherkernelsexistbutareusedmuchmorerarely.Forexample,somekernelsarespecializedforspecificdatastructures.StringkernelsaresometimesusedwhenclassifyingtextdocumentsorDNAsequences(e.g.,usingthestringsubsequencekernelorkernelsbasedontheLevenshteindistance).

TIPWithsomanykernelstochoosefrom,howcanyoudecidewhichonetouse?Asaruleofthumb,youshouldalwaystrythelinearkernelfirst(rememberthatLinearSVCismuchfasterthanSVC(kernel="linear")),especiallyifthetrainingsetisverylargeorifithasplentyoffeatures.Ifthetrainingsetisnottoolarge,youshouldtrytheGaussianRBFkernelaswell;itworkswellinmostcases.Thenifyouhavesparetimeandcomputingpower,youcanalsoexperimentwithafewotherkernelsusingcross-validationandgridsearch,especiallyiftherearekernelsspecializedforyourtrainingset’sdatastructure.

ComputationalComplexityTheLinearSVCclassisbasedontheliblinearlibrary,whichimplementsanoptimizedalgorithmforlinearSVMs.1Itdoesnotsupportthekerneltrick,butitscalesalmostlinearlywiththenumberoftraininginstancesandthenumberoffeatures:itstrainingtimecomplexityisroughlyO(m×n).

Thealgorithmtakeslongerifyourequireaveryhighprecision.Thisiscontrolledbythetolerancehyperparameterϵ(calledtolinScikit-Learn).Inmostclassificationtasks,thedefaulttoleranceisfine.

TheSVCclassisbasedonthelibsvmlibrary,whichimplementsanalgorithmthatsupportsthekerneltrick.2ThetrainingtimecomplexityisusuallybetweenO(m2×n)andO(m3×n).Unfortunately,thismeansthatitgetsdreadfullyslowwhenthenumberoftraininginstancesgetslarge(e.g.,hundredsofthousandsofinstances).Thisalgorithmisperfectforcomplexbutsmallormediumtrainingsets.However,itscaleswellwiththenumberoffeatures,especiallywithsparsefeatures(i.e.,wheneachinstancehasfewnonzerofeatures).Inthiscase,thealgorithmscalesroughlywiththeaveragenumberofnonzerofeaturesperinstance.Table5-1comparesScikit-Learn’sSVMclassificationclasses.

Table5-1.ComparisonofScikit-LearnclassesforSVMclassification

Class Timecomplexity Out-of-coresupport Scalingrequired Kerneltrick

LinearSVC O(m×n) No Yes No

SGDClassifier O(m×n) Yes Yes No

SVC O(m²×n)toO(m³×n) No Yes Yes

SVMRegressionAswementionedearlier,theSVMalgorithmisquiteversatile:notonlydoesitsupportlinearandnonlinearclassification,butitalsosupportslinearandnonlinearregression.Thetrickistoreversetheobjective:insteadoftryingtofitthelargestpossiblestreetbetweentwoclasseswhilelimitingmarginviolations,SVMRegressiontriestofitasmanyinstancesaspossibleonthestreetwhilelimitingmarginviolations(i.e.,instancesoffthestreet).Thewidthofthestreetiscontrolledbyahyperparameterϵ.Figure5-10showstwolinearSVMRegressionmodelstrainedonsomerandomlineardata,onewithalargemargin(ϵ=1.5)andtheotherwithasmallmargin(ϵ=0.5).

Figure5-10.SVMRegression

Addingmoretraininginstanceswithinthemargindoesnotaffectthemodel’spredictions;thus,themodelissaidtobeϵ-insensitive.

YoucanuseScikit-Learn’sLinearSVRclasstoperformlinearSVMRegression.ThefollowingcodeproducesthemodelrepresentedontheleftofFigure5-10(thetrainingdatashouldbescaledandcenteredfirst):

fromsklearn.svmimportLinearSVR

svm_reg=LinearSVR(epsilon=1.5)

svm_reg.fit(X,y)

Totacklenonlinearregressiontasks,youcanuseakernelizedSVMmodel.Forexample,Figure5-11showsSVMRegressiononarandomquadratictrainingset,usinga2nd-degreepolynomialkernel.Thereislittleregularizationontheleftplot(i.e.,alargeCvalue),andmuchmoreregularizationontherightplot(i.e.,asmallCvalue).

Figure5-11.SVMregressionusinga2nd-degreepolynomialkernel

ThefollowingcodeproducesthemodelrepresentedontheleftofFigure5-11usingScikit-Learn’sSVRclass(whichsupportsthekerneltrick).TheSVRclassistheregressionequivalentoftheSVCclass,andtheLinearSVRclassistheregressionequivalentoftheLinearSVCclass.TheLinearSVRclassscaleslinearlywiththesizeofthetrainingset(justliketheLinearSVCclass),whiletheSVRclassgetsmuchtooslowwhenthetrainingsetgrowslarge(justliketheSVCclass).

fromsklearn.svmimportSVR

svm_poly_reg=SVR(kernel="poly",degree=2,C=100,epsilon=0.1)

svm_poly_reg.fit(X,y)

NOTESVMscanalsobeusedforoutlierdetection;seeScikit-Learn’sdocumentationformoredetails.

UndertheHoodThissectionexplainshowSVMsmakepredictionsandhowtheirtrainingalgorithmswork,startingwithlinearSVMclassifiers.YoucansafelyskipitandgostraighttotheexercisesattheendofthischapterifyouarejustgettingstartedwithMachineLearning,andcomebacklaterwhenyouwanttogetadeeperunderstandingofSVMs.

First,awordaboutnotations:inChapter4weusedtheconventionofputtingallthemodelparametersinonevectorθ,includingthebiastermθ0andtheinputfeatureweightsθ1toθn,andaddingabiasinputx0=1toallinstances.Inthischapter,wewilluseadifferentconvention,whichismoreconvenient(andmorecommon)whenyouaredealingwithSVMs:thebiastermwillbecalledbandthefeatureweightsvectorwillbecalledw.Nobiasfeaturewillbeaddedtotheinputfeaturevectors.

DecisionFunctionandPredictionsThelinearSVMclassifiermodelpredictstheclassofanewinstancexbysimplycomputingthedecisionfunctionwT·x+b=w1x1++ wnxn+b:iftheresultispositive,thepredictedclassŷisthepositiveclass(1),orelseitisthenegativeclass(0);seeEquation5-2.

Equation5-2.LinearSVMclassifierprediction

Figure5-12showsthedecisionfunctionthatcorrespondstothemodelontherightofFigure5-4:itisatwo-dimensionalplanesincethisdatasethastwofeatures(petalwidthandpetallength).Thedecisionboundaryisthesetofpointswherethedecisionfunctionisequalto0:itistheintersectionoftwoplanes,whichisastraightline(representedbythethicksolidline).3

Figure5-12.Decisionfunctionfortheirisdataset

Thedashedlinesrepresentthepointswherethedecisionfunctionisequalto1or–1:theyareparallelandatequaldistancetothedecisionboundary,formingamarginaroundit.TrainingalinearSVMclassifiermeansfindingthevalueofwandbthatmakethismarginaswideaspossiblewhileavoidingmarginviolations(hardmargin)orlimitingthem(softmargin).

TrainingObjectiveConsidertheslopeofthedecisionfunction:itisequaltothenormoftheweightvector, w.Ifwedividethisslopeby2,thepointswherethedecisionfunctionisequalto±1aregoingtobetwiceasfarawayfromthedecisionboundary.Inotherwords,dividingtheslopeby2willmultiplythemarginby2.Perhapsthisiseasiertovisualizein2DinFigure5-13.Thesmallertheweightvectorw,thelargerthemargin.

Figure5-13.Asmallerweightvectorresultsinalargermargin

Sowewanttominimize wtogetalargemargin.However,ifwealsowanttoavoidanymarginviolation(hardmargin),thenweneedthedecisionfunctiontobegreaterthan1forallpositivetraininginstances,andlowerthan–1fornegativetraininginstances.Ifwedefinet(i)=–1fornegativeinstances(ify(i)=0)andt(i)=1forpositiveinstances(ify(i)=1),thenwecanexpressthisconstraintast(i)(wT·x(i)+b)≥1forallinstances.

WecanthereforeexpressthehardmarginlinearSVMclassifierobjectiveastheconstrainedoptimizationprobleminEquation5-3.

Equation5-3.HardmarginlinearSVMclassifierobjective

Weareminimizing wT·w,whichisequalto w2,ratherthanminimizing w.Thisisbecauseitwillgivethesame

result(sincethevaluesofwandbthatminimizeavaluealsominimizehalfofitssquare),but w2hasaniceandsimplederivative(itisjustw)while wisnotdifferentiableat w=0.Optimizationalgorithmsworkmuchbetterondifferentiablefunctions.

Togetthesoftmarginobjective,weneedtointroduceaslackvariableζ(i)≥0foreachinstance:4ζ(i)

measureshowmuchtheithinstanceisallowedtoviolatethemargin.Wenowhavetwoconflicting

objectives:makingtheslackvariablesassmallaspossibletoreducethemarginviolations,andmakingwT·wassmallaspossibletoincreasethemargin.ThisiswheretheChyperparametercomesin:it

allowsustodefinethetradeoffbetweenthesetwoobjectives.ThisgivesustheconstrainedoptimizationprobleminEquation5-4.

Equation5-4.SoftmarginlinearSVMclassifierobjective

QuadraticProgrammingThehardmarginandsoftmarginproblemsarebothconvexquadraticoptimizationproblemswithlinearconstraints.SuchproblemsareknownasQuadraticProgramming(QP)problems.Manyoff-the-shelfsolversareavailabletosolveQPproblemsusingavarietyoftechniquesthatareoutsidethescopeofthisbook.5ThegeneralproblemformulationisgivenbyEquation5-5.

Equation5-5.QuadraticProgrammingproblem

NotethattheexpressionA·p≤bactuallydefinesncconstraints:pT·a(i)≤b(i)fori=1,2,, nc,wherea(i)isthevectorcontainingtheelementsoftheithrowofAandb(i)istheithelementofb.

YoucaneasilyverifythatifyousettheQPparametersinthefollowingway,yougetthehardmarginlinearSVMclassifierobjective:

np=n+1,wherenisthenumberoffeatures(the+1isforthebiasterm).

nc=m,wheremisthenumberoftraininginstances.

Histhenp×npidentitymatrix,exceptwithazerointhetop-leftcell(toignorethebiasterm).

f=0,annp-dimensionalvectorfullof0s.

b=1,annc-dimensionalvectorfullof1s.

a(i)=–t(i) (i),where (i)isequaltox(i)withanextrabiasfeature 0=1.

SoonewaytotrainahardmarginlinearSVMclassifierisjusttouseanoff-the-shelfQPsolverbypassingittheprecedingparameters.Theresultingvectorpwillcontainthebiastermb=p0andthefeatureweightswi=pifori=1,2,, m.Similarly,youcanuseaQPsolvertosolvethesoftmarginproblem(seetheexercisesattheendofthechapter).

However,tousethekerneltrickwearegoingtolookatadifferentconstrainedoptimizationproblem.

TheDualProblemGivenaconstrainedoptimizationproblem,knownastheprimalproblem,itispossibletoexpressadifferentbutcloselyrelatedproblem,calleditsdualproblem.Thesolutiontothedualproblemtypicallygivesalowerboundtothesolutionoftheprimalproblem,butundersomeconditionsitcanevenhavethesamesolutionsastheprimalproblem.Luckily,theSVMproblemhappenstomeettheseconditions,6soyoucanchoosetosolvetheprimalproblemorthedualproblem;bothwillhavethesamesolution.Equation5-6showsthedualformofthelinearSVMobjective(ifyouareinterestedinknowinghowtoderivethedualproblemfromtheprimalproblem,seeAppendixC).

Equation5-6.DualformofthelinearSVMobjective

Onceyoufindthevector thatminimizesthisequation(usingaQPsolver),youcancompute andthatminimizetheprimalproblembyusingEquation5-7.

Equation5-7.Fromthedualsolutiontotheprimalsolution

Thedualproblemisfastertosolvethantheprimalwhenthenumberoftraininginstancesissmallerthanthenumberoffeatures.Moreimportantly,itmakesthekerneltrickpossible,whiletheprimaldoesnot.Sowhatisthiskerneltrickanyway?

KernelizedSVMSupposeyouwanttoapplya2nd-degreepolynomialtransformationtoatwo-dimensionaltrainingset(suchasthemoonstrainingset),thentrainalinearSVMclassifieronthetransformedtrainingset.Equation5-8showsthe2nd-degreepolynomialmappingfunctionϕthatyouwanttoapply.

Equation5-8.Second-degreepolynomialmapping

Noticethatthetransformedvectoristhree-dimensionalinsteadoftwo-dimensional.Nowlet’slookatwhathappenstoacoupleoftwo-dimensionalvectors,aandb,ifweapplythis2nd-degreepolynomialmappingandthencomputethedotproductofthetransformedvectors(SeeEquation5-9).

Equation5-9.Kerneltrickfora2nd-degreepolynomialmapping

Howaboutthat?Thedotproductofthetransformedvectorsisequaltothesquareofthedotproductoftheoriginalvectors:ϕ(a)T·ϕ(b)=(aT·b)2.

Nowhereisthekeyinsight:ifyouapplythetransformationϕtoalltraininginstances,thenthedualproblem(seeEquation5-6)willcontainthedotproductϕ(x(i))T·ϕ(x(j)).Butifϕisthe2nd-degreepolynomialtransformationdefinedinEquation5-8,thenyoucanreplacethisdotproductoftransformed

vectorssimplyby .Soyoudon’tactuallyneedtotransformthetraininginstancesatall:justreplacethedotproductbyitssquareinEquation5-6.TheresultwillbestrictlythesameasifyouwentthroughthetroubleofactuallytransformingthetrainingsetthenfittingalinearSVMalgorithm,butthistrickmakesthewholeprocessmuchmorecomputationallyefficient.Thisistheessenceofthekerneltrick.

ThefunctionK(a,b)=(aT·b)2iscalleda2nd-degreepolynomialkernel.InMachineLearning,akernelisafunctioncapableofcomputingthedotproductϕ(a)T·ϕ(b)basedonlyontheoriginalvectorsaandb,

withouthavingtocompute(oreventoknowabout)thetransformationϕ.Equation5-10listssomeofthemostcommonlyusedkernels.

Equation5-10.Commonkernels

MERCER’STHEOREM

AccordingtoMercer’stheorem,ifafunctionK(a,b)respectsafewmathematicalconditionscalledMercer’sconditions(Kmustbecontinuous,symmetricinitsargumentssoK(a,b)=K(b,a),etc.),thenthereexistsafunctionϕthatmapsaandbintoanotherspace(possiblywithmuchhigherdimensions)suchthatK(a,b)=ϕ(a)T·ϕ(b).SoyoucanuseKasakernelsinceyouknowϕexists,evenifyoudon’tknowwhatϕis.InthecaseoftheGaussianRBFkernel,itcanbeshownthatϕactuallymapseachtraininginstancetoaninfinite-dimensionalspace,soit’sagoodthingyoudon’tneedtoactuallyperformthemapping!

Notethatsomefrequentlyusedkernels(suchastheSigmoidkernel)don’trespectallofMercer’sconditions,yettheygenerallyworkwellinpractice.

Thereisstillonelooseendwemusttie.Equation5-7showshowtogofromthedualsolutiontotheprimalsolutioninthecaseofalinearSVMclassifier,butifyouapplythekerneltrickyouendupwithequationsthatincludeϕ(x(i)).Infact, musthavethesamenumberofdimensionsasϕ(x(i)),whichmaybehugeoreveninfinite,soyoucan’tcomputeit.Buthowcanyoumakepredictionswithoutknowing ?Well,thegoodnewsisthatyoucanplugintheformulafor fromEquation5-7intothedecisionfunctionforanewinstancex(n),andyougetanequationwithonlydotproductsbetweeninputvectors.Thismakesitpossibletousethekerneltrick,onceagain(Equation5-11).

Equation5-11.MakingpredictionswithakernelizedSVM

Notethatsinceα(i)≠0onlyforsupportvectors,makingpredictionsinvolvescomputingthedotproductofthenewinputvectorx(n)withonlythesupportvectors,notallthetraininginstances.Ofcourse,youalsoneedtocomputethebiasterm ,usingthesametrick(Equation5-12).

Equation5-12.Computingthebiastermusingthekerneltrick

Ifyouarestartingtogetaheadache,it’sperfectlynormal:it’sanunfortunatesideeffectsofthekerneltrick.

OnlineSVMsBeforeconcludingthischapter,let’stakeaquicklookatonlineSVMclassifiers(recallthatonlinelearningmeanslearningincrementally,typicallyasnewinstancesarrive).

ForlinearSVMclassifiers,onemethodistouseGradientDescent(e.g.,usingSGDClassifier)tominimizethecostfunctioninEquation5-13,whichisderivedfromtheprimalproblem.UnfortunatelyitconvergesmuchmoreslowlythanthemethodsbasedonQP.

Equation5-13.LinearSVMclassifiercostfunction

Thefirstsuminthecostfunctionwillpushthemodeltohaveasmallweightvectorw,leadingtoalargermargin.Thesecondsumcomputesthetotalofallmarginviolations.Aninstance’smarginviolationisequalto0ifitislocatedoffthestreetandonthecorrectside,orelseitisproportionaltothedistancetothecorrectsideofthestreet.Minimizingthistermensuresthatthemodelmakesthemarginviolationsassmallandasfewaspossible

HINGELOSS

Thefunctionmax(0,1–t)iscalledthehingelossfunction(representedbelow).Itisequalto0whent≥1.Itsderivative(slope)isequalto–1ift<1and0ift>1.Itisnotdifferentiableatt=1,butjustlikeforLassoRegression(see“LassoRegression”)youcanstilluseGradientDescentusinganysubderivativeatt=1(i.e.,anyvaluebetween–1and0).

ItisalsopossibletoimplementonlinekernelizedSVMs—forexample,using“IncrementalandDecrementalSVMLearning”7or“FastKernelClassifierswithOnlineandActiveLearning.”8However,theseareimplementedinMatlabandC++.Forlarge-scalenonlinearproblems,youmaywanttoconsiderusingneuralnetworksinstead(seePartII).

Exercises1. WhatisthefundamentalideabehindSupportVectorMachines?

2. Whatisasupportvector?

3. WhyisitimportanttoscaletheinputswhenusingSVMs?

4. CananSVMclassifieroutputaconfidencescorewhenitclassifiesaninstance?Whataboutaprobability?

5. ShouldyouusetheprimalorthedualformoftheSVMproblemtotrainamodelonatrainingsetwithmillionsofinstancesandhundredsoffeatures?

6. SayyoutrainedanSVMclassifierwithanRBFkernel.Itseemstounderfitthetrainingset:shouldyouincreaseordecreaseγ(gamma)?WhataboutC?

7. HowshouldyousettheQPparameters(H,f,A,andb)tosolvethesoftmarginlinearSVMclassifierproblemusinganoff-the-shelfQPsolver?

8. TrainaLinearSVConalinearlyseparabledataset.ThentrainanSVCandaSGDClassifieronthesamedataset.Seeifyoucangetthemtoproduceroughlythesamemodel.

9. TrainanSVMclassifierontheMNISTdataset.SinceSVMclassifiersarebinaryclassifiers,youwillneedtouseone-versus-alltoclassifyall10digits.Youmaywanttotunethehyperparametersusingsmallvalidationsetstospeeduptheprocess.Whataccuracycanyoureach?

10. TrainanSVMregressorontheCaliforniahousingdataset.

“ADualCoordinateDescentMethodforLarge-scaleLinearSVM,”Linetal.(2008).

“SequentialMinimalOptimization(SMO),”J.Platt(1998).

Moregenerally,whentherearenfeatures,thedecisionfunctionisann-dimensionalhyperplane,andthedecisionboundaryisan(n–1)-dimensionalhyperplane.

Zeta(ζ)isthe8

letteroftheGreekalphabet.

TolearnmoreaboutQuadraticProgramming,youcanstartbyreadingStephenBoydandLievenVandenberghe,ConvexOptimization(Cambridge,UK:CambridgeUniversityPress,2004)orwatchRichardBrown’sseriesofvideolectures.

Theobjectivefunctionisconvex,andtheinequalityconstraintsarecontinuouslydifferentiableandconvexfunctions.

“IncrementalandDecrementalSupportVectorMachineLearning,”G.Cauwenberghs,T.Poggio(2001).

“FastKernelClassifierswithOnlineandActiveLearning,“A.Bordes,S.Ertekin,J.Weston,L.Bottou(2005).

Chapter6.DecisionTrees

LikeSVMs,DecisionTreesareversatileMachineLearningalgorithmsthatcanperformbothclassificationandregressiontasks,andevenmultioutputtasks.Theyareverypowerfulalgorithms,capableoffittingcomplexdatasets.Forexample,inChapter2youtrainedaDecisionTreeRegressormodelontheCaliforniahousingdataset,fittingitperfectly(actuallyoverfittingit).

DecisionTreesarealsothefundamentalcomponentsofRandomForests(seeChapter7),whichareamongthemostpowerfulMachineLearningalgorithmsavailabletoday.

Inthischapterwewillstartbydiscussinghowtotrain,visualize,andmakepredictionswithDecisionTrees.ThenwewillgothroughtheCARTtrainingalgorithmusedbyScikit-Learn,andwewilldiscusshowtoregularizetreesandusethemforregressiontasks.Finally,wewilldiscusssomeofthelimitationsofDecisionTrees.

TrainingandVisualizingaDecisionTreeTounderstandDecisionTrees,let’sjustbuildoneandtakealookathowitmakespredictions.ThefollowingcodetrainsaDecisionTreeClassifierontheirisdataset(seeChapter4):

fromsklearn.datasetsimportload_iris

fromsklearn.treeimportDecisionTreeClassifier

iris=load_iris()

X=iris.data[:,2:]#petallengthandwidth

y=iris.target

tree_clf=DecisionTreeClassifier(max_depth=2)

tree_clf.fit(X,y)

YoucanvisualizethetrainedDecisionTreebyfirstusingtheexport_graphviz()methodtooutputagraphdefinitionfilecallediris_tree.dot:

fromsklearn.treeimportexport_graphviz

export_graphviz(

tree_clf,

out_file=image_path("iris_tree.dot"),

feature_names=iris.feature_names[2:],

class_names=iris.target_names,

rounded=True,

filled=True

Thenyoucanconvertthis.dotfiletoavarietyofformatssuchasPDForPNGusingthedotcommand-linetoolfromthegraphvizpackage.1Thiscommandlineconvertsthe.dotfiletoa.pngimagefile:

$dot-Tpngiris_tree.dot-oiris_tree.png

YourfirstdecisiontreelookslikeFigure6-1.

Figure6-1.IrisDecisionTree

MakingPredictionsLet’sseehowthetreerepresentedinFigure6-1makespredictions.Supposeyoufindanirisflowerandyouwanttoclassifyit.Youstartattherootnode(depth0,atthetop):thisnodeaskswhethertheflower’spetallengthissmallerthan2.45cm.Ifitis,thenyoumovedowntotheroot’sleftchildnode(depth1,left).Inthiscase,itisaleafnode(i.e.,itdoesnothaveanychildrennodes),soitdoesnotaskanyquestions:youcansimplylookatthepredictedclassforthatnodeandtheDecisionTreepredictsthatyourflowerisanIris-Setosa(class=setosa).

Nowsupposeyoufindanotherflower,butthistimethepetallengthisgreaterthan2.45cm.Youmustmovedowntotheroot’srightchildnode(depth1,right),whichisnotaleafnode,soitasksanotherquestion:isthepetalwidthsmallerthan1.75cm?Ifitis,thenyourflowerismostlikelyanIris-Versicolor(depth2,left).Ifnot,itislikelyanIris-Virginica(depth2,right).It’sreallythatsimple.

NOTEOneofthemanyqualitiesofDecisionTreesisthattheyrequireverylittledatapreparation.Inparticular,theydon’trequirefeaturescalingorcenteringatall.

Anode’ssamplesattributecountshowmanytraininginstancesitappliesto.Forexample,100traininginstanceshaveapetallengthgreaterthan2.45cm(depth1,right),amongwhich54haveapetalwidthsmallerthan1.75cm(depth2,left).Anode’svalueattributetellsyouhowmanytraininginstancesofeachclassthisnodeappliesto:forexample,thebottom-rightnodeappliesto0Iris-Setosa,1Iris-Versicolor,and45Iris-Virginica.Finally,anode’sginiattributemeasuresitsimpurity:anodeis“pure”(gini=0)ifalltraininginstancesitappliestobelongtothesameclass.Forexample,sincethedepth-1leftnodeappliesonlytoIris-Setosatraininginstances,itispureanditsginiscoreis0.Equation6-1showshowthetrainingalgorithmcomputestheginiscoreGioftheithnode.Forexample,thedepth-2leftnodehasaginiscoreequalto1–(0/54)2–(49/54)2–(5/54)2≈0.168.Anotherimpuritymeasureisdiscussedshortly.

Equation6-1.Giniimpurity

pi,kistheratioofclasskinstancesamongthetraininginstancesintheithnode.

NOTEScikit-LearnusestheCARTalgorithm,whichproducesonlybinarytrees:nonleafnodesalwayshavetwochildren(i.e.,questionsonlyhaveyes/noanswers).However,otheralgorithmssuchasID3canproduceDecisionTreeswithnodesthathavemorethantwochildren.

Figure6-2showsthisDecisionTree’sdecisionboundaries.Thethickverticallinerepresentsthedecisionboundaryoftherootnode(depth0):petallength=2.45cm.Sincetheleftareaispure(onlyIris-Setosa),itcannotbesplitanyfurther.However,therightareaisimpure,sothedepth-1rightnodesplitsitatpetalwidth=1.75cm(representedbythedashedline).Sincemax_depthwassetto2,theDecisionTreestopsrightthere.However,ifyousetmax_depthto3,thenthetwodepth-2nodeswouldeachaddanotherdecisionboundary(representedbythedottedlines).

Figure6-2.DecisionTreedecisionboundaries

MODELINTERPRETATION:WHITEBOXVERSUSBLACKBOX

AsyoucanseeDecisionTreesarefairlyintuitiveandtheirdecisionsareeasytointerpret.Suchmodelsareoftencalledwhiteboxmodels.Incontrast,aswewillsee,RandomForestsorneuralnetworksaregenerallyconsideredblackboxmodels.Theymakegreatpredictions,andyoucaneasilycheckthecalculationsthattheyperformedtomakethesepredictions;nevertheless,itisusuallyhardtoexplaininsimpletermswhythepredictionsweremade.Forexample,ifaneuralnetworksaysthataparticularpersonappearsonapicture,itishardtoknowwhatactuallycontributedtothisprediction:didthemodelrecognizethatperson’seyes?Hermouth?Hernose?Hershoes?Oreventhecouchthatshewassittingon?Conversely,DecisionTreesprovideniceandsimpleclassificationrulesthatcanevenbeappliedmanuallyifneedbe(e.g.,forflowerclassification).

EstimatingClassProbabilitiesADecisionTreecanalsoestimatetheprobabilitythataninstancebelongstoaparticularclassk:firstittraversesthetreetofindtheleafnodeforthisinstance,andthenitreturnstheratiooftraininginstancesofclasskinthisnode.Forexample,supposeyouhavefoundaflowerwhosepetalsare5cmlongand1.5cmwide.Thecorrespondingleafnodeisthedepth-2leftnode,sotheDecisionTreeshouldoutputthefollowingprobabilities:0%forIris-Setosa(0/54),90.7%forIris-Versicolor(49/54),and9.3%forIris-Virginica(5/54).Andofcourseifyouaskittopredicttheclass,itshouldoutputIris-Versicolor(class1)sinceithasthehighestprobability.Let’scheckthis:

>>>tree_clf.predict_proba([[5,1.5]])

array([[0.,0.90740741,0.09259259]])

>>>tree_clf.predict([[5,1.5]])

array([1])

Perfect!Noticethattheestimatedprobabilitieswouldbeidenticalanywhereelseinthebottom-rightrectangleofFigure6-2—forexample,ifthepetalswere6cmlongand1.5cmwide(eventhoughitseemsobviousthatitwouldmostlikelybeanIris-Virginicainthiscase).

TheCARTTrainingAlgorithmScikit-LearnusestheClassificationAndRegressionTree(CART)algorithmtotrainDecisionTrees(alsocalled“growing”trees).Theideaisreallyquitesimple:thealgorithmfirstsplitsthetrainingsetintwosubsetsusingasinglefeaturekandathresholdtk(e.g.,“petallength≤2.45cm”).Howdoesitchoosekandtk?Itsearchesforthepair(k,tk)thatproducesthepurestsubsets(weightedbytheirsize).ThecostfunctionthatthealgorithmtriestominimizeisgivenbyEquation6-2.

Equation6-2.CARTcostfunctionforclassification

Onceithassuccessfullysplitthetrainingsetintwo,itsplitsthesubsetsusingthesamelogic,thenthesub-subsetsandsoon,recursively.Itstopsrecursingonceitreachesthemaximumdepth(definedbythemax_depthhyperparameter),orifitcannotfindasplitthatwillreduceimpurity.Afewotherhyperparameters(describedinamoment)controladditionalstoppingconditions(min_samples_split,min_samples_leaf,min_weight_fraction_leaf,andmax_leaf_nodes).

WARNINGAsyoucansee,theCARTalgorithmisagreedyalgorithm:itgreedilysearchesforanoptimumsplitatthetoplevel,thenrepeatstheprocessateachlevel.Itdoesnotcheckwhetherornotthesplitwillleadtothelowestpossibleimpurityseverallevelsdown.Agreedyalgorithmoftenproducesareasonablygoodsolution,butitisnotguaranteedtobetheoptimalsolution.

Unfortunately,findingtheoptimaltreeisknowntobeanNP-Completeproblem:2itrequiresO(exp(m))time,makingtheproblemintractableevenforfairlysmalltrainingsets.Thisiswhywemustsettlefora“reasonablygood”solution.

ComputationalComplexityMakingpredictionsrequirestraversingtheDecisionTreefromtheroottoaleaf.DecisionTreesaregenerallyapproximatelybalanced,sotraversingtheDecisionTreerequiresgoingthroughroughlyO(log2(m))nodes.3Sinceeachnodeonlyrequirescheckingthevalueofonefeature,theoverallpredictioncomplexityisjustO(log2(m)),independentofthenumberoffeatures.Sopredictionsareveryfast,evenwhendealingwithlargetrainingsets.

However,thetrainingalgorithmcomparesallfeatures(orlessifmax_featuresisset)onallsamplesateachnode.ThisresultsinatrainingcomplexityofO(n×mlog(m)).Forsmalltrainingsets(lessthanafewthousandinstances),Scikit-Learncanspeeduptrainingbypresortingthedata(setpresort=True),butthisslowsdowntrainingconsiderablyforlargertrainingsets.

GiniImpurityorEntropy?Bydefault,theGiniimpuritymeasureisused,butyoucanselecttheentropyimpuritymeasureinsteadbysettingthecriterionhyperparameterto"entropy".Theconceptofentropyoriginatedinthermodynamicsasameasureofmoleculardisorder:entropyapproacheszerowhenmoleculesarestillandwellordered.Itlaterspreadtoawidevarietyofdomains,includingShannon’sinformationtheory,whereitmeasurestheaverageinformationcontentofamessage:4entropyiszerowhenallmessagesareidentical.InMachineLearning,itisfrequentlyusedasanimpuritymeasure:aset’sentropyiszerowhenitcontainsinstancesofonlyoneclass.Equation6-3showsthedefinitionoftheentropyoftheithnode.

Forexample,thedepth-2leftnodeinFigure6-1hasanentropyequalto≈0.31.

Equation6-3.Entropy

SoshouldyouuseGiniimpurityorentropy?Thetruthis,mostofthetimeitdoesnotmakeabigdifference:theyleadtosimilartrees.Giniimpurityisslightlyfastertocompute,soitisagooddefault.However,whentheydiffer,Giniimpuritytendstoisolatethemostfrequentclassinitsownbranchofthetree,whileentropytendstoproduceslightlymorebalancedtrees.5

RegularizationHyperparametersDecisionTreesmakeveryfewassumptionsaboutthetrainingdata(asopposedtolinearmodels,whichobviouslyassumethatthedataislinear,forexample).Ifleftunconstrained,thetreestructurewilladaptitselftothetrainingdata,fittingitveryclosely,andmostlikelyoverfittingit.Suchamodelisoftencalledanonparametricmodel,notbecauseitdoesnothaveanyparameters(itoftenhasalot)butbecausethenumberofparametersisnotdeterminedpriortotraining,sothemodelstructureisfreetostickcloselytothedata.Incontrast,aparametricmodelsuchasalinearmodelhasapredeterminednumberofparameters,soitsdegreeoffreedomislimited,reducingtheriskofoverfitting(butincreasingtheriskofunderfitting).

Toavoidoverfittingthetrainingdata,youneedtorestricttheDecisionTree’sfreedomduringtraining.Asyouknowbynow,thisiscalledregularization.Theregularizationhyperparametersdependonthealgorithmused,butgenerallyyoucanatleastrestrictthemaximumdepthoftheDecisionTree.InScikit-Learn,thisiscontrolledbythemax_depthhyperparameter(thedefaultvalueisNone,whichmeansunlimited).Reducingmax_depthwillregularizethemodelandthusreducetheriskofoverfitting.

TheDecisionTreeClassifierclasshasafewotherparametersthatsimilarlyrestricttheshapeoftheDecisionTree:min_samples_split(theminimumnumberofsamplesanodemusthavebeforeitcanbesplit),min_samples_leaf(theminimumnumberofsamplesaleafnodemusthave),min_weight_fraction_leaf(sameasmin_samples_leafbutexpressedasafractionofthetotalnumberofweightedinstances),max_leaf_nodes(maximumnumberofleafnodes),andmax_features(maximumnumberoffeaturesthatareevaluatedforsplittingateachnode).Increasingmin_*hyperparametersorreducingmax_*hyperparameterswillregularizethemodel.

NOTEOtheralgorithmsworkbyfirsttrainingtheDecisionTreewithoutrestrictions,thenpruning(deleting)unnecessarynodes.Anodewhosechildrenareallleafnodesisconsideredunnecessaryifthepurityimprovementitprovidesisnotstatisticallysignificant.Standardstatisticaltests,suchastheχ2test,areusedtoestimatetheprobabilitythattheimprovementispurelytheresultofchance(whichiscalledthenullhypothesis).Ifthisprobability,calledthep-value,ishigherthanagiventhreshold(typically5%,controlledbyahyperparameter),thenthenodeisconsideredunnecessaryanditschildrenaredeleted.Thepruningcontinuesuntilallunnecessarynodeshavebeenpruned.

Figure6-3showstwoDecisionTreestrainedonthemoonsdataset(introducedinChapter5).Ontheleft,theDecisionTreeistrainedwiththedefaulthyperparameters(i.e.,norestrictions),andontherighttheDecisionTreeistrainedwithmin_samples_leaf=4.Itisquiteobviousthatthemodelontheleftisoverfitting,andthemodelontherightwillprobablygeneralizebetter.

Figure6-3.Regularizationusingmin_samples_leaf

RegressionDecisionTreesarealsocapableofperformingregressiontasks.Let’sbuildaregressiontreeusingScikit-Learn’sDecisionTreeRegressorclass,trainingitonanoisyquadraticdatasetwithmax_depth=2:

tree_reg=DecisionTreeRegressor(max_depth=2)

tree_reg.fit(X,y)

TheresultingtreeisrepresentedonFigure6-4.

Figure6-4.ADecisionTreeforregression

Thistreelooksverysimilartotheclassificationtreeyoubuiltearlier.Themaindifferenceisthatinsteadofpredictingaclassineachnode,itpredictsavalue.Forexample,supposeyouwanttomakeapredictionforanewinstancewithx1=0.6.Youtraversethetreestartingattheroot,andyoueventuallyreachtheleafnodethatpredictsvalue=0.1106.Thispredictionissimplytheaveragetargetvalueofthe110traininginstancesassociatedtothisleafnode.ThispredictionresultsinaMeanSquaredError(MSE)equalto0.0151overthese110instances.

Thismodel’spredictionsarerepresentedontheleftofFigure6-5.Ifyousetmax_depth=3,yougetthepredictionsrepresentedontheright.Noticehowthepredictedvalueforeachregionisalwaystheaveragetargetvalueoftheinstancesinthatregion.Thealgorithmsplitseachregioninawaythatmakesmosttraininginstancesascloseaspossibletothatpredictedvalue.

Figure6-5.PredictionsoftwoDecisionTreeregressionmodels

TheCARTalgorithmworksmostlythesamewayasearlier,exceptthatinsteadoftryingtosplitthetrainingsetinawaythatminimizesimpurity,itnowtriestosplitthetrainingsetinawaythatminimizestheMSE.Equation6-4showsthecostfunctionthatthealgorithmtriestominimize.

Equation6-4.CARTcostfunctionforregression

Justlikeforclassificationtasks,DecisionTreesarepronetooverfittingwhendealingwithregressiontasks.Withoutanyregularization(i.e.,usingthedefaulthyperparameters),yougetthepredictionsontheleftofFigure6-6.Itisobviouslyoverfittingthetrainingsetverybadly.Justsettingmin_samples_leaf=10resultsinamuchmorereasonablemodel,representedontherightofFigure6-6.

Figure6-6.RegularizingaDecisionTreeregressor

InstabilityHopefullybynowyouareconvincedthatDecisionTreeshavealotgoingforthem:theyaresimpletounderstandandinterpret,easytouse,versatile,andpowerful.Howevertheydohaveafewlimitations.First,asyoumayhavenoticed,DecisionTreesloveorthogonaldecisionboundaries(allsplitsareperpendiculartoanaxis),whichmakesthemsensitivetotrainingsetrotation.Forexample,Figure6-7showsasimplelinearlyseparabledataset:ontheleft,aDecisionTreecansplititeasily,whileontheright,afterthedatasetisrotatedby45°,thedecisionboundarylooksunnecessarilyconvoluted.AlthoughbothDecisionTreesfitthetrainingsetperfectly,itisverylikelythatthemodelontherightwillnotgeneralizewell.OnewaytolimitthisproblemistousePCA(seeChapter8),whichoftenresultsinabetterorientationofthetrainingdata.

Figure6-7.Sensitivitytotrainingsetrotation

Moregenerally,themainissuewithDecisionTreesisthattheyareverysensitivetosmallvariationsinthetrainingdata.Forexample,ifyoujustremovethewidestIris-Versicolorfromtheiristrainingset(theonewithpetals4.8cmlongand1.8cmwide)andtrainanewDecisionTree,youmaygetthemodelrepresentedinFigure6-8.Asyoucansee,itlooksverydifferentfromthepreviousDecisionTree(Figure6-2).Actually,sincethetrainingalgorithmusedbyScikit-Learnisstochastic6youmaygetverydifferentmodelsevenonthesametrainingdata(unlessyousettherandom_statehyperparameter).

Figure6-8.Sensitivitytotrainingsetdetails

RandomForestscanlimitthisinstabilitybyaveragingpredictionsovermanytrees,aswewillseeinthenextchapter.

Exercises1. WhatistheapproximatedepthofaDecisionTreetrained(withoutrestrictions)onatrainingsetwith

1millioninstances?

2. Isanode’sGiniimpuritygenerallylowerorgreaterthanitsparent’s?Isitgenerallylower/greater,oralwayslower/greater?

3. IfaDecisionTreeisoverfittingthetrainingset,isitagoodideatotrydecreasingmax_depth?

4. IfaDecisionTreeisunderfittingthetrainingset,isitagoodideatotryscalingtheinputfeatures?

5. IfittakesonehourtotrainaDecisionTreeonatrainingsetcontaining1millioninstances,roughlyhowmuchtimewillittaketotrainanotherDecisionTreeonatrainingsetcontaining10millioninstances?

6. Ifyourtrainingsetcontains100,000instances,willsettingpresort=Truespeeduptraining?

7. Trainandfine-tuneaDecisionTreeforthemoonsdataset.a. Generateamoonsdatasetusingmake_moons(n_samples=10000,noise=0.4).

b. Splititintoatrainingsetandatestsetusingtrain_test_split().

c. Usegridsearchwithcross-validation(withthehelpoftheGridSearchCVclass)tofindgoodhyperparametervaluesforaDecisionTreeClassifier.Hint:tryvariousvaluesformax_leaf_nodes.

d. Trainitonthefulltrainingsetusingthesehyperparameters,andmeasureyourmodel’sperformanceonthetestset.Youshouldgetroughly85%to87%accuracy.

8. Growaforest.a. Continuingthepreviousexercise,generate1,000subsetsofthetrainingset,eachcontaining100

instancesselectedrandomly.Hint:youcanuseScikit-Learn’sShuffleSplitclassforthis.

b. TrainoneDecisionTreeoneachsubset,usingthebesthyperparametervaluesfoundabove.Evaluatethese1,000DecisionTreesonthetestset.Sincetheyweretrainedonsmallersets,theseDecisionTreeswilllikelyperformworsethanthefirstDecisionTree,achievingonlyabout80%accuracy.

c. Nowcomesthemagic.Foreachtestsetinstance,generatethepredictionsofthe1,000DecisionTrees,andkeeponlythemostfrequentprediction(youcanuseSciPy’smode()functionforthis).Thisgivesyoumajority-votepredictionsoverthetestset.

d. Evaluatethesepredictionsonthetestset:youshouldobtainaslightlyhigheraccuracythanyourfirstmodel(about0.5to1.5%higher).Congratulations,youhavetrainedaRandomForestclassifier!

Graphvizisanopensourcegraphvisualizationsoftwarepackage,availableathttp://www.graphviz.org/.

Pisthesetofproblemsthatcanbesolvedinpolynomialtime.NPisthesetofproblemswhosesolutionscanbeverifiedinpolynomialtime.AnNP-HardproblemisaproblemtowhichanyNPproblemcanbereducedinpolynomialtime.AnNP-CompleteproblemisbothNPandNP-Hard.AmajoropenmathematicalquestioniswhetherornotP=NP.IfP≠NP(whichseemslikely),thennopolynomialalgorithmwilleverbefoundforanyNP-Completeproblem(exceptperhapsonaquantumcomputer).

log2isthebinarylogarithm.Itisequaltolog2(m)=log(m)/log(2).

Areductionofentropyisoftencalledaninformationgain.

SeeSebastianRaschka’sinterestinganalysisformoredetails.

Itrandomlyselectsthesetoffeaturestoevaluateateachnode.

Chapter7.EnsembleLearningandRandomForests

Supposeyouaskacomplexquestiontothousandsofrandompeople,thenaggregatetheiranswers.Inmanycasesyouwillfindthatthisaggregatedanswerisbetterthananexpert’sanswer.Thisiscalledthewisdomofthecrowd.Similarly,ifyouaggregatethepredictionsofagroupofpredictors(suchasclassifiersorregressors),youwilloftengetbetterpredictionsthanwiththebestindividualpredictor.Agroupofpredictorsiscalledanensemble;thus,thistechniqueiscalledEnsembleLearning,andanEnsembleLearningalgorithmiscalledanEnsemblemethod.

Forexample,youcantrainagroupofDecisionTreeclassifiers,eachonadifferentrandomsubsetofthetrainingset.Tomakepredictions,youjustobtainthepredictionsofallindividualtrees,thenpredicttheclassthatgetsthemostvotes(seethelastexerciseinChapter6).SuchanensembleofDecisionTreesiscalledaRandomForest,anddespiteitssimplicity,thisisoneofthemostpowerfulMachineLearningalgorithmsavailabletoday.

Moreover,aswediscussedinChapter2,youwilloftenuseEnsemblemethodsneartheendofaproject,onceyouhavealreadybuiltafewgoodpredictors,tocombinethemintoanevenbetterpredictor.Infact,thewinningsolutionsinMachineLearningcompetitionsofteninvolveseveralEnsemblemethods(mostfamouslyintheNetflixPrizecompetition).

InthischapterwewilldiscussthemostpopularEnsemblemethods,includingbagging,boosting,stacking,andafewothers.WewillalsoexploreRandomForests.

VotingClassifiersSupposeyouhavetrainedafewclassifiers,eachoneachievingabout80%accuracy.YoumayhaveaLogisticRegressionclassifier,anSVMclassifier,aRandomForestclassifier,aK-NearestNeighborsclassifier,andperhapsafewmore(seeFigure7-1).

Figure7-1.Trainingdiverseclassifiers

Averysimplewaytocreateanevenbetterclassifieristoaggregatethepredictionsofeachclassifierandpredicttheclassthatgetsthemostvotes.Thismajority-voteclassifieriscalledahardvotingclassifier(seeFigure7-2).

Figure7-2.Hardvotingclassifierpredictions

Somewhatsurprisingly,thisvotingclassifieroftenachievesahigheraccuracythanthebestclassifierintheensemble.Infact,evenifeachclassifierisaweaklearner(meaningitdoesonlyslightlybetterthanrandomguessing),theensemblecanstillbeastronglearner(achievinghighaccuracy),providedthereareasufficientnumberofweaklearnersandtheyaresufficientlydiverse.

Howisthispossible?Thefollowinganalogycanhelpshedsomelightonthismystery.Supposeyouhave

aslightlybiasedcointhathasa51%chanceofcomingupheads,and49%chanceofcominguptails.Ifyoutossit1,000times,youwillgenerallygetmoreorless510headsand490tails,andhenceamajorityofheads.Ifyoudothemath,youwillfindthattheprobabilityofobtainingamajorityofheadsafter1,000tossesiscloseto75%.Themoreyoutossthecoin,thehighertheprobability(e.g.,with10,000tosses,theprobabilityclimbsover97%).Thisisduetothelawoflargenumbers:asyoukeeptossingthecoin,theratioofheadsgetscloserandclosertotheprobabilityofheads(51%).Figure7-3shows10seriesofbiasedcointosses.Youcanseethatasthenumberoftossesincreases,theratioofheadsapproaches51%.Eventuallyall10seriesendupsocloseto51%thattheyareconsistentlyabove50%.

Figure7-3.Thelawoflargenumbers

Similarly,supposeyoubuildanensemblecontaining1,000classifiersthatareindividuallycorrectonly51%ofthetime(barelybetterthanrandomguessing).Ifyoupredictthemajorityvotedclass,youcanhopeforupto75%accuracy!However,thisisonlytrueifallclassifiersareperfectlyindependent,makinguncorrelatederrors,whichisclearlynotthecasesincetheyaretrainedonthesamedata.Theyarelikelytomakethesametypesoferrors,sotherewillbemanymajorityvotesforthewrongclass,reducingtheensemble’saccuracy.

TIPEnsemblemethodsworkbestwhenthepredictorsareasindependentfromoneanotheraspossible.Onewaytogetdiverseclassifiersistotrainthemusingverydifferentalgorithms.Thisincreasesthechancethattheywillmakeverydifferenttypesoferrors,improvingtheensemble’saccuracy.

ThefollowingcodecreatesandtrainsavotingclassifierinScikit-Learn,composedofthreediverseclassifiers(thetrainingsetisthemoonsdataset,introducedinChapter5):

fromsklearn.ensembleimportVotingClassifier

fromsklearn.svmimportSVC

log_clf=LogisticRegression()

rnd_clf=RandomForestClassifier()

svm_clf=SVC()

voting_clf=VotingClassifier(

estimators=[('lr',log_clf),('rf',rnd_clf),('svc',svm_clf)],

voting='hard')

voting_clf.fit(X_train,y_train)

Let’slookateachclassifier’saccuracyonthetestset:

>>>fromsklearn.metricsimportaccuracy_score

>>>forclfin(log_clf,rnd_clf,svm_clf,voting_clf):

...clf.fit(X_train,y_train)

...y_pred=clf.predict(X_test)

...print(clf.__class__.__name__,accuracy_score(y_test,y_pred))

LogisticRegression0.864

RandomForestClassifier0.872

SVC0.888

VotingClassifier0.896

Thereyouhaveit!Thevotingclassifierslightlyoutperformsalltheindividualclassifiers.

Ifallclassifiersareabletoestimateclassprobabilities(i.e.,theyhaveapredict_proba()method),thenyoucantellScikit-Learntopredicttheclasswiththehighestclassprobability,averagedoveralltheindividualclassifiers.Thisiscalledsoftvoting.Itoftenachieveshigherperformancethanhardvotingbecauseitgivesmoreweighttohighlyconfidentvotes.Allyouneedtodoisreplacevoting="hard"withvoting="soft"andensurethatallclassifierscanestimateclassprobabilities.ThisisnotthecaseoftheSVCclassbydefault,soyouneedtosetitsprobabilityhyperparametertoTrue(thiswillmaketheSVCclassusecross-validationtoestimateclassprobabilities,slowingdowntraining,anditwilladdapredict_proba()method).Ifyoumodifytheprecedingcodetousesoftvoting,youwillfindthatthevotingclassifierachievesover91%accuracy!

BaggingandPastingOnewaytogetadiversesetofclassifiersistouseverydifferenttrainingalgorithms,asjustdiscussed.Anotherapproachistousethesametrainingalgorithmforeverypredictor,buttotrainthemondifferentrandomsubsetsofthetrainingset.Whensamplingisperformedwithreplacement,thismethodiscalledbagging1(shortforbootstrapaggregating2).Whensamplingisperformedwithoutreplacement,itiscalledpasting.3

Inotherwords,bothbaggingandpastingallowtraininginstancestobesampledseveraltimesacrossmultiplepredictors,butonlybaggingallowstraininginstancestobesampledseveraltimesforthesamepredictor.ThissamplingandtrainingprocessisrepresentedinFigure7-4.

Figure7-4.Pasting/baggingtrainingsetsamplingandtraining

Onceallpredictorsaretrained,theensemblecanmakeapredictionforanewinstancebysimplyaggregatingthepredictionsofallpredictors.Theaggregationfunctionistypicallythestatisticalmode(i.e.,themostfrequentprediction,justlikeahardvotingclassifier)forclassification,ortheaverageforregression.Eachindividualpredictorhasahigherbiasthanifitweretrainedontheoriginaltrainingset,butaggregationreducesbothbiasandvariance.4Generally,thenetresultisthattheensemblehasasimilarbiasbutalowervariancethanasinglepredictortrainedontheoriginaltrainingset.

AsyoucanseeinFigure7-4,predictorscanallbetrainedinparallel,viadifferentCPUcoresorevendifferentservers.Similarly,predictionscanbemadeinparallel.Thisisoneofthereasonswhybaggingandpastingaresuchpopularmethods:theyscaleverywell.

BaggingandPastinginScikit-LearnScikit-LearnoffersasimpleAPIforbothbaggingandpastingwiththeBaggingClassifierclass(orBaggingRegressorforregression).Thefollowingcodetrainsanensembleof500DecisionTreeclassifiers,5eachtrainedon100traininginstancesrandomlysampledfromthetrainingsetwithreplacement(thisisanexampleofbagging,butifyouwanttousepastinginstead,justsetbootstrap=False).Then_jobsparametertellsScikit-LearnthenumberofCPUcorestousefortrainingandpredictions(–1tellsScikit-Learntouseallavailablecores):

fromsklearn.ensembleimportBaggingClassifier

fromsklearn.treeimportDecisionTreeClassifier

bag_clf=BaggingClassifier(

DecisionTreeClassifier(),n_estimators=500,

max_samples=100,bootstrap=True,n_jobs=-1)

bag_clf.fit(X_train,y_train)

y_pred=bag_clf.predict(X_test)

NOTETheBaggingClassifierautomaticallyperformssoftvotinginsteadofhardvotingifthebaseclassifiercanestimateclassprobabilities(i.e.,ifithasapredict_proba()method),whichisthecasewithDecisionTreesclassifiers.

Figure7-5comparesthedecisionboundaryofasingleDecisionTreewiththedecisionboundaryofabaggingensembleof500trees(fromtheprecedingcode),bothtrainedonthemoonsdataset.Asyoucansee,theensemble’spredictionswilllikelygeneralizemuchbetterthanthesingleDecisionTree’spredictions:theensemblehasacomparablebiasbutasmallervariance(itmakesroughlythesamenumberoferrorsonthetrainingset,butthedecisionboundaryislessirregular).

Figure7-5.AsingleDecisionTreeversusabaggingensembleof500trees

Bootstrappingintroducesabitmorediversityinthesubsetsthateachpredictoristrainedon,sobaggingendsupwithaslightlyhigherbiasthanpasting,butthisalsomeansthatpredictorsendupbeinglesscorrelatedsotheensemble’svarianceisreduced.Overall,baggingoftenresultsinbettermodels,whichexplainswhyitisgenerallypreferred.However,ifyouhavesparetimeandCPUpoweryoucanusecross-validationtoevaluatebothbaggingandpastingandselecttheonethatworksbest.

Out-of-BagEvaluationWithbagging,someinstancesmaybesampledseveraltimesforanygivenpredictor,whileothersmaynotbesampledatall.BydefaultaBaggingClassifiersamplesmtraininginstanceswithreplacement(bootstrap=True),wheremisthesizeofthetrainingset.Thismeansthatonlyabout63%ofthetraininginstancesaresampledonaverageforeachpredictor.6Theremaining37%ofthetraininginstancesthatarenotsampledarecalledout-of-bag(oob)instances.Notethattheyarenotthesame37%forallpredictors.

Sinceapredictorneverseestheoobinstancesduringtraining,itcanbeevaluatedontheseinstances,withouttheneedforaseparatevalidationsetorcross-validation.Youcanevaluatetheensembleitselfbyaveragingouttheoobevaluationsofeachpredictor.

InScikit-Learn,youcansetoob_score=TruewhencreatingaBaggingClassifiertorequestanautomaticoobevaluationaftertraining.Thefollowingcodedemonstratesthis.Theresultingevaluationscoreisavailablethroughtheoob_score_variable:

>>>bag_clf=BaggingClassifier(

...DecisionTreeClassifier(),n_estimators=500,

...bootstrap=True,n_jobs=-1,oob_score=True)

>>>bag_clf.fit(X_train,y_train)

>>>bag_clf.oob_score_

0.90133333333333332

Accordingtothisoobevaluation,thisBaggingClassifierislikelytoachieveabout90.1%accuracyonthetestset.Let’sverifythis:

>>>y_pred=bag_clf.predict(X_test)

>>>accuracy_score(y_test,y_pred)

0.91200000000000003

Weget91.2%accuracyonthetestset—closeenough!

Theoobdecisionfunctionforeachtraininginstanceisalsoavailablethroughtheoob_decision_function_variable.Inthiscase(sincethebaseestimatorhasapredict_proba()method)thedecisionfunctionreturnstheclassprobabilitiesforeachtraininginstance.Forexample,theoobevaluationestimatesthatthesecondtraininginstancehasa60.6%probabilityofbelongingtothepositiveclass(and39.4%ofbelongingtothepositiveclass):

>>>bag_clf.oob_decision_function_

array([[0.31746032,0.68253968],

[0.34117647,0.65882353],

[1.,0.],

[0.03108808,0.96891192],

[0.57291667,0.42708333]])

RandomPatchesandRandomSubspacesTheBaggingClassifierclasssupportssamplingthefeaturesaswell.Thisiscontrolledbytwohyperparameters:max_featuresandbootstrap_features.Theyworkthesamewayasmax_samplesandbootstrap,butforfeaturesamplinginsteadofinstancesampling.Thus,eachpredictorwillbetrainedonarandomsubsetoftheinputfeatures.

Thisisparticularlyusefulwhenyouaredealingwithhigh-dimensionalinputs(suchasimages).SamplingbothtraininginstancesandfeaturesiscalledtheRandomPatchesmethod.7Keepingalltraininginstances(i.e.,bootstrap=Falseandmax_samples=1.0)butsamplingfeatures(i.e.,bootstrap_features=Trueand/ormax_featuressmallerthan1.0)iscalledtheRandomSubspacesmethod.8

Samplingfeaturesresultsinevenmorepredictordiversity,tradingabitmorebiasforalowervariance.

RandomForestsAswehavediscussed,aRandomForest9isanensembleofDecisionTrees,generallytrainedviathebaggingmethod(orsometimespasting),typicallywithmax_samplessettothesizeofthetrainingset.InsteadofbuildingaBaggingClassifierandpassingitaDecisionTreeClassifier,youcaninsteadusetheRandomForestClassifierclass,whichismoreconvenientandoptimizedforDecisionTrees10

(similarly,thereisaRandomForestRegressorclassforregressiontasks).ThefollowingcodetrainsaRandomForestclassifierwith500trees(eachlimitedtomaximum16nodes),usingallavailableCPUcores:

rnd_clf=RandomForestClassifier(n_estimators=500,max_leaf_nodes=16,n_jobs=-1)

rnd_clf.fit(X_train,y_train)

y_pred_rf=rnd_clf.predict(X_test)

Withafewexceptions,aRandomForestClassifierhasallthehyperparametersofaDecisionTreeClassifier(tocontrolhowtreesaregrown),plusallthehyperparametersofaBaggingClassifiertocontroltheensembleitself.11

TheRandomForestalgorithmintroducesextrarandomnesswhengrowingtrees;insteadofsearchingfortheverybestfeaturewhensplittinganode(seeChapter6),itsearchesforthebestfeatureamongarandomsubsetoffeatures.Thisresultsinagreatertreediversity,which(onceagain)tradesahigherbiasforalowervariance,generallyyieldinganoverallbettermodel.ThefollowingBaggingClassifierisroughlyequivalenttothepreviousRandomForestClassifier:

bag_clf=BaggingClassifier(

DecisionTreeClassifier(splitter="random",max_leaf_nodes=16),

n_estimators=500,max_samples=1.0,bootstrap=True,n_jobs=-1)

Extra-TreesWhenyouaregrowingatreeinaRandomForest,ateachnodeonlyarandomsubsetofthefeaturesisconsideredforsplitting(asdiscussedearlier).Itispossibletomaketreesevenmorerandombyalsousingrandomthresholdsforeachfeatureratherthansearchingforthebestpossiblethresholds(likeregularDecisionTreesdo).

AforestofsuchextremelyrandomtreesissimplycalledanExtremelyRandomizedTreesensemble12(orExtra-Treesforshort).Onceagain,thistradesmorebiasforalowervariance.ItalsomakesExtra-TreesmuchfastertotrainthanregularRandomForestssincefindingthebestpossiblethresholdforeachfeatureateverynodeisoneofthemosttime-consumingtasksofgrowingatree.

YoucancreateanExtra-TreesclassifierusingScikit-Learn’sExtraTreesClassifierclass.ItsAPIisidenticaltotheRandomForestClassifierclass.Similarly,theExtraTreesRegressorclasshasthesameAPIastheRandomForestRegressorclass.

TIPItishardtotellinadvancewhetheraRandomForestClassifierwillperformbetterorworsethananExtraTreesClassifier.Generally,theonlywaytoknowistotrybothandcomparethemusingcross-validation(andtuningthehyperparametersusinggridsearch).

FeatureImportanceYetanothergreatqualityofRandomForestsisthattheymakeiteasytomeasuretherelativeimportanceofeachfeature.Scikit-Learnmeasuresafeature’simportancebylookingathowmuchthetreenodesthatusethatfeaturereduceimpurityonaverage(acrossalltreesintheforest).Moreprecisely,itisaweightedaverage,whereeachnode’sweightisequaltothenumberoftrainingsamplesthatareassociatedwithit(seeChapter6).

Scikit-Learncomputesthisscoreautomaticallyforeachfeatureaftertraining,thenitscalestheresultssothatthesumofallimportancesisequalto1.Youcanaccesstheresultusingthefeature_importances_variable.Forexample,thefollowingcodetrainsaRandomForestClassifierontheirisdataset(introducedinChapter4)andoutputseachfeature’simportance.Itseemsthatthemostimportantfeaturesarethepetallength(44%)andwidth(42%),whilesepallengthandwidthareratherunimportantincomparison(11%and2%,respectively).

>>>fromsklearn.datasetsimportload_iris

>>>iris=load_iris()

>>>rnd_clf=RandomForestClassifier(n_estimators=500,n_jobs=-1)

>>>rnd_clf.fit(iris["data"],iris["target"])

>>>forname,scoreinzip(iris["feature_names"],rnd_clf.feature_importances_):

...print(name,score)

sepallength(cm)0.112492250999

sepalwidth(cm)0.0231192882825

petallength(cm)0.441030464364

petalwidth(cm)0.423357996355

Similarly,ifyoutrainaRandomForestclassifierontheMNISTdataset(introducedinChapter3)andploteachpixel’simportance,yougettheimagerepresentedinFigure7-6.

Figure7-6.MNISTpixelimportance(accordingtoaRandomForestclassifier)

RandomForestsareveryhandytogetaquickunderstandingofwhatfeaturesactuallymatter,inparticularifyouneedtoperformfeatureselection.

BoostingBoosting(originallycalledhypothesisboosting)referstoanyEnsemblemethodthatcancombineseveralweaklearnersintoastronglearner.Thegeneralideaofmostboostingmethodsistotrainpredictorssequentially,eachtryingtocorrectitspredecessor.Therearemanyboostingmethodsavailable,butbyfarthemostpopularareAdaBoost13(shortforAdaptiveBoosting)andGradientBoosting.Let’sstartwithAdaBoost.

AdaBoostOnewayforanewpredictortocorrectitspredecessoristopayabitmoreattentiontothetraininginstancesthatthepredecessorunderfitted.Thisresultsinnewpredictorsfocusingmoreandmoreonthehardcases.ThisisthetechniqueusedbyAdaBoost.

Forexample,tobuildanAdaBoostclassifier,afirstbaseclassifier(suchasaDecisionTree)istrainedandusedtomakepredictionsonthetrainingset.Therelativeweightofmisclassifiedtraininginstancesisthenincreased.Asecondclassifieristrainedusingtheupdatedweightsandagainitmakespredictionsonthetrainingset,weightsareupdated,andsoon(seeFigure7-7).

Figure7-7.AdaBoostsequentialtrainingwithinstanceweightupdates

Figure7-8showsthedecisionboundariesoffiveconsecutivepredictorsonthemoonsdataset(inthisexample,eachpredictorisahighlyregularizedSVMclassifierwithanRBFkernel14).Thefirstclassifiergetsmanyinstanceswrong,sotheirweightsgetboosted.Thesecondclassifierthereforedoesabetterjobontheseinstances,andsoon.Theplotontherightrepresentsthesamesequenceofpredictorsexceptthatthelearningrateishalved(i.e.,themisclassifiedinstanceweightsareboostedhalfasmuchateveryiteration).Asyoucansee,thissequentiallearningtechniquehassomesimilaritieswithGradientDescent,exceptthatinsteadoftweakingasinglepredictor’sparameterstominimizeacostfunction,AdaBoostaddspredictorstotheensemble,graduallymakingitbetter.

Figure7-8.Decisionboundariesofconsecutivepredictors

Onceallpredictorsaretrained,theensemblemakespredictionsverymuchlikebaggingorpasting,exceptthatpredictorshavedifferentweightsdependingontheiroverallaccuracyontheweightedtrainingset.

WARNINGThereisoneimportantdrawbacktothissequentiallearningtechnique:itcannotbeparallelized(oronlypartially),sinceeachpredictorcanonlybetrainedafterthepreviouspredictorhasbeentrainedandevaluated.Asaresult,itdoesnotscaleaswellasbaggingorpasting.

Let’stakeacloserlookattheAdaBoostalgorithm.Eachinstanceweightw(i)isinitiallysetto .Afirstpredictoristrainedanditsweightederrorrater1iscomputedonthetrainingset;seeEquation7-1.

Equation7-1.Weightederrorrateofthejthpredictor

Thepredictor’sweightαjisthencomputedusingEquation7-2,whereηisthelearningratehyperparameter(defaultsto1).15Themoreaccuratethepredictoris,thehigheritsweightwillbe.Ifitisjustguessingrandomly,thenitsweightwillbeclosetozero.However,ifitismostoftenwrong(i.e.,lessaccuratethanrandomguessing),thenitsweightwillbenegative.

Equation7-2.Predictorweight

NexttheinstanceweightsareupdatedusingEquation7-3:themisclassifiedinstancesareboosted.

Equation7-3.Weightupdaterule

Thenalltheinstanceweightsarenormalized(i.e.,dividedby ).

Finally,anewpredictoristrainedusingtheupdatedweights,andthewholeprocessisrepeated(thenewpredictor’sweightiscomputed,theinstanceweightsareupdated,thenanotherpredictoristrained,andsoon).Thealgorithmstopswhenthedesirednumberofpredictorsisreached,orwhenaperfectpredictorisfound.

Tomakepredictions,AdaBoostsimplycomputesthepredictionsofallthepredictorsandweighsthemusingthepredictorweightsαj.Thepredictedclassistheonethatreceivesthemajorityofweightedvotes(seeEquation7-4).

Equation7-4.AdaBoostpredictions

Scikit-LearnactuallyusesamulticlassversionofAdaBoostcalledSAMME16(whichstandsforStagewiseAdditiveModelingusingaMulticlassExponentiallossfunction).Whentherearejusttwoclasses,SAMMEisequivalenttoAdaBoost.Moreover,ifthepredictorscanestimateclassprobabilities(i.e.,iftheyhaveapredict_proba()method),Scikit-LearncanuseavariantofSAMMEcalledSAMME.R(theRstandsfor“Real”),whichreliesonclassprobabilitiesratherthanpredictionsandgenerallyperformsbetter.

ThefollowingcodetrainsanAdaBoostclassifierbasedon200DecisionStumpsusingScikit-Learn’sAdaBoostClassifierclass(asyoumightexpect,thereisalsoanAdaBoostRegressorclass).ADecisionStumpisaDecisionTreewithmax_depth=1—inotherwords,atreecomposedofasingledecisionnodeplustwoleafnodes.ThisisthedefaultbaseestimatorfortheAdaBoostClassifierclass:

fromsklearn.ensembleimportAdaBoostClassifier

ada_clf=AdaBoostClassifier(

DecisionTreeClassifier(max_depth=1),n_estimators=200,

algorithm="SAMME.R",learning_rate=0.5)

ada_clf.fit(X_train,y_train)

TIPIfyourAdaBoostensembleisoverfittingthetrainingset,youcantryreducingthenumberofestimatorsormorestronglyregularizingthebaseestimator.

GradientBoostingAnotherverypopularBoostingalgorithmisGradientBoosting.17JustlikeAdaBoost,GradientBoostingworksbysequentiallyaddingpredictorstoanensemble,eachonecorrectingitspredecessor.However,insteadoftweakingtheinstanceweightsateveryiterationlikeAdaBoostdoes,thismethodtriestofitthenewpredictortotheresidualerrorsmadebythepreviouspredictor.

Let’sgothroughasimpleregressionexampleusingDecisionTreesasthebasepredictors(ofcourseGradientBoostingalsoworksgreatwithregressiontasks).ThisiscalledGradientTreeBoosting,orGradientBoostedRegressionTrees(GBRT).First,let’sfitaDecisionTreeRegressortothetrainingset(forexample,anoisyquadratictrainingset):

tree_reg1=DecisionTreeRegressor(max_depth=2)

tree_reg1.fit(X,y)

NowtrainasecondDecisionTreeRegressorontheresidualerrorsmadebythefirstpredictor:

y2=y-tree_reg1.predict(X)

tree_reg2.fit(X,y2)

Thenwetrainathirdregressorontheresidualerrorsmadebythesecondpredictor:

y3=y2-tree_reg2.predict(X)

tree_reg3.fit(X,y3)

Nowwehaveanensemblecontainingthreetrees.Itcanmakepredictionsonanewinstancesimplybyaddingupthepredictionsofallthetrees:

y_pred=sum(tree.predict(X_new)fortreein(tree_reg1,tree_reg2,tree_reg3))

Figure7-9representsthepredictionsofthesethreetreesintheleftcolumn,andtheensemble’spredictionsintherightcolumn.Inthefirstrow,theensemblehasjustonetree,soitspredictionsareexactlythesameasthefirsttree’spredictions.Inthesecondrow,anewtreeistrainedontheresidualerrorsofthefirsttree.Ontherightyoucanseethattheensemble’spredictionsareequaltothesumofthepredictionsofthefirsttwotrees.Similarly,inthethirdrowanothertreeistrainedontheresidualerrorsofthesecondtree.Youcanseethattheensemble’spredictionsgraduallygetbetterastreesareaddedtotheensemble.

AsimplerwaytotrainGBRTensemblesistouseScikit-Learn’sGradientBoostingRegressorclass.MuchliketheRandomForestRegressorclass,ithashyperparameterstocontrolthegrowthofDecisionTrees(e.g.,max_depth,min_samples_leaf,andsoon),aswellashyperparameterstocontroltheensembletraining,suchasthenumberoftrees(n_estimators).Thefollowingcodecreatesthesameensembleasthepreviousone:

fromsklearn.ensembleimportGradientBoostingRegressor

gbrt=GradientBoostingRegressor(max_depth=2,n_estimators=3,learning_rate=1.0)

gbrt.fit(X,y)

Figure7-9.GradientBoosting

Thelearning_ratehyperparameterscalesthecontributionofeachtree.Ifyousetittoalowvalue,suchas0.1,youwillneedmoretreesintheensembletofitthetrainingset,butthepredictionswillusuallygeneralizebetter.Thisisaregularizationtechniquecalledshrinkage.Figure7-10showstwoGBRTensemblestrainedwithalowlearningrate:theoneontheleftdoesnothaveenoughtreestofitthetrainingset,whiletheoneontherighthastoomanytreesandoverfitsthetrainingset.

Figure7-10.GBRTensembleswithnotenoughpredictors(left)andtoomany(right)

Inordertofindtheoptimalnumberoftrees,youcanuseearlystopping(seeChapter4).Asimplewaytoimplementthisistousethestaged_predict()method:itreturnsaniteratoroverthepredictionsmadebytheensembleateachstageoftraining(withonetree,twotrees,etc.).ThefollowingcodetrainsaGBRTensemblewith120trees,thenmeasuresthevalidationerrorateachstageoftrainingtofindtheoptimalnumberoftrees,andfinallytrainsanotherGBRTensembleusingtheoptimalnumberoftrees:

importnumpyasnp

fromsklearn.metricsimportmean_squared_error

X_train,X_val,y_train,y_val=train_test_split(X,y)

gbrt=GradientBoostingRegressor(max_depth=2,n_estimators=120)

gbrt.fit(X_train,y_train)

errors=[mean_squared_error(y_val,y_pred)

fory_predingbrt.staged_predict(X_val)]

bst_n_estimators=np.argmin(errors)

gbrt_best=GradientBoostingRegressor(max_depth=2,n_estimators=bst_n_estimators)

gbrt_best.fit(X_train,y_train)

ThevalidationerrorsarerepresentedontheleftofFigure7-11,andthebestmodel’spredictionsarerepresentedontheright.

Figure7-11.Tuningthenumberoftreesusingearlystopping

Itisalsopossibletoimplementearlystoppingbyactuallystoppingtrainingearly(insteadoftrainingalargenumberoftreesfirstandthenlookingbacktofindtheoptimalnumber).Youcandosobysettingwarm_start=True,whichmakesScikit-Learnkeepexistingtreeswhenthefit()methodiscalled,allowingincrementaltraining.Thefollowingcodestopstrainingwhenthevalidationerrordoesnot

improveforfiveiterationsinarow:

gbrt=GradientBoostingRegressor(max_depth=2,warm_start=True)

min_val_error=float("inf")

error_going_up=0

forn_estimatorsinrange(1,120):

gbrt.n_estimators=n_estimators

gbrt.fit(X_train,y_train)

y_pred=gbrt.predict(X_val)

val_error=mean_squared_error(y_val,y_pred)

ifval_error<min_val_error:

min_val_error=val_error

error_going_up=0

error_going_up+=1

iferror_going_up==5:

break#earlystopping

TheGradientBoostingRegressorclassalsosupportsasubsamplehyperparameter,whichspecifiesthefractionoftraininginstancestobeusedfortrainingeachtree.Forexample,ifsubsample=0.25,theneachtreeistrainedon25%ofthetraininginstances,selectedrandomly.Asyoucanprobablyguessbynow,thistradesahigherbiasforalowervariance.Italsospeedsuptrainingconsiderably.ThistechniqueiscalledStochasticGradientBoosting.

NOTEItispossibletouseGradientBoostingwithothercostfunctions.Thisiscontrolledbythelosshyperparameter(seeScikit-Learn’sdocumentationformoredetails).

StackingThelastEnsemblemethodwewilldiscussinthischapteriscalledstacking(shortforstackedgeneralization).18Itisbasedonasimpleidea:insteadofusingtrivialfunctions(suchashardvoting)toaggregatethepredictionsofallpredictorsinanensemble,whydon’twetrainamodeltoperformthisaggregation?Figure7-12showssuchanensembleperformingaregressiontaskonanewinstance.Eachofthebottomthreepredictorspredictsadifferentvalue(3.1,2.7,and2.9),andthenthefinalpredictor(calledablender,orametalearner)takesthesepredictionsasinputsandmakesthefinalprediction(3.0).

Figure7-12.Aggregatingpredictionsusingablendingpredictor

Totraintheblender,acommonapproachistouseahold-outset.19Let’sseehowitworks.First,thetrainingsetissplitintwosubsets.Thefirstsubsetisusedtotrainthepredictorsinthefirstlayer(seeFigure7-13).

Figure7-13.Trainingthefirstlayer

Next,thefirstlayerpredictorsareusedtomakepredictionsonthesecond(held-out)set(seeFigure7-14).Thisensuresthatthepredictionsare“clean,”sincethepredictorsneversawtheseinstancesduringtraining.Nowforeachinstanceinthehold-outsettherearethreepredictedvalues.Wecancreateanewtrainingsetusingthesepredictedvaluesasinputfeatures(whichmakesthisnewtrainingsetthree-dimensional),andkeepingthetargetvalues.Theblenderistrainedonthisnewtrainingset,soitlearnstopredictthetargetvaluegiventhefirstlayer’spredictions.

Figure7-14.Trainingtheblender

Itisactuallypossibletotrainseveraldifferentblendersthisway(e.g.,oneusingLinearRegression,anotherusingRandomForestRegression,andsoon):wegetawholelayerofblenders.Thetrickistosplitthetrainingsetintothreesubsets:thefirstoneisusedtotrainthefirstlayer,thesecondoneisusedtocreatethetrainingsetusedtotrainthesecondlayer(usingpredictionsmadebythepredictorsofthefirstlayer),andthethirdoneisusedtocreatethetrainingsettotrainthethirdlayer(usingpredictionsmadebythepredictorsofthesecondlayer).Oncethisisdone,wecanmakeapredictionforanewinstancebygoingthrougheachlayersequentially,asshowninFigure7-15.

Figure7-15.Predictionsinamultilayerstackingensemble

Unfortunately,Scikit-Learndoesnotsupportstackingdirectly,butitisnottoohardtorolloutyourownimplementation(seethefollowingexercises).Alternatively,youcanuseanopensourceimplementationsuchasbrew(availableathttps://github.com/viisar/brew).

Exercises1. Ifyouhavetrainedfivedifferentmodelsontheexactsametrainingdata,andtheyallachieve95%

precision,isthereanychancethatyoucancombinethesemodelstogetbetterresults?Ifso,how?Ifnot,why?

2. Whatisthedifferencebetweenhardandsoftvotingclassifiers?

3. Isitpossibletospeeduptrainingofabaggingensemblebydistributingitacrossmultipleservers?Whataboutpastingensembles,boostingensembles,randomforests,orstackingensembles?

4. Whatisthebenefitofout-of-bagevaluation?

5. WhatmakesExtra-TreesmorerandomthanregularRandomForests?Howcanthisextrarandomnesshelp?AreExtra-TreesslowerorfasterthanregularRandomForests?

6. IfyourAdaBoostensembleunderfitsthetrainingdata,whathyperparametersshouldyoutweakandhow?

7. IfyourGradientBoostingensembleoverfitsthetrainingset,shouldyouincreaseordecreasethelearningrate?

8. LoadtheMNISTdata(introducedinChapter3),andsplititintoatrainingset,avalidationset,andatestset(e.g.,use40,000instancesfortraining,10,000forvalidation,and10,000fortesting).Thentrainvariousclassifiers,suchasaRandomForestclassifier,anExtra-Treesclassifier,andanSVM.Next,trytocombinethemintoanensemblethatoutperformsthemallonthevalidationset,usingasoftorhardvotingclassifier.Onceyouhavefoundone,tryitonthetestset.Howmuchbetterdoesitperformcomparedtotheindividualclassifiers?

9. Runtheindividualclassifiersfromthepreviousexercisetomakepredictionsonthevalidationset,andcreateanewtrainingsetwiththeresultingpredictions:eachtraininginstanceisavectorcontainingthesetofpredictionsfromallyourclassifiersforanimage,andthetargetistheimage’sclass.Congratulations,youhavejusttrainedablender,andtogetherwiththeclassifierstheyformastackingensemble!Nowlet’sevaluatetheensembleonthetestset.Foreachimageinthetestset,makepredictionswithallyourclassifiers,thenfeedthepredictionstotheblendertogettheensemble’spredictions.Howdoesitcomparetothevotingclassifieryoutrainedearlier?

“BaggingPredictors,”L.Breiman(1996).

Instatistics,resamplingwithreplacementiscalledbootstrapping.

“Pastingsmallvotesforclassificationinlargedatabasesandon-line,”L.Breiman(1999).

BiasandvariancewereintroducedinChapter4.

max_samplescanalternativelybesettoafloatbetween0.0and1.0,inwhichcasethemaxnumberofinstancestosampleisequaltothesizeofthetrainingsettimesmax_samples.

Asmgrows,thisratioapproaches1–exp(–1)≈63.212%.

“EnsemblesonRandomPatches,”G.LouppeandP.Geurts(2012).

“Therandomsubspacemethodforconstructingdecisionforests,”TinKamHo(1998).

“RandomDecisionForests,”T.Ho(1995).

TheBaggingClassifierclassremainsusefulifyouwantabagofsomethingotherthanDecisionTrees.

Thereareafewnotableexceptions:splitterisabsent(forcedto"random"),presortisabsent(forcedtoFalse),max_samplesisabsent(forcedto1.0),andbase_estimatorisabsent(forcedtoDecisionTreeClassifierwiththeprovidedhyperparameters).

“Extremelyrandomizedtrees,”P.Geurts,D.Ernst,L.Wehenkel(2005).

“ADecision-TheoreticGeneralizationofOn-LineLearningandanApplicationtoBoosting,”YoavFreund,RobertE.Schapire(1997).

Thisisjustforillustrativepurposes.SVMsaregenerallynotgoodbasepredictorsforAdaBoost,becausetheyareslowandtendtobeunstablewithAdaBoost.

TheoriginalAdaBoostalgorithmdoesnotusealearningratehyperparameter.

Formoredetails,see“Multi-ClassAdaBoost,”J.Zhuetal.(2006).

Firstintroducedin“ArcingtheEdge,”L.Breiman(1997).

“StackedGeneralization,”D.Wolpert(1992).

Alternatively,itispossibletouseout-of-foldpredictions.Insomecontextsthisiscalledstacking,whileusingahold-outsetiscalledblending.However,formanypeoplethesetermsaresynonymous.

Chapter8.DimensionalityReduction

ManyMachineLearningproblemsinvolvethousandsorevenmillionsoffeaturesforeachtraininginstance.Notonlydoesthismaketrainingextremelyslow,itcanalsomakeitmuchhardertofindagoodsolution,aswewillsee.Thisproblemisoftenreferredtoasthecurseofdimensionality.

Fortunately,inreal-worldproblems,itisoftenpossibletoreducethenumberoffeaturesconsiderably,turninganintractableproblemintoatractableone.Forexample,considertheMNISTimages(introducedinChapter3):thepixelsontheimagebordersarealmostalwayswhite,soyoucouldcompletelydropthesepixelsfromthetrainingsetwithoutlosingmuchinformation.Figure7-6confirmsthatthesepixelsareutterlyunimportantfortheclassificationtask.Moreover,twoneighboringpixelsareoftenhighlycorrelated:ifyoumergethemintoasinglepixel(e.g.,bytakingthemeanofthetwopixelintensities),youwillnotlosemuchinformation.

WARNINGReducingdimensionalitydoeslosesomeinformation(justlikecompressinganimagetoJPEGcandegradeitsquality),soeventhoughitwillspeeduptraining,itmayalsomakeyoursystemperformslightlyworse.Italsomakesyourpipelinesabitmorecomplexandthushardertomaintain.Soyoushouldfirsttrytotrainyoursystemwiththeoriginaldatabeforeconsideringusingdimensionalityreductioniftrainingistooslow.Insomecases,however,reducingthedimensionalityofthetrainingdatamayfilteroutsomenoiseandunnecessarydetailsandthusresultinhigherperformance(butingeneralitwon’t;itwilljustspeeduptraining).

Apartfromspeedinguptraining,dimensionalityreductionisalsoextremelyusefulfordatavisualization(orDataViz).Reducingthenumberofdimensionsdowntotwo(orthree)makesitpossibletoplotahigh-dimensionaltrainingsetonagraphandoftengainsomeimportantinsightsbyvisuallydetectingpatterns,suchasclusters.

Inthischapterwewilldiscussthecurseofdimensionalityandgetasenseofwhatgoesoninhigh-dimensionalspace.Then,wewillpresentthetwomainapproachestodimensionalityreduction(projectionandManifoldLearning),andwewillgothroughthreeofthemostpopulardimensionalityreductiontechniques:PCA,KernelPCA,andLLE.

TheCurseofDimensionalityWearesousedtolivinginthreedimensions1thatourintuitionfailsuswhenwetrytoimagineahigh-dimensionalspace.Evenabasic4Dhypercubeisincrediblyhardtopictureinourmind(seeFigure8-1),letalonea200-dimensionalellipsoidbentina1,000-dimensionalspace.

Figure8-1.Point,segment,square,cube,andtesseract(0Dto4Dhypercubes)2

Itturnsoutthatmanythingsbehaveverydifferentlyinhigh-dimensionalspace.Forexample,ifyoupickarandompointinaunitsquare(a1×1square),itwillhaveonlyabouta0.4%chanceofbeinglocatedlessthan0.001fromaborder(inotherwords,itisveryunlikelythatarandompointwillbe“extreme”alonganydimension).Butina10,000-dimensionalunithypercube(a1×1××1cube,withtenthousand1s),thisprobabilityisgreaterthan99.999999%.Mostpointsinahigh-dimensionalhypercubeareveryclosetotheborder.3

Hereisamoretroublesomedifference:ifyoupicktwopointsrandomlyinaunitsquare,thedistancebetweenthesetwopointswillbe,onaverage,roughly0.52.Ifyoupicktworandompointsinaunit3Dcube,theaveragedistancewillberoughly0.66.Butwhatabouttwopointspickedrandomlyina1,000,000-dimensionalhypercube?Well,theaveragedistance,believeitornot,willbeabout408.25

(roughly )!Thisisquitecounterintuitive:howcantwopointsbesofarapartwhentheybothliewithinthesameunithypercube?Thisfactimpliesthathigh-dimensionaldatasetsareatriskofbeingverysparse:mosttraininginstancesarelikelytobefarawayfromeachother.Ofcourse,thisalsomeansthatanewinstancewilllikelybefarawayfromanytraininginstance,makingpredictionsmuchlessreliablethaninlowerdimensions,sincetheywillbebasedonmuchlargerextrapolations.Inshort,themoredimensionsthetrainingsethas,thegreatertheriskofoverfittingit.

Intheory,onesolutiontothecurseofdimensionalitycouldbetoincreasethesizeofthetrainingsettoreachasufficientdensityoftraininginstances.Unfortunately,inpractice,thenumberoftraininginstancesrequiredtoreachagivendensitygrowsexponentiallywiththenumberofdimensions.Withjust100features(muchlessthanintheMNISTproblem),youwouldneedmoretraininginstancesthanatomsintheobservableuniverseinorderfortraininginstancestobewithin0.1ofeachotheronaverage,assumingtheywerespreadoutuniformlyacrossalldimensions.

MainApproachesforDimensionalityReductionBeforewediveintospecificdimensionalityreductionalgorithms,let’stakealookatthetwomainapproachestoreducingdimensionality:projectionandManifoldLearning.

ProjectionInmostreal-worldproblems,traininginstancesarenotspreadoutuniformlyacrossalldimensions.Manyfeaturesarealmostconstant,whileothersarehighlycorrelated(asdiscussedearlierforMNIST).Asaresult,alltraininginstancesactuallyliewithin(orcloseto)amuchlower-dimensionalsubspaceofthehigh-dimensionalspace.Thissoundsveryabstract,solet’slookatanexample.InFigure8-2youcanseea3Ddatasetrepresentedbythecircles.

Figure8-2.A3Ddatasetlyingclosetoa2Dsubspace

Noticethatalltraininginstanceslieclosetoaplane:thisisalower-dimensional(2D)subspaceofthehigh-dimensional(3D)space.Nowifweprojecteverytraininginstanceperpendicularlyontothissubspace(asrepresentedbytheshortlinesconnectingtheinstancestotheplane),wegetthenew2DdatasetshowninFigure8-3.Ta-da!Wehavejustreducedthedataset’sdimensionalityfrom3Dto2D.Notethattheaxescorrespondtonewfeaturesz1andz2(thecoordinatesoftheprojectionsontheplane).

Figure8-3.Thenew2Ddatasetafterprojection

However,projectionisnotalwaysthebestapproachtodimensionalityreduction.Inmanycasesthesubspacemaytwistandturn,suchasinthefamousSwissrolltoydatasetrepresentedinFigure8-4.

Figure8-4.Swissrolldataset

Simplyprojectingontoaplane(e.g.,bydroppingx3)wouldsquashdifferentlayersoftheSwissrolltogether,asshownontheleftofFigure8-5.However,whatyoureallywantistounrolltheSwissrolltoobtainthe2DdatasetontherightofFigure8-5.

Figure8-5.Squashingbyprojectingontoaplane(left)versusunrollingtheSwissroll(right)

ManifoldLearningTheSwissrollisanexampleofa2Dmanifold.Putsimply,a2Dmanifoldisa2Dshapethatcanbebentandtwistedinahigher-dimensionalspace.Moregenerally,ad-dimensionalmanifoldisapartofann-dimensionalspace(whered<n)thatlocallyresemblesad-dimensionalhyperplane.InthecaseoftheSwissroll,d=2andn=3:itlocallyresemblesa2Dplane,butitisrolledinthethirddimension.

Manydimensionalityreductionalgorithmsworkbymodelingthemanifoldonwhichthetraininginstanceslie;thisiscalledManifoldLearning.Itreliesonthemanifoldassumption,alsocalledthemanifoldhypothesis,whichholdsthatmostreal-worldhigh-dimensionaldatasetslieclosetoamuchlower-dimensionalmanifold.Thisassumptionisveryoftenempiricallyobserved.

Onceagain,thinkabouttheMNISTdataset:allhandwrittendigitimageshavesomesimilarities.Theyaremadeofconnectedlines,thebordersarewhite,theyaremoreorlesscentered,andsoon.Ifyourandomlygeneratedimages,onlyaridiculouslytinyfractionofthemwouldlooklikehandwrittendigits.Inotherwords,thedegreesoffreedomavailabletoyouifyoutrytocreateadigitimagearedramaticallylowerthanthedegreesoffreedomyouwouldhaveifyouwereallowedtogenerateanyimageyouwanted.Theseconstraintstendtosqueezethedatasetintoalower-dimensionalmanifold.

Themanifoldassumptionisoftenaccompaniedbyanotherimplicitassumption:thatthetaskathand(e.g.,classificationorregression)willbesimplerifexpressedinthelower-dimensionalspaceofthemanifold.Forexample,inthetoprowofFigure8-6theSwissrollissplitintotwoclasses:inthe3Dspace(ontheleft),thedecisionboundarywouldbefairlycomplex,butinthe2Dunrolledmanifoldspace(ontheright),thedecisionboundaryisasimplestraightline.

However,thisassumptiondoesnotalwayshold.Forexample,inthebottomrowofFigure8-6,thedecisionboundaryislocatedatx1=5.Thisdecisionboundarylooksverysimpleintheoriginal3Dspace(averticalplane),butitlooksmorecomplexintheunrolledmanifold(acollectionoffourindependentlinesegments).

Inshort,ifyoureducethedimensionalityofyourtrainingsetbeforetrainingamodel,itwilldefinitelyspeeduptraining,butitmaynotalwaysleadtoabetterorsimplersolution;italldependsonthedataset.

Hopefullyyounowhaveagoodsenseofwhatthecurseofdimensionalityisandhowdimensionalityreductionalgorithmscanfightit,especiallywhenthemanifoldassumptionholds.Therestofthischapterwillgothroughsomeofthemostpopularalgorithms.

Figure8-6.Thedecisionboundarymaynotalwaysbesimplerwithlowerdimensions

PCAPrincipalComponentAnalysis(PCA)isbyfarthemostpopulardimensionalityreductionalgorithm.Firstitidentifiesthehyperplanethatliesclosesttothedata,andthenitprojectsthedataontoit.

PreservingtheVarianceBeforeyoucanprojectthetrainingsetontoalower-dimensionalhyperplane,youfirstneedtochoosetherighthyperplane.Forexample,asimple2DdatasetisrepresentedontheleftofFigure8-7,alongwiththreedifferentaxes(i.e.,one-dimensionalhyperplanes).Ontherightistheresultoftheprojectionofthedatasetontoeachoftheseaxes.Asyoucansee,theprojectionontothesolidlinepreservesthemaximumvariance,whiletheprojectionontothedottedlinepreservesverylittlevariance,andtheprojectionontothedashedlinepreservesanintermediateamountofvariance.

Figure8-7.Selectingthesubspaceontowhichtoproject

Itseemsreasonabletoselecttheaxisthatpreservesthemaximumamountofvariance,asitwillmostlikelyloselessinformationthantheotherprojections.Anotherwaytojustifythischoiceisthatitistheaxisthatminimizesthemeansquareddistancebetweentheoriginaldatasetanditsprojectionontothataxis.ThisistherathersimpleideabehindPCA.4

PrincipalComponentsPCAidentifiestheaxisthataccountsforthelargestamountofvarianceinthetrainingset.InFigure8-7,itisthesolidline.Italsofindsasecondaxis,orthogonaltothefirstone,thataccountsforthelargestamountofremainingvariance.Inthis2Dexamplethereisnochoice:itisthedottedline.Ifitwereahigher-dimensionaldataset,PCAwouldalsofindathirdaxis,orthogonaltobothpreviousaxes,andafourth,afifth,andsoon—asmanyaxesasthenumberofdimensionsinthedataset.

Theunitvectorthatdefinestheithaxisiscalledtheithprincipalcomponent(PC).InFigure8-7,the1st

PCisc1andthe2ndPCisc2.InFigure8-2thefirsttwoPCsarerepresentedbytheorthogonalarrowsintheplane,andthethirdPCwouldbeorthogonaltotheplane(pointingupordown).

NOTEThedirectionoftheprincipalcomponentsisnotstable:ifyouperturbthetrainingsetslightlyandrunPCAagain,someofthenewPCsmaypointintheoppositedirectionoftheoriginalPCs.However,theywillgenerallystilllieonthesameaxes.Insomecases,apairofPCsmayevenrotateorswap,buttheplanetheydefinewillgenerallyremainthesame.

Sohowcanyoufindtheprincipalcomponentsofatrainingset?Luckily,thereisastandardmatrixfactorizationtechniquecalledSingularValueDecomposition(SVD)thatcandecomposethetrainingsetmatrixXintothedotproductofthreematricesU·Σ·VT,whereVTcontainsalltheprincipalcomponentsthatwearelookingfor,asshowninEquation8-1.

Equation8-1.Principalcomponentsmatrix

ThefollowingPythoncodeusesNumPy’ssvd()functiontoobtainalltheprincipalcomponentsofthetrainingset,thenextractsthefirsttwoPCs:

X_centered=X-X.mean(axis=0)

U,s,V=np.linalg.svd(X_centered)

c1=V.T[:,0]

c2=V.T[:,1]

WARNINGPCAassumesthatthedatasetiscenteredaroundtheorigin.Aswewillsee,Scikit-Learn’sPCAclassestakecareofcenteringthedataforyou.However,ifyouimplementPCAyourself(asintheprecedingexample),orifyouuseotherlibraries,don’tforgettocenterthedatafirst.

ProjectingDowntodDimensionsOnceyouhaveidentifiedalltheprincipalcomponents,youcanreducethedimensionalityofthedatasetdowntoddimensionsbyprojectingitontothehyperplanedefinedbythefirstdprincipalcomponents.Selectingthishyperplaneensuresthattheprojectionwillpreserveasmuchvarianceaspossible.Forexample,inFigure8-2the3Ddatasetisprojecteddowntothe2Dplanedefinedbythefirsttwoprincipalcomponents,preservingalargepartofthedataset’svariance.Asaresult,the2Dprojectionlooksverymuchliketheoriginal3Ddataset.

Toprojectthetrainingsetontothehyperplane,youcansimplycomputethedotproductofthetrainingsetmatrixXbythematrixWd,definedasthematrixcontainingthefirstdprincipalcomponents(i.e.,thematrixcomposedofthefirstdcolumnsofVT),asshowninEquation8-2.

Equation8-2.Projectingthetrainingsetdowntoddimensions

ThefollowingPythoncodeprojectsthetrainingsetontotheplanedefinedbythefirsttwoprincipalcomponents:

W2=V.T[:,:2]

X2D=X_centered.dot(W2)

Thereyouhaveit!Younowknowhowtoreducethedimensionalityofanydatasetdowntoanynumberofdimensions,whilepreservingasmuchvarianceaspossible.

UsingScikit-LearnScikit-Learn’sPCAclassimplementsPCAusingSVDdecompositionjustlikewedidbefore.ThefollowingcodeappliesPCAtoreducethedimensionalityofthedatasetdowntotwodimensions(notethatitautomaticallytakescareofcenteringthedata):

fromsklearn.decompositionimportPCA

pca=PCA(n_components=2)

X2D=pca.fit_transform(X)

AfterfittingthePCAtransformertothedataset,youcanaccesstheprincipalcomponentsusingthecomponents_variable(notethatitcontainsthePCsashorizontalvectors,so,forexample,thefirstprincipalcomponentisequaltopca.components_.T[:,0]).

ExplainedVarianceRatioAnotherveryusefulpieceofinformationistheexplainedvarianceratioofeachprincipalcomponent,availableviatheexplained_variance_ratio_variable.Itindicatestheproportionofthedataset’svariancethatliesalongtheaxisofeachprincipalcomponent.Forexample,let’slookattheexplainedvarianceratiosofthefirsttwocomponentsofthe3DdatasetrepresentedinFigure8-2:

>>>pca.explained_variance_ratio_

array([0.84248607,0.14631839])

Thistellsyouthat84.2%ofthedataset’svarianceliesalongthefirstaxis,and14.6%liesalongthesecondaxis.Thisleaveslessthan1.2%forthethirdaxis,soitisreasonabletoassumethatitprobablycarrieslittleinformation.

ChoosingtheRightNumberofDimensionsInsteadofarbitrarilychoosingthenumberofdimensionstoreducedownto,itisgenerallypreferabletochoosethenumberofdimensionsthatadduptoasufficientlylargeportionofthevariance(e.g.,95%).Unless,ofcourse,youarereducingdimensionalityfordatavisualization—inthatcaseyouwillgenerallywanttoreducethedimensionalitydownto2or3.

ThefollowingcodecomputesPCAwithoutreducingdimensionality,thencomputestheminimumnumberofdimensionsrequiredtopreserve95%ofthetrainingset’svariance:

pca=PCA()

pca.fit(X_train)

cumsum=np.cumsum(pca.explained_variance_ratio_)

d=np.argmax(cumsum>=0.95)+1

Youcouldthensetn_components=dandrunPCAagain.However,thereisamuchbetteroption:insteadofspecifyingthenumberofprincipalcomponentsyouwanttopreserve,youcansetn_componentstobeafloatbetween0.0and1.0,indicatingtheratioofvarianceyouwishtopreserve:

pca=PCA(n_components=0.95)

X_reduced=pca.fit_transform(X_train)

Yetanotheroptionistoplottheexplainedvarianceasafunctionofthenumberofdimensions(simplyplotcumsum;seeFigure8-8).Therewillusuallybeanelbowinthecurve,wheretheexplainedvariancestopsgrowingfast.Youcanthinkofthisastheintrinsicdimensionalityofthedataset.Inthiscase,youcanseethatreducingthedimensionalitydowntoabout100dimensionswouldn’tlosetoomuchexplainedvariance.

Figure8-8.Explainedvarianceasafunctionofthenumberofdimensions

PCAforCompressionObviouslyafterdimensionalityreduction,thetrainingsettakesupmuchlessspace.Forexample,tryapplyingPCAtotheMNISTdatasetwhilepreserving95%ofitsvariance.Youshouldfindthateachinstancewillhavejustover150features,insteadoftheoriginal784features.Sowhilemostofthevarianceispreserved,thedatasetisnowlessthan20%ofitsoriginalsize!Thisisareasonablecompressionratio,andyoucanseehowthiscanspeedupaclassificationalgorithm(suchasanSVMclassifier)tremendously.

Itisalsopossibletodecompressthereduceddatasetbackto784dimensionsbyapplyingtheinversetransformationofthePCAprojection.Ofcoursethiswon’tgiveyoubacktheoriginaldata,sincetheprojectionlostabitofinformation(withinthe5%variancethatwasdropped),butitwilllikelybequiteclosetotheoriginaldata.Themeansquareddistancebetweentheoriginaldataandthereconstructeddata(compressedandthendecompressed)iscalledthereconstructionerror.Forexample,thefollowingcodecompressestheMNISTdatasetdownto154dimensions,thenusestheinverse_transform()methodtodecompressitbackto784dimensions.Figure8-9showsafewdigitsfromtheoriginaltrainingset(ontheleft),andthecorrespondingdigitsaftercompressionanddecompression.Youcanseethatthereisaslightimagequalityloss,butthedigitsarestillmostlyintact.

pca=PCA(n_components=154)

X_reduced=pca.fit_transform(X_train)

X_recovered=pca.inverse_transform(X_reduced)

Figure8-9.MNISTcompressionpreserving95%ofthevariance

TheequationoftheinversetransformationisshowninEquation8-3.

Equation8-3.PCAinversetransformation,backtotheoriginalnumberofdimensions

IncrementalPCAOneproblemwiththeprecedingimplementationofPCAisthatitrequiresthewholetrainingsettofitinmemoryinorderfortheSVDalgorithmtorun.Fortunately,IncrementalPCA(IPCA)algorithmshavebeendeveloped:youcansplitthetrainingsetintomini-batchesandfeedanIPCAalgorithmonemini-batchatatime.Thisisusefulforlargetrainingsets,andalsotoapplyPCAonline(i.e.,onthefly,asnewinstancesarrive).

ThefollowingcodesplitstheMNISTdatasetinto100mini-batches(usingNumPy’sarray_split()function)andfeedsthemtoScikit-Learn’sIncrementalPCAclass5toreducethedimensionalityoftheMNISTdatasetdownto154dimensions(justlikebefore).Notethatyoumustcallthepartial_fit()methodwitheachmini-batchratherthanthefit()methodwiththewholetrainingset:

fromsklearn.decompositionimportIncrementalPCA

n_batches=100

inc_pca=IncrementalPCA(n_components=154)

forX_batchinnp.array_split(X_train,n_batches):

inc_pca.partial_fit(X_batch)

X_reduced=inc_pca.transform(X_train)

Alternatively,youcanuseNumPy’smemmapclass,whichallowsyoutomanipulatealargearraystoredinabinaryfileondiskasifitwereentirelyinmemory;theclassloadsonlythedataitneedsinmemory,whenitneedsit.SincetheIncrementalPCAclassusesonlyasmallpartofthearrayatanygiventime,thememoryusageremainsundercontrol.Thismakesitpossibletocalltheusualfit()method,asyoucanseeinthefollowingcode:

X_mm=np.memmap(filename,dtype="float32",mode="readonly",shape=(m,n))

batch_size=m//n_batches

inc_pca=IncrementalPCA(n_components=154,batch_size=batch_size)

inc_pca.fit(X_mm)

RandomizedPCAScikit-LearnoffersyetanotheroptiontoperformPCA,calledRandomizedPCA.Thisisastochasticalgorithmthatquicklyfindsanapproximationofthefirstdprincipalcomponents.ItscomputationalcomplexityisO(m×d2)+O(d3),insteadofO(m×n2)+O(n3),soitisdramaticallyfasterthanthepreviousalgorithmswhendismuchsmallerthann.

rnd_pca=PCA(n_components=154,svd_solver="randomized")

X_reduced=rnd_pca.fit_transform(X_train)

KernelPCAInChapter5wediscussedthekerneltrick,amathematicaltechniquethatimplicitlymapsinstancesintoaveryhigh-dimensionalspace(calledthefeaturespace),enablingnonlinearclassificationandregressionwithSupportVectorMachines.Recallthatalineardecisionboundaryinthehigh-dimensionalfeaturespacecorrespondstoacomplexnonlineardecisionboundaryintheoriginalspace.

ItturnsoutthatthesametrickcanbeappliedtoPCA,makingitpossibletoperformcomplexnonlinearprojectionsfordimensionalityreduction.ThisiscalledKernelPCA(kPCA).6Itisoftengoodatpreservingclustersofinstancesafterprojection,orsometimesevenunrollingdatasetsthatlieclosetoatwistedmanifold.

Forexample,thefollowingcodeusesScikit-Learn’sKernelPCAclasstoperformkPCAwithanRBFkernel(seeChapter5formoredetailsabouttheRBFkernelandtheotherkernels):

fromsklearn.decompositionimportKernelPCA

rbf_pca=KernelPCA(n_components=2,kernel="rbf",gamma=0.04)

X_reduced=rbf_pca.fit_transform(X)

Figure8-10showstheSwissroll,reducedtotwodimensionsusingalinearkernel(equivalenttosimplyusingthePCAclass),anRBFkernel,andasigmoidkernel(Logistic).

Figure8-10.Swissrollreducedto2DusingkPCAwithvariouskernels

SelectingaKernelandTuningHyperparametersAskPCAisanunsupervisedlearningalgorithm,thereisnoobviousperformancemeasuretohelpyouselectthebestkernelandhyperparametervalues.However,dimensionalityreductionisoftenapreparationstepforasupervisedlearningtask(e.g.,classification),soyoucansimplyusegridsearchtoselectthekernelandhyperparametersthatleadtothebestperformanceonthattask.Forexample,thefollowingcodecreatesatwo-steppipeline,firstreducingdimensionalitytotwodimensionsusingkPCA,thenapplyingLogisticRegressionforclassification.ThenitusesGridSearchCVtofindthebestkernelandgammavalueforkPCAinordertogetthebestclassificationaccuracyattheendofthepipeline:

fromsklearn.model_selectionimportGridSearchCV

clf=Pipeline([

("kpca",KernelPCA(n_components=2)),

("log_reg",LogisticRegression())

param_grid=[{

"kpca__gamma":np.linspace(0.03,0.05,10),

"kpca__kernel":["rbf","sigmoid"]

grid_search=GridSearchCV(clf,param_grid,cv=3)

grid_search.fit(X,y)

Thebestkernelandhyperparametersarethenavailablethroughthebest_params_variable:

>>>print(grid_search.best_params_)

{'kpca__gamma':0.043333333333333335,'kpca__kernel':'rbf'}

Anotherapproach,thistimeentirelyunsupervised,istoselectthekernelandhyperparametersthatyieldthelowestreconstructionerror.However,reconstructionisnotaseasyaswithlinearPCA.Here’swhy.Figure8-11showstheoriginalSwissroll3Ddataset(topleft),andtheresulting2DdatasetafterkPCAisappliedusinganRBFkernel(topright).Thankstothekerneltrick,thisismathematicallyequivalenttomappingthetrainingsettoaninfinite-dimensionalfeaturespace(bottomright)usingthefeaturemapφ,thenprojectingthetransformedtrainingsetdownto2DusinglinearPCA.NoticethatifwecouldinvertthelinearPCAstepforagiveninstanceinthereducedspace,thereconstructedpointwouldlieinfeaturespace,notintheoriginalspace(e.g.,liketheonerepresentedbyanxinthediagram).Sincethefeaturespaceisinfinite-dimensional,wecannotcomputethereconstructedpoint,andthereforewecannotcomputethetruereconstructionerror.Fortunately,itispossibletofindapointintheoriginalspacethatwouldmapclosetothereconstructedpoint.Thisiscalledthereconstructionpre-image.Onceyouhavethispre-image,youcanmeasureitssquareddistancetotheoriginalinstance.Youcanthenselectthekernelandhyperparametersthatminimizethisreconstructionpre-imageerror.

Figure8-11.KernelPCAandthereconstructionpre-imageerror

Youmaybewonderinghowtoperformthisreconstruction.Onesolutionistotrainasupervisedregressionmodel,withtheprojectedinstancesasthetrainingsetandtheoriginalinstancesasthetargets.Scikit-Learnwilldothisautomaticallyifyousetfit_inverse_transform=True,asshowninthefollowingcode:7

rbf_pca=KernelPCA(n_components=2,kernel="rbf",gamma=0.0433,

fit_inverse_transform=True)

X_reduced=rbf_pca.fit_transform(X)

X_preimage=rbf_pca.inverse_transform(X_reduced)

NOTEBydefault,fit_inverse_transform=FalseandKernelPCAhasnoinverse_transform()method.Thismethodonlygetscreatedwhenyousetfit_inverse_transform=True.

Youcanthencomputethereconstructionpre-imageerror:

>>>fromsklearn.metricsimportmean_squared_error

>>>mean_squared_error(X,X_preimage)

32.786308795766132

Nowyoucanusegridsearchwithcross-validationtofindthekernelandhyperparametersthatminimizethispre-imagereconstructionerror.

LLELocallyLinearEmbedding(LLE)8isanotherverypowerfulnonlineardimensionalityreduction(NLDR)technique.ItisaManifoldLearningtechniquethatdoesnotrelyonprojectionslikethepreviousalgorithms.Inanutshell,LLEworksbyfirstmeasuringhoweachtraininginstancelinearlyrelatestoitsclosestneighbors(c.n.),andthenlookingforalow-dimensionalrepresentationofthetrainingsetwheretheselocalrelationshipsarebestpreserved(moredetailsshortly).Thismakesitparticularlygoodatunrollingtwistedmanifolds,especiallywhenthereisnottoomuchnoise.

Forexample,thefollowingcodeusesScikit-Learn’sLocallyLinearEmbeddingclasstounrolltheSwissroll.Theresulting2DdatasetisshowninFigure8-12.Asyoucansee,theSwissrolliscompletelyunrolledandthedistancesbetweeninstancesarelocallywellpreserved.However,distancesarenotpreservedonalargerscale:theleftpartoftheunrolledSwissrollissqueezed,whiletherightpartisstretched.Nevertheless,LLEdidaprettygoodjobatmodelingthemanifold.

fromsklearn.manifoldimportLocallyLinearEmbedding

lle=LocallyLinearEmbedding(n_components=2,n_neighbors=10)

X_reduced=lle.fit_transform(X)

Figure8-12.UnrolledSwissrollusingLLE

Here’showLLEworks:first,foreachtraininginstancex(i),thealgorithmidentifiesitskclosestneighbors(intheprecedingcodek=10),thentriestoreconstructx(i)asalinearfunctionoftheseneighbors.Morespecifically,itfindstheweightswi,jsuchthatthesquareddistancebetweenx(i)and

isassmallaspossible,assumingwi,j=0ifx(j)isnotoneofthekclosestneighborsofx(i).ThusthefirststepofLLEistheconstrainedoptimizationproblemdescribedinEquation8-4,whereWis

theweightmatrixcontainingalltheweightswi,j.Thesecondconstraintsimplynormalizestheweightsforeachtraininginstancex(i).

Equation8-4.LLEstep1:linearlymodelinglocalrelationships

Afterthisstep,theweightmatrix (containingtheweights )encodesthelocallinearrelationshipsbetweenthetraininginstances.Nowthesecondstepistomapthetraininginstancesintoad-dimensionalspace(whered<n)whilepreservingtheselocalrelationshipsasmuchaspossible.Ifz(i)istheimageof

x(i)inthisd-dimensionalspace,thenwewantthesquareddistancebetweenz(i)and tobeassmallaspossible.ThisidealeadstotheunconstrainedoptimizationproblemdescribedinEquation8-5.Itlooksverysimilartothefirststep,butinsteadofkeepingtheinstancesfixedandfindingtheoptimalweights,wearedoingthereverse:keepingtheweightsfixedandfindingtheoptimalpositionoftheinstances’imagesinthelow-dimensionalspace.NotethatZisthematrixcontainingallz(i).

Equation8-5.LLEstep2:reducingdimensionalitywhilepreservingrelationships

Scikit-Learn’sLLEimplementationhasthefollowingcomputationalcomplexity:O(mlog(m)nlog(k))forfindingtheknearestneighbors,O(mnk3)foroptimizingtheweights,andO(dm2)forconstructingthelow-dimensionalrepresentations.Unfortunately,them2inthelasttermmakesthisalgorithmscalepoorlytoverylargedatasets.

OtherDimensionalityReductionTechniquesTherearemanyotherdimensionalityreductiontechniques,severalofwhichareavailableinScikit-Learn.Herearesomeofthemostpopular:

MultidimensionalScaling(MDS)reducesdimensionalitywhiletryingtopreservethedistancesbetweentheinstances(seeFigure8-13).

Isomapcreatesagraphbyconnectingeachinstancetoitsnearestneighbors,thenreducesdimensionalitywhiletryingtopreservethegeodesicdistances9betweentheinstances.

t-DistributedStochasticNeighborEmbedding(t-SNE)reducesdimensionalitywhiletryingtokeepsimilarinstancescloseanddissimilarinstancesapart.Itismostlyusedforvisualization,inparticulartovisualizeclustersofinstancesinhigh-dimensionalspace(e.g.,tovisualizetheMNISTimagesin2D).

LinearDiscriminantAnalysis(LDA)isactuallyaclassificationalgorithm,butduringtrainingitlearnsthemostdiscriminativeaxesbetweentheclasses,andtheseaxescanthenbeusedtodefineahyperplaneontowhichtoprojectthedata.Thebenefitisthattheprojectionwillkeepclassesasfarapartaspossible,soLDAisagoodtechniquetoreducedimensionalitybeforerunninganotherclassificationalgorithmsuchasanSVMclassifier.

Figure8-13.ReducingtheSwissrollto2Dusingvarioustechniques

Exercises1. Whatarethemainmotivationsforreducingadataset’sdimensionality?Whatarethemain

drawbacks?

2. Whatisthecurseofdimensionality?

3. Onceadataset’sdimensionalityhasbeenreduced,isitpossibletoreversetheoperation?Ifso,how?Ifnot,why?

4. CanPCAbeusedtoreducethedimensionalityofahighlynonlineardataset?

5. SupposeyouperformPCAona1,000-dimensionaldataset,settingtheexplainedvarianceratioto95%.Howmanydimensionswilltheresultingdatasethave?

6. InwhatcaseswouldyouusevanillaPCA,IncrementalPCA,RandomizedPCA,orKernelPCA?

7. Howcanyouevaluatetheperformanceofadimensionalityreductionalgorithmonyourdataset?

8. Doesitmakeanysensetochaintwodifferentdimensionalityreductionalgorithms?

9. LoadtheMNISTdataset(introducedinChapter3)andsplititintoatrainingsetandatestset(takethefirst60,000instancesfortraining,andtheremaining10,000fortesting).TrainaRandomForestclassifieronthedatasetandtimehowlongittakes,thenevaluatetheresultingmodelonthetestset.Next,usePCAtoreducethedataset’sdimensionality,withanexplainedvarianceratioof95%.TrainanewRandomForestclassifieronthereduceddatasetandseehowlongittakes.Wastrainingmuchfaster?Nextevaluatetheclassifieronthetestset:howdoesitcomparetothepreviousclassifier?

10. Uset-SNEtoreducetheMNISTdatasetdowntotwodimensionsandplottheresultusingMatplotlib.Youcanuseascatterplotusing10differentcolorstorepresenteachimage’stargetclass.Alternatively,youcanwritecoloreddigitsatthelocationofeachinstance,orevenplotscaled-downversionsofthedigitimagesthemselves(ifyouplotalldigits,thevisualizationwillbetoocluttered,soyoushouldeitherdrawarandomsampleorplotaninstanceonlyifnootherinstancehasalreadybeenplottedataclosedistance).Youshouldgetanicevisualizationwithwell-separatedclustersofdigits.TryusingotherdimensionalityreductionalgorithmssuchasPCA,LLE,orMDSandcomparetheresultingvisualizations.

Well,fourdimensionsifyoucounttime,andafewmoreifyouareastringtheorist.

Watcharotatingtesseractprojectedinto3Dspaceathttp://goo.gl/OM7ktJ.ImagebyWikipediauserNerdBoy1392(CreativeCommonsBY-SA3.0).Reproducedfromhttps://en.wikipedia.org/wiki/Tesseract.

Funfact:anyoneyouknowisprobablyanextremistinatleastonedimension(e.g.,howmuchsugartheyputintheircoffee),ifyouconsiderenoughdimensions.

“OnLinesandPlanesofClosestFittoSystemsofPointsinSpace,”K.Pearson(1901).

Scikit-Learnusesthealgorithmdescribedin“IncrementalLearningforRobustVisualTracking,”D.Rossetal.(2007).

“KernelPrincipalComponentAnalysis,”B.Schölkopf,A.Smola,K.Müller(1999).

Scikit-LearnusesthealgorithmbasedonKernelRidgeRegressiondescribedinGokhanH.Bakır,JasonWeston,andBernhardScholkopf,“LearningtoFindPre-images”(Tubingen,Germany:MaxPlanckInstituteforBiologicalCybernetics,2004).

“NonlinearDimensionalityReductionbyLocallyLinearEmbedding,”S.Roweis,L.Saul(2000).

Thegeodesicdistancebetweentwonodesinagraphisthenumberofnodesontheshortestpathbetweenthesenodes.

PartII.NeuralNetworksandDeepLearning

Chapter9.UpandRunningwithTensorFlow

TensorFlowisapowerfulopensourcesoftwarelibraryfornumericalcomputation,particularlywellsuitedandfine-tunedforlarge-scaleMachineLearning.Itsbasicprincipleissimple:youfirstdefineinPythonagraphofcomputationstoperform(forexample,theoneinFigure9-1),andthenTensorFlowtakesthatgraphandrunsitefficientlyusingoptimizedC++code.

Figure9-1.Asimplecomputationgraph

Mostimportantly,itispossibletobreakupthegraphintoseveralchunksandruntheminparallelacrossmultipleCPUsorGPUs(asshowninFigure9-2).TensorFlowalsosupportsdistributedcomputing,soyoucantraincolossalneuralnetworksonhumongoustrainingsetsinareasonableamountoftimebysplittingthecomputationsacrosshundredsofservers(seeChapter12).TensorFlowcantrainanetworkwithmillionsofparametersonatrainingsetcomposedofbillionsofinstanceswithmillionsoffeatureseach.Thisshouldcomeasnosurprise,sinceTensorFlowwasdevelopedbytheGoogleBrainteamanditpowersmanyofGoogle’slarge-scaleservices,suchasGoogleCloudSpeech,GooglePhotos,andGoogleSearch.

Figure9-2.ParallelcomputationonmultipleCPUs/GPUs/servers

WhenTensorFlowwasopen-sourcedinNovember2015,therewerealreadymanypopularopensourcelibrariesforDeepLearning(Table9-1listsafew),andtobefairmostofTensorFlow’sfeaturesalreadyexistedinonelibraryoranother.Nevertheless,TensorFlow’scleandesign,scalability,flexibility,1andgreatdocumentation(nottomentionGoogle’sname)quicklyboostedittothetopofthelist.Inshort,TensorFlowwasdesignedtobeflexible,scalable,andproduction-ready,andexistingframeworksarguablyhitonlytwooutofthethreeofthese.HerearesomeofTensorFlow’shighlights:

ItrunsnotonlyonWindows,Linux,andmacOS,butalsoonmobiledevices,includingbothiOSandAndroid.

ItprovidesaverysimplePythonAPIcalledTF.Learn2(tensorflow.contrib.learn),compatiblewithScikit-Learn.Asyouwillsee,youcanuseittotrainvarioustypesofneuralnetworksinjustafewlinesofcode.ItwaspreviouslyanindependentprojectcalledScikitFlow(orskflow).

ItalsoprovidesanothersimpleAPIcalledTF-slim(tensorflow.contrib.slim)tosimplifybuilding,training,andevaluatingneuralnetworks.

Severalotherhigh-levelAPIshavebeenbuiltindependentlyontopofTensorFlow,suchasKeras(nowavailableintensorflow.contrib.keras)orPrettyTensor.

ItsmainPythonAPIoffersmuchmoreflexibility(atthecostofhighercomplexity)tocreateallsortsofcomputations,includinganyneuralnetworkarchitectureyoucanthinkof.

ItincludeshighlyefficientC++implementationsofmanyMLoperations,particularlythoseneededtobuildneuralnetworks.ThereisalsoaC++APItodefineyourownhigh-performanceoperations.

Itprovidesseveraladvancedoptimizationnodestosearchfortheparametersthatminimizeacostfunction.TheseareveryeasytousesinceTensorFlowautomaticallytakescareofcomputingthegradientsofthefunctionsyoudefine.Thisiscalledautomaticdifferentiating(orautodiff).

ItalsocomeswithagreatvisualizationtoolcalledTensorBoardthatallowsyoutobrowsethroughthecomputationgraph,viewlearningcurves,andmore.

GooglealsolaunchedacloudservicetorunTensorFlowgraphs.

Lastbutnotleast,ithasadedicatedteamofpassionateandhelpfuldevelopers,andagrowingcommunitycontributingtoimprovingit.ItisoneofthemostpopularopensourceprojectsonGitHub,andmoreandmoregreatprojectsarebeingbuiltontopofit(forexamples,checkouttheresourcespageonhttps://www.tensorflow.org/,orhttps://github.com/jtoy/awesome-tensorflow).Toasktechnicalquestions,youshouldusehttp://stackoverflow.com/andtagyourquestionwith"tensorflow".YoucanfilebugsandfeaturerequeststhroughGitHub.Forgeneraldiscussions,jointheGooglegroup.

Inthischapter,wewillgothroughthebasicsofTensorFlow,frominstallationtocreating,running,saving,andvisualizingsimplecomputationalgraphs.Masteringthesebasicsisimportantbeforeyoubuildyourfirstneuralnetwork(whichwewilldointhenextchapter).

Table9-1.OpensourceDeepLearninglibraries(notanexhaustivelist)

Library API Platforms Startedby Year

Caffe Python,C++,Matlab Linux,macOS,Windows Y.Jia,UCBerkeley(BVLC) 2013

Deeplearning4j Java,Scala,Clojure Linux,macOS,Windows,Android A.Gibson,J.Patterson 2014

H2O Python,R Linux,macOS,Windows H2O.ai 2014

MXNet Python,C++,others Linux,macOS,Windows,iOS,Android DMLC 2015

TensorFlow Python,C++ Linux,macOS,Windows,iOS,Android Google 2015

Theano Python Linux,macOS,iOS UniversityofMontreal 2010

Torch C++,Lua Linux,macOS,iOS,Android R.Collobert,K.Kavukcuoglu,C.Farabet 2002

InstallationLet’sgetstarted!AssumingyouinstalledJupyterandScikit-LearnbyfollowingtheinstallationinstructionsinChapter2,youcansimplyusepiptoinstallTensorFlow.Ifyoucreatedanisolatedenvironmentusingvirtualenv,youfirstneedtoactivateit:

$cd$ML_PATH#YourMLworkingdirectory(e.g.,$HOME/ml)

Next,installTensorFlow:

$pip3install--upgradetensorflow

NOTEForGPUsupport,youneedtoinstalltensorflow-gpuinsteadoftensorflow.SeeChapter12formoredetails.

Totestyourinstallation,typethefollowingcommand.ItshouldoutputtheversionofTensorFlowyouinstalled.

$python3-c'importtensorflow;print(tensorflow.__version__)'

CreatingYourFirstGraphandRunningItinaSessionThefollowingcodecreatesthegraphrepresentedinFigure9-1:

importtensorflowastf

x=tf.Variable(3,name="x")

y=tf.Variable(4,name="y")

f=x*x*y+y+2

That’sallthereistoit!Themostimportantthingtounderstandisthatthiscodedoesnotactuallyperformanycomputation,eventhoughitlookslikeitdoes(especiallythelastline).Itjustcreatesacomputationgraph.Infact,eventhevariablesarenotinitializedyet.Toevaluatethisgraph,youneedtoopenaTensorFlowsessionanduseittoinitializethevariablesandevaluatef.ATensorFlowsessiontakescareofplacingtheoperationsontodevicessuchasCPUsandGPUsandrunningthem,anditholdsallthevariablevalues.3Thefollowingcodecreatesasession,initializesthevariables,andevaluates,andfthenclosesthesession(whichfreesupresources):

>>>sess=tf.Session()

>>>sess.run(x.initializer)

>>>sess.run(y.initializer)

>>>result=sess.run(f)

>>>print(result)

>>>sess.close()

Havingtorepeatsess.run()allthetimeisabitcumbersome,butfortunatelythereisabetterway:

withtf.Session()assess:

x.initializer.run()

y.initializer.run()

result=f.eval()

Insidethewithblock,thesessionissetasthedefaultsession.Callingx.initializer.run()isequivalenttocallingtf.get_default_session().run(x.initializer),andsimilarlyf.eval()isequivalenttocallingtf.get_default_session().run(f).Thismakesthecodeeasiertoread.Moreover,thesessionisautomaticallyclosedattheendoftheblock.

Insteadofmanuallyrunningtheinitializerforeverysinglevariable,youcanusetheglobal_variables_initializer()function.Notethatitdoesnotactuallyperformtheinitializationimmediately,butrathercreatesanodeinthegraphthatwillinitializeallvariableswhenitisrun:

init=tf.global_variables_initializer()#prepareaninitnode

init.run()#actuallyinitializeallthevariables

result=f.eval()

InsideJupyterorwithinaPythonshellyoumayprefertocreateanInteractiveSession.TheonlydifferencefromaregularSessionisthatwhenanInteractiveSessioniscreateditautomaticallysets

itselfasthedefaultsession,soyoudon’tneedawithblock(butyoudoneedtoclosethesessionmanuallywhenyouaredonewithit):

>>>sess=tf.InteractiveSession()

>>>init.run()

>>>result=f.eval()

>>>print(result)

>>>sess.close()

ATensorFlowprogramistypicallysplitintotwoparts:thefirstpartbuildsacomputationgraph(thisiscalledtheconstructionphase),andthesecondpartrunsit(thisistheexecutionphase).TheconstructionphasetypicallybuildsacomputationgraphrepresentingtheMLmodelandthecomputationsrequiredtotrainit.Theexecutionphasegenerallyrunsaloopthatevaluatesatrainingsteprepeatedly(forexample,onesteppermini-batch),graduallyimprovingthemodelparameters.Wewillgothroughanexampleshortly.

ManagingGraphsAnynodeyoucreateisautomaticallyaddedtothedefaultgraph:

>>>x1=tf.Variable(1)

>>>x1.graphistf.get_default_graph()

Inmostcasesthisisfine,butsometimesyoumaywanttomanagemultipleindependentgraphs.YoucandothisbycreatinganewGraphandtemporarilymakingitthedefaultgraphinsideawithblock,likeso:

>>>graph=tf.Graph()

>>>withgraph.as_default():

...x2=tf.Variable(2)

>>>x2.graphisgraph

>>>x2.graphistf.get_default_graph()

TIPInJupyter(orinaPythonshell),itiscommontorunthesamecommandsmorethanoncewhileyouareexperimenting.Asaresult,youmayendupwithadefaultgraphcontainingmanyduplicatenodes.OnesolutionistorestarttheJupyterkernel(orthePythonshell),butamoreconvenientsolutionistojustresetthedefaultgraphbyrunningtf.reset_default_graph().

LifecycleofaNodeValueWhenyouevaluateanode,TensorFlowautomaticallydeterminesthesetofnodesthatitdependsonanditevaluatesthesenodesfirst.Forexample,considerthefollowingcode:

w=tf.constant(3)

print(y.eval())#10

print(z.eval())#15

First,thiscodedefinesaverysimplegraph.Thenitstartsasessionandrunsthegraphtoevaluatey:TensorFlowautomaticallydetectsthatydependsonx,whichdependsonw,soitfirstevaluatesw,thenx,theny,andreturnsthevalueofy.Finally,thecoderunsthegraphtoevaluatez.Onceagain,TensorFlowdetectsthatitmustfirstevaluatewandx.Itisimportanttonotethatitwillnotreusetheresultofthepreviousevaluationofwandx.Inshort,theprecedingcodeevaluateswandxtwice.

Allnodevaluesaredroppedbetweengraphruns,exceptvariablevalues,whicharemaintainedbythesessionacrossgraphruns(queuesandreadersalsomaintainsomestate,aswewillseeinChapter12).Avariablestartsitslifewhenitsinitializerisrun,anditendswhenthesessionisclosed.

Ifyouwanttoevaluateyandzefficiently,withoutevaluatingwandxtwiceasinthepreviouscode,youmustaskTensorFlowtoevaluatebothyandzinjustonegraphrun,asshowninthefollowingcode:

y_val,z_val=sess.run([y,z])

print(y_val)#10

print(z_val)#15

WARNINGInsingle-processTensorFlow,multiplesessionsdonotshareanystate,eveniftheyreusethesamegraph(eachsessionwouldhaveitsowncopyofeveryvariable).IndistributedTensorFlow(seeChapter12),variablestateisstoredontheservers,notinthesessions,somultiplesessionscansharethesamevariables.

LinearRegressionwithTensorFlowTensorFlowoperations(alsocalledopsforshort)cantakeanynumberofinputsandproduceanynumberofoutputs.Forexample,theadditionandmultiplicationopseachtaketwoinputsandproduceoneoutput.Constantsandvariablestakenoinput(theyarecalledsourceops).Theinputsandoutputsaremultidimensionalarrays,calledtensors(hencethename“tensorflow”).JustlikeNumPyarrays,tensorshaveatypeandashape.Infact,inthePythonAPItensorsaresimplyrepresentedbyNumPyndarrays.Theytypicallycontainfloats,butyoucanalsousethemtocarrystrings(arbitrarybytearrays).

Intheexamplessofar,thetensorsjustcontainedasinglescalarvalue,butyoucanofcourseperformcomputationsonarraysofanyshape.Forexample,thefollowingcodemanipulates2DarraystoperformLinearRegressionontheCaliforniahousingdataset(introducedinChapter2).Itstartsbyfetchingthedataset;thenitaddsanextrabiasinputfeature(x0=1)toalltraininginstances(itdoessousingNumPysoitrunsimmediately);thenitcreatestwoTensorFlowconstantnodes,Xandy,toholdthisdataandthetargets,4anditusessomeofthematrixoperationsprovidedbyTensorFlowtodefinetheta.Thesematrixfunctions—transpose(),matmul(),andmatrix_inverse()—areself-explanatory,butasusualtheydonotperformanycomputationsimmediately;instead,theycreatenodesinthegraphthatwillperformthemwhenthegraphisrun.YoumayrecognizethatthedefinitionofthetacorrespondstotheNormal

Equation( =(XT·X)–1·XT·y;seeChapter4).Finally,thecodecreatesasessionandusesittoevaluatetheta.

importnumpyasnp

fromsklearn.datasetsimportfetch_california_housing

housing=fetch_california_housing()

m,n=housing.data.shape

housing_data_plus_bias=np.c_[np.ones((m,1)),housing.data]

X=tf.constant(housing_data_plus_bias,dtype=tf.float32,name="X")

y=tf.constant(housing.target.reshape(-1,1),dtype=tf.float32,name="y")

XT=tf.transpose(X)

theta=tf.matmul(tf.matmul(tf.matrix_inverse(tf.matmul(XT,X)),XT),y)

theta_value=theta.eval()

ThemainbenefitofthiscodeversuscomputingtheNormalEquationdirectlyusingNumPyisthatTensorFlowwillautomaticallyrunthisonyourGPUcardifyouhaveone(providedyouinstalledTensorFlowwithGPUsupport,ofcourse;seeChapter12formoredetails).

ImplementingGradientDescentLet’stryusingBatchGradientDescent(introducedinChapter4)insteadoftheNormalEquation.Firstwewilldothisbymanuallycomputingthegradients,thenwewilluseTensorFlow’sautodifffeaturetoletTensorFlowcomputethegradientsautomatically,andfinallywewilluseacoupleofTensorFlow’sout-of-the-boxoptimizers.

WARNINGWhenusingGradientDescent,rememberthatitisimportanttofirstnormalizetheinputfeaturevectors,orelsetrainingmaybemuchslower.YoucandothisusingTensorFlow,NumPy,Scikit-Learn’sStandardScaler,oranyothersolutionyouprefer.Thefollowingcodeassumesthatthisnormalizationhasalreadybeendone.

ManuallyComputingtheGradientsThefollowingcodeshouldbefairlyself-explanatory,exceptforafewnewelements:

Therandom_uniform()functioncreatesanodeinthegraphthatwillgenerateatensorcontainingrandomvalues,givenitsshapeandvaluerange,muchlikeNumPy’srand()function.

Theassign()functioncreatesanodethatwillassignanewvaluetoavariable.Inthiscase,itimplementstheBatchGradientDescentstepθ(nextstep)=θ–η θMSE(θ).

Themainloopexecutesthetrainingstepoverandoveragain(n_epochstimes),andevery100iterationsitprintsoutthecurrentMeanSquaredError(mse).YoushouldseetheMSEgodownateveryiteration.

n_epochs=1000

learning_rate=0.01

X=tf.constant(scaled_housing_data_plus_bias,dtype=tf.float32,name="X")

y=tf.constant(housing.target.reshape(-1,1),dtype=tf.float32,name="y")

theta=tf.Variable(tf.random_uniform([n+1,1],-1.0,1.0),name="theta")

y_pred=tf.matmul(X,theta,name="predictions")

error=y_pred-y

mse=tf.reduce_mean(tf.square(error),name="mse")

gradients=2/m*tf.matmul(tf.transpose(X),error)

training_op=tf.assign(theta,theta-learning_rate*gradients)

init=tf.global_variables_initializer()

sess.run(init)

ifepoch%100==0:

print("Epoch",epoch,"MSE=",mse.eval())

sess.run(training_op)

best_theta=theta.eval()

UsingautodiffTheprecedingcodeworksfine,butitrequiresmathematicallyderivingthegradientsfromthecostfunction(MSE).InthecaseofLinearRegression,itisreasonablyeasy,butifyouhadtodothiswithdeepneuralnetworksyouwouldgetquiteaheadache:itwouldbetediousanderror-prone.Youcouldusesymbolicdifferentiationtoautomaticallyfindtheequationsforthepartialderivativesforyou,buttheresultingcodewouldnotnecessarilybeveryefficient.

Tounderstandwhy,considerthefunctionf(x)=exp(exp(exp(x))).Ifyouknowcalculus,youcanfigureoutitsderivativef′(x)=exp(x)×exp(exp(x))×exp(exp(exp(x))).Ifyoucodef(x)andf′(x)separatelyandexactlyastheyappear,yourcodewillnotbeasefficientasitcouldbe.Amoreefficientsolutionwouldbetowriteafunctionthatfirstcomputesexp(x),thenexp(exp(x)),thenexp(exp(exp(x))),andreturnsallthree.Thisgivesyouf(x)directly(thethirdterm),andifyouneedthederivativeyoucanjustmultiplyallthreetermsandyouaredone.Withthenaïveapproachyouwouldhavehadtocalltheexpfunctionninetimestocomputebothf(x)andf′(x).Withthisapproachyoujustneedtocallitthreetimes.

Itgetsworsewhenyourfunctionisdefinedbysomearbitrarycode.Canyoufindtheequation(orthecode)tocomputethepartialderivativesofthefollowingfunction?Hint:don’teventry.

defmy_func(a,b):

foriinrange(100):

z=a*np.cos(z+i)+z*np.sin(b-i)

returnz

Fortunately,TensorFlow’sautodifffeaturecomestotherescue:itcanautomaticallyandefficientlycomputethegradientsforyou.Simplyreplacethegradients=...lineintheGradientDescentcodeintheprevioussectionwiththefollowingline,andthecodewillcontinuetoworkjustfine:

gradients=tf.gradients(mse,[theta])[0]

Thegradients()functiontakesanop(inthiscasemse)andalistofvariables(inthiscasejusttheta),anditcreatesalistofops(onepervariable)tocomputethegradientsoftheopwithregardstoeachvariable.SothegradientsnodewillcomputethegradientvectoroftheMSEwithregardstotheta.

Therearefourmainapproachestocomputinggradientsautomatically.TheyaresummarizedinTable9-2.TensorFlowusesreverse-modeautodiff,whichisperfect(efficientandaccurate)whentherearemanyinputsandfewoutputs,asisoftenthecaseinneuralnetworks.Itcomputesallthepartialderivativesoftheoutputswithregardstoalltheinputsinjustnoutputs+1graphtraversals.

Table9-2.Mainsolutionstocomputegradientsautomatically

Technique Nbofgraphtraversalstocomputeallgradients

Accuracy Supportsarbitrarycode

Comment

Numericaldifferentiation

ninputs+1 Low Yes Trivialtoimplement

Symbolicdifferentiation N/A High No Buildsaverydifferent

Forward-modeautodiff ninputs High Yes Usesdualnumbers

Reverse-modeautodiff noutputs+1 High Yes ImplementedbyTensorFlow

Ifyouareinterestedinhowthismagicworks,checkoutAppendixD.

UsinganOptimizerSoTensorFlowcomputesthegradientsforyou.Butitgetseveneasier:italsoprovidesanumberofoptimizersoutofthebox,includingaGradientDescentoptimizer.Youcansimplyreplacetheprecedinggradients=...andtraining_op=...lineswiththefollowingcode,andonceagaineverythingwilljustworkfine:

optimizer=tf.train.GradientDescentOptimizer(learning_rate=learning_rate)

training_op=optimizer.minimize(mse)

Ifyouwanttouseadifferenttypeofoptimizer,youjustneedtochangeoneline.Forexample,youcanuseamomentumoptimizer(whichoftenconvergesmuchfasterthanGradientDescent;seeChapter11)bydefiningtheoptimizerlikethis:

optimizer=tf.train.MomentumOptimizer(learning_rate=learning_rate,

momentum=0.9)

FeedingDatatotheTrainingAlgorithmLet’strytomodifythepreviouscodetoimplementMini-batchGradientDescent.Forthis,weneedawaytoreplaceXandyateveryiterationwiththenextmini-batch.Thesimplestwaytodothisistouseplaceholdernodes.Thesenodesarespecialbecausetheydon’tactuallyperformanycomputation,theyjustoutputthedatayoutellthemtooutputatruntime.TheyaretypicallyusedtopassthetrainingdatatoTensorFlowduringtraining.Ifyoudon’tspecifyavalueatruntimeforaplaceholder,yougetanexception.

Tocreateaplaceholdernode,youmustcalltheplaceholder()functionandspecifytheoutputtensor’sdatatype.Optionally,youcanalsospecifyitsshape,ifyouwanttoenforceit.IfyouspecifyNoneforadimension,itmeans“anysize.”Forexample,thefollowingcodecreatesaplaceholdernodeA,andalsoanodeB=A+5.WhenweevaluateB,wepassafeed_dicttotheeval()methodthatspecifiesthevalueofA.NotethatAmusthaverank2(i.e.,itmustbetwo-dimensional)andtheremustbethreecolumns(orelseanexceptionisraised),butitcanhaveanynumberofrows.

>>>A=tf.placeholder(tf.float32,shape=(None,3))

>>>B=A+5

>>>withtf.Session()assess:

...B_val_1=B.eval(feed_dict={A:[[1,2,3]]})

...B_val_2=B.eval(feed_dict={A:[[4,5,6],[7,8,9]]})

>>>print(B_val_1)

[[6.7.8.]]

>>>print(B_val_2)

[[9.10.11.]

[12.13.14.]]

NOTEYoucanactuallyfeedtheoutputofanyoperations,notjustplaceholders.InthiscaseTensorFlowdoesnottrytoevaluatetheseoperations;itusesthevaluesyoufeedit.

ToimplementMini-batchGradientDescent,weonlyneedtotweaktheexistingcodeslightly.FirstchangethedefinitionofXandyintheconstructionphasetomakethemplaceholdernodes:

X=tf.placeholder(tf.float32,shape=(None,n+1),name="X")

y=tf.placeholder(tf.float32,shape=(None,1),name="y")

Thendefinethebatchsizeandcomputethetotalnumberofbatches:

batch_size=100

n_batches=int(np.ceil(m/batch_size))

Finally,intheexecutionphase,fetchthemini-batchesonebyone,thenprovidethevalueofXandyviathefeed_dictparameterwhenevaluatinganodethatdependsoneitherofthem.

deffetch_batch(epoch,batch_index,batch_size):

[...]#loadthedatafromdisk

returnX_batch,y_batch

sess.run(init)

forbatch_indexinrange(n_batches):

X_batch,y_batch=fetch_batch(epoch,batch_index,batch_size)

sess.run(training_op,feed_dict={X:X_batch,y:y_batch})

NOTEWedon’tneedtopassthevalueofXandywhenevaluatingthetasinceitdoesnotdependoneitherofthem.

SavingandRestoringModelsOnceyouhavetrainedyourmodel,youshouldsaveitsparameterstodisksoyoucancomebacktoitwheneveryouwant,useitinanotherprogram,compareittoothermodels,andsoon.Moreover,youprobablywanttosavecheckpointsatregularintervalsduringtrainingsothatifyourcomputercrashesduringtrainingyoucancontinuefromthelastcheckpointratherthanstartoverfromscratch.

TensorFlowmakessavingandrestoringamodelveryeasy.JustcreateaSavernodeattheendoftheconstructionphase(afterallvariablenodesarecreated);then,intheexecutionphase,justcallitssave()methodwheneveryouwanttosavethemodel,passingitthesessionandpathofthecheckpointfile:

theta=tf.Variable(tf.random_uniform([n+1,1],-1.0,1.0),name="theta")

saver=tf.train.Saver()

sess.run(init)

ifepoch%100==0:#checkpointevery100epochs

save_path=saver.save(sess,"/tmp/my_model.ckpt")

save_path=saver.save(sess,"/tmp/my_model_final.ckpt")

Restoringamodelisjustaseasy:youcreateaSaverattheendoftheconstructionphasejustlikebefore,butthenatthebeginningoftheexecutionphase,insteadofinitializingthevariablesusingtheinitnode,youcalltherestore()methodoftheSaverobject:

saver.restore(sess,"/tmp/my_model_final.ckpt")

BydefaultaSaversavesandrestoresallvariablesundertheirownname,butifyouneedmorecontrol,youcanspecifywhichvariablestosaveorrestore,andwhatnamestouse.Forexample,thefollowingSaverwillsaveorrestoreonlythethetavariableunderthenameweights:

saver=tf.train.Saver({"weights":theta})

Bydefault,thesave()methodalsosavesthestructureofthegraphinasecondfilewiththesamenameplusa.metaextension.Youcanloadthisgraphstructureusingtf.train.import_meta_graph().Thisaddsthegraphtothedefaultgraph,andreturnsaSaverinstancethatyoucanthenusetorestorethegraph’sstate(i.e.,thevariablevalues):

saver=tf.train.import_meta_graph("/tmp/my_model_final.ckpt.meta")

saver.restore(sess,"/tmp/my_model_final.ckpt")

Thisallowsyoutofullyrestoreasavedmodel,includingboththegraphstructureandthevariablevalues,withouthavingtosearchforthecodethatbuiltit.

VisualizingtheGraphandTrainingCurvesUsingTensorBoardSonowwehaveacomputationgraphthattrainsaLinearRegressionmodelusingMini-batchGradientDescent,andwearesavingcheckpointsatregularintervals.Soundssophisticated,doesn’tit?However,wearestillrelyingontheprint()functiontovisualizeprogressduringtraining.Thereisabetterway:enterTensorBoard.Ifyoufeeditsometrainingstats,itwilldisplayniceinteractivevisualizationsofthesestatsinyourwebbrowser(e.g.,learningcurves).Youcanalsoprovideitthegraph’sdefinitionanditwillgiveyouagreatinterfacetobrowsethroughit.Thisisveryusefultoidentifyerrorsinthegraph,tofindbottlenecks,andsoon.

Thefirststepistotweakyourprogramabitsoitwritesthegraphdefinitionandsometrainingstats—forexample,thetrainingerror(MSE)—toalogdirectorythatTensorBoardwillreadfrom.Youneedtouseadifferentlogdirectoryeverytimeyourunyourprogram,orelseTensorBoardwillmergestatsfromdifferentruns,whichwillmessupthevisualizations.Thesimplestsolutionforthisistoincludeatimestampinthelogdirectoryname.Addthefollowingcodeatthebeginningoftheprogram:

fromdatetimeimportdatetime

now=datetime.utcnow().strftime("%Y%m%d%H%M%S")

root_logdir="tf_logs"

logdir="{}/run-{}/".format(root_logdir,now)

Next,addthefollowingcodeattheveryendoftheconstructionphase:

mse_summary=tf.summary.scalar('MSE',mse)

file_writer=tf.summary.FileWriter(logdir,tf.get_default_graph())

ThefirstlinecreatesanodeinthegraphthatwillevaluatetheMSEvalueandwriteittoaTensorBoard-compatiblebinarylogstringcalledasummary.ThesecondlinecreatesaFileWriterthatyouwillusetowritesummariestologfilesinthelogdirectory.Thefirstparameterindicatesthepathofthelogdirectory(inthiscasesomethingliketf_logs/run-20160906091959/,relativetothecurrentdirectory).Thesecond(optional)parameteristhegraphyouwanttovisualize.Uponcreation,theFileWritercreatesthelogdirectoryifitdoesnotalreadyexist(anditsparentdirectoriesifneeded),andwritesthegraphdefinitioninabinarylogfilecalledaneventsfile.

Nextyouneedtoupdatetheexecutionphasetoevaluatethemse_summarynoderegularlyduringtraining(e.g.,every10mini-batches).Thiswilloutputasummarythatyoucanthenwritetotheeventsfileusingthefile_writer.Hereistheupdatedcode:

forbatch_indexinrange(n_batches):

X_batch,y_batch=fetch_batch(epoch,batch_index,batch_size)

ifbatch_index%10==0:

summary_str=mse_summary.eval(feed_dict={X:X_batch,y:y_batch})

step=epoch*n_batches+batch_index

file_writer.add_summary(summary_str,step)

WARNINGAvoidloggingtrainingstatsateverysingletrainingstep,asthiswouldsignificantlyslowdowntraining.

Finally,youwanttoclosetheFileWriterattheendoftheprogram:

file_writer.close()

Nowrunthisprogram:itwillcreatethelogdirectoryandwriteaneventsfileinthisdirectory,containingboththegraphdefinitionandtheMSEvalues.Openupashellandgotoyourworkingdirectory,thentypels-ltf_logs/run*tolistthecontentsofthelogdirectory:

$ls-ltf_logs/run*

total40

-rw-r--r--1ageronstaff18620Sep611:10events.out.tfevents.1472553182.mymac

Ifyouruntheprogramasecondtime,youshouldseeaseconddirectoryinthetf_logs/directory:

$ls-ltf_logs/

total0

drwxr-xr-x3ageronstaff102Sep610:07run-20160906091959

drwxr-xr-x3ageronstaff102Sep610:22run-20160906092202

Great!Nowit’stimetofireuptheTensorBoardserver.Youneedtoactivateyourvirtualenvenvironmentifyoucreatedone,thenstarttheserverbyrunningthetensorboardcommand,pointingittotherootlogdirectory.ThisstartstheTensorBoardwebserver,listeningonport6006(whichis“goog”writtenupsidedown):

$tensorboard--logdirtf_logs/

StartingTensorBoardonport6006

(Youcannavigatetohttp://0.0.0.0:6006)

Nextopenabrowserandgotohttp://0.0.0.0:6006/(orhttp://localhost:6006/).WelcometoTensorBoard!IntheEventstabyoushouldseeMSEontheright.Ifyouclickonit,youwillseeaplotoftheMSEduringtraining,forbothruns(Figure9-3).Youcancheckorunchecktherunsyouwanttosee,zoominorout,hoveroverthecurvetogetdetails,andsoon.

Figure9-3.VisualizingtrainingstatsusingTensorBoard

NowclickontheGraphstab.YoushouldseethegraphshowninFigure9-4.

Toreduceclutter,thenodesthathavemanyedges(i.e.,connectionstoothernodes)areseparatedouttoanauxiliaryareaontheright(youcanmoveanodebackandforthbetweenthemaingraphandtheauxiliaryareabyright-clickingonit).Somepartsofthegrapharealsocollapsedbydefault.Forexample,tryhoveringoverthegradientsnode,thenclickonthe icontoexpandthissubgraph.Next,inthissubgraph,tryexpandingthemse_gradsubgraph.

Figure9-4.VisualizingthegraphusingTensorBoard

TIPIfyouwanttotakeapeekatthegraphdirectlywithinJupyter,youcanusetheshow_graph()functionavailableinthenotebookforthischapter.ItwasoriginallywrittenbyA.Mordvintsevinhisgreatdeepdreamtutorialnotebook.AnotheroptionistoinstallE.Jang’sTensorFlowdebuggertoolwhichincludesaJupyterextensionforgraphvisualization(andmore).

NameScopesWhendealingwithmorecomplexmodelssuchasneuralnetworks,thegraphcaneasilybecomeclutteredwiththousandsofnodes.Toavoidthis,youcancreatenamescopestogrouprelatednodes.Forexample,let’smodifythepreviouscodetodefinetheerrorandmseopswithinanamescopecalled"loss":

withtf.name_scope("loss")asscope:

error=y_pred-y

mse=tf.reduce_mean(tf.square(error),name="mse")

Thenameofeachopdefinedwithinthescopeisnowprefixedwith"loss/":

>>>print(error.op.name)

loss/sub

>>>print(mse.op.name)

loss/mse

InTensorBoard,themseanderrornodesnowappearinsidethelossnamespace,whichappearscollapsedbydefault(Figure9-5).

Figure9-5.AcollapsednamescopeinTensorBoard

ModularitySupposeyouwanttocreateagraphthataddstheoutputoftworectifiedlinearunits(ReLU).AReLUcomputesalinearfunctionoftheinputs,andoutputstheresultifitispositive,and0otherwise,asshowninEquation9-1.

Equation9-1.Rectifiedlinearunit

Thefollowingcodedoesthejob,butit’squiterepetitive:

n_features=3

X=tf.placeholder(tf.float32,shape=(None,n_features),name="X")

w1=tf.Variable(tf.random_normal((n_features,1)),name="weights1")

w2=tf.Variable(tf.random_normal((n_features,1)),name="weights2")

b1=tf.Variable(0.0,name="bias1")

b2=tf.Variable(0.0,name="bias2")

z1=tf.add(tf.matmul(X,w1),b1,name="z1")

z2=tf.add(tf.matmul(X,w2),b2,name="z2")

relu1=tf.maximum(z1,0.,name="relu1")

relu2=tf.maximum(z1,0.,name="relu2")

output=tf.add(relu1,relu2,name="output")

Suchrepetitivecodeishardtomaintainanderror-prone(infact,thiscodecontainsacut-and-pasteerror;didyouspotit?).ItwouldbecomeevenworseifyouwantedtoaddafewmoreReLUs.Fortunately,TensorFlowletsyoustayDRY(Don’tRepeatYourself):simplycreateafunctiontobuildaReLU.ThefollowingcodecreatesfiveReLUsandoutputstheirsum(notethatadd_n()createsanoperationthatwillcomputethesumofalistoftensors):

defrelu(X):

w_shape=(int(X.get_shape()[1]),1)

w=tf.Variable(tf.random_normal(w_shape),name="weights")

b=tf.Variable(0.0,name="bias")

z=tf.add(tf.matmul(X,w),b,name="z")

returntf.maximum(z,0.,name="relu")

n_features=3

relus=[relu(X)foriinrange(5)]

output=tf.add_n(relus,name="output")

Notethatwhenyoucreateanode,TensorFlowcheckswhetheritsnamealreadyexists,andifitdoesitappendsanunderscorefollowedbyanindextomakethenameunique.SothefirstReLUcontainsnodesnamed"weights","bias","z",and"relu"(plusmanymorenodeswiththeirdefaultname,suchas"MatMul");thesecondReLUcontainsnodesnamed"weights_1","bias_1",andsoon;thethirdReLUcontainsnodesnamed"weights_2","bias_2",andsoon.TensorBoardidentifiessuchseriesandcollapsesthemtogethertoreduceclutter(asyoucanseeinFigure9-6).

Figure9-6.Collapsednodeseries

Usingnamescopes,youcanmakethegraphmuchclearer.Simplymoveallthecontentoftherelu()functioninsideanamescope.Figure9-7showstheresultinggraph.NoticethatTensorFlowalsogivesthenamescopesuniquenamesbyappending_1,_2,andsoon.

defrelu(X):

withtf.name_scope("relu"):

Figure9-7.Aclearergraphusingname-scopedunits

SharingVariablesIfyouwanttoshareavariablebetweenvariouscomponentsofyourgraph,onesimpleoptionistocreateitfirst,thenpassitasaparametertothefunctionsthatneedit.Forexample,supposeyouwanttocontroltheReLUthreshold(currentlyhardcodedto0)usingasharedthresholdvariableforallReLUs.Youcouldjustcreatethatvariablefirst,andthenpassittotherelu()function:

defrelu(X,threshold):

returntf.maximum(z,threshold,name="max")

threshold=tf.Variable(0.0,name="threshold")

relus=[relu(X,threshold)foriinrange(5)]

Thisworksfine:nowyoucancontrolthethresholdforallReLUsusingthethresholdvariable.However,iftherearemanysharedparameterssuchasthisone,itwillbepainfultohavetopassthemaroundasparametersallthetime.ManypeoplecreateaPythondictionarycontainingallthevariablesintheirmodel,andpassitaroundtoeveryfunction.Otherscreateaclassforeachmodule(e.g.,aReLUclassusingclassvariablestohandlethesharedparameter).Yetanotheroptionistosetthesharedvariableasanattributeoftherelu()functionuponthefirstcall,likeso:

defrelu(X):

ifnothasattr(relu,"threshold"):

relu.threshold=tf.Variable(0.0,name="threshold")

returntf.maximum(z,relu.threshold,name="max")

TensorFlowoffersanotheroption,whichmayleadtoslightlycleanerandmoremodularcodethantheprevioussolutions.5Thissolutionisabittrickytounderstandatfirst,butsinceitisusedalotinTensorFlowitisworthgoingintoabitofdetail.Theideaistousetheget_variable()functiontocreatethesharedvariableifitdoesnotexistyet,orreuseitifitalreadyexists.Thedesiredbehavior(creatingorreusing)iscontrolledbyanattributeofthecurrentvariable_scope().Forexample,thefollowingcodewillcreateavariablenamed"relu/threshold"(asascalar,sinceshape=(),andusing0.0astheinitialvalue):

withtf.variable_scope("relu"):

threshold=tf.get_variable("threshold",shape=(),

initializer=tf.constant_initializer(0.0))

Notethatifthevariablehasalreadybeencreatedbyanearliercalltoget_variable(),thiscodewillraiseanexception.Thisbehaviorpreventsreusingvariablesbymistake.Ifyouwanttoreuseavariable,youneedtoexplicitlysaysobysettingthevariablescope’sreuseattributetoTrue(inwhichcaseyoudon’thavetospecifytheshapeortheinitializer):

withtf.variable_scope("relu",reuse=True):

threshold=tf.get_variable("threshold")

Thiscodewillfetchtheexisting"relu/threshold"variable,orraiseanexceptionifitdoesnotexistorifitwasnotcreatedusingget_variable().Alternatively,youcansetthereuseattributetoTrueinsidetheblockbycallingthescope’sreuse_variables()method:

withtf.variable_scope("relu")asscope:

scope.reuse_variables()

threshold=tf.get_variable("threshold")

WARNINGOncereuseissettoTrue,itcannotbesetbacktoFalsewithintheblock.Moreover,ifyoudefineothervariablescopesinsidethisone,theywillautomaticallyinheritreuse=True.Lastly,onlyvariablescreatedbyget_variable()canbereusedthisway.

Nowyouhaveallthepiecesyouneedtomaketherelu()functionaccessthethresholdvariablewithouthavingtopassitasaparameter:

defrelu(X):

withtf.variable_scope("relu",reuse=True):

threshold=tf.get_variable("threshold")#reuseexistingvariable

withtf.variable_scope("relu"):#createthevariable

relus=[relu(X)forrelu_indexinrange(5)]

Thiscodefirstdefinestherelu()function,thencreatestherelu/thresholdvariable(asascalarthatwilllaterbeinitializedto0.0)andbuildsfiveReLUsbycallingtherelu()function.Therelu()functionreusestherelu/thresholdvariable,andcreatestheotherReLUnodes.

NOTEVariablescreatedusingget_variable()arealwaysnamedusingthenameoftheirvariable_scopeasaprefix(e.g.,"relu/threshold"),butforallothernodes(includingvariablescreatedwithtf.Variable())thevariablescopeactslikeanewnamescope.Inparticular,ifanamescopewithanidenticalnamewasalreadycreated,thenasuffixisaddedtomakethenameunique.Forexample,allnodescreatedintheprecedingcode(exceptthethresholdvariable)haveanameprefixedwith"relu_1/"to"relu_5/",asshowninFigure9-8.

Figure9-8.FiveReLUssharingthethresholdvariable

Itissomewhatunfortunatethatthethresholdvariablemustbedefinedoutsidetherelu()function,wherealltherestoftheReLUcoderesides.Tofixthis,thefollowingcodecreatesthethresholdvariablewithintherelu()functionuponthefirstcall,thenreusesitinsubsequentcalls.Nowtherelu()functiondoesnothavetoworryaboutnamescopesorvariablesharing:itjustcallsget_variable(),whichwillcreateorreusethethresholdvariable(itdoesnotneedtoknowwhichisthecase).Therestofthecodecallsrelu()fivetimes,makingsuretosetreuse=Falseonthefirstcall,andreuse=Truefortheothercalls.

defrelu(X):

relus=[]

forrelu_indexinrange(5):

withtf.variable_scope("relu",reuse=(relu_index>=1))asscope:

relus.append(relu(X))

Theresultinggraphisslightlydifferentthanbefore,sincethesharedvariableliveswithinthefirstReLU(seeFigure9-9).

Figure9-9.FiveReLUssharingthethresholdvariable

ThisconcludesthisintroductiontoTensorFlow.Wewilldiscussmoreadvancedtopicsaswegothroughthefollowingchapters,inparticularmanyoperationsrelatedtodeepneuralnetworks,convolutionalneuralnetworks,andrecurrentneuralnetworksaswellashowtoscaleupwithTensorFlowusingmultithreading,queues,multipleGPUs,andmultipleservers.

Exercises1. Whatarethemainbenefitsofcreatingacomputationgraphratherthandirectlyexecutingthe

computations?Whatarethemaindrawbacks?

2. Isthestatementa_val=a.eval(session=sess)equivalenttoa_val=sess.run(a)?

3. Isthestatementa_val,b_val=a.eval(session=sess),b.eval(session=sess)equivalenttoa_val,b_val=sess.run([a,b])?

4. Canyouruntwographsinthesamesession?

5. Ifyoucreateagraphgcontainingavariablew,thenstarttwothreadsandopenasessionineachthread,bothusingthesamegraphg,willeachsessionhaveitsowncopyofthevariableworwillitbeshared?

6. Whenisavariableinitialized?Whenisitdestroyed?

7. Whatisthedifferencebetweenaplaceholderandavariable?

8. Whathappenswhenyourunthegraphtoevaluateanoperationthatdependsonaplaceholderbutyoudon’tfeeditsvalue?Whathappensiftheoperationdoesnotdependontheplaceholder?

9. Whenyourunagraph,canyoufeedtheoutputvalueofanyoperation,orjustthevalueofplaceholders?

10. Howcanyousetavariabletoanyvalueyouwant(duringtheexecutionphase)?

11. Howmanytimesdoesreverse-modeautodiffneedtotraversethegraphinordertocomputethegradientsofthecostfunctionwithregardsto10variables?Whataboutforward-modeautodiff?Andsymbolicdifferentiation?

12. ImplementLogisticRegressionwithMini-batchGradientDescentusingTensorFlow.Trainitandevaluateitonthemoonsdataset(introducedinChapter5).Tryaddingallthebellsandwhistles:

Definethegraphwithinalogistic_regression()functionthatcanbereusedeasily.

SavecheckpointsusingaSaveratregularintervalsduringtraining,andsavethefinalmodelattheendoftraining.

Restorethelastcheckpointuponstartupiftrainingwasinterrupted.

DefinethegraphusingnicescopessothegraphlooksgoodinTensorBoard.

AddsummariestovisualizethelearningcurvesinTensorBoard.

Trytweakingsomehyperparameterssuchasthelearningrateorthemini-batchsizeandlookattheshapeofthelearningcurve.

TensorFlowisnotlimitedtoneuralnetworksorevenMachineLearning;youcouldrunquantumphysicssimulationsifyouwanted.

NottobeconfusedwiththeTFLearnlibrary,whichisanindependentproject.

IndistributedTensorFlow,variablevaluesarestoredontheserversinsteadofthesession,aswewillseeinChapter12.

Notethathousing.targetisa1Darray,butweneedtoreshapeittoacolumnvectortocomputetheta.RecallthatNumPy’sreshape()functionaccepts–1(meaning“unspecified”)foroneofthedimensions:thatdimensionwillbecomputedbasedonthearray’slengthandtheremainingdimensions.

CreatingaReLUclassisarguablythecleanestoption,butitisratherheavyweight.

Chapter10.IntroductiontoArtificialNeuralNetworks

Birdsinspiredustofly,burdockplantsinspiredvelcro,andnaturehasinspiredmanyotherinventions.Itseemsonlylogical,then,tolookatthebrain’sarchitectureforinspirationonhowtobuildanintelligentmachine.Thisisthekeyideathatinspiredartificialneuralnetworks(ANNs).However,althoughplaneswereinspiredbybirds,theydon’thavetoflaptheirwings.Similarly,ANNshavegraduallybecomequitedifferentfromtheirbiologicalcousins.Someresearchersevenarguethatweshoulddropthebiologicalanalogyaltogether(e.g.,bysaying“units”ratherthan“neurons”),lestwerestrictourcreativitytobiologicallyplausiblesystems.1

ANNsareattheverycoreofDeepLearning.Theyareversatile,powerful,andscalable,makingthemidealtotacklelargeandhighlycomplexMachineLearningtasks,suchasclassifyingbillionsofimages(e.g.,GoogleImages),poweringspeechrecognitionservices(e.g.,Apple’sSiri),recommendingthebestvideostowatchtohundredsofmillionsofuserseveryday(e.g.,YouTube),orlearningtobeattheworldchampionatthegameofGobyexaminingmillionsofpastgamesandthenplayingagainstitself(DeepMind’sAlphaGo).

Inthischapter,wewillintroduceartificialneuralnetworks,startingwithaquicktouroftheveryfirstANNarchitectures.ThenwewillpresentMulti-LayerPerceptrons(MLPs)andimplementoneusingTensorFlowtotackletheMNISTdigitclassificationproblem(introducedinChapter3).

FromBiologicaltoArtificialNeuronsSurprisingly,ANNshavebeenaroundforquiteawhile:theywerefirstintroducedbackin1943bytheneurophysiologistWarrenMcCullochandthemathematicianWalterPitts.Intheirlandmarkpaper,2“ALogicalCalculusofIdeasImmanentinNervousActivity,”McCullochandPittspresentedasimplifiedcomputationalmodelofhowbiologicalneuronsmightworktogetherinanimalbrainstoperformcomplexcomputationsusingpropositionallogic.Thiswasthefirstartificialneuralnetworkarchitecture.Sincethenmanyotherarchitectureshavebeeninvented,aswewillsee.

TheearlysuccessesofANNsuntilthe1960sledtothewidespreadbeliefthatwewouldsoonbeconversingwithtrulyintelligentmachines.Whenitbecameclearthatthispromisewouldgounfulfilled(atleastforquiteawhile),fundingflewelsewhereandANNsenteredalongdarkera.Intheearly1980stherewasarevivalofinterestinANNsasnewnetworkarchitectureswereinventedandbettertrainingtechniquesweredeveloped.Butbythe1990s,powerfulalternativeMachineLearningtechniquessuchasSupportVectorMachines(seeChapter5)werefavoredbymostresearchers,astheyseemedtoofferbetterresultsandstrongertheoreticalfoundations.Finally,wearenowwitnessingyetanotherwaveofinterestinANNs.Willthiswavedieoutlikethepreviousonesdid?Thereareafewgoodreasonstobelievethatthisoneisdifferentandwillhaveamuchmoreprofoundimpactonourlives:

Thereisnowahugequantityofdataavailabletotrainneuralnetworks,andANNsfrequentlyoutperformotherMLtechniquesonverylargeandcomplexproblems.

Thetremendousincreaseincomputingpowersincethe1990snowmakesitpossibletotrainlargeneuralnetworksinareasonableamountoftime.ThisisinpartduetoMoore’sLaw,butalsothankstothegamingindustry,whichhasproducedpowerfulGPUcardsbythemillions.

Thetrainingalgorithmshavebeenimproved.Tobefairtheyareonlyslightlydifferentfromtheonesusedinthe1990s,buttheserelativelysmalltweakshaveahugepositiveimpact.

SometheoreticallimitationsofANNshaveturnedouttobebenigninpractice.Forexample,manypeoplethoughtthatANNtrainingalgorithmsweredoomedbecausetheywerelikelytogetstuckinlocaloptima,butitturnsoutthatthisisratherrareinpractice(orwhenitisthecase,theyareusuallyfairlyclosetotheglobaloptimum).

ANNsseemtohaveenteredavirtuouscircleoffundingandprogress.AmazingproductsbasedonANNsregularlymaketheheadlinenews,whichpullsmoreandmoreattentionandfundingtowardthem,resultinginmoreandmoreprogress,andevenmoreamazingproducts.

BiologicalNeuronsBeforewediscussartificialneurons,let’stakeaquicklookatabiologicalneuron(representedinFigure10-1).Itisanunusual-lookingcellmostlyfoundinanimalcerebralcortexes(e.g.,yourbrain),composedofacellbodycontainingthenucleusandmostofthecell’scomplexcomponents,andmanybranchingextensionscalleddendrites,plusoneverylongextensioncalledtheaxon.Theaxon’slengthmaybejustafewtimeslongerthanthecellbody,oruptotensofthousandsoftimeslonger.Nearitsextremitytheaxonsplitsoffintomanybranchescalledtelodendria,andatthetipofthesebranchesareminusculestructurescalledsynapticterminals(orsimplysynapses),whichareconnectedtothedendrites(ordirectlytothecellbody)ofotherneurons.Biologicalneuronsreceiveshortelectricalimpulsescalledsignalsfromotherneuronsviathesesynapses.Whenaneuronreceivesasufficientnumberofsignalsfromotherneuronswithinafewmilliseconds,itfiresitsownsignals.

Figure10-1.Biologicalneuron3

Thus,individualbiologicalneuronsseemtobehaveinarathersimpleway,buttheyareorganizedinavastnetworkofbillionsofneurons,eachneurontypicallyconnectedtothousandsofotherneurons.Highlycomplexcomputationscanbeperformedbyavastnetworkoffairlysimpleneurons,muchlikeacomplexanthillcanemergefromthecombinedeffortsofsimpleants.Thearchitectureofbiologicalneuralnetworks(BNN)4isstillthesubjectofactiveresearch,butsomepartsofthebrainhavebeenmapped,anditseemsthatneuronsareoftenorganizedinconsecutivelayers,asshowninFigure10-2.

Figure10-2.Multiplelayersinabiologicalneuralnetwork(humancortex)5

LogicalComputationswithNeuronsWarrenMcCullochandWalterPittsproposedaverysimplemodelofthebiologicalneuron,whichlaterbecameknownasanartificialneuron:ithasoneormorebinary(on/off)inputsandonebinaryoutput.Theartificialneuronsimplyactivatesitsoutputwhenmorethanacertainnumberofitsinputsareactive.McCullochandPittsshowedthatevenwithsuchasimplifiedmodelitispossibletobuildanetworkofartificialneuronsthatcomputesanylogicalpropositionyouwant.Forexample,let’sbuildafewANNsthatperformvariouslogicalcomputations(seeFigure10-3),assumingthataneuronisactivatedwhenatleasttwoofitsinputsareactive.

Figure10-3.ANNsperformingsimplelogicalcomputations

Thefirstnetworkontheleftissimplytheidentityfunction:ifneuronAisactivated,thenneuronCgetsactivatedaswell(sinceitreceivestwoinputsignalsfromneuronA),butifneuronAisoff,thenneuronCisoffaswell.

ThesecondnetworkperformsalogicalAND:neuronCisactivatedonlywhenbothneuronsAandBareactivated(asingleinputsignalisnotenoughtoactivateneuronC).

ThethirdnetworkperformsalogicalOR:neuronCgetsactivatedifeitherneuronAorneuronBisactivated(orboth).

Finally,ifwesupposethataninputconnectioncaninhibittheneuron’sactivity(whichisthecasewithbiologicalneurons),thenthefourthnetworkcomputesaslightlymorecomplexlogicalproposition:neuronCisactivatedonlyifneuronAisactiveandifneuronBisoff.IfneuronAisactiveallthetime,thenyougetalogicalNOT:neuronCisactivewhenneuronBisoff,andviceversa.

Youcaneasilyimaginehowthesenetworkscanbecombinedtocomputecomplexlogicalexpressions(seetheexercisesattheendofthechapter).

ThePerceptronThePerceptronisoneofthesimplestANNarchitectures,inventedin1957byFrankRosenblatt.Itisbasedonaslightlydifferentartificialneuron(seeFigure10-4)calledalinearthresholdunit(LTU):theinputsandoutputarenownumbers(insteadofbinaryon/offvalues)andeachinputconnectionisassociatedwithaweight.TheLTUcomputesaweightedsumofitsinputs(z=w1x1+w2x2++ wnxn=wT·x),thenappliesastepfunctiontothatsumandoutputstheresult:hw(x)=step(z)=step(wT·x).

Figure10-4.Linearthresholdunit

ThemostcommonstepfunctionusedinPerceptronsistheHeavisidestepfunction(seeEquation10-1).Sometimesthesignfunctionisusedinstead.

Equation10-1.CommonstepfunctionsusedinPerceptrons

AsingleLTUcanbeusedforsimplelinearbinaryclassification.Itcomputesalinearcombinationoftheinputsandiftheresultexceedsathreshold,itoutputsthepositiveclassorelseoutputsthenegativeclass(justlikeaLogisticRegressionclassifieroralinearSVM).Forexample,youcoulduseasingleLTUtoclassifyirisflowersbasedonthepetallengthandwidth(alsoaddinganextrabiasfeaturex0=1,justlikewedidinpreviouschapters).TraininganLTUmeansfindingtherightvaluesforw0,w1,andw2(thetrainingalgorithmisdiscussedshortly).

APerceptronissimplycomposedofasinglelayerofLTUs,6witheachneuronconnectedtoalltheinputs.Theseconnectionsareoftenrepresentedusingspecialpassthroughneuronscalledinputneurons:theyjustoutputwhateverinputtheyarefed.Moreover,anextrabiasfeatureisgenerallyadded(x0=1).Thisbiasfeatureistypicallyrepresentedusingaspecialtypeofneuroncalledabiasneuron,whichjustoutputs1allthetime.

APerceptronwithtwoinputsandthreeoutputsisrepresentedinFigure10-5.ThisPerceptroncanclassifyinstancessimultaneouslyintothreedifferentbinaryclasses,whichmakesitamultioutputclassifier.

Figure10-5.Perceptrondiagram

SohowisaPerceptrontrained?ThePerceptrontrainingalgorithmproposedbyFrankRosenblattwaslargelyinspiredbyHebb’srule.InhisbookTheOrganizationofBehavior,publishedin1949,DonaldHebbsuggestedthatwhenabiologicalneuronoftentriggersanotherneuron,theconnectionbetweenthesetwoneuronsgrowsstronger.ThisideawaslatersummarizedbySiegridLöwelinthiscatchyphrase:“Cellsthatfiretogether,wiretogether.”ThisrulelaterbecameknownasHebb’srule(orHebbianlearning);thatis,theconnectionweightbetweentwoneuronsisincreasedwhenevertheyhavethesameoutput.Perceptronsaretrainedusingavariantofthisrulethattakesintoaccounttheerrormadebythenetwork;itdoesnotreinforceconnectionsthatleadtothewrongoutput.Morespecifically,thePerceptronisfedonetraininginstanceatatime,andforeachinstanceitmakesitspredictions.Foreveryoutputneuronthatproducedawrongprediction,itreinforcestheconnectionweightsfromtheinputsthatwouldhavecontributedtothecorrectprediction.TheruleisshowninEquation10-2.

Equation10-2.Perceptronlearningrule(weightupdate)

wi,jistheconnectionweightbetweentheithinputneuronandthejthoutputneuron.

xiistheithinputvalueofthecurrenttraininginstance.

jistheoutputofthejthoutputneuronforthecurrenttraininginstance.

yjisthetargetoutputofthejthoutputneuronforthecurrenttraininginstance.

ηisthelearningrate.

Thedecisionboundaryofeachoutputneuronislinear,soPerceptronsareincapableoflearningcomplexpatterns(justlikeLogisticRegressionclassifiers).However,ifthetraininginstancesarelinearlyseparable,Rosenblattdemonstratedthatthisalgorithmwouldconvergetoasolution.7ThisiscalledthePerceptronconvergencetheorem.

Scikit-LearnprovidesaPerceptronclassthatimplementsasingleLTUnetwork.Itcanbeusedprettymuchasyouwouldexpect—forexample,ontheirisdataset(introducedinChapter4):

importnumpyasnp

fromsklearn.datasetsimportload_iris

fromsklearn.linear_modelimportPerceptron

iris=load_iris()

X=iris.data[:,(2,3)]#petallength,petalwidth

y=(iris.target==0).astype(np.int)#IrisSetosa?

per_clf=Perceptron(random_state=42)

per_clf.fit(X,y)

y_pred=per_clf.predict([[2,0.5]])

YoumayhaverecognizedthatthePerceptronlearningalgorithmstronglyresemblesStochasticGradientDescent.Infact,Scikit-Learn’sPerceptronclassisequivalenttousinganSGDClassifierwiththefollowinghyperparameters:loss="perceptron",learning_rate="constant",eta0=1(thelearningrate),andpenalty=None(noregularization).

NotethatcontrarytoLogisticRegressionclassifiers,Perceptronsdonotoutputaclassprobability;rather,theyjustmakepredictionsbasedonahardthreshold.ThisisoneofthegoodreasonstopreferLogisticRegressionoverPerceptrons.

Intheir1969monographtitledPerceptrons,MarvinMinskyandSeymourPaperthighlightedanumberofseriousweaknessesofPerceptrons,inparticularthefactthattheyareincapableofsolvingsometrivialproblems(e.g.,theExclusiveOR(XOR)classificationproblem;seetheleftsideofFigure10-6).Ofcoursethisistrueofanyotherlinearclassificationmodelaswell(suchasLogisticRegressionclassifiers),butresearchershadexpectedmuchmorefromPerceptrons,andtheirdisappointmentwasgreat:asaresult,manyresearchersdroppedconnectionismaltogether(i.e.,thestudyofneuralnetworks)infavorofhigher-levelproblemssuchaslogic,problemsolving,andsearch.

However,itturnsoutthatsomeofthelimitationsofPerceptronscanbeeliminatedbystackingmultiplePerceptrons.TheresultingANNiscalledaMulti-LayerPerceptron(MLP).Inparticular,anMLPcansolvetheXORproblem,asyoucanverifybycomputingtheoutputoftheMLPrepresentedontherightofFigure10-6,foreachcombinationofinputs:withinputs(0,0)or(1,1)thenetworkoutputs0,andwithinputs(0,1)or(1,0)itoutputs1.

Figure10-6.XORclassificationproblemandanMLPthatsolvesit

Multi-LayerPerceptronandBackpropagationAnMLPiscomposedofone(passthrough)inputlayer,oneormorelayersofLTUs,calledhiddenlayers,andonefinallayerofLTUscalledtheoutputlayer(seeFigure10-7).Everylayerexcepttheoutputlayerincludesabiasneuronandisfullyconnectedtothenextlayer.WhenanANNhastwoormorehiddenlayers,itiscalledadeepneuralnetwork(DNN).

Figure10-7.Multi-LayerPerceptron

FormanyyearsresearchersstruggledtofindawaytotrainMLPs,withoutsuccess.Butin1986,D.E.Rumelhartetal.publishedagroundbreakingarticle8introducingthebackpropagationtrainingalgorithm.9TodaywewoulddescribeitasGradientDescentusingreverse-modeautodiff(GradientDescentwasintroducedinChapter4,andautodiffwasdiscussedinChapter9).

Foreachtraininginstance,thealgorithmfeedsittothenetworkandcomputestheoutputofeveryneuronineachconsecutivelayer(thisistheforwardpass,justlikewhenmakingpredictions).Thenitmeasuresthenetwork’soutputerror(i.e.,thedifferencebetweenthedesiredoutputandtheactualoutputofthenetwork),anditcomputeshowmucheachneuroninthelasthiddenlayercontributedtoeachoutputneuron’serror.Itthenproceedstomeasurehowmuchoftheseerrorcontributionscamefromeachneuronintheprevioushiddenlayer—andsoonuntilthealgorithmreachestheinputlayer.Thisreversepassefficientlymeasurestheerrorgradientacrossalltheconnectionweightsinthenetworkbypropagatingtheerrorgradientbackwardinthenetwork(hencethenameofthealgorithm).Ifyoucheckoutthereverse-modeautodiffalgorithminAppendixD,youwillfindthattheforwardandreversepassesofbackpropagationsimplyperformreverse-modeautodiff.ThelaststepofthebackpropagationalgorithmisaGradientDescentsteponalltheconnectionweightsinthenetwork,usingtheerrorgradientsmeasuredearlier.

Let’smakethisevenshorter:foreachtraininginstancethebackpropagationalgorithmfirstmakesaprediction(forwardpass),measurestheerror,thengoesthrougheachlayerinreversetomeasuretheerrorcontributionfromeachconnection(reversepass),andfinallyslightlytweakstheconnectionweightstoreducetheerror(GradientDescentstep).

Inorderforthisalgorithmtoworkproperly,theauthorsmadeakeychangetotheMLP’sarchitecture:theyreplacedthestepfunctionwiththelogisticfunction,σ(z)=1/(1+exp(–z)).Thiswasessentialbecausethestepfunctioncontainsonlyflatsegments,sothereisnogradienttoworkwith(GradientDescentcannotmoveonaflatsurface),whilethelogisticfunctionhasawell-definednonzeroderivativeeverywhere,allowingGradientDescenttomakesomeprogressateverystep.Thebackpropagationalgorithmmaybeusedwithotheractivationfunctions,insteadofthelogisticfunction.Twootherpopularactivationfunctionsare:

Thehyperbolictangentfunctiontanh(z)=2σ(2z)–1JustlikethelogisticfunctionitisS-shaped,continuous,anddifferentiable,butitsoutputvaluerangesfrom–1to1(insteadof0to1inthecaseofthelogisticfunction),whichtendstomakeeachlayer’soutputmoreorlessnormalized(i.e.,centeredaround0)atthebeginningoftraining.Thisoftenhelpsspeedupconvergence.

TheReLUfunction(introducedinChapter9)ReLU(z)=max(0,z).Itiscontinuousbutunfortunatelynotdifferentiableatz=0(theslopechangesabruptly,whichcanmakeGradientDescentbouncearound).However,inpracticeitworksverywellandhastheadvantageofbeingfasttocompute.Mostimportantly,thefactthatitdoesnothaveamaximumoutputvaluealsohelpsreducesomeissuesduringGradientDescent(wewillcomebacktothisinChapter11).

ThesepopularactivationfunctionsandtheirderivativesarerepresentedinFigure10-8.

Figure10-8.Activationfunctionsandtheirderivatives

AnMLPisoftenusedforclassification,witheachoutputcorrespondingtoadifferentbinaryclass(e.g.,spam/ham,urgent/not-urgent,andsoon).Whentheclassesareexclusive(e.g.,classes0through9fordigitimageclassification),theoutputlayeristypicallymodifiedbyreplacingtheindividualactivationfunctionsbyasharedsoftmaxfunction(seeFigure10-9).ThesoftmaxfunctionwasintroducedinChapter3.Theoutputofeachneuroncorrespondstotheestimatedprobabilityofthecorrespondingclass.Notethatthesignalflowsonlyinonedirection(fromtheinputstotheoutputs),sothisarchitectureisanexampleofafeedforwardneuralnetwork(FNN).

Figure10-9.AmodernMLP(includingReLUandsoftmax)forclassification

NOTEBiologicalneuronsseemtoimplementaroughlysigmoid(S-shaped)activationfunction,soresearchersstucktosigmoidfunctionsforaverylongtime.ButitturnsoutthattheReLUactivationfunctiongenerallyworksbetterinANNs.Thisisoneofthecaseswherethebiologicalanalogywasmisleading.

TraininganMLPwithTensorFlow’sHigh-LevelAPIThesimplestwaytotrainanMLPwithTensorFlowistousethehigh-levelAPITF.Learn,whichoffersaScikit-Learn–compatibleAPI.TheDNNClassifierclassmakesitfairlyeasytotrainadeepneuralnetworkwithanynumberofhiddenlayers,andasoftmaxoutputlayertooutputestimatedclassprobabilities.Forexample,thefollowingcodetrainsaDNNforclassificationwithtwohiddenlayers(onewith300neurons,andtheotherwith100neurons)andasoftmaxoutputlayerwith10neurons:

feature_cols=tf.contrib.learn.infer_real_valued_columns_from_input(X_train)

dnn_clf=tf.contrib.learn.DNNClassifier(hidden_units=[300,100],n_classes=10,

feature_columns=feature_cols)

dnn_clf=tf.contrib.learn.SKCompat(dnn_clf)#ifTensorFlow>=1.1

dnn_clf.fit(X_train,y_train,batch_size=50,steps=40000)

Thecodefirstcreatesasetofrealvaluedcolumnsfromthetrainingset(othertypesofcolumns,suchascategoricalcolumns,areavailable).ThenwecreatetheDNNClassifier,andwewrapitinaScikit-Learncompatibilityhelper.Finally,werun40,000trainingiterationsusingbatchesof50instances.

IfyourunthiscodeontheMNISTdataset(afterscalingit,e.g.,byusingScikit-Learn’sStandardScaler),youwillactuallygetamodelthatachievesaround98.2%accuracyonthetestset!That’sbetterthanthebestmodelwetrainedinChapter3:

>>>y_pred=dnn_clf.predict(X_test)

>>>accuracy_score(y_test,y_pred['classes'])

0.98250000000000004

WARNINGThetensorflow.contribpackagecontainsmanyusefulfunctions,butitisaplaceforexperimentalcodethathasnotyetgraduatedtobepartofthecoreTensorFlowAPI.SotheDNNClassifierclass(andanyothercontribcode)maychangewithoutnoticeinthefuture.

Underthehood,theDNNClassifierclasscreatesalltheneuronlayers,basedontheReLUactivationfunction(wecanchangethisbysettingtheactivation_fnhyperparameter).Theoutputlayerreliesonthesoftmaxfunction,andthecostfunctioniscrossentropy(introducedinChapter4).

TrainingaDNNUsingPlainTensorFlowIfyouwantmorecontroloverthearchitectureofthenetwork,youmayprefertouseTensorFlow’slower-levelPythonAPI(introducedinChapter9).InthissectionwewillbuildthesamemodelasbeforeusingthisAPI,andwewillimplementMini-batchGradientDescenttotrainitontheMNISTdataset.Thefirststepistheconstructionphase,buildingtheTensorFlowgraph.Thesecondstepistheexecutionphase,whereyouactuallyrunthegraphtotrainthemodel.

ConstructionPhaseLet’sstart.Firstweneedtoimportthetensorflowlibrary.Thenwemustspecifythenumberofinputsandoutputs,andsetthenumberofhiddenneuronsineachlayer:

n_inputs=28*28#MNIST

n_hidden1=300

n_hidden2=100

n_outputs=10

Next,justlikeyoudidinChapter9,youcanuseplaceholdernodestorepresentthetrainingdataandtargets.TheshapeofXisonlypartiallydefined.Weknowthatitwillbea2Dtensor(i.e.,amatrix),withinstancesalongthefirstdimensionandfeaturesalongtheseconddimension,andweknowthatthenumberoffeaturesisgoingtobe28x28(onefeatureperpixel),butwedon’tknowyethowmanyinstanceseachtrainingbatchwillcontain.SotheshapeofXis(None,n_inputs).Similarly,weknowthatywillbea1Dtensorwithoneentryperinstance,butagainwedon’tknowthesizeofthetrainingbatchatthispoint,sotheshapeis(None).

X=tf.placeholder(tf.float32,shape=(None,n_inputs),name="X")

y=tf.placeholder(tf.int64,shape=(None),name="y")

Nowlet’screatetheactualneuralnetwork.TheplaceholderXwillactastheinputlayer;duringtheexecutionphase,itwillbereplacedwithonetrainingbatchatatime(notethatalltheinstancesinatrainingbatchwillbeprocessedsimultaneouslybytheneuralnetwork).Nowyouneedtocreatethetwohiddenlayersandtheoutputlayer.Thetwohiddenlayersarealmostidentical:theydifferonlybytheinputstheyareconnectedtoandbythenumberofneuronstheycontain.Theoutputlayerisalsoverysimilar,butitusesasoftmaxactivationfunctioninsteadofaReLUactivationfunction.Solet’screateaneuron_layer()functionthatwewillusetocreateonelayeratatime.Itwillneedparameterstospecifytheinputs,thenumberofneurons,theactivationfunction,andthenameofthelayer:

defneuron_layer(X,n_neurons,name,activation=None):

withtf.name_scope(name):

n_inputs=int(X.get_shape()[1])

stddev=2/np.sqrt(n_inputs)

init=tf.truncated_normal((n_inputs,n_neurons),stddev=stddev)

W=tf.Variable(init,name="kernel")

b=tf.Variable(tf.zeros([n_neurons]),name="bias")

Z=tf.matmul(X,W)+b

ifactivationisnotNone:

returnactivation(Z)

returnZ

Let’sgothroughthiscodelinebyline:1. Firstwecreateanamescopeusingthenameofthelayer:itwillcontainallthecomputationnodes

forthisneuronlayer.Thisisoptional,butthegraphwilllookmuchnicerinTensorBoardifitsnodesarewellorganized.

2. Next,wegetthenumberofinputsbylookinguptheinputmatrix’sshapeandgettingthesizeoftheseconddimension(thefirstdimensionisforinstances).

3. ThenextthreelinescreateaWvariablethatwillholdtheweightsmatrix(oftencalledthelayer’skernel).Itwillbea2Dtensorcontainingalltheconnectionweightsbetweeneachinputandeachneuron;hence,itsshapewillbe(n_inputs,n_neurons).Itwillbeinitializedrandomly,usinga

truncated10normal(Gaussian)distributionwithastandarddeviationof .Usingthisspecificstandarddeviationhelpsthealgorithmconvergemuchfaster(wewilldiscussthisfurtherinChapter11;itisoneofthosesmalltweakstoneuralnetworksthathavehadatremendousimpactontheirefficiency).ItisimportanttoinitializeconnectionweightsrandomlyforallhiddenlayerstoavoidanysymmetriesthattheGradientDescentalgorithmwouldbeunabletobreak.11

4. Thenextlinecreatesabvariableforbiases,initializedto0(nosymmetryissueinthiscase),withonebiasparameterperneuron.

5. ThenwecreateasubgraphtocomputeZ=X·W+b.Thisvectorizedimplementationwillefficientlycomputetheweightedsumsoftheinputsplusthebiastermforeachandeveryneuroninthelayer,foralltheinstancesinthebatchinjustoneshot.

6. Finally,ifanactivationparameterisprovided,suchastf.nn.relu(i.e.,max(0,Z)),thenthecodereturnsactivation(Z),orelseitjustreturnsZ.

Okay,sonowyouhaveanicefunctiontocreateaneuronlayer.Let’suseittocreatethedeepneuralnetwork!ThefirsthiddenlayertakesXasitsinput.Thesecondtakestheoutputofthefirsthiddenlayerasitsinput.Andfinally,theoutputlayertakestheoutputofthesecondhiddenlayerasitsinput.

withtf.name_scope("dnn"):

hidden1=neuron_layer(X,n_hidden1,name="hidden1",

activation=tf.nn.relu)

hidden2=neuron_layer(hidden1,n_hidden2,name="hidden2",

logits=neuron_layer(hidden2,n_outputs,name="outputs")

Noticethatonceagainweusedanamescopeforclarity.Alsonotethatlogitsistheoutputoftheneuralnetworkbeforegoingthroughthesoftmaxactivationfunction:foroptimizationreasons,wewillhandlethesoftmaxcomputationlater.

Asyoumightexpect,TensorFlowcomeswithmanyhandyfunctionstocreatestandardneuralnetworklayers,sothere’softennoneedtodefineyourownneuron_layer()functionlikewejustdid.Forexample,TensorFlow’stf.layers.dense()function(previouslycalledtf.contrib.layers.fully_connected())createsafullyconnectedlayer,wherealltheinputsareconnectedtoalltheneuronsinthelayer.Ittakescareofcreatingtheweightsandbiasesvariables,namedkernelandbiasrespectively,usingtheappropriateinitializationstrategy,andyoucansettheactivationfunctionusingtheactivationargument.AswewillseeinChapter11,italsosupportsregularizationparameters.Let’stweaktheprecedingcodetousethedense()functioninsteadofourneuron_layer()function.Simplyreplacethednnconstructionsectionwiththefollowingcode:

hidden1=tf.layers.dense(X,n_hidden1,name="hidden1",

hidden2=tf.layers.dense(hidden1,n_hidden2,name="hidden2",

logits=tf.layers.dense(hidden2,n_outputs,name="outputs")

Nowthatwehavetheneuralnetworkmodelreadytogo,weneedtodefinethecostfunctionthatwewillusetotrainit.JustaswedidforSoftmaxRegressioninChapter4,wewillusecrossentropy.Aswediscussedearlier,crossentropywillpenalizemodelsthatestimatealowprobabilityforthetargetclass.TensorFlowprovidesseveralfunctionstocomputecrossentropy.Wewillusesparse_softmax_cross_entropy_with_logits():itcomputesthecrossentropybasedonthe“logits”(i.e.,theoutputofthenetworkbeforegoingthroughthesoftmaxactivationfunction),anditexpectslabelsintheformofintegersrangingfrom0tothenumberofclassesminus1(inourcase,from0to9).Thiswillgiveusa1Dtensorcontainingthecrossentropyforeachinstance.WecanthenuseTensorFlow’sreduce_mean()functiontocomputethemeancrossentropyoverallinstances.

withtf.name_scope("loss"):

xentropy=tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,

logits=logits)

loss=tf.reduce_mean(xentropy,name="loss")

NOTEThesparse_softmax_cross_entropy_with_logits()functionisequivalenttoapplyingthesoftmaxactivationfunctionandthencomputingthecrossentropy,butitismoreefficient,anditproperlytakescareofcornercaseslikelogitsequalto0.Thisiswhywedidnotapplythesoftmaxactivationfunctionearlier.Thereisalsoanotherfunctioncalledsoftmax_cross_entropy_with_logits(),whichtakeslabelsintheformofone-hotvectors(insteadofintsfrom0tothenumberofclassesminus1).

Wehavetheneuralnetworkmodel,wehavethecostfunction,andnowweneedtodefineaGradientDescentOptimizerthatwilltweakthemodelparameterstominimizethecostfunction.Nothingnew;it’sjustlikewedidinChapter9:

learning_rate=0.01

withtf.name_scope("train"):

optimizer=tf.train.GradientDescentOptimizer(learning_rate)

training_op=optimizer.minimize(loss)

Thelastimportantstepintheconstructionphaseistospecifyhowtoevaluatethemodel.Wewillsimplyuseaccuracyasourperformancemeasure.First,foreachinstance,determineiftheneuralnetwork’spredictioniscorrectbycheckingwhetherornotthehighestlogitcorrespondstothetargetclass.Forthisyoucanusethein_top_k()function.Thisreturnsa1Dtensorfullofbooleanvalues,soweneedtocastthesebooleanstofloatsandthencomputetheaverage.Thiswillgiveusthenetwork’soverallaccuracy.

withtf.name_scope("eval"):

correct=tf.nn.in_top_k(logits,y,1)

accuracy=tf.reduce_mean(tf.cast(correct,tf.float32))

And,asusual,weneedtocreateanodetoinitializeallvariables,andwewillalsocreateaSaverto

saveourtrainedmodelparameterstodisk:

Phew!Thisconcludestheconstructionphase.Thiswasfewerthan40linesofcode,butitwasprettyintense:wecreatedplaceholdersfortheinputsandthetargets,wecreatedafunctiontobuildaneuronlayer,weusedittocreatetheDNN,wedefinedthecostfunction,wecreatedanoptimizer,andfinallywedefinedtheperformancemeasure.Nowontotheexecutionphase.

ExecutionPhaseThispartismuchshorterandsimpler.First,let’sloadMNIST.WecoulduseScikit-Learnforthataswedidinpreviouschapters,butTensorFlowoffersitsownhelperthatfetchesthedata,scalesit(between0and1),shufflesit,andprovidesasimplefunctiontoloadonemini-batchatime.Solet’suseitinstead:

fromtensorflow.examples.tutorials.mnistimportinput_data

mnist=input_data.read_data_sets("/tmp/data/")

Nowwedefinethenumberofepochsthatwewanttorun,aswellasthesizeofthemini-batches:

n_epochs=40

batch_size=50

Andnowwecantrainthemodel:

init.run()

foriterationinrange(mnist.train.num_examples//batch_size):

X_batch,y_batch=mnist.train.next_batch(batch_size)

acc_train=accuracy.eval(feed_dict={X:X_batch,y:y_batch})

acc_test=accuracy.eval(feed_dict={X:mnist.test.images,

y:mnist.test.labels})

print(epoch,"Trainaccuracy:",acc_train,"Testaccuracy:",acc_test)

save_path=saver.save(sess,"./my_model_final.ckpt")

ThiscodeopensaTensorFlowsession,anditrunstheinitnodethatinitializesallthevariables.Thenitrunsthemaintrainingloop:ateachepoch,thecodeiteratesthroughanumberofmini-batchesthatcorrespondstothetrainingsetsize.Eachmini-batchisfetchedviathenext_batch()method,andthenthecodesimplyrunsthetrainingoperation,feedingitthecurrentmini-batchinputdataandtargets.Next,attheendofeachepoch,thecodeevaluatesthemodelonthelastmini-batchandonthefulltestset,anditprintsouttheresult.Finally,themodelparametersaresavedtodisk.

UsingtheNeuralNetworkNowthattheneuralnetworkistrained,youcanuseittomakepredictions.Todothat,youcanreusethesameconstructionphase,butchangetheexecutionphaselikethis:

saver.restore(sess,"./my_model_final.ckpt")

X_new_scaled=[...]#somenewimages(scaledfrom0to1)

Z=logits.eval(feed_dict={X:X_new_scaled})

y_pred=np.argmax(Z,axis=1)

Firstthecodeloadsthemodelparametersfromdisk.Thenitloadssomenewimagesthatyouwanttoclassify.Remembertoapplythesamefeaturescalingasforthetrainingdata(inthiscase,scaleitfrom0to1).Thenthecodeevaluatesthelogitsnode.Ifyouwantedtoknowalltheestimatedclassprobabilities,youwouldneedtoapplythesoftmax()functiontothelogits,butifyoujustwanttopredictaclass,youcansimplypicktheclassthathasthehighestlogitvalue(usingtheargmax()functiondoesthetrick).

Fine-TuningNeuralNetworkHyperparametersTheflexibilityofneuralnetworksisalsooneoftheirmaindrawbacks:therearemanyhyperparameterstotweak.Notonlycanyouuseanyimaginablenetworktopology(howneuronsareinterconnected),buteveninasimpleMLPyoucanchangethenumberoflayers,thenumberofneuronsperlayer,thetypeofactivationfunctiontouseineachlayer,theweightinitializationlogic,andmuchmore.Howdoyouknowwhatcombinationofhyperparametersisthebestforyourtask?

Ofcourse,youcanusegridsearchwithcross-validationtofindtherighthyperparameters,likeyoudidinpreviouschapters,butsincetherearemanyhyperparameterstotune,andsincetraininganeuralnetworkonalargedatasettakesalotoftime,youwillonlybeabletoexploreatinypartofthehyperparameterspaceinareasonableamountoftime.Itismuchbettertouserandomizedsearch,aswediscussedinChapter2.AnotheroptionistouseatoolsuchasOscar,whichimplementsmorecomplexalgorithmstohelpyoufindagoodsetofhyperparametersquickly.

Ithelpstohaveanideaofwhatvaluesarereasonableforeachhyperparameter,soyoucanrestrictthesearchspace.Let’sstartwiththenumberofhiddenlayers.

NumberofHiddenLayersFormanyproblems,youcanjustbeginwithasinglehiddenlayerandyouwillgetreasonableresults.IthasactuallybeenshownthatanMLPwithjustonehiddenlayercanmodeleventhemostcomplexfunctionsprovidedithasenoughneurons.Foralongtime,thesefactsconvincedresearchersthattherewasnoneedtoinvestigateanydeeperneuralnetworks.Buttheyoverlookedthefactthatdeepnetworkshaveamuchhigherparameterefficiencythanshallowones:theycanmodelcomplexfunctionsusingexponentiallyfewerneuronsthanshallownets,makingthemmuchfastertotrain.

Tounderstandwhy,supposeyouareaskedtodrawaforestusingsomedrawingsoftware,butyouareforbiddentousecopy/paste.Youwouldhavetodraweachtreeindividually,branchperbranch,leafperleaf.Ifyoucouldinsteaddrawoneleaf,copy/pasteittodrawabranch,thencopy/pastethatbranchtocreateatree,andfinallycopy/pastethistreetomakeaforest,youwouldbefinishedinnotime.Real-worlddataisoftenstructuredinsuchahierarchicalwayandDNNsautomaticallytakeadvantageofthisfact:lowerhiddenlayersmodellow-levelstructures(e.g.,linesegmentsofvariousshapesandorientations),intermediatehiddenlayerscombinetheselow-levelstructurestomodelintermediate-levelstructures(e.g.,squares,circles),andthehighesthiddenlayersandtheoutputlayercombinetheseintermediatestructurestomodelhigh-levelstructures(e.g.,faces).

NotonlydoesthishierarchicalarchitecturehelpDNNsconvergefastertoagoodsolution,italsoimprovestheirabilitytogeneralizetonewdatasets.Forexample,ifyouhavealreadytrainedamodeltorecognizefacesinpictures,andyounowwanttotrainanewneuralnetworktorecognizehairstyles,thenyoucankickstarttrainingbyreusingthelowerlayersofthefirstnetwork.Insteadofrandomlyinitializingtheweightsandbiasesofthefirstfewlayersofthenewneuralnetwork,youcaninitializethemtothevalueoftheweightsandbiasesofthelowerlayersofthefirstnetwork.Thiswaythenetworkwillnothavetolearnfromscratchallthelow-levelstructuresthatoccurinmostpictures;itwillonlyhavetolearnthehigher-levelstructures(e.g.,hairstyles).

Insummary,formanyproblemsyoucanstartwithjustoneortwohiddenlayersanditwillworkjustfine(e.g.,youcaneasilyreachabove97%accuracyontheMNISTdatasetusingjustonehiddenlayerwithafewhundredneurons,andabove98%accuracyusingtwohiddenlayerswiththesametotalamountofneurons,inroughlythesameamountoftrainingtime).Formorecomplexproblems,youcangraduallyrampupthenumberofhiddenlayers,untilyoustartoverfittingthetrainingset.Verycomplextasks,suchaslargeimageclassificationorspeechrecognition,typicallyrequirenetworkswithdozensoflayers(orevenhundreds,butnotfullyconnectedones,aswewillseeinChapter13),andtheyneedahugeamountoftrainingdata.However,youwillrarelyhavetotrainsuchnetworksfromscratch:itismuchmorecommontoreusepartsofapretrainedstate-of-the-artnetworkthatperformsasimilartask.Trainingwillbealotfasterandrequiremuchlessdata(wewilldiscussthisinChapter11).

NumberofNeuronsperHiddenLayerObviouslythenumberofneuronsintheinputandoutputlayersisdeterminedbythetypeofinputandoutputyourtaskrequires.Forexample,theMNISTtaskrequires28x28=784inputneuronsand10outputneurons.Asforthehiddenlayers,acommonpracticeistosizethemtoformafunnel,withfewerandfewerneuronsateachlayer—therationalebeingthatmanylow-levelfeaturescancoalesceintofarfewerhigh-levelfeatures.Forexample,atypicalneuralnetworkforMNISTmayhavetwohiddenlayers,thefirstwith300neuronsandthesecondwith100.However,thispracticeisnotascommonnow,andyoumaysimplyusethesamesizeforallhiddenlayers—forexample,allhiddenlayerswith150neurons:that’sjustonehyperparametertotuneinsteadofoneperlayer.Justlikeforthenumberoflayers,youcantryincreasingthenumberofneuronsgraduallyuntilthenetworkstartsoverfitting.Ingeneralyouwillgetmorebangforthebuckbyincreasingthenumberoflayersthanthenumberofneuronsperlayer.Unfortunately,asyoucansee,findingtheperfectamountofneuronsisstillsomewhatofablackart.

Asimplerapproachistopickamodelwithmorelayersandneuronsthanyouactuallyneed,thenuseearlystoppingtopreventitfromoverfitting(andotherregularizationtechniques,especiallydropout,aswewillseeinChapter11).Thishasbeendubbedthe“stretchpants”approach:12insteadofwastingtimelookingforpantsthatperfectlymatchyoursize,justuselargestretchpantsthatwillshrinkdowntotherightsize.

ActivationFunctionsInmostcasesyoucanusetheReLUactivationfunctioninthehiddenlayers(oroneofitsvariants,aswewillseeinChapter11).Itisabitfastertocomputethanotheractivationfunctions,andGradientDescentdoesnotgetstuckasmuchonplateaus,thankstothefactthatitdoesnotsaturateforlargeinputvalues(asopposedtothelogisticfunctionorthehyperbolictangentfunction,whichsaturateat1).

Fortheoutputlayer,thesoftmaxactivationfunctionisgenerallyagoodchoiceforclassificationtasks(whentheclassesaremutuallyexclusive).Forregressiontasks,youcansimplyusenoactivationfunctionatall.

Thisconcludesthisintroductiontoartificialneuralnetworks.Inthefollowingchapters,wewilldiscusstechniquestotrainverydeepnets,anddistributetrainingacrossmultipleserversandGPUs.Thenwewillexploreafewotherpopularneuralnetworkarchitectures:convolutionalneuralnetworks,recurrentneuralnetworks,andautoencoders.13

Exercises1. DrawanANNusingtheoriginalartificialneurons(liketheonesinFigure10-3)thatcomputesAB(where representstheXORoperation).Hint:A B=(A ¬B) (¬A B).

2. WhyisitgenerallypreferabletouseaLogisticRegressionclassifierratherthanaclassicalPerceptron(i.e.,asinglelayeroflinearthresholdunitstrainedusingthePerceptrontrainingalgorithm)?HowcanyoutweakaPerceptrontomakeitequivalenttoaLogisticRegressionclassifier?

3. WhywasthelogisticactivationfunctionakeyingredientintrainingthefirstMLPs?

4. Namethreepopularactivationfunctions.Canyoudrawthem?

5. SupposeyouhaveanMLPcomposedofoneinputlayerwith10passthroughneurons,followedbyonehiddenlayerwith50artificialneurons,andfinallyoneoutputlayerwith3artificialneurons.AllartificialneuronsusetheReLUactivationfunction.

WhatistheshapeoftheinputmatrixX?

Whatabouttheshapeofthehiddenlayer’sweightvectorWh,andtheshapeofitsbiasvectorbh?

Whatistheshapeoftheoutputlayer’sweightvectorWo,anditsbiasvectorbo?

Whatistheshapeofthenetwork’soutputmatrixY?

Writetheequationthatcomputesthenetwork’soutputmatrixYasafunctionofX,Wh,bh,Woandbo.

6. Howmanyneuronsdoyouneedintheoutputlayerifyouwanttoclassifyemailintospamorham?Whatactivationfunctionshouldyouuseintheoutputlayer?IfinsteadyouwanttotackleMNIST,howmanyneuronsdoyouneedintheoutputlayer,usingwhatactivationfunction?AnswerthesamequestionsforgettingyournetworktopredicthousingpricesasinChapter2.

7. Whatisbackpropagationandhowdoesitwork?Whatisthedifferencebetweenbackpropagationandreverse-modeautodiff?

8. CanyoulistallthehyperparametersyoucantweakinanMLP?IftheMLPoverfitsthetrainingdata,howcouldyoutweakthesehyperparameterstotrytosolvetheproblem?

9. TrainadeepMLPontheMNISTdatasetandseeifyoucangetover98%precision.JustlikeinthelastexerciseofChapter9,tryaddingallthebellsandwhistles(i.e.,savecheckpoints,restorethelastcheckpointincaseofaninterruption,addsummaries,plotlearningcurvesusingTensorBoard,andsoon).

Youcangetthebestofbothworldsbybeingopentobiologicalinspirationswithoutbeingafraidtocreatebiologicallyunrealisticmodels,aslongastheyworkwell.

“ALogicalCalculusofIdeasImmanentinNervousActivity,”W.McCullochandW.Pitts(1943).

ImagebyBruceBlaus(CreativeCommons3.0).Reproducedfromhttps://en.wikipedia.org/wiki/Neuron.

InthecontextofMachineLearning,thephrase“neuralnetworks”generallyreferstoANNs,notBNNs.

DrawingofacorticallaminationbyS.RamonyCajal(publicdomain).Reproducedfromhttps://en.wikipedia.org/wiki/Cerebral_cortex.

ThenamePerceptronissometimesusedtomeanatinynetworkwithasingleLTU.

Notethatthissolutionisgenerallynotunique:ingeneralwhenthedataarelinearlyseparable,thereisaninfinityofhyperplanesthatcanseparatethem.

“LearningInternalRepresentationsbyErrorPropagation,”D.Rumelhart,G.Hinton,R.Williams(1986).

Thisalgorithmwasactuallyinventedseveraltimesbyvariousresearchersindifferentfields,startingwithP.Werbosin1974.

Usingatruncatednormaldistributionratherthanaregularnormaldistributionensuresthattherewon’tbeanylargeweights,whichcouldslowdowntraining.

Forexample,ifyousetalltheweightsto0,thenallneuronswilloutput0,andtheerrorgradientwillbethesameforallneuronsinagivenhiddenlayer.TheGradientDescentstepwillthenupdatealltheweightsinexactlythesamewayineachlayer,sotheywillallremainequal.Inotherwords,despitehavinghundredsofneuronsperlayer,yourmodelwillactasiftherewereonlyoneneuronperlayer.Itisnotgoingtofly.

ByVincentVanhouckeinhisDeepLearningclassonUdacity.com.

AfewextraANNarchitecturesarepresentedinAppendixE.

Chapter11.TrainingDeepNeuralNets

InChapter10weintroducedartificialneuralnetworksandtrainedourfirstdeepneuralnetwork.ButitwasaveryshallowDNN,withonlytwohiddenlayers.Whatifyouneedtotackleaverycomplexproblem,suchasdetectinghundredsoftypesofobjectsinhigh-resolutionimages?YoumayneedtotrainamuchdeeperDNN,perhapswith(say)10layers,eachcontaininghundredsofneurons,connectedbyhundredsofthousandsofconnections.Thiswouldnotbeawalkinthepark:

First,youwouldbefacedwiththetrickyvanishinggradientsproblem(ortherelatedexplodinggradientsproblem)thataffectsdeepneuralnetworksandmakeslowerlayersveryhardtotrain.

Second,withsuchalargenetwork,trainingwouldbeextremelyslow.

Third,amodelwithmillionsofparameterswouldseverelyriskoverfittingthetrainingset.

Inthischapter,wewillgothrougheachoftheseproblemsinturnandpresenttechniquestosolvethem.Wewillstartbyexplainingthevanishinggradientsproblemandexploringsomeofthemostpopularsolutionstothisproblem.NextwewilllookatvariousoptimizersthatcanspeeduptraininglargemodelstremendouslycomparedtoplainGradientDescent.Finally,wewillgothroughafewpopularregularizationtechniquesforlargeneuralnetworks.

Withthesetools,youwillbeabletotrainverydeepnets:welcometoDeepLearning!

Vanishing/ExplodingGradientsProblemsAswediscussedinChapter10,thebackpropagationalgorithmworksbygoingfromtheoutputlayertotheinputlayer,propagatingtheerrorgradientontheway.Oncethealgorithmhascomputedthegradientofthecostfunctionwithregardstoeachparameterinthenetwork,itusesthesegradientstoupdateeachparameterwithaGradientDescentstep.

Unfortunately,gradientsoftengetsmallerandsmallerasthealgorithmprogressesdowntothelowerlayers.Asaresult,theGradientDescentupdateleavesthelowerlayerconnectionweightsvirtuallyunchanged,andtrainingneverconvergestoagoodsolution.Thisiscalledthevanishinggradientsproblem.Insomecases,theoppositecanhappen:thegradientscangrowbiggerandbigger,somanylayersgetinsanelylargeweightupdatesandthealgorithmdiverges.Thisistheexplodinggradientsproblem,whichismostlyencounteredinrecurrentneuralnetworks(seeChapter14).Moregenerally,deepneuralnetworkssufferfromunstablegradients;differentlayersmaylearnatwidelydifferentspeeds.

Althoughthisunfortunatebehaviorhasbeenempiricallyobservedforquiteawhile(itwasoneofthereasonswhydeepneuralnetworksweremostlyabandonedforalongtime),itisonlyaround2010thatsignificantprogresswasmadeinunderstandingit.Apapertitled“UnderstandingtheDifficultyofTrainingDeepFeedforwardNeuralNetworks”byXavierGlorotandYoshuaBengio1foundafewsuspects,includingthecombinationofthepopularlogisticsigmoidactivationfunctionandtheweightinitializationtechniquethatwasmostpopularatthetime,namelyrandominitializationusinganormaldistributionwithameanof0andastandarddeviationof1.Inshort,theyshowedthatwiththisactivationfunctionandthisinitializationscheme,thevarianceoftheoutputsofeachlayerismuchgreaterthanthevarianceofitsinputs.Goingforwardinthenetwork,thevariancekeepsincreasingaftereachlayeruntiltheactivationfunctionsaturatesatthetoplayers.Thisisactuallymadeworsebythefactthatthelogisticfunctionhasameanof0.5,not0(thehyperbolictangentfunctionhasameanof0andbehavesslightlybetterthanthelogisticfunctionindeepnetworks).

Lookingatthelogisticactivationfunction(seeFigure11-1),youcanseethatwheninputsbecomelarge(negativeorpositive),thefunctionsaturatesat0or1,withaderivativeextremelycloseto0.Thuswhenbackpropagationkicksin,ithasvirtuallynogradienttopropagatebackthroughthenetwork,andwhatlittlegradientexistskeepsgettingdilutedasbackpropagationprogressesdownthroughthetoplayers,sothereisreallynothingleftforthelowerlayers.

Figure11-1.Logisticactivationfunctionsaturation

XavierandHeInitializationIntheirpaper,GlorotandBengioproposeawaytosignificantlyalleviatethisproblem.Weneedthesignaltoflowproperlyinbothdirections:intheforwarddirectionwhenmakingpredictions,andinthereversedirectionwhenbackpropagatinggradients.Wedon’twantthesignaltodieout,nordowewantittoexplodeandsaturate.Forthesignaltoflowproperly,theauthorsarguethatweneedthevarianceoftheoutputsofeachlayertobeequaltothevarianceofitsinputs,2andwealsoneedthegradientstohaveequalvariancebeforeandafterflowingthroughalayerinthereversedirection(pleasecheckoutthepaperifyouareinterestedinthemathematicaldetails).Itisactuallynotpossibletoguaranteebothunlessthelayerhasanequalnumberofinputandoutputconnections,buttheyproposedagoodcompromisethathasproventoworkverywellinpractice:theconnectionweightsmustbeinitializedrandomlyasdescribedinEquation11-1,whereninputsandnoutputsarethenumberofinputandoutputconnectionsforthelayerwhoseweightsarebeinginitialized(alsocalledfan-inandfan-out).ThisinitializationstrategyisoftencalledXavierinitialization(aftertheauthor’sfirstname),orsometimesGlorotinitialization.

Equation11-1.Xavierinitialization(whenusingthelogisticactivationfunction)

Whenthenumberofinputconnectionsisroughlyequaltothenumberofoutputconnections,youget

simplerequations(e.g., or ).WeusedthissimplifiedstrategyinChapter10.3

UsingtheXavierinitializationstrategycanspeeduptrainingconsiderably,anditisoneofthetricksthatledtothecurrentsuccessofDeepLearning.Somerecentpapers4haveprovidedsimilarstrategiesfordifferentactivationfunctions,asshowninTable11-1.TheinitializationstrategyfortheReLUactivationfunction(anditsvariants,includingtheELUactivationdescribedshortly)issometimescalledHeinitialization(afterthelastnameofitsauthor).

Table11-1.Initializationparametersforeachtypeofactivationfunction

Activationfunction Uniformdistribution[–r,r] Normaldistribution

Logistic

Hyperbolictangent

ReLU(anditsvariants)

Bydefault,thetf.layers.dense()function(introducedinChapter10)usesXavierinitialization(withauniformdistribution).YoucanchangethistoHeinitializationbyusingthe

variance_scaling_initializer()functionlikethis:

he_init=tf.contrib.layers.variance_scaling_initializer()

hidden1=tf.layers.dense(X,n_hidden1,activation=tf.nn.relu,

kernel_initializer=he_init,name="hidden1")

NOTEHeinitializationconsidersonlythefan-in,nottheaveragebetweenfan-inandfan-outlikeinXavierinitialization.Thisisalsothedefaultforthevariance_scaling_initializer()function,butyoucanchangethisbysettingtheargumentmode="FAN_AVG".

NonsaturatingActivationFunctionsOneoftheinsightsinthe2010paperbyGlorotandBengiowasthatthevanishing/explodinggradientsproblemswereinpartduetoapoorchoiceofactivationfunction.UntilthenmostpeoplehadassumedthatifMotherNaturehadchosentouseroughlysigmoidactivationfunctionsinbiologicalneurons,theymustbeanexcellentchoice.Butitturnsoutthatotheractivationfunctionsbehavemuchbetterindeepneuralnetworks,inparticulartheReLUactivationfunction,mostlybecauseitdoesnotsaturateforpositivevalues(andalsobecauseitisquitefasttocompute).

Unfortunately,theReLUactivationfunctionisnotperfect.ItsuffersfromaproblemknownasthedyingReLUs:duringtraining,someneuronseffectivelydie,meaningtheystopoutputtinganythingotherthan0.Insomecases,youmayfindthathalfofyournetwork’sneuronsaredead,especiallyifyouusedalargelearningrate.Duringtraining,ifaneuron’sweightsgetupdatedsuchthattheweightedsumoftheneuron’sinputsisnegative,itwillstartoutputting0.Whenthishappen,theneuronisunlikelytocomebacktolifesincethegradientoftheReLUfunctionis0whenitsinputisnegative.

Tosolvethisproblem,youmaywanttouseavariantoftheReLUfunction,suchastheleakyReLU.ThisfunctionisdefinedasLeakyReLUα(z)=max(αz,z)(seeFigure11-2).Thehyperparameterαdefineshowmuchthefunction“leaks”:itistheslopeofthefunctionforz<0,andistypicallysetto0.01.ThissmallslopeensuresthatleakyReLUsneverdie;theycangointoalongcoma,buttheyhaveachancetoeventuallywakeup.Arecentpaper5comparedseveralvariantsoftheReLUactivationfunctionandoneofitsconclusionswasthattheleakyvariantsalwaysoutperformedthestrictReLUactivationfunction.Infact,settingα=0.2(hugeleak)seemedtoresultinbetterperformancethanα=0.01(smallleak).TheyalsoevaluatedtherandomizedleakyReLU(RReLU),whereαispickedrandomlyinagivenrangeduringtraining,anditisfixedtoanaveragevalueduringtesting.Italsoperformedfairlywellandseemedtoactasaregularizer(reducingtheriskofoverfittingthetrainingset).Finally,theyalsoevaluatedtheparametricleakyReLU(PReLU),whereαisauthorizedtobelearnedduringtraining(insteadofbeingahyperparameter,itbecomesaparameterthatcanbemodifiedbybackpropagationlikeanyotherparameter).ThiswasreportedtostronglyoutperformReLUonlargeimagedatasets,butonsmallerdatasetsitrunstheriskofoverfittingthetrainingset.

Figure11-2.LeakyReLU

Lastbutnotleast,a2015paperbyDjork-ArnéClevertetal.6proposedanewactivationfunctioncalledtheexponentiallinearunit(ELU)thatoutperformedalltheReLUvariantsintheirexperiments:trainingtimewasreducedandtheneuralnetworkperformedbetteronthetestset.ItisrepresentedinFigure11-3,andEquation11-2showsitsdefinition.

Equation11-2.ELUactivationfunction

Figure11-3.ELUactivationfunction

ItlooksalotliketheReLUfunction,withafewmajordifferences:Firstittakesonnegativevalueswhenz<0,whichallowstheunittohaveanaverageoutputcloserto0.Thishelpsalleviatethevanishinggradientsproblem,asdiscussedearlier.ThehyperparameterαdefinesthevaluethattheELUfunctionapproacheswhenzisalargenegativenumber.Itisusuallysetto1,butyoucantweakitlikeanyotherhyperparameterifyouwant.

Second,ithasanonzerogradientforz<0,whichavoidsthedyingunitsissue.

Third,thefunctionissmootheverywhere,includingaroundz=0,whichhelpsspeedupGradientDescent,sinceitdoesnotbounceasmuchleftandrightofz=0.

ThemaindrawbackoftheELUactivationfunctionisthatitisslowertocomputethantheReLUanditsvariants(duetotheuseoftheexponentialfunction),butduringtrainingthisiscompensatedbythefasterconvergencerate.However,attesttimeanELUnetworkwillbeslowerthanaReLUnetwork.

TIPSowhichactivationfunctionshouldyouuseforthehiddenlayersofyourdeepneuralnetworks?Althoughyourmileagewillvary,ingeneralELU>leakyReLU(anditsvariants)>ReLU>tanh>logistic.Ifyoucarealotaboutruntimeperformance,thenyoumaypreferleakyReLUsoverELUs.Ifyoudon’twanttotweakyetanotherhyperparameter,youmayjustusethedefaultαvaluessuggestedearlier(0.01fortheleakyReLU,and1forELU).Ifyouhavesparetimeandcomputingpower,youcanusecross-validationtoevaluateotheractivationfunctions,inparticularRReLUifyournetworkisoverfitting,orPReLUifyouhaveahugetrainingset.

TensorFlowoffersanelu()functionthatyoucanusetobuildyourneuralnetwork.Simplysettheactivationargumentwhencallingthedense()function,likethis:

hidden1=tf.layers.dense(X,n_hidden1,activation=tf.nn.elu,name="hidden1")

TensorFlowdoesnothaveapredefinedfunctionforleakyReLUs,butitiseasyenoughtodefine:

defleaky_relu(z,name=None):

returntf.maximum(0.01*z,z,name=name)

hidden1=tf.layers.dense(X,n_hidden1,activation=leaky_relu,name="hidden1")

BatchNormalizationAlthoughusingHeinitializationalongwithELU(oranyvariantofReLU)cansignificantlyreducethevanishing/explodinggradientsproblemsatthebeginningoftraining,itdoesn’tguaranteethattheywon’tcomebackduringtraining.

Ina2015paper,7SergeyIoffeandChristianSzegedyproposedatechniquecalledBatchNormalization(BN)toaddressthevanishing/explodinggradientsproblems,andmoregenerallytheproblemthatthedistributionofeachlayer’sinputschangesduringtraining,astheparametersofthepreviouslayerschange(whichtheycalltheInternalCovariateShiftproblem).

Thetechniqueconsistsofaddinganoperationinthemodeljustbeforetheactivationfunctionofeachlayer,simplyzero-centeringandnormalizingtheinputs,thenscalingandshiftingtheresultusingtwonewparametersperlayer(oneforscaling,theotherforshifting).Inotherwords,thisoperationletsthemodellearntheoptimalscaleandmeanoftheinputsforeachlayer.

Inordertozero-centerandnormalizetheinputs,thealgorithmneedstoestimatetheinputs’meanandstandarddeviation.Itdoessobyevaluatingthemeanandstandarddeviationoftheinputsoverthecurrentmini-batch(hencethename“BatchNormalization”).ThewholeoperationissummarizedinEquation11-3.

Equation11-3.BatchNormalizationalgorithm

μBistheempiricalmean,evaluatedoverthewholemini-batchB.

σBistheempiricalstandarddeviation,alsoevaluatedoverthewholemini-batch.

mBisthenumberofinstancesinthemini-batch.

(i)isthezero-centeredandnormalizedinput.

γisthescalingparameterforthelayer.

βistheshiftingparameter(offset)forthelayer.

ϵisatinynumbertoavoiddivisionbyzero(typically10–5).Thisiscalledasmoothingterm.

z(i)istheoutputoftheBNoperation:itisascaledandshiftedversionoftheinputs.

Attesttime,thereisnomini-batchtocomputetheempiricalmeanandstandarddeviation,soinsteadyousimplyusethewholetrainingset’smeanandstandarddeviation.Thesearetypicallyefficientlycomputedduringtrainingusingamovingaverage.So,intotal,fourparametersarelearnedforeachbatch-normalizedlayer:γ(scale),β(offset),μ(mean),andσ(standarddeviation).

Theauthorsdemonstratedthatthistechniqueconsiderablyimprovedallthedeepneuralnetworkstheyexperimentedwith.Thevanishinggradientsproblemwasstronglyreduced,tothepointthattheycouldusesaturatingactivationfunctionssuchasthetanhandeventhelogisticactivationfunction.Thenetworkswerealsomuchlesssensitivetotheweightinitialization.Theywereabletousemuchlargerlearningrates,significantlyspeedingupthelearningprocess.Specifically,theynotethat“Appliedtoastate-of-the-artimageclassificationmodel,BatchNormalizationachievesthesameaccuracywith14timesfewertrainingsteps,andbeatstheoriginalmodelbyasignificantmargin.[…]Usinganensembleofbatch-normalizednetworks,weimproveuponthebestpublishedresultonImageNetclassification:reaching4.9%top-5validationerror(and4.8%testerror),exceedingtheaccuracyofhumanraters.”Finally,likeagiftthatkeepsongiving,BatchNormalizationalsoactslikearegularizer,reducingtheneedforotherregularizationtechniques(suchasdropout,describedlaterinthechapter).

BatchNormalizationdoes,however,addsomecomplexitytothemodel(althoughitremovestheneedfornormalizingtheinputdatasincethefirsthiddenlayerwilltakecareofthat,provideditisbatch-normalized).Moreover,thereisaruntimepenalty:theneuralnetworkmakesslowerpredictionsduetotheextracomputationsrequiredateachlayer.Soifyouneedpredictionstobelightning-fast,youmaywanttocheckhowwellplainELU+HeinitializationperformbeforeplayingwithBatchNormalization.

NOTEYoumayfindthattrainingisratherslowatfirstwhileGradientDescentissearchingfortheoptimalscalesandoffsetsforeachlayer,butitacceleratesonceithasfoundreasonablygoodvalues.

ImplementingBatchNormalizationwithTensorFlowTensorFlowprovidesatf.nn.batch_normalization()functionthatsimplycentersandnormalizestheinputs,butyoumustcomputethemeanandstandarddeviationyourself(basedonthemini-batchdata

duringtrainingoronthefulldatasetduringtesting,asjustdiscussed)andpassthemasparameterstothisfunction,andyoumustalsohandlethecreationofthescalingandoffsetparameters(andpassthemtothisfunction).Itisdoable,butnotthemostconvenientapproach.Instead,youshouldusethetf.layers.batch_normalization()function,whichhandlesallthisforyou,asinthefollowingcode:

n_inputs=28*28

n_hidden1=300

n_hidden2=100

n_outputs=10

training=tf.placeholder_with_default(False,shape=(),name='training')

hidden1=tf.layers.dense(X,n_hidden1,name="hidden1")

bn1=tf.layers.batch_normalization(hidden1,training=training,momentum=0.9)

bn1_act=tf.nn.elu(bn1)

hidden2=tf.layers.dense(bn1_act,n_hidden2,name="hidden2")

bn2=tf.layers.batch_normalization(hidden2,training=training,momentum=0.9)

logits_before_bn=tf.layers.dense(bn2_act,n_outputs,name="outputs")

logits=tf.layers.batch_normalization(logits_before_bn,training=training,

momentum=0.9)

Let’swalkthroughthiscode.Thefirstlinesarefairlyself-explanatory,untilwedefinethetrainingplaceholder:wewillsetittoTrueduringtraining,butotherwiseitwilldefaulttoFalse.Thiswillbeusedtotellthetf.layers.batch_normalization()functionwhetheritshouldusethecurrentmini-batch’smeanandstandarddeviation(duringtraining)orthewholetrainingset’smeanandstandarddeviation(duringtesting).

Then,wealternatefullyconnectedlayersandbatchnormalizationlayers:thefullyconnectedlayersarecreatedusingthetf.layers.dense()function,justlikewedidinChapter10.Notethatwedon’tspecifyanyactivationfunctionforthefullyconnectedlayersbecausewewanttoapplytheactivationfunctionaftereachbatchnormalizationlayer.8Wecreatethebatchnormalizationlayersusingthetf.layers.batch_normalization()function,settingitstrainingandmomentumparameters.TheBNalgorithmusesexponentialdecaytocomputetherunningaverages,whichiswhyitrequiresthemomentumparameter:givenanewvaluev,therunningaverage isupdatedthroughtheequation:

Agoodmomentumvalueistypicallycloseto1—forexample,0.9,0.99,or0.999(youwantmore9sforlargerdatasetsandsmallermini-batches).

Youmayhavenoticedthatthecodeisquiterepetitive,withthesamebatchnormalizationparametersappearingoverandoveragain.Toavoidthisrepetition,youcanusethepartial()functionfromthefunctoolsmodule(partofPython’sstandardlibrary).Itcreatesathinwrapperaroundafunctionandallowsyoutodefinedefaultvaluesforsomeparameters.Thecreationofthenetworklayersintheprecedingcodecanbemodifiedlikeso:

fromfunctoolsimportpartial

my_batch_norm_layer=partial(tf.layers.batch_normalization,

training=training,momentum=0.9)

hidden1=tf.layers.dense(X,n_hidden1,name="hidden1")

bn1=my_batch_norm_layer(hidden1)

hidden2=tf.layers.dense(bn1_act,n_hidden2,name="hidden2")

bn2=my_batch_norm_layer(hidden2)

logits_before_bn=tf.layers.dense(bn2_act,n_outputs,name="outputs")

logits=my_batch_norm_layer(logits_before_bn)

Itmaynotlookmuchbetterthanbeforeinthissmallexample,butifyouhave10layersandwanttousethesameactivationfunction,initializer,regularizer,andsoon,inalllayers,thistrickwillmakeyourcodemuchmorereadable.

TherestoftheconstructionphaseisthesameasinChapter10:definethecostfunction,createanoptimizer,tellittominimizethecostfunction,definetheevaluationoperations,createaSaver,andsoon.

Theexecutionphaseisalsoprettymuchthesame,withtwoexceptions.First,duringtraining,wheneveryourunanoperationthatdependsonthebatch_normalization()layer,youneedtosetthetrainingplaceholdertoTrue.Second,thebatch_normalization()functioncreatesafewoperationsthatmustbeevaluatedateachstepduringtraininginordertoupdatethemovingaverages(recallthatthesemovingaveragesareneededtoevaluatethetrainingset’smeanandstandarddeviation).TheseoperationsareautomaticallyaddedtotheUPDATE_OPScollection,soallweneedtodoisgetthelistofoperationsinthatcollectionandrunthemateachtrainingiteration:

extra_update_ops=tf.get_collection(tf.GraphKeys.UPDATE_OPS)

init.run()

sess.run([training_op,extra_update_ops],

feed_dict={training:True,X:X_batch,y:y_batch})

accuracy_val=accuracy.eval(feed_dict={X:mnist.test.images,

y:mnist.test.labels})

print(epoch,"Testaccuracy:",accuracy_val)

save_path=saver.save(sess,"./my_model_final.ckpt")

That’sall!Inthistinyexamplewithjusttwolayers,it’sunlikelythatBatchNormalizationwillhaveaverypositiveimpact,butfordeepernetworksitcanmakeatremendousdifference.

GradientClippingApopulartechniquetolessentheexplodinggradientsproblemistosimplyclipthegradientsduringbackpropagationsothattheyneverexceedsomethreshold(thisismostlyusefulforrecurrentneuralnetworks;seeChapter14).ThisiscalledGradientClipping.9IngeneralpeoplenowpreferBatchNormalization,butit’sstillusefultoknowaboutGradientClippingandhowtoimplementit.

InTensorFlow,theoptimizer’sminimize()functiontakescareofbothcomputingthegradientsandapplyingthem,soyoumustinsteadcalltheoptimizer’scompute_gradients()methodfirst,thencreateanoperationtoclipthegradientsusingtheclip_by_value()function,andfinallycreateanoperationtoapplytheclippedgradientsusingtheoptimizer’sapply_gradients()method:

threshold=1.0

optimizer=tf.train.GradientDescentOptimizer(learning_rate)

grads_and_vars=optimizer.compute_gradients(loss)

capped_gvs=[(tf.clip_by_value(grad,-threshold,threshold),var)

forgrad,varingrads_and_vars]

training_op=optimizer.apply_gradients(capped_gvs)

Youwouldthenrunthistraining_opateverytrainingstep,asusual.Itwillcomputethegradients,clipthembetween–1.0and1.0,andapplythem.Thethresholdisahyperparameteryoucantune.

ReusingPretrainedLayersItisgenerallynotagoodideatotrainaverylargeDNNfromscratch:instead,youshouldalwaystrytofindanexistingneuralnetworkthataccomplishesasimilartasktotheoneyouaretryingtotackle,thenjustreusethelowerlayersofthisnetwork:thisiscalledtransferlearning.Itwillnotonlyspeeduptrainingconsiderably,butwillalsorequiremuchlesstrainingdata.

Forexample,supposethatyouhaveaccesstoaDNNthatwastrainedtoclassifypicturesinto100differentcategories,includinganimals,plants,vehicles,andeverydayobjects.YounowwanttotrainaDNNtoclassifyspecifictypesofvehicles.Thesetasksareverysimilar,soyoushouldtrytoreusepartsofthefirstnetwork(seeFigure11-4).

Figure11-4.Reusingpretrainedlayers

NOTEIftheinputpicturesofyournewtaskdon’thavethesamesizeastheonesusedintheoriginaltask,youwillhavetoaddapreprocessingsteptoresizethemtothesizeexpectedbytheoriginalmodel.Moregenerally,transferlearningwillonlyworkwelliftheinputshavesimilarlow-levelfeatures.

ReusingaTensorFlowModelIftheoriginalmodelwastrainedusingTensorFlow,youcansimplyrestoreitandtrainitonthenewtask.AswediscussedinChapter9,youcanusetheimport_meta_graph()functiontoimporttheoperationsintothedefaultgraph.ThisreturnsaSaverthatyoucanlaterusetoloadthemodel’sstate:

saver=tf.train.import_meta_graph("./my_model_final.ckpt.meta")

Youmustthengetahandleontheoperationsandtensorsyouwillneedfortraining.Forthis,youcanusethegraph’sget_operation_by_name()andget_tensor_by_name()methods.Thenameofatensoristhenameoftheoperationthatoutputsitfollowedby:0(or:1ifitisthesecondoutput,:2ifitisthethird,andsoon):

X=tf.get_default_graph().get_tensor_by_name("X:0")

y=tf.get_default_graph().get_tensor_by_name("y:0")

accuracy=tf.get_default_graph().get_tensor_by_name("eval/accuracy:0")

training_op=tf.get_default_graph().get_operation_by_name("GradientDescent")

Ifthepretrainedmodelisnotwelldocumented,thenyouwillhavetoexplorethegraphtofindthenamesoftheoperationsyouwillneed.Inthiscase,youcaneitherexplorethegraphusingTensorBoard(forthisyoumustfirstexportthegraphusingaFileWriter,asdiscussedinChapter9),oryoucanusethegraph’sget_operations()methodtolistalltheoperations:

foropintf.get_default_graph().get_operations():

print(op.name)

Ifyouaretheauthoroftheoriginalmodel,youcouldmakethingseasierforpeoplewhowillreuseyourmodelbygivingoperationsveryclearnamesanddocumentingthem.Anotherapproachistocreateacollectioncontainingalltheimportantoperationsthatpeoplewillwanttogetahandleon:

foropin(X,y,accuracy,training_op):

tf.add_to_collection("my_important_ops",op)

Thiswaypeoplewhoreuseyourmodelwillbeabletosimplywrite:

X,y,accuracy,training_op=tf.get_collection("my_important_ops")

Youcanthenrestorethemodel’sstateusingtheSaverandcontinuetrainingusingyourowndata:

saver.restore(sess,"./my_model_final.ckpt")

[...]#trainthemodelonyourowndata

Alternatively,ifyouhaveaccesstothePythoncodethatbuilttheoriginalgraph,youcanuseitinsteadofimport_meta_graph().

Ingeneral,youwillwanttoreuseonlypartoftheoriginalmodel,typicallythelowerlayers.Ifyouuseimport_meta_graph()torestorethegraph,itwillloadtheentireoriginalgraph,butnothingprevents

youfromjustignoringthelayersyoudonotcareabout.Forexample,asshowninFigure11-4,youcouldbuildnewlayers(e.g.,onehiddenlayerandoneoutputlayer)ontopofapretrainedlayer(e.g.,pretrainedhiddenlayer3).Youwouldalsoneedtocomputethelossforthisnewoutput,andcreateanoptimizertominimizethatloss.

Ifyouhaveaccesstothepretrainedgraph’sPythoncode,youcanjustreusethepartsyouneedandchopouttherest.However,inthiscaseyouneedaSavertorestorethepretrainedmodel(specifyingwhichvariablesyouwanttorestore;otherwise,TensorFlowwillcomplainthatthegraphsdonotmatch),andanotherSavertosavethenewmodel.Forexample,thefollowingcoderestoresonlyhiddenlayers1,2,and3:

[...]#buildthenewmodelwiththesamehiddenlayers1-3asbefore

reuse_vars=tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,

scope="hidden[123]")#regularexpression

reuse_vars_dict=dict([(var.op.name,var)forvarinreuse_vars])

restore_saver=tf.train.Saver(reuse_vars_dict)#torestorelayers1-3

init=tf.global_variables_initializer()#toinitallvariables,oldandnew

saver=tf.train.Saver()#tosavethenewmodel

init.run()

restore_saver.restore(sess,"./my_model_final.ckpt")

[...]#trainthemodel

save_path=saver.save(sess,"./my_new_model_final.ckpt")

Firstwebuildthenewmodel,makingsuretocopytheoriginalmodel’shiddenlayers1to3.Thenwegetthelistofallvariablesinhiddenlayers1to3,usingtheregularexpression"hidden[123]".Next,wecreateadictionarythatmapsthenameofeachvariableintheoriginalmodeltoitsnameinthenewmodel(generallyyouwanttokeeptheexactsamenames).ThenwecreateaSaverthatwillrestoreonlythesevariables.Wealsocreateanoperationtoinitializeallthevariables(oldandnew)andasecondSavertosavetheentirenewmodel,notjustlayers1to3.Wethenstartasessionandinitializeallvariablesinthemodel,thenrestorethevariablevaluesfromtheoriginalmodel’slayers1to3.Finally,wetrainthemodelonthenewtaskandsaveit.

TIPThemoresimilarthetasksare,themorelayersyouwanttoreuse(startingwiththelowerlayers).Forverysimilartasks,youcantrykeepingallthehiddenlayersandjustreplacetheoutputlayer.

ReusingModelsfromOtherFrameworksIfthemodelwastrainedusinganotherframework,youwillneedtoloadthemodelparametersmanually(e.g.,usingTheanocodeifitwastrainedwithTheano),thenassignthemtotheappropriatevariables.Thiscanbequitetedious.Forexample,thefollowingcodeshowshowyouwouldcopytheweightandbiasesfromthefirsthiddenlayerofamodeltrainedusinganotherframework:

original_w=[...]#Loadtheweightsfromtheotherframework

original_b=[...]#Loadthebiasesfromtheotherframework

hidden1=tf.layers.dense(X,n_hidden1,activation=tf.nn.relu,name="hidden1")

[...]#Buildtherestofthemodel

#Getahandleontheassignmentnodesforthehidden1variables

graph=tf.get_default_graph()

assign_kernel=graph.get_operation_by_name("hidden1/kernel/Assign")

assign_bias=graph.get_operation_by_name("hidden1/bias/Assign")

init_kernel=assign_kernel.inputs[1]

init_bias=assign_bias.inputs[1]

sess.run(init,feed_dict={init_kernel:original_w,init_bias:original_b})

#[...]Trainthemodelonyournewtask

Inthisimplementation,wefirstloadthepretrainedmodelusingtheotherframework(notshownhere),andweextractfromitthemodelparameterswewanttoreuse.Next,webuildourTensorFlowmodelasusual.Thencomesthetrickypart:everyTensorFlowvariablehasanassociatedassignmentoperationthatisusedtoinitializeit.Westartbygettingahandleontheseassignmentoperations(theyhavethesamenameasthevariable,plus"/Assign").Wealsogetahandleoneachassignmentoperation’ssecondinput:inthecaseofanassignmentoperation,thesecondinputcorrespondstothevaluethatwillbeassignedtothevariable,sointhiscaseitisthevariable’sinitializationvalue.Oncewestartthesession,weruntheusualinitializationoperation,butthistimewefeeditthevalueswewantforthevariableswewanttoreuse.Alternatively,wecouldhavecreatednewassignmentoperationsandplaceholders,andusedthemtosetthevaluesofthevariablesafterinitialization.Butwhycreatenewnodesinthegraphwheneverythingweneedisalreadythere?

FreezingtheLowerLayersItislikelythatthelowerlayersofthefirstDNNhavelearnedtodetectlow-levelfeaturesinpicturesthatwillbeusefulacrossbothimageclassificationtasks,soyoucanjustreusetheselayersastheyare.Itisgenerallyagoodideato“freeze”theirweightswhentrainingthenewDNN:ifthelower-layerweightsarefixed,thenthehigher-layerweightswillbeeasiertotrain(becausetheywon’thavetolearnamovingtarget).Tofreezethelowerlayersduringtraining,onesolutionistogivetheoptimizerthelistofvariablestotrain,excludingthevariablesfromthelowerlayers:

train_vars=tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,

scope="hidden[34]|outputs")

training_op=optimizer.minimize(loss,var_list=train_vars)

Thefirstlinegetsthelistofalltrainablevariablesinhiddenlayers3and4andintheoutputlayer.Thisleavesoutthevariablesinthehiddenlayers1and2.Nextweprovidethisrestrictedlistoftrainablevariablestotheoptimizer’sminimize()function.Ta-da!Layers1and2arenowfrozen:theywillnotbudgeduringtraining(theseareoftencalledfrozenlayers).

Anotheroptionistoaddastop_gradient()layerinthegraph.Anylayerbelowitwillbefrozen:

name="hidden1")#reusedfrozen

hidden2=tf.layers.dense(hidden1,n_hidden2,activation=tf.nn.relu,

name="hidden2")#reusedfrozen

hidden2_stop=tf.stop_gradient(hidden2)

hidden3=tf.layers.dense(hidden2_stop,n_hidden3,activation=tf.nn.relu,

name="hidden3")#reused,notfrozen

name="hidden4")#new!

logits=tf.layers.dense(hidden4,n_outputs,name="outputs")#new!

CachingtheFrozenLayersSincethefrozenlayerswon’tchange,itispossibletocachetheoutputofthetopmostfrozenlayerforeachtraininginstance.Sincetraininggoesthroughthewholedatasetmanytimes,thiswillgiveyouahugespeedboostasyouwillonlyneedtogothroughthefrozenlayersoncepertraininginstance(insteadofonceperepoch).Forexample,youcouldfirstrunthewholetrainingsetthroughthelowerlayers(assumingyouhaveenoughRAM),thenduringtraining,insteadofbuildingbatchesoftraininginstances,youwouldbuildbatchesofoutputsfromhiddenlayer2andfeedthemtothetrainingoperation:

importnumpyasnp

n_batches=mnist.train.num_examples//batch_size

init.run()

restore_saver.restore(sess,"./my_model_final.ckpt")

h2_cache=sess.run(hidden2,feed_dict={X:mnist.train.images})

shuffled_idx=np.random.permutation(mnist.train.num_examples)

hidden2_batches=np.array_split(h2_cache[shuffled_idx],n_batches)

y_batches=np.array_split(mnist.train.labels[shuffled_idx],n_batches)

forhidden2_batch,y_batchinzip(hidden2_batches,y_batches):

sess.run(training_op,feed_dict={hidden2:hidden2_batch,y:y_batch})

save_path=saver.save(sess,"./my_new_model_final.ckpt")

Thelastlineofthetraininglooprunsthetrainingoperationdefinedearlier(whichdoesnottouchlayers1and2),andfeedsitabatchofoutputsfromthesecondhiddenlayer(aswellasthetargetsforthatbatch).SincewegiveTensorFlowtheoutputofhiddenlayer2,itdoesnottrytoevaluateit(oranynodeitdependson).

Tweaking,Dropping,orReplacingtheUpperLayersTheoutputlayeroftheoriginalmodelshouldusuallybereplacedsinceitismostlikelynotusefulatallforthenewtask,anditmaynotevenhavetherightnumberofoutputsforthenewtask.

Similarly,theupperhiddenlayersoftheoriginalmodelarelesslikelytobeasusefulasthelowerlayers,sincethehigh-levelfeaturesthataremostusefulforthenewtaskmaydiffersignificantlyfromtheonesthatweremostusefulfortheoriginaltask.Youwanttofindtherightnumberoflayerstoreuse.

Tryfreezingallthecopiedlayersfirst,thentrainyourmodelandseehowitperforms.Thentryunfreezingoneortwoofthetophiddenlayerstoletbackpropagationtweakthemandseeifperformanceimproves.Themoretrainingdatayouhave,themorelayersyoucanunfreeze.

Ifyoustillcannotgetgoodperformance,andyouhavelittletrainingdata,trydroppingthetophiddenlayer(s)andfreezeallremaininghiddenlayersagain.Youcaniterateuntilyoufindtherightnumberoflayerstoreuse.Ifyouhaveplentyoftrainingdata,youmaytryreplacingthetophiddenlayersinsteadofdroppingthem,andevenaddmorehiddenlayers.

ModelZoosWherecanyoufindaneuralnetworktrainedforatasksimilartotheoneyouwanttotackle?Thefirstplacetolookisobviouslyinyourowncatalogofmodels.Thisisonegoodreasontosaveallyourmodelsandorganizethemsoyoucanretrievethemlatereasily.Anotheroptionistosearchinamodelzoo.ManypeopletrainMachineLearningmodelsforvarioustasksandkindlyreleasetheirpretrainedmodelstothepublic.

TensorFlowhasitsownmodelzooavailableathttps://github.com/tensorflow/models.Inparticular,itcontainsmostofthestate-of-the-artimageclassificationnetssuchasVGG,Inception,andResNet(seeChapter13,andcheckoutthemodels/slimdirectory),includingthecode,thepretrainedmodels,andtoolstodownloadpopularimagedatasets.

AnotherpopularmodelzooisCaffe’sModelZoo.Italsocontainsmanycomputervisionmodels(e.g.,LeNet,AlexNet,ZFNet,GoogLeNet,VGGNet,inception)trainedonvariousdatasets(e.g.,ImageNet,PlacesDatabase,CIFAR10,etc.).SaumitroDasguptawroteaconverter,whichisavailableathttps://github.com/ethereon/caffe-tensorflow.

UnsupervisedPretrainingSupposeyouwanttotackleacomplextaskforwhichyoudon’thavemuchlabeledtrainingdata,butunfortunatelyyoucannotfindamodeltrainedonasimilartask.Don’tloseallhope!First,youshouldofcoursetrytogathermorelabeledtrainingdata,butifthisistoohardortooexpensive,youmaystillbeabletoperformunsupervisedpretraining(seeFigure11-5).Thatis,ifyouhaveplentyofunlabeledtrainingdata,youcantrytotrainthelayersonebyone,startingwiththelowestlayerandthengoingup,usinganunsupervisedfeaturedetectoralgorithmsuchasRestrictedBoltzmannMachines(RBMs;seeAppendixE)orautoencoders(seeChapter15).Eachlayeristrainedontheoutputofthepreviouslytrainedlayers(alllayersexcepttheonebeingtrainedarefrozen).Oncealllayershavebeentrainedthisway,youcanfine-tunethenetworkusingsupervisedlearning(i.e.,withbackpropagation).

Thisisaratherlongandtediousprocess,butitoftenworkswell;infact,itisthistechniquethatGeoffreyHintonandhisteamusedin2006andwhichledtotherevivalofneuralnetworksandthesuccessofDeepLearning.Until2010,unsupervisedpretraining(typicallyusingRBMs)wasthenormfordeepnets,anditwasonlyafterthevanishinggradientsproblemwasalleviatedthatitbecamemuchmorecommontotrainDNNspurelyusingbackpropagation.However,unsupervisedpretraining(todaytypicallyusingautoencodersratherthanRBMs)isstillagoodoptionwhenyouhaveacomplextasktosolve,nosimilarmodelyoucanreuse,andlittlelabeledtrainingdatabutplentyofunlabeledtrainingdata.10

Figure11-5.Unsupervisedpretraining

PretrainingonanAuxiliaryTaskOnelastoptionistotrainafirstneuralnetworkonanauxiliarytaskforwhichyoucaneasilyobtainorgeneratelabeledtrainingdata,thenreusethelowerlayersofthatnetworkforyouractualtask.Thefirstneuralnetwork’slowerlayerswilllearnfeaturedetectorsthatwilllikelybereusablebythesecondneuralnetwork.

Forexample,ifyouwanttobuildasystemtorecognizefaces,youmayonlyhaveafewpicturesofeachindividual—clearlynotenoughtotrainagoodclassifier.Gatheringhundredsofpicturesofeachpersonwouldnotbepractical.However,youcouldgatheralotofpicturesofrandompeopleontheinternetandtrainafirstneuralnetworktodetectwhetherornottwodifferentpicturesfeaturethesameperson.Suchanetworkwouldlearngoodfeaturedetectorsforfaces,soreusingitslowerlayerswouldallowyoutotrainagoodfaceclassifierusinglittletrainingdata.

Itisoftenrathercheaptogatherunlabeledtrainingexamples,butquiteexpensivetolabelthem.Inthissituation,acommontechniqueistolabelallyourtrainingexamplesas“good,”thengeneratemanynewtraininginstancesbycorruptingthegoodones,andlabelthesecorruptedinstancesas“bad.”Thenyoucantrainafirstneuralnetworktoclassifyinstancesasgoodorbad.Forexample,youcoulddownloadmillionsofsentences,labelthemas“good,”thenrandomlychangeawordineachsentenceandlabeltheresultingsentencesas“bad.”Ifaneuralnetworkcantellthat“Thedogsleeps”isagoodsentencebut“Thedogthey”isbad,itprobablyknowsquitealotaboutlanguage.Reusingitslowerlayerswilllikelyhelpinmanylanguageprocessingtasks.

Anotherapproachistotrainafirstnetworktooutputascoreforeachtraininginstance,anduseacostfunctionthatensuresthatagoodinstance’sscoreisgreaterthanabadinstance’sscorebyatleastsomemargin.Thisiscalledmaxmarginlearning.

FasterOptimizersTrainingaverylargedeepneuralnetworkcanbepainfullyslow.Sofarwehaveseenfourwaystospeeduptraining(andreachabettersolution):applyingagoodinitializationstrategyfortheconnectionweights,usingagoodactivationfunction,usingBatchNormalization,andreusingpartsofapretrainednetwork.AnotherhugespeedboostcomesfromusingafasteroptimizerthantheregularGradientDescentoptimizer.Inthissectionwewillpresentthemostpopularones:Momentumoptimization,NesterovAcceleratedGradient,AdaGrad,RMSProp,andfinallyAdamoptimization.

MomentumOptimizationImagineabowlingballrollingdownagentleslopeonasmoothsurface:itwillstartoutslowly,butitwillquicklypickupmomentumuntiliteventuallyreachesterminalvelocity(ifthereissomefrictionorairresistance).ThisistheverysimpleideabehindMomentumoptimization,proposedbyBorisPolyakin1964.11Incontrast,regularGradientDescentwillsimplytakesmallregularstepsdowntheslope,soitwilltakemuchmoretimetoreachthebottom.

RecallthatGradientDescentsimplyupdatestheweightsθbydirectlysubtractingthegradientofthecostfunctionJ(θ)withregardstotheweights( θJ(θ))multipliedbythelearningrateη.Theequationis:θ←θ–η θJ(θ).Itdoesnotcareaboutwhattheearliergradientswere.Ifthelocalgradientistiny,itgoesveryslowly.

Momentumoptimizationcaresagreatdealaboutwhatpreviousgradientswere:ateachiteration,itsubtractsthelocalgradientfromthemomentumvectorm(multipliedbythelearningrateη),anditupdatestheweightsbysimplyaddingthismomentumvector(seeEquation11-4).Inotherwords,thegradientisusedasanacceleration,notasaspeed.Tosimulatesomesortoffrictionmechanismandpreventthemomentumfromgrowingtoolarge,thealgorithmintroducesanewhyperparameterβ,simplycalledthemomentum,whichmustbesetbetween0(highfriction)and1(nofriction).Atypicalmomentumvalueis0.9.

Equation11-4.Momentumalgorithm

Youcaneasilyverifythatifthegradientremainsconstant,theterminalvelocity(i.e.,themaximumsizeof

theweightupdates)isequaltothatgradientmultipliedbythelearningrateηmultipliedby (ignoringthesign).Forexample,ifβ=0.9,thentheterminalvelocityisequalto10timesthegradienttimesthelearningrate,soMomentumoptimizationendsupgoing10timesfasterthanGradientDescent!ThisallowsMomentumoptimizationtoescapefromplateausmuchfasterthanGradientDescent.Inparticular,wesawinChapter4thatwhentheinputshaveverydifferentscalesthecostfunctionwilllooklikeanelongatedbowl(seeFigure4-7).GradientDescentgoesdownthesteepslopequitefast,butthenittakesaverylongtimetogodownthevalley.Incontrast,Momentumoptimizationwillrolldownthebottomofthevalleyfasterandfasteruntilitreachesthebottom(theoptimum).Indeepneuralnetworksthatdon’tuseBatchNormalization,theupperlayerswilloftenenduphavinginputswithverydifferentscales,sousingMomentumoptimizationhelpsalot.Itcanalsohelprollpastlocaloptima.

NOTEDuetothemomentum,theoptimizermayovershootabit,thencomeback,overshootagain,andoscillatelikethismanytimesbeforestabilizingattheminimum.Thisisoneofthereasonswhyitisgoodtohaveabitoffrictioninthesystem:itgetsridoftheseoscillationsandthusspeedsupconvergence.

ImplementingMomentumoptimizationinTensorFlowisano-brainer:justreplacetheGradientDescentOptimizerwiththeMomentumOptimizer,thenliebackandprofit!

momentum=0.9)

TheonedrawbackofMomentumoptimizationisthatitaddsyetanotherhyperparametertotune.However,themomentumvalueof0.9usuallyworkswellinpracticeandalmostalwaysgoesfasterthanGradientDescent.

NesterovAcceleratedGradientOnesmallvarianttoMomentumoptimization,proposedbyYuriiNesterovin1983,12isalmostalwaysfasterthanvanillaMomentumoptimization.TheideaofNesterovMomentumoptimization,orNesterovAcceleratedGradient(NAG),istomeasurethegradientofthecostfunctionnotatthelocalpositionbutslightlyaheadinthedirectionofthemomentum(seeEquation11-5).TheonlydifferencefromvanillaMomentumoptimizationisthatthegradientismeasuredatθ+βmratherthanatθ.

Equation11-5.NesterovAcceleratedGradientalgorithm

Thissmalltweakworksbecauseingeneralthemomentumvectorwillbepointingintherightdirection(i.e.,towardtheoptimum),soitwillbeslightlymoreaccuratetousethegradientmeasuredabitfartherinthatdirectionratherthanusingthegradientattheoriginalposition,asyoucanseeinFigure11-6(where1representsthegradientofthecostfunctionmeasuredatthestartingpointθ,and 2representsthe

gradientatthepointlocatedatθ+βm).Asyoucansee,theNesterovupdateendsupslightlyclosertotheoptimum.Afterawhile,thesesmallimprovementsaddupandNAGendsupbeingsignificantlyfasterthanregularMomentumoptimization.Moreover,notethatwhenthemomentumpushestheweightsacrossavalley, 1continuestopushfurtheracrossthevalley,while 2pushesbacktowardthebottomofthevalley.Thishelpsreduceoscillationsandthusconvergesfaster.

NAGwillalmostalwaysspeeduptrainingcomparedtoregularMomentumoptimization.Touseit,simplysetuse_nesterov=TruewhencreatingtheMomentumOptimizer:

momentum=0.9,use_nesterov=True)

Figure11-6.RegularversusNesterovMomentumoptimization

AdaGradConsidertheelongatedbowlproblemagain:GradientDescentstartsbyquicklygoingdownthesteepestslope,thenslowlygoesdownthebottomofthevalley.Itwouldbeniceifthealgorithmcoulddetectthisearlyonandcorrectitsdirectiontopointabitmoretowardtheglobaloptimum.

TheAdaGradalgorithm13achievesthisbyscalingdownthegradientvectoralongthesteepestdimensions(seeEquation11-6):

Equation11-6.AdaGradalgorithm

Thefirststepaccumulatesthesquareofthegradientsintothevectors(the symbolrepresentstheelement-wisemultiplication).Thisvectorizedformisequivalenttocomputingsi←si+(∂J(θ)/∂θi)2

foreachelementsiofthevectors;inotherwords,eachsiaccumulatesthesquaresofthepartialderivativeofthecostfunctionwithregardstoparameterθi.Ifthecostfunctionissteepalongtheith

dimension,thensiwillgetlargerandlargerateachiteration.

ThesecondstepisalmostidenticaltoGradientDescent,butwithonebigdifference:thegradientvectorisscaleddownbyafactorof (thesymbolrepresentstheelement-wisedivision,andϵisasmoothingtermtoavoiddivisionbyzero,typicallysetto10–10).Thisvectorizedformisequivalenttocomputing forallparametersθi(simultaneously).

Inshort,thisalgorithmdecaysthelearningrate,butitdoessofasterforsteepdimensionsthanfordimensionswithgentlerslopes.Thisiscalledanadaptivelearningrate.Ithelpspointtheresultingupdatesmoredirectlytowardtheglobaloptimum(seeFigure11-7).Oneadditionalbenefitisthatitrequiresmuchlesstuningofthelearningratehyperparameterη.

Figure11-7.AdaGradversusGradientDescent

AdaGradoftenperformswellforsimplequadraticproblems,butunfortunatelyitoftenstopstooearlywhentrainingneuralnetworks.Thelearningrategetsscaleddownsomuchthatthealgorithmendsupstoppingentirelybeforereachingtheglobaloptimum.SoeventhoughTensorFlowhasanAdagradOptimizer,youshouldnotuseittotraindeepneuralnetworks(itmaybeefficientforsimplertaskssuchasLinearRegression,though).

RMSPropAlthoughAdaGradslowsdownabittoofastandendsupneverconvergingtotheglobaloptimum,theRMSPropalgorithm14fixesthisbyaccumulatingonlythegradientsfromthemostrecentiterations(asopposedtoallthegradientssincethebeginningoftraining).Itdoessobyusingexponentialdecayinthefirststep(seeEquation11-7).

Equation11-7.RMSPropalgorithm

Thedecayrateβistypicallysetto0.9.Yes,itisonceagainanewhyperparameter,butthisdefaultvalueoftenworkswell,soyoumaynotneedtotuneitatall.

Asyoumightexpect,TensorFlowhasanRMSPropOptimizerclass:

optimizer=tf.train.RMSPropOptimizer(learning_rate=learning_rate,

momentum=0.9,decay=0.9,epsilon=1e-10)

Exceptonverysimpleproblems,thisoptimizeralmostalwaysperformsmuchbetterthanAdaGrad.ItalsogenerallyconvergesfasterthanMomentumoptimizationandNesterovAcceleratedGradients.Infact,itwasthepreferredoptimizationalgorithmofmanyresearchersuntilAdamoptimizationcamearound.

AdamOptimizationAdam,15whichstandsforadaptivemomentestimation,combinestheideasofMomentumoptimizationandRMSProp:justlikeMomentumoptimizationitkeepstrackofanexponentiallydecayingaverageofpastgradients,andjustlikeRMSPropitkeepstrackofanexponentiallydecayingaverageofpastsquaredgradients(seeEquation11-8).16

Equation11-8.Adamalgorithm

Trepresentstheiterationnumber(startingat1).

Ifyoujustlookatsteps1,2,and5,youwillnoticeAdam’sclosesimilaritytobothMomentumoptimizationandRMSProp.Theonlydifferenceisthatstep1computesanexponentiallydecayingaverageratherthananexponentiallydecayingsum,buttheseareactuallyequivalentexceptforaconstantfactor(thedecayingaverageisjust1–β1timesthedecayingsum).Steps3and4aresomewhatofatechnicaldetail:sincemandsareinitializedat0,theywillbebiasedtoward0atthebeginningoftraining,sothesetwostepswillhelpboostmandsatthebeginningoftraining.

Themomentumdecayhyperparameterβ1istypicallyinitializedto0.9,whilethescalingdecayhyperparameterβ2isofteninitializedto0.999.Asearlier,thesmoothingtermϵisusuallyinitializedtoatinynumbersuchas10–8.ThesearethedefaultvaluesforTensorFlow’sAdamOptimizerclass,soyoucansimplyuse:

optimizer=tf.train.AdamOptimizer(learning_rate=learning_rate)

Infact,sinceAdamisanadaptivelearningratealgorithm(likeAdaGradandRMSProp),itrequireslesstuningofthelearningratehyperparameterη.Youcanoftenusethedefaultvalueη=0.001,makingAdameveneasiertousethanGradientDescent.

WARNINGThisbookinitiallyrecommendedusingAdamoptimization,becauseitwasgenerallyconsideredfasterandbetterthanothermethods.However,a2017paper17byAshiaC.Wilsonetal.showedthatadaptiveoptimizationmethods(i.e.,AdaGrad,RMSPropandAdamoptimization)canleadtosolutionsthatgeneralizepoorlyonsomedatasets.SoyoumaywanttosticktoMomentumoptimizationorNesterovAcceleratedGradientfornow,untilresearchershaveabetterunderstandingofthisissue.

Alltheoptimizationtechniquesdiscussedsofaronlyrelyonthefirst-orderpartialderivatives(Jacobians).Theoptimizationliteraturecontainsamazingalgorithmsbasedonthesecond-orderpartialderivatives(theHessians).Unfortunately,thesealgorithmsareveryhardtoapplytodeepneuralnetworksbecausetherearen2Hessiansperoutput(wherenisthenumberofparameters),asopposedtojustnJacobiansperoutput.SinceDNNstypicallyhavetensofthousandsofparameters,thesecond-orderoptimizationalgorithmsoftendon’tevenfitinmemory,andevenwhentheydo,computingtheHessiansisjusttooslow.

TRAININGSPARSEMODELS

Alltheoptimizationalgorithmsjustpresentedproducedensemodels,meaningthatmostparameterswillbenonzero.Ifyouneedablazinglyfastmodelatruntime,orifyouneedittotakeuplessmemory,youmayprefertoendupwithasparsemodelinstead.

Onetrivialwaytoachievethisistotrainthemodelasusual,thengetridofthetinyweights(setthemto0).

Anotheroptionistoapplystrongℓ1regularizationduringtraining,asitpushestheoptimizertozerooutasmanyweightsasitcan(asdiscussedinChapter4aboutLassoRegression).

However,insomecasesthesetechniquesmayremaininsufficient.OnelastoptionistoapplyDualAveraging,oftencalledFollowTheRegularizedLeader(FTRL),atechniqueproposedbyYuriiNesterov.18Whenusedwithℓ1regularization,thistechniqueoftenleadsto

verysparsemodels.TensorFlowimplementsavariantofFTRLcalledFTRL-Proximal19intheFTRLOptimizerclass.

LearningRateSchedulingFindingagoodlearningratecanbetricky.Ifyousetitwaytoohigh,trainingmayactuallydiverge(aswediscussedinChapter4).Ifyousetittoolow,trainingwilleventuallyconvergetotheoptimum,butitwilltakeaverylongtime.Ifyousetitslightlytoohigh,itwillmakeprogressveryquicklyatfirst,butitwillendupdancingaroundtheoptimum,neversettlingdown(unlessyouuseanadaptivelearningrateoptimizationalgorithmsuchasAdaGrad,RMSProp,orAdam,buteventhenitmaytaketimetosettle).Ifyouhavealimitedcomputingbudget,youmayhavetointerrupttrainingbeforeithasconvergedproperly,yieldingasuboptimalsolution(seeFigure11-8).

Figure11-8.Learningcurvesforvariouslearningratesη

Youmaybeabletofindafairlygoodlearningratebytrainingyournetworkseveraltimesduringjustafewepochsusingvariouslearningratesandcomparingthelearningcurves.Theideallearningratewilllearnquicklyandconvergetogoodsolution.

However,youcandobetterthanaconstantlearningrate:ifyoustartwithahighlearningrateandthenreduceitonceitstopsmakingfastprogress,youcanreachagoodsolutionfasterthanwiththeoptimalconstantlearningrate.Therearemanydifferentstrategiestoreducethelearningrateduringtraining.Thesestrategiesarecalledlearningschedules(webrieflyintroducedthisconceptinChapter4),themostcommonofwhichare:

PredeterminedpiecewiseconstantlearningrateForexample,setthelearningratetoη0=0.1atfirst,thentoη1=0.001after50epochs.Althoughthissolutioncanworkverywell,itoftenrequiresfiddlingaroundtofigureouttherightlearningratesandwhentousethem.

PerformanceschedulingMeasurethevalidationerroreveryNsteps(justlikeforearlystopping)andreducethelearningratebyafactorofλwhentheerrorstopsdropping.

Exponentialscheduling

Setthelearningratetoafunctionoftheiterationnumbert:η(t)=η010–t/r.Thisworksgreat,butitrequirestuningη0andr.Thelearningratewilldropbyafactorof10everyrsteps.

PowerschedulingSetthelearningratetoη(t)=η0(1+t/r)–c.Thehyperparametercistypicallysetto1.Thisissimilartoexponentialscheduling,butthelearningratedropsmuchmoreslowly.

A2013paper20byAndrewSenioretal.comparedtheperformanceofsomeofthemostpopularlearningscheduleswhentrainingdeepneuralnetworksforspeechrecognitionusingMomentumoptimization.Theauthorsconcludedthat,inthissetting,bothperformanceschedulingandexponentialschedulingperformedwell,buttheyfavoredexponentialschedulingbecauseitissimplertoimplement,iseasytotune,andconvergedslightlyfastertotheoptimalsolution.

ImplementingalearningschedulewithTensorFlowisfairlystraightforward:

initial_learning_rate=0.1

decay_steps=10000

decay_rate=1/10

global_step=tf.Variable(0,trainable=False,name="global_step")

learning_rate=tf.train.exponential_decay(initial_learning_rate,global_step,

decay_steps,decay_rate)

optimizer=tf.train.MomentumOptimizer(learning_rate,momentum=0.9)

training_op=optimizer.minimize(loss,global_step=global_step)

Aftersettingthehyperparametervalues,wecreateanontrainablevariableglobal_step(initializedto0)tokeeptrackofthecurrenttrainingiterationnumber.Thenwedefineanexponentiallydecayinglearningrate(withη0=0.1andr=10,000)usingTensorFlow’sexponential_decay()function.Next,wecreateanoptimizer(inthisexample,aMomentumOptimizer)usingthisdecayinglearningrate.Finally,wecreatethetrainingoperationbycallingtheoptimizer’sminimize()method;sincewepassittheglobal_stepvariable,itwillkindlytakecareofincrementingit.That’sit!

SinceAdaGrad,RMSProp,andAdamoptimizationautomaticallyreducethelearningrateduringtraining,itisnotnecessarytoaddanextralearningschedule.Forotheroptimizationalgorithms,usingexponentialdecayorperformanceschedulingcanconsiderablyspeedupconvergence.

AvoidingOverfittingThroughRegularizationWithfourparametersIcanfitanelephantandwithfiveIcanmakehimwigglehistrunk.JohnvonNeumann,citedbyEnricoFermiinNature427

Deepneuralnetworkstypicallyhavetensofthousandsofparameters,sometimesevenmillions.Withsomanyparameters,thenetworkhasanincredibleamountoffreedomandcanfitahugevarietyofcomplexdatasets.Butthisgreatflexibilityalsomeansthatitispronetooverfittingthetrainingset.

Withmillionsofparametersyoucanfitthewholezoo.Inthissectionwewillpresentsomeofthemostpopularregularizationtechniquesforneuralnetworks,andhowtoimplementthemwithTensorFlow:earlystopping,ℓ1andℓ2regularization,dropout,max-normregularization,anddataaugmentation.

EarlyStoppingToavoidoverfittingthetrainingset,agreatsolutionisearlystopping(introducedinChapter4):justinterrupttrainingwhenitsperformanceonthevalidationsetstartsdropping.

OnewaytoimplementthiswithTensorFlowistoevaluatethemodelonavalidationsetatregularintervals(e.g.,every50steps),andsavea“winner”snapshotifitoutperformsprevious“winner”snapshots.Countthenumberofstepssincethelast“winner”snapshotwassaved,andinterrupttrainingwhenthisnumberreachessomelimit(e.g.,2,000steps).Thenrestorethelast“winner”snapshot.

Althoughearlystoppingworksverywellinpractice,youcanusuallygetmuchhigherperformanceoutofyournetworkbycombiningitwithotherregularizationtechniques.

ℓ1andℓ2RegularizationJustlikeyoudidinChapter4forsimplelinearmodels,youcanuseℓ1andℓ2regularizationtoconstrainaneuralnetwork’sconnectionweights(buttypicallynotitsbiases).

OnewaytodothisusingTensorFlowistosimplyaddtheappropriateregularizationtermstoyourcostfunction.Forexample,assumingyouhavejustonehiddenlayerwithweightsW1andoneoutputlayerwithweightsW2,thenyoucanapplyℓ1regularizationlikethis:

[...]#constructtheneuralnetwork

W1=tf.get_default_graph().get_tensor_by_name("hidden1/kernel:0")

W2=tf.get_default_graph().get_tensor_by_name("outputs/kernel:0")

scale=0.001#l1regularizationhyperparameter

withtf.name_scope("loss"):

logits=logits)

base_loss=tf.reduce_mean(xentropy,name="avg_xentropy")

reg_losses=tf.reduce_sum(tf.abs(W1))+tf.reduce_sum(tf.abs(W2))

loss=tf.add(base_loss,scale*reg_losses,name="loss")

However,iftherearemanylayers,thisapproachisnotveryconvenient.Fortunately,TensorFlowprovidesabetteroption.Manyfunctionsthatcreatevariables(suchasget_variable()ortf.layers.dense())accepta*_regularizerargumentforeachcreatedvariable(e.g.,kernel_regularizer).Youcanpassanyfunctionthattakesweightsasanargumentandreturnsthecorrespondingregularizationloss.Thel1_regularizer(),l2_regularizer(),andl1_l2_regularizer()functionsreturnsuchfunctions.Thefollowingcodeputsallthistogether:

my_dense_layer=partial(

tf.layers.dense,activation=tf.nn.relu,

kernel_regularizer=tf.contrib.layers.l1_regularizer(scale))

hidden1=my_dense_layer(X,n_hidden1,name="hidden1")

hidden2=my_dense_layer(hidden1,n_hidden2,name="hidden2")

logits=my_dense_layer(hidden2,n_outputs,activation=None,

name="outputs")

Thiscodecreatesaneuralnetworkwithtwohiddenlayersandoneoutputlayer,anditalsocreatesnodesinthegraphtocomputetheℓ1regularizationlosscorrespondingtoeachlayer’sweights.TensorFlowautomaticallyaddsthesenodestoaspecialcollectioncontainingalltheregularizationlosses.Youjustneedtoaddtheseregularizationlossestoyouroverallloss,likethis:

reg_losses=tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)

loss=tf.add_n([base_loss]+reg_losses,name="loss")

WARNINGDon’tforgettoaddtheregularizationlossestoyouroverallloss,orelsetheywillsimplybeignored.

DropoutThemostpopularregularizationtechniquefordeepneuralnetworksisarguablydropout.Itwasproposed21byG.E.Hintonin2012andfurtherdetailedinapaper22byNitishSrivastavaetal.,andithasproventobehighlysuccessful:eventhestate-of-the-artneuralnetworksgota1–2%accuracyboostsimplybyaddingdropout.Thismaynotsoundlikealot,butwhenamodelalreadyhas95%accuracy,gettinga2%accuracyboostmeansdroppingtheerrorratebyalmost40%(goingfrom5%errortoroughly3%).

Itisafairlysimplealgorithm:ateverytrainingstep,everyneuron(includingtheinputneuronsbutexcludingtheoutputneurons)hasaprobabilitypofbeingtemporarily“droppedout,”meaningitwillbeentirelyignoredduringthistrainingstep,butitmaybeactiveduringthenextstep(seeFigure11-9).Thehyperparameterpiscalledthedropoutrate,anditistypicallysetto50%.Aftertraining,neuronsdon’tgetdroppedanymore.Andthat’sall(exceptforatechnicaldetailwewilldiscussmomentarily).

Figure11-9.Dropoutregularization

Itisquitesurprisingatfirstthatthisratherbrutaltechniqueworksatall.Wouldacompanyperformbetterifitsemployeesweretoldtotossacoineverymorningtodecidewhetherornottogotowork?Well,whoknows;perhapsitwould!Thecompanywouldobviouslybeforcedtoadaptitsorganization;itcouldnotrelyonanysinglepersontofillinthecoffeemachineorperformanyothercriticaltasks,sothisexpertisewouldhavetobespreadacrossseveralpeople.Employeeswouldhavetolearntocooperatewithmanyoftheircoworkers,notjustahandfulofthem.Thecompanywouldbecomemuchmoreresilient.Ifonepersonquit,itwouldn’tmakemuchofadifference.It’sunclearwhetherthisideawouldactuallyworkforcompanies,butitcertainlydoesforneuralnetworks.Neuronstrainedwithdropoutcannotco-adaptwiththeirneighboringneurons;theyhavetobeasusefulaspossibleontheirown.Theyalsocannotrely

excessivelyonjustafewinputneurons;theymustpayattentiontoeachoftheirinputneurons.Theyendupbeinglesssensitivetoslightchangesintheinputs.Intheendyougetamorerobustnetworkthatgeneralizesbetter.

Anotherwaytounderstandthepowerofdropoutistorealizethatauniqueneuralnetworkisgeneratedateachtrainingstep.Sinceeachneuroncanbeeitherpresentorabsent,thereisatotalof2Npossiblenetworks(whereNisthetotalnumberofdroppableneurons).Thisissuchahugenumberthatitisvirtuallyimpossibleforthesameneuralnetworktobesampledtwice.Onceyouhaveruna10,000trainingsteps,youhaveessentiallytrained10,000differentneuralnetworks(eachwithjustonetraininginstance).Theseneuralnetworksareobviouslynotindependentsincetheysharemanyoftheirweights,buttheyareneverthelessalldifferent.Theresultingneuralnetworkcanbeseenasanaveragingensembleofallthesesmallerneuralnetworks.

Thereisonesmallbutimportanttechnicaldetail.Supposep=50%,inwhichcaseduringtestinganeuronwillbeconnectedtotwiceasmanyinputneuronsasitwas(onaverage)duringtraining.Tocompensateforthisfact,weneedtomultiplyeachneuron’sinputconnectionweightsby0.5aftertraining.Ifwedon’t,eachneuronwillgetatotalinputsignalroughlytwiceaslargeaswhatthenetworkwastrainedon,anditisunlikelytoperformwell.Moregenerally,weneedtomultiplyeachinputconnectionweightbythekeepprobability(1–p)aftertraining.Alternatively,wecandivideeachneuron’soutputbythekeepprobabilityduringtraining(thesealternativesarenotperfectlyequivalent,buttheyworkequallywell).

ToimplementdropoutusingTensorFlow,youcansimplyapplythetf.layers.dropout()functiontotheinputlayerand/ortotheoutputofanyhiddenlayeryouwant.Duringtraining,thisfunctionrandomlydropssomeitems(settingthemto0)anddividestheremainingitemsbythekeepprobability.Aftertraining,thisfunctiondoesnothingatall.Thefollowingcodeappliesdropoutregularizationtoourthree-layerneuralnetwork:

dropout_rate=0.5#==1-keep_prob

X_drop=tf.layers.dropout(X,dropout_rate,training=training)

hidden1=tf.layers.dense(X_drop,n_hidden1,activation=tf.nn.relu,

name="hidden1")

hidden1_drop=tf.layers.dropout(hidden1,dropout_rate,training=training)

hidden2=tf.layers.dense(hidden1_drop,n_hidden2,activation=tf.nn.relu,

name="hidden2")

hidden2_drop=tf.layers.dropout(hidden2,dropout_rate,training=training)

logits=tf.layers.dense(hidden2_drop,n_outputs,name="outputs")

WARNINGYouwanttousethetf.layers.dropout()function,nottf.nn.dropout().Thefirstoneturnsoff(no-op)whennottraining,whichiswhatyouwant,whilethesecondonedoesnot.

Ofcourse,justlikeyoudidearlierforBatchNormalization,youneedtosettrainingtoTruewhentraining,andleavethedefaultFalsevaluewhentesting.

Ifyouobservethatthemodelisoverfitting,youcanincreasethedropoutrate.Conversely,youshouldtry

decreasingthedropoutrateifthemodelunderfitsthetrainingset.Itcanalsohelptoincreasethedropoutrateforlargelayers,andreduceitforsmallones.

Dropoutdoestendtosignificantlyslowdownconvergence,butitusuallyresultsinamuchbettermodelwhentunedproperly.So,itisgenerallywellworththeextratimeandeffort.

NOTEDropconnectisavariantofdropoutwhereindividualconnectionsaredroppedrandomlyratherthanwholeneurons.Ingeneraldropoutperformsbetter.

Max-NormRegularizationAnotherregularizationtechniquethatisquitepopularforneuralnetworksiscalledmax-normregularization:foreachneuron,itconstrainstheweightswoftheincomingconnectionssuchthat w2≤r,whereristhemax-normhyperparameterand· 2istheℓ2norm.

Wetypicallyimplementthisconstraintbycomputing w2aftereachtrainingstepandclippingwif

needed( ).

Reducingrincreasestheamountofregularizationandhelpsreduceoverfitting.Max-normregularizationcanalsohelpalleviatethevanishing/explodinggradientsproblems(ifyouarenotusingBatchNormalization).

TensorFlowdoesnotprovideanoff-the-shelfmax-normregularizer,butitisnottoohardtoimplement.Thefollowingcodegetsahandleontheweightsofthefirsthiddenlayer,thenitusestheclip_by_norm()functiontocreateanoperationthatwillcliptheweightsalongthesecondaxissothateachrowvectorendsupwithamaximumnormof1.0.Thelastlinecreatesanassignmentoperationthatwillassigntheclippedweightstotheweightsvariable:

threshold=1.0

weights=tf.get_default_graph().get_tensor_by_name("hidden1/kernel:0")

clipped_weights=tf.clip_by_norm(weights,clip_norm=threshold,axes=1)

clip_weights=tf.assign(weights,clipped_weights)

Thenyoujustapplythisoperationaftereachtrainingstep,likeso:

clip_weights.eval()

Ingeneral,youwoulddothisforeveryhiddenlayer.Althoughthissolutionshouldworkfine,itisabitmessy.Acleanersolutionistocreateamax_norm_regularizer()functionanduseitjustliketheearlierl1_regularizer()function:

defmax_norm_regularizer(threshold,axes=1,name="max_norm",

collection="max_norm"):

defmax_norm(weights):

clipped=tf.clip_by_norm(weights,clip_norm=threshold,axes=axes)

clip_weights=tf.assign(weights,clipped,name=name)

tf.add_to_collection(collection,clip_weights)

returnNone#thereisnoregularizationlossterm

returnmax_norm

Thisfunctionreturnsaparametrizedmax_norm()functionthatyoucanuselikeanyotherregularizer:

max_norm_reg=max_norm_regularizer(threshold=1.0)

kernel_regularizer=max_norm_reg,name="hidden1")

kernel_regularizer=max_norm_reg,name="hidden2")

logits=tf.layers.dense(hidden2,n_outputs,name="outputs")

Notethatmax-normregularizationdoesnotrequireaddingaregularizationlosstermtoyouroveralllossfunction,whichiswhythemax_norm()functionreturnsNone.Butyoustillneedtobeabletoruntheclip_weightsoperationsaftereachtrainingstep,soyouneedtobeabletogetahandleonthem.Thisiswhythemax_norm()functionaddstheclip_weightsoperationtoacollectionofmax-normclippingoperations.Youneedtofetchtheseclippingoperationsandrunthemaftereachtrainingstep:

clip_all_weights=tf.get_collection("max_norm")

init.run()

sess.run(clip_all_weights)

Muchcleanercode,isn’tit?

DataAugmentationOnelastregularizationtechnique,dataaugmentation,consistsofgeneratingnewtraininginstancesfromexistingones,artificiallyboostingthesizeofthetrainingset.Thiswillreduceoverfitting,makingthisaregularizationtechnique.Thetrickistogeneraterealistictraininginstances;ideally,ahumanshouldnotbeabletotellwhichinstancesweregeneratedandwhichoneswerenot.Moreover,simplyaddingwhitenoisewillnothelp;themodificationsyouapplyshouldbelearnable(whitenoiseisnot).

Forexample,ifyourmodelismeanttoclassifypicturesofmushrooms,youcanslightlyshift,rotate,andresizeeverypictureinthetrainingsetbyvariousamountsandaddtheresultingpicturestothetrainingset(seeFigure11-10).Thisforcesthemodeltobemoretoleranttotheposition,orientation,andsizeofthemushroomsinthepicture.Ifyouwantthemodeltobemoretoleranttolightingconditions,youcansimilarlygeneratemanyimageswithvariouscontrasts.Assumingthemushroomsaresymmetrical,youcanalsoflipthepictureshorizontally.Bycombiningthesetransformationsyoucangreatlyincreasethesizeofyourtrainingset.

Figure11-10.Generatingnewtraininginstancesfromexistingones

Itisoftenpreferabletogeneratetraininginstancesontheflyduringtrainingratherthanwastingstoragespaceandnetworkbandwidth.TensorFlowoffersseveralimagemanipulationoperationssuchastransposing(shifting),rotating,resizing,flipping,andcropping,aswellasadjustingthebrightness,contrast,saturation,andhue(seetheAPIdocumentationformoredetails).Thismakesiteasytoimplementdataaugmentationforimagedatasets.

NOTEAnotherpowerfultechniquetotrainverydeepneuralnetworksistoaddskipconnections(askipconnectioniswhenyouaddtheinputofalayertotheoutputofahigherlayer).WewillexplorethisideainChapter13whenwetalkaboutdeepresidualnetworks.

PracticalGuidelinesInthischapter,wehavecoveredawiderangeoftechniquesandyoumaybewonderingwhichonesyoushoulduse.TheconfigurationinTable11-2willworkfineinmostcases.

Table11-2.DefaultDNNconfiguration

Initialization Heinitialization

Activationfunction ELU

Normalization BatchNormalization

Regularization Dropout

Optimizer NesterovAcceleratedGradient

Learningrateschedule None

Ofcourse,youshouldtrytoreusepartsofapretrainedneuralnetworkifyoucanfindonethatsolvesasimilarproblem.

Thisdefaultconfigurationmayneedtobetweaked:Ifyoucan’tfindagoodlearningrate(convergencewastooslow,soyouincreasedthetrainingrate,andnowconvergenceisfastbutthenetwork’saccuracyissuboptimal),thenyoucantryaddingalearningschedulesuchasexponentialdecay.

Ifyourtrainingsetisabittoosmall,youcanimplementdataaugmentation.

Ifyouneedasparsemodel,youcanaddsomeℓ1regularizationtothemix(andoptionallyzerooutthetinyweightsaftertraining).Ifyouneedanevensparsermodel,youcantryusingFTRLinsteadofAdamoptimization,alongwithℓ1regularization.

Ifyouneedalightning-fastmodelatruntime,youmaywanttodropBatchNormalization,andpossiblyreplacetheELUactivationfunctionwiththeleakyReLU.Havingasparsemodelwillalsohelp.

Withtheseguidelines,youarenowreadytotrainverydeepnets—well,ifyouareverypatient,thatis!Ifyouuseasinglemachine,youmayhavetowaitfordaysorevenmonthsfortrainingtocomplete.InthenextchapterwewilldiscusshowtousedistributedTensorFlowtotrainandrunmodelsacrossmanyserversandGPUs.

Exercises1. Isitokaytoinitializealltheweightstothesamevalueaslongasthatvalueisselectedrandomly

usingHeinitialization?

2. Isitokaytoinitializethebiastermsto0?

3. NamethreeadvantagesoftheELUactivationfunctionoverReLU.

4. Inwhichcaseswouldyouwanttouseeachofthefollowingactivationfunctions:ELU,leakyReLU(anditsvariants),ReLU,tanh,logistic,andsoftmax?

5. Whatmayhappenifyousetthemomentumhyperparametertoocloseto1(e.g.,0.99999)whenusingaMomentumOptimizer?

6. Namethreewaysyoucanproduceasparsemodel.

7. Doesdropoutslowdowntraining?Doesitslowdowninference(i.e.,makingpredictionsonnewinstances)?

8. DeepLearning.a. BuildaDNNwithfivehiddenlayersof100neuronseach,Heinitialization,andtheELU

activationfunction.

b. UsingAdamoptimizationandearlystopping,trytrainingitonMNISTbutonlyondigits0to4,aswewillusetransferlearningfordigits5to9inthenextexercise.Youwillneedasoftmaxoutputlayerwithfiveneurons,andasalwaysmakesuretosavecheckpointsatregularintervalsandsavethefinalmodelsoyoucanreuseitlater.

c. Tunethehyperparametersusingcross-validationandseewhatprecisionyoucanachieve.

d. NowtryaddingBatchNormalizationandcomparethelearningcurves:isitconvergingfasterthanbefore?Doesitproduceabettermodel?

e. Isthemodeloverfittingthetrainingset?Tryaddingdropouttoeverylayerandtryagain.Doesithelp?

9. Transferlearning.a. CreateanewDNNthatreusesallthepretrainedhiddenlayersofthepreviousmodel,freezes

them,andreplacesthesoftmaxoutputlayerwithanewone.

b. TrainthisnewDNNondigits5to9,usingonly100imagesperdigit,andtimehowlongittakes.Despitethissmallnumberofexamples,canyouachievehighprecision?

c. Trycachingthefrozenlayers,andtrainthemodelagain:howmuchfasterisitnow?

d. Tryagainreusingjustfourhiddenlayersinsteadoffive.Canyouachieveahigherprecision?

e. Nowunfreezethetoptwohiddenlayersandcontinuetraining:canyougetthemodeltoperformevenbetter?

10. Pretrainingonanauxiliarytask.a. InthisexerciseyouwillbuildaDNNthatcomparestwoMNISTdigitimagesandpredicts

whethertheyrepresentthesamedigitornot.ThenyouwillreusethelowerlayersofthisnetworktotrainanMNISTclassifierusingverylittletrainingdata.StartbybuildingtwoDNNs(let’scallthemDNNAandB),bothsimilartotheoneyoubuiltearlierbutwithouttheoutputlayer:eachDNNshouldhavefivehiddenlayersof100neuronseach,Heinitialization,andELUactivation.Next,addonemorehiddenlayerwith10unitsontopofbothDNNs.Todothis,youshoulduseTensorFlow’sconcat()functionwithaxis=1toconcatenatetheoutputsofbothDNNsforeachinstance,thenfeedtheresulttothehiddenlayer.Finally,addanoutputlayerwithasingleneuronusingthelogisticactivationfunction.

b. SplittheMNISTtrainingsetintwosets:split#1shouldcontaining55,000images,andsplit#2shouldcontaincontain5,000images.CreateafunctionthatgeneratesatrainingbatchwhereeachinstanceisapairofMNISTimagespickedfromsplit#1.Halfofthetraininginstancesshouldbepairsofimagesthatbelongtothesameclass,whiletheotherhalfshouldbeimagesfromdifferentclasses.Foreachpair,thetraininglabelshouldbe0iftheimagesarefromthesameclass,or1iftheyarefromdifferentclasses.

c. TraintheDNNonthistrainingset.Foreachimagepair,youcansimultaneouslyfeedthefirstimagetoDNNAandthesecondimagetoDNNB.Thewholenetworkwillgraduallylearntotellwhethertwoimagesbelongtothesameclassornot.

d. NowcreateanewDNNbyreusingandfreezingthehiddenlayersofDNNAandaddingasoftmaxoutputlayerontopwith10neurons.Trainthisnetworkonsplit#2andseeifyoucanachievehighperformancedespitehavingonly500imagesperclass.

“UnderstandingtheDifficultyofTrainingDeepFeedforwardNeuralNetworks,”X.Glorot,YBengio(2010).

Here’sananalogy:ifyousetamicrophoneamplifier’sknobtooclosetozero,peoplewon’thearyourvoice,butifyousetittooclosetothemax,yourvoicewillbesaturatedandpeoplewon’tunderstandwhatyouaresaying.Nowimagineachainofsuchamplifiers:theyallneedtobesetproperlyinorderforyourvoicetocomeoutloudandclearattheendofthechain.Yourvoicehastocomeoutofeachamplifieratthesameamplitudeasitcamein.

Thissimplifiedstrategywasactuallyalreadyproposedmuchearlier—forexample,inthe1998bookNeuralNetworks:TricksoftheTradebyGenevieveOrrandKlaus-RobertMüller(Springer).

Suchas“DelvingDeepintoRectifiers:SurpassingHuman-LevelPerformanceonImageNetClassification,”K.Heetal.(2015).

“EmpiricalEvaluationofRectifiedActivationsinConvolutionNetwork,”B.Xuetal.(2015).

“FastandAccurateDeepNetworkLearningbyExponentialLinearUnits(ELUs),”D.Clevert,T.Unterthiner,S.Hochreiter(2015).

“BatchNormalization:AcceleratingDeepNetworkTrainingbyReducingInternalCovariateShift,”S.IoffeandC.Szegedy(2015).

Manyresearchersarguethatitisjustasgood,orevenbetter,toplacethebatchnormalizationlayersafter(ratherthanbefore)theactivations.

“Onthedifficultyoftrainingrecurrentneuralnetworks,”R.Pascanuetal.(2013).

Anotheroptionistocomeupwithasupervisedtaskforwhichyoucaneasilygatheralotoflabeledtrainingdata,thenusetransfer

learning,asexplainedearlier.Forexample,ifyouwanttotrainamodeltoidentifyyourfriendsinpictures,youcoulddownloadmillionsoffacesontheinternetandtrainaclassifiertodetectwhethertwofacesareidenticalornot,thenusethisclassifiertocompareanewpicturewitheachpictureofyourfriends.

“Somemethodsofspeedinguptheconvergenceofiterationmethods,”B.Polyak(1964).

“AMethodforUnconstrainedConvexMinimizationProblemwiththeRateofConvergenceO(1/k

),”YuriiNesterov(1983).

“AdaptiveSubgradientMethodsforOnlineLearningandStochasticOptimization,”J.Duchietal.(2011).

ThisalgorithmwascreatedbyTijmenTielemanandGeoffreyHintonin2012,andpresentedbyGeoffreyHintoninhisCourseraclassonneuralnetworks(slides:http://goo.gl/RsQeis;video:https://goo.gl/XUbIyJ).Amusingly,sincetheauthorshavenotwrittenapapertodescribeit,researchersoftencite“slide29inlecture6”intheirpapers.

“Adam:AMethodforStochasticOptimization,”D.Kingma,J.Ba(2015).

Theseareestimationsofthemeanand(uncentered)varianceofthegradients.Themeanisoftencalledthefirstmoment,whilethevarianceisoftencalledthesecondmoment,hencethenameofthealgorithm.

“TheMarginalValueofAdaptiveGradientMethodsinMachineLearning,”A.C.Wilsonetal.(2017).

“Primal-DualSubgradientMethodsforConvexProblems,”YuriiNesterov(2005).

“AdClickPrediction:aViewfromtheTrenches,”H.McMahanetal.(2013).

“AnEmpiricalStudyofLearningRatesinDeepNeuralNetworksforSpeechRecognition,”A.Senioretal.(2013).

“Improvingneuralnetworksbypreventingco-adaptationoffeaturedetectors,”G.Hintonetal.(2012).

“Dropout:ASimpleWaytoPreventNeuralNetworksfromOverfitting,”N.Srivastavaetal.(2014).

Chapter12.DistributingTensorFlowAcrossDevicesandServers

InChapter11wediscussedseveraltechniquesthatcanconsiderablyspeeduptraining:betterweightinitialization,BatchNormalization,sophisticatedoptimizers,andsoon.However,evenwithallofthesetechniques,trainingalargeneuralnetworkonasinglemachinewithasingleCPUcantakedaysorevenweeks.

InthischapterwewillseehowtouseTensorFlowtodistributecomputationsacrossmultipledevices(CPUsandGPUs)andruntheminparallel(seeFigure12-1).Firstwewilldistributecomputationsacrossmultipledevicesonjustonemachine,thenonmultipledevicesacrossmultiplemachines.

Figure12-1.ExecutingaTensorFlowgraphacrossmultipledevicesinparallel

TensorFlow’ssupportofdistributedcomputingisoneofitsmainhighlightscomparedtootherneuralnetworkframeworks.Itgivesyoufullcontroloverhowtosplit(orreplicate)yourcomputationgraphacrossdevicesandservers,anditletsyouparallelizeandsynchronizeoperationsinflexiblewayssoyoucanchoosebetweenallsortsofparallelizationapproaches.

Wewilllookatsomeofthemostpopularapproachestoparallelizingtheexecutionandtrainingofaneuralnetwork.Insteadofwaitingforweeksforatrainingalgorithmtocomplete,youmayendupwaitingforjustafewhours.Notonlydoesthissaveanenormousamountoftime,italsomeansthatyoucanexperimentwithvariousmodelsmuchmoreeasily,andfrequentlyretrainyourmodelsonfreshdata.

Othergreatusecasesofparallelizationincludeexploringamuchlargerhyperparameterspacewhenfine-tuningyourmodel,andrunninglargeensemblesofneuralnetworksefficiently.

Butwemustlearntowalkbeforewecanrun.Let’sstartbyparallelizingsimplegraphsacrossseveralGPUsonasinglemachine.

MultipleDevicesonaSingleMachineYoucanoftengetamajorperformanceboostsimplybyaddingGPUcardstoasinglemachine.Infact,inmanycasesthiswillsuffice;youwon’tneedtousemultiplemachinesatall.Forexample,youcantypicallytrainaneuralnetworkjustasfastusing8GPUsonasinglemachineratherthan16GPUsacrossmultiplemachines(duetotheextradelayimposedbynetworkcommunicationsinamultimachinesetup).

InthissectionwewilllookathowtosetupyourenvironmentsothatTensorFlowcanusemultipleGPUcardsononemachine.Thenwewilllookathowyoucandistributeoperationsacrossavailabledevicesandexecutetheminparallel.

InstallationInordertorunTensorFlowonmultipleGPUcards,youfirstneedtomakesureyourGPUcardshaveNVidiaComputeCapability(greaterorequalto3.0).ThisincludesNvidia’sTitan,TitanX,K20,andK40cards(ifyouownanothercard,youcancheckitscompatibilityathttps://developer.nvidia.com/cuda-gpus).

TIPIfyoudon’townanyGPUcards,youcanuseahostingservicewithGPUcapabilitysuchasAmazonAWS.DetailedinstructionstosetupTensorFlow0.9withPython3.5onanAmazonAWSGPUinstanceareavailableinŽigaAvsec’shelpfulblogpost.ItshouldnotbetoohardtoupdateittothelatestversionofTensorFlow.GooglealsoreleasedacloudservicecalledCloudMachineLearningtorunTensorFlowgraphs.InMay2016,theyannouncedthattheirplatformnowincludesserversequippedwithtensorprocessingunits(TPUs),processorsspecializedforMachineLearningthataremuchfasterthanGPUsformanyMLtasks.Ofcourse,anotheroptionissimplytobuyyourownGPUcard.TimDettmerswroteagreatblogposttohelpyouchoose,andheupdatesitfairlyregularly.

YoumustthendownloadandinstalltheappropriateversionoftheCUDAandcuDNNlibraries(CUDA8.0andcuDNN5.1ifyouareusingthebinaryinstallationofTensorFlow1.0.0),andsetafewenvironmentvariablessoTensorFlowknowswheretofindCUDAandcuDNN.Thedetailedinstallationinstructionsarelikelytochangefairlyquickly,soitisbestthatyoufollowtheinstructionsonTensorFlow’swebsite.

Nvidia’sComputeUnifiedDeviceArchitecturelibrary(CUDA)allowsdeveloperstouseCUDA-enabledGPUsforallsortsofcomputations(notjustgraphicsacceleration).Nvidia’sCUDADeepNeuralNetworklibrary(cuDNN)isaGPU-acceleratedlibraryofprimitivesforDNNs.ItprovidesoptimizedimplementationsofcommonDNNcomputationssuchasactivationlayers,normalization,forwardandbackwardconvolutions,andpooling(seeChapter13).ItispartofNvidia’sDeepLearningSDK(notethatitrequirescreatinganNvidiadeveloperaccountinordertodownloadit).TensorFlowusesCUDAandcuDNNtocontroltheGPUcardsandacceleratecomputations(seeFigure12-2).

Figure12-2.TensorFlowusesCUDAandcuDNNtocontrolGPUsandboostDNNs

Youcanusethenvidia-smicommandtocheckthatCUDAisproperlyinstalled.ItliststheavailableGPUcards,aswellasprocessesrunningoneachcard:

$nvidia-smi

WedSep1609:50:032016

+------------------------------------------------------+

|NVIDIA-SMI352.63DriverVersion:352.63|

|-------------------------------+----------------------+----------------------+

|GPUNamePersistence-M|Bus-IdDisp.A|VolatileUncorr.ECC|

|FanTempPerfPwr:Usage/Cap|Memory-Usage|GPU-UtilComputeM.|

|===============================+======================+======================|

|0GRIDK520Off|0000:00:03.0Off|N/A|

|N/A27CP817W/125W|11MiB/4095MiB|0%Default|

+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+

|Processes:GPUMemory|

|GPUPIDTypeProcessnameUsage|

|=============================================================================|

|Norunningprocessesfound|

+-----------------------------------------------------------------------------+

Finally,youmustinstallTensorFlowwithGPUsupport.Ifyoucreatedanisolatedenvironmentusingvirtualenv,youfirstneedtoactivateit:

TheninstalltheappropriateGPU-enabledversionofTensorFlow:

$pip3install--upgradetensorflow-gpu

NowyoucanopenupaPythonshellandcheckthatTensorFlowdetectsandusesCUDAandcuDNNproperlybyimportingTensorFlowandcreatingasession:

>>>importtensorflowastf

I[...]/dso_loader.cc:108]successfullyopenedCUDAlibrarylibcublas.solocally

I[...]/dso_loader.cc:108]successfullyopenedCUDAlibrarylibcudnn.solocally

I[...]/dso_loader.cc:108]successfullyopenedCUDAlibrarylibcufft.solocally

I[...]/dso_loader.cc:108]successfullyopenedCUDAlibrarylibcuda.so.1locally

I[...]/dso_loader.cc:108]successfullyopenedCUDAlibrarylibcurand.solocally

>>>sess=tf.Session()

I[...]/gpu_init.cc:102]Founddevice0withproperties:

name:GRIDK520

major:3minor:0memoryClockRate(GHz)0.797

pciBusID0000:00:03.0

Totalmemory:4.00GiB

Freememory:3.95GiB

I[...]/gpu_init.cc:126]DMA:0

I[...]/gpu_init.cc:136]0:Y

I[...]/gpu_device.cc:839]CreatingTensorFlowdevice

(/gpu:0)->(device:0,name:GRIDK520,pcibusid:0000:00:03.0)

Looksgood!TensorFlowdetectedtheCUDAandcuDNNlibraries,anditusedtheCUDAlibrarytodetecttheGPUcard(inthiscaseanNvidiaGridK520card).

ManagingtheGPURAMBydefaultTensorFlowautomaticallygrabsalltheRAMinallavailableGPUsthefirsttimeyourunagraph,soyouwillnotbeabletostartasecondTensorFlowprogramwhilethefirstoneisstillrunning.Ifyoutry,youwillgetthefollowingerror:

E[...]/cuda_driver.cc:965]failedtoallocate3.66G(3928915968bytes)from

device:CUDA_ERROR_OUT_OF_MEMORY

OnesolutionistoruneachprocessondifferentGPUcards.Todothis,thesimplestoptionistosettheCUDA_VISIBLE_DEVICESenvironmentvariablesothateachprocessonlyseestheappropriateGPUcards.Forexample,youcouldstarttwoprogramslikethis:

$CUDA_VISIBLE_DEVICES=0,1python3program_1.py

#andinanotherterminal:

$CUDA_VISIBLE_DEVICES=3,2python3program_2.py

Program#1willonlyseeGPUcards0and1(numbered0and1,respectively),andprogram#2willonlyseeGPUcards2and3(numbered1and0,respectively).Everythingwillworkfine(seeFigure12-3).

Figure12-3.EachprogramgetstwoGPUsforitself

AnotheroptionistotellTensorFlowtograbonlyafractionofthememory.Forexample,tomakeTensorFlowgrabonly40%ofeachGPU’smemory,youmustcreateaConfigProtoobject,setitsgpu_options.per_process_gpu_memory_fractionoptionto0.4,andcreatethesessionusingthisconfiguration:

config=tf.ConfigProto()

config.gpu_options.per_process_gpu_memory_fraction=0.4

session=tf.Session(config=config)

NowtwoprogramslikethisonecanruninparallelusingthesameGPUcards(butnotthree,since3×0.4>1).SeeFigure12-4.

Figure12-4.EachprogramgetsallfourGPUs,butwithonly40%oftheRAMeach

Ifyourunthenvidia-smicommandwhilebothprogramsarerunning,youshouldseethateachprocessholdsroughly40%ofthetotalRAMofeachcard:

$nvidia-smi

+-----------------------------------------------------------------------------+

|Processes:GPUMemory|

|GPUPIDTypeProcessnameUsage|

|=============================================================================|

|05231Cpython1677MiB|

YetanotheroptionistotellTensorFlowtograbmemoryonlywhenitneedsit.Todothisyoumustsetconfig.gpu_options.allow_growthtoTrue.However,TensorFlowneverreleasesmemoryonceithasgrabbedit(toavoidmemoryfragmentation)soyoumaystillrunoutofmemoryafterawhile.Itmaybehardertoguaranteeadeterministicbehaviorusingthisoption,soingeneralyouprobablywanttostickwithoneofthepreviousoptions.

Okay,nowyouhaveaworkingGPU-enabledTensorFlowinstallation.Let’sseehowtouseit!

PlacingOperationsonDevicesTheTensorFlowwhitepaper1presentsafriendlydynamicplaceralgorithmthatautomagicallydistributesoperationsacrossallavailabledevices,takingintoaccountthingslikethemeasuredcomputationtimeinpreviousrunsofthegraph,estimationsofthesizeoftheinputandoutputtensorstoeachoperation,theamountofRAMavailableineachdevice,communicationdelaywhentransferringdatainandoutofdevices,hintsandconstraintsfromtheuser,andmore.Unfortunately,thissophisticatedalgorithmisinternaltoGoogle;itwasnotreleasedintheopensourceversionofTensorFlow.Thereasonitwasleftoutseemstobethatinpracticeasmallsetofplacementrulesspecifiedbytheuseractuallyresultsinmoreefficientplacementthanwhatthedynamicplaceriscapableof.However,theTensorFlowteamisworkingonimprovingthedynamicplacer,andperhapsitwilleventuallybegoodenoughtobereleased.

UntilthenTensorFlowreliesonthesimpleplacer,which(asitsnamesuggests)isverybasic.

SimpleplacementWheneveryourunagraph,ifTensorFlowneedstoevaluateanodethatisnotplacedonadeviceyet,itusesthesimpleplacertoplaceit,alongwithallothernodesthatarenotplacedyet.Thesimpleplacerrespectsthefollowingrules:

Ifanodewasalreadyplacedonadeviceinapreviousrunofthegraph,itisleftonthatdevice.

Else,iftheuserpinnedanodetoadevice(describednext),theplacerplacesitonthatdevice.

Else,itdefaultstoGPU#0,ortheCPUifthereisnoGPU.

Asyoucansee,placingoperationsontheappropriatedeviceismostlyuptoyou.Ifyoudon’tdoanything,thewholegraphwillbeplacedonthedefaultdevice.Topinnodesontoadevice,youmustcreateadeviceblockusingthedevice()function.Forexample,thefollowingcodepinsthevariableaandtheconstantbontheCPU,butthemultiplicationnodecisnotpinnedonanydevice,soitwillbeplacedonthedefaultdevice:

withtf.device("/cpu:0"):

a=tf.Variable(3.0)

b=tf.constant(4.0)

NOTEThe"/cpu:0"deviceaggregatesallCPUsonamulti-CPUsystem.ThereiscurrentlynowaytopinnodesonspecificCPUsortousejustasubsetofallCPUs.

LoggingplacementsLet’scheckthatthesimpleplacerrespectstheplacementconstraintswehavejustdefined.Forthisyoucansetthelog_device_placementoptiontoTrue;thistellstheplacertologamessagewheneverit

placesanode.Forexample:

>>>config=tf.ConfigProto()

>>>config.log_device_placement=True

>>>sess=tf.Session(config=config)

I[...]CreatingTensorFlowdevice(/gpu:0)->(device:0,name:GRIDK520,

pcibusid:0000:00:03.0)

>>>x.initializer.run(session=sess)

I[...]a:/job:localhost/replica:0/task:0/cpu:0

I[...]a/read:/job:localhost/replica:0/task:0/cpu:0

I[...]mul:/job:localhost/replica:0/task:0/gpu:0

I[...]a/Assign:/job:localhost/replica:0/task:0/cpu:0

I[...]b:/job:localhost/replica:0/task:0/cpu:0

I[...]a/initial_value:/job:localhost/replica:0/task:0/cpu:0

>>>sess.run(c)

Thelinesstartingwith"I"forInfoarethelogmessages.Whenwecreateasession,TensorFlowlogsamessagetotellusthatithasfoundaGPUcard(inthiscasetheGridK520card).Thenthefirsttimewerunthegraph(inthiscasewheninitializingthevariablea),thesimpleplacerisrunandplaceseachnodeonthedeviceitwasassignedto.Asexpected,thelogmessagesshowthatallnodesareplacedon"/cpu:0"exceptthemultiplicationnode,whichendsuponthedefaultdevice"/gpu:0"(youcansafelyignoretheprefix/job:localhost/replica:0/task:0fornow;wewilltalkaboutitinamoment).Noticethatthesecondtimewerunthegraph(tocomputec),theplacerisnotusedsinceallthenodesTensorFlowneedstocomputecarealreadyplaced.

DynamicplacementfunctionWhenyoucreateadeviceblock,youcanspecifyafunctioninsteadofadevicename.TensorFlowwillcallthisfunctionforeachoperationitneedstoplaceinthedeviceblock,andthefunctionmustreturnthenameofthedevicetopintheoperationon.Forexample,thefollowingcodepinsallthevariablenodesto"/cpu:0"(inthiscasejustthevariablea)andallothernodesto"/gpu:0":

defvariables_on_cpu(op):

ifop.type=="Variable":

return"/cpu:0"

return"/gpu:0"

withtf.device(variables_on_cpu):

a=tf.Variable(3.0)

b=tf.constant(4.0)

Youcaneasilyimplementmorecomplexalgorithms,suchaspinningvariablesacrossGPUsinaround-robinfashion.

OperationsandkernelsForaTensorFlowoperationtorunonadevice,itneedstohaveanimplementationforthatdevice;thisiscalledakernel.ManyoperationshavekernelsforbothCPUsandGPUs,butnotallofthem.Forexample,TensorFlowdoesnothaveaGPUkernelforintegervariables,sothefollowingcodewillfailwhenTensorFlowtriestoplacethevariableionGPU#0:

>>>withtf.device("/gpu:0"):

...i=tf.Variable(3)

>>>sess.run(i.initializer)

Traceback(mostrecentcalllast):

tensorflow.python.framework.errors.InvalidArgumentError:Cannotassignadevice

tonode'Variable':Couldnotsatisfyexplicitdevicespecification

NotethatTensorFlowinfersthatthevariablemustbeoftypeint32sincetheinitializationvalueisaninteger.Ifyouchangetheinitializationvalueto3.0insteadof3,orifyouexplicitlysetdtype=tf.float32whencreatingthevariable,everythingwillworkfine.

SoftplacementBydefault,ifyoutrytopinanoperationonadeviceforwhichtheoperationhasnokernel,yougettheexceptionshownearlierwhenTensorFlowtriestoplacetheoperationonthedevice.IfyoupreferTensorFlowtofallbacktotheCPUinstead,youcansettheallow_soft_placementconfigurationoptiontoTrue:

withtf.device("/gpu:0"):

i=tf.Variable(3)

config.allow_soft_placement=True

sess=tf.Session(config=config)

sess.run(i.initializer)#theplacerrunsandfallsbackto/cpu:0

Sofarwehavediscussedhowtoplacenodesondifferentdevices.Nowlet’sseehowTensorFlowwillrunthesenodesinparallel.

ParallelExecutionWhenTensorFlowrunsagraph,itstartsbyfindingoutthelistofnodesthatneedtobeevaluated,anditcountshowmanydependencieseachofthemhas.TensorFlowthenstartsevaluatingthenodeswithzerodependencies(i.e.,sourcenodes).Ifthesenodesareplacedonseparatedevices,theyobviouslygetevaluatedinparallel.Iftheyareplacedonthesamedevice,theygetevaluatedindifferentthreads,sotheymayruninparalleltoo(inseparateGPUthreadsorCPUcores).

TensorFlowmanagesathreadpooloneachdevicetoparallelizeoperations(seeFigure12-5).Thesearecalledtheinter-opthreadpools.Someoperationshavemultithreadedkernels:theycanuseotherthreadpools(oneperdevice)calledtheintra-opthreadpools.

Figure12-5.ParallelizedexecutionofaTensorFlowgraph

Forexample,inFigure12-5,operationsA,B,andCaresourceops,sotheycanimmediatelybeevaluated.OperationsAandBareplacedonGPU#0,sotheyaresenttothisdevice’sinter-opthreadpool,andimmediatelyevaluatedinparallel.OperationAhappenstohaveamultithreadedkernel;itscomputationsaresplitinthreeparts,whichareexecutedinparallelbytheintra-opthreadpool.OperationCgoestoGPU#1’sinter-opthreadpool.

AssoonasoperationCfinishes,thedependencycountersofoperationsDandEwillbedecrementedandwillbothreach0,sobothoperationswillbesenttotheinter-opthreadpooltobeexecuted.

TIPYoucancontrolthenumberofthreadsperinter-oppoolbysettingtheinter_op_parallelism_threadsoption.Notethatthefirstsessionyoustartcreatestheinter-opthreadpools.Allothersessionswilljustreusethemunlessyousettheuse_per_session_threadsoptiontoTrue.Youcancontrolthenumberofthreadsperintra-oppoolbysettingtheintra_op_parallelism_threadsoption.

ControlDependenciesInsomecases,itmaybewisetopostponetheevaluationofanoperationeventhoughalltheoperationsitdependsonhavebeenexecuted.Forexample,ifitusesalotofmemorybutitsvalueisneededonlymuchfurtherinthegraph,itwouldbebesttoevaluateitatthelastmomenttoavoidneedlesslyoccupyingRAMthatotheroperationsmayneed.Anotherexampleisasetofoperationsthatdependondatalocatedoutsideofthedevice.Iftheyallrunatthesametime,theymaysaturatethedevice’scommunicationbandwidth,andtheywillendupallwaitingonI/O.Otheroperationsthatneedtocommunicatedatawillalsobeblocked.Itwouldbepreferabletoexecutethesecommunication-heavyoperationssequentially,allowingthedevicetoperformotheroperationsinparallel.

Topostponeevaluationofsomenodes,asimplesolutionistoaddcontroldependencies.Forexample,thefollowingcodetellsTensorFlowtoevaluatexandyonlyafteraandbhavebeenevaluated:

a=tf.constant(1.0)

b=a+2.0

withtf.control_dependencies([a,b]):

x=tf.constant(3.0)

y=tf.constant(4.0)

Obviously,sincezdependsonxandy,evaluatingzalsoimplieswaitingforaandbtobeevaluated,eventhoughitisnotexplicitlyinthecontrol_dependencies()block.Also,sincebdependsona,wecouldsimplifytheprecedingcodebyjustcreatingacontroldependencyon[b]insteadof[a,b],butinsomecases“explicitisbetterthanimplicit.”

Great!Nowyouknow:Howtoplaceoperationsonmultipledevicesinanywayyouplease

Howtheseoperationsgetexecutedinparallel

Howtocreatecontroldependenciestooptimizeparallelexecution

It’stimetodistributecomputationsacrossmultipleservers!

MultipleDevicesAcrossMultipleServersTorunagraphacrossmultipleservers,youfirstneedtodefineacluster.AclusteriscomposedofoneormoreTensorFlowservers,calledtasks,typicallyspreadacrossseveralmachines(seeFigure12-6).Eachtaskbelongstoajob.Ajobisjustanamedgroupoftasksthattypicallyhaveacommonrole,suchaskeepingtrackofthemodelparameters(suchajobisusuallynamed"ps"forparameterserver),orperformingcomputations(suchajobisusuallynamed"worker").

Figure12-6.TensorFlowcluster

Thefollowingclusterspecificationdefinestwojobs,"ps"and"worker",containingonetaskandtwotasks,respectively.Inthisexample,machineAhoststwoTensorFlowservers(i.e.,tasks),listeningondifferentports:oneispartofthe"ps"job,andtheotherispartofthe"worker"job.MachineBjusthostsoneTensorFlowserver,partofthe"worker"job.

cluster_spec=tf.train.ClusterSpec({

"ps":[

"machine-a.example.com:2221",#/job:ps/task:0

"worker":[

"machine-a.example.com:2222",#/job:worker/task:0

"machine-b.example.com:2222",#/job:worker/task:1

TostartaTensorFlowserver,youmustcreateaServerobject,passingittheclusterspecification(soitcancommunicatewithotherservers)anditsownjobnameandtasknumber.Forexample,tostartthefirstworkertask,youwouldrunthefollowingcodeonmachineA:

server=tf.train.Server(cluster_spec,job_name="worker",task_index=0)

Itisusuallysimplertojustrunonetaskpermachine,butthepreviousexampledemonstratesthatTensorFlowallowsyoutorunmultipletasksonthesamemachineifyouwant.2Ifyouhaveseveralserversononemachine,youwillneedtoensurethattheydon’talltrytograballtheRAMofeveryGPU,asexplainedearlier.Forexample,inFigure12-6the"ps"taskdoesnotseetheGPUdevices,sincepresumablyitsprocesswaslaunchedwithCUDA_VISIBLE_DEVICES="".NotethattheCPUissharedbyalltaskslocatedonthesamemachine.

IfyouwanttheprocesstodonothingotherthanruntheTensorFlowserver,youcanblockthemainthreadbytellingittowaitfortheservertofinishusingthejoin()method(otherwisetheserverwillbekilledassoonasyourmainthreadexits).Sincethereiscurrentlynowaytostoptheserver,thiswillactuallyblockforever:

server.join()#blocksuntiltheserverstops(i.e.,never)

OpeningaSessionOnceallthetasksareupandrunning(doingnothingyet),youcanopenasessiononanyoftheservers,fromaclientlocatedinanyprocessonanymachine(evenfromaprocessrunningoneofthetasks),andusethatsessionlikearegularlocalsession.Forexample:

a=tf.constant(1.0)

withtf.Session("grpc://machine-b.example.com:2222")assess:

print(c.eval())#9.0

Thisclientcodefirstcreatesasimplegraph,thenopensasessionontheTensorFlowserverlocatedonmachineB(whichwewillcallthemaster),andinstructsittoevaluatec.Themasterstartsbyplacingtheoperationsontheappropriatedevices.Inthisexample,sincewedidnotpinanyoperationonanydevice,themastersimplyplacesthemallonitsowndefaultdevice—inthiscase,machineB’sGPUdevice.Thenitjustevaluatescasinstructedbytheclient,anditreturnstheresult.

TheMasterandWorkerServicesTheclientusesthegRPCprotocol(GoogleRemoteProcedureCall)tocommunicatewiththeserver.Thisisanefficientopensourceframeworktocallremotefunctionsandgettheiroutputsacrossavarietyofplatformsandlanguages.3ItisbasedonHTTP2,whichopensaconnectionandleavesitopenduringthewholesession,allowingefficientbidirectionalcommunicationoncetheconnectionisestablished.Dataistransmittedintheformofprotocolbuffers,anotheropensourceGoogletechnology.Thisisalightweightbinarydatainterchangeformat.

WARNINGAllserversinaTensorFlowclustermaycommunicatewithanyotherserverinthecluster,somakesuretoopentheappropriateportsonyourfirewall.

EveryTensorFlowserverprovidestwoservices:themasterserviceandtheworkerservice.Themasterserviceallowsclientstoopensessionsandusethemtorungraphs.Itcoordinatesthecomputationsacrosstasks,relyingontheworkerservicetoactuallyexecutecomputationsonothertasksandgettheirresults.

Thisarchitecturegivesyoualotofflexibility.Oneclientcanconnecttomultipleserversbyopeningmultiplesessionsindifferentthreads.Oneservercanhandlemultiplesessionssimultaneouslyfromoneormoreclients.Youcanrunoneclientpertask(typicallywithinthesameprocess),orjustoneclienttocontrolalltasks.Alloptionsareopen.

PinningOperationsAcrossTasksYoucanusedeviceblockstopinoperationsonanydevicemanagedbyanytask,byspecifyingthejobname,taskindex,devicetype,anddeviceindex.Forexample,thefollowingcodepinsatotheCPUofthefirsttaskinthe"ps"job(that’stheCPUonmachineA),anditpinsbtothesecondGPUmanagedbythefirsttaskofthe"worker"job(that’sGPU#1onmachineA).Finally,cisnotpinnedtoanydevice,sothemasterplacesitonitsowndefaultdevice(machineB’sGPU#0device).

withtf.device("/job:ps/task:0/cpu:0")

a=tf.constant(1.0)

withtf.device("/job:worker/task:0/gpu:1")

Asearlier,ifyouomitthedevicetypeandindex,TensorFlowwilldefaulttothetask’sdefaultdevice;forexample,pinninganoperationto"/job:ps/task:0"willplaceitonthedefaultdeviceofthefirsttaskofthe"ps"job(machineA’sCPU).Ifyoualsoomitthetaskindex(e.g.,"/job:ps"),TensorFlowdefaultsto"/task:0".Ifyouomitthejobnameandthetaskindex,TensorFlowdefaultstothesession’smastertask.

ShardingVariablesAcrossMultipleParameterServersAswewillseeshortly,acommonpatternwhentraininganeuralnetworkonadistributedsetupistostorethemodelparametersonasetofparameterservers(i.e.,thetasksinthe"ps"job)whileothertasksfocusoncomputations(i.e.,thetasksinthe"worker"job).Forlargemodelswithmillionsofparameters,itisusefultoshardtheseparametersacrossmultipleparameterservers,toreducetheriskofsaturatingasingleparameterserver’snetworkcard.Ifyouweretomanuallypineveryvariabletoadifferentparameterserver,itwouldbequitetedious.Fortunately,TensorFlowprovidesthereplica_device_setter()function,whichdistributesvariablesacrossallthe"ps"tasksinaround-robinfashion.Forexample,thefollowingcodepinsfivevariablestotwoparameterservers:

withtf.device(tf.train.replica_device_setter(ps_tasks=2):

v1=tf.Variable(1.0)#pinnedto/job:ps/task:0

Insteadofpassingthenumberofps_tasks,youcanpasstheclusterspeccluster=cluster_specandTensorFlowwillsimplycountthenumberoftasksinthe"ps"job.

Ifyoucreateotheroperationsintheblock,beyondjustvariables,TensorFlowautomaticallypinsthemto"/job:worker",whichwilldefaulttothefirstdevicemanagedbythefirsttaskinthe"worker"job.Youcanpinthemtoanotherdevicebysettingtheworker_deviceparameter,butabetterapproachistouseembeddeddeviceblocks.Aninnerdeviceblockcanoverridethejob,task,ordevicedefinedinanouterblock.Forexample:

withtf.device(tf.train.replica_device_setter(ps_tasks=2)):

v1=tf.Variable(1.0)#pinnedto/job:ps/task:0(+defaultsto/cpu:0)

s=v1+v2#pinnedto/job:worker(+defaultstotask:0/gpu:0)

withtf.device("/gpu:1"):

p1=2*s#pinnedto/job:worker/gpu:1(+defaultsto/task:0)

withtf.device("/task:1"):

p2=3*s#pinnedto/job:worker/task:1/gpu:1

NOTEThisexampleassumesthattheparameterserversareCPU-only,whichistypicallythecasesincetheyonlyneedtostoreandcommunicateparameters,notperformintensivecomputations.

SharingStateAcrossSessionsUsingResourceContainersWhenyouareusingaplainlocalsession(notthedistributedkind),eachvariable’sstateismanagedbythesessionitself;assoonasitends,allvariablevaluesarelost.Moreover,multiplelocalsessionscannotshareanystate,eveniftheybothrunthesamegraph;eachsessionhasitsowncopyofeveryvariable(aswediscussedinChapter9).Incontrast,whenyouareusingdistributedsessions,variablestateismanagedbyresourcecontainerslocatedontheclusteritself,notbythesessions.Soifyoucreateavariablenamedxusingoneclientsession,itwillautomaticallybeavailabletoanyothersessiononthesamecluster(evenifbothsessionsareconnectedtoadifferentserver).Forexample,considerthefollowingclientcode:

#simple_client.py

importsys

x=tf.Variable(0.0,name="x")

increment_x=tf.assign(x,x+1)

withtf.Session(sys.argv[1])assess:

ifsys.argv[2:]==["init"]:

sess.run(x.initializer)

sess.run(increment_x)

print(x.eval())

Let’ssupposeyouhaveaTensorFlowclusterupandrunningonmachinesAandB,port2222.Youcouldlaunchtheclient,haveitopenasessionwiththeserveronmachineA,andtellittoinitializethevariable,incrementit,andprintitsvaluebylaunchingthefollowingcommand:

$python3simple_client.pygrpc://machine-a.example.com:2222init

Nowifyoulaunchtheclientwiththefollowingcommand,itwillconnecttotheserveronmachineBandmagicallyreusethesamevariablex(thistimewedon’tasktheservertoinitializethevariable):

$python3simple_client.pygrpc://machine-b.example.com:2222

Thisfeaturecutsbothways:it’sgreatifyouwanttosharevariablesacrossmultiplesessions,butifyouwanttoruncompletelyindependentcomputationsonthesameclusteryouwillhavetobecarefulnottousethesamevariablenamesbyaccident.Onewaytoensurethatyouwon’thavenameclashesistowrapallofyourconstructionphaseinsideavariablescopewithauniquenameforeachcomputation,forexample:

withtf.variable_scope("my_problem_1"):

[...]#Constructionphaseofproblem1

Abetteroptionistouseacontainerblock:

withtf.container("my_problem_1"):

[...]#Constructionphaseofproblem1

Thiswilluseacontainerdedicatedtoproblem#1,insteadofthedefaultone(whosenameisanemptystring"").Oneadvantageisthatvariablenamesremainniceandshort.Anotheradvantageisthatyoucaneasilyresetanamedcontainer.Forexample,thefollowingcommandwillconnecttotheserveronmachineAandaskittoresetthecontainernamed"my_problem_1",whichwillfreealltheresourcesthiscontainerused(andalsocloseallsessionsopenontheserver).Anyvariablemanagedbythiscontainermustbeinitializedbeforeyoucanuseitagain:

tf.Session.reset("grpc://machine-a.example.com:2222",["my_problem_1"])

Resourcecontainersmakeiteasytosharevariablesacrosssessionsinflexibleways.Forexample,Figure12-7showsfourclientsrunningdifferentgraphsonthesamecluster,butsharingsomevariables.ClientsAandBsharethesamevariablexmanagedbythedefaultcontainer,whileclientsCandDshareanothervariablenamedxmanagedbythecontainernamed"my_problem_1".NotethatclientCevenusesvariablesfrombothcontainers.

Figure12-7.Resourcecontainers

Resourcecontainersalsotakecareofpreservingthestateofotherstatefuloperations,namelyqueuesandreaders.Let’stakealookatqueuesfirst.

AsynchronousCommunicationUsingTensorFlowQueuesQueuesareanothergreatwaytoexchangedatabetweenmultiplesessions;forexample,onecommonusecaseistohaveaclientcreateagraphthatloadsthetrainingdataandpushesitintoaqueue,whileanotherclientcreatesagraphthatpullsthedatafromthequeueandtrainsamodel(seeFigure12-8).Thiscanspeeduptrainingconsiderablybecausethetrainingoperationsdon’thavetowaitforthenextmini-batchateverystep.

Figure12-8.Usingqueuestoloadthetrainingdataasynchronously

TensorFlowprovidesvariouskindsofqueues.Thesimplestkindisthefirst-infirst-out(FIFO)queue.Forexample,thefollowingcodecreatesaFIFOqueuethatcanstoreupto10tensorscontainingtwofloatvalueseach:

q=tf.FIFOQueue(capacity=10,dtypes=[tf.float32],shapes=[[2]],

name="q",shared_name="shared_q")

WARNINGTosharevariablesacrosssessions,allyouhadtodowastospecifythesamenameandcontaineronbothends.WithqueuesTensorFlowdoesnotusethenameattributebutinsteadusesshared_name,soitisimportanttospecifyit(evenifitisthesameasthename).And,ofcourse,usethesamecontainer.

EnqueuingdataTopushdatatoaqueue,youmustcreateanenqueueoperation.Forexample,thefollowingcodepushesthreetraininginstancestothequeue:

#training_data_loader.py

q=[...]

training_instance=tf.placeholder(tf.float32,shape=(2))

enqueue=q.enqueue([training_instance])

withtf.Session("grpc://machine-a.example.com:2222")assess:

sess.run(enqueue,feed_dict={training_instance:[1.,2.]})

Insteadofenqueuinginstancesonebyone,youcanenqueueseveralatatimeusinganenqueue_manyoperation:

training_instances=tf.placeholder(tf.float32,shape=(None,2))

enqueue_many=q.enqueue([training_instances])

sess.run(enqueue_many,

feed_dict={training_instances:[[1.,2.],[3.,4.],[5.,6.]]})

Bothexamplesenqueuethesamethreetensorstothequeue.

DequeuingdataTopulltheinstancesoutofthequeue,ontheotherend,youneedtouseadequeueoperation:

#trainer.py

q=[...]

dequeue=q.dequeue()

print(sess.run(dequeue))#[1.,2.]

Ingeneralyouwillwanttopullawholemini-batchatonce,insteadofpullingjustoneinstanceatatime.Todoso,youmustuseadequeue_manyoperation,specifyingthemini-batchsize:

batch_size=2

dequeue_mini_batch=q.dequeue_many(batch_size)

print(sess.run(dequeue_mini_batch))#[[1.,2.],[4.,5.]]

print(sess.run(dequeue_mini_batch))#blockedwaitingforanotherinstance

Whenaqueueisfull,theenqueueoperationwillblockuntilitemsarepulledoutbyadequeueoperation.Similarly,whenaqueueisempty(oryouareusingdequeue_many()andtherearefeweritemsthanthemini-batchsize),thedequeueoperationwillblockuntilenoughitemsarepushedintothequeueusinganenqueueoperation.

QueuesoftuplesEachiteminaqueuecanbeatupleoftensors(ofvarioustypesandshapes)insteadofjustasingletensor.Forexample,thefollowingqueuestorespairsoftensors,oneoftypeint32andshape(),andtheotheroftypefloat32andshape[3,2]:

q=tf.FIFOQueue(capacity=10,dtypes=[tf.int32,tf.float32],shapes=[[],[3,2]],

Theenqueueoperationmustbegivenpairsoftensors(notethateachpairrepresentsonlyoneiteminthe

queue):

a=tf.placeholder(tf.int32,shape=())

b=tf.placeholder(tf.float32,shape=(3,2))

enqueue=q.enqueue((a,b))

withtf.Session([...])assess:

sess.run(enqueue,feed_dict={a:10,b:[[1.,2.],[3.,4.],[5.,6.]]})

Ontheotherend,thedequeue()functionnowcreatesapairofdequeueoperations:

dequeue_a,dequeue_b=q.dequeue()

Ingeneral,youshouldruntheseoperationstogether:

a_val,b_val=sess.run([dequeue_a,dequeue_b])

print(a_val)#10

print(b_val)#[[1.,2.],[3.,4.],[5.,6.]]

WARNINGIfyourundequeue_aonitsown,itwilldequeueapairandreturnonlythefirstelement;thesecondelementwillbelost(andsimilarly,ifyourundequeue_bonitsown,thefirstelementwillbelost).

Thedequeue_many()functionalsoreturnsapairofoperations:

batch_size=2

dequeue_as,dequeue_bs=q.dequeue_many(batch_size)

Youcanuseitasyouwouldexpect:

a,b=sess.run([dequeue_a,dequeue_b])

print(a)#[10,11]

print(b)#[[[1.,2.],[3.,4.],[5.,6.]],[[2.,4.],[6.,8.],[0.,2.]]]

a,b=sess.run([dequeue_a,dequeue_b])#blockedwaitingforanotherpair

ClosingaqueueItispossibletocloseaqueuetosignaltotheothersessionsthatnomoredatawillbeenqueued:

close_q=q.close()

sess.run(close_q)

Subsequentexecutionsofenqueueorenqueue_manyoperationswillraiseanexception.Bydefault,anypendingenqueuerequestwillbehonored,unlessyoucallq.close(cancel_pending_enqueues=True).

Subsequentexecutionsofdequeueordequeue_manyoperationswillcontinuetosucceedaslongasthere

areitemsinthequeue,buttheywillfailwhentherearenotenoughitemsleftinthequeue.Ifyouareusingadequeue_manyoperationandthereareafewinstancesleftinthequeue,butfewerthanthemini-batchsize,theywillbelost.Youmayprefertouseadequeue_up_tooperationinstead;itbehavesexactlylikedequeue_manyexceptwhenaqueueisclosedandtherearefewerthanbatch_sizeinstancesleftinthequeue,inwhichcaseitjustreturnsthem.

RandomShuffleQueueTensorFlowalsosupportsacouplemoretypesofqueues,includingRandomShuffleQueue,whichcanbeusedjustlikeaFIFOQueueexceptthatitemsaredequeuedinarandomorder.Thiscanbeusefultoshuffletraininginstancesateachepochduringtraining.First,let’screatethequeue:

q=tf.RandomShuffleQueue(capacity=50,min_after_dequeue=10,

dtypes=[tf.float32],shapes=[()],

Themin_after_dequeuespecifiestheminimumnumberofitemsthatmustremaininthequeueafteradequeueoperation.Thisensuresthattherewillbeenoughinstancesinthequeuetohaveenoughrandomness(oncethequeueisclosed,themin_after_dequeuelimitisignored).Nowsupposethatyouenqueued22itemsinthisqueue(floats1.to22.).Hereishowyoucoulddequeuethem:

dequeue=q.dequeue_many(5)

print(sess.run(dequeue))#[20.15.11.12.4.](17itemsleft)

print(sess.run(dequeue))#[5.13.6.0.17.](12itemsleft)

print(sess.run(dequeue))#12-5<10:blockedwaitingfor3moreinstances

PaddingFifoQueueAPaddingFIFOQueuecanalsobeusedjustlikeaFIFOQueueexceptthatitacceptstensorsofvariablesizesalonganydimension(butwithafixedrank).Whenyouaredequeuingthemwithadequeue_manyordequeue_up_tooperation,eachtensorispaddedwithzerosalongeveryvariabledimensiontomakeitthesamesizeasthelargesttensorinthemini-batch.Forexample,youcouldenqueue2Dtensors(matrices)ofarbitrarysizes:

q=tf.PaddingFIFOQueue(capacity=50,dtypes=[tf.float32],shapes=[(None,None)]

v=tf.placeholder(tf.float32,shape=(None,None))

enqueue=q.enqueue([v])

sess.run(enqueue,feed_dict={v:[[1.,2.],[3.,4.],[5.,6.]]})#3x2

sess.run(enqueue,feed_dict={v:[[1.]]})#1x1

sess.run(enqueue,feed_dict={v:[[7.,8.,9.,5.],[6.,7.,8.,9.]]})#2x4

Ifwejustdequeueoneitematatime,wegettheexactsametensorsthatwereenqueued.Butifwedequeueseveralitemsatatime(usingdequeue_many()ordequeue_up_to()),thequeueautomaticallypadsthetensorsappropriately.Forexample,ifwedequeueallthreeitemsatonce,alltensorswillbepaddedwithzerostobecome3×4tensors,sincethemaximumsizeforthefirstdimensionis3(firstitem)andthemaximumsizefortheseconddimensionis4(thirditem):

>>>q=[...]

>>>dequeue=q.dequeue_many(3)

>>>withtf.Session([...])assess:

...print(sess.run(dequeue))

[[[1.2.0.0.]

[3.4.0.0.]

[5.6.0.0.]]

[[1.0.0.0.]

[0.0.0.0.]

[0.0.0.0.]]

[[7.8.9.5.]

[6.7.8.9.]

[0.0.0.0.]]]

Thistypeofqueuecanbeusefulwhenyouaredealingwithvariablelengthinputs,suchassequencesofwords(seeChapter14).

Okay,nowlet’spauseforasecond:sofaryouhavelearnedtodistributecomputationsacrossmultipledevicesandservers,sharevariablesacrosssessions,andcommunicateasynchronouslyusingqueues.Beforeyoustarttrainingneuralnetworks,though,there’sonelasttopicweneedtodiscuss:howtoefficientlyloadtrainingdata.

LoadingDataDirectlyfromtheGraphSofarwehaveassumedthattheclientswouldloadthetrainingdataandfeedittotheclusterusingplaceholders.Thisissimpleandworksquitewellforsimplesetups,butitisratherinefficientsinceittransfersthetrainingdataseveraltimes:1. Fromthefilesystemtotheclient

2. Fromtheclienttothemastertask

3. Possiblyfromthemastertasktoothertaskswherethedataisneeded

Itgetsworseifyouhaveseveralclientstrainingvariousneuralnetworksusingthesametrainingdata(forexample,forhyperparametertuning):ifeveryclientloadsthedatasimultaneously,youmayendupevensaturatingyourfileserverorthenetwork’sbandwidth.

PreloadthedataintoavariableFordatasetsthatcanfitinmemory,abetteroptionistoloadthetrainingdataonceandassignittoavariable,thenjustusethatvariableinyourgraph.Thisiscalledpreloadingthetrainingset.Thiswaythedatawillbetransferredonlyoncefromtheclienttothecluster(butitmaystillneedtobemovedaroundfromtasktotaskdependingonwhichoperationsneedit).Thefollowingcodeshowshowtoloadthefulltrainingsetintoavariable:

training_set_init=tf.placeholder(tf.float32,shape=(None,n_features))

training_set=tf.Variable(training_set_init,trainable=False,collections=[],

name="training_set")

data=[...]#loadthetrainingdatafromthedatastore

sess.run(training_set.initializer,feed_dict={training_set_init:data})

Youmustsettrainable=Falsesotheoptimizersdon’ttrytotweakthisvariable.Youshouldalsosetcollections=[]toensurethatthisvariablewon’tgetaddedtotheGraphKeys.GLOBAL_VARIABLEScollection,whichisusedforsavingandrestoringcheckpoints.

NOTEThisexampleassumesthatallofyourtrainingset(includingthelabels)consistsonlyoffloat32values.Ifthat’snotthecase,youwillneedonevariablepertype.

ReadingthetrainingdatadirectlyfromthegraphIfthetrainingsetdoesnotfitinmemory,agoodsolutionistousereaderoperations:theseareoperationscapableofreadingdatadirectlyfromthefilesystem.Thiswaythetrainingdataneverneedstoflowthroughtheclientsatall.TensorFlowprovidesreadersforvariousfileformats:

Fixed-lengthbinaryrecords

TensorFlow’sownTFRecordsformat,basedonprotocolbuffers

Let’slookatasimpleexamplereadingfromaCSVfile(forotherformats,pleasecheckouttheAPIdocumentation).Supposeyouhavefilenamedmy_test.csvthatcontainstraininginstances,andyouwanttocreateoperationstoreadit.Supposeithasthefollowingcontent,withtwofloatfeaturesx1andx2andoneintegertargetrepresentingabinaryclass:

x1,x2,target

1.,2.,0

4.,5,1

First,let’screateaTextLineReadertoreadthisfile.ATextLineReaderopensafile(oncewetellitwhichonetoopen)andreadslinesonebyone.Itisastatefuloperation,likevariablesandqueues:itpreservesitsstateacrossmultiplerunsofthegraph,keepingtrackofwhichfileitiscurrentlyreadingandwhatitscurrentpositionisinthisfile.

reader=tf.TextLineReader(skip_header_lines=1)

Next,wecreateaqueuethatthereaderwillpullfromtoknowwhichfiletoreadnext.Wealsocreateanenqueueoperationandaplaceholdertopushanyfilenamewewanttothequeue,andwecreateanoperationtoclosethequeueoncewehavenomorefilestoread:

filename_queue=tf.FIFOQueue(capacity=10,dtypes=[tf.string],shapes=[()])

filename=tf.placeholder(tf.string)

enqueue_filename=filename_queue.enqueue([filename])

close_filename_queue=filename_queue.close()

Nowwearereadytocreateareadoperationthatwillreadonerecord(i.e.,aline)atatimeandreturnakey/valuepair.Thekeyistherecord’suniqueidentifier—astringcomposedofthefilename,acolon(:),andthelinenumber—andthevalueissimplyastringcontainingthecontentoftheline:

key,value=reader.read(filename_queue)

Wehaveallweneedtoreadthefilelinebyline!Butwearenotquitedoneyet—weneedtoparsethisstringtogetthefeaturesandtarget:

x1,x2,target=tf.decode_csv(value,record_defaults=[[-1.],[-1.],[-1]])

features=tf.stack([x1,x2])

ThefirstlineusesTensorFlow’sCSVparsertoextractthevaluesfromthecurrentline.Thedefaultvaluesareusedwhenafieldismissing(inthisexamplethethirdtraininginstance’sx2feature),andtheyarealsousedtodeterminethetypeofeachfield(inthiscasetwofloatsandoneinteger).

Finally,wecanpushthistraininginstanceanditstargettoaRandomShuffleQueuethatwewillsharewiththetraininggraph(soitcanpullmini-batchesfromit),andwecreateanoperationtoclosethatqueuewhenwearedonepushinginstancestoit:

instance_queue=tf.RandomShuffleQueue(

capacity=10,min_after_dequeue=2,

dtypes=[tf.float32,tf.int32],shapes=[[2],[]],

name="instance_q",shared_name="shared_instance_q")

enqueue_instance=instance_queue.enqueue([features,target])

close_instance_queue=instance_queue.close()

Wow!Thatwasalotofworkjusttoreadafile.Plusweonlycreatedthegraph,sonowweneedtorunit:

sess.run(enqueue_filename,feed_dict={filename:"my_test.csv"})

sess.run(close_filename_queue)

whileTrue:

sess.run(enqueue_instance)

excepttf.errors.OutOfRangeErrorasex:

pass#nomorerecordsinthecurrentfileandnomorefilestoread

sess.run(close_instance_queue)

Firstweopenthesession,andthenweenqueuethefilename"my_test.csv"andimmediatelyclosethatqueuesincewewillnotenqueueanymorefilenames.Thenwerunaninfinitelooptoenqueueinstancesonebyone.Theenqueue_instancedependsonthereaderreadingthenextline,soateveryiterationanewrecordisreaduntilitreachestheendofthefile.Atthatpointittriestoreadthefilenamequeuetoknowwhichfiletoreadnext,andsincethequeueiscloseditthrowsanOutOfRangeErrorexception(ifwedidnotclosethequeue,itwouldjustremainblockeduntilwepushedanotherfilenameorclosedthequeue).Lastly,weclosetheinstancequeuesothatthetrainingoperationspullingfromitwon’tgetblockedforever.Figure12-9summarizeswhatwehavelearned;itrepresentsatypicalgraphforreadingtraininginstancesfromasetofCSVfiles.

Figure12-9.AgraphdedicatedtoreadingtraininginstancesfromCSVfiles

Inthetraininggraph,youneedtocreatethesharedinstancequeueandsimplydequeuemini-batchesfromit:

instance_queue=tf.RandomShuffleQueue([...],shared_name="shared_instance_q")

mini_batch_instances,mini_batch_targets=instance_queue.dequeue_up_to(2)

[...]#usethemini_batchinstancesandtargetstobuildthetraininggraph

training_op=[...]

forstepinrange(max_steps):

excepttf.errors.OutOfRangeErrorasex:

pass#nomoretraininginstances

Inthisexample,thefirstmini-batchwillcontainthefirsttwoinstancesoftheCSVfile,andthesecondmini-batchwillcontainthelastinstance.

WARNINGTensorFlowqueuesdon’thandlesparsetensorswell,soifyourtraininginstancesaresparseyoushouldparsetherecordsaftertheinstancequeue.

Thisarchitecturewillonlyuseonethreadtoreadrecordsandpushthemtotheinstancequeue.Youcangetamuchhigherthroughputbyhavingmultiplethreadsreadsimultaneouslyfrommultiplefilesusingmultiplereaders.Let’sseehow.

MultithreadedreadersusingaCoordinatorandaQueueRunnerTohavemultiplethreadsreadinstancessimultaneously,youcouldcreatePythonthreads(usingthethreadingmodule)andmanagethemyourself.However,TensorFlowprovidessometoolstomakethissimpler:theCoordinatorclassandtheQueueRunnerclass.

ACoordinatorisaverysimpleobjectwhosesolepurposeistocoordinatestoppingmultiplethreads.FirstyoucreateaCoordinator:

coord=tf.train.Coordinator()

Thenyougiveittoallthreadsthatneedtostopjointly,andtheirmainlooplookslikethis:

whilenotcoord.should_stop():

[...]#dosomething

AnythreadcanrequestthateverythreadstopbycallingtheCoordinator’srequest_stop()method:

coord.request_stop()

Everythreadwillstopassoonasitfinishesitscurrentiteration.YoucanwaitforallofthethreadstofinishbycallingtheCoordinator’sjoin()method,passingitthelistofthreads:

coord.join(list_of_threads)

AQueueRunnerrunsmultiplethreadsthateachrunanenqueueoperationrepeatedly,fillingupaqueueasfastaspossible.Assoonasthequeueisclosed,thenextthreadthattriestopushanitemtothequeuewillgetanOutOfRangeError;thisthreadcatchestheerrorandimmediatelytellsotherthreadstostopusingaCoordinator.ThefollowingcodeshowshowyoucanuseaQueueRunnertohavefivethreadsreadinginstancessimultaneouslyandpushingthemtoaninstancequeue:

[...]#sameconstructionphaseasearlier

queue_runner=tf.train.QueueRunner(instance_queue,[enqueue_instance]*5)

sess.run(enqueue_filename,feed_dict={filename:"my_test.csv"})

sess.run(close_filename_queue)

coord=tf.train.Coordinator()

enqueue_threads=queue_runner.create_threads(sess,coord=coord,start=True)

ThefirstlinecreatestheQueueRunnerandtellsittorunfivethreads,allrunningthesameenqueue_instanceoperationrepeatedly.Thenwestartasessionandweenqueuethenameofthefilestoread(inthiscasejust"my_test.csv").NextwecreateaCoordinatorthattheQueueRunnerwillusetostopgracefully,asjustexplained.Finally,wetelltheQueueRunnertocreatethethreadsandstartthem.Thethreadswillreadalltraininginstancesandpushthemtotheinstancequeue,andthentheywillallstopgracefully.

Thiswillbeabitmoreefficientthanearlier,butwecandobetter.Currentlyallthreadsarereadingfromthesamefile.Wecanmakethemreadsimultaneouslyfromseparatefilesinstead(assumingthetrainingdataisshardedacrossmultipleCSVfiles)bycreatingmultiplereaders(seeFigure12-10).

Figure12-10.Readingsimultaneouslyfrommultiplefiles

Forthisweneedtowriteasmallfunctiontocreateareaderandthenodesthatwillreadandpushoneinstancetotheinstancequeue:

defread_and_push_instance(filename_queue,instance_queue):

reader=tf.TextLineReader(skip_header_lines=1)

key,value=reader.read(filename_queue)

x1,x2,target=tf.decode_csv(value,record_defaults=[[-1.],[-1.],[-1]])

features=tf.stack([x1,x2])

enqueue_instance=instance_queue.enqueue([features,target])

returnenqueue_instance

Nextwedefinethequeues:

filename_queue=tf.FIFOQueue(capacity=10,dtypes=[tf.string],shapes=[()])

filename=tf.placeholder(tf.string)

enqueue_filename=filename_queue.enqueue([filename])

close_filename_queue=filename_queue.close()

instance_queue=tf.RandomShuffleQueue([...])

AndfinallywecreatetheQueueRunner,butthistimewegiveitalistofdifferentenqueueoperations.Eachoperationwilluseadifferentreader,sothethreadswillsimultaneouslyreadfromdifferentfiles:

read_and_enqueue_ops=[

read_and_push_instance(filename_queue,instance_queue)

foriinrange(5)]

queue_runner=tf.train.QueueRunner(instance_queue,read_and_enqueue_ops)

Theexecutionphaseisthenthesameasbefore:firstpushthenamesofthefilestoread,thencreateaCoordinatorandcreateandstarttheQueueRunnerthreads.Thistimeallthreadswillreadfromdifferentfilessimultaneouslyuntilallfilesarereadentirely,andthentheQueueRunnerwillclosetheinstancequeuesothatotheropspullingfromitdon’tgetblocked.

OtherconveniencefunctionsTensorFlowalsooffersafewconveniencefunctionstosimplifysomecommontaskswhenreadingtraininginstances.Wewillgooverjustafew(seetheAPIdocumentationforthefulllist).

Thestring_input_producer()takesa1Dtensorcontainingalistoffilenames,createsathreadthatpushesonefilenameatatimetothefilenamequeue,andthenclosesthequeue.Ifyouspecifyanumberofepochs,itwillcyclethroughthefilenamesonceperepochbeforeclosingthequeue.Bydefault,itshufflesthefilenamesateachepoch.ItcreatesaQueueRunnertomanageitsthread,andaddsittotheGraphKeys.QUEUE_RUNNERScollection.TostarteveryQueueRunnerinthatcollection,youcancallthetf.train.start_queue_runners()function.NotethatifyouforgettostarttheQueueRunner,thefilenamequeuewillbeopenandempty,andyourreaderswillbeblockedforever.

ThereareafewotherproducerfunctionsthatsimilarlycreateaqueueandacorrespondingQueueRunnerforrunninganenqueueoperation(e.g.,input_producer(),range_input_producer(),andslice_input_producer()).

Theshuffle_batch()functiontakesalistoftensors(e.g.,[features,target])andcreates:

ARandomShuffleQueue

AQueueRunnertoenqueuethetensorstothequeue(addedtotheGraphKeys.QUEUE_RUNNERScollection)

Adequeue_manyoperationtoextractamini-batchfromthequeue

Thismakesiteasytomanageinasingleprocessamultithreadedinputpipelinefeedingaqueueandatrainingpipelinereadingmini-batchesfromthatqueue.Alsocheckoutthebatch(),batch_join(),andshuffle_batch_join()functionsthatprovidesimilarfunctionality.

Okay!YounowhaveallthetoolsyouneedtostarttrainingandrunningneuralnetworksefficientlyacrossmultipledevicesandserversonaTensorFlowcluster.Let’sreviewwhatyouhavelearned:

UsingmultipleGPUdevices

SettingupandstartingaTensorFlowcluster

Distributingcomputationsacrossmultipledevicesandservers

Sharingvariables(andotherstatefulopssuchasqueuesandreaders)acrosssessionsusingcontainers

Coordinatingmultiplegraphsworkingasynchronouslyusingqueues

Readinginputsefficientlyusingreaders,queuerunners,andcoordinators

Nowlet’suseallofthistoparallelizeneuralnetworks!

ParallelizingNeuralNetworksonaTensorFlowClusterInthissection,firstwewilllookathowtoparallelizeseveralneuralnetworksbysimplyplacingeachoneonadifferentdevice.Thenwewilllookatthemuchtrickierproblemoftrainingasingleneuralnetworkacrossmultipledevicesandservers.

OneNeuralNetworkperDeviceThemosttrivialwaytotrainandrunneuralnetworksonaTensorFlowclusteristotaketheexactsamecodeyouwoulduseforasingledeviceonasinglemachine,andspecifythemasterserver’saddresswhencreatingthesession.That’sit—you’redone!Yourcodewillberunningontheserver’sdefaultdevice.Youcanchangethedevicethatwillrunyourgraphsimplybyputtingyourcode’sconstructionphasewithinadeviceblock.

Byrunningseveralclientsessionsinparallel(indifferentthreadsordifferentprocesses),connectingthemtodifferentservers,andconfiguringthemtousedifferentdevices,youcanquiteeasilytrainorrunmanyneuralnetworksinparallel,acrossalldevicesandallmachinesinyourcluster(seeFigure12-11).Thespeedupisalmostlinear.4Training100neuralnetworksacross50serverswith2GPUseachwillnottakemuchlongerthantrainingjust1neuralnetworkon1GPU.

Figure12-11.Trainingoneneuralnetworkperdevice

Thissolutionisperfectforhyperparametertuning:eachdeviceintheclusterwilltrainadifferentmodelwithitsownsetofhyperparameters.Themorecomputingpoweryouhave,thelargerthehyperparameterspaceyoucanexplore.

Italsoworksperfectlyifyouhostawebservicethatreceivesalargenumberofqueriespersecond(QPS)andyouneedyourneuralnetworktomakeapredictionforeachquery.Simplyreplicatetheneuralnetworkacrossalldevicesontheclusteranddispatchqueriesacrossalldevices.ByaddingmoreserversyoucanhandleanunlimitednumberofQPS(however,thiswillnotreducethetimeittakestoprocessasinglerequestsinceitwillstillhavetowaitforaneuralnetworktomakeaprediction).

NOTEAnotheroptionistoserveyourneuralnetworksusingTensorFlowServing.Itisanopensourcesystem,releasedbyGoogleinFebruary2016,designedtoserveahighvolumeofqueriestoMachineLearningmodels(typicallybuiltwithTensorFlow).Ithandlesmodelversioning,soyoucaneasilydeployanewversionofyournetworktoproduction,orexperimentwithvariousalgorithmswithoutinterruptingyourservice,anditcansustainaheavyloadbyaddingmoreservers.Formoredetails,checkouthttps://tensorflow.github.io/serving/.

In-GraphVersusBetween-GraphReplicationYoucanalsoparallelizethetrainingofalargeensembleofneuralnetworksbysimplyplacingeveryneuralnetworkonadifferentdevice(ensembleswereintroducedinChapter7).However,onceyouwanttoruntheensemble,youwillneedtoaggregatetheindividualpredictionsmadebyeachneuralnetworktoproducetheensemble’sprediction,andthisrequiresabitofcoordination.

Therearetwomajorapproachestohandlinganeuralnetworkensemble(oranyothergraphthatcontainslargechunksofindependentcomputations):

Youcancreateonebiggraph,containingeveryneuralnetwork,eachpinnedtoadifferentdevice,plusthecomputationsneededtoaggregatetheindividualpredictionsfromalltheneuralnetworks(seeFigure12-12).Thenyoujustcreateonesessiontoanyserverintheclusterandletittakecareofeverything(includingwaitingforallindividualpredictionstobeavailablebeforeaggregatingthem).Thisapproachiscalledin-graphreplication.

Figure12-12.In-graphreplication

Alternatively,youcancreateoneseparategraphforeachneuralnetworkandhandlesynchronizationbetweenthesegraphsyourself.Thisapproachiscalledbetween-graphreplication.Onetypicalimplementationistocoordinatetheexecutionofthesegraphsusingqueues(seeFigure12-13).Asetofclientshandlesoneneuralnetworkeach,readingfromitsdedicatedinputqueue,andwritingtoitsdedicatedpredictionqueue.Anotherclientisinchargeofreadingtheinputsandpushingthemtoalltheinputqueues(copyingallinputstoeveryqueue).Finally,onelastclientisinchargeofreadingonepredictionfromeachpredictionqueueandaggregatingthemtoproducetheensemble’sprediction.

Figure12-13.Between-graphreplication

Thesesolutionshavetheirprosandcons.In-graphreplicationissomewhatsimplertoimplementsinceyoudon’thavetomanagemultipleclientsandmultiplequeues.However,between-graphreplicationisabiteasiertoorganizeintowell-boundedandeasy-to-testmodules.Moreover,itgivesyoumoreflexibility.Forexample,youcouldaddadequeuetimeoutintheaggregatorclientsothattheensemblewouldnotfailevenifoneoftheneuralnetworkclientscrashesorifoneneuralnetworktakestoolongtoproduceitsprediction.TensorFlowletsyouspecifyatimeoutwhencallingtherun()functionbypassingaRunOptionswithtimeout_in_ms:

run_options=tf.RunOptions()

run_options.timeout_in_ms=1000#1stimeout

pred=sess.run(dequeue_prediction,options=run_options)

excepttf.errors.DeadlineExceededErrorasex:

[...]#thedequeueoperationtimedoutafter1s

Anotherwayyoucanspecifyatimeoutistosetthesession’soperation_timeout_in_msconfigurationoption,butinthiscasetherun()functiontimesoutifanyoperationtakeslongerthanthetimeoutdelay:

config.operation_timeout_in_ms=1000#1stimeoutforeveryoperation

withtf.Session([...],config=config)assess:

pred=sess.run(dequeue_prediction)

excepttf.errors.DeadlineExceededErrorasex:

[...]#thedequeueoperationtimedoutafter1s

ModelParallelismSofarwehaveruneachneuralnetworkonasingledevice.Whatifwewanttorunasingleneuralnetworkacrossmultipledevices?Thisrequireschoppingyourmodelintoseparatechunksandrunningeachchunkonadifferentdevice.Thisiscalledmodelparallelism.Unfortunately,modelparallelismturnsouttobeprettytricky,anditreallydependsonthearchitectureofyourneuralnetwork.Forfullyconnectednetworks,thereisgenerallynotmuchtobegainedfromthisapproach(seeFigure12-14).Intuitively,itmayseemthataneasywaytosplitthemodelistoplaceeachlayeronadifferentdevice,butthisdoesnotworksinceeachlayerneedstowaitfortheoutputofthepreviouslayerbeforeitcandoanything.Soperhapsyoucansliceitvertically—forexample,withthelefthalfofeachlayerononedevice,andtherightpartonanotherdevice?Thisisslightlybetter,sincebothhalvesofeachlayercanindeedworkinparallel,buttheproblemisthateachhalfofthenextlayerrequirestheoutputofbothhalves,sotherewillbealotofcross-devicecommunication(representedbythedashedarrows).Thisislikelytocompletelycanceloutthebenefitoftheparallelcomputation,sincecross-devicecommunicationisslow(especiallyifitisacrossseparatemachines).

Figure12-14.Splittingafullyconnectedneuralnetwork

However,aswewillseeinChapter13,someneuralnetworkarchitectures,suchasconvolutionalneuralnetworks,containlayersthatareonlypartiallyconnectedtothelowerlayers,soitismucheasiertodistributechunksacrossdevicesinanefficientway.

Figure12-15.Splittingapartiallyconnectedneuralnetwork

Moreover,aswewillseeinChapter14,somedeeprecurrentneuralnetworksarecomposedofseverallayersofmemorycells(seetheleftsideofFigure12-16).Acell’soutputattimetisfedbacktoitsinputattimet+1(asyoucanseemoreclearlyontherightsideofFigure12-16).Ifyousplitsuchanetworkhorizontally,placingeachlayeronadifferentdevice,thenatthefirststeponlyonedevicewillbeactive,atthesecondsteptwowillbeactive,andbythetimethesignalpropagatestotheoutputlayeralldeviceswillbeactivesimultaneously.Thereisstillalotofcross-devicecommunicationgoingon,butsinceeachcellmaybefairlycomplex,thebenefitofrunningmultiplecellsinparalleloftenoutweighsthecommunicationpenalty.

Figure12-16.Splittingadeeprecurrentneuralnetwork

Inshort,modelparallelismcanspeeduprunningortrainingsometypesofneuralnetworks,butnotall,anditrequiresspecialcareandtuning,suchasmakingsurethatdevicesthatneedtocommunicatethemostrunonthesamemachine.

DataParallelismAnotherwaytoparallelizethetrainingofaneuralnetworkistoreplicateitoneachdevice,runatrainingstepsimultaneouslyonallreplicasusingadifferentmini-batchforeach,andthenaggregatethegradientstoupdatethemodelparameters.Thisiscalleddataparallelism(seeFigure12-17).

Figure12-17.Dataparallelism

Therearetwovariantsofthisapproach:synchronousupdatesandasynchronousupdates.

SynchronousupdatesWithsynchronousupdates,theaggregatorwaitsforallgradientstobeavailablebeforecomputingtheaverageandapplyingtheresult(i.e.,usingtheaggregatedgradientstoupdatethemodelparameters).Onceareplicahasfinishedcomputingitsgradients,itmustwaitfortheparameterstobeupdatedbeforeitcanproceedtothenextmini-batch.Thedownsideisthatsomedevicesmaybeslowerthanothers,soallotherdeviceswillhavetowaitforthemateverystep.Moreover,theparameterswillbecopiedtoeverydevicealmostatthesametime(immediatelyafterthegradientsareapplied),whichmaysaturatetheparameterservers’bandwidth.

TIPToreducethewaitingtimeateachstep,youcouldignorethegradientsfromtheslowestfewreplicas(typically~10%).Forexample,youcouldrun20replicas,butonlyaggregatethegradientsfromthefastest18replicasateachstep,andjustignorethegradientsfromthelast2.Assoonastheparametersareupdated,thefirst18replicascanstartworkingagainimmediately,withouthavingtowaitforthe2slowestreplicas.Thissetupisgenerallydescribedashaving18replicasplus2sparereplicas.5

AsynchronousupdatesWithasynchronousupdates,wheneverareplicahasfinishedcomputingthegradients,itimmediatelyusesthemtoupdatethemodelparameters.Thereisnoaggregation(removethe“mean”stepinFigure12-17),andnosynchronization.Replicasjustworkindependentlyoftheotherreplicas.Sincethereisnowaitingfortheotherreplicas,thisapproachrunsmoretrainingstepsperminute.Moreover,althoughtheparametersstillneedtobecopiedtoeverydeviceateverystep,thishappensatdifferenttimesforeachreplicasotheriskofbandwidthsaturationisreduced.

Dataparallelismwithasynchronousupdatesisanattractivechoice,becauseofitssimplicity,theabsenceofsynchronizationdelay,andabetteruseofthebandwidth.However,althoughitworksreasonablywellinpractice,itisalmostsurprisingthatitworksatall!Indeed,bythetimeareplicahasfinishedcomputingthegradientsbasedonsomeparametervalues,theseparameterswillhavebeenupdatedseveraltimesbyotherreplicas(onaverageN–1timesifthereareNreplicas)andthereisnoguaranteethatthecomputedgradientswillstillbepointingintherightdirection(seeFigure12-18).Whengradientsareseverelyout-of-date,theyarecalledstalegradients:theycanslowdownconvergence,introducingnoiseandwobbleeffects(thelearningcurvemaycontaintemporaryoscillations),ortheycanevenmakethetrainingalgorithmdiverge.

Figure12-18.Stalegradientswhenusingasynchronousupdates

Thereareafewwaystoreducetheeffectofstalegradients:Reducethelearningrate.

Dropstalegradientsorscalethemdown.

Adjustthemini-batchsize.

Startthefirstfewepochsusingjustonereplica(thisiscalledthewarmupphase).Stalegradientstendtobemoredamagingatthebeginningoftraining,whengradientsaretypicallylargeandtheparametershavenotsettledintoavalleyofthecostfunctionyet,sodifferentreplicasmaypushtheparametersinquitedifferentdirections.

ApaperpublishedbytheGoogleBrainteaminApril2016benchmarkedvariousapproachesandfoundthatdataparallelismwithsynchronousupdatesusingafewsparereplicaswasthemostefficient,notonlyconvergingfasterbutalsoproducingabettermodel.However,thisisstillanactiveareaofresearch,soyoushouldnotruleoutasynchronousupdatesquiteyet.

BandwidthsaturationWhetheryouusesynchronousorasynchronousupdates,dataparallelismstillrequirescommunicatingthemodelparametersfromtheparameterserverstoeveryreplicaatthebeginningofeverytrainingstep,andthegradientsintheotherdirectionattheendofeachtrainingstep.Unfortunately,thismeansthattherealwayscomesapointwhereaddinganextraGPUwillnotimproveperformanceatallbecausethetimespentmovingthedatainandoutofGPURAM(andpossiblyacrossthenetwork)willoutweighthespeedupobtainedbysplittingthecomputationload.Atthatpoint,addingmoreGPUswilljustincreasesaturationandslowdowntraining.

TIPForsomemodels,typicallyrelativelysmallandtrainedonaverylargetrainingset,youareoftenbetterofftrainingthemodelonasinglemachinewithasingleGPU.

Saturationismoresevereforlargedensemodels,sincetheyhavealotofparametersandgradientstotransfer.Itislesssevereforsmallmodels(buttheparallelizationgainissmall)andalsoforlargesparsemodelssincethegradientsaretypicallymostlyzeros,sotheycanbecommunicatedefficiently.JeffDean,initiatorandleadoftheGoogleBrainproject,reportedtypicalspeedupsof25–40xwhendistributingcomputationsacross50GPUsfordensemodels,and300xspeedupforsparsermodelstrainedacross500GPUs.Asyoucansee,sparsemodelsreallydoscalebetter.Hereareafewconcreteexamples:

NeuralMachineTranslation:6xspeedupon8GPUs

Inception/ImageNet:32xspeedupon50GPUs

RankBrain:300xspeedupon500GPUs

ThesenumbersrepresentthestateoftheartinQ12016.BeyondafewdozenGPUsforadensemodelorfewhundredGPUsforasparsemodel,saturationkicksinandperformancedegrades.Thereisplentyofresearchgoingontosolvethisproblem(exploringpeer-to-peerarchitecturesratherthancentralizedparameterservers,usinglossymodelcompression,optimizingwhenandwhatthereplicasneedtocommunicate,andsoon),sotherewilllikelybealotofprogressinparallelizingneuralnetworksinthenextfewyears.

Inthemeantime,hereareafewsimplestepsyoucantaketoreducethesaturationproblem:GroupyourGPUsonafewserversratherthanscatteringthemacrossmanyservers.Thiswillavoidunnecessarynetworkhops.

Shardtheparametersacrossmultipleparameterservers(asdiscussedearlier).

Dropthemodelparameters’floatprecisionfrom32bits(tf.float32)to16bits(tf.bfloat16).Thiswillcutinhalftheamountofdatatotransfer,withoutmuchimpactontheconvergencerateorthemodel’sperformance.

TIPAlthough16-bitprecisionistheminimumfortrainingneuralnetwork,youcanactuallydropdownto8-bitprecisionaftertrainingtoreducethesizeofthemodelandspeedupcomputations.Thisiscalledquantizingtheneuralnetwork.Itisparticularlyusefulfordeployingandrunningpretrainedmodelsonmobilephones.SeePeteWarden’sgreatpostonthesubject.

TensorFlowimplementationToimplementdataparallelismusingTensorFlow,youfirstneedtochoosewhetheryouwantin-graphreplicationorbetween-graphreplication,andwhetheryouwantsynchronousupdatesorasynchronousupdates.Let’slookathowyouwouldimplementeachcombination(seetheexercisesandtheJupyternotebooksforcompletecodeexamples).

Within-graphreplication+synchronousupdates,youbuildonebiggraphcontainingallthemodelreplicas(placedondifferentdevices),andafewnodestoaggregatealltheirgradientsandfeedthemtoanoptimizer.Yourcodeopensasessiontotheclusterandsimplyrunsthetrainingoperationrepeatedly.

Within-graphreplication+asynchronousupdates,youalsocreateonebiggraph,butwithoneoptimizerperreplica,andyourunonethreadperreplica,repeatedlyrunningthereplica’soptimizer.

Withbetween-graphreplication+asynchronousupdates,yourunmultipleindependentclients(typicallyinseparateprocesses),eachtrainingthemodelreplicaasifitwerealoneintheworld,buttheparametersareactuallysharedwithotherreplicas(usingaresourcecontainer).

Withbetween-graphreplication+synchronousupdates,onceagainyourunmultipleclients,eachtrainingamodelreplicabasedonsharedparameters,butthistimeyouwraptheoptimizer(e.g.,aMomentumOptimizer)withinaSyncReplicasOptimizer.Eachreplicausesthisoptimizerasitwoulduseanyotheroptimizer,butunderthehoodthisoptimizersendsthegradientstoasetofqueues(onepervariable),whichisreadbyoneofthereplica’sSyncReplicasOptimizer,calledthechief.Thechiefaggregatesthegradientsandappliesthem,thenwritesatokentoatokenqueueforeachreplica,signalingitthatitcangoaheadandcomputethenextgradients.Thisapproachsupportshavingsparereplicas.

Ifyougothroughtheexercises,youwillimplementeachofthesefoursolutions.YouwilleasilybeabletoapplywhatyouhavelearnedtotrainlargedeepneuralnetworksacrossdozensofserversandGPUs!InthefollowingchapterswewillgothroughafewmoreimportantneuralnetworkarchitecturesbeforewetackleReinforcementLearning.

Exercises1. IfyougetaCUDA_ERROR_OUT_OF_MEMORYwhenstartingyourTensorFlowprogram,whatis

probablygoingon?Whatcanyoudoaboutit?

2. Whatisthedifferencebetweenpinninganoperationonadeviceandplacinganoperationonadevice?

3. IfyouarerunningonaGPU-enabledTensorFlowinstallation,andyoujustusethedefaultplacement,willalloperationsbeplacedonthefirstGPU?

4. Ifyoupinavariableto"/gpu:0",canitbeusedbyoperationsplacedon/gpu:1?Orbyoperationsplacedon"/cpu:0"?Orbyoperationspinnedtodeviceslocatedonotherservers?

5. Cantwooperationsplacedonthesamedeviceruninparallel?

6. Whatisacontroldependencyandwhenwouldyouwanttouseone?

7. SupposeyoutrainaDNNfordaysonaTensorFlowcluster,andimmediatelyafteryourtrainingprogramendsyourealizethatyouforgottosavethemodelusingaSaver.Isyourtrainedmodellost?

8. TrainseveralDNNsinparallelonaTensorFlowcluster,usingdifferenthyperparametervalues.ThiscouldbeDNNsforMNISTclassificationoranyothertaskyouareinterestedin.ThesimplestoptionistowriteasingleclientprogramthattrainsonlyoneDNN,thenrunthisprograminmultipleprocessesinparallel,withdifferenthyperparametervaluesforeachclient.Theprogramshouldhavecommand-lineoptionstocontrolwhatserveranddevicetheDNNshouldbeplacedon,andwhatresourcecontainerandhyperparametervaluestouse(makesuretouseadifferentresourcecontainerforeachDNN).Useavalidationsetorcross-validationtoselectthetopthreemodels.

9. Createanensembleusingthetopthreemodelsfromthepreviousexercise.Defineitinasinglegraph,ensuringthateachDNNrunsonadifferentdevice.Evaluateitonthevalidationset:doestheensembleperformbetterthantheindividualDNNs?

10. TrainaDNNusingbetween-graphreplicationanddataparallelismwithasynchronousupdates,timinghowlongittakestoreachasatisfyingperformance.Next,tryagainusingsynchronousupdates.Dosynchronousupdatesproduceabettermodel?Istrainingfaster?SplittheDNNverticallyandplaceeachverticalsliceonadifferentdevice,andtrainthemodelagain.Istraininganyfaster?Istheperformanceanydifferent?

“TensorFlow:Large-ScaleMachineLearningonHeterogeneousDistributedSystems,”GoogleResearch(2015).

Youcanevenstartmultipletasksinthesameprocess.Itmaybeusefulfortests,butitisnotrecommendedinproduction.

ItisthenextversionofGoogle’sinternalStubbyservice,whichGooglehasusedsuccessfullyforoveradecade.Seehttp://grpc.io/formoredetails.

Not100%linearifyouwaitforalldevicestofinish,sincethetotaltimewillbethetimetakenbytheslowestdevice.

Thisnameisslightlyconfusingsinceitsoundslikesomereplicasarespecial,doingnothing.Inreality,allreplicasareequivalent:theyallworkhardtobeamongthefastestateachtrainingstep,andthelosersvaryateverystep(unlesssomedevicesarereallyslowerthanothers).

Chapter13.ConvolutionalNeuralNetworks

AlthoughIBM’sDeepBluesupercomputerbeatthechessworldchampionGarryKasparovbackin1996,untilquiterecentlycomputerswereunabletoreliablyperformseeminglytrivialtaskssuchasdetectingapuppyinapictureorrecognizingspokenwords.Whyarethesetaskssoeffortlesstoushumans?Theanswerliesinthefactthatperceptionlargelytakesplaceoutsidetherealmofourconsciousness,withinspecializedvisual,auditory,andothersensorymodulesinourbrains.Bythetimesensoryinformationreachesourconsciousness,itisalreadyadornedwithhigh-levelfeatures;forexample,whenyoulookatapictureofacutepuppy,youcannotchoosenottoseethepuppy,ornottonoticeitscuteness.Norcanyouexplainhowyourecognizeacutepuppy;it’sjustobvioustoyou.Thus,wecannottrustoursubjectiveexperience:perceptionisnottrivialatall,andtounderstanditwemustlookathowthesensorymoduleswork.

Convolutionalneuralnetworks(CNNs)emergedfromthestudyofthebrain’svisualcortex,andtheyhavebeenusedinimagerecognitionsincethe1980s.Inthelastfewyears,thankstotheincreaseincomputationalpower,theamountofavailabletrainingdata,andthetrickspresentedinChapter11fortrainingdeepnets,CNNshavemanagedtoachievesuperhumanperformanceonsomecomplexvisualtasks.Theypowerimagesearchservices,self-drivingcars,automaticvideoclassificationsystems,andmore.Moreover,CNNsarenotrestrictedtovisualperception:theyarealsosuccessfulatothertasks,suchasvoicerecognitionornaturallanguageprocessing(NLP);however,wewillfocusonvisualapplicationsfornow.

InthischapterwewillpresentwhereCNNscamefrom,whattheirbuildingblockslooklike,andhowtoimplementthemusingTensorFlow.ThenwewillpresentsomeofthebestCNNarchitectures.

TheArchitectureoftheVisualCortexDavidH.HubelandTorstenWieselperformedaseriesofexperimentsoncatsin19581and19592(andafewyearslateronmonkeys3),givingcrucialinsightsonthestructureofthevisualcortex(theauthorsreceivedtheNobelPrizeinPhysiologyorMedicinein1981fortheirwork).Inparticular,theyshowedthatmanyneuronsinthevisualcortexhaveasmalllocalreceptivefield,meaningtheyreactonlytovisualstimulilocatedinalimitedregionofthevisualfield(seeFigure13-1,inwhichthelocalreceptivefieldsoffiveneuronsarerepresentedbydashedcircles).Thereceptivefieldsofdifferentneuronsmayoverlap,andtogethertheytilethewholevisualfield.Moreover,theauthorsshowedthatsomeneuronsreactonlytoimagesofhorizontallines,whileothersreactonlytolineswithdifferentorientations(twoneuronsmayhavethesamereceptivefieldbutreacttodifferentlineorientations).Theyalsonoticedthatsomeneuronshavelargerreceptivefields,andtheyreacttomorecomplexpatternsthatarecombinationsofthelower-levelpatterns.Theseobservationsledtotheideathatthehigher-levelneuronsarebasedontheoutputsofneighboringlower-levelneurons(inFigure13-1,noticethateachneuronisconnectedonlytoafewneuronsfromthepreviouslayer).Thispowerfularchitectureisabletodetectallsortsofcomplexpatternsinanyareaofthevisualfield.

Figure13-1.Localreceptivefieldsinthevisualcortex

Thesestudiesofthevisualcortexinspiredtheneocognitron,introducedin1980,4whichgraduallyevolvedintowhatwenowcallconvolutionalneuralnetworks.Animportantmilestonewasa1998paper5byYannLeCun,LéonBottou,YoshuaBengio,andPatrickHaffner,whichintroducedthefamousLeNet-5architecture,widelyusedtorecognizehandwrittenchecknumbers.Thisarchitecturehassomebuildingblocksthatyoualreadyknow,suchasfullyconnectedlayersandsigmoidactivationfunctions,butitalsointroducestwonewbuildingblocks:convolutionallayersandpoolinglayers.Let’slookatthemnow.

NOTEWhynotsimplyusearegulardeepneuralnetworkwithfullyconnectedlayersforimagerecognitiontasks?Unfortunately,althoughthisworksfineforsmallimages(e.g.,MNIST),itbreaksdownforlargerimagesbecauseofthehugenumberofparametersitrequires.Forexample,a100×100imagehas10,000pixels,andifthefirstlayerhasjust1,000neurons(whichalreadyseverelyrestrictstheamountofinformationtransmittedtothenextlayer),thismeansatotalof10millionconnections.Andthat’sjustthefirstlayer.CNNssolvethisproblemusingpartiallyconnectedlayers.

ConvolutionalLayerThemostimportantbuildingblockofaCNNistheconvolutionallayer:6neuronsinthefirstconvolutionallayerarenotconnectedtoeverysinglepixelintheinputimage(liketheywereinpreviouschapters),butonlytopixelsintheirreceptivefields(seeFigure13-2).Inturn,eachneuroninthesecondconvolutionallayerisconnectedonlytoneuronslocatedwithinasmallrectangleinthefirstlayer.Thisarchitectureallowsthenetworktoconcentrateonlow-levelfeaturesinthefirsthiddenlayer,thenassemblethemintohigher-levelfeaturesinthenexthiddenlayer,andsoon.Thishierarchicalstructureiscommoninreal-worldimages,whichisoneofthereasonswhyCNNsworksowellforimagerecognition.

Figure13-2.CNNlayerswithrectangularlocalreceptivefields

NOTEUntilnow,allmultilayerneuralnetworkswelookedathadlayerscomposedofalonglineofneurons,andwehadtoflatteninputimagesto1Dbeforefeedingthemtotheneuralnetwork.Noweachlayerisrepresentedin2D,whichmakesiteasiertomatchneuronswiththeircorrespondinginputs.

Aneuronlocatedinrowi,columnjofagivenlayerisconnectedtotheoutputsoftheneuronsinthepreviouslayerlocatedinrowsitoi+fh–1,columnsjtoj+fw–1,wherefhandfwaretheheightandwidthofthereceptivefield(seeFigure13-3).Inorderforalayertohavethesameheightandwidthasthepreviouslayer,itiscommontoaddzerosaroundtheinputs,asshowninthediagram.Thisiscalledzeropadding.

Figure13-3.Connectionsbetweenlayersandzeropadding

Itisalsopossibletoconnectalargeinputlayertoamuchsmallerlayerbyspacingoutthereceptivefields,asshowninFigure13-4.Thedistancebetweentwoconsecutivereceptivefieldsiscalledthestride.Inthediagram,a5×7inputlayer(pluszeropadding)isconnectedtoa3×4layer,using3×3receptivefieldsandastrideof2(inthisexamplethestrideisthesameinbothdirections,butitdoesnothavetobeso).Aneuronlocatedinrowi,columnjintheupperlayerisconnectedtotheoutputsoftheneuronsinthepreviouslayerlocatedinrowsi×shtoi×sh+fh–1,columnsj×sw+fw–1,whereshandswaretheverticalandhorizontalstrides.

Figure13-4.Reducingdimensionalityusingastride

FiltersAneuron’sweightscanberepresentedasasmallimagethesizeofthereceptivefield.Forexample,Figure13-5showstwopossiblesetsofweights,calledfilters(orconvolutionkernels).Thefirstoneisrepresentedasablacksquarewithaverticalwhitelineinthemiddle(itisa7×7matrixfullof0sexceptforthecentralcolumn,whichisfullof1s);neuronsusingtheseweightswillignoreeverythingintheirreceptivefieldexceptforthecentralverticalline(sinceallinputswillgetmultipliedby0,exceptfortheoneslocatedinthecentralverticalline).Thesecondfilterisablacksquarewithahorizontalwhitelineinthemiddle.Onceagain,neuronsusingtheseweightswillignoreeverythingintheirreceptivefieldexceptforthecentralhorizontalline.

Nowifallneuronsinalayerusethesameverticallinefilter(andthesamebiasterm),andyoufeedthenetworktheinputimageshowninFigure13-5(bottomimage),thelayerwilloutputthetop-leftimage.Noticethattheverticalwhitelinesgetenhancedwhiletherestgetsblurred.Similarly,theupper-rightimageiswhatyougetifallneuronsusethehorizontallinefilter;noticethatthehorizontalwhitelinesgetenhancedwhiletherestisblurredout.Thus,alayerfullofneuronsusingthesamefiltergivesyouafeaturemap,whichhighlightstheareasinanimagethataremostsimilartothefilter.Duringtraining,aCNNfindsthemostusefulfiltersforitstask,anditlearnstocombinethemintomorecomplexpatterns(e.g.,acrossisanareainanimagewhereboththeverticalfilterandthehorizontalfilterareactive).

Figure13-5.Applyingtwodifferentfilterstogettwofeaturemaps

StackingMultipleFeatureMapsUptonow,forsimplicity,wehaverepresentedeachconvolutionallayerasathin2Dlayer,butinrealityitiscomposedofseveralfeaturemapsofequalsizes,soitismoreaccuratelyrepresentedin3D(seeFigure13-6).Withinonefeaturemap,allneuronssharethesameparameters(weightsandbiasterm),butdifferentfeaturemapsmayhavedifferentparameters.Aneuron’sreceptivefieldisthesameasdescribedearlier,butitextendsacrossallthepreviouslayers’featuremaps.Inshort,aconvolutionallayersimultaneouslyappliesmultiplefilterstoitsinputs,makingitcapableofdetectingmultiplefeaturesanywhereinitsinputs.

NOTEThefactthatallneuronsinafeaturemapsharethesameparametersdramaticallyreducesthenumberofparametersinthemodel,butmostimportantlyitmeansthatoncetheCNNhaslearnedtorecognizeapatterninonelocation,itcanrecognizeitinanyotherlocation.Incontrast,oncearegularDNNhaslearnedtorecognizeapatterninonelocation,itcanrecognizeitonlyinthatparticularlocation.

Moreover,inputimagesarealsocomposedofmultiplesublayers:onepercolorchannel.Therearetypicallythree:red,green,andblue(RGB).Grayscaleimageshavejustonechannel,butsomeimagesmayhavemuchmore—forexample,satelliteimagesthatcaptureextralightfrequencies(suchasinfrared).

Figure13-6.Convolutionlayerswithmultiplefeaturemaps,andimageswiththreechannels

Specifically,aneuronlocatedinrowi,columnjofthefeaturemapkinagivenconvolutionallayerlisconnectedtotheoutputsoftheneuronsinthepreviouslayerl–1,locatedinrowsi×shtoi×sh+fh–1andcolumnsj×swtoj×sw+fw–1,acrossallfeaturemaps(inlayerl–1).Notethatallneuronslocatedinthesamerowiandcolumnjbutindifferentfeaturemapsareconnectedtotheoutputsoftheexactsameneuronsinthepreviouslayer.

Equation13-1summarizestheprecedingexplanationsinonebigmathematicalequation:itshowshowtocomputetheoutputofagivenneuroninaconvolutionallayer.Itisabituglyduetoallthedifferentindices,butallitdoesiscalculatetheweightedsumofalltheinputs,plusthebiasterm.

Equation13-1.Computingtheoutputofaneuroninaconvolutionallayer

zi,j,kistheoutputoftheneuronlocatedinrowi,columnjinfeaturemapkoftheconvolutionallayer(layerl).

Asexplainedearlier,shandswaretheverticalandhorizontalstrides,fhandfwaretheheightandwidthofthereceptivefield,andfn′isthenumberoffeaturemapsinthepreviouslayer(layerl–1).

xi′,j′,k ′istheoutputoftheneuronlocatedinlayerl–1,rowi′,columnj′,featuremapk′(orchannelk′ifthepreviouslayeristheinputlayer).

bkisthebiastermforfeaturemapk(inlayerl).Youcanthinkofitasaknobthattweakstheoverallbrightnessofthefeaturemapk.

wu,v,k ′,kistheconnectionweightbetweenanyneuroninfeaturemapkofthelayerlanditsinputlocatedatrowu,columnv(relativetotheneuron’sreceptivefield),andfeaturemapk′.

TensorFlowImplementationInTensorFlow,eachinputimageistypicallyrepresentedasa3Dtensorofshape[height,width,channels].Amini-batchisrepresentedasa4Dtensorofshape[mini-batchsize,height,width,channels].Theweightsofaconvolutionallayerarerepresentedasa4Dtensorofshape[fh,fw,fn′,fn].Thebiastermsofaconvolutionallayeraresimplyrepresentedasa1Dtensorofshape[fn].

Let’slookatasimpleexample.Thefollowingcodeloadstwosampleimages,usingScikit-Learn’sload_sample_images()(whichloadstwocolorimages,oneofaChinesetemple,andtheotherofaflower).Thenitcreatestwo7×7filters(onewithaverticalwhitelineinthemiddle,andtheotherwithahorizontalwhitelineinthemiddle),andappliesthemtobothimagesusingaconvolutionallayerbuiltusingTensorFlow’stf.nn.conv2d()function(withzeropaddingandastrideof2).Finally,itplotsoneoftheresultingfeaturemaps(similartothetop-rightimageinFigure13-5).

importnumpyasnp

fromsklearn.datasetsimportload_sample_images

#Loadsampleimages

china=load_sample_image("china.jpg")

flower=load_sample_image("flower.jpg")

dataset=np.array([china,flower],dtype=np.float32)

batch_size,height,width,channels=dataset.shape

#Create2filters

filters=np.zeros(shape=(7,7,channels,2),dtype=np.float32)

filters[:,3,:,0]=1#verticalline

filters[3,:,:,1]=1#horizontalline

#CreateagraphwithinputXplusaconvolutionallayerapplyingthe2filters

X=tf.placeholder(tf.float32,shape=(None,height,width,channels))

convolution=tf.nn.conv2d(X,filters,strides=[1,2,2,1],padding="SAME")

output=sess.run(convolution,feed_dict={X:dataset})

plt.imshow(output[0,:,:,1],cmap="gray")#plot1stimage's2ndfeaturemap

plt.show()

Mostofthiscodeisself-explanatory,butthetf.nn.conv2d()linedeservesabitofexplanation:

Xistheinputmini-batch(a4Dtensor,asexplainedearlier).

filtersisthesetoffilterstoapply(alsoa4Dtensor,asexplainedearlier).

stridesisafour-element1Darray,wherethetwocentralelementsaretheverticalandhorizontalstrides(shandsw).Thefirstandlastelementsmustcurrentlybeequalto1.Theymayonedaybeusedtospecifyabatchstride(toskipsomeinstances)andachannelstride(toskipsomeofthepreviouslayer’sfeaturemapsorchannels).

paddingmustbeeither"VALID"or"SAME":Ifsetto"VALID",theconvolutionallayerdoesnotusezeropadding,andmayignoresomerowsandcolumnsatthebottomandrightoftheinputimage,dependingonthestride,asshowninFigure13-7(forsimplicity,onlythehorizontaldimensionisshownhere,butofcoursethesamelogicappliestotheverticaldimension).

Ifsetto"SAME",theconvolutionallayeruseszeropaddingifnecessary.Inthiscase,thenumberofoutputneuronsisequaltothenumberofinputneuronsdividedbythestride,roundedup(inthisexample,ceil(13/5)=3).Thenzerosareaddedasevenlyaspossiblearoundtheinputs.

Figure13-7.Paddingoptions—inputwidth:13,filterwidth:6,stride:5

Inthissimpleexample,wemanuallycreatedthefilters,butinarealCNNyouwouldletthetrainingalgorithmdiscoverthebestfiltersautomatically.TensorFlowhasatf.layers.conv2d()functionwhichcreatesthefiltersvariableforyou(calledkernel),andinitializesitrandomly.Forexample,thefollowingcodecreatesaninputplaceholderfollowedbyaconvolutionallayerwithtwo7×7featuremaps,using2×2strides(notethatthisfunctiononlyexpectstheverticalandhorizontalstrides),and"SAME"padding:

X=tf.placeholder(shape=(None,height,width,channels),dtype=tf.float32)

conv=tf.layers.conv2d(X,filters=2,kernel_size=7,strides=[2,2],

padding="SAME")

Unfortunately,convolutionallayershavequiteafewhyperparameters:youmustchoosethenumberoffilters,theirheightandwidth,thestrides,andthepaddingtype.Asalways,youcanusecross-validationtofindtherighthyperparametervalues,butthisisverytime-consuming.WewilldiscusscommonCNNarchitectureslater,togiveyousomeideaofwhathyperparametervaluesworkbestinpractice.

MemoryRequirementsAnotherproblemwithCNNsisthattheconvolutionallayersrequireahugeamountofRAM,especiallyduringtraining,becausethereversepassofbackpropagationrequiresalltheintermediatevaluescomputedduringtheforwardpass.

Forexample,consideraconvolutionallayerwith5×5filters,outputting200featuremapsofsize150×100,withstride1andSAMEpadding.Iftheinputisa150×100RGBimage(threechannels),thenthenumberofparametersis(5×5×3+1)×200=15,200(the+1correspondstothebiasterms),whichisfairlysmallcomparedtoafullyconnectedlayer.7However,eachofthe200featuremapscontains150×100neurons,andeachoftheseneuronsneedstocomputeaweightedsumofits5×5×3=75inputs:that’satotalof225millionfloatmultiplications.Notasbadasafullyconnectedlayer,butstillquitecomputationallyintensive.Moreover,ifthefeaturemapsarerepresentedusing32-bitfloats,thentheconvolutionallayer’soutputwilloccupy200×150×100×32=96millionbits(about11.4MB)ofRAM.8Andthat’sjustforoneinstance!Ifatrainingbatchcontains100instances,thenthislayerwilluseupover1GBofRAM!

Duringinference(i.e.,whenmakingapredictionforanewinstance)theRAMoccupiedbyonelayercanbereleasedassoonasthenextlayerhasbeencomputed,soyouonlyneedasmuchRAMasrequiredbytwoconsecutivelayers.Butduringtrainingeverythingcomputedduringtheforwardpassneedstobepreservedforthereversepass,sotheamountofRAMneededis(atleast)thetotalamountofRAMrequiredbyalllayers.

TIPIftrainingcrashesbecauseofanout-of-memoryerror,youcantryreducingthemini-batchsize.Alternatively,youcantryreducingdimensionalityusingastride,orremovingafewlayers.Oryoucantryusing16-bitfloatsinsteadof32-bitfloats.OryoucoulddistributetheCNNacrossmultipledevices.

Nowlet’slookatthesecondcommonbuildingblockofCNNs:thepoolinglayer.

PoolingLayerOnceyouunderstandhowconvolutionallayerswork,thepoolinglayersarequiteeasytograsp.Theirgoalistosubsample(i.e.,shrink)theinputimageinordertoreducethecomputationalload,thememoryusage,andthenumberofparameters(therebylimitingtheriskofoverfitting).Reducingtheinputimagesizealsomakestheneuralnetworktoleratealittlebitofimageshift(locationinvariance).

Justlikeinconvolutionallayers,eachneuroninapoolinglayerisconnectedtotheoutputsofalimitednumberofneuronsinthepreviouslayer,locatedwithinasmallrectangularreceptivefield.Youmustdefineitssize,thestride,andthepaddingtype,justlikebefore.However,apoolingneuronhasnoweights;allitdoesisaggregatetheinputsusinganaggregationfunctionsuchasthemaxormean.Figure13-8showsamaxpoolinglayer,whichisthemostcommontypeofpoolinglayer.Inthisexample,weusea2×2poolingkernel,astrideof2,andnopadding.Notethatonlythemaxinputvalueineachkernelmakesittothenextlayer.Theotherinputsaredropped.

Figure13-8.Maxpoolinglayer(2×2poolingkernel,stride2,nopadding)

Thisisobviouslyaverydestructivekindoflayer:evenwithatiny2×2kernelandastrideof2,theoutputwillbetwotimessmallerinbothdirections(soitsareawillbefourtimessmaller),simplydropping75%oftheinputvalues.

Apoolinglayertypicallyworksoneveryinputchannelindependently,sotheoutputdepthisthesameastheinputdepth.Youmayalternativelypooloverthedepthdimension,aswewillseenext,inwhichcasetheimage’sspatialdimensions(heightandwidth)remainunchanged,butthenumberofchannelsisreduced.

ImplementingamaxpoolinglayerinTensorFlowisquiteeasy.Thefollowingcodecreatesamaxpoolinglayerusinga2×2kernel,stride2,andnopadding,thenappliesittoalltheimagesinthedataset:

[...]#loadtheimagedataset,justlikeabove

#CreateagraphwithinputXplusamaxpoolinglayer

X=tf.placeholder(tf.float32,shape=(None,height,width,channels))

max_pool=tf.nn.max_pool(X,ksize=[1,2,2,1],strides=[1,2,2,1],padding="VALID")

output=sess.run(max_pool,feed_dict={X:dataset})

plt.imshow(output[0].astype(np.uint8))#plottheoutputforthe1stimage

plt.show()

Theksizeargumentcontainsthekernelshapealongallfourdimensionsoftheinputtensor:[batchsize,height,width,channels].TensorFlowcurrentlydoesnotsupportpoolingovermultipleinstances,sothefirstelementofksizemustbeequalto1.Moreover,itdoesnotsupportpoolingoverboththespatialdimensions(heightandwidth)andthedepthdimension,soeitherksize[1]andksize[2]mustbothbeequalto1,orksize[3]mustbeequalto1.

Tocreateanaveragepoolinglayer,justusetheavg_pool()functioninsteadofmax_pool().

Nowyouknowallthebuildingblockstocreateaconvolutionalneuralnetwork.Let’sseehowtoassemblethem.

CNNArchitecturesTypicalCNNarchitecturesstackafewconvolutionallayers(eachonegenerallyfollowedbyaReLUlayer),thenapoolinglayer,thenanotherfewconvolutionallayers(+ReLU),thenanotherpoolinglayer,andsoon.Theimagegetssmallerandsmallerasitprogressesthroughthenetwork,butitalsotypicallygetsdeeperanddeeper(i.e.,withmorefeaturemaps)thankstotheconvolutionallayers(seeFigure13-9).Atthetopofthestack,aregularfeedforwardneuralnetworkisadded,composedofafewfullyconnectedlayers(+ReLUs),andthefinallayeroutputstheprediction(e.g.,asoftmaxlayerthatoutputsestimatedclassprobabilities).

Figure13-9.TypicalCNNarchitecture

TIPAcommonmistakeistouseconvolutionkernelsthataretoolarge.Youcanoftengetthesameeffectasa9×9kernelbystackingtwo3×3kernelsontopofeachother,foralotlesscompute.

Overtheyears,variantsofthisfundamentalarchitecturehavebeendeveloped,leadingtoamazingadvancesinthefield.AgoodmeasureofthisprogressistheerrorrateincompetitionssuchastheILSVRCImageNetchallenge.Inthiscompetitionthetop-5errorrateforimageclassificationfellfromover26%tobarelyover3%injustfiveyears.Thetop-fiveerrorrateisthenumberoftestimagesforwhichthesystem’stop5predictionsdidnotincludethecorrectanswer.Theimagesarelarge(256pixelshigh)andthereare1,000classes,someofwhicharereallysubtle(trydistinguishing120dogbreeds).LookingattheevolutionofthewinningentriesisagoodwaytounderstandhowCNNswork.

WewillfirstlookattheclassicalLeNet-5architecture(1998),thenthreeofthewinnersoftheILSVRCchallenge:AlexNet(2012),GoogLeNet(2014),andResNet(2015).

OTHERVISUALTASKS

Therewasstunningprogressaswellinothervisualtaskssuchasobjectdetectionandlocalization,andimagesegmentation.Inobjectdetectionandlocalization,theneuralnetworktypicallyoutputsasequenceofboundingboxesaroundvariousobjectsintheimage.Forexample,seeMaxineOquabetal.’s2015paperthatoutputsaheatmapforeachobjectclass,orRussellStewartetal.’s2015paperthatusesacombinationofaCNNtodetectfacesandarecurrentneuralnetworktooutputasequenceofboundingboxesaroundthem.Inimagesegmentation,thenetoutputsanimage(usuallyofthesamesizeastheinput)whereeachpixelindicatestheclassoftheobjecttowhichthecorrespondinginputpixelbelongs.Forexample,checkoutEvanShelhameretal.’s2016paper.

LeNet-5TheLeNet-5architectureisperhapsthemostwidelyknownCNNarchitecture.Asmentionedearlier,itwascreatedbyYannLeCunin1998andwidelyusedforhandwrittendigitrecognition(MNIST).ItiscomposedofthelayersshowninTable13-1.

Table13-1.LeNet-5architecture

Layer Type Maps Size Kernelsize Stride Activation

Out FullyConnected – 10 – – RBF

F6 FullyConnected – 84 – – tanh

C5 Convolution 120 1×1 5×5 1 tanh

S4 AvgPooling 16 5×5 2×2 2 tanh

S2 AvgPooling 6 14×14 2×2 2 tanh

In Input 1 32×32 – – –

Thereareafewextradetailstobenoted:MNISTimagesare28×28pixels,buttheyarezero-paddedto32×32pixelsandnormalizedbeforebeingfedtothenetwork.Therestofthenetworkdoesnotuseanypadding,whichiswhythesizekeepsshrinkingastheimageprogressesthroughthenetwork.

Theaveragepoolinglayersareslightlymorecomplexthanusual:eachneuroncomputesthemeanofitsinputs,thenmultipliestheresultbyalearnablecoefficient(onepermap)andaddsalearnablebiasterm(again,onepermap),thenfinallyappliestheactivationfunction.

MostneuronsinC3mapsareconnectedtoneuronsinonlythreeorfourS2maps(insteadofallsixS2maps).Seetable1intheoriginalpaperfordetails.

Theoutputlayerisabitspecial:insteadofcomputingthedotproductoftheinputsandtheweightvector,eachneuronoutputsthesquareoftheEuclidiandistancebetweenitsinputvectoranditsweightvector.Eachoutputmeasureshowmuchtheimagebelongstoaparticulardigitclass.Thecrossentropycostfunctionisnowpreferred,asitpenalizesbadpredictionsmuchmore,producinglargergradientsandthusconvergingfaster.

YannLeCun’swebsite(“LENET”section)featuresgreatdemosofLeNet-5classifyingdigits.

AlexNetTheAlexNetCNNarchitecture9wonthe2012ImageNetILSVRCchallengebyalargemargin:itachieved17%top-5errorratewhilethesecondbestachievedonly26%!ItwasdevelopedbyAlexKrizhevsky(hencethename),IlyaSutskever,andGeoffreyHinton.ItisquitesimilartoLeNet-5,onlymuchlargeranddeeper,anditwasthefirsttostackconvolutionallayersdirectlyontopofeachother,insteadofstackingapoolinglayerontopofeachconvolutionallayer.Table13-2presentsthisarchitecture.

Table13-2.AlexNetarchitecture

Layer Type Maps Size Kernelsize Stride Padding Activation

Out FullyConnected – 1,000 – – – Softmax

F9 FullyConnected – 4,096 – – – ReLU

F8 FullyConnected – 4,096 – – – ReLU

C7 Convolution 256 13×13 3×3 1 SAME ReLU

S4 MaxPooling 256 13×13 3×3 2 VALID –

S2 MaxPooling 96 27×27 3×3 2 VALID –

In Input 3(RGB) 224×224 – – – –

Toreduceoverfitting,theauthorsusedtworegularizationtechniqueswediscussedinpreviouschapters:firsttheyapplieddropout(witha50%dropoutrate)duringtrainingtotheoutputsoflayersF8andF9.Second,theyperformeddataaugmentationbyrandomlyshiftingthetrainingimagesbyvariousoffsets,flippingthemhorizontally,andchangingthelightingconditions.

AlexNetalsousesacompetitivenormalizationstepimmediatelyaftertheReLUstepoflayersC1andC3,calledlocalresponsenormalization.Thisformofnormalizationmakestheneuronsthatmoststronglyactivateinhibitneuronsatthesamelocationbutinneighboringfeaturemaps(suchcompetitiveactivationhasbeenobservedinbiologicalneurons).Thisencouragesdifferentfeaturemapstospecialize,pushingthemapartandforcingthemtoexploreawiderrangeoffeatures,ultimatelyimprovinggeneralization.Equation13-2showshowtoapplyLRN.

Equation13-2.Localresponsenormalization

biisthenormalizedoutputoftheneuronlocatedinfeaturemapi,atsomerowuandcolumnv(notethatinthisequationweconsideronlyneuronslocatedatthisrowandcolumn,souandvarenotshown).

aiistheactivationofthatneuronaftertheReLUstep,butbeforenormalization.

k,α,β,andrarehyperparameters.kiscalledthebias,andriscalledthedepthradius.

fnisthenumberoffeaturemaps.

Forexample,ifr=2andaneuronhasastrongactivation,itwillinhibittheactivationoftheneuronslocatedinthefeaturemapsimmediatelyaboveandbelowitsown.

InAlexNet,thehyperparametersaresetasfollows:r=2,α=0.00002,β=0.75,andk=1.ThisstepcanbeimplementedusingTensorFlow’stf.nn.local_response_normalization()operation.

AvariantofAlexNetcalledZFNetwasdevelopedbyMatthewZeilerandRobFergusandwonthe2013ILSVRCchallenge.ItisessentiallyAlexNetwithafewtweakedhyperparameters(numberoffeaturemaps,kernelsize,stride,etc.).

GoogLeNetTheGoogLeNetarchitecturewasdevelopedbyChristianSzegedyetal.fromGoogleResearch,10anditwontheILSVRC2014challengebypushingthetop-5errorratebelow7%.ThisgreatperformancecameinlargepartfromthefactthatthenetworkwasmuchdeeperthanpreviousCNNs(seeFigure13-11).Thiswasmadepossiblebysub-networkscalledinceptionmodules,11whichallowGoogLeNettouseparametersmuchmoreefficientlythanpreviousarchitectures:GoogLeNetactuallyhas10timesfewerparametersthanAlexNet(roughly6millioninsteadof60million).

Figure13-10showsthearchitectureofaninceptionmodule.Thenotation“3×3+2(S)”meansthatthelayerusesa3×3kernel,stride2,andSAMEpadding.Theinputsignalisfirstcopiedandfedtofourdifferentlayers.AllconvolutionallayersusetheReLUactivationfunction.Notethatthesecondsetofconvolutionallayersusesdifferentkernelsizes(1×1,3×3,and5×5),allowingthemtocapturepatternsatdifferentscales.Alsonotethateverysinglelayerusesastrideof1andSAMEpadding(eventhemaxpoolinglayer),sotheiroutputsallhavethesameheightandwidthastheirinputs.Thismakesitpossibletoconcatenatealltheoutputsalongthedepthdimensioninthefinaldepthconcatlayer(i.e.,stackthefeaturemapsfromallfourtopconvolutionallayers).ThisconcatenationlayercanbeimplementedinTensorFlowusingthetf.concat()operation,withaxis=3(axis3isthedepth).

Figure13-10.Inceptionmodule

Youmaywonderwhyinceptionmoduleshaveconvolutionallayerswith1×1kernels.Surelytheselayerscannotcaptureanyfeaturessincetheylookatonlyonepixelatatime?Infact,theselayersservetwopurposes:

First,theyareconfiguredtooutputmanyfewerfeaturemapsthantheirinputs,sotheyserveasbottlenecklayers,meaningtheyreducedimensionality.Thisisparticularlyusefulbeforethe3×3and5×5convolutions,sincetheseareverycomputationallyexpensivelayers.

Second,eachpairofconvolutionallayers([1×1,3×3]and[1×1,5×5])actslikeasingle,

powerfulconvolutionallayer,capableofcapturingmorecomplexpatterns.Indeed,insteadofsweepingasimplelinearclassifieracrosstheimage(asasingleconvolutionallayerdoes),thispairofconvolutionallayerssweepsatwo-layerneuralnetworkacrosstheimage.

Inshort,youcanthinkofthewholeinceptionmoduleasaconvolutionallayeronsteroids,abletooutputfeaturemapsthatcapturecomplexpatternsatvariousscales.

WARNINGThenumberofconvolutionalkernelsforeachconvolutionallayerisahyperparameter.Unfortunately,thismeansthatyouhavesixmorehyperparameterstotweakforeveryinceptionlayeryouadd.

Nowlet’slookatthearchitectureoftheGoogLeNetCNN(seeFigure13-11).Itissodeepthatwehadtorepresentitinthreecolumns,butGoogLeNetisactuallyonetallstack,includingnineinceptionmodules(theboxeswiththespinningtops)thatactuallycontainthreelayerseach.Thenumberoffeaturemapsoutputbyeachconvolutionallayerandeachpoolinglayerisshownbeforethekernelsize.Thesixnumbersintheinceptionmodulesrepresentthenumberoffeaturemapsoutputbyeachconvolutionallayerinthemodule(inthesameorderasinFigure13-10).NotethatalltheconvolutionallayersusetheReLUactivationfunction.

Figure13-11.GoogLeNetarchitecture

Let’sgothroughthisnetwork:Thefirsttwolayersdividetheimage’sheightandwidthby4(soitsareaisdividedby16),toreducethecomputationalload.

Thenthelocalresponsenormalizationlayerensuresthatthepreviouslayerslearnawidevarietyoffeatures(asdiscussedearlier).

Twoconvolutionallayersfollow,wherethefirstactslikeabottlenecklayer.Asexplainedearlier,youcanthinkofthispairasasinglesmarterconvolutionallayer.

Again,alocalresponsenormalizationlayerensuresthatthepreviouslayerscaptureawidevarietyofpatterns.

Nextamaxpoolinglayerreducestheimageheightandwidthby2,againtospeedupcomputations.

Thencomesthetallstackofnineinceptionmodules,interleavedwithacouplemaxpoolinglayerstoreducedimensionalityandspeedupthenet.

Next,theaveragepoolinglayerusesakernelthesizeofthefeaturemapswithVALIDpadding,outputting1×1featuremaps:thissurprisingstrategyiscalledglobalaveragepooling.Iteffectivelyforcesthepreviouslayerstoproducefeaturemapsthatareactuallyconfidencemapsforeachtargetclass(sinceotherkindsoffeatureswouldbedestroyedbytheaveragingstep).ThismakesitunnecessarytohaveseveralfullyconnectedlayersatthetopoftheCNN(likeinAlexNet),considerablyreducingthenumberofparametersinthenetworkandlimitingtheriskofoverfitting.

Thelastlayersareself-explanatory:dropoutforregularization,thenafullyconnectedlayerwithasoftmaxactivationfunctiontooutputestimatedclassprobabilities.

Thisdiagramisslightlysimplified:theoriginalGoogLeNetarchitecturealsoincludedtwoauxiliaryclassifierspluggedontopofthethirdandsixthinceptionmodules.Theywerebothcomposedofoneaveragepoolinglayer,oneconvolutionallayer,twofullyconnectedlayers,andasoftmaxactivationlayer.Duringtraining,theirloss(scaleddownby70%)wasaddedtotheoverallloss.Thegoalwastofightthevanishinggradientsproblemandregularizethenetwork.However,itwasshownthattheireffectwasrelativelyminor.

ResNetLastbutnotleast,thewinneroftheILSVRC2015challengewastheResidualNetwork(orResNet),developedbyKaimingHeetal.,12whichdeliveredanastoundingtop-5errorrateunder3.6%,usinganextremelydeepCNNcomposedof152layers.Thekeytobeingabletotrainsuchadeepnetworkistouseskipconnections(alsocalledshortcutconnections):thesignalfeedingintoalayerisalsoaddedtotheoutputofalayerlocatedabithigherupthestack.Let’sseewhythisisuseful.

Whentraininganeuralnetwork,thegoalistomakeitmodelatargetfunctionh(x).Ifyouaddtheinputxtotheoutputofthenetwork(i.e.,youaddaskipconnection),thenthenetworkwillbeforcedtomodelf(x)=h(x)–xratherthanh(x).Thisiscalledresiduallearning(seeFigure13-12).

Figure13-12.Residuallearning

Whenyouinitializearegularneuralnetwork,itsweightsareclosetozero,sothenetworkjustoutputsvaluesclosetozero.Ifyouaddaskipconnection,theresultingnetworkjustoutputsacopyofitsinputs;inotherwords,itinitiallymodelstheidentityfunction.Ifthetargetfunctionisfairlyclosetotheidentityfunction(whichisoftenthecase),thiswillspeeduptrainingconsiderably.

Moreover,ifyouaddmanyskipconnections,thenetworkcanstartmakingprogressevenifseverallayershavenotstartedlearningyet(seeFigure13-13).Thankstoskipconnections,thesignalcaneasilymakeitswayacrossthewholenetwork.Thedeepresidualnetworkcanbeseenasastackofresidualunits,whereeachresidualunitisasmallneuralnetworkwithaskipconnection.

Figure13-13.Regulardeepneuralnetwork(left)anddeepresidualnetwork(right)

Nowlet’slookatResNet’sarchitecture(seeFigure13-14).Itisactuallysurprisinglysimple.ItstartsandendsexactlylikeGoogLeNet(exceptwithoutadropoutlayer),andinbetweenisjustaverydeepstackofsimpleresidualunits.Eachresidualunitiscomposedoftwoconvolutionallayers,withBatchNormalization(BN)andReLUactivation,using3×3kernelsandpreservingspatialdimensions(stride1,SAMEpadding).

Figure13-14.ResNetarchitecture

Notethatthenumberoffeaturemapsisdoubledeveryfewresidualunits,atthesametimeastheirheightandwidtharehalved(usingaconvolutionallayerwithstride2).Whenthishappenstheinputscannotbeaddeddirectlytotheoutputsoftheresidualunitsincetheydon’thavethesameshape(forexample,this

problemaffectstheskipconnectionrepresentedbythedashedarrowinFigure13-14).Tosolvethisproblem,theinputsarepassedthrougha1×1convolutionallayerwithstride2andtherightnumberofoutputfeaturemaps(seeFigure13-15).

Figure13-15.Skipconnectionwhenchangingfeaturemapsizeanddepth

ResNet-34istheResNetwith34layers(onlycountingtheconvolutionallayersandthefullyconnectedlayer)containingthreeresidualunitsthatoutput64featuremaps,4RUswith128maps,6RUswith256maps,and3RUswith512maps.

ResNetsdeeperthanthat,suchasResNet-152,useslightlydifferentresidualunits.Insteadoftwo3×3convolutionallayerswith(say)256featuremaps,theyusethreeconvolutionallayers:firsta1×1convolutionallayerwithjust64featuremaps(4timesless),whichactsaabottlenecklayer(asdiscussedalready),thena3×3layerwith64featuremaps,andfinallyanother1×1convolutionallayerwith256featuremaps(4times64)thatrestorestheoriginaldepth.ResNet-152containsthreesuchRUsthatoutput256maps,then8RUswith512maps,awhopping36RUswith1,024maps,andfinally3RUswith2,048maps.

Asyoucansee,thefieldismovingrapidly,withallsortsofarchitecturespoppingouteveryyear.OnecleartrendisthatCNNskeepgettingdeeperanddeeper.Theyarealsogettinglighter,requiringfewerandfewerparameters.Atpresent,theResNetarchitectureisboththemostpowerfulandarguablythesimplest,soitisreallytheoneyoushouldprobablyusefornow,butkeeplookingattheILSVRCchallengeeveryyear.The2016winnersweretheTrimps-SoushenteamfromChinawithanastounding2.99%errorrate.Toachievethistheytrainedcombinationsofthepreviousmodelsandjoinedthemintoanensemble.Dependingonthetask,thereducederrorratemayormaynotbeworththeextracomplexity.

Thereareafewotherarchitecturesthatyoumaywanttolookat,inparticularVGGNet13(runner-upoftheILSVRC2014challenge)andInception-v414(whichmergestheideasofGoogLeNetandResNetandachievescloseto3%top-5errorrateonImageNetclassification).

NOTEThereisreallynothingspecialaboutimplementingthevariousCNNarchitectureswejustdiscussed.Wesawearlierhowtobuildalltheindividualbuildingblocks,sonowallyouneedistoassemblethemtocreatethedesiredarchitecture.WewillbuildacompleteCNNintheupcomingexercisesandyouwillfindfullworkingcodeintheJupyternotebooks.

TENSORFLOWCONVOLUTIONOPERATIONS

TensorFlowalsooffersafewotherkindsofconvolutionallayers:

tf.layers.conv1d()createsaconvolutionallayerfor1Dinputs.Thisisuseful,forexample,innaturallanguageprocessing,whereasentencemayberepresentedasa1Darrayofwords,andthereceptivefieldcoversafewneighboringwords.

tf.layers.conv3d()createsaconvolutionallayerfor3Dinputs,suchas3DPETscan.

tf.nn.atrous_conv2d()createsanatrousconvolutionallayer(“àtrous”isFrenchfor“withholes”).Thisisequivalenttousingaregularconvolutionallayerwithafilterdilatedbyinsertingrowsandcolumnsofzeros(i.e.,holes).Forexample,a1×3filterequalto[[1,2,3]]maybedilatedwithadilationrateof4,resultinginadilatedfilter[[1,0,0,0,2,0,0,0,3]].Thisallowstheconvolutionallayertohavealargerreceptivefieldatnocomputationalpriceandusingnoextraparameters.

tf.layers.conv2d_transpose()createsatransposeconvolutionallayer,sometimescalledadeconvolutionallayer,15whichupsamplesanimage.Itdoessobyinsertingzerosbetweentheinputs,soyoucanthinkofthisasaregularconvolutionallayerusingafractionalstride.Upsamplingisuseful,forexample,inimagesegmentation:inatypicalCNN,featuremapsgetsmallerandsmallerasyouprogressthroughthenetwork,soifyouwanttooutputanimageofthesamesizeastheinput,youneedanupsamplinglayer.

tf.nn.depthwise_conv2d()createsadepthwiseconvolutionallayerthatapplieseveryfiltertoeveryindividualinputchannelindependently.Thus,iftherearefnfiltersandfn′inputchannels,thenthiswilloutputfn×fn′featuremaps.

tf.layers.separable_conv2d()createsaseparableconvolutionallayerthatfirstactslikeadepthwiseconvolutionallayer,thenappliesa1×1convolutionallayertotheresultingfeaturemaps.Thismakesitpossibletoapplyfilterstoarbitrarysetsofinputschannels.

Exercises1. WhataretheadvantagesofaCNNoverafullyconnectedDNNforimageclassification?

2. ConsideraCNNcomposedofthreeconvolutionallayers,eachwith3×3kernels,astrideof2,andSAMEpadding.Thelowestlayeroutputs100featuremaps,themiddleoneoutputs200,andthetoponeoutputs400.TheinputimagesareRGBimagesof200×300pixels.WhatisthetotalnumberofparametersintheCNN?Ifweareusing32-bitfloats,atleasthowmuchRAMwillthisnetworkrequirewhenmakingapredictionforasingleinstance?Whataboutwhentrainingonamini-batchof50images?

3. IfyourGPUrunsoutofmemorywhiletrainingaCNN,whatarefivethingsyoucouldtrytosolvetheproblem?

4. Whywouldyouwanttoaddamaxpoolinglayerratherthanaconvolutionallayerwiththesamestride?

5. Whenwouldyouwanttoaddalocalresponsenormalizationlayer?

6. CanyounamethemaininnovationsinAlexNet,comparedtoLeNet-5?WhataboutthemaininnovationsinGoogLeNetandResNet?

7. BuildyourownCNNandtrytoachievethehighestpossibleaccuracyonMNIST.

8. ClassifyinglargeimagesusingInceptionv3.a. Downloadsomeimagesofvariousanimals.LoadtheminPython,forexampleusingthe

matplotlib.image.mpimg.imread()functionorthescipy.misc.imread()function.Resizeand/orcropthemto299×299pixels,andensurethattheyhavejustthreechannels(RGB),withnotransparencychannel.

b. DownloadthelatestpretrainedInceptionv3model:thecheckpointisavailableathttps://goo.gl/nxSQvl.

c. CreatetheInceptionv3modelbycallingtheinception_v3()function,asshownbelow.Thismustbedonewithinanargumentscopecreatedbytheinception_v3_arg_scope()function.Also,youmustsetis_training=Falseandnum_classes=1001likeso:

fromtensorflow.contrib.slim.netsimportinception

importtensorflow.contrib.slimasslim

X=tf.placeholder(tf.float32,shape=[None,299,299,3],name="X")

withslim.arg_scope(inception.inception_v3_arg_scope()):

logits,end_points=inception.inception_v3(

X,num_classes=1001,is_training=False)

predictions=end_points["Predictions"]

d. OpenasessionandusetheSavertorestorethepretrainedmodelcheckpointyoudownloadedearlier.

e. Runthemodeltoclassifytheimagesyouprepared.Displaythetopfivepredictionsforeachimage,alongwiththeestimatedprobability(thelistofclassnamesisavailableathttps://goo.gl/brXRtZ).Howaccurateisthemodel?

9. Transferlearningforlargeimageclassification.a. Createatrainingsetcontainingatleast100imagesperclass.Forexample,youcouldclassify

yourownpicturesbasedonthelocation(beach,mountain,city,etc.),oralternativelyyoucanjustuseanexistingdataset,suchastheflowersdatasetorMIT’splacesdataset(requiresregistration,anditishuge).

b. Writeapreprocessingstepthatwillresizeandcroptheimageto299×299,withsomerandomnessfordataaugmentation.

c. UsingthepretrainedInceptionv3modelfromthepreviousexercise,freezealllayersuptothebottlenecklayer(i.e.,thelastlayerbeforetheoutputlayer),andreplacetheoutputlayerwiththeappropriatenumberofoutputsforyournewclassificationtask(e.g.,theflowersdatasethasfivemutuallyexclusiveclassessotheoutputlayermusthavefiveneuronsandusethesoftmaxactivationfunction).

d. Splityourdatasetintoatrainingsetandatestset.Trainthemodelonthetrainingsetandevaluateitonthetestset.

10. GothroughTensorFlow’sDeepDreamtutorial.ItisafunwaytofamiliarizeyourselfwithvariouswaysofvisualizingthepatternslearnedbyaCNN,andtogenerateartusingDeepLearning.

“SingleUnitActivityinStriateCortexofUnrestrainedCats,”D.HubelandT.Wiesel(1958).

“ReceptiveFieldsofSingleNeuronesintheCat’sStriateCortex,”D.HubelandT.Wiesel(1959).

“ReceptiveFieldsandFunctionalArchitectureofMonkeyStriateCortex,”D.HubelandT.Wiesel(1968).

“Neocognitron:ASelf-organizingNeuralNetworkModelforaMechanismofPatternRecognitionUnaffectedbyShiftinPosition,”K.Fukushima(1980).

“Gradient-BasedLearningAppliedtoDocumentRecognition,”Y.LeCunetal.(1998).

Aconvolutionisamathematicaloperationthatslidesonefunctionoveranotherandmeasurestheintegraloftheirpointwisemultiplication.IthasdeepconnectionswiththeFouriertransformandtheLaplacetransform,andisheavilyusedinsignalprocessing.Convolutionallayersactuallyusecross-correlations,whichareverysimilartoconvolutions(seehttp://goo.gl/HAfxXdformoredetails).

Afullyconnectedlayerwith150×100neurons,eachconnectedtoall150×100×3inputs,wouldhave150

×3=675millionparameters!

1MB=1,024kB=1,024×1,024bytes=1,024×1,024×8bits.

“ImageNetClassificationwithDeepConvolutionalNeuralNetworks,”A.Krizhevskyetal.(2012).

“GoingDeeperwithConvolutions,”C.Szegedyetal.(2015).

Inthe2010movieInception,thecharacterskeepgoingdeeperanddeeperintomultiplelayersofdreams,hencethenameofthesemodules.

“DeepResidualLearningforImageRecognition,”K.He(2015).

“VeryDeepConvolutionalNetworksforLarge-ScaleImageRecognition,”K.SimonyanandA.Zisserman(2015).

“Inception-v4,Inception-ResNetandtheImpactofResidualConnectionsonLearning,”C.Szegedyetal.(2016).

Thisnameisquitemisleadingsincethislayerdoesnotperformadeconvolution,whichisawell-definedmathematicaloperation(theinverseofaconvolution).

Chapter14.RecurrentNeuralNetworks

Thebatterhitstheball.Youimmediatelystartrunning,anticipatingtheball’strajectory.Youtrackitandadaptyourmovements,andfinallycatchit(underathunderofapplause).Predictingthefutureiswhatyoudoallthetime,whetheryouarefinishingafriend’ssentenceoranticipatingthesmellofcoffeeatbreakfast.Inthischapter,wearegoingtodiscussrecurrentneuralnetworks(RNN),aclassofnetsthatcanpredictthefuture(well,uptoapoint,ofcourse).Theycananalyzetimeseriesdatasuchasstockprices,andtellyouwhentobuyorsell.Inautonomousdrivingsystems,theycananticipatecartrajectoriesandhelpavoidaccidents.Moregenerally,theycanworkonsequencesofarbitrarylengths,ratherthanonfixed-sizedinputslikeallthenetswehavediscussedsofar.Forexample,theycantakesentences,documents,oraudiosamplesasinput,makingthemextremelyusefulfornaturallanguageprocessing(NLP)systemssuchasautomatictranslation,speech-to-text,orsentimentanalysis(e.g.,readingmoviereviewsandextractingtherater’sfeelingaboutthemovie).

Moreover,RNNs’abilitytoanticipatealsomakesthemcapableofsurprisingcreativity.Youcanaskthemtopredictwhicharethemostlikelynextnotesinamelody,thenrandomlypickoneofthesenotesandplayit.Thenaskthenetforthenextmostlikelynotes,playit,andrepeattheprocessagainandagain.Beforeyouknowit,yournetwillcomposeamelodysuchastheoneproducedbyGoogle’sMagentaproject.Similarly,RNNscangeneratesentences,imagecaptions,andmuchmore.TheresultisnotexactlyShakespeareorMozartyet,butwhoknowswhattheywillproduceafewyearsfromnow?

Inthischapter,wewilllookatthefundamentalconceptsunderlyingRNNs,themainproblemtheyface(namely,vanishing/explodinggradients,discussedinChapter11),andthesolutionswidelyusedtofightit:LSTMandGRUcells.Alongtheway,asalways,wewillshowhowtoimplementRNNsusingTensorFlow.Finally,wewilltakealookatthearchitectureofamachinetranslationsystem.

RecurrentNeuronsUptonowwehavemostlylookedatfeedforwardneuralnetworks,wheretheactivationsflowonlyinonedirection,fromtheinputlayertotheoutputlayer(exceptforafewnetworksinAppendixE).Arecurrentneuralnetworklooksverymuchlikeafeedforwardneuralnetwork,exceptitalsohasconnectionspointingbackward.Let’slookatthesimplestpossibleRNN,composedofjustoneneuronreceivinginputs,producinganoutput,andsendingthatoutputbacktoitself,asshowninFigure14-1(left).Ateachtimestept(alsocalledaframe),thisrecurrentneuronreceivestheinputsx(t)aswellasitsownoutputfromtheprevioustimestep,y(t–1).Wecanrepresentthistinynetworkagainstthetimeaxis,asshowninFigure14-1(right).Thisiscalledunrollingthenetworkthroughtime.

Figure14-1.Arecurrentneuron(left),unrolledthroughtime(right)

Youcaneasilycreatealayerofrecurrentneurons.Ateachtimestept,everyneuronreceivesboththeinputvectorx(t)andtheoutputvectorfromtheprevioustimestepy(t–1),asshowninFigure14-2.Notethatboththeinputsandoutputsarevectorsnow(whentherewasjustasingleneuron,theoutputwasascalar).

Figure14-2.Alayerofrecurrentneurons(left),unrolledthroughtime(right)

Eachrecurrentneuronhastwosetsofweights:onefortheinputsx(t)andtheotherfortheoutputsoftheprevioustimestep,y(t–1).Let’scalltheseweightvectorswxandwy.Theoutputofarecurrentlayercanbe

computedprettymuchasyoumightexpect,asshowninEquation14-1(bisthebiastermandϕ(·)istheactivationfunction,e.g.,ReLU1).

Equation14-1.Outputofarecurrentlayerforasingleinstance

Justlikeforfeedforwardneuralnetworks,wecancomputearecurrentlayer’soutputinoneshotforawholemini-batchusingavectorizedformofthepreviousequation(seeEquation14-2).

Equation14-2.Outputsofalayerofrecurrentneuronsforallinstancesinamini-batch

Y(t)isanm×nneuronsmatrixcontainingthelayer’soutputsattimesteptforeachinstanceinthemini-batch(misthenumberofinstancesinthemini-batchandnneuronsisthenumberofneurons).

X(t)isanm×ninputsmatrixcontainingtheinputsforallinstances(ninputsisthenumberofinputfeatures).

Wxisanninputs×nneuronsmatrixcontainingtheconnectionweightsfortheinputsofthecurrenttimestep.

Wyisannneurons×nneuronsmatrixcontainingtheconnectionweightsfortheoutputsoftheprevioustimestep.

TheweightmatricesWxandWyareoftenconcatenatedintoasingleweightmatrixWofshape(ninputs+nneurons)×nneurons(seethesecondlineofEquation14-2).

bisavectorofsizenneuronscontainingeachneuron’sbiasterm.

NoticethatY(t)isafunctionofX(t)andY(t–1),whichisafunctionofX(t–1)andY(t–2),whichisafunctionofX(t–2)andY(t–3),andsoon.ThismakesY(t)afunctionofalltheinputssincetimet=0(thatis,X(0),X(1),…,X(t)).Atthefirsttimestep,t=0,therearenopreviousoutputs,sotheyaretypicallyassumedtobeallzeros.

MemoryCellsSincetheoutputofarecurrentneuronattimesteptisafunctionofalltheinputsfromprevioustimesteps,youcouldsayithasaformofmemory.Apartofaneuralnetworkthatpreservessomestateacrosstimestepsiscalledamemorycell(orsimplyacell).Asinglerecurrentneuron,oralayerofrecurrentneurons,isaverybasiccell,butlaterinthischapterwewilllookatsomemorecomplexandpowerfultypesofcells.

Ingeneralacell’sstateattimestept,denotedh(t)(the“h”standsfor“hidden”),isafunctionofsomeinputsatthattimestepanditsstateattheprevioustimestep:h(t)=f(h(t–1),x(t)).Itsoutputattimestept,denotedy(t),isalsoafunctionofthepreviousstateandthecurrentinputs.Inthecaseofthebasiccellswehavediscussedsofar,theoutputissimplyequaltothestate,butinmorecomplexcellsthisisnotalwaysthecase,asshowninFigure14-3.

Figure14-3.Acell’shiddenstateanditsoutputmaybedifferent

InputandOutputSequencesAnRNNcansimultaneouslytakeasequenceofinputsandproduceasequenceofoutputs(seeFigure14-4,top-leftnetwork).Forexample,thistypeofnetworkisusefulforpredictingtimeseriessuchasstockprices:youfeeditthepricesoverthelastNdays,anditmustoutputthepricesshiftedbyonedayintothefuture(i.e.,fromN–1daysagototomorrow).

Alternatively,youcouldfeedthenetworkasequenceofinputs,andignorealloutputsexceptforthelastone(seethetop-rightnetwork).Inotherwords,thisisasequence-to-vectornetwork.Forexample,youcouldfeedthenetworkasequenceofwordscorrespondingtoamoviereview,andthenetworkwouldoutputasentimentscore(e.g.,from–1[hate]to+1[love]).

Conversely,youcouldfeedthenetworkasingleinputatthefirsttimestep(andzerosforallothertimesteps),andletitoutputasequence(seethebottom-leftnetwork).Thisisavector-to-sequencenetwork.Forexample,theinputcouldbeanimage,andtheoutputcouldbeacaptionforthatimage.

Lastly,youcouldhaveasequence-to-vectornetwork,calledanencoder,followedbyavector-to-sequencenetwork,calledadecoder(seethebottom-rightnetwork).Forexample,thiscanbeusedfortranslatingasentencefromonelanguagetoanother.Youwouldfeedthenetworkasentenceinonelanguage,theencoderwouldconvertthissentenceintoasinglevectorrepresentation,andthenthedecoderwoulddecodethisvectorintoasentenceinanotherlanguage.Thistwo-stepmodel,calledanEncoder–Decoder,worksmuchbetterthantryingtotranslateontheflywithasinglesequence-to-sequenceRNN(liketheonerepresentedonthetopleft),sincethelastwordsofasentencecanaffectthefirstwordsofthetranslation,soyouneedtowaituntilyouhaveheardthewholesentencebeforetranslatingit.

Figure14-4.Seqtoseq(topleft),seqtovector(topright),vectortoseq(bottomleft),delayedseqtoseq(bottomright)

Soundspromising,solet’sstartcoding!

BasicRNNsinTensorFlowFirst,let’simplementaverysimpleRNNmodel,withoutusinganyofTensorFlow’sRNNoperations,tobetterunderstandwhatgoesonunderthehood.WewillcreateanRNNcomposedofalayeroffiverecurrentneurons(liketheRNNrepresentedinFigure14-2),usingthetanhactivationfunction.WewillassumethattheRNNrunsoveronlytwotimesteps,takinginputvectorsofsize3ateachtimestep.ThefollowingcodebuildsthisRNN,unrolledthroughtwotimesteps:

n_inputs=3

n_neurons=5

X0=tf.placeholder(tf.float32,[None,n_inputs])

Wx=tf.Variable(tf.random_normal(shape=[n_inputs,n_neurons],dtype=tf.float32))

Wy=tf.Variable(tf.random_normal(shape=[n_neurons,n_neurons],dtype=tf.float32))

b=tf.Variable(tf.zeros([1,n_neurons],dtype=tf.float32))

Y0=tf.tanh(tf.matmul(X0,Wx)+b)

Y1=tf.tanh(tf.matmul(Y0,Wy)+tf.matmul(X1,Wx)+b)

Thisnetworklooksmuchlikeatwo-layerfeedforwardneuralnetwork,withafewtwists:first,thesameweightsandbiastermsaresharedbybothlayers,andsecond,wefeedinputsateachlayer,andwegetoutputsfromeachlayer.Torunthemodel,weneedtofeedittheinputsatbothtimesteps,likeso:

importnumpyasnp

#Mini-batch:instance0,instance1,instance2,instance3

X0_batch=np.array([[0,1,2],[3,4,5],[6,7,8],[9,0,1]])#t=0

X1_batch=np.array([[9,8,7],[0,0,0],[6,5,4],[3,2,1]])#t=1

init.run()

Y0_val,Y1_val=sess.run([Y0,Y1],feed_dict={X0:X0_batch,X1:X1_batch})

Thismini-batchcontainsfourinstances,eachwithaninputsequencecomposedofexactlytwoinputs.Attheend,Y0_valandY1_valcontaintheoutputsofthenetworkatbothtimestepsforallneuronsandallinstancesinthemini-batch:

>>>print(Y0_val)#outputatt=0

[[-0.06640060.962576690.681057870.70918542-0.89821595]#instance0

[0.9977755-0.71978885-0.996576250.9673925-0.99989718]#instance1

[0.99999774-0.99898815-0.999998930.99677622-0.99999988]#instance2

[1.-1.-1.-0.998189150.99950868]]#instance3

>>>print(Y1_val)#outputatt=1

[[1.-1.-1.0.40200216-1.]#instance0

[-0.122104330.628053190.96718419-0.99371207-0.25839335]#instance1

[0.99999827-0.9999994-0.9999975-0.85943311-0.9999879]#instance2

[0.99928284-0.99999815-0.999905820.98579615-0.92205751]]#instance3

Thatwasn’ttoohard,butofcourseifyouwanttobeabletorunanRNNover100timesteps,thegraphisgoingtobeprettybig.Nowlet’slookathowtocreatethesamemodelusingTensorFlow’sRNNoperations.

StaticUnrollingThroughTimeThestatic_rnn()functioncreatesanunrolledRNNnetworkbychainingcells.Thefollowingcodecreatestheexactsamemodelasthepreviousone:

basic_cell=tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)

output_seqs,states=tf.contrib.rnn.static_rnn(basic_cell,[X0,X1],

dtype=tf.float32)

Y0,Y1=output_seqs

Firstwecreatetheinputplaceholders,asbefore.ThenwecreateaBasicRNNCell,whichyoucanthinkofasafactorythatcreatescopiesofthecelltobuildtheunrolledRNN(oneforeachtimestep).Thenwecallstatic_rnn(),givingitthecellfactoryandtheinputtensors,andtellingitthedatatypeoftheinputs(thisisusedtocreatetheinitialstatematrix,whichbydefaultisfullofzeros).Thestatic_rnn()functioncallsthecellfactory’s__call__()functiononceperinput,creatingtwocopiesofthecell(eachcontainingalayeroffiverecurrentneurons),withsharedweightsandbiasterms,anditchainsthemjustlikewedidearlier.Thestatic_rnn()functionreturnstwoobjects.ThefirstisaPythonlistcontainingtheoutputtensorsforeachtimestep.Thesecondisatensorcontainingthefinalstatesofthenetwork.Whenyouareusingbasiccells,thefinalstateissimplyequaltothelastoutput.

Iftherewere50timesteps,itwouldnotbeveryconvenienttohavetodefine50inputplaceholdersand50outputtensors.Moreover,atexecutiontimeyouwouldhavetofeedeachofthe50placeholdersandmanipulatethe50outputs.Let’ssimplifythis.ThefollowingcodebuildsthesameRNNagain,butthistimeittakesasingleinputplaceholderofshape[None,n_steps,n_inputs]wherethefirstdimensionisthemini-batchsize.Thenitextractsthelistofinputsequencesforeachtimestep.X_seqsisaPythonlistofn_stepstensorsofshape[None,n_inputs],whereonceagainthefirstdimensionisthemini-batchsize.Todothis,wefirstswapthefirsttwodimensionsusingthetranspose()function,sothatthetimestepsarenowthefirstdimension.ThenweextractaPythonlistoftensorsalongthefirstdimension(i.e.,onetensorpertimestep)usingtheunstack()function.Thenexttwolinesarethesameasbefore.Finally,wemergealltheoutputtensorsintoasingletensorusingthestack()function,andweswapthefirsttwodimensionstogetafinaloutputstensorofshape[None,n_steps,n_neurons](againthefirstdimensionisthemini-batchsize).

X=tf.placeholder(tf.float32,[None,n_steps,n_inputs])

X_seqs=tf.unstack(tf.transpose(X,perm=[1,0,2]))

output_seqs,states=tf.contrib.rnn.static_rnn(basic_cell,X_seqs,

dtype=tf.float32)

outputs=tf.transpose(tf.stack(output_seqs),perm=[1,0,2])

Nowwecanrunthenetworkbyfeedingitasingletensorthatcontainsallthemini-batchsequences:

X_batch=np.array([

#t=0t=1

[[0,1,2],[9,8,7]],#instance0

[[3,4,5],[0,0,0]],#instance1

[[6,7,8],[6,5,4]],#instance2

[[9,0,1],[3,2,1]],#instance3

init.run()

outputs_val=outputs.eval(feed_dict={X:X_batch})

Andwegetasingleoutputs_valtensorforallinstances,alltimesteps,andallneurons:

>>>print(outputs_val)

[[[-0.912797270.83698678-0.892779410.80308062-0.5283336]

[-1.1.-0.997948290.99985468-0.99273592]]

[[-0.999943910.99951613-0.99469250.99030769-0.94413054]

[0.487333090.93389565-0.313620720.885736110.2424476]]

[[-1.0.99999875-0.999750140.99956584-0.99466234]

[-0.999948560.99999434-0.960581720.99784708-0.9099462]]

[[-0.959724250.999514820.96938795-0.969908-0.67668229]

[-0.845960140.962882280.96856463-0.14777924-0.9119423]]]

However,thisapproachstillbuildsagraphcontainingonecellpertimestep.Iftherewere50timesteps,thegraphwouldlookprettyugly.Itisabitlikewritingaprogramwithouteverusingloops(e.g.,Y0=f(0,X0);Y1=f(Y0,X1);Y2=f(Y1,X2);...;Y50=f(Y49,X50)).Withsuchaslargegraph,youmayevengetout-of-memory(OOM)errorsduringbackpropagation(especiallywiththelimitedmemoryofGPUcards),sinceitmuststorealltensorvaluesduringtheforwardpasssoitcanusethemtocomputegradientsduringthereversepass.

Fortunately,thereisabettersolution:thedynamic_rnn()function.

DynamicUnrollingThroughTimeThedynamic_rnn()functionusesawhile_loop()operationtorunoverthecelltheappropriatenumberoftimes,andyoucansetswap_memory=TrueifyouwantittoswaptheGPU’smemorytotheCPU’smemoryduringbackpropagationtoavoidOOMerrors.Conveniently,italsoacceptsasingletensorforallinputsateverytimestep(shape[None,n_steps,n_inputs])anditoutputsasingletensorforalloutputsateverytimestep(shape[None,n_steps,n_neurons]);thereisnoneedtostack,unstack,ortranspose.ThefollowingcodecreatesthesameRNNasearlierusingthedynamic_rnn()function.It’ssomuchnicer!

outputs,states=tf.nn.dynamic_rnn(basic_cell,X,dtype=tf.float32)

NOTEDuringbackpropagation,thewhile_loop()operationdoestheappropriatemagic:itstoresthetensorvaluesforeachiterationduringtheforwardpasssoitcanusethemtocomputegradientsduringthereversepass.

HandlingVariableLengthInputSequencesSofarwehaveusedonlyfixed-sizeinputsequences(allexactlytwostepslong).Whatiftheinputsequenceshavevariablelengths(e.g.,likesentences)?Inthiscaseyoushouldsetthesequence_lengthargumentwhencallingthedynamic_rnn()(orstatic_rnn())function;itmustbea1Dtensorindicatingthelengthoftheinputsequenceforeachinstance.Forexample:

seq_length=tf.placeholder(tf.int32,[None])

outputs,states=tf.nn.dynamic_rnn(basic_cell,X,dtype=tf.float32,

sequence_length=seq_length)

Forexample,supposethesecondinputsequencecontainsonlyoneinputinsteadoftwo.ItmustbepaddedwithazerovectorinordertofitintheinputtensorX(becausetheinputtensor’sseconddimensionisthesizeofthelongestsequence—i.e.,2).

X_batch=np.array([

#step0step1

[[0,1,2],[9,8,7]],#instance0

[[3,4,5],[0,0,0]],#instance1(paddedwithazerovector)

[[6,7,8],[6,5,4]],#instance2

[[9,0,1],[3,2,1]],#instance3

seq_length_batch=np.array([2,1,2,2])

Ofcourse,younowneedtofeedvaluesforbothplaceholdersXandseq_length:

init.run()

outputs_val,states_val=sess.run(

[outputs,states],feed_dict={X:X_batch,seq_length:seq_length_batch})

NowtheRNNoutputszerovectorsforeverytimesteppasttheinputsequencelength(lookatthesecondinstance’soutputforthesecondtimestep):

>>>print(outputs_val)

[[[-0.68579948-0.25901747-0.80249101-0.18141513-0.37491536]

[-0.99996698-0.945011850.98072106-0.96897620.99966913]]#finalstate

[[-0.99099374-0.64768541-0.67801034-0.74154460.7719509]#finalstate

[0.0.0.0.0.]]#zerovector

[[-0.99978048-0.85583007-0.49696958-0.938385780.98505187]

[-0.99951065-0.891487960.94170523-0.384076570.97499216]]#finalstate

[[-0.02052618-0.945880470.999352040.372833310.9998163]

[-0.910523470.057694090.47446665-0.446110370.89394671]]]#finalstate

Moreover,thestatestensorcontainsthefinalstateofeachcell(excludingthezerovectors):

>>>print(states_val)

[[-0.99996698-0.945011850.98072106-0.96897620.99966913]#t=1

[-0.99099374-0.64768541-0.67801034-0.74154460.7719509]#t=0!!!

[-0.99951065-0.891487960.94170523-0.384076570.97499216]#t=1

[-0.910523470.057694090.47446665-0.446110370.89394671]]#t=1

HandlingVariable-LengthOutputSequencesWhatiftheoutputsequenceshavevariablelengthsaswell?Ifyouknowinadvancewhatlengtheachsequencewillhave(forexampleifyouknowthatitwillbethesamelengthastheinputsequence),thenyoucansetthesequence_lengthparameterasdescribedabove.Unfortunately,ingeneralthiswillnotbepossible:forexample,thelengthofatranslatedsentenceisgenerallydifferentfromthelengthoftheinputsentence.Inthiscase,themostcommonsolutionistodefineaspecialoutputcalledanend-of-sequencetoken(EOStoken).AnyoutputpasttheEOSshouldbeignored(wewilldiscussthislaterinthischapter).

Okay,nowyouknowhowtobuildanRNNnetwork(ormorepreciselyanRNNnetworkunrolledthroughtime).Buthowdoyoutrainit?

TrainingRNNsTotrainanRNN,thetrickistounrollitthroughtime(likewejustdid)andthensimplyuseregularbackpropagation(seeFigure14-5).Thisstrategyiscalledbackpropagationthroughtime(BPTT).

Figure14-5.Backpropagationthroughtime

Justlikeinregularbackpropagation,thereisafirstforwardpassthroughtheunrollednetwork(representedbythedashedarrows);thentheoutputsequenceisevaluatedusingacostfunction

(wheretminandtmaxarethefirstandlastoutputtimesteps,notcountingtheignoredoutputs),andthegradientsofthatcostfunctionarepropagatedbackwardthroughtheunrollednetwork(representedbythesolidarrows);andfinallythemodelparametersareupdatedusingthegradientscomputedduringBPTT.Notethatthegradientsflowbackwardthroughalltheoutputsusedbythecostfunction,notjustthroughthefinaloutput(forexample,inFigure14-5thecostfunctioniscomputedusingthelastthreeoutputsofthenetwork,Y(2),Y(3),andY(4),sogradientsflowthroughthesethreeoutputs,butnotthroughY(0)andY(1)).Moreover,sincethesameparametersWandbareusedateachtimestep,backpropagationwilldotherightthingandsumoveralltimesteps.

TrainingaSequenceClassifierLet’strainanRNNtoclassifyMNISTimages.Aconvolutionalneuralnetworkwouldbebettersuitedforimageclassification(seeChapter13),butthismakesforasimpleexamplethatyouarealreadyfamiliarwith.Wewilltreateachimageasasequenceof28rowsof28pixelseach(sinceeachMNISTimageis28×28pixels).Wewillusecellsof150recurrentneurons,plusafullyconnectedlayercontaining10neurons(oneperclass)connectedtotheoutputofthelasttimestep,followedbyasoftmaxlayer(seeFigure14-6).

Figure14-6.Sequenceclassifier

Theconstructionphaseisquitestraightforward;it’sprettymuchthesameastheMNISTclassifierwebuiltinChapter10exceptthatanunrolledRNNreplacesthehiddenlayers.Notethatthefullyconnectedlayerisconnectedtothestatestensor,whichcontainsonlythefinalstateoftheRNN(i.e.,the28th

output).Alsonotethatyisaplaceholderforthetargetclasses.

n_steps=28

n_inputs=28

n_neurons=150

n_outputs=10

learning_rate=0.001

y=tf.placeholder(tf.int32,[None])

outputs,states=tf.nn.dynamic_rnn(basic_cell,X,dtype=tf.float32)

logits=tf.layers.dense(states,n_outputs)

logits=logits)

loss=tf.reduce_mean(xentropy)

correct=tf.nn.in_top_k(logits,y,1)

accuracy=tf.reduce_mean(tf.cast(correct,tf.float32))

Nowlet’sloadtheMNISTdataandreshapethetestdatato[batch_size,n_steps,n_inputs]asisexpectedbythenetwork.Wewilltakecareofreshapingthetrainingdatainamoment.

fromtensorflow.examples.tutorials.mnistimportinput_data

mnist=input_data.read_data_sets("/tmp/data/")

X_test=mnist.test.images.reshape((-1,n_steps,n_inputs))

y_test=mnist.test.labels

NowwearereadytotraintheRNN.TheexecutionphaseisexactlythesameasfortheMNISTclassifierinChapter10,exceptthatwereshapeeachtrainingbatchbeforefeedingittothenetwork.

n_epochs=100

batch_size=150

init.run()

X_batch=X_batch.reshape((-1,n_steps,n_inputs))

acc_train=accuracy.eval(feed_dict={X:X_batch,y:y_batch})

acc_test=accuracy.eval(feed_dict={X:X_test,y:y_test})

print(epoch,"Trainaccuracy:",acc_train,"Testaccuracy:",acc_test)

Theoutputshouldlooklikethis:

0Trainaccuracy:0.94Testaccuracy:0.9308

Wegetover98%accuracy—notbad!Plusyouwouldcertainlygetabetterresultbytuningthehyperparameters,initializingtheRNNweightsusingHeinitialization,traininglonger,oraddingabitofregularization(e.g.,dropout).

TIPYoucanspecifyaninitializerfortheRNNbywrappingitsconstructioncodeinavariablescope(e.g.,usevariable_scope("rnn",initializer=variance_scaling_initializer())touseHeinitialization).

TrainingtoPredictTimeSeriesNowlet’stakealookathowtohandletimeseries,suchasstockprices,airtemperature,brainwavepatterns,andsoon.InthissectionwewilltrainanRNNtopredictthenextvalueinageneratedtimeseries.Eachtraininginstanceisarandomlyselectedsequenceof20consecutivevaluesfromthetimeseries,andthetargetsequenceisthesameastheinputsequence,exceptitisshiftedbyonetimestepintothefuture(seeFigure14-7).

Figure14-7.Timeseries(left),andatraininginstancefromthatseries(right)

First,let’screatetheRNN.Itwillcontain100recurrentneuronsandwewillunrollitover20timestepssinceeachtraininginstancewillbe20inputslong.Eachinputwillcontainonlyonefeature(thevalueatthattime).Thetargetsarealsosequencesof20inputs,eachcontainingasinglevalue.Thecodeisalmostthesameasearlier:

n_steps=20

n_inputs=1

n_neurons=100

n_outputs=1

y=tf.placeholder(tf.float32,[None,n_steps,n_outputs])

cell=tf.contrib.rnn.BasicRNNCell(num_units=n_neurons,activation=tf.nn.relu)

outputs,states=tf.nn.dynamic_rnn(cell,X,dtype=tf.float32)

NOTEIngeneralyouwouldhavemorethanjustoneinputfeature.Forexample,ifyouweretryingtopredictstockprices,youwouldlikelyhavemanyotherinputfeaturesateachtimestep,suchaspricesofcompetingstocks,ratingsfromanalysts,oranyotherfeaturethatmighthelpthesystemmakeitspredictions.

Ateachtimestepwenowhaveanoutputvectorofsize100.Butwhatweactuallywantisasingleoutputvalueateachtimestep.ThesimplestsolutionistowrapthecellinanOutputProjectionWrapper.Acellwrapperactslikeanormalcell,proxyingeverymethodcalltoanunderlyingcell,butitalsoaddssomefunctionality.TheOutputProjectionWrapperaddsafullyconnectedlayeroflinearneurons(i.e.,withoutanyactivationfunction)ontopofeachoutput(butitdoesnotaffectthecellstate).Allthesefullyconnectedlayerssharethesame(trainable)weightsandbiasterms.TheresultingRNNisrepresentedin

Figure14-8.

Figure14-8.RNNcellsusingoutputprojections

Wrappingacellisquiteeasy.Let’stweaktheprecedingcodebywrappingtheBasicRNNCellintoanOutputProjectionWrapper:

cell=tf.contrib.rnn.OutputProjectionWrapper(

tf.contrib.rnn.BasicRNNCell(num_units=n_neurons,activation=tf.nn.relu),

output_size=n_outputs)

Sofar,sogood.Nowweneedtodefinethecostfunction.WewillusetheMeanSquaredError(MSE),aswedidinpreviousregressiontasks.NextwewillcreateanAdamoptimizer,thetrainingop,andthevariableinitializationop,asusual:

learning_rate=0.001

loss=tf.reduce_mean(tf.square(outputs-y))

Nowontotheexecutionphase:

n_iterations=1500

batch_size=50

init.run()

X_batch,y_batch=[...]#fetchthenexttrainingbatch

ifiteration%100==0:

mse=loss.eval(feed_dict={X:X_batch,y:y_batch})

print(iteration,"\tMSE:",mse)

Theprogram’soutputshouldlooklikethis:

0MSE:13.6543

100MSE:0.538476

200MSE:0.168532

300MSE:0.0879579

400MSE:0.0633425

Oncethemodelistrained,youcanmakepredictions:

X_new=[...]#Newsequences

y_pred=sess.run(outputs,feed_dict={X:X_new})

Figure14-9showsthepredictedsequencefortheinstancewelookedatearlier(inFigure14-7),afterjust1,000trainingiterations.

Figure14-9.Timeseriesprediction

AlthoughusinganOutputProjectionWrapperisthesimplestsolutiontoreducethedimensionalityoftheRNN’soutputsequencesdowntojustonevaluepertimestep(perinstance),itisnotthemostefficient.Thereisatrickierbutmoreefficientsolution:youcanreshapetheRNNoutputsfrom[batch_size,n_steps,n_neurons]to[batch_size*n_steps,n_neurons],thenapplyasinglefullyconnectedlayerwiththeappropriateoutputsize(inourcasejust1),whichwillresultinanoutputtensorofshape[batch_size*n_steps,n_outputs],andthenreshapethistensorto[batch_size,n_steps,n_outputs].TheseoperationsarerepresentedinFigure14-10.

Figure14-10.Stackalltheoutputs,applytheprojection,thenunstacktheresult

Toimplementthissolution,wefirstreverttoabasiccell,withouttheOutputProjectionWrapper:

cell=tf.contrib.rnn.BasicRNNCell(num_units=n_neurons,activation=tf.nn.relu)

rnn_outputs,states=tf.nn.dynamic_rnn(cell,X,dtype=tf.float32)

Thenwestackalltheoutputsusingthereshape()operation,applythefullyconnectedlinearlayer(withoutusinganyactivationfunction;thisisjustaprojection),andfinallyunstackalltheoutputs,againusingreshape():

stacked_rnn_outputs=tf.reshape(rnn_outputs,[-1,n_neurons])

stacked_outputs=tf.layers.dense(stacked_rnn_outputs,n_outputs)

outputs=tf.reshape(stacked_outputs,[-1,n_steps,n_outputs])

Therestofthecodeisthesameasearlier.Thiscanprovideasignificantspeedboostsincethereisjustonefullyconnectedlayerinsteadofonepertimestep.

CreativeRNNNowthatwehaveamodelthatcanpredictthefuture,wecanuseittogeneratesomecreativesequences,asexplainedatthebeginningofthechapter.Allweneedistoprovideitaseedsequencecontainingn_stepsvalues(e.g.,fullofzeros),usethemodeltopredictthenextvalue,appendthispredictedvaluetothesequence,feedthelastn_stepsvaluestothemodeltopredictthenextvalue,andsoon.Thisprocessgeneratesanewsequencethathassomeresemblancetotheoriginaltimeseries(seeFigure14-11).

sequence=[0.]*n_steps

foriterationinrange(300):

X_batch=np.array(sequence[-n_steps:]).reshape(1,n_steps,1)

y_pred=sess.run(outputs,feed_dict={X:X_batch})

sequence.append(y_pred[0,-1,0])

Figure14-11.Creativesequences,seededwithzeros(left)orwithaninstance(right)

NowyoucantrytofeedallyourJohnLennonalbumstoanRNNandseeifitcangeneratethenext“Imagine.”However,youwillprobablyneedamuchmorepowerfulRNN,withmoreneurons,andalsomuchdeeper.Let’slookatdeepRNNsnow.

DeepRNNsItisquitecommontostackmultiplelayersofcells,asshowninFigure14-12.ThisgivesyouadeepRNN.

ToimplementadeepRNNinTensorFlow,youcancreateseveralcellsandstackthemintoaMultiRNNCell.Inthefollowingcodewestackthreeidenticalcells(butyoucouldverywellusevariouskindsofcellswithadifferentnumberofneurons):

n_neurons=100

n_layers=3

layers=[tf.contrib.rnn.BasicRNNCell(num_units=n_neurons,

forlayerinrange(n_layers)]

multi_layer_cell=tf.contrib.rnn.MultiRNNCell(layers)

outputs,states=tf.nn.dynamic_rnn(multi_layer_cell,X,dtype=tf.float32)

Figure14-12.DeepRNN(left),unrolledthroughtime(right)

That’sallthereistoit!Thestatesvariableisatuplecontainingonetensorperlayer,eachrepresentingthefinalstateofthatlayer’scell(withshape[batch_size,n_neurons]).Ifyousetstate_is_tuple=FalsewhencreatingtheMultiRNNCell,thenstatesbecomesasingletensorcontainingthestatesfromeverylayer,concatenatedalongthecolumnaxis(i.e.,itsshapeis[batch_size,n_layers*n_neurons]).NotethatbeforeTensorFlow0.11.0,thisbehaviorwasthedefault.

DistributingaDeepRNNAcrossMultipleGPUsChapter12pointedoutthatwecanefficientlydistributedeepRNNsacrossmultipleGPUsbypinningeachlayertoadifferentGPU(seeFigure12-16).However,ifyoutrytocreateeachcellinadifferentdevice()block,itwillnotwork:

withtf.device("/gpu:0"):#BAD!Thisisignored.

layer1=tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)

withtf.device("/gpu:1"):#BAD!Ignoredagain.

layer2=tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)

ThisfailsbecauseaBasicRNNCellisacellfactory,notacellperse(asmentionedearlier);nocellsgetcreatedwhenyoucreatethefactory,andthusnovariablesdoeither.Thedeviceblockissimplyignored.Thecellsactuallygetcreatedlater.Whenyoucalldynamic_rnn(),itcallstheMultiRNNCell,whichcallseachindividualBasicRNNCell,whichcreatetheactualcells(includingtheirvariables).Unfortunately,noneoftheseclassesprovideanywaytocontrolthedevicesonwhichthevariablesgetcreated.Ifyoutrytoputthedynamic_rnn()callwithinadeviceblock,thewholeRNNgetspinnedtoasingledevice.Soareyoustuck?Fortunatelynot!Thetrickistocreateyourowncellwrapper:

classDeviceCellWrapper(tf.contrib.rnn.RNNCell):

def__init__(self,device,cell):

self._cell=cell

self._device=device

@property

defstate_size(self):

returnself._cell.state_size

@property

defoutput_size(self):

returnself._cell.output_size

def__call__(self,inputs,state,scope=None):

withtf.device(self._device):

returnself._cell(inputs,state,scope)

Thiswrappersimplyproxieseverymethodcalltoanothercell,exceptitwrapsthe__call__()functionwithinadeviceblock.2NowyoucandistributeeachlayeronadifferentGPU:

devices=["/gpu:0","/gpu:1","/gpu:2"]

cells=[DeviceCellWrapper(dev,tf.contrib.rnn.BasicRNNCell(num_units=n_neurons))

fordevindevices]

multi_layer_cell=tf.contrib.rnn.MultiRNNCell(cells)

outputs,states=tf.nn.dynamic_rnn(multi_layer_cell,X,dtype=tf.float32)

WARNINGDonotsetstate_is_tuple=False,ortheMultiRNNCellwillconcatenateallthecellstatesintoasingletensor,onasingleGPU.

ApplyingDropoutIfyoubuildaverydeepRNN,itmayendupoverfittingthetrainingset.Topreventthat,acommontechniqueistoapplydropout(introducedinChapter11).YoucansimplyaddadropoutlayerbeforeoraftertheRNNasusual,butifyoualsowanttoapplydropoutbetweentheRNNlayers,youneedtouseaDropoutWrapper.ThefollowingcodeappliesdropouttotheinputsofeachlayerintheRNN,droppingeachinputwitha50%probability:

keep_prob=0.5

cells=[tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)

cells_drop=[tf.contrib.rnn.DropoutWrapper(cell,input_keep_prob=keep_prob)

forcellincells]

multi_layer_cell=tf.contrib.rnn.MultiRNNCell(cells_drop)

rnn_outputs,states=tf.nn.dynamic_rnn(multi_layer_cell,X,dtype=tf.float32)

Notethatitisalsopossibletoapplydropouttotheoutputsbysettingoutput_keep_prob.

Themainproblemwiththiscodeisthatitwillapplydropoutnotonlyduringtrainingbutalsoduringtesting,whichisnotwhatyouwant(recallthatdropoutshouldbeappliedonlyduringtraining).Unfortunately,theDropoutWrapperdoesnotsupportatrainingplaceholder(yet?),soyoumusteitherwriteyourowndropoutwrapperclass,orhavetwodifferentgraphs:onefortraining,andtheotherfortesting.Thesecondoptionlookslikethis:

importsys

training=(sys.argv[-1]=="train")

y=tf.placeholder(tf.float32,[None,n_steps,n_outputs])

cells=[tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)

iftraining:

cells=[tf.contrib.rnn.DropoutWrapper(cell,input_keep_prob=keep_prob)

forcellincells]

multi_layer_cell=tf.contrib.rnn.MultiRNNCell(cells)

rnn_outputs,states=tf.nn.dynamic_rnn(multi_layer_cell,X,dtype=tf.float32)

[...]#buildtherestofthegraph

iftraining:

init.run()

[...]#trainthemodel

save_path=saver.save(sess,"/tmp/my_model.ckpt")

saver.restore(sess,"/tmp/my_model.ckpt")

[...]#usethemodel

WiththatyoushouldbeabletotrainallsortsofRNNs!Unfortunately,ifyouwanttotrainanRNNonlongsequences,thingswillgetabitharder.Let’sseewhyandwhatyoucandoaboutit.

TheDifficultyofTrainingoverManyTimeStepsTotrainanRNNonlongsequences,youwillneedtorunitovermanytimesteps,makingtheunrolledRNNaverydeepnetwork.Justlikeanydeepneuralnetworkitmaysufferfromthevanishing/explodinggradientsproblem(discussedinChapter11)andtakeforevertotrain.ManyofthetrickswediscussedtoalleviatethisproblemcanbeusedfordeepunrolledRNNsaswell:goodparameterinitialization,nonsaturatingactivationfunctions(e.g.,ReLU),BatchNormalization,GradientClipping,andfasteroptimizers.However,iftheRNNneedstohandleevenmoderatelylongsequences(e.g.,100inputs),thentrainingwillstillbeveryslow.

ThesimplestandmostcommonsolutiontothisproblemistounrolltheRNNonlyoveralimitednumberoftimestepsduringtraining.Thisiscalledtruncatedbackpropagationthroughtime.InTensorFlowyoucanimplementitsimplybytruncatingtheinputsequences.Forexample,inthetimeseriespredictionproblem,youwouldsimplyreducen_stepsduringtraining.Theproblem,ofcourse,isthatthemodelwillnotbeabletolearnlong-termpatterns.Oneworkaroundcouldbetomakesurethattheseshortenedsequencescontainbotholdandrecentdata,sothatthemodelcanlearntouseboth(e.g.,thesequencecouldcontainmonthlydataforthelastfivemonths,thenweeklydataforthelastfiveweeks,thendailydataoverthelastfivedays).Butthisworkaroundhasitslimits:whatiffine-graineddatafromlastyearisactuallyuseful?Whatiftherewasabriefbutsignificanteventthatabsolutelymustbetakenintoaccount,evenyearslater(e.g.,theresultofanelection)?

Besidesthelongtrainingtime,asecondproblemfacedbylong-runningRNNsisthefactthatthememoryofthefirstinputsgraduallyfadesaway.Indeed,duetothetransformationsthatthedatagoesthroughwhentraversinganRNN,someinformationislostaftereachtimestep.Afterawhile,theRNN’sstatecontainsvirtuallynotraceofthefirstinputs.Thiscanbeashowstopper.Forexample,sayyouwanttoperformsentimentanalysisonalongreviewthatstartswiththefourwords“Ilovedthismovie,”buttherestofthereviewliststhemanythingsthatcouldhavemadethemovieevenbetter.IftheRNNgraduallyforgetsthefirstfourwords,itwillcompletelymisinterpretthereview.Tosolvethisproblem,varioustypesofcellswithlong-termmemoryhavebeenintroduced.Theyhaveprovedsosuccessfulthatthebasiccellsarenotmuchusedanymore.Let’sfirstlookatthemostpopularoftheselongmemorycells:theLSTMcell.

LSTMCellTheLongShort-TermMemory(LSTM)cellwasproposedin19973bySeppHochreiterandJürgenSchmidhuber,anditwasgraduallyimprovedovertheyearsbyseveralresearchers,suchasAlexGraves,HaşimSak,4WojciechZaremba,5andmanymore.IfyouconsidertheLSTMcellasablackbox,itcanbeusedverymuchlikeabasiccell,exceptitwillperformmuchbetter;trainingwillconvergefasteranditwilldetectlong-termdependenciesinthedata.InTensorFlow,youcansimplyuseaBasicLSTMCellinsteadofaBasicRNNCell:

lstm_cell=tf.contrib.rnn.BasicLSTMCell(num_units=n_neurons)

LSTMcellsmanagetwostatevectors,andforperformancereasonstheyarekeptseparatebydefault.Youcanchangethisdefaultbehaviorbysettingstate_is_tuple=FalsewhencreatingtheBasicLSTMCell.

SohowdoesanLSTMcellwork?ThearchitectureofabasicLSTMcellisshowninFigure14-13.

Figure14-13.LSTMcell

Ifyoudon’tlookatwhat’sinsidethebox,theLSTMcelllooksexactlylikearegularcell,exceptthatitsstateissplitintwovectors:h(t)andc(t)(“c”standsfor“cell”).Youcanthinkofh(t)astheshort-termstateandc(t)asthelong-termstate.

Nowlet’sopenthebox!Thekeyideaisthatthenetworkcanlearnwhattostoreinthelong-termstate,whattothrowaway,andwhattoreadfromit.Asthelong-termstatec(t–1)traversesthenetworkfromlefttoright,youcanseethatitfirstgoesthroughaforgetgate,droppingsomememories,andthenitaddssomenewmemoriesviatheadditionoperation(whichaddsthememoriesthatwereselectedbyaninputgate).Theresultc(t)issentstraightout,withoutanyfurthertransformation.So,ateachtimestep,somememoriesaredroppedandsomememoriesareadded.Moreover,aftertheadditionoperation,thelong-termstateiscopiedandpassedthroughthetanhfunction,andthentheresultisfilteredbytheoutputgate.

Thisproducestheshort-termstateh(t)(whichisequaltothecell’soutputforthistimestepy(t)).Nowlet’slookatwherenewmemoriescomefromandhowthegateswork.

First,thecurrentinputvectorx(t)andthepreviousshort-termstateh(t–1)arefedtofourdifferentfullyconnectedlayers.Theyallserveadifferentpurpose:

Themainlayeristheonethatoutputsg(t).Ithastheusualroleofanalyzingthecurrentinputsx(t)andtheprevious(short-term)stateh(t–1).Inabasiccell,thereisnothingelsethanthislayer,anditsoutputgoesstraightouttoy(t)andh(t).Incontrast,inanLSTMcellthislayer’soutputdoesnotgostraightout,butinsteaditispartiallystoredinthelong-termstate.

Thethreeotherlayersaregatecontrollers.Sincetheyusethelogisticactivationfunction,theiroutputsrangefrom0to1.Asyoucansee,theiroutputsarefedtoelement-wisemultiplicationoperations,soiftheyoutput0s,theyclosethegate,andiftheyoutput1s,theyopenit.Specifically:

Theforgetgate(controlledbyf(t))controlswhichpartsofthelong-termstateshouldbeerased.

Theinputgate(controlledbyi(t))controlswhichpartsofg(t)shouldbeaddedtothelong-termstate(thisiswhywesaiditwasonly“partiallystored”).

Finally,theoutputgate(controlledbyo(t))controlswhichpartsofthelong-termstateshouldbereadandoutputatthistimestep(bothtoh(t))andy(t).

Inshort,anLSTMcellcanlearntorecognizeanimportantinput(that’stheroleoftheinputgate),storeitinthelong-termstate,learntopreserveitforaslongasitisneeded(that’stheroleoftheforgetgate),andlearntoextractitwheneveritisneeded.Thisexplainswhytheyhavebeenamazinglysuccessfulatcapturinglong-termpatternsintimeseries,longtexts,audiorecordings,andmore.

Equation14-3summarizeshowtocomputethecell’slong-termstate,itsshort-termstate,anditsoutputateachtimestepforasingleinstance(theequationsforawholemini-batchareverysimilar).

Equation14-3.LSTMcomputations

Wxi,Wxf,Wxo,Wxgaretheweightmatricesofeachofthefourlayersfortheirconnectiontothe

inputvectorx(t).

Whi,Whf,Who,andWhgaretheweightmatricesofeachofthefourlayersfortheirconnectiontothepreviousshort-termstateh(t–1).

bi,bf,bo,andbgarethebiastermsforeachofthefourlayers.NotethatTensorFlowinitializesbftoavectorfullof1sinsteadof0s.Thispreventsforgettingeverythingatthebeginningoftraining.

PeepholeConnectionsInabasicLSTMcell,thegatecontrollerscanlookonlyattheinputx(t)andthepreviousshort-termstateh(t–1).Itmaybeagoodideatogivethemabitmorecontextbylettingthempeekatthelong-termstateaswell.ThisideawasproposedbyFelixGersandJürgenSchmidhuberin2000.6TheyproposedanLSTMvariantwithextraconnectionscalledpeepholeconnections:thepreviouslong-termstatec(t–1)isaddedasaninputtothecontrollersoftheforgetgateandtheinputgate,andthecurrentlong-termstatec(t)isaddedasinputtothecontrolleroftheoutputgate.

ToimplementpeepholeconnectionsinTensorFlow,youmustusetheLSTMCellinsteadoftheBasicLSTMCellandsetuse_peepholes=True:

lstm_cell=tf.contrib.rnn.LSTMCell(num_units=n_neurons,use_peepholes=True)

TherearemanyothervariantsoftheLSTMcell.OneparticularlypopularvariantistheGRUcell,whichwewilllookatnow.

GRUCellTheGatedRecurrentUnit(GRU)cell(seeFigure14-14)wasproposedbyKyunghyunChoetal.ina2014paper7thatalsointroducedtheEncoder–Decodernetworkwementionedearlier.

Figure14-14.GRUcell

TheGRUcellisasimplifiedversionoftheLSTMcell,anditseemstoperformjustaswell8(whichexplainsitsgrowingpopularity).Themainsimplificationsare:

Bothstatevectorsaremergedintoasinglevectorh(t).

Asinglegatecontrollercontrolsboththeforgetgateandtheinputgate.Ifthegatecontrolleroutputsa1,theinputgateisopenandtheforgetgateisclosed.Ifitoutputsa0,theoppositehappens.Inotherwords,wheneveramemorymustbestored,thelocationwhereitwillbestorediserasedfirst.ThisisactuallyafrequentvarianttotheLSTMcellinandofitself.

Thereisnooutputgate;thefullstatevectorisoutputateverytimestep.However,thereisanewgatecontrollerthatcontrolswhichpartofthepreviousstatewillbeshowntothemainlayer.

Equation14-4summarizeshowtocomputethecell’sstateateachtimestepforasingleinstance.

Equation14-4.GRUcomputations

CreatingaGRUcellinTensorFlowistrivial:

gru_cell=tf.contrib.rnn.GRUCell(num_units=n_neurons)

LSTMorGRUcellsareoneofthemainreasonsbehindthesuccessofRNNsinrecentyears,inparticularforapplicationsinnaturallanguageprocessing(NLP).

NaturalLanguageProcessingMostofthestate-of-the-artNLPapplications,suchasmachinetranslation,automaticsummarization,parsing,sentimentanalysis,andmore,arenowbased(atleastinpart)onRNNs.Inthislastsection,wewilltakeaquicklookatwhatamachinetranslationmodellookslike.ThistopicisverywellcoveredbyTensorFlow’sawesomeWord2VecandSeq2Seqtutorials,soyoushoulddefinitelycheckthemout.

WordEmbeddingsBeforewestart,weneedtochooseawordrepresentation.Oneoptioncouldbetorepresenteachwordusingaone-hotvector.Supposeyourvocabularycontains50,000words,thenthenthwordwouldberepresentedasa50,000-dimensionalvector,fullof0sexceptfora1atthenthposition.However,withsuchalargevocabulary,thissparserepresentationwouldnotbeefficientatall.Ideally,youwantsimilarwordstohavesimilarrepresentations,makingiteasyforthemodeltogeneralizewhatitlearnsaboutawordtoallsimilarwords.Forexample,ifthemodelistoldthat“Idrinkmilk”isavalidsentence,andifitknowsthat“milk”iscloseto“water”butfarfrom“shoes,”thenitwillknowthat“Idrinkwater”isprobablyavalidsentenceaswell,while“Idrinkshoes”isprobablynot.Buthowcanyoucomeupwithsuchameaningfulrepresentation?

Themostcommonsolutionistorepresenteachwordinthevocabularyusingafairlysmallanddensevector(e.g.,150dimensions),calledanembedding,andjustlettheneuralnetworklearnagoodembeddingforeachwordduringtraining.Atthebeginningoftraining,embeddingsaresimplychosenrandomly,butduringtraining,backpropagationautomaticallymovestheembeddingsaroundinawaythathelpstheneuralnetworkperformitstask.Typicallythismeansthatsimilarwordswillgraduallyclusterclosetooneanother,andevenenduporganizedinarathermeaningfulway.Forexample,embeddingsmayendupplacedalongvariousaxesthatrepresentgender,singular/plural,adjective/noun,andsoon.Theresultcanbetrulyamazing.9

InTensorFlow,youfirstneedtocreatethevariablerepresentingtheembeddingsforeverywordinyourvocabulary(initializedrandomly):

vocabulary_size=50000

embedding_size=150

init_embeds=tf.random_uniform([vocabulary_size,embedding_size],-1.0,1.0)

embeddings=tf.Variable(init_embeds)

Nowsupposeyouwanttofeedthesentence“Idrinkmilk”toyourneuralnetwork.Youshouldfirstpreprocessthesentenceandbreakitintoalistofknownwords.Forexampleyoumayremoveunnecessarycharacters,replaceunknownwordsbyapredefinedtokenwordsuchas“[UNK]”,replacenumericalvaluesby“[NUM]”,replaceURLsby“[URL]”,andsoon.Onceyouhavealistofknownwords,youcanlookupeachword’sintegeridentifier(from0to49999)inadictionary,forexample[72,3335,288].Atthatpoint,youarereadytofeedthesewordidentifierstoTensorFlowusingaplaceholder,andapplytheembedding_lookup()functiontogetthecorrespondingembeddings:

train_inputs=tf.placeholder(tf.int32,shape=[None])#fromids...

embed=tf.nn.embedding_lookup(embeddings,train_inputs)#...toembeddings

Onceyourmodelhaslearnedgoodwordembeddings,theycanactuallybereusedfairlyefficientlyinanyNLPapplication:afterall,“milk”isstillcloseto“water”andfarfrom“shoes”nomatterwhatyourapplicationis.Infact,insteadoftrainingyourownwordembeddings,youmaywanttodownloadpretrainedwordembeddings.Justlikewhenreusingpretrainedlayers(seeChapter11),youcanchoosetofreezethepretrainedembeddings(e.g.,creatingtheembeddingsvariableusingtrainable=False)orlet

backpropagationtweakthemforyourapplication.Thefirstoptionwillspeeduptraining,butthesecondmayleadtoslightlyhigherperformance.

TIPEmbeddingsarealsousefulforrepresentingcategoricalattributesthatcantakeonalargenumberofdifferentvalues,especiallywhentherearecomplexsimilaritiesbetweenvalues.Forexample,considerprofessions,hobbies,dishes,species,brands,andsoon.

Younowhavealmostallthetoolsyouneedtoimplementamachinetranslationsystem.Let’slookatthisnow.

AnEncoder–DecoderNetworkforMachineTranslationLet’stakealookatasimplemachinetranslationmodel10thatwilltranslateEnglishsentencestoFrench(seeFigure14-15).

Figure14-15.Asimplemachinetranslationmodel

TheEnglishsentencesarefedtotheencoder,andthedecoderoutputstheFrenchtranslations.NotethattheFrenchtranslationsarealsousedasinputstothedecoder,butpushedbackbyonestep.Inotherwords,thedecoderisgivenasinputthewordthatitshouldhaveoutputatthepreviousstep(regardlessofwhatitactuallyoutput).Fortheveryfirstword,itisgivenatokenthatrepresentsthebeginningofthesentence(e.g.,“<go>”).Thedecoderisexpectedtoendthesentencewithanend-of-sequence(EOS)token(e.g.,“<eos>”).

NotethattheEnglishsentencesarereversedbeforetheyarefedtotheencoder.Forexample“Idrinkmilk”isreversedto“milkdrinkI.”ThisensuresthatthebeginningoftheEnglishsentencewillbefedlasttotheencoder,whichisusefulbecausethat’sgenerallythefirstthingthatthedecoderneedstotranslate.

Eachwordisinitiallyrepresentedbyasimpleintegeridentifier(e.g.,288fortheword“milk”).Next,anembeddinglookupreturnsthewordembedding(asexplainedearlier,thisisadense,fairlylow-dimensionalvector).Thesewordembeddingsarewhatisactuallyfedtotheencoderandthedecoder.

Ateachstep,thedecoderoutputsascoreforeachwordintheoutputvocabulary(i.e.,French),andthentheSoftmaxlayerturnsthesescoresintoprobabilities.Forexample,atthefirststeptheword“Je”mayhaveaprobabilityof20%,“Tu”mayhaveaprobabilityof1%,andsoon.Thewordwiththehighestprobabilityisoutput.Thisisverymuchlikearegularclassificationtask,soyoucantrainthemodelusingthesoftmax_cross_entropy_with_logits()function.

Notethatatinferencetime(aftertraining),youwillnothavethetargetsentencetofeedtothedecoder.

Instead,simplyfeedthedecoderthewordthatitoutputatthepreviousstep,asshowninFigure14-16(thiswillrequireanembeddinglookupthatisnotshownonthediagram).

Figure14-16.Feedingthepreviousoutputwordasinputatinferencetime

Okay,nowyouhavethebigpicture.However,ifyougothroughTensorFlow’ssequence-to-sequencetutorialandyoulookatthecodeinrnn/translate/seq2seq_model.py(intheTensorFlowmodels),youwillnoticeafewimportantdifferences:

First,sofarwehaveassumedthatallinputsequences(totheencoderandtothedecoder)haveaconstantlength.Butobviouslysentencelengthsmayvary.Thereareseveralwaysthatthiscanbehandled—forexample,usingthesequence_lengthargumenttothestatic_rnn()ordynamic_rnn()functionstospecifyeachsentence’slength(asdiscussedearlier).However,anotherapproachisusedinthetutorial(presumablyforperformancereasons):sentencesaregroupedintobucketsofsimilarlengths(e.g.,abucketforthe1-to6-wordsentences,anotherforthe7-to12-wordsentences,andsoon11),andtheshortersentencesarepaddedusingaspecialpaddingtoken(e.g.,“<pad>”).Forexample“Idrinkmilk”becomes“<pad><pad><pad>milkdrinkI”,anditstranslationbecomes“Jeboisdulait<eos><pad>”.Ofcourse,wewanttoignoreanyoutputpasttheEOStoken.Forthis,thetutorial’simplementationusesatarget_weightsvector.Forexample,forthetargetsentence“Jeboisdulait<eos><pad>”,theweightswouldbesetto[1.0,1.0,1.0,1.0,1.0,0.0](noticetheweight0.0thatcorrespondstothepaddingtokeninthetargetsentence).SimplymultiplyingthelossesbythetargetweightswillzerooutthelossesthatcorrespondtowordspastEOStokens.

Second,whentheoutputvocabularyislarge(whichisthecasehere),outputtingaprobabilityforeachandeverypossiblewordwouldbeterriblyslow.Ifthetargetvocabularycontains,say,50,000Frenchwords,thenthedecoderwouldoutput50,000-dimensionalvectors,andthencomputingthesoftmaxfunctionoversuchalargevectorwouldbeverycomputationallyintensive.Toavoidthis,onesolutionistoletthedecoderoutputmuchsmallervectors,suchas1,000-dimensionalvectors,thenuseasamplingtechniquetoestimatethelosswithouthavingtocomputeitovereverysinglewordinthetargetvocabulary.ThisSampledSoftmaxtechniquewasintroducedin2015bySébastienJeanetal.12InTensorFlowyoucanusethesampled_softmax_loss()function.

Third,thetutorial’simplementationusesanattentionmechanismthatletsthedecoderpeekintotheinputsequence.AttentionaugmentedRNNsarebeyondthescopeofthisbook,butifyouare

interestedtherearehelpfulpapersaboutmachinetranslation,13machinereading,14andimagecaptions15usingattention.

Finally,thetutorial’simplementationmakesuseofthetf.nn.legacy_seq2seqmodule,whichprovidestoolstobuildvariousEncoder–Decodermodelseasily.Forexample,theembedding_rnn_seq2seq()functioncreatesasimpleEncoder–Decodermodelthatautomaticallytakescareofwordembeddingsforyou,justliketheonerepresentedinFigure14-15.Thiscodewilllikelybeupdatedquicklytousethenewtf.nn.seq2seqmodule.

Younowhaveallthetoolsyouneedtounderstandthesequence-to-sequencetutorial’simplementation.CheckitoutandtrainyourownEnglish-to-Frenchtranslator!

Exercises1. Canyouthinkofafewapplicationsforasequence-to-sequenceRNN?Whataboutasequence-to-

vectorRNN?Andavector-to-sequenceRNN?

2. Whydopeopleuseencoder–decoderRNNsratherthanplainsequence-to-sequenceRNNsforautomatictranslation?

3. HowcouldyoucombineaconvolutionalneuralnetworkwithanRNNtoclassifyvideos?

4. WhataretheadvantagesofbuildinganRNNusingdynamic_rnn()ratherthanstatic_rnn()?

5. Howcanyoudealwithvariable-lengthinputsequences?Whataboutvariable-lengthoutputsequences?

6. WhatisacommonwaytodistributetrainingandexecutionofadeepRNNacrossmultipleGPUs?

7. EmbeddedRebergrammarswereusedbyHochreiterandSchmidhuberintheirpaperaboutLSTMs.Theyareartificialgrammarsthatproducestringssuchas“BPBTSXXVPSEPE.”CheckoutJennyOrr’sniceintroductiontothistopic.ChooseaparticularembeddedRebergrammar(suchastheonerepresentedonJennyOrr’spage),thentrainanRNNtoidentifywhetherastringrespectsthatgrammarornot.Youwillfirstneedtowriteafunctioncapableofgeneratingatrainingbatchcontainingabout50%stringsthatrespectthegrammar,and50%thatdon’t.

8. Tacklethe“Howmuchdiditrain?II”Kagglecompetition.Thisisatimeseriespredictiontask:youaregivensnapshotsofpolarimetricradarvaluesandaskedtopredictthehourlyraingaugetotal.LuisAndreDutraeSilva’sinterviewgivessomeinterestinginsightsintothetechniquesheusedtoreachsecondplaceinthecompetition.Inparticular,heusedanRNNcomposedoftwoLSTMlayers.

9. GothroughTensorFlow’sWord2Vectutorialtocreatewordembeddings,andthengothroughtheSeq2SeqtutorialtotrainanEnglish-to-Frenchtranslationsystem.

Notethatmanyresearchersprefertousethehyperbolictangent(tanh)activationfunctioninRNNsratherthantheReLUactivationfunction.Forexample,takealookatbyVuPhametal.’spaper“DropoutImprovesRecurrentNeuralNetworksforHandwritingRecognition”.However,ReLU-basedRNNsarealsopossible,asshowninQuocV.Leetal.’spaper“ASimpleWaytoInitializeRecurrentNetworksofRectifiedLinearUnits”.

Thisusesthedecoratordesignpattern.

“LongShort-TermMemory,”S.HochreiterandJ.Schmidhuber(1997).

“LongShort-TermMemoryRecurrentNeuralNetworkArchitecturesforLargeScaleAcousticModeling,”H.Saketal.(2014).

“RecurrentNeuralNetworkRegularization,”W.Zarembaetal.(2015).

“RecurrentNetsthatTimeandCount,”F.GersandJ.Schmidhuber(2000).

“LearningPhraseRepresentationsusingRNNEncoder–DecoderforStatisticalMachineTranslation,”K.Choetal.(2014).

A2015paperbyKlausGreffetal.,“LSTM:ASearchSpaceOdyssey,”seemstoshowthatallLSTMvariantsperformroughlythesame.

Formoredetails,checkoutChristopherOlah’sgreatpost,orSebastianRuder’sseriesofposts.

“SequencetoSequencelearningwithNeuralNetworks,”I.Sutskeveretal.(2014).

Thebucketsizesusedinthetutorialaredifferent.

“OnUsingVeryLargeTargetVocabularyforNeuralMachineTranslation,”S.Jeanetal.(2015).

“NeuralMachineTranslationbyJointlyLearningtoAlignandTranslate,”D.Bahdanauetal.(2014).

“LongShort-TermMemory-NetworksforMachineReading,”J.Cheng(2016).

“Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttention,”K.Xuetal.(2015).

Chapter15.Autoencoders

Autoencodersareartificialneuralnetworkscapableoflearningefficientrepresentationsoftheinputdata,calledcodings,withoutanysupervision(i.e.,thetrainingsetisunlabeled).Thesecodingstypicallyhaveamuchlowerdimensionalitythantheinputdata,makingautoencodersusefulfordimensionalityreduction(seeChapter8).Moreimportantly,autoencodersactaspowerfulfeaturedetectors,andtheycanbeusedforunsupervisedpretrainingofdeepneuralnetworks(aswediscussedinChapter11).Lastly,theyarecapableofrandomlygeneratingnewdatathatlooksverysimilartothetrainingdata;thisiscalledagenerativemodel.Forexample,youcouldtrainanautoencoderonpicturesoffaces,anditwouldthenbeabletogeneratenewfaces.

Surprisingly,autoencodersworkbysimplylearningtocopytheirinputstotheiroutputs.Thismaysoundlikeatrivialtask,butwewillseethatconstrainingthenetworkinvariouswayscanmakeitratherdifficult.Forexample,youcanlimitthesizeoftheinternalrepresentation,oryoucanaddnoisetotheinputsandtrainthenetworktorecovertheoriginalinputs.Theseconstraintspreventtheautoencoderfromtriviallycopyingtheinputsdirectlytotheoutputs,whichforcesittolearnefficientwaysofrepresentingthedata.Inshort,thecodingsarebyproductsoftheautoencoder’sattempttolearntheidentityfunctionundersomeconstraints.

Inthischapterwewillexplaininmoredepthhowautoencoderswork,whattypesofconstraintscanbeimposed,andhowtoimplementthemusingTensorFlow,whetheritisfordimensionalityreduction,featureextraction,unsupervisedpretraining,orasgenerativemodels.

EfficientDataRepresentationsWhichofthefollowingnumbersequencesdoyoufindtheeasiesttomemorize?

40,27,25,36,81,57,10,73,19,68

50,25,76,38,19,58,29,88,44,22,11,34,17,52,26,13,40,20

Atfirstglance,itwouldseemthatthefirstsequenceshouldbeeasier,sinceitismuchshorter.However,ifyoulookcarefullyatthesecondsequence,youmaynoticethatitfollowstwosimplerules:evennumbersarefollowedbytheirhalf,andoddnumbersarefollowedbytheirtripleplusone(thisisafamoussequenceknownasthehailstonesequence).Onceyounoticethispattern,thesecondsequencebecomesmucheasiertomemorizethanthefirstbecauseyouonlyneedtomemorizethetworules,thefirstnumber,andthelengthofthesequence.Notethatifyoucouldquicklyandeasilymemorizeverylongsequences,youwouldnotcaremuchabouttheexistenceofapatterninthesecondsequence.Youwouldjustlearneverynumberbyheart,andthatwouldbethat.Itisthefactthatitishardtomemorizelongsequencesthatmakesitusefultorecognizepatterns,andhopefullythisclarifieswhyconstraininganautoencoderduringtrainingpushesittodiscoverandexploitpatternsinthedata.

Therelationshipbetweenmemory,perception,andpatternmatchingwasfamouslystudiedbyWilliamChaseandHerbertSimonintheearly1970s.1Theyobservedthatexpertchessplayerswereabletomemorizethepositionsofallthepiecesinagamebylookingattheboardforjust5seconds,ataskthatmostpeoplewouldfindimpossible.However,thiswasonlythecasewhenthepieceswereplacedinrealisticpositions(fromactualgames),notwhenthepieceswereplacedrandomly.Chessexpertsdon’thaveamuchbettermemorythanyouandI,theyjustseechesspatternsmoreeasilythankstotheirexperiencewiththegame.Noticingpatternshelpsthemstoreinformationefficiently.

Justlikethechessplayersinthismemoryexperiment,anautoencoderlooksattheinputs,convertsthemtoanefficientinternalrepresentation,andthenspitsoutsomethingthat(hopefully)looksveryclosetotheinputs.Anautoencoderisalwayscomposedoftwoparts:anencoder(orrecognitionnetwork)thatconvertstheinputstoaninternalrepresentation,followedbyadecoder(orgenerativenetwork)thatconvertstheinternalrepresentationtotheoutputs(seeFigure15-1).

Asyoucansee,anautoencodertypicallyhasthesamearchitectureasaMulti-LayerPerceptron(MLP;seeChapter10),exceptthatthenumberofneuronsintheoutputlayermustbeequaltothenumberofinputs.Inthisexample,thereisjustonehiddenlayercomposedoftwoneurons(theencoder),andoneoutputlayercomposedofthreeneurons(thedecoder).Theoutputsareoftencalledthereconstructionssincetheautoencodertriestoreconstructtheinputs,andthecostfunctioncontainsareconstructionlossthatpenalizesthemodelwhenthereconstructionsaredifferentfromtheinputs.

Figure15-1.Thechessmemoryexperiment(left)andasimpleautoencoder(right)

Becausetheinternalrepresentationhasalowerdimensionalitythantheinputdata(itis2Dinsteadof3D),theautoencoderissaidtobeundercomplete.Anundercompleteautoencodercannottriviallycopyitsinputstothecodings,yetitmustfindawaytooutputacopyofitsinputs.Itisforcedtolearnthemostimportantfeaturesintheinputdata(anddroptheunimportantones).

Let’sseehowtoimplementaverysimpleundercompleteautoencoderfordimensionalityreduction.

PerformingPCAwithanUndercompleteLinearAutoencoderIftheautoencoderusesonlylinearactivationsandthecostfunctionistheMeanSquaredError(MSE),thenitcanbeshownthatitendsupperformingPrincipalComponentAnalysis(seeChapter8).

ThefollowingcodebuildsasimplelinearautoencodertoperformPCAona3Ddataset,projectingitto2D:

n_inputs=3#3Dinputs

n_hidden=2#2Dcodings

n_outputs=n_inputs

learning_rate=0.01

X=tf.placeholder(tf.float32,shape=[None,n_inputs])

hidden=tf.layers.dense(X,n_hidden)

outputs=tf.layers.dense(hidden,n_outputs)

reconstruction_loss=tf.reduce_mean(tf.square(outputs-X))#MSE

optimizer=tf.train.AdamOptimizer(learning_rate)

training_op=optimizer.minimize(reconstruction_loss)

ThiscodeisreallynotverydifferentfromalltheMLPswebuiltinpastchapters.Thetwothingstonoteare:

Thenumberofoutputsisequaltothenumberofinputs.

ToperformsimplePCA,wedonotuseanyactivationfunction(i.e.,allneuronsarelinear)andthecostfunctionistheMSE.Wewillseemorecomplexautoencodersshortly.

Nowlet’sloadthedataset,trainthemodelonthetrainingset,anduseittoencodethetestset(i.e.,projectitto2D):

X_train,X_test=[...]#loadthedataset

n_iterations=1000

codings=hidden#theoutputofthehiddenlayerprovidesthecodings

init.run()

training_op.run(feed_dict={X:X_train})#nolabels(unsupervised)

codings_val=codings.eval(feed_dict={X:X_test})

Figure15-2showstheoriginal3Ddataset(attheleft)andtheoutputoftheautoencoder’shiddenlayer(i.e.,thecodinglayer,attheright).Asyoucansee,theautoencoderfoundthebest2Dplanetoprojectthedataonto,preservingasmuchvarianceinthedataasitcould(justlikePCA).

Figure15-2.PCAperformedbyanundercompletelinearautoencoder

StackedAutoencodersJustlikeotherneuralnetworkswehavediscussed,autoencoderscanhavemultiplehiddenlayers.Inthiscasetheyarecalledstackedautoencoders(ordeepautoencoders).Addingmorelayershelpstheautoencoderlearnmorecomplexcodings.However,onemustbecarefulnottomaketheautoencodertoopowerful.Imagineanencodersopowerfulthatitjustlearnstomapeachinputtoasinglearbitrarynumber(andthedecoderlearnsthereversemapping).Obviouslysuchanautoencoderwillreconstructthetrainingdataperfectly,butitwillnothavelearnedanyusefuldatarepresentationintheprocess(anditisunlikelytogeneralizewelltonewinstances).

Thearchitectureofastackedautoencoderistypicallysymmetricalwithregardstothecentralhiddenlayer(thecodinglayer).Toputitsimply,itlookslikeasandwich.Forexample,anautoencoderforMNIST(introducedinChapter3)mayhave784inputs,followedbyahiddenlayerwith300neurons,thenacentralhiddenlayerof150neurons,thenanotherhiddenlayerwith300neurons,andanoutputlayerwith784neurons.ThisstackedautoencoderisrepresentedinFigure15-3.

Figure15-3.Stackedautoencoder

TensorFlowImplementationYoucanimplementastackedautoencoderverymuchlikearegulardeepMLP.Inparticular,thesametechniquesweusedinChapter11fortrainingdeepnetscanbeapplied.Forexample,thefollowingcodebuildsastackedautoencoderforMNIST,usingHeinitialization,theELUactivationfunction,andℓ2regularization.Thecodeshouldlookveryfamiliar,exceptthattherearenolabels(noy):

n_inputs=28*28#forMNIST

n_hidden1=300

n_hidden2=150#codings

n_hidden3=n_hidden1

n_outputs=n_inputs

learning_rate=0.01

l2_reg=0.0001

he_init=tf.contrib.layers.variance_scaling_initializer()

l2_regularizer=tf.contrib.layers.l2_regularizer(l2_reg)

my_dense_layer=partial(tf.layers.dense,

activation=tf.nn.elu,

kernel_initializer=he_init,

kernel_regularizer=l2_regularizer)

hidden1=my_dense_layer(X,n_hidden1)

hidden2=my_dense_layer(hidden1,n_hidden2)#codings

hidden3=my_dense_layer(hidden2,n_hidden3)

outputs=my_dense_layer(hidden3,n_outputs,activation=None)

reg_losses=tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)

loss=tf.add_n([reconstruction_loss]+reg_losses)

Youcanthentrainthemodelnormally.Notethatthedigitlabels(y_batch)areunused:

n_epochs=5

batch_size=150

init.run()

foriterationinrange(n_batches):

sess.run(training_op,feed_dict={X:X_batch})

TyingWeightsWhenanautoencoderisneatlysymmetrical,liketheonewejustbuilt,acommontechniqueistotietheweightsofthedecoderlayerstotheweightsoftheencoderlayers.Thishalvesthenumberofweightsinthemodel,speedinguptrainingandlimitingtheriskofoverfitting.Specifically,iftheautoencoderhasatotalofNlayers(notcountingtheinputlayer),andWLrepresentstheconnectionweightsoftheLthlayer

(e.g.,layer1isthefirsthiddenlayer,layer isthecodinglayer,andlayerNistheoutputlayer),thenthe

decoderlayerweightscanbedefinedsimplyas:WN–L+1=WLT(withL=1,2, ).

Unfortunately,implementingtiedweightsinTensorFlowusingthedense()functionisabitcumbersome;it’sactuallyeasiertojustdefinethelayersmanually.Thecodeendsupsignificantlymoreverbose:

activation=tf.nn.elu

regularizer=tf.contrib.layers.l2_regularizer(l2_reg)

initializer=tf.contrib.layers.variance_scaling_initializer()

weights1_init=initializer([n_inputs,n_hidden1])

weights2_init=initializer([n_hidden1,n_hidden2])

weights1=tf.Variable(weights1_init,dtype=tf.float32,name="weights1")

weights2=tf.Variable(weights2_init,dtype=tf.float32,name="weights2")

weights3=tf.transpose(weights2,name="weights3")#tiedweights

weights4=tf.transpose(weights1,name="weights4")#tiedweights

biases1=tf.Variable(tf.zeros(n_hidden1),name="biases1")

biases4=tf.Variable(tf.zeros(n_outputs),name="biases4")

hidden1=activation(tf.matmul(X,weights1)+biases1)

hidden2=activation(tf.matmul(hidden1,weights2)+biases2)

hidden3=activation(tf.matmul(hidden2,weights3)+biases3)

outputs=tf.matmul(hidden3,weights4)+biases4

reconstruction_loss=tf.reduce_mean(tf.square(outputs-X))

reg_loss=regularizer(weights1)+regularizer(weights2)

loss=reconstruction_loss+reg_loss

Thiscodeisfairlystraightforward,butthereareafewimportantthingstonote:First,weight3andweights4arenotvariables,theyarerespectivelythetransposeofweights2andweights1(theyare“tied”tothem).

Second,sincetheyarenotvariables,it’snouseregularizingthem:weonlyregularizeweights1andweights2.

Third,biasesarenevertied,andneverregularized.

TrainingOneAutoencoderataTimeRatherthantrainingthewholestackedautoencoderinonegolikewejustdid,itisoftenmuchfastertotrainoneshallowautoencoderatatime,thenstackallofthemintoasinglestackedautoencoder(hencethename),asshownonFigure15-4.Thisisespeciallyusefulforverydeepautoencoders.

Figure15-4.Trainingoneautoencoderatatime

Duringthefirstphaseoftraining,thefirstautoencoderlearnstoreconstructtheinputs.Duringthesecondphase,thesecondautoencoderlearnstoreconstructtheoutputofthefirstautoencoder’shiddenlayer.Finally,youjustbuildabigsandwichusingalltheseautoencoders,asshowninFigure15-4(i.e.,youfirststackthehiddenlayersofeachautoencoder,thentheoutputlayersinreverseorder).Thisgivesyouthefinalstackedautoencoder.Youcouldeasilytrainmoreautoencodersthisway,buildingaverydeepstackedautoencoder.

Toimplementthismultiphasetrainingalgorithm,thesimplestapproachistouseadifferentTensorFlowgraphforeachphase.Aftertraininganautoencoder,youjustrunthetrainingsetthroughitandcapturetheoutputofthehiddenlayer.Thisoutputthenservesasthetrainingsetforthenextautoencoder.Onceallautoencodershavebeentrainedthisway,yousimplycopytheweightsandbiasesfromeachautoencoderandusethemtobuildthestackedautoencoder.Implementingthisapproachisquitestraightforward,sowewon’tdetailithere,butpleasecheckoutthecodeintheJupyternotebooksforanexample.

Anotherapproachistouseasinglegraphcontainingthewholestackedautoencoder,plussomeextraoperationstoperformeachtrainingphase,asshowninFigure15-5.

Figure15-5.Asinglegraphtotrainastackedautoencoder

Thisdeservesabitofexplanation:Thecentralcolumninthegraphisthefullstackedautoencoder.Thispartcanbeusedaftertraining.

Theleftcolumnisthesetofoperationsneededtorunthefirstphaseoftraining.Itcreatesanoutputlayerthatbypasseshiddenlayers2and3.Thisoutputlayersharesthesameweightsandbiasesasthestackedautoencoder’soutputlayer.Ontopofthatarethetrainingoperationsthatwillaimatmakingtheoutputascloseaspossibletotheinputs.Thus,thisphasewilltraintheweightsandbiasesforthehiddenlayer1andtheoutputlayer(i.e.,thefirstautoencoder).

Therightcolumninthegraphisthesetofoperationsneededtorunthesecondphaseoftraining.Itaddsthetrainingoperationthatwillaimatmakingtheoutputofhiddenlayer3ascloseaspossibletotheoutputofhiddenlayer1.Notethatwemustfreezehiddenlayer1whilerunningphase2.Thisphasewilltraintheweightsandbiasesforhiddenlayers2and3(i.e.,thesecondautoencoder).

TheTensorFlowcodelookslikethis:

[...]#Buildthewholestackedautoencodernormally.

#Inthisexample,theweightsarenottied.

withtf.name_scope("phase1"):

phase1_outputs=tf.matmul(hidden1,weights4)+biases4

phase1_reconstruction_loss=tf.reduce_mean(tf.square(phase1_outputs-X))

phase1_reg_loss=regularizer(weights1)+regularizer(weights4)

phase1_loss=phase1_reconstruction_loss+phase1_reg_loss

phase1_training_op=optimizer.minimize(phase1_loss)

withtf.name_scope("phase2"):

phase2_reconstruction_loss=tf.reduce_mean(tf.square(hidden3-hidden1))

phase2_reg_loss=regularizer(weights2)+regularizer(weights3)

phase2_loss=phase2_reconstruction_loss+phase2_reg_loss

train_vars=[weights2,biases2,weights3,biases3]

phase2_training_op=optimizer.minimize(phase2_loss,var_list=train_vars)

Thefirstphaseisratherstraightforward:wejustcreateanoutputlayerthatskipshiddenlayers2and3,thenbuildthetrainingoperationstominimizethedistancebetweentheoutputsandtheinputs(plussome

regularization).

Thesecondphasejustaddstheoperationsneededtominimizethedistancebetweentheoutputofhiddenlayer3andhiddenlayer1(alsowithsomeregularization).Mostimportantly,weprovidethelistoftrainablevariablestotheminimize()method,makingsuretoleaveoutweights1andbiases1;thiseffectivelyfreezeshiddenlayer1duringphase2.

Duringtheexecutionphase,allyouneedtodoisrunthephase1trainingopforanumberofepochs,thenthephase2trainingopforsomemoreepochs.

TIPSincehiddenlayer1isfrozenduringphase2,itsoutputwillalwaysbethesameforanygiventraininginstance.Toavoidhavingtorecomputetheoutputofhiddenlayer1ateverysingleepoch,youcancomputeitforthewholetrainingsetattheendofphase1,thendirectlyfeedthecachedoutputofhiddenlayer1duringphase2.Thiscangiveyouaniceperformanceboost.

VisualizingtheReconstructionsOnewaytoensurethatanautoencoderisproperlytrainedistocomparetheinputsandtheoutputs.Theymustbefairlysimilar,andthedifferencesshouldbeunimportantdetails.Let’splottworandomdigitsandtheirreconstructions:

n_test_digits=2

X_test=mnist.test.images[:n_test_digits]

[...]#TraintheAutoencoder

outputs_val=outputs.eval(feed_dict={X:X_test})

defplot_image(image,shape=[28,28]):

plt.imshow(image.reshape(shape),cmap="Greys",interpolation="nearest")

plt.axis("off")

fordigit_indexinrange(n_test_digits):

plt.subplot(n_test_digits,2,digit_index*2+1)

plot_image(X_test[digit_index])

plt.subplot(n_test_digits,2,digit_index*2+2)

plot_image(outputs_val[digit_index])

Figure15-6showstheresultingimages.

Figure15-6.Originaldigits(left)andtheirreconstructions(right)

Lookscloseenough.Sotheautoencoderhasproperlylearnedtoreproduceitsinputs,buthasitlearnedusefulfeatures?Let’stakealook.

VisualizingFeaturesOnceyourautoencoderhaslearnedsomefeatures,youmaywanttotakealookatthem.Therearevarioustechniquesforthis.Arguablythesimplesttechniqueistoconsidereachneuronineveryhiddenlayer,andfindthetraininginstancesthatactivateitthemost.Thisisespeciallyusefulforthetophiddenlayerssincetheyoftencapturerelativelylargefeaturesthatyoucaneasilyspotinagroupoftraininginstancesthatcontainthem.Forexample,ifaneuronstronglyactivateswhenitseesacatinapicture,itwillbeprettyobviousthatthepicturesthatactivateitthemostallcontaincats.However,forlowerlayers,thistechniquedoesnotworksowell,asthefeaturesaresmallerandmoreabstract,soit’softenhardtounderstandexactlywhattheneuronisgettingallexcitedabout.

Let’slookatanothertechnique.Foreachneuroninthefirsthiddenlayer,youcancreateanimagewhereapixel’sintensitycorrespondstotheweightoftheconnectiontothegivenneuron.Forexample,thefollowingcodeplotsthefeatureslearnedbyfiveneuronsinthefirsthiddenlayer:

[...]#trainautoencoder

weights1_val=weights1.eval()

foriinrange(5):

plt.subplot(1,5,i+1)

plot_image(weights1_val.T[i])

Youmaygetlow-levelfeaturessuchastheonesshowninFigure15-7.

Figure15-7.Featureslearnedbyfiveneuronsfromthefirsthiddenlayer

Thefirstfourfeaturesseemtocorrespondtosmallpatches,whilethefifthfeatureseemstolookforverticalstrokes(notethatthesefeaturescomefromthestackeddenoisingautoencoderthatwewilldiscusslater).

Anothertechniqueistofeedtheautoencoderarandominputimage,measuretheactivationoftheneuronyouareinterestedin,andthenperformbackpropagationtotweaktheimageinsuchawaythattheneuronwillactivateevenmore.Ifyouiterateseveraltimes(performinggradientascent),theimagewillgraduallyturnintothemostexcitingimage(fortheneuron).Thisisausefultechniquetovisualizethekindsofinputsthataneuronislookingfor.

Finally,ifyouareusinganautoencodertoperformunsupervisedpretraining—forexample,foraclassificationtask—asimplewaytoverifythatthefeatureslearnedbytheautoencoderareusefulistomeasuretheperformanceoftheclassifier.

UnsupervisedPretrainingUsingStackedAutoencodersAswediscussedinChapter11,ifyouaretacklingacomplexsupervisedtaskbutyoudonothavealotoflabeledtrainingdata,onesolutionistofindaneuralnetworkthatperformsasimilartask,andthenreuseitslowerlayers.Thismakesitpossibletotrainahigh-performancemodelusingonlylittletrainingdatabecauseyourneuralnetworkwon’thavetolearnallthelow-levelfeatures;itwilljustreusethefeaturedetectorslearnedbytheexistingnet.

Similarly,ifyouhavealargedatasetbutmostofitisunlabeled,youcanfirsttrainastackedautoencoderusingallthedata,thenreusethelowerlayerstocreateaneuralnetworkforyouractualtask,andtrainitusingthelabeleddata.Forexample,Figure15-8showshowtouseastackedautoencodertoperformunsupervisedpretrainingforaclassificationneuralnetwork.Thestackedautoencoderitselfistypicallytrainedoneautoencoderatatime,asdiscussedearlier.Whentrainingtheclassifier,ifyoureallydon’thavemuchlabeledtrainingdata,youmaywanttofreezethepretrainedlayers(atleastthelowerones).

Figure15-8.Unsupervisedpretrainingusingautoencoders

NOTEThissituationisactuallyquitecommon,becausebuildingalargeunlabeleddatasetisoftencheap(e.g.,asimplescriptcandownloadmillionsofimagesofftheinternet),butlabelingthemcanonlybedonereliablybyhumans(e.g.,classifyingimagesascuteornot).Labelinginstancesistime-consumingandcostly,soitisquitecommontohaveonlyafewthousandlabeledinstances.

Aswediscussedearlier,oneofthetriggersofthecurrentDeepLearningtsunamiisthediscoveryin2006byGeoffreyHintonetal.thatdeepneuralnetworkscanbepretrainedinanunsupervisedfashion.TheyusedrestrictedBoltzmannmachinesforthat(seeAppendixE),butin2007YoshuaBengioetal.showed2thatautoencodersworkedjustaswell.

ThereisnothingspecialabouttheTensorFlowimplementation:justtrainanautoencoderusingallthetrainingdata,thenreuseitsencoderlayerstocreateanewneuralnetwork(seeChapter11formoredetailsonhowtoreusepretrainedlayers,orcheckoutthecodeexamplesintheJupyternotebooks).

Uptonow,inordertoforcetheautoencodertolearninterestingfeatures,wehavelimitedthesizeofthecodinglayer,makingitundercomplete.Thereareactuallymanyotherkindsofconstraintsthatcanbeused,includingonesthatallowthecodinglayertobejustaslargeastheinputs,orevenlarger,resultinginanovercompleteautoencoder.Let’slookatsomeofthoseapproachesnow.

DenoisingAutoencodersAnotherwaytoforcetheautoencodertolearnusefulfeaturesistoaddnoisetoitsinputs,trainingittorecovertheoriginal,noise-freeinputs.Thispreventstheautoencoderfromtriviallycopyingitsinputstoitsoutputs,soitendsuphavingtofindpatternsinthedata.

Theideaofusingautoencoderstoremovenoisehasbeenaroundsincethe1980s(e.g.,itismentionedinYannLeCun’s1987master’sthesis).Ina2008paper,3PascalVincentetal.showedthatautoencoderscouldalsobeusedforfeatureextraction.Ina2010paper,4Vincentetal.introducedstackeddenoisingautoencoders.

ThenoisecanbepureGaussiannoiseaddedtotheinputs,oritcanberandomlyswitchedoffinputs,justlikeindropout(introducedinChapter11).Figure15-9showsbothoptions.

Figure15-9.Denoisingautoencoders,withGaussiannoise(left)ordropout(right)

TensorFlowImplementationImplementingdenoisingautoencodersinTensorFlowisnottoohard.Let’sstartwithGaussiannoise.It’sreallyjustliketrainingaregularautoencoder,exceptyouaddnoisetotheinputs,andthereconstructionlossiscalculatedbasedontheoriginalinputs:

noise_level=1.0

X_noisy=X+noise_level*tf.random_normal(tf.shape(X))

hidden1=tf.layers.dense(X_noisy,n_hidden1,activation=tf.nn.relu,

name="hidden1")

WARNINGSincetheshapeofXisonlypartiallydefinedduringtheconstructionphase,wecannotknowinadvancetheshapeofthenoisethatwemustaddtoX.WecannotcallX.get_shape()becausethiswouldjustreturnthepartiallydefinedshapeofX([None,n_inputs]),andrandom_normal()expectsafullydefinedshapesoitwouldraiseanexception.Instead,wecalltf.shape(X),whichcreatesanoperationthatwillreturntheshapeofXatruntime,whichwillbefullydefinedatthatpoint.

Implementingthedropoutversion,whichismorecommon,isnotmuchharder:

dropout_rate=0.3

X_drop=tf.layers.dropout(X,dropout_rate,training=training)

hidden1=tf.layers.dense(X_drop,n_hidden1,activation=tf.nn.relu,

name="hidden1")

DuringtrainingwemustsettrainingtoTrue(asexplainedinChapter11)usingthefeed_dict:

sess.run(training_op,feed_dict={X:X_batch,training:True})

DuringtestingitisnotnecessarytosettrainingtoFalse,sincewesetthatasthedefaultinthecalltotheplaceholder_with_default()function.

SparseAutoencodersAnotherkindofconstraintthatoftenleadstogoodfeatureextractionissparsity:byaddinganappropriatetermtothecostfunction,theautoencoderispushedtoreducethenumberofactiveneuronsinthecodinglayer.Forexample,itmaybepushedtohaveonaverageonly5%significantlyactiveneuronsinthecodinglayer.Thisforcestheautoencodertorepresenteachinputasacombinationofasmallnumberofactivations.Asaresult,eachneuroninthecodinglayertypicallyendsuprepresentingausefulfeature(ifyoucouldspeakonlyafewwordspermonth,youwouldprobablytrytomakethemworthlisteningto).

Inordertofavorsparsemodels,wemustfirstmeasuretheactualsparsityofthecodinglayerateachtrainingiteration.Wedosobycomputingtheaverageactivationofeachneuroninthecodinglayer,overthewholetrainingbatch.Thebatchsizemustnotbetoosmall,orelsethemeanwillnotbeaccurate.

Oncewehavethemeanactivationperneuron,wewanttopenalizetheneuronsthataretooactivebyaddingasparsitylosstothecostfunction.Forexample,ifwemeasurethataneuronhasanaverageactivationof0.3,butthetargetsparsityis0.1,itmustbepenalizedtoactivateless.Oneapproachcouldbesimplyaddingthesquarederror(0.3–0.1)2tothecostfunction,butinpracticeabetterapproachistousetheKullback–Leiblerdivergence(brieflydiscussedinChapter4),whichhasmuchstrongergradientsthantheMeanSquaredError,asyoucanseeinFigure15-10.

Figure15-10.Sparsityloss

GiventwodiscreteprobabilitydistributionsPandQ,theKLdivergencebetweenthesedistributions,notedDKL(P Q),canbecomputedusingEquation15-1.

Equation15-1.Kullback–Leiblerdivergence

Inourcase,wewanttomeasurethedivergencebetweenthetargetprobabilitypthataneuroninthecodinglayerwillactivate,andtheactualprobabilityq(i.e.,themeanactivationoverthetrainingbatch).SotheKLdivergencesimplifiestoEquation15-2.

Equation15-2.KLdivergencebetweenthetargetsparsitypandtheactualsparsityq

Oncewehavecomputedthesparsitylossforeachneuroninthecodinglayer,wejustsumuptheselosses,andaddtheresulttothecostfunction.Inordertocontroltherelativeimportanceofthesparsitylossandthereconstructionloss,wecanmultiplythesparsitylossbyasparsityweighthyperparameter.Ifthisweightistoohigh,themodelwillstickcloselytothetargetsparsity,butitmaynotreconstructtheinputsproperly,makingthemodeluseless.Conversely,ifitistoolow,themodelwillmostlyignorethesparsityobjectiveanditwillnotlearnanyinterestingfeatures.

TensorFlowImplementationWenowhaveallweneedtoimplementasparseautoencoderusingTensorFlow:

defkl_divergence(p,q):

returnp*tf.log(p/q)+(1-p)*tf.log((1-p)/(1-q))

learning_rate=0.01

sparsity_target=0.1

sparsity_weight=0.2

[...]#Buildanormalautoencoder(inthisexamplethecodinglayerishidden1)

hidden1_mean=tf.reduce_mean(hidden1,axis=0)#batchmean

sparsity_loss=tf.reduce_sum(kl_divergence(sparsity_target,hidden1_mean))

loss=reconstruction_loss+sparsity_weight*sparsity_loss

Animportantdetailisthefactthattheactivationsofthecodinglayermustbebetween0and1(butnotequalto0or1),orelsetheKLdivergencewillreturnNaN(NotaNumber).Asimplesolutionistousethelogisticactivationfunctionforthecodinglayer:

hidden1=tf.layers.dense(X,n_hidden1,activation=tf.nn.sigmoid)

Onesimpletrickcanspeedupconvergence:insteadofusingtheMSE,wecanchooseareconstructionlossthatwillhavelargergradients.Crossentropyisoftenagoodchoice.Touseit,wemustnormalizetheinputstomakethemtakeonvaluesfrom0to1,andusethelogisticactivationfunctionintheoutputlayersotheoutputsalsotakeonvaluesfrom0to1.TensorFlow’ssigmoid_cross_entropy_with_logits()functiontakescareofefficientlyapplyingthelogistic(sigmoid)activationfunctiontotheoutputsandcomputingthecrossentropy:

logits=tf.layers.dense(hidden1,n_outputs)

outputs=tf.nn.sigmoid(logits)

xentropy=tf.nn.sigmoid_cross_entropy_with_logits(labels=X,logits=logits)

reconstruction_loss=tf.reduce_sum(xentropy)

Notethattheoutputsoperationisnotneededduringtraining(weuseitonlywhenwewanttolookatthereconstructions).

VariationalAutoencodersAnotherimportantcategoryofautoencoderswasintroducedin2014byDiederikKingmaandMaxWelling,5andhasquicklybecomeoneofthemostpopulartypesofautoencoders:variationalautoencoders.

Theyarequitedifferentfromalltheautoencoderswehavediscussedsofar,inparticular:Theyareprobabilisticautoencoders,meaningthattheiroutputsarepartlydeterminedbychance,evenaftertraining(asopposedtodenoisingautoencoders,whichuserandomnessonlyduringtraining).

Mostimportantly,theyaregenerativeautoencoders,meaningthattheycangeneratenewinstancesthatlookliketheyweresampledfromthetrainingset.

BoththesepropertiesmakethemrathersimilartoRBMs(seeAppendixE),buttheyareeasiertotrainandthesamplingprocessismuchfaster(withRBMsyouneedtowaitforthenetworktostabilizeintoa“thermalequilibrium”beforeyoucansampleanewinstance).

Let’stakealookathowtheywork.Figure15-11(left)showsavariationalautoencoder.Youcanrecognize,ofcourse,thebasicstructureofallautoencoders,withanencoderfollowedbyadecoder(inthisexample,theybothhavetwohiddenlayers),butthereisatwist:insteadofdirectlyproducingacodingforagiveninput,theencoderproducesameancodingμandastandarddeviationσ.TheactualcodingisthensampledrandomlyfromaGaussiandistributionwithmeanμandstandarddeviationσ.Afterthatthedecoderjustdecodesthesampledcodingnormally.Therightpartofthediagramshowsatraininginstancegoingthroughthisautoencoder.First,theencoderproducesμandσ,thenacodingissampledrandomly(noticethatitisnotexactlylocatedatμ),andfinallythiscodingisdecoded,andthefinaloutputresemblesthetraininginstance.

Figure15-11.Variationalautoencoder(left),andaninstancegoingthroughit(right)

Asyoucanseeonthediagram,althoughtheinputsmayhaveaveryconvoluteddistribution,avariationalautoencodertendstoproducecodingsthatlookasthoughtheyweresampledfromasimpleGaussiandistribution:6duringtraining,thecostfunction(discussednext)pushesthecodingstograduallymigratewithinthecodingspace(alsocalledthelatentspace)tooccupyaroughly(hyper)sphericalregionthatlookslikeacloudofGaussianpoints.Onegreatconsequenceisthataftertrainingavariationalautoencoder,youcanveryeasilygenerateanewinstance:justsamplearandomcodingfromtheGaussiandistribution,decodeit,andvoilà!

Solet’slookatthecostfunction.Itiscomposedoftwoparts.Thefirstistheusualreconstructionlossthatpushestheautoencodertoreproduceitsinputs(wecanusecrossentropyforthis,asdiscussedearlier).ThesecondisthelatentlossthatpushestheautoencodertohavecodingsthatlookasthoughtheyweresampledfromasimpleGaussiandistribution,forwhichweusetheKLdivergencebetweenthetargetdistribution(theGaussiandistribution)andtheactualdistributionofthecodings.Themathisabitmorecomplexthanearlier,inparticularbecauseoftheGaussiannoise,whichlimitstheamountofinformationthatcanbetransmittedtothecodinglayer(thuspushingtheautoencodertolearnusefulfeatures).Luckily,theequationssimplifytothefollowingcodeforthelatentloss:7

eps=1e-10#smoothingtermtoavoidcomputinglog(0)whichisNaN

latent_loss=0.5*tf.reduce_sum(

tf.square(hidden3_sigma)+tf.square(hidden3_mean)

-1-tf.log(eps+tf.square(hidden3_sigma)))

Onecommonvariantistotraintheencodertooutputγ=log(σ2)ratherthanσ.Whereverweneedσwe

canjustcompute .Thismakesitabiteasierfortheencodertocapturesigmasofdifferent

scales,andthusithelpsspeedupconvergence.Thelatentlossendsupabitsimpler:

tf.exp(hidden3_gamma)+tf.square(hidden3_mean)-1-hidden3_gamma)

ThefollowingcodebuildsthevariationalautoencodershowninFigure15-11(left),usingthelog(σ2)variant:

n_inputs=28*28

n_hidden1=500

n_hidden2=500

n_hidden3=20#codings

n_hidden4=n_hidden2

n_hidden5=n_hidden1

n_outputs=n_inputs

learning_rate=0.001

my_dense_layer=partial(

tf.layers.dense,

activation=tf.nn.elu,

kernel_initializer=initializer)

X=tf.placeholder(tf.float32,[None,n_inputs])

hidden1=my_dense_layer(X,n_hidden1)

hidden3_mean=my_dense_layer(hidden2,n_hidden3,activation=None)

hidden3_gamma=my_dense_layer(hidden2,n_hidden3,activation=None)

noise=tf.random_normal(tf.shape(hidden3_gamma),dtype=tf.float32)

hidden3=hidden3_mean+tf.exp(0.5*hidden3_gamma)*noise

logits=my_dense_layer(hidden5,n_outputs,activation=None)

outputs=tf.sigmoid(logits)

xentropy=tf.nn.sigmoid_cross_entropy_with_logits(labels=X,logits=logits)

reconstruction_loss=tf.reduce_sum(xentropy)

tf.exp(hidden3_gamma)+tf.square(hidden3_mean)-1-hidden3_gamma)

loss=reconstruction_loss+latent_loss

GeneratingDigitsNowlet’susethisvariationalautoencodertogenerateimagesthatlooklikehandwrittendigits.Allweneedtodoistrainthemodel,thensamplerandomcodingsfromaGaussiandistributionanddecodethem.

importnumpyasnp

n_digits=60

n_epochs=50

batch_size=150

init.run()

foriterationinrange(n_batches):

sess.run(training_op,feed_dict={X:X_batch})

codings_rnd=np.random.normal(size=[n_digits,n_hidden3])

outputs_val=outputs.eval(feed_dict={hidden3:codings_rnd})

That’sit.Nowwecanseewhatthe“handwritten”digitsproducedbytheautoencoderlooklike(seeFigure15-12):

foriterationinrange(n_digits):

plt.subplot(n_digits,10,iteration+1)

plot_image(outputs_val[iteration])

Figure15-12.Imagesofhandwrittendigitsgeneratedbythevariationalautoencoder

Amajorityofthesedigitslookprettyconvincing,whileafewarerather“creative.”Butdon’tbetooharshontheautoencoder—itonlystartedlearninglessthananhourago.Giveitabitmoretrainingtime,andthosedigitswilllookbetterandbetter.

OtherAutoencodersTheamazingsuccessesofsupervisedlearninginimagerecognition,speechrecognition,texttranslation,andmorehavesomewhatovershadowedunsupervisedlearning,butitisactuallybooming.Newarchitecturesforautoencodersandotherunsupervisedlearningalgorithmsareinventedregularly,somuchsothatwecannotcoverthemallinthisbook.Hereisabrief(bynomeansexhaustive)overviewofafewmoretypesofautoencodersthatyoumaywanttocheckout:

Contractiveautoencoder(CAE)8

Theautoencoderisconstrainedduringtrainingsothatthederivativesofthecodingswithregardstotheinputsaresmall.Inotherwords,twosimilarinputsmusthavesimilarcodings.

Stackedconvolutionalautoencoders9

Autoencodersthatlearntoextractvisualfeaturesbyreconstructingimagesprocessedthroughconvolutionallayers.

Generativestochasticnetwork(GSN)10

Ageneralizationofdenoisingautoencoders,withtheaddedcapabilitytogeneratedata.

Winner-take-all(WTA)autoencoder11

Duringtraining,aftercomputingtheactivationsofalltheneuronsinthecodinglayer,onlythetopk%activationsforeachneuronoverthetrainingbatcharepreserved,andtherestaresettozero.Naturallythisleadstosparsecodings.Moreover,asimilarWTAapproachcanbeusedtoproducesparseconvolutionalautoencoders.

Adversarialautoencoders12

Onenetworkistrainedtoreproduceitsinputs,andatthesametimeanotheristrainedtofindinputsthatthefirstnetworkisunabletoproperlyreconstruct.Thispushesthefirstautoencodertolearnrobustcodings.

Exercises1. Whatarethemaintasksthatautoencodersareusedfor?

2. Supposeyouwanttotrainaclassifierandyouhaveplentyofunlabeledtrainingdata,butonlyafewthousandlabeledinstances.Howcanautoencodershelp?Howwouldyouproceed?

3. Ifanautoencoderperfectlyreconstructstheinputs,isitnecessarilyagoodautoencoder?Howcanyouevaluatetheperformanceofanautoencoder?

4. Whatareundercompleteandovercompleteautoencoders?Whatisthemainriskofanexcessivelyundercompleteautoencoder?Whataboutthemainriskofanovercompleteautoencoder?

5. Howdoyoutieweightsinastackedautoencoder?Whatisthepointofdoingso?

6. Whatisacommontechniquetovisualizefeatureslearnedbythelowerlayerofastackedautoencoder?Whatabouthigherlayers?

7. Whatisagenerativemodel?Canyounameatypeofgenerativeautoencoder?

8. Let’suseadenoisingautoencodertopretrainanimageclassifier:YoucanuseMNIST(simplest),oranotherlargesetofimagessuchasCIFAR10ifyouwantabiggerchallenge.IfyouchooseCIFAR10,youneedtowritecodetoloadbatchesofimagesfortraining.Ifyouwanttoskipthispart,TensorFlow’smodelzoocontainstoolstodojustthat.

Splitthedatasetintoatrainingsetandatestset.Trainadeepdenoisingautoencoderonthefulltrainingset.

Checkthattheimagesarefairlywellreconstructed,andvisualizethelow-levelfeatures.Visualizetheimagesthatmostactivateeachneuroninthecodinglayer.

Buildaclassificationdeepneuralnetwork,reusingthelowerlayersoftheautoencoder.Trainitusingonly10%ofthetrainingset.Canyougetittoperformaswellasthesameclassifiertrainedonthefulltrainingset?

9. Semantichashing,introducedin2008byRuslanSalakhutdinovandGeoffreyHinton,13isatechniqueusedforefficientinformationretrieval:adocument(e.g.,animage)ispassedthroughasystem,typicallyaneuralnetwork,whichoutputsafairlylow-dimensionalbinaryvector(e.g.,30bits).Twosimilardocumentsarelikelytohaveidenticalorverysimilarhashes.Byindexingeachdocumentusingitshash,itispossibletoretrievemanydocumentssimilartoaparticulardocumentalmostinstantly,eveniftherearebillionsofdocuments:justcomputethehashofthedocumentandlookupalldocumentswiththatsamehash(orhashesdifferingbyjustoneortwobits).Let’simplementsemantichashingusingaslightlytweakedstackedautoencoder:

Createastackedautoencodercontainingtwohiddenlayersbelowthecodinglayer,andtrainitontheimagedatasetyouusedinthepreviousexercise.Thecodinglayershouldcontain30neuronsandusethelogisticactivationfunctiontooutputvaluesbetween0and1.Aftertraining,

toproducethehashofanimage,youcansimplyrunitthroughtheautoencoder,taketheoutputofthecodinglayer,androundeveryvaluetotheclosestinteger(0or1).

OneneattrickproposedbySalakhutdinovandHintonistoaddGaussiannoise(withzeromean)totheinputsofthecodinglayer,duringtrainingonly.Inordertopreserveahighsignal-to-noiseratio,theautoencoderwilllearntofeedlargevaluestothecodinglayer(sothatthenoisebecomesnegligible).Inturn,thismeansthatthelogisticfunctionofthecodinglayerwilllikelysaturateat0or1.Asaresult,roundingthecodingsto0or1won’tdistortthemtoomuch,andthiswillimprovethereliabilityofthehashes.

Computethehashofeveryimage,andseeifimageswithidenticalhasheslookalike.SinceMNISTandCIFAR10arelabeled,amoreobjectivewaytomeasuretheperformanceoftheautoencoderforsemantichashingistoensurethatimageswiththesamehashgenerallyhavethesameclass.OnewaytodothisistomeasuretheaverageGinipurity(introducedinChapter6)ofthesetsofimageswithidentical(orverysimilar)hashes.

Tryfine-tuningthehyperparametersusingcross-validation.

Notethatwithalabeleddataset,anotherapproachistotrainaconvolutionalneuralnetwork(seeChapter13)forclassification,thenusethelayerbelowtheoutputlayertoproducethehashes.SeeJinmaGuaandJianminLi’s2015paper.14Seeifthatperformsbetter.

10. Trainavariationalautoencoderontheimagedatasetusedinthepreviousexercises(MNISTorCIFAR10),andmakeitgenerateimages.Alternatively,youcantrytofindanunlabeleddatasetthatyouareinterestedinandseeifyoucangeneratenewsamples.

“Perceptioninchess,”W.ChaseandH.Simon(1973).

“GreedyLayer-WiseTrainingofDeepNetworks,”Y.Bengioetal.(2007).

“ExtractingandComposingRobustFeatureswithDenoisingAutoencoders,”P.Vincentetal.(2008).

“StackedDenoisingAutoencoders:LearningUsefulRepresentationsinaDeepNetworkwithaLocalDenoisingCriterion,”P.Vincentetal.(2010).

“Auto-EncodingVariationalBayes,”D.KingmaandM.Welling(2014).

Variationalautoencodersareactuallymoregeneral;thecodingsarenotlimitedtoGaussiandistributions.

Formoremathematicaldetails,checkouttheoriginalpaperonvariationalautoencoders,orCarlDoersch’sgreattutorial(2016).

“ContractiveAuto-Encoders:ExplicitInvarianceDuringFeatureExtraction,”S.Rifaietal.(2011).

“StackedConvolutionalAuto-EncodersforHierarchicalFeatureExtraction,”J.Mascietal.(2011).

“GSNs:GenerativeStochasticNetworks,”G.Alainetal.(2015).

“Winner-Take-AllAutoencoders,”A.MakhzaniandB.Frey(2015).

“AdversarialAutoencoders,”A.Makhzanietal.(2016).

“SemanticHashing,”R.SalakhutdinovandG.Hinton(2008).

“CNNBasedHashingforImageRetrieval,”J.GuaandJ.Li(2015).

Chapter16.ReinforcementLearning

ReinforcementLearning(RL)isoneofthemostexcitingfieldsofMachineLearningtoday,andalsooneoftheoldest.Ithasbeenaroundsincethe1950s,producingmanyinterestingapplicationsovertheyears,1inparticularingames(e.g.,TD-Gammon,aBackgammonplayingprogram)andinmachinecontrol,butseldommakingtheheadlinenews.Butarevolutiontookplacein2013whenresearchersfromanEnglishstartupcalledDeepMinddemonstratedasystemthatcouldlearntoplayjustaboutanyAtarigamefromscratch,2eventuallyoutperforminghumans3inmostofthem,usingonlyrawpixelsasinputsandwithoutanypriorknowledgeoftherulesofthegames.4Thiswasthefirstofaseriesofamazingfeats,culminatinginMarch2016withthevictoryoftheirsystemAlphaGoagainstLeeSedol,theworldchampionofthegameofGo.Noprogramhadevercomeclosetobeatingamasterofthisgame,letalonetheworldchampion.TodaythewholefieldofRLisboilingwithnewideas,withawiderangeofapplications.DeepMindwasboughtbyGoogleforover500milliondollarsin2014.

Sohowdidtheydoit?Withhindsightitseemsrathersimple:theyappliedthepowerofDeepLearningtothefieldofReinforcementLearning,anditworkedbeyondtheirwildestdreams.InthischapterwewillfirstexplainwhatReinforcementLearningisandwhatitisgoodat,andthenwewillpresenttwoofthemostimportanttechniquesindeepReinforcementLearning:policygradientsanddeepQ-networks(DQN),includingadiscussionofMarkovdecisionprocesses(MDP).Wewillusethesetechniquestotrainamodeltobalanceapoleonamovingcart,andanothertoplayAtarigames.Thesametechniquescanbeusedforawidevarietyoftasks,fromwalkingrobotstoself-drivingcars.

LearningtoOptimizeRewardsInReinforcementLearning,asoftwareagentmakesobservationsandtakesactionswithinanenvironment,andinreturnitreceivesrewards.Itsobjectiveistolearntoactinawaythatwillmaximizeitsexpectedlong-termrewards.Ifyoudon’tmindabitofanthropomorphism,youcanthinkofpositiverewardsaspleasure,andnegativerewardsaspain(theterm“reward”isabitmisleadinginthiscase).Inshort,theagentactsintheenvironmentandlearnsbytrialanderrortomaximizeitspleasureandminimizeitspain.

Thisisquiteabroadsetting,whichcanapplytoawidevarietyoftasks.Hereareafewexamples(seeFigure16-1):1. Theagentcanbetheprogramcontrollingawalkingrobot.Inthiscase,theenvironmentisthereal

world,theagentobservestheenvironmentthroughasetofsensorssuchascamerasandtouchsensors,anditsactionsconsistofsendingsignalstoactivatemotors.Itmaybeprogrammedtogetpositiverewardswheneveritapproachesthetargetdestination,andnegativerewardswheneveritwastestime,goesinthewrongdirection,orfallsdown.

2. TheagentcanbetheprogramcontrollingMs.Pac-Man.Inthiscase,theenvironmentisasimulationoftheAtarigame,theactionsaretheninepossiblejoystickpositions(upperleft,down,center,andsoon),theobservationsarescreenshots,andtherewardsarejustthegamepoints.

3. Similarly,theagentcanbetheprogramplayingaboardgamesuchasthegameofGo.

4. Theagentdoesnothavetocontrolaphysically(orvirtually)movingthing.Forexample,itcanbeasmartthermostat,gettingrewardswheneveritisclosetothetargettemperatureandsavesenergy,andnegativerewardswhenhumansneedtotweakthetemperature,sotheagentmustlearntoanticipatehumanneeds.

5. Theagentcanobservestockmarketpricesanddecidehowmuchtobuyorselleverysecond.Rewardsareobviouslythemonetarygainsandlosses.

Figure16-1.ReinforcementLearningexamples:(a)walkingrobot,(b)Ms.Pac-Man,(c)Goplayer,(d)thermostat,(e)automatictrader5

Notethattheremaynotbeanypositiverewardsatall;forexample,theagentmaymovearoundinamaze,gettinganegativerewardateverytimestep,soitbetterfindtheexitasquicklyaspossible!TherearemanyotherexamplesoftaskswhereReinforcementLearningiswellsuited,suchasself-drivingcars,placingadsonawebpage,orcontrollingwhereanimageclassificationsystemshouldfocusitsattention.

PolicySearchThealgorithmusedbythesoftwareagenttodetermineitsactionsiscalleditspolicy.Forexample,thepolicycouldbeaneuralnetworktakingobservationsasinputsandoutputtingtheactiontotake(seeFigure16-2).

Figure16-2.ReinforcementLearningusinganeuralnetworkpolicy

Thepolicycanbeanyalgorithmyoucanthinkof,anditdoesnotevenhavetobedeterministic.Forexample,consideraroboticvacuumcleanerwhoserewardistheamountofdustitpicksupin30minutes.Itspolicycouldbetomoveforwardwithsomeprobabilitypeverysecond,orrandomlyrotateleftorrightwithprobability1–p.Therotationanglewouldbearandomanglebetween–rand+r.Sincethispolicyinvolvessomerandomness,itiscalledastochasticpolicy.Therobotwillhaveanerratictrajectory,whichguaranteesthatitwilleventuallygettoanyplaceitcanreachandpickupallthedust.Thequestionis:howmuchdustwillitpickupin30minutes?

Howwouldyoutrainsucharobot?Therearejusttwopolicyparametersyoucantweak:theprobabilitypandtheangleranger.Onepossiblelearningalgorithmcouldbetotryoutmanydifferentvaluesfortheseparameters,andpickthecombinationthatperformsbest(seeFigure16-3).Thisisanexampleofpolicysearch,inthiscaseusingabruteforceapproach.However,whenthepolicyspaceistoolarge(whichisgenerallythecase),findingagoodsetofparametersthiswayislikesearchingforaneedleinagigantichaystack.

Anotherwaytoexplorethepolicyspaceistousegeneticalgorithms.Forexample,youcouldrandomlycreateafirstgenerationof100policiesandtrythemout,then“kill”the80worstpolicies6andmakethe20survivorsproduce4offspringeach.Anoffspringisjustacopyofitsparent7plussomerandomvariation.Thesurvivingpoliciesplustheiroffspringtogetherconstitutethesecondgeneration.Youcancontinuetoiteratethroughgenerationsthisway,untilyoufindagoodpolicy.

Figure16-3.Fourpointsinpolicyspaceandtheagent’scorrespondingbehavior

Yetanotherapproachistouseoptimizationtechniques,byevaluatingthegradientsoftherewardswithregardstothepolicyparameters,thentweakingtheseparametersbyfollowingthegradienttowardhigherrewards(gradientascent).Thisapproachiscalledpolicygradients(PG),whichwewilldiscussinmoredetaillaterinthischapter.Forexample,goingbacktothevacuumcleanerrobot,youcouldslightlyincreasepandevaluatewhetherthisincreasestheamountofdustpickedupbytherobotin30minutes;ifitdoes,thenincreasepsomemore,orelsereducep.WewillimplementapopularPGalgorithmusingTensorFlow,butbeforewedoweneedtocreateanenvironmentfortheagenttolivein,soit’stimetointroduceOpenAIgym.

IntroductiontoOpenAIGymOneofthechallengesofReinforcementLearningisthatinordertotrainanagent,youfirstneedtohaveaworkingenvironment.IfyouwanttoprogramanagentthatwilllearntoplayanAtarigame,youwillneedanAtarigamesimulator.Ifyouwanttoprogramawalkingrobot,thentheenvironmentistherealworldandyoucandirectlytrainyourrobotinthatenvironment,butthishasitslimits:iftherobotfallsoffacliff,youcan’tjustclick“undo.”Youcan’tspeeduptimeeither;addingmorecomputingpowerwon’tmaketherobotmoveanyfaster.Andit’sgenerallytooexpensivetotrain1,000robotsinparallel.Inshort,trainingishardandslowintherealworld,soyougenerallyneedasimulatedenvironmentatleasttobootstraptraining.

OpenAIgym8isatoolkitthatprovidesawidevarietyofsimulatedenvironments(Atarigames,boardgames,2Dand3Dphysicalsimulations,andsoon),soyoucantrainagents,comparethem,ordevelopnewRLalgorithms.

Let’sinstallOpenAIgym.ForaminimalOpenAIgyminstallation,simplyusepip:

$pip3install--upgradegym

NextopenupaPythonshelloraJupyternotebookandcreateyourfirstenvironment:

>>>importgym

>>>env=gym.make("CartPole-v0")

[2016-10-1416:03:23,199]Makingnewenv:MsPacman-v0

>>>obs=env.reset()

>>>obs

array([-0.03799846,-0.03288115,0.02337094,0.00720711])

>>>env.render()

Themake()functioncreatesanenvironment,inthiscaseaCartPoleenvironment.Thisisa2Dsimulationinwhichacartcanbeacceleratedleftorrightinordertobalanceapoleplacedontopofit(seeFigure16-4).Aftertheenvironmentiscreated,wemustinitializeitusingthereset()method.Thisreturnsthefirstobservation.Observationsdependonthetypeofenvironment.FortheCartPoleenvironment,eachobservationisa1DNumPyarraycontainingfourfloats:thesefloatsrepresentthecart’shorizontalposition(0.0=center),itsvelocity,theangleofthepole(0.0=vertical),anditsangularvelocity.Finally,therender()methoddisplaystheenvironmentasshowninFigure16-4.

Figure16-4.TheCartPoleenvironment

Ifyouwantrender()toreturntherenderedimageasaNumPyarray,youcansetthemodeparametertorgb_array(notethatotherenvironmentsmaysupportdifferentmodes):

>>>img=env.render(mode="rgb_array")

>>>img.shape#height,width,channels(3=RGB)

(400,600,3)

TIPUnfortunately,theCartPole(andafewotherenvironments)renderstheimagetothescreenevenifyousetthemodeto"rgb_array".TheonlywaytoavoidthisistouseafakeXserversuchasXvfborXdummy.Forexample,youcaninstallXvfbandstartPythonusingthefollowingcommand:xvfb-run-s"-screen01400x900x24"python.Orusethexvfbwrapperpackage.

Let’sasktheenvironmentwhatactionsarepossible:

>>>env.action_space

Discrete(2)

Discrete(2)meansthatthepossibleactionsareintegers0and1,whichrepresentacceleratingleft(0)orright(1).Otherenvironmentsmayhavemorediscreteactions,orotherkindsofactions(e.g.,continuous).Sincethepoleisleaningtowardtheright,let’sacceleratethecarttowardtheright:

>>>action=1#accelerateright

>>>obs,reward,done,info=env.step(action)

>>>obs

array([-0.03865608,0.16189797,0.02351508,-0.27801135])

>>>reward

>>>done

>>>info

Thestep()methodexecutesthegivenactionandreturnsfourvalues:

Thisisthenewobservation.Thecartisnowmovingtowardtheright(obs[1]>0).Thepoleisstilltiltedtowardtheright(obs[2]>0),butitsangularvelocityisnownegative(obs[3]<0),soitwilllikelybetiltedtowardtheleftafterthenextstep.

reward

Inthisenvironment,yougetarewardof1.0ateverystep,nomatterwhatyoudo,sothegoalistokeeprunningaslongaspossible.

ThisvaluewillbeTruewhentheepisodeisover.Thiswillhappenwhenthepoletiltstoomuch.Afterthat,theenvironmentmustberesetbeforeitcanbeusedagain.

Thisdictionarymayprovideextradebuginformationinotherenvironments.Thisdatashouldnotbeusedfortraining(itwouldbecheating).

Let’shardcodeasimplepolicythatacceleratesleftwhenthepoleisleaningtowardtheleftandacceleratesrightwhenthepoleisleaningtowardtheright.Wewillrunthispolicytoseetheaveragerewardsitgetsover500episodes:

defbasic_policy(obs):

angle=obs[2]

return0ifangle<0else1

totals=[]

forepisodeinrange(500):

episode_rewards=0

obs=env.reset()

forstepinrange(1000):#1000stepsmax,wedon'twanttorunforever

action=basic_policy(obs)

obs,reward,done,info=env.step(action)

episode_rewards+=reward

ifdone:

totals.append(episode_rewards)

Thiscodeishopefullyself-explanatory.Let’slookattheresult:

>>>importnumpyasnp

>>>np.mean(totals),np.std(totals),np.min(totals),np.max(totals)

(42.125999999999998,9.1237121830974033,24.0,68.0)

Evenwith500tries,thispolicynevermanagedtokeepthepoleuprightformorethan68consecutivesteps.Notgreat.IfyoulookatthesimulationintheJupyternotebooks,youwillseethatthecartoscillatesleftandrightmoreandmorestronglyuntilthepoletiltstoomuch.Let’sseeifaneuralnetworkcancomeupwithabetterpolicy.

NeuralNetworkPoliciesLet’screateaneuralnetworkpolicy.Justlikethepolicywehardcodedearlier,thisneuralnetworkwilltakeanobservationasinput,anditwilloutputtheactiontobeexecuted.Moreprecisely,itwillestimateaprobabilityforeachaction,andthenwewillselectanactionrandomlyaccordingtotheestimatedprobabilities(seeFigure16-5).InthecaseoftheCartPoleenvironment,therearejusttwopossibleactions(leftorright),soweonlyneedoneoutputneuron.Itwilloutputtheprobabilitypofaction0(left),andofcoursetheprobabilityofaction1(right)willbe1–p.Forexample,ifitoutputs0.7,thenwewillpickaction0with70%probability,andaction1with30%probability.

Figure16-5.Neuralnetworkpolicy

Youmaywonderwhywearepickingarandomactionbasedontheprobabilitygivenbytheneuralnetwork,ratherthanjustpickingtheactionwiththehighestscore.Thisapproachletstheagentfindtherightbalancebetweenexploringnewactionsandexploitingtheactionsthatareknowntoworkwell.Here’sananalogy:supposeyougotoarestaurantforthefirsttime,andallthedisheslookequallyappealingsoyourandomlypickone.Ifitturnsouttobegood,youcanincreasetheprobabilitytoorderitnexttime,butyoushouldn’tincreasethatprobabilityupto100%,orelseyouwillnevertryouttheotherdishes,someofwhichmaybeevenbetterthantheoneyoutried.

Alsonotethatinthisparticularenvironment,thepastactionsandobservationscansafelybeignored,

sinceeachobservationcontainstheenvironment’sfullstate.Ifthereweresomehiddenstate,thenyoumayneedtoconsiderpastactionsandobservationsaswell.Forexample,iftheenvironmentonlyrevealedthepositionofthecartbutnotitsvelocity,youwouldhavetoconsidernotonlythecurrentobservationbutalsothepreviousobservationinordertoestimatethecurrentvelocity.Anotherexampleiswhentheobservationsarenoisy;inthatcase,yougenerallywanttousethepastfewobservationstoestimatethemostlikelycurrentstate.TheCartPoleproblemisthusassimpleascanbe;theobservationsarenoise-freeandtheycontaintheenvironment’sfullstate.

HereisthecodetobuildthisneuralnetworkpolicyusingTensorFlow:

#1.Specifytheneuralnetworkarchitecture

n_inputs=4#==env.observation_space.shape[0]

n_hidden=4#it'sasimpletask,wedon'tneedmorehiddenneurons

n_outputs=1#onlyoutputstheprobabilityofacceleratingleft

#2.Buildtheneuralnetwork

hidden=tf.layers.dense(X,n_hidden,activation=tf.nn.elu,

logits=tf.layers.dense(hidden,n_outputs,

#3.Selectarandomactionbasedontheestimatedprobabilities

p_left_and_right=tf.concat(axis=1,values=[outputs,1-outputs])

action=tf.multinomial(tf.log(p_left_and_right),num_samples=1)

Let’sgothroughthiscode:1. Aftertheimports,wedefinetheneuralnetworkarchitecture.Thenumberofinputsisthesizeofthe

observationspace(whichinthecaseoftheCartPoleisfour),wejusthavefourhiddenunitsandnoneedformore,andwehavejustoneoutputprobability(theprobabilityofgoingleft).

2. Nextwebuildtheneuralnetwork.Inthisexample,it’savanillaMulti-LayerPerceptron,withasingleoutput.Notethattheoutputlayerusesthelogistic(sigmoid)activationfunctioninordertooutputaprobabilityfrom0.0to1.0.Ifthereweremorethantwopossibleactions,therewouldbeoneoutputneuronperaction,andyouwouldusethesoftmaxactivationfunctioninstead.

3. Lastly,wecallthemultinomial()functiontopickarandomaction.Thisfunctionindependentlysamplesone(ormore)integers,giventhelogprobabilityofeachinteger.Forexample,ifyoucallitwiththearray[np.log(0.5),np.log(0.2),np.log(0.3)]andwithnum_samples=5,thenitwilloutputfiveintegers,eachofwhichwillhavea50%probabilityofbeing0,20%ofbeing1,and30%ofbeing2.Inourcasewejustneedoneintegerrepresentingtheactiontotake.Sincetheoutputstensoronlycontainstheprobabilityofgoingleft,wemustfirstconcatenate1-outputstoittohaveatensorcontainingtheprobabilityofbothleftandrightactions.Notethatifthereweremorethantwopossibleactions,theneuralnetworkwouldhavetooutputoneprobabilityperactionsoyouwouldnotneedtheconcatenationstep.

Okay,wenowhaveaneuralnetworkpolicythatwilltakeobservationsandoutputactions.Buthowdowetrainit?

EvaluatingActions:TheCreditAssignmentProblemIfweknewwhatthebestactionwasateachstep,wecouldtraintheneuralnetworkasusual,byminimizingthecrossentropybetweentheestimatedprobabilityandthetargetprobability.Itwouldjustberegularsupervisedlearning.However,inReinforcementLearningtheonlyguidancetheagentgetsisthroughrewards,andrewardsaretypicallysparseanddelayed.Forexample,iftheagentmanagestobalancethepolefor100steps,howcanitknowwhichofthe100actionsittookweregood,andwhichofthemwerebad?Allitknowsisthatthepolefellafterthelastaction,butsurelythislastactionisnotentirelyresponsible.Thisiscalledthecreditassignmentproblem:whentheagentgetsareward,itishardforittoknowwhichactionsshouldgetcredited(orblamed)forit.Thinkofadogthatgetsrewardedhoursafteritbehavedwell;willitunderstandwhatitisrewardedfor?

Totacklethisproblem,acommonstrategyistoevaluateanactionbasedonthesumofalltherewardsthatcomeafterit,usuallyapplyingadiscountraterateachstep.Forexample(seeFigure16-6),ifanagentdecidestogorightthreetimesinarowandgets+10rewardafterthefirststep,0afterthesecondstep,andfinally–50afterthethirdstep,thenassumingweuseadiscountrater=0.8,thefirstactionwillhaveatotalscoreof10+r×0+r2×(–50)=–22.Ifthediscountrateiscloseto0,thenfuturerewardswon’tcountformuchcomparedtoimmediaterewards.Conversely,ifthediscountrateiscloseto1,thenrewardsfarintothefuturewillcountalmostasmuchasimmediaterewards.Typicaldiscountratesare0.95or0.99.Withadiscountrateof0.95,rewards13stepsintothefuturecountroughlyforhalfasmuchasimmediaterewards(since0.9513≈0.5),whilewithadiscountrateof0.99,rewards69stepsintothefuturecountforhalfasmuchasimmediaterewards.IntheCartPoleenvironment,actionshavefairlyshort-termeffects,sochoosingadiscountrateof0.95seemsreasonable.

Figure16-6.Discountedrewards

Ofcourse,agoodactionmaybefollowedbyseveralbadactionsthatcausethepoletofallquickly,resultinginthegoodactiongettingalowscore(similarly,agoodactormaysometimesstarinaterriblemovie).However,ifweplaythegameenoughtimes,onaveragegoodactionswillgetabetterscorethanbadones.So,togetfairlyreliableactionscores,wemustrunmanyepisodesandnormalizealltheaction

scores(bysubtractingthemeananddividingbythestandarddeviation).Afterthat,wecanreasonablyassumethatactionswithanegativescorewerebadwhileactionswithapositivescoreweregood.Perfect—nowthatwehaveawaytoevaluateeachaction,wearereadytotrainourfirstagentusingpolicygradients.Let’sseehow.

PolicyGradientsAsdiscussedearlier,PGalgorithmsoptimizetheparametersofapolicybyfollowingthegradientstowardhigherrewards.OnepopularclassofPGalgorithms,calledREINFORCEalgorithms,wasintroducedbackin19929byRonaldWilliams.Hereisonecommonvariant:1. First,lettheneuralnetworkpolicyplaythegameseveraltimesandateachstepcomputethe

gradientsthatwouldmakethechosenactionevenmorelikely,butdon’tapplythesegradientsyet.

2. Onceyouhaverunseveralepisodes,computeeachaction’sscore(usingthemethoddescribedinthepreviousparagraph).

3. Ifanaction’sscoreispositive,itmeansthattheactionwasgoodandyouwanttoapplythegradientscomputedearliertomaketheactionevenmorelikelytobechoseninthefuture.However,ifthescoreisnegative,itmeanstheactionwasbadandyouwanttoapplytheoppositegradientstomakethisactionslightlylesslikelyinthefuture.Thesolutionissimplytomultiplyeachgradientvectorbythecorrespondingaction’sscore.

4. Finally,computethemeanofalltheresultinggradientvectors,anduseittoperformaGradientDescentstep.

Let’simplementthisalgorithmusingTensorFlow.Wewilltraintheneuralnetworkpolicywebuiltearliersothatitlearnstobalancethepoleonthecart.Let’sstartbycompletingtheconstructionphasewecodedearliertoaddthetargetprobability,thecostfunction,andthetrainingoperation.Sinceweareactingasthoughthechosenactionisthebestpossibleaction,thetargetprobabilitymustbe1.0ifthechosenactionisaction0(left)and0.0ifitisaction1(right):

y=1.-tf.to_float(action)

Nowthatwehaveatargetprobability,wecandefinethecostfunction(crossentropy)andcomputethegradients:

learning_rate=0.01

cross_entropy=tf.nn.sigmoid_cross_entropy_with_logits(labels=y,

logits=logits)

grads_and_vars=optimizer.compute_gradients(cross_entropy)

Notethatwearecallingtheoptimizer’scompute_gradients()methodinsteadoftheminimize()method.Thisisbecausewewanttotweakthegradientsbeforeweapplythem.10Thecompute_gradients()methodreturnsalistofgradientvector/variablepairs(onepairpertrainablevariable).Let’sputallthegradientsinalist,tomakeitmoreconvenienttoobtaintheirvalues:

gradients=[gradforgrad,variableingrads_and_vars]

Okay,nowcomesthetrickypart.Duringtheexecutionphase,thealgorithmwillrunthepolicyandateach

stepitwillevaluatethesegradienttensorsandstoretheirvalues.Afteranumberofepisodesitwilltweakthesegradientsasexplainedearlier(i.e.,multiplythembytheactionscoresandnormalizethem)andcomputethemeanofthetweakedgradients.Next,itwillneedtofeedtheresultinggradientsbacktotheoptimizersothatitcanperformanoptimizationstep.Thismeansweneedoneplaceholderpergradientvector.Moreover,wemustcreatetheoperationthatwillapplytheupdatedgradients.Forthiswewillcalltheoptimizer’sapply_gradients()function,whichtakesalistofgradientvector/variablepairs.Insteadofgivingittheoriginalgradientvectors,wewillgiveitalistcontainingtheupdatedgradients(i.e.,theonesfedthroughthegradientplaceholders):

gradient_placeholders=[]

grads_and_vars_feed=[]

forgrad,variableingrads_and_vars:

gradient_placeholder=tf.placeholder(tf.float32,shape=grad.get_shape())

gradient_placeholders.append(gradient_placeholder)

grads_and_vars_feed.append((gradient_placeholder,variable))

training_op=optimizer.apply_gradients(grads_and_vars_feed)

Let’sstepbackandtakealookatthefullconstructionphase:

n_inputs=4

n_hidden=4

n_outputs=1

learning_rate=0.01

hidden=tf.layers.dense(X,n_hidden,activation=tf.nn.elu,

logits=tf.layers.dense(hidden,n_outputs,

p_left_and_right=tf.concat(axis=1,values=[outputs,1-outputs])

action=tf.multinomial(tf.log(p_left_and_right),num_samples=1)

y=1.-tf.to_float(action)

cross_entropy=tf.nn.sigmoid_cross_entropy_with_logits(

labels=y,logits=logits)

grads_and_vars=optimizer.compute_gradients(cross_entropy)

gradients=[gradforgrad,variableingrads_and_vars]

gradient_placeholders=[]

grads_and_vars_feed=[]

forgrad,variableingrads_and_vars:

gradient_placeholder=tf.placeholder(tf.float32,shape=grad.get_shape())

gradient_placeholders.append(gradient_placeholder)

grads_and_vars_feed.append((gradient_placeholder,variable))

training_op=optimizer.apply_gradients(grads_and_vars_feed)

Ontotheexecutionphase!Wewillneedacoupleoffunctionstocomputethetotaldiscountedrewards,giventherawrewards,andtonormalizetheresultsacrossmultipleepisodes:

defdiscount_rewards(rewards,discount_rate):

discounted_rewards=np.empty(len(rewards))

cumulative_rewards=0

forstepinreversed(range(len(rewards))):

cumulative_rewards=rewards[step]+cumulative_rewards*discount_rate

discounted_rewards[step]=cumulative_rewards

returndiscounted_rewards

defdiscount_and_normalize_rewards(all_rewards,discount_rate):

all_discounted_rewards=[discount_rewards(rewards,discount_rate)

forrewardsinall_rewards]

flat_rewards=np.concatenate(all_discounted_rewards)

reward_mean=flat_rewards.mean()

reward_std=flat_rewards.std()

return[(discounted_rewards-reward_mean)/reward_std

fordiscounted_rewardsinall_discounted_rewards]

Let’scheckthatthisworks:

>>>discount_rewards([10,0,-50],discount_rate=0.8)

array([-22.,-40.,-50.])

>>>discount_and_normalize_rewards([[10,0,-50],[10,20]],discount_rate=0.8)

[array([-0.28435071,-0.86597718,-1.18910299]),

array([1.26665318,1.0727777])]

Thecalltodiscount_rewards()returnsexactlywhatweexpect(seeFigure16-6).Youcanverifythatthefunctiondiscount_and_normalize_rewards()doesindeedreturnthenormalizedscoresforeachactioninbothepisodes.Noticethatthefirstepisodewasmuchworsethanthesecond,soitsnormalizedscoresareallnegative;allactionsfromthefirstepisodewouldbeconsideredbad,andconverselyallactionsfromthesecondepisodewouldbeconsideredgood.

Wenowhaveallweneedtotrainthepolicy:

n_iterations=250#numberoftrainingiterations

n_max_steps=1000#maxstepsperepisode

n_games_per_update=10#trainthepolicyevery10episodes

save_iterations=10#savethemodelevery10trainingiterations

discount_rate=0.95

init.run()

all_rewards=[]#allsequencesofrawrewardsforeachepisode

all_gradients=[]#gradientssavedateachstepofeachepisode

forgameinrange(n_games_per_update):

current_rewards=[]#allrawrewardsfromthecurrentepisode

current_gradients=[]#allgradientsfromthecurrentepisode

obs=env.reset()

forstepinrange(n_max_steps):

action_val,gradients_val=sess.run(

[action,gradients],

feed_dict={X:obs.reshape(1,n_inputs)})#oneobs

obs,reward,done,info=env.step(action_val[0][0])

current_rewards.append(reward)

current_gradients.append(gradients_val)

ifdone:

all_rewards.append(current_rewards)

all_gradients.append(current_gradients)

#Atthispointwehaverunthepolicyfor10episodes,andweare

#readyforapolicyupdateusingthealgorithmdescribedearlier.

all_rewards=discount_and_normalize_rewards(all_rewards,discount_rate)

feed_dict={}

forvar_index,grad_placeholderinenumerate(gradient_placeholders):

#multiplythegradientsbytheactionscores,andcomputethemean

mean_gradients=np.mean(

[reward*all_gradients[game_index][step][var_index]

forgame_index,rewardsinenumerate(all_rewards)

forstep,rewardinenumerate(rewards)],

axis=0)

feed_dict[grad_placeholder]=mean_gradients

sess.run(training_op,feed_dict=feed_dict)

ifiteration%save_iterations==0:

saver.save(sess,"./my_policy_net_pg.ckpt")

Eachtrainingiterationstartsbyrunningthepolicyfor10episodes(withmaximum1,000stepsperepisode,toavoidrunningforever).Ateachstep,wealsocomputethegradients,pretendingthatthechosenactionwasthebest.Afterthese10episodeshavebeenrun,wecomputetheactionscoresusingthediscount_and_normalize_rewards()function;wegothrougheachtrainablevariable,acrossallepisodesandallsteps,tomultiplyeachgradientvectorbyitscorrespondingactionscore;andwecomputethemeanoftheresultinggradients.Finally,werunthetrainingoperation,feedingitthesemeangradients(onepertrainablevariable).Wealsosavethemodelevery10trainingoperations.

Andwe’redone!Thiscodewilltraintheneuralnetworkpolicy,anditwillsuccessfullylearntobalancethepoleonthecart(youcantryitoutintheJupyternotebooks).Notethatthereareactuallytwowaystheagentcanlosethegame:eitherthepolecantilttoomuch,orthecartcangocompletelyoffthescreen.With250trainingiterations,thepolicylearnstobalancethepolequitewell,butitisnotyetgoodenoughatavoidinggoingoffthescreen.Afewhundredmoretrainingiterationswillfixthat.

TIPResearcherstrytofindalgorithmsthatworkwellevenwhentheagentinitiallyknowsnothingabouttheenvironment.However,unlessyouarewritingapaper,youshouldinjectasmuchpriorknowledgeaspossibleintotheagent,asitwillspeeduptrainingdramatically.Forexample,youcouldaddnegativerewardsproportionaltothedistancefromthecenterofthescreen,andtothepole’sangle.Also,ifyoualreadyhaveareasonablygoodpolicy(e.g.,hardcoded),youmaywanttotraintheneuralnetworktoimitateitbeforeusingpolicygradientstoimproveit.

Despiteitsrelativesimplicity,thisalgorithmisquitepowerful.Youcanuseittotacklemuchharderproblemsthanbalancingapoleonacart.Infact,AlphaGowasbasedonasimilarPGalgorithm(plusMonteCarloTreeSearch,whichisbeyondthescopeofthisbook).

Wewillnowlookatanotherpopularfamilyofalgorithms.WhereasPGalgorithmsdirectlytrytooptimizethepolicytoincreaserewards,thealgorithmswewilllookatnowarelessdirect:theagentlearnstoestimatetheexpectedsumofdiscountedfuturerewardsforeachstate,ortheexpectedsumofdiscountedfuturerewardsforeachactionineachstate,thenusesthisknowledgetodecidehowtoact.Tounderstandthesealgorithms,wemustfirstintroduceMarkovdecisionprocesses(MDP).

MarkovDecisionProcessesIntheearly20thcentury,themathematicianAndreyMarkovstudiedstochasticprocesseswithnomemory,calledMarkovchains.Suchaprocesshasafixednumberofstates,anditrandomlyevolvesfromonestatetoanotherateachstep.Theprobabilityforittoevolvefromastatestoastates′isfixed,anditdependsonlyonthepair(s,s′),notonpaststates(thesystemhasnomemory).

Figure16-7showsanexampleofaMarkovchainwithfourstates.Supposethattheprocessstartsinstates0,andthereisa70%chancethatitwillremaininthatstateatthenextstep.Eventuallyitisboundtoleavethatstateandnevercomebacksincenootherstatepointsbacktos0.Ifitgoestostates1,itwillthenmostlikelygotostates2(90%probability),thenimmediatelybacktostates1(with100%probability).Itmayalternateanumberoftimesbetweenthesetwostates,buteventuallyitwillfallintostates3andremainthereforever(thisisaterminalstate).Markovchainscanhaveverydifferentdynamics,andtheyareheavilyusedinthermodynamics,chemistry,statistics,andmuchmore.

Figure16-7.ExampleofaMarkovchain

Markovdecisionprocesseswerefirstdescribedinthe1950sbyRichardBellman.11TheyresembleMarkovchainsbutwithatwist:ateachstep,anagentcanchooseoneofseveralpossibleactions,andthetransitionprobabilitiesdependonthechosenaction.Moreover,somestatetransitionsreturnsomereward(positiveornegative),andtheagent’sgoalistofindapolicythatwillmaximizerewardsovertime.

Forexample,theMDPrepresentedinFigure16-8hasthreestatesanduptothreepossiblediscreteactionsateachstep.Ifitstartsinstates0,theagentcanchoosebetweenactionsa0,a1,ora2.Ifitchoosesactiona1,itjustremainsinstates0withcertainty,andwithoutanyreward.Itcanthusdecidetostaythereforeverifitwants.Butifitchoosesactiona0,ithasa70%probabilityofgainingarewardof+10,andremaininginstates0.Itcanthentryagainandagaintogainasmuchrewardaspossible.Butatonepointit

isgoingtoendupinsteadinstates1.Instates1ithasonlytwopossibleactions:a0ora2.Itcanchoosetostayputbyrepeatedlychoosingactiona0,oritcanchoosetomoveontostates2andgetanegativerewardof–50(ouch).Instates2ithasnootherchoicethantotakeactiona1,whichwillmostlikelyleaditbacktostates0,gainingarewardof+40ontheway.Yougetthepicture.BylookingatthisMDP,canyouguesswhichstrategywillgainthemostrewardovertime?Instates0itisclearthatactiona0isthebestoption,andinstates2theagenthasnochoicebuttotakeactiona1,butinstates1itisnotobviouswhethertheagentshouldstayput(a0)orgothroughthefire(a2).

Figure16-8.ExampleofaMarkovdecisionprocess

Bellmanfoundawaytoestimatetheoptimalstatevalueofanystates,notedV*(s),whichisthesumofalldiscountedfuturerewardstheagentcanexpectonaverageafteritreachesastates,assumingitactsoptimally.Heshowedthatiftheagentactsoptimally,thentheBellmanOptimalityEquationapplies(seeEquation16-1).Thisrecursiveequationsaysthatiftheagentactsoptimally,thentheoptimalvalueofthecurrentstateisequaltotherewarditwillgetonaverageaftertakingoneoptimalaction,plustheexpectedoptimalvalueofallpossiblenextstatesthatthisactioncanleadto.

Equation16-1.BellmanOptimalityEquation

T(s,a,s′)isthetransitionprobabilityfromstatestostates′,giventhattheagentchoseactiona.

R(s,a,s′)istherewardthattheagentgetswhenitgoesfromstatestostates′,giventhattheagentchoseactiona.

γisthediscountrate.

Thisequationleadsdirectlytoanalgorithmthatcanpreciselyestimatetheoptimalstatevalueofeverypossiblestate:youfirstinitializeallthestatevalueestimatestozero,andthenyouiterativelyupdatethemusingtheValueIterationalgorithm(seeEquation16-2).Aremarkableresultisthat,givenenoughtime,theseestimatesareguaranteedtoconvergetotheoptimalstatevalues,correspondingtotheoptimal

policy.

Equation16-2.ValueIterationalgorithm

Vk(s)istheestimatedvalueofstatesatthekthiterationofthealgorithm.

NOTEThisalgorithmisanexampleofDynamicProgramming,whichbreaksdownacomplexproblem(inthiscaseestimatingapotentiallyinfinitesumofdiscountedfuturerewards)intotractablesub-problemsthatcanbetacklediteratively(inthiscasefindingtheactionthatmaximizestheaveragerewardplusthediscountednextstatevalue).

Knowingtheoptimalstatevaluescanbeuseful,inparticulartoevaluateapolicy,butitdoesnottelltheagentexplicitlywhattodo.Luckily,Bellmanfoundaverysimilaralgorithmtoestimatetheoptimalstate-actionvalues,generallycalledQ-Values.TheoptimalQ-Valueofthestate-actionpair(s,a),notedQ*(s,a),isthesumofdiscountedfuturerewardstheagentcanexpectonaverageafteritreachesthestatesandchoosesactiona,butbeforeitseestheoutcomeofthisaction,assumingitactsoptimallyafterthataction.

Hereishowitworks:onceagain,youstartbyinitializingalltheQ-Valueestimatestozero,thenyouupdatethemusingtheQ-ValueIterationalgorithm(seeEquation16-3).

Equation16-3.Q-ValueIterationalgorithm

OnceyouhavetheoptimalQ-Values,definingtheoptimalpolicy,notedπ*(s),istrivial:whentheagentis

instates,itshouldchoosetheactionwiththehighestQ-Valueforthatstate: .

Let’sapplythisalgorithmtotheMDPrepresentedinFigure16-8.First,weneedtodefinetheMDP:

nan=np.nan#representsimpossibleactions

T=np.array([#shape=[s,a,s']

[[0.7,0.3,0.0],[1.0,0.0,0.0],[0.8,0.2,0.0]],

[[0.0,1.0,0.0],[nan,nan,nan],[0.0,0.0,1.0]],

[[nan,nan,nan],[0.8,0.1,0.1],[nan,nan,nan]],

R=np.array([#shape=[s,a,s']

[[10.,0.0,0.0],[0.0,0.0,0.0],[0.0,0.0,0.0]],

[[10.,0.0,0.0],[nan,nan,nan],[0.0,0.0,-50.]],

[[nan,nan,nan],[40.,0.0,0.0],[nan,nan,nan]],

possible_actions=[[0,1,2],[0,2],[1]]

Nowlet’sruntheQ-ValueIterationalgorithm:

Q=np.full((3,3),-np.inf)#-infforimpossibleactions

forstate,actionsinenumerate(possible_actions):

Q[state,actions]=0.0#Initialvalue=0.0,forallpossibleactions

learning_rate=0.01

discount_rate=0.95

n_iterations=100

Q_prev=Q.copy()

forsinrange(3):

forainpossible_actions[s]:

Q[s,a]=np.sum([

T[s,a,sp]*(R[s,a,sp]+discount_rate*np.max(Q_prev[sp]))

forspinrange(3)

TheresultingQ-Valueslooklikethis:

array([[21.89498982,20.80024033,16.86353093],

[1.11669335,-inf,1.17573546],

[-inf,53.86946068,-inf]])

>>>np.argmax(Q,axis=1)#optimalactionforeachstate

array([0,2,1])

ThisgivesustheoptimalpolicyforthisMDP,whenusingadiscountrateof0.95:instates0chooseactiona0,instates1chooseactiona2(gothroughthefire!),andinstates2chooseactiona1(theonlypossibleaction).Interestingly,ifyoureducethediscountrateto0.9,theoptimalpolicychanges:instates1thebestactionbecomesa0(stayput;don’tgothroughthefire).Itmakessensebecauseifyouvaluethepresentmuchmorethanthefuture,thentheprospectoffuturerewardsisnotworthimmediatepain.

TemporalDifferenceLearningandQ-LearningReinforcementLearningproblemswithdiscreteactionscanoftenbemodeledasMarkovdecisionprocesses,buttheagentinitiallyhasnoideawhatthetransitionprobabilitiesare(itdoesnotknowT(s,a,s′)),anditdoesnotknowwhattherewardsaregoingtobeeither(itdoesnotknowR(s,a,s′)).Itmustexperienceeachstateandeachtransitionatleastoncetoknowtherewards,anditmustexperiencethemmultipletimesifitistohaveareasonableestimateofthetransitionprobabilities.

TheTemporalDifferenceLearning(TDLearning)algorithmisverysimilartotheValueIterationalgorithm,buttweakedtotakeintoaccountthefactthattheagenthasonlypartialknowledgeoftheMDP.Ingeneralweassumethattheagentinitiallyknowsonlythepossiblestatesandactions,andnothingmore.Theagentusesanexplorationpolicy—forexample,apurelyrandompolicy—toexploretheMDP,andasitprogressestheTDLearningalgorithmupdatestheestimatesofthestatevaluesbasedonthetransitionsandrewardsthatareactuallyobserved(seeEquation16-4).

Equation16-4.TDLearningalgorithm

αisthelearningrate(e.g.,0.01).

TIPTDLearninghasmanysimilaritieswithStochasticGradientDescent,inparticularthefactthatithandlesonesampleatatime.JustlikeSGD,itcanonlytrulyconvergeifyougraduallyreducethelearningrate(otherwiseitwillkeepbouncingaroundtheoptimum).

Foreachstates,thisalgorithmsimplykeepstrackofarunningaverageoftheimmediaterewardstheagentgetsuponleavingthatstate,plustherewardsitexpectstogetlater(assumingitactsoptimally).

Similarly,theQ-LearningalgorithmisanadaptationoftheQ-ValueIterationalgorithmtothesituationwherethetransitionprobabilitiesandtherewardsareinitiallyunknown(seeEquation16-5).

Equation16-5.Q-Learningalgorithm

Foreachstate-actionpair(s,a),thisalgorithmkeepstrackofarunningaverageoftherewardsrtheagentgetsuponleavingthestateswithactiona,plustherewardsitexpectstogetlater.Sincethetargetpolicywouldactoptimally,wetakethemaximumoftheQ-Valueestimatesforthenextstate.

HereishowQ-Learningcanbeimplemented:

importnumpy.randomasrnd

learning_rate0=0.05

learning_rate_decay=0.1

n_iterations=20000

s=0#startinstate0

Q=np.full((3,3),-np.inf)#-infforimpossibleactions

forstate,actionsinenumerate(possible_actions):

Q[state,actions]=0.0#Initialvalue=0.0,forallpossibleactions

a=rnd.choice(possible_actions[s])#chooseanaction(randomly)

sp=rnd.choice(range(3),p=T[s,a])#picknextstateusingT[s,a]

reward=R[s,a,sp]

learning_rate=learning_rate0/(1+iteration*learning_rate_decay)

Q[s,a]=learning_rate*Q[s,a]+(1-learning_rate)*(

reward+discount_rate*np.max(Q[sp])

s=sp#movetonextstate

Givenenoughiterations,thisalgorithmwillconvergetotheoptimalQ-Values.Thisiscalledanoff-policyalgorithmbecausethepolicybeingtrainedisnottheonebeingexecuted.Itissomewhatsurprisingthatthisalgorithmiscapableoflearningtheoptimalpolicybyjustwatchinganagentactrandomly(imaginelearningtoplaygolfwhenyourteacherisadrunkenmonkey).Canwedobetter?

ExplorationPoliciesOfcourseQ-LearningcanworkonlyiftheexplorationpolicyexplorestheMDPthoroughlyenough.Althoughapurelyrandompolicyisguaranteedtoeventuallyvisiteverystateandeverytransitionmanytimes,itmaytakeanextremelylongtimetodoso.Therefore,abetteroptionistousetheε-greedypolicy:ateachstepitactsrandomlywithprobabilityε,orgreedily(choosingtheactionwiththehighestQ-Value)withprobability1-ε.Theadvantageoftheε-greedypolicy(comparedtoacompletelyrandompolicy)isthatitwillspendmoreandmoretimeexploringtheinterestingpartsoftheenvironment,astheQ-Valueestimatesgetbetterandbetter,whilestillspendingsometimevisitingunknownregionsoftheMDP.Itisquitecommontostartwithahighvalueforε(e.g.,1.0)andthengraduallyreduceit(e.g.,downto0.05).

Alternatively,ratherthanrelyingonchanceforexploration,anotherapproachistoencouragetheexplorationpolicytotryactionsthatithasnottriedmuchbefore.ThiscanbeimplementedasabonusaddedtotheQ-Valueestimates,asshowninEquation16-6.

Equation16-6.Q-Learningusinganexplorationfunction

N(s′,a′)countsthenumberoftimestheactiona′waschoseninstates′.

f(q,n)isanexplorationfunction,suchasf(q,n)=q+K/(1+n),whereKisacuriosityhyperparameterthatmeasureshowmuchtheagentisattractedtototheunknown.

ApproximateQ-LearningThemainproblemwithQ-Learningisthatitdoesnotscalewelltolarge(orevenmedium)MDPswithmanystatesandactions.ConsidertryingtouseQ-LearningtotrainanagenttoplayMs.Pac-Man.Thereareover250pelletsthatMs.Pac-Mancaneat,eachofwhichcanbepresentorabsent(i.e.,alreadyeaten).Sothenumberofpossiblestatesisgreaterthan2250≈1075(andthat’sconsideringthepossiblestatesonlyofthepellets).Thisiswaymorethanatomsintheobservableuniverse,sothere’sabsolutelynowayyoucankeeptrackofanestimateforeverysingleQ-Value.

ThesolutionistofindafunctionthatapproximatestheQ-Valuesusingamanageablenumberofparameters.ThisiscalledApproximateQ-Learning.Foryearsitwasrecommendedtouselinearcombinationsofhand-craftedfeaturesextractedfromthestate(e.g.,distanceoftheclosestghosts,theirdirections,andsoon)toestimateQ-Values,butDeepMindshowedthatusingdeepneuralnetworkscanworkmuchbetter,especiallyforcomplexproblems,anditdoesnotrequireanyfeatureengineering.ADNNusedtoestimateQ-ValuesiscalledadeepQ-network(DQN),andusingaDQNforApproximateQ-LearningiscalledDeepQ-Learning.

Intherestofthischapter,wewilluseDeepQ-LearningtotrainanagenttoplayMs.Pac-Man,muchlikeDeepMinddidin2013.ThecodecaneasilybetweakedtolearntoplaythemajorityofAtarigamesquitewell.Itcanachievesuperhumanskillatmostactiongames,butitisnotsogoodatgameswithlong-runningstorylines.

LearningtoPlayMs.Pac-ManUsingDeepQ-LearningSincewewillbeusinganAtarienvironment,wemustfirstinstallOpenAIgym’sAtaridependencies.Whilewe’reatit,wewillalsoinstalldependenciesforotherOpenAIgymenvironmentsthatyoumaywanttoplaywith.OnmacOS,assumingyouhaveinstalledHomebrew,youneedtorun:

$brewinstallcmakeboostboost-pythonsdl2swigwget

OnUbuntu,typethefollowingcommand(replacingpython3withpythonifyouareusingPython2):

$apt-getinstall-ypython3-numpypython3-devcmakezlib1g-devlibjpeg-dev\

xvfblibav-toolsxorg-devpython3-opengllibboost-all-devlibsdl2-devswig

TheninstalltheextraPythonmodules:

$pip3install--upgrade'gym[all]'

Ifeverythingwentwell,youshouldbeabletocreateaMs.Pac-Manenvironment:

>>>env=gym.make("MsPacman-v0")

>>>obs=env.reset()

>>>obs.shape#[height,width,channels]

(210,160,3)

>>>env.action_space

Discrete(9)

Asyoucansee,thereareninediscreteactionsavailable,whichcorrespondtotheninepossiblepositionsofthejoystick(left,right,up,down,center,upperleft,andsoon),andtheobservationsaresimplyscreenshotsoftheAtariscreen(seeFigure16-9,left),representedas3DNumPyarrays.Theseimagesareabitlarge,sowewillcreateasmallpreprocessingfunctionthatwillcroptheimageandshrinkitdownto88×80pixels,convertittograyscale,andimprovethecontrastofMs.Pac-Man.ThiswillreducetheamountofcomputationsrequiredbytheDQN,andspeeduptraining.

mspacman_color=np.array([210,164,74]).mean()

defpreprocess_observation(obs):

img=obs[1:176:2,::2]#cropanddownsize

img=img.mean(axis=2)#togreyscale

img[img==mspacman_color]=0#improvecontrast

img=(img-128)/128-1#normalizefrom-1.to1.

returnimg.reshape(88,80,1)

TheresultofpreprocessingisshowninFigure16-9(right).

Figure16-9.Ms.Pac-Manobservation,original(left)andafterpreprocessing(right)

Next,let’screatetheDQN.Itcouldjusttakeastate-actionpair(s,a)asinput,andoutputanestimateofthecorrespondingQ-ValueQ(s,a),butsincetheactionsarediscreteitismoreconvenienttouseaneuralnetworkthattakesonlyastatesasinputandoutputsoneQ-Valueestimateperaction.TheDQNwillbecomposedofthreeconvolutionallayers,followedbytwofullyconnectedlayers,includingtheoutputlayer(seeFigure16-10).

Figure16-10.DeepQ-networktoplayMs.Pac-Man

Aswewillsee,thetrainingalgorithmwewilluserequirestwoDQNswiththesamearchitecture(butdifferentparameters):onewillbeusedtodriveMs.Pac-Manduringtraining(theactor),andtheotherwillwatchtheactorandlearnfromitstrialsanderrors(thecritic).Atregularintervalswewillcopythecritictotheactor.SinceweneedtwoidenticalDQNs,wewillcreateaq_network()functiontobuildthem:

input_height=88

input_width=80

input_channels=1

conv_n_maps=[32,64,64]

conv_kernel_sizes=[(8,8),(4,4),(3,3)]

conv_strides=[4,2,1]

conv_paddings=["SAME"]*3

conv_activation=[tf.nn.relu]*3

n_hidden_in=64*11*10#conv3has64mapsof11x10each

n_hidden=512

hidden_activation=tf.nn.relu

n_outputs=env.action_space.n#9discreteactionsareavailable

defq_network(X_state,name):

prev_layer=X_state

conv_layers=[]

withtf.variable_scope(name)asscope:

forn_maps,kernel_size,stride,padding,activationinzip(

conv_n_maps,conv_kernel_sizes,conv_strides,

conv_paddings,conv_activation):

prev_layer=tf.layers.conv2d(

prev_layer,filters=n_maps,kernel_size=kernel_size,

stride=stride,padding=padding,activation=activation,

conv_layers.append(prev_layer)

last_conv_layer_flat=tf.reshape(prev_layer,shape=[-1,n_hidden_in])

hidden=tf.layers.dense(last_conv_layer_flat,n_hidden,

activation=hidden_activation,

outputs=tf.layers.dense(hidden,n_outputs,

trainable_vars=tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,

scope=scope.name)

trainable_vars_by_name={var.name[len(scope.name):]:var

forvarintrainable_vars}

returnoutputs,trainable_vars_by_name

ThefirstpartofthiscodedefinesthehyperparametersoftheDQNarchitecture.Thentheq_network()functioncreatestheDQN,takingtheenvironment’sstateX_stateasinput,andthenameofthevariablescope.Notethatwewilljustuseoneobservationtorepresenttheenvironment’sstatesincethere’salmostnohiddenstate(exceptforblinkingobjectsandtheghosts’directions).

Thetrainable_vars_by_namedictionarygathersallthetrainablevariablesofthisDQN.ItwillbeusefulinaminutewhenwecreateoperationstocopythecriticDQNtotheactorDQN.Thekeysofthedictionaryarethenamesofthevariables,strippingthepartoftheprefixthatjustcorrespondstothescope’sname.Itlookslikethis:

>>>trainable_vars_by_name

{'/conv2d/bias:0':<tensorflow.python.ops.variables.Variableat0x121cf7b50>,

'/conv2d/kernel:0':<tensorflow.python.ops.variables.Variable...>,

'/conv2d_1/bias:0':<tensorflow.python.ops.variables.Variable...>,

'/conv2d_1/kernel:0':<tensorflow.python.ops.variables.Variable...>,

'/conv2d_2/bias:0':<tensorflow.python.ops.variables.Variable...>,

'/conv2d_2/kernel:0':<tensorflow.python.ops.variables.Variable...>,

'/dense/bias:0':<tensorflow.python.ops.variables.Variable...>,

'/dense/kernel:0':<tensorflow.python.ops.variables.Variable...>,

'/dense_1/bias:0':<tensorflow.python.ops.variables.Variable...>,

'/dense_1/kernel:0':<tensorflow.python.ops.variables.Variable...>}

Nowlet’screatetheinputplaceholder,thetwoDQNs,andtheoperationtocopythecriticDQNtotheactorDQN:

X_state=tf.placeholder(tf.float32,shape=[None,input_height,input_width,

input_channels])

actor_q_values,actor_vars=q_network(X_state,name="q_networks/actor")

critic_q_values,critic_vars=q_network(X_state,name="q_networks/critic")

copy_ops=[actor_var.assign(critic_vars[var_name])

forvar_name,actor_varinactor_vars.items()]

copy_critic_to_actor=tf.group(*copy_ops)

Let’sstepbackforasecond:wenowhavetwoDQNsthatarebothcapableoftakinganenvironmentstate(i.e.,apreprocessedobservation)asinputandoutputtinganestimatedQ-Valueforeachpossibleactioninthatstate.Pluswehaveanoperationcalledcopy_critic_to_actortocopyallthetrainablevariablesofthecriticDQNtotheactorDQN.WeuseTensorFlow’stf.group()functiontogroupalltheassignmentoperationsintoasingleconvenientoperation.

TheactorDQNcanbeusedtoplayMs.Pac-Man(initiallyverybadly).Asdiscussedearlier,youwantittoexplorethegamethoroughlyenough,soyougenerallywanttocombineitwithanε-greedypolicyoranotherexplorationstrategy.

ButwhataboutthecriticDQN?Howwillitlearntoplaythegame?TheshortansweristhatitwilltrytomakeitsQ-ValuepredictionsmatchtheQ-Valuesestimatedbytheactorthroughitsexperienceofthegame.Specifically,wewilllettheactorplayforawhile,storingallitsexperiencesinareplaymemory.Eachmemorywillbea5-tuple(state,action,nextstate,reward,continue),wherethe“continue”itemwillbeequalto0.0whenthegameisover,or1.0otherwise.Next,atregularintervalswewillsampleabatchofmemoriesfromthereplaymemory,andwewillestimatetheQ-Valuesfromthesememories.Finally,wewilltrainthecriticDQNtopredicttheseQ-Valuesusingregularsupervisedlearningtechniques.Onceeveryfewtrainingiterations,wewillcopythecriticDQNtotheactorDQN.Andthat’sit!Equation16-7showsthecostfunctionusedtotrainthecriticDQN:

Equation16-7.DeepQ-Learningcostfunction

s(i),a(i),r(i)ands′(i)arerespectivelythestate,action,reward,andnextstateoftheithmemorysampledfromthereplaymemory.

misthesizeofthememorybatch.

θcriticandθactorarethecriticandtheactor’sparameters.

Q(s(i),a(i),θcritic)isthecriticDQN’spredictionoftheithmemorizedstate-action’sQ-Value.

Q(s′(i),a′,θactor)istheactorDQN’spredictionoftheQ-Valueitcanexpectfromthenextstates′(i)ifitchoosesactiona′.

y(i)isthetargetQ-Valuefortheithmemory.Notethatitisequaltotherewardactuallyobservedbytheactor,plustheactor’spredictionofwhatfuturerewardsitshouldexpectifitweretoplayoptimally(asfarasitknows).

J(θcritic)isthecostfunctionusedtotrainthecriticDQN.Asyoucansee,itisjusttheMeanSquared

ErrorbetweenthetargetQ-Valuesy(i)asestimatedbytheactorDQN,andthecriticDQN’spredictionsoftheseQ-Values.

NOTEThereplaymemoryisoptional,buthighlyrecommended.Withoutit,youwouldtrainthecriticDQNusingconsecutiveexperiencesthatmaybeverycorrelated.Thiswouldintroducealotofbiasandslowdownthetrainingalgorithm’sconvergence.Byusingareplaymemory,weensurethatthememoriesfedtothetrainingalgorithmcanbefairlyuncorrelated.

Let’saddthecriticDQN’strainingoperations.First,weneedtobeabletocomputeitspredictedQ-Valuesforeachstate-actioninthememorybatch.SincetheDQNoutputsoneQ-Valueforeverypossibleaction,weneedtokeeponlytheQ-Valuethatcorrespondstotheactionthatwasactuallychoseninthismemory.Forthis,wewillconverttheactiontoaone-hotvector(recallthatthisisavectorfullof0sexceptfora1attheithindex),andmultiplyitbytheQ-Values:thiswillzerooutallQ-Valuesexceptfortheonecorrespondingtothememorizedaction.ThenjustsumoverthefirstaxistoobtainonlythedesiredQ-Valuepredictionforeachmemory.

X_action=tf.placeholder(tf.int32,shape=[None])

q_value=tf.reduce_sum(critic_q_values*tf.one_hot(X_action,n_outputs),

axis=1,keep_dims=True)

Nextlet’saddthetrainingoperations,assumingthetargetQ-Valueswillbefedthroughaplaceholder.Wealsocreateanontrainablevariablecalledglobal_step.Theoptimizer’sminimize()operationwilltakecareofincrementingit.PluswecreatetheusualinitoperationandaSaver.

y=tf.placeholder(tf.float32,shape=[None,1])

cost=tf.reduce_mean(tf.square(y-q_value))

global_step=tf.Variable(0,trainable=False,name='global_step')

training_op=optimizer.minimize(cost,global_step=global_step)

That’sitfortheconstructionphase.Beforewelookattheexecutionphase,wewillneedacoupleoftools.First,let’sstartbyimplementingthereplaymemory.Wewilluseadequelistsinceitisveryefficientatpushingitemstothequeueandpoppingthemoutfromtheendofthelistwhenthemaximummemorysizeisreached.Wewillalsowriteasmallfunctiontorandomlysampleabatchofexperiencesfromthereplaymemory:

fromcollectionsimportdeque

replay_memory_size=10000

replay_memory=deque([],maxlen=replay_memory_size)

defsample_memories(batch_size):

indices=rnd.permutation(len(replay_memory))[:batch_size]

cols=[[],[],[],[],[]]#state,action,reward,next_state,continue

foridxinindices:

memory=replay_memory[idx]

forcol,valueinzip(cols,memory):

col.append(value)

cols=[np.array(col)forcolincols]

return(cols[0],cols[1],cols[2].reshape(-1,1),cols[3],

cols[4].reshape(-1,1))

Next,wewillneedtheactortoexplorethegame.Wewillusetheε-greedypolicy,andgraduallydecreaseεfrom1.0to0.05,in50,000trainingsteps:

eps_min=0.05

eps_max=1.0

eps_decay_steps=50000

defepsilon_greedy(q_values,step):

epsilon=max(eps_min,eps_max-(eps_max-eps_min)*step/eps_decay_steps)

ifrnd.rand()<epsilon:

returnrnd.randint(n_outputs)#randomaction

returnnp.argmax(q_values)#optimalaction

That’sit!Wehaveallweneedtostarttraining.Theexecutionphasedoesnotcontainanythingtoocomplex,butitisabitlong,sotakeadeepbreath.Ready?Let’sgo!First,let’sinitializeafewvariables:

n_steps=100000#totalnumberoftrainingsteps

training_start=1000#starttrainingafter1,000gameiterations

training_interval=3#runatrainingstepevery3gameiterations

save_steps=50#savethemodelevery50trainingsteps

copy_steps=25#copythecritictotheactorevery25trainingsteps

discount_rate=0.95

skip_start=90#skipthestartofeverygame(it'sjustwaitingtime)

batch_size=50

iteration=0#gameiterations

checkpoint_path="./my_dqn.ckpt"

done=True#envneedstobereset

Next,let’sopenthesessionandrunthemaintrainingloop:

ifos.path.isfile(checkpoint_path):

saver.restore(sess,checkpoint_path)

init.run()

whileTrue:

step=global_step.eval()

ifstep>=n_steps:

iteration+=1

ifdone:#gameover,startagain

obs=env.reset()

forskipinrange(skip_start):#skipthestartofeachgame

obs,reward,done,info=env.step(0)

state=preprocess_observation(obs)

#Actorevaluateswhattodo

q_values=actor_q_values.eval(feed_dict={X_state:[state]})

action=epsilon_greedy(q_values,step)

#Actorplays

obs,reward,done,info=env.step(action)

next_state=preprocess_observation(obs)

#Let'smemorizewhatjusthappened

replay_memory.append((state,action,reward,next_state,1.0-done))

state=next_state

ifiteration<training_startoriteration%training_interval!=0:

continue

#Criticlearns

X_state_val,X_action_val,rewards,X_next_state_val,continues=(

sample_memories(batch_size))

next_q_values=actor_q_values.eval(

feed_dict={X_state:X_next_state_val})

max_next_q_values=np.max(next_q_values,axis=1,keepdims=True)

y_val=rewards+continues*discount_rate*max_next_q_values

training_op.run(feed_dict={X_state:X_state_val,

X_action:X_action_val,y:y_val})

#Regularlycopycritictoactor

ifstep%copy_steps==0:

copy_critic_to_actor.run()

#Andsaveregularly

ifstep%save_steps==0:

saver.save(sess,checkpoint_path)

Westartbyrestoringthemodelsifacheckpointfileexists,orelsewejustinitializethevariablesnormally.Thenthemainloopstarts,whereiterationcountsthetotalnumberofgamestepswehavegonethroughsincetheprogramstarted,andstepcountsthetotalnumberoftrainingstepssincetrainingstarted(ifacheckpointisrestored,theglobalstepisrestoredaswell).Thenthecoderesetsthegame(andskipsthefirstboringgamesteps,wherenothinghappens).Next,theactorevaluateswhattodo,andplaysthegame,anditsexperienceismemorizedinreplaymemory.Then,atregularintervals(afterawarmupperiod),thecriticgoesthroughatrainingstep.ItsamplesabatchofmemoriesandaskstheactortoestimatetheQ-Valuesofallactionsforthenextstate,anditappliesEquation16-7tocomputethetargetQ-Valuey_val.Theonlytrickyparthereisthatwemustmultiplythenextstate’sQ-ValuesbythecontinuesvectortozeroouttheQ-Valuescorrespondingtomemorieswherethegamewasover.Nextwerunatrainingoperationtoimprovethecritic’sabilitytopredictQ-Values.Finally,atregularintervalswecopythecritictotheactor,andwesavethemodel.

TIPUnfortunately,trainingisveryslow:ifyouuseyourlaptopfortraining,itwilltakedaysbeforeMs.Pac-Mangetsanygood,andifyoulookatthelearningcurve,measuringtheaveragerewardsperepisode,youwillnoticethatitisextremelynoisy.Atsomepointstheremaybenoapparentprogressforaverylongtimeuntilsuddenlytheagentlearnstosurviveareasonableamountoftime.Asmentionedearlier,onesolutionistoinjectasmuchpriorknowledgeaspossibleintothemodel(e.g.,throughpreprocessing,rewards,andsoon),andyoucanalsotrytobootstrapthemodelbyfirsttrainingittoimitateabasicstrategy.Inanycase,RLstillrequiresquitealotofpatienceandtweaking,buttheendresultisveryexciting.

Exercises1. HowwouldyoudefineReinforcementLearning?Howisitdifferentfromregularsupervisedor

unsupervisedlearning?

2. CanyouthinkofthreepossibleapplicationsofRLthatwerenotmentionedinthischapter?Foreachofthem,whatistheenvironment?Whatistheagent?Whatarepossibleactions?Whataretherewards?

3. Whatisthediscountrate?Cantheoptimalpolicychangeifyoumodifythediscountrate?

4. HowdoyoumeasuretheperformanceofaReinforcementLearningagent?

5. Whatisthecreditassignmentproblem?Whendoesitoccur?Howcanyoualleviateit?

6. Whatisthepointofusingareplaymemory?

7. Whatisanoff-policyRLalgorithm?

8. UseDeepQ-LearningtotackleOpenAIgym’s“BypedalWalker-v2.”TheQ-networksdonotneedtobeverydeepforthistask.

9. UsepolicygradientstotrainanagenttoplayPong,thefamousAtarigame(Pong-v0intheOpenAIgym).Beware:anindividualobservationisinsufficienttotellthedirectionandspeedoftheball.Onesolutionistopasstwoobservationsatatimetotheneuralnetworkpolicy.Toreducedimensionalityandspeeduptraining,youshoulddefinitelypreprocesstheseimages(crop,resize,andconvertthemtoblackandwhite),andpossiblymergethemintoasingleimage(e.g.,byoverlayingthem).

10. Ifyouhaveabout$100tospare,youcanpurchaseaRaspberryPi3plussomecheaproboticscomponents,installTensorFlowonthePi,andgowild!Foranexample,checkoutthisfunpostbyLukasBiewald,ortakealookatGoPiGoorBrickPi.Whynottrytobuildareal-lifecartpolebytrainingtherobotusingpolicygradients?Orbuildaroboticspiderthatlearnstowalk;giveitrewardsanytimeitgetsclosertosomeobjective(youwillneedsensorstomeasurethedistancetotheobjective).Theonlylimitisyourimagination.

ThankYou!Beforeweclosethelastchapterofthisbook,Iwouldliketothankyouforreadingituptothelastparagraph.ItrulyhopethatyouhadasmuchpleasurereadingthisbookasIhadwritingit,andthatitwillbeusefulforyourprojects,bigorsmall.

Ifyoufinderrors,pleasesendfeedback.Moregenerally,Iwouldlovetoknowwhatyouthink,sopleasedon’thesitatetocontactmeviaO’Reilly,orthroughtheageron/handson-mlGitHubproject.

Goingforward,mybestadvicetoyouistopracticeandpractice:trygoingthroughalltheexercisesifyouhavenotdonesoalready,playwiththeJupyternotebooks,joinKaggle.comorsomeotherMLcommunity,watchMLcourses,readpapers,attendconferences,meetexperts.Youmayalsowanttostudysometopicsthatwedidnotcoverinthisbook,includingrecommendersystems,clusteringalgorithms,anomalydetectionalgorithms,andgeneticalgorithms.

MygreatesthopeisthatthisbookwillinspireyoutobuildawonderfulMLapplicationthatwillbenefitallofus!Whatwillitbe?

AurélienGéron,November26th,2016

Formoredetails,besuretocheckoutRichardSuttonandAndrewBarto’sbookonRL,ReinforcementLearning:AnIntroduction(MITPress),orDavidSilver’sfreeonlineRLcourseatUniversityCollegeLondon.

“PlayingAtariwithDeepReinforcementLearning,”V.Mnihetal.(2013).

“Human-levelcontrolthroughdeepreinforcementlearning,”V.Mnihetal.(2015).

CheckoutthevideosofDeepMind’ssystemlearningtoplaySpaceInvaders,Breakout,andmoreathttps://goo.gl/yTsH6X.

Images(a),(c),and(d)arereproducedfromWikipedia.(a)and(d)areinthepublicdomain.(c)wascreatedbyuserStevertigoandreleasedunderCreativeCommonsBY-SA2.0.(b)isascreenshotfromtheMs.Pac-Mangame,copyrightAtari(theauthorbelievesittobefairuseinthischapter).(e)wasreproducedfromPixabay,releasedunderCreativeCommonsCC0.

Itisoftenbettertogivethepoorperformersaslightchanceofsurvival,topreservesomediversityinthe“genepool.”

Ifthereisasingleparent,thisiscalledasexualreproduction.Withtwo(ormore)parents,itiscalledsexualreproduction.Anoffspring’sgenome(inthiscaseasetofpolicyparameters)israndomlycomposedofpartsofitsparents’genomes.

OpenAIisanonprofitartificialintelligenceresearchcompany,fundedinpartbyElonMusk.ItsstatedgoalistopromoteanddevelopfriendlyAIsthatwillbenefithumanity(ratherthanexterminateit).

“SimpleStatisticalGradient-FollowingAlgorithmsforConnectionistReinforcementLearning,”R.Williams(1992).

WealreadydidsomethingsimilarinChapter11whenwediscussedGradientClipping:wefirstcomputedthegradients,thenweclippedthem,andfinallyweappliedtheclippedgradients.

“AMarkovianDecisionProcess,”R.Bellman(1957).

AppendixA.ExerciseSolutions

NOTESolutionstothecodingexercisesareavailableintheonlineJupyternotebooksathttps://github.com/ageron/handson-ml.

Chapter1:TheMachineLearningLandscape1. MachineLearningisaboutbuildingsystemsthatcanlearnfromdata.Learningmeansgettingbetterat

sometask,givensomeperformancemeasure.

2. MachineLearningisgreatforcomplexproblemsforwhichwehavenoalgorithmicsolution,toreplacelonglistsofhand-tunedrules,tobuildsystemsthatadapttofluctuatingenvironments,andfinallytohelphumanslearn(e.g.,datamining).

3. Alabeledtrainingsetisatrainingsetthatcontainsthedesiredsolution(a.k.a.alabel)foreachinstance.

4. Thetwomostcommonsupervisedtasksareregressionandclassification.

5. Commonunsupervisedtasksincludeclustering,visualization,dimensionalityreduction,andassociationrulelearning.

6. ReinforcementLearningislikelytoperformbestifwewantarobottolearntowalkinvariousunknownterrainssincethisistypicallythetypeofproblemthatReinforcementLearningtackles.Itmightbepossibletoexpresstheproblemasasupervisedorsemisupervisedlearningproblem,butitwouldbelessnatural.

7. Ifyoudon’tknowhowtodefinethegroups,thenyoucanuseaclusteringalgorithm(unsupervisedlearning)tosegmentyourcustomersintoclustersofsimilarcustomers.However,ifyouknowwhatgroupsyouwouldliketohave,thenyoucanfeedmanyexamplesofeachgrouptoaclassificationalgorithm(supervisedlearning),anditwillclassifyallyourcustomersintothesegroups.

8. Spamdetectionisatypicalsupervisedlearningproblem:thealgorithmisfedmanyemailsalongwiththeirlabel(spamornotspam).

9. Anonlinelearningsystemcanlearnincrementally,asopposedtoabatchlearningsystem.Thismakesitcapableofadaptingrapidlytobothchangingdataandautonomoussystems,andoftrainingonverylargequantitiesofdata.

10. Out-of-corealgorithmscanhandlevastquantitiesofdatathatcannotfitinacomputer’smainmemory.Anout-of-corelearningalgorithmchopsthedataintomini-batchesandusesonlinelearningtechniquestolearnfromthesemini-batches.

11. Aninstance-basedlearningsystemlearnsthetrainingdatabyheart;then,whengivenanewinstance,itusesasimilaritymeasuretofindthemostsimilarlearnedinstancesandusesthemtomakepredictions.

12. Amodelhasoneormoremodelparametersthatdeterminewhatitwillpredictgivenanewinstance(e.g.,theslopeofalinearmodel).Alearningalgorithmtriestofindoptimalvaluesfortheseparameterssuchthatthemodelgeneralizeswelltonewinstances.Ahyperparameterisaparameterofthelearningalgorithmitself,notofthemodel(e.g.,theamountofregularizationtoapply).

13. Model-basedlearningalgorithmssearchforanoptimalvalueforthemodelparameterssuchthatthemodelwillgeneralizewelltonewinstances.Weusuallytrainsuchsystemsbyminimizingacostfunctionthatmeasureshowbadthesystemisatmakingpredictionsonthetrainingdata,plusapenaltyformodelcomplexityifthemodelisregularized.Tomakepredictions,wefeedthenewinstance’sfeaturesintothemodel’spredictionfunction,usingtheparametervaluesfoundbythelearningalgorithm.

14. SomeofthemainchallengesinMachineLearningarethelackofdata,poordataquality,nonrepresentativedata,uninformativefeatures,excessivelysimplemodelsthatunderfitthetrainingdata,andexcessivelycomplexmodelsthatoverfitthedata.

15. Ifamodelperformsgreatonthetrainingdatabutgeneralizespoorlytonewinstances,themodelislikelyoverfittingthetrainingdata(orwegotextremelyluckyonthetrainingdata).Possiblesolutionstooverfittingaregettingmoredata,simplifyingthemodel(selectingasimpleralgorithm,reducingthenumberofparametersorfeaturesused,orregularizingthemodel),orreducingthenoiseinthetrainingdata.

16. Atestsetisusedtoestimatethegeneralizationerrorthatamodelwillmakeonnewinstances,beforethemodelislaunchedinproduction.

17. Avalidationsetisusedtocomparemodels.Itmakesitpossibletoselectthebestmodelandtunethehyperparameters.

18. Ifyoutunehyperparametersusingthetestset,youriskoverfittingthetestset,andthegeneralizationerroryoumeasurewillbeoptimistic(youmaylaunchamodelthatperformsworsethanyouexpect).

19. Cross-validationisatechniquethatmakesitpossibletocomparemodels(formodelselectionandhyperparametertuning)withouttheneedforaseparatevalidationset.Thissavesprecioustrainingdata.

Chapter2:End-to-EndMachineLearningProjectSeetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.

Chapter3:ClassificationSeetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.

Chapter4:TrainingModels1. IfyouhaveatrainingsetwithmillionsoffeaturesyoucanuseStochasticGradientDescentorMini-

batchGradientDescent,andperhapsBatchGradientDescentifthetrainingsetfitsinmemory.ButyoucannotusetheNormalEquationbecausethecomputationalcomplexitygrowsquickly(morethanquadratically)withthenumberoffeatures.

2. Ifthefeaturesinyourtrainingsethaveverydifferentscales,thecostfunctionwillhavetheshapeofanelongatedbowl,sotheGradientDescentalgorithmswilltakealongtimetoconverge.Tosolvethisyoushouldscalethedatabeforetrainingthemodel.NotethattheNormalEquationwillworkjustfinewithoutscaling.Moreover,regularizedmodelsmayconvergetoasuboptimalsolutionifthefeaturesarenotscaled:indeed,sinceregularizationpenalizeslargeweights,featureswithsmallervalueswilltendtobeignoredcomparedtofeatureswithlargervalues.

3. GradientDescentcannotgetstuckinalocalminimumwhentrainingaLogisticRegressionmodelbecausethecostfunctionisconvex.1

4. Iftheoptimizationproblemisconvex(suchasLinearRegressionorLogisticRegression),andassumingthelearningrateisnottoohigh,thenallGradientDescentalgorithmswillapproachtheglobaloptimumandendupproducingfairlysimilarmodels.However,unlessyougraduallyreducethelearningrate,StochasticGDandMini-batchGDwillnevertrulyconverge;instead,theywillkeepjumpingbackandfortharoundtheglobaloptimum.Thismeansthatevenifyouletthemrunforaverylongtime,theseGradientDescentalgorithmswillproduceslightlydifferentmodels.

5. Ifthevalidationerrorconsistentlygoesupaftereveryepoch,thenonepossibilityisthatthelearningrateistoohighandthealgorithmisdiverging.Ifthetrainingerroralsogoesup,thenthisisclearlytheproblemandyoushouldreducethelearningrate.However,ifthetrainingerrorisnotgoingup,thenyourmodelisoverfittingthetrainingsetandyoushouldstoptraining.

6. Duetotheirrandomnature,neitherStochasticGradientDescentnorMini-batchGradientDescentisguaranteedtomakeprogressateverysingletrainingiteration.Soifyouimmediatelystoptrainingwhenthevalidationerrorgoesup,youmaystopmuchtooearly,beforetheoptimumisreached.Abetteroptionistosavethemodelatregularintervals,andwhenithasnotimprovedforalongtime(meaningitwillprobablyneverbeattherecord),youcanreverttothebestsavedmodel.

7. StochasticGradientDescenthasthefastesttrainingiterationsinceitconsidersonlyonetraininginstanceatatime,soitisgenerallythefirsttoreachthevicinityoftheglobaloptimum(orMini-batchGDwithaverysmallmini-batchsize).However,onlyBatchGradientDescentwillactuallyconverge,givenenoughtrainingtime.Asmentioned,StochasticGDandMini-batchGDwillbouncearoundtheoptimum,unlessyougraduallyreducethelearningrate.

8. Ifthevalidationerrorismuchhigherthanthetrainingerror,thisislikelybecauseyourmodelisoverfittingthetrainingset.Onewaytotrytofixthisistoreducethepolynomialdegree:amodelwithfewerdegreesoffreedomislesslikelytooverfit.Anotherthingyoucantryistoregularizethemodel—forexample,byaddinganℓ2penalty(Ridge)oranℓ1penalty(Lasso)tothecostfunction.Thiswillalsoreducethedegreesoffreedomofthemodel.Lastly,youcantrytoincreasethesizeof

thetrainingset.

9. Ifboththetrainingerrorandthevalidationerrorarealmostequalandfairlyhigh,themodelislikelyunderfittingthetrainingset,whichmeansithasahighbias.Youshouldtryreducingtheregularizationhyperparameterα.

10. Let’ssee:Amodelwithsomeregularizationtypicallyperformsbetterthanamodelwithoutanyregularization,soyoushouldgenerallypreferRidgeRegressionoverplainLinearRegression.2

LassoRegressionusesanℓ1penalty,whichtendstopushtheweightsdowntoexactlyzero.Thisleadstosparsemodels,whereallweightsarezeroexceptforthemostimportantweights.Thisisawaytoperformfeatureselectionautomatically,whichisgoodifyoususpectthatonlyafewfeaturesactuallymatter.Whenyouarenotsure,youshouldpreferRidgeRegression.

ElasticNetisgenerallypreferredoverLassosinceLassomaybehaveerraticallyinsomecases(whenseveralfeaturesarestronglycorrelatedorwhentherearemorefeaturesthantraininginstances).However,itdoesaddanextrahyperparametertotune.IfyoujustwantLassowithouttheerraticbehavior,youcanjustuseElasticNetwithanl1_ratiocloseto1.

11. Ifyouwanttoclassifypicturesasoutdoor/indooranddaytime/nighttime,sincethesearenotexclusiveclasses(i.e.,allfourcombinationsarepossible)youshouldtraintwoLogisticRegressionclassifiers.

12. SeetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.

Chapter5:SupportVectorMachines1. ThefundamentalideabehindSupportVectorMachinesistofitthewidestpossible“street”between

theclasses.Inotherwords,thegoalistohavethelargestpossiblemarginbetweenthedecisionboundarythatseparatesthetwoclassesandthetraininginstances.Whenperformingsoftmarginclassification,theSVMsearchesforacompromisebetweenperfectlyseparatingthetwoclassesandhavingthewidestpossiblestreet(i.e.,afewinstancesmayenduponthestreet).Anotherkeyideaistousekernelswhentrainingonnonlineardatasets.

2. AftertraininganSVM,asupportvectorisanyinstancelocatedonthe“street”(seethepreviousanswer),includingitsborder.Thedecisionboundaryisentirelydeterminedbythesupportvectors.Anyinstancethatisnotasupportvector(i.e.,offthestreet)hasnoinfluencewhatsoever;youcouldremovethem,addmoreinstances,ormovethemaround,andaslongastheystayoffthestreettheywon’taffectthedecisionboundary.Computingthepredictionsonlyinvolvesthesupportvectors,notthewholetrainingset.

3. SVMstrytofitthelargestpossible“street”betweentheclasses(seethefirstanswer),soifthetrainingsetisnotscaled,theSVMwilltendtoneglectsmallfeatures(seeFigure5-2).

4. AnSVMclassifiercanoutputthedistancebetweenthetestinstanceandthedecisionboundary,andyoucanusethisasaconfidencescore.However,thisscorecannotbedirectlyconvertedintoanestimationoftheclassprobability.Ifyousetprobability=TruewhencreatinganSVMinScikit-Learn,thenaftertrainingitwillcalibratetheprobabilitiesusingLogisticRegressionontheSVM’sscores(trainedbyanadditionalfive-foldcross-validationonthetrainingdata).Thiswilladdthepredict_proba()andpredict_log_proba()methodstotheSVM.

5. ThisquestionappliesonlytolinearSVMssincekernelizedcanonlyusethedualform.ThecomputationalcomplexityoftheprimalformoftheSVMproblemisproportionaltothenumberoftraininginstancesm,whilethecomputationalcomplexityofthedualformisproportionaltoanumberbetweenm2andm3.Soiftherearemillionsofinstances,youshoulddefinitelyusetheprimalform,becausethedualformwillbemuchtooslow.

6. IfanSVMclassifiertrainedwithanRBFkernelunderfitsthetrainingset,theremightbetoomuchregularization.Todecreaseit,youneedtoincreasegammaorC(orboth).

7. Let’scalltheQPparametersforthehard-marginproblemH′,f′,A′andb′(see“QuadraticProgramming”).TheQPparametersforthesoft-marginproblemhavemadditionalparameters(np=n+1+m)andmadditionalconstraints(nc=2m).Theycanbedefinedlikeso:

HisequaltoH′,plusmcolumnsof0sontherightandmrowsof0satthebottom:

fisequaltof′withmadditionalelements,allequaltothevalueofthehyperparameterC.

bisequaltob′withmadditionalelements,allequalto0.

AisequaltoA′,withanextram×midentitymatrixImappendedtotheright,–Imjustbelowit,

andtherestfilledwithzeros:

Forthesolutionstoexercises8,9,and10,pleaseseetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.

Chapter6:DecisionTrees1. Thedepthofawell-balancedbinarytreecontainingmleavesisequaltolog2(m)3,roundedup.A

binaryDecisionTree(onethatmakesonlybinarydecisions,asisthecaseofalltreesinScikit-Learn)willendupmoreorlesswellbalancedattheendoftraining,withoneleafpertraininginstanceifitistrainedwithoutrestrictions.Thus,ifthetrainingsetcontainsonemillioninstances,theDecisionTreewillhaveadepthoflog2(106)≈20(actuallyabitmoresincethetreewillgenerallynotbeperfectlywellbalanced).

2. Anode’sGiniimpurityisgenerallylowerthanitsparent’s.ThisisduetotheCARTtrainingalgorithm’scostfunction,whichsplitseachnodeinawaythatminimizestheweightedsumofitschildren’sGiniimpurities.However,itispossibleforanodetohaveahigherGiniimpuritythanitsparent,aslongasthisincreaseismorethancompensatedforbyadecreaseoftheotherchild’simpurity.Forexample,consideranodecontainingfourinstancesofclassAand1ofclassB.ItsGini

impurityis =0.32.Nowsupposethedatasetisone-dimensionalandtheinstancesarelinedupinthefollowingorder:A,B,A,A,A.Youcanverifythatthealgorithmwillsplitthisnodeafterthesecondinstance,producingonechildnodewithinstancesA,B,andtheotherchildnode

withinstancesA,A,A.Thefirstchildnode’sGiniimpurityis =0.5,whichishigherthanitsparent.Thisiscompensatedforbythefactthattheothernodeispure,sotheoverall

weightedGiniimpurityis 0.5+ =0.2,whichislowerthantheparent’sGiniimpurity.

3. IfaDecisionTreeisoverfittingthetrainingset,itmaybeagoodideatodecreasemax_depth,sincethiswillconstrainthemodel,regularizingit.

4. DecisionTreesdon’tcarewhetherornotthetrainingdataisscaledorcentered;that’soneofthenicethingsaboutthem.SoifaDecisionTreeunderfitsthetrainingset,scalingtheinputfeatureswilljustbeawasteoftime.

5. ThecomputationalcomplexityoftrainingaDecisionTreeisO(n×mlog(m)).Soifyoumultiplythetrainingsetsizeby10,thetrainingtimewillbemultipliedbyK=(n×10m×log(10m))/(n×m×log(m))=10×log(10m)/log(m).Ifm=106,thenK≈11.7,soyoucanexpectthetrainingtimetoberoughly11.7hours.

6. Presortingthetrainingsetspeedsuptrainingonlyifthedatasetissmallerthanafewthousandinstances.Ifitcontains100,000instances,settingpresort=Truewillconsiderablyslowdowntraining.

Forthesolutionstoexercises7and8,pleaseseetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.

Chapter7:EnsembleLearningandRandomForests1. Ifyouhavetrainedfivedifferentmodelsandtheyallachieve95%precision,youcantrycombining

themintoavotingensemble,whichwilloftengiveyouevenbetterresults.Itworksbetterifthemodelsareverydifferent(e.g.,anSVMclassifier,aDecisionTreeclassifier,aLogisticRegressionclassifier,andsoon).Itisevenbetteriftheyaretrainedondifferenttraininginstances(that’sthewholepointofbaggingandpastingensembles),butifnotitwillstillworkaslongasthemodelsareverydifferent.

2. Ahardvotingclassifierjustcountsthevotesofeachclassifierintheensembleandpickstheclassthatgetsthemostvotes.Asoftvotingclassifiercomputestheaverageestimatedclassprobabilityforeachclassandpickstheclasswiththehighestprobability.Thisgiveshigh-confidencevotesmoreweightandoftenperformsbetter,butitworksonlyifeveryclassifierisabletoestimateclassprobabilities(e.g.,fortheSVMclassifiersinScikit-Learnyoumustsetprobability=True).

3. Itisquitepossibletospeeduptrainingofabaggingensemblebydistributingitacrossmultipleservers,sinceeachpredictorintheensembleisindependentoftheothers.ThesamegoesforpastingensemblesandRandomForests,forthesamereason.However,eachpredictorinaboostingensembleisbuiltbasedonthepreviouspredictor,sotrainingisnecessarilysequential,andyouwillnotgainanythingbydistributingtrainingacrossmultipleservers.Regardingstackingensembles,allthepredictorsinagivenlayerareindependentofeachother,sotheycanbetrainedinparallelonmultipleservers.However,thepredictorsinonelayercanonlybetrainedafterthepredictorsinthepreviouslayerhaveallbeentrained.

4. Without-of-bagevaluation,eachpredictorinabaggingensembleisevaluatedusinginstancesthatitwasnottrainedon(theywereheldout).Thismakesitpossibletohaveafairlyunbiasedevaluationoftheensemblewithouttheneedforanadditionalvalidationset.Thus,youhavemoreinstancesavailablefortraining,andyourensemblecanperformslightlybetter.

5. WhenyouaregrowingatreeinaRandomForest,onlyarandomsubsetofthefeaturesisconsideredforsplittingateachnode.ThisistrueaswellforExtra-Trees,buttheygoonestepfurther:ratherthansearchingforthebestpossiblethresholds,likeregularDecisionTreesdo,theyuserandomthresholdsforeachfeature.Thisextrarandomnessactslikeaformofregularization:ifaRandomForestoverfitsthetrainingdata,Extra-Treesmightperformbetter.Moreover,sinceExtra-Treesdon’tsearchforthebestpossiblethresholds,theyaremuchfastertotrainthanRandomForests.However,theyareneitherfasternorslowerthanRandomForestswhenmakingpredictions.

6. IfyourAdaBoostensembleunderfitsthetrainingdata,youcantryincreasingthenumberofestimatorsorreducingtheregularizationhyperparametersofthebaseestimator.Youmayalsotryslightlyincreasingthelearningrate.

7. IfyourGradientBoostingensembleoverfitsthetrainingset,youshouldtrydecreasingthelearningrate.Youcouldalsouseearlystoppingtofindtherightnumberofpredictors(youprobablyhavetoomany).

Forthesolutionstoexercises8and9,pleaseseetheJupyternotebooksavailableat

https://github.com/ageron/handson-ml.

Chapter8:DimensionalityReduction1. Motivationsanddrawbacks:

Themainmotivationsfordimensionalityreductionare:Tospeedupasubsequenttrainingalgorithm(insomecasesitmayevenremovenoiseandredundantfeatures,makingthetrainingalgorithmperformbetter).

Tovisualizethedataandgaininsightsonthemostimportantfeatures.

Simplytosavespace(compression).

Themaindrawbacksare:Someinformationislost,possiblydegradingtheperformanceofsubsequenttrainingalgorithms.

Itcanbecomputationallyintensive.

ItaddssomecomplexitytoyourMachineLearningpipelines.

Transformedfeaturesareoftenhardtointerpret.

2. Thecurseofdimensionalityreferstothefactthatmanyproblemsthatdonotexistinlow-dimensionalspaceariseinhigh-dimensionalspace.InMachineLearning,onecommonmanifestationisthefactthatrandomlysampledhigh-dimensionalvectorsaregenerallyverysparse,increasingtheriskofoverfittingandmakingitverydifficulttoidentifypatternsinthedatawithouthavingplentyoftrainingdata.

3. Onceadataset’sdimensionalityhasbeenreducedusingoneofthealgorithmswediscussed,itisalmostalwaysimpossibletoperfectlyreversetheoperation,becausesomeinformationgetslostduringdimensionalityreduction.Moreover,whilesomealgorithms(suchasPCA)haveasimplereversetransformationprocedurethatcanreconstructadatasetrelativelysimilartotheoriginal,otheralgorithms(suchasT-SNE)donot.

4. PCAcanbeusedtosignificantlyreducethedimensionalityofmostdatasets,eveniftheyarehighlynonlinear,becauseitcanatleastgetridofuselessdimensions.However,iftherearenouselessdimensions—forexample,theSwissroll—thenreducingdimensionalitywithPCAwilllosetoomuchinformation.YouwanttounrolltheSwissroll,notsquashit.

5. That’satrickquestion:itdependsonthedataset.Let’slookattwoextremeexamples.First,supposethedatasetiscomposedofpointsthatarealmostperfectlyaligned.Inthiscase,PCAcanreducethedatasetdowntojustonedimensionwhilestillpreserving95%ofthevariance.Nowimaginethatthedatasetiscomposedofperfectlyrandompoints,scatteredallaroundthe1,000dimensions.Inthiscaseroughly950dimensionsarerequiredtopreserve95%ofthevariance.Sotheansweris,itdependsonthedataset,anditcouldbeanynumberbetween1and950.Plottingtheexplainedvarianceasafunctionofthenumberofdimensionsisonewaytogetaroughideaofthedataset’sintrinsicdimensionality.

6. RegularPCAisthedefault,butitworksonlyifthedatasetfitsinmemory.IncrementalPCAisusefulforlargedatasetsthatdon’tfitinmemory,butitisslowerthanregularPCA,soifthedatasetfitsinmemoryyoushouldpreferregularPCA.IncrementalPCAisalsousefulforonlinetasks,whenyouneedtoapplyPCAonthefly,everytimeanewinstancearrives.RandomizedPCAisusefulwhenyouwanttoconsiderablyreducedimensionalityandthedatasetfitsinmemory;inthiscase,itismuchfasterthanregularPCA.Finally,KernelPCAisusefulfornonlineardatasets.

7. Intuitively,adimensionalityreductionalgorithmperformswellifiteliminatesalotofdimensionsfromthedatasetwithoutlosingtoomuchinformation.Onewaytomeasurethisistoapplythereversetransformationandmeasurethereconstructionerror.However,notalldimensionalityreductionalgorithmsprovideareversetransformation.Alternatively,ifyouareusingdimensionalityreductionasapreprocessingstepbeforeanotherMachineLearningalgorithm(e.g.,aRandomForestclassifier),thenyoucansimplymeasuretheperformanceofthatsecondalgorithm;ifdimensionalityreductiondidnotlosetoomuchinformation,thenthealgorithmshouldperformjustaswellaswhenusingtheoriginaldataset.

8. Itcanabsolutelymakesensetochaintwodifferentdimensionalityreductionalgorithms.AcommonexampleisusingPCAtoquicklygetridofalargenumberofuselessdimensions,thenapplyinganothermuchslowerdimensionalityreductionalgorithm,suchasLLE.Thistwo-stepapproachwilllikelyyieldthesameperformanceasusingLLEonly,butinafractionofthetime.

Forthesolutionstoexercises9and10,pleaseseetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.

Chapter9:UpandRunningwithTensorFlow1. Mainbenefitsanddrawbacksofcreatingacomputationgraphratherthandirectlyexecutingthe

computations:Mainbenefits:

TensorFlowcanautomaticallycomputethegradientsforyou(usingreverse-modeautodiff).

TensorFlowcantakecareofrunningtheoperationsinparallelindifferentthreads.

Itmakesiteasiertorunthesamemodelacrossdifferentdevices.

Itsimplifiesintrospection—forexample,toviewthemodelinTensorBoard.

Maindrawbacks:Itmakesthelearningcurvesteeper.

Itmakesstep-by-stepdebuggingharder.

2. Yes,thestatementa_val=a.eval(session=sess)isindeedequivalenttoa_val=sess.run(a).

3. No,thestatementa_val,b_val=a.eval(session=sess),b.eval(session=sess)isnotequivalenttoa_val,b_val=sess.run([a,b]).Indeed,thefirststatementrunsthegraphtwice(oncetocomputea,oncetocomputeb),whilethesecondstatementrunsthegraphonlyonce.Ifanyoftheseoperations(ortheopstheydependon)havesideeffects(e.g.,avariableismodified,anitemisinsertedinaqueue,orareaderreadsafile),thentheeffectswillbedifferent.Iftheydon’thavesideeffects,bothstatementswillreturnthesameresult,butthesecondstatementwillbefasterthanthefirst.

4. No,youcannotruntwographsinthesamesession.Youwouldhavetomergethegraphsintoasinglegraphfirst.

5. InlocalTensorFlow,sessionsmanagevariablevalues,soifyoucreateagraphgcontainingavariablew,thenstarttwothreadsandopenalocalsessionineachthread,bothusingthesamegraphg,theneachsessionwillhaveitsowncopyofthevariablew.However,indistributedTensorFlow,variablevaluesarestoredincontainersmanagedbythecluster,soifbothsessionsconnecttothesameclusterandusethesamecontainer,thentheywillsharethesamevariablevalueforw.

6. Avariableisinitializedwhenyoucallitsinitializer,anditisdestroyedwhenthesessionends.IndistributedTensorFlow,variablesliveincontainersonthecluster,soclosingasessionwillnotdestroythevariable.Todestroyavariable,youneedtoclearitscontainer.

7. Variablesandplaceholdersareextremelydifferent,butbeginnersoftenconfusethem:Avariableisanoperationthatholdsavalue.Ifyourunthevariable,itreturnsthatvalue.Beforeyoucanrunit,youneedtoinitializeit.Youcanchangethevariable’svalue(for

example,byusinganassignmentoperation).Itisstateful:thevariablekeepsthesamevalueuponsuccessiverunsofthegraph.Itistypicallyusedtoholdmodelparametersbutalsoforotherpurposes(e.g.,tocounttheglobaltrainingstep).

Placeholderstechnicallydon’tdomuch:theyjustholdinformationaboutthetypeandshapeofthetensortheyrepresent,buttheyhavenovalue.Infact,ifyoutrytoevaluateanoperationthatdependsonaplaceholder,youmustfeedTensorFlowthevalueoftheplaceholder(usingthefeed_dictargument)orelseyouwillgetanexception.PlaceholdersaretypicallyusedtofeedtrainingortestdatatoTensorFlowduringtheexecutionphase.Theyarealsousefultopassavaluetoanassignmentnode,tochangethevalueofavariable(e.g.,modelweights).

8. Ifyourunthegraphtoevaluateanoperationthatdependsonaplaceholderbutyoudon’tfeeditsvalue,yougetanexception.Iftheoperationdoesnotdependontheplaceholder,thennoexceptionisraised.

9. Whenyourunagraph,youcanfeedtheoutputvalueofanyoperation,notjustthevalueofplaceholders.Inpractice,however,thisisratherrare(itcanbeuseful,forexample,whenyouarecachingtheoutputoffrozenlayers;seeChapter11).

10. Youcanspecifyavariable’sinitialvaluewhenconstructingthegraph,anditwillbeinitializedlaterwhenyourunthevariable’sinitializerduringtheexecutionphase.Ifyouwanttochangethatvariable’svaluetoanythingyouwantduringtheexecutionphase,thenthesimplestoptionistocreateanassignmentnode(duringthegraphconstructionphase)usingthetf.assign()function,passingthevariableandaplaceholderasparameters.Duringtheexecutionphase,youcanruntheassignmentoperationandfeedthevariable’snewvalueusingtheplaceholder.

x=tf.Variable(tf.random_uniform(shape=(),minval=0.0,maxval=1.0))

x_new_val=tf.placeholder(shape=(),dtype=tf.float32)

x_assign=tf.assign(x,x_new_val)

withtf.Session():

x.initializer.run()#randomnumberissampled*now*

print(x.eval())#0.646157(somerandomnumber)

x_assign.eval(feed_dict={x_new_val:5.0})

print(x.eval())#5.0

11. Reverse-modeautodiff(implementedbyTensorFlow)needstotraversethegraphonlytwiceinordertocomputethegradientsofthecostfunctionwithregardstoanynumberofvariables.Ontheotherhand,forward-modeautodiffwouldneedtorunonceforeachvariable(so10timesifwewantthegradientswithregardsto10differentvariables).Asforsymbolicdifferentiation,itwouldbuildadifferentgraphtocomputethegradients,soitwouldnottraversetheoriginalgraphatall(exceptwhenbuildingthenewgradientsgraph).Ahighlyoptimizedsymbolicdifferentiationsystemcouldpotentiallyrunthenewgradientsgraphonlyoncetocomputethegradientswithregardstoallvariables,butthatnewgraphmaybehorriblycomplexandinefficientcomparedtotheoriginalgraph.

Chapter10:IntroductiontoArtificialNeuralNetworks1. HereisaneuralnetworkbasedontheoriginalartificialneuronsthatcomputesA B(where

representstheexclusiveOR),usingthefactthatA B=(A ¬B) (¬A B).Thereareothersolutions—forexample,usingthefactthatA B=(A B) ¬(A B),orthefactthatA B=(AB) (¬A B),andsoon.

2. AclassicalPerceptronwillconvergeonlyifthedatasetislinearlyseparable,anditwon’tbeabletoestimateclassprobabilities.Incontrast,aLogisticRegressionclassifierwillconvergetoagoodsolutionevenifthedatasetisnotlinearlyseparable,anditwilloutputclassprobabilities.IfyouchangethePerceptron’sactivationfunctiontothelogisticactivationfunction(orthesoftmaxactivationfunctioniftherearemultipleneurons),andifyoutrainitusingGradientDescent(orsomeotheroptimizationalgorithmminimizingthecostfunction,typicallycrossentropy),thenitbecomesequivalenttoaLogisticRegressionclassifier.

3. ThelogisticactivationfunctionwasakeyingredientintrainingthefirstMLPsbecauseitsderivativeisalwaysnonzero,soGradientDescentcanalwaysrolldowntheslope.Whentheactivationfunctionisastepfunction,GradientDescentcannotmove,asthereisnoslopeatall.

4. Thestepfunction,thelogisticfunction,thehyperbolictangent,therectifiedlinearunit(seeFigure10-8).SeeChapter11forotherexamples,suchasELUandvariantsoftheReLU.

5. ConsideringtheMLPdescribedinthequestion:supposeyouhaveanMLPcomposedofoneinputlayerwith10passthroughneurons,followedbyonehiddenlayerwith50artificialneurons,andfinallyoneoutputlayerwith3artificialneurons.AllartificialneuronsusetheReLUactivationfunction.

TheshapeoftheinputmatrixXism×10,wheremrepresentsthetrainingbatchsize.

Theshapeofthehiddenlayer’sweightvectorWhis10×50andthelengthofitsbiasvectorbhis50.

Theshapeoftheoutputlayer’sweightvectorWois50×3,andthelengthofitsbiasvectorbo

Theshapeofthenetwork’soutputmatrixYism×3.

Y=ReLU(ReLU(X·Wh+bh)·Wo+bo).RecallthattheReLUfunctionjustsetseverynegativenumberinthematrixtozero.Alsonotethatwhenyouareaddingabiasvectortoamatrix,itisaddedtoeverysinglerowinthematrix,whichiscalledbroadcasting.

6. Toclassifyemailintospamorham,youjustneedoneneuronintheoutputlayerofaneuralnetwork—forexample,indicatingtheprobabilitythattheemailisspam.Youwouldtypicallyusethelogisticactivationfunctionintheoutputlayerwhenestimatingaprobability.IfinsteadyouwanttotackleMNIST,youneed10neuronsintheoutputlayer,andyoumustreplacethelogisticfunctionwiththesoftmaxactivationfunction,whichcanhandlemultipleclasses,outputtingoneprobabilityperclass.Now,ifyouwantyourneuralnetworktopredicthousingpriceslikeinChapter2,thenyouneedoneoutputneuron,usingnoactivationfunctionatallintheoutputlayer.4

7. Backpropagationisatechniqueusedtotrainartificialneuralnetworks.Itfirstcomputesthegradientsofthecostfunctionwithregardstoeverymodelparameter(alltheweightsandbiases),andthenitperformsaGradientDescentstepusingthesegradients.Thisbackpropagationstepistypicallyperformedthousandsormillionsoftimes,usingmanytrainingbatches,untilthemodelparametersconvergetovaluesthat(hopefully)minimizethecostfunction.Tocomputethegradients,backpropagationusesreverse-modeautodiff(althoughitwasn’tcalledthatwhenbackpropagationwasinvented,andithasbeenreinventedseveraltimes).Reverse-modeautodiffperformsaforwardpassthroughacomputationgraph,computingeverynode’svalueforthecurrenttrainingbatch,andthenitperformsareversepass,computingallthegradientsatonce(seeAppendixDformoredetails).Sowhat’sthedifference?Well,backpropagationreferstothewholeprocessoftraininganartificialneuralnetworkusingmultiplebackpropagationsteps,eachofwhichcomputesgradientsandusesthemtoperformaGradientDescentstep.Incontrast,reverse-modeautodiffisasimplyatechniquetocomputegradientsefficiently,andithappenstobeusedbybackpropagation.

8. HereisalistofallthehyperparametersyoucantweakinabasicMLP:thenumberofhiddenlayers,thenumberofneuronsineachhiddenlayer,andtheactivationfunctionusedineachhiddenlayerandintheoutputlayer.5Ingeneral,theReLUactivationfunction(oroneofitsvariants;seeChapter11)isagooddefaultforthehiddenlayers.Fortheoutputlayer,ingeneralyouwillwantthelogisticactivationfunctionforbinaryclassification,thesoftmaxactivationfunctionformulticlassclassification,ornoactivationfunctionforregression.IftheMLPoverfitsthetrainingdata,youcantryreducingthenumberofhiddenlayersandreducingthenumberofneuronsperhiddenlayer.

Chapter11:TrainingDeepNeuralNets1. No,allweightsshouldbesampledindependently;theyshouldnotallhavethesameinitialvalue.

Oneimportantgoalofsamplingweightsrandomlyistobreaksymmetries:ifalltheweightshavethesameinitialvalue,evenifthatvalueisnotzero,thensymmetryisnotbroken(i.e.,allneuronsinagivenlayerareequivalent),andbackpropagationwillbeunabletobreakit.Concretely,thismeansthatalltheneuronsinanygivenlayerwillalwayshavethesameweights.It’slikehavingjustoneneuronperlayer,andmuchslower.Itisvirtuallyimpossibleforsuchaconfigurationtoconvergetoagoodsolution.

2. Itisperfectlyfinetoinitializethebiastermstozero.Somepeopleliketoinitializethemjustlikeweights,andthat’sokaytoo;itdoesnotmakemuchdifference.

3. AfewadvantagesoftheELUfunctionovertheReLUfunctionare:Itcantakeonnegativevalues,sotheaverageoutputoftheneuronsinanygivenlayeristypicallycloserto0thanwhenusingtheReLUactivationfunction(whichneveroutputsnegativevalues).Thishelpsalleviatethevanishinggradientsproblem.

Italwayshasanonzeroderivative,whichavoidsthedyingunitsissuethatcanaffectReLUunits.

Itissmootheverywhere,whereastheReLU’sslopeabruptlyjumpsfrom0to1atz=0.SuchanabruptchangecanslowdownGradientDescentbecauseitwillbouncearoundz=0.

4. TheELUactivationfunctionisagooddefault.Ifyouneedtheneuralnetworktobeasfastaspossible,youcanuseoneoftheleakyReLUvariantsinstead(e.g.,asimpleleakyReLUusingthedefaulthyperparametervalue).ThesimplicityoftheReLUactivationfunctionmakesitmanypeople’spreferredoption,despitethefactthattheyaregenerallyoutperformedbytheELUandleakyReLU.However,theReLUactivationfunction’scapabilityofoutputtingpreciselyzerocanbeusefulinsomecases(e.g.,seeChapter15).Thehyperbolictangent(tanh)canbeusefulintheoutputlayerifyouneedtooutputanumberbetween–1and1,butnowadaysitisnotusedmuchinhiddenlayers.Thelogisticactivationfunctionisalsousefulintheoutputlayerwhenyouneedtoestimateaprobability(e.g.,forbinaryclassification),butitisalsorarelyusedinhiddenlayers(thereareexceptions—forexample,forthecodinglayerofvariationalautoencoders;seeChapter15).Finally,thesoftmaxactivationfunctionisusefulintheoutputlayertooutputprobabilitiesformutuallyexclusiveclasses,butotherthanthatitisrarely(ifever)usedinhiddenlayers.

5. Ifyousetthemomentumhyperparametertoocloseto1(e.g.,0.99999)whenusingaMomentumOptimizer,thenthealgorithmwilllikelypickupalotofspeed,hopefullyroughlytowardtheglobalminimum,butthenitwillshootrightpasttheminimum,duetoitsmomentum.Thenitwillslowdownandcomeback,accelerateagain,overshootagain,andsoon.Itmayoscillatethiswaymanytimesbeforeconverging,sooverallitwilltakemuchlongertoconvergethanwithasmallermomentumvalue.

6. Onewaytoproduceasparsemodel(i.e.,withmostweightsequaltozero)istotrainthemodelnormally,thenzeroouttinyweights.Formoresparsity,youcanapplyℓ1regularizationduring

training,whichpushestheoptimizertowardsparsity.Athirdoptionistocombineℓ1regularizationwithdualaveraging,usingTensorFlow’sFTRLOptimizerclass.

7. Yes,dropoutdoesslowdowntraining,ingeneralroughlybyafactoroftwo.However,ithasnoimpactoninferencesinceitisonlyturnedonduringtraining.

Chapter12:DistributingTensorFlowAcrossDevicesandServers1. WhenaTensorFlowprocessstarts,itgrabsalltheavailablememoryonallGPUdevicesthatare

visibletoit,soifyougetaCUDA_ERROR_OUT_OF_MEMORYwhenstartingyourTensorFlowprogram,itprobablymeansthatotherprocessesarerunningthathavealreadygrabbedallthememoryonatleastonevisibleGPUdevice(mostlikelyitisanotherTensorFlowprocess).Tofixthisproblem,atrivialsolutionistostoptheotherprocessesandtryagain.However,ifyouneedallprocessestorunsimultaneously,asimpleoptionistodedicatedifferentdevicestoeachprocess,bysettingtheCUDA_VISIBLE_DEVICESenvironmentvariableappropriatelyforeachdevice.AnotheroptionistoconfigureTensorFlowtograbonlypartoftheGPUmemory,insteadofallofit,bycreatingaConfigProto,settingitsgpu_options.per_process_gpu_memory_fractiontotheproportionofthetotalmemorythatitshouldgrab(e.g.,0.4),andusingthisConfigProtowhenopeningasession.ThelastoptionistotellTensorFlowtograbmemoryonlywhenitneedsitbysettingthegpu_options.allow_growthtoTrue.However,thislastoptionisusuallynotrecommendedbecauseanymemorythatTensorFlowgrabsisneverreleased,anditishardertoguaranteearepeatablebehavior(theremayberaceconditionsdependingonwhichprocessesstartfirst,howmuchmemorytheyneedduringtraining,andsoon).

2. Bypinninganoperationonadevice,youaretellingTensorFlowthatthisiswhereyouwouldlikethisoperationtobeplaced.However,someconstraintsmaypreventTensorFlowfromhonoringyourrequest.Forexample,theoperationmayhavenoimplementation(calledakernel)forthatparticulartypeofdevice.Inthiscase,TensorFlowwillraiseanexceptionbydefault,butyoucanconfigureittofallbacktotheCPUinstead(thisiscalledsoftplacement).Anotherexampleisanoperationthatcanmodifyavariable;thisoperationandthevariableneedtobecollocated.SothedifferencebetweenpinninganoperationandplacinganoperationisthatpinningiswhatyouaskTensorFlow(“PleaseplacethisoperationonGPU#1”)whileplacementiswhatTensorFlowactuallyendsupdoing(“Sorry,fallingbacktotheCPU”).

3. IfyouarerunningonaGPU-enabledTensorFlowinstallation,andyoujustusethedefaultplacement,thenifalloperationshaveaGPUkernel(i.e.,aGPUimplementation),yes,theywillallbeplacedonthefirstGPU.However,ifoneormoreoperationsdonothaveaGPUkernel,thenbydefaultTensorFlowwillraiseanexception.IfyouconfigureTensorFlowtofallbacktotheCPUinstead(softplacement),thenalloperationswillbeplacedonthefirstGPUexcepttheoneswithoutaGPUkernelandalltheoperationsthatmustbecollocatedwiththem(seetheanswertothepreviousexercise).

4. Yes,ifyoupinavariableto"/gpu:0",itcanbeusedbyoperationsplacedon/gpu:1.TensorFlowwillautomaticallytakecareofaddingtheappropriateoperationstotransferthevariable’svalueacrossdevices.Thesamegoesfordeviceslocatedondifferentservers(aslongastheyarepartofthesamecluster).

5. Yes,twooperationsplacedonthesamedevicecanruninparallel:TensorFlowautomaticallytakescareofrunningoperationsinparallel(ondifferentCPUcoresordifferentGPUthreads),aslongasnooperationdependsonanotheroperation’soutput.Moreover,youcanstartmultiplesessionsinparallelthreads(orprocesses),andevaluateoperationsineachthread.Sincesessionsareindependent,TensorFlowwillbeabletoevaluateanyoperationfromonesessioninparallelwith

anyoperationfromanothersession.

6. ControldependenciesareusedwhenyouwanttopostponetheevaluationofanoperationXuntilaftersomeotheroperationsarerun,eventhoughtheseoperationsarenotrequiredtocomputeX.ThisisusefulinparticularwhenXwouldoccupyalotofmemoryandyouonlyneeditlaterinthecomputationgraph,orifXusesupalotofI/O(forexample,itrequiresalargevariablevaluelocatedonadifferentdeviceorserver)andyoudon’twantittorunatthesametimeasotherI/O-hungryoperations,toavoidsaturatingthebandwidth.

7. You’reinluck!IndistributedTensorFlow,thevariablevaluesliveincontainersmanagedbythecluster,soevenifyouclosethesessionandexittheclientprogram,themodelparametersarestillaliveandwellonthecluster.Yousimplyneedtoopenanewsessiontotheclusterandsavethemodel(makesureyoudon’tcallthevariableinitializersorrestoreapreviousmodel,asthiswoulddestroyyourpreciousnewmodel!).

Chapter13:ConvolutionalNeuralNetworks1. ThesearethemainadvantagesofaCNNoverafullyconnectedDNNforimageclassification:

Becauseconsecutivelayersareonlypartiallyconnectedandbecauseitheavilyreusesitsweights,aCNNhasmanyfewerparametersthanafullyconnectedDNN,whichmakesitmuchfastertotrain,reducestheriskofoverfitting,andrequiresmuchlesstrainingdata.

WhenaCNNhaslearnedakernelthatcandetectaparticularfeature,itcandetectthatfeatureanywhereontheimage.Incontrast,whenaDNNlearnsafeatureinonelocation,itcandetectitonlyinthatparticularlocation.Sinceimagestypicallyhaveveryrepetitivefeatures,CNNsareabletogeneralizemuchbetterthanDNNsforimageprocessingtaskssuchasclassification,usingfewertrainingexamples.

Finally,aDNNhasnopriorknowledgeofhowpixelsareorganized;itdoesnotknowthatnearbypixelsareclose.ACNN’sarchitectureembedsthispriorknowledge.Lowerlayerstypicallyidentifyfeaturesinsmallareasoftheimages,whilehigherlayerscombinethelower-levelfeaturesintolargerfeatures.Thisworkswellwithmostnaturalimages,givingCNNsadecisiveheadstartcomparedtoDNNs.

2. Let’scomputehowmanyparameterstheCNNhas.Sinceitsfirstconvolutionallayerhas3×3kernels,andtheinputhasthreechannels(red,green,andblue),theneachfeaturemaphas3×3×3weights,plusabiasterm.That’s28parametersperfeaturemap.Sincethisfirstconvolutionallayerhas100featuremaps,ithasatotalof2,800parameters.Thesecondconvolutionallayerhas3×3kernels,anditsinputisthesetof100featuremapsofthepreviouslayer,soeachfeaturemaphas3×3×100=900weights,plusabiasterm.Sinceithas200featuremaps,thislayerhas901×200=180,200parameters.Finally,thethirdandlastconvolutionallayeralsohas3×3kernels,anditsinputisthesetof200featuremapsofthepreviouslayers,soeachfeaturemaphas3×3×200=1,800weights,plusabiasterm.Sinceithas400featuremaps,thislayerhasatotalof1,801×400=720,400parameters.Allinall,theCNNhas2,800+180,200+720,400=903,400parameters.Nowlet’scomputehowmuchRAMthisneuralnetworkwillrequire(atleast)whenmakingapredictionforasingleinstance.Firstlet’scomputethefeaturemapsizeforeachlayer.Sinceweareusingastrideof2andSAMEpadding,thehorizontalandverticalsizeofthefeaturemapsaredividedby2ateachlayer(roundingupifnecessary),soastheinputchannelsare200×300pixels,thefirstlayer’sfeaturemapsare100×150,thesecondlayer’sfeaturemapsare50×75,andthethirdlayer’sfeaturemapsare25×38.Since32bitsis4bytesandthefirstconvolutionallayerhas100featuremaps,thisfirstlayertakesup4x100×150×100=6millionbytes(about5.7MB,consideringthat1MB=1,024KBand1KB=1,024bytes).Thesecondlayertakesup4×50×75×200=3millionbytes(about2.9MB).Finally,thethirdlayertakesup4×25×38×400=1,520,000bytes(about1.4MB).However,oncealayerhasbeencomputed,thememoryoccupiedbythepreviouslayercanbereleased,soifeverythingiswelloptimized,only6+9=15millionbytes(about14.3MB)ofRAMwillberequired(whenthesecondlayerhasjustbeencomputed,butthememoryoccupiedbythefirstlayerisnotreleasedyet).Butwait,youalsoneedtoaddthememoryoccupiedbytheCNN’sparameters.Wecomputedearlierthatithas903,400parameters,eachusingup4bytes,sothisadds3,613,600bytes(about3.4MB).ThetotalRAMrequiredis(atleast)18,613,600bytes(about17.8MB).

Lastly,let’scomputetheminimumamountofRAMrequiredwhentrainingtheCNNonamini-batchof50images.DuringtrainingTensorFlowusesbackpropagation,whichrequireskeepingallvaluescomputedduringtheforwardpassuntilthereversepassbegins.SowemustcomputethetotalRAMrequiredbyalllayersforasingleinstanceandmultiplythatby50!Atthatpointlet’sstartcountinginmegabytesratherthanbytes.Wecomputedbeforethatthethreelayersrequirerespectively5.7,2.9,and1.4MBforeachinstance.That’satotalof10.0MBperinstance.Sofor50instancesthetotalRAMis500MB.AddtothattheRAMrequiredbytheinputimages,whichis50×4×200×300×3=36millionbytes(about34.3MB),plustheRAMrequiredforthemodelparameters,whichisabout3.4MB(computedearlier),plussomeRAMforthegradients(wewillneglectthemsincetheycanbereleasedgraduallyasbackpropagationgoesdownthelayersduringthereversepass).Weareuptoatotalofroughly500.0+34.3+3.4=537.7MB.Andthat’sreallyanoptimisticbareminimum.

3. IfyourGPUrunsoutofmemorywhiletrainingaCNN,herearefivethingsyoucouldtrytosolvetheproblem(otherthanpurchasingaGPUwithmoreRAM):

Reducethemini-batchsize.

Reducedimensionalityusingalargerstrideinoneormorelayers.

Removeoneormorelayers.

Use16-bitfloatsinsteadof32-bitfloats.

DistributetheCNNacrossmultipledevices.

4. Amaxpoolinglayerhasnoparametersatall,whereasaconvolutionallayerhasquiteafew(seethepreviousquestions).

5. Alocalresponsenormalizationlayermakestheneuronsthatmoststronglyactivateinhibitneuronsatthesamelocationbutinneighboringfeaturemaps,whichencouragesdifferentfeaturemapstospecializeandpushesthemapart,forcingthemtoexploreawiderrangeoffeatures.Itistypicallyusedinthelowerlayerstohavealargerpooloflow-levelfeaturesthattheupperlayerscanbuildupon.

6. ThemaininnovationsinAlexNetcomparedtoLeNet-5are(1)itismuchlargeranddeeper,and(2)itstacksconvolutionallayersdirectlyontopofeachother,insteadofstackingapoolinglayerontopofeachconvolutionallayer.ThemaininnovationinGoogLeNetistheintroductionofinceptionmodules,whichmakeitpossibletohaveamuchdeepernetthanpreviousCNNarchitectures,withfewerparameters.Finally,ResNet’smaininnovationistheintroductionofskipconnections,whichmakeitpossibletogowellbeyond100layers.Arguably,itssimplicityandconsistencyarealsoratherinnovative.

Forthesolutionstoexercises7,8,9,and10,pleaseseetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.

Chapter14:RecurrentNeuralNetworks1. HereareafewRNNapplications:

Forasequence-to-sequenceRNN:predictingtheweather(oranyothertimeseries),machinetranslation(usinganencoder–decoderarchitecture),videocaptioning,speechtotext,musicgeneration(orothersequencegeneration),identifyingthechordsofasong.

Forasequence-to-vectorRNN:classifyingmusicsamplesbymusicgenre,analyzingthesentimentofabookreview,predictingwhatwordanaphasicpatientisthinkingofbasedonreadingsfrombrainimplants,predictingtheprobabilitythatauserwillwanttowatchamoviebasedonherwatchhistory(thisisoneofmanypossibleimplementationsofcollaborativefiltering).

Foravector-to-sequenceRNN:imagecaptioning,creatingamusicplaylistbasedonanembeddingofthecurrentartist,generatingamelodybasedonasetofparameters,locatingpedestriansinapicture(e.g.,avideoframefromaself-drivingcar’scamera).

2. Ingeneral,ifyoutranslateasentenceonewordatatime,theresultwillbeterrible.Forexample,theFrenchsentence“Jevousenprie”means“Youarewelcome,”butifyoutranslateitonewordatatime,youget“Iyouinpray.”Huh?Itismuchbettertoreadthewholesentencefirstandthentranslateit.Aplainsequence-to-sequenceRNNwouldstarttranslatingasentenceimmediatelyafterreadingthefirstword,whileanencoder–decoderRNNwillfirstreadthewholesentenceandthentranslateit.Thatsaid,onecouldimagineaplainsequence-to-sequenceRNNthatwouldoutputsilencewheneveritisunsureaboutwhattosaynext(justlikehumantranslatorsdowhentheymusttranslatealivebroadcast).

3. Toclassifyvideosbasedonthevisualcontent,onepossiblearchitecturecouldbetotake(say)oneframepersecond,thenruneachframethroughaconvolutionalneuralnetwork,feedtheoutputoftheCNNtoasequence-to-vectorRNN,andfinallyrunitsoutputthroughasoftmaxlayer,givingyoualltheclassprobabilities.Fortrainingyouwouldjustusecrossentropyasthecostfunction.Ifyouwantedtousetheaudioforclassificationaswell,youcouldconverteverysecondofaudiotoaspectrograph,feedthisspectrographtoaCNN,andfeedtheoutputofthisCNNtotheRNN(alongwiththecorrespondingoutputoftheotherCNN).

4. BuildinganRNNusingdynamic_rnn()ratherthanstatic_rnn()offersseveraladvantages:Itisbasedonawhile_loop()operationthatisabletoswaptheGPU’smemorytotheCPU’smemoryduringbackpropagation,avoidingout-of-memoryerrors.

Itisarguablyeasiertouse,asitcandirectlytakeasingletensorasinputandoutput(coveringalltimesteps),ratherthanalistoftensors(onepertimestep).Noneedtostack,unstack,ortranspose.

Itgeneratesasmallergraph,easiertovisualizeinTensorBoard.

5. Tohandlevariablelengthinputsequences,thesimplestoptionistosetthesequence_lengthparameterwhencallingthestatic_rnn()ordynamic_rnn()functions.Anotheroptionistopad

thesmallerinputs(e.g.,withzeros)tomakethemthesamesizeasthelargestinput(thismaybefasterthanthefirstoptioniftheinputsequencesallhaveverysimilarlengths).Tohandlevariable-lengthoutputsequences,ifyouknowinadvancethelengthofeachoutputsequence,youcanusethesequence_lengthparameter(forexample,considerasequence-to-sequenceRNNthatlabelseveryframeinavideowithaviolencescore:theoutputsequencewillbeexactlythesamelengthastheinputsequence).Ifyoudon’tknowinadvancethelengthoftheoutputsequence,youcanusethepaddingtrick:alwaysoutputthesamesizesequence,butignoreanyoutputsthatcomeaftertheend-of-sequencetoken(byignoringthemwhencomputingthecostfunction).

6. TodistributetrainingandexecutionofadeepRNNacrossmultipleGPUs,acommontechniqueissimplytoplaceeachlayeronadifferentGPU(seeChapter12).

Chapter15:Autoencoders1. Herearesomeofthemaintasksthatautoencodersareusedfor:

Featureextraction

Unsupervisedpretraining

Dimensionalityreduction

Generativemodels

Anomalydetection(anautoencoderisgenerallybadatreconstructingoutliers)

2. Ifyouwanttotrainaclassifierandyouhaveplentyofunlabeledtrainingdata,butonlyafewthousandlabeledinstances,thenyoucouldfirsttrainadeepautoencoderonthefulldataset(labeled+unlabeled),thenreuseitslowerhalffortheclassifier(i.e.,reusethelayersuptothecodingslayer,included)andtraintheclassifierusingthelabeleddata.Ifyouhavelittlelabeleddata,youprobablywanttofreezethereusedlayerswhentrainingtheclassifier.

3. Thefactthatanautoencoderperfectlyreconstructsitsinputsdoesnotnecessarilymeanthatitisagoodautoencoder;perhapsitissimplyanovercompleteautoencoderthatlearnedtocopyitsinputstothecodingslayerandthentotheoutputs.Infact,evenifthecodingslayercontainedasingleneuron,itwouldbepossibleforaverydeepautoencodertolearntomapeachtraininginstancetoadifferentcoding(e.g.,thefirstinstancecouldbemappedto0.001,thesecondto0.002,thethirdto0.003,andsoon),anditcouldlearn“byheart”toreconstructtherighttraininginstanceforeachcoding.Itwouldperfectlyreconstructitsinputswithoutreallylearninganyusefulpatterninthedata.Inpracticesuchamappingisunlikelytohappen,butitillustratesthefactthatperfectreconstructionsarenotaguaranteethattheautoencoderlearnedanythinguseful.However,ifitproducesverybadreconstructions,thenitisalmostguaranteedtobeabadautoencoder.Toevaluatetheperformanceofanautoencoder,oneoptionistomeasurethereconstructionloss(e.g.,computetheMSE,themeansquareoftheoutputsminustheinputs).Again,ahighreconstructionlossisagoodsignthattheautoencoderisbad,butalowreconstructionlossisnotaguaranteethatitisgood.Youshouldalsoevaluatetheautoencoderaccordingtowhatitwillbeusedfor.Forexample,ifyouareusingitforunsupervisedpretrainingofaclassifier,thenyoushouldalsoevaluatetheclassifier’sperformance.

4. Anundercompleteautoencoderisonewhosecodingslayerissmallerthantheinputandoutputlayers.Ifitislarger,thenitisanovercompleteautoencoder.Themainriskofanexcessivelyundercompleteautoencoderisthatitmayfailtoreconstructtheinputs.Themainriskofanovercompleteautoencoderisthatitmayjustcopytheinputstotheoutputs,withoutlearninganyusefulfeature.

5. Totietheweightsofanencoderlayeranditscorrespondingdecoderlayer,yousimplymakethedecoderweightsequaltothetransposeoftheencoderweights.Thisreducesthenumberofparametersinthemodelbyhalf,oftenmakingtrainingconvergefasterwithlesstrainingdata,andreducingtheriskofoverfittingthetrainingset.

6. Tovisualizethefeatureslearnedbythelowerlayerofastackedautoencoder,acommontechniqueissimplytoplottheweightsofeachneuron,byreshapingeachweightvectortothesizeofaninputimage(e.g.,forMNIST,reshapingaweightvectorofshape[784]to[28,28]).Tovisualizethefeatureslearnedbyhigherlayers,onetechniqueistodisplaythetraininginstancesthatmostactivateeachneuron.

7. Agenerativemodelisamodelcapableofrandomlygeneratingoutputsthatresemblethetraininginstances.Forexample,oncetrainedsuccessfullyontheMNISTdataset,agenerativemodelcanbeusedtorandomlygeneraterealisticimagesofdigits.Theoutputdistributionistypicallysimilartothetrainingdata.Forexample,sinceMNISTcontainsmanyimagesofeachdigit,thegenerativemodelwouldoutputroughlythesamenumberofimagesofeachdigit.Somegenerativemodelscanbeparametrized—forexample,togenerateonlysomekindsofoutputs.Anexampleofagenerativeautoencoderisthevariationalautoencoder.

Chapter16:ReinforcementLearning1. ReinforcementLearningisanareaofMachineLearningaimedatcreatingagentscapableoftaking

actionsinanenvironmentinawaythatmaximizesrewardsovertime.TherearemanydifferencesbetweenRLandregularsupervisedandunsupervisedlearning.Hereareafew:

Insupervisedandunsupervisedlearning,thegoalisgenerallytofindpatternsinthedata.InReinforcementLearning,thegoalistofindagoodpolicy.

Unlikeinsupervisedlearning,theagentisnotexplicitlygiventhe“right”answer.Itmustlearnbytrialanderror.

Unlikeinunsupervisedlearning,thereisaformofsupervision,throughrewards.Wedonottelltheagenthowtoperformthetask,butwedotellitwhenitismakingprogressorwhenitisfailing.

AReinforcementLearningagentneedstofindtherightbalancebetweenexploringtheenvironment,lookingfornewwaysofgettingrewards,andexploitingsourcesofrewardsthatitalreadyknows.Incontrast,supervisedandunsupervisedlearningsystemsgenerallydon’tneedtoworryaboutexploration;theyjustfeedonthetrainingdatatheyaregiven.

Insupervisedandunsupervisedlearning,traininginstancesaretypicallyindependent(infact,theyaregenerallyshuffled).InReinforcementLearning,consecutiveobservationsaregenerallynotindependent.Anagentmayremaininthesameregionoftheenvironmentforawhilebeforeitmoveson,soconsecutiveobservationswillbeverycorrelated.Insomecasesareplaymemoryisusedtoensurethatthetrainingalgorithmgetsfairlyindependentobservations.

2. HereareafewpossibleapplicationsofReinforcementLearning,otherthanthosementionedinChapter16:

MusicpersonalizationTheenvironmentisauser’spersonalizedwebradio.Theagentisthesoftwaredecidingwhatsongtoplaynextforthatuser.Itspossibleactionsaretoplayanysonginthecatalog(itmusttrytochooseasongtheuserwillenjoy)ortoplayanadvertisement(itmusttrytochooseanadthattheuserwillbeinterestedin).Itgetsasmallrewardeverytimetheuserlistenstoasong,alargerrewardeverytimetheuserlistenstoanad,anegativerewardwhentheuserskipsasongoranad,andaverynegativerewardiftheuserleaves.

MarketingTheenvironmentisyourcompany’smarketingdepartment.Theagentisthesoftwarethatdefineswhichcustomersamailingcampaignshouldbesentto,giventheirprofileandpurchasehistory(foreachcustomerithastwopossibleactions:sendordon’tsend).Itgetsanegativerewardforthecostofthemailingcampaign,andapositiverewardforestimatedrevenuegeneratedfromthiscampaign.

Productdelivery

Lettheagentcontrolafleetofdeliverytrucks,decidingwhattheyshouldpickupatthedepots,wheretheyshouldgo,whattheyshoulddropoff,andsoon.Theywouldgetpositiverewardsforeachproductdeliveredontime,andnegativerewardsforlatedeliveries.

3. Whenestimatingthevalueofanaction,ReinforcementLearningalgorithmstypicallysumalltherewardsthatthisactionledto,givingmoreweighttoimmediaterewards,andlessweighttolaterrewards(consideringthatanactionhasmoreinfluenceonthenearfuturethanonthedistantfuture).Tomodelthis,adiscountrateistypicallyappliedateachtimestep.Forexample,withadiscountrateof0.9,arewardof100thatisreceivedtwotimestepslateriscountedasonly0.92×100=81whenyouareestimatingthevalueoftheaction.Youcanthinkofthediscountrateasameasureofhowmuchthefutureisvaluedrelativetothepresent:ifitisverycloseto1,thenthefutureisvaluedalmostasmuchasthepresent.Ifitiscloseto0,thenonlyimmediaterewardsmatter.Ofcourse,thisimpactstheoptimalpolicytremendously:ifyouvaluethefuture,youmaybewillingtoputupwithalotofimmediatepainfortheprospectofeventualrewards,whileifyoudon’tvaluethefuture,youwilljustgrabanyimmediaterewardyoucanfind,neverinvestinginthefuture.

4. TomeasuretheperformanceofaReinforcementLearningagent,youcansimplysumuptherewardsitgets.Inasimulatedenvironment,youcanrunmanyepisodesandlookatthetotalrewardsitgetsonaverage(andpossiblylookatthemin,max,standarddeviation,andsoon).

5. ThecreditassignmentproblemisthefactthatwhenaReinforcementLearningagentreceivesareward,ithasnodirectwayofknowingwhichofitspreviousactionscontributedtothisreward.Ittypicallyoccurswhenthereisalargedelaybetweenanactionandtheresultingrewards(e.g.,duringagameofAtari’sPong,theremaybeafewdozentimestepsbetweenthemomenttheagenthitstheballandthemomentitwinsthepoint).Onewaytoalleviateitistoprovidetheagentwithshorter-termrewards,whenpossible.Thisusuallyrequirespriorknowledgeaboutthetask.Forexample,ifwewanttobuildanagentthatwilllearntoplaychess,insteadofgivingitarewardonlywhenitwinsthegame,wecouldgiveitarewardeverytimeitcapturesoneoftheopponent’spieces.

6. Anagentcanoftenremaininthesameregionofitsenvironmentforawhile,soallofitsexperienceswillbeverysimilarforthatperiodoftime.Thiscanintroducesomebiasinthelearningalgorithm.Itmaytuneitspolicyforthisregionoftheenvironment,butitwillnotperformwellassoonasitmovesoutofthisregion.Tosolvethisproblem,youcanuseareplaymemory;insteadofusingonlythemostimmediateexperiencesforlearning,theagentwilllearnbasedonabufferofitspastexperiences,recentandnotsorecent(perhapsthisiswhywedreamatnight:toreplayourexperiencesofthedayandbetterlearnfromthem?).

7. Anoff-policyRLalgorithmlearnsthevalueoftheoptimalpolicy(i.e.,thesumofdiscountedrewardsthatcanbeexpectedforeachstateiftheagentactsoptimally),independentlyofhowtheagentactuallyacts.Q-Learningisagoodexampleofsuchanalgorithm.Incontrast,anon-policyalgorithmlearnsthevalueofthepolicythattheagentactuallyexecutes,includingbothexplorationandexploitation.

Ifyoudrawastraightlinebetweenanytwopointsonthecurve,thelinenevercrossesthecurve.1

Moreover,theNormalEquationrequirescomputingtheinverseofamatrix,butthatmatrixisnotalwaysinvertible.Incontrast,thematrixforRidgeRegressionisalwaysinvertible.

log2isthebinarylog,log2(m)=log(m)/log(2).

Whenthevaluestopredictcanvarybymanyordersofmagnitude,thenyoumaywanttopredictthelogarithmofthetargetvalueratherthanthetargetvaluedirectly.Simplycomputingtheexponentialoftheneuralnetwork’soutputwillgiveyoutheestimatedvalue(sinceexp(logv)=v).

InChapter11wediscussmanytechniquesthatintroduceadditionalhyperparameters:typeofweightinitialization,activationfunctionhyperparameters(e.g.,amountofleakinleakyReLU),GradientClippingthreshold,typeofoptimizeranditshyperparameters(e.g.,themomentumhyperparameterwhenusingaMomentumOptimizer),typeofregularizationforeachlayer,andtheregularizationhyperparameters(e.g.,dropoutratewhenusingdropout)andsoon.

AppendixB.MachineLearningProjectChecklist

ThischecklistcanguideyouthroughyourMachineLearningprojects.Thereareeightmainsteps:1. Frametheproblemandlookatthebigpicture.

2. Getthedata.

3. Explorethedatatogaininsights.

4. PreparethedatatobetterexposetheunderlyingdatapatternstoMachineLearningalgorithms.

5. Exploremanydifferentmodelsandshort-listthebestones.

6. Fine-tuneyourmodelsandcombinethemintoagreatsolution.

7. Presentyoursolution.

8. Launch,monitor,andmaintainyoursystem.

Obviously,youshouldfeelfreetoadaptthischecklisttoyourneeds.

FrametheProblemandLookattheBigPicture1. Definetheobjectiveinbusinessterms.

2. Howwillyoursolutionbeused?

3. Whatarethecurrentsolutions/workarounds(ifany)?

4. Howshouldyouframethisproblem(supervised/unsupervised,online/offline,etc.)?

5. Howshouldperformancebemeasured?

6. Istheperformancemeasurealignedwiththebusinessobjective?

7. Whatwouldbetheminimumperformanceneededtoreachthebusinessobjective?

8. Whatarecomparableproblems?Canyoureuseexperienceortools?

9. Ishumanexpertiseavailable?

10. Howwouldyousolvetheproblemmanually?

11. Listtheassumptionsyou(orothers)havemadesofar.

12. Verifyassumptionsifpossible.

GettheDataNote:automateasmuchaspossiblesoyoucaneasilygetfreshdata.1. Listthedatayouneedandhowmuchyouneed.

2. Findanddocumentwhereyoucangetthatdata.

3. Checkhowmuchspaceitwilltake.

4. Checklegalobligations,andgetauthorizationifnecessary.

5. Getaccessauthorizations.

6. Createaworkspace(withenoughstoragespace).

7. Getthedata.

8. Convertthedatatoaformatyoucaneasilymanipulate(withoutchangingthedataitself).

9. Ensuresensitiveinformationisdeletedorprotected(e.g.,anonymized).

10. Checkthesizeandtypeofdata(timeseries,sample,geographical,etc.).

11. Sampleatestset,putitaside,andneverlookatit(nodatasnooping!).

ExploretheDataNote:trytogetinsightsfromafieldexpertforthesesteps.1. Createacopyofthedataforexploration(samplingitdowntoamanageablesizeifnecessary).

2. CreateaJupyternotebooktokeeparecordofyourdataexploration.

3. Studyeachattributeanditscharacteristics:Name

Type(categorical,int/float,bounded/unbounded,text,structured,etc.)

%ofmissingvalues

Noisinessandtypeofnoise(stochastic,outliers,roundingerrors,etc.)

Possiblyusefulforthetask?

Typeofdistribution(Gaussian,uniform,logarithmic,etc.)

4. Forsupervisedlearningtasks,identifythetargetattribute(s).

5. Visualizethedata.

6. Studythecorrelationsbetweenattributes.

7. Studyhowyouwouldsolvetheproblemmanually.

8. Identifythepromisingtransformationsyoumaywanttoapply.

9. Identifyextradatathatwouldbeuseful(gobackto“GettheData”).

10. Documentwhatyouhavelearned.

PreparetheDataNotes:

Workoncopiesofthedata(keeptheoriginaldatasetintact).

Writefunctionsforalldatatransformationsyouapply,forfivereasons:Soyoucaneasilypreparethedatathenexttimeyougetafreshdataset

Soyoucanapplythesetransformationsinfutureprojects

Tocleanandpreparethetestset

Tocleanandpreparenewdatainstancesonceyoursolutionislive

Tomakeiteasytotreatyourpreparationchoicesashyperparameters

1. Datacleaning:Fixorremoveoutliers(optional).

Fillinmissingvalues(e.g.,withzero,mean,median…)ordroptheirrows(orcolumns).

2. Featureselection(optional):Droptheattributesthatprovidenousefulinformationforthetask.

3. Featureengineering,whereappropriate:Discretizecontinuousfeatures.

Decomposefeatures(e.g.,categorical,date/time,etc.).

Addpromisingtransformationsoffeatures(e.g.,log(x),sqrt(x),x^2,etc.).

Aggregatefeaturesintopromisingnewfeatures.

4. Featurescaling:standardizeornormalizefeatures.

Short-ListPromisingModelsNotes:

Ifthedataishuge,youmaywanttosamplesmallertrainingsetssoyoucantrainmanydifferentmodelsinareasonabletime(beawarethatthispenalizescomplexmodelssuchaslargeneuralnetsorRandomForests).

Onceagain,trytoautomatethesestepsasmuchaspossible.

1. Trainmanyquickanddirtymodelsfromdifferentcategories(e.g.,linear,naiveBayes,SVM,RandomForests,neuralnet,etc.)usingstandardparameters.

2. Measureandcomparetheirperformance.Foreachmodel,useN-foldcross-validationandcomputethemeanandstandarddeviationoftheperformancemeasureontheNfolds.

3. Analyzethemostsignificantvariablesforeachalgorithm.

4. Analyzethetypesoferrorsthemodelsmake.Whatdatawouldahumanhaveusedtoavoidtheseerrors?

5. Haveaquickroundoffeatureselectionandengineering.

6. Haveoneortwomorequickiterationsofthefiveprevioussteps.

7. Short-listthetopthreetofivemostpromisingmodels,preferringmodelsthatmakedifferenttypesoferrors.

Fine-TunetheSystemNotes:

Youwillwanttouseasmuchdataaspossibleforthisstep,especiallyasyoumovetowardtheendoffine-tuning.

Asalwaysautomatewhatyoucan.

1. Fine-tunethehyperparametersusingcross-validation.Treatyourdatatransformationchoicesashyperparameters,especiallywhenyouarenotsureaboutthem(e.g.,shouldIreplacemissingvalueswithzeroorwiththemedianvalue?Orjustdroptherows?).

Unlessthereareveryfewhyperparametervaluestoexplore,preferrandomsearchovergridsearch.Iftrainingisverylong,youmaypreferaBayesianoptimizationapproach(e.g.,usingGaussianprocesspriors,asdescribedbyJasperSnoek,HugoLarochelle,andRyanAdams).1

2. TryEnsemblemethods.Combiningyourbestmodelswilloftenperformbetterthanrunningthemindividually.

3. Onceyouareconfidentaboutyourfinalmodel,measureitsperformanceonthetestsettoestimatethegeneralizationerror.

WARNINGDon’ttweakyourmodelaftermeasuringthegeneralizationerror:youwouldjuststartoverfittingthetestset.

PresentYourSolution1. Documentwhatyouhavedone.

2. Createanicepresentation.Makesureyouhighlightthebigpicturefirst.

3. Explainwhyyoursolutionachievesthebusinessobjective.

4. Don’tforgettopresentinterestingpointsyounoticedalongtheway.Describewhatworkedandwhatdidnot.

Listyourassumptionsandyoursystem’slimitations.

5. Ensureyourkeyfindingsarecommunicatedthroughbeautifulvisualizationsoreasy-to-rememberstatements(e.g.,“themedianincomeisthenumber-onepredictorofhousingprices”).

Launch!1. Getyoursolutionreadyforproduction(plugintoproductiondatainputs,writeunittests,etc.).

2. Writemonitoringcodetocheckyoursystem’sliveperformanceatregularintervalsandtriggeralertswhenitdrops.

Bewareofslowdegradationtoo:modelstendto“rot”asdataevolves.

Measuringperformancemayrequireahumanpipeline(e.g.,viaacrowdsourcingservice).

Alsomonitoryourinputs’quality(e.g.,amalfunctioningsensorsendingrandomvalues,oranotherteam’soutputbecomingstale).Thisisparticularlyimportantforonlinelearningsystems.

3. Retrainyourmodelsonaregularbasisonfreshdata(automateasmuchaspossible).

“PracticalBayesianOptimizationofMachineLearningAlgorithms,”J.Snoek,H.Larochelle,R.Adams(2012).1

AppendixC.SVMDualProblem

Tounderstandduality,youfirstneedtounderstandtheLagrangemultipliersmethod.Thegeneralideaistotransformaconstrainedoptimizationobjectiveintoanunconstrainedone,bymovingtheconstraintsintotheobjectivefunction.Let’slookatasimpleexample.Supposeyouwanttofindthevaluesofxandythatminimizethefunctionf(x,y)=x2+2y,subjecttoanequalityconstraint:3x+2y+1=0.UsingtheLagrangemultipliersmethod,westartbydefininganewfunctioncalledtheLagrangian(orLagrangefunction):g(x,y,α)=f(x,y)–α(3x+2y+1).Eachconstraint(inthiscasejustone)issubtractedfromtheoriginalobjective,multipliedbyanewvariablecalledaLagrangemultiplier.

Joseph-LouisLagrangeshowedthatif isasolutiontotheconstrainedoptimizationproblem,then

theremustexistan suchthat isastationarypointoftheLagrangian(astationarypointisapointwhereallpartialderivativesareequaltozero).Inotherwords,wecancomputethepartialderivativesofg(x,y,α)withregardstox,y,andα;wecanfindthepointswherethesederivativesareallequaltozero;andthesolutionstotheconstrainedoptimizationproblem(iftheyexist)mustbeamongthesestationarypoints.

Inthisexamplethepartialderivativesare:

Whenallthesepartialderivativesareequalto0,wefindthat

,fromwhichwecaneasilyfindthat ,

,and .Thisistheonlystationarypoint,andasitrespectstheconstraint,itmustbethesolutiontotheconstrainedoptimizationproblem.

However,thismethodappliesonlytoequalityconstraints.Fortunately,undersomeregularityconditions(whicharerespectedbytheSVMobjectives),thismethodcanbegeneralizedtoinequalityconstraintsaswell(e.g.,3x+2y+1≥0).ThegeneralizedLagrangianforthehardmarginproblemisgivenbyEquationC-1,wheretheα(i)variablesarecalledtheKarush–Kuhn–Tucker(KKT)multipliers,andtheymustbegreaterorequaltozero.

EquationC-1.GeneralizedLagrangianforthehardmarginproblem

JustlikewiththeLagrangemultipliersmethod,youcancomputethepartialderivativesandlocatethe

stationarypoints.Ifthereisasolution,itwillnecessarilybeamongthestationarypoints thatrespecttheKKTconditions:

Respecttheproblem’sconstraints: ,

Verify ,

Either ortheithconstraintmustbeanactiveconstraint,meaningitmustholdbyequality:.Thisconditioniscalledthecomplementaryslacknesscondition.Itimplies

thateither ortheithinstanceliesontheboundary(itisasupportvector).

NotethattheKKTconditionsarenecessaryconditionsforastationarypointtobeasolutionoftheconstrainedoptimizationproblem.Undersomeconditions,theyarealsosufficientconditions.Luckily,theSVMoptimizationproblemhappenstomeettheseconditions,soanystationarypointthatmeetstheKKTconditionsisguaranteedtobeasolutiontotheconstrainedoptimizationproblem.

WecancomputethepartialderivativesofthegeneralizedLagrangianwithregardstowandbwithEquationC-2.

EquationC-2.PartialderivativesofthegeneralizedLagrangian

Whenthesepartialderivativesareequalto0,wehaveEquationC-3.

EquationC-3.Propertiesofthestationarypoints

IfweplugtheseresultsintothedefinitionofthegeneralizedLagrangian,sometermsdisappearandwe

findEquationC-4.

EquationC-4.DualformoftheSVMproblem

Thegoalisnowtofindthevector thatminimizesthisfunction,with forallinstances.Thisconstrainedoptimizationproblemisthedualproblemwewerelookingfor.

Onceyoufindtheoptimal ,youcancompute usingthefirstlineofEquationC-3.Tocompute ,youcanusethefactthatasupportvectorverifiest(i)(wT·x(i)+b)=1,soifthekthinstanceisasupportvector

(i.e.,αk>0),youcanuseittocompute .However,itisoftenpreferedtocomputetheaverageoverallsupportvectorstogetamorestableandprecisevalue,asinEquationC-5.

EquationC-5.Biastermestimationusingthedualform

AppendixD.Autodiff

ThisappendixexplainshowTensorFlow’sautodifffeatureworks,andhowitcomparestoothersolutions.

Supposeyoudefineafunctionf(x,y)=x2y+y+2,andyouneeditspartialderivatives and ,typicallytoperformGradientDescent(orsomeotheroptimizationalgorithm).Yourmainoptionsaremanualdifferentiation,symbolicdifferentiation,numericaldifferentiation,forward-modeautodiff,andfinallyreverse-modeautodiff.TensorFlowimplementsthislastoption.Let’sgothrougheachoftheseoptions.

ManualDifferentiationThefirstapproachistopickupapencilandapieceofpaperanduseyourcalculusknowledgetoderivethepartialderivativesmanually.Forthefunctionf(x,y)justdefined,itisnottoohard;youjustneedtousefiverules:

Thederivativeofaconstantis0.

Thederivativeofλxisλ(whereλisaconstant).

Thederivativeofxλisλxλ–1,sothederivativeofx2is2x.

Thederivativeofasumoffunctionsisthesumofthesefunctions’derivatives.

Thederivativeofλtimesafunctionisλtimesitsderivative.

Fromtheserules,youcanderiveEquationD-1:

EquationD-1.Partialderivativesoff(x,y)

Thisapproachcanbecomeverytediousformorecomplexfunctions,andyouruntheriskofmakingmistakes.Thegoodnewsisthatderivingthemathematicalequationsforthepartialderivativeslikewejustdidcanbeautomated,throughaprocesscalledsymbolicdifferentiation.

SymbolicDifferentiationFigureD-1showshowsymbolicdifferentiationworksonanevensimplerfunction,g(x,y)=5+xy.Thegraphforthatfunctionisrepresentedontheleft.Aftersymbolicdifferentiation,wegetthegraphonthe

right,whichrepresentsthepartialderivative (wecouldsimilarlyobtainthepartialderivativewithregardstoy).

FigureD-1.Symbolicdifferentiation

Thealgorithmstartsbygettingthepartialderivativeoftheleafnodes.Theconstantnode(5)returnstheconstant0,sincethederivativeofaconstantisalways0.Thevariablexreturnstheconstant1since

,andthevariableyreturnstheconstant0since (ifwewerelookingforthepartialderivativewithregardstoy,itwouldbethereverse).

Nowwehaveallweneedtomoveupthegraphtothemultiplicationnodeinfunctiong.Calculustellsus

thatthederivativeoftheproductoftwofunctionsuandvis .Wecanthereforeconstructalargepartofthegraphontheright,representing0×x+y×1.

Finally,wecangouptotheadditionnodeinfunctiong.Asmentioned,thederivativeofasumoffunctionsisthesumofthesefunctions’derivatives.Sowejustneedtocreateanadditionnodeandconnectittothepartsofthegraphwehavealreadycomputed.Wegetthecorrectpartialderivative:

However,itcanbesimplified(alot).Afewtrivialpruningstepscanbeappliedtothisgraphtogetridof

allunnecessaryoperations,andwegetamuchsmallergraphwithjustonenode: .

Inthiscase,simplificationisfairlyeasy,butforamorecomplexfunction,symbolicdifferentiationcanproduceahugegraphthatmaybetoughtosimplifyandleadtosuboptimalperformance.Mostimportantly,symbolicdifferentiationcannotdealwithfunctionsdefinedwitharbitrarycode—forexample,the

followingfunctiondiscussedinChapter9:

defmy_func(a,b):

foriinrange(100):

z=a*np.cos(z+i)+z*np.sin(b-i)

returnz

NumericalDifferentiationThesimplestsolutionistocomputeanapproximationofthederivatives,numerically.Recallthatthederivativeh′(x0)ofafunctionh(x)atapointx0istheslopeofthefunctionatthatpoint,ormorepreciselyEquationD-2.

EquationD-2.Derivativeofafunctionh(x)atpointx0

Soifwewanttocalculatethepartialderivativeoff(x,y)withregardstox,atx=3andy=4,wecansimplycomputef(3+ϵ,4)–f(3,4)anddividetheresultbyϵ,usingaverysmallvalueforϵ.That’sexactlywhatthefollowingcodedoes:

deff(x,y):

returnx**2*y+y+2

defderivative(f,x,y,x_eps,y_eps):

return(f(x+x_eps,y+y_eps)-f(x,y))/(x_eps+y_eps)

df_dx=derivative(f,3,4,0.00001,0)

df_dy=derivative(f,3,4,0,0.00001)

Unfortunately,theresultisimprecise(anditgetsworseformorecomplexfunctions).Thecorrectresultsarerespectively24and10,butinsteadweget:

>>>print(df_dx)

24.000039999805264

>>>print(df_dy)

10.000000000331966

Noticethattocomputebothpartialderivatives,wehavetocallf()atleastthreetimes(wecalleditfourtimesintheprecedingcode,butitcouldbeoptimized).Iftherewere1,000parameters,wewouldneedtocallf()atleast1,001times.Whenyouaredealingwithlargeneuralnetworks,thismakesnumericaldifferentiationwaytooinefficient.

However,numericaldifferentiationissosimpletoimplementthatitisagreattooltocheckthattheothermethodsareimplementedcorrectly.Forexample,ifitdisagreeswithyourmanuallyderivedfunction,thenyourfunctionprobablycontainsamistake.

Forward-ModeAutodiffForward-modeautodiffisneithernumericaldifferentiationnorsymbolicdifferentiation,butinsomewaysitistheirlovechild.Itreliesondualnumbers,whichare(weirdbutfascinating)numbersoftheforma+bϵwhereaandbarerealnumbersandϵisaninfinitesimalnumbersuchthatϵ2=0(butϵ≠0).Youcanthinkofthedualnumber42+24ϵassomethingakinto42.0000000024withaninfinitenumberof0s(butofcoursethisissimplifiedjusttogiveyousomeideaofwhatdualnumbersare).Adualnumberisrepresentedinmemoryasapairoffloats.Forexample,42+24ϵisrepresentedbythepair(42.0,24.0).

Dualnumberscanbeadded,multiplied,andsoon,asshowninEquationD-3.

EquationD-3.Afewoperationswithdualnumbers

Mostimportantly,itcanbeshownthath(a+bϵ)=h(a)+b×h′(a)ϵ,socomputingh(a+ϵ)givesyoubothh(a)andthederivativeh′(a)injustoneshot.FigureD-2showshowforward-modeautodiffcomputesthepartialderivativeoff(x,y)withregardstoxatx=3andy=4.Allweneedtodoiscomputef(3+ϵ,4);thiswilloutputadualnumberwhosefirstcomponentisequaltof(3,4)andwhosesecondcomponentis

equalto .

FigureD-2.Forward-modeautodiff

Tocompute wewouldhavetogothroughthegraphagain,butthistimewithx=3andy=4+ϵ.

Soforward-modeautodiffismuchmoreaccuratethannumericaldifferentiation,butitsuffersfromthesamemajorflaw:iftherewere1,000parameters,itwouldrequire1,000passesthroughthegraphtocomputeallthepartialderivatives.Thisiswherereverse-modeautodiffshines:itcancomputealloftheminjusttwopassesthroughthegraph.

Reverse-ModeAutodiffReverse-modeautodiffisthesolutionimplementedbyTensorFlow.Itfirstgoesthroughthegraphintheforwarddirection(i.e.,fromtheinputstotheoutput)tocomputethevalueofeachnode.Thenitdoesasecondpass,thistimeinthereversedirection(i.e.,fromtheoutputtotheinputs)tocomputeallthepartialderivatives.FigureD-3representsthesecondpass.Duringthefirstpass,allthenodevalueswerecomputed,startingfromx=3andy=4.Youcanseethosevaluesatthebottomrightofeachnode(e.g.,x×x=9).Thenodesarelabeledn1ton7forclarity.Theoutputnodeisn7:f(3,4)=n7=42.

FigureD-3.Reverse-modeautodiff

Theideaistograduallygodownthegraph,computingthepartialderivativeoff(x,y)withregardstoeachconsecutivenode,untilwereachthevariablenodes.Forthis,reverse-modeautodiffreliesheavilyonthechainrule,showninEquationD-4.

EquationD-4.Chainrule

Sincen7istheoutputnode,f=n7sotrivially .

Let’scontinuedownthegraphton5:howmuchdoesfvarywhenn5varies?Theansweris .

Wealreadyknowthat ,soallweneedis .Sincen7simplyperformsthesumn5+n6,wefind

that ,so .

Nowwecanproceedtonoden4:howmuchdoesfvarywhenn4varies?Theansweris .

Sincen5=n4×n2,wefindthat ,so .

Theprocesscontinuesuntilwereachthebottomofthegraph.Atthatpointwewillhavecalculatedallthe

partialderivativesoff(x,y)atthepointx=3andy=4.Inthisexample,wefind and .Soundsaboutright!

Reverse-modeautodiffisaverypowerfulandaccuratetechnique,especiallywhentherearemanyinputsandfewoutputs,sinceitrequiresonlyoneforwardpassplusonereversepassperoutputtocomputeallthepartialderivativesforalloutputswithregardstoalltheinputs.Mostimportantly,itcandealwithfunctionsdefinedbyarbitrarycode.Itcanalsohandlefunctionsthatarenotentirelydifferentiable,aslongasyouaskittocomputethepartialderivativesatpointsthataredifferentiable.

TIPIfyouimplementanewtypeofoperationinTensorFlowandyouwanttomakeitcompatiblewithautodiff,thenyouneedtoprovideafunctionthatbuildsasubgraphtocomputeitspartialderivativeswithregardstoitsinputs.Forexample,supposeyouimplementafunctionthatcomputesthesquareofitsinputf(x)=x2.Inthatcaseyouwouldneedtoprovidethecorrespondingderivativefunctionf′(x)=2x.Notethatthisfunctiondoesnotcomputeanumericalresult,butinsteadbuildsasubgraphthatwill(later)computetheresult.Thisisveryusefulbecauseitmeansthatyoucancomputegradientsofgradients(tocomputesecond-orderderivatives,orevenhigher-orderderivatives).

AppendixE.OtherPopularANNArchitectures

InthisappendixwewillgiveaquickoverviewofafewhistoricallyimportantneuralnetworkarchitecturesthataremuchlessusedtodaythandeepMulti-LayerPerceptrons(Chapter10),convolutionalneuralnetworks(Chapter13),recurrentneuralnetworks(Chapter14),orautoencoders(Chapter15).Theyareoftenmentionedintheliterature,andsomearestillusedinmanyapplications,soitisworthknowingaboutthem.Moreover,wewilldiscussdeepbeliefnets(DBNs),whichwerethestateoftheartinDeepLearninguntiltheearly2010s.Theyarestillthesubjectofveryactiveresearch,sotheymaywellcomebackwithavengeanceinthenearfuture.

HopfieldNetworksHopfieldnetworkswerefirstintroducedbyW.A.Littlein1974,thenpopularizedbyJ.Hopfieldin1982.Theyareassociativememorynetworks:youfirstteachthemsomepatterns,andthenwhentheyseeanewpatternthey(hopefully)outputtheclosestlearnedpattern.Thishasmadethemusefulinparticularforcharacterrecognitionbeforetheywereoutperformedbyotherapproaches.Youfirsttrainthenetworkbyshowingitexamplesofcharacterimages(eachbinarypixelmapstooneneuron),andthenwhenyoushowitanewcharacterimage,afterafewiterationsitoutputstheclosestlearnedcharacter.

Theyarefullyconnectedgraphs(seeFigureE-1);thatis,everyneuronisconnectedtoeveryotherneuron.Notethatonthediagramtheimagesare6×6pixels,sotheneuralnetworkontheleftshouldcontain36neurons(and648connections),butforvisualclarityamuchsmallernetworkisrepresented.

FigureE-1.Hopfieldnetwork

ThetrainingalgorithmworksbyusingHebb’srule:foreachtrainingimage,theweightbetweentwoneuronsisincreasedifthecorrespondingpixelsarebothonorbothoff,butdecreasedifonepixelisonandtheotherisoff.

Toshowanewimagetothenetwork,youjustactivatetheneuronsthatcorrespondtoactivepixels.Thenetworkthencomputestheoutputofeveryneuron,andthisgivesyouanewimage.Youcanthentakethisnewimageandrepeatthewholeprocess.Afterawhile,thenetworkreachesastablestate.Generally,thiscorrespondstothetrainingimagethatmostresemblestheinputimage.

Aso-calledenergyfunctionisassociatedwithHopfieldnets.Ateachiteration,theenergydecreases,sothenetworkisguaranteedtoeventuallystabilizetoalow-energystate.Thetrainingalgorithmtweakstheweightsinawaythatdecreasestheenergylevelofthetrainingpatterns,sothenetworkislikelytostabilizeinoneoftheselow-energyconfigurations.Unfortunately,somepatternsthatwerenotinthetrainingsetalsoendupwithlowenergy,sothenetworksometimesstabilizesinaconfigurationthatwas

notlearned.Thesearecalledspuriouspatterns.

AnothermajorflawwithHopfieldnetsisthattheydon’tscaleverywell—theirmemorycapacityisroughlyequalto14%ofthenumberofneurons.Forexample,toclassify28×28images,youwouldneedaHopfieldnetwith784fullyconnectedneuronsand306,936weights.Suchanetworkwouldonlybeabletolearnabout110differentcharacters(14%of784).That’salotofparametersforsuchasmallmemory.

BoltzmannMachinesBoltzmannmachineswereinventedin1985byGeoffreyHintonandTerrenceSejnowski.JustlikeHopfieldnets,theyarefullyconnectedANNs,buttheyarebasedonstochasticneurons:insteadofusingadeterministicstepfunctiontodecidewhatvaluetooutput,theseneuronsoutput1withsomeprobability,and0otherwise.TheprobabilityfunctionthattheseANNsuseisbasedontheBoltzmanndistribution(usedinstatisticalmechanics)hencetheirname.EquationE-1givestheprobabilitythataparticularneuronwilloutputa1.

EquationE-1.Probabilitythattheithneuronwilloutput1

sjisthejthneuron’sstate(0or1).

wi,jistheconnectionweightbetweentheithandjthneurons.Notethatwi,i=0.

biistheithneuron’sbiasterm.Wecanimplementthistermbyaddingabiasneurontothenetwork.

Nisthenumberofneuronsinthenetwork.

Tisanumbercalledthenetwork’stemperature;thehigherthetemperature,themorerandomtheoutputis(i.e.,themoretheprobabilityapproaches50%).

σisthelogisticfunction.

NeuronsinBoltzmannmachinesareseparatedintotwogroups:visibleunitsandhiddenunits(seeFigureE-2).Allneuronsworkinthesamestochasticway,butthevisibleunitsaretheonesthatreceivetheinputsandfromwhichoutputsareread.

FigureE-2.Boltzmannmachine

Becauseofitsstochasticnature,aBoltzmannmachinewillneverstabilizeintoafixedconfiguration,butinsteaditwillkeepswitchingbetweenmanyconfigurations.Ifitisleftrunningforasufficientlylongtime,theprobabilityofobservingaparticularconfigurationwillonlybeafunctionoftheconnectionweightsandbiasterms,notoftheoriginalconfiguration(similarly,afteryoushuffleadeckofcardsforlongenough,theconfigurationofthedeckdoesnotdependontheinitialstate).Whenthenetworkreachesthisstatewheretheoriginalconfigurationis“forgotten,”itissaidtobeinthermalequilibrium(althoughitsconfigurationkeepschangingallthetime).Bysettingthenetworkparametersappropriately,lettingthenetworkreachthermalequilibrium,andthenobservingitsstate,wecansimulateawiderangeofprobabilitydistributions.Thisiscalledagenerativemodel.

TrainingaBoltzmannmachinemeansfindingtheparametersthatwillmakethenetworkapproximatethetrainingset’sprobabilitydistribution.Forexample,iftherearethreevisibleneuronsandthetrainingsetcontains75%(0,1,1)triplets,10%(0,0,1)triplets,and15%(1,1,1)triplets,thenaftertrainingaBoltzmannmachine,youcoulduseittogeneraterandombinarytripletswithaboutthesameprobabilitydistribution.Forexample,about75%ofthetimeitwouldoutputthe(0,1,1)triplet.

Suchagenerativemodelcanbeusedinavarietyofways.Forexample,ifitistrainedonimages,andyouprovideanincompleteornoisyimagetothenetwork,itwillautomatically“repair”theimageinareasonableway.Youcanalsouseagenerativemodelforclassification.Justaddafewvisibleneuronstoencodethetrainingimage’sclass(e.g.,add10visibleneuronsandturnononlythefifthneuronwhenthetrainingimagerepresentsa5).Then,whengivenanewimage,thenetworkwillautomaticallyturnonthe

appropriatevisibleneurons,indicatingtheimage’sclass(e.g.,itwillturnonthefifthvisibleneuroniftheimagerepresentsa5).

Unfortunately,thereisnoefficienttechniquetotrainBoltzmannmachines.However,fairlyefficientalgorithmshavebeendevelopedtotrainrestrictedBoltzmannmachines(RBM).

RestrictedBoltzmannMachinesAnRBMissimplyaBoltzmannmachineinwhichtherearenoconnectionsbetweenvisibleunitsorbetweenhiddenunits,onlybetweenvisibleandhiddenunits.Forexample,FigureE-3representsanRBMwiththreevisibleunitsandfourhiddenunits.

FigureE-3.RestrictedBoltzmannmachine

Averyefficienttrainingalgorithm,calledContrastiveDivergence,wasintroducedin2005byMiguelÁ.Carreira-PerpiñánandGeoffreyHinton.1Hereishowitworks:foreachtraininginstancex,thealgorithmstartsbyfeedingittothenetworkbysettingthestateofthevisibleunitstox1,x2,, xn.Thenyoucomputethestateofthehiddenunitsbyapplyingthestochasticequationdescribedbefore(EquationE-1).Thisgivesyouahiddenvectorh(wherehiisequaltothestateoftheithunit).Nextyoucomputethestateofthevisibleunits,byapplyingthesamestochasticequation.Thisgivesyouavector .Thenonceagainyoucomputethestateofthehiddenunits,whichgivesyouavector .NowyoucanupdateeachconnectionweightbyapplyingtheruleinEquationE-2.

EquationE-2.Contrastivedivergenceweightupdate

Thegreatbenefitofthisalgorithmitthatitdoesnotrequirewaitingforthenetworktoreachthermalequilibrium:itjustgoesforward,backward,andforwardagain,andthat’sit.Thismakesitincomparablymoreefficientthanpreviousalgorithms,anditwasakeyingredienttothefirstsuccessofDeepLearningbasedonmultiplestackedRBMs.

DeepBeliefNetsSeverallayersofRBMscanbestacked;thehiddenunitsofthefirst-levelRBMservesasthevisibleunitsforthesecond-layerRBM,andsoon.SuchanRBMstackiscalledadeepbeliefnet(DBN).

Yee-WhyeTeh,oneofGeoffreyHinton’sstudents,observedthatitwaspossibletotrainDBNsonelayeratatimeusingContrastiveDivergence,startingwiththelowerlayersandthengraduallymovinguptothetoplayers.ThisledtothegroundbreakingarticlethatkickstartedtheDeepLearningtsunamiin2006.2

JustlikeRBMs,DBNslearntoreproducetheprobabilitydistributionoftheirinputs,withoutanysupervision.However,theyaremuchbetteratit,forthesamereasonthatdeepneuralnetworksaremorepowerfulthanshallowones:real-worlddataisoftenorganizedinhierarchicalpatterns,andDBNstakeadvantageofthat.Theirlowerlayerslearnlow-levelfeaturesintheinputdata,whilehigherlayerslearnhigh-levelfeatures.

JustlikeRBMs,DBNsarefundamentallyunsupervised,butyoucanalsotraintheminasupervisedmannerbyaddingsomevisibleunitstorepresentthelabels.Moreover,onegreatfeatureofDBNsisthattheycanbetrainedinasemisupervisedfashion.FigureE-4representssuchaDBNconfiguredforsemisupervisedlearning.

FigureE-4.Adeepbeliefnetworkconfiguredforsemisupervisedlearning

First,theRBM1istrainedwithoutsupervision.Itlearnslow-levelfeaturesinthetrainingdata.ThenRBM2istrainedwithRBM1’shiddenunitsasinputs,againwithoutsupervision:itlearnshigher-levelfeatures(notethatRBM2’shiddenunitsincludeonlythethreerightmostunits,notthelabelunits).SeveralmoreRBMscouldbestackedthisway,butyougettheidea.Sofar,trainingwas100%unsupervised.Lastly,RBM3istrainedusingbothRBM2’shiddenunitsasinputs,aswellasextravisibleunitsusedtorepresentthetargetlabels(e.g.,aone-hotvectorrepresentingtheinstanceclass).Itlearnstoassociatehigh-levelfeatureswithtraininglabels.Thisisthesupervisedstep.

Attheendoftraining,ifyoufeedRBM1anewinstance,thesignalwillpropagateuptoRBM2,thenuptothetopofRBM3,andthenbackdowntothelabelunits;hopefully,theappropriatelabelwilllightup.ThisishowaDBNcanbeusedforclassification.

Onegreatbenefitofthissemisupervisedapproachisthatyoudon’tneedmuchlabeledtrainingdata.IftheunsupervisedRBMsdoagoodenoughjob,thenonlyasmallamountoflabeledtraininginstancesperclasswillbenecessary.Similarly,ababylearnstorecognizeobjectswithoutsupervision,sowhenyoupointtoachairandsay“chair,”thebabycanassociatetheword“chair”withtheclassofobjectsithasalreadylearnedtorecognizeonitsown.Youdon’tneedtopointtoeverysinglechairandsay“chair”;onlyafewexampleswillsuffice(justenoughsothebabycanbesurethatyouareindeedreferringtothechair,nottoitscolororoneofthechair’sparts).

Quiteamazingly,DBNscanalsoworkinreverse.Ifyouactivateoneofthelabelunits,thesignalwillpropagateuptothehiddenunitsofRBM3,thendowntoRBM2,andthenRBM1,andanewinstancewillbeoutputbythevisibleunitsofRBM1.Thisnewinstancewillusuallylooklikearegularinstanceoftheclasswhoselabelunityouactivated.ThisgenerativecapabilityofDBNsisquitepowerful.Forexample,ithasbeenusedtoautomaticallygeneratecaptionsforimages,andviceversa:firstaDBNistrained(withoutsupervision)tolearnfeaturesinimages,andanotherDBNistrained(againwithoutsupervision)tolearnfeaturesinsetsofcaptions(e.g.,“car”oftencomeswith“automobile”).ThenanRBMisstackedontopofbothDBNsandtrainedwithasetofimagesalongwiththeircaptions;itlearnstoassociatehigh-levelfeaturesinimageswithhigh-levelfeaturesincaptions.Next,ifyoufeedtheimageDBNanimageofacar,thesignalwillpropagatethroughthenetwork,uptothetop-levelRBM,andbackdowntothebottomofthecaptionDBN,producingacaption.DuetothestochasticnatureofRBMsandDBNs,thecaptionwillkeepchangingrandomly,butitwillgenerallybeappropriatefortheimage.Ifyougenerateafewhundredcaptions,themostfrequentlygeneratedoneswilllikelybeagooddescriptionoftheimage.3

Self-OrganizingMapsSelf-organizingmaps(SOM)arequitedifferentfromalltheothertypesofneuralnetworkswehavediscussedsofar.Theyareusedtoproducealow-dimensionalrepresentationofahigh-dimensionaldataset,generallyforvisualization,clustering,orclassification.Theneuronsarespreadacrossamap(typically2Dforvisualization,butitcanbeanynumberofdimensionsyouwant),asshowninFigureE-5,andeachneuronhasaweightedconnectiontoeveryinput(notethatthediagramshowsjusttwoinputs,buttherearetypicallyaverylargenumber,sincethewholepointofSOMsistoreducedimensionality).

FigureE-5.Self-organizingmaps

Oncethenetworkistrained,youcanfeeditanewinstanceandthiswillactivateonlyoneneuron(i.e.,henceonepointonthemap):theneuronwhoseweightvectorisclosesttotheinputvector.Ingeneral,instancesthatarenearbyintheoriginalinputspacewillactivateneuronsthatarenearbyonthemap.ThismakesSOMsusefulforvisualization(inparticular,youcaneasilyidentifyclustersonthemap),butalsoforapplicationslikespeechrecognition.Forexample,ifeachinstancerepresentstheaudiorecordingofapersonpronouncingavowel,thendifferentpronunciationsofthevowel“a”willactivateneuronsinthesameareaofthemap,whileinstancesofthevowel“e”willactivateneuronsinanotherarea,andintermediatesoundswillgenerallyactivateintermediateneuronsonthemap.

NOTEOneimportantdifferencewiththeotherdimensionalityreductiontechniquesdiscussedinChapter8isthatallinstancesgetmappedtoadiscretenumberofpointsinthelow-dimensionalspace(onepointperneuron).Whenthereareveryfewneurons,thistechniqueisbetterdescribedasclusteringratherthandimensionalityreduction.

Thetrainingalgorithmisunsupervised.Itworksbyhavingalltheneuronscompeteagainsteachother.First,alltheweightsareinitializedrandomly.Thenatraininginstanceispickedrandomlyandfedtothenetwork.Allneuronscomputethedistancebetweentheirweightvectorandtheinputvector(thisisverydifferentfromtheartificialneuronswehaveseensofar).Theneuronthatmeasuresthesmallestdistancewinsandtweaksitsweightvectortobeevenslightlyclosertotheinputvector,makingitmorelikelytowinfuturecompetitionsforotherinputssimilartothisone.Italsorecruitsitsneighboringneurons,andtheytooupdatetheirweightvectortobeslightlyclosertotheinputvector(buttheydon’tupdatetheirweightsasmuchasthewinnerneuron).Thenthealgorithmpicksanothertraininginstanceandrepeatstheprocess,againandagain.Thisalgorithmtendstomakenearbyneuronsgraduallyspecializeinsimilarinputs.4

“OnContrastiveDivergenceLearning,”M.Á.Carreira-PerpiñánandG.Hinton(2005).

“AFastLearningAlgorithmforDeepBeliefNets,”G.Hinton,S.Osindero,Y.Teh(2006).

SeethisvideobyGeoffreyHintonformoredetailsandademo:http://goo.gl/7Z5QiS.

Youcanimagineaclassofyoungchildrenwithroughlysimilarskills.Onechildhappenstobeslightlybetteratbasketball.Thismotivateshertopracticemore,especiallywithherfriends.Afterawhile,thisgroupoffriendsgetssogoodatbasketballthatotherkidscannotcompete.Butthat’sokay,becausetheotherkidsspecializeinothertopics.Afterawhile,theclassisfulloflittlespecializedgroups.

Symbols

__call__(),StaticUnrollingThroughTime

ε-greedypolicy,ExplorationPolicies,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

ε-insensitive,SVMRegression

χ2test(seechisquaretest)

ℓ0norm,SelectaPerformanceMeasure

ℓ1andℓ2regularization,ℓ1andℓ2Regularization-ℓ1andℓ2Regularization

ℓ1norm,SelectaPerformanceMeasure,LassoRegression,DecisionBoundaries,AdamOptimization,AvoidingOverfittingThroughRegularization

ℓ2norm,SelectaPerformanceMeasure,RidgeRegression-LassoRegression,DecisionBoundaries,SoftmaxRegression,AvoidingOverfittingThroughRegularization,Max-NormRegularization

ℓknorm,SelectaPerformanceMeasure

ℓ∞norm,SelectaPerformanceMeasure

accuracy,WhatIsMachineLearning?,MeasuringAccuracyUsingCross-Validation-MeasuringAccuracyUsingCross-Validation

actions,evaluating,EvaluatingActions:TheCreditAssignmentProblem-EvaluatingActions:TheCreditAssignmentProblem

activationfunctions,Multi-LayerPerceptronandBackpropagation-Multi-LayerPerceptronandBackpropagation

activeconstraints,SVMDualProblem

actors,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

actualclass,ConfusionMatrix

AdaBoost,AdaBoost-AdaBoost

Adagrad,AdaGrad-AdaGrad

Adamoptimization,AdamOptimization-AdamOptimization,AdamOptimization

adaptivelearningrate,AdaGrad

adaptivemomentoptimization,AdamOptimization

agents,LearningtoOptimizeRewards

AlexNetarchitecture,AlexNet-AlexNet

algorithms

preparingdatafor,PreparetheDataforMachineLearningAlgorithms-SelectandTrainaModel

AlphaGo,ReinforcementLearning,IntroductiontoArtificialNeuralNetworks,ReinforcementLearning,PolicyGradients

Anaconda,CreatetheWorkspace

anomalydetection,Unsupervisedlearning

Apple’sSiri,IntroductiontoArtificialNeuralNetworks

apply_gradients(),GradientClipping,PolicyGradients

areaunderthecurve(AUC),TheROCCurve

array_split(),IncrementalPCA

artificialneuralnetworks(ANNs),IntroductiontoArtificialNeuralNetworks-Exercises

BoltzmannMachines,BoltzmannMachines-BoltzmannMachines

deepbeliefnetworks(DBNs),DeepBeliefNets-DeepBeliefNets

evolutionof,FromBiologicaltoArtificialNeurons

HopfieldNetworks,HopfieldNetworks-HopfieldNetworks

hyperparameterfine-tuning,Fine-TuningNeuralNetworkHyperparameters-Activation

Functions

overview,IntroductiontoArtificialNeuralNetworks-FromBiologicaltoArtificialNeurons

Perceptrons,ThePerceptron-Multi-LayerPerceptronandBackpropagation

self-organizingmaps,Self-OrganizingMaps-Self-OrganizingMaps

trainingaDNNwithTensorFlow,TrainingaDNNUsingPlainTensorFlow-UsingtheNeuralNetwork

artificialneuron,LogicalComputationswithNeurons

(seealsoartificialneuralnetwork(ANN))

assign(),ManuallyComputingtheGradients

associationrulelearning,Unsupervisedlearning

associativememorynetworks,HopfieldNetworks

assumptions,checking,ChecktheAssumptions

asynchronousupdates,Asynchronousupdates-Asynchronousupdates

asynchrouscommunication,AsynchronousCommunicationUsingTensorFlowQueues-PaddingFifoQueue

atrous_conv2d(),ResNet

attentionmechanism,AnEncoder–DecoderNetworkforMachineTranslation

attributes,Supervisedlearning,TakeaQuickLookattheDataStructure-TakeaQuickLookattheDataStructure

(seealsodatastructure)

combinationsof,ExperimentingwithAttributeCombinations-ExperimentingwithAttributeCombinations

preprocessed,TakeaQuickLookattheDataStructure

target,TakeaQuickLookattheDataStructure

autodiff,Usingautodiff-Usingautodiff,Autodiff-Reverse-ModeAutodiff

forward-mode,Forward-ModeAutodiff-Forward-ModeAutodiff

manualdifferentiation,ManualDifferentiation

numericaldifferentiation,NumericalDifferentiation

reverse-mode,Reverse-ModeAutodiff-Reverse-ModeAutodiff

symbolicdifferentiation,SymbolicDifferentiation-NumericalDifferentiation

autoencoders,Autoencoders-Exercises

adversarial,OtherAutoencoders

contractive,OtherAutoencoders

denoising,DenoisingAutoencoders-TensorFlowImplementation

efficientdatarepresentations,EfficientDataRepresentations

generativestochasticnetwork(GSN),OtherAutoencoders

overcomplete,UnsupervisedPretrainingUsingStackedAutoencoders

PCAwithundercompletelinearautoencoder,PerformingPCAwithanUndercompleteLinearAutoencoder

reconstructions,EfficientDataRepresentations

sparse,SparseAutoencoders-TensorFlowImplementation

stacked,StackedAutoencoders-UnsupervisedPretrainingUsingStackedAutoencoders

stackedconvolutional,OtherAutoencoders

undercomplete,EfficientDataRepresentations

variational,VariationalAutoencoders-GeneratingDigits

visualizingfeatures,VisualizingFeatures-VisualizingFeatures

winner-take-all(WTA),OtherAutoencoders

automaticdifferentiating,UpandRunningwithTensorFlow

autonomousdrivingsystems,RecurrentNeuralNetworks

AverageAbsoluteDeviation,SelectaPerformanceMeasure

averagepoolinglayer,PoolingLayer

avg_pool(),PoolingLayer

backpropagation,Multi-LayerPerceptronandBackpropagation-Multi-LayerPerceptronandBackpropagation,Vanishing/ExplodingGradientsProblems,UnsupervisedPretraining,VisualizingFeatures

backpropagationthroughtime(BPTT),TrainingRNNs

baggingandpasting,BaggingandPasting-Out-of-BagEvaluation

out-of-bagevaluation,Out-of-BagEvaluation-Out-of-BagEvaluation

inScikit-Learn,BaggingandPastinginScikit-Learn-BaggingandPastinginScikit-Learn

bandwidthsaturation,Bandwidthsaturation-Bandwidthsaturation

BasicLSTMCell,LSTMCell

BasicRNNCell,DistributingaDeepRNNAcrossMultipleGPUs-DistributingaDeepRNNAcrossMultipleGPUs

BatchGradientDescent,BatchGradientDescent-BatchGradientDescent,LassoRegression

batchlearning,Batchlearning-Batchlearning

BatchNormalization,BatchNormalization-ImplementingBatchNormalizationwithTensorFlow,ResNet

operationsummary,BatchNormalization

withTensorFlow,ImplementingBatchNormalizationwithTensorFlow-ImplementingBatchNormalizationwithTensorFlow

batch(),Otherconveniencefunctions

batch_join(),Otherconveniencefunctions

batch_normalization(),ImplementingBatchNormalizationwithTensorFlow-ImplementingBatchNormalizationwithTensorFlow

BellmanOptimalityEquation,MarkovDecisionProcesses

between-graphreplication,In-GraphVersusBetween-GraphReplication

biasneurons,ThePerceptron

biasterm,LinearRegression

bias/variancetradeoff,LearningCurves

biases,ConstructionPhase

binaryclassifiers,TrainingaBinaryClassifier,LogisticRegression

biologicalneurons,FromBiologicaltoArtificialNeurons-BiologicalNeurons

blackboxmodels,MakingPredictions

blending,Stacking-Exercises

BoltzmannMachines,BoltzmannMachines-BoltzmannMachines

(seealsorestrictedBoltzmanmachines(RBMs))

boosting,Boosting-GradientBoosting

AdaBoost,AdaBoost-AdaBoost

GradientBoosting,GradientBoosting-GradientBoosting

bootstrapaggregation(seebagging)

bootstrapping,GridSearch,BaggingandPasting,IntroductiontoOpenAIGym,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

bottlenecklayers,GoogLeNet

brew,Stacking

Caffemodelzoo,ModelZoos

CART(ClassificationandRegressionTree)algorithm,MakingPredictions-TheCARTTrainingAlgorithm,Regression

categoricalattributes,HandlingTextandCategoricalAttributes-HandlingTextandCategoricalAttributes

cellwrapper,TrainingtoPredictTimeSeries

chisquaretest,RegularizationHyperparameters

classificationversusregression,Supervisedlearning,MultioutputClassification

classifiers

binary,TrainingaBinaryClassifier

erroranalysis,ErrorAnalysis-ErrorAnalysis

evaluating,MulticlassClassification

MNISTdataset,MNIST-MNIST

multiclass,MulticlassClassification-MulticlassClassification

multilabel,MultilabelClassification-MultilabelClassification

multioutput,MultioutputClassification-MultioutputClassification

performancemeasures,PerformanceMeasures-TheROCCurve

precisionof,ConfusionMatrix

voting,VotingClassifiers-VotingClassifiers

clip_by_value(),GradientClipping

closed-formequation,TrainingModels,RidgeRegression,TrainingandCostFunction

clusterspecification,MultipleDevicesAcrossMultipleServers

clusteringalgorithms,Unsupervisedlearning

clusters,MultipleDevicesAcrossMultipleServers

codingspace,VariationalAutoencoders

codings,Autoencoders

complementaryslacknesscondition,SVMDualProblem

components_,UsingScikit-Learn

computationalcomplexity,ComputationalComplexity,ComputationalComplexity,ComputationalComplexity

compute_gradients(),GradientClipping,PolicyGradients

concat(),GoogLeNet

config.gpu_options,ManagingtheGPURAM

ConfigProto,ManagingtheGPURAM

confusionmatrix,ConfusionMatrix-ConfusionMatrix,ErrorAnalysis-ErrorAnalysis

connectionism,ThePerceptron

constrainedoptimization,TrainingObjective,SVMDualProblem

ContrastiveDivergence,RestrictedBoltzmannMachines

controldependencies,ControlDependencies

conv1d(),ResNet

conv2d_transpose(),ResNet

conv3d(),ResNet

convergencerate,BatchGradientDescent

convexfunction,GradientDescent

convolutionkernels,Filters,CNNArchitectures,GoogLeNet

convolutionalneuralnetworks(CNNs),ConvolutionalNeuralNetworks-Exercises

architectures,CNNArchitectures-ResNet

AlexNet,AlexNet-AlexNet

GoogleNet,GoogLeNet-GoogLeNet

LeNet5,LeNet-5-LeNet-5

ResNet,ResNet-ResNet

convolutionallayer,ConvolutionalLayer-MemoryRequirements,GoogLeNet,ResNet

featuremaps,StackingMultipleFeatureMaps-TensorFlowImplementation

filters,Filters

memoryrequirement,MemoryRequirements-MemoryRequirements

evolutionof,TheArchitectureoftheVisualCortex

poolinglayer,PoolingLayer-PoolingLayer

TensorFlowimplementation,TensorFlowImplementation-TensorFlowImplementation

Coordinatorclass,MultithreadedreadersusingaCoordinatorandaQueueRunner-MultithreadedreadersusingaCoordinatorandaQueueRunner

correlationcoefficient,LookingforCorrelations-LookingforCorrelations

correlations,finding,LookingforCorrelations-LookingforCorrelations

costfunction,Model-basedlearning,SelectaPerformanceMeasure

inAdaBoost,AdaBoost

inadagrad,AdaGrad

inartificialneuralnetworks,TraininganMLPwithTensorFlow’sHigh-LevelAPI,ConstructionPhase-ConstructionPhase

inautodiff,Usingautodiff

inbatchnormalization,ImplementingBatchNormalizationwithTensorFlow

crossentropy,LeNet-5

deepQ-Learning,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

inElasticNet,ElasticNet

inGradientDescent,TrainingModels,GradientDescent-GradientDescent,Batch

GradientDescent,BatchGradientDescent-StochasticGradientDescent,GradientBoosting,Vanishing/ExplodingGradientsProblems

inLogisticRegression,TrainingandCostFunction-TrainingandCostFunction

inPGalgorithms,PolicyGradients

invariationalautoencoders,VariationalAutoencoders

inLassoRegression,LassoRegression-LassoRegression

inLinearRegression,TheNormalEquation,GradientDescent

inMomentumoptimization,MomentumOptimization-NesterovAcceleratedGradient

inpretrainedlayersreuse,PretrainingonanAuxiliaryTask

inridgeregression,RidgeRegression-RidgeRegression

inRNNs,TrainingRNNs,TrainingtoPredictTimeSeries

stalegradientsand,Asynchronousupdates

creativesequences,CreativeRNN

creditassignmentproblem,EvaluatingActions:TheCreditAssignmentProblem-EvaluatingActions:TheCreditAssignmentProblem

critics,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

crossentropy,SoftmaxRegression-SoftmaxRegression,TraininganMLPwithTensorFlow’sHigh-LevelAPI,TensorFlowImplementation,PolicyGradients

cross-validation,TestingandValidating,BetterEvaluationUsingCross-Validation-BetterEvaluationUsingCross-Validation,MeasuringAccuracyUsingCross-Validation-MeasuringAccuracyUsingCross-Validation

CUDAlibrary,Installation

cuDNNlibrary,Installation

curseofdimensionality,DimensionalityReduction-TheCurseofDimensionality

(seealsodimensionalityreduction)

customtransformers,CustomTransformers-CustomTransformers

data,TestingandValidating

(seealsotestdata;trainingdata)

creatingworkspacefor,GettheData-DownloadtheData

downloading,DownloadtheData-DownloadtheData

findingcorrelationsin,LookingforCorrelations-LookingforCorrelations

makingassumptionsabout,TestingandValidating

preparingforMachineLearningalgorithms,PreparetheDataforMachineLearningAlgorithms-SelectandTrainaModel

test-setcreation,CreateaTestSet-CreateaTestSet

workingwithrealdata,WorkingwithRealData

dataaugmentation,DataAugmentation-DataAugmentation

datacleaning,DataCleaning-HandlingTextandCategoricalAttributes

datamining,WhyUseMachineLearning?

dataparallelism,DataParallelism-TensorFlowimplementation

asynchronousupdates,Asynchronousupdates-Asynchronousupdates

bandwidthsaturation,Bandwidthsaturation-Bandwidthsaturation

synchronousupdates,Synchronousupdates

TensorFlowimplementation,TensorFlowimplementation

datapipeline,FrametheProblem

datasnoopingbias,CreateaTestSet

datastructure,TakeaQuickLookattheDataStructure-TakeaQuickLookattheDataStructure

datavisualization,VisualizingGeographicalData-VisualizingGeographicalData

DataFrame,DataCleaning

dataquest,OtherResources

decisionboundaries,DecisionBoundaries-DecisionBoundaries,SoftmaxRegression,MakingPredictions

decisionfunction,Precision/RecallTradeoff,DecisionFunctionandPredictions-DecisionFunctionandPredictions

DecisionStumps,AdaBoost

decisionthreshold,Precision/RecallTradeoff

DecisionTrees,TrainingandEvaluatingontheTrainingSet-BetterEvaluationUsingCross-Validation,DecisionTrees-Exercises,EnsembleLearningandRandomForests

binarytrees,MakingPredictions

classprobabilityestimates,EstimatingClassProbabilities

computationalcomplexity,ComputationalComplexity

decisionboundaries,MakingPredictions

GINIimpurity,GiniImpurityorEntropy?

instabilitywith,Instability-Instability

numbersofchildren,MakingPredictions

predictions,MakingPredictions-EstimatingClassProbabilities

RandomForests(seeRandomForests)

regressiontasks,Regression-Regression

regularizationhyperparameters,RegularizationHyperparameters-RegularizationHyperparameters

trainingandvisualizing,TrainingandVisualizingaDecisionTree-MakingPredictions

decoder,EfficientDataRepresentations

deconvolutionallayer,ResNet

deepautoencoders(seestackedautoencoders)

deepbeliefnetworks(DBNs),Semisupervisedlearning,DeepBeliefNets-DeepBeliefNets

DeepLearning,ReinforcementLearning

(seealsoReinforcementLearning;TensorFlow)

about,TheMachineLearningTsunami,Roadmap

libraries,UpandRunningwithTensorFlow-UpandRunningwithTensorFlow

deepneuralnetworks(DNNs),Multi-LayerPerceptronandBackpropagation,TrainingDeepNeuralNets-Exercises

(seealsoMulti-LayerPerceptrons(MLP))

fasteroptimizersfor,FasterOptimizers-LearningRateScheduling

regularization,AvoidingOverfittingThroughRegularization-DataAugmentation

reusingpretrainedlayers,ReusingPretrainedLayers-PretrainingonanAuxiliaryTask

trainingguidelinesoverview,PracticalGuidelines

trainingwithTensorFlow,TrainingaDNNUsingPlainTensorFlow-UsingtheNeuralNetwork

trainingwithTF.Learn,TraininganMLPwithTensorFlow’sHigh-LevelAPI

unstablegradients,Vanishing/ExplodingGradientsProblems

vanishingandexplodinggradients,TrainingDeepNeuralNets-GradientClipping

DeepQ-Learning,ApproximateQ-Learning-LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

Ms.PacManexample,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning-LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

deepQ-network,ApproximateQ-Learning

deepRNNs,DeepRNNs-TheDifficultyofTrainingoverManyTimeSteps

applyingdropout,ApplyingDropout

distributingacrossmultipleGPUs,DistributingaDeepRNNAcrossMultipleGPUs

longsequencedifficulties,TheDifficultyofTrainingoverManyTimeSteps

truncatedbackpropagationthroughtime,TheDifficultyofTrainingoverManyTimeSteps

DeepMind,ReinforcementLearning,IntroductiontoArtificialNeuralNetworks,ReinforcementLearning,ApproximateQ-Learning

degreesoffreedom,OverfittingtheTrainingData,LearningCurves

denoisingautoencoders,DenoisingAutoencoders-TensorFlowImplementation

dense(),ConstructionPhase,TyingWeights

depthconcatlayer,GoogLeNet

depthradius,AlexNet

depthwise_conv2d(),ResNet

dequeue(),Queuesoftuples

dequeue_many(),Queuesoftuples,PaddingFifoQueue

dequeue_up_to(),Closingaqueue-PaddingFifoQueue

dequeuingdata,Dequeuingdata

describe(),TakeaQuickLookattheDataStructure

deviceblocks,ShardingVariablesAcrossMultipleParameterServers

device(),Simpleplacement

dimensionalityreduction,Unsupervisedlearning,DimensionalityReduction-Exercises,Autoencoders

approachesto

ManifoldLearning,ManifoldLearning

projection,Projection-Projection

choosingtherightnumberofdimensions,ChoosingtheRightNumberofDimensions

curseofdimensionality,DimensionalityReduction-TheCurseofDimensionality

anddatavisualization,DimensionalityReduction

Isomap,OtherDimensionalityReductionTechniques

LLE(LocallyLinearEmbedding),LLE-LLE

MultidimensionalScaling,OtherDimensionalityReductionTechniques-OtherDimensionalityReductionTechniques

PCA(PrincipalComponentAnalysis),PCA-RandomizedPCA

t-DistributedStochasticNeighborEmbedding(t-SNE),OtherDimensionalityReductionTechniques

discountrate,EvaluatingActions:TheCreditAssignmentProblem

distributedcomputing,UpandRunningwithTensorFlow

distributedsessions,SharingStateAcrossSessionsUsingResourceContainers-SharingStateAcrossSessionsUsingResourceContainers

DNNClassifier,TraininganMLPwithTensorFlow’sHigh-LevelAPI

drop(),PreparetheDataforMachineLearningAlgorithms

dropconnect,Dropout

dropna(),DataCleaning

dropout,NumberofNeuronsperHiddenLayer,ApplyingDropout

dropoutrate,Dropout

dropout(),Dropout

DropoutWrapper,ApplyingDropout

DRY(Don’tRepeatYourself),Modularity

DualAveraging,AdamOptimization

dualnumbers,Forward-ModeAutodiff

dualproblem,TheDualProblem

duality,SVMDualProblem

dyingReLUs,NonsaturatingActivationFunctions

dynamicplacements,Dynamicplacementfunction

dynamicplacer,PlacingOperationsonDevices

DynamicProgramming,MarkovDecisionProcesses

dynamicunrollingthroughtime,DynamicUnrollingThroughTime

dynamic_rnn(),DynamicUnrollingThroughTime,DistributingaDeepRNNAcrossMultipleGPUs,AnEncoder–DecoderNetworkforMachineTranslation

earlystopping,EarlyStopping-EarlyStopping,GradientBoosting,NumberofNeuronsperHiddenLayer,EarlyStopping

ElasticNet,ElasticNet

embeddeddeviceblocks,ShardingVariablesAcrossMultipleParameterServers

EmbeddedRebergrammars,Exercises

embeddings,WordEmbeddings-WordEmbeddings

embedding_lookup(),WordEmbeddings

encoder,EfficientDataRepresentations

Encoder–Decoder,InputandOutputSequences

end-of-sequence(EOS)token,HandlingVariable-LengthOutputSequences

energyfunctions,HopfieldNetworks

enqueuingdata,Enqueuingdata

EnsembleLearning,BetterEvaluationUsingCross-Validation,EnsembleMethods,EnsembleLearningandRandomForests-Exercises

baggingandpasting,BaggingandPasting-Out-of-BagEvaluation

boosting,Boosting-GradientBoosting

in-graphversusbetween-graphreplication,In-GraphVersusBetween-GraphReplication-In-GraphVersusBetween-GraphReplication

RandomForests,RandomForests-FeatureImportance

(seealsoRandomForests)

randompatchesandrandomsubspaces,RandomPatchesandRandomSubspaces

stacking,Stacking-Stacking

entropyimpuritymeasure,GiniImpurityorEntropy?

environments,inreinforcementlearning,LearningtoOptimizeRewards-EvaluatingActions:TheCreditAssignmentProblem,ExplorationPolicies,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

episodes(inRL),IntroductiontoOpenAIGym,EvaluatingActions:TheCreditAssignmentProblem-PolicyGradients,PolicyGradients-PolicyGradients,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

epochs,StochasticGradientDescent

ε-insensitive,SVMRegression

equalitycontraints,SVMDualProblem

erroranalysis,ErrorAnalysis-ErrorAnalysis

estimators,DataCleaning

Euclidiannorm,SelectaPerformanceMeasure

eval(),FeedingDatatotheTrainingAlgorithm

evaluatingmodels,TestingandValidating-TestingandValidating

explainedvariance,ChoosingtheRightNumberofDimensions

explainedvarianceratio,ExplainedVarianceRatio

explodinggradients,Vanishing/ExplodingGradientsProblems

(seealsogradients,vanishingandexploding)

explorationpolicies,ExplorationPolicies

exponentialdecay,ImplementingBatchNormalizationwithTensorFlow

exponentiallinearunit(ELU),NonsaturatingActivationFunctions-NonsaturatingActivationFunctions

exponentialscheduling,LearningRateScheduling

Extra-Trees,Extra-Trees

F-1score,PrecisionandRecall-PrecisionandRecall

face-recognition,MultilabelClassification

fakeXserver,IntroductiontoOpenAIGym

falsepositiverate(FPR),TheROCCurve-TheROCCurve

fan-in,XavierandHeInitialization,XavierandHeInitialization

fan-out,XavierandHeInitialization,XavierandHeInitialization

featuredetection,Autoencoders

featureengineering,IrrelevantFeatures

featureextraction,Unsupervisedlearning

featureimportance,FeatureImportance-FeatureImportance

featuremaps,SelectingaKernelandTuningHyperparameters,Filters-TensorFlowImplementation,ResNet

featurescaling,FeatureScaling

featureselection,IrrelevantFeatures,GridSearch,LassoRegression,FeatureImportance,

PreparetheData

featurespace,KernelPCA,SelectingaKernelandTuningHyperparameters

featurevector,SelectaPerformanceMeasure,LinearRegression,UndertheHood,ImplementingGradientDescent

features,Supervisedlearning

FeatureUnion,TransformationPipelines

feedforwardneuralnetwork(FNN),Multi-LayerPerceptronandBackpropagation

feed_dict,FeedingDatatotheTrainingAlgorithm

FIFOQueue,AsynchronousCommunicationUsingTensorFlowQueues,RandomShuffleQueue

fillna(),DataCleaning

first-infirst-out(FIFO)queues,AsynchronousCommunicationUsingTensorFlowQueues

first-orderpartialderivatives(Jacobians),AdamOptimization

fit(),DataCleaning,TransformationPipelines,IncrementalPCA

fitnessfunction,Model-basedlearning

fit_inverse_transform=,SelectingaKernelandTuningHyperparameters

fit_transform(),DataCleaning,TransformationPipelines

folds,BetterEvaluationUsingCross-Validation,MNIST,MeasuringAccuracyUsingCross-Validation-MeasuringAccuracyUsingCross-Validation

FollowTheRegularizedLeader(FTRL),AdamOptimization

forgetgate,LSTMCell

forward-modeautodiff,Forward-ModeAutodiff-Forward-ModeAutodiff

framingaproblem,FrametheProblem-FrametheProblem

frozenlayers,FreezingtheLowerLayers-CachingtheFrozenLayers

functools.partial(),ImplementingBatchNormalizationwithTensorFlow,TensorFlowImplementation,VariationalAutoencoders

gameplay(seereinforcementlearning)

gammavalue,GaussianRBFKernel

gatecontrollers,LSTMCell

Gaussiandistribution,VariationalAutoencoders,GeneratingDigits

GaussianRBF,AddingSimilarityFeatures

GaussianRBFkernel,GaussianRBFKernel-GaussianRBFKernel,KernelizedSVM

generalizationerror,TestingandValidating

generalizedLagrangian,SVMDualProblem-SVMDualProblem

generativeautoencoders,VariationalAutoencoders

generativemodels,Autoencoders,BoltzmannMachines

geneticalgorithms,PolicySearch

geodesicdistance,OtherDimensionalityReductionTechniques

get_variable(),SharingVariables-SharingVariables

GINIimpurity,MakingPredictions,GiniImpurityorEntropy?

globalaveragepooling,GoogLeNet

global_step,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

global_variables_initializer(),CreatingYourFirstGraphandRunningItinaSession

Glorotinitialization,Vanishing/ExplodingGradientsProblems-XavierandHeInitialization

Google,UpandRunningwithTensorFlow

GoogleImages,IntroductiontoArtificialNeuralNetworks

GooglePhotos,Semisupervisedlearning

GoogleNetarchitecture,GoogLeNet-GoogLeNet

gpu_options.per_process_gpu_memory_fraction,ManagingtheGPURAM

gradientascent,PolicySearch

GradientBoostedRegressionTrees(GBRT),GradientBoosting

GradientBoosting,GradientBoosting-GradientBoosting

GradientDescent(GD),TrainingModels,GradientDescent-Mini-batchGradientDescent,OnlineSVMs,TrainingDeepNeuralNets,MomentumOptimization,AdaGrad

algorithmcomparisons,Mini-batchGradientDescent-Mini-batchGradientDescent

automaticallycomputinggradients,Usingautodiff-Usingautodiff

BatchGD,BatchGradientDescent-BatchGradientDescent,LassoRegression

defining,GradientDescent

localminimumversusglobalminimum,GradientDescent

manuallycomputinggradients,ManuallyComputingtheGradients

Mini-batchGD,Mini-batchGradientDescent-Mini-batchGradientDescent,FeedingDatatotheTrainingAlgorithm-FeedingDatatotheTrainingAlgorithm

optimizer,UsinganOptimizer

StochasticGD,StochasticGradientDescent-StochasticGradientDescent,SoftMarginClassification

withTensorFlow,ImplementingGradientDescent-UsinganOptimizer

GradientTreeBoosting,GradientBoosting

GradientDescentOptimizer,ConstructionPhase

gradients(),Usingautodiff

gradients,vanishingandexploding,TrainingDeepNeuralNets-GradientClipping,TheDifficultyofTrainingoverManyTimeSteps

BatchNormalization,BatchNormalization-ImplementingBatchNormalizationwithTensorFlow

GlorotandHeinitialization,Vanishing/ExplodingGradientsProblems-XavierandHeInitialization

gradientclipping,GradientClipping

nonsaturatingactivationfunctions,NonsaturatingActivationFunctions-NonsaturatingActivationFunctions

graphviz,TrainingandVisualizingaDecisionTree

greedyalgorithm,TheCARTTrainingAlgorithm

gridsearch,Fine-TuneYourModel-GridSearch,PolynomialKernel

group(),LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

GRU(GatedRecurrentUnit)cell,GRUCell-GRUCell

hailstonesequence,EfficientDataRepresentations

hardmarginclassification,SoftMarginClassification-SoftMarginClassification

hardvotingclassifiers,VotingClassifiers-VotingClassifiers

harmonicmean,PrecisionandRecall

Heinitialization,Vanishing/ExplodingGradientsProblems-XavierandHeInitialization

Heavisidestepfunction,ThePerceptron

Hebb'srule,ThePerceptron,HopfieldNetworks

Hebbianlearning,ThePerceptron

hiddenlayers,Multi-LayerPerceptronandBackpropagation

hierarchicalclustering,Unsupervisedlearning

hingelossfunction,OnlineSVMs

histograms,TakeaQuickLookattheDataStructure-TakeaQuickLookattheDataStructure

hold-outsets,Stacking

(seealsoblenders)

HopfieldNetworks,HopfieldNetworks-HopfieldNetworks

hyperbolictangent(htanactivationfunction),Multi-LayerPerceptronandBackpropagation,ActivationFunctions,Vanishing/ExplodingGradientsProblems,XavierandHeInitialization,RecurrentNeurons

hyperparameters,OverfittingtheTrainingData,CustomTransformers,GridSearch-GridSearch,EvaluateYourSystemontheTestSet,GradientDescent,PolynomialKernel,ComputationalComplexity,Fine-TuningNeuralNetworkHyperparameters

(seealsoneuralnetworkhyperparameters)

hyperplane,DecisionFunctionandPredictions,ManifoldLearning-PCA,ProjectingDowntodDimensions,OtherDimensionalityReductionTechniques

hypothesis,SelectaPerformanceMeasure

manifold,ManifoldLearning

hypothesisboosting(seeboosting)

hypothesisfunction,LinearRegression

hypothesis,null,RegularizationHyperparameters

identitymatrix,RidgeRegression,QuadraticProgramming

ILSVRCImageNetchallenge,CNNArchitectures

imageclassification,CNNArchitectures

impuritymeasures,MakingPredictions,GiniImpurityorEntropy?

in-graphreplication,In-GraphVersusBetween-GraphReplication

inceptionmodules,GoogLeNet

Inception-v4,ResNet

incrementallearning,Onlinelearning,IncrementalPCA

inequalityconstraints,SVMDualProblem

inference,Model-basedlearning,Exercises,MemoryRequirements,AnEncoder–DecoderNetworkforMachineTranslation

info(),TakeaQuickLookattheDataStructure

informationgain,GiniImpurityorEntropy?

informationtheory,GiniImpurityorEntropy?

initnode,SavingandRestoringModels

inputgate,LSTMCell

inputneurons,ThePerceptron

input_keep_prob,ApplyingDropout

instance-basedlearning,Instance-basedlearning,Model-basedlearning

InteractiveSession,CreatingYourFirstGraphandRunningItinaSession

interceptterm,LinearRegression

InternalCovariateShiftproblem,BatchNormalization

inter_op_parallelism_threads,ParallelExecution

intra_op_parallelism_threads,ParallelExecution

inverse_transform(),SelectingaKernelandTuningHyperparameters

in_top_k(),ConstructionPhase

irreducibleerror,LearningCurves

isolatedenvironment,CreatetheWorkspace-CreatetheWorkspace

Isomap,OtherDimensionalityReductionTechniques

jobs,MultipleDevicesAcrossMultipleServers

join(),MultipleDevicesAcrossMultipleServers,MultithreadedreadersusingaCoordinatorandaQueueRunner

Jupyter,CreatetheWorkspace,CreatetheWorkspace,TakeaQuickLookattheDataStructure

K-foldcross-validation,BetterEvaluationUsingCross-Validation-BetterEvaluationUsingCross-Validation,MeasuringAccuracyUsingCross-Validation

k-NearestNeighbors,Model-basedlearning,MultilabelClassification

Karush–Kuhn–Tucker(KKT)conditions,SVMDualProblem

keepprobability,Dropout

Keras,UpandRunningwithTensorFlow

KernelPCA(kPCA),KernelPCA-SelectingaKernelandTuningHyperparameters

kerneltrick,PolynomialKernel,GaussianRBFKernel,TheDualProblem-KernelizedSVM,KernelPCA

kernelizedSVM,KernelizedSVM-KernelizedSVM

kernels,PolynomialKernel-GaussianRBFKernel,Operationsandkernels

Kullback–Leiblerdivergence,SoftmaxRegression,SparseAutoencoders

l1_l2_regularizer(),ℓ1andℓ2Regularization

LabelBinarizer,TransformationPipelines

labels,Supervisedlearning,FrametheProblem

Lagrangefunction,SVMDualProblem-SVMDualProblem

Lagrangemultiplier,SVMDualProblem

landmarks,AddingSimilarityFeatures-AddingSimilarityFeatures

largemarginclassification,LinearSVMClassification-LinearSVMClassification

LassoRegression,LassoRegression-LassoRegression

latentloss,VariationalAutoencoders

latentspace,VariationalAutoencoders

lawoflargenumbers,VotingClassifiers

leakyReLU,NonsaturatingActivationFunctions

learningrate,Onlinelearning,GradientDescent,BatchGradientDescent-StochasticGradientDescent

learningratescheduling,StochasticGradientDescent,LearningRateScheduling-LearningRateScheduling

LeNet-5architecture,TheArchitectureoftheVisualCortex,LeNet-5-LeNet-5

Levenshteindistance,GaussianRBFKernel

liblinearlibrary,ComputationalComplexity

libsvmlibrary,ComputationalComplexity

LinearDiscriminantAnalysis(LDA),OtherDimensionalityReductionTechniques

linearmodels

earlystopping,EarlyStopping-EarlyStopping

LinearRegression(seeLinearRegression)

regression(seeLinearRegression)

RidgeRegression,RidgeRegression-RidgeRegression,ElasticNet

SVM,LinearSVMClassification-SoftMarginClassification

LinearRegression,Model-basedlearning,TrainingandEvaluatingontheTrainingSet,TrainingModels-Mini-batchGradientDescent,ElasticNet

GradientDescentin,GradientDescent-Mini-batchGradientDescent

learningcurvesin,LearningCurves-LearningCurves

NormalEquation,TheNormalEquation-ComputationalComplexity

regularizingmodels(seeregularization)

usingStochasticGradientDescent(SGD),StochasticGradientDescent

withTensorFlow,LinearRegressionwithTensorFlow-LinearRegressionwithTensorFlow

linearSVMclassification,LinearSVMClassification-SoftMarginClassification

linearthresholdunits(LTUs),ThePerceptron

Lipschitzcontinuous,GradientDescent

LLE(LocallyLinearEmbedding),LLE-LLE

load_sample_images(),TensorFlowImplementation

localreceptivefield,TheArchitectureoftheVisualCortex

localresponsenormalization,AlexNet

localsessions,SharingStateAcrossSessionsUsingResourceContainers

locationinvariance,PoolingLayer

logloss,TrainingandCostFunction

loggingplacements,Loggingplacements-Loggingplacements

logisticfunction,EstimatingProbabilities

LogisticRegression,Supervisedlearning,LogisticRegression-SoftmaxRegression

decisionboundaries,DecisionBoundaries-DecisionBoundaries

estimatingprobablities,EstimatingProbabilities-EstimatingProbabilities

SoftmaxRegressionmodel,SoftmaxRegression-SoftmaxRegression

trainingandcostfunction,TrainingandCostFunction-TrainingandCostFunction

log_device_placement,Loggingplacements

LSTM(LongShort-TermMemory)cell,LSTMCell-GRUCell

machinecontrol(seereinforcementlearning)

MachineLearning

large-scaleprojects(seeTensorFlow)

notations,SelectaPerformanceMeasure-SelectaPerformanceMeasure

processexample,End-to-EndMachineLearningProject-Exercises

projectchecklist,LookattheBigPicture,MachineLearningProjectChecklist-Launch!

resourceson,OtherResources-OtherResources

usesfor,MachineLearninginYourProjects-MachineLearninginYourProjects

MachineLearningbasics

attributes,Supervisedlearning

challenges,MainChallengesofMachineLearning-SteppingBack

algorithmproblems,OverfittingtheTrainingData-UnderfittingtheTrainingData

trainingdataproblems,Poor-QualityData

definition,WhatIsMachineLearning?

features,Supervisedlearning

overview,TheMachineLearningLandscape

reasonsforusing,WhyUseMachineLearning?-WhyUseMachineLearning?

spamfilterexample,WhatIsMachineLearning?-WhyUseMachineLearning?

summary,SteppingBack

testingandvalidating,TestingandValidating-TestingandValidating

typesofsystems,TypesofMachineLearningSystems-Model-basedlearning

batchandonlinelearning,BatchandOnlineLearning-Onlinelearning

instance-basedversusmodel-basedlearning,Instance-BasedVersusModel-BasedLearning-Model-basedlearning

supervised/unsupervisedlearning,Supervised/UnsupervisedLearning-ReinforcementLearning

workflowexample,Model-basedlearning-Model-basedlearning

machinetranslation(seenaturallanguageprocessing(NLP))

make(),IntroductiontoOpenAIGym

Manhattannorm,SelectaPerformanceMeasure

manifoldassumption/hypothesis,ManifoldLearning

ManifoldLearning,ManifoldLearning,LLE

(seealsoLLE(LocallyLinearEmbedding)

MapReduce,FrametheProblem

marginviolations,SoftMarginClassification

Markovchains,MarkovDecisionProcesses

Markovdecisionprocesses,MarkovDecisionProcesses-MarkovDecisionProcesses

masterservice,TheMasterandWorkerServices

Matplotlib,CreatetheWorkspace,TakeaQuickLookattheDataStructure,TheROCCurve,ErrorAnalysis

maxmarginlearning,PretrainingonanAuxiliaryTask

maxpoolinglayer,PoolingLayer

max-normregularization,Max-NormRegularization-Max-NormRegularization

max_norm(),Max-NormRegularization

max_norm_regularizer(),Max-NormRegularization

max_pool(),PoolingLayer

MeanAbsoluteError(MAE),SelectaPerformanceMeasure-SelectaPerformanceMeasure

meancoding,VariationalAutoencoders

MeanSquareError(MSE),LinearRegression,ManuallyComputingtheGradients,SparseAutoencoders

measureofsimilarity,Instance-basedlearning

memmap,IncrementalPCA

memorycells,ModelParallelism,MemoryCells

Mercer'stheorem,KernelizedSVM

metalearner(seeblending)

min-maxscaling,FeatureScaling

Mini-batchGradientDescent,Mini-batchGradientDescent-Mini-batchGradientDescent,TrainingandCostFunction,FeedingDatatotheTrainingAlgorithm-FeedingDatatotheTrainingAlgorithm

mini-batches,Onlinelearning

minimize(),GradientClipping,FreezingtheLowerLayers,PolicyGradients,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

min_after_dequeue,RandomShuffleQueue

MNISTdataset,MNIST-MNIST

modelparallelism,ModelParallelism-ModelParallelism

modelparameters,GradientDescent,BatchGradientDescent,EarlyStopping,UndertheHood,QuadraticProgramming,CreatingYourFirstGraphandRunningItinaSession,ConstructionPhase,TrainingRNNs

defining,Model-basedlearning

modelselection,Model-basedlearning

modelzoos,ModelZoos

model-basedlearning,Model-basedlearning-Model-basedlearning

models

analyzing,AnalyzetheBestModelsandTheirErrors-AnalyzetheBestModelsandTheirErrors

evaluatingontestset,EvaluateYourSystemontheTestSet-EvaluateYourSystemontheTestSet

moments,AdamOptimization

Momentumoptimization,MomentumOptimization-MomentumOptimization

MonteCarlotreesearch,PolicyGradients

Multi-LayerPerceptrons(MLP),IntroductiontoArtificialNeuralNetworks,ThePerceptron-Multi-LayerPerceptronandBackpropagation,NeuralNetworkPolicies

trainingwithTF.Learn,TraininganMLPwithTensorFlow’sHigh-LevelAPI

multiclassclassifiers,MulticlassClassification-MulticlassClassification

MultidimensionalScaling(MDS),OtherDimensionalityReductionTechniques

multilabelclassifiers,MultilabelClassification-MultilabelClassification

MultinomialLogisticRegression(seeSoftmaxRegression)

multinomial(),NeuralNetworkPolicies

multioutputclassifiers,MultioutputClassification-MultioutputClassification

MultiRNNCell,DistributingaDeepRNNAcrossMultipleGPUs

multithreadedreaders,MultithreadedreadersusingaCoordinatorandaQueueRunner-MultithreadedreadersusingaCoordinatorandaQueueRunner

multivariateregression,FrametheProblem

naiveBayesclassifiers,MulticlassClassification

namescopes,NameScopes

naturallanguageprocessing(NLP),RecurrentNeuralNetworks,NaturalLanguageProcessing-AnEncoder–DecoderNetworkforMachineTranslation

encoder-decodernetworkformachinetranslation,AnEncoder–DecoderNetworkforMachineTranslation-AnEncoder–DecoderNetworkforMachineTranslation

TensorFlowtutorials,NaturalLanguageProcessing,AnEncoder–DecoderNetworkforMachineTranslation

wordembeddings,WordEmbeddings-WordEmbeddings

NesterovAcceleratedGradient(NAG),NesterovAcceleratedGradient-NesterovAcceleratedGradient

Nesterovmomentumoptimization,NesterovAcceleratedGradient-NesterovAcceleratedGradient

networktopology,Fine-TuningNeuralNetworkHyperparameters

neuralnetworkhyperparameters,Fine-TuningNeuralNetworkHyperparameters-ActivationFunctions

activationfunctions,ActivationFunctions

neuronsperhiddenlayer,NumberofNeuronsperHiddenLayer

numberofhiddenlayers,NumberofHiddenLayers-NumberofHiddenLayers

neuralnetworkpolicies,NeuralNetworkPolicies-NeuralNetworkPolicies

neurons

biological,FromBiologicaltoArtificialNeurons-BiologicalNeurons

logicalcomputationswith,LogicalComputationswithNeurons

neuron_layer(),ConstructionPhase

next_batch(),ExecutionPhase

NoFreeLunchtheorem,TestingandValidating

nodeedges,VisualizingtheGraphandTrainingCurvesUsingTensorBoard

nonlineardimensionalityreduction(NLDR),LLE

(seealsoKernelPCA;LLE(LocallyLinearEmbedding))

nonlinearSVMclassification,NonlinearSVMClassification-ComputationalComplexity

GaussianRBFkernel,GaussianRBFKernel-GaussianRBFKernel

withpolynomialfeatures,NonlinearSVMClassification-PolynomialKernel

polynomialkernel,PolynomialKernel-PolynomialKernel

similarityfeatures,adding,AddingSimilarityFeatures-AddingSimilarityFeatures

nonparametricmodels,RegularizationHyperparameters

nonresponsebias,NonrepresentativeTrainingData

nonsaturatingactivationfunctions,NonsaturatingActivationFunctions-NonsaturatingActivationFunctions

NormalEquation,TheNormalEquation-ComputationalComplexity

normalization,FeatureScaling

normalizedexponential,SoftmaxRegression

norms,SelectaPerformanceMeasure

notations,SelectaPerformanceMeasure-SelectaPerformanceMeasure

NP-Completeproblems,TheCARTTrainingAlgorithm

nullhypothesis,RegularizationHyperparameters

numericaldifferentiation,NumericalDifferentiation

NumPy,CreatetheWorkspace

NumPyarrays,HandlingTextandCategoricalAttributes

NVidiaComputeCapability,Installation

nvidia-smi,ManagingtheGPURAM

n_components,ChoosingtheRightNumberofDimensions

observationspace,NeuralNetworkPolicies

off-policyalgorithm,TemporalDifferenceLearningandQ-Learning

offlinelearning,Batchlearning

one-hotencoding,HandlingTextandCategoricalAttributes

one-versus-all(OvA)strategy,MulticlassClassification,SoftmaxRegression,Exercises

one-versus-one(OvO)strategy,MulticlassClassification

onlinelearning,Onlinelearning-Onlinelearning

onlineSVMs,OnlineSVMs-OnlineSVMs

OpenAIGym,IntroductiontoOpenAIGym-IntroductiontoOpenAIGym

operation_timeout_in_ms,In-GraphVersusBetween-GraphReplication

OpticalCharacterRecognition(OCR),TheMachineLearningLandscape

optimalstatevalue,MarkovDecisionProcesses

optimizers,FasterOptimizers-LearningRateScheduling

AdaGrad,AdaGrad-AdaGrad

Adamoptimization,AdamOptimization-AdamOptimization,AdamOptimization

GradientDescent(seeGradientDescentoptimizer)

learningratescheduling,LearningRateScheduling-LearningRateScheduling

Momentumoptimization,MomentumOptimization-MomentumOptimization

NesterovAcceleratedGradient(NAG),NesterovAcceleratedGradient-NesterovAcceleratedGradient

RMSProp,RMSProp

out-of-bagevaluation,Out-of-BagEvaluation-Out-of-BagEvaluation

out-of-corelearning,Onlinelearning

out-of-memory(OOM)errors,StaticUnrollingThroughTime

out-of-sampleerror,TestingandValidating

OutOfRangeError,Readingthetrainingdatadirectlyfromthegraph,MultithreadedreadersusingaCoordinatorandaQueueRunner

outputgate,LSTMCell

outputlayer,Multi-LayerPerceptronandBackpropagation

OutputProjectionWrapper,TrainingtoPredictTimeSeries-TrainingtoPredictTimeSeries

output_keep_prob,ApplyingDropout

overcompleteautoencoder,UnsupervisedPretrainingUsingStackedAutoencoders

overfitting,OverfittingtheTrainingData-OverfittingtheTrainingData,CreateaTestSet,SoftMarginClassification,GaussianRBFKernel,RegularizationHyperparameters,Regression,NumberofNeuronsperHiddenLayer

avoidingthroughregularization,AvoidingOverfittingThroughRegularization-DataAugmentation

p-value,RegularizationHyperparameters

PaddingFIFOQueue,PaddingFifoQueue

Pandas,CreatetheWorkspace,DownloadtheData

scatter_matrix,LookingforCorrelations-LookingforCorrelations

paralleldistributedcomputing,DistributingTensorFlowAcrossDevicesandServers-Exercises

dataparallelism,DataParallelism-TensorFlowimplementation

in-graphversusbetween-graphreplication,In-GraphVersusBetween-GraphReplication-ModelParallelism

modelparallelism,ModelParallelism-ModelParallelism

multipledevicesacrossmultipleservers,MultipleDevicesAcrossMultipleServers-

Otherconveniencefunctions

asynchronouscommunicationusingqueues,AsynchronousCommunicationUsingTensorFlowQueues-PaddingFifoQueue

loadingtrainingdata,LoadingDataDirectlyfromtheGraph-Otherconveniencefunctions

masterandworkerservices,TheMasterandWorkerServices

openingasession,OpeningaSession

pinningoperationsacrosstasks,PinningOperationsAcrossTasks

shardingvariables,ShardingVariablesAcrossMultipleParameterServers

sharingstateacrosssessions,SharingStateAcrossSessionsUsingResourceContainers-SharingStateAcrossSessionsUsingResourceContainers

multipledevicesonasinglemachine,MultipleDevicesonaSingleMachine-ControlDependencies

installation,Installation-Installation

managingtheGPURAM,ManagingtheGPURAM-ManagingtheGPURAM

parallelexecution,ParallelExecution-ParallelExecution

placingoperationsondevices,PlacingOperationsonDevices-Softplacement

oneneuralnetworkperdevice,OneNeuralNetworkperDevice-OneNeuralNetworkperDevice

parameterefficiency,NumberofHiddenLayers

parametermatrix,SoftmaxRegression

parameterserver(ps),MultipleDevicesAcrossMultipleServers

parameterspace,GradientDescent

parametervector,LinearRegression,GradientDescent,TrainingandCostFunction,SoftmaxRegression

parametricmodels,RegularizationHyperparameters

partialderivative,BatchGradientDescent

partial_fit(),IncrementalPCA

Pearson'sr,LookingforCorrelations

peepholeconnections,PeepholeConnections

penalties(seerewards,inRL)

percentiles,TakeaQuickLookattheDataStructure

Perceptronconvergencetheorem,ThePerceptron

Perceptrons,ThePerceptron-Multi-LayerPerceptronandBackpropagation

versusLogisticRegression,ThePerceptron

training,ThePerceptron-ThePerceptron

performancemeasures,SelectaPerformanceMeasure-SelectaPerformanceMeasure

confusionmatrix,ConfusionMatrix-ConfusionMatrix

cross-validation,MeasuringAccuracyUsingCross-Validation-MeasuringAccuracyUsingCross-Validation

precisionandrecall,PrecisionandRecall-Precision/RecallTradeoff

ROC(receiveroperatingcharacteristic)curve,TheROCCurve-TheROCCurve

performancescheduling,LearningRateScheduling

permutation(),CreateaTestSet

PGalgorithms,PolicyGradients

photo-hostingservices,Semisupervisedlearning

pinningoperations,PinningOperationsAcrossTasks

pip,CreatetheWorkspace

Pipelineconstructor,TransformationPipelines-SelectandTrainaModel

pipelines,FrametheProblem

placeholdernodes,FeedingDatatotheTrainingAlgorithm

placers(seesimpleplacer;dynamicplacer)

policy,PolicySearch

policygradients,PolicySearch(seePGalgorithms)

policyspace,PolicySearch

polynomialfeatures,adding,NonlinearSVMClassification-PolynomialKernel

polynomialkernel,PolynomialKernel-PolynomialKernel,KernelizedSVM

PolynomialRegression,TrainingModels,PolynomialRegression-PolynomialRegression

poolingkernel,PoolingLayer

poolinglayer,PoolingLayer-PoolingLayer

powerscheduling,LearningRateScheduling

precision,ConfusionMatrix

precisionandrecall,PrecisionandRecall-Precision/RecallTradeoff

F-1score,PrecisionandRecall-PrecisionandRecall

precision/recall(PR)curve,TheROCCurve

precision/recalltradeoff,Precision/RecallTradeoff-Precision/RecallTradeoff

predeterminedpiecewiseconstantlearningrate,LearningRateScheduling

predict(),DataCleaning

predictedclass,ConfusionMatrix

predictions,ConfusionMatrix-ConfusionMatrix,DecisionFunctionandPredictions-DecisionFunctionandPredictions,MakingPredictions-EstimatingClassProbabilities

predictors,Supervisedlearning,DataCleaning

preloadingtrainingdata,Preloadthedataintoavariable

PReLU(parametricleakyReLU),NonsaturatingActivationFunctions

preprocessedattributes,TakeaQuickLookattheDataStructure

pretrainedlayersreuse,ReusingPretrainedLayers-PretrainingonanAuxiliaryTask

auxiliarytask,PretrainingonanAuxiliaryTask-PretrainingonanAuxiliaryTask

cachingfrozenlayers,CachingtheFrozenLayers

freezinglowerlayers,FreezingtheLowerLayers

modelzoos,ModelZoos

otherframeworks,ReusingModelsfromOtherFrameworks

TensorFlowmodel,ReusingaTensorFlowModel-ReusingaTensorFlowModel

unsupervisedpretraining,UnsupervisedPretraining-UnsupervisedPretraining

upperlayers,Tweaking,Dropping,orReplacingtheUpperLayers

PrettyTensor,UpandRunningwithTensorFlow

primalproblem,TheDualProblem

principalcomponent,PrincipalComponents

PrincipalComponentAnalysis(PCA),PCA-RandomizedPCA

explainedvarianceratios,ExplainedVarianceRatio

findingprincipalcomponents,PrincipalComponents-PrincipalComponents

forcompression,PCAforCompression-IncrementalPCA

IncrementalPCA,IncrementalPCA-RandomizedPCA

KernelPCA(kPCA),KernelPCA-SelectingaKernelandTuningHyperparameters

projectingdowntoddimensions,ProjectingDowntodDimensions

RandomizedPCA,RandomizedPCA

ScikitLearnfor,UsingScikit-Learn

variance,preserving,PreservingtheVariance-PreservingtheVariance

probabilisticautoencoders,VariationalAutoencoders

probabilities,estimating,EstimatingProbabilities-EstimatingProbabilities,EstimatingClassProbabilities

producerfunctions,Otherconveniencefunctions

projection,Projection-Projection

propositionallogic,FromBiologicaltoArtificialNeurons

pruning,RegularizationHyperparameters,SymbolicDifferentiation

Python

isolatedenvironmentin,CreatetheWorkspace-CreatetheWorkspace

notebooksin,CreatetheWorkspace-DownloadtheData

pickle,BetterEvaluationUsingCross-Validation

pip,CreatetheWorkspace

Q-Learningalgorithm,TemporalDifferenceLearningandQ-Learning-LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

approximateQ-Learning,ApproximateQ-Learning

deepQ-Learning,ApproximateQ-Learning-LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

Q-ValueIterationAlgorithm,MarkovDecisionProcesses

Q-Values,MarkovDecisionProcesses

QuadraticProgramming(QP)Problems,QuadraticProgramming-QuadraticProgramming

quantizing,Bandwidthsaturation

queriespersecond(QPS),OneNeuralNetworkperDevice

QueueRunner,MultithreadedreadersusingaCoordinatorandaQueueRunner-MultithreadedreadersusingaCoordinatorandaQueueRunner

queues,AsynchronousCommunicationUsingTensorFlowQueues-PaddingFifoQueue

closing,Closingaqueue

dequeuingdata,Dequeuingdata

enqueuingdata,Enqueuingdata

first-infirst-out(FIFO),AsynchronousCommunicationUsingTensorFlowQueues

oftuples,Queuesoftuples

PaddingFIFOQueue,PaddingFifoQueue

RandomShuffleQueue,RandomShuffleQueue

q_network(),LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

RadialBasisFunction(RBF),AddingSimilarityFeatures

RandomForests,BetterEvaluationUsingCross-Validation-GridSearch,MulticlassClassification,DecisionTrees,Instability,EnsembleLearningandRandomForests,RandomForests-FeatureImportance

Extra-Trees,Extra-Trees

featureimportance,FeatureImportance-FeatureImportance

randominitialization,GradientDescent,BatchGradientDescent,StochasticGradientDescent,Vanishing/ExplodingGradientsProblems

RandomPatchesandRandomSubspaces,RandomPatchesandRandomSubspaces

randomizedleakyReLU(RReLU),NonsaturatingActivationFunctions

randomizedsearch,RandomizedSearch,Fine-TuningNeuralNetworkHyperparameters

RandomShuffleQueue,RandomShuffleQueue,Readingthetrainingdatadirectlyfromthegraph

random_uniform(),ManuallyComputingtheGradients

readeroperations,Readingthetrainingdatadirectlyfromthegraph

recall,ConfusionMatrix

recognitionnetwork,EfficientDataRepresentations

reconstructionerror,PCAforCompression

reconstructionloss,EfficientDataRepresentations,TensorFlowImplementation,VariationalAutoencoders

reconstructionpre-image,SelectingaKernelandTuningHyperparameters

reconstructions,EfficientDataRepresentations

recurrentneuralnetworks(RNNs),RecurrentNeuralNetworks-Exercises

deepRNNs,DeepRNNs-TheDifficultyofTrainingoverManyTimeSteps

explorationpolicies,ExplorationPolicies

GRUcell,GRUCell-GRUCell

inputandoutputsequences,InputandOutputSequences-InputandOutputSequences

LSTMcell,LSTMCell-GRUCell

naturallanguageprocessing(NLP),NaturalLanguageProcessing-AnEncoder–DecoderNetworkforMachineTranslation

inTensorFlow,BasicRNNsinTensorFlow-HandlingVariable-LengthOutputSequences

dynamicunrollingthroughtime,DynamicUnrollingThroughTime

staticunrollingthroughtime,StaticUnrollingThroughTime-StaticUnrollingThroughTime

variablelengthinputsequences,HandlingVariableLengthInputSequences

variablelengthoutputsequences,HandlingVariable-LengthOutputSequences

training,TrainingRNNs-CreativeRNN

backpropagationthroughtime(BPTT),TrainingRNNs

creativesequences,CreativeRNN

sequenceclassifiers,TrainingaSequenceClassifier-TrainingaSequenceClassifier

timeseriespredictions,TrainingtoPredictTimeSeries-TrainingtoPredictTimeSeries

recurrentneurons,RecurrentNeurons-InputandOutputSequences

memorycells,MemoryCells

reduce_mean(),ConstructionPhase

reduce_sum(),TensorFlowImplementation-TensorFlowImplementation,VariationalAutoencoders,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

regression,Supervisedlearning

DecisionTrees,Regression-Regression

regressionmodels

linear,TrainingandEvaluatingontheTrainingSet

regressionversusclassification,MultioutputClassification

regularization,OverfittingtheTrainingData-OverfittingtheTrainingData,TestingandValidating,RegularizedLinearModels-EarlyStopping

dataaugmentation,DataAugmentation-DataAugmentation

DecisionTrees,RegularizationHyperparameters-RegularizationHyperparameters

dropout,Dropout-Dropout

earlystopping,EarlyStopping-EarlyStopping,EarlyStopping

max-norm,Max-NormRegularization-Max-NormRegularization

RidgeRegression,RidgeRegression-RidgeRegression

shrinkage,GradientBoosting

ℓ1andℓ2regularization,ℓ1andℓ2Regularization-ℓ1andℓ2Regularization

REINFORCEalgorithms,PolicyGradients

ReinforcementLearning(RL),ReinforcementLearning-ReinforcementLearning,ReinforcementLearning-ThankYou!

actions,EvaluatingActions:TheCreditAssignmentProblem-EvaluatingActions:TheCreditAssignmentProblem

creditassignmentproblem,EvaluatingActions:TheCreditAssignmentProblem-EvaluatingActions:TheCreditAssignmentProblem

discountrate,EvaluatingActions:TheCreditAssignmentProblem

examplesof,LearningtoOptimizeRewards

Markovdecisionprocesses,MarkovDecisionProcesses-MarkovDecisionProcesses

neuralnetworkpolicies,NeuralNetworkPolicies-NeuralNetworkPolicies

OpenAIgym,IntroductiontoOpenAIGym-IntroductiontoOpenAIGym

PGalgorithms,PolicyGradients-PolicyGradients

policysearch,PolicySearch-PolicySearch

Q-Learningalgorithm,TemporalDifferenceLearningandQ-Learning-LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

rewards,learningtooptimize,LearningtoOptimizeRewards-LearningtoOptimizeRewards

TemporalDifference(TD)Learning,TemporalDifferenceLearningandQ-Learning-TemporalDifferenceLearningandQ-Learning

ReLU(rectifiedlinearunits),Modularity-Modularity

ReLUactivation,ResNet

ReLUfunction,Multi-LayerPerceptronandBackpropagation,ActivationFunctions,XavierandHeInitialization-NonsaturatingActivationFunctions

relu(z),ConstructionPhase

render(),IntroductiontoOpenAIGym

replaymemory,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

replica_device_setter(),ShardingVariablesAcrossMultipleParameterServers

request_stop(),MultithreadedreadersusingaCoordinatorandaQueueRunner

reset(),IntroductiontoOpenAIGym

reset_default_graph(),ManagingGraphs

reshape(),TrainingtoPredictTimeSeries

residualerrors,GradientBoosting-GradientBoosting

residuallearning,ResNet

residualnetwork(ResNet),ModelZoos,ResNet-ResNet

residualunits,ResNet

ResNet,ResNet-ResNet

resourcecontainers,SharingStateAcrossSessionsUsingResourceContainers-SharingStateAcrossSessionsUsingResourceContainers

restore(),SavingandRestoringModels

restrictedBoltzmannmachines(RBMs),Semisupervisedlearning,UnsupervisedPretraining,BoltzmannMachines

reuse_variables(),SharingVariables

reverse-modeautodiff,Reverse-ModeAutodiff-Reverse-ModeAutodiff

rewards,inRL,LearningtoOptimizeRewards-LearningtoOptimizeRewards

rgb_array,IntroductiontoOpenAIGym

RidgeRegression,RidgeRegression-RidgeRegression,ElasticNet

RMSProp,RMSProp

ROC(receiveroperatingcharacteristic)curve,TheROCCurve-TheROCCurve

RootMeanSquareError(RMSE),SelectaPerformanceMeasure-SelectaPerformanceMeasure,LinearRegression

RReLU(randomizedleakyReLU),NonsaturatingActivationFunctions

run(),CreatingYourFirstGraphandRunningItinaSession,In-GraphVersusBetween-GraphReplication

SampledSoftmax,AnEncoder–DecoderNetworkforMachineTranslation

samplingbias,NonrepresentativeTrainingData-Poor-QualityData,CreateaTestSet

samplingnoise,NonrepresentativeTrainingData

save(),SavingandRestoringModels

Savernode,SavingandRestoringModels

ScikitFlow,UpandRunningwithTensorFlow

Scikit-Learn,CreatetheWorkspace

about,ObjectiveandApproach

baggingandpastingin,BaggingandPastinginScikit-Learn-BaggingandPastinginScikit-Learn

CARTalgorithm,MakingPredictions-TheCARTTrainingAlgorithm,Regression

cross-validation,BetterEvaluationUsingCross-Validation-BetterEvaluationUsingCross-Validation

designprinciples,DataCleaning-DataCleaning

imputer,DataCleaning-HandlingTextandCategoricalAttributes

LinearSVRclass,SVMRegression

MinMaxScaler,FeatureScaling

min_andmax_hyperparameters,RegularizationHyperparameters

PCAimplementation,UsingScikit-Learn

Perceptronclass,ThePerceptron

Pipelineconstructor,TransformationPipelines-SelectandTrainaModel,NonlinearSVMClassification

RidgeRegressionwith,RidgeRegression

SAMME,AdaBoost

SGDClassifier,TrainingaBinaryClassifier,Precision/RecallTradeoff-Precision/RecallTradeoff,MulticlassClassification

SGDRegressor,StochasticGradientDescent

sklearn.base.BaseEstimator,CustomTransformers,TransformationPipelines,MeasuringAccuracyUsingCross-Validation

sklearn.base.clone(),MeasuringAccuracyUsingCross-Validation,EarlyStopping

sklearn.base.TransformerMixin,CustomTransformers,TransformationPipelines

sklearn.datasets.fetch_california_housing(),LinearRegressionwithTensorFlow

sklearn.datasets.fetch_mldata(),MNIST

sklearn.datasets.load_iris(),DecisionBoundaries,SoftMarginClassification,TrainingandVisualizingaDecisionTree,FeatureImportance,ThePerceptron

sklearn.datasets.load_sample_images(),TensorFlowImplementation-TensorFlowImplementation

sklearn.datasets.make_moons(),NonlinearSVMClassification,Exercises

sklearn.decomposition.IncrementalPCA,IncrementalPCA

sklearn.decomposition.KernelPCA,KernelPCA-SelectingaKernelandTuning

Hyperparameters,SelectingaKernelandTuningHyperparameters

sklearn.decomposition.PCA,UsingScikit-Learn

sklearn.ensemble.AdaBoostClassifier,AdaBoost

sklearn.ensemble.BaggingClassifier,BaggingandPastinginScikit-Learn-RandomForests

sklearn.ensemble.GradientBoostingRegressor,GradientBoosting,GradientBoosting-GradientBoosting

sklearn.ensemble.RandomForestClassifier,TheROCCurve,MulticlassClassification,VotingClassifiers

sklearn.ensemble.RandomForestRegressor,BetterEvaluationUsingCross-Validation,GridSearch-AnalyzetheBestModelsandTheirErrors,RandomForests-Extra-Trees,GradientBoosting

sklearn.ensemble.VotingClassifier,VotingClassifiers

sklearn.externals.joblib,BetterEvaluationUsingCross-Validation

sklearn.linear_model.ElasticNet,ElasticNet

sklearn.linear_model.Lasso,LassoRegression

sklearn.linear_model.LinearRegression,Model-basedlearning-Model-basedlearning,DataCleaning,TrainingandEvaluatingontheTrainingSet,TheNormalEquation,Mini-batchGradientDescent,PolynomialRegression,LearningCurves-LearningCurves

sklearn.linear_model.LogisticRegression,DecisionBoundaries,DecisionBoundaries,SoftmaxRegression,VotingClassifiers,SelectingaKernelandTuningHyperparameters

sklearn.linear_model.Perceptron,ThePerceptron

sklearn.linear_model.Ridge,RidgeRegression

sklearn.linear_model.SGDClassifier,TrainingaBinaryClassifier

sklearn.linear_model.SGDRegressor,StochasticGradientDescent-Mini-batchGradient

Descent,RidgeRegression,LassoRegression-EarlyStopping

sklearn.manifold.LocallyLinearEmbedding,LLE-LLE

sklearn.metrics.accuracy_score(),VotingClassifiers,Out-of-BagEvaluation,TraininganMLPwithTensorFlow’sHigh-LevelAPI

sklearn.metrics.confusion_matrix(),ConfusionMatrix,ErrorAnalysis

sklearn.metrics.f1_score(),PrecisionandRecall,MultilabelClassification

sklearn.metrics.mean_squared_error(),TrainingandEvaluatingontheTrainingSet-TrainingandEvaluatingontheTrainingSet,EvaluateYourSystemontheTestSet,LearningCurves,EarlyStopping,GradientBoosting-GradientBoosting,SelectingaKernelandTuningHyperparameters

sklearn.metrics.precision_recall_curve(),Precision/RecallTradeoff

sklearn.metrics.precision_score(),PrecisionandRecall,Precision/RecallTradeoff

sklearn.metrics.recall_score(),PrecisionandRecall,Precision/RecallTradeoff

sklearn.metrics.roc_auc_score(),TheROCCurve-TheROCCurve

sklearn.metrics.roc_curve(),TheROCCurve-TheROCCurve

sklearn.model_selection.cross_val_predict(),ConfusionMatrix,Precision/RecallTradeoff,TheROCCurve,ErrorAnalysis,MultilabelClassification

sklearn.model_selection.cross_val_score(),BetterEvaluationUsingCross-Validation-BetterEvaluationUsingCross-Validation,MeasuringAccuracyUsingCross-Validation-ConfusionMatrix

sklearn.model_selection.GridSearchCV,GridSearch-RandomizedSearch,Exercises,ErrorAnalysis,Exercises,SelectingaKernelandTuningHyperparameters

sklearn.model_selection.StratifiedKFold,MeasuringAccuracyUsingCross-Validation

sklearn.model_selection.StratifiedShuffleSplit,CreateaTestSet

sklearn.model_selection.train_test_split(),CreateaTestSet,TrainingandEvaluatingontheTrainingSet,LearningCurves,Exercises,GradientBoosting

sklearn.multiclass.OneVsOneClassifier,MulticlassClassification

sklearn.neighbors.KNeighborsClassifier,MultilabelClassification,Exercises

sklearn.neighbors.KNeighborsRegressor,Model-basedlearning

sklearn.pipeline.FeatureUnion,TransformationPipelines

sklearn.pipeline.Pipeline,TransformationPipelines,LearningCurves,SoftMarginClassification-NonlinearSVMClassification,SelectingaKernelandTuningHyperparameters

sklearn.preprocessing.Imputer,DataCleaning,TransformationPipelines

sklearn.preprocessing.LabelBinarizer,HandlingTextandCategoricalAttributes,TransformationPipelines

sklearn.preprocessing.LabelEncoder,HandlingTextandCategoricalAttributes

sklearn.preprocessing.OneHotEncoder,HandlingTextandCategoricalAttributes

sklearn.preprocessing.PolynomialFeatures,PolynomialRegression-PolynomialRegression,LearningCurves,RidgeRegression,NonlinearSVMClassification

sklearn.preprocessing.StandardScaler,FeatureScaling-TransformationPipelines,MulticlassClassification,GradientDescent,RidgeRegression,LinearSVMClassification,SoftMarginClassification-PolynomialKernel,GaussianRBFKernel,ImplementingGradientDescent,TraininganMLPwithTensorFlow’sHigh-LevelAPI

sklearn.svm.LinearSVC,SoftMarginClassification-NonlinearSVMClassification,GaussianRBFKernel-ComputationalComplexity,SVMRegression,Exercises

sklearn.svm.LinearSVR,SVMRegression-SVMRegression

sklearn.svm.SVC,SoftMarginClassification,PolynomialKernel,GaussianRBFKernel-ComputationalComplexity,SVMRegression,Exercises,VotingClassifiers

sklearn.svm.SVR,Exercises,SVMRegression

sklearn.tree.DecisionTreeClassifier,RegularizationHyperparameters,Exercises,BaggingandPastinginScikit-Learn-Out-of-BagEvaluation,RandomForests,AdaBoost

sklearn.tree.DecisionTreeRegressor,TrainingandEvaluatingontheTrainingSet,DecisionTrees,Regression,GradientBoosting-GradientBoosting

sklearn.tree.export_graphviz(),TrainingandVisualizingaDecisionTree

StandardScaler,GradientDescent,ImplementingGradientDescent,TraininganMLPwithTensorFlow’sHigh-LevelAPI

SVMclassificationclasses,ComputationalComplexity

TF.Learn,UpandRunningwithTensorFlow

userguide,OtherResources

score(),DataCleaning

searchspace,RandomizedSearch,Fine-TuningNeuralNetworkHyperparameters

second-orderpartialderivatives(Hessians),AdamOptimization

self-organizingmaps(SOMs),Self-OrganizingMaps-Self-OrganizingMaps

semantichashing,Exercises

semisupervisedlearning,Semisupervisedlearning

sensitivity,ConfusionMatrix,TheROCCurve

sentimentanalysis,RecurrentNeuralNetworks

separable_conv2d(),ResNet

sequences,RecurrentNeuralNetworks

sequence_length,HandlingVariableLengthInputSequences-HandlingVariable-LengthOutputSequences,AnEncoder–DecoderNetworkforMachineTranslation

Shannon'sinformationtheory,GiniImpurityorEntropy?

shortcutconnections,ResNet

show(),TakeaQuickLookattheDataStructure

show_graph(),VisualizingtheGraphandTrainingCurvesUsingTensorBoard

shrinkage,GradientBoosting

shuffle_batch(),Otherconveniencefunctions

shuffle_batch_join(),Otherconveniencefunctions

sigmoidfunction,EstimatingProbabilities

sigmoid_cross_entropy_with_logits(),TensorFlowImplementation

similarityfunction,AddingSimilarityFeatures-AddingSimilarityFeatures

simulatedannealing,StochasticGradientDescent

simulatedenvironments,IntroductiontoOpenAIGym

(seealsoOpenAIGym)

SingularValueDecomposition(SVD),PrincipalComponents

skeweddatasets,MeasuringAccuracyUsingCross-Validation

skipconnections,DataAugmentation,ResNet

slackvariable,TrainingObjective

smoothingterms,BatchNormalization,AdaGrad,AdamOptimization,VariationalAutoencoders

softmarginclassification,SoftMarginClassification-SoftMarginClassification

softplacements,Softplacement

softvoting,VotingClassifiers

softmaxfunction,SoftmaxRegression,Multi-LayerPerceptronandBackpropagation,TraininganMLPwithTensorFlow’sHigh-LevelAPI

SoftmaxRegression,SoftmaxRegression-SoftmaxRegression

sourceops,LinearRegressionwithTensorFlow,ParallelExecution

spamfilters,TheMachineLearningLandscape-WhyUseMachineLearning?,Supervisedlearning

sparseautoencoders,SparseAutoencoders-TensorFlowImplementation

sparsematrix,HandlingTextandCategoricalAttributes

sparsemodels,LassoRegression,AdamOptimization

sparse_softmax_cross_entropy_with_logits(),ConstructionPhase

sparsityloss,SparseAutoencoders

specificity,TheROCCurve

speechrecognition,WhyUseMachineLearning?

spuriouspatterns,HopfieldNetworks

stack(),StaticUnrollingThroughTime

stackedautoencoders,StackedAutoencoders-UnsupervisedPretrainingUsingStackedAutoencoders

TensorFlowimplementation,TensorFlowImplementation

trainingone-at-a-time,TrainingOneAutoencoderataTime-TrainingOneAutoencoderataTime

tyingweights,TyingWeights-TyingWeights

unsupervisedpretrainingwith,UnsupervisedPretrainingUsingStackedAutoencoders-UnsupervisedPretrainingUsingStackedAutoencoders

visualizingthereconstructions,VisualizingtheReconstructions-VisualizingtheReconstructions

stackeddenoisingautoencoders,VisualizingFeatures,DenoisingAutoencoders

stackeddenoisingencoders,DenoisingAutoencoders

stackedgeneralization(seestacking)

stacking,Stacking-Stacking

stalegradients,Asynchronousupdates

standardcorrelationcoefficient,LookingforCorrelations

standardization,FeatureScaling

StandardScaler,TransformationPipelines,ImplementingGradientDescent,TraininganMLPwithTensorFlow’sHigh-LevelAPI

state-actionvalues,MarkovDecisionProcesses

statestensor,HandlingVariableLengthInputSequences

state_is_tuple,DistributingaDeepRNNAcrossMultipleGPUs,LSTMCell

staticunrollingthroughtime,StaticUnrollingThroughTime-StaticUnrollingThroughTime

static_rnn(),StaticUnrollingThroughTime-StaticUnrollingThroughTime,AnEncoder–DecoderNetworkforMachineTranslation

stationarypoint,SVMDualProblem-SVMDualProblem

statisticalmode,BaggingandPasting

statisticalsignificance,RegularizationHyperparameters

stemming,Exercises

stepfunctions,ThePerceptron

step(),IntroductiontoOpenAIGym

StochasticGradientBoosting,GradientBoosting

StochasticGradientDescent(SGD),StochasticGradientDescent-StochasticGradientDescent,SoftMarginClassification,ThePerceptron

training,TrainingandCostFunction

StochasticGradientDescent(SGD)classifier,TrainingaBinaryClassifier,RidgeRegression

stochasticneurons,BoltzmannMachines

stochasticpolicy,PolicySearch

stratifiedsampling,CreateaTestSet-CreateaTestSet,MeasuringAccuracyUsingCross-Validation

stride,ConvolutionalLayer

stringkernels,GaussianRBFKernel

string_input_producer(),Otherconveniencefunctions

stronglearners,VotingClassifiers

subderivatives,OnlineSVMs

subgradientvector,LassoRegression

subsample,GradientBoosting,PoolingLayer

supervisedlearning,Supervised/UnsupervisedLearning-Supervisedlearning

SupportVectorMachines(SVMs),MulticlassClassification,SupportVectorMachines-Exercises

decisionfunctionandpredictions,DecisionFunctionandPredictions-DecisionFunctionandPredictions

dualproblem,SVMDualProblem-SVMDualProblem

kernelizedSVM,KernelizedSVM-KernelizedSVM

linearclassification,LinearSVMClassification-SoftMarginClassification

mechanicsof,UndertheHood-OnlineSVMs

nonlinearclassification,NonlinearSVMClassification-ComputationalComplexity

onlineSVMs,OnlineSVMs-OnlineSVMs

QuadraticProgramming(QP)problems,QuadraticProgramming-QuadraticProgramming

SVMregression,SVMRegression-OnlineSVMs

thedualproblem,TheDualProblem

trainingobjective,TrainingObjective-TrainingObjective

supportvectors,LinearSVMClassification

svd(),PrincipalComponents

symbolicdifferentiation,Usingautodiff,SymbolicDifferentiation-NumericalDifferentiation

synchronousupdates,Synchronousupdates

t-DistributedStochasticNeighborEmbedding(t-SNE),OtherDimensionalityReductionTechniques

tailheavy,TakeaQuickLookattheDataStructure

targetattributes,TakeaQuickLookattheDataStructure

target_weights,AnEncoder–DecoderNetworkforMachineTranslation

tasks,MultipleDevicesAcrossMultipleServers

TemporalDifference(TD)Learning,TemporalDifferenceLearningandQ-Learning-TemporalDifferenceLearningandQ-Learning

tensorprocessingunits(TPUs),Installation

TensorBoard,UpandRunningwithTensorFlow

TensorFlow,UpandRunningwithTensorFlow-Exercises

about,ObjectiveandApproach

autodiff,Usingautodiff-Usingautodiff,Autodiff-Reverse-ModeAutodiff

BatchNormalizationwith,ImplementingBatchNormalizationwithTensorFlow-ImplementingBatchNormalizationwithTensorFlow

constructionphase,CreatingYourFirstGraphandRunningItinaSession

conveniencefunctions,Otherconveniencefunctions

convolutionallayers,ResNet

convolutionalneuralnetworksand,TensorFlowImplementation-TensorFlowImplementation

dataparallelismand,TensorFlowimplementation

denoisingautoencoders,TensorFlowImplementation-TensorFlowImplementation

dropoutwith,Dropout

dynamicplacer,PlacingOperationsonDevices

executionphase,CreatingYourFirstGraphandRunningItinaSession

feedingdatatothetrainingalgorithm,FeedingDatatotheTrainingAlgorithm-FeedingDatatotheTrainingAlgorithm

GradientDescentwith,ImplementingGradientDescent-UsinganOptimizer

graphs,managing,ManagingGraphs

initialgraphcreationandsessionrun,CreatingYourFirstGraphandRunningItinaSession-CreatingYourFirstGraphandRunningItinaSession

installation,Installation

l1andl2regularizationwith,ℓ1andℓ2Regularization

learningschedulesin,LearningRateScheduling

LinearRegressionwith,LinearRegressionwithTensorFlow-LinearRegressionwithTensorFlow

maxpoolinglayerin,PoolingLayer

max-normregularizationwith,Max-NormRegularization

modelzoo,ModelZoos

modularity,Modularity-Modularity

Momentumoptimizationin,MomentumOptimization

namescopes,NameScopes

neuralnetworkpolicies,NeuralNetworkPolicies

NLPtutorials,NaturalLanguageProcessing,AnEncoder–DecoderNetworkforMachineTranslation

nodevaluelifecycle,LifecycleofaNodeValue

operations(ops),LinearRegressionwithTensorFlow

optimizer,UsinganOptimizer

overview,UpandRunningwithTensorFlow-UpandRunningwithTensorFlow

paralleldistributedcomputing(seeparalleldistributedcomputingwithTensorFlow)

PythonAPI

construction,ConstructionPhase-ConstructionPhase

execution,ExecutionPhase

usingtheneuralnetwork,UsingtheNeuralNetwork

queues(seequeues)

reusingpretrainedlayers,ReusingaTensorFlowModel-ReusingaTensorFlowModel

RNNsin,BasicRNNsinTensorFlow-HandlingVariable-LengthOutputSequences

(seealsorecurrentneuralnetworks(RNNs))

savingandrestoringmodels,SavingandRestoringModels-SavingandRestoringModels

sharingvariables,SharingVariables-SharingVariables

simpleplacer,PlacingOperationsonDevices

sparseautoencoderswith,TensorFlowImplementation

andstackedautoencoders,TensorFlowImplementation

TensorBoard,VisualizingtheGraphandTrainingCurvesUsingTensorBoard-VisualizingtheGraphandTrainingCurvesUsingTensorBoard

tf.abs(),ℓ1andℓ2Regularization

tf.add(),Modularity,ℓ1andℓ2Regularization

tf.add_n(),Modularity-SharingVariables,SharingVariables-SharingVariables

tf.add_to_collection(),Max-NormRegularization

tf.assign(),ManuallyComputingtheGradients,ReusingModelsfromOther

Frameworks,Max-NormRegularization-Max-NormRegularization,Chapter9:UpandRunningwithTensorFlow

tf.bfloat16,Bandwidthsaturation

tf.bool,Dropout

tf.cast(),ConstructionPhase,TrainingaSequenceClassifier

tf.clip_by_norm(),Max-NormRegularization-Max-NormRegularization

tf.clip_by_value(),GradientClipping

tf.concat(),Exercises,GoogLeNet,NeuralNetworkPolicies,PolicyGradients

tf.ConfigProto,ManagingtheGPURAM,Loggingplacements-Softplacement,In-GraphVersusBetween-GraphReplication,Chapter12:DistributingTensorFlowAcrossDevicesandServers

tf.constant(),LifecycleofaNodeValue-ManuallyComputingtheGradients,Simpleplacement-Dynamicplacementfunction,ControlDependencies,OpeningaSession-PinningOperationsAcrossTasks

tf.constant_initializer(),SharingVariables-SharingVariables

tf.container(),SharingStateAcrossSessionsUsingResourceContainers-AsynchronousCommunicationUsingTensorFlowQueues,TensorFlowimplementation-Exercises,Chapter9:UpandRunningwithTensorFlow

tf.contrib.layers.l1_regularizer(),ℓ1andℓ2Regularization,Max-NormRegularization

tf.contrib.layers.l2_regularizer(),ℓ1andℓ2Regularization,TensorFlowImplementation-TyingWeights

tf.contrib.layers.variance_scaling_initializer(),XavierandHeInitialization-XavierandHeInitialization,TrainingaSequenceClassifier,TensorFlowImplementation-TyingWeights,VariationalAutoencoders,NeuralNetworkPolicies,PolicyGradients,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

tf.contrib.learn.DNNClassifier,TraininganMLPwithTensorFlow’sHigh-LevelAPI

tf.contrib.learn.infer_real_valued_columns_from_input(),TraininganMLPwithTensorFlow’sHigh-LevelAPI

tf.contrib.rnn.BasicLSTMCell,LSTMCell,PeepholeConnections

tf.contrib.rnn.BasicRNNCell,StaticUnrollingThroughTime-DynamicUnrollingThroughTime,TrainingaSequenceClassifier,TrainingtoPredictTimeSeries-TrainingtoPredictTimeSeries,TrainingtoPredictTimeSeries,DeepRNNs-ApplyingDropout,LSTMCell

tf.contrib.rnn.DropoutWrapper,ApplyingDropout

tf.contrib.rnn.GRUCell,GRUCell

tf.contrib.rnn.LSTMCell,PeepholeConnections

tf.contrib.rnn.MultiRNNCell,DeepRNNs-ApplyingDropout

tf.contrib.rnn.OutputProjectionWrapper,TrainingtoPredictTimeSeries-TrainingtoPredictTimeSeries

tf.contrib.rnn.RNNCell,DistributingaDeepRNNAcrossMultipleGPUs

tf.contrib.rnn.static_rnn(),BasicRNNsinTensorFlow-HandlingVariableLengthInputSequences,AnEncoder–DecoderNetworkforMachineTranslation-Exercises,Chapter14:RecurrentNeuralNetworks-Chapter14:RecurrentNeuralNetworks

tf.contrib.slimmodule,UpandRunningwithTensorFlow,Exercises

tf.contrib.slim.netsmodule(nets),Exercises

tf.control_dependencies(),ControlDependencies

tf.decode_csv(),Readingthetrainingdatadirectlyfromthegraph,MultithreadedreadersusingaCoordinatorandaQueueRunner

tf.device(),Simpleplacement-Softplacement,PinningOperationsAcrossTasks-ShardingVariablesAcrossMultipleParameterServers,DistributingaDeepRNNAcrossMultipleGPUs-DistributingaDeepRNNAcrossMultipleGPUs

tf.exp(),VariationalAutoencoders-GeneratingDigits

tf.FIFOQueue,AsynchronousCommunicationUsingTensorFlowQueues,Queuesoftuples-RandomShuffleQueue,Readingthetrainingdatadirectlyfromthegraph,MultithreadedreadersusingaCoordinatorandaQueueRunner

tf.float32,LinearRegressionwithTensorFlow,Chapter9:UpandRunningwithTensorFlow

tf.get_collection(),ReusingaTensorFlowModel-FreezingtheLowerLayers,ℓ1andℓ2Regularization,Max-NormRegularization,TensorFlowImplementation,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

tf.get_default_graph(),ManagingGraphs,VisualizingtheGraphandTrainingCurvesUsingTensorBoard

tf.get_default_session(),CreatingYourFirstGraphandRunningItinaSession

tf.get_variable(),SharingVariables-SharingVariables,ReusingModelsfromOtherFrameworks,ℓ1andℓ2Regularization

tf.global_variables(),ReusingaTensorFlowModel

tf.global_variables_initializer(),CreatingYourFirstGraphandRunningItinaSession,ManuallyComputingtheGradients

tf.gradients(),Usingautodiff

tf.Graph,CreatingYourFirstGraphandRunningItinaSession,ManagingGraphs,VisualizingtheGraphandTrainingCurvesUsingTensorBoard,LoadingDataDirectlyfromtheGraph,In-GraphVersusBetween-GraphReplication

tf.GraphKeys.GLOBAL_VARIABLES,ReusingaTensorFlowModel-FreezingtheLowerLayers

tf.GraphKeys.REGULARIZATION_LOSSES,ℓ1andℓ2Regularization,TensorFlowImplementation

tf.group(),LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

tf.int32,Operationsandkernels-Queuesoftuples,Readingthetrainingdatadirectlyfromthegraph,HandlingVariableLengthInputSequences,TrainingaSequenceClassifier,WordEmbeddings,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

tf.int64,ConstructionPhase

tf.InteractiveSession,CreatingYourFirstGraphandRunningItinaSession

tf.layers.batch_normalization(),ImplementingBatchNormalizationwithTensorFlow-

ImplementingBatchNormalizationwithTensorFlow

tf.layers.dense(),ConstructionPhase

TF.Learn,TraininganMLPwithTensorFlow’sHigh-LevelAPI

tf.log(),TensorFlowImplementation,VariationalAutoencoders,NeuralNetworkPolicies,PolicyGradients

tf.matmul(),LinearRegressionwithTensorFlow-ManuallyComputingtheGradients,Modularity,ConstructionPhase,BasicRNNsinTensorFlow,TyingWeights,TrainingOneAutoencoderataTime,TensorFlowImplementation,TensorFlowImplementation-TensorFlowImplementation

tf.matrix_inverse(),LinearRegressionwithTensorFlow

tf.maximum(),Modularity,SharingVariables-SharingVariables,NonsaturatingActivationFunctions

tf.multinomial(),NeuralNetworkPolicies,PolicyGradients

tf.name_scope(),NameScopes,Modularity-SharingVariables,ConstructionPhase,ConstructionPhase-ConstructionPhase,TrainingOneAutoencoderataTime-TrainingOneAutoencoderataTime

tf.nn.conv2d(),TensorFlowImplementation-TensorFlowImplementation

tf.nn.dynamic_rnn(),StaticUnrollingThroughTime-DynamicUnrollingThroughTime,TrainingaSequenceClassifier,TrainingtoPredictTimeSeries,TrainingtoPredictTimeSeries,DeepRNNs-ApplyingDropout,AnEncoder–DecoderNetworkforMachineTranslation-Exercises,Chapter14:RecurrentNeuralNetworks-Chapter14:RecurrentNeuralNetworks

tf.nn.elu(),NonsaturatingActivationFunctions,TensorFlowImplementation-TyingWeights,VariationalAutoencoders,NeuralNetworkPolicies,PolicyGradients

tf.nn.embedding_lookup(),WordEmbeddings

tf.nn.in_top_k(),ConstructionPhase,TrainingaSequenceClassifier

tf.nn.max_pool(),PoolingLayer-PoolingLayer

tf.nn.relu(),ConstructionPhase,TrainingtoPredictTimeSeries-TrainingtoPredict

TimeSeries,TrainingtoPredictTimeSeries,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

tf.nn.sigmoid_cross_entropy_with_logits(),TensorFlowImplementation,GeneratingDigits,PolicyGradients-PolicyGradients

tf.nn.sparse_softmax_cross_entropy_with_logits(),ConstructionPhase-ConstructionPhase,TrainingaSequenceClassifier

tf.one_hot(),LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

tf.PaddingFIFOQueue,PaddingFifoQueue

tf.placeholder(),FeedingDatatotheTrainingAlgorithm-FeedingDatatotheTrainingAlgorithm,Chapter9:UpandRunningwithTensorFlow

tf.placeholder_with_default(),TensorFlowImplementation

tf.RandomShuffleQueue,RandomShuffleQueue,Readingthetrainingdatadirectlyfromthegraph-Readingthetrainingdatadirectlyfromthegraph,MultithreadedreadersusingaCoordinatorandaQueueRunner-Otherconveniencefunctions

tf.random_normal(),Modularity,BasicRNNsinTensorFlow,TensorFlowImplementation,VariationalAutoencoders

tf.random_uniform(),ManuallyComputingtheGradients,SavingandRestoringModels,WordEmbeddings,Chapter9:UpandRunningwithTensorFlow

tf.reduce_mean(),ManuallyComputingtheGradients,NameScopes,ConstructionPhase-ConstructionPhase,ℓ1andℓ2Regularization,TrainingaSequenceClassifier-TrainingaSequenceClassifier,PerformingPCAwithanUndercompleteLinearAutoencoder,TensorFlowImplementation,TrainingOneAutoencoderataTime,TrainingOneAutoencoderataTime,TensorFlowImplementation,TensorFlowImplementation,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

tf.reduce_sum(),ℓ1andℓ2Regularization,TensorFlowImplementation-TensorFlowImplementation,VariationalAutoencoders-GeneratingDigits,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning-LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

tf.reset_default_graph(),ManagingGraphs

tf.reshape(),TrainingtoPredictTimeSeries,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

tf.RunOptions,In-GraphVersusBetween-GraphReplication

tf.Session,CreatingYourFirstGraphandRunningItinaSession,Chapter9:UpandRunningwithTensorFlow

tf.shape(),TensorFlowImplementation,VariationalAutoencoders

tf.square(),ManuallyComputingtheGradients,NameScopes,TrainingtoPredictTimeSeries,PerformingPCAwithanUndercompleteLinearAutoencoder,TensorFlowImplementation,TrainingOneAutoencoderataTime,TrainingOneAutoencoderataTime,TensorFlowImplementation,TensorFlowImplementation,VariationalAutoencoders-GeneratingDigits,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

tf.stack(),Readingthetrainingdatadirectlyfromthegraph,MultithreadedreadersusingaCoordinatorandaQueueRunner,StaticUnrollingThroughTime

tf.string,Readingthetrainingdatadirectlyfromthegraph,MultithreadedreadersusingaCoordinatorandaQueueRunner

tf.summary.FileWriter,VisualizingtheGraphandTrainingCurvesUsingTensorBoard-VisualizingtheGraphandTrainingCurvesUsingTensorBoard

tf.summary.scalar(),VisualizingtheGraphandTrainingCurvesUsingTensorBoard

tf.tanh(),BasicRNNsinTensorFlow

tf.TextLineReader,Readingthetrainingdatadirectlyfromthegraph,MultithreadedreadersusingaCoordinatorandaQueueRunner

tf.to_float(),PolicyGradients-PolicyGradients

tf.train.AdamOptimizer,AdamOptimization,AdamOptimization,TrainingaSequenceClassifier,TrainingtoPredictTimeSeries,PerformingPCAwithanUndercompleteLinearAutoencoder,TensorFlowImplementation-TyingWeights,TrainingOneAutoencoderataTime,TensorFlowImplementation,GeneratingDigits,PolicyGradients-PolicyGradients,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

tf.train.ClusterSpec,MultipleDevicesAcrossMultipleServers

tf.train.Coordinator,MultithreadedreadersusingaCoordinatorandaQueueRunner-MultithreadedreadersusingaCoordinatorandaQueueRunner

tf.train.exponential_decay(),LearningRateScheduling

tf.train.GradientDescentOptimizer,UsinganOptimizer,ConstructionPhase,GradientClipping,MomentumOptimization,AdamOptimization

tf.train.MomentumOptimizer,UsinganOptimizer,MomentumOptimization-NesterovAcceleratedGradient,LearningRateScheduling,Exercises,TensorFlowimplementation,Chapter10:IntroductiontoArtificialNeuralNetworks-Chapter11:TrainingDeepNeuralNets

tf.train.QueueRunner,MultithreadedreadersusingaCoordinatorandaQueueRunner-Otherconveniencefunctions

tf.train.replica_device_setter(),ShardingVariablesAcrossMultipleParameterServers-SharingStateAcrossSessionsUsingResourceContainers

tf.train.RMSPropOptimizer,RMSProp

tf.train.Saver,SavingandRestoringModels-SavingandRestoringModels,ConstructionPhase,Exercises,ApplyingDropout,PolicyGradients,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

tf.train.Server,MultipleDevicesAcrossMultipleServers

tf.train.start_queue_runners(),Otherconveniencefunctions

tf.transpose(),LinearRegressionwithTensorFlow-ManuallyComputingtheGradients,StaticUnrollingThroughTime,TyingWeights

tf.truncated_normal(),ConstructionPhase

tf.unstack(),StaticUnrollingThroughTime-DynamicUnrollingThroughTime,TrainingtoPredictTimeSeries,Chapter14:RecurrentNeuralNetworks

tf.Variable,CreatingYourFirstGraphandRunningItinaSession,Chapter9:UpandRunningwithTensorFlow

tf.variable_scope(),SharingVariables-SharingVariables,ReusingModelsfromOtherFrameworks,SharingStateAcrossSessionsUsingResourceContainers,Traininga

SequenceClassifier,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

tf.zeros(),ConstructionPhase,BasicRNNsinTensorFlow,TyingWeights

visualizinggraphandtrainingcurves,VisualizingtheGraphandTrainingCurvesUsingTensorBoard-VisualizingtheGraphandTrainingCurvesUsingTensorBoard

TensorFlowServing,OneNeuralNetworkperDevice

tensorflow.contrib,TraininganMLPwithTensorFlow’sHigh-LevelAPI

testset,TestingandValidating,CreateaTestSet-CreateaTestSet,MNIST

testingandvalidating,TestingandValidating-TestingandValidating

textattributes,HandlingTextandCategoricalAttributes-HandlingTextandCategoricalAttributes

TextLineReader,Readingthetrainingdatadirectlyfromthegraph

TF-slim,UpandRunningwithTensorFlow

tf.layers.conv1d(),ResNet

tf.layers.conv2d(),LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

tf.layers.conv2d_transpose(),ResNet

tf.layers.conv3d(),ResNet

tf.layers.dense(),XavierandHeInitialization,ImplementingBatchNormalizationwithTensorFlow

tf.layers.separable_conv2d(),ResNet

TF.Learn,UpandRunningwithTensorFlow,TraininganMLPwithTensorFlow’sHigh-LevelAPI

tf.nn.atrous_conv2d(),ResNet

tf.nn.depthwise_conv2d(),ResNet

thermalequilibrium,BoltzmannMachines

threadpools(inter-op/intra-op,inTensorFlow,ParallelExecution

thresholdvariable,SharingVariables-SharingVariables

Tikhonovregularization,RidgeRegression

timeseriesdata,RecurrentNeuralNetworks

toarray(),HandlingTextandCategoricalAttributes

tolerancehyperparameter,ComputationalComplexity

training,ImplementingBatchNormalizationwithTensorFlow-ImplementingBatchNormalizationwithTensorFlow,ApplyingDropout

trainingdata,WhatIsMachineLearning?

insufficientquantities,InsufficientQuantityofTrainingData

irrelevantfeatures,IrrelevantFeatures

loading,LoadingDataDirectlyfromtheGraph-Otherconveniencefunctions

nonrepresentative,NonrepresentativeTrainingData

overfitting,OverfittingtheTrainingData-OverfittingtheTrainingData

poorquality,Poor-QualityData

underfitting,UnderfittingtheTrainingData

traininginstance,WhatIsMachineLearning?

trainingmodels,Model-basedlearning,TrainingModels-Exercises

LinearRegression,TrainingModels,LinearRegression-Mini-batchGradientDescent

LogisticRegression,LogisticRegression-SoftmaxRegression

overview,TrainingModels-TrainingModels

PolynomialRegression,TrainingModels,PolynomialRegression-PolynomialRegression

trainingobjectives,TrainingObjective-TrainingObjective

trainingset,WhatIsMachineLearning?,TestingandValidating,DiscoverandVisualizetheDatatoGainInsights,PreparetheDataforMachineLearningAlgorithms,TrainingandEvaluatingontheTrainingSet-TrainingandEvaluatingontheTrainingSet

costfunctionof,TrainingandCostFunction-TrainingandCostFunction

shuffling,MNIST

transferlearning,ReusingPretrainedLayers-PretrainingonanAuxiliaryTask

(seealsopretrainedlayersreuse)

transform(),DataCleaning,TransformationPipelines

transformationpipelines,TransformationPipelines-SelectandTrainaModel

transformers,DataCleaning

transformers,custom,CustomTransformers-CustomTransformers

transpose(),StaticUnrollingThroughTime

truenegativerate(TNR),TheROCCurve

truepositiverate(TPR),ConfusionMatrix,TheROCCurve

tuples,Queuesoftuples

tyingweights,TyingWeights

underfitting,UnderfittingtheTrainingData,TrainingandEvaluatingontheTrainingSet,GaussianRBFKernel

univariateregression,FrametheProblem

unstack(),StaticUnrollingThroughTime

unsupervisedlearning,Unsupervisedlearning-Unsupervisedlearning

anomalydetection,Unsupervisedlearning

associationrulelearning,Unsupervisedlearning,Unsupervisedlearning

clustering,Unsupervisedlearning

dimensionalityreductionalgorithm,Unsupervisedlearning

visualizationalgorithms,Unsupervisedlearning

unsupervisedpretraining,UnsupervisedPretraining-UnsupervisedPretraining,UnsupervisedPretrainingUsingStackedAutoencoders-UnsupervisedPretrainingUsingStackedAutoencoders

upsampling,ResNet

utilityfunction,Model-basedlearning

validationset,TestingandValidating

ValueIteration,MarkovDecisionProcesses

value_counts(),TakeaQuickLookattheDataStructure

vanishinggradients,Vanishing/ExplodingGradientsProblems

(seealsogradients,vanishingandexploding)

variables,sharing,SharingVariables-SharingVariables

variable_scope(),SharingVariables-SharingVariables

variance

bias/variancetradeoff,LearningCurves

variancepreservation,PreservingtheVariance-PreservingtheVariance

variance_scaling_initializer(),XavierandHeInitialization

variationalautoencoders,VariationalAutoencoders-GeneratingDigits

VGGNet,ResNet

visualcortex,TheArchitectureoftheVisualCortex

visualization,VisualizingtheGraphandTrainingCurvesUsingTensorBoard-VisualizingtheGraphandTrainingCurvesUsingTensorBoard

visualizationalgorithms,Unsupervisedlearning-Unsupervisedlearning

voicerecognition,ConvolutionalNeuralNetworks

votingclassifiers,VotingClassifiers-VotingClassifiers

warmupphase,Asynchronousupdates

weaklearners,VotingClassifiers

weight-tying,TyingWeights

weights,ConstructionPhase

freezing,FreezingtheLowerLayers

while_loop(),DynamicUnrollingThroughTime

whiteboxmodels,MakingPredictions

worker,MultipleDevicesAcrossMultipleServers

workerservice,TheMasterandWorkerServices

worker_device,ShardingVariablesAcrossMultipleParameterServers

workspacedirectory,GettheData-DownloadtheData

Xavierinitialization,Vanishing/ExplodingGradientsProblems-XavierandHeInitialization

YouTube,IntroductiontoArtificialNeuralNetworks

zeropadding,ConvolutionalLayer,TensorFlowImplementation

AbouttheAuthorAurélienGéronisaMachineLearningconsultant.AformerGoogler,heledtheYouTubevideoclassificationteamfrom2013to2016.HewasalsoafounderandCTOofWifirstfrom2002to2012,aleadingWirelessISPinFrance;andafounderandCTOofPolyconseilin2001,thefirmthatnowmanagestheelectriccarsharingserviceAutolib’.

Beforethisheworkedasanengineerinavarietyofdomains:finance(JPMorganandSociétéGénérale),defense(Canada’sDOD),andhealthcare(bloodtransfusion).Hepublishedafewtechnicalbooks(onC++,WiFi,andinternetarchitectures),andwasaComputerSciencelecturerinaFrenchengineeringschool.

Afewfunfacts:hetaughthisthreechildrentocountinbinarywiththeirfingers(upto1023),hestudiedmicrobiologyandevolutionarygeneticsbeforegoingintosoftwareengineering,andhisparachutedidn’topenonthesecondjump.

ColophonTheanimalonthecoverofHands-OnMachineLearningwithScikit-LearnandTensorFlowisthefareasternfiresalamander(Salamandrainfraimmaculata),anamphibianfoundintheMiddleEast.Theyhaveblackskinfeaturinglargeyellowspotsontheirbackandhead.Thesespotsareawarningcolorationmeanttokeeppredatorsatbay.Full-grownsalamanderscanbeoverafootinlength.

Fareasternfiresalamandersliveinsubtropicalshrublandandforestsnearriversorotherfreshwaterbodies.Theyspendmostoftheirlifeonland,butlaytheireggsinthewater.Theysubsistmostlyonadietofinsects,worms,andsmallcrustaceans,butoccasionallyeatothersalamanders.Malesofthespecieshavebeenknowntoliveupto23years,whilefemalescanliveupto21years.

Althoughnotyetendangered,thefareasternfiresalamanderpopulationisindecline.Primarythreatsincludedammingofrivers(whichdisruptsthesalamander’sbreeding)andpollution.Theyarealsothreatenedbytherecentintroductionofpredatoryfish,suchasthemosquitofish.Thesefishwereintendedtocontrolthemosquitopopulation,buttheyalsofeedonyoungsalamanders.

ManyoftheanimalsonO’Reillycoversareendangered;allofthemareimportanttotheworld.Tolearnmoreabouthowyoucanhelp,gotoanimals.oreilly.com.

ThecoverimageisfromWood’sIllustratedNaturalHistory.ThecoverfontsareURWTypewriterandGuardianSans.ThetextfontisAdobeMinionPro;theheadingfontisAdobeMyriadCondensed;andthecodefontisDaltonMaag’sUbuntuMono.

PrefaceTheMachineLearningTsunami

MachineLearninginYourProjects

ObjectiveandApproach

Prerequisites

Roadmap

OtherResources

ConventionsUsedinThisBook

UsingCodeExamples

O’ReillySafari

HowtoContactUs

Acknowledgments

I.TheFundamentalsofMachineLearning

1.TheMachineLearningLandscapeWhatIsMachineLearning?

WhyUseMachineLearning?

TypesofMachineLearningSystemsSupervised/UnsupervisedLearning

BatchandOnlineLearning

Instance-BasedVersusModel-BasedLearning

MainChallengesofMachineLearningInsufficientQuantityofTrainingData

NonrepresentativeTrainingData

Poor-QualityData

IrrelevantFeatures

OverfittingtheTrainingData

UnderfittingtheTrainingData

SteppingBack

TestingandValidating

Exercises

2.End-to-EndMachineLearningProjectWorkingwithRealData

LookattheBigPictureFrametheProblem

SelectaPerformanceMeasure

ChecktheAssumptions

GettheDataCreatetheWorkspace

DownloadtheData

TakeaQuickLookattheDataStructure

CreateaTestSet

DiscoverandVisualizetheDatatoGainInsightsVisualizingGeographicalData

LookingforCorrelations

ExperimentingwithAttributeCombinations

PreparetheDataforMachineLearningAlgorithmsDataCleaning

HandlingTextandCategoricalAttributes

CustomTransformers

FeatureScaling

TransformationPipelines

SelectandTrainaModelTrainingandEvaluatingontheTrainingSet

BetterEvaluationUsingCross-Validation

Fine-TuneYourModelGridSearch

RandomizedSearch

EnsembleMethods

AnalyzetheBestModelsandTheirErrors

EvaluateYourSystemontheTestSet

Launch,Monitor,andMaintainYourSystem

TryItOut!

Exercises

3.ClassificationMNIST

TrainingaBinaryClassifier

PerformanceMeasuresMeasuringAccuracyUsingCross-Validation

ConfusionMatrix

PrecisionandRecall

Precision/RecallTradeoff

TheROCCurve

MulticlassClassification

ErrorAnalysis

MultilabelClassification

MultioutputClassification

Exercises

4.TrainingModelsLinearRegression

TheNormalEquation

ComputationalComplexity

GradientDescentBatchGradientDescent

StochasticGradientDescent

Mini-batchGradientDescent

PolynomialRegression

LearningCurves

RegularizedLinearModelsRidgeRegression

LassoRegression

ElasticNet

EarlyStopping

LogisticRegressionEstimatingProbabilities

TrainingandCostFunction

DecisionBoundaries

SoftmaxRegression

Exercises

5.SupportVectorMachinesLinearSVMClassification

SoftMarginClassification

NonlinearSVMClassificationPolynomialKernel

AddingSimilarityFeatures

GaussianRBFKernel

SVMRegression

UndertheHoodDecisionFunctionandPredictions

TrainingObjective

QuadraticProgramming

TheDualProblem

KernelizedSVM

OnlineSVMs

Exercises

6.DecisionTreesTrainingandVisualizingaDecisionTree

MakingPredictions

EstimatingClassProbabilities

TheCARTTrainingAlgorithm

GiniImpurityorEntropy?

RegularizationHyperparameters

Regression

Instability

Exercises

7.EnsembleLearningandRandomForestsVotingClassifiers

BaggingandPastingBaggingandPastinginScikit-Learn

Out-of-BagEvaluation

RandomPatchesandRandomSubspaces

RandomForestsExtra-Trees

FeatureImportance

BoostingAdaBoost

GradientBoosting

Stacking

Exercises

8.DimensionalityReductionTheCurseofDimensionality

MainApproachesforDimensionalityReductionProjection

ManifoldLearning

PCAPreservingtheVariance

PrincipalComponents

ProjectingDowntodDimensions

UsingScikit-Learn

ExplainedVarianceRatio

ChoosingtheRightNumberofDimensions

PCAforCompression

IncrementalPCA

RandomizedPCA

KernelPCASelectingaKernelandTuningHyperparameters

OtherDimensionalityReductionTechniques

Exercises

II.NeuralNetworksandDeepLearning

9.UpandRunningwithTensorFlowInstallation

CreatingYourFirstGraphandRunningItinaSession

ManagingGraphs

LifecycleofaNodeValue

LinearRegressionwithTensorFlow

ImplementingGradientDescentManuallyComputingtheGradients

Usingautodiff

UsinganOptimizer

FeedingDatatotheTrainingAlgorithm

SavingandRestoringModels

VisualizingtheGraphandTrainingCurvesUsingTensorBoard

NameScopes

Modularity

SharingVariables

Exercises

10.IntroductiontoArtificialNeuralNetworksFromBiologicaltoArtificialNeurons

BiologicalNeurons

LogicalComputationswithNeurons

ThePerceptron

Multi-LayerPerceptronandBackpropagation

TraininganMLPwithTensorFlow’sHigh-LevelAPI

TrainingaDNNUsingPlainTensorFlowConstructionPhase

ExecutionPhase

UsingtheNeuralNetwork

Fine-TuningNeuralNetworkHyperparametersNumberofHiddenLayers

NumberofNeuronsperHiddenLayer

ActivationFunctions

Exercises

11.TrainingDeepNeuralNetsVanishing/ExplodingGradientsProblems

XavierandHeInitialization

NonsaturatingActivationFunctions

BatchNormalization

GradientClipping

ReusingPretrainedLayersReusingaTensorFlowModel

ReusingModelsfromOtherFrameworks

FreezingtheLowerLayers

CachingtheFrozenLayers

Tweaking,Dropping,orReplacingtheUpperLayers

ModelZoos

UnsupervisedPretraining

PretrainingonanAuxiliaryTask

FasterOptimizersMomentumOptimization

NesterovAcceleratedGradient

AdaGrad

RMSProp

AdamOptimization

LearningRateScheduling

AvoidingOverfittingThroughRegularizationEarlyStopping

ℓ1andℓ2Regularization

Dropout

Max-NormRegularization

DataAugmentation

PracticalGuidelines

Exercises

12.DistributingTensorFlowAcrossDevicesandServersMultipleDevicesonaSingleMachine

Installation

ManagingtheGPURAM

PlacingOperationsonDevices

ParallelExecution

ControlDependencies

MultipleDevicesAcrossMultipleServersOpeningaSession

TheMasterandWorkerServices

PinningOperationsAcrossTasks

ShardingVariablesAcrossMultipleParameterServers

SharingStateAcrossSessionsUsingResourceContainers

AsynchronousCommunicationUsingTensorFlowQueues

LoadingDataDirectlyfromtheGraph

ParallelizingNeuralNetworksonaTensorFlowClusterOneNeuralNetworkperDevice

In-GraphVersusBetween-GraphReplication

ModelParallelism

DataParallelism

Exercises

13.ConvolutionalNeuralNetworksTheArchitectureoftheVisualCortex

ConvolutionalLayerFilters

StackingMultipleFeatureMaps

TensorFlowImplementation

MemoryRequirements

PoolingLayer

CNNArchitecturesLeNet-5

AlexNet

GoogLeNet

ResNet

Exercises

14.RecurrentNeuralNetworksRecurrentNeurons

MemoryCells

InputandOutputSequences

BasicRNNsinTensorFlowStaticUnrollingThroughTime

DynamicUnrollingThroughTime

HandlingVariableLengthInputSequences

HandlingVariable-LengthOutputSequences

TrainingRNNs

TrainingaSequenceClassifier

TrainingtoPredictTimeSeries

CreativeRNN

DeepRNNsDistributingaDeepRNNAcrossMultipleGPUs

ApplyingDropout

TheDifficultyofTrainingoverManyTimeSteps

LSTMCellPeepholeConnections

GRUCell

NaturalLanguageProcessingWordEmbeddings

AnEncoder–DecoderNetworkforMachineTranslation

Exercises

15.AutoencodersEfficientDataRepresentations

PerformingPCAwithanUndercompleteLinearAutoencoder

StackedAutoencodersTensorFlowImplementation

TyingWeights

TrainingOneAutoencoderataTime

VisualizingtheReconstructions

VisualizingFeatures

UnsupervisedPretrainingUsingStackedAutoencoders

DenoisingAutoencodersTensorFlowImplementation

SparseAutoencoders

TensorFlowImplementation

VariationalAutoencodersGeneratingDigits

OtherAutoencoders

Exercises

16.ReinforcementLearningLearningtoOptimizeRewards

PolicySearch

IntroductiontoOpenAIGym

NeuralNetworkPolicies

EvaluatingActions:TheCreditAssignmentProblem

PolicyGradients

MarkovDecisionProcesses

TemporalDifferenceLearningandQ-LearningExplorationPolicies

ApproximateQ-Learning

LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

Exercises

ThankYou!

A.ExerciseSolutionsChapter1:TheMachineLearningLandscape

Chapter2:End-to-EndMachineLearningProject

Chapter3:Classification

Chapter4:TrainingModels

Chapter5:SupportVectorMachines

Chapter6:DecisionTrees

Chapter7:EnsembleLearningandRandomForests

Chapter8:DimensionalityReduction

Chapter9:UpandRunningwithTensorFlow

Chapter10:IntroductiontoArtificialNeuralNetworks

Chapter11:TrainingDeepNeuralNets

Chapter12:DistributingTensorFlowAcrossDevicesandServers

Chapter13:ConvolutionalNeuralNetworks

Chapter14:RecurrentNeuralNetworks

Chapter15:Autoencoders

Chapter16:ReinforcementLearning

B.MachineLearningProjectChecklistFrametheProblemandLookattheBigPicture

GettheData

ExploretheData

PreparetheData

Short-ListPromisingModels

Fine-TunetheSystem

PresentYourSolution

Launch!

C.SVMDualProblem

D.AutodiffManualDifferentiation

SymbolicDifferentiation

NumericalDifferentiation

Forward-ModeAutodiff

Reverse-ModeAutodiff

E.OtherPopularANNArchitecturesHopfieldNetworks

BoltzmannMachines

RestrictedBoltzmannMachines

DeepBeliefNets

Self-OrganizingMaps

index-of.esindex-of.es/varios-2/hands on machine learning with... · revision history for the first...

Documents

3 big ways to monetize social media -...

mikrotik -...

build restful zf2 applications -...

advanced bootkit techniques on android -...

css mastery - index-of.esindex-of.es/networking/css mastery...

sécurité des de contrôle industriel -...

practical cellphone spying -...

keylogger -...

network programming with perl -...

index-of.esindex-of.es/attacks/ftp bounce...

solutions for chapter 1 exercises -...

index-of.esindex-of.es/z0ro-repository-3/the challenge...

tutoriel securit´ e´ -...

index-of.esindex-of.es/php/pro zend framework...

object pascal language guide -...

automatic binary deobfuscation -...

presents -...

mockito programming cookbook -...

rapport d’audit de sécurité -...

undocumented windows nt -...