index-of.esindex-of.es/varios-2/hands on machine learning with... · revision history for the first...

718

Upload: others

Post on 04-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release
Page 2: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Hands-OnMachineLearningwithScikit-LearnandTensorFlow

Concepts,Tools,andTechniquestoBuildIntelligentSystems

AurélienGéron

Page 3: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Hands-OnMachineLearningwithScikit-LearnandTensorFlowbyAurélienGéron

Copyright©2017AurélienGéron.Allrightsreserved.

PrintedintheUnitedStatesofAmerica.

PublishedbyO’ReillyMedia,Inc.,1005GravensteinHighwayNorth,Sebastopol,CA95472.

O’Reillybooksmaybepurchasedforeducational,business,orsalespromotionaluse.Onlineeditionsarealsoavailableformosttitles(http://oreilly.com/safari).Formoreinformation,contactourcorporate/institutionalsalesdepartment:[email protected].

Editor:NicoleTache

ProductionEditor:NicholasAdams

Copyeditor:RachelMonaghan

Proofreader:CharlesRoumeliotis

Indexer:WendyCatalano

InteriorDesigner:DavidFutato

CoverDesigner:RandyComer

Illustrator:RebeccaDemarest

March2017:FirstEdition

Page 4: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

RevisionHistoryfortheFirstEdition2017-03-10:FirstRelease

2017-06-09:SecondRelease

Seehttp://oreilly.com/catalog/errata.csp?isbn=9781491962299forreleasedetails.

TheO’ReillylogoisaregisteredtrademarkofO’ReillyMedia,Inc.Hands-OnMachineLearningwithScikit-LearnandTensorFlow,thecoverimage,andrelatedtradedressaretrademarksofO’ReillyMedia,Inc.

Whilethepublisherandtheauthorhaveusedgoodfaitheffortstoensurethattheinformationandinstructionscontainedinthisworkareaccurate,thepublisherandtheauthordisclaimallresponsibilityforerrorsoromissions,includingwithoutlimitationresponsibilityfordamagesresultingfromtheuseoforrelianceonthiswork.Useoftheinformationandinstructionscontainedinthisworkisatyourownrisk.Ifanycodesamplesorothertechnologythisworkcontainsordescribesissubjecttoopensourcelicensesortheintellectualpropertyrightsofothers,itisyourresponsibilitytoensurethatyourusethereofcomplieswithsuchlicensesand/orrights.

978-1-491-96229-9

[LSI]

Page 5: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Preface

Page 6: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TheMachineLearningTsunamiIn2006,GeoffreyHintonetal.publishedapaper1showinghowtotrainadeepneuralnetworkcapableofrecognizinghandwrittendigitswithstate-of-the-artprecision(>98%).Theybrandedthistechnique“DeepLearning.”Trainingadeepneuralnetwaswidelyconsideredimpossibleatthetime,2andmostresearchershadabandonedtheideasincethe1990s.ThispaperrevivedtheinterestofthescientificcommunityandbeforelongmanynewpapersdemonstratedthatDeepLearningwasnotonlypossible,butcapableofmind-blowingachievementsthatnootherMachineLearning(ML)techniquecouldhopetomatch(withthehelpoftremendouscomputingpowerandgreatamountsofdata).ThisenthusiasmsoonextendedtomanyotherareasofMachineLearning.

Fast-forward10yearsandMachineLearninghasconqueredtheindustry:itisnowattheheartofmuchofthemagicintoday’shigh-techproducts,rankingyourwebsearchresults,poweringyoursmartphone’sspeechrecognition,andrecommendingvideos,beatingtheworldchampionatthegameofGo.Beforeyouknowit,itwillbedrivingyourcar.

Page 7: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

MachineLearninginYourProjectsSonaturallyyouareexcitedaboutMachineLearningandyouwouldlovetojointheparty!

Perhapsyouwouldliketogiveyourhomemaderobotabrainofitsown?Makeitrecognizefaces?Orlearntowalkaround?

Ormaybeyourcompanyhastonsofdata(userlogs,financialdata,productiondata,machinesensordata,hotlinestats,HRreports,etc.),andmorethanlikelyyoucouldunearthsomehiddengemsifyoujustknewwheretolook;forexample:

Segmentcustomersandfindthebestmarketingstrategyforeachgroup

Recommendproductsforeachclientbasedonwhatsimilarclientsbought

Detectwhichtransactionsarelikelytobefraudulent

Predictnextyear’srevenue

Andmore

Whateverthereason,youhavedecidedtolearnMachineLearningandimplementitinyourprojects.Greatidea!

Page 8: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ObjectiveandApproachThisbookassumesthatyouknowclosetonothingaboutMachineLearning.Itsgoalistogiveyoutheconcepts,theintuitions,andthetoolsyouneedtoactuallyimplementprogramscapableoflearningfromdata.

Wewillcoveralargenumberoftechniques,fromthesimplestandmostcommonlyused(suchaslinearregression)tosomeoftheDeepLearningtechniquesthatregularlywincompetitions.

Ratherthanimplementingourowntoyversionsofeachalgorithm,wewillbeusingactualproduction-readyPythonframeworks:

Scikit-Learnisveryeasytouse,yetitimplementsmanyMachineLearningalgorithmsefficiently,soitmakesforagreatentrypointtolearnMachineLearning.

TensorFlowisamorecomplexlibraryfordistributednumericalcomputationusingdataflowgraphs.Itmakesitpossibletotrainandrunverylargeneuralnetworksefficientlybydistributingthecomputationsacrosspotentiallythousandsofmulti-GPUservers.TensorFlowwascreatedatGoogleandsupportsmanyoftheirlarge-scaleMachineLearningapplications.Itwasopen-sourcedinNovember2015.

Thebookfavorsahands-onapproach,growinganintuitiveunderstandingofMachineLearningthroughconcreteworkingexamplesandjustalittlebitoftheory.Whileyoucanreadthisbookwithoutpickingupyourlaptop,wehighlyrecommendyouexperimentwiththecodeexamplesavailableonlineasJupyternotebooksathttps://github.com/ageron/handson-ml.

Page 9: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

PrerequisitesThisbookassumesthatyouhavesomePythonprogrammingexperienceandthatyouarefamiliarwithPython’smainscientificlibraries,inparticularNumPy,Pandas,andMatplotlib.

Also,ifyoucareaboutwhat’sunderthehoodyoushouldhaveareasonableunderstandingofcollege-levelmathaswell(calculus,linearalgebra,probabilities,andstatistics).

Ifyoudon’tknowPythonyet,http://learnpython.org/isagreatplacetostart.Theofficialtutorialonpython.orgisalsoquitegood.

IfyouhaveneverusedJupyter,Chapter2willguideyouthroughinstallationandthebasics:itisagreattooltohaveinyourtoolbox.

IfyouarenotfamiliarwithPython’sscientificlibraries,theprovidedJupyternotebooksincludeafewtutorials.Thereisalsoaquickmathtutorialforlinearalgebra.

Page 10: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

RoadmapThisbookisorganizedintwoparts.PartI,TheFundamentalsofMachineLearning,coversthefollowingtopics:

WhatisMachineLearning?Whatproblemsdoesittrytosolve?WhatarethemaincategoriesandfundamentalconceptsofMachineLearningsystems?

ThemainstepsinatypicalMachineLearningproject.

Learningbyfittingamodeltodata.

Optimizingacostfunction.

Handling,cleaning,andpreparingdata.

Selectingandengineeringfeatures.

Selectingamodelandtuninghyperparametersusingcross-validation.

ThemainchallengesofMachineLearning,inparticularunderfittingandoverfitting(thebias/variancetradeoff).

Reducingthedimensionalityofthetrainingdatatofightthecurseofdimensionality.

Themostcommonlearningalgorithms:LinearandPolynomialRegression,LogisticRegression,k-NearestNeighbors,SupportVectorMachines,DecisionTrees,RandomForests,andEnsemblemethods.

PartII,NeuralNetworksandDeepLearning,coversthefollowingtopics:Whatareneuralnets?Whataretheygoodfor?

BuildingandtrainingneuralnetsusingTensorFlow.

Themostimportantneuralnetarchitectures:feedforwardneuralnets,convolutionalnets,recurrentnets,longshort-termmemory(LSTM)nets,andautoencoders.

Techniquesfortrainingdeepneuralnets.

Scalingneuralnetworksforhugedatasets.

Reinforcementlearning.

ThefirstpartisbasedmostlyonScikit-LearnwhilethesecondpartusesTensorFlow.

Page 11: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

CAUTIONDon’tjumpintodeepwaterstoohastily:whileDeepLearningisnodoubtoneofthemostexcitingareasinMachineLearning,youshouldmasterthefundamentalsfirst.Moreover,mostproblemscanbesolvedquitewellusingsimplertechniquessuchasRandomForestsandEnsemblemethods(discussedinPartI).DeepLearningisbestsuitedforcomplexproblemssuchasimagerecognition,speechrecognition,ornaturallanguageprocessing,providedyouhaveenoughdata,computingpower,andpatience.

Page 12: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

OtherResourcesManyresourcesareavailabletolearnaboutMachineLearning.AndrewNg’sMLcourseonCourseraandGeoffreyHinton’scourseonneuralnetworksandDeepLearningareamazing,althoughtheybothrequireasignificanttimeinvestment(thinkmonths).

TherearealsomanyinterestingwebsitesaboutMachineLearning,includingofcourseScikit-Learn’sexceptionalUserGuide.YoumayalsoenjoyDataquest,whichprovidesveryniceinteractivetutorials,andMLblogssuchasthoselistedonQuora.Finally,theDeepLearningwebsitehasagoodlistofresourcestolearnmore.

OfcoursetherearealsomanyotherintroductorybooksaboutMachineLearning,inparticular:JoelGrus,DataSciencefromScratch(O’Reilly).ThisbookpresentsthefundamentalsofMachineLearning,andimplementssomeofthemainalgorithmsinpurePython(fromscratch,asthenamesuggests).

StephenMarsland,MachineLearning:AnAlgorithmicPerspective(ChapmanandHall).ThisbookisagreatintroductiontoMachineLearning,coveringawiderangeoftopicsindepth,withcodeexamplesinPython(alsofromscratch,butusingNumPy).

SebastianRaschka,PythonMachineLearning(PacktPublishing).AlsoagreatintroductiontoMachineLearning,thisbookleveragesPythonopensourcelibraries(Pylearn2andTheano).

YaserS.Abu-Mostafa,MalikMagdon-Ismail,andHsuan-TienLin,LearningfromData(AMLBook).ArathertheoreticalapproachtoML,thisbookprovidesdeepinsights,inparticularonthebias/variancetradeoff(seeChapter4).

StuartRussellandPeterNorvig,ArtificialIntelligence:AModernApproach,3rdEdition(Pearson).Thisisagreat(andhuge)bookcoveringanincredibleamountoftopics,includingMachineLearning.IthelpsputMLintoperspective.

Finally,agreatwaytolearnistojoinMLcompetitionwebsitessuchasKaggle.comthiswillallowyoutopracticeyourskillsonreal-worldproblems,withhelpandinsightsfromsomeofthebestMLprofessionalsoutthere.

Page 13: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ConventionsUsedinThisBookThefollowingtypographicalconventionsareusedinthisbook:

ItalicIndicatesnewterms,URLs,emailaddresses,filenames,andfileextensions.

Constantwidth

Usedforprogramlistings,aswellaswithinparagraphstorefertoprogramelementssuchasvariableorfunctionnames,databases,datatypes,environmentvariables,statementsandkeywords.

Constantwidthbold

Showscommandsorothertextthatshouldbetypedliterallybytheuser.

Constantwidthitalic

Showstextthatshouldbereplacedwithuser-suppliedvaluesorbyvaluesdeterminedbycontext.

TIPThiselementsignifiesatiporsuggestion.

NOTEThiselementsignifiesageneralnote.

WARNINGThiselementindicatesawarningorcaution.

Page 14: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

UsingCodeExamplesSupplementalmaterial(codeexamples,exercises,etc.)isavailablefordownloadathttps://github.com/ageron/handson-ml.

Thisbookisheretohelpyougetyourjobdone.Ingeneral,ifexamplecodeisofferedwiththisbook,youmayuseitinyourprogramsanddocumentation.Youdonotneedtocontactusforpermissionunlessyou’rereproducingasignificantportionofthecode.Forexample,writingaprogramthatusesseveralchunksofcodefromthisbookdoesnotrequirepermission.SellingordistributingaCD-ROMofexamplesfromO’Reillybooksdoesrequirepermission.Answeringaquestionbycitingthisbookandquotingexamplecodedoesnotrequirepermission.Incorporatingasignificantamountofexamplecodefromthisbookintoyourproduct’sdocumentationdoesrequirepermission.

Weappreciate,butdonotrequire,attribution.Anattributionusuallyincludesthetitle,author,publisher,andISBN.Forexample:“Hands-OnMachineLearningwithScikit-LearnandTensorFlowbyAurélienGéron(O’Reilly).Copyright2017AurélienGéron,978-1-491-96229-9.”

Ifyoufeelyouruseofcodeexamplesfallsoutsidefairuseorthepermissiongivenabove,[email protected].

Page 15: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

O’ReillySafariNOTE

Safari(formerlySafariBooksOnline)isamembership-basedtrainingandreferenceplatformforenterprise,government,educators,andindividuals.

Membershaveaccesstothousandsofbooks,trainingvideos,LearningPaths,interactivetutorials,andcuratedplaylistsfromover250publishers,includingO’ReillyMedia,HarvardBusinessReview,PrenticeHallProfessional,Addison-WesleyProfessional,MicrosoftPress,Sams,Que,PeachpitPress,Adobe,FocalPress,CiscoPress,JohnWiley&Sons,Syngress,MorganKaufmann,IBMRedbooks,Packt,AdobePress,FTPress,Apress,Manning,NewRiders,McGraw-Hill,Jones&Bartlett,andCourseTechnology,amongothers.

Formoreinformation,pleasevisithttp://oreilly.com/safari.

Page 16: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

HowtoContactUsPleaseaddresscommentsandquestionsconcerningthisbooktothepublisher:

O’ReillyMedia,Inc.

1005GravensteinHighwayNorth

Sebastopol,CA95472

800-998-9938(intheUnitedStatesorCanada)

707-829-0515(internationalorlocal)

707-829-0104(fax)

Wehaveawebpageforthisbook,wherewelisterrata,examples,andanyadditionalinformation.Youcanaccessthispageathttp://bit.ly/hands-on-machine-learning-with-scikit-learn-and-tensorflow.

Tocommentorasktechnicalquestionsaboutthisbook,[email protected].

Formoreinformationaboutourbooks,courses,conferences,andnews,seeourwebsiteathttp://www.oreilly.com.

FindusonFacebook:http://facebook.com/oreilly

FollowusonTwitter:http://twitter.com/oreillymedia

WatchusonYouTube:http://www.youtube.com/oreillymedia

Page 17: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

AcknowledgmentsIwouldliketothankmyGooglecolleagues,inparticulartheYouTubevideoclassificationteam,forteachingmesomuchaboutMachineLearning.Icouldneverhavestartedthisprojectwithoutthem.SpecialthankstomypersonalMLgurus:ClémentCourbet,JulienDubois,MathiasKende,DanielKitachewsky,JamesPack,AlexanderPak,AnoshRaj,VitorSessak,WiktorTomczak,IngridvonGlehn,RichWashington,andeveryoneatYouTubeParis.

Iamincrediblygratefultoalltheamazingpeoplewhotooktimeoutoftheirbusylivestoreviewmybookinsomuchdetail.ThankstoPeteWardenforansweringallmyTensorFlowquestions,reviewingPartII,providingmanyinterestinginsights,andofcourseforbeingpartofthecoreTensorFlowteam.Youshoulddefinitelycheckouthisblog!ManythankstoLukasBiewaldforhisverythoroughreviewofPartII:heleftnostoneunturned,testedallthecode(andcaughtafewerrors),mademanygreatsuggestions,andhisenthusiasmwascontagious.Youshouldcheckouthisblogandhiscoolrobots!ThankstoJustinFrancis,whoalsoreviewedPartIIverythoroughly,catchingerrorsandprovidinggreatinsights,inparticularinChapter16.CheckouthispostsonTensorFlow!

HugethanksaswelltoDavidAndrzejewski,whoreviewedPartIandprovidedincrediblyusefulfeedback,identifyingunclearsectionsandsuggestinghowtoimprovethem.Checkouthiswebsite!ThankstoGrégoireMesnil,whoreviewedPartIIandcontributedveryinterestingpracticaladviceontrainingneuralnetworks.ThanksaswelltoEddyHung,SalimSémaoune,KarimMatrah,IngridvonGlehn,IainSmears,andVincentGuilbeauforreviewingPartIandmakingmanyusefulsuggestions.AndIalsowishtothankmyfather-in-law,MichelTessier,formermathematicsteacherandnowagreattranslatorofAntonChekhov,forhelpingmeironoutsomeofthemathematicsandnotationsinthisbookandreviewingthelinearalgebraJupyternotebook.

Andofcourse,agigantic“thankyou”tomydearbrotherSylvain,whoreviewedeverysinglechapter,testedeverylineofcode,providedfeedbackonvirtuallyeverysection,andencouragedmefromthefirstlinetothelast.Loveyou,bro!

ManythanksaswelltoO’Reilly’sfantasticstaff,inparticularNicoleTache,whogavemeinsightfulfeedback,alwayscheerful,encouraging,andhelpful.ThanksaswelltoMarieBeaugureau,BenLorica,MikeLoukides,andLaurelRumaforbelievinginthisprojectandhelpingmedefineitsscope.ThankstoMattHackerandalloftheAtlasteamforansweringallmytechnicalquestionsregardingformatting,asciidoc,andLaTeX,andthankstoRachelMonaghan,NickAdams,andalloftheproductionteamfortheirfinalreviewandtheirhundredsofcorrections.

Lastbutnotleast,Iaminfinitelygratefultomybelovedwife,Emmanuelle,andtoourthreewonderfulkids,Alexandre,Rémi,andGabrielle,forencouragingmetoworkhardonthisbook,askingmanyquestions(whosaidyoucan’tteachneuralnetworkstoaseven-year-old?),andevenbringingmecookiesandcoffee.Whatmorecanonedreamof?

AvailableonHinton’shomepageathttp://www.cs.toronto.edu/~hinton/.

DespitethefactthatYannLecun’sdeepconvolutionalneuralnetworkshadworkedwellforimagerecognitionsincethe1990s,althoughtheywerenotasgeneralpurpose.

1

2

Page 18: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

PartI.TheFundamentalsofMachineLearning

Page 19: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Chapter1.TheMachineLearningLandscape

Whenmostpeoplehear“MachineLearning,”theypicturearobot:adependablebutleroradeadlyTerminatordependingonwhoyouask.ButMachineLearningisnotjustafuturisticfantasy,it’salreadyhere.Infact,ithasbeenaroundfordecadesinsomespecializedapplications,suchasOpticalCharacterRecognition(OCR).ButthefirstMLapplicationthatreallybecamemainstream,improvingthelivesofhundredsofmillionsofpeople,tookovertheworldbackinthe1990s:itwasthespamfilter.Notexactlyaself-awareSkynet,butitdoestechnicallyqualifyasMachineLearning(ithasactuallylearnedsowellthatyouseldomneedtoflaganemailasspamanymore).ItwasfollowedbyhundredsofMLapplicationsthatnowquietlypowerhundredsofproductsandfeaturesthatyouuseregularly,frombetterrecommendationstovoicesearch.

WheredoesMachineLearningstartandwheredoesitend?Whatexactlydoesitmeanforamachinetolearnsomething?IfIdownloadacopyofWikipedia,hasmycomputerreally“learned”something?Isitsuddenlysmarter?InthischapterwewillstartbyclarifyingwhatMachineLearningisandwhyyoumaywanttouseit.

Then,beforewesetouttoexploretheMachineLearningcontinent,wewilltakealookatthemapandlearnaboutthemainregionsandthemostnotablelandmarks:supervisedversusunsupervisedlearning,onlineversusbatchlearning,instance-basedversusmodel-basedlearning.ThenwewilllookattheworkflowofatypicalMLproject,discussthemainchallengesyoumayface,andcoverhowtoevaluateandfine-tuneaMachineLearningsystem.

Thischapterintroducesalotoffundamentalconcepts(andjargon)thateverydatascientistshouldknowbyheart.Itwillbeahigh-leveloverview(theonlychapterwithoutmuchcode),allrathersimple,butyoushouldmakesureeverythingiscrystal-cleartoyoubeforecontinuingtotherestofthebook.Sograbacoffeeandlet’sgetstarted!

TIPIfyoualreadyknowalltheMachineLearningbasics,youmaywanttoskipdirectlytoChapter2.Ifyouarenotsure,trytoanswerallthequestionslistedattheendofthechapterbeforemovingon.

Page 20: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

WhatIsMachineLearning?MachineLearningisthescience(andart)ofprogrammingcomputerssotheycanlearnfromdata.

Hereisaslightlymoregeneraldefinition:

[MachineLearningisthe]fieldofstudythatgivescomputerstheabilitytolearnwithoutbeingexplicitlyprogrammed.ArthurSamuel,1959

Andamoreengineering-orientedone:

AcomputerprogramissaidtolearnfromexperienceEwithrespecttosometaskTandsomeperformancemeasureP,ifitsperformanceonT,asmeasuredbyP,improveswithexperienceE.TomMitchell,1997

Forexample,yourspamfilterisaMachineLearningprogramthatcanlearntoflagspamgivenexamplesofspamemails(e.g.,flaggedbyusers)andexamplesofregular(nonspam,alsocalled“ham”)emails.Theexamplesthatthesystemusestolearnarecalledthetrainingset.Eachtrainingexampleiscalledatraininginstance(orsample).Inthiscase,thetaskTistoflagspamfornewemails,theexperienceEisthetrainingdata,andtheperformancemeasurePneedstobedefined;forexample,youcanusetheratioofcorrectlyclassifiedemails.Thisparticularperformancemeasureiscalledaccuracyanditisoftenusedinclassificationtasks.

IfyoujustdownloadacopyofWikipedia,yourcomputerhasalotmoredata,butitisnotsuddenlybetteratanytask.Thus,itisnotMachineLearning.

Page 21: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

WhyUseMachineLearning?Considerhowyouwouldwriteaspamfilterusingtraditionalprogrammingtechniques(Figure1-1):1. Firstyouwouldlookatwhatspamtypicallylookslike.Youmightnoticethatsomewordsorphrases

(suchas“4U,”“creditcard,”“free,”and“amazing”)tendtocomeupalotinthesubject.Perhapsyouwouldalsonoticeafewotherpatternsinthesender’sname,theemail’sbody,andsoon.

2. Youwouldwriteadetectionalgorithmforeachofthepatternsthatyounoticed,andyourprogramwouldflagemailsasspamifanumberofthesepatternsaredetected.

3. Youwouldtestyourprogram,andrepeatsteps1and2untilitisgoodenough.

Figure1-1.Thetraditionalapproach

Sincetheproblemisnottrivial,yourprogramwilllikelybecomealonglistofcomplexrules—prettyhardtomaintain.

Incontrast,aspamfilterbasedonMachineLearningtechniquesautomaticallylearnswhichwordsandphrasesaregoodpredictorsofspambydetectingunusuallyfrequentpatternsofwordsinthespamexamplescomparedtothehamexamples(Figure1-2).Theprogramismuchshorter,easiertomaintain,andmostlikelymoreaccurate.

Page 22: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure1-2.MachineLearningapproach

Moreover,ifspammersnoticethatalltheiremailscontaining“4U”areblocked,theymightstartwriting“ForU”instead.Aspamfilterusingtraditionalprogrammingtechniqueswouldneedtobeupdatedtoflag“ForU”emails.Ifspammerskeepworkingaroundyourspamfilter,youwillneedtokeepwritingnewrulesforever.

Incontrast,aspamfilterbasedonMachineLearningtechniquesautomaticallynoticesthat“ForU”hasbecomeunusuallyfrequentinspamflaggedbyusers,anditstartsflaggingthemwithoutyourintervention(Figure1-3).

Figure1-3.Automaticallyadaptingtochange

AnotherareawhereMachineLearningshinesisforproblemsthateitheraretoocomplexfortraditionalapproachesorhavenoknownalgorithm.Forexample,considerspeechrecognition:sayyouwanttostartsimpleandwriteaprogramcapableofdistinguishingthewords“one”and“two.”Youmightnoticethattheword“two”startswithahigh-pitchsound(“T”),soyoucouldhardcodeanalgorithmthatmeasureshigh-pitchsoundintensityandusethattodistinguishonesandtwos.Obviouslythistechniquewillnotscaletothousandsofwordsspokenbymillionsofverydifferentpeopleinnoisyenvironmentsandin

Page 23: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

dozensoflanguages.Thebestsolution(atleasttoday)istowriteanalgorithmthatlearnsbyitself,givenmanyexamplerecordingsforeachword.

Finally,MachineLearningcanhelphumanslearn(Figure1-4):MLalgorithmscanbeinspectedtoseewhattheyhavelearned(althoughforsomealgorithmsthiscanbetricky).Forinstance,oncethespamfilterhasbeentrainedonenoughspam,itcaneasilybeinspectedtorevealthelistofwordsandcombinationsofwordsthatitbelievesarethebestpredictorsofspam.Sometimesthiswillrevealunsuspectedcorrelationsornewtrends,andtherebyleadtoabetterunderstandingoftheproblem.

ApplyingMLtechniquestodigintolargeamountsofdatacanhelpdiscoverpatternsthatwerenotimmediatelyapparent.Thisiscalleddatamining.

Figure1-4.MachineLearningcanhelphumanslearn

Tosummarize,MachineLearningisgreatfor:Problemsforwhichexistingsolutionsrequirealotofhand-tuningorlonglistsofrules:oneMachineLearningalgorithmcanoftensimplifycodeandperformbetter.

Complexproblemsforwhichthereisnogoodsolutionatallusingatraditionalapproach:thebestMachineLearningtechniquescanfindasolution.

Fluctuatingenvironments:aMachineLearningsystemcanadapttonewdata.

Gettinginsightsaboutcomplexproblemsandlargeamountsofdata.

Page 24: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TypesofMachineLearningSystemsTherearesomanydifferenttypesofMachineLearningsystemsthatitisusefultoclassifytheminbroadcategoriesbasedon:

Whetherornottheyaretrainedwithhumansupervision(supervised,unsupervised,semisupervised,andReinforcementLearning)

Whetherornottheycanlearnincrementallyonthefly(onlineversusbatchlearning)

Whethertheyworkbysimplycomparingnewdatapointstoknowndatapoints,orinsteaddetectpatternsinthetrainingdataandbuildapredictivemodel,muchlikescientistsdo(instance-basedversusmodel-basedlearning)

Thesecriteriaarenotexclusive;youcancombinetheminanywayyoulike.Forexample,astate-of-the-artspamfiltermaylearnontheflyusingadeepneuralnetworkmodeltrainedusingexamplesofspamandham;thismakesitanonline,model-based,supervisedlearningsystem.

Let’slookateachofthesecriteriaabitmoreclosely.

Page 25: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Supervised/UnsupervisedLearningMachineLearningsystemscanbeclassifiedaccordingtotheamountandtypeofsupervisiontheygetduringtraining.Therearefourmajorcategories:supervisedlearning,unsupervisedlearning,semisupervisedlearning,andReinforcementLearning.

SupervisedlearningInsupervisedlearning,thetrainingdatayoufeedtothealgorithmincludesthedesiredsolutions,calledlabels(Figure1-5).

Figure1-5.Alabeledtrainingsetforsupervisedlearning(e.g.,spamclassification)

Atypicalsupervisedlearningtaskisclassification.Thespamfilterisagoodexampleofthis:itistrainedwithmanyexampleemailsalongwiththeirclass(spamorham),anditmustlearnhowtoclassifynewemails.

Anothertypicaltaskistopredictatargetnumericvalue,suchasthepriceofacar,givenasetoffeatures(mileage,age,brand,etc.)calledpredictors.Thissortoftaskiscalledregression(Figure1-6).1Totrainthesystem,youneedtogiveitmanyexamplesofcars,includingboththeirpredictorsandtheirlabels(i.e.,theirprices).

NOTEInMachineLearninganattributeisadatatype(e.g.,“Mileage”),whileafeaturehasseveralmeaningsdependingonthecontext,butgenerallymeansanattributeplusitsvalue(e.g.,“Mileage=15,000”).Manypeopleusethewordsattributeandfeatureinterchangeably,though.

Page 26: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure1-6.Regression

Notethatsomeregressionalgorithmscanbeusedforclassificationaswell,andviceversa.Forexample,LogisticRegressioniscommonlyusedforclassification,asitcanoutputavaluethatcorrespondstotheprobabilityofbelongingtoagivenclass(e.g.,20%chanceofbeingspam).

Herearesomeofthemostimportantsupervisedlearningalgorithms(coveredinthisbook):k-NearestNeighbors

LinearRegression

LogisticRegression

SupportVectorMachines(SVMs)

DecisionTreesandRandomForests

Neuralnetworks2

UnsupervisedlearningInunsupervisedlearning,asyoumightguess,thetrainingdataisunlabeled(Figure1-7).Thesystemtriestolearnwithoutateacher.

Page 27: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure1-7.Anunlabeledtrainingsetforunsupervisedlearning

Herearesomeofthemostimportantunsupervisedlearningalgorithms(wewillcoverdimensionalityreductioninChapter8):

Clusteringk-Means

HierarchicalClusterAnalysis(HCA)

ExpectationMaximization

VisualizationanddimensionalityreductionPrincipalComponentAnalysis(PCA)

KernelPCA

Locally-LinearEmbedding(LLE)

t-distributedStochasticNeighborEmbedding(t-SNE)

AssociationrulelearningApriori

Eclat

Forexample,sayyouhavealotofdataaboutyourblog’svisitors.Youmaywanttorunaclusteringalgorithmtotrytodetectgroupsofsimilarvisitors(Figure1-8).Atnopointdoyoutellthealgorithmwhichgroupavisitorbelongsto:itfindsthoseconnectionswithoutyourhelp.Forexample,itmightnoticethat40%ofyourvisitorsaremaleswholovecomicbooksandgenerallyreadyourblogintheevening,while20%areyoungsci-filoverswhovisitduringtheweekends,andsoon.Ifyouuseahierarchicalclusteringalgorithm,itmayalsosubdivideeachgroupintosmallergroups.Thismayhelpyoutargetyourpostsforeachgroup.

Page 28: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure1-8.Clustering

Visualizationalgorithmsarealsogoodexamplesofunsupervisedlearningalgorithms:youfeedthemalotofcomplexandunlabeleddata,andtheyoutputa2Dor3Drepresentationofyourdatathatcaneasilybeplotted(Figure1-9).Thesealgorithmstrytopreserveasmuchstructureastheycan(e.g.,tryingtokeepseparateclustersintheinputspacefromoverlappinginthevisualization),soyoucanunderstandhowthedataisorganizedandperhapsidentifyunsuspectedpatterns.

Figure1-9.Exampleofat-SNEvisualizationhighlightingsemanticclusters3

Arelatedtaskisdimensionalityreduction,inwhichthegoalistosimplifythedatawithoutlosingtoomuchinformation.Onewaytodothisistomergeseveralcorrelatedfeaturesintoone.Forexample,acar’smileagemaybeverycorrelatedwithitsage,sothedimensionalityreductionalgorithmwillmergethemintoonefeaturethatrepresentsthecar’swearandtear.Thisiscalledfeatureextraction.

Page 29: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TIPItisoftenagoodideatotrytoreducethedimensionofyourtrainingdatausingadimensionalityreductionalgorithmbeforeyoufeedittoanotherMachineLearningalgorithm(suchasasupervisedlearningalgorithm).Itwillrunmuchfaster,thedatawilltakeuplessdiskandmemoryspace,andinsomecasesitmayalsoperformbetter.

Yetanotherimportantunsupervisedtaskisanomalydetection—forexample,detectingunusualcreditcardtransactionstopreventfraud,catchingmanufacturingdefects,orautomaticallyremovingoutliersfromadatasetbeforefeedingittoanotherlearningalgorithm.Thesystemistrainedwithnormalinstances,andwhenitseesanewinstanceitcantellwhetheritlookslikeanormaloneorwhetheritislikelyananomaly(seeFigure1-10).

Figure1-10.Anomalydetection

Finally,anothercommonunsupervisedtaskisassociationrulelearning,inwhichthegoalistodigintolargeamountsofdataanddiscoverinterestingrelationsbetweenattributes.Forexample,supposeyouownasupermarket.Runninganassociationruleonyoursaleslogsmayrevealthatpeoplewhopurchasebarbecuesauceandpotatochipsalsotendtobuysteak.Thus,youmaywanttoplacetheseitemsclosetoeachother.

SemisupervisedlearningSomealgorithmscandealwithpartiallylabeledtrainingdata,usuallyalotofunlabeleddataandalittlebitoflabeleddata.Thisiscalledsemisupervisedlearning(Figure1-11).

Somephoto-hostingservices,suchasGooglePhotos,aregoodexamplesofthis.Onceyouuploadallyourfamilyphotostotheservice,itautomaticallyrecognizesthatthesamepersonAshowsupinphotos1,5,and11,whileanotherpersonBshowsupinphotos2,5,and7.Thisistheunsupervisedpartofthealgorithm(clustering).Nowallthesystemneedsisforyoutotellitwhothesepeopleare.Justonelabelperperson,4anditisabletonameeveryoneineveryphoto,whichisusefulforsearchingphotos.

Page 30: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure1-11.Semisupervisedlearning

Mostsemisupervisedlearningalgorithmsarecombinationsofunsupervisedandsupervisedalgorithms.Forexample,deepbeliefnetworks(DBNs)arebasedonunsupervisedcomponentscalledrestrictedBoltzmannmachines(RBMs)stackedontopofoneanother.RBMsaretrainedsequentiallyinanunsupervisedmanner,andthenthewholesystemisfine-tunedusingsupervisedlearningtechniques.

ReinforcementLearningReinforcementLearningisaverydifferentbeast.Thelearningsystem,calledanagentinthiscontext,canobservetheenvironment,selectandperformactions,andgetrewardsinreturn(orpenaltiesintheformofnegativerewards,asinFigure1-12).Itmustthenlearnbyitselfwhatisthebeststrategy,calledapolicy,togetthemostrewardovertime.Apolicydefineswhatactiontheagentshouldchoosewhenitisinagivensituation.

Page 31: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure1-12.ReinforcementLearning

Forexample,manyrobotsimplementReinforcementLearningalgorithmstolearnhowtowalk.DeepMind’sAlphaGoprogramisalsoagoodexampleofReinforcementLearning:itmadetheheadlinesinMarch2016whenitbeattheworldchampionLeeSedolatthegameofGo.Itlearneditswinningpolicybyanalyzingmillionsofgames,andthenplayingmanygamesagainstitself.Notethatlearningwasturnedoffduringthegamesagainstthechampion;AlphaGowasjustapplyingthepolicyithadlearned.

Page 32: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

BatchandOnlineLearningAnothercriterionusedtoclassifyMachineLearningsystemsiswhetherornotthesystemcanlearnincrementallyfromastreamofincomingdata.

BatchlearningInbatchlearning,thesystemisincapableoflearningincrementally:itmustbetrainedusingalltheavailabledata.Thiswillgenerallytakealotoftimeandcomputingresources,soitistypicallydoneoffline.Firstthesystemistrained,andthenitislaunchedintoproductionandrunswithoutlearninganymore;itjustapplieswhatithaslearned.Thisiscalledofflinelearning.

Ifyouwantabatchlearningsystemtoknowaboutnewdata(suchasanewtypeofspam),youneedtotrainanewversionofthesystemfromscratchonthefulldataset(notjustthenewdata,butalsotheolddata),thenstoptheoldsystemandreplaceitwiththenewone.

Fortunately,thewholeprocessoftraining,evaluating,andlaunchingaMachineLearningsystemcanbeautomatedfairlyeasily(asshowninFigure1-3),soevenabatchlearningsystemcanadapttochange.Simplyupdatethedataandtrainanewversionofthesystemfromscratchasoftenasneeded.

Thissolutionissimpleandoftenworksfine,buttrainingusingthefullsetofdatacantakemanyhours,soyouwouldtypicallytrainanewsystemonlyevery24hoursorevenjustweekly.Ifyoursystemneedstoadapttorapidlychangingdata(e.g.,topredictstockprices),thenyouneedamorereactivesolution.

Also,trainingonthefullsetofdatarequiresalotofcomputingresources(CPU,memoryspace,diskspace,diskI/O,networkI/O,etc.).Ifyouhavealotofdataandyouautomateyoursystemtotrainfromscratcheveryday,itwillendupcostingyoualotofmoney.Iftheamountofdataishuge,itmayevenbeimpossibletouseabatchlearningalgorithm.

Finally,ifyoursystemneedstobeabletolearnautonomouslyandithaslimitedresources(e.g.,asmartphoneapplicationoraroveronMars),thencarryingaroundlargeamountsoftrainingdataandtakingupalotofresourcestotrainforhourseverydayisashowstopper.

Fortunately,abetteroptioninallthesecasesistousealgorithmsthatarecapableoflearningincrementally.

OnlinelearningInonlinelearning,youtrainthesystemincrementallybyfeedingitdatainstancessequentially,eitherindividuallyorbysmallgroupscalledmini-batches.Eachlearningstepisfastandcheap,sothesystemcanlearnaboutnewdataonthefly,asitarrives(seeFigure1-13).

Page 33: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure1-13.Onlinelearning

Onlinelearningisgreatforsystemsthatreceivedataasacontinuousflow(e.g.,stockprices)andneedtoadapttochangerapidlyorautonomously.Itisalsoagoodoptionifyouhavelimitedcomputingresources:onceanonlinelearningsystemhaslearnedaboutnewdatainstances,itdoesnotneedthemanymore,soyoucandiscardthem(unlessyouwanttobeabletorollbacktoapreviousstateand“replay”thedata).Thiscansaveahugeamountofspace.

Onlinelearningalgorithmscanalsobeusedtotrainsystemsonhugedatasetsthatcannotfitinonemachine’smainmemory(thisiscalledout-of-corelearning).Thealgorithmloadspartofthedata,runsatrainingsteponthatdata,andrepeatstheprocessuntilithasrunonallofthedata(seeFigure1-14).

WARNINGThiswholeprocessisusuallydoneoffline(i.e.,notonthelivesystem),soonlinelearningcanbeaconfusingname.Thinkofitasincrementallearning.

Page 34: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure1-14.Usingonlinelearningtohandlehugedatasets

Oneimportantparameterofonlinelearningsystemsishowfasttheyshouldadapttochangingdata:thisiscalledthelearningrate.Ifyousetahighlearningrate,thenyoursystemwillrapidlyadapttonewdata,butitwillalsotendtoquicklyforgettheolddata(youdon’twantaspamfiltertoflagonlythelatestkindsofspamitwasshown).Conversely,ifyousetalowlearningrate,thesystemwillhavemoreinertia;thatis,itwilllearnmoreslowly,butitwillalsobelesssensitivetonoiseinthenewdataortosequencesofnonrepresentativedatapoints.

Abigchallengewithonlinelearningisthatifbaddataisfedtothesystem,thesystem’sperformancewillgraduallydecline.Ifwearetalkingaboutalivesystem,yourclientswillnotice.Forexample,baddatacouldcomefromamalfunctioningsensoronarobot,orfromsomeonespammingasearchenginetotrytorankhighinsearchresults.Toreducethisrisk,youneedtomonitoryoursystemcloselyandpromptlyswitchlearningoff(andpossiblyreverttoapreviouslyworkingstate)ifyoudetectadropinperformance.Youmayalsowanttomonitortheinputdataandreacttoabnormaldata(e.g.,usingananomalydetectionalgorithm).

Page 35: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Instance-BasedVersusModel-BasedLearningOnemorewaytocategorizeMachineLearningsystemsisbyhowtheygeneralize.MostMachineLearningtasksareaboutmakingpredictions.Thismeansthatgivenanumberoftrainingexamples,thesystemneedstobeabletogeneralizetoexamplesithasneverseenbefore.Havingagoodperformancemeasureonthetrainingdataisgood,butinsufficient;thetruegoalistoperformwellonnewinstances.

Therearetwomainapproachestogeneralization:instance-basedlearningandmodel-basedlearning.

Instance-basedlearningPossiblythemosttrivialformoflearningissimplytolearnbyheart.Ifyouweretocreateaspamfilterthisway,itwouldjustflagallemailsthatareidenticaltoemailsthathavealreadybeenflaggedbyusers—nottheworstsolution,butcertainlynotthebest.

Insteadofjustflaggingemailsthatareidenticaltoknownspamemails,yourspamfiltercouldbeprogrammedtoalsoflagemailsthatareverysimilartoknownspamemails.Thisrequiresameasureofsimilaritybetweentwoemails.A(verybasic)similaritymeasurebetweentwoemailscouldbetocountthenumberofwordstheyhaveincommon.Thesystemwouldflaganemailasspamifithasmanywordsincommonwithaknownspamemail.

Thisiscalledinstance-basedlearning:thesystemlearnstheexamplesbyheart,thengeneralizestonewcasesusingasimilaritymeasure(Figure1-15).

Figure1-15.Instance-basedlearning

Model-basedlearningAnotherwaytogeneralizefromasetofexamplesistobuildamodeloftheseexamples,thenusethatmodeltomakepredictions.Thisiscalledmodel-basedlearning(Figure1-16).

Page 36: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure1-16.Model-basedlearning

Forexample,supposeyouwanttoknowifmoneymakespeoplehappy,soyoudownloadtheBetterLifeIndexdatafromtheOECD’swebsiteaswellasstatsaboutGDPpercapitafromtheIMF’swebsite.ThenyoujointhetablesandsortbyGDPpercapita.Table1-1showsanexcerptofwhatyouget.

Table1-1.Doesmoneymakepeoplehappier?

Country GDPpercapita(USD) Lifesatisfaction

Hungary 12,240 4.9

Korea 27,195 5.8

France 37,675 6.5

Australia 50,962 7.3

UnitedStates 55,805 7.2

Let’splotthedataforafewrandomcountries(Figure1-17).

Page 37: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure1-17.Doyouseeatrendhere?

Theredoesseemtobeatrendhere!Althoughthedataisnoisy(i.e.,partlyrandom),itlookslikelifesatisfactiongoesupmoreorlesslinearlyasthecountry’sGDPpercapitaincreases.SoyoudecidetomodellifesatisfactionasalinearfunctionofGDPpercapita.Thisstepiscalledmodelselection:youselectedalinearmodeloflifesatisfactionwithjustoneattribute,GDPpercapita(Equation1-1).

Equation1-1.Asimplelinearmodel

Thismodelhastwomodelparameters,θ0andθ1.5Bytweakingtheseparameters,youcanmakeyourmodelrepresentanylinearfunction,asshowninFigure1-18.

Figure1-18.Afewpossiblelinearmodels

Beforeyoucanuseyourmodel,youneedtodefinetheparametervaluesθ0andθ1.Howcanyouknow

Page 38: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

whichvalueswillmakeyourmodelperformbest?Toanswerthisquestion,youneedtospecifyaperformancemeasure.Youcaneitherdefineautilityfunction(orfitnessfunction)thatmeasureshowgoodyourmodelis,oryoucandefineacostfunctionthatmeasureshowbaditis.Forlinearregressionproblems,peopletypicallyuseacostfunctionthatmeasuresthedistancebetweenthelinearmodel’spredictionsandthetrainingexamples;theobjectiveistominimizethisdistance.

ThisiswheretheLinearRegressionalgorithmcomesin:youfeedityourtrainingexamplesanditfindstheparametersthatmakethelinearmodelfitbesttoyourdata.Thisiscalledtrainingthemodel.Inourcasethealgorithmfindsthattheoptimalparametervaluesareθ0=4.85andθ1=4.91×10–5.

Nowthemodelfitsthetrainingdataascloselyaspossible(foralinearmodel),asyoucanseeinFigure1-19.

Figure1-19.Thelinearmodelthatfitsthetrainingdatabest

Youarefinallyreadytorunthemodeltomakepredictions.Forexample,sayyouwanttoknowhowhappyCypriotsare,andtheOECDdatadoesnothavetheanswer.Fortunately,youcanuseyourmodeltomakeagoodprediction:youlookupCyprus’sGDPpercapita,find$22,587,andthenapplyyourmodelandfindthatlifesatisfactionislikelytobesomewherearound4.85+22,587×4.91×10-5=5.96.

Towhetyourappetite,Example1-1showsthePythoncodethatloadsthedata,preparesit,6createsascatterplotforvisualization,andthentrainsalinearmodelandmakesaprediction.7

Example1-1.TrainingandrunningalinearmodelusingScikit-Learnimportmatplotlib

importmatplotlib.pyplotasplt

importnumpyasnp

importpandasaspd

importsklearn

#Loadthedata

oecd_bli=pd.read_csv("oecd_bli_2015.csv",thousands=',')

gdp_per_capita=pd.read_csv("gdp_per_capita.csv",thousands=',',delimiter='\t',

encoding='latin1',na_values="n/a")

#Preparethedata

country_stats=prepare_country_stats(oecd_bli,gdp_per_capita)

X=np.c_[country_stats["GDPpercapita"]]

Page 39: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

y=np.c_[country_stats["Lifesatisfaction"]]

#Visualizethedata

country_stats.plot(kind='scatter',x="GDPpercapita",y='Lifesatisfaction')

plt.show()

#Selectalinearmodel

model=sklearn.linear_model.LinearRegression()

#Trainthemodel

model.fit(X,y)

#MakeapredictionforCyprus

X_new=[[22587]]#Cyprus'GDPpercapita

print(model.predict(X_new))#outputs[[5.96242338]]

NOTEIfyouhadusedaninstance-basedlearningalgorithminstead,youwouldhavefoundthatSloveniahastheclosestGDPpercapitatothatofCyprus($20,732),andsincetheOECDdatatellsusthatSlovenians’lifesatisfactionis5.7,youwouldhavepredictedalifesatisfactionof5.7forCyprus.Ifyouzoomoutabitandlookatthetwonextclosestcountries,youwillfindPortugalandSpainwithlifesatisfactionsof5.1and6.5,respectively.Averagingthesethreevalues,youget5.77,whichisprettyclosetoyourmodel-basedprediction.Thissimplealgorithmiscalledk-NearestNeighborsregression(inthisexample,k =3).

ReplacingtheLinearRegressionmodelwithk-NearestNeighborsregressioninthepreviouscodeisassimpleasreplacingthisline:

model=sklearn.linear_model.LinearRegression()

withthisone:

model=sklearn.neighbors.KNeighborsRegressor(n_neighbors=3)

Ifallwentwell,yourmodelwillmakegoodpredictions.Ifnot,youmayneedtousemoreattributes(employmentrate,health,airpollution,etc.),getmoreorbetterqualitytrainingdata,orperhapsselectamorepowerfulmodel(e.g.,aPolynomialRegressionmodel).

Insummary:Youstudiedthedata.

Youselectedamodel.

Youtraineditonthetrainingdata(i.e.,thelearningalgorithmsearchedforthemodelparametervaluesthatminimizeacostfunction).

Finally,youappliedthemodeltomakepredictionsonnewcases(thisiscalledinference),hopingthatthismodelwillgeneralizewell.

ThisiswhatatypicalMachineLearningprojectlookslike.InChapter2youwillexperiencethisfirst-handbygoingthroughanend-to-endproject.

Wehavecoveredalotofgroundsofar:younowknowwhatMachineLearningisreallyabout,whyitisuseful,whatsomeofthemostcommoncategoriesofMLsystemsare,andwhatatypicalprojectworkflowlookslike.Nowlet’slookatwhatcangowronginlearningandpreventyoufrommakingaccurate

Page 40: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

predictions.

Page 41: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

MainChallengesofMachineLearningInshort,sinceyourmaintaskistoselectalearningalgorithmandtrainitonsomedata,thetwothingsthatcangowrongare“badalgorithm”and“baddata.”Let’sstartwithexamplesofbaddata.

Page 42: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

InsufficientQuantityofTrainingDataForatoddlertolearnwhatanappleis,allittakesisforyoutopointtoanappleandsay“apple”(possiblyrepeatingthisprocedureafewtimes).Nowthechildisabletorecognizeapplesinallsortsofcolorsandshapes.Genius.

MachineLearningisnotquitethereyet;ittakesalotofdataformostMachineLearningalgorithmstoworkproperly.Evenforverysimpleproblemsyoutypicallyneedthousandsofexamples,andforcomplexproblemssuchasimageorspeechrecognitionyoumayneedmillionsofexamples(unlessyoucanreusepartsofanexistingmodel).

Page 43: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

THEUNREASONABLEEFFECTIVENESSOFDATA

Inafamouspaperpublishedin2001,MicrosoftresearchersMicheleBankoandEricBrillshowedthatverydifferentMachineLearningalgorithms,includingfairlysimpleones,performedalmostidenticallywellonacomplexproblemofnaturallanguagedisambiguation8oncetheyweregivenenoughdata(asyoucanseeinFigure1-20).

Figure1-20.Theimportanceofdataversusalgorithms9

Astheauthorsputit:“theseresultssuggestthatwemaywanttoreconsiderthetrade-offbetweenspendingtimeandmoneyonalgorithmdevelopmentversusspendingitoncorpusdevelopment.”

TheideathatdatamattersmorethanalgorithmsforcomplexproblemswasfurtherpopularizedbyPeterNorvigetal.inapapertitled“TheUnreasonableEffectivenessofData”publishedin2009.10Itshouldbenoted,however,thatsmall-andmedium-sizeddatasetsarestillverycommon,anditisnotalwayseasyorcheaptogetextratrainingdata,sodon’tabandonalgorithmsjustyet.

Page 44: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

NonrepresentativeTrainingDataInordertogeneralizewell,itiscrucialthatyourtrainingdataberepresentativeofthenewcasesyouwanttogeneralizeto.Thisistruewhetheryouuseinstance-basedlearningormodel-basedlearning.

Forexample,thesetofcountriesweusedearlierfortrainingthelinearmodelwasnotperfectlyrepresentative;afewcountriesweremissing.Figure1-21showswhatthedatalookslikewhenyouaddthemissingcountries.

Figure1-21.Amorerepresentativetrainingsample

Ifyoutrainalinearmodelonthisdata,yougetthesolidline,whiletheoldmodelisrepresentedbythedottedline.Asyoucansee,notonlydoesaddingafewmissingcountriessignificantlyalterthemodel,butitmakesitclearthatsuchasimplelinearmodelisprobablynevergoingtoworkwell.Itseemsthatveryrichcountriesarenothappierthanmoderatelyrichcountries(infacttheyseemunhappier),andconverselysomepoorcountriesseemhappierthanmanyrichcountries.

Byusinganonrepresentativetrainingset,wetrainedamodelthatisunlikelytomakeaccuratepredictions,especiallyforverypoorandveryrichcountries.

Itiscrucialtouseatrainingsetthatisrepresentativeofthecasesyouwanttogeneralizeto.Thisisoftenharderthanitsounds:ifthesampleistoosmall,youwillhavesamplingnoise(i.e.,nonrepresentativedataasaresultofchance),butevenverylargesamplescanbenonrepresentativeifthesamplingmethodisflawed.Thisiscalledsamplingbias.

Page 45: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

AFAMOUSEXAMPLEOFSAMPLINGBIAS

PerhapsthemostfamousexampleofsamplingbiashappenedduringtheUSpresidentialelectionin1936,whichpittedLandonagainstRoosevelt:theLiteraryDigestconductedaverylargepoll,sendingmailtoabout10millionpeople.Itgot2.4millionanswers,andpredictedwithhighconfidencethatLandonwouldget57%ofthevotes.Instead,Rooseveltwonwith62%ofthevotes.TheflawwasintheLiteraryDigest’ssamplingmethod:

First,toobtaintheaddressestosendthepollsto,theLiteraryDigestusedtelephonedirectories,listsofmagazinesubscribers,clubmembershiplists,andthelike.Alloftheseliststendtofavorwealthierpeople,whoaremorelikelytovoteRepublican(henceLandon).

Second,lessthan25%ofthepeoplewhoreceivedthepollanswered.Again,thisintroducesasamplingbias,byrulingoutpeoplewhodon’tcaremuchaboutpolitics,peoplewhodon’tliketheLiteraryDigest,andotherkeygroups.Thisisaspecialtypeofsamplingbiascallednonresponsebias.

Hereisanotherexample:sayyouwanttobuildasystemtorecognizefunkmusicvideos.Onewaytobuildyourtrainingsetistosearch“funkmusic”onYouTubeandusetheresultingvideos.ButthisassumesthatYouTube’ssearchenginereturnsasetofvideosthatarerepresentativeofallthefunkmusicvideosonYouTube.Inreality,thesearchresultsarelikelytobebiasedtowardpopularartists(andifyouliveinBrazilyouwillgetalotof“funkcarioca”videos,whichsoundnothinglikeJamesBrown).Ontheotherhand,howelsecanyougetalargetrainingset?

Page 46: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Poor-QualityDataObviously,ifyourtrainingdataisfulloferrors,outliers,andnoise(e.g.,duetopoor-qualitymeasurements),itwillmakeitharderforthesystemtodetecttheunderlyingpatterns,soyoursystemislesslikelytoperformwell.Itisoftenwellworththeefforttospendtimecleaningupyourtrainingdata.Thetruthis,mostdatascientistsspendasignificantpartoftheirtimedoingjustthat.Forexample:

Ifsomeinstancesareclearlyoutliers,itmayhelptosimplydiscardthemortrytofixtheerrorsmanually.

Ifsomeinstancesaremissingafewfeatures(e.g.,5%ofyourcustomersdidnotspecifytheirage),youmustdecidewhetheryouwanttoignorethisattributealtogether,ignoretheseinstances,fillinthemissingvalues(e.g.,withthemedianage),ortrainonemodelwiththefeatureandonemodelwithoutit,andsoon.

Page 47: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

IrrelevantFeaturesAsthesayinggoes:garbagein,garbageout.Yoursystemwillonlybecapableoflearningifthetrainingdatacontainsenoughrelevantfeaturesandnottoomanyirrelevantones.AcriticalpartofthesuccessofaMachineLearningprojectiscomingupwithagoodsetoffeaturestotrainon.Thisprocess,calledfeatureengineering,involves:

Featureselection:selectingthemostusefulfeaturestotrainonamongexistingfeatures.

Featureextraction:combiningexistingfeaturestoproduceamoreusefulone(aswesawearlier,dimensionalityreductionalgorithmscanhelp).

Creatingnewfeaturesbygatheringnewdata.

Nowthatwehavelookedatmanyexamplesofbaddata,let’slookatacoupleofexamplesofbadalgorithms.

Page 48: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

OverfittingtheTrainingDataSayyouarevisitingaforeigncountryandthetaxidriverripsyouoff.Youmightbetemptedtosaythatalltaxidriversinthatcountryarethieves.Overgeneralizingissomethingthatwehumansdoalltoooften,andunfortunatelymachinescanfallintothesametrapifwearenotcareful.InMachineLearningthisiscalledoverfitting:itmeansthatthemodelperformswellonthetrainingdata,butitdoesnotgeneralizewell.

Figure1-22showsanexampleofahigh-degreepolynomiallifesatisfactionmodelthatstronglyoverfitsthetrainingdata.Eventhoughitperformsmuchbetteronthetrainingdatathanthesimplelinearmodel,wouldyoureallytrustitspredictions?

Figure1-22.Overfittingthetrainingdata

Complexmodelssuchasdeepneuralnetworkscandetectsubtlepatternsinthedata,butifthetrainingsetisnoisy,orifitistoosmall(whichintroducessamplingnoise),thenthemodelislikelytodetectpatternsinthenoiseitself.Obviouslythesepatternswillnotgeneralizetonewinstances.Forexample,sayyoufeedyourlifesatisfactionmodelmanymoreattributes,includinguninformativeonessuchasthecountry’sname.Inthatcase,acomplexmodelmaydetectpatternslikethefactthatallcountriesinthetrainingdatawithawintheirnamehavealifesatisfactiongreaterthan7:NewZealand(7.3),Norway(7.4),Sweden(7.2),andSwitzerland(7.5).HowconfidentareyouthattheW-satisfactionrulegeneralizestoRwandaorZimbabwe?Obviouslythispatternoccurredinthetrainingdatabypurechance,butthemodelhasnowaytotellwhetherapatternisrealorsimplytheresultofnoiseinthedata.

WARNINGOverfittinghappenswhenthemodelistoocomplexrelativetotheamountandnoisinessofthetrainingdata.Thepossiblesolutionsare:

Tosimplifythemodelbyselectingonewithfewerparameters(e.g.,alinearmodelratherthanahigh-degreepolynomialmodel),byreducingthenumberofattributesinthetrainingdataorbyconstrainingthemodel

Togathermoretrainingdata

Toreducethenoiseinthetrainingdata(e.g.,fixdataerrorsandremoveoutliers)

Constrainingamodeltomakeitsimplerandreducetheriskofoverfittingiscalledregularization.For

Page 49: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

example,thelinearmodelwedefinedearlierhastwoparameters,θ0andθ1.Thisgivesthelearningalgorithmtwodegreesoffreedomtoadaptthemodeltothetrainingdata:itcantweakboththeheight(θ0)andtheslope(θ1)oftheline.Ifweforcedθ1=0,thealgorithmwouldhaveonlyonedegreeoffreedomandwouldhaveamuchhardertimefittingthedataproperly:allitcoulddoismovethelineupordowntogetascloseaspossibletothetraininginstances,soitwouldenduparoundthemean.Averysimplemodelindeed!Ifweallowthealgorithmtomodifyθ1butweforceittokeepitsmall,thenthelearningalgorithmwilleffectivelyhavesomewhereinbetweenoneandtwodegreesoffreedom.Itwillproduceasimplermodelthanwithtwodegreesoffreedom,butmorecomplexthanwithjustone.Youwanttofindtherightbalancebetweenfittingthedataperfectlyandkeepingthemodelsimpleenoughtoensurethatitwillgeneralizewell.

Figure1-23showsthreemodels:thedottedlinerepresentstheoriginalmodelthatwastrainedwithafewcountriesmissing,thedashedlineisoursecondmodeltrainedwithallcountries,andthesolidlineisalinearmodeltrainedwiththesamedataasthefirstmodelbutwitharegularizationconstraint.Youcanseethatregularizationforcedthemodeltohaveasmallerslope,whichfitsabitlessthetrainingdatathatthemodelwastrainedon,butactuallyallowsittogeneralizebettertonewexamples.

Figure1-23.Regularizationreducestheriskofoverfitting

Theamountofregularizationtoapplyduringlearningcanbecontrolledbyahyperparameter.Ahyperparameterisaparameterofalearningalgorithm(notofthemodel).Assuch,itisnotaffectedbythelearningalgorithmitself;itmustbesetpriortotrainingandremainsconstantduringtraining.Ifyousettheregularizationhyperparametertoaverylargevalue,youwillgetanalmostflatmodel(aslopeclosetozero);thelearningalgorithmwillalmostcertainlynotoverfitthetrainingdata,butitwillbelesslikelytofindagoodsolution.TuninghyperparametersisanimportantpartofbuildingaMachineLearningsystem(youwillseeadetailedexampleinthenextchapter).

Page 50: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

UnderfittingtheTrainingDataAsyoumightguess,underfittingistheoppositeofoverfitting:itoccurswhenyourmodelistoosimpletolearntheunderlyingstructureofthedata.Forexample,alinearmodeloflifesatisfactionispronetounderfit;realityisjustmorecomplexthanthemodel,soitspredictionsareboundtobeinaccurate,evenonthetrainingexamples.

Themainoptionstofixthisproblemare:Selectingamorepowerfulmodel,withmoreparameters

Feedingbetterfeaturestothelearningalgorithm(featureengineering)

Reducingtheconstraintsonthemodel(e.g.,reducingtheregularizationhyperparameter)

Page 51: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

SteppingBackBynowyoualreadyknowalotaboutMachineLearning.However,wewentthroughsomanyconceptsthatyoumaybefeelingalittlelost,solet’sstepbackandlookatthebigpicture:

MachineLearningisaboutmakingmachinesgetbetteratsometaskbylearningfromdata,insteadofhavingtoexplicitlycoderules.

TherearemanydifferenttypesofMLsystems:supervisedornot,batchoronline,instance-basedormodel-based,andsoon.

InaMLprojectyougatherdatainatrainingset,andyoufeedthetrainingsettoalearningalgorithm.Ifthealgorithmismodel-basedittunessomeparameterstofitthemodeltothetrainingset(i.e.,tomakegoodpredictionsonthetrainingsetitself),andthenhopefullyitwillbeabletomakegoodpredictionsonnewcasesaswell.Ifthealgorithmisinstance-based,itjustlearnstheexamplesbyheartandusesasimilaritymeasuretogeneralizetonewinstances.

Thesystemwillnotperformwellifyourtrainingsetistoosmall,orifthedataisnotrepresentative,noisy,orpollutedwithirrelevantfeatures(garbagein,garbageout).Lastly,yourmodelneedstobeneithertoosimple(inwhichcaseitwillunderfit)nortoocomplex(inwhichcaseitwilloverfit).

There’sjustonelastimportanttopictocover:onceyouhavetrainedamodel,youdon’twanttojust“hope”itgeneralizestonewcases.Youwanttoevaluateit,andfine-tuneitifnecessary.Let’sseehow.

Page 52: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TestingandValidatingTheonlywaytoknowhowwellamodelwillgeneralizetonewcasesistoactuallytryitoutonnewcases.Onewaytodothatistoputyourmodelinproductionandmonitorhowwellitperforms.Thisworkswell,butifyourmodelishorriblybad,youruserswillcomplain—notthebestidea.

Abetteroptionistosplityourdataintotwosets:thetrainingsetandthetestset.Asthesenamesimply,youtrainyourmodelusingthetrainingset,andyoutestitusingthetestset.Theerrorrateonnewcasesiscalledthegeneralizationerror(orout-of-sampleerror),andbyevaluatingyourmodelonthetestset,yougetanestimationofthiserror.Thisvaluetellsyouhowwellyourmodelwillperformoninstancesithasneverseenbefore.

Ifthetrainingerrorislow(i.e.,yourmodelmakesfewmistakesonthetrainingset)butthegeneralizationerrorishigh,itmeansthatyourmodelisoverfittingthetrainingdata.

TIPItiscommontouse80%ofthedatafortrainingandholdout20%fortesting.

Soevaluatingamodelissimpleenough:justuseatestset.Nowsupposeyouarehesitatingbetweentwomodels(sayalinearmodelandapolynomialmodel):howcanyoudecide?Oneoptionistotrainbothandcomparehowwelltheygeneralizeusingthetestset.

Nowsupposethatthelinearmodelgeneralizesbetter,butyouwanttoapplysomeregularizationtoavoidoverfitting.Thequestionis:howdoyouchoosethevalueoftheregularizationhyperparameter?Oneoptionistotrain100differentmodelsusing100differentvaluesforthishyperparameter.Supposeyoufindthebesthyperparametervaluethatproducesamodelwiththelowestgeneralizationerror,sayjust5%error.

Soyoulaunchthismodelintoproduction,butunfortunatelyitdoesnotperformaswellasexpectedandproduces15%errors.Whatjusthappened?

Theproblemisthatyoumeasuredthegeneralizationerrormultipletimesonthetestset,andyouadaptedthemodelandhyperparameterstoproducethebestmodelforthatset.Thismeansthatthemodelisunlikelytoperformaswellonnewdata.

Acommonsolutiontothisproblemistohaveasecondholdoutsetcalledthevalidationset.Youtrainmultiplemodelswithvarioushyperparametersusingthetrainingset,youselectthemodelandhyperparametersthatperformbestonthevalidationset,andwhenyou’rehappywithyourmodelyourunasinglefinaltestagainstthetestsettogetanestimateofthegeneralizationerror.

Toavoid“wasting”toomuchtrainingdatainvalidationsets,acommontechniqueistousecross-validation:thetrainingsetissplitintocomplementarysubsets,andeachmodelistrainedagainstadifferentcombinationofthesesubsetsandvalidatedagainsttheremainingparts.Oncethemodeltypeandhyperparametershavebeenselected,afinalmodelistrainedusingthesehyperparametersonthefulltrainingset,andthegeneralizederrorismeasuredonthetestset.

Page 53: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

NOFREELUNCHTHEOREM

Amodelisasimplifiedversionoftheobservations.Thesimplificationsaremeanttodiscardthesuperfluousdetailsthatareunlikelytogeneralizetonewinstances.However,todecidewhatdatatodiscardandwhatdatatokeep,youmustmakeassumptions.Forexample,alinearmodelmakestheassumptionthatthedataisfundamentallylinearandthatthedistancebetweentheinstancesandthestraightlineisjustnoise,whichcansafelybeignored.

Inafamous1996paper,11DavidWolpertdemonstratedthatifyoumakeabsolutelynoassumptionaboutthedata,thenthereisnoreasontopreferonemodeloveranyother.ThisiscalledtheNoFreeLunch(NFL)theorem.Forsomedatasetsthebestmodelisalinearmodel,whileforotherdatasetsitisaneuralnetwork.Thereisnomodelthatisaprioriguaranteedtoworkbetter(hencethenameofthetheorem).Theonlywaytoknowforsurewhichmodelisbestistoevaluatethemall.Sincethisisnotpossible,inpracticeyoumakesomereasonableassumptionsaboutthedataandyouevaluateonlyafewreasonablemodels.Forexample,forsimpletasksyoumayevaluatelinearmodelswithvariouslevelsofregularization,andforacomplexproblemyoumayevaluatevariousneuralnetworks.

Page 54: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ExercisesInthischapterwehavecoveredsomeofthemostimportantconceptsinMachineLearning.Inthenextchapterswewilldivedeeperandwritemorecode,butbeforewedo,makesureyouknowhowtoanswerthefollowingquestions:1. HowwouldyoudefineMachineLearning?

2. Canyounamefourtypesofproblemswhereitshines?

3. Whatisalabeledtrainingset?

4. Whatarethetwomostcommonsupervisedtasks?

5. Canyounamefourcommonunsupervisedtasks?

6. WhattypeofMachineLearningalgorithmwouldyouusetoallowarobottowalkinvariousunknownterrains?

7. Whattypeofalgorithmwouldyouusetosegmentyourcustomersintomultiplegroups?

8. Wouldyouframetheproblemofspamdetectionasasupervisedlearningproblemoranunsupervisedlearningproblem?

9. Whatisanonlinelearningsystem?

10. Whatisout-of-corelearning?

11. Whattypeoflearningalgorithmreliesonasimilaritymeasuretomakepredictions?

12. Whatisthedifferencebetweenamodelparameterandalearningalgorithm’shyperparameter?

13. Whatdomodel-basedlearningalgorithmssearchfor?Whatisthemostcommonstrategytheyusetosucceed?Howdotheymakepredictions?

14. CanyounamefourofthemainchallengesinMachineLearning?

15. Ifyourmodelperformsgreatonthetrainingdatabutgeneralizespoorlytonewinstances,whatishappening?Canyounamethreepossiblesolutions?

16. Whatisatestsetandwhywouldyouwanttouseit?

17. Whatisthepurposeofavalidationset?

18. Whatcangowrongifyoutunehyperparametersusingthetestset?

19. Whatiscross-validationandwhywouldyoupreferittoavalidationset?

Page 55: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

SolutionstotheseexercisesareavailableinAppendixA.

Funfact:thisodd-soundingnameisastatisticstermintroducedbyFrancisGaltonwhilehewasstudyingthefactthatthechildrenoftallpeopletendtobeshorterthantheirparents.Sincechildrenwereshorter,hecalledthisregressiontothemean.Thisnamewasthenappliedtothemethodsheusedtoanalyzecorrelationsbetweenvariables.

Someneuralnetworkarchitecturescanbeunsupervised,suchasautoencodersandrestrictedBoltzmannmachines.Theycanalsobesemisupervised,suchasindeepbeliefnetworksandunsupervisedpretraining.

Noticehowanimalsareratherwellseparatedfromvehicles,howhorsesareclosetodeerbutfarfrombirds,andsoon.FigurereproducedwithpermissionfromSocher,Ganjoo,Manning,andNg(2013),“T-SNEvisualizationofthesemanticwordspace.”

That’swhenthesystemworksperfectly.Inpracticeitoftencreatesafewclustersperperson,andsometimesmixesuptwopeoplewholookalike,soyouneedtoprovideafewlabelsperpersonandmanuallycleanupsomeclusters.

Byconvention,theGreekletterθ(theta)isfrequentlyusedtorepresentmodelparameters.

Thecodeassumesthatprepare_country_stats()isalreadydefined:itmergestheGDPandlifesatisfactiondataintoasinglePandasdataframe.

It’sokayifyoudon’tunderstandallthecodeyet;wewillpresentScikit-Learninthefollowingchapters.

Forexample,knowingwhethertowrite“to,”“two,”or“too”dependingonthecontext.

FigurereproducedwithpermissionfromBankoandBrill(2001),“LearningCurvesforConfusionSetDisambiguation.”

“TheUnreasonableEffectivenessofData,”PeterNorvigetal.(2009).

“TheLackofAPrioriDistinctionsBetweenLearningAlgorithms,”D.Wolperts(1996).

1

2

3

4

5

6

7

8

9

10

11

Page 56: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Chapter2.End-to-EndMachineLearningProject

Inthischapter,youwillgothroughanexampleprojectendtoend,pretendingtobearecentlyhireddatascientistinarealestatecompany.1Herearethemainstepsyouwillgothrough:1. Lookatthebigpicture.

2. Getthedata.

3. Discoverandvisualizethedatatogaininsights.

4. PreparethedataforMachineLearningalgorithms.

5. Selectamodelandtrainit.

6. Fine-tuneyourmodel.

7. Presentyoursolution.

8. Launch,monitor,andmaintainyoursystem.

Page 57: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

WorkingwithRealDataWhenyouarelearningaboutMachineLearningitisbesttoactuallyexperimentwithreal-worlddata,notjustartificialdatasets.Fortunately,therearethousandsofopendatasetstochoosefrom,rangingacrossallsortsofdomains.Hereareafewplacesyoucanlooktogetdata:

Popularopendatarepositories:UCIrvineMachineLearningRepository

Kaggledatasets

Amazon’sAWSdatasets

Metaportals(theylistopendatarepositories):http://dataportals.org/

http://opendatamonitor.eu/

http://quandl.com/

Otherpageslistingmanypopularopendatarepositories:Wikipedia’slistofMachineLearningdatasets

Quora.comquestion

Datasetssubreddit

InthischapterwechosetheCaliforniaHousingPricesdatasetfromtheStatLibrepository2(seeFigure2-1).Thisdatasetwasbasedondatafromthe1990Californiacensus.Itisnotexactlyrecent(youcouldstillaffordanicehouseintheBayAreaatthetime),butithasmanyqualitiesforlearning,sowewillpretenditisrecentdata.Wealsoaddedacategoricalattributeandremovedafewfeaturesforteachingpurposes.

Page 58: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure2-1.Californiahousingprices

Page 59: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

LookattheBigPictureWelcometoMachineLearningHousingCorporation!ThefirsttaskyouareaskedtoperformistobuildamodelofhousingpricesinCaliforniausingtheCaliforniacensusdata.Thisdatahasmetricssuchasthepopulation,medianincome,medianhousingprice,andsoonforeachblockgroupinCalifornia.BlockgroupsarethesmallestgeographicalunitforwhichtheUSCensusBureaupublishessampledata(ablockgrouptypicallyhasapopulationof600to3,000people).Wewilljustcallthem“districts”forshort.

Yourmodelshouldlearnfromthisdataandbeabletopredictthemedianhousingpriceinanydistrict,givenalltheothermetrics.

TIPSinceyouareawell-organizeddatascientist,thefirstthingyoudoistopulloutyourMachineLearningprojectchecklist.YoucanstartwiththeoneinAppendixB;itshouldworkreasonablywellformostMachineLearningprojectsbutmakesuretoadaptittoyourneeds.Inthischapterwewillgothroughmanychecklistitems,butwewillalsoskipafew,eitherbecausetheyareself-explanatoryorbecausetheywillbediscussedinlaterchapters.

Page 60: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

FrametheProblemThefirstquestiontoaskyourbossiswhatexactlyisthebusinessobjective;buildingamodelisprobablynottheendgoal.Howdoesthecompanyexpecttouseandbenefitfromthismodel?Thisisimportantbecauseitwilldeterminehowyouframetheproblem,whatalgorithmsyouwillselect,whatperformancemeasureyouwillusetoevaluateyourmodel,andhowmucheffortyoushouldspendtweakingit.

Yourbossanswersthatyourmodel’soutput(apredictionofadistrict’smedianhousingprice)willbefedtoanotherMachineLearningsystem(seeFigure2-2),alongwithmanyothersignals.3Thisdownstreamsystemwilldeterminewhetheritisworthinvestinginagivenareaornot.Gettingthisrightiscritical,asitdirectlyaffectsrevenue.

Figure2-2.AMachineLearningpipelineforrealestateinvestments

PIPELINES

Asequenceofdataprocessingcomponentsiscalledadatapipeline.PipelinesareverycommoninMachineLearningsystems,sincethereisalotofdatatomanipulateandmanydatatransformationstoapply.

Componentstypicallyrunasynchronously.Eachcomponentpullsinalargeamountofdata,processesit,andspitsouttheresultinanotherdatastore,andthensometimelaterthenextcomponentinthepipelinepullsthisdataandspitsoutitsownoutput,andsoon.Eachcomponentisfairlyself-contained:theinterfacebetweencomponentsissimplythedatastore.Thismakesthesystemquitesimpletograsp(withthehelpofadataflowgraph),anddifferentteamscanfocusondifferentcomponents.Moreover,ifacomponentbreaksdown,thedownstreamcomponentscanoftencontinuetorunnormally(atleastforawhile)byjustusingthelastoutputfromthebrokencomponent.Thismakesthearchitecturequiterobust.

Ontheotherhand,abrokencomponentcangounnoticedforsometimeifpropermonitoringisnotimplemented.Thedatagetsstaleandtheoverallsystem’sperformancedrops.

Thenextquestiontoaskiswhatthecurrentsolutionlookslike(ifany).Itwilloftengiveyouareferenceperformance,aswellasinsightsonhowtosolvetheproblem.Yourbossanswersthatthedistricthousingpricesarecurrentlyestimatedmanuallybyexperts:ateamgathersup-to-dateinformationaboutadistrict,andwhentheycannotgetthemedianhousingprice,theyestimateitusingcomplexrules.

Thisiscostlyandtime-consuming,andtheirestimatesarenotgreat;incaseswheretheymanagetofindouttheactualmedianhousingprice,theyoftenrealizethattheirestimateswereoffbymorethan10%.Thisiswhythecompanythinksthatitwouldbeusefultotrainamodeltopredictadistrict’smedianhousingpricegivenotherdataaboutthatdistrict.Thecensusdatalookslikeagreatdatasettoexploitforthispurpose,sinceitincludesthemedianhousingpricesofthousandsofdistricts,aswellasotherdata.

Okay,withallthisinformationyouarenowreadytostartdesigningyoursystem.First,youneedtoframe

Page 61: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

theproblem:isitsupervised,unsupervised,orReinforcementLearning?Isitaclassificationtask,aregressiontask,orsomethingelse?Shouldyouusebatchlearningoronlinelearningtechniques?Beforeyoureadon,pauseandtrytoanswerthesequestionsforyourself.

Haveyoufoundtheanswers?Let’ssee:itisclearlyatypicalsupervisedlearningtasksinceyouaregivenlabeledtrainingexamples(eachinstancecomeswiththeexpectedoutput,i.e.,thedistrict’smedianhousingprice).Moreover,itisalsoatypicalregressiontask,sinceyouareaskedtopredictavalue.Morespecifically,thisisamultivariateregressionproblemsincethesystemwillusemultiplefeaturestomakeaprediction(itwillusethedistrict’spopulation,themedianincome,etc.).Inthefirstchapter,youpredictedlifesatisfactionbasedonjustonefeature,theGDPpercapita,soitwasaunivariateregressionproblem.Finally,thereisnocontinuousflowofdatacominginthesystem,thereisnoparticularneedtoadjusttochangingdatarapidly,andthedataissmallenoughtofitinmemory,soplainbatchlearningshoulddojustfine.

TIPIfthedatawashuge,youcouldeithersplityourbatchlearningworkacrossmultipleservers(usingtheMapReducetechnique,aswewillseelater),oryoucoulduseanonlinelearningtechniqueinstead.

Page 62: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

SelectaPerformanceMeasureYournextstepistoselectaperformancemeasure.AtypicalperformancemeasureforregressionproblemsistheRootMeanSquareError(RMSE).Itgivesanideaofhowmucherrorthesystemtypicallymakesinitspredictions,withahigherweightforlargeerrors.Equation2-1showsthemathematicalformulatocomputetheRMSE.

Equation2-1.RootMeanSquareError(RMSE)

NOTATIONS

ThisequationintroducesseveralverycommonMachineLearningnotationsthatwewillusethroughoutthisbook:

misthenumberofinstancesinthedatasetyouaremeasuringtheRMSEon.

Forexample,ifyouareevaluatingtheRMSEonavalidationsetof2,000districts,thenm=2,000.

x(i)isavectorofallthefeaturevalues(excludingthelabel)oftheithinstanceinthedataset,andy(i)isitslabel(thedesiredoutputvalueforthatinstance).

Forexample,ifthefirstdistrictinthedatasetislocatedatlongitude–118.29°,latitude33.91°,andithas1,416inhabitantswithamedianincomeof$38,372,andthemedianhousevalueis$156,400(ignoringtheotherfeaturesfornow),then:

and:

Xisamatrixcontainingallthefeaturevalues(excludinglabels)ofallinstancesinthedataset.Thereisonerowperinstanceandtheithrowisequaltothetransposeofx(i),noted(x(i))T.4

Forexample,ifthefirstdistrictisasjustdescribed,thenthematrixXlookslikethis:

Page 63: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

hisyoursystem’spredictionfunction,alsocalledahypothesis.Whenyoursystemisgivenaninstance’sfeaturevectorx(i),itoutputsapredictedvalueŷ(i)=h(x(i))forthatinstance(ŷispronounced“y-hat”).

Forexample,ifyoursystempredictsthatthemedianhousingpriceinthefirstdistrictis$158,400,thenŷ(1)=h(x(1))=158,400.Thepredictionerrorforthisdistrictisŷ(1)–y(1)=2,000.

RMSE(X,h)isthecostfunctionmeasuredonthesetofexamplesusingyourhypothesish.

Weuselowercaseitalicfontforscalarvalues(suchasmory(i))andfunctionnames(suchash),lowercaseboldfontforvectors(suchasx(i)),anduppercaseboldfontformatrices(suchasX).

EventhoughtheRMSEisgenerallythepreferredperformancemeasureforregressiontasks,insomecontextsyoumayprefertouseanotherfunction.Forexample,supposethattherearemanyoutlierdistricts.Inthatcase,youmayconsiderusingtheMeanAbsoluteError(alsocalledtheAverageAbsoluteDeviation;seeEquation2-2):

Equation2-2.MeanAbsoluteError

BoththeRMSEandtheMAEarewaystomeasurethedistancebetweentwovectors:thevectorofpredictionsandthevectoroftargetvalues.Variousdistancemeasures,ornorms,arepossible:

Computingtherootofasumofsquares(RMSE)correspondstotheEuclidiannorm:itisthenotionofdistanceyouarefamiliarwith.Itisalsocalledtheℓ2norm,noted· 2(orjust·).

Computingthesumofabsolutes(MAE)correspondstotheℓ1norm,noted· 1.ItissometimescalledtheManhattannormbecauseitmeasuresthedistancebetweentwopointsinacityifyoucanonlytravelalongorthogonalcityblocks.

Moregenerally,theℓknormofavectorvcontainingnelementsisdefinedas

.ℓ0justgivesthenumberofnon-zeroelementsinthevector,andℓ∞givesthemaximumabsolutevalueinthevector.

Thehigherthenormindex,themoreitfocusesonlargevaluesandneglectssmallones.ThisiswhytheRMSEismoresensitivetooutliersthantheMAE.Butwhenoutliersareexponentiallyrare(likeinabell-shapedcurve),theRMSEperformsverywellandisgenerallypreferred.

Page 64: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release
Page 65: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ChecktheAssumptionsLastly,itisgoodpracticetolistandverifytheassumptionsthatweremadesofar(byyouorothers);thiscancatchseriousissuesearlyon.Forexample,thedistrictpricesthatyoursystemoutputsaregoingtobefedintoadownstreamMachineLearningsystem,andweassumethatthesepricesaregoingtobeusedassuch.Butwhatifthedownstreamsystemactuallyconvertsthepricesintocategories(e.g.,“cheap,”“medium,”or“expensive”)andthenusesthosecategoriesinsteadofthepricesthemselves?Inthiscase,gettingthepriceperfectlyrightisnotimportantatall;yoursystemjustneedstogetthecategoryright.Ifthat’sso,thentheproblemshouldhavebeenframedasaclassificationtask,notaregressiontask.Youdon’twanttofindthisoutafterworkingonaregressionsystemformonths.

Fortunately,aftertalkingwiththeteaminchargeofthedownstreamsystem,youareconfidentthattheydoindeedneedtheactualprices,notjustcategories.Great!You’reallset,thelightsaregreen,andyoucanstartcodingnow!

Page 66: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

GettheDataIt’stimetogetyourhandsdirty.Don’thesitatetopickupyourlaptopandwalkthroughthefollowingcodeexamplesinaJupyternotebook.ThefullJupyternotebookisavailableathttps://github.com/ageron/handson-ml.

Page 67: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

CreatetheWorkspaceFirstyouwillneedtohavePythoninstalled.Itisprobablyalreadyinstalledonyoursystem.Ifnot,youcangetitathttps://www.python.org/.5

NextyouneedtocreateaworkspacedirectoryforyourMachineLearningcodeanddatasets.Openaterminalandtypethefollowingcommands(afterthe$prompts):

$exportML_PATH="$HOME/ml"#Youcanchangethepathifyouprefer

$mkdir-p$ML_PATH

YouwillneedanumberofPythonmodules:Jupyter,NumPy,Pandas,Matplotlib,andScikit-Learn.IfyoualreadyhaveJupyterrunningwithallthesemodulesinstalled,youcansafelyskipto“DownloadtheData”.Ifyoudon’thavethemyet,therearemanywaystoinstallthem(andtheirdependencies).Youcanuseyoursystem’spackagingsystem(e.g.,apt-getonUbuntu,orMacPortsorHomeBrewonmacOS),installaScientificPythondistributionsuchasAnacondaanduseitspackagingsystem,orjustusePython’sownpackagingsystem,pip,whichisincludedbydefaultwiththePythonbinaryinstallers(sincePython2.7.9).6Youcanchecktoseeifpipisinstalledbytypingthefollowingcommand:

$pip3--version

pip9.0.1from[...]/lib/python3.5/site-packages(python3.5)

Youshouldmakesureyouhavearecentversionofpipinstalled,attheveryleast>1.4tosupportbinarymoduleinstallation(a.k.a.wheels).Toupgradethepipmodule,type:7

$pip3install--upgradepip

Collectingpip

[...]

Successfullyinstalledpip-9.0.1

Page 68: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

CREATINGANISOLATEDENVIRONMENT

Ifyouwouldliketoworkinanisolatedenvironment(whichisstronglyrecommendedsoyoucanworkondifferentprojectswithouthavingconflictinglibraryversions),installvirtualenvbyrunningthefollowingpipcommand:

$pip3install--user--upgradevirtualenv

Collectingvirtualenv

[...]

Successfullyinstalledvirtualenv

NowyoucancreateanisolatedPythonenvironmentbytyping:

$cd$ML_PATH

$virtualenvenv

Usingbaseprefix'[...]'

Newpythonexecutablein[...]/ml/env/bin/python3.5

Alsocreatingexecutablein[...]/ml/env/bin/python

Installingsetuptools,pip,wheel...done.

Noweverytimeyouwanttoactivatethisenvironment,justopenaterminalandtype:

$cd$ML_PATH

$sourceenv/bin/activate

Whiletheenvironmentisactive,anypackageyouinstallusingpipwillbeinstalledinthisisolatedenvironment,andPythonwillonlyhaveaccesstothesepackages(ifyoualsowantaccesstothesystem’ssitepackages,youshouldcreatetheenvironmentusingvirtualenv’s--system-site-packagesoption).Checkoutvirtualenv’sdocumentationformoreinformation.

Nowyoucaninstallalltherequiredmodulesandtheirdependenciesusingthissimplepipcommand:

$pip3install--upgradejupytermatplotlibnumpypandasscipyscikit-learn

Collectingjupyter

Downloadingjupyter-1.0.0-py2.py3-none-any.whl

Collectingmatplotlib

[...]

Tocheckyourinstallation,trytoimporteverymodulelikethis:

$python3-c"importjupyter,matplotlib,numpy,pandas,scipy,sklearn"

Thereshouldbenooutputandnoerror.NowyoucanfireupJupyterbytyping:

$jupyternotebook

[I15:24NotebookApp]Servingnotebooksfromlocaldirectory:[...]/ml

[I15:24NotebookApp]0activekernels

[I15:24NotebookApp]TheJupyterNotebookisrunningat:http://localhost:8888/

[I15:24NotebookApp]UseControl-Ctostopthisserverandshutdownall

kernels(twicetoskipconfirmation).

AJupyterserverisnowrunninginyourterminal,listeningtoport8888.Youcanvisitthisserverbyopeningyourwebbrowsertohttp://localhost:8888/(thisusuallyhappensautomaticallywhentheserverstarts).Youshouldseeyouremptyworkspacedirectory(containingonlytheenvdirectoryifyoufollowedtheprecedingvirtualenvinstructions).

NowcreateanewPythonnotebookbyclickingontheNewbuttonandselectingtheappropriatePythonversion8(seeFigure2-3).

Page 69: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Thisdoesthreethings:first,itcreatesanewnotebookfilecalledUntitled.ipynbinyourworkspace;second,itstartsaJupyterPythonkerneltorunthisnotebook;andthird,itopensthisnotebookinanewtab.Youshouldstartbyrenamingthisnotebookto“Housing”(thiswillautomaticallyrenamethefiletoHousing.ipynb)byclickingUntitledandtypingthenewname.

Figure2-3.YourworkspaceinJupyter

Anotebookcontainsalistofcells.Eachcellcancontainexecutablecodeorformattedtext.Rightnowthenotebookcontainsonlyoneemptycodecell,labeled“In[1]:”.Trytypingprint("Helloworld!")inthecell,andclickontheplaybutton(seeFigure2-4)orpressShift-Enter.Thissendsthecurrentcelltothisnotebook’sPythonkernel,whichrunsitandreturnstheoutput.Theresultisdisplayedbelowthecell,andsincewereachedtheendofthenotebook,anewcellisautomaticallycreated.GothroughtheUserInterfaceTourfromJupyter’sHelpmenutolearnthebasics.

Figure2-4.HelloworldPythonnotebook

Page 70: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

DownloadtheDataIntypicalenvironmentsyourdatawouldbeavailableinarelationaldatabase(orsomeothercommondatastore)andspreadacrossmultipletables/documents/files.Toaccessit,youwouldfirstneedtogetyourcredentialsandaccessauthorizations,9andfamiliarizeyourselfwiththedataschema.Inthisproject,however,thingsaremuchsimpler:youwilljustdownloadasinglecompressedfile,housing.tgz,whichcontainsacomma-separatedvalue(CSV)filecalledhousing.csvwithallthedata.

Youcoulduseyourwebbrowsertodownloadit,andruntarxzfhousing.tgztodecompressthefileandextracttheCSVfile,butitispreferabletocreateasmallfunctiontodothat.Itisusefulinparticularifdatachangesregularly,asitallowsyoutowriteasmallscriptthatyoucanrunwheneveryouneedtofetchthelatestdata(oryoucansetupascheduledjobtodothatautomaticallyatregularintervals).Automatingtheprocessoffetchingthedataisalsousefulifyouneedtoinstallthedatasetonmultiplemachines.

Hereisthefunctiontofetchthedata:10

importos

importtarfile

fromsix.movesimporturllib

DOWNLOAD_ROOT="https://raw.githubusercontent.com/ageron/handson-ml/master/"

HOUSING_PATH=os.path.join("datasets","housing")

HOUSING_URL=DOWNLOAD_ROOT+"datasets/housing/housing.tgz"

deffetch_housing_data(housing_url=HOUSING_URL,housing_path=HOUSING_PATH):

ifnotos.path.isdir(housing_path):

os.makedirs(housing_path)

tgz_path=os.path.join(housing_path,"housing.tgz")

urllib.request.urlretrieve(housing_url,tgz_path)

housing_tgz=tarfile.open(tgz_path)

housing_tgz.extractall(path=housing_path)

housing_tgz.close()

Nowwhenyoucallfetch_housing_data(),itcreatesadatasets/housingdirectoryinyourworkspace,downloadsthehousing.tgzfile,andextractsthehousing.csvfromitinthisdirectory.

Nowlet’sloadthedatausingPandas.Onceagainyoushouldwriteasmallfunctiontoloadthedata:

importpandasaspd

defload_housing_data(housing_path=HOUSING_PATH):

csv_path=os.path.join(housing_path,"housing.csv")

returnpd.read_csv(csv_path)

ThisfunctionreturnsaPandasDataFrameobjectcontainingallthedata.

Page 71: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TakeaQuickLookattheDataStructureLet’stakealookatthetopfiverowsusingtheDataFrame’shead()method(seeFigure2-5).

Figure2-5.Topfiverowsinthedataset

Eachrowrepresentsonedistrict.Thereare10attributes(youcanseethefirst6inthescreenshot):longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,andocean_proximity.

Theinfo()methodisusefultogetaquickdescriptionofthedata,inparticularthetotalnumberofrows,andeachattribute’stypeandnumberofnon-nullvalues(seeFigure2-6).

Figure2-6.Housinginfo

Thereare20,640instancesinthedataset,whichmeansthatitisfairlysmallbyMachineLearningstandards,butit’sperfecttogetstarted.Noticethatthetotal_bedroomsattributehasonly20,433non-nullvalues,meaningthat207districtsaremissingthisfeature.Wewillneedtotakecareofthislater.

Allattributesarenumerical,excepttheocean_proximityfield.Itstypeisobject,soitcouldholdanykindofPythonobject,butsinceyouloadedthisdatafromaCSVfileyouknowthatitmustbeatextattribute.Whenyoulookedatthetopfiverows,youprobablynoticedthatthevaluesintheocean_proximitycolumnwererepetitive,whichmeansthatitisprobablyacategoricalattribute.Youcanfindoutwhatcategoriesexistandhowmanydistrictsbelongtoeachcategorybyusingthe

Page 72: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

value_counts()method:

>>>housing["ocean_proximity"].value_counts()

<1HOCEAN9136

INLAND6551

NEAROCEAN2658

NEARBAY2290

ISLAND5

Name:ocean_proximity,dtype:int64

Let’slookattheotherfields.Thedescribe()methodshowsasummaryofthenumericalattributes(Figure2-7).

Figure2-7.Summaryofeachnumericalattribute

Thecount,mean,min,andmaxrowsareself-explanatory.Notethatthenullvaluesareignored(so,forexample,countoftotal_bedroomsis20,433,not20,640).Thestdrowshowsthestandarddeviation,whichmeasureshowdispersedthevaluesare.11The25%,50%,and75%rowsshowthecorrespondingpercentiles:apercentileindicatesthevaluebelowwhichagivenpercentageofobservationsinagroupofobservationsfalls.Forexample,25%ofthedistrictshaveahousing_median_agelowerthan18,while50%arelowerthan29and75%arelowerthan37.Theseareoftencalledthe25thpercentile(or1stquartile),themedian,andthe75thpercentile(or3rdquartile).

Anotherquickwaytogetafeelofthetypeofdatayouaredealingwithistoplotahistogramforeachnumericalattribute.Ahistogramshowsthenumberofinstances(ontheverticalaxis)thathaveagivenvaluerange(onthehorizontalaxis).Youcaneitherplotthisoneattributeatatime,oryoucancallthehist()methodonthewholedataset,anditwillplotahistogramforeachnumericalattribute(seeFigure2-8).Forexample,youcanseethatslightlyover800districtshaveamedian_house_valueequaltoabout$100,000.

%matplotlibinline#onlyinaJupyternotebook

importmatplotlib.pyplotasplt

housing.hist(bins=50,figsize=(20,15))

plt.show()

Page 73: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

NOTEThehist()methodreliesonMatplotlib,whichinturnreliesonauser-specifiedgraphicalbackendtodrawonyourscreen.Sobeforeyoucanplotanything,youneedtospecifywhichbackendMatplotlibshoulduse.ThesimplestoptionistouseJupyter’smagiccommand%matplotlibinline.ThistellsJupytertosetupMatplotlibsoitusesJupyter’sownbackend.Plotsarethenrenderedwithinthenotebookitself.Notethatcallingshow()isoptionalinaJupyternotebook,asJupyterwillautomaticallydisplayplotswhenacellisexecuted.

Figure2-8.Ahistogramforeachnumericalattribute

Noticeafewthingsinthesehistograms:1. First,themedianincomeattributedoesnotlooklikeitisexpressedinUSdollars(USD).After

checkingwiththeteamthatcollectedthedata,youaretoldthatthedatahasbeenscaledandcappedat15(actually15.0001)forhighermedianincomes,andat0.5(actually0.4999)forlowermedianincomes.WorkingwithpreprocessedattributesiscommoninMachineLearning,anditisnotnecessarilyaproblem,butyoushouldtrytounderstandhowthedatawascomputed.

2. Thehousingmedianageandthemedianhousevaluewerealsocapped.Thelattermaybeaseriousproblemsinceitisyourtargetattribute(yourlabels).YourMachineLearningalgorithmsmaylearnthatpricesnevergobeyondthatlimit.Youneedtocheckwithyourclientteam(theteamthatwilluseyoursystem’soutput)toseeifthisisaproblemornot.Iftheytellyouthattheyneedprecisepredictionsevenbeyond$500,000,thenyouhavemainlytwooptions:a. Collectproperlabelsforthedistrictswhoselabelswerecapped.

b. Removethosedistrictsfromthetrainingset(andalsofromthetestset,sinceyoursystemshouldnotbeevaluatedpoorlyifitpredictsvaluesbeyond$500,000).

Page 74: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

3. Theseattributeshaveverydifferentscales.Wewilldiscussthislaterinthischapterwhenweexplorefeaturescaling.

4. Finally,manyhistogramsaretailheavy:theyextendmuchfarthertotherightofthemedianthantotheleft.ThismaymakeitabitharderforsomeMachineLearningalgorithmstodetectpatterns.Wewilltrytransformingtheseattributeslaterontohavemorebell-shapeddistributions.

Hopefullyyounowhaveabetterunderstandingofthekindofdatayouaredealingwith.

WARNINGWait!Beforeyoulookatthedataanyfurther,youneedtocreateatestset,putitaside,andneverlookatit.

Page 75: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

CreateaTestSetItmaysoundstrangetovoluntarilysetasidepartofthedataatthisstage.Afterall,youhaveonlytakenaquickglanceatthedata,andsurelyyoushouldlearnawholelotmoreaboutitbeforeyoudecidewhatalgorithmstouse,right?Thisistrue,butyourbrainisanamazingpatterndetectionsystem,whichmeansthatitishighlypronetooverfitting:ifyoulookatthetestset,youmaystumbleuponsomeseeminglyinterestingpatterninthetestdatathatleadsyoutoselectaparticularkindofMachineLearningmodel.Whenyouestimatethegeneralizationerrorusingthetestset,yourestimatewillbetoooptimisticandyouwilllaunchasystemthatwillnotperformaswellasexpected.Thisiscalleddatasnoopingbias.

Creatingatestsetistheoreticallyquitesimple:justpicksomeinstancesrandomly,typically20%ofthedataset,andsetthemaside:

importnumpyasnp

defsplit_train_test(data,test_ratio):

shuffled_indices=np.random.permutation(len(data))

test_set_size=int(len(data)*test_ratio)

test_indices=shuffled_indices[:test_set_size]

train_indices=shuffled_indices[test_set_size:]

returndata.iloc[train_indices],data.iloc[test_indices]

Youcanthenusethisfunctionlikethis:

>>>train_set,test_set=split_train_test(housing,0.2)

>>>print(len(train_set),"train+",len(test_set),"test")

16512train+4128test

Well,thisworks,butitisnotperfect:ifyouruntheprogramagain,itwillgenerateadifferenttestset!Overtime,you(oryourMachineLearningalgorithms)willgettoseethewholedataset,whichiswhatyouwanttoavoid.

Onesolutionistosavethetestsetonthefirstrunandthenloaditinsubsequentruns.Anotheroptionistosettherandomnumbergenerator’sseed(e.g.,np.random.seed(42))12beforecallingnp.random.permutation(),sothatitalwaysgeneratesthesameshuffledindices.

Butboththesesolutionswillbreaknexttimeyoufetchanupdateddataset.Acommonsolutionistouseeachinstance’sidentifiertodecidewhetherornotitshouldgointhetestset(assuminginstanceshaveauniqueandimmutableidentifier).Forexample,youcouldcomputeahashofeachinstance’sidentifier,keeponlythelastbyteofthehash,andputtheinstanceinthetestsetifthisvalueislowerorequalto51(~20%of256).Thisensuresthatthetestsetwillremainconsistentacrossmultipleruns,evenifyourefreshthedataset.Thenewtestsetwillcontain20%ofthenewinstances,butitwillnotcontainanyinstancethatwaspreviouslyinthetrainingset.Hereisapossibleimplementation:

importhashlib

deftest_set_check(identifier,test_ratio,hash):

returnhash(np.int64(identifier)).digest()[-1]<256*test_ratio

defsplit_train_test_by_id(data,test_ratio,id_column,hash=hashlib.md5):

ids=data[id_column]

in_test_set=ids.apply(lambdaid_:test_set_check(id_,test_ratio,hash))

returndata.loc[~in_test_set],data.loc[in_test_set]

Page 76: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Unfortunately,thehousingdatasetdoesnothaveanidentifiercolumn.ThesimplestsolutionistousetherowindexastheID:

housing_with_id=housing.reset_index()#addsan`index`column

train_set,test_set=split_train_test_by_id(housing_with_id,0.2,"index")

Ifyouusetherowindexasauniqueidentifier,youneedtomakesurethatnewdatagetsappendedtotheendofthedataset,andnorowevergetsdeleted.Ifthisisnotpossible,thenyoucantrytousethemoststablefeaturestobuildauniqueidentifier.Forexample,adistrict’slatitudeandlongitudeareguaranteedtobestableforafewmillionyears,soyoucouldcombinethemintoanIDlikeso:13

housing_with_id["id"]=housing["longitude"]*1000+housing["latitude"]

train_set,test_set=split_train_test_by_id(housing_with_id,0.2,"id")

Scikit-Learnprovidesafewfunctionstosplitdatasetsintomultiplesubsetsinvariousways.Thesimplestfunctionistrain_test_split,whichdoesprettymuchthesamethingasthefunctionsplit_train_testdefinedearlier,withacoupleofadditionalfeatures.Firstthereisarandom_stateparameterthatallowsyoutosettherandomgeneratorseedasexplainedpreviously,andsecondyoucanpassitmultipledatasetswithanidenticalnumberofrows,anditwillsplitthemonthesameindices(thisisveryuseful,forexample,ifyouhaveaseparateDataFrameforlabels):

fromsklearn.model_selectionimporttrain_test_split

train_set,test_set=train_test_split(housing,test_size=0.2,random_state=42)

Sofarwehaveconsideredpurelyrandomsamplingmethods.Thisisgenerallyfineifyourdatasetislargeenough(especiallyrelativetothenumberofattributes),butifitisnot,youruntheriskofintroducingasignificantsamplingbias.Whenasurveycompanydecidestocall1,000peopletoaskthemafewquestions,theydon’tjustpick1,000peoplerandomlyinaphonebooth.Theytrytoensurethatthese1,000peoplearerepresentativeofthewholepopulation.Forexample,theUSpopulationiscomposedof51.3%femaleand48.7%male,soawell-conductedsurveyintheUSwouldtrytomaintainthisratiointhesample:513femaleand487male.Thisiscalledstratifiedsampling:thepopulationisdividedintohomogeneoussubgroupscalledstrata,andtherightnumberofinstancesissampledfromeachstratumtoguaranteethatthetestsetisrepresentativeoftheoverallpopulation.Iftheyusedpurelyrandomsampling,therewouldbeabout12%chanceofsamplingaskewedtestsetwitheitherlessthan49%femaleormorethan54%female.Eitherway,thesurveyresultswouldbesignificantlybiased.

Supposeyouchattedwithexpertswhotoldyouthatthemedianincomeisaveryimportantattributetopredictmedianhousingprices.Youmaywanttoensurethatthetestsetisrepresentativeofthevariouscategoriesofincomesinthewholedataset.Sincethemedianincomeisacontinuousnumericalattribute,youfirstneedtocreateanincomecategoryattribute.Let’slookatthemedianincomehistogrammoreclosely(seeFigure2-8):mostmedianincomevaluesareclusteredaround$20,000–$50,000,butsomemedianincomesgofarbeyond$60,000.Itisimportanttohaveasufficientnumberofinstancesinyourdatasetforeachstratum,orelsetheestimateofthestratum’simportancemaybebiased.Thismeansthatyoushouldnothavetoomanystrata,andeachstratumshouldbelargeenough.Thefollowingcodecreatesanincomecategoryattributebydividingthemedianincomeby1.5(tolimitthenumberofincome

Page 77: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

categories),androundingupusingceil(tohavediscretecategories),andthenmergingallthecategoriesgreaterthan5intocategory5:

housing["income_cat"]=np.ceil(housing["median_income"]/1.5)

housing["income_cat"].where(housing["income_cat"]<5,5.0,inplace=True)

TheseincomecategoriesarerepresentedonFigure2-9):

Figure2-9.Histogramofincomecategories

Nowyouarereadytodostratifiedsamplingbasedontheincomecategory.ForthisyoucanuseScikit-Learn’sStratifiedShuffleSplitclass:

fromsklearn.model_selectionimportStratifiedShuffleSplit

split=StratifiedShuffleSplit(n_splits=1,test_size=0.2,random_state=42)

fortrain_index,test_indexinsplit.split(housing,housing["income_cat"]):

strat_train_set=housing.loc[train_index]

strat_test_set=housing.loc[test_index]

Let’sseeifthisworkedasexpected.Youcanstartbylookingattheincomecategoryproportionsinthefullhousingdataset:

>>>housing["income_cat"].value_counts()/len(housing)

3.00.350581

2.00.318847

4.00.176308

5.00.114438

1.00.039826

Name:income_cat,dtype:float64

Withsimilarcodeyoucanmeasuretheincomecategoryproportionsinthetestset.Figure2-10comparestheincomecategoryproportionsintheoveralldataset,inthetestsetgeneratedwithstratifiedsampling,andinatestsetgeneratedusingpurelyrandomsampling.Asyoucansee,thetestsetgeneratedusing

Page 78: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

stratifiedsamplinghasincomecategoryproportionsalmostidenticaltothoseinthefulldataset,whereasthetestsetgeneratedusingpurelyrandomsamplingisquiteskewed.

Figure2-10.Samplingbiascomparisonofstratifiedversuspurelyrandomsampling

Nowyoushouldremovetheincome_catattributesothedataisbacktoitsoriginalstate:

forset_in(strat_train_set,strat_test_set):

set_.drop("income_cat",axis=1,inplace=True)

Wespentquiteabitoftimeontestsetgenerationforagoodreason:thisisanoftenneglectedbutcriticalpartofaMachineLearningproject.Moreover,manyoftheseideaswillbeusefullaterwhenwediscusscross-validation.Nowit’stimetomoveontothenextstage:exploringthedata.

Page 79: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

DiscoverandVisualizetheDatatoGainInsightsSofaryouhaveonlytakenaquickglanceatthedatatogetageneralunderstandingofthekindofdatayouaremanipulating.Nowthegoalistogoalittlebitmoreindepth.

First,makesureyouhaveputthetestsetasideandyouareonlyexploringthetrainingset.Also,ifthetrainingsetisverylarge,youmaywanttosampleanexplorationset,tomakemanipulationseasyandfast.Inourcase,thesetisquitesmallsoyoucanjustworkdirectlyonthefullset.Let’screateacopysoyoucanplaywithitwithoutharmingthetrainingset:

housing=strat_train_set.copy()

Page 80: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

VisualizingGeographicalDataSincethereisgeographicalinformation(latitudeandlongitude),itisagoodideatocreateascatterplotofalldistrictstovisualizethedata(Figure2-11):

housing.plot(kind="scatter",x="longitude",y="latitude")

Figure2-11.Ageographicalscatterplotofthedata

ThislookslikeCaliforniaallright,butotherthanthatitishardtoseeanyparticularpattern.Settingthealphaoptionto0.1makesitmucheasiertovisualizetheplaceswherethereisahighdensityofdatapoints(Figure2-12):

housing.plot(kind="scatter",x="longitude",y="latitude",alpha=0.1)

Page 81: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure2-12.Abettervisualizationhighlightinghigh-densityareas

Nowthat’smuchbetter:youcanclearlyseethehigh-densityareas,namelytheBayAreaandaroundLosAngelesandSanDiego,plusalonglineoffairlyhighdensityintheCentralValley,inparticulararoundSacramentoandFresno.

Moregenerally,ourbrainsareverygoodatspottingpatternsonpictures,butyoumayneedtoplayaroundwithvisualizationparameterstomakethepatternsstandout.

Nowlet’slookatthehousingprices(Figure2-13).Theradiusofeachcirclerepresentsthedistrict’spopulation(options),andthecolorrepresentstheprice(optionc).Wewilluseapredefinedcolormap(optioncmap)calledjet,whichrangesfromblue(lowvalues)tored(highprices):14

housing.plot(kind="scatter",x="longitude",y="latitude",alpha=0.4,

s=housing["population"]/100,label="population",figsize=(10,7),

c="median_house_value",cmap=plt.get_cmap("jet"),colorbar=True,

)

plt.legend()

Page 82: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure2-13.Californiahousingprices

Thisimagetellsyouthatthehousingpricesareverymuchrelatedtothelocation(e.g.,closetotheocean)andtothepopulationdensity,asyouprobablyknewalready.Itwillprobablybeusefultouseaclusteringalgorithmtodetectthemainclusters,andaddnewfeaturesthatmeasuretheproximitytotheclustercenters.Theoceanproximityattributemaybeusefulaswell,althoughinNorthernCaliforniathehousingpricesincoastaldistrictsarenottoohigh,soitisnotasimplerule.

Page 83: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

LookingforCorrelationsSincethedatasetisnottoolarge,youcaneasilycomputethestandardcorrelationcoefficient(alsocalledPearson’sr)betweeneverypairofattributesusingthecorr()method:

corr_matrix=housing.corr()

Nowlet’slookathowmucheachattributecorrelateswiththemedianhousevalue:

>>>corr_matrix["median_house_value"].sort_values(ascending=False)

median_house_value1.000000

median_income0.687170

total_rooms0.135231

housing_median_age0.114220

households0.064702

total_bedrooms0.047865

population-0.026699

longitude-0.047279

latitude-0.142826

Name:median_house_value,dtype:float64

Thecorrelationcoefficientrangesfrom–1to1.Whenitiscloseto1,itmeansthatthereisastrongpositivecorrelation;forexample,themedianhousevaluetendstogoupwhenthemedianincomegoesup.Whenthecoefficientiscloseto–1,itmeansthatthereisastrongnegativecorrelation;youcanseeasmallnegativecorrelationbetweenthelatitudeandthemedianhousevalue(i.e.,priceshaveaslighttendencytogodownwhenyougonorth).Finally,coefficientsclosetozeromeanthatthereisnolinearcorrelation.Figure2-14showsvariousplotsalongwiththecorrelationcoefficientbetweentheirhorizontalandverticalaxes.

Figure2-14.Standardcorrelationcoefficientofvariousdatasets(source:Wikipedia;publicdomainimage)

Page 84: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

WARNINGThecorrelationcoefficientonlymeasureslinearcorrelations(“ifxgoesup,thenygenerallygoesup/down”).Itmaycompletelymissoutonnonlinearrelationships(e.g.,“ifxisclosetozerothenygenerallygoesup”).Notehowalltheplotsofthebottomrowhaveacorrelationcoefficientequaltozerodespitethefactthattheiraxesareclearlynotindependent:theseareexamplesofnonlinearrelationships.Also,thesecondrowshowsexampleswherethecorrelationcoefficientisequalto1or–1;noticethatthishasnothingtodowiththeslope.Forexample,yourheightinincheshasacorrelationcoefficientof1withyourheightinfeetorinnanometers.

AnotherwaytocheckforcorrelationbetweenattributesistousePandas’scatter_matrixfunction,whichplotseverynumericalattributeagainsteveryothernumericalattribute.Sincetherearenow11numericalattributes,youwouldget112=121plots,whichwouldnotfitonapage,solet’sjustfocusonafewpromisingattributesthatseemmostcorrelatedwiththemedianhousingvalue(Figure2-15):

frompandas.tools.plottingimportscatter_matrix

attributes=["median_house_value","median_income","total_rooms",

"housing_median_age"]

scatter_matrix(housing[attributes],figsize=(12,8))

Figure2-15.Scattermatrix

Themaindiagonal(toplefttobottomright)wouldbefullofstraightlinesifPandasplottedeachvariableagainstitself,whichwouldnotbeveryuseful.SoinsteadPandasdisplaysahistogramofeachattribute(otheroptionsareavailable;seePandas’documentationformoredetails).

Themostpromisingattributetopredictthemedianhousevalueisthemedianincome,solet’szoominontheircorrelationscatterplot(Figure2-16):

housing.plot(kind="scatter",x="median_income",y="median_house_value",

Page 85: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

alpha=0.1)

Thisplotrevealsafewthings.First,thecorrelationisindeedverystrong;youcanclearlyseetheupwardtrendandthepointsarenottoodispersed.Second,thepricecapthatwenoticedearlierisclearlyvisibleasahorizontallineat$500,000.Butthisplotrevealsotherlessobviousstraightlines:ahorizontallinearound$450,000,anotheraround$350,000,perhapsonearound$280,000,andafewmorebelowthat.Youmaywanttotryremovingthecorrespondingdistrictstopreventyouralgorithmsfromlearningtoreproducethesedataquirks.

Figure2-16.Medianincomeversusmedianhousevalue

Page 86: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ExperimentingwithAttributeCombinationsHopefullytheprevioussectionsgaveyouanideaofafewwaysyoucanexplorethedataandgaininsights.YouidentifiedafewdataquirksthatyoumaywanttocleanupbeforefeedingthedatatoaMachineLearningalgorithm,andyoufoundinterestingcorrelationsbetweenattributes,inparticularwiththetargetattribute.Youalsonoticedthatsomeattributeshaveatail-heavydistribution,soyoumaywanttotransformthem(e.g.,bycomputingtheirlogarithm).Ofcourse,yourmileagewillvaryconsiderablywitheachproject,butthegeneralideasaresimilar.

OnelastthingyoumaywanttodobeforeactuallypreparingthedataforMachineLearningalgorithmsistotryoutvariousattributecombinations.Forexample,thetotalnumberofroomsinadistrictisnotveryusefulifyoudon’tknowhowmanyhouseholdsthereare.Whatyoureallywantisthenumberofroomsperhousehold.Similarly,thetotalnumberofbedroomsbyitselfisnotveryuseful:youprobablywanttocompareittothenumberofrooms.Andthepopulationperhouseholdalsoseemslikeaninterestingattributecombinationtolookat.Let’screatethesenewattributes:

housing["rooms_per_household"]=housing["total_rooms"]/housing["households"]

housing["bedrooms_per_room"]=housing["total_bedrooms"]/housing["total_rooms"]

housing["population_per_household"]=housing["population"]/housing["households"]

Andnowlet’slookatthecorrelationmatrixagain:

>>>corr_matrix=housing.corr()

>>>corr_matrix["median_house_value"].sort_values(ascending=False)

median_house_value1.000000

median_income0.687160

rooms_per_household0.146285

total_rooms0.135097

housing_median_age0.114110

households0.064506

total_bedrooms0.047689

population_per_household-0.021985

population-0.026920

longitude-0.047432

latitude-0.142724

bedrooms_per_room-0.259984

Name:median_house_value,dtype:float64

Hey,notbad!Thenewbedrooms_per_roomattributeismuchmorecorrelatedwiththemedianhousevaluethanthetotalnumberofroomsorbedrooms.Apparentlyhouseswithalowerbedroom/roomratiotendtobemoreexpensive.Thenumberofroomsperhouseholdisalsomoreinformativethanthetotalnumberofroomsinadistrict—obviouslythelargerthehouses,themoreexpensivetheyare.

Thisroundofexplorationdoesnothavetobeabsolutelythorough;thepointistostartoffontherightfootandquicklygaininsightsthatwillhelpyougetafirstreasonablygoodprototype.Butthisisaniterativeprocess:onceyougetaprototypeupandrunning,youcananalyzeitsoutputtogainmoreinsightsandcomebacktothisexplorationstep.

Page 87: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

PreparetheDataforMachineLearningAlgorithmsIt’stimetopreparethedataforyourMachineLearningalgorithms.Insteadofjustdoingthismanually,youshouldwritefunctionstodothat,forseveralgoodreasons:

Thiswillallowyoutoreproducethesetransformationseasilyonanydataset(e.g.,thenexttimeyougetafreshdataset).

Youwillgraduallybuildalibraryoftransformationfunctionsthatyoucanreuseinfutureprojects.

Youcanusethesefunctionsinyourlivesystemtotransformthenewdatabeforefeedingittoyouralgorithms.

Thiswillmakeitpossibleforyoutoeasilytryvarioustransformationsandseewhichcombinationoftransformationsworksbest.

Butfirstlet’sreverttoacleantrainingset(bycopyingstrat_train_setonceagain),andlet’sseparatethepredictorsandthelabelssincewedon’tnecessarilywanttoapplythesametransformationstothepredictorsandthetargetvalues(notethatdrop()createsacopyofthedataanddoesnotaffectstrat_train_set):

housing=strat_train_set.drop("median_house_value",axis=1)

housing_labels=strat_train_set["median_house_value"].copy()

Page 88: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

DataCleaningMostMachineLearningalgorithmscannotworkwithmissingfeatures,solet’screateafewfunctionstotakecareofthem.Younoticedearlierthatthetotal_bedroomsattributehassomemissingvalues,solet’sfixthis.Youhavethreeoptions:

Getridofthecorrespondingdistricts.

Getridofthewholeattribute.

Setthevaluestosomevalue(zero,themean,themedian,etc.).

YoucanaccomplishtheseeasilyusingDataFrame’sdropna(),drop(),andfillna()methods:

housing.dropna(subset=["total_bedrooms"])#option1

housing.drop("total_bedrooms",axis=1)#option2

median=housing["total_bedrooms"].median()#option3

housing["total_bedrooms"].fillna(median,inplace=True)

Ifyouchooseoption3,youshouldcomputethemedianvalueonthetrainingset,anduseittofillthemissingvaluesinthetrainingset,butalsodon’tforgettosavethemedianvaluethatyouhavecomputed.Youwillneeditlatertoreplacemissingvaluesinthetestsetwhenyouwanttoevaluateyoursystem,andalsooncethesystemgoeslivetoreplacemissingvaluesinnewdata.

Scikit-Learnprovidesahandyclasstotakecareofmissingvalues:Imputer.Hereishowtouseit.First,youneedtocreateanImputerinstance,specifyingthatyouwanttoreplaceeachattribute’smissingvalueswiththemedianofthatattribute:

fromsklearn.preprocessingimportImputer

imputer=Imputer(strategy="median")

Sincethemediancanonlybecomputedonnumericalattributes,weneedtocreateacopyofthedatawithoutthetextattributeocean_proximity:

housing_num=housing.drop("ocean_proximity",axis=1)

Nowyoucanfittheimputerinstancetothetrainingdatausingthefit()method:

imputer.fit(housing_num)

Theimputerhassimplycomputedthemedianofeachattributeandstoredtheresultinitsstatistics_instancevariable.Onlythetotal_bedroomsattributehadmissingvalues,butwecannotbesurethattherewon’tbeanymissingvaluesinnewdataafterthesystemgoeslive,soitissafertoapplytheimputertoallthenumericalattributes:

>>>imputer.statistics_

array([-118.51,34.26,29.,2119.5,433.,1164.,408.,3.5409])

>>>housing_num.median().values

Page 89: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

array([-118.51,34.26,29.,2119.5,433.,1164.,408.,3.5409])

Nowyoucanusethis“trained”imputertotransformthetrainingsetbyreplacingmissingvaluesbythelearnedmedians:

X=imputer.transform(housing_num)

TheresultisaplainNumpyarraycontainingthetransformedfeatures.IfyouwanttoputitbackintoaPandasDataFrame,it’ssimple:

housing_tr=pd.DataFrame(X,columns=housing_num.columns)

SCIKIT-LEARNDESIGN

Scikit-Learn’sAPIisremarkablywelldesigned.Themaindesignprinciplesare:15

Consistency.Allobjectsshareaconsistentandsimpleinterface:

Estimators.Anyobjectthatcanestimatesomeparametersbasedonadatasetiscalledanestimator(e.g.,animputerisanestimator).Theestimationitselfisperformedbythefit()method,andittakesonlyadatasetasaparameter(ortwoforsupervisedlearningalgorithms;theseconddatasetcontainsthelabels).Anyotherparameterneededtoguidetheestimationprocessisconsideredahyperparameter(suchasanimputer’sstrategy),anditmustbesetasaninstancevariable(generallyviaaconstructorparameter).

Transformers.Someestimators(suchasanimputer)canalsotransformadataset;thesearecalledtransformers.Onceagain,theAPIisquitesimple:thetransformationisperformedbythetransform()methodwiththedatasettotransformasaparameter.Itreturnsthetransformeddataset.Thistransformationgenerallyreliesonthelearnedparameters,asisthecaseforanimputer.Alltransformersalsohaveaconveniencemethodcalledfit_transform()thatisequivalenttocallingfit()andthentransform()(butsometimesfit_transform()isoptimizedandrunsmuchfaster).

Predictors.Finally,someestimatorsarecapableofmakingpredictionsgivenadataset;theyarecalledpredictors.Forexample,theLinearRegressionmodelinthepreviouschapterwasapredictor:itpredictedlifesatisfactiongivenacountry’sGDPpercapita.Apredictorhasapredict()methodthattakesadatasetofnewinstancesandreturnsadatasetofcorrespondingpredictions.Italsohasascore()methodthatmeasuresthequalityofthepredictionsgivenatestset(andthecorrespondinglabelsinthecaseofsupervisedlearningalgorithms).16

Inspection.Alltheestimator’shyperparametersareaccessibledirectlyviapublicinstancevariables(e.g.,imputer.strategy),andalltheestimator’slearnedparametersarealsoaccessibleviapublicinstancevariableswithanunderscoresuffix(e.g.,imputer.statistics_).

Nonproliferationofclasses .DatasetsarerepresentedasNumPyarraysorSciPysparsematrices,insteadofhomemadeclasses.HyperparametersarejustregularPythonstringsornumbers.

Composition.Existingbuildingblocksarereusedasmuchaspossible.Forexample,itiseasytocreateaPipelineestimatorfromanarbitrarysequenceoftransformersfollowedbyafinalestimator,aswewillsee.

Sensibledefaults .Scikit-Learnprovidesreasonabledefaultvaluesformostparameters,makingiteasytocreateabaselineworkingsystemquickly.

Page 90: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

HandlingTextandCategoricalAttributesEarlierweleftoutthecategoricalattributeocean_proximitybecauseitisatextattributesowecannotcomputeitsmedian.MostMachineLearningalgorithmsprefertoworkwithnumbersanyway,solet’sconvertthesetextlabelstonumbers.

Scikit-LearnprovidesatransformerforthistaskcalledLabelEncoder:

>>>fromsklearn.preprocessingimportLabelEncoder

>>>encoder=LabelEncoder()

>>>housing_cat=housing["ocean_proximity"]

>>>housing_cat_encoded=encoder.fit_transform(housing_cat)

>>>housing_cat_encoded

array([0,0,4,...,1,0,3])

Thisisbetter:nowwecanusethisnumericaldatainanyMLalgorithm.Youcanlookatthemappingthatthisencoderhaslearnedusingtheclasses_attribute(“<1HOCEAN”ismappedto0,“INLAND”ismappedto1,etc.):

>>>print(encoder.classes_)

['<1HOCEAN''INLAND''ISLAND''NEARBAY''NEAROCEAN']

OneissuewiththisrepresentationisthatMLalgorithmswillassumethattwonearbyvaluesaremoresimilarthantwodistantvalues.Obviouslythisisnotthecase(forexample,categories0and4aremoresimilarthancategories0and1).Tofixthisissue,acommonsolutionistocreateonebinaryattributepercategory:oneattributeequalto1whenthecategoryis“<1HOCEAN”(and0otherwise),anotherattributeequalto1whenthecategoryis“INLAND”(and0otherwise),andsoon.Thisiscalledone-hotencoding,becauseonlyoneattributewillbeequalto1(hot),whiletheotherswillbe0(cold).

Scikit-LearnprovidesaOneHotEncoderencodertoconvertintegercategoricalvaluesintoone-hotvectors.Let’sencodethecategoriesasone-hotvectors.Notethatfit_transform()expectsa2Darray,buthousing_cat_encodedisa1Darray,soweneedtoreshapeit:17

>>>fromsklearn.preprocessingimportOneHotEncoder

>>>encoder=OneHotEncoder()

>>>housing_cat_1hot=encoder.fit_transform(housing_cat_encoded.reshape(-1,1))

>>>housing_cat_1hot

<16512x5sparsematrixoftype'<class'numpy.float64'>'

with16512storedelementsinCompressedSparseRowformat>

NoticethattheoutputisaSciPysparsematrix,insteadofaNumPyarray.Thisisveryusefulwhenyouhavecategoricalattributeswiththousandsofcategories.Afterone-hotencodingwegetamatrixwiththousandsofcolumns,andthematrixisfullofzerosexceptforone1perrow.Usinguptonsofmemorymostlytostorezeroswouldbeverywasteful,soinsteadasparsematrixonlystoresthelocationofthenonzeroelements.Youcanuseitmostlylikeanormal2Darray,18butifyoureallywanttoconvertittoa(dense)NumPyarray,justcallthetoarray()method:

>>>housing_cat_1hot.toarray()

array([[1.,0.,0.,0.,0.],

[1.,0.,0.,0.,0.],

[0.,0.,0.,0.,1.],

Page 91: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

...,

[0.,1.,0.,0.,0.],

[1.,0.,0.,0.,0.],

[0.,0.,0.,1.,0.]])

Wecanapplybothtransformations(fromtextcategoriestointegercategories,thenfromintegercategoriestoone-hotvectors)inoneshotusingtheLabelBinarizerclass:

>>>fromsklearn.preprocessingimportLabelBinarizer

>>>encoder=LabelBinarizer()

>>>housing_cat_1hot=encoder.fit_transform(housing_cat)

>>>housing_cat_1hot

array([[1,0,0,0,0],

[1,0,0,0,0],

[0,0,0,0,1],

...,

[0,1,0,0,0],

[1,0,0,0,0],

[0,0,0,1,0]])

NotethatthisreturnsadenseNumPyarraybydefault.Youcangetasparsematrixinsteadbypassingsparse_output=TruetotheLabelBinarizerconstructor.

Page 92: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

CustomTransformersAlthoughScikit-Learnprovidesmanyusefultransformers,youwillneedtowriteyourownfortaskssuchascustomcleanupoperationsorcombiningspecificattributes.YouwillwantyourtransformertoworkseamlesslywithScikit-Learnfunctionalities(suchaspipelines),andsinceScikit-Learnreliesonducktyping(notinheritance),allyouneedistocreateaclassandimplementthreemethods:fit()(returningself),transform(),andfit_transform().YoucangetthelastoneforfreebysimplyaddingTransformerMixinasabaseclass.Also,ifyouaddBaseEstimatorasabaseclass(andavoid*argsand**kargsinyourconstructor)youwillgettwoextramethods(get_params()andset_params())thatwillbeusefulforautomatichyperparametertuning.Forexample,hereisasmalltransformerclassthataddsthecombinedattributeswediscussedearlier:

fromsklearn.baseimportBaseEstimator,TransformerMixin

rooms_ix,bedrooms_ix,population_ix,household_ix=3,4,5,6

classCombinedAttributesAdder(BaseEstimator,TransformerMixin):

def__init__(self,add_bedrooms_per_room=True):#no*argsor**kargs

self.add_bedrooms_per_room=add_bedrooms_per_room

deffit(self,X,y=None):

returnself#nothingelsetodo

deftransform(self,X,y=None):

rooms_per_household=X[:,rooms_ix]/X[:,household_ix]

population_per_household=X[:,population_ix]/X[:,household_ix]

ifself.add_bedrooms_per_room:

bedrooms_per_room=X[:,bedrooms_ix]/X[:,rooms_ix]

returnnp.c_[X,rooms_per_household,population_per_household,

bedrooms_per_room]

else:

returnnp.c_[X,rooms_per_household,population_per_household]

attr_adder=CombinedAttributesAdder(add_bedrooms_per_room=False)

housing_extra_attribs=attr_adder.transform(housing.values)

Inthisexamplethetransformerhasonehyperparameter,add_bedrooms_per_room,settoTruebydefault(itisoftenhelpfultoprovidesensibledefaults).ThishyperparameterwillallowyoutoeasilyfindoutwhetheraddingthisattributehelpstheMachineLearningalgorithmsornot.Moregenerally,youcanaddahyperparametertogateanydatapreparationstepthatyouarenot100%sureabout.Themoreyouautomatethesedatapreparationsteps,themorecombinationsyoucanautomaticallytryout,makingitmuchmorelikelythatyouwillfindagreatcombination(andsavingyoualotoftime).

Page 93: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

FeatureScalingOneofthemostimportanttransformationsyouneedtoapplytoyourdataisfeaturescaling.Withfewexceptions,MachineLearningalgorithmsdon’tperformwellwhentheinputnumericalattributeshaveverydifferentscales.Thisisthecaseforthehousingdata:thetotalnumberofroomsrangesfromabout6to39,320,whilethemedianincomesonlyrangefrom0to15.Notethatscalingthetargetvaluesisgenerallynotrequired.

Therearetwocommonwaystogetallattributestohavethesamescale:min-maxscalingandstandardization.

Min-maxscaling(manypeoplecallthisnormalization)isquitesimple:valuesareshiftedandrescaledsothattheyenduprangingfrom0to1.Wedothisbysubtractingtheminvalueanddividingbythemaxminusthemin.Scikit-LearnprovidesatransformercalledMinMaxScalerforthis.Ithasafeature_rangehyperparameterthatletsyouchangetherangeifyoudon’twant0–1forsomereason.

Standardizationisquitedifferent:firstitsubtractsthemeanvalue(sostandardizedvaluesalwayshaveazeromean),andthenitdividesbythevariancesothattheresultingdistributionhasunitvariance.Unlikemin-maxscaling,standardizationdoesnotboundvaluestoaspecificrange,whichmaybeaproblemforsomealgorithms(e.g.,neuralnetworksoftenexpectaninputvaluerangingfrom0to1).However,standardizationismuchlessaffectedbyoutliers.Forexample,supposeadistricthadamedianincomeequalto100(bymistake).Min-maxscalingwouldthencrushalltheothervaluesfrom0–15downto0–0.15,whereasstandardizationwouldnotbemuchaffected.Scikit-LearnprovidesatransformercalledStandardScalerforstandardization.

WARNINGAswithallthetransformations,itisimportanttofitthescalerstothetrainingdataonly,nottothefulldataset(includingthetestset).Onlythencanyouusethemtotransformthetrainingsetandthetestset(andnewdata).

Page 94: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TransformationPipelinesAsyoucansee,therearemanydatatransformationstepsthatneedtobeexecutedintherightorder.Fortunately,Scikit-LearnprovidesthePipelineclasstohelpwithsuchsequencesoftransformations.Hereisasmallpipelineforthenumericalattributes:

fromsklearn.pipelineimportPipeline

fromsklearn.preprocessingimportStandardScaler

num_pipeline=Pipeline([

('imputer',Imputer(strategy="median")),

('attribs_adder',CombinedAttributesAdder()),

('std_scaler',StandardScaler()),

])

housing_num_tr=num_pipeline.fit_transform(housing_num)

ThePipelineconstructortakesalistofname/estimatorpairsdefiningasequenceofsteps.Allbutthelastestimatormustbetransformers(i.e.,theymusthaveafit_transform()method).Thenamescanbeanythingyoulike(aslongastheydon’tcontaindoubleunderscores“__”).

Whenyoucallthepipeline’sfit()method,itcallsfit_transform()sequentiallyonalltransformers,passingtheoutputofeachcallastheparametertothenextcall,untilitreachesthefinalestimator,forwhichitjustcallsthefit()method.

Thepipelineexposesthesamemethodsasthefinalestimator.Inthisexample,thelastestimatorisaStandardScaler,whichisatransformer,sothepipelinehasatransform()methodthatappliesallthetransformstothedatainsequence(italsohasafit_transformmethodthatwecouldhaveusedinsteadofcallingfit()andthentransform()).

NowitwouldbeniceifwecouldfeedaPandasDataFramedirectlyintoourpipeline,insteadofhavingtofirstmanuallyextractthenumericalcolumnsintoaNumPyarray.ThereisnothinginScikit-LearntohandlePandasDataFrames,19butwecanwriteacustomtransformerforthistask:

fromsklearn.baseimportBaseEstimator,TransformerMixin

classDataFrameSelector(BaseEstimator,TransformerMixin):

def__init__(self,attribute_names):

self.attribute_names=attribute_names

deffit(self,X,y=None):

returnself

deftransform(self,X):

returnX[self.attribute_names].values

OurDataFrameSelectorwilltransformthedatabyselectingthedesiredattributes,droppingtherest,andconvertingtheresultingDataFrametoaNumPyarray.Withthis,youcaneasilywriteapipelinethatwilltakeaPandasDataFrameandhandleonlythenumericalvalues:thepipelinewouldjuststartwithaDataFrameSelectortopickonlythenumericalattributes,followedbytheotherpreprocessingstepswediscussedearlier.AndyoucanjustaseasilywriteanotherpipelineforthecategoricalattributesaswellbysimplyselectingthecategoricalattributesusingaDataFrameSelectorandthenapplyingaLabelBinarizer.

Page 95: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

num_attribs=list(housing_num)

cat_attribs=["ocean_proximity"]

num_pipeline=Pipeline([

('selector',DataFrameSelector(num_attribs)),

('imputer',Imputer(strategy="median")),

('attribs_adder',CombinedAttributesAdder()),

('std_scaler',StandardScaler()),

])

cat_pipeline=Pipeline([

('selector',DataFrameSelector(cat_attribs)),

('label_binarizer',LabelBinarizer()),

])

Buthowcanyoujointhesetwopipelinesintoasinglepipeline?TheansweristouseScikit-Learn’sFeatureUnionclass.Yougiveitalistoftransformers(whichcanbeentiretransformerpipelines);whenitstransform()methodiscalled,itrunseachtransformer’stransform()methodinparallel,waitsfortheiroutput,andthenconcatenatesthemandreturnstheresult(andofcoursecallingitsfit()methodcallseachtransformer’sfit()method).Afullpipelinehandlingbothnumericalandcategoricalattributesmaylooklikethis:

fromsklearn.pipelineimportFeatureUnion

full_pipeline=FeatureUnion(transformer_list=[

("num_pipeline",num_pipeline),

("cat_pipeline",cat_pipeline),

])

Andyoucanrunthewholepipelinesimply:

>>>housing_prepared=full_pipeline.fit_transform(housing)

>>>housing_prepared

array([[-1.15604281,0.77194962,0.74333089,...,0.,

0.,0.],

[-1.17602483,0.6596948,-1.1653172,...,0.,

0.,0.],

[...]

>>>housing_prepared.shape

(16512,16)

Page 96: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

SelectandTrainaModelAtlast!Youframedtheproblem,yougotthedataandexploredit,yousampledatrainingsetandatestset,andyouwrotetransformationpipelinestocleanupandprepareyourdataforMachineLearningalgorithmsautomatically.YouarenowreadytoselectandtrainaMachineLearningmodel.

Page 97: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TrainingandEvaluatingontheTrainingSetThegoodnewsisthatthankstoalltheseprevioussteps,thingsarenowgoingtobemuchsimplerthanyoumightthink.Let’sfirsttrainaLinearRegressionmodel,likewedidinthepreviouschapter:

fromsklearn.linear_modelimportLinearRegression

lin_reg=LinearRegression()

lin_reg.fit(housing_prepared,housing_labels)

Done!YounowhaveaworkingLinearRegressionmodel.Let’stryitoutonafewinstancesfromthetrainingset:

>>>some_data=housing.iloc[:5]

>>>some_labels=housing_labels.iloc[:5]

>>>some_data_prepared=full_pipeline.transform(some_data)

>>>print("Predictions:",lin_reg.predict(some_data_prepared))

Predictions:[210644.6045317768.8069210956.433359218.9888189747.5584]

>>>print("Labels:",list(some_labels))

Labels:[286600.0,340600.0,196900.0,46300.0,254500.0]

Itworks,althoughthepredictionsarenotexactlyaccurate(e.g.,thefirstpredictionisoffbycloseto40%!).Let’smeasurethisregressionmodel’sRMSEonthewholetrainingsetusingScikit-Learn’smean_squared_errorfunction:

>>>fromsklearn.metricsimportmean_squared_error

>>>housing_predictions=lin_reg.predict(housing_prepared)

>>>lin_mse=mean_squared_error(housing_labels,housing_predictions)

>>>lin_rmse=np.sqrt(lin_mse)

>>>lin_rmse

68628.198198489219

Okay,thisisbetterthannothingbutclearlynotagreatscore:mostdistricts’median_housing_valuesrangebetween$120,000and$265,000,soatypicalpredictionerrorof$68,628isnotverysatisfying.Thisisanexampleofamodelunderfittingthetrainingdata.Whenthishappensitcanmeanthatthefeaturesdonotprovideenoughinformationtomakegoodpredictions,orthatthemodelisnotpowerfulenough.Aswesawinthepreviouschapter,themainwaystofixunderfittingaretoselectamorepowerfulmodel,tofeedthetrainingalgorithmwithbetterfeatures,ortoreducetheconstraintsonthemodel.Thismodelisnotregularized,sothisrulesoutthelastoption.Youcouldtrytoaddmorefeatures(e.g.,thelogofthepopulation),butfirstlet’stryamorecomplexmodeltoseehowitdoes.

Let’strainaDecisionTreeRegressor.Thisisapowerfulmodel,capableoffindingcomplexnonlinearrelationshipsinthedata(DecisionTreesarepresentedinmoredetailinChapter6).Thecodeshouldlookfamiliarbynow:

fromsklearn.treeimportDecisionTreeRegressor

tree_reg=DecisionTreeRegressor()

tree_reg.fit(housing_prepared,housing_labels)

Nowthatthemodelistrained,let’sevaluateitonthetrainingset:

Page 98: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

>>>housing_predictions=tree_reg.predict(housing_prepared)

>>>tree_mse=mean_squared_error(housing_labels,housing_predictions)

>>>tree_rmse=np.sqrt(tree_mse)

>>>tree_rmse

0.0

Wait,what!?Noerroratall?Couldthismodelreallybeabsolutelyperfect?Ofcourse,itismuchmorelikelythatthemodelhasbadlyoverfitthedata.Howcanyoubesure?Aswesawearlier,youdon’twanttotouchthetestsetuntilyouarereadytolaunchamodelyouareconfidentabout,soyouneedtousepartofthetrainingsetfortraining,andpartformodelvalidation.

Page 99: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

BetterEvaluationUsingCross-ValidationOnewaytoevaluatetheDecisionTreemodelwouldbetousethetrain_test_splitfunctiontosplitthetrainingsetintoasmallertrainingsetandavalidationset,thentrainyourmodelsagainstthesmallertrainingsetandevaluatethemagainstthevalidationset.It’sabitofwork,butnothingtoodifficultanditwouldworkfairlywell.

AgreatalternativeistouseScikit-Learn’scross-validationfeature.ThefollowingcodeperformsK-foldcross-validation:itrandomlysplitsthetrainingsetinto10distinctsubsetscalledfolds,thenittrainsandevaluatestheDecisionTreemodel10times,pickingadifferentfoldforevaluationeverytimeandtrainingontheother9folds.Theresultisanarraycontainingthe10evaluationscores:

fromsklearn.model_selectionimportcross_val_score

scores=cross_val_score(tree_reg,housing_prepared,housing_labels,

scoring="neg_mean_squared_error",cv=10)

tree_rmse_scores=np.sqrt(-scores)

WARNINGScikit-Learncross-validationfeaturesexpectautilityfunction(greaterisbetter)ratherthanacostfunction(lowerisbetter),sothescoringfunctionisactuallytheoppositeoftheMSE(i.e.,anegativevalue),whichiswhytheprecedingcodecomputes-scoresbeforecalculatingthesquareroot.

Let’slookattheresults:

>>>defdisplay_scores(scores):

...print("Scores:",scores)

...print("Mean:",scores.mean())

...print("Standarddeviation:",scores.std())

...

>>>display_scores(tree_rmse_scores)

Scores:[70232.013648266828.4683989272444.0872100370761.50186201

71125.5269765375581.2931985770169.5928616470055.37863456

75370.4911677371222.39081244]

Mean:71379.0744771

Standarddeviation:2458.31882043

NowtheDecisionTreedoesn’tlookasgoodasitdidearlier.Infact,itseemstoperformworsethantheLinearRegressionmodel!Noticethatcross-validationallowsyoutogetnotonlyanestimateoftheperformanceofyourmodel,butalsoameasureofhowprecisethisestimateis(i.e.,itsstandarddeviation).TheDecisionTreehasascoreofapproximately71,379,generally±2,458.Youwouldnothavethisinformationifyoujustusedonevalidationset.Butcross-validationcomesatthecostoftrainingthemodelseveraltimes,soitisnotalwayspossible.

Let’scomputethesamescoresfortheLinearRegressionmodeljusttobesure:

>>>lin_scores=cross_val_score(lin_reg,housing_prepared,housing_labels,

...scoring="neg_mean_squared_error",cv=10)

...

>>>lin_rmse_scores=np.sqrt(-lin_scores)

>>>display_scores(lin_rmse_scores)

Scores:[66760.9737157266962.6191424470349.9485340174757.02629506

68031.1338893871193.8418342664968.1370652768261.95557897

Page 100: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

71527.6421787467665.10082067]

Mean:69047.8379055

Standarddeviation:2735.51074287

That’sright:theDecisionTreemodelisoverfittingsobadlythatitperformsworsethantheLinearRegressionmodel.

Let’stryonelastmodelnow:theRandomForestRegressor.AswewillseeinChapter7,RandomForestsworkbytrainingmanyDecisionTreesonrandomsubsetsofthefeatures,thenaveragingouttheirpredictions.BuildingamodelontopofmanyothermodelsiscalledEnsembleLearning,anditisoftenagreatwaytopushMLalgorithmsevenfurther.Wewillskipmostofthecodesinceitisessentiallythesameasfortheothermodels:

>>>fromsklearn.ensembleimportRandomForestRegressor

>>>forest_reg=RandomForestRegressor()

>>>forest_reg.fit(housing_prepared,housing_labels)

>>>[...]

>>>forest_rmse

21941.911027380233

>>>display_scores(forest_rmse_scores)

Scores:[51650.9440547148920.8064549852979.1609675254412.74042021

50861.2938116356488.5569972751866.9012078649752.24599537

55399.5071319153309.74548294]

Mean:52564.1902524

Standarddeviation:2301.87380392

Wow,thisismuchbetter:RandomForestslookverypromising.However,notethatthescoreonthetrainingsetisstillmuchlowerthanonthevalidationsets,meaningthatthemodelisstilloverfittingthetrainingset.Possiblesolutionsforoverfittingaretosimplifythemodel,constrainit(i.e.,regularizeit),orgetalotmoretrainingdata.However,beforeyoudivemuchdeeperinRandomForests,youshouldtryoutmanyothermodelsfromvariouscategoriesofMachineLearningalgorithms(severalSupportVectorMachineswithdifferentkernels,possiblyaneuralnetwork,etc.),withoutspendingtoomuchtimetweakingthehyperparameters.Thegoalistoshortlistafew(twotofive)promisingmodels.

TIPYoushouldsaveeverymodelyouexperimentwith,soyoucancomebackeasilytoanymodelyouwant.Makesureyousaveboththehyperparametersandthetrainedparameters,aswellasthecross-validationscoresandperhapstheactualpredictionsaswell.Thiswillallowyoutoeasilycomparescoresacrossmodeltypes,andcomparethetypesoferrorstheymake.YoucaneasilysaveScikit-LearnmodelsbyusingPython’spicklemodule,orusingsklearn.externals.joblib,whichismoreefficientatserializinglargeNumPyarrays:

fromsklearn.externalsimportjoblib

joblib.dump(my_model,"my_model.pkl")

#andlater...

my_model_loaded=joblib.load("my_model.pkl")

Page 101: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Fine-TuneYourModelLet’sassumethatyounowhaveashortlistofpromisingmodels.Younowneedtofine-tunethem.Let’slookatafewwaysyoucandothat.

Page 102: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

GridSearchOnewaytodothatwouldbetofiddlewiththehyperparametersmanually,untilyoufindagreatcombinationofhyperparametervalues.Thiswouldbeverytediouswork,andyoumaynothavetimetoexploremanycombinations.

InsteadyoushouldgetScikit-Learn’sGridSearchCVtosearchforyou.Allyouneedtodoistellitwhichhyperparametersyouwantittoexperimentwith,andwhatvaluestotryout,anditwillevaluateallthepossiblecombinationsofhyperparametervalues,usingcross-validation.Forexample,thefollowingcodesearchesforthebestcombinationofhyperparametervaluesfortheRandomForestRegressor:

fromsklearn.model_selectionimportGridSearchCV

param_grid=[

{'n_estimators':[3,10,30],'max_features':[2,4,6,8]},

{'bootstrap':[False],'n_estimators':[3,10],'max_features':[2,3,4]},

]

forest_reg=RandomForestRegressor()

grid_search=GridSearchCV(forest_reg,param_grid,cv=5,

scoring='neg_mean_squared_error')

grid_search.fit(housing_prepared,housing_labels)

TIPWhenyouhavenoideawhatvalueahyperparametershouldhave,asimpleapproachistotryoutconsecutivepowersof10(orasmallernumberifyouwantamorefine-grainedsearch,asshowninthisexamplewiththen_estimatorshyperparameter).

Thisparam_gridtellsScikit-Learntofirstevaluateall3×4=12combinationsofn_estimatorsandmax_featureshyperparametervaluesspecifiedinthefirstdict(don’tworryaboutwhatthesehyperparametersmeanfornow;theywillbeexplainedinChapter7),thentryall2×3=6combinationsofhyperparametervaluesintheseconddict,butthistimewiththebootstraphyperparametersettoFalseinsteadofTrue(whichisthedefaultvalueforthishyperparameter).

Allinall,thegridsearchwillexplore12+6=18combinationsofRandomForestRegressorhyperparametervalues,anditwilltraineachmodelfivetimes(sinceweareusingfive-foldcrossvalidation).Inotherwords,allinall,therewillbe18×5=90roundsoftraining!Itmaytakequitealongtime,butwhenitisdoneyoucangetthebestcombinationofparameterslikethis:

>>>grid_search.best_params_

{'max_features':8,'n_estimators':30}

TIPSince8and30arethemaximumvaluesthatwereevaluated,youshouldprobablytrysearchingagainwithhighervalues,sincethescoremaycontinuetoimprove.

Page 103: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Youcanalsogetthebestestimatordirectly:

>>>grid_search.best_estimator_

RandomForestRegressor(bootstrap=True,criterion='mse',max_depth=None,

max_features=8,max_leaf_nodes=None,min_impurity_split=1e-07,

min_samples_leaf=1,min_samples_split=2,

min_weight_fraction_leaf=0.0,n_estimators=30,n_jobs=1,

oob_score=False,random_state=42,verbose=0,warm_start=False)

NOTEIfGridSearchCVisinitializedwithrefit=True(whichisthedefault),thenonceitfindsthebestestimatorusingcross-validation,itretrainsitonthewholetrainingset.Thisisusuallyagoodideasincefeedingitmoredatawilllikelyimproveitsperformance.

Andofcoursetheevaluationscoresarealsoavailable:

>>>cvres=grid_search.cv_results_

>>>formean_score,paramsinzip(cvres["mean_test_score"],cvres["params"]):

...print(np.sqrt(-mean_score),params)

...

63825.0479302{'max_features':2,'n_estimators':3}

55643.8429091{'max_features':2,'n_estimators':10}

53380.6566859{'max_features':2,'n_estimators':30}

60959.1388585{'max_features':4,'n_estimators':3}

52740.5841667{'max_features':4,'n_estimators':10}

50374.1421461{'max_features':4,'n_estimators':30}

58661.2866462{'max_features':6,'n_estimators':3}

52009.9739798{'max_features':6,'n_estimators':10}

50154.1177737{'max_features':6,'n_estimators':30}

57865.3616801{'max_features':8,'n_estimators':3}

51730.0755087{'max_features':8,'n_estimators':10}

49694.8514333{'max_features':8,'n_estimators':30}

62874.4073931{'max_features':2,'n_estimators':3,'bootstrap':False}

54561.9398157{'max_features':2,'n_estimators':10,'bootstrap':False}

59416.6463145{'max_features':3,'n_estimators':3,'bootstrap':False}

52660.245911{'max_features':3,'n_estimators':10,'bootstrap':False}

57490.0168279{'max_features':4,'n_estimators':3,'bootstrap':False}

51093.9059428{'max_features':4,'n_estimators':10,'bootstrap':False}

Inthisexample,weobtainthebestsolutionbysettingthemax_featureshyperparameterto8,andthen_estimatorshyperparameterto30.TheRMSEscoreforthiscombinationis49,694,whichisslightlybetterthanthescoreyougotearlierusingthedefaulthyperparametervalues(whichwas52,564).Congratulations,youhavesuccessfullyfine-tunedyourbestmodel!

TIPDon’tforgetthatyoucantreatsomeofthedatapreparationstepsashyperparameters.Forexample,thegridsearchwillautomaticallyfindoutwhetherornottoaddafeatureyouwerenotsureabout(e.g.,usingtheadd_bedrooms_per_roomhyperparameterofyourCombinedAttributesAddertransformer).Itmaysimilarlybeusedtoautomaticallyfindthebestwaytohandleoutliers,missingfeatures,featureselection,andmore.

Page 104: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

RandomizedSearchThegridsearchapproachisfinewhenyouareexploringrelativelyfewcombinations,likeinthepreviousexample,butwhenthehyperparametersearchspaceislarge,itisoftenpreferabletouseRandomizedSearchCVinstead.ThisclasscanbeusedinmuchthesamewayastheGridSearchCVclass,butinsteadoftryingoutallpossiblecombinations,itevaluatesagivennumberofrandomcombinationsbyselectingarandomvalueforeachhyperparameterateveryiteration.Thisapproachhastwomainbenefits:

Ifyoulettherandomizedsearchrunfor,say,1,000iterations,thisapproachwillexplore1,000differentvaluesforeachhyperparameter(insteadofjustafewvaluesperhyperparameterwiththegridsearchapproach).

Youhavemorecontroloverthecomputingbudgetyouwanttoallocatetohyperparametersearch,simplybysettingthenumberofiterations.

Page 105: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

EnsembleMethodsAnotherwaytofine-tuneyoursystemistotrytocombinethemodelsthatperformbest.Thegroup(or“ensemble”)willoftenperformbetterthanthebestindividualmodel(justlikeRandomForestsperformbetterthantheindividualDecisionTreestheyrelyon),especiallyiftheindividualmodelsmakeverydifferenttypesoferrors.WewillcoverthistopicinmoredetailinChapter7.

Page 106: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

AnalyzetheBestModelsandTheirErrorsYouwilloftengaingoodinsightsontheproblembyinspectingthebestmodels.Forexample,theRandomForestRegressorcanindicatetherelativeimportanceofeachattributeformakingaccuratepredictions:

>>>feature_importances=grid_search.best_estimator_.feature_importances_

>>>feature_importances

array([7.33442355e-02,6.29090705e-02,4.11437985e-02,

1.46726854e-02,1.41064835e-02,1.48742809e-02,

1.42575993e-02,3.66158981e-01,5.64191792e-02,

1.08792957e-01,5.33510773e-02,1.03114883e-02,

1.64780994e-01,6.02803867e-05,1.96041560e-03,

2.85647464e-03])

Let’sdisplaytheseimportancescoresnexttotheircorrespondingattributenames:

>>>extra_attribs=["rooms_per_hhold","pop_per_hhold","bedrooms_per_room"]

>>>cat_one_hot_attribs=list(encoder.classes_)

>>>attributes=num_attribs+extra_attribs+cat_one_hot_attribs

>>>sorted(zip(feature_importances,attributes),reverse=True)

[(0.36615898061813418,'median_income'),

(0.16478099356159051,'INLAND'),

(0.10879295677551573,'pop_per_hhold'),

(0.073344235516012421,'longitude'),

(0.062909070482620302,'latitude'),

(0.056419179181954007,'rooms_per_hhold'),

(0.053351077347675809,'bedrooms_per_room'),

(0.041143798478729635,'housing_median_age'),

(0.014874280890402767,'population'),

(0.014672685420543237,'total_rooms'),

(0.014257599323407807,'households'),

(0.014106483453584102,'total_bedrooms'),

(0.010311488326303787,'<1HOCEAN'),

(0.0028564746373201579,'NEAROCEAN'),

(0.0019604155994780701,'NEARBAY'),

(6.0280386727365991e-05,'ISLAND')]

Withthisinformation,youmaywanttotrydroppingsomeofthelessusefulfeatures(e.g.,apparentlyonlyoneocean_proximitycategoryisreallyuseful,soyoucouldtrydroppingtheothers).

Youshouldalsolookatthespecificerrorsthatyoursystemmakes,thentrytounderstandwhyitmakesthemandwhatcouldfixtheproblem(addingextrafeaturesor,onthecontrary,gettingridofuninformativeones,cleaningupoutliers,etc.).

Page 107: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

EvaluateYourSystemontheTestSetAftertweakingyourmodelsforawhile,youeventuallyhaveasystemthatperformssufficientlywell.Nowisthetimetoevaluatethefinalmodelonthetestset.Thereisnothingspecialaboutthisprocess;justgetthepredictorsandthelabelsfromyourtestset,runyourfull_pipelinetotransformthedata(calltransform(),notfit_transform()!),andevaluatethefinalmodelonthetestset:

final_model=grid_search.best_estimator_

X_test=strat_test_set.drop("median_house_value",axis=1)

y_test=strat_test_set["median_house_value"].copy()

X_test_prepared=full_pipeline.transform(X_test)

final_predictions=final_model.predict(X_test_prepared)

final_mse=mean_squared_error(y_test,final_predictions)

final_rmse=np.sqrt(final_mse)#=>evaluatesto47,766.0

Theperformancewillusuallybeslightlyworsethanwhatyoumeasuredusingcross-validationifyoudidalotofhyperparametertuning(becauseyoursystemendsupfine-tunedtoperformwellonthevalidationdata,andwilllikelynotperformaswellonunknowndatasets).Itisnotthecaseinthisexample,butwhenthishappensyoumustresistthetemptationtotweakthehyperparameterstomakethenumberslookgoodonthetestset;theimprovementswouldbeunlikelytogeneralizetonewdata.

Nowcomestheprojectprelaunchphase:youneedtopresentyoursolution(highlightingwhatyouhavelearned,whatworkedandwhatdidnot,whatassumptionsweremade,andwhatyoursystem’slimitationsare),documenteverything,andcreatenicepresentationswithclearvisualizationsandeasy-to-rememberstatements(e.g.,“themedianincomeisthenumberonepredictorofhousingprices”).

Page 108: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Launch,Monitor,andMaintainYourSystemPerfect,yougotapprovaltolaunch!Youneedtogetyoursolutionreadyforproduction,inparticularbypluggingtheproductioninputdatasourcesintoyoursystemandwritingtests.

Youalsoneedtowritemonitoringcodetocheckyoursystem’sliveperformanceatregularintervalsandtriggeralertswhenitdrops.Thisisimportanttocatchnotonlysuddenbreakage,butalsoperformancedegradation.Thisisquitecommonbecausemodelstendto“rot”asdataevolvesovertime,unlessthemodelsareregularlytrainedonfreshdata.

Evaluatingyoursystem’sperformancewillrequiresamplingthesystem’spredictionsandevaluatingthem.Thiswillgenerallyrequireahumananalysis.Theseanalystsmaybefieldexperts,orworkersonacrowdsourcingplatform(suchasAmazonMechanicalTurkorCrowdFlower).Eitherway,youneedtoplugthehumanevaluationpipelineintoyoursystem.

Youshouldalsomakesureyouevaluatethesystem’sinputdataquality.Sometimesperformancewilldegradeslightlybecauseofapoorqualitysignal(e.g.,amalfunctioningsensorsendingrandomvalues,oranotherteam’soutputbecomingstale),butitmaytakeawhilebeforeyoursystem’sperformancedegradesenoughtotriggeranalert.Ifyoumonitoryoursystem’sinputs,youmaycatchthisearlier.Monitoringtheinputsisparticularlyimportantforonlinelearningsystems.

Finally,youwillgenerallywanttotrainyourmodelsonaregularbasisusingfreshdata.Youshouldautomatethisprocessasmuchaspossible.Ifyoudon’t,youareverylikelytorefreshyourmodelonlyeverysixmonths(atbest),andyoursystem’sperformancemayfluctuateseverelyovertime.Ifyoursystemisanonlinelearningsystem,youshouldmakesureyousavesnapshotsofitsstateatregularintervalssoyoucaneasilyrollbacktoapreviouslyworkingstate.

Page 109: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TryItOut!HopefullythischaptergaveyouagoodideaofwhataMachineLearningprojectlookslike,andshowedyousomeofthetoolsyoucanusetotrainagreatsystem.Asyoucansee,muchoftheworkisinthedatapreparationstep,buildingmonitoringtools,settinguphumanevaluationpipelines,andautomatingregularmodeltraining.TheMachineLearningalgorithmsarealsoimportant,ofcourse,butitisprobablypreferabletobecomfortablewiththeoverallprocessandknowthreeorfouralgorithmswellratherthantospendallyourtimeexploringadvancedalgorithmsandnotenoughtimeontheoverallprocess.

So,ifyouhavenotalreadydoneso,nowisagoodtimetopickupalaptop,selectadatasetthatyouareinterestedin,andtrytogothroughthewholeprocessfromAtoZ.Agoodplacetostartisonacompetitionwebsitesuchashttp://kaggle.com/:youwillhaveadatasettoplaywith,acleargoal,andpeopletosharetheexperiencewith.

Page 110: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ExercisesUsingthischapter’shousingdataset:1. TryaSupportVectorMachineregressor(sklearn.svm.SVR),withvarioushyperparameterssuchas

kernel="linear"(withvariousvaluesfortheChyperparameter)orkernel="rbf"(withvariousvaluesfortheCandgammahyperparameters).Don’tworryaboutwhatthesehyperparametersmeanfornow.HowdoesthebestSVRpredictorperform?

2. TryreplacingGridSearchCVwithRandomizedSearchCV.

3. Tryaddingatransformerinthepreparationpipelinetoselectonlythemostimportantattributes.

4. Trycreatingasinglepipelinethatdoesthefulldatapreparationplusthefinalprediction.

5. AutomaticallyexploresomepreparationoptionsusingGridSearchCV.

SolutionstotheseexercisesareavailableintheonlineJupyternotebooksathttps://github.com/ageron/handson-ml.

Theexampleprojectiscompletelyfictitious;thegoalisjusttoillustratethemainstepsofaMachineLearningproject,nottolearnanythingabouttherealestatebusiness.

TheoriginaldatasetappearedinR.KelleyPaceandRonaldBarry,“SparseSpatialAutoregressions,”Statistics&ProbabilityLetters33,no.3(1997):291–297.

ApieceofinformationfedtoaMachineLearningsystemisoftencalledasignalinreferencetoShannon’sinformationtheory:youwantahighsignal/noiseratio.

Recallthatthetransposeoperatorflipsacolumnvectorintoarowvector(andviceversa).

ThelatestversionofPython3isrecommended.Python2.7+shouldworkfinetoo,butitisdeprecated.IfyouusePython2,youmustaddfrom__future__importdivision,print_function,unicode_literalsatthebeginningofyourcode.

WewillshowtheinstallationstepsusingpipinabashshellonaLinuxormacOSsystem.Youmayneedtoadaptthesecommandstoyourownsystem.OnWindows,werecommendinstallingAnacondainstead.

Youmayneedtohaveadministratorrightstorunthiscommand;ifso,tryprefixingitwithsudo.

NotethatJupytercanhandlemultipleversionsofPython,andevenmanyotherlanguagessuchasRorOctave.

Youmightalsoneedtochecklegalconstraints,suchasprivatefieldsthatshouldneverbecopiedtounsafedatastores.

InarealprojectyouwouldsavethiscodeinaPythonfile,butfornowyoucanjustwriteitinyourJupyternotebook.

Thestandarddeviationisgenerallydenotedσ(theGreeklettersigma),anditisthesquarerootofthevariance,whichistheaverageofthesquareddeviationfromthemean.Whenafeaturehasabell-shapednormaldistribution(alsocalledaGaussiandistribution),whichisverycommon,the“68-95-99.7”ruleapplies:about68%ofthevaluesfallwithin1σofthemean,95%within2σ,and99.7%within3σ.

Youwilloftenseepeoplesettherandomseedto42.Thisnumberhasnospecialproperty,otherthantobeTheAnswertotheUltimateQuestionofLife,theUniverse,andEverything.

Thelocationinformationisactuallyquitecoarse,andasaresultmanydistrictswillhavetheexactsameID,sotheywillendupinthesameset(testortrain).Thisintroducessomeunfortunatesamplingbias.

Ifyouarereadingthisingrayscale,grabaredpenandscribbleovermostofthecoastlinefromtheBayAreadowntoSanDiego(asyoumightexpect).YoucanaddapatchofyellowaroundSacramentoaswell.

Formoredetailsonthedesignprinciples,see“APIdesignformachinelearningsoftware:experiencesfromthescikit-learnproject,”L.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

Page 111: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Buitinck,G.Louppe,M.Blondel,F.Pedregosa,A.Müller,etal.(2013).

Somepredictorsalsoprovidemethodstomeasuretheconfidenceoftheirpredictions.

NumPy’sreshape()functionallowsonedimensiontobe–1,whichmeans“unspecified”:thevalueisinferredfromthelengthofthearrayandtheremainingdimensions.

SeeSciPy’sdocumentationformoredetails.

ButcheckoutPullRequest#3886,whichmayintroduceaColumnTransformerclassmakingattribute-specifictransformationseasy.Youcouldalsorunpip3installsklearn-pandastogetaDataFrameMapperclasswithasimilarobjective.

16

17

18

19

Page 112: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Chapter3.Classification

InChapter1wementionedthatthemostcommonsupervisedlearningtasksareregression(predictingvalues)andclassification(predictingclasses).InChapter2weexploredaregressiontask,predictinghousingvalues,usingvariousalgorithmssuchasLinearRegression,DecisionTrees,andRandomForests(whichwillbeexplainedinfurtherdetailinlaterchapters).Nowwewillturnourattentiontoclassificationsystems.

Page 113: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

MNISTInthischapter,wewillbeusingtheMNISTdataset,whichisasetof70,000smallimagesofdigitshandwrittenbyhighschoolstudentsandemployeesoftheUSCensusBureau.Eachimageislabeledwiththedigititrepresents.Thissethasbeenstudiedsomuchthatitisoftencalledthe“HelloWorld”ofMachineLearning:wheneverpeoplecomeupwithanewclassificationalgorithm,theyarecurioustoseehowitwillperformonMNIST.WheneversomeonelearnsMachineLearning,soonerorlatertheytackleMNIST.

Scikit-Learnprovidesmanyhelperfunctionstodownloadpopulardatasets.MNISTisoneofthem.ThefollowingcodefetchestheMNISTdataset:1

>>>fromsklearn.datasetsimportfetch_mldata

>>>mnist=fetch_mldata('MNISToriginal')

>>>mnist

{'COL_NAMES':['label','data'],

'DESCR':'mldata.orgdataset:mnist-original',

'data':array([[0,0,0,...,0,0,0],

[0,0,0,...,0,0,0],

[0,0,0,...,0,0,0],

...,

[0,0,0,...,0,0,0],

[0,0,0,...,0,0,0],

[0,0,0,...,0,0,0]],dtype=uint8),

'target':array([0.,0.,0.,...,9.,9.,9.])}

DatasetsloadedbyScikit-Learngenerallyhaveasimilardictionarystructureincluding:ADESCRkeydescribingthedataset

Adatakeycontaininganarraywithonerowperinstanceandonecolumnperfeature

Atargetkeycontaininganarraywiththelabels

Let’slookatthesearrays:

>>>X,y=mnist["data"],mnist["target"]

>>>X.shape

(70000,784)

>>>y.shape

(70000,)

Thereare70,000images,andeachimagehas784features.Thisisbecauseeachimageis28×28pixels,andeachfeaturesimplyrepresentsonepixel’sintensity,from0(white)to255(black).Let’stakeapeekatonedigitfromthedataset.Allyouneedtodoisgrabaninstance’sfeaturevector,reshapeittoa28×28array,anddisplayitusingMatplotlib’simshow()function:

%matplotlibinline

importmatplotlib

importmatplotlib.pyplotasplt

some_digit=X[36000]

some_digit_image=some_digit.reshape(28,28)

plt.imshow(some_digit_image,cmap=matplotlib.cm.binary,

interpolation="nearest")

Page 114: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

plt.axis("off")

plt.show()

Thislookslikea5,andindeedthat’swhatthelabeltellsus:

>>>y[36000]

5.0

Figure3-1showsafewmoreimagesfromtheMNISTdatasettogiveyouafeelforthecomplexityoftheclassificationtask.

Page 115: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure3-1.AfewdigitsfromtheMNISTdataset

Butwait!Youshouldalwayscreateatestsetandsetitasidebeforeinspectingthedataclosely.TheMNISTdatasetisactuallyalreadysplitintoatrainingset(thefirst60,000images)andatestset(thelast10,000images):

X_train,X_test,y_train,y_test=X[:60000],X[60000:],y[:60000],y[60000:]

Let’salsoshufflethetrainingset;thiswillguaranteethatallcross-validationfoldswillbesimilar(youdon’twantonefoldtobemissingsomedigits).Moreover,somelearningalgorithmsaresensitivetotheorderofthetraininginstances,andtheyperformpoorlyiftheygetmanysimilarinstancesinarow.Shufflingthedatasetensuresthatthiswon’thappen:2

importnumpyasnp

shuffle_index=np.random.permutation(60000)

X_train,y_train=X_train[shuffle_index],y_train[shuffle_index]

Page 116: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TrainingaBinaryClassifierLet’ssimplifytheproblemfornowandonlytrytoidentifyonedigit—forexample,thenumber5.This“5-detector”willbeanexampleofabinaryclassifier,capableofdistinguishingbetweenjusttwoclasses,5andnot-5.Let’screatethetargetvectorsforthisclassificationtask:

y_train_5=(y_train==5)#Trueforall5s,Falseforallotherdigits.

y_test_5=(y_test==5)

Okay,nowlet’spickaclassifierandtrainit.AgoodplacetostartiswithaStochasticGradientDescent(SGD)classifier,usingScikit-Learn’sSGDClassifierclass.Thisclassifierhastheadvantageofbeingcapableofhandlingverylargedatasetsefficiently.ThisisinpartbecauseSGDdealswithtraininginstancesindependently,oneatatime(whichalsomakesSGDwellsuitedforonlinelearning),aswewillseelater.Let’screateanSGDClassifierandtrainitonthewholetrainingset:

fromsklearn.linear_modelimportSGDClassifier

sgd_clf=SGDClassifier(random_state=42)

sgd_clf.fit(X_train,y_train_5)

TIPTheSGDClassifierreliesonrandomnessduringtraining(hencethename“stochastic”).Ifyouwantreproducibleresults,youshouldsettherandom_stateparameter.

Nowyoucanuseittodetectimagesofthenumber5:

>>>sgd_clf.predict([some_digit])

array([True],dtype=bool)

Theclassifierguessesthatthisimagerepresentsa5(True).Lookslikeitguessedrightinthisparticularcase!Now,let’sevaluatethismodel’sperformance.

Page 117: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

PerformanceMeasuresEvaluatingaclassifierisoftensignificantlytrickierthanevaluatingaregressor,sowewillspendalargepartofthischapteronthistopic.Therearemanyperformancemeasuresavailable,sograbanothercoffeeandgetreadytolearnmanynewconceptsandacronyms!

Page 118: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

MeasuringAccuracyUsingCross-ValidationAgoodwaytoevaluateamodelistousecross-validation,justasyoudidinChapter2.

IMPLEMENTINGCROSS-VALIDATION

Occasionallyyouwillneedmorecontroloverthecross-validationprocessthanwhatScikit-Learnprovidesoff-the-shelf.Inthesecases,youcanimplementcross-validationyourself;itisactuallyfairlystraightforward.ThefollowingcodedoesroughlythesamethingasScikit-Learn’scross_val_score()function,andprintsthesameresult:

fromsklearn.model_selectionimportStratifiedKFold

fromsklearn.baseimportclone

skfolds=StratifiedKFold(n_splits=3,random_state=42)

fortrain_index,test_indexinskfolds.split(X_train,y_train_5):

clone_clf=clone(sgd_clf)

X_train_folds=X_train[train_index]

y_train_folds=(y_train_5[train_index])

X_test_fold=X_train[test_index]

y_test_fold=(y_train_5[test_index])

clone_clf.fit(X_train_folds,y_train_folds)

y_pred=clone_clf.predict(X_test_fold)

n_correct=sum(y_pred==y_test_fold)

print(n_correct/len(y_pred))#prints0.9502,0.96565and0.96495

TheStratifiedKFoldclassperformsstratifiedsampling(asexplainedinChapter2)toproducefoldsthatcontainarepresentativeratioofeachclass.Ateachiterationthecodecreatesacloneoftheclassifier,trainsthatcloneonthetrainingfolds,andmakespredictionsonthetestfold.Thenitcountsthenumberofcorrectpredictionsandoutputstheratioofcorrectpredictions.

Let’susethecross_val_score()functiontoevaluateyourSGDClassifiermodelusingK-foldcross-validation,withthreefolds.RememberthatK-foldcross-validationmeanssplittingthetrainingsetintoK-folds(inthiscase,three),thenmakingpredictionsandevaluatingthemoneachfoldusingamodeltrainedontheremainingfolds(seeChapter2):

>>>fromsklearn.model_selectionimportcross_val_score

>>>cross_val_score(sgd_clf,X_train,y_train_5,cv=3,scoring="accuracy")

array([0.9502,0.96565,0.96495])

Wow!Above95%accuracy(ratioofcorrectpredictions)onallcross-validationfolds?Thislooksamazing,doesn’tit?Well,beforeyougettooexcited,let’slookataverydumbclassifierthatjustclassifieseverysingleimageinthe“not-5”class:

fromsklearn.baseimportBaseEstimator

classNever5Classifier(BaseEstimator):

deffit(self,X,y=None):

pass

defpredict(self,X):

returnnp.zeros((len(X),1),dtype=bool)

Canyouguessthismodel’saccuracy?Let’sfindout:

>>>never_5_clf=Never5Classifier()

>>>cross_val_score(never_5_clf,X_train,y_train_5,cv=3,scoring="accuracy")

array([0.909,0.90715,0.9128])

Page 119: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

That’sright,ithasover90%accuracy!Thisissimplybecauseonlyabout10%oftheimagesare5s,soifyoualwaysguessthatanimageisnota5,youwillberightabout90%ofthetime.BeatsNostradamus.

Thisdemonstrateswhyaccuracyisgenerallynotthepreferredperformancemeasureforclassifiers,especiallywhenyouaredealingwithskeweddatasets(i.e.,whensomeclassesaremuchmorefrequentthanothers).

Page 120: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ConfusionMatrixAmuchbetterwaytoevaluatetheperformanceofaclassifieristolookattheconfusionmatrix.ThegeneralideaistocountthenumberoftimesinstancesofclassAareclassifiedasclassB.Forexample,toknowthenumberoftimestheclassifierconfusedimagesof5swith3s,youwouldlookinthe5throwand3rdcolumnoftheconfusionmatrix.

Tocomputetheconfusionmatrix,youfirstneedtohaveasetofpredictions,sotheycanbecomparedtotheactualtargets.Youcouldmakepredictionsonthetestset,butlet’skeepituntouchedfornow(rememberthatyouwanttousethetestsetonlyattheveryendofyourproject,onceyouhaveaclassifierthatyouarereadytolaunch).Instead,youcanusethecross_val_predict()function:

fromsklearn.model_selectionimportcross_val_predict

y_train_pred=cross_val_predict(sgd_clf,X_train,y_train_5,cv=3)

Justlikethecross_val_score()function,cross_val_predict()performsK-foldcross-validation,butinsteadofreturningtheevaluationscores,itreturnsthepredictionsmadeoneachtestfold.Thismeansthatyougetacleanpredictionforeachinstanceinthetrainingset(“clean”meaningthatthepredictionismadebyamodelthatneversawthedataduringtraining).

Nowyouarereadytogettheconfusionmatrixusingtheconfusion_matrix()function.Justpassitthetargetclasses(y_train_5)andthepredictedclasses(y_train_pred):

>>>fromsklearn.metricsimportconfusion_matrix

>>>confusion_matrix(y_train_5,y_train_pred)

array([[53272,1307],

[1077,4344]])

Eachrowinaconfusionmatrixrepresentsanactualclass,whileeachcolumnrepresentsapredictedclass.Thefirstrowofthismatrixconsidersnon-5images(thenegativeclass):53,272ofthemwerecorrectlyclassifiedasnon-5s(theyarecalledtruenegatives),whiletheremaining1,307werewronglyclassifiedas5s(falsepositives).Thesecondrowconsiderstheimagesof5s(thepositiveclass):1,077werewronglyclassifiedasnon-5s(falsenegatives),whiletheremaining4,344werecorrectlyclassifiedas5s(truepositives).Aperfectclassifierwouldhaveonlytruepositivesandtruenegatives,soitsconfusionmatrixwouldhavenonzerovaluesonlyonitsmaindiagonal(toplefttobottomright):

>>>confusion_matrix(y_train_5,y_train_perfect_predictions)

array([[54579,0],

[0,5421]])

Theconfusionmatrixgivesyoualotofinformation,butsometimesyoumaypreferamoreconcisemetric.Aninterestingonetolookatistheaccuracyofthepositivepredictions;thisiscalledtheprecisionoftheclassifier(Equation3-1).

Equation3-1.Precision

Page 121: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TPisthenumberoftruepositives,andFPisthenumberoffalsepositives.

Atrivialwaytohaveperfectprecisionistomakeonesinglepositivepredictionandensureitiscorrect(precision=1/1=100%).Thiswouldnotbeveryusefulsincetheclassifierwouldignoreallbutonepositiveinstance.Soprecisionistypicallyusedalongwithanothermetricnamedrecall,alsocalledsensitivityortruepositiverate(TPR):thisistheratioofpositiveinstancesthatarecorrectlydetectedbytheclassifier(Equation3-2).

Equation3-2.Recall

FNisofcoursethenumberoffalsenegatives.

Ifyouareconfusedabouttheconfusionmatrix,Figure3-2mayhelp.

Figure3-2.Anillustratedconfusionmatrix

Page 122: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

PrecisionandRecallScikit-Learnprovidesseveralfunctionstocomputeclassifiermetrics,includingprecisionandrecall:

>>>fromsklearn.metricsimportprecision_score,recall_score

>>>precision_score(y_train_5,y_train_pred)#==4344/(4344+1307)

0.76871350203503808

>>>recall_score(y_train_5,y_train_pred)#==4344/(4344+1077)

0.80132816823464303

Nowyour5-detectordoesnotlookasshinyasitdidwhenyoulookedatitsaccuracy.Whenitclaimsanimagerepresentsa5,itiscorrectonly77%ofthetime.Moreover,itonlydetects80%ofthe5s.

ItisoftenconvenienttocombineprecisionandrecallintoasinglemetriccalledtheF1score,inparticularifyouneedasimplewaytocomparetwoclassifiers.TheF1scoreistheharmonicmeanofprecisionandrecall(Equation3-3).Whereastheregularmeantreatsallvaluesequally,theharmonicmeangivesmuchmoreweighttolowvalues.Asaresult,theclassifierwillonlygetahighF1scoreifbothrecallandprecisionarehigh.

Equation3-3.F1score

TocomputetheF1score,simplycallthef1_score()function:

>>>fromsklearn.metricsimportf1_score

>>>f1_score(y_train_5,y_train_pred)

0.78468208092485547

TheF1scorefavorsclassifiersthathavesimilarprecisionandrecall.Thisisnotalwayswhatyouwant:insomecontextsyoumostlycareaboutprecision,andinothercontextsyoureallycareaboutrecall.Forexample,ifyoutrainedaclassifiertodetectvideosthataresafeforkids,youwouldprobablypreferaclassifierthatrejectsmanygoodvideos(lowrecall)butkeepsonlysafeones(highprecision),ratherthanaclassifierthathasamuchhigherrecallbutletsafewreallybadvideosshowupinyourproduct(insuchcases,youmayevenwanttoaddahumanpipelinetochecktheclassifier’svideoselection).Ontheotherhand,supposeyoutrainaclassifiertodetectshopliftersonsurveillanceimages:itisprobablyfineifyourclassifierhasonly30%precisionaslongasithas99%recall(sure,thesecurityguardswillgetafewfalsealerts,butalmostallshoplifterswillgetcaught).

Unfortunately,youcan’thaveitbothways:increasingprecisionreducesrecall,andviceversa.Thisiscalledtheprecision/recalltradeoff.

Page 123: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Precision/RecallTradeoffTounderstandthistradeoff,let’slookathowtheSGDClassifiermakesitsclassificationdecisions.Foreachinstance,itcomputesascorebasedonadecisionfunction,andifthatscoreisgreaterthanathreshold,itassignstheinstancetothepositiveclass,orelseitassignsittothenegativeclass.Figure3-3showsafewdigitspositionedfromthelowestscoreonthelefttothehighestscoreontheright.Supposethedecisionthresholdispositionedatthecentralarrow(betweenthetwo5s):youwillfind4truepositives(actual5s)ontherightofthatthreshold,andonefalsepositive(actuallya6).Therefore,withthatthreshold,theprecisionis80%(4outof5).Butoutof6actual5s,theclassifieronlydetects4,sotherecallis67%(4outof6).Nowifyouraisethethreshold(moveittothearrowontheright),thefalsepositive(the6)becomesatruenegative,therebyincreasingprecision(upto100%inthiscase),butonetruepositivebecomesafalsenegative,decreasingrecalldownto50%.Conversely,loweringthethresholdincreasesrecallandreducesprecision.

Figure3-3.Decisionthresholdandprecision/recalltradeoff

Scikit-Learndoesnotletyousetthethresholddirectly,butitdoesgiveyouaccesstothedecisionscoresthatitusestomakepredictions.Insteadofcallingtheclassifier’spredict()method,youcancallitsdecision_function()method,whichreturnsascoreforeachinstance,andthenmakepredictionsbasedonthosescoresusinganythresholdyouwant:

>>>y_scores=sgd_clf.decision_function([some_digit])

>>>y_scores

array([161855.74572176])

>>>threshold=0

>>>y_some_digit_pred=(y_scores>threshold)

array([True],dtype=bool)

TheSGDClassifierusesathresholdequalto0,sothepreviouscodereturnsthesameresultasthepredict()method(i.e.,True).Let’sraisethethreshold:

>>>threshold=200000

>>>y_some_digit_pred=(y_scores>threshold)

>>>y_some_digit_pred

array([False],dtype=bool)

Thisconfirmsthatraisingthethresholddecreasesrecall.Theimageactuallyrepresentsa5,andtheclassifierdetectsitwhenthethresholdis0,butitmissesitwhenthethresholdisincreasedto200,000.

Page 124: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Sohowcanyoudecidewhichthresholdtouse?Forthisyouwillfirstneedtogetthescoresofallinstancesinthetrainingsetusingthecross_val_predict()functionagain,butthistimespecifyingthatyouwantittoreturndecisionscoresinsteadofpredictions:

y_scores=cross_val_predict(sgd_clf,X_train,y_train_5,cv=3,

method="decision_function")

Nowwiththesescoresyoucancomputeprecisionandrecallforallpossiblethresholdsusingtheprecision_recall_curve()function:

fromsklearn.metricsimportprecision_recall_curve

precisions,recalls,thresholds=precision_recall_curve(y_train_5,y_scores)

Finally,youcanplotprecisionandrecallasfunctionsofthethresholdvalueusingMatplotlib(Figure3-4):

defplot_precision_recall_vs_threshold(precisions,recalls,thresholds):

plt.plot(thresholds,precisions[:-1],"b--",label="Precision")

plt.plot(thresholds,recalls[:-1],"g-",label="Recall")

plt.xlabel("Threshold")

plt.legend(loc="upperleft")

plt.ylim([0,1])

plot_precision_recall_vs_threshold(precisions,recalls,thresholds)

plt.show()

Figure3-4.Precisionandrecallversusthedecisionthreshold

NOTEYoumaywonderwhytheprecisioncurveisbumpierthantherecallcurveinFigure3-4.Thereasonisthatprecisionmaysometimesgodownwhenyouraisethethreshold(althoughingeneralitwillgoup).Tounderstandwhy,lookbackatFigure3-3andnoticewhathappenswhenyoustartfromthecentralthresholdandmoveitjustonedigittotheright:precisiongoesfrom4/5(80%)downto3/4(75%).Ontheotherhand,recallcanonlygodownwhenthethresholdisincreased,whichexplainswhyitscurvelookssmooth.

Page 125: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Nowyoucansimplyselectthethresholdvaluethatgivesyouthebestprecision/recalltradeoffforyourtask.Anotherwaytoselectagoodprecision/recalltradeoffistoplotprecisiondirectlyagainstrecall,asshowninFigure3-5.

Figure3-5.Precisionversusrecall

Youcanseethatprecisionreallystartstofallsharplyaround80%recall.Youwillprobablywanttoselectaprecision/recalltradeoffjustbeforethatdrop—forexample,ataround60%recall.Butofcoursethechoicedependsonyourproject.

Solet’ssupposeyoudecidetoaimfor90%precision.Youlookupthefirstplot(zoominginabit)andfindthatyouneedtouseathresholdofabout70,000.Tomakepredictions(onthetrainingsetfornow),insteadofcallingtheclassifier’spredict()method,youcanjustrunthiscode:

y_train_pred_90=(y_scores>70000)

Let’scheckthesepredictions’precisionandrecall:

>>>precision_score(y_train_5,y_train_pred_90)

0.86592051164915484

>>>recall_score(y_train_5,y_train_pred_90)

0.69931746910164172

Great,youhavea90%precisionclassifier(orcloseenough)!Asyoucansee,itisfairlyeasytocreateaclassifierwithvirtuallyanyprecisionyouwant:justsetahighenoughthreshold,andyou’redone.Hmm,notsofast.Ahigh-precisionclassifierisnotveryusefulifitsrecallistoolow!

Page 126: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TIPIfsomeonesays“let’sreach99%precision,”youshouldask,“atwhatrecall?”

Page 127: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TheROCCurveThereceiveroperatingcharacteristic(ROC)curveisanothercommontoolusedwithbinaryclassifiers.Itisverysimilartotheprecision/recallcurve,butinsteadofplottingprecisionversusrecall,theROCcurveplotsthetruepositiverate(anothernameforrecall)againstthefalsepositiverate.TheFPRistheratioofnegativeinstancesthatareincorrectlyclassifiedaspositive.Itisequaltooneminusthetruenegativerate,whichistheratioofnegativeinstancesthatarecorrectlyclassifiedasnegative.TheTNRisalsocalledspecificity.HencetheROCcurveplotssensitivity(recall)versus1–specificity.

ToplottheROCcurve,youfirstneedtocomputetheTPRandFPRforvariousthresholdvalues,usingtheroc_curve()function:

fromsklearn.metricsimportroc_curve

fpr,tpr,thresholds=roc_curve(y_train_5,y_scores)

ThenyoucanplottheFPRagainsttheTPRusingMatplotlib.ThiscodeproducestheplotinFigure3-6:

defplot_roc_curve(fpr,tpr,label=None):

plt.plot(fpr,tpr,linewidth=2,label=label)

plt.plot([0,1],[0,1],'k--')

plt.axis([0,1,0,1])

plt.xlabel('FalsePositiveRate')

plt.ylabel('TruePositiveRate')

plot_roc_curve(fpr,tpr)

plt.show()

Figure3-6.ROCcurve

Onceagainthereisatradeoff:thehighertherecall(TPR),themorefalsepositives(FPR)theclassifier

Page 128: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

produces.ThedottedlinerepresentstheROCcurveofapurelyrandomclassifier;agoodclassifierstaysasfarawayfromthatlineaspossible(towardthetop-leftcorner).

Onewaytocompareclassifiersistomeasuretheareaunderthecurve(AUC).AperfectclassifierwillhaveaROCAUCequalto1,whereasapurelyrandomclassifierwillhaveaROCAUCequalto0.5.Scikit-LearnprovidesafunctiontocomputetheROCAUC:

>>>fromsklearn.metricsimportroc_auc_score

>>>roc_auc_score(y_train_5,y_scores)

0.96244965559671547

TIPSincetheROCcurveissosimilartotheprecision/recall(orPR)curve,youmaywonderhowtodecidewhichonetouse.Asaruleofthumb,youshouldpreferthePRcurvewheneverthepositiveclassisrareorwhenyoucaremoreaboutthefalsepositivesthanthefalsenegatives,andtheROCcurveotherwise.Forexample,lookingatthepreviousROCcurve(andtheROCAUCscore),youmaythinkthattheclassifierisreallygood.Butthisismostlybecausetherearefewpositives(5s)comparedtothenegatives(non-5s).Incontrast,thePRcurvemakesitclearthattheclassifierhasroomforimprovement(thecurvecouldbeclosertothetop-rightcorner).

Let’strainaRandomForestClassifierandcompareitsROCcurveandROCAUCscoretotheSGDClassifier.First,youneedtogetscoresforeachinstanceinthetrainingset.Butduetothewayitworks(seeChapter7),theRandomForestClassifierclassdoesnothaveadecision_function()method.Insteadithasapredict_proba()method.Scikit-Learnclassifiersgenerallyhaveoneortheother.Thepredict_proba()methodreturnsanarraycontainingarowperinstanceandacolumnperclass,eachcontainingtheprobabilitythatthegiveninstancebelongstothegivenclass(e.g.,70%chancethattheimagerepresentsa5):

fromsklearn.ensembleimportRandomForestClassifier

forest_clf=RandomForestClassifier(random_state=42)

y_probas_forest=cross_val_predict(forest_clf,X_train,y_train_5,cv=3,

method="predict_proba")

ButtoplotaROCcurve,youneedscores,notprobabilities.Asimplesolutionistousethepositiveclass’sprobabilityasthescore:

y_scores_forest=y_probas_forest[:,1]#score=probaofpositiveclass

fpr_forest,tpr_forest,thresholds_forest=roc_curve(y_train_5,y_scores_forest)

NowyouarereadytoplottheROCcurve.ItisusefultoplotthefirstROCcurveaswelltoseehowtheycompare(Figure3-7):

plt.plot(fpr,tpr,"b:",label="SGD")

plot_roc_curve(fpr_forest,tpr_forest,"RandomForest")

plt.legend(loc="lowerright")

plt.show()

Page 129: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure3-7.ComparingROCcurves

AsyoucanseeinFigure3-7,theRandomForestClassifier’sROCcurvelooksmuchbetterthantheSGDClassifier’s:itcomesmuchclosertothetop-leftcorner.Asaresult,itsROCAUCscoreisalsosignificantlybetter:

>>>roc_auc_score(y_train_5,y_scores_forest)

0.99312433660038291

Trymeasuringtheprecisionandrecallscores:youshouldfind98.5%precisionand82.8%recall.Nottoobad!

Hopefullyyounowknowhowtotrainbinaryclassifiers,choosetheappropriatemetricforyourtask,evaluateyourclassifiersusingcross-validation,selecttheprecision/recalltradeoffthatfitsyourneeds,andcomparevariousmodelsusingROCcurvesandROCAUCscores.Nowlet’strytodetectmorethanjustthe5s.

Page 130: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

MulticlassClassificationWhereasbinaryclassifiersdistinguishbetweentwoclasses,multiclassclassifiers(alsocalledmultinomialclassifiers)candistinguishbetweenmorethantwoclasses.

Somealgorithms(suchasRandomForestclassifiersornaiveBayesclassifiers)arecapableofhandlingmultipleclassesdirectly.Others(suchasSupportVectorMachineclassifiersorLinearclassifiers)arestrictlybinaryclassifiers.However,therearevariousstrategiesthatyoucanusetoperformmulticlassclassificationusingmultiplebinaryclassifiers.

Forexample,onewaytocreateasystemthatcanclassifythedigitimagesinto10classes(from0to9)istotrain10binaryclassifiers,oneforeachdigit(a0-detector,a1-detector,a2-detector,andsoon).Thenwhenyouwanttoclassifyanimage,yougetthedecisionscorefromeachclassifierforthatimageandyouselecttheclasswhoseclassifieroutputsthehighestscore.Thisiscalledtheone-versus-all(OvA)strategy(alsocalledone-versus-the-rest).

Anotherstrategyistotrainabinaryclassifierforeverypairofdigits:onetodistinguish0sand1s,anothertodistinguish0sand2s,anotherfor1sand2s,andsoon.Thisiscalledtheone-versus-one(OvO)strategy.IfthereareNclasses,youneedtotrainN×(N–1)/2classifiers.FortheMNISTproblem,thismeanstraining45binaryclassifiers!Whenyouwanttoclassifyanimage,youhavetoruntheimagethroughall45classifiersandseewhichclasswinsthemostduels.ThemainadvantageofOvOisthateachclassifieronlyneedstobetrainedonthepartofthetrainingsetforthetwoclassesthatitmustdistinguish.

Somealgorithms(suchasSupportVectorMachineclassifiers)scalepoorlywiththesizeofthetrainingset,soforthesealgorithmsOvOispreferredsinceitisfastertotrainmanyclassifiersonsmalltrainingsetsthantrainingfewclassifiersonlargetrainingsets.Formostbinaryclassificationalgorithms,however,OvAispreferred.

Scikit-Learndetectswhenyoutrytouseabinaryclassificationalgorithmforamulticlassclassificationtask,anditautomaticallyrunsOvA(exceptforSVMclassifiersforwhichitusesOvO).Let’strythiswiththeSGDClassifier:

>>>sgd_clf.fit(X_train,y_train)#y_train,noty_train_5

>>>sgd_clf.predict([some_digit])

array([5.])

Thatwaseasy!ThiscodetrainstheSGDClassifieronthetrainingsetusingtheoriginaltargetclassesfrom0to9(y_train),insteadofthe5-versus-alltargetclasses(y_train_5).Thenitmakesaprediction(acorrectoneinthiscase).Underthehood,Scikit-Learnactuallytrained10binaryclassifiers,gottheirdecisionscoresfortheimage,andselectedtheclasswiththehighestscore.

Toseethatthisisindeedthecase,youcancallthedecision_function()method.Insteadofreturningjustonescoreperinstance,itnowreturns10scores,oneperclass:

>>>some_digit_scores=sgd_clf.decision_function([some_digit])

>>>some_digit_scores

array([[-311402.62954431,-363517.28355739,-446449.5306454,

Page 131: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

-183226.61023518,-414337.15339485,161855.74572176,

-452576.39616343,-471957.14962573,-518542.33997148,

-536774.63961222]])

Thehighestscoreisindeedtheonecorrespondingtoclass5:

>>>np.argmax(some_digit_scores)

5

>>>sgd_clf.classes_

array([0.,1.,2.,3.,4.,5.,6.,7.,8.,9.])

>>>sgd_clf.classes_[5]

5.0

WARNINGWhenaclassifieristrained,itstoresthelistoftargetclassesinitsclasses_attribute,orderedbyvalue.Inthiscase,theindexofeachclassintheclasses_arrayconvenientlymatchestheclassitself(e.g.,theclassatindex5happenstobeclass5),butingeneralyouwon’tbesolucky.

IfyouwanttoforceScikitLearntouseone-versus-oneorone-versus-all,youcanusetheOneVsOneClassifierorOneVsRestClassifierclasses.Simplycreateaninstanceandpassabinaryclassifiertoitsconstructor.Forexample,thiscodecreatesamulticlassclassifierusingtheOvOstrategy,basedonaSGDClassifier:

>>>fromsklearn.multiclassimportOneVsOneClassifier

>>>ovo_clf=OneVsOneClassifier(SGDClassifier(random_state=42))

>>>ovo_clf.fit(X_train,y_train)

>>>ovo_clf.predict([some_digit])

array([5.])

>>>len(ovo_clf.estimators_)

45

TrainingaRandomForestClassifierisjustaseasy:

>>>forest_clf.fit(X_train,y_train)

>>>forest_clf.predict([some_digit])

array([5.])

ThistimeScikit-LearndidnothavetorunOvAorOvObecauseRandomForestclassifierscandirectlyclassifyinstancesintomultipleclasses.Youcancallpredict_proba()togetthelistofprobabilitiesthattheclassifierassignedtoeachinstanceforeachclass:

>>>forest_clf.predict_proba([some_digit])

array([[0.1,0.,0.,0.1,0.,0.8,0.,0.,0.,0.]])

Youcanseethattheclassifierisfairlyconfidentaboutitsprediction:the0.8atthe5thindexinthearraymeansthatthemodelestimatesan80%probabilitythattheimagerepresentsa5.Italsothinksthattheimagecouldinsteadbea0ora3(10%chanceeach).

Nowofcourseyouwanttoevaluatetheseclassifiers.Asusual,youwanttousecross-validation.Let’sevaluatetheSGDClassifier’saccuracyusingthecross_val_score()function:

Page 132: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

>>>cross_val_score(sgd_clf,X_train,y_train,cv=3,scoring="accuracy")

array([0.84063187,0.84899245,0.86652998])

Itgetsover84%onalltestfolds.Ifyouusedarandomclassifier,youwouldget10%accuracy,sothisisnotsuchabadscore,butyoucanstilldomuchbetter.Forexample,simplyscalingtheinputs(asdiscussedinChapter2)increasesaccuracyabove90%:

>>>fromsklearn.preprocessingimportStandardScaler

>>>scaler=StandardScaler()

>>>X_train_scaled=scaler.fit_transform(X_train.astype(np.float64))

>>>cross_val_score(sgd_clf,X_train_scaled,y_train,cv=3,scoring="accuracy")

array([0.91011798,0.90874544,0.906636])

Page 133: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ErrorAnalysisOfcourse,ifthiswerearealproject,youwouldfollowthestepsinyourMachineLearningprojectchecklist(seeAppendixB):exploringdatapreparationoptions,tryingoutmultiplemodels,shortlistingthebestonesandfine-tuningtheirhyperparametersusingGridSearchCV,andautomatingasmuchaspossible,asyoudidinthepreviouschapter.Here,wewillassumethatyouhavefoundapromisingmodelandyouwanttofindwaystoimproveit.Onewaytodothisistoanalyzethetypesoferrorsitmakes.

First,youcanlookattheconfusionmatrix.Youneedtomakepredictionsusingthecross_val_predict()function,thencalltheconfusion_matrix()function,justlikeyoudidearlier:

>>>y_train_pred=cross_val_predict(sgd_clf,X_train_scaled,y_train,cv=3)

>>>conf_mx=confusion_matrix(y_train,y_train_pred)

>>>conf_mx

array([[5725,3,24,9,10,49,50,10,39,4],

[2,6493,43,25,7,40,5,10,109,8],

[51,41,5321,104,89,26,87,60,166,13],

[47,46,141,5342,1,231,40,50,141,92],

[19,29,41,10,5366,9,56,37,86,189],

[73,45,36,193,64,4582,111,30,193,94],

[29,34,44,2,42,85,5627,10,45,0],

[25,24,74,32,54,12,6,5787,15,236],

[52,161,73,156,10,163,61,25,5027,123],

[43,35,26,92,178,28,2,223,82,5240]])

That’salotofnumbers.It’softenmoreconvenienttolookatanimagerepresentationoftheconfusionmatrix,usingMatplotlib’smatshow()function:

plt.matshow(conf_mx,cmap=plt.cm.gray)

plt.show()

Page 134: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Thisconfusionmatrixlooksfairlygood,sincemostimagesareonthemaindiagonal,whichmeansthattheywereclassifiedcorrectly.The5slookslightlydarkerthantheotherdigits,whichcouldmeanthattherearefewerimagesof5sinthedatasetorthattheclassifierdoesnotperformaswellon5sasonotherdigits.Infact,youcanverifythatbotharethecase.

Let’sfocustheplotontheerrors.First,youneedtodivideeachvalueintheconfusionmatrixbythenumberofimagesinthecorrespondingclass,soyoucancompareerrorratesinsteadofabsolutenumberoferrors(whichwouldmakeabundantclasseslookunfairlybad):

row_sums=conf_mx.sum(axis=1,keepdims=True)

norm_conf_mx=conf_mx/row_sums

Nowlet’sfillthediagonalwithzerostokeeponlytheerrors,andlet’splottheresult:

np.fill_diagonal(norm_conf_mx,0)

plt.matshow(norm_conf_mx,cmap=plt.cm.gray)

plt.show()

Page 135: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Nowyoucanclearlyseethekindsoferrorstheclassifiermakes.Rememberthatrowsrepresentactualclasses,whilecolumnsrepresentpredictedclasses.Thecolumnsforclasses8and9arequitebright,whichtellsyouthatmanyimagesgetmisclassifiedas8sor9s.Similarly,therowsforclasses8and9arealsoquitebright,tellingyouthat8sand9sareoftenconfusedwithotherdigits.Conversely,somerowsareprettydark,suchasrow1:thismeansthatmost1sareclassifiedcorrectly(afewareconfusedwith8s,butthat’saboutit).Noticethattheerrorsarenotperfectlysymmetrical;forexample,therearemore5smisclassifiedas8sthanthereverse.

Analyzingtheconfusionmatrixcanoftengiveyouinsightsonwaystoimproveyourclassifier.Lookingatthisplot,itseemsthatyoureffortsshouldbespentonimprovingclassificationof8sand9s,aswellasfixingthespecific3/5confusion.Forexample,youcouldtrytogathermoretrainingdataforthesedigits.Oryoucouldengineernewfeaturesthatwouldhelptheclassifier—forexample,writinganalgorithmtocountthenumberofclosedloops(e.g.,8hastwo,6hasone,5hasnone).Oryoucouldpreprocesstheimages(e.g.,usingScikit-Image,Pillow,orOpenCV)tomakesomepatternsstandoutmore,suchasclosedloops.

Analyzingindividualerrorscanalsobeagoodwaytogaininsightsonwhatyourclassifierisdoingandwhyitisfailing,butitismoredifficultandtime-consuming.Forexample,let’splotexamplesof3sand5s(theplot_digits()functionjustusesMatplotlib’simshow()function;seethischapter’sJupyter

Page 136: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

notebookfordetails):

cl_a,cl_b=3,5

X_aa=X_train[(y_train==cl_a)&(y_train_pred==cl_a)]

X_ab=X_train[(y_train==cl_a)&(y_train_pred==cl_b)]

X_ba=X_train[(y_train==cl_b)&(y_train_pred==cl_a)]

X_bb=X_train[(y_train==cl_b)&(y_train_pred==cl_b)]

plt.figure(figsize=(8,8))

plt.subplot(221);plot_digits(X_aa[:25],images_per_row=5)

plt.subplot(222);plot_digits(X_ab[:25],images_per_row=5)

plt.subplot(223);plot_digits(X_ba[:25],images_per_row=5)

plt.subplot(224);plot_digits(X_bb[:25],images_per_row=5)

plt.show()

Thetwo5×5blocksontheleftshowdigitsclassifiedas3s,andthetwo5×5blocksontherightshowimagesclassifiedas5s.Someofthedigitsthattheclassifiergetswrong(i.e.,inthebottom-leftandtop-rightblocks)aresobadlywrittenthatevenahumanwouldhavetroubleclassifyingthem(e.g.,the5onthe8throwand1stcolumntrulylookslikea3).However,mostmisclassifiedimagesseemlikeobviouserrorstous,andit’shardtounderstandwhytheclassifiermadethemistakesitdid.3ThereasonisthatweusedasimpleSGDClassifier,whichisalinearmodel.Allitdoesisassignaweightperclasstoeachpixel,andwhenitseesanewimageitjustsumsuptheweightedpixelintensitiestogetascoreforeachclass.Sosince3sand5sdifferonlybyafewpixels,thismodelwilleasilyconfusethem.

Page 137: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Themaindifferencebetween3sand5sisthepositionofthesmalllinethatjoinsthetoplinetothebottomarc.Ifyoudrawa3withthejunctionslightlyshiftedtotheleft,theclassifiermightclassifyitasa5,andviceversa.Inotherwords,thisclassifierisquitesensitivetoimageshiftingandrotation.Soonewaytoreducethe3/5confusionwouldbetopreprocesstheimagestoensurethattheyarewellcenteredandnottoorotated.Thiswillprobablyhelpreduceothererrorsaswell.

Page 138: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

MultilabelClassificationUntilnoweachinstancehasalwaysbeenassignedtojustoneclass.Insomecasesyoumaywantyourclassifiertooutputmultipleclassesforeachinstance.Forexample,consideraface-recognitionclassifier:whatshoulditdoifitrecognizesseveralpeopleonthesamepicture?Ofcourseitshouldattachonelabelperpersonitrecognizes.Saytheclassifierhasbeentrainedtorecognizethreefaces,Alice,Bob,andCharlie;thenwhenitisshownapictureofAliceandCharlie,itshouldoutput[1,0,1](meaning“Aliceyes,Bobno,Charlieyes”).Suchaclassificationsystemthatoutputsmultiplebinarylabelsiscalledamultilabelclassificationsystem.

Wewon’tgointofacerecognitionjustyet,butlet’slookatasimplerexample,justforillustrationpurposes:

fromsklearn.neighborsimportKNeighborsClassifier

y_train_large=(y_train>=7)

y_train_odd=(y_train%2==1)

y_multilabel=np.c_[y_train_large,y_train_odd]

knn_clf=KNeighborsClassifier()

knn_clf.fit(X_train,y_multilabel)

Thiscodecreatesay_multilabelarraycontainingtwotargetlabelsforeachdigitimage:thefirstindicateswhetherornotthedigitislarge(7,8,or9)andthesecondindicateswhetherornotitisodd.ThenextlinescreateaKNeighborsClassifierinstance(whichsupportsmultilabelclassification,butnotallclassifiersdo)andwetrainitusingthemultipletargetsarray.Nowyoucanmakeaprediction,andnoticethatitoutputstwolabels:

>>>knn_clf.predict([some_digit])

array([[False,True]],dtype=bool)

Anditgetsitright!Thedigit5isindeednotlarge(False)andodd(True).

Therearemanywaystoevaluateamultilabelclassifier,andselectingtherightmetricreallydependsonyourproject.Forexample,oneapproachistomeasuretheF1scoreforeachindividuallabel(oranyotherbinaryclassifiermetricdiscussedearlier),thensimplycomputetheaveragescore.ThiscodecomputestheaverageF1scoreacrossalllabels:

>>>y_train_knn_pred=cross_val_predict(knn_clf,X_train,y_train,cv=3)

>>>f1_score(y_train,y_train_knn_pred,average="macro")

0.96845540180280221

Thisassumesthatalllabelsareequallyimportant,whichmaynotbethecase.Inparticular,ifyouhavemanymorepicturesofAlicethanofBoborCharlie,youmaywanttogivemoreweighttotheclassifier’sscoreonpicturesofAlice.Onesimpleoptionistogiveeachlabelaweightequaltoitssupport(i.e.,thenumberofinstanceswiththattargetlabel).Todothis,simplysetaverage="weighted"intheprecedingcode.4

Page 139: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

MultioutputClassificationThelasttypeofclassificationtaskwearegoingtodiscusshereiscalledmultioutput-multiclassclassification(orsimplymultioutputclassification).Itissimplyageneralizationofmultilabelclassificationwhereeachlabelcanbemulticlass(i.e.,itcanhavemorethantwopossiblevalues).

Toillustratethis,let’sbuildasystemthatremovesnoisefromimages.Itwilltakeasinputanoisydigitimage,anditwill(hopefully)outputacleandigitimage,representedasanarrayofpixelintensities,justliketheMNISTimages.Noticethattheclassifier’soutputismultilabel(onelabelperpixel)andeachlabelcanhavemultiplevalues(pixelintensityrangesfrom0to255).Itisthusanexampleofamultioutputclassificationsystem.

NOTEThelinebetweenclassificationandregressionissometimesblurry,suchasinthisexample.Arguably,predictingpixelintensityismoreakintoregressionthantoclassification.Moreover,multioutputsystemsarenotlimitedtoclassificationtasks;youcouldevenhaveasystemthatoutputsmultiplelabelsperinstance,includingbothclasslabelsandvaluelabels.

Let’sstartbycreatingthetrainingandtestsetsbytakingtheMNISTimagesandaddingnoisetotheirpixelintensitiesusingNumPy’srandint()function.Thetargetimageswillbetheoriginalimages:

noise=np.random.randint(0,100,(len(X_train),784))

X_train_mod=X_train+noise

noise=np.random.randint(0,100,(len(X_test),784))

X_test_mod=X_test+noise

y_train_mod=X_train

y_test_mod=X_test

Let’stakeapeekatanimagefromthetestset(yes,we’resnoopingonthetestdata,soyoushouldbefrowningrightnow):

Ontheleftisthenoisyinputimage,andontherightisthecleantargetimage.Nowlet’straintheclassifier

Page 140: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

andmakeitcleanthisimage:

knn_clf.fit(X_train_mod,y_train_mod)

clean_digit=knn_clf.predict([X_test_mod[some_index]])

plot_digit(clean_digit)

Lookscloseenoughtothetarget!Thisconcludesourtourofclassification.Hopefullyyoushouldnowknowhowtoselectgoodmetricsforclassificationtasks,picktheappropriateprecision/recalltradeoff,compareclassifiers,andmoregenerallybuildgoodclassificationsystemsforavarietyoftasks.

Page 141: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Exercises1. TrytobuildaclassifierfortheMNISTdatasetthatachievesover97%accuracyonthetestset.Hint:

theKNeighborsClassifierworksquitewellforthistask;youjustneedtofindgoodhyperparametervalues(tryagridsearchontheweightsandn_neighborshyperparameters).

2. WriteafunctionthatcanshiftanMNISTimageinanydirection(left,right,up,ordown)byonepixel.5Then,foreachimageinthetrainingset,createfourshiftedcopies(oneperdirection)andaddthemtothetrainingset.Finally,trainyourbestmodelonthisexpandedtrainingsetandmeasureitsaccuracyonthetestset.Youshouldobservethatyourmodelperformsevenbetternow!Thistechniqueofartificiallygrowingthetrainingsetiscalleddataaugmentationortrainingsetexpansion.

3. TackletheTitanicdataset.AgreatplacetostartisonKaggle.

4. Buildaspamclassifier(amorechallengingexercise):DownloadexamplesofspamandhamfromApacheSpamAssassin’spublicdatasets.

Unzipthedatasetsandfamiliarizeyourselfwiththedataformat.

Splitthedatasetsintoatrainingsetandatestset.

Writeadatapreparationpipelinetoconverteachemailintoafeaturevector.Yourpreparationpipelineshouldtransformanemailintoa(sparse)vectorindicatingthepresenceorabsenceofeachpossibleword.Forexample,ifallemailsonlyevercontainfourwords,“Hello,”“how,”“are,”“you,”thentheemail“HelloyouHelloHelloyou”wouldbeconvertedintoavector[1,0,0,1](meaning[“Hello”ispresent,“how”isabsent,“are”isabsent,“you”ispresent]),or[3,0,0,2]ifyouprefertocountthenumberofoccurrencesofeachword.

Youmaywanttoaddhyperparameterstoyourpreparationpipelinetocontrolwhetherornottostripoffemailheaders,converteachemailtolowercase,removepunctuation,replaceallURLswith“URL,”replaceallnumberswith“NUMBER,”orevenperformstemming(i.e.,trimoffwordendings;therearePythonlibrariesavailabletodothis).

Thentryoutseveralclassifiersandseeifyoucanbuildagreatspamclassifier,withbothhighrecallandhighprecision.

SolutionstotheseexercisesareavailableintheonlineJupyternotebooksathttps://github.com/ageron/handson-ml.

BydefaultScikit-Learncachesdownloadeddatasetsinadirectorycalled$HOME/scikit_learn_data.

Shufflingmaybeabadideainsomecontexts—forexample,ifyouareworkingontimeseriesdata(suchasstockmarketpricesorweatherconditions).Wewillexplorethisinthenextchapters.

Butrememberthatourbrainisafantasticpatternrecognitionsystem,andourvisualsystemdoesalotofcomplexpreprocessingbeforeanyinformationreachesourconsciousness,sothefactthatitfeelssimpledoesnotmeanthatitis.

1

2

3

4

Page 142: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Scikit-Learnoffersafewotheraveragingoptionsandmultilabelclassifiermetrics;seethedocumentationformoredetails.

Youcanusetheshift()functionfromthescipy.ndimage.interpolationmodule.Forexample,shift(image,[2,1],cval=0)shiftstheimage2pixelsdownand1pixeltotheright.

4

5

Page 143: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Chapter4.TrainingModels

SofarwehavetreatedMachineLearningmodelsandtheirtrainingalgorithmsmostlylikeblackboxes.Ifyouwentthroughsomeoftheexercisesinthepreviouschapters,youmayhavebeensurprisedbyhowmuchyoucangetdonewithoutknowinganythingaboutwhat’sunderthehood:youoptimizedaregressionsystem,youimprovedadigitimageclassifier,andyouevenbuiltaspamclassifierfromscratch—allthiswithoutknowinghowtheyactuallywork.Indeed,inmanysituationsyoudon’treallyneedtoknowtheimplementationdetails.

However,havingagoodunderstandingofhowthingsworkcanhelpyouquicklyhomeinontheappropriatemodel,therighttrainingalgorithmtouse,andagoodsetofhyperparametersforyourtask.Understandingwhat’sunderthehoodwillalsohelpyoudebugissuesandperformerroranalysismoreefficiently.Lastly,mostofthetopicsdiscussedinthischapterwillbeessentialinunderstanding,building,andtrainingneuralnetworks(discussedinPartIIofthisbook).

Inthischapter,wewillstartbylookingattheLinearRegressionmodel,oneofthesimplestmodelsthereis.Wewilldiscusstwoverydifferentwaystotrainit:

Usingadirect“closed-form”equationthatdirectlycomputesthemodelparametersthatbestfitthemodeltothetrainingset(i.e.,themodelparametersthatminimizethecostfunctionoverthetrainingset).

Usinganiterativeoptimizationapproach,calledGradientDescent(GD),thatgraduallytweaksthemodelparameterstominimizethecostfunctionoverthetrainingset,eventuallyconvergingtothesamesetofparametersasthefirstmethod.WewilllookatafewvariantsofGradientDescentthatwewilluseagainandagainwhenwestudyneuralnetworksinPartII:BatchGD,Mini-batchGD,andStochasticGD.

NextwewilllookatPolynomialRegression,amorecomplexmodelthatcanfitnonlineardatasets.SincethismodelhasmoreparametersthanLinearRegression,itismorepronetooverfittingthetrainingdata,sowewilllookathowtodetectwhetherornotthisisthecase,usinglearningcurves,andthenwewilllookatseveralregularizationtechniquesthatcanreducetheriskofoverfittingthetrainingset.

Finally,wewilllookattwomoremodelsthatarecommonlyusedforclassificationtasks:LogisticRegressionandSoftmaxRegression.

WARNINGTherewillbequiteafewmathequationsinthischapter,usingbasicnotionsoflinearalgebraandcalculus.Tounderstandtheseequations,youwillneedtoknowwhatvectorsandmatricesare,howtotransposethem,whatthedotproductis,whatmatrixinverseis,andwhatpartialderivativesare.Ifyouareunfamiliarwiththeseconcepts,pleasegothroughthelinearalgebraandcalculusintroductorytutorialsavailableasJupyternotebooksintheonlinesupplementalmaterial.Forthosewhoaretrulyallergictomathematics,youshouldstillgothroughthischapterandsimplyskiptheequations;hopefully,thetextwillbesufficienttohelpyouunderstandmostoftheconcepts.

Page 144: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

LinearRegressionInChapter1,welookedatasimpleregressionmodeloflifesatisfaction:life_satisfaction=θ0+θ1×GDP_per_capita.

ThismodelisjustalinearfunctionoftheinputfeatureGDP_per_capita.θ0andθ1arethemodel’sparameters.

Moregenerally,alinearmodelmakesapredictionbysimplycomputingaweightedsumoftheinputfeatures,plusaconstantcalledthebiasterm(alsocalledtheinterceptterm),asshowninEquation4-1.

Equation4-1.LinearRegressionmodelprediction

ŷisthepredictedvalue.

nisthenumberoffeatures.

xiistheithfeaturevalue.

θjisthejthmodelparameter(includingthebiastermθ0andthefeatureweightsθ1,θ2,, θn).

Thiscanbewrittenmuchmoreconciselyusingavectorizedform,asshowninEquation4-2.

Equation4-2.LinearRegressionmodelprediction(vectorizedform)

θisthemodel’sparametervector,containingthebiastermθ0andthefeatureweightsθ1toθn.

θTisthetransposeofθ(arowvectorinsteadofacolumnvector).

xistheinstance’sfeaturevector,containingx0toxn,withx0alwaysequalto1.

θT·xisthedotproductofθTandx.

hθisthehypothesisfunction,usingthemodelparametersθ.

Okay,that’stheLinearRegressionmodel,sonowhowdowetrainit?Well,recallthattrainingamodelmeanssettingitsparameterssothatthemodelbestfitsthetrainingset.Forthispurpose,wefirstneedameasureofhowwell(orpoorly)themodelfitsthetrainingdata.InChapter2wesawthatthemostcommonperformancemeasureofaregressionmodelistheRootMeanSquareError(RMSE)(Equation2-1).Therefore,totrainaLinearRegressionmodel,youneedtofindthevalueofθthatminimizesthe

Page 145: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

RMSE.Inpractice,itissimplertominimizetheMeanSquareError(MSE)thantheRMSE,anditleadstothesameresult(becausethevaluethatminimizesafunctionalsominimizesitssquareroot).1

TheMSEofaLinearRegressionhypothesishθonatrainingsetXiscalculatedusingEquation4-3.

Equation4-3.MSEcostfunctionforaLinearRegressionmodel

MostofthesenotationswerepresentedinChapter2(see“Notations”).Theonlydifferenceisthatwewritehθinsteadofjusthinordertomakeitclearthatthemodelisparametrizedbythevectorθ.Tosimplifynotations,wewilljustwriteMSE(θ)insteadofMSE(X,hθ).

Page 146: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TheNormalEquationTofindthevalueofθthatminimizesthecostfunction,thereisaclosed-formsolution—inotherwords,amathematicalequationthatgivestheresultdirectly.ThisiscalledtheNormalEquation(Equation4-4).2

Equation4-4.NormalEquation

isthevalueof thatminimizesthecostfunction.

yisthevectoroftargetvaluescontainingy(1)toy(m).

Let’sgeneratesomelinear-lookingdatatotestthisequationon(Figure4-1):

importnumpyasnp

X=2*np.random.rand(100,1)

y=4+3*X+np.random.randn(100,1)

Figure4-1.Randomlygeneratedlineardataset

Nowlet’scompute usingtheNormalEquation.Wewillusetheinv()functionfromNumPy’sLinearAlgebramodule(np.linalg)tocomputetheinverseofamatrix,andthedot()methodformatrixmultiplication:

X_b=np.c_[np.ones((100,1)),X]#addx0=1toeachinstance

theta_best=np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)

Page 147: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Theactualfunctionthatweusedtogeneratethedataisy=4+3x0+Gaussiannoise.Let’sseewhattheequationfound:

>>>theta_best

array([[4.21509616],

[2.77011339]])

Wewouldhavehopedforθ0=4andθ1=3insteadofθ0=4.215andθ1=2.770.Closeenough,butthenoisemadeitimpossibletorecovertheexactparametersoftheoriginalfunction.

Nowyoucanmakepredictionsusing :

>>>X_new=np.array([[0],[2]])

>>>X_new_b=np.c_[np.ones((2,1)),X_new]#addx0=1toeachinstance

>>>y_predict=X_new_b.dot(theta_best)

>>>y_predict

array([[4.21509616],

[9.75532293]])

Let’splotthismodel’spredictions(Figure4-2):

plt.plot(X_new,y_predict,"r-")

plt.plot(X,y,"b.")

plt.axis([0,2,0,15])

plt.show()

Figure4-2.LinearRegressionmodelpredictions

TheequivalentcodeusingScikit-Learnlookslikethis:3

>>>fromsklearn.linear_modelimportLinearRegression

>>>lin_reg=LinearRegression()

>>>lin_reg.fit(X,y)

>>>lin_reg.intercept_,lin_reg.coef_

(array([4.21509616]),array([[2.77011339]]))

>>>lin_reg.predict(X_new)

Page 148: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

array([[4.21509616],

[9.75532293]])

Page 149: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ComputationalComplexityTheNormalEquationcomputestheinverseofXT·X,whichisann×nmatrix(wherenisthenumberoffeatures).ThecomputationalcomplexityofinvertingsuchamatrixistypicallyaboutO(n2.4)toO(n3)(dependingontheimplementation).Inotherwords,ifyoudoublethenumberoffeatures,youmultiplythecomputationtimebyroughly22.4=5.3to23=8.

WARNINGTheNormalEquationgetsveryslowwhenthenumberoffeaturesgrowslarge(e.g.,100,000).

Onthepositiveside,thisequationislinearwithregardstothenumberofinstancesinthetrainingset(itisO(m)),soithandleslargetrainingsetsefficiently,providedtheycanfitinmemory.

Also,onceyouhavetrainedyourLinearRegressionmodel(usingtheNormalEquationoranyotheralgorithm),predictionsareveryfast:thecomputationalcomplexityislinearwithregardstoboththenumberofinstancesyouwanttomakepredictionsonandthenumberoffeatures.Inotherwords,makingpredictionsontwiceasmanyinstances(ortwiceasmanyfeatures)willjusttakeroughlytwiceasmuchtime.

NowwewilllookatverydifferentwaystotrainaLinearRegressionmodel,bettersuitedforcaseswheretherearealargenumberoffeatures,ortoomanytraininginstancestofitinmemory.

Page 150: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

GradientDescentGradientDescentisaverygenericoptimizationalgorithmcapableoffindingoptimalsolutionstoawiderangeofproblems.ThegeneralideaofGradientDescentistotweakparametersiterativelyinordertominimizeacostfunction.

Supposeyouarelostinthemountainsinadensefog;youcanonlyfeeltheslopeofthegroundbelowyourfeet.Agoodstrategytogettothebottomofthevalleyquicklyistogodownhillinthedirectionofthesteepestslope.ThisisexactlywhatGradientDescentdoes:itmeasuresthelocalgradientoftheerrorfunctionwithregardstotheparametervectorθ,anditgoesinthedirectionofdescendinggradient.Oncethegradientiszero,youhavereachedaminimum!

Concretely,youstartbyfillingθwithrandomvalues(thisiscalledrandominitialization),andthenyouimproveitgradually,takingonebabystepatatime,eachstepattemptingtodecreasethecostfunction(e.g.,theMSE),untilthealgorithmconvergestoaminimum(seeFigure4-3).

Figure4-3.GradientDescent

AnimportantparameterinGradientDescentisthesizeofthesteps,determinedbythelearningratehyperparameter.Ifthelearningrateistoosmall,thenthealgorithmwillhavetogothroughmanyiterationstoconverge,whichwilltakealongtime(seeFigure4-4).

Page 151: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure4-4.Learningratetoosmall

Ontheotherhand,ifthelearningrateistoohigh,youmightjumpacrossthevalleyandendupontheotherside,possiblyevenhigherupthanyouwerebefore.Thismightmakethealgorithmdiverge,withlargerandlargervalues,failingtofindagoodsolution(seeFigure4-5).

Figure4-5.Learningratetoolarge

Finally,notallcostfunctionslooklikeniceregularbowls.Theremaybeholes,ridges,plateaus,andallsortsofirregularterrains,makingconvergencetotheminimumverydifficult.Figure4-6showsthetwomainchallengeswithGradientDescent:iftherandominitializationstartsthealgorithmontheleft,thenitwillconvergetoalocalminimum,whichisnotasgoodastheglobalminimum.Ifitstartsontheright,thenitwilltakeaverylongtimetocrosstheplateau,andifyoustoptooearlyyouwillneverreachtheglobalminimum.

Page 152: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure4-6.GradientDescentpitfalls

Fortunately,theMSEcostfunctionforaLinearRegressionmodelhappenstobeaconvexfunction,whichmeansthatifyoupickanytwopointsonthecurve,thelinesegmentjoiningthemnevercrossesthecurve.Thisimpliesthattherearenolocalminima,justoneglobalminimum.Itisalsoacontinuousfunctionwithaslopethatneverchangesabruptly.4Thesetwofactshaveagreatconsequence:GradientDescentisguaranteedtoapproacharbitrarilyclosetheglobalminimum(ifyouwaitlongenoughandifthelearningrateisnottoohigh).

Infact,thecostfunctionhastheshapeofabowl,butitcanbeanelongatedbowlifthefeatureshaveverydifferentscales.Figure4-7showsGradientDescentonatrainingsetwherefeatures1and2havethesamescale(ontheleft),andonatrainingsetwherefeature1hasmuchsmallervaluesthanfeature2(ontheright).5

Figure4-7.GradientDescentwithandwithoutfeaturescaling

Asyoucansee,onthelefttheGradientDescentalgorithmgoesstraighttowardtheminimum,therebyreachingitquickly,whereasontherightitfirstgoesinadirectionalmostorthogonaltothedirectionoftheglobalminimum,anditendswithalongmarchdownanalmostflatvalley.Itwilleventuallyreachtheminimum,butitwilltakealongtime.

Page 153: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

WARNINGWhenusingGradientDescent,youshouldensurethatallfeatureshaveasimilarscale(e.g.,usingScikit-Learn’sStandardScalerclass),orelseitwilltakemuchlongertoconverge.

Thisdiagramalsoillustratesthefactthattrainingamodelmeanssearchingforacombinationofmodelparametersthatminimizesacostfunction(overthetrainingset).Itisasearchinthemodel’sparameterspace:themoreparametersamodelhas,themoredimensionsthisspacehas,andtheharderthesearchis:searchingforaneedleina300-dimensionalhaystackismuchtrickierthaninthreedimensions.Fortunately,sincethecostfunctionisconvexinthecaseofLinearRegression,theneedleissimplyatthebottomofthebowl.

Page 154: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

BatchGradientDescentToimplementGradientDescent,youneedtocomputethegradientofthecostfunctionwithregardstoeachmodelparameterθj.Inotherwords,youneedtocalculatehowmuchthecostfunctionwillchangeifyouchangeθjjustalittlebit.Thisiscalledapartialderivative.Itislikeasking“whatistheslopeofthemountainundermyfeetifIfaceeast?”andthenaskingthesamequestionfacingnorth(andsoonforallotherdimensions,ifyoucanimagineauniversewithmorethanthreedimensions).Equation4-5computes

thepartialderivativeofthecostfunctionwithregardstoparameterθj,noted .

Equation4-5.Partialderivativesofthecostfunction

Insteadofcomputingthesepartialderivativesindividually,youcanuseEquation4-6tocomputethemallinonego.Thegradientvector,noted θMSE(θ),containsallthepartialderivativesofthecostfunction(oneforeachmodelparameter).

Equation4-6.Gradientvectorofthecostfunction

WARNINGNoticethatthisformulainvolvescalculationsoverthefulltrainingsetX,ateachGradientDescentstep!ThisiswhythealgorithmiscalledBatchGradientDescent:itusesthewholebatchoftrainingdataateverystep.Asaresultitisterriblyslowonverylargetrainingsets(butwewillseemuchfasterGradientDescentalgorithmsshortly).However,GradientDescentscaleswellwiththenumberoffeatures;trainingaLinearRegressionmodelwhentherearehundredsofthousandsoffeaturesismuchfasterusingGradientDescentthanusingtheNormalEquation.

Onceyouhavethegradientvector,whichpointsuphill,justgointheoppositedirectiontogodownhill.Thismeanssubtracting θMSE(θ)fromθ.Thisiswherethelearningrateηcomesintoplay:6multiplythegradientvectorbyηtodeterminethesizeofthedownhillstep(Equation4-7).

Equation4-7.GradientDescentstep

Page 155: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Let’slookataquickimplementationofthisalgorithm:

eta=0.1#learningrate

n_iterations=1000

m=100

theta=np.random.randn(2,1)#randominitialization

foriterationinrange(n_iterations):

gradients=2/m*X_b.T.dot(X_b.dot(theta)-y)

theta=theta-eta*gradients

Thatwasn’ttoohard!Let’slookattheresultingtheta:

>>>theta

array([[4.21509616],

[2.77011339]])

Hey,that’sexactlywhattheNormalEquationfound!GradientDescentworkedperfectly.Butwhatifyouhadusedadifferentlearningrateeta?Figure4-8showsthefirst10stepsofGradientDescentusingthreedifferentlearningrates(thedashedlinerepresentsthestartingpoint).

Figure4-8.GradientDescentwithvariouslearningrates

Ontheleft,thelearningrateistoolow:thealgorithmwilleventuallyreachthesolution,butitwilltakealongtime.Inthemiddle,thelearningratelooksprettygood:injustafewiterations,ithasalreadyconvergedtothesolution.Ontheright,thelearningrateistoohigh:thealgorithmdiverges,jumpingallovertheplaceandactuallygettingfurtherandfurtherawayfromthesolutionateverystep.

Tofindagoodlearningrate,youcanusegridsearch(seeChapter2).However,youmaywanttolimitthenumberofiterationssothatgridsearchcaneliminatemodelsthattaketoolongtoconverge.

Youmaywonderhowtosetthenumberofiterations.Ifitistoolow,youwillstillbefarawayfromtheoptimalsolutionwhenthealgorithmstops,butifitistoohigh,youwillwastetimewhilethemodelparametersdonotchangeanymore.Asimplesolutionistosetaverylargenumberofiterationsbuttointerruptthealgorithmwhenthegradientvectorbecomestiny—thatis,whenitsnormbecomessmallerthanatinynumberϵ(calledthetolerance)—becausethishappenswhenGradientDescenthas(almost)

Page 156: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

reachedtheminimum.

CONVERGENCERATE

Whenthecostfunctionisconvexanditsslopedoesnotchangeabruptly(asisthecasefortheMSEcostfunction),itcanbeshownthat

BatchGradientDescentwithafixedlearningratehasaconvergencerateof .Inotherwords,ifyoudividethetoleranceϵby10(tohaveamoreprecisesolution),thenthealgorithmwillhavetorunabout10timesmoreiterations.

Page 157: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

StochasticGradientDescentThemainproblemwithBatchGradientDescentisthefactthatitusesthewholetrainingsettocomputethegradientsateverystep,whichmakesitveryslowwhenthetrainingsetislarge.Attheoppositeextreme,StochasticGradientDescentjustpicksarandominstanceinthetrainingsetateverystepandcomputesthegradientsbasedonlyonthatsingleinstance.Obviouslythismakesthealgorithmmuchfastersinceithasverylittledatatomanipulateateveryiteration.Italsomakesitpossibletotrainonhugetrainingsets,sinceonlyoneinstanceneedstobeinmemoryateachiteration(SGDcanbeimplementedasanout-of-corealgorithm.7)

Ontheotherhand,duetoitsstochastic(i.e.,random)nature,thisalgorithmismuchlessregularthanBatchGradientDescent:insteadofgentlydecreasinguntilitreachestheminimum,thecostfunctionwillbounceupanddown,decreasingonlyonaverage.Overtimeitwillendupveryclosetotheminimum,butonceitgetsthereitwillcontinuetobouncearound,neversettlingdown(seeFigure4-9).Sooncethealgorithmstops,thefinalparametervaluesaregood,butnotoptimal.

Figure4-9.StochasticGradientDescent

Whenthecostfunctionisveryirregular(asinFigure4-6),thiscanactuallyhelpthealgorithmjumpoutoflocalminima,soStochasticGradientDescenthasabetterchanceoffindingtheglobalminimumthanBatchGradientDescentdoes.

Thereforerandomnessisgoodtoescapefromlocaloptima,butbadbecauseitmeansthatthealgorithmcanneversettleattheminimum.Onesolutiontothisdilemmaistograduallyreducethelearningrate.Thestepsstartoutlarge(whichhelpsmakequickprogressandescapelocalminima),thengetsmallerandsmaller,allowingthealgorithmtosettleattheglobalminimum.Thisprocessiscalledsimulated

Page 158: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

annealing,becauseitresemblestheprocessofannealinginmetallurgywheremoltenmetalisslowlycooleddown.Thefunctionthatdeterminesthelearningrateateachiterationiscalledthelearningschedule.Ifthelearningrateisreducedtooquickly,youmaygetstuckinalocalminimum,orevenendupfrozenhalfwaytotheminimum.Ifthelearningrateisreducedtooslowly,youmayjumparoundtheminimumforalongtimeandendupwithasuboptimalsolutionifyouhalttrainingtooearly.

ThiscodeimplementsStochasticGradientDescentusingasimplelearningschedule:

n_epochs=50

t0,t1=5,50#learningschedulehyperparameters

deflearning_schedule(t):

returnt0/(t+t1)

theta=np.random.randn(2,1)#randominitialization

forepochinrange(n_epochs):

foriinrange(m):

random_index=np.random.randint(m)

xi=X_b[random_index:random_index+1]

yi=y[random_index:random_index+1]

gradients=2*xi.T.dot(xi.dot(theta)-yi)

eta=learning_schedule(epoch*m+i)

theta=theta-eta*gradients

Byconventionweiteratebyroundsofmiterations;eachroundiscalledanepoch.WhiletheBatchGradientDescentcodeiterated1,000timesthroughthewholetrainingset,thiscodegoesthroughthetrainingsetonly50timesandreachesafairlygoodsolution:

>>>theta

array([[4.21076011],

[2.74856079]])

Figure4-10showsthefirst10stepsoftraining(noticehowirregularthestepsare).

Figure4-10.StochasticGradientDescentfirst10steps

Page 159: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Notethatsinceinstancesarepickedrandomly,someinstancesmaybepickedseveraltimesperepochwhileothersmaynotbepickedatall.Ifyouwanttobesurethatthealgorithmgoesthrougheveryinstanceateachepoch,anotherapproachistoshufflethetrainingset,thengothroughitinstancebyinstance,thenshuffleitagain,andsoon.However,thisgenerallyconvergesmoreslowly.

ToperformLinearRegressionusingSGDwithScikit-Learn,youcanusetheSGDRegressorclass,whichdefaultstooptimizingthesquarederrorcostfunction.Thefollowingcoderuns50epochs,startingwithalearningrateof0.1(eta0=0.1),usingthedefaultlearningschedule(differentfromtheprecedingone),anditdoesnotuseanyregularization(penalty=None;moredetailsonthisshortly):

fromsklearn.linear_modelimportSGDRegressor

sgd_reg=SGDRegressor(n_iter=50,penalty=None,eta0=0.1)

sgd_reg.fit(X,y.ravel())

Onceagain,youfindasolutionveryclosetotheonereturnedbytheNormalEquation:

>>>sgd_reg.intercept_,sgd_reg.coef_

(array([4.16782089]),array([2.72603052]))

Page 160: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Mini-batchGradientDescentThelastGradientDescentalgorithmwewilllookatiscalledMini-batchGradientDescent.ItisquitesimpletounderstandonceyouknowBatchandStochasticGradientDescent:ateachstep,insteadofcomputingthegradientsbasedonthefulltrainingset(asinBatchGD)orbasedonjustoneinstance(asinStochasticGD),Mini-batchGDcomputesthegradientsonsmallrandomsetsofinstancescalledmini-batches.ThemainadvantageofMini-batchGDoverStochasticGDisthatyoucangetaperformanceboostfromhardwareoptimizationofmatrixoperations,especiallywhenusingGPUs.

Thealgorithm’sprogressinparameterspaceislesserraticthanwithSGD,especiallywithfairlylargemini-batches.Asaresult,Mini-batchGDwillendupwalkingaroundabitclosertotheminimumthanSGD.But,ontheotherhand,itmaybeharderforittoescapefromlocalminima(inthecaseofproblemsthatsufferfromlocalminima,unlikeLinearRegressionaswesawearlier).Figure4-11showsthepathstakenbythethreeGradientDescentalgorithmsinparameterspaceduringtraining.Theyallendupneartheminimum,butBatchGD’spathactuallystopsattheminimum,whilebothStochasticGDandMini-batchGDcontinuetowalkaround.However,don’tforgetthatBatchGDtakesalotoftimetotakeeachstep,andStochasticGDandMini-batchGDwouldalsoreachtheminimumifyouusedagoodlearningschedule.

Figure4-11.GradientDescentpathsinparameterspace

Let’scomparethealgorithmswe’vediscussedsofarforLinearRegression8(recallthatmisthenumberoftraininginstancesandnisthenumberoffeatures);seeTable4-1.

Table4-1.ComparisonofalgorithmsforLinearRegression

Algorithm Largem Out-of-coresupport Largen Hyperparams Scalingrequired Scikit-Learn

NormalEquation Fast No Slow 0 No LinearRegression

BatchGD Slow No Fast 2 Yes n/a

Page 161: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

StochasticGD Fast Yes Fast ≥2 Yes SGDRegressor

Mini-batchGD Fast Yes Fast ≥2 Yes n/a

NOTEThereisalmostnodifferenceaftertraining:allthesealgorithmsendupwithverysimilarmodelsandmakepredictionsinexactlythesameway.

Page 162: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

PolynomialRegressionWhatifyourdataisactuallymorecomplexthanasimplestraightline?Surprisingly,youcanactuallyusealinearmodeltofitnonlineardata.Asimplewaytodothisistoaddpowersofeachfeatureasnewfeatures,thentrainalinearmodelonthisextendedsetoffeatures.ThistechniqueiscalledPolynomialRegression.

Let’slookatanexample.First,let’sgeneratesomenonlineardata,basedonasimplequadraticequation9(plussomenoise;seeFigure4-12):

m=100

X=6*np.random.rand(m,1)-3

y=0.5*X**2+X+2+np.random.randn(m,1)

Figure4-12.Generatednonlinearandnoisydataset

Clearly,astraightlinewillneverfitthisdataproperly.Solet’suseScikit-Learn’sPolynomialFeaturesclasstotransformourtrainingdata,addingthesquare(2nd-degreepolynomial)ofeachfeatureinthetrainingsetasnewfeatures(inthiscasethereisjustonefeature):

>>>fromsklearn.preprocessingimportPolynomialFeatures

>>>poly_features=PolynomialFeatures(degree=2,include_bias=False)

>>>X_poly=poly_features.fit_transform(X)

>>>X[0]

array([-0.75275929])

>>>X_poly[0]

array([-0.75275929,0.56664654])

X_polynowcontainstheoriginalfeatureofXplusthesquareofthisfeature.NowyoucanfitaLinearRegressionmodeltothisextendedtrainingdata(Figure4-13):

Page 163: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

>>>lin_reg=LinearRegression()

>>>lin_reg.fit(X_poly,y)

>>>lin_reg.intercept_,lin_reg.coef_

(array([1.78134581]),array([[0.93366893,0.56456263]]))

Figure4-13.PolynomialRegressionmodelpredictions

Notbad:themodelestimates wheninfacttheoriginalfunctionwas.

Notethatwhentherearemultiplefeatures,PolynomialRegressioniscapableoffindingrelationshipsbetweenfeatures(whichissomethingaplainLinearRegressionmodelcannotdo).ThisismadepossiblebythefactthatPolynomialFeaturesalsoaddsallcombinationsoffeaturesuptothegivendegree.Forexample,ifthereweretwofeaturesaandb,PolynomialFeatureswithdegree=3wouldnotonlyaddthefeaturesa2,a3,b2,andb3,butalsothecombinationsab,a2b,andab2.

WARNING

PolynomialFeatures(degree=d)transformsanarraycontainingnfeaturesintoanarraycontaining features,wheren!isthefactorialofn,equalto1×2×3×× n.Bewareofthecombinatorialexplosionofthenumberoffeatures!

Page 164: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

LearningCurvesIfyouperformhigh-degreePolynomialRegression,youwilllikelyfitthetrainingdatamuchbetterthanwithplainLinearRegression.Forexample,Figure4-14appliesa300-degreepolynomialmodeltotheprecedingtrainingdata,andcomparestheresultwithapurelinearmodelandaquadraticmodel(2nd-degreepolynomial).Noticehowthe300-degreepolynomialmodelwigglesaroundtogetascloseaspossibletothetraininginstances.

Figure4-14.High-degreePolynomialRegression

Ofcourse,thishigh-degreePolynomialRegressionmodelisseverelyoverfittingthetrainingdata,whilethelinearmodelisunderfittingit.Themodelthatwillgeneralizebestinthiscaseisthequadraticmodel.Itmakessensesincethedatawasgeneratedusingaquadraticmodel,butingeneralyouwon’tknowwhatfunctiongeneratedthedata,sohowcanyoudecidehowcomplexyourmodelshouldbe?Howcanyoutellthatyourmodelisoverfittingorunderfittingthedata?

InChapter2youusedcross-validationtogetanestimateofamodel’sgeneralizationperformance.Ifamodelperformswellonthetrainingdatabutgeneralizespoorlyaccordingtothecross-validationmetrics,thenyourmodelisoverfitting.Ifitperformspoorlyonboth,thenitisunderfitting.Thisisonewaytotellwhenamodelistoosimpleortoocomplex.

Anotherwayistolookatthelearningcurves:theseareplotsofthemodel’sperformanceonthetrainingsetandthevalidationsetasafunctionofthetrainingsetsize.Togeneratetheplots,simplytrainthemodelseveraltimesondifferentsizedsubsetsofthetrainingset.Thefollowingcodedefinesafunctionthatplotsthelearningcurvesofamodelgivensometrainingdata:

fromsklearn.metricsimportmean_squared_error

fromsklearn.model_selectionimporttrain_test_split

defplot_learning_curves(model,X,y):

Page 165: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

X_train,X_val,y_train,y_val=train_test_split(X,y,test_size=0.2)

train_errors,val_errors=[],[]

forminrange(1,len(X_train)):

model.fit(X_train[:m],y_train[:m])

y_train_predict=model.predict(X_train[:m])

y_val_predict=model.predict(X_val)

train_errors.append(mean_squared_error(y_train_predict,y_train[:m]))

val_errors.append(mean_squared_error(y_val_predict,y_val))

plt.plot(np.sqrt(train_errors),"r-+",linewidth=2,label="train")

plt.plot(np.sqrt(val_errors),"b-",linewidth=3,label="val")

Let’slookatthelearningcurvesoftheplainLinearRegressionmodel(astraightline;Figure4-15):

lin_reg=LinearRegression()

plot_learning_curves(lin_reg,X,y)

Figure4-15.Learningcurves

Thisdeservesabitofexplanation.First,let’slookattheperformanceonthetrainingdata:whentherearejustoneortwoinstancesinthetrainingset,themodelcanfitthemperfectly,whichiswhythecurvestartsatzero.Butasnewinstancesareaddedtothetrainingset,itbecomesimpossibleforthemodeltofitthetrainingdataperfectly,bothbecausethedataisnoisyandbecauseitisnotlinearatall.Sotheerroronthetrainingdatagoesupuntilitreachesaplateau,atwhichpointaddingnewinstancestothetrainingsetdoesn’tmaketheaverageerrormuchbetterorworse.Nowlet’slookattheperformanceofthemodelonthevalidationdata.Whenthemodelistrainedonveryfewtraininginstances,itisincapableofgeneralizingproperly,whichiswhythevalidationerrorisinitiallyquitebig.Thenasthemodelisshownmoretrainingexamples,itlearnsandthusthevalidationerrorslowlygoesdown.However,onceagainastraightlinecannotdoagoodjobmodelingthedata,sotheerrorendsupataplateau,veryclosetotheothercurve.

Theselearningcurvesaretypicalofanunderfittingmodel.Bothcurveshavereachedaplateau;theyarecloseandfairlyhigh.

Page 166: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TIPIfyourmodelisunderfittingthetrainingdata,addingmoretrainingexampleswillnothelp.Youneedtouseamorecomplexmodelorcomeupwithbetterfeatures.

Nowlet’slookatthelearningcurvesofa10th-degreepolynomialmodelonthesamedata(Figure4-16):

fromsklearn.pipelineimportPipeline

polynomial_regression=Pipeline((

("poly_features",PolynomialFeatures(degree=10,include_bias=False)),

("lin_reg",LinearRegression()),

))

plot_learning_curves(polynomial_regression,X,y)

Theselearningcurveslookabitlikethepreviousones,buttherearetwoveryimportantdifferences:TheerroronthetrainingdataismuchlowerthanwiththeLinearRegressionmodel.

Thereisagapbetweenthecurves.Thismeansthatthemodelperformssignificantlybetteronthetrainingdatathanonthevalidationdata,whichisthehallmarkofanoverfittingmodel.However,ifyouusedamuchlargertrainingset,thetwocurveswouldcontinuetogetcloser.

Figure4-16.Learningcurvesforthepolynomialmodel

TIPOnewaytoimproveanoverfittingmodelistofeeditmoretrainingdatauntilthevalidationerrorreachesthetrainingerror.

Page 167: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

THEBIAS/VARIANCETRADEOFF

AnimportanttheoreticalresultofstatisticsandMachineLearningisthefactthatamodel’sgeneralizationerrorcanbeexpressedasthesumofthreeverydifferenterrors:

BiasThispartofthegeneralizationerrorisduetowrongassumptions,suchasassumingthatthedataislinearwhenitisactuallyquadratic.Ahigh-biasmodelismostlikelytounderfitthetrainingdata.10

VarianceThispartisduetothemodel’sexcessivesensitivitytosmallvariationsinthetrainingdata.Amodelwithmanydegreesoffreedom(suchasahigh-degreepolynomialmodel)islikelytohavehighvariance,andthustooverfitthetrainingdata.

IrreducibleerrorThispartisduetothenoisinessofthedataitself.Theonlywaytoreducethispartoftheerroristocleanupthedata(e.g.,fixthedatasources,suchasbrokensensors,ordetectandremoveoutliers).

Increasingamodel’scomplexitywilltypicallyincreaseitsvarianceandreduceitsbias.Conversely,reducingamodel’scomplexityincreasesitsbiasandreducesitsvariance.Thisiswhyitiscalledatradeoff.

Page 168: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

RegularizedLinearModelsAswesawinChapters1and2,agoodwaytoreduceoverfittingistoregularizethemodel(i.e.,toconstrainit):thefewerdegreesoffreedomithas,theharderitwillbeforittooverfitthedata.Forexample,asimplewaytoregularizeapolynomialmodelistoreducethenumberofpolynomialdegrees.

Foralinearmodel,regularizationistypicallyachievedbyconstrainingtheweightsofthemodel.WewillnowlookatRidgeRegression,LassoRegression,andElasticNet,whichimplementthreedifferentwaystoconstraintheweights.

Page 169: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

RidgeRegressionRidgeRegression(alsocalledTikhonovregularization)isaregularizedversionofLinearRegression:a

regularizationtermequalto isaddedtothecostfunction.Thisforcesthelearningalgorithmtonotonlyfitthedatabutalsokeepthemodelweightsassmallaspossible.Notethattheregularizationtermshouldonlybeaddedtothecostfunctionduringtraining.Oncethemodelistrained,youwanttoevaluatethemodel’sperformanceusingtheunregularizedperformancemeasure.

NOTEItisquitecommonforthecostfunctionusedduringtrainingtobedifferentfromtheperformancemeasureusedfortesting.Apartfromregularization,anotherreasonwhytheymightbedifferentisthatagoodtrainingcostfunctionshouldhaveoptimization-friendlyderivatives,whiletheperformancemeasureusedfortestingshouldbeascloseaspossibletothefinalobjective.Agoodexampleofthisisaclassifiertrainedusingacostfunctionsuchasthelogloss(discussedinamoment)butevaluatedusingprecision/recall.

Thehyperparameterαcontrolshowmuchyouwanttoregularizethemodel.Ifα=0thenRidgeRegressionisjustLinearRegression.Ifαisverylarge,thenallweightsendupveryclosetozeroandtheresultisaflatlinegoingthroughthedata’smean.Equation4-8presentstheRidgeRegressioncostfunction.11

Equation4-8.RidgeRegressioncostfunction

Notethatthebiastermθ0isnotregularized(thesumstartsati=1,not0).Ifwedefinewasthevectoroffeatureweights(θ1toθn),thentheregularizationtermissimplyequalto½( w2)2,where· 2representstheℓ2normoftheweightvector.12ForGradientDescent,justaddαwtotheMSEgradientvector(Equation4-6).

WARNINGItisimportanttoscalethedata(e.g.,usingaStandardScaler)beforeperformingRidgeRegression,asitissensitivetothescaleoftheinputfeatures.Thisistrueofmostregularizedmodels.

Figure4-17showsseveralRidgemodelstrainedonsomelineardatausingdifferentαvalue.Ontheleft,plainRidgemodelsareused,leadingtolinearpredictions.Ontheright,thedataisfirstexpandedusingPolynomialFeatures(degree=10),thenitisscaledusingaStandardScaler,andfinallytheRidge

Page 170: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

modelsareappliedtotheresultingfeatures:thisisPolynomialRegressionwithRidgeregularization.Notehowincreasingαleadstoflatter(i.e.,lessextreme,morereasonable)predictions;thisreducesthemodel’svariancebutincreasesitsbias.

AswithLinearRegression,wecanperformRidgeRegressioneitherbycomputingaclosed-formequationorbyperformingGradientDescent.Theprosandconsarethesame.Equation4-9showstheclosed-formsolution(whereAisthen×nidentitymatrix13exceptwitha0inthetop-leftcell,correspondingtothebiasterm).

Figure4-17.RidgeRegression

Equation4-9.RidgeRegressionclosed-formsolution

HereishowtoperformRidgeRegressionwithScikit-Learnusingaclosed-formsolution(avariantofEquation4-9usingamatrixfactorizationtechniquebyAndré-LouisCholesky):

>>>fromsklearn.linear_modelimportRidge

>>>ridge_reg=Ridge(alpha=1,solver="cholesky")

>>>ridge_reg.fit(X,y)

>>>ridge_reg.predict([[1.5]])

array([[1.55071465]])

AndusingStochasticGradientDescent:14

>>>sgd_reg=SGDRegressor(penalty="l2")

>>>sgd_reg.fit(X,y.ravel())

>>>sgd_reg.predict([[1.5]])

array([1.13500145])

Thepenaltyhyperparametersetsthetypeofregularizationtermtouse.Specifying"l2"indicatesthatyouwantSGDtoaddaregularizationtermtothecostfunctionequaltohalfthesquareoftheℓ2normoftheweightvector:thisissimplyRidgeRegression.

Page 171: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

LassoRegressionLeastAbsoluteShrinkageandSelectionOperatorRegression(simplycalledLassoRegression)isanotherregularizedversionofLinearRegression:justlikeRidgeRegression,itaddsaregularizationtermtothecostfunction,butitusestheℓ1normoftheweightvectorinsteadofhalfthesquareoftheℓ2norm(seeEquation4-10).

Equation4-10.LassoRegressioncostfunction

Figure4-18showsthesamethingasFigure4-17butreplacesRidgemodelswithLassomodelsandusessmallerαvalues.

Figure4-18.LassoRegression

AnimportantcharacteristicofLassoRegressionisthatittendstocompletelyeliminatetheweightsoftheleastimportantfeatures(i.e.,setthemtozero).Forexample,thedashedlineintherightplotonFigure4-18(withα=10-7)looksquadratic,almostlinear:alltheweightsforthehigh-degreepolynomialfeaturesareequaltozero.Inotherwords,LassoRegressionautomaticallyperformsfeatureselectionandoutputsasparsemodel(i.e.,withfewnonzerofeatureweights).

YoucangetasenseofwhythisisthecasebylookingatFigure4-19:onthetop-leftplot,thebackgroundcontours(ellipses)representanunregularizedMSEcostfunction(α=0),andthewhitecirclesshowtheBatchGradientDescentpathwiththatcostfunction.Theforegroundcontours(diamonds)representtheℓ1penalty,andthetrianglesshowtheBGDpathforthispenaltyonly(α→∞).Noticehowthepathfirstreachesθ1=0,thenrollsdownagutteruntilitreachesθ2=0.Onthetop-rightplot,thecontoursrepresentthesamecostfunctionplusanℓ1penaltywithα=0.5.Theglobalminimumisontheθ2=0axis.BGD

Page 172: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

firstreachesθ2=0,thenrollsdownthegutteruntilitreachestheglobalminimum.Thetwobottomplotsshowthesamethingbutusesanℓ2penaltyinstead.Theregularizedminimumisclosertoθ=0thantheunregularizedminimum,buttheweightsdonotgetfullyeliminated.

Figure4-19.LassoversusRidgeregularization

TIPOntheLassocostfunction,theBGDpathtendstobounceacrosstheguttertowardtheend.Thisisbecausetheslopechangesabruptlyatθ2=0.Youneedtograduallyreducethelearningrateinordertoactuallyconvergetotheglobalminimum.

TheLassocostfunctionisnotdifferentiableatθi=0(fori=1,2,, n),butGradientDescentstillworksfineifyouuseasubgradientvectorg15insteadwhenanyθi=0.Equation4-11showsasubgradientvectorequationyoucanuseforGradientDescentwiththeLassocostfunction.

Equation4-11.LassoRegressionsubgradientvector

HereisasmallScikit-LearnexampleusingtheLassoclass.NotethatyoucouldinsteaduseanSGDRegressor(penalty="l1").

>>>fromsklearn.linear_modelimportLasso

>>>lasso_reg=Lasso(alpha=0.1)

>>>lasso_reg.fit(X,y)

>>>lasso_reg.predict([[1.5]])

array([1.53788174])

Page 173: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release
Page 174: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ElasticNetElasticNetisamiddlegroundbetweenRidgeRegressionandLassoRegression.TheregularizationtermisasimplemixofbothRidgeandLasso’sregularizationterms,andyoucancontrolthemixratior.Whenr=0,ElasticNetisequivalenttoRidgeRegression,andwhenr=1,itisequivalenttoLassoRegression(seeEquation4-12).

Equation4-12.ElasticNetcostfunction

SowhenshouldyouuseplainLinearRegression(i.e.,withoutanyregularization),Ridge,Lasso,orElasticNet?Itisalmostalwayspreferabletohaveatleastalittlebitofregularization,sogenerallyyoushouldavoidplainLinearRegression.Ridgeisagooddefault,butifyoususpectthatonlyafewfeaturesareactuallyuseful,youshouldpreferLassoorElasticNetsincetheytendtoreducetheuselessfeatures’weightsdowntozeroaswehavediscussed.Ingeneral,ElasticNetispreferredoverLassosinceLassomaybehaveerraticallywhenthenumberoffeaturesisgreaterthanthenumberoftraininginstancesorwhenseveralfeaturesarestronglycorrelated.

HereisashortexampleusingScikit-Learn’sElasticNet(l1_ratiocorrespondstothemixratior):

>>>fromsklearn.linear_modelimportElasticNet

>>>elastic_net=ElasticNet(alpha=0.1,l1_ratio=0.5)

>>>elastic_net.fit(X,y)

>>>elastic_net.predict([[1.5]])

array([1.54333232])

Page 175: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

EarlyStoppingAverydifferentwaytoregularizeiterativelearningalgorithmssuchasGradientDescentistostoptrainingassoonasthevalidationerrorreachesaminimum.Thisiscalledearlystopping.Figure4-20showsacomplexmodel(inthiscaseahigh-degreePolynomialRegressionmodel)beingtrainedusingBatchGradientDescent.Astheepochsgoby,thealgorithmlearnsanditspredictionerror(RMSE)onthetrainingsetnaturallygoesdown,andsodoesitspredictionerroronthevalidationset.However,afterawhilethevalidationerrorstopsdecreasingandactuallystartstogobackup.Thisindicatesthatthemodelhasstartedtooverfitthetrainingdata.Withearlystoppingyoujuststoptrainingassoonasthevalidationerrorreachestheminimum.ItissuchasimpleandefficientregularizationtechniquethatGeoffreyHintoncalledita“beautifulfreelunch.”

Figure4-20.Earlystoppingregularization

TIPWithStochasticandMini-batchGradientDescent,thecurvesarenotsosmooth,anditmaybehardtoknowwhetheryouhavereachedtheminimumornot.Onesolutionistostoponlyafterthevalidationerrorhasbeenabovetheminimumforsometime(whenyouareconfidentthatthemodelwillnotdoanybetter),thenrollbackthemodelparameterstothepointwherethevalidationerrorwasataminimum.

Hereisabasicimplementationofearlystopping:

fromsklearn.baseimportclone

sgd_reg=SGDRegressor(n_iter=1,warm_start=True,penalty=None,

learning_rate="constant",eta0=0.0005)

minimum_val_error=float("inf")

best_epoch=None

Page 176: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

best_model=None

forepochinrange(1000):

sgd_reg.fit(X_train_poly_scaled,y_train)#continueswhereitleftoff

y_val_predict=sgd_reg.predict(X_val_poly_scaled)

val_error=mean_squared_error(y_val_predict,y_val)

ifval_error<minimum_val_error:

minimum_val_error=val_error

best_epoch=epoch

best_model=clone(sgd_reg)

Notethatwithwarm_start=True,whenthefit()methodiscalled,itjustcontinuestrainingwhereitleftoffinsteadofrestartingfromscratch.

Page 177: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

LogisticRegressionAswediscussedinChapter1,someregressionalgorithmscanbeusedforclassificationaswell(andviceversa).LogisticRegression(alsocalledLogitRegression)iscommonlyusedtoestimatetheprobabilitythataninstancebelongstoaparticularclass(e.g.,whatistheprobabilitythatthisemailisspam?).Iftheestimatedprobabilityisgreaterthan50%,thenthemodelpredictsthattheinstancebelongstothatclass(calledthepositiveclass,labeled“1”),orelseitpredictsthatitdoesnot(i.e.,itbelongstothenegativeclass,labeled“0”).Thismakesitabinaryclassifier.

Page 178: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

EstimatingProbabilitiesSohowdoesitwork?JustlikeaLinearRegressionmodel,aLogisticRegressionmodelcomputesaweightedsumoftheinputfeatures(plusabiasterm),butinsteadofoutputtingtheresultdirectlyliketheLinearRegressionmodeldoes,itoutputsthelogisticofthisresult(seeEquation4-13).

Equation4-13.LogisticRegressionmodelestimatedprobability(vectorizedform)

Thelogistic—alsocalledthelogit,notedσ(·)—isasigmoidfunction(i.e.,S-shaped)thatoutputsanumberbetween0and1.ItisdefinedasshowninEquation4-14andFigure4-21.

Equation4-14.Logisticfunction

Figure4-21.Logisticfunction

OncetheLogisticRegressionmodelhasestimatedtheprobability =hθ(x)thataninstancexbelongstothepositiveclass,itcanmakeitspredictionŷeasily(seeEquation4-15).

Equation4-15.LogisticRegressionmodelprediction

Noticethatσ(t)<0.5whent<0,andσ(t)≥0.5whent≥0,soaLogisticRegressionmodelpredicts1if

Page 179: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

θT·xispositive,and0ifitisnegative.

Page 180: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TrainingandCostFunctionGood,nowyouknowhowaLogisticRegressionmodelestimatesprobabilitiesandmakespredictions.Buthowisittrained?Theobjectiveoftrainingistosettheparametervectorθsothatthemodelestimateshighprobabilitiesforpositiveinstances(y=1)andlowprobabilitiesfornegativeinstances(y=0).ThisideaiscapturedbythecostfunctionshowninEquation4-16forasingletraininginstancex.

Equation4-16.Costfunctionofasingletraininginstance

Thiscostfunctionmakessensebecause–log(t)growsverylargewhentapproaches0,sothecostwillbelargeifthemodelestimatesaprobabilitycloseto0forapositiveinstance,anditwillalsobeverylargeifthemodelestimatesaprobabilitycloseto1foranegativeinstance.Ontheotherhand,–log(t)iscloseto0whentiscloseto1,sothecostwillbecloseto0iftheestimatedprobabilityiscloseto0foranegativeinstanceorcloseto1forapositiveinstance,whichispreciselywhatwewant.

Thecostfunctionoverthewholetrainingsetissimplytheaveragecostoveralltraininginstances.Itcanbewritteninasingleexpression(asyoucanverifyeasily),calledthelogloss,showninEquation4-17.

Equation4-17.LogisticRegressioncostfunction(logloss)

Thebadnewsisthatthereisnoknownclosed-formequationtocomputethevalueofθthatminimizesthiscostfunction(thereisnoequivalentoftheNormalEquation).Butthegoodnewsisthatthiscostfunctionisconvex,soGradientDescent(oranyotheroptimizationalgorithm)isguaranteedtofindtheglobalminimum(ifthelearningrateisnottoolargeandyouwaitlongenough).ThepartialderivativesofthecostfunctionwithregardstothejthmodelparameterθjisgivenbyEquation4-18.

Equation4-18.Logisticcostfunctionpartialderivatives

ThisequationlooksverymuchlikeEquation4-5:foreachinstanceitcomputesthepredictionerrorandmultipliesitbythejthfeaturevalue,andthenitcomputestheaverageoveralltraininginstances.OnceyouhavethegradientvectorcontainingallthepartialderivativesyoucanuseitintheBatchGradientDescentalgorithm.That’sit:younowknowhowtotrainaLogisticRegressionmodel.ForStochasticGDyou

Page 181: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

wouldofcoursejusttakeoneinstanceatatime,andforMini-batchGDyouwoulduseamini-batchatatime.

Page 182: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

DecisionBoundariesLet’susetheirisdatasettoillustrateLogisticRegression.Thisisafamousdatasetthatcontainsthesepalandpetallengthandwidthof150irisflowersofthreedifferentspecies:Iris-Setosa,Iris-Versicolor,andIris-Virginica(seeFigure4-22).

Figure4-22.Flowersofthreeirisplantspecies16

Let’strytobuildaclassifiertodetecttheIris-Virginicatypebasedonlyonthepetalwidthfeature.Firstlet’sloadthedata:

>>>fromsklearnimportdatasets

>>>iris=datasets.load_iris()

>>>list(iris.keys())

['data','target_names','feature_names','target','DESCR']

>>>X=iris["data"][:,3:]#petalwidth

>>>y=(iris["target"]==2).astype(np.int)#1ifIris-Virginica,else0

Nowlet’strainaLogisticRegressionmodel:

fromsklearn.linear_modelimportLogisticRegression

log_reg=LogisticRegression()

log_reg.fit(X,y)

Let’slookatthemodel’sestimatedprobabilitiesforflowerswithpetalwidthsvaryingfrom0to3cm(Figure4-23):

X_new=np.linspace(0,3,1000).reshape(-1,1)

y_proba=log_reg.predict_proba(X_new)

plt.plot(X_new,y_proba[:,1],"g-",label="Iris-Virginica")

plt.plot(X_new,y_proba[:,0],"b--",label="NotIris-Virginica")

#+moreMatplotlibcodetomaketheimagelookpretty

Page 183: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure4-23.Estimatedprobabilitiesanddecisionboundary

ThepetalwidthofIris-Virginicaflowers(representedbytriangles)rangesfrom1.4cmto2.5cm,whiletheotheririsflowers(representedbysquares)generallyhaveasmallerpetalwidth,rangingfrom0.1cmto1.8cm.Noticethatthereisabitofoverlap.Aboveabout2cmtheclassifierishighlyconfidentthattheflowerisanIris-Virginica(itoutputsahighprobabilitytothatclass),whilebelow1cmitishighlyconfidentthatitisnotanIris-Virginica(highprobabilityforthe“NotIris-Virginica”class).Inbetweentheseextremes,theclassifierisunsure.However,ifyouaskittopredicttheclass(usingthepredict()methodratherthanthepredict_proba()method),itwillreturnwhicheverclassisthemostlikely.Therefore,thereisadecisionboundaryataround1.6cmwherebothprobabilitiesareequalto50%:ifthepetalwidthishigherthan1.6cm,theclassifierwillpredictthattheflowerisanIris-Virginica,orelseitwillpredictthatitisnot(evenifitisnotveryconfident):

>>>log_reg.predict([[1.7],[1.5]])

array([1,0])

Figure4-24showsthesamedatasetbutthistimedisplayingtwofeatures:petalwidthandlength.Oncetrained,theLogisticRegressionclassifiercanestimatetheprobabilitythatanewflowerisanIris-Virginicabasedonthesetwofeatures.Thedashedlinerepresentsthepointswherethemodelestimatesa50%probability:thisisthemodel’sdecisionboundary.Notethatitisalinearboundary.17Eachparallellinerepresentsthepointswherethemodeloutputsaspecificprobability,from15%(bottomleft)to90%(topright).Alltheflowersbeyondthetop-rightlinehaveanover90%chanceofbeingIris-Virginicaaccordingtothemodel.

Figure4-24.Lineardecisionboundary

Page 184: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Justliketheotherlinearmodels,LogisticRegressionmodelscanberegularizedusingℓ1orℓ2penalties.Scitkit-Learnactuallyaddsanℓ2penaltybydefault.

NOTEThehyperparametercontrollingtheregularizationstrengthofaScikit-LearnLogisticRegressionmodelisnotalpha(asinotherlinearmodels),butitsinverse:C.ThehigherthevalueofC,thelessthemodelisregularized.

Page 185: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

SoftmaxRegressionTheLogisticRegressionmodelcanbegeneralizedtosupportmultipleclassesdirectly,withouthavingtotrainandcombinemultiplebinaryclassifiers(asdiscussedinChapter3).ThisiscalledSoftmaxRegression,orMultinomialLogisticRegression.

Theideaisquitesimple:whengivenaninstancex,theSoftmaxRegressionmodelfirstcomputesascoresk(x)foreachclassk,thenestimatestheprobabilityofeachclassbyapplyingthesoftmaxfunction(alsocalledthenormalizedexponential)tothescores.Theequationtocomputesk(x)shouldlookfamiliar,asitisjustliketheequationforLinearRegressionprediction(seeEquation4-19).

Equation4-19.Softmaxscoreforclassk

Notethateachclasshasitsowndedicatedparametervectorθ(k).AllthesevectorsaretypicallystoredasrowsinaparametermatrixΘ.

Onceyouhavecomputedthescoreofeveryclassfortheinstancex,youcanestimatetheprobability kthattheinstancebelongstoclasskbyrunningthescoresthroughthesoftmaxfunction(Equation4-20):itcomputestheexponentialofeveryscore,thennormalizesthem(dividingbythesumofalltheexponentials).

Equation4-20.Softmaxfunction

Kisthenumberofclasses.

s(x)isavectorcontainingthescoresofeachclassfortheinstancex.

σ(s(x))kistheestimatedprobabilitythattheinstancexbelongstoclasskgiventhescoresofeachclassforthatinstance.

JustliketheLogisticRegressionclassifier,theSoftmaxRegressionclassifierpredictstheclasswiththehighestestimatedprobability(whichissimplytheclasswiththehighestscore),asshowninEquation4-21.

Equation4-21.SoftmaxRegressionclassifierprediction

Page 186: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Theargmaxoperatorreturnsthevalueofavariablethatmaximizesafunction.Inthisequation,itreturnsthevalueofkthatmaximizestheestimatedprobabilityσ(s(x))k.

TIPTheSoftmaxRegressionclassifierpredictsonlyoneclassatatime(i.e.,itismulticlass,notmultioutput)soitshouldbeusedonlywithmutuallyexclusiveclassessuchasdifferenttypesofplants.Youcannotuseittorecognizemultiplepeopleinonepicture.

Nowthatyouknowhowthemodelestimatesprobabilitiesandmakespredictions,let’stakealookattraining.Theobjectiveistohaveamodelthatestimatesahighprobabilityforthetargetclass(andconsequentlyalowprobabilityfortheotherclasses).MinimizingthecostfunctionshowninEquation4-22,calledthecrossentropy,shouldleadtothisobjectivebecauseitpenalizesthemodelwhenitestimatesalowprobabilityforatargetclass.Crossentropyisfrequentlyusedtomeasurehowwellasetofestimatedclassprobabilitiesmatchthetargetclasses(wewilluseitagainseveraltimesinthefollowingchapters).

Equation4-22.Crossentropycostfunction

isequalto1ifthetargetclassfortheithinstanceisk;otherwise,itisequalto0.

Noticethatwhentherearejusttwoclasses(K=2),thiscostfunctionisequivalenttotheLogisticRegression’scostfunction(logloss;seeEquation4-17).

CROSSENTROPY

Crossentropyoriginatedfrominformationtheory.Supposeyouwanttoefficientlytransmitinformationabouttheweathereveryday.Ifthereareeightoptions(sunny,rainy,etc.),youcouldencodeeachoptionusing3bitssince23=8.However,ifyouthinkitwillbesunnyalmosteveryday,itwouldbemuchmoreefficienttocode“sunny”onjustonebit(0)andtheothersevenoptionson4bits(startingwitha1).Crossentropymeasurestheaveragenumberofbitsyouactuallysendperoption.Ifyourassumptionabouttheweatherisperfect,crossentropywilljustbeequaltotheentropyoftheweatheritself(i.e.,itsintrinsicunpredictability).Butifyourassumptionsarewrong(e.g.,ifitrainsoften),crossentropywillbegreaterbyanamountcalledtheKullback–Leiblerdivergence.

Thecrossentropybetweentwoprobabilitydistributionspandqisdefinedas (atleastwhenthedistributionsarediscrete).

Thegradientvectorofthiscostfunctionwithregardstoθ(k)isgivenbyEquation4-23:

Equation4-23.Crossentropygradientvectorforclassk

Page 187: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Nowyoucancomputethegradientvectorforeveryclass,thenuseGradientDescent(oranyotheroptimizationalgorithm)tofindtheparametermatrixΘthatminimizesthecostfunction.

Let’suseSoftmaxRegressiontoclassifytheirisflowersintoallthreeclasses.Scikit-Learn’sLogisticRegressionusesone-versus-allbydefaultwhenyoutrainitonmorethantwoclasses,butyoucansetthemulti_classhyperparameterto"multinomial"toswitchittoSoftmaxRegressioninstead.YoumustalsospecifyasolverthatsupportsSoftmaxRegression,suchasthe"lbfgs"solver(seeScikit-Learn’sdocumentationformoredetails).Italsoappliesℓ2regularizationbydefault,whichyoucancontrolusingthehyperparameterC.

X=iris["data"][:,(2,3)]#petallength,petalwidth

y=iris["target"]

softmax_reg=LogisticRegression(multi_class="multinomial",solver="lbfgs",C=10)

softmax_reg.fit(X,y)

Sothenexttimeyoufindaniriswith5cmlongand2cmwidepetals,youcanaskyourmodeltotellyouwhattypeofirisitis,anditwillanswerIris-Virginica(class2)with94.2%probability(orIris-Versicolorwith5.8%probability):

>>>softmax_reg.predict([[5,2]])

array([2])

>>>softmax_reg.predict_proba([[5,2]])

array([[6.33134078e-07,5.75276067e-02,9.42471760e-01]])

Figure4-25showstheresultingdecisionboundaries,representedbythebackgroundcolors.Noticethatthedecisionboundariesbetweenanytwoclassesarelinear.ThefigurealsoshowstheprobabilitiesfortheIris-Versicolorclass,representedbythecurvedlines(e.g.,thelinelabeledwith0.450representsthe45%probabilityboundary).Noticethatthemodelcanpredictaclassthathasanestimatedprobabilitybelow50%.Forexample,atthepointwherealldecisionboundariesmeet,allclasseshaveanequalestimatedprobabilityof33%.

Page 188: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure4-25.SoftmaxRegressiondecisionboundaries

Page 189: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Exercises1. WhatLinearRegressiontrainingalgorithmcanyouuseifyouhaveatrainingsetwithmillionsof

features?

2. Supposethefeaturesinyourtrainingsethaveverydifferentscales.Whatalgorithmsmightsufferfromthis,andhow?Whatcanyoudoaboutit?

3. CanGradientDescentgetstuckinalocalminimumwhentrainingaLogisticRegressionmodel?

4. DoallGradientDescentalgorithmsleadtothesamemodelprovidedyouletthemrunlongenough?

5. SupposeyouuseBatchGradientDescentandyouplotthevalidationerrorateveryepoch.Ifyounoticethatthevalidationerrorconsistentlygoesup,whatislikelygoingon?Howcanyoufixthis?

6. IsitagoodideatostopMini-batchGradientDescentimmediatelywhenthevalidationerrorgoesup?

7. WhichGradientDescentalgorithm(amongthosewediscussed)willreachthevicinityoftheoptimalsolutionthefastest?Whichwillactuallyconverge?Howcanyoumaketheothersconvergeaswell?

8. SupposeyouareusingPolynomialRegression.Youplotthelearningcurvesandyounoticethatthereisalargegapbetweenthetrainingerrorandthevalidationerror.Whatishappening?Whatarethreewaystosolvethis?

9. SupposeyouareusingRidgeRegressionandyounoticethatthetrainingerrorandthevalidationerrorarealmostequalandfairlyhigh.Wouldyousaythatthemodelsuffersfromhighbiasorhighvariance?Shouldyouincreasetheregularizationhyperparameterαorreduceit?

10. Whywouldyouwanttouse:RidgeRegressioninsteadofplainLinearRegression(i.e.,withoutanyregularization)?

LassoinsteadofRidgeRegression?

ElasticNetinsteadofLasso?

11. Supposeyouwanttoclassifypicturesasoutdoor/indooranddaytime/nighttime.ShouldyouimplementtwoLogisticRegressionclassifiersoroneSoftmaxRegressionclassifier?

12. ImplementBatchGradientDescentwithearlystoppingforSoftmaxRegression(withoutusingScikit-Learn).

SolutionstotheseexercisesareavailableinAppendixA.

Itisoftenthecasethatalearningalgorithmwilltrytooptimizeadifferentfunctionthantheperformancemeasureusedtoevaluatethefinalmodel.Thisisgenerallybecausethatfunctioniseasiertocompute,becauseithasusefuldifferentiationpropertiesthattheperformancemeasurelacks,orbecausewewanttoconstrainthemodelduringtraining,aswewillseewhenwediscussregularization.

1

Page 190: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Thedemonstrationthatthisreturnsthevalueofθthatminimizesthecostfunctionisoutsidethescopeofthisbook.

NotethatScikit-Learnseparatesthebiasterm(intercept_)fromthefeatureweights(coef_).

Technicallyspeaking,itsderivativeisLipschitzcontinuous.

Sincefeature1issmaller,ittakesalargerchangeinθ1toaffectthecostfunction,whichiswhythebowliselongatedalongtheθ1axis.

Eta(η)isthe7

letteroftheGreekalphabet.

Out-of-corealgorithmsarediscussedinChapter1.

WhiletheNormalEquationcanonlyperformLinearRegression,theGradientDescentalgorithmscanbeusedtotrainmanyothermodels,aswewillsee.

Aquadraticequationisoftheformy=ax

+bx+c.

Thisnotionofbiasisnottobeconfusedwiththebiastermoflinearmodels.

ItiscommontousethenotationJ(θ)forcostfunctionsthatdon’thaveashortname;wewilloftenusethisnotationthroughouttherestofthisbook.Thecontextwillmakeitclearwhichcostfunctionisbeingdiscussed.

NormsarediscussedinChapter2.

Asquarematrixfullof0sexceptfor1sonthemaindiagonal(top-lefttobottom-right).

AlternativelyyoucanusetheRidgeclasswiththe"sag"solver.StochasticAverageGDisavariantofSGD.Formoredetails,seethepresentation“MinimizingFiniteSumswiththeStochasticAverageGradientAlgorithm”byMarkSchmidtetal.fromtheUniversityofBritishColumbia.

Youcanthinkofasubgradientvectoratanondifferentiablepointasanintermediatevectorbetweenthegradientvectorsaroundthatpoint.

PhotosreproducedfromthecorrespondingWikipediapages.Iris-VirginicaphotobyFrankMayfield(CreativeCommonsBY-SA2.0),Iris-VersicolorphotobyD.GordonE.Robertson(CreativeCommonsBY-SA3.0),andIris-Setosaphotoispublicdomain.

Itisthethesetofpointsxsuchthatθ0+θ1x1+θ2x2=0,whichdefinesastraightline.

2

3

4

5

6

th

7

8

9

2

10

11

12

13

14

15

16

17

Page 191: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Chapter5.SupportVectorMachines

ASupportVectorMachine(SVM)isaverypowerfulandversatileMachineLearningmodel,capableofperforminglinearornonlinearclassification,regression,andevenoutlierdetection.ItisoneofthemostpopularmodelsinMachineLearning,andanyoneinterestedinMachineLearningshouldhaveitintheirtoolbox.SVMsareparticularlywellsuitedforclassificationofcomplexbutsmall-ormedium-sizeddatasets.

ThischapterwillexplainthecoreconceptsofSVMs,howtousethem,andhowtheywork.

Page 192: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

LinearSVMClassificationThefundamentalideabehindSVMsisbestexplainedwithsomepictures.Figure5-1showspartoftheirisdatasetthatwasintroducedattheendofChapter4.Thetwoclassescanclearlybeseparatedeasilywithastraightline(theyarelinearlyseparable).Theleftplotshowsthedecisionboundariesofthreepossiblelinearclassifiers.Themodelwhosedecisionboundaryisrepresentedbythedashedlineissobadthatitdoesnotevenseparatetheclassesproperly.Theothertwomodelsworkperfectlyonthistrainingset,buttheirdecisionboundariescomesoclosetotheinstancesthatthesemodelswillprobablynotperformaswellonnewinstances.Incontrast,thesolidlineintheplotontherightrepresentsthedecisionboundaryofanSVMclassifier;thislinenotonlyseparatesthetwoclassesbutalsostaysasfarawayfromtheclosesttraininginstancesaspossible.YoucanthinkofanSVMclassifierasfittingthewidestpossiblestreet(representedbytheparalleldashedlines)betweentheclasses.Thisiscalledlargemarginclassification.

Figure5-1.Largemarginclassification

Noticethataddingmoretraininginstances“offthestreet”willnotaffectthedecisionboundaryatall:itisfullydetermined(or“supported”)bytheinstanceslocatedontheedgeofthestreet.Theseinstancesarecalledthesupportvectors(theyarecircledinFigure5-1).

WARNINGSVMsaresensitivetothefeaturescales,asyoucanseeinFigure5-2:ontheleftplot,theverticalscaleismuchlargerthanthehorizontalscale,sothewidestpossiblestreetisclosetohorizontal.Afterfeaturescaling(e.g.,usingScikit-Learn’sStandardScaler),thedecisionboundarylooksmuchbetter(ontherightplot).

Figure5-2.Sensitivitytofeaturescales

Page 193: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

SoftMarginClassificationIfwestrictlyimposethatallinstancesbeoffthestreetandontherightside,thisiscalledhardmarginclassification.Therearetwomainissueswithhardmarginclassification.First,itonlyworksifthedataislinearlyseparable,andseconditisquitesensitivetooutliers.Figure5-3showstheirisdatasetwithjustoneadditionaloutlier:ontheleft,itisimpossibletofindahardmargin,andontherightthedecisionboundaryendsupverydifferentfromtheonewesawinFigure5-1withouttheoutlier,anditwillprobablynotgeneralizeaswell.

Figure5-3.Hardmarginsensitivitytooutliers

Toavoidtheseissuesitispreferabletouseamoreflexiblemodel.Theobjectiveistofindagoodbalancebetweenkeepingthestreetaslargeaspossibleandlimitingthemarginviolations(i.e.,instancesthatendupinthemiddleofthestreetorevenonthewrongside).Thisiscalledsoftmarginclassification.

InScikit-Learn’sSVMclasses,youcancontrolthisbalanceusingtheChyperparameter:asmallerCvalueleadstoawiderstreetbutmoremarginviolations.Figure5-4showsthedecisionboundariesandmarginsoftwosoftmarginSVMclassifiersonanonlinearlyseparabledataset.Ontheleft,usingahighCvaluetheclassifiermakesfewermarginviolationsbutendsupwithasmallermargin.Ontheright,usingalowCvaluethemarginismuchlarger,butmanyinstancesenduponthestreet.However,itseemslikelythatthesecondclassifierwillgeneralizebetter:infactevenonthistrainingsetitmakesfewerpredictionerrors,sincemostofthemarginviolationsareactuallyonthecorrectsideofthedecisionboundary.

Figure5-4.Fewermarginviolationsversuslargemargin

TIPIfyourSVMmodelisoverfitting,youcantryregularizingitbyreducingC.

Page 194: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ThefollowingScikit-Learncodeloadstheirisdataset,scalesthefeatures,andthentrainsalinearSVMmodel(usingtheLinearSVCclasswithC=0.1andthehingelossfunction,describedshortly)todetectIris-Virginicaflowers.TheresultingmodelisrepresentedontherightofFigure5-4.

importnumpyasnp

fromsklearnimportdatasets

fromsklearn.pipelineimportPipeline

fromsklearn.preprocessingimportStandardScaler

fromsklearn.svmimportLinearSVC

iris=datasets.load_iris()

X=iris["data"][:,(2,3)]#petallength,petalwidth

y=(iris["target"]==2).astype(np.float64)#Iris-Virginica

svm_clf=Pipeline((

("scaler",StandardScaler()),

("linear_svc",LinearSVC(C=1,loss="hinge")),

))

svm_clf.fit(X,y)

Then,asusual,youcanusethemodeltomakepredictions:

>>>svm_clf.predict([[5.5,1.7]])

array([1.])

NOTEUnlikeLogisticRegressionclassifiers,SVMclassifiersdonotoutputprobabilitiesforeachclass.

Alternatively,youcouldusetheSVCclass,usingSVC(kernel="linear",C=1),butitismuchslower,especiallywithlargetrainingsets,soitisnotrecommended.AnotheroptionistousetheSGDClassifierclass,withSGDClassifier(loss="hinge",alpha=1/(m*C)).ThisappliesregularStochasticGradientDescent(seeChapter4)totrainalinearSVMclassifier.ItdoesnotconvergeasfastastheLinearSVCclass,butitcanbeusefultohandlehugedatasetsthatdonotfitinmemory(out-of-coretraining),ortohandleonlineclassificationtasks.

TIPTheLinearSVCclassregularizesthebiasterm,soyoushouldcenterthetrainingsetfirstbysubtractingitsmean.ThisisautomaticifyouscalethedatausingtheStandardScaler.Moreover,makesureyousetthelosshyperparameterto"hinge",asitisnotthedefaultvalue.Finally,forbetterperformanceyoushouldsetthedualhyperparametertoFalse,unlesstherearemorefeaturesthantraininginstances(wewilldiscussdualitylaterinthechapter).

Page 195: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

NonlinearSVMClassificationAlthoughlinearSVMclassifiersareefficientandworksurprisinglywellinmanycases,manydatasetsarenotevenclosetobeinglinearlyseparable.Oneapproachtohandlingnonlineardatasetsistoaddmorefeatures,suchaspolynomialfeatures(asyoudidinChapter4);insomecasesthiscanresultinalinearlyseparabledataset.ConsidertheleftplotinFigure5-5:itrepresentsasimpledatasetwithjustonefeaturex1.Thisdatasetisnotlinearlyseparable,asyoucansee.Butifyouaddasecondfeaturex2=(x1)2,theresulting2Ddatasetisperfectlylinearlyseparable.

Figure5-5.Addingfeaturestomakeadatasetlinearlyseparable

ToimplementthisideausingScikit-Learn,youcancreateaPipelinecontainingaPolynomialFeaturestransformer(discussedin“PolynomialRegression”),followedbyaStandardScalerandaLinearSVC.Let’stestthisonthemoonsdataset(seeFigure5-6):

fromsklearn.datasetsimportmake_moons

fromsklearn.pipelineimportPipeline

fromsklearn.preprocessingimportPolynomialFeatures

polynomial_svm_clf=Pipeline((

("poly_features",PolynomialFeatures(degree=3)),

("scaler",StandardScaler()),

("svm_clf",LinearSVC(C=10,loss="hinge"))

))

polynomial_svm_clf.fit(X,y)

Page 196: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure5-6.LinearSVMclassifierusingpolynomialfeatures

Page 197: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

PolynomialKernelAddingpolynomialfeaturesissimpletoimplementandcanworkgreatwithallsortsofMachineLearningalgorithms(notjustSVMs),butatalowpolynomialdegreeitcannotdealwithverycomplexdatasets,andwithahighpolynomialdegreeitcreatesahugenumberoffeatures,makingthemodeltooslow.

Fortunately,whenusingSVMsyoucanapplyanalmostmiraculousmathematicaltechniquecalledthekerneltrick(itisexplainedinamoment).Itmakesitpossibletogetthesameresultasifyouaddedmanypolynomialfeatures,evenwithveryhigh-degreepolynomials,withoutactuallyhavingtoaddthem.Sothereisnocombinatorialexplosionofthenumberoffeaturessinceyoudon’tactuallyaddanyfeatures.ThistrickisimplementedbytheSVCclass.Let’stestitonthemoonsdataset:

fromsklearn.svmimportSVC

poly_kernel_svm_clf=Pipeline((

("scaler",StandardScaler()),

("svm_clf",SVC(kernel="poly",degree=3,coef0=1,C=5))

))

poly_kernel_svm_clf.fit(X,y)

ThiscodetrainsanSVMclassifierusinga3rd-degreepolynomialkernel.ItisrepresentedontheleftofFigure5-7.OntherightisanotherSVMclassifierusinga10th-degreepolynomialkernel.Obviously,ifyourmodelisoverfitting,youmightwanttoreducethepolynomialdegree.Conversely,ifitisunderfitting,youcantryincreasingit.Thehyperparametercoef0controlshowmuchthemodelisinfluencedbyhigh-degreepolynomialsversuslow-degreepolynomials.

Figure5-7.SVMclassifierswithapolynomialkernel

TIPAcommonapproachtofindtherighthyperparametervaluesistousegridsearch(seeChapter2).Itisoftenfastertofirstdoaverycoarsegridsearch,thenafinergridsearcharoundthebestvaluesfound.Havingagoodsenseofwhateachhyperparameteractuallydoescanalsohelpyousearchintherightpartofthehyperparameterspace.

Page 198: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

AddingSimilarityFeaturesAnothertechniquetotacklenonlinearproblemsistoaddfeaturescomputedusingasimilarityfunctionthatmeasureshowmucheachinstanceresemblesaparticularlandmark.Forexample,let’staketheone-dimensionaldatasetdiscussedearlierandaddtwolandmarkstoitatx1=–2andx1=1(seetheleftplotinFigure5-8).Next,let’sdefinethesimilarityfunctiontobetheGaussianRadialBasisFunction(RBF)withγ=0.3(seeEquation5-1).

Equation5-1.GaussianRBF

Itisabell-shapedfunctionvaryingfrom0(veryfarawayfromthelandmark)to1(atthelandmark).Nowwearereadytocomputethenewfeatures.Forexample,let’slookattheinstancex1=–1:itislocatedatadistanceof1fromthefirstlandmark,and2fromthesecondlandmark.Thereforeitsnewfeaturesarex2=exp(–0.3×12)≈0.74andx3=exp(–0.3×22)≈0.30.TheplotontherightofFigure5-8showsthetransformeddataset(droppingtheoriginalfeatures).Asyoucansee,itisnowlinearlyseparable.

Figure5-8.SimilarityfeaturesusingtheGaussianRBF

Youmaywonderhowtoselectthelandmarks.Thesimplestapproachistocreatealandmarkatthelocationofeachandeveryinstanceinthedataset.Thiscreatesmanydimensionsandthusincreasesthechancesthatthetransformedtrainingsetwillbelinearlyseparable.Thedownsideisthatatrainingsetwithminstancesandnfeaturesgetstransformedintoatrainingsetwithminstancesandmfeatures(assumingyoudroptheoriginalfeatures).Ifyourtrainingsetisverylarge,youendupwithanequallylargenumberoffeatures.

Page 199: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

GaussianRBFKernelJustlikethepolynomialfeaturesmethod,thesimilarityfeaturesmethodcanbeusefulwithanyMachineLearningalgorithm,butitmaybecomputationallyexpensivetocomputealltheadditionalfeatures,especiallyonlargetrainingsets.However,onceagainthekerneltrickdoesitsSVMmagic:itmakesitpossibletoobtainasimilarresultasifyouhadaddedmanysimilarityfeatures,withoutactuallyhavingtoaddthem.Let’strytheGaussianRBFkernelusingtheSVCclass:

rbf_kernel_svm_clf=Pipeline((

("scaler",StandardScaler()),

("svm_clf",SVC(kernel="rbf",gamma=5,C=0.001))

))

rbf_kernel_svm_clf.fit(X,y)

ThismodelisrepresentedonthebottomleftofFigure5-9.Theotherplotsshowmodelstrainedwithdifferentvaluesofhyperparametersgamma(γ)andC.Increasinggammamakesthebell-shapecurvenarrower(seetheleftplotofFigure5-8),andasaresulteachinstance’srangeofinfluenceissmaller:thedecisionboundaryendsupbeingmoreirregular,wigglingaroundindividualinstances.Conversely,asmallgammavaluemakesthebell-shapedcurvewider,soinstanceshavealargerrangeofinfluence,andthedecisionboundaryendsupsmoother.Soγactslikearegularizationhyperparameter:ifyourmodelisoverfitting,youshouldreduceit,andifitisunderfitting,youshouldincreaseit(similartotheChyperparameter).

Figure5-9.SVMclassifiersusinganRBFkernel

Otherkernelsexistbutareusedmuchmorerarely.Forexample,somekernelsarespecializedforspecificdatastructures.StringkernelsaresometimesusedwhenclassifyingtextdocumentsorDNAsequences(e.g.,usingthestringsubsequencekernelorkernelsbasedontheLevenshteindistance).

Page 200: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TIPWithsomanykernelstochoosefrom,howcanyoudecidewhichonetouse?Asaruleofthumb,youshouldalwaystrythelinearkernelfirst(rememberthatLinearSVCismuchfasterthanSVC(kernel="linear")),especiallyifthetrainingsetisverylargeorifithasplentyoffeatures.Ifthetrainingsetisnottoolarge,youshouldtrytheGaussianRBFkernelaswell;itworkswellinmostcases.Thenifyouhavesparetimeandcomputingpower,youcanalsoexperimentwithafewotherkernelsusingcross-validationandgridsearch,especiallyiftherearekernelsspecializedforyourtrainingset’sdatastructure.

Page 201: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ComputationalComplexityTheLinearSVCclassisbasedontheliblinearlibrary,whichimplementsanoptimizedalgorithmforlinearSVMs.1Itdoesnotsupportthekerneltrick,butitscalesalmostlinearlywiththenumberoftraininginstancesandthenumberoffeatures:itstrainingtimecomplexityisroughlyO(m×n).

Thealgorithmtakeslongerifyourequireaveryhighprecision.Thisiscontrolledbythetolerancehyperparameterϵ(calledtolinScikit-Learn).Inmostclassificationtasks,thedefaulttoleranceisfine.

TheSVCclassisbasedonthelibsvmlibrary,whichimplementsanalgorithmthatsupportsthekerneltrick.2ThetrainingtimecomplexityisusuallybetweenO(m2×n)andO(m3×n).Unfortunately,thismeansthatitgetsdreadfullyslowwhenthenumberoftraininginstancesgetslarge(e.g.,hundredsofthousandsofinstances).Thisalgorithmisperfectforcomplexbutsmallormediumtrainingsets.However,itscaleswellwiththenumberoffeatures,especiallywithsparsefeatures(i.e.,wheneachinstancehasfewnonzerofeatures).Inthiscase,thealgorithmscalesroughlywiththeaveragenumberofnonzerofeaturesperinstance.Table5-1comparesScikit-Learn’sSVMclassificationclasses.

Table5-1.ComparisonofScikit-LearnclassesforSVMclassification

Class Timecomplexity Out-of-coresupport Scalingrequired Kerneltrick

LinearSVC O(m×n) No Yes No

SGDClassifier O(m×n) Yes Yes No

SVC O(m²×n)toO(m³×n) No Yes Yes

Page 202: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

SVMRegressionAswementionedearlier,theSVMalgorithmisquiteversatile:notonlydoesitsupportlinearandnonlinearclassification,butitalsosupportslinearandnonlinearregression.Thetrickistoreversetheobjective:insteadoftryingtofitthelargestpossiblestreetbetweentwoclasseswhilelimitingmarginviolations,SVMRegressiontriestofitasmanyinstancesaspossibleonthestreetwhilelimitingmarginviolations(i.e.,instancesoffthestreet).Thewidthofthestreetiscontrolledbyahyperparameterϵ.Figure5-10showstwolinearSVMRegressionmodelstrainedonsomerandomlineardata,onewithalargemargin(ϵ=1.5)andtheotherwithasmallmargin(ϵ=0.5).

Figure5-10.SVMRegression

Addingmoretraininginstanceswithinthemargindoesnotaffectthemodel’spredictions;thus,themodelissaidtobeϵ-insensitive.

YoucanuseScikit-Learn’sLinearSVRclasstoperformlinearSVMRegression.ThefollowingcodeproducesthemodelrepresentedontheleftofFigure5-10(thetrainingdatashouldbescaledandcenteredfirst):

fromsklearn.svmimportLinearSVR

svm_reg=LinearSVR(epsilon=1.5)

svm_reg.fit(X,y)

Totacklenonlinearregressiontasks,youcanuseakernelizedSVMmodel.Forexample,Figure5-11showsSVMRegressiononarandomquadratictrainingset,usinga2nd-degreepolynomialkernel.Thereislittleregularizationontheleftplot(i.e.,alargeCvalue),andmuchmoreregularizationontherightplot(i.e.,asmallCvalue).

Page 203: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure5-11.SVMregressionusinga2nd-degreepolynomialkernel

ThefollowingcodeproducesthemodelrepresentedontheleftofFigure5-11usingScikit-Learn’sSVRclass(whichsupportsthekerneltrick).TheSVRclassistheregressionequivalentoftheSVCclass,andtheLinearSVRclassistheregressionequivalentoftheLinearSVCclass.TheLinearSVRclassscaleslinearlywiththesizeofthetrainingset(justliketheLinearSVCclass),whiletheSVRclassgetsmuchtooslowwhenthetrainingsetgrowslarge(justliketheSVCclass).

fromsklearn.svmimportSVR

svm_poly_reg=SVR(kernel="poly",degree=2,C=100,epsilon=0.1)

svm_poly_reg.fit(X,y)

NOTESVMscanalsobeusedforoutlierdetection;seeScikit-Learn’sdocumentationformoredetails.

Page 204: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

UndertheHoodThissectionexplainshowSVMsmakepredictionsandhowtheirtrainingalgorithmswork,startingwithlinearSVMclassifiers.YoucansafelyskipitandgostraighttotheexercisesattheendofthischapterifyouarejustgettingstartedwithMachineLearning,andcomebacklaterwhenyouwanttogetadeeperunderstandingofSVMs.

First,awordaboutnotations:inChapter4weusedtheconventionofputtingallthemodelparametersinonevectorθ,includingthebiastermθ0andtheinputfeatureweightsθ1toθn,andaddingabiasinputx0=1toallinstances.Inthischapter,wewilluseadifferentconvention,whichismoreconvenient(andmorecommon)whenyouaredealingwithSVMs:thebiastermwillbecalledbandthefeatureweightsvectorwillbecalledw.Nobiasfeaturewillbeaddedtotheinputfeaturevectors.

Page 205: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

DecisionFunctionandPredictionsThelinearSVMclassifiermodelpredictstheclassofanewinstancexbysimplycomputingthedecisionfunctionwT·x+b=w1x1++ wnxn+b:iftheresultispositive,thepredictedclassŷisthepositiveclass(1),orelseitisthenegativeclass(0);seeEquation5-2.

Equation5-2.LinearSVMclassifierprediction

Figure5-12showsthedecisionfunctionthatcorrespondstothemodelontherightofFigure5-4:itisatwo-dimensionalplanesincethisdatasethastwofeatures(petalwidthandpetallength).Thedecisionboundaryisthesetofpointswherethedecisionfunctionisequalto0:itistheintersectionoftwoplanes,whichisastraightline(representedbythethicksolidline).3

Figure5-12.Decisionfunctionfortheirisdataset

Thedashedlinesrepresentthepointswherethedecisionfunctionisequalto1or–1:theyareparallelandatequaldistancetothedecisionboundary,formingamarginaroundit.TrainingalinearSVMclassifiermeansfindingthevalueofwandbthatmakethismarginaswideaspossiblewhileavoidingmarginviolations(hardmargin)orlimitingthem(softmargin).

Page 206: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TrainingObjectiveConsidertheslopeofthedecisionfunction:itisequaltothenormoftheweightvector, w.Ifwedividethisslopeby2,thepointswherethedecisionfunctionisequalto±1aregoingtobetwiceasfarawayfromthedecisionboundary.Inotherwords,dividingtheslopeby2willmultiplythemarginby2.Perhapsthisiseasiertovisualizein2DinFigure5-13.Thesmallertheweightvectorw,thelargerthemargin.

Figure5-13.Asmallerweightvectorresultsinalargermargin

Sowewanttominimize wtogetalargemargin.However,ifwealsowanttoavoidanymarginviolation(hardmargin),thenweneedthedecisionfunctiontobegreaterthan1forallpositivetraininginstances,andlowerthan–1fornegativetraininginstances.Ifwedefinet(i)=–1fornegativeinstances(ify(i)=0)andt(i)=1forpositiveinstances(ify(i)=1),thenwecanexpressthisconstraintast(i)(wT·x(i)+b)≥1forallinstances.

WecanthereforeexpressthehardmarginlinearSVMclassifierobjectiveastheconstrainedoptimizationprobleminEquation5-3.

Equation5-3.HardmarginlinearSVMclassifierobjective

NOTE

Weareminimizing wT·w,whichisequalto w2,ratherthanminimizing w.Thisisbecauseitwillgivethesame

result(sincethevaluesofwandbthatminimizeavaluealsominimizehalfofitssquare),but w2hasaniceandsimplederivative(itisjustw)while wisnotdifferentiableat w=0.Optimizationalgorithmsworkmuchbetterondifferentiablefunctions.

Togetthesoftmarginobjective,weneedtointroduceaslackvariableζ(i)≥0foreachinstance:4ζ(i)

measureshowmuchtheithinstanceisallowedtoviolatethemargin.Wenowhavetwoconflicting

objectives:makingtheslackvariablesassmallaspossibletoreducethemarginviolations,andmakingwT·wassmallaspossibletoincreasethemargin.ThisiswheretheChyperparametercomesin:it

Page 207: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

allowsustodefinethetradeoffbetweenthesetwoobjectives.ThisgivesustheconstrainedoptimizationprobleminEquation5-4.

Equation5-4.SoftmarginlinearSVMclassifierobjective

Page 208: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

QuadraticProgrammingThehardmarginandsoftmarginproblemsarebothconvexquadraticoptimizationproblemswithlinearconstraints.SuchproblemsareknownasQuadraticProgramming(QP)problems.Manyoff-the-shelfsolversareavailabletosolveQPproblemsusingavarietyoftechniquesthatareoutsidethescopeofthisbook.5ThegeneralproblemformulationisgivenbyEquation5-5.

Equation5-5.QuadraticProgrammingproblem

NotethattheexpressionA·p≤bactuallydefinesncconstraints:pT·a(i)≤b(i)fori=1,2,, nc,wherea(i)isthevectorcontainingtheelementsoftheithrowofAandb(i)istheithelementofb.

YoucaneasilyverifythatifyousettheQPparametersinthefollowingway,yougetthehardmarginlinearSVMclassifierobjective:

np=n+1,wherenisthenumberoffeatures(the+1isforthebiasterm).

nc=m,wheremisthenumberoftraininginstances.

Histhenp×npidentitymatrix,exceptwithazerointhetop-leftcell(toignorethebiasterm).

f=0,annp-dimensionalvectorfullof0s.

b=1,annc-dimensionalvectorfullof1s.

a(i)=–t(i) (i),where (i)isequaltox(i)withanextrabiasfeature 0=1.

SoonewaytotrainahardmarginlinearSVMclassifierisjusttouseanoff-the-shelfQPsolverbypassingittheprecedingparameters.Theresultingvectorpwillcontainthebiastermb=p0andthefeatureweightswi=pifori=1,2,, m.Similarly,youcanuseaQPsolvertosolvethesoftmarginproblem(seetheexercisesattheendofthechapter).

However,tousethekerneltrickwearegoingtolookatadifferentconstrainedoptimizationproblem.

Page 209: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TheDualProblemGivenaconstrainedoptimizationproblem,knownastheprimalproblem,itispossibletoexpressadifferentbutcloselyrelatedproblem,calleditsdualproblem.Thesolutiontothedualproblemtypicallygivesalowerboundtothesolutionoftheprimalproblem,butundersomeconditionsitcanevenhavethesamesolutionsastheprimalproblem.Luckily,theSVMproblemhappenstomeettheseconditions,6soyoucanchoosetosolvetheprimalproblemorthedualproblem;bothwillhavethesamesolution.Equation5-6showsthedualformofthelinearSVMobjective(ifyouareinterestedinknowinghowtoderivethedualproblemfromtheprimalproblem,seeAppendixC).

Equation5-6.DualformofthelinearSVMobjective

Onceyoufindthevector thatminimizesthisequation(usingaQPsolver),youcancompute andthatminimizetheprimalproblembyusingEquation5-7.

Equation5-7.Fromthedualsolutiontotheprimalsolution

Thedualproblemisfastertosolvethantheprimalwhenthenumberoftraininginstancesissmallerthanthenumberoffeatures.Moreimportantly,itmakesthekerneltrickpossible,whiletheprimaldoesnot.Sowhatisthiskerneltrickanyway?

Page 210: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

KernelizedSVMSupposeyouwanttoapplya2nd-degreepolynomialtransformationtoatwo-dimensionaltrainingset(suchasthemoonstrainingset),thentrainalinearSVMclassifieronthetransformedtrainingset.Equation5-8showsthe2nd-degreepolynomialmappingfunctionϕthatyouwanttoapply.

Equation5-8.Second-degreepolynomialmapping

Noticethatthetransformedvectoristhree-dimensionalinsteadoftwo-dimensional.Nowlet’slookatwhathappenstoacoupleoftwo-dimensionalvectors,aandb,ifweapplythis2nd-degreepolynomialmappingandthencomputethedotproductofthetransformedvectors(SeeEquation5-9).

Equation5-9.Kerneltrickfora2nd-degreepolynomialmapping

Howaboutthat?Thedotproductofthetransformedvectorsisequaltothesquareofthedotproductoftheoriginalvectors:ϕ(a)T·ϕ(b)=(aT·b)2.

Nowhereisthekeyinsight:ifyouapplythetransformationϕtoalltraininginstances,thenthedualproblem(seeEquation5-6)willcontainthedotproductϕ(x(i))T·ϕ(x(j)).Butifϕisthe2nd-degreepolynomialtransformationdefinedinEquation5-8,thenyoucanreplacethisdotproductoftransformed

vectorssimplyby .Soyoudon’tactuallyneedtotransformthetraininginstancesatall:justreplacethedotproductbyitssquareinEquation5-6.TheresultwillbestrictlythesameasifyouwentthroughthetroubleofactuallytransformingthetrainingsetthenfittingalinearSVMalgorithm,butthistrickmakesthewholeprocessmuchmorecomputationallyefficient.Thisistheessenceofthekerneltrick.

ThefunctionK(a,b)=(aT·b)2iscalleda2nd-degreepolynomialkernel.InMachineLearning,akernelisafunctioncapableofcomputingthedotproductϕ(a)T·ϕ(b)basedonlyontheoriginalvectorsaandb,

Page 211: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

withouthavingtocompute(oreventoknowabout)thetransformationϕ.Equation5-10listssomeofthemostcommonlyusedkernels.

Equation5-10.Commonkernels

MERCER’STHEOREM

AccordingtoMercer’stheorem,ifafunctionK(a,b)respectsafewmathematicalconditionscalledMercer’sconditions(Kmustbecontinuous,symmetricinitsargumentssoK(a,b)=K(b,a),etc.),thenthereexistsafunctionϕthatmapsaandbintoanotherspace(possiblywithmuchhigherdimensions)suchthatK(a,b)=ϕ(a)T·ϕ(b).SoyoucanuseKasakernelsinceyouknowϕexists,evenifyoudon’tknowwhatϕis.InthecaseoftheGaussianRBFkernel,itcanbeshownthatϕactuallymapseachtraininginstancetoaninfinite-dimensionalspace,soit’sagoodthingyoudon’tneedtoactuallyperformthemapping!

Notethatsomefrequentlyusedkernels(suchastheSigmoidkernel)don’trespectallofMercer’sconditions,yettheygenerallyworkwellinpractice.

Thereisstillonelooseendwemusttie.Equation5-7showshowtogofromthedualsolutiontotheprimalsolutioninthecaseofalinearSVMclassifier,butifyouapplythekerneltrickyouendupwithequationsthatincludeϕ(x(i)).Infact, musthavethesamenumberofdimensionsasϕ(x(i)),whichmaybehugeoreveninfinite,soyoucan’tcomputeit.Buthowcanyoumakepredictionswithoutknowing ?Well,thegoodnewsisthatyoucanplugintheformulafor fromEquation5-7intothedecisionfunctionforanewinstancex(n),andyougetanequationwithonlydotproductsbetweeninputvectors.Thismakesitpossibletousethekerneltrick,onceagain(Equation5-11).

Equation5-11.MakingpredictionswithakernelizedSVM

Notethatsinceα(i)≠0onlyforsupportvectors,makingpredictionsinvolvescomputingthedotproductofthenewinputvectorx(n)withonlythesupportvectors,notallthetraininginstances.Ofcourse,youalsoneedtocomputethebiasterm ,usingthesametrick(Equation5-12).

Equation5-12.Computingthebiastermusingthekerneltrick

Page 212: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Ifyouarestartingtogetaheadache,it’sperfectlynormal:it’sanunfortunatesideeffectsofthekerneltrick.

Page 213: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

OnlineSVMsBeforeconcludingthischapter,let’stakeaquicklookatonlineSVMclassifiers(recallthatonlinelearningmeanslearningincrementally,typicallyasnewinstancesarrive).

ForlinearSVMclassifiers,onemethodistouseGradientDescent(e.g.,usingSGDClassifier)tominimizethecostfunctioninEquation5-13,whichisderivedfromtheprimalproblem.UnfortunatelyitconvergesmuchmoreslowlythanthemethodsbasedonQP.

Equation5-13.LinearSVMclassifiercostfunction

Thefirstsuminthecostfunctionwillpushthemodeltohaveasmallweightvectorw,leadingtoalargermargin.Thesecondsumcomputesthetotalofallmarginviolations.Aninstance’smarginviolationisequalto0ifitislocatedoffthestreetandonthecorrectside,orelseitisproportionaltothedistancetothecorrectsideofthestreet.Minimizingthistermensuresthatthemodelmakesthemarginviolationsassmallandasfewaspossible

HINGELOSS

Thefunctionmax(0,1–t)iscalledthehingelossfunction(representedbelow).Itisequalto0whent≥1.Itsderivative(slope)isequalto–1ift<1and0ift>1.Itisnotdifferentiableatt=1,butjustlikeforLassoRegression(see“LassoRegression”)youcanstilluseGradientDescentusinganysubderivativeatt=1(i.e.,anyvaluebetween–1and0).

ItisalsopossibletoimplementonlinekernelizedSVMs—forexample,using“IncrementalandDecrementalSVMLearning”7or“FastKernelClassifierswithOnlineandActiveLearning.”8However,theseareimplementedinMatlabandC++.Forlarge-scalenonlinearproblems,youmaywanttoconsiderusingneuralnetworksinstead(seePartII).

Page 214: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Exercises1. WhatisthefundamentalideabehindSupportVectorMachines?

2. Whatisasupportvector?

3. WhyisitimportanttoscaletheinputswhenusingSVMs?

4. CananSVMclassifieroutputaconfidencescorewhenitclassifiesaninstance?Whataboutaprobability?

5. ShouldyouusetheprimalorthedualformoftheSVMproblemtotrainamodelonatrainingsetwithmillionsofinstancesandhundredsoffeatures?

6. SayyoutrainedanSVMclassifierwithanRBFkernel.Itseemstounderfitthetrainingset:shouldyouincreaseordecreaseγ(gamma)?WhataboutC?

7. HowshouldyousettheQPparameters(H,f,A,andb)tosolvethesoftmarginlinearSVMclassifierproblemusinganoff-the-shelfQPsolver?

8. TrainaLinearSVConalinearlyseparabledataset.ThentrainanSVCandaSGDClassifieronthesamedataset.Seeifyoucangetthemtoproduceroughlythesamemodel.

9. TrainanSVMclassifierontheMNISTdataset.SinceSVMclassifiersarebinaryclassifiers,youwillneedtouseone-versus-alltoclassifyall10digits.Youmaywanttotunethehyperparametersusingsmallvalidationsetstospeeduptheprocess.Whataccuracycanyoureach?

10. TrainanSVMregressorontheCaliforniahousingdataset.

SolutionstotheseexercisesareavailableinAppendixA.

“ADualCoordinateDescentMethodforLarge-scaleLinearSVM,”Linetal.(2008).

“SequentialMinimalOptimization(SMO),”J.Platt(1998).

Moregenerally,whentherearenfeatures,thedecisionfunctionisann-dimensionalhyperplane,andthedecisionboundaryisan(n–1)-dimensionalhyperplane.

Zeta(ζ)isthe8

letteroftheGreekalphabet.

TolearnmoreaboutQuadraticProgramming,youcanstartbyreadingStephenBoydandLievenVandenberghe,ConvexOptimization(Cambridge,UK:CambridgeUniversityPress,2004)orwatchRichardBrown’sseriesofvideolectures.

Theobjectivefunctionisconvex,andtheinequalityconstraintsarecontinuouslydifferentiableandconvexfunctions.

“IncrementalandDecrementalSupportVectorMachineLearning,”G.Cauwenberghs,T.Poggio(2001).

“FastKernelClassifierswithOnlineandActiveLearning,“A.Bordes,S.Ertekin,J.Weston,L.Bottou(2005).

1

2

3

4

th

5

6

7

8

Page 215: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Chapter6.DecisionTrees

LikeSVMs,DecisionTreesareversatileMachineLearningalgorithmsthatcanperformbothclassificationandregressiontasks,andevenmultioutputtasks.Theyareverypowerfulalgorithms,capableoffittingcomplexdatasets.Forexample,inChapter2youtrainedaDecisionTreeRegressormodelontheCaliforniahousingdataset,fittingitperfectly(actuallyoverfittingit).

DecisionTreesarealsothefundamentalcomponentsofRandomForests(seeChapter7),whichareamongthemostpowerfulMachineLearningalgorithmsavailabletoday.

Inthischapterwewillstartbydiscussinghowtotrain,visualize,andmakepredictionswithDecisionTrees.ThenwewillgothroughtheCARTtrainingalgorithmusedbyScikit-Learn,andwewilldiscusshowtoregularizetreesandusethemforregressiontasks.Finally,wewilldiscusssomeofthelimitationsofDecisionTrees.

Page 216: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TrainingandVisualizingaDecisionTreeTounderstandDecisionTrees,let’sjustbuildoneandtakealookathowitmakespredictions.ThefollowingcodetrainsaDecisionTreeClassifierontheirisdataset(seeChapter4):

fromsklearn.datasetsimportload_iris

fromsklearn.treeimportDecisionTreeClassifier

iris=load_iris()

X=iris.data[:,2:]#petallengthandwidth

y=iris.target

tree_clf=DecisionTreeClassifier(max_depth=2)

tree_clf.fit(X,y)

YoucanvisualizethetrainedDecisionTreebyfirstusingtheexport_graphviz()methodtooutputagraphdefinitionfilecallediris_tree.dot:

fromsklearn.treeimportexport_graphviz

export_graphviz(

tree_clf,

out_file=image_path("iris_tree.dot"),

feature_names=iris.feature_names[2:],

class_names=iris.target_names,

rounded=True,

filled=True

)

Thenyoucanconvertthis.dotfiletoavarietyofformatssuchasPDForPNGusingthedotcommand-linetoolfromthegraphvizpackage.1Thiscommandlineconvertsthe.dotfiletoa.pngimagefile:

$dot-Tpngiris_tree.dot-oiris_tree.png

YourfirstdecisiontreelookslikeFigure6-1.

Page 217: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure6-1.IrisDecisionTree

Page 218: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

MakingPredictionsLet’sseehowthetreerepresentedinFigure6-1makespredictions.Supposeyoufindanirisflowerandyouwanttoclassifyit.Youstartattherootnode(depth0,atthetop):thisnodeaskswhethertheflower’spetallengthissmallerthan2.45cm.Ifitis,thenyoumovedowntotheroot’sleftchildnode(depth1,left).Inthiscase,itisaleafnode(i.e.,itdoesnothaveanychildrennodes),soitdoesnotaskanyquestions:youcansimplylookatthepredictedclassforthatnodeandtheDecisionTreepredictsthatyourflowerisanIris-Setosa(class=setosa).

Nowsupposeyoufindanotherflower,butthistimethepetallengthisgreaterthan2.45cm.Youmustmovedowntotheroot’srightchildnode(depth1,right),whichisnotaleafnode,soitasksanotherquestion:isthepetalwidthsmallerthan1.75cm?Ifitis,thenyourflowerismostlikelyanIris-Versicolor(depth2,left).Ifnot,itislikelyanIris-Virginica(depth2,right).It’sreallythatsimple.

NOTEOneofthemanyqualitiesofDecisionTreesisthattheyrequireverylittledatapreparation.Inparticular,theydon’trequirefeaturescalingorcenteringatall.

Anode’ssamplesattributecountshowmanytraininginstancesitappliesto.Forexample,100traininginstanceshaveapetallengthgreaterthan2.45cm(depth1,right),amongwhich54haveapetalwidthsmallerthan1.75cm(depth2,left).Anode’svalueattributetellsyouhowmanytraininginstancesofeachclassthisnodeappliesto:forexample,thebottom-rightnodeappliesto0Iris-Setosa,1Iris-Versicolor,and45Iris-Virginica.Finally,anode’sginiattributemeasuresitsimpurity:anodeis“pure”(gini=0)ifalltraininginstancesitappliestobelongtothesameclass.Forexample,sincethedepth-1leftnodeappliesonlytoIris-Setosatraininginstances,itispureanditsginiscoreis0.Equation6-1showshowthetrainingalgorithmcomputestheginiscoreGioftheithnode.Forexample,thedepth-2leftnodehasaginiscoreequalto1–(0/54)2–(49/54)2–(5/54)2≈0.168.Anotherimpuritymeasureisdiscussedshortly.

Equation6-1.Giniimpurity

pi,kistheratioofclasskinstancesamongthetraininginstancesintheithnode.

Page 219: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

NOTEScikit-LearnusestheCARTalgorithm,whichproducesonlybinarytrees:nonleafnodesalwayshavetwochildren(i.e.,questionsonlyhaveyes/noanswers).However,otheralgorithmssuchasID3canproduceDecisionTreeswithnodesthathavemorethantwochildren.

Figure6-2showsthisDecisionTree’sdecisionboundaries.Thethickverticallinerepresentsthedecisionboundaryoftherootnode(depth0):petallength=2.45cm.Sincetheleftareaispure(onlyIris-Setosa),itcannotbesplitanyfurther.However,therightareaisimpure,sothedepth-1rightnodesplitsitatpetalwidth=1.75cm(representedbythedashedline).Sincemax_depthwassetto2,theDecisionTreestopsrightthere.However,ifyousetmax_depthto3,thenthetwodepth-2nodeswouldeachaddanotherdecisionboundary(representedbythedottedlines).

Figure6-2.DecisionTreedecisionboundaries

MODELINTERPRETATION:WHITEBOXVERSUSBLACKBOX

AsyoucanseeDecisionTreesarefairlyintuitiveandtheirdecisionsareeasytointerpret.Suchmodelsareoftencalledwhiteboxmodels.Incontrast,aswewillsee,RandomForestsorneuralnetworksaregenerallyconsideredblackboxmodels.Theymakegreatpredictions,andyoucaneasilycheckthecalculationsthattheyperformedtomakethesepredictions;nevertheless,itisusuallyhardtoexplaininsimpletermswhythepredictionsweremade.Forexample,ifaneuralnetworksaysthataparticularpersonappearsonapicture,itishardtoknowwhatactuallycontributedtothisprediction:didthemodelrecognizethatperson’seyes?Hermouth?Hernose?Hershoes?Oreventhecouchthatshewassittingon?Conversely,DecisionTreesprovideniceandsimpleclassificationrulesthatcanevenbeappliedmanuallyifneedbe(e.g.,forflowerclassification).

Page 220: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

EstimatingClassProbabilitiesADecisionTreecanalsoestimatetheprobabilitythataninstancebelongstoaparticularclassk:firstittraversesthetreetofindtheleafnodeforthisinstance,andthenitreturnstheratiooftraininginstancesofclasskinthisnode.Forexample,supposeyouhavefoundaflowerwhosepetalsare5cmlongand1.5cmwide.Thecorrespondingleafnodeisthedepth-2leftnode,sotheDecisionTreeshouldoutputthefollowingprobabilities:0%forIris-Setosa(0/54),90.7%forIris-Versicolor(49/54),and9.3%forIris-Virginica(5/54).Andofcourseifyouaskittopredicttheclass,itshouldoutputIris-Versicolor(class1)sinceithasthehighestprobability.Let’scheckthis:

>>>tree_clf.predict_proba([[5,1.5]])

array([[0.,0.90740741,0.09259259]])

>>>tree_clf.predict([[5,1.5]])

array([1])

Perfect!Noticethattheestimatedprobabilitieswouldbeidenticalanywhereelseinthebottom-rightrectangleofFigure6-2—forexample,ifthepetalswere6cmlongand1.5cmwide(eventhoughitseemsobviousthatitwouldmostlikelybeanIris-Virginicainthiscase).

Page 221: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TheCARTTrainingAlgorithmScikit-LearnusestheClassificationAndRegressionTree(CART)algorithmtotrainDecisionTrees(alsocalled“growing”trees).Theideaisreallyquitesimple:thealgorithmfirstsplitsthetrainingsetintwosubsetsusingasinglefeaturekandathresholdtk(e.g.,“petallength≤2.45cm”).Howdoesitchoosekandtk?Itsearchesforthepair(k,tk)thatproducesthepurestsubsets(weightedbytheirsize).ThecostfunctionthatthealgorithmtriestominimizeisgivenbyEquation6-2.

Equation6-2.CARTcostfunctionforclassification

Onceithassuccessfullysplitthetrainingsetintwo,itsplitsthesubsetsusingthesamelogic,thenthesub-subsetsandsoon,recursively.Itstopsrecursingonceitreachesthemaximumdepth(definedbythemax_depthhyperparameter),orifitcannotfindasplitthatwillreduceimpurity.Afewotherhyperparameters(describedinamoment)controladditionalstoppingconditions(min_samples_split,min_samples_leaf,min_weight_fraction_leaf,andmax_leaf_nodes).

WARNINGAsyoucansee,theCARTalgorithmisagreedyalgorithm:itgreedilysearchesforanoptimumsplitatthetoplevel,thenrepeatstheprocessateachlevel.Itdoesnotcheckwhetherornotthesplitwillleadtothelowestpossibleimpurityseverallevelsdown.Agreedyalgorithmoftenproducesareasonablygoodsolution,butitisnotguaranteedtobetheoptimalsolution.

Unfortunately,findingtheoptimaltreeisknowntobeanNP-Completeproblem:2itrequiresO(exp(m))time,makingtheproblemintractableevenforfairlysmalltrainingsets.Thisiswhywemustsettlefora“reasonablygood”solution.

Page 222: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ComputationalComplexityMakingpredictionsrequirestraversingtheDecisionTreefromtheroottoaleaf.DecisionTreesaregenerallyapproximatelybalanced,sotraversingtheDecisionTreerequiresgoingthroughroughlyO(log2(m))nodes.3Sinceeachnodeonlyrequirescheckingthevalueofonefeature,theoverallpredictioncomplexityisjustO(log2(m)),independentofthenumberoffeatures.Sopredictionsareveryfast,evenwhendealingwithlargetrainingsets.

However,thetrainingalgorithmcomparesallfeatures(orlessifmax_featuresisset)onallsamplesateachnode.ThisresultsinatrainingcomplexityofO(n×mlog(m)).Forsmalltrainingsets(lessthanafewthousandinstances),Scikit-Learncanspeeduptrainingbypresortingthedata(setpresort=True),butthisslowsdowntrainingconsiderablyforlargertrainingsets.

Page 223: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

GiniImpurityorEntropy?Bydefault,theGiniimpuritymeasureisused,butyoucanselecttheentropyimpuritymeasureinsteadbysettingthecriterionhyperparameterto"entropy".Theconceptofentropyoriginatedinthermodynamicsasameasureofmoleculardisorder:entropyapproacheszerowhenmoleculesarestillandwellordered.Itlaterspreadtoawidevarietyofdomains,includingShannon’sinformationtheory,whereitmeasurestheaverageinformationcontentofamessage:4entropyiszerowhenallmessagesareidentical.InMachineLearning,itisfrequentlyusedasanimpuritymeasure:aset’sentropyiszerowhenitcontainsinstancesofonlyoneclass.Equation6-3showsthedefinitionoftheentropyoftheithnode.

Forexample,thedepth-2leftnodeinFigure6-1hasanentropyequalto≈0.31.

Equation6-3.Entropy

SoshouldyouuseGiniimpurityorentropy?Thetruthis,mostofthetimeitdoesnotmakeabigdifference:theyleadtosimilartrees.Giniimpurityisslightlyfastertocompute,soitisagooddefault.However,whentheydiffer,Giniimpuritytendstoisolatethemostfrequentclassinitsownbranchofthetree,whileentropytendstoproduceslightlymorebalancedtrees.5

Page 224: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

RegularizationHyperparametersDecisionTreesmakeveryfewassumptionsaboutthetrainingdata(asopposedtolinearmodels,whichobviouslyassumethatthedataislinear,forexample).Ifleftunconstrained,thetreestructurewilladaptitselftothetrainingdata,fittingitveryclosely,andmostlikelyoverfittingit.Suchamodelisoftencalledanonparametricmodel,notbecauseitdoesnothaveanyparameters(itoftenhasalot)butbecausethenumberofparametersisnotdeterminedpriortotraining,sothemodelstructureisfreetostickcloselytothedata.Incontrast,aparametricmodelsuchasalinearmodelhasapredeterminednumberofparameters,soitsdegreeoffreedomislimited,reducingtheriskofoverfitting(butincreasingtheriskofunderfitting).

Toavoidoverfittingthetrainingdata,youneedtorestricttheDecisionTree’sfreedomduringtraining.Asyouknowbynow,thisiscalledregularization.Theregularizationhyperparametersdependonthealgorithmused,butgenerallyyoucanatleastrestrictthemaximumdepthoftheDecisionTree.InScikit-Learn,thisiscontrolledbythemax_depthhyperparameter(thedefaultvalueisNone,whichmeansunlimited).Reducingmax_depthwillregularizethemodelandthusreducetheriskofoverfitting.

TheDecisionTreeClassifierclasshasafewotherparametersthatsimilarlyrestricttheshapeoftheDecisionTree:min_samples_split(theminimumnumberofsamplesanodemusthavebeforeitcanbesplit),min_samples_leaf(theminimumnumberofsamplesaleafnodemusthave),min_weight_fraction_leaf(sameasmin_samples_leafbutexpressedasafractionofthetotalnumberofweightedinstances),max_leaf_nodes(maximumnumberofleafnodes),andmax_features(maximumnumberoffeaturesthatareevaluatedforsplittingateachnode).Increasingmin_*hyperparametersorreducingmax_*hyperparameterswillregularizethemodel.

NOTEOtheralgorithmsworkbyfirsttrainingtheDecisionTreewithoutrestrictions,thenpruning(deleting)unnecessarynodes.Anodewhosechildrenareallleafnodesisconsideredunnecessaryifthepurityimprovementitprovidesisnotstatisticallysignificant.Standardstatisticaltests,suchastheχ2test,areusedtoestimatetheprobabilitythattheimprovementispurelytheresultofchance(whichiscalledthenullhypothesis).Ifthisprobability,calledthep-value,ishigherthanagiventhreshold(typically5%,controlledbyahyperparameter),thenthenodeisconsideredunnecessaryanditschildrenaredeleted.Thepruningcontinuesuntilallunnecessarynodeshavebeenpruned.

Figure6-3showstwoDecisionTreestrainedonthemoonsdataset(introducedinChapter5).Ontheleft,theDecisionTreeistrainedwiththedefaulthyperparameters(i.e.,norestrictions),andontherighttheDecisionTreeistrainedwithmin_samples_leaf=4.Itisquiteobviousthatthemodelontheleftisoverfitting,andthemodelontherightwillprobablygeneralizebetter.

Page 225: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure6-3.Regularizationusingmin_samples_leaf

Page 226: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

RegressionDecisionTreesarealsocapableofperformingregressiontasks.Let’sbuildaregressiontreeusingScikit-Learn’sDecisionTreeRegressorclass,trainingitonanoisyquadraticdatasetwithmax_depth=2:

fromsklearn.treeimportDecisionTreeRegressor

tree_reg=DecisionTreeRegressor(max_depth=2)

tree_reg.fit(X,y)

TheresultingtreeisrepresentedonFigure6-4.

Figure6-4.ADecisionTreeforregression

Thistreelooksverysimilartotheclassificationtreeyoubuiltearlier.Themaindifferenceisthatinsteadofpredictingaclassineachnode,itpredictsavalue.Forexample,supposeyouwanttomakeapredictionforanewinstancewithx1=0.6.Youtraversethetreestartingattheroot,andyoueventuallyreachtheleafnodethatpredictsvalue=0.1106.Thispredictionissimplytheaveragetargetvalueofthe110traininginstancesassociatedtothisleafnode.ThispredictionresultsinaMeanSquaredError(MSE)equalto0.0151overthese110instances.

Thismodel’spredictionsarerepresentedontheleftofFigure6-5.Ifyousetmax_depth=3,yougetthepredictionsrepresentedontheright.Noticehowthepredictedvalueforeachregionisalwaystheaveragetargetvalueoftheinstancesinthatregion.Thealgorithmsplitseachregioninawaythatmakesmosttraininginstancesascloseaspossibletothatpredictedvalue.

Page 227: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure6-5.PredictionsoftwoDecisionTreeregressionmodels

TheCARTalgorithmworksmostlythesamewayasearlier,exceptthatinsteadoftryingtosplitthetrainingsetinawaythatminimizesimpurity,itnowtriestosplitthetrainingsetinawaythatminimizestheMSE.Equation6-4showsthecostfunctionthatthealgorithmtriestominimize.

Equation6-4.CARTcostfunctionforregression

Justlikeforclassificationtasks,DecisionTreesarepronetooverfittingwhendealingwithregressiontasks.Withoutanyregularization(i.e.,usingthedefaulthyperparameters),yougetthepredictionsontheleftofFigure6-6.Itisobviouslyoverfittingthetrainingsetverybadly.Justsettingmin_samples_leaf=10resultsinamuchmorereasonablemodel,representedontherightofFigure6-6.

Figure6-6.RegularizingaDecisionTreeregressor

Page 228: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

InstabilityHopefullybynowyouareconvincedthatDecisionTreeshavealotgoingforthem:theyaresimpletounderstandandinterpret,easytouse,versatile,andpowerful.Howevertheydohaveafewlimitations.First,asyoumayhavenoticed,DecisionTreesloveorthogonaldecisionboundaries(allsplitsareperpendiculartoanaxis),whichmakesthemsensitivetotrainingsetrotation.Forexample,Figure6-7showsasimplelinearlyseparabledataset:ontheleft,aDecisionTreecansplititeasily,whileontheright,afterthedatasetisrotatedby45°,thedecisionboundarylooksunnecessarilyconvoluted.AlthoughbothDecisionTreesfitthetrainingsetperfectly,itisverylikelythatthemodelontherightwillnotgeneralizewell.OnewaytolimitthisproblemistousePCA(seeChapter8),whichoftenresultsinabetterorientationofthetrainingdata.

Figure6-7.Sensitivitytotrainingsetrotation

Moregenerally,themainissuewithDecisionTreesisthattheyareverysensitivetosmallvariationsinthetrainingdata.Forexample,ifyoujustremovethewidestIris-Versicolorfromtheiristrainingset(theonewithpetals4.8cmlongand1.8cmwide)andtrainanewDecisionTree,youmaygetthemodelrepresentedinFigure6-8.Asyoucansee,itlooksverydifferentfromthepreviousDecisionTree(Figure6-2).Actually,sincethetrainingalgorithmusedbyScikit-Learnisstochastic6youmaygetverydifferentmodelsevenonthesametrainingdata(unlessyousettherandom_statehyperparameter).

Figure6-8.Sensitivitytotrainingsetdetails

Page 229: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

RandomForestscanlimitthisinstabilitybyaveragingpredictionsovermanytrees,aswewillseeinthenextchapter.

Page 230: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Exercises1. WhatistheapproximatedepthofaDecisionTreetrained(withoutrestrictions)onatrainingsetwith

1millioninstances?

2. Isanode’sGiniimpuritygenerallylowerorgreaterthanitsparent’s?Isitgenerallylower/greater,oralwayslower/greater?

3. IfaDecisionTreeisoverfittingthetrainingset,isitagoodideatotrydecreasingmax_depth?

4. IfaDecisionTreeisunderfittingthetrainingset,isitagoodideatotryscalingtheinputfeatures?

5. IfittakesonehourtotrainaDecisionTreeonatrainingsetcontaining1millioninstances,roughlyhowmuchtimewillittaketotrainanotherDecisionTreeonatrainingsetcontaining10millioninstances?

6. Ifyourtrainingsetcontains100,000instances,willsettingpresort=Truespeeduptraining?

7. Trainandfine-tuneaDecisionTreeforthemoonsdataset.a. Generateamoonsdatasetusingmake_moons(n_samples=10000,noise=0.4).

b. Splititintoatrainingsetandatestsetusingtrain_test_split().

c. Usegridsearchwithcross-validation(withthehelpoftheGridSearchCVclass)tofindgoodhyperparametervaluesforaDecisionTreeClassifier.Hint:tryvariousvaluesformax_leaf_nodes.

d. Trainitonthefulltrainingsetusingthesehyperparameters,andmeasureyourmodel’sperformanceonthetestset.Youshouldgetroughly85%to87%accuracy.

8. Growaforest.a. Continuingthepreviousexercise,generate1,000subsetsofthetrainingset,eachcontaining100

instancesselectedrandomly.Hint:youcanuseScikit-Learn’sShuffleSplitclassforthis.

b. TrainoneDecisionTreeoneachsubset,usingthebesthyperparametervaluesfoundabove.Evaluatethese1,000DecisionTreesonthetestset.Sincetheyweretrainedonsmallersets,theseDecisionTreeswilllikelyperformworsethanthefirstDecisionTree,achievingonlyabout80%accuracy.

c. Nowcomesthemagic.Foreachtestsetinstance,generatethepredictionsofthe1,000DecisionTrees,andkeeponlythemostfrequentprediction(youcanuseSciPy’smode()functionforthis).Thisgivesyoumajority-votepredictionsoverthetestset.

d. Evaluatethesepredictionsonthetestset:youshouldobtainaslightlyhigheraccuracythanyourfirstmodel(about0.5to1.5%higher).Congratulations,youhavetrainedaRandomForestclassifier!

Page 231: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

SolutionstotheseexercisesareavailableinAppendixA.

Graphvizisanopensourcegraphvisualizationsoftwarepackage,availableathttp://www.graphviz.org/.

Pisthesetofproblemsthatcanbesolvedinpolynomialtime.NPisthesetofproblemswhosesolutionscanbeverifiedinpolynomialtime.AnNP-HardproblemisaproblemtowhichanyNPproblemcanbereducedinpolynomialtime.AnNP-CompleteproblemisbothNPandNP-Hard.AmajoropenmathematicalquestioniswhetherornotP=NP.IfP≠NP(whichseemslikely),thennopolynomialalgorithmwilleverbefoundforanyNP-Completeproblem(exceptperhapsonaquantumcomputer).

log2isthebinarylogarithm.Itisequaltolog2(m)=log(m)/log(2).

Areductionofentropyisoftencalledaninformationgain.

SeeSebastianRaschka’sinterestinganalysisformoredetails.

Itrandomlyselectsthesetoffeaturestoevaluateateachnode.

1

2

3

4

5

6

Page 232: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Chapter7.EnsembleLearningandRandomForests

Supposeyouaskacomplexquestiontothousandsofrandompeople,thenaggregatetheiranswers.Inmanycasesyouwillfindthatthisaggregatedanswerisbetterthananexpert’sanswer.Thisiscalledthewisdomofthecrowd.Similarly,ifyouaggregatethepredictionsofagroupofpredictors(suchasclassifiersorregressors),youwilloftengetbetterpredictionsthanwiththebestindividualpredictor.Agroupofpredictorsiscalledanensemble;thus,thistechniqueiscalledEnsembleLearning,andanEnsembleLearningalgorithmiscalledanEnsemblemethod.

Forexample,youcantrainagroupofDecisionTreeclassifiers,eachonadifferentrandomsubsetofthetrainingset.Tomakepredictions,youjustobtainthepredictionsofallindividualtrees,thenpredicttheclassthatgetsthemostvotes(seethelastexerciseinChapter6).SuchanensembleofDecisionTreesiscalledaRandomForest,anddespiteitssimplicity,thisisoneofthemostpowerfulMachineLearningalgorithmsavailabletoday.

Moreover,aswediscussedinChapter2,youwilloftenuseEnsemblemethodsneartheendofaproject,onceyouhavealreadybuiltafewgoodpredictors,tocombinethemintoanevenbetterpredictor.Infact,thewinningsolutionsinMachineLearningcompetitionsofteninvolveseveralEnsemblemethods(mostfamouslyintheNetflixPrizecompetition).

InthischapterwewilldiscussthemostpopularEnsemblemethods,includingbagging,boosting,stacking,andafewothers.WewillalsoexploreRandomForests.

Page 233: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

VotingClassifiersSupposeyouhavetrainedafewclassifiers,eachoneachievingabout80%accuracy.YoumayhaveaLogisticRegressionclassifier,anSVMclassifier,aRandomForestclassifier,aK-NearestNeighborsclassifier,andperhapsafewmore(seeFigure7-1).

Figure7-1.Trainingdiverseclassifiers

Averysimplewaytocreateanevenbetterclassifieristoaggregatethepredictionsofeachclassifierandpredicttheclassthatgetsthemostvotes.Thismajority-voteclassifieriscalledahardvotingclassifier(seeFigure7-2).

Figure7-2.Hardvotingclassifierpredictions

Somewhatsurprisingly,thisvotingclassifieroftenachievesahigheraccuracythanthebestclassifierintheensemble.Infact,evenifeachclassifierisaweaklearner(meaningitdoesonlyslightlybetterthanrandomguessing),theensemblecanstillbeastronglearner(achievinghighaccuracy),providedthereareasufficientnumberofweaklearnersandtheyaresufficientlydiverse.

Howisthispossible?Thefollowinganalogycanhelpshedsomelightonthismystery.Supposeyouhave

Page 234: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

aslightlybiasedcointhathasa51%chanceofcomingupheads,and49%chanceofcominguptails.Ifyoutossit1,000times,youwillgenerallygetmoreorless510headsand490tails,andhenceamajorityofheads.Ifyoudothemath,youwillfindthattheprobabilityofobtainingamajorityofheadsafter1,000tossesiscloseto75%.Themoreyoutossthecoin,thehighertheprobability(e.g.,with10,000tosses,theprobabilityclimbsover97%).Thisisduetothelawoflargenumbers:asyoukeeptossingthecoin,theratioofheadsgetscloserandclosertotheprobabilityofheads(51%).Figure7-3shows10seriesofbiasedcointosses.Youcanseethatasthenumberoftossesincreases,theratioofheadsapproaches51%.Eventuallyall10seriesendupsocloseto51%thattheyareconsistentlyabove50%.

Figure7-3.Thelawoflargenumbers

Similarly,supposeyoubuildanensemblecontaining1,000classifiersthatareindividuallycorrectonly51%ofthetime(barelybetterthanrandomguessing).Ifyoupredictthemajorityvotedclass,youcanhopeforupto75%accuracy!However,thisisonlytrueifallclassifiersareperfectlyindependent,makinguncorrelatederrors,whichisclearlynotthecasesincetheyaretrainedonthesamedata.Theyarelikelytomakethesametypesoferrors,sotherewillbemanymajorityvotesforthewrongclass,reducingtheensemble’saccuracy.

TIPEnsemblemethodsworkbestwhenthepredictorsareasindependentfromoneanotheraspossible.Onewaytogetdiverseclassifiersistotrainthemusingverydifferentalgorithms.Thisincreasesthechancethattheywillmakeverydifferenttypesoferrors,improvingtheensemble’saccuracy.

ThefollowingcodecreatesandtrainsavotingclassifierinScikit-Learn,composedofthreediverseclassifiers(thetrainingsetisthemoonsdataset,introducedinChapter5):

fromsklearn.ensembleimportRandomForestClassifier

fromsklearn.ensembleimportVotingClassifier

fromsklearn.linear_modelimportLogisticRegression

fromsklearn.svmimportSVC

log_clf=LogisticRegression()

rnd_clf=RandomForestClassifier()

svm_clf=SVC()

voting_clf=VotingClassifier(

estimators=[('lr',log_clf),('rf',rnd_clf),('svc',svm_clf)],

Page 235: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

voting='hard')

voting_clf.fit(X_train,y_train)

Let’slookateachclassifier’saccuracyonthetestset:

>>>fromsklearn.metricsimportaccuracy_score

>>>forclfin(log_clf,rnd_clf,svm_clf,voting_clf):

...clf.fit(X_train,y_train)

...y_pred=clf.predict(X_test)

...print(clf.__class__.__name__,accuracy_score(y_test,y_pred))

...

LogisticRegression0.864

RandomForestClassifier0.872

SVC0.888

VotingClassifier0.896

Thereyouhaveit!Thevotingclassifierslightlyoutperformsalltheindividualclassifiers.

Ifallclassifiersareabletoestimateclassprobabilities(i.e.,theyhaveapredict_proba()method),thenyoucantellScikit-Learntopredicttheclasswiththehighestclassprobability,averagedoveralltheindividualclassifiers.Thisiscalledsoftvoting.Itoftenachieveshigherperformancethanhardvotingbecauseitgivesmoreweighttohighlyconfidentvotes.Allyouneedtodoisreplacevoting="hard"withvoting="soft"andensurethatallclassifierscanestimateclassprobabilities.ThisisnotthecaseoftheSVCclassbydefault,soyouneedtosetitsprobabilityhyperparametertoTrue(thiswillmaketheSVCclassusecross-validationtoestimateclassprobabilities,slowingdowntraining,anditwilladdapredict_proba()method).Ifyoumodifytheprecedingcodetousesoftvoting,youwillfindthatthevotingclassifierachievesover91%accuracy!

Page 236: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

BaggingandPastingOnewaytogetadiversesetofclassifiersistouseverydifferenttrainingalgorithms,asjustdiscussed.Anotherapproachistousethesametrainingalgorithmforeverypredictor,buttotrainthemondifferentrandomsubsetsofthetrainingset.Whensamplingisperformedwithreplacement,thismethodiscalledbagging1(shortforbootstrapaggregating2).Whensamplingisperformedwithoutreplacement,itiscalledpasting.3

Inotherwords,bothbaggingandpastingallowtraininginstancestobesampledseveraltimesacrossmultiplepredictors,butonlybaggingallowstraininginstancestobesampledseveraltimesforthesamepredictor.ThissamplingandtrainingprocessisrepresentedinFigure7-4.

Figure7-4.Pasting/baggingtrainingsetsamplingandtraining

Onceallpredictorsaretrained,theensemblecanmakeapredictionforanewinstancebysimplyaggregatingthepredictionsofallpredictors.Theaggregationfunctionistypicallythestatisticalmode(i.e.,themostfrequentprediction,justlikeahardvotingclassifier)forclassification,ortheaverageforregression.Eachindividualpredictorhasahigherbiasthanifitweretrainedontheoriginaltrainingset,butaggregationreducesbothbiasandvariance.4Generally,thenetresultisthattheensemblehasasimilarbiasbutalowervariancethanasinglepredictortrainedontheoriginaltrainingset.

AsyoucanseeinFigure7-4,predictorscanallbetrainedinparallel,viadifferentCPUcoresorevendifferentservers.Similarly,predictionscanbemadeinparallel.Thisisoneofthereasonswhybaggingandpastingaresuchpopularmethods:theyscaleverywell.

Page 237: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

BaggingandPastinginScikit-LearnScikit-LearnoffersasimpleAPIforbothbaggingandpastingwiththeBaggingClassifierclass(orBaggingRegressorforregression).Thefollowingcodetrainsanensembleof500DecisionTreeclassifiers,5eachtrainedon100traininginstancesrandomlysampledfromthetrainingsetwithreplacement(thisisanexampleofbagging,butifyouwanttousepastinginstead,justsetbootstrap=False).Then_jobsparametertellsScikit-LearnthenumberofCPUcorestousefortrainingandpredictions(–1tellsScikit-Learntouseallavailablecores):

fromsklearn.ensembleimportBaggingClassifier

fromsklearn.treeimportDecisionTreeClassifier

bag_clf=BaggingClassifier(

DecisionTreeClassifier(),n_estimators=500,

max_samples=100,bootstrap=True,n_jobs=-1)

bag_clf.fit(X_train,y_train)

y_pred=bag_clf.predict(X_test)

NOTETheBaggingClassifierautomaticallyperformssoftvotinginsteadofhardvotingifthebaseclassifiercanestimateclassprobabilities(i.e.,ifithasapredict_proba()method),whichisthecasewithDecisionTreesclassifiers.

Figure7-5comparesthedecisionboundaryofasingleDecisionTreewiththedecisionboundaryofabaggingensembleof500trees(fromtheprecedingcode),bothtrainedonthemoonsdataset.Asyoucansee,theensemble’spredictionswilllikelygeneralizemuchbetterthanthesingleDecisionTree’spredictions:theensemblehasacomparablebiasbutasmallervariance(itmakesroughlythesamenumberoferrorsonthetrainingset,butthedecisionboundaryislessirregular).

Figure7-5.AsingleDecisionTreeversusabaggingensembleof500trees

Bootstrappingintroducesabitmorediversityinthesubsetsthateachpredictoristrainedon,sobaggingendsupwithaslightlyhigherbiasthanpasting,butthisalsomeansthatpredictorsendupbeinglesscorrelatedsotheensemble’svarianceisreduced.Overall,baggingoftenresultsinbettermodels,whichexplainswhyitisgenerallypreferred.However,ifyouhavesparetimeandCPUpoweryoucanusecross-validationtoevaluatebothbaggingandpastingandselecttheonethatworksbest.

Page 238: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Out-of-BagEvaluationWithbagging,someinstancesmaybesampledseveraltimesforanygivenpredictor,whileothersmaynotbesampledatall.BydefaultaBaggingClassifiersamplesmtraininginstanceswithreplacement(bootstrap=True),wheremisthesizeofthetrainingset.Thismeansthatonlyabout63%ofthetraininginstancesaresampledonaverageforeachpredictor.6Theremaining37%ofthetraininginstancesthatarenotsampledarecalledout-of-bag(oob)instances.Notethattheyarenotthesame37%forallpredictors.

Sinceapredictorneverseestheoobinstancesduringtraining,itcanbeevaluatedontheseinstances,withouttheneedforaseparatevalidationsetorcross-validation.Youcanevaluatetheensembleitselfbyaveragingouttheoobevaluationsofeachpredictor.

InScikit-Learn,youcansetoob_score=TruewhencreatingaBaggingClassifiertorequestanautomaticoobevaluationaftertraining.Thefollowingcodedemonstratesthis.Theresultingevaluationscoreisavailablethroughtheoob_score_variable:

>>>bag_clf=BaggingClassifier(

...DecisionTreeClassifier(),n_estimators=500,

...bootstrap=True,n_jobs=-1,oob_score=True)

...

>>>bag_clf.fit(X_train,y_train)

>>>bag_clf.oob_score_

0.90133333333333332

Accordingtothisoobevaluation,thisBaggingClassifierislikelytoachieveabout90.1%accuracyonthetestset.Let’sverifythis:

>>>fromsklearn.metricsimportaccuracy_score

>>>y_pred=bag_clf.predict(X_test)

>>>accuracy_score(y_test,y_pred)

0.91200000000000003

Weget91.2%accuracyonthetestset—closeenough!

Theoobdecisionfunctionforeachtraininginstanceisalsoavailablethroughtheoob_decision_function_variable.Inthiscase(sincethebaseestimatorhasapredict_proba()method)thedecisionfunctionreturnstheclassprobabilitiesforeachtraininginstance.Forexample,theoobevaluationestimatesthatthesecondtraininginstancehasa60.6%probabilityofbelongingtothepositiveclass(and39.4%ofbelongingtothepositiveclass):

>>>bag_clf.oob_decision_function_

array([[0.31746032,0.68253968],

[0.34117647,0.65882353],

[1.,0.],

...

[1.,0.],

[0.03108808,0.96891192],

[0.57291667,0.42708333]])

Page 239: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

RandomPatchesandRandomSubspacesTheBaggingClassifierclasssupportssamplingthefeaturesaswell.Thisiscontrolledbytwohyperparameters:max_featuresandbootstrap_features.Theyworkthesamewayasmax_samplesandbootstrap,butforfeaturesamplinginsteadofinstancesampling.Thus,eachpredictorwillbetrainedonarandomsubsetoftheinputfeatures.

Thisisparticularlyusefulwhenyouaredealingwithhigh-dimensionalinputs(suchasimages).SamplingbothtraininginstancesandfeaturesiscalledtheRandomPatchesmethod.7Keepingalltraininginstances(i.e.,bootstrap=Falseandmax_samples=1.0)butsamplingfeatures(i.e.,bootstrap_features=Trueand/ormax_featuressmallerthan1.0)iscalledtheRandomSubspacesmethod.8

Samplingfeaturesresultsinevenmorepredictordiversity,tradingabitmorebiasforalowervariance.

Page 240: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

RandomForestsAswehavediscussed,aRandomForest9isanensembleofDecisionTrees,generallytrainedviathebaggingmethod(orsometimespasting),typicallywithmax_samplessettothesizeofthetrainingset.InsteadofbuildingaBaggingClassifierandpassingitaDecisionTreeClassifier,youcaninsteadusetheRandomForestClassifierclass,whichismoreconvenientandoptimizedforDecisionTrees10

(similarly,thereisaRandomForestRegressorclassforregressiontasks).ThefollowingcodetrainsaRandomForestclassifierwith500trees(eachlimitedtomaximum16nodes),usingallavailableCPUcores:

fromsklearn.ensembleimportRandomForestClassifier

rnd_clf=RandomForestClassifier(n_estimators=500,max_leaf_nodes=16,n_jobs=-1)

rnd_clf.fit(X_train,y_train)

y_pred_rf=rnd_clf.predict(X_test)

Withafewexceptions,aRandomForestClassifierhasallthehyperparametersofaDecisionTreeClassifier(tocontrolhowtreesaregrown),plusallthehyperparametersofaBaggingClassifiertocontroltheensembleitself.11

TheRandomForestalgorithmintroducesextrarandomnesswhengrowingtrees;insteadofsearchingfortheverybestfeaturewhensplittinganode(seeChapter6),itsearchesforthebestfeatureamongarandomsubsetoffeatures.Thisresultsinagreatertreediversity,which(onceagain)tradesahigherbiasforalowervariance,generallyyieldinganoverallbettermodel.ThefollowingBaggingClassifierisroughlyequivalenttothepreviousRandomForestClassifier:

bag_clf=BaggingClassifier(

DecisionTreeClassifier(splitter="random",max_leaf_nodes=16),

n_estimators=500,max_samples=1.0,bootstrap=True,n_jobs=-1)

Page 241: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Extra-TreesWhenyouaregrowingatreeinaRandomForest,ateachnodeonlyarandomsubsetofthefeaturesisconsideredforsplitting(asdiscussedearlier).Itispossibletomaketreesevenmorerandombyalsousingrandomthresholdsforeachfeatureratherthansearchingforthebestpossiblethresholds(likeregularDecisionTreesdo).

AforestofsuchextremelyrandomtreesissimplycalledanExtremelyRandomizedTreesensemble12(orExtra-Treesforshort).Onceagain,thistradesmorebiasforalowervariance.ItalsomakesExtra-TreesmuchfastertotrainthanregularRandomForestssincefindingthebestpossiblethresholdforeachfeatureateverynodeisoneofthemosttime-consumingtasksofgrowingatree.

YoucancreateanExtra-TreesclassifierusingScikit-Learn’sExtraTreesClassifierclass.ItsAPIisidenticaltotheRandomForestClassifierclass.Similarly,theExtraTreesRegressorclasshasthesameAPIastheRandomForestRegressorclass.

TIPItishardtotellinadvancewhetheraRandomForestClassifierwillperformbetterorworsethananExtraTreesClassifier.Generally,theonlywaytoknowistotrybothandcomparethemusingcross-validation(andtuningthehyperparametersusinggridsearch).

Page 242: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

FeatureImportanceYetanothergreatqualityofRandomForestsisthattheymakeiteasytomeasuretherelativeimportanceofeachfeature.Scikit-Learnmeasuresafeature’simportancebylookingathowmuchthetreenodesthatusethatfeaturereduceimpurityonaverage(acrossalltreesintheforest).Moreprecisely,itisaweightedaverage,whereeachnode’sweightisequaltothenumberoftrainingsamplesthatareassociatedwithit(seeChapter6).

Scikit-Learncomputesthisscoreautomaticallyforeachfeatureaftertraining,thenitscalestheresultssothatthesumofallimportancesisequalto1.Youcanaccesstheresultusingthefeature_importances_variable.Forexample,thefollowingcodetrainsaRandomForestClassifierontheirisdataset(introducedinChapter4)andoutputseachfeature’simportance.Itseemsthatthemostimportantfeaturesarethepetallength(44%)andwidth(42%),whilesepallengthandwidthareratherunimportantincomparison(11%and2%,respectively).

>>>fromsklearn.datasetsimportload_iris

>>>iris=load_iris()

>>>rnd_clf=RandomForestClassifier(n_estimators=500,n_jobs=-1)

>>>rnd_clf.fit(iris["data"],iris["target"])

>>>forname,scoreinzip(iris["feature_names"],rnd_clf.feature_importances_):

...print(name,score)

...

sepallength(cm)0.112492250999

sepalwidth(cm)0.0231192882825

petallength(cm)0.441030464364

petalwidth(cm)0.423357996355

Similarly,ifyoutrainaRandomForestclassifierontheMNISTdataset(introducedinChapter3)andploteachpixel’simportance,yougettheimagerepresentedinFigure7-6.

Figure7-6.MNISTpixelimportance(accordingtoaRandomForestclassifier)

Page 243: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

RandomForestsareveryhandytogetaquickunderstandingofwhatfeaturesactuallymatter,inparticularifyouneedtoperformfeatureselection.

Page 244: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

BoostingBoosting(originallycalledhypothesisboosting)referstoanyEnsemblemethodthatcancombineseveralweaklearnersintoastronglearner.Thegeneralideaofmostboostingmethodsistotrainpredictorssequentially,eachtryingtocorrectitspredecessor.Therearemanyboostingmethodsavailable,butbyfarthemostpopularareAdaBoost13(shortforAdaptiveBoosting)andGradientBoosting.Let’sstartwithAdaBoost.

Page 245: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

AdaBoostOnewayforanewpredictortocorrectitspredecessoristopayabitmoreattentiontothetraininginstancesthatthepredecessorunderfitted.Thisresultsinnewpredictorsfocusingmoreandmoreonthehardcases.ThisisthetechniqueusedbyAdaBoost.

Forexample,tobuildanAdaBoostclassifier,afirstbaseclassifier(suchasaDecisionTree)istrainedandusedtomakepredictionsonthetrainingset.Therelativeweightofmisclassifiedtraininginstancesisthenincreased.Asecondclassifieristrainedusingtheupdatedweightsandagainitmakespredictionsonthetrainingset,weightsareupdated,andsoon(seeFigure7-7).

Figure7-7.AdaBoostsequentialtrainingwithinstanceweightupdates

Figure7-8showsthedecisionboundariesoffiveconsecutivepredictorsonthemoonsdataset(inthisexample,eachpredictorisahighlyregularizedSVMclassifierwithanRBFkernel14).Thefirstclassifiergetsmanyinstanceswrong,sotheirweightsgetboosted.Thesecondclassifierthereforedoesabetterjobontheseinstances,andsoon.Theplotontherightrepresentsthesamesequenceofpredictorsexceptthatthelearningrateishalved(i.e.,themisclassifiedinstanceweightsareboostedhalfasmuchateveryiteration).Asyoucansee,thissequentiallearningtechniquehassomesimilaritieswithGradientDescent,exceptthatinsteadoftweakingasinglepredictor’sparameterstominimizeacostfunction,AdaBoostaddspredictorstotheensemble,graduallymakingitbetter.

Page 246: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure7-8.Decisionboundariesofconsecutivepredictors

Onceallpredictorsaretrained,theensemblemakespredictionsverymuchlikebaggingorpasting,exceptthatpredictorshavedifferentweightsdependingontheiroverallaccuracyontheweightedtrainingset.

WARNINGThereisoneimportantdrawbacktothissequentiallearningtechnique:itcannotbeparallelized(oronlypartially),sinceeachpredictorcanonlybetrainedafterthepreviouspredictorhasbeentrainedandevaluated.Asaresult,itdoesnotscaleaswellasbaggingorpasting.

Let’stakeacloserlookattheAdaBoostalgorithm.Eachinstanceweightw(i)isinitiallysetto .Afirstpredictoristrainedanditsweightederrorrater1iscomputedonthetrainingset;seeEquation7-1.

Equation7-1.Weightederrorrateofthejthpredictor

Thepredictor’sweightαjisthencomputedusingEquation7-2,whereηisthelearningratehyperparameter(defaultsto1).15Themoreaccuratethepredictoris,thehigheritsweightwillbe.Ifitisjustguessingrandomly,thenitsweightwillbeclosetozero.However,ifitismostoftenwrong(i.e.,lessaccuratethanrandomguessing),thenitsweightwillbenegative.

Equation7-2.Predictorweight

NexttheinstanceweightsareupdatedusingEquation7-3:themisclassifiedinstancesareboosted.

Page 247: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Equation7-3.Weightupdaterule

Thenalltheinstanceweightsarenormalized(i.e.,dividedby ).

Finally,anewpredictoristrainedusingtheupdatedweights,andthewholeprocessisrepeated(thenewpredictor’sweightiscomputed,theinstanceweightsareupdated,thenanotherpredictoristrained,andsoon).Thealgorithmstopswhenthedesirednumberofpredictorsisreached,orwhenaperfectpredictorisfound.

Tomakepredictions,AdaBoostsimplycomputesthepredictionsofallthepredictorsandweighsthemusingthepredictorweightsαj.Thepredictedclassistheonethatreceivesthemajorityofweightedvotes(seeEquation7-4).

Equation7-4.AdaBoostpredictions

Scikit-LearnactuallyusesamulticlassversionofAdaBoostcalledSAMME16(whichstandsforStagewiseAdditiveModelingusingaMulticlassExponentiallossfunction).Whentherearejusttwoclasses,SAMMEisequivalenttoAdaBoost.Moreover,ifthepredictorscanestimateclassprobabilities(i.e.,iftheyhaveapredict_proba()method),Scikit-LearncanuseavariantofSAMMEcalledSAMME.R(theRstandsfor“Real”),whichreliesonclassprobabilitiesratherthanpredictionsandgenerallyperformsbetter.

ThefollowingcodetrainsanAdaBoostclassifierbasedon200DecisionStumpsusingScikit-Learn’sAdaBoostClassifierclass(asyoumightexpect,thereisalsoanAdaBoostRegressorclass).ADecisionStumpisaDecisionTreewithmax_depth=1—inotherwords,atreecomposedofasingledecisionnodeplustwoleafnodes.ThisisthedefaultbaseestimatorfortheAdaBoostClassifierclass:

fromsklearn.ensembleimportAdaBoostClassifier

ada_clf=AdaBoostClassifier(

DecisionTreeClassifier(max_depth=1),n_estimators=200,

algorithm="SAMME.R",learning_rate=0.5)

ada_clf.fit(X_train,y_train)

Page 248: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TIPIfyourAdaBoostensembleisoverfittingthetrainingset,youcantryreducingthenumberofestimatorsormorestronglyregularizingthebaseestimator.

Page 249: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

GradientBoostingAnotherverypopularBoostingalgorithmisGradientBoosting.17JustlikeAdaBoost,GradientBoostingworksbysequentiallyaddingpredictorstoanensemble,eachonecorrectingitspredecessor.However,insteadoftweakingtheinstanceweightsateveryiterationlikeAdaBoostdoes,thismethodtriestofitthenewpredictortotheresidualerrorsmadebythepreviouspredictor.

Let’sgothroughasimpleregressionexampleusingDecisionTreesasthebasepredictors(ofcourseGradientBoostingalsoworksgreatwithregressiontasks).ThisiscalledGradientTreeBoosting,orGradientBoostedRegressionTrees(GBRT).First,let’sfitaDecisionTreeRegressortothetrainingset(forexample,anoisyquadratictrainingset):

fromsklearn.treeimportDecisionTreeRegressor

tree_reg1=DecisionTreeRegressor(max_depth=2)

tree_reg1.fit(X,y)

NowtrainasecondDecisionTreeRegressorontheresidualerrorsmadebythefirstpredictor:

y2=y-tree_reg1.predict(X)

tree_reg2=DecisionTreeRegressor(max_depth=2)

tree_reg2.fit(X,y2)

Thenwetrainathirdregressorontheresidualerrorsmadebythesecondpredictor:

y3=y2-tree_reg2.predict(X)

tree_reg3=DecisionTreeRegressor(max_depth=2)

tree_reg3.fit(X,y3)

Nowwehaveanensemblecontainingthreetrees.Itcanmakepredictionsonanewinstancesimplybyaddingupthepredictionsofallthetrees:

y_pred=sum(tree.predict(X_new)fortreein(tree_reg1,tree_reg2,tree_reg3))

Figure7-9representsthepredictionsofthesethreetreesintheleftcolumn,andtheensemble’spredictionsintherightcolumn.Inthefirstrow,theensemblehasjustonetree,soitspredictionsareexactlythesameasthefirsttree’spredictions.Inthesecondrow,anewtreeistrainedontheresidualerrorsofthefirsttree.Ontherightyoucanseethattheensemble’spredictionsareequaltothesumofthepredictionsofthefirsttwotrees.Similarly,inthethirdrowanothertreeistrainedontheresidualerrorsofthesecondtree.Youcanseethattheensemble’spredictionsgraduallygetbetterastreesareaddedtotheensemble.

AsimplerwaytotrainGBRTensemblesistouseScikit-Learn’sGradientBoostingRegressorclass.MuchliketheRandomForestRegressorclass,ithashyperparameterstocontrolthegrowthofDecisionTrees(e.g.,max_depth,min_samples_leaf,andsoon),aswellashyperparameterstocontroltheensembletraining,suchasthenumberoftrees(n_estimators).Thefollowingcodecreatesthesameensembleasthepreviousone:

Page 250: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

fromsklearn.ensembleimportGradientBoostingRegressor

gbrt=GradientBoostingRegressor(max_depth=2,n_estimators=3,learning_rate=1.0)

gbrt.fit(X,y)

Figure7-9.GradientBoosting

Thelearning_ratehyperparameterscalesthecontributionofeachtree.Ifyousetittoalowvalue,suchas0.1,youwillneedmoretreesintheensembletofitthetrainingset,butthepredictionswillusuallygeneralizebetter.Thisisaregularizationtechniquecalledshrinkage.Figure7-10showstwoGBRTensemblestrainedwithalowlearningrate:theoneontheleftdoesnothaveenoughtreestofitthetrainingset,whiletheoneontherighthastoomanytreesandoverfitsthetrainingset.

Page 251: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure7-10.GBRTensembleswithnotenoughpredictors(left)andtoomany(right)

Inordertofindtheoptimalnumberoftrees,youcanuseearlystopping(seeChapter4).Asimplewaytoimplementthisistousethestaged_predict()method:itreturnsaniteratoroverthepredictionsmadebytheensembleateachstageoftraining(withonetree,twotrees,etc.).ThefollowingcodetrainsaGBRTensemblewith120trees,thenmeasuresthevalidationerrorateachstageoftrainingtofindtheoptimalnumberoftrees,andfinallytrainsanotherGBRTensembleusingtheoptimalnumberoftrees:

importnumpyasnp

fromsklearn.model_selectionimporttrain_test_split

fromsklearn.metricsimportmean_squared_error

X_train,X_val,y_train,y_val=train_test_split(X,y)

gbrt=GradientBoostingRegressor(max_depth=2,n_estimators=120)

gbrt.fit(X_train,y_train)

errors=[mean_squared_error(y_val,y_pred)

fory_predingbrt.staged_predict(X_val)]

bst_n_estimators=np.argmin(errors)

gbrt_best=GradientBoostingRegressor(max_depth=2,n_estimators=bst_n_estimators)

gbrt_best.fit(X_train,y_train)

ThevalidationerrorsarerepresentedontheleftofFigure7-11,andthebestmodel’spredictionsarerepresentedontheright.

Figure7-11.Tuningthenumberoftreesusingearlystopping

Itisalsopossibletoimplementearlystoppingbyactuallystoppingtrainingearly(insteadoftrainingalargenumberoftreesfirstandthenlookingbacktofindtheoptimalnumber).Youcandosobysettingwarm_start=True,whichmakesScikit-Learnkeepexistingtreeswhenthefit()methodiscalled,allowingincrementaltraining.Thefollowingcodestopstrainingwhenthevalidationerrordoesnot

Page 252: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

improveforfiveiterationsinarow:

gbrt=GradientBoostingRegressor(max_depth=2,warm_start=True)

min_val_error=float("inf")

error_going_up=0

forn_estimatorsinrange(1,120):

gbrt.n_estimators=n_estimators

gbrt.fit(X_train,y_train)

y_pred=gbrt.predict(X_val)

val_error=mean_squared_error(y_val,y_pred)

ifval_error<min_val_error:

min_val_error=val_error

error_going_up=0

else:

error_going_up+=1

iferror_going_up==5:

break#earlystopping

TheGradientBoostingRegressorclassalsosupportsasubsamplehyperparameter,whichspecifiesthefractionoftraininginstancestobeusedfortrainingeachtree.Forexample,ifsubsample=0.25,theneachtreeistrainedon25%ofthetraininginstances,selectedrandomly.Asyoucanprobablyguessbynow,thistradesahigherbiasforalowervariance.Italsospeedsuptrainingconsiderably.ThistechniqueiscalledStochasticGradientBoosting.

NOTEItispossibletouseGradientBoostingwithothercostfunctions.Thisiscontrolledbythelosshyperparameter(seeScikit-Learn’sdocumentationformoredetails).

Page 253: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

StackingThelastEnsemblemethodwewilldiscussinthischapteriscalledstacking(shortforstackedgeneralization).18Itisbasedonasimpleidea:insteadofusingtrivialfunctions(suchashardvoting)toaggregatethepredictionsofallpredictorsinanensemble,whydon’twetrainamodeltoperformthisaggregation?Figure7-12showssuchanensembleperformingaregressiontaskonanewinstance.Eachofthebottomthreepredictorspredictsadifferentvalue(3.1,2.7,and2.9),andthenthefinalpredictor(calledablender,orametalearner)takesthesepredictionsasinputsandmakesthefinalprediction(3.0).

Figure7-12.Aggregatingpredictionsusingablendingpredictor

Totraintheblender,acommonapproachistouseahold-outset.19Let’sseehowitworks.First,thetrainingsetissplitintwosubsets.Thefirstsubsetisusedtotrainthepredictorsinthefirstlayer(seeFigure7-13).

Page 254: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure7-13.Trainingthefirstlayer

Next,thefirstlayerpredictorsareusedtomakepredictionsonthesecond(held-out)set(seeFigure7-14).Thisensuresthatthepredictionsare“clean,”sincethepredictorsneversawtheseinstancesduringtraining.Nowforeachinstanceinthehold-outsettherearethreepredictedvalues.Wecancreateanewtrainingsetusingthesepredictedvaluesasinputfeatures(whichmakesthisnewtrainingsetthree-dimensional),andkeepingthetargetvalues.Theblenderistrainedonthisnewtrainingset,soitlearnstopredictthetargetvaluegiventhefirstlayer’spredictions.

Page 255: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure7-14.Trainingtheblender

Itisactuallypossibletotrainseveraldifferentblendersthisway(e.g.,oneusingLinearRegression,anotherusingRandomForestRegression,andsoon):wegetawholelayerofblenders.Thetrickistosplitthetrainingsetintothreesubsets:thefirstoneisusedtotrainthefirstlayer,thesecondoneisusedtocreatethetrainingsetusedtotrainthesecondlayer(usingpredictionsmadebythepredictorsofthefirstlayer),andthethirdoneisusedtocreatethetrainingsettotrainthethirdlayer(usingpredictionsmadebythepredictorsofthesecondlayer).Oncethisisdone,wecanmakeapredictionforanewinstancebygoingthrougheachlayersequentially,asshowninFigure7-15.

Page 256: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure7-15.Predictionsinamultilayerstackingensemble

Unfortunately,Scikit-Learndoesnotsupportstackingdirectly,butitisnottoohardtorolloutyourownimplementation(seethefollowingexercises).Alternatively,youcanuseanopensourceimplementationsuchasbrew(availableathttps://github.com/viisar/brew).

Page 257: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Exercises1. Ifyouhavetrainedfivedifferentmodelsontheexactsametrainingdata,andtheyallachieve95%

precision,isthereanychancethatyoucancombinethesemodelstogetbetterresults?Ifso,how?Ifnot,why?

2. Whatisthedifferencebetweenhardandsoftvotingclassifiers?

3. Isitpossibletospeeduptrainingofabaggingensemblebydistributingitacrossmultipleservers?Whataboutpastingensembles,boostingensembles,randomforests,orstackingensembles?

4. Whatisthebenefitofout-of-bagevaluation?

5. WhatmakesExtra-TreesmorerandomthanregularRandomForests?Howcanthisextrarandomnesshelp?AreExtra-TreesslowerorfasterthanregularRandomForests?

6. IfyourAdaBoostensembleunderfitsthetrainingdata,whathyperparametersshouldyoutweakandhow?

7. IfyourGradientBoostingensembleoverfitsthetrainingset,shouldyouincreaseordecreasethelearningrate?

8. LoadtheMNISTdata(introducedinChapter3),andsplititintoatrainingset,avalidationset,andatestset(e.g.,use40,000instancesfortraining,10,000forvalidation,and10,000fortesting).Thentrainvariousclassifiers,suchasaRandomForestclassifier,anExtra-Treesclassifier,andanSVM.Next,trytocombinethemintoanensemblethatoutperformsthemallonthevalidationset,usingasoftorhardvotingclassifier.Onceyouhavefoundone,tryitonthetestset.Howmuchbetterdoesitperformcomparedtotheindividualclassifiers?

9. Runtheindividualclassifiersfromthepreviousexercisetomakepredictionsonthevalidationset,andcreateanewtrainingsetwiththeresultingpredictions:eachtraininginstanceisavectorcontainingthesetofpredictionsfromallyourclassifiersforanimage,andthetargetistheimage’sclass.Congratulations,youhavejusttrainedablender,andtogetherwiththeclassifierstheyformastackingensemble!Nowlet’sevaluatetheensembleonthetestset.Foreachimageinthetestset,makepredictionswithallyourclassifiers,thenfeedthepredictionstotheblendertogettheensemble’spredictions.Howdoesitcomparetothevotingclassifieryoutrainedearlier?

SolutionstotheseexercisesareavailableinAppendixA.

“BaggingPredictors,”L.Breiman(1996).

Instatistics,resamplingwithreplacementiscalledbootstrapping.

“Pastingsmallvotesforclassificationinlargedatabasesandon-line,”L.Breiman(1999).

BiasandvariancewereintroducedinChapter4.

max_samplescanalternativelybesettoafloatbetween0.0and1.0,inwhichcasethemaxnumberofinstancestosampleisequaltothesizeofthetrainingsettimesmax_samples.

1

2

3

4

5

6

Page 258: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Asmgrows,thisratioapproaches1–exp(–1)≈63.212%.

“EnsemblesonRandomPatches,”G.LouppeandP.Geurts(2012).

“Therandomsubspacemethodforconstructingdecisionforests,”TinKamHo(1998).

“RandomDecisionForests,”T.Ho(1995).

TheBaggingClassifierclassremainsusefulifyouwantabagofsomethingotherthanDecisionTrees.

Thereareafewnotableexceptions:splitterisabsent(forcedto"random"),presortisabsent(forcedtoFalse),max_samplesisabsent(forcedto1.0),andbase_estimatorisabsent(forcedtoDecisionTreeClassifierwiththeprovidedhyperparameters).

“Extremelyrandomizedtrees,”P.Geurts,D.Ernst,L.Wehenkel(2005).

“ADecision-TheoreticGeneralizationofOn-LineLearningandanApplicationtoBoosting,”YoavFreund,RobertE.Schapire(1997).

Thisisjustforillustrativepurposes.SVMsaregenerallynotgoodbasepredictorsforAdaBoost,becausetheyareslowandtendtobeunstablewithAdaBoost.

TheoriginalAdaBoostalgorithmdoesnotusealearningratehyperparameter.

Formoredetails,see“Multi-ClassAdaBoost,”J.Zhuetal.(2006).

Firstintroducedin“ArcingtheEdge,”L.Breiman(1997).

“StackedGeneralization,”D.Wolpert(1992).

Alternatively,itispossibletouseout-of-foldpredictions.Insomecontextsthisiscalledstacking,whileusingahold-outsetiscalledblending.However,formanypeoplethesetermsaresynonymous.

6

7

8

9

10

11

12

13

14

15

16

17

18

19

Page 259: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Chapter8.DimensionalityReduction

ManyMachineLearningproblemsinvolvethousandsorevenmillionsoffeaturesforeachtraininginstance.Notonlydoesthismaketrainingextremelyslow,itcanalsomakeitmuchhardertofindagoodsolution,aswewillsee.Thisproblemisoftenreferredtoasthecurseofdimensionality.

Fortunately,inreal-worldproblems,itisoftenpossibletoreducethenumberoffeaturesconsiderably,turninganintractableproblemintoatractableone.Forexample,considertheMNISTimages(introducedinChapter3):thepixelsontheimagebordersarealmostalwayswhite,soyoucouldcompletelydropthesepixelsfromthetrainingsetwithoutlosingmuchinformation.Figure7-6confirmsthatthesepixelsareutterlyunimportantfortheclassificationtask.Moreover,twoneighboringpixelsareoftenhighlycorrelated:ifyoumergethemintoasinglepixel(e.g.,bytakingthemeanofthetwopixelintensities),youwillnotlosemuchinformation.

WARNINGReducingdimensionalitydoeslosesomeinformation(justlikecompressinganimagetoJPEGcandegradeitsquality),soeventhoughitwillspeeduptraining,itmayalsomakeyoursystemperformslightlyworse.Italsomakesyourpipelinesabitmorecomplexandthushardertomaintain.Soyoushouldfirsttrytotrainyoursystemwiththeoriginaldatabeforeconsideringusingdimensionalityreductioniftrainingistooslow.Insomecases,however,reducingthedimensionalityofthetrainingdatamayfilteroutsomenoiseandunnecessarydetailsandthusresultinhigherperformance(butingeneralitwon’t;itwilljustspeeduptraining).

Apartfromspeedinguptraining,dimensionalityreductionisalsoextremelyusefulfordatavisualization(orDataViz).Reducingthenumberofdimensionsdowntotwo(orthree)makesitpossibletoplotahigh-dimensionaltrainingsetonagraphandoftengainsomeimportantinsightsbyvisuallydetectingpatterns,suchasclusters.

Inthischapterwewilldiscussthecurseofdimensionalityandgetasenseofwhatgoesoninhigh-dimensionalspace.Then,wewillpresentthetwomainapproachestodimensionalityreduction(projectionandManifoldLearning),andwewillgothroughthreeofthemostpopulardimensionalityreductiontechniques:PCA,KernelPCA,andLLE.

Page 260: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TheCurseofDimensionalityWearesousedtolivinginthreedimensions1thatourintuitionfailsuswhenwetrytoimagineahigh-dimensionalspace.Evenabasic4Dhypercubeisincrediblyhardtopictureinourmind(seeFigure8-1),letalonea200-dimensionalellipsoidbentina1,000-dimensionalspace.

Figure8-1.Point,segment,square,cube,andtesseract(0Dto4Dhypercubes)2

Itturnsoutthatmanythingsbehaveverydifferentlyinhigh-dimensionalspace.Forexample,ifyoupickarandompointinaunitsquare(a1×1square),itwillhaveonlyabouta0.4%chanceofbeinglocatedlessthan0.001fromaborder(inotherwords,itisveryunlikelythatarandompointwillbe“extreme”alonganydimension).Butina10,000-dimensionalunithypercube(a1×1××1cube,withtenthousand1s),thisprobabilityisgreaterthan99.999999%.Mostpointsinahigh-dimensionalhypercubeareveryclosetotheborder.3

Hereisamoretroublesomedifference:ifyoupicktwopointsrandomlyinaunitsquare,thedistancebetweenthesetwopointswillbe,onaverage,roughly0.52.Ifyoupicktworandompointsinaunit3Dcube,theaveragedistancewillberoughly0.66.Butwhatabouttwopointspickedrandomlyina1,000,000-dimensionalhypercube?Well,theaveragedistance,believeitornot,willbeabout408.25

(roughly )!Thisisquitecounterintuitive:howcantwopointsbesofarapartwhentheybothliewithinthesameunithypercube?Thisfactimpliesthathigh-dimensionaldatasetsareatriskofbeingverysparse:mosttraininginstancesarelikelytobefarawayfromeachother.Ofcourse,thisalsomeansthatanewinstancewilllikelybefarawayfromanytraininginstance,makingpredictionsmuchlessreliablethaninlowerdimensions,sincetheywillbebasedonmuchlargerextrapolations.Inshort,themoredimensionsthetrainingsethas,thegreatertheriskofoverfittingit.

Intheory,onesolutiontothecurseofdimensionalitycouldbetoincreasethesizeofthetrainingsettoreachasufficientdensityoftraininginstances.Unfortunately,inpractice,thenumberoftraininginstancesrequiredtoreachagivendensitygrowsexponentiallywiththenumberofdimensions.Withjust100features(muchlessthanintheMNISTproblem),youwouldneedmoretraininginstancesthanatomsintheobservableuniverseinorderfortraininginstancestobewithin0.1ofeachotheronaverage,assumingtheywerespreadoutuniformlyacrossalldimensions.

Page 261: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

MainApproachesforDimensionalityReductionBeforewediveintospecificdimensionalityreductionalgorithms,let’stakealookatthetwomainapproachestoreducingdimensionality:projectionandManifoldLearning.

Page 262: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ProjectionInmostreal-worldproblems,traininginstancesarenotspreadoutuniformlyacrossalldimensions.Manyfeaturesarealmostconstant,whileothersarehighlycorrelated(asdiscussedearlierforMNIST).Asaresult,alltraininginstancesactuallyliewithin(orcloseto)amuchlower-dimensionalsubspaceofthehigh-dimensionalspace.Thissoundsveryabstract,solet’slookatanexample.InFigure8-2youcanseea3Ddatasetrepresentedbythecircles.

Figure8-2.A3Ddatasetlyingclosetoa2Dsubspace

Noticethatalltraininginstanceslieclosetoaplane:thisisalower-dimensional(2D)subspaceofthehigh-dimensional(3D)space.Nowifweprojecteverytraininginstanceperpendicularlyontothissubspace(asrepresentedbytheshortlinesconnectingtheinstancestotheplane),wegetthenew2DdatasetshowninFigure8-3.Ta-da!Wehavejustreducedthedataset’sdimensionalityfrom3Dto2D.Notethattheaxescorrespondtonewfeaturesz1andz2(thecoordinatesoftheprojectionsontheplane).

Page 263: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure8-3.Thenew2Ddatasetafterprojection

However,projectionisnotalwaysthebestapproachtodimensionalityreduction.Inmanycasesthesubspacemaytwistandturn,suchasinthefamousSwissrolltoydatasetrepresentedinFigure8-4.

Page 264: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure8-4.Swissrolldataset

Simplyprojectingontoaplane(e.g.,bydroppingx3)wouldsquashdifferentlayersoftheSwissrolltogether,asshownontheleftofFigure8-5.However,whatyoureallywantistounrolltheSwissrolltoobtainthe2DdatasetontherightofFigure8-5.

Figure8-5.Squashingbyprojectingontoaplane(left)versusunrollingtheSwissroll(right)

Page 265: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ManifoldLearningTheSwissrollisanexampleofa2Dmanifold.Putsimply,a2Dmanifoldisa2Dshapethatcanbebentandtwistedinahigher-dimensionalspace.Moregenerally,ad-dimensionalmanifoldisapartofann-dimensionalspace(whered<n)thatlocallyresemblesad-dimensionalhyperplane.InthecaseoftheSwissroll,d=2andn=3:itlocallyresemblesa2Dplane,butitisrolledinthethirddimension.

Manydimensionalityreductionalgorithmsworkbymodelingthemanifoldonwhichthetraininginstanceslie;thisiscalledManifoldLearning.Itreliesonthemanifoldassumption,alsocalledthemanifoldhypothesis,whichholdsthatmostreal-worldhigh-dimensionaldatasetslieclosetoamuchlower-dimensionalmanifold.Thisassumptionisveryoftenempiricallyobserved.

Onceagain,thinkabouttheMNISTdataset:allhandwrittendigitimageshavesomesimilarities.Theyaremadeofconnectedlines,thebordersarewhite,theyaremoreorlesscentered,andsoon.Ifyourandomlygeneratedimages,onlyaridiculouslytinyfractionofthemwouldlooklikehandwrittendigits.Inotherwords,thedegreesoffreedomavailabletoyouifyoutrytocreateadigitimagearedramaticallylowerthanthedegreesoffreedomyouwouldhaveifyouwereallowedtogenerateanyimageyouwanted.Theseconstraintstendtosqueezethedatasetintoalower-dimensionalmanifold.

Themanifoldassumptionisoftenaccompaniedbyanotherimplicitassumption:thatthetaskathand(e.g.,classificationorregression)willbesimplerifexpressedinthelower-dimensionalspaceofthemanifold.Forexample,inthetoprowofFigure8-6theSwissrollissplitintotwoclasses:inthe3Dspace(ontheleft),thedecisionboundarywouldbefairlycomplex,butinthe2Dunrolledmanifoldspace(ontheright),thedecisionboundaryisasimplestraightline.

However,thisassumptiondoesnotalwayshold.Forexample,inthebottomrowofFigure8-6,thedecisionboundaryislocatedatx1=5.Thisdecisionboundarylooksverysimpleintheoriginal3Dspace(averticalplane),butitlooksmorecomplexintheunrolledmanifold(acollectionoffourindependentlinesegments).

Inshort,ifyoureducethedimensionalityofyourtrainingsetbeforetrainingamodel,itwilldefinitelyspeeduptraining,butitmaynotalwaysleadtoabetterorsimplersolution;italldependsonthedataset.

Hopefullyyounowhaveagoodsenseofwhatthecurseofdimensionalityisandhowdimensionalityreductionalgorithmscanfightit,especiallywhenthemanifoldassumptionholds.Therestofthischapterwillgothroughsomeofthemostpopularalgorithms.

Page 266: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure8-6.Thedecisionboundarymaynotalwaysbesimplerwithlowerdimensions

Page 267: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

PCAPrincipalComponentAnalysis(PCA)isbyfarthemostpopulardimensionalityreductionalgorithm.Firstitidentifiesthehyperplanethatliesclosesttothedata,andthenitprojectsthedataontoit.

Page 268: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

PreservingtheVarianceBeforeyoucanprojectthetrainingsetontoalower-dimensionalhyperplane,youfirstneedtochoosetherighthyperplane.Forexample,asimple2DdatasetisrepresentedontheleftofFigure8-7,alongwiththreedifferentaxes(i.e.,one-dimensionalhyperplanes).Ontherightistheresultoftheprojectionofthedatasetontoeachoftheseaxes.Asyoucansee,theprojectionontothesolidlinepreservesthemaximumvariance,whiletheprojectionontothedottedlinepreservesverylittlevariance,andtheprojectionontothedashedlinepreservesanintermediateamountofvariance.

Figure8-7.Selectingthesubspaceontowhichtoproject

Itseemsreasonabletoselecttheaxisthatpreservesthemaximumamountofvariance,asitwillmostlikelyloselessinformationthantheotherprojections.Anotherwaytojustifythischoiceisthatitistheaxisthatminimizesthemeansquareddistancebetweentheoriginaldatasetanditsprojectionontothataxis.ThisistherathersimpleideabehindPCA.4

Page 269: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

PrincipalComponentsPCAidentifiestheaxisthataccountsforthelargestamountofvarianceinthetrainingset.InFigure8-7,itisthesolidline.Italsofindsasecondaxis,orthogonaltothefirstone,thataccountsforthelargestamountofremainingvariance.Inthis2Dexamplethereisnochoice:itisthedottedline.Ifitwereahigher-dimensionaldataset,PCAwouldalsofindathirdaxis,orthogonaltobothpreviousaxes,andafourth,afifth,andsoon—asmanyaxesasthenumberofdimensionsinthedataset.

Theunitvectorthatdefinestheithaxisiscalledtheithprincipalcomponent(PC).InFigure8-7,the1st

PCisc1andthe2ndPCisc2.InFigure8-2thefirsttwoPCsarerepresentedbytheorthogonalarrowsintheplane,andthethirdPCwouldbeorthogonaltotheplane(pointingupordown).

NOTEThedirectionoftheprincipalcomponentsisnotstable:ifyouperturbthetrainingsetslightlyandrunPCAagain,someofthenewPCsmaypointintheoppositedirectionoftheoriginalPCs.However,theywillgenerallystilllieonthesameaxes.Insomecases,apairofPCsmayevenrotateorswap,buttheplanetheydefinewillgenerallyremainthesame.

Sohowcanyoufindtheprincipalcomponentsofatrainingset?Luckily,thereisastandardmatrixfactorizationtechniquecalledSingularValueDecomposition(SVD)thatcandecomposethetrainingsetmatrixXintothedotproductofthreematricesU·Σ·VT,whereVTcontainsalltheprincipalcomponentsthatwearelookingfor,asshowninEquation8-1.

Equation8-1.Principalcomponentsmatrix

ThefollowingPythoncodeusesNumPy’ssvd()functiontoobtainalltheprincipalcomponentsofthetrainingset,thenextractsthefirsttwoPCs:

X_centered=X-X.mean(axis=0)

U,s,V=np.linalg.svd(X_centered)

c1=V.T[:,0]

c2=V.T[:,1]

Page 270: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

WARNINGPCAassumesthatthedatasetiscenteredaroundtheorigin.Aswewillsee,Scikit-Learn’sPCAclassestakecareofcenteringthedataforyou.However,ifyouimplementPCAyourself(asintheprecedingexample),orifyouuseotherlibraries,don’tforgettocenterthedatafirst.

Page 271: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ProjectingDowntodDimensionsOnceyouhaveidentifiedalltheprincipalcomponents,youcanreducethedimensionalityofthedatasetdowntoddimensionsbyprojectingitontothehyperplanedefinedbythefirstdprincipalcomponents.Selectingthishyperplaneensuresthattheprojectionwillpreserveasmuchvarianceaspossible.Forexample,inFigure8-2the3Ddatasetisprojecteddowntothe2Dplanedefinedbythefirsttwoprincipalcomponents,preservingalargepartofthedataset’svariance.Asaresult,the2Dprojectionlooksverymuchliketheoriginal3Ddataset.

Toprojectthetrainingsetontothehyperplane,youcansimplycomputethedotproductofthetrainingsetmatrixXbythematrixWd,definedasthematrixcontainingthefirstdprincipalcomponents(i.e.,thematrixcomposedofthefirstdcolumnsofVT),asshowninEquation8-2.

Equation8-2.Projectingthetrainingsetdowntoddimensions

ThefollowingPythoncodeprojectsthetrainingsetontotheplanedefinedbythefirsttwoprincipalcomponents:

W2=V.T[:,:2]

X2D=X_centered.dot(W2)

Thereyouhaveit!Younowknowhowtoreducethedimensionalityofanydatasetdowntoanynumberofdimensions,whilepreservingasmuchvarianceaspossible.

Page 272: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

UsingScikit-LearnScikit-Learn’sPCAclassimplementsPCAusingSVDdecompositionjustlikewedidbefore.ThefollowingcodeappliesPCAtoreducethedimensionalityofthedatasetdowntotwodimensions(notethatitautomaticallytakescareofcenteringthedata):

fromsklearn.decompositionimportPCA

pca=PCA(n_components=2)

X2D=pca.fit_transform(X)

AfterfittingthePCAtransformertothedataset,youcanaccesstheprincipalcomponentsusingthecomponents_variable(notethatitcontainsthePCsashorizontalvectors,so,forexample,thefirstprincipalcomponentisequaltopca.components_.T[:,0]).

Page 273: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ExplainedVarianceRatioAnotherveryusefulpieceofinformationistheexplainedvarianceratioofeachprincipalcomponent,availableviatheexplained_variance_ratio_variable.Itindicatestheproportionofthedataset’svariancethatliesalongtheaxisofeachprincipalcomponent.Forexample,let’slookattheexplainedvarianceratiosofthefirsttwocomponentsofthe3DdatasetrepresentedinFigure8-2:

>>>pca.explained_variance_ratio_

array([0.84248607,0.14631839])

Thistellsyouthat84.2%ofthedataset’svarianceliesalongthefirstaxis,and14.6%liesalongthesecondaxis.Thisleaveslessthan1.2%forthethirdaxis,soitisreasonabletoassumethatitprobablycarrieslittleinformation.

Page 274: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ChoosingtheRightNumberofDimensionsInsteadofarbitrarilychoosingthenumberofdimensionstoreducedownto,itisgenerallypreferabletochoosethenumberofdimensionsthatadduptoasufficientlylargeportionofthevariance(e.g.,95%).Unless,ofcourse,youarereducingdimensionalityfordatavisualization—inthatcaseyouwillgenerallywanttoreducethedimensionalitydownto2or3.

ThefollowingcodecomputesPCAwithoutreducingdimensionality,thencomputestheminimumnumberofdimensionsrequiredtopreserve95%ofthetrainingset’svariance:

pca=PCA()

pca.fit(X_train)

cumsum=np.cumsum(pca.explained_variance_ratio_)

d=np.argmax(cumsum>=0.95)+1

Youcouldthensetn_components=dandrunPCAagain.However,thereisamuchbetteroption:insteadofspecifyingthenumberofprincipalcomponentsyouwanttopreserve,youcansetn_componentstobeafloatbetween0.0and1.0,indicatingtheratioofvarianceyouwishtopreserve:

pca=PCA(n_components=0.95)

X_reduced=pca.fit_transform(X_train)

Yetanotheroptionistoplottheexplainedvarianceasafunctionofthenumberofdimensions(simplyplotcumsum;seeFigure8-8).Therewillusuallybeanelbowinthecurve,wheretheexplainedvariancestopsgrowingfast.Youcanthinkofthisastheintrinsicdimensionalityofthedataset.Inthiscase,youcanseethatreducingthedimensionalitydowntoabout100dimensionswouldn’tlosetoomuchexplainedvariance.

Figure8-8.Explainedvarianceasafunctionofthenumberofdimensions

Page 275: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

PCAforCompressionObviouslyafterdimensionalityreduction,thetrainingsettakesupmuchlessspace.Forexample,tryapplyingPCAtotheMNISTdatasetwhilepreserving95%ofitsvariance.Youshouldfindthateachinstancewillhavejustover150features,insteadoftheoriginal784features.Sowhilemostofthevarianceispreserved,thedatasetisnowlessthan20%ofitsoriginalsize!Thisisareasonablecompressionratio,andyoucanseehowthiscanspeedupaclassificationalgorithm(suchasanSVMclassifier)tremendously.

Itisalsopossibletodecompressthereduceddatasetbackto784dimensionsbyapplyingtheinversetransformationofthePCAprojection.Ofcoursethiswon’tgiveyoubacktheoriginaldata,sincetheprojectionlostabitofinformation(withinthe5%variancethatwasdropped),butitwilllikelybequiteclosetotheoriginaldata.Themeansquareddistancebetweentheoriginaldataandthereconstructeddata(compressedandthendecompressed)iscalledthereconstructionerror.Forexample,thefollowingcodecompressestheMNISTdatasetdownto154dimensions,thenusestheinverse_transform()methodtodecompressitbackto784dimensions.Figure8-9showsafewdigitsfromtheoriginaltrainingset(ontheleft),andthecorrespondingdigitsaftercompressionanddecompression.Youcanseethatthereisaslightimagequalityloss,butthedigitsarestillmostlyintact.

pca=PCA(n_components=154)

X_reduced=pca.fit_transform(X_train)

X_recovered=pca.inverse_transform(X_reduced)

Figure8-9.MNISTcompressionpreserving95%ofthevariance

TheequationoftheinversetransformationisshowninEquation8-3.

Equation8-3.PCAinversetransformation,backtotheoriginalnumberofdimensions

Page 276: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

IncrementalPCAOneproblemwiththeprecedingimplementationofPCAisthatitrequiresthewholetrainingsettofitinmemoryinorderfortheSVDalgorithmtorun.Fortunately,IncrementalPCA(IPCA)algorithmshavebeendeveloped:youcansplitthetrainingsetintomini-batchesandfeedanIPCAalgorithmonemini-batchatatime.Thisisusefulforlargetrainingsets,andalsotoapplyPCAonline(i.e.,onthefly,asnewinstancesarrive).

ThefollowingcodesplitstheMNISTdatasetinto100mini-batches(usingNumPy’sarray_split()function)andfeedsthemtoScikit-Learn’sIncrementalPCAclass5toreducethedimensionalityoftheMNISTdatasetdownto154dimensions(justlikebefore).Notethatyoumustcallthepartial_fit()methodwitheachmini-batchratherthanthefit()methodwiththewholetrainingset:

fromsklearn.decompositionimportIncrementalPCA

n_batches=100

inc_pca=IncrementalPCA(n_components=154)

forX_batchinnp.array_split(X_train,n_batches):

inc_pca.partial_fit(X_batch)

X_reduced=inc_pca.transform(X_train)

Alternatively,youcanuseNumPy’smemmapclass,whichallowsyoutomanipulatealargearraystoredinabinaryfileondiskasifitwereentirelyinmemory;theclassloadsonlythedataitneedsinmemory,whenitneedsit.SincetheIncrementalPCAclassusesonlyasmallpartofthearrayatanygiventime,thememoryusageremainsundercontrol.Thismakesitpossibletocalltheusualfit()method,asyoucanseeinthefollowingcode:

X_mm=np.memmap(filename,dtype="float32",mode="readonly",shape=(m,n))

batch_size=m//n_batches

inc_pca=IncrementalPCA(n_components=154,batch_size=batch_size)

inc_pca.fit(X_mm)

Page 277: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

RandomizedPCAScikit-LearnoffersyetanotheroptiontoperformPCA,calledRandomizedPCA.Thisisastochasticalgorithmthatquicklyfindsanapproximationofthefirstdprincipalcomponents.ItscomputationalcomplexityisO(m×d2)+O(d3),insteadofO(m×n2)+O(n3),soitisdramaticallyfasterthanthepreviousalgorithmswhendismuchsmallerthann.

rnd_pca=PCA(n_components=154,svd_solver="randomized")

X_reduced=rnd_pca.fit_transform(X_train)

Page 278: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

KernelPCAInChapter5wediscussedthekerneltrick,amathematicaltechniquethatimplicitlymapsinstancesintoaveryhigh-dimensionalspace(calledthefeaturespace),enablingnonlinearclassificationandregressionwithSupportVectorMachines.Recallthatalineardecisionboundaryinthehigh-dimensionalfeaturespacecorrespondstoacomplexnonlineardecisionboundaryintheoriginalspace.

ItturnsoutthatthesametrickcanbeappliedtoPCA,makingitpossibletoperformcomplexnonlinearprojectionsfordimensionalityreduction.ThisiscalledKernelPCA(kPCA).6Itisoftengoodatpreservingclustersofinstancesafterprojection,orsometimesevenunrollingdatasetsthatlieclosetoatwistedmanifold.

Forexample,thefollowingcodeusesScikit-Learn’sKernelPCAclasstoperformkPCAwithanRBFkernel(seeChapter5formoredetailsabouttheRBFkernelandtheotherkernels):

fromsklearn.decompositionimportKernelPCA

rbf_pca=KernelPCA(n_components=2,kernel="rbf",gamma=0.04)

X_reduced=rbf_pca.fit_transform(X)

Figure8-10showstheSwissroll,reducedtotwodimensionsusingalinearkernel(equivalenttosimplyusingthePCAclass),anRBFkernel,andasigmoidkernel(Logistic).

Figure8-10.Swissrollreducedto2DusingkPCAwithvariouskernels

Page 279: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

SelectingaKernelandTuningHyperparametersAskPCAisanunsupervisedlearningalgorithm,thereisnoobviousperformancemeasuretohelpyouselectthebestkernelandhyperparametervalues.However,dimensionalityreductionisoftenapreparationstepforasupervisedlearningtask(e.g.,classification),soyoucansimplyusegridsearchtoselectthekernelandhyperparametersthatleadtothebestperformanceonthattask.Forexample,thefollowingcodecreatesatwo-steppipeline,firstreducingdimensionalitytotwodimensionsusingkPCA,thenapplyingLogisticRegressionforclassification.ThenitusesGridSearchCVtofindthebestkernelandgammavalueforkPCAinordertogetthebestclassificationaccuracyattheendofthepipeline:

fromsklearn.model_selectionimportGridSearchCV

fromsklearn.linear_modelimportLogisticRegression

fromsklearn.pipelineimportPipeline

clf=Pipeline([

("kpca",KernelPCA(n_components=2)),

("log_reg",LogisticRegression())

])

param_grid=[{

"kpca__gamma":np.linspace(0.03,0.05,10),

"kpca__kernel":["rbf","sigmoid"]

}]

grid_search=GridSearchCV(clf,param_grid,cv=3)

grid_search.fit(X,y)

Thebestkernelandhyperparametersarethenavailablethroughthebest_params_variable:

>>>print(grid_search.best_params_)

{'kpca__gamma':0.043333333333333335,'kpca__kernel':'rbf'}

Anotherapproach,thistimeentirelyunsupervised,istoselectthekernelandhyperparametersthatyieldthelowestreconstructionerror.However,reconstructionisnotaseasyaswithlinearPCA.Here’swhy.Figure8-11showstheoriginalSwissroll3Ddataset(topleft),andtheresulting2DdatasetafterkPCAisappliedusinganRBFkernel(topright).Thankstothekerneltrick,thisismathematicallyequivalenttomappingthetrainingsettoaninfinite-dimensionalfeaturespace(bottomright)usingthefeaturemapφ,thenprojectingthetransformedtrainingsetdownto2DusinglinearPCA.NoticethatifwecouldinvertthelinearPCAstepforagiveninstanceinthereducedspace,thereconstructedpointwouldlieinfeaturespace,notintheoriginalspace(e.g.,liketheonerepresentedbyanxinthediagram).Sincethefeaturespaceisinfinite-dimensional,wecannotcomputethereconstructedpoint,andthereforewecannotcomputethetruereconstructionerror.Fortunately,itispossibletofindapointintheoriginalspacethatwouldmapclosetothereconstructedpoint.Thisiscalledthereconstructionpre-image.Onceyouhavethispre-image,youcanmeasureitssquareddistancetotheoriginalinstance.Youcanthenselectthekernelandhyperparametersthatminimizethisreconstructionpre-imageerror.

Page 280: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure8-11.KernelPCAandthereconstructionpre-imageerror

Youmaybewonderinghowtoperformthisreconstruction.Onesolutionistotrainasupervisedregressionmodel,withtheprojectedinstancesasthetrainingsetandtheoriginalinstancesasthetargets.Scikit-Learnwilldothisautomaticallyifyousetfit_inverse_transform=True,asshowninthefollowingcode:7

rbf_pca=KernelPCA(n_components=2,kernel="rbf",gamma=0.0433,

fit_inverse_transform=True)

X_reduced=rbf_pca.fit_transform(X)

X_preimage=rbf_pca.inverse_transform(X_reduced)

NOTEBydefault,fit_inverse_transform=FalseandKernelPCAhasnoinverse_transform()method.Thismethodonlygetscreatedwhenyousetfit_inverse_transform=True.

Youcanthencomputethereconstructionpre-imageerror:

>>>fromsklearn.metricsimportmean_squared_error

>>>mean_squared_error(X,X_preimage)

32.786308795766132

Nowyoucanusegridsearchwithcross-validationtofindthekernelandhyperparametersthatminimizethispre-imagereconstructionerror.

Page 281: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

LLELocallyLinearEmbedding(LLE)8isanotherverypowerfulnonlineardimensionalityreduction(NLDR)technique.ItisaManifoldLearningtechniquethatdoesnotrelyonprojectionslikethepreviousalgorithms.Inanutshell,LLEworksbyfirstmeasuringhoweachtraininginstancelinearlyrelatestoitsclosestneighbors(c.n.),andthenlookingforalow-dimensionalrepresentationofthetrainingsetwheretheselocalrelationshipsarebestpreserved(moredetailsshortly).Thismakesitparticularlygoodatunrollingtwistedmanifolds,especiallywhenthereisnottoomuchnoise.

Forexample,thefollowingcodeusesScikit-Learn’sLocallyLinearEmbeddingclasstounrolltheSwissroll.Theresulting2DdatasetisshowninFigure8-12.Asyoucansee,theSwissrolliscompletelyunrolledandthedistancesbetweeninstancesarelocallywellpreserved.However,distancesarenotpreservedonalargerscale:theleftpartoftheunrolledSwissrollissqueezed,whiletherightpartisstretched.Nevertheless,LLEdidaprettygoodjobatmodelingthemanifold.

fromsklearn.manifoldimportLocallyLinearEmbedding

lle=LocallyLinearEmbedding(n_components=2,n_neighbors=10)

X_reduced=lle.fit_transform(X)

Figure8-12.UnrolledSwissrollusingLLE

Here’showLLEworks:first,foreachtraininginstancex(i),thealgorithmidentifiesitskclosestneighbors(intheprecedingcodek=10),thentriestoreconstructx(i)asalinearfunctionoftheseneighbors.Morespecifically,itfindstheweightswi,jsuchthatthesquareddistancebetweenx(i)and

isassmallaspossible,assumingwi,j=0ifx(j)isnotoneofthekclosestneighborsofx(i).ThusthefirststepofLLEistheconstrainedoptimizationproblemdescribedinEquation8-4,whereWis

Page 282: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

theweightmatrixcontainingalltheweightswi,j.Thesecondconstraintsimplynormalizestheweightsforeachtraininginstancex(i).

Equation8-4.LLEstep1:linearlymodelinglocalrelationships

Afterthisstep,theweightmatrix (containingtheweights )encodesthelocallinearrelationshipsbetweenthetraininginstances.Nowthesecondstepistomapthetraininginstancesintoad-dimensionalspace(whered<n)whilepreservingtheselocalrelationshipsasmuchaspossible.Ifz(i)istheimageof

x(i)inthisd-dimensionalspace,thenwewantthesquareddistancebetweenz(i)and tobeassmallaspossible.ThisidealeadstotheunconstrainedoptimizationproblemdescribedinEquation8-5.Itlooksverysimilartothefirststep,butinsteadofkeepingtheinstancesfixedandfindingtheoptimalweights,wearedoingthereverse:keepingtheweightsfixedandfindingtheoptimalpositionoftheinstances’imagesinthelow-dimensionalspace.NotethatZisthematrixcontainingallz(i).

Equation8-5.LLEstep2:reducingdimensionalitywhilepreservingrelationships

Scikit-Learn’sLLEimplementationhasthefollowingcomputationalcomplexity:O(mlog(m)nlog(k))forfindingtheknearestneighbors,O(mnk3)foroptimizingtheweights,andO(dm2)forconstructingthelow-dimensionalrepresentations.Unfortunately,them2inthelasttermmakesthisalgorithmscalepoorlytoverylargedatasets.

Page 283: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

OtherDimensionalityReductionTechniquesTherearemanyotherdimensionalityreductiontechniques,severalofwhichareavailableinScikit-Learn.Herearesomeofthemostpopular:

MultidimensionalScaling(MDS)reducesdimensionalitywhiletryingtopreservethedistancesbetweentheinstances(seeFigure8-13).

Isomapcreatesagraphbyconnectingeachinstancetoitsnearestneighbors,thenreducesdimensionalitywhiletryingtopreservethegeodesicdistances9betweentheinstances.

t-DistributedStochasticNeighborEmbedding(t-SNE)reducesdimensionalitywhiletryingtokeepsimilarinstancescloseanddissimilarinstancesapart.Itismostlyusedforvisualization,inparticulartovisualizeclustersofinstancesinhigh-dimensionalspace(e.g.,tovisualizetheMNISTimagesin2D).

LinearDiscriminantAnalysis(LDA)isactuallyaclassificationalgorithm,butduringtrainingitlearnsthemostdiscriminativeaxesbetweentheclasses,andtheseaxescanthenbeusedtodefineahyperplaneontowhichtoprojectthedata.Thebenefitisthattheprojectionwillkeepclassesasfarapartaspossible,soLDAisagoodtechniquetoreducedimensionalitybeforerunninganotherclassificationalgorithmsuchasanSVMclassifier.

Figure8-13.ReducingtheSwissrollto2Dusingvarioustechniques

Page 284: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Exercises1. Whatarethemainmotivationsforreducingadataset’sdimensionality?Whatarethemain

drawbacks?

2. Whatisthecurseofdimensionality?

3. Onceadataset’sdimensionalityhasbeenreduced,isitpossibletoreversetheoperation?Ifso,how?Ifnot,why?

4. CanPCAbeusedtoreducethedimensionalityofahighlynonlineardataset?

5. SupposeyouperformPCAona1,000-dimensionaldataset,settingtheexplainedvarianceratioto95%.Howmanydimensionswilltheresultingdatasethave?

6. InwhatcaseswouldyouusevanillaPCA,IncrementalPCA,RandomizedPCA,orKernelPCA?

7. Howcanyouevaluatetheperformanceofadimensionalityreductionalgorithmonyourdataset?

8. Doesitmakeanysensetochaintwodifferentdimensionalityreductionalgorithms?

9. LoadtheMNISTdataset(introducedinChapter3)andsplititintoatrainingsetandatestset(takethefirst60,000instancesfortraining,andtheremaining10,000fortesting).TrainaRandomForestclassifieronthedatasetandtimehowlongittakes,thenevaluatetheresultingmodelonthetestset.Next,usePCAtoreducethedataset’sdimensionality,withanexplainedvarianceratioof95%.TrainanewRandomForestclassifieronthereduceddatasetandseehowlongittakes.Wastrainingmuchfaster?Nextevaluatetheclassifieronthetestset:howdoesitcomparetothepreviousclassifier?

10. Uset-SNEtoreducetheMNISTdatasetdowntotwodimensionsandplottheresultusingMatplotlib.Youcanuseascatterplotusing10differentcolorstorepresenteachimage’stargetclass.Alternatively,youcanwritecoloreddigitsatthelocationofeachinstance,orevenplotscaled-downversionsofthedigitimagesthemselves(ifyouplotalldigits,thevisualizationwillbetoocluttered,soyoushouldeitherdrawarandomsampleorplotaninstanceonlyifnootherinstancehasalreadybeenplottedataclosedistance).Youshouldgetanicevisualizationwithwell-separatedclustersofdigits.TryusingotherdimensionalityreductionalgorithmssuchasPCA,LLE,orMDSandcomparetheresultingvisualizations.

SolutionstotheseexercisesareavailableinAppendixA.

Well,fourdimensionsifyoucounttime,andafewmoreifyouareastringtheorist.

Watcharotatingtesseractprojectedinto3Dspaceathttp://goo.gl/OM7ktJ.ImagebyWikipediauserNerdBoy1392(CreativeCommonsBY-SA3.0).Reproducedfromhttps://en.wikipedia.org/wiki/Tesseract.

Funfact:anyoneyouknowisprobablyanextremistinatleastonedimension(e.g.,howmuchsugartheyputintheircoffee),ifyouconsiderenoughdimensions.

“OnLinesandPlanesofClosestFittoSystemsofPointsinSpace,”K.Pearson(1901).

Scikit-Learnusesthealgorithmdescribedin“IncrementalLearningforRobustVisualTracking,”D.Rossetal.(2007).

1

2

3

4

5

Page 285: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

“KernelPrincipalComponentAnalysis,”B.Schölkopf,A.Smola,K.Müller(1999).

Scikit-LearnusesthealgorithmbasedonKernelRidgeRegressiondescribedinGokhanH.Bakır,JasonWeston,andBernhardScholkopf,“LearningtoFindPre-images”(Tubingen,Germany:MaxPlanckInstituteforBiologicalCybernetics,2004).

“NonlinearDimensionalityReductionbyLocallyLinearEmbedding,”S.Roweis,L.Saul(2000).

Thegeodesicdistancebetweentwonodesinagraphisthenumberofnodesontheshortestpathbetweenthesenodes.

6

7

8

9

Page 286: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

PartII.NeuralNetworksandDeepLearning

Page 287: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Chapter9.UpandRunningwithTensorFlow

TensorFlowisapowerfulopensourcesoftwarelibraryfornumericalcomputation,particularlywellsuitedandfine-tunedforlarge-scaleMachineLearning.Itsbasicprincipleissimple:youfirstdefineinPythonagraphofcomputationstoperform(forexample,theoneinFigure9-1),andthenTensorFlowtakesthatgraphandrunsitefficientlyusingoptimizedC++code.

Figure9-1.Asimplecomputationgraph

Mostimportantly,itispossibletobreakupthegraphintoseveralchunksandruntheminparallelacrossmultipleCPUsorGPUs(asshowninFigure9-2).TensorFlowalsosupportsdistributedcomputing,soyoucantraincolossalneuralnetworksonhumongoustrainingsetsinareasonableamountoftimebysplittingthecomputationsacrosshundredsofservers(seeChapter12).TensorFlowcantrainanetworkwithmillionsofparametersonatrainingsetcomposedofbillionsofinstanceswithmillionsoffeatureseach.Thisshouldcomeasnosurprise,sinceTensorFlowwasdevelopedbytheGoogleBrainteamanditpowersmanyofGoogle’slarge-scaleservices,suchasGoogleCloudSpeech,GooglePhotos,andGoogleSearch.

Page 288: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure9-2.ParallelcomputationonmultipleCPUs/GPUs/servers

WhenTensorFlowwasopen-sourcedinNovember2015,therewerealreadymanypopularopensourcelibrariesforDeepLearning(Table9-1listsafew),andtobefairmostofTensorFlow’sfeaturesalreadyexistedinonelibraryoranother.Nevertheless,TensorFlow’scleandesign,scalability,flexibility,1andgreatdocumentation(nottomentionGoogle’sname)quicklyboostedittothetopofthelist.Inshort,TensorFlowwasdesignedtobeflexible,scalable,andproduction-ready,andexistingframeworksarguablyhitonlytwooutofthethreeofthese.HerearesomeofTensorFlow’shighlights:

ItrunsnotonlyonWindows,Linux,andmacOS,butalsoonmobiledevices,includingbothiOSandAndroid.

ItprovidesaverysimplePythonAPIcalledTF.Learn2(tensorflow.contrib.learn),compatiblewithScikit-Learn.Asyouwillsee,youcanuseittotrainvarioustypesofneuralnetworksinjustafewlinesofcode.ItwaspreviouslyanindependentprojectcalledScikitFlow(orskflow).

ItalsoprovidesanothersimpleAPIcalledTF-slim(tensorflow.contrib.slim)tosimplifybuilding,training,andevaluatingneuralnetworks.

Severalotherhigh-levelAPIshavebeenbuiltindependentlyontopofTensorFlow,suchasKeras(nowavailableintensorflow.contrib.keras)orPrettyTensor.

Page 289: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ItsmainPythonAPIoffersmuchmoreflexibility(atthecostofhighercomplexity)tocreateallsortsofcomputations,includinganyneuralnetworkarchitectureyoucanthinkof.

ItincludeshighlyefficientC++implementationsofmanyMLoperations,particularlythoseneededtobuildneuralnetworks.ThereisalsoaC++APItodefineyourownhigh-performanceoperations.

Itprovidesseveraladvancedoptimizationnodestosearchfortheparametersthatminimizeacostfunction.TheseareveryeasytousesinceTensorFlowautomaticallytakescareofcomputingthegradientsofthefunctionsyoudefine.Thisiscalledautomaticdifferentiating(orautodiff).

ItalsocomeswithagreatvisualizationtoolcalledTensorBoardthatallowsyoutobrowsethroughthecomputationgraph,viewlearningcurves,andmore.

GooglealsolaunchedacloudservicetorunTensorFlowgraphs.

Lastbutnotleast,ithasadedicatedteamofpassionateandhelpfuldevelopers,andagrowingcommunitycontributingtoimprovingit.ItisoneofthemostpopularopensourceprojectsonGitHub,andmoreandmoregreatprojectsarebeingbuiltontopofit(forexamples,checkouttheresourcespageonhttps://www.tensorflow.org/,orhttps://github.com/jtoy/awesome-tensorflow).Toasktechnicalquestions,youshouldusehttp://stackoverflow.com/andtagyourquestionwith"tensorflow".YoucanfilebugsandfeaturerequeststhroughGitHub.Forgeneraldiscussions,jointheGooglegroup.

Inthischapter,wewillgothroughthebasicsofTensorFlow,frominstallationtocreating,running,saving,andvisualizingsimplecomputationalgraphs.Masteringthesebasicsisimportantbeforeyoubuildyourfirstneuralnetwork(whichwewilldointhenextchapter).

Table9-1.OpensourceDeepLearninglibraries(notanexhaustivelist)

Library API Platforms Startedby Year

Caffe Python,C++,Matlab Linux,macOS,Windows Y.Jia,UCBerkeley(BVLC) 2013

Deeplearning4j Java,Scala,Clojure Linux,macOS,Windows,Android A.Gibson,J.Patterson 2014

H2O Python,R Linux,macOS,Windows H2O.ai 2014

MXNet Python,C++,others Linux,macOS,Windows,iOS,Android DMLC 2015

TensorFlow Python,C++ Linux,macOS,Windows,iOS,Android Google 2015

Theano Python Linux,macOS,iOS UniversityofMontreal 2010

Torch C++,Lua Linux,macOS,iOS,Android R.Collobert,K.Kavukcuoglu,C.Farabet 2002

Page 290: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

InstallationLet’sgetstarted!AssumingyouinstalledJupyterandScikit-LearnbyfollowingtheinstallationinstructionsinChapter2,youcansimplyusepiptoinstallTensorFlow.Ifyoucreatedanisolatedenvironmentusingvirtualenv,youfirstneedtoactivateit:

$cd$ML_PATH#YourMLworkingdirectory(e.g.,$HOME/ml)

$sourceenv/bin/activate

Next,installTensorFlow:

$pip3install--upgradetensorflow

NOTEForGPUsupport,youneedtoinstalltensorflow-gpuinsteadoftensorflow.SeeChapter12formoredetails.

Totestyourinstallation,typethefollowingcommand.ItshouldoutputtheversionofTensorFlowyouinstalled.

$python3-c'importtensorflow;print(tensorflow.__version__)'

1.0.0

Page 291: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

CreatingYourFirstGraphandRunningItinaSessionThefollowingcodecreatesthegraphrepresentedinFigure9-1:

importtensorflowastf

x=tf.Variable(3,name="x")

y=tf.Variable(4,name="y")

f=x*x*y+y+2

That’sallthereistoit!Themostimportantthingtounderstandisthatthiscodedoesnotactuallyperformanycomputation,eventhoughitlookslikeitdoes(especiallythelastline).Itjustcreatesacomputationgraph.Infact,eventhevariablesarenotinitializedyet.Toevaluatethisgraph,youneedtoopenaTensorFlowsessionanduseittoinitializethevariablesandevaluatef.ATensorFlowsessiontakescareofplacingtheoperationsontodevicessuchasCPUsandGPUsandrunningthem,anditholdsallthevariablevalues.3Thefollowingcodecreatesasession,initializesthevariables,andevaluates,andfthenclosesthesession(whichfreesupresources):

>>>sess=tf.Session()

>>>sess.run(x.initializer)

>>>sess.run(y.initializer)

>>>result=sess.run(f)

>>>print(result)

42

>>>sess.close()

Havingtorepeatsess.run()allthetimeisabitcumbersome,butfortunatelythereisabetterway:

withtf.Session()assess:

x.initializer.run()

y.initializer.run()

result=f.eval()

Insidethewithblock,thesessionissetasthedefaultsession.Callingx.initializer.run()isequivalenttocallingtf.get_default_session().run(x.initializer),andsimilarlyf.eval()isequivalenttocallingtf.get_default_session().run(f).Thismakesthecodeeasiertoread.Moreover,thesessionisautomaticallyclosedattheendoftheblock.

Insteadofmanuallyrunningtheinitializerforeverysinglevariable,youcanusetheglobal_variables_initializer()function.Notethatitdoesnotactuallyperformtheinitializationimmediately,butrathercreatesanodeinthegraphthatwillinitializeallvariableswhenitisrun:

init=tf.global_variables_initializer()#prepareaninitnode

withtf.Session()assess:

init.run()#actuallyinitializeallthevariables

result=f.eval()

InsideJupyterorwithinaPythonshellyoumayprefertocreateanInteractiveSession.TheonlydifferencefromaregularSessionisthatwhenanInteractiveSessioniscreateditautomaticallysets

Page 292: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

itselfasthedefaultsession,soyoudon’tneedawithblock(butyoudoneedtoclosethesessionmanuallywhenyouaredonewithit):

>>>sess=tf.InteractiveSession()

>>>init.run()

>>>result=f.eval()

>>>print(result)

42

>>>sess.close()

ATensorFlowprogramistypicallysplitintotwoparts:thefirstpartbuildsacomputationgraph(thisiscalledtheconstructionphase),andthesecondpartrunsit(thisistheexecutionphase).TheconstructionphasetypicallybuildsacomputationgraphrepresentingtheMLmodelandthecomputationsrequiredtotrainit.Theexecutionphasegenerallyrunsaloopthatevaluatesatrainingsteprepeatedly(forexample,onesteppermini-batch),graduallyimprovingthemodelparameters.Wewillgothroughanexampleshortly.

Page 293: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ManagingGraphsAnynodeyoucreateisautomaticallyaddedtothedefaultgraph:

>>>x1=tf.Variable(1)

>>>x1.graphistf.get_default_graph()

True

Inmostcasesthisisfine,butsometimesyoumaywanttomanagemultipleindependentgraphs.YoucandothisbycreatinganewGraphandtemporarilymakingitthedefaultgraphinsideawithblock,likeso:

>>>graph=tf.Graph()

>>>withgraph.as_default():

...x2=tf.Variable(2)

...

>>>x2.graphisgraph

True

>>>x2.graphistf.get_default_graph()

False

TIPInJupyter(orinaPythonshell),itiscommontorunthesamecommandsmorethanoncewhileyouareexperimenting.Asaresult,youmayendupwithadefaultgraphcontainingmanyduplicatenodes.OnesolutionistorestarttheJupyterkernel(orthePythonshell),butamoreconvenientsolutionistojustresetthedefaultgraphbyrunningtf.reset_default_graph().

Page 294: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

LifecycleofaNodeValueWhenyouevaluateanode,TensorFlowautomaticallydeterminesthesetofnodesthatitdependsonanditevaluatesthesenodesfirst.Forexample,considerthefollowingcode:

w=tf.constant(3)

x=w+2

y=x+5

z=x*3

withtf.Session()assess:

print(y.eval())#10

print(z.eval())#15

First,thiscodedefinesaverysimplegraph.Thenitstartsasessionandrunsthegraphtoevaluatey:TensorFlowautomaticallydetectsthatydependsonx,whichdependsonw,soitfirstevaluatesw,thenx,theny,andreturnsthevalueofy.Finally,thecoderunsthegraphtoevaluatez.Onceagain,TensorFlowdetectsthatitmustfirstevaluatewandx.Itisimportanttonotethatitwillnotreusetheresultofthepreviousevaluationofwandx.Inshort,theprecedingcodeevaluateswandxtwice.

Allnodevaluesaredroppedbetweengraphruns,exceptvariablevalues,whicharemaintainedbythesessionacrossgraphruns(queuesandreadersalsomaintainsomestate,aswewillseeinChapter12).Avariablestartsitslifewhenitsinitializerisrun,anditendswhenthesessionisclosed.

Ifyouwanttoevaluateyandzefficiently,withoutevaluatingwandxtwiceasinthepreviouscode,youmustaskTensorFlowtoevaluatebothyandzinjustonegraphrun,asshowninthefollowingcode:

withtf.Session()assess:

y_val,z_val=sess.run([y,z])

print(y_val)#10

print(z_val)#15

WARNINGInsingle-processTensorFlow,multiplesessionsdonotshareanystate,eveniftheyreusethesamegraph(eachsessionwouldhaveitsowncopyofeveryvariable).IndistributedTensorFlow(seeChapter12),variablestateisstoredontheservers,notinthesessions,somultiplesessionscansharethesamevariables.

Page 295: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

LinearRegressionwithTensorFlowTensorFlowoperations(alsocalledopsforshort)cantakeanynumberofinputsandproduceanynumberofoutputs.Forexample,theadditionandmultiplicationopseachtaketwoinputsandproduceoneoutput.Constantsandvariablestakenoinput(theyarecalledsourceops).Theinputsandoutputsaremultidimensionalarrays,calledtensors(hencethename“tensorflow”).JustlikeNumPyarrays,tensorshaveatypeandashape.Infact,inthePythonAPItensorsaresimplyrepresentedbyNumPyndarrays.Theytypicallycontainfloats,butyoucanalsousethemtocarrystrings(arbitrarybytearrays).

Intheexamplessofar,thetensorsjustcontainedasinglescalarvalue,butyoucanofcourseperformcomputationsonarraysofanyshape.Forexample,thefollowingcodemanipulates2DarraystoperformLinearRegressionontheCaliforniahousingdataset(introducedinChapter2).Itstartsbyfetchingthedataset;thenitaddsanextrabiasinputfeature(x0=1)toalltraininginstances(itdoessousingNumPysoitrunsimmediately);thenitcreatestwoTensorFlowconstantnodes,Xandy,toholdthisdataandthetargets,4anditusessomeofthematrixoperationsprovidedbyTensorFlowtodefinetheta.Thesematrixfunctions—transpose(),matmul(),andmatrix_inverse()—areself-explanatory,butasusualtheydonotperformanycomputationsimmediately;instead,theycreatenodesinthegraphthatwillperformthemwhenthegraphisrun.YoumayrecognizethatthedefinitionofthetacorrespondstotheNormal

Equation( =(XT·X)–1·XT·y;seeChapter4).Finally,thecodecreatesasessionandusesittoevaluatetheta.

importnumpyasnp

fromsklearn.datasetsimportfetch_california_housing

housing=fetch_california_housing()

m,n=housing.data.shape

housing_data_plus_bias=np.c_[np.ones((m,1)),housing.data]

X=tf.constant(housing_data_plus_bias,dtype=tf.float32,name="X")

y=tf.constant(housing.target.reshape(-1,1),dtype=tf.float32,name="y")

XT=tf.transpose(X)

theta=tf.matmul(tf.matmul(tf.matrix_inverse(tf.matmul(XT,X)),XT),y)

withtf.Session()assess:

theta_value=theta.eval()

ThemainbenefitofthiscodeversuscomputingtheNormalEquationdirectlyusingNumPyisthatTensorFlowwillautomaticallyrunthisonyourGPUcardifyouhaveone(providedyouinstalledTensorFlowwithGPUsupport,ofcourse;seeChapter12formoredetails).

Page 296: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ImplementingGradientDescentLet’stryusingBatchGradientDescent(introducedinChapter4)insteadoftheNormalEquation.Firstwewilldothisbymanuallycomputingthegradients,thenwewilluseTensorFlow’sautodifffeaturetoletTensorFlowcomputethegradientsautomatically,andfinallywewilluseacoupleofTensorFlow’sout-of-the-boxoptimizers.

WARNINGWhenusingGradientDescent,rememberthatitisimportanttofirstnormalizetheinputfeaturevectors,orelsetrainingmaybemuchslower.YoucandothisusingTensorFlow,NumPy,Scikit-Learn’sStandardScaler,oranyothersolutionyouprefer.Thefollowingcodeassumesthatthisnormalizationhasalreadybeendone.

Page 297: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ManuallyComputingtheGradientsThefollowingcodeshouldbefairlyself-explanatory,exceptforafewnewelements:

Therandom_uniform()functioncreatesanodeinthegraphthatwillgenerateatensorcontainingrandomvalues,givenitsshapeandvaluerange,muchlikeNumPy’srand()function.

Theassign()functioncreatesanodethatwillassignanewvaluetoavariable.Inthiscase,itimplementstheBatchGradientDescentstepθ(nextstep)=θ–η θMSE(θ).

Themainloopexecutesthetrainingstepoverandoveragain(n_epochstimes),andevery100iterationsitprintsoutthecurrentMeanSquaredError(mse).YoushouldseetheMSEgodownateveryiteration.

n_epochs=1000

learning_rate=0.01

X=tf.constant(scaled_housing_data_plus_bias,dtype=tf.float32,name="X")

y=tf.constant(housing.target.reshape(-1,1),dtype=tf.float32,name="y")

theta=tf.Variable(tf.random_uniform([n+1,1],-1.0,1.0),name="theta")

y_pred=tf.matmul(X,theta,name="predictions")

error=y_pred-y

mse=tf.reduce_mean(tf.square(error),name="mse")

gradients=2/m*tf.matmul(tf.transpose(X),error)

training_op=tf.assign(theta,theta-learning_rate*gradients)

init=tf.global_variables_initializer()

withtf.Session()assess:

sess.run(init)

forepochinrange(n_epochs):

ifepoch%100==0:

print("Epoch",epoch,"MSE=",mse.eval())

sess.run(training_op)

best_theta=theta.eval()

Page 298: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

UsingautodiffTheprecedingcodeworksfine,butitrequiresmathematicallyderivingthegradientsfromthecostfunction(MSE).InthecaseofLinearRegression,itisreasonablyeasy,butifyouhadtodothiswithdeepneuralnetworksyouwouldgetquiteaheadache:itwouldbetediousanderror-prone.Youcouldusesymbolicdifferentiationtoautomaticallyfindtheequationsforthepartialderivativesforyou,buttheresultingcodewouldnotnecessarilybeveryefficient.

Tounderstandwhy,considerthefunctionf(x)=exp(exp(exp(x))).Ifyouknowcalculus,youcanfigureoutitsderivativef′(x)=exp(x)×exp(exp(x))×exp(exp(exp(x))).Ifyoucodef(x)andf′(x)separatelyandexactlyastheyappear,yourcodewillnotbeasefficientasitcouldbe.Amoreefficientsolutionwouldbetowriteafunctionthatfirstcomputesexp(x),thenexp(exp(x)),thenexp(exp(exp(x))),andreturnsallthree.Thisgivesyouf(x)directly(thethirdterm),andifyouneedthederivativeyoucanjustmultiplyallthreetermsandyouaredone.Withthenaïveapproachyouwouldhavehadtocalltheexpfunctionninetimestocomputebothf(x)andf′(x).Withthisapproachyoujustneedtocallitthreetimes.

Itgetsworsewhenyourfunctionisdefinedbysomearbitrarycode.Canyoufindtheequation(orthecode)tocomputethepartialderivativesofthefollowingfunction?Hint:don’teventry.

defmy_func(a,b):

z=0

foriinrange(100):

z=a*np.cos(z+i)+z*np.sin(b-i)

returnz

Fortunately,TensorFlow’sautodifffeaturecomestotherescue:itcanautomaticallyandefficientlycomputethegradientsforyou.Simplyreplacethegradients=...lineintheGradientDescentcodeintheprevioussectionwiththefollowingline,andthecodewillcontinuetoworkjustfine:

gradients=tf.gradients(mse,[theta])[0]

Thegradients()functiontakesanop(inthiscasemse)andalistofvariables(inthiscasejusttheta),anditcreatesalistofops(onepervariable)tocomputethegradientsoftheopwithregardstoeachvariable.SothegradientsnodewillcomputethegradientvectoroftheMSEwithregardstotheta.

Therearefourmainapproachestocomputinggradientsautomatically.TheyaresummarizedinTable9-2.TensorFlowusesreverse-modeautodiff,whichisperfect(efficientandaccurate)whentherearemanyinputsandfewoutputs,asisoftenthecaseinneuralnetworks.Itcomputesallthepartialderivativesoftheoutputswithregardstoalltheinputsinjustnoutputs+1graphtraversals.

Table9-2.Mainsolutionstocomputegradientsautomatically

Technique Nbofgraphtraversalstocomputeallgradients

Accuracy Supportsarbitrarycode

Comment

Numericaldifferentiation

ninputs+1 Low Yes Trivialtoimplement

Symbolicdifferentiation N/A High No Buildsaverydifferent

Page 299: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

graph

Forward-modeautodiff ninputs High Yes Usesdualnumbers

Reverse-modeautodiff noutputs+1 High Yes ImplementedbyTensorFlow

Ifyouareinterestedinhowthismagicworks,checkoutAppendixD.

Page 300: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

UsinganOptimizerSoTensorFlowcomputesthegradientsforyou.Butitgetseveneasier:italsoprovidesanumberofoptimizersoutofthebox,includingaGradientDescentoptimizer.Youcansimplyreplacetheprecedinggradients=...andtraining_op=...lineswiththefollowingcode,andonceagaineverythingwilljustworkfine:

optimizer=tf.train.GradientDescentOptimizer(learning_rate=learning_rate)

training_op=optimizer.minimize(mse)

Ifyouwanttouseadifferenttypeofoptimizer,youjustneedtochangeoneline.Forexample,youcanuseamomentumoptimizer(whichoftenconvergesmuchfasterthanGradientDescent;seeChapter11)bydefiningtheoptimizerlikethis:

optimizer=tf.train.MomentumOptimizer(learning_rate=learning_rate,

momentum=0.9)

Page 301: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

FeedingDatatotheTrainingAlgorithmLet’strytomodifythepreviouscodetoimplementMini-batchGradientDescent.Forthis,weneedawaytoreplaceXandyateveryiterationwiththenextmini-batch.Thesimplestwaytodothisistouseplaceholdernodes.Thesenodesarespecialbecausetheydon’tactuallyperformanycomputation,theyjustoutputthedatayoutellthemtooutputatruntime.TheyaretypicallyusedtopassthetrainingdatatoTensorFlowduringtraining.Ifyoudon’tspecifyavalueatruntimeforaplaceholder,yougetanexception.

Tocreateaplaceholdernode,youmustcalltheplaceholder()functionandspecifytheoutputtensor’sdatatype.Optionally,youcanalsospecifyitsshape,ifyouwanttoenforceit.IfyouspecifyNoneforadimension,itmeans“anysize.”Forexample,thefollowingcodecreatesaplaceholdernodeA,andalsoanodeB=A+5.WhenweevaluateB,wepassafeed_dicttotheeval()methodthatspecifiesthevalueofA.NotethatAmusthaverank2(i.e.,itmustbetwo-dimensional)andtheremustbethreecolumns(orelseanexceptionisraised),butitcanhaveanynumberofrows.

>>>A=tf.placeholder(tf.float32,shape=(None,3))

>>>B=A+5

>>>withtf.Session()assess:

...B_val_1=B.eval(feed_dict={A:[[1,2,3]]})

...B_val_2=B.eval(feed_dict={A:[[4,5,6],[7,8,9]]})

...

>>>print(B_val_1)

[[6.7.8.]]

>>>print(B_val_2)

[[9.10.11.]

[12.13.14.]]

NOTEYoucanactuallyfeedtheoutputofanyoperations,notjustplaceholders.InthiscaseTensorFlowdoesnottrytoevaluatetheseoperations;itusesthevaluesyoufeedit.

ToimplementMini-batchGradientDescent,weonlyneedtotweaktheexistingcodeslightly.FirstchangethedefinitionofXandyintheconstructionphasetomakethemplaceholdernodes:

X=tf.placeholder(tf.float32,shape=(None,n+1),name="X")

y=tf.placeholder(tf.float32,shape=(None,1),name="y")

Thendefinethebatchsizeandcomputethetotalnumberofbatches:

batch_size=100

n_batches=int(np.ceil(m/batch_size))

Finally,intheexecutionphase,fetchthemini-batchesonebyone,thenprovidethevalueofXandyviathefeed_dictparameterwhenevaluatinganodethatdependsoneitherofthem.

deffetch_batch(epoch,batch_index,batch_size):

Page 302: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

[...]#loadthedatafromdisk

returnX_batch,y_batch

withtf.Session()assess:

sess.run(init)

forepochinrange(n_epochs):

forbatch_indexinrange(n_batches):

X_batch,y_batch=fetch_batch(epoch,batch_index,batch_size)

sess.run(training_op,feed_dict={X:X_batch,y:y_batch})

best_theta=theta.eval()

NOTEWedon’tneedtopassthevalueofXandywhenevaluatingthetasinceitdoesnotdependoneitherofthem.

Page 303: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

SavingandRestoringModelsOnceyouhavetrainedyourmodel,youshouldsaveitsparameterstodisksoyoucancomebacktoitwheneveryouwant,useitinanotherprogram,compareittoothermodels,andsoon.Moreover,youprobablywanttosavecheckpointsatregularintervalsduringtrainingsothatifyourcomputercrashesduringtrainingyoucancontinuefromthelastcheckpointratherthanstartoverfromscratch.

TensorFlowmakessavingandrestoringamodelveryeasy.JustcreateaSavernodeattheendoftheconstructionphase(afterallvariablenodesarecreated);then,intheexecutionphase,justcallitssave()methodwheneveryouwanttosavethemodel,passingitthesessionandpathofthecheckpointfile:

[...]

theta=tf.Variable(tf.random_uniform([n+1,1],-1.0,1.0),name="theta")

[...]

init=tf.global_variables_initializer()

saver=tf.train.Saver()

withtf.Session()assess:

sess.run(init)

forepochinrange(n_epochs):

ifepoch%100==0:#checkpointevery100epochs

save_path=saver.save(sess,"/tmp/my_model.ckpt")

sess.run(training_op)

best_theta=theta.eval()

save_path=saver.save(sess,"/tmp/my_model_final.ckpt")

Restoringamodelisjustaseasy:youcreateaSaverattheendoftheconstructionphasejustlikebefore,butthenatthebeginningoftheexecutionphase,insteadofinitializingthevariablesusingtheinitnode,youcalltherestore()methodoftheSaverobject:

withtf.Session()assess:

saver.restore(sess,"/tmp/my_model_final.ckpt")

[...]

BydefaultaSaversavesandrestoresallvariablesundertheirownname,butifyouneedmorecontrol,youcanspecifywhichvariablestosaveorrestore,andwhatnamestouse.Forexample,thefollowingSaverwillsaveorrestoreonlythethetavariableunderthenameweights:

saver=tf.train.Saver({"weights":theta})

Bydefault,thesave()methodalsosavesthestructureofthegraphinasecondfilewiththesamenameplusa.metaextension.Youcanloadthisgraphstructureusingtf.train.import_meta_graph().Thisaddsthegraphtothedefaultgraph,andreturnsaSaverinstancethatyoucanthenusetorestorethegraph’sstate(i.e.,thevariablevalues):

saver=tf.train.import_meta_graph("/tmp/my_model_final.ckpt.meta")

withtf.Session()assess:

saver.restore(sess,"/tmp/my_model_final.ckpt")

[...]

Page 304: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Thisallowsyoutofullyrestoreasavedmodel,includingboththegraphstructureandthevariablevalues,withouthavingtosearchforthecodethatbuiltit.

Page 305: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

VisualizingtheGraphandTrainingCurvesUsingTensorBoardSonowwehaveacomputationgraphthattrainsaLinearRegressionmodelusingMini-batchGradientDescent,andwearesavingcheckpointsatregularintervals.Soundssophisticated,doesn’tit?However,wearestillrelyingontheprint()functiontovisualizeprogressduringtraining.Thereisabetterway:enterTensorBoard.Ifyoufeeditsometrainingstats,itwilldisplayniceinteractivevisualizationsofthesestatsinyourwebbrowser(e.g.,learningcurves).Youcanalsoprovideitthegraph’sdefinitionanditwillgiveyouagreatinterfacetobrowsethroughit.Thisisveryusefultoidentifyerrorsinthegraph,tofindbottlenecks,andsoon.

Thefirststepistotweakyourprogramabitsoitwritesthegraphdefinitionandsometrainingstats—forexample,thetrainingerror(MSE)—toalogdirectorythatTensorBoardwillreadfrom.Youneedtouseadifferentlogdirectoryeverytimeyourunyourprogram,orelseTensorBoardwillmergestatsfromdifferentruns,whichwillmessupthevisualizations.Thesimplestsolutionforthisistoincludeatimestampinthelogdirectoryname.Addthefollowingcodeatthebeginningoftheprogram:

fromdatetimeimportdatetime

now=datetime.utcnow().strftime("%Y%m%d%H%M%S")

root_logdir="tf_logs"

logdir="{}/run-{}/".format(root_logdir,now)

Next,addthefollowingcodeattheveryendoftheconstructionphase:

mse_summary=tf.summary.scalar('MSE',mse)

file_writer=tf.summary.FileWriter(logdir,tf.get_default_graph())

ThefirstlinecreatesanodeinthegraphthatwillevaluatetheMSEvalueandwriteittoaTensorBoard-compatiblebinarylogstringcalledasummary.ThesecondlinecreatesaFileWriterthatyouwillusetowritesummariestologfilesinthelogdirectory.Thefirstparameterindicatesthepathofthelogdirectory(inthiscasesomethingliketf_logs/run-20160906091959/,relativetothecurrentdirectory).Thesecond(optional)parameteristhegraphyouwanttovisualize.Uponcreation,theFileWritercreatesthelogdirectoryifitdoesnotalreadyexist(anditsparentdirectoriesifneeded),andwritesthegraphdefinitioninabinarylogfilecalledaneventsfile.

Nextyouneedtoupdatetheexecutionphasetoevaluatethemse_summarynoderegularlyduringtraining(e.g.,every10mini-batches).Thiswilloutputasummarythatyoucanthenwritetotheeventsfileusingthefile_writer.Hereistheupdatedcode:

[...]

forbatch_indexinrange(n_batches):

X_batch,y_batch=fetch_batch(epoch,batch_index,batch_size)

ifbatch_index%10==0:

summary_str=mse_summary.eval(feed_dict={X:X_batch,y:y_batch})

step=epoch*n_batches+batch_index

file_writer.add_summary(summary_str,step)

sess.run(training_op,feed_dict={X:X_batch,y:y_batch})

[...]

Page 306: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

WARNINGAvoidloggingtrainingstatsateverysingletrainingstep,asthiswouldsignificantlyslowdowntraining.

Finally,youwanttoclosetheFileWriterattheendoftheprogram:

file_writer.close()

Nowrunthisprogram:itwillcreatethelogdirectoryandwriteaneventsfileinthisdirectory,containingboththegraphdefinitionandtheMSEvalues.Openupashellandgotoyourworkingdirectory,thentypels-ltf_logs/run*tolistthecontentsofthelogdirectory:

$cd$ML_PATH#YourMLworkingdirectory(e.g.,$HOME/ml)

$ls-ltf_logs/run*

total40

-rw-r--r--1ageronstaff18620Sep611:10events.out.tfevents.1472553182.mymac

Ifyouruntheprogramasecondtime,youshouldseeaseconddirectoryinthetf_logs/directory:

$ls-ltf_logs/

total0

drwxr-xr-x3ageronstaff102Sep610:07run-20160906091959

drwxr-xr-x3ageronstaff102Sep610:22run-20160906092202

Great!Nowit’stimetofireuptheTensorBoardserver.Youneedtoactivateyourvirtualenvenvironmentifyoucreatedone,thenstarttheserverbyrunningthetensorboardcommand,pointingittotherootlogdirectory.ThisstartstheTensorBoardwebserver,listeningonport6006(whichis“goog”writtenupsidedown):

$sourceenv/bin/activate

$tensorboard--logdirtf_logs/

StartingTensorBoardonport6006

(Youcannavigatetohttp://0.0.0.0:6006)

Nextopenabrowserandgotohttp://0.0.0.0:6006/(orhttp://localhost:6006/).WelcometoTensorBoard!IntheEventstabyoushouldseeMSEontheright.Ifyouclickonit,youwillseeaplotoftheMSEduringtraining,forbothruns(Figure9-3).Youcancheckorunchecktherunsyouwanttosee,zoominorout,hoveroverthecurvetogetdetails,andsoon.

Page 307: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure9-3.VisualizingtrainingstatsusingTensorBoard

NowclickontheGraphstab.YoushouldseethegraphshowninFigure9-4.

Toreduceclutter,thenodesthathavemanyedges(i.e.,connectionstoothernodes)areseparatedouttoanauxiliaryareaontheright(youcanmoveanodebackandforthbetweenthemaingraphandtheauxiliaryareabyright-clickingonit).Somepartsofthegrapharealsocollapsedbydefault.Forexample,tryhoveringoverthegradientsnode,thenclickonthe icontoexpandthissubgraph.Next,inthissubgraph,tryexpandingthemse_gradsubgraph.

Figure9-4.VisualizingthegraphusingTensorBoard

TIPIfyouwanttotakeapeekatthegraphdirectlywithinJupyter,youcanusetheshow_graph()functionavailableinthenotebookforthischapter.ItwasoriginallywrittenbyA.Mordvintsevinhisgreatdeepdreamtutorialnotebook.AnotheroptionistoinstallE.Jang’sTensorFlowdebuggertoolwhichincludesaJupyterextensionforgraphvisualization(andmore).

Page 308: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

NameScopesWhendealingwithmorecomplexmodelssuchasneuralnetworks,thegraphcaneasilybecomeclutteredwiththousandsofnodes.Toavoidthis,youcancreatenamescopestogrouprelatednodes.Forexample,let’smodifythepreviouscodetodefinetheerrorandmseopswithinanamescopecalled"loss":

withtf.name_scope("loss")asscope:

error=y_pred-y

mse=tf.reduce_mean(tf.square(error),name="mse")

Thenameofeachopdefinedwithinthescopeisnowprefixedwith"loss/":

>>>print(error.op.name)

loss/sub

>>>print(mse.op.name)

loss/mse

InTensorBoard,themseanderrornodesnowappearinsidethelossnamespace,whichappearscollapsedbydefault(Figure9-5).

Figure9-5.AcollapsednamescopeinTensorBoard

Page 309: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ModularitySupposeyouwanttocreateagraphthataddstheoutputoftworectifiedlinearunits(ReLU).AReLUcomputesalinearfunctionoftheinputs,andoutputstheresultifitispositive,and0otherwise,asshowninEquation9-1.

Equation9-1.Rectifiedlinearunit

Thefollowingcodedoesthejob,butit’squiterepetitive:

n_features=3

X=tf.placeholder(tf.float32,shape=(None,n_features),name="X")

w1=tf.Variable(tf.random_normal((n_features,1)),name="weights1")

w2=tf.Variable(tf.random_normal((n_features,1)),name="weights2")

b1=tf.Variable(0.0,name="bias1")

b2=tf.Variable(0.0,name="bias2")

z1=tf.add(tf.matmul(X,w1),b1,name="z1")

z2=tf.add(tf.matmul(X,w2),b2,name="z2")

relu1=tf.maximum(z1,0.,name="relu1")

relu2=tf.maximum(z1,0.,name="relu2")

output=tf.add(relu1,relu2,name="output")

Suchrepetitivecodeishardtomaintainanderror-prone(infact,thiscodecontainsacut-and-pasteerror;didyouspotit?).ItwouldbecomeevenworseifyouwantedtoaddafewmoreReLUs.Fortunately,TensorFlowletsyoustayDRY(Don’tRepeatYourself):simplycreateafunctiontobuildaReLU.ThefollowingcodecreatesfiveReLUsandoutputstheirsum(notethatadd_n()createsanoperationthatwillcomputethesumofalistoftensors):

defrelu(X):

w_shape=(int(X.get_shape()[1]),1)

w=tf.Variable(tf.random_normal(w_shape),name="weights")

b=tf.Variable(0.0,name="bias")

z=tf.add(tf.matmul(X,w),b,name="z")

returntf.maximum(z,0.,name="relu")

n_features=3

X=tf.placeholder(tf.float32,shape=(None,n_features),name="X")

relus=[relu(X)foriinrange(5)]

output=tf.add_n(relus,name="output")

Notethatwhenyoucreateanode,TensorFlowcheckswhetheritsnamealreadyexists,andifitdoesitappendsanunderscorefollowedbyanindextomakethenameunique.SothefirstReLUcontainsnodesnamed"weights","bias","z",and"relu"(plusmanymorenodeswiththeirdefaultname,suchas"MatMul");thesecondReLUcontainsnodesnamed"weights_1","bias_1",andsoon;thethirdReLUcontainsnodesnamed"weights_2","bias_2",andsoon.TensorBoardidentifiessuchseriesandcollapsesthemtogethertoreduceclutter(asyoucanseeinFigure9-6).

Page 310: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure9-6.Collapsednodeseries

Usingnamescopes,youcanmakethegraphmuchclearer.Simplymoveallthecontentoftherelu()functioninsideanamescope.Figure9-7showstheresultinggraph.NoticethatTensorFlowalsogivesthenamescopesuniquenamesbyappending_1,_2,andsoon.

defrelu(X):

withtf.name_scope("relu"):

[...]

Page 311: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure9-7.Aclearergraphusingname-scopedunits

Page 312: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

SharingVariablesIfyouwanttoshareavariablebetweenvariouscomponentsofyourgraph,onesimpleoptionistocreateitfirst,thenpassitasaparametertothefunctionsthatneedit.Forexample,supposeyouwanttocontroltheReLUthreshold(currentlyhardcodedto0)usingasharedthresholdvariableforallReLUs.Youcouldjustcreatethatvariablefirst,andthenpassittotherelu()function:

defrelu(X,threshold):

withtf.name_scope("relu"):

[...]

returntf.maximum(z,threshold,name="max")

threshold=tf.Variable(0.0,name="threshold")

X=tf.placeholder(tf.float32,shape=(None,n_features),name="X")

relus=[relu(X,threshold)foriinrange(5)]

output=tf.add_n(relus,name="output")

Thisworksfine:nowyoucancontrolthethresholdforallReLUsusingthethresholdvariable.However,iftherearemanysharedparameterssuchasthisone,itwillbepainfultohavetopassthemaroundasparametersallthetime.ManypeoplecreateaPythondictionarycontainingallthevariablesintheirmodel,andpassitaroundtoeveryfunction.Otherscreateaclassforeachmodule(e.g.,aReLUclassusingclassvariablestohandlethesharedparameter).Yetanotheroptionistosetthesharedvariableasanattributeoftherelu()functionuponthefirstcall,likeso:

defrelu(X):

withtf.name_scope("relu"):

ifnothasattr(relu,"threshold"):

relu.threshold=tf.Variable(0.0,name="threshold")

[...]

returntf.maximum(z,relu.threshold,name="max")

TensorFlowoffersanotheroption,whichmayleadtoslightlycleanerandmoremodularcodethantheprevioussolutions.5Thissolutionisabittrickytounderstandatfirst,butsinceitisusedalotinTensorFlowitisworthgoingintoabitofdetail.Theideaistousetheget_variable()functiontocreatethesharedvariableifitdoesnotexistyet,orreuseitifitalreadyexists.Thedesiredbehavior(creatingorreusing)iscontrolledbyanattributeofthecurrentvariable_scope().Forexample,thefollowingcodewillcreateavariablenamed"relu/threshold"(asascalar,sinceshape=(),andusing0.0astheinitialvalue):

withtf.variable_scope("relu"):

threshold=tf.get_variable("threshold",shape=(),

initializer=tf.constant_initializer(0.0))

Notethatifthevariablehasalreadybeencreatedbyanearliercalltoget_variable(),thiscodewillraiseanexception.Thisbehaviorpreventsreusingvariablesbymistake.Ifyouwanttoreuseavariable,youneedtoexplicitlysaysobysettingthevariablescope’sreuseattributetoTrue(inwhichcaseyoudon’thavetospecifytheshapeortheinitializer):

withtf.variable_scope("relu",reuse=True):

Page 313: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

threshold=tf.get_variable("threshold")

Thiscodewillfetchtheexisting"relu/threshold"variable,orraiseanexceptionifitdoesnotexistorifitwasnotcreatedusingget_variable().Alternatively,youcansetthereuseattributetoTrueinsidetheblockbycallingthescope’sreuse_variables()method:

withtf.variable_scope("relu")asscope:

scope.reuse_variables()

threshold=tf.get_variable("threshold")

WARNINGOncereuseissettoTrue,itcannotbesetbacktoFalsewithintheblock.Moreover,ifyoudefineothervariablescopesinsidethisone,theywillautomaticallyinheritreuse=True.Lastly,onlyvariablescreatedbyget_variable()canbereusedthisway.

Nowyouhaveallthepiecesyouneedtomaketherelu()functionaccessthethresholdvariablewithouthavingtopassitasaparameter:

defrelu(X):

withtf.variable_scope("relu",reuse=True):

threshold=tf.get_variable("threshold")#reuseexistingvariable

[...]

returntf.maximum(z,threshold,name="max")

X=tf.placeholder(tf.float32,shape=(None,n_features),name="X")

withtf.variable_scope("relu"):#createthevariable

threshold=tf.get_variable("threshold",shape=(),

initializer=tf.constant_initializer(0.0))

relus=[relu(X)forrelu_indexinrange(5)]

output=tf.add_n(relus,name="output")

Thiscodefirstdefinestherelu()function,thencreatestherelu/thresholdvariable(asascalarthatwilllaterbeinitializedto0.0)andbuildsfiveReLUsbycallingtherelu()function.Therelu()functionreusestherelu/thresholdvariable,andcreatestheotherReLUnodes.

NOTEVariablescreatedusingget_variable()arealwaysnamedusingthenameoftheirvariable_scopeasaprefix(e.g.,"relu/threshold"),butforallothernodes(includingvariablescreatedwithtf.Variable())thevariablescopeactslikeanewnamescope.Inparticular,ifanamescopewithanidenticalnamewasalreadycreated,thenasuffixisaddedtomakethenameunique.Forexample,allnodescreatedintheprecedingcode(exceptthethresholdvariable)haveanameprefixedwith"relu_1/"to"relu_5/",asshowninFigure9-8.

Page 314: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure9-8.FiveReLUssharingthethresholdvariable

Itissomewhatunfortunatethatthethresholdvariablemustbedefinedoutsidetherelu()function,wherealltherestoftheReLUcoderesides.Tofixthis,thefollowingcodecreatesthethresholdvariablewithintherelu()functionuponthefirstcall,thenreusesitinsubsequentcalls.Nowtherelu()functiondoesnothavetoworryaboutnamescopesorvariablesharing:itjustcallsget_variable(),whichwillcreateorreusethethresholdvariable(itdoesnotneedtoknowwhichisthecase).Therestofthecodecallsrelu()fivetimes,makingsuretosetreuse=Falseonthefirstcall,andreuse=Truefortheothercalls.

defrelu(X):

threshold=tf.get_variable("threshold",shape=(),

initializer=tf.constant_initializer(0.0))

[...]

returntf.maximum(z,threshold,name="max")

X=tf.placeholder(tf.float32,shape=(None,n_features),name="X")

relus=[]

forrelu_indexinrange(5):

withtf.variable_scope("relu",reuse=(relu_index>=1))asscope:

relus.append(relu(X))

output=tf.add_n(relus,name="output")

Theresultinggraphisslightlydifferentthanbefore,sincethesharedvariableliveswithinthefirstReLU(seeFigure9-9).

Figure9-9.FiveReLUssharingthethresholdvariable

ThisconcludesthisintroductiontoTensorFlow.Wewilldiscussmoreadvancedtopicsaswegothroughthefollowingchapters,inparticularmanyoperationsrelatedtodeepneuralnetworks,convolutionalneuralnetworks,andrecurrentneuralnetworksaswellashowtoscaleupwithTensorFlowusingmultithreading,queues,multipleGPUs,andmultipleservers.

Page 315: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Exercises1. Whatarethemainbenefitsofcreatingacomputationgraphratherthandirectlyexecutingthe

computations?Whatarethemaindrawbacks?

2. Isthestatementa_val=a.eval(session=sess)equivalenttoa_val=sess.run(a)?

3. Isthestatementa_val,b_val=a.eval(session=sess),b.eval(session=sess)equivalenttoa_val,b_val=sess.run([a,b])?

4. Canyouruntwographsinthesamesession?

5. Ifyoucreateagraphgcontainingavariablew,thenstarttwothreadsandopenasessionineachthread,bothusingthesamegraphg,willeachsessionhaveitsowncopyofthevariableworwillitbeshared?

6. Whenisavariableinitialized?Whenisitdestroyed?

7. Whatisthedifferencebetweenaplaceholderandavariable?

8. Whathappenswhenyourunthegraphtoevaluateanoperationthatdependsonaplaceholderbutyoudon’tfeeditsvalue?Whathappensiftheoperationdoesnotdependontheplaceholder?

9. Whenyourunagraph,canyoufeedtheoutputvalueofanyoperation,orjustthevalueofplaceholders?

10. Howcanyousetavariabletoanyvalueyouwant(duringtheexecutionphase)?

11. Howmanytimesdoesreverse-modeautodiffneedtotraversethegraphinordertocomputethegradientsofthecostfunctionwithregardsto10variables?Whataboutforward-modeautodiff?Andsymbolicdifferentiation?

12. ImplementLogisticRegressionwithMini-batchGradientDescentusingTensorFlow.Trainitandevaluateitonthemoonsdataset(introducedinChapter5).Tryaddingallthebellsandwhistles:

Definethegraphwithinalogistic_regression()functionthatcanbereusedeasily.

SavecheckpointsusingaSaveratregularintervalsduringtraining,andsavethefinalmodelattheendoftraining.

Restorethelastcheckpointuponstartupiftrainingwasinterrupted.

DefinethegraphusingnicescopessothegraphlooksgoodinTensorBoard.

AddsummariestovisualizethelearningcurvesinTensorBoard.

Trytweakingsomehyperparameterssuchasthelearningrateorthemini-batchsizeandlookattheshapeofthelearningcurve.

Page 316: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

SolutionstotheseexercisesareavailableinAppendixA.

TensorFlowisnotlimitedtoneuralnetworksorevenMachineLearning;youcouldrunquantumphysicssimulationsifyouwanted.

NottobeconfusedwiththeTFLearnlibrary,whichisanindependentproject.

IndistributedTensorFlow,variablevaluesarestoredontheserversinsteadofthesession,aswewillseeinChapter12.

Notethathousing.targetisa1Darray,butweneedtoreshapeittoacolumnvectortocomputetheta.RecallthatNumPy’sreshape()functionaccepts–1(meaning“unspecified”)foroneofthedimensions:thatdimensionwillbecomputedbasedonthearray’slengthandtheremainingdimensions.

CreatingaReLUclassisarguablythecleanestoption,butitisratherheavyweight.

1

2

3

4

5

Page 317: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Chapter10.IntroductiontoArtificialNeuralNetworks

Birdsinspiredustofly,burdockplantsinspiredvelcro,andnaturehasinspiredmanyotherinventions.Itseemsonlylogical,then,tolookatthebrain’sarchitectureforinspirationonhowtobuildanintelligentmachine.Thisisthekeyideathatinspiredartificialneuralnetworks(ANNs).However,althoughplaneswereinspiredbybirds,theydon’thavetoflaptheirwings.Similarly,ANNshavegraduallybecomequitedifferentfromtheirbiologicalcousins.Someresearchersevenarguethatweshoulddropthebiologicalanalogyaltogether(e.g.,bysaying“units”ratherthan“neurons”),lestwerestrictourcreativitytobiologicallyplausiblesystems.1

ANNsareattheverycoreofDeepLearning.Theyareversatile,powerful,andscalable,makingthemidealtotacklelargeandhighlycomplexMachineLearningtasks,suchasclassifyingbillionsofimages(e.g.,GoogleImages),poweringspeechrecognitionservices(e.g.,Apple’sSiri),recommendingthebestvideostowatchtohundredsofmillionsofuserseveryday(e.g.,YouTube),orlearningtobeattheworldchampionatthegameofGobyexaminingmillionsofpastgamesandthenplayingagainstitself(DeepMind’sAlphaGo).

Inthischapter,wewillintroduceartificialneuralnetworks,startingwithaquicktouroftheveryfirstANNarchitectures.ThenwewillpresentMulti-LayerPerceptrons(MLPs)andimplementoneusingTensorFlowtotackletheMNISTdigitclassificationproblem(introducedinChapter3).

Page 318: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

FromBiologicaltoArtificialNeuronsSurprisingly,ANNshavebeenaroundforquiteawhile:theywerefirstintroducedbackin1943bytheneurophysiologistWarrenMcCullochandthemathematicianWalterPitts.Intheirlandmarkpaper,2“ALogicalCalculusofIdeasImmanentinNervousActivity,”McCullochandPittspresentedasimplifiedcomputationalmodelofhowbiologicalneuronsmightworktogetherinanimalbrainstoperformcomplexcomputationsusingpropositionallogic.Thiswasthefirstartificialneuralnetworkarchitecture.Sincethenmanyotherarchitectureshavebeeninvented,aswewillsee.

TheearlysuccessesofANNsuntilthe1960sledtothewidespreadbeliefthatwewouldsoonbeconversingwithtrulyintelligentmachines.Whenitbecameclearthatthispromisewouldgounfulfilled(atleastforquiteawhile),fundingflewelsewhereandANNsenteredalongdarkera.Intheearly1980stherewasarevivalofinterestinANNsasnewnetworkarchitectureswereinventedandbettertrainingtechniquesweredeveloped.Butbythe1990s,powerfulalternativeMachineLearningtechniquessuchasSupportVectorMachines(seeChapter5)werefavoredbymostresearchers,astheyseemedtoofferbetterresultsandstrongertheoreticalfoundations.Finally,wearenowwitnessingyetanotherwaveofinterestinANNs.Willthiswavedieoutlikethepreviousonesdid?Thereareafewgoodreasonstobelievethatthisoneisdifferentandwillhaveamuchmoreprofoundimpactonourlives:

Thereisnowahugequantityofdataavailabletotrainneuralnetworks,andANNsfrequentlyoutperformotherMLtechniquesonverylargeandcomplexproblems.

Thetremendousincreaseincomputingpowersincethe1990snowmakesitpossibletotrainlargeneuralnetworksinareasonableamountoftime.ThisisinpartduetoMoore’sLaw,butalsothankstothegamingindustry,whichhasproducedpowerfulGPUcardsbythemillions.

Thetrainingalgorithmshavebeenimproved.Tobefairtheyareonlyslightlydifferentfromtheonesusedinthe1990s,buttheserelativelysmalltweakshaveahugepositiveimpact.

SometheoreticallimitationsofANNshaveturnedouttobebenigninpractice.Forexample,manypeoplethoughtthatANNtrainingalgorithmsweredoomedbecausetheywerelikelytogetstuckinlocaloptima,butitturnsoutthatthisisratherrareinpractice(orwhenitisthecase,theyareusuallyfairlyclosetotheglobaloptimum).

ANNsseemtohaveenteredavirtuouscircleoffundingandprogress.AmazingproductsbasedonANNsregularlymaketheheadlinenews,whichpullsmoreandmoreattentionandfundingtowardthem,resultinginmoreandmoreprogress,andevenmoreamazingproducts.

Page 319: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

BiologicalNeuronsBeforewediscussartificialneurons,let’stakeaquicklookatabiologicalneuron(representedinFigure10-1).Itisanunusual-lookingcellmostlyfoundinanimalcerebralcortexes(e.g.,yourbrain),composedofacellbodycontainingthenucleusandmostofthecell’scomplexcomponents,andmanybranchingextensionscalleddendrites,plusoneverylongextensioncalledtheaxon.Theaxon’slengthmaybejustafewtimeslongerthanthecellbody,oruptotensofthousandsoftimeslonger.Nearitsextremitytheaxonsplitsoffintomanybranchescalledtelodendria,andatthetipofthesebranchesareminusculestructurescalledsynapticterminals(orsimplysynapses),whichareconnectedtothedendrites(ordirectlytothecellbody)ofotherneurons.Biologicalneuronsreceiveshortelectricalimpulsescalledsignalsfromotherneuronsviathesesynapses.Whenaneuronreceivesasufficientnumberofsignalsfromotherneuronswithinafewmilliseconds,itfiresitsownsignals.

Figure10-1.Biologicalneuron3

Thus,individualbiologicalneuronsseemtobehaveinarathersimpleway,buttheyareorganizedinavastnetworkofbillionsofneurons,eachneurontypicallyconnectedtothousandsofotherneurons.Highlycomplexcomputationscanbeperformedbyavastnetworkoffairlysimpleneurons,muchlikeacomplexanthillcanemergefromthecombinedeffortsofsimpleants.Thearchitectureofbiologicalneuralnetworks(BNN)4isstillthesubjectofactiveresearch,butsomepartsofthebrainhavebeenmapped,anditseemsthatneuronsareoftenorganizedinconsecutivelayers,asshowninFigure10-2.

Page 320: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure10-2.Multiplelayersinabiologicalneuralnetwork(humancortex)5

Page 321: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

LogicalComputationswithNeuronsWarrenMcCullochandWalterPittsproposedaverysimplemodelofthebiologicalneuron,whichlaterbecameknownasanartificialneuron:ithasoneormorebinary(on/off)inputsandonebinaryoutput.Theartificialneuronsimplyactivatesitsoutputwhenmorethanacertainnumberofitsinputsareactive.McCullochandPittsshowedthatevenwithsuchasimplifiedmodelitispossibletobuildanetworkofartificialneuronsthatcomputesanylogicalpropositionyouwant.Forexample,let’sbuildafewANNsthatperformvariouslogicalcomputations(seeFigure10-3),assumingthataneuronisactivatedwhenatleasttwoofitsinputsareactive.

Figure10-3.ANNsperformingsimplelogicalcomputations

Thefirstnetworkontheleftissimplytheidentityfunction:ifneuronAisactivated,thenneuronCgetsactivatedaswell(sinceitreceivestwoinputsignalsfromneuronA),butifneuronAisoff,thenneuronCisoffaswell.

ThesecondnetworkperformsalogicalAND:neuronCisactivatedonlywhenbothneuronsAandBareactivated(asingleinputsignalisnotenoughtoactivateneuronC).

ThethirdnetworkperformsalogicalOR:neuronCgetsactivatedifeitherneuronAorneuronBisactivated(orboth).

Finally,ifwesupposethataninputconnectioncaninhibittheneuron’sactivity(whichisthecasewithbiologicalneurons),thenthefourthnetworkcomputesaslightlymorecomplexlogicalproposition:neuronCisactivatedonlyifneuronAisactiveandifneuronBisoff.IfneuronAisactiveallthetime,thenyougetalogicalNOT:neuronCisactivewhenneuronBisoff,andviceversa.

Youcaneasilyimaginehowthesenetworkscanbecombinedtocomputecomplexlogicalexpressions(seetheexercisesattheendofthechapter).

Page 322: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ThePerceptronThePerceptronisoneofthesimplestANNarchitectures,inventedin1957byFrankRosenblatt.Itisbasedonaslightlydifferentartificialneuron(seeFigure10-4)calledalinearthresholdunit(LTU):theinputsandoutputarenownumbers(insteadofbinaryon/offvalues)andeachinputconnectionisassociatedwithaweight.TheLTUcomputesaweightedsumofitsinputs(z=w1x1+w2x2++ wnxn=wT·x),thenappliesastepfunctiontothatsumandoutputstheresult:hw(x)=step(z)=step(wT·x).

Figure10-4.Linearthresholdunit

ThemostcommonstepfunctionusedinPerceptronsistheHeavisidestepfunction(seeEquation10-1).Sometimesthesignfunctionisusedinstead.

Equation10-1.CommonstepfunctionsusedinPerceptrons

AsingleLTUcanbeusedforsimplelinearbinaryclassification.Itcomputesalinearcombinationoftheinputsandiftheresultexceedsathreshold,itoutputsthepositiveclassorelseoutputsthenegativeclass(justlikeaLogisticRegressionclassifieroralinearSVM).Forexample,youcoulduseasingleLTUtoclassifyirisflowersbasedonthepetallengthandwidth(alsoaddinganextrabiasfeaturex0=1,justlikewedidinpreviouschapters).TraininganLTUmeansfindingtherightvaluesforw0,w1,andw2(thetrainingalgorithmisdiscussedshortly).

APerceptronissimplycomposedofasinglelayerofLTUs,6witheachneuronconnectedtoalltheinputs.Theseconnectionsareoftenrepresentedusingspecialpassthroughneuronscalledinputneurons:theyjustoutputwhateverinputtheyarefed.Moreover,anextrabiasfeatureisgenerallyadded(x0=1).Thisbiasfeatureistypicallyrepresentedusingaspecialtypeofneuroncalledabiasneuron,whichjustoutputs1allthetime.

Page 323: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

APerceptronwithtwoinputsandthreeoutputsisrepresentedinFigure10-5.ThisPerceptroncanclassifyinstancessimultaneouslyintothreedifferentbinaryclasses,whichmakesitamultioutputclassifier.

Figure10-5.Perceptrondiagram

SohowisaPerceptrontrained?ThePerceptrontrainingalgorithmproposedbyFrankRosenblattwaslargelyinspiredbyHebb’srule.InhisbookTheOrganizationofBehavior,publishedin1949,DonaldHebbsuggestedthatwhenabiologicalneuronoftentriggersanotherneuron,theconnectionbetweenthesetwoneuronsgrowsstronger.ThisideawaslatersummarizedbySiegridLöwelinthiscatchyphrase:“Cellsthatfiretogether,wiretogether.”ThisrulelaterbecameknownasHebb’srule(orHebbianlearning);thatis,theconnectionweightbetweentwoneuronsisincreasedwhenevertheyhavethesameoutput.Perceptronsaretrainedusingavariantofthisrulethattakesintoaccounttheerrormadebythenetwork;itdoesnotreinforceconnectionsthatleadtothewrongoutput.Morespecifically,thePerceptronisfedonetraininginstanceatatime,andforeachinstanceitmakesitspredictions.Foreveryoutputneuronthatproducedawrongprediction,itreinforcestheconnectionweightsfromtheinputsthatwouldhavecontributedtothecorrectprediction.TheruleisshowninEquation10-2.

Equation10-2.Perceptronlearningrule(weightupdate)

wi,jistheconnectionweightbetweentheithinputneuronandthejthoutputneuron.

xiistheithinputvalueofthecurrenttraininginstance.

jistheoutputofthejthoutputneuronforthecurrenttraininginstance.

yjisthetargetoutputofthejthoutputneuronforthecurrenttraininginstance.

Page 324: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ηisthelearningrate.

Thedecisionboundaryofeachoutputneuronislinear,soPerceptronsareincapableoflearningcomplexpatterns(justlikeLogisticRegressionclassifiers).However,ifthetraininginstancesarelinearlyseparable,Rosenblattdemonstratedthatthisalgorithmwouldconvergetoasolution.7ThisiscalledthePerceptronconvergencetheorem.

Scikit-LearnprovidesaPerceptronclassthatimplementsasingleLTUnetwork.Itcanbeusedprettymuchasyouwouldexpect—forexample,ontheirisdataset(introducedinChapter4):

importnumpyasnp

fromsklearn.datasetsimportload_iris

fromsklearn.linear_modelimportPerceptron

iris=load_iris()

X=iris.data[:,(2,3)]#petallength,petalwidth

y=(iris.target==0).astype(np.int)#IrisSetosa?

per_clf=Perceptron(random_state=42)

per_clf.fit(X,y)

y_pred=per_clf.predict([[2,0.5]])

YoumayhaverecognizedthatthePerceptronlearningalgorithmstronglyresemblesStochasticGradientDescent.Infact,Scikit-Learn’sPerceptronclassisequivalenttousinganSGDClassifierwiththefollowinghyperparameters:loss="perceptron",learning_rate="constant",eta0=1(thelearningrate),andpenalty=None(noregularization).

NotethatcontrarytoLogisticRegressionclassifiers,Perceptronsdonotoutputaclassprobability;rather,theyjustmakepredictionsbasedonahardthreshold.ThisisoneofthegoodreasonstopreferLogisticRegressionoverPerceptrons.

Intheir1969monographtitledPerceptrons,MarvinMinskyandSeymourPaperthighlightedanumberofseriousweaknessesofPerceptrons,inparticularthefactthattheyareincapableofsolvingsometrivialproblems(e.g.,theExclusiveOR(XOR)classificationproblem;seetheleftsideofFigure10-6).Ofcoursethisistrueofanyotherlinearclassificationmodelaswell(suchasLogisticRegressionclassifiers),butresearchershadexpectedmuchmorefromPerceptrons,andtheirdisappointmentwasgreat:asaresult,manyresearchersdroppedconnectionismaltogether(i.e.,thestudyofneuralnetworks)infavorofhigher-levelproblemssuchaslogic,problemsolving,andsearch.

However,itturnsoutthatsomeofthelimitationsofPerceptronscanbeeliminatedbystackingmultiplePerceptrons.TheresultingANNiscalledaMulti-LayerPerceptron(MLP).Inparticular,anMLPcansolvetheXORproblem,asyoucanverifybycomputingtheoutputoftheMLPrepresentedontherightofFigure10-6,foreachcombinationofinputs:withinputs(0,0)or(1,1)thenetworkoutputs0,andwithinputs(0,1)or(1,0)itoutputs1.

Page 325: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure10-6.XORclassificationproblemandanMLPthatsolvesit

Page 326: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Multi-LayerPerceptronandBackpropagationAnMLPiscomposedofone(passthrough)inputlayer,oneormorelayersofLTUs,calledhiddenlayers,andonefinallayerofLTUscalledtheoutputlayer(seeFigure10-7).Everylayerexcepttheoutputlayerincludesabiasneuronandisfullyconnectedtothenextlayer.WhenanANNhastwoormorehiddenlayers,itiscalledadeepneuralnetwork(DNN).

Figure10-7.Multi-LayerPerceptron

FormanyyearsresearchersstruggledtofindawaytotrainMLPs,withoutsuccess.Butin1986,D.E.Rumelhartetal.publishedagroundbreakingarticle8introducingthebackpropagationtrainingalgorithm.9TodaywewoulddescribeitasGradientDescentusingreverse-modeautodiff(GradientDescentwasintroducedinChapter4,andautodiffwasdiscussedinChapter9).

Foreachtraininginstance,thealgorithmfeedsittothenetworkandcomputestheoutputofeveryneuronineachconsecutivelayer(thisistheforwardpass,justlikewhenmakingpredictions).Thenitmeasuresthenetwork’soutputerror(i.e.,thedifferencebetweenthedesiredoutputandtheactualoutputofthenetwork),anditcomputeshowmucheachneuroninthelasthiddenlayercontributedtoeachoutputneuron’serror.Itthenproceedstomeasurehowmuchoftheseerrorcontributionscamefromeachneuronintheprevioushiddenlayer—andsoonuntilthealgorithmreachestheinputlayer.Thisreversepassefficientlymeasurestheerrorgradientacrossalltheconnectionweightsinthenetworkbypropagatingtheerrorgradientbackwardinthenetwork(hencethenameofthealgorithm).Ifyoucheckoutthereverse-modeautodiffalgorithminAppendixD,youwillfindthattheforwardandreversepassesofbackpropagationsimplyperformreverse-modeautodiff.ThelaststepofthebackpropagationalgorithmisaGradientDescentsteponalltheconnectionweightsinthenetwork,usingtheerrorgradientsmeasuredearlier.

Page 327: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Let’smakethisevenshorter:foreachtraininginstancethebackpropagationalgorithmfirstmakesaprediction(forwardpass),measurestheerror,thengoesthrougheachlayerinreversetomeasuretheerrorcontributionfromeachconnection(reversepass),andfinallyslightlytweakstheconnectionweightstoreducetheerror(GradientDescentstep).

Inorderforthisalgorithmtoworkproperly,theauthorsmadeakeychangetotheMLP’sarchitecture:theyreplacedthestepfunctionwiththelogisticfunction,σ(z)=1/(1+exp(–z)).Thiswasessentialbecausethestepfunctioncontainsonlyflatsegments,sothereisnogradienttoworkwith(GradientDescentcannotmoveonaflatsurface),whilethelogisticfunctionhasawell-definednonzeroderivativeeverywhere,allowingGradientDescenttomakesomeprogressateverystep.Thebackpropagationalgorithmmaybeusedwithotheractivationfunctions,insteadofthelogisticfunction.Twootherpopularactivationfunctionsare:

Thehyperbolictangentfunctiontanh(z)=2σ(2z)–1JustlikethelogisticfunctionitisS-shaped,continuous,anddifferentiable,butitsoutputvaluerangesfrom–1to1(insteadof0to1inthecaseofthelogisticfunction),whichtendstomakeeachlayer’soutputmoreorlessnormalized(i.e.,centeredaround0)atthebeginningoftraining.Thisoftenhelpsspeedupconvergence.

TheReLUfunction(introducedinChapter9)ReLU(z)=max(0,z).Itiscontinuousbutunfortunatelynotdifferentiableatz=0(theslopechangesabruptly,whichcanmakeGradientDescentbouncearound).However,inpracticeitworksverywellandhastheadvantageofbeingfasttocompute.Mostimportantly,thefactthatitdoesnothaveamaximumoutputvaluealsohelpsreducesomeissuesduringGradientDescent(wewillcomebacktothisinChapter11).

ThesepopularactivationfunctionsandtheirderivativesarerepresentedinFigure10-8.

Figure10-8.Activationfunctionsandtheirderivatives

AnMLPisoftenusedforclassification,witheachoutputcorrespondingtoadifferentbinaryclass(e.g.,spam/ham,urgent/not-urgent,andsoon).Whentheclassesareexclusive(e.g.,classes0through9fordigitimageclassification),theoutputlayeristypicallymodifiedbyreplacingtheindividualactivationfunctionsbyasharedsoftmaxfunction(seeFigure10-9).ThesoftmaxfunctionwasintroducedinChapter3.Theoutputofeachneuroncorrespondstotheestimatedprobabilityofthecorrespondingclass.Notethatthesignalflowsonlyinonedirection(fromtheinputstotheoutputs),sothisarchitectureisanexampleofafeedforwardneuralnetwork(FNN).

Page 328: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure10-9.AmodernMLP(includingReLUandsoftmax)forclassification

NOTEBiologicalneuronsseemtoimplementaroughlysigmoid(S-shaped)activationfunction,soresearchersstucktosigmoidfunctionsforaverylongtime.ButitturnsoutthattheReLUactivationfunctiongenerallyworksbetterinANNs.Thisisoneofthecaseswherethebiologicalanalogywasmisleading.

Page 329: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TraininganMLPwithTensorFlow’sHigh-LevelAPIThesimplestwaytotrainanMLPwithTensorFlowistousethehigh-levelAPITF.Learn,whichoffersaScikit-Learn–compatibleAPI.TheDNNClassifierclassmakesitfairlyeasytotrainadeepneuralnetworkwithanynumberofhiddenlayers,andasoftmaxoutputlayertooutputestimatedclassprobabilities.Forexample,thefollowingcodetrainsaDNNforclassificationwithtwohiddenlayers(onewith300neurons,andtheotherwith100neurons)andasoftmaxoutputlayerwith10neurons:

importtensorflowastf

feature_cols=tf.contrib.learn.infer_real_valued_columns_from_input(X_train)

dnn_clf=tf.contrib.learn.DNNClassifier(hidden_units=[300,100],n_classes=10,

feature_columns=feature_cols)

dnn_clf=tf.contrib.learn.SKCompat(dnn_clf)#ifTensorFlow>=1.1

dnn_clf.fit(X_train,y_train,batch_size=50,steps=40000)

Thecodefirstcreatesasetofrealvaluedcolumnsfromthetrainingset(othertypesofcolumns,suchascategoricalcolumns,areavailable).ThenwecreatetheDNNClassifier,andwewrapitinaScikit-Learncompatibilityhelper.Finally,werun40,000trainingiterationsusingbatchesof50instances.

IfyourunthiscodeontheMNISTdataset(afterscalingit,e.g.,byusingScikit-Learn’sStandardScaler),youwillactuallygetamodelthatachievesaround98.2%accuracyonthetestset!That’sbetterthanthebestmodelwetrainedinChapter3:

>>>fromsklearn.metricsimportaccuracy_score

>>>y_pred=dnn_clf.predict(X_test)

>>>accuracy_score(y_test,y_pred['classes'])

0.98250000000000004

WARNINGThetensorflow.contribpackagecontainsmanyusefulfunctions,butitisaplaceforexperimentalcodethathasnotyetgraduatedtobepartofthecoreTensorFlowAPI.SotheDNNClassifierclass(andanyothercontribcode)maychangewithoutnoticeinthefuture.

Underthehood,theDNNClassifierclasscreatesalltheneuronlayers,basedontheReLUactivationfunction(wecanchangethisbysettingtheactivation_fnhyperparameter).Theoutputlayerreliesonthesoftmaxfunction,andthecostfunctioniscrossentropy(introducedinChapter4).

Page 330: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TrainingaDNNUsingPlainTensorFlowIfyouwantmorecontroloverthearchitectureofthenetwork,youmayprefertouseTensorFlow’slower-levelPythonAPI(introducedinChapter9).InthissectionwewillbuildthesamemodelasbeforeusingthisAPI,andwewillimplementMini-batchGradientDescenttotrainitontheMNISTdataset.Thefirststepistheconstructionphase,buildingtheTensorFlowgraph.Thesecondstepistheexecutionphase,whereyouactuallyrunthegraphtotrainthemodel.

Page 331: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ConstructionPhaseLet’sstart.Firstweneedtoimportthetensorflowlibrary.Thenwemustspecifythenumberofinputsandoutputs,andsetthenumberofhiddenneuronsineachlayer:

importtensorflowastf

n_inputs=28*28#MNIST

n_hidden1=300

n_hidden2=100

n_outputs=10

Next,justlikeyoudidinChapter9,youcanuseplaceholdernodestorepresentthetrainingdataandtargets.TheshapeofXisonlypartiallydefined.Weknowthatitwillbea2Dtensor(i.e.,amatrix),withinstancesalongthefirstdimensionandfeaturesalongtheseconddimension,andweknowthatthenumberoffeaturesisgoingtobe28x28(onefeatureperpixel),butwedon’tknowyethowmanyinstanceseachtrainingbatchwillcontain.SotheshapeofXis(None,n_inputs).Similarly,weknowthatywillbea1Dtensorwithoneentryperinstance,butagainwedon’tknowthesizeofthetrainingbatchatthispoint,sotheshapeis(None).

X=tf.placeholder(tf.float32,shape=(None,n_inputs),name="X")

y=tf.placeholder(tf.int64,shape=(None),name="y")

Nowlet’screatetheactualneuralnetwork.TheplaceholderXwillactastheinputlayer;duringtheexecutionphase,itwillbereplacedwithonetrainingbatchatatime(notethatalltheinstancesinatrainingbatchwillbeprocessedsimultaneouslybytheneuralnetwork).Nowyouneedtocreatethetwohiddenlayersandtheoutputlayer.Thetwohiddenlayersarealmostidentical:theydifferonlybytheinputstheyareconnectedtoandbythenumberofneuronstheycontain.Theoutputlayerisalsoverysimilar,butitusesasoftmaxactivationfunctioninsteadofaReLUactivationfunction.Solet’screateaneuron_layer()functionthatwewillusetocreateonelayeratatime.Itwillneedparameterstospecifytheinputs,thenumberofneurons,theactivationfunction,andthenameofthelayer:

defneuron_layer(X,n_neurons,name,activation=None):

withtf.name_scope(name):

n_inputs=int(X.get_shape()[1])

stddev=2/np.sqrt(n_inputs)

init=tf.truncated_normal((n_inputs,n_neurons),stddev=stddev)

W=tf.Variable(init,name="kernel")

b=tf.Variable(tf.zeros([n_neurons]),name="bias")

Z=tf.matmul(X,W)+b

ifactivationisnotNone:

returnactivation(Z)

else:

returnZ

Let’sgothroughthiscodelinebyline:1. Firstwecreateanamescopeusingthenameofthelayer:itwillcontainallthecomputationnodes

forthisneuronlayer.Thisisoptional,butthegraphwilllookmuchnicerinTensorBoardifitsnodesarewellorganized.

Page 332: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

2. Next,wegetthenumberofinputsbylookinguptheinputmatrix’sshapeandgettingthesizeoftheseconddimension(thefirstdimensionisforinstances).

3. ThenextthreelinescreateaWvariablethatwillholdtheweightsmatrix(oftencalledthelayer’skernel).Itwillbea2Dtensorcontainingalltheconnectionweightsbetweeneachinputandeachneuron;hence,itsshapewillbe(n_inputs,n_neurons).Itwillbeinitializedrandomly,usinga

truncated10normal(Gaussian)distributionwithastandarddeviationof .Usingthisspecificstandarddeviationhelpsthealgorithmconvergemuchfaster(wewilldiscussthisfurtherinChapter11;itisoneofthosesmalltweakstoneuralnetworksthathavehadatremendousimpactontheirefficiency).ItisimportanttoinitializeconnectionweightsrandomlyforallhiddenlayerstoavoidanysymmetriesthattheGradientDescentalgorithmwouldbeunabletobreak.11

4. Thenextlinecreatesabvariableforbiases,initializedto0(nosymmetryissueinthiscase),withonebiasparameterperneuron.

5. ThenwecreateasubgraphtocomputeZ=X·W+b.Thisvectorizedimplementationwillefficientlycomputetheweightedsumsoftheinputsplusthebiastermforeachandeveryneuroninthelayer,foralltheinstancesinthebatchinjustoneshot.

6. Finally,ifanactivationparameterisprovided,suchastf.nn.relu(i.e.,max(0,Z)),thenthecodereturnsactivation(Z),orelseitjustreturnsZ.

Okay,sonowyouhaveanicefunctiontocreateaneuronlayer.Let’suseittocreatethedeepneuralnetwork!ThefirsthiddenlayertakesXasitsinput.Thesecondtakestheoutputofthefirsthiddenlayerasitsinput.Andfinally,theoutputlayertakestheoutputofthesecondhiddenlayerasitsinput.

withtf.name_scope("dnn"):

hidden1=neuron_layer(X,n_hidden1,name="hidden1",

activation=tf.nn.relu)

hidden2=neuron_layer(hidden1,n_hidden2,name="hidden2",

activation=tf.nn.relu)

logits=neuron_layer(hidden2,n_outputs,name="outputs")

Noticethatonceagainweusedanamescopeforclarity.Alsonotethatlogitsistheoutputoftheneuralnetworkbeforegoingthroughthesoftmaxactivationfunction:foroptimizationreasons,wewillhandlethesoftmaxcomputationlater.

Asyoumightexpect,TensorFlowcomeswithmanyhandyfunctionstocreatestandardneuralnetworklayers,sothere’softennoneedtodefineyourownneuron_layer()functionlikewejustdid.Forexample,TensorFlow’stf.layers.dense()function(previouslycalledtf.contrib.layers.fully_connected())createsafullyconnectedlayer,wherealltheinputsareconnectedtoalltheneuronsinthelayer.Ittakescareofcreatingtheweightsandbiasesvariables,namedkernelandbiasrespectively,usingtheappropriateinitializationstrategy,andyoucansettheactivationfunctionusingtheactivationargument.AswewillseeinChapter11,italsosupportsregularizationparameters.Let’stweaktheprecedingcodetousethedense()functioninsteadofourneuron_layer()function.Simplyreplacethednnconstructionsectionwiththefollowingcode:

withtf.name_scope("dnn"):

Page 333: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

hidden1=tf.layers.dense(X,n_hidden1,name="hidden1",

activation=tf.nn.relu)

hidden2=tf.layers.dense(hidden1,n_hidden2,name="hidden2",

activation=tf.nn.relu)

logits=tf.layers.dense(hidden2,n_outputs,name="outputs")

Nowthatwehavetheneuralnetworkmodelreadytogo,weneedtodefinethecostfunctionthatwewillusetotrainit.JustaswedidforSoftmaxRegressioninChapter4,wewillusecrossentropy.Aswediscussedearlier,crossentropywillpenalizemodelsthatestimatealowprobabilityforthetargetclass.TensorFlowprovidesseveralfunctionstocomputecrossentropy.Wewillusesparse_softmax_cross_entropy_with_logits():itcomputesthecrossentropybasedonthe“logits”(i.e.,theoutputofthenetworkbeforegoingthroughthesoftmaxactivationfunction),anditexpectslabelsintheformofintegersrangingfrom0tothenumberofclassesminus1(inourcase,from0to9).Thiswillgiveusa1Dtensorcontainingthecrossentropyforeachinstance.WecanthenuseTensorFlow’sreduce_mean()functiontocomputethemeancrossentropyoverallinstances.

withtf.name_scope("loss"):

xentropy=tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,

logits=logits)

loss=tf.reduce_mean(xentropy,name="loss")

NOTEThesparse_softmax_cross_entropy_with_logits()functionisequivalenttoapplyingthesoftmaxactivationfunctionandthencomputingthecrossentropy,butitismoreefficient,anditproperlytakescareofcornercaseslikelogitsequalto0.Thisiswhywedidnotapplythesoftmaxactivationfunctionearlier.Thereisalsoanotherfunctioncalledsoftmax_cross_entropy_with_logits(),whichtakeslabelsintheformofone-hotvectors(insteadofintsfrom0tothenumberofclassesminus1).

Wehavetheneuralnetworkmodel,wehavethecostfunction,andnowweneedtodefineaGradientDescentOptimizerthatwilltweakthemodelparameterstominimizethecostfunction.Nothingnew;it’sjustlikewedidinChapter9:

learning_rate=0.01

withtf.name_scope("train"):

optimizer=tf.train.GradientDescentOptimizer(learning_rate)

training_op=optimizer.minimize(loss)

Thelastimportantstepintheconstructionphaseistospecifyhowtoevaluatethemodel.Wewillsimplyuseaccuracyasourperformancemeasure.First,foreachinstance,determineiftheneuralnetwork’spredictioniscorrectbycheckingwhetherornotthehighestlogitcorrespondstothetargetclass.Forthisyoucanusethein_top_k()function.Thisreturnsa1Dtensorfullofbooleanvalues,soweneedtocastthesebooleanstofloatsandthencomputetheaverage.Thiswillgiveusthenetwork’soverallaccuracy.

withtf.name_scope("eval"):

correct=tf.nn.in_top_k(logits,y,1)

accuracy=tf.reduce_mean(tf.cast(correct,tf.float32))

And,asusual,weneedtocreateanodetoinitializeallvariables,andwewillalsocreateaSaverto

Page 334: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

saveourtrainedmodelparameterstodisk:

init=tf.global_variables_initializer()

saver=tf.train.Saver()

Phew!Thisconcludestheconstructionphase.Thiswasfewerthan40linesofcode,butitwasprettyintense:wecreatedplaceholdersfortheinputsandthetargets,wecreatedafunctiontobuildaneuronlayer,weusedittocreatetheDNN,wedefinedthecostfunction,wecreatedanoptimizer,andfinallywedefinedtheperformancemeasure.Nowontotheexecutionphase.

Page 335: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ExecutionPhaseThispartismuchshorterandsimpler.First,let’sloadMNIST.WecoulduseScikit-Learnforthataswedidinpreviouschapters,butTensorFlowoffersitsownhelperthatfetchesthedata,scalesit(between0and1),shufflesit,andprovidesasimplefunctiontoloadonemini-batchatime.Solet’suseitinstead:

fromtensorflow.examples.tutorials.mnistimportinput_data

mnist=input_data.read_data_sets("/tmp/data/")

Nowwedefinethenumberofepochsthatwewanttorun,aswellasthesizeofthemini-batches:

n_epochs=40

batch_size=50

Andnowwecantrainthemodel:

withtf.Session()assess:

init.run()

forepochinrange(n_epochs):

foriterationinrange(mnist.train.num_examples//batch_size):

X_batch,y_batch=mnist.train.next_batch(batch_size)

sess.run(training_op,feed_dict={X:X_batch,y:y_batch})

acc_train=accuracy.eval(feed_dict={X:X_batch,y:y_batch})

acc_test=accuracy.eval(feed_dict={X:mnist.test.images,

y:mnist.test.labels})

print(epoch,"Trainaccuracy:",acc_train,"Testaccuracy:",acc_test)

save_path=saver.save(sess,"./my_model_final.ckpt")

ThiscodeopensaTensorFlowsession,anditrunstheinitnodethatinitializesallthevariables.Thenitrunsthemaintrainingloop:ateachepoch,thecodeiteratesthroughanumberofmini-batchesthatcorrespondstothetrainingsetsize.Eachmini-batchisfetchedviathenext_batch()method,andthenthecodesimplyrunsthetrainingoperation,feedingitthecurrentmini-batchinputdataandtargets.Next,attheendofeachepoch,thecodeevaluatesthemodelonthelastmini-batchandonthefulltestset,anditprintsouttheresult.Finally,themodelparametersaresavedtodisk.

Page 336: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

UsingtheNeuralNetworkNowthattheneuralnetworkistrained,youcanuseittomakepredictions.Todothat,youcanreusethesameconstructionphase,butchangetheexecutionphaselikethis:

withtf.Session()assess:

saver.restore(sess,"./my_model_final.ckpt")

X_new_scaled=[...]#somenewimages(scaledfrom0to1)

Z=logits.eval(feed_dict={X:X_new_scaled})

y_pred=np.argmax(Z,axis=1)

Firstthecodeloadsthemodelparametersfromdisk.Thenitloadssomenewimagesthatyouwanttoclassify.Remembertoapplythesamefeaturescalingasforthetrainingdata(inthiscase,scaleitfrom0to1).Thenthecodeevaluatesthelogitsnode.Ifyouwantedtoknowalltheestimatedclassprobabilities,youwouldneedtoapplythesoftmax()functiontothelogits,butifyoujustwanttopredictaclass,youcansimplypicktheclassthathasthehighestlogitvalue(usingtheargmax()functiondoesthetrick).

Page 337: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Fine-TuningNeuralNetworkHyperparametersTheflexibilityofneuralnetworksisalsooneoftheirmaindrawbacks:therearemanyhyperparameterstotweak.Notonlycanyouuseanyimaginablenetworktopology(howneuronsareinterconnected),buteveninasimpleMLPyoucanchangethenumberoflayers,thenumberofneuronsperlayer,thetypeofactivationfunctiontouseineachlayer,theweightinitializationlogic,andmuchmore.Howdoyouknowwhatcombinationofhyperparametersisthebestforyourtask?

Ofcourse,youcanusegridsearchwithcross-validationtofindtherighthyperparameters,likeyoudidinpreviouschapters,butsincetherearemanyhyperparameterstotune,andsincetraininganeuralnetworkonalargedatasettakesalotoftime,youwillonlybeabletoexploreatinypartofthehyperparameterspaceinareasonableamountoftime.Itismuchbettertouserandomizedsearch,aswediscussedinChapter2.AnotheroptionistouseatoolsuchasOscar,whichimplementsmorecomplexalgorithmstohelpyoufindagoodsetofhyperparametersquickly.

Ithelpstohaveanideaofwhatvaluesarereasonableforeachhyperparameter,soyoucanrestrictthesearchspace.Let’sstartwiththenumberofhiddenlayers.

Page 338: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

NumberofHiddenLayersFormanyproblems,youcanjustbeginwithasinglehiddenlayerandyouwillgetreasonableresults.IthasactuallybeenshownthatanMLPwithjustonehiddenlayercanmodeleventhemostcomplexfunctionsprovidedithasenoughneurons.Foralongtime,thesefactsconvincedresearchersthattherewasnoneedtoinvestigateanydeeperneuralnetworks.Buttheyoverlookedthefactthatdeepnetworkshaveamuchhigherparameterefficiencythanshallowones:theycanmodelcomplexfunctionsusingexponentiallyfewerneuronsthanshallownets,makingthemmuchfastertotrain.

Tounderstandwhy,supposeyouareaskedtodrawaforestusingsomedrawingsoftware,butyouareforbiddentousecopy/paste.Youwouldhavetodraweachtreeindividually,branchperbranch,leafperleaf.Ifyoucouldinsteaddrawoneleaf,copy/pasteittodrawabranch,thencopy/pastethatbranchtocreateatree,andfinallycopy/pastethistreetomakeaforest,youwouldbefinishedinnotime.Real-worlddataisoftenstructuredinsuchahierarchicalwayandDNNsautomaticallytakeadvantageofthisfact:lowerhiddenlayersmodellow-levelstructures(e.g.,linesegmentsofvariousshapesandorientations),intermediatehiddenlayerscombinetheselow-levelstructurestomodelintermediate-levelstructures(e.g.,squares,circles),andthehighesthiddenlayersandtheoutputlayercombinetheseintermediatestructurestomodelhigh-levelstructures(e.g.,faces).

NotonlydoesthishierarchicalarchitecturehelpDNNsconvergefastertoagoodsolution,italsoimprovestheirabilitytogeneralizetonewdatasets.Forexample,ifyouhavealreadytrainedamodeltorecognizefacesinpictures,andyounowwanttotrainanewneuralnetworktorecognizehairstyles,thenyoucankickstarttrainingbyreusingthelowerlayersofthefirstnetwork.Insteadofrandomlyinitializingtheweightsandbiasesofthefirstfewlayersofthenewneuralnetwork,youcaninitializethemtothevalueoftheweightsandbiasesofthelowerlayersofthefirstnetwork.Thiswaythenetworkwillnothavetolearnfromscratchallthelow-levelstructuresthatoccurinmostpictures;itwillonlyhavetolearnthehigher-levelstructures(e.g.,hairstyles).

Insummary,formanyproblemsyoucanstartwithjustoneortwohiddenlayersanditwillworkjustfine(e.g.,youcaneasilyreachabove97%accuracyontheMNISTdatasetusingjustonehiddenlayerwithafewhundredneurons,andabove98%accuracyusingtwohiddenlayerswiththesametotalamountofneurons,inroughlythesameamountoftrainingtime).Formorecomplexproblems,youcangraduallyrampupthenumberofhiddenlayers,untilyoustartoverfittingthetrainingset.Verycomplextasks,suchaslargeimageclassificationorspeechrecognition,typicallyrequirenetworkswithdozensoflayers(orevenhundreds,butnotfullyconnectedones,aswewillseeinChapter13),andtheyneedahugeamountoftrainingdata.However,youwillrarelyhavetotrainsuchnetworksfromscratch:itismuchmorecommontoreusepartsofapretrainedstate-of-the-artnetworkthatperformsasimilartask.Trainingwillbealotfasterandrequiremuchlessdata(wewilldiscussthisinChapter11).

Page 339: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

NumberofNeuronsperHiddenLayerObviouslythenumberofneuronsintheinputandoutputlayersisdeterminedbythetypeofinputandoutputyourtaskrequires.Forexample,theMNISTtaskrequires28x28=784inputneuronsand10outputneurons.Asforthehiddenlayers,acommonpracticeistosizethemtoformafunnel,withfewerandfewerneuronsateachlayer—therationalebeingthatmanylow-levelfeaturescancoalesceintofarfewerhigh-levelfeatures.Forexample,atypicalneuralnetworkforMNISTmayhavetwohiddenlayers,thefirstwith300neuronsandthesecondwith100.However,thispracticeisnotascommonnow,andyoumaysimplyusethesamesizeforallhiddenlayers—forexample,allhiddenlayerswith150neurons:that’sjustonehyperparametertotuneinsteadofoneperlayer.Justlikeforthenumberoflayers,youcantryincreasingthenumberofneuronsgraduallyuntilthenetworkstartsoverfitting.Ingeneralyouwillgetmorebangforthebuckbyincreasingthenumberoflayersthanthenumberofneuronsperlayer.Unfortunately,asyoucansee,findingtheperfectamountofneuronsisstillsomewhatofablackart.

Asimplerapproachistopickamodelwithmorelayersandneuronsthanyouactuallyneed,thenuseearlystoppingtopreventitfromoverfitting(andotherregularizationtechniques,especiallydropout,aswewillseeinChapter11).Thishasbeendubbedthe“stretchpants”approach:12insteadofwastingtimelookingforpantsthatperfectlymatchyoursize,justuselargestretchpantsthatwillshrinkdowntotherightsize.

Page 340: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ActivationFunctionsInmostcasesyoucanusetheReLUactivationfunctioninthehiddenlayers(oroneofitsvariants,aswewillseeinChapter11).Itisabitfastertocomputethanotheractivationfunctions,andGradientDescentdoesnotgetstuckasmuchonplateaus,thankstothefactthatitdoesnotsaturateforlargeinputvalues(asopposedtothelogisticfunctionorthehyperbolictangentfunction,whichsaturateat1).

Fortheoutputlayer,thesoftmaxactivationfunctionisgenerallyagoodchoiceforclassificationtasks(whentheclassesaremutuallyexclusive).Forregressiontasks,youcansimplyusenoactivationfunctionatall.

Thisconcludesthisintroductiontoartificialneuralnetworks.Inthefollowingchapters,wewilldiscusstechniquestotrainverydeepnets,anddistributetrainingacrossmultipleserversandGPUs.Thenwewillexploreafewotherpopularneuralnetworkarchitectures:convolutionalneuralnetworks,recurrentneuralnetworks,andautoencoders.13

Page 341: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Exercises1. DrawanANNusingtheoriginalartificialneurons(liketheonesinFigure10-3)thatcomputesAB(where representstheXORoperation).Hint:A B=(A ¬B) (¬A B).

2. WhyisitgenerallypreferabletouseaLogisticRegressionclassifierratherthanaclassicalPerceptron(i.e.,asinglelayeroflinearthresholdunitstrainedusingthePerceptrontrainingalgorithm)?HowcanyoutweakaPerceptrontomakeitequivalenttoaLogisticRegressionclassifier?

3. WhywasthelogisticactivationfunctionakeyingredientintrainingthefirstMLPs?

4. Namethreepopularactivationfunctions.Canyoudrawthem?

5. SupposeyouhaveanMLPcomposedofoneinputlayerwith10passthroughneurons,followedbyonehiddenlayerwith50artificialneurons,andfinallyoneoutputlayerwith3artificialneurons.AllartificialneuronsusetheReLUactivationfunction.

WhatistheshapeoftheinputmatrixX?

Whatabouttheshapeofthehiddenlayer’sweightvectorWh,andtheshapeofitsbiasvectorbh?

Whatistheshapeoftheoutputlayer’sweightvectorWo,anditsbiasvectorbo?

Whatistheshapeofthenetwork’soutputmatrixY?

Writetheequationthatcomputesthenetwork’soutputmatrixYasafunctionofX,Wh,bh,Woandbo.

6. Howmanyneuronsdoyouneedintheoutputlayerifyouwanttoclassifyemailintospamorham?Whatactivationfunctionshouldyouuseintheoutputlayer?IfinsteadyouwanttotackleMNIST,howmanyneuronsdoyouneedintheoutputlayer,usingwhatactivationfunction?AnswerthesamequestionsforgettingyournetworktopredicthousingpricesasinChapter2.

7. Whatisbackpropagationandhowdoesitwork?Whatisthedifferencebetweenbackpropagationandreverse-modeautodiff?

8. CanyoulistallthehyperparametersyoucantweakinanMLP?IftheMLPoverfitsthetrainingdata,howcouldyoutweakthesehyperparameterstotrytosolvetheproblem?

9. TrainadeepMLPontheMNISTdatasetandseeifyoucangetover98%precision.JustlikeinthelastexerciseofChapter9,tryaddingallthebellsandwhistles(i.e.,savecheckpoints,restorethelastcheckpointincaseofaninterruption,addsummaries,plotlearningcurvesusingTensorBoard,andsoon).

SolutionstotheseexercisesareavailableinAppendixA.

Page 342: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Youcangetthebestofbothworldsbybeingopentobiologicalinspirationswithoutbeingafraidtocreatebiologicallyunrealisticmodels,aslongastheyworkwell.

“ALogicalCalculusofIdeasImmanentinNervousActivity,”W.McCullochandW.Pitts(1943).

ImagebyBruceBlaus(CreativeCommons3.0).Reproducedfromhttps://en.wikipedia.org/wiki/Neuron.

InthecontextofMachineLearning,thephrase“neuralnetworks”generallyreferstoANNs,notBNNs.

DrawingofacorticallaminationbyS.RamonyCajal(publicdomain).Reproducedfromhttps://en.wikipedia.org/wiki/Cerebral_cortex.

ThenamePerceptronissometimesusedtomeanatinynetworkwithasingleLTU.

Notethatthissolutionisgenerallynotunique:ingeneralwhenthedataarelinearlyseparable,thereisaninfinityofhyperplanesthatcanseparatethem.

“LearningInternalRepresentationsbyErrorPropagation,”D.Rumelhart,G.Hinton,R.Williams(1986).

Thisalgorithmwasactuallyinventedseveraltimesbyvariousresearchersindifferentfields,startingwithP.Werbosin1974.

Usingatruncatednormaldistributionratherthanaregularnormaldistributionensuresthattherewon’tbeanylargeweights,whichcouldslowdowntraining.

Forexample,ifyousetalltheweightsto0,thenallneuronswilloutput0,andtheerrorgradientwillbethesameforallneuronsinagivenhiddenlayer.TheGradientDescentstepwillthenupdatealltheweightsinexactlythesamewayineachlayer,sotheywillallremainequal.Inotherwords,despitehavinghundredsofneuronsperlayer,yourmodelwillactasiftherewereonlyoneneuronperlayer.Itisnotgoingtofly.

ByVincentVanhouckeinhisDeepLearningclassonUdacity.com.

AfewextraANNarchitecturesarepresentedinAppendixE.

1

2

3

4

5

6

7

8

9

10

11

12

13

Page 343: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Chapter11.TrainingDeepNeuralNets

InChapter10weintroducedartificialneuralnetworksandtrainedourfirstdeepneuralnetwork.ButitwasaveryshallowDNN,withonlytwohiddenlayers.Whatifyouneedtotackleaverycomplexproblem,suchasdetectinghundredsoftypesofobjectsinhigh-resolutionimages?YoumayneedtotrainamuchdeeperDNN,perhapswith(say)10layers,eachcontaininghundredsofneurons,connectedbyhundredsofthousandsofconnections.Thiswouldnotbeawalkinthepark:

First,youwouldbefacedwiththetrickyvanishinggradientsproblem(ortherelatedexplodinggradientsproblem)thataffectsdeepneuralnetworksandmakeslowerlayersveryhardtotrain.

Second,withsuchalargenetwork,trainingwouldbeextremelyslow.

Third,amodelwithmillionsofparameterswouldseverelyriskoverfittingthetrainingset.

Inthischapter,wewillgothrougheachoftheseproblemsinturnandpresenttechniquestosolvethem.Wewillstartbyexplainingthevanishinggradientsproblemandexploringsomeofthemostpopularsolutionstothisproblem.NextwewilllookatvariousoptimizersthatcanspeeduptraininglargemodelstremendouslycomparedtoplainGradientDescent.Finally,wewillgothroughafewpopularregularizationtechniquesforlargeneuralnetworks.

Withthesetools,youwillbeabletotrainverydeepnets:welcometoDeepLearning!

Page 344: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Vanishing/ExplodingGradientsProblemsAswediscussedinChapter10,thebackpropagationalgorithmworksbygoingfromtheoutputlayertotheinputlayer,propagatingtheerrorgradientontheway.Oncethealgorithmhascomputedthegradientofthecostfunctionwithregardstoeachparameterinthenetwork,itusesthesegradientstoupdateeachparameterwithaGradientDescentstep.

Unfortunately,gradientsoftengetsmallerandsmallerasthealgorithmprogressesdowntothelowerlayers.Asaresult,theGradientDescentupdateleavesthelowerlayerconnectionweightsvirtuallyunchanged,andtrainingneverconvergestoagoodsolution.Thisiscalledthevanishinggradientsproblem.Insomecases,theoppositecanhappen:thegradientscangrowbiggerandbigger,somanylayersgetinsanelylargeweightupdatesandthealgorithmdiverges.Thisistheexplodinggradientsproblem,whichismostlyencounteredinrecurrentneuralnetworks(seeChapter14).Moregenerally,deepneuralnetworkssufferfromunstablegradients;differentlayersmaylearnatwidelydifferentspeeds.

Althoughthisunfortunatebehaviorhasbeenempiricallyobservedforquiteawhile(itwasoneofthereasonswhydeepneuralnetworksweremostlyabandonedforalongtime),itisonlyaround2010thatsignificantprogresswasmadeinunderstandingit.Apapertitled“UnderstandingtheDifficultyofTrainingDeepFeedforwardNeuralNetworks”byXavierGlorotandYoshuaBengio1foundafewsuspects,includingthecombinationofthepopularlogisticsigmoidactivationfunctionandtheweightinitializationtechniquethatwasmostpopularatthetime,namelyrandominitializationusinganormaldistributionwithameanof0andastandarddeviationof1.Inshort,theyshowedthatwiththisactivationfunctionandthisinitializationscheme,thevarianceoftheoutputsofeachlayerismuchgreaterthanthevarianceofitsinputs.Goingforwardinthenetwork,thevariancekeepsincreasingaftereachlayeruntiltheactivationfunctionsaturatesatthetoplayers.Thisisactuallymadeworsebythefactthatthelogisticfunctionhasameanof0.5,not0(thehyperbolictangentfunctionhasameanof0andbehavesslightlybetterthanthelogisticfunctionindeepnetworks).

Lookingatthelogisticactivationfunction(seeFigure11-1),youcanseethatwheninputsbecomelarge(negativeorpositive),thefunctionsaturatesat0or1,withaderivativeextremelycloseto0.Thuswhenbackpropagationkicksin,ithasvirtuallynogradienttopropagatebackthroughthenetwork,andwhatlittlegradientexistskeepsgettingdilutedasbackpropagationprogressesdownthroughthetoplayers,sothereisreallynothingleftforthelowerlayers.

Page 345: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure11-1.Logisticactivationfunctionsaturation

Page 346: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

XavierandHeInitializationIntheirpaper,GlorotandBengioproposeawaytosignificantlyalleviatethisproblem.Weneedthesignaltoflowproperlyinbothdirections:intheforwarddirectionwhenmakingpredictions,andinthereversedirectionwhenbackpropagatinggradients.Wedon’twantthesignaltodieout,nordowewantittoexplodeandsaturate.Forthesignaltoflowproperly,theauthorsarguethatweneedthevarianceoftheoutputsofeachlayertobeequaltothevarianceofitsinputs,2andwealsoneedthegradientstohaveequalvariancebeforeandafterflowingthroughalayerinthereversedirection(pleasecheckoutthepaperifyouareinterestedinthemathematicaldetails).Itisactuallynotpossibletoguaranteebothunlessthelayerhasanequalnumberofinputandoutputconnections,buttheyproposedagoodcompromisethathasproventoworkverywellinpractice:theconnectionweightsmustbeinitializedrandomlyasdescribedinEquation11-1,whereninputsandnoutputsarethenumberofinputandoutputconnectionsforthelayerwhoseweightsarebeinginitialized(alsocalledfan-inandfan-out).ThisinitializationstrategyisoftencalledXavierinitialization(aftertheauthor’sfirstname),orsometimesGlorotinitialization.

Equation11-1.Xavierinitialization(whenusingthelogisticactivationfunction)

Whenthenumberofinputconnectionsisroughlyequaltothenumberofoutputconnections,youget

simplerequations(e.g., or ).WeusedthissimplifiedstrategyinChapter10.3

UsingtheXavierinitializationstrategycanspeeduptrainingconsiderably,anditisoneofthetricksthatledtothecurrentsuccessofDeepLearning.Somerecentpapers4haveprovidedsimilarstrategiesfordifferentactivationfunctions,asshowninTable11-1.TheinitializationstrategyfortheReLUactivationfunction(anditsvariants,includingtheELUactivationdescribedshortly)issometimescalledHeinitialization(afterthelastnameofitsauthor).

Table11-1.Initializationparametersforeachtypeofactivationfunction

Activationfunction Uniformdistribution[–r,r] Normaldistribution

Logistic

Hyperbolictangent

ReLU(anditsvariants)

Bydefault,thetf.layers.dense()function(introducedinChapter10)usesXavierinitialization(withauniformdistribution).YoucanchangethistoHeinitializationbyusingthe

Page 347: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

variance_scaling_initializer()functionlikethis:

he_init=tf.contrib.layers.variance_scaling_initializer()

hidden1=tf.layers.dense(X,n_hidden1,activation=tf.nn.relu,

kernel_initializer=he_init,name="hidden1")

NOTEHeinitializationconsidersonlythefan-in,nottheaveragebetweenfan-inandfan-outlikeinXavierinitialization.Thisisalsothedefaultforthevariance_scaling_initializer()function,butyoucanchangethisbysettingtheargumentmode="FAN_AVG".

Page 348: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

NonsaturatingActivationFunctionsOneoftheinsightsinthe2010paperbyGlorotandBengiowasthatthevanishing/explodinggradientsproblemswereinpartduetoapoorchoiceofactivationfunction.UntilthenmostpeoplehadassumedthatifMotherNaturehadchosentouseroughlysigmoidactivationfunctionsinbiologicalneurons,theymustbeanexcellentchoice.Butitturnsoutthatotheractivationfunctionsbehavemuchbetterindeepneuralnetworks,inparticulartheReLUactivationfunction,mostlybecauseitdoesnotsaturateforpositivevalues(andalsobecauseitisquitefasttocompute).

Unfortunately,theReLUactivationfunctionisnotperfect.ItsuffersfromaproblemknownasthedyingReLUs:duringtraining,someneuronseffectivelydie,meaningtheystopoutputtinganythingotherthan0.Insomecases,youmayfindthathalfofyournetwork’sneuronsaredead,especiallyifyouusedalargelearningrate.Duringtraining,ifaneuron’sweightsgetupdatedsuchthattheweightedsumoftheneuron’sinputsisnegative,itwillstartoutputting0.Whenthishappen,theneuronisunlikelytocomebacktolifesincethegradientoftheReLUfunctionis0whenitsinputisnegative.

Tosolvethisproblem,youmaywanttouseavariantoftheReLUfunction,suchastheleakyReLU.ThisfunctionisdefinedasLeakyReLUα(z)=max(αz,z)(seeFigure11-2).Thehyperparameterαdefineshowmuchthefunction“leaks”:itistheslopeofthefunctionforz<0,andistypicallysetto0.01.ThissmallslopeensuresthatleakyReLUsneverdie;theycangointoalongcoma,buttheyhaveachancetoeventuallywakeup.Arecentpaper5comparedseveralvariantsoftheReLUactivationfunctionandoneofitsconclusionswasthattheleakyvariantsalwaysoutperformedthestrictReLUactivationfunction.Infact,settingα=0.2(hugeleak)seemedtoresultinbetterperformancethanα=0.01(smallleak).TheyalsoevaluatedtherandomizedleakyReLU(RReLU),whereαispickedrandomlyinagivenrangeduringtraining,anditisfixedtoanaveragevalueduringtesting.Italsoperformedfairlywellandseemedtoactasaregularizer(reducingtheriskofoverfittingthetrainingset).Finally,theyalsoevaluatedtheparametricleakyReLU(PReLU),whereαisauthorizedtobelearnedduringtraining(insteadofbeingahyperparameter,itbecomesaparameterthatcanbemodifiedbybackpropagationlikeanyotherparameter).ThiswasreportedtostronglyoutperformReLUonlargeimagedatasets,butonsmallerdatasetsitrunstheriskofoverfittingthetrainingset.

Page 349: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure11-2.LeakyReLU

Lastbutnotleast,a2015paperbyDjork-ArnéClevertetal.6proposedanewactivationfunctioncalledtheexponentiallinearunit(ELU)thatoutperformedalltheReLUvariantsintheirexperiments:trainingtimewasreducedandtheneuralnetworkperformedbetteronthetestset.ItisrepresentedinFigure11-3,andEquation11-2showsitsdefinition.

Equation11-2.ELUactivationfunction

Page 350: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure11-3.ELUactivationfunction

ItlooksalotliketheReLUfunction,withafewmajordifferences:Firstittakesonnegativevalueswhenz<0,whichallowstheunittohaveanaverageoutputcloserto0.Thishelpsalleviatethevanishinggradientsproblem,asdiscussedearlier.ThehyperparameterαdefinesthevaluethattheELUfunctionapproacheswhenzisalargenegativenumber.Itisusuallysetto1,butyoucantweakitlikeanyotherhyperparameterifyouwant.

Second,ithasanonzerogradientforz<0,whichavoidsthedyingunitsissue.

Third,thefunctionissmootheverywhere,includingaroundz=0,whichhelpsspeedupGradientDescent,sinceitdoesnotbounceasmuchleftandrightofz=0.

ThemaindrawbackoftheELUactivationfunctionisthatitisslowertocomputethantheReLUanditsvariants(duetotheuseoftheexponentialfunction),butduringtrainingthisiscompensatedbythefasterconvergencerate.However,attesttimeanELUnetworkwillbeslowerthanaReLUnetwork.

TIPSowhichactivationfunctionshouldyouuseforthehiddenlayersofyourdeepneuralnetworks?Althoughyourmileagewillvary,ingeneralELU>leakyReLU(anditsvariants)>ReLU>tanh>logistic.Ifyoucarealotaboutruntimeperformance,thenyoumaypreferleakyReLUsoverELUs.Ifyoudon’twanttotweakyetanotherhyperparameter,youmayjustusethedefaultαvaluessuggestedearlier(0.01fortheleakyReLU,and1forELU).Ifyouhavesparetimeandcomputingpower,youcanusecross-validationtoevaluateotheractivationfunctions,inparticularRReLUifyournetworkisoverfitting,orPReLUifyouhaveahugetrainingset.

TensorFlowoffersanelu()functionthatyoucanusetobuildyourneuralnetwork.Simplysettheactivationargumentwhencallingthedense()function,likethis:

Page 351: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

hidden1=tf.layers.dense(X,n_hidden1,activation=tf.nn.elu,name="hidden1")

TensorFlowdoesnothaveapredefinedfunctionforleakyReLUs,butitiseasyenoughtodefine:

defleaky_relu(z,name=None):

returntf.maximum(0.01*z,z,name=name)

hidden1=tf.layers.dense(X,n_hidden1,activation=leaky_relu,name="hidden1")

Page 352: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

BatchNormalizationAlthoughusingHeinitializationalongwithELU(oranyvariantofReLU)cansignificantlyreducethevanishing/explodinggradientsproblemsatthebeginningoftraining,itdoesn’tguaranteethattheywon’tcomebackduringtraining.

Ina2015paper,7SergeyIoffeandChristianSzegedyproposedatechniquecalledBatchNormalization(BN)toaddressthevanishing/explodinggradientsproblems,andmoregenerallytheproblemthatthedistributionofeachlayer’sinputschangesduringtraining,astheparametersofthepreviouslayerschange(whichtheycalltheInternalCovariateShiftproblem).

Thetechniqueconsistsofaddinganoperationinthemodeljustbeforetheactivationfunctionofeachlayer,simplyzero-centeringandnormalizingtheinputs,thenscalingandshiftingtheresultusingtwonewparametersperlayer(oneforscaling,theotherforshifting).Inotherwords,thisoperationletsthemodellearntheoptimalscaleandmeanoftheinputsforeachlayer.

Inordertozero-centerandnormalizetheinputs,thealgorithmneedstoestimatetheinputs’meanandstandarddeviation.Itdoessobyevaluatingthemeanandstandarddeviationoftheinputsoverthecurrentmini-batch(hencethename“BatchNormalization”).ThewholeoperationissummarizedinEquation11-3.

Equation11-3.BatchNormalizationalgorithm

μBistheempiricalmean,evaluatedoverthewholemini-batchB.

Page 353: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

σBistheempiricalstandarddeviation,alsoevaluatedoverthewholemini-batch.

mBisthenumberofinstancesinthemini-batch.

(i)isthezero-centeredandnormalizedinput.

γisthescalingparameterforthelayer.

βistheshiftingparameter(offset)forthelayer.

ϵisatinynumbertoavoiddivisionbyzero(typically10–5).Thisiscalledasmoothingterm.

z(i)istheoutputoftheBNoperation:itisascaledandshiftedversionoftheinputs.

Attesttime,thereisnomini-batchtocomputetheempiricalmeanandstandarddeviation,soinsteadyousimplyusethewholetrainingset’smeanandstandarddeviation.Thesearetypicallyefficientlycomputedduringtrainingusingamovingaverage.So,intotal,fourparametersarelearnedforeachbatch-normalizedlayer:γ(scale),β(offset),μ(mean),andσ(standarddeviation).

Theauthorsdemonstratedthatthistechniqueconsiderablyimprovedallthedeepneuralnetworkstheyexperimentedwith.Thevanishinggradientsproblemwasstronglyreduced,tothepointthattheycouldusesaturatingactivationfunctionssuchasthetanhandeventhelogisticactivationfunction.Thenetworkswerealsomuchlesssensitivetotheweightinitialization.Theywereabletousemuchlargerlearningrates,significantlyspeedingupthelearningprocess.Specifically,theynotethat“Appliedtoastate-of-the-artimageclassificationmodel,BatchNormalizationachievesthesameaccuracywith14timesfewertrainingsteps,andbeatstheoriginalmodelbyasignificantmargin.[…]Usinganensembleofbatch-normalizednetworks,weimproveuponthebestpublishedresultonImageNetclassification:reaching4.9%top-5validationerror(and4.8%testerror),exceedingtheaccuracyofhumanraters.”Finally,likeagiftthatkeepsongiving,BatchNormalizationalsoactslikearegularizer,reducingtheneedforotherregularizationtechniques(suchasdropout,describedlaterinthechapter).

BatchNormalizationdoes,however,addsomecomplexitytothemodel(althoughitremovestheneedfornormalizingtheinputdatasincethefirsthiddenlayerwilltakecareofthat,provideditisbatch-normalized).Moreover,thereisaruntimepenalty:theneuralnetworkmakesslowerpredictionsduetotheextracomputationsrequiredateachlayer.Soifyouneedpredictionstobelightning-fast,youmaywanttocheckhowwellplainELU+HeinitializationperformbeforeplayingwithBatchNormalization.

NOTEYoumayfindthattrainingisratherslowatfirstwhileGradientDescentissearchingfortheoptimalscalesandoffsetsforeachlayer,butitacceleratesonceithasfoundreasonablygoodvalues.

ImplementingBatchNormalizationwithTensorFlowTensorFlowprovidesatf.nn.batch_normalization()functionthatsimplycentersandnormalizestheinputs,butyoumustcomputethemeanandstandarddeviationyourself(basedonthemini-batchdata

Page 354: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

duringtrainingoronthefulldatasetduringtesting,asjustdiscussed)andpassthemasparameterstothisfunction,andyoumustalsohandlethecreationofthescalingandoffsetparameters(andpassthemtothisfunction).Itisdoable,butnotthemostconvenientapproach.Instead,youshouldusethetf.layers.batch_normalization()function,whichhandlesallthisforyou,asinthefollowingcode:

importtensorflowastf

n_inputs=28*28

n_hidden1=300

n_hidden2=100

n_outputs=10

X=tf.placeholder(tf.float32,shape=(None,n_inputs),name="X")

training=tf.placeholder_with_default(False,shape=(),name='training')

hidden1=tf.layers.dense(X,n_hidden1,name="hidden1")

bn1=tf.layers.batch_normalization(hidden1,training=training,momentum=0.9)

bn1_act=tf.nn.elu(bn1)

hidden2=tf.layers.dense(bn1_act,n_hidden2,name="hidden2")

bn2=tf.layers.batch_normalization(hidden2,training=training,momentum=0.9)

bn2_act=tf.nn.elu(bn2)

logits_before_bn=tf.layers.dense(bn2_act,n_outputs,name="outputs")

logits=tf.layers.batch_normalization(logits_before_bn,training=training,

momentum=0.9)

Let’swalkthroughthiscode.Thefirstlinesarefairlyself-explanatory,untilwedefinethetrainingplaceholder:wewillsetittoTrueduringtraining,butotherwiseitwilldefaulttoFalse.Thiswillbeusedtotellthetf.layers.batch_normalization()functionwhetheritshouldusethecurrentmini-batch’smeanandstandarddeviation(duringtraining)orthewholetrainingset’smeanandstandarddeviation(duringtesting).

Then,wealternatefullyconnectedlayersandbatchnormalizationlayers:thefullyconnectedlayersarecreatedusingthetf.layers.dense()function,justlikewedidinChapter10.Notethatwedon’tspecifyanyactivationfunctionforthefullyconnectedlayersbecausewewanttoapplytheactivationfunctionaftereachbatchnormalizationlayer.8Wecreatethebatchnormalizationlayersusingthetf.layers.batch_normalization()function,settingitstrainingandmomentumparameters.TheBNalgorithmusesexponentialdecaytocomputetherunningaverages,whichiswhyitrequiresthemomentumparameter:givenanewvaluev,therunningaverage isupdatedthroughtheequation:

Agoodmomentumvalueistypicallycloseto1—forexample,0.9,0.99,or0.999(youwantmore9sforlargerdatasetsandsmallermini-batches).

Youmayhavenoticedthatthecodeisquiterepetitive,withthesamebatchnormalizationparametersappearingoverandoveragain.Toavoidthisrepetition,youcanusethepartial()functionfromthefunctoolsmodule(partofPython’sstandardlibrary).Itcreatesathinwrapperaroundafunctionandallowsyoutodefinedefaultvaluesforsomeparameters.Thecreationofthenetworklayersintheprecedingcodecanbemodifiedlikeso:

fromfunctoolsimportpartial

my_batch_norm_layer=partial(tf.layers.batch_normalization,

Page 355: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

training=training,momentum=0.9)

hidden1=tf.layers.dense(X,n_hidden1,name="hidden1")

bn1=my_batch_norm_layer(hidden1)

bn1_act=tf.nn.elu(bn1)

hidden2=tf.layers.dense(bn1_act,n_hidden2,name="hidden2")

bn2=my_batch_norm_layer(hidden2)

bn2_act=tf.nn.elu(bn2)

logits_before_bn=tf.layers.dense(bn2_act,n_outputs,name="outputs")

logits=my_batch_norm_layer(logits_before_bn)

Itmaynotlookmuchbetterthanbeforeinthissmallexample,butifyouhave10layersandwanttousethesameactivationfunction,initializer,regularizer,andsoon,inalllayers,thistrickwillmakeyourcodemuchmorereadable.

TherestoftheconstructionphaseisthesameasinChapter10:definethecostfunction,createanoptimizer,tellittominimizethecostfunction,definetheevaluationoperations,createaSaver,andsoon.

Theexecutionphaseisalsoprettymuchthesame,withtwoexceptions.First,duringtraining,wheneveryourunanoperationthatdependsonthebatch_normalization()layer,youneedtosetthetrainingplaceholdertoTrue.Second,thebatch_normalization()functioncreatesafewoperationsthatmustbeevaluatedateachstepduringtraininginordertoupdatethemovingaverages(recallthatthesemovingaveragesareneededtoevaluatethetrainingset’smeanandstandarddeviation).TheseoperationsareautomaticallyaddedtotheUPDATE_OPScollection,soallweneedtodoisgetthelistofoperationsinthatcollectionandrunthemateachtrainingiteration:

extra_update_ops=tf.get_collection(tf.GraphKeys.UPDATE_OPS)

withtf.Session()assess:

init.run()

forepochinrange(n_epochs):

foriterationinrange(mnist.train.num_examples//batch_size):

X_batch,y_batch=mnist.train.next_batch(batch_size)

sess.run([training_op,extra_update_ops],

feed_dict={training:True,X:X_batch,y:y_batch})

accuracy_val=accuracy.eval(feed_dict={X:mnist.test.images,

y:mnist.test.labels})

print(epoch,"Testaccuracy:",accuracy_val)

save_path=saver.save(sess,"./my_model_final.ckpt")

That’sall!Inthistinyexamplewithjusttwolayers,it’sunlikelythatBatchNormalizationwillhaveaverypositiveimpact,butfordeepernetworksitcanmakeatremendousdifference.

Page 356: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

GradientClippingApopulartechniquetolessentheexplodinggradientsproblemistosimplyclipthegradientsduringbackpropagationsothattheyneverexceedsomethreshold(thisismostlyusefulforrecurrentneuralnetworks;seeChapter14).ThisiscalledGradientClipping.9IngeneralpeoplenowpreferBatchNormalization,butit’sstillusefultoknowaboutGradientClippingandhowtoimplementit.

InTensorFlow,theoptimizer’sminimize()functiontakescareofbothcomputingthegradientsandapplyingthem,soyoumustinsteadcalltheoptimizer’scompute_gradients()methodfirst,thencreateanoperationtoclipthegradientsusingtheclip_by_value()function,andfinallycreateanoperationtoapplytheclippedgradientsusingtheoptimizer’sapply_gradients()method:

threshold=1.0

optimizer=tf.train.GradientDescentOptimizer(learning_rate)

grads_and_vars=optimizer.compute_gradients(loss)

capped_gvs=[(tf.clip_by_value(grad,-threshold,threshold),var)

forgrad,varingrads_and_vars]

training_op=optimizer.apply_gradients(capped_gvs)

Youwouldthenrunthistraining_opateverytrainingstep,asusual.Itwillcomputethegradients,clipthembetween–1.0and1.0,andapplythem.Thethresholdisahyperparameteryoucantune.

Page 357: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ReusingPretrainedLayersItisgenerallynotagoodideatotrainaverylargeDNNfromscratch:instead,youshouldalwaystrytofindanexistingneuralnetworkthataccomplishesasimilartasktotheoneyouaretryingtotackle,thenjustreusethelowerlayersofthisnetwork:thisiscalledtransferlearning.Itwillnotonlyspeeduptrainingconsiderably,butwillalsorequiremuchlesstrainingdata.

Forexample,supposethatyouhaveaccesstoaDNNthatwastrainedtoclassifypicturesinto100differentcategories,includinganimals,plants,vehicles,andeverydayobjects.YounowwanttotrainaDNNtoclassifyspecifictypesofvehicles.Thesetasksareverysimilar,soyoushouldtrytoreusepartsofthefirstnetwork(seeFigure11-4).

Figure11-4.Reusingpretrainedlayers

NOTEIftheinputpicturesofyournewtaskdon’thavethesamesizeastheonesusedintheoriginaltask,youwillhavetoaddapreprocessingsteptoresizethemtothesizeexpectedbytheoriginalmodel.Moregenerally,transferlearningwillonlyworkwelliftheinputshavesimilarlow-levelfeatures.

Page 358: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ReusingaTensorFlowModelIftheoriginalmodelwastrainedusingTensorFlow,youcansimplyrestoreitandtrainitonthenewtask.AswediscussedinChapter9,youcanusetheimport_meta_graph()functiontoimporttheoperationsintothedefaultgraph.ThisreturnsaSaverthatyoucanlaterusetoloadthemodel’sstate:

saver=tf.train.import_meta_graph("./my_model_final.ckpt.meta")

Youmustthengetahandleontheoperationsandtensorsyouwillneedfortraining.Forthis,youcanusethegraph’sget_operation_by_name()andget_tensor_by_name()methods.Thenameofatensoristhenameoftheoperationthatoutputsitfollowedby:0(or:1ifitisthesecondoutput,:2ifitisthethird,andsoon):

X=tf.get_default_graph().get_tensor_by_name("X:0")

y=tf.get_default_graph().get_tensor_by_name("y:0")

accuracy=tf.get_default_graph().get_tensor_by_name("eval/accuracy:0")

training_op=tf.get_default_graph().get_operation_by_name("GradientDescent")

Ifthepretrainedmodelisnotwelldocumented,thenyouwillhavetoexplorethegraphtofindthenamesoftheoperationsyouwillneed.Inthiscase,youcaneitherexplorethegraphusingTensorBoard(forthisyoumustfirstexportthegraphusingaFileWriter,asdiscussedinChapter9),oryoucanusethegraph’sget_operations()methodtolistalltheoperations:

foropintf.get_default_graph().get_operations():

print(op.name)

Ifyouaretheauthoroftheoriginalmodel,youcouldmakethingseasierforpeoplewhowillreuseyourmodelbygivingoperationsveryclearnamesanddocumentingthem.Anotherapproachistocreateacollectioncontainingalltheimportantoperationsthatpeoplewillwanttogetahandleon:

foropin(X,y,accuracy,training_op):

tf.add_to_collection("my_important_ops",op)

Thiswaypeoplewhoreuseyourmodelwillbeabletosimplywrite:

X,y,accuracy,training_op=tf.get_collection("my_important_ops")

Youcanthenrestorethemodel’sstateusingtheSaverandcontinuetrainingusingyourowndata:

withtf.Session()assess:

saver.restore(sess,"./my_model_final.ckpt")

[...]#trainthemodelonyourowndata

Alternatively,ifyouhaveaccesstothePythoncodethatbuilttheoriginalgraph,youcanuseitinsteadofimport_meta_graph().

Ingeneral,youwillwanttoreuseonlypartoftheoriginalmodel,typicallythelowerlayers.Ifyouuseimport_meta_graph()torestorethegraph,itwillloadtheentireoriginalgraph,butnothingprevents

Page 359: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

youfromjustignoringthelayersyoudonotcareabout.Forexample,asshowninFigure11-4,youcouldbuildnewlayers(e.g.,onehiddenlayerandoneoutputlayer)ontopofapretrainedlayer(e.g.,pretrainedhiddenlayer3).Youwouldalsoneedtocomputethelossforthisnewoutput,andcreateanoptimizertominimizethatloss.

Ifyouhaveaccesstothepretrainedgraph’sPythoncode,youcanjustreusethepartsyouneedandchopouttherest.However,inthiscaseyouneedaSavertorestorethepretrainedmodel(specifyingwhichvariablesyouwanttorestore;otherwise,TensorFlowwillcomplainthatthegraphsdonotmatch),andanotherSavertosavethenewmodel.Forexample,thefollowingcoderestoresonlyhiddenlayers1,2,and3:

[...]#buildthenewmodelwiththesamehiddenlayers1-3asbefore

reuse_vars=tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES,

scope="hidden[123]")#regularexpression

reuse_vars_dict=dict([(var.op.name,var)forvarinreuse_vars])

restore_saver=tf.train.Saver(reuse_vars_dict)#torestorelayers1-3

init=tf.global_variables_initializer()#toinitallvariables,oldandnew

saver=tf.train.Saver()#tosavethenewmodel

withtf.Session()assess:

init.run()

restore_saver.restore(sess,"./my_model_final.ckpt")

[...]#trainthemodel

save_path=saver.save(sess,"./my_new_model_final.ckpt")

Firstwebuildthenewmodel,makingsuretocopytheoriginalmodel’shiddenlayers1to3.Thenwegetthelistofallvariablesinhiddenlayers1to3,usingtheregularexpression"hidden[123]".Next,wecreateadictionarythatmapsthenameofeachvariableintheoriginalmodeltoitsnameinthenewmodel(generallyyouwanttokeeptheexactsamenames).ThenwecreateaSaverthatwillrestoreonlythesevariables.Wealsocreateanoperationtoinitializeallthevariables(oldandnew)andasecondSavertosavetheentirenewmodel,notjustlayers1to3.Wethenstartasessionandinitializeallvariablesinthemodel,thenrestorethevariablevaluesfromtheoriginalmodel’slayers1to3.Finally,wetrainthemodelonthenewtaskandsaveit.

TIPThemoresimilarthetasksare,themorelayersyouwanttoreuse(startingwiththelowerlayers).Forverysimilartasks,youcantrykeepingallthehiddenlayersandjustreplacetheoutputlayer.

Page 360: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ReusingModelsfromOtherFrameworksIfthemodelwastrainedusinganotherframework,youwillneedtoloadthemodelparametersmanually(e.g.,usingTheanocodeifitwastrainedwithTheano),thenassignthemtotheappropriatevariables.Thiscanbequitetedious.Forexample,thefollowingcodeshowshowyouwouldcopytheweightandbiasesfromthefirsthiddenlayerofamodeltrainedusinganotherframework:

original_w=[...]#Loadtheweightsfromtheotherframework

original_b=[...]#Loadthebiasesfromtheotherframework

X=tf.placeholder(tf.float32,shape=(None,n_inputs),name="X")

hidden1=tf.layers.dense(X,n_hidden1,activation=tf.nn.relu,name="hidden1")

[...]#Buildtherestofthemodel

#Getahandleontheassignmentnodesforthehidden1variables

graph=tf.get_default_graph()

assign_kernel=graph.get_operation_by_name("hidden1/kernel/Assign")

assign_bias=graph.get_operation_by_name("hidden1/bias/Assign")

init_kernel=assign_kernel.inputs[1]

init_bias=assign_bias.inputs[1]

init=tf.global_variables_initializer()

withtf.Session()assess:

sess.run(init,feed_dict={init_kernel:original_w,init_bias:original_b})

#[...]Trainthemodelonyournewtask

Inthisimplementation,wefirstloadthepretrainedmodelusingtheotherframework(notshownhere),andweextractfromitthemodelparameterswewanttoreuse.Next,webuildourTensorFlowmodelasusual.Thencomesthetrickypart:everyTensorFlowvariablehasanassociatedassignmentoperationthatisusedtoinitializeit.Westartbygettingahandleontheseassignmentoperations(theyhavethesamenameasthevariable,plus"/Assign").Wealsogetahandleoneachassignmentoperation’ssecondinput:inthecaseofanassignmentoperation,thesecondinputcorrespondstothevaluethatwillbeassignedtothevariable,sointhiscaseitisthevariable’sinitializationvalue.Oncewestartthesession,weruntheusualinitializationoperation,butthistimewefeeditthevalueswewantforthevariableswewanttoreuse.Alternatively,wecouldhavecreatednewassignmentoperationsandplaceholders,andusedthemtosetthevaluesofthevariablesafterinitialization.Butwhycreatenewnodesinthegraphwheneverythingweneedisalreadythere?

Page 361: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

FreezingtheLowerLayersItislikelythatthelowerlayersofthefirstDNNhavelearnedtodetectlow-levelfeaturesinpicturesthatwillbeusefulacrossbothimageclassificationtasks,soyoucanjustreusetheselayersastheyare.Itisgenerallyagoodideato“freeze”theirweightswhentrainingthenewDNN:ifthelower-layerweightsarefixed,thenthehigher-layerweightswillbeeasiertotrain(becausetheywon’thavetolearnamovingtarget).Tofreezethelowerlayersduringtraining,onesolutionistogivetheoptimizerthelistofvariablestotrain,excludingthevariablesfromthelowerlayers:

train_vars=tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,

scope="hidden[34]|outputs")

training_op=optimizer.minimize(loss,var_list=train_vars)

Thefirstlinegetsthelistofalltrainablevariablesinhiddenlayers3and4andintheoutputlayer.Thisleavesoutthevariablesinthehiddenlayers1and2.Nextweprovidethisrestrictedlistoftrainablevariablestotheoptimizer’sminimize()function.Ta-da!Layers1and2arenowfrozen:theywillnotbudgeduringtraining(theseareoftencalledfrozenlayers).

Anotheroptionistoaddastop_gradient()layerinthegraph.Anylayerbelowitwillbefrozen:

withtf.name_scope("dnn"):

hidden1=tf.layers.dense(X,n_hidden1,activation=tf.nn.relu,

name="hidden1")#reusedfrozen

hidden2=tf.layers.dense(hidden1,n_hidden2,activation=tf.nn.relu,

name="hidden2")#reusedfrozen

hidden2_stop=tf.stop_gradient(hidden2)

hidden3=tf.layers.dense(hidden2_stop,n_hidden3,activation=tf.nn.relu,

name="hidden3")#reused,notfrozen

hidden4=tf.layers.dense(hidden3,n_hidden4,activation=tf.nn.relu,

name="hidden4")#new!

logits=tf.layers.dense(hidden4,n_outputs,name="outputs")#new!

Page 362: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

CachingtheFrozenLayersSincethefrozenlayerswon’tchange,itispossibletocachetheoutputofthetopmostfrozenlayerforeachtraininginstance.Sincetraininggoesthroughthewholedatasetmanytimes,thiswillgiveyouahugespeedboostasyouwillonlyneedtogothroughthefrozenlayersoncepertraininginstance(insteadofonceperepoch).Forexample,youcouldfirstrunthewholetrainingsetthroughthelowerlayers(assumingyouhaveenoughRAM),thenduringtraining,insteadofbuildingbatchesoftraininginstances,youwouldbuildbatchesofoutputsfromhiddenlayer2andfeedthemtothetrainingoperation:

importnumpyasnp

n_batches=mnist.train.num_examples//batch_size

withtf.Session()assess:

init.run()

restore_saver.restore(sess,"./my_model_final.ckpt")

h2_cache=sess.run(hidden2,feed_dict={X:mnist.train.images})

forepochinrange(n_epochs):

shuffled_idx=np.random.permutation(mnist.train.num_examples)

hidden2_batches=np.array_split(h2_cache[shuffled_idx],n_batches)

y_batches=np.array_split(mnist.train.labels[shuffled_idx],n_batches)

forhidden2_batch,y_batchinzip(hidden2_batches,y_batches):

sess.run(training_op,feed_dict={hidden2:hidden2_batch,y:y_batch})

save_path=saver.save(sess,"./my_new_model_final.ckpt")

Thelastlineofthetraininglooprunsthetrainingoperationdefinedearlier(whichdoesnottouchlayers1and2),andfeedsitabatchofoutputsfromthesecondhiddenlayer(aswellasthetargetsforthatbatch).SincewegiveTensorFlowtheoutputofhiddenlayer2,itdoesnottrytoevaluateit(oranynodeitdependson).

Page 363: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Tweaking,Dropping,orReplacingtheUpperLayersTheoutputlayeroftheoriginalmodelshouldusuallybereplacedsinceitismostlikelynotusefulatallforthenewtask,anditmaynotevenhavetherightnumberofoutputsforthenewtask.

Similarly,theupperhiddenlayersoftheoriginalmodelarelesslikelytobeasusefulasthelowerlayers,sincethehigh-levelfeaturesthataremostusefulforthenewtaskmaydiffersignificantlyfromtheonesthatweremostusefulfortheoriginaltask.Youwanttofindtherightnumberoflayerstoreuse.

Tryfreezingallthecopiedlayersfirst,thentrainyourmodelandseehowitperforms.Thentryunfreezingoneortwoofthetophiddenlayerstoletbackpropagationtweakthemandseeifperformanceimproves.Themoretrainingdatayouhave,themorelayersyoucanunfreeze.

Ifyoustillcannotgetgoodperformance,andyouhavelittletrainingdata,trydroppingthetophiddenlayer(s)andfreezeallremaininghiddenlayersagain.Youcaniterateuntilyoufindtherightnumberoflayerstoreuse.Ifyouhaveplentyoftrainingdata,youmaytryreplacingthetophiddenlayersinsteadofdroppingthem,andevenaddmorehiddenlayers.

Page 364: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ModelZoosWherecanyoufindaneuralnetworktrainedforatasksimilartotheoneyouwanttotackle?Thefirstplacetolookisobviouslyinyourowncatalogofmodels.Thisisonegoodreasontosaveallyourmodelsandorganizethemsoyoucanretrievethemlatereasily.Anotheroptionistosearchinamodelzoo.ManypeopletrainMachineLearningmodelsforvarioustasksandkindlyreleasetheirpretrainedmodelstothepublic.

TensorFlowhasitsownmodelzooavailableathttps://github.com/tensorflow/models.Inparticular,itcontainsmostofthestate-of-the-artimageclassificationnetssuchasVGG,Inception,andResNet(seeChapter13,andcheckoutthemodels/slimdirectory),includingthecode,thepretrainedmodels,andtoolstodownloadpopularimagedatasets.

AnotherpopularmodelzooisCaffe’sModelZoo.Italsocontainsmanycomputervisionmodels(e.g.,LeNet,AlexNet,ZFNet,GoogLeNet,VGGNet,inception)trainedonvariousdatasets(e.g.,ImageNet,PlacesDatabase,CIFAR10,etc.).SaumitroDasguptawroteaconverter,whichisavailableathttps://github.com/ethereon/caffe-tensorflow.

Page 365: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

UnsupervisedPretrainingSupposeyouwanttotackleacomplextaskforwhichyoudon’thavemuchlabeledtrainingdata,butunfortunatelyyoucannotfindamodeltrainedonasimilartask.Don’tloseallhope!First,youshouldofcoursetrytogathermorelabeledtrainingdata,butifthisistoohardortooexpensive,youmaystillbeabletoperformunsupervisedpretraining(seeFigure11-5).Thatis,ifyouhaveplentyofunlabeledtrainingdata,youcantrytotrainthelayersonebyone,startingwiththelowestlayerandthengoingup,usinganunsupervisedfeaturedetectoralgorithmsuchasRestrictedBoltzmannMachines(RBMs;seeAppendixE)orautoencoders(seeChapter15).Eachlayeristrainedontheoutputofthepreviouslytrainedlayers(alllayersexcepttheonebeingtrainedarefrozen).Oncealllayershavebeentrainedthisway,youcanfine-tunethenetworkusingsupervisedlearning(i.e.,withbackpropagation).

Thisisaratherlongandtediousprocess,butitoftenworkswell;infact,itisthistechniquethatGeoffreyHintonandhisteamusedin2006andwhichledtotherevivalofneuralnetworksandthesuccessofDeepLearning.Until2010,unsupervisedpretraining(typicallyusingRBMs)wasthenormfordeepnets,anditwasonlyafterthevanishinggradientsproblemwasalleviatedthatitbecamemuchmorecommontotrainDNNspurelyusingbackpropagation.However,unsupervisedpretraining(todaytypicallyusingautoencodersratherthanRBMs)isstillagoodoptionwhenyouhaveacomplextasktosolve,nosimilarmodelyoucanreuse,andlittlelabeledtrainingdatabutplentyofunlabeledtrainingdata.10

Figure11-5.Unsupervisedpretraining

Page 366: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

PretrainingonanAuxiliaryTaskOnelastoptionistotrainafirstneuralnetworkonanauxiliarytaskforwhichyoucaneasilyobtainorgeneratelabeledtrainingdata,thenreusethelowerlayersofthatnetworkforyouractualtask.Thefirstneuralnetwork’slowerlayerswilllearnfeaturedetectorsthatwilllikelybereusablebythesecondneuralnetwork.

Forexample,ifyouwanttobuildasystemtorecognizefaces,youmayonlyhaveafewpicturesofeachindividual—clearlynotenoughtotrainagoodclassifier.Gatheringhundredsofpicturesofeachpersonwouldnotbepractical.However,youcouldgatheralotofpicturesofrandompeopleontheinternetandtrainafirstneuralnetworktodetectwhetherornottwodifferentpicturesfeaturethesameperson.Suchanetworkwouldlearngoodfeaturedetectorsforfaces,soreusingitslowerlayerswouldallowyoutotrainagoodfaceclassifierusinglittletrainingdata.

Itisoftenrathercheaptogatherunlabeledtrainingexamples,butquiteexpensivetolabelthem.Inthissituation,acommontechniqueistolabelallyourtrainingexamplesas“good,”thengeneratemanynewtraininginstancesbycorruptingthegoodones,andlabelthesecorruptedinstancesas“bad.”Thenyoucantrainafirstneuralnetworktoclassifyinstancesasgoodorbad.Forexample,youcoulddownloadmillionsofsentences,labelthemas“good,”thenrandomlychangeawordineachsentenceandlabeltheresultingsentencesas“bad.”Ifaneuralnetworkcantellthat“Thedogsleeps”isagoodsentencebut“Thedogthey”isbad,itprobablyknowsquitealotaboutlanguage.Reusingitslowerlayerswilllikelyhelpinmanylanguageprocessingtasks.

Anotherapproachistotrainafirstnetworktooutputascoreforeachtraininginstance,anduseacostfunctionthatensuresthatagoodinstance’sscoreisgreaterthanabadinstance’sscorebyatleastsomemargin.Thisiscalledmaxmarginlearning.

Page 367: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

FasterOptimizersTrainingaverylargedeepneuralnetworkcanbepainfullyslow.Sofarwehaveseenfourwaystospeeduptraining(andreachabettersolution):applyingagoodinitializationstrategyfortheconnectionweights,usingagoodactivationfunction,usingBatchNormalization,andreusingpartsofapretrainednetwork.AnotherhugespeedboostcomesfromusingafasteroptimizerthantheregularGradientDescentoptimizer.Inthissectionwewillpresentthemostpopularones:Momentumoptimization,NesterovAcceleratedGradient,AdaGrad,RMSProp,andfinallyAdamoptimization.

Page 368: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

MomentumOptimizationImagineabowlingballrollingdownagentleslopeonasmoothsurface:itwillstartoutslowly,butitwillquicklypickupmomentumuntiliteventuallyreachesterminalvelocity(ifthereissomefrictionorairresistance).ThisistheverysimpleideabehindMomentumoptimization,proposedbyBorisPolyakin1964.11Incontrast,regularGradientDescentwillsimplytakesmallregularstepsdowntheslope,soitwilltakemuchmoretimetoreachthebottom.

RecallthatGradientDescentsimplyupdatestheweightsθbydirectlysubtractingthegradientofthecostfunctionJ(θ)withregardstotheweights( θJ(θ))multipliedbythelearningrateη.Theequationis:θ←θ–η θJ(θ).Itdoesnotcareaboutwhattheearliergradientswere.Ifthelocalgradientistiny,itgoesveryslowly.

Momentumoptimizationcaresagreatdealaboutwhatpreviousgradientswere:ateachiteration,itsubtractsthelocalgradientfromthemomentumvectorm(multipliedbythelearningrateη),anditupdatestheweightsbysimplyaddingthismomentumvector(seeEquation11-4).Inotherwords,thegradientisusedasanacceleration,notasaspeed.Tosimulatesomesortoffrictionmechanismandpreventthemomentumfromgrowingtoolarge,thealgorithmintroducesanewhyperparameterβ,simplycalledthemomentum,whichmustbesetbetween0(highfriction)and1(nofriction).Atypicalmomentumvalueis0.9.

Equation11-4.Momentumalgorithm

Youcaneasilyverifythatifthegradientremainsconstant,theterminalvelocity(i.e.,themaximumsizeof

theweightupdates)isequaltothatgradientmultipliedbythelearningrateηmultipliedby (ignoringthesign).Forexample,ifβ=0.9,thentheterminalvelocityisequalto10timesthegradienttimesthelearningrate,soMomentumoptimizationendsupgoing10timesfasterthanGradientDescent!ThisallowsMomentumoptimizationtoescapefromplateausmuchfasterthanGradientDescent.Inparticular,wesawinChapter4thatwhentheinputshaveverydifferentscalesthecostfunctionwilllooklikeanelongatedbowl(seeFigure4-7).GradientDescentgoesdownthesteepslopequitefast,butthenittakesaverylongtimetogodownthevalley.Incontrast,Momentumoptimizationwillrolldownthebottomofthevalleyfasterandfasteruntilitreachesthebottom(theoptimum).Indeepneuralnetworksthatdon’tuseBatchNormalization,theupperlayerswilloftenenduphavinginputswithverydifferentscales,sousingMomentumoptimizationhelpsalot.Itcanalsohelprollpastlocaloptima.

Page 369: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

NOTEDuetothemomentum,theoptimizermayovershootabit,thencomeback,overshootagain,andoscillatelikethismanytimesbeforestabilizingattheminimum.Thisisoneofthereasonswhyitisgoodtohaveabitoffrictioninthesystem:itgetsridoftheseoscillationsandthusspeedsupconvergence.

ImplementingMomentumoptimizationinTensorFlowisano-brainer:justreplacetheGradientDescentOptimizerwiththeMomentumOptimizer,thenliebackandprofit!

optimizer=tf.train.MomentumOptimizer(learning_rate=learning_rate,

momentum=0.9)

TheonedrawbackofMomentumoptimizationisthatitaddsyetanotherhyperparametertotune.However,themomentumvalueof0.9usuallyworkswellinpracticeandalmostalwaysgoesfasterthanGradientDescent.

Page 370: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

NesterovAcceleratedGradientOnesmallvarianttoMomentumoptimization,proposedbyYuriiNesterovin1983,12isalmostalwaysfasterthanvanillaMomentumoptimization.TheideaofNesterovMomentumoptimization,orNesterovAcceleratedGradient(NAG),istomeasurethegradientofthecostfunctionnotatthelocalpositionbutslightlyaheadinthedirectionofthemomentum(seeEquation11-5).TheonlydifferencefromvanillaMomentumoptimizationisthatthegradientismeasuredatθ+βmratherthanatθ.

Equation11-5.NesterovAcceleratedGradientalgorithm

Thissmalltweakworksbecauseingeneralthemomentumvectorwillbepointingintherightdirection(i.e.,towardtheoptimum),soitwillbeslightlymoreaccuratetousethegradientmeasuredabitfartherinthatdirectionratherthanusingthegradientattheoriginalposition,asyoucanseeinFigure11-6(where1representsthegradientofthecostfunctionmeasuredatthestartingpointθ,and 2representsthe

gradientatthepointlocatedatθ+βm).Asyoucansee,theNesterovupdateendsupslightlyclosertotheoptimum.Afterawhile,thesesmallimprovementsaddupandNAGendsupbeingsignificantlyfasterthanregularMomentumoptimization.Moreover,notethatwhenthemomentumpushestheweightsacrossavalley, 1continuestopushfurtheracrossthevalley,while 2pushesbacktowardthebottomofthevalley.Thishelpsreduceoscillationsandthusconvergesfaster.

NAGwillalmostalwaysspeeduptrainingcomparedtoregularMomentumoptimization.Touseit,simplysetuse_nesterov=TruewhencreatingtheMomentumOptimizer:

optimizer=tf.train.MomentumOptimizer(learning_rate=learning_rate,

momentum=0.9,use_nesterov=True)

Page 371: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure11-6.RegularversusNesterovMomentumoptimization

Page 372: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

AdaGradConsidertheelongatedbowlproblemagain:GradientDescentstartsbyquicklygoingdownthesteepestslope,thenslowlygoesdownthebottomofthevalley.Itwouldbeniceifthealgorithmcoulddetectthisearlyonandcorrectitsdirectiontopointabitmoretowardtheglobaloptimum.

TheAdaGradalgorithm13achievesthisbyscalingdownthegradientvectoralongthesteepestdimensions(seeEquation11-6):

Equation11-6.AdaGradalgorithm

Thefirststepaccumulatesthesquareofthegradientsintothevectors(the symbolrepresentstheelement-wisemultiplication).Thisvectorizedformisequivalenttocomputingsi←si+(∂J(θ)/∂θi)2

foreachelementsiofthevectors;inotherwords,eachsiaccumulatesthesquaresofthepartialderivativeofthecostfunctionwithregardstoparameterθi.Ifthecostfunctionissteepalongtheith

dimension,thensiwillgetlargerandlargerateachiteration.

ThesecondstepisalmostidenticaltoGradientDescent,butwithonebigdifference:thegradientvectorisscaleddownbyafactorof (thesymbolrepresentstheelement-wisedivision,andϵisasmoothingtermtoavoiddivisionbyzero,typicallysetto10–10).Thisvectorizedformisequivalenttocomputing forallparametersθi(simultaneously).

Inshort,thisalgorithmdecaysthelearningrate,butitdoessofasterforsteepdimensionsthanfordimensionswithgentlerslopes.Thisiscalledanadaptivelearningrate.Ithelpspointtheresultingupdatesmoredirectlytowardtheglobaloptimum(seeFigure11-7).Oneadditionalbenefitisthatitrequiresmuchlesstuningofthelearningratehyperparameterη.

Page 373: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure11-7.AdaGradversusGradientDescent

AdaGradoftenperformswellforsimplequadraticproblems,butunfortunatelyitoftenstopstooearlywhentrainingneuralnetworks.Thelearningrategetsscaleddownsomuchthatthealgorithmendsupstoppingentirelybeforereachingtheglobaloptimum.SoeventhoughTensorFlowhasanAdagradOptimizer,youshouldnotuseittotraindeepneuralnetworks(itmaybeefficientforsimplertaskssuchasLinearRegression,though).

Page 374: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

RMSPropAlthoughAdaGradslowsdownabittoofastandendsupneverconvergingtotheglobaloptimum,theRMSPropalgorithm14fixesthisbyaccumulatingonlythegradientsfromthemostrecentiterations(asopposedtoallthegradientssincethebeginningoftraining).Itdoessobyusingexponentialdecayinthefirststep(seeEquation11-7).

Equation11-7.RMSPropalgorithm

Thedecayrateβistypicallysetto0.9.Yes,itisonceagainanewhyperparameter,butthisdefaultvalueoftenworkswell,soyoumaynotneedtotuneitatall.

Asyoumightexpect,TensorFlowhasanRMSPropOptimizerclass:

optimizer=tf.train.RMSPropOptimizer(learning_rate=learning_rate,

momentum=0.9,decay=0.9,epsilon=1e-10)

Exceptonverysimpleproblems,thisoptimizeralmostalwaysperformsmuchbetterthanAdaGrad.ItalsogenerallyconvergesfasterthanMomentumoptimizationandNesterovAcceleratedGradients.Infact,itwasthepreferredoptimizationalgorithmofmanyresearchersuntilAdamoptimizationcamearound.

Page 375: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

AdamOptimizationAdam,15whichstandsforadaptivemomentestimation,combinestheideasofMomentumoptimizationandRMSProp:justlikeMomentumoptimizationitkeepstrackofanexponentiallydecayingaverageofpastgradients,andjustlikeRMSPropitkeepstrackofanexponentiallydecayingaverageofpastsquaredgradients(seeEquation11-8).16

Equation11-8.Adamalgorithm

Trepresentstheiterationnumber(startingat1).

Ifyoujustlookatsteps1,2,and5,youwillnoticeAdam’sclosesimilaritytobothMomentumoptimizationandRMSProp.Theonlydifferenceisthatstep1computesanexponentiallydecayingaverageratherthananexponentiallydecayingsum,buttheseareactuallyequivalentexceptforaconstantfactor(thedecayingaverageisjust1–β1timesthedecayingsum).Steps3and4aresomewhatofatechnicaldetail:sincemandsareinitializedat0,theywillbebiasedtoward0atthebeginningoftraining,sothesetwostepswillhelpboostmandsatthebeginningoftraining.

Themomentumdecayhyperparameterβ1istypicallyinitializedto0.9,whilethescalingdecayhyperparameterβ2isofteninitializedto0.999.Asearlier,thesmoothingtermϵisusuallyinitializedtoatinynumbersuchas10–8.ThesearethedefaultvaluesforTensorFlow’sAdamOptimizerclass,soyoucansimplyuse:

optimizer=tf.train.AdamOptimizer(learning_rate=learning_rate)

Infact,sinceAdamisanadaptivelearningratealgorithm(likeAdaGradandRMSProp),itrequireslesstuningofthelearningratehyperparameterη.Youcanoftenusethedefaultvalueη=0.001,makingAdameveneasiertousethanGradientDescent.

Page 376: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

WARNINGThisbookinitiallyrecommendedusingAdamoptimization,becauseitwasgenerallyconsideredfasterandbetterthanothermethods.However,a2017paper17byAshiaC.Wilsonetal.showedthatadaptiveoptimizationmethods(i.e.,AdaGrad,RMSPropandAdamoptimization)canleadtosolutionsthatgeneralizepoorlyonsomedatasets.SoyoumaywanttosticktoMomentumoptimizationorNesterovAcceleratedGradientfornow,untilresearchershaveabetterunderstandingofthisissue.

Alltheoptimizationtechniquesdiscussedsofaronlyrelyonthefirst-orderpartialderivatives(Jacobians).Theoptimizationliteraturecontainsamazingalgorithmsbasedonthesecond-orderpartialderivatives(theHessians).Unfortunately,thesealgorithmsareveryhardtoapplytodeepneuralnetworksbecausetherearen2Hessiansperoutput(wherenisthenumberofparameters),asopposedtojustnJacobiansperoutput.SinceDNNstypicallyhavetensofthousandsofparameters,thesecond-orderoptimizationalgorithmsoftendon’tevenfitinmemory,andevenwhentheydo,computingtheHessiansisjusttooslow.

TRAININGSPARSEMODELS

Alltheoptimizationalgorithmsjustpresentedproducedensemodels,meaningthatmostparameterswillbenonzero.Ifyouneedablazinglyfastmodelatruntime,orifyouneedittotakeuplessmemory,youmayprefertoendupwithasparsemodelinstead.

Onetrivialwaytoachievethisistotrainthemodelasusual,thengetridofthetinyweights(setthemto0).

Anotheroptionistoapplystrongℓ1regularizationduringtraining,asitpushestheoptimizertozerooutasmanyweightsasitcan(asdiscussedinChapter4aboutLassoRegression).

However,insomecasesthesetechniquesmayremaininsufficient.OnelastoptionistoapplyDualAveraging,oftencalledFollowTheRegularizedLeader(FTRL),atechniqueproposedbyYuriiNesterov.18Whenusedwithℓ1regularization,thistechniqueoftenleadsto

verysparsemodels.TensorFlowimplementsavariantofFTRLcalledFTRL-Proximal19intheFTRLOptimizerclass.

Page 377: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

LearningRateSchedulingFindingagoodlearningratecanbetricky.Ifyousetitwaytoohigh,trainingmayactuallydiverge(aswediscussedinChapter4).Ifyousetittoolow,trainingwilleventuallyconvergetotheoptimum,butitwilltakeaverylongtime.Ifyousetitslightlytoohigh,itwillmakeprogressveryquicklyatfirst,butitwillendupdancingaroundtheoptimum,neversettlingdown(unlessyouuseanadaptivelearningrateoptimizationalgorithmsuchasAdaGrad,RMSProp,orAdam,buteventhenitmaytaketimetosettle).Ifyouhavealimitedcomputingbudget,youmayhavetointerrupttrainingbeforeithasconvergedproperly,yieldingasuboptimalsolution(seeFigure11-8).

Figure11-8.Learningcurvesforvariouslearningratesη

Youmaybeabletofindafairlygoodlearningratebytrainingyournetworkseveraltimesduringjustafewepochsusingvariouslearningratesandcomparingthelearningcurves.Theideallearningratewilllearnquicklyandconvergetogoodsolution.

However,youcandobetterthanaconstantlearningrate:ifyoustartwithahighlearningrateandthenreduceitonceitstopsmakingfastprogress,youcanreachagoodsolutionfasterthanwiththeoptimalconstantlearningrate.Therearemanydifferentstrategiestoreducethelearningrateduringtraining.Thesestrategiesarecalledlearningschedules(webrieflyintroducedthisconceptinChapter4),themostcommonofwhichare:

PredeterminedpiecewiseconstantlearningrateForexample,setthelearningratetoη0=0.1atfirst,thentoη1=0.001after50epochs.Althoughthissolutioncanworkverywell,itoftenrequiresfiddlingaroundtofigureouttherightlearningratesandwhentousethem.

PerformanceschedulingMeasurethevalidationerroreveryNsteps(justlikeforearlystopping)andreducethelearningratebyafactorofλwhentheerrorstopsdropping.

Exponentialscheduling

Page 378: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Setthelearningratetoafunctionoftheiterationnumbert:η(t)=η010–t/r.Thisworksgreat,butitrequirestuningη0andr.Thelearningratewilldropbyafactorof10everyrsteps.

PowerschedulingSetthelearningratetoη(t)=η0(1+t/r)–c.Thehyperparametercistypicallysetto1.Thisissimilartoexponentialscheduling,butthelearningratedropsmuchmoreslowly.

A2013paper20byAndrewSenioretal.comparedtheperformanceofsomeofthemostpopularlearningscheduleswhentrainingdeepneuralnetworksforspeechrecognitionusingMomentumoptimization.Theauthorsconcludedthat,inthissetting,bothperformanceschedulingandexponentialschedulingperformedwell,buttheyfavoredexponentialschedulingbecauseitissimplertoimplement,iseasytotune,andconvergedslightlyfastertotheoptimalsolution.

ImplementingalearningschedulewithTensorFlowisfairlystraightforward:

initial_learning_rate=0.1

decay_steps=10000

decay_rate=1/10

global_step=tf.Variable(0,trainable=False,name="global_step")

learning_rate=tf.train.exponential_decay(initial_learning_rate,global_step,

decay_steps,decay_rate)

optimizer=tf.train.MomentumOptimizer(learning_rate,momentum=0.9)

training_op=optimizer.minimize(loss,global_step=global_step)

Aftersettingthehyperparametervalues,wecreateanontrainablevariableglobal_step(initializedto0)tokeeptrackofthecurrenttrainingiterationnumber.Thenwedefineanexponentiallydecayinglearningrate(withη0=0.1andr=10,000)usingTensorFlow’sexponential_decay()function.Next,wecreateanoptimizer(inthisexample,aMomentumOptimizer)usingthisdecayinglearningrate.Finally,wecreatethetrainingoperationbycallingtheoptimizer’sminimize()method;sincewepassittheglobal_stepvariable,itwillkindlytakecareofincrementingit.That’sit!

SinceAdaGrad,RMSProp,andAdamoptimizationautomaticallyreducethelearningrateduringtraining,itisnotnecessarytoaddanextralearningschedule.Forotheroptimizationalgorithms,usingexponentialdecayorperformanceschedulingcanconsiderablyspeedupconvergence.

Page 379: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

AvoidingOverfittingThroughRegularizationWithfourparametersIcanfitanelephantandwithfiveIcanmakehimwigglehistrunk.JohnvonNeumann,citedbyEnricoFermiinNature427

Deepneuralnetworkstypicallyhavetensofthousandsofparameters,sometimesevenmillions.Withsomanyparameters,thenetworkhasanincredibleamountoffreedomandcanfitahugevarietyofcomplexdatasets.Butthisgreatflexibilityalsomeansthatitispronetooverfittingthetrainingset.

Withmillionsofparametersyoucanfitthewholezoo.Inthissectionwewillpresentsomeofthemostpopularregularizationtechniquesforneuralnetworks,andhowtoimplementthemwithTensorFlow:earlystopping,ℓ1andℓ2regularization,dropout,max-normregularization,anddataaugmentation.

Page 380: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

EarlyStoppingToavoidoverfittingthetrainingset,agreatsolutionisearlystopping(introducedinChapter4):justinterrupttrainingwhenitsperformanceonthevalidationsetstartsdropping.

OnewaytoimplementthiswithTensorFlowistoevaluatethemodelonavalidationsetatregularintervals(e.g.,every50steps),andsavea“winner”snapshotifitoutperformsprevious“winner”snapshots.Countthenumberofstepssincethelast“winner”snapshotwassaved,andinterrupttrainingwhenthisnumberreachessomelimit(e.g.,2,000steps).Thenrestorethelast“winner”snapshot.

Althoughearlystoppingworksverywellinpractice,youcanusuallygetmuchhigherperformanceoutofyournetworkbycombiningitwithotherregularizationtechniques.

Page 381: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ℓ1andℓ2RegularizationJustlikeyoudidinChapter4forsimplelinearmodels,youcanuseℓ1andℓ2regularizationtoconstrainaneuralnetwork’sconnectionweights(buttypicallynotitsbiases).

OnewaytodothisusingTensorFlowistosimplyaddtheappropriateregularizationtermstoyourcostfunction.Forexample,assumingyouhavejustonehiddenlayerwithweightsW1andoneoutputlayerwithweightsW2,thenyoucanapplyℓ1regularizationlikethis:

[...]#constructtheneuralnetwork

W1=tf.get_default_graph().get_tensor_by_name("hidden1/kernel:0")

W2=tf.get_default_graph().get_tensor_by_name("outputs/kernel:0")

scale=0.001#l1regularizationhyperparameter

withtf.name_scope("loss"):

xentropy=tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,

logits=logits)

base_loss=tf.reduce_mean(xentropy,name="avg_xentropy")

reg_losses=tf.reduce_sum(tf.abs(W1))+tf.reduce_sum(tf.abs(W2))

loss=tf.add(base_loss,scale*reg_losses,name="loss")

However,iftherearemanylayers,thisapproachisnotveryconvenient.Fortunately,TensorFlowprovidesabetteroption.Manyfunctionsthatcreatevariables(suchasget_variable()ortf.layers.dense())accepta*_regularizerargumentforeachcreatedvariable(e.g.,kernel_regularizer).Youcanpassanyfunctionthattakesweightsasanargumentandreturnsthecorrespondingregularizationloss.Thel1_regularizer(),l2_regularizer(),andl1_l2_regularizer()functionsreturnsuchfunctions.Thefollowingcodeputsallthistogether:

my_dense_layer=partial(

tf.layers.dense,activation=tf.nn.relu,

kernel_regularizer=tf.contrib.layers.l1_regularizer(scale))

withtf.name_scope("dnn"):

hidden1=my_dense_layer(X,n_hidden1,name="hidden1")

hidden2=my_dense_layer(hidden1,n_hidden2,name="hidden2")

logits=my_dense_layer(hidden2,n_outputs,activation=None,

name="outputs")

Thiscodecreatesaneuralnetworkwithtwohiddenlayersandoneoutputlayer,anditalsocreatesnodesinthegraphtocomputetheℓ1regularizationlosscorrespondingtoeachlayer’sweights.TensorFlowautomaticallyaddsthesenodestoaspecialcollectioncontainingalltheregularizationlosses.Youjustneedtoaddtheseregularizationlossestoyouroverallloss,likethis:

reg_losses=tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)

loss=tf.add_n([base_loss]+reg_losses,name="loss")

WARNINGDon’tforgettoaddtheregularizationlossestoyouroverallloss,orelsetheywillsimplybeignored.

Page 382: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

DropoutThemostpopularregularizationtechniquefordeepneuralnetworksisarguablydropout.Itwasproposed21byG.E.Hintonin2012andfurtherdetailedinapaper22byNitishSrivastavaetal.,andithasproventobehighlysuccessful:eventhestate-of-the-artneuralnetworksgota1–2%accuracyboostsimplybyaddingdropout.Thismaynotsoundlikealot,butwhenamodelalreadyhas95%accuracy,gettinga2%accuracyboostmeansdroppingtheerrorratebyalmost40%(goingfrom5%errortoroughly3%).

Itisafairlysimplealgorithm:ateverytrainingstep,everyneuron(includingtheinputneuronsbutexcludingtheoutputneurons)hasaprobabilitypofbeingtemporarily“droppedout,”meaningitwillbeentirelyignoredduringthistrainingstep,butitmaybeactiveduringthenextstep(seeFigure11-9).Thehyperparameterpiscalledthedropoutrate,anditistypicallysetto50%.Aftertraining,neuronsdon’tgetdroppedanymore.Andthat’sall(exceptforatechnicaldetailwewilldiscussmomentarily).

Figure11-9.Dropoutregularization

Itisquitesurprisingatfirstthatthisratherbrutaltechniqueworksatall.Wouldacompanyperformbetterifitsemployeesweretoldtotossacoineverymorningtodecidewhetherornottogotowork?Well,whoknows;perhapsitwould!Thecompanywouldobviouslybeforcedtoadaptitsorganization;itcouldnotrelyonanysinglepersontofillinthecoffeemachineorperformanyothercriticaltasks,sothisexpertisewouldhavetobespreadacrossseveralpeople.Employeeswouldhavetolearntocooperatewithmanyoftheircoworkers,notjustahandfulofthem.Thecompanywouldbecomemuchmoreresilient.Ifonepersonquit,itwouldn’tmakemuchofadifference.It’sunclearwhetherthisideawouldactuallyworkforcompanies,butitcertainlydoesforneuralnetworks.Neuronstrainedwithdropoutcannotco-adaptwiththeirneighboringneurons;theyhavetobeasusefulaspossibleontheirown.Theyalsocannotrely

Page 383: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

excessivelyonjustafewinputneurons;theymustpayattentiontoeachoftheirinputneurons.Theyendupbeinglesssensitivetoslightchangesintheinputs.Intheendyougetamorerobustnetworkthatgeneralizesbetter.

Anotherwaytounderstandthepowerofdropoutistorealizethatauniqueneuralnetworkisgeneratedateachtrainingstep.Sinceeachneuroncanbeeitherpresentorabsent,thereisatotalof2Npossiblenetworks(whereNisthetotalnumberofdroppableneurons).Thisissuchahugenumberthatitisvirtuallyimpossibleforthesameneuralnetworktobesampledtwice.Onceyouhaveruna10,000trainingsteps,youhaveessentiallytrained10,000differentneuralnetworks(eachwithjustonetraininginstance).Theseneuralnetworksareobviouslynotindependentsincetheysharemanyoftheirweights,buttheyareneverthelessalldifferent.Theresultingneuralnetworkcanbeseenasanaveragingensembleofallthesesmallerneuralnetworks.

Thereisonesmallbutimportanttechnicaldetail.Supposep=50%,inwhichcaseduringtestinganeuronwillbeconnectedtotwiceasmanyinputneuronsasitwas(onaverage)duringtraining.Tocompensateforthisfact,weneedtomultiplyeachneuron’sinputconnectionweightsby0.5aftertraining.Ifwedon’t,eachneuronwillgetatotalinputsignalroughlytwiceaslargeaswhatthenetworkwastrainedon,anditisunlikelytoperformwell.Moregenerally,weneedtomultiplyeachinputconnectionweightbythekeepprobability(1–p)aftertraining.Alternatively,wecandivideeachneuron’soutputbythekeepprobabilityduringtraining(thesealternativesarenotperfectlyequivalent,buttheyworkequallywell).

ToimplementdropoutusingTensorFlow,youcansimplyapplythetf.layers.dropout()functiontotheinputlayerand/ortotheoutputofanyhiddenlayeryouwant.Duringtraining,thisfunctionrandomlydropssomeitems(settingthemto0)anddividestheremainingitemsbythekeepprobability.Aftertraining,thisfunctiondoesnothingatall.Thefollowingcodeappliesdropoutregularizationtoourthree-layerneuralnetwork:

[...]

training=tf.placeholder_with_default(False,shape=(),name='training')

dropout_rate=0.5#==1-keep_prob

X_drop=tf.layers.dropout(X,dropout_rate,training=training)

withtf.name_scope("dnn"):

hidden1=tf.layers.dense(X_drop,n_hidden1,activation=tf.nn.relu,

name="hidden1")

hidden1_drop=tf.layers.dropout(hidden1,dropout_rate,training=training)

hidden2=tf.layers.dense(hidden1_drop,n_hidden2,activation=tf.nn.relu,

name="hidden2")

hidden2_drop=tf.layers.dropout(hidden2,dropout_rate,training=training)

logits=tf.layers.dense(hidden2_drop,n_outputs,name="outputs")

WARNINGYouwanttousethetf.layers.dropout()function,nottf.nn.dropout().Thefirstoneturnsoff(no-op)whennottraining,whichiswhatyouwant,whilethesecondonedoesnot.

Ofcourse,justlikeyoudidearlierforBatchNormalization,youneedtosettrainingtoTruewhentraining,andleavethedefaultFalsevaluewhentesting.

Ifyouobservethatthemodelisoverfitting,youcanincreasethedropoutrate.Conversely,youshouldtry

Page 384: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

decreasingthedropoutrateifthemodelunderfitsthetrainingset.Itcanalsohelptoincreasethedropoutrateforlargelayers,andreduceitforsmallones.

Dropoutdoestendtosignificantlyslowdownconvergence,butitusuallyresultsinamuchbettermodelwhentunedproperly.So,itisgenerallywellworththeextratimeandeffort.

NOTEDropconnectisavariantofdropoutwhereindividualconnectionsaredroppedrandomlyratherthanwholeneurons.Ingeneraldropoutperformsbetter.

Page 385: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Max-NormRegularizationAnotherregularizationtechniquethatisquitepopularforneuralnetworksiscalledmax-normregularization:foreachneuron,itconstrainstheweightswoftheincomingconnectionssuchthat w2≤r,whereristhemax-normhyperparameterand· 2istheℓ2norm.

Wetypicallyimplementthisconstraintbycomputing w2aftereachtrainingstepandclippingwif

needed( ).

Reducingrincreasestheamountofregularizationandhelpsreduceoverfitting.Max-normregularizationcanalsohelpalleviatethevanishing/explodinggradientsproblems(ifyouarenotusingBatchNormalization).

TensorFlowdoesnotprovideanoff-the-shelfmax-normregularizer,butitisnottoohardtoimplement.Thefollowingcodegetsahandleontheweightsofthefirsthiddenlayer,thenitusestheclip_by_norm()functiontocreateanoperationthatwillcliptheweightsalongthesecondaxissothateachrowvectorendsupwithamaximumnormof1.0.Thelastlinecreatesanassignmentoperationthatwillassigntheclippedweightstotheweightsvariable:

threshold=1.0

weights=tf.get_default_graph().get_tensor_by_name("hidden1/kernel:0")

clipped_weights=tf.clip_by_norm(weights,clip_norm=threshold,axes=1)

clip_weights=tf.assign(weights,clipped_weights)

Thenyoujustapplythisoperationaftereachtrainingstep,likeso:

sess.run(training_op,feed_dict={X:X_batch,y:y_batch})

clip_weights.eval()

Ingeneral,youwoulddothisforeveryhiddenlayer.Althoughthissolutionshouldworkfine,itisabitmessy.Acleanersolutionistocreateamax_norm_regularizer()functionanduseitjustliketheearlierl1_regularizer()function:

defmax_norm_regularizer(threshold,axes=1,name="max_norm",

collection="max_norm"):

defmax_norm(weights):

clipped=tf.clip_by_norm(weights,clip_norm=threshold,axes=axes)

clip_weights=tf.assign(weights,clipped,name=name)

tf.add_to_collection(collection,clip_weights)

returnNone#thereisnoregularizationlossterm

returnmax_norm

Thisfunctionreturnsaparametrizedmax_norm()functionthatyoucanuselikeanyotherregularizer:

max_norm_reg=max_norm_regularizer(threshold=1.0)

withtf.name_scope("dnn"):

hidden1=tf.layers.dense(X,n_hidden1,activation=tf.nn.relu,

kernel_regularizer=max_norm_reg,name="hidden1")

hidden2=tf.layers.dense(hidden1,n_hidden2,activation=tf.nn.relu,

kernel_regularizer=max_norm_reg,name="hidden2")

logits=tf.layers.dense(hidden2,n_outputs,name="outputs")

Page 386: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Notethatmax-normregularizationdoesnotrequireaddingaregularizationlosstermtoyouroveralllossfunction,whichiswhythemax_norm()functionreturnsNone.Butyoustillneedtobeabletoruntheclip_weightsoperationsaftereachtrainingstep,soyouneedtobeabletogetahandleonthem.Thisiswhythemax_norm()functionaddstheclip_weightsoperationtoacollectionofmax-normclippingoperations.Youneedtofetchtheseclippingoperationsandrunthemaftereachtrainingstep:

clip_all_weights=tf.get_collection("max_norm")

withtf.Session()assess:

init.run()

forepochinrange(n_epochs):

foriterationinrange(mnist.train.num_examples//batch_size):

X_batch,y_batch=mnist.train.next_batch(batch_size)

sess.run(training_op,feed_dict={X:X_batch,y:y_batch})

sess.run(clip_all_weights)

Muchcleanercode,isn’tit?

Page 387: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

DataAugmentationOnelastregularizationtechnique,dataaugmentation,consistsofgeneratingnewtraininginstancesfromexistingones,artificiallyboostingthesizeofthetrainingset.Thiswillreduceoverfitting,makingthisaregularizationtechnique.Thetrickistogeneraterealistictraininginstances;ideally,ahumanshouldnotbeabletotellwhichinstancesweregeneratedandwhichoneswerenot.Moreover,simplyaddingwhitenoisewillnothelp;themodificationsyouapplyshouldbelearnable(whitenoiseisnot).

Forexample,ifyourmodelismeanttoclassifypicturesofmushrooms,youcanslightlyshift,rotate,andresizeeverypictureinthetrainingsetbyvariousamountsandaddtheresultingpicturestothetrainingset(seeFigure11-10).Thisforcesthemodeltobemoretoleranttotheposition,orientation,andsizeofthemushroomsinthepicture.Ifyouwantthemodeltobemoretoleranttolightingconditions,youcansimilarlygeneratemanyimageswithvariouscontrasts.Assumingthemushroomsaresymmetrical,youcanalsoflipthepictureshorizontally.Bycombiningthesetransformationsyoucangreatlyincreasethesizeofyourtrainingset.

Figure11-10.Generatingnewtraininginstancesfromexistingones

Itisoftenpreferabletogeneratetraininginstancesontheflyduringtrainingratherthanwastingstoragespaceandnetworkbandwidth.TensorFlowoffersseveralimagemanipulationoperationssuchastransposing(shifting),rotating,resizing,flipping,andcropping,aswellasadjustingthebrightness,contrast,saturation,andhue(seetheAPIdocumentationformoredetails).Thismakesiteasytoimplementdataaugmentationforimagedatasets.

Page 388: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

NOTEAnotherpowerfultechniquetotrainverydeepneuralnetworksistoaddskipconnections(askipconnectioniswhenyouaddtheinputofalayertotheoutputofahigherlayer).WewillexplorethisideainChapter13whenwetalkaboutdeepresidualnetworks.

Page 389: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

PracticalGuidelinesInthischapter,wehavecoveredawiderangeoftechniquesandyoumaybewonderingwhichonesyoushoulduse.TheconfigurationinTable11-2willworkfineinmostcases.

Table11-2.DefaultDNNconfiguration

Initialization Heinitialization

Activationfunction ELU

Normalization BatchNormalization

Regularization Dropout

Optimizer NesterovAcceleratedGradient

Learningrateschedule None

Ofcourse,youshouldtrytoreusepartsofapretrainedneuralnetworkifyoucanfindonethatsolvesasimilarproblem.

Thisdefaultconfigurationmayneedtobetweaked:Ifyoucan’tfindagoodlearningrate(convergencewastooslow,soyouincreasedthetrainingrate,andnowconvergenceisfastbutthenetwork’saccuracyissuboptimal),thenyoucantryaddingalearningschedulesuchasexponentialdecay.

Ifyourtrainingsetisabittoosmall,youcanimplementdataaugmentation.

Ifyouneedasparsemodel,youcanaddsomeℓ1regularizationtothemix(andoptionallyzerooutthetinyweightsaftertraining).Ifyouneedanevensparsermodel,youcantryusingFTRLinsteadofAdamoptimization,alongwithℓ1regularization.

Ifyouneedalightning-fastmodelatruntime,youmaywanttodropBatchNormalization,andpossiblyreplacetheELUactivationfunctionwiththeleakyReLU.Havingasparsemodelwillalsohelp.

Withtheseguidelines,youarenowreadytotrainverydeepnets—well,ifyouareverypatient,thatis!Ifyouuseasinglemachine,youmayhavetowaitfordaysorevenmonthsfortrainingtocomplete.InthenextchapterwewilldiscusshowtousedistributedTensorFlowtotrainandrunmodelsacrossmanyserversandGPUs.

Page 390: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Exercises1. Isitokaytoinitializealltheweightstothesamevalueaslongasthatvalueisselectedrandomly

usingHeinitialization?

2. Isitokaytoinitializethebiastermsto0?

3. NamethreeadvantagesoftheELUactivationfunctionoverReLU.

4. Inwhichcaseswouldyouwanttouseeachofthefollowingactivationfunctions:ELU,leakyReLU(anditsvariants),ReLU,tanh,logistic,andsoftmax?

5. Whatmayhappenifyousetthemomentumhyperparametertoocloseto1(e.g.,0.99999)whenusingaMomentumOptimizer?

6. Namethreewaysyoucanproduceasparsemodel.

7. Doesdropoutslowdowntraining?Doesitslowdowninference(i.e.,makingpredictionsonnewinstances)?

8. DeepLearning.a. BuildaDNNwithfivehiddenlayersof100neuronseach,Heinitialization,andtheELU

activationfunction.

b. UsingAdamoptimizationandearlystopping,trytrainingitonMNISTbutonlyondigits0to4,aswewillusetransferlearningfordigits5to9inthenextexercise.Youwillneedasoftmaxoutputlayerwithfiveneurons,andasalwaysmakesuretosavecheckpointsatregularintervalsandsavethefinalmodelsoyoucanreuseitlater.

c. Tunethehyperparametersusingcross-validationandseewhatprecisionyoucanachieve.

d. NowtryaddingBatchNormalizationandcomparethelearningcurves:isitconvergingfasterthanbefore?Doesitproduceabettermodel?

e. Isthemodeloverfittingthetrainingset?Tryaddingdropouttoeverylayerandtryagain.Doesithelp?

9. Transferlearning.a. CreateanewDNNthatreusesallthepretrainedhiddenlayersofthepreviousmodel,freezes

them,andreplacesthesoftmaxoutputlayerwithanewone.

b. TrainthisnewDNNondigits5to9,usingonly100imagesperdigit,andtimehowlongittakes.Despitethissmallnumberofexamples,canyouachievehighprecision?

c. Trycachingthefrozenlayers,andtrainthemodelagain:howmuchfasterisitnow?

d. Tryagainreusingjustfourhiddenlayersinsteadoffive.Canyouachieveahigherprecision?

Page 391: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

e. Nowunfreezethetoptwohiddenlayersandcontinuetraining:canyougetthemodeltoperformevenbetter?

10. Pretrainingonanauxiliarytask.a. InthisexerciseyouwillbuildaDNNthatcomparestwoMNISTdigitimagesandpredicts

whethertheyrepresentthesamedigitornot.ThenyouwillreusethelowerlayersofthisnetworktotrainanMNISTclassifierusingverylittletrainingdata.StartbybuildingtwoDNNs(let’scallthemDNNAandB),bothsimilartotheoneyoubuiltearlierbutwithouttheoutputlayer:eachDNNshouldhavefivehiddenlayersof100neuronseach,Heinitialization,andELUactivation.Next,addonemorehiddenlayerwith10unitsontopofbothDNNs.Todothis,youshoulduseTensorFlow’sconcat()functionwithaxis=1toconcatenatetheoutputsofbothDNNsforeachinstance,thenfeedtheresulttothehiddenlayer.Finally,addanoutputlayerwithasingleneuronusingthelogisticactivationfunction.

b. SplittheMNISTtrainingsetintwosets:split#1shouldcontaining55,000images,andsplit#2shouldcontaincontain5,000images.CreateafunctionthatgeneratesatrainingbatchwhereeachinstanceisapairofMNISTimagespickedfromsplit#1.Halfofthetraininginstancesshouldbepairsofimagesthatbelongtothesameclass,whiletheotherhalfshouldbeimagesfromdifferentclasses.Foreachpair,thetraininglabelshouldbe0iftheimagesarefromthesameclass,or1iftheyarefromdifferentclasses.

c. TraintheDNNonthistrainingset.Foreachimagepair,youcansimultaneouslyfeedthefirstimagetoDNNAandthesecondimagetoDNNB.Thewholenetworkwillgraduallylearntotellwhethertwoimagesbelongtothesameclassornot.

d. NowcreateanewDNNbyreusingandfreezingthehiddenlayersofDNNAandaddingasoftmaxoutputlayerontopwith10neurons.Trainthisnetworkonsplit#2andseeifyoucanachievehighperformancedespitehavingonly500imagesperclass.

SolutionstotheseexercisesareavailableinAppendixA.

“UnderstandingtheDifficultyofTrainingDeepFeedforwardNeuralNetworks,”X.Glorot,YBengio(2010).

Here’sananalogy:ifyousetamicrophoneamplifier’sknobtooclosetozero,peoplewon’thearyourvoice,butifyousetittooclosetothemax,yourvoicewillbesaturatedandpeoplewon’tunderstandwhatyouaresaying.Nowimagineachainofsuchamplifiers:theyallneedtobesetproperlyinorderforyourvoicetocomeoutloudandclearattheendofthechain.Yourvoicehastocomeoutofeachamplifieratthesameamplitudeasitcamein.

Thissimplifiedstrategywasactuallyalreadyproposedmuchearlier—forexample,inthe1998bookNeuralNetworks:TricksoftheTradebyGenevieveOrrandKlaus-RobertMüller(Springer).

Suchas“DelvingDeepintoRectifiers:SurpassingHuman-LevelPerformanceonImageNetClassification,”K.Heetal.(2015).

“EmpiricalEvaluationofRectifiedActivationsinConvolutionNetwork,”B.Xuetal.(2015).

“FastandAccurateDeepNetworkLearningbyExponentialLinearUnits(ELUs),”D.Clevert,T.Unterthiner,S.Hochreiter(2015).

“BatchNormalization:AcceleratingDeepNetworkTrainingbyReducingInternalCovariateShift,”S.IoffeandC.Szegedy(2015).

Manyresearchersarguethatitisjustasgood,orevenbetter,toplacethebatchnormalizationlayersafter(ratherthanbefore)theactivations.

“Onthedifficultyoftrainingrecurrentneuralnetworks,”R.Pascanuetal.(2013).

Anotheroptionistocomeupwithasupervisedtaskforwhichyoucaneasilygatheralotoflabeledtrainingdata,thenusetransfer

1

2

3

4

5

6

7

8

9

10

Page 392: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

learning,asexplainedearlier.Forexample,ifyouwanttotrainamodeltoidentifyyourfriendsinpictures,youcoulddownloadmillionsoffacesontheinternetandtrainaclassifiertodetectwhethertwofacesareidenticalornot,thenusethisclassifiertocompareanewpicturewitheachpictureofyourfriends.

“Somemethodsofspeedinguptheconvergenceofiterationmethods,”B.Polyak(1964).

“AMethodforUnconstrainedConvexMinimizationProblemwiththeRateofConvergenceO(1/k

),”YuriiNesterov(1983).

“AdaptiveSubgradientMethodsforOnlineLearningandStochasticOptimization,”J.Duchietal.(2011).

ThisalgorithmwascreatedbyTijmenTielemanandGeoffreyHintonin2012,andpresentedbyGeoffreyHintoninhisCourseraclassonneuralnetworks(slides:http://goo.gl/RsQeis;video:https://goo.gl/XUbIyJ).Amusingly,sincetheauthorshavenotwrittenapapertodescribeit,researchersoftencite“slide29inlecture6”intheirpapers.

“Adam:AMethodforStochasticOptimization,”D.Kingma,J.Ba(2015).

Theseareestimationsofthemeanand(uncentered)varianceofthegradients.Themeanisoftencalledthefirstmoment,whilethevarianceisoftencalledthesecondmoment,hencethenameofthealgorithm.

“TheMarginalValueofAdaptiveGradientMethodsinMachineLearning,”A.C.Wilsonetal.(2017).

“Primal-DualSubgradientMethodsforConvexProblems,”YuriiNesterov(2005).

“AdClickPrediction:aViewfromtheTrenches,”H.McMahanetal.(2013).

“AnEmpiricalStudyofLearningRatesinDeepNeuralNetworksforSpeechRecognition,”A.Senioretal.(2013).

“Improvingneuralnetworksbypreventingco-adaptationoffeaturedetectors,”G.Hintonetal.(2012).

“Dropout:ASimpleWaytoPreventNeuralNetworksfromOverfitting,”N.Srivastavaetal.(2014).

11

12

2

13

14

15

16

17

18

19

20

21

22

Page 393: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Chapter12.DistributingTensorFlowAcrossDevicesandServers

InChapter11wediscussedseveraltechniquesthatcanconsiderablyspeeduptraining:betterweightinitialization,BatchNormalization,sophisticatedoptimizers,andsoon.However,evenwithallofthesetechniques,trainingalargeneuralnetworkonasinglemachinewithasingleCPUcantakedaysorevenweeks.

InthischapterwewillseehowtouseTensorFlowtodistributecomputationsacrossmultipledevices(CPUsandGPUs)andruntheminparallel(seeFigure12-1).Firstwewilldistributecomputationsacrossmultipledevicesonjustonemachine,thenonmultipledevicesacrossmultiplemachines.

Figure12-1.ExecutingaTensorFlowgraphacrossmultipledevicesinparallel

TensorFlow’ssupportofdistributedcomputingisoneofitsmainhighlightscomparedtootherneuralnetworkframeworks.Itgivesyoufullcontroloverhowtosplit(orreplicate)yourcomputationgraphacrossdevicesandservers,anditletsyouparallelizeandsynchronizeoperationsinflexiblewayssoyoucanchoosebetweenallsortsofparallelizationapproaches.

Wewilllookatsomeofthemostpopularapproachestoparallelizingtheexecutionandtrainingofaneuralnetwork.Insteadofwaitingforweeksforatrainingalgorithmtocomplete,youmayendupwaitingforjustafewhours.Notonlydoesthissaveanenormousamountoftime,italsomeansthatyoucanexperimentwithvariousmodelsmuchmoreeasily,andfrequentlyretrainyourmodelsonfreshdata.

Othergreatusecasesofparallelizationincludeexploringamuchlargerhyperparameterspacewhenfine-tuningyourmodel,andrunninglargeensemblesofneuralnetworksefficiently.

Page 394: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Butwemustlearntowalkbeforewecanrun.Let’sstartbyparallelizingsimplegraphsacrossseveralGPUsonasinglemachine.

Page 395: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

MultipleDevicesonaSingleMachineYoucanoftengetamajorperformanceboostsimplybyaddingGPUcardstoasinglemachine.Infact,inmanycasesthiswillsuffice;youwon’tneedtousemultiplemachinesatall.Forexample,youcantypicallytrainaneuralnetworkjustasfastusing8GPUsonasinglemachineratherthan16GPUsacrossmultiplemachines(duetotheextradelayimposedbynetworkcommunicationsinamultimachinesetup).

InthissectionwewilllookathowtosetupyourenvironmentsothatTensorFlowcanusemultipleGPUcardsononemachine.Thenwewilllookathowyoucandistributeoperationsacrossavailabledevicesandexecutetheminparallel.

Page 396: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

InstallationInordertorunTensorFlowonmultipleGPUcards,youfirstneedtomakesureyourGPUcardshaveNVidiaComputeCapability(greaterorequalto3.0).ThisincludesNvidia’sTitan,TitanX,K20,andK40cards(ifyouownanothercard,youcancheckitscompatibilityathttps://developer.nvidia.com/cuda-gpus).

TIPIfyoudon’townanyGPUcards,youcanuseahostingservicewithGPUcapabilitysuchasAmazonAWS.DetailedinstructionstosetupTensorFlow0.9withPython3.5onanAmazonAWSGPUinstanceareavailableinŽigaAvsec’shelpfulblogpost.ItshouldnotbetoohardtoupdateittothelatestversionofTensorFlow.GooglealsoreleasedacloudservicecalledCloudMachineLearningtorunTensorFlowgraphs.InMay2016,theyannouncedthattheirplatformnowincludesserversequippedwithtensorprocessingunits(TPUs),processorsspecializedforMachineLearningthataremuchfasterthanGPUsformanyMLtasks.Ofcourse,anotheroptionissimplytobuyyourownGPUcard.TimDettmerswroteagreatblogposttohelpyouchoose,andheupdatesitfairlyregularly.

YoumustthendownloadandinstalltheappropriateversionoftheCUDAandcuDNNlibraries(CUDA8.0andcuDNN5.1ifyouareusingthebinaryinstallationofTensorFlow1.0.0),andsetafewenvironmentvariablessoTensorFlowknowswheretofindCUDAandcuDNN.Thedetailedinstallationinstructionsarelikelytochangefairlyquickly,soitisbestthatyoufollowtheinstructionsonTensorFlow’swebsite.

Nvidia’sComputeUnifiedDeviceArchitecturelibrary(CUDA)allowsdeveloperstouseCUDA-enabledGPUsforallsortsofcomputations(notjustgraphicsacceleration).Nvidia’sCUDADeepNeuralNetworklibrary(cuDNN)isaGPU-acceleratedlibraryofprimitivesforDNNs.ItprovidesoptimizedimplementationsofcommonDNNcomputationssuchasactivationlayers,normalization,forwardandbackwardconvolutions,andpooling(seeChapter13).ItispartofNvidia’sDeepLearningSDK(notethatitrequirescreatinganNvidiadeveloperaccountinordertodownloadit).TensorFlowusesCUDAandcuDNNtocontroltheGPUcardsandacceleratecomputations(seeFigure12-2).

Page 397: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure12-2.TensorFlowusesCUDAandcuDNNtocontrolGPUsandboostDNNs

Youcanusethenvidia-smicommandtocheckthatCUDAisproperlyinstalled.ItliststheavailableGPUcards,aswellasprocessesrunningoneachcard:

$nvidia-smi

WedSep1609:50:032016

+------------------------------------------------------+

|NVIDIA-SMI352.63DriverVersion:352.63|

|-------------------------------+----------------------+----------------------+

|GPUNamePersistence-M|Bus-IdDisp.A|VolatileUncorr.ECC|

|FanTempPerfPwr:Usage/Cap|Memory-Usage|GPU-UtilComputeM.|

|===============================+======================+======================|

|0GRIDK520Off|0000:00:03.0Off|N/A|

|N/A27CP817W/125W|11MiB/4095MiB|0%Default|

+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+

|Processes:GPUMemory|

|GPUPIDTypeProcessnameUsage|

|=============================================================================|

|Norunningprocessesfound|

+-----------------------------------------------------------------------------+

Finally,youmustinstallTensorFlowwithGPUsupport.Ifyoucreatedanisolatedenvironmentusingvirtualenv,youfirstneedtoactivateit:

$cd$ML_PATH#YourMLworkingdirectory(e.g.,$HOME/ml)

$sourceenv/bin/activate

TheninstalltheappropriateGPU-enabledversionofTensorFlow:

$pip3install--upgradetensorflow-gpu

Page 398: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

NowyoucanopenupaPythonshellandcheckthatTensorFlowdetectsandusesCUDAandcuDNNproperlybyimportingTensorFlowandcreatingasession:

>>>importtensorflowastf

I[...]/dso_loader.cc:108]successfullyopenedCUDAlibrarylibcublas.solocally

I[...]/dso_loader.cc:108]successfullyopenedCUDAlibrarylibcudnn.solocally

I[...]/dso_loader.cc:108]successfullyopenedCUDAlibrarylibcufft.solocally

I[...]/dso_loader.cc:108]successfullyopenedCUDAlibrarylibcuda.so.1locally

I[...]/dso_loader.cc:108]successfullyopenedCUDAlibrarylibcurand.solocally

>>>sess=tf.Session()

[...]

I[...]/gpu_init.cc:102]Founddevice0withproperties:

name:GRIDK520

major:3minor:0memoryClockRate(GHz)0.797

pciBusID0000:00:03.0

Totalmemory:4.00GiB

Freememory:3.95GiB

I[...]/gpu_init.cc:126]DMA:0

I[...]/gpu_init.cc:136]0:Y

I[...]/gpu_device.cc:839]CreatingTensorFlowdevice

(/gpu:0)->(device:0,name:GRIDK520,pcibusid:0000:00:03.0)

Looksgood!TensorFlowdetectedtheCUDAandcuDNNlibraries,anditusedtheCUDAlibrarytodetecttheGPUcard(inthiscaseanNvidiaGridK520card).

Page 399: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ManagingtheGPURAMBydefaultTensorFlowautomaticallygrabsalltheRAMinallavailableGPUsthefirsttimeyourunagraph,soyouwillnotbeabletostartasecondTensorFlowprogramwhilethefirstoneisstillrunning.Ifyoutry,youwillgetthefollowingerror:

E[...]/cuda_driver.cc:965]failedtoallocate3.66G(3928915968bytes)from

device:CUDA_ERROR_OUT_OF_MEMORY

OnesolutionistoruneachprocessondifferentGPUcards.Todothis,thesimplestoptionistosettheCUDA_VISIBLE_DEVICESenvironmentvariablesothateachprocessonlyseestheappropriateGPUcards.Forexample,youcouldstarttwoprogramslikethis:

$CUDA_VISIBLE_DEVICES=0,1python3program_1.py

#andinanotherterminal:

$CUDA_VISIBLE_DEVICES=3,2python3program_2.py

Program#1willonlyseeGPUcards0and1(numbered0and1,respectively),andprogram#2willonlyseeGPUcards2and3(numbered1and0,respectively).Everythingwillworkfine(seeFigure12-3).

Figure12-3.EachprogramgetstwoGPUsforitself

AnotheroptionistotellTensorFlowtograbonlyafractionofthememory.Forexample,tomakeTensorFlowgrabonly40%ofeachGPU’smemory,youmustcreateaConfigProtoobject,setitsgpu_options.per_process_gpu_memory_fractionoptionto0.4,andcreatethesessionusingthisconfiguration:

config=tf.ConfigProto()

config.gpu_options.per_process_gpu_memory_fraction=0.4

session=tf.Session(config=config)

NowtwoprogramslikethisonecanruninparallelusingthesameGPUcards(butnotthree,since3×0.4>1).SeeFigure12-4.

Page 400: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure12-4.EachprogramgetsallfourGPUs,butwithonly40%oftheRAMeach

Ifyourunthenvidia-smicommandwhilebothprogramsarerunning,youshouldseethateachprocessholdsroughly40%ofthetotalRAMofeachcard:

$nvidia-smi

[...]

+-----------------------------------------------------------------------------+

|Processes:GPUMemory|

|GPUPIDTypeProcessnameUsage|

|=============================================================================|

|05231Cpython1677MiB|

|05262Cpython1677MiB|

|15231Cpython1677MiB|

|15262Cpython1677MiB|

[...]

YetanotheroptionistotellTensorFlowtograbmemoryonlywhenitneedsit.Todothisyoumustsetconfig.gpu_options.allow_growthtoTrue.However,TensorFlowneverreleasesmemoryonceithasgrabbedit(toavoidmemoryfragmentation)soyoumaystillrunoutofmemoryafterawhile.Itmaybehardertoguaranteeadeterministicbehaviorusingthisoption,soingeneralyouprobablywanttostickwithoneofthepreviousoptions.

Okay,nowyouhaveaworkingGPU-enabledTensorFlowinstallation.Let’sseehowtouseit!

Page 401: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

PlacingOperationsonDevicesTheTensorFlowwhitepaper1presentsafriendlydynamicplaceralgorithmthatautomagicallydistributesoperationsacrossallavailabledevices,takingintoaccountthingslikethemeasuredcomputationtimeinpreviousrunsofthegraph,estimationsofthesizeoftheinputandoutputtensorstoeachoperation,theamountofRAMavailableineachdevice,communicationdelaywhentransferringdatainandoutofdevices,hintsandconstraintsfromtheuser,andmore.Unfortunately,thissophisticatedalgorithmisinternaltoGoogle;itwasnotreleasedintheopensourceversionofTensorFlow.Thereasonitwasleftoutseemstobethatinpracticeasmallsetofplacementrulesspecifiedbytheuseractuallyresultsinmoreefficientplacementthanwhatthedynamicplaceriscapableof.However,theTensorFlowteamisworkingonimprovingthedynamicplacer,andperhapsitwilleventuallybegoodenoughtobereleased.

UntilthenTensorFlowreliesonthesimpleplacer,which(asitsnamesuggests)isverybasic.

SimpleplacementWheneveryourunagraph,ifTensorFlowneedstoevaluateanodethatisnotplacedonadeviceyet,itusesthesimpleplacertoplaceit,alongwithallothernodesthatarenotplacedyet.Thesimpleplacerrespectsthefollowingrules:

Ifanodewasalreadyplacedonadeviceinapreviousrunofthegraph,itisleftonthatdevice.

Else,iftheuserpinnedanodetoadevice(describednext),theplacerplacesitonthatdevice.

Else,itdefaultstoGPU#0,ortheCPUifthereisnoGPU.

Asyoucansee,placingoperationsontheappropriatedeviceismostlyuptoyou.Ifyoudon’tdoanything,thewholegraphwillbeplacedonthedefaultdevice.Topinnodesontoadevice,youmustcreateadeviceblockusingthedevice()function.Forexample,thefollowingcodepinsthevariableaandtheconstantbontheCPU,butthemultiplicationnodecisnotpinnedonanydevice,soitwillbeplacedonthedefaultdevice:

withtf.device("/cpu:0"):

a=tf.Variable(3.0)

b=tf.constant(4.0)

c=a*b

NOTEThe"/cpu:0"deviceaggregatesallCPUsonamulti-CPUsystem.ThereiscurrentlynowaytopinnodesonspecificCPUsortousejustasubsetofallCPUs.

LoggingplacementsLet’scheckthatthesimpleplacerrespectstheplacementconstraintswehavejustdefined.Forthisyoucansetthelog_device_placementoptiontoTrue;thistellstheplacertologamessagewheneverit

Page 402: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

placesanode.Forexample:

>>>config=tf.ConfigProto()

>>>config.log_device_placement=True

>>>sess=tf.Session(config=config)

I[...]CreatingTensorFlowdevice(/gpu:0)->(device:0,name:GRIDK520,

pcibusid:0000:00:03.0)

[...]

>>>x.initializer.run(session=sess)

I[...]a:/job:localhost/replica:0/task:0/cpu:0

I[...]a/read:/job:localhost/replica:0/task:0/cpu:0

I[...]mul:/job:localhost/replica:0/task:0/gpu:0

I[...]a/Assign:/job:localhost/replica:0/task:0/cpu:0

I[...]b:/job:localhost/replica:0/task:0/cpu:0

I[...]a/initial_value:/job:localhost/replica:0/task:0/cpu:0

>>>sess.run(c)

12

Thelinesstartingwith"I"forInfoarethelogmessages.Whenwecreateasession,TensorFlowlogsamessagetotellusthatithasfoundaGPUcard(inthiscasetheGridK520card).Thenthefirsttimewerunthegraph(inthiscasewheninitializingthevariablea),thesimpleplacerisrunandplaceseachnodeonthedeviceitwasassignedto.Asexpected,thelogmessagesshowthatallnodesareplacedon"/cpu:0"exceptthemultiplicationnode,whichendsuponthedefaultdevice"/gpu:0"(youcansafelyignoretheprefix/job:localhost/replica:0/task:0fornow;wewilltalkaboutitinamoment).Noticethatthesecondtimewerunthegraph(tocomputec),theplacerisnotusedsinceallthenodesTensorFlowneedstocomputecarealreadyplaced.

DynamicplacementfunctionWhenyoucreateadeviceblock,youcanspecifyafunctioninsteadofadevicename.TensorFlowwillcallthisfunctionforeachoperationitneedstoplaceinthedeviceblock,andthefunctionmustreturnthenameofthedevicetopintheoperationon.Forexample,thefollowingcodepinsallthevariablenodesto"/cpu:0"(inthiscasejustthevariablea)andallothernodesto"/gpu:0":

defvariables_on_cpu(op):

ifop.type=="Variable":

return"/cpu:0"

else:

return"/gpu:0"

withtf.device(variables_on_cpu):

a=tf.Variable(3.0)

b=tf.constant(4.0)

c=a*b

Youcaneasilyimplementmorecomplexalgorithms,suchaspinningvariablesacrossGPUsinaround-robinfashion.

OperationsandkernelsForaTensorFlowoperationtorunonadevice,itneedstohaveanimplementationforthatdevice;thisiscalledakernel.ManyoperationshavekernelsforbothCPUsandGPUs,butnotallofthem.Forexample,TensorFlowdoesnothaveaGPUkernelforintegervariables,sothefollowingcodewillfailwhenTensorFlowtriestoplacethevariableionGPU#0:

Page 403: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

>>>withtf.device("/gpu:0"):

...i=tf.Variable(3)

[...]

>>>sess.run(i.initializer)

Traceback(mostrecentcalllast):

[...]

tensorflow.python.framework.errors.InvalidArgumentError:Cannotassignadevice

tonode'Variable':Couldnotsatisfyexplicitdevicespecification

NotethatTensorFlowinfersthatthevariablemustbeoftypeint32sincetheinitializationvalueisaninteger.Ifyouchangetheinitializationvalueto3.0insteadof3,orifyouexplicitlysetdtype=tf.float32whencreatingthevariable,everythingwillworkfine.

SoftplacementBydefault,ifyoutrytopinanoperationonadeviceforwhichtheoperationhasnokernel,yougettheexceptionshownearlierwhenTensorFlowtriestoplacetheoperationonthedevice.IfyoupreferTensorFlowtofallbacktotheCPUinstead,youcansettheallow_soft_placementconfigurationoptiontoTrue:

withtf.device("/gpu:0"):

i=tf.Variable(3)

config=tf.ConfigProto()

config.allow_soft_placement=True

sess=tf.Session(config=config)

sess.run(i.initializer)#theplacerrunsandfallsbackto/cpu:0

Sofarwehavediscussedhowtoplacenodesondifferentdevices.Nowlet’sseehowTensorFlowwillrunthesenodesinparallel.

Page 404: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ParallelExecutionWhenTensorFlowrunsagraph,itstartsbyfindingoutthelistofnodesthatneedtobeevaluated,anditcountshowmanydependencieseachofthemhas.TensorFlowthenstartsevaluatingthenodeswithzerodependencies(i.e.,sourcenodes).Ifthesenodesareplacedonseparatedevices,theyobviouslygetevaluatedinparallel.Iftheyareplacedonthesamedevice,theygetevaluatedindifferentthreads,sotheymayruninparalleltoo(inseparateGPUthreadsorCPUcores).

TensorFlowmanagesathreadpooloneachdevicetoparallelizeoperations(seeFigure12-5).Thesearecalledtheinter-opthreadpools.Someoperationshavemultithreadedkernels:theycanuseotherthreadpools(oneperdevice)calledtheintra-opthreadpools.

Figure12-5.ParallelizedexecutionofaTensorFlowgraph

Forexample,inFigure12-5,operationsA,B,andCaresourceops,sotheycanimmediatelybeevaluated.OperationsAandBareplacedonGPU#0,sotheyaresenttothisdevice’sinter-opthreadpool,andimmediatelyevaluatedinparallel.OperationAhappenstohaveamultithreadedkernel;itscomputationsaresplitinthreeparts,whichareexecutedinparallelbytheintra-opthreadpool.OperationCgoestoGPU#1’sinter-opthreadpool.

AssoonasoperationCfinishes,thedependencycountersofoperationsDandEwillbedecrementedandwillbothreach0,sobothoperationswillbesenttotheinter-opthreadpooltobeexecuted.

Page 405: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TIPYoucancontrolthenumberofthreadsperinter-oppoolbysettingtheinter_op_parallelism_threadsoption.Notethatthefirstsessionyoustartcreatestheinter-opthreadpools.Allothersessionswilljustreusethemunlessyousettheuse_per_session_threadsoptiontoTrue.Youcancontrolthenumberofthreadsperintra-oppoolbysettingtheintra_op_parallelism_threadsoption.

Page 406: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ControlDependenciesInsomecases,itmaybewisetopostponetheevaluationofanoperationeventhoughalltheoperationsitdependsonhavebeenexecuted.Forexample,ifitusesalotofmemorybutitsvalueisneededonlymuchfurtherinthegraph,itwouldbebesttoevaluateitatthelastmomenttoavoidneedlesslyoccupyingRAMthatotheroperationsmayneed.Anotherexampleisasetofoperationsthatdependondatalocatedoutsideofthedevice.Iftheyallrunatthesametime,theymaysaturatethedevice’scommunicationbandwidth,andtheywillendupallwaitingonI/O.Otheroperationsthatneedtocommunicatedatawillalsobeblocked.Itwouldbepreferabletoexecutethesecommunication-heavyoperationssequentially,allowingthedevicetoperformotheroperationsinparallel.

Topostponeevaluationofsomenodes,asimplesolutionistoaddcontroldependencies.Forexample,thefollowingcodetellsTensorFlowtoevaluatexandyonlyafteraandbhavebeenevaluated:

a=tf.constant(1.0)

b=a+2.0

withtf.control_dependencies([a,b]):

x=tf.constant(3.0)

y=tf.constant(4.0)

z=x+y

Obviously,sincezdependsonxandy,evaluatingzalsoimplieswaitingforaandbtobeevaluated,eventhoughitisnotexplicitlyinthecontrol_dependencies()block.Also,sincebdependsona,wecouldsimplifytheprecedingcodebyjustcreatingacontroldependencyon[b]insteadof[a,b],butinsomecases“explicitisbetterthanimplicit.”

Great!Nowyouknow:Howtoplaceoperationsonmultipledevicesinanywayyouplease

Howtheseoperationsgetexecutedinparallel

Howtocreatecontroldependenciestooptimizeparallelexecution

It’stimetodistributecomputationsacrossmultipleservers!

Page 407: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

MultipleDevicesAcrossMultipleServersTorunagraphacrossmultipleservers,youfirstneedtodefineacluster.AclusteriscomposedofoneormoreTensorFlowservers,calledtasks,typicallyspreadacrossseveralmachines(seeFigure12-6).Eachtaskbelongstoajob.Ajobisjustanamedgroupoftasksthattypicallyhaveacommonrole,suchaskeepingtrackofthemodelparameters(suchajobisusuallynamed"ps"forparameterserver),orperformingcomputations(suchajobisusuallynamed"worker").

Figure12-6.TensorFlowcluster

Thefollowingclusterspecificationdefinestwojobs,"ps"and"worker",containingonetaskandtwotasks,respectively.Inthisexample,machineAhoststwoTensorFlowservers(i.e.,tasks),listeningondifferentports:oneispartofthe"ps"job,andtheotherispartofthe"worker"job.MachineBjusthostsoneTensorFlowserver,partofthe"worker"job.

cluster_spec=tf.train.ClusterSpec({

"ps":[

"machine-a.example.com:2221",#/job:ps/task:0

],

"worker":[

"machine-a.example.com:2222",#/job:worker/task:0

"machine-b.example.com:2222",#/job:worker/task:1

]})

TostartaTensorFlowserver,youmustcreateaServerobject,passingittheclusterspecification(soitcancommunicatewithotherservers)anditsownjobnameandtasknumber.Forexample,tostartthefirstworkertask,youwouldrunthefollowingcodeonmachineA:

server=tf.train.Server(cluster_spec,job_name="worker",task_index=0)

Page 408: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Itisusuallysimplertojustrunonetaskpermachine,butthepreviousexampledemonstratesthatTensorFlowallowsyoutorunmultipletasksonthesamemachineifyouwant.2Ifyouhaveseveralserversononemachine,youwillneedtoensurethattheydon’talltrytograballtheRAMofeveryGPU,asexplainedearlier.Forexample,inFigure12-6the"ps"taskdoesnotseetheGPUdevices,sincepresumablyitsprocesswaslaunchedwithCUDA_VISIBLE_DEVICES="".NotethattheCPUissharedbyalltaskslocatedonthesamemachine.

IfyouwanttheprocesstodonothingotherthanruntheTensorFlowserver,youcanblockthemainthreadbytellingittowaitfortheservertofinishusingthejoin()method(otherwisetheserverwillbekilledassoonasyourmainthreadexits).Sincethereiscurrentlynowaytostoptheserver,thiswillactuallyblockforever:

server.join()#blocksuntiltheserverstops(i.e.,never)

Page 409: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

OpeningaSessionOnceallthetasksareupandrunning(doingnothingyet),youcanopenasessiononanyoftheservers,fromaclientlocatedinanyprocessonanymachine(evenfromaprocessrunningoneofthetasks),andusethatsessionlikearegularlocalsession.Forexample:

a=tf.constant(1.0)

b=a+2

c=a*3

withtf.Session("grpc://machine-b.example.com:2222")assess:

print(c.eval())#9.0

Thisclientcodefirstcreatesasimplegraph,thenopensasessionontheTensorFlowserverlocatedonmachineB(whichwewillcallthemaster),andinstructsittoevaluatec.Themasterstartsbyplacingtheoperationsontheappropriatedevices.Inthisexample,sincewedidnotpinanyoperationonanydevice,themastersimplyplacesthemallonitsowndefaultdevice—inthiscase,machineB’sGPUdevice.Thenitjustevaluatescasinstructedbytheclient,anditreturnstheresult.

Page 410: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TheMasterandWorkerServicesTheclientusesthegRPCprotocol(GoogleRemoteProcedureCall)tocommunicatewiththeserver.Thisisanefficientopensourceframeworktocallremotefunctionsandgettheiroutputsacrossavarietyofplatformsandlanguages.3ItisbasedonHTTP2,whichopensaconnectionandleavesitopenduringthewholesession,allowingefficientbidirectionalcommunicationoncetheconnectionisestablished.Dataistransmittedintheformofprotocolbuffers,anotheropensourceGoogletechnology.Thisisalightweightbinarydatainterchangeformat.

WARNINGAllserversinaTensorFlowclustermaycommunicatewithanyotherserverinthecluster,somakesuretoopentheappropriateportsonyourfirewall.

EveryTensorFlowserverprovidestwoservices:themasterserviceandtheworkerservice.Themasterserviceallowsclientstoopensessionsandusethemtorungraphs.Itcoordinatesthecomputationsacrosstasks,relyingontheworkerservicetoactuallyexecutecomputationsonothertasksandgettheirresults.

Thisarchitecturegivesyoualotofflexibility.Oneclientcanconnecttomultipleserversbyopeningmultiplesessionsindifferentthreads.Oneservercanhandlemultiplesessionssimultaneouslyfromoneormoreclients.Youcanrunoneclientpertask(typicallywithinthesameprocess),orjustoneclienttocontrolalltasks.Alloptionsareopen.

Page 411: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

PinningOperationsAcrossTasksYoucanusedeviceblockstopinoperationsonanydevicemanagedbyanytask,byspecifyingthejobname,taskindex,devicetype,anddeviceindex.Forexample,thefollowingcodepinsatotheCPUofthefirsttaskinthe"ps"job(that’stheCPUonmachineA),anditpinsbtothesecondGPUmanagedbythefirsttaskofthe"worker"job(that’sGPU#1onmachineA).Finally,cisnotpinnedtoanydevice,sothemasterplacesitonitsowndefaultdevice(machineB’sGPU#0device).

withtf.device("/job:ps/task:0/cpu:0")

a=tf.constant(1.0)

withtf.device("/job:worker/task:0/gpu:1")

b=a+2

c=a+b

Asearlier,ifyouomitthedevicetypeandindex,TensorFlowwilldefaulttothetask’sdefaultdevice;forexample,pinninganoperationto"/job:ps/task:0"willplaceitonthedefaultdeviceofthefirsttaskofthe"ps"job(machineA’sCPU).Ifyoualsoomitthetaskindex(e.g.,"/job:ps"),TensorFlowdefaultsto"/task:0".Ifyouomitthejobnameandthetaskindex,TensorFlowdefaultstothesession’smastertask.

Page 412: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ShardingVariablesAcrossMultipleParameterServersAswewillseeshortly,acommonpatternwhentraininganeuralnetworkonadistributedsetupistostorethemodelparametersonasetofparameterservers(i.e.,thetasksinthe"ps"job)whileothertasksfocusoncomputations(i.e.,thetasksinthe"worker"job).Forlargemodelswithmillionsofparameters,itisusefultoshardtheseparametersacrossmultipleparameterservers,toreducetheriskofsaturatingasingleparameterserver’snetworkcard.Ifyouweretomanuallypineveryvariabletoadifferentparameterserver,itwouldbequitetedious.Fortunately,TensorFlowprovidesthereplica_device_setter()function,whichdistributesvariablesacrossallthe"ps"tasksinaround-robinfashion.Forexample,thefollowingcodepinsfivevariablestotwoparameterservers:

withtf.device(tf.train.replica_device_setter(ps_tasks=2):

v1=tf.Variable(1.0)#pinnedto/job:ps/task:0

v2=tf.Variable(2.0)#pinnedto/job:ps/task:1

v3=tf.Variable(3.0)#pinnedto/job:ps/task:0

v4=tf.Variable(4.0)#pinnedto/job:ps/task:1

v5=tf.Variable(5.0)#pinnedto/job:ps/task:0

Insteadofpassingthenumberofps_tasks,youcanpasstheclusterspeccluster=cluster_specandTensorFlowwillsimplycountthenumberoftasksinthe"ps"job.

Ifyoucreateotheroperationsintheblock,beyondjustvariables,TensorFlowautomaticallypinsthemto"/job:worker",whichwilldefaulttothefirstdevicemanagedbythefirsttaskinthe"worker"job.Youcanpinthemtoanotherdevicebysettingtheworker_deviceparameter,butabetterapproachistouseembeddeddeviceblocks.Aninnerdeviceblockcanoverridethejob,task,ordevicedefinedinanouterblock.Forexample:

withtf.device(tf.train.replica_device_setter(ps_tasks=2)):

v1=tf.Variable(1.0)#pinnedto/job:ps/task:0(+defaultsto/cpu:0)

v2=tf.Variable(2.0)#pinnedto/job:ps/task:1(+defaultsto/cpu:0)

v3=tf.Variable(3.0)#pinnedto/job:ps/task:0(+defaultsto/cpu:0)

[...]

s=v1+v2#pinnedto/job:worker(+defaultstotask:0/gpu:0)

withtf.device("/gpu:1"):

p1=2*s#pinnedto/job:worker/gpu:1(+defaultsto/task:0)

withtf.device("/task:1"):

p2=3*s#pinnedto/job:worker/task:1/gpu:1

NOTEThisexampleassumesthattheparameterserversareCPU-only,whichistypicallythecasesincetheyonlyneedtostoreandcommunicateparameters,notperformintensivecomputations.

Page 413: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

SharingStateAcrossSessionsUsingResourceContainersWhenyouareusingaplainlocalsession(notthedistributedkind),eachvariable’sstateismanagedbythesessionitself;assoonasitends,allvariablevaluesarelost.Moreover,multiplelocalsessionscannotshareanystate,eveniftheybothrunthesamegraph;eachsessionhasitsowncopyofeveryvariable(aswediscussedinChapter9).Incontrast,whenyouareusingdistributedsessions,variablestateismanagedbyresourcecontainerslocatedontheclusteritself,notbythesessions.Soifyoucreateavariablenamedxusingoneclientsession,itwillautomaticallybeavailabletoanyothersessiononthesamecluster(evenifbothsessionsareconnectedtoadifferentserver).Forexample,considerthefollowingclientcode:

#simple_client.py

importtensorflowastf

importsys

x=tf.Variable(0.0,name="x")

increment_x=tf.assign(x,x+1)

withtf.Session(sys.argv[1])assess:

ifsys.argv[2:]==["init"]:

sess.run(x.initializer)

sess.run(increment_x)

print(x.eval())

Let’ssupposeyouhaveaTensorFlowclusterupandrunningonmachinesAandB,port2222.Youcouldlaunchtheclient,haveitopenasessionwiththeserveronmachineA,andtellittoinitializethevariable,incrementit,andprintitsvaluebylaunchingthefollowingcommand:

$python3simple_client.pygrpc://machine-a.example.com:2222init

1.0

Nowifyoulaunchtheclientwiththefollowingcommand,itwillconnecttotheserveronmachineBandmagicallyreusethesamevariablex(thistimewedon’tasktheservertoinitializethevariable):

$python3simple_client.pygrpc://machine-b.example.com:2222

2.0

Thisfeaturecutsbothways:it’sgreatifyouwanttosharevariablesacrossmultiplesessions,butifyouwanttoruncompletelyindependentcomputationsonthesameclusteryouwillhavetobecarefulnottousethesamevariablenamesbyaccident.Onewaytoensurethatyouwon’thavenameclashesistowrapallofyourconstructionphaseinsideavariablescopewithauniquenameforeachcomputation,forexample:

withtf.variable_scope("my_problem_1"):

[...]#Constructionphaseofproblem1

Abetteroptionistouseacontainerblock:

withtf.container("my_problem_1"):

[...]#Constructionphaseofproblem1

Page 414: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Thiswilluseacontainerdedicatedtoproblem#1,insteadofthedefaultone(whosenameisanemptystring"").Oneadvantageisthatvariablenamesremainniceandshort.Anotheradvantageisthatyoucaneasilyresetanamedcontainer.Forexample,thefollowingcommandwillconnecttotheserveronmachineAandaskittoresetthecontainernamed"my_problem_1",whichwillfreealltheresourcesthiscontainerused(andalsocloseallsessionsopenontheserver).Anyvariablemanagedbythiscontainermustbeinitializedbeforeyoucanuseitagain:

tf.Session.reset("grpc://machine-a.example.com:2222",["my_problem_1"])

Resourcecontainersmakeiteasytosharevariablesacrosssessionsinflexibleways.Forexample,Figure12-7showsfourclientsrunningdifferentgraphsonthesamecluster,butsharingsomevariables.ClientsAandBsharethesamevariablexmanagedbythedefaultcontainer,whileclientsCandDshareanothervariablenamedxmanagedbythecontainernamed"my_problem_1".NotethatclientCevenusesvariablesfrombothcontainers.

Figure12-7.Resourcecontainers

Resourcecontainersalsotakecareofpreservingthestateofotherstatefuloperations,namelyqueuesandreaders.Let’stakealookatqueuesfirst.

Page 415: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

AsynchronousCommunicationUsingTensorFlowQueuesQueuesareanothergreatwaytoexchangedatabetweenmultiplesessions;forexample,onecommonusecaseistohaveaclientcreateagraphthatloadsthetrainingdataandpushesitintoaqueue,whileanotherclientcreatesagraphthatpullsthedatafromthequeueandtrainsamodel(seeFigure12-8).Thiscanspeeduptrainingconsiderablybecausethetrainingoperationsdon’thavetowaitforthenextmini-batchateverystep.

Figure12-8.Usingqueuestoloadthetrainingdataasynchronously

TensorFlowprovidesvariouskindsofqueues.Thesimplestkindisthefirst-infirst-out(FIFO)queue.Forexample,thefollowingcodecreatesaFIFOqueuethatcanstoreupto10tensorscontainingtwofloatvalueseach:

q=tf.FIFOQueue(capacity=10,dtypes=[tf.float32],shapes=[[2]],

name="q",shared_name="shared_q")

WARNINGTosharevariablesacrosssessions,allyouhadtodowastospecifythesamenameandcontaineronbothends.WithqueuesTensorFlowdoesnotusethenameattributebutinsteadusesshared_name,soitisimportanttospecifyit(evenifitisthesameasthename).And,ofcourse,usethesamecontainer.

EnqueuingdataTopushdatatoaqueue,youmustcreateanenqueueoperation.Forexample,thefollowingcodepushesthreetraininginstancestothequeue:

#training_data_loader.py

importtensorflowastf

q=[...]

training_instance=tf.placeholder(tf.float32,shape=(2))

enqueue=q.enqueue([training_instance])

withtf.Session("grpc://machine-a.example.com:2222")assess:

sess.run(enqueue,feed_dict={training_instance:[1.,2.]})

Page 416: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

sess.run(enqueue,feed_dict={training_instance:[3.,4.]})

sess.run(enqueue,feed_dict={training_instance:[5.,6.]})

Insteadofenqueuinginstancesonebyone,youcanenqueueseveralatatimeusinganenqueue_manyoperation:

[...]

training_instances=tf.placeholder(tf.float32,shape=(None,2))

enqueue_many=q.enqueue([training_instances])

withtf.Session("grpc://machine-a.example.com:2222")assess:

sess.run(enqueue_many,

feed_dict={training_instances:[[1.,2.],[3.,4.],[5.,6.]]})

Bothexamplesenqueuethesamethreetensorstothequeue.

DequeuingdataTopulltheinstancesoutofthequeue,ontheotherend,youneedtouseadequeueoperation:

#trainer.py

importtensorflowastf

q=[...]

dequeue=q.dequeue()

withtf.Session("grpc://machine-a.example.com:2222")assess:

print(sess.run(dequeue))#[1.,2.]

print(sess.run(dequeue))#[3.,4.]

print(sess.run(dequeue))#[5.,6.]

Ingeneralyouwillwanttopullawholemini-batchatonce,insteadofpullingjustoneinstanceatatime.Todoso,youmustuseadequeue_manyoperation,specifyingthemini-batchsize:

[...]

batch_size=2

dequeue_mini_batch=q.dequeue_many(batch_size)

withtf.Session("grpc://machine-a.example.com:2222")assess:

print(sess.run(dequeue_mini_batch))#[[1.,2.],[4.,5.]]

print(sess.run(dequeue_mini_batch))#blockedwaitingforanotherinstance

Whenaqueueisfull,theenqueueoperationwillblockuntilitemsarepulledoutbyadequeueoperation.Similarly,whenaqueueisempty(oryouareusingdequeue_many()andtherearefeweritemsthanthemini-batchsize),thedequeueoperationwillblockuntilenoughitemsarepushedintothequeueusinganenqueueoperation.

QueuesoftuplesEachiteminaqueuecanbeatupleoftensors(ofvarioustypesandshapes)insteadofjustasingletensor.Forexample,thefollowingqueuestorespairsoftensors,oneoftypeint32andshape(),andtheotheroftypefloat32andshape[3,2]:

q=tf.FIFOQueue(capacity=10,dtypes=[tf.int32,tf.float32],shapes=[[],[3,2]],

name="q",shared_name="shared_q")

Theenqueueoperationmustbegivenpairsoftensors(notethateachpairrepresentsonlyoneiteminthe

Page 417: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

queue):

a=tf.placeholder(tf.int32,shape=())

b=tf.placeholder(tf.float32,shape=(3,2))

enqueue=q.enqueue((a,b))

withtf.Session([...])assess:

sess.run(enqueue,feed_dict={a:10,b:[[1.,2.],[3.,4.],[5.,6.]]})

sess.run(enqueue,feed_dict={a:11,b:[[2.,4.],[6.,8.],[0.,2.]]})

sess.run(enqueue,feed_dict={a:12,b:[[3.,6.],[9.,2.],[5.,8.]]})

Ontheotherend,thedequeue()functionnowcreatesapairofdequeueoperations:

dequeue_a,dequeue_b=q.dequeue()

Ingeneral,youshouldruntheseoperationstogether:

withtf.Session([...])assess:

a_val,b_val=sess.run([dequeue_a,dequeue_b])

print(a_val)#10

print(b_val)#[[1.,2.],[3.,4.],[5.,6.]]

WARNINGIfyourundequeue_aonitsown,itwilldequeueapairandreturnonlythefirstelement;thesecondelementwillbelost(andsimilarly,ifyourundequeue_bonitsown,thefirstelementwillbelost).

Thedequeue_many()functionalsoreturnsapairofoperations:

batch_size=2

dequeue_as,dequeue_bs=q.dequeue_many(batch_size)

Youcanuseitasyouwouldexpect:

withtf.Session([...])assess:

a,b=sess.run([dequeue_a,dequeue_b])

print(a)#[10,11]

print(b)#[[[1.,2.],[3.,4.],[5.,6.]],[[2.,4.],[6.,8.],[0.,2.]]]

a,b=sess.run([dequeue_a,dequeue_b])#blockedwaitingforanotherpair

ClosingaqueueItispossibletocloseaqueuetosignaltotheothersessionsthatnomoredatawillbeenqueued:

close_q=q.close()

withtf.Session([...])assess:

[...]

sess.run(close_q)

Subsequentexecutionsofenqueueorenqueue_manyoperationswillraiseanexception.Bydefault,anypendingenqueuerequestwillbehonored,unlessyoucallq.close(cancel_pending_enqueues=True).

Subsequentexecutionsofdequeueordequeue_manyoperationswillcontinuetosucceedaslongasthere

Page 418: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

areitemsinthequeue,buttheywillfailwhentherearenotenoughitemsleftinthequeue.Ifyouareusingadequeue_manyoperationandthereareafewinstancesleftinthequeue,butfewerthanthemini-batchsize,theywillbelost.Youmayprefertouseadequeue_up_tooperationinstead;itbehavesexactlylikedequeue_manyexceptwhenaqueueisclosedandtherearefewerthanbatch_sizeinstancesleftinthequeue,inwhichcaseitjustreturnsthem.

RandomShuffleQueueTensorFlowalsosupportsacouplemoretypesofqueues,includingRandomShuffleQueue,whichcanbeusedjustlikeaFIFOQueueexceptthatitemsaredequeuedinarandomorder.Thiscanbeusefultoshuffletraininginstancesateachepochduringtraining.First,let’screatethequeue:

q=tf.RandomShuffleQueue(capacity=50,min_after_dequeue=10,

dtypes=[tf.float32],shapes=[()],

name="q",shared_name="shared_q")

Themin_after_dequeuespecifiestheminimumnumberofitemsthatmustremaininthequeueafteradequeueoperation.Thisensuresthattherewillbeenoughinstancesinthequeuetohaveenoughrandomness(oncethequeueisclosed,themin_after_dequeuelimitisignored).Nowsupposethatyouenqueued22itemsinthisqueue(floats1.to22.).Hereishowyoucoulddequeuethem:

dequeue=q.dequeue_many(5)

withtf.Session([...])assess:

print(sess.run(dequeue))#[20.15.11.12.4.](17itemsleft)

print(sess.run(dequeue))#[5.13.6.0.17.](12itemsleft)

print(sess.run(dequeue))#12-5<10:blockedwaitingfor3moreinstances

PaddingFifoQueueAPaddingFIFOQueuecanalsobeusedjustlikeaFIFOQueueexceptthatitacceptstensorsofvariablesizesalonganydimension(butwithafixedrank).Whenyouaredequeuingthemwithadequeue_manyordequeue_up_tooperation,eachtensorispaddedwithzerosalongeveryvariabledimensiontomakeitthesamesizeasthelargesttensorinthemini-batch.Forexample,youcouldenqueue2Dtensors(matrices)ofarbitrarysizes:

q=tf.PaddingFIFOQueue(capacity=50,dtypes=[tf.float32],shapes=[(None,None)]

name="q",shared_name="shared_q")

v=tf.placeholder(tf.float32,shape=(None,None))

enqueue=q.enqueue([v])

withtf.Session([...])assess:

sess.run(enqueue,feed_dict={v:[[1.,2.],[3.,4.],[5.,6.]]})#3x2

sess.run(enqueue,feed_dict={v:[[1.]]})#1x1

sess.run(enqueue,feed_dict={v:[[7.,8.,9.,5.],[6.,7.,8.,9.]]})#2x4

Ifwejustdequeueoneitematatime,wegettheexactsametensorsthatwereenqueued.Butifwedequeueseveralitemsatatime(usingdequeue_many()ordequeue_up_to()),thequeueautomaticallypadsthetensorsappropriately.Forexample,ifwedequeueallthreeitemsatonce,alltensorswillbepaddedwithzerostobecome3×4tensors,sincethemaximumsizeforthefirstdimensionis3(firstitem)andthemaximumsizefortheseconddimensionis4(thirditem):

Page 419: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

>>>q=[...]

>>>dequeue=q.dequeue_many(3)

>>>withtf.Session([...])assess:

...print(sess.run(dequeue))

[[[1.2.0.0.]

[3.4.0.0.]

[5.6.0.0.]]

[[1.0.0.0.]

[0.0.0.0.]

[0.0.0.0.]]

[[7.8.9.5.]

[6.7.8.9.]

[0.0.0.0.]]]

Thistypeofqueuecanbeusefulwhenyouaredealingwithvariablelengthinputs,suchassequencesofwords(seeChapter14).

Okay,nowlet’spauseforasecond:sofaryouhavelearnedtodistributecomputationsacrossmultipledevicesandservers,sharevariablesacrosssessions,andcommunicateasynchronouslyusingqueues.Beforeyoustarttrainingneuralnetworks,though,there’sonelasttopicweneedtodiscuss:howtoefficientlyloadtrainingdata.

Page 420: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

LoadingDataDirectlyfromtheGraphSofarwehaveassumedthattheclientswouldloadthetrainingdataandfeedittotheclusterusingplaceholders.Thisissimpleandworksquitewellforsimplesetups,butitisratherinefficientsinceittransfersthetrainingdataseveraltimes:1. Fromthefilesystemtotheclient

2. Fromtheclienttothemastertask

3. Possiblyfromthemastertasktoothertaskswherethedataisneeded

Itgetsworseifyouhaveseveralclientstrainingvariousneuralnetworksusingthesametrainingdata(forexample,forhyperparametertuning):ifeveryclientloadsthedatasimultaneously,youmayendupevensaturatingyourfileserverorthenetwork’sbandwidth.

PreloadthedataintoavariableFordatasetsthatcanfitinmemory,abetteroptionistoloadthetrainingdataonceandassignittoavariable,thenjustusethatvariableinyourgraph.Thisiscalledpreloadingthetrainingset.Thiswaythedatawillbetransferredonlyoncefromtheclienttothecluster(butitmaystillneedtobemovedaroundfromtasktotaskdependingonwhichoperationsneedit).Thefollowingcodeshowshowtoloadthefulltrainingsetintoavariable:

training_set_init=tf.placeholder(tf.float32,shape=(None,n_features))

training_set=tf.Variable(training_set_init,trainable=False,collections=[],

name="training_set")

withtf.Session([...])assess:

data=[...]#loadthetrainingdatafromthedatastore

sess.run(training_set.initializer,feed_dict={training_set_init:data})

Youmustsettrainable=Falsesotheoptimizersdon’ttrytotweakthisvariable.Youshouldalsosetcollections=[]toensurethatthisvariablewon’tgetaddedtotheGraphKeys.GLOBAL_VARIABLEScollection,whichisusedforsavingandrestoringcheckpoints.

NOTEThisexampleassumesthatallofyourtrainingset(includingthelabels)consistsonlyoffloat32values.Ifthat’snotthecase,youwillneedonevariablepertype.

ReadingthetrainingdatadirectlyfromthegraphIfthetrainingsetdoesnotfitinmemory,agoodsolutionistousereaderoperations:theseareoperationscapableofreadingdatadirectlyfromthefilesystem.Thiswaythetrainingdataneverneedstoflowthroughtheclientsatall.TensorFlowprovidesreadersforvariousfileformats:

CSV

Page 421: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Fixed-lengthbinaryrecords

TensorFlow’sownTFRecordsformat,basedonprotocolbuffers

Let’slookatasimpleexamplereadingfromaCSVfile(forotherformats,pleasecheckouttheAPIdocumentation).Supposeyouhavefilenamedmy_test.csvthatcontainstraininginstances,andyouwanttocreateoperationstoreadit.Supposeithasthefollowingcontent,withtwofloatfeaturesx1andx2andoneintegertargetrepresentingabinaryclass:

x1,x2,target

1.,2.,0

4.,5,1

7.,,0

First,let’screateaTextLineReadertoreadthisfile.ATextLineReaderopensafile(oncewetellitwhichonetoopen)andreadslinesonebyone.Itisastatefuloperation,likevariablesandqueues:itpreservesitsstateacrossmultiplerunsofthegraph,keepingtrackofwhichfileitiscurrentlyreadingandwhatitscurrentpositionisinthisfile.

reader=tf.TextLineReader(skip_header_lines=1)

Next,wecreateaqueuethatthereaderwillpullfromtoknowwhichfiletoreadnext.Wealsocreateanenqueueoperationandaplaceholdertopushanyfilenamewewanttothequeue,andwecreateanoperationtoclosethequeueoncewehavenomorefilestoread:

filename_queue=tf.FIFOQueue(capacity=10,dtypes=[tf.string],shapes=[()])

filename=tf.placeholder(tf.string)

enqueue_filename=filename_queue.enqueue([filename])

close_filename_queue=filename_queue.close()

Nowwearereadytocreateareadoperationthatwillreadonerecord(i.e.,aline)atatimeandreturnakey/valuepair.Thekeyistherecord’suniqueidentifier—astringcomposedofthefilename,acolon(:),andthelinenumber—andthevalueissimplyastringcontainingthecontentoftheline:

key,value=reader.read(filename_queue)

Wehaveallweneedtoreadthefilelinebyline!Butwearenotquitedoneyet—weneedtoparsethisstringtogetthefeaturesandtarget:

x1,x2,target=tf.decode_csv(value,record_defaults=[[-1.],[-1.],[-1]])

features=tf.stack([x1,x2])

ThefirstlineusesTensorFlow’sCSVparsertoextractthevaluesfromthecurrentline.Thedefaultvaluesareusedwhenafieldismissing(inthisexamplethethirdtraininginstance’sx2feature),andtheyarealsousedtodeterminethetypeofeachfield(inthiscasetwofloatsandoneinteger).

Finally,wecanpushthistraininginstanceanditstargettoaRandomShuffleQueuethatwewillsharewiththetraininggraph(soitcanpullmini-batchesfromit),andwecreateanoperationtoclosethatqueuewhenwearedonepushinginstancestoit:

Page 422: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

instance_queue=tf.RandomShuffleQueue(

capacity=10,min_after_dequeue=2,

dtypes=[tf.float32,tf.int32],shapes=[[2],[]],

name="instance_q",shared_name="shared_instance_q")

enqueue_instance=instance_queue.enqueue([features,target])

close_instance_queue=instance_queue.close()

Wow!Thatwasalotofworkjusttoreadafile.Plusweonlycreatedthegraph,sonowweneedtorunit:

withtf.Session([...])assess:

sess.run(enqueue_filename,feed_dict={filename:"my_test.csv"})

sess.run(close_filename_queue)

try:

whileTrue:

sess.run(enqueue_instance)

excepttf.errors.OutOfRangeErrorasex:

pass#nomorerecordsinthecurrentfileandnomorefilestoread

sess.run(close_instance_queue)

Firstweopenthesession,andthenweenqueuethefilename"my_test.csv"andimmediatelyclosethatqueuesincewewillnotenqueueanymorefilenames.Thenwerunaninfinitelooptoenqueueinstancesonebyone.Theenqueue_instancedependsonthereaderreadingthenextline,soateveryiterationanewrecordisreaduntilitreachestheendofthefile.Atthatpointittriestoreadthefilenamequeuetoknowwhichfiletoreadnext,andsincethequeueiscloseditthrowsanOutOfRangeErrorexception(ifwedidnotclosethequeue,itwouldjustremainblockeduntilwepushedanotherfilenameorclosedthequeue).Lastly,weclosetheinstancequeuesothatthetrainingoperationspullingfromitwon’tgetblockedforever.Figure12-9summarizeswhatwehavelearned;itrepresentsatypicalgraphforreadingtraininginstancesfromasetofCSVfiles.

Figure12-9.AgraphdedicatedtoreadingtraininginstancesfromCSVfiles

Inthetraininggraph,youneedtocreatethesharedinstancequeueandsimplydequeuemini-batchesfromit:

instance_queue=tf.RandomShuffleQueue([...],shared_name="shared_instance_q")

mini_batch_instances,mini_batch_targets=instance_queue.dequeue_up_to(2)

[...]#usethemini_batchinstancesandtargetstobuildthetraininggraph

training_op=[...]

withtf.Session([...])assess:

try:

forstepinrange(max_steps):

sess.run(training_op)

excepttf.errors.OutOfRangeErrorasex:

pass#nomoretraininginstances

Page 423: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Inthisexample,thefirstmini-batchwillcontainthefirsttwoinstancesoftheCSVfile,andthesecondmini-batchwillcontainthelastinstance.

WARNINGTensorFlowqueuesdon’thandlesparsetensorswell,soifyourtraininginstancesaresparseyoushouldparsetherecordsaftertheinstancequeue.

Thisarchitecturewillonlyuseonethreadtoreadrecordsandpushthemtotheinstancequeue.Youcangetamuchhigherthroughputbyhavingmultiplethreadsreadsimultaneouslyfrommultiplefilesusingmultiplereaders.Let’sseehow.

MultithreadedreadersusingaCoordinatorandaQueueRunnerTohavemultiplethreadsreadinstancessimultaneously,youcouldcreatePythonthreads(usingthethreadingmodule)andmanagethemyourself.However,TensorFlowprovidessometoolstomakethissimpler:theCoordinatorclassandtheQueueRunnerclass.

ACoordinatorisaverysimpleobjectwhosesolepurposeistocoordinatestoppingmultiplethreads.FirstyoucreateaCoordinator:

coord=tf.train.Coordinator()

Thenyougiveittoallthreadsthatneedtostopjointly,andtheirmainlooplookslikethis:

whilenotcoord.should_stop():

[...]#dosomething

AnythreadcanrequestthateverythreadstopbycallingtheCoordinator’srequest_stop()method:

coord.request_stop()

Everythreadwillstopassoonasitfinishesitscurrentiteration.YoucanwaitforallofthethreadstofinishbycallingtheCoordinator’sjoin()method,passingitthelistofthreads:

coord.join(list_of_threads)

AQueueRunnerrunsmultiplethreadsthateachrunanenqueueoperationrepeatedly,fillingupaqueueasfastaspossible.Assoonasthequeueisclosed,thenextthreadthattriestopushanitemtothequeuewillgetanOutOfRangeError;thisthreadcatchestheerrorandimmediatelytellsotherthreadstostopusingaCoordinator.ThefollowingcodeshowshowyoucanuseaQueueRunnertohavefivethreadsreadinginstancessimultaneouslyandpushingthemtoaninstancequeue:

[...]#sameconstructionphaseasearlier

queue_runner=tf.train.QueueRunner(instance_queue,[enqueue_instance]*5)

withtf.Session()assess:

Page 424: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

sess.run(enqueue_filename,feed_dict={filename:"my_test.csv"})

sess.run(close_filename_queue)

coord=tf.train.Coordinator()

enqueue_threads=queue_runner.create_threads(sess,coord=coord,start=True)

ThefirstlinecreatestheQueueRunnerandtellsittorunfivethreads,allrunningthesameenqueue_instanceoperationrepeatedly.Thenwestartasessionandweenqueuethenameofthefilestoread(inthiscasejust"my_test.csv").NextwecreateaCoordinatorthattheQueueRunnerwillusetostopgracefully,asjustexplained.Finally,wetelltheQueueRunnertocreatethethreadsandstartthem.Thethreadswillreadalltraininginstancesandpushthemtotheinstancequeue,andthentheywillallstopgracefully.

Thiswillbeabitmoreefficientthanearlier,butwecandobetter.Currentlyallthreadsarereadingfromthesamefile.Wecanmakethemreadsimultaneouslyfromseparatefilesinstead(assumingthetrainingdataisshardedacrossmultipleCSVfiles)bycreatingmultiplereaders(seeFigure12-10).

Figure12-10.Readingsimultaneouslyfrommultiplefiles

Forthisweneedtowriteasmallfunctiontocreateareaderandthenodesthatwillreadandpushoneinstancetotheinstancequeue:

defread_and_push_instance(filename_queue,instance_queue):

reader=tf.TextLineReader(skip_header_lines=1)

key,value=reader.read(filename_queue)

x1,x2,target=tf.decode_csv(value,record_defaults=[[-1.],[-1.],[-1]])

features=tf.stack([x1,x2])

enqueue_instance=instance_queue.enqueue([features,target])

returnenqueue_instance

Nextwedefinethequeues:

filename_queue=tf.FIFOQueue(capacity=10,dtypes=[tf.string],shapes=[()])

filename=tf.placeholder(tf.string)

enqueue_filename=filename_queue.enqueue([filename])

close_filename_queue=filename_queue.close()

instance_queue=tf.RandomShuffleQueue([...])

AndfinallywecreatetheQueueRunner,butthistimewegiveitalistofdifferentenqueueoperations.Eachoperationwilluseadifferentreader,sothethreadswillsimultaneouslyreadfromdifferentfiles:

read_and_enqueue_ops=[

read_and_push_instance(filename_queue,instance_queue)

Page 425: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

foriinrange(5)]

queue_runner=tf.train.QueueRunner(instance_queue,read_and_enqueue_ops)

Theexecutionphaseisthenthesameasbefore:firstpushthenamesofthefilestoread,thencreateaCoordinatorandcreateandstarttheQueueRunnerthreads.Thistimeallthreadswillreadfromdifferentfilessimultaneouslyuntilallfilesarereadentirely,andthentheQueueRunnerwillclosetheinstancequeuesothatotheropspullingfromitdon’tgetblocked.

OtherconveniencefunctionsTensorFlowalsooffersafewconveniencefunctionstosimplifysomecommontaskswhenreadingtraininginstances.Wewillgooverjustafew(seetheAPIdocumentationforthefulllist).

Thestring_input_producer()takesa1Dtensorcontainingalistoffilenames,createsathreadthatpushesonefilenameatatimetothefilenamequeue,andthenclosesthequeue.Ifyouspecifyanumberofepochs,itwillcyclethroughthefilenamesonceperepochbeforeclosingthequeue.Bydefault,itshufflesthefilenamesateachepoch.ItcreatesaQueueRunnertomanageitsthread,andaddsittotheGraphKeys.QUEUE_RUNNERScollection.TostarteveryQueueRunnerinthatcollection,youcancallthetf.train.start_queue_runners()function.NotethatifyouforgettostarttheQueueRunner,thefilenamequeuewillbeopenandempty,andyourreaderswillbeblockedforever.

ThereareafewotherproducerfunctionsthatsimilarlycreateaqueueandacorrespondingQueueRunnerforrunninganenqueueoperation(e.g.,input_producer(),range_input_producer(),andslice_input_producer()).

Theshuffle_batch()functiontakesalistoftensors(e.g.,[features,target])andcreates:

ARandomShuffleQueue

AQueueRunnertoenqueuethetensorstothequeue(addedtotheGraphKeys.QUEUE_RUNNERScollection)

Adequeue_manyoperationtoextractamini-batchfromthequeue

Thismakesiteasytomanageinasingleprocessamultithreadedinputpipelinefeedingaqueueandatrainingpipelinereadingmini-batchesfromthatqueue.Alsocheckoutthebatch(),batch_join(),andshuffle_batch_join()functionsthatprovidesimilarfunctionality.

Okay!YounowhaveallthetoolsyouneedtostarttrainingandrunningneuralnetworksefficientlyacrossmultipledevicesandserversonaTensorFlowcluster.Let’sreviewwhatyouhavelearned:

UsingmultipleGPUdevices

SettingupandstartingaTensorFlowcluster

Distributingcomputationsacrossmultipledevicesandservers

Sharingvariables(andotherstatefulopssuchasqueuesandreaders)acrosssessionsusingcontainers

Page 426: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Coordinatingmultiplegraphsworkingasynchronouslyusingqueues

Readinginputsefficientlyusingreaders,queuerunners,andcoordinators

Nowlet’suseallofthistoparallelizeneuralnetworks!

Page 427: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ParallelizingNeuralNetworksonaTensorFlowClusterInthissection,firstwewilllookathowtoparallelizeseveralneuralnetworksbysimplyplacingeachoneonadifferentdevice.Thenwewilllookatthemuchtrickierproblemoftrainingasingleneuralnetworkacrossmultipledevicesandservers.

Page 428: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

OneNeuralNetworkperDeviceThemosttrivialwaytotrainandrunneuralnetworksonaTensorFlowclusteristotaketheexactsamecodeyouwoulduseforasingledeviceonasinglemachine,andspecifythemasterserver’saddresswhencreatingthesession.That’sit—you’redone!Yourcodewillberunningontheserver’sdefaultdevice.Youcanchangethedevicethatwillrunyourgraphsimplybyputtingyourcode’sconstructionphasewithinadeviceblock.

Byrunningseveralclientsessionsinparallel(indifferentthreadsordifferentprocesses),connectingthemtodifferentservers,andconfiguringthemtousedifferentdevices,youcanquiteeasilytrainorrunmanyneuralnetworksinparallel,acrossalldevicesandallmachinesinyourcluster(seeFigure12-11).Thespeedupisalmostlinear.4Training100neuralnetworksacross50serverswith2GPUseachwillnottakemuchlongerthantrainingjust1neuralnetworkon1GPU.

Figure12-11.Trainingoneneuralnetworkperdevice

Thissolutionisperfectforhyperparametertuning:eachdeviceintheclusterwilltrainadifferentmodelwithitsownsetofhyperparameters.Themorecomputingpoweryouhave,thelargerthehyperparameterspaceyoucanexplore.

Italsoworksperfectlyifyouhostawebservicethatreceivesalargenumberofqueriespersecond(QPS)andyouneedyourneuralnetworktomakeapredictionforeachquery.Simplyreplicatetheneuralnetworkacrossalldevicesontheclusteranddispatchqueriesacrossalldevices.ByaddingmoreserversyoucanhandleanunlimitednumberofQPS(however,thiswillnotreducethetimeittakestoprocessasinglerequestsinceitwillstillhavetowaitforaneuralnetworktomakeaprediction).

Page 429: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

NOTEAnotheroptionistoserveyourneuralnetworksusingTensorFlowServing.Itisanopensourcesystem,releasedbyGoogleinFebruary2016,designedtoserveahighvolumeofqueriestoMachineLearningmodels(typicallybuiltwithTensorFlow).Ithandlesmodelversioning,soyoucaneasilydeployanewversionofyournetworktoproduction,orexperimentwithvariousalgorithmswithoutinterruptingyourservice,anditcansustainaheavyloadbyaddingmoreservers.Formoredetails,checkouthttps://tensorflow.github.io/serving/.

Page 430: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

In-GraphVersusBetween-GraphReplicationYoucanalsoparallelizethetrainingofalargeensembleofneuralnetworksbysimplyplacingeveryneuralnetworkonadifferentdevice(ensembleswereintroducedinChapter7).However,onceyouwanttoruntheensemble,youwillneedtoaggregatetheindividualpredictionsmadebyeachneuralnetworktoproducetheensemble’sprediction,andthisrequiresabitofcoordination.

Therearetwomajorapproachestohandlinganeuralnetworkensemble(oranyothergraphthatcontainslargechunksofindependentcomputations):

Youcancreateonebiggraph,containingeveryneuralnetwork,eachpinnedtoadifferentdevice,plusthecomputationsneededtoaggregatetheindividualpredictionsfromalltheneuralnetworks(seeFigure12-12).Thenyoujustcreateonesessiontoanyserverintheclusterandletittakecareofeverything(includingwaitingforallindividualpredictionstobeavailablebeforeaggregatingthem).Thisapproachiscalledin-graphreplication.

Figure12-12.In-graphreplication

Alternatively,youcancreateoneseparategraphforeachneuralnetworkandhandlesynchronizationbetweenthesegraphsyourself.Thisapproachiscalledbetween-graphreplication.Onetypicalimplementationistocoordinatetheexecutionofthesegraphsusingqueues(seeFigure12-13).Asetofclientshandlesoneneuralnetworkeach,readingfromitsdedicatedinputqueue,andwritingtoitsdedicatedpredictionqueue.Anotherclientisinchargeofreadingtheinputsandpushingthemtoalltheinputqueues(copyingallinputstoeveryqueue).Finally,onelastclientisinchargeofreadingonepredictionfromeachpredictionqueueandaggregatingthemtoproducetheensemble’sprediction.

Page 431: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure12-13.Between-graphreplication

Thesesolutionshavetheirprosandcons.In-graphreplicationissomewhatsimplertoimplementsinceyoudon’thavetomanagemultipleclientsandmultiplequeues.However,between-graphreplicationisabiteasiertoorganizeintowell-boundedandeasy-to-testmodules.Moreover,itgivesyoumoreflexibility.Forexample,youcouldaddadequeuetimeoutintheaggregatorclientsothattheensemblewouldnotfailevenifoneoftheneuralnetworkclientscrashesorifoneneuralnetworktakestoolongtoproduceitsprediction.TensorFlowletsyouspecifyatimeoutwhencallingtherun()functionbypassingaRunOptionswithtimeout_in_ms:

withtf.Session([...])assess:

[...]

run_options=tf.RunOptions()

run_options.timeout_in_ms=1000#1stimeout

try:

pred=sess.run(dequeue_prediction,options=run_options)

excepttf.errors.DeadlineExceededErrorasex:

[...]#thedequeueoperationtimedoutafter1s

Anotherwayyoucanspecifyatimeoutistosetthesession’soperation_timeout_in_msconfigurationoption,butinthiscasetherun()functiontimesoutifanyoperationtakeslongerthanthetimeoutdelay:

config=tf.ConfigProto()

config.operation_timeout_in_ms=1000#1stimeoutforeveryoperation

withtf.Session([...],config=config)assess:

[...]

try:

pred=sess.run(dequeue_prediction)

Page 432: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

excepttf.errors.DeadlineExceededErrorasex:

[...]#thedequeueoperationtimedoutafter1s

Page 433: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ModelParallelismSofarwehaveruneachneuralnetworkonasingledevice.Whatifwewanttorunasingleneuralnetworkacrossmultipledevices?Thisrequireschoppingyourmodelintoseparatechunksandrunningeachchunkonadifferentdevice.Thisiscalledmodelparallelism.Unfortunately,modelparallelismturnsouttobeprettytricky,anditreallydependsonthearchitectureofyourneuralnetwork.Forfullyconnectednetworks,thereisgenerallynotmuchtobegainedfromthisapproach(seeFigure12-14).Intuitively,itmayseemthataneasywaytosplitthemodelistoplaceeachlayeronadifferentdevice,butthisdoesnotworksinceeachlayerneedstowaitfortheoutputofthepreviouslayerbeforeitcandoanything.Soperhapsyoucansliceitvertically—forexample,withthelefthalfofeachlayerononedevice,andtherightpartonanotherdevice?Thisisslightlybetter,sincebothhalvesofeachlayercanindeedworkinparallel,buttheproblemisthateachhalfofthenextlayerrequirestheoutputofbothhalves,sotherewillbealotofcross-devicecommunication(representedbythedashedarrows).Thisislikelytocompletelycanceloutthebenefitoftheparallelcomputation,sincecross-devicecommunicationisslow(especiallyifitisacrossseparatemachines).

Figure12-14.Splittingafullyconnectedneuralnetwork

However,aswewillseeinChapter13,someneuralnetworkarchitectures,suchasconvolutionalneuralnetworks,containlayersthatareonlypartiallyconnectedtothelowerlayers,soitismucheasiertodistributechunksacrossdevicesinanefficientway.

Page 434: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure12-15.Splittingapartiallyconnectedneuralnetwork

Moreover,aswewillseeinChapter14,somedeeprecurrentneuralnetworksarecomposedofseverallayersofmemorycells(seetheleftsideofFigure12-16).Acell’soutputattimetisfedbacktoitsinputattimet+1(asyoucanseemoreclearlyontherightsideofFigure12-16).Ifyousplitsuchanetworkhorizontally,placingeachlayeronadifferentdevice,thenatthefirststeponlyonedevicewillbeactive,atthesecondsteptwowillbeactive,andbythetimethesignalpropagatestotheoutputlayeralldeviceswillbeactivesimultaneously.Thereisstillalotofcross-devicecommunicationgoingon,butsinceeachcellmaybefairlycomplex,thebenefitofrunningmultiplecellsinparalleloftenoutweighsthecommunicationpenalty.

Figure12-16.Splittingadeeprecurrentneuralnetwork

Page 435: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Inshort,modelparallelismcanspeeduprunningortrainingsometypesofneuralnetworks,butnotall,anditrequiresspecialcareandtuning,suchasmakingsurethatdevicesthatneedtocommunicatethemostrunonthesamemachine.

Page 436: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

DataParallelismAnotherwaytoparallelizethetrainingofaneuralnetworkistoreplicateitoneachdevice,runatrainingstepsimultaneouslyonallreplicasusingadifferentmini-batchforeach,andthenaggregatethegradientstoupdatethemodelparameters.Thisiscalleddataparallelism(seeFigure12-17).

Figure12-17.Dataparallelism

Therearetwovariantsofthisapproach:synchronousupdatesandasynchronousupdates.

SynchronousupdatesWithsynchronousupdates,theaggregatorwaitsforallgradientstobeavailablebeforecomputingtheaverageandapplyingtheresult(i.e.,usingtheaggregatedgradientstoupdatethemodelparameters).Onceareplicahasfinishedcomputingitsgradients,itmustwaitfortheparameterstobeupdatedbeforeitcanproceedtothenextmini-batch.Thedownsideisthatsomedevicesmaybeslowerthanothers,soallotherdeviceswillhavetowaitforthemateverystep.Moreover,theparameterswillbecopiedtoeverydevicealmostatthesametime(immediatelyafterthegradientsareapplied),whichmaysaturatetheparameterservers’bandwidth.

TIPToreducethewaitingtimeateachstep,youcouldignorethegradientsfromtheslowestfewreplicas(typically~10%).Forexample,youcouldrun20replicas,butonlyaggregatethegradientsfromthefastest18replicasateachstep,andjustignorethegradientsfromthelast2.Assoonastheparametersareupdated,thefirst18replicascanstartworkingagainimmediately,withouthavingtowaitforthe2slowestreplicas.Thissetupisgenerallydescribedashaving18replicasplus2sparereplicas.5

Page 437: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

AsynchronousupdatesWithasynchronousupdates,wheneverareplicahasfinishedcomputingthegradients,itimmediatelyusesthemtoupdatethemodelparameters.Thereisnoaggregation(removethe“mean”stepinFigure12-17),andnosynchronization.Replicasjustworkindependentlyoftheotherreplicas.Sincethereisnowaitingfortheotherreplicas,thisapproachrunsmoretrainingstepsperminute.Moreover,althoughtheparametersstillneedtobecopiedtoeverydeviceateverystep,thishappensatdifferenttimesforeachreplicasotheriskofbandwidthsaturationisreduced.

Dataparallelismwithasynchronousupdatesisanattractivechoice,becauseofitssimplicity,theabsenceofsynchronizationdelay,andabetteruseofthebandwidth.However,althoughitworksreasonablywellinpractice,itisalmostsurprisingthatitworksatall!Indeed,bythetimeareplicahasfinishedcomputingthegradientsbasedonsomeparametervalues,theseparameterswillhavebeenupdatedseveraltimesbyotherreplicas(onaverageN–1timesifthereareNreplicas)andthereisnoguaranteethatthecomputedgradientswillstillbepointingintherightdirection(seeFigure12-18).Whengradientsareseverelyout-of-date,theyarecalledstalegradients:theycanslowdownconvergence,introducingnoiseandwobbleeffects(thelearningcurvemaycontaintemporaryoscillations),ortheycanevenmakethetrainingalgorithmdiverge.

Figure12-18.Stalegradientswhenusingasynchronousupdates

Thereareafewwaystoreducetheeffectofstalegradients:Reducethelearningrate.

Dropstalegradientsorscalethemdown.

Adjustthemini-batchsize.

Startthefirstfewepochsusingjustonereplica(thisiscalledthewarmupphase).Stalegradientstendtobemoredamagingatthebeginningoftraining,whengradientsaretypicallylargeandtheparametershavenotsettledintoavalleyofthecostfunctionyet,sodifferentreplicasmaypushtheparametersinquitedifferentdirections.

Page 438: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ApaperpublishedbytheGoogleBrainteaminApril2016benchmarkedvariousapproachesandfoundthatdataparallelismwithsynchronousupdatesusingafewsparereplicaswasthemostefficient,notonlyconvergingfasterbutalsoproducingabettermodel.However,thisisstillanactiveareaofresearch,soyoushouldnotruleoutasynchronousupdatesquiteyet.

BandwidthsaturationWhetheryouusesynchronousorasynchronousupdates,dataparallelismstillrequirescommunicatingthemodelparametersfromtheparameterserverstoeveryreplicaatthebeginningofeverytrainingstep,andthegradientsintheotherdirectionattheendofeachtrainingstep.Unfortunately,thismeansthattherealwayscomesapointwhereaddinganextraGPUwillnotimproveperformanceatallbecausethetimespentmovingthedatainandoutofGPURAM(andpossiblyacrossthenetwork)willoutweighthespeedupobtainedbysplittingthecomputationload.Atthatpoint,addingmoreGPUswilljustincreasesaturationandslowdowntraining.

TIPForsomemodels,typicallyrelativelysmallandtrainedonaverylargetrainingset,youareoftenbetterofftrainingthemodelonasinglemachinewithasingleGPU.

Saturationismoresevereforlargedensemodels,sincetheyhavealotofparametersandgradientstotransfer.Itislesssevereforsmallmodels(buttheparallelizationgainissmall)andalsoforlargesparsemodelssincethegradientsaretypicallymostlyzeros,sotheycanbecommunicatedefficiently.JeffDean,initiatorandleadoftheGoogleBrainproject,reportedtypicalspeedupsof25–40xwhendistributingcomputationsacross50GPUsfordensemodels,and300xspeedupforsparsermodelstrainedacross500GPUs.Asyoucansee,sparsemodelsreallydoscalebetter.Hereareafewconcreteexamples:

NeuralMachineTranslation:6xspeedupon8GPUs

Inception/ImageNet:32xspeedupon50GPUs

RankBrain:300xspeedupon500GPUs

ThesenumbersrepresentthestateoftheartinQ12016.BeyondafewdozenGPUsforadensemodelorfewhundredGPUsforasparsemodel,saturationkicksinandperformancedegrades.Thereisplentyofresearchgoingontosolvethisproblem(exploringpeer-to-peerarchitecturesratherthancentralizedparameterservers,usinglossymodelcompression,optimizingwhenandwhatthereplicasneedtocommunicate,andsoon),sotherewilllikelybealotofprogressinparallelizingneuralnetworksinthenextfewyears.

Inthemeantime,hereareafewsimplestepsyoucantaketoreducethesaturationproblem:GroupyourGPUsonafewserversratherthanscatteringthemacrossmanyservers.Thiswillavoidunnecessarynetworkhops.

Shardtheparametersacrossmultipleparameterservers(asdiscussedearlier).

Page 439: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Dropthemodelparameters’floatprecisionfrom32bits(tf.float32)to16bits(tf.bfloat16).Thiswillcutinhalftheamountofdatatotransfer,withoutmuchimpactontheconvergencerateorthemodel’sperformance.

TIPAlthough16-bitprecisionistheminimumfortrainingneuralnetwork,youcanactuallydropdownto8-bitprecisionaftertrainingtoreducethesizeofthemodelandspeedupcomputations.Thisiscalledquantizingtheneuralnetwork.Itisparticularlyusefulfordeployingandrunningpretrainedmodelsonmobilephones.SeePeteWarden’sgreatpostonthesubject.

TensorFlowimplementationToimplementdataparallelismusingTensorFlow,youfirstneedtochoosewhetheryouwantin-graphreplicationorbetween-graphreplication,andwhetheryouwantsynchronousupdatesorasynchronousupdates.Let’slookathowyouwouldimplementeachcombination(seetheexercisesandtheJupyternotebooksforcompletecodeexamples).

Within-graphreplication+synchronousupdates,youbuildonebiggraphcontainingallthemodelreplicas(placedondifferentdevices),andafewnodestoaggregatealltheirgradientsandfeedthemtoanoptimizer.Yourcodeopensasessiontotheclusterandsimplyrunsthetrainingoperationrepeatedly.

Within-graphreplication+asynchronousupdates,youalsocreateonebiggraph,butwithoneoptimizerperreplica,andyourunonethreadperreplica,repeatedlyrunningthereplica’soptimizer.

Withbetween-graphreplication+asynchronousupdates,yourunmultipleindependentclients(typicallyinseparateprocesses),eachtrainingthemodelreplicaasifitwerealoneintheworld,buttheparametersareactuallysharedwithotherreplicas(usingaresourcecontainer).

Withbetween-graphreplication+synchronousupdates,onceagainyourunmultipleclients,eachtrainingamodelreplicabasedonsharedparameters,butthistimeyouwraptheoptimizer(e.g.,aMomentumOptimizer)withinaSyncReplicasOptimizer.Eachreplicausesthisoptimizerasitwoulduseanyotheroptimizer,butunderthehoodthisoptimizersendsthegradientstoasetofqueues(onepervariable),whichisreadbyoneofthereplica’sSyncReplicasOptimizer,calledthechief.Thechiefaggregatesthegradientsandappliesthem,thenwritesatokentoatokenqueueforeachreplica,signalingitthatitcangoaheadandcomputethenextgradients.Thisapproachsupportshavingsparereplicas.

Ifyougothroughtheexercises,youwillimplementeachofthesefoursolutions.YouwilleasilybeabletoapplywhatyouhavelearnedtotrainlargedeepneuralnetworksacrossdozensofserversandGPUs!InthefollowingchapterswewillgothroughafewmoreimportantneuralnetworkarchitecturesbeforewetackleReinforcementLearning.

Page 440: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Exercises1. IfyougetaCUDA_ERROR_OUT_OF_MEMORYwhenstartingyourTensorFlowprogram,whatis

probablygoingon?Whatcanyoudoaboutit?

2. Whatisthedifferencebetweenpinninganoperationonadeviceandplacinganoperationonadevice?

3. IfyouarerunningonaGPU-enabledTensorFlowinstallation,andyoujustusethedefaultplacement,willalloperationsbeplacedonthefirstGPU?

4. Ifyoupinavariableto"/gpu:0",canitbeusedbyoperationsplacedon/gpu:1?Orbyoperationsplacedon"/cpu:0"?Orbyoperationspinnedtodeviceslocatedonotherservers?

5. Cantwooperationsplacedonthesamedeviceruninparallel?

6. Whatisacontroldependencyandwhenwouldyouwanttouseone?

7. SupposeyoutrainaDNNfordaysonaTensorFlowcluster,andimmediatelyafteryourtrainingprogramendsyourealizethatyouforgottosavethemodelusingaSaver.Isyourtrainedmodellost?

8. TrainseveralDNNsinparallelonaTensorFlowcluster,usingdifferenthyperparametervalues.ThiscouldbeDNNsforMNISTclassificationoranyothertaskyouareinterestedin.ThesimplestoptionistowriteasingleclientprogramthattrainsonlyoneDNN,thenrunthisprograminmultipleprocessesinparallel,withdifferenthyperparametervaluesforeachclient.Theprogramshouldhavecommand-lineoptionstocontrolwhatserveranddevicetheDNNshouldbeplacedon,andwhatresourcecontainerandhyperparametervaluestouse(makesuretouseadifferentresourcecontainerforeachDNN).Useavalidationsetorcross-validationtoselectthetopthreemodels.

9. Createanensembleusingthetopthreemodelsfromthepreviousexercise.Defineitinasinglegraph,ensuringthateachDNNrunsonadifferentdevice.Evaluateitonthevalidationset:doestheensembleperformbetterthantheindividualDNNs?

10. TrainaDNNusingbetween-graphreplicationanddataparallelismwithasynchronousupdates,timinghowlongittakestoreachasatisfyingperformance.Next,tryagainusingsynchronousupdates.Dosynchronousupdatesproduceabettermodel?Istrainingfaster?SplittheDNNverticallyandplaceeachverticalsliceonadifferentdevice,andtrainthemodelagain.Istraininganyfaster?Istheperformanceanydifferent?

SolutionstotheseexercisesareavailableinAppendixA.

“TensorFlow:Large-ScaleMachineLearningonHeterogeneousDistributedSystems,”GoogleResearch(2015).

Youcanevenstartmultipletasksinthesameprocess.Itmaybeusefulfortests,butitisnotrecommendedinproduction.

ItisthenextversionofGoogle’sinternalStubbyservice,whichGooglehasusedsuccessfullyforoveradecade.Seehttp://grpc.io/formoredetails.

Not100%linearifyouwaitforalldevicestofinish,sincethetotaltimewillbethetimetakenbytheslowestdevice.

1

2

3

4

Page 441: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Thisnameisslightlyconfusingsinceitsoundslikesomereplicasarespecial,doingnothing.Inreality,allreplicasareequivalent:theyallworkhardtobeamongthefastestateachtrainingstep,andthelosersvaryateverystep(unlesssomedevicesarereallyslowerthanothers).

5

Page 442: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Chapter13.ConvolutionalNeuralNetworks

AlthoughIBM’sDeepBluesupercomputerbeatthechessworldchampionGarryKasparovbackin1996,untilquiterecentlycomputerswereunabletoreliablyperformseeminglytrivialtaskssuchasdetectingapuppyinapictureorrecognizingspokenwords.Whyarethesetaskssoeffortlesstoushumans?Theanswerliesinthefactthatperceptionlargelytakesplaceoutsidetherealmofourconsciousness,withinspecializedvisual,auditory,andothersensorymodulesinourbrains.Bythetimesensoryinformationreachesourconsciousness,itisalreadyadornedwithhigh-levelfeatures;forexample,whenyoulookatapictureofacutepuppy,youcannotchoosenottoseethepuppy,ornottonoticeitscuteness.Norcanyouexplainhowyourecognizeacutepuppy;it’sjustobvioustoyou.Thus,wecannottrustoursubjectiveexperience:perceptionisnottrivialatall,andtounderstanditwemustlookathowthesensorymoduleswork.

Convolutionalneuralnetworks(CNNs)emergedfromthestudyofthebrain’svisualcortex,andtheyhavebeenusedinimagerecognitionsincethe1980s.Inthelastfewyears,thankstotheincreaseincomputationalpower,theamountofavailabletrainingdata,andthetrickspresentedinChapter11fortrainingdeepnets,CNNshavemanagedtoachievesuperhumanperformanceonsomecomplexvisualtasks.Theypowerimagesearchservices,self-drivingcars,automaticvideoclassificationsystems,andmore.Moreover,CNNsarenotrestrictedtovisualperception:theyarealsosuccessfulatothertasks,suchasvoicerecognitionornaturallanguageprocessing(NLP);however,wewillfocusonvisualapplicationsfornow.

InthischapterwewillpresentwhereCNNscamefrom,whattheirbuildingblockslooklike,andhowtoimplementthemusingTensorFlow.ThenwewillpresentsomeofthebestCNNarchitectures.

Page 443: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TheArchitectureoftheVisualCortexDavidH.HubelandTorstenWieselperformedaseriesofexperimentsoncatsin19581and19592(andafewyearslateronmonkeys3),givingcrucialinsightsonthestructureofthevisualcortex(theauthorsreceivedtheNobelPrizeinPhysiologyorMedicinein1981fortheirwork).Inparticular,theyshowedthatmanyneuronsinthevisualcortexhaveasmalllocalreceptivefield,meaningtheyreactonlytovisualstimulilocatedinalimitedregionofthevisualfield(seeFigure13-1,inwhichthelocalreceptivefieldsoffiveneuronsarerepresentedbydashedcircles).Thereceptivefieldsofdifferentneuronsmayoverlap,andtogethertheytilethewholevisualfield.Moreover,theauthorsshowedthatsomeneuronsreactonlytoimagesofhorizontallines,whileothersreactonlytolineswithdifferentorientations(twoneuronsmayhavethesamereceptivefieldbutreacttodifferentlineorientations).Theyalsonoticedthatsomeneuronshavelargerreceptivefields,andtheyreacttomorecomplexpatternsthatarecombinationsofthelower-levelpatterns.Theseobservationsledtotheideathatthehigher-levelneuronsarebasedontheoutputsofneighboringlower-levelneurons(inFigure13-1,noticethateachneuronisconnectedonlytoafewneuronsfromthepreviouslayer).Thispowerfularchitectureisabletodetectallsortsofcomplexpatternsinanyareaofthevisualfield.

Figure13-1.Localreceptivefieldsinthevisualcortex

Thesestudiesofthevisualcortexinspiredtheneocognitron,introducedin1980,4whichgraduallyevolvedintowhatwenowcallconvolutionalneuralnetworks.Animportantmilestonewasa1998paper5byYannLeCun,LéonBottou,YoshuaBengio,andPatrickHaffner,whichintroducedthefamousLeNet-5architecture,widelyusedtorecognizehandwrittenchecknumbers.Thisarchitecturehassomebuildingblocksthatyoualreadyknow,suchasfullyconnectedlayersandsigmoidactivationfunctions,butitalsointroducestwonewbuildingblocks:convolutionallayersandpoolinglayers.Let’slookatthemnow.

NOTEWhynotsimplyusearegulardeepneuralnetworkwithfullyconnectedlayersforimagerecognitiontasks?Unfortunately,althoughthisworksfineforsmallimages(e.g.,MNIST),itbreaksdownforlargerimagesbecauseofthehugenumberofparametersitrequires.Forexample,a100×100imagehas10,000pixels,andifthefirstlayerhasjust1,000neurons(whichalreadyseverelyrestrictstheamountofinformationtransmittedtothenextlayer),thismeansatotalof10millionconnections.Andthat’sjustthefirstlayer.CNNssolvethisproblemusingpartiallyconnectedlayers.

Page 444: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ConvolutionalLayerThemostimportantbuildingblockofaCNNistheconvolutionallayer:6neuronsinthefirstconvolutionallayerarenotconnectedtoeverysinglepixelintheinputimage(liketheywereinpreviouschapters),butonlytopixelsintheirreceptivefields(seeFigure13-2).Inturn,eachneuroninthesecondconvolutionallayerisconnectedonlytoneuronslocatedwithinasmallrectangleinthefirstlayer.Thisarchitectureallowsthenetworktoconcentrateonlow-levelfeaturesinthefirsthiddenlayer,thenassemblethemintohigher-levelfeaturesinthenexthiddenlayer,andsoon.Thishierarchicalstructureiscommoninreal-worldimages,whichisoneofthereasonswhyCNNsworksowellforimagerecognition.

Figure13-2.CNNlayerswithrectangularlocalreceptivefields

NOTEUntilnow,allmultilayerneuralnetworkswelookedathadlayerscomposedofalonglineofneurons,andwehadtoflatteninputimagesto1Dbeforefeedingthemtotheneuralnetwork.Noweachlayerisrepresentedin2D,whichmakesiteasiertomatchneuronswiththeircorrespondinginputs.

Aneuronlocatedinrowi,columnjofagivenlayerisconnectedtotheoutputsoftheneuronsinthepreviouslayerlocatedinrowsitoi+fh–1,columnsjtoj+fw–1,wherefhandfwaretheheightandwidthofthereceptivefield(seeFigure13-3).Inorderforalayertohavethesameheightandwidthasthepreviouslayer,itiscommontoaddzerosaroundtheinputs,asshowninthediagram.Thisiscalledzeropadding.

Page 445: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure13-3.Connectionsbetweenlayersandzeropadding

Itisalsopossibletoconnectalargeinputlayertoamuchsmallerlayerbyspacingoutthereceptivefields,asshowninFigure13-4.Thedistancebetweentwoconsecutivereceptivefieldsiscalledthestride.Inthediagram,a5×7inputlayer(pluszeropadding)isconnectedtoa3×4layer,using3×3receptivefieldsandastrideof2(inthisexamplethestrideisthesameinbothdirections,butitdoesnothavetobeso).Aneuronlocatedinrowi,columnjintheupperlayerisconnectedtotheoutputsoftheneuronsinthepreviouslayerlocatedinrowsi×shtoi×sh+fh–1,columnsj×sw+fw–1,whereshandswaretheverticalandhorizontalstrides.

Figure13-4.Reducingdimensionalityusingastride

Page 446: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

FiltersAneuron’sweightscanberepresentedasasmallimagethesizeofthereceptivefield.Forexample,Figure13-5showstwopossiblesetsofweights,calledfilters(orconvolutionkernels).Thefirstoneisrepresentedasablacksquarewithaverticalwhitelineinthemiddle(itisa7×7matrixfullof0sexceptforthecentralcolumn,whichisfullof1s);neuronsusingtheseweightswillignoreeverythingintheirreceptivefieldexceptforthecentralverticalline(sinceallinputswillgetmultipliedby0,exceptfortheoneslocatedinthecentralverticalline).Thesecondfilterisablacksquarewithahorizontalwhitelineinthemiddle.Onceagain,neuronsusingtheseweightswillignoreeverythingintheirreceptivefieldexceptforthecentralhorizontalline.

Nowifallneuronsinalayerusethesameverticallinefilter(andthesamebiasterm),andyoufeedthenetworktheinputimageshowninFigure13-5(bottomimage),thelayerwilloutputthetop-leftimage.Noticethattheverticalwhitelinesgetenhancedwhiletherestgetsblurred.Similarly,theupper-rightimageiswhatyougetifallneuronsusethehorizontallinefilter;noticethatthehorizontalwhitelinesgetenhancedwhiletherestisblurredout.Thus,alayerfullofneuronsusingthesamefiltergivesyouafeaturemap,whichhighlightstheareasinanimagethataremostsimilartothefilter.Duringtraining,aCNNfindsthemostusefulfiltersforitstask,anditlearnstocombinethemintomorecomplexpatterns(e.g.,acrossisanareainanimagewhereboththeverticalfilterandthehorizontalfilterareactive).

Figure13-5.Applyingtwodifferentfilterstogettwofeaturemaps

Page 447: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

StackingMultipleFeatureMapsUptonow,forsimplicity,wehaverepresentedeachconvolutionallayerasathin2Dlayer,butinrealityitiscomposedofseveralfeaturemapsofequalsizes,soitismoreaccuratelyrepresentedin3D(seeFigure13-6).Withinonefeaturemap,allneuronssharethesameparameters(weightsandbiasterm),butdifferentfeaturemapsmayhavedifferentparameters.Aneuron’sreceptivefieldisthesameasdescribedearlier,butitextendsacrossallthepreviouslayers’featuremaps.Inshort,aconvolutionallayersimultaneouslyappliesmultiplefilterstoitsinputs,makingitcapableofdetectingmultiplefeaturesanywhereinitsinputs.

NOTEThefactthatallneuronsinafeaturemapsharethesameparametersdramaticallyreducesthenumberofparametersinthemodel,butmostimportantlyitmeansthatoncetheCNNhaslearnedtorecognizeapatterninonelocation,itcanrecognizeitinanyotherlocation.Incontrast,oncearegularDNNhaslearnedtorecognizeapatterninonelocation,itcanrecognizeitonlyinthatparticularlocation.

Moreover,inputimagesarealsocomposedofmultiplesublayers:onepercolorchannel.Therearetypicallythree:red,green,andblue(RGB).Grayscaleimageshavejustonechannel,butsomeimagesmayhavemuchmore—forexample,satelliteimagesthatcaptureextralightfrequencies(suchasinfrared).

Page 448: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure13-6.Convolutionlayerswithmultiplefeaturemaps,andimageswiththreechannels

Specifically,aneuronlocatedinrowi,columnjofthefeaturemapkinagivenconvolutionallayerlisconnectedtotheoutputsoftheneuronsinthepreviouslayerl–1,locatedinrowsi×shtoi×sh+fh–1andcolumnsj×swtoj×sw+fw–1,acrossallfeaturemaps(inlayerl–1).Notethatallneuronslocatedinthesamerowiandcolumnjbutindifferentfeaturemapsareconnectedtotheoutputsoftheexactsameneuronsinthepreviouslayer.

Equation13-1summarizestheprecedingexplanationsinonebigmathematicalequation:itshowshowtocomputetheoutputofagivenneuroninaconvolutionallayer.Itisabituglyduetoallthedifferentindices,butallitdoesiscalculatetheweightedsumofalltheinputs,plusthebiasterm.

Equation13-1.Computingtheoutputofaneuroninaconvolutionallayer

zi,j,kistheoutputoftheneuronlocatedinrowi,columnjinfeaturemapkoftheconvolutionallayer(layerl).

Asexplainedearlier,shandswaretheverticalandhorizontalstrides,fhandfwaretheheightandwidthofthereceptivefield,andfn′isthenumberoffeaturemapsinthepreviouslayer(layerl–1).

Page 449: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

xi′,j′,k ′istheoutputoftheneuronlocatedinlayerl–1,rowi′,columnj′,featuremapk′(orchannelk′ifthepreviouslayeristheinputlayer).

bkisthebiastermforfeaturemapk(inlayerl).Youcanthinkofitasaknobthattweakstheoverallbrightnessofthefeaturemapk.

wu,v,k ′,kistheconnectionweightbetweenanyneuroninfeaturemapkofthelayerlanditsinputlocatedatrowu,columnv(relativetotheneuron’sreceptivefield),andfeaturemapk′.

Page 450: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TensorFlowImplementationInTensorFlow,eachinputimageistypicallyrepresentedasa3Dtensorofshape[height,width,channels].Amini-batchisrepresentedasa4Dtensorofshape[mini-batchsize,height,width,channels].Theweightsofaconvolutionallayerarerepresentedasa4Dtensorofshape[fh,fw,fn′,fn].Thebiastermsofaconvolutionallayeraresimplyrepresentedasa1Dtensorofshape[fn].

Let’slookatasimpleexample.Thefollowingcodeloadstwosampleimages,usingScikit-Learn’sload_sample_images()(whichloadstwocolorimages,oneofaChinesetemple,andtheotherofaflower).Thenitcreatestwo7×7filters(onewithaverticalwhitelineinthemiddle,andtheotherwithahorizontalwhitelineinthemiddle),andappliesthemtobothimagesusingaconvolutionallayerbuiltusingTensorFlow’stf.nn.conv2d()function(withzeropaddingandastrideof2).Finally,itplotsoneoftheresultingfeaturemaps(similartothetop-rightimageinFigure13-5).

importnumpyasnp

fromsklearn.datasetsimportload_sample_images

#Loadsampleimages

china=load_sample_image("china.jpg")

flower=load_sample_image("flower.jpg")

dataset=np.array([china,flower],dtype=np.float32)

batch_size,height,width,channels=dataset.shape

#Create2filters

filters=np.zeros(shape=(7,7,channels,2),dtype=np.float32)

filters[:,3,:,0]=1#verticalline

filters[3,:,:,1]=1#horizontalline

#CreateagraphwithinputXplusaconvolutionallayerapplyingthe2filters

X=tf.placeholder(tf.float32,shape=(None,height,width,channels))

convolution=tf.nn.conv2d(X,filters,strides=[1,2,2,1],padding="SAME")

withtf.Session()assess:

output=sess.run(convolution,feed_dict={X:dataset})

plt.imshow(output[0,:,:,1],cmap="gray")#plot1stimage's2ndfeaturemap

plt.show()

Mostofthiscodeisself-explanatory,butthetf.nn.conv2d()linedeservesabitofexplanation:

Xistheinputmini-batch(a4Dtensor,asexplainedearlier).

filtersisthesetoffilterstoapply(alsoa4Dtensor,asexplainedearlier).

stridesisafour-element1Darray,wherethetwocentralelementsaretheverticalandhorizontalstrides(shandsw).Thefirstandlastelementsmustcurrentlybeequalto1.Theymayonedaybeusedtospecifyabatchstride(toskipsomeinstances)andachannelstride(toskipsomeofthepreviouslayer’sfeaturemapsorchannels).

paddingmustbeeither"VALID"or"SAME":Ifsetto"VALID",theconvolutionallayerdoesnotusezeropadding,andmayignoresomerowsandcolumnsatthebottomandrightoftheinputimage,dependingonthestride,asshowninFigure13-7(forsimplicity,onlythehorizontaldimensionisshownhere,butofcoursethesamelogicappliestotheverticaldimension).

Page 451: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Ifsetto"SAME",theconvolutionallayeruseszeropaddingifnecessary.Inthiscase,thenumberofoutputneuronsisequaltothenumberofinputneuronsdividedbythestride,roundedup(inthisexample,ceil(13/5)=3).Thenzerosareaddedasevenlyaspossiblearoundtheinputs.

Figure13-7.Paddingoptions—inputwidth:13,filterwidth:6,stride:5

Inthissimpleexample,wemanuallycreatedthefilters,butinarealCNNyouwouldletthetrainingalgorithmdiscoverthebestfiltersautomatically.TensorFlowhasatf.layers.conv2d()functionwhichcreatesthefiltersvariableforyou(calledkernel),andinitializesitrandomly.Forexample,thefollowingcodecreatesaninputplaceholderfollowedbyaconvolutionallayerwithtwo7×7featuremaps,using2×2strides(notethatthisfunctiononlyexpectstheverticalandhorizontalstrides),and"SAME"padding:

X=tf.placeholder(shape=(None,height,width,channels),dtype=tf.float32)

conv=tf.layers.conv2d(X,filters=2,kernel_size=7,strides=[2,2],

padding="SAME")

Unfortunately,convolutionallayershavequiteafewhyperparameters:youmustchoosethenumberoffilters,theirheightandwidth,thestrides,andthepaddingtype.Asalways,youcanusecross-validationtofindtherighthyperparametervalues,butthisisverytime-consuming.WewilldiscusscommonCNNarchitectureslater,togiveyousomeideaofwhathyperparametervaluesworkbestinpractice.

Page 452: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

MemoryRequirementsAnotherproblemwithCNNsisthattheconvolutionallayersrequireahugeamountofRAM,especiallyduringtraining,becausethereversepassofbackpropagationrequiresalltheintermediatevaluescomputedduringtheforwardpass.

Forexample,consideraconvolutionallayerwith5×5filters,outputting200featuremapsofsize150×100,withstride1andSAMEpadding.Iftheinputisa150×100RGBimage(threechannels),thenthenumberofparametersis(5×5×3+1)×200=15,200(the+1correspondstothebiasterms),whichisfairlysmallcomparedtoafullyconnectedlayer.7However,eachofthe200featuremapscontains150×100neurons,andeachoftheseneuronsneedstocomputeaweightedsumofits5×5×3=75inputs:that’satotalof225millionfloatmultiplications.Notasbadasafullyconnectedlayer,butstillquitecomputationallyintensive.Moreover,ifthefeaturemapsarerepresentedusing32-bitfloats,thentheconvolutionallayer’soutputwilloccupy200×150×100×32=96millionbits(about11.4MB)ofRAM.8Andthat’sjustforoneinstance!Ifatrainingbatchcontains100instances,thenthislayerwilluseupover1GBofRAM!

Duringinference(i.e.,whenmakingapredictionforanewinstance)theRAMoccupiedbyonelayercanbereleasedassoonasthenextlayerhasbeencomputed,soyouonlyneedasmuchRAMasrequiredbytwoconsecutivelayers.Butduringtrainingeverythingcomputedduringtheforwardpassneedstobepreservedforthereversepass,sotheamountofRAMneededis(atleast)thetotalamountofRAMrequiredbyalllayers.

TIPIftrainingcrashesbecauseofanout-of-memoryerror,youcantryreducingthemini-batchsize.Alternatively,youcantryreducingdimensionalityusingastride,orremovingafewlayers.Oryoucantryusing16-bitfloatsinsteadof32-bitfloats.OryoucoulddistributetheCNNacrossmultipledevices.

Nowlet’slookatthesecondcommonbuildingblockofCNNs:thepoolinglayer.

Page 453: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

PoolingLayerOnceyouunderstandhowconvolutionallayerswork,thepoolinglayersarequiteeasytograsp.Theirgoalistosubsample(i.e.,shrink)theinputimageinordertoreducethecomputationalload,thememoryusage,andthenumberofparameters(therebylimitingtheriskofoverfitting).Reducingtheinputimagesizealsomakestheneuralnetworktoleratealittlebitofimageshift(locationinvariance).

Justlikeinconvolutionallayers,eachneuroninapoolinglayerisconnectedtotheoutputsofalimitednumberofneuronsinthepreviouslayer,locatedwithinasmallrectangularreceptivefield.Youmustdefineitssize,thestride,andthepaddingtype,justlikebefore.However,apoolingneuronhasnoweights;allitdoesisaggregatetheinputsusinganaggregationfunctionsuchasthemaxormean.Figure13-8showsamaxpoolinglayer,whichisthemostcommontypeofpoolinglayer.Inthisexample,weusea2×2poolingkernel,astrideof2,andnopadding.Notethatonlythemaxinputvalueineachkernelmakesittothenextlayer.Theotherinputsaredropped.

Figure13-8.Maxpoolinglayer(2×2poolingkernel,stride2,nopadding)

Thisisobviouslyaverydestructivekindoflayer:evenwithatiny2×2kernelandastrideof2,theoutputwillbetwotimessmallerinbothdirections(soitsareawillbefourtimessmaller),simplydropping75%oftheinputvalues.

Apoolinglayertypicallyworksoneveryinputchannelindependently,sotheoutputdepthisthesameastheinputdepth.Youmayalternativelypooloverthedepthdimension,aswewillseenext,inwhichcasetheimage’sspatialdimensions(heightandwidth)remainunchanged,butthenumberofchannelsisreduced.

ImplementingamaxpoolinglayerinTensorFlowisquiteeasy.Thefollowingcodecreatesamaxpoolinglayerusinga2×2kernel,stride2,andnopadding,thenappliesittoalltheimagesinthedataset:

[...]#loadtheimagedataset,justlikeabove

#CreateagraphwithinputXplusamaxpoolinglayer

X=tf.placeholder(tf.float32,shape=(None,height,width,channels))

max_pool=tf.nn.max_pool(X,ksize=[1,2,2,1],strides=[1,2,2,1],padding="VALID")

withtf.Session()assess:

output=sess.run(max_pool,feed_dict={X:dataset})

Page 454: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

plt.imshow(output[0].astype(np.uint8))#plottheoutputforthe1stimage

plt.show()

Theksizeargumentcontainsthekernelshapealongallfourdimensionsoftheinputtensor:[batchsize,height,width,channels].TensorFlowcurrentlydoesnotsupportpoolingovermultipleinstances,sothefirstelementofksizemustbeequalto1.Moreover,itdoesnotsupportpoolingoverboththespatialdimensions(heightandwidth)andthedepthdimension,soeitherksize[1]andksize[2]mustbothbeequalto1,orksize[3]mustbeequalto1.

Tocreateanaveragepoolinglayer,justusetheavg_pool()functioninsteadofmax_pool().

Nowyouknowallthebuildingblockstocreateaconvolutionalneuralnetwork.Let’sseehowtoassemblethem.

Page 455: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

CNNArchitecturesTypicalCNNarchitecturesstackafewconvolutionallayers(eachonegenerallyfollowedbyaReLUlayer),thenapoolinglayer,thenanotherfewconvolutionallayers(+ReLU),thenanotherpoolinglayer,andsoon.Theimagegetssmallerandsmallerasitprogressesthroughthenetwork,butitalsotypicallygetsdeeperanddeeper(i.e.,withmorefeaturemaps)thankstotheconvolutionallayers(seeFigure13-9).Atthetopofthestack,aregularfeedforwardneuralnetworkisadded,composedofafewfullyconnectedlayers(+ReLUs),andthefinallayeroutputstheprediction(e.g.,asoftmaxlayerthatoutputsestimatedclassprobabilities).

Figure13-9.TypicalCNNarchitecture

TIPAcommonmistakeistouseconvolutionkernelsthataretoolarge.Youcanoftengetthesameeffectasa9×9kernelbystackingtwo3×3kernelsontopofeachother,foralotlesscompute.

Overtheyears,variantsofthisfundamentalarchitecturehavebeendeveloped,leadingtoamazingadvancesinthefield.AgoodmeasureofthisprogressistheerrorrateincompetitionssuchastheILSVRCImageNetchallenge.Inthiscompetitionthetop-5errorrateforimageclassificationfellfromover26%tobarelyover3%injustfiveyears.Thetop-fiveerrorrateisthenumberoftestimagesforwhichthesystem’stop5predictionsdidnotincludethecorrectanswer.Theimagesarelarge(256pixelshigh)andthereare1,000classes,someofwhicharereallysubtle(trydistinguishing120dogbreeds).LookingattheevolutionofthewinningentriesisagoodwaytounderstandhowCNNswork.

WewillfirstlookattheclassicalLeNet-5architecture(1998),thenthreeofthewinnersoftheILSVRCchallenge:AlexNet(2012),GoogLeNet(2014),andResNet(2015).

OTHERVISUALTASKS

Therewasstunningprogressaswellinothervisualtaskssuchasobjectdetectionandlocalization,andimagesegmentation.Inobjectdetectionandlocalization,theneuralnetworktypicallyoutputsasequenceofboundingboxesaroundvariousobjectsintheimage.Forexample,seeMaxineOquabetal.’s2015paperthatoutputsaheatmapforeachobjectclass,orRussellStewartetal.’s2015paperthatusesacombinationofaCNNtodetectfacesandarecurrentneuralnetworktooutputasequenceofboundingboxesaroundthem.Inimagesegmentation,thenetoutputsanimage(usuallyofthesamesizeastheinput)whereeachpixelindicatestheclassoftheobjecttowhichthecorrespondinginputpixelbelongs.Forexample,checkoutEvanShelhameretal.’s2016paper.

Page 456: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

LeNet-5TheLeNet-5architectureisperhapsthemostwidelyknownCNNarchitecture.Asmentionedearlier,itwascreatedbyYannLeCunin1998andwidelyusedforhandwrittendigitrecognition(MNIST).ItiscomposedofthelayersshowninTable13-1.

Table13-1.LeNet-5architecture

Layer Type Maps Size Kernelsize Stride Activation

Out FullyConnected – 10 – – RBF

F6 FullyConnected – 84 – – tanh

C5 Convolution 120 1×1 5×5 1 tanh

S4 AvgPooling 16 5×5 2×2 2 tanh

C3 Convolution 16 10×10 5×5 1 tanh

S2 AvgPooling 6 14×14 2×2 2 tanh

C1 Convolution 6 28×28 5×5 1 tanh

In Input 1 32×32 – – –

Thereareafewextradetailstobenoted:MNISTimagesare28×28pixels,buttheyarezero-paddedto32×32pixelsandnormalizedbeforebeingfedtothenetwork.Therestofthenetworkdoesnotuseanypadding,whichiswhythesizekeepsshrinkingastheimageprogressesthroughthenetwork.

Theaveragepoolinglayersareslightlymorecomplexthanusual:eachneuroncomputesthemeanofitsinputs,thenmultipliestheresultbyalearnablecoefficient(onepermap)andaddsalearnablebiasterm(again,onepermap),thenfinallyappliestheactivationfunction.

MostneuronsinC3mapsareconnectedtoneuronsinonlythreeorfourS2maps(insteadofallsixS2maps).Seetable1intheoriginalpaperfordetails.

Theoutputlayerisabitspecial:insteadofcomputingthedotproductoftheinputsandtheweightvector,eachneuronoutputsthesquareoftheEuclidiandistancebetweenitsinputvectoranditsweightvector.Eachoutputmeasureshowmuchtheimagebelongstoaparticulardigitclass.Thecrossentropycostfunctionisnowpreferred,asitpenalizesbadpredictionsmuchmore,producinglargergradientsandthusconvergingfaster.

YannLeCun’swebsite(“LENET”section)featuresgreatdemosofLeNet-5classifyingdigits.

Page 457: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

AlexNetTheAlexNetCNNarchitecture9wonthe2012ImageNetILSVRCchallengebyalargemargin:itachieved17%top-5errorratewhilethesecondbestachievedonly26%!ItwasdevelopedbyAlexKrizhevsky(hencethename),IlyaSutskever,andGeoffreyHinton.ItisquitesimilartoLeNet-5,onlymuchlargeranddeeper,anditwasthefirsttostackconvolutionallayersdirectlyontopofeachother,insteadofstackingapoolinglayerontopofeachconvolutionallayer.Table13-2presentsthisarchitecture.

Table13-2.AlexNetarchitecture

Layer Type Maps Size Kernelsize Stride Padding Activation

Out FullyConnected – 1,000 – – – Softmax

F9 FullyConnected – 4,096 – – – ReLU

F8 FullyConnected – 4,096 – – – ReLU

C7 Convolution 256 13×13 3×3 1 SAME ReLU

C6 Convolution 384 13×13 3×3 1 SAME ReLU

C5 Convolution 384 13×13 3×3 1 SAME ReLU

S4 MaxPooling 256 13×13 3×3 2 VALID –

C3 Convolution 256 27×27 5×5 1 SAME ReLU

S2 MaxPooling 96 27×27 3×3 2 VALID –

C1 Convolution 96 55×55 11×11 4 SAME ReLU

In Input 3(RGB) 224×224 – – – –

Toreduceoverfitting,theauthorsusedtworegularizationtechniqueswediscussedinpreviouschapters:firsttheyapplieddropout(witha50%dropoutrate)duringtrainingtotheoutputsoflayersF8andF9.Second,theyperformeddataaugmentationbyrandomlyshiftingthetrainingimagesbyvariousoffsets,flippingthemhorizontally,andchangingthelightingconditions.

AlexNetalsousesacompetitivenormalizationstepimmediatelyaftertheReLUstepoflayersC1andC3,calledlocalresponsenormalization.Thisformofnormalizationmakestheneuronsthatmoststronglyactivateinhibitneuronsatthesamelocationbutinneighboringfeaturemaps(suchcompetitiveactivationhasbeenobservedinbiologicalneurons).Thisencouragesdifferentfeaturemapstospecialize,pushingthemapartandforcingthemtoexploreawiderrangeoffeatures,ultimatelyimprovinggeneralization.Equation13-2showshowtoapplyLRN.

Equation13-2.Localresponsenormalization

Page 458: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

biisthenormalizedoutputoftheneuronlocatedinfeaturemapi,atsomerowuandcolumnv(notethatinthisequationweconsideronlyneuronslocatedatthisrowandcolumn,souandvarenotshown).

aiistheactivationofthatneuronaftertheReLUstep,butbeforenormalization.

k,α,β,andrarehyperparameters.kiscalledthebias,andriscalledthedepthradius.

fnisthenumberoffeaturemaps.

Forexample,ifr=2andaneuronhasastrongactivation,itwillinhibittheactivationoftheneuronslocatedinthefeaturemapsimmediatelyaboveandbelowitsown.

InAlexNet,thehyperparametersaresetasfollows:r=2,α=0.00002,β=0.75,andk=1.ThisstepcanbeimplementedusingTensorFlow’stf.nn.local_response_normalization()operation.

AvariantofAlexNetcalledZFNetwasdevelopedbyMatthewZeilerandRobFergusandwonthe2013ILSVRCchallenge.ItisessentiallyAlexNetwithafewtweakedhyperparameters(numberoffeaturemaps,kernelsize,stride,etc.).

Page 459: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

GoogLeNetTheGoogLeNetarchitecturewasdevelopedbyChristianSzegedyetal.fromGoogleResearch,10anditwontheILSVRC2014challengebypushingthetop-5errorratebelow7%.ThisgreatperformancecameinlargepartfromthefactthatthenetworkwasmuchdeeperthanpreviousCNNs(seeFigure13-11).Thiswasmadepossiblebysub-networkscalledinceptionmodules,11whichallowGoogLeNettouseparametersmuchmoreefficientlythanpreviousarchitectures:GoogLeNetactuallyhas10timesfewerparametersthanAlexNet(roughly6millioninsteadof60million).

Figure13-10showsthearchitectureofaninceptionmodule.Thenotation“3×3+2(S)”meansthatthelayerusesa3×3kernel,stride2,andSAMEpadding.Theinputsignalisfirstcopiedandfedtofourdifferentlayers.AllconvolutionallayersusetheReLUactivationfunction.Notethatthesecondsetofconvolutionallayersusesdifferentkernelsizes(1×1,3×3,and5×5),allowingthemtocapturepatternsatdifferentscales.Alsonotethateverysinglelayerusesastrideof1andSAMEpadding(eventhemaxpoolinglayer),sotheiroutputsallhavethesameheightandwidthastheirinputs.Thismakesitpossibletoconcatenatealltheoutputsalongthedepthdimensioninthefinaldepthconcatlayer(i.e.,stackthefeaturemapsfromallfourtopconvolutionallayers).ThisconcatenationlayercanbeimplementedinTensorFlowusingthetf.concat()operation,withaxis=3(axis3isthedepth).

Figure13-10.Inceptionmodule

Youmaywonderwhyinceptionmoduleshaveconvolutionallayerswith1×1kernels.Surelytheselayerscannotcaptureanyfeaturessincetheylookatonlyonepixelatatime?Infact,theselayersservetwopurposes:

First,theyareconfiguredtooutputmanyfewerfeaturemapsthantheirinputs,sotheyserveasbottlenecklayers,meaningtheyreducedimensionality.Thisisparticularlyusefulbeforethe3×3and5×5convolutions,sincetheseareverycomputationallyexpensivelayers.

Second,eachpairofconvolutionallayers([1×1,3×3]and[1×1,5×5])actslikeasingle,

Page 460: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

powerfulconvolutionallayer,capableofcapturingmorecomplexpatterns.Indeed,insteadofsweepingasimplelinearclassifieracrosstheimage(asasingleconvolutionallayerdoes),thispairofconvolutionallayerssweepsatwo-layerneuralnetworkacrosstheimage.

Inshort,youcanthinkofthewholeinceptionmoduleasaconvolutionallayeronsteroids,abletooutputfeaturemapsthatcapturecomplexpatternsatvariousscales.

WARNINGThenumberofconvolutionalkernelsforeachconvolutionallayerisahyperparameter.Unfortunately,thismeansthatyouhavesixmorehyperparameterstotweakforeveryinceptionlayeryouadd.

Nowlet’slookatthearchitectureoftheGoogLeNetCNN(seeFigure13-11).Itissodeepthatwehadtorepresentitinthreecolumns,butGoogLeNetisactuallyonetallstack,includingnineinceptionmodules(theboxeswiththespinningtops)thatactuallycontainthreelayerseach.Thenumberoffeaturemapsoutputbyeachconvolutionallayerandeachpoolinglayerisshownbeforethekernelsize.Thesixnumbersintheinceptionmodulesrepresentthenumberoffeaturemapsoutputbyeachconvolutionallayerinthemodule(inthesameorderasinFigure13-10).NotethatalltheconvolutionallayersusetheReLUactivationfunction.

Figure13-11.GoogLeNetarchitecture

Page 461: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Let’sgothroughthisnetwork:Thefirsttwolayersdividetheimage’sheightandwidthby4(soitsareaisdividedby16),toreducethecomputationalload.

Thenthelocalresponsenormalizationlayerensuresthatthepreviouslayerslearnawidevarietyoffeatures(asdiscussedearlier).

Twoconvolutionallayersfollow,wherethefirstactslikeabottlenecklayer.Asexplainedearlier,youcanthinkofthispairasasinglesmarterconvolutionallayer.

Again,alocalresponsenormalizationlayerensuresthatthepreviouslayerscaptureawidevarietyofpatterns.

Nextamaxpoolinglayerreducestheimageheightandwidthby2,againtospeedupcomputations.

Thencomesthetallstackofnineinceptionmodules,interleavedwithacouplemaxpoolinglayerstoreducedimensionalityandspeedupthenet.

Next,theaveragepoolinglayerusesakernelthesizeofthefeaturemapswithVALIDpadding,outputting1×1featuremaps:thissurprisingstrategyiscalledglobalaveragepooling.Iteffectivelyforcesthepreviouslayerstoproducefeaturemapsthatareactuallyconfidencemapsforeachtargetclass(sinceotherkindsoffeatureswouldbedestroyedbytheaveragingstep).ThismakesitunnecessarytohaveseveralfullyconnectedlayersatthetopoftheCNN(likeinAlexNet),considerablyreducingthenumberofparametersinthenetworkandlimitingtheriskofoverfitting.

Thelastlayersareself-explanatory:dropoutforregularization,thenafullyconnectedlayerwithasoftmaxactivationfunctiontooutputestimatedclassprobabilities.

Thisdiagramisslightlysimplified:theoriginalGoogLeNetarchitecturealsoincludedtwoauxiliaryclassifierspluggedontopofthethirdandsixthinceptionmodules.Theywerebothcomposedofoneaveragepoolinglayer,oneconvolutionallayer,twofullyconnectedlayers,andasoftmaxactivationlayer.Duringtraining,theirloss(scaleddownby70%)wasaddedtotheoverallloss.Thegoalwastofightthevanishinggradientsproblemandregularizethenetwork.However,itwasshownthattheireffectwasrelativelyminor.

Page 462: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ResNetLastbutnotleast,thewinneroftheILSVRC2015challengewastheResidualNetwork(orResNet),developedbyKaimingHeetal.,12whichdeliveredanastoundingtop-5errorrateunder3.6%,usinganextremelydeepCNNcomposedof152layers.Thekeytobeingabletotrainsuchadeepnetworkistouseskipconnections(alsocalledshortcutconnections):thesignalfeedingintoalayerisalsoaddedtotheoutputofalayerlocatedabithigherupthestack.Let’sseewhythisisuseful.

Whentraininganeuralnetwork,thegoalistomakeitmodelatargetfunctionh(x).Ifyouaddtheinputxtotheoutputofthenetwork(i.e.,youaddaskipconnection),thenthenetworkwillbeforcedtomodelf(x)=h(x)–xratherthanh(x).Thisiscalledresiduallearning(seeFigure13-12).

Figure13-12.Residuallearning

Whenyouinitializearegularneuralnetwork,itsweightsareclosetozero,sothenetworkjustoutputsvaluesclosetozero.Ifyouaddaskipconnection,theresultingnetworkjustoutputsacopyofitsinputs;inotherwords,itinitiallymodelstheidentityfunction.Ifthetargetfunctionisfairlyclosetotheidentityfunction(whichisoftenthecase),thiswillspeeduptrainingconsiderably.

Moreover,ifyouaddmanyskipconnections,thenetworkcanstartmakingprogressevenifseverallayershavenotstartedlearningyet(seeFigure13-13).Thankstoskipconnections,thesignalcaneasilymakeitswayacrossthewholenetwork.Thedeepresidualnetworkcanbeseenasastackofresidualunits,whereeachresidualunitisasmallneuralnetworkwithaskipconnection.

Page 463: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure13-13.Regulardeepneuralnetwork(left)anddeepresidualnetwork(right)

Nowlet’slookatResNet’sarchitecture(seeFigure13-14).Itisactuallysurprisinglysimple.ItstartsandendsexactlylikeGoogLeNet(exceptwithoutadropoutlayer),andinbetweenisjustaverydeepstackofsimpleresidualunits.Eachresidualunitiscomposedoftwoconvolutionallayers,withBatchNormalization(BN)andReLUactivation,using3×3kernelsandpreservingspatialdimensions(stride1,SAMEpadding).

Figure13-14.ResNetarchitecture

Notethatthenumberoffeaturemapsisdoubledeveryfewresidualunits,atthesametimeastheirheightandwidtharehalved(usingaconvolutionallayerwithstride2).Whenthishappenstheinputscannotbeaddeddirectlytotheoutputsoftheresidualunitsincetheydon’thavethesameshape(forexample,this

Page 464: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

problemaffectstheskipconnectionrepresentedbythedashedarrowinFigure13-14).Tosolvethisproblem,theinputsarepassedthrougha1×1convolutionallayerwithstride2andtherightnumberofoutputfeaturemaps(seeFigure13-15).

Figure13-15.Skipconnectionwhenchangingfeaturemapsizeanddepth

ResNet-34istheResNetwith34layers(onlycountingtheconvolutionallayersandthefullyconnectedlayer)containingthreeresidualunitsthatoutput64featuremaps,4RUswith128maps,6RUswith256maps,and3RUswith512maps.

ResNetsdeeperthanthat,suchasResNet-152,useslightlydifferentresidualunits.Insteadoftwo3×3convolutionallayerswith(say)256featuremaps,theyusethreeconvolutionallayers:firsta1×1convolutionallayerwithjust64featuremaps(4timesless),whichactsaabottlenecklayer(asdiscussedalready),thena3×3layerwith64featuremaps,andfinallyanother1×1convolutionallayerwith256featuremaps(4times64)thatrestorestheoriginaldepth.ResNet-152containsthreesuchRUsthatoutput256maps,then8RUswith512maps,awhopping36RUswith1,024maps,andfinally3RUswith2,048maps.

Asyoucansee,thefieldismovingrapidly,withallsortsofarchitecturespoppingouteveryyear.OnecleartrendisthatCNNskeepgettingdeeperanddeeper.Theyarealsogettinglighter,requiringfewerandfewerparameters.Atpresent,theResNetarchitectureisboththemostpowerfulandarguablythesimplest,soitisreallytheoneyoushouldprobablyusefornow,butkeeplookingattheILSVRCchallengeeveryyear.The2016winnersweretheTrimps-SoushenteamfromChinawithanastounding2.99%errorrate.Toachievethistheytrainedcombinationsofthepreviousmodelsandjoinedthemintoanensemble.Dependingonthetask,thereducederrorratemayormaynotbeworththeextracomplexity.

Thereareafewotherarchitecturesthatyoumaywanttolookat,inparticularVGGNet13(runner-upoftheILSVRC2014challenge)andInception-v414(whichmergestheideasofGoogLeNetandResNetandachievescloseto3%top-5errorrateonImageNetclassification).

Page 465: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

NOTEThereisreallynothingspecialaboutimplementingthevariousCNNarchitectureswejustdiscussed.Wesawearlierhowtobuildalltheindividualbuildingblocks,sonowallyouneedistoassemblethemtocreatethedesiredarchitecture.WewillbuildacompleteCNNintheupcomingexercisesandyouwillfindfullworkingcodeintheJupyternotebooks.

TENSORFLOWCONVOLUTIONOPERATIONS

TensorFlowalsooffersafewotherkindsofconvolutionallayers:

tf.layers.conv1d()createsaconvolutionallayerfor1Dinputs.Thisisuseful,forexample,innaturallanguageprocessing,whereasentencemayberepresentedasa1Darrayofwords,andthereceptivefieldcoversafewneighboringwords.

tf.layers.conv3d()createsaconvolutionallayerfor3Dinputs,suchas3DPETscan.

tf.nn.atrous_conv2d()createsanatrousconvolutionallayer(“àtrous”isFrenchfor“withholes”).Thisisequivalenttousingaregularconvolutionallayerwithafilterdilatedbyinsertingrowsandcolumnsofzeros(i.e.,holes).Forexample,a1×3filterequalto[[1,2,3]]maybedilatedwithadilationrateof4,resultinginadilatedfilter[[1,0,0,0,2,0,0,0,3]].Thisallowstheconvolutionallayertohavealargerreceptivefieldatnocomputationalpriceandusingnoextraparameters.

tf.layers.conv2d_transpose()createsatransposeconvolutionallayer,sometimescalledadeconvolutionallayer,15whichupsamplesanimage.Itdoessobyinsertingzerosbetweentheinputs,soyoucanthinkofthisasaregularconvolutionallayerusingafractionalstride.Upsamplingisuseful,forexample,inimagesegmentation:inatypicalCNN,featuremapsgetsmallerandsmallerasyouprogressthroughthenetwork,soifyouwanttooutputanimageofthesamesizeastheinput,youneedanupsamplinglayer.

tf.nn.depthwise_conv2d()createsadepthwiseconvolutionallayerthatapplieseveryfiltertoeveryindividualinputchannelindependently.Thus,iftherearefnfiltersandfn′inputchannels,thenthiswilloutputfn×fn′featuremaps.

tf.layers.separable_conv2d()createsaseparableconvolutionallayerthatfirstactslikeadepthwiseconvolutionallayer,thenappliesa1×1convolutionallayertotheresultingfeaturemaps.Thismakesitpossibletoapplyfilterstoarbitrarysetsofinputschannels.

Page 466: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Exercises1. WhataretheadvantagesofaCNNoverafullyconnectedDNNforimageclassification?

2. ConsideraCNNcomposedofthreeconvolutionallayers,eachwith3×3kernels,astrideof2,andSAMEpadding.Thelowestlayeroutputs100featuremaps,themiddleoneoutputs200,andthetoponeoutputs400.TheinputimagesareRGBimagesof200×300pixels.WhatisthetotalnumberofparametersintheCNN?Ifweareusing32-bitfloats,atleasthowmuchRAMwillthisnetworkrequirewhenmakingapredictionforasingleinstance?Whataboutwhentrainingonamini-batchof50images?

3. IfyourGPUrunsoutofmemorywhiletrainingaCNN,whatarefivethingsyoucouldtrytosolvetheproblem?

4. Whywouldyouwanttoaddamaxpoolinglayerratherthanaconvolutionallayerwiththesamestride?

5. Whenwouldyouwanttoaddalocalresponsenormalizationlayer?

6. CanyounamethemaininnovationsinAlexNet,comparedtoLeNet-5?WhataboutthemaininnovationsinGoogLeNetandResNet?

7. BuildyourownCNNandtrytoachievethehighestpossibleaccuracyonMNIST.

8. ClassifyinglargeimagesusingInceptionv3.a. Downloadsomeimagesofvariousanimals.LoadtheminPython,forexampleusingthe

matplotlib.image.mpimg.imread()functionorthescipy.misc.imread()function.Resizeand/orcropthemto299×299pixels,andensurethattheyhavejustthreechannels(RGB),withnotransparencychannel.

b. DownloadthelatestpretrainedInceptionv3model:thecheckpointisavailableathttps://goo.gl/nxSQvl.

c. CreatetheInceptionv3modelbycallingtheinception_v3()function,asshownbelow.Thismustbedonewithinanargumentscopecreatedbytheinception_v3_arg_scope()function.Also,youmustsetis_training=Falseandnum_classes=1001likeso:

fromtensorflow.contrib.slim.netsimportinception

importtensorflow.contrib.slimasslim

X=tf.placeholder(tf.float32,shape=[None,299,299,3],name="X")

withslim.arg_scope(inception.inception_v3_arg_scope()):

logits,end_points=inception.inception_v3(

X,num_classes=1001,is_training=False)

predictions=end_points["Predictions"]

saver=tf.train.Saver()

d. OpenasessionandusetheSavertorestorethepretrainedmodelcheckpointyoudownloadedearlier.

Page 467: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

e. Runthemodeltoclassifytheimagesyouprepared.Displaythetopfivepredictionsforeachimage,alongwiththeestimatedprobability(thelistofclassnamesisavailableathttps://goo.gl/brXRtZ).Howaccurateisthemodel?

9. Transferlearningforlargeimageclassification.a. Createatrainingsetcontainingatleast100imagesperclass.Forexample,youcouldclassify

yourownpicturesbasedonthelocation(beach,mountain,city,etc.),oralternativelyyoucanjustuseanexistingdataset,suchastheflowersdatasetorMIT’splacesdataset(requiresregistration,anditishuge).

b. Writeapreprocessingstepthatwillresizeandcroptheimageto299×299,withsomerandomnessfordataaugmentation.

c. UsingthepretrainedInceptionv3modelfromthepreviousexercise,freezealllayersuptothebottlenecklayer(i.e.,thelastlayerbeforetheoutputlayer),andreplacetheoutputlayerwiththeappropriatenumberofoutputsforyournewclassificationtask(e.g.,theflowersdatasethasfivemutuallyexclusiveclassessotheoutputlayermusthavefiveneuronsandusethesoftmaxactivationfunction).

d. Splityourdatasetintoatrainingsetandatestset.Trainthemodelonthetrainingsetandevaluateitonthetestset.

10. GothroughTensorFlow’sDeepDreamtutorial.ItisafunwaytofamiliarizeyourselfwithvariouswaysofvisualizingthepatternslearnedbyaCNN,andtogenerateartusingDeepLearning.

SolutionstotheseexercisesareavailableinAppendixA.

“SingleUnitActivityinStriateCortexofUnrestrainedCats,”D.HubelandT.Wiesel(1958).

“ReceptiveFieldsofSingleNeuronesintheCat’sStriateCortex,”D.HubelandT.Wiesel(1959).

“ReceptiveFieldsandFunctionalArchitectureofMonkeyStriateCortex,”D.HubelandT.Wiesel(1968).

“Neocognitron:ASelf-organizingNeuralNetworkModelforaMechanismofPatternRecognitionUnaffectedbyShiftinPosition,”K.Fukushima(1980).

“Gradient-BasedLearningAppliedtoDocumentRecognition,”Y.LeCunetal.(1998).

Aconvolutionisamathematicaloperationthatslidesonefunctionoveranotherandmeasurestheintegraloftheirpointwisemultiplication.IthasdeepconnectionswiththeFouriertransformandtheLaplacetransform,andisheavilyusedinsignalprocessing.Convolutionallayersactuallyusecross-correlations,whichareverysimilartoconvolutions(seehttp://goo.gl/HAfxXdformoredetails).

Afullyconnectedlayerwith150×100neurons,eachconnectedtoall150×100×3inputs,wouldhave150

×100

×3=675millionparameters!

1MB=1,024kB=1,024×1,024bytes=1,024×1,024×8bits.

“ImageNetClassificationwithDeepConvolutionalNeuralNetworks,”A.Krizhevskyetal.(2012).

“GoingDeeperwithConvolutions,”C.Szegedyetal.(2015).

Inthe2010movieInception,thecharacterskeepgoingdeeperanddeeperintomultiplelayersofdreams,hencethenameofthesemodules.

1

2

3

4

5

6

7

2

2

8

9

10

11

12

Page 468: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

“DeepResidualLearningforImageRecognition,”K.He(2015).

“VeryDeepConvolutionalNetworksforLarge-ScaleImageRecognition,”K.SimonyanandA.Zisserman(2015).

“Inception-v4,Inception-ResNetandtheImpactofResidualConnectionsonLearning,”C.Szegedyetal.(2016).

Thisnameisquitemisleadingsincethislayerdoesnotperformadeconvolution,whichisawell-definedmathematicaloperation(theinverseofaconvolution).

12

13

14

15

Page 469: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Chapter14.RecurrentNeuralNetworks

Thebatterhitstheball.Youimmediatelystartrunning,anticipatingtheball’strajectory.Youtrackitandadaptyourmovements,andfinallycatchit(underathunderofapplause).Predictingthefutureiswhatyoudoallthetime,whetheryouarefinishingafriend’ssentenceoranticipatingthesmellofcoffeeatbreakfast.Inthischapter,wearegoingtodiscussrecurrentneuralnetworks(RNN),aclassofnetsthatcanpredictthefuture(well,uptoapoint,ofcourse).Theycananalyzetimeseriesdatasuchasstockprices,andtellyouwhentobuyorsell.Inautonomousdrivingsystems,theycananticipatecartrajectoriesandhelpavoidaccidents.Moregenerally,theycanworkonsequencesofarbitrarylengths,ratherthanonfixed-sizedinputslikeallthenetswehavediscussedsofar.Forexample,theycantakesentences,documents,oraudiosamplesasinput,makingthemextremelyusefulfornaturallanguageprocessing(NLP)systemssuchasautomatictranslation,speech-to-text,orsentimentanalysis(e.g.,readingmoviereviewsandextractingtherater’sfeelingaboutthemovie).

Moreover,RNNs’abilitytoanticipatealsomakesthemcapableofsurprisingcreativity.Youcanaskthemtopredictwhicharethemostlikelynextnotesinamelody,thenrandomlypickoneofthesenotesandplayit.Thenaskthenetforthenextmostlikelynotes,playit,andrepeattheprocessagainandagain.Beforeyouknowit,yournetwillcomposeamelodysuchastheoneproducedbyGoogle’sMagentaproject.Similarly,RNNscangeneratesentences,imagecaptions,andmuchmore.TheresultisnotexactlyShakespeareorMozartyet,butwhoknowswhattheywillproduceafewyearsfromnow?

Inthischapter,wewilllookatthefundamentalconceptsunderlyingRNNs,themainproblemtheyface(namely,vanishing/explodinggradients,discussedinChapter11),andthesolutionswidelyusedtofightit:LSTMandGRUcells.Alongtheway,asalways,wewillshowhowtoimplementRNNsusingTensorFlow.Finally,wewilltakealookatthearchitectureofamachinetranslationsystem.

Page 470: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

RecurrentNeuronsUptonowwehavemostlylookedatfeedforwardneuralnetworks,wheretheactivationsflowonlyinonedirection,fromtheinputlayertotheoutputlayer(exceptforafewnetworksinAppendixE).Arecurrentneuralnetworklooksverymuchlikeafeedforwardneuralnetwork,exceptitalsohasconnectionspointingbackward.Let’slookatthesimplestpossibleRNN,composedofjustoneneuronreceivinginputs,producinganoutput,andsendingthatoutputbacktoitself,asshowninFigure14-1(left).Ateachtimestept(alsocalledaframe),thisrecurrentneuronreceivestheinputsx(t)aswellasitsownoutputfromtheprevioustimestep,y(t–1).Wecanrepresentthistinynetworkagainstthetimeaxis,asshowninFigure14-1(right).Thisiscalledunrollingthenetworkthroughtime.

Figure14-1.Arecurrentneuron(left),unrolledthroughtime(right)

Youcaneasilycreatealayerofrecurrentneurons.Ateachtimestept,everyneuronreceivesboththeinputvectorx(t)andtheoutputvectorfromtheprevioustimestepy(t–1),asshowninFigure14-2.Notethatboththeinputsandoutputsarevectorsnow(whentherewasjustasingleneuron,theoutputwasascalar).

Figure14-2.Alayerofrecurrentneurons(left),unrolledthroughtime(right)

Eachrecurrentneuronhastwosetsofweights:onefortheinputsx(t)andtheotherfortheoutputsoftheprevioustimestep,y(t–1).Let’scalltheseweightvectorswxandwy.Theoutputofarecurrentlayercanbe

Page 471: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

computedprettymuchasyoumightexpect,asshowninEquation14-1(bisthebiastermandϕ(·)istheactivationfunction,e.g.,ReLU1).

Equation14-1.Outputofarecurrentlayerforasingleinstance

Justlikeforfeedforwardneuralnetworks,wecancomputearecurrentlayer’soutputinoneshotforawholemini-batchusingavectorizedformofthepreviousequation(seeEquation14-2).

Equation14-2.Outputsofalayerofrecurrentneuronsforallinstancesinamini-batch

Y(t)isanm×nneuronsmatrixcontainingthelayer’soutputsattimesteptforeachinstanceinthemini-batch(misthenumberofinstancesinthemini-batchandnneuronsisthenumberofneurons).

X(t)isanm×ninputsmatrixcontainingtheinputsforallinstances(ninputsisthenumberofinputfeatures).

Wxisanninputs×nneuronsmatrixcontainingtheconnectionweightsfortheinputsofthecurrenttimestep.

Wyisannneurons×nneuronsmatrixcontainingtheconnectionweightsfortheoutputsoftheprevioustimestep.

TheweightmatricesWxandWyareoftenconcatenatedintoasingleweightmatrixWofshape(ninputs+nneurons)×nneurons(seethesecondlineofEquation14-2).

bisavectorofsizenneuronscontainingeachneuron’sbiasterm.

NoticethatY(t)isafunctionofX(t)andY(t–1),whichisafunctionofX(t–1)andY(t–2),whichisafunctionofX(t–2)andY(t–3),andsoon.ThismakesY(t)afunctionofalltheinputssincetimet=0(thatis,X(0),X(1),…,X(t)).Atthefirsttimestep,t=0,therearenopreviousoutputs,sotheyaretypicallyassumedtobeallzeros.

Page 472: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

MemoryCellsSincetheoutputofarecurrentneuronattimesteptisafunctionofalltheinputsfromprevioustimesteps,youcouldsayithasaformofmemory.Apartofaneuralnetworkthatpreservessomestateacrosstimestepsiscalledamemorycell(orsimplyacell).Asinglerecurrentneuron,oralayerofrecurrentneurons,isaverybasiccell,butlaterinthischapterwewilllookatsomemorecomplexandpowerfultypesofcells.

Ingeneralacell’sstateattimestept,denotedh(t)(the“h”standsfor“hidden”),isafunctionofsomeinputsatthattimestepanditsstateattheprevioustimestep:h(t)=f(h(t–1),x(t)).Itsoutputattimestept,denotedy(t),isalsoafunctionofthepreviousstateandthecurrentinputs.Inthecaseofthebasiccellswehavediscussedsofar,theoutputissimplyequaltothestate,butinmorecomplexcellsthisisnotalwaysthecase,asshowninFigure14-3.

Figure14-3.Acell’shiddenstateanditsoutputmaybedifferent

Page 473: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

InputandOutputSequencesAnRNNcansimultaneouslytakeasequenceofinputsandproduceasequenceofoutputs(seeFigure14-4,top-leftnetwork).Forexample,thistypeofnetworkisusefulforpredictingtimeseriessuchasstockprices:youfeeditthepricesoverthelastNdays,anditmustoutputthepricesshiftedbyonedayintothefuture(i.e.,fromN–1daysagototomorrow).

Alternatively,youcouldfeedthenetworkasequenceofinputs,andignorealloutputsexceptforthelastone(seethetop-rightnetwork).Inotherwords,thisisasequence-to-vectornetwork.Forexample,youcouldfeedthenetworkasequenceofwordscorrespondingtoamoviereview,andthenetworkwouldoutputasentimentscore(e.g.,from–1[hate]to+1[love]).

Conversely,youcouldfeedthenetworkasingleinputatthefirsttimestep(andzerosforallothertimesteps),andletitoutputasequence(seethebottom-leftnetwork).Thisisavector-to-sequencenetwork.Forexample,theinputcouldbeanimage,andtheoutputcouldbeacaptionforthatimage.

Lastly,youcouldhaveasequence-to-vectornetwork,calledanencoder,followedbyavector-to-sequencenetwork,calledadecoder(seethebottom-rightnetwork).Forexample,thiscanbeusedfortranslatingasentencefromonelanguagetoanother.Youwouldfeedthenetworkasentenceinonelanguage,theencoderwouldconvertthissentenceintoasinglevectorrepresentation,andthenthedecoderwoulddecodethisvectorintoasentenceinanotherlanguage.Thistwo-stepmodel,calledanEncoder–Decoder,worksmuchbetterthantryingtotranslateontheflywithasinglesequence-to-sequenceRNN(liketheonerepresentedonthetopleft),sincethelastwordsofasentencecanaffectthefirstwordsofthetranslation,soyouneedtowaituntilyouhaveheardthewholesentencebeforetranslatingit.

Figure14-4.Seqtoseq(topleft),seqtovector(topright),vectortoseq(bottomleft),delayedseqtoseq(bottomright)

Soundspromising,solet’sstartcoding!

Page 474: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

BasicRNNsinTensorFlowFirst,let’simplementaverysimpleRNNmodel,withoutusinganyofTensorFlow’sRNNoperations,tobetterunderstandwhatgoesonunderthehood.WewillcreateanRNNcomposedofalayeroffiverecurrentneurons(liketheRNNrepresentedinFigure14-2),usingthetanhactivationfunction.WewillassumethattheRNNrunsoveronlytwotimesteps,takinginputvectorsofsize3ateachtimestep.ThefollowingcodebuildsthisRNN,unrolledthroughtwotimesteps:

n_inputs=3

n_neurons=5

X0=tf.placeholder(tf.float32,[None,n_inputs])

X1=tf.placeholder(tf.float32,[None,n_inputs])

Wx=tf.Variable(tf.random_normal(shape=[n_inputs,n_neurons],dtype=tf.float32))

Wy=tf.Variable(tf.random_normal(shape=[n_neurons,n_neurons],dtype=tf.float32))

b=tf.Variable(tf.zeros([1,n_neurons],dtype=tf.float32))

Y0=tf.tanh(tf.matmul(X0,Wx)+b)

Y1=tf.tanh(tf.matmul(Y0,Wy)+tf.matmul(X1,Wx)+b)

init=tf.global_variables_initializer()

Thisnetworklooksmuchlikeatwo-layerfeedforwardneuralnetwork,withafewtwists:first,thesameweightsandbiastermsaresharedbybothlayers,andsecond,wefeedinputsateachlayer,andwegetoutputsfromeachlayer.Torunthemodel,weneedtofeedittheinputsatbothtimesteps,likeso:

importnumpyasnp

#Mini-batch:instance0,instance1,instance2,instance3

X0_batch=np.array([[0,1,2],[3,4,5],[6,7,8],[9,0,1]])#t=0

X1_batch=np.array([[9,8,7],[0,0,0],[6,5,4],[3,2,1]])#t=1

withtf.Session()assess:

init.run()

Y0_val,Y1_val=sess.run([Y0,Y1],feed_dict={X0:X0_batch,X1:X1_batch})

Thismini-batchcontainsfourinstances,eachwithaninputsequencecomposedofexactlytwoinputs.Attheend,Y0_valandY1_valcontaintheoutputsofthenetworkatbothtimestepsforallneuronsandallinstancesinthemini-batch:

>>>print(Y0_val)#outputatt=0

[[-0.06640060.962576690.681057870.70918542-0.89821595]#instance0

[0.9977755-0.71978885-0.996576250.9673925-0.99989718]#instance1

[0.99999774-0.99898815-0.999998930.99677622-0.99999988]#instance2

[1.-1.-1.-0.998189150.99950868]]#instance3

>>>print(Y1_val)#outputatt=1

[[1.-1.-1.0.40200216-1.]#instance0

[-0.122104330.628053190.96718419-0.99371207-0.25839335]#instance1

[0.99999827-0.9999994-0.9999975-0.85943311-0.9999879]#instance2

[0.99928284-0.99999815-0.999905820.98579615-0.92205751]]#instance3

Thatwasn’ttoohard,butofcourseifyouwanttobeabletorunanRNNover100timesteps,thegraphisgoingtobeprettybig.Nowlet’slookathowtocreatethesamemodelusingTensorFlow’sRNNoperations.

Page 475: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

StaticUnrollingThroughTimeThestatic_rnn()functioncreatesanunrolledRNNnetworkbychainingcells.Thefollowingcodecreatestheexactsamemodelasthepreviousone:

X0=tf.placeholder(tf.float32,[None,n_inputs])

X1=tf.placeholder(tf.float32,[None,n_inputs])

basic_cell=tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)

output_seqs,states=tf.contrib.rnn.static_rnn(basic_cell,[X0,X1],

dtype=tf.float32)

Y0,Y1=output_seqs

Firstwecreatetheinputplaceholders,asbefore.ThenwecreateaBasicRNNCell,whichyoucanthinkofasafactorythatcreatescopiesofthecelltobuildtheunrolledRNN(oneforeachtimestep).Thenwecallstatic_rnn(),givingitthecellfactoryandtheinputtensors,andtellingitthedatatypeoftheinputs(thisisusedtocreatetheinitialstatematrix,whichbydefaultisfullofzeros).Thestatic_rnn()functioncallsthecellfactory’s__call__()functiononceperinput,creatingtwocopiesofthecell(eachcontainingalayeroffiverecurrentneurons),withsharedweightsandbiasterms,anditchainsthemjustlikewedidearlier.Thestatic_rnn()functionreturnstwoobjects.ThefirstisaPythonlistcontainingtheoutputtensorsforeachtimestep.Thesecondisatensorcontainingthefinalstatesofthenetwork.Whenyouareusingbasiccells,thefinalstateissimplyequaltothelastoutput.

Iftherewere50timesteps,itwouldnotbeveryconvenienttohavetodefine50inputplaceholdersand50outputtensors.Moreover,atexecutiontimeyouwouldhavetofeedeachofthe50placeholdersandmanipulatethe50outputs.Let’ssimplifythis.ThefollowingcodebuildsthesameRNNagain,butthistimeittakesasingleinputplaceholderofshape[None,n_steps,n_inputs]wherethefirstdimensionisthemini-batchsize.Thenitextractsthelistofinputsequencesforeachtimestep.X_seqsisaPythonlistofn_stepstensorsofshape[None,n_inputs],whereonceagainthefirstdimensionisthemini-batchsize.Todothis,wefirstswapthefirsttwodimensionsusingthetranspose()function,sothatthetimestepsarenowthefirstdimension.ThenweextractaPythonlistoftensorsalongthefirstdimension(i.e.,onetensorpertimestep)usingtheunstack()function.Thenexttwolinesarethesameasbefore.Finally,wemergealltheoutputtensorsintoasingletensorusingthestack()function,andweswapthefirsttwodimensionstogetafinaloutputstensorofshape[None,n_steps,n_neurons](againthefirstdimensionisthemini-batchsize).

X=tf.placeholder(tf.float32,[None,n_steps,n_inputs])

X_seqs=tf.unstack(tf.transpose(X,perm=[1,0,2]))

basic_cell=tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)

output_seqs,states=tf.contrib.rnn.static_rnn(basic_cell,X_seqs,

dtype=tf.float32)

outputs=tf.transpose(tf.stack(output_seqs),perm=[1,0,2])

Nowwecanrunthenetworkbyfeedingitasingletensorthatcontainsallthemini-batchsequences:

X_batch=np.array([

#t=0t=1

[[0,1,2],[9,8,7]],#instance0

[[3,4,5],[0,0,0]],#instance1

Page 476: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

[[6,7,8],[6,5,4]],#instance2

[[9,0,1],[3,2,1]],#instance3

])

withtf.Session()assess:

init.run()

outputs_val=outputs.eval(feed_dict={X:X_batch})

Andwegetasingleoutputs_valtensorforallinstances,alltimesteps,andallneurons:

>>>print(outputs_val)

[[[-0.912797270.83698678-0.892779410.80308062-0.5283336]

[-1.1.-0.997948290.99985468-0.99273592]]

[[-0.999943910.99951613-0.99469250.99030769-0.94413054]

[0.487333090.93389565-0.313620720.885736110.2424476]]

[[-1.0.99999875-0.999750140.99956584-0.99466234]

[-0.999948560.99999434-0.960581720.99784708-0.9099462]]

[[-0.959724250.999514820.96938795-0.969908-0.67668229]

[-0.845960140.962882280.96856463-0.14777924-0.9119423]]]

However,thisapproachstillbuildsagraphcontainingonecellpertimestep.Iftherewere50timesteps,thegraphwouldlookprettyugly.Itisabitlikewritingaprogramwithouteverusingloops(e.g.,Y0=f(0,X0);Y1=f(Y0,X1);Y2=f(Y1,X2);...;Y50=f(Y49,X50)).Withsuchaslargegraph,youmayevengetout-of-memory(OOM)errorsduringbackpropagation(especiallywiththelimitedmemoryofGPUcards),sinceitmuststorealltensorvaluesduringtheforwardpasssoitcanusethemtocomputegradientsduringthereversepass.

Fortunately,thereisabettersolution:thedynamic_rnn()function.

Page 477: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

DynamicUnrollingThroughTimeThedynamic_rnn()functionusesawhile_loop()operationtorunoverthecelltheappropriatenumberoftimes,andyoucansetswap_memory=TrueifyouwantittoswaptheGPU’smemorytotheCPU’smemoryduringbackpropagationtoavoidOOMerrors.Conveniently,italsoacceptsasingletensorforallinputsateverytimestep(shape[None,n_steps,n_inputs])anditoutputsasingletensorforalloutputsateverytimestep(shape[None,n_steps,n_neurons]);thereisnoneedtostack,unstack,ortranspose.ThefollowingcodecreatesthesameRNNasearlierusingthedynamic_rnn()function.It’ssomuchnicer!

X=tf.placeholder(tf.float32,[None,n_steps,n_inputs])

basic_cell=tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)

outputs,states=tf.nn.dynamic_rnn(basic_cell,X,dtype=tf.float32)

NOTEDuringbackpropagation,thewhile_loop()operationdoestheappropriatemagic:itstoresthetensorvaluesforeachiterationduringtheforwardpasssoitcanusethemtocomputegradientsduringthereversepass.

Page 478: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

HandlingVariableLengthInputSequencesSofarwehaveusedonlyfixed-sizeinputsequences(allexactlytwostepslong).Whatiftheinputsequenceshavevariablelengths(e.g.,likesentences)?Inthiscaseyoushouldsetthesequence_lengthargumentwhencallingthedynamic_rnn()(orstatic_rnn())function;itmustbea1Dtensorindicatingthelengthoftheinputsequenceforeachinstance.Forexample:

seq_length=tf.placeholder(tf.int32,[None])

[...]

outputs,states=tf.nn.dynamic_rnn(basic_cell,X,dtype=tf.float32,

sequence_length=seq_length)

Forexample,supposethesecondinputsequencecontainsonlyoneinputinsteadoftwo.ItmustbepaddedwithazerovectorinordertofitintheinputtensorX(becausetheinputtensor’sseconddimensionisthesizeofthelongestsequence—i.e.,2).

X_batch=np.array([

#step0step1

[[0,1,2],[9,8,7]],#instance0

[[3,4,5],[0,0,0]],#instance1(paddedwithazerovector)

[[6,7,8],[6,5,4]],#instance2

[[9,0,1],[3,2,1]],#instance3

])

seq_length_batch=np.array([2,1,2,2])

Ofcourse,younowneedtofeedvaluesforbothplaceholdersXandseq_length:

withtf.Session()assess:

init.run()

outputs_val,states_val=sess.run(

[outputs,states],feed_dict={X:X_batch,seq_length:seq_length_batch})

NowtheRNNoutputszerovectorsforeverytimesteppasttheinputsequencelength(lookatthesecondinstance’soutputforthesecondtimestep):

>>>print(outputs_val)

[[[-0.68579948-0.25901747-0.80249101-0.18141513-0.37491536]

[-0.99996698-0.945011850.98072106-0.96897620.99966913]]#finalstate

[[-0.99099374-0.64768541-0.67801034-0.74154460.7719509]#finalstate

[0.0.0.0.0.]]#zerovector

[[-0.99978048-0.85583007-0.49696958-0.938385780.98505187]

[-0.99951065-0.891487960.94170523-0.384076570.97499216]]#finalstate

[[-0.02052618-0.945880470.999352040.372833310.9998163]

[-0.910523470.057694090.47446665-0.446110370.89394671]]]#finalstate

Moreover,thestatestensorcontainsthefinalstateofeachcell(excludingthezerovectors):

>>>print(states_val)

[[-0.99996698-0.945011850.98072106-0.96897620.99966913]#t=1

[-0.99099374-0.64768541-0.67801034-0.74154460.7719509]#t=0!!!

[-0.99951065-0.891487960.94170523-0.384076570.97499216]#t=1

[-0.910523470.057694090.47446665-0.446110370.89394671]]#t=1

Page 479: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

HandlingVariable-LengthOutputSequencesWhatiftheoutputsequenceshavevariablelengthsaswell?Ifyouknowinadvancewhatlengtheachsequencewillhave(forexampleifyouknowthatitwillbethesamelengthastheinputsequence),thenyoucansetthesequence_lengthparameterasdescribedabove.Unfortunately,ingeneralthiswillnotbepossible:forexample,thelengthofatranslatedsentenceisgenerallydifferentfromthelengthoftheinputsentence.Inthiscase,themostcommonsolutionistodefineaspecialoutputcalledanend-of-sequencetoken(EOStoken).AnyoutputpasttheEOSshouldbeignored(wewilldiscussthislaterinthischapter).

Okay,nowyouknowhowtobuildanRNNnetwork(ormorepreciselyanRNNnetworkunrolledthroughtime).Buthowdoyoutrainit?

Page 480: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TrainingRNNsTotrainanRNN,thetrickistounrollitthroughtime(likewejustdid)andthensimplyuseregularbackpropagation(seeFigure14-5).Thisstrategyiscalledbackpropagationthroughtime(BPTT).

Figure14-5.Backpropagationthroughtime

Justlikeinregularbackpropagation,thereisafirstforwardpassthroughtheunrollednetwork(representedbythedashedarrows);thentheoutputsequenceisevaluatedusingacostfunction

(wheretminandtmaxarethefirstandlastoutputtimesteps,notcountingtheignoredoutputs),andthegradientsofthatcostfunctionarepropagatedbackwardthroughtheunrollednetwork(representedbythesolidarrows);andfinallythemodelparametersareupdatedusingthegradientscomputedduringBPTT.Notethatthegradientsflowbackwardthroughalltheoutputsusedbythecostfunction,notjustthroughthefinaloutput(forexample,inFigure14-5thecostfunctioniscomputedusingthelastthreeoutputsofthenetwork,Y(2),Y(3),andY(4),sogradientsflowthroughthesethreeoutputs,butnotthroughY(0)andY(1)).Moreover,sincethesameparametersWandbareusedateachtimestep,backpropagationwilldotherightthingandsumoveralltimesteps.

Page 481: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TrainingaSequenceClassifierLet’strainanRNNtoclassifyMNISTimages.Aconvolutionalneuralnetworkwouldbebettersuitedforimageclassification(seeChapter13),butthismakesforasimpleexamplethatyouarealreadyfamiliarwith.Wewilltreateachimageasasequenceof28rowsof28pixelseach(sinceeachMNISTimageis28×28pixels).Wewillusecellsof150recurrentneurons,plusafullyconnectedlayercontaining10neurons(oneperclass)connectedtotheoutputofthelasttimestep,followedbyasoftmaxlayer(seeFigure14-6).

Figure14-6.Sequenceclassifier

Theconstructionphaseisquitestraightforward;it’sprettymuchthesameastheMNISTclassifierwebuiltinChapter10exceptthatanunrolledRNNreplacesthehiddenlayers.Notethatthefullyconnectedlayerisconnectedtothestatestensor,whichcontainsonlythefinalstateoftheRNN(i.e.,the28th

output).Alsonotethatyisaplaceholderforthetargetclasses.

n_steps=28

n_inputs=28

n_neurons=150

n_outputs=10

learning_rate=0.001

X=tf.placeholder(tf.float32,[None,n_steps,n_inputs])

y=tf.placeholder(tf.int32,[None])

basic_cell=tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)

outputs,states=tf.nn.dynamic_rnn(basic_cell,X,dtype=tf.float32)

logits=tf.layers.dense(states,n_outputs)

xentropy=tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,

logits=logits)

loss=tf.reduce_mean(xentropy)

optimizer=tf.train.AdamOptimizer(learning_rate=learning_rate)

training_op=optimizer.minimize(loss)

correct=tf.nn.in_top_k(logits,y,1)

accuracy=tf.reduce_mean(tf.cast(correct,tf.float32))

Page 482: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

init=tf.global_variables_initializer()

Nowlet’sloadtheMNISTdataandreshapethetestdatato[batch_size,n_steps,n_inputs]asisexpectedbythenetwork.Wewilltakecareofreshapingthetrainingdatainamoment.

fromtensorflow.examples.tutorials.mnistimportinput_data

mnist=input_data.read_data_sets("/tmp/data/")

X_test=mnist.test.images.reshape((-1,n_steps,n_inputs))

y_test=mnist.test.labels

NowwearereadytotraintheRNN.TheexecutionphaseisexactlythesameasfortheMNISTclassifierinChapter10,exceptthatwereshapeeachtrainingbatchbeforefeedingittothenetwork.

n_epochs=100

batch_size=150

withtf.Session()assess:

init.run()

forepochinrange(n_epochs):

foriterationinrange(mnist.train.num_examples//batch_size):

X_batch,y_batch=mnist.train.next_batch(batch_size)

X_batch=X_batch.reshape((-1,n_steps,n_inputs))

sess.run(training_op,feed_dict={X:X_batch,y:y_batch})

acc_train=accuracy.eval(feed_dict={X:X_batch,y:y_batch})

acc_test=accuracy.eval(feed_dict={X:X_test,y:y_test})

print(epoch,"Trainaccuracy:",acc_train,"Testaccuracy:",acc_test)

Theoutputshouldlooklikethis:

0Trainaccuracy:0.94Testaccuracy:0.9308

1Trainaccuracy:0.933333Testaccuracy:0.9431

[...]

98Trainaccuracy:0.98Testaccuracy:0.9794

99Trainaccuracy:1.0Testaccuracy:0.9804

Wegetover98%accuracy—notbad!Plusyouwouldcertainlygetabetterresultbytuningthehyperparameters,initializingtheRNNweightsusingHeinitialization,traininglonger,oraddingabitofregularization(e.g.,dropout).

TIPYoucanspecifyaninitializerfortheRNNbywrappingitsconstructioncodeinavariablescope(e.g.,usevariable_scope("rnn",initializer=variance_scaling_initializer())touseHeinitialization).

Page 483: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TrainingtoPredictTimeSeriesNowlet’stakealookathowtohandletimeseries,suchasstockprices,airtemperature,brainwavepatterns,andsoon.InthissectionwewilltrainanRNNtopredictthenextvalueinageneratedtimeseries.Eachtraininginstanceisarandomlyselectedsequenceof20consecutivevaluesfromthetimeseries,andthetargetsequenceisthesameastheinputsequence,exceptitisshiftedbyonetimestepintothefuture(seeFigure14-7).

Figure14-7.Timeseries(left),andatraininginstancefromthatseries(right)

First,let’screatetheRNN.Itwillcontain100recurrentneuronsandwewillunrollitover20timestepssinceeachtraininginstancewillbe20inputslong.Eachinputwillcontainonlyonefeature(thevalueatthattime).Thetargetsarealsosequencesof20inputs,eachcontainingasinglevalue.Thecodeisalmostthesameasearlier:

n_steps=20

n_inputs=1

n_neurons=100

n_outputs=1

X=tf.placeholder(tf.float32,[None,n_steps,n_inputs])

y=tf.placeholder(tf.float32,[None,n_steps,n_outputs])

cell=tf.contrib.rnn.BasicRNNCell(num_units=n_neurons,activation=tf.nn.relu)

outputs,states=tf.nn.dynamic_rnn(cell,X,dtype=tf.float32)

NOTEIngeneralyouwouldhavemorethanjustoneinputfeature.Forexample,ifyouweretryingtopredictstockprices,youwouldlikelyhavemanyotherinputfeaturesateachtimestep,suchaspricesofcompetingstocks,ratingsfromanalysts,oranyotherfeaturethatmighthelpthesystemmakeitspredictions.

Ateachtimestepwenowhaveanoutputvectorofsize100.Butwhatweactuallywantisasingleoutputvalueateachtimestep.ThesimplestsolutionistowrapthecellinanOutputProjectionWrapper.Acellwrapperactslikeanormalcell,proxyingeverymethodcalltoanunderlyingcell,butitalsoaddssomefunctionality.TheOutputProjectionWrapperaddsafullyconnectedlayeroflinearneurons(i.e.,withoutanyactivationfunction)ontopofeachoutput(butitdoesnotaffectthecellstate).Allthesefullyconnectedlayerssharethesame(trainable)weightsandbiasterms.TheresultingRNNisrepresentedin

Page 484: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure14-8.

Figure14-8.RNNcellsusingoutputprojections

Wrappingacellisquiteeasy.Let’stweaktheprecedingcodebywrappingtheBasicRNNCellintoanOutputProjectionWrapper:

cell=tf.contrib.rnn.OutputProjectionWrapper(

tf.contrib.rnn.BasicRNNCell(num_units=n_neurons,activation=tf.nn.relu),

output_size=n_outputs)

Sofar,sogood.Nowweneedtodefinethecostfunction.WewillusetheMeanSquaredError(MSE),aswedidinpreviousregressiontasks.NextwewillcreateanAdamoptimizer,thetrainingop,andthevariableinitializationop,asusual:

learning_rate=0.001

loss=tf.reduce_mean(tf.square(outputs-y))

optimizer=tf.train.AdamOptimizer(learning_rate=learning_rate)

training_op=optimizer.minimize(loss)

init=tf.global_variables_initializer()

Nowontotheexecutionphase:

n_iterations=1500

batch_size=50

withtf.Session()assess:

init.run()

foriterationinrange(n_iterations):

X_batch,y_batch=[...]#fetchthenexttrainingbatch

sess.run(training_op,feed_dict={X:X_batch,y:y_batch})

ifiteration%100==0:

mse=loss.eval(feed_dict={X:X_batch,y:y_batch})

print(iteration,"\tMSE:",mse)

Page 485: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Theprogram’soutputshouldlooklikethis:

0MSE:13.6543

100MSE:0.538476

200MSE:0.168532

300MSE:0.0879579

400MSE:0.0633425

[...]

Oncethemodelistrained,youcanmakepredictions:

X_new=[...]#Newsequences

y_pred=sess.run(outputs,feed_dict={X:X_new})

Figure14-9showsthepredictedsequencefortheinstancewelookedatearlier(inFigure14-7),afterjust1,000trainingiterations.

Figure14-9.Timeseriesprediction

AlthoughusinganOutputProjectionWrapperisthesimplestsolutiontoreducethedimensionalityoftheRNN’soutputsequencesdowntojustonevaluepertimestep(perinstance),itisnotthemostefficient.Thereisatrickierbutmoreefficientsolution:youcanreshapetheRNNoutputsfrom[batch_size,n_steps,n_neurons]to[batch_size*n_steps,n_neurons],thenapplyasinglefullyconnectedlayerwiththeappropriateoutputsize(inourcasejust1),whichwillresultinanoutputtensorofshape[batch_size*n_steps,n_outputs],andthenreshapethistensorto[batch_size,n_steps,n_outputs].TheseoperationsarerepresentedinFigure14-10.

Page 486: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure14-10.Stackalltheoutputs,applytheprojection,thenunstacktheresult

Toimplementthissolution,wefirstreverttoabasiccell,withouttheOutputProjectionWrapper:

cell=tf.contrib.rnn.BasicRNNCell(num_units=n_neurons,activation=tf.nn.relu)

rnn_outputs,states=tf.nn.dynamic_rnn(cell,X,dtype=tf.float32)

Thenwestackalltheoutputsusingthereshape()operation,applythefullyconnectedlinearlayer(withoutusinganyactivationfunction;thisisjustaprojection),andfinallyunstackalltheoutputs,againusingreshape():

stacked_rnn_outputs=tf.reshape(rnn_outputs,[-1,n_neurons])

stacked_outputs=tf.layers.dense(stacked_rnn_outputs,n_outputs)

outputs=tf.reshape(stacked_outputs,[-1,n_steps,n_outputs])

Therestofthecodeisthesameasearlier.Thiscanprovideasignificantspeedboostsincethereisjustonefullyconnectedlayerinsteadofonepertimestep.

Page 487: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

CreativeRNNNowthatwehaveamodelthatcanpredictthefuture,wecanuseittogeneratesomecreativesequences,asexplainedatthebeginningofthechapter.Allweneedistoprovideitaseedsequencecontainingn_stepsvalues(e.g.,fullofzeros),usethemodeltopredictthenextvalue,appendthispredictedvaluetothesequence,feedthelastn_stepsvaluestothemodeltopredictthenextvalue,andsoon.Thisprocessgeneratesanewsequencethathassomeresemblancetotheoriginaltimeseries(seeFigure14-11).

sequence=[0.]*n_steps

foriterationinrange(300):

X_batch=np.array(sequence[-n_steps:]).reshape(1,n_steps,1)

y_pred=sess.run(outputs,feed_dict={X:X_batch})

sequence.append(y_pred[0,-1,0])

Figure14-11.Creativesequences,seededwithzeros(left)orwithaninstance(right)

NowyoucantrytofeedallyourJohnLennonalbumstoanRNNandseeifitcangeneratethenext“Imagine.”However,youwillprobablyneedamuchmorepowerfulRNN,withmoreneurons,andalsomuchdeeper.Let’slookatdeepRNNsnow.

Page 488: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

DeepRNNsItisquitecommontostackmultiplelayersofcells,asshowninFigure14-12.ThisgivesyouadeepRNN.

ToimplementadeepRNNinTensorFlow,youcancreateseveralcellsandstackthemintoaMultiRNNCell.Inthefollowingcodewestackthreeidenticalcells(butyoucouldverywellusevariouskindsofcellswithadifferentnumberofneurons):

n_neurons=100

n_layers=3

layers=[tf.contrib.rnn.BasicRNNCell(num_units=n_neurons,

activation=tf.nn.relu)

forlayerinrange(n_layers)]

multi_layer_cell=tf.contrib.rnn.MultiRNNCell(layers)

outputs,states=tf.nn.dynamic_rnn(multi_layer_cell,X,dtype=tf.float32)

Figure14-12.DeepRNN(left),unrolledthroughtime(right)

That’sallthereistoit!Thestatesvariableisatuplecontainingonetensorperlayer,eachrepresentingthefinalstateofthatlayer’scell(withshape[batch_size,n_neurons]).Ifyousetstate_is_tuple=FalsewhencreatingtheMultiRNNCell,thenstatesbecomesasingletensorcontainingthestatesfromeverylayer,concatenatedalongthecolumnaxis(i.e.,itsshapeis[batch_size,n_layers*n_neurons]).NotethatbeforeTensorFlow0.11.0,thisbehaviorwasthedefault.

Page 489: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

DistributingaDeepRNNAcrossMultipleGPUsChapter12pointedoutthatwecanefficientlydistributedeepRNNsacrossmultipleGPUsbypinningeachlayertoadifferentGPU(seeFigure12-16).However,ifyoutrytocreateeachcellinadifferentdevice()block,itwillnotwork:

withtf.device("/gpu:0"):#BAD!Thisisignored.

layer1=tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)

withtf.device("/gpu:1"):#BAD!Ignoredagain.

layer2=tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)

ThisfailsbecauseaBasicRNNCellisacellfactory,notacellperse(asmentionedearlier);nocellsgetcreatedwhenyoucreatethefactory,andthusnovariablesdoeither.Thedeviceblockissimplyignored.Thecellsactuallygetcreatedlater.Whenyoucalldynamic_rnn(),itcallstheMultiRNNCell,whichcallseachindividualBasicRNNCell,whichcreatetheactualcells(includingtheirvariables).Unfortunately,noneoftheseclassesprovideanywaytocontrolthedevicesonwhichthevariablesgetcreated.Ifyoutrytoputthedynamic_rnn()callwithinadeviceblock,thewholeRNNgetspinnedtoasingledevice.Soareyoustuck?Fortunatelynot!Thetrickistocreateyourowncellwrapper:

importtensorflowastf

classDeviceCellWrapper(tf.contrib.rnn.RNNCell):

def__init__(self,device,cell):

self._cell=cell

self._device=device

@property

defstate_size(self):

returnself._cell.state_size

@property

defoutput_size(self):

returnself._cell.output_size

def__call__(self,inputs,state,scope=None):

withtf.device(self._device):

returnself._cell(inputs,state,scope)

Thiswrappersimplyproxieseverymethodcalltoanothercell,exceptitwrapsthe__call__()functionwithinadeviceblock.2NowyoucandistributeeachlayeronadifferentGPU:

devices=["/gpu:0","/gpu:1","/gpu:2"]

cells=[DeviceCellWrapper(dev,tf.contrib.rnn.BasicRNNCell(num_units=n_neurons))

fordevindevices]

multi_layer_cell=tf.contrib.rnn.MultiRNNCell(cells)

outputs,states=tf.nn.dynamic_rnn(multi_layer_cell,X,dtype=tf.float32)

WARNINGDonotsetstate_is_tuple=False,ortheMultiRNNCellwillconcatenateallthecellstatesintoasingletensor,onasingleGPU.

Page 490: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ApplyingDropoutIfyoubuildaverydeepRNN,itmayendupoverfittingthetrainingset.Topreventthat,acommontechniqueistoapplydropout(introducedinChapter11).YoucansimplyaddadropoutlayerbeforeoraftertheRNNasusual,butifyoualsowanttoapplydropoutbetweentheRNNlayers,youneedtouseaDropoutWrapper.ThefollowingcodeappliesdropouttotheinputsofeachlayerintheRNN,droppingeachinputwitha50%probability:

keep_prob=0.5

cells=[tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)

forlayerinrange(n_layers)]

cells_drop=[tf.contrib.rnn.DropoutWrapper(cell,input_keep_prob=keep_prob)

forcellincells]

multi_layer_cell=tf.contrib.rnn.MultiRNNCell(cells_drop)

rnn_outputs,states=tf.nn.dynamic_rnn(multi_layer_cell,X,dtype=tf.float32)

Notethatitisalsopossibletoapplydropouttotheoutputsbysettingoutput_keep_prob.

Themainproblemwiththiscodeisthatitwillapplydropoutnotonlyduringtrainingbutalsoduringtesting,whichisnotwhatyouwant(recallthatdropoutshouldbeappliedonlyduringtraining).Unfortunately,theDropoutWrapperdoesnotsupportatrainingplaceholder(yet?),soyoumusteitherwriteyourowndropoutwrapperclass,orhavetwodifferentgraphs:onefortraining,andtheotherfortesting.Thesecondoptionlookslikethis:

importsys

training=(sys.argv[-1]=="train")

X=tf.placeholder(tf.float32,[None,n_steps,n_inputs])

y=tf.placeholder(tf.float32,[None,n_steps,n_outputs])

cells=[tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)

forlayerinrange(n_layers)]

iftraining:

cells=[tf.contrib.rnn.DropoutWrapper(cell,input_keep_prob=keep_prob)

forcellincells]

multi_layer_cell=tf.contrib.rnn.MultiRNNCell(cells)

rnn_outputs,states=tf.nn.dynamic_rnn(multi_layer_cell,X,dtype=tf.float32)

[...]#buildtherestofthegraph

withtf.Session()assess:

iftraining:

init.run()

foriterationinrange(n_iterations):

[...]#trainthemodel

save_path=saver.save(sess,"/tmp/my_model.ckpt")

else:

saver.restore(sess,"/tmp/my_model.ckpt")

[...]#usethemodel

WiththatyoushouldbeabletotrainallsortsofRNNs!Unfortunately,ifyouwanttotrainanRNNonlongsequences,thingswillgetabitharder.Let’sseewhyandwhatyoucandoaboutit.

Page 491: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TheDifficultyofTrainingoverManyTimeStepsTotrainanRNNonlongsequences,youwillneedtorunitovermanytimesteps,makingtheunrolledRNNaverydeepnetwork.Justlikeanydeepneuralnetworkitmaysufferfromthevanishing/explodinggradientsproblem(discussedinChapter11)andtakeforevertotrain.ManyofthetrickswediscussedtoalleviatethisproblemcanbeusedfordeepunrolledRNNsaswell:goodparameterinitialization,nonsaturatingactivationfunctions(e.g.,ReLU),BatchNormalization,GradientClipping,andfasteroptimizers.However,iftheRNNneedstohandleevenmoderatelylongsequences(e.g.,100inputs),thentrainingwillstillbeveryslow.

ThesimplestandmostcommonsolutiontothisproblemistounrolltheRNNonlyoveralimitednumberoftimestepsduringtraining.Thisiscalledtruncatedbackpropagationthroughtime.InTensorFlowyoucanimplementitsimplybytruncatingtheinputsequences.Forexample,inthetimeseriespredictionproblem,youwouldsimplyreducen_stepsduringtraining.Theproblem,ofcourse,isthatthemodelwillnotbeabletolearnlong-termpatterns.Oneworkaroundcouldbetomakesurethattheseshortenedsequencescontainbotholdandrecentdata,sothatthemodelcanlearntouseboth(e.g.,thesequencecouldcontainmonthlydataforthelastfivemonths,thenweeklydataforthelastfiveweeks,thendailydataoverthelastfivedays).Butthisworkaroundhasitslimits:whatiffine-graineddatafromlastyearisactuallyuseful?Whatiftherewasabriefbutsignificanteventthatabsolutelymustbetakenintoaccount,evenyearslater(e.g.,theresultofanelection)?

Besidesthelongtrainingtime,asecondproblemfacedbylong-runningRNNsisthefactthatthememoryofthefirstinputsgraduallyfadesaway.Indeed,duetothetransformationsthatthedatagoesthroughwhentraversinganRNN,someinformationislostaftereachtimestep.Afterawhile,theRNN’sstatecontainsvirtuallynotraceofthefirstinputs.Thiscanbeashowstopper.Forexample,sayyouwanttoperformsentimentanalysisonalongreviewthatstartswiththefourwords“Ilovedthismovie,”buttherestofthereviewliststhemanythingsthatcouldhavemadethemovieevenbetter.IftheRNNgraduallyforgetsthefirstfourwords,itwillcompletelymisinterpretthereview.Tosolvethisproblem,varioustypesofcellswithlong-termmemoryhavebeenintroduced.Theyhaveprovedsosuccessfulthatthebasiccellsarenotmuchusedanymore.Let’sfirstlookatthemostpopularoftheselongmemorycells:theLSTMcell.

Page 492: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

LSTMCellTheLongShort-TermMemory(LSTM)cellwasproposedin19973bySeppHochreiterandJürgenSchmidhuber,anditwasgraduallyimprovedovertheyearsbyseveralresearchers,suchasAlexGraves,HaşimSak,4WojciechZaremba,5andmanymore.IfyouconsidertheLSTMcellasablackbox,itcanbeusedverymuchlikeabasiccell,exceptitwillperformmuchbetter;trainingwillconvergefasteranditwilldetectlong-termdependenciesinthedata.InTensorFlow,youcansimplyuseaBasicLSTMCellinsteadofaBasicRNNCell:

lstm_cell=tf.contrib.rnn.BasicLSTMCell(num_units=n_neurons)

LSTMcellsmanagetwostatevectors,andforperformancereasonstheyarekeptseparatebydefault.Youcanchangethisdefaultbehaviorbysettingstate_is_tuple=FalsewhencreatingtheBasicLSTMCell.

SohowdoesanLSTMcellwork?ThearchitectureofabasicLSTMcellisshowninFigure14-13.

Figure14-13.LSTMcell

Ifyoudon’tlookatwhat’sinsidethebox,theLSTMcelllooksexactlylikearegularcell,exceptthatitsstateissplitintwovectors:h(t)andc(t)(“c”standsfor“cell”).Youcanthinkofh(t)astheshort-termstateandc(t)asthelong-termstate.

Nowlet’sopenthebox!Thekeyideaisthatthenetworkcanlearnwhattostoreinthelong-termstate,whattothrowaway,andwhattoreadfromit.Asthelong-termstatec(t–1)traversesthenetworkfromlefttoright,youcanseethatitfirstgoesthroughaforgetgate,droppingsomememories,andthenitaddssomenewmemoriesviatheadditionoperation(whichaddsthememoriesthatwereselectedbyaninputgate).Theresultc(t)issentstraightout,withoutanyfurthertransformation.So,ateachtimestep,somememoriesaredroppedandsomememoriesareadded.Moreover,aftertheadditionoperation,thelong-termstateiscopiedandpassedthroughthetanhfunction,andthentheresultisfilteredbytheoutputgate.

Page 493: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Thisproducestheshort-termstateh(t)(whichisequaltothecell’soutputforthistimestepy(t)).Nowlet’slookatwherenewmemoriescomefromandhowthegateswork.

First,thecurrentinputvectorx(t)andthepreviousshort-termstateh(t–1)arefedtofourdifferentfullyconnectedlayers.Theyallserveadifferentpurpose:

Themainlayeristheonethatoutputsg(t).Ithastheusualroleofanalyzingthecurrentinputsx(t)andtheprevious(short-term)stateh(t–1).Inabasiccell,thereisnothingelsethanthislayer,anditsoutputgoesstraightouttoy(t)andh(t).Incontrast,inanLSTMcellthislayer’soutputdoesnotgostraightout,butinsteaditispartiallystoredinthelong-termstate.

Thethreeotherlayersaregatecontrollers.Sincetheyusethelogisticactivationfunction,theiroutputsrangefrom0to1.Asyoucansee,theiroutputsarefedtoelement-wisemultiplicationoperations,soiftheyoutput0s,theyclosethegate,andiftheyoutput1s,theyopenit.Specifically:

Theforgetgate(controlledbyf(t))controlswhichpartsofthelong-termstateshouldbeerased.

Theinputgate(controlledbyi(t))controlswhichpartsofg(t)shouldbeaddedtothelong-termstate(thisiswhywesaiditwasonly“partiallystored”).

Finally,theoutputgate(controlledbyo(t))controlswhichpartsofthelong-termstateshouldbereadandoutputatthistimestep(bothtoh(t))andy(t).

Inshort,anLSTMcellcanlearntorecognizeanimportantinput(that’stheroleoftheinputgate),storeitinthelong-termstate,learntopreserveitforaslongasitisneeded(that’stheroleoftheforgetgate),andlearntoextractitwheneveritisneeded.Thisexplainswhytheyhavebeenamazinglysuccessfulatcapturinglong-termpatternsintimeseries,longtexts,audiorecordings,andmore.

Equation14-3summarizeshowtocomputethecell’slong-termstate,itsshort-termstate,anditsoutputateachtimestepforasingleinstance(theequationsforawholemini-batchareverysimilar).

Equation14-3.LSTMcomputations

Wxi,Wxf,Wxo,Wxgaretheweightmatricesofeachofthefourlayersfortheirconnectiontothe

Page 494: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

inputvectorx(t).

Whi,Whf,Who,andWhgaretheweightmatricesofeachofthefourlayersfortheirconnectiontothepreviousshort-termstateh(t–1).

bi,bf,bo,andbgarethebiastermsforeachofthefourlayers.NotethatTensorFlowinitializesbftoavectorfullof1sinsteadof0s.Thispreventsforgettingeverythingatthebeginningoftraining.

Page 495: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

PeepholeConnectionsInabasicLSTMcell,thegatecontrollerscanlookonlyattheinputx(t)andthepreviousshort-termstateh(t–1).Itmaybeagoodideatogivethemabitmorecontextbylettingthempeekatthelong-termstateaswell.ThisideawasproposedbyFelixGersandJürgenSchmidhuberin2000.6TheyproposedanLSTMvariantwithextraconnectionscalledpeepholeconnections:thepreviouslong-termstatec(t–1)isaddedasaninputtothecontrollersoftheforgetgateandtheinputgate,andthecurrentlong-termstatec(t)isaddedasinputtothecontrolleroftheoutputgate.

ToimplementpeepholeconnectionsinTensorFlow,youmustusetheLSTMCellinsteadoftheBasicLSTMCellandsetuse_peepholes=True:

lstm_cell=tf.contrib.rnn.LSTMCell(num_units=n_neurons,use_peepholes=True)

TherearemanyothervariantsoftheLSTMcell.OneparticularlypopularvariantistheGRUcell,whichwewilllookatnow.

Page 496: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

GRUCellTheGatedRecurrentUnit(GRU)cell(seeFigure14-14)wasproposedbyKyunghyunChoetal.ina2014paper7thatalsointroducedtheEncoder–Decodernetworkwementionedearlier.

Figure14-14.GRUcell

TheGRUcellisasimplifiedversionoftheLSTMcell,anditseemstoperformjustaswell8(whichexplainsitsgrowingpopularity).Themainsimplificationsare:

Bothstatevectorsaremergedintoasinglevectorh(t).

Asinglegatecontrollercontrolsboththeforgetgateandtheinputgate.Ifthegatecontrolleroutputsa1,theinputgateisopenandtheforgetgateisclosed.Ifitoutputsa0,theoppositehappens.Inotherwords,wheneveramemorymustbestored,thelocationwhereitwillbestorediserasedfirst.ThisisactuallyafrequentvarianttotheLSTMcellinandofitself.

Thereisnooutputgate;thefullstatevectorisoutputateverytimestep.However,thereisanewgatecontrollerthatcontrolswhichpartofthepreviousstatewillbeshowntothemainlayer.

Equation14-4summarizeshowtocomputethecell’sstateateachtimestepforasingleinstance.

Equation14-4.GRUcomputations

Page 497: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

CreatingaGRUcellinTensorFlowistrivial:

gru_cell=tf.contrib.rnn.GRUCell(num_units=n_neurons)

LSTMorGRUcellsareoneofthemainreasonsbehindthesuccessofRNNsinrecentyears,inparticularforapplicationsinnaturallanguageprocessing(NLP).

Page 498: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

NaturalLanguageProcessingMostofthestate-of-the-artNLPapplications,suchasmachinetranslation,automaticsummarization,parsing,sentimentanalysis,andmore,arenowbased(atleastinpart)onRNNs.Inthislastsection,wewilltakeaquicklookatwhatamachinetranslationmodellookslike.ThistopicisverywellcoveredbyTensorFlow’sawesomeWord2VecandSeq2Seqtutorials,soyoushoulddefinitelycheckthemout.

Page 499: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

WordEmbeddingsBeforewestart,weneedtochooseawordrepresentation.Oneoptioncouldbetorepresenteachwordusingaone-hotvector.Supposeyourvocabularycontains50,000words,thenthenthwordwouldberepresentedasa50,000-dimensionalvector,fullof0sexceptfora1atthenthposition.However,withsuchalargevocabulary,thissparserepresentationwouldnotbeefficientatall.Ideally,youwantsimilarwordstohavesimilarrepresentations,makingiteasyforthemodeltogeneralizewhatitlearnsaboutawordtoallsimilarwords.Forexample,ifthemodelistoldthat“Idrinkmilk”isavalidsentence,andifitknowsthat“milk”iscloseto“water”butfarfrom“shoes,”thenitwillknowthat“Idrinkwater”isprobablyavalidsentenceaswell,while“Idrinkshoes”isprobablynot.Buthowcanyoucomeupwithsuchameaningfulrepresentation?

Themostcommonsolutionistorepresenteachwordinthevocabularyusingafairlysmallanddensevector(e.g.,150dimensions),calledanembedding,andjustlettheneuralnetworklearnagoodembeddingforeachwordduringtraining.Atthebeginningoftraining,embeddingsaresimplychosenrandomly,butduringtraining,backpropagationautomaticallymovestheembeddingsaroundinawaythathelpstheneuralnetworkperformitstask.Typicallythismeansthatsimilarwordswillgraduallyclusterclosetooneanother,andevenenduporganizedinarathermeaningfulway.Forexample,embeddingsmayendupplacedalongvariousaxesthatrepresentgender,singular/plural,adjective/noun,andsoon.Theresultcanbetrulyamazing.9

InTensorFlow,youfirstneedtocreatethevariablerepresentingtheembeddingsforeverywordinyourvocabulary(initializedrandomly):

vocabulary_size=50000

embedding_size=150

init_embeds=tf.random_uniform([vocabulary_size,embedding_size],-1.0,1.0)

embeddings=tf.Variable(init_embeds)

Nowsupposeyouwanttofeedthesentence“Idrinkmilk”toyourneuralnetwork.Youshouldfirstpreprocessthesentenceandbreakitintoalistofknownwords.Forexampleyoumayremoveunnecessarycharacters,replaceunknownwordsbyapredefinedtokenwordsuchas“[UNK]”,replacenumericalvaluesby“[NUM]”,replaceURLsby“[URL]”,andsoon.Onceyouhavealistofknownwords,youcanlookupeachword’sintegeridentifier(from0to49999)inadictionary,forexample[72,3335,288].Atthatpoint,youarereadytofeedthesewordidentifierstoTensorFlowusingaplaceholder,andapplytheembedding_lookup()functiontogetthecorrespondingembeddings:

train_inputs=tf.placeholder(tf.int32,shape=[None])#fromids...

embed=tf.nn.embedding_lookup(embeddings,train_inputs)#...toembeddings

Onceyourmodelhaslearnedgoodwordembeddings,theycanactuallybereusedfairlyefficientlyinanyNLPapplication:afterall,“milk”isstillcloseto“water”andfarfrom“shoes”nomatterwhatyourapplicationis.Infact,insteadoftrainingyourownwordembeddings,youmaywanttodownloadpretrainedwordembeddings.Justlikewhenreusingpretrainedlayers(seeChapter11),youcanchoosetofreezethepretrainedembeddings(e.g.,creatingtheembeddingsvariableusingtrainable=False)orlet

Page 500: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

backpropagationtweakthemforyourapplication.Thefirstoptionwillspeeduptraining,butthesecondmayleadtoslightlyhigherperformance.

TIPEmbeddingsarealsousefulforrepresentingcategoricalattributesthatcantakeonalargenumberofdifferentvalues,especiallywhentherearecomplexsimilaritiesbetweenvalues.Forexample,considerprofessions,hobbies,dishes,species,brands,andsoon.

Younowhavealmostallthetoolsyouneedtoimplementamachinetranslationsystem.Let’slookatthisnow.

Page 501: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

AnEncoder–DecoderNetworkforMachineTranslationLet’stakealookatasimplemachinetranslationmodel10thatwilltranslateEnglishsentencestoFrench(seeFigure14-15).

Figure14-15.Asimplemachinetranslationmodel

TheEnglishsentencesarefedtotheencoder,andthedecoderoutputstheFrenchtranslations.NotethattheFrenchtranslationsarealsousedasinputstothedecoder,butpushedbackbyonestep.Inotherwords,thedecoderisgivenasinputthewordthatitshouldhaveoutputatthepreviousstep(regardlessofwhatitactuallyoutput).Fortheveryfirstword,itisgivenatokenthatrepresentsthebeginningofthesentence(e.g.,“<go>”).Thedecoderisexpectedtoendthesentencewithanend-of-sequence(EOS)token(e.g.,“<eos>”).

NotethattheEnglishsentencesarereversedbeforetheyarefedtotheencoder.Forexample“Idrinkmilk”isreversedto“milkdrinkI.”ThisensuresthatthebeginningoftheEnglishsentencewillbefedlasttotheencoder,whichisusefulbecausethat’sgenerallythefirstthingthatthedecoderneedstotranslate.

Eachwordisinitiallyrepresentedbyasimpleintegeridentifier(e.g.,288fortheword“milk”).Next,anembeddinglookupreturnsthewordembedding(asexplainedearlier,thisisadense,fairlylow-dimensionalvector).Thesewordembeddingsarewhatisactuallyfedtotheencoderandthedecoder.

Ateachstep,thedecoderoutputsascoreforeachwordintheoutputvocabulary(i.e.,French),andthentheSoftmaxlayerturnsthesescoresintoprobabilities.Forexample,atthefirststeptheword“Je”mayhaveaprobabilityof20%,“Tu”mayhaveaprobabilityof1%,andsoon.Thewordwiththehighestprobabilityisoutput.Thisisverymuchlikearegularclassificationtask,soyoucantrainthemodelusingthesoftmax_cross_entropy_with_logits()function.

Notethatatinferencetime(aftertraining),youwillnothavethetargetsentencetofeedtothedecoder.

Page 502: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Instead,simplyfeedthedecoderthewordthatitoutputatthepreviousstep,asshowninFigure14-16(thiswillrequireanembeddinglookupthatisnotshownonthediagram).

Figure14-16.Feedingthepreviousoutputwordasinputatinferencetime

Okay,nowyouhavethebigpicture.However,ifyougothroughTensorFlow’ssequence-to-sequencetutorialandyoulookatthecodeinrnn/translate/seq2seq_model.py(intheTensorFlowmodels),youwillnoticeafewimportantdifferences:

First,sofarwehaveassumedthatallinputsequences(totheencoderandtothedecoder)haveaconstantlength.Butobviouslysentencelengthsmayvary.Thereareseveralwaysthatthiscanbehandled—forexample,usingthesequence_lengthargumenttothestatic_rnn()ordynamic_rnn()functionstospecifyeachsentence’slength(asdiscussedearlier).However,anotherapproachisusedinthetutorial(presumablyforperformancereasons):sentencesaregroupedintobucketsofsimilarlengths(e.g.,abucketforthe1-to6-wordsentences,anotherforthe7-to12-wordsentences,andsoon11),andtheshortersentencesarepaddedusingaspecialpaddingtoken(e.g.,“<pad>”).Forexample“Idrinkmilk”becomes“<pad><pad><pad>milkdrinkI”,anditstranslationbecomes“Jeboisdulait<eos><pad>”.Ofcourse,wewanttoignoreanyoutputpasttheEOStoken.Forthis,thetutorial’simplementationusesatarget_weightsvector.Forexample,forthetargetsentence“Jeboisdulait<eos><pad>”,theweightswouldbesetto[1.0,1.0,1.0,1.0,1.0,0.0](noticetheweight0.0thatcorrespondstothepaddingtokeninthetargetsentence).SimplymultiplyingthelossesbythetargetweightswillzerooutthelossesthatcorrespondtowordspastEOStokens.

Second,whentheoutputvocabularyislarge(whichisthecasehere),outputtingaprobabilityforeachandeverypossiblewordwouldbeterriblyslow.Ifthetargetvocabularycontains,say,50,000Frenchwords,thenthedecoderwouldoutput50,000-dimensionalvectors,andthencomputingthesoftmaxfunctionoversuchalargevectorwouldbeverycomputationallyintensive.Toavoidthis,onesolutionistoletthedecoderoutputmuchsmallervectors,suchas1,000-dimensionalvectors,thenuseasamplingtechniquetoestimatethelosswithouthavingtocomputeitovereverysinglewordinthetargetvocabulary.ThisSampledSoftmaxtechniquewasintroducedin2015bySébastienJeanetal.12InTensorFlowyoucanusethesampled_softmax_loss()function.

Third,thetutorial’simplementationusesanattentionmechanismthatletsthedecoderpeekintotheinputsequence.AttentionaugmentedRNNsarebeyondthescopeofthisbook,butifyouare

Page 503: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

interestedtherearehelpfulpapersaboutmachinetranslation,13machinereading,14andimagecaptions15usingattention.

Finally,thetutorial’simplementationmakesuseofthetf.nn.legacy_seq2seqmodule,whichprovidestoolstobuildvariousEncoder–Decodermodelseasily.Forexample,theembedding_rnn_seq2seq()functioncreatesasimpleEncoder–Decodermodelthatautomaticallytakescareofwordembeddingsforyou,justliketheonerepresentedinFigure14-15.Thiscodewilllikelybeupdatedquicklytousethenewtf.nn.seq2seqmodule.

Younowhaveallthetoolsyouneedtounderstandthesequence-to-sequencetutorial’simplementation.CheckitoutandtrainyourownEnglish-to-Frenchtranslator!

Page 504: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Exercises1. Canyouthinkofafewapplicationsforasequence-to-sequenceRNN?Whataboutasequence-to-

vectorRNN?Andavector-to-sequenceRNN?

2. Whydopeopleuseencoder–decoderRNNsratherthanplainsequence-to-sequenceRNNsforautomatictranslation?

3. HowcouldyoucombineaconvolutionalneuralnetworkwithanRNNtoclassifyvideos?

4. WhataretheadvantagesofbuildinganRNNusingdynamic_rnn()ratherthanstatic_rnn()?

5. Howcanyoudealwithvariable-lengthinputsequences?Whataboutvariable-lengthoutputsequences?

6. WhatisacommonwaytodistributetrainingandexecutionofadeepRNNacrossmultipleGPUs?

7. EmbeddedRebergrammarswereusedbyHochreiterandSchmidhuberintheirpaperaboutLSTMs.Theyareartificialgrammarsthatproducestringssuchas“BPBTSXXVPSEPE.”CheckoutJennyOrr’sniceintroductiontothistopic.ChooseaparticularembeddedRebergrammar(suchastheonerepresentedonJennyOrr’spage),thentrainanRNNtoidentifywhetherastringrespectsthatgrammarornot.Youwillfirstneedtowriteafunctioncapableofgeneratingatrainingbatchcontainingabout50%stringsthatrespectthegrammar,and50%thatdon’t.

8. Tacklethe“Howmuchdiditrain?II”Kagglecompetition.Thisisatimeseriespredictiontask:youaregivensnapshotsofpolarimetricradarvaluesandaskedtopredictthehourlyraingaugetotal.LuisAndreDutraeSilva’sinterviewgivessomeinterestinginsightsintothetechniquesheusedtoreachsecondplaceinthecompetition.Inparticular,heusedanRNNcomposedoftwoLSTMlayers.

9. GothroughTensorFlow’sWord2Vectutorialtocreatewordembeddings,andthengothroughtheSeq2SeqtutorialtotrainanEnglish-to-Frenchtranslationsystem.

SolutionstotheseexercisesareavailableinAppendixA.

Notethatmanyresearchersprefertousethehyperbolictangent(tanh)activationfunctioninRNNsratherthantheReLUactivationfunction.Forexample,takealookatbyVuPhametal.’spaper“DropoutImprovesRecurrentNeuralNetworksforHandwritingRecognition”.However,ReLU-basedRNNsarealsopossible,asshowninQuocV.Leetal.’spaper“ASimpleWaytoInitializeRecurrentNetworksofRectifiedLinearUnits”.

Thisusesthedecoratordesignpattern.

“LongShort-TermMemory,”S.HochreiterandJ.Schmidhuber(1997).

“LongShort-TermMemoryRecurrentNeuralNetworkArchitecturesforLargeScaleAcousticModeling,”H.Saketal.(2014).

“RecurrentNeuralNetworkRegularization,”W.Zarembaetal.(2015).

“RecurrentNetsthatTimeandCount,”F.GersandJ.Schmidhuber(2000).

“LearningPhraseRepresentationsusingRNNEncoder–DecoderforStatisticalMachineTranslation,”K.Choetal.(2014).

A2015paperbyKlausGreffetal.,“LSTM:ASearchSpaceOdyssey,”seemstoshowthatallLSTMvariantsperformroughlythesame.

1

2

3

4

5

6

7

8

9

Page 505: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Formoredetails,checkoutChristopherOlah’sgreatpost,orSebastianRuder’sseriesofposts.

“SequencetoSequencelearningwithNeuralNetworks,”I.Sutskeveretal.(2014).

Thebucketsizesusedinthetutorialaredifferent.

“OnUsingVeryLargeTargetVocabularyforNeuralMachineTranslation,”S.Jeanetal.(2015).

“NeuralMachineTranslationbyJointlyLearningtoAlignandTranslate,”D.Bahdanauetal.(2014).

“LongShort-TermMemory-NetworksforMachineReading,”J.Cheng(2016).

“Show,AttendandTell:NeuralImageCaptionGenerationwithVisualAttention,”K.Xuetal.(2015).

9

10

11

12

13

14

15

Page 506: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Chapter15.Autoencoders

Autoencodersareartificialneuralnetworkscapableoflearningefficientrepresentationsoftheinputdata,calledcodings,withoutanysupervision(i.e.,thetrainingsetisunlabeled).Thesecodingstypicallyhaveamuchlowerdimensionalitythantheinputdata,makingautoencodersusefulfordimensionalityreduction(seeChapter8).Moreimportantly,autoencodersactaspowerfulfeaturedetectors,andtheycanbeusedforunsupervisedpretrainingofdeepneuralnetworks(aswediscussedinChapter11).Lastly,theyarecapableofrandomlygeneratingnewdatathatlooksverysimilartothetrainingdata;thisiscalledagenerativemodel.Forexample,youcouldtrainanautoencoderonpicturesoffaces,anditwouldthenbeabletogeneratenewfaces.

Surprisingly,autoencodersworkbysimplylearningtocopytheirinputstotheiroutputs.Thismaysoundlikeatrivialtask,butwewillseethatconstrainingthenetworkinvariouswayscanmakeitratherdifficult.Forexample,youcanlimitthesizeoftheinternalrepresentation,oryoucanaddnoisetotheinputsandtrainthenetworktorecovertheoriginalinputs.Theseconstraintspreventtheautoencoderfromtriviallycopyingtheinputsdirectlytotheoutputs,whichforcesittolearnefficientwaysofrepresentingthedata.Inshort,thecodingsarebyproductsoftheautoencoder’sattempttolearntheidentityfunctionundersomeconstraints.

Inthischapterwewillexplaininmoredepthhowautoencoderswork,whattypesofconstraintscanbeimposed,andhowtoimplementthemusingTensorFlow,whetheritisfordimensionalityreduction,featureextraction,unsupervisedpretraining,orasgenerativemodels.

Page 507: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

EfficientDataRepresentationsWhichofthefollowingnumbersequencesdoyoufindtheeasiesttomemorize?

40,27,25,36,81,57,10,73,19,68

50,25,76,38,19,58,29,88,44,22,11,34,17,52,26,13,40,20

Atfirstglance,itwouldseemthatthefirstsequenceshouldbeeasier,sinceitismuchshorter.However,ifyoulookcarefullyatthesecondsequence,youmaynoticethatitfollowstwosimplerules:evennumbersarefollowedbytheirhalf,andoddnumbersarefollowedbytheirtripleplusone(thisisafamoussequenceknownasthehailstonesequence).Onceyounoticethispattern,thesecondsequencebecomesmucheasiertomemorizethanthefirstbecauseyouonlyneedtomemorizethetworules,thefirstnumber,andthelengthofthesequence.Notethatifyoucouldquicklyandeasilymemorizeverylongsequences,youwouldnotcaremuchabouttheexistenceofapatterninthesecondsequence.Youwouldjustlearneverynumberbyheart,andthatwouldbethat.Itisthefactthatitishardtomemorizelongsequencesthatmakesitusefultorecognizepatterns,andhopefullythisclarifieswhyconstraininganautoencoderduringtrainingpushesittodiscoverandexploitpatternsinthedata.

Therelationshipbetweenmemory,perception,andpatternmatchingwasfamouslystudiedbyWilliamChaseandHerbertSimonintheearly1970s.1Theyobservedthatexpertchessplayerswereabletomemorizethepositionsofallthepiecesinagamebylookingattheboardforjust5seconds,ataskthatmostpeoplewouldfindimpossible.However,thiswasonlythecasewhenthepieceswereplacedinrealisticpositions(fromactualgames),notwhenthepieceswereplacedrandomly.Chessexpertsdon’thaveamuchbettermemorythanyouandI,theyjustseechesspatternsmoreeasilythankstotheirexperiencewiththegame.Noticingpatternshelpsthemstoreinformationefficiently.

Justlikethechessplayersinthismemoryexperiment,anautoencoderlooksattheinputs,convertsthemtoanefficientinternalrepresentation,andthenspitsoutsomethingthat(hopefully)looksveryclosetotheinputs.Anautoencoderisalwayscomposedoftwoparts:anencoder(orrecognitionnetwork)thatconvertstheinputstoaninternalrepresentation,followedbyadecoder(orgenerativenetwork)thatconvertstheinternalrepresentationtotheoutputs(seeFigure15-1).

Asyoucansee,anautoencodertypicallyhasthesamearchitectureasaMulti-LayerPerceptron(MLP;seeChapter10),exceptthatthenumberofneuronsintheoutputlayermustbeequaltothenumberofinputs.Inthisexample,thereisjustonehiddenlayercomposedoftwoneurons(theencoder),andoneoutputlayercomposedofthreeneurons(thedecoder).Theoutputsareoftencalledthereconstructionssincetheautoencodertriestoreconstructtheinputs,andthecostfunctioncontainsareconstructionlossthatpenalizesthemodelwhenthereconstructionsaredifferentfromtheinputs.

Page 508: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure15-1.Thechessmemoryexperiment(left)andasimpleautoencoder(right)

Becausetheinternalrepresentationhasalowerdimensionalitythantheinputdata(itis2Dinsteadof3D),theautoencoderissaidtobeundercomplete.Anundercompleteautoencodercannottriviallycopyitsinputstothecodings,yetitmustfindawaytooutputacopyofitsinputs.Itisforcedtolearnthemostimportantfeaturesintheinputdata(anddroptheunimportantones).

Let’sseehowtoimplementaverysimpleundercompleteautoencoderfordimensionalityreduction.

Page 509: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

PerformingPCAwithanUndercompleteLinearAutoencoderIftheautoencoderusesonlylinearactivationsandthecostfunctionistheMeanSquaredError(MSE),thenitcanbeshownthatitendsupperformingPrincipalComponentAnalysis(seeChapter8).

ThefollowingcodebuildsasimplelinearautoencodertoperformPCAona3Ddataset,projectingitto2D:

importtensorflowastf

n_inputs=3#3Dinputs

n_hidden=2#2Dcodings

n_outputs=n_inputs

learning_rate=0.01

X=tf.placeholder(tf.float32,shape=[None,n_inputs])

hidden=tf.layers.dense(X,n_hidden)

outputs=tf.layers.dense(hidden,n_outputs)

reconstruction_loss=tf.reduce_mean(tf.square(outputs-X))#MSE

optimizer=tf.train.AdamOptimizer(learning_rate)

training_op=optimizer.minimize(reconstruction_loss)

init=tf.global_variables_initializer()

ThiscodeisreallynotverydifferentfromalltheMLPswebuiltinpastchapters.Thetwothingstonoteare:

Thenumberofoutputsisequaltothenumberofinputs.

ToperformsimplePCA,wedonotuseanyactivationfunction(i.e.,allneuronsarelinear)andthecostfunctionistheMSE.Wewillseemorecomplexautoencodersshortly.

Nowlet’sloadthedataset,trainthemodelonthetrainingset,anduseittoencodethetestset(i.e.,projectitto2D):

X_train,X_test=[...]#loadthedataset

n_iterations=1000

codings=hidden#theoutputofthehiddenlayerprovidesthecodings

withtf.Session()assess:

init.run()

foriterationinrange(n_iterations):

training_op.run(feed_dict={X:X_train})#nolabels(unsupervised)

codings_val=codings.eval(feed_dict={X:X_test})

Figure15-2showstheoriginal3Ddataset(attheleft)andtheoutputoftheautoencoder’shiddenlayer(i.e.,thecodinglayer,attheright).Asyoucansee,theautoencoderfoundthebest2Dplanetoprojectthedataonto,preservingasmuchvarianceinthedataasitcould(justlikePCA).

Page 510: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure15-2.PCAperformedbyanundercompletelinearautoencoder

Page 511: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

StackedAutoencodersJustlikeotherneuralnetworkswehavediscussed,autoencoderscanhavemultiplehiddenlayers.Inthiscasetheyarecalledstackedautoencoders(ordeepautoencoders).Addingmorelayershelpstheautoencoderlearnmorecomplexcodings.However,onemustbecarefulnottomaketheautoencodertoopowerful.Imagineanencodersopowerfulthatitjustlearnstomapeachinputtoasinglearbitrarynumber(andthedecoderlearnsthereversemapping).Obviouslysuchanautoencoderwillreconstructthetrainingdataperfectly,butitwillnothavelearnedanyusefuldatarepresentationintheprocess(anditisunlikelytogeneralizewelltonewinstances).

Thearchitectureofastackedautoencoderistypicallysymmetricalwithregardstothecentralhiddenlayer(thecodinglayer).Toputitsimply,itlookslikeasandwich.Forexample,anautoencoderforMNIST(introducedinChapter3)mayhave784inputs,followedbyahiddenlayerwith300neurons,thenacentralhiddenlayerof150neurons,thenanotherhiddenlayerwith300neurons,andanoutputlayerwith784neurons.ThisstackedautoencoderisrepresentedinFigure15-3.

Figure15-3.Stackedautoencoder

Page 512: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TensorFlowImplementationYoucanimplementastackedautoencoderverymuchlikearegulardeepMLP.Inparticular,thesametechniquesweusedinChapter11fortrainingdeepnetscanbeapplied.Forexample,thefollowingcodebuildsastackedautoencoderforMNIST,usingHeinitialization,theELUactivationfunction,andℓ2regularization.Thecodeshouldlookveryfamiliar,exceptthattherearenolabels(noy):

fromfunctoolsimportpartial

n_inputs=28*28#forMNIST

n_hidden1=300

n_hidden2=150#codings

n_hidden3=n_hidden1

n_outputs=n_inputs

learning_rate=0.01

l2_reg=0.0001

X=tf.placeholder(tf.float32,shape=[None,n_inputs])

he_init=tf.contrib.layers.variance_scaling_initializer()

l2_regularizer=tf.contrib.layers.l2_regularizer(l2_reg)

my_dense_layer=partial(tf.layers.dense,

activation=tf.nn.elu,

kernel_initializer=he_init,

kernel_regularizer=l2_regularizer)

hidden1=my_dense_layer(X,n_hidden1)

hidden2=my_dense_layer(hidden1,n_hidden2)#codings

hidden3=my_dense_layer(hidden2,n_hidden3)

outputs=my_dense_layer(hidden3,n_outputs,activation=None)

reconstruction_loss=tf.reduce_mean(tf.square(outputs-X))#MSE

reg_losses=tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES)

loss=tf.add_n([reconstruction_loss]+reg_losses)

optimizer=tf.train.AdamOptimizer(learning_rate)

training_op=optimizer.minimize(loss)

init=tf.global_variables_initializer()

Youcanthentrainthemodelnormally.Notethatthedigitlabels(y_batch)areunused:

n_epochs=5

batch_size=150

withtf.Session()assess:

init.run()

forepochinrange(n_epochs):

n_batches=mnist.train.num_examples//batch_size

foriterationinrange(n_batches):

X_batch,y_batch=mnist.train.next_batch(batch_size)

sess.run(training_op,feed_dict={X:X_batch})

Page 513: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TyingWeightsWhenanautoencoderisneatlysymmetrical,liketheonewejustbuilt,acommontechniqueistotietheweightsofthedecoderlayerstotheweightsoftheencoderlayers.Thishalvesthenumberofweightsinthemodel,speedinguptrainingandlimitingtheriskofoverfitting.Specifically,iftheautoencoderhasatotalofNlayers(notcountingtheinputlayer),andWLrepresentstheconnectionweightsoftheLthlayer

(e.g.,layer1isthefirsthiddenlayer,layer isthecodinglayer,andlayerNistheoutputlayer),thenthe

decoderlayerweightscanbedefinedsimplyas:WN–L+1=WLT(withL=1,2, ).

Unfortunately,implementingtiedweightsinTensorFlowusingthedense()functionisabitcumbersome;it’sactuallyeasiertojustdefinethelayersmanually.Thecodeendsupsignificantlymoreverbose:

activation=tf.nn.elu

regularizer=tf.contrib.layers.l2_regularizer(l2_reg)

initializer=tf.contrib.layers.variance_scaling_initializer()

X=tf.placeholder(tf.float32,shape=[None,n_inputs])

weights1_init=initializer([n_inputs,n_hidden1])

weights2_init=initializer([n_hidden1,n_hidden2])

weights1=tf.Variable(weights1_init,dtype=tf.float32,name="weights1")

weights2=tf.Variable(weights2_init,dtype=tf.float32,name="weights2")

weights3=tf.transpose(weights2,name="weights3")#tiedweights

weights4=tf.transpose(weights1,name="weights4")#tiedweights

biases1=tf.Variable(tf.zeros(n_hidden1),name="biases1")

biases2=tf.Variable(tf.zeros(n_hidden2),name="biases2")

biases3=tf.Variable(tf.zeros(n_hidden3),name="biases3")

biases4=tf.Variable(tf.zeros(n_outputs),name="biases4")

hidden1=activation(tf.matmul(X,weights1)+biases1)

hidden2=activation(tf.matmul(hidden1,weights2)+biases2)

hidden3=activation(tf.matmul(hidden2,weights3)+biases3)

outputs=tf.matmul(hidden3,weights4)+biases4

reconstruction_loss=tf.reduce_mean(tf.square(outputs-X))

reg_loss=regularizer(weights1)+regularizer(weights2)

loss=reconstruction_loss+reg_loss

optimizer=tf.train.AdamOptimizer(learning_rate)

training_op=optimizer.minimize(loss)

init=tf.global_variables_initializer()

Thiscodeisfairlystraightforward,butthereareafewimportantthingstonote:First,weight3andweights4arenotvariables,theyarerespectivelythetransposeofweights2andweights1(theyare“tied”tothem).

Second,sincetheyarenotvariables,it’snouseregularizingthem:weonlyregularizeweights1andweights2.

Third,biasesarenevertied,andneverregularized.

Page 514: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TrainingOneAutoencoderataTimeRatherthantrainingthewholestackedautoencoderinonegolikewejustdid,itisoftenmuchfastertotrainoneshallowautoencoderatatime,thenstackallofthemintoasinglestackedautoencoder(hencethename),asshownonFigure15-4.Thisisespeciallyusefulforverydeepautoencoders.

Figure15-4.Trainingoneautoencoderatatime

Duringthefirstphaseoftraining,thefirstautoencoderlearnstoreconstructtheinputs.Duringthesecondphase,thesecondautoencoderlearnstoreconstructtheoutputofthefirstautoencoder’shiddenlayer.Finally,youjustbuildabigsandwichusingalltheseautoencoders,asshowninFigure15-4(i.e.,youfirststackthehiddenlayersofeachautoencoder,thentheoutputlayersinreverseorder).Thisgivesyouthefinalstackedautoencoder.Youcouldeasilytrainmoreautoencodersthisway,buildingaverydeepstackedautoencoder.

Toimplementthismultiphasetrainingalgorithm,thesimplestapproachistouseadifferentTensorFlowgraphforeachphase.Aftertraininganautoencoder,youjustrunthetrainingsetthroughitandcapturetheoutputofthehiddenlayer.Thisoutputthenservesasthetrainingsetforthenextautoencoder.Onceallautoencodershavebeentrainedthisway,yousimplycopytheweightsandbiasesfromeachautoencoderandusethemtobuildthestackedautoencoder.Implementingthisapproachisquitestraightforward,sowewon’tdetailithere,butpleasecheckoutthecodeintheJupyternotebooksforanexample.

Anotherapproachistouseasinglegraphcontainingthewholestackedautoencoder,plussomeextraoperationstoperformeachtrainingphase,asshowninFigure15-5.

Page 515: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure15-5.Asinglegraphtotrainastackedautoencoder

Thisdeservesabitofexplanation:Thecentralcolumninthegraphisthefullstackedautoencoder.Thispartcanbeusedaftertraining.

Theleftcolumnisthesetofoperationsneededtorunthefirstphaseoftraining.Itcreatesanoutputlayerthatbypasseshiddenlayers2and3.Thisoutputlayersharesthesameweightsandbiasesasthestackedautoencoder’soutputlayer.Ontopofthatarethetrainingoperationsthatwillaimatmakingtheoutputascloseaspossibletotheinputs.Thus,thisphasewilltraintheweightsandbiasesforthehiddenlayer1andtheoutputlayer(i.e.,thefirstautoencoder).

Therightcolumninthegraphisthesetofoperationsneededtorunthesecondphaseoftraining.Itaddsthetrainingoperationthatwillaimatmakingtheoutputofhiddenlayer3ascloseaspossibletotheoutputofhiddenlayer1.Notethatwemustfreezehiddenlayer1whilerunningphase2.Thisphasewilltraintheweightsandbiasesforhiddenlayers2and3(i.e.,thesecondautoencoder).

TheTensorFlowcodelookslikethis:

[...]#Buildthewholestackedautoencodernormally.

#Inthisexample,theweightsarenottied.

optimizer=tf.train.AdamOptimizer(learning_rate)

withtf.name_scope("phase1"):

phase1_outputs=tf.matmul(hidden1,weights4)+biases4

phase1_reconstruction_loss=tf.reduce_mean(tf.square(phase1_outputs-X))

phase1_reg_loss=regularizer(weights1)+regularizer(weights4)

phase1_loss=phase1_reconstruction_loss+phase1_reg_loss

phase1_training_op=optimizer.minimize(phase1_loss)

withtf.name_scope("phase2"):

phase2_reconstruction_loss=tf.reduce_mean(tf.square(hidden3-hidden1))

phase2_reg_loss=regularizer(weights2)+regularizer(weights3)

phase2_loss=phase2_reconstruction_loss+phase2_reg_loss

train_vars=[weights2,biases2,weights3,biases3]

phase2_training_op=optimizer.minimize(phase2_loss,var_list=train_vars)

Thefirstphaseisratherstraightforward:wejustcreateanoutputlayerthatskipshiddenlayers2and3,thenbuildthetrainingoperationstominimizethedistancebetweentheoutputsandtheinputs(plussome

Page 516: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

regularization).

Thesecondphasejustaddstheoperationsneededtominimizethedistancebetweentheoutputofhiddenlayer3andhiddenlayer1(alsowithsomeregularization).Mostimportantly,weprovidethelistoftrainablevariablestotheminimize()method,makingsuretoleaveoutweights1andbiases1;thiseffectivelyfreezeshiddenlayer1duringphase2.

Duringtheexecutionphase,allyouneedtodoisrunthephase1trainingopforanumberofepochs,thenthephase2trainingopforsomemoreepochs.

TIPSincehiddenlayer1isfrozenduringphase2,itsoutputwillalwaysbethesameforanygiventraininginstance.Toavoidhavingtorecomputetheoutputofhiddenlayer1ateverysingleepoch,youcancomputeitforthewholetrainingsetattheendofphase1,thendirectlyfeedthecachedoutputofhiddenlayer1duringphase2.Thiscangiveyouaniceperformanceboost.

Page 517: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

VisualizingtheReconstructionsOnewaytoensurethatanautoencoderisproperlytrainedistocomparetheinputsandtheoutputs.Theymustbefairlysimilar,andthedifferencesshouldbeunimportantdetails.Let’splottworandomdigitsandtheirreconstructions:

n_test_digits=2

X_test=mnist.test.images[:n_test_digits]

withtf.Session()assess:

[...]#TraintheAutoencoder

outputs_val=outputs.eval(feed_dict={X:X_test})

defplot_image(image,shape=[28,28]):

plt.imshow(image.reshape(shape),cmap="Greys",interpolation="nearest")

plt.axis("off")

fordigit_indexinrange(n_test_digits):

plt.subplot(n_test_digits,2,digit_index*2+1)

plot_image(X_test[digit_index])

plt.subplot(n_test_digits,2,digit_index*2+2)

plot_image(outputs_val[digit_index])

Figure15-6showstheresultingimages.

Figure15-6.Originaldigits(left)andtheirreconstructions(right)

Page 518: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Lookscloseenough.Sotheautoencoderhasproperlylearnedtoreproduceitsinputs,buthasitlearnedusefulfeatures?Let’stakealook.

Page 519: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

VisualizingFeaturesOnceyourautoencoderhaslearnedsomefeatures,youmaywanttotakealookatthem.Therearevarioustechniquesforthis.Arguablythesimplesttechniqueistoconsidereachneuronineveryhiddenlayer,andfindthetraininginstancesthatactivateitthemost.Thisisespeciallyusefulforthetophiddenlayerssincetheyoftencapturerelativelylargefeaturesthatyoucaneasilyspotinagroupoftraininginstancesthatcontainthem.Forexample,ifaneuronstronglyactivateswhenitseesacatinapicture,itwillbeprettyobviousthatthepicturesthatactivateitthemostallcontaincats.However,forlowerlayers,thistechniquedoesnotworksowell,asthefeaturesaresmallerandmoreabstract,soit’softenhardtounderstandexactlywhattheneuronisgettingallexcitedabout.

Let’slookatanothertechnique.Foreachneuroninthefirsthiddenlayer,youcancreateanimagewhereapixel’sintensitycorrespondstotheweightoftheconnectiontothegivenneuron.Forexample,thefollowingcodeplotsthefeatureslearnedbyfiveneuronsinthefirsthiddenlayer:

withtf.Session()assess:

[...]#trainautoencoder

weights1_val=weights1.eval()

foriinrange(5):

plt.subplot(1,5,i+1)

plot_image(weights1_val.T[i])

Youmaygetlow-levelfeaturessuchastheonesshowninFigure15-7.

Figure15-7.Featureslearnedbyfiveneuronsfromthefirsthiddenlayer

Thefirstfourfeaturesseemtocorrespondtosmallpatches,whilethefifthfeatureseemstolookforverticalstrokes(notethatthesefeaturescomefromthestackeddenoisingautoencoderthatwewilldiscusslater).

Anothertechniqueistofeedtheautoencoderarandominputimage,measuretheactivationoftheneuronyouareinterestedin,andthenperformbackpropagationtotweaktheimageinsuchawaythattheneuronwillactivateevenmore.Ifyouiterateseveraltimes(performinggradientascent),theimagewillgraduallyturnintothemostexcitingimage(fortheneuron).Thisisausefultechniquetovisualizethekindsofinputsthataneuronislookingfor.

Finally,ifyouareusinganautoencodertoperformunsupervisedpretraining—forexample,foraclassificationtask—asimplewaytoverifythatthefeatureslearnedbytheautoencoderareusefulistomeasuretheperformanceoftheclassifier.

Page 520: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

UnsupervisedPretrainingUsingStackedAutoencodersAswediscussedinChapter11,ifyouaretacklingacomplexsupervisedtaskbutyoudonothavealotoflabeledtrainingdata,onesolutionistofindaneuralnetworkthatperformsasimilartask,andthenreuseitslowerlayers.Thismakesitpossibletotrainahigh-performancemodelusingonlylittletrainingdatabecauseyourneuralnetworkwon’thavetolearnallthelow-levelfeatures;itwilljustreusethefeaturedetectorslearnedbytheexistingnet.

Similarly,ifyouhavealargedatasetbutmostofitisunlabeled,youcanfirsttrainastackedautoencoderusingallthedata,thenreusethelowerlayerstocreateaneuralnetworkforyouractualtask,andtrainitusingthelabeleddata.Forexample,Figure15-8showshowtouseastackedautoencodertoperformunsupervisedpretrainingforaclassificationneuralnetwork.Thestackedautoencoderitselfistypicallytrainedoneautoencoderatatime,asdiscussedearlier.Whentrainingtheclassifier,ifyoureallydon’thavemuchlabeledtrainingdata,youmaywanttofreezethepretrainedlayers(atleastthelowerones).

Figure15-8.Unsupervisedpretrainingusingautoencoders

NOTEThissituationisactuallyquitecommon,becausebuildingalargeunlabeleddatasetisoftencheap(e.g.,asimplescriptcandownloadmillionsofimagesofftheinternet),butlabelingthemcanonlybedonereliablybyhumans(e.g.,classifyingimagesascuteornot).Labelinginstancesistime-consumingandcostly,soitisquitecommontohaveonlyafewthousandlabeledinstances.

Aswediscussedearlier,oneofthetriggersofthecurrentDeepLearningtsunamiisthediscoveryin2006byGeoffreyHintonetal.thatdeepneuralnetworkscanbepretrainedinanunsupervisedfashion.TheyusedrestrictedBoltzmannmachinesforthat(seeAppendixE),butin2007YoshuaBengioetal.showed2thatautoencodersworkedjustaswell.

Page 521: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ThereisnothingspecialabouttheTensorFlowimplementation:justtrainanautoencoderusingallthetrainingdata,thenreuseitsencoderlayerstocreateanewneuralnetwork(seeChapter11formoredetailsonhowtoreusepretrainedlayers,orcheckoutthecodeexamplesintheJupyternotebooks).

Uptonow,inordertoforcetheautoencodertolearninterestingfeatures,wehavelimitedthesizeofthecodinglayer,makingitundercomplete.Thereareactuallymanyotherkindsofconstraintsthatcanbeused,includingonesthatallowthecodinglayertobejustaslargeastheinputs,orevenlarger,resultinginanovercompleteautoencoder.Let’slookatsomeofthoseapproachesnow.

Page 522: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

DenoisingAutoencodersAnotherwaytoforcetheautoencodertolearnusefulfeaturesistoaddnoisetoitsinputs,trainingittorecovertheoriginal,noise-freeinputs.Thispreventstheautoencoderfromtriviallycopyingitsinputstoitsoutputs,soitendsuphavingtofindpatternsinthedata.

Theideaofusingautoencoderstoremovenoisehasbeenaroundsincethe1980s(e.g.,itismentionedinYannLeCun’s1987master’sthesis).Ina2008paper,3PascalVincentetal.showedthatautoencoderscouldalsobeusedforfeatureextraction.Ina2010paper,4Vincentetal.introducedstackeddenoisingautoencoders.

ThenoisecanbepureGaussiannoiseaddedtotheinputs,oritcanberandomlyswitchedoffinputs,justlikeindropout(introducedinChapter11).Figure15-9showsbothoptions.

Figure15-9.Denoisingautoencoders,withGaussiannoise(left)ordropout(right)

Page 523: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TensorFlowImplementationImplementingdenoisingautoencodersinTensorFlowisnottoohard.Let’sstartwithGaussiannoise.It’sreallyjustliketrainingaregularautoencoder,exceptyouaddnoisetotheinputs,andthereconstructionlossiscalculatedbasedontheoriginalinputs:

noise_level=1.0

X=tf.placeholder(tf.float32,shape=[None,n_inputs])

X_noisy=X+noise_level*tf.random_normal(tf.shape(X))

hidden1=tf.layers.dense(X_noisy,n_hidden1,activation=tf.nn.relu,

name="hidden1")

[...]

reconstruction_loss=tf.reduce_mean(tf.square(outputs-X))#MSE

[...]

WARNINGSincetheshapeofXisonlypartiallydefinedduringtheconstructionphase,wecannotknowinadvancetheshapeofthenoisethatwemustaddtoX.WecannotcallX.get_shape()becausethiswouldjustreturnthepartiallydefinedshapeofX([None,n_inputs]),andrandom_normal()expectsafullydefinedshapesoitwouldraiseanexception.Instead,wecalltf.shape(X),whichcreatesanoperationthatwillreturntheshapeofXatruntime,whichwillbefullydefinedatthatpoint.

Implementingthedropoutversion,whichismorecommon,isnotmuchharder:

dropout_rate=0.3

training=tf.placeholder_with_default(False,shape=(),name='training')

X=tf.placeholder(tf.float32,shape=[None,n_inputs])

X_drop=tf.layers.dropout(X,dropout_rate,training=training)

hidden1=tf.layers.dense(X_drop,n_hidden1,activation=tf.nn.relu,

name="hidden1")

[...]

reconstruction_loss=tf.reduce_mean(tf.square(outputs-X))#MSE

[...]

DuringtrainingwemustsettrainingtoTrue(asexplainedinChapter11)usingthefeed_dict:

sess.run(training_op,feed_dict={X:X_batch,training:True})

DuringtestingitisnotnecessarytosettrainingtoFalse,sincewesetthatasthedefaultinthecalltotheplaceholder_with_default()function.

Page 524: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

SparseAutoencodersAnotherkindofconstraintthatoftenleadstogoodfeatureextractionissparsity:byaddinganappropriatetermtothecostfunction,theautoencoderispushedtoreducethenumberofactiveneuronsinthecodinglayer.Forexample,itmaybepushedtohaveonaverageonly5%significantlyactiveneuronsinthecodinglayer.Thisforcestheautoencodertorepresenteachinputasacombinationofasmallnumberofactivations.Asaresult,eachneuroninthecodinglayertypicallyendsuprepresentingausefulfeature(ifyoucouldspeakonlyafewwordspermonth,youwouldprobablytrytomakethemworthlisteningto).

Inordertofavorsparsemodels,wemustfirstmeasuretheactualsparsityofthecodinglayerateachtrainingiteration.Wedosobycomputingtheaverageactivationofeachneuroninthecodinglayer,overthewholetrainingbatch.Thebatchsizemustnotbetoosmall,orelsethemeanwillnotbeaccurate.

Oncewehavethemeanactivationperneuron,wewanttopenalizetheneuronsthataretooactivebyaddingasparsitylosstothecostfunction.Forexample,ifwemeasurethataneuronhasanaverageactivationof0.3,butthetargetsparsityis0.1,itmustbepenalizedtoactivateless.Oneapproachcouldbesimplyaddingthesquarederror(0.3–0.1)2tothecostfunction,butinpracticeabetterapproachistousetheKullback–Leiblerdivergence(brieflydiscussedinChapter4),whichhasmuchstrongergradientsthantheMeanSquaredError,asyoucanseeinFigure15-10.

Figure15-10.Sparsityloss

GiventwodiscreteprobabilitydistributionsPandQ,theKLdivergencebetweenthesedistributions,notedDKL(P Q),canbecomputedusingEquation15-1.

Equation15-1.Kullback–Leiblerdivergence

Page 525: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Inourcase,wewanttomeasurethedivergencebetweenthetargetprobabilitypthataneuroninthecodinglayerwillactivate,andtheactualprobabilityq(i.e.,themeanactivationoverthetrainingbatch).SotheKLdivergencesimplifiestoEquation15-2.

Equation15-2.KLdivergencebetweenthetargetsparsitypandtheactualsparsityq

Oncewehavecomputedthesparsitylossforeachneuroninthecodinglayer,wejustsumuptheselosses,andaddtheresulttothecostfunction.Inordertocontroltherelativeimportanceofthesparsitylossandthereconstructionloss,wecanmultiplythesparsitylossbyasparsityweighthyperparameter.Ifthisweightistoohigh,themodelwillstickcloselytothetargetsparsity,butitmaynotreconstructtheinputsproperly,makingthemodeluseless.Conversely,ifitistoolow,themodelwillmostlyignorethesparsityobjectiveanditwillnotlearnanyinterestingfeatures.

Page 526: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TensorFlowImplementationWenowhaveallweneedtoimplementasparseautoencoderusingTensorFlow:

defkl_divergence(p,q):

returnp*tf.log(p/q)+(1-p)*tf.log((1-p)/(1-q))

learning_rate=0.01

sparsity_target=0.1

sparsity_weight=0.2

[...]#Buildanormalautoencoder(inthisexamplethecodinglayerishidden1)

hidden1_mean=tf.reduce_mean(hidden1,axis=0)#batchmean

sparsity_loss=tf.reduce_sum(kl_divergence(sparsity_target,hidden1_mean))

reconstruction_loss=tf.reduce_mean(tf.square(outputs-X))#MSE

loss=reconstruction_loss+sparsity_weight*sparsity_loss

optimizer=tf.train.AdamOptimizer(learning_rate)

training_op=optimizer.minimize(loss)

Animportantdetailisthefactthattheactivationsofthecodinglayermustbebetween0and1(butnotequalto0or1),orelsetheKLdivergencewillreturnNaN(NotaNumber).Asimplesolutionistousethelogisticactivationfunctionforthecodinglayer:

hidden1=tf.layers.dense(X,n_hidden1,activation=tf.nn.sigmoid)

Onesimpletrickcanspeedupconvergence:insteadofusingtheMSE,wecanchooseareconstructionlossthatwillhavelargergradients.Crossentropyisoftenagoodchoice.Touseit,wemustnormalizetheinputstomakethemtakeonvaluesfrom0to1,andusethelogisticactivationfunctionintheoutputlayersotheoutputsalsotakeonvaluesfrom0to1.TensorFlow’ssigmoid_cross_entropy_with_logits()functiontakescareofefficientlyapplyingthelogistic(sigmoid)activationfunctiontotheoutputsandcomputingthecrossentropy:

[...]

logits=tf.layers.dense(hidden1,n_outputs)

outputs=tf.nn.sigmoid(logits)

xentropy=tf.nn.sigmoid_cross_entropy_with_logits(labels=X,logits=logits)

reconstruction_loss=tf.reduce_sum(xentropy)

Notethattheoutputsoperationisnotneededduringtraining(weuseitonlywhenwewanttolookatthereconstructions).

Page 527: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

VariationalAutoencodersAnotherimportantcategoryofautoencoderswasintroducedin2014byDiederikKingmaandMaxWelling,5andhasquicklybecomeoneofthemostpopulartypesofautoencoders:variationalautoencoders.

Theyarequitedifferentfromalltheautoencoderswehavediscussedsofar,inparticular:Theyareprobabilisticautoencoders,meaningthattheiroutputsarepartlydeterminedbychance,evenaftertraining(asopposedtodenoisingautoencoders,whichuserandomnessonlyduringtraining).

Mostimportantly,theyaregenerativeautoencoders,meaningthattheycangeneratenewinstancesthatlookliketheyweresampledfromthetrainingset.

BoththesepropertiesmakethemrathersimilartoRBMs(seeAppendixE),buttheyareeasiertotrainandthesamplingprocessismuchfaster(withRBMsyouneedtowaitforthenetworktostabilizeintoa“thermalequilibrium”beforeyoucansampleanewinstance).

Let’stakealookathowtheywork.Figure15-11(left)showsavariationalautoencoder.Youcanrecognize,ofcourse,thebasicstructureofallautoencoders,withanencoderfollowedbyadecoder(inthisexample,theybothhavetwohiddenlayers),butthereisatwist:insteadofdirectlyproducingacodingforagiveninput,theencoderproducesameancodingμandastandarddeviationσ.TheactualcodingisthensampledrandomlyfromaGaussiandistributionwithmeanμandstandarddeviationσ.Afterthatthedecoderjustdecodesthesampledcodingnormally.Therightpartofthediagramshowsatraininginstancegoingthroughthisautoencoder.First,theencoderproducesμandσ,thenacodingissampledrandomly(noticethatitisnotexactlylocatedatμ),andfinallythiscodingisdecoded,andthefinaloutputresemblesthetraininginstance.

Page 528: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure15-11.Variationalautoencoder(left),andaninstancegoingthroughit(right)

Asyoucanseeonthediagram,althoughtheinputsmayhaveaveryconvoluteddistribution,avariationalautoencodertendstoproducecodingsthatlookasthoughtheyweresampledfromasimpleGaussiandistribution:6duringtraining,thecostfunction(discussednext)pushesthecodingstograduallymigratewithinthecodingspace(alsocalledthelatentspace)tooccupyaroughly(hyper)sphericalregionthatlookslikeacloudofGaussianpoints.Onegreatconsequenceisthataftertrainingavariationalautoencoder,youcanveryeasilygenerateanewinstance:justsamplearandomcodingfromtheGaussiandistribution,decodeit,andvoilà!

Solet’slookatthecostfunction.Itiscomposedoftwoparts.Thefirstistheusualreconstructionlossthatpushestheautoencodertoreproduceitsinputs(wecanusecrossentropyforthis,asdiscussedearlier).ThesecondisthelatentlossthatpushestheautoencodertohavecodingsthatlookasthoughtheyweresampledfromasimpleGaussiandistribution,forwhichweusetheKLdivergencebetweenthetargetdistribution(theGaussiandistribution)andtheactualdistributionofthecodings.Themathisabitmorecomplexthanearlier,inparticularbecauseoftheGaussiannoise,whichlimitstheamountofinformationthatcanbetransmittedtothecodinglayer(thuspushingtheautoencodertolearnusefulfeatures).Luckily,theequationssimplifytothefollowingcodeforthelatentloss:7

eps=1e-10#smoothingtermtoavoidcomputinglog(0)whichisNaN

latent_loss=0.5*tf.reduce_sum(

tf.square(hidden3_sigma)+tf.square(hidden3_mean)

-1-tf.log(eps+tf.square(hidden3_sigma)))

Onecommonvariantistotraintheencodertooutputγ=log(σ2)ratherthanσ.Whereverweneedσwe

canjustcompute .Thismakesitabiteasierfortheencodertocapturesigmasofdifferent

Page 529: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

scales,andthusithelpsspeedupconvergence.Thelatentlossendsupabitsimpler:

latent_loss=0.5*tf.reduce_sum(

tf.exp(hidden3_gamma)+tf.square(hidden3_mean)-1-hidden3_gamma)

ThefollowingcodebuildsthevariationalautoencodershowninFigure15-11(left),usingthelog(σ2)variant:

fromfunctoolsimportpartial

n_inputs=28*28

n_hidden1=500

n_hidden2=500

n_hidden3=20#codings

n_hidden4=n_hidden2

n_hidden5=n_hidden1

n_outputs=n_inputs

learning_rate=0.001

initializer=tf.contrib.layers.variance_scaling_initializer()

my_dense_layer=partial(

tf.layers.dense,

activation=tf.nn.elu,

kernel_initializer=initializer)

X=tf.placeholder(tf.float32,[None,n_inputs])

hidden1=my_dense_layer(X,n_hidden1)

hidden2=my_dense_layer(hidden1,n_hidden2)

hidden3_mean=my_dense_layer(hidden2,n_hidden3,activation=None)

hidden3_gamma=my_dense_layer(hidden2,n_hidden3,activation=None)

noise=tf.random_normal(tf.shape(hidden3_gamma),dtype=tf.float32)

hidden3=hidden3_mean+tf.exp(0.5*hidden3_gamma)*noise

hidden4=my_dense_layer(hidden3,n_hidden4)

hidden5=my_dense_layer(hidden4,n_hidden5)

logits=my_dense_layer(hidden5,n_outputs,activation=None)

outputs=tf.sigmoid(logits)

xentropy=tf.nn.sigmoid_cross_entropy_with_logits(labels=X,logits=logits)

reconstruction_loss=tf.reduce_sum(xentropy)

latent_loss=0.5*tf.reduce_sum(

tf.exp(hidden3_gamma)+tf.square(hidden3_mean)-1-hidden3_gamma)

loss=reconstruction_loss+latent_loss

optimizer=tf.train.AdamOptimizer(learning_rate=learning_rate)

training_op=optimizer.minimize(loss)

init=tf.global_variables_initializer()

saver=tf.train.Saver()

Page 530: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

GeneratingDigitsNowlet’susethisvariationalautoencodertogenerateimagesthatlooklikehandwrittendigits.Allweneedtodoistrainthemodel,thensamplerandomcodingsfromaGaussiandistributionanddecodethem.

importnumpyasnp

n_digits=60

n_epochs=50

batch_size=150

withtf.Session()assess:

init.run()

forepochinrange(n_epochs):

n_batches=mnist.train.num_examples//batch_size

foriterationinrange(n_batches):

X_batch,y_batch=mnist.train.next_batch(batch_size)

sess.run(training_op,feed_dict={X:X_batch})

codings_rnd=np.random.normal(size=[n_digits,n_hidden3])

outputs_val=outputs.eval(feed_dict={hidden3:codings_rnd})

That’sit.Nowwecanseewhatthe“handwritten”digitsproducedbytheautoencoderlooklike(seeFigure15-12):

foriterationinrange(n_digits):

plt.subplot(n_digits,10,iteration+1)

plot_image(outputs_val[iteration])

Figure15-12.Imagesofhandwrittendigitsgeneratedbythevariationalautoencoder

Amajorityofthesedigitslookprettyconvincing,whileafewarerather“creative.”Butdon’tbetooharshontheautoencoder—itonlystartedlearninglessthananhourago.Giveitabitmoretrainingtime,andthosedigitswilllookbetterandbetter.

Page 531: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

OtherAutoencodersTheamazingsuccessesofsupervisedlearninginimagerecognition,speechrecognition,texttranslation,andmorehavesomewhatovershadowedunsupervisedlearning,butitisactuallybooming.Newarchitecturesforautoencodersandotherunsupervisedlearningalgorithmsareinventedregularly,somuchsothatwecannotcoverthemallinthisbook.Hereisabrief(bynomeansexhaustive)overviewofafewmoretypesofautoencodersthatyoumaywanttocheckout:

Contractiveautoencoder(CAE)8

Theautoencoderisconstrainedduringtrainingsothatthederivativesofthecodingswithregardstotheinputsaresmall.Inotherwords,twosimilarinputsmusthavesimilarcodings.

Stackedconvolutionalautoencoders9

Autoencodersthatlearntoextractvisualfeaturesbyreconstructingimagesprocessedthroughconvolutionallayers.

Generativestochasticnetwork(GSN)10

Ageneralizationofdenoisingautoencoders,withtheaddedcapabilitytogeneratedata.

Winner-take-all(WTA)autoencoder11

Duringtraining,aftercomputingtheactivationsofalltheneuronsinthecodinglayer,onlythetopk%activationsforeachneuronoverthetrainingbatcharepreserved,andtherestaresettozero.Naturallythisleadstosparsecodings.Moreover,asimilarWTAapproachcanbeusedtoproducesparseconvolutionalautoencoders.

Adversarialautoencoders12

Onenetworkistrainedtoreproduceitsinputs,andatthesametimeanotheristrainedtofindinputsthatthefirstnetworkisunabletoproperlyreconstruct.Thispushesthefirstautoencodertolearnrobustcodings.

Page 532: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Exercises1. Whatarethemaintasksthatautoencodersareusedfor?

2. Supposeyouwanttotrainaclassifierandyouhaveplentyofunlabeledtrainingdata,butonlyafewthousandlabeledinstances.Howcanautoencodershelp?Howwouldyouproceed?

3. Ifanautoencoderperfectlyreconstructstheinputs,isitnecessarilyagoodautoencoder?Howcanyouevaluatetheperformanceofanautoencoder?

4. Whatareundercompleteandovercompleteautoencoders?Whatisthemainriskofanexcessivelyundercompleteautoencoder?Whataboutthemainriskofanovercompleteautoencoder?

5. Howdoyoutieweightsinastackedautoencoder?Whatisthepointofdoingso?

6. Whatisacommontechniquetovisualizefeatureslearnedbythelowerlayerofastackedautoencoder?Whatabouthigherlayers?

7. Whatisagenerativemodel?Canyounameatypeofgenerativeautoencoder?

8. Let’suseadenoisingautoencodertopretrainanimageclassifier:YoucanuseMNIST(simplest),oranotherlargesetofimagessuchasCIFAR10ifyouwantabiggerchallenge.IfyouchooseCIFAR10,youneedtowritecodetoloadbatchesofimagesfortraining.Ifyouwanttoskipthispart,TensorFlow’smodelzoocontainstoolstodojustthat.

Splitthedatasetintoatrainingsetandatestset.Trainadeepdenoisingautoencoderonthefulltrainingset.

Checkthattheimagesarefairlywellreconstructed,andvisualizethelow-levelfeatures.Visualizetheimagesthatmostactivateeachneuroninthecodinglayer.

Buildaclassificationdeepneuralnetwork,reusingthelowerlayersoftheautoencoder.Trainitusingonly10%ofthetrainingset.Canyougetittoperformaswellasthesameclassifiertrainedonthefulltrainingset?

9. Semantichashing,introducedin2008byRuslanSalakhutdinovandGeoffreyHinton,13isatechniqueusedforefficientinformationretrieval:adocument(e.g.,animage)ispassedthroughasystem,typicallyaneuralnetwork,whichoutputsafairlylow-dimensionalbinaryvector(e.g.,30bits).Twosimilardocumentsarelikelytohaveidenticalorverysimilarhashes.Byindexingeachdocumentusingitshash,itispossibletoretrievemanydocumentssimilartoaparticulardocumentalmostinstantly,eveniftherearebillionsofdocuments:justcomputethehashofthedocumentandlookupalldocumentswiththatsamehash(orhashesdifferingbyjustoneortwobits).Let’simplementsemantichashingusingaslightlytweakedstackedautoencoder:

Createastackedautoencodercontainingtwohiddenlayersbelowthecodinglayer,andtrainitontheimagedatasetyouusedinthepreviousexercise.Thecodinglayershouldcontain30neuronsandusethelogisticactivationfunctiontooutputvaluesbetween0and1.Aftertraining,

Page 533: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

toproducethehashofanimage,youcansimplyrunitthroughtheautoencoder,taketheoutputofthecodinglayer,androundeveryvaluetotheclosestinteger(0or1).

OneneattrickproposedbySalakhutdinovandHintonistoaddGaussiannoise(withzeromean)totheinputsofthecodinglayer,duringtrainingonly.Inordertopreserveahighsignal-to-noiseratio,theautoencoderwilllearntofeedlargevaluestothecodinglayer(sothatthenoisebecomesnegligible).Inturn,thismeansthatthelogisticfunctionofthecodinglayerwilllikelysaturateat0or1.Asaresult,roundingthecodingsto0or1won’tdistortthemtoomuch,andthiswillimprovethereliabilityofthehashes.

Computethehashofeveryimage,andseeifimageswithidenticalhasheslookalike.SinceMNISTandCIFAR10arelabeled,amoreobjectivewaytomeasuretheperformanceoftheautoencoderforsemantichashingistoensurethatimageswiththesamehashgenerallyhavethesameclass.OnewaytodothisistomeasuretheaverageGinipurity(introducedinChapter6)ofthesetsofimageswithidentical(orverysimilar)hashes.

Tryfine-tuningthehyperparametersusingcross-validation.

Notethatwithalabeleddataset,anotherapproachistotrainaconvolutionalneuralnetwork(seeChapter13)forclassification,thenusethelayerbelowtheoutputlayertoproducethehashes.SeeJinmaGuaandJianminLi’s2015paper.14Seeifthatperformsbetter.

10. Trainavariationalautoencoderontheimagedatasetusedinthepreviousexercises(MNISTorCIFAR10),andmakeitgenerateimages.Alternatively,youcantrytofindanunlabeleddatasetthatyouareinterestedinandseeifyoucangeneratenewsamples.

SolutionstotheseexercisesareavailableinAppendixA.

“Perceptioninchess,”W.ChaseandH.Simon(1973).

“GreedyLayer-WiseTrainingofDeepNetworks,”Y.Bengioetal.(2007).

“ExtractingandComposingRobustFeatureswithDenoisingAutoencoders,”P.Vincentetal.(2008).

“StackedDenoisingAutoencoders:LearningUsefulRepresentationsinaDeepNetworkwithaLocalDenoisingCriterion,”P.Vincentetal.(2010).

“Auto-EncodingVariationalBayes,”D.KingmaandM.Welling(2014).

Variationalautoencodersareactuallymoregeneral;thecodingsarenotlimitedtoGaussiandistributions.

Formoremathematicaldetails,checkouttheoriginalpaperonvariationalautoencoders,orCarlDoersch’sgreattutorial(2016).

“ContractiveAuto-Encoders:ExplicitInvarianceDuringFeatureExtraction,”S.Rifaietal.(2011).

“StackedConvolutionalAuto-EncodersforHierarchicalFeatureExtraction,”J.Mascietal.(2011).

“GSNs:GenerativeStochasticNetworks,”G.Alainetal.(2015).

“Winner-Take-AllAutoencoders,”A.MakhzaniandB.Frey(2015).

“AdversarialAutoencoders,”A.Makhzanietal.(2016).

“SemanticHashing,”R.SalakhutdinovandG.Hinton(2008).

“CNNBasedHashingforImageRetrieval,”J.GuaandJ.Li(2015).

1

2

3

4

5

6

7

8

9

10

11

12

13

14

Page 534: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Chapter16.ReinforcementLearning

ReinforcementLearning(RL)isoneofthemostexcitingfieldsofMachineLearningtoday,andalsooneoftheoldest.Ithasbeenaroundsincethe1950s,producingmanyinterestingapplicationsovertheyears,1inparticularingames(e.g.,TD-Gammon,aBackgammonplayingprogram)andinmachinecontrol,butseldommakingtheheadlinenews.Butarevolutiontookplacein2013whenresearchersfromanEnglishstartupcalledDeepMinddemonstratedasystemthatcouldlearntoplayjustaboutanyAtarigamefromscratch,2eventuallyoutperforminghumans3inmostofthem,usingonlyrawpixelsasinputsandwithoutanypriorknowledgeoftherulesofthegames.4Thiswasthefirstofaseriesofamazingfeats,culminatinginMarch2016withthevictoryoftheirsystemAlphaGoagainstLeeSedol,theworldchampionofthegameofGo.Noprogramhadevercomeclosetobeatingamasterofthisgame,letalonetheworldchampion.TodaythewholefieldofRLisboilingwithnewideas,withawiderangeofapplications.DeepMindwasboughtbyGoogleforover500milliondollarsin2014.

Sohowdidtheydoit?Withhindsightitseemsrathersimple:theyappliedthepowerofDeepLearningtothefieldofReinforcementLearning,anditworkedbeyondtheirwildestdreams.InthischapterwewillfirstexplainwhatReinforcementLearningisandwhatitisgoodat,andthenwewillpresenttwoofthemostimportanttechniquesindeepReinforcementLearning:policygradientsanddeepQ-networks(DQN),includingadiscussionofMarkovdecisionprocesses(MDP).Wewillusethesetechniquestotrainamodeltobalanceapoleonamovingcart,andanothertoplayAtarigames.Thesametechniquescanbeusedforawidevarietyoftasks,fromwalkingrobotstoself-drivingcars.

Page 535: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

LearningtoOptimizeRewardsInReinforcementLearning,asoftwareagentmakesobservationsandtakesactionswithinanenvironment,andinreturnitreceivesrewards.Itsobjectiveistolearntoactinawaythatwillmaximizeitsexpectedlong-termrewards.Ifyoudon’tmindabitofanthropomorphism,youcanthinkofpositiverewardsaspleasure,andnegativerewardsaspain(theterm“reward”isabitmisleadinginthiscase).Inshort,theagentactsintheenvironmentandlearnsbytrialanderrortomaximizeitspleasureandminimizeitspain.

Thisisquiteabroadsetting,whichcanapplytoawidevarietyoftasks.Hereareafewexamples(seeFigure16-1):1. Theagentcanbetheprogramcontrollingawalkingrobot.Inthiscase,theenvironmentisthereal

world,theagentobservestheenvironmentthroughasetofsensorssuchascamerasandtouchsensors,anditsactionsconsistofsendingsignalstoactivatemotors.Itmaybeprogrammedtogetpositiverewardswheneveritapproachesthetargetdestination,andnegativerewardswheneveritwastestime,goesinthewrongdirection,orfallsdown.

2. TheagentcanbetheprogramcontrollingMs.Pac-Man.Inthiscase,theenvironmentisasimulationoftheAtarigame,theactionsaretheninepossiblejoystickpositions(upperleft,down,center,andsoon),theobservationsarescreenshots,andtherewardsarejustthegamepoints.

3. Similarly,theagentcanbetheprogramplayingaboardgamesuchasthegameofGo.

4. Theagentdoesnothavetocontrolaphysically(orvirtually)movingthing.Forexample,itcanbeasmartthermostat,gettingrewardswheneveritisclosetothetargettemperatureandsavesenergy,andnegativerewardswhenhumansneedtotweakthetemperature,sotheagentmustlearntoanticipatehumanneeds.

5. Theagentcanobservestockmarketpricesanddecidehowmuchtobuyorselleverysecond.Rewardsareobviouslythemonetarygainsandlosses.

Page 536: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure16-1.ReinforcementLearningexamples:(a)walkingrobot,(b)Ms.Pac-Man,(c)Goplayer,(d)thermostat,(e)automatictrader5

Notethattheremaynotbeanypositiverewardsatall;forexample,theagentmaymovearoundinamaze,gettinganegativerewardateverytimestep,soitbetterfindtheexitasquicklyaspossible!TherearemanyotherexamplesoftaskswhereReinforcementLearningiswellsuited,suchasself-drivingcars,placingadsonawebpage,orcontrollingwhereanimageclassificationsystemshouldfocusitsattention.

Page 537: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

PolicySearchThealgorithmusedbythesoftwareagenttodetermineitsactionsiscalleditspolicy.Forexample,thepolicycouldbeaneuralnetworktakingobservationsasinputsandoutputtingtheactiontotake(seeFigure16-2).

Figure16-2.ReinforcementLearningusinganeuralnetworkpolicy

Thepolicycanbeanyalgorithmyoucanthinkof,anditdoesnotevenhavetobedeterministic.Forexample,consideraroboticvacuumcleanerwhoserewardistheamountofdustitpicksupin30minutes.Itspolicycouldbetomoveforwardwithsomeprobabilitypeverysecond,orrandomlyrotateleftorrightwithprobability1–p.Therotationanglewouldbearandomanglebetween–rand+r.Sincethispolicyinvolvessomerandomness,itiscalledastochasticpolicy.Therobotwillhaveanerratictrajectory,whichguaranteesthatitwilleventuallygettoanyplaceitcanreachandpickupallthedust.Thequestionis:howmuchdustwillitpickupin30minutes?

Howwouldyoutrainsucharobot?Therearejusttwopolicyparametersyoucantweak:theprobabilitypandtheangleranger.Onepossiblelearningalgorithmcouldbetotryoutmanydifferentvaluesfortheseparameters,andpickthecombinationthatperformsbest(seeFigure16-3).Thisisanexampleofpolicysearch,inthiscaseusingabruteforceapproach.However,whenthepolicyspaceistoolarge(whichisgenerallythecase),findingagoodsetofparametersthiswayislikesearchingforaneedleinagigantichaystack.

Anotherwaytoexplorethepolicyspaceistousegeneticalgorithms.Forexample,youcouldrandomlycreateafirstgenerationof100policiesandtrythemout,then“kill”the80worstpolicies6andmakethe20survivorsproduce4offspringeach.Anoffspringisjustacopyofitsparent7plussomerandomvariation.Thesurvivingpoliciesplustheiroffspringtogetherconstitutethesecondgeneration.Youcancontinuetoiteratethroughgenerationsthisway,untilyoufindagoodpolicy.

Page 538: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure16-3.Fourpointsinpolicyspaceandtheagent’scorrespondingbehavior

Yetanotherapproachistouseoptimizationtechniques,byevaluatingthegradientsoftherewardswithregardstothepolicyparameters,thentweakingtheseparametersbyfollowingthegradienttowardhigherrewards(gradientascent).Thisapproachiscalledpolicygradients(PG),whichwewilldiscussinmoredetaillaterinthischapter.Forexample,goingbacktothevacuumcleanerrobot,youcouldslightlyincreasepandevaluatewhetherthisincreasestheamountofdustpickedupbytherobotin30minutes;ifitdoes,thenincreasepsomemore,orelsereducep.WewillimplementapopularPGalgorithmusingTensorFlow,butbeforewedoweneedtocreateanenvironmentfortheagenttolivein,soit’stimetointroduceOpenAIgym.

Page 539: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

IntroductiontoOpenAIGymOneofthechallengesofReinforcementLearningisthatinordertotrainanagent,youfirstneedtohaveaworkingenvironment.IfyouwanttoprogramanagentthatwilllearntoplayanAtarigame,youwillneedanAtarigamesimulator.Ifyouwanttoprogramawalkingrobot,thentheenvironmentistherealworldandyoucandirectlytrainyourrobotinthatenvironment,butthishasitslimits:iftherobotfallsoffacliff,youcan’tjustclick“undo.”Youcan’tspeeduptimeeither;addingmorecomputingpowerwon’tmaketherobotmoveanyfaster.Andit’sgenerallytooexpensivetotrain1,000robotsinparallel.Inshort,trainingishardandslowintherealworld,soyougenerallyneedasimulatedenvironmentatleasttobootstraptraining.

OpenAIgym8isatoolkitthatprovidesawidevarietyofsimulatedenvironments(Atarigames,boardgames,2Dand3Dphysicalsimulations,andsoon),soyoucantrainagents,comparethem,ordevelopnewRLalgorithms.

Let’sinstallOpenAIgym.ForaminimalOpenAIgyminstallation,simplyusepip:

$pip3install--upgradegym

NextopenupaPythonshelloraJupyternotebookandcreateyourfirstenvironment:

>>>importgym

>>>env=gym.make("CartPole-v0")

[2016-10-1416:03:23,199]Makingnewenv:MsPacman-v0

>>>obs=env.reset()

>>>obs

array([-0.03799846,-0.03288115,0.02337094,0.00720711])

>>>env.render()

Themake()functioncreatesanenvironment,inthiscaseaCartPoleenvironment.Thisisa2Dsimulationinwhichacartcanbeacceleratedleftorrightinordertobalanceapoleplacedontopofit(seeFigure16-4).Aftertheenvironmentiscreated,wemustinitializeitusingthereset()method.Thisreturnsthefirstobservation.Observationsdependonthetypeofenvironment.FortheCartPoleenvironment,eachobservationisa1DNumPyarraycontainingfourfloats:thesefloatsrepresentthecart’shorizontalposition(0.0=center),itsvelocity,theangleofthepole(0.0=vertical),anditsangularvelocity.Finally,therender()methoddisplaystheenvironmentasshowninFigure16-4.

Page 540: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure16-4.TheCartPoleenvironment

Ifyouwantrender()toreturntherenderedimageasaNumPyarray,youcansetthemodeparametertorgb_array(notethatotherenvironmentsmaysupportdifferentmodes):

>>>img=env.render(mode="rgb_array")

>>>img.shape#height,width,channels(3=RGB)

(400,600,3)

TIPUnfortunately,theCartPole(andafewotherenvironments)renderstheimagetothescreenevenifyousetthemodeto"rgb_array".TheonlywaytoavoidthisistouseafakeXserversuchasXvfborXdummy.Forexample,youcaninstallXvfbandstartPythonusingthefollowingcommand:xvfb-run-s"-screen01400x900x24"python.Orusethexvfbwrapperpackage.

Let’sasktheenvironmentwhatactionsarepossible:

>>>env.action_space

Discrete(2)

Discrete(2)meansthatthepossibleactionsareintegers0and1,whichrepresentacceleratingleft(0)orright(1).Otherenvironmentsmayhavemorediscreteactions,orotherkindsofactions(e.g.,continuous).Sincethepoleisleaningtowardtheright,let’sacceleratethecarttowardtheright:

>>>action=1#accelerateright

>>>obs,reward,done,info=env.step(action)

>>>obs

array([-0.03865608,0.16189797,0.02351508,-0.27801135])

>>>reward

1.0

>>>done

False

>>>info

Page 541: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

{}

Thestep()methodexecutesthegivenactionandreturnsfourvalues:

obs

Thisisthenewobservation.Thecartisnowmovingtowardtheright(obs[1]>0).Thepoleisstilltiltedtowardtheright(obs[2]>0),butitsangularvelocityisnownegative(obs[3]<0),soitwilllikelybetiltedtowardtheleftafterthenextstep.

reward

Inthisenvironment,yougetarewardof1.0ateverystep,nomatterwhatyoudo,sothegoalistokeeprunningaslongaspossible.

done

ThisvaluewillbeTruewhentheepisodeisover.Thiswillhappenwhenthepoletiltstoomuch.Afterthat,theenvironmentmustberesetbeforeitcanbeusedagain.

info

Thisdictionarymayprovideextradebuginformationinotherenvironments.Thisdatashouldnotbeusedfortraining(itwouldbecheating).

Let’shardcodeasimplepolicythatacceleratesleftwhenthepoleisleaningtowardtheleftandacceleratesrightwhenthepoleisleaningtowardtheright.Wewillrunthispolicytoseetheaveragerewardsitgetsover500episodes:

defbasic_policy(obs):

angle=obs[2]

return0ifangle<0else1

totals=[]

forepisodeinrange(500):

episode_rewards=0

obs=env.reset()

forstepinrange(1000):#1000stepsmax,wedon'twanttorunforever

action=basic_policy(obs)

obs,reward,done,info=env.step(action)

episode_rewards+=reward

ifdone:

break

totals.append(episode_rewards)

Thiscodeishopefullyself-explanatory.Let’slookattheresult:

>>>importnumpyasnp

>>>np.mean(totals),np.std(totals),np.min(totals),np.max(totals)

(42.125999999999998,9.1237121830974033,24.0,68.0)

Evenwith500tries,thispolicynevermanagedtokeepthepoleuprightformorethan68consecutivesteps.Notgreat.IfyoulookatthesimulationintheJupyternotebooks,youwillseethatthecartoscillatesleftandrightmoreandmorestronglyuntilthepoletiltstoomuch.Let’sseeifaneuralnetworkcancomeupwithabetterpolicy.

Page 542: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

NeuralNetworkPoliciesLet’screateaneuralnetworkpolicy.Justlikethepolicywehardcodedearlier,thisneuralnetworkwilltakeanobservationasinput,anditwilloutputtheactiontobeexecuted.Moreprecisely,itwillestimateaprobabilityforeachaction,andthenwewillselectanactionrandomlyaccordingtotheestimatedprobabilities(seeFigure16-5).InthecaseoftheCartPoleenvironment,therearejusttwopossibleactions(leftorright),soweonlyneedoneoutputneuron.Itwilloutputtheprobabilitypofaction0(left),andofcoursetheprobabilityofaction1(right)willbe1–p.Forexample,ifitoutputs0.7,thenwewillpickaction0with70%probability,andaction1with30%probability.

Figure16-5.Neuralnetworkpolicy

Youmaywonderwhywearepickingarandomactionbasedontheprobabilitygivenbytheneuralnetwork,ratherthanjustpickingtheactionwiththehighestscore.Thisapproachletstheagentfindtherightbalancebetweenexploringnewactionsandexploitingtheactionsthatareknowntoworkwell.Here’sananalogy:supposeyougotoarestaurantforthefirsttime,andallthedisheslookequallyappealingsoyourandomlypickone.Ifitturnsouttobegood,youcanincreasetheprobabilitytoorderitnexttime,butyoushouldn’tincreasethatprobabilityupto100%,orelseyouwillnevertryouttheotherdishes,someofwhichmaybeevenbetterthantheoneyoutried.

Alsonotethatinthisparticularenvironment,thepastactionsandobservationscansafelybeignored,

Page 543: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

sinceeachobservationcontainstheenvironment’sfullstate.Ifthereweresomehiddenstate,thenyoumayneedtoconsiderpastactionsandobservationsaswell.Forexample,iftheenvironmentonlyrevealedthepositionofthecartbutnotitsvelocity,youwouldhavetoconsidernotonlythecurrentobservationbutalsothepreviousobservationinordertoestimatethecurrentvelocity.Anotherexampleiswhentheobservationsarenoisy;inthatcase,yougenerallywanttousethepastfewobservationstoestimatethemostlikelycurrentstate.TheCartPoleproblemisthusassimpleascanbe;theobservationsarenoise-freeandtheycontaintheenvironment’sfullstate.

HereisthecodetobuildthisneuralnetworkpolicyusingTensorFlow:

importtensorflowastf

#1.Specifytheneuralnetworkarchitecture

n_inputs=4#==env.observation_space.shape[0]

n_hidden=4#it'sasimpletask,wedon'tneedmorehiddenneurons

n_outputs=1#onlyoutputstheprobabilityofacceleratingleft

initializer=tf.contrib.layers.variance_scaling_initializer()

#2.Buildtheneuralnetwork

X=tf.placeholder(tf.float32,shape=[None,n_inputs])

hidden=tf.layers.dense(X,n_hidden,activation=tf.nn.elu,

kernel_initializer=initializer)

logits=tf.layers.dense(hidden,n_outputs,

kernel_initializer=initializer)

outputs=tf.nn.sigmoid(logits)

#3.Selectarandomactionbasedontheestimatedprobabilities

p_left_and_right=tf.concat(axis=1,values=[outputs,1-outputs])

action=tf.multinomial(tf.log(p_left_and_right),num_samples=1)

init=tf.global_variables_initializer()

Let’sgothroughthiscode:1. Aftertheimports,wedefinetheneuralnetworkarchitecture.Thenumberofinputsisthesizeofthe

observationspace(whichinthecaseoftheCartPoleisfour),wejusthavefourhiddenunitsandnoneedformore,andwehavejustoneoutputprobability(theprobabilityofgoingleft).

2. Nextwebuildtheneuralnetwork.Inthisexample,it’savanillaMulti-LayerPerceptron,withasingleoutput.Notethattheoutputlayerusesthelogistic(sigmoid)activationfunctioninordertooutputaprobabilityfrom0.0to1.0.Ifthereweremorethantwopossibleactions,therewouldbeoneoutputneuronperaction,andyouwouldusethesoftmaxactivationfunctioninstead.

3. Lastly,wecallthemultinomial()functiontopickarandomaction.Thisfunctionindependentlysamplesone(ormore)integers,giventhelogprobabilityofeachinteger.Forexample,ifyoucallitwiththearray[np.log(0.5),np.log(0.2),np.log(0.3)]andwithnum_samples=5,thenitwilloutputfiveintegers,eachofwhichwillhavea50%probabilityofbeing0,20%ofbeing1,and30%ofbeing2.Inourcasewejustneedoneintegerrepresentingtheactiontotake.Sincetheoutputstensoronlycontainstheprobabilityofgoingleft,wemustfirstconcatenate1-outputstoittohaveatensorcontainingtheprobabilityofbothleftandrightactions.Notethatifthereweremorethantwopossibleactions,theneuralnetworkwouldhavetooutputoneprobabilityperactionsoyouwouldnotneedtheconcatenationstep.

Okay,wenowhaveaneuralnetworkpolicythatwilltakeobservationsandoutputactions.Buthowdowetrainit?

Page 544: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

EvaluatingActions:TheCreditAssignmentProblemIfweknewwhatthebestactionwasateachstep,wecouldtraintheneuralnetworkasusual,byminimizingthecrossentropybetweentheestimatedprobabilityandthetargetprobability.Itwouldjustberegularsupervisedlearning.However,inReinforcementLearningtheonlyguidancetheagentgetsisthroughrewards,andrewardsaretypicallysparseanddelayed.Forexample,iftheagentmanagestobalancethepolefor100steps,howcanitknowwhichofthe100actionsittookweregood,andwhichofthemwerebad?Allitknowsisthatthepolefellafterthelastaction,butsurelythislastactionisnotentirelyresponsible.Thisiscalledthecreditassignmentproblem:whentheagentgetsareward,itishardforittoknowwhichactionsshouldgetcredited(orblamed)forit.Thinkofadogthatgetsrewardedhoursafteritbehavedwell;willitunderstandwhatitisrewardedfor?

Totacklethisproblem,acommonstrategyistoevaluateanactionbasedonthesumofalltherewardsthatcomeafterit,usuallyapplyingadiscountraterateachstep.Forexample(seeFigure16-6),ifanagentdecidestogorightthreetimesinarowandgets+10rewardafterthefirststep,0afterthesecondstep,andfinally–50afterthethirdstep,thenassumingweuseadiscountrater=0.8,thefirstactionwillhaveatotalscoreof10+r×0+r2×(–50)=–22.Ifthediscountrateiscloseto0,thenfuturerewardswon’tcountformuchcomparedtoimmediaterewards.Conversely,ifthediscountrateiscloseto1,thenrewardsfarintothefuturewillcountalmostasmuchasimmediaterewards.Typicaldiscountratesare0.95or0.99.Withadiscountrateof0.95,rewards13stepsintothefuturecountroughlyforhalfasmuchasimmediaterewards(since0.9513≈0.5),whilewithadiscountrateof0.99,rewards69stepsintothefuturecountforhalfasmuchasimmediaterewards.IntheCartPoleenvironment,actionshavefairlyshort-termeffects,sochoosingadiscountrateof0.95seemsreasonable.

Figure16-6.Discountedrewards

Ofcourse,agoodactionmaybefollowedbyseveralbadactionsthatcausethepoletofallquickly,resultinginthegoodactiongettingalowscore(similarly,agoodactormaysometimesstarinaterriblemovie).However,ifweplaythegameenoughtimes,onaveragegoodactionswillgetabetterscorethanbadones.So,togetfairlyreliableactionscores,wemustrunmanyepisodesandnormalizealltheaction

Page 545: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

scores(bysubtractingthemeananddividingbythestandarddeviation).Afterthat,wecanreasonablyassumethatactionswithanegativescorewerebadwhileactionswithapositivescoreweregood.Perfect—nowthatwehaveawaytoevaluateeachaction,wearereadytotrainourfirstagentusingpolicygradients.Let’sseehow.

Page 546: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

PolicyGradientsAsdiscussedearlier,PGalgorithmsoptimizetheparametersofapolicybyfollowingthegradientstowardhigherrewards.OnepopularclassofPGalgorithms,calledREINFORCEalgorithms,wasintroducedbackin19929byRonaldWilliams.Hereisonecommonvariant:1. First,lettheneuralnetworkpolicyplaythegameseveraltimesandateachstepcomputethe

gradientsthatwouldmakethechosenactionevenmorelikely,butdon’tapplythesegradientsyet.

2. Onceyouhaverunseveralepisodes,computeeachaction’sscore(usingthemethoddescribedinthepreviousparagraph).

3. Ifanaction’sscoreispositive,itmeansthattheactionwasgoodandyouwanttoapplythegradientscomputedearliertomaketheactionevenmorelikelytobechoseninthefuture.However,ifthescoreisnegative,itmeanstheactionwasbadandyouwanttoapplytheoppositegradientstomakethisactionslightlylesslikelyinthefuture.Thesolutionissimplytomultiplyeachgradientvectorbythecorrespondingaction’sscore.

4. Finally,computethemeanofalltheresultinggradientvectors,anduseittoperformaGradientDescentstep.

Let’simplementthisalgorithmusingTensorFlow.Wewilltraintheneuralnetworkpolicywebuiltearliersothatitlearnstobalancethepoleonthecart.Let’sstartbycompletingtheconstructionphasewecodedearliertoaddthetargetprobability,thecostfunction,andthetrainingoperation.Sinceweareactingasthoughthechosenactionisthebestpossibleaction,thetargetprobabilitymustbe1.0ifthechosenactionisaction0(left)and0.0ifitisaction1(right):

y=1.-tf.to_float(action)

Nowthatwehaveatargetprobability,wecandefinethecostfunction(crossentropy)andcomputethegradients:

learning_rate=0.01

cross_entropy=tf.nn.sigmoid_cross_entropy_with_logits(labels=y,

logits=logits)

optimizer=tf.train.AdamOptimizer(learning_rate)

grads_and_vars=optimizer.compute_gradients(cross_entropy)

Notethatwearecallingtheoptimizer’scompute_gradients()methodinsteadoftheminimize()method.Thisisbecausewewanttotweakthegradientsbeforeweapplythem.10Thecompute_gradients()methodreturnsalistofgradientvector/variablepairs(onepairpertrainablevariable).Let’sputallthegradientsinalist,tomakeitmoreconvenienttoobtaintheirvalues:

gradients=[gradforgrad,variableingrads_and_vars]

Okay,nowcomesthetrickypart.Duringtheexecutionphase,thealgorithmwillrunthepolicyandateach

Page 547: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

stepitwillevaluatethesegradienttensorsandstoretheirvalues.Afteranumberofepisodesitwilltweakthesegradientsasexplainedearlier(i.e.,multiplythembytheactionscoresandnormalizethem)andcomputethemeanofthetweakedgradients.Next,itwillneedtofeedtheresultinggradientsbacktotheoptimizersothatitcanperformanoptimizationstep.Thismeansweneedoneplaceholderpergradientvector.Moreover,wemustcreatetheoperationthatwillapplytheupdatedgradients.Forthiswewillcalltheoptimizer’sapply_gradients()function,whichtakesalistofgradientvector/variablepairs.Insteadofgivingittheoriginalgradientvectors,wewillgiveitalistcontainingtheupdatedgradients(i.e.,theonesfedthroughthegradientplaceholders):

gradient_placeholders=[]

grads_and_vars_feed=[]

forgrad,variableingrads_and_vars:

gradient_placeholder=tf.placeholder(tf.float32,shape=grad.get_shape())

gradient_placeholders.append(gradient_placeholder)

grads_and_vars_feed.append((gradient_placeholder,variable))

training_op=optimizer.apply_gradients(grads_and_vars_feed)

Let’sstepbackandtakealookatthefullconstructionphase:

n_inputs=4

n_hidden=4

n_outputs=1

initializer=tf.contrib.layers.variance_scaling_initializer()

learning_rate=0.01

X=tf.placeholder(tf.float32,shape=[None,n_inputs])

hidden=tf.layers.dense(X,n_hidden,activation=tf.nn.elu,

kernel_initializer=initializer)

logits=tf.layers.dense(hidden,n_outputs,

kernel_initializer=initializer)

outputs=tf.nn.sigmoid(logits)

p_left_and_right=tf.concat(axis=1,values=[outputs,1-outputs])

action=tf.multinomial(tf.log(p_left_and_right),num_samples=1)

y=1.-tf.to_float(action)

cross_entropy=tf.nn.sigmoid_cross_entropy_with_logits(

labels=y,logits=logits)

optimizer=tf.train.AdamOptimizer(learning_rate)

grads_and_vars=optimizer.compute_gradients(cross_entropy)

gradients=[gradforgrad,variableingrads_and_vars]

gradient_placeholders=[]

grads_and_vars_feed=[]

forgrad,variableingrads_and_vars:

gradient_placeholder=tf.placeholder(tf.float32,shape=grad.get_shape())

gradient_placeholders.append(gradient_placeholder)

grads_and_vars_feed.append((gradient_placeholder,variable))

training_op=optimizer.apply_gradients(grads_and_vars_feed)

init=tf.global_variables_initializer()

saver=tf.train.Saver()

Ontotheexecutionphase!Wewillneedacoupleoffunctionstocomputethetotaldiscountedrewards,giventherawrewards,andtonormalizetheresultsacrossmultipleepisodes:

defdiscount_rewards(rewards,discount_rate):

discounted_rewards=np.empty(len(rewards))

cumulative_rewards=0

forstepinreversed(range(len(rewards))):

cumulative_rewards=rewards[step]+cumulative_rewards*discount_rate

discounted_rewards[step]=cumulative_rewards

returndiscounted_rewards

Page 548: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

defdiscount_and_normalize_rewards(all_rewards,discount_rate):

all_discounted_rewards=[discount_rewards(rewards,discount_rate)

forrewardsinall_rewards]

flat_rewards=np.concatenate(all_discounted_rewards)

reward_mean=flat_rewards.mean()

reward_std=flat_rewards.std()

return[(discounted_rewards-reward_mean)/reward_std

fordiscounted_rewardsinall_discounted_rewards]

Let’scheckthatthisworks:

>>>discount_rewards([10,0,-50],discount_rate=0.8)

array([-22.,-40.,-50.])

>>>discount_and_normalize_rewards([[10,0,-50],[10,20]],discount_rate=0.8)

[array([-0.28435071,-0.86597718,-1.18910299]),

array([1.26665318,1.0727777])]

Thecalltodiscount_rewards()returnsexactlywhatweexpect(seeFigure16-6).Youcanverifythatthefunctiondiscount_and_normalize_rewards()doesindeedreturnthenormalizedscoresforeachactioninbothepisodes.Noticethatthefirstepisodewasmuchworsethanthesecond,soitsnormalizedscoresareallnegative;allactionsfromthefirstepisodewouldbeconsideredbad,andconverselyallactionsfromthesecondepisodewouldbeconsideredgood.

Wenowhaveallweneedtotrainthepolicy:

n_iterations=250#numberoftrainingiterations

n_max_steps=1000#maxstepsperepisode

n_games_per_update=10#trainthepolicyevery10episodes

save_iterations=10#savethemodelevery10trainingiterations

discount_rate=0.95

withtf.Session()assess:

init.run()

foriterationinrange(n_iterations):

all_rewards=[]#allsequencesofrawrewardsforeachepisode

all_gradients=[]#gradientssavedateachstepofeachepisode

forgameinrange(n_games_per_update):

current_rewards=[]#allrawrewardsfromthecurrentepisode

current_gradients=[]#allgradientsfromthecurrentepisode

obs=env.reset()

forstepinrange(n_max_steps):

action_val,gradients_val=sess.run(

[action,gradients],

feed_dict={X:obs.reshape(1,n_inputs)})#oneobs

obs,reward,done,info=env.step(action_val[0][0])

current_rewards.append(reward)

current_gradients.append(gradients_val)

ifdone:

break

all_rewards.append(current_rewards)

all_gradients.append(current_gradients)

#Atthispointwehaverunthepolicyfor10episodes,andweare

#readyforapolicyupdateusingthealgorithmdescribedearlier.

all_rewards=discount_and_normalize_rewards(all_rewards,discount_rate)

feed_dict={}

forvar_index,grad_placeholderinenumerate(gradient_placeholders):

#multiplythegradientsbytheactionscores,andcomputethemean

mean_gradients=np.mean(

[reward*all_gradients[game_index][step][var_index]

forgame_index,rewardsinenumerate(all_rewards)

forstep,rewardinenumerate(rewards)],

axis=0)

feed_dict[grad_placeholder]=mean_gradients

sess.run(training_op,feed_dict=feed_dict)

ifiteration%save_iterations==0:

saver.save(sess,"./my_policy_net_pg.ckpt")

Page 549: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Eachtrainingiterationstartsbyrunningthepolicyfor10episodes(withmaximum1,000stepsperepisode,toavoidrunningforever).Ateachstep,wealsocomputethegradients,pretendingthatthechosenactionwasthebest.Afterthese10episodeshavebeenrun,wecomputetheactionscoresusingthediscount_and_normalize_rewards()function;wegothrougheachtrainablevariable,acrossallepisodesandallsteps,tomultiplyeachgradientvectorbyitscorrespondingactionscore;andwecomputethemeanoftheresultinggradients.Finally,werunthetrainingoperation,feedingitthesemeangradients(onepertrainablevariable).Wealsosavethemodelevery10trainingoperations.

Andwe’redone!Thiscodewilltraintheneuralnetworkpolicy,anditwillsuccessfullylearntobalancethepoleonthecart(youcantryitoutintheJupyternotebooks).Notethatthereareactuallytwowaystheagentcanlosethegame:eitherthepolecantilttoomuch,orthecartcangocompletelyoffthescreen.With250trainingiterations,thepolicylearnstobalancethepolequitewell,butitisnotyetgoodenoughatavoidinggoingoffthescreen.Afewhundredmoretrainingiterationswillfixthat.

TIPResearcherstrytofindalgorithmsthatworkwellevenwhentheagentinitiallyknowsnothingabouttheenvironment.However,unlessyouarewritingapaper,youshouldinjectasmuchpriorknowledgeaspossibleintotheagent,asitwillspeeduptrainingdramatically.Forexample,youcouldaddnegativerewardsproportionaltothedistancefromthecenterofthescreen,andtothepole’sangle.Also,ifyoualreadyhaveareasonablygoodpolicy(e.g.,hardcoded),youmaywanttotraintheneuralnetworktoimitateitbeforeusingpolicygradientstoimproveit.

Despiteitsrelativesimplicity,thisalgorithmisquitepowerful.Youcanuseittotacklemuchharderproblemsthanbalancingapoleonacart.Infact,AlphaGowasbasedonasimilarPGalgorithm(plusMonteCarloTreeSearch,whichisbeyondthescopeofthisbook).

Wewillnowlookatanotherpopularfamilyofalgorithms.WhereasPGalgorithmsdirectlytrytooptimizethepolicytoincreaserewards,thealgorithmswewilllookatnowarelessdirect:theagentlearnstoestimatetheexpectedsumofdiscountedfuturerewardsforeachstate,ortheexpectedsumofdiscountedfuturerewardsforeachactionineachstate,thenusesthisknowledgetodecidehowtoact.Tounderstandthesealgorithms,wemustfirstintroduceMarkovdecisionprocesses(MDP).

Page 550: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

MarkovDecisionProcessesIntheearly20thcentury,themathematicianAndreyMarkovstudiedstochasticprocesseswithnomemory,calledMarkovchains.Suchaprocesshasafixednumberofstates,anditrandomlyevolvesfromonestatetoanotherateachstep.Theprobabilityforittoevolvefromastatestoastates′isfixed,anditdependsonlyonthepair(s,s′),notonpaststates(thesystemhasnomemory).

Figure16-7showsanexampleofaMarkovchainwithfourstates.Supposethattheprocessstartsinstates0,andthereisa70%chancethatitwillremaininthatstateatthenextstep.Eventuallyitisboundtoleavethatstateandnevercomebacksincenootherstatepointsbacktos0.Ifitgoestostates1,itwillthenmostlikelygotostates2(90%probability),thenimmediatelybacktostates1(with100%probability).Itmayalternateanumberoftimesbetweenthesetwostates,buteventuallyitwillfallintostates3andremainthereforever(thisisaterminalstate).Markovchainscanhaveverydifferentdynamics,andtheyareheavilyusedinthermodynamics,chemistry,statistics,andmuchmore.

Figure16-7.ExampleofaMarkovchain

Markovdecisionprocesseswerefirstdescribedinthe1950sbyRichardBellman.11TheyresembleMarkovchainsbutwithatwist:ateachstep,anagentcanchooseoneofseveralpossibleactions,andthetransitionprobabilitiesdependonthechosenaction.Moreover,somestatetransitionsreturnsomereward(positiveornegative),andtheagent’sgoalistofindapolicythatwillmaximizerewardsovertime.

Forexample,theMDPrepresentedinFigure16-8hasthreestatesanduptothreepossiblediscreteactionsateachstep.Ifitstartsinstates0,theagentcanchoosebetweenactionsa0,a1,ora2.Ifitchoosesactiona1,itjustremainsinstates0withcertainty,andwithoutanyreward.Itcanthusdecidetostaythereforeverifitwants.Butifitchoosesactiona0,ithasa70%probabilityofgainingarewardof+10,andremaininginstates0.Itcanthentryagainandagaintogainasmuchrewardaspossible.Butatonepointit

Page 551: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

isgoingtoendupinsteadinstates1.Instates1ithasonlytwopossibleactions:a0ora2.Itcanchoosetostayputbyrepeatedlychoosingactiona0,oritcanchoosetomoveontostates2andgetanegativerewardof–50(ouch).Instates2ithasnootherchoicethantotakeactiona1,whichwillmostlikelyleaditbacktostates0,gainingarewardof+40ontheway.Yougetthepicture.BylookingatthisMDP,canyouguesswhichstrategywillgainthemostrewardovertime?Instates0itisclearthatactiona0isthebestoption,andinstates2theagenthasnochoicebuttotakeactiona1,butinstates1itisnotobviouswhethertheagentshouldstayput(a0)orgothroughthefire(a2).

Figure16-8.ExampleofaMarkovdecisionprocess

Bellmanfoundawaytoestimatetheoptimalstatevalueofanystates,notedV*(s),whichisthesumofalldiscountedfuturerewardstheagentcanexpectonaverageafteritreachesastates,assumingitactsoptimally.Heshowedthatiftheagentactsoptimally,thentheBellmanOptimalityEquationapplies(seeEquation16-1).Thisrecursiveequationsaysthatiftheagentactsoptimally,thentheoptimalvalueofthecurrentstateisequaltotherewarditwillgetonaverageaftertakingoneoptimalaction,plustheexpectedoptimalvalueofallpossiblenextstatesthatthisactioncanleadto.

Equation16-1.BellmanOptimalityEquation

T(s,a,s′)isthetransitionprobabilityfromstatestostates′,giventhattheagentchoseactiona.

R(s,a,s′)istherewardthattheagentgetswhenitgoesfromstatestostates′,giventhattheagentchoseactiona.

γisthediscountrate.

Thisequationleadsdirectlytoanalgorithmthatcanpreciselyestimatetheoptimalstatevalueofeverypossiblestate:youfirstinitializeallthestatevalueestimatestozero,andthenyouiterativelyupdatethemusingtheValueIterationalgorithm(seeEquation16-2).Aremarkableresultisthat,givenenoughtime,theseestimatesareguaranteedtoconvergetotheoptimalstatevalues,correspondingtotheoptimal

Page 552: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

policy.

Equation16-2.ValueIterationalgorithm

Vk(s)istheestimatedvalueofstatesatthekthiterationofthealgorithm.

NOTEThisalgorithmisanexampleofDynamicProgramming,whichbreaksdownacomplexproblem(inthiscaseestimatingapotentiallyinfinitesumofdiscountedfuturerewards)intotractablesub-problemsthatcanbetacklediteratively(inthiscasefindingtheactionthatmaximizestheaveragerewardplusthediscountednextstatevalue).

Knowingtheoptimalstatevaluescanbeuseful,inparticulartoevaluateapolicy,butitdoesnottelltheagentexplicitlywhattodo.Luckily,Bellmanfoundaverysimilaralgorithmtoestimatetheoptimalstate-actionvalues,generallycalledQ-Values.TheoptimalQ-Valueofthestate-actionpair(s,a),notedQ*(s,a),isthesumofdiscountedfuturerewardstheagentcanexpectonaverageafteritreachesthestatesandchoosesactiona,butbeforeitseestheoutcomeofthisaction,assumingitactsoptimallyafterthataction.

Hereishowitworks:onceagain,youstartbyinitializingalltheQ-Valueestimatestozero,thenyouupdatethemusingtheQ-ValueIterationalgorithm(seeEquation16-3).

Equation16-3.Q-ValueIterationalgorithm

OnceyouhavetheoptimalQ-Values,definingtheoptimalpolicy,notedπ*(s),istrivial:whentheagentis

instates,itshouldchoosetheactionwiththehighestQ-Valueforthatstate: .

Let’sapplythisalgorithmtotheMDPrepresentedinFigure16-8.First,weneedtodefinetheMDP:

nan=np.nan#representsimpossibleactions

T=np.array([#shape=[s,a,s']

[[0.7,0.3,0.0],[1.0,0.0,0.0],[0.8,0.2,0.0]],

[[0.0,1.0,0.0],[nan,nan,nan],[0.0,0.0,1.0]],

[[nan,nan,nan],[0.8,0.1,0.1],[nan,nan,nan]],

])

R=np.array([#shape=[s,a,s']

[[10.,0.0,0.0],[0.0,0.0,0.0],[0.0,0.0,0.0]],

[[10.,0.0,0.0],[nan,nan,nan],[0.0,0.0,-50.]],

[[nan,nan,nan],[40.,0.0,0.0],[nan,nan,nan]],

])

possible_actions=[[0,1,2],[0,2],[1]]

Nowlet’sruntheQ-ValueIterationalgorithm:

Q=np.full((3,3),-np.inf)#-infforimpossibleactions

forstate,actionsinenumerate(possible_actions):

Q[state,actions]=0.0#Initialvalue=0.0,forallpossibleactions

Page 553: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

learning_rate=0.01

discount_rate=0.95

n_iterations=100

foriterationinrange(n_iterations):

Q_prev=Q.copy()

forsinrange(3):

forainpossible_actions[s]:

Q[s,a]=np.sum([

T[s,a,sp]*(R[s,a,sp]+discount_rate*np.max(Q_prev[sp]))

forspinrange(3)

])

TheresultingQ-Valueslooklikethis:

>>>Q

array([[21.89498982,20.80024033,16.86353093],

[1.11669335,-inf,1.17573546],

[-inf,53.86946068,-inf]])

>>>np.argmax(Q,axis=1)#optimalactionforeachstate

array([0,2,1])

ThisgivesustheoptimalpolicyforthisMDP,whenusingadiscountrateof0.95:instates0chooseactiona0,instates1chooseactiona2(gothroughthefire!),andinstates2chooseactiona1(theonlypossibleaction).Interestingly,ifyoureducethediscountrateto0.9,theoptimalpolicychanges:instates1thebestactionbecomesa0(stayput;don’tgothroughthefire).Itmakessensebecauseifyouvaluethepresentmuchmorethanthefuture,thentheprospectoffuturerewardsisnotworthimmediatepain.

Page 554: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TemporalDifferenceLearningandQ-LearningReinforcementLearningproblemswithdiscreteactionscanoftenbemodeledasMarkovdecisionprocesses,buttheagentinitiallyhasnoideawhatthetransitionprobabilitiesare(itdoesnotknowT(s,a,s′)),anditdoesnotknowwhattherewardsaregoingtobeeither(itdoesnotknowR(s,a,s′)).Itmustexperienceeachstateandeachtransitionatleastoncetoknowtherewards,anditmustexperiencethemmultipletimesifitistohaveareasonableestimateofthetransitionprobabilities.

TheTemporalDifferenceLearning(TDLearning)algorithmisverysimilartotheValueIterationalgorithm,buttweakedtotakeintoaccountthefactthattheagenthasonlypartialknowledgeoftheMDP.Ingeneralweassumethattheagentinitiallyknowsonlythepossiblestatesandactions,andnothingmore.Theagentusesanexplorationpolicy—forexample,apurelyrandompolicy—toexploretheMDP,andasitprogressestheTDLearningalgorithmupdatestheestimatesofthestatevaluesbasedonthetransitionsandrewardsthatareactuallyobserved(seeEquation16-4).

Equation16-4.TDLearningalgorithm

αisthelearningrate(e.g.,0.01).

TIPTDLearninghasmanysimilaritieswithStochasticGradientDescent,inparticularthefactthatithandlesonesampleatatime.JustlikeSGD,itcanonlytrulyconvergeifyougraduallyreducethelearningrate(otherwiseitwillkeepbouncingaroundtheoptimum).

Foreachstates,thisalgorithmsimplykeepstrackofarunningaverageoftheimmediaterewardstheagentgetsuponleavingthatstate,plustherewardsitexpectstogetlater(assumingitactsoptimally).

Similarly,theQ-LearningalgorithmisanadaptationoftheQ-ValueIterationalgorithmtothesituationwherethetransitionprobabilitiesandtherewardsareinitiallyunknown(seeEquation16-5).

Equation16-5.Q-Learningalgorithm

Foreachstate-actionpair(s,a),thisalgorithmkeepstrackofarunningaverageoftherewardsrtheagentgetsuponleavingthestateswithactiona,plustherewardsitexpectstogetlater.Sincethetargetpolicywouldactoptimally,wetakethemaximumoftheQ-Valueestimatesforthenextstate.

HereishowQ-Learningcanbeimplemented:

importnumpy.randomasrnd

learning_rate0=0.05

learning_rate_decay=0.1

Page 555: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

n_iterations=20000

s=0#startinstate0

Q=np.full((3,3),-np.inf)#-infforimpossibleactions

forstate,actionsinenumerate(possible_actions):

Q[state,actions]=0.0#Initialvalue=0.0,forallpossibleactions

foriterationinrange(n_iterations):

a=rnd.choice(possible_actions[s])#chooseanaction(randomly)

sp=rnd.choice(range(3),p=T[s,a])#picknextstateusingT[s,a]

reward=R[s,a,sp]

learning_rate=learning_rate0/(1+iteration*learning_rate_decay)

Q[s,a]=learning_rate*Q[s,a]+(1-learning_rate)*(

reward+discount_rate*np.max(Q[sp])

)

s=sp#movetonextstate

Givenenoughiterations,thisalgorithmwillconvergetotheoptimalQ-Values.Thisiscalledanoff-policyalgorithmbecausethepolicybeingtrainedisnottheonebeingexecuted.Itissomewhatsurprisingthatthisalgorithmiscapableoflearningtheoptimalpolicybyjustwatchinganagentactrandomly(imaginelearningtoplaygolfwhenyourteacherisadrunkenmonkey).Canwedobetter?

Page 556: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ExplorationPoliciesOfcourseQ-LearningcanworkonlyiftheexplorationpolicyexplorestheMDPthoroughlyenough.Althoughapurelyrandompolicyisguaranteedtoeventuallyvisiteverystateandeverytransitionmanytimes,itmaytakeanextremelylongtimetodoso.Therefore,abetteroptionistousetheε-greedypolicy:ateachstepitactsrandomlywithprobabilityε,orgreedily(choosingtheactionwiththehighestQ-Value)withprobability1-ε.Theadvantageoftheε-greedypolicy(comparedtoacompletelyrandompolicy)isthatitwillspendmoreandmoretimeexploringtheinterestingpartsoftheenvironment,astheQ-Valueestimatesgetbetterandbetter,whilestillspendingsometimevisitingunknownregionsoftheMDP.Itisquitecommontostartwithahighvalueforε(e.g.,1.0)andthengraduallyreduceit(e.g.,downto0.05).

Alternatively,ratherthanrelyingonchanceforexploration,anotherapproachistoencouragetheexplorationpolicytotryactionsthatithasnottriedmuchbefore.ThiscanbeimplementedasabonusaddedtotheQ-Valueestimates,asshowninEquation16-6.

Equation16-6.Q-Learningusinganexplorationfunction

N(s′,a′)countsthenumberoftimestheactiona′waschoseninstates′.

f(q,n)isanexplorationfunction,suchasf(q,n)=q+K/(1+n),whereKisacuriosityhyperparameterthatmeasureshowmuchtheagentisattractedtototheunknown.

Page 557: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ApproximateQ-LearningThemainproblemwithQ-Learningisthatitdoesnotscalewelltolarge(orevenmedium)MDPswithmanystatesandactions.ConsidertryingtouseQ-LearningtotrainanagenttoplayMs.Pac-Man.Thereareover250pelletsthatMs.Pac-Mancaneat,eachofwhichcanbepresentorabsent(i.e.,alreadyeaten).Sothenumberofpossiblestatesisgreaterthan2250≈1075(andthat’sconsideringthepossiblestatesonlyofthepellets).Thisiswaymorethanatomsintheobservableuniverse,sothere’sabsolutelynowayyoucankeeptrackofanestimateforeverysingleQ-Value.

ThesolutionistofindafunctionthatapproximatestheQ-Valuesusingamanageablenumberofparameters.ThisiscalledApproximateQ-Learning.Foryearsitwasrecommendedtouselinearcombinationsofhand-craftedfeaturesextractedfromthestate(e.g.,distanceoftheclosestghosts,theirdirections,andsoon)toestimateQ-Values,butDeepMindshowedthatusingdeepneuralnetworkscanworkmuchbetter,especiallyforcomplexproblems,anditdoesnotrequireanyfeatureengineering.ADNNusedtoestimateQ-ValuesiscalledadeepQ-network(DQN),andusingaDQNforApproximateQ-LearningiscalledDeepQ-Learning.

Intherestofthischapter,wewilluseDeepQ-LearningtotrainanagenttoplayMs.Pac-Man,muchlikeDeepMinddidin2013.ThecodecaneasilybetweakedtolearntoplaythemajorityofAtarigamesquitewell.Itcanachievesuperhumanskillatmostactiongames,butitisnotsogoodatgameswithlong-runningstorylines.

Page 558: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

LearningtoPlayMs.Pac-ManUsingDeepQ-LearningSincewewillbeusinganAtarienvironment,wemustfirstinstallOpenAIgym’sAtaridependencies.Whilewe’reatit,wewillalsoinstalldependenciesforotherOpenAIgymenvironmentsthatyoumaywanttoplaywith.OnmacOS,assumingyouhaveinstalledHomebrew,youneedtorun:

$brewinstallcmakeboostboost-pythonsdl2swigwget

OnUbuntu,typethefollowingcommand(replacingpython3withpythonifyouareusingPython2):

$apt-getinstall-ypython3-numpypython3-devcmakezlib1g-devlibjpeg-dev\

xvfblibav-toolsxorg-devpython3-opengllibboost-all-devlibsdl2-devswig

TheninstalltheextraPythonmodules:

$pip3install--upgrade'gym[all]'

Ifeverythingwentwell,youshouldbeabletocreateaMs.Pac-Manenvironment:

>>>env=gym.make("MsPacman-v0")

>>>obs=env.reset()

>>>obs.shape#[height,width,channels]

(210,160,3)

>>>env.action_space

Discrete(9)

Asyoucansee,thereareninediscreteactionsavailable,whichcorrespondtotheninepossiblepositionsofthejoystick(left,right,up,down,center,upperleft,andsoon),andtheobservationsaresimplyscreenshotsoftheAtariscreen(seeFigure16-9,left),representedas3DNumPyarrays.Theseimagesareabitlarge,sowewillcreateasmallpreprocessingfunctionthatwillcroptheimageandshrinkitdownto88×80pixels,convertittograyscale,andimprovethecontrastofMs.Pac-Man.ThiswillreducetheamountofcomputationsrequiredbytheDQN,andspeeduptraining.

mspacman_color=np.array([210,164,74]).mean()

defpreprocess_observation(obs):

img=obs[1:176:2,::2]#cropanddownsize

img=img.mean(axis=2)#togreyscale

img[img==mspacman_color]=0#improvecontrast

img=(img-128)/128-1#normalizefrom-1.to1.

returnimg.reshape(88,80,1)

TheresultofpreprocessingisshowninFigure16-9(right).

Page 559: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure16-9.Ms.Pac-Manobservation,original(left)andafterpreprocessing(right)

Next,let’screatetheDQN.Itcouldjusttakeastate-actionpair(s,a)asinput,andoutputanestimateofthecorrespondingQ-ValueQ(s,a),butsincetheactionsarediscreteitismoreconvenienttouseaneuralnetworkthattakesonlyastatesasinputandoutputsoneQ-Valueestimateperaction.TheDQNwillbecomposedofthreeconvolutionallayers,followedbytwofullyconnectedlayers,includingtheoutputlayer(seeFigure16-10).

Page 560: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Figure16-10.DeepQ-networktoplayMs.Pac-Man

Aswewillsee,thetrainingalgorithmwewilluserequirestwoDQNswiththesamearchitecture(butdifferentparameters):onewillbeusedtodriveMs.Pac-Manduringtraining(theactor),andtheotherwillwatchtheactorandlearnfromitstrialsanderrors(thecritic).Atregularintervalswewillcopythecritictotheactor.SinceweneedtwoidenticalDQNs,wewillcreateaq_network()functiontobuildthem:

input_height=88

input_width=80

input_channels=1

conv_n_maps=[32,64,64]

conv_kernel_sizes=[(8,8),(4,4),(3,3)]

conv_strides=[4,2,1]

conv_paddings=["SAME"]*3

conv_activation=[tf.nn.relu]*3

Page 561: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

n_hidden_in=64*11*10#conv3has64mapsof11x10each

n_hidden=512

hidden_activation=tf.nn.relu

n_outputs=env.action_space.n#9discreteactionsareavailable

initializer=tf.contrib.layers.variance_scaling_initializer()

defq_network(X_state,name):

prev_layer=X_state

conv_layers=[]

withtf.variable_scope(name)asscope:

forn_maps,kernel_size,stride,padding,activationinzip(

conv_n_maps,conv_kernel_sizes,conv_strides,

conv_paddings,conv_activation):

prev_layer=tf.layers.conv2d(

prev_layer,filters=n_maps,kernel_size=kernel_size,

stride=stride,padding=padding,activation=activation,

kernel_initializer=initializer)

conv_layers.append(prev_layer)

last_conv_layer_flat=tf.reshape(prev_layer,shape=[-1,n_hidden_in])

hidden=tf.layers.dense(last_conv_layer_flat,n_hidden,

activation=hidden_activation,

kernel_initializer=initializer)

outputs=tf.layers.dense(hidden,n_outputs,

kernel_initializer=initializer)

trainable_vars=tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,

scope=scope.name)

trainable_vars_by_name={var.name[len(scope.name):]:var

forvarintrainable_vars}

returnoutputs,trainable_vars_by_name

ThefirstpartofthiscodedefinesthehyperparametersoftheDQNarchitecture.Thentheq_network()functioncreatestheDQN,takingtheenvironment’sstateX_stateasinput,andthenameofthevariablescope.Notethatwewilljustuseoneobservationtorepresenttheenvironment’sstatesincethere’salmostnohiddenstate(exceptforblinkingobjectsandtheghosts’directions).

Thetrainable_vars_by_namedictionarygathersallthetrainablevariablesofthisDQN.ItwillbeusefulinaminutewhenwecreateoperationstocopythecriticDQNtotheactorDQN.Thekeysofthedictionaryarethenamesofthevariables,strippingthepartoftheprefixthatjustcorrespondstothescope’sname.Itlookslikethis:

>>>trainable_vars_by_name

{'/conv2d/bias:0':<tensorflow.python.ops.variables.Variableat0x121cf7b50>,

'/conv2d/kernel:0':<tensorflow.python.ops.variables.Variable...>,

'/conv2d_1/bias:0':<tensorflow.python.ops.variables.Variable...>,

'/conv2d_1/kernel:0':<tensorflow.python.ops.variables.Variable...>,

'/conv2d_2/bias:0':<tensorflow.python.ops.variables.Variable...>,

'/conv2d_2/kernel:0':<tensorflow.python.ops.variables.Variable...>,

'/dense/bias:0':<tensorflow.python.ops.variables.Variable...>,

'/dense/kernel:0':<tensorflow.python.ops.variables.Variable...>,

'/dense_1/bias:0':<tensorflow.python.ops.variables.Variable...>,

'/dense_1/kernel:0':<tensorflow.python.ops.variables.Variable...>}

Nowlet’screatetheinputplaceholder,thetwoDQNs,andtheoperationtocopythecriticDQNtotheactorDQN:

X_state=tf.placeholder(tf.float32,shape=[None,input_height,input_width,

input_channels])

actor_q_values,actor_vars=q_network(X_state,name="q_networks/actor")

critic_q_values,critic_vars=q_network(X_state,name="q_networks/critic")

copy_ops=[actor_var.assign(critic_vars[var_name])

forvar_name,actor_varinactor_vars.items()]

copy_critic_to_actor=tf.group(*copy_ops)

Page 562: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Let’sstepbackforasecond:wenowhavetwoDQNsthatarebothcapableoftakinganenvironmentstate(i.e.,apreprocessedobservation)asinputandoutputtinganestimatedQ-Valueforeachpossibleactioninthatstate.Pluswehaveanoperationcalledcopy_critic_to_actortocopyallthetrainablevariablesofthecriticDQNtotheactorDQN.WeuseTensorFlow’stf.group()functiontogroupalltheassignmentoperationsintoasingleconvenientoperation.

TheactorDQNcanbeusedtoplayMs.Pac-Man(initiallyverybadly).Asdiscussedearlier,youwantittoexplorethegamethoroughlyenough,soyougenerallywanttocombineitwithanε-greedypolicyoranotherexplorationstrategy.

ButwhataboutthecriticDQN?Howwillitlearntoplaythegame?TheshortansweristhatitwilltrytomakeitsQ-ValuepredictionsmatchtheQ-Valuesestimatedbytheactorthroughitsexperienceofthegame.Specifically,wewilllettheactorplayforawhile,storingallitsexperiencesinareplaymemory.Eachmemorywillbea5-tuple(state,action,nextstate,reward,continue),wherethe“continue”itemwillbeequalto0.0whenthegameisover,or1.0otherwise.Next,atregularintervalswewillsampleabatchofmemoriesfromthereplaymemory,andwewillestimatetheQ-Valuesfromthesememories.Finally,wewilltrainthecriticDQNtopredicttheseQ-Valuesusingregularsupervisedlearningtechniques.Onceeveryfewtrainingiterations,wewillcopythecriticDQNtotheactorDQN.Andthat’sit!Equation16-7showsthecostfunctionusedtotrainthecriticDQN:

Equation16-7.DeepQ-Learningcostfunction

s(i),a(i),r(i)ands′(i)arerespectivelythestate,action,reward,andnextstateoftheithmemorysampledfromthereplaymemory.

misthesizeofthememorybatch.

θcriticandθactorarethecriticandtheactor’sparameters.

Q(s(i),a(i),θcritic)isthecriticDQN’spredictionoftheithmemorizedstate-action’sQ-Value.

Q(s′(i),a′,θactor)istheactorDQN’spredictionoftheQ-Valueitcanexpectfromthenextstates′(i)ifitchoosesactiona′.

y(i)isthetargetQ-Valuefortheithmemory.Notethatitisequaltotherewardactuallyobservedbytheactor,plustheactor’spredictionofwhatfuturerewardsitshouldexpectifitweretoplayoptimally(asfarasitknows).

J(θcritic)isthecostfunctionusedtotrainthecriticDQN.Asyoucansee,itisjusttheMeanSquared

Page 563: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ErrorbetweenthetargetQ-Valuesy(i)asestimatedbytheactorDQN,andthecriticDQN’spredictionsoftheseQ-Values.

NOTEThereplaymemoryisoptional,buthighlyrecommended.Withoutit,youwouldtrainthecriticDQNusingconsecutiveexperiencesthatmaybeverycorrelated.Thiswouldintroducealotofbiasandslowdownthetrainingalgorithm’sconvergence.Byusingareplaymemory,weensurethatthememoriesfedtothetrainingalgorithmcanbefairlyuncorrelated.

Let’saddthecriticDQN’strainingoperations.First,weneedtobeabletocomputeitspredictedQ-Valuesforeachstate-actioninthememorybatch.SincetheDQNoutputsoneQ-Valueforeverypossibleaction,weneedtokeeponlytheQ-Valuethatcorrespondstotheactionthatwasactuallychoseninthismemory.Forthis,wewillconverttheactiontoaone-hotvector(recallthatthisisavectorfullof0sexceptfora1attheithindex),andmultiplyitbytheQ-Values:thiswillzerooutallQ-Valuesexceptfortheonecorrespondingtothememorizedaction.ThenjustsumoverthefirstaxistoobtainonlythedesiredQ-Valuepredictionforeachmemory.

X_action=tf.placeholder(tf.int32,shape=[None])

q_value=tf.reduce_sum(critic_q_values*tf.one_hot(X_action,n_outputs),

axis=1,keep_dims=True)

Nextlet’saddthetrainingoperations,assumingthetargetQ-Valueswillbefedthroughaplaceholder.Wealsocreateanontrainablevariablecalledglobal_step.Theoptimizer’sminimize()operationwilltakecareofincrementingit.PluswecreatetheusualinitoperationandaSaver.

y=tf.placeholder(tf.float32,shape=[None,1])

cost=tf.reduce_mean(tf.square(y-q_value))

global_step=tf.Variable(0,trainable=False,name='global_step')

optimizer=tf.train.AdamOptimizer(learning_rate)

training_op=optimizer.minimize(cost,global_step=global_step)

init=tf.global_variables_initializer()

saver=tf.train.Saver()

That’sitfortheconstructionphase.Beforewelookattheexecutionphase,wewillneedacoupleoftools.First,let’sstartbyimplementingthereplaymemory.Wewilluseadequelistsinceitisveryefficientatpushingitemstothequeueandpoppingthemoutfromtheendofthelistwhenthemaximummemorysizeisreached.Wewillalsowriteasmallfunctiontorandomlysampleabatchofexperiencesfromthereplaymemory:

fromcollectionsimportdeque

replay_memory_size=10000

replay_memory=deque([],maxlen=replay_memory_size)

defsample_memories(batch_size):

indices=rnd.permutation(len(replay_memory))[:batch_size]

cols=[[],[],[],[],[]]#state,action,reward,next_state,continue

foridxinindices:

memory=replay_memory[idx]

forcol,valueinzip(cols,memory):

col.append(value)

cols=[np.array(col)forcolincols]

Page 564: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

return(cols[0],cols[1],cols[2].reshape(-1,1),cols[3],

cols[4].reshape(-1,1))

Next,wewillneedtheactortoexplorethegame.Wewillusetheε-greedypolicy,andgraduallydecreaseεfrom1.0to0.05,in50,000trainingsteps:

eps_min=0.05

eps_max=1.0

eps_decay_steps=50000

defepsilon_greedy(q_values,step):

epsilon=max(eps_min,eps_max-(eps_max-eps_min)*step/eps_decay_steps)

ifrnd.rand()<epsilon:

returnrnd.randint(n_outputs)#randomaction

else:

returnnp.argmax(q_values)#optimalaction

That’sit!Wehaveallweneedtostarttraining.Theexecutionphasedoesnotcontainanythingtoocomplex,butitisabitlong,sotakeadeepbreath.Ready?Let’sgo!First,let’sinitializeafewvariables:

n_steps=100000#totalnumberoftrainingsteps

training_start=1000#starttrainingafter1,000gameiterations

training_interval=3#runatrainingstepevery3gameiterations

save_steps=50#savethemodelevery50trainingsteps

copy_steps=25#copythecritictotheactorevery25trainingsteps

discount_rate=0.95

skip_start=90#skipthestartofeverygame(it'sjustwaitingtime)

batch_size=50

iteration=0#gameiterations

checkpoint_path="./my_dqn.ckpt"

done=True#envneedstobereset

Next,let’sopenthesessionandrunthemaintrainingloop:

withtf.Session()assess:

ifos.path.isfile(checkpoint_path):

saver.restore(sess,checkpoint_path)

else:

init.run()

whileTrue:

step=global_step.eval()

ifstep>=n_steps:

break

iteration+=1

ifdone:#gameover,startagain

obs=env.reset()

forskipinrange(skip_start):#skipthestartofeachgame

obs,reward,done,info=env.step(0)

state=preprocess_observation(obs)

#Actorevaluateswhattodo

q_values=actor_q_values.eval(feed_dict={X_state:[state]})

action=epsilon_greedy(q_values,step)

#Actorplays

obs,reward,done,info=env.step(action)

next_state=preprocess_observation(obs)

#Let'smemorizewhatjusthappened

replay_memory.append((state,action,reward,next_state,1.0-done))

state=next_state

ifiteration<training_startoriteration%training_interval!=0:

continue

#Criticlearns

X_state_val,X_action_val,rewards,X_next_state_val,continues=(

sample_memories(batch_size))

next_q_values=actor_q_values.eval(

Page 565: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

feed_dict={X_state:X_next_state_val})

max_next_q_values=np.max(next_q_values,axis=1,keepdims=True)

y_val=rewards+continues*discount_rate*max_next_q_values

training_op.run(feed_dict={X_state:X_state_val,

X_action:X_action_val,y:y_val})

#Regularlycopycritictoactor

ifstep%copy_steps==0:

copy_critic_to_actor.run()

#Andsaveregularly

ifstep%save_steps==0:

saver.save(sess,checkpoint_path)

Westartbyrestoringthemodelsifacheckpointfileexists,orelsewejustinitializethevariablesnormally.Thenthemainloopstarts,whereiterationcountsthetotalnumberofgamestepswehavegonethroughsincetheprogramstarted,andstepcountsthetotalnumberoftrainingstepssincetrainingstarted(ifacheckpointisrestored,theglobalstepisrestoredaswell).Thenthecoderesetsthegame(andskipsthefirstboringgamesteps,wherenothinghappens).Next,theactorevaluateswhattodo,andplaysthegame,anditsexperienceismemorizedinreplaymemory.Then,atregularintervals(afterawarmupperiod),thecriticgoesthroughatrainingstep.ItsamplesabatchofmemoriesandaskstheactortoestimatetheQ-Valuesofallactionsforthenextstate,anditappliesEquation16-7tocomputethetargetQ-Valuey_val.Theonlytrickyparthereisthatwemustmultiplythenextstate’sQ-ValuesbythecontinuesvectortozeroouttheQ-Valuescorrespondingtomemorieswherethegamewasover.Nextwerunatrainingoperationtoimprovethecritic’sabilitytopredictQ-Values.Finally,atregularintervalswecopythecritictotheactor,andwesavethemodel.

TIPUnfortunately,trainingisveryslow:ifyouuseyourlaptopfortraining,itwilltakedaysbeforeMs.Pac-Mangetsanygood,andifyoulookatthelearningcurve,measuringtheaveragerewardsperepisode,youwillnoticethatitisextremelynoisy.Atsomepointstheremaybenoapparentprogressforaverylongtimeuntilsuddenlytheagentlearnstosurviveareasonableamountoftime.Asmentionedearlier,onesolutionistoinjectasmuchpriorknowledgeaspossibleintothemodel(e.g.,throughpreprocessing,rewards,andsoon),andyoucanalsotrytobootstrapthemodelbyfirsttrainingittoimitateabasicstrategy.Inanycase,RLstillrequiresquitealotofpatienceandtweaking,buttheendresultisveryexciting.

Page 566: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Exercises1. HowwouldyoudefineReinforcementLearning?Howisitdifferentfromregularsupervisedor

unsupervisedlearning?

2. CanyouthinkofthreepossibleapplicationsofRLthatwerenotmentionedinthischapter?Foreachofthem,whatistheenvironment?Whatistheagent?Whatarepossibleactions?Whataretherewards?

3. Whatisthediscountrate?Cantheoptimalpolicychangeifyoumodifythediscountrate?

4. HowdoyoumeasuretheperformanceofaReinforcementLearningagent?

5. Whatisthecreditassignmentproblem?Whendoesitoccur?Howcanyoualleviateit?

6. Whatisthepointofusingareplaymemory?

7. Whatisanoff-policyRLalgorithm?

8. UseDeepQ-LearningtotackleOpenAIgym’s“BypedalWalker-v2.”TheQ-networksdonotneedtobeverydeepforthistask.

9. UsepolicygradientstotrainanagenttoplayPong,thefamousAtarigame(Pong-v0intheOpenAIgym).Beware:anindividualobservationisinsufficienttotellthedirectionandspeedoftheball.Onesolutionistopasstwoobservationsatatimetotheneuralnetworkpolicy.Toreducedimensionalityandspeeduptraining,youshoulddefinitelypreprocesstheseimages(crop,resize,andconvertthemtoblackandwhite),andpossiblymergethemintoasingleimage(e.g.,byoverlayingthem).

10. Ifyouhaveabout$100tospare,youcanpurchaseaRaspberryPi3plussomecheaproboticscomponents,installTensorFlowonthePi,andgowild!Foranexample,checkoutthisfunpostbyLukasBiewald,ortakealookatGoPiGoorBrickPi.Whynottrytobuildareal-lifecartpolebytrainingtherobotusingpolicygradients?Orbuildaroboticspiderthatlearnstowalk;giveitrewardsanytimeitgetsclosertosomeobjective(youwillneedsensorstomeasurethedistancetotheobjective).Theonlylimitisyourimagination.

SolutionstotheseexercisesareavailableinAppendixA.

Page 567: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ThankYou!Beforeweclosethelastchapterofthisbook,Iwouldliketothankyouforreadingituptothelastparagraph.ItrulyhopethatyouhadasmuchpleasurereadingthisbookasIhadwritingit,andthatitwillbeusefulforyourprojects,bigorsmall.

Ifyoufinderrors,pleasesendfeedback.Moregenerally,Iwouldlovetoknowwhatyouthink,sopleasedon’thesitatetocontactmeviaO’Reilly,orthroughtheageron/handson-mlGitHubproject.

Goingforward,mybestadvicetoyouistopracticeandpractice:trygoingthroughalltheexercisesifyouhavenotdonesoalready,playwiththeJupyternotebooks,joinKaggle.comorsomeotherMLcommunity,watchMLcourses,readpapers,attendconferences,meetexperts.Youmayalsowanttostudysometopicsthatwedidnotcoverinthisbook,includingrecommendersystems,clusteringalgorithms,anomalydetectionalgorithms,andgeneticalgorithms.

MygreatesthopeisthatthisbookwillinspireyoutobuildawonderfulMLapplicationthatwillbenefitallofus!Whatwillitbe?

AurélienGéron,November26th,2016

Formoredetails,besuretocheckoutRichardSuttonandAndrewBarto’sbookonRL,ReinforcementLearning:AnIntroduction(MITPress),orDavidSilver’sfreeonlineRLcourseatUniversityCollegeLondon.

“PlayingAtariwithDeepReinforcementLearning,”V.Mnihetal.(2013).

“Human-levelcontrolthroughdeepreinforcementlearning,”V.Mnihetal.(2015).

CheckoutthevideosofDeepMind’ssystemlearningtoplaySpaceInvaders,Breakout,andmoreathttps://goo.gl/yTsH6X.

Images(a),(c),and(d)arereproducedfromWikipedia.(a)and(d)areinthepublicdomain.(c)wascreatedbyuserStevertigoandreleasedunderCreativeCommonsBY-SA2.0.(b)isascreenshotfromtheMs.Pac-Mangame,copyrightAtari(theauthorbelievesittobefairuseinthischapter).(e)wasreproducedfromPixabay,releasedunderCreativeCommonsCC0.

Itisoftenbettertogivethepoorperformersaslightchanceofsurvival,topreservesomediversityinthe“genepool.”

Ifthereisasingleparent,thisiscalledasexualreproduction.Withtwo(ormore)parents,itiscalledsexualreproduction.Anoffspring’sgenome(inthiscaseasetofpolicyparameters)israndomlycomposedofpartsofitsparents’genomes.

OpenAIisanonprofitartificialintelligenceresearchcompany,fundedinpartbyElonMusk.ItsstatedgoalistopromoteanddevelopfriendlyAIsthatwillbenefithumanity(ratherthanexterminateit).

“SimpleStatisticalGradient-FollowingAlgorithmsforConnectionistReinforcementLearning,”R.Williams(1992).

WealreadydidsomethingsimilarinChapter11whenwediscussedGradientClipping:wefirstcomputedthegradients,thenweclippedthem,andfinallyweappliedtheclippedgradients.

“AMarkovianDecisionProcess,”R.Bellman(1957).

1

2

3

4

5

6

7

8

9

10

11

Page 568: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

AppendixA.ExerciseSolutions

NOTESolutionstothecodingexercisesareavailableintheonlineJupyternotebooksathttps://github.com/ageron/handson-ml.

Page 569: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Chapter1:TheMachineLearningLandscape1. MachineLearningisaboutbuildingsystemsthatcanlearnfromdata.Learningmeansgettingbetterat

sometask,givensomeperformancemeasure.

2. MachineLearningisgreatforcomplexproblemsforwhichwehavenoalgorithmicsolution,toreplacelonglistsofhand-tunedrules,tobuildsystemsthatadapttofluctuatingenvironments,andfinallytohelphumanslearn(e.g.,datamining).

3. Alabeledtrainingsetisatrainingsetthatcontainsthedesiredsolution(a.k.a.alabel)foreachinstance.

4. Thetwomostcommonsupervisedtasksareregressionandclassification.

5. Commonunsupervisedtasksincludeclustering,visualization,dimensionalityreduction,andassociationrulelearning.

6. ReinforcementLearningislikelytoperformbestifwewantarobottolearntowalkinvariousunknownterrainssincethisistypicallythetypeofproblemthatReinforcementLearningtackles.Itmightbepossibletoexpresstheproblemasasupervisedorsemisupervisedlearningproblem,butitwouldbelessnatural.

7. Ifyoudon’tknowhowtodefinethegroups,thenyoucanuseaclusteringalgorithm(unsupervisedlearning)tosegmentyourcustomersintoclustersofsimilarcustomers.However,ifyouknowwhatgroupsyouwouldliketohave,thenyoucanfeedmanyexamplesofeachgrouptoaclassificationalgorithm(supervisedlearning),anditwillclassifyallyourcustomersintothesegroups.

8. Spamdetectionisatypicalsupervisedlearningproblem:thealgorithmisfedmanyemailsalongwiththeirlabel(spamornotspam).

9. Anonlinelearningsystemcanlearnincrementally,asopposedtoabatchlearningsystem.Thismakesitcapableofadaptingrapidlytobothchangingdataandautonomoussystems,andoftrainingonverylargequantitiesofdata.

10. Out-of-corealgorithmscanhandlevastquantitiesofdatathatcannotfitinacomputer’smainmemory.Anout-of-corelearningalgorithmchopsthedataintomini-batchesandusesonlinelearningtechniquestolearnfromthesemini-batches.

11. Aninstance-basedlearningsystemlearnsthetrainingdatabyheart;then,whengivenanewinstance,itusesasimilaritymeasuretofindthemostsimilarlearnedinstancesandusesthemtomakepredictions.

12. Amodelhasoneormoremodelparametersthatdeterminewhatitwillpredictgivenanewinstance(e.g.,theslopeofalinearmodel).Alearningalgorithmtriestofindoptimalvaluesfortheseparameterssuchthatthemodelgeneralizeswelltonewinstances.Ahyperparameterisaparameterofthelearningalgorithmitself,notofthemodel(e.g.,theamountofregularizationtoapply).

Page 570: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

13. Model-basedlearningalgorithmssearchforanoptimalvalueforthemodelparameterssuchthatthemodelwillgeneralizewelltonewinstances.Weusuallytrainsuchsystemsbyminimizingacostfunctionthatmeasureshowbadthesystemisatmakingpredictionsonthetrainingdata,plusapenaltyformodelcomplexityifthemodelisregularized.Tomakepredictions,wefeedthenewinstance’sfeaturesintothemodel’spredictionfunction,usingtheparametervaluesfoundbythelearningalgorithm.

14. SomeofthemainchallengesinMachineLearningarethelackofdata,poordataquality,nonrepresentativedata,uninformativefeatures,excessivelysimplemodelsthatunderfitthetrainingdata,andexcessivelycomplexmodelsthatoverfitthedata.

15. Ifamodelperformsgreatonthetrainingdatabutgeneralizespoorlytonewinstances,themodelislikelyoverfittingthetrainingdata(orwegotextremelyluckyonthetrainingdata).Possiblesolutionstooverfittingaregettingmoredata,simplifyingthemodel(selectingasimpleralgorithm,reducingthenumberofparametersorfeaturesused,orregularizingthemodel),orreducingthenoiseinthetrainingdata.

16. Atestsetisusedtoestimatethegeneralizationerrorthatamodelwillmakeonnewinstances,beforethemodelislaunchedinproduction.

17. Avalidationsetisusedtocomparemodels.Itmakesitpossibletoselectthebestmodelandtunethehyperparameters.

18. Ifyoutunehyperparametersusingthetestset,youriskoverfittingthetestset,andthegeneralizationerroryoumeasurewillbeoptimistic(youmaylaunchamodelthatperformsworsethanyouexpect).

19. Cross-validationisatechniquethatmakesitpossibletocomparemodels(formodelselectionandhyperparametertuning)withouttheneedforaseparatevalidationset.Thissavesprecioustrainingdata.

Page 571: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Chapter2:End-to-EndMachineLearningProjectSeetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.

Page 572: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Chapter3:ClassificationSeetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.

Page 573: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Chapter4:TrainingModels1. IfyouhaveatrainingsetwithmillionsoffeaturesyoucanuseStochasticGradientDescentorMini-

batchGradientDescent,andperhapsBatchGradientDescentifthetrainingsetfitsinmemory.ButyoucannotusetheNormalEquationbecausethecomputationalcomplexitygrowsquickly(morethanquadratically)withthenumberoffeatures.

2. Ifthefeaturesinyourtrainingsethaveverydifferentscales,thecostfunctionwillhavetheshapeofanelongatedbowl,sotheGradientDescentalgorithmswilltakealongtimetoconverge.Tosolvethisyoushouldscalethedatabeforetrainingthemodel.NotethattheNormalEquationwillworkjustfinewithoutscaling.Moreover,regularizedmodelsmayconvergetoasuboptimalsolutionifthefeaturesarenotscaled:indeed,sinceregularizationpenalizeslargeweights,featureswithsmallervalueswilltendtobeignoredcomparedtofeatureswithlargervalues.

3. GradientDescentcannotgetstuckinalocalminimumwhentrainingaLogisticRegressionmodelbecausethecostfunctionisconvex.1

4. Iftheoptimizationproblemisconvex(suchasLinearRegressionorLogisticRegression),andassumingthelearningrateisnottoohigh,thenallGradientDescentalgorithmswillapproachtheglobaloptimumandendupproducingfairlysimilarmodels.However,unlessyougraduallyreducethelearningrate,StochasticGDandMini-batchGDwillnevertrulyconverge;instead,theywillkeepjumpingbackandfortharoundtheglobaloptimum.Thismeansthatevenifyouletthemrunforaverylongtime,theseGradientDescentalgorithmswillproduceslightlydifferentmodels.

5. Ifthevalidationerrorconsistentlygoesupaftereveryepoch,thenonepossibilityisthatthelearningrateistoohighandthealgorithmisdiverging.Ifthetrainingerroralsogoesup,thenthisisclearlytheproblemandyoushouldreducethelearningrate.However,ifthetrainingerrorisnotgoingup,thenyourmodelisoverfittingthetrainingsetandyoushouldstoptraining.

6. Duetotheirrandomnature,neitherStochasticGradientDescentnorMini-batchGradientDescentisguaranteedtomakeprogressateverysingletrainingiteration.Soifyouimmediatelystoptrainingwhenthevalidationerrorgoesup,youmaystopmuchtooearly,beforetheoptimumisreached.Abetteroptionistosavethemodelatregularintervals,andwhenithasnotimprovedforalongtime(meaningitwillprobablyneverbeattherecord),youcanreverttothebestsavedmodel.

7. StochasticGradientDescenthasthefastesttrainingiterationsinceitconsidersonlyonetraininginstanceatatime,soitisgenerallythefirsttoreachthevicinityoftheglobaloptimum(orMini-batchGDwithaverysmallmini-batchsize).However,onlyBatchGradientDescentwillactuallyconverge,givenenoughtrainingtime.Asmentioned,StochasticGDandMini-batchGDwillbouncearoundtheoptimum,unlessyougraduallyreducethelearningrate.

8. Ifthevalidationerrorismuchhigherthanthetrainingerror,thisislikelybecauseyourmodelisoverfittingthetrainingset.Onewaytotrytofixthisistoreducethepolynomialdegree:amodelwithfewerdegreesoffreedomislesslikelytooverfit.Anotherthingyoucantryistoregularizethemodel—forexample,byaddinganℓ2penalty(Ridge)oranℓ1penalty(Lasso)tothecostfunction.Thiswillalsoreducethedegreesoffreedomofthemodel.Lastly,youcantrytoincreasethesizeof

Page 574: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

thetrainingset.

9. Ifboththetrainingerrorandthevalidationerrorarealmostequalandfairlyhigh,themodelislikelyunderfittingthetrainingset,whichmeansithasahighbias.Youshouldtryreducingtheregularizationhyperparameterα.

10. Let’ssee:Amodelwithsomeregularizationtypicallyperformsbetterthanamodelwithoutanyregularization,soyoushouldgenerallypreferRidgeRegressionoverplainLinearRegression.2

LassoRegressionusesanℓ1penalty,whichtendstopushtheweightsdowntoexactlyzero.Thisleadstosparsemodels,whereallweightsarezeroexceptforthemostimportantweights.Thisisawaytoperformfeatureselectionautomatically,whichisgoodifyoususpectthatonlyafewfeaturesactuallymatter.Whenyouarenotsure,youshouldpreferRidgeRegression.

ElasticNetisgenerallypreferredoverLassosinceLassomaybehaveerraticallyinsomecases(whenseveralfeaturesarestronglycorrelatedorwhentherearemorefeaturesthantraininginstances).However,itdoesaddanextrahyperparametertotune.IfyoujustwantLassowithouttheerraticbehavior,youcanjustuseElasticNetwithanl1_ratiocloseto1.

11. Ifyouwanttoclassifypicturesasoutdoor/indooranddaytime/nighttime,sincethesearenotexclusiveclasses(i.e.,allfourcombinationsarepossible)youshouldtraintwoLogisticRegressionclassifiers.

12. SeetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.

Page 575: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Chapter5:SupportVectorMachines1. ThefundamentalideabehindSupportVectorMachinesistofitthewidestpossible“street”between

theclasses.Inotherwords,thegoalistohavethelargestpossiblemarginbetweenthedecisionboundarythatseparatesthetwoclassesandthetraininginstances.Whenperformingsoftmarginclassification,theSVMsearchesforacompromisebetweenperfectlyseparatingthetwoclassesandhavingthewidestpossiblestreet(i.e.,afewinstancesmayenduponthestreet).Anotherkeyideaistousekernelswhentrainingonnonlineardatasets.

2. AftertraininganSVM,asupportvectorisanyinstancelocatedonthe“street”(seethepreviousanswer),includingitsborder.Thedecisionboundaryisentirelydeterminedbythesupportvectors.Anyinstancethatisnotasupportvector(i.e.,offthestreet)hasnoinfluencewhatsoever;youcouldremovethem,addmoreinstances,ormovethemaround,andaslongastheystayoffthestreettheywon’taffectthedecisionboundary.Computingthepredictionsonlyinvolvesthesupportvectors,notthewholetrainingset.

3. SVMstrytofitthelargestpossible“street”betweentheclasses(seethefirstanswer),soifthetrainingsetisnotscaled,theSVMwilltendtoneglectsmallfeatures(seeFigure5-2).

4. AnSVMclassifiercanoutputthedistancebetweenthetestinstanceandthedecisionboundary,andyoucanusethisasaconfidencescore.However,thisscorecannotbedirectlyconvertedintoanestimationoftheclassprobability.Ifyousetprobability=TruewhencreatinganSVMinScikit-Learn,thenaftertrainingitwillcalibratetheprobabilitiesusingLogisticRegressionontheSVM’sscores(trainedbyanadditionalfive-foldcross-validationonthetrainingdata).Thiswilladdthepredict_proba()andpredict_log_proba()methodstotheSVM.

5. ThisquestionappliesonlytolinearSVMssincekernelizedcanonlyusethedualform.ThecomputationalcomplexityoftheprimalformoftheSVMproblemisproportionaltothenumberoftraininginstancesm,whilethecomputationalcomplexityofthedualformisproportionaltoanumberbetweenm2andm3.Soiftherearemillionsofinstances,youshoulddefinitelyusetheprimalform,becausethedualformwillbemuchtooslow.

6. IfanSVMclassifiertrainedwithanRBFkernelunderfitsthetrainingset,theremightbetoomuchregularization.Todecreaseit,youneedtoincreasegammaorC(orboth).

7. Let’scalltheQPparametersforthehard-marginproblemH′,f′,A′andb′(see“QuadraticProgramming”).TheQPparametersforthesoft-marginproblemhavemadditionalparameters(np=n+1+m)andmadditionalconstraints(nc=2m).Theycanbedefinedlikeso:

HisequaltoH′,plusmcolumnsof0sontherightandmrowsof0satthebottom:

fisequaltof′withmadditionalelements,allequaltothevalueofthehyperparameterC.

bisequaltob′withmadditionalelements,allequalto0.

AisequaltoA′,withanextram×midentitymatrixImappendedtotheright,–Imjustbelowit,

Page 576: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

andtherestfilledwithzeros:

Forthesolutionstoexercises8,9,and10,pleaseseetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.

Page 577: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Chapter6:DecisionTrees1. Thedepthofawell-balancedbinarytreecontainingmleavesisequaltolog2(m)3,roundedup.A

binaryDecisionTree(onethatmakesonlybinarydecisions,asisthecaseofalltreesinScikit-Learn)willendupmoreorlesswellbalancedattheendoftraining,withoneleafpertraininginstanceifitistrainedwithoutrestrictions.Thus,ifthetrainingsetcontainsonemillioninstances,theDecisionTreewillhaveadepthoflog2(106)≈20(actuallyabitmoresincethetreewillgenerallynotbeperfectlywellbalanced).

2. Anode’sGiniimpurityisgenerallylowerthanitsparent’s.ThisisduetotheCARTtrainingalgorithm’scostfunction,whichsplitseachnodeinawaythatminimizestheweightedsumofitschildren’sGiniimpurities.However,itispossibleforanodetohaveahigherGiniimpuritythanitsparent,aslongasthisincreaseismorethancompensatedforbyadecreaseoftheotherchild’simpurity.Forexample,consideranodecontainingfourinstancesofclassAand1ofclassB.ItsGini

impurityis =0.32.Nowsupposethedatasetisone-dimensionalandtheinstancesarelinedupinthefollowingorder:A,B,A,A,A.Youcanverifythatthealgorithmwillsplitthisnodeafterthesecondinstance,producingonechildnodewithinstancesA,B,andtheotherchildnode

withinstancesA,A,A.Thefirstchildnode’sGiniimpurityis =0.5,whichishigherthanitsparent.Thisiscompensatedforbythefactthattheothernodeispure,sotheoverall

weightedGiniimpurityis 0.5+ =0.2,whichislowerthantheparent’sGiniimpurity.

3. IfaDecisionTreeisoverfittingthetrainingset,itmaybeagoodideatodecreasemax_depth,sincethiswillconstrainthemodel,regularizingit.

4. DecisionTreesdon’tcarewhetherornotthetrainingdataisscaledorcentered;that’soneofthenicethingsaboutthem.SoifaDecisionTreeunderfitsthetrainingset,scalingtheinputfeatureswilljustbeawasteoftime.

5. ThecomputationalcomplexityoftrainingaDecisionTreeisO(n×mlog(m)).Soifyoumultiplythetrainingsetsizeby10,thetrainingtimewillbemultipliedbyK=(n×10m×log(10m))/(n×m×log(m))=10×log(10m)/log(m).Ifm=106,thenK≈11.7,soyoucanexpectthetrainingtimetoberoughly11.7hours.

6. Presortingthetrainingsetspeedsuptrainingonlyifthedatasetissmallerthanafewthousandinstances.Ifitcontains100,000instances,settingpresort=Truewillconsiderablyslowdowntraining.

Forthesolutionstoexercises7and8,pleaseseetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.

Page 578: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Chapter7:EnsembleLearningandRandomForests1. Ifyouhavetrainedfivedifferentmodelsandtheyallachieve95%precision,youcantrycombining

themintoavotingensemble,whichwilloftengiveyouevenbetterresults.Itworksbetterifthemodelsareverydifferent(e.g.,anSVMclassifier,aDecisionTreeclassifier,aLogisticRegressionclassifier,andsoon).Itisevenbetteriftheyaretrainedondifferenttraininginstances(that’sthewholepointofbaggingandpastingensembles),butifnotitwillstillworkaslongasthemodelsareverydifferent.

2. Ahardvotingclassifierjustcountsthevotesofeachclassifierintheensembleandpickstheclassthatgetsthemostvotes.Asoftvotingclassifiercomputestheaverageestimatedclassprobabilityforeachclassandpickstheclasswiththehighestprobability.Thisgiveshigh-confidencevotesmoreweightandoftenperformsbetter,butitworksonlyifeveryclassifierisabletoestimateclassprobabilities(e.g.,fortheSVMclassifiersinScikit-Learnyoumustsetprobability=True).

3. Itisquitepossibletospeeduptrainingofabaggingensemblebydistributingitacrossmultipleservers,sinceeachpredictorintheensembleisindependentoftheothers.ThesamegoesforpastingensemblesandRandomForests,forthesamereason.However,eachpredictorinaboostingensembleisbuiltbasedonthepreviouspredictor,sotrainingisnecessarilysequential,andyouwillnotgainanythingbydistributingtrainingacrossmultipleservers.Regardingstackingensembles,allthepredictorsinagivenlayerareindependentofeachother,sotheycanbetrainedinparallelonmultipleservers.However,thepredictorsinonelayercanonlybetrainedafterthepredictorsinthepreviouslayerhaveallbeentrained.

4. Without-of-bagevaluation,eachpredictorinabaggingensembleisevaluatedusinginstancesthatitwasnottrainedon(theywereheldout).Thismakesitpossibletohaveafairlyunbiasedevaluationoftheensemblewithouttheneedforanadditionalvalidationset.Thus,youhavemoreinstancesavailablefortraining,andyourensemblecanperformslightlybetter.

5. WhenyouaregrowingatreeinaRandomForest,onlyarandomsubsetofthefeaturesisconsideredforsplittingateachnode.ThisistrueaswellforExtra-Trees,buttheygoonestepfurther:ratherthansearchingforthebestpossiblethresholds,likeregularDecisionTreesdo,theyuserandomthresholdsforeachfeature.Thisextrarandomnessactslikeaformofregularization:ifaRandomForestoverfitsthetrainingdata,Extra-Treesmightperformbetter.Moreover,sinceExtra-Treesdon’tsearchforthebestpossiblethresholds,theyaremuchfastertotrainthanRandomForests.However,theyareneitherfasternorslowerthanRandomForestswhenmakingpredictions.

6. IfyourAdaBoostensembleunderfitsthetrainingdata,youcantryincreasingthenumberofestimatorsorreducingtheregularizationhyperparametersofthebaseestimator.Youmayalsotryslightlyincreasingthelearningrate.

7. IfyourGradientBoostingensembleoverfitsthetrainingset,youshouldtrydecreasingthelearningrate.Youcouldalsouseearlystoppingtofindtherightnumberofpredictors(youprobablyhavetoomany).

Forthesolutionstoexercises8and9,pleaseseetheJupyternotebooksavailableat

Page 579: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

https://github.com/ageron/handson-ml.

Page 580: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Chapter8:DimensionalityReduction1. Motivationsanddrawbacks:

Themainmotivationsfordimensionalityreductionare:Tospeedupasubsequenttrainingalgorithm(insomecasesitmayevenremovenoiseandredundantfeatures,makingthetrainingalgorithmperformbetter).

Tovisualizethedataandgaininsightsonthemostimportantfeatures.

Simplytosavespace(compression).

Themaindrawbacksare:Someinformationislost,possiblydegradingtheperformanceofsubsequenttrainingalgorithms.

Itcanbecomputationallyintensive.

ItaddssomecomplexitytoyourMachineLearningpipelines.

Transformedfeaturesareoftenhardtointerpret.

2. Thecurseofdimensionalityreferstothefactthatmanyproblemsthatdonotexistinlow-dimensionalspaceariseinhigh-dimensionalspace.InMachineLearning,onecommonmanifestationisthefactthatrandomlysampledhigh-dimensionalvectorsaregenerallyverysparse,increasingtheriskofoverfittingandmakingitverydifficulttoidentifypatternsinthedatawithouthavingplentyoftrainingdata.

3. Onceadataset’sdimensionalityhasbeenreducedusingoneofthealgorithmswediscussed,itisalmostalwaysimpossibletoperfectlyreversetheoperation,becausesomeinformationgetslostduringdimensionalityreduction.Moreover,whilesomealgorithms(suchasPCA)haveasimplereversetransformationprocedurethatcanreconstructadatasetrelativelysimilartotheoriginal,otheralgorithms(suchasT-SNE)donot.

4. PCAcanbeusedtosignificantlyreducethedimensionalityofmostdatasets,eveniftheyarehighlynonlinear,becauseitcanatleastgetridofuselessdimensions.However,iftherearenouselessdimensions—forexample,theSwissroll—thenreducingdimensionalitywithPCAwilllosetoomuchinformation.YouwanttounrolltheSwissroll,notsquashit.

5. That’satrickquestion:itdependsonthedataset.Let’slookattwoextremeexamples.First,supposethedatasetiscomposedofpointsthatarealmostperfectlyaligned.Inthiscase,PCAcanreducethedatasetdowntojustonedimensionwhilestillpreserving95%ofthevariance.Nowimaginethatthedatasetiscomposedofperfectlyrandompoints,scatteredallaroundthe1,000dimensions.Inthiscaseroughly950dimensionsarerequiredtopreserve95%ofthevariance.Sotheansweris,itdependsonthedataset,anditcouldbeanynumberbetween1and950.Plottingtheexplainedvarianceasafunctionofthenumberofdimensionsisonewaytogetaroughideaofthedataset’sintrinsicdimensionality.

Page 581: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

6. RegularPCAisthedefault,butitworksonlyifthedatasetfitsinmemory.IncrementalPCAisusefulforlargedatasetsthatdon’tfitinmemory,butitisslowerthanregularPCA,soifthedatasetfitsinmemoryyoushouldpreferregularPCA.IncrementalPCAisalsousefulforonlinetasks,whenyouneedtoapplyPCAonthefly,everytimeanewinstancearrives.RandomizedPCAisusefulwhenyouwanttoconsiderablyreducedimensionalityandthedatasetfitsinmemory;inthiscase,itismuchfasterthanregularPCA.Finally,KernelPCAisusefulfornonlineardatasets.

7. Intuitively,adimensionalityreductionalgorithmperformswellifiteliminatesalotofdimensionsfromthedatasetwithoutlosingtoomuchinformation.Onewaytomeasurethisistoapplythereversetransformationandmeasurethereconstructionerror.However,notalldimensionalityreductionalgorithmsprovideareversetransformation.Alternatively,ifyouareusingdimensionalityreductionasapreprocessingstepbeforeanotherMachineLearningalgorithm(e.g.,aRandomForestclassifier),thenyoucansimplymeasuretheperformanceofthatsecondalgorithm;ifdimensionalityreductiondidnotlosetoomuchinformation,thenthealgorithmshouldperformjustaswellaswhenusingtheoriginaldataset.

8. Itcanabsolutelymakesensetochaintwodifferentdimensionalityreductionalgorithms.AcommonexampleisusingPCAtoquicklygetridofalargenumberofuselessdimensions,thenapplyinganothermuchslowerdimensionalityreductionalgorithm,suchasLLE.Thistwo-stepapproachwilllikelyyieldthesameperformanceasusingLLEonly,butinafractionofthetime.

Forthesolutionstoexercises9and10,pleaseseetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.

Page 582: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Chapter9:UpandRunningwithTensorFlow1. Mainbenefitsanddrawbacksofcreatingacomputationgraphratherthandirectlyexecutingthe

computations:Mainbenefits:

TensorFlowcanautomaticallycomputethegradientsforyou(usingreverse-modeautodiff).

TensorFlowcantakecareofrunningtheoperationsinparallelindifferentthreads.

Itmakesiteasiertorunthesamemodelacrossdifferentdevices.

Itsimplifiesintrospection—forexample,toviewthemodelinTensorBoard.

Maindrawbacks:Itmakesthelearningcurvesteeper.

Itmakesstep-by-stepdebuggingharder.

2. Yes,thestatementa_val=a.eval(session=sess)isindeedequivalenttoa_val=sess.run(a).

3. No,thestatementa_val,b_val=a.eval(session=sess),b.eval(session=sess)isnotequivalenttoa_val,b_val=sess.run([a,b]).Indeed,thefirststatementrunsthegraphtwice(oncetocomputea,oncetocomputeb),whilethesecondstatementrunsthegraphonlyonce.Ifanyoftheseoperations(ortheopstheydependon)havesideeffects(e.g.,avariableismodified,anitemisinsertedinaqueue,orareaderreadsafile),thentheeffectswillbedifferent.Iftheydon’thavesideeffects,bothstatementswillreturnthesameresult,butthesecondstatementwillbefasterthanthefirst.

4. No,youcannotruntwographsinthesamesession.Youwouldhavetomergethegraphsintoasinglegraphfirst.

5. InlocalTensorFlow,sessionsmanagevariablevalues,soifyoucreateagraphgcontainingavariablew,thenstarttwothreadsandopenalocalsessionineachthread,bothusingthesamegraphg,theneachsessionwillhaveitsowncopyofthevariablew.However,indistributedTensorFlow,variablevaluesarestoredincontainersmanagedbythecluster,soifbothsessionsconnecttothesameclusterandusethesamecontainer,thentheywillsharethesamevariablevalueforw.

6. Avariableisinitializedwhenyoucallitsinitializer,anditisdestroyedwhenthesessionends.IndistributedTensorFlow,variablesliveincontainersonthecluster,soclosingasessionwillnotdestroythevariable.Todestroyavariable,youneedtoclearitscontainer.

7. Variablesandplaceholdersareextremelydifferent,butbeginnersoftenconfusethem:Avariableisanoperationthatholdsavalue.Ifyourunthevariable,itreturnsthatvalue.Beforeyoucanrunit,youneedtoinitializeit.Youcanchangethevariable’svalue(for

Page 583: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

example,byusinganassignmentoperation).Itisstateful:thevariablekeepsthesamevalueuponsuccessiverunsofthegraph.Itistypicallyusedtoholdmodelparametersbutalsoforotherpurposes(e.g.,tocounttheglobaltrainingstep).

Placeholderstechnicallydon’tdomuch:theyjustholdinformationaboutthetypeandshapeofthetensortheyrepresent,buttheyhavenovalue.Infact,ifyoutrytoevaluateanoperationthatdependsonaplaceholder,youmustfeedTensorFlowthevalueoftheplaceholder(usingthefeed_dictargument)orelseyouwillgetanexception.PlaceholdersaretypicallyusedtofeedtrainingortestdatatoTensorFlowduringtheexecutionphase.Theyarealsousefultopassavaluetoanassignmentnode,tochangethevalueofavariable(e.g.,modelweights).

8. Ifyourunthegraphtoevaluateanoperationthatdependsonaplaceholderbutyoudon’tfeeditsvalue,yougetanexception.Iftheoperationdoesnotdependontheplaceholder,thennoexceptionisraised.

9. Whenyourunagraph,youcanfeedtheoutputvalueofanyoperation,notjustthevalueofplaceholders.Inpractice,however,thisisratherrare(itcanbeuseful,forexample,whenyouarecachingtheoutputoffrozenlayers;seeChapter11).

10. Youcanspecifyavariable’sinitialvaluewhenconstructingthegraph,anditwillbeinitializedlaterwhenyourunthevariable’sinitializerduringtheexecutionphase.Ifyouwanttochangethatvariable’svaluetoanythingyouwantduringtheexecutionphase,thenthesimplestoptionistocreateanassignmentnode(duringthegraphconstructionphase)usingthetf.assign()function,passingthevariableandaplaceholderasparameters.Duringtheexecutionphase,youcanruntheassignmentoperationandfeedthevariable’snewvalueusingtheplaceholder.

importtensorflowastf

x=tf.Variable(tf.random_uniform(shape=(),minval=0.0,maxval=1.0))

x_new_val=tf.placeholder(shape=(),dtype=tf.float32)

x_assign=tf.assign(x,x_new_val)

withtf.Session():

x.initializer.run()#randomnumberissampled*now*

print(x.eval())#0.646157(somerandomnumber)

x_assign.eval(feed_dict={x_new_val:5.0})

print(x.eval())#5.0

11. Reverse-modeautodiff(implementedbyTensorFlow)needstotraversethegraphonlytwiceinordertocomputethegradientsofthecostfunctionwithregardstoanynumberofvariables.Ontheotherhand,forward-modeautodiffwouldneedtorunonceforeachvariable(so10timesifwewantthegradientswithregardsto10differentvariables).Asforsymbolicdifferentiation,itwouldbuildadifferentgraphtocomputethegradients,soitwouldnottraversetheoriginalgraphatall(exceptwhenbuildingthenewgradientsgraph).Ahighlyoptimizedsymbolicdifferentiationsystemcouldpotentiallyrunthenewgradientsgraphonlyoncetocomputethegradientswithregardstoallvariables,butthatnewgraphmaybehorriblycomplexandinefficientcomparedtotheoriginalgraph.

12. SeetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.

Page 584: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Chapter10:IntroductiontoArtificialNeuralNetworks1. HereisaneuralnetworkbasedontheoriginalartificialneuronsthatcomputesA B(where

representstheexclusiveOR),usingthefactthatA B=(A ¬B) (¬A B).Thereareothersolutions—forexample,usingthefactthatA B=(A B) ¬(A B),orthefactthatA B=(AB) (¬A B),andsoon.

2. AclassicalPerceptronwillconvergeonlyifthedatasetislinearlyseparable,anditwon’tbeabletoestimateclassprobabilities.Incontrast,aLogisticRegressionclassifierwillconvergetoagoodsolutionevenifthedatasetisnotlinearlyseparable,anditwilloutputclassprobabilities.IfyouchangethePerceptron’sactivationfunctiontothelogisticactivationfunction(orthesoftmaxactivationfunctioniftherearemultipleneurons),andifyoutrainitusingGradientDescent(orsomeotheroptimizationalgorithmminimizingthecostfunction,typicallycrossentropy),thenitbecomesequivalenttoaLogisticRegressionclassifier.

3. ThelogisticactivationfunctionwasakeyingredientintrainingthefirstMLPsbecauseitsderivativeisalwaysnonzero,soGradientDescentcanalwaysrolldowntheslope.Whentheactivationfunctionisastepfunction,GradientDescentcannotmove,asthereisnoslopeatall.

4. Thestepfunction,thelogisticfunction,thehyperbolictangent,therectifiedlinearunit(seeFigure10-8).SeeChapter11forotherexamples,suchasELUandvariantsoftheReLU.

5. ConsideringtheMLPdescribedinthequestion:supposeyouhaveanMLPcomposedofoneinputlayerwith10passthroughneurons,followedbyonehiddenlayerwith50artificialneurons,andfinallyoneoutputlayerwith3artificialneurons.AllartificialneuronsusetheReLUactivationfunction.

TheshapeoftheinputmatrixXism×10,wheremrepresentsthetrainingbatchsize.

Theshapeofthehiddenlayer’sweightvectorWhis10×50andthelengthofitsbiasvectorbhis50.

Theshapeoftheoutputlayer’sweightvectorWois50×3,andthelengthofitsbiasvectorbo

Page 585: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

is3.

Theshapeofthenetwork’soutputmatrixYism×3.

Y=ReLU(ReLU(X·Wh+bh)·Wo+bo).RecallthattheReLUfunctionjustsetseverynegativenumberinthematrixtozero.Alsonotethatwhenyouareaddingabiasvectortoamatrix,itisaddedtoeverysinglerowinthematrix,whichiscalledbroadcasting.

6. Toclassifyemailintospamorham,youjustneedoneneuronintheoutputlayerofaneuralnetwork—forexample,indicatingtheprobabilitythattheemailisspam.Youwouldtypicallyusethelogisticactivationfunctionintheoutputlayerwhenestimatingaprobability.IfinsteadyouwanttotackleMNIST,youneed10neuronsintheoutputlayer,andyoumustreplacethelogisticfunctionwiththesoftmaxactivationfunction,whichcanhandlemultipleclasses,outputtingoneprobabilityperclass.Now,ifyouwantyourneuralnetworktopredicthousingpriceslikeinChapter2,thenyouneedoneoutputneuron,usingnoactivationfunctionatallintheoutputlayer.4

7. Backpropagationisatechniqueusedtotrainartificialneuralnetworks.Itfirstcomputesthegradientsofthecostfunctionwithregardstoeverymodelparameter(alltheweightsandbiases),andthenitperformsaGradientDescentstepusingthesegradients.Thisbackpropagationstepistypicallyperformedthousandsormillionsoftimes,usingmanytrainingbatches,untilthemodelparametersconvergetovaluesthat(hopefully)minimizethecostfunction.Tocomputethegradients,backpropagationusesreverse-modeautodiff(althoughitwasn’tcalledthatwhenbackpropagationwasinvented,andithasbeenreinventedseveraltimes).Reverse-modeautodiffperformsaforwardpassthroughacomputationgraph,computingeverynode’svalueforthecurrenttrainingbatch,andthenitperformsareversepass,computingallthegradientsatonce(seeAppendixDformoredetails).Sowhat’sthedifference?Well,backpropagationreferstothewholeprocessoftraininganartificialneuralnetworkusingmultiplebackpropagationsteps,eachofwhichcomputesgradientsandusesthemtoperformaGradientDescentstep.Incontrast,reverse-modeautodiffisasimplyatechniquetocomputegradientsefficiently,andithappenstobeusedbybackpropagation.

8. HereisalistofallthehyperparametersyoucantweakinabasicMLP:thenumberofhiddenlayers,thenumberofneuronsineachhiddenlayer,andtheactivationfunctionusedineachhiddenlayerandintheoutputlayer.5Ingeneral,theReLUactivationfunction(oroneofitsvariants;seeChapter11)isagooddefaultforthehiddenlayers.Fortheoutputlayer,ingeneralyouwillwantthelogisticactivationfunctionforbinaryclassification,thesoftmaxactivationfunctionformulticlassclassification,ornoactivationfunctionforregression.IftheMLPoverfitsthetrainingdata,youcantryreducingthenumberofhiddenlayersandreducingthenumberofneuronsperhiddenlayer.

9. SeetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.

Page 586: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Chapter11:TrainingDeepNeuralNets1. No,allweightsshouldbesampledindependently;theyshouldnotallhavethesameinitialvalue.

Oneimportantgoalofsamplingweightsrandomlyistobreaksymmetries:ifalltheweightshavethesameinitialvalue,evenifthatvalueisnotzero,thensymmetryisnotbroken(i.e.,allneuronsinagivenlayerareequivalent),andbackpropagationwillbeunabletobreakit.Concretely,thismeansthatalltheneuronsinanygivenlayerwillalwayshavethesameweights.It’slikehavingjustoneneuronperlayer,andmuchslower.Itisvirtuallyimpossibleforsuchaconfigurationtoconvergetoagoodsolution.

2. Itisperfectlyfinetoinitializethebiastermstozero.Somepeopleliketoinitializethemjustlikeweights,andthat’sokaytoo;itdoesnotmakemuchdifference.

3. AfewadvantagesoftheELUfunctionovertheReLUfunctionare:Itcantakeonnegativevalues,sotheaverageoutputoftheneuronsinanygivenlayeristypicallycloserto0thanwhenusingtheReLUactivationfunction(whichneveroutputsnegativevalues).Thishelpsalleviatethevanishinggradientsproblem.

Italwayshasanonzeroderivative,whichavoidsthedyingunitsissuethatcanaffectReLUunits.

Itissmootheverywhere,whereastheReLU’sslopeabruptlyjumpsfrom0to1atz=0.SuchanabruptchangecanslowdownGradientDescentbecauseitwillbouncearoundz=0.

4. TheELUactivationfunctionisagooddefault.Ifyouneedtheneuralnetworktobeasfastaspossible,youcanuseoneoftheleakyReLUvariantsinstead(e.g.,asimpleleakyReLUusingthedefaulthyperparametervalue).ThesimplicityoftheReLUactivationfunctionmakesitmanypeople’spreferredoption,despitethefactthattheyaregenerallyoutperformedbytheELUandleakyReLU.However,theReLUactivationfunction’scapabilityofoutputtingpreciselyzerocanbeusefulinsomecases(e.g.,seeChapter15).Thehyperbolictangent(tanh)canbeusefulintheoutputlayerifyouneedtooutputanumberbetween–1and1,butnowadaysitisnotusedmuchinhiddenlayers.Thelogisticactivationfunctionisalsousefulintheoutputlayerwhenyouneedtoestimateaprobability(e.g.,forbinaryclassification),butitisalsorarelyusedinhiddenlayers(thereareexceptions—forexample,forthecodinglayerofvariationalautoencoders;seeChapter15).Finally,thesoftmaxactivationfunctionisusefulintheoutputlayertooutputprobabilitiesformutuallyexclusiveclasses,butotherthanthatitisrarely(ifever)usedinhiddenlayers.

5. Ifyousetthemomentumhyperparametertoocloseto1(e.g.,0.99999)whenusingaMomentumOptimizer,thenthealgorithmwilllikelypickupalotofspeed,hopefullyroughlytowardtheglobalminimum,butthenitwillshootrightpasttheminimum,duetoitsmomentum.Thenitwillslowdownandcomeback,accelerateagain,overshootagain,andsoon.Itmayoscillatethiswaymanytimesbeforeconverging,sooverallitwilltakemuchlongertoconvergethanwithasmallermomentumvalue.

6. Onewaytoproduceasparsemodel(i.e.,withmostweightsequaltozero)istotrainthemodelnormally,thenzeroouttinyweights.Formoresparsity,youcanapplyℓ1regularizationduring

Page 587: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

training,whichpushestheoptimizertowardsparsity.Athirdoptionistocombineℓ1regularizationwithdualaveraging,usingTensorFlow’sFTRLOptimizerclass.

7. Yes,dropoutdoesslowdowntraining,ingeneralroughlybyafactoroftwo.However,ithasnoimpactoninferencesinceitisonlyturnedonduringtraining.

Forthesolutionstoexercises8,9,and10,pleaseseetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.

Page 588: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Chapter12:DistributingTensorFlowAcrossDevicesandServers1. WhenaTensorFlowprocessstarts,itgrabsalltheavailablememoryonallGPUdevicesthatare

visibletoit,soifyougetaCUDA_ERROR_OUT_OF_MEMORYwhenstartingyourTensorFlowprogram,itprobablymeansthatotherprocessesarerunningthathavealreadygrabbedallthememoryonatleastonevisibleGPUdevice(mostlikelyitisanotherTensorFlowprocess).Tofixthisproblem,atrivialsolutionistostoptheotherprocessesandtryagain.However,ifyouneedallprocessestorunsimultaneously,asimpleoptionistodedicatedifferentdevicestoeachprocess,bysettingtheCUDA_VISIBLE_DEVICESenvironmentvariableappropriatelyforeachdevice.AnotheroptionistoconfigureTensorFlowtograbonlypartoftheGPUmemory,insteadofallofit,bycreatingaConfigProto,settingitsgpu_options.per_process_gpu_memory_fractiontotheproportionofthetotalmemorythatitshouldgrab(e.g.,0.4),andusingthisConfigProtowhenopeningasession.ThelastoptionistotellTensorFlowtograbmemoryonlywhenitneedsitbysettingthegpu_options.allow_growthtoTrue.However,thislastoptionisusuallynotrecommendedbecauseanymemorythatTensorFlowgrabsisneverreleased,anditishardertoguaranteearepeatablebehavior(theremayberaceconditionsdependingonwhichprocessesstartfirst,howmuchmemorytheyneedduringtraining,andsoon).

2. Bypinninganoperationonadevice,youaretellingTensorFlowthatthisiswhereyouwouldlikethisoperationtobeplaced.However,someconstraintsmaypreventTensorFlowfromhonoringyourrequest.Forexample,theoperationmayhavenoimplementation(calledakernel)forthatparticulartypeofdevice.Inthiscase,TensorFlowwillraiseanexceptionbydefault,butyoucanconfigureittofallbacktotheCPUinstead(thisiscalledsoftplacement).Anotherexampleisanoperationthatcanmodifyavariable;thisoperationandthevariableneedtobecollocated.SothedifferencebetweenpinninganoperationandplacinganoperationisthatpinningiswhatyouaskTensorFlow(“PleaseplacethisoperationonGPU#1”)whileplacementiswhatTensorFlowactuallyendsupdoing(“Sorry,fallingbacktotheCPU”).

3. IfyouarerunningonaGPU-enabledTensorFlowinstallation,andyoujustusethedefaultplacement,thenifalloperationshaveaGPUkernel(i.e.,aGPUimplementation),yes,theywillallbeplacedonthefirstGPU.However,ifoneormoreoperationsdonothaveaGPUkernel,thenbydefaultTensorFlowwillraiseanexception.IfyouconfigureTensorFlowtofallbacktotheCPUinstead(softplacement),thenalloperationswillbeplacedonthefirstGPUexcepttheoneswithoutaGPUkernelandalltheoperationsthatmustbecollocatedwiththem(seetheanswertothepreviousexercise).

4. Yes,ifyoupinavariableto"/gpu:0",itcanbeusedbyoperationsplacedon/gpu:1.TensorFlowwillautomaticallytakecareofaddingtheappropriateoperationstotransferthevariable’svalueacrossdevices.Thesamegoesfordeviceslocatedondifferentservers(aslongastheyarepartofthesamecluster).

5. Yes,twooperationsplacedonthesamedevicecanruninparallel:TensorFlowautomaticallytakescareofrunningoperationsinparallel(ondifferentCPUcoresordifferentGPUthreads),aslongasnooperationdependsonanotheroperation’soutput.Moreover,youcanstartmultiplesessionsinparallelthreads(orprocesses),andevaluateoperationsineachthread.Sincesessionsareindependent,TensorFlowwillbeabletoevaluateanyoperationfromonesessioninparallelwith

Page 589: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

anyoperationfromanothersession.

6. ControldependenciesareusedwhenyouwanttopostponetheevaluationofanoperationXuntilaftersomeotheroperationsarerun,eventhoughtheseoperationsarenotrequiredtocomputeX.ThisisusefulinparticularwhenXwouldoccupyalotofmemoryandyouonlyneeditlaterinthecomputationgraph,orifXusesupalotofI/O(forexample,itrequiresalargevariablevaluelocatedonadifferentdeviceorserver)andyoudon’twantittorunatthesametimeasotherI/O-hungryoperations,toavoidsaturatingthebandwidth.

7. You’reinluck!IndistributedTensorFlow,thevariablevaluesliveincontainersmanagedbythecluster,soevenifyouclosethesessionandexittheclientprogram,themodelparametersarestillaliveandwellonthecluster.Yousimplyneedtoopenanewsessiontotheclusterandsavethemodel(makesureyoudon’tcallthevariableinitializersorrestoreapreviousmodel,asthiswoulddestroyyourpreciousnewmodel!).

Forthesolutionstoexercises8,9,and10,pleaseseetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.

Page 590: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Chapter13:ConvolutionalNeuralNetworks1. ThesearethemainadvantagesofaCNNoverafullyconnectedDNNforimageclassification:

Becauseconsecutivelayersareonlypartiallyconnectedandbecauseitheavilyreusesitsweights,aCNNhasmanyfewerparametersthanafullyconnectedDNN,whichmakesitmuchfastertotrain,reducestheriskofoverfitting,andrequiresmuchlesstrainingdata.

WhenaCNNhaslearnedakernelthatcandetectaparticularfeature,itcandetectthatfeatureanywhereontheimage.Incontrast,whenaDNNlearnsafeatureinonelocation,itcandetectitonlyinthatparticularlocation.Sinceimagestypicallyhaveveryrepetitivefeatures,CNNsareabletogeneralizemuchbetterthanDNNsforimageprocessingtaskssuchasclassification,usingfewertrainingexamples.

Finally,aDNNhasnopriorknowledgeofhowpixelsareorganized;itdoesnotknowthatnearbypixelsareclose.ACNN’sarchitectureembedsthispriorknowledge.Lowerlayerstypicallyidentifyfeaturesinsmallareasoftheimages,whilehigherlayerscombinethelower-levelfeaturesintolargerfeatures.Thisworkswellwithmostnaturalimages,givingCNNsadecisiveheadstartcomparedtoDNNs.

2. Let’scomputehowmanyparameterstheCNNhas.Sinceitsfirstconvolutionallayerhas3×3kernels,andtheinputhasthreechannels(red,green,andblue),theneachfeaturemaphas3×3×3weights,plusabiasterm.That’s28parametersperfeaturemap.Sincethisfirstconvolutionallayerhas100featuremaps,ithasatotalof2,800parameters.Thesecondconvolutionallayerhas3×3kernels,anditsinputisthesetof100featuremapsofthepreviouslayer,soeachfeaturemaphas3×3×100=900weights,plusabiasterm.Sinceithas200featuremaps,thislayerhas901×200=180,200parameters.Finally,thethirdandlastconvolutionallayeralsohas3×3kernels,anditsinputisthesetof200featuremapsofthepreviouslayers,soeachfeaturemaphas3×3×200=1,800weights,plusabiasterm.Sinceithas400featuremaps,thislayerhasatotalof1,801×400=720,400parameters.Allinall,theCNNhas2,800+180,200+720,400=903,400parameters.Nowlet’scomputehowmuchRAMthisneuralnetworkwillrequire(atleast)whenmakingapredictionforasingleinstance.Firstlet’scomputethefeaturemapsizeforeachlayer.Sinceweareusingastrideof2andSAMEpadding,thehorizontalandverticalsizeofthefeaturemapsaredividedby2ateachlayer(roundingupifnecessary),soastheinputchannelsare200×300pixels,thefirstlayer’sfeaturemapsare100×150,thesecondlayer’sfeaturemapsare50×75,andthethirdlayer’sfeaturemapsare25×38.Since32bitsis4bytesandthefirstconvolutionallayerhas100featuremaps,thisfirstlayertakesup4x100×150×100=6millionbytes(about5.7MB,consideringthat1MB=1,024KBand1KB=1,024bytes).Thesecondlayertakesup4×50×75×200=3millionbytes(about2.9MB).Finally,thethirdlayertakesup4×25×38×400=1,520,000bytes(about1.4MB).However,oncealayerhasbeencomputed,thememoryoccupiedbythepreviouslayercanbereleased,soifeverythingiswelloptimized,only6+9=15millionbytes(about14.3MB)ofRAMwillberequired(whenthesecondlayerhasjustbeencomputed,butthememoryoccupiedbythefirstlayerisnotreleasedyet).Butwait,youalsoneedtoaddthememoryoccupiedbytheCNN’sparameters.Wecomputedearlierthatithas903,400parameters,eachusingup4bytes,sothisadds3,613,600bytes(about3.4MB).ThetotalRAMrequiredis(atleast)18,613,600bytes(about17.8MB).

Page 591: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Lastly,let’scomputetheminimumamountofRAMrequiredwhentrainingtheCNNonamini-batchof50images.DuringtrainingTensorFlowusesbackpropagation,whichrequireskeepingallvaluescomputedduringtheforwardpassuntilthereversepassbegins.SowemustcomputethetotalRAMrequiredbyalllayersforasingleinstanceandmultiplythatby50!Atthatpointlet’sstartcountinginmegabytesratherthanbytes.Wecomputedbeforethatthethreelayersrequirerespectively5.7,2.9,and1.4MBforeachinstance.That’satotalof10.0MBperinstance.Sofor50instancesthetotalRAMis500MB.AddtothattheRAMrequiredbytheinputimages,whichis50×4×200×300×3=36millionbytes(about34.3MB),plustheRAMrequiredforthemodelparameters,whichisabout3.4MB(computedearlier),plussomeRAMforthegradients(wewillneglectthemsincetheycanbereleasedgraduallyasbackpropagationgoesdownthelayersduringthereversepass).Weareuptoatotalofroughly500.0+34.3+3.4=537.7MB.Andthat’sreallyanoptimisticbareminimum.

3. IfyourGPUrunsoutofmemorywhiletrainingaCNN,herearefivethingsyoucouldtrytosolvetheproblem(otherthanpurchasingaGPUwithmoreRAM):

Reducethemini-batchsize.

Reducedimensionalityusingalargerstrideinoneormorelayers.

Removeoneormorelayers.

Use16-bitfloatsinsteadof32-bitfloats.

DistributetheCNNacrossmultipledevices.

4. Amaxpoolinglayerhasnoparametersatall,whereasaconvolutionallayerhasquiteafew(seethepreviousquestions).

5. Alocalresponsenormalizationlayermakestheneuronsthatmoststronglyactivateinhibitneuronsatthesamelocationbutinneighboringfeaturemaps,whichencouragesdifferentfeaturemapstospecializeandpushesthemapart,forcingthemtoexploreawiderrangeoffeatures.Itistypicallyusedinthelowerlayerstohavealargerpooloflow-levelfeaturesthattheupperlayerscanbuildupon.

6. ThemaininnovationsinAlexNetcomparedtoLeNet-5are(1)itismuchlargeranddeeper,and(2)itstacksconvolutionallayersdirectlyontopofeachother,insteadofstackingapoolinglayerontopofeachconvolutionallayer.ThemaininnovationinGoogLeNetistheintroductionofinceptionmodules,whichmakeitpossibletohaveamuchdeepernetthanpreviousCNNarchitectures,withfewerparameters.Finally,ResNet’smaininnovationistheintroductionofskipconnections,whichmakeitpossibletogowellbeyond100layers.Arguably,itssimplicityandconsistencyarealsoratherinnovative.

Forthesolutionstoexercises7,8,9,and10,pleaseseetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.

Page 592: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Chapter14:RecurrentNeuralNetworks1. HereareafewRNNapplications:

Forasequence-to-sequenceRNN:predictingtheweather(oranyothertimeseries),machinetranslation(usinganencoder–decoderarchitecture),videocaptioning,speechtotext,musicgeneration(orothersequencegeneration),identifyingthechordsofasong.

Forasequence-to-vectorRNN:classifyingmusicsamplesbymusicgenre,analyzingthesentimentofabookreview,predictingwhatwordanaphasicpatientisthinkingofbasedonreadingsfrombrainimplants,predictingtheprobabilitythatauserwillwanttowatchamoviebasedonherwatchhistory(thisisoneofmanypossibleimplementationsofcollaborativefiltering).

Foravector-to-sequenceRNN:imagecaptioning,creatingamusicplaylistbasedonanembeddingofthecurrentartist,generatingamelodybasedonasetofparameters,locatingpedestriansinapicture(e.g.,avideoframefromaself-drivingcar’scamera).

2. Ingeneral,ifyoutranslateasentenceonewordatatime,theresultwillbeterrible.Forexample,theFrenchsentence“Jevousenprie”means“Youarewelcome,”butifyoutranslateitonewordatatime,youget“Iyouinpray.”Huh?Itismuchbettertoreadthewholesentencefirstandthentranslateit.Aplainsequence-to-sequenceRNNwouldstarttranslatingasentenceimmediatelyafterreadingthefirstword,whileanencoder–decoderRNNwillfirstreadthewholesentenceandthentranslateit.Thatsaid,onecouldimagineaplainsequence-to-sequenceRNNthatwouldoutputsilencewheneveritisunsureaboutwhattosaynext(justlikehumantranslatorsdowhentheymusttranslatealivebroadcast).

3. Toclassifyvideosbasedonthevisualcontent,onepossiblearchitecturecouldbetotake(say)oneframepersecond,thenruneachframethroughaconvolutionalneuralnetwork,feedtheoutputoftheCNNtoasequence-to-vectorRNN,andfinallyrunitsoutputthroughasoftmaxlayer,givingyoualltheclassprobabilities.Fortrainingyouwouldjustusecrossentropyasthecostfunction.Ifyouwantedtousetheaudioforclassificationaswell,youcouldconverteverysecondofaudiotoaspectrograph,feedthisspectrographtoaCNN,andfeedtheoutputofthisCNNtotheRNN(alongwiththecorrespondingoutputoftheotherCNN).

4. BuildinganRNNusingdynamic_rnn()ratherthanstatic_rnn()offersseveraladvantages:Itisbasedonawhile_loop()operationthatisabletoswaptheGPU’smemorytotheCPU’smemoryduringbackpropagation,avoidingout-of-memoryerrors.

Itisarguablyeasiertouse,asitcandirectlytakeasingletensorasinputandoutput(coveringalltimesteps),ratherthanalistoftensors(onepertimestep).Noneedtostack,unstack,ortranspose.

Itgeneratesasmallergraph,easiertovisualizeinTensorBoard.

5. Tohandlevariablelengthinputsequences,thesimplestoptionistosetthesequence_lengthparameterwhencallingthestatic_rnn()ordynamic_rnn()functions.Anotheroptionistopad

Page 593: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

thesmallerinputs(e.g.,withzeros)tomakethemthesamesizeasthelargestinput(thismaybefasterthanthefirstoptioniftheinputsequencesallhaveverysimilarlengths).Tohandlevariable-lengthoutputsequences,ifyouknowinadvancethelengthofeachoutputsequence,youcanusethesequence_lengthparameter(forexample,considerasequence-to-sequenceRNNthatlabelseveryframeinavideowithaviolencescore:theoutputsequencewillbeexactlythesamelengthastheinputsequence).Ifyoudon’tknowinadvancethelengthoftheoutputsequence,youcanusethepaddingtrick:alwaysoutputthesamesizesequence,butignoreanyoutputsthatcomeaftertheend-of-sequencetoken(byignoringthemwhencomputingthecostfunction).

6. TodistributetrainingandexecutionofadeepRNNacrossmultipleGPUs,acommontechniqueissimplytoplaceeachlayeronadifferentGPU(seeChapter12).

Forthesolutionstoexercises7,8,and9,pleaseseetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.

Page 594: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Chapter15:Autoencoders1. Herearesomeofthemaintasksthatautoencodersareusedfor:

Featureextraction

Unsupervisedpretraining

Dimensionalityreduction

Generativemodels

Anomalydetection(anautoencoderisgenerallybadatreconstructingoutliers)

2. Ifyouwanttotrainaclassifierandyouhaveplentyofunlabeledtrainingdata,butonlyafewthousandlabeledinstances,thenyoucouldfirsttrainadeepautoencoderonthefulldataset(labeled+unlabeled),thenreuseitslowerhalffortheclassifier(i.e.,reusethelayersuptothecodingslayer,included)andtraintheclassifierusingthelabeleddata.Ifyouhavelittlelabeleddata,youprobablywanttofreezethereusedlayerswhentrainingtheclassifier.

3. Thefactthatanautoencoderperfectlyreconstructsitsinputsdoesnotnecessarilymeanthatitisagoodautoencoder;perhapsitissimplyanovercompleteautoencoderthatlearnedtocopyitsinputstothecodingslayerandthentotheoutputs.Infact,evenifthecodingslayercontainedasingleneuron,itwouldbepossibleforaverydeepautoencodertolearntomapeachtraininginstancetoadifferentcoding(e.g.,thefirstinstancecouldbemappedto0.001,thesecondto0.002,thethirdto0.003,andsoon),anditcouldlearn“byheart”toreconstructtherighttraininginstanceforeachcoding.Itwouldperfectlyreconstructitsinputswithoutreallylearninganyusefulpatterninthedata.Inpracticesuchamappingisunlikelytohappen,butitillustratesthefactthatperfectreconstructionsarenotaguaranteethattheautoencoderlearnedanythinguseful.However,ifitproducesverybadreconstructions,thenitisalmostguaranteedtobeabadautoencoder.Toevaluatetheperformanceofanautoencoder,oneoptionistomeasurethereconstructionloss(e.g.,computetheMSE,themeansquareoftheoutputsminustheinputs).Again,ahighreconstructionlossisagoodsignthattheautoencoderisbad,butalowreconstructionlossisnotaguaranteethatitisgood.Youshouldalsoevaluatetheautoencoderaccordingtowhatitwillbeusedfor.Forexample,ifyouareusingitforunsupervisedpretrainingofaclassifier,thenyoushouldalsoevaluatetheclassifier’sperformance.

4. Anundercompleteautoencoderisonewhosecodingslayerissmallerthantheinputandoutputlayers.Ifitislarger,thenitisanovercompleteautoencoder.Themainriskofanexcessivelyundercompleteautoencoderisthatitmayfailtoreconstructtheinputs.Themainriskofanovercompleteautoencoderisthatitmayjustcopytheinputstotheoutputs,withoutlearninganyusefulfeature.

5. Totietheweightsofanencoderlayeranditscorrespondingdecoderlayer,yousimplymakethedecoderweightsequaltothetransposeoftheencoderweights.Thisreducesthenumberofparametersinthemodelbyhalf,oftenmakingtrainingconvergefasterwithlesstrainingdata,andreducingtheriskofoverfittingthetrainingset.

Page 595: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

6. Tovisualizethefeatureslearnedbythelowerlayerofastackedautoencoder,acommontechniqueissimplytoplottheweightsofeachneuron,byreshapingeachweightvectortothesizeofaninputimage(e.g.,forMNIST,reshapingaweightvectorofshape[784]to[28,28]).Tovisualizethefeatureslearnedbyhigherlayers,onetechniqueistodisplaythetraininginstancesthatmostactivateeachneuron.

7. Agenerativemodelisamodelcapableofrandomlygeneratingoutputsthatresemblethetraininginstances.Forexample,oncetrainedsuccessfullyontheMNISTdataset,agenerativemodelcanbeusedtorandomlygeneraterealisticimagesofdigits.Theoutputdistributionistypicallysimilartothetrainingdata.Forexample,sinceMNISTcontainsmanyimagesofeachdigit,thegenerativemodelwouldoutputroughlythesamenumberofimagesofeachdigit.Somegenerativemodelscanbeparametrized—forexample,togenerateonlysomekindsofoutputs.Anexampleofagenerativeautoencoderisthevariationalautoencoder.

Forthesolutionstoexercises8,9,and10,pleaseseetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.

Page 596: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Chapter16:ReinforcementLearning1. ReinforcementLearningisanareaofMachineLearningaimedatcreatingagentscapableoftaking

actionsinanenvironmentinawaythatmaximizesrewardsovertime.TherearemanydifferencesbetweenRLandregularsupervisedandunsupervisedlearning.Hereareafew:

Insupervisedandunsupervisedlearning,thegoalisgenerallytofindpatternsinthedata.InReinforcementLearning,thegoalistofindagoodpolicy.

Unlikeinsupervisedlearning,theagentisnotexplicitlygiventhe“right”answer.Itmustlearnbytrialanderror.

Unlikeinunsupervisedlearning,thereisaformofsupervision,throughrewards.Wedonottelltheagenthowtoperformthetask,butwedotellitwhenitismakingprogressorwhenitisfailing.

AReinforcementLearningagentneedstofindtherightbalancebetweenexploringtheenvironment,lookingfornewwaysofgettingrewards,andexploitingsourcesofrewardsthatitalreadyknows.Incontrast,supervisedandunsupervisedlearningsystemsgenerallydon’tneedtoworryaboutexploration;theyjustfeedonthetrainingdatatheyaregiven.

Insupervisedandunsupervisedlearning,traininginstancesaretypicallyindependent(infact,theyaregenerallyshuffled).InReinforcementLearning,consecutiveobservationsaregenerallynotindependent.Anagentmayremaininthesameregionoftheenvironmentforawhilebeforeitmoveson,soconsecutiveobservationswillbeverycorrelated.Insomecasesareplaymemoryisusedtoensurethatthetrainingalgorithmgetsfairlyindependentobservations.

2. HereareafewpossibleapplicationsofReinforcementLearning,otherthanthosementionedinChapter16:

MusicpersonalizationTheenvironmentisauser’spersonalizedwebradio.Theagentisthesoftwaredecidingwhatsongtoplaynextforthatuser.Itspossibleactionsaretoplayanysonginthecatalog(itmusttrytochooseasongtheuserwillenjoy)ortoplayanadvertisement(itmusttrytochooseanadthattheuserwillbeinterestedin).Itgetsasmallrewardeverytimetheuserlistenstoasong,alargerrewardeverytimetheuserlistenstoanad,anegativerewardwhentheuserskipsasongoranad,andaverynegativerewardiftheuserleaves.

MarketingTheenvironmentisyourcompany’smarketingdepartment.Theagentisthesoftwarethatdefineswhichcustomersamailingcampaignshouldbesentto,giventheirprofileandpurchasehistory(foreachcustomerithastwopossibleactions:sendordon’tsend).Itgetsanegativerewardforthecostofthemailingcampaign,andapositiverewardforestimatedrevenuegeneratedfromthiscampaign.

Productdelivery

Page 597: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Lettheagentcontrolafleetofdeliverytrucks,decidingwhattheyshouldpickupatthedepots,wheretheyshouldgo,whattheyshoulddropoff,andsoon.Theywouldgetpositiverewardsforeachproductdeliveredontime,andnegativerewardsforlatedeliveries.

3. Whenestimatingthevalueofanaction,ReinforcementLearningalgorithmstypicallysumalltherewardsthatthisactionledto,givingmoreweighttoimmediaterewards,andlessweighttolaterrewards(consideringthatanactionhasmoreinfluenceonthenearfuturethanonthedistantfuture).Tomodelthis,adiscountrateistypicallyappliedateachtimestep.Forexample,withadiscountrateof0.9,arewardof100thatisreceivedtwotimestepslateriscountedasonly0.92×100=81whenyouareestimatingthevalueoftheaction.Youcanthinkofthediscountrateasameasureofhowmuchthefutureisvaluedrelativetothepresent:ifitisverycloseto1,thenthefutureisvaluedalmostasmuchasthepresent.Ifitiscloseto0,thenonlyimmediaterewardsmatter.Ofcourse,thisimpactstheoptimalpolicytremendously:ifyouvaluethefuture,youmaybewillingtoputupwithalotofimmediatepainfortheprospectofeventualrewards,whileifyoudon’tvaluethefuture,youwilljustgrabanyimmediaterewardyoucanfind,neverinvestinginthefuture.

4. TomeasuretheperformanceofaReinforcementLearningagent,youcansimplysumuptherewardsitgets.Inasimulatedenvironment,youcanrunmanyepisodesandlookatthetotalrewardsitgetsonaverage(andpossiblylookatthemin,max,standarddeviation,andsoon).

5. ThecreditassignmentproblemisthefactthatwhenaReinforcementLearningagentreceivesareward,ithasnodirectwayofknowingwhichofitspreviousactionscontributedtothisreward.Ittypicallyoccurswhenthereisalargedelaybetweenanactionandtheresultingrewards(e.g.,duringagameofAtari’sPong,theremaybeafewdozentimestepsbetweenthemomenttheagenthitstheballandthemomentitwinsthepoint).Onewaytoalleviateitistoprovidetheagentwithshorter-termrewards,whenpossible.Thisusuallyrequirespriorknowledgeaboutthetask.Forexample,ifwewanttobuildanagentthatwilllearntoplaychess,insteadofgivingitarewardonlywhenitwinsthegame,wecouldgiveitarewardeverytimeitcapturesoneoftheopponent’spieces.

6. Anagentcanoftenremaininthesameregionofitsenvironmentforawhile,soallofitsexperienceswillbeverysimilarforthatperiodoftime.Thiscanintroducesomebiasinthelearningalgorithm.Itmaytuneitspolicyforthisregionoftheenvironment,butitwillnotperformwellassoonasitmovesoutofthisregion.Tosolvethisproblem,youcanuseareplaymemory;insteadofusingonlythemostimmediateexperiencesforlearning,theagentwilllearnbasedonabufferofitspastexperiences,recentandnotsorecent(perhapsthisiswhywedreamatnight:toreplayourexperiencesofthedayandbetterlearnfromthem?).

7. Anoff-policyRLalgorithmlearnsthevalueoftheoptimalpolicy(i.e.,thesumofdiscountedrewardsthatcanbeexpectedforeachstateiftheagentactsoptimally),independentlyofhowtheagentactuallyacts.Q-Learningisagoodexampleofsuchanalgorithm.Incontrast,anon-policyalgorithmlearnsthevalueofthepolicythattheagentactuallyexecutes,includingbothexplorationandexploitation.

Forthesolutionstoexercises8,9,and10,pleaseseetheJupyternotebooksavailableathttps://github.com/ageron/handson-ml.

Ifyoudrawastraightlinebetweenanytwopointsonthecurve,thelinenevercrossesthecurve.1

Page 598: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Moreover,theNormalEquationrequirescomputingtheinverseofamatrix,butthatmatrixisnotalwaysinvertible.Incontrast,thematrixforRidgeRegressionisalwaysinvertible.

log2isthebinarylog,log2(m)=log(m)/log(2).

Whenthevaluestopredictcanvarybymanyordersofmagnitude,thenyoumaywanttopredictthelogarithmofthetargetvalueratherthanthetargetvaluedirectly.Simplycomputingtheexponentialoftheneuralnetwork’soutputwillgiveyoutheestimatedvalue(sinceexp(logv)=v).

InChapter11wediscussmanytechniquesthatintroduceadditionalhyperparameters:typeofweightinitialization,activationfunctionhyperparameters(e.g.,amountofleakinleakyReLU),GradientClippingthreshold,typeofoptimizeranditshyperparameters(e.g.,themomentumhyperparameterwhenusingaMomentumOptimizer),typeofregularizationforeachlayer,andtheregularizationhyperparameters(e.g.,dropoutratewhenusingdropout)andsoon.

2

3

4

5

Page 599: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

AppendixB.MachineLearningProjectChecklist

ThischecklistcanguideyouthroughyourMachineLearningprojects.Thereareeightmainsteps:1. Frametheproblemandlookatthebigpicture.

2. Getthedata.

3. Explorethedatatogaininsights.

4. PreparethedatatobetterexposetheunderlyingdatapatternstoMachineLearningalgorithms.

5. Exploremanydifferentmodelsandshort-listthebestones.

6. Fine-tuneyourmodelsandcombinethemintoagreatsolution.

7. Presentyoursolution.

8. Launch,monitor,andmaintainyoursystem.

Obviously,youshouldfeelfreetoadaptthischecklisttoyourneeds.

Page 600: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

FrametheProblemandLookattheBigPicture1. Definetheobjectiveinbusinessterms.

2. Howwillyoursolutionbeused?

3. Whatarethecurrentsolutions/workarounds(ifany)?

4. Howshouldyouframethisproblem(supervised/unsupervised,online/offline,etc.)?

5. Howshouldperformancebemeasured?

6. Istheperformancemeasurealignedwiththebusinessobjective?

7. Whatwouldbetheminimumperformanceneededtoreachthebusinessobjective?

8. Whatarecomparableproblems?Canyoureuseexperienceortools?

9. Ishumanexpertiseavailable?

10. Howwouldyousolvetheproblemmanually?

11. Listtheassumptionsyou(orothers)havemadesofar.

12. Verifyassumptionsifpossible.

Page 601: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

GettheDataNote:automateasmuchaspossiblesoyoucaneasilygetfreshdata.1. Listthedatayouneedandhowmuchyouneed.

2. Findanddocumentwhereyoucangetthatdata.

3. Checkhowmuchspaceitwilltake.

4. Checklegalobligations,andgetauthorizationifnecessary.

5. Getaccessauthorizations.

6. Createaworkspace(withenoughstoragespace).

7. Getthedata.

8. Convertthedatatoaformatyoucaneasilymanipulate(withoutchangingthedataitself).

9. Ensuresensitiveinformationisdeletedorprotected(e.g.,anonymized).

10. Checkthesizeandtypeofdata(timeseries,sample,geographical,etc.).

11. Sampleatestset,putitaside,andneverlookatit(nodatasnooping!).

Page 602: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ExploretheDataNote:trytogetinsightsfromafieldexpertforthesesteps.1. Createacopyofthedataforexploration(samplingitdowntoamanageablesizeifnecessary).

2. CreateaJupyternotebooktokeeparecordofyourdataexploration.

3. Studyeachattributeanditscharacteristics:Name

Type(categorical,int/float,bounded/unbounded,text,structured,etc.)

%ofmissingvalues

Noisinessandtypeofnoise(stochastic,outliers,roundingerrors,etc.)

Possiblyusefulforthetask?

Typeofdistribution(Gaussian,uniform,logarithmic,etc.)

4. Forsupervisedlearningtasks,identifythetargetattribute(s).

5. Visualizethedata.

6. Studythecorrelationsbetweenattributes.

7. Studyhowyouwouldsolvetheproblemmanually.

8. Identifythepromisingtransformationsyoumaywanttoapply.

9. Identifyextradatathatwouldbeuseful(gobackto“GettheData”).

10. Documentwhatyouhavelearned.

Page 603: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

PreparetheDataNotes:

Workoncopiesofthedata(keeptheoriginaldatasetintact).

Writefunctionsforalldatatransformationsyouapply,forfivereasons:Soyoucaneasilypreparethedatathenexttimeyougetafreshdataset

Soyoucanapplythesetransformationsinfutureprojects

Tocleanandpreparethetestset

Tocleanandpreparenewdatainstancesonceyoursolutionislive

Tomakeiteasytotreatyourpreparationchoicesashyperparameters

1. Datacleaning:Fixorremoveoutliers(optional).

Fillinmissingvalues(e.g.,withzero,mean,median…)ordroptheirrows(orcolumns).

2. Featureselection(optional):Droptheattributesthatprovidenousefulinformationforthetask.

3. Featureengineering,whereappropriate:Discretizecontinuousfeatures.

Decomposefeatures(e.g.,categorical,date/time,etc.).

Addpromisingtransformationsoffeatures(e.g.,log(x),sqrt(x),x^2,etc.).

Aggregatefeaturesintopromisingnewfeatures.

4. Featurescaling:standardizeornormalizefeatures.

Page 604: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Short-ListPromisingModelsNotes:

Ifthedataishuge,youmaywanttosamplesmallertrainingsetssoyoucantrainmanydifferentmodelsinareasonabletime(beawarethatthispenalizescomplexmodelssuchaslargeneuralnetsorRandomForests).

Onceagain,trytoautomatethesestepsasmuchaspossible.

1. Trainmanyquickanddirtymodelsfromdifferentcategories(e.g.,linear,naiveBayes,SVM,RandomForests,neuralnet,etc.)usingstandardparameters.

2. Measureandcomparetheirperformance.Foreachmodel,useN-foldcross-validationandcomputethemeanandstandarddeviationoftheperformancemeasureontheNfolds.

3. Analyzethemostsignificantvariablesforeachalgorithm.

4. Analyzethetypesoferrorsthemodelsmake.Whatdatawouldahumanhaveusedtoavoidtheseerrors?

5. Haveaquickroundoffeatureselectionandengineering.

6. Haveoneortwomorequickiterationsofthefiveprevioussteps.

7. Short-listthetopthreetofivemostpromisingmodels,preferringmodelsthatmakedifferenttypesoferrors.

Page 605: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Fine-TunetheSystemNotes:

Youwillwanttouseasmuchdataaspossibleforthisstep,especiallyasyoumovetowardtheendoffine-tuning.

Asalwaysautomatewhatyoucan.

1. Fine-tunethehyperparametersusingcross-validation.Treatyourdatatransformationchoicesashyperparameters,especiallywhenyouarenotsureaboutthem(e.g.,shouldIreplacemissingvalueswithzeroorwiththemedianvalue?Orjustdroptherows?).

Unlessthereareveryfewhyperparametervaluestoexplore,preferrandomsearchovergridsearch.Iftrainingisverylong,youmaypreferaBayesianoptimizationapproach(e.g.,usingGaussianprocesspriors,asdescribedbyJasperSnoek,HugoLarochelle,andRyanAdams).1

2. TryEnsemblemethods.Combiningyourbestmodelswilloftenperformbetterthanrunningthemindividually.

3. Onceyouareconfidentaboutyourfinalmodel,measureitsperformanceonthetestsettoestimatethegeneralizationerror.

WARNINGDon’ttweakyourmodelaftermeasuringthegeneralizationerror:youwouldjuststartoverfittingthetestset.

Page 606: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

PresentYourSolution1. Documentwhatyouhavedone.

2. Createanicepresentation.Makesureyouhighlightthebigpicturefirst.

3. Explainwhyyoursolutionachievesthebusinessobjective.

4. Don’tforgettopresentinterestingpointsyounoticedalongtheway.Describewhatworkedandwhatdidnot.

Listyourassumptionsandyoursystem’slimitations.

5. Ensureyourkeyfindingsarecommunicatedthroughbeautifulvisualizationsoreasy-to-rememberstatements(e.g.,“themedianincomeisthenumber-onepredictorofhousingprices”).

Page 607: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Launch!1. Getyoursolutionreadyforproduction(plugintoproductiondatainputs,writeunittests,etc.).

2. Writemonitoringcodetocheckyoursystem’sliveperformanceatregularintervalsandtriggeralertswhenitdrops.

Bewareofslowdegradationtoo:modelstendto“rot”asdataevolves.

Measuringperformancemayrequireahumanpipeline(e.g.,viaacrowdsourcingservice).

Alsomonitoryourinputs’quality(e.g.,amalfunctioningsensorsendingrandomvalues,oranotherteam’soutputbecomingstale).Thisisparticularlyimportantforonlinelearningsystems.

3. Retrainyourmodelsonaregularbasisonfreshdata(automateasmuchaspossible).

“PracticalBayesianOptimizationofMachineLearningAlgorithms,”J.Snoek,H.Larochelle,R.Adams(2012).1

Page 608: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

AppendixC.SVMDualProblem

Tounderstandduality,youfirstneedtounderstandtheLagrangemultipliersmethod.Thegeneralideaistotransformaconstrainedoptimizationobjectiveintoanunconstrainedone,bymovingtheconstraintsintotheobjectivefunction.Let’slookatasimpleexample.Supposeyouwanttofindthevaluesofxandythatminimizethefunctionf(x,y)=x2+2y,subjecttoanequalityconstraint:3x+2y+1=0.UsingtheLagrangemultipliersmethod,westartbydefininganewfunctioncalledtheLagrangian(orLagrangefunction):g(x,y,α)=f(x,y)–α(3x+2y+1).Eachconstraint(inthiscasejustone)issubtractedfromtheoriginalobjective,multipliedbyanewvariablecalledaLagrangemultiplier.

Joseph-LouisLagrangeshowedthatif isasolutiontotheconstrainedoptimizationproblem,then

theremustexistan suchthat isastationarypointoftheLagrangian(astationarypointisapointwhereallpartialderivativesareequaltozero).Inotherwords,wecancomputethepartialderivativesofg(x,y,α)withregardstox,y,andα;wecanfindthepointswherethesederivativesareallequaltozero;andthesolutionstotheconstrainedoptimizationproblem(iftheyexist)mustbeamongthesestationarypoints.

Inthisexamplethepartialderivativesare:

Whenallthesepartialderivativesareequalto0,wefindthat

,fromwhichwecaneasilyfindthat ,

,and .Thisistheonlystationarypoint,andasitrespectstheconstraint,itmustbethesolutiontotheconstrainedoptimizationproblem.

However,thismethodappliesonlytoequalityconstraints.Fortunately,undersomeregularityconditions(whicharerespectedbytheSVMobjectives),thismethodcanbegeneralizedtoinequalityconstraintsaswell(e.g.,3x+2y+1≥0).ThegeneralizedLagrangianforthehardmarginproblemisgivenbyEquationC-1,wheretheα(i)variablesarecalledtheKarush–Kuhn–Tucker(KKT)multipliers,andtheymustbegreaterorequaltozero.

EquationC-1.GeneralizedLagrangianforthehardmarginproblem

JustlikewiththeLagrangemultipliersmethod,youcancomputethepartialderivativesandlocatethe

stationarypoints.Ifthereisasolution,itwillnecessarilybeamongthestationarypoints thatrespecttheKKTconditions:

Respecttheproblem’sconstraints: ,

Page 609: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Verify ,

Either ortheithconstraintmustbeanactiveconstraint,meaningitmustholdbyequality:.Thisconditioniscalledthecomplementaryslacknesscondition.Itimplies

thateither ortheithinstanceliesontheboundary(itisasupportvector).

NotethattheKKTconditionsarenecessaryconditionsforastationarypointtobeasolutionoftheconstrainedoptimizationproblem.Undersomeconditions,theyarealsosufficientconditions.Luckily,theSVMoptimizationproblemhappenstomeettheseconditions,soanystationarypointthatmeetstheKKTconditionsisguaranteedtobeasolutiontotheconstrainedoptimizationproblem.

WecancomputethepartialderivativesofthegeneralizedLagrangianwithregardstowandbwithEquationC-2.

EquationC-2.PartialderivativesofthegeneralizedLagrangian

Whenthesepartialderivativesareequalto0,wehaveEquationC-3.

EquationC-3.Propertiesofthestationarypoints

IfweplugtheseresultsintothedefinitionofthegeneralizedLagrangian,sometermsdisappearandwe

Page 610: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

findEquationC-4.

EquationC-4.DualformoftheSVMproblem

Thegoalisnowtofindthevector thatminimizesthisfunction,with forallinstances.Thisconstrainedoptimizationproblemisthedualproblemwewerelookingfor.

Onceyoufindtheoptimal ,youcancompute usingthefirstlineofEquationC-3.Tocompute ,youcanusethefactthatasupportvectorverifiest(i)(wT·x(i)+b)=1,soifthekthinstanceisasupportvector

(i.e.,αk>0),youcanuseittocompute .However,itisoftenpreferedtocomputetheaverageoverallsupportvectorstogetamorestableandprecisevalue,asinEquationC-5.

EquationC-5.Biastermestimationusingthedualform

Page 611: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

AppendixD.Autodiff

ThisappendixexplainshowTensorFlow’sautodifffeatureworks,andhowitcomparestoothersolutions.

Supposeyoudefineafunctionf(x,y)=x2y+y+2,andyouneeditspartialderivatives and ,typicallytoperformGradientDescent(orsomeotheroptimizationalgorithm).Yourmainoptionsaremanualdifferentiation,symbolicdifferentiation,numericaldifferentiation,forward-modeautodiff,andfinallyreverse-modeautodiff.TensorFlowimplementsthislastoption.Let’sgothrougheachoftheseoptions.

Page 612: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ManualDifferentiationThefirstapproachistopickupapencilandapieceofpaperanduseyourcalculusknowledgetoderivethepartialderivativesmanually.Forthefunctionf(x,y)justdefined,itisnottoohard;youjustneedtousefiverules:

Thederivativeofaconstantis0.

Thederivativeofλxisλ(whereλisaconstant).

Thederivativeofxλisλxλ–1,sothederivativeofx2is2x.

Thederivativeofasumoffunctionsisthesumofthesefunctions’derivatives.

Thederivativeofλtimesafunctionisλtimesitsderivative.

Fromtheserules,youcanderiveEquationD-1:

EquationD-1.Partialderivativesoff(x,y)

Thisapproachcanbecomeverytediousformorecomplexfunctions,andyouruntheriskofmakingmistakes.Thegoodnewsisthatderivingthemathematicalequationsforthepartialderivativeslikewejustdidcanbeautomated,throughaprocesscalledsymbolicdifferentiation.

Page 613: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

SymbolicDifferentiationFigureD-1showshowsymbolicdifferentiationworksonanevensimplerfunction,g(x,y)=5+xy.Thegraphforthatfunctionisrepresentedontheleft.Aftersymbolicdifferentiation,wegetthegraphonthe

right,whichrepresentsthepartialderivative (wecouldsimilarlyobtainthepartialderivativewithregardstoy).

FigureD-1.Symbolicdifferentiation

Thealgorithmstartsbygettingthepartialderivativeoftheleafnodes.Theconstantnode(5)returnstheconstant0,sincethederivativeofaconstantisalways0.Thevariablexreturnstheconstant1since

,andthevariableyreturnstheconstant0since (ifwewerelookingforthepartialderivativewithregardstoy,itwouldbethereverse).

Nowwehaveallweneedtomoveupthegraphtothemultiplicationnodeinfunctiong.Calculustellsus

thatthederivativeoftheproductoftwofunctionsuandvis .Wecanthereforeconstructalargepartofthegraphontheright,representing0×x+y×1.

Finally,wecangouptotheadditionnodeinfunctiong.Asmentioned,thederivativeofasumoffunctionsisthesumofthesefunctions’derivatives.Sowejustneedtocreateanadditionnodeandconnectittothepartsofthegraphwehavealreadycomputed.Wegetthecorrectpartialderivative:

.

However,itcanbesimplified(alot).Afewtrivialpruningstepscanbeappliedtothisgraphtogetridof

allunnecessaryoperations,andwegetamuchsmallergraphwithjustonenode: .

Inthiscase,simplificationisfairlyeasy,butforamorecomplexfunction,symbolicdifferentiationcanproduceahugegraphthatmaybetoughtosimplifyandleadtosuboptimalperformance.Mostimportantly,symbolicdifferentiationcannotdealwithfunctionsdefinedwitharbitrarycode—forexample,the

Page 614: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

followingfunctiondiscussedinChapter9:

defmy_func(a,b):

z=0

foriinrange(100):

z=a*np.cos(z+i)+z*np.sin(b-i)

returnz

Page 615: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

NumericalDifferentiationThesimplestsolutionistocomputeanapproximationofthederivatives,numerically.Recallthatthederivativeh′(x0)ofafunctionh(x)atapointx0istheslopeofthefunctionatthatpoint,ormorepreciselyEquationD-2.

EquationD-2.Derivativeofafunctionh(x)atpointx0

Soifwewanttocalculatethepartialderivativeoff(x,y)withregardstox,atx=3andy=4,wecansimplycomputef(3+ϵ,4)–f(3,4)anddividetheresultbyϵ,usingaverysmallvalueforϵ.That’sexactlywhatthefollowingcodedoes:

deff(x,y):

returnx**2*y+y+2

defderivative(f,x,y,x_eps,y_eps):

return(f(x+x_eps,y+y_eps)-f(x,y))/(x_eps+y_eps)

df_dx=derivative(f,3,4,0.00001,0)

df_dy=derivative(f,3,4,0,0.00001)

Unfortunately,theresultisimprecise(anditgetsworseformorecomplexfunctions).Thecorrectresultsarerespectively24and10,butinsteadweget:

>>>print(df_dx)

24.000039999805264

>>>print(df_dy)

10.000000000331966

Noticethattocomputebothpartialderivatives,wehavetocallf()atleastthreetimes(wecalleditfourtimesintheprecedingcode,butitcouldbeoptimized).Iftherewere1,000parameters,wewouldneedtocallf()atleast1,001times.Whenyouaredealingwithlargeneuralnetworks,thismakesnumericaldifferentiationwaytooinefficient.

However,numericaldifferentiationissosimpletoimplementthatitisagreattooltocheckthattheothermethodsareimplementedcorrectly.Forexample,ifitdisagreeswithyourmanuallyderivedfunction,thenyourfunctionprobablycontainsamistake.

Page 616: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Forward-ModeAutodiffForward-modeautodiffisneithernumericaldifferentiationnorsymbolicdifferentiation,butinsomewaysitistheirlovechild.Itreliesondualnumbers,whichare(weirdbutfascinating)numbersoftheforma+bϵwhereaandbarerealnumbersandϵisaninfinitesimalnumbersuchthatϵ2=0(butϵ≠0).Youcanthinkofthedualnumber42+24ϵassomethingakinto42.0000000024withaninfinitenumberof0s(butofcoursethisissimplifiedjusttogiveyousomeideaofwhatdualnumbersare).Adualnumberisrepresentedinmemoryasapairoffloats.Forexample,42+24ϵisrepresentedbythepair(42.0,24.0).

Dualnumberscanbeadded,multiplied,andsoon,asshowninEquationD-3.

EquationD-3.Afewoperationswithdualnumbers

Mostimportantly,itcanbeshownthath(a+bϵ)=h(a)+b×h′(a)ϵ,socomputingh(a+ϵ)givesyoubothh(a)andthederivativeh′(a)injustoneshot.FigureD-2showshowforward-modeautodiffcomputesthepartialderivativeoff(x,y)withregardstoxatx=3andy=4.Allweneedtodoiscomputef(3+ϵ,4);thiswilloutputadualnumberwhosefirstcomponentisequaltof(3,4)andwhosesecondcomponentis

equalto .

Page 617: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

FigureD-2.Forward-modeautodiff

Tocompute wewouldhavetogothroughthegraphagain,butthistimewithx=3andy=4+ϵ.

Soforward-modeautodiffismuchmoreaccuratethannumericaldifferentiation,butitsuffersfromthesamemajorflaw:iftherewere1,000parameters,itwouldrequire1,000passesthroughthegraphtocomputeallthepartialderivatives.Thisiswherereverse-modeautodiffshines:itcancomputealloftheminjusttwopassesthroughthegraph.

Page 618: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Reverse-ModeAutodiffReverse-modeautodiffisthesolutionimplementedbyTensorFlow.Itfirstgoesthroughthegraphintheforwarddirection(i.e.,fromtheinputstotheoutput)tocomputethevalueofeachnode.Thenitdoesasecondpass,thistimeinthereversedirection(i.e.,fromtheoutputtotheinputs)tocomputeallthepartialderivatives.FigureD-3representsthesecondpass.Duringthefirstpass,allthenodevalueswerecomputed,startingfromx=3andy=4.Youcanseethosevaluesatthebottomrightofeachnode(e.g.,x×x=9).Thenodesarelabeledn1ton7forclarity.Theoutputnodeisn7:f(3,4)=n7=42.

FigureD-3.Reverse-modeautodiff

Theideaistograduallygodownthegraph,computingthepartialderivativeoff(x,y)withregardstoeachconsecutivenode,untilwereachthevariablenodes.Forthis,reverse-modeautodiffreliesheavilyonthechainrule,showninEquationD-4.

EquationD-4.Chainrule

Sincen7istheoutputnode,f=n7sotrivially .

Let’scontinuedownthegraphton5:howmuchdoesfvarywhenn5varies?Theansweris .

Page 619: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Wealreadyknowthat ,soallweneedis .Sincen7simplyperformsthesumn5+n6,wefind

that ,so .

Nowwecanproceedtonoden4:howmuchdoesfvarywhenn4varies?Theansweris .

Sincen5=n4×n2,wefindthat ,so .

Theprocesscontinuesuntilwereachthebottomofthegraph.Atthatpointwewillhavecalculatedallthe

partialderivativesoff(x,y)atthepointx=3andy=4.Inthisexample,wefind and .Soundsaboutright!

Reverse-modeautodiffisaverypowerfulandaccuratetechnique,especiallywhentherearemanyinputsandfewoutputs,sinceitrequiresonlyoneforwardpassplusonereversepassperoutputtocomputeallthepartialderivativesforalloutputswithregardstoalltheinputs.Mostimportantly,itcandealwithfunctionsdefinedbyarbitrarycode.Itcanalsohandlefunctionsthatarenotentirelydifferentiable,aslongasyouaskittocomputethepartialderivativesatpointsthataredifferentiable.

TIPIfyouimplementanewtypeofoperationinTensorFlowandyouwanttomakeitcompatiblewithautodiff,thenyouneedtoprovideafunctionthatbuildsasubgraphtocomputeitspartialderivativeswithregardstoitsinputs.Forexample,supposeyouimplementafunctionthatcomputesthesquareofitsinputf(x)=x2.Inthatcaseyouwouldneedtoprovidethecorrespondingderivativefunctionf′(x)=2x.Notethatthisfunctiondoesnotcomputeanumericalresult,butinsteadbuildsasubgraphthatwill(later)computetheresult.Thisisveryusefulbecauseitmeansthatyoucancomputegradientsofgradients(tocomputesecond-orderderivatives,orevenhigher-orderderivatives).

Page 620: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

AppendixE.OtherPopularANNArchitectures

InthisappendixwewillgiveaquickoverviewofafewhistoricallyimportantneuralnetworkarchitecturesthataremuchlessusedtodaythandeepMulti-LayerPerceptrons(Chapter10),convolutionalneuralnetworks(Chapter13),recurrentneuralnetworks(Chapter14),orautoencoders(Chapter15).Theyareoftenmentionedintheliterature,andsomearestillusedinmanyapplications,soitisworthknowingaboutthem.Moreover,wewilldiscussdeepbeliefnets(DBNs),whichwerethestateoftheartinDeepLearninguntiltheearly2010s.Theyarestillthesubjectofveryactiveresearch,sotheymaywellcomebackwithavengeanceinthenearfuture.

Page 621: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

HopfieldNetworksHopfieldnetworkswerefirstintroducedbyW.A.Littlein1974,thenpopularizedbyJ.Hopfieldin1982.Theyareassociativememorynetworks:youfirstteachthemsomepatterns,andthenwhentheyseeanewpatternthey(hopefully)outputtheclosestlearnedpattern.Thishasmadethemusefulinparticularforcharacterrecognitionbeforetheywereoutperformedbyotherapproaches.Youfirsttrainthenetworkbyshowingitexamplesofcharacterimages(eachbinarypixelmapstooneneuron),andthenwhenyoushowitanewcharacterimage,afterafewiterationsitoutputstheclosestlearnedcharacter.

Theyarefullyconnectedgraphs(seeFigureE-1);thatis,everyneuronisconnectedtoeveryotherneuron.Notethatonthediagramtheimagesare6×6pixels,sotheneuralnetworkontheleftshouldcontain36neurons(and648connections),butforvisualclarityamuchsmallernetworkisrepresented.

FigureE-1.Hopfieldnetwork

ThetrainingalgorithmworksbyusingHebb’srule:foreachtrainingimage,theweightbetweentwoneuronsisincreasedifthecorrespondingpixelsarebothonorbothoff,butdecreasedifonepixelisonandtheotherisoff.

Toshowanewimagetothenetwork,youjustactivatetheneuronsthatcorrespondtoactivepixels.Thenetworkthencomputestheoutputofeveryneuron,andthisgivesyouanewimage.Youcanthentakethisnewimageandrepeatthewholeprocess.Afterawhile,thenetworkreachesastablestate.Generally,thiscorrespondstothetrainingimagethatmostresemblestheinputimage.

Aso-calledenergyfunctionisassociatedwithHopfieldnets.Ateachiteration,theenergydecreases,sothenetworkisguaranteedtoeventuallystabilizetoalow-energystate.Thetrainingalgorithmtweakstheweightsinawaythatdecreasestheenergylevelofthetrainingpatterns,sothenetworkislikelytostabilizeinoneoftheselow-energyconfigurations.Unfortunately,somepatternsthatwerenotinthetrainingsetalsoendupwithlowenergy,sothenetworksometimesstabilizesinaconfigurationthatwas

Page 622: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

notlearned.Thesearecalledspuriouspatterns.

AnothermajorflawwithHopfieldnetsisthattheydon’tscaleverywell—theirmemorycapacityisroughlyequalto14%ofthenumberofneurons.Forexample,toclassify28×28images,youwouldneedaHopfieldnetwith784fullyconnectedneuronsand306,936weights.Suchanetworkwouldonlybeabletolearnabout110differentcharacters(14%of784).That’salotofparametersforsuchasmallmemory.

Page 623: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

BoltzmannMachinesBoltzmannmachineswereinventedin1985byGeoffreyHintonandTerrenceSejnowski.JustlikeHopfieldnets,theyarefullyconnectedANNs,buttheyarebasedonstochasticneurons:insteadofusingadeterministicstepfunctiontodecidewhatvaluetooutput,theseneuronsoutput1withsomeprobability,and0otherwise.TheprobabilityfunctionthattheseANNsuseisbasedontheBoltzmanndistribution(usedinstatisticalmechanics)hencetheirname.EquationE-1givestheprobabilitythataparticularneuronwilloutputa1.

EquationE-1.Probabilitythattheithneuronwilloutput1

sjisthejthneuron’sstate(0or1).

wi,jistheconnectionweightbetweentheithandjthneurons.Notethatwi,i=0.

biistheithneuron’sbiasterm.Wecanimplementthistermbyaddingabiasneurontothenetwork.

Nisthenumberofneuronsinthenetwork.

Tisanumbercalledthenetwork’stemperature;thehigherthetemperature,themorerandomtheoutputis(i.e.,themoretheprobabilityapproaches50%).

σisthelogisticfunction.

NeuronsinBoltzmannmachinesareseparatedintotwogroups:visibleunitsandhiddenunits(seeFigureE-2).Allneuronsworkinthesamestochasticway,butthevisibleunitsaretheonesthatreceivetheinputsandfromwhichoutputsareread.

Page 624: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

FigureE-2.Boltzmannmachine

Becauseofitsstochasticnature,aBoltzmannmachinewillneverstabilizeintoafixedconfiguration,butinsteaditwillkeepswitchingbetweenmanyconfigurations.Ifitisleftrunningforasufficientlylongtime,theprobabilityofobservingaparticularconfigurationwillonlybeafunctionoftheconnectionweightsandbiasterms,notoftheoriginalconfiguration(similarly,afteryoushuffleadeckofcardsforlongenough,theconfigurationofthedeckdoesnotdependontheinitialstate).Whenthenetworkreachesthisstatewheretheoriginalconfigurationis“forgotten,”itissaidtobeinthermalequilibrium(althoughitsconfigurationkeepschangingallthetime).Bysettingthenetworkparametersappropriately,lettingthenetworkreachthermalequilibrium,andthenobservingitsstate,wecansimulateawiderangeofprobabilitydistributions.Thisiscalledagenerativemodel.

TrainingaBoltzmannmachinemeansfindingtheparametersthatwillmakethenetworkapproximatethetrainingset’sprobabilitydistribution.Forexample,iftherearethreevisibleneuronsandthetrainingsetcontains75%(0,1,1)triplets,10%(0,0,1)triplets,and15%(1,1,1)triplets,thenaftertrainingaBoltzmannmachine,youcoulduseittogeneraterandombinarytripletswithaboutthesameprobabilitydistribution.Forexample,about75%ofthetimeitwouldoutputthe(0,1,1)triplet.

Suchagenerativemodelcanbeusedinavarietyofways.Forexample,ifitistrainedonimages,andyouprovideanincompleteornoisyimagetothenetwork,itwillautomatically“repair”theimageinareasonableway.Youcanalsouseagenerativemodelforclassification.Justaddafewvisibleneuronstoencodethetrainingimage’sclass(e.g.,add10visibleneuronsandturnononlythefifthneuronwhenthetrainingimagerepresentsa5).Then,whengivenanewimage,thenetworkwillautomaticallyturnonthe

Page 625: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

appropriatevisibleneurons,indicatingtheimage’sclass(e.g.,itwillturnonthefifthvisibleneuroniftheimagerepresentsa5).

Unfortunately,thereisnoefficienttechniquetotrainBoltzmannmachines.However,fairlyefficientalgorithmshavebeendevelopedtotrainrestrictedBoltzmannmachines(RBM).

Page 626: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

RestrictedBoltzmannMachinesAnRBMissimplyaBoltzmannmachineinwhichtherearenoconnectionsbetweenvisibleunitsorbetweenhiddenunits,onlybetweenvisibleandhiddenunits.Forexample,FigureE-3representsanRBMwiththreevisibleunitsandfourhiddenunits.

FigureE-3.RestrictedBoltzmannmachine

Averyefficienttrainingalgorithm,calledContrastiveDivergence,wasintroducedin2005byMiguelÁ.Carreira-PerpiñánandGeoffreyHinton.1Hereishowitworks:foreachtraininginstancex,thealgorithmstartsbyfeedingittothenetworkbysettingthestateofthevisibleunitstox1,x2,, xn.Thenyoucomputethestateofthehiddenunitsbyapplyingthestochasticequationdescribedbefore(EquationE-1).Thisgivesyouahiddenvectorh(wherehiisequaltothestateoftheithunit).Nextyoucomputethestateofthevisibleunits,byapplyingthesamestochasticequation.Thisgivesyouavector .Thenonceagainyoucomputethestateofthehiddenunits,whichgivesyouavector .NowyoucanupdateeachconnectionweightbyapplyingtheruleinEquationE-2.

EquationE-2.Contrastivedivergenceweightupdate

Thegreatbenefitofthisalgorithmitthatitdoesnotrequirewaitingforthenetworktoreachthermalequilibrium:itjustgoesforward,backward,andforwardagain,andthat’sit.Thismakesitincomparablymoreefficientthanpreviousalgorithms,anditwasakeyingredienttothefirstsuccessofDeepLearningbasedonmultiplestackedRBMs.

Page 627: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

DeepBeliefNetsSeverallayersofRBMscanbestacked;thehiddenunitsofthefirst-levelRBMservesasthevisibleunitsforthesecond-layerRBM,andsoon.SuchanRBMstackiscalledadeepbeliefnet(DBN).

Yee-WhyeTeh,oneofGeoffreyHinton’sstudents,observedthatitwaspossibletotrainDBNsonelayeratatimeusingContrastiveDivergence,startingwiththelowerlayersandthengraduallymovinguptothetoplayers.ThisledtothegroundbreakingarticlethatkickstartedtheDeepLearningtsunamiin2006.2

JustlikeRBMs,DBNslearntoreproducetheprobabilitydistributionoftheirinputs,withoutanysupervision.However,theyaremuchbetteratit,forthesamereasonthatdeepneuralnetworksaremorepowerfulthanshallowones:real-worlddataisoftenorganizedinhierarchicalpatterns,andDBNstakeadvantageofthat.Theirlowerlayerslearnlow-levelfeaturesintheinputdata,whilehigherlayerslearnhigh-levelfeatures.

JustlikeRBMs,DBNsarefundamentallyunsupervised,butyoucanalsotraintheminasupervisedmannerbyaddingsomevisibleunitstorepresentthelabels.Moreover,onegreatfeatureofDBNsisthattheycanbetrainedinasemisupervisedfashion.FigureE-4representssuchaDBNconfiguredforsemisupervisedlearning.

FigureE-4.Adeepbeliefnetworkconfiguredforsemisupervisedlearning

First,theRBM1istrainedwithoutsupervision.Itlearnslow-levelfeaturesinthetrainingdata.ThenRBM2istrainedwithRBM1’shiddenunitsasinputs,againwithoutsupervision:itlearnshigher-levelfeatures(notethatRBM2’shiddenunitsincludeonlythethreerightmostunits,notthelabelunits).SeveralmoreRBMscouldbestackedthisway,butyougettheidea.Sofar,trainingwas100%unsupervised.Lastly,RBM3istrainedusingbothRBM2’shiddenunitsasinputs,aswellasextravisibleunitsusedtorepresentthetargetlabels(e.g.,aone-hotvectorrepresentingtheinstanceclass).Itlearnstoassociatehigh-levelfeatureswithtraininglabels.Thisisthesupervisedstep.

Page 628: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Attheendoftraining,ifyoufeedRBM1anewinstance,thesignalwillpropagateuptoRBM2,thenuptothetopofRBM3,andthenbackdowntothelabelunits;hopefully,theappropriatelabelwilllightup.ThisishowaDBNcanbeusedforclassification.

Onegreatbenefitofthissemisupervisedapproachisthatyoudon’tneedmuchlabeledtrainingdata.IftheunsupervisedRBMsdoagoodenoughjob,thenonlyasmallamountoflabeledtraininginstancesperclasswillbenecessary.Similarly,ababylearnstorecognizeobjectswithoutsupervision,sowhenyoupointtoachairandsay“chair,”thebabycanassociatetheword“chair”withtheclassofobjectsithasalreadylearnedtorecognizeonitsown.Youdon’tneedtopointtoeverysinglechairandsay“chair”;onlyafewexampleswillsuffice(justenoughsothebabycanbesurethatyouareindeedreferringtothechair,nottoitscolororoneofthechair’sparts).

Quiteamazingly,DBNscanalsoworkinreverse.Ifyouactivateoneofthelabelunits,thesignalwillpropagateuptothehiddenunitsofRBM3,thendowntoRBM2,andthenRBM1,andanewinstancewillbeoutputbythevisibleunitsofRBM1.Thisnewinstancewillusuallylooklikearegularinstanceoftheclasswhoselabelunityouactivated.ThisgenerativecapabilityofDBNsisquitepowerful.Forexample,ithasbeenusedtoautomaticallygeneratecaptionsforimages,andviceversa:firstaDBNistrained(withoutsupervision)tolearnfeaturesinimages,andanotherDBNistrained(againwithoutsupervision)tolearnfeaturesinsetsofcaptions(e.g.,“car”oftencomeswith“automobile”).ThenanRBMisstackedontopofbothDBNsandtrainedwithasetofimagesalongwiththeircaptions;itlearnstoassociatehigh-levelfeaturesinimageswithhigh-levelfeaturesincaptions.Next,ifyoufeedtheimageDBNanimageofacar,thesignalwillpropagatethroughthenetwork,uptothetop-levelRBM,andbackdowntothebottomofthecaptionDBN,producingacaption.DuetothestochasticnatureofRBMsandDBNs,thecaptionwillkeepchangingrandomly,butitwillgenerallybeappropriatefortheimage.Ifyougenerateafewhundredcaptions,themostfrequentlygeneratedoneswilllikelybeagooddescriptionoftheimage.3

Page 629: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Self-OrganizingMapsSelf-organizingmaps(SOM)arequitedifferentfromalltheothertypesofneuralnetworkswehavediscussedsofar.Theyareusedtoproducealow-dimensionalrepresentationofahigh-dimensionaldataset,generallyforvisualization,clustering,orclassification.Theneuronsarespreadacrossamap(typically2Dforvisualization,butitcanbeanynumberofdimensionsyouwant),asshowninFigureE-5,andeachneuronhasaweightedconnectiontoeveryinput(notethatthediagramshowsjusttwoinputs,buttherearetypicallyaverylargenumber,sincethewholepointofSOMsistoreducedimensionality).

FigureE-5.Self-organizingmaps

Oncethenetworkistrained,youcanfeeditanewinstanceandthiswillactivateonlyoneneuron(i.e.,henceonepointonthemap):theneuronwhoseweightvectorisclosesttotheinputvector.Ingeneral,instancesthatarenearbyintheoriginalinputspacewillactivateneuronsthatarenearbyonthemap.ThismakesSOMsusefulforvisualization(inparticular,youcaneasilyidentifyclustersonthemap),butalsoforapplicationslikespeechrecognition.Forexample,ifeachinstancerepresentstheaudiorecordingofapersonpronouncingavowel,thendifferentpronunciationsofthevowel“a”willactivateneuronsinthesameareaofthemap,whileinstancesofthevowel“e”willactivateneuronsinanotherarea,andintermediatesoundswillgenerallyactivateintermediateneuronsonthemap.

Page 630: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

NOTEOneimportantdifferencewiththeotherdimensionalityreductiontechniquesdiscussedinChapter8isthatallinstancesgetmappedtoadiscretenumberofpointsinthelow-dimensionalspace(onepointperneuron).Whenthereareveryfewneurons,thistechniqueisbetterdescribedasclusteringratherthandimensionalityreduction.

Thetrainingalgorithmisunsupervised.Itworksbyhavingalltheneuronscompeteagainsteachother.First,alltheweightsareinitializedrandomly.Thenatraininginstanceispickedrandomlyandfedtothenetwork.Allneuronscomputethedistancebetweentheirweightvectorandtheinputvector(thisisverydifferentfromtheartificialneuronswehaveseensofar).Theneuronthatmeasuresthesmallestdistancewinsandtweaksitsweightvectortobeevenslightlyclosertotheinputvector,makingitmorelikelytowinfuturecompetitionsforotherinputssimilartothisone.Italsorecruitsitsneighboringneurons,andtheytooupdatetheirweightvectortobeslightlyclosertotheinputvector(buttheydon’tupdatetheirweightsasmuchasthewinnerneuron).Thenthealgorithmpicksanothertraininginstanceandrepeatstheprocess,againandagain.Thisalgorithmtendstomakenearbyneuronsgraduallyspecializeinsimilarinputs.4

“OnContrastiveDivergenceLearning,”M.Á.Carreira-PerpiñánandG.Hinton(2005).

“AFastLearningAlgorithmforDeepBeliefNets,”G.Hinton,S.Osindero,Y.Teh(2006).

SeethisvideobyGeoffreyHintonformoredetailsandademo:http://goo.gl/7Z5QiS.

Youcanimagineaclassofyoungchildrenwithroughlysimilarskills.Onechildhappenstobeslightlybetteratbasketball.Thismotivateshertopracticemore,especiallywithherfriends.Afterawhile,thisgroupoffriendsgetssogoodatbasketballthatotherkidscannotcompete.Butthat’sokay,becausetheotherkidsspecializeinothertopics.Afterawhile,theclassisfulloflittlespecializedgroups.

1

2

3

4

Page 631: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Index

Symbols

__call__(),StaticUnrollingThroughTime

ε-greedypolicy,ExplorationPolicies,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

ε-insensitive,SVMRegression

χ2test(seechisquaretest)

ℓ0norm,SelectaPerformanceMeasure

ℓ1andℓ2regularization,ℓ1andℓ2Regularization-ℓ1andℓ2Regularization

ℓ1norm,SelectaPerformanceMeasure,LassoRegression,DecisionBoundaries,AdamOptimization,AvoidingOverfittingThroughRegularization

ℓ2norm,SelectaPerformanceMeasure,RidgeRegression-LassoRegression,DecisionBoundaries,SoftmaxRegression,AvoidingOverfittingThroughRegularization,Max-NormRegularization

ℓknorm,SelectaPerformanceMeasure

ℓ∞norm,SelectaPerformanceMeasure

A

accuracy,WhatIsMachineLearning?,MeasuringAccuracyUsingCross-Validation-MeasuringAccuracyUsingCross-Validation

actions,evaluating,EvaluatingActions:TheCreditAssignmentProblem-EvaluatingActions:TheCreditAssignmentProblem

activationfunctions,Multi-LayerPerceptronandBackpropagation-Multi-LayerPerceptronandBackpropagation

activeconstraints,SVMDualProblem

actors,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

actualclass,ConfusionMatrix

Page 632: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

AdaBoost,AdaBoost-AdaBoost

Adagrad,AdaGrad-AdaGrad

Adamoptimization,AdamOptimization-AdamOptimization,AdamOptimization

adaptivelearningrate,AdaGrad

adaptivemomentoptimization,AdamOptimization

agents,LearningtoOptimizeRewards

AlexNetarchitecture,AlexNet-AlexNet

algorithms

preparingdatafor,PreparetheDataforMachineLearningAlgorithms-SelectandTrainaModel

AlphaGo,ReinforcementLearning,IntroductiontoArtificialNeuralNetworks,ReinforcementLearning,PolicyGradients

Anaconda,CreatetheWorkspace

anomalydetection,Unsupervisedlearning

Apple’sSiri,IntroductiontoArtificialNeuralNetworks

apply_gradients(),GradientClipping,PolicyGradients

areaunderthecurve(AUC),TheROCCurve

array_split(),IncrementalPCA

artificialneuralnetworks(ANNs),IntroductiontoArtificialNeuralNetworks-Exercises

BoltzmannMachines,BoltzmannMachines-BoltzmannMachines

deepbeliefnetworks(DBNs),DeepBeliefNets-DeepBeliefNets

evolutionof,FromBiologicaltoArtificialNeurons

HopfieldNetworks,HopfieldNetworks-HopfieldNetworks

hyperparameterfine-tuning,Fine-TuningNeuralNetworkHyperparameters-Activation

Page 633: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Functions

overview,IntroductiontoArtificialNeuralNetworks-FromBiologicaltoArtificialNeurons

Perceptrons,ThePerceptron-Multi-LayerPerceptronandBackpropagation

self-organizingmaps,Self-OrganizingMaps-Self-OrganizingMaps

trainingaDNNwithTensorFlow,TrainingaDNNUsingPlainTensorFlow-UsingtheNeuralNetwork

artificialneuron,LogicalComputationswithNeurons

(seealsoartificialneuralnetwork(ANN))

assign(),ManuallyComputingtheGradients

associationrulelearning,Unsupervisedlearning

associativememorynetworks,HopfieldNetworks

assumptions,checking,ChecktheAssumptions

asynchronousupdates,Asynchronousupdates-Asynchronousupdates

asynchrouscommunication,AsynchronousCommunicationUsingTensorFlowQueues-PaddingFifoQueue

atrous_conv2d(),ResNet

attentionmechanism,AnEncoder–DecoderNetworkforMachineTranslation

attributes,Supervisedlearning,TakeaQuickLookattheDataStructure-TakeaQuickLookattheDataStructure

(seealsodatastructure)

combinationsof,ExperimentingwithAttributeCombinations-ExperimentingwithAttributeCombinations

preprocessed,TakeaQuickLookattheDataStructure

target,TakeaQuickLookattheDataStructure

Page 634: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

autodiff,Usingautodiff-Usingautodiff,Autodiff-Reverse-ModeAutodiff

forward-mode,Forward-ModeAutodiff-Forward-ModeAutodiff

manualdifferentiation,ManualDifferentiation

numericaldifferentiation,NumericalDifferentiation

reverse-mode,Reverse-ModeAutodiff-Reverse-ModeAutodiff

symbolicdifferentiation,SymbolicDifferentiation-NumericalDifferentiation

autoencoders,Autoencoders-Exercises

adversarial,OtherAutoencoders

contractive,OtherAutoencoders

denoising,DenoisingAutoencoders-TensorFlowImplementation

efficientdatarepresentations,EfficientDataRepresentations

generativestochasticnetwork(GSN),OtherAutoencoders

overcomplete,UnsupervisedPretrainingUsingStackedAutoencoders

PCAwithundercompletelinearautoencoder,PerformingPCAwithanUndercompleteLinearAutoencoder

reconstructions,EfficientDataRepresentations

sparse,SparseAutoencoders-TensorFlowImplementation

stacked,StackedAutoencoders-UnsupervisedPretrainingUsingStackedAutoencoders

stackedconvolutional,OtherAutoencoders

undercomplete,EfficientDataRepresentations

variational,VariationalAutoencoders-GeneratingDigits

visualizingfeatures,VisualizingFeatures-VisualizingFeatures

winner-take-all(WTA),OtherAutoencoders

automaticdifferentiating,UpandRunningwithTensorFlow

Page 635: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

autonomousdrivingsystems,RecurrentNeuralNetworks

AverageAbsoluteDeviation,SelectaPerformanceMeasure

averagepoolinglayer,PoolingLayer

avg_pool(),PoolingLayer

B

backpropagation,Multi-LayerPerceptronandBackpropagation-Multi-LayerPerceptronandBackpropagation,Vanishing/ExplodingGradientsProblems,UnsupervisedPretraining,VisualizingFeatures

backpropagationthroughtime(BPTT),TrainingRNNs

baggingandpasting,BaggingandPasting-Out-of-BagEvaluation

out-of-bagevaluation,Out-of-BagEvaluation-Out-of-BagEvaluation

inScikit-Learn,BaggingandPastinginScikit-Learn-BaggingandPastinginScikit-Learn

bandwidthsaturation,Bandwidthsaturation-Bandwidthsaturation

BasicLSTMCell,LSTMCell

BasicRNNCell,DistributingaDeepRNNAcrossMultipleGPUs-DistributingaDeepRNNAcrossMultipleGPUs

BatchGradientDescent,BatchGradientDescent-BatchGradientDescent,LassoRegression

batchlearning,Batchlearning-Batchlearning

BatchNormalization,BatchNormalization-ImplementingBatchNormalizationwithTensorFlow,ResNet

operationsummary,BatchNormalization

withTensorFlow,ImplementingBatchNormalizationwithTensorFlow-ImplementingBatchNormalizationwithTensorFlow

batch(),Otherconveniencefunctions

batch_join(),Otherconveniencefunctions

Page 636: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

batch_normalization(),ImplementingBatchNormalizationwithTensorFlow-ImplementingBatchNormalizationwithTensorFlow

BellmanOptimalityEquation,MarkovDecisionProcesses

between-graphreplication,In-GraphVersusBetween-GraphReplication

biasneurons,ThePerceptron

biasterm,LinearRegression

bias/variancetradeoff,LearningCurves

biases,ConstructionPhase

binaryclassifiers,TrainingaBinaryClassifier,LogisticRegression

biologicalneurons,FromBiologicaltoArtificialNeurons-BiologicalNeurons

blackboxmodels,MakingPredictions

blending,Stacking-Exercises

BoltzmannMachines,BoltzmannMachines-BoltzmannMachines

(seealsorestrictedBoltzmanmachines(RBMs))

boosting,Boosting-GradientBoosting

AdaBoost,AdaBoost-AdaBoost

GradientBoosting,GradientBoosting-GradientBoosting

bootstrapaggregation(seebagging)

bootstrapping,GridSearch,BaggingandPasting,IntroductiontoOpenAIGym,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

bottlenecklayers,GoogLeNet

brew,Stacking

C

Caffemodelzoo,ModelZoos

Page 637: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

CART(ClassificationandRegressionTree)algorithm,MakingPredictions-TheCARTTrainingAlgorithm,Regression

categoricalattributes,HandlingTextandCategoricalAttributes-HandlingTextandCategoricalAttributes

cellwrapper,TrainingtoPredictTimeSeries

chisquaretest,RegularizationHyperparameters

classificationversusregression,Supervisedlearning,MultioutputClassification

classifiers

binary,TrainingaBinaryClassifier

erroranalysis,ErrorAnalysis-ErrorAnalysis

evaluating,MulticlassClassification

MNISTdataset,MNIST-MNIST

multiclass,MulticlassClassification-MulticlassClassification

multilabel,MultilabelClassification-MultilabelClassification

multioutput,MultioutputClassification-MultioutputClassification

performancemeasures,PerformanceMeasures-TheROCCurve

precisionof,ConfusionMatrix

voting,VotingClassifiers-VotingClassifiers

clip_by_value(),GradientClipping

closed-formequation,TrainingModels,RidgeRegression,TrainingandCostFunction

clusterspecification,MultipleDevicesAcrossMultipleServers

clusteringalgorithms,Unsupervisedlearning

clusters,MultipleDevicesAcrossMultipleServers

codingspace,VariationalAutoencoders

Page 638: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

codings,Autoencoders

complementaryslacknesscondition,SVMDualProblem

components_,UsingScikit-Learn

computationalcomplexity,ComputationalComplexity,ComputationalComplexity,ComputationalComplexity

compute_gradients(),GradientClipping,PolicyGradients

concat(),GoogLeNet

config.gpu_options,ManagingtheGPURAM

ConfigProto,ManagingtheGPURAM

confusionmatrix,ConfusionMatrix-ConfusionMatrix,ErrorAnalysis-ErrorAnalysis

connectionism,ThePerceptron

constrainedoptimization,TrainingObjective,SVMDualProblem

ContrastiveDivergence,RestrictedBoltzmannMachines

controldependencies,ControlDependencies

conv1d(),ResNet

conv2d_transpose(),ResNet

conv3d(),ResNet

convergencerate,BatchGradientDescent

convexfunction,GradientDescent

convolutionkernels,Filters,CNNArchitectures,GoogLeNet

convolutionalneuralnetworks(CNNs),ConvolutionalNeuralNetworks-Exercises

architectures,CNNArchitectures-ResNet

AlexNet,AlexNet-AlexNet

GoogleNet,GoogLeNet-GoogLeNet

Page 639: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

LeNet5,LeNet-5-LeNet-5

ResNet,ResNet-ResNet

convolutionallayer,ConvolutionalLayer-MemoryRequirements,GoogLeNet,ResNet

featuremaps,StackingMultipleFeatureMaps-TensorFlowImplementation

filters,Filters

memoryrequirement,MemoryRequirements-MemoryRequirements

evolutionof,TheArchitectureoftheVisualCortex

poolinglayer,PoolingLayer-PoolingLayer

TensorFlowimplementation,TensorFlowImplementation-TensorFlowImplementation

Coordinatorclass,MultithreadedreadersusingaCoordinatorandaQueueRunner-MultithreadedreadersusingaCoordinatorandaQueueRunner

correlationcoefficient,LookingforCorrelations-LookingforCorrelations

correlations,finding,LookingforCorrelations-LookingforCorrelations

costfunction,Model-basedlearning,SelectaPerformanceMeasure

inAdaBoost,AdaBoost

inadagrad,AdaGrad

inartificialneuralnetworks,TraininganMLPwithTensorFlow’sHigh-LevelAPI,ConstructionPhase-ConstructionPhase

inautodiff,Usingautodiff

inbatchnormalization,ImplementingBatchNormalizationwithTensorFlow

crossentropy,LeNet-5

deepQ-Learning,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

inElasticNet,ElasticNet

inGradientDescent,TrainingModels,GradientDescent-GradientDescent,Batch

Page 640: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

GradientDescent,BatchGradientDescent-StochasticGradientDescent,GradientBoosting,Vanishing/ExplodingGradientsProblems

inLogisticRegression,TrainingandCostFunction-TrainingandCostFunction

inPGalgorithms,PolicyGradients

invariationalautoencoders,VariationalAutoencoders

inLassoRegression,LassoRegression-LassoRegression

inLinearRegression,TheNormalEquation,GradientDescent

inMomentumoptimization,MomentumOptimization-NesterovAcceleratedGradient

inpretrainedlayersreuse,PretrainingonanAuxiliaryTask

inridgeregression,RidgeRegression-RidgeRegression

inRNNs,TrainingRNNs,TrainingtoPredictTimeSeries

stalegradientsand,Asynchronousupdates

creativesequences,CreativeRNN

creditassignmentproblem,EvaluatingActions:TheCreditAssignmentProblem-EvaluatingActions:TheCreditAssignmentProblem

critics,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

crossentropy,SoftmaxRegression-SoftmaxRegression,TraininganMLPwithTensorFlow’sHigh-LevelAPI,TensorFlowImplementation,PolicyGradients

cross-validation,TestingandValidating,BetterEvaluationUsingCross-Validation-BetterEvaluationUsingCross-Validation,MeasuringAccuracyUsingCross-Validation-MeasuringAccuracyUsingCross-Validation

CUDAlibrary,Installation

cuDNNlibrary,Installation

curseofdimensionality,DimensionalityReduction-TheCurseofDimensionality

(seealsodimensionalityreduction)

Page 641: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

customtransformers,CustomTransformers-CustomTransformers

D

data,TestingandValidating

(seealsotestdata;trainingdata)

creatingworkspacefor,GettheData-DownloadtheData

downloading,DownloadtheData-DownloadtheData

findingcorrelationsin,LookingforCorrelations-LookingforCorrelations

makingassumptionsabout,TestingandValidating

preparingforMachineLearningalgorithms,PreparetheDataforMachineLearningAlgorithms-SelectandTrainaModel

test-setcreation,CreateaTestSet-CreateaTestSet

workingwithrealdata,WorkingwithRealData

dataaugmentation,DataAugmentation-DataAugmentation

datacleaning,DataCleaning-HandlingTextandCategoricalAttributes

datamining,WhyUseMachineLearning?

dataparallelism,DataParallelism-TensorFlowimplementation

asynchronousupdates,Asynchronousupdates-Asynchronousupdates

bandwidthsaturation,Bandwidthsaturation-Bandwidthsaturation

synchronousupdates,Synchronousupdates

TensorFlowimplementation,TensorFlowimplementation

datapipeline,FrametheProblem

datasnoopingbias,CreateaTestSet

datastructure,TakeaQuickLookattheDataStructure-TakeaQuickLookattheDataStructure

Page 642: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

datavisualization,VisualizingGeographicalData-VisualizingGeographicalData

DataFrame,DataCleaning

dataquest,OtherResources

decisionboundaries,DecisionBoundaries-DecisionBoundaries,SoftmaxRegression,MakingPredictions

decisionfunction,Precision/RecallTradeoff,DecisionFunctionandPredictions-DecisionFunctionandPredictions

DecisionStumps,AdaBoost

decisionthreshold,Precision/RecallTradeoff

DecisionTrees,TrainingandEvaluatingontheTrainingSet-BetterEvaluationUsingCross-Validation,DecisionTrees-Exercises,EnsembleLearningandRandomForests

binarytrees,MakingPredictions

classprobabilityestimates,EstimatingClassProbabilities

computationalcomplexity,ComputationalComplexity

decisionboundaries,MakingPredictions

GINIimpurity,GiniImpurityorEntropy?

instabilitywith,Instability-Instability

numbersofchildren,MakingPredictions

predictions,MakingPredictions-EstimatingClassProbabilities

RandomForests(seeRandomForests)

regressiontasks,Regression-Regression

regularizationhyperparameters,RegularizationHyperparameters-RegularizationHyperparameters

trainingandvisualizing,TrainingandVisualizingaDecisionTree-MakingPredictions

decoder,EfficientDataRepresentations

Page 643: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

deconvolutionallayer,ResNet

deepautoencoders(seestackedautoencoders)

deepbeliefnetworks(DBNs),Semisupervisedlearning,DeepBeliefNets-DeepBeliefNets

DeepLearning,ReinforcementLearning

(seealsoReinforcementLearning;TensorFlow)

about,TheMachineLearningTsunami,Roadmap

libraries,UpandRunningwithTensorFlow-UpandRunningwithTensorFlow

deepneuralnetworks(DNNs),Multi-LayerPerceptronandBackpropagation,TrainingDeepNeuralNets-Exercises

(seealsoMulti-LayerPerceptrons(MLP))

fasteroptimizersfor,FasterOptimizers-LearningRateScheduling

regularization,AvoidingOverfittingThroughRegularization-DataAugmentation

reusingpretrainedlayers,ReusingPretrainedLayers-PretrainingonanAuxiliaryTask

trainingguidelinesoverview,PracticalGuidelines

trainingwithTensorFlow,TrainingaDNNUsingPlainTensorFlow-UsingtheNeuralNetwork

trainingwithTF.Learn,TraininganMLPwithTensorFlow’sHigh-LevelAPI

unstablegradients,Vanishing/ExplodingGradientsProblems

vanishingandexplodinggradients,TrainingDeepNeuralNets-GradientClipping

DeepQ-Learning,ApproximateQ-Learning-LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

Ms.PacManexample,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning-LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

deepQ-network,ApproximateQ-Learning

deepRNNs,DeepRNNs-TheDifficultyofTrainingoverManyTimeSteps

Page 644: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

applyingdropout,ApplyingDropout

distributingacrossmultipleGPUs,DistributingaDeepRNNAcrossMultipleGPUs

longsequencedifficulties,TheDifficultyofTrainingoverManyTimeSteps

truncatedbackpropagationthroughtime,TheDifficultyofTrainingoverManyTimeSteps

DeepMind,ReinforcementLearning,IntroductiontoArtificialNeuralNetworks,ReinforcementLearning,ApproximateQ-Learning

degreesoffreedom,OverfittingtheTrainingData,LearningCurves

denoisingautoencoders,DenoisingAutoencoders-TensorFlowImplementation

dense(),ConstructionPhase,TyingWeights

depthconcatlayer,GoogLeNet

depthradius,AlexNet

depthwise_conv2d(),ResNet

dequeue(),Queuesoftuples

dequeue_many(),Queuesoftuples,PaddingFifoQueue

dequeue_up_to(),Closingaqueue-PaddingFifoQueue

dequeuingdata,Dequeuingdata

describe(),TakeaQuickLookattheDataStructure

deviceblocks,ShardingVariablesAcrossMultipleParameterServers

device(),Simpleplacement

dimensionalityreduction,Unsupervisedlearning,DimensionalityReduction-Exercises,Autoencoders

approachesto

ManifoldLearning,ManifoldLearning

Page 645: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

projection,Projection-Projection

choosingtherightnumberofdimensions,ChoosingtheRightNumberofDimensions

curseofdimensionality,DimensionalityReduction-TheCurseofDimensionality

anddatavisualization,DimensionalityReduction

Isomap,OtherDimensionalityReductionTechniques

LLE(LocallyLinearEmbedding),LLE-LLE

MultidimensionalScaling,OtherDimensionalityReductionTechniques-OtherDimensionalityReductionTechniques

PCA(PrincipalComponentAnalysis),PCA-RandomizedPCA

t-DistributedStochasticNeighborEmbedding(t-SNE),OtherDimensionalityReductionTechniques

discountrate,EvaluatingActions:TheCreditAssignmentProblem

distributedcomputing,UpandRunningwithTensorFlow

distributedsessions,SharingStateAcrossSessionsUsingResourceContainers-SharingStateAcrossSessionsUsingResourceContainers

DNNClassifier,TraininganMLPwithTensorFlow’sHigh-LevelAPI

drop(),PreparetheDataforMachineLearningAlgorithms

dropconnect,Dropout

dropna(),DataCleaning

dropout,NumberofNeuronsperHiddenLayer,ApplyingDropout

dropoutrate,Dropout

dropout(),Dropout

DropoutWrapper,ApplyingDropout

DRY(Don’tRepeatYourself),Modularity

Page 646: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

DualAveraging,AdamOptimization

dualnumbers,Forward-ModeAutodiff

dualproblem,TheDualProblem

duality,SVMDualProblem

dyingReLUs,NonsaturatingActivationFunctions

dynamicplacements,Dynamicplacementfunction

dynamicplacer,PlacingOperationsonDevices

DynamicProgramming,MarkovDecisionProcesses

dynamicunrollingthroughtime,DynamicUnrollingThroughTime

dynamic_rnn(),DynamicUnrollingThroughTime,DistributingaDeepRNNAcrossMultipleGPUs,AnEncoder–DecoderNetworkforMachineTranslation

E

earlystopping,EarlyStopping-EarlyStopping,GradientBoosting,NumberofNeuronsperHiddenLayer,EarlyStopping

ElasticNet,ElasticNet

embeddeddeviceblocks,ShardingVariablesAcrossMultipleParameterServers

EmbeddedRebergrammars,Exercises

embeddings,WordEmbeddings-WordEmbeddings

embedding_lookup(),WordEmbeddings

encoder,EfficientDataRepresentations

Encoder–Decoder,InputandOutputSequences

end-of-sequence(EOS)token,HandlingVariable-LengthOutputSequences

energyfunctions,HopfieldNetworks

enqueuingdata,Enqueuingdata

Page 647: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

EnsembleLearning,BetterEvaluationUsingCross-Validation,EnsembleMethods,EnsembleLearningandRandomForests-Exercises

baggingandpasting,BaggingandPasting-Out-of-BagEvaluation

boosting,Boosting-GradientBoosting

in-graphversusbetween-graphreplication,In-GraphVersusBetween-GraphReplication-In-GraphVersusBetween-GraphReplication

RandomForests,RandomForests-FeatureImportance

(seealsoRandomForests)

randompatchesandrandomsubspaces,RandomPatchesandRandomSubspaces

stacking,Stacking-Stacking

entropyimpuritymeasure,GiniImpurityorEntropy?

environments,inreinforcementlearning,LearningtoOptimizeRewards-EvaluatingActions:TheCreditAssignmentProblem,ExplorationPolicies,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

episodes(inRL),IntroductiontoOpenAIGym,EvaluatingActions:TheCreditAssignmentProblem-PolicyGradients,PolicyGradients-PolicyGradients,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

epochs,StochasticGradientDescent

ε-insensitive,SVMRegression

equalitycontraints,SVMDualProblem

erroranalysis,ErrorAnalysis-ErrorAnalysis

estimators,DataCleaning

Euclidiannorm,SelectaPerformanceMeasure

eval(),FeedingDatatotheTrainingAlgorithm

evaluatingmodels,TestingandValidating-TestingandValidating

explainedvariance,ChoosingtheRightNumberofDimensions

Page 648: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

explainedvarianceratio,ExplainedVarianceRatio

explodinggradients,Vanishing/ExplodingGradientsProblems

(seealsogradients,vanishingandexploding)

explorationpolicies,ExplorationPolicies

exponentialdecay,ImplementingBatchNormalizationwithTensorFlow

exponentiallinearunit(ELU),NonsaturatingActivationFunctions-NonsaturatingActivationFunctions

exponentialscheduling,LearningRateScheduling

Extra-Trees,Extra-Trees

F

F-1score,PrecisionandRecall-PrecisionandRecall

face-recognition,MultilabelClassification

fakeXserver,IntroductiontoOpenAIGym

falsepositiverate(FPR),TheROCCurve-TheROCCurve

fan-in,XavierandHeInitialization,XavierandHeInitialization

fan-out,XavierandHeInitialization,XavierandHeInitialization

featuredetection,Autoencoders

featureengineering,IrrelevantFeatures

featureextraction,Unsupervisedlearning

featureimportance,FeatureImportance-FeatureImportance

featuremaps,SelectingaKernelandTuningHyperparameters,Filters-TensorFlowImplementation,ResNet

featurescaling,FeatureScaling

featureselection,IrrelevantFeatures,GridSearch,LassoRegression,FeatureImportance,

Page 649: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

PreparetheData

featurespace,KernelPCA,SelectingaKernelandTuningHyperparameters

featurevector,SelectaPerformanceMeasure,LinearRegression,UndertheHood,ImplementingGradientDescent

features,Supervisedlearning

FeatureUnion,TransformationPipelines

feedforwardneuralnetwork(FNN),Multi-LayerPerceptronandBackpropagation

feed_dict,FeedingDatatotheTrainingAlgorithm

FIFOQueue,AsynchronousCommunicationUsingTensorFlowQueues,RandomShuffleQueue

fillna(),DataCleaning

first-infirst-out(FIFO)queues,AsynchronousCommunicationUsingTensorFlowQueues

first-orderpartialderivatives(Jacobians),AdamOptimization

fit(),DataCleaning,TransformationPipelines,IncrementalPCA

fitnessfunction,Model-basedlearning

fit_inverse_transform=,SelectingaKernelandTuningHyperparameters

fit_transform(),DataCleaning,TransformationPipelines

folds,BetterEvaluationUsingCross-Validation,MNIST,MeasuringAccuracyUsingCross-Validation-MeasuringAccuracyUsingCross-Validation

FollowTheRegularizedLeader(FTRL),AdamOptimization

forgetgate,LSTMCell

forward-modeautodiff,Forward-ModeAutodiff-Forward-ModeAutodiff

framingaproblem,FrametheProblem-FrametheProblem

frozenlayers,FreezingtheLowerLayers-CachingtheFrozenLayers

Page 650: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

functools.partial(),ImplementingBatchNormalizationwithTensorFlow,TensorFlowImplementation,VariationalAutoencoders

G

gameplay(seereinforcementlearning)

gammavalue,GaussianRBFKernel

gatecontrollers,LSTMCell

Gaussiandistribution,VariationalAutoencoders,GeneratingDigits

GaussianRBF,AddingSimilarityFeatures

GaussianRBFkernel,GaussianRBFKernel-GaussianRBFKernel,KernelizedSVM

generalizationerror,TestingandValidating

generalizedLagrangian,SVMDualProblem-SVMDualProblem

generativeautoencoders,VariationalAutoencoders

generativemodels,Autoencoders,BoltzmannMachines

geneticalgorithms,PolicySearch

geodesicdistance,OtherDimensionalityReductionTechniques

get_variable(),SharingVariables-SharingVariables

GINIimpurity,MakingPredictions,GiniImpurityorEntropy?

globalaveragepooling,GoogLeNet

global_step,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

global_variables_initializer(),CreatingYourFirstGraphandRunningItinaSession

Glorotinitialization,Vanishing/ExplodingGradientsProblems-XavierandHeInitialization

Google,UpandRunningwithTensorFlow

GoogleImages,IntroductiontoArtificialNeuralNetworks

Page 651: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

GooglePhotos,Semisupervisedlearning

GoogleNetarchitecture,GoogLeNet-GoogLeNet

gpu_options.per_process_gpu_memory_fraction,ManagingtheGPURAM

gradientascent,PolicySearch

GradientBoostedRegressionTrees(GBRT),GradientBoosting

GradientBoosting,GradientBoosting-GradientBoosting

GradientDescent(GD),TrainingModels,GradientDescent-Mini-batchGradientDescent,OnlineSVMs,TrainingDeepNeuralNets,MomentumOptimization,AdaGrad

algorithmcomparisons,Mini-batchGradientDescent-Mini-batchGradientDescent

automaticallycomputinggradients,Usingautodiff-Usingautodiff

BatchGD,BatchGradientDescent-BatchGradientDescent,LassoRegression

defining,GradientDescent

localminimumversusglobalminimum,GradientDescent

manuallycomputinggradients,ManuallyComputingtheGradients

Mini-batchGD,Mini-batchGradientDescent-Mini-batchGradientDescent,FeedingDatatotheTrainingAlgorithm-FeedingDatatotheTrainingAlgorithm

optimizer,UsinganOptimizer

StochasticGD,StochasticGradientDescent-StochasticGradientDescent,SoftMarginClassification

withTensorFlow,ImplementingGradientDescent-UsinganOptimizer

GradientTreeBoosting,GradientBoosting

GradientDescentOptimizer,ConstructionPhase

gradients(),Usingautodiff

gradients,vanishingandexploding,TrainingDeepNeuralNets-GradientClipping,TheDifficultyofTrainingoverManyTimeSteps

Page 652: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

BatchNormalization,BatchNormalization-ImplementingBatchNormalizationwithTensorFlow

GlorotandHeinitialization,Vanishing/ExplodingGradientsProblems-XavierandHeInitialization

gradientclipping,GradientClipping

nonsaturatingactivationfunctions,NonsaturatingActivationFunctions-NonsaturatingActivationFunctions

graphviz,TrainingandVisualizingaDecisionTree

greedyalgorithm,TheCARTTrainingAlgorithm

gridsearch,Fine-TuneYourModel-GridSearch,PolynomialKernel

group(),LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

GRU(GatedRecurrentUnit)cell,GRUCell-GRUCell

H

hailstonesequence,EfficientDataRepresentations

hardmarginclassification,SoftMarginClassification-SoftMarginClassification

hardvotingclassifiers,VotingClassifiers-VotingClassifiers

harmonicmean,PrecisionandRecall

Heinitialization,Vanishing/ExplodingGradientsProblems-XavierandHeInitialization

Heavisidestepfunction,ThePerceptron

Hebb'srule,ThePerceptron,HopfieldNetworks

Hebbianlearning,ThePerceptron

hiddenlayers,Multi-LayerPerceptronandBackpropagation

hierarchicalclustering,Unsupervisedlearning

hingelossfunction,OnlineSVMs

Page 653: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

histograms,TakeaQuickLookattheDataStructure-TakeaQuickLookattheDataStructure

hold-outsets,Stacking

(seealsoblenders)

HopfieldNetworks,HopfieldNetworks-HopfieldNetworks

hyperbolictangent(htanactivationfunction),Multi-LayerPerceptronandBackpropagation,ActivationFunctions,Vanishing/ExplodingGradientsProblems,XavierandHeInitialization,RecurrentNeurons

hyperparameters,OverfittingtheTrainingData,CustomTransformers,GridSearch-GridSearch,EvaluateYourSystemontheTestSet,GradientDescent,PolynomialKernel,ComputationalComplexity,Fine-TuningNeuralNetworkHyperparameters

(seealsoneuralnetworkhyperparameters)

hyperplane,DecisionFunctionandPredictions,ManifoldLearning-PCA,ProjectingDowntodDimensions,OtherDimensionalityReductionTechniques

hypothesis,SelectaPerformanceMeasure

manifold,ManifoldLearning

hypothesisboosting(seeboosting)

hypothesisfunction,LinearRegression

hypothesis,null,RegularizationHyperparameters

I

identitymatrix,RidgeRegression,QuadraticProgramming

ILSVRCImageNetchallenge,CNNArchitectures

imageclassification,CNNArchitectures

impuritymeasures,MakingPredictions,GiniImpurityorEntropy?

in-graphreplication,In-GraphVersusBetween-GraphReplication

inceptionmodules,GoogLeNet

Page 654: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Inception-v4,ResNet

incrementallearning,Onlinelearning,IncrementalPCA

inequalityconstraints,SVMDualProblem

inference,Model-basedlearning,Exercises,MemoryRequirements,AnEncoder–DecoderNetworkforMachineTranslation

info(),TakeaQuickLookattheDataStructure

informationgain,GiniImpurityorEntropy?

informationtheory,GiniImpurityorEntropy?

initnode,SavingandRestoringModels

inputgate,LSTMCell

inputneurons,ThePerceptron

input_keep_prob,ApplyingDropout

instance-basedlearning,Instance-basedlearning,Model-basedlearning

InteractiveSession,CreatingYourFirstGraphandRunningItinaSession

interceptterm,LinearRegression

InternalCovariateShiftproblem,BatchNormalization

inter_op_parallelism_threads,ParallelExecution

intra_op_parallelism_threads,ParallelExecution

inverse_transform(),SelectingaKernelandTuningHyperparameters

in_top_k(),ConstructionPhase

irreducibleerror,LearningCurves

isolatedenvironment,CreatetheWorkspace-CreatetheWorkspace

Isomap,OtherDimensionalityReductionTechniques

Page 655: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

J

jobs,MultipleDevicesAcrossMultipleServers

join(),MultipleDevicesAcrossMultipleServers,MultithreadedreadersusingaCoordinatorandaQueueRunner

Jupyter,CreatetheWorkspace,CreatetheWorkspace,TakeaQuickLookattheDataStructure

K

K-foldcross-validation,BetterEvaluationUsingCross-Validation-BetterEvaluationUsingCross-Validation,MeasuringAccuracyUsingCross-Validation

k-NearestNeighbors,Model-basedlearning,MultilabelClassification

Karush–Kuhn–Tucker(KKT)conditions,SVMDualProblem

keepprobability,Dropout

Keras,UpandRunningwithTensorFlow

KernelPCA(kPCA),KernelPCA-SelectingaKernelandTuningHyperparameters

kerneltrick,PolynomialKernel,GaussianRBFKernel,TheDualProblem-KernelizedSVM,KernelPCA

kernelizedSVM,KernelizedSVM-KernelizedSVM

kernels,PolynomialKernel-GaussianRBFKernel,Operationsandkernels

Kullback–Leiblerdivergence,SoftmaxRegression,SparseAutoencoders

L

l1_l2_regularizer(),ℓ1andℓ2Regularization

LabelBinarizer,TransformationPipelines

labels,Supervisedlearning,FrametheProblem

Lagrangefunction,SVMDualProblem-SVMDualProblem

Lagrangemultiplier,SVMDualProblem

Page 656: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

landmarks,AddingSimilarityFeatures-AddingSimilarityFeatures

largemarginclassification,LinearSVMClassification-LinearSVMClassification

LassoRegression,LassoRegression-LassoRegression

latentloss,VariationalAutoencoders

latentspace,VariationalAutoencoders

lawoflargenumbers,VotingClassifiers

leakyReLU,NonsaturatingActivationFunctions

learningrate,Onlinelearning,GradientDescent,BatchGradientDescent-StochasticGradientDescent

learningratescheduling,StochasticGradientDescent,LearningRateScheduling-LearningRateScheduling

LeNet-5architecture,TheArchitectureoftheVisualCortex,LeNet-5-LeNet-5

Levenshteindistance,GaussianRBFKernel

liblinearlibrary,ComputationalComplexity

libsvmlibrary,ComputationalComplexity

LinearDiscriminantAnalysis(LDA),OtherDimensionalityReductionTechniques

linearmodels

earlystopping,EarlyStopping-EarlyStopping

ElasticNet,ElasticNet

LassoRegression,LassoRegression-LassoRegression

LinearRegression(seeLinearRegression)

regression(seeLinearRegression)

RidgeRegression,RidgeRegression-RidgeRegression,ElasticNet

SVM,LinearSVMClassification-SoftMarginClassification

Page 657: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

LinearRegression,Model-basedlearning,TrainingandEvaluatingontheTrainingSet,TrainingModels-Mini-batchGradientDescent,ElasticNet

computationalcomplexity,ComputationalComplexity

GradientDescentin,GradientDescent-Mini-batchGradientDescent

learningcurvesin,LearningCurves-LearningCurves

NormalEquation,TheNormalEquation-ComputationalComplexity

regularizingmodels(seeregularization)

usingStochasticGradientDescent(SGD),StochasticGradientDescent

withTensorFlow,LinearRegressionwithTensorFlow-LinearRegressionwithTensorFlow

linearSVMclassification,LinearSVMClassification-SoftMarginClassification

linearthresholdunits(LTUs),ThePerceptron

Lipschitzcontinuous,GradientDescent

LLE(LocallyLinearEmbedding),LLE-LLE

load_sample_images(),TensorFlowImplementation

localreceptivefield,TheArchitectureoftheVisualCortex

localresponsenormalization,AlexNet

localsessions,SharingStateAcrossSessionsUsingResourceContainers

locationinvariance,PoolingLayer

logloss,TrainingandCostFunction

loggingplacements,Loggingplacements-Loggingplacements

logisticfunction,EstimatingProbabilities

LogisticRegression,Supervisedlearning,LogisticRegression-SoftmaxRegression

decisionboundaries,DecisionBoundaries-DecisionBoundaries

Page 658: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

estimatingprobablities,EstimatingProbabilities-EstimatingProbabilities

SoftmaxRegressionmodel,SoftmaxRegression-SoftmaxRegression

trainingandcostfunction,TrainingandCostFunction-TrainingandCostFunction

log_device_placement,Loggingplacements

LSTM(LongShort-TermMemory)cell,LSTMCell-GRUCell

Page 659: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

M

machinecontrol(seereinforcementlearning)

MachineLearning

large-scaleprojects(seeTensorFlow)

notations,SelectaPerformanceMeasure-SelectaPerformanceMeasure

processexample,End-to-EndMachineLearningProject-Exercises

projectchecklist,LookattheBigPicture,MachineLearningProjectChecklist-Launch!

resourceson,OtherResources-OtherResources

usesfor,MachineLearninginYourProjects-MachineLearninginYourProjects

MachineLearningbasics

attributes,Supervisedlearning

challenges,MainChallengesofMachineLearning-SteppingBack

algorithmproblems,OverfittingtheTrainingData-UnderfittingtheTrainingData

trainingdataproblems,Poor-QualityData

definition,WhatIsMachineLearning?

features,Supervisedlearning

overview,TheMachineLearningLandscape

reasonsforusing,WhyUseMachineLearning?-WhyUseMachineLearning?

spamfilterexample,WhatIsMachineLearning?-WhyUseMachineLearning?

summary,SteppingBack

testingandvalidating,TestingandValidating-TestingandValidating

typesofsystems,TypesofMachineLearningSystems-Model-basedlearning

batchandonlinelearning,BatchandOnlineLearning-Onlinelearning

Page 660: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

instance-basedversusmodel-basedlearning,Instance-BasedVersusModel-BasedLearning-Model-basedlearning

supervised/unsupervisedlearning,Supervised/UnsupervisedLearning-ReinforcementLearning

workflowexample,Model-basedlearning-Model-basedlearning

machinetranslation(seenaturallanguageprocessing(NLP))

make(),IntroductiontoOpenAIGym

Manhattannorm,SelectaPerformanceMeasure

manifoldassumption/hypothesis,ManifoldLearning

ManifoldLearning,ManifoldLearning,LLE

(seealsoLLE(LocallyLinearEmbedding)

MapReduce,FrametheProblem

marginviolations,SoftMarginClassification

Markovchains,MarkovDecisionProcesses

Markovdecisionprocesses,MarkovDecisionProcesses-MarkovDecisionProcesses

masterservice,TheMasterandWorkerServices

Matplotlib,CreatetheWorkspace,TakeaQuickLookattheDataStructure,TheROCCurve,ErrorAnalysis

maxmarginlearning,PretrainingonanAuxiliaryTask

maxpoolinglayer,PoolingLayer

max-normregularization,Max-NormRegularization-Max-NormRegularization

max_norm(),Max-NormRegularization

max_norm_regularizer(),Max-NormRegularization

max_pool(),PoolingLayer

Page 661: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

MeanAbsoluteError(MAE),SelectaPerformanceMeasure-SelectaPerformanceMeasure

meancoding,VariationalAutoencoders

MeanSquareError(MSE),LinearRegression,ManuallyComputingtheGradients,SparseAutoencoders

measureofsimilarity,Instance-basedlearning

memmap,IncrementalPCA

memorycells,ModelParallelism,MemoryCells

Mercer'stheorem,KernelizedSVM

metalearner(seeblending)

min-maxscaling,FeatureScaling

Mini-batchGradientDescent,Mini-batchGradientDescent-Mini-batchGradientDescent,TrainingandCostFunction,FeedingDatatotheTrainingAlgorithm-FeedingDatatotheTrainingAlgorithm

mini-batches,Onlinelearning

minimize(),GradientClipping,FreezingtheLowerLayers,PolicyGradients,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

min_after_dequeue,RandomShuffleQueue

MNISTdataset,MNIST-MNIST

modelparallelism,ModelParallelism-ModelParallelism

modelparameters,GradientDescent,BatchGradientDescent,EarlyStopping,UndertheHood,QuadraticProgramming,CreatingYourFirstGraphandRunningItinaSession,ConstructionPhase,TrainingRNNs

defining,Model-basedlearning

modelselection,Model-basedlearning

modelzoos,ModelZoos

Page 662: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

model-basedlearning,Model-basedlearning-Model-basedlearning

models

analyzing,AnalyzetheBestModelsandTheirErrors-AnalyzetheBestModelsandTheirErrors

evaluatingontestset,EvaluateYourSystemontheTestSet-EvaluateYourSystemontheTestSet

moments,AdamOptimization

Momentumoptimization,MomentumOptimization-MomentumOptimization

MonteCarlotreesearch,PolicyGradients

Multi-LayerPerceptrons(MLP),IntroductiontoArtificialNeuralNetworks,ThePerceptron-Multi-LayerPerceptronandBackpropagation,NeuralNetworkPolicies

trainingwithTF.Learn,TraininganMLPwithTensorFlow’sHigh-LevelAPI

multiclassclassifiers,MulticlassClassification-MulticlassClassification

MultidimensionalScaling(MDS),OtherDimensionalityReductionTechniques

multilabelclassifiers,MultilabelClassification-MultilabelClassification

MultinomialLogisticRegression(seeSoftmaxRegression)

multinomial(),NeuralNetworkPolicies

multioutputclassifiers,MultioutputClassification-MultioutputClassification

MultiRNNCell,DistributingaDeepRNNAcrossMultipleGPUs

multithreadedreaders,MultithreadedreadersusingaCoordinatorandaQueueRunner-MultithreadedreadersusingaCoordinatorandaQueueRunner

multivariateregression,FrametheProblem

N

naiveBayesclassifiers,MulticlassClassification

namescopes,NameScopes

Page 663: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

naturallanguageprocessing(NLP),RecurrentNeuralNetworks,NaturalLanguageProcessing-AnEncoder–DecoderNetworkforMachineTranslation

encoder-decodernetworkformachinetranslation,AnEncoder–DecoderNetworkforMachineTranslation-AnEncoder–DecoderNetworkforMachineTranslation

TensorFlowtutorials,NaturalLanguageProcessing,AnEncoder–DecoderNetworkforMachineTranslation

wordembeddings,WordEmbeddings-WordEmbeddings

NesterovAcceleratedGradient(NAG),NesterovAcceleratedGradient-NesterovAcceleratedGradient

Nesterovmomentumoptimization,NesterovAcceleratedGradient-NesterovAcceleratedGradient

networktopology,Fine-TuningNeuralNetworkHyperparameters

neuralnetworkhyperparameters,Fine-TuningNeuralNetworkHyperparameters-ActivationFunctions

activationfunctions,ActivationFunctions

neuronsperhiddenlayer,NumberofNeuronsperHiddenLayer

numberofhiddenlayers,NumberofHiddenLayers-NumberofHiddenLayers

neuralnetworkpolicies,NeuralNetworkPolicies-NeuralNetworkPolicies

neurons

biological,FromBiologicaltoArtificialNeurons-BiologicalNeurons

logicalcomputationswith,LogicalComputationswithNeurons

neuron_layer(),ConstructionPhase

next_batch(),ExecutionPhase

NoFreeLunchtheorem,TestingandValidating

nodeedges,VisualizingtheGraphandTrainingCurvesUsingTensorBoard

nonlineardimensionalityreduction(NLDR),LLE

Page 664: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

(seealsoKernelPCA;LLE(LocallyLinearEmbedding))

nonlinearSVMclassification,NonlinearSVMClassification-ComputationalComplexity

computationalcomplexity,ComputationalComplexity

GaussianRBFkernel,GaussianRBFKernel-GaussianRBFKernel

withpolynomialfeatures,NonlinearSVMClassification-PolynomialKernel

polynomialkernel,PolynomialKernel-PolynomialKernel

similarityfeatures,adding,AddingSimilarityFeatures-AddingSimilarityFeatures

nonparametricmodels,RegularizationHyperparameters

nonresponsebias,NonrepresentativeTrainingData

nonsaturatingactivationfunctions,NonsaturatingActivationFunctions-NonsaturatingActivationFunctions

NormalEquation,TheNormalEquation-ComputationalComplexity

normalization,FeatureScaling

normalizedexponential,SoftmaxRegression

norms,SelectaPerformanceMeasure

notations,SelectaPerformanceMeasure-SelectaPerformanceMeasure

NP-Completeproblems,TheCARTTrainingAlgorithm

nullhypothesis,RegularizationHyperparameters

numericaldifferentiation,NumericalDifferentiation

NumPy,CreatetheWorkspace

NumPyarrays,HandlingTextandCategoricalAttributes

NVidiaComputeCapability,Installation

nvidia-smi,ManagingtheGPURAM

Page 665: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

n_components,ChoosingtheRightNumberofDimensions

O

observationspace,NeuralNetworkPolicies

off-policyalgorithm,TemporalDifferenceLearningandQ-Learning

offlinelearning,Batchlearning

one-hotencoding,HandlingTextandCategoricalAttributes

one-versus-all(OvA)strategy,MulticlassClassification,SoftmaxRegression,Exercises

one-versus-one(OvO)strategy,MulticlassClassification

onlinelearning,Onlinelearning-Onlinelearning

onlineSVMs,OnlineSVMs-OnlineSVMs

OpenAIGym,IntroductiontoOpenAIGym-IntroductiontoOpenAIGym

operation_timeout_in_ms,In-GraphVersusBetween-GraphReplication

OpticalCharacterRecognition(OCR),TheMachineLearningLandscape

optimalstatevalue,MarkovDecisionProcesses

optimizers,FasterOptimizers-LearningRateScheduling

AdaGrad,AdaGrad-AdaGrad

Adamoptimization,AdamOptimization-AdamOptimization,AdamOptimization

GradientDescent(seeGradientDescentoptimizer)

learningratescheduling,LearningRateScheduling-LearningRateScheduling

Momentumoptimization,MomentumOptimization-MomentumOptimization

NesterovAcceleratedGradient(NAG),NesterovAcceleratedGradient-NesterovAcceleratedGradient

RMSProp,RMSProp

out-of-bagevaluation,Out-of-BagEvaluation-Out-of-BagEvaluation

Page 666: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

out-of-corelearning,Onlinelearning

out-of-memory(OOM)errors,StaticUnrollingThroughTime

out-of-sampleerror,TestingandValidating

OutOfRangeError,Readingthetrainingdatadirectlyfromthegraph,MultithreadedreadersusingaCoordinatorandaQueueRunner

outputgate,LSTMCell

outputlayer,Multi-LayerPerceptronandBackpropagation

OutputProjectionWrapper,TrainingtoPredictTimeSeries-TrainingtoPredictTimeSeries

output_keep_prob,ApplyingDropout

overcompleteautoencoder,UnsupervisedPretrainingUsingStackedAutoencoders

overfitting,OverfittingtheTrainingData-OverfittingtheTrainingData,CreateaTestSet,SoftMarginClassification,GaussianRBFKernel,RegularizationHyperparameters,Regression,NumberofNeuronsperHiddenLayer

avoidingthroughregularization,AvoidingOverfittingThroughRegularization-DataAugmentation

P

p-value,RegularizationHyperparameters

PaddingFIFOQueue,PaddingFifoQueue

Pandas,CreatetheWorkspace,DownloadtheData

scatter_matrix,LookingforCorrelations-LookingforCorrelations

paralleldistributedcomputing,DistributingTensorFlowAcrossDevicesandServers-Exercises

dataparallelism,DataParallelism-TensorFlowimplementation

in-graphversusbetween-graphreplication,In-GraphVersusBetween-GraphReplication-ModelParallelism

modelparallelism,ModelParallelism-ModelParallelism

multipledevicesacrossmultipleservers,MultipleDevicesAcrossMultipleServers-

Page 667: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Otherconveniencefunctions

asynchronouscommunicationusingqueues,AsynchronousCommunicationUsingTensorFlowQueues-PaddingFifoQueue

loadingtrainingdata,LoadingDataDirectlyfromtheGraph-Otherconveniencefunctions

masterandworkerservices,TheMasterandWorkerServices

openingasession,OpeningaSession

pinningoperationsacrosstasks,PinningOperationsAcrossTasks

shardingvariables,ShardingVariablesAcrossMultipleParameterServers

sharingstateacrosssessions,SharingStateAcrossSessionsUsingResourceContainers-SharingStateAcrossSessionsUsingResourceContainers

multipledevicesonasinglemachine,MultipleDevicesonaSingleMachine-ControlDependencies

controldependencies,ControlDependencies

installation,Installation-Installation

managingtheGPURAM,ManagingtheGPURAM-ManagingtheGPURAM

parallelexecution,ParallelExecution-ParallelExecution

placingoperationsondevices,PlacingOperationsonDevices-Softplacement

oneneuralnetworkperdevice,OneNeuralNetworkperDevice-OneNeuralNetworkperDevice

parameterefficiency,NumberofHiddenLayers

parametermatrix,SoftmaxRegression

parameterserver(ps),MultipleDevicesAcrossMultipleServers

parameterspace,GradientDescent

parametervector,LinearRegression,GradientDescent,TrainingandCostFunction,SoftmaxRegression

Page 668: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

parametricmodels,RegularizationHyperparameters

partialderivative,BatchGradientDescent

partial_fit(),IncrementalPCA

Pearson'sr,LookingforCorrelations

peepholeconnections,PeepholeConnections

penalties(seerewards,inRL)

percentiles,TakeaQuickLookattheDataStructure

Perceptronconvergencetheorem,ThePerceptron

Perceptrons,ThePerceptron-Multi-LayerPerceptronandBackpropagation

versusLogisticRegression,ThePerceptron

training,ThePerceptron-ThePerceptron

performancemeasures,SelectaPerformanceMeasure-SelectaPerformanceMeasure

confusionmatrix,ConfusionMatrix-ConfusionMatrix

cross-validation,MeasuringAccuracyUsingCross-Validation-MeasuringAccuracyUsingCross-Validation

precisionandrecall,PrecisionandRecall-Precision/RecallTradeoff

ROC(receiveroperatingcharacteristic)curve,TheROCCurve-TheROCCurve

performancescheduling,LearningRateScheduling

permutation(),CreateaTestSet

PGalgorithms,PolicyGradients

photo-hostingservices,Semisupervisedlearning

pinningoperations,PinningOperationsAcrossTasks

pip,CreatetheWorkspace

Pipelineconstructor,TransformationPipelines-SelectandTrainaModel

Page 669: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

pipelines,FrametheProblem

placeholdernodes,FeedingDatatotheTrainingAlgorithm

placers(seesimpleplacer;dynamicplacer)

policy,PolicySearch

policygradients,PolicySearch(seePGalgorithms)

policyspace,PolicySearch

polynomialfeatures,adding,NonlinearSVMClassification-PolynomialKernel

polynomialkernel,PolynomialKernel-PolynomialKernel,KernelizedSVM

PolynomialRegression,TrainingModels,PolynomialRegression-PolynomialRegression

learningcurvesin,LearningCurves-LearningCurves

poolingkernel,PoolingLayer

poolinglayer,PoolingLayer-PoolingLayer

powerscheduling,LearningRateScheduling

precision,ConfusionMatrix

precisionandrecall,PrecisionandRecall-Precision/RecallTradeoff

F-1score,PrecisionandRecall-PrecisionandRecall

precision/recall(PR)curve,TheROCCurve

precision/recalltradeoff,Precision/RecallTradeoff-Precision/RecallTradeoff

predeterminedpiecewiseconstantlearningrate,LearningRateScheduling

predict(),DataCleaning

predictedclass,ConfusionMatrix

predictions,ConfusionMatrix-ConfusionMatrix,DecisionFunctionandPredictions-DecisionFunctionandPredictions,MakingPredictions-EstimatingClassProbabilities

Page 670: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

predictors,Supervisedlearning,DataCleaning

preloadingtrainingdata,Preloadthedataintoavariable

PReLU(parametricleakyReLU),NonsaturatingActivationFunctions

preprocessedattributes,TakeaQuickLookattheDataStructure

pretrainedlayersreuse,ReusingPretrainedLayers-PretrainingonanAuxiliaryTask

auxiliarytask,PretrainingonanAuxiliaryTask-PretrainingonanAuxiliaryTask

cachingfrozenlayers,CachingtheFrozenLayers

freezinglowerlayers,FreezingtheLowerLayers

modelzoos,ModelZoos

otherframeworks,ReusingModelsfromOtherFrameworks

TensorFlowmodel,ReusingaTensorFlowModel-ReusingaTensorFlowModel

unsupervisedpretraining,UnsupervisedPretraining-UnsupervisedPretraining

upperlayers,Tweaking,Dropping,orReplacingtheUpperLayers

PrettyTensor,UpandRunningwithTensorFlow

primalproblem,TheDualProblem

principalcomponent,PrincipalComponents

PrincipalComponentAnalysis(PCA),PCA-RandomizedPCA

explainedvarianceratios,ExplainedVarianceRatio

findingprincipalcomponents,PrincipalComponents-PrincipalComponents

forcompression,PCAforCompression-IncrementalPCA

IncrementalPCA,IncrementalPCA-RandomizedPCA

KernelPCA(kPCA),KernelPCA-SelectingaKernelandTuningHyperparameters

projectingdowntoddimensions,ProjectingDowntodDimensions

Page 671: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

RandomizedPCA,RandomizedPCA

ScikitLearnfor,UsingScikit-Learn

variance,preserving,PreservingtheVariance-PreservingtheVariance

probabilisticautoencoders,VariationalAutoencoders

probabilities,estimating,EstimatingProbabilities-EstimatingProbabilities,EstimatingClassProbabilities

producerfunctions,Otherconveniencefunctions

projection,Projection-Projection

propositionallogic,FromBiologicaltoArtificialNeurons

pruning,RegularizationHyperparameters,SymbolicDifferentiation

Python

isolatedenvironmentin,CreatetheWorkspace-CreatetheWorkspace

notebooksin,CreatetheWorkspace-DownloadtheData

pickle,BetterEvaluationUsingCross-Validation

pip,CreatetheWorkspace

Q

Q-Learningalgorithm,TemporalDifferenceLearningandQ-Learning-LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

approximateQ-Learning,ApproximateQ-Learning

deepQ-Learning,ApproximateQ-Learning-LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

Q-ValueIterationAlgorithm,MarkovDecisionProcesses

Q-Values,MarkovDecisionProcesses

QuadraticProgramming(QP)Problems,QuadraticProgramming-QuadraticProgramming

quantizing,Bandwidthsaturation

Page 672: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

queriespersecond(QPS),OneNeuralNetworkperDevice

QueueRunner,MultithreadedreadersusingaCoordinatorandaQueueRunner-MultithreadedreadersusingaCoordinatorandaQueueRunner

queues,AsynchronousCommunicationUsingTensorFlowQueues-PaddingFifoQueue

closing,Closingaqueue

dequeuingdata,Dequeuingdata

enqueuingdata,Enqueuingdata

first-infirst-out(FIFO),AsynchronousCommunicationUsingTensorFlowQueues

oftuples,Queuesoftuples

PaddingFIFOQueue,PaddingFifoQueue

RandomShuffleQueue,RandomShuffleQueue

q_network(),LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

R

RadialBasisFunction(RBF),AddingSimilarityFeatures

RandomForests,BetterEvaluationUsingCross-Validation-GridSearch,MulticlassClassification,DecisionTrees,Instability,EnsembleLearningandRandomForests,RandomForests-FeatureImportance

Extra-Trees,Extra-Trees

featureimportance,FeatureImportance-FeatureImportance

randominitialization,GradientDescent,BatchGradientDescent,StochasticGradientDescent,Vanishing/ExplodingGradientsProblems

RandomPatchesandRandomSubspaces,RandomPatchesandRandomSubspaces

randomizedleakyReLU(RReLU),NonsaturatingActivationFunctions

RandomizedPCA,RandomizedPCA

randomizedsearch,RandomizedSearch,Fine-TuningNeuralNetworkHyperparameters

Page 673: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

RandomShuffleQueue,RandomShuffleQueue,Readingthetrainingdatadirectlyfromthegraph

random_uniform(),ManuallyComputingtheGradients

readeroperations,Readingthetrainingdatadirectlyfromthegraph

recall,ConfusionMatrix

recognitionnetwork,EfficientDataRepresentations

reconstructionerror,PCAforCompression

reconstructionloss,EfficientDataRepresentations,TensorFlowImplementation,VariationalAutoencoders

reconstructionpre-image,SelectingaKernelandTuningHyperparameters

reconstructions,EfficientDataRepresentations

recurrentneuralnetworks(RNNs),RecurrentNeuralNetworks-Exercises

deepRNNs,DeepRNNs-TheDifficultyofTrainingoverManyTimeSteps

explorationpolicies,ExplorationPolicies

GRUcell,GRUCell-GRUCell

inputandoutputsequences,InputandOutputSequences-InputandOutputSequences

LSTMcell,LSTMCell-GRUCell

naturallanguageprocessing(NLP),NaturalLanguageProcessing-AnEncoder–DecoderNetworkforMachineTranslation

inTensorFlow,BasicRNNsinTensorFlow-HandlingVariable-LengthOutputSequences

dynamicunrollingthroughtime,DynamicUnrollingThroughTime

staticunrollingthroughtime,StaticUnrollingThroughTime-StaticUnrollingThroughTime

variablelengthinputsequences,HandlingVariableLengthInputSequences

variablelengthoutputsequences,HandlingVariable-LengthOutputSequences

Page 674: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

training,TrainingRNNs-CreativeRNN

backpropagationthroughtime(BPTT),TrainingRNNs

creativesequences,CreativeRNN

sequenceclassifiers,TrainingaSequenceClassifier-TrainingaSequenceClassifier

timeseriespredictions,TrainingtoPredictTimeSeries-TrainingtoPredictTimeSeries

recurrentneurons,RecurrentNeurons-InputandOutputSequences

memorycells,MemoryCells

reduce_mean(),ConstructionPhase

reduce_sum(),TensorFlowImplementation-TensorFlowImplementation,VariationalAutoencoders,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

regression,Supervisedlearning

DecisionTrees,Regression-Regression

regressionmodels

linear,TrainingandEvaluatingontheTrainingSet

regressionversusclassification,MultioutputClassification

regularization,OverfittingtheTrainingData-OverfittingtheTrainingData,TestingandValidating,RegularizedLinearModels-EarlyStopping

dataaugmentation,DataAugmentation-DataAugmentation

DecisionTrees,RegularizationHyperparameters-RegularizationHyperparameters

dropout,Dropout-Dropout

earlystopping,EarlyStopping-EarlyStopping,EarlyStopping

ElasticNet,ElasticNet

LassoRegression,LassoRegression-LassoRegression

max-norm,Max-NormRegularization-Max-NormRegularization

Page 675: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

RidgeRegression,RidgeRegression-RidgeRegression

shrinkage,GradientBoosting

ℓ1andℓ2regularization,ℓ1andℓ2Regularization-ℓ1andℓ2Regularization

REINFORCEalgorithms,PolicyGradients

ReinforcementLearning(RL),ReinforcementLearning-ReinforcementLearning,ReinforcementLearning-ThankYou!

actions,EvaluatingActions:TheCreditAssignmentProblem-EvaluatingActions:TheCreditAssignmentProblem

creditassignmentproblem,EvaluatingActions:TheCreditAssignmentProblem-EvaluatingActions:TheCreditAssignmentProblem

discountrate,EvaluatingActions:TheCreditAssignmentProblem

examplesof,LearningtoOptimizeRewards

Markovdecisionprocesses,MarkovDecisionProcesses-MarkovDecisionProcesses

neuralnetworkpolicies,NeuralNetworkPolicies-NeuralNetworkPolicies

OpenAIgym,IntroductiontoOpenAIGym-IntroductiontoOpenAIGym

PGalgorithms,PolicyGradients-PolicyGradients

policysearch,PolicySearch-PolicySearch

Q-Learningalgorithm,TemporalDifferenceLearningandQ-Learning-LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

rewards,learningtooptimize,LearningtoOptimizeRewards-LearningtoOptimizeRewards

TemporalDifference(TD)Learning,TemporalDifferenceLearningandQ-Learning-TemporalDifferenceLearningandQ-Learning

ReLU(rectifiedlinearunits),Modularity-Modularity

ReLUactivation,ResNet

Page 676: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ReLUfunction,Multi-LayerPerceptronandBackpropagation,ActivationFunctions,XavierandHeInitialization-NonsaturatingActivationFunctions

relu(z),ConstructionPhase

render(),IntroductiontoOpenAIGym

replaymemory,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

replica_device_setter(),ShardingVariablesAcrossMultipleParameterServers

request_stop(),MultithreadedreadersusingaCoordinatorandaQueueRunner

reset(),IntroductiontoOpenAIGym

reset_default_graph(),ManagingGraphs

reshape(),TrainingtoPredictTimeSeries

residualerrors,GradientBoosting-GradientBoosting

residuallearning,ResNet

residualnetwork(ResNet),ModelZoos,ResNet-ResNet

residualunits,ResNet

ResNet,ResNet-ResNet

resourcecontainers,SharingStateAcrossSessionsUsingResourceContainers-SharingStateAcrossSessionsUsingResourceContainers

restore(),SavingandRestoringModels

restrictedBoltzmannmachines(RBMs),Semisupervisedlearning,UnsupervisedPretraining,BoltzmannMachines

reuse_variables(),SharingVariables

reverse-modeautodiff,Reverse-ModeAutodiff-Reverse-ModeAutodiff

rewards,inRL,LearningtoOptimizeRewards-LearningtoOptimizeRewards

rgb_array,IntroductiontoOpenAIGym

Page 677: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

RidgeRegression,RidgeRegression-RidgeRegression,ElasticNet

RMSProp,RMSProp

ROC(receiveroperatingcharacteristic)curve,TheROCCurve-TheROCCurve

RootMeanSquareError(RMSE),SelectaPerformanceMeasure-SelectaPerformanceMeasure,LinearRegression

RReLU(randomizedleakyReLU),NonsaturatingActivationFunctions

run(),CreatingYourFirstGraphandRunningItinaSession,In-GraphVersusBetween-GraphReplication

S

SampledSoftmax,AnEncoder–DecoderNetworkforMachineTranslation

samplingbias,NonrepresentativeTrainingData-Poor-QualityData,CreateaTestSet

samplingnoise,NonrepresentativeTrainingData

save(),SavingandRestoringModels

Savernode,SavingandRestoringModels

ScikitFlow,UpandRunningwithTensorFlow

Scikit-Learn,CreatetheWorkspace

about,ObjectiveandApproach

baggingandpastingin,BaggingandPastinginScikit-Learn-BaggingandPastinginScikit-Learn

CARTalgorithm,MakingPredictions-TheCARTTrainingAlgorithm,Regression

cross-validation,BetterEvaluationUsingCross-Validation-BetterEvaluationUsingCross-Validation

designprinciples,DataCleaning-DataCleaning

imputer,DataCleaning-HandlingTextandCategoricalAttributes

LinearSVRclass,SVMRegression

Page 678: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

MinMaxScaler,FeatureScaling

min_andmax_hyperparameters,RegularizationHyperparameters

PCAimplementation,UsingScikit-Learn

Perceptronclass,ThePerceptron

Pipelineconstructor,TransformationPipelines-SelectandTrainaModel,NonlinearSVMClassification

RandomizedPCA,RandomizedPCA

RidgeRegressionwith,RidgeRegression

SAMME,AdaBoost

SGDClassifier,TrainingaBinaryClassifier,Precision/RecallTradeoff-Precision/RecallTradeoff,MulticlassClassification

SGDRegressor,StochasticGradientDescent

sklearn.base.BaseEstimator,CustomTransformers,TransformationPipelines,MeasuringAccuracyUsingCross-Validation

sklearn.base.clone(),MeasuringAccuracyUsingCross-Validation,EarlyStopping

sklearn.base.TransformerMixin,CustomTransformers,TransformationPipelines

sklearn.datasets.fetch_california_housing(),LinearRegressionwithTensorFlow

sklearn.datasets.fetch_mldata(),MNIST

sklearn.datasets.load_iris(),DecisionBoundaries,SoftMarginClassification,TrainingandVisualizingaDecisionTree,FeatureImportance,ThePerceptron

sklearn.datasets.load_sample_images(),TensorFlowImplementation-TensorFlowImplementation

sklearn.datasets.make_moons(),NonlinearSVMClassification,Exercises

sklearn.decomposition.IncrementalPCA,IncrementalPCA

sklearn.decomposition.KernelPCA,KernelPCA-SelectingaKernelandTuning

Page 679: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Hyperparameters,SelectingaKernelandTuningHyperparameters

sklearn.decomposition.PCA,UsingScikit-Learn

sklearn.ensemble.AdaBoostClassifier,AdaBoost

sklearn.ensemble.BaggingClassifier,BaggingandPastinginScikit-Learn-RandomForests

sklearn.ensemble.GradientBoostingRegressor,GradientBoosting,GradientBoosting-GradientBoosting

sklearn.ensemble.RandomForestClassifier,TheROCCurve,MulticlassClassification,VotingClassifiers

sklearn.ensemble.RandomForestRegressor,BetterEvaluationUsingCross-Validation,GridSearch-AnalyzetheBestModelsandTheirErrors,RandomForests-Extra-Trees,GradientBoosting

sklearn.ensemble.VotingClassifier,VotingClassifiers

sklearn.externals.joblib,BetterEvaluationUsingCross-Validation

sklearn.linear_model.ElasticNet,ElasticNet

sklearn.linear_model.Lasso,LassoRegression

sklearn.linear_model.LinearRegression,Model-basedlearning-Model-basedlearning,DataCleaning,TrainingandEvaluatingontheTrainingSet,TheNormalEquation,Mini-batchGradientDescent,PolynomialRegression,LearningCurves-LearningCurves

sklearn.linear_model.LogisticRegression,DecisionBoundaries,DecisionBoundaries,SoftmaxRegression,VotingClassifiers,SelectingaKernelandTuningHyperparameters

sklearn.linear_model.Perceptron,ThePerceptron

sklearn.linear_model.Ridge,RidgeRegression

sklearn.linear_model.SGDClassifier,TrainingaBinaryClassifier

sklearn.linear_model.SGDRegressor,StochasticGradientDescent-Mini-batchGradient

Page 680: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Descent,RidgeRegression,LassoRegression-EarlyStopping

sklearn.manifold.LocallyLinearEmbedding,LLE-LLE

sklearn.metrics.accuracy_score(),VotingClassifiers,Out-of-BagEvaluation,TraininganMLPwithTensorFlow’sHigh-LevelAPI

sklearn.metrics.confusion_matrix(),ConfusionMatrix,ErrorAnalysis

sklearn.metrics.f1_score(),PrecisionandRecall,MultilabelClassification

sklearn.metrics.mean_squared_error(),TrainingandEvaluatingontheTrainingSet-TrainingandEvaluatingontheTrainingSet,EvaluateYourSystemontheTestSet,LearningCurves,EarlyStopping,GradientBoosting-GradientBoosting,SelectingaKernelandTuningHyperparameters

sklearn.metrics.precision_recall_curve(),Precision/RecallTradeoff

sklearn.metrics.precision_score(),PrecisionandRecall,Precision/RecallTradeoff

sklearn.metrics.recall_score(),PrecisionandRecall,Precision/RecallTradeoff

sklearn.metrics.roc_auc_score(),TheROCCurve-TheROCCurve

sklearn.metrics.roc_curve(),TheROCCurve-TheROCCurve

sklearn.model_selection.cross_val_predict(),ConfusionMatrix,Precision/RecallTradeoff,TheROCCurve,ErrorAnalysis,MultilabelClassification

sklearn.model_selection.cross_val_score(),BetterEvaluationUsingCross-Validation-BetterEvaluationUsingCross-Validation,MeasuringAccuracyUsingCross-Validation-ConfusionMatrix

sklearn.model_selection.GridSearchCV,GridSearch-RandomizedSearch,Exercises,ErrorAnalysis,Exercises,SelectingaKernelandTuningHyperparameters

sklearn.model_selection.StratifiedKFold,MeasuringAccuracyUsingCross-Validation

sklearn.model_selection.StratifiedShuffleSplit,CreateaTestSet

sklearn.model_selection.train_test_split(),CreateaTestSet,TrainingandEvaluatingontheTrainingSet,LearningCurves,Exercises,GradientBoosting

Page 681: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

sklearn.multiclass.OneVsOneClassifier,MulticlassClassification

sklearn.neighbors.KNeighborsClassifier,MultilabelClassification,Exercises

sklearn.neighbors.KNeighborsRegressor,Model-basedlearning

sklearn.pipeline.FeatureUnion,TransformationPipelines

sklearn.pipeline.Pipeline,TransformationPipelines,LearningCurves,SoftMarginClassification-NonlinearSVMClassification,SelectingaKernelandTuningHyperparameters

sklearn.preprocessing.Imputer,DataCleaning,TransformationPipelines

sklearn.preprocessing.LabelBinarizer,HandlingTextandCategoricalAttributes,TransformationPipelines

sklearn.preprocessing.LabelEncoder,HandlingTextandCategoricalAttributes

sklearn.preprocessing.OneHotEncoder,HandlingTextandCategoricalAttributes

sklearn.preprocessing.PolynomialFeatures,PolynomialRegression-PolynomialRegression,LearningCurves,RidgeRegression,NonlinearSVMClassification

sklearn.preprocessing.StandardScaler,FeatureScaling-TransformationPipelines,MulticlassClassification,GradientDescent,RidgeRegression,LinearSVMClassification,SoftMarginClassification-PolynomialKernel,GaussianRBFKernel,ImplementingGradientDescent,TraininganMLPwithTensorFlow’sHigh-LevelAPI

sklearn.svm.LinearSVC,SoftMarginClassification-NonlinearSVMClassification,GaussianRBFKernel-ComputationalComplexity,SVMRegression,Exercises

sklearn.svm.LinearSVR,SVMRegression-SVMRegression

sklearn.svm.SVC,SoftMarginClassification,PolynomialKernel,GaussianRBFKernel-ComputationalComplexity,SVMRegression,Exercises,VotingClassifiers

sklearn.svm.SVR,Exercises,SVMRegression

sklearn.tree.DecisionTreeClassifier,RegularizationHyperparameters,Exercises,BaggingandPastinginScikit-Learn-Out-of-BagEvaluation,RandomForests,AdaBoost

Page 682: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

sklearn.tree.DecisionTreeRegressor,TrainingandEvaluatingontheTrainingSet,DecisionTrees,Regression,GradientBoosting-GradientBoosting

sklearn.tree.export_graphviz(),TrainingandVisualizingaDecisionTree

StandardScaler,GradientDescent,ImplementingGradientDescent,TraininganMLPwithTensorFlow’sHigh-LevelAPI

SVMclassificationclasses,ComputationalComplexity

TF.Learn,UpandRunningwithTensorFlow

userguide,OtherResources

score(),DataCleaning

searchspace,RandomizedSearch,Fine-TuningNeuralNetworkHyperparameters

second-orderpartialderivatives(Hessians),AdamOptimization

self-organizingmaps(SOMs),Self-OrganizingMaps-Self-OrganizingMaps

semantichashing,Exercises

semisupervisedlearning,Semisupervisedlearning

sensitivity,ConfusionMatrix,TheROCCurve

sentimentanalysis,RecurrentNeuralNetworks

separable_conv2d(),ResNet

sequences,RecurrentNeuralNetworks

sequence_length,HandlingVariableLengthInputSequences-HandlingVariable-LengthOutputSequences,AnEncoder–DecoderNetworkforMachineTranslation

Shannon'sinformationtheory,GiniImpurityorEntropy?

shortcutconnections,ResNet

show(),TakeaQuickLookattheDataStructure

show_graph(),VisualizingtheGraphandTrainingCurvesUsingTensorBoard

Page 683: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

shrinkage,GradientBoosting

shuffle_batch(),Otherconveniencefunctions

shuffle_batch_join(),Otherconveniencefunctions

sigmoidfunction,EstimatingProbabilities

sigmoid_cross_entropy_with_logits(),TensorFlowImplementation

similarityfunction,AddingSimilarityFeatures-AddingSimilarityFeatures

simulatedannealing,StochasticGradientDescent

simulatedenvironments,IntroductiontoOpenAIGym

(seealsoOpenAIGym)

SingularValueDecomposition(SVD),PrincipalComponents

skeweddatasets,MeasuringAccuracyUsingCross-Validation

skipconnections,DataAugmentation,ResNet

slackvariable,TrainingObjective

smoothingterms,BatchNormalization,AdaGrad,AdamOptimization,VariationalAutoencoders

softmarginclassification,SoftMarginClassification-SoftMarginClassification

softplacements,Softplacement

softvoting,VotingClassifiers

softmaxfunction,SoftmaxRegression,Multi-LayerPerceptronandBackpropagation,TraininganMLPwithTensorFlow’sHigh-LevelAPI

SoftmaxRegression,SoftmaxRegression-SoftmaxRegression

sourceops,LinearRegressionwithTensorFlow,ParallelExecution

spamfilters,TheMachineLearningLandscape-WhyUseMachineLearning?,Supervisedlearning

Page 684: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

sparseautoencoders,SparseAutoencoders-TensorFlowImplementation

sparsematrix,HandlingTextandCategoricalAttributes

sparsemodels,LassoRegression,AdamOptimization

sparse_softmax_cross_entropy_with_logits(),ConstructionPhase

sparsityloss,SparseAutoencoders

specificity,TheROCCurve

speechrecognition,WhyUseMachineLearning?

spuriouspatterns,HopfieldNetworks

stack(),StaticUnrollingThroughTime

stackedautoencoders,StackedAutoencoders-UnsupervisedPretrainingUsingStackedAutoencoders

TensorFlowimplementation,TensorFlowImplementation

trainingone-at-a-time,TrainingOneAutoencoderataTime-TrainingOneAutoencoderataTime

tyingweights,TyingWeights-TyingWeights

unsupervisedpretrainingwith,UnsupervisedPretrainingUsingStackedAutoencoders-UnsupervisedPretrainingUsingStackedAutoencoders

visualizingthereconstructions,VisualizingtheReconstructions-VisualizingtheReconstructions

stackeddenoisingautoencoders,VisualizingFeatures,DenoisingAutoencoders

stackeddenoisingencoders,DenoisingAutoencoders

stackedgeneralization(seestacking)

stacking,Stacking-Stacking

stalegradients,Asynchronousupdates

standardcorrelationcoefficient,LookingforCorrelations

Page 685: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

standardization,FeatureScaling

StandardScaler,TransformationPipelines,ImplementingGradientDescent,TraininganMLPwithTensorFlow’sHigh-LevelAPI

state-actionvalues,MarkovDecisionProcesses

statestensor,HandlingVariableLengthInputSequences

state_is_tuple,DistributingaDeepRNNAcrossMultipleGPUs,LSTMCell

staticunrollingthroughtime,StaticUnrollingThroughTime-StaticUnrollingThroughTime

static_rnn(),StaticUnrollingThroughTime-StaticUnrollingThroughTime,AnEncoder–DecoderNetworkforMachineTranslation

stationarypoint,SVMDualProblem-SVMDualProblem

statisticalmode,BaggingandPasting

statisticalsignificance,RegularizationHyperparameters

stemming,Exercises

stepfunctions,ThePerceptron

step(),IntroductiontoOpenAIGym

StochasticGradientBoosting,GradientBoosting

StochasticGradientDescent(SGD),StochasticGradientDescent-StochasticGradientDescent,SoftMarginClassification,ThePerceptron

training,TrainingandCostFunction

StochasticGradientDescent(SGD)classifier,TrainingaBinaryClassifier,RidgeRegression

stochasticneurons,BoltzmannMachines

stochasticpolicy,PolicySearch

stratifiedsampling,CreateaTestSet-CreateaTestSet,MeasuringAccuracyUsingCross-Validation

Page 686: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

stride,ConvolutionalLayer

stringkernels,GaussianRBFKernel

string_input_producer(),Otherconveniencefunctions

stronglearners,VotingClassifiers

subderivatives,OnlineSVMs

subgradientvector,LassoRegression

subsample,GradientBoosting,PoolingLayer

supervisedlearning,Supervised/UnsupervisedLearning-Supervisedlearning

SupportVectorMachines(SVMs),MulticlassClassification,SupportVectorMachines-Exercises

decisionfunctionandpredictions,DecisionFunctionandPredictions-DecisionFunctionandPredictions

dualproblem,SVMDualProblem-SVMDualProblem

kernelizedSVM,KernelizedSVM-KernelizedSVM

linearclassification,LinearSVMClassification-SoftMarginClassification

mechanicsof,UndertheHood-OnlineSVMs

nonlinearclassification,NonlinearSVMClassification-ComputationalComplexity

onlineSVMs,OnlineSVMs-OnlineSVMs

QuadraticProgramming(QP)problems,QuadraticProgramming-QuadraticProgramming

SVMregression,SVMRegression-OnlineSVMs

thedualproblem,TheDualProblem

trainingobjective,TrainingObjective-TrainingObjective

supportvectors,LinearSVMClassification

Page 687: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

svd(),PrincipalComponents

symbolicdifferentiation,Usingautodiff,SymbolicDifferentiation-NumericalDifferentiation

synchronousupdates,Synchronousupdates

Page 688: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

T

t-DistributedStochasticNeighborEmbedding(t-SNE),OtherDimensionalityReductionTechniques

tailheavy,TakeaQuickLookattheDataStructure

targetattributes,TakeaQuickLookattheDataStructure

target_weights,AnEncoder–DecoderNetworkforMachineTranslation

tasks,MultipleDevicesAcrossMultipleServers

TemporalDifference(TD)Learning,TemporalDifferenceLearningandQ-Learning-TemporalDifferenceLearningandQ-Learning

tensorprocessingunits(TPUs),Installation

TensorBoard,UpandRunningwithTensorFlow

TensorFlow,UpandRunningwithTensorFlow-Exercises

about,ObjectiveandApproach

autodiff,Usingautodiff-Usingautodiff,Autodiff-Reverse-ModeAutodiff

BatchNormalizationwith,ImplementingBatchNormalizationwithTensorFlow-ImplementingBatchNormalizationwithTensorFlow

constructionphase,CreatingYourFirstGraphandRunningItinaSession

controldependencies,ControlDependencies

conveniencefunctions,Otherconveniencefunctions

convolutionallayers,ResNet

convolutionalneuralnetworksand,TensorFlowImplementation-TensorFlowImplementation

dataparallelismand,TensorFlowimplementation

denoisingautoencoders,TensorFlowImplementation-TensorFlowImplementation

dropoutwith,Dropout

Page 689: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

dynamicplacer,PlacingOperationsonDevices

executionphase,CreatingYourFirstGraphandRunningItinaSession

feedingdatatothetrainingalgorithm,FeedingDatatotheTrainingAlgorithm-FeedingDatatotheTrainingAlgorithm

GradientDescentwith,ImplementingGradientDescent-UsinganOptimizer

graphs,managing,ManagingGraphs

initialgraphcreationandsessionrun,CreatingYourFirstGraphandRunningItinaSession-CreatingYourFirstGraphandRunningItinaSession

installation,Installation

l1andl2regularizationwith,ℓ1andℓ2Regularization

learningschedulesin,LearningRateScheduling

LinearRegressionwith,LinearRegressionwithTensorFlow-LinearRegressionwithTensorFlow

maxpoolinglayerin,PoolingLayer

max-normregularizationwith,Max-NormRegularization

modelzoo,ModelZoos

modularity,Modularity-Modularity

Momentumoptimizationin,MomentumOptimization

namescopes,NameScopes

neuralnetworkpolicies,NeuralNetworkPolicies

NLPtutorials,NaturalLanguageProcessing,AnEncoder–DecoderNetworkforMachineTranslation

nodevaluelifecycle,LifecycleofaNodeValue

operations(ops),LinearRegressionwithTensorFlow

Page 690: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

optimizer,UsinganOptimizer

overview,UpandRunningwithTensorFlow-UpandRunningwithTensorFlow

paralleldistributedcomputing(seeparalleldistributedcomputingwithTensorFlow)

PythonAPI

construction,ConstructionPhase-ConstructionPhase

execution,ExecutionPhase

usingtheneuralnetwork,UsingtheNeuralNetwork

queues(seequeues)

reusingpretrainedlayers,ReusingaTensorFlowModel-ReusingaTensorFlowModel

RNNsin,BasicRNNsinTensorFlow-HandlingVariable-LengthOutputSequences

(seealsorecurrentneuralnetworks(RNNs))

savingandrestoringmodels,SavingandRestoringModels-SavingandRestoringModels

sharingvariables,SharingVariables-SharingVariables

simpleplacer,PlacingOperationsonDevices

sparseautoencoderswith,TensorFlowImplementation

andstackedautoencoders,TensorFlowImplementation

TensorBoard,VisualizingtheGraphandTrainingCurvesUsingTensorBoard-VisualizingtheGraphandTrainingCurvesUsingTensorBoard

tf.abs(),ℓ1andℓ2Regularization

tf.add(),Modularity,ℓ1andℓ2Regularization

tf.add_n(),Modularity-SharingVariables,SharingVariables-SharingVariables

tf.add_to_collection(),Max-NormRegularization

tf.assign(),ManuallyComputingtheGradients,ReusingModelsfromOther

Page 691: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Frameworks,Max-NormRegularization-Max-NormRegularization,Chapter9:UpandRunningwithTensorFlow

tf.bfloat16,Bandwidthsaturation

tf.bool,Dropout

tf.cast(),ConstructionPhase,TrainingaSequenceClassifier

tf.clip_by_norm(),Max-NormRegularization-Max-NormRegularization

tf.clip_by_value(),GradientClipping

tf.concat(),Exercises,GoogLeNet,NeuralNetworkPolicies,PolicyGradients

tf.ConfigProto,ManagingtheGPURAM,Loggingplacements-Softplacement,In-GraphVersusBetween-GraphReplication,Chapter12:DistributingTensorFlowAcrossDevicesandServers

tf.constant(),LifecycleofaNodeValue-ManuallyComputingtheGradients,Simpleplacement-Dynamicplacementfunction,ControlDependencies,OpeningaSession-PinningOperationsAcrossTasks

tf.constant_initializer(),SharingVariables-SharingVariables

tf.container(),SharingStateAcrossSessionsUsingResourceContainers-AsynchronousCommunicationUsingTensorFlowQueues,TensorFlowimplementation-Exercises,Chapter9:UpandRunningwithTensorFlow

tf.contrib.layers.l1_regularizer(),ℓ1andℓ2Regularization,Max-NormRegularization

tf.contrib.layers.l2_regularizer(),ℓ1andℓ2Regularization,TensorFlowImplementation-TyingWeights

tf.contrib.layers.variance_scaling_initializer(),XavierandHeInitialization-XavierandHeInitialization,TrainingaSequenceClassifier,TensorFlowImplementation-TyingWeights,VariationalAutoencoders,NeuralNetworkPolicies,PolicyGradients,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

tf.contrib.learn.DNNClassifier,TraininganMLPwithTensorFlow’sHigh-LevelAPI

tf.contrib.learn.infer_real_valued_columns_from_input(),TraininganMLPwithTensorFlow’sHigh-LevelAPI

Page 692: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

tf.contrib.rnn.BasicLSTMCell,LSTMCell,PeepholeConnections

tf.contrib.rnn.BasicRNNCell,StaticUnrollingThroughTime-DynamicUnrollingThroughTime,TrainingaSequenceClassifier,TrainingtoPredictTimeSeries-TrainingtoPredictTimeSeries,TrainingtoPredictTimeSeries,DeepRNNs-ApplyingDropout,LSTMCell

tf.contrib.rnn.DropoutWrapper,ApplyingDropout

tf.contrib.rnn.GRUCell,GRUCell

tf.contrib.rnn.LSTMCell,PeepholeConnections

tf.contrib.rnn.MultiRNNCell,DeepRNNs-ApplyingDropout

tf.contrib.rnn.OutputProjectionWrapper,TrainingtoPredictTimeSeries-TrainingtoPredictTimeSeries

tf.contrib.rnn.RNNCell,DistributingaDeepRNNAcrossMultipleGPUs

tf.contrib.rnn.static_rnn(),BasicRNNsinTensorFlow-HandlingVariableLengthInputSequences,AnEncoder–DecoderNetworkforMachineTranslation-Exercises,Chapter14:RecurrentNeuralNetworks-Chapter14:RecurrentNeuralNetworks

tf.contrib.slimmodule,UpandRunningwithTensorFlow,Exercises

tf.contrib.slim.netsmodule(nets),Exercises

tf.control_dependencies(),ControlDependencies

tf.decode_csv(),Readingthetrainingdatadirectlyfromthegraph,MultithreadedreadersusingaCoordinatorandaQueueRunner

tf.device(),Simpleplacement-Softplacement,PinningOperationsAcrossTasks-ShardingVariablesAcrossMultipleParameterServers,DistributingaDeepRNNAcrossMultipleGPUs-DistributingaDeepRNNAcrossMultipleGPUs

tf.exp(),VariationalAutoencoders-GeneratingDigits

tf.FIFOQueue,AsynchronousCommunicationUsingTensorFlowQueues,Queuesoftuples-RandomShuffleQueue,Readingthetrainingdatadirectlyfromthegraph,MultithreadedreadersusingaCoordinatorandaQueueRunner

Page 693: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

tf.float32,LinearRegressionwithTensorFlow,Chapter9:UpandRunningwithTensorFlow

tf.get_collection(),ReusingaTensorFlowModel-FreezingtheLowerLayers,ℓ1andℓ2Regularization,Max-NormRegularization,TensorFlowImplementation,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

tf.get_default_graph(),ManagingGraphs,VisualizingtheGraphandTrainingCurvesUsingTensorBoard

tf.get_default_session(),CreatingYourFirstGraphandRunningItinaSession

tf.get_variable(),SharingVariables-SharingVariables,ReusingModelsfromOtherFrameworks,ℓ1andℓ2Regularization

tf.global_variables(),ReusingaTensorFlowModel

tf.global_variables_initializer(),CreatingYourFirstGraphandRunningItinaSession,ManuallyComputingtheGradients

tf.gradients(),Usingautodiff

tf.Graph,CreatingYourFirstGraphandRunningItinaSession,ManagingGraphs,VisualizingtheGraphandTrainingCurvesUsingTensorBoard,LoadingDataDirectlyfromtheGraph,In-GraphVersusBetween-GraphReplication

tf.GraphKeys.GLOBAL_VARIABLES,ReusingaTensorFlowModel-FreezingtheLowerLayers

tf.GraphKeys.REGULARIZATION_LOSSES,ℓ1andℓ2Regularization,TensorFlowImplementation

tf.group(),LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

tf.int32,Operationsandkernels-Queuesoftuples,Readingthetrainingdatadirectlyfromthegraph,HandlingVariableLengthInputSequences,TrainingaSequenceClassifier,WordEmbeddings,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

tf.int64,ConstructionPhase

tf.InteractiveSession,CreatingYourFirstGraphandRunningItinaSession

tf.layers.batch_normalization(),ImplementingBatchNormalizationwithTensorFlow-

Page 694: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ImplementingBatchNormalizationwithTensorFlow

tf.layers.dense(),ConstructionPhase

TF.Learn,TraininganMLPwithTensorFlow’sHigh-LevelAPI

tf.log(),TensorFlowImplementation,VariationalAutoencoders,NeuralNetworkPolicies,PolicyGradients

tf.matmul(),LinearRegressionwithTensorFlow-ManuallyComputingtheGradients,Modularity,ConstructionPhase,BasicRNNsinTensorFlow,TyingWeights,TrainingOneAutoencoderataTime,TensorFlowImplementation,TensorFlowImplementation-TensorFlowImplementation

tf.matrix_inverse(),LinearRegressionwithTensorFlow

tf.maximum(),Modularity,SharingVariables-SharingVariables,NonsaturatingActivationFunctions

tf.multinomial(),NeuralNetworkPolicies,PolicyGradients

tf.name_scope(),NameScopes,Modularity-SharingVariables,ConstructionPhase,ConstructionPhase-ConstructionPhase,TrainingOneAutoencoderataTime-TrainingOneAutoencoderataTime

tf.nn.conv2d(),TensorFlowImplementation-TensorFlowImplementation

tf.nn.dynamic_rnn(),StaticUnrollingThroughTime-DynamicUnrollingThroughTime,TrainingaSequenceClassifier,TrainingtoPredictTimeSeries,TrainingtoPredictTimeSeries,DeepRNNs-ApplyingDropout,AnEncoder–DecoderNetworkforMachineTranslation-Exercises,Chapter14:RecurrentNeuralNetworks-Chapter14:RecurrentNeuralNetworks

tf.nn.elu(),NonsaturatingActivationFunctions,TensorFlowImplementation-TyingWeights,VariationalAutoencoders,NeuralNetworkPolicies,PolicyGradients

tf.nn.embedding_lookup(),WordEmbeddings

tf.nn.in_top_k(),ConstructionPhase,TrainingaSequenceClassifier

tf.nn.max_pool(),PoolingLayer-PoolingLayer

tf.nn.relu(),ConstructionPhase,TrainingtoPredictTimeSeries-TrainingtoPredict

Page 695: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TimeSeries,TrainingtoPredictTimeSeries,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

tf.nn.sigmoid_cross_entropy_with_logits(),TensorFlowImplementation,GeneratingDigits,PolicyGradients-PolicyGradients

tf.nn.sparse_softmax_cross_entropy_with_logits(),ConstructionPhase-ConstructionPhase,TrainingaSequenceClassifier

tf.one_hot(),LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

tf.PaddingFIFOQueue,PaddingFifoQueue

tf.placeholder(),FeedingDatatotheTrainingAlgorithm-FeedingDatatotheTrainingAlgorithm,Chapter9:UpandRunningwithTensorFlow

tf.placeholder_with_default(),TensorFlowImplementation

tf.RandomShuffleQueue,RandomShuffleQueue,Readingthetrainingdatadirectlyfromthegraph-Readingthetrainingdatadirectlyfromthegraph,MultithreadedreadersusingaCoordinatorandaQueueRunner-Otherconveniencefunctions

tf.random_normal(),Modularity,BasicRNNsinTensorFlow,TensorFlowImplementation,VariationalAutoencoders

tf.random_uniform(),ManuallyComputingtheGradients,SavingandRestoringModels,WordEmbeddings,Chapter9:UpandRunningwithTensorFlow

tf.reduce_mean(),ManuallyComputingtheGradients,NameScopes,ConstructionPhase-ConstructionPhase,ℓ1andℓ2Regularization,TrainingaSequenceClassifier-TrainingaSequenceClassifier,PerformingPCAwithanUndercompleteLinearAutoencoder,TensorFlowImplementation,TrainingOneAutoencoderataTime,TrainingOneAutoencoderataTime,TensorFlowImplementation,TensorFlowImplementation,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

tf.reduce_sum(),ℓ1andℓ2Regularization,TensorFlowImplementation-TensorFlowImplementation,VariationalAutoencoders-GeneratingDigits,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning-LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

tf.reset_default_graph(),ManagingGraphs

Page 696: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

tf.reshape(),TrainingtoPredictTimeSeries,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

tf.RunOptions,In-GraphVersusBetween-GraphReplication

tf.Session,CreatingYourFirstGraphandRunningItinaSession,Chapter9:UpandRunningwithTensorFlow

tf.shape(),TensorFlowImplementation,VariationalAutoencoders

tf.square(),ManuallyComputingtheGradients,NameScopes,TrainingtoPredictTimeSeries,PerformingPCAwithanUndercompleteLinearAutoencoder,TensorFlowImplementation,TrainingOneAutoencoderataTime,TrainingOneAutoencoderataTime,TensorFlowImplementation,TensorFlowImplementation,VariationalAutoencoders-GeneratingDigits,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

tf.stack(),Readingthetrainingdatadirectlyfromthegraph,MultithreadedreadersusingaCoordinatorandaQueueRunner,StaticUnrollingThroughTime

tf.string,Readingthetrainingdatadirectlyfromthegraph,MultithreadedreadersusingaCoordinatorandaQueueRunner

tf.summary.FileWriter,VisualizingtheGraphandTrainingCurvesUsingTensorBoard-VisualizingtheGraphandTrainingCurvesUsingTensorBoard

tf.summary.scalar(),VisualizingtheGraphandTrainingCurvesUsingTensorBoard

tf.tanh(),BasicRNNsinTensorFlow

tf.TextLineReader,Readingthetrainingdatadirectlyfromthegraph,MultithreadedreadersusingaCoordinatorandaQueueRunner

tf.to_float(),PolicyGradients-PolicyGradients

tf.train.AdamOptimizer,AdamOptimization,AdamOptimization,TrainingaSequenceClassifier,TrainingtoPredictTimeSeries,PerformingPCAwithanUndercompleteLinearAutoencoder,TensorFlowImplementation-TyingWeights,TrainingOneAutoencoderataTime,TensorFlowImplementation,GeneratingDigits,PolicyGradients-PolicyGradients,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

tf.train.ClusterSpec,MultipleDevicesAcrossMultipleServers

Page 697: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

tf.train.Coordinator,MultithreadedreadersusingaCoordinatorandaQueueRunner-MultithreadedreadersusingaCoordinatorandaQueueRunner

tf.train.exponential_decay(),LearningRateScheduling

tf.train.GradientDescentOptimizer,UsinganOptimizer,ConstructionPhase,GradientClipping,MomentumOptimization,AdamOptimization

tf.train.MomentumOptimizer,UsinganOptimizer,MomentumOptimization-NesterovAcceleratedGradient,LearningRateScheduling,Exercises,TensorFlowimplementation,Chapter10:IntroductiontoArtificialNeuralNetworks-Chapter11:TrainingDeepNeuralNets

tf.train.QueueRunner,MultithreadedreadersusingaCoordinatorandaQueueRunner-Otherconveniencefunctions

tf.train.replica_device_setter(),ShardingVariablesAcrossMultipleParameterServers-SharingStateAcrossSessionsUsingResourceContainers

tf.train.RMSPropOptimizer,RMSProp

tf.train.Saver,SavingandRestoringModels-SavingandRestoringModels,ConstructionPhase,Exercises,ApplyingDropout,PolicyGradients,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

tf.train.Server,MultipleDevicesAcrossMultipleServers

tf.train.start_queue_runners(),Otherconveniencefunctions

tf.transpose(),LinearRegressionwithTensorFlow-ManuallyComputingtheGradients,StaticUnrollingThroughTime,TyingWeights

tf.truncated_normal(),ConstructionPhase

tf.unstack(),StaticUnrollingThroughTime-DynamicUnrollingThroughTime,TrainingtoPredictTimeSeries,Chapter14:RecurrentNeuralNetworks

tf.Variable,CreatingYourFirstGraphandRunningItinaSession,Chapter9:UpandRunningwithTensorFlow

tf.variable_scope(),SharingVariables-SharingVariables,ReusingModelsfromOtherFrameworks,SharingStateAcrossSessionsUsingResourceContainers,Traininga

Page 698: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

SequenceClassifier,LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

tf.zeros(),ConstructionPhase,BasicRNNsinTensorFlow,TyingWeights

truncatedbackpropagationthroughtime,TheDifficultyofTrainingoverManyTimeSteps

visualizinggraphandtrainingcurves,VisualizingtheGraphandTrainingCurvesUsingTensorBoard-VisualizingtheGraphandTrainingCurvesUsingTensorBoard

TensorFlowServing,OneNeuralNetworkperDevice

tensorflow.contrib,TraininganMLPwithTensorFlow’sHigh-LevelAPI

testset,TestingandValidating,CreateaTestSet-CreateaTestSet,MNIST

testingandvalidating,TestingandValidating-TestingandValidating

textattributes,HandlingTextandCategoricalAttributes-HandlingTextandCategoricalAttributes

TextLineReader,Readingthetrainingdatadirectlyfromthegraph

TF-slim,UpandRunningwithTensorFlow

tf.layers.conv1d(),ResNet

tf.layers.conv2d(),LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

tf.layers.conv2d_transpose(),ResNet

tf.layers.conv3d(),ResNet

tf.layers.dense(),XavierandHeInitialization,ImplementingBatchNormalizationwithTensorFlow

tf.layers.separable_conv2d(),ResNet

TF.Learn,UpandRunningwithTensorFlow,TraininganMLPwithTensorFlow’sHigh-LevelAPI

tf.nn.atrous_conv2d(),ResNet

tf.nn.depthwise_conv2d(),ResNet

Page 699: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

thermalequilibrium,BoltzmannMachines

threadpools(inter-op/intra-op,inTensorFlow,ParallelExecution

thresholdvariable,SharingVariables-SharingVariables

Tikhonovregularization,RidgeRegression

timeseriesdata,RecurrentNeuralNetworks

toarray(),HandlingTextandCategoricalAttributes

tolerancehyperparameter,ComputationalComplexity

training,ImplementingBatchNormalizationwithTensorFlow-ImplementingBatchNormalizationwithTensorFlow,ApplyingDropout

trainingdata,WhatIsMachineLearning?

insufficientquantities,InsufficientQuantityofTrainingData

irrelevantfeatures,IrrelevantFeatures

loading,LoadingDataDirectlyfromtheGraph-Otherconveniencefunctions

nonrepresentative,NonrepresentativeTrainingData

overfitting,OverfittingtheTrainingData-OverfittingtheTrainingData

poorquality,Poor-QualityData

underfitting,UnderfittingtheTrainingData

traininginstance,WhatIsMachineLearning?

trainingmodels,Model-basedlearning,TrainingModels-Exercises

learningcurvesin,LearningCurves-LearningCurves

LinearRegression,TrainingModels,LinearRegression-Mini-batchGradientDescent

LogisticRegression,LogisticRegression-SoftmaxRegression

overview,TrainingModels-TrainingModels

PolynomialRegression,TrainingModels,PolynomialRegression-PolynomialRegression

Page 700: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

trainingobjectives,TrainingObjective-TrainingObjective

trainingset,WhatIsMachineLearning?,TestingandValidating,DiscoverandVisualizetheDatatoGainInsights,PreparetheDataforMachineLearningAlgorithms,TrainingandEvaluatingontheTrainingSet-TrainingandEvaluatingontheTrainingSet

costfunctionof,TrainingandCostFunction-TrainingandCostFunction

shuffling,MNIST

transferlearning,ReusingPretrainedLayers-PretrainingonanAuxiliaryTask

(seealsopretrainedlayersreuse)

transform(),DataCleaning,TransformationPipelines

transformationpipelines,TransformationPipelines-SelectandTrainaModel

transformers,DataCleaning

transformers,custom,CustomTransformers-CustomTransformers

transpose(),StaticUnrollingThroughTime

truenegativerate(TNR),TheROCCurve

truepositiverate(TPR),ConfusionMatrix,TheROCCurve

truncatedbackpropagationthroughtime,TheDifficultyofTrainingoverManyTimeSteps

tuples,Queuesoftuples

tyingweights,TyingWeights

U

underfitting,UnderfittingtheTrainingData,TrainingandEvaluatingontheTrainingSet,GaussianRBFKernel

univariateregression,FrametheProblem

unstack(),StaticUnrollingThroughTime

unsupervisedlearning,Unsupervisedlearning-Unsupervisedlearning

anomalydetection,Unsupervisedlearning

Page 701: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

associationrulelearning,Unsupervisedlearning,Unsupervisedlearning

clustering,Unsupervisedlearning

dimensionalityreductionalgorithm,Unsupervisedlearning

visualizationalgorithms,Unsupervisedlearning

unsupervisedpretraining,UnsupervisedPretraining-UnsupervisedPretraining,UnsupervisedPretrainingUsingStackedAutoencoders-UnsupervisedPretrainingUsingStackedAutoencoders

upsampling,ResNet

utilityfunction,Model-basedlearning

V

validationset,TestingandValidating

ValueIteration,MarkovDecisionProcesses

value_counts(),TakeaQuickLookattheDataStructure

vanishinggradients,Vanishing/ExplodingGradientsProblems

(seealsogradients,vanishingandexploding)

variables,sharing,SharingVariables-SharingVariables

variable_scope(),SharingVariables-SharingVariables

variance

bias/variancetradeoff,LearningCurves

variancepreservation,PreservingtheVariance-PreservingtheVariance

variance_scaling_initializer(),XavierandHeInitialization

variationalautoencoders,VariationalAutoencoders-GeneratingDigits

VGGNet,ResNet

visualcortex,TheArchitectureoftheVisualCortex

Page 702: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

visualization,VisualizingtheGraphandTrainingCurvesUsingTensorBoard-VisualizingtheGraphandTrainingCurvesUsingTensorBoard

visualizationalgorithms,Unsupervisedlearning-Unsupervisedlearning

voicerecognition,ConvolutionalNeuralNetworks

votingclassifiers,VotingClassifiers-VotingClassifiers

W

warmupphase,Asynchronousupdates

weaklearners,VotingClassifiers

weight-tying,TyingWeights

weights,ConstructionPhase

freezing,FreezingtheLowerLayers

while_loop(),DynamicUnrollingThroughTime

whiteboxmodels,MakingPredictions

worker,MultipleDevicesAcrossMultipleServers

workerservice,TheMasterandWorkerServices

worker_device,ShardingVariablesAcrossMultipleParameterServers

workspacedirectory,GettheData-DownloadtheData

X

Xavierinitialization,Vanishing/ExplodingGradientsProblems-XavierandHeInitialization

Y

YouTube,IntroductiontoArtificialNeuralNetworks

Z

zeropadding,ConvolutionalLayer,TensorFlowImplementation

Page 703: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

AbouttheAuthorAurélienGéronisaMachineLearningconsultant.AformerGoogler,heledtheYouTubevideoclassificationteamfrom2013to2016.HewasalsoafounderandCTOofWifirstfrom2002to2012,aleadingWirelessISPinFrance;andafounderandCTOofPolyconseilin2001,thefirmthatnowmanagestheelectriccarsharingserviceAutolib’.

Beforethisheworkedasanengineerinavarietyofdomains:finance(JPMorganandSociétéGénérale),defense(Canada’sDOD),andhealthcare(bloodtransfusion).Hepublishedafewtechnicalbooks(onC++,WiFi,andinternetarchitectures),andwasaComputerSciencelecturerinaFrenchengineeringschool.

Afewfunfacts:hetaughthisthreechildrentocountinbinarywiththeirfingers(upto1023),hestudiedmicrobiologyandevolutionarygeneticsbeforegoingintosoftwareengineering,andhisparachutedidn’topenonthesecondjump.

Page 704: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ColophonTheanimalonthecoverofHands-OnMachineLearningwithScikit-LearnandTensorFlowisthefareasternfiresalamander(Salamandrainfraimmaculata),anamphibianfoundintheMiddleEast.Theyhaveblackskinfeaturinglargeyellowspotsontheirbackandhead.Thesespotsareawarningcolorationmeanttokeeppredatorsatbay.Full-grownsalamanderscanbeoverafootinlength.

Fareasternfiresalamandersliveinsubtropicalshrublandandforestsnearriversorotherfreshwaterbodies.Theyspendmostoftheirlifeonland,butlaytheireggsinthewater.Theysubsistmostlyonadietofinsects,worms,andsmallcrustaceans,butoccasionallyeatothersalamanders.Malesofthespecieshavebeenknowntoliveupto23years,whilefemalescanliveupto21years.

Althoughnotyetendangered,thefareasternfiresalamanderpopulationisindecline.Primarythreatsincludedammingofrivers(whichdisruptsthesalamander’sbreeding)andpollution.Theyarealsothreatenedbytherecentintroductionofpredatoryfish,suchasthemosquitofish.Thesefishwereintendedtocontrolthemosquitopopulation,buttheyalsofeedonyoungsalamanders.

ManyoftheanimalsonO’Reillycoversareendangered;allofthemareimportanttotheworld.Tolearnmoreabouthowyoucanhelp,gotoanimals.oreilly.com.

ThecoverimageisfromWood’sIllustratedNaturalHistory.ThecoverfontsareURWTypewriterandGuardianSans.ThetextfontisAdobeMinionPro;theheadingfontisAdobeMyriadCondensed;andthecodefontisDaltonMaag’sUbuntuMono.

Page 705: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

PrefaceTheMachineLearningTsunami

MachineLearninginYourProjects

ObjectiveandApproach

Prerequisites

Roadmap

OtherResources

ConventionsUsedinThisBook

UsingCodeExamples

O’ReillySafari

HowtoContactUs

Acknowledgments

I.TheFundamentalsofMachineLearning

1.TheMachineLearningLandscapeWhatIsMachineLearning?

WhyUseMachineLearning?

TypesofMachineLearningSystemsSupervised/UnsupervisedLearning

BatchandOnlineLearning

Instance-BasedVersusModel-BasedLearning

MainChallengesofMachineLearningInsufficientQuantityofTrainingData

NonrepresentativeTrainingData

Poor-QualityData

IrrelevantFeatures

OverfittingtheTrainingData

Page 706: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

UnderfittingtheTrainingData

SteppingBack

TestingandValidating

Exercises

2.End-to-EndMachineLearningProjectWorkingwithRealData

LookattheBigPictureFrametheProblem

SelectaPerformanceMeasure

ChecktheAssumptions

GettheDataCreatetheWorkspace

DownloadtheData

TakeaQuickLookattheDataStructure

CreateaTestSet

DiscoverandVisualizetheDatatoGainInsightsVisualizingGeographicalData

LookingforCorrelations

ExperimentingwithAttributeCombinations

PreparetheDataforMachineLearningAlgorithmsDataCleaning

HandlingTextandCategoricalAttributes

CustomTransformers

FeatureScaling

TransformationPipelines

SelectandTrainaModelTrainingandEvaluatingontheTrainingSet

Page 707: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

BetterEvaluationUsingCross-Validation

Fine-TuneYourModelGridSearch

RandomizedSearch

EnsembleMethods

AnalyzetheBestModelsandTheirErrors

EvaluateYourSystemontheTestSet

Launch,Monitor,andMaintainYourSystem

TryItOut!

Exercises

3.ClassificationMNIST

TrainingaBinaryClassifier

PerformanceMeasuresMeasuringAccuracyUsingCross-Validation

ConfusionMatrix

PrecisionandRecall

Precision/RecallTradeoff

TheROCCurve

MulticlassClassification

ErrorAnalysis

MultilabelClassification

MultioutputClassification

Exercises

4.TrainingModelsLinearRegression

TheNormalEquation

Page 708: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ComputationalComplexity

GradientDescentBatchGradientDescent

StochasticGradientDescent

Mini-batchGradientDescent

PolynomialRegression

LearningCurves

RegularizedLinearModelsRidgeRegression

LassoRegression

ElasticNet

EarlyStopping

LogisticRegressionEstimatingProbabilities

TrainingandCostFunction

DecisionBoundaries

SoftmaxRegression

Exercises

5.SupportVectorMachinesLinearSVMClassification

SoftMarginClassification

NonlinearSVMClassificationPolynomialKernel

AddingSimilarityFeatures

GaussianRBFKernel

ComputationalComplexity

SVMRegression

Page 709: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

UndertheHoodDecisionFunctionandPredictions

TrainingObjective

QuadraticProgramming

TheDualProblem

KernelizedSVM

OnlineSVMs

Exercises

6.DecisionTreesTrainingandVisualizingaDecisionTree

MakingPredictions

EstimatingClassProbabilities

TheCARTTrainingAlgorithm

ComputationalComplexity

GiniImpurityorEntropy?

RegularizationHyperparameters

Regression

Instability

Exercises

7.EnsembleLearningandRandomForestsVotingClassifiers

BaggingandPastingBaggingandPastinginScikit-Learn

Out-of-BagEvaluation

RandomPatchesandRandomSubspaces

RandomForestsExtra-Trees

Page 710: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

FeatureImportance

BoostingAdaBoost

GradientBoosting

Stacking

Exercises

8.DimensionalityReductionTheCurseofDimensionality

MainApproachesforDimensionalityReductionProjection

ManifoldLearning

PCAPreservingtheVariance

PrincipalComponents

ProjectingDowntodDimensions

UsingScikit-Learn

ExplainedVarianceRatio

ChoosingtheRightNumberofDimensions

PCAforCompression

IncrementalPCA

RandomizedPCA

KernelPCASelectingaKernelandTuningHyperparameters

LLE

OtherDimensionalityReductionTechniques

Exercises

II.NeuralNetworksandDeepLearning

Page 711: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

9.UpandRunningwithTensorFlowInstallation

CreatingYourFirstGraphandRunningItinaSession

ManagingGraphs

LifecycleofaNodeValue

LinearRegressionwithTensorFlow

ImplementingGradientDescentManuallyComputingtheGradients

Usingautodiff

UsinganOptimizer

FeedingDatatotheTrainingAlgorithm

SavingandRestoringModels

VisualizingtheGraphandTrainingCurvesUsingTensorBoard

NameScopes

Modularity

SharingVariables

Exercises

10.IntroductiontoArtificialNeuralNetworksFromBiologicaltoArtificialNeurons

BiologicalNeurons

LogicalComputationswithNeurons

ThePerceptron

Multi-LayerPerceptronandBackpropagation

TraininganMLPwithTensorFlow’sHigh-LevelAPI

TrainingaDNNUsingPlainTensorFlowConstructionPhase

ExecutionPhase

Page 712: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

UsingtheNeuralNetwork

Fine-TuningNeuralNetworkHyperparametersNumberofHiddenLayers

NumberofNeuronsperHiddenLayer

ActivationFunctions

Exercises

11.TrainingDeepNeuralNetsVanishing/ExplodingGradientsProblems

XavierandHeInitialization

NonsaturatingActivationFunctions

BatchNormalization

GradientClipping

ReusingPretrainedLayersReusingaTensorFlowModel

ReusingModelsfromOtherFrameworks

FreezingtheLowerLayers

CachingtheFrozenLayers

Tweaking,Dropping,orReplacingtheUpperLayers

ModelZoos

UnsupervisedPretraining

PretrainingonanAuxiliaryTask

FasterOptimizersMomentumOptimization

NesterovAcceleratedGradient

AdaGrad

RMSProp

AdamOptimization

Page 713: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

LearningRateScheduling

AvoidingOverfittingThroughRegularizationEarlyStopping

ℓ1andℓ2Regularization

Dropout

Max-NormRegularization

DataAugmentation

PracticalGuidelines

Exercises

12.DistributingTensorFlowAcrossDevicesandServersMultipleDevicesonaSingleMachine

Installation

ManagingtheGPURAM

PlacingOperationsonDevices

ParallelExecution

ControlDependencies

MultipleDevicesAcrossMultipleServersOpeningaSession

TheMasterandWorkerServices

PinningOperationsAcrossTasks

ShardingVariablesAcrossMultipleParameterServers

SharingStateAcrossSessionsUsingResourceContainers

AsynchronousCommunicationUsingTensorFlowQueues

LoadingDataDirectlyfromtheGraph

ParallelizingNeuralNetworksonaTensorFlowClusterOneNeuralNetworkperDevice

In-GraphVersusBetween-GraphReplication

Page 714: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

ModelParallelism

DataParallelism

Exercises

13.ConvolutionalNeuralNetworksTheArchitectureoftheVisualCortex

ConvolutionalLayerFilters

StackingMultipleFeatureMaps

TensorFlowImplementation

MemoryRequirements

PoolingLayer

CNNArchitecturesLeNet-5

AlexNet

GoogLeNet

ResNet

Exercises

14.RecurrentNeuralNetworksRecurrentNeurons

MemoryCells

InputandOutputSequences

BasicRNNsinTensorFlowStaticUnrollingThroughTime

DynamicUnrollingThroughTime

HandlingVariableLengthInputSequences

HandlingVariable-LengthOutputSequences

TrainingRNNs

Page 715: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TrainingaSequenceClassifier

TrainingtoPredictTimeSeries

CreativeRNN

DeepRNNsDistributingaDeepRNNAcrossMultipleGPUs

ApplyingDropout

TheDifficultyofTrainingoverManyTimeSteps

LSTMCellPeepholeConnections

GRUCell

NaturalLanguageProcessingWordEmbeddings

AnEncoder–DecoderNetworkforMachineTranslation

Exercises

15.AutoencodersEfficientDataRepresentations

PerformingPCAwithanUndercompleteLinearAutoencoder

StackedAutoencodersTensorFlowImplementation

TyingWeights

TrainingOneAutoencoderataTime

VisualizingtheReconstructions

VisualizingFeatures

UnsupervisedPretrainingUsingStackedAutoencoders

DenoisingAutoencodersTensorFlowImplementation

SparseAutoencoders

Page 716: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

TensorFlowImplementation

VariationalAutoencodersGeneratingDigits

OtherAutoencoders

Exercises

16.ReinforcementLearningLearningtoOptimizeRewards

PolicySearch

IntroductiontoOpenAIGym

NeuralNetworkPolicies

EvaluatingActions:TheCreditAssignmentProblem

PolicyGradients

MarkovDecisionProcesses

TemporalDifferenceLearningandQ-LearningExplorationPolicies

ApproximateQ-Learning

LearningtoPlayMs.Pac-ManUsingDeepQ-Learning

Exercises

ThankYou!

A.ExerciseSolutionsChapter1:TheMachineLearningLandscape

Chapter2:End-to-EndMachineLearningProject

Chapter3:Classification

Chapter4:TrainingModels

Chapter5:SupportVectorMachines

Chapter6:DecisionTrees

Page 717: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

Chapter7:EnsembleLearningandRandomForests

Chapter8:DimensionalityReduction

Chapter9:UpandRunningwithTensorFlow

Chapter10:IntroductiontoArtificialNeuralNetworks

Chapter11:TrainingDeepNeuralNets

Chapter12:DistributingTensorFlowAcrossDevicesandServers

Chapter13:ConvolutionalNeuralNetworks

Chapter14:RecurrentNeuralNetworks

Chapter15:Autoencoders

Chapter16:ReinforcementLearning

B.MachineLearningProjectChecklistFrametheProblemandLookattheBigPicture

GettheData

ExploretheData

PreparetheData

Short-ListPromisingModels

Fine-TunetheSystem

PresentYourSolution

Launch!

C.SVMDualProblem

D.AutodiffManualDifferentiation

SymbolicDifferentiation

NumericalDifferentiation

Forward-ModeAutodiff

Reverse-ModeAutodiff

Page 718: index-of.esindex-of.es/Varios-2/Hands on Machine Learning with... · Revision History for the First Edition 2017-03-10: First Release 2017-06-09: Second Release See for release

E.OtherPopularANNArchitecturesHopfieldNetworks

BoltzmannMachines

RestrictedBoltzmannMachines

DeepBeliefNets

Self-OrganizingMaps

Index