big data essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · no part of this...
TRANSCRIPT
![Page 1: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/1.jpg)
![Page 2: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/2.jpg)
BigDataEssentialsCopyright©2016byAnilK.Maheshwari,Ph.D.
Bypurchasingthisbook,youagreenottocopyordistributethebookbyanymeans,mechanicalorelectronic.
Nopartofthisbookmaybecopiedortransmittedwithoutwrittenpermission.
Otherbooksbythesameauthor:
DataAnalyticsMadeAccessiblethe#1BestsellerinDataMining
Moksha:LiberationThroughTranscendence
![Page 3: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/3.jpg)
![Page 4: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/4.jpg)
PrefaceBigDataisanew,andinclusive,naturalphenomenon.Itisasmessyasnatureitself.ItrequiresanewkindofConsciousnesstofathomitsscaleandscope,anditsmanyopportunitiesandchallenges.UnderstandingtheessentialsofBigDatarequiressuspendingmanyconventionalexpectationsandassumptionsaboutdata…suchascompleteness,clarity,consistency,andconciseness.Fathomingandtamingthemulti-layeredBigDataisadreamthatisslowlybecomingareality.Itisarapidlyevolvingfieldthatisgrowingexponentiallyinvalueandcapabilities.
ThereisagrowingnumberofbooksbeingwrittenonBigData.Theyfallmostlyintwocategories.Thefirstkindfocusonbusinessaspects,anddiscussthestrategicinternalshiftsrequiredforreapingthebusinessbenefitsfromthemanyopportunitiesofferedbyBigData.Thesecondkindfocusonparticulartechnologyplatforms,suchasHadooporSpark.Thisbookaimstobringtogetherthebusinesscontextandthetechnologiesinaseamlessway.
ThisbookwaswrittentomeettheneedsforanintroductoryBigDatacourse.Itismeantforstudents,aswellasexecutives,whowishtotakeadvantageofemergingopportunitiesinBigData.Itprovidesanintuitionofthewholenessofthefieldinasimplelanguage,freefromjargonandcode.AlltheessentialBigDatatechnologytoolsandplatformssuchasHadoop,MapReduce,Spark,andNoSqlarediscussed.MostoftherelevantprogrammingdetailshavebeenmovedtoAppendicestoensurereadability.Theshortchaptersmakeiteasytoquicklyunderstandthekeyconcepts.AcompletecasestudyofdevelopingaBigDataapplicationisincluded.
ThankstoMaharishiMaheshYogiforcreatingawonderfuluniversitywhoseconsciousness-basedenvironmentmadewritingthisevolutionarybookpossible.Thankstomanycurrentandformerstudentsforcontributingtothisbook.DheerajPandeyassistedwiththeWebloganalyzerapplicationanditsdetails.SurajThapaliaassistedwiththeHadoopinstallationguide.EnkhbilegTseeleesurenhelpedwritetheSparktutorial.Thankstomyfamilyforsupportingmeinthisprocess.MydaughtersAnkitaandNupurreviewedthebookandmadehelpfulcomments.MyfatherMr.RLMaheshwariandbrotherDr.SunilMaheshwarialsoreadthebookandenthusiasticallyapprovedit.MycolleagueDr.EdiShivajitooreviewedthebook.
MaytheBigDataForcebewithyou!
Dr.AnilMaheshwari
![Page 5: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/5.jpg)
August2016,Fairfield,IA
![Page 6: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/6.jpg)
ContentsPreface
Chapter1–WholenessofBigData
Introduction
UnderstandingBigData
CASELET:IBMWatson:ABigDatasystem
CapturingBigData
VolumeofData
VelocityofData
VarietyofData
VeracityofData
BenefittingfromBigData
ManagementofBigData
OrganizingBigData
AnalyzingBigData
TechnologyChallengesforBigData
StoringHugeVolumes
Ingestingstreamsatanextremelyfastpace
Handlingavarietyofformsandfunctionsofdata
Processingdataathugespeeds
ConclusionandSummary
Organizationoftherestofthebook
ReviewQuestions
LibertyStoresCaseExercise:StepB1
Section1
Chapter2-BigDataApplications
Introduction
CASELET:BigDataGetstheFlu
BigDataSources
PeopletoPeopleCommunications
![Page 7: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/7.jpg)
SocialMedia
PeopletoMachineCommunications
Webaccess
MachinetoMachine(M2M)Communications
RFIDtags
Sensors
BigDataApplications
MonitoringandTrackingApplications
AnalysisandInsightApplications
NewProductDevelopment
Conclusion
ReviewQuestions
LibertyStoresCaseExercise:StepB2
Chapter3-BigDataArchitecture
Introduction
CASELET:GoogleQueryArchitecture
StandardBigdataarchitecture
BigDataArchitectureexamples
IBMWatson
Netflix
Ebay
VMWare
TheWeatherCompany
TicketMaster
Paypal
CERN
Conclusion
ReviewQuestions
LibertyStoresCaseExercise:StepB3
Section2
![Page 8: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/8.jpg)
Chapter4:DistributedComputingusingHadoop
Introduction
HadoopFramework
HDFSDesignGoals
Master-SlaveArchitecture
Blocksystem
EnsuringDataIntegrity
InstallingHDFS
ReadingandWritingLocalFilesintoHDFS
ReadingandWritingDataStreamsintoHDFS
SequenceFiles
YARN
Conclusion
ReviewQuestions
Chapter5–ParallelProcessingwithMapReduce
Introduction
MapReduceOverview
MapReduceprogramming
MapReduceDataTypesandFormats
WritingMapReduceProgramming
TestingMapReducePrograms
MapReduceJobsExecution
HowMapReduceWorks
ManagingFailures
ShuffleandSort
ProgressandStatusUpdates
HadoopStreaming
Conclusion
ReviewQuestions
Chapter6–NoSQLdatabases
Introduction
![Page 9: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/9.jpg)
RDBMSVsNoSQL
TypesofNoSQLDatabases
ArchitectureofNoSQL
CAPtheorem
PopularNoSQLDatabases
HBase
ArchitectureOverview
ReadingandWritingData
Cassandra
ArchitectureOverview
ReadingandWritingData
HiveLanguage
HIVELanguageCapabilities
PigLanguage
Conclusion
ReviewQuestions
Chapter7–StreamProcessingwithSpark
Introduction
SparkArchitecture
ResilientDistributedDatasets(RDD)
DirectedAcyclicGraph(DAG)
SparkEcosystem
Sparkforbigdataprocessing
MLlib
SparkGraphX
SparkR
SparkSQL
SparkStreaming
Sparkapplications
SparkvsHadoop
Conclusion
![Page 10: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/10.jpg)
ReviewQuestions
Chapter8–IngestingData
Wholeness
MessagingSystems
PointtoPointMessagingSystem
Publish-SubscribeMessagingSystem
ApacheKafka
UseCases
KafkaArchitecture
Producers
Consumers
Broker
Topic
SummaryofKeyAttributes
Distribution
Guarantees
ClientLibraries
ApacheZooKeeper
KafkaProducerexampleinJava
Conclusion
ReviewQuestions
References
Chapter9–CloudComputingPrimer
Introduction
CloudComputingCharacteristics
In-housestorage
Cloudstorage
CloudComputing:EvolutionofVirtualizedArchitecture
CloudServiceModels
CloudComputingMyths
CloudComputing:GettingStarted
![Page 11: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/11.jpg)
Conclusion
ReviewQuestions
Section3
Chapter10–WebLogAnalyzerapplicationcasestudy
Introduction
Client-ServerArchitecture
WebLoganalyzer
Requirements
SolutionArchitecture
Benefitsofthissolution
Technologystack
ApacheSpark
SparkDeployment
ComponentsofSpark
HDFS
MongoDB
ApacheFlume
OverallApplicationlogic
TechnicalPlanfortheApplication
ScalaSparkcodeforloganalysis
SampleLogdata
SampleInputData:
SampleOutputofWebLogAnalysis
ConclusionandFindings
ReviewQuestions
Chapter10:DataMiningPrimer
Gatheringandselectingdata
Datacleansingandpreparation
OutputsofDataMining
EvaluatingDataMiningResults
DataMiningTechniques
![Page 12: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/12.jpg)
MiningBigData
FromCausationtoCorrelation
FromSamplingtotheWhole
FromDatasettoDatastream
DataMiningBestPractices
Conclusion
ReviewQuestions
Appendix1:HadoopInstallationonAmazonWebServices(AWS)ElasticComputeCluster(EC2)
CreatingClusterserveronAWS,InstallHadoopfromCloudEra
Step1:CreatingAmazonEC2Servers.
Step2:ConnectingserverandinstallingrequiredClouderadistributionofHadoop
Step3:WordCountusingMapReduce
Appendix2:SparkInstallationandTutorial
Step1:VerifyingJavaInstallation
Step2:VerifyingScalainstallation
Step3:DownloadingScala
Step4:InstallingScala
Step5:DownloadingSpark
Step6:InstallingSpark
Step7:VerifyingtheSparkInstallation
Step8:Application:WordCountinScala
AdditionalResources
AbouttheAuthor
![Page 13: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/13.jpg)
![Page 14: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/14.jpg)
![Page 15: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/15.jpg)
Chapter1–WholenessofBigData
Introduction
BigDataisanall-inclusivetermthatreferstoextremelylarge,veryfast,diverse,andcomplexdatathatcannotbemanagedwithtraditionaldatamanagementtools.Ideally,BigDatawouldharnessallkindsofdata,anddelivertherightinformation,totherightperson,intherightquantity,attherighttime,tohelpmaketherightdecision.BigDatacanbemanagedbydevelopinginfinitelyscalable,totallyflexible,andevolutionarydataarchitectures,coupledwiththeuseofextremelycost-effectivecomputingcomponents.TheinfinitepotentialknowledgeembeddedwithinthiscosmiccomputerwouldhelpconnecteverythingtotheUnifiedFieldofallthelawsofnature.
ThisbookwillprovideacompleteoverviewofBigDatafortheexecutiveandthedataspecialist.ThischapterwillcoverthekeychallengesandbenefitsofBigData,andtheessentialtoolsandtechnologiesnowavailablefororganizingandmanipulatingBigData.
![Page 16: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/16.jpg)
UnderstandingBigData
BigDatacanbeexaminedontwolevels.Onafundamentallevel,itisdatathatcanbeanalyzedandutilizedforthebenefitofthebusiness.Onanotherlevel,itisaspecialkindofdatathatposesuniquechallenges.Thisisthelevelthatthisbookwillfocuson.
Figure1‑1:BigDataContext
Atthelevelofbusiness,datageneratedbybusinessoperations,canbeanalyzedtogenerateinsightsthatcanhelpthebusinessmakebetterdecisions.Thismakesthebusinessgrowbigger,andgenerateevenmoredata,andthecyclecontinues.Thisisrepresentedbythebluecycleonthetop-rightofFigure1.1.ThisaspectisdiscussedinChapter10,aprimeronDataAnalytics.
Onanotherlevel,BigDataisdifferentfromtraditionaldataineveryway:space,time,andfunction.ThequantityofBigDatais1,000timesmorethanthatoftraditionaldata.Thespeedofdatagenerationandtransmissionis1,000timesfaster.TheformsandfunctionsofBigDataaremuchmorediverse:fromnumberstotext,pictures,audio,videos,activitylogs,machinedata,andmore.Therearealsomanymoresourcesofdata,fromindividualstoorganizationstogovernments,usingarangeofdevicesfrommobilephonestocomputerstoindustrialmachines.Notalldatawillbeofequalqualityandvalue.ThisisrepresentedbytheredcycleonthebottomleftofFigure1.1.ThisaspectofBigData,anditsnewtechnologies,isthemainfocusofthisbook.
BigDataismostlyunstructureddata.Everytypeofdataisstructureddifferently,andwillhavetobedealtwithdifferently.TherearehugeopportunitiesfortechnologyproviderstoinnovateandmanagetheentirelifecycleofBigData…togenerate,gather,store,organize,analyze,andvisualizethisdata.
![Page 17: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/17.jpg)
CASELET:IBMWatson:ABigDatasystemIBMcreatedtheWatsonsystemasawayofpushingtheboundariesofArtificialIntelligenceandnaturallanguageunderstandingtechnologies.WatsonbeattheworldchampionhumanplayersofJeopardy(quizstyleTVshow)inFeb2011.WatsonreadsupondataabouteverythingonthewebincludingtheentireWikipedia.Itdigestsandabsorbsthedatabasedonsimplegenericrulessuchas:bookshaveauthors;storieshaveheroes;anddrugstreatailments.Ajeopardyclue,receivedintheformofacrypticphrase,isbrokendownintomanypossiblepotentialsub-cluesofthecorrectanswer.Eachsub-clueisexaminedtoseethelikelinessofitsanswerbeingthecorrectanswerforthemainproblem.Watsoncalculatestheconfidencelevelofeachpossibleanswer.Iftheconfidencelevelreachesmorethanathresholdlevel,itdecidestooffertheanswertotheclue.Itmanagestodoallthisinamere3seconds.
Watsonisnowbeingappliedtodiagnosingdiseases,especiallycancer.Watsoncanreadallthenewresearchpublishedinthemedicaljournalstoupdateitsknowledgebase.Itisbeingusedtodiagnosetheprobabilityofvariousdiseases,byapplyingfactorssuchaspatient’scurrentsymptoms,healthhistory,genetichistory,medicationrecords,andotherfactorstorecommendaparticulardiagnosis.(Source:SmartestmachinesonEarth:youtube.com/watch?v=TCOhyaw5bwg)
Figure1.2:IBMWatsonplayingJeopardy
Q1:WhatkindsofBigDataknowledge,technologiesandskillsarerequiredtobuildasystemlikeWatson?Whatkindofresourcesareneeded?
Q2:WilldoctorsbeabletocompetewithWatsonindiagnosingdiseases
![Page 18: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/18.jpg)
andprescribingmedications?WhoelsecouldbenefitfromasystemlikeWatson?
![Page 19: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/19.jpg)
CapturingBigDataIfdataweresimplygrowingtoolarge,ORonlymovingtoofast,ORonlybecomingtoodiverse,itwouldberelativelyeasy.However,whenthefourVs(Volume,Velocity,Variety,andVeracity)arrivetogetherinaninteractivemanner,itcreatesaperfectstorm.WhiletheVolumeandVelocityofdatadrivethemajortechnologicalconcernsandthe
costsofmanagingBigData,thesetwoVsarethemselvesbeingdrivenbythe3rdV,theVarietyofformsandfunctionsandsourcesofdata.
VolumeofData
Thequantityofdatahasbeenrelentlesslydoublingevery12-18months.TraditionaldataismeasuredinGigabytes(GB)andTerabytes(TB),butBigDataismeasuredinPetabytes(PB)andExabytes(1Exabyte=1MillionTB).
Thisdataissohugethatitisalmostamiraclethatonecanfindanyspecificthinginit,inareasonableperiodoftime.Searchingtheworld-widewebwasthefirsttrueBigDataapplication.Googleperfectedtheartofthisapplication,anddevelopedmanyofthepath-breakingtechnologiesweseetodaytomanageBigData.
Theprimaryreasonforthegrowthofdataisthedramaticreductioninthecostofstoringdata.Thecostsofstoringdatahavedecreasedby30-40%everyyear.Therefore,thereisanincentivetorecordeverythingthatcanbeobserved.Itiscalled‘datafication’oftheworld.Thecostsofcomputationandcommunicationhavealsobeencomingdown,similarly.Anotherreasonforthegrowthofdataistheincreaseinthenumberofformsandfunctionsofdata.MoreaboutthisintheVarietysection.
VelocityofData
Iftraditionaldataislikealake,BigDataislikeafast-flowingriver.BigDataisbeinggeneratedbybillionsofdevices,andcommunicatedatthespeedoftheinternet.Ingestingallthisdataislikedrinkingfromafirehose.Onedoesnothavecontroloverhowfastthedatawillcome.Ahugeunpredictabledata-streamisthenewmetaphorforthinkingaboutBigData.
Theprimaryreasonfortheincreasedvelocityofdataistheincreaseininternetspeed.Internetspeedsavailabletohomesandofficesarenowincreasingfrom10MB/secto1GB/sec(100timesfaster).Morepeoplearegettingaccesstohigh-speedinternetaroundtheworld.Anotherimportantreasonistheincreasedvarietyofsourcesthatcangenerateandcommunicatedatafromanywhere,atanytime.MoreonthatintheVarietysection.
VarietyofData
![Page 20: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/20.jpg)
Bigdataisinclusiveofallformsofdata,forallkindsoffunctions,fromallsourcesanddevices.Iftraditionaldata,suchasinvoicesandledgerswerelikeasmallstore,BigDataisthebiggestimaginableshoppingmallthatoffersunlimitedvariety.Therearethreemajorkindsofvariety.
1. Thefirstaspectofvarietyistheformofdata.Datatypesrangeinorderofsimplicityandsizefromnumberstotext,graph,map,audio,video,andothers.Therecouldbeacompositeofdatathatincludesmanyelementsinasinglefile.Forexample,textdocumentshavetextandgraphsandpicturesembeddedinthem.Videocanhavechartsandsongsembeddedinthem.Audioandvideohavedifferentandmorecomplexstorageformatsthannumbersandtext.Numbersandtextcanbemoreeasilyanalyzedthananaudioorvideofile.Howshouldcompositeentitiesbestoredandanalyzed?
2. Thesecondaspectisthevarietyoffunctionofdata.Therearehumanchatsandconversationdata,songsandmoviesforentertainment,businesstransactionrecords,machineoperationsperformancedata,newproductdesigndata,olddataforbackup,etc.Humancommunicationdatawouldbeprocessedverydifferentlyfromoperationalperformancedata,withtotallydifferentobjectives.Avarietyofapplicationsareneededtocomparepicturesinordertorecognizepeople’sfaces;comparevoicestoidentifythespeaker;andcomparehandwritingstoidentifythewriter.
3. Thethirdaspectofvarietyisthesourceofdata.Mobilephonesandtabletdevicesenableawideseriesofapplicationsorappstoaccessdataandgeneratedatafromanytimeanywhere.Webaccesslogsareanothernewandhugesourceofdiagnosticdata.ERPsystemsgeneratemassiveamountsofstructuredbusinesstransactionalinformation.Sensorsonmachines,andRFIDtagsonassets,generateincessantandrepetitivedata.Broadlyspeaking,therearethreebroadtypesofsourcesofdata:Human-humancommunications;human-machinecommunications;andmachine-to-machinecommunications.Thesourcesofdata,andtheirrespectiveapplicationsarisingfromthatdata,willbediscussedinthenextchapter.
![Page 21: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/21.jpg)
Figure1.3SourcesofBigData(Source:Hortonworks.com)VeracityofData
Veracityrelatestothebelievabilityandqualityofdata.BigDataismessy.Thereisalotofmisinformationanddisinformation.Thereasonsforpoorqualityofdatacanrangefromhumanandtechnicalerror,tomaliciousintent.
1. Thesourceofinformationmaynotbeauthoritative.Forexample,allwebsitesarenotequallytrustworthy.Anyinformationfromwhitehouse.govorfromnytimes.comismorelikelytobeauthenticandcomplete.Wikipediaisuseful,butnotallpagesareequallyreliable.Thecommunicatormayhaveanagendaorapointofview.
2. Thedatamaynotbereceivedcorrectlybecauseofhumanortechnicalfailure.Sensorsandmachinesforgatheringandcommunicatingdatamaymalfunctionandmayrecordandtransmitincorrectdata.Urgencymayrequirethetransmissionofthebestdataavailableatapointintime.Suchdatamakesreconciliationwithlater,accurate,recordsmoreproblematic.
3. Thedataprovidedandreceived,mayhowever,alsobeintentionallywrong,forcompetitiveorsecurityreasons.
Dataneedstobesiftedandorganizedbyqualityfactors,forittobeputtoanygreatuse.
![Page 22: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/22.jpg)
BenefittingfromBigDataDatausuallybelongstotheorganizationthatgeneratesit.Thereisotherdata,suchassocialmediadata,thatisfreelyaccessibleunderanopengenerallicense.Organizationscanusethisdatatolearnabouttheirconsumers,improvetheirservicedelivery,anddesignnewproductstodelighttheircustomersandtogainacompetitiveadvantage.Dataisalsolikeanewnaturalresource.Itisbeingusedtodesignnewdigitalproducts,suchason-demandentertainmentandlearning.
Organizationsmaychoosetogatherandstorethisdataforlateranalysis,ortosellittootherorganizations,whomightbenefitfromit.Theymayalsolegitimatelychoosetodiscardpartsoftheirdataforprivacyorlegalreasons.However,organizationscannotaffordtoignoreBigData.OrganizationsthatdonotlearntoengagewithBigData,couldfindthemselvesleftfarbehindtheircompetition,landinginthedustbinofhistory.InnovativesmallandneworganizationscanuseBigDatatoquicklyscaleupandbeatlargerandmorematureorganizations.
BigDataapplicationsexistinallindustriesandaspectsoflife.TherearethreemajortypesofBigDataapplications:MonitoringandTracking,AnalysisandInsight,andnewdigitalproductdevelopment.
MonitoringandTrackingApplications:Consumergoodsproducersusemonitoringandtrackingapplicationstounderstandthesentimentsandneedsoftheircustomers.IndustrialorganizationsuseBigDatatotrackinventoryinmassiveinterlinkedglobalsupplychains.Factoryownersuseittomonitormachineperformanceanddopreventivemaintenance.Utilitycompaniesuseittopredictenergyconsumption,andmanagedemandandsupply.InformationTechnologycompaniesuseittotrackwebsiteperformanceandimproveitsusefulness.Financialorganizationsuseittoprojecttrendsbetterandmakemoreeffectiveandprofitablebets,etc.
AnalysisandInsight:PoliticalorganizationsuseBigDatatomicro-targetvotersandwinelections.PoliceuseBigDatatopredictandpreventcrime.Hospitalsuseittobetterdiagnosediseasesandmakemedicineprescriptions.Adagenciesuseittodesignmoretargetedmarketingcampaignsquickly.Fashiondesignersuseittotracktrendsandcreatemoreinnovativeproducts.
![Page 23: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/23.jpg)
Figure1.4:ThefirstBigDataPresident
NewProductDevelopment:IncomingdatacouldbeusedtodesignnewproductssuchasrealityTVentertainment.Stockmarketfeedscouldbeadigitalproduct.Thisareaneedsmuchmoredevelopment.
![Page 24: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/24.jpg)
ManagementofBigDataManyorganizationshavestartedinitiativesaroundtheuseofBigData.However,mostorganizationsdonotnecessarilyhaveagriponit.HerearesomeemerginginsightsintomakingbetteruseofBigData.
1. Acrossallindustries,thebusinesscaseforBigDataisstronglyfocusedonaddressingcustomer-centricobjectives.ThefirstfocusondeployingBigDatainitiativesistoprotectandenhancecustomerrelationshipsandcustomerexperience.
2. Solvearealpain-point.BigDatashouldbedeployedforspecificbusinessobjectivesinordertohavemanagementavoidbeingoverwhelmedbythesheersizeofitall.
3. Organizationsarebeginningtheirpilotimplementationsbyusingexistingandnewlyaccessibleinternalsourcesofdata.Itisbettertobeginwithdataunderone’scontrolandwhereonehasasuperiorunderstandingofthedata.
4. Puthumansanddatatogethertogetthemostinsight.Combiningdata-basedanalysiswithhumanintuitionandperspectivesisbetterthangoingjustoneway.
5. Advancedanalyticalcapabilitiesarerequired,butlacking,fororganizationstogetthemostvaluefromBigData.Thereisagrowingawarenessofbuildingorhiringthoseskillsandcapabilities.
6. Usemorediversedata,notjustmoredata.Thiswouldprovideabroaderperspectiveintorealityandbetterqualityinsights.
7. Thefasteryouanalyzethedata,themoreitspredictivevalue.Thevalueofdatadepreciateswithtime.Ifthedataisnotprocessedinfiveminutes,thentheimmediateadvantageislost.
8. Don’tthrowawaydataifnoimmediateusecanbeseenforit.Datahasvaluebeyondwhatyouinitiallyanticipate.Datacanaddperspectivetootherdatalateroninamultiplicativemanner.
9. Maintainonecopyofyourdata,notmultiple.Thiswouldhelpavoidconfusionandincreaseefficiency.
10. Planforexponentialgrowth.Dataisexpectedtocontinuetogrowatexponentialrates.Storagecostscontinuetofall,datagenerationcontinuestogrow,data-basedapplicationscontinuetogrowincapabilityandfunctionality.
11. Ascalableandextensibleinformationmanagementfoundationisaprerequisiteforbigdataadvancement.BigDatabuildsuponaresilient,secure,efficient,flexible,andreal-timeinformationprocessingenvironment.
![Page 25: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/25.jpg)
12. BigDataistransformingbusiness,justlikeITdid.BigDataisanewphaserepresentingadigitalworld.Businessandsocietyarenotimmunetoitsstrongimpacts.
![Page 26: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/26.jpg)
OrganizingBigData
Goodorganizationdependsuponthepurposeoftheorganization.
Givenhugequantities,itwouldbedesirabletoorganizethedatatospeedupthesearchprocessforfindingaspecific,adesiredthingintheentiredata.Thecostofstoringandprocessingthedata,too,wouldbeamajordriverforthechoiceofanorganizingpattern.
Giventhefastspeedofdata,itwouldbedesirabletocreateascalablenumberofingestpoints.Itwillalsobedesirabletocreateatleastathinveneerofcontroloverthedatabymaintainingcountandaveragesovertime,uniquevaluesreceived,etc.
Giventhevarietyinformfactors,dataneedstobestoredandanalyzeddifferently.Videosneedtobestoredseparatelyandusedforservinginastreamingmode.Textdatamaybecombined,cleaned,andvisualizedforthemesandsentiments.
Givendifferentqualitylevelsofdata,variousdatasourcesmayneedtoberankedandprioritizedbeforeservingthemtotheaudience.Forexample,thequalityofawebpagemaybecomputedthroughaPageRankmechanism.
![Page 27: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/27.jpg)
AnalyzingBigData
BigDatacanbeanalyzedintwoways.ThesearecalledanalyzingBigDatainmotionorBigDataatrest.Firstwayistoprocesstheincomingstreamofdatainrealtimeforquickandeffectivestatisticsaboutthedata.Theotherwayistostoreandstructurethedataandapplystandardanalyticaltechniquesonbatchesofdataforgeneratinginsights.Thiscouldthenbevisualizedusingreal-timedashboards.BigDatacanbeutilizedtovisualizeaflowingorastaticsituation.Thenatureofprocessingthishuge,diverse,andlargelyunstructureddata,canbelimitedonlybyone’simagination.
Figure1.5:BigDataArchitecture
Amillionpointsofdatacanbeplottedinagraphandofferaviewofthedensityofdata.However,plottingamillionpointsonthegraphmayproduceablurredimagewhichmayhide,ratherthanhighlightthedistinctions.Insuchacase,binningthedatawouldhelp,orselectingthetopfewfrequentcategoriesmaydelivergreaterinsights.Streamingdatacanalsobevisualizedbysimplecountsandaveragesovertime.Forexample,belowisadynamicallyupdatedchartthatshowsup-to-datestatisticsofvisitortraffictomyblogsite,anilmah.com.Thebarshowsthenumberofpageviews,andtheinnerdarkerbarshowsthenumberofuniquevisitors.Thedashboardcouldshowtheviewbydays,weeksoryearsalso.
![Page 28: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/28.jpg)
Figure1.6:Real-timeDashboardforwebsiteperformancefortheauthor’sblog
TextDatacouldbecombined,filtered,cleaned,thematicallyanalyzed,andvisualizedinawordcloud.Hereiswordcloudfromarecentstreamoftweets(ieTwittermessages)fromUSPresidentialcandidatesHillaryClintonandDonaldTrump.Thelargerwordsimpliesgreaterfrequencyofoccurrenceinthetweets.Thiscanhelpunderstandthemajortopicsofdiscussionbetweenthetwo.
Figure1.7:AwordcloudofHillaryClinton’sandDonaldTrump’stweets
![Page 29: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/29.jpg)
TechnologyChallengesforBigData
Therearefourmajortechnologicalchallenges,andmatchinglayersoftechnologiestomanageBigData.
StoringHugeVolumes
Thefirstchallengerelatestostoringhugequantitiesofdata.Nomachinecanbebigenoughtostoretherelentlesslygrowingquantityofdata.Therefore,dataneedstobestoredinalargenumberofsmallerinexpensivemachines.However,withalargenumberofmachines,thereistheinevitablechallengeofmachinefailure.Eachofthesecommoditymachineswillfailatsomepointoranother.Failureofamachinecouldentailalossofdatastoredonit.
ThefirstlayerofBigDatatechnologyhelpsstorehugevolumesofdata,whileavoidingtheriskofdataloss.Itdistributesdataacrossthelargeclusterofinexpensivecommoditymachines,andensuresthateverypieceofdataisstoredonmultiplemachinestoguaranteethatatleastonecopyisalwaysavailable.Hadoopisthemostwell-knownclusteringtechnologyforBigData.ItsdatastoragepatterniscalledHadoopDistributedFileSystem(HDFS).ThissystemisbuiltonthepatternsofGoogle’sFilesystems,designedtostorebillionsofpagesandsortthemtoanswerusersearchqueries.
Ingestingstreamsatanextremelyfastpace
ThesecondchallengerelatestotheVelocityofdata,i.e.handlingtorrentialstreamsofdata.Someofthemmaybetoolargetostore,butmuststillbeingestedandmonitored.Thesolutionliesincreatingspecialingestingsystemsthatcanopenanunlimitednumberofchannelsforreceivingdata.Thesequeuingsystemscanholddata,fromwhichconsumerapplicationscanrequestandprocessdataattheirownpace.
BigDatatechnologymanagesthisvelocityproblem,usingaspecialstream-processingengine,whereallincomingdataisfedintoacentralqueueingsystem.Fromthere,afork-shapedsystemsendsdatatobatchprocessingaswellastostreamprocessingdirections.Thestreamprocessingenginecandoitsworkwhilethebatchprocessingdoesitswork.ApacheSparkisthemostpopularsystemforstreamingapplications.
Handlingavarietyofformsandfunctionsofdata
ThethirdchallengerelatestothestructuringandaccessofallvarietiesofdatathatcompriseBigData.Storingthemintraditionalflatorrelationalfilestructureswouldbetoowastefulandslow.ThethirdlayerofBigDatatechnologysolvesthisproblemby
![Page 30: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/30.jpg)
storingthedatainnon-relationalsystemsthatrelaxmanyofthestringentconditionsoftherelationalmodel.ThesearecalledNoSQL(NotOnlySQL)databases.
HBaseandCassandraaretwoofthebetterknownNoSQLdatabasessystems.HBase,forexample,storeseachdataelementseparatelyalongwithitskeyidentifyinginformation.Thisiscalledakey-valuepairformat.Cassandrastoresdatainadocumentformat.TherearemanyothervariantsofNoSQLdatabases.NoSQLlanguages,suchasPigandHive,areusedtoaccessthisdata.
Processingdataathugespeeds
Thefourthchallengerelatestomovinglargeamountsofdatafromstoragetotheprocessor,asthiswouldconsumeenormousnetworkcapacityandchokethenetwork.Thealternativeandinnovativemodewouldbetomovetheprocessortothedata.
ThesecondlayerofBigDatatechnologyavoidsthechokingofthenetwork.Itdistributesthetasklogicthroughouttheclusterofmachineswherethedataisstored.Thosemachineswork,inparallel,onthedataassignedtothem,respectively.Afollow-upprocessconsolidatestheoutputsofallthesmalltasksanddeliversthefinalresults.MapReduce,alsoinventedbyGoogle,isthebest-knowntechnologyforparallelprocessingofdistributedBigData.
Table1.1:TechnologicalchallengesandsolutionsforBigData
Challenge Description Solution Technology
Volume Avoidriskofdatalossfrommachinefailureinclustersofcommoditymachines
Replicatesegmentsofdatainmultiplemachines;masternodekeepstrackofsegmentlocation
HDFS
Volume&Velocity
Avoidchokingofnetworkbandwidthbymovinglargevolumesofdata
Moveprocessinglogictowherethedataisstored;manageusingparallelprocessingalgorithms
Map-Reduce
Variety Efficientstorageoflargeandsmalldataobjects
Columnardatabasesusingkey-pairvaluesformat
HBase,Cassandra
Velocity Monitoringstreamstoolargetostore
Fork-shapedarchitecturetoprocessdataasstreamandasbatch
Spark
![Page 31: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/31.jpg)
Oncethesemajortechnologicalchallengesaremet,alltraditionalanalyticalandpresentationtoolscanbeappliedtoBigData.TherearemanyadditionalsupportivetechnologiestomakethetaskofmanagingBigDataeasier.Forexample,aresourcemanager(suchasYARN)canhelpmonitortheresourceusageandloadbalancingofthemachinesinthecluster.
![Page 32: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/32.jpg)
ConclusionandSummary
BigDataisamajorphenomenonthatimpactseveryone,andisanopportunitytocreatenewwaysofworking.BigDataisextremelylarge,complex,fast,andnotalwaysclean,itisdatathatcomesfrommanysourcessuchaspeople,web,andmachinecommunications.Itneedstobegathered,organizedandprocessedinacost-effectivewaythatmanagesthevolume,velocity,varietyandveracityofBigData.HadoopandSparksystemsarepopulartechnologicalplatformsforthispurpose.HereisalistofthemanydifferencesbetweentraditionalandBigData.
Table1.2:ComparingBigDatawithTraditionalData
Feature TraditionalData BigData
RepresentativeStructure Lake/Pool FlowingStream/river
PrimaryPurpose Managebusinessactivities Communicate,Monitor
Sourceofdata Businesstransactions,documents
Socialmedia,Webaccesslogs,machinegenerated
Volumeofdata Gigabytes,Terabytes Petabytes,Exabytes
Velocityofdata Ingestleveliscontrolled Real-timeunpredictableingest
Varietyofdata Alphanumeric Audio,Video,Graphs,Text
Veracityofdata Clean,moretrustworthy Variesdependingonsource
Structureofdata Well-Structured Semi-orUn-structured
PhysicalStorageofData
InaStorageAreaNetwork
Distributedclustersofcommoditycomputers
Databaseorganization Relationaldatabases NoSQLdatabases
DataAccess SQL NoSQLsuchasPig
DataManipulationConventionaldataprocessing Parallelprocessing
Dynamicdashboardswithsimple
![Page 33: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/33.jpg)
DataVisualization Varietyoftools measures
DatabaseTools Commercialsystems Open-source-Hadoop,Spark
TotalCostofSystem MediumtoHigh high
![Page 34: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/34.jpg)
OrganizationoftherestofthebookThisbookwillcoverapplications,architectures,andtheessentialBigDatatechnologies.Therestofthebookisorganizedasfollows.
Section1willdiscusssources,applications,andarchitecturaltopics.Chapter2willdiscussafewcompellingbusinessapplicationsofBigData,basedontheunderstandingofthedifferentsourcesandformatsofdata.Chapter3willcoversomeexamplesofarchitecturesusedbymanyBigDataapplications.
Section2willdiscussthesixmajortechnologyelementsidentifiedintheBigDataEcosystem(Figure1.5).Chapter4willdiscussHadoopandhowitsDistributedFilesystem(HDFS)works.Chapter5willdiscussMapReduceandhowthisparallelprocessingalgorithmworks.Chapter6willdiscussNoSQLdatabasestolearnhowtostructurethedataintodatabasesforfastaccess.PigandHivelanguages,fordataaccess,willbeincluded.Chapter7willcoverstreamingdata,andthesystemsforingestingandprocessingthisdata.ThischapterwillcoverSpark,anintegrated,in-memoryprocessingtoolsettomanageBigData.Chapter8willcoverDataingestsystem,withApacheKafka.Chapter9willbeaprimeronCloudComputingtechnologiesusedforrentingstorageandcomputersatthirdpartylocations.
Section3willincludePrimersandtutorials.Chapter10willpresentacasestudyonthewebloganalyzer,anapplicationthatingestsalogofalargenumberofwebrequestentrieseverydayandcancreatesummaryandexceptionreports.Chapter11willbeaprimerondataanalyticstechnologiesforanalyzingdata.Afulltreatmentcanbefoundinmybook,DataAnalyticsMadeAccessible.Appendix1willbeatutorialoninstallingHadoopclusteronAmazonEC2cloud.Appendix2willbeatutorialoninstallingandusingSpark.
![Page 35: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/35.jpg)
ReviewQuestionsQ1.WhatisBigData?Whyshouldanyonecare?
Q2.Describethe4VmodelofBigData.
Q3.WhatarethemajortechnologicalchallengesinmanagingBigData?
Q4:WhatarethetechnologiesavailabletomanageBigData?
Q5.WhatkindofanalysescanbedoneonBigData?
Q6:WatchClouderaCEOpresenttheevolutionofHadoopathttps://www.youtube.com/watch?v=S9xnYBVqLws.WhydidpeoplenotpayattentiontoHadoopandMapReducewhenitwasintroduced?Whatimplicationsdoesithavetoemergingtechnologies?
![Page 36: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/36.jpg)
LibertyStoresCaseExercise:StepB1LibertyStoresInc.isaspecializedglobalretailchainthatsellsorganicfood,organicclothing,wellnessproducts,andeducationproductstoenlightenedLOHAS(LifestylesoftheHealthyandSustainable)citizensworldwide.Thecompanyis20yearsold,andisgrowingrapidly.Itnowoperatesin5continents,50countries,150cities,andhas500stores.Itsells20000productsandhas10000employees.Thecompanyhasrevenuesofover$5billionandhasaprofitofabout5%ofitsrevenue.Thecompanypaysspecialattentiontotheconditionsunderwhichtheproductsaregrownandproduced.Itdonatesaboutone-fifth(20%)fromitspre-taxprofitsfromgloballocalcharitablecauses.
Q1:CreateacomprehensiveBigDatastrategyfortheCEOofthecompany.
Q2:HowcanBigDatasystemssuchasIBMWatsonhelpthiscompany?
![Page 37: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/37.jpg)
![Page 38: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/38.jpg)
Section1
Thissectioncoversthreeimportanthigh-leveltopics.
Chapter2willcoverbigdatasources,andmanyapplicationsinmanyindustries.
Chapter3willarchitecturesformanagingbigdata
![Page 39: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/39.jpg)
![Page 40: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/40.jpg)
![Page 41: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/41.jpg)
Chapter2-BigDataApplications
IntroductionIfatraditionalsoftwareapplicationisalovelycat,thenaBigDataapplicationisapowerfultiger.AnidealBigDataapplicationwilltakeadvantageofalltherichnessofdataandproducerelevantinformationtomaketheorganizationresponsiveandsuccessful.BigDataapplicationscanaligntheorganizationwiththetotalityofnaturallaws,thesourceofallsuccess.
Companiesliketheconsumergoodsgiant,Proctor&Gamble,haveinsertedBigDataintoallaspectsofitsplanningandoperations.Theindustrialgiant,Volkswagen,asksallitsbusinessunitstoidentifysomerealisticinitiativeusingBigDatatogrowtheirunit’ssales.Theentertainmentgiant,Netflix,processes400billionuseractionseveryday,andthesearesomeofthebiggestusersofBigData.
Figure2‑0‑1:BigDataapplicationisapowerfultiger(Source:Flickr.com)
![Page 42: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/42.jpg)
CASELET:BigDataGetstheFluGoogleFluTrendswasanenormouslysuccessfulinfluenzaforecastingservice,pioneeredbyGoogle.ItemployedBigData,suchasthestreamofsearchtermsusedinitsubiquitousInternetsearchservice.TheprogramaimedtobetterpredictfluoutbreaksusingdataandinformationfromtheU.S.CentersforDiseaseControlandPrevention(CDC).Whatwasmostamazingwasthatthisapplicationwasabletopredicttheonsetofflu,almosttwoweeksbeforeCDCsawitcoming.From2004tillabout2012itwasabletosuccessfullypredictthetimingandgeographicallocationofthearrivalofthefluseasonaroundtheworld.
Figure2‑0‑2:GoogleFlutrends
However,itfailedspectacularlytopredictthe2013fluoutbreak.DatausedtopredictEbola’sspreadin2014-15yieldedwildlyinaccurateresults,andcreatedamajorpanic.Newspapersacrosstheglobespreadthisapplication’sworst-casescenariosfortheEbolaoutbreakof2014.
GoogleFluTrendsfailedfortworeasons:BigDatahubris,andalgorithmicdynamics,(a)Thequantityofdatadoesnotmeanthatonecanignorefoundationalissuesofmeasurementandconstructvalidityandreliabilityanddependenciesamongdataand(b)GoogleFluTrendspredictionswerebasedonacommercialsearchalgorithmthatfrequentlychanges,basedonGoogle’sbusinessgoals.ThisuncertaintyskewedthedatainwaysevenGoogleengineersdidnotunderstand,evenskewingtheaccuracyofpredictions.Perhapsthebiggestlessonisthatthereisfarlessinformationinthedata,typicallyavailableintheearlystagesofanoutbreak,thanisneededtoparameterizethetestmodels.
Q1:WhatlessonswouldyoulearnfromthedeathofaprominentandhighlysuccessfulBigDataapplication?
Q2:WhatotherBigDataapplicationscanbeinspiredfromthesuccessofthisapplication?
![Page 43: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/43.jpg)
![Page 44: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/44.jpg)
BigDataSourcesBigDataisinclusiveofalldataaboutallactivitieseverywhere.Itcan,thus,potentiallytransformourperspectiveonlifeandtheuniverse.Itbringsnewinsightsinreal-timeandcanmakelifehappierandmaketheworldmoreproductive.BigDatacan,however,alsobringperils—intermsofviolationofprivacy,andsocialandeconomicdisruption.
Therearethreemajorcategoriesofdatasources:humancommunications,human-machinecommunications,andmachine-machinecommunications.
![Page 45: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/45.jpg)
PeopletoPeopleCommunicationsPeopleandcorporationsincreasinglycommunicateoverelectronicnetworks.Distanceandtimehavebeenannihilated.Everyonecommunicatesthroughphoneandemail.Newstravelsinstantly.Influentialnetworkshaveexpanded.Thecontentofcommunicationhasbecomericherandmultimedia.High-resolutioncamerasinmobilephonesenablepeopletotakepicturesandvideos,andinstantlysharethemwithfriendsandfamily.Allthesecommunicationsarestoredinthefacilitiesofmanyintermediaries,suchastelecomandinternetserviceproviders.Socialmediaisanew,butparticularlytransformativetypeofhuman-humancommunications.
SocialMedia
SocialmediaplatformssuchasFacebook,Twitter,LinkedIn,YouTube,Flickr,Tumblr,Skye,Snapchat,andothershavebecomeanincreasinglyintimatepartofmodernlife.Theseareamongthehundredsofsocialmediathatpeopleuseandtheygeneratehugestreamsoftext,pictures,videos,logs,andothermultimediadata.
PeoplesharemessagesandpicturesthroughsocialmediasuchasFacebookandYouTube.TheysharephotoalbumsthroughFlickr.TheycommunicateinshortasynchronousmessageswitheachotheronTwitter.TheymakefriendsonFacebook,andfollowothersonTwitter.Theydovideoconferencing,usingSkypeandleadersdelivermessagesthatsometimesgoviralthroughsocialmedia.AllthesedatastreamsarepartofBigData,andcanbemonitoredandanalyzedtounderstandmanyphenomena,suchaspatternsofcommunication,aswellasthegistoftheconversations.Thesemediahavebeenusedforawidevarietyofpurposeswithstunningeffects.
Figure2‑0‑3:Samplingofmajorsocialmedia
![Page 46: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/46.jpg)
PeopletoMachineCommunicationsSensorsandwebaretwoofthekindsofmachinesthatpeoplecommunicatewith.PersonalassistantssuchasSiriandCortanaarethelatestinman-machinecommunicationsastheytrytounderstandhumanrequestsinnaturallanguage,andfulfilthem.WearabledevicessuchasFitBitandsmartwatcharesmartdevicesthatread,storeandanalyzepeople’spersonaldatasuchasbloodpressureandweight,foodandexercisedata,andsleeppatterns.Theworld-widewebislikeaknowledgemachinethatpeopleinteractwithtogetanswersfortheirqueries.
Webaccess
Theworld-wide-webhasintegrateditselfintoallpartsofhumanandmachineactivity.Theusageofthetensofbillionsofpagesbybillionsofwebusersgenerateshugeamountofenormouslyvaluableclickstreamdata.Everytimeawebpageisrequested,alogentryisgeneratedattheproviderend.Thewebpageprovidertrackstheidentityoftherequestingdeviceanduser,andtimeandspatiallocationofeachrequest.Ontherequesterside,therearecertainsmallpiecesofcomputercodeanddatacalledcookieswhichtrackthewebpagesreceived,date/timeofaccess,andsomeidentifyinginformationabouttheuser.Allthewebaccesslogs,andcookierecords,canprovidewebusagerecordsthatcanbeanalyzedfordiscoveringopportunitiesformarketingpurposes.
Awebloganalyzerisanapplicationrequiredtomonitorstreamingwebaccesslogsinreal-timetocheckonwebsitehealthandtoflagerrors.Adetailedcasestudyofapracticaldevelopmentofthisapplicationisshowninchapter8.
![Page 47: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/47.jpg)
MachinetoMachine(M2M)CommunicationsM2McommunicationsisalsosometimescalledtheInternetofThings(IoT).Atrilliondevicesareconnectedtotheinternetandtheycommunicatewitheachotherorsomemastermachines.Allthisdatacanbeaccessedandharnessedbymakersandownersofthosemachines.
Machinesandequipmenthavemanykindsofsensorstomeasurecertainenvironmentalparameters,whichcanbebroadcasttocommunicatetheirstatus.RFIDtagsandsensorsembeddedinmachineshelpgeneratethedata.ContainersonshipsaretaggedwithRFIDtagsthatconveytheirlocationtoallthosewhocanlisten.Similarly,whenpalletsofgoodsaremovedinwarehousesorlargeretainstores,thosepalletscontainelectromagnetic(RFID)tagsthatconveytheirlocation.CarscarryanRFIDtranspondertoidentifythemselvestoautomatedtollboothsandpaythetolls.Robotsinafactory,andinternet-connectedrefrigeratorsinahouse,continuallybroadcasta‘heartbeat’thattheyarefunctionallynormally.Surveillancevideosusingcommoditycamerasareanothermajorsourceofmachine-generateddata.
Automobilescontainsensorsthatrecordandcommunicateoperationaldata.Amoderncarcangeneratemanymegabytesofdataeveryday,andtherearemorethan1billionmotorvehiclesontheroad.Thustheautomotiveindustryitselfgeneratehugeamountsofdata.Self-drivingcarswouldonlyaddtothequantityofdatagenerated.
RFIDtags
AnRFIDtagisaradiotransmitterwithalittleantennathatcanrespondtoandcommunicateessentialinformationtospecialreadersthroughRadioFrequency(RF)channel.Afewyearsago,majorretailerssuchasWalmartdecidedtoinvestinRFIDtechnologytotaketheretailindustrytoanewlevel.ItforcedtheirsupplierstoinvestinRFIDtagsonthesuppliedproducts.Today,almostallretailersandmanufacturershaveimplementedRFID-tagsbasedsolutions.
Figure2‑0‑4:AsmallpassiveRFIDtag
HereishowanRFIDtagworks.WhenapassiveRFIDtagcomesinthevicinityofanRFreaderandis‘tickled’,thetagrespondsbybroadcastingafixedidentifyingcode.An
![Page 48: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/48.jpg)
activeRFIDtaghasitsownbatteryandstorage,andcanstoreandcommunicatealotmoreinformation.EveryreadingofmessagefromanRFIDtagbyanRFreadercreatesalogentry.ThusthereisasteadystreamofdatafromeveryreaderasitrecordsinformationaboutalltheRFIDtagsinitsareaofinfluence.Therecordsmaybeloggedregularly,andthustherewillbemanymorerecordsthanarenecessarytotrackthelocationandmovementofanitem.Alltheduplicateandredundantrecordsisremoved,toproduceclean,consolidateddataaboutthelocationandstatusofitems.
Sensors
Asensorisasmalldevicethatcanobserveandrecordphysicalorchemicalparameters.Sensorsareeverywhere.Aphotosensorintheelevatorortraindoorcansenseifsomeoneismovingandtothuskeepthedoorfromclosing.ACCTVcameracanrecordavideoforsurveillancepurposes.AGPSdevicecanrecorditsgeographicallocationeverymoment.
Figure2‑0‑5:Anembeddedsensor
Temperaturesensorsinacarcanmeasurethetemperatureoftheengineandthetiresandmore.Thethermostatinabuildingorarefrigeratortoohavetemperaturesensors.Apressuresensorcanmeasurethepressureinsideanindustrialboiler.
![Page 49: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/49.jpg)
BigDataApplicationsMonitoringandTrackingApplicationsPublicHealthMonitoring
TheUSgovernmentisencouragingallhealthcarestakeholderstoestablishanationalplatformforinteroperabilityanddatasharingstandards.Thiswouldenablesecondaryuseofhealthdata,whichwouldadvanceBigDataanalyticsandpersonalizedholisticprecisionmedicine.Thiswouldbeabroad-basedplatformliketheGoogleFluTrendscase.
ConsumerSentimentMonitoring
SocialMediahasbecomemorepowerfulthanadvertising.Manyconsumergoodscompanieshavemovedabulkoftheirmarketingbudgetsfromtraditionaladvertisingmediaintosocialmedia.TheyhavesetupBigDatalisteningplatforms,whereSocialMediadatastreams(includingtweetsandFacebookpostsandblogposts)arefilteredandanalyzedforcertainkeywordsorsentiments,bycertaindemographicsandregions.Actionableinformationfromthisanalysisisdeliveredtomarketingprofessionalsforappropriateaction,especiallywhentheproductisnewtothemarket.
Figure2‑0‑6:ArchitectureforaListeningPlatform(source:Intelligenthq.com)
Assettracking
TheUSDepartmentofDefenseisencouragingtheindustrytodeviseatinyRFIDchipthatcouldpreventthecounterfeitingofelectronicpartsthatendupinavionicsorcircuitboardsforotherdevices.Airplanesareoneoftheheaviestusersofsensorswhichtrackeveryaspectoftheperformanceofeverypartoftheplane.Thedatacanbedisplayedon
![Page 50: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/50.jpg)
thedashboard,aswellasstoredforlaterdetailedanalysis.Workingwithcommunicatingdevices,thesesensorscanproduceatorrentofdata.
Theftbyvisitors,shoppersandevenemployees,isamajorsourceoflossofrevenueforretailers.AllvaluableitemsinthestorecanbeassignedRFIDtags,andthegatesofthestoreareequippedwithRFreaders.Thishelpssecuretheproducts,andreduceleakage(theft),fromthestore.
Supplychainmonitoring
AllcontainersonshipscommunicatetheirstatusandlocationusingRFIDtags.Thus,retailersandtheirsupplierscangainreal-timevisibilitytotheinventorythroughouttheglobalsupplychain.Retailerscanknowexactlywheretheitemsareinthewarehouse,andsocanbringthemintothestoreattherighttime.Thisisparticularlyrelevantforseasonalitemsthatneedtobesoldontime,orelsetheywillbesoldatadiscount.Withitem-levelRFIDtacks,retailersalsogainfullvisibilityofeachitemandcanservetheircustomersbetter.
ElectricityConsumptionTracking
Electricutilitiescantrackthestatusofgeneratingandtransmissionsystems,andalsomeasureandpredicttheconsumptionofelectricity.Sophisticatedsensorscanhelpmonitorvoltage,current,frequency,temperature,andothervitaloperatingcharacteristicsofhugeandexpensiveelectricdistributioninfrastructure.Smartmeterscanmeasuretheconsumptionofelectricityatregularintervalsofonehourorless.Thisdataisanalyzedtomakereal-timedecisionstomaximizepowercapacityutilizationandthetotalrevenuegeneration.
PreventiveMachineMaintenance
Allmachines,includingcarsandcomputers,willfailsometime,becauseoneormoreortheircomponentswillfail.Anypreciousequipmentcouldbeequippedwithsensors.Thecontinuousstreamofdatafromthesensorsdatacouldbemonitoredandanalyzedtoforecastthestatusofkeycomponents,andthus,monitortheoverallmachine’shealth.Preventivemaintenancecanbescheduledtoreducethecostofdowntime.
AnalysisandInsightApplications
BigDatacanbestructuredandanalyzedusingdataminingtechniquestoproduceinsightsandpatternsthatcanbeusedtomakebusinessbetter.
PredictivePolicing
![Page 51: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/51.jpg)
TheLosAngelesPoliceDepartment(LAPD)inventedtheconceptofPredictivePolicing.TheLAPDworkedwithUCBerkeleyresearcherstoanalyzeitslargedatabaseof13millioncrimesrecordedover80years,andpredictedthelikelinessofcrimesofcertaintypes,atcertaintimes,andincertainlocations.Theyidentifiedhotspotsofcrimewherecrimeshadoccurred,andwherecrimewaslikelytohappeninthefuture.Crimepatternsweremathematicallymodeledafterasimpleinsightborrowedfromametaphorofearthquakesanditsaftershocks.Inessence,itsaidthatonceacrimeoccurredinalocation,itrepresentedacertaindisturbanceinharmony,andwouldthus,leadtoagreaterlikelihoodofasimilarcrimeoccurringinthelocalvicinityinthenearfuture.Themodelshowedforeachpolicebeat,thespecificneighborhoodblocksandspecifictimeslots,wherecrimewaslikelytooccur.
Figure2‑0‑7:LAPDofficeronpredictingpolicing(Source:nbclosangeles.com)
Byincludingthepolicecars’patrolschedulesinaccordancewiththemodel’spredictions,theLAPDwasabletoreducecrimeby12%to26%fordifferentcategoriesofcrime.Recently,theSanFranciscoPoliceDepartmentreleaseditsowncrimedataforover2years,sodataanalystscouldmodelthatdataandpreventfuturecrimes.
WinningPoliticalElections
TheUSPresident,BarackObama,wasthefirstmajorpoliticalcandidatetouseBigDatainasignificantway,inthe2008elections.HeisthefirstBigDatapresident.Hiscampaigngathereddataaboutmillionsofpeople,includingsupporters.Theyinventedthe“DonateNow”buttonforuseinemailstoobtaincampaigncontributionsfrommillionsofsupporters.Theycreatedpersonalprofilesofmillionsofsupportersandwhattheyhaddoneandcoulddoforthecampaign.Datawasusedtodetermineundecidedvoterswhocouldbeconvertedtotheirside.Theyprovidedphonenumbersoftheseundecidedvoterstothesupporterstocall,andthenrecordedtheoutcomeofthosecallsallovertheweb,
![Page 52: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/52.jpg)
usinginteractiveapplications.Obamahimselfusedhistwitteraccounttocommunicatehismessagesdirectlywithhismillionsoffollowers.
Aftertheelections,ObamaconvertedthelistofsupporterstoanadvocacymachinethatwouldprovidethegrassrootssupportforthePresident’sinitiatives.Sincethen,almostallcampaignsuseBigData.SenatorBernieSandersusedthesameBigDataplaybooktobuildaneffectivenationalpoliticalmachinepoweredentirelybysmalldonors.Analyst,NateSilver,createdsophisticalpredictivemodelsusinginputsfrommanypoliticalpollsandsurveystowinpunditstosuccessfullypredictwinnersoftheUSelections.Natewashowever,unsuccessfulinpredictingDonaldTrump’srise,andthatshowsthelimitsofBigData.
PersonalHealth
Correctdiagnosisisthesinequanonofeffectivetreatment.Medicalknowledgeandtechnologyisgrowingbyleapsandbounds.IBMWatsonisaBigDataAnalyticsenginethatingestsandmetabolizesallthemedicalinformationintheworld,andthenappliesitintelligentlytoanindividualsituation.Watsoncanprovideadetailedandaccuratemedicaldiagnosisusingcurrentsymptoms,patienthistory,medicationhistory,andenvironmentaltrends,andotherparameters.SimilarproductsmightbeofferedasanApptolicenseddoctors,andevenindividuals,toimproveproductivityandaccuracyinhealthcare.
NewProductDevelopment
Theseapplicationsaretotallynewconceptsthatdidnotexistearlier.
Flexibleautoinsurance
AnautoinsurancecompanycanusetheGPSdatafromcarstocalculatetheriskofaccidentsbasedontravelpatterns.Theautomobilecompaniescanusethecarsensordatatotracktheperformanceofacar.Saferdriverscanberewardedandtheerrantdriverscanbepenalized.
![Page 53: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/53.jpg)
Figure2‑0‑8:GPSbasedtrackingofvehicles
Location-basedretailpromotion
Aretailer,orathird-partyadvertiser,cantargetcustomerswithspecificpromotionsandcouponsbasedonlocationdataobtainedthroughGPS,thetimeofday,thepresenceofstoresnearby,andmappingittotheconsumerpreferencedataavailablefromsocialmediadatabases.Adsandofferscanbedeliveredthroughmobileapps,SMS,andemail.Theseareexamplesofmobileapps.
Recommendationservice
Ecommerceisafastgrowingindustryinthelastcoupleofdecades.Avarietyofproductsaresoldandsharedovertheinternet.Webusers’browsingandpurchasehistoryonecommercesitesisutilizedtolearnabouttheirpreferencesandneeds,andtoadvertiserelevantproductandpricingoffersinreal-time.Amazonusesapersonalizedrecommendationenginesystemtosuggestnewadditionalproductstoconsumersbasedonaffinitiesofvariousproducts.Netflixalsousesarecommendationenginetosuggestentertainmentoptionstoitsusers.
![Page 54: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/54.jpg)
ConclusionBigDatahasapplicabilityacrossallindustries.TherearethreemajortypesofdatasourcesofBigData.Theyarepeople-peoplecommunications,people-machinecommunications,andmachine-machinecommunications.Eachtypehasmanysourcesofdata.Therearethreetypesofapplications.Theyarethemonitoringtype,theanalysistype,andnewproductdevelopment.Thischapterpresentsafewbusinessapplicationsofeachofthosethreetypes.
![Page 55: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/55.jpg)
ReviewQuestionsQ1:WhatarethemajorsourcesofBigData?Describeasourceofeachtype.
Q2:WhatarethethreemajortypesofBigDataapplications?Describetwoapplicationsofeachtype.
Q3:WoulditbeethicaltoarrestsomeonebasedonaBigDataModel’spredictionofthatpersonlikelytocommitacrime?
Q4:AnautoinsurancecompanylearnedaboutthemovementsofapersonbasedontheGPSinstalledinthevehicle.Woulditbeethicaltousethatasasurveillancetool?
Q5:ResearchcandescribeaBigDataapplicationthathasaprovenreturnoninvestmentforanorganization.
![Page 56: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/56.jpg)
LibertyStoresCaseExercise:StepB2TheBoardofDirectorsaskedthecompanytotakeconcreteandeffectivestepstobecomeadata-drivencompany.Thecompanywantstounderstanditscustomersbetter.Itwantstoimprovethehappinesslevelsofitscustomersandemployees.Itwantstoinnovateonnewproductsthatitscustomerswouldlike.Itwantstorelateitscharitableactivitiestotheinterestsofitscustomers.
Q1:Whatkindofdatasourcesshouldthecompanycaptureforthis?
Q2:WhatkindofBigDataapplicationswouldyousuggestforthiscompany?
![Page 57: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/57.jpg)
![Page 58: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/58.jpg)
![Page 59: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/59.jpg)
Chapter3-BigDataArchitecture
IntroductionBigDataApplicationArchitectureistheconfigurationoftoolsandmodulestoaccomplishthewholetask.Anidealarchitecturewouldberesilient,secure,cost-effective,andadaptivetonewneedsandenvironments.Thisisachievedthroughbeginningwithprovenarchitectures,andcreativelyandprogressivelyrestructuringitwithnewelementsasadditionalneedsandproblemsarise.BigDataarchitecturesultimatelyalignwiththearchitectureoftheUniverse,thesourceofallinvincibility.
![Page 60: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/60.jpg)
CASELET:GoogleQueryArchitectureGoogleinventedthefirstBigDataarchitecture.Theirgoalwastogatheralltheinformationontheweb,organizeit,andsearchitforspecificqueriesfrommillionsofusers.Anadditionalgoalwastofindawaytomonetizethisservicebyservingrelevantandprioritizedonlineadvertisementsonbehalfofclients.
Googledevelopedwebcrawlingagentswhichwouldfollowallthelinksinthewebandmakeacopyofallthecontentonallthewebpagesitvisited.
Googleinventedcost-effective,resilient,andfastwaystostoreandprocessallthatexponentiallygrowingdata.Itdevelopedascale-outarchitectureinwhichitcouldlinearlyincreaseitsstoragecapacitybyinsertingadditionalcomputersintoitscomputingnetwork.Thedatafilesweredistributedoverthelargenumberofmachinesinthecluster.ThisdistributedfilessystemwascalledtheGoogleFilesystem,andwastheprecursortoHDFS.
Googlewouldsortorindexthedatathusgatheredsoitcanbesearchedefficiently.Theyinventedthekey-pairNoSQLdatabasearchitecturetostorevarietyofdataobjects.Theydevelopedthestoragesystemtoavoidupdatesinthesameplace.Thusthedatawaswrittenonce,andreadmultipletimes.
Figure3‑0‑1:GoogleQueryArchitecture
GoogledevelopedtheMapReduceparallelprocessingarchitecturewherebylargedatasetscouldbeprocessedbythousandsofcomputersinparallel,witheachcomputerprocessingachunkofdata,toproducequickresultsfortheoveralljob.
![Page 61: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/61.jpg)
TheHadoopecosystemofdatamanagementtoolslikeHadoopdistributedfilesystem(HDFS),columnardatabasesystemlikeHBase,aqueryingtoolsuchasHive,andmore,emergedfromGoogle’sinventions.Stormisastreamingdatatechnologiestoproduceinstantresults.LambdaArchitectureisaY-shapedarchitecturethatbranchesouttheincomingdatastreamforbatchaswellasstreamprocessing.
Q1:WhyshouldGooglepublishitsFileSystemandtheMapReduceparallelprogrammingsystemandsenditintoopen-sourcesystem?
Q2:WhatelsecanbedonewithGoogle’srepositoryofalltheweb’sdata?
![Page 62: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/62.jpg)
StandardBigdataarchitectureHereisthegenericBigDataArchitectureintroducedinChapter1.Therearemanysourcesofdata.Alldataisfunneledinthroughaningestsystem.Thedataisforkedintotwosides:astreamprocessingsystemandabatchprocessingsystem.TheoutcomeoftheseprocessingcanbesentintoNoSQLdatabasesforlaterretrieval,orsentdirectlyforconsumptionbymanyapplicationsanddevices.
Figure3‑0‑2:BigDataApplicationArchitecture
Abigdatasolutiontypicallycomprisestheseaslogicallayers.Eachlayercanberepresentedbyoneormoreavailabletechnologies.
Bigdatasources:Thesourcesofdataforanapplicationdependsuponwhatdataisrequiredtoperformthekindofanalysesyouneed.ThevarioussourcesofBigdataweredescribedinchapter2.Thedatawillvaryinorigin,size,speed,form,andfunction,asdescribedbythe4Vsinchapter1.Datasourcescanbeinternalorexternaltotheorganization.Thescopeofaccesstodataavailablecouldbelimited.Thelevelofstructurecouldbehighorlow.Thespeedofdataanditsquantitywillalsobyhighorlowdependinguponthedatasource.
Dataingestlayer:Thislayerisresponsibleforacquiringdatafromthedatasources.Thedataisthroughascalablesetofinputpointsthatcanacquireatvariousspeedsandinvariousquantities.Thedataissenttoabatchprocessingsystem,astreamprocessingsystem,ordirectlytoastoragefilesystem(suchasHDFS).Complianceregulationsandgovernancepoliciesimpactwhatdatacanbestoredandforhowlong.
BatchProcessinglayer:TheanalysislayerreceivesdatafromtheingestpointorfromthefilesystemorfromtheNoSQLdatabases.Dataisprocessedusingparallelprogrammingtechniques(suchasMapReduce)toprocessitandproducethedesiredresults.Thisbatchprocessinglayerthusneedstounderstandthedatasourcesanddatatypes,thealgorithms
![Page 63: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/63.jpg)
thatwouldworkonthatdata,andtheformatofthedesiredoutcomes.Theoutputofthislayercouldbesentforinstantreporting,orstoredinaNoSQLdatabasesforanon-demandreport,fortheclient.
StreamingProcessinglayer:Thislayerreceivesdatadirectlyfromtheingestpoint.Dataisprocessedusingparallelprogrammingtechniques(suchasMapReduce)toprocessitinrealtime,andproducethedesiredresults.Thislayerthusneedstounderstandthedatasourcesanddatatypesextremelywell,andthesuper-lightalgorithmsthatwouldworkonthatdatatoproducethedesiredresults.TheoutcomeofthislayertoocouldbestoredintheNoSQLDatabases.
DataOrganizingLayer:Thislayerreceivesdatafromboththebatchandstreamprocessinglayers.Itsobjectiveistoorganizethedataforeasyaccess.ItisrepresentedbyNoSQLdatabases.SQL-likelanguageslikeHiveandPigcanbeusedtoeasilyaccessdataandgeneratereports.
DataConsumptionlayer:Thislayerconsumestheoutputprovidedbytheanalysislayers,directlyorthroughtheorganizinglayer.Theoutcomecouldbestandardreports,dataanalytics,dashboardsandothervisualizationapplications,recommendationengine,onmobileandotherdevices.
InfrastructureLayer:Atbottomthereisalayerthatmanagestherawresourcesofstorage,compute,andcommunication.Thisisincreasinglyprovidedthroughacloudcomputingparadigm.
DistributedFileSystemLayer:ItwouldalsoincludetheHadoopDistributedFileSystem(HDFS).Itwouldalsoincludesupportingapplications,suchasYARN(YetAnotherResourceManager),thatenabletheefficientaccesstodatastorageanditstransfer.
![Page 64: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/64.jpg)
BigDataArchitectureexamplesEverymajororganizationandapplicationshasauniqueoptimizedinfrastructuretosuititsspecificneeds.HerebelowaresomearchitectureexamplesfromsomeveryprominentusersanddesignersofBigDataapplications.
IBMWatson
IBMWatsonusesSparktomanageincomingdatastreams.ItalsousesSpark’sMachineLearninglibrary(MLLib)toanalyzedataandpredictdiseases.
Netflix
Thisisoneofthelargestprovidersofonlinevideoentertainment.Theyhandle400Billiononlineeventsperday.Asacutting-edgeuserofbigdatatechnologies,theyareconstantlyinnovatingtheirmixoftechnologiestodeliverthebestperformance.Kafkaisthecommonmessagingsystemforallincomingrequests.TheyhosttheentireinfrastructureonAmazonWebServices(AWS).ThedatabaseisAWS’S3aswellasCassandraandHbasetostoredata.Sparkisusedforstreamprocessing.
![Page 65: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/65.jpg)
(Source:Netflix)
Ebay
Ebayisthesecond-largestEcommercecompanyintheworld.Itdelivers800millionlistingsfrom25millionsellersto160millionbuyers.Tomanagethishugestreamofactivity,EBayusesastackofHadoop,Spark,Kafka,andotherelements.TheythinkthatKafkaisthebestnewthingforprocessingdatastreams.
VMWare
HereisVMware’sviewofaBigDataarchitecture.Itissimilarto,butmoredetailedthan,ourmainbigarchitecturediagram.
![Page 66: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/66.jpg)
TheWeatherCompany
TheWeathercompanyservesweatherdatagloballythroughwebsitesandmobileapps.ItusesstreamingarchitectureusingApacheSpark.
TicketMaster
Thisistheworld’slargestcompanythatsellseventtickets.Theirgoalistomaketicketsavailabletopurchaseforrealfans,andpreventbadactorsfrommanipulatingthesystemtoincreasethepriceoftheticketsinthesecondarymarkets.
![Page 67: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/67.jpg)
Thegoalofthisprofessionalnetworkingcompanyistomaintainanefficientsystemforprocessingthestreamingdataandmakethelinkoptionsavailableinreal-time.
Paypal
Thispayments-facilitationcompanyneedstounderstandandacquirecustomers,andprocessalargenumberofpaymenttransactions.
![Page 68: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/68.jpg)
CERN
Thispremierhigh-energyphysicsresearchlabcomputepetabytesofdatausingin-memorystreamprocessingtoprocessdatafrommillionsofsensorsanddevices.
![Page 69: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/69.jpg)
ConclusionBigDataapplicationsarearchitectedtodostreamaswellasbatchprocessing.Dataisingestedandfedintostreamingandbatchprocessing.MosttoolsusedforbigdataprocessingareopensourcetoolsservedthroughtheApachecommunity,andsomekeydistributorsofthosetechnologies.
![Page 70: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/70.jpg)
ReviewQuestionsQ1:DescribetheBigDataprocessingarchitecture.
Q2:WhatareGoogle’scontributionstoBigdataprocessing?
Q3:WhataresomeofthehottesttechnologiesvisibleinBigDataprocessing?
![Page 71: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/71.jpg)
LibertyStoresCaseExercise:StepB3ThewantstobuildascalableandfuturisticplatformforitsBigData.
Q1:WhatkindofBigDataProcessingarchitecturewouldyousuggestforthiscompany
![Page 72: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/72.jpg)
![Page 73: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/73.jpg)
![Page 74: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/74.jpg)
Section2
ThissectioncoverstheimportantBigDatatechnologiesdefinedintheBigDataarchitecturespecifiedinchapter3.
Chapter4willcoverHadoopanditsDistributedFileSystem(HDFS)
Chapter5willcovertheparallelprocessingalgorithm,MapReduce.
Chapter6willNoSQLdatabasessuchasHBaseandCassandra.ItwillalsocoverPigandHivelanguagesusedforaccessingthosedatabases.
Chapter7willcoverSpark,afastandintegratedstreamingdatamanagementplatform.
Chapter8willcoverDataIngestsystems,usingApacheKafka
Chapter9willcoverCloudComputingmodel.
![Page 75: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/75.jpg)
![Page 76: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/76.jpg)
![Page 77: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/77.jpg)
Chapter4:DistributedComputingusingHadoopIntroductionAdistributedsystemisacleverwayofstoringhugequantitiesofdata,securelyandcost-effectively,forspeedandease,forretrievalandprocessing,usinganetworkedcollectionofcommoditymachines.Theidealdistributedfilesystemwouldstoreinfiniteamountsofdatawhilemakingthecomplexitycompletelytransparenttotheuser,andenableeasyaccesstotherightdatainstantly.Thiswouldbeachievedbystoringfragmentsofdataatdifferentlocations,andinternallymanagingthelower-leveltasksofstoringandreplicatingdataacrossthenetwork.ThedistributedsystemultimatelyleadstothecreationoftheunboundedcosmiccomputerthatisalignedwiththeUnifiedFieldofallthelawsofnature.
![Page 78: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/78.jpg)
HadoopFrameworkTheApacheHadoopdistributedcomputingframeworkiscomposedofthefollowingmodules:
1. HadoopCommon–containslibrariesandutilitiesneededbyotherHadoopmodules
2. HadoopDistributedFileSystem(HDFS)–adistributedfile-systemthatstoresdataoncommoditymachines,providingveryhighaggregatebandwidthacrossthecluster
3. YARN–aresource-managementplatformresponsibleformanagingcomputingresourcesinclustersandusingthemforschedulingofusers’applications,and
4. MapReduce–animplementationoftheMapReduceprogrammingmodelforlargescaledataprocessing.
ThischapterwillcoverHadoopCommon,HDFS,andYARN.ThenextchapterwillcoverMapReduce.
![Page 79: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/79.jpg)
HDFSDesignGoalsTheHadoopdistributedfilesystem(HDFS)isadistributedandscalablefile-system.Itisdesignedforapplicationsthatdealwithlargedatasizes.Itisalsodesignedtodealwithmostlyimmutablefiles,i.e.writedataonce,butreaditmanytimes.
HDFShasthefollowingmajordesigngoals:
1. Hardwarefailuremanagement–itwillhappen,andonemustplanforit.2. Hugevolume–createcapacityforlargenumberofhugefilesizes,withfast
read/writethroughput3. Highspeed–createamechanismtoprovidelowlatencyaccesstostreaming
applications4. Highvariety–Maintainsimpledatacoherence,bywritingdataoncebutreading
manytimes.5. Open-source–Maintaineasyaccessibilityofdatausinganyhardware,software,
anddatabaseplatform6. Networkefficiency–Minimizenetworkbandwidthrequirement,byminimizing
datamovement
![Page 80: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/80.jpg)
Master-SlaveArchitectureHadoopisanarchitecturefororganizingcomputersinamaster-slaverelationshipthathelpsachievegreatscalabilityinprocessing.AnHDFSclusterhastwotypesofnodesoperatinginamaster−workerpattern:asinglemasternode(calledNameNode),andalargenumberofslaveworkernodes(calledDataNodes).AsmallHadoopclusterincludesasinglemasterandmultipleworkernodes.AlargeHadoopclusterwouldconsistofamasterandthousandsofsmallordinarymachinesasworkernodes.
Figure4‑0‑1:Master-SlaveArchitecture
Themasternodemanagestheoverallfilesystem,itsnamespace,andcontrolstheaccesstofilesbyclients.Themasternodeisawareofthedata-nodes:i.e.whatblocksofwhichfilearestoredonwhichdatanode.Italsocontrolstheprocessingplanforallapplicationsrunningonthedataonthecluster.Thereisonlyonemasternode.Unfortunately,thatmakesitasinglepointoffailure.Therefore,wheneverpossible,themasternodehasahotbackupjustincasethemasternodediesunexpectedly.Themasternodeusesatransactionlogtopersistentlyrecordeverychangethatoccurstofilesystemmetadata.
Theworkernodesstorethedatablocksintheirstoragespace,asdirectedbythemasternode.Eachworkernodetypicalcontainsmanydiskstomaximizestoragecapacityandaccessspeed.Eachworkernodehasitsownlocalfilesystem.Aworkernodehasnoawarenessofthedistributedfilestructure.Itsimplystoreseachblockofdataasdirected,asifeachblockwereaseparatefile.TheDataNodesstoreandserveupblocksofdataoverthenetworkusingablockprotocol,underthedirectionoftheNameNode.
![Page 81: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/81.jpg)
Figure4‑0‑2:HadoopArchitecture(Source:Hadoop.apache.org)
TheNamenodestoresallrelevantinformationaboutalltheDataNodes,andthefilesstoredinthoseDataNodes.TheNameNodewillcontain:
-ForeveryDataNode,itsname,Rack,Capacity,andHealth
-ForeveryFile,itsName,replicas,Type,Size,TimeStamp,Location,Health,etc.
ItaDataNodefails,thereisnoseriousproblem.ThedataonthefaileddataNodewillbeaccessedfromitsreplicasonotherDataNodes.ThefailedDataNodecanbeautomaticallyrecreatedonanothermachine,bywritingallthosefileblocksoffromtheotherhealthyreplicas.Eachdata-nodesendsaheartbeatmessagetothename-nodeperiodically.Withoutthismessage,theDataNodeisassumedtobedead.TheDataNodereplicationeffortwouldautomaticallykick-intoreplacethedeaddata-node.
Thefilesystemhasasetoffeaturesandcapabilitiestocompletelyhidethesplinteringandscatteringofdata,andenabletheusertodealwiththedataatahigh,logicallevel.
TheNameNodetriestoensurethatfilesareevenlyspreadacrossthedata-nodesinthecluster.Thatbalancesthestorageandcomputingload,andalsolimitstheextentoflossfromthefailureofanode.TheNameNodealsotriestooptimizethenetworkingload.Whenretrievingdataororderingtheprocessing,theNameNodetriestopickFragmentsfrommultiplenodestobalancetheprocessingloadandspeedupthetotallyprocessingeffort.TheNameNodealsotriestostorefragmentsoffilesonthesamenodeforspeedofreadandwriting.Processingisdoneonthenodewherethefilefragmentisstored.
Anypieceofdataisstoredtypicallyonthreenodes:twoonthesamerack,andoneona
![Page 82: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/82.jpg)
differentrack.Datanodescantalktoeachothertorebalancedata,tomovecopiesaround,andtokeepthereplicationofdatahigh.
![Page 83: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/83.jpg)
BlocksystemHDFSstoreslargefiles(typicallygigabytestoterabytes)bystoringsegments(calledblocks)ofthefileacrossmultiplemachines.AblockofdataisthefundamentalstorageunitinHDFS.Datafilesaredescribed,readandwritteninblock-sizedgranularity.Allstoragecapacityandfilesizesaremeasuredinblocks.Ablockrangesfrom16-128MBinsize,withadefaultblocksizeof64MB.Thus,anHDFSfileischoppedupinto64MBchunks,andifpossible,eachchunkwillresideonadifferentDataNode.
Everydatafiletakesupanumberofblocksdependinguponitssize.Thusa100MBfilewilloccupytwoblocks(100MBdividedby64MB),withsomeroomtospare.Everystoragediskcanaccommodateanumberofblocksdependinguponthesizeofthedisk.Thusa1Terabytestoragewillhave16000blocks(1TBdividedby64MB).
Everyfileisorganizedasaconsecutivelynumberedsequenceofblocks.Afile’sblocksarestoredphysicallyclosetoeachotherforeaseofaccess,asfaraspossible.Thefile’sblocksizeandreplicationfactorareconfigurablebytheapplicationthatwritesthefileonHDFS.
![Page 84: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/84.jpg)
EnsuringDataIntegrityHadoopensuresthatnodatawillbelostorcorrupted,duringstorageorprocessing.Thefilesarewrittenonlyonce,andneverupdatedinplace.Theycanbereadmanytimes.Onlyoneclientcanwriteorappendtoafile,atatime.Noconcurrentupdatesareallowed.
Ifadataisindeedlostorcorrupted,orifapartofthediskgetscorrupted,anewhealthyreplicaforthatlostblockwillbeautomaticallyrecreatedbycopyingfromthereplicasonotherdata-nodes.Atleastoneofthereplicasisstoredonadata-nodeonadifferentrack.Thisguardsagainstthefailureoftherackofnodes,orthenetworkingrouter,onit.
AchecksumalgorithmisappliedonalldatawrittentoHDFS.Aprocessofserializationisusedtoturnfilesintoabytestreamfortransmissionoveranetworkorforwritingtopersistentstorage.Hadoophasadditionalsecuritybuiltin,usingKerberosverifier.
![Page 85: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/85.jpg)
InstallingHDFSItispossibletorunHadooponanin-houseclusterofmachines,oronthecloudinexpensively.Asanexample,TheNewYorkTimesused100AmazonElasticComputeCloud(EC2)instances(DataNodes)andaHadoopapplicationtoprocess4TBofrawimageTIFFdatastoredinAmazonSimpleStorageService(S3)into11millionfinishedPDFsinthespaceof24hoursatacomputationcostofabout$240(notincludingbandwidth).SeeChapter9foraprimeronCloudComputing.SeeAppendix1forastep-by-steptutorialoninstallingHadooponAmazonEC2.
HadoopiswritteninJava.HadoopalsorequiresaworkingJavainstallation.InstallingHadooptakesalotofresources.Forexample,allinformationaboutfragmentsoffilesneedstobeinName-nodememory.AthumbruleisthatHadoopneedsapproximately1GBmemorytomanage1Mfilefragments.ManyeasymechanismsexisttoinstalltheentireHadoopstack.UsingaGUIsuchasClouderaResourcesManagertoinstallaClouderaHadoopstackiseasy.Thisstackincludes,HDFS,andmanyotherrelatedcomponents,suchasHBase,Pig,YARN,andmore.InstallingitonaclusteronacloudservicesproviderlikeAWSiseasierthaninstallingJavaVirtualMachines(JVMs)onHDFScanbeinstalledbyusingClouderaGUIResourcesManager.Ifdoingfromcommandline,downloadHadoopfromoneoftheApachemirrorsites
HadoopiswritteninJava.AndmostaccesstofilesisprovidedthroughJavaabstractclassorg.apache.hadoop.fs.FileSystem.HDFScanbemounteddirectlywithaFilesysteminUserspace(FUSE)virtualfilesystemonLinuxandsomeotherUnixsystems.FileaccesscanbeachievedthroughthenativeJavaapplicationprogramminginterface(API).AnotherAPI,calledThrift,helpstogenerateaclientinthelanguageoftheusers’choosing(suchasC++,Java,Python).WhentheHadoopcommandisinvokedwithaclassnameasthefirstargument,itlaunchesaJavavirtualmachine(JVM)toruntheclass,alongwiththerelevantHadooplibraries(andtheirdependencies)ontheclasspath.
HDFShasaUNIX-likecommandlikeinterface(CLI).UseshshelltocommunicatewithHadoop.HDFShasUNIX-likepermissionsmodelforfilesanddirectories.Therearethreeprogressivelyincreasinglevelsofpermissions:read(r),write(w),andexecute(x).Createahduser,andcommunicateusingsshshellonthelocalmachine.
%hadoopfs-help##getdetailedhelponeverycommand.
ReadingandWritingLocalFilesintoHDFS
Therearetwodifferentwaystotransferdata:fromthelocalfilesystem,orforman
![Page 86: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/86.jpg)
input/outputstream.CopyingafilefromthelocalfilesystemtoHDFScanbedoneby:
%hadoopfs-copyFromLocalpath/filename
ReadingandWritingDataStreamsintoHDFS
ReadafilefromHDFSbyusingajava.net.URLobjecttoopenastreamtoreadthedatarequiresashortscript,asbelow.
InputStreamin=null;
Start{
instream=newURL(“hdfs://host/path”).openStream();//detailsofprocessin}
Finish{IOUtils.closeStream(instream);}
Asimplemethodtocreateanewfileisasfollows:
publicFSDataOutputStreamcreate(Pathp)throwsIOException
Datacanbeappendedtoanexistingfileusingtheappend()method:
publicFSDataOutputStreamappend(Pathp)throwsIOException
Adirectorycanbecreatedbyasimplemethod:
publicbooleanmkdirs(Pathp)throwsIOException
Listthecontentsofadirectoryusing:
publicFileStatus[]listStatus(Pathp)throwsIOException
publicFileStatus[]listStatus(Pathp,PathFilterfilter)throwsIOException
![Page 87: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/87.jpg)
SequenceFilesTheincomingdatafilescanrangefromverysmalltoextremelylarge,andwithdifferentstructures.BigDatafilesarethereforeorganizedquitedifferentlytohandlethediversityoffilesizesandtype.LargefilesarestoredasHDFSfiles,withFileFragmentsdistributedacrossthecluster.However,smallerfilesshouldbebunchedtogetherintosinglesegmentforefficientstorage.
SequenceFilesareaspecializeddatastructurewithinHadooptohandlesmallerfileswithsmallerrecordsizes.SequenceFileusesapersistentdatastructurefordataavailableinkey-valuepairformat.Thesehelpefficientlystoresmallerobjects.HDFSandMapReducearedesignedtoworkwithlargefiles,sopackingsmallfilesintoaSequenceFilecontainer,makesstoringandprocessingthesmallerfilesmoreefficientforHDFSandMapReduce.
Sequencefilesarerow-orientedfileformats,whichmeansthatthevaluesforeachrowarestoredcontiguouslyinthefile.Thisformatsareappropriatewhenalargenumberofcolumnsofasinglerowareneededforprocessingatthesametime.Thereareeasycommandstocreate,readandwriteSequenceFilestructures.SortingandmergingSequenceFilesisnativetoMapReducesystem.AMapFileisessentiallyasortedSequenceFilewithanindextopermitlookupsbykey.
![Page 88: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/88.jpg)
YARNYARN(YetAnotherResourceNegotiator)isthearchitecturalcenterofHadoop,Itisoftencharacterizedasalarge-scale,distributedoperatingsystemforbigdataapplications.YARNmanagesresourcesandmonitorsworkloads,inasecuremulti-tenantenvironment,whileensuringhighavailabilityacrossmultipleHadoopclusters.YARNalsobringsgreatflexibilityasacommonplatformtorunmultipletoolsandapplicationssuchasinteractiveSQL(e.g.Hive),real-timestreaming(e.g.Spark),andbatchprocessing(MapReduce),toworkondatastoredinasingleHDFSstorageplatform.Itbringsclustersmorescalabilitytoexpandbeyond1000nodes,italsoimprovesclusterutilizationthroughdynamicallocationofclusterresourcestovariousapplications.
Figure4‑0‑3:HadoopDistributedArchitectureincludingYARN
TheResourceManagerinYARNhastwomaincomponents:SchedulerandApplicationsManager.
YARNSchedulerallocatesresourcestothevariousrequestingapplications.ItdoessobasedonanabstractnotionofaresourceContainerwhichincorporateselementssuchasMemory,CPU,Diskstorage,Network,etc.EachmachinealsohasaNodeManagerthatmanagesalltheContainersonthatmachine,andreportsstatusonresourcesandContainerstotheYARNScheduler.
YARNApplicationsManageracceptsnewjobsubmissionsfromtheclient.ItthenrequestsafirstresourceContainerfortheapplication-specificApplicationMasterprogram,andmonitorsthehealthandexecutionoftheapplication.Oncerunning,theApplicationMasterdirectlynegotiatesadditionalresourcecontainersfromtheSchedulerasneeded.
![Page 89: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/89.jpg)
ConclusionHadoopisthemajortechnologyformanagingbigdata.HDFSsecurelystoresdataonlargeclustersofcommoditymachines.Amastermachinecontrolsthestorageandprocessingactivitiesoftheworkermachines.ANameNodecontrolsthenamespaceandstorageinformationforthefilesystemontheDataNodes.AmasterJobTrackercontrolstheprocessingoftasksattheDataNodes.YARNistheresourcesmanagerthatmanagesallresourcesdynamicallyandefficientlyacrossallapplicationsonthecluster.HadoopFilesystemandotherpartsoftheHadoopstackaredistributedbymanyvendors,andcanbeeasilyinstalledoncloudcomputinginfrastructure.HadoopinstallationtutorialisinAppendixA.
![Page 90: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/90.jpg)
ReviewQuestionsQ1:HowdoesHadoopdifferfromatraditionalfilesystem?
Q2:WhatarethedesigngoalsforHDFS?
Q3:HowdoesHDFSensuresecurityandintegrityofdata?
Q4:Howdoesamasternodedifferfromtheworkernode?
![Page 91: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/91.jpg)
![Page 92: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/92.jpg)
![Page 93: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/93.jpg)
Chapter5–ParallelProcessingwithMapReduce
Introduction
Aparallelprocessingsystemisacleverwaytoprocesshugeamountsofdatainashortperiodoftimebyenlistingtheservicesofmanycomputingdevicestoworkonpartsofthejob,simultaneously.Theidealparallelprocessingsystemwillworkacrossanycomputationalproblem,usinganynumberofcomputingdevices,acrossanysizeofdatasets,witheaseandhighprogrammerproductivity.Thisisachievedbyframingtheprobleminawaythatitcanbebrokendownintomanyparts,suchthatthateachpartcanbepartiallyprocessedindependentlyoftheotherparts;andthentheintermediateresultsfromprocessingthepartscanbecombinedtoproduceafinalsolution.Infiniteparallelprocessingistheessenceofinfinitedynamismofthelawsofnature.
![Page 94: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/94.jpg)
MapReduceOverview
MapReduceisaparallelprogrammingframeworkforspeedinguplargescaledataprocessingforcertaintypesoftasks.ItachievessowithminimalmovementofdataondistributedfilesystemssuchasHDFSclusters,toachievenear-realtimeresults.Therearetwomajorpre-requisitesforMapReduceprogramming.(a)Theapplicationmustlenditselftoparallelprogramming.(b)Thedatacanbeexpressedinkey-valuepairs.
MapReduceprocessingissimilartoUNIXsequence(alsocalledpipe)structure
e.g.theUNIXcommand:
grep|sort|countmyfile.txt
willproduceawordcountinthetextdocumentcalledmyfile.txt.
Therearethreecommandsinthissequence,andtheyworkasfollows:(a)grepiscommandtoreadthetextfileandcreateanintermediatefilewithonewordonaline;(b)sortcommandwillsortthatintermediatefile,andproduceanalphabeticallysortedlistofwordsinthatset;(c)thecountcommandwillworkonthatsortedlist,toproducethenumberofoccurrencesofeachword,anddisplaytheresultstotheuserina“word,frequency”pairformat.
Forexample:Supposemyfile.txtcontainsthefollowingtext:
Myfile:Wearegoingtoapicnicnearourhouse.Manyofourfriendsarecoming.Youarewelcometojoinus.Wewillhavefun.
TheoutputsofGrep,SortandWordcountwillasshownbelow.
Grep Sort WordCount
We a a 1
are are are 3
going are coming 1
to are friends 1
a coming fun 1
picnic friends going 1
![Page 95: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/95.jpg)
near fun have 1
our going house 1
house have join 1
Many house many 1
of join near 1
our many of 1
friends near our 2
are of picnic 1
coming our to 2
You our us 1
are picnic we 2
welcome to welcome 1
to to will 1
join us you 1
us We
we we
will welcome
have will
fun you
Ifthefileisverylarge,thenitwillbetakethecomputeralongtimetoprocessit.Parallelprocessingcanhelphere.
MapReducespeedsupthecomputationbyreadingandprocessingsmallchunksoffile,bydifferentcomputersinparallel.Thusifafilecanbebrokendowninto100smallchunks,
![Page 96: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/96.jpg)
eachchunkcanbeprocessedataseparatecomputerinparallel.Thetotaltimetakentoprocessthefilecouldbe1/100ofthetimetakenotherwise.However,nowtheresultsofthecomputationonsmallchunksareresidingina100differentplaces.Theselargenumberofpartialresultsneedtobecombinedtoproduceacompositeresult.TheresultsoftheoutputsfromvariouschunkswillbecombinedbyanotherprogramcalledtheReduceprogram.
TheMapstepwilldistributethefulljobintosmallertasksthatcanbedoneonseparatecomputerseachusingonlyapartofthedataset.TheresultoftheMapstepwillbeconsideredasintermediateresults.TheReducestepwillreadtheintermediateresults,andwillcombineallofthemandproducethefinalresult.Theprogrammerneedstospecifiesthefunctionallogicforboththemapandreducesteps.Thesorting,betweentheMapandReducesteps,doesnotneedtobespecifiedandisautomaticallytakencareoftheMapReducesystemasastandardserviceprovidedtoeveryjob.Thesortingofthedatarequiresafieldtosorton.Thustheintermediateresultsneedtohavesomekindofakeyfield,andasetofassociatednon-keyattribute(s)forthatkey.
Figure5‑0‑1:MapReduceArchitecture
Inpractice,tomanagethevarietyofdatastructuresstoredinthefilesystem,dataisstoredasonekeyandonenon-keyattribute.Thusthedataisrepresentedasakey-valuepair.Theintermediateresults,andthefinalresultsallwillalsobeinkey-pairformat.ThusakeyrequirementfortheuseofMapReduceparallelprocessingsystemisthattheinputdataandoutputdatamustbothberepresentedinkey-valuesformats.
Mapstepreadsdatainkey-valuepairformat.Theprogrammerdecidewhatshouldbethecharacteristicsofthekeyandvaluefields.TheMapstepproducesresultsinkey-valuepairformat.However,thecharacteristicsofthekeysproducedbytheMapstep,i.e.theintermediateresults,neednotbesamekeysattheinputdata.So,thosecanbecalledkey2-value2pairs.
TheReducestepreadsthekey2-value2pairs,theintermediateresultsproducedbytheMapstep.Reducestepwillproduceanoutputusingthesamekeysthatitread.Onlythevaluesassociatedwiththosekeyswillchangethoughasaresultofprocessing.Thusitcan
![Page 97: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/97.jpg)
belabeledaskey2-value3format.
Supposethetextinthemyfile.txtcanbesplitinto4approximatelyequalsegments.Itcouldbedonewitheachsentenceasaseparatepieceoftext.Thefoursegmentswilllookasfollowing:
Segment1:Wearegoingtoapicnicnearourhouse.
Segment2:Manyofourfriendsarecoming.
Segment3:Youarewelcometojoinus.
Segment4:Wewillhavefun.
Thustheinputtothe4processorsintheMapStepwillbeinkey-valuepairformat.Thefirstcolumnisthekey,whichistheentiresentenceinthiscase.Thesecondcolumnisthevalue,whichinthisapplicationisthefrequencyofthesentence.
Wearegoingtoapicnicnearourhouse. 1
Manyofourfriendsarecoming. 1
Youarewelcometojoinus. 1
Wewillhavefun. 1
Thistaskcanbedoneinparallelbyfourprocessors.Eachofthissegmentwillbetaskforadifferentprocessor.Thuseachtaskwillproduceafileofwords,withacountof1.Therewillbefourintermediatefiles,in<key,value>pairformat,shownbelow.
Key2 Value2 Key2 Value2 Key2 Value2 Key2 Value2
we 1 many 1 you 1 we 1
are 1 of 1 are 1 will 1
going 1 our 1 welcome 1 have 1
![Page 98: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/98.jpg)
to 1 friends 1 to 1 fun 1
a 1 are 1 join 1
picnic 1 coming 1 us 1
near 1
our 1
house 1
ThesortprocessinherentwithinMapReducewillsorteachoftheintermediatefiles,andproducethefollowingsortedkey-pairvalues:
Key2 Value2 Key Value2 Key Value2 Key Value2
a 1 are 1 are 1 fun 1
are 1 coming 1 join 1 have 1
going 1 friends 1 to 1 we 1
house 1 many 1 us 1 will 1
near 1 of 1 welcome 1
our 1 our 1 you 1
picnic 1
to 1
we 1
TheReducefunctionwillreadthesortedintermediatefiles,andcombinethecountsforalltheuniquewords,toproducethefollowingoutput.Thekeysremainthesameasintheintermediateresults.However,thevalueschangeascountsfromeachoftheintermediatefilesareaddedupforeachkey.Forexample,thecountfortheword‘are’goesupto3.
Key2 Value3
![Page 99: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/99.jpg)
a 1
are 3
coming 1
friends 1
fun 1
going 1
have 1
house 1
join 1
many 1
near 1
of 1
our 2
picnic 1
to 2
us 1
we 2
welcome 1
will 1
you 1
ThisoutputwillbeidenticaltothatproducedbytheUNIXsequenceearlier.
![Page 100: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/100.jpg)
MapReduceprogrammingAdataprocessingproblemneedstobetransformedintotheMapReducemodel.Thefirststepistovisualizetheprocessingplanintoamapandareducestep.Whentheprocessinggetsmorecomplex,thiscomplexitycanbegenerallymanifestedinhavingmoreMapReducejobs,ormorecomplexmapandreducejobs.HavingmorebutsimplerMapReducejobsleadstomoreeasilymaintainablemapperandreducerprograms.
MapReduceDataTypesandFormats
MapReducehasasimplemodelofdataprocessing:inputsandoutputsforthemapandreducefunctionsarekey-valuepairs.ThemapandreducefunctionsinHadoopMapReducehavethefollowinggeneralform:
map:(K1,V1)→list(K2,V2)
reduce:(K2,list(V2))→list(K3,V3)
Ingeneral,themapinputkeyandvaluetypes(K1andV1)aredifferentfromthemapoutputtypes(K2andV2).However,thereduceinputmusthavethesametypesasthemapoutput,althoughthereduceoutputtypesmaybedifferentagain(K3andV3).SinceMapperandReducerareseparateclasses,thetypeparametershavedifferentscopes,
Hadoopcanprocessmanydifferenttypesofdataformats,fromflattextfilestodatabases.Aninputsplitisachunkoftheinputthatisprocessedbyasinglemap.Eachmapprocessesasinglesplit.Eachsplitisdividedintorecords,andthemapprocesseseachrecord—akey-valuepair—inturn.Splitsandrecordsarelogical:andmaymaptoafullfile,apartofafile,oracollectionoffiles.Inadatabasecontext,asplitmightcorrespondtoarangeofrowsfromatableandarecordtoarowinthatrange
WritingMapReduceProgramming
Startbywritingpseudocodeforthemapandreducefunctions.TheprogramcodeforboththemapandthereducefunctioncanthenbewritteninJavaorotherlanguages.InJava,themapfunctionisrepresentedbythegenericMapperclass.Itusesfourparameters:inputkey,inputvalue,outputkey,outputvalue.Thisclassusesanabstractmap()method.Thismethodreceivedtheinputkeyandinputvalue.Itwouldnormallyproduceandoutputkeyandoutputvalue.Formorecomplexproblems,itisbettertouseahigher-levellanguagethanMapReduce,suchasPig,Hive,Cascading,Crunch,orSpark.
Amappercommonlyperformsinputformatparsing,projection(selectingtherelevantfields),andfiltering(selectingtherecordsofinterest).Thereducertypicallycombines
![Page 101: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/101.jpg)
(addsoraverages)thosevalues.
Figure5‑0‑2:MapReduceprogramFlow
Herebelowisthestep-by-steplogicImaginethatwewanttodoawordcountofalluniquewordsinatext.
1. Thebigdocumentissplitintomanysegments.Themapstepisrunoneachsegmentofdata.Theoutputwillbeasetofkey,valuepairs.Inthiscase,thekeywillbeawordinthedocument.
2. Thesystemwillgatherthekey,valuepairoutputsfromallthemappers,andwillsortthembykey.Thesortedlistitselfmaythenbesplitintoafewsegments.
3. AReducertaskwillreadthesortedlistandproduceacombinedlistofwordcounts.
HereistheJavacodeforwordcount:.
map(Stringkey,Stringvalue):
foreachwordwinvalue:
EmitIntermediate(w,“1”);
reduce(Stringkey,Iteratorvalues):
intresult=0;
foreachvinvalues:
result+=ParseInt(v);
Emit(AsString(result));
TestingMapReducePrograms
Mapperprogramsrunningonaclustercanbecomplicatedtodebug.Thetime-honored
![Page 102: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/102.jpg)
wayofdebuggingprogramsisviaprintstatements.However,withtheprogramseventuallyrunningontensorthousandsofnodes,itisbesttodebugtheprogramsinstages.Therefore,runtheprogramusingsmallsampledatasetstoensurethattheprogramisworkingcorrectly.Expandtheunitteststocoverlargerdatasetandrunitonacluster.Ensurethatthemapperorreducercanhandletheinputscorrectly.Runningagainstthefulldatasetislikelytoexposesomemoreissues,whichshouldbefixed,byalteringyourmapperorreducertohandlethenewcases.Aftertheprogramisworking,theprogrammaybetunedtomaketheentireMapReducejobrunfaster.
Itmaybedesirabletosplitthelogicintomanysimplemappersandchainingthemintoasinglemapperusingafacility(theChainMapperlibraryclass)builtintoHadoop.Itcanrunachainofmappers,followedbyareducerandanotherchainofmappers,inasingleMapReducejob.
![Page 103: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/103.jpg)
MapReduceJobsExecution
AMapReducejobisspecifiedbytheMapprogramandtheReduceprogram,alongwiththedatasetsassociatedwiththatjob.ThereisanothermasterprogramthatresidesandrunsendlesslyontheNameNode.ItiscalledtheJobtracker,andittrackstheprogressoftheMapReducejobsfrombeginningtothecompletion.Hadoopdividesthejobintotwotasks:maptasksandreducetasks.HadoopmovestheMapandReducecomputationlogictoeachDataNodethatishostingapartofthedata.ThecommunicationbetweenthenodesisaccomplishedusingYARN,Hadoop’snativeresourcemanager.
Themastermachine(NameNode)iscompletelyawareofthedatastoredoneachoftheworkermachines(DataNodes).Itschedulesthemaporreducejobstotasktrackerswithfullawarenessofthedatalocation.Forexample:ifnodeAcontainsdata(x,y,z)andnodeBcontainsdata(a,b,c),thejobtrackerschedulesnodeBtoperformmaporreducetaskson(a,b,c)andnodeAwouldbescheduledtoperformmaporreducetaskson(x,y,z).Thisreducesthedatatrafficandpreventschokingofthenetwork.
EachDataNodehasamasterprogramcalledtheJobtracker.ThisprogrammonitorstheexecutionofeverytaskassignedtoitbytheNameNode.Whenthetaskiscompleted,theTasktrackersendsacompletionmessagetotheJobTrackerprogramonthe
Thejobsandtasksworkinamaster-slavemode.
Figure5‑0‑3:HierarchicalMonitoringArchitecture
WhenthereismorethanonejobinaMapReduceworkflow,itisnecessarytheybeexecutedintherightorder.Foralinearchainofjobsitmightbeeasy.Foramorecomplexdirectedacyclicgraph(DAG)ofjobs,therearelibrariesthatcanhelporchestrateyourworkflow.OronecanuseApacheOozie,asystemforrunningworkflowsofdependentjobs.
Oozieconsistsoftwomainparts:aworkflowenginethatstoresandrunsworkflowscomposedofdifferenttypesofHadoopjobs(MapReduce,Pig,Hive,andsoon),andacoordinatorenginethatrunsworkflowjobsbasedonpredefinedschedulesanddata
![Page 104: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/104.jpg)
availability.Ooziehasbeendesignedtoscale,anditcanmanagethetimelyexecutionofthousandsofworkflowsinaHadoopcluster.
ThedatasetfortheMapReducejobisdividedintofixed-sizepiecescalledinputsplits,orjustsplits.Hadoopcreatesonemaptaskforeachsplit,whichrunstheuser-definedmapfunctionforeachrecordinthesplit.ThetasksarescheduledusingYARNandrunonnodesinthecluster.YARNensuresthatifataskfailsorinordinatelydelayed,itwillbeautomaticallyscheduledtorunonadifferentnode.Theoutputsofthemapjobsarefedasinputtothereducejob.Thatlogicisalsopropagatedtothenode(s)thatwilldothereducejobs.Tosaveonbandwidth,Hadoopallowstheuseofacombinerfunctiononthemapoutput.Thenthecombinerfunction’soutputformstheinputtothereducefunction.
HowMapReduceWorks
AMapReducejobcanbeexecutedwithasinglemethodcall:submit()onaJobobject.WhentheresourcemanagerreceivesacalltoitssubmitApplication()method,ithandsofftherequesttotheYARNscheduler.Theschedulerallocatesacontainer,andtheresourcemanagerthenlaunchestheapplicationmaster’sprocess.TheapplicationmasterforMapReducejobsisaJavaapplicationwhosemainclassisMRAppMaster.Itinitializesthejobbycreatinganumberofbookkeepingobjectstokeeptrackofthejob’sprogress.Itretrievestheinputsplitscomputedintheclientfromthesharedfilesystem.Itthencreatesamaptaskobjectforeachsplit,aswellasanumberofreducetaskobjectsdeterminedbythemapreduce.job.reducesproperty(setbythesetNumReduceTasks()methodonJob).TasksaregivenIDsatthispoint.TheapplicationmastermustdecidehowtorunthetasksthatmakeuptheMapReducejob.Theapplicationmasterrequestscontainersforallthemapandreducetasksinthejobfromtheresourcemanager.Onceataskhasbeenassignedresourcesforacontaineronaparticularnodebytheresourcemanager’sscheduler,theapplicationmasterstartsthecontainerbycontactingthenodemanager.ThetaskisexecutedbyaJavaapplicationwhosemainclassisYarnChild.
ManagingFailures
Therecanbefailuresattheleveloftheentirejoborparticulartasks.Theentireapplicationmasteritselfcouldfail.
Taskfailureusuallyhappenswhentheusercodeinthemaporreducetaskthrowsaruntimeexception.Ifthishappens,thetaskJVMreportstheerrortoitsparentapplicationmaster,whereitisloggedintoerrorlogs.Theapplicationmasterwillthenrescheduleexecutionofthetaskonanotherdatanode.
![Page 105: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/105.jpg)
Theentirejob,i.e.MapReduceapplicationmasterapplicationrunningonYARN,toocanfail.Inthatcase,itisstartedagain,subjecttoamaximumnumberwhichisauser-setconfigurationparameter.
Ifadatanodemanagerfailsbycrashingorrunningveryslowly,itwillstopsendingheartbeatstotheresourcemanager(orsendthemveryinfrequently).Theresourcemanagerwillthenremoveitfromitspoolofnodestoschedulecontainerson.Anytaskorapplicationmasterrunningonthefailednodemanagerwillberecoveredusingerrorlogs,andstartedonothernodes.
ResourceManagerYARNcanalsofail,andithasmoresevereconsequencesfortheentirecluster.Therefore,typically,therewillbeahot-standbyforYARN.Iftheactiveresourcemanagerfails,thenthestandbycantakeoverwithoutasignificantinterruptiontotheclient.Thenewresourcemanagercanreadtheapplicationinformationfromthestatestore,andthenrestarttheapplicationthatwererunningonthecluster.
ShuffleandSort
MapReduceguaranteesthattheinputtoeveryreducerissortedbykey.Theprocessbywhichthesystemperformsthesort—andtransfersthemapoutputstothereducersasinputs—isknownastheshuffle.
Whenthemapfunctionstartsproducingoutput,itisnotdirectlywrittentodisk.Thetakesadvantageofbufferingwritesinmemoryanddoingsomepresortingforefficiencyreasons.Eachmaptaskhasacircularmemorybufferthatitwritestheoutputto.Beforeitwritestodisk,thethreadfirstdividesthedataintopartitionscorrespondingtothereducersthattheywillultimatelybesentto.Withineachpartition,thebackgroundthreadperformsanin-memorysortbykey.Ifthereisacombinerfunction,itisrunontheoutputofthesortsothatthereislessdatatotransfertothereducer.
Thereducetaskneedsthemapoutputforitsparticularpartitionfromseveralmaptasksacrossthecluster.Themaptasksmayfinishatdifferenttimes,sothereducetaskstartsreadingtheiroutputsassoonaseachcompletes.Whenallthemapoutputshavebeenread,thereducetaskmergesthemapoutputs,maintainingtheirsortordering.Thereducefunctionisinvokedforeachkeyinthesortedoutput.TheoutputofthisphaseiswrittendirectlytotheoutputfilesystemsuchasHDFS.
ProgressandStatusUpdates
MapReducejobsarelong-runningbatchjobs,takingalongtimetorun.Itisimportantfor
![Page 106: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/106.jpg)
theusertogetfeedbackonhowthejob’sprogress.Ajobandeachofitstaskshaveastatusvalue(e.g.,running,successfullycompleted,failed),theprogressofmapsandreduces,thevaluesofthejob’scounters.Thesevaluesareconstantlycommunicatedbacktotheclient.Whentheapplicationmasterreceivesanotificationthatthelasttaskforajobiscomplete,itchangesthestatusforthejobto“successful.”Jobstatisticsandcountersarecommunicatedtotheuser.
Hadoopcomeswithanativeweb-basedGUIfortrackingtheMapReducejobs.Itdisplaysusefulinformationaboutajob’sprogresssuchashowmanytaskshavebeencompleted,andwhichonesarestillbeingexecuted.Oncethejobiscompleted,onecanviewthejobstatisticsandlogs.
![Page 107: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/107.jpg)
HadoopStreamingHadoopStreamingusesstandardUnixstreamsastheinterfacebetweenHadoopanduserprogram.Streamingisanidealapplicationfortextprocessing.Mapinputdataispassedoverstandardinputtoyourmapfunction,whichprocessesitlinebylineandwriteslinestostandardoutput.Amapoutputkey-valuepairiswrittenasasingletab-delimitedline.Inputtothereducefunctionisinthesameformat—atab-separatedkey-valuepair—passedoverstandardinput.Thereducefunctionreadslinesfromstandardinput,whichtheframeworkguaranteesaresortedbykey,andwritesitsresultstostandardoutput.
![Page 108: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/108.jpg)
Conclusion
MapReduceisthefirstpopularparallelprogrammingframeworkforBigData.Itworkswellforapplicationswherethedatacanbelarge,anddivisibleintoseparatesets,andrepresentedin<key,value>pairformat.Theapplicationlogicisdividedintotwoparts:aMapprogramandaReduceProgram.Eachoftheseprogramscanberuninparallelbyseveralmachines.
![Page 109: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/109.jpg)
ReviewQuestions
Q1:WhatisMapReduce?Whatareitsbenefits?
Q2:Whatisthekey-valuepairformat?Howisitdifferentfromotherdatastructures?Whatareitsbenefits?Andlimitations.
![Page 110: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/110.jpg)
![Page 111: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/111.jpg)
![Page 112: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/112.jpg)
Chapter6–NoSQLdatabasesANoSQLdatabaseisacleverwaytocost-effectivelyorganizelargeamountsofheterogeneousdataforefficientaccessandupdates.TheidealNoSQLdatabaseiscompletelyalignedwiththenatureoftheproblemsbeingsolved,andissuperfastinthattask.Thisisachievedbyreleasingandrelaxingmanyoftheintegrityandredundancyconstraintsofstoringdatainrelationaldatabases,andstoringdatainmanyinnovativeformatsasalignedwithbusinessneed.ThediverseNoSQLdatabaseswillultimatelycollectiveevolveintoaholisticsetofefficientandelegantdatastructuresattheheartofacosmiccomputerofinfiniteorganizationcapacity.
![Page 113: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/113.jpg)
IntroductionRelationaldatamanagementsystems(RDBMS)areapowerfulanduniversallyuseddatabasetechnologybyalmostallenterprises.Relationaldatabasesarestructuredandoptimizedtoensureaccuracyandconsistencyofdata,whilealsoeliminatinganyredundancyofdata.Thesedatabasesarestoredonthelargestandmostreliableofcomputerstoensurethatthedataisalwaysavailableatagranularlevelandatahighspeed.
Bigdataishoweveramuchlargerandunpredictablestreamofdata.Relationaldatabasesareinadequateforthistask,andwillalsobeveryexpensiveforsuchlargedatavolumes.Managingthecostsandspeedofmanagingsuchlargeandheterogeneousdatastreamsrequiresrelaxingmanyofthestrictrulesandrequirementsofrelationaldata.Dependinguponwhichconstraint(s)arerelaxed,adifferentkindofdatabasestructurewillemerge.ThesearecalledNoSQLdatabases,todifferentiatethemfromrelationaldatabasesthatuseStructuredQueryLanguage(SQL)astheprimarymeanstomanipulatedata.
NoSQLdatabasesarenext-generationdatabasesthatarenon-relationalintheirdesign.ThenameNoSQLismeanttodifferentiateitfromantiquated,‘PRE-relational’databases.Today,almosteveryorganizationthatneedstogathercustomerfeedbackandsentimentstoimprovetheirbusiness,willuseaNoSQLdatabase.NoSQLisusefulwhenanenterpriseneedstoaccess,analyzeandutilizemassiveamountsofeitherstructuredorunstructureddataordatathat’sstoredremotelyinanyvirtualserveracrosstheglobe.
Theconstraintsofarelationaldatabasearerelaxedinmanyways.Forexample,relationaldatabasesrequirethatanydataelementcouldberandomlyaccessedanditsvaluecouldbeupdatedinthatsamephysicallocation.However,thesimplephysicsofstoragesaysthatitissimplerandfastertoreadorwritesequentialblocksofdataonadisk.Therefore,NoSQLdatabasefilesarewrittenonceandalmostneverupdatedinplace.Ifanewversionofapartofthedatabecomeavailable,itwouldbestoredelsewherebythesystem.Thesystemwouldhavetheintelligencetolinktheupdateddatatotheolddata.
PigandHivearetwokeyandpopularlanguagesintheHadoopecosystemthatworkswellonNoSQLdatabases.PigoriginatedatYahoowhileHiveoriginatedatFacebook.BothPigandHivecanusethesamedataasaninput,andcanachievesimilarresultswithqueries.BothPigLatinandHivecommandseventuallycompiletoMapandReducejobs.Theyhaveasimilargoal-toeasethecomplexityofwritingcomplexjavaMapReduceprograms.MostMapReducejobscanbeimplementedeasilyinHiveorPig.
![Page 114: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/114.jpg)
Foranalyticalneeds,HiveispreferableoverPig.Forcontrolledprocessing,Pig’sscriptingdesignispreferableHiveleadstoeaseandproductivityusingitsSQLlikedesignanduserinterface.Pigoffersgreatercontroloverdataflows.JavaMRcanbeusedformoreadvancedAPIstoaccomplishthingswhenthereissomethingspecialneeded,suchasinteractingwithathird-partytool,orsomespecialdatacharacteristics.
![Page 115: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/115.jpg)
RDBMSVsNoSQLTheyaredifferentinmanyways.First,NoSQLdatabases,donotsupportrelationalschemaortheSQLlanguage.ThetermNoSQLstandsmostlyfor“NotonlySQL”.Second,theirtransactionprocessingcapabilitiesarefastbutweak,andtheydonotsupporttheACID(Atomicity,Consistency,Isolation,Durability)propertiesassociatedwithtransactionprocessingusingrelationaldatabases.Instead,theyareapproximatelyaccurateatanypointintime,andwillbeeventuallyconsistent.Third,thesedatabasesarealsodistributedandhorizontallyscalabletomanageweb-scaledatabasesusingHadoopclustersofstorage.Thustheyworkwellwiththewrite-once,read-manystoragemechanismofHadoopclusters.
Feature RDBMS NoSQL
Applications MostlycentralizedApplications(e.g.ERP)
Mostlydesignedforthedecentralizedapplications(e.g.Web,mobile,sensors)
Availability Moderatetohigh Continuousavailabilitytoreceiveandservedata
Velocity Moderatevelocityofdata Highvelocityofdata(devices,sensors,socialmedia,etc.).Lowlatencyofaccess.
DataVolume Moderatesize;archivedafterforacertainperiod
Hugevolumeofdata,storedmostlyforalongtimeorforever;LinearlyscalableDB.
DataSources Dataarrivesfromoneorfew,mostlypredictablesources
Dataarrivesfrommultiplelocationsandareofunpredictablenature
Datatype Dataaremostlystructured Structuredorunstructureddata
DataAccess Primaryconcernisreadingthedata
Concernisbothreadandwrite
Technology Standardizedrelationalschemas;SQLlanguage
Manydesignswithmanyimplementationsofdatastructuresandaccesslanguages
Cost Expensive;commercial Low;open-sourcesoftware
![Page 116: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/116.jpg)
![Page 117: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/117.jpg)
TypesofNoSQLDatabasesThevarietyofbigdatameansthatfilesizeandtypeswillvaryenormously.Therearespecializeddatabasestosuitdifferentpurposes.
1. DocumentDatabases:Storinga10GBvideomoviefileasasingleobjectcouldbespeededupbysequentiallystoringthedataincontiguousblocksofphysicalstorage.Anindexcouldstoretheidentifyinginformationaboutthemovie,andtheaddressofthestartingblock.Therestofstoragedetailscouldbehandledbythesystem.Thisstorageformatwouldbeacalleddocumentstoreformat.Theindexwouldcontainthenameofthemovie,andthevalueistheentirevideofile,characterizedbythefirstblockofstorage.Documentdatabasesaregenerallyusefulforcontentmanagementsystems,bloggingplatforms,webanalytics,real-timeanalytics,ecommerce-applications.Wewouldavoidusingdocumentdatabasesforsystemsthatneedcomplextransactionsspanningmultipleoperationsorqueriesagainstvaryingaggregatestructures.
2. Key-ValuePairDatabases:Therecouldbeacollectionofmanydataelementssuchasacollectionoftextmessageswhichcouldalsofitintoasinglephysicalblockofstorage.Eachtextmessageisauniqueobject.Thisdatawouldneedtobequeriedoften.Thatcollectionofmessagescouldalsobestoredinakey-valuepairformat,bycombiningtheidentifierofthemessageandthecontentofthemessage.Key-valuedatabasesareusefulforstoringsessioninformation,userprofiles,preferences,andshoppingcartdata.Key-valuedatabasesdon’tworksowellwhenweneedtoquerybynon-keyfieldsoronmultiplekeyfieldsatthesametime.
3. GraphDatabases:Geographicmapdatathatisstoredinsetofrelationshipsorlinksbetweenpoints.Graphdatabasesareverywellsuitedtoproblemspaceswherewehaveconnecteddata,suchassocialnetworks,spatialdata,routinginformation,andrecommendationengines.
4. ColumnarDatabases:Somekindofdatabasesareneededtospeedupsomeoft-soughtqueriesfromverylargedatasets.Supposethereisanextremelylargedatawarehouseofweblogaccessdata,whichisrolledupbythenumberofwebaccessbythehour.Thisneedstobequeried,orsummarizedoften,involvingonlysomeofthedatafieldsfromthedatabase.Thusthequerycouldbespeededupbycreatingadatabasestructurethatincludedonlytherelevantcolumnsofthedataset,alongwiththekeyidentifyinginformation.Thisiscalledacolumnardatabaseformat,andisusefulforcontentmanagementsystems,bloggingplatforms,maintainingcounters,expiringusage,heavywritevolumesuchaslog
![Page 118: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/118.jpg)
aggregation.Columnfamilydatabasesforsystemswellwhenthequerypatternshavestabilized.
ThechoiceofNoSQLdatabasedependsonthesystemrequirements.Thereareatleast200implementationsofNoSQLdatabasesofthesefourtypes.Visitnosql-database.orgformore.
Despitethename,aNoSQLdatabasedoesnotnecessarilyprohibitstructuredquerylanguage(likeMySQL).WhilesomeoftheNoSQLsystemsareentirelynon-relational,othersjustavoidsomeselectedfunctionalityofRDMSsuchasfixedtableschemasandjoinoperations.ForNoSQLsystems,insteadofusingtables,thedatacanbeorganizedthedatainkey/valuepairformat,andthenSQLcanbeused.
ThefirstpopularNoSQLdatabasewasHBase,whichisapartoftheHadoopfamily.ThemostpopularNoSQLdatabaseusedtodayisApacheCassandra,whichwasdevelopedandownedbyFacebooktillitwasreleasedasopensourcein2008.OtherNoSQLdatabasesystemsareSimpleDB,Google’sBigTable,MemcacheDB,OracleNoSQL,Voldemort,etc.
![Page 119: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/119.jpg)
ArchitectureofNoSQL
Figure6‑0‑1:NoSQLDatabasesArchitecture
OneofthekeyconceptsunderlyingtheNoSQLdatabasesisthatdatabasemanagementhasmovedtoatwo-layerarchitecture;separatingtheconcernsofdatamodelinganddatastorage.Thedatastoragelayerfocusesonthetaskofhigh-performancescalabledatastorageforthetaskathand.Thedatamanagementlayeravarietyofdatabaseformats,andallowsforlow-levelaccesstothatdatathroughspecializedlanguagesthataremoreappropriateforthejob,ratherthanbeingconstrainedbyusingthestandardSQLformat.
NoSQLdatabasesmapsthedatainthekey/valuepairsandsavesthedatainthestorageunit.Thereisnostorageofdatainacentralizedtabularform,sothedatabaseishighlyscalable.Thedatacouldbeofdifferentforms,andcomingfromdifferentsources,andtheycanallbestoredinsimilarkey/valuepairformats.
ThereareavarietyofNoSQLarchitectures.SomepopularNoSQLdatabaseslikeMongoDBaredesignedinamaster/slavemodellikemanyRDBMS.ButotherpopularNoSQLdatabaseslikeCassandraaredesignedinamaster-lessfashionwhereallthenodesintheclustersarethesame.So,itisthearchitectureoftheNoSQLdatabasesystemthatdeterminesthebenefitsofdistributedandscalablesystememergeslikecontinuousavailability,distributedaccess,highspeed,andsoon.
NoSQLdatabasesprovidedeveloperslotofoptionstochoosefromandfinetunethesystemtotheirspecificrequirements.Understandingtherequirementsofhowthedataisgoingtobeconsumedbythesystem,questionssuchasisitreadheavyvswriteheavy,isthereaneedtoquerydatawithrandomqueryparameters,willthesystembeablehandleinconsistentdata.
![Page 120: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/120.jpg)
CAPtheoremDataisexpectedtobeaccurateandavailable.Inadistributedenvironment,accuracydependsupontheconsistencyofdata.AsystemisconsideredConsistentifallreplicasofcopycontainthesamevalue.ThesystemisconsideredAvailable,ifthedataIisavailableatallpointsintime.Itisalsodesirableforthedatatobeconsistentandavailableevenwhenanetworkfailurerendersthedatabasepartitionedintotwoormoreislands.Asystemisconsideredpartitiontolerantifprocessingcancontinueinbothpartitionsinthecaseofanetworkfailure.Inpracticeitishardtoachieveallthree.
ThechoicebetweenConsistencyandAvailabilityremainstheunavoidablerealityfordistributeddatastores.CAPtheoremstatesthatinanydistributedsystemonecanchooseonlytwooutofthethree(Consistency,AvailabilityandPartitionTolerance).Thethirdwillbedeterminedbythosechoices.
NoSQLdatabasescanbetunedtosuitone’schoiceofhighconsistencyoravailability.Forexample,foraNoSQLdatabase,thereareessentiallythreeparameters:
-N=replicationfactor,i.e.thenumberofreplicascreatedforeachpieceofdata
-R=Minimumnumberofnodesthatshouldrespondtoareadrequestforittobeconsideredsuccessful
-W=Minimumnumberofnodesthatshouldrespondtoawriterequestbeforeitsconsideredsuccessful.
SettingthevaluesofRandWveryhigh(R=N,andW=N)willmakethesystemmoreconsistent.However,itwillbeslowtoreportConsistency,andthusAvailabilitywillbelow.Ontheotherend,settingRandWtobeverylow(suchasR=1andW=1),wouldmaketheclusterhighlyavailable,asevenasinglesuccessfulread(orwrite)wouldlettheclustertoreportsuccess.However,consistencyofdataontheclusterwillbelowsincemanyofthemaynothaveyetreceivedthelatestcopyofthedata.
Ifanetworkgetspartitionedbecauseofanetworkfailure,thenonehastotradeoffavailabilityversusconsistency.NoSQLdatabaseusersoftenchooseavailabilityandpartitiontoleranceoverstrongconsistency.Theyarguethatshortperiodsofapplicationmisbehaviorarelessproblematicthanshortperiodsofunavailability.
Consistencyismoreexpensiveintermsofthroughputorlatency,thanisAvailability.However,HDFSchoosesconsistency–asthreefaileddatanodescanpotentiallyrendera
![Page 121: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/121.jpg)
file’sblockscompletelyunavailable.
![Page 122: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/122.jpg)
PopularNoSQLDatabasesWecovertwoofthemorepopularofferings.
![Page 123: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/123.jpg)
HBaseApacheHBaseisacolumn-oriented,non-relational,distributeddatabasesystemthatrunsontopofHDFS.AnHBasesystemcomprisesasetoftables.Eachtablecontainsrowsandcolumns,muchlikeatraditionaldatabase.EachtablemusthaveanelementdefinedasaPrimaryKey;allaccesstoHBasetablesisdoneusingthePrimaryKey.AnHBasecolumnrepresentsanattributeofanobject.Forexample,ifthetableisstoringdiagnosticlogsfromwebservers,eachrowwillbealogrecord.Eachcolumninthattablewillrepresentanattributesuchasthedate/timeoftherecord,ortheservername.HBasepermitsmanyattributestobegroupedtogetherintoacolumnfamily,sothatallelementsofacolumnfamilyareallstoredasessentiallyacompositeattribute.
Columnardatabasesaredifferentfromarelationaldatabaseintermsofhowthedataisstored.Intherelationaldatabase,allthecolumns/attributesofagivenrowarestoredtogether.WithHBaseyoumustpredefinethetableschemaandspecifythecolumnfamilies.Allrowsofacolumnfamilywillstoredsequentially.However,it’sveryflexibleinthatnewcolumnscanbeaddedtofamiliesatanytime,makingtheschemaflexibleandthereforeabletoadapttochangingapplicationrequirements.
ArchitectureOverview
HBaseisbuiltonmaster-slaveconcept.InHBaseamasternodemanagesthecluster,whiletheworkernodes(calledregionservers)storeportionsofthetablesandperformtheworkonthedata.HBaseisdesignedafterGoogleBigtable,andofferssimilarcapabilitiesontopofHadoopandHDFS.Itdoesconsistentreadsandwrites.Itdoesautomaticandconfigurableshardingoftables.Ashardisasegmentofthedatabase.
Figure6‑0‑2:HBASEArchitecture
Physically,HBaseiscomposedofthreetypesofserversinamasterslavetypeofarchitecture.
![Page 124: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/124.jpg)
(a)TheNameNodemaintainsmetadatainformationforallthephysicaldatablocksthatcomprisethefiles.
(b)Regionserversservedataforreadsandwrites.
(c)TheHadoopDataNodestoresthedatathattheRegionServerismanaging.
HBaseTablesaredividedhorizontallybyrowkeyrangeinto“Regions.”Aregioncontainsallrowsinthetablebetweentheregion’sstartkeyandendkey.Regionassignment,DDL(create,deletetables)operationsarehandledbytheHBaseMasterprocess.Zookeeper,whichispartofHDFS,maintainsaliveclusterstate.ThereisanautomaticfailoversupportbetweenRegionServers.AllHBasedataisstoredinHDFSfiles.RegionServersarecollocatedwiththeHDFSDataNodes,whichenabledatalocality(puttingthedataclosetowhereitisneeded)forthedataservedbytheRegionServers.HBasedataislocalwhenitiswritten,butwhenaregionismoved,itisnotlocaluntilcompaction.
EachRegionServercreatesanephemeralnode.TheHMastermonitorsthesenodestodiscoveravailableregionservers,anditalsomonitorsthesenodesforserverfailures.
Amasterisresponsibleforcoordinatingtheregionservers,includingassigningregionsonstartup,loadbalancingofrecoveryamongregions,andmonitoringtheirhealth.Itisalsotheinterfaceforcreating,deleting,updatingtables
ReadingandWritingData
ThereisaspecialHBaseCatalogtablecalledtheMETAtable,whichholdsthelocationoftheregionsinthecluster.ZooKeeperstoresthelocationoftheMETAtable.
ThisiswhathappensthefirsttimeaclientreadsorwritestoHBase:
TheclientgetstheRegionserverthathoststheMETAtablefromZooKeeper.
Theclientwillquerythe.META.servertogettheregionservercorrespondingtotherowkeyitwantstoaccess.TheclientcachesthisinformationalongwiththeMETAtablelocation.
ItwillgettheRowfromthecorrespondingRegionServer.
Forfuturereads,theclientusesthecachetoretrievetheMETAlocationandpreviouslyreadrowkeys.Overtime,itdoesnotneedtoquerytheMETAtable,unlessthereisamissbecausearegionhasmoved;thenitwillre-queryandupdatethecache.
![Page 125: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/125.jpg)
CassandraApacheCassandraisalargelyscalableopensourcenon-relationaldatabasethatofferscontinuousuptime,simplicityandeasydatadistributionacrossmultipledatacentersandcloud.CassandrawasoriginallydevelopedatFacebookandwasopensourcedin2008.Itprovidesmanybenefitsoverthetraditionalrelationaldatabasesformodernonlineapplicationslikescalablearchitecture,continuousavailability,highdataprotection,multidatareplicationsoverdatacenters,datacompression,SQLlikelanguageandsoon.
ArchitectureOverview
Cassandraarchitectureprovidesitsabilitytoscaleandprovidecontinuousavailability.Ratherthanusingmaster-slavearchitecture,ithasamaster-less“ring”designthatiseasytosetupandmaintain.InCassandra,allnodesplayanequalrole,allnodescommunicatewithoneanotherbyadistributedandhighlyscalableprotocolcalledgossip.
So,theCassandrascalablearchitectureprovidesthecapacityofhandlinglargevolumeofdata,andlargenumberofconcurrentusersoroperationsoccurringatthesametime,acrossmultipledatacenters,justaseasilyasanormaloperationfortherelationaldatabases.Toenhanceitscapacity,onesimplyneedstoaddnewnodestoanexistingclusterwithouttakingdownthesystemanddesigningfromthescratch.
AlsotheCassandraarchitecturemeansthatunlikeothermasterslavesystems,ithasnosinglepointoffailureandthusiscapableofofferingcontinuousavailabilityanduptime.
ReadingandWritingData
![Page 126: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/126.jpg)
DatatobewrittentoaCassandranodeisfirstrecordedinanondiskcommitlogandthenitiswrittentoamemorybasedunitcalleda“memTable”.Whena“memTable”sizeexceedsacertainsetthreshold,thedataisthenwrittentofileondiskcalledan“SSTable”.Thus,inthiswaythewriteoperationisfullysequentialinnature.withmanyinputoutputoperationoccurringatthesametime,ratherthanoccurringoneatatimeoveralongperiod.
Forareadoperation,Cassandralooksinaninmemorydatastructurecalleda“Bloomfilter”thatfetchtheprobabilityofa“SSTable”havingtherequireddata.TheBloomfiltercanperformthetaskveryquicklytotellifafilehastheneededdataornot.IfitreturntruethenCassandralooksforanotherlayerofinmemorycaches,andthenfetchesthecompresseddataondisk.Iftheanswerisfalse,Cassandradoesn’tbotherwithreadingthe“SSTable”andlooksforanotherfiletofetchtherequireddata.
WriteSyntax:TTransporttr=newTSocket(HOST,PORT);
TFramedTransporttf=newTFramedTransport(tr);TProtocolprotocal=newTBinaryProtocol(tf);Cassandra.Clientclient=newCassandra.Client(protocal);
tf.open();
client.insert(userIDKey,cp,newColumn(“Colume-name”.getBytes(UTF8),“Colume-data”.getBytes(),clock),CL);
ReadSyntax:
Columncol=client.get(userIDKey,colPathName,CL).getColumn();
LOG.debug(“Columnname:”+newString(col.Colume-name,UTF8));
LOG.debug(“Columnvalue:”+newS tring(col.Colume-data,UTF8));
![Page 127: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/127.jpg)
HiveLanguageHiveisadeclarativeSQL-likelanguageforqueries.HivewasdesignedtoappealtoacommunitycomfortablewithSQL.Itisusedmainlybydataanalystsontheserverside,fordesigningreports.Ithasitsownmetadatasectionwhichcanbedefinedaheadoftime,beforedataisloaded.Hivesupportsmapandreducetransformscriptsinthelanguageoftheuser’schoice,whichcanbeembeddedwithinSQLclauses.ItiswidelyusedinFacebookbyanalystscomfortablewithSQL,aswellasbydataminersprogramminginPython.Hiveisbestusedfortraditionaldatawarehousingtasks;itisnotdesignedforonlinetransactionprocessing.
Hiveisbestsuitedforstructureddata.HivecanbeusedtoquerydatastoredinHbase,whichisakey-valuestore.Hive’sSQL-likestructuremakestransformationofdatatoandfromRDBMSiseasier.SupportingSQLsyntaxalsomakesiteasytointegratewithexistingBItools.Hiveneedsthedatatobefirstimported(orloaded)andafterthatitcanbeworkedupon.Incaseofstreamingdata,onewouldhavetokeepfillingbuckets(orfiles),andthenHivecanbeusedtoprocesseachfilledbucket,whileusingotherbucketstokeepstoringthenewlyarrivingdata.
HivedataColumnsaremappedtotablesinHDFS.ThismappingisstoredinMetadata.AllHQLqueriesareconvertedtoMapReducejobs.Atablecanhaveonemorepartitionkeys.ThereareusualSQLdatatypes,andArraysandMapsandStructstorepresentmorecomplextypesofdata.Thereareuserdefinedfunctionsformapping,aggregating
Figure6‑3:HiveArchitecture
HIVELanguageCapabilities
Hive’sSQLprovidesalmostallbasicSQLoperations.Theseoperationsworkontablesandorpartitions.Theseoperationsare:SELECT,FROM,WHERE,JOIN,GROUPBY,
![Page 128: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/128.jpg)
ORDERBY.Italsoallowstheresultstobestoredinanothertable,orinaHDFSfile.
Thestatementtocreateapage_viewtablewouldbelike:
CREATETABLEpage_view(viewTimeINT,useridBIGINT,
page_urlSTRING,referrer_urlSTRING,
ipSTRINGCOMMENT‘IPAddressoftheUser’)
COMMENT‘Thisisthepageviewtable’
PARTITIONEDBY(dtSTRING,countrySTRING)
STOREDASSEQUENCEFILE;
Hereisascriptforloadingdataintothisfile.
CREATEEXTERNALTABLEpage_view_stg(viewTimeINT,useridBIGINT,
page_urlSTRING,referrer_urlSTRING,
ipSTRINGCOMMENT‘IPAddressoftheUser’,
countrySTRINGCOMMENT‘countryoforigination’)
COMMENT‘Thisisthestagingpageviewtable’
ROWFORMATDELIMITEDFIELDSTERMINATEDBY‘44’LINESTERMINATEDBY‘12’
STOREDASTEXTFILE
LOCATION‘/user/data/staging/page_view’;
ThetablecreatedabovecanbestoredinHDFSasaTextFileorasaSequenceFile.
AnINSERTqueryonthistablewilllooklike:
hadoopdfs-put/tmp/pv_2008-06-08.txt/user/data/staging/page_view
FROMpage_view_stgpvs
INSERTOVERWRITETABLEpage_viewPARTITION(dt=‘2008-06-08’,country=‘US’)
SELECTpvs.viewTime,pvs.userid,pvs.page_url,pvs.referrer_url,null,null,pvs.ip
WHEREpvs.country=‘US’;
![Page 129: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/129.jpg)
![Page 130: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/130.jpg)
PigLanguagePigisahigh-levelprocedurallanguage.Itisusedmainlyforprogramming.Ithelpstocreateastep-by-stepflowofdatatodoprocessing.Itoperatesmostlyontheclientsideofthecluster.PigLatinfollowsaprocedureprogrammingmodelandmorenaturaltousetobuildadatapipeline,suchasETLjob.Itgivesfullcontroloverhowthedataflowsthroughthepipeline,whentocheckpointthedatainpipeline,anditsupportDAGsinpipelinesuchassplit,andgivesmorecontroloveroptimization.Pigworkswellwithunstructureddata.Forcomplexoperationssuchasanalyzingmatrices,orsearchforpatternsinunstructureddata,Pigwillgivegreatercontrolandoptions.
Pigallowsonetoloaddataandusercodeatanypointinthepipeline.Thiscanbeimportantforingestingstreamingdatafromsatellitesorinstruments.Pigalsouseslazyevaluation.PigisfasterinthedataimportbutslowerinactualexecutionthananRDBMSfriendlylanguagelikeHive.Pigiswellsuitedtoparallelizationandsoitisbettersuitedforverylargedatasetsthroughput(amountofdataprocessed)ismoreimportantthanlatency(speedofresponse).
PigisSQL-like,butdifferstoagreatextent.Itdoesnothaveadedicatedmetadatasection;theschemawillhavetobedefinedintheprogramitself.Itis.PigcanbeeasierforsomeonewhohadnotearlierexperiencewithSQL.
![Page 131: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/131.jpg)
ConclusionNoSQLdatabasesemergedinresponsetothelimitationsofrelationaldatabasesinhandlingthesheervolume,natureandgrowthofdata.NoSQLdatabaseshavethefunctionalitylikeMapReduce.NoSQLdatabaseisprovingtobeaviablesolutiontotheenterprisedataneedsandcontinuetodoso.TherearefourtypesofNoSQLdatabases:columnar,Key-pair,document,andgraphicaldatabases.CassandraandHBaseareamongthemostpopularNOSQLdatabases.HiveisanSQL-typelanguagetoaccessdatafromNoSQLdatabases.Pigisaproceduralhigh-languagethatgivesgreatercontroloverdataflows.
![Page 132: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/132.jpg)
ReviewQuestionsQ1:WhatisaNoSQLdatabase?Whatarethedifferenttypesofit?
Q2:HowdoesaNoSQLdatabaseleveragethepowerofMapReduce?
Q3:whatarethekindsofNoSQLdatabases?Whataretheadvantagesofeach?
Q3:WhatarethesimilaritiesanddifferencesbetweenHiveandPig?
![Page 133: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/133.jpg)
![Page 134: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/134.jpg)
Chapter7–StreamProcessingwithSparkAstreamprocessingsystemisacleverwaytoprocesslargequantitiesofdatafromavastsetofextremelyfastincomingdatastreams.Theidealstreamprocessingenginewillcaptureandreportinrealtimetheessenceofalldatastreams,nomatterthespeedorsizeofnumber.Thisisachievedbyusinginnovativealgorithmsandfiltersthatrelaxmanycomputationalaccuracyrequirements,tocomputesimpleapproximatemetricsinrealtime.Streamprocessingenginealignswiththeinfinitedynamismoftheflowofnature.
![Page 135: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/135.jpg)
IntroductionApacheSparkisanintegrated,fast,in-memory,general-purposeengineforlarge-scaledataprocessing.Sparkisidealforiterativeandinteractiveprocessingtasksonlargedatasetsandstreams.Sparkachieves10-100xperformanceoverHadoopbyoperatingwithanin-memoryconstructcalled‘ResilientDistributedDatasets’,whichhelpavoidthelatenciesinvolvedindiskreadsandwrites.WhileSparkiscompatiblewithHadoopfilesystemsandtools,alargescaleadoptionofSparkanditsbuilt-inlibraries(forMachineLearning,GraphProcessing,Streamprocessing,SQL)willdeliverseamlessfastdataprocessingalongwithhighprogrammerproductivity.SparkhasbecomeamoreefficientandproductivealternativeforHadoopecosystem,andisincreasingbeingusedinindustry.
ApacheSparkwasoriginallydevelopedin2009inUCBerkeley’sAMPLab,andopensourcedin2010asanApacheproject.Itcanprocessdatafromavarietyofdatarepositories,includingtheHadoopDistributedFileSystem(HDFS),andNoSQLdatabasessuchasHBaseandCassandra.Sparksupportsin-memoryprocessingtoboosttheperformanceofbigdataanalyticsapplications,butitcanalsodoconventionaldisk-basedprocessingwhendatasetsaretoolargetofitintotheavailablesystemmemory.Sparkgivesusacomprehensive,unifiedframeworktomanagebigdataprocessingrequirementswithavarietyofdatasetsthatarediverseinnature(textdata,graphdataetc)aswellasthesourceofdata(batchv.real-timestreamingdata).SparkenablesapplicationsinHadoopclusterstorunupto100timesfasterinmemoryand10timesfasterevenwhenrunningondisk.SparkisanalternativetoHadoopMapReduceratherthanareplacementforHadoop.Itprovidesacomprehensiveandunifiedsolutiontomanagedifferentbigdatausecasesandrequirements.
![Page 136: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/136.jpg)
SparkArchitecture
ThecoreSparkenginefunctionspartlyasanapplicationprogramminginterface(API)layerandunderpinsasetofrelatedtoolsformanagingandanalyzingdata,includingaSQLqueryengine,alibraryofmachinelearningalgorithms,agraphprocessingsystemandstreamingdataprocessingsoftware.Sparkallowsprogrammerstodevelopcomplex,multi-stepdatapipelinesusingdirectedacyclicgraph(DAG)pattern.Italsosupportsin-memorydatasharingacrossDAGs,sothatdifferentjobscanworkwiththesamedata.SparkrunsontopofexistingHadoopDistributedFileSystem(HDFS)infrastructuretoprovideenhancedandadditionalfunctionality.ItprovidessupportfordeployingSparkapplicationsinanexistingHadoopv1cluster(withSIMR–Spark-Inside-MapReduce)orHadoopv2YARNclusterorevenApacheMesos.
Nextwewillintroducethetwoimportancefeaturesinspark:RDDsandDAG.
ResilientDistributedDatasets(RDD)
RDD,ResilientDistributedDatasets,isadistributedmemorydistribution.Theyaremotivatedbytwotypesofapplicationsthatcurrentcomputingframeworkshandleinefficiently:iterativealgorithmsandinteractivedataminingtools.Inbothcases,keepingdatainmemorycanimproveperformancebyanorderofmagnitude.
RDDsareImmutableandpartitionedcollectionofrecords,whichcanonlybecreatedbycoarsegrainedoperationssuchasmap,filter,groupbyetc.Bycoarsegrainedoperations,itmeansthattheoperationsareappliedonallelementsinadataset.RDDscanonlybecreatedbyreadingdatafromastablestoragesuchasHDFSorbytransformationsonexistingRDDs.
OncedataisreadintoanRDDobjectinSpark,avarietyofoperationscanbeperformedbycallingabstractSparkAPIs.Thetwomajortypesofoperationavailablearetransformationsandactions.Transformationsreturnanew,modifiedRDDbasedontheoriginal.SeveraltransformationsareavailablethroughtheSparkAPI,includingmap(),
![Page 137: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/137.jpg)
filter(),sample(),andunion().ActionsreturnavaluebasedonsomecomputationbeingperformedonanRDD.SomeexamplesofactionssupportedbytheSparkAPIincludereduce(),count(),first(),andforeach().
DirectedAcyclicGraph(DAG)
DAGrefersadirectedacyclicgraph.Thisapproachisanimportantfeatureforreal-timenigDataplatforms.Thosetools,includingStorm,Spark,andTez,offeramazingnewcapabilitiesforbuildinghighlyinteractive,real-timecomputingsystemstopoweryourreal-timeBI,predictiveanalytics,real-timemarketingandothercriticalsystems.
DAGScheduleristheschedulinglayerofApacheSparkthatimplementsstage-orientedscheduling,i.e.afteranRDDactionhasbeencalleditbecomesajobthatisthentransformedintoasetofstagesthataresubmittedasTaskSetsforexecution.Ingeneral,DAGSchedulerdoesthreethingsinSpark:ComputesanexecutionDAG,i.e.DAGofstages,forajob;Determinesthepreferredlocationstoruneachtaskon;Handlesfailuresduetoshuffleoutputfilesbeinglost.
![Page 138: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/138.jpg)
SparkEcosystemSparkisanintegratedstackoftoolsresponsibleforscheduling,distributing,andmonitoringapplicationsconsistingofmanycomputationaltasksacrossmanyworkermachines,oracomputingcluster.SparkiswrittenprimarilyinScala,butincludescodefromPython,Java,R,andotherlanguages.Sparkcomeswithasetofintergratedtoolsthatreducelearningtimeanddeliverhigheruserproductivity.SparkecosystemincludesMesosresourcemanager,andothertools.
SparkhasalreadyovertakenHadoopingeneralbecauseofbenefitsitprovidesintermsoffasterexecutioniniterativeprocessingalgorithms.
![Page 139: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/139.jpg)
SparkforbigdataprocessingSparksupportbigdataminingthroughrelevantlibrariesincludingMLlib,GraphXandSparkR.AndthroughSparkSQLlanguageandStreaminglibrary.
MLlib
MLlibisSpark’smachinelearninglibrary.Itconsistsofbasicmachinelearningalgorithmssuchasclassification,regression,clustering,collaborativefiltering,dimensionalityreduction,aswellaslower-leveloptimizationprimitivesandhigher-levelpipelineAPIs.Atthesametime,wecareaboutalgorithmicperformance.Sparkexcelsatiterativecomputation,enablingMLlibtorunfast.SoMLlibalsocontainshigh-qualityalgorithmsthatleverageiteration,andcanyieldbetterresultsthantheone-passapproximationssometimesusedonMapReduce.Inaddition,SparkMLlibiseasytouseanditcansupportscala,Java,Python,andSparkR.
Forexample,Decisiontreesisapopulardataclassificationtechnique,SparkMLlibcansupportdecisiontreesforbinaryandmulticlassclassification,usingbothcontinuousandcategoricalfeatures.Theimplementationpartitionsdatabyrows,allowingdistributedtrainingwithmillionsofinstances.
FunctionsinDecisionTrees
class:publicstaticDecisionTreeModeltrainClassifier(…)
Methodtotrainadecisiontreemodelforbinaryormulticlassclassification.
Parameters:
•input-Trainingdataset:RDDofLabeledPoint.Labelsshouldtakevalues{0,1,…,numClasses-1}.
•numClassesForClassification-numberofclassesforclassification.
•categoricalFeaturesInfo-Mapstoringarityofcategoricalfeatures.
•impurity-Criterionusedforinformationgaincalculation.Supportedvalues:“gini”or“entropy”
•maxDepth-Maximumdepthofthetree.(suggestedvalue:4).
•maxBins-maximumnumberofbinsusedforsplittingfeatures(suggestedvalue:100).
Returns:DecisionTreeModelthatcanbeusedforprediction
SparkGraphX
Efficientprocessingoflargegraphsisanotherimportantandchallengingissue.Many
![Page 140: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/140.jpg)
practicalcomputingproblemsconcernlargegraphs.Forexample,googlehavetorunitsPageRankonbillionsofwebpagesandmaybetrillionsofweblinks.GraphXisanewcomponentinSparkforgraphsandgraph-parallelcomputation.Atahighlevel,GraphXextendstheSparkRDDbyintroducinganewGraphabstraction:adirectedmulti-graphwithpropertiesattachedtoeachvertexandedge.
Tosupportgraphcomputation,GraphXexposesasetoffundamentaloperatorssuchassubgraph,joinVertices,andaggregateMessagesonthebaissofanoptimizedvariantofthePregelAPI(PregelisthesystematGooglethatpowersPageRank).Inaddition,GraphXincludesagrowingcollectionofgraphalgorithmsandbuilderstosimplifygraphanalyticstasks.
WecomputethePageRankofeachuserasfollows:
//loadtheedgesasagraphobject
valgraph=GraphLoader.edgeListFile(sc,“outlink.txt”)
//Runpagerank
valranks=graph.pagerank(0.00000001).vertices
//jointherankwiththewebpages
valpages=sc.textFile(“pages.txt”).map{line=>valfields=line.split(“,”)(fields(0).toLong,fields(1))}
valranksByPagename=pages.join(ranks).map{case(id,(pagename,rank))=>(pagename,rank)}
//printtheoutput
println(rankByPagename.collect().mkString(“\n”))
SparkR
Risapopularstatisticalprogramminglanguagewithanumberofextensionsthatsupportdataprocessingandmachinelearningtasks.However,interactivedataanalysisinRisusuallylimitedastheruntimeissingle-threadedandcanonlyprocessdatasetsthatfitinasinglemachine’smemory.SparkR,anRpackageinitiallydevelopedattheAMPLab,canprovideanRfrontendtoApacheSparkandusingSpark’sdistributedcomputationengineallowsustorunlargescaledataanalysisfromtheRshell.SparkRexposestheRDDAPIofSparkasdistributedlistsinR.Forexample,onecanreadaninputfilefromHDFSandprocesseverylineusinglapplyonaRDD.Thereisacaseletasfollows:
sc<-sparkR.init(“local”)
![Page 141: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/141.jpg)
lines<-textFile(sc,“hdfs://data.txt”)
wordsPerLine<-lapply(lines,function(line)){length(unlist(strsplit(line,””)))})
Inadditiontolapply,SparkRalsoallowsclosurestobeappliedoneverypartitionusinglapplyWithPartition.OthersupportedRDDfunctionsincludeoperationslikereduce,reduceByKey,groupByKeyandcollect.
SparkSQL
SparkSQLisalanguageprovidedtodealwiththestructureddata.Usingthisonecanrunqueriesonthedataandgetsomemeaningfulresult.ItsupportsthequeriesthroughSQLaswellasHQL(HiveQueryLanguage)whichisApache’sHiveversionofSQL.
SparkStreaming
SparkStreaminggainsdatastreamsfrominputsources,processtheminacluster,pushouttodatabases/dashboards.Sparkfurtherchopsupdatastreamsintobatchesoffewseconds.SparktreatseachbatchofdataasRDDsandprocessesthemusingRDDoperations.Theprocessedresultsarepushedoutasbatches.
![Page 142: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/142.jpg)
SparkapplicationsSomehotdataproblemsthataresolvedwellbyatoollikeApacheSparkinclude:1.Real-timeLogDatamonitoring.2.MassiveNaturalLanguageProcessing3.LargeScaleOnlineRecommendationSystems.
AsimpleWordcountapplicationcanberuninSparkshellasbelow.
valtextFile=sc.textFile(“C:\Users\MyName\Documents\obamaSpeech.txt”)
***Comment:savesthetextfileastextFile***
valcounts=textFile.flatMap(line=>line.split(”“)).map(word=>(word,1)).reduceByKey(_+_)
***Comment:Calculatethetotalwordsbysplittingwithspace***
counts.count();
***Resultstheoutputasbelow******
Long=52
counts.saveAsTextFile(“C:\Users\MyName\Desktop\counts1”)
***Comment:savesthefileonmyDesktop***
![Page 143: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/143.jpg)
SparkvsHadoopSparkandHadooparebothpopularApacheprojectsdedicatedtobigdataprocessing.Hadoop,formanyyears,wastheleadingopensourcebigdataplatformandmanycompaniesalreadyuseadistributedcomputingframeworklikeHadoopbasedonMapReduce.Table9.1providesasummaryofthedifferencesbetweenHadoopandSpark.
Feature Hadoop Spark
Purpose Resilientcost-effectivestorageandprocessingoflargedatasets
Fastgeneral-purposeengineforlarge-scaledataprocessing
Corecomponent HadoopDistributedFilesystem(HDFS)
SparkCore,thein-memoryprocessingengine.
Storage HDFSmanagesmassivedatacollectionsacrossmultiplenodeswithinaclusterofcommodityservers.
Spark doesn’t do distributedstorage. It operates ondistributeddatacollections.
FaultTolerance Hadoop uses replication toachievefaulttolerance.
SparkusesRDDforfaulttolerancethatminimizesnetworkI/O.
Natureofprocessing
AccompaniedbyMapReduce,itincludesbatchprocessingofthisdatainparallelmode
Batch as well as streamprocessing.
SweetspotBatchprocessing
Iterativeandinteractiveprocessingjobs,thatcanfitinthememory
ProcessingSpeedMapReduceisslow.
Sparkcanbeupto10xfasterthanMapReduceforbatchprocessingandupto100xfasterforstreamprocessing.
Security Moresecure Lesssecure
Failurerecovery Hadoopcanrecoverfromsystemfaultsorfailuressincedataarewrittentodiskaftereveryoperation
WithSpark,dataobjectsarestoredinRDD.Thesecanbereconstructedafterfaultsorfailures
Analyticstools Built-inMLLib(Machine
![Page 144: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/144.jpg)
Separateengine Learning)andGraphX(GraphProcessing)libraries
Compatibility PrimarystoragemodelisHDFS CompatibilitywithHDFSandotherstorageformats
Languagesupport Java Scalaisnativelanguage.APIsforpython,java,R,others.
DrivingOrganization Yahoo AMPLabsfromUCBerkeley
Technologyowners Apache,Open-source,free Open-source,free
KeyDistributors Cloudera,Horton,MapR Databricks,AMPLabs
CostofSystem MediumtoHigh MediumtoHigh
![Page 145: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/145.jpg)
ConclusionSparkisanewintegratedsystemforbigdataprocessing.ItsmostimportantcoreabstractionisRDDs,alongwithrelevantlibrarieslikeMLlibandGraphX.Sparkisareallypowerfulopensourceprocessingenginebuildaroundspeed,easeofuse,andsophisticatedanalytics.
![Page 146: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/146.jpg)
ReviewQuestionsQ1:Describethesparkecosystem.
Q2:CompareSparkandHadoopintermsoftheirabilitytodostreamcomputing?
Q3:WhatisanRDD?HowdoesitmakeSparkfaster?
Q4:DescribethreemajorcapabilitiesinSparkfordataanalytics.
![Page 147: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/147.jpg)
![Page 148: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/148.jpg)
![Page 149: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/149.jpg)
Chapter8–IngestingDataWholenessADataingestingsystemisareliableandefficientpointofreceptionforalldatacomingintoasystem.Thissystemisdesignedtobeflexibleandscalabletoreceivedatafromvarioussources,atvarioustimesandspeedsandquantities.Theingestsystemmakesthedataavailableforusebythetargetapplicationsinrealtime.Ideally,alldatawouldbesmoothlyreceived,andmadeavailablefordownstreamapplicationstosecurelyandreliablyaccessattheirownconvenience.Adedicatingdataingestmechanismisachievedbycreatingafastandflexiblebufferforreceivingandstoringallincomingstreamsofdata.Thedatainthebufferisstoredinasequentialmanner,andismadeavailabletoallconsumingapplicationsinafastandorderlymanner.
BigDataarrivesintoasystematunpredictablespeedsandquantities.Businessapplicationsthereafterreceiveandprocessthisdataatsomeplannedthroughputcapacity.Aningestbufferisneededtocommunicatethedatawithoutlossofdataorspeed.Thisbufferideahashistoricallybeencalledamessagingsystem,nottoodissimilarfromamailboxsystematthepostoffice.Incomingmessagesareputintoasetoforganizedlocations,fromwherethetargetapplicationswouldreceivethemwhentheyareready.
Withhugeamountsofdatacominginfromdifferentsources,andmanymoreconsumingapplications,apoint-to-pointsystemofdeliveringmessagesbecomesinadequateandslow.Alternatively,incomingdatacanbecategorizedintocertaintopics,andstoredintherespectivelocationorlocationsforthosetopics.Insteadofdatabeingreceivedandheldinstorageforaspecifictargetapplication,nowthedatamaybeconsumedbyanyapplicationthatisinterestedindatarelatedtoatopic.Eachconsumingapplicationcanchoosetoreaddataaboutoneormoretopicsofitsinterest.Thisiscalledthepublish-and-subscribesystem.
![Page 150: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/150.jpg)
MessagingSystemsAMessagingSystemisanasynchronousmodeofcommunicatingdatabetweenapplication.Therearetwogenerickindsofmessagingsystems−apoint-to-pointsystem,andapublish-subscribe(pub-sub)system.Mostofthemessagingpatternsnowfollowpub-submodel.
PointtoPointMessagingSystem
Inapoint-to-pointsystem,everymessageisdirectedataparticularreceiver.Acommonqueuecanreceivemessagesfrommanyproducersormessages.Anyparticularmessagecanbereceivedandconsumedbyonlyonereceiver.Oncethattargetconsumerreadsamessageinthequeue,thatmessagedisappearsfromthatqueue.ThetypicalexampleofthissystemisanOrderProcessingSystem,whereeachorderwillbeprocessedbyoneOrderProcessor.
Publish-SubscribeMessagingSystem
Inapub-submessagingsystem,theapplicationspublishtheiroutputtoastandardmessagingqueue.Thetargetrecipientwillonlyneedtoknowwheretogetthemessage,wheneveritisreadytopickupthemessage.Applicationsthuscanignorethemechanicsofinteractionwithotherapplications,andsimplycareaboutthemessageitself.Thisisespeciallyvaluablewhentheremaybemanytargetrecipientsforamessage.Inapub-subsystem,messagesareenteredintothemessagingqueueasynchronouslyfromclientapplications.
Amessagequeuingsystemneedstobefastandsecuretoservemanyapplications,bothproducersandsubscribers.Messagesarealsoreplicatedacrossmultiplelocationsforreliabilityofdata.
TherearetwopopularDataingestingsystemsusedinBigData.Anoldersystem,calledFlume,iscloselytiedtotheHadoopdistributedfilesystem.ThenewandmorepopularsystemisageneralpurposesystemcalledApacheKafka.Inthischapterwewilldiscussthenewsystem,Kafka.
![Page 151: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/151.jpg)
ApacheKafkaApacheKafkaisanopensourcepublish-and-subscribemessagebrokersystem.Kafkaaimstoprovideanintegratedhigh-throughput,low-latencymessagingplatformforhandlingreal-timedatafeeds.Intheabstract,itisasinglepointofcontactbetweenallproducersandconsumersofdata.AllproducersofdatasenddatatoKafka.AllconsumersofdatareaddatafromKafka.(Figure8.1)
Figure8‑1:Kafkacoreidea
Kafkaisadistributed,partitioned,scalable,replicatedmessagingsystem,withasimplebutuniquedesign.ItwasinitiallydevelopedbyLinkedInandwasopensourcedinearly2011.ApacheSoftwareFoundationisnowresponsibleforitsdevelopmentandimprovement.Kafkaisavaluableforanenterpriseslevelinfrastructurebecauseofitssimplicityandscalability.Kafkasystemiswritteninthehigh-levelScalaprogramminglanguage.
![Page 152: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/152.jpg)
UseCasesFollowingaresomepopularusecasesofApacheKafka.
Messaging
KafkaisaverygoodalternativeforatraditionalmessagebrokerbecauseKafkamessagingsystemhasbetterthroughput,builtinpartitioning,replicationandbetterfaulttolerance.Kafkaisverygoodsolutionforalargescalemessageprocessingapplications.
WebsiteActivityTracking
WebsiteActivityTrackingwasoneofinitialusecasesforKafkaforLinkedIn.Users’onlineactivitytrackingpipelinewasrebuiltasasetofrealtimedatafeeds.Generalwebactivitytrackingincludesverylargevolumeofdata,andKafkaisverygoodathandlingthishugevolumeofdata.Useractivitytypessuchaspageview,searches,clicks,etccanbedesignatedascentraltopics,andtheactivitydatacanbepublishedtothosetopics.Thoseeventsareavailableforrealtimeorofflineprocessingandreporting.
StreamProcessing
PopularframeworkssuchasStormandSparkStreamingreaddatafromatopic,processit,andsendtootherusersandconsumerapplications.TheymayevenwriteitbacktoKafkatoanewtopic.Kafka’sstrongdurabilityisalsoveryusefulforstreamprocessing.
LogAggregation
ActivityLogaggregationtypicallygathersphysicallogfilesfromserversandputsthemallinacentralplaceforprocessing.Kafkacanabstractawaythedetailsofthefilesandprovideacleanerabstractionoflogdataasastreamofmessages.UseofKafkathenallowsforlower-latencyprocessingandeasiersupportformultipledatasourcesanddistributeddataconsumption.Unlikededicatedlog-centricsystems,Kafkaoffershigherperformanceandstrongerdurabilityguaranteesduetoreplication.
CommitLogKafkacanbeusedasexternalcommitlogforadistributeddatabasesystem.Thisauditlogcanhelptore-syncdatabetweenthefailednodestorestoretheirdata.ThelogcompactioninKafkahelpstoachievethisfeaturemoreefficiently.
![Page 153: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/153.jpg)
KafkaArchitectureIntheabstract,Kafkabrokersdealwithproducersandconsumersofdata.Aproducerpushesdataintotheingestsystematitsownspeed,scaleandconvenience.Aconsumerpullsdataoutofthesystematitsownspeed,scaleandconvenience.Allthereceiveddataisorganizedbycategories,calledtopics.Incomingdataissortedandstoredintotopicservers.Theconsumersofdatacansubscribetooneormoretopics(Figure8.2).
Figure8‑2:KafkaEcosystem
Therearemorethanonebrokers(alsocalledservers,orpartitions)foreachtopic,forreliabilityofthemessagingsystem.Thustwoormorebrokerswillstoredataoneachtopic.Onlyonebrokercanbeleaderatanygiventime.Intheleadbrokerfails,thenasecondonecanautomaticallytakeoverandpreventthelossofaccesstodata.
Kafkaisdesignedfordistributedhighthroughputsystems.Incomparisontoothermessagingsystems,Kafkahasbetterthroughput,built-inpartitioning,replicationandinherentfault-tolerance,whichmakesitagoodfitforlarge-scalemessageprocessingapplications.Ithastheabilitytohandlealargenumberofdiverseconsumers.ItintegratesverywellwithApacheStorm,Sparkandotherreal-timestreamingdataapplications.Kafkaisveryfastandcanperform2millionwrites/sec.Italsoguaranteeszerodowntimeandzerodataloss.
TherearealotofcontributingorganizationshelpingtoimprovetheKafkaopen-sourcesystem.Ithasverywelldocumentedonlineresources.IthasbeenusedbymanybigorganizationssuchasLinkedIn,CiscoSystem,Spotify,Paypal,HubSpot,Shopify,Uberandmore.HubSpotusesKafkatodeliverrealtimenotificationofwhenarecipientopenstheiremail.PaypalusesKafkatoprocessmillionsofupdatesinaminute.
Producers
Aproducerisresponsibleforselectingthepartition,andthetopicforthemessagethatitwantstoconvey.Itcanuseround-robinalgorithmtobalancetheloadamongpartitions.Therecanbebothsynchronousandasynchronousproducersforproducingmessageandpublishingtothepartition.
Consumers
Aconsumerisresponsibleforreadingthedataaboutthetopicthatithassubscribed.Theconsumerisresponsibleforreadingthedatawithinareasonableperiodoftime,beforethe
![Page 154: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/154.jpg)
queuesareemptiedforefficientmanagementofstorage.Differentconsumingapplicationscanreadthedataatdifferenttimes.Kafkahasstrongerorderingguaranteesthanatraditionalmessagingsystem.Aconsumerneedstoknowhowfarithasreadinthatqueue,soastoavoidduplicatesorlosesomedata.
Broker
AbrokerisaserverinaKafkacluster.Theclustermayhavemanysuchserversorbrokers.
Topic
Atopicisacategoryintowhichmessagesarepublished.Foreachtopicthereisaseparatepartitionlogforstorageofmessages.Eachpartitionhasanorderedsequenceofmessagesforthattopic.Eachmessageinthepartitionisassignedauniquesequentialnumber,alsocalledtheoffset.Thisoffsethelpstoidentifyeachmessagewithinthepartition.
Theconsumerreadsthedatasequentiallyaccordingtooffsetnumbers.Theconsumermaintainstheoffsettorememberhowfarithasread.Generally,theoffsetincreaseslinearlyasmessagesareconsumed.However,aconsumercanresetoffsettoaccessthedatagainandreprocessitasneeded.
TheKafkaclusterkeepsallthepublishedmessageswhetherornottheyhavebeenconsumedforaconfigurableperiodoftimeornot.Forexample,ifthelogretentionissettosevendays,thenforthesevendaysafterpublishing,themessageisavailableforconsumption.Aftersevendays,Kafkadiscardsthemessagestofreeupspace.
Kafka’sperformanceisnotaffectedbythesizeofdata.Eachpartitionmustfitontheserversthathostit,butatopicmayhavemultiplepartitions.ThisenablesKafkatomanageanarbitraryamountofdata.Also,itactsastheunitofparallelism.
SummaryofKeyAttributes1. Diskbased:Kafkaworksonaclusterofdisks.Itdoesnotkeepeverythingin
memory,andkeepswritingtothedisktomakethestoragepermanent.2. Faulttolerant:DatainKafkaisreplicatedacrossmultiplebrokers.Whenany
leaderbrokerfails,afollowerbrokertakesoverasleaderandeverything
![Page 155: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/155.jpg)
continuestoworknormally.3. Scalable:Kafkacanscaleupeasilybyaddingmorepartitionsormorebrokers.
Morebrokershelptospreadtheloadandthisprovidesgreaterthroughput.4. Lowlatency:Kafkadoesverylittleprocessingonthedata.Thusithasverylow
latencyrateMessagesproducedbytheconsumerarepublishedandavailabletotheconsumerwithinafewmilliseconds.
5. FiniteRetention:Kafkabydefaultkeepsthemessageintheclusterforaweek.Afterthatthestorageisrefreshed.Thusthedataconsumershaveuptoaweektocatchupondata,incasetheyfallbehindforanyreason.
Distribution
TheKafkaclustermaintainsmultipleserversoverthedistributednetwork.Thepartitionsofthelogaremaintainedoverthisnetwork.Eachserverhandlesdataandrequestsforashareofthepartitions.Eachpartitionisreplicatedacrossaconfigurablenumberofserversforfaulttolerance.Butoneoftheserverforeachpartitionactsasthemainserveralsocalled“leader”whileitmayormaynothaveoneormoresecondaryserveralsoknownas“followers”.Theleaderserverisresponsibleforhandlingallthereadandwriteoperationforthepartitionwhilethefollowerssilentlyreplicatestheleader.Thefollowerserverbecomesveryhelpfulwhentheleaderserverfails.Thefollowerserverautomaticallybecomestheleaderandthenhandlesthefailure.Oneservercanbealeaderforsomeofthepartitionsonit,whileitmaybefollowerforotherpartitions.Thusoneservercanactasbothleaderandfollower.Thishelpstobalancetheworkloadontheserverswithinthecluster.
Guarantees
Messagessentalwaysmaintaintheordertheyweresent.Forexample,ifamessageM1andM2weresentbythesameproducerandM1wassentfirstthenthemessageM1willhaveloweroffsetthanmessageM2.Therefore,M1willalwaysappearbeforetheM2fortheconsumer.
EachtopichasareplicationfactorNandthesystemcantolerateuptoN-1serverfailureswithoutlosinganymessagescommittedtothelog.
ClientLibraries
Kafkasupportsfollowingclientlibraries:
1. Python:PurepythonimplementationwithfullprotocolsupportandConsumerProducerarealsoincluded.
2. C:HighperformanceClibrarywithfullprotocolsupport.3. C++,Ruby,Javascriptandmore.
![Page 156: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/156.jpg)
ApacheZooKeeperKafkaisbuiltontopofZooKeeper.ApacheZookeeperisadistributedconfigurationandsynchronizationserviceinHadoopclusters.HereitservesasthecoordinationinterfacebetweentheKafkabrokersandconsumers.TheKafkaserversstoresbasicmetadatainZookeeperandsharesinformationabouttopics,brokers,andconsumeroffsets(queuereaders)andsoon.
SinceZookeeperdoesitownlayersofreplication,thefailureofaKafkabrokerdoesnotaffectthestateoftheKafkacluster.EvenifZookeeperfails,Kafkawillrestorethestate,oncetheZookeeperrestarts.ThisgiveszerodowntimeforKafka.Zookeeperalsomanagesthealternativeleaderbrokerselection,incaseofaKafkaleaderfailure.KafkaProducerexampleinJava
//Configure
Propertiesconfig=newProperties();
config.setProperty(ProducerConfig.BOOTSTRAP_SERVER_CONFIG,“localhost:8082”);
KafkaProducerproducer=newKafkaProducer(config);
ProducerRecordrecord=newProducerRecord(“topic”,“key”.getBytes(),”value”.getBytes());
Future<RecordMetaData>response=producer.send(record);
![Page 157: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/157.jpg)
ConclusionBigdataisingestedusingadedicatedsystem.Theseoftentaketheformofmessagingsystems.Publish-and-subscribesystemsareefficientwaysofdeliveringdatafrommanysourcestomanytargets,inareliable,secureandefficientway.Kafkaisanopen-source,reliable,secure,andscalabledatapublish-subscribemessagingsystem.Itdealswithproducersaswellasconsumersofdata.Messagesarepublishedtoasetofcentraltopics.Eachconsumercansubscribetoanynumberoftopics.Kafkausesaleader-followersystemofmanagingreplicatedpartitionsforthesamesetofdata,toensurefullreliabilityandzerodowntime.
![Page 158: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/158.jpg)
ReviewQuestionsQ1:Whatisadataingestsystem?Whyisitanimportanttopic?
Q2:Whatarethetwowaysofdeliveringdatafrommanysourcestomanytargets?
Q3:WhatisKafka?Whatareitsadvantages?Describe3usecasesofKafka.
Q4:Whatisatopic?Howdoesithelpwithdataingestmanagement?
![Page 159: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/159.jpg)
References1.http://kafka.apache.org/documentation.html#introduction
![Page 160: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/160.jpg)
![Page 161: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/161.jpg)
![Page 162: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/162.jpg)
Chapter9–CloudComputingPrimerCloudcomputingisacost-effectiveandflexiblemodeofdeliveringITinfrastructureasaservicetoclients,overinternet,onameteredbasis.ThecloudcomputingmodeloffersclientsenormousflexibilitytouseasmuchITcapacity–compute,storage,network–asneededwithouthavingtoinvestinadedicatedITcapacityonone’sown.TheITusagecanbescaledupordowninminutes.ThecomplexITinfrastructuremanagementskillsareallownedbythecloudcomputingprovider,andproblemscanberesolvedmuchfaster.TheclientcansimplyaccessasmoothlyrunningITinfrastructureoverafastinternetconnection.ITcapacityinthecloudcanbepurchasedasacustompackagedependinguponone’sneedsintermsofaverageandpeakITrequirements.Thecomputingcloudistheultimatecosmiccomputeralignedwithalllawsofnature.
![Page 163: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/163.jpg)
IntroductionManagingverylargeandfastdatastreamsisahugechallenge.Itrequiresmakingcriticaldecisionsaboutitsstorage,structure,andaccess.Thisdatawouldbestoredinlargeclustersofhundredsorthousandsofinexpensivecomputers.Suchclustersareoftencalledserverfarms.Thelocationandsizeofsuchclustersimpactscosts.Theserverfarmsmaybelocatedintheirowndatacenters,ortheymayberentedfromspecializedthird-partyorganizationscalledcloudcomputingserviceproviders.
CloudcomputingprovidestheITleadershipacost-effectiveandpredictablesolutionforreliablymeetingtheirlargedatamanagementneeds.Therearemanyvendorsofferingthisservice.Priceskeepdroppingregularly,becauseITcomponentskeepgettingcheaper,thereisgrowingvolumeofbusiness,andthereiseffectivecompetition.Withcloudcomputing,theITexpensebecomesanoperatingexpenseratherthanacapitalexpense.ThecostsofITbecomesalignedwithrevenuestreamsandmakescashflowmanagementeasier.
Oneofthemainreasonsforenterprisesmovingtocloudcomputingistoexperimentwithnewandriskyprojects.Thisflexiblemodelmakesitmucheasiertolaunchnewproductsandservices,withoutbeingexposedtotheriskofaheavylossinITinfrastructure.Forexample,anewHollywoodmovie’ssitewillhavemillionsofvisitorstoitswebsiteforamonthbeforeandforamonthafterthemovie’sreleasedate.Afterthatthevisitstothewebsitewilldropdramatically.Thewebsiteownerwouldbenefitenormouslyfromusingacloudcomputingmodelwheretheypayforthepeakwebusagecapacityforthosefewmonths,andmuchlessastheusagedropsdown.Moreimportantly,theflexibilityensuresthattheirwebsitewillnotcrashjustincasethemoviebecomesasuper-hitandattractsunusuallylargenumberofvisitorstothewebsite.
![Page 164: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/164.jpg)
CloudComputingCharacteristicsHerearethemajorcharacteristicsofacloudcomputingmodel.
1. FlexibleCapacity:Thecapacitycanscaleuprapidly.Onecanexpandandreduceresourcesaccordingtoone’sspecificservicerequirements,asandwhenneeded.Thecloudinternallydoesregularworkloadbalancingamongtheneedsofmillionsofclients,andthishelpsbringdowncostsforeveryone.
2. Attractivepaymentmodel:Cloudcomputingworksonapay-per-usemodel.i.e.onepaysonlyforwhatoneuses,andforhowlongoneusesit.ITcostsbecomeanexpenseratherthanacapitalexpensefortheclient.Theresourcepricesmaybenegotiatedatlong-termcontractrates,andcanalsobepurchasedatspotmarketrates.
3. ResiliencyandSecurity:Thefailureofanyindividualserverandstorageresourcesdoesnotimpacttheuser.TheServersandstorageforallclientsareisolatedtomaximizesecurityofdata.
![Page 165: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/165.jpg)
In-housestorageMostorganizationshavedatacentersforrunningtheirregularIToperations.Anorganizationmaydecidetoexpanditsowndatacentertostorelargestreamsofdata.Theorganizationcanensurecompletesecurityandprivacyofitsdataifitkeepsallthedatain-house.However,thecostsandcomplexityofmanagingthisdataareincreasing,anditisnotcost-effectiveforeveryorganizationtomanagehugedatacenters.Hiringandretainingscarceadvancedskillstomanagesuchdatacenterswouldalsobeachallenge.
![Page 166: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/166.jpg)
CloudstorageItisnowbecomingatrendfororganizationstochoosetostoretheirdatainmassivedatacentersownedbyotherspecializedcompanies.Theirdataandprocessingcapacityresidesinsomesortofahugecloudoutthere,whichisaccessiblefromanywhereanytimethroughasimpleinternetconnection.
CompanieslikeAmazon,Google,Microsoft,Apple,andIBMareamongthemajorprovidersofcloudstorageandcomputingservicesaroundtheworld.Theyownandoperatedatacenterswithmillionsofcomputersinthem.
Figure0‑1:Acloudcomputingdatacenter
Commercially,cloudserviceprovidersareabletoconsolidatetherequirementsofthousandsormillionsofcustomers,andsupplyflexibleamountsofdatastorageandcomputingfacilityavailabletoclientsonaper-usagebasis.Thispaymodelissimilartohowelectricutilitycompanieschargeconsumersfortheirusageofelectricityinhomesandoffices.Cloudcomputingoffersmuchlowercostsperuse,justlikeusingtheelectricutilitycostsmuchlessthanowningandoperatingone’sownelectricitygenerators.
![Page 167: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/167.jpg)
Amajordisadvantageofcloudstorageisthatthedataisstoredawayfromone’sphysicalcontrol.Thussecurityofpreciousdataislefttothehandsofthecloudcomputingprovider.Whilethesecurityprotocolsarerapidlyimproving,however,therearenofailsafemethodsforsecuringdatainthecloud.Thereisalsoariskofbeinglockedintooneprovider’sinfrastructure.Thecost-benefittradeoffshavedefinitelytiltedtowardsusingcloudcomputingproviders.Atsomefuturepointintime,thecloudservicesprovidersmightbeheavilyregulatedliketheelectricutilities.
![Page 168: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/168.jpg)
CloudComputing:EvolutionofVirtualizedArchitectureCloudcomputingisessentiallyacommercialmodelforvirtualizedserverinfrastructure.IBMbegantooffertime-sharingservicesonitsmainframecomputersbeginninginthe1960s.Nowthatsametechnologyhasbeenofferedonnetworksofsmallmachinesthroughthevirtualizationprocess.
Virtualizationassumesthatlogicalmachinescanbedifferentiatedfromphysicalmachines.AphysicalservercouldrunmultipleVirtualMachines(VMs);andonevirtualmachinemayspanmultiplephysicalservers.Thevirtualizationsoftwareiscalledahypervisor.ItabstractsallmachinesintoVirtualMachines,usingeasyGUIinterface.Avirtualizationsoftwarecantypicallyrunonaheterogeneousphysicalinfrastructure,andconvertallITcapacityintoasingleunifiedcapacity.Thiscapacitycanthenbeprovisionedinslicesandpackages.Theuserapplicationsarenotawarethattheyarerunninginavirtualizedenvironment;sotheyrunasifrunningonadedicatedmachine.Theapplicationscanalsorunontopoftheirownnativeoperatingsystems.
![Page 169: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/169.jpg)
CloudServiceModelsTherearetwomajordimensionstoconceptualizetheCloudcomputingmodels:thescopeofservicesreceived;andthecontroloverandcostofthoseservices.
1. Therangeofcloudcomputingservicesfromacloudcomputingprovider,fallinthreebroadbuckets:
1. Infrastructureasaservice:Thisisthelowestlevelofservices,andincludedonlyrawcapacityofcompute,storage,andnetworking.Thepriceforthisservicesisthelowest.
2. Platformasaservice:ThisincludesIaaS,alongwithothertechnologiesandservices.ThesearestillverygeneraltoolssuchopensourceHadooporSparkorCassandraimplementation,alongwithcertainmonitoringtools.Thecostsarealittlehigherbecauseoftheadditionalmanagementandmonitoringservicesprovidedbytheprovider.
3. Softwareasaservice:Thisincludesthecomputingplatformaswellasbusinessapplicationsthatgetworkdone.Forexample,salesforce.comwasoneofthefirstCRMapplicationsoldonlyonaSaaSmodel.Googlesellsanemailservicetoorganizationsonaper-user-per-monthbasis.Thisisalsothemostexpensivetypeofcloudservice.
2. Theotherwaythecloudservicesdifferisintermsoftheownershipandcontrol.1. Publiccloud:Thiswillbealargesharedinfrastructuremadeavailableto
oneandall,inalow-costandmulti-tenancymodel.Theclientcanaccessitusinganydevice.Thedownsideisthatthedataalsoresidesonthecloud,andthuscouldbevulnerabletotheftorhacking.Thecoststo
![Page 170: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/170.jpg)
clientarelow,andvariabledependinguponuse.2. Privatecloud:Thisisacloudversionofanin-houseITinfrastructure.
Theorganizationwillhaveexclusivecontrolovertheentireinfrastructure.Thecostswouldbefixedandhigher.
3. Hybridcloud:Thisisamixofflexibilityofcapacity,andmuchcontroloversomekeyaspectsofit.Onecouldretaincompletecontrolovercriticalapplications,whileusingsharedinfrastructurefornon-criticalapplications.
Alllevelsofinfrastructureandpaymodelsareuseful,astheyserverdifferentlevelsofneedsforclientorganizations.However,mostofthegrowthincloudcomputingishappeningbecauseoftheattractivenessofthelowcostofthepubliccloudmodel.
![Page 171: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/171.jpg)
CloudComputingMythsThereareacoupleofmisconceptionsaboutthecostsandbenefitsofcloudcomputing.
1. Myth:PublicCloudcomputingwouldsatisfyalltherequirement:scalability,flexibility,payperuse,resilience,multitenancy,andsecurity.Dependinguponthetypeofserviceselected(SaaS,IaaS,orPaaS),theservicecansatisfyspecificsubsetsoftheserequirements.
2. Myth:CloudcomputingwouldbeusefulonlyifyouareoutsourcingyourITfunctionstoanexternalserviceprovider.OnecoulduseaprivatecloudcomputingmodelforasectionofITapplicationstoofferon-demand,scalable,andpay-per-usedeploymentswithinyourenterprise’sowndatacenter.
![Page 172: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/172.jpg)
CloudComputing:GettingStartedHerebelowisaframeworkforcloudadoption.Learnmoreaboutthecontextforgettingbenefitsfromcloudcomputing.Selecttherightmodelandlevelofcloudcapacity.Setuptheapplicationsandamonitoringsystemforthoseapplicationandthetotalcloudfootprint.Chooseaserviceprovider,sayAmazonWebServices,theleadingproviderofcloudcomputing.UseAppendixAtoinstallHadooponAWSEC2publiccloud
infrastructure.
![Page 173: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/173.jpg)
ConclusionCloudcomputingisabusinessmodeltoprovideshared,flexible,cost-effectiveITinfrastructuretogetstartedquicklyonbuildinganapplication.ForBigDataapplications,itcanbeevenmoreattractivetotestthesystemusingrentedfacilities,beforemakingthedeterminationofinvestingindedicatedITinfrastructure.
![Page 174: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/174.jpg)
ReviewQuestionsQ1:DescribeCloudComputingmodel.
Q2:Whataretheadvantagesofcloudcomputingoverin-housecomputing
Q3:DescribethetechnicalarchitectureforCloudcomputing.
Q4:Nameafewmajorprovidersofcloudcomputingservices.
![Page 175: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/175.jpg)
![Page 176: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/176.jpg)
![Page 177: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/177.jpg)
Section3
ThissectioncoverstheotherrelevantconceptsandtutorialsforeffectivelymanagingandutilizingBigData.
Chapter10willbringallthetoolstogetherinacasestudyofdevelopingwebloganalyzer,asanexampleofausefulBigDataapplication.
Chapter11willcovertheoverallviewofDataMiningtoolsandtechniquestoextractbenefitfromBigData.
Appendix1showsstepbystep,thewaytoinstallaHadoopclusteronacloudcomputingplatform.
Appendix2isatutorialoninstallingandrunningSpark.
![Page 178: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/178.jpg)
![Page 179: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/179.jpg)
![Page 180: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/180.jpg)
Chapter10–WebLogAnalyzerapplicationcasestudy
IntroductionAwebloganalyzerisanautomatedsoftwaretoolthathelpstoanalyzeandmakedecisionsonanumberofissuesregardingwebapplicationserverlogs.Anidealwebloganalyzerwouldanalyzeunlimitedstreamsofdataandhelpkeeptheentireuniverserunningsmoothlyandwithoutfault.Thiswouldbedonebyeliminatingtheneedformanuallyaccessingthelogs,automatingtheflowofinformation,andalertingthesystemadministratorasneeded.
![Page 181: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/181.jpg)
Client-ServerArchitectureEveryweb-basedapplicationrunsonaclient-serverarchitecture.Clientsareentitiesthataccessservers,andserversareentitiesthatrespondtotheclientwithasolution.Alotofclientssimultaneouslytrytoaccessservers.Theserversmaybedatabaseserver,networkserver,theapplicationserver,oranyserverinthen-tierarchitecture.Foreachrequest,alogentryisgenerated.Thespeedofaccessrequestsdeterminedthestreamoflogentries.Thisleadstoapotentiallyhugelogovertime.Thelogcanbeprocessedasstreamofdata.Thislogcanalsobestoredontheserversforlateranalysis.
Logscanbeusedformonitoring,auditandanalysispurposes.Itcanhelpwitherrordiagnosticsincaseawebsitebecomessloworitgoesdown.Logscanbeanalyzedtodetecthackingactivity.Theycanalsobeanalyzedtosummarizethepopularityofwebpages,andthedistributionofthepagerequesters.Itcanhelpwithaccessvolumes,andforscalingupordowntheinfrastructure.
![Page 182: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/182.jpg)
WebLoganalyzerTheloganalyzerreceivedstreaminglogsfromaserverlocation,andanalyzesmultiplethingsusingmanyalgorithmstogeneratethedesiredresults.Thesystemiscompletelyautomated.Thelogisproduced,anditisconsumedittomakereal-timereports.Itiseasytoimaginethemassivedataflowproducedbythelogintheserverenvironmentwhileitisalsobeinganalyzedsimultaneouslyontheadministratorside.`
Requirements
Thisisaloganalyzertoanalyzeawebapplicationhostedonaserver.Itisabusyapplicationownedbyabigcompany.Itreceivesmorethan15000webaccessrequestsperhour.Alltheaccessrequestsneedtobelogged,anddumpedtoHadoopFilesystemperiodically.Theanalyzerisrequiredtoingestreal-timelogdata,andfilteroutapartofdataforanalyzinganddumpingtoHDFS.Ithastodostreamingdataflowmanagementaswellasbatchprocessing.TheanalyzerneedstoprocessthedatabeforeitisdumpedintoHDFS,andalsoafteritisputintoHDFS.Thesystemadministratorsshouldbealertedinrealtimeaboutpossiblethreats,overloads,delays,potentialserrors,andanyotherdamages.Theresultsofalltheanalysesneedtobestoredinadatabaseforlaterpresentationinagraphicalformat.Theresultshavetobemadeavailableforanyperiodoftime,withoutanymissingtimevalues.Thelogdatahastobepreservedforfuturewithoutlosinganylogdata.
SolutionArchitecture
GetstreamingdatausingApacheFlume,andsendittoHDFS.UseApacheSparkfordataflowmanagementplatformandprocessingengine.StoretheresultsofanalysisinMongoDB.Thisisasafesolution,becausethedatagetsstoredintoHadoopclusterandisavailableforfuturerequirements,evenwhileitisbeinganalyzedinrealtime.Theresultsofreal-timeprocessingalsogointoMongoDB.
![Page 183: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/183.jpg)
Fig10.1:WebLogAnalyzerArchitecture
![Page 184: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/184.jpg)
BenefitsofthissolutionTheadvantagesofthissolutionare:
1. RealtimeloggingandanalysisdatageneratedonserverisstreameddirectlytoHDFSbyFlumeagentwithoutdelay.Everylogentrygeneratedovereverysinglepointoftimeisanalyzedandusedformonitoringanddecisionmaking,
2. Automaticloghandlingandstorage.LoadingdataintoHDFSnormallyrequiresmanuallyrunningcertainHadoopcommands.ThisloganalyzerusesaFlumeagentorsparkstreamingtohandlealldataonitsown,withoutanyexternallymanagedefforts.
3. Easyandconvenientimplementusingbuilt-inandeasy-to-customizemachinelearningalgorithmsinSpark.
4. Easyerrorhandling,serverrequesthandling,andoverallserverperformanceoptimization.Itmakesserversmarterbykeepingtrackofalmosteveryaspectsofserver.
![Page 185: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/185.jpg)
TechnologystackThetechnologystackusedforthisapplicationisshownbelow.Abriefofeachcomponentfollows.
1. ApacheSparkv22. Hadoop2.6.0cdh53. ApacheFlume4. Scala,Java5. MongoDB6. RestFulWebservices7. FrontUItools8. LinuxShellScripts
![Page 186: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/186.jpg)
ApacheSparkSparkisfastin-memory-basedclustercomputingtechnology,designedforfastandstreamingcomputation.ItisbuiltontopofHadoopandMapReducesystem,anditextendsMapReducemodeltousemoretypesofcomputation,whichincludesinteractivequeriesandstreamprocessing.Ithaslotoflibrariesandpackageslikemachinelearning(MLLib),graphcomputation(GraphX)etc.Itclaimstoexecute10to100timesfasterthanHadoopbecauseofitsin-memorycomputationmodel.ItalsosupportsmultiplelanguagessuchasScala,Python,Java,andR.
SparkDeployment
1. Standalone2. HadoopYARN3. SIMR:SparkinmapReduce//Mesos
ComponentsofSpark
SparkSql:DataabstractioncalledschemaRDD,whichprovidessupportforstructuredandsemi-structureddata.
SparkStreaming:IngestsdatainminibatchandperformRDDtransformationonthosemini-batches.StreamingdataanalyticsusingRDD
MLib(machinelearning):Itisadistributedmachinelearningframework,whichoperatesin-memoryathighspeed,andoffersmanyMLalgorithms.
GraphX:ThisdistributedgraphprocessingframeworkprovidesAPIformanygraphcomputationalgorithms.
SparkCore:Thisisageneralexecutionengineforsparkplatformuponwhichallotherfunctionalityisbuilt.Ittakescareoftaskdispatchingandscheduling,andbasicI/Ofunctionalities.
Spark-shell:Itisapowerfultooltoanalyzedatainteractively.Itisavailableonscalaandpython.Spark’sprimarydataabstractionisanin-memorycollectionsofitemscalledRDD.ItcanbecreatedfromHadoopinputformatslikeHDFS,andbytransformingexistingRDDsusingfiltersandmapsintonewRDDs.
ScriptingandProgrammingmodelusingSparkContext:OnecanuseanIDEtodevelopandtesttheanalyticscode.OnecanthencreateajartoruntheanalyticsusingHadooparchitecture.Thejarcanalsobesubmittedusingspark-submitutilitytotheSparkengine.Forexample:
![Page 187: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/187.jpg)
spark-submit—classapache.accesslogs.ServerLogAnalyzer—master
*localScalaSpark/Scala1/target/scala-2.10/Scala1-assembly-1.0.jar>output.txt
![Page 188: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/188.jpg)
HDFSHDFSisadistributedfilesystem,thatisatthecoreofHadoopsystem.
-Deployedonlowcostcommodityhardware
-Faulttolerant
-SupportsBatchProcessing
-Designedforlargedatasetorlargefiles
-Maintainscoherencethroughwriteoncereadmanytimes
-Movingcomputationtothelocationofthedata.
![Page 189: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/189.jpg)
MongoDBItisdocument-orienteddatabase.ItcameintoexistenceasaNoSQLdatabase.
![Page 190: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/190.jpg)
ApacheFlumeFlumeisanopensourcetoolforhandlingstreaminglogsordata.Itisadistributedandreliablesystemforefficientlycollecting,aggregatingandmovinglargeamountofdatafrommanydifferentsourcestoacentralizeddatastore.ItisapopulartooltoassistwithdataflowandstoragetoHDFS.Flumeisnotrestrictedtologdata.Thedatasourcesarecustomizablesoitmightbeanysourcelikeeventdata,trafficdata,socialmediadata,oranyotherdatasource.ThemajorComponentsofFlumeare:
-Event
-Agent
-DataGenerators
-CentralizedStores
![Page 191: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/191.jpg)
OverallApplicationlogicThesystemreadsaccesslogsandpresentstheresultsintabularandgraphicalformtoendusers.Thissystemprovidesthefollowingmajorfunctions:
1. Calculatecontentsize2. CountResponsecode3. AnalyzerequestingIP-address4. ManageEndpoints
![Page 192: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/192.jpg)
TechnicalPlanfortheApplicationTechnically,theprojectfollowsthefollowingstructure:
1. FlumetakesstreaminglogfromrunningapplicationserverandstoresinHDFS.Flumeusescompressiontostorehugelogfilestospeedupthedatatransferandforstorageefficiency.
2. ApacheSparkusesHDFSasinputsourceandanalyzesdatausingMLLib.ApacheSparkstoresanalyzeddatainMongoDB
3. RESTfuljavaservicepresentsJSONobjectsfetchingfromMongoDBandsendingtoFrontend.Graphicaltoolsareusedtopresentdata.
![Page 193: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/193.jpg)
ScalaSparkcodeforloganalysisNote:ThisapplicationiswritteninScalalanguage.Belowistheoperativepartofthecode.VisitgithublinkbelowforthecompleteScalacodeforthisapplication.
//calculatessizeoflog,andprovidesmin,maxandaveragesize
//cachingisdoneforrepeatedlyusedfactors
defcalcContentSize(log:RDD[AccessLogs])={
valsize=log.map(log=>log.contentSize).cache()
valaverage=size.reduce(_+_)/size.count()
println(“ContentSize::Average::”+average+””+
”||Maximum::”+size.max()+“||Minimum::”+size.min())
}
//SendalltheresponsecodewithitsfrequencyofoccurrenceasOutput
defresponseCodeCount(log:RDD[AccessLogs])={
valresponseCount=log.map(log=>(log.responseCode,1))
.reduceByKey(_+_)
.take(1000)
println(s”““ResponseCodesCount:${responseCount.mkString(“[“,“,”,“]”)}”””)
}
//filtersipaddressesthathavemorethen10requestsinserverlog
defipAddressFilter(log:RDD[AccessLogs])={
valresult=log.map(log=>(log.ipAddr,1))
.reduceByKey(_+_)
.filter(count=>count._2>1)
//.map(_._1).take(10)
.collect()
println(“IPAddressesCount::${result.mkString(“[“,“,”,“]”)}”)
}}
![Page 194: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/194.jpg)
SampleLogdataSampleInputData:
InputFields(selectedfields):
Certainfieldshavebeenomittedtomakethecodeclear.Theresponsecodehasbeencoloredinredasitisthebasisofthemajorreports.
1. ipAddress:String,2. dateTime:String,3. method:String,4. endPoint:String,5. protocol:String,6. responseCode:Long,7. contentSize:Long
SampleInputRowsofData:
64.242.88.10[07/Mar/2014:16:05:49-0800]“GET/twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariablesHTTP/1.1”40112846
64.242.88.10[07/Mar/2014:16:06:51-0800]“GET/twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2HTTP/1.1”2004523
64.242.88.10[07/Mar/2014:16:10:02-0800]“GET/mailman/listinfo/hsdivisionHTTP/1.1”2006291
64.242.88.10[07/Mar/2014:16:11:58-0800]“GET/twiki/bin/view/TWiki/WikiSyntaxHTTP/1.1”2007352
64.242.88.10[07/Mar/2014:16:20:55-0800]“GET/twiki/bin/view/Main/DCCAndPostFixHTTP/1.1”2005253
64.242.88.10[07/Mar/2014:16:23:12-0800]“GET/twiki/bin/oops/TWiki/AppendixFileSystem?template=oopsmore¶m1=1.12¶m2=1.12HTTP/1.1”20011382
![Page 195: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/195.jpg)
SampleOutputofWebLogAnalysisContentSize::Average::10101||Maximum::138789||Minimum::0
ResponseCodesCount:[(401,113),(200,591),(302,1)]
IPAddressesCount::[(127.0.0.1,31),(207.195.59.160,15),(67.131.107.5,3),(203.147.138.233,13),(64.242.88.10,452),(10.0.0.153,188)]
EndPoints::[(/wap/Project/login.php,15),(/cgi-bin/mailgraph.cgi/mailgraph_2.png,12),(/cgi-bin/mailgraph.cgi/mailgraph_0.png,12),(/wap/Project/loginsubmit.php,12),(/cgi-bin/mailgraph.cgi/mailgraph_2_err.png,12),(/cgi-bin/mailgraph.cgi/mailgraph_1.png,12),(/cgi-bin/mailgraph.cgi/mailgraph_0_err.png,12),(/cgi-bin/mailgraph.cgi/mailgraph_1_err.png,12),(/cgi-bin/mailgraph.cgi/mailgraph_3_err.png,12),(/cgi-bin/mailgraph.cgi/mailgraph_3.png,12)]
IntermediatedataisstoredinHadoopFileSysteminCSVformat
Toseedetailedcode,visit:https://github.com/databricks/reference-apps/blob/master/logs_analyzer/chapter1/scala/src/main/scala/com/databricks/apps/logs/chapter1/LogAnalyzer.scala
Thiswebloganalyzercanbeenhancedinmanyways.Forexample,itcananalyzehistoryoflogsfrompreviousyearsanddiscoverwebaccesstrends.Thisapplicationcanalsobemadetodiscarddataolderthan5yearsintopermanentandbackupstorage.
![Page 196: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/196.jpg)
ConclusionandFindingsTherearemorethan100technologiesaroundApacheecosystem.MostbasicistheMapReducetechniqueusedbyHadoopengine.ManystacksareavailableontopofMapReduce.Itisimportanttoincorporatetherightsetsofelementstodeveloptherightstackfortheparticularlargescaledataanalytics.AfewawesometechnologieslikeHDFS,Spark,Hive,MongoDB,andFlume/Kafkaislikelytomakethebigdataapplicationpowerfulandworthy.
Itisalsousefultoexperimentwithmanyothertechnologiesduringthedevelopmentofthisloganalyzer.FlumeandKafkaaremostpowerfultoolstohandlestreamingdata.SparkhasitsownstreamingAPI,butit’snoteasytoincorporatewithHDFSstorage.DevelopingthisapplicationalsohelpstolearnLinuxbasedtasksandshellscriptsalongwithsomedatahandlingtoolslikeAWKandStreamEditor.
Thisapplicationreducesburdenofmanualhandlingoflogsondatabase,applicationorhistoryservers.Moreover,ithelpstopresentanalyzeddatainanimpressivewaythatleadstoeasydecisionmaking.ThisapplicationcameintodevelopmentafterdoingmuchresearchonbigdatatoolssuchasApacheSpark.Thatsavedalottimeandcostlater.Itwasdevelopedusingagiledevelopmentpractices.
![Page 197: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/197.jpg)
ReviewQuestionsQ1.Describetheadvantagesofawebloganalyzer.
Q2.Describethemajorchallengesindevelopingthisapplication.
Q3:Checkoutthereferencesbelow.Identify3-4majorlessonslearnedfromthecodeandvideo.
![Page 198: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/198.jpg)
![Page 199: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/199.jpg)
![Page 200: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/200.jpg)
Chapter10:DataMiningPrimer
Dataminingistheartandscienceofdiscoveringknowledge,insights,andpatternsindata.Itistheactofextractingusefulpatternsfromanorganizedcollectionofdata.Patternsmustbevalid,novel,potentiallyuseful,andunderstandable.Theimplicitassumptionisthatdataaboutthepastcanrevealpatternsofactivitythatcanbeprojectedintothefuture.
Dataminingisamultidisciplinaryfieldthatborrowstechniquesfromavarietyoffields.Itutilizestheknowledgeofdataqualityanddataorganizingfromthedatabasesarea.Itdrawsmodelingandanalyticaltechniquesfromstatisticsandcomputerscience(artificialintelligence)areas.Italsodrawstheknowledgeofdecision-makingfromthefieldofbusinessmanagement.
Thefieldofdataminingemergedinthecontextofpatternrecognitionindefense,suchasidentifyingafriend-or-foeonabattlefield.Likemanyotherdefense-inspiredtechnologies,ithasevolvedtohelpgainacompetitiveadvantageinbusiness.
Forexample,“customerswhobuycheeseandmilkalsobuybread90percentofthetime”wouldbeausefulpatternforagrocerystore,whichcanthenstocktheproductsappropriately.Similarly,“peoplewithbloodpressuregreaterthan160andanagegreaterthan65wereatahighriskofdyingfromaheartstroke”isofgreatdiagnosticvaluefordoctors,whocanthenfocusontreatingsuchpatientswithurgentcareandgreatsensitivity.
Pastdatacanbeofpredictivevalueinmanycomplexsituations,especiallywherethepatternmaynotbesoeasilyvisiblewithoutthemodelingtechnique.Hereisadramaticcaseofadata-drivendecision-makingsystemthatbeatsthebestofhumanexperts.Usingpastdata,adecisiontreemodelwasdevelopedtopredictvotesforJusticeSandraDayO’Connor,whohadaswingvoteina5–4dividedUSSupremeCourt.Allherpreviousdecisionswerecodedonafewvariables.Whatemergedfromdataminingwasasimplefour-stepdecisiontreethatwasabletoaccuratelypredicthervotes71percentofthetime.Incontrast,thelegalanalystscouldatbestpredictcorrectly59percentofthetime.(Source:Martinetal.2004)
![Page 201: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/201.jpg)
GatheringandselectingdataTolearnfromdata,qualitydataneedstobeeffectivelygathered,cleanedandorganized,andthenefficientlymined.Onerequirestheskillsandtechnologiesforconsolidationandintegrationofdataelementsfrommanysources.
Gatheringandcuratingdatatakestimeandeffort,particularlywhenitisunstructuredorsemistructured.Unstructureddatacancomeinmanyformslikedatabases,blogs,images,videos,audio,andchats.Therearestreamsofunstructuredsocialmediadatafromblogs,chats,andtweets.Therearestreamsofmachine-generateddatafromconnectedmachines,RFIDtags,theinternetofthings,andsoon.Eventuallythedatashouldberectangularized,thatis,putinrectangulardatashapeswithclearcolumnsandrows,beforesubmittingittodatamining.
Knowledgeofthebusinessdomainhelpsselecttherightstreamsofdataforpursuingnewinsights.Onlythedatathatsuitsthenatureoftheproblembeingsolvedshouldbegathered.Thedataelementsshouldberelevant,andsuitablyaddresstheproblembeingsolved.Theycoulddirectlyimpacttheproblem,ortheycouldbeasuitableproxyfortheeffectbeingmeasured.Selectdatacouldalsobegatheredfromthedatawarehouse.Everyindustryandfunctionwillhaveitsownrequirementsandconstraints.Thehealthcareindustrywillprovideadifferenttypeofdatawithdifferentdatanames.TheHRfunctionwouldprovidedifferentkindsofdata.Therewouldbedifferentissuesofqualityandprivacyforthesedata.
![Page 202: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/202.jpg)
DatacleansingandpreparationThequalityofdataiscriticaltothesuccessandvalueofthedataminingproject.Otherwise,thesituationwillbeofthekindofgarbageinandgarbageout(GIGO).Thequalityofincomingdatavariesbythesourceandnatureofdata.Datafrominternaloperationsislikelytobeofhigherquality,asitwillbeaccurateandconsistent.Datafromsocialmediaandotherpublicsourcesislessunderthecontrolofbusiness,andislesslikelytobereliable.
Dataalmostcertainlyneedstobecleansedandtransformedbeforeitcanbeusedfordatamining.Therearemanywaysinwhatdatamayneedtobecleansed–fillingmissingvalues,reigningintheeffectsofoutliers,transformingfields,binningcontinuousvariables,andmuchmore–beforeitcanbereadyforanalysis.Datacleansingandpreparationisalabor-intensiveorsemi-automatedactivitythatcantakeupto60-80%ofthetimeneededforadataminingproject.
![Page 203: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/203.jpg)
OutputsofDataMiningDataminingtechniquescanservedifferenttypesofobjectives.Theoutputsofdataminingwillreflecttheobjectivebeingserved.Therearemanywaysofrepresentingtheoutputsofdatamining.
Onepopularformofdataminingoutputisadecisiontree.Itisahierarchicallybranchedstructurethathelpsvisuallyfollowthestepstomakeamodel-baseddecision.Thetreemayhavecertainattributes,suchasprobabilitiesassignedtoeachbranch.Arelatedformatisasetofbusinessrules,whichareif-thenstatementsthatshowcausality.Adecisiontreecanbemappedtobusinessrules.Iftheobjectivefunctionisprediction,thenadecisiontreeorbusinessrulesarethemostappropriatemodeofrepresentingtheoutput.
Theoutputcanbeintheformofaregressionequationormathematicalfunctionthatrepresentsthebestfittingcurvetorepresentthedata.Thisequationmayincludelinearandnonlinearterms.Regressionequationsareagoodwayofrepresentingtheoutputofclassificationexercises.Thesearealsoagoodrepresentationofforecastingformulae.
Population“centroid”isastatisticalmeasurefordescribingcentraltendenciesofacollectionofdatapoints.Thesemightbedefinedinamultidimensionalspace.Forexample,acentroidcouldbe“middle-aged,highlyeducated,high-networthprofessionals,marriedwithtwochildren,livinginthecoastalareas”.Orapopulationof“20-something,ivy-league-educated,techentrepreneursbasedinSiliconValley”.Oritcouldbeacollectionof“vehiclesmorethan20yearsold,givinglowmileagepergallon,whichfailedenvironmentalinspection”.Thesearetypicalrepresentationsoftheoutputofaclusteranalysisexercise.
Businessrulesareanappropriaterepresentationoftheoutputofamarketbasketanalysisexercise.Theserulesareif-thenstatementswithsomeprobabilityparametersassociatedwitheachrule.Forexample,thosethatbuymilkandbreadwillalsobuybutter(with80percentprobability).
![Page 204: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/204.jpg)
EvaluatingDataMiningResultsTherearetwoprimarykindsofdataminingprocesses:supervisedlearningandunsupervisedlearning.Insupervisedlearning,adecisionmodelcanbecreatedusingpastdata,andthemodelcanthenbeusedtopredictthecorrectanswerforfuturedatainstances.Classificationisthemaincategoryofsupervisedlearningactivity.Therearemanytechniquesforclassification,decisiontreesbeingthemostpopularone.Eachofthesetechniquescanbeimplementedwithmanyalgorithms.Acommonmetricforallofclassificationtechniquesispredictiveaccuracy.
PredictiveAccuracy=(CorrectPredictions)/TotalPredictionsSupposeadataminingprojecthasbeeninitiatedtodevelopapredictivemodelforcancerpatientsusingadecisiontree.Usingarelevantsetofvariablesanddatainstances,adecisiontreemodelhasbeencreated.Themodelisthenusedtopredictotherdatainstances.Whenatruepositivedatapointispositive,thatisacorrectprediction,calledatruepositive(TP).Similarly,whenatruenegativedatapointisclassifiedasnegative,thatisatruenegative(TN).Ontheotherhand,whenatrue-positivedatapointisclassifiedbythemodelasnegative,thatisanincorrectprediction,calledafalsenegative(FN).Similarly,whenatrue-negativedatapointisclassifiedaspositive,thatisclassifiedasafalsepositive(FP).Thisisrepresentedusingtheconfusionmatrix(Figure4.1).
ConfusionMatrix TrueClass
Positive Negative
PredictedClass
Predictedclass
Positive
TruePositive(TP)
FalsePositive(FP)
Negative
FalseNegative(FN)
TrueNegative(TN)
Figure10.1:ConfusionMatrix
Thusthepredictiveaccuracycanbespecifiedbythefollowingformula.
PredictiveAccuracy=(TP+TN)/(TP+TN+FP+FN).
![Page 205: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/205.jpg)
Allclassificationtechniqueshaveapredictiveaccuracyassociatedwithapredictivemodel.Thehighestvaluecanbe100%.Inpractice,predictivemodelswithmorethan70%accuracycanbeconsideredusableinbusinessdomains,dependinguponthenatureofthebusiness.
TherearenogoodobjectivemeasurestojudgetheaccuracyofunsupervisedlearningtechniquessuchasClusterAnalysis.Thereisnosinglerightanswerfortheresultsofthesetechniques.Forexample,thevalueofthesegmentationmodeldependsuponthevaluethedecision-makerseesinthoseresults.
![Page 206: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/206.jpg)
DataMiningTechniquesDatamaybeminedtohelpmakemoreefficientdecisionsinthefuture.Oritmaybeused to explore thedata to find interesting associativepatterns.Therighttechniquedependsuponthekindofproblembeingsolved(Figure10.2).
DataMiningTechniques
SupervisedLearning
(Predictiveabilitybasedonpastdata)
Classification–MachineLearning
DecisionTrees
NeuralNetworks
Classification-Statistics
Regression
UnsupervisedLearning
(Exploratoryanalysistodiscoverpatterns)
ClusteringAnalysis
AssociationRules
Figure10.2:ImportantDataMiningTechniques
Themostimportantclassofproblemssolvedusingdataminingareclassificationproblems.Classificationtechniquesarecalledsupervisedlearningasthereisawaytosupervisewhetherthemodelisprovidingtherightorwronganswers.Theseareproblemswheredatafrompastdecisionsisminedtoextractthefewrulesandpatternsthatwouldimprovetheaccuracyofthedecisionmakingprocessinthefuture.Thedataofpastdecisionsisorganizedandminedfordecisionrulesorequations,thatarethencodifiedtoproducemoreaccuratedecisions.
Decisiontreesarethemostpopulardataminingtechnique,formanyreasons.
1. Decisiontreesareeasytounderstandandeasytouse,byanalystsaswellasexecutives.Theyalsoshowahighpredictiveaccuracy.
2. Decisiontreesselectthemostrelevantvariablesautomaticallyoutofalltheavailablevariablesfordecisionmaking.
3. Decisiontreesaretolerantofdataqualityissuesanddonotrequiremuchdatapreparationfromtheusers.
4. Evennon-linearrelationshipscanbehandledwellbydecisiontrees.
Therearemanyalgorithmstoimplementdecisiontrees.Someofthepopular
![Page 207: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/207.jpg)
onesareC5,CARTandCHAID.
Regressionisamostpopularstatisticaldataminingtechnique.Thegoalofregressionistoderiveasmoothwell-definedcurvetobestthedata.Regressionanalysistechniques,forexample,canbeusedtomodelandpredicttheenergyconsumptionasafunctionofdailytemperature.Simplyplottingthedatamayshowanon-linearcurve.Applyinganon-linearregressionequationwillfitthedataverywellwithhighaccuracy.Oncesucharegressionmodelhasbeendeveloped,theenergyconsumptiononanyfuturedaycanbepredictedusingthisequation.Theaccuracyoftheregressionmodeldependsentirelyuponthedatasetusedandnotatallonthealgorithmortoolsused.
ArtificialNeuralNetworks(ANN)isasophisticateddataminingtechniquefromtheArtificialIntelligencestreaminComputerScience.Itmimicsthebehaviorofhumanneuralstructure:Neuronsreceivestimuli,processthem,andcommunicatetheirresultstootherneuronssuccessively,andeventuallyaneuronoutputsadecision.Adecisiontaskmaybeprocessedbyjustoneneuronandtheresultmaybecommunicatedsoon.Alternatively,therecouldbemanylayersofneuronsinvolvedinadecisiontask,dependinguponthecomplexityofthedomain.Theneuralnetworkcanbetrainedbymakingadecisionoverandoveragainwithmanydatapoints.Itwillcontinuetolearnbyadjustingitsinternalcomputationandcommunicationparametersbasedonfeedbackreceivedonitspreviousdecisions.Theintermediatevaluespassedwithinthelayersofneuronsmaynotmakeanyintuitivesensetoanobserver.Thus,theneuralnetworksareconsideredablack-boxsystem.
ClusterAnalysisisanexploratorylearningtechniquethathelpsinidentifyingasetofsimilargroupsinthedata.Itisatechniqueusedforautomaticidentificationofnaturalgroupingsofthings.Datainstancesthataresimilarto(ornear)eachotherarecategorizedintoonecluster,whiledatainstancesthatareverydifferent(orfaraway)fromeachotherarecategorizedintoseparateclusters.Therecanbeanynumberofclustersthatcouldbeproducedbythedata.TheK-meanstechniqueisapopulartechniqueandallowstheuserguidanceinselectingtherightnumber(K)ofclustersfromthedata.Clusteringisalsoknownasthesegmentationtechnique.Ithelpsdivideandconquerlargedatasets.Thetechniqueshowstheclustersofthingsfrompastdata.Theoutputisthecentroidsforeachclusterandtheallocationofdatapointstotheircluster.Thecentroiddefinitionisusedtoassignnewdatainstancescanbeassignedtotheirclusterhomes.Clusteringisalsoapartoftheartificialintelligencefamilyoftechniques.
Associationrulesareapopulardataminingmethodinbusiness,especially
![Page 208: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/208.jpg)
wheresellingisinvolved.Alsoknownasmarketbasketanalysis,ithelpsinansweringquestionsaboutcross-sellingopportunities.ThisistheheartofthepersonalizationengineusedbyecommercesiteslikeAmazon.comandstreamingmoviesiteslikeNetflix.com.Thetechniquehelpsfindinterestingrelationships(affinities)betweenvariables(itemsorevents).ThesearerepresentedasrulesoftheformX Y,whereXandYaresetsofdataitems.Aformofunsupervisedlearning,ithasnodependentvariable;andtherearenorightorwronganswers.Therearejuststrongerandweakeraffinities.Thus,eachrulehasaconfidencelevelassignedtoit.Apartofthemachinelearningfamily,thistechniqueachievedlegendarystatuswhenafascinatingrelationshipwasfoundinthesalesofdiapersandbeers.
![Page 209: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/209.jpg)
MiningBigDataAsdatagrowslargerandlarger,thereareafewwaysinwhichanalyzingBigdataisdifferent.FromCausationtoCorrelation
Thereismoredataavailablethantherearetheoriesandtoolsavailabletoexplainit.Historically,theoriesofhumanbehavior,andtheoriesofuniverseingeneral,havebeenintuitedandtestedusinglimitedandsampleddata,withsomestatisticalconfidencelevel.Nowthatdataisavailableinextremelylargequantitiesaboutmanypeopleandmanyfactors,theremaybetoomuchnoiseinthedatatoarticulateandtestcleantheories.Inthatcase,itmaysufficetovalueco-occurrencesorcorrelationofeventsassignificantwithoutnecessarilyestablishingstrongcausation.FromSamplingtotheWhole
Poolingallthedatatogetherintoasinglebigdatasystemcanhelpdiscoverevents,thathelpbringaboutafullerpictureofthesituation,andhighlightthreatsoropportunitiesthatanorganizationfaces.Workingfromthefulldatasetcanenablediscoveringremotebutextremelyvaluableinsights.Forexample,ananalysisofthepurchasinghabitsofmillionscustomersandtheirbillionstransactionsattheirthousandsofstorescangiveanorganizationavast,detailedanddynamicviewofsalespatternsintheircompany,whichmaynotbeavailablefromtheanalysisofsmallsamplesofdatabyeachstoreorregion.FromDatasettoDatastream
Aflowingstreamhasaperishableandunlimitedconnotationtoit,whileadatasethasafinitudeandpermanenceaboutit.Withanygiveninfrastructure,onecanonlyconsumesomuchdataatatime.Datastreamsaremany,largeandfast.Thusonehastochoosewhichofthemanystreamsofdatadoesonewanttoengagewith.Itisequivalenttodecidingwhichstreamtofishin.Themetricsusedforanalysisofstreamstendtoberelativelysimpleandrelatetotimedimension.Mostofthemetricsarestatisticalmeasuressuchascountsandmeans.Forexample,acompanymightwanttomonitorcustomersentimentaboutitsproducts.Sotheycouldcreateasocialmedialisteningplatformthatwouldreadalltweetsandblogpostsabouttheminreal-time.Thisplatformwould(a)keepacountofpositiveandnegativesentimentmessageseveryminute,and(b)flaganymessagesthatmeritattentionsuchassendinganonlineadvertisementorpurchaseoffertothatcustomer.
![Page 210: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/210.jpg)
DataMiningBestPracticesEffectiveandsuccessfuluseofdataminingactivityrequiresbothbusinessandtechnologyskills.Thebusinessaspectshelpunderstandthedomainandthekeyquestions.Italsohelpsoneimaginepossiblerelationshipsinthedata,andcreatehypothesestotestit.TheITaspectshelpfetchthedatafrommanysources,cleanupthedata,assembleittomeettheneedsofthebusinessproblem,andthenrunthedataminingtechniquesontheplatform.
Animportantelementistogoaftertheproblemiteratively.Itisbettertodivideandconquertheproblemwithsmalleramountsofdata,andgetclosertotheheartofthesolutioninaniterativesequenceofsteps.Thereareseveralbestpracticeslearnedfromtheuseofdataminingtechniquesoveralongperiodoftime.TheDataMiningindustryhasproposedaCross-IndustryStandardProcessforDataMining(CRISP-DM).Ithassixessentialsteps(Figure4.3):
1. BusinessUnderstanding:Thefirstandmostimportantstepindataminingisaskingtherightbusinessquestions.Aquestionisagoodoneifansweringitwouldleadtolargepayoffsfortheorganization,financiallyandotherwise.Inotherwords,selectingadataminingprojectislikeanyotherproject,inthatitshouldshowstrongpayoffsiftheprojectissuccessful.Thereshouldbestrongexecutivesupportforthedataminingproject,whichmeansthattheprojectalignswellwiththebusinessstrategy.Arelatedimportantstepistobecreativeandopeninproposingimaginativehypothesesforthesolution.Thinkingoutsidetheboxisimportant,bothintermsofaproposedmodelaswellinthedatasetsavailableandrequired.
![Page 211: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/211.jpg)
Figure4.3:CRISP-DMDataMiningcycle
2. DataUnderstanding:Arelatedimportantstepistounderstandthedataavailableformining.Oneneedstobeimaginativeinscouringformanyelementsofdatathroughmanysourcesinhelpingaddressthehypothesestosolveaproblem.Withoutrelevantdata,thehypothesescannotbetested.
3. DataPreparation:Thedatashouldberelevant,cleanandofhighquality.It’simportanttoassembleateamthathasamixoftechnicalandbusinessskills,whounderstandthedomainandthedata.Datacleaningcantake60-70%ofthetimeinadataminingproject.Itmaybedesirabletocontinuetoexperimentandaddnewdataelementsfromexternalsourcesofdatathatcouldhelpimprovepredictiveaccuracy.
4. Modeling:Thisistheactualtaskofrunningmanyalgorithmsusingtheavailabledatatodiscoverifthehypothesesaresupported.Patienceisrequiredincontinuouslyengagingwiththedatauntilthedatayieldssomegoodinsights.Ahostofmodelingtoolsandalgorithmsshouldbeused.Atoolcouldbetriedwithdifferentoptions,suchasrunningdifferentdecisiontreealgorithms.
5. ModelEvaluation:Oneshouldnotacceptwhatthedatasaysatfirst.Itisbettertotriangulatetheanalysisbyapplyingmultipledataminingtechniques,andconductingmanywhat-ifscenarios,tobuildconfidenceinthesolution.Oneshouldevaluateandimprovethemodel’spredictiveaccuracywithmoretestdata.Whentheaccuracyhasreachedsomesatisfactorylevel,thenthemodelshouldbedeployed.
![Page 212: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/212.jpg)
6. Disseminationandrollout:Itisimportantthatthedataminingsolutionispresentedtothekeystakeholders,andisdeployedintheorganization.Otherwisetheprojectwillbeawasteoftimeandwillbeasetbackforestablishingandsupportingadata-baseddecision-processcultureintheorganization.Themodelshouldbeeventuallyembeddedintheorganization’sbusinessprocesses.
![Page 213: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/213.jpg)
ConclusionDataMiningislikedivingintotheroughmaterialtodiscoveravaluablefinishednugget.Whilethetechniqueisimportant,domainknowledgeisalsoimportanttoprovideimaginativesolutionsthatcanthenbetestedwithdatamining.Thebusinessobjectiveshouldbewellunderstoodandshouldalwaysbekeptinmindtoensurethattheresultsarebeneficialtothesponsoroftheexercise.
![Page 214: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/214.jpg)
ReviewQuestions1. Whatisdatamining?Whataresupervisedandunsupervisedlearning
techniques?2. Describethekeystepsinthedataminingprocess.Whyisitimportant
tofollowtheseprocesses?3. Whatisaconfusionmatrix?4. Whyisdatapreparationsoimportantandtimeconsuming?5. Whataresomeofthemostpopulardataminingtechniques?6. HowisminingBigdatadifferentfromtraditionaldatamining?
![Page 215: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/215.jpg)
![Page 216: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/216.jpg)
Appendix1:HadoopInstallationonAmazonWebServices(AWS)ElasticComputeCluster(EC2)
![Page 217: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/217.jpg)
CreatingClusterserveronAWS,InstallHadoopfromCloudEraTheobjectiveofthistutorialistosetupabigdataprocessinginfrastructureusingcloudcomputing,andHadoopandSparksoftware.
![Page 218: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/218.jpg)
Step1:CreatingAmazonEC2Servers.
1. Openhttps://aws.amazon.com/2. ClickonServices3. ClickonEC2
YoucanseethebelowresultonceyouclickonEC2.Ifyoualreadyhaveaserveryoucanseethenumberofrunningservers,theirvolumeandotherinformation.
4. ClickonLaunchInstanceButton.
![Page 219: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/219.jpg)
5. ClickonAWSMarketePlace6. TypeUbuntuinsearchtextbox.7. ClickonSelectbutton
8. Ubuntuisfreesoyoudon’thavetoworryabouttheservicepriceClickonContinuebutton.
![Page 220: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/220.jpg)
9. ChooseGeneral.purposem1.largeandclickonNext:ConfigurareInstanceDetails(DonotchoosetheMicroInstancest1.microitisfreebutitwillnotabletohandletheinstallation.)
10. ClickonNext:AddStorage
11. Specifythevolumesize20GB(Defaultwillbe8butitwillnotsufficient)andClickonNext:TagInstance
12. Typethenamecs488-master(Thisisforlabeltoknowwhichoneismasterandslave)andclickonNext:SecurityGroup
![Page 221: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/221.jpg)
13. Weneedtoopenourservertotheworldincludingmostoftheportcauseclouderaneedtoaddmoreport.SpecifythegroupnameType:ChooseCustomTCPRulePortRange0-65500Source:AnyWhereAndClickonReviewInstance
14. Themessageshowsthewarningthisisonlythatweopenourservertoworld,Soignoreitfornow.ClickonLaunchbutton.
15. TypethekeypairnameandClickonDownloadKeyPairbutton(rememberthelocationofdownloadedfileweneedthisfiletologintotheserver.)andClickonLaunchInstances.
![Page 222: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/222.jpg)
16. Nowthemasterserveriscreated.
Now,weneedfourmoreserverstomaketheclusteringforthatwedon’tneedtodotheseprocessfourtimes.Wejustincreasethevalueofnoofinstanceweneedandwegotthe4servers.
Nowwearegoingtolaunch4moreserverwhichisslaves.
Pleaserepeatstep4-9
Gotoamazonmarketplace,chooseUbuntu,selecttheinstancetype(General.purpose)
17. Type4inNumberofInstances.Whichwillcreatethe4moreserverforus.
18. Nametheservercs488-slave
19. Selectthepreviouscreatedsecuritygroup.
![Page 223: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/223.jpg)
20. Itisimportantthatyouneedtochoosetheexistingkeypairfortheseservertoo.
Ifeverythinggoeswell,youcanseehave5instances,5volumes,1keypair,1or2securitygroups.
Wearenowsuccessfullycreated5servers.
![Page 224: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/224.jpg)
Step2:ConnectingserverandinstallingrequiredClouderadistributionofHadoop
Firstofalltakeanoteforallyourserverdetails,IPAddress,DNSaddress.Masterandslaves.
MasterPublicDNSAddress:ec2-54-200-210-141.us-west-2.compute.amazonaws.comMasterPrivateIPAddress:172.31.20.82
Slave1PrivateIP:172.31.26.245Slave2PrivateIP:172.31.26.242Slave3PrivateIP:172.31.26.243Slave4PrivateIP:172.31.26.244
Onceyouhavetheseinrecorded,youcanconnecttotheserver.Ifyouareusinglinuxasoperatingsystemyoucanusesshcommandfromterminaltoconnectit.
Connectingtheserver(Windows)
1. Downloadthesshsoftware(Putty)(http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html)Alsodownloadputtygentoconvertourauthenticationfile.pemto.ppk
![Page 225: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/225.jpg)
2. Openputtygenloadtheauthenticationfile
ClickonSavePrivateKey
3. OpenPuttytypethemasterpublicdnsaddressinhostnameandthanclickonSSHfromleftpanel>ClickonAuth>>Selecttherecentconvertedauthenticationfile(.ppk)andfinallyclickonOpenbutton.
![Page 226: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/226.jpg)
4. Nowyouwillabletoconnecttheserverpleasetype“ubuntu”thedefaultusernametologintothesystem.
5. Onceyouconnecttypethefollowingcommandintotheterminal6. sudoaptitudeupdate7. cd/usr/local/src/8. sudowgethttp://archive.cloudera.com/cm4/installer/latest/cloudera-manager-
installer.bin9. sudochmodu+xcloudera-manager-installer.bin10. sudo./cloudera-manager-installer.bin
![Page 227: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/227.jpg)
11. Thereis4morestepwhereyouclickonNextandYesforlicenseagreement.Onceyoufinishtheinstallationyouneedtorestarttheservice.
12. sudoservicecloudera-scm-serverrestart
Youarenowabletoconnecttheclouderafromyoubrowser.Theaddresswillbehttp://<YOURPUBLICDNSSERVER>:7180eg.http://ec2-54-200-210-141.us-west-2.compute.amazonaws.com:7180anddefaultusernameandpasswordisadmin/admintologintothesystem.
![Page 228: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/228.jpg)
Oncerestarttheserveritwillopentheloginscreenagain.Thesameusernameandpassword(admin/admin)isusetologintothesystem.
13. ClickonLaunchtheClassicwizard
14. ClickonContinue
![Page 229: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/229.jpg)
15. ProvideallthePrivateIPaddressofmasterandslavescomputersandclickonSearchbutton.
16. ClickonContinuebutton.
17. ChooseNoneforSOLR1….AndNoneforIMPAL….AndClickonContinuebutton.
![Page 230: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/230.jpg)
18. ClickonAnotherUser>>Type“ubuntu”andselectAllhostsacceptsameprivatekey>>uploadtheauthenticationfile.pemandclickonContinuebutton.
19. Nowclouderawillinstallthesoftwareforeachofourserver.
20. Oncetheinstallationiscompleteclickoncontinuebutton.
21. Onceitreachto100%clickoncontinuebutton.Donotdisconnectinternetnorshutthemachine,Iftheprocesswillnotcompletethatweneedtore-createthewholeprocess.Clickoncontinuebutton.
![Page 231: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/231.jpg)
22. ClickonContinue.
23. ChooseCoreHadoopandClickonInspectRoleAssignmentsbutton
![Page 232: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/232.jpg)
24. NowforyoumasterIPitshouldhaveonlyNameNodeselectionanduncheckedinDataNode.Thisisimportanttomakethemasterandslaveserver.
25. Nowtheclouderawillinstallthealltheservicesforyoufutureuseyoucanrecordtheusernameandpasswordofeachservices.ClickonTestConnection
![Page 233: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/233.jpg)
26. ClickonContinue
![Page 234: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/234.jpg)
27. Nowalltheinstallationiscompleteyoucannowhave1masternode4datanode.
![Page 235: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/235.jpg)
28. Youshouldseethedashboard.
![Page 236: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/236.jpg)
Step3:WordCountusingMapReduce29. Nowlogintomasterserverfromputty.30. Runthefollowingcommand31. cd~/32. mkdircode-and-data33. cdcode-and-data34. sudowgethttps://s3.amazonaws.com/learn-hadoop/hadoop-infiniteskills-
richmorrow-class.tgz35. sudotar-xvzfhadoop-infiniteskills-richmorrow-class.tgz36. cddata37. sudo-uhdfshadoopfs-mkdir/user/ubuntu38. sudo-uhdfshadoopfs-chownubuntu/user/ubuntu39. hadoopfs-putshakespeareshakespeare-hdfs40. hadoopversion41. hadoopfs-lsshakespeare-hdfs
42. sudohadoopjar/opt/cloudera/parcels/CDH-4.7.1-
1.cdh4.7.1.p0.47/share/hue/apps/oozie/examples/lib/hadoop-examples.jarwordcountshakespeare-hdfswordcount-output
43. hadoopjar/opt/cloudera/parcels/CDH-4.7.1-1.cdh4.7.1.p0.47/share/hue/apps/oozie/examples/lib/hadoop-examples.jarsleep-m10-r10-mt20000-rt20000
![Page 237: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/237.jpg)
![Page 238: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/238.jpg)
![Page 239: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/239.jpg)
Appendix2:SparkInstallationandTutorial
ThistutorialwillhelpinstallSparkandgetitrunningonastandalonemachine.ItwillthenhelpdevelopasimpleanalyticalapplicationusingRlanguage.
![Page 240: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/240.jpg)
Step1:VerifyingJavaInstallation
JavainstallationisoneofthemandatorythingsininstallingSpark.TrythefollowingcommandtoverifytheJAVAversion.
$java-version
IfJavaisalready,installedonyoursystem,yougettoseethefollowingresponse−
javaversion“1.7.0_71”
Java(TM)SERuntimeEnvironment(build1.7.0_71-b13)
JavaHotSpot(TM)ClientVM(build25.0-b02,mixedmode)
IncaseyoudonothaveJavainstalledonyoursystem,thenInstallJavabeforeproceedingtonextstep.
![Page 241: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/241.jpg)
Step2:VerifyingScalainstallation
VerifyScalainstallationusingfollowingcommand.
$scala-version
IfScalaisalreadyinstalledonyoursystem,yougettoseethefollowingresponse−
Scalacoderunnerversion2.11.6—Copyright2002-2013,LAMP/EPFL
Incaseyoudon’thaveScalainstalledonyoursystem,thenproceedtonextstepforScalainstallation.
![Page 242: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/242.jpg)
Step3:DownloadingScala
DownloadthelatestversionofScalabyvisitthefollowinglinkDownloadScala.Forthistutorial,weareusingscala-2.11.6version.Afterdownloading,youwillfindtheScalatarfileinthedownloadfolder.
![Page 243: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/243.jpg)
Step4:InstallingScala
FollowthebelowgivenstepsforinstallingScala.ExtracttheScalatarfile
TypethefollowingcommandforextractingtheScalatarfile.
$tarxvfscala-2.11.6.tgzMoveScalasoftwarefiles
UsethefollowingcommandsformovingtheScalasoftwarefiles,torespectivedirectory(/usr/local/scala).
$su–
Password:
#cd/home/Hadoop/Downloads/
#mvscala-2.11.6/usr/local/scala
#exit
SetPATHforScala
UsethefollowingcommandforsettingPATHforScala.
$exportPATH=$PATH:/usr/local/scala/binVerifyingScalaInstallation
Afterinstallation,itisbettertoverifyit.UsethefollowingcommandforverifyingScalainstallation.
$scala-version
IfScalaisalreadyinstalledonyoursystem,yougettoseethefollowingresponse−
Scalacoderunnerversion2.11.6—Copyright2002-2013,LAMP/EPFL
![Page 244: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/244.jpg)
Step5:DownloadingSpark
DownloadthelatestversionofSpark.Forthistutorial,weareusingspark-1.3.1-bin-hadoop2.6version.Afterdownloadingit,youwillfindtheSparktarfileinthedownloadfolder.
![Page 245: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/245.jpg)
Step6:InstallingSpark
FollowthestepsgivenbelowforinstallingSpark.ExtractingSparktar
Thefollowingcommandforextractingthesparktarfile.
$tarxvfspark-1.3.1-bin-hadoop2.6.tgzMovingSparksoftwarefiles
ThefollowingcommandsformovingtheSparksoftwarefilestorespectivedirectory(/usr/local/spark).
$su–
Password:
#cd/home/Hadoop/Downloads/
#mvspark-1.3.1-bin-hadoop2.6/usr/local/spark
#exit
SettinguptheenvironmentforSpark
Addthefollowinglineto~/.bashrcfile.Itmeansaddingthelocation,wherethesparksoftwarefilearelocatedtothePATHvariable.
exportPATH=$PATH:/usr/local/spark/bin
Usethefollowingcommandforsourcingthe~/.bashrcfile.
$source~/.bashrc
![Page 246: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/246.jpg)
Step7:VerifyingtheSparkInstallation
WritethefollowingcommandforopeningSparkshell.
$spark-shell
Ifsparkisinstalledsuccessfullythenyouwillfindthefollowingoutput.
SparkassemblyhasbeenbuiltwithHive,includingDatanucleusjarsonclasspath
UsingSpark’sdefaultlog4jprofile:org/apache/spark/log4j-defaults.properties
15/06/0415:25:22INFOSecurityManager:Changingviewaclsto:hadoop
15/06/0415:25:22INFOSecurityManager:Changingmodifyaclsto:hadoop
15/06/0415:25:22INFOSecurityManager:SecurityManager:authenticationdisabled;
uiaclsdisabled;userswithviewpermissions:Set(hadoop);userswithmodifypermissions:Set(hadoop)
15/06/0415:25:22INFOHttpServer:StartingHTTPServer
15/06/0415:25:23INFOUtils:Successfullystartedservice‘HTTPclassserver’onport43292.
WelcometoSparkversion1.4.0
UsingScalaversion2.10.4(JavaHotSpot(TM)64-BitServerVM,Java1.7.0_71)
Typeinexpressionstohavethemevaluated.
Sparkcontextavailableassc
scala>
Hereyoucanseethevideo:
HowtoinstallSpark
Youmightencounter“filespecifiednotfounderror”whenyouarefirstinstallingSPARKstandalone:
![Page 247: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/247.jpg)
TofixthisyouhavetosetupyourJAVA_HOME
Step1:Start->run->commandprompt(cmd)
Step2:DeterminewhereisyourJDKislocated,bydefaultitisinyourC:\programfiles
Step3SelectyourJDKtouseinmycase,IwillusemyJDK_8
CopythedirectorytoyourclipboardandgotoyourCMD.Andpressenter.
![Page 248: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/248.jpg)
Step4:AddittogeneralPATH
Andpressenter.
NowgotoyoursparkfolderandgotoBIN\spark_shell
Youhaveinstalledsparklet’strytouseit.
![Page 249: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/249.jpg)
Step8:Application:WordCountinScala
NowwewilldoanexampleofwordcountinScala:
text_file=sc.textFile(“hdfs://…”)
counts=text_file.flatMap(lambdaline:line.split(”“))\
.map(lambdaword:(word,1))\
.reduceByKey(lambdaa,b:a+b)
counts.saveAsTextFile(“hdfs://…”)
NOTE:Ifyouareworkingonastand-aloneSpark:
Thiscounts.saveAsTextFile(“hdfs://…”)commandwillgiveyouanerrorofNullPointerException.
Solution:counts.coalesce(1).saveAsTextFile()
ForimplementingwordcloudwecoulduseRinoursparkconsole:
However,ifyouclickonSparkRstraightawayyouwillgetanerror.
Tofixthis:
Step1:Setuptheenvironmentvariables.
InthePATHVariableaddyourpath:Iadded->;C:\spark-1.5.1-bin-hadoop2.6\spark-1.5.1-bin-hadoop2.6\;C:\spark-1.5.1-bin-hadoop2.6\spark-1.5.1-bin-hadoop2.6\sbin;C:\spark-1.5.1-bin-hadoop2.6\spark-1.5.1-bin-hadoop2.6\bin
Step2:InstallRsoftwareandRstudio.ThenaddthepathofRsoftwarepathtothePATHvariable.
Iaddedthistomyexistingpath->;C:\ProgramFiles\R\R-3.2.2\bin\x64\(Remembereachpaththatyouaddmustbeseparatedbysemicolonandnospacesplease)
Step3:Runcommandpromptasanadministrator.
Step4:Nowexecutethecommand>“SparkR”fromthecommandprompt.Ifsuccessfulyoushouldseemessage“Sparkcontextisavailable…”asseen
![Page 250: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/250.jpg)
below.IfyoupathisnotsetcorrectlyyoucanalternativelynavigatetothelocationwhereyouhavedownloadedSparkR.Inmycase(C:\spark-1.5.1-bin-hadoop2.6\spark-1.5.1-bin-hadoop2.6\bin)andexecute“SparkR”Command.
Step5:ConfigurationinsidetheRStudiotoconnecttoSpark!
ExecutethebelowthreecommandsinRstudioeverytime:
#HerewearesettingupSPARK_HOMEenvironmentvariable
Sys.setenv(SPARK_HOME=“C:/spark-1.5.1-bin-hadoop2.6/spark-1.5.1-bin-hadoop2.6”)
#Setthelibrarypath
.libPaths(c(file.path(Sys.getenv(“SPARK_HOME”),“R”,“lib”),.libPaths()))
#LoadingtheSparkRLibary
library(SparkR)
IfyouseethebelowmessagethenyouareallsettostartworkingwithSparkR
Nowlet’sStartCodinginR:
![Page 251: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/251.jpg)
lords<-Corpus(DirSource(“temp/”))
Toseewhat’sinthatcorpus,typethecommand
inspect(lords)
Thisshouldprintoutcontentsonthemainscreen.Next,weneedtocleanitup.Executethefollowinginthecommandline,onelineatatime:
lords<-tm_map(lords,stripWhitespace)
lords<-tm_map(lords,tolower)
lords<-tm_map(lords,removeWords,stopwords(“english”))
lords<-tm_map(lords,stemDocument)
Thetm_mapfunctioncomeswiththetmpackage.Thevariouscommandsareself-explanatory:stripunnecessarywhitespace,converteverythingtolowercase(otherwisethewordcloudmighthighlightcapitalisedwordsseparately),removeEnglishcommonwordslike‘the’(so-called‘stopwords’),andcarryouttextstemmingforthefinaltidy-up.DependingonwhatyouwanttoachieveyoucouldalsoexplicitlyremovenumbersandpunctuationwiththeremoveNumbersandremovePunctuationarguments.
Itispossiblethatyoumaygeterrormessageswhilstexecutingsomeofthecommands,e.g.missingpackages.IfsoinstalltheseasoutlinedaboveinStep4,andrepeat
Ifalliswellthenyoushouldnowbereadytocreateyourfirstwordcloud!Trythis:
wordcloud(lords,scale=c(5,0.5),max.words=100,random.order=FALSE,rot.per=0.35,use.r.layout=FALSE,colors=brewer.pal(8,“Dark2”))
![Page 252: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/252.jpg)
![Page 253: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/253.jpg)
![Page 254: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/254.jpg)
AdditionalResourcesHerearesomeotherbooks,papers,videoandotherresources,foradeeperdiveintothetopicscoveredinthisbook.
1. Mayer-Schonberger,Viktor;Cukier,Kenneth(2013).BigData:ARevolutionThatWillTransformHowWeLive,Work,andThink.HoughtonMifflinHarcourt.
2. McKinseyGlobalInstituteReport(2011).Bigdata:Thenextfrontierforinnovation,competition,andproductivity.Mckinsey.com
3. Silver,N.(2012).TheSignalandtheNoise:WhySoManyPredictionsFailbutSomeDon’t.PenguinPress.
4. MateiZahariaandet.Al.(2010).“ResilientDistributedDatasets:AFault-TolerantAbstractionforIn-MemoryClusterComputing,”UniversityofCalifornia,Berkeley.OReilley.
5. SandyRyza,UriLasersonet.al(2014).“Advanced-Analytics-with-Spark”.OReilley.
Websites:
6. ApacheHadoopresources:https://hadoop.apache.org/docs/r2.7.2/7. ApacheHDFS:https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html8. HadoopAPIsite:http://hadoop.apache.org/docs/current/api/9. ApacheSpark:http://spark.apache.org/docs/latest/
10.https://www.biostat.wisc.edu/~kbroman/Rintro/Rwinpack.html11.http://robjhyndman.com/hyndsight/building-r-packages-for-windows/12.https://stevemosher.wordpress.com/ten-steps-to-building-an-r-package-under-windows/13.http://www.inside-r.org/packages/cran/wordcloud/docs/wordcloud14.https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html15.https://intellipaat.com/tutorial/spark-tutorial/16.https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces5017.https://en.wikipedia.org/wiki/NoSQL
![Page 255: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/255.jpg)
18.http://www.planetcassandra.org/what-is-apache-cassandra/19.http://www.datastax.com/nosql20.https://www.sitepen.com/blog/2010/05/11/nosql-architecture/21.http://nosql-database.org/22.http://webpages.uncc.edu/xwu/5160/nosqldbs.pdf
VideoResources
23.DougCuttingon‘Hadoopat10’:https://www.youtube.com/watch?v=yDZRDDu3CJo24.StatusofApachecommunity:https://www.youtube.com/watch?v=sOZnf8Nn3Fo.25.Spark2.0updatesshowinganicedemoacrossR,ScalaandSQL)usingtweetsandclustering.https://www.youtube.com/watch?v=9xSz0ppBtFg26.https://www.youtube.com/watch?v=VwiGHUKAHWM27.https://www.youtube.com/watch?v=L5QWO8QBG5c28.https://www.youtube.com/watch?v=KvQto_b3sqw29.https://www.youtube.com/watch?v=YW28qItH_tA
![Page 256: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/256.jpg)
![Page 257: Big Data Essentialseniac2017.files.wordpress.com/2017/04/big-data-essential.pdf · No part of this book may be copied or transmitted without written permission. Other books by the](https://reader034.vdocuments.site/reader034/viewer/2022050103/5f41e3f77febeb5fd84475f6/html5/thumbnails/257.jpg)
AbouttheAuthorDr.AnilMaheshwariisaProfessorofComputerScienceandInformationSystems,andtheDirectorofCenterforDataAnalytics,atMaharishiUniversityofManagement.Heteachescoursesindataanalytics,andhelpswithextractingdeepinsightsfromtheirdata.HeworkedinavarietyofleadershiprolesatIBMinAustinTX,andhasalsoworkedatmanyothercompaniesincludingstartups.
HehastaughtattheUniversityofCincinnati,CityUniversityofNewYork,UniversityofIllinois,andothers.HeearnedanElectricalEngineeringdegreefromIndianInstituteofTechnologyinDelhi,anMBAfromIndianInstituteofManagementinAhmedabad,andaPh.D.fromCaseWesternReserveUniversity.HeisapractitionerofTranscendentalMeditationtechnique.
Heistheauthorofthe#1bestsellerDataAnalyticsMadeAccessible.
HeblogsinterestingstuffonITandEnlightenmentatanilmah.com
Instructorscanreachhimforcoursematerialsatakm2030@gmail.com.Speakingengagementsarewelcome.