cs 378 – big data programmingdfranke/courses/2017fall/lecture5.pdf · counters • hadoop...
TRANSCRIPT
![Page 1: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · Counters • Hadoop map-reduce infrastructure provides counters – Accessed by group name – Cannot have](https://reader034.vdocuments.site/reader034/viewer/2022052018/6031af07a9490a416b18ccf7/html5/thumbnails/1.jpg)
CS378–BigDataProgramming
Lecture5Summariza9onPa:erns
CS378–Fall2017 BigDataProgramming 1
![Page 2: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · Counters • Hadoop map-reduce infrastructure provides counters – Accessed by group name – Cannot have](https://reader034.vdocuments.site/reader034/viewer/2022052018/6031af07a9490a416b18ccf7/html5/thumbnails/2.jpg)
Review
• Assignment2–Ques9ons?
• mrunit–Howdoyoutestmap()orreduce()callsthatproducemul9pleoutputs?
• Issueswithcalcula9ngvariance
CS378–Fall2017 BigDataProgramming 2
![Page 3: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · Counters • Hadoop map-reduce infrastructure provides counters – Accessed by group name – Cannot have](https://reader034.vdocuments.site/reader034/viewer/2022052018/6031af07a9490a416b18ccf7/html5/thumbnails/3.jpg)
Summariza9on
• Othersummariza9onsofinterest– Min,max,median
• Supposeweareinterestedinthesemetricsforparagraphlength(Assignment2data)– Ifparagraphlengthsarenormallydistributed,thenthemedianwillbeverynearthemean
– Ifthedistribu9onofparagraphlengthsisskewed,thenthemeanandmedianwillbeverydifferent
CS378–Fall2017 BigDataProgramming 3
![Page 4: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · Counters • Hadoop map-reduce infrastructure provides counters – Accessed by group name – Cannot have](https://reader034.vdocuments.site/reader034/viewer/2022052018/6031af07a9490a416b18ccf7/html5/thumbnails/4.jpg)
Summariza9on
• MinandmaxarestraighWorward• Foreachparagraph,outputtwovalues– Minlength(thelengthofthecurrentparagraph)– Maxlength(thelengthofthecurrentparagraph)– Key?
• Combinerwillgetakeyandlistofvaluespair– Selectthemin,maxfromthelist,outputthevalues– Key?
• Reducerdoesthesame
CS378–Fall2017 BigDataProgramming 4
![Page 5: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · Counters • Hadoop map-reduce infrastructure provides counters – Accessed by group name – Cannot have](https://reader034.vdocuments.site/reader034/viewer/2022052018/6031af07a9490a416b18ccf7/html5/thumbnails/5.jpg)
Summariza9on
• Median– Getallthevalues,sortthem,thenfindthemiddle
• Sinceourcomputa9onisdistributed,nomapperseesallthevalues
• Shouldwesendthemalltoonereducer?– Notu9lizingmap-reduce(computa9onnotdistributed)– Datasizeslikelytoolargetokeepinmemory
CS378–Fall2017 BigDataProgramming 5
![Page 6: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · Counters • Hadoop map-reduce infrastructure provides counters – Accessed by group name – Cannot have](https://reader034.vdocuments.site/reader034/viewer/2022052018/6031af07a9490a416b18ccf7/html5/thumbnails/6.jpg)
Summariza9on
• Median– Keeptheuniqueparagraphlengths,and– Thefrequencyofeachlength
• Mapoutput:– <paragraphlength,1>
• Combinergetsalistofthesepairsandupdatesthecountforrecurringlengths
• Reducerdoesthesame,thencomputesthemedian
CS378–Fall2017 BigDataProgramming 6
![Page 7: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · Counters • Hadoop map-reduce infrastructure provides counters – Accessed by group name – Cannot have](https://reader034.vdocuments.site/reader034/viewer/2022052018/6031af07a9490a416b18ccf7/html5/thumbnails/7.jpg)
Summariza9on
• Median– HadoopprovidestheSortedMapWritableclass– Canassociateafrequencycountwithaparagraphlength– Keepsthelengthsinsortedorder
• SeetheexampleinChapter2ofMap-ReduceDesignPa1erns
• Howcouldwecomputeallinonepassoverthedata? – min,max,median
CS378–Fall2017 BigDataProgramming 7
![Page 8: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · Counters • Hadoop map-reduce infrastructure provides counters – Accessed by group name – Cannot have](https://reader034.vdocuments.site/reader034/viewer/2022052018/6031af07a9490a416b18ccf7/html5/thumbnails/8.jpg)
Counters
• Hadoopmap-reduceinfrastructureprovidescounters– Accessedbygroupname– Cannothavealargenumberofcounters
• Forexample,can’tusecounterstosolveWordCount
– Afewtensofcounterscanbeused
• CountersarestoredinmemoryonJobTracker
CS378–Fall2017 BigDataProgramming 8
![Page 9: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · Counters • Hadoop map-reduce infrastructure provides counters – Accessed by group name – Cannot have](https://reader034.vdocuments.site/reader034/viewer/2022052018/6031af07a9490a416b18ccf7/html5/thumbnails/9.jpg)
CountersFigure2-6,MapReduceDesignPa:erns
CS378–Fall2017 BigDataProgramming 9
![Page 10: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · Counters • Hadoop map-reduce infrastructure provides counters – Accessed by group name – Cannot have](https://reader034.vdocuments.site/reader034/viewer/2022052018/6031af07a9490a416b18ccf7/html5/thumbnails/10.jpg)
HowHadoopMapReduceWorks
• We’veseensometermslike:– Job,JobTracker,TaskTracker(MapReduce1)– Job,ResourceManager,NodeManager(YARN,MapReduce2)
• Let’slookatwhattheydo
• DetailsfromChapter7,Hadoop:TheDefini9veGuide4thEdi9on
CS378–Fall2017 BigDataProgramming 10
![Page 11: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · Counters • Hadoop map-reduce infrastructure provides counters – Accessed by group name – Cannot have](https://reader034.vdocuments.site/reader034/viewer/2022052018/6031af07a9490a416b18ccf7/html5/thumbnails/11.jpg)
HowHadoopMapReduceWorksFigure7-1,Hadoop:TheDefini9veGuide4thEdi9on
CS378–Fall2017 BigDataProgramming 11
![Page 12: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · Counters • Hadoop map-reduce infrastructure provides counters – Accessed by group name – Cannot have](https://reader034.vdocuments.site/reader034/viewer/2022052018/6031af07a9490a416b18ccf7/html5/thumbnails/12.jpg)
JobSubmission
• Jobsubmission– Inputfilesexist?Cansplitsbecomputed?– Outputdirectoryexist?
• Ifyes,itfails.Hadoopexpectstocreatethisdirectory– CopyresourcestoHDFS
• JARfiles• Configura9onfile• Computedfilesplits
CS378–Fall2017 BigDataProgramming 12
![Page 13: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · Counters • Hadoop map-reduce infrastructure provides counters – Accessed by group name – Cannot have](https://reader034.vdocuments.site/reader034/viewer/2022052018/6031af07a9490a416b18ccf7/html5/thumbnails/13.jpg)
ResourceManager• Createstasks(worktobedone)
– Maptaskforeachinputsplit– Requestednumberofreducertasks– Jobsetup,jobcleanuptasks
• Maptasksareassignedtotasktrackersthatare“close”totheinputsplitloca9on– Datalocalpreferred– Racklocalnext
• Reducetaskcangoanywhere.Why?• Schedulingalgorithmordersthetasks
CS378–Fall2017 BigDataProgramming 13
![Page 14: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · Counters • Hadoop map-reduce infrastructure provides counters – Accessed by group name – Cannot have](https://reader034.vdocuments.site/reader034/viewer/2022052018/6031af07a9490a416b18ccf7/html5/thumbnails/14.jpg)
TaskExecu9on
• Configuredforseveralmapandreducetasks• Eachtaskhasstatusinfo(state,progress,counters)• PeriodicallysendsinfotoMRAppMaster – Running,successfulcomple9on,failed– Progress(%complete)
• Foranewtask– Copyfilestolocalfilesystem(JAR,configura9on)– LaunchanewJVM(YarnChild drivesexecu9on)– Loadthemapper/reducerclassandrunthetask
CS378–Fall2017 BigDataProgramming 14
![Page 15: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · Counters • Hadoop map-reduce infrastructure provides counters – Accessed by group name – Cannot have](https://reader034.vdocuments.site/reader034/viewer/2022052018/6031af07a9490a416b18ccf7/html5/thumbnails/15.jpg)
TaskProgress
• Readininputpair(mapperorreducer)• Writeanoutputpair(mapperorreducer)• Setthestatusdescrip9on• Incrementacounter• Repor9ngprogress
CS378–Fall2017 BigDataProgramming 15
![Page 16: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · Counters • Hadoop map-reduce infrastructure provides counters – Accessed by group name – Cannot have](https://reader034.vdocuments.site/reader034/viewer/2022052018/6031af07a9490a416b18ccf7/html5/thumbnails/16.jpg)
TaskProgress
• Mapper–straighWorward– Howmuchoftheinputhasbeenprocessed
• Reducer–morecomplicated– Sort,shuffleandreduceareconsideredhere– Progressisanes9mateofhowmuchofthetotalworkhasbeendone
– One-thirdallocatedtoeach
CS378–Fall2017 BigDataProgramming 16
![Page 17: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · Counters • Hadoop map-reduce infrastructure provides counters – Accessed by group name – Cannot have](https://reader034.vdocuments.site/reader034/viewer/2022052018/6031af07a9490a416b18ccf7/html5/thumbnails/17.jpg)
ShuffleFigure7-4,Hadoop:TheDefini9veGuide4thEdi9on
CS378–Fall2017 BigDataProgramming 17
![Page 18: CS 378 – Big Data Programmingdfranke/courses/2017fall/Lecture5.pdf · Counters • Hadoop map-reduce infrastructure provides counters – Accessed by group name – Cannot have](https://reader034.vdocuments.site/reader034/viewer/2022052018/6031af07a9490a416b18ccf7/html5/thumbnails/18.jpg)
MapReduceinHadoopFigure2.4,Hadoop-TheDefini9veGuide
The number of reduce tasks is not governed by the size of the input, but instead isspecified independently. In “The Default MapReduce Job” on page 227, you will seehow to choose the number of reduce tasks for a given job.
When there are multiple reducers, the map tasks partition their output, each creatingone partition for each reduce task. There can be many keys (and their associated values)in each partition, but the records for any given key are all in a single partition. Thepartitioning can be controlled by a user-defined partitioning function, but normally thedefault partitioner—which buckets keys using a hash function—works very well.
The data flow for the general case of multiple reduce tasks is illustrated in Figure 2-4.This diagram makes it clear why the data flow between map and reduce tasks is collo-quially known as “the shuffle,” as each reduce task is fed by many map tasks. Theshuffle is more complicated than this diagram suggests, and tuning it can have a bigimpact on job execution time, as you will see in “Shuffle and Sort” on page 208.
Figure 2-4. MapReduce data flow with multiple reduce tasks
Finally, it’s also possible to have zero reduce tasks. This can be appropriate when youdon’t need the shuffle because the processing can be carried out entirely in parallel (afew examples are discussed in “NLineInputFormat” on page 247). In this case, theonly off-node data transfer is when the map tasks write to HDFS (see Figure 2-5).
Combiner FunctionsMany MapReduce jobs are limited by the bandwidth available on the cluster, so it paysto minimize the data transferred between map and reduce tasks. Hadoop allows theuser to specify a combiner function to be run on the map output, and the combiner
Scaling Out | 33
CS378–Fall2017 BigDataProgramming 18