cs60021: scalable data mining map reduce

55
CS60021: Scalable Data Mining Map Reduce Sourangshu Bha=acharya

Upload: others

Post on 25-Oct-2021

18 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS60021: Scalable Data Mining Map Reduce

CS60021:ScalableDataMining

MapReduceSourangshuBha=acharya

Page 2: CS60021: Scalable Data Mining Map Reduce

Mo?va?on:GoogleExample•  20+billionwebpagesx20KB=400+TB•  1computerreads30-35MB/secfromdisk

–  ~4monthstoreadtheweb•  ~1,000harddrivestostoretheweb•  Takesevenmoretodosomethingusefulwiththedata!

•  Today,astandardarchitectureforsuchproblemsisemerging:–  ClusterofcommodityLinuxnodes–  Commoditynetwork(ethernet)toconnectthem

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h=p://

www.mmds.org2

Page 3: CS60021: Scalable Data Mining Map Reduce

ClusterArchitecture

Mem

Disk

CPU

Mem

Disk

CPU

Switch

Eachrackcontains16-64nodes

Mem

Disk

CPU

Mem

Disk

CPU

Switch

Switch1Gbpsbetweenanypairofnodesinarack

2-10Gbpsbackbonebetweenracks

In 2011 it was guestimated that Google had 1M machines, http://bit.ly/Shh0RO J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h=p://

www.mmds.org3

Page 4: CS60021: Scalable Data Mining Map Reduce

Large-scaleCompu?ng•  Large-scalecompuAngfordataminingproblemsoncommodityhardware

•  Challenges:– HowdoyoudistributecomputaAon?– Howcanwemakeiteasytowritedistributedprograms?

– Machinesfail:•  Oneservermaystayup3years(1,000days)•  Ifyouhave1,000servers,expecttoloose1/day•  Peoplees?matedGooglehad~1Mmachinesin2011

–  1,000machinesfaileveryday!J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h=p://

www.mmds.org4

Page 5: CS60021: Scalable Data Mining Map Reduce

BigDataChallengesq Scalability:processingshouldscalewithincreaseindata.q FaultTolerance:func?oninpresenceofhardwarefailureq CostEffec?ve:shouldrunoncommodityhardwareq Easeofuse:programsshouldbesmallq Flexibility:abletoprocessunstructureddata

q Solu?on:MapReduce!

Page 6: CS60021: Scalable Data Mining Map Reduce

IdeaandSolu?on•  Issue:CopyingdataoveranetworktakesAme•  Idea:

– Bringcomputa?onclosetothedata– Storefilesmul?ple?mesforreliability

•  Map-reduceaddressestheseproblems– Elegantwaytoworkwithbigdata– StorageInfrastructure–Filesystem

•  Google:GFS.Hadoop:HDFS– Programmingmodel

•  Map-Reduce J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h=p://

www.mmds.org6

Page 7: CS60021: Scalable Data Mining Map Reduce

StorageInfrastructure

•  Problem:–  Ifnodesfail,howtostoredatapersistently?

•  Answer:– DistributedFileSystem:

•  Providesglobalfilenamespace•  GoogleGFS;HadoopHDFS;

•  TypicalusagepaKern– Hugefiles(100sofGBtoTB)– Dataisrarelyupdatedinplace–  Readsandappendsarecommon

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h=p://

www.mmds.org7

Page 8: CS60021: Scalable Data Mining Map Reduce

WhatisHadoop?q Ascalablefault-tolerantdistributedsystemfordatastorage

andprocessing.q CoreHadoop:

q HadoopDistributedFileSystem(HDFS)q HadoopYARN:JobSchedulingandClusterResourceManagementq HadoopMapReduce:Frameworkfordistributeddataprocessing.

q OpenSourcesystemwithlargecommunitysupport.h=ps://hadoop.apache.org/

Page 9: CS60021: Scalable Data Mining Map Reduce

HadoopArchitecture

YARN

Courtesy:h=p://hadoop.apache.org/docs/r2.3.0/hadoop-yarn/hadoop-yarn-site/YARN.html

Page 10: CS60021: Scalable Data Mining Map Reduce

HDFS

Page 11: CS60021: Scalable Data Mining Map Reduce

HDFSq Assump?ons

q Hardwarefailureisthenorm.q Streamingdataaccess.q Writeonce,readmany?mes.q Highthroughput,notlowlatency.q Largedatasets.

q Characteris?cs:q Performsbestwithmodestnumberoflargefilesq Op?mizedforstreamingreadsq Layerontopofna?vefilesystem.

Page 12: CS60021: Scalable Data Mining Map Reduce

HDFSq Dataisorganizedintofileanddirectories.q Filesaredividedintoblocksanddistributedtonodes.q Blockplacementisknownatthe?meofread

q Computa?onmovedtosamenode.

q Replica?onisusedfor:q Speedq Faulttoleranceq Selfhealing.

Page 13: CS60021: Scalable Data Mining Map Reduce

Goals of HDFS •  Very Large Distributed File System

– 10K nodes, 100 million files, 10 PB •  Assumes Commodity Hardware

– Files are replicated to handle hardware failure – Detect failures and recovers from them

•  Optimized for Batch Processing – Data locations exposed so that computations can move to where data resides – Provides very high aggregate bandwidth

•  User Space, runs on heterogeneous OS

Page 14: CS60021: Scalable Data Mining Map Reduce

Distributed File System •  Single Namespace for entire cluster •  Data Coherency

– Write-once-read-many access model – Client can only append to existing files

•  Files are broken up into blocks – Typically 128 MB block size – Each block replicated on multiple DataNodes

•  Intelligent Client – Client can find location of blocks – Client accesses data directly from DataNode

Page 15: CS60021: Scalable Data Mining Map Reduce
Page 16: CS60021: Scalable Data Mining Map Reduce

NameNode Metadata •  Meta-data in Memory

– The entire metadata is in main memory – No demand paging of meta-data

•  Types of Metadata – List of files – List of Blocks for each file – List of DataNodes for each block – File attributes, e.g creation time, replication factor

•  A Transaction Log – Records file creations, file deletions. etc

Page 17: CS60021: Scalable Data Mining Map Reduce

DataNode •  A Block Server

– Stores data in the local file system (e.g. ext3) – Stores meta-data of a block (e.g. CRC) – Serves data and meta-data to Clients

•  Block Report – Periodically sends a report of all existing blocks to the NameNode

•  Facilitates Pipelining of Data – Forwards data to other specified DataNodes

Page 18: CS60021: Scalable Data Mining Map Reduce

HDFSreadclient

Source:Hadoop:TheDefini?veGuide

Page 19: CS60021: Scalable Data Mining Map Reduce

HDFSwriteClient

Source:Hadoop:TheDefini?veGuide

Page 20: CS60021: Scalable Data Mining Map Reduce

Block Placement •  Current Strategy

-- One replica on local node -- Second replica on a remote rack -- Third replica on same remote rack -- Additional replicas are randomly placed

•  Clients read from nearest replica •  Would like to make this policy pluggable

Page 21: CS60021: Scalable Data Mining Map Reduce

NameNode Failure •  A single point of failure •  Transaction Log stored in multiple directories

– A directory on the local file system – A directory on a remote file system (NFS/CIFS)

•  Need to develop a real HA solution

Page 22: CS60021: Scalable Data Mining Map Reduce

Data Pipelining •  Client retrieves a list of DataNodes on which to place

replicas of a block •  Client writes block to the first DataNode •  The first DataNode forwards the data to the next

DataNode in the Pipeline •  When all replicas are written, the Client moves on to

write the next block in file

Page 23: CS60021: Scalable Data Mining Map Reduce

MAPREDUCE

Page 24: CS60021: Scalable Data Mining Map Reduce

WhatisMapReduce?q  Methodfordistribu?ngataskacrossmul?pleservers.q ProposedbyDeanandGhemawat,2004.q Consistsoftwodevelopercreatedphases:

q Mapq Reduce

q  InbetweenMapandReduceistheShuffleandSortphase.q Userisresponsibleforcas?ngtheproblemintomap–reduce

framework.q Mul?plemap-reducejobscanbe“chained”.

Page 25: CS60021: Scalable Data Mining Map Reduce

ProgrammingModel:MapReduce

Warm-uptask:•  Wehaveahugetextdocument

•  Countthenumberof?meseachdis?nctwordappearsinthefile

•  SampleapplicaAon:– AnalyzewebserverlogstofindpopularURLs

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h=p://

www.mmds.org25

Page 26: CS60021: Scalable Data Mining Map Reduce

Task:WordCount

Case1:–  Filetoolargeformemory,butall<word,count>pairsfitinmemory

Case2:•  Countoccurrencesofwords:

– words(doc.txt) | sort | uniq -c •  wherewordstakesafileandoutputsthewordsinit,oneperaline

•  Case2capturestheessenceofMapReduce– Greatthingisthatitisnaturallyparallelizable

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h=p://

www.mmds.org26

Page 27: CS60021: Scalable Data Mining Map Reduce

MapReduce:Overview•  Sequen?allyreadalotofdata•  Map:

–  Extractsomethingyoucareabout

•  Groupbykey:SortandShuffle•  Reduce:

–  Aggregate,summarize,filterortransform

•  Writetheresult

Outlinestaysthesame,MapandReducechangetofittheproblem

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h=p://

www.mmds.org27

Page 28: CS60021: Scalable Data Mining Map Reduce

MapReduce:TheMapStep

vk

k v

k v

mapvk

vk

k vmap

Input key-value pairs

Intermediate key-value pairs

k v

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h=p://

www.mmds.org28

Page 29: CS60021: Scalable Data Mining Map Reduce

MapReduce:TheReduceStep

k v

k v

k v

k v

Intermediate key-value pairs

Groupbykey

reduce

reduce

k v

k v

k v

k v

k v

k v v

v v

Key-value groups Output key-value pairs

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h=p://

www.mmds.org29

Page 30: CS60021: Scalable Data Mining Map Reduce

MoreSpecifically•  Input:asetofkey-valuepairs•  Programmerspecifiestwomethods:

– Map(k, v) → <k’, v’>* •  Takesakey-valuepairandoutputsasetofkey-valuepairs

–  E.g.,keyisthefilename,valueisasinglelineinthefile

•  ThereisoneMapcallforevery(k,v)pair

– Reduce(k’, <v’>*) → <k’, v’’>* •  Allvaluesv’withsamekeyk’arereducedtogetherandprocessedinv’order

•  ThereisoneReducefunc?oncallperuniquekeyk’

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h=p://

www.mmds.org30

Page 31: CS60021: Scalable Data Mining Map Reduce

MapReduce:WordCoun?ng

The crew of the space shuttle Endeavor recently re turned to Ear th as ambassadors, harbingers of a new era o f space exploration. Scientists at NASA are saying that the recent assembly of the Dextre bot is the first step in a long-term space-based man/mache partnership. '"The work we're doing now -- the robotics we're doing -- is what we're going to need ……………………..

Big document

(The,1)(crew,1)(of,1)(the,1)(space,1)(shu=le,1)

(Endeavor,1)(recently,1)

….

(crew,1)(crew,1)(space,1)(the,1)(the,1)(the,1)

(shu=le,1)(recently,1)

(crew,2)(space,1)(the,3)

(shu=le,1)(recently,1)

MAP:Readinputandproducesasetofkey-valuepairs

Groupbykey:Collectallpairswithsamekey

Reduce:Collectallvaluesbelongingtothekeyandoutput

(key, value)

Provided by the programmer

Provided by the programmer

(key, value) (key, value)

Sequ

en?a

llyre

adth

edata

Onlysequ

en?a

lreads

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h=p://

www.mmds.org31

Page 32: CS60021: Scalable Data Mining Map Reduce

WordCountUsingMapReduce

map(key, value): // key: document name; value: text of the document for each word w in value:

emit(w, 1)

reduce(key, values): // key: a word; value: an iterator over counts result = 0 for each count v in values: result += v emit(key, result)

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h=p://

www.mmds.org32

Page 33: CS60021: Scalable Data Mining Map Reduce

MapPhaseq Userwritesthemappermethod.q  Inputisanunstructuredrecord:

q E.g.ArowofRDBMStable,q Alineofatextfile,etc

q Outputisasetofrecordsoftheform:<key,value>q Bothkeyandvaluecanbeanything,e.g.text,number,etc.q E.g.forrowofRDBMStable:<columnid,value>q Lineoftextfile:<word,count>

Page 34: CS60021: Scalable Data Mining Map Reduce

Shuffle/Sortphaseq Shufflephaseensuresthatallthemapperoutputrecordswith

thesamekeyvalue,goestothesamereducer.q Sortensuresthatamongtherecordsreceivedateach

reducer,recordswithsamekeyarrivestogether.

Page 35: CS60021: Scalable Data Mining Map Reduce

Reducephaseq Reducerisauserdefinedfunc?onwhichprocessesmapper

outputrecordswithsomeofthekeysoutputbymapper.q  Inputisoftheform<key,value>

q Allrecordshavingsamekeyarrivetogether.

q Outputisasetofrecordsoftheform<key,value>q Keyisnotimportant

Page 36: CS60021: Scalable Data Mining Map Reduce

Parallelpicture

Page 37: CS60021: Scalable Data Mining Map Reduce

Example

•  WordCount:Countthetotalno.ofoccurrencesofeachword

Page 38: CS60021: Scalable Data Mining Map Reduce

MapReduce

Whatwasthemax/mintemperatureforthelastcentury?

Page 39: CS60021: Scalable Data Mining Map Reduce

HadoopMapReduceq Provides:

q Automa?cparalleliza?onandDistribu?onq FaultToleranceq MethodsforinterfacingwithHDFSforcoloca?onofcomputa?onand

storageofoutput.q StatusandMonitoringtoolsq APIinJavaq Abilitytodefinethemapperandreducerinmanylanguagesthrough

Hadoopstreaming.

Page 40: CS60021: Scalable Data Mining Map Reduce

HadoopMRDataFlow

Source:Hadoop:TheDefini?veGuide

Page 41: CS60021: Scalable Data Mining Map Reduce

Hadoop(v2)MRjob

Source:Hadoop:TheDefini?veGuide

Page 42: CS60021: Scalable Data Mining Map Reduce

Shuffleandsort

Source:Hadoop:TheDefini?veGuide

Page 43: CS60021: Scalable Data Mining Map Reduce

DataFlow

•  Inputandfinaloutputarestoredonadistributedfilesystem(FS):– Schedulertriestoschedulemaptasks“close”tophysicalstorageloca?onofinputdata

•  IntermediateresultsarestoredonlocalFSofMapandReduceworkers

•  OutputisoVeninputtoanotherMapReducetaskJ.Leskovec,A.Rajaraman,J.Ullman:

MiningofMassiveDatasets,h=p://www.mmds.org

43

Page 44: CS60021: Scalable Data Mining Map Reduce

Coordina?on:Master

•  MasternodetakescareofcoordinaAon:–  Taskstatus:(idle,in-progress,completed)–  Idletasksgetscheduledasworkersbecomeavailable– Whenamaptaskcompletes,itsendsthemastertheloca?onandsizesofitsRintermediatefiles,oneforeachreducer

– Masterpushesthisinfotoreducers

•  Masterpingsworkersperiodicallytodetectfailures

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h=p://

www.mmds.org44

Page 45: CS60021: Scalable Data Mining Map Reduce

Faulttolerance

q Comesfromscalabilityandcosteffec?venessq HDFS:

q Replica?onq MapReduce

q Restar?ngfailedtasks:mapandreduceq Wri?ngmapoutputtoFSq Minimizesre-computa?on

Page 46: CS60021: Scalable Data Mining Map Reduce

DealingwithFailures

•  Mapworkerfailure– Maptaskscompletedorin-progressatworkerareresettoidle

–  Reduceworkersareno?fiedwhentaskisrescheduledonanotherworker

•  Reduceworkerfailure–  Onlyin-progresstasksareresettoidle–  Reducetaskisrestarted

•  Masterfailure– MapReducetaskisabortedandclientisno?fied

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h=p://

www.mmds.org46

Page 47: CS60021: Scalable Data Mining Map Reduce

Failures

q Taskfailureq Taskhasfailed–reporterrortonodemanager,appmaster,client.

q Tasknotresponsive,JVMfailure–Nodemanagerrestartstasks.

q Applica?onMasterfailureq Applica?onmastersendsheartbeatstoresourcemanager.q Ifnotreceived,theresourcemanagerretrivesjobhistoryoftheruntasks.

q Nodemanagerfailure

Page 48: CS60021: Scalable Data Mining Map Reduce

HowmanyMapandReducejobs?

•  Mmaptasks,Rreducetasks•  Ruleofathumb:

– MakeMmuchlargerthanthenumberofnodesinthecluster

– OneDFSchunkpermapiscommon–  Improvesdynamicloadbalancingandspeedsuprecoveryfromworkerfailures

•  UsuallyRissmallerthanM– BecauseoutputisspreadacrossRfiles

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h=p://

www.mmds.org48

Page 49: CS60021: Scalable Data Mining Map Reduce

TaskGranularity&Pipelining

•  Finegranularitytasks:maptasks>>machines– Minimizes?meforfaultrecovery– Candopipelineshufflingwithmapexecu?on– Be=erdynamicloadbalancing

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h=p://

www.mmds.org49

Page 50: CS60021: Scalable Data Mining Map Reduce

Refinements:BackupTasks•  Problem

–  Slowworkerssignificantlylengthenthejobcomple?on?me:

•  Otherjobsonthemachine•  Baddisks•  Weirdthings

•  SoluAon– Nearendofphase,spawnbackupcopiesoftasks

•  Whicheveronefinishesfirst“wins”•  Effect

– Drama?callyshortensjobcomple?on?me

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h=p://

www.mmds.org50

Page 51: CS60021: Scalable Data Mining Map Reduce

Refinement:Combiners

•  OvenaMaptaskwillproducemanypairsoftheform(k,v1),(k,v2),…forthesamekeyk– E.g.,popularwordsinthewordcountexample

•  CansavenetworkAmebypre-aggregaAngvaluesinthemapper:– combine(k, list(v1)) à v2 – Combinerisusuallysameasthereducefunc?on

•  Worksonlyifreducefunc?oniscommuta?veandassocia?ve

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h=p://

www.mmds.org51

Page 52: CS60021: Scalable Data Mining Map Reduce

Refinement:Combiners

•  BacktoourwordcounAngexample:–  Combinercombinesthevaluesofallkeysofasinglemapper(singlemachine):

– Muchlessdataneedstobecopiedandshuffled!

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h=p://

www.mmds.org52

Page 53: CS60021: Scalable Data Mining Map Reduce

Refinement:Par??onFunc?on

•  WanttocontrolhowkeysgetparAAoned–  Inputstomaptasksarecreatedbycon?guoussplitsofinputfile

–  Reduceneedstoensurethatrecordswiththesameintermediatekeyendupatthesameworker

•  SystemusesadefaultparAAonfuncAon:–  hash(key) mod R

•  SomeAmesusefultooverridethehashfuncAon:–  E.g.,hash(hostname(URL)) mod RensuresURLsfromahostendupinthesameoutputfile

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h=p://

www.mmds.org53

Page 54: CS60021: Scalable Data Mining Map Reduce

Example:JoinByMap-Reduce•  ComputethenaturaljoinR(A,B)⋈S(B,C)•  RandSareeachstoredinfiles•  Tuplesarepairs(a,b)or(b,c)

J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,h=p://

www.mmds.org54

A B a1 b1

a2 b1

a3 b2

a4 b3

B C b2 c1

b2 c2

b3 c3

⋈A C a3 c1

a3 c2

a4 c3

=

R S

Page 55: CS60021: Scalable Data Mining Map Reduce

Map-ReduceJoin•  UseahashfuncAonhfromB-valuesto1...k•  AMapprocessturns:

– EachinputtupleR(a,b)intokey-valuepair(b,(a,R))– EachinputtupleS(b,c)into(b,(c,S))

•  Mapprocessessendeachkey-valuepairwithkeybtoReduceprocessh(b)– Hadoopdoesthisautoma?cally;justtellitwhatkis.

•  EachReduceprocessmatchesallthepairs(b,(a,R))withall(b,(c,S))andoutputs(a,b,c).J.Leskovec,A.Rajaraman,J.Ullman:

MiningofMassiveDatasets,h=p://www.mmds.org

55