distributed computing using sparkashriram/courses/cs431/assets/...iterative algorithms must load...

DistributedComputingusingSpark

(leveragingdata-parallelprogramstructure)

1

Review:whichprogramperformsbetter?voidadd(intn,float*A,float*B,float*C){

for(inti=0;i

Thepreviousexampleinvolvedgloballyrestructuringtheorderofcomputationtoimproveproducer-consumerlocality

(improvearithmeticintensityofprogram)

3

Parallelcomputersinclasssofar

ManycoresconnectedtoasinglesharedmemorysystemImagecredit:https://wccftech.com/intel-cascade-lake-advanced-performance-48-core-xeon-cpus-announced/

4

Warehousescalecomputing

5

Scaleoutclustercomputing▪ Inexpensivewaytorealizeahighcorecount,high

memory(inaggregate)computer- Madefrom(somewhat*)commodityLinuxservers(commodityprocessors,networking,andstorage)

- Privateper-serveraddressspace- Relativelylowbandwidthconnectivitybetweenservers

*CloudvendorslikeAWS,Google,MSAzure,Facebookmakesignificantcustomizationsintheirdatacenters.6

Typicallycommodityserver

DRAM(64-512GB)

CPUx2(8-64cores)

50GB/s

NetworkTop-of-RackSwitch

1-4GB/s

0.1-4GB/s

1-4GB/s

50MB/seach

Disksx10(10-30TB)

Nodesin NodesinSameRack OtherRacks

~401RUserversperrack7

Whywriteanapplicationforacluster?▪ Motivatingproblem:- Considerprocessing100TBofdata- Ononenodewithonedisk:scanningat50MB/s=23das- On1000nodes,eachscanningat50MBs=33min!

▪ Challenge:itcanbehardtoeffectivelyutilize1000nodes- Needtoprogram1000xcores_per_nodetotalcores- Havetoworryaboutmachinefailures- Ormachinesthatarefasterorslowerthanothers- Itwouldbenicetohaveparallelprogrammingframeworksthatmakeiteasiertoutilizeresourcesatthisscale!*

*We’vealreadyseenprogramminglanguages/frameworkstohelpuswithSIMD,multi-core,andGPU-basedprogramming.

8

TodayIneedyoutoassumeclusterstoragesystemsexist▪ Ifnodescanfail,howdowestoredatapersistently?

▪ Modernsolution:distributedstoragesystems- Provideaglobalnamespaceforfiles- Examples:GoogleGFX,HadoopHDFS,AmazonS3

▪ Typicalusagepatterns- Hugefiles(100sGBtoTBs)- Dataisrarelyupdatedinplace- Readsandappendsarecommon(e.g.,logfiles)

9

Example:HadoopdistributedFS(HDFS)

• Globalnamespace

• Filessplitinto~200MBblocks

• EachblockreplicatedonmultipleDataNodes

• Intelligentclient–FindslocationsofblocksfromNameNode;requestsdatafromDataNode

NameNode

Client

DataNode

3 41

File1->Block1,Block2

DataNode

1 42

DataNode

2 3

Getblocklocations

Read/writedatablocks

File2->Block3,Block4

…

Tracklocations

10

Let’ssaywebsitegetsverypopular…

11

Thelogofpageviewsgetsquitelarge…

1TBdisk

Node0

CPU

DRAM

log.txtblock0

log.txtblock1

1TBdisk

Node1

CPU

DRAM

log.txtblock2

log.txtblock3

1TBdisk

Node3

CPU

DRAM

log.txtblock6

log.txtblock7

1TBdisk

Node2

CPU

DRAM

log.txtblock4

log.txtblock5

Assumecog.txtisalargefile,storedinadistributedfilesystem,likeHDFSBelow:clusterof4nodes,eachnodewitha1TBdiskContentsoflog.txtaredistributedevenlyinblocksacrossthecluster

12

Imagineyourprofessorswanttoknowabitmoreaboutthestudentsvisitingthe431website…

Forexample:“Whattypeofmobilephoneareallthesestudentsusing?”“Whendidtheyfirstdownloadthehandoutforthehomeworkassignmentduetomorrow?”

13

Considerasimpleprogrammingmodel//thisfunctioniscalledonceperlineininputfilebyruntime//input:string(lineofinputfile)//output:adds(user_agent,1)entrytolistvoidmapper(stringline,multimap&results){

stringuser_agent=parse_requester_user_agent(line);if(is_mobile_client(user_agent))

results.add(user_agent,1);}

//thisfunctioniscalledonceperuniquekey(user_agent)inresults//valuesisalistofvaluesassociatedwiththegivenkeyvoidreducer(stringkey,listvalues,int&result){

intsum=0;for(vinvalues)

sum+=v;result=sum;

}

//iteratoroverlinesoftextfileLineByLineReaderinput(“hdfs://431log.txt”);

//storesoutputWriteroutput(“hdfs://…”);

//dostuffrunMapReduceJob(mapper,reducer,input,output);

(Thecodeabovecomputesthecountofpageviewsbyeachtypeofmobilephone.)14

Let’sdesignanimplementationofrunMapReduceJob

15

Step1:runningthemapperfunction

Node0

log.txtblock0

Disk

CPU

//calledonceperlineinfilevoidmapper(stringline,multimap&results){

stringuser_agent=parse_requester_user_agent(line);if(is_mobile_client(user_agent))

results.add(user_agent,1);}

//calledonceperuniquekeyinresultsvoidreducer(stringkey,listvalues,int&result){


sum+=v;result=sum;

}

LineByLineReaderinput(“hdfs://431log.txt”);Writeroutput(“hdfs://…”);runMapReduceJob(mapper,reducer,input,output);

log.txtblock1

Node1

log.txtblock2

Disk

CPU

log.txtblock3

Node2

log.txtblock4

Disk

CPU

log.txtblock5

Node3

log.txtblock6

Disk

CPU

log.txtblock7

Step1:runmapperfunctiononalllinesoffileQuestion:Howtoassignworktonodes?

Idea2:datadistributionbasedassignment:Eachnodeprocesseslinesinblocksofinputfilethatarestoredlocally.

Idea1:useworkqueueforlistofinputblockstoprocessDynamicassignment:freenodetakesnextavailableblock

block0block1block2...

16

Steps2and3:gatheringdata,runningthereducer

Node0

log.txtblock0

Disk

CPU

//calledonceperlineinfilevoidmapper(stringline,mapresults)

{stringuser_agent=parse_requester_user_agent(line);if(is_mobile_client(user_agent))

results.add(user_agent,1);}//calledonceperuniquekeyinresultsvoidreducer(stringkey,listvalues,int&result){


sum+=v;result=sum;

}

LineByLineReaderinput(“hdfs://431log.txt”);Writeroutput(“hdfs://…”);runMapReduceJob(mapper,reducer,input,output);

log.txtblock1

Node1

log.txtblock2

Disk

CPU

log.txtblock3

Node2

log.txtblock4

Disk

CPU

log.txtblock5

Node3

log.txtblock6

Disk

CPU

log.txtblock7

SafariiOSChrome

SafariiWatchChromeGlass

Step2:PrepareintermediatedataforreducerStep3:RunreducerfunctiononallkeysQuestion:howtoassignreducertasks?Question:howtogetalldataforkeyontothecorrectworkernode?

Keystoreduce:(generatedbymapper):

SafariiOSvalues0

Chromevalues0

SafariiOSvalues1

Chromevalues1

SafariiOSvalues2

Chromevalues2

SafariiOSvalues3

Chromevalues3

SafariiWatchvalues3

ChromeGlassvalues0

17

Node0

log.txtblock0

Disk

CPU

//gatherallinputdataforkey,thenexecutereducer//toproducefinalresultvoidrunReducer(stringkey,reducer,result){

listinputs;for(ninnodes){

filename=get_filename(key,n);readlinesoffilename,appendintoinputs;

}reducer(key,inputs,result);

}

log.txtblock1

Step2:Prepareintermediatedataforreducer.Step3:Runreducerfunctiononallkeys.Question:howtoassignreducertasks?Question:howtogetalldataforkeyontothecorrectworkernode?

Node1

log.txtblock2

Disk

CPU

log.txtblock3

Node2

log.txtblock4

Disk

CPU

log.txtblock5

Node3

log.txtblock6

Disk

CPU

log.txtblock7

SafariiOSChrome

SafariiWatchChromeGlass

Keystoreduce:(generatedbymapper):

SafariiOSvalues0

Chromevalues0

SafariiOSvalues1

Chromevalues1

SafariiOSvalues2

Chromevalues2

SafariiOSvalues3

Chromevalues3

SafariiWatchvalues3

ChromeGlassvalues0

Example:AssignSafariiOStoNode0

Steps2and3:gatheringdata,runningthereducer

18

Additionalimplementationchallengesatscale

Node0

log.txtblock…

Disklog.txt

block…

CPU

Node1

log.txtblock…

Disklog.txt

block…

CPU

Node2

log.txtblock…

Disklog.txt

block…

CPU

Node3

log.txtblock…

Disklog.txt

block…

CPU

Node4

log.txtblock…

Disklog.txt

block…

CPU

Node5

log.txtblock…

Disklog.txt

block…

CPU

Node6

log.txtblock…

Disklog.txt

block…

CPU

Node7

log.txtblock…

Disklog.txt

block…

CPU

Node8

log.txtblock…

Disklog.txt

block…

CPU

Node9

log.txtblock…

Disklog.txt

block…

CPU

Node10

log.txtblock…

Disklog.txt

block…

CPU

Node11

log.txtblock…

Disklog.txt

block…

CPU

Node996

log.txtblock…

Disklog.txt

block…

CPU

Node997

log.txtblock…

Disklog.txt

block…

CPU

Node998

log.txtblock…

Disklog.txt

block…

CPU

Node999

log.txtblock…

Disklog.txt

block…

CPU

...

Nodesmayfailduringprogramexecution

Somenodesmayrunslowerthanothers(duetodifferentamountsofwork,heterogeneityinthecluster,etc..)

19

Jobschedulerresponsibilities

▪ Exploitdatalocality:“movecomputationtothedata”- Runmapperjobsonnodesthatcontaininputfiles- Runreducerjobsonnodesthatalreadyhavemostofdataforacertainkey

▪ Handlingnodefailures- Schedulerdetectsjobfailuresandrerunsjobonnewmachines- Thisispossiblesinceinputsresideinpersistentstorage(distributedfilesystem)

- Schedulerduplicatesjobsonmultiplemachines(reduceoverallprocessinglatencyincurredbynodefailures)

▪ Handlingslowmachines- Schedulerduplicatesjobsonmultiplemachines

20

runMapReduceJobproblems?▪ Permitsonlyaverysimpleprogramstructure

- Programsmustbestructuredas:map,followedbyreducebykey

- SeeDryadLINQforgeneralizationtoDAGs

▪ Iterativealgorithmsmustloadfromdiskeachiteration- Noprimitivesforsharingdatadirectlybetweenjobs

iter.1 iter.2 ...

Input

HDFSwrite

HDFSread

HDFSwrite

IterativeJob:

HDFSread

21

in-memory,fault-tolerantdistributedcomputinghttp://spark.apache.org/

[Zahariaetal.NSDI2012]22

http://spark.apache.org/

Goals▪ Programmingmodelforcluster-scalecomputations

wherethereissignificantreuseofintermediatedatasets- Iterativemachinelearningandgraphalgorithms- Interactivedatamining:loadlargedatasetintoaggregate

memoryofclusterandthenperformmultiplead-hocqueries

▪ Don’twantincurinefficiencyofwritingintermediatestopersistentdistributedfilesystem(wanttokeepitinmemory)- Challenge:efficientlyimplementingfaulttoleranceforlarge-

scaledistributedin-memorycomputations.

23

Faulttoleranceforin-memorycalculations▪ Replicateallcomputations

- Expensivesolution:decreasespeakthroughput

▪ Checkpointandrollback- Periodicallysavestateofprogramtopersistentstorage- Restartfromlastcheckpointonnodefailure

▪ Maintainlogofupdates(commandsanddata)- Highoverheadformaintaininglogs

Recallmap-reducesolutions:- Checkpointsaftereachmap/reducestepbywritingresultstofilesystem- Scheduler’slistofoutstanding(butnotyetcomplete)jobsisalog- Functionalstructureofprogramsallowsforrestartatgranularityofasingle

mapperorreducerinvocation(don’thavetorestartentireprogram)24

Resilientdistributeddataset(RDD)Spark’skeyprogrammingabstraction:- Read-onlyorderedcollectionofrecords(immutable)- RDDscanonlybecreatedbydeterministictransformationsondatain

persistentstorageoronexistingRDDs- ActionsonRDDsreturndatatoapplication

//createRDDfromfilesystemdatavarlines=spark.textFile(“hdfs://431log.txt”);

//createRDDusingfilter()transformationonlinesvarmobileViews=lines.filter((x:String)=>isMobileClient(x));

//anotherfilter()transformationvarsafariViews=mobileViews.filter((x:String)=>x.contains(“Safari”));

//thencountnumberofelementsinRDDviacount()actionvarnumViews=safariViews.count();

lines

mobileViews

numViews

safariViews

.count( )

.filter(...)

.filter(...)

.textFile(…)

431log.txt

int

RDDs

25

Repeatingthemap-reduceexample// 1. create RDD from filesystemdata// 2. create RDD with onlylinesfrommobileclients// 3. create RDD with elementsoftype(String,Int) from line string//4.groupelementsbykey//5.callprovidedreductionfunctiononallkeystocountviewsvarperAgentCounts= spark.textFile(“hdfs://431log.txt”)

.filter(x=>isMobileClient(x))

.map(x=>(parseUserAgent(x),1));

.reduceByKey((x,y)=>x+y)

.collect();

lines

.map(parseUserAgent(…))

.filter(isMobileClient(…)))

.textFile(…)

431log.txtArray[String,int]

.reduceByKey(…)

.collect()

PerAgentCounts

“Lineage”:SequenceofRDDoperationsneededtocomputeoutput

26

AnotherSparkprogram//createRDDfromfilesystemdatavarlines=spark.textFile(“hdfs://431log.txt”);


//instructSparkruntimetotrytokeepmobileViewsinmemorymobileViews.persist();

//createanewRDDbyfilteringmobileViews//thencountnumberofelementsinnewRDDviacount()actionvarnumViews=mobileViews.filter(_.contains(“Safari”)).count();

// 1. createnewRDDby filteringonlyChrome views////

2. foreachelement,pageview

splitstringandtake timestamp of

//3.convertRDDtoascalarsequence(collect()action)vartimestamps=mobileViews.filter(_.contains(“Chrome”))

.map(_.split(“”)(0))

.collect();

lines

mobileViews

.collect()

timestamps

.filter(isMobileClient(…)))

.textFile(…)

431log.txt

.map(split(…)).count()

numViews

.filter(contains(“Safari”); .filter(contains(“Chrome”);

27

flatMap( f : T ) Seq[U]) RDD[T] ) RDD[U]RDD[T] ) RDD[T] (Deterministic sampling)sample(fraction : Float)

groupByKey() RDD[(K, V)] ) RDD[(K, Seq[V])]reduceByKey( f : (V, V) ) V) RDD[(K, V)] ) RDD[(K, V)]

(RDD[T], RDD[T]) ) RDD[T](RDD[(K, V)], RDD[(K, W)]) ) RDD[(K, (V, W))](RDD[(K, V)], RDD[(K, W)]) ) RDD[(K, (Seq[V], Seq[W]))]

union() join()

cogroup() crossProduct() (RDD[T], RDD[U]) ) RDD[(T, U)]

mapValues( f : V ) W) RDD[(K, V)] ) RDD[(K, W)] (Preserves partitioning)RDD[(K, V)] ) RDD[(K, V)]sort(c : Comparator[K])

partitionBy(p : Partitioner[K])

map( f : T ) U) : RDD[T] ) RDD[U] filter( f : T ) Bool) : RDD[T] ) RDD[T]

: : : : : : : : : : :

RDD[(K, V)] ) RDD[(K, V)]

count() collect()

reduce( f : (T, T) ) T) RDD[T] ) TRDD[(K, V)] ) Seq[V] (On hash/range partitioned RDDs)lookup(k : K)

save(path : String)

: RDD[T] ) Long : RDD[T] ) Seq[T] : : : Outputs RDD to a storage system, e.g., HDFS

RDDtransformationsandactionsTransformations:(dataparalleloperatorstakinganinputRDDtoanewRDD)

Actions:(providedatabacktothe“host”application)

28

HowdoweimplementRDDs?Inparticular,howshouldtheybestored?

var lines = spark.textFile(“hdfs://431log.txt”);var lower = lines.map(_.toLower());varmobileViews=lower.filter(x=>isMobileClient(x));varhowMany=mobileViews.count();

Node0

log.txtblock0

Disk

CPU

log.txtblock1

DRAM

Node1

log.txtblock2

Disk

CPU

log.txtblock3

DRAM

Node2

log.txtblock4

Disk

CPU

log.txtblock5

DRAM

Node3

log.txtblock6

Disk

CPU

log.txtblock7

DRAM

Question:shouldwethinkofRDD’slikearrays?

29

HowdoweimplementRDDs?Inparticular,howshouldtheybestored?

var lines = spark.textFile(“hdfs://431log.txt”);var lower = lines.map(_.toLower());

Node0

log.txtblock0

Disk

CPU

log.txtblock1

DRAMlines

(partition0)lower

(partition0)mobileViews(part0)

lines(partition1)

lower(partition1)mobileViews(part1)

Node1

log.txtblock2

Disk

CPU

log.txtblock3

DRAMlines

(partition2)lower


lines(partition3)


Node2

log.txtblock4

Disk

CPU

log.txtblock5

DRAMlines

(partition4)lower


lines(partition5)


Node2

log.txtblock6

Disk

CPU

log.txtblock7

DRAMlines

(partition6)lower


lines(partition7)


varmobileViews=lower.filter(x=>isMobileClient(x));varhowMany=mobileViews.count();

In-memoryrepresentationwouldbehuge!(largerthanoriginalfileondisk)

30

RDDpartitioninganddependencies

Node3

block2 block3 block4 block5 block6 block7

.load()

varlines=spark.textFile(“hdfs://431log.txt”);varlower=lines.map(_.toLower());varmobileViews=lower.filter(x=>isMobileClient(x));varhowMany=mobileViews.count();

Node0 Node1 Node2

linespart2

linespart3

linespart4

linespart5

linespart6

linespart7

.filter()

mobileViewspart2

mobileViewspart3

mobileViewspart4

mobileViewspart5

mobileViewspart6

mobileViewspart7

BlacklinesshowdependenciesbetweenRDDpartitions.

lowerpart2

lowerpart3

lowerpart4

lowerpart5

lowerpart6

lowerpart7

.map()

block0(0-1000)

block1(1000-2000)

linespart0(0-1000)

linespart1

(1000-2000)

lowerpart0(0-1000)

lowerpart1

(1000-2000)

mobileViewspart0

(670elements)

mobileViewspart1

(212elements)

31

ImplementingsequenceofRDDopsefficientlyvarlines=spark.textFile(“hdfs://431log.txt”);varlower=lines.map(_.toLower());varmobileViews=lower.filter(x=>isMobileClient(x));varhowMany=mobileViews.count();

Recall“loopfusion”examplefromopeningslidesoflecture

Thefollowingcodestoresonlyalineofthelogfileinmemory,andonlyreadsinputdatafromdiskonce(“streaming”solution)intcount=0;while(inputFile.eof()){

stringline=inputFile.readLine();stringlower=line.toLower;if(isMobileClient(lower))

count++;}

32

AsimpleinterfaceforRDDsvarlines=spark.textFile(“hdfs://431log.txt”);varlower=lines.map(_.toLower());varmobileViews=lower.filter(x=>isMobileClient(x));varhowMany=mobileViews.count();

RDDFromMap::next(){varel=parent.next();returnmap_func(el);

}

RDDFromFilter::next(){while(parent.hasMoreElements()){

varel=parent.next();if(filter_func(el))

returnel;}

RDDFromTextFile::next(){returninputFile.readLine();

}

//countaction(forcesevaluationofRDD)RDD::count(){

intcount=0;while(hasMoreElements()){

varel=next();count++;

}}

RDD::hasMoreElements(){parent.hasMoreElements();

}//overloadedsincenoparentexistsRDDFromTextFile::hasMoreElements(){

return!inputFile.eof();}

//createRDDbymappingmap_funcontoinput(parent)RDDRDD::map(RDDparent,map_func){

returnnewRDDFromMap(parent,map_func);}

//createRDDfromtextfileondiskRDD::textFile(stringfilename){

returnnewRDDFromTextFile(open(filename));}

//createRDDbyfilteringinput(parent)RDDRDD::filter(RDDparent,filter_func){

returnnewRDDFromFilter(parent,filter_func);}

33

Narrowdependencies

block2 block3 block4 block5 block6 block7

.load()

varlines=spark.textFile(“hdfs://431log.txt”);varlower=lines.map(_.toLower());varmobileViews=lower.filter(x=>isMobileClient(x));varhowMany=mobileViews.count();

linespart2

linespart3

linespart4

linespart5

linespart6

linespart7

.filter()

mobileViewspart2

mobileViewspart3

mobileViewspart4

mobileViewspart5

mobileViewspart6

mobileViewspart7

“Narrowdependencies”=eachpartitionofparentRDDreferencedbyatmostonechildRDDpartition--

Allowsforfusingofoperations(here:canapplymapandthenfilterallatonceoninputelement)Inthisexample:nocommunicationbetweennodesofcluster(communicationofoneintatendtoperformcount()reduction)

Node0 Node1 Node2 Node3

lowerpart2

lowerpart3

lowerpart4

lowerpart5

lowerpart6

lowerpart7

.map()

block0(0-1000)

block1(1000-2000)

linespart0(0-1000)

linespart1

(1000-2000)

lowerpart0(0-1000)

lowerpart1

(1000-2000)

mobileViewspart0

(670elements)

mobileViewspart1

(212elements)34

Widedependencies

RDD_Apart0

.groupByKey()

RDD_Apart1

RDD_Apart2

RDD_Apart3

RDD_Bpart0

RDD_Bpart1

RDD_Bpart2

RDD_Bpart3

▪ Widedependencies=eachpartitionofparentRDDreferencedbymultiplechildRDDpartitions

▪ Challenges:- MustcomputeallofRDD_AbeforecomputingRDD_B

- Example:groupByKey()mayinduceall-to-allcommunicationasshownabove

- Maytriggersignificantrecomputationofancestorlineageuponnodefailure(Iwilladdressresilienceinafewslides)

groupByKey:RDD[(K,V)]→ RDD[(K,Seq[V])]“MakeanewRDDwhereeachelementisasequencecontainingallvaluesfromtheparentRDDwiththesamekey.”

35

Costofoperationsdependsonpartitioning

RDD_Cpart0

RDD_Cpart1

RDD_Cpart6

RDD_Cpart9

.join()

RDD_Apart0

RDD_Apart1

RDD_Apart2

RDD_Apart3

RDD_Bpart0

RDD_Bpart1

RDD_Bpart2

RDD_Bpart3

(“Kayvon”,1)(“Teguh”,23)

(“Kayvon”,“fizz”) (“Randy”,1024) (“Randy”,“wham”)(“Teguh”,“buzz”) (“Ravi”,32) (“Ravi”,“pow”)

(“Alex”,50)(“Riya”,9)

(“Alex”,“splat”) (“Tao”,10) (“Tao”,“slap”)(“Riya”,“pop”) (“Junhong”,100)(“Junhong”,“bam”)

RDD_Cpart0

RDD_Cpart1

RDD_Cpart6

RDD_Cpart9

.join()

RDD_Apart0

RDD_Apart1

RDD_Apart2

RDD_Apart3

RDD_Bpart0

RDD_Bpart1

RDD_Bpart2

RDD_Bpart3

(“Kayvon”,1)(“Teguh”,23)

(“Kayvon”,“fizz”)(“Alex”,“splat”)

(“Randy”,1024)(“Ravi”,32)

(“Riya”,“pop”)(“Tao”,“slap”)

(“Alex”,50)(“Riya”,9)

(“Ravi”,“pow”)(“Junhong”,“bam”)

(“Tao”,10)(“Junhong”,100)

(“Randy”,“wham”)(“Teguh”,“buzz”)

(“Kayvon”,(1,”fizz”))(“Teguh”,(23,”buzz”))

(“Randy”,(1024,”wham”))(“Ravi”,(32,”pow”))

(“Alex”,(50,”splat”))(“Riya”,(9,”pop”))

(“Tao”,(10,”slap”))(“Junhong”,(100,”bam”))

RDD_AandRDD_Bhavesamehashpartition:joinonlycreatesnarrowdependencies

(“Kayvon”,(1,”fizz”))(“Teguh”,(23,”buzz”))

(“Randy”,(1024,”wham”))(“Ravi”,(32,”pow”))

(“Alex”,(50,”splat”))(“Riya”,(9,”pop”))

(“Tao”,(10,”slap”))(“Junhong”,(100,”bam”))

join:RDD[(K,V)],RDD[(K,W)]→ RDD[(K,(V,W))]AssumedatainRDD_AandRDD_Barepartitionedbykey:hashusernametopartitionid

RDD_AandRDD_Bhavedifferenthashpartitions:joincreateswidedependencies

36

PartitionBy()transformation▪ InformSparkonhowtopartitionanRDD

- e.g.,HashPartitioner,RangePartitioner//createRDDfromfilesystemdatavarlines=spark.textFile(“hdfs://431log.txt”);varclientInfo=spark.textFile(“hdfs://clientssupported.txt”);//(useragent,“yes”/“no”)

//createRDDusingfilter()transformationonlinesvarmobileViews=lines.filter(x=>isMobileClient(x)).map(x=>parseUserAgent(x));

//HashPartitionermapskeystointegersvarpartitioner=spark.HashPartitioner(100);

// informSparkof partition// .persist()also instructs Spark to try to keep dataset in memoryvarmobileViewPartitioned=mobileViews.partitionBy(partitioner)

.persist();varclientInfoPartitioned=clientInfo.partitionBy(partitioner)

.persist();

//joinuseragentswithwhethertheyaresupportedornotsupported//Note:thisjoinonlycreatesnarrowdependenciesduetotheexplicitpartitioningabovevoidjoined=mobileViewPartitioned.join(clientInfoPartitioned);

▪ .persist():- InformSparkthisRDD’scontentsshouldberetainedinmemory- .persist(RELIABLE)= storecontentsindurablestorage(likeacheckpoint) 37

RDD_Apart0

RDD_Apart1

RDD_Apart2

RDD_Bpart0

RDD_Bpart1

RDD_Bpart2

.groupByKey()

RDD_Cpart0

RDD_Cpart1

RDD_Dpart0

RDD_Dpart1

RDD_Epart0

RDD_Epart1

RDD_Fpart0

RDD_Fpart1

RDD_Fpart2

RDD_Fpart3

.map()

.union()

.join()

RDD_Gpart0

RDD_Gpart1

RDD_Gpart2

.save()

SchedulingSparkcomputationsStage1Computation Stage2Computation

Actions(e.g.,save())triggerevaluationofSparklineagegraph.Stage1Computation:donothingsinceinputalreadymaterializedinmemoryStage2Computation:evaluatemapinfusedmanner,onlyactuallymaterializeRDDFStage3Computation:executejoin(couldstreamtheoperationtodisk,donotneedtomaterialize)

block1block0 block2 = materializedRDD

38

Implementingresiliencevialineage

//createRDDfromfilesystemdatavarlines=spark.textFile(“hdfs://431log.txt”);


// 1. createnewRDDby filteringonlyChrome views// 2. foreachelement, splitstringandtake timestamp of// pageview(firstelement)//3.convertRDDToascalarsequence(collect()action)vartimestamps=mobileView.filter(_.contains(“Chrome”))

.map(_.split(“”)(0));

lines

mobileViews

Chromeviews

timestamps.map(_.split(“”)(0))

.filter(...)

.filter(...)

▪ RDDtransformationsarebulk,deterministic,andfunc.- Implication:runtimecanalwaysreconstructcontentsofRDDfromits

lineage(thesequenceoftransformationsusedtocreateit)- Lineageisalogoftransformations- Efficient:sincethelogrecordsbulkdata-paralleloperations,overheadof

loggingislow(comparedtologgingfine-grainedoperations,likeinadatabase)

39

.load(…)

var lines = spark.textFile(“hdfs://431log.txt”);var mobileViews = lines.filter((x:String)=>

isMobileClient(x));var timestamps = mobileView.filter(_.contains(“Chrome”))

.map(_.split(“”)(0));

Uponnodefailure:recomputelostRDDpartitionsfromlineage

Disk431log.txt log.txt

block0 block1

Node0

DRAM

log.txtblock2

Disklog.txtblock3

mobileViewspart3

mobileViewspart2

Node1 Node2

log.txtblock4

Disklog.txtblock5

mobileViewspart5

mobileViewspart4

Node3

log.txtblock6

Disklog.txtblock7

mobileViewspart7

mobileViewspart6

timestampspart1

mobileViewspart1

CPU

timestampspart0

mobileViewspart0

DRAM

CPU

timestamps timestampspart2

DRAMtimestampspart5

CPU

timestampspart4

DRAMtimestamps

part7

CPU

timestampspart6

lines

mobileViews

Chromeviews


.filter(...)

.filter(...)

.load(…)

Mustreloadrequiredsubsetofdatafromdiskandrecomputeentiresequenceofoperationsgivenbylineagetoregeneratepartitions2and3ofRDDtimestamps.

Note:(notshown):recallfilesystemdataisreplicatedsoassumeblocks2and3remainaccessibletoallnodes

CRASH!part3

40

var lines = spark.textFile(“hdfs://431log.txt”);var mobileViews = lines.filter((x:String)=>

isMobileClient(x));var timestamps = mobileView.filter(_.contains(“Chrome”))

.map(_.split(“”)(0));

log.txtblock0

Disklog.txtblock1

DRAM

Node1

log.txtblock2

Disklog.txtblock3

mobileViewspart3

mobileViewspart2

Node0 Node2

log.txtblock4

Disklog.txtblock5

mobileViewspart5

mobileViewspart4

Node3

log.txtblock6

Disklog.txtblock7

mobileViewspart7

mobileViewspart6

timestampspart1

mobileViewspart1

CPU

timestampspart0

mobileViewspart0

DRAM

CPU

timestamps timestampspart2

AMDRtimestampspart5

CPU

timestampspart4

DRAMtimestamps

part7

CPU

timestampspart6

lines

mobileViews

Chromeviews


.filter(...)

.filter(...)

.load(…)

Must reload required subset of data from disk andrecomputeentiresequenceofoperationsgivenbylineagetoregeneratepartitions2and3ofRDDtimestamps.

timestampspart2

timestampspart3

Note:(notshown):filesystemdataisreplicatedsoassumeblocks2and3remainaccessibletoallnodes

Uponnodefailure:recomputelostRDDpartitionsfromlineage

CRASH!part3

41

Sparkperformance

139

46

182

8280

76 62

3

115

106

87

33

240 200 160 120

80 40

0Hadoop SparkHadoop HadoopBM Spark

Logistic Regression

HadoopBM

K-Means

Itera

tion

time

(s)

First Iteration Later Iterations

HadoopBM=HadoopBinaryIn-Memory(converttextinputtobinary,storeinin-memoryversionofHDFS)

Q.Wait,thebaselineparsestextinputineachiterationofaniterativealgorithm?A.Yes.Sigh…

Anythingelsepuzzlinghere?

HadoopBM’sfirstiterationisslowbecauseitrunsanextraHadoopjobtocopybinaryformofinputdatatoinmemoryHDFS

AccessingdatafromHDFS,evenifinmemory,hashighoverhead:- Multiplememcopiesinfilesystem+achecksum

(100GBofdataona100nodecluster)

42

Caution:“scaleout”isnottheentirestory▪ Distributedsystemsdesignedforcloudexecutionaddressmanydifficultchallenges,andhavebeeninstrumentalintheexplosionof“big-data”computingandlarge-scale

---

Scale-outparallelismtomanymachinesResiliencyinthefaceoffailuresComplexityofmanagingclustersofmachines

▪

scalable system cores twitter uk-2007-05GraphChi [10] 2 3160s 6972sStratosphere [6] 16 2250s -X-Stream [17] 16 1488s -Spark [8] 128 857s 1759sGiraph [8] 128 596s 1235sGraphLab [8] 128 249s 833sGraphX [8] 128 419s 462sSingle thread (SSD) 1 300s 651sSingle thread (RAM) 1 275s -

But“scaleout”isnotthewholestory.Attheendofthedayyouwantgoodperformance.

20IterationsofPageRank(iterativegraphalgorithm)name twitter rv [11] uk-2007-05 [4]nodes 41,652,230 105,896,555edges 1,468,365,182 3,738,733,648size 5.76GB 14.72GB

Vertex order (SSD) 1 300s 651sVertex order (RAM) 1 275s -Hilbert order (SSD) 1 242s 256sHilbert order (RAM) 1 110s -Furtheroptimizationofthebaseline

broughttimedownto110s

[“Scalability!AtwhatCOST?”McSherryetal.HotOS2015]

43

Caution:“scaleout”isnottheentirestory

scalable system cores twitter uk-2007-05Stratosphere [6] 16 950s -

- ≤ 8000s ≤ 8000s

714s 800s

X-Stream [17] 16 1159sSpark [8] 128 1784sGiraph [8] 128 200sGraphLab [8] 128 242sGraphX [8] 128 251sSingle thread (SSD) 1 153s 417s

System Graph VxE Time(s) Gflops Procs

Hadoop ?x1.1B 198 0.015 50x8Spark 40Mx1.5B 97.4 0.03 50x2Twister 50Mx1.4B 36 0.09 60x4PowerGraph 40Mx1.4B 3.6 0.8 64x8BIDMat 60Mx1.4B 6 0.5 1x8

BIDMat+disk 60Mx1.4B 24 0.16 1x8

System Docs/hr Gflops Procs

Smola[15]

PowerGraph

1.6M

1.1M

0.5

0.3

100x8

64x16

BIDMach 3.6M 30 1x8x1

LatencyDirichletAllocation(LDA)

fromMcSherry2015:

“Thepublishedworkonbigdatasystemshasfetishizedscalabilityasthemostimportantfeatureofadistributeddataprocessingplatform.Whilenearlyallsuchpublicationsdetailtheirsystem’simpressivescalability,fewdirectlyevaluatetheirabsoluteperformanceagainstreasonablebenchmarks.Towhatdegreearethesesystemstrulyimprovingperformance,asopposedtoparallelizingoverheadsthattheythemselvesintroduce?”

COST=“ConfigurationthatOutperformsaSingleThread”Perhapssurprisingly,manypublishedsystemshaveunboundedCOST—i.e.,noconfigurationoutperformsthebestsingle-threadedimplementation—foralloftheproblemstowhichtheyhavebeenapplied.

BIDDataSuite(1GPUacceleratednode)[CannyandZhao,KDD13]

PageRank

LabelPropagation[McSherryetal.HotOS2015]

44

PerformanceimprovementstoSpark▪ WithincreasingDRAMsizesandfasterpersistentstorage(SSD),thereis

interestinimprovingtheCPUutilizationofSparkapplications- Goal:reduce“COST”

▪ EffortslookingataddingefficientcodegenerationtoSparkecosystem(e.g.,generateSIMDkernels,targetacceleratorslikeGPUs,etc.)toclosethegaponsinglenodeperformance- RDDstoragelayoutsmustchangetoenablehigh-performanceSIMD

processing(e.g.,structofarraysinsteadofarrayofstructs)- SeeSpark’sProjectTungsten,Weld[PalkarCidr’17],IBM’sSparkGPU

▪ High-performancecomputingideasareinfluencingdesignoffutureperformance-orienteddistributedsystems- Conversely:thescientificcomputingcommunityhasalottolearn

fromthedistributedcomputingcommunityaboutelasticityandutilitycomputing 45

Sparksummary▪ Opaquesequenceabstraction(RDD)toencapsulateintermediatesof

clustercomputations(previously…frameworkslikeHadoop/MapReducestoredintermediatesinthefilesystem)- Observation:“filesareapoorabstractionforintermediatevariablesin

large-scaledata-parallelprograms”

- RDDsareread-only,andcreatedbydeterministicdata-paralleloperators- Lineagetrackedandusedforlocality-awareschedulingandfault-

tolerance(allowsrecomputationofpartitionsofRDDonfailure,ratherthanrestorefromcheckpoint*)- Bulkoperations—>overheadoflineagetracking(logging)islow

▪ Simple,versatileabstractionuponwhichmanydomain-specificdistributedcomputingframeworksarebeingimplemented.- SeeApacheSparkproject:spark.apache.org

*Notethat.persist(RELIABLE)allowsprogrammertorequestcheckpointinginlonglineagesituations.46

ModernSparkecosystem

InterleavecomputationanddatabasequeryCanapplytransformationstoRDDsproducedbySQLqueries

MachinelearninglibrarybuildontopofSparkabstractions.

GraphLab-likelibrarybuiltontopofSparkabstractions.

Compellingfeature:enablesintegration/compositionofmultipledomain-specificframeworks(sinceallcollectionsimplementedunderthehoodwithRDDsandscheduledusingSparkscheduler)

47

distributed computing using sparkashriram/courses/cs431/assets/...iterative algorithms must load...

Documents