distributed computing using sparkashriram/courses/cs431/assets/...iterative algorithms must load...
TRANSCRIPT
-
DistributedComputingusingSpark
(leveragingdata-parallelprogramstructure)
1
-
Review:whichprogramperformsbetter?voidadd(intn,float*A,float*B,float*C){
for(inti=0;i
-
Thepreviousexampleinvolvedgloballyrestructuringtheorderofcomputationtoimproveproducer-consumerlocality
(improvearithmeticintensityofprogram)
3
-
Parallelcomputersinclasssofar
ManycoresconnectedtoasinglesharedmemorysystemImagecredit:https://wccftech.com/intel-cascade-lake-advanced-performance-48-core-xeon-cpus-announced/
4
-
Warehousescalecomputing
5
-
Scaleoutclustercomputing▪ Inexpensivewaytorealizeahighcorecount,high
memory(inaggregate)computer- Madefrom(somewhat*)commodityLinuxservers(commodityprocessors,networking,andstorage)
- Privateper-serveraddressspace- Relativelylowbandwidthconnectivitybetweenservers
*CloudvendorslikeAWS,Google,MSAzure,Facebookmakesignificantcustomizationsintheirdatacenters.6
-
Typicallycommodityserver
DRAM(64-512GB)
CPUx2(8-64cores)
50GB/s
NetworkTop-of-RackSwitch
1-4GB/s
0.1-4GB/s
1-4GB/s
50MB/seach
Disksx10(10-30TB)
Nodesin NodesinSameRack OtherRacks
~401RUserversperrack7
-
Whywriteanapplicationforacluster?▪ Motivatingproblem:- Considerprocessing100TBofdata- Ononenodewithonedisk:scanningat50MB/s=23das- On1000nodes,eachscanningat50MBs=33min!
▪ Challenge:itcanbehardtoeffectivelyutilize1000nodes- Needtoprogram1000xcores_per_nodetotalcores- Havetoworryaboutmachinefailures- Ormachinesthatarefasterorslowerthanothers- Itwouldbenicetohaveparallelprogrammingframeworksthatmakeiteasiertoutilizeresourcesatthisscale!*
*We’vealreadyseenprogramminglanguages/frameworkstohelpuswithSIMD,multi-core,andGPU-basedprogramming.
8
-
TodayIneedyoutoassumeclusterstoragesystemsexist▪ Ifnodescanfail,howdowestoredatapersistently?
▪ Modernsolution:distributedstoragesystems- Provideaglobalnamespaceforfiles- Examples:GoogleGFX,HadoopHDFS,AmazonS3
▪ Typicalusagepatterns- Hugefiles(100sGBtoTBs)- Dataisrarelyupdatedinplace- Readsandappendsarecommon(e.g.,logfiles)
9
-
Example:HadoopdistributedFS(HDFS)
• Globalnamespace
• Filessplitinto~200MBblocks
• EachblockreplicatedonmultipleDataNodes
• Intelligentclient–FindslocationsofblocksfromNameNode;requestsdatafromDataNode
NameNode
Client
DataNode
3 41
File1->Block1,Block2
DataNode
1 42
DataNode
2 3
Getblocklocations
Read/writedatablocks
File2->Block3,Block4
…
Tracklocations
10
-
Let’ssaywebsitegetsverypopular…
11
-
Thelogofpageviewsgetsquitelarge…
1TBdisk
Node0
CPU
DRAM
log.txtblock0
log.txtblock1
1TBdisk
Node1
CPU
DRAM
log.txtblock2
log.txtblock3
1TBdisk
Node3
CPU
DRAM
log.txtblock6
log.txtblock7
1TBdisk
Node2
CPU
DRAM
log.txtblock4
log.txtblock5
Assumecog.txtisalargefile,storedinadistributedfilesystem,likeHDFSBelow:clusterof4nodes,eachnodewitha1TBdiskContentsoflog.txtaredistributedevenlyinblocksacrossthecluster
12
-
Imagineyourprofessorswanttoknowabitmoreaboutthestudentsvisitingthe431website…
Forexample:“Whattypeofmobilephoneareallthesestudentsusing?”“Whendidtheyfirstdownloadthehandoutforthehomeworkassignmentduetomorrow?”
13
-
Considerasimpleprogrammingmodel//thisfunctioniscalledonceperlineininputfilebyruntime//input:string(lineofinputfile)//output:adds(user_agent,1)entrytolistvoidmapper(stringline,multimap&results){
stringuser_agent=parse_requester_user_agent(line);if(is_mobile_client(user_agent))
results.add(user_agent,1);}
//thisfunctioniscalledonceperuniquekey(user_agent)inresults//valuesisalistofvaluesassociatedwiththegivenkeyvoidreducer(stringkey,listvalues,int&result){
intsum=0;for(vinvalues)
sum+=v;result=sum;
}
//iteratoroverlinesoftextfileLineByLineReaderinput(“hdfs://431log.txt”);
//storesoutputWriteroutput(“hdfs://…”);
//dostuffrunMapReduceJob(mapper,reducer,input,output);
(Thecodeabovecomputesthecountofpageviewsbyeachtypeofmobilephone.)14
-
Let’sdesignanimplementationofrunMapReduceJob
15
-
Step1:runningthemapperfunction
Node0
log.txtblock0
Disk
CPU
//calledonceperlineinfilevoidmapper(stringline,multimap&results){
stringuser_agent=parse_requester_user_agent(line);if(is_mobile_client(user_agent))
results.add(user_agent,1);}
//calledonceperuniquekeyinresultsvoidreducer(stringkey,listvalues,int&result){
intsum=0;for(vinvalues)
sum+=v;result=sum;
}
LineByLineReaderinput(“hdfs://431log.txt”);Writeroutput(“hdfs://…”);runMapReduceJob(mapper,reducer,input,output);
log.txtblock1
Node1
log.txtblock2
Disk
CPU
log.txtblock3
Node2
log.txtblock4
Disk
CPU
log.txtblock5
Node3
log.txtblock6
Disk
CPU
log.txtblock7
Step1:runmapperfunctiononalllinesoffileQuestion:Howtoassignworktonodes?
Idea2:datadistributionbasedassignment:Eachnodeprocesseslinesinblocksofinputfilethatarestoredlocally.
Idea1:useworkqueueforlistofinputblockstoprocessDynamicassignment:freenodetakesnextavailableblock
block0block1block2...
16
-
Steps2and3:gatheringdata,runningthereducer
Node0
log.txtblock0
Disk
CPU
//calledonceperlineinfilevoidmapper(stringline,mapresults)
{stringuser_agent=parse_requester_user_agent(line);if(is_mobile_client(user_agent))
results.add(user_agent,1);}//calledonceperuniquekeyinresultsvoidreducer(stringkey,listvalues,int&result){
intsum=0;for(vinvalues)
sum+=v;result=sum;
}
LineByLineReaderinput(“hdfs://431log.txt”);Writeroutput(“hdfs://…”);runMapReduceJob(mapper,reducer,input,output);
log.txtblock1
Node1
log.txtblock2
Disk
CPU
log.txtblock3
Node2
log.txtblock4
Disk
CPU
log.txtblock5
Node3
log.txtblock6
Disk
CPU
log.txtblock7
SafariiOSChrome
SafariiWatchChromeGlass
Step2:PrepareintermediatedataforreducerStep3:RunreducerfunctiononallkeysQuestion:howtoassignreducertasks?Question:howtogetalldataforkeyontothecorrectworkernode?
Keystoreduce:(generatedbymapper):
SafariiOSvalues0
Chromevalues0
SafariiOSvalues1
Chromevalues1
SafariiOSvalues2
Chromevalues2
SafariiOSvalues3
Chromevalues3
SafariiWatchvalues3
ChromeGlassvalues0
17
-
Node0
log.txtblock0
Disk
CPU
//gatherallinputdataforkey,thenexecutereducer//toproducefinalresultvoidrunReducer(stringkey,reducer,result){
listinputs;for(ninnodes){
filename=get_filename(key,n);readlinesoffilename,appendintoinputs;
}reducer(key,inputs,result);
}
log.txtblock1
Step2:Prepareintermediatedataforreducer.Step3:Runreducerfunctiononallkeys.Question:howtoassignreducertasks?Question:howtogetalldataforkeyontothecorrectworkernode?
Node1
log.txtblock2
Disk
CPU
log.txtblock3
Node2
log.txtblock4
Disk
CPU
log.txtblock5
Node3
log.txtblock6
Disk
CPU
log.txtblock7
SafariiOSChrome
SafariiWatchChromeGlass
Keystoreduce:(generatedbymapper):
SafariiOSvalues0
Chromevalues0
SafariiOSvalues1
Chromevalues1
SafariiOSvalues2
Chromevalues2
SafariiOSvalues3
Chromevalues3
SafariiWatchvalues3
ChromeGlassvalues0
Example:AssignSafariiOStoNode0
Steps2and3:gatheringdata,runningthereducer
18
-
Additionalimplementationchallengesatscale
Node0
log.txtblock…
Disklog.txt
block…
CPU
Node1
log.txtblock…
Disklog.txt
block…
CPU
Node2
log.txtblock…
Disklog.txt
block…
CPU
Node3
log.txtblock…
Disklog.txt
block…
CPU
Node4
log.txtblock…
Disklog.txt
block…
CPU
Node5
log.txtblock…
Disklog.txt
block…
CPU
Node6
log.txtblock…
Disklog.txt
block…
CPU
Node7
log.txtblock…
Disklog.txt
block…
CPU
Node8
log.txtblock…
Disklog.txt
block…
CPU
Node9
log.txtblock…
Disklog.txt
block…
CPU
Node10
log.txtblock…
Disklog.txt
block…
CPU
Node11
log.txtblock…
Disklog.txt
block…
CPU
Node996
log.txtblock…
Disklog.txt
block…
CPU
Node997
log.txtblock…
Disklog.txt
block…
CPU
Node998
log.txtblock…
Disklog.txt
block…
CPU
Node999
log.txtblock…
Disklog.txt
block…
CPU
...
Nodesmayfailduringprogramexecution
Somenodesmayrunslowerthanothers(duetodifferentamountsofwork,heterogeneityinthecluster,etc..)
19
-
Jobschedulerresponsibilities
▪ Exploitdatalocality:“movecomputationtothedata”- Runmapperjobsonnodesthatcontaininputfiles- Runreducerjobsonnodesthatalreadyhavemostofdataforacertainkey
▪ Handlingnodefailures- Schedulerdetectsjobfailuresandrerunsjobonnewmachines- Thisispossiblesinceinputsresideinpersistentstorage(distributedfilesystem)
- Schedulerduplicatesjobsonmultiplemachines(reduceoverallprocessinglatencyincurredbynodefailures)
▪ Handlingslowmachines- Schedulerduplicatesjobsonmultiplemachines
20
-
runMapReduceJobproblems?▪ Permitsonlyaverysimpleprogramstructure
- Programsmustbestructuredas:map,followedbyreducebykey
- SeeDryadLINQforgeneralizationtoDAGs
▪ Iterativealgorithmsmustloadfromdiskeachiteration- Noprimitivesforsharingdatadirectlybetweenjobs
iter.1 iter.2 ...
Input
HDFSwrite
HDFSread
HDFSwrite
IterativeJob:
HDFSread
21
-
in-memory,fault-tolerantdistributedcomputinghttp://spark.apache.org/
[Zahariaetal.NSDI2012]22
http://spark.apache.org/
-
Goals▪ Programmingmodelforcluster-scalecomputations
wherethereissignificantreuseofintermediatedatasets- Iterativemachinelearningandgraphalgorithms- Interactivedatamining:loadlargedatasetintoaggregate
memoryofclusterandthenperformmultiplead-hocqueries
▪ Don’twantincurinefficiencyofwritingintermediatestopersistentdistributedfilesystem(wanttokeepitinmemory)- Challenge:efficientlyimplementingfaulttoleranceforlarge-
scaledistributedin-memorycomputations.
23
-
Faulttoleranceforin-memorycalculations▪ Replicateallcomputations
- Expensivesolution:decreasespeakthroughput
▪ Checkpointandrollback- Periodicallysavestateofprogramtopersistentstorage- Restartfromlastcheckpointonnodefailure
▪ Maintainlogofupdates(commandsanddata)- Highoverheadformaintaininglogs
Recallmap-reducesolutions:- Checkpointsaftereachmap/reducestepbywritingresultstofilesystem- Scheduler’slistofoutstanding(butnotyetcomplete)jobsisalog- Functionalstructureofprogramsallowsforrestartatgranularityofasingle
mapperorreducerinvocation(don’thavetorestartentireprogram)24
-
Resilientdistributeddataset(RDD)Spark’skeyprogrammingabstraction:- Read-onlyorderedcollectionofrecords(immutable)- RDDscanonlybecreatedbydeterministictransformationsondatain
persistentstorageoronexistingRDDs- ActionsonRDDsreturndatatoapplication
//createRDDfromfilesystemdatavarlines=spark.textFile(“hdfs://431log.txt”);
//createRDDusingfilter()transformationonlinesvarmobileViews=lines.filter((x:String)=>isMobileClient(x));
//anotherfilter()transformationvarsafariViews=mobileViews.filter((x:String)=>x.contains(“Safari”));
//thencountnumberofelementsinRDDviacount()actionvarnumViews=safariViews.count();
lines
mobileViews
numViews
safariViews
.count( )
.filter(...)
.filter(...)
.textFile(…)
431log.txt
int
RDDs
25
-
Repeatingthemap-reduceexample// 1. create RDD from filesystemdata// 2. create RDD with onlylinesfrommobileclients// 3. create RDD with elementsoftype(String,Int) from line string//4.groupelementsbykey//5.callprovidedreductionfunctiononallkeystocountviewsvarperAgentCounts= spark.textFile(“hdfs://431log.txt”)
.filter(x=>isMobileClient(x))
.map(x=>(parseUserAgent(x),1));
.reduceByKey((x,y)=>x+y)
.collect();
lines
.map(parseUserAgent(…))
.filter(isMobileClient(…)))
.textFile(…)
431log.txtArray[String,int]
.reduceByKey(…)
.collect()
PerAgentCounts
“Lineage”:SequenceofRDDoperationsneededtocomputeoutput
26
-
AnotherSparkprogram//createRDDfromfilesystemdatavarlines=spark.textFile(“hdfs://431log.txt”);
//createRDDusingfilter()transformationonlinesvarmobileViews=lines.filter((x:String)=>isMobileClient(x));
//instructSparkruntimetotrytokeepmobileViewsinmemorymobileViews.persist();
//createanewRDDbyfilteringmobileViews//thencountnumberofelementsinnewRDDviacount()actionvarnumViews=mobileViews.filter(_.contains(“Safari”)).count();
// 1. createnewRDDby filteringonlyChrome views////
2. foreachelement,pageview
splitstringandtake timestamp of
//3.convertRDDtoascalarsequence(collect()action)vartimestamps=mobileViews.filter(_.contains(“Chrome”))
.map(_.split(“”)(0))
.collect();
lines
mobileViews
.collect()
timestamps
.filter(isMobileClient(…)))
.textFile(…)
431log.txt
.map(split(…)).count()
numViews
.filter(contains(“Safari”); .filter(contains(“Chrome”);
27
-
flatMap( f : T ) Seq[U]) RDD[T] ) RDD[U]RDD[T] ) RDD[T] (Deterministic sampling)sample(fraction : Float)
groupByKey() RDD[(K, V)] ) RDD[(K, Seq[V])]reduceByKey( f : (V, V) ) V) RDD[(K, V)] ) RDD[(K, V)]
(RDD[T], RDD[T]) ) RDD[T](RDD[(K, V)], RDD[(K, W)]) ) RDD[(K, (V, W))](RDD[(K, V)], RDD[(K, W)]) ) RDD[(K, (Seq[V], Seq[W]))]
union() join()
cogroup() crossProduct() (RDD[T], RDD[U]) ) RDD[(T, U)]
mapValues( f : V ) W) RDD[(K, V)] ) RDD[(K, W)] (Preserves partitioning)RDD[(K, V)] ) RDD[(K, V)]sort(c : Comparator[K])
partitionBy(p : Partitioner[K])
map( f : T ) U) : RDD[T] ) RDD[U] filter( f : T ) Bool) : RDD[T] ) RDD[T]
: : : : : : : : : : :
RDD[(K, V)] ) RDD[(K, V)]
count() collect()
reduce( f : (T, T) ) T) RDD[T] ) TRDD[(K, V)] ) Seq[V] (On hash/range partitioned RDDs)lookup(k : K)
save(path : String)
: RDD[T] ) Long : RDD[T] ) Seq[T] : : : Outputs RDD to a storage system, e.g., HDFS
RDDtransformationsandactionsTransformations:(dataparalleloperatorstakinganinputRDDtoanewRDD)
Actions:(providedatabacktothe“host”application)
28
-
HowdoweimplementRDDs?Inparticular,howshouldtheybestored?
var lines = spark.textFile(“hdfs://431log.txt”);var lower = lines.map(_.toLower());varmobileViews=lower.filter(x=>isMobileClient(x));varhowMany=mobileViews.count();
Node0
log.txtblock0
Disk
CPU
log.txtblock1
DRAM
Node1
log.txtblock2
Disk
CPU
log.txtblock3
DRAM
Node2
log.txtblock4
Disk
CPU
log.txtblock5
DRAM
Node3
log.txtblock6
Disk
CPU
log.txtblock7
DRAM
Question:shouldwethinkofRDD’slikearrays?
29
-
HowdoweimplementRDDs?Inparticular,howshouldtheybestored?
var lines = spark.textFile(“hdfs://431log.txt”);var lower = lines.map(_.toLower());
Node0
log.txtblock0
Disk
CPU
log.txtblock1
DRAMlines
(partition0)lower
(partition0)mobileViews(part0)
lines(partition1)
lower(partition1)mobileViews(part1)
Node1
log.txtblock2
Disk
CPU
log.txtblock3
DRAMlines
(partition2)lower
(partition2)mobileViews(part2)
lines(partition3)
lower(partition3)mobileViews(part3)
Node2
log.txtblock4
Disk
CPU
log.txtblock5
DRAMlines
(partition4)lower
(partition4)mobileViews(part4)
lines(partition5)
lower(partition5)mobileViews(part5)
Node2
log.txtblock6
Disk
CPU
log.txtblock7
DRAMlines
(partition6)lower
(partition6)mobileViews(part6)
lines(partition7)
lower(partition7)mobileViews(part7)
varmobileViews=lower.filter(x=>isMobileClient(x));varhowMany=mobileViews.count();
In-memoryrepresentationwouldbehuge!(largerthanoriginalfileondisk)
30
-
RDDpartitioninganddependencies
Node3
block2 block3 block4 block5 block6 block7
.load()
varlines=spark.textFile(“hdfs://431log.txt”);varlower=lines.map(_.toLower());varmobileViews=lower.filter(x=>isMobileClient(x));varhowMany=mobileViews.count();
Node0 Node1 Node2
linespart2
linespart3
linespart4
linespart5
linespart6
linespart7
.filter()
mobileViewspart2
mobileViewspart3
mobileViewspart4
mobileViewspart5
mobileViewspart6
mobileViewspart7
BlacklinesshowdependenciesbetweenRDDpartitions.
lowerpart2
lowerpart3
lowerpart4
lowerpart5
lowerpart6
lowerpart7
.map()
block0(0-1000)
block1(1000-2000)
linespart0(0-1000)
linespart1
(1000-2000)
lowerpart0(0-1000)
lowerpart1
(1000-2000)
mobileViewspart0
(670elements)
mobileViewspart1
(212elements)
31
-
ImplementingsequenceofRDDopsefficientlyvarlines=spark.textFile(“hdfs://431log.txt”);varlower=lines.map(_.toLower());varmobileViews=lower.filter(x=>isMobileClient(x));varhowMany=mobileViews.count();
Recall“loopfusion”examplefromopeningslidesoflecture
Thefollowingcodestoresonlyalineofthelogfileinmemory,andonlyreadsinputdatafromdiskonce(“streaming”solution)intcount=0;while(inputFile.eof()){
stringline=inputFile.readLine();stringlower=line.toLower;if(isMobileClient(lower))
count++;}
32
-
AsimpleinterfaceforRDDsvarlines=spark.textFile(“hdfs://431log.txt”);varlower=lines.map(_.toLower());varmobileViews=lower.filter(x=>isMobileClient(x));varhowMany=mobileViews.count();
RDDFromMap::next(){varel=parent.next();returnmap_func(el);
}
RDDFromFilter::next(){while(parent.hasMoreElements()){
varel=parent.next();if(filter_func(el))
returnel;}
RDDFromTextFile::next(){returninputFile.readLine();
}
//countaction(forcesevaluationofRDD)RDD::count(){
intcount=0;while(hasMoreElements()){
varel=next();count++;
}}
RDD::hasMoreElements(){parent.hasMoreElements();
}//overloadedsincenoparentexistsRDDFromTextFile::hasMoreElements(){
return!inputFile.eof();}
//createRDDbymappingmap_funcontoinput(parent)RDDRDD::map(RDDparent,map_func){
returnnewRDDFromMap(parent,map_func);}
//createRDDfromtextfileondiskRDD::textFile(stringfilename){
returnnewRDDFromTextFile(open(filename));}
//createRDDbyfilteringinput(parent)RDDRDD::filter(RDDparent,filter_func){
returnnewRDDFromFilter(parent,filter_func);}
33
-
Narrowdependencies
block2 block3 block4 block5 block6 block7
.load()
varlines=spark.textFile(“hdfs://431log.txt”);varlower=lines.map(_.toLower());varmobileViews=lower.filter(x=>isMobileClient(x));varhowMany=mobileViews.count();
linespart2
linespart3
linespart4
linespart5
linespart6
linespart7
.filter()
mobileViewspart2
mobileViewspart3
mobileViewspart4
mobileViewspart5
mobileViewspart6
mobileViewspart7
“Narrowdependencies”=eachpartitionofparentRDDreferencedbyatmostonechildRDDpartition--
Allowsforfusingofoperations(here:canapplymapandthenfilterallatonceoninputelement)Inthisexample:nocommunicationbetweennodesofcluster(communicationofoneintatendtoperformcount()reduction)
Node0 Node1 Node2 Node3
lowerpart2
lowerpart3
lowerpart4
lowerpart5
lowerpart6
lowerpart7
.map()
block0(0-1000)
block1(1000-2000)
linespart0(0-1000)
linespart1
(1000-2000)
lowerpart0(0-1000)
lowerpart1
(1000-2000)
mobileViewspart0
(670elements)
mobileViewspart1
(212elements)34
-
Widedependencies
RDD_Apart0
.groupByKey()
RDD_Apart1
RDD_Apart2
RDD_Apart3
RDD_Bpart0
RDD_Bpart1
RDD_Bpart2
RDD_Bpart3
▪ Widedependencies=eachpartitionofparentRDDreferencedbymultiplechildRDDpartitions
▪ Challenges:- MustcomputeallofRDD_AbeforecomputingRDD_B
- Example:groupByKey()mayinduceall-to-allcommunicationasshownabove
- Maytriggersignificantrecomputationofancestorlineageuponnodefailure(Iwilladdressresilienceinafewslides)
groupByKey:RDD[(K,V)]→ RDD[(K,Seq[V])]“MakeanewRDDwhereeachelementisasequencecontainingallvaluesfromtheparentRDDwiththesamekey.”
35
-
Costofoperationsdependsonpartitioning
RDD_Cpart0
RDD_Cpart1
RDD_Cpart6
RDD_Cpart9
.join()
RDD_Apart0
RDD_Apart1
RDD_Apart2
RDD_Apart3
RDD_Bpart0
RDD_Bpart1
RDD_Bpart2
RDD_Bpart3
(“Kayvon”,1)(“Teguh”,23)
(“Kayvon”,“fizz”) (“Randy”,1024) (“Randy”,“wham”)(“Teguh”,“buzz”) (“Ravi”,32) (“Ravi”,“pow”)
(“Alex”,50)(“Riya”,9)
(“Alex”,“splat”) (“Tao”,10) (“Tao”,“slap”)(“Riya”,“pop”) (“Junhong”,100)(“Junhong”,“bam”)
RDD_Cpart0
RDD_Cpart1
RDD_Cpart6
RDD_Cpart9
.join()
RDD_Apart0
RDD_Apart1
RDD_Apart2
RDD_Apart3
RDD_Bpart0
RDD_Bpart1
RDD_Bpart2
RDD_Bpart3
(“Kayvon”,1)(“Teguh”,23)
(“Kayvon”,“fizz”)(“Alex”,“splat”)
(“Randy”,1024)(“Ravi”,32)
(“Riya”,“pop”)(“Tao”,“slap”)
(“Alex”,50)(“Riya”,9)
(“Ravi”,“pow”)(“Junhong”,“bam”)
(“Tao”,10)(“Junhong”,100)
(“Randy”,“wham”)(“Teguh”,“buzz”)
(“Kayvon”,(1,”fizz”))(“Teguh”,(23,”buzz”))
(“Randy”,(1024,”wham”))(“Ravi”,(32,”pow”))
(“Alex”,(50,”splat”))(“Riya”,(9,”pop”))
(“Tao”,(10,”slap”))(“Junhong”,(100,”bam”))
RDD_AandRDD_Bhavesamehashpartition:joinonlycreatesnarrowdependencies
(“Kayvon”,(1,”fizz”))(“Teguh”,(23,”buzz”))
(“Randy”,(1024,”wham”))(“Ravi”,(32,”pow”))
(“Alex”,(50,”splat”))(“Riya”,(9,”pop”))
(“Tao”,(10,”slap”))(“Junhong”,(100,”bam”))
join:RDD[(K,V)],RDD[(K,W)]→ RDD[(K,(V,W))]AssumedatainRDD_AandRDD_Barepartitionedbykey:hashusernametopartitionid
RDD_AandRDD_Bhavedifferenthashpartitions:joincreateswidedependencies
36
-
PartitionBy()transformation▪ InformSparkonhowtopartitionanRDD
- e.g.,HashPartitioner,RangePartitioner//createRDDfromfilesystemdatavarlines=spark.textFile(“hdfs://431log.txt”);varclientInfo=spark.textFile(“hdfs://clientssupported.txt”);//(useragent,“yes”/“no”)
//createRDDusingfilter()transformationonlinesvarmobileViews=lines.filter(x=>isMobileClient(x)).map(x=>parseUserAgent(x));
//HashPartitionermapskeystointegersvarpartitioner=spark.HashPartitioner(100);
// informSparkof partition// .persist()also instructs Spark to try to keep dataset in memoryvarmobileViewPartitioned=mobileViews.partitionBy(partitioner)
.persist();varclientInfoPartitioned=clientInfo.partitionBy(partitioner)
.persist();
//joinuseragentswithwhethertheyaresupportedornotsupported//Note:thisjoinonlycreatesnarrowdependenciesduetotheexplicitpartitioningabovevoidjoined=mobileViewPartitioned.join(clientInfoPartitioned);
▪ .persist():- InformSparkthisRDD’scontentsshouldberetainedinmemory- .persist(RELIABLE)= storecontentsindurablestorage(likeacheckpoint) 37
-
RDD_Apart0
RDD_Apart1
RDD_Apart2
RDD_Bpart0
RDD_Bpart1
RDD_Bpart2
.groupByKey()
RDD_Cpart0
RDD_Cpart1
RDD_Dpart0
RDD_Dpart1
RDD_Epart0
RDD_Epart1
RDD_Fpart0
RDD_Fpart1
RDD_Fpart2
RDD_Fpart3
.map()
.union()
.join()
RDD_Gpart0
RDD_Gpart1
RDD_Gpart2
.save()
SchedulingSparkcomputationsStage1Computation Stage2Computation
Actions(e.g.,save())triggerevaluationofSparklineagegraph.Stage1Computation:donothingsinceinputalreadymaterializedinmemoryStage2Computation:evaluatemapinfusedmanner,onlyactuallymaterializeRDDFStage3Computation:executejoin(couldstreamtheoperationtodisk,donotneedtomaterialize)
block1block0 block2 = materializedRDD
38
-
Implementingresiliencevialineage
//createRDDfromfilesystemdatavarlines=spark.textFile(“hdfs://431log.txt”);
//createRDDusingfilter()transformationonlinesvarmobileViews=lines.filter((x:String)=>isMobileClient(x));
// 1. createnewRDDby filteringonlyChrome views// 2. foreachelement, splitstringandtake timestamp of// pageview(firstelement)//3.convertRDDToascalarsequence(collect()action)vartimestamps=mobileView.filter(_.contains(“Chrome”))
.map(_.split(“”)(0));
lines
mobileViews
Chromeviews
timestamps.map(_.split(“”)(0))
.filter(...)
.filter(...)
▪ RDDtransformationsarebulk,deterministic,andfunc.- Implication:runtimecanalwaysreconstructcontentsofRDDfromits
lineage(thesequenceoftransformationsusedtocreateit)- Lineageisalogoftransformations- Efficient:sincethelogrecordsbulkdata-paralleloperations,overheadof
loggingislow(comparedtologgingfine-grainedoperations,likeinadatabase)
39
.load(…)
-
var lines = spark.textFile(“hdfs://431log.txt”);var mobileViews = lines.filter((x:String)=>
isMobileClient(x));var timestamps = mobileView.filter(_.contains(“Chrome”))
.map(_.split(“”)(0));
Uponnodefailure:recomputelostRDDpartitionsfromlineage
Disk431log.txt log.txt
block0 block1
Node0
DRAM
log.txtblock2
Disklog.txtblock3
mobileViewspart3
mobileViewspart2
Node1 Node2
log.txtblock4
Disklog.txtblock5
mobileViewspart5
mobileViewspart4
Node3
log.txtblock6
Disklog.txtblock7
mobileViewspart7
mobileViewspart6
timestampspart1
mobileViewspart1
CPU
timestampspart0
mobileViewspart0
DRAM
CPU
timestamps timestampspart2
DRAMtimestampspart5
CPU
timestampspart4
DRAMtimestamps
part7
CPU
timestampspart6
lines
mobileViews
Chromeviews
timestamps.map(_.split(“”)(0))
.filter(...)
.filter(...)
.load(…)
Mustreloadrequiredsubsetofdatafromdiskandrecomputeentiresequenceofoperationsgivenbylineagetoregeneratepartitions2and3ofRDDtimestamps.
Note:(notshown):recallfilesystemdataisreplicatedsoassumeblocks2and3remainaccessibletoallnodes
CRASH!part3
40
-
var lines = spark.textFile(“hdfs://431log.txt”);var mobileViews = lines.filter((x:String)=>
isMobileClient(x));var timestamps = mobileView.filter(_.contains(“Chrome”))
.map(_.split(“”)(0));
log.txtblock0
Disklog.txtblock1
DRAM
Node1
log.txtblock2
Disklog.txtblock3
mobileViewspart3
mobileViewspart2
Node0 Node2
log.txtblock4
Disklog.txtblock5
mobileViewspart5
mobileViewspart4
Node3
log.txtblock6
Disklog.txtblock7
mobileViewspart7
mobileViewspart6
timestampspart1
mobileViewspart1
CPU
timestampspart0
mobileViewspart0
DRAM
CPU
timestamps timestampspart2
AMDRtimestampspart5
CPU
timestampspart4
DRAMtimestamps
part7
CPU
timestampspart6
lines
mobileViews
Chromeviews
timestamps.map(_.split(“”)(0))
.filter(...)
.filter(...)
.load(…)
Must reload required subset of data from disk andrecomputeentiresequenceofoperationsgivenbylineagetoregeneratepartitions2and3ofRDDtimestamps.
timestampspart2
timestampspart3
Note:(notshown):filesystemdataisreplicatedsoassumeblocks2and3remainaccessibletoallnodes
Uponnodefailure:recomputelostRDDpartitionsfromlineage
CRASH!part3
41
-
Sparkperformance
139
46
182
8280
76 62
3
115
106
87
33
240 200 160 120
80 40
0Hadoop SparkHadoop HadoopBM Spark
Logistic Regression
HadoopBM
K-Means
Itera
tion
time
(s)
First Iteration Later Iterations
HadoopBM=HadoopBinaryIn-Memory(converttextinputtobinary,storeinin-memoryversionofHDFS)
Q.Wait,thebaselineparsestextinputineachiterationofaniterativealgorithm?A.Yes.Sigh…
Anythingelsepuzzlinghere?
HadoopBM’sfirstiterationisslowbecauseitrunsanextraHadoopjobtocopybinaryformofinputdatatoinmemoryHDFS
AccessingdatafromHDFS,evenifinmemory,hashighoverhead:- Multiplememcopiesinfilesystem+achecksum
(100GBofdataona100nodecluster)
42
-
Caution:“scaleout”isnottheentirestory▪ Distributedsystemsdesignedforcloudexecutionaddressmanydifficultchallenges,andhavebeeninstrumentalintheexplosionof“big-data”computingandlarge-scale
---
Scale-outparallelismtomanymachinesResiliencyinthefaceoffailuresComplexityofmanagingclustersofmachines
▪
scalable system cores twitter uk-2007-05GraphChi [10] 2 3160s 6972sStratosphere [6] 16 2250s -X-Stream [17] 16 1488s -Spark [8] 128 857s 1759sGiraph [8] 128 596s 1235sGraphLab [8] 128 249s 833sGraphX [8] 128 419s 462sSingle thread (SSD) 1 300s 651sSingle thread (RAM) 1 275s -
But“scaleout”isnotthewholestory.Attheendofthedayyouwantgoodperformance.
20IterationsofPageRank(iterativegraphalgorithm)name twitter rv [11] uk-2007-05 [4]nodes 41,652,230 105,896,555edges 1,468,365,182 3,738,733,648size 5.76GB 14.72GB
Vertex order (SSD) 1 300s 651sVertex order (RAM) 1 275s -Hilbert order (SSD) 1 242s 256sHilbert order (RAM) 1 110s -Furtheroptimizationofthebaseline
broughttimedownto110s
[“Scalability!AtwhatCOST?”McSherryetal.HotOS2015]
43
-
Caution:“scaleout”isnottheentirestory
scalable system cores twitter uk-2007-05Stratosphere [6] 16 950s -
- ≤ 8000s ≤ 8000s
714s 800s
X-Stream [17] 16 1159sSpark [8] 128 1784sGiraph [8] 128 200sGraphLab [8] 128 242sGraphX [8] 128 251sSingle thread (SSD) 1 153s 417s
System Graph VxE Time(s) Gflops Procs
Hadoop ?x1.1B 198 0.015 50x8Spark 40Mx1.5B 97.4 0.03 50x2Twister 50Mx1.4B 36 0.09 60x4PowerGraph 40Mx1.4B 3.6 0.8 64x8BIDMat 60Mx1.4B 6 0.5 1x8
BIDMat+disk 60Mx1.4B 24 0.16 1x8
System Docs/hr Gflops Procs
Smola[15]
PowerGraph
1.6M
1.1M
0.5
0.3
100x8
64x16
BIDMach 3.6M 30 1x8x1
LatencyDirichletAllocation(LDA)
fromMcSherry2015:
“Thepublishedworkonbigdatasystemshasfetishizedscalabilityasthemostimportantfeatureofadistributeddataprocessingplatform.Whilenearlyallsuchpublicationsdetailtheirsystem’simpressivescalability,fewdirectlyevaluatetheirabsoluteperformanceagainstreasonablebenchmarks.Towhatdegreearethesesystemstrulyimprovingperformance,asopposedtoparallelizingoverheadsthattheythemselvesintroduce?”
COST=“ConfigurationthatOutperformsaSingleThread”Perhapssurprisingly,manypublishedsystemshaveunboundedCOST—i.e.,noconfigurationoutperformsthebestsingle-threadedimplementation—foralloftheproblemstowhichtheyhavebeenapplied.
BIDDataSuite(1GPUacceleratednode)[CannyandZhao,KDD13]
PageRank
LabelPropagation[McSherryetal.HotOS2015]
44
-
PerformanceimprovementstoSpark▪ WithincreasingDRAMsizesandfasterpersistentstorage(SSD),thereis
interestinimprovingtheCPUutilizationofSparkapplications- Goal:reduce“COST”
▪ EffortslookingataddingefficientcodegenerationtoSparkecosystem(e.g.,generateSIMDkernels,targetacceleratorslikeGPUs,etc.)toclosethegaponsinglenodeperformance- RDDstoragelayoutsmustchangetoenablehigh-performanceSIMD
processing(e.g.,structofarraysinsteadofarrayofstructs)- SeeSpark’sProjectTungsten,Weld[PalkarCidr’17],IBM’sSparkGPU
▪ High-performancecomputingideasareinfluencingdesignoffutureperformance-orienteddistributedsystems- Conversely:thescientificcomputingcommunityhasalottolearn
fromthedistributedcomputingcommunityaboutelasticityandutilitycomputing 45
-
Sparksummary▪ Opaquesequenceabstraction(RDD)toencapsulateintermediatesof
clustercomputations(previously…frameworkslikeHadoop/MapReducestoredintermediatesinthefilesystem)- Observation:“filesareapoorabstractionforintermediatevariablesin
large-scaledata-parallelprograms”
- RDDsareread-only,andcreatedbydeterministicdata-paralleloperators- Lineagetrackedandusedforlocality-awareschedulingandfault-
tolerance(allowsrecomputationofpartitionsofRDDonfailure,ratherthanrestorefromcheckpoint*)- Bulkoperations—>overheadoflineagetracking(logging)islow
▪ Simple,versatileabstractionuponwhichmanydomain-specificdistributedcomputingframeworksarebeingimplemented.- SeeApacheSparkproject:spark.apache.org
*Notethat.persist(RELIABLE)allowsprogrammertorequestcheckpointinginlonglineagesituations.46
-
ModernSparkecosystem
InterleavecomputationanddatabasequeryCanapplytransformationstoRDDsproducedbySQLqueries
MachinelearninglibrarybuildontopofSparkabstractions.
GraphLab-likelibrarybuiltontopofSparkabstractions.
Compellingfeature:enablesintegration/compositionofmultipledomain-specificframeworks(sinceallcollectionsimplementedunderthehoodwithRDDsandscheduledusingSparkscheduler)
47