distributed computing using sparkashriram/courses/cs431/assets/...iterative algorithms must load...

47
Distributed Computing using Spark (leveraging data-parallel program structure) 1

Upload: others

Post on 04-Feb-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

  • DistributedComputingusingSpark

    (leveragingdata-parallelprogramstructure)

    1

  • Review:whichprogramperformsbetter?voidadd(intn,float*A,float*B,float*C){

    for(inti=0;i

  • Thepreviousexampleinvolvedgloballyrestructuringtheorderofcomputationtoimproveproducer-consumerlocality

    (improvearithmeticintensityofprogram)

    3

  • Parallelcomputersinclasssofar

    ManycoresconnectedtoasinglesharedmemorysystemImagecredit:https://wccftech.com/intel-cascade-lake-advanced-performance-48-core-xeon-cpus-announced/

    4

  • Warehousescalecomputing

    5

  • Scaleoutclustercomputing▪ Inexpensivewaytorealizeahighcorecount,high

    memory(inaggregate)computer- Madefrom(somewhat*)commodityLinuxservers(commodityprocessors,networking,andstorage)

    - Privateper-serveraddressspace- Relativelylowbandwidthconnectivitybetweenservers

    *CloudvendorslikeAWS,Google,MSAzure,Facebookmakesignificantcustomizationsintheirdatacenters.6

  • Typicallycommodityserver

    DRAM(64-512GB)

    CPUx2(8-64cores)

    50GB/s

    NetworkTop-of-RackSwitch

    1-4GB/s

    0.1-4GB/s

    1-4GB/s

    50MB/seach

    Disksx10(10-30TB)

    Nodesin NodesinSameRack OtherRacks

    ~401RUserversperrack7

  • Whywriteanapplicationforacluster?▪ Motivatingproblem:- Considerprocessing100TBofdata- Ononenodewithonedisk:scanningat50MB/s=23das- On1000nodes,eachscanningat50MBs=33min!

    ▪ Challenge:itcanbehardtoeffectivelyutilize1000nodes- Needtoprogram1000xcores_per_nodetotalcores- Havetoworryaboutmachinefailures- Ormachinesthatarefasterorslowerthanothers- Itwouldbenicetohaveparallelprogrammingframeworksthatmakeiteasiertoutilizeresourcesatthisscale!*

    *We’vealreadyseenprogramminglanguages/frameworkstohelpuswithSIMD,multi-core,andGPU-basedprogramming.

    8

  • TodayIneedyoutoassumeclusterstoragesystemsexist▪ Ifnodescanfail,howdowestoredatapersistently?

    ▪ Modernsolution:distributedstoragesystems- Provideaglobalnamespaceforfiles- Examples:GoogleGFX,HadoopHDFS,AmazonS3

    ▪ Typicalusagepatterns- Hugefiles(100sGBtoTBs)- Dataisrarelyupdatedinplace- Readsandappendsarecommon(e.g.,logfiles)

    9

  • Example:HadoopdistributedFS(HDFS)

    • Globalnamespace

    • Filessplitinto~200MBblocks

    • EachblockreplicatedonmultipleDataNodes

    • Intelligentclient–FindslocationsofblocksfromNameNode;requestsdatafromDataNode

    NameNode

    Client

    DataNode

    3 41

    File1->Block1,Block2

    DataNode

    1 42

    DataNode

    2 3

    Getblocklocations

    Read/writedatablocks

    File2->Block3,Block4

    Tracklocations

    10

  • Let’ssaywebsitegetsverypopular…

    11

  • Thelogofpageviewsgetsquitelarge…

    1TBdisk

    Node0

    CPU

    DRAM

    log.txtblock0

    log.txtblock1

    1TBdisk

    Node1

    CPU

    DRAM

    log.txtblock2

    log.txtblock3

    1TBdisk

    Node3

    CPU

    DRAM

    log.txtblock6

    log.txtblock7

    1TBdisk

    Node2

    CPU

    DRAM

    log.txtblock4

    log.txtblock5

    Assumecog.txtisalargefile,storedinadistributedfilesystem,likeHDFSBelow:clusterof4nodes,eachnodewitha1TBdiskContentsoflog.txtaredistributedevenlyinblocksacrossthecluster

    12

  • Imagineyourprofessorswanttoknowabitmoreaboutthestudentsvisitingthe431website…

    Forexample:“Whattypeofmobilephoneareallthesestudentsusing?”“Whendidtheyfirstdownloadthehandoutforthehomeworkassignmentduetomorrow?”

    13

  • Considerasimpleprogrammingmodel//thisfunctioniscalledonceperlineininputfilebyruntime//input:string(lineofinputfile)//output:adds(user_agent,1)entrytolistvoidmapper(stringline,multimap&results){

    stringuser_agent=parse_requester_user_agent(line);if(is_mobile_client(user_agent))

    results.add(user_agent,1);}

    //thisfunctioniscalledonceperuniquekey(user_agent)inresults//valuesisalistofvaluesassociatedwiththegivenkeyvoidreducer(stringkey,listvalues,int&result){

    intsum=0;for(vinvalues)

    sum+=v;result=sum;

    }

    //iteratoroverlinesoftextfileLineByLineReaderinput(“hdfs://431log.txt”);

    //storesoutputWriteroutput(“hdfs://…”);

    //dostuffrunMapReduceJob(mapper,reducer,input,output);

    (Thecodeabovecomputesthecountofpageviewsbyeachtypeofmobilephone.)14

  • Let’sdesignanimplementationofrunMapReduceJob

    15

  • Step1:runningthemapperfunction

    Node0

    log.txtblock0

    Disk

    CPU

    //calledonceperlineinfilevoidmapper(stringline,multimap&results){

    stringuser_agent=parse_requester_user_agent(line);if(is_mobile_client(user_agent))

    results.add(user_agent,1);}

    //calledonceperuniquekeyinresultsvoidreducer(stringkey,listvalues,int&result){

    intsum=0;for(vinvalues)

    sum+=v;result=sum;

    }

    LineByLineReaderinput(“hdfs://431log.txt”);Writeroutput(“hdfs://…”);runMapReduceJob(mapper,reducer,input,output);

    log.txtblock1

    Node1

    log.txtblock2

    Disk

    CPU

    log.txtblock3

    Node2

    log.txtblock4

    Disk

    CPU

    log.txtblock5

    Node3

    log.txtblock6

    Disk

    CPU

    log.txtblock7

    Step1:runmapperfunctiononalllinesoffileQuestion:Howtoassignworktonodes?

    Idea2:datadistributionbasedassignment:Eachnodeprocesseslinesinblocksofinputfilethatarestoredlocally.

    Idea1:useworkqueueforlistofinputblockstoprocessDynamicassignment:freenodetakesnextavailableblock

    block0block1block2...

    16

  • Steps2and3:gatheringdata,runningthereducer

    Node0

    log.txtblock0

    Disk

    CPU

    //calledonceperlineinfilevoidmapper(stringline,mapresults)

    {stringuser_agent=parse_requester_user_agent(line);if(is_mobile_client(user_agent))

    results.add(user_agent,1);}//calledonceperuniquekeyinresultsvoidreducer(stringkey,listvalues,int&result){

    intsum=0;for(vinvalues)

    sum+=v;result=sum;

    }

    LineByLineReaderinput(“hdfs://431log.txt”);Writeroutput(“hdfs://…”);runMapReduceJob(mapper,reducer,input,output);

    log.txtblock1

    Node1

    log.txtblock2

    Disk

    CPU

    log.txtblock3

    Node2

    log.txtblock4

    Disk

    CPU

    log.txtblock5

    Node3

    log.txtblock6

    Disk

    CPU

    log.txtblock7

    SafariiOSChrome

    SafariiWatchChromeGlass

    Step2:PrepareintermediatedataforreducerStep3:RunreducerfunctiononallkeysQuestion:howtoassignreducertasks?Question:howtogetalldataforkeyontothecorrectworkernode?

    Keystoreduce:(generatedbymapper):

    SafariiOSvalues0

    Chromevalues0

    SafariiOSvalues1

    Chromevalues1

    SafariiOSvalues2

    Chromevalues2

    SafariiOSvalues3

    Chromevalues3

    SafariiWatchvalues3

    ChromeGlassvalues0

    17

  • Node0

    log.txtblock0

    Disk

    CPU

    //gatherallinputdataforkey,thenexecutereducer//toproducefinalresultvoidrunReducer(stringkey,reducer,result){

    listinputs;for(ninnodes){

    filename=get_filename(key,n);readlinesoffilename,appendintoinputs;

    }reducer(key,inputs,result);

    }

    log.txtblock1

    Step2:Prepareintermediatedataforreducer.Step3:Runreducerfunctiononallkeys.Question:howtoassignreducertasks?Question:howtogetalldataforkeyontothecorrectworkernode?

    Node1

    log.txtblock2

    Disk

    CPU

    log.txtblock3

    Node2

    log.txtblock4

    Disk

    CPU

    log.txtblock5

    Node3

    log.txtblock6

    Disk

    CPU

    log.txtblock7

    SafariiOSChrome

    SafariiWatchChromeGlass

    Keystoreduce:(generatedbymapper):

    SafariiOSvalues0

    Chromevalues0

    SafariiOSvalues1

    Chromevalues1

    SafariiOSvalues2

    Chromevalues2

    SafariiOSvalues3

    Chromevalues3

    SafariiWatchvalues3

    ChromeGlassvalues0

    Example:AssignSafariiOStoNode0

    Steps2and3:gatheringdata,runningthereducer

    18

  • Additionalimplementationchallengesatscale

    Node0

    log.txtblock…

    Disklog.txt

    block…

    CPU

    Node1

    log.txtblock…

    Disklog.txt

    block…

    CPU

    Node2

    log.txtblock…

    Disklog.txt

    block…

    CPU

    Node3

    log.txtblock…

    Disklog.txt

    block…

    CPU

    Node4

    log.txtblock…

    Disklog.txt

    block…

    CPU

    Node5

    log.txtblock…

    Disklog.txt

    block…

    CPU

    Node6

    log.txtblock…

    Disklog.txt

    block…

    CPU

    Node7

    log.txtblock…

    Disklog.txt

    block…

    CPU

    Node8

    log.txtblock…

    Disklog.txt

    block…

    CPU

    Node9

    log.txtblock…

    Disklog.txt

    block…

    CPU

    Node10

    log.txtblock…

    Disklog.txt

    block…

    CPU

    Node11

    log.txtblock…

    Disklog.txt

    block…

    CPU

    Node996

    log.txtblock…

    Disklog.txt

    block…

    CPU

    Node997

    log.txtblock…

    Disklog.txt

    block…

    CPU

    Node998

    log.txtblock…

    Disklog.txt

    block…

    CPU

    Node999

    log.txtblock…

    Disklog.txt

    block…

    CPU

    ...

    Nodesmayfailduringprogramexecution

    Somenodesmayrunslowerthanothers(duetodifferentamountsofwork,heterogeneityinthecluster,etc..)

    19

  • Jobschedulerresponsibilities

    ▪ Exploitdatalocality:“movecomputationtothedata”- Runmapperjobsonnodesthatcontaininputfiles- Runreducerjobsonnodesthatalreadyhavemostofdataforacertainkey

    ▪ Handlingnodefailures- Schedulerdetectsjobfailuresandrerunsjobonnewmachines- Thisispossiblesinceinputsresideinpersistentstorage(distributedfilesystem)

    - Schedulerduplicatesjobsonmultiplemachines(reduceoverallprocessinglatencyincurredbynodefailures)

    ▪ Handlingslowmachines- Schedulerduplicatesjobsonmultiplemachines

    20

  • runMapReduceJobproblems?▪ Permitsonlyaverysimpleprogramstructure

    - Programsmustbestructuredas:map,followedbyreducebykey

    - SeeDryadLINQforgeneralizationtoDAGs

    ▪ Iterativealgorithmsmustloadfromdiskeachiteration- Noprimitivesforsharingdatadirectlybetweenjobs

    iter.1 iter.2 ...

    Input

    HDFSwrite

    HDFSread

    HDFSwrite

    IterativeJob:

    HDFSread

    21

  • in-memory,fault-tolerantdistributedcomputinghttp://spark.apache.org/

    [Zahariaetal.NSDI2012]22

    http://spark.apache.org/

  • Goals▪ Programmingmodelforcluster-scalecomputations

    wherethereissignificantreuseofintermediatedatasets- Iterativemachinelearningandgraphalgorithms- Interactivedatamining:loadlargedatasetintoaggregate

    memoryofclusterandthenperformmultiplead-hocqueries

    ▪ Don’twantincurinefficiencyofwritingintermediatestopersistentdistributedfilesystem(wanttokeepitinmemory)- Challenge:efficientlyimplementingfaulttoleranceforlarge-

    scaledistributedin-memorycomputations.

    23

  • Faulttoleranceforin-memorycalculations▪ Replicateallcomputations

    - Expensivesolution:decreasespeakthroughput

    ▪ Checkpointandrollback- Periodicallysavestateofprogramtopersistentstorage- Restartfromlastcheckpointonnodefailure

    ▪ Maintainlogofupdates(commandsanddata)- Highoverheadformaintaininglogs

    Recallmap-reducesolutions:- Checkpointsaftereachmap/reducestepbywritingresultstofilesystem- Scheduler’slistofoutstanding(butnotyetcomplete)jobsisalog- Functionalstructureofprogramsallowsforrestartatgranularityofasingle

    mapperorreducerinvocation(don’thavetorestartentireprogram)24

  • Resilientdistributeddataset(RDD)Spark’skeyprogrammingabstraction:- Read-onlyorderedcollectionofrecords(immutable)- RDDscanonlybecreatedbydeterministictransformationsondatain

    persistentstorageoronexistingRDDs- ActionsonRDDsreturndatatoapplication

    //createRDDfromfilesystemdatavarlines=spark.textFile(“hdfs://431log.txt”);

    //createRDDusingfilter()transformationonlinesvarmobileViews=lines.filter((x:String)=>isMobileClient(x));

    //anotherfilter()transformationvarsafariViews=mobileViews.filter((x:String)=>x.contains(“Safari”));

    //thencountnumberofelementsinRDDviacount()actionvarnumViews=safariViews.count();

    lines

    mobileViews

    numViews

    safariViews

    .count( )

    .filter(...)

    .filter(...)

    .textFile(…)

    431log.txt

    int

    RDDs

    25

  • Repeatingthemap-reduceexample// 1. create RDD from filesystemdata// 2. create RDD with onlylinesfrommobileclients// 3. create RDD with elementsoftype(String,Int) from line string//4.groupelementsbykey//5.callprovidedreductionfunctiononallkeystocountviewsvarperAgentCounts= spark.textFile(“hdfs://431log.txt”)

    .filter(x=>isMobileClient(x))

    .map(x=>(parseUserAgent(x),1));

    .reduceByKey((x,y)=>x+y)

    .collect();

    lines

    .map(parseUserAgent(…))

    .filter(isMobileClient(…)))

    .textFile(…)

    431log.txtArray[String,int]

    .reduceByKey(…)

    .collect()

    PerAgentCounts

    “Lineage”:SequenceofRDDoperationsneededtocomputeoutput

    26

  • AnotherSparkprogram//createRDDfromfilesystemdatavarlines=spark.textFile(“hdfs://431log.txt”);

    //createRDDusingfilter()transformationonlinesvarmobileViews=lines.filter((x:String)=>isMobileClient(x));

    //instructSparkruntimetotrytokeepmobileViewsinmemorymobileViews.persist();

    //createanewRDDbyfilteringmobileViews//thencountnumberofelementsinnewRDDviacount()actionvarnumViews=mobileViews.filter(_.contains(“Safari”)).count();

    // 1. createnewRDDby filteringonlyChrome views////

    2. foreachelement,pageview

    splitstringandtake timestamp of

    //3.convertRDDtoascalarsequence(collect()action)vartimestamps=mobileViews.filter(_.contains(“Chrome”))

    .map(_.split(“”)(0))

    .collect();

    lines

    mobileViews

    .collect()

    timestamps

    .filter(isMobileClient(…)))

    .textFile(…)

    431log.txt

    .map(split(…)).count()

    numViews

    .filter(contains(“Safari”); .filter(contains(“Chrome”);

    27

  • flatMap( f : T ) Seq[U]) RDD[T] ) RDD[U]RDD[T] ) RDD[T] (Deterministic sampling)sample(fraction : Float)

    groupByKey() RDD[(K, V)] ) RDD[(K, Seq[V])]reduceByKey( f : (V, V) ) V) RDD[(K, V)] ) RDD[(K, V)]

    (RDD[T], RDD[T]) ) RDD[T](RDD[(K, V)], RDD[(K, W)]) ) RDD[(K, (V, W))](RDD[(K, V)], RDD[(K, W)]) ) RDD[(K, (Seq[V], Seq[W]))]

    union() join()

    cogroup() crossProduct() (RDD[T], RDD[U]) ) RDD[(T, U)]

    mapValues( f : V ) W) RDD[(K, V)] ) RDD[(K, W)] (Preserves partitioning)RDD[(K, V)] ) RDD[(K, V)]sort(c : Comparator[K])

    partitionBy(p : Partitioner[K])

    map( f : T ) U) : RDD[T] ) RDD[U] filter( f : T ) Bool) : RDD[T] ) RDD[T]

    : : : : : : : : : : :

    RDD[(K, V)] ) RDD[(K, V)]

    count() collect()

    reduce( f : (T, T) ) T) RDD[T] ) TRDD[(K, V)] ) Seq[V] (On hash/range partitioned RDDs)lookup(k : K)

    save(path : String)

    : RDD[T] ) Long : RDD[T] ) Seq[T] : : : Outputs RDD to a storage system, e.g., HDFS

    RDDtransformationsandactionsTransformations:(dataparalleloperatorstakinganinputRDDtoanewRDD)

    Actions:(providedatabacktothe“host”application)

    28

  • HowdoweimplementRDDs?Inparticular,howshouldtheybestored?

    var lines = spark.textFile(“hdfs://431log.txt”);var lower = lines.map(_.toLower());varmobileViews=lower.filter(x=>isMobileClient(x));varhowMany=mobileViews.count();

    Node0

    log.txtblock0

    Disk

    CPU

    log.txtblock1

    DRAM

    Node1

    log.txtblock2

    Disk

    CPU

    log.txtblock3

    DRAM

    Node2

    log.txtblock4

    Disk

    CPU

    log.txtblock5

    DRAM

    Node3

    log.txtblock6

    Disk

    CPU

    log.txtblock7

    DRAM

    Question:shouldwethinkofRDD’slikearrays?

    29

  • HowdoweimplementRDDs?Inparticular,howshouldtheybestored?

    var lines = spark.textFile(“hdfs://431log.txt”);var lower = lines.map(_.toLower());

    Node0

    log.txtblock0

    Disk

    CPU

    log.txtblock1

    DRAMlines

    (partition0)lower

    (partition0)mobileViews(part0)

    lines(partition1)

    lower(partition1)mobileViews(part1)

    Node1

    log.txtblock2

    Disk

    CPU

    log.txtblock3

    DRAMlines

    (partition2)lower

    (partition2)mobileViews(part2)

    lines(partition3)

    lower(partition3)mobileViews(part3)

    Node2

    log.txtblock4

    Disk

    CPU

    log.txtblock5

    DRAMlines

    (partition4)lower

    (partition4)mobileViews(part4)

    lines(partition5)

    lower(partition5)mobileViews(part5)

    Node2

    log.txtblock6

    Disk

    CPU

    log.txtblock7

    DRAMlines

    (partition6)lower

    (partition6)mobileViews(part6)

    lines(partition7)

    lower(partition7)mobileViews(part7)

    varmobileViews=lower.filter(x=>isMobileClient(x));varhowMany=mobileViews.count();

    In-memoryrepresentationwouldbehuge!(largerthanoriginalfileondisk)

    30

  • RDDpartitioninganddependencies

    Node3

    block2 block3 block4 block5 block6 block7

    .load()

    varlines=spark.textFile(“hdfs://431log.txt”);varlower=lines.map(_.toLower());varmobileViews=lower.filter(x=>isMobileClient(x));varhowMany=mobileViews.count();

    Node0 Node1 Node2

    linespart2

    linespart3

    linespart4

    linespart5

    linespart6

    linespart7

    .filter()

    mobileViewspart2

    mobileViewspart3

    mobileViewspart4

    mobileViewspart5

    mobileViewspart6

    mobileViewspart7

    BlacklinesshowdependenciesbetweenRDDpartitions.

    lowerpart2

    lowerpart3

    lowerpart4

    lowerpart5

    lowerpart6

    lowerpart7

    .map()

    block0(0-1000)

    block1(1000-2000)

    linespart0(0-1000)

    linespart1

    (1000-2000)

    lowerpart0(0-1000)

    lowerpart1

    (1000-2000)

    mobileViewspart0

    (670elements)

    mobileViewspart1

    (212elements)

    31

  • ImplementingsequenceofRDDopsefficientlyvarlines=spark.textFile(“hdfs://431log.txt”);varlower=lines.map(_.toLower());varmobileViews=lower.filter(x=>isMobileClient(x));varhowMany=mobileViews.count();

    Recall“loopfusion”examplefromopeningslidesoflecture

    Thefollowingcodestoresonlyalineofthelogfileinmemory,andonlyreadsinputdatafromdiskonce(“streaming”solution)intcount=0;while(inputFile.eof()){

    stringline=inputFile.readLine();stringlower=line.toLower;if(isMobileClient(lower))

    count++;}

    32

  • AsimpleinterfaceforRDDsvarlines=spark.textFile(“hdfs://431log.txt”);varlower=lines.map(_.toLower());varmobileViews=lower.filter(x=>isMobileClient(x));varhowMany=mobileViews.count();

    RDDFromMap::next(){varel=parent.next();returnmap_func(el);

    }

    RDDFromFilter::next(){while(parent.hasMoreElements()){

    varel=parent.next();if(filter_func(el))

    returnel;}

    RDDFromTextFile::next(){returninputFile.readLine();

    }

    //countaction(forcesevaluationofRDD)RDD::count(){

    intcount=0;while(hasMoreElements()){

    varel=next();count++;

    }}

    RDD::hasMoreElements(){parent.hasMoreElements();

    }//overloadedsincenoparentexistsRDDFromTextFile::hasMoreElements(){

    return!inputFile.eof();}

    //createRDDbymappingmap_funcontoinput(parent)RDDRDD::map(RDDparent,map_func){

    returnnewRDDFromMap(parent,map_func);}

    //createRDDfromtextfileondiskRDD::textFile(stringfilename){

    returnnewRDDFromTextFile(open(filename));}

    //createRDDbyfilteringinput(parent)RDDRDD::filter(RDDparent,filter_func){

    returnnewRDDFromFilter(parent,filter_func);}

    33

  • Narrowdependencies

    block2 block3 block4 block5 block6 block7

    .load()

    varlines=spark.textFile(“hdfs://431log.txt”);varlower=lines.map(_.toLower());varmobileViews=lower.filter(x=>isMobileClient(x));varhowMany=mobileViews.count();

    linespart2

    linespart3

    linespart4

    linespart5

    linespart6

    linespart7

    .filter()

    mobileViewspart2

    mobileViewspart3

    mobileViewspart4

    mobileViewspart5

    mobileViewspart6

    mobileViewspart7

    “Narrowdependencies”=eachpartitionofparentRDDreferencedbyatmostonechildRDDpartition--

    Allowsforfusingofoperations(here:canapplymapandthenfilterallatonceoninputelement)Inthisexample:nocommunicationbetweennodesofcluster(communicationofoneintatendtoperformcount()reduction)

    Node0 Node1 Node2 Node3

    lowerpart2

    lowerpart3

    lowerpart4

    lowerpart5

    lowerpart6

    lowerpart7

    .map()

    block0(0-1000)

    block1(1000-2000)

    linespart0(0-1000)

    linespart1

    (1000-2000)

    lowerpart0(0-1000)

    lowerpart1

    (1000-2000)

    mobileViewspart0

    (670elements)

    mobileViewspart1

    (212elements)34

  • Widedependencies

    RDD_Apart0

    .groupByKey()

    RDD_Apart1

    RDD_Apart2

    RDD_Apart3

    RDD_Bpart0

    RDD_Bpart1

    RDD_Bpart2

    RDD_Bpart3

    ▪ Widedependencies=eachpartitionofparentRDDreferencedbymultiplechildRDDpartitions

    ▪ Challenges:- MustcomputeallofRDD_AbeforecomputingRDD_B

    - Example:groupByKey()mayinduceall-to-allcommunicationasshownabove

    - Maytriggersignificantrecomputationofancestorlineageuponnodefailure(Iwilladdressresilienceinafewslides)

    groupByKey:RDD[(K,V)]→ RDD[(K,Seq[V])]“MakeanewRDDwhereeachelementisasequencecontainingallvaluesfromtheparentRDDwiththesamekey.”

    35

  • Costofoperationsdependsonpartitioning

    RDD_Cpart0

    RDD_Cpart1

    RDD_Cpart6

    RDD_Cpart9

    .join()

    RDD_Apart0

    RDD_Apart1

    RDD_Apart2

    RDD_Apart3

    RDD_Bpart0

    RDD_Bpart1

    RDD_Bpart2

    RDD_Bpart3

    (“Kayvon”,1)(“Teguh”,23)

    (“Kayvon”,“fizz”) (“Randy”,1024) (“Randy”,“wham”)(“Teguh”,“buzz”) (“Ravi”,32) (“Ravi”,“pow”)

    (“Alex”,50)(“Riya”,9)

    (“Alex”,“splat”) (“Tao”,10) (“Tao”,“slap”)(“Riya”,“pop”) (“Junhong”,100)(“Junhong”,“bam”)

    RDD_Cpart0

    RDD_Cpart1

    RDD_Cpart6

    RDD_Cpart9

    .join()

    RDD_Apart0

    RDD_Apart1

    RDD_Apart2

    RDD_Apart3

    RDD_Bpart0

    RDD_Bpart1

    RDD_Bpart2

    RDD_Bpart3

    (“Kayvon”,1)(“Teguh”,23)

    (“Kayvon”,“fizz”)(“Alex”,“splat”)

    (“Randy”,1024)(“Ravi”,32)

    (“Riya”,“pop”)(“Tao”,“slap”)

    (“Alex”,50)(“Riya”,9)

    (“Ravi”,“pow”)(“Junhong”,“bam”)

    (“Tao”,10)(“Junhong”,100)

    (“Randy”,“wham”)(“Teguh”,“buzz”)

    (“Kayvon”,(1,”fizz”))(“Teguh”,(23,”buzz”))

    (“Randy”,(1024,”wham”))(“Ravi”,(32,”pow”))

    (“Alex”,(50,”splat”))(“Riya”,(9,”pop”))

    (“Tao”,(10,”slap”))(“Junhong”,(100,”bam”))

    RDD_AandRDD_Bhavesamehashpartition:joinonlycreatesnarrowdependencies

    (“Kayvon”,(1,”fizz”))(“Teguh”,(23,”buzz”))

    (“Randy”,(1024,”wham”))(“Ravi”,(32,”pow”))

    (“Alex”,(50,”splat”))(“Riya”,(9,”pop”))

    (“Tao”,(10,”slap”))(“Junhong”,(100,”bam”))

    join:RDD[(K,V)],RDD[(K,W)]→ RDD[(K,(V,W))]AssumedatainRDD_AandRDD_Barepartitionedbykey:hashusernametopartitionid

    RDD_AandRDD_Bhavedifferenthashpartitions:joincreateswidedependencies

    36

  • PartitionBy()transformation▪ InformSparkonhowtopartitionanRDD

    - e.g.,HashPartitioner,RangePartitioner//createRDDfromfilesystemdatavarlines=spark.textFile(“hdfs://431log.txt”);varclientInfo=spark.textFile(“hdfs://clientssupported.txt”);//(useragent,“yes”/“no”)

    //createRDDusingfilter()transformationonlinesvarmobileViews=lines.filter(x=>isMobileClient(x)).map(x=>parseUserAgent(x));

    //HashPartitionermapskeystointegersvarpartitioner=spark.HashPartitioner(100);

    // informSparkof partition// .persist()also instructs Spark to try to keep dataset in memoryvarmobileViewPartitioned=mobileViews.partitionBy(partitioner)

    .persist();varclientInfoPartitioned=clientInfo.partitionBy(partitioner)

    .persist();

    //joinuseragentswithwhethertheyaresupportedornotsupported//Note:thisjoinonlycreatesnarrowdependenciesduetotheexplicitpartitioningabovevoidjoined=mobileViewPartitioned.join(clientInfoPartitioned);

    ▪ .persist():- InformSparkthisRDD’scontentsshouldberetainedinmemory- .persist(RELIABLE)= storecontentsindurablestorage(likeacheckpoint) 37

  • RDD_Apart0

    RDD_Apart1

    RDD_Apart2

    RDD_Bpart0

    RDD_Bpart1

    RDD_Bpart2

    .groupByKey()

    RDD_Cpart0

    RDD_Cpart1

    RDD_Dpart0

    RDD_Dpart1

    RDD_Epart0

    RDD_Epart1

    RDD_Fpart0

    RDD_Fpart1

    RDD_Fpart2

    RDD_Fpart3

    .map()

    .union()

    .join()

    RDD_Gpart0

    RDD_Gpart1

    RDD_Gpart2

    .save()

    SchedulingSparkcomputationsStage1Computation Stage2Computation

    Actions(e.g.,save())triggerevaluationofSparklineagegraph.Stage1Computation:donothingsinceinputalreadymaterializedinmemoryStage2Computation:evaluatemapinfusedmanner,onlyactuallymaterializeRDDFStage3Computation:executejoin(couldstreamtheoperationtodisk,donotneedtomaterialize)

    block1block0 block2 = materializedRDD

    38

  • Implementingresiliencevialineage

    //createRDDfromfilesystemdatavarlines=spark.textFile(“hdfs://431log.txt”);

    //createRDDusingfilter()transformationonlinesvarmobileViews=lines.filter((x:String)=>isMobileClient(x));

    // 1. createnewRDDby filteringonlyChrome views// 2. foreachelement, splitstringandtake timestamp of// pageview(firstelement)//3.convertRDDToascalarsequence(collect()action)vartimestamps=mobileView.filter(_.contains(“Chrome”))

    .map(_.split(“”)(0));

    lines

    mobileViews

    Chromeviews

    timestamps.map(_.split(“”)(0))

    .filter(...)

    .filter(...)

    ▪ RDDtransformationsarebulk,deterministic,andfunc.- Implication:runtimecanalwaysreconstructcontentsofRDDfromits

    lineage(thesequenceoftransformationsusedtocreateit)- Lineageisalogoftransformations- Efficient:sincethelogrecordsbulkdata-paralleloperations,overheadof

    loggingislow(comparedtologgingfine-grainedoperations,likeinadatabase)

    39

    .load(…)

  • var lines = spark.textFile(“hdfs://431log.txt”);var mobileViews = lines.filter((x:String)=>

    isMobileClient(x));var timestamps = mobileView.filter(_.contains(“Chrome”))

    .map(_.split(“”)(0));

    Uponnodefailure:recomputelostRDDpartitionsfromlineage

    Disk431log.txt log.txt

    block0 block1

    Node0

    DRAM

    log.txtblock2

    Disklog.txtblock3

    mobileViewspart3

    mobileViewspart2

    Node1 Node2

    log.txtblock4

    Disklog.txtblock5

    mobileViewspart5

    mobileViewspart4

    Node3

    log.txtblock6

    Disklog.txtblock7

    mobileViewspart7

    mobileViewspart6

    timestampspart1

    mobileViewspart1

    CPU

    timestampspart0

    mobileViewspart0

    DRAM

    CPU

    timestamps timestampspart2

    DRAMtimestampspart5

    CPU

    timestampspart4

    DRAMtimestamps

    part7

    CPU

    timestampspart6

    lines

    mobileViews

    Chromeviews

    timestamps.map(_.split(“”)(0))

    .filter(...)

    .filter(...)

    .load(…)

    Mustreloadrequiredsubsetofdatafromdiskandrecomputeentiresequenceofoperationsgivenbylineagetoregeneratepartitions2and3ofRDDtimestamps.

    Note:(notshown):recallfilesystemdataisreplicatedsoassumeblocks2and3remainaccessibletoallnodes

    CRASH!part3

    40

  • var lines = spark.textFile(“hdfs://431log.txt”);var mobileViews = lines.filter((x:String)=>

    isMobileClient(x));var timestamps = mobileView.filter(_.contains(“Chrome”))

    .map(_.split(“”)(0));

    log.txtblock0

    Disklog.txtblock1

    DRAM

    Node1

    log.txtblock2

    Disklog.txtblock3

    mobileViewspart3

    mobileViewspart2

    Node0 Node2

    log.txtblock4

    Disklog.txtblock5

    mobileViewspart5

    mobileViewspart4

    Node3

    log.txtblock6

    Disklog.txtblock7

    mobileViewspart7

    mobileViewspart6

    timestampspart1

    mobileViewspart1

    CPU

    timestampspart0

    mobileViewspart0

    DRAM

    CPU

    timestamps timestampspart2

    AMDRtimestampspart5

    CPU

    timestampspart4

    DRAMtimestamps

    part7

    CPU

    timestampspart6

    lines

    mobileViews

    Chromeviews

    timestamps.map(_.split(“”)(0))

    .filter(...)

    .filter(...)

    .load(…)

    Must reload required subset of data from disk andrecomputeentiresequenceofoperationsgivenbylineagetoregeneratepartitions2and3ofRDDtimestamps.

    timestampspart2

    timestampspart3

    Note:(notshown):filesystemdataisreplicatedsoassumeblocks2and3remainaccessibletoallnodes

    Uponnodefailure:recomputelostRDDpartitionsfromlineage

    CRASH!part3

    41

  • Sparkperformance

    139

    46

    182

    8280

    76 62

    3

    115

    106

    87

    33

    240 200 160 120

    80 40

    0Hadoop SparkHadoop HadoopBM Spark

    Logistic Regression

    HadoopBM

    K-Means

    Itera

    tion

    time

    (s)

    First Iteration Later Iterations

    HadoopBM=HadoopBinaryIn-Memory(converttextinputtobinary,storeinin-memoryversionofHDFS)

    Q.Wait,thebaselineparsestextinputineachiterationofaniterativealgorithm?A.Yes.Sigh…

    Anythingelsepuzzlinghere?

    HadoopBM’sfirstiterationisslowbecauseitrunsanextraHadoopjobtocopybinaryformofinputdatatoinmemoryHDFS

    AccessingdatafromHDFS,evenifinmemory,hashighoverhead:- Multiplememcopiesinfilesystem+achecksum

    (100GBofdataona100nodecluster)

    42

  • Caution:“scaleout”isnottheentirestory▪ Distributedsystemsdesignedforcloudexecutionaddressmanydifficultchallenges,andhavebeeninstrumentalintheexplosionof“big-data”computingandlarge-scale

    ---

    Scale-outparallelismtomanymachinesResiliencyinthefaceoffailuresComplexityofmanagingclustersofmachines

    scalable system cores twitter uk-2007-05GraphChi [10] 2 3160s 6972sStratosphere [6] 16 2250s -X-Stream [17] 16 1488s -Spark [8] 128 857s 1759sGiraph [8] 128 596s 1235sGraphLab [8] 128 249s 833sGraphX [8] 128 419s 462sSingle thread (SSD) 1 300s 651sSingle thread (RAM) 1 275s -

    But“scaleout”isnotthewholestory.Attheendofthedayyouwantgoodperformance.

    20IterationsofPageRank(iterativegraphalgorithm)name twitter rv [11] uk-2007-05 [4]nodes 41,652,230 105,896,555edges 1,468,365,182 3,738,733,648size 5.76GB 14.72GB

    Vertex order (SSD) 1 300s 651sVertex order (RAM) 1 275s -Hilbert order (SSD) 1 242s 256sHilbert order (RAM) 1 110s -Furtheroptimizationofthebaseline

    broughttimedownto110s

    [“Scalability!AtwhatCOST?”McSherryetal.HotOS2015]

    43

  • Caution:“scaleout”isnottheentirestory

    scalable system cores twitter uk-2007-05Stratosphere [6] 16 950s -

    - ≤ 8000s ≤ 8000s

    714s 800s

    X-Stream [17] 16 1159sSpark [8] 128 1784sGiraph [8] 128 200sGraphLab [8] 128 242sGraphX [8] 128 251sSingle thread (SSD) 1 153s 417s

    System Graph VxE Time(s) Gflops Procs

    Hadoop ?x1.1B 198 0.015 50x8Spark 40Mx1.5B 97.4 0.03 50x2Twister 50Mx1.4B 36 0.09 60x4PowerGraph 40Mx1.4B 3.6 0.8 64x8BIDMat 60Mx1.4B 6 0.5 1x8

    BIDMat+disk 60Mx1.4B 24 0.16 1x8

    System Docs/hr Gflops Procs

    Smola[15]

    PowerGraph

    1.6M

    1.1M

    0.5

    0.3

    100x8

    64x16

    BIDMach 3.6M 30 1x8x1

    LatencyDirichletAllocation(LDA)

    fromMcSherry2015:

    “Thepublishedworkonbigdatasystemshasfetishizedscalabilityasthemostimportantfeatureofadistributeddataprocessingplatform.Whilenearlyallsuchpublicationsdetailtheirsystem’simpressivescalability,fewdirectlyevaluatetheirabsoluteperformanceagainstreasonablebenchmarks.Towhatdegreearethesesystemstrulyimprovingperformance,asopposedtoparallelizingoverheadsthattheythemselvesintroduce?”

    COST=“ConfigurationthatOutperformsaSingleThread”Perhapssurprisingly,manypublishedsystemshaveunboundedCOST—i.e.,noconfigurationoutperformsthebestsingle-threadedimplementation—foralloftheproblemstowhichtheyhavebeenapplied.

    BIDDataSuite(1GPUacceleratednode)[CannyandZhao,KDD13]

    PageRank

    LabelPropagation[McSherryetal.HotOS2015]

    44

  • PerformanceimprovementstoSpark▪ WithincreasingDRAMsizesandfasterpersistentstorage(SSD),thereis

    interestinimprovingtheCPUutilizationofSparkapplications- Goal:reduce“COST”

    ▪ EffortslookingataddingefficientcodegenerationtoSparkecosystem(e.g.,generateSIMDkernels,targetacceleratorslikeGPUs,etc.)toclosethegaponsinglenodeperformance- RDDstoragelayoutsmustchangetoenablehigh-performanceSIMD

    processing(e.g.,structofarraysinsteadofarrayofstructs)- SeeSpark’sProjectTungsten,Weld[PalkarCidr’17],IBM’sSparkGPU

    ▪ High-performancecomputingideasareinfluencingdesignoffutureperformance-orienteddistributedsystems- Conversely:thescientificcomputingcommunityhasalottolearn

    fromthedistributedcomputingcommunityaboutelasticityandutilitycomputing 45

  • Sparksummary▪ Opaquesequenceabstraction(RDD)toencapsulateintermediatesof

    clustercomputations(previously…frameworkslikeHadoop/MapReducestoredintermediatesinthefilesystem)- Observation:“filesareapoorabstractionforintermediatevariablesin

    large-scaledata-parallelprograms”

    - RDDsareread-only,andcreatedbydeterministicdata-paralleloperators- Lineagetrackedandusedforlocality-awareschedulingandfault-

    tolerance(allowsrecomputationofpartitionsofRDDonfailure,ratherthanrestorefromcheckpoint*)- Bulkoperations—>overheadoflineagetracking(logging)islow

    ▪ Simple,versatileabstractionuponwhichmanydomain-specificdistributedcomputingframeworksarebeingimplemented.- SeeApacheSparkproject:spark.apache.org

    *Notethat.persist(RELIABLE)allowsprogrammertorequestcheckpointinginlonglineagesituations.46

  • ModernSparkecosystem

    InterleavecomputationanddatabasequeryCanapplytransformationstoRDDsproducedbySQLqueries

    MachinelearninglibrarybuildontopofSparkabstractions.

    GraphLab-likelibrarybuiltontopofSparkabstractions.

    Compellingfeature:enablesintegration/compositionofmultipledomain-specificframeworks(sinceallcollectionsimplementedunderthehoodwithRDDsandscheduledusingSparkscheduler)

    47