mpi, dataflow, streaming: messaging for diverse requirements · 2017-09-26 · abstract: mpi,...
TRANSCRIPT
`
MPI,Dataflow,Streaming:MessagingforDiverseRequirements
ArgonneNa<onalLab,Chicago,Illinois,GeoffreyFox,September25,2017
Indiana University, Department of Intelligent Systems Engineering [email protected], http://www.dsc.soic.indiana.edu/, http://spidal.org/
WorkwithJudyQiu,ShantenuJha,SupunKamburugamuve,KannanGovindarajan,PulasthiWickramasinghe
9/26/17 1
25 years of MPI Symposium
Abstract:MPI,Dataflow,Streaming:MessagingforDiverseRequirements
• Welookatmessagingneededinavarietyofparallel,distributed,cloudandedgecompuHngapplicaHons.
• WecomparetechnologyapproachesinMPI,AsynchronousMany-Tasksystems,ApacheNiFi,Heron,KaRa,OpenWhisk,Pregel,SparkandFlink,event-drivensimulaHons(HLA)andMicrosoWNaiad.
• Wesuggestanevent-triggereddataflowpolymorphicrunHmewithimplementaHonsthattrade-offperformance,faulttolerance,andusability.
• IntegrateParallelCompuHng,BigData,Grids
9/26/17 2
• MPIiswonderful(andimpossibletobeat?)forcloselycoupledparallelcompuHngbut
• TherearemanyotherregimeswhereeitherparallelcompuHngand/ormessagepassingessenHal
• ApplicaHondomainswhereother/higher-levelconceptssuccessful/necessary• InternetofThingsandEdgeCompu<nggrowinginimportance• Useofpubliccloudsincreasingrapidly
• CloudsbecomingdiversewithsubsystemscontainingGPU’s,FPGA’s,highperformancenetworks,storage,memory…
• RichsoRwarestacks:• HPC(HighPerformanceCompuHng)forParallelCompuHnglessusedthan(?)• ApacheforBigDataSoWwareStackABDSincludingsomeedgecompuHng(streamingdata)
• AlotofconfusioncomingfromdifferentcommuniHes(database,distributed,parallelcompuHng,machinelearning,computaHonal/datascience)invesHgaHngsimilarideaswithlibleknowledgeexchangeandmixeduprequirements
Mo<va<ngRemarks
9/26/17 3
• Ongeneralprinciplesparallelanddistributedcompu<nghavedifferentrequirementsevenifsomeHmessimilarfuncHonaliHes
• ApachestackABDStypicallyusesdistributedcompuHngconcepts• Forexample,ReduceoperaHonisdifferentinMPI(Harp)andSpark
• LargescalesimulaHonrequirementsarewellunderstood• BigDatarequirementsarenotclearbutthereareafewkeyusetypes
1) Pleasinglyparallelprocessing(includinglocalmachinelearningLML)asofdifferenttweetsfromdifferentuserswithperhapsMapReducestyleofstaHsHcsandvisualizaHons;possiblyStreaming
2) DatabasemodelwithqueriesagainsupportedbyMapReduceforhorizontalscaling3) GlobalMachineLearningGMLwithsinglejobusingmulHplenodesasclassicparallel
compuHng4) DeepLearningcertainlyneedsHPC–possiblyonlymulHplesmallsystems
• Currentworkloadsstress1)and2)andaresuitedtocurrentcloudsandtoABDS(withnoHPC)• ThisexplainswhySparkwithpoorGMLperformanceissosuccessfulandwhyitcanignoreMPI
Requirements
9/26/17 4
HPCRun<meversusABDSdistributedCompu<ngModelonDataAnaly<cs
Hadoopwritestodiskandisslowest;SparkandFlinkspawnmanyprocessesanddonotsupportAllReducedirectly;MPIdoesin-placecombinedreduce/broadcastandisfastest
NeedPolymorphicReducHoncapabilitychoosingbestimplementaHonUseHPCarchitecturewith
MutablemodelImmutabledata
9/26/17 5
Mul<dimensionalScaling:3NestedParallelSec<ons
MDSexecuHonHmeon16nodeswith20processesineachnodewith
varyingnumberofpoints
MDSexecuHonHmewith32000pointsonvaryingnumberofnodes.Eachnoderuns20paralleltasks
9/26/17 6
Flink
Spark
MPI
MPIFactorof20-200FasterthanSpark/Flink
Implemen<ngTwister2tosupportaGridlinkedtoanHPCCloud
Cloud
HPC
Cloud
HPC
CentralizedHPCCloud+IoTDevices CentralizedHPCCloud+Edge=Fog+IoTDevices
Cloud
HPC
Fog
HPCCloudcanbefederated
9/26/17 7
• Cloud-ownerProvidedCloud-naHveplapormfor• Event-drivenapplicaHonswhich• ScaleupanddowninstantlyandautomaHcallyChargesforactualusageatamillisecondgranularity
4
BareMetal
PaaSContainer
Orchestrators
IaaS
Serverless
GridSolve,NeoswereFaaS
9/26/17 8
Seereviewhbp://dx.doi.org/10.13140/RG.2.2.15007.87206
Serverless(serverhidden)compu<nga\rac<vetouser:“Noserveriseasiertomanagethannoserver”
Twister2:“NextGenera<onGrid-Edge–HPCCloud”• Original2010TwisterpaperwasaparHcularapproachtoMap-CollecHveiteraHveprocessingformachinelearning
• Re-engineercurrentApacheBigDatasoWwaresystemsasatoolkitwithMPIasanopHon• BaseonApacheHeronasmostmodernand“neutral”oncontroversialissues
• Supportaserverless(cloud-na<ve)dataflowevent-drivenHPC-FaaS(microservice)frameworkrunningacrossapplicaHonandgeographicdomains.
• SupportalltypesofDataanalysisfromGMLtoEdgecompuHng
• BuildonCloudbestpracHcebutuseHPCwhereverpossibletogethighperformance• SmoothlysupportcurrentparadigmsNaiad,Hadoop,Spark,Flink,Storm,Heron,MPI…• UseinteroperablecommonabstracHonsbutmulHplepolymorphicimplementaHons.
• i.e.donotrequireasinglerunHme
• FocusonRunHmebutthisimpliesHPC-FaaSprogrammingandexecuHonmodel• Thisdescribesanextgenera<onGridbasedondataandedgedevices–notcompuHngasinoriginalGrid Seelongpaperhbp://dsc.soic.indiana.edu/publicaHons/Twister2.pdf
9/26/17 9
Communica<on(Messaging)Models• MPIGoldStandard:TightlysynchronizedapplicaHons
• EfficientcommunicaHons(µslatency)withuseofadvancedhardware• InplacecommunicaHonsandcomputaHons(Processscopeforstate)
• Basic(coarse-grain)dataflow:ModelacomputaHonasagraph• NodesdocomputaHonswithTaskascomputaHonsandedgesareasynchronouscommunicaHons
• AcomputaHonisacHvatedwhenitsinputdatadependenciesaresaHsfied
• Streamingdataflow:Pub-SubwithdataparHHonedintostreams• Streamsareunbounded,ordereddatatuples• OrderofeventsimportantandgroupdataintoHmewindows
• MachineLearningdataflow:IteraHvecomputaHons• ThereisbothModelandData,butonlycommunicatethemodel• Collec<vecommunica<onoperaHonssuchasAllReduceAllGather(nodifferenHaloperatorsinBigDataproblems
• Canusein-placeMPIstylecommunicaHon
S
W G
S
W
WDataflow
9/26/17 10
CoreSPIDALParallelHPCLibrarywithCollec<veUsed• DA-MDSRotate,AllReduce,Broadcast• DirectedForceDimensionReducHonAllGather,Allreduce
• IrregularDAVSClusteringParHalRotate,AllReduce,Broadcast
• DASemimetricClusteringRotate,AllReduce,Broadcast
• K-meansAllReduce,Broadcast,AllGatherDAAL• SVMAllReduce,AllGather• SubGraphMiningAllGather,AllReduce• LatentDirichletAllocaHonRotate,AllReduce• MatrixFactorizaHon(SGD)RotateDAAL• RecommenderSystem(ALS)RotateDAAL• SingularValueDecomposiHon(SVD)AllGatherDAAL
• QRDecomposiHon(QR)Reduce,BroadcastDAAL• NeuralNetworkAllReduceDAAL• CovarianceAllReduceDAAL• LowOrderMomentsReduceDAAL• NaiveBayesReduceDAAL• LinearRegressionReduceDAAL• RidgeRegressionReduceDAAL• MulH-classLogisHcRegressionRegroup,Rotate,AllGather
• RandomForestAllReduce• PrincipalComponentAnalysis(PCA)AllReduceDAAL
DAALimpliesintegratedwithIntelDAALOpHmizedDataAnalyHcsLibrary(RunsonKNL!)
9/26/17 11
Coordina<onPoints• Thereareinmanyapproaches,“coordinaHonpoints”thatcanbeimplicitorexplicit
• Twister2makescoordinaHonpointsanimportant(firstclass)concept• DataflownodesinHeron,Flink,Spark,Naiad;wecallthesefine-graindataflow• IssuanceofaCollecHvecommunicaHoncommandinMPI• StartandEndofaParallelsecHoninOpenMP• Endofajob;wecallthesecoarse-graindataflownodesandtheseareseeninworkflowsystemssuchasPegasus,Taverna,KeplerandNiFi(fromApache)
• Twister2willallowuserstospecifytheexistenceofanamedcoordinaHonpointandallowacHonstobeiniHated
• ProduceanRDDstyledatasetfromuserspecified• LaunchnewtasksasinHeron,Flink,Spark,Naiad• ChangeexecuHonmodelasinOpenMPParallelsecHon
9/26/17 12
NiFiWorkflowwithCoarseGrainCoordina<on
9/26/17 13
K-meansandDataflow
9/26/17 14
Map(nearestcentroid
calculaHon)
Reduce(updatecentroids)
DataSet<Points>
DataSet<IniHalCentroids>
DataSet<UpdatedCentroids>
Broadcast
DataflowforK-means
FullJobReduce
Maps
Iterate
AnotherJob
CorseGrainWorkflowNodes CoarseGrain
WorkflowNodes
InternalExecuHon(IteraHon)NodesFine-GrainCoordinaHon
DataflowCommunicaHon
HPCCommunicaHon
“CoordinaHonPoints”
HandlingofState• Stateisakeyissueandhandleddifferentlyinsystems• MPINaiad,Storm,Heronhavelongrunningtasksthatpreservestate
• MPItasksstopatendofjob• NaiadStormHerontaskschangeat(fine-grain)dataflownodesbutalltasksrunforever
• SparkandFlinktasksstopandrefreshatdataflownodesbutpreservesomestateasRDD/datasetsusingin-memorydatabases
• AllsystemsagreeonacHonsatacoarsegraindataflow(atjoblevel);onlykeepstatebyexchangingdata.
9/26/17 15
FaultToleranceandState • Similarformofcheck-poin<ngmechanismisusedalreadyinHPCandBigData
• althoughHPCinformalasdoesn’ttypicallyspecifyasadataflowgraph• FlinkandSparkdobeberthanMPIduetouseofdatabasetechnologies;MPIisabitharderduetoricherstatebutthereisanobviousintegratedmodelusingRDDtypesnapshotsofMPIstylejobs
• CheckpointaRereachstageofthedataflowgraph• NaturalsynchronizaHonpoint• Let’sallowsusertochoosewhentocheckpoint(noteverystage)• Savestateasuserspecifies;SparkjustsavesModelstatewhichisinsufficientforcomplexalgorithms
9/26/17 16
Twister2ComponentsI
9/26/17 17
Area Component Implementation Comments Distributed Data API
Relaxed Distributed data set
Similar to Spark RDD ETL type data applications; Streaming Backup for Fault Tolerance
Streaming Pub-Sub and Spouts as in Heron API to pub-sub messages Data access Access common data sources including
file, connecting to message brokers etc. All the above applications can use this base functionality
Distributed Shared Memory
Similar to PGAS Machine learning such as graph algorithms
Task API FaaS API (Function as a Service)
Dynamic Task Scheduling
Dynamic scheduling as in AMT Some machine learning FaaS
Static Task Scheduling Static scheduling as in Flink & Heron Streaming ETL data pipelines
Task Execution Thread based execution as seen in Spark, Flink, Naiad, OpenMP
Look at hybrid MPI/thread support available
Task Graph Twister2 Tasks similar to Naiad and Heron
Streaming and FaaS Events
Heron, OpenWhisk, Kafka/RabbitMQ Classic Streaming Scaling of FaaS needs Research
Elasticity OpenWhisk Needs experimentation Task migration Monitoring of tasks and migrating tasks for
better resource utilization
Twister2ComponentsII
9/26/17 18
Area Component Implementation Comments Communication API
Messages Heron ThisisuserlevelandcouldmaptomulHplecommunicaHon
Dataflow Communication
Fine-Grain Twister2 Dataflow communications: MPI,TCP and RMA Coarse grain Dataflow from NiFi, Kepler?
Streaming Machine learning ETL data pipelines
BSP Communication Map-Collective
MPI Style communication Harp
Machine learning
Execution Model
Architecture Spark, Flink Container/Processes/Tasks=Threads Job Submit API Resource Scheduler
Pluggable architecture for any resource scheduler (Yarn, Mesos, Slurm)
All the above applications need this base functionality
Dataflow graph analyzer & optimizer
Flink Spark is dynamic and implicit
Coordination Points Specification and Actions
Research based on MPI, Spark, Flink, NiFi (Kepler)
Synchronization Point. Backup to datasets Refresh Tasks
Security Storage, Messaging, execution
Research Crosses all Components
Summary of MPI in a HPC Cloud + Edge + Grid Environment
• WesuggestvalueofaneventdrivencompuHngmodelbuiltaroundCloudandHPCandspanningbatch,streaming,andedgeapplicaHons
• Highlyparalleloncloud;possiblysequenHalattheedge• WehavedoneapreliminaryanalysisofthedifferentrunHmesofMPI,Hadoop,Spark,Flink,Storm,Heron,Naiad,HPCAsynchronousManyTask(AMT)
• TherearedifferenttechnologiesfordifferentcircumstancesbutcanbeunifiedbyhighlevelabstracHonssuchascommunicaHoncollecHves
• ObviouslyMPIbestforparallelcompuHng(bydefiniHon)
• ApachesystemsusedataflowcommunicaHonwhichisnaturalfordistributedsystemsbutinevitablyslowforclassicparallelcompuHng
• Nostandarddataflowlibrary(why?).AddDataflowprimi<vesinMPI-4?
• MPIcouldadoptsomeoftoolsofBigDataasinCoordinaHonPoints(dataflownodes),StatemanagementwithRDD(datasets)
9/26/17 19