mpi, dataflow, streaming: messaging for diverse requirements · 2017-09-26 · abstract: mpi,...

19
` MPI, Dataflow, Streaming: Messaging for Diverse Requirements Argonne Na<onal Lab, Chicago, Illinois, Geoffrey Fox, September 25, 2017 Indiana University, Department of Intelligent Systems Engineering [email protected] , http://www.dsc.soic.indiana.edu/ , http://spidal.org / Work with Judy Qiu, Shantenu Jha, Supun Kamburugamuve, Kannan Govindarajan, Pulasthi Wickramasinghe 9/26/17 1 25 years of MPI Symposium

Upload: others

Post on 20-May-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MPI, Dataflow, Streaming: Messaging for Diverse Requirements · 2017-09-26 · Abstract: MPI, Dataflow, Streaming: Messaging for Diverse Requirements • We look at messaging needed

`

MPI,Dataflow,Streaming:MessagingforDiverseRequirements

ArgonneNa<onalLab,Chicago,Illinois,GeoffreyFox,September25,2017

Indiana University, Department of Intelligent Systems Engineering [email protected], http://www.dsc.soic.indiana.edu/, http://spidal.org/

WorkwithJudyQiu,ShantenuJha,SupunKamburugamuve,KannanGovindarajan,PulasthiWickramasinghe

9/26/17 1

25 years of MPI Symposium

Page 2: MPI, Dataflow, Streaming: Messaging for Diverse Requirements · 2017-09-26 · Abstract: MPI, Dataflow, Streaming: Messaging for Diverse Requirements • We look at messaging needed

Abstract:MPI,Dataflow,Streaming:MessagingforDiverseRequirements

• Welookatmessagingneededinavarietyofparallel,distributed,cloudandedgecompuHngapplicaHons.

• WecomparetechnologyapproachesinMPI,AsynchronousMany-Tasksystems,ApacheNiFi,Heron,KaRa,OpenWhisk,Pregel,SparkandFlink,event-drivensimulaHons(HLA)andMicrosoWNaiad.

• Wesuggestanevent-triggereddataflowpolymorphicrunHmewithimplementaHonsthattrade-offperformance,faulttolerance,andusability.

• IntegrateParallelCompuHng,BigData,Grids

9/26/17 2

Page 3: MPI, Dataflow, Streaming: Messaging for Diverse Requirements · 2017-09-26 · Abstract: MPI, Dataflow, Streaming: Messaging for Diverse Requirements • We look at messaging needed

• MPIiswonderful(andimpossibletobeat?)forcloselycoupledparallelcompuHngbut

•  TherearemanyotherregimeswhereeitherparallelcompuHngand/ormessagepassingessenHal

•  ApplicaHondomainswhereother/higher-levelconceptssuccessful/necessary•  InternetofThingsandEdgeCompu<nggrowinginimportance• Useofpubliccloudsincreasingrapidly

•  CloudsbecomingdiversewithsubsystemscontainingGPU’s,FPGA’s,highperformancenetworks,storage,memory…

• RichsoRwarestacks:•  HPC(HighPerformanceCompuHng)forParallelCompuHnglessusedthan(?)•  ApacheforBigDataSoWwareStackABDSincludingsomeedgecompuHng(streamingdata)

• AlotofconfusioncomingfromdifferentcommuniHes(database,distributed,parallelcompuHng,machinelearning,computaHonal/datascience)invesHgaHngsimilarideaswithlibleknowledgeexchangeandmixeduprequirements

Mo<va<ngRemarks

9/26/17 3

Page 4: MPI, Dataflow, Streaming: Messaging for Diverse Requirements · 2017-09-26 · Abstract: MPI, Dataflow, Streaming: Messaging for Diverse Requirements • We look at messaging needed

•  Ongeneralprinciplesparallelanddistributedcompu<nghavedifferentrequirementsevenifsomeHmessimilarfuncHonaliHes

•  ApachestackABDStypicallyusesdistributedcompuHngconcepts•  Forexample,ReduceoperaHonisdifferentinMPI(Harp)andSpark

•  LargescalesimulaHonrequirementsarewellunderstood•  BigDatarequirementsarenotclearbutthereareafewkeyusetypes

1)  Pleasinglyparallelprocessing(includinglocalmachinelearningLML)asofdifferenttweetsfromdifferentuserswithperhapsMapReducestyleofstaHsHcsandvisualizaHons;possiblyStreaming

2)  DatabasemodelwithqueriesagainsupportedbyMapReduceforhorizontalscaling3)  GlobalMachineLearningGMLwithsinglejobusingmulHplenodesasclassicparallel

compuHng4)  DeepLearningcertainlyneedsHPC–possiblyonlymulHplesmallsystems

•  Currentworkloadsstress1)and2)andaresuitedtocurrentcloudsandtoABDS(withnoHPC)•  ThisexplainswhySparkwithpoorGMLperformanceissosuccessfulandwhyitcanignoreMPI

Requirements

9/26/17 4

Page 5: MPI, Dataflow, Streaming: Messaging for Diverse Requirements · 2017-09-26 · Abstract: MPI, Dataflow, Streaming: Messaging for Diverse Requirements • We look at messaging needed

HPCRun<meversusABDSdistributedCompu<ngModelonDataAnaly<cs

Hadoopwritestodiskandisslowest;SparkandFlinkspawnmanyprocessesanddonotsupportAllReducedirectly;MPIdoesin-placecombinedreduce/broadcastandisfastest

NeedPolymorphicReducHoncapabilitychoosingbestimplementaHonUseHPCarchitecturewith

MutablemodelImmutabledata

9/26/17 5

Page 6: MPI, Dataflow, Streaming: Messaging for Diverse Requirements · 2017-09-26 · Abstract: MPI, Dataflow, Streaming: Messaging for Diverse Requirements • We look at messaging needed

Mul<dimensionalScaling:3NestedParallelSec<ons

MDSexecuHonHmeon16nodeswith20processesineachnodewith

varyingnumberofpoints

MDSexecuHonHmewith32000pointsonvaryingnumberofnodes.Eachnoderuns20paralleltasks

9/26/17 6

Flink

Spark

MPI

MPIFactorof20-200FasterthanSpark/Flink

Page 7: MPI, Dataflow, Streaming: Messaging for Diverse Requirements · 2017-09-26 · Abstract: MPI, Dataflow, Streaming: Messaging for Diverse Requirements • We look at messaging needed

Implemen<ngTwister2tosupportaGridlinkedtoanHPCCloud

Cloud

HPC

Cloud

HPC

CentralizedHPCCloud+IoTDevices CentralizedHPCCloud+Edge=Fog+IoTDevices

Cloud

HPC

Fog

HPCCloudcanbefederated

9/26/17 7

Page 8: MPI, Dataflow, Streaming: Messaging for Diverse Requirements · 2017-09-26 · Abstract: MPI, Dataflow, Streaming: Messaging for Diverse Requirements • We look at messaging needed

•  Cloud-ownerProvidedCloud-naHveplapormfor•  Event-drivenapplicaHonswhich•  ScaleupanddowninstantlyandautomaHcallyChargesforactualusageatamillisecondgranularity

4

BareMetal

PaaSContainer

Orchestrators

IaaS

Serverless

GridSolve,NeoswereFaaS

9/26/17 8

Seereviewhbp://dx.doi.org/10.13140/RG.2.2.15007.87206

Serverless(serverhidden)compu<nga\rac<vetouser:“Noserveriseasiertomanagethannoserver”

Page 9: MPI, Dataflow, Streaming: Messaging for Diverse Requirements · 2017-09-26 · Abstract: MPI, Dataflow, Streaming: Messaging for Diverse Requirements • We look at messaging needed

Twister2:“NextGenera<onGrid-Edge–HPCCloud”•  Original2010TwisterpaperwasaparHcularapproachtoMap-CollecHveiteraHveprocessingformachinelearning

•  Re-engineercurrentApacheBigDatasoWwaresystemsasatoolkitwithMPIasanopHon•  BaseonApacheHeronasmostmodernand“neutral”oncontroversialissues

•  Supportaserverless(cloud-na<ve)dataflowevent-drivenHPC-FaaS(microservice)frameworkrunningacrossapplicaHonandgeographicdomains.

•  SupportalltypesofDataanalysisfromGMLtoEdgecompuHng

•  BuildonCloudbestpracHcebutuseHPCwhereverpossibletogethighperformance•  SmoothlysupportcurrentparadigmsNaiad,Hadoop,Spark,Flink,Storm,Heron,MPI…•  UseinteroperablecommonabstracHonsbutmulHplepolymorphicimplementaHons.

•  i.e.donotrequireasinglerunHme

•  FocusonRunHmebutthisimpliesHPC-FaaSprogrammingandexecuHonmodel•  Thisdescribesanextgenera<onGridbasedondataandedgedevices–notcompuHngasinoriginalGrid Seelongpaperhbp://dsc.soic.indiana.edu/publicaHons/Twister2.pdf

9/26/17 9

Page 10: MPI, Dataflow, Streaming: Messaging for Diverse Requirements · 2017-09-26 · Abstract: MPI, Dataflow, Streaming: Messaging for Diverse Requirements • We look at messaging needed

Communica<on(Messaging)Models•  MPIGoldStandard:TightlysynchronizedapplicaHons

•  EfficientcommunicaHons(µslatency)withuseofadvancedhardware•  InplacecommunicaHonsandcomputaHons(Processscopeforstate)

•  Basic(coarse-grain)dataflow:ModelacomputaHonasagraph•  NodesdocomputaHonswithTaskascomputaHonsandedgesareasynchronouscommunicaHons

•  AcomputaHonisacHvatedwhenitsinputdatadependenciesaresaHsfied

•  Streamingdataflow:Pub-SubwithdataparHHonedintostreams•  Streamsareunbounded,ordereddatatuples•  OrderofeventsimportantandgroupdataintoHmewindows

• MachineLearningdataflow:IteraHvecomputaHons•  ThereisbothModelandData,butonlycommunicatethemodel•  Collec<vecommunica<onoperaHonssuchasAllReduceAllGather(nodifferenHaloperatorsinBigDataproblems

•  Canusein-placeMPIstylecommunicaHon

S

W G

S

W

WDataflow

9/26/17 10

Page 11: MPI, Dataflow, Streaming: Messaging for Diverse Requirements · 2017-09-26 · Abstract: MPI, Dataflow, Streaming: Messaging for Diverse Requirements • We look at messaging needed

CoreSPIDALParallelHPCLibrarywithCollec<veUsed•  DA-MDSRotate,AllReduce,Broadcast•  DirectedForceDimensionReducHonAllGather,Allreduce

•  IrregularDAVSClusteringParHalRotate,AllReduce,Broadcast

•  DASemimetricClusteringRotate,AllReduce,Broadcast

•  K-meansAllReduce,Broadcast,AllGatherDAAL•  SVMAllReduce,AllGather•  SubGraphMiningAllGather,AllReduce•  LatentDirichletAllocaHonRotate,AllReduce•  MatrixFactorizaHon(SGD)RotateDAAL•  RecommenderSystem(ALS)RotateDAAL•  SingularValueDecomposiHon(SVD)AllGatherDAAL

•  QRDecomposiHon(QR)Reduce,BroadcastDAAL•  NeuralNetworkAllReduceDAAL•  CovarianceAllReduceDAAL•  LowOrderMomentsReduceDAAL•  NaiveBayesReduceDAAL•  LinearRegressionReduceDAAL•  RidgeRegressionReduceDAAL•  MulH-classLogisHcRegressionRegroup,Rotate,AllGather

•  RandomForestAllReduce•  PrincipalComponentAnalysis(PCA)AllReduceDAAL

DAALimpliesintegratedwithIntelDAALOpHmizedDataAnalyHcsLibrary(RunsonKNL!)

9/26/17 11

Page 12: MPI, Dataflow, Streaming: Messaging for Diverse Requirements · 2017-09-26 · Abstract: MPI, Dataflow, Streaming: Messaging for Diverse Requirements • We look at messaging needed

Coordina<onPoints•  Thereareinmanyapproaches,“coordinaHonpoints”thatcanbeimplicitorexplicit

•  Twister2makescoordinaHonpointsanimportant(firstclass)concept•  DataflownodesinHeron,Flink,Spark,Naiad;wecallthesefine-graindataflow•  IssuanceofaCollecHvecommunicaHoncommandinMPI•  StartandEndofaParallelsecHoninOpenMP•  Endofajob;wecallthesecoarse-graindataflownodesandtheseareseeninworkflowsystemssuchasPegasus,Taverna,KeplerandNiFi(fromApache)

•  Twister2willallowuserstospecifytheexistenceofanamedcoordinaHonpointandallowacHonstobeiniHated

•  ProduceanRDDstyledatasetfromuserspecified•  LaunchnewtasksasinHeron,Flink,Spark,Naiad•  ChangeexecuHonmodelasinOpenMPParallelsecHon

9/26/17 12

Page 13: MPI, Dataflow, Streaming: Messaging for Diverse Requirements · 2017-09-26 · Abstract: MPI, Dataflow, Streaming: Messaging for Diverse Requirements • We look at messaging needed

NiFiWorkflowwithCoarseGrainCoordina<on

9/26/17 13

Page 14: MPI, Dataflow, Streaming: Messaging for Diverse Requirements · 2017-09-26 · Abstract: MPI, Dataflow, Streaming: Messaging for Diverse Requirements • We look at messaging needed

K-meansandDataflow

9/26/17 14

Map(nearestcentroid

calculaHon)

Reduce(updatecentroids)

DataSet<Points>

DataSet<IniHalCentroids>

DataSet<UpdatedCentroids>

Broadcast

DataflowforK-means

FullJobReduce

Maps

Iterate

AnotherJob

CorseGrainWorkflowNodes CoarseGrain

WorkflowNodes

InternalExecuHon(IteraHon)NodesFine-GrainCoordinaHon

DataflowCommunicaHon

HPCCommunicaHon

“CoordinaHonPoints”

Page 15: MPI, Dataflow, Streaming: Messaging for Diverse Requirements · 2017-09-26 · Abstract: MPI, Dataflow, Streaming: Messaging for Diverse Requirements • We look at messaging needed

HandlingofState• Stateisakeyissueandhandleddifferentlyinsystems• MPINaiad,Storm,Heronhavelongrunningtasksthatpreservestate

• MPItasksstopatendofjob• NaiadStormHerontaskschangeat(fine-grain)dataflownodesbutalltasksrunforever

•  SparkandFlinktasksstopandrefreshatdataflownodesbutpreservesomestateasRDD/datasetsusingin-memorydatabases

• AllsystemsagreeonacHonsatacoarsegraindataflow(atjoblevel);onlykeepstatebyexchangingdata.

9/26/17 15

Page 16: MPI, Dataflow, Streaming: Messaging for Diverse Requirements · 2017-09-26 · Abstract: MPI, Dataflow, Streaming: Messaging for Diverse Requirements • We look at messaging needed

FaultToleranceandState • Similarformofcheck-poin<ngmechanismisusedalreadyinHPCandBigData

• althoughHPCinformalasdoesn’ttypicallyspecifyasadataflowgraph• FlinkandSparkdobeberthanMPIduetouseofdatabasetechnologies;MPIisabitharderduetoricherstatebutthereisanobviousintegratedmodelusingRDDtypesnapshotsofMPIstylejobs

• CheckpointaRereachstageofthedataflowgraph• NaturalsynchronizaHonpoint• Let’sallowsusertochoosewhentocheckpoint(noteverystage)• Savestateasuserspecifies;SparkjustsavesModelstatewhichisinsufficientforcomplexalgorithms

9/26/17 16

Page 17: MPI, Dataflow, Streaming: Messaging for Diverse Requirements · 2017-09-26 · Abstract: MPI, Dataflow, Streaming: Messaging for Diverse Requirements • We look at messaging needed

Twister2ComponentsI

9/26/17 17

Area Component Implementation Comments Distributed Data API

Relaxed Distributed data set

Similar to Spark RDD ETL type data applications; Streaming Backup for Fault Tolerance

Streaming Pub-Sub and Spouts as in Heron API to pub-sub messages Data access Access common data sources including

file, connecting to message brokers etc. All the above applications can use this base functionality

Distributed Shared Memory

Similar to PGAS Machine learning such as graph algorithms

Task API FaaS API (Function as a Service)

Dynamic Task Scheduling

Dynamic scheduling as in AMT Some machine learning FaaS

Static Task Scheduling Static scheduling as in Flink & Heron Streaming ETL data pipelines

Task Execution Thread based execution as seen in Spark, Flink, Naiad, OpenMP

Look at hybrid MPI/thread support available

Task Graph Twister2 Tasks similar to Naiad and Heron

Streaming and FaaS Events

Heron, OpenWhisk, Kafka/RabbitMQ Classic Streaming Scaling of FaaS needs Research

Elasticity OpenWhisk Needs experimentation Task migration Monitoring of tasks and migrating tasks for

better resource utilization

Page 18: MPI, Dataflow, Streaming: Messaging for Diverse Requirements · 2017-09-26 · Abstract: MPI, Dataflow, Streaming: Messaging for Diverse Requirements • We look at messaging needed

Twister2ComponentsII

9/26/17 18

Area Component Implementation Comments Communication API

Messages Heron ThisisuserlevelandcouldmaptomulHplecommunicaHon

Dataflow Communication

Fine-Grain Twister2 Dataflow communications: MPI,TCP and RMA Coarse grain Dataflow from NiFi, Kepler?

Streaming Machine learning ETL data pipelines

BSP Communication Map-Collective

MPI Style communication Harp

Machine learning

Execution Model

Architecture Spark, Flink Container/Processes/Tasks=Threads Job Submit API Resource Scheduler

Pluggable architecture for any resource scheduler (Yarn, Mesos, Slurm)

All the above applications need this base functionality

Dataflow graph analyzer & optimizer

Flink Spark is dynamic and implicit

Coordination Points Specification and Actions

Research based on MPI, Spark, Flink, NiFi (Kepler)

Synchronization Point. Backup to datasets Refresh Tasks

Security Storage, Messaging, execution

Research Crosses all Components

Page 19: MPI, Dataflow, Streaming: Messaging for Diverse Requirements · 2017-09-26 · Abstract: MPI, Dataflow, Streaming: Messaging for Diverse Requirements • We look at messaging needed

Summary of MPI in a HPC Cloud + Edge + Grid Environment

• WesuggestvalueofaneventdrivencompuHngmodelbuiltaroundCloudandHPCandspanningbatch,streaming,andedgeapplicaHons

•  Highlyparalleloncloud;possiblysequenHalattheedge• WehavedoneapreliminaryanalysisofthedifferentrunHmesofMPI,Hadoop,Spark,Flink,Storm,Heron,Naiad,HPCAsynchronousManyTask(AMT)

•  TherearedifferenttechnologiesfordifferentcircumstancesbutcanbeunifiedbyhighlevelabstracHonssuchascommunicaHoncollecHves

•  ObviouslyMPIbestforparallelcompuHng(bydefiniHon)

•  ApachesystemsusedataflowcommunicaHonwhichisnaturalfordistributedsystemsbutinevitablyslowforclassicparallelcompuHng

•  Nostandarddataflowlibrary(why?).AddDataflowprimi<vesinMPI-4?

• MPIcouldadoptsomeoftoolsofBigDataasinCoordinaHonPoints(dataflownodes),StatemanagementwithRDD(datasets)

9/26/17 19