on the use of burst buffers for accelerating data-intensive scientific workflows
TRANSCRIPT
On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows
Rafael Ferreira da Silva, Scott Callaghan, Ewa Deelman
12th Workflows in Support of Large-Scale Science (WORKS) – SuperComputing’17November 13, 2017
Funded by the US Department of Energy under Grant #DE-SC0012636
OUTLINE
Introduction Burst Buffers Model and Design
Workflow Application Experimental Results
Next-generation SupercomputersData-intensive WorkflowsIn-transit / In-situ
OverviewNode-local / Remote-sharedNERSC BB
BB ReservationsExecution DirectoryI/O Read Operations
Overall Write/Read OperationsI/O Performance per ProcessCumulative CPU TimeRupture Files
CyberShakeWorkflow ImplementationI/O Performance: Darshan
2
ConclusionSummary of FindingsFuture Directions
A Brief Introduction and Contextualization
3
Traditionally,Workflowshaveusedthefilesystemtocommunicatedatabetweentasks
TocopewithincreasingapplicationdemandsonI/Ooperations,solutionstargeting insitu and/orintransit processinghavebecomemainstreamapproachestoattenuateI/Operformancebottlenecks.
”Next-generation of Exascale SupercomputersIncreasedprocessingcapabilitiestoover1018 Flop/s
Memory anddisk capacitywillalsobesignificantlyincreasedPower consumptionmanagement
I/Operformanceoftheparallelfilesystem(PFS)isnotexpectedtoimprovemuch
Data-Intensive Scientific Workflows
4
Whileinsitu iswelladaptedforcomputationsthatconformwiththedatadistribution imposedbysimulations,intransitprocessingtargetsapplications
whereintensivedatatransfers arerequired
1000GenomeWorkflowCyberShake Workflow
Consumers/producesover4.4TB ofdata,andrequiresover24TB ofmemoryacrossalltasks
Consumers/producesover700GB ofdata
Improving I/O Performance with Burst Buffers
5
Burstbuffershaveemergedasanon-volatilestorage solutionthatispositionedbetweentheprocessors’memoryandthePFS,bufferingthelargevolumeofdataproducedbytheapplicationatanhigherratethanthePFS,whileseamlesslydrainingthedatatothePFSasynchronously.
In Transit Processing
PlacementoftheburstbuffernodeswithintheCori system(NERSC)
Aburstbufferconsistsofthecombinationofrapidlyaccessedpersistentmemorywithitsownprocessing
power(e.g.,DRAM),andablockofsymmetricmulti-processorcomputeaccessiblethroughhigh-
bandwidthlinks(e.g.,PCIExpress)
”
Node-localvsRemote-shared
Xeon Processor / DRAM
x8 PCIe
x8 PCIeFlash SSD card
CN
Burst BufferNode
Storage Servers
CN CN CN ComputeNodes
x16 PCIe
x8 PCIe
Aries
Storage Area Network
First Burst Buffers Use at Scale
6
Coriisapetascale HPCsystemand#6ontheJune2017Top500list
NERSCBBisbasedonCrayDataWarp(Cray’simplementationoftheBBconcept)
Cori System (NERSC)
Architecturaloverviewofaburst-buffernodeonCoriatNERSC
EachBBnodecontainsaXeon processor,64GBofDDR3memory,andtwo3.2TBNAND ashSSDmodulesattachedovertwoPCIe gen3x8 interfaces,whichisattachedtoaCrayAries
networkinterconnectoveraPCIe gen3x16interface
~6.5GB/secofsequentialreadandwritebandwidth
Model and Design: Enabling BB in Workflow Systems
7
Automated BB ReservationsBBreservationsoperations(eitherpersistentorscratch)consistinthecreation andrelease,aswellasstagein andstageoutoperations
Transientreservations:needstoimplementstagein/outoperationsatthebeginning/endofeachjobexecution Execution Directory
Automatedmapping betweentheworkflowexecutiondirectory andtheBBreservation
Nochangestotheapplicationcodearenecessary,andtheapplicationjobdirectlywritesitsoutputtotheBBreservation
I/O Read OperationsReadoperationsfromtheBBshouldbetransparenttotheapplications
Approaches: pointtheexecutiondirectorytotheBBreservation,orcreatesymboliclinks todataendpointsintotheBB
Workflow Application: CyberShake Workflow
8
CyberShake isahigh-performancecomputingsoftwarethatuses3DwaveformmodelingtocalculatePSHAestimatesforpopulatedareasofCalifornia
ConstructsandPopulatesa3Dmeshof~1.2billionelements withseismicvelocitydatatocomputeStrainGreenTensors(SGTs)
Post-processing: SGTsareconvolvedwithsliptimehistoriesforeachofabout500,000differentearthquakes togeneratesyntheticseismogramsforeachevent
CyberShake Workflow
CyberShake hazardmapforSouthernCalifornia,showingthespectralaccelerationsata2-second
periodexceededwithaprobabilityof2%in50years
WefocusonthetwoCyberShake jobtypeswhichtogetheraccountfor97%ofthecomputetime:thewavepropagation
codeAWP-ODC-SGT,andthepost-processingcodeDirectSynth
Burst Buffers: Workflow Implementation
9
Workflowiscomposedoftwotightly-coupledparalleljobs(SGT_generator;anddirect_synth),andtwosystemjobs(bb_setup andbb_delete)
Generates/consumesabout550GB ofdata
Pegasus WMS
AgeneralrepresentationoftheCyberShaketestworkflow
bb_setup
direct_synthdirect_synthdirect_synthdirect_synth
direct_synthdirect_synthdirect_synthSGT_generator
bb_delete
Control flowData flow
fx.sgt fx.sgtheader fy.sgt fy.sgtheader
seismogram rotd peakvals
https://github.com/rafaelfsilva/bb-workflow
#SBATCH -p regular#SBATCH -N 64#SBATCH -C haswell#SBATCH -t 05:00:00#DW persistentdw name=csbb
#SBATCH -p debug#SBATCH -N 1#SBATCH -C haswell#SBATCH -t 00:05:00#BB create_persistent name=csbb capacity=700GB access=striped type=scratch
bb_setup jobGoesthroughtheregularqueuingprocessing
Collecting I/O Performance Data with Darshan
10
HPClightweightI/Oprofiling toolthatcapturesanaccuratepictureofI/Obehavior(includingPOSIXIO,MPI-IO,andHDF5IO)inMPIapplications
Darshan: HPC I/O Characterization Tool
Darshan ispartofthedefaultsoftwarestackonCori
0
2500
5000
7500
10000
1 4 8 16 32 64 128 256 313# Nodes
MiB
/s
BB no−BB
Experimental Results: Overall Write Operations
11
AverageI/OperformanceestimateforwriteoperationsattheMPI-IOlayer(left),andaverageI/Owriteperformancegain(right)fortheSGT_generator job
• Overall,writeoperationstothePFS(No-BB)havenearlyconstant I/Operformance
• No-BB:~900MiB/s regardlessofthenumberofnodesused
• BasevaluesobtainedfortheBBexecutions (1node,32cores)areover4,600MiB/s,andpeakvaluesscaleupto∼8,200MiB/s for32nodes(1,024cores)
• Slightdrop intheI/Operformance(#nodes≥64)largenumberofconcurrentwriteoperations
0.75
1.00
1.25
1.50
1.75
1 4 8 16 32 64 128 256 313nodes
Perfo
rman
ce G
ain
BB no−BB
Experimental Results: Overall Read Operations
12
I/OperformanceestimateforreadoperationsattheMPI-IOlayer(left),andaverageI/Owriteperformancegain(right)forthedirect_synth job
• I/OreadoperationsfromthePFS yieldsimilarperformanceregardlessofthenumberofnodesused:~500MiB/s
• BB:single-nodeperformanceof4,000MiB/s,peakvaluesuptoabout8,000MiB/s
• Smalldropintheperformanceforrunsusing64nodesorabove– mayindicateanI/Obottleneck whendrainingthedatato/fromtheunderlyingparallelfilesystem
0
2500
5000
7500
1 4 8 16 32 64 128# Nodes
MiB
/s
BB No−BB
0.5
1.0
1.5
2.0
1 4 8 16 32 64 128nodes
Perfo
rman
ce G
ain
BB No−BB
Experimental Results: I/O Performance per Process
13
POSIXmoduledata:AveragetimeconsumedinI/Oreadoperationsperprocessforthedirect_synth job
• POSIXoperations (left)representbufferingandsynchronizationoperationswiththesystem
• POSIXvaluesarenegligiblewhencomparedtothejob’stotalruntime(~8hfor64nodes)
• MPI-IO:BBaccelerates I/Oreadoperationsupto10times inaverage
• forlargerconfigurations(≥32node),theaveragetimeisnearlythesameaswhenrunningwith16nodesfortheNo-BB
fx.sgt fy.sgt
1 4 8 16 32 64 128 1 4 8 16 32 64 1280
1
2
3
# Nodes
Tim
e (s
econ
ds)
BB No−BB
fx.sgt fy.sgt
1 4 8 16 32 64 128 1 4 8 16 32 64 1280
500
1000
1500
2000
# Nodes
Tim
e (s
econ
ds)
BB No−BB
MPI-IOmoduledata:AveragetimeconsumedinI/Oreadoperationsperprocessforthedirect_synth job
14
Ratiobetweenthecumulativetimespentintheuser(utime)andkernel(stime)spacesforthedirect_synth job,fordifferent
numbersofnodes
• Averagedvalues(foruptothousandsofcores)maymaskslowerprocesses
Insomecases,e.g.64nodes,slowesttimeconsumedinI/Oreadoperationscanslowdowntheapplicationupto12times theaveragedvalue
• Ratiobetweenthetimespentintheuser(utime)andkernel(stime)spaces– handlingI/O-relatedinterruptions,etc.
• Performanceat64nodesis similarto32nodes,suggestinggainsin application parallelefficiencywouldoutweighaslightI/Operformancehit at64nodesandleadtodecreasedoverallruntime
Experimental Results: Cumulative CPU Time
BB No−BB
1 4 8 16 32 64 128 1 4 8 16 32 64 1280
25
50
75
100
# Nodes
Cum
ulat
ive C
PU ti
me
usag
e (%
)
stime utime
15
Ratiobetweenthecumulativetimespentintheuser(utime)andkernel(stime)spacesforthedirect_synth jobfordifferent
numbersofrupturefiles(workflowrunswith64nodes)
• AtypicalexecutionoftheCyberShakeworkflowforaselectedsiteinourexperimentprocessesabout5,700rupturefiles
• TheprocessingofrupturefilesdrivemostoftheCPU(userspace)activitiesforthedirect_synth job
• TheuseofaBBattenuates (about15%)theI/Oprocessingtimeoftheworkflowjobs,forbothreadandwriteoperations
Experimental Results: Rupture Files
BB No−BB
1 10 100 1000 2500 5700 1 10 100 1000 2500 57000
25
50
75
100
# Rupture Files
Cum
ulat
ive C
PU ti
me
usag
e (%
)
stime utime
16
MajorFindings
• I/Owrite performancewasimproved byafactorof9,andI/Oread performancebyafactorof15
• Performancedecreasedslightly atnodecountsabove64 (potentialI/Oceiling)
• I/Operformancemustbebalanced withparallelefficiencywhenusingburstbufferswithhighlyparallelapplications
• I/Ocontentionmaylimit thebroadapplicabilityofburstbuffersforallworkflowapplications(e.g.,insituprocessing)
Conclusion and Future Work
What’sNext?
• SolutionssuchasI/O-awareschedulingorinsituprocessingmayalsonotfulfillallapplicationrequirements
Weintendtoinvestigatetheuseofcombinedinsituandintransitanalysis
• DevelopmentofaproductionsolutionforthePegasusworkflowmanagementsystem
ON THE USE OF BURST BUFFERS FOR ACCELERATING DATA-INTENSIVE SCIENTIFIC WORKFLOWS
Rafael Ferreira da Silva, Ph.D.Research Assistant Professor, Computer Science DepartmentComputer Scientist, USC Information Sciences Institute
[email protected] – http://rafaelsilva.com
Thank You
Questions?
Funded by the US Department of Energy under
Grant #DE-SC0012636