on the use of burst buffers for accelerating data-intensive scientific workflows

On the Use of Burst Buffers for Accelerating Data-Intensive Scientific Workflows

Rafael Ferreira da Silva, Scott Callaghan, Ewa Deelman

12th Workflows in Support of Large-Scale Science (WORKS) – SuperComputing’17November 13, 2017

Funded by the US Department of Energy under Grant #DE-SC0012636

OUTLINE

Introduction Burst Buffers Model and Design

Workflow Application Experimental Results

Next-generation SupercomputersData-intensive WorkflowsIn-transit / In-situ

OverviewNode-local / Remote-sharedNERSC BB

BB ReservationsExecution DirectoryI/O Read Operations

Overall Write/Read OperationsI/O Performance per ProcessCumulative CPU TimeRupture Files

CyberShakeWorkflow ImplementationI/O Performance: Darshan

2

ConclusionSummary of FindingsFuture Directions

A Brief Introduction and Contextualization

3

Traditionally,Workflowshaveusedthefilesystemtocommunicatedatabetweentasks

TocopewithincreasingapplicationdemandsonI/Ooperations,solutionstargeting insitu and/orintransit processinghavebecomemainstreamapproachestoattenuateI/Operformancebottlenecks.

”Next-generation of Exascale SupercomputersIncreasedprocessingcapabilitiestoover1018 Flop/s

Memory anddisk capacitywillalsobesignificantlyincreasedPower consumptionmanagement

I/Operformanceoftheparallelfilesystem(PFS)isnotexpectedtoimprovemuch

Data-Intensive Scientific Workflows

4

Whileinsitu iswelladaptedforcomputationsthatconformwiththedatadistribution imposedbysimulations,intransitprocessingtargetsapplications

whereintensivedatatransfers arerequired

1000GenomeWorkflowCyberShake Workflow

Consumers/producesover4.4TB ofdata,andrequiresover24TB ofmemoryacrossalltasks

Consumers/producesover700GB ofdata

Improving I/O Performance with Burst Buffers

5

Burstbuffershaveemergedasanon-volatilestorage solutionthatispositionedbetweentheprocessors’memoryandthePFS,bufferingthelargevolumeofdataproducedbytheapplicationatanhigherratethanthePFS,whileseamlesslydrainingthedatatothePFSasynchronously.

In Transit Processing

PlacementoftheburstbuffernodeswithintheCori system(NERSC)

Aburstbufferconsistsofthecombinationofrapidlyaccessedpersistentmemorywithitsownprocessing

power(e.g.,DRAM),andablockofsymmetricmulti-processorcomputeaccessiblethroughhigh-

bandwidthlinks(e.g.,PCIExpress)

”

Node-localvsRemote-shared

Xeon Processor / DRAM

x8 PCIe

x8 PCIeFlash SSD card

CN

Burst BufferNode

Storage Servers

CN CN CN ComputeNodes

x16 PCIe

x8 PCIe

Aries

Storage Area Network

First Burst Buffers Use at Scale

6

Coriisapetascale HPCsystemand#6ontheJune2017Top500list

NERSCBBisbasedonCrayDataWarp(Cray’simplementationoftheBBconcept)

Cori System (NERSC)

Architecturaloverviewofaburst-buffernodeonCoriatNERSC

EachBBnodecontainsaXeon processor,64GBofDDR3memory,andtwo3.2TBNAND ashSSDmodulesattachedovertwoPCIe gen3x8 interfaces,whichisattachedtoaCrayAries

networkinterconnectoveraPCIe gen3x16interface

~6.5GB/secofsequentialreadandwritebandwidth

Model and Design: Enabling BB in Workflow Systems

7

Automated BB ReservationsBBreservationsoperations(eitherpersistentorscratch)consistinthecreation andrelease,aswellasstagein andstageoutoperations

Transientreservations:needstoimplementstagein/outoperationsatthebeginning/endofeachjobexecution Execution Directory

Automatedmapping betweentheworkflowexecutiondirectory andtheBBreservation

Nochangestotheapplicationcodearenecessary,andtheapplicationjobdirectlywritesitsoutputtotheBBreservation

I/O Read OperationsReadoperationsfromtheBBshouldbetransparenttotheapplications

Approaches: pointtheexecutiondirectorytotheBBreservation,orcreatesymboliclinks todataendpointsintotheBB

Workflow Application: CyberShake Workflow

8

CyberShake isahigh-performancecomputingsoftwarethatuses3DwaveformmodelingtocalculatePSHAestimatesforpopulatedareasofCalifornia

ConstructsandPopulatesa3Dmeshof~1.2billionelements withseismicvelocitydatatocomputeStrainGreenTensors(SGTs)

Post-processing: SGTsareconvolvedwithsliptimehistoriesforeachofabout500,000differentearthquakes togeneratesyntheticseismogramsforeachevent

CyberShake Workflow

CyberShake hazardmapforSouthernCalifornia,showingthespectralaccelerationsata2-second

periodexceededwithaprobabilityof2%in50years

WefocusonthetwoCyberShake jobtypeswhichtogetheraccountfor97%ofthecomputetime:thewavepropagation

codeAWP-ODC-SGT,andthepost-processingcodeDirectSynth

Burst Buffers: Workflow Implementation

9

Workflowiscomposedoftwotightly-coupledparalleljobs(SGT_generator;anddirect_synth),andtwosystemjobs(bb_setup andbb_delete)

Generates/consumesabout550GB ofdata

Pegasus WMS

AgeneralrepresentationoftheCyberShaketestworkflow

bb_setup

direct_synthdirect_synthdirect_synthdirect_synth

direct_synthdirect_synthdirect_synthSGT_generator

bb_delete

Control flowData flow

fx.sgt fx.sgtheader fy.sgt fy.sgtheader

seismogram rotd peakvals

https://github.com/rafaelfsilva/bb-workflow

#SBATCH -p regular#SBATCH -N 64#SBATCH -C haswell#SBATCH -t 05:00:00#DW persistentdw name=csbb

#SBATCH -p debug#SBATCH -N 1#SBATCH -C haswell#SBATCH -t 00:05:00#BB create_persistent name=csbb capacity=700GB access=striped type=scratch

bb_setup jobGoesthroughtheregularqueuingprocessing

Collecting I/O Performance Data with Darshan

10

HPClightweightI/Oprofiling toolthatcapturesanaccuratepictureofI/Obehavior(includingPOSIXIO,MPI-IO,andHDF5IO)inMPIapplications

Darshan: HPC I/O Characterization Tool

Darshan ispartofthedefaultsoftwarestackonCori

0

2500

5000

7500

10000

1 4 8 16 32 64 128 256 313# Nodes

MiB

/s

BB no−BB

Experimental Results: Overall Write Operations

11

AverageI/OperformanceestimateforwriteoperationsattheMPI-IOlayer(left),andaverageI/Owriteperformancegain(right)fortheSGT_generator job

• Overall,writeoperationstothePFS(No-BB)havenearlyconstant I/Operformance

• No-BB:~900MiB/s regardlessofthenumberofnodesused

• BasevaluesobtainedfortheBBexecutions (1node,32cores)areover4,600MiB/s,andpeakvaluesscaleupto∼8,200MiB/s for32nodes(1,024cores)

• Slightdrop intheI/Operformance(#nodes≥64)largenumberofconcurrentwriteoperations

0.75

1.00

1.25

1.50

1.75

1 4 8 16 32 64 128 256 313nodes

Perfo

rman

ce G

ain

BB no−BB

Experimental Results: Overall Read Operations

12

I/OperformanceestimateforreadoperationsattheMPI-IOlayer(left),andaverageI/Owriteperformancegain(right)forthedirect_synth job

• I/OreadoperationsfromthePFS yieldsimilarperformanceregardlessofthenumberofnodesused:~500MiB/s

• BB:single-nodeperformanceof4,000MiB/s,peakvaluesuptoabout8,000MiB/s

• Smalldropintheperformanceforrunsusing64nodesorabove– mayindicateanI/Obottleneck whendrainingthedatato/fromtheunderlyingparallelfilesystem

0

2500

5000

7500

1 4 8 16 32 64 128# Nodes

MiB

/s

BB No−BB

0.5

1.0

1.5

2.0

1 4 8 16 32 64 128nodes

Perfo

rman

ce G

ain

BB No−BB

Experimental Results: I/O Performance per Process

13

POSIXmoduledata:AveragetimeconsumedinI/Oreadoperationsperprocessforthedirect_synth job

• POSIXoperations (left)representbufferingandsynchronizationoperationswiththesystem

• POSIXvaluesarenegligiblewhencomparedtothejob’stotalruntime(~8hfor64nodes)

• MPI-IO:BBaccelerates I/Oreadoperationsupto10times inaverage

• forlargerconfigurations(≥32node),theaveragetimeisnearlythesameaswhenrunningwith16nodesfortheNo-BB

fx.sgt fy.sgt

1 4 8 16 32 64 128 1 4 8 16 32 64 1280

1

2

3

# Nodes

Tim

e (s

econ

ds)

BB No−BB

fx.sgt fy.sgt

1 4 8 16 32 64 128 1 4 8 16 32 64 1280

500

1000

1500

2000

# Nodes

Tim

e (s

econ

ds)

BB No−BB

MPI-IOmoduledata:AveragetimeconsumedinI/Oreadoperationsperprocessforthedirect_synth job

14

Ratiobetweenthecumulativetimespentintheuser(utime)andkernel(stime)spacesforthedirect_synth job,fordifferent

numbersofnodes

• Averagedvalues(foruptothousandsofcores)maymaskslowerprocesses

Insomecases,e.g.64nodes,slowesttimeconsumedinI/Oreadoperationscanslowdowntheapplicationupto12times theaveragedvalue

• Ratiobetweenthetimespentintheuser(utime)andkernel(stime)spaces– handlingI/O-relatedinterruptions,etc.

• Performanceat64nodesis similarto32nodes,suggestinggainsin application parallelefficiencywouldoutweighaslightI/Operformancehit at64nodesandleadtodecreasedoverallruntime

Experimental Results: Cumulative CPU Time

BB No−BB

1 4 8 16 32 64 128 1 4 8 16 32 64 1280

25

50

75

100

# Nodes

Cum

ulat

ive C

PU ti

me

usag

e (%

)

stime utime

15

Ratiobetweenthecumulativetimespentintheuser(utime)andkernel(stime)spacesforthedirect_synth jobfordifferent

numbersofrupturefiles(workflowrunswith64nodes)

• AtypicalexecutionoftheCyberShakeworkflowforaselectedsiteinourexperimentprocessesabout5,700rupturefiles

• TheprocessingofrupturefilesdrivemostoftheCPU(userspace)activitiesforthedirect_synth job

• TheuseofaBBattenuates (about15%)theI/Oprocessingtimeoftheworkflowjobs,forbothreadandwriteoperations

Experimental Results: Rupture Files

BB No−BB

1 10 100 1000 2500 5700 1 10 100 1000 2500 57000

25

50

75

100

# Rupture Files

Cum

ulat

ive C

PU ti

me

usag

e (%

)

stime utime

16

MajorFindings

• I/Owrite performancewasimproved byafactorof9,andI/Oread performancebyafactorof15

• Performancedecreasedslightly atnodecountsabove64 (potentialI/Oceiling)

• I/Operformancemustbebalanced withparallelefficiencywhenusingburstbufferswithhighlyparallelapplications

• I/Ocontentionmaylimit thebroadapplicabilityofburstbuffersforallworkflowapplications(e.g.,insituprocessing)

Conclusion and Future Work

What’sNext?

• SolutionssuchasI/O-awareschedulingorinsituprocessingmayalsonotfulfillallapplicationrequirements

Weintendtoinvestigatetheuseofcombinedinsituandintransitanalysis

• DevelopmentofaproductionsolutionforthePegasusworkflowmanagementsystem

ON THE USE OF BURST BUFFERS FOR ACCELERATING DATA-INTENSIVE SCIENTIFIC WORKFLOWS

Rafael Ferreira da Silva, Ph.D.Research Assistant Professor, Computer Science DepartmentComputer Scientist, USC Information Sciences Institute

[email protected] – http://rafaelsilva.com

Thank You

Questions?

Funded by the US Department of Energy under

Grant #DE-SC0012636

on the use of burst buffers for accelerating data-intensive scientific workflows

Technology