automating real-time seismic analysis through streaming and

34
Automating Real-time Seismic Analysis Through Streaming and High Throughput Workflows Rafael Ferreira da Silva, Ph.D. http://pegasus.isi.edu

Upload: ngocong

Post on 13-Feb-2017

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Automating Real-time Seismic Analysis Through Streaming and

Automating Real-time Seismic AnalysisThroughStreamingandHighThroughputWorkflows

RafaelFerreiradaSilva,Ph.D.

http://pegasus.isi.edu

Page 2: Automating Real-time Seismic Analysis Through Streaming and

Pegasus http://pegasus.isi.edu 2

Do we need seismic analysis?

Page 3: Automating Real-time Seismic Analysis Through Streaming and

Pegasus http://pegasus.isi.edu 3

USArrayA continental-scale Seismic Observatory

US Array TA (IRIS Service)836 stations

394 stations have online data available

http://ds.iris.edu/ds/nodes/dmc/earthscope/usarray/_US-TA-operational/

Page 4: Automating Real-time Seismic Analysis Through Streaming and

Pegasus http://pegasus.isi.edu 4

The development of reliable risk assessment methods for thesehazards requires real-time analysis of seismic data

Page 5: Automating Real-time Seismic Analysis Through Streaming and

Pegasus http://pegasus.isi.edu 5

So, how to efficiently process these data?

Page 6: Automating Real-time Seismic Analysis Through Streaming and

Pegasus http://pegasus.isi.edu 6

ScientificProblem

AnalyticalSolution

Computational Scripts

Automation

ExperimentTimeline

MonitoringandDebug

ScientificResultModels,QualityControl,

ImageAnalysis,etc.

Fault-tolerance,Provenance,etc.

Shellscripts,Python,Matlab,etc.

EarthScience, Astronomy,Neuroinformatics,Bioinformatics,etc.

DistributedComputing

Workflows,MapReduce,etc.

Clusters,HPC,Cloud,Grid,etc.

Page 7: Automating Real-time Seismic Analysis Through Streaming and

Pegasus http://pegasus.isi.edu 7

What is involved in an experiment execution?

Page 8: Automating Real-time Seismic Analysis Through Streaming and

Pegasus http://pegasus.isi.edu 8

Automate

Recover

Debug

Why Scientific Workflows?

Automates complex, multi-stage processing pipelines

Enables parallel, distributed computations

Automatically executes data transfers

Reusable, aids reproducibility

Records how data was produced (provenance)

Handles failures with to provide reliability

Keeps track of data and files

Page 9: Automating Real-time Seismic Analysis Through Streaming and

Pegasus http://pegasus.isi.edu 9

Taking a closer look into a workflow…

job

dependencyUsuallydatadependencies

split

merge

pipeline

Command-lineprograms

DAGdirected-acyclic graphs

abstract workflow

executable workflow

storage constraints

optimizations

Page 10: Automating Real-time Seismic Analysis Through Streaming and

Pegasus http://pegasus.isi.edu 10

From the abstraction to execution!

stage-in job

stage-out job

registration job

Transferstheworkflowinputdata

Transferstheworkflowoutputdata

Registerstheworkflowoutputdata

abstract workflow

executable workflow

storage constraints

optimizations

Page 11: Automating Real-time Seismic Analysis Through Streaming and

Pegasus http://pegasus.isi.edu 11

Optimizing storage usage…

cleanup jobRemovesunuseddata

abstract workflow

executable workflow

storage constraints

optimizations

Page 12: Automating Real-time Seismic Analysis Through Streaming and

Pegasus http://pegasus.isi.edu 12

Workflow systems provide tools togenerate the abstract workflow

dax = ADAG("test_dax")firstJob = Job(name="first_job")firstInputFile = File("input.txt")firstOutputFile = File("tmp.txt")firstJob.addArgument("input=input.txt", "output=tmp.txt")firstJob.uses(firstInputFile, link=Link.INPUT)firstJob.uses(firstOutputFile, link=Link.OUTPUT)dax.addJob(firstJob)for i in range(0, 5):

simulJob = Job(id="%s" % (i+1), name="simul_job”)simulInputFile = File("tmp.txt”)simulOutputFile = File("output.%d.dat" % i)simulJob.addArgument("parameter=%d" % i, "input=tmp.txt”,

output=%s" % simulOutputFile.getName())simulJob.uses(simulInputFile, link=Link.INPUT)simulJob.uses(simulOutputFile, line=Link.OUTPUT)

dax.addJob(simulJob)dax.depends(parent=firstJob, child=simulJob)fp = open("test.dax", "w”)dax.writeXML(fp)fp.close()

abstract workflow

Page 13: Automating Real-time Seismic Analysis Through Streaming and

W h i c h W o r k f l o wManagement System?

Pegasus http://pegasus.isi.edu 13

PegasusNimrod

Page 14: Automating Real-time Seismic Analysis Through Streaming and

Pegasus http://pegasus.isi.edu 14

…and which model to use?

task-oriented stream-based

files

parallelismnoconcurrency

datatransfersviafiles

heterogeneous executiontaskscanruninheterogeneousresources

streams

concurrencytasksrunconcurrently

datatransfersviamemoryormessage

homogeneous executiontasksshouldruninhomogeneousresources

Page 15: Automating Real-time Seismic Analysis Through Streaming and

Pegasus http://pegasus.isi.edu 15

What does Pegasus provide?

AutomationAutomatespipelineexecutions

Parallel,distributedcomputationsAutomaticallyexecutesdatatransfers

Heterogeneous resourcesTask-orientedmodel

Applicationisseenasablackbox

DebugWorkflowexecutionandjobperformancemetricsSetofdebuggingtoolstounveilissuesReal-timemonitoring,graphs,provenance

OptimizationJobclusteringDatacleanup

RecoveryJobfailuredetection

CheckpointFilesJobRetry

RescueDAGs

Page 16: Automating Real-time Seismic Analysis Through Streaming and

Pegasus http://pegasus.isi.edu 16

…and dispel4py?

AutomationAutomatespipelineexecutions

Concurrent,distributedcomputationsStream-basedmodel

WorkflowCompositionPythonLibraryGrouping(all-to-all,all-to-one,one-to-all)

OptimizationMultiple streams(in/out)AvoidsI/O(sharedmemoryormessagepassing)

MappingSequential

Multiprocessing (sharedmemory)Distributedmemory,messagepassing(MPI)

DistributedReal-time (ApacheStorm)ApacheSpark(Prototype)

Page 17: Automating Real-time Seismic Analysis Through Streaming and

Pegasus http://pegasus.isi.edu 17

Asterism greatlysimplifiestheeffortrequiredtodevelop

data-intensive applicationsthatrunacrossmultiple

heterogeneous resourcesdistributedinthewidearea

Pegasus

sub-workflowdispel4pyworkflow

ASTERISM

Page 18: Automating Real-time Seismic Analysis Through Streaming and

Pegasus http://pegasus.isi.edu 18

Where to run scientific workflows?

Page 19: Automating Real-time Seismic Analysis Through Streaming and

High Performance Computing

http://pegasus.isi.edu 19Pegasus

sharedfilesystem

submit host(e.g.,user’slaptop)

There are several possible configurations…

typically most HPC sites

WorkflowEngine

Page 20: Automating Real-time Seismic Analysis Through Streaming and

Cloud Computing

http://pegasus.isi.edu 20Pegasus

objectstorage

submit host(e.g.,user’slaptop)

High-scalable object storages

Typical cloud computing deployment (Amazon S3,

Google Storage)

WorkflowEngine

Page 21: Automating Real-time Seismic Analysis Through Streaming and

http://pegasus.isi.edu 21Pegasus

submit host(e.g.,user’slaptop)

local data management

Typical OSG sitesOpenScience Grid

WorkflowEngine

Grid Computing

Page 22: Automating Real-time Seismic Analysis Through Streaming and

http://pegasus.isi.edu 22Pegasus

sharedfilesystem

submit host(e.g.,user’slaptop)

And yes… you can mix everything!

ComputesiteB

ComputesiteA

object storage

WorkflowEngine

Page 23: Automating Real-time Seismic Analysis Through Streaming and

Pegasus http://pegasus.isi.edu 23

How do we use Asterism to automate seismic analysis?

Page 24: Automating Real-time Seismic Analysis Through Streaming and

WORKFLOW

Pegasus http://pegasus.isi.edu 24

Phase1(pre-process)

Phase2(cross-correlation)

Seismic Ambient Noise Cross-Correlation

Preprocesses and cross-correlates traces (sequences of measurements of acceleration in three dimensions) from multiple seismic stations (IRIS database)

Phase 1: data preparation using statistics for extracting information from the noise

Phase 2: compute correlation, identifying the timefor signals to travel between stations. Infers properties of the intervening rock

Page 25: Automating Real-time Seismic Analysis Through Streaming and

Seismic Ambient Noise Cross-Correlation

Pegasus http://pegasus.isi.edu 25

Distributed computation framework for event stream processing

Designed for massive scalability, supports fault-tolerance with a “fail fast, auto restart” approach to processes

Rich array of available spouts specialized for receiving data from all types of sources

Hadoop of real-time processing, very scalablespout

bolt

Page 26: Automating Real-time Seismic Analysis Through Streaming and

http://pegasus.isi.edu 26

Seismic workflow execution

Pegasus

IRISdatabase(stations)

datatransfersbetweensitesperformedbyPegasus

ComputesiteB(ApacheStorm)

ComputesiteA(MPI-based)

input data (~150MB)

submit host(e.g.,user’slaptop)

output data (~40GB)

Phase1

Phase2

Page 27: Automating Real-time Seismic Analysis Through Streaming and

WORKFLOW

Pegasus http://pegasus.isi.edu 27

Southern California Earthquake Center’s CyberShake

Builders ask seismologists: What will the peak ground motion be at my new building in the next 50 years?

Seismologists answer this question using Probabilistic Seismic Hazard Analysis (PSHA)

286 sites, 4 modelseach workflow has 420,000 tasks

Page 28: Automating Real-time Seismic Analysis Through Streaming and

Pegasus http://pegasus.isi.edu 28

A few more workflow features…

Page 29: Automating Real-time Seismic Analysis Through Streaming and

Pegasus http://pegasus.isi.edu 29

Performance, why not improve it?

clustered jobGroupssmall jobstogethertoimproveperformance

tasksmallgranularity

workflow restructuring

workflow reduction

pegasus-mpi-cluster

hierarchical workflows

Page 30: Automating Real-time Seismic Analysis Through Streaming and

Pegasus http://pegasus.isi.edu 30

What about data reuse?

data alreadyavailable

JobswhichoutputdataisalreadyavailableareprunedfromtheDAG

data reuse

workflow restructuring

workflow reduction

pegasus-mpi-cluster

hierarchical workflows

workflowreduction

data alsoavailable

data reuse

Page 31: Automating Real-time Seismic Analysis Through Streaming and

http://pegasus.isi.edu 31

Handling large-scaleworkflows

pegasus-mpi-cluster

recursion endswhen workflow withonly compute jobsis encountered

sub-workflow

sub-workflow

workflow restructuring

workflow reduction

hierarchical workflows

Page 32: Automating Real-time Seismic Analysis Through Streaming and

Pegasus http://pegasus.isi.edu 32

Running fine-grainedworkflows on HPC systems…

pegasus-mpi-cluster

HPCSystemsubmit host(e.g.,user’slaptop)

workflow wrapped as an MPI jobAllowssub-graphs ofaPegasusworkflowtobe

submittedasmonolithic jobstoremoteresources

workflow restructuring

workflow reduction

hierarchical workflows

Page 33: Automating Real-time Seismic Analysis Through Streaming and

PegasusAutomate, recover,anddebug scientificcomputations.

Get Started

PegasusWebsitehttp://pegasus.isi.edu

[email protected]

[email protected]

HipChat

Page 34: Automating Real-time Seismic Analysis Through Streaming and

Thank You

Questions?

RafaelFerreiradaSilva,[email protected] KaranVahi

RafaelFerreiradaSilva

RajivMayani

MatsRynge

Ewa Deelman

Automating Real-time Seismic AnalysisThroughStreamingandHighThroughputWorkflows