cybershake study 14.2 technical readiness review

25
CyberShake Study 14.2 Technical Readiness Review

Upload: jui

Post on 23-Feb-2016

40 views

Category:

Documents


0 download

DESCRIPTION

CyberShake Study 14.2 Technical Readiness Review. Study 14.2 Scientific Goals. Compare impact of velocity models on Los Angeles-area hazard maps with various velocity models CVM-S4.26, BBP 1D, CVM-H 11.9, no GTL Compare to CVM-S, CVM-H 11.9 with GTL Investigate impact of GTL - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CyberShake Study  14.2 Technical Readiness  Review

CyberShake Study 14.2Technical Readiness Review

Page 2: CyberShake Study  14.2 Technical Readiness  Review

Study 14.2 Scientific Goals

• Compare impact of velocity models on Los Angeles-area hazard maps with various velocity models• CVM-S4.26, BBP 1D, CVM-H 11.9, no GTL• Compare to CVM-S, CVM-H 11.9 with GTL

• Investigate impact of GTL• Compare 1D reference model• Compare tomographic inversion results

• 286 sites (10 km mesh + points of interest)

Page 3: CyberShake Study  14.2 Technical Readiness  Review

Study 14.2 Technical Goals

• Run both SGT and post-processing workflows on Blue Waters

• Plan to measure CyberShake application makespan• Equivalent to the makespan of all of the workflows

• (All jobs complete) – (first workflow submitted)• Includes hazard curve calculation time• Includes system downtime, workflow stoppages

• Will estimate time-to-solution by adding estimate of setup-time and analysis time.

• Compare performance, queue times, results of GPU and CPU AWP-ODC-SGT

Page 4: CyberShake Study  14.2 Technical Readiness  Review

Performance Enhancements

• New version of seismogram synthesis code to reduce read I/O• Reads in set of extracted SGTs• Synthesizes multiple RVs (using 5 in production)

• Reduce number of subworkflows to 6 (from 8)• Fewer jobs, less queuing time

• For CPU SGTs, increase core count• Each processor has ~64x50x50 chunk of grid points

• For GPU SGTs, decrease processor count• Volume must be multiple of 20 in X and Y• 10 x 10 x 1 GPUs, regardless of volume

Page 5: CyberShake Study  14.2 Technical Readiness  Review

Proposed Study sites (286)

Page 6: CyberShake Study  14.2 Technical Readiness  Review

Study 14.2 Data Products

• 2 CVM-S4.26 Los Angeles-area hazard models• 1 BBP 1D Los Angeles-area hazard model• 1 CVM-H 11.9, no GTL Los Angeles-area

hazard model• Hazard curves for 286 sites x 4 conditions, at

3s, 5s, 10s• 1144 sets of 2-component SGTs• Seismograms for all ruptures (~470M)• Peak amplitudes in DB for 3s, 5s, 10s

Page 7: CyberShake Study  14.2 Technical Readiness  Review

Study 14.2 Notables

• First CVM-S4.26 hazard models• First CVM-H, no GTL hazard model• First 1D hazard model• First study using AWP-SGT-GPU• First CyberShake Study using a single workflow

on one system (Blue Waters)

Page 8: CyberShake Study  14.2 Technical Readiness  Review

Study 14.2 Parameters

• 0.5 Hz, deterministic• 200 m spacing

• CVMs• Vs min = 500 m/s

• UCERF 2• Graves & Pitarka (2010) rupture variations

Page 9: CyberShake Study  14.2 Technical Readiness  Review

Verification

• 4 sites (USC, PAS, WNGC, SBSM)• AWP-SGT-CPU, CVM-S4.26• AWP-SGT-GPU, CVM-S4.26• AWP-SGT-CPU, BBP 1D• AWP-SGT-GPU, CVM-H 11.9, no GTL

• Plotted with previously calculated curves

Page 10: CyberShake Study  14.2 Technical Readiness  Review

CVM-S4.26 (CPU)

Page 11: CyberShake Study  14.2 Technical Readiness  Review

CVM-H, no GTL (CPU)

Page 12: CyberShake Study  14.2 Technical Readiness  Review

Changes to SGT Software Stack

• Velocity Mesh generation• Switched from 2 jobs (create, then merge) to 1 job

• SGTs• AWP-ODC-SGT CPU v14.2

• Has wrapper because of issue with getting exit code back• AWP-ODC-SGT GPU v14.2

• Has wrapper to read in parameter file and construct command-line arguments

• Nan Check• Always had NaN check for RWG SGTs, now for

AWP SGTs also

Page 13: CyberShake Study  14.2 Technical Readiness  Review

Changes to PP Software Stack

• Seismogram Synthesis / PSA Calculation• Modified to synthesize multiple seismograms per

invocation• Will use 5 rupture variations per invocation• Reduces read I/O by factor of 5• Needed to avoid congestion protection events

• All codes tagged in SVN before study begins

Page 14: CyberShake Study  14.2 Technical Readiness  Review

Changes to Workflows

• Changed workflow hierarchy• 1 integrated workflow per site, per • Added ability to select SGT core count dynamically• Put volume creation job into top-level workflow to

reduce hierarchy to 2 levels• Reduced number of post-processing sub-

workflows to 6• Fewer jobs in queue

• Will not keep job output if job succeeds• Reduce size of workflow logs

Page 15: CyberShake Study  14.2 Technical Readiness  Review

Workflow HierarchyIntegrated Workflow(1 per model per site)

PreCVM(creates volume)

Generate SGT Workflow

SGT Workflow

PP Pre Workflow

PP subwf 0 PP subwf 1 PP subwf 5…

DB workflow

More details on next slide

Page 16: CyberShake Study  14.2 Technical Readiness  Review

6

68000 68000

Page 17: CyberShake Study  14.2 Technical Readiness  Review

Distributed Processing

• Cron job on shock.usc.edu creates/plans/runs full workflows• Pegasus 4.4, from Git repository• Condor 8.0.3• Globus 5.0.4

• Jobs submitted to Blue Waters via GRAM• Results staged back to shock, DB populated,

curves generated• Alternate CPU and GPU workflows for best

queue performance

Page 18: CyberShake Study  14.2 Technical Readiness  Review

Computational Requirements

• Computational time: 275K node-hrs• SGT Computational time: 180K node-hrs

• CPU: 150 node-hrs/site x 286 sites x 2 models = 86K node-hrs (XE, 32 cores/node)

• GPU: 90 node hrs/site x 286 sites x 2 models = 52K node-hrs (XK)

• Study 13.4 had 29% overrun on SGTs• PP Computational time: 95K node-hrs

• 60 node-hrs/site x 286 sites x 4 models= 70K node-hrs (XE, 32 cores/node)

• Study 13.4 had 35% overrun on PP

• Current allocation has 3.0M node-hrs remaining

Page 19: CyberShake Study  14.2 Technical Readiness  Review

Blue Waters Storage Requirements

• Planned unpurged disk usage: 45 TB• SGTs: 40 GB/site x 286 sites x 4 models

= 45 TB, archived on Blue Waters• Planned purged disk usage: 783 TB

• Seismograms: 11 GB/site x 286 sites x 4 models= 12.3 TB, staged back to SCEC

• PSA files: 0.2 GB/site x 286 sites x 4 models= 0.2 TB, staged back to SCEC

• Temporary: 690 GB/site x 286 sites x 4 models= 771 TB

Page 20: CyberShake Study  14.2 Technical Readiness  Review

SCEC Storage Requirements

• Planned archival disk usage: 12.5 TB• Seismograms: 12.3 TB (scec-04 has 19 TB)• PSA files: 0.2 TB (scec-04)• Curves, disagg, reports: 93 GB (99% reports)

• Planned database usage: 210 GB• 3 rows/rupture variation x 410K rupture

variations/site x 286 sites x 4 models = 1.4B rows• 1.4B rows x 151 bytes/row = 210 GB (880 GB free)

• Planned temporary disk usage: 5.5 TB• Workflow logs: 5.5 TB – possibly smaller, not saving

all output anymore (scec-02 has 12 TB free)

Page 21: CyberShake Study  14.2 Technical Readiness  Review

Metrics Gathering

• Monitord for workflow metrics• Will run after workflows have completed

• Python scripts• Used to obtain some of the standard CyberShake

metrics for comparison• Cronjob on Blue Waters

• Core usage over time• Jobs running and idle counts

• Will use start and end of workflow logs to perform makespan measurement

Page 22: CyberShake Study  14.2 Technical Readiness  Review

Estimated Duration

• Limiting factors:• Queue time

• Especially for XK nodes, could be substantial percentage of run time

• Blue Waters -> SCEC transfer• If Blue Waters throughput is very high, transfer could be

bottleneck

• With queues, estimated completion is 4 weeks• 1 hazard map/week• Requires average of 410 nodes• 603 nodes averaged during Study 13.4

• With a reservation, completion depends on the reservation size

Page 23: CyberShake Study  14.2 Technical Readiness  Review

Personnel Support• Scientists

• Tom Jordan, Kim Olsen, Rob Graves• Technical Lead

• Scott Callaghan

• SGT code support• Efecan Poyraz, Yifeng Cui

• Job Submission / Run Monitoring• Scott Callaghan, David Gill, Heming Xu, Phil Maechling

• NCSA Support• Omar Padron, Tim Bouvet

• Workflow Support• Karan Vahi, Gideon Juve

Page 24: CyberShake Study  14.2 Technical Readiness  Review

Risks

• Queue times on Blue Waters• In tests, at times GPU queue times have been > 1

day• Congestion protection events

• If triggered consistently, will either need to throttle post-processing or suspend run until improvements are developed

Page 25: CyberShake Study  14.2 Technical Readiness  Review

Thanks for your time!