real-time storm surge modeling in a grid environment howard m. lander [email protected]

Real-time Storm Surge Modeling in a Grid Environment

Howard M. Lander

[email protected]

Funded by NOAA & ONRFunded by NOAA & ONR

Bedford Institute of Oceanography

Virginia Institute of Marine Science

University of Alabama, Huntsville

Texas A&M Research Foundation

Renaissance Computing Institute

2005/2006 SCOOP Implementation Team

University of North Carolina

University of Florida

Louisiana State University

Gulf of Maine Ocean Observing System

MCNC

Southeastern Universities Research Association

External Resourcese.g. SURAgrid regional grid infrastructure, www.sura.org/suragrid

SCOOP: A Distributed Laboratory

Credits: SCOOP Team

Acknowledgements• Funding

– “SURA Coastal Ocean Observing and Prediction (SCOOP) Program”, Office of Naval Research, Award N00014-04-1-0721, National Oceanic and Atmospheric Administration’s NOAA Ocean Service, Award NA04NOS4730254.

• SCOOP Partners– Philip Bogden (SURA and GoMOOS); Will Perrie, Bash Toulany (BIO); Charlton

Purvis, Eric Bridger (GoMOOS); Greg Stone, Gabrielle Allen, Jon MacLaren, Bret Estrada, Chirag Dekate (LSU, Center for Computation and Technology); Gerald Creager, Larry Flournoy, Wei Zhao, Donna Cote and Matt Howard (TAMU); Sara Graves, Helen Conover, Ken Keiser, Matt Smith, and Marilyn Drewry (UAH); Peter Sheng, Justin Davis, Renato Figueiredo, and Vladimir Paramygin (UFL); Harry Wang, Jian Shen and David Forrest (VIMS); Hans Graber, Neil Williams and Geoff Samuels (UMiami); and Mary Fran Yafchak, Kate Barzee, Don Riley, Don Wright and Joanne Bintz (SURA), Rick Luettich (UNC-CH), Brian Blanton(SAIC), Dan Reed, Alan Blatecky, Lavanya Ramakrishnan, Gopi Kandaswamy, Ken Galluppi (RENCI), Steve Thorpe (MCNC)

• SCOOP and SURAGrid resource partners and system administrators– Steven Johnson (TAMU), Renato J. Figueiredo (UFL), Michael McEniry (UAH), Ian

Chang-Yen (ULL), and Brad Viviano (RENCI), for providing valuable system administrator support

Outline

• Motivation• Demo Scenario• Grid Technologies

– Grid Architecture– Resource Selection

• Portlet Tour

Motivation: Disaster Response• An example close to

home: North Carolina– most disasters are

weather driven • floods, winds, and ice

• Inadequate information – based on national and

regional information

• High resolution model forecasting for local events – improves planning and

preparation– shortens response and

recovery time

Credits: Ken Galluppi

Integrated Response System

• Hurricane Season 2005– 26 named storms, 14 hurricanes, 3 with

major impact– billions of dollars economic losses

• SURA Coastal Observing and Prediction (SCOOP) Program– provide early and accurate forecasts,

dissemination of information– to be able to interact in real-time i.e.

evaluate and adapt– provide infrastructure to solve inter-

disciplinary problems

Today …

ADCIRC: Storm Surge Modeling

• Advanced Circulation Model (ADCIRC)– Finite Element Hydrodynamic Model for Coastal

Oceans, Inlets, Rivers and Floodplains

• Scenarios– Daily operational 24/7/365 forecasts– Real-time ensemble model prediction– Retrospective analysis

• Assembling meteorological and other data sets for input– Multiple sources: U. Florida, NCEP, TAMU

• Hot-starting the model– NCEP 6 hour operational cycle– previous data is used to jumpstart the model run

Demo Scenario

• Multiple model runs- An ensemble of 11 input files for a single time period.- Plan is to go to 46 members for this year: We need help!- Each member of the ensemble represents a distinct

forecast track for the storm.

- Multiple model runs for each ensemble member.

• Data from Hurricane Katrina August 2005- Generated on demand at the University of Florida for

demo.- Ordinarily generated in response to storm activity in the

Atlantic basin.

• Portal tracks activity and status in the demo– Status of compute resources– Status of input and output data.– Status of model runs.

[Resource Status]

Site A

LDM

NAM

UF-WANA

NAH ResourceSelection

ApplicationCoordinator

PackagePreparation

Portal …

1.F

5

3

4

8.F

9

.

.

.

WS-Messenger Broker

Site C

ResourceMonitoring

1.H

Site B[Wind data arrives -forecast]

7

6

10

[Output files are pushed out]

[What is the best resource?]

[Query site status]

[Prepare the package for the

resource]

2[Get initialization files from

archive or run model to generate hotstart file]

[User initiates a

model run]

[Move the package,

initiate the run]

[Job finished, Move output files back]

[Output files] 8.H

[Model Status]

Grid Architecture

Visualization Wall

MySQL11

Technology Exposition• Grid technologies (Globus)

– standard job submission: Gatekeeper: used to dispatch and monitor jobs.

– file transfer: GridFTP: used to move prepared package to resource and to retrieve results from resource.

– queue status: Information Services/MDS: used as an input to the resource selection algorithm and displayed in a portlet.

– credential repository: MyProxy: required for job submission.

• Domain products– Local Data Manager (ldm):event driven data transport

system: used to receive input files and trigger model runs as well as to insert results.

– OpenDAP: format independent network data access protocol.

Technology Exposition(2)

• Portal Technologies– NSF NMI Open Grid Computing Environment (OGCE): used to

host the portlets.• Eventing

– LEAD WS-Messenger: enables data communication among pieces of the system. Example: the application coordinator sends status information through WS-Messenger.

• Web Services– Used to send job and resource status information from a

MySQL database to the portlets. Also used to track flows of data files in the system.

• MySQL– Open source relation database used to store job and resource

status information for display and analyses.

Application Coordinator

• Data Management– real-time data movement: LDM, GridFTP– previously generated files: SCOOP

Catalog [UAH] and archive [TAMU, LSU]

• Application Preparation– conversion of data formats– self extracting archive containing binary– identify and retrieve or generate

appropriate hotstart files

• Extensible– model parameters, template scripts and

environment

ResourceSelection

ApplicationCoordinator

Globus Gatekeeper

Globus GridFTP

Globus MDS

Globus GatekeeperGlobus GridFTP

…

Site A

Site Z

Network Weather Service

a) Query queue status (free CPUs, length of queue)b) Query bandwidthc) Query current jobs

Submit Job

Move self extracting file

Job status

Move output files

Globus MDS

Network Weather ServiceMyProxy

Obtain credential

Resource Selection

MySQL

Fault Tolerance and Recovery

• Verify correct operation of basic Grid services

• Implemented two phase fault recovery – Retry the failed step– Move back one step (e.g. may need to run

on different resource)

• Proactive Monitoring and notification– Using WS-Messenger and Broker

Experiences from 2005 & 2006

• Murphy’s Law– "If anything can go

wrong, it will" – debugging is hard

• Resource selection– bandwidth, resource– performance, reliability – fault tolerance– failure recovery

• Model specifics– verification of model

results

Left: ADCIRC max water level for 72 hr forecast starting 29 Aug 2005,driven by the "usual, always-available” ETA winds.

Right: ADCIRC max water level over ALL of UFL ensemble wind fields for 72 hr forecast starting 29 Aug 2005, driven by “UFL always-available” ETA winds.

Images credit: Brian O. Blanton, SAIC

Conclusions and Future Work

• Foundation for a highly reliable distributed Grid environment for critical applications

• Upgrade path to OGCE2 and Globus 4.0– Early work has been done to port to OGCE2– Use Globus 4.0 MDS triggering

• Application to other environments– North Carolina Forecasting System– Package standard web services for resource

selection and fault tolerant application co-ordination

• More sophisticated resource selection – Use historical and data from concurrent runs to

make selections.

Portal Tourhttps://portal.scoop.sura.org/gridsphere

End of talk!

https://portal.scoop.sura.org/gridsphere

More Information

• SCOOP– http://scoop.sura.org

• RENCI Projects– NCFS http://www.renci.org/projects/indexdr.php– SCOOP

http://www.renci.org/projects/scoop.phphttp://www.scoop.unc.edu

• SURAGrid– https://gridportal.sura.org/

http://scoop.sura.org/

http://www.renci.org/projects/indexdr.php

http://www.renci.org/projects/scoop.php

http://www.scoop.unc.edu/

https://gridportal.sura.org/

Design Principles• Scalable real-time system

– multiple large scale simulations in parallel– based on Grid technologies and standards

• Modular, Extensible– apply in context of other domains

• Adaptable– criticality of the application – variability in grid environments

• Framework– real-time discovery of available resources – managing the model run on an ad-hoc set of

resources– continuous monitoring and adaptation

• active monitoring, fault tolerance, failure recovery

Portal: Monitoring

Resource Pool Management

• Resources– Local: RENCI, MCNC– SURAGrid: TAMU, ULL, etc – SCOOP Partners: UAH, UFL

• Software– Globus Services – GridFTP, GRAM, MDS– NWS

• Configuration – Resources Expansion using property files – Automated test suite to check

periodically

Portal: Hindcast Mode

Select Run DatesAnd Model Details

real-time storm surge modeling in a grid environment howard m. lander [email protected]

Documents

coastal oceans

renato figueiredo

figueiredo ufl

ken galluppi renci

ken keiser

realtime storm surge

joanne bintz sura

office of naval research