real-time storm surge modeling in a grid environment howard m. lander [email protected]
TRANSCRIPT
Funded by NOAA & ONRFunded by NOAA & ONR
Bedford Institute of Oceanography
Virginia Institute of Marine Science
University of Alabama, Huntsville
Texas A&M Research Foundation
Renaissance Computing Institute
2005/2006 SCOOP Implementation Team
University of North Carolina
University of Florida
Louisiana State University
Gulf of Maine Ocean Observing System
MCNC
Southeastern Universities Research Association
External Resourcese.g. SURAgrid regional grid infrastructure, www.sura.org/suragrid
SCOOP: A Distributed Laboratory
Credits: SCOOP Team
Acknowledgements• Funding
– “SURA Coastal Ocean Observing and Prediction (SCOOP) Program”, Office of Naval Research, Award N00014-04-1-0721, National Oceanic and Atmospheric Administration’s NOAA Ocean Service, Award NA04NOS4730254.
• SCOOP Partners– Philip Bogden (SURA and GoMOOS); Will Perrie, Bash Toulany (BIO); Charlton
Purvis, Eric Bridger (GoMOOS); Greg Stone, Gabrielle Allen, Jon MacLaren, Bret Estrada, Chirag Dekate (LSU, Center for Computation and Technology); Gerald Creager, Larry Flournoy, Wei Zhao, Donna Cote and Matt Howard (TAMU); Sara Graves, Helen Conover, Ken Keiser, Matt Smith, and Marilyn Drewry (UAH); Peter Sheng, Justin Davis, Renato Figueiredo, and Vladimir Paramygin (UFL); Harry Wang, Jian Shen and David Forrest (VIMS); Hans Graber, Neil Williams and Geoff Samuels (UMiami); and Mary Fran Yafchak, Kate Barzee, Don Riley, Don Wright and Joanne Bintz (SURA), Rick Luettich (UNC-CH), Brian Blanton(SAIC), Dan Reed, Alan Blatecky, Lavanya Ramakrishnan, Gopi Kandaswamy, Ken Galluppi (RENCI), Steve Thorpe (MCNC)
• SCOOP and SURAGrid resource partners and system administrators– Steven Johnson (TAMU), Renato J. Figueiredo (UFL), Michael McEniry (UAH), Ian
Chang-Yen (ULL), and Brad Viviano (RENCI), for providing valuable system administrator support
Outline
• Motivation• Demo Scenario• Grid Technologies
– Grid Architecture– Resource Selection
• Portlet Tour
Motivation: Disaster Response• An example close to
home: North Carolina– most disasters are
weather driven • floods, winds, and ice
• Inadequate information – based on national and
regional information
• High resolution model forecasting for local events – improves planning and
preparation– shortens response and
recovery time
Credits: Ken Galluppi
Integrated Response System
• Hurricane Season 2005– 26 named storms, 14 hurricanes, 3 with
major impact– billions of dollars economic losses
• SURA Coastal Observing and Prediction (SCOOP) Program– provide early and accurate forecasts,
dissemination of information– to be able to interact in real-time i.e.
evaluate and adapt– provide infrastructure to solve inter-
disciplinary problems
Today …
ADCIRC: Storm Surge Modeling
• Advanced Circulation Model (ADCIRC)– Finite Element Hydrodynamic Model for Coastal
Oceans, Inlets, Rivers and Floodplains
• Scenarios– Daily operational 24/7/365 forecasts– Real-time ensemble model prediction– Retrospective analysis
• Assembling meteorological and other data sets for input– Multiple sources: U. Florida, NCEP, TAMU
• Hot-starting the model– NCEP 6 hour operational cycle– previous data is used to jumpstart the model run
Demo Scenario
• Multiple model runs- An ensemble of 11 input files for a single time period.- Plan is to go to 46 members for this year: We need help!- Each member of the ensemble represents a distinct
forecast track for the storm.
- Multiple model runs for each ensemble member.
• Data from Hurricane Katrina August 2005- Generated on demand at the University of Florida for
demo.- Ordinarily generated in response to storm activity in the
Atlantic basin.
• Portal tracks activity and status in the demo– Status of compute resources– Status of input and output data.– Status of model runs.
[Resource Status]
Site A
LDM
NAM
UF-WANA
NAH ResourceSelection
ApplicationCoordinator
PackagePreparation
Portal …
1.F
5
3
4
8.F
9
.
.
.
WS-Messenger Broker
Site C
ResourceMonitoring
1.H
Site B[Wind data arrives -forecast]
7
6
10
[Output files are pushed out]
[What is the best resource?]
[Query site status]
[Prepare the package for the
resource]
2[Get initialization files from
archive or run model to generate hotstart file]
[User initiates a
model run]
[Move the package,
initiate the run]
[Job finished, Move output files back]
[Output files] 8.H
[Model Status]
Grid Architecture
Visualization Wall
MySQL11
Technology Exposition• Grid technologies (Globus)
– standard job submission: Gatekeeper: used to dispatch and monitor jobs.
– file transfer: GridFTP: used to move prepared package to resource and to retrieve results from resource.
– queue status: Information Services/MDS: used as an input to the resource selection algorithm and displayed in a portlet.
– credential repository: MyProxy: required for job submission.
• Domain products– Local Data Manager (ldm):event driven data transport
system: used to receive input files and trigger model runs as well as to insert results.
– OpenDAP: format independent network data access protocol.
Technology Exposition(2)
• Portal Technologies– NSF NMI Open Grid Computing Environment (OGCE): used to
host the portlets.• Eventing
– LEAD WS-Messenger: enables data communication among pieces of the system. Example: the application coordinator sends status information through WS-Messenger.
• Web Services– Used to send job and resource status information from a
MySQL database to the portlets. Also used to track flows of data files in the system.
• MySQL– Open source relation database used to store job and resource
status information for display and analyses.
Application Coordinator
• Data Management– real-time data movement: LDM, GridFTP– previously generated files: SCOOP
Catalog [UAH] and archive [TAMU, LSU]
• Application Preparation– conversion of data formats– self extracting archive containing binary– identify and retrieve or generate
appropriate hotstart files
• Extensible– model parameters, template scripts and
environment
ResourceSelection
ApplicationCoordinator
Globus Gatekeeper
Globus GridFTP
Globus MDS
Globus GatekeeperGlobus GridFTP
…
Site A
Site Z
Network Weather Service
a) Query queue status (free CPUs, length of queue)b) Query bandwidthc) Query current jobs
Submit Job
Move self extracting file
Job status
Move output files
Globus MDS
Network Weather ServiceMyProxy
Obtain credential
Resource Selection
MySQL
Fault Tolerance and Recovery
• Verify correct operation of basic Grid services
• Implemented two phase fault recovery – Retry the failed step– Move back one step (e.g. may need to run
on different resource)
• Proactive Monitoring and notification– Using WS-Messenger and Broker
Experiences from 2005 & 2006
• Murphy’s Law– "If anything can go
wrong, it will" – debugging is hard
• Resource selection– bandwidth, resource– performance, reliability – fault tolerance– failure recovery
• Model specifics– verification of model
results
Left: ADCIRC max water level for 72 hr forecast starting 29 Aug 2005,driven by the "usual, always-available” ETA winds.
Right: ADCIRC max water level over ALL of UFL ensemble wind fields for 72 hr forecast starting 29 Aug 2005, driven by “UFL always-available” ETA winds.
Images credit: Brian O. Blanton, SAIC
Conclusions and Future Work
• Foundation for a highly reliable distributed Grid environment for critical applications
• Upgrade path to OGCE2 and Globus 4.0– Early work has been done to port to OGCE2– Use Globus 4.0 MDS triggering
• Application to other environments– North Carolina Forecasting System– Package standard web services for resource
selection and fault tolerant application co-ordination
• More sophisticated resource selection – Use historical and data from concurrent runs to
make selections.
Portal Tourhttps://portal.scoop.sura.org/gridsphere
End of talk!
More Information
• SCOOP– http://scoop.sura.org
• RENCI Projects– NCFS http://www.renci.org/projects/indexdr.php– SCOOP
http://www.renci.org/projects/scoop.phphttp://www.scoop.unc.edu
• SURAGrid– https://gridportal.sura.org/
Design Principles• Scalable real-time system
– multiple large scale simulations in parallel– based on Grid technologies and standards
• Modular, Extensible– apply in context of other domains
• Adaptable– criticality of the application – variability in grid environments
• Framework– real-time discovery of available resources – managing the model run on an ad-hoc set of
resources– continuous monitoring and adaptation
• active monitoring, fault tolerance, failure recovery
Portal: Monitoring
Resource Pool Management
• Resources– Local: RENCI, MCNC– SURAGrid: TAMU, ULL, etc – SCOOP Partners: UAH, UFL
• Software– Globus Services – GridFTP, GRAM, MDS– NWS
• Configuration – Resources Expansion using property files – Automated test suite to check
periodically
Portal: Hindcast Mode
Select Run DatesAnd Model Details