ncar the earth system grid (esg) & the community data portal (cdp) (ncars data & grid...
TRANSCRIPT
NCAR
The Earth System Grid (ESG)The Earth System Grid (ESG)&&
The The CommunityCommunity Data Portal (CDP) Data Portal (CDP)(NCAR’s Data & GriD Efforts)(NCAR’s Data & GriD Efforts)
forfor
COMMISSION FOR BASIC SYSTEMSCOMMISSION FOR BASIC SYSTEMS
INFORMATION SYSTEMS and SERVICESINFORMATION SYSTEMS and SERVICES INTERPROGRAMME TASK TEAM ON THE INTERPROGRAMME TASK TEAM ON THE
FUTURE WMO INFORMATION SYSTEMFUTURE WMO INFORMATION SYSTEM
KUALA LUMPUR, 20 - 24 OCTOBER 2003KUALA LUMPUR, 20 - 24 OCTOBER 2003
Courtesy: Don MiddletonCourtesy: Don MiddletonNCAR Scientific Computing DivisionNCAR Scientific Computing Division
NCAR
““Atkins Report”Atkins Report” ““A new age has dawned…”A new age has dawned…”
“The Panel’s overarching recommendation is that the National Science Foundation should establish and lead a large-scale, interagency, and internationally coordinated Advanced Cyberinfrastructure Program (ACP) to create, deploy, and apply cyberinfrastructure in ways that radically empower all scientific and engineering research and allied education. We estimate that sustained new NSF funding of $1 billion per year is needed to achieve critical mass and to leverage the coordinated co-investment from other federal agencies, universities, industry, and international sources necessary to empower a revolution. The cost of not acting quickly or at a subcritical level could be high, both in opportunities lost and in increased fragmentation and balkanization of the research.”
Atkins Report, Executive Summary
NCAR
The Earth System GridThe Earth System Grid
U.S. DOE SciDAC funded R&D effort - a U.S. DOE SciDAC funded R&D effort - a ““Collaboratory Pilot Project”Collaboratory Pilot Project”
Build an “Earth System Grid” that enables Build an “Earth System Grid” that enables management, discovery, distributed access, management, discovery, distributed access, processing, & analysis of distributed terascale processing, & analysis of distributed terascale climate research dataclimate research data
Build upon Globus ToolkitBuild upon Globus Toolkit and DataGrid and DataGrid technologies and technologies and deploy (Rubber on the road)deploy (Rubber on the road)
Potential broad application to other areasPotential broad application to other areas
http://www.earthsystemgrid.org
NCAR
ESG TeamESG Team ANLANL
– Ian Foster (PI)Ian Foster (PI)– Veronika NefedovaVeronika Nefedova– (John Bresenhan)(John Bresenhan)– (Bill Allcock)(Bill Allcock)
LBNLLBNL– Arie ShoshaniArie Shoshani– Alex SimAlex Sim
ORNLORNL– David BernholdteDavid Bernholdte– Kasidit ChanchioKasidit Chanchio– Line PouchardLine Pouchard
LLNL/PCMDILLNL/PCMDI– Bob DrachBob Drach– Dean Williams (PI)Dean Williams (PI)
USC/ISIUSC/ISI– Anne ChervenakAnne Chervenak– Carl KesselmanCarl Kesselman– (Laura Perlman)(Laura Perlman)
NCARNCAR– David BrownDavid Brown– Luca CinquiniLuca Cinquini– Peter FoxPeter Fox– Jose GarciaJose Garcia– Don Middleton (PI)Don Middleton (PI)– Gary StrandGary Strand
NCAR
NCAR
Baseline NumbersBaseline Numbers T42 CCSM (current, 280km)T42 CCSM (current, 280km)
– 7.5GB/yr, 100 years -> .75TB7.5GB/yr, 100 years -> .75TB T85 CCSM (140km)T85 CCSM (140km)
– 29GB/yr, 100 years -> 2.9TB29GB/yr, 100 years -> 2.9TB T170 CCSM (70km)T170 CCSM (70km)
– 110GB/yr, 100 years -> 11TB110GB/yr, 100 years -> 11TB
NCAR
Capacity-related ImprovementsCapacity-related ImprovementsIncreased turnaround, model development, ensemble of runs
Increase by a factor of 10, linear data
Current T42 CCSMCurrent T42 CCSM– 7.5GB/yr, 100 years -> .75TB * 10 = 7.5GB/yr, 100 years -> .75TB * 10 =
7.5TB7.5TB
NCAR
Capability-related Improvements Capability-related Improvements Spatial Resolution: T42 -> T85 -> T170
Increase by factor of ~ 10-20, linear data Temporal Resolution: Study diurnal cycle, 3 hour data
Increase by factor of ~ 4, linear data
CCM3 at T170 (70km)
NCAR
Capability-related Improvements Capability-related Improvements
Quality: Improved boundary layer, clouds, convection, ocean physics, land model, river runoff, sea ice
Increase by another factor of 2-3, data flat
Scope: Atmospheric chemistry (sulfates, ozone…), biogeochemistry (carbon cycle, ecosystem dynamics),middle Atmosphere Model…
Increase by another factor of 10+, linear data
NCAR
Model Improvement WishlistModel Improvement Wishlist
Grand Total:
Increase compute by a Factor O(1000-10000)
NCAR
ESG ScenarioESG Scenario End 2002: 1.2 million files comprising End 2002: 1.2 million files comprising
~75TB of data at NCAR, ORNL, LANL, ~75TB of data at NCAR, ORNL, LANL, NERSC, and PCMDINERSC, and PCMDI
End 2007: As much as 3 PB (3,000 TB) End 2007: As much as 3 PB (3,000 TB) of data (!)of data (!)
Current practice is already broken – the Current practice is already broken – the future will be even worse if something future will be even worse if something isn’t done…isn’t done…
NCAR
ESG Scenario (cont.)ESG Scenario (cont.)
DataData– Different formats are converted to netCDFDifferent formats are converted to netCDF– netCDF is not standardized to the CF modelnetCDF is not standardized to the CF model– Different sites require knowledge of different methods of Different sites require knowledge of different methods of
accessaccess MetadataMetadata
– Most kept in online files separate from data and Most kept in online files separate from data and unsearchable unless one is “in the know”unsearchable unless one is “in the know”
– Some kept in people’s brainsSome kept in people’s brains Access controlAccess control
– ManualManual– Not formalizedNot formalized
Data requestsData requests– Beginnings of a formal process (e.g., the PCMDI model)Beginnings of a formal process (e.g., the PCMDI model)– Beginnings of web portalsBeginnings of web portals– Far too much done by handFar too much done by hand– Logging nearly non-existentLogging nearly non-existent
NCAR
ESG: ChallengesESG: Challenges Enabling the simulation and data Enabling the simulation and data
management teammanagement team Enabling the core research community in Enabling the core research community in
analyzing and visualizing resultsanalyzing and visualizing results Enabling broad multidisciplinary Enabling broad multidisciplinary
communities to access simulation resultscommunities to access simulation results
We need integrated scientific work environments that enable smooth WORKFLOW for knowledge development: computation, collaboration & collaboratories, data management, access, distribution, analysis, and visualization.
NCAR
ESG: StrategiesESG: Strategies Move data a minimal amount, keep it close to Move data a minimal amount, keep it close to
computational point of origin when possiblecomputational point of origin when possible– Data access protocols, distributed analysisData access protocols, distributed analysis
When we must move data, do it fast and with a When we must move data, do it fast and with a minimum amount of human interventionminimum amount of human intervention– Storage Resource Management, fast networksStorage Resource Management, fast networks
Keep track of what we have, particularly what’s Keep track of what we have, particularly what’s on deep storageon deep storage– Metadata and Replica CatalogsMetadata and Replica Catalogs
Harness a federation of sites, web portalsHarness a federation of sites, web portals– Globus Toolkit -> The Earth System Grid -> The Globus Toolkit -> The Earth System Grid -> The
UltraDataGridUltraDataGrid
NCAR
Server
Tera/Peta-scaleArchive
HRM
Tools for reliable staging,
transport, and replication
Server
Tera/Peta-scaleArchive
HRM
ClientSelectionControl
MonitoringHRM
Storage/Data Management
NCAR
HRM aka “DataMover”HRM aka “DataMover” Running well across DOE/HPSS systemsRunning well across DOE/HPSS systems New component built that abstracts NCAR New component built that abstracts NCAR
Mass Storage SystemMass Storage System Defining next generation of requirements Defining next generation of requirements
with climate production groupwith climate production group First “real” usageFirst “real” usage
“The bottom line is that it now works fine and is over 100 times faster than what I was doing before. As important as two orders of magnitude increase in throughput is, more importantly I can see a path that will essentially reduce my own time spent on file transfers to zero in the development of the climate model database” – Mike Wehner, LBNL
NCAR
OPeNDAPOPeNDAP
An Open Source Project for a An Open Source Project for a Network Data Access ProtocolNetwork Data Access Protocol
(originally DODS, the Distributed (originally DODS, the Distributed Oceanographic Data System)Oceanographic Data System)
NCAR
OPeNDAP-g-Transparency-Performance-Security-Authorization-(Processing)Typical Application
Data(local)
netCDF lib
Application
Data(remote)
OPeNDAP Client
Application
OPeNDAPViahttp
Big Data(Multiple remotes)
ESG client
Application
ESG+
DODS
OpenDAP Server ESG Server
Distributed Application
data
Distributed Data Access Services
OPeNDAPViaGrid
NCAR
For XML encoding of metadata (and data) of any generic netCDF For XML encoding of metadata (and data) of any generic netCDF filefile
Objects: netCDF, dimension, variable, attributeObjects: netCDF, dimension, variable, attribute Beta version reference implementation as Java Library Beta version reference implementation as Java Library
(http://www.scd.ucar.edu/vets/luca/netcdf/extract_metadata.htm)(http://www.scd.ucar.edu/vets/luca/netcdf/extract_metadata.htm)
ESG: NcML Core SchemaESG: NcML Core Schema
netCDFnetCDF
nc:netCDFType
nc:dimension
nc:variable
nc: attribute
nc:attribute
nc:values
nc:VariableType
NCAR
Object[1] id
Object[1] id
Activity[0,1] name[0,1] description[0,1] rights[0,n] date type=[0,n] note[0,n] participant role=[0,n] reference uri=
Activity[0,1] name[0,1] description[0,1] rights[0,n] date type=[0,n] note[0,n] participant role=[0,n] reference uri=
isA
Investigation
Investigation
isA
Project[0,n] topic type=[0,1] funding
Project[0,n] topic type=[0,1] funding
isA Ensemble
Ensemble
Campaign
Campaign
isPartOf
Simulation[0,n] simulationInput type=[0,n] simulationHardware
Simulation[0,n] simulationInput type=[0,n] simulationHardware
Observation
Observation
Experiment
Experiment
Analysis
Analysis
isPartOf
hasParent
hasChild
hasSibling
Dataset[0,1] type[0,1] conventions[0,n] date type=[0,n] format type= uri=[0,1] timeCoverage[0,1] spaceCoverage
Dataset[0,1] type[0,1] conventions[0,n] date type=[0,n] format type= uri=[0,1] timeCoverage[0,1] spaceCoverage
isA
generatedBy
isPartOf
Person[0,1] firstName[0,1] lastName[0,1] contact
Person[0,1] firstName[0,1] lastName[0,1] contact
Institution[0,1] name[0,1] type[0,1] contact
Institution[0,1] name[0,1] type[0,1] contact
isAworksF
or
participant role=
Class
Class
AbstractClass
AbstractClass
inheritanceassociation
LEGEND
Service[0,1] name[0,1] description
Service[0,1] name[0,1] description
serviceId
NCAR
ESG Metadata ProgressESG Metadata Progress Co-developed NcML with UnidataCo-developed NcML with Unidata
– CF conventions in progress, almost doneCF conventions in progress, almost done Developed & evaluated a prototype metadata systemDeveloped & evaluated a prototype metadata system Finalized an initial schema for PCM/CCSMFinalized an initial schema for PCM/CCSM
– Address interoperability with federal standards and Address interoperability with federal standards and NASA/GCMD via the generation of DIF/FGDC/ISONASA/GCMD via the generation of DIF/FGDC/ISO
– Address interoperability with digital libraries via the creation Address interoperability with digital libraries via the creation of Dublin Coreof Dublin Core
Testing relational and native XML databases, and OGSA-Testing relational and native XML databases, and OGSA-DAIDAI
Exploratory work for first-generation ontologyExploratory work for first-generation ontology Authoring of discovery metadata in progressAuthoring of discovery metadata in progress
NCAR
RLS
MSS
HRM
HPSSHRM
RLS
HPSSHRM
RLS
DISKHRM
RLS
DISKcache
OGSA-DAIMySQLRDBMS
ESG WEB PORTALTomcat/Struts
cross-updatecross-update
gridFTP
gridFTP
gridFTP
query
query MyProxy
authenticate
GRAMGATEKEEPER
submit
execute
gridFTP SERVER
gridFTP SERVER
gridFTP SERVER
gridFTP SERVER
LAS SERVERvisualize
LBNL
ISI
LLNL
NCAR ORNL
CAS
ANLESG Topology
NCAR
Collaborations & RelationshipsCollaborations & Relationships CCSM Data Management GroupCCSM Data Management Group The Globus ProjectThe Globus Project Other SciDAC Projects: Climate, Security & Policy for Other SciDAC Projects: Climate, Security & Policy for
Group Collaboration, Scientific Data Management Group Collaboration, Scientific Data Management ISIC, & High-performance DataGrid ToolkitISIC, & High-performance DataGrid Toolkit
OPeNDAP/DODS (multi-agency)OPeNDAP/DODS (multi-agency) NSF National Science Digital Libraries Program NSF National Science Digital Libraries Program
(UCAR & Unidata THREDDS Project)(UCAR & Unidata THREDDS Project) U.K. e-Science and British Atmospheric Data CenterU.K. e-Science and British Atmospheric Data Center NOAA NOMADS and CEOS-gridNOAA NOMADS and CEOS-grid Earth Science Portal group (multi-agency, intnl.)Earth Science Portal group (multi-agency, intnl.)
NCAR
Immediate DirectionsImmediate Directions Broaden usage of DataMover and refineBroaden usage of DataMover and refine Continue building metadata catalogsContinue building metadata catalogs Revisit overall security model and consider Revisit overall security model and consider
simplified approachessimplified approaches Redesign and implement user interfaceRedesign and implement user interface Alpha version of OPeNDAPgAlpha version of OPeNDAPg
– Test and evaluate with client applications Test and evaluate with client applications Develop automation for data publishing (GT3)Develop automation for data publishing (GT3) Deploy for IPCC runsDeploy for IPCC runs
NCAR
The Community Data Portal (CDP)The Community Data Portal (CDP)
Provide a common portal to NCAR, UCAR, and university dataProvide a common portal to NCAR, UCAR, and university data Provide a sustainable cyberinfrastructure that dramatically lowers Provide a sustainable cyberinfrastructure that dramatically lowers
the cost of sharing data (there is HUGE interest in this)the cost of sharing data (there is HUGE interest in this) Directly couple to simulation systems and DataMonsterDirectly couple to simulation systems and DataMonster Begin capturing rich metadata and catalog our scientific Begin capturing rich metadata and catalog our scientific
experiments for the worldexperiments for the world MSS -> A Petascale Mass Knowledge SystemMSS -> A Petascale Mass Knowledge System Federate internationally (ESG, THREDDS, U.K. e-Science, Federate internationally (ESG, THREDDS, U.K. e-Science,
NOMADS, PRISM, GEON, etc.)NOMADS, PRISM, GEON, etc.)
“The dataportal has changed my life…” Ben Kirtman, COLA
NCAR
Foster Revolutionary ChangeFoster Revolutionary Change
Mass StorageSystem (1.5PB) Petascale Knowledge
Repository
Establish a new paradigm for managing and accessingscientific data based on semantic organization.
NCAR
Community Data PortalCommunity Data Portal
Purpose:Purpose: Build an infrastructure using different methods for data Build an infrastructure using different methods for data
exploration and deliveryexploration and delivery Web-based retrieval and interactive analysis for MSS Web-based retrieval and interactive analysis for MSS
collectionscollections Data sharing for multi-institution cooperative studiesData sharing for multi-institution cooperative studies Browse, select, compare, download data sets, & Browse, select, compare, download data sets, &
specify data subsets using – graphical, text entry, specify data subsets using – graphical, text entry, choice of output formatchoice of output format
Components:Components: User interface, Live Access Server (LAS) User interface, Live Access Server (LAS) Middleware, Ferret, NCL, GrADSMiddleware, Ferret, NCL, GrADS File service, local, or DODSFile service, local, or DODS
Status:Status: Pilot working (2 years), more middleware testingPilot working (2 years), more middleware testing
NCAR
Data AccessData Access
Data Collections
MassiveMassiveDataData
Simulation & RetrospectiveSimulation & Retrospective
Ferret NCL Other Engines
Live Access Client
DODS
CSM, PCM, DSS, CSM, PCM, DSS, MM5, WRF, MICOM, MM5, WRF, MICOM, CMIWGCMIWG
Live Access Server
NCAR
ExampleExample … Data Analysis … Data Analysis
NCAR
Live Access Server + NCL Live Access Server + NCL (Grib Data)(Grib Data)
NCAR
Interface and Reanalysis 2 Interface and Reanalysis 2 Sea Level PressureSea Level Pressure
NCAR
dataportal.ucar.edu
raiddisks
MSS
catalogs parsing &metadata ingestion
data search & discovery
catalogs browsing
MSS data retrieval
Struts
Tomcat
UI
data access(OPeNDAP, FTP, HTTP)
data visualization(NCL, Ferret)
GDS DODS aggregation server LAS
Tomcat Tomcat Tomcat
UI UI UI
hardware
core services
middleware
user interface
Community Data Portal architecture
NCAR
Community Data Portal Metadata Software
THREDDScatalogs
ESGmetadata
DCmetadata
NcMLmetadata
THREDDS catalog parserapplication
relational DB(MySQL)
XML native DB(Xindice
XML viewerweb application
schema-specific
stylesheets
stores full XML doc
shreds XML doc into tables
Search & Discoveryweb application
simple query(SQL)
Results: list of triplets(dataset id, metadata schema,
metadata URL)THREDDS catalogs browser
Web application
reference
othermetadata
parses
futureadvanced query(Xpath, Xquery)
displays
links to
uses
NCAR
CDP Data/Catalog ContributorsCDP Data/Catalog Contributors
ACD: MOZART v2.1 standard run (Louisa Emmons) ATD: Radar almost ready for today! CGD: CAS satellite data example (Lesley Smith) CGD: CDAS and VEMAP data (Steve AulenBach, Nan
Rosenbloom, Dave Schimmel) CGD: CCSM 1000 year run (Lawrence Buja) CGD: PCM 16 top datasets (Gary Strand) SCD: DSS full data holdings (Bob Dattore, Steve Worley) SCD: VETS example visualization catalog (Markus Stobbs,
Luca Cinquini) COLA: Jennifer Adams, Jim Kinter, Brian Doty
NCAR
Next StepsNext Steps Recruiting (!)Recruiting (!)
– One student for data ingest One student for data ingest – One software engineer One software engineer – SystemsSystems– Expanding storage by 20TB (SCD cosponsor)Expanding storage by 20TB (SCD cosponsor)
Ongoing publication of datasetsOngoing publication of datasets Publishing documents on plans, design, Publishing documents on plans, design,
how to partner, standard services, and how to partner, standard services, and management proceduresmanagement procedures
Building partnerships, DMWG meeting Building partnerships, DMWG meeting AugustAugust
NCAR
Closing ThoughtsClosing Thoughts
Building a sustainable infrastructure for Building a sustainable infrastructure for the long-termthe long-termDifficult, expensive, and time-consumingDifficult, expensive, and time-consumingRequires longer-term projectsRequires longer-term projects
Team-building is a critical processTeam-building is a critical processCollaboration technologies really helpCollaboration technologies really help
Managing all the collaborations is a Managing all the collaborations is a challengechallengeBut extremely valuableBut extremely valuable
Good progress, first real usageGood progress, first real usage
NCAR
LinksLinks
Earth System GridEarth System Grid– www.earthsystemgrid.orgwww.earthsystemgrid.org
Community Data PortalCommunity Data Portal– dataportal.ucar.edudataportal.ucar.edu
NCAR
ENDEND
NCAR
Longer-term MissionsLonger-term Missions - - Observation of Key Earth System InteractionsObservation of Key Earth System Interactions
Terra
Aura
Aqua
Landsat 7
Exploratory - Exploratory - Explore Specific Earth System Processes and Parameters and Explore Specific Earth System Processes and Parameters and Demonstrate TechnologiesDemonstrate Technologies
GRACE
PICASSO
Cloudsat
QuikScat
EO-1
ICEsat Jason-1
SRTMVCL
We Will Examine Practically Every Aspect of the Earth We Will Examine Practically Every Aspect of the Earth System from Space in This DecadeSystem from Space in This Decade
Triana
Courtesy of Tim Killeen, NCAR
NCAR
Characteristics of Infrastructure
EssentialEssential– So important that it becomes ubiquitousSo important that it becomes ubiquitous
ReliableReliable– Example: the built environment of the Roman EmpireExample: the built environment of the Roman Empire
ExpensiveExpensive– Nothing succeeds like excess (e.g. Interstate systemNothing succeeds like excess (e.g. Interstate system– Inherently one-off (often, few economies of scale)Inherently one-off (often, few economies of scale)
Clear factorization between research and Clear factorization between research and practicepractice– Generally deploy what provably worksGenerally deploy what provably works
NCAR
CDP Interactions & OpportunitiesCDP Interactions & Opportunities
COLACOLA CGD/VEMAPCGD/VEMAP ACD,HAO/WACCMACD,HAO/WACCM CGD/CCSM, CAMCGD/CCSM, CAM CGD/CASCGD/CAS MMM/WRFMMM/WRF UCAR/JOSSUCAR/JOSS UCAR/UnidataUCAR/Unidata CGD,SCD,CU/GridBGCCGD,SCD,CU/GridBGC NOAA/NOMADSNOAA/NOMADS
GODAEGODAE HAO/TIEGCM,MLSOHAO/TIEGCM,MLSO ATD/Radar, HIAPERATD/Radar, HIAPER ACD/Mozart, BVOC, ACD/Mozart, BVOC,
Aqua proposalAqua proposal BioGeo/CDASBioGeo/CDAS SCD/DSSSCD/DSS DOE/Earth System DOE/Earth System
GridGrid DLESEDLESE GIS InitiativeGIS Initiative