data facilities workshop - panel on current concepts in data sharing & interoperability
DESCRIPTION
This series of presentations was given at the EarthCube Data Facilities End-User Workshop held January 15-17, 2014 in Washington, DC. This workshop provided a forum to discuss the unique requirements and challenges associated with developing the communication, collaboration, interoperability, and governance structures that will be required to build EarthCube in conjunction with existing and emerging NSF/GEO facilities. This panel and discussion, specifically, outlined and explained several current concepts in data sharing and interoperability, featuring presentations by: Paul Morin (UMN): Polar Cyberinfrastructure Don Middleton (UCAR): Atmospheric/Climate Kerstin Lehnert (LDEO): Domain Repositories & Physical Samples David Schindel (CBOL, GRBio): Biological Perspective & Collections Hank Leoscher (NEON): Observation Networks Daniel Fuka (Virginia Tech) and Ruth Duerr (NSIDC): Brokering Ilya Zaslavsky (UCSD): Cross-Domain InteroperabilityTRANSCRIPT
PANEL DISCUSSION – CURRENT CONCEPTS IN DATA SHARING & INTEROPERABILITYEarthCube Data Facilities Workshop
Wednesday, January 15th 2014
D-10White Island AWS
Archive includes 5 satellitesNew tasking is WV-1 and 2
GeoeyeQuickbird
Ikonos
Worldview 1
Worldview 2
PGC Imagery Viewers · June 24, 2013 7
PGC Imagery Viewers · June 24, 2013 8
PGC Imagery Viewers · June 24, 2013 9
PGC Imagery Viewers · June 24, 2013 10
August 13, 2013Data Facility Workshop; Arlington, VA.
Data System Interoperability and Standards for UCAR/NCAR and
Collaborative Activities
Don Middleton (on behalf of many others)University Corporation for Atmospheric ResearchU.S. National Center for Atmospheric Research
Computational and Information Systems LaboratoryBoulder, Colorado, USA
Data Cyberinfrastructure for “Big Head” and “Long Tail” Scientific
Research
Computational and Information
Systems Laboratory
(CISL) and Earth System
Laboratory (NESL)
High Altitude Observatory
(HAO)
Mauna Loa SolarObservatory
Earth Observing Lab
(EOL)
Field Project Archive
Research Data Archive
Community Data Portal
Earth System Grid
ACADIS Arctic Gateway UCAR
Unidata
netCDF, THREDDS, TDS,
LDM, IDV, Rosetta
These systems federate in various ways among themselves, across organizations such as as ACADIS, and with external programs such as GCMD, the UN/WMO WIS, ESGF, TIGGE, and others.
ACADIS is joint venture of NCAR EOL & CISL, the National Snow and Ice Data Center, and UCAR
Unidata
NCAR Wyoming Supercomputing Center, Cheyenne. Disk, archive, and computational resources.
Data Users and Publishers (SelfPub)
ACADIS Gateway
Federation with Other Systems (GCMD, WMO,
ADE)
EOL ACADIS Collections
(via THREDDS)
NSIDC Arctic
Collections (via
Brokering)
Future Federated Collections
Core Technology• Spring Framework
• Hibernate• Liquibase
• Apache SOLR•OpenID4Java/OpenSAML• OAI-PMH, OpenSearch
• ActiveMQ• FreeMarker
• Java NetCDF Library• DOI’s via EZID/DataCite
Catalog Harvester (OpenSearch, DIF,
THREDDS)
OAI-PMH Repository
(DC, DIF, ISO)Discovery Services
(Apache SOLR)
Identity Management
(OpenID, SAML)
Data Services, Access Control
Publishing Services
Metadata and Database Services
Metrics
HPSS
RDBNWSC GLADE ACADIS Arctic
Collections
Automated Modeling
and Observation
Systems
RESTful PubServices
Bagit (from the LoC)
ACADIS is sponsored by NSF/GEO/PLR
The Chronopolis Data Preservation
Network• A Consortium of UCSD Libraries, SDSC, Univ. of Maryland, and NCAR
• Using LoC Bagit for deposits
• Based on iRods and ACE (Audit Control Environment)
• TRAC-certified (i.e. ISO 16363)
Physical Samples in
18
courtesy of:
Lesley Wyborn, Geoscience Australia(talk at the IGSN workshop at IGC 2012)
EarthCube Data Facilities Workshop
Connection to Digital Data
Access to the physical samples is needed to verify & reproduce published observations.
Access to sample metadata is needed for proper interpretation and re-use of sample-based data.
Access to both is needed to facilitate sharing of samples for use & re-use.
▪ Samples are often expensive to collect (drilling, remote locations).
▪ Many samples are unique and irreplaceable.▪ Re-analysis augments utility of existing data.
EarthCube Data Facilities Workshop 19
Samples in EarthCube End-User Domains
Geochemistry Structural Geology and Tectonics Experimental Stratigraphy Critical Zone Community Envisioning a Digital Crust Cyberinfrastructure for Paleogeoscience Petrology and Geochemistry Inland Waters Deep Seafloor Processes and Dynamics Coral Reef Systems Science Geochronology Rock Deformation and Mineral Physics Research
EarthCube Data Facilities Workshop 20
Key Challenges/Needs
“Global Access to Global Collections: establish repositories for all physical samples and the biological, geochemical and physical measurements made from those samples.” (Paleogeoscience)
“Poor and uneven access and management of sample collections, incomplete sample tracking and linking of samples to analyses in the literature and databases, discoverability of existing samples” (Petrology & Geochem)
“Most geological terrains of interest do not have sufficient or even sample density through space and time.” (Petrology & Geochem)
“Central archive of experimental samples with integrated workflows, database templates, and community-wide DOI system for samples” (Mineral Physics & Rock Deformation)
EarthCube Data Facilities Workshop 21
EarthCube SIG
EarthCube Data Facilities Workshop 22
Needs
Infrastructure and resources for preservation and access of physical samples
Tools for repositories to efficiently manage and improve online access to their collections.
Online registry for discovery, access, and preservation of sample data & metadata
Best practices & standards for sample curation and sample sharing for sample data & data exchange
Funding strategies, business modelsEarthCube Data Facilities Workshop 23
geosamples.org
A multi-institutional initiative to build a “Digital Environment for Sample Curation” to advance access and re-use of physical samples to support and simplify the work of curators to advance best practices, standards, & policies for
sample curation, distribution, attribution, and citation
24EarthCube Data Facilities Workshop
geosamples.org collaboration Physical collection facilities
NSF-funded repositories: LDEO, OSU, SIO, LacCore, WHOI, USPRR, UT Austin, ARF, and growing
State Surveys (AASG), USGS Industry
Data facilities & systems: IGSN/SESAR, IMLGS, USGIN
Computer & Information Science: RENCI, UT Austin
Biocollection informatics: iPlant, iDigBioEarthCube Data Facilities Workshop 25
DESC Design
26
Curators (Admin GUI)
Public (Admin GUI)
Samplers (User GUI)
DESC (data, tools, services)
IGSN Registry Publications
Data Systems
EarthCube Data Facilities Workshop
2009 recommendations included:• Increase impact and
improve management of collections
• Clarify and standardize management and budgeting for collections
• Create an online clearinghouse of information on Federal scientific collections
SciColl Priorities:• Develop first cross-
disciplinary registry of object-based scientific collections (GRSciColl)
• Promote interdisciplinary research utilizing scientific collections
US Interagency Working Group on Scientific Collections
(IWGSC)
• Covers all scientific disciplines• Created under White House S&T Council,
reports to Life Sciences Subcommittee• ~10 participating Departments/Agencies• USDA and Smithsonian Co-chairs
• Covers all scientific disciplines• Created under OECD Global
Science Forum• Independent project, no legal
status• National and Institutional
memberships• Governance by Executive Board• Secretariat Office at Smithsonian
Plants and animals in zoos, botanical
gardens, aquariums
Plants and animals in museums, herbaria
GRSciColl
Extraterrestrial samples
Global Registry of Scientific Collections (GRSciColl)
Microbes in BRCs
Human medical samples
Disease banks
Veterinary samples
Standards repositoriesFossils and microfossils
Rocks, sediment and ice cores
Air, water, soil samples
Human artefacts
Living material in genebanks, culture
collections
And more, what else?
SciColl and IWGSC ask:How can we connect collections across disciplines?
Institution Table• Institution ID • Institution Name• Institution Discipline(s)• Primary Contact
Institutional Collection Table• Institution ID• Collection ID• Collection Name• Collection Discipline• Content Type(s)• Primary Contact
Personal Collection Table• Institution ID = “Personal”• Collection ID• Collection Name• Collection Discipline• Content Type(s)• Primary Contact
Structure of GRSciColl
Contacts Table• Contact Name• Primary Institution• Primary Collection• Additional Inst/Coll
SciColl and IWGSC ask:What terms constitute the common
vocabularies of discipline and content type?
INTEROPERABILITY PHILOSOPHY(OBSERVATIONAL INFRASTRUCTURE)
Hank Loescher | National Ecological Observatory Network (NEON)
Director Strategic Development | CEO Office
Get Specific Data
Many respondents appeared to desire more specific details and expressed an interest in data communicated that can be readily used in their work.
Lots and lots of data…
9/2008
10/2009 2/2011
3/2010
Data as a National Resource
NSF Director Suresh’s emphasis on:
• “Era of Observations”
• “Era of Data and Information”
March 2012: White House $200M “Big Data” initiative:
• NSF
• NIH
• DOE
• DOD
• DARPA
• USGS
The President’s Council of Advisors on Science and Technology (PCAST)
The PCAST report (2011) urge that even as the government deals with our nation’s economic challenges, it must:
“…address the threats to both the environmental and the economic aspects of well-being that derive from the accelerating degradation of the environmental capital – the Nation’s ecosystems and the biodiversity they contain”.
PCAST New Directions…..
Weather
Increasing importance on designing new x-discipline data structures to support policy/decision-making
Societal Benefit Areas (SBAs)
Essential Climate Variables (ECVs)Essential Biodiversity Variables (EBVs)Essential Carbon Variables (ECVs)
Aligned with OSTP (NEO, US-GEO) NSF/EU Strategic PlanningAligned with GEO, GEO-BON, GCOS, Diversitas, WMO, WCRP, etc…Aligned with Suresh, S., 2012. Research funding: Global challenges need global solutions, Nature, 490, 337-338, doi:10.1038/490337a
Global Themes – Global Observations
Agriculture Biodiversity Climate Disasters EnergyEcosystems WaterHealth
Why Interoperability?
• The rapid pace of large-scale environmental global changes underscores the value of accessible long-term data sets.
• Natural, managed, and socioeconomic systems are subject to complex interacting stresses that play out over extended periods of time and space.
• An era of large-scale, interdisciplinary science fueled by large data sets.
• Data Interoperability enhances the value of current scientific efforts and investment.
• Interoperability is needed to forecast future conditions for basic understanding, and for future planning, policy, and societal benefit.
• Currently, there is no accepted approach to make large datasets interoperable
• Provides new leadership opportunities for Scientists globally
Interoperability Philosophy - scientific utility
Linking Science Questions and Hypotheses and Requirements
Traceability of Measurements
Algorithms/Procedures
Informatics
• Mapping Questions to ‘what must be done’ • ‘how’ data can/will be used jointly • Defining Joint Science Scope• Defines interfaces and Functionality
• What is the algorithm or procedural process to create a data product?
• Provides “consistent and compatible” data• Managed through intercomparisons• What are their relative uncertainties?
• Use of Recognized Standards• Traceability to Recognized Standards, or First
Principles• Known and managed signal:noise• Managing QA/QC• Uncertainty budgets (ISO traceable)
1.
2.
3.
4. • Standards - Data Formats• Standards - Metadata formats• Persistent Identifiers / Open-source /Policies• Discovery tools / Dissemination / Discovery• Ontologies, semantics and controlled
vocabularies• Archival and Curation Activities• Providence
Interoperability Philosophy - scientific utility
The degree to which Observatories are truly interoperable is the degree to which these four elements are adopted by collaborative
facilities
Signal:noise and uncertainty estimates must also be known in order for data to have broader, global utility and prognostic capability (ecological
forecasting)
Provides the frame for individual approaches and creativity, spans organizational and programmatic maturity
This Interoperability Framework is currently being implemented as part of a joint EU FP7 and US NSF Project called CoopEUS (www.coopeus.eu)
Facilitates establishing a Baseline/infrastructure with scientific creativity
Is a framework by which all parties can engage (policy and social dimension, incl)
Real work, real tasks can be defined
Frontiers - Interoperability
European Union - ICOS
European Union - Lifewatch
Australia – TERN
(EU) France – ANAEE
Mexico and Canada – CarboNA / MexFlux
Korea – KEON/KoFlux/AsiaFlux
China – CERN
iLTER - global
Bottom-up Organizations
Top-downOrganizations
The National Ecological Observatory Network is a project sponsored by the National Science Foundation and managed under cooperative agreement by NEON Inc.
Stacking Environmental Observatories- SoS
NEON
Biodiversity Observatories
Other Terrestrial Datasets
Stacking Environmental Observatories - SoS
NEON
Others
Biodiversitydatasets
Collapse the layers
“Stuff” in the middle
The Type of Interaction and Efficacy is Dependant on the Organizational Development of the other Institution
NEON Interactions – Other Organizations
• Balancing Scientific Creativity vs. Baseline Infrastructure
• Level of System Engineering Maturity
• Base Capacity - Critical Mass
• Cultural Sensitivity
BCube: A Broker Framework for Next Generation Geoscience
Siri Jodha - PI
Brokering Framework Principles
• A broker connects information resources by mediating interactions between those resources without requiring the maintainers of those resources to adapt their existing systems
EPOS Workshop, Erice 2013
Discover
Evaluate
Access Use… a new
technological revolution every
year …
Brokers mediate betweenService Buses
Preparing Data for Ingest, presented 10/27/09 by R. Duerr LID590DCL Foundations of Data Curation
What if....A scientist could find data and services that matched their interests as easy as subscribing to the news?
myData News.org
Greenland 1 km DEM has been published
A Digital Elevation Model (DEM) of Greenland acquired by A. Researcher is available in binary format at a 1 KM grid spacing in a polar stereographic projection ... moreGreenland Ice Sheet Melt Characteristics Data updatedGreenland Ice Sheet Melt Characteristics now available via OpenSearch API
Scientists could advertise AND INDEX their data so other scientists could find it AND REFERENCE IT, as simply as...1 - Filling out a web form2 - Saving it to your website3 - Adding it's link to your site
What if....
• Service Bus Mediator• Scientific Field to Field Translator• Crawling, Advertising, (and Indexing)• http://nsidc.org/bcube• http://rd-alliance.org
BCube Broker
Domain data repositories and cross-disciplinary data integration
governance issuestechnical issues
ILYA ZASLAVSKY AND THE EARTHCUBE CINERGI PROJECT (NSF ICER-1343816)
High-level inventory and readiness assessment: viewer
http://connections.earthcube.org
Community Inventory of EarthCube Resources for Geoscience Interoperability
data discovery is the most often cited issue in executive summaries on the EarthCube web site
CINERGI
Short questionnaire
Function Importance Comments
Making metadata from your facility available for search using standard metadata, via standard APIs
1 2 3 4 5 6 7Unimportant Essential
NA DK
Tracking demand for and cross-domain usage of your resources 1 2 3 4 5 6 7Unimportant Essential
NA DK
Identifying issues related to data and metadata quality and completeness
1 2 3 4 5 6 7Unimportant Essential
NA DK
Tracking search hits that become searches for resources managed by your data facility
1 2 3 4 5 6 7Unimportant Essential
NA DK
Connecting owners of relevant datasets to your facility for potential longer-term data management
1 2 3 4 5 6 7Unimportant Essential
NA DK
Connecting data from your facility with people, publications, models, and projects
1 2 3 4 5 6 7Unimportant Essential
NA DK
Identifying communities using data, tools, and models from your facility
1 2 3 4 5 6 7Unimportant Essential
NA DK
Validating published metadata and service signatures from your facility
1 2 3 4 5 6 7Unimportant Essential
NA DK
Finding and reporting to you resources that appear as duplicates across multiple registries
1 2 3 4 5 6 7Unimportant Essential
NA DK
Potential added value by a cross-domain systemIntegration with cross-domain searchKey characteristics for CINERGI