data driven science: some lessons learned · 2012. 5. 10. · 3 managed by ut-battelle for the u.s....
TRANSCRIPT
Data Driven Science: Some Lessons
Learned
Data Symposium 2012 SeWHIP & CTSI
John W. Cobb, Ph.D. Milwaukee, WI March 1, 2012
2 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
Acknowledgement and collaborators
• DataONE http://www.dataone.org/
• Cal Dig. Lib. http://www.cdlib.org/
• Cornell Lab of Ornithology http://www.birds.cornell.edu/ http://ebird.org/content/ebird/
• TeraGrid (now XSEDE) https://www.xsede.org/
• National Science Foundation
3 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
Outline
• Distinctions between cyberinfrastructure, information technology, computer science, computational science – a personal view
• Requirements analysis for interoperable long-term data archive and curation service (a datanet) with focus on ecological, biological, and environmental sciences
• Description of DataONE datanet services and related services as an exemplar (bulk of talk)
4 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
Outline
• Distinctions between cyberinfrastructure, information technology, computer science, computational science – a personal view
• Requirements analysis for interoperable long-term data archive and curation service (a datanet) with focus on ecological, biological, and environmental sciences
• Description of DataONE datanet services and related services as an exemplar (bulk of talk)
5 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
Data driven science
A computing center or a data center?
Rorshach test: What is This?
Images courtesy of the NCCS, ORNL
“A supercomputer is just one more source of petabytes of data”
6 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
A (personal) taxonomy of computing research activities
Computer Science (CS) • Algorithm development • Compu6ng architecture development • Language research • …
Computa6onal Science (Computa6onal Sci) • PDE solver algorithms • Finite math representa6ons of
con6nous PDE’s • Numerical linear algebra • …
Cyberinfrastructure (CI) • Developing methods for accessing
compu6ng services • Virtual organiza6on research • …
Informa6on Technology (IT) • Research about IT opera6ons • Provisioning • Produc6on networking (internal and
external connec6ons) • Cybersecurity • …
In addition to research efforts IT, CI, and Computational science often have coupled operational tasks.
7 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
Contrasts: CI and IT
CI • Enables research initiatives
• Expands topline • Seeks new services • Externally focused • Project partner
IT • Enables enterprise
integration • Optimizes bottom line • Optimizes current services • Internally focused • Project component
However, pragmatically, • university CIO’s often are charged with CI and IT responsibilities as well as a
partner in computational science provisioning. Usually computer science is not under CIO function. Library science is also separate although occasionally academic library operations sometimes also fall under the CIO function.
• Some CI and IT services can look quite similar (system administration, network security, …)
8 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
My focus today: CI
“Like the physical infrastructure of roads, bridges, power grids, telephone lines, and water systems that support modern society, cyberinfrastructure refers to the distributed computer, information and communication technologies combined with the personnel and integrating components that provide a long-term platform to empower the modern scientific research endeavor.” (Atkins Report, NSF Blue-Ribbon Advisory Panel on Cyberinfrastructure 2003)
9 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
Outline
• Distinctions between cyberinfrastructure, information technology, computer science, computational science – a personal view
• Requirements analysis for interoperable long-term data archive and curation service (a datanet) with focus on ecological, biological, and environmental sciences
• Description of DataONE datanet services and related services as an exemplar (bulk of talk)
10 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
What is needed to build a robust set of data services?
• Archiving • Management • Metadata regularization • Curation • Retention • Practioner education • Socio-cultural barriers • …
11 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
Poor data practice “data entropy”
Info
rmat
ion
Con
tent
Time
Time of publication
Specific details
General details
Accident
Retirement or career change
Death
(Michener et al. 1997)
In what sense is modern science reproducible?
cf. Brian Athey and Clifford Lynch this morning
12 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
Data management plans now required
NSF Data Management Plan Requirements “Beginning January 18, 2011, proposals submitted to NSF must include a supplementary document of no more than two pages labeled "Data Management Plan" (DMP) . This supplementary document should describe how the proposal will conform to NSF policy on the dissemination and sharing of research results. Proposals that do not include a DMP will not be able to be submitted.” http://www.nsf.gov/eng/general/dmp.jsp
Other agencies have or are instituting requirements
13 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
Data deluge and interoperability “the flood of increasingly heterogeneous data”
• Data are heterogeneous – Syntax
• (format) – Schema
• (model) – Semantics
• (meaning)
Jones et al. 2007
By hand is time-consuming and brittle
14 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
Baseline assessment: scientists (2010)
• Demographics
social sciences 16%
computer science/
engineering 9%
physical sciences
12%
environmental sciences 18%
atmospheric science 4%
biology 14%
ecology 18%
medicine 2%
other 7%
academic 80%
government 13%
commercial 2%
non-profit 3% other 2%
n=1315 n=1317
Work Sector Discipline
Tenopir, C, Allard S, Douglass K, Aydinoglu AU, Wu L, Read E, Manoff M, Frame M. 2011. Data Sharing by Scientists: Practices and Perceptions. PLoS ONE. 6(6)
15 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
12 21 26 95 95 96 97
266
676
DIF DwC DC EML FGDC Open GIS
ISO My Lab none
Metadata language
What standard do you currently use?
16 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
41%
76%
78%
81%
0% 20% 40% 60% 80% 100%
Willing to place all of my data into a central data repository with no
restric6ons
Appropriate to create new datasets from shared data
Willing to place at least some of my data into a central data repository
with no restric6ons
Willing to share data across a broad group of researchers
Many are interested in sharing data
Percent agree
17 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
Collect
Assure
Describe
Deposit
Preserve
Discover
Integrate
Analyze
How do I preserve my
data?
What tools do I use?
Will I get credit for my work?
How much will it cost?
What is a data management
plan?
Who can help me?
What is metadata?
Where do I preserve my
data?
Needs of scientists: the data lifecycle
18 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
eBird pilot project exploration and visualization
Spa6o-‐Temporal Exploratory Model iden6fies factors affec6ng pa^erns of migra6on
Diverse bird observa6ons and environmental data from 300,00 loca6ons in the US integrated and analyzed using High Performance Compu6ng Resources
Land Cover
Meteorology
MODIS – Remote sensing data
• Examine pa^erns of migra6on
• Infer how climate change may affect bird migra6on
Model results
Occurrence of Indigo Bun=ng (2008)
Jan Sep Dec Jun Apr
19 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science 19 19
Secretary Salazar on Birds (May 3, 2011): “The State of the Birds report is a measurable indicator of how well we are fulfilling our shared role as stewards of our nation’s public lands and waters.”
Acadian Flycatcher Distribution – eBird.org
20 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
Dec
reas
ing
Spa
tial C
over
age
Incr
easi
ng P
roce
ss K
now
ledg
e
Adapted from CENR-OSTP
Remote sensing
Intensive science sites and experiments
Extensive science sites
Volunteer & education networks
“Building the Knowledge Pyramid” 90:10 à 10:90
Multiple Scales
21 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
Tracing requirements
• Multiple scales • Interoperable across repositories • Cross organizational (VO’s) • Multiple identities • Data heterogeneity • Manage disparate rights policies • Support all phases of the data life cycle • Include education and outreach to change community
practies
22 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
Outline
• Distinctions between cyberinfrastructure, information technology, computer science, computational science – a personal view
• Requirements analysis for interoperable long-term data archive and curation service (a datanet) with focus on ecological, biological, and environmental sciences
• Description of DataONE datanet services and related services as an exemplar (bulk of talk)
23 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
DataONE Movie (with Sound)
24 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
DataONE is Cyberinfrastructure
Three major components form a flexible, scalable, sustainable network
Member Nodes • diverse institutions • serve local community • provide resources for
managing their data • retain copies of data
Coordinating Nodes • retain complete
metadata catalog • indexing for search • network-wide services • ensure content
availability (preservation)
• replication services
Investigator Toolkit
Source: DataONE/Michener
25 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science 25 25
25"
Examples of data holdings
Data Archive Types of Data Managed Metadata Standard(s)
Biodiversity, taxonomic, ecological BDP, DwC, DC, OGIS
Biogeochemical dynamics, terrestrial ecological Earth observation imagery
DIF, BDP, ECHO
Ecological, biodiversity, biophysical, social, genomics, and taxonomic
EML
Avian populations and molecular biology DwC
Biological and taxonomic DC subset
Biophysical, biodiversity, disturbance, and Earth observation imagery
EML
Biodiversity, biotic structure, function/process, biogeochemical, climate, and
hydrologic
EML
Metadata Interoperability Across Data Holdings
EML=Ecological Metadata Language
BDP=Biological Data Profile DwC=Darwin Core
DC=Dublin Core ECHO=EOS ClearingHOuse
OGIS=OpenGIS
DC subset=Dublin Core subset
DIF=Directory Interchange Format
26 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
Initial member nodes ORNL-‐DAAC Dryad KNB
Community Agency repository Journal consor6um Research network
Data Ecology and biogeochemical dynamics
Biosciences Biodiversity, ecology, environment
Size 900 data products, ~ 1 TB
~ 1,000 data products, ~ 3 GB
20,000 data products, 100s GBs
Services Tools for data preservation, replication, discovery, access, subsetting, and visualization
Tools for data preservation, replication, discovery and access
Tools for data preservation, replication, discovery, access, management, and visualization
Metadata stds. FGDC subset Dublin Core applica6on profile
EML, FGDC
Degree of cura6on High Medium Low
Data submission Agency-approved, staff-assisted submission and curation of final data product
Web-based data submission at time of journal article submission
Self-submission via desktop tool at any time
Sponsor NASA NSF/JISC, socie6es, publishers
NSF
27 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science
Preserve Data and Metadata
• Metadata copied to Coordinating Nodes
• Mirrored between CNs • Data replicated between
Member Nodes • CNs manage copies • Checksums recorded • Promote quality
metadata
*