data driven science: some lessons learned · 2012. 5. 10. · 3 managed by ut-battelle for the u.s....

27
Data Driven Science: Some Lessons Learned Data Symposium 2012 SeWHIP & CTSI John W. Cobb, Ph.D. Milwaukee, WI March 1, 2012

Upload: others

Post on 16-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Driven Science: Some Lessons Learned · 2012. 5. 10. · 3 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science Outline • Distinctions between cyberinfrastructure,

Data Driven Science: Some Lessons

Learned

Data Symposium 2012 SeWHIP & CTSI

John W. Cobb, Ph.D. Milwaukee, WI March 1, 2012

Page 2: Data Driven Science: Some Lessons Learned · 2012. 5. 10. · 3 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science Outline • Distinctions between cyberinfrastructure,

2 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

Acknowledgement and collaborators

• DataONE http://www.dataone.org/

• Cal Dig. Lib. http://www.cdlib.org/

• Cornell Lab of Ornithology http://www.birds.cornell.edu/ http://ebird.org/content/ebird/

•  TeraGrid (now XSEDE) https://www.xsede.org/

• National Science Foundation

Page 3: Data Driven Science: Some Lessons Learned · 2012. 5. 10. · 3 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science Outline • Distinctions between cyberinfrastructure,

3 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

Outline

• Distinctions between cyberinfrastructure, information technology, computer science, computational science – a personal view

• Requirements analysis for interoperable long-term data archive and curation service (a datanet) with focus on ecological, biological, and environmental sciences

• Description of DataONE datanet services and related services as an exemplar (bulk of talk)

Page 4: Data Driven Science: Some Lessons Learned · 2012. 5. 10. · 3 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science Outline • Distinctions between cyberinfrastructure,

4 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

Outline

• Distinctions between cyberinfrastructure, information technology, computer science, computational science – a personal view

• Requirements analysis for interoperable long-term data archive and curation service (a datanet) with focus on ecological, biological, and environmental sciences

• Description of DataONE datanet services and related services as an exemplar (bulk of talk)

Page 5: Data Driven Science: Some Lessons Learned · 2012. 5. 10. · 3 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science Outline • Distinctions between cyberinfrastructure,

5 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

Data driven science

A computing center or a data center?

Rorshach test: What is This?

Images courtesy of the NCCS, ORNL

“A supercomputer is just one more source of petabytes of data”

Page 6: Data Driven Science: Some Lessons Learned · 2012. 5. 10. · 3 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science Outline • Distinctions between cyberinfrastructure,

6 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

A (personal) taxonomy of computing research activities

Computer  Science  (CS)  •  Algorithm  development  •  Compu6ng  architecture  development  •  Language  research  •  …  

Computa6onal  Science  (Computa6onal  Sci)  •  PDE  solver  algorithms  •  Finite  math  representa6ons  of  

con6nous  PDE’s  •  Numerical  linear  algebra  •  …  

Cyberinfrastructure  (CI)  •  Developing  methods  for  accessing  

compu6ng  services  •  Virtual  organiza6on  research  •  …  

Informa6on  Technology  (IT)  •  Research  about  IT  opera6ons  •  Provisioning  •  Produc6on  networking  (internal  and  

external  connec6ons)  •  Cybersecurity  •  …  

In addition to research efforts IT, CI, and Computational science often have coupled operational tasks.

Page 7: Data Driven Science: Some Lessons Learned · 2012. 5. 10. · 3 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science Outline • Distinctions between cyberinfrastructure,

7 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

Contrasts: CI and IT

CI •  Enables research initiatives

•  Expands topline •  Seeks new services •  Externally focused •  Project partner

IT •  Enables enterprise

integration •  Optimizes bottom line •  Optimizes current services •  Internally focused •  Project component

However, pragmatically, •  university CIO’s often are charged with CI and IT responsibilities as well as a

partner in computational science provisioning. Usually computer science is not under CIO function. Library science is also separate although occasionally academic library operations sometimes also fall under the CIO function.

•  Some CI and IT services can look quite similar (system administration, network security, …)

Page 8: Data Driven Science: Some Lessons Learned · 2012. 5. 10. · 3 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science Outline • Distinctions between cyberinfrastructure,

8 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

My focus today: CI

“Like the physical infrastructure of roads, bridges, power grids, telephone lines, and water systems that support modern society, cyberinfrastructure refers to the distributed computer, information and communication technologies combined with the personnel and integrating components that provide a long-term platform to empower the modern scientific research endeavor.” (Atkins Report, NSF Blue-Ribbon Advisory Panel on Cyberinfrastructure 2003)

Page 9: Data Driven Science: Some Lessons Learned · 2012. 5. 10. · 3 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science Outline • Distinctions between cyberinfrastructure,

9 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

Outline

• Distinctions between cyberinfrastructure, information technology, computer science, computational science – a personal view

• Requirements analysis for interoperable long-term data archive and curation service (a datanet) with focus on ecological, biological, and environmental sciences

• Description of DataONE datanet services and related services as an exemplar (bulk of talk)

Page 10: Data Driven Science: Some Lessons Learned · 2012. 5. 10. · 3 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science Outline • Distinctions between cyberinfrastructure,

10 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

What is needed to build a robust set of data services?

• Archiving • Management • Metadata regularization • Curation • Retention • Practioner education • Socio-cultural barriers • …

Page 11: Data Driven Science: Some Lessons Learned · 2012. 5. 10. · 3 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science Outline • Distinctions between cyberinfrastructure,

11 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

Poor data practice “data entropy”

Info

rmat

ion

Con

tent

Time

Time of publication

Specific details

General details

Accident

Retirement or career change

Death

(Michener et al. 1997)

In what sense is modern science reproducible?

cf. Brian Athey and Clifford Lynch this morning

Page 12: Data Driven Science: Some Lessons Learned · 2012. 5. 10. · 3 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science Outline • Distinctions between cyberinfrastructure,

12 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

Data management plans now required

NSF Data Management Plan Requirements “Beginning January 18, 2011, proposals submitted to NSF must include a supplementary document of no more than two pages labeled "Data Management Plan" (DMP) . This supplementary document should describe how the proposal will conform to NSF policy on the dissemination and sharing of research results. Proposals that do not include a DMP will not be able to be submitted.” http://www.nsf.gov/eng/general/dmp.jsp

Other agencies have or are instituting requirements

Page 13: Data Driven Science: Some Lessons Learned · 2012. 5. 10. · 3 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science Outline • Distinctions between cyberinfrastructure,

13 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

Data deluge and interoperability “the flood of increasingly heterogeneous data”

•  Data are heterogeneous –  Syntax

•  (format) –  Schema

•  (model) –  Semantics

•  (meaning)

Jones et al. 2007

By hand is time-consuming and brittle

Page 14: Data Driven Science: Some Lessons Learned · 2012. 5. 10. · 3 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science Outline • Distinctions between cyberinfrastructure,

14 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

Baseline assessment: scientists (2010)

• Demographics

social sciences 16%

computer science/

engineering 9%

physical sciences

12%

environmental sciences 18%

atmospheric science 4%

biology 14%

ecology 18%

medicine 2%

other 7%

academic 80%

government 13%

commercial 2%

non-profit 3% other 2%

n=1315  n=1317  

Work  Sector  Discipline  

Tenopir, C, Allard S, Douglass K, Aydinoglu AU, Wu L, Read E, Manoff M, Frame M. 2011. Data Sharing by Scientists: Practices and Perceptions. PLoS ONE. 6(6)

Page 15: Data Driven Science: Some Lessons Learned · 2012. 5. 10. · 3 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science Outline • Distinctions between cyberinfrastructure,

15 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

12 21 26 95 95 96 97

266

676

DIF DwC DC EML FGDC Open GIS

ISO My Lab none

Metadata language

What standard do you currently use?

Page 16: Data Driven Science: Some Lessons Learned · 2012. 5. 10. · 3 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science Outline • Distinctions between cyberinfrastructure,

16 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

41%  

76%  

78%  

81%  

0%   20%   40%   60%   80%   100%  

Willing  to  place  all  of  my  data  into  a  central  data  repository  with  no  

restric6ons  

Appropriate  to  create  new  datasets  from  shared  data  

Willing  to  place  at  least  some  of  my  data  into  a  central  data  repository  

with  no  restric6ons  

Willing  to  share  data  across  a  broad  group  of  researchers  

Many are interested in sharing data

Percent agree

Page 17: Data Driven Science: Some Lessons Learned · 2012. 5. 10. · 3 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science Outline • Distinctions between cyberinfrastructure,

17 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

Collect  

Assure  

Describe  

Deposit  

Preserve  

Discover  

Integrate  

Analyze  

How  do  I  preserve  my  

data?  

What  tools  do  I  use?  

Will  I  get  credit  for  my  work?  

How  much  will  it  cost?  

What  is  a  data  management  

plan?  

Who  can  help  me?  

What  is  metadata?  

Where  do  I  preserve  my  

data?  

Needs of scientists: the data lifecycle

Page 18: Data Driven Science: Some Lessons Learned · 2012. 5. 10. · 3 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science Outline • Distinctions between cyberinfrastructure,

18 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

eBird pilot project exploration and visualization

Spa6o-­‐Temporal  Exploratory  Model  iden6fies  factors  affec6ng  pa^erns  of  migra6on  

Diverse  bird  observa6ons  and  environmental  data  from  300,00  loca6ons  in  the  US  integrated  and  analyzed  using  High  Performance  Compu6ng  Resources  

Land  Cover  

Meteorology  

MODIS  –  Remote  sensing  data  

•  Examine  pa^erns  of  migra6on    

•  Infer  how  climate  change  may  affect  bird  migra6on  

Model  results  

Occurrence  of  Indigo  Bun=ng  (2008)  

Jan   Sep   Dec  Jun  Apr  

Page 19: Data Driven Science: Some Lessons Learned · 2012. 5. 10. · 3 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science Outline • Distinctions between cyberinfrastructure,

19 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science 19  19  

Secretary Salazar on Birds (May 3, 2011): “The State of the Birds report is a measurable indicator of how well we are fulfilling our shared role as stewards of our nation’s public lands and waters.”

Acadian Flycatcher Distribution – eBird.org

Page 20: Data Driven Science: Some Lessons Learned · 2012. 5. 10. · 3 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science Outline • Distinctions between cyberinfrastructure,

20 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

Dec

reas

ing

Spa

tial C

over

age

Incr

easi

ng P

roce

ss K

now

ledg

e

Adapted from CENR-OSTP

Remote sensing

Intensive science sites and experiments

Extensive science sites

Volunteer & education networks

“Building  the  Knowledge  Pyramid”  90:10 à 10:90  

Multiple Scales

Page 21: Data Driven Science: Some Lessons Learned · 2012. 5. 10. · 3 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science Outline • Distinctions between cyberinfrastructure,

21 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

Tracing requirements

• Multiple scales •  Interoperable across repositories • Cross organizational (VO’s) • Multiple identities • Data heterogeneity • Manage disparate rights policies • Support all phases of the data life cycle •  Include education and outreach to change community

practies

Page 22: Data Driven Science: Some Lessons Learned · 2012. 5. 10. · 3 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science Outline • Distinctions between cyberinfrastructure,

22 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

Outline

• Distinctions between cyberinfrastructure, information technology, computer science, computational science – a personal view

• Requirements analysis for interoperable long-term data archive and curation service (a datanet) with focus on ecological, biological, and environmental sciences

• Description of DataONE datanet services and related services as an exemplar (bulk of talk)

Page 23: Data Driven Science: Some Lessons Learned · 2012. 5. 10. · 3 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science Outline • Distinctions between cyberinfrastructure,

23 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

DataONE Movie (with Sound)

Page 24: Data Driven Science: Some Lessons Learned · 2012. 5. 10. · 3 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science Outline • Distinctions between cyberinfrastructure,

24 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

DataONE is Cyberinfrastructure

Three major components form a flexible, scalable, sustainable network

Member Nodes •  diverse institutions •  serve local community •  provide resources for

managing their data •  retain copies of data

Coordinating Nodes •  retain complete

metadata catalog •  indexing for search •  network-wide services •  ensure content

availability (preservation)

•  replication services

Investigator Toolkit

Source: DataONE/Michener

Page 25: Data Driven Science: Some Lessons Learned · 2012. 5. 10. · 3 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science Outline • Distinctions between cyberinfrastructure,

25 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science 25  25  

25"

Examples of data holdings

Data Archive Types of Data Managed Metadata Standard(s)

Biodiversity, taxonomic, ecological BDP, DwC, DC, OGIS

Biogeochemical dynamics, terrestrial ecological Earth observation imagery

DIF, BDP, ECHO

Ecological, biodiversity, biophysical, social, genomics, and taxonomic

EML

Avian populations and molecular biology DwC

Biological and taxonomic DC subset

Biophysical, biodiversity, disturbance, and Earth observation imagery

EML

Biodiversity, biotic structure, function/process, biogeochemical, climate, and

hydrologic

EML

Metadata Interoperability Across Data Holdings

EML=Ecological Metadata Language

BDP=Biological Data Profile DwC=Darwin Core

DC=Dublin Core ECHO=EOS ClearingHOuse

OGIS=OpenGIS

DC subset=Dublin Core subset

DIF=Directory Interchange Format

Page 26: Data Driven Science: Some Lessons Learned · 2012. 5. 10. · 3 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science Outline • Distinctions between cyberinfrastructure,

26 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

Initial member nodes ORNL-­‐DAAC   Dryad   KNB  

Community   Agency  repository   Journal  consor6um   Research  network  

Data   Ecology  and  biogeochemical  dynamics  

Biosciences   Biodiversity,  ecology,  environment  

Size   900  data  products,  ~  1  TB  

~  1,000  data  products,  ~  3  GB  

20,000  data  products,  100s  GBs  

Services   Tools for data preservation, replication, discovery, access, subsetting, and visualization

Tools for data preservation, replication, discovery and access

Tools for data preservation, replication, discovery, access, management, and visualization

Metadata  stds.   FGDC  subset   Dublin  Core  applica6on  profile  

EML,  FGDC  

Degree  of  cura6on   High   Medium   Low  

Data  submission   Agency-approved, staff-assisted submission and curation of final data product

Web-based data submission at time of journal article submission

Self-submission via desktop tool at any time

Sponsor   NASA   NSF/JISC,  socie6es,  publishers  

NSF  

Page 27: Data Driven Science: Some Lessons Learned · 2012. 5. 10. · 3 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science Outline • Distinctions between cyberinfrastructure,

27 Managed by UT-Battelle for the U.S. Department of Energy Data Driven Science

Preserve Data and Metadata

•  Metadata copied to Coordinating Nodes

•  Mirrored between CNs •  Data replicated between

Member Nodes •  CNs manage copies •  Checksums recorded •  Promote quality

metadata

*