archive and access practices that support data reuse and transparency

15
Archive and Access Practices that Support Data Reuse and Transparency Steven Worley Doug Schuster Bob Dattore National Center for Atmospheric Research

Upload: kaden-lara

Post on 01-Jan-2016

20 views

Category:

Documents


0 download

DESCRIPTION

Archive and Access Practices that Support Data Reuse and Transparency. Steven Worley Doug Schuster Bob Dattore National Center for Atmospheric Research. Topics. Data Reuse and Transparency What are these data features? Why are they important? Archiving practices Access practices. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Archive and Access Practices that Support Data Reuse and Transparency

Archive and Access Practices that Support Data Reuse and Transparency

Steven WorleyDoug Schuster

Bob DattoreNational Center for Atmospheric Research

Page 2: Archive and Access Practices that Support Data Reuse and Transparency

AGU Fall Meeting, 5-9 Dec 2011, San Francisco, USA 2

Topics

Data Reuse and Transparency What are these data features? Why are they important? Archiving practices Access practices

Page 3: Archive and Access Practices that Support Data Reuse and Transparency

AGU Fall Meeting, 5-9 Dec 2011, San Francisco, USA 3

What are these data features?

Reuse implies: Expanding usage beyond intended primary communityMaintaining reference datasets and building many

products from them Transparency implies:

Reproducibility - ability to reproduce data files or products for users

Traceability – tagging and preserving access details

Page 4: Archive and Access Practices that Support Data Reuse and Transparency

AGU Fall Meeting, 5-9 Dec 2011, San Francisco, USA 4

Why are Reuse and Transparency Important?

Data centers/providers are expected to support fact-based outcomes in science, as has been the tradition, but now also for policy makers, community leaders, individual citizens, and commercial interests.

Page 5: Archive and Access Practices that Support Data Reuse and Transparency

AGU Fall Meeting, 5-9 Dec 2011, San Francisco, USA 5

Supporting New Reuse and Transparency Decisions by policy makers

Traceable open access sources Actions by community leaders

Planning for societal services Emergencies, water, energy, etc.

Usage by citizens and educators Inquisitive science, family activities, safety Science learning

Commercial applicationsTighter coupling between engineering and science

Wx forecasts for wind energy productionEnergy companies contribute mesoscale observations

Page 6: Archive and Access Practices that Support Data Reuse and Transparency

AGU Fall Meeting, 5-9 Dec 2011, San Francisco, USA 6

Archiving practices

Curation that assures data authenticityPreserve original data formats, to the max. extent

possible.Maintaining 100% content and accuracy – serious challenge

Use a “rich” metadata standard A local standard? Generate discipline and cross-discipline standards

E.g. ISO, DIF, etc.

Create multiple copiesData files, metadata, documentation, and softwareDisaster recovery – not a secondary concern

Page 7: Archive and Access Practices that Support Data Reuse and Transparency

AGU Fall Meeting, 5-9 Dec 2011, San Francisco, USA 7

Archiving practices Collection completeness and integrity

Tightly monitor data work flowAccount for every file

Read every fileGather, check, preserve metadata

Compute and preserve file checksums Maintain dataset lineage / provenance

Use approved processes to remove datasets (never?)Establish tiered “level of service” for data

Move old / superseded versions to lower levelKeep all metadata on the highest tier – discoverable!

Page 8: Archive and Access Practices that Support Data Reuse and Transparency

AGU Fall Meeting, 5-9 Dec 2011, San Francisco, USA 8

Archiving practices

Explicit data version tracking Internal to filesWithin data management system Include in all documentation

Establish Digital Object Identifiers (DOIs)Two-way linkage between publications and data

Promotes easy path for follow-on research from

publications Leverages skills / facilities of libraries – richer

knowledge baseCreate data family tree connections

Page 9: Archive and Access Practices that Support Data Reuse and Transparency

AGU Fall Meeting, 5-9 Dec 2011, San Francisco, USA 9

Dataset Family Tree Example

International Comprehensive Ocean Atmosphere Data Set (ICOADS)Global marine surface observations (1662-2011)

HadISST(1871-2011)

NOAA OI SST(1981-2011)

NOAA ERSST (1854-2011)

HadSLP (1871-2011)

JMA SST (1871-2011)

Ocean Clouds(1900-2010)

NOC Surf. Flux (1973-2009)

WASwind(1950-2009)

Global and Regional Atmospheric and Ocean Re-analysesNCEP/NCAR, NARR, ERA-40, ERA-Interim, 20CR, OARCA

Etc.

Page 10: Archive and Access Practices that Support Data Reuse and Transparency

AGU Fall Meeting, 5-9 Dec 2011, San Francisco, USA 10

Challenges:• System of immutable IDs – DOIs?• Multi-institution preservation commitment• Sufficient/synchronized user access speeds• Transparency across institutions, accepted standards/governance• Better ways to guide users to a “best” starting point

Child

Dataset Family Tree - Evolution

Parent

Grand Child

Data Center Centric

Child

Parent

Grand Child

Page 11: Archive and Access Practices that Support Data Reuse and Transparency

AGU Fall Meeting, 5-9 Dec 2011, San Francisco, USA 11

Access Practices User IDs – key to reproducibility Record all data access transactions

Who received what and when Log product creation constraints from interfaces and

web services Space, time, parameters, format translations

Log software IDs used for product creation Benefits

Transparency – can reproduce a data access action Feedback to users about data changesUse metrics to inform access service development

Liability, security of user ID information

Page 12: Archive and Access Practices that Support Data Reuse and Transparency

AGU Fall Meeting, 5-9 Dec 2011, San Francisco, USA 12

Metrics ExampleCFSR 6hrly, GRIB2, 1979-2011, 75TB, 28K fields/time step, 168K files

October, 2011 Metrics

30-40 unique users per week

Deliver more data using customized (subsetting) requests – normally!

Majority users are university

70% request netCDF, 20% include spatial subsetting

Page 13: Archive and Access Practices that Support Data Reuse and Transparency

AGU Fall Meeting, 5-9 Dec 2011, San Francisco, USA 13

Conclusions

Reuse and transparency are rapidly expanding in importance

Many “best practices” in archive management support reuse and transparency

Archive access monitoring is necessary for transparency, reproducibility, and traceability

Need significant improvement in linking data family trees and data to publications to advance reuse

Page 14: Archive and Access Practices that Support Data Reuse and Transparency

AGU Fall Meeting, 5-9 Dec 2011, San Francisco, USA 14

Research Data Archive @ NCARhttp://dss.ucar.edu/

Page 15: Archive and Access Practices that Support Data Reuse and Transparency

AGU Fall Meeting, 5-9 Dec 2011, San Francisco, USA 15

Needs for the new communities

Documentation that defines data limitations More derivative products

Condense large collections Generate formats/outputs that easily integrate with

their tools Augment models and analyses to produce new

products