fundamental practices for preparing data sets robert cook ornl distributed active archive center...

Download Fundamental Practices for Preparing Data Sets Robert Cook ORNL Distributed Active Archive Center Environmental Sciences Division Oak Ridge National Laboratory

If you can't read please download the document

Upload: laura-mccoy

Post on 26-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1
  • Fundamental Practices for Preparing Data Sets Robert Cook ORNL Distributed Active Archive Center Environmental Sciences Division Oak Ridge National Laboratory Oak Ridge, TN [email protected] CC&E Joint Science Workshop College Park, MD April 19, 2015
  • Slide 2
  • CC&E Workshop, Best Data Management Practices, April 19, 2015 Data centers use the 20-year rule The data set and accompanying documentation should be prepared for a user 20 years into the future--what does that investigator need to know to use the data? Prepare the data and documentation for a user who is unfamiliar with your project, methods, and observations NRC (1991) 2
  • Slide 3
  • CC&E Workshop, Best Data Management Practices, April 19, 2015 Fundamental Data Practices 1.Define the contents of your data files 2.Define the variables 3.Use consistent data organization 4.Use stable file formats 5.Assign descriptive file names 6.Preserve processing information 7.Perform basic quality assurance 8.Provide documentation 9.Protect your data 10.Preserve your data 3
  • Slide 4
  • CC&E Workshop, Best Data Management Practices, April 19, 2015 1. Define the contents of your data files Content flows from science plan (hypotheses) and is informed from requirements of final archive. Keep a set of similar measurements together in one file same investigator, methods, time basis, and instrument No hard and fast rules about contents of each file. 4
  • Slide 5
  • CC&E Workshop, Best Data Management Practices, April 19, 2015 2. Define the variables 1.Choose the units and format for each variable, 2.Explain the format in the metadata, and 3.Use that format consistently throughout the file Date / Time Example e.g., use yyyymmdd; January 2, 1999 is 19990102 Report in both local time and Coordinated Universal Time (UTC) and 24-hour notation (13:30 hrs instead of 1:30 p.m.) Use a code (e.g., -9999) for missing values 5 Representation of dates and times
  • Slide 6
  • CC&E Workshop, Best Data Management Practices, April 19, 2015 ISO Formatted Dates Sort Chronologically 6
  • Slide 7
  • CC&E Workshop, Best Data Management Practices, April 19, 2015 2. Define the variables (cont) Use commonly accepted variable names and units ORNL DAAC Best Practices (Hook et al., 2010) Additional examples of variable names, units, and their formats Next Generation Ecosystem Experiment Arctic Guidance for variable names and units FLUXNET Guidance for flux tower variable names and units 7 UDUNITS Unit database and conversion between units CF Standard Name Climate Forecast (CF) standards promote sharing International System of Units
  • Slide 8
  • CC&E Workshop, Best Data Management Practices, April 19, 2015 2. Define the variables (cont) 8 Scholes (2005) Be consistent Explicitly state units Use ISO formats Variable Table
  • Slide 9
  • CC&E Workshop, Best Data Management Practices, April 19, 2015 2. Define the variables Site Table 9 Site NameSite Code Latitude (deg ) Longitude (deg) Elevation (m) Date Kataba (Mongu)k-15.4389223.2529811952000.02.21 Pandamatengap-18.6565125.4995511382000.03.07 Skukuza Flux Tower skukuza-31.4968825.01973365 2000.06.15 Scholes, R. J. 2005. SAFARI 2000 Woody Vegetation Characteristics of Kalahari and Skukuza Sites. Data set. Available on-line [http://daac.ornl.gov/] from Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A. doi:10.3334/ORNLDAAC/777
  • Slide 10
  • CC&E Workshop, Best Data Management Practices, April 19, 2015 3. Use consistent data organization (one good approach) StationDateTempPrecip Units YYYYMMDDCmm HOGI19961001120 HOGI19961002143 HOGI1996100319-9999 Note: -9999 is a missing value code for the data set 10 Each row in a file represents a complete record, and the columns represent all the variables that make up the record.
  • Slide 11
  • CC&E Workshop, Best Data Management Practices, April 19, 2015 3. Use consistent data organization (a 2 nd good approach) StationDateVariableValueUnit HOGI19961001Temp12C HOGI19961002Temp14C HOGI19961001Precip0mm HOGI19961002Precip3mm 11 Variable name, value, and units are placed in individual rows. This approach is used in relational databases.
  • Slide 12
  • CC&E Workshop, Best Data Management Practices, April 19, 2015 3. Use consistent data organization (cont) Be consistent in file organization and formatting dont change or re-arrange columns Include header rows (first row should contain file name, data set title, author, date, and companion file names) column headings should describe content of each column, including one row for variable names and one for variable units 12
  • Slide 13
  • CC&E Workshop, Best Data Management Practices, April 19, 2015 13 Example of Poor Data Practices for Collaboration and Data Sharing Courtesy of Stefanie Hampton, NCEAS Problems with spreadsheets Multiple tables Embedded figures No headings / units Poor file names Problems with spreadsheets Multiple tables Embedded figures No headings / units Poor file names
  • Slide 14
  • CC&E Workshop, Best Data Management Practices, April 19, 2015 Stable Isotope Data at ORNL: tabular csv format Aranabar and Macko. 2005. doi:10.3334/ORNLDAAC/783 14
  • Slide 15
  • CC&E Workshop, Best Data Management Practices, April 19, 2015 4. Use stable file formats Los[e] years of critical knowledge because modern PCs could not always open old file formats. Lesson: Avoid proprietary formats They may not be readable in the future 15 http://news.bbc.co.uk/2/hi/6265976.stm
  • Slide 16
  • CC&E Workshop, Best Data Management Practices, April 19, 2015 4. Use stable file formats (cont) 16 Aranibar, J. N. and S. A. Macko. 2005. SAFARI 2000 Plant and Soil C and N Isotopes, Southern Africa, 1995-2000. Data set. Available on-line [http://daac.ornl.gov/] from Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A. doi:10.3334/ORNLDAAC/783 Use text (ASCII) file formats for tabular data (e.g.,.txt or.csv (comma-separated values)
  • Slide 17
  • CC&E Workshop, Best Data Management Practices, April 19, 2015 4. Use stable file formats (cont) Suggested Geospatial File Formats Raster formats Geotiff netCDF o with CF convention preferred HDF ASCII o plain text file gridded format with external projection information Vector Shapefile ASCII 17 GTOPO30 Elevation Minimum Temperature
  • Slide 18
  • CC&E Workshop, Best Data Management Practices, April 19, 2015 Use descriptive file names Unique Reflect contents ASCII characters only Avoid spaces Bad: Mydata.xls 2001_data.csv best version.txt Better:bigfoot_agro_2000_gpp.tiff Site name Year What was measured Project Name File Format 5. Assign descriptive file names 18
  • Slide 19
  • CC&E Workshop, Best Data Management Practices, April 19, 2015 19 Courtesy of PhD Comics
  • Slide 20
  • CC&E Workshop, Best Data Management Practices, April 19, 2015 Biodiversity Lake Experiments Field work Grassland Biodiv_H20_heatExp_2005_2008.csv Biodiv_H20_predatorExp_2001_2003.csv Biodiv_H20_planktonCount_start2001_active.csv Biodiv_H20_chla_profiles_2003.csv 5. Assign descriptive file names (cont) Organize files logically Make sure your file system is logical and efficient 20 From S. Hampton
  • Slide 21
  • CC&E Workshop, Best Data Management Practices, April 19, 2015 6. Preserve processing information Keep raw data raw: Do not Include transformations, interpolations, etc in raw file Make your raw data read only to ensure no changes Giles_zoopCount_Diel_2001_2003.csv TAXCOUNTTEMPC C3.9788735812.3 F0.9726135412.7 M0.5305164812.1 F011.9 C10.882389312.8 F43.529557113.1 M21.764778514.2 N61.666872512.9 Raw Data File ### Giles_zoop_temp_regress_4jun08.r ### Load data Giles