gce data toolbox and metabase: a sensor-to-synthesis...

1
GCE Data Toolbox and Metabase: A Sensor-to-Synthesis Software Pipeline for LTER Data Management Wade M. Sheldon, Jr. (GCE), John F. Chamblee (CWT), Richard H. Cary (CWT) Data Import Data can be imported many sources, including: - Campbell Data Loggers - CTDs and Sondes (SeaBird, YSI) - Groundwater loggers (Hobo, Schlumberger, Aquatroll) - Online Databases (USGS, NOAA NCDC, NOAA HADS, ClimDB) - General formats (text, MATLAB, SQL queries) Data are managed in a standardized data model combining: - Numeric and text data columns - Attribute metadata (name, units, data type, ...) - QA/QC rules and flags - Documentation metadata - Processing lineage Data Transformation and Synthesis - Derived data sets can be created by filtering values or refactoring data table structure (combining/splitting columns) - Data can be re-sampled or summarized by aggregation, binning anddate/time scaling - Multiple data sets can be combined by merging (union) and joining - All derived data contain complete metadata describing the entire processing history - QA/QC rules can be generated for derived data columns (%missing, %flagged, ...) Data & Metadata Curation - Finalized data & metadata are uploaded to the Metabase Metadata Management System (MMS) - Dataset metadata are linked to comprehensive databases of personnel, geographic, taxonomic and research project metadata - Data files are version- controlled and managed on a file server linked to the Metabase MMS Data Cataloging and Search - The Metabase MMS web interface provides access to the entire site data catalog on the web and via a MATLAB search tool - Data can be searched by core area, research theme, investigator, keywords, study dates, geographic and taxonomic coverage or text anywhere in the metadata - Search results can be fine-tuned using a “filter” panel on the results page - Data set details pages include dynamic links to personnel, study sites, and project pages, species lists and related data sets - Public release dates are enforced automatically, and all data file downloads are tracked for reporting and contacting users Data and Metadata Distribution - EML 2.1.0 metadata files are dynamically generated for all data sets in the Metabase MMS on demand - EML harvest lists are dynamically generated for synchronizing with Metacat and other catalogs - Revising and re-publishing data in the Metabase re-versions the EML automatically to ensure changes are reflected - PASTA-ready data and other EML-described entities (KML files, GIS files) are streamed from the file server via URLs in the EML to support Kepler, PASTA and DataONE clients Streaming data are supported through every step of a data set’s life-cycle LTER sites use a wide range of software to acquire, process, quality control and archive data. A separate set of tools is then typically used to produce and manage metadata content and generate EML-described data packages for the LTER NIS. This separation of data and metadata processing is inefficient, risks loss of information, and often delays data release. Software developed at the GCE LTER site - the GCE Data Toolbox for MATLAB - streamlines this process by coupling metadata creation to data processing and quality control. This software also interfaces with the Metabase, a sophisticated Metadata Management System (MMS) that supports data warehousing and automatic distribution of EML-described, version-controlled data through the NIS. Used together, these systems constitute an integrated and highly automated pipeline for producing EML-described data packages for archival and synthesis efforts. This poster describes how this software is used to automate and streamline data management at the GCE and CWT LTER sites, and opportunities for use at other sites. For More Information GCE Data Toolbox for MATLAB support site: https://gce-svn.marsci.uga.edu/trac/GCE_Toolbox Metabase inquiries: Wade Sheldon ([email protected]) or John Chamblee ([email protected]) Acknowledgements This material is based upon work supported by the National Science Foundation under grant numbers OCE-9982133, OCE-0620959, DEB-0823293, DEB-9632854 and DEB-0218001. Quality Control - QA/QC “rules” can be defined for each data column (simple limit checks through parameterized models) - Qualifier “flags” can be assigned or unassigned graphically - Flags “shadow” data values - Flags can be visualized on plots and exported as coded columns - Flagged values can be cleared or summarized in derived data All toolbox programs include flagged value options Metadata Import and Management - Metadata are imported with data or assigned automatically by data analysis - Metadata templates are applied to add complete column metadata, QA/QC rules, and documentation - Metadata content is Transparently updated as data are edited and analyzed

Upload: others

Post on 26-Jun-2020

45 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: GCE Data Toolbox and Metabase: A Sensor-to-Synthesis ...gce-lter.marsci.uga.edu/public/uploads/gce_cwt... · Metabase inquiries: Wade Sheldon (sheldon@uga.edu) or John Chamblee (chamblee@uga.edu)

GCE Data Toolbox and Metabase:A Sensor-to-Synthesis Software Pipeline for LTER Data Management

Wade M. Sheldon, Jr. (GCE), John F. Chamblee (CWT), Richard H. Cary (CWT)

Data ImportData can be imported many sources, including: - Campbell Data Loggers- CTDs and Sondes (SeaBird, YSI)- Groundwater loggers (Hobo, Schlumberger, Aquatroll)- Online Databases (USGS, NOAA NCDC, NOAA HADS, ClimDB)- General formats (text, MATLAB, SQL queries)

Data are managed in a standardized data model combining: - Numeric and text data columns- Attribute metadata (name, units, data type, ...)- QA/QC rules and �ags- Documentation metadata- Processing lineage

Data Transformation and Synthesis- Derived data sets can be created by �ltering values or refactoring data table structure (combining/splitting columns)- Data can be re-sampled or summarized by aggregation, binning anddate/time scaling- Multiple data sets can be combined by merging (union) and joining- All derived data contain complete metadata describing the entire processing history- QA/QC rules can be generated for derived data columns (%missing, %�agged, ...)

Data & Metadata Curation- Finalized data & metadata are uploaded to the Metabase Metadata Management System (MMS)- Dataset metadata are linked to comprehensive databases of personnel, geographic, taxonomic and research project metadata- Data �les are version- controlled and managed on a �le server linked to the Metabase MMS

Data Cataloging and Search- The Metabase MMS web interface provides access to the entire site data catalog on the web and via a MATLAB search tool- Data can be searched by core area, research theme, investigator, keywords, study dates, geographic and taxonomic coverage or text anywhere in the metadata- Search results can be �ne-tuned using a “�lter” panel on the results page- Data set details pages include dynamic links to personnel, study sites, and project pages, species lists and related data sets- Public release dates are enforced automatically, and all data �le downloads are tracked for reporting and contacting users

Data and Metadata Distribution- EML 2.1.0 metadata �les are dynamically generated for all data sets in the Metabase MMS on demand- EML harvest lists are dynamically generated for synchronizing with Metacat and other catalogs- Revising and re-publishing data in the Metabase re-versions the EML automatically to ensure changes are re�ected- PASTA-ready data and other EML-described entities (KML �les, GIS �les) are streamed from the �le server via URLs in the EML to support Kepler, PASTA and DataONE clients

Streaming data are supported through every step of a data set’s life-cycle

LTER sites use a wide range of software to acquire, process, quality control and archive data. A separate set of tools is then typically used to produce and manage metadata content and generate EML-described data packages for the LTER NIS. This separation of data and metadata processing is ine�cient, risks loss of information, and often delays data release. Software developed at the GCE LTER site - the GCE Data Toolbox for MATLAB - streamlines this process by coupling metadata creation to data processing and quality control. This software also interfaces with the Metabase, a sophisticated Metadata Management System (MMS) that supports data warehousing and automatic distribution of EML-described, version-controlled data through the NIS. Used together, these systems constitute an integrated and highly automated pipeline for producing EML-described data packages for archival and synthesis e�orts. This poster describes how this software is used to automate and streamline data management at the GCE and CWT LTER sites, and opportunities for use at other sites.

For More InformationGCE Data Toolbox for MATLAB support site: https://gce-svn.marsci.uga.edu/trac/GCE_Toolbox Metabase inquiries: Wade Sheldon ([email protected]) or John Chamblee ([email protected])

AcknowledgementsThis material is based upon work supported by the National Science Foundation under grantnumbers OCE-9982133, OCE-0620959, DEB-0823293, DEB-9632854 and DEB-0218001.

Quality Control - QA/QC “rules” can be de�ned for each data column (simple limit checks through parameterized models)- Quali�er “�ags” can be assigned or unassigned graphically- Flags “shadow” data values- Flags can be visualized on plots and exported as coded columns- Flagged values can be cleared or summarized in derived data

All toolbox programsinclude �agged

value options

Metadata Import and Management- Metadata are imported with data or assigned automatically by data analysis- Metadata templates are applied to add complete column metadata, QA/QC rules, and documentation- Metadata content is Transparently updated as data are edited and analyzed