nci.org.au @ncinews large-scale data collection metadata management at the national computational...
TRANSCRIPT
nci.org.aunci.org.au
@NCInews
Large-Scale Data Collection Metadata Management at the National Computational Infrastructure (NCI)
Jingbo Wang1, Irina Bastrakova2, Daisy Duursma3, Ben Evans1, Kashif Gohar1, Tim Mackey2, Julia Martin4, Matt Paget5, Gerry Ryder4, Guru Siddeswara3, Lesley Wyborn1
1ANU, 2Geoscience Australia, 3TERN, 4ANDS, 5CSIRO
nci.org.au
Overview
10PB+ Research Data
VDI: Cloud scale user desktops on data
Server-side analysis
and visualizationData Services
THREDDS
Evans et. al. 2014 (in press @ ISESS)© National Computational Infrastructure 2014
Web-time analytics software
nci.org.au
NCI Data Collections
BOM GA CSIRO ANU Inter-national
Other National
CMIP5 3PB
Astronomy (Optical) 200 TB
WaterOcean1.5 PB
Atmosphere2.4 PB
Earth Observ.
2 PB
MarineVideos 10 TB
Geophysics 300 TB
Weather340 TB
© National Computational Infrastructure 2014
nci.org.au
We have established a petascale national data resource that is co-located with high-performance computing. • NCI partners: Australian National University, Bureau of Meteorology,
CSIRO and Geoscience Australia• Support from the Australian Department of Education under Research Data
Storage Infrastructure (RDSI)
NCI manages 38+ data collections (10+ PB) in 7 categories: 1) earth system sciences, 2) climate and weather model data assets and products, 3) earth and marine observations and products, 4) geosciences, 5) terrestrial ecosystem, 6) water management and hydrology7) others such as astronomy, social science and biosciences.
NCI Data Collections cont.
nci.org.au
Universities1, 2, 3…
Non Government organzations1, 2, 3…
Government agency1, 2, 3…
disparate science data collections
curated data collections
Step1: record a data management plan including the conditions/licenses of source data, unify the metadata into a single metadata catalogue, record the (true)source of data, record product description/algorithm, tag with controlled vocabulariesStep2: publish to the data services, record all URIs in data catalogueexpose user-level metadata
Data
Metadata
ready for data access
step 1 step 2
© National Computational Infrastructure 2014
Ingest to availability
nci.org.au
1. Fill the Data Management Plan
Data Management Plan (DMP) online form (attributes compliant with ISO 19115)
© National Computational Infrastructure 2014
• At NCI, collections are the operational form for data and metadata management
• DMP tool filters out heterogeneity of data from different sources in different formats
• 19115 compliant Collection level catalogue is automatically generated from DMPs
• reference related datasets, and record services for accessing the data.
nci.org.au
2: DMPs are mapped to the NCI Catalogue
© National Computational Infrastructure 2014
http://geonetwork.nci.org.au
nci.org.aunci.org.au
Top level GeoNetwork: Collection and Series
Lens 3: Dataset specific GeoNetworks
Dataset 1 Dataset 2 Dataset 3 Dataset n
Collection 1Collection 2Collection 3…
Lens 1: CSW Harvesting and Cross-walks (e.g. RIF-CS)
Full harvest of the metadata
Full Search GeoNetworkFull Search GeoNetwork (or domain)
Dataset 1Dataset 2Dataset 3…
Lens 2: Domain Specific or User deep query
Proposed multi-lens GeoNetwork architecture
© National Computational Infrastructure 2014
Catalogue system infrastructure
nci.org.au
3. Data services and publishing process and in-situ analysis
© National Computational Infrastructure 2014
VDI: Cloud scale user desktops on data
Server-side analysis and visualizationData Services
THREDDS
nci.org.au
NCI data policy (publishing/citation)• Reflect and interoperate with stakeholder policies and catalogues
4. Data citation –DOI minting methodology
© National Computational Infrastructure 2014
Data collections/series/sets analysed:
Data sourceData ownershipWho?
How? Versioning - Dynamic data typeGranularity
What? DataCite schema
nci.org.au
Characterisation Matrix for each collection – map the landscape
© National Computational Infrastructure 2014
Data Citation Character Matrix
nci.org.au
5. Overall data publishing procedures
© National Computational Infrastructure 2014
Preparation
• Data management plan completed – collection level catalogue created• Group access defined• License and readme and product descriptions• Directory structure defined• update schedule, replication, backup and recovery plan established
Data inges
t
• Data replication from supplier• VM and puppet configure for specific GeoNetwork dataset• Metadata catalogue created or harvested from trusted partner source
Data Publishing
• DOI minting and tagged with unique identifier• Hierarchic GeoNetworks are linked as parent-child relationship using UUID• Data services repo created and configured and release tagged• Publish data through data services, including relevant data services and filesystem location
nci.org.au
Tuesday
Talk: Collaboratively Architecting a Scalable and Adaptable Petascale Infrastructure at NCI, Lesley Wyborn 5:05 PM - 5:18 PM, Moscone West 2020
Poster: Enabling Data Intensive Science through Virtual Laboratories and Science Gateways, David Lescinsky 08:00 AM - 12:20 PM Moscone South Poster Hall
Friday
Talk: Computational Environments and Analysis methods on the NCI HPC & HPD platform, Ben Evans 1:40 PM - 2:10 PM Moscone West 2020
Poster: The NCI HPC & HPD Platform for Analysis of Petascale Environmental Collections, Ben Evans 1:40 PM - 06:00 PM Moscone South Poster Hall
© National Computational Infrastructure 2014
Thank you and other talks