nci.org.au @ncinews large-scale data collection metadata management at the national computational...

13
nci.org.au nci.org.au @NCInews Large-Scale Data Collection Metadata Management at the National Computational Infrastructure (NCI) Jingbo Wang 1 , Irina Bastrakova 2 , Daisy Duursma 3 , Ben Evans 1 , Kashif Gohar 1 , Tim Mackey 2 , Julia Martin 4 , Matt Paget 5 , Gerry Ryder 4 , Guru Siddeswara 3 , Lesley Wyborn 1 1 ANU, 2 Geoscience Australia, 3 TERN, 4 ANDS, 5 CSIRO

Upload: kimberly-malone

Post on 21-Dec-2015

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Nci.org.au @NCInews Large-Scale Data Collection Metadata Management at the National Computational Infrastructure (NCI) Jingbo Wang 1, Irina Bastrakova

nci.org.aunci.org.au

@NCInews

Large-Scale Data Collection Metadata Management at the National Computational Infrastructure (NCI)

Jingbo Wang1, Irina Bastrakova2, Daisy Duursma3, Ben Evans1, Kashif Gohar1, Tim Mackey2, Julia Martin4, Matt Paget5, Gerry Ryder4, Guru Siddeswara3, Lesley Wyborn1

1ANU, 2Geoscience Australia, 3TERN, 4ANDS, 5CSIRO

Page 2: Nci.org.au @NCInews Large-Scale Data Collection Metadata Management at the National Computational Infrastructure (NCI) Jingbo Wang 1, Irina Bastrakova

nci.org.au

Overview

10PB+ Research Data

VDI: Cloud scale user desktops on data

Server-side analysis

and visualizationData Services

THREDDS

Evans et. al. 2014 (in press @ ISESS)© National Computational Infrastructure 2014

Web-time analytics software

Page 3: Nci.org.au @NCInews Large-Scale Data Collection Metadata Management at the National Computational Infrastructure (NCI) Jingbo Wang 1, Irina Bastrakova

nci.org.au

NCI Data Collections

BOM GA CSIRO ANU Inter-national

Other National

CMIP5 3PB

Astronomy (Optical) 200 TB

WaterOcean1.5 PB

Atmosphere2.4 PB

Earth Observ.

2 PB

MarineVideos 10 TB

Geophysics 300 TB

Weather340 TB

© National Computational Infrastructure 2014

Page 4: Nci.org.au @NCInews Large-Scale Data Collection Metadata Management at the National Computational Infrastructure (NCI) Jingbo Wang 1, Irina Bastrakova

nci.org.au

We have established a petascale national data resource that is co-located with high-performance computing. • NCI partners: Australian National University, Bureau of Meteorology,

CSIRO and Geoscience Australia• Support from the Australian Department of Education under Research Data

Storage Infrastructure (RDSI)

NCI manages 38+ data collections (10+ PB) in 7 categories: 1) earth system sciences, 2) climate and weather model data assets and products, 3) earth and marine observations and products, 4) geosciences, 5) terrestrial ecosystem, 6) water management and hydrology7) others such as astronomy, social science and biosciences.

NCI Data Collections cont.

Page 5: Nci.org.au @NCInews Large-Scale Data Collection Metadata Management at the National Computational Infrastructure (NCI) Jingbo Wang 1, Irina Bastrakova

nci.org.au

Universities1, 2, 3…

Non Government organzations1, 2, 3…

Government agency1, 2, 3…

disparate science data collections

curated data collections

Step1: record a data management plan including the conditions/licenses of source data, unify the metadata into a single metadata catalogue, record the (true)source of data, record product description/algorithm, tag with controlled vocabulariesStep2: publish to the data services, record all URIs in data catalogueexpose user-level metadata

Data

Metadata

ready for data access

step 1 step 2

© National Computational Infrastructure 2014

Ingest to availability

Page 6: Nci.org.au @NCInews Large-Scale Data Collection Metadata Management at the National Computational Infrastructure (NCI) Jingbo Wang 1, Irina Bastrakova

nci.org.au

1. Fill the Data Management Plan

Data Management Plan (DMP) online form (attributes compliant with ISO 19115)

© National Computational Infrastructure 2014

• At NCI, collections are the operational form for data and metadata management

• DMP tool filters out heterogeneity of data from different sources in different formats

• 19115 compliant Collection level catalogue is automatically generated from DMPs

• reference related datasets, and record services for accessing the data.

Page 7: Nci.org.au @NCInews Large-Scale Data Collection Metadata Management at the National Computational Infrastructure (NCI) Jingbo Wang 1, Irina Bastrakova

nci.org.au

2: DMPs are mapped to the NCI Catalogue

© National Computational Infrastructure 2014

http://geonetwork.nci.org.au

Page 8: Nci.org.au @NCInews Large-Scale Data Collection Metadata Management at the National Computational Infrastructure (NCI) Jingbo Wang 1, Irina Bastrakova

nci.org.aunci.org.au

Top level GeoNetwork: Collection and Series

Lens 3: Dataset specific GeoNetworks

Dataset 1 Dataset 2 Dataset 3 Dataset n

Collection 1Collection 2Collection 3…

Lens 1: CSW Harvesting and Cross-walks (e.g. RIF-CS)

Full harvest of the metadata

Full Search GeoNetworkFull Search GeoNetwork (or domain)

Dataset 1Dataset 2Dataset 3…

Lens 2: Domain Specific or User deep query

Proposed multi-lens GeoNetwork architecture

© National Computational Infrastructure 2014

Catalogue system infrastructure

Page 9: Nci.org.au @NCInews Large-Scale Data Collection Metadata Management at the National Computational Infrastructure (NCI) Jingbo Wang 1, Irina Bastrakova

nci.org.au

3. Data services and publishing process and in-situ analysis

© National Computational Infrastructure 2014

VDI: Cloud scale user desktops on data

Server-side analysis and visualizationData Services

THREDDS

Page 10: Nci.org.au @NCInews Large-Scale Data Collection Metadata Management at the National Computational Infrastructure (NCI) Jingbo Wang 1, Irina Bastrakova

nci.org.au

NCI data policy (publishing/citation)• Reflect and interoperate with stakeholder policies and catalogues

4. Data citation –DOI minting methodology

© National Computational Infrastructure 2014

Data collections/series/sets analysed:

Data sourceData ownershipWho?

How? Versioning - Dynamic data typeGranularity

What? DataCite schema

Page 11: Nci.org.au @NCInews Large-Scale Data Collection Metadata Management at the National Computational Infrastructure (NCI) Jingbo Wang 1, Irina Bastrakova

nci.org.au

Characterisation Matrix for each collection – map the landscape

© National Computational Infrastructure 2014

Data Citation Character Matrix

Page 12: Nci.org.au @NCInews Large-Scale Data Collection Metadata Management at the National Computational Infrastructure (NCI) Jingbo Wang 1, Irina Bastrakova

nci.org.au

5. Overall data publishing procedures

© National Computational Infrastructure 2014

Preparation

• Data management plan completed – collection level catalogue created• Group access defined• License and readme and product descriptions• Directory structure defined• update schedule, replication, backup and recovery plan established

Data inges

t

• Data replication from supplier• VM and puppet configure for specific GeoNetwork dataset• Metadata catalogue created or harvested from trusted partner source

Data Publishing

• DOI minting and tagged with unique identifier• Hierarchic GeoNetworks are linked as parent-child relationship using UUID• Data services repo created and configured and release tagged• Publish data through data services, including relevant data services and filesystem location

Page 13: Nci.org.au @NCInews Large-Scale Data Collection Metadata Management at the National Computational Infrastructure (NCI) Jingbo Wang 1, Irina Bastrakova

nci.org.au

Tuesday

Talk: Collaboratively Architecting a Scalable and Adaptable Petascale Infrastructure at NCI, Lesley Wyborn 5:05 PM - 5:18 PM, Moscone West 2020

Poster: Enabling Data Intensive Science through Virtual Laboratories and Science Gateways, David Lescinsky 08:00 AM - 12:20 PM Moscone South Poster Hall

Friday

Talk: Computational Environments and Analysis methods on the NCI HPC & HPD platform, Ben Evans 1:40 PM - 2:10 PM Moscone West 2020

Poster: The NCI HPC & HPD Platform for Analysis of Petascale Environmental Collections, Ben Evans 1:40 PM - 06:00 PM Moscone South Poster Hall 

© National Computational Infrastructure 2014

Thank you and other talks