a global research data platform€¦ · research data platform ian foster ... •ensure persistence...

20
A Global Research Data Platform Ian Foster Argonne National Laboratory & The University of Chicago 1

Upload: others

Post on 17-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Global Research Data Platform€¦ · Research Data Platform Ian Foster ... •Ensure persistence and evolution over time In so doing, enable many scientists and tool developers

A Global Research Data Platform

Ian FosterArgonne National Laboratory & The University of Chicago

1

Page 2: A Global Research Data Platform€¦ · Research Data Platform Ian Foster ... •Ensure persistence and evolution over time In so doing, enable many scientists and tool developers

Cloud platforms have transformed how software is developed and delivered

2

Infrastructure as a service: IaaS

Platform as a service: PaaS

Software as a service: SaaS(web & mobile apps)

Page 3: A Global Research Data Platform€¦ · Research Data Platform Ian Foster ... •Ensure persistence and evolution over time In so doing, enable many scientists and tool developers

Cloud platforms have transformed how software is developed and delivered

3

Can we do the same for science?• Identify cross-cutting capabilities required by many groups• Define simple REST APIs for accessing those capabilities• Operate high-quality, scalable, secure, performant cloud-hosted

implementations• Ensure persistence and evolution over time

In so doing, enable many scientists and tool developers to automateand outsource tasks that are not central to their core mission: thus reduce costs, increase quality, promote interoperability

Page 4: A Global Research Data Platform€¦ · Research Data Platform Ian Foster ... •Ensure persistence and evolution over time In so doing, enable many scientists and tool developers

What tasks?

4

• Auth: Manage identities, authentication, and authorization• Transfer: Manage movement of files from A to B• Sharing: Manage who can access data at a location• Publish: Preserve, identify, describe, curate• Search: Index and search data• Identifiers: Assign identifiers to collections of files • Automate: Organize sets of activities• Learn: Discover, train, run machine learning models• …

Page 5: A Global Research Data Platform€¦ · Research Data Platform Ian Foster ... •Ensure persistence and evolution over time In so doing, enable many scientists and tool developers

globus.org

Auto-mate

Page 6: A Global Research Data Platform€¦ · Research Data Platform Ian Foster ... •Ensure persistence and evolution over time In so doing, enable many scientists and tool developers

An example -- NCAR RDA: climate data

6

• Globus Auth for single sign on & federated login

• Globus sharing for restricted data access

• Globus transfer for data movement

• Globus management for administrative tasks

Page 7: A Global Research Data Platform€¦ · Research Data Platform Ian Foster ... •Ensure persistence and evolution over time In so doing, enable many scientists and tool developers

7

https://docs.globus.org/mrdp

Page 8: A Global Research Data Platform€¦ · Research Data Platform Ian Foster ... •Ensure persistence and evolution over time In so doing, enable many scientists and tool developers

A key message: Outsource all that you can to cloud-hosted automation platformsFor example:• Outsource responsibility for determining user identities• Outsource control over who can access different data and

services within the portal• Outsource responsibility for managing data uploads and

downloads between various locations and storage systems• Leverage standard web user interfaces for common user

actions

8

Page 9: A Global Research Data Platform€¦ · Research Data Platform Ian Foster ... •Ensure persistence and evolution over time In so doing, enable many scientists and tool developers

Automate and outsource:Publication and discovery

Move to permanent location(or publish in place)

Compute and record checksumsObtain and record metadata Assign persistent identifier Index for discovery

99

2 petabytes100 GbpsGlobus APIs

Page 10: A Global Research Data Platform€¦ · Research Data Platform Ian Foster ... •Ensure persistence and evolution over time In so doing, enable many scientists and tool developers

Automate and outsource:Publication and discovery

Move to permanent location(or publish in place)

Compute and record checksumsObtain and record metadata Assign persistent identifier Index for discovery

1010

Data Publication

Indexing

materialsdatafacility.org

2 petabytes100 GbpsGlobus APIs

Page 11: A Global Research Data Platform€¦ · Research Data Platform Ian Foster ... •Ensure persistence and evolution over time In so doing, enable many scientists and tool developers

Automate and outsource:Publication and discovery

1111

Programmatic access (REST, Python, Jupyter)

Web browse and search

Data Publication

Indexing

materialsdatafacility.org

2 petabytes100 GbpsGlobus APIs

Page 12: A Global Research Data Platform€¦ · Research Data Platform Ian Foster ... •Ensure persistence and evolution over time In so doing, enable many scientists and tool developers

Open Storage Network: Many Petrels, for everyone

Page 13: A Global Research Data Platform€¦ · Research Data Platform Ian Foster ... •Ensure persistence and evolution over time In so doing, enable many scientists and tool developers

For each data, must apply quality control, assign identifiers, move to compute, extract features, eventually publish to public repository, …

Building a different custom pipeline for every situation is impractical

Automate and outsource:End-to-end data pipelines

Page 14: A Global Research Data Platform€¦ · Research Data Platform Ian Foster ... •Ensure persistence and evolution over time In so doing, enable many scientists and tool developers

For each data, must apply quality control, assign identifiers, move to compute, extract features, eventually publish to public repository

Building a different custom pipeline for every situation is impractical

Automate: Trigger-action programming (“if this happens, then do that”)

Outsource: Cloud-based trigger-action service for reliability, scalability, ease of use, security, sustainability

Automate and outsource:End-to-end data pipelines

Page 15: A Global Research Data Platform€¦ · Research Data Platform Ian Foster ... •Ensure persistence and evolution over time In so doing, enable many scientists and tool developers

National Facility

Local Storage and Compute• Quality Control• Assign Handle

BeamlineInstrument

Globus Transfer

Central Storage and Compute (CSC)• Feature extraction• Aggregate and convert format

Archive

Automate and outsource:End-to-end pipelines with trigger-action programming

Page 16: A Global Research Data Platform€¦ · Research Data Platform Ian Foster ... •Ensure persistence and evolution over time In so doing, enable many scientists and tool developers

National Facility

Local Storage and Compute• Quality Control• Assign Handle

BeamlineInstrument • Email / SMS

notification

Globus Transfer

Central Storage and Compute (CSC)• Feature extraction• Aggregate and convert format

Archive1

1

Rules• IF new files THEN run quality

control scripts• IF quality is good THEN send

email and transfer data to CSC

Automate and outsource:End-to-end pipelines with trigger-action programming

Page 17: A Global Research Data Platform€¦ · Research Data Platform Ian Foster ... •Ensure persistence and evolution over time In so doing, enable many scientists and tool developers

National Facility

Local Storage and Compute• Quality Control• Assign Handle

BeamlineInstrument • Email / SMS

notification

Globus Transfer

Central Storage and Compute (CSC)• Feature extraction• Aggregate and convert format

Globus Transfer

Archive• Set sharing ACLs• Set timer for publication

to Materials Data Facility

Data publication

12

1

Rules

2

• IF new files THEN run quality control scripts

• IF quality is good THEN send email and transfer data to CSC

• IF new files THEN run feature extraction

• IF feature detected THEN transfer data to archival storage

• IF time since ingest > 6 months THEN publish dataset to Materials Data Facility

Automate and outsource:End-to-end pipelines with trigger-action programming

Page 18: A Global Research Data Platform€¦ · Research Data Platform Ian Foster ... •Ensure persistence and evolution over time In so doing, enable many scientists and tool developers

globus.org

Page 19: A Global Research Data Platform€¦ · Research Data Platform Ian Foster ... •Ensure persistence and evolution over time In so doing, enable many scientists and tool developers

Ben Blaiszik Steve TueckeKyle Chard Jim Pruyne Logan WardRachana Ananthakrishnan

Ryan Chard Mike Papka Rick Wagner

I reported on the work of many talented people

Thanks also to: • Other members of the Globus team• Participants in the Open Storage Network project• Globus users around the world

We are grateful to our sponsors

DLHub GlobusIMaD

PetrelArgonne Leadership Computing Facility

Page 20: A Global Research Data Platform€¦ · Research Data Platform Ian Foster ... •Ensure persistence and evolution over time In so doing, enable many scientists and tool developers

Modern information technology is about leveraging platforms to outsource and automateFor example:• Outsource responsibility for determining user identities• Outsource control over who can access different data and

services within the portal• Outsource responsibility for managing data uploads and

downloads between various locations and storage systems• Leverage standard web interfaces for common user actions• Outsource data publication, search, pipeline management, …

20