accelerating discovery via science services

Ian Foster

Accelerating discovery via science services

Life Sciences and Biology

Advanced MaterialsCondensed Matter

Physics

Chemistry and Catalysis

Soft Materials

Environmental and Geo Sciences

Can we determine pathways that lead to novel states and

nonequilibrium assemblies?

Can we observe – and control –

nanoscale chemical transformations in

macroscopic systems?

Can we create new materials with extraordinary properties – by engineering

defects at the atomic scale?

Can we map – and ultimately harness –

dynamic heterogeneity in complex correlated

systems?

Can we unravel the secrets of biological function – across length scales?

Can we understand physical and chemical processes in the most extreme environments?

2

We want to accelerate progress on the most pressing questions

Publish results

Collectdata

Design experiment

Test hypothesis

Hypothesize explanation

Identify patterns

Analyzedata

The discovery process is iterative and time-consuming

Pose question

J.C.R Licklider, 1960: About 85% of my “thinking” time was spent getting into a position to think, to make a decision, to learn something I needed to know

Outsourcing for economies of scale in the use of automated methods

Automation to apply more sophisticated methods at larger scales

Higgs discovery “only possible because of the extraordinary achievements of … grid computing”—Rolf Heuer, CERN DG

10s of PB, 100s of institutions,1000s of scientists,

100Ks of CPUs, Bs of tasks

Outsourcing and automation:(1) The Grid

Outsourcing and automation:(2) The Cloud

The Software as a Service (SaaS) revolution

Customer relationship management (CRM):

A knowledge-intensive processHistorically, handled manually or via expensive, inflexible on-premise software

SaaS has revolutionized how CRM is consumed Outsource to provider who

runs software on cloud Access via simple interfaces Ease of use Cost Flexibility Complexity

Drag picture to placeholder or click icon to add

SaaSOn-premise

Where can we automate and outsource in science broadly?

Run experimentCollect dataMove dataCheck data

Annotate dataShare data

Find similar dataLink to literature

Analyze dataPublish data

Time Automate and

outsource

Science

services

Many services are used by science, but have limitations

Science services exist, but do not address whole life cycle

Acceleratingdiscovery

via science services

(1) Eliminate data friction

The elimination of data friction is a key to faster discovery

Civilization advancesby extending the number of important operations which we can perform without thinking about them (Whitehead, 1912)

Obstacles to data access, movement, discovery, sharing, and analysis slow research, distort research directions, and waste time (DOE reports, 2005-2015)

We have the highways but not the delivery service

Our highways encompass the Internet, ultra-high-speed networks, science DMZs, data transfer nodes, high-speed transport protocols

A good delivery service automates, schedules, accelerates, adapts.It provides APIs for experts and casual users.Cuts costs and saves time.

Globus: Research data management as a service

Essential research data management services File transfer Data sharing Data publication Identity and groups

Builds on 15 years of DOE research

Outsourced and automated High availability, reliability,

performance, scalability Convenient for

Casual users: Web interfaces Power users: APIs Administrators: Install, manage

globus.org

16

“I need to easily, quickly, & reliably move data to other locations.”

Research Computing HPC Cluster

Lab Server

Campus Home Filesystem

Desktop Workstation

Personal Laptop

DOE supercomputer Public Cloud

17

One APS node connects to125 locations

18

“I need to get data from a scientific instrument to my analysis system.”

Next GenSequencer

Light Sheet Microscope

MRI Advanced Light Source

19

“I need to easily and securely share my data with my colleagues.”

20

Globus and the research data lifecycle

Researcher initiates transfer request; or requested automatically by script, science gateway

1

InstrumentCompute Facility

Globus transfers files reliably, securely

2

Globus controls access to shared

files on existing storage; no need

to move files to cloud storage!

4

Curator reviews and approves; data set

published on campus or other system

7

Researcher selects files to share, selects user or group,

and sets access permissions

3

Collaborator logs in to Globus and accesses shared files; no local

account required; download via Globus

5

Researcher assembles data set;

describes it using metadata (Dublin core and domain-

specific)

6

6

Peers, collaborators search and discover datasets; transfer and share using Globus

8

Publication Repository

Personal Computer

Transfer

Share

Publish

Discover

• SaaS Only a web browser required

• Use storage system of your choice

• Access using your campus credentials

Globus and DOE: Terabytes per month

5 major

services

130 federated

campus IdPs

115petabytes transferred

8,000 managed

storage systems

20 billion

files processed

99.95%uptime over past 2 years

25,000 registered

users

>30institutional subscribers

3 months longest transfer

1 petabytebiggest transfer

50Mmost files in one transfer

13 national labs use services

Globus by the numbers



(2) Create platform services

25

Globus service APIs provide elements of a science platform

Identity, Group, andProfile Management

… Globus Toolkit

Glo

bus

API

s

Glo

bus

Con

nectData Publication & Discovery

File Sharing

File Transfer & Replication

Publication as service for ACME climate modeling consortium

kbase.us



(3) Liberate scientific data

Q: What is the biggest obstacle to data sharing in science?

A: The vast majority of data that is lost, or not online;if online, not described; if described, not indexedNot accessibleNot discoverableNot used

Contrast with common practice for consumer photos (iPhoto) Automated capture Publish then curate Processing to add value Outsourced storage

We must automate the capture, linking, and indexing of all data

Globus publication service encodes and automates data publication pipelines

Example application: Materials Data Facility for materials simulation and experiment data

Proposed distributed virtual collections index, organize, tag, & manage distributed data

Think iPhoto on steroids –backed by domain knowledge and supercomputing power


We must automate the capture, linking, and indexing of all data

chiDB: Human-computer collaboration to extract Flory-Huggins ( ) parameters from 𝞆polymers literatureR. Tchoua et al.

Plenario: Spatially and temporally integrated, linked, and searchable database of urban dataC. Catlett, B. Goldstein, T. Malik et al.

Drag picture to placeholder or click icon to addDrag picture to placeholder or click icon to add

Flory-Huggins parameters liberated!

R. Tchoua, J. De Pablo

“I need to publish my data so that others can find it and use it.”

ScholarlyPublication

ReferenceDataset

Research CommunityCollaboration

Publish dashboard

35

Configuring a publication pipeline: Publication “facets”

URL Handle DOIidentifier

none standard customdescription

domain-specific

none acceptance machine-validatedcuration

human-validated

anonymous Public collaboratorsaccess

embargoed

transient project lifetime “forever”preservation

archive

36



(4) Create discovery engines

Data-driven science requirescollaborative discovery engines

informaticsanalysis

high-throughputexperiments

problemspecification

modeling and simulation

analysis &visualization

experimentaldesign


Integrateddatabases

Rick Stevens

Example: A discovery engine for disordered structures

Diffuse scattering images from Ray Osborn et al., Argonne

SampleExperimentalscattering

Material composition

Simulated structure

Simulatedscattering

La 60%Sr

40%

Detect errors (secs—mins)

Knowledge basePast experiments;

simulations; literature; expert knowledge

Select experiments (mins—hours)

Contribute to knowledge base

Simulations driven by experiments (mins—days)

Knowledge-drivendecision making

Evolutionary optimization

Integrate data movement, management, workflow, and computation to accelerate data-driven applications, organize data for efficient use

New architectures and methods create opportunities and challenges

Integrate statistics/machine learning to assess many models and calibrate them against `all' relevant data

New computer facilities enable on-demand computing and high-speed analysis of large quantities of data

SimulationCharacterize,

PredictAssimilateSteer data acquisition

Data analysisReconstruct,

detect features, auto-correlate,

particle distributions, …

Science automation servicesScripting, security, storage, cataloging, transfer

~0.001-0.5 GB/s/flow~2 GB/s total burst~200 TB/month~10 concurrent flows(Today: x10 in 5 yrs)

IntegrationOptimize, fit, …

Configure CheckGuide

Batch

Immediate

0.001 1 100+PFlops

Precomputematerial

database

Reconstruct image

Auto-correlation

Feature detection

Scientific opportunities Probe material structure and

function at unprecedented scalesTechnical challenges Many experimental modalities Data rates and computation

needs vary widely; increasing Knowledge management,

integration, synthesis

Towards discovery engines for energy science (Argonne LDRD)

Linking experiment and computation

Single-crystal diffuse scattering Defect structure in disordered materials. (Osborn, Wilde, Wozniak, et al.) Estimate structure via inverse modeling: many-simulation evolutionary optimization on 100K+ BG/Q cores (Swift+OpenMP).

Near-field high-energy X-ray diffraction microscopy Microstructure in bulk materials (Almer, Sharma, et al.)Reconstruction on 10K+ BG/Q cores (Swift) takes ~10 minutes,vs. >5 hours on APS cluster or months if data taken home. Used to detect errors in one run that would have resulted in total waste of beamtime.

X-ray nano/microtomographyBio, geo, and material science imaging.(Bicer, Gursoy, Kettimuthu, De Carlo, et al.).Innovative in-slice parallelization method gives reconstruction of 360x2048x1024 dataset in ~1 minute, using 32K BG/Q cores, vs. many days on cluster: enables quasi-instant response

2-BM

1-ID

6-ID

Populate

Sim Sim

Select

Sim

Microstructure of a copper wire, 0.2mm diameter

Advanced Photon Source

Experimental and simulated scattering from manganite

Rapid assessment of alignment quality in high-energy diffraction microscopy

Before

AfterHemant SharmaJustin WozniakMike WildeJon Almer

Science services raise research and policy questions What else can we automate and outsource?

How do we choose opportunities? How do we measure success?

How must our computer systems evolve? High-capacity discovery engines: where, how?

What will science become in a services era? Will it be more democratic? Collaborative?

Entrepreneurial? More or less creative? What are implications for trust and reproducibility?

What would Beer say?The question which asks how to use the computer in the enterprise, is, in short, the wrong question. A better formulation is to ask how the enterprise should be run given that computers exist. The best version of all is the question asking what, given computers, the enterprise now is. – Stafford Beer, “Brain of the Firm”, 1972

informaticsanalysis

high-throughputexperiments

problemspecification

modeling and simulation


experimentaldesign


Integrateddatabases

Opportunities and challenges for discovery acceleration

Immediate opportunities Reduce data friction and

accelerate discovery by applying Globus services across DOE facilities

Develop new services to capture, link science data

Important research agenda Discovery engines to answer

major scientific questions New research modalities

linking computation and data Organization and analysis of

massive science data


47

Thank you to our sponsors!

U.S. DEPARTMENT OF

ENERGY

For more information: [email protected] to co-authors and Globus teamGlobus services (globus.org) Foster, I. Globus Online: Accelerating and democratizing science through

cloud-based services. IEEE Internet Computing(May/June):70-73, 2011. Chard, K., Tuecke, S. and Foster, I. Efficient and Secure Transfer,

Synchronization, and Sharing of Big Data. Cloud Computing, IEEE, 1(3):46-55, 2014.

Chard, K., Foster, I. and Tuecke, S. Globus Platform-as-a-Service for Collaborative Science Applications. Concurrency - Practice and Experience, 27(2):290-305, 2014.

Publication (globus.org/data-publication) Chard, K., Pruyne, J., Blaiszik, B., Ananthakrishnan, R., Tuecke, S. and Foster, I.,

Globus Data Publication as a Service: Lowering Barriers to Reproducible Science. 11th IEEE International Conference on eScience Munich, Germany, 2015

Discovery engines Foster, I., Ananthakrishnan, R., Blaiszik, B., Chard, K., Osborn, R., Tuecke, S., Wilde,

M. and Wozniak, J. Networking materials data: Accelerating discovery at an experimental facility. Big Data and High Performance Computing, 2015.

Questions?

[email protected]