accelerating discovery via science services
TRANSCRIPT
Ian Foster
Accelerating discovery via science services
Life Sciences and Biology
Advanced MaterialsCondensed Matter
Physics
Chemistry and Catalysis
Soft Materials
Environmental and Geo Sciences
Can we determine pathways that lead to novel states and
nonequilibrium assemblies?
Can we observe – and control –
nanoscale chemical transformations in
macroscopic systems?
Can we create new materials with extraordinary properties – by engineering
defects at the atomic scale?
Can we map – and ultimately harness –
dynamic heterogeneity in complex correlated
systems?
Can we unravel the secrets of biological function – across length scales?
Can we understand physical and chemical processes in the most extreme environments?
2
We want to accelerate progress on the most pressing questions
Publish results
Collectdata
Design experiment
Test hypothesis
Hypothesize explanation
Identify patterns
Analyzedata
The discovery process is iterative and time-consuming
Pose question
J.C.R Licklider, 1960: About 85% of my “thinking” time was spent getting into a position to think, to make a decision, to learn something I needed to know
Outsourcing for economies of scale in the use of automated methods
Automation to apply more sophisticated methods at larger scales
Higgs discovery “only possible because of the extraordinary achievements of … grid computing”—Rolf Heuer, CERN DG
10s of PB, 100s of institutions,1000s of scientists,
100Ks of CPUs, Bs of tasks
Outsourcing and automation:(1) The Grid
Outsourcing and automation:(2) The Cloud
The Software as a Service (SaaS) revolution
Customer relationship management (CRM):
A knowledge-intensive processHistorically, handled manually or via expensive, inflexible on-premise software
SaaS has revolutionized how CRM is consumed Outsource to provider who
runs software on cloud Access via simple interfaces Ease of use Cost Flexibility Complexity
Drag picture to placeholder or click icon to add
SaaSOn-premise
Where can we automate and outsource in science broadly?
Run experimentCollect dataMove dataCheck data
Annotate dataShare data
Find similar dataLink to literature
Analyze dataPublish data
Time Automate and
outsource
Science
services
Many services are used by science, but have limitations
Science services exist, but do not address whole life cycle
Acceleratingdiscovery
via science services
(1) Eliminate data friction
The elimination of data friction is a key to faster discovery
Civilization advancesby extending the number of important operations which we can perform without thinking about them (Whitehead, 1912)
Obstacles to data access, movement, discovery, sharing, and analysis slow research, distort research directions, and waste time (DOE reports, 2005-2015)
We have the highways but not the delivery service
Our highways encompass the Internet, ultra-high-speed networks, science DMZs, data transfer nodes, high-speed transport protocols
A good delivery service automates, schedules, accelerates, adapts.It provides APIs for experts and casual users.Cuts costs and saves time.
Globus: Research data management as a service
Essential research data management services File transfer Data sharing Data publication Identity and groups
Builds on 15 years of DOE research
Outsourced and automated High availability, reliability,
performance, scalability Convenient for
Casual users: Web interfaces Power users: APIs Administrators: Install, manage
globus.org
16
“I need to easily, quickly, & reliably move data to other locations.”
Research Computing HPC Cluster
Lab Server
Campus Home Filesystem
Desktop Workstation
Personal Laptop
DOE supercomputer Public Cloud
17
One APS node connects to125 locations
18
“I need to get data from a scientific instrument to my analysis system.”
Next GenSequencer
Light Sheet Microscope
MRI Advanced Light Source
19
“I need to easily and securely share my data with my colleagues.”
20
Globus and the research data lifecycle
Researcher initiates transfer request; or requested automatically by script, science gateway
1
InstrumentCompute Facility
Globus transfers files reliably, securely
2
Globus controls access to shared
files on existing storage; no need
to move files to cloud storage!
4
Curator reviews and approves; data set
published on campus or other system
7
Researcher selects files to share, selects user or group,
and sets access permissions
3
Collaborator logs in to Globus and accesses shared files; no local
account required; download via Globus
5
Researcher assembles data set;
describes it using metadata (Dublin core and domain-
specific)
6
6
Peers, collaborators search and discover datasets; transfer and share using Globus
8
Publication Repository
Personal Computer
Transfer
Share
Publish
Discover
• SaaS Only a web browser required
• Use storage system of your choice
• Access using your campus credentials
Globus and DOE: Terabytes per month
5 major
services
130 federated
campus IdPs
115petabytes transferred
8,000 managed
storage systems
20 billion
files processed
99.95%uptime over past 2 years
25,000 registered
users
>30institutional subscribers
3 months longest transfer
1 petabytebiggest transfer
50Mmost files in one transfer
13 national labs use services
Globus by the numbers
Acceleratingdiscovery
via science services
(2) Create platform services
25
Globus service APIs provide elements of a science platform
Identity, Group, andProfile Management
… Globus Toolkit
Glo
bus
API
s
Glo
bus
Con
nectData Publication & Discovery
File Sharing
File Transfer & Replication
Publication as service for ACME climate modeling consortium
kbase.us
Acceleratingdiscovery
via science services
(3) Liberate scientific data
Q: What is the biggest obstacle to data sharing in science?
A: The vast majority of data that is lost, or not online;if online, not described; if described, not indexedNot accessibleNot discoverableNot used
Contrast with common practice for consumer photos (iPhoto) Automated capture Publish then curate Processing to add value Outsourced storage
We must automate the capture, linking, and indexing of all data
Globus publication service encodes and automates data publication pipelines
Example application: Materials Data Facility for materials simulation and experiment data
Proposed distributed virtual collections index, organize, tag, & manage distributed data
Think iPhoto on steroids –backed by domain knowledge and supercomputing power
Drag picture to placeholder or click icon to add
We must automate the capture, linking, and indexing of all data
chiDB: Human-computer collaboration to extract Flory-Huggins ( ) parameters from 𝞆polymers literatureR. Tchoua et al.
Plenario: Spatially and temporally integrated, linked, and searchable database of urban dataC. Catlett, B. Goldstein, T. Malik et al.
Drag picture to placeholder or click icon to addDrag picture to placeholder or click icon to add
Flory-Huggins parameters liberated!
R. Tchoua, J. De Pablo
“I need to publish my data so that others can find it and use it.”
ScholarlyPublication
ReferenceDataset
Research CommunityCollaboration
Publish dashboard
35
Configuring a publication pipeline: Publication “facets”
URL Handle DOIidentifier
none standard customdescription
domain-specific
none acceptance machine-validatedcuration
human-validated
anonymous Public collaboratorsaccess
embargoed
transient project lifetime “forever”preservation
archive
36
Acceleratingdiscovery
via science services
(4) Create discovery engines
Data-driven science requirescollaborative discovery engines
informaticsanalysis
high-throughputexperiments
problemspecification
modeling and simulation
analysis &visualization
experimentaldesign
analysis &visualization
Integrateddatabases
Rick Stevens
Example: A discovery engine for disordered structures
Diffuse scattering images from Ray Osborn et al., Argonne
SampleExperimentalscattering
Material composition
Simulated structure
Simulatedscattering
La 60%Sr
40%
Detect errors (secs—mins)
Knowledge basePast experiments;
simulations; literature; expert knowledge
Select experiments (mins—hours)
Contribute to knowledge base
Simulations driven by experiments (mins—days)
Knowledge-drivendecision making
Evolutionary optimization
Integrate data movement, management, workflow, and computation to accelerate data-driven applications, organize data for efficient use
New architectures and methods create opportunities and challenges
Integrate statistics/machine learning to assess many models and calibrate them against `all' relevant data
New computer facilities enable on-demand computing and high-speed analysis of large quantities of data
SimulationCharacterize,
PredictAssimilateSteer data acquisition
Data analysisReconstruct,
detect features, auto-correlate,
particle distributions, …
Science automation servicesScripting, security, storage, cataloging, transfer
~0.001-0.5 GB/s/flow~2 GB/s total burst~200 TB/month~10 concurrent flows(Today: x10 in 5 yrs)
IntegrationOptimize, fit, …
Configure CheckGuide
Batch
Immediate
0.001 1 100+PFlops
Precomputematerial
database
Reconstruct image
Auto-correlation
Feature detection
Scientific opportunities Probe material structure and
function at unprecedented scalesTechnical challenges Many experimental modalities Data rates and computation
needs vary widely; increasing Knowledge management,
integration, synthesis
Towards discovery engines for energy science (Argonne LDRD)
Linking experiment and computation
Single-crystal diffuse scattering Defect structure in disordered materials. (Osborn, Wilde, Wozniak, et al.) Estimate structure via inverse modeling: many-simulation evolutionary optimization on 100K+ BG/Q cores (Swift+OpenMP).
Near-field high-energy X-ray diffraction microscopy Microstructure in bulk materials (Almer, Sharma, et al.)Reconstruction on 10K+ BG/Q cores (Swift) takes ~10 minutes,vs. >5 hours on APS cluster or months if data taken home. Used to detect errors in one run that would have resulted in total waste of beamtime.
X-ray nano/microtomographyBio, geo, and material science imaging.(Bicer, Gursoy, Kettimuthu, De Carlo, et al.).Innovative in-slice parallelization method gives reconstruction of 360x2048x1024 dataset in ~1 minute, using 32K BG/Q cores, vs. many days on cluster: enables quasi-instant response
2-BM
1-ID
6-ID
Populate
Sim Sim
Select
Sim
Microstructure of a copper wire, 0.2mm diameter
Advanced Photon Source
Experimental and simulated scattering from manganite
Rapid assessment of alignment quality in high-energy diffraction microscopy
Before
AfterHemant SharmaJustin WozniakMike WildeJon Almer
Science services raise research and policy questions What else can we automate and outsource?
How do we choose opportunities? How do we measure success?
How must our computer systems evolve? High-capacity discovery engines: where, how?
What will science become in a services era? Will it be more democratic? Collaborative?
Entrepreneurial? More or less creative? What are implications for trust and reproducibility?
What would Beer say?The question which asks how to use the computer in the enterprise, is, in short, the wrong question. A better formulation is to ask how the enterprise should be run given that computers exist. The best version of all is the question asking what, given computers, the enterprise now is. – Stafford Beer, “Brain of the Firm”, 1972
informaticsanalysis
high-throughputexperiments
problemspecification
modeling and simulation
analysis &visualization
experimentaldesign
analysis &visualization
Integrateddatabases
Opportunities and challenges for discovery acceleration
Immediate opportunities Reduce data friction and
accelerate discovery by applying Globus services across DOE facilities
Develop new services to capture, link science data
Important research agenda Discovery engines to answer
major scientific questions New research modalities
linking computation and data Organization and analysis of
massive science data
Drag picture to placeholder or click icon to add
47
Thank you to our sponsors!
U.S. DEPARTMENT OF
ENERGY
For more information: [email protected] to co-authors and Globus teamGlobus services (globus.org) Foster, I. Globus Online: Accelerating and democratizing science through
cloud-based services. IEEE Internet Computing(May/June):70-73, 2011. Chard, K., Tuecke, S. and Foster, I. Efficient and Secure Transfer,
Synchronization, and Sharing of Big Data. Cloud Computing, IEEE, 1(3):46-55, 2014.
Chard, K., Foster, I. and Tuecke, S. Globus Platform-as-a-Service for Collaborative Science Applications. Concurrency - Practice and Experience, 27(2):290-305, 2014.
Publication (globus.org/data-publication) Chard, K., Pruyne, J., Blaiszik, B., Ananthakrishnan, R., Tuecke, S. and Foster, I.,
Globus Data Publication as a Service: Lowering Barriers to Reproducible Science. 11th IEEE International Conference on eScience Munich, Germany, 2015
Discovery engines Foster, I., Ananthakrishnan, R., Blaiszik, B., Chard, K., Osborn, R., Tuecke, S., Wilde,
M. and Wozniak, J. Networking materials data: Accelerating discovery at an experimental facility. Big Data and High Performance Computing, 2015.
Questions?