utility hpc: right systems, right scale, right science
TRANSCRIPT
Utility HPC: Right Systems, Right Scale,
Right Science
Jason Stowe, CEO @jasonastowe, @cyclecomputing
I’m here to recruit you, for a cause
We believe utility access to compute power
makes impossible science, possible.
Dynamic, utility access to compute power
is as important as uptime
(that’s why coded infrastructure is critical)
Skeptical? Flickr: Tourist on Earth
In prior years (today?)
Researchers/engineers waited for computing
For the horsepower
For the place to put it
For it to be Configured..
Flickr: vaxomatic
Yesterday, high performance engineering, science clusters
were…
Too small when you need it most,
Too large every other time.
The Innovation Bottleneck: Researchers/Scientists/Engineers
Forced to size questions to the infrastructure you have
Multi-‐tenant systems create float capacity That is critical to innovation
The 60’s
The 70’s
The 80’s
The 90’s
The 00’s
From centralized to decentralized, collaborative to independent
and right back again!
The 10’s
Mainframes VAX The PC Beowulf Clusters Central Clouds
100% 60% 0% 40% ??? %
SHARIN
G ~ 0Mbit ~ 1Mbit ~ 10Mbit ~ 1000 Mbit ~ 10,000 Mbit
Bigger, better but further and further away from the scientist’s lab
Ask a Question Hypothesize Predict Experiment /
Test Analyze Final Results
The Scientific Method
Test and Analyze stages require the most time,
compute, and data
Ask a Question Hypothesize Predict Experiment /
Test Analyze Final Results
The Scientific Method
Any improvements to this cycle yield multiplicative
benefits
A Challenge Across Industries � 3 of Top 5 Insurance � 6 of Top 8 Pharmaceutical � 2 of Top 3 Banks � 2 of Top 3 Genomics Sequencing � 1 of Top 2 FPGA
Utility HPC in the News�WSJ, NYTimes, Wired, Bio-IT World BusinessWeek
To accelerate science, we need automation
Management Software
CC1/CCG Instances EBS S3
Shared FS
EBS
Utility HPC Cluster -‐ Scales to 50,000+ cores -‐ Data Scheduling -‐ Workload portability
Data & Application
Aware Movement
Traditional Scheduler
Massive Scale Based upon workload
Secure, HPC Cluster
User
HPC Reporting &
Audit
50,000-core CycleCloud Using Chef and AWS
ChefConf 2012
10,600-instance cluster against cancer target
ChefConf 2013
Created in 2 hours Configured with Search,
with Data bags
one Chef 11 server
We make software tools to easily orchestrate complex workloads and data access across Utility HPC
Today is a survey of use cases…
10,600 instance Life Science
Molecular Modeling
600 core Manufacturing Nuclear Power Plant for safety
simulation
Genomic Analysis RNA for
Stem Cells
Dynamic, utility access to compute power
is as important as uptime
Why?
#1: “Better” Science =
“Answer the question we want to ask”, not constrained to what fits
on local compute power
#2 “Faster” Science =
Run this “better” science, that would have taken
months or years in hours or days
Survey of Use Cases þ Drug Design þ CAD/CAM þ Genomics …
Life Sciences & Compute? C
ompu
te
Data/Bandwidth
Genomics
Molecular Modeling
CAD/ CAM
All Sample Analysis
Proteomics Biomarker/
Image Analysis
Sensor Data Import
Creating fake Charts, with Fake Data
Why is this important?
(W.H.O./Globocan 2008)
~2 million Type 2 diabetics, ~200k Type 1
Every day is crucial and costly
Before: Trade-off compute time vs.
accuracy
Now: Accurate analysis, fewer false
negatives, faster Initial
Coarse Screen
Higher Quality
Analysis
Best Quality
Process for Drug Design
Higher Quality
Analysis
Best Quality
Big 10 Pharma Built 10,600 instance cluster
($44M) in 2 hours, ran 40 years of science
in 11 hours for $4,372
Most Recent Utility Supercomputer server count:
AWS Console view:
Cycle’s view of this cluster:
One Chef 11 Server
Earlier Drug Design Novartis discussed at BioIT2012
� Needed � Push-button Utility Supercomputer for molecular
modeling � Created
� 30,000 core run across US/EU Cloud (AWS) � 10 years of compute in 8 hours for $10,000 � Found 3 compounds now in the wetlab as a result
� Capacity is no longer an issue
� Hardware = software � Testing (error handling, unit testing, etc.)
e.g. Cycle spent ~$1M dollars on AWS over 5 years
� The only way to do this is to automate
Lessons learned
Servers are not house plants
Servers are wheat
Survey of Use Cases þ Drug Design þ CAD/CAM þ Genomics …
Nuclear Power Plant simulation
We don’t’ know what they’re running, but it has “Safety”
600-core CAD/CAM 3 Quarters of a year wait became 3 weeks
Site Data
Corporate
Firewall
3 Weeks instead Of 3 Quarters
Secure HPC
Cluster
TBs FS
External Cloud
~600 CPU cluster Scheduled
Data Engineer
Survey of Use Cases þ Drug Design þ CAD/CAM þ Genomics …
Gene Expression Analysis Morgridge Institute for Research
Run holistic comparison of all 78 terabyte stem cell RNA samples to build a unique gene expression database
Make it easier to replicate disease in petri dishes w/induced stem cells
78 TB of Stem Cell RNA
1 Million compute hours, 115 years of computing in
1 week for $19,555
Gene Expression Analysis Morgridge Institute for Research
� Cluster details
� 5,000 to 10,000 cores for a week � Very long individual analysis were check-pointed = Spot instance usage possible
Survey of Use Cases þ Drug Design þ CAD/CAM þ Genomics …
Code can accelerate Science
Ask a Question Hypothesize Predict Experiment /
Test Analyze Final Results
The Scientific Method on Utility HPC
Yield “Better”, “Faster” Research for less $
Dynamic, utility access to compute power
is as important as uptime
I’m here to recruit you, for a cause
Contribute to Chef. Make the community better.
And you will help Cycle make impossible science,
possible.
2013 BigScience Challenge
$10,000 of free computing to science benefitting humanity
2012 winner: 115yr Genomic analysis
Enter at: http://cyclecomputing.com/big-science-challenge/enter
Thank You! Questions?