talk at microsoft cloud futures 2010
DESCRIPTION
My talk from Cloud Futures 2010, organized by MSRTRANSCRIPT
Scien&fic Compu&ng with Amazon Web ServicesDeepak Singh
Cloud Futures 2010
Via Reavel under a CC-BY-NC-ND license
life science industry
Credit: Bosco Ho
By ~Prescott under a CC-BY-NC license
data
Image: Wikipedia
Image via image editor under a CC-‐BY License
Image: Matt Wood
Image: NOAA
gigabytes
terabytes
petabytes
petabytes
petabytes
exabytes?
really fast
Image: http://www.broadinstitute.org/~apleite/photos.html
data management
data processing
data sharing
Image: Chris Dagdigian
compute & storage limited
amazon web services
the cloud
has_many :definitions
infrastructure as a service
ComputeAmazon Elastic Compute
Cloud (EC2)- Elastic Load Balancing- Auto Scaling
StorageAmazon Simple
Storage Service (S3)- AWS Import/Export
DatabaseAmazon RDS and
SimpleDB
ComputeAmazon Elastic Compute
Cloud (EC2)- Elastic Load Balancing- Auto Scaling
StorageAmazon Simple
Storage Service (S3)- AWS Import/Export
Content DeliveryAmazon CloudFront
MessagingAmazon Simple
Queue Service (SQS) Amazon Simple
Notification Service (SNS)
PaymentsAmazon Flexible Payments Service
(FPS)
On-Demand Workforce
Amazon Mechanical Turk
Parallel ProcessingAmazon Elastic
MapReduce
DatabaseAmazon RDS and
SimpleDB
ComputeAmazon Elastic Compute
Cloud (EC2)- Elastic Load Balancing- Auto Scaling
StorageAmazon Simple
Storage Service (S3)- AWS Import/Export
Content DeliveryAmazon CloudFront
MessagingAmazon Simple
Queue Service (SQS)Amazon Simple
Notification Service (SNS)
PaymentsAmazon Flexible Payments Service
(FPS)
On-Demand Workforce
Amazon Mechanical Turk
Parallel ProcessingAmazon Elastic
MapReduce
MonitoringAmazon CloudWatch
ManagementAWS Management Console
ToolsAWS Toolkit for Eclipse
AWS SDK for .NET
Isolated NetworksAmazon Virtual Private
Cloud
DatabaseAmazon RDS and
SimpleDB
ComputeAmazon Elastic Compute
Cloud (EC2)- Elastic Load Balancing- Auto Scaling
StorageAmazon Simple
Storage Service (S3)- AWS Import/Export
Your Custom Applications and Services
Content DeliveryAmazon CloudFront
MessagingAmazon Simple
Queue Service (SQS)Amazon Simple
Notification Service (SNS)
PaymentsAmazon Flexible Payments Service
(FPS)
On-Demand Workforce
Amazon Mechanical Turk
Parallel ProcessingAmazon Elastic
MapReduce
MonitoringAmazon CloudWatch
ManagementAWS Management Console
ToolsAWS Toolkit for Eclipse
AWS SDK for .NET
Isolated NetworksAmazon Virtual Private
Cloud
DatabaseAmazon RDS and
SimpleDB
elasticity
3000 CPU’s for one firm’s risk management application
!"#$%&'()'*+,'-./01.2%/'
344'+567/'(.'
8%%9%.:/'
;<"&/:1='
>?,3?,44@'
A&B:1='
>?,>?,44@'
C".:1='
>?,D?,44@'
E(.:1='
>?,F?,44@'
;"%/:1='
>?,G?,44@'
C10"&:1='
>?,H?,44@'
I%:.%/:1='
>?,,?,44@'
3444JJ'
344'JJ'
scalability
> 1PB of data in S3
highly available
Image: Chris Dagdigian
“Everything fails, all the time”-- Werner Vogels
“Things will crash. Deal with it”-- Jeff Dean
2-4% of serverswill die annually
Source: Jeff Dean, LADIS 2009
1-5% of disk drives will die every year
Source: Jeff Dean, LADIS 2009
human errors
human errors~20% admin issues have unintended consequences
Source: James Hamilton
scalable & available
assume sw/hw failure
design apps to be resilient
automation & alarming
US East Region
Availability Zone A
Availability Zone B
Availability Zone C
Availability Zone D
!"#$%&'()*+
T
TT
elastic load balancing
CloudWatch
auto scaling
elastic block store
elastic IP
SQSSNS
cost effective
cost effective
pay as you go
on-demand instancesreserved instances
spot instances
Your Network
AmazonWeb Services
CloudSecure VPN Connection
over the Internet
Customer’s isolated AWS
resources
VPN Gateway
ExternalCustomers
Subnets
10.32.1.0/24
10.32.2.0/24
10.32.3.0/24
AMAZON VPC ARCHITECTURE
AWS + science = win
3.7 million classifications in just over three days~15 million in less than a month>2.6 million clicks in 100 hours
lots and lots and lots and lots and lots and lots of data andlots and lots of lots of data
scalability & availability
we are data geeks not data center geeks
data management
Shaq Image: Keith Allison under a CC-BY-SA license
Shaq Image: Keith Allison under a CC-BY-SA license
Shaq Image: Keith Allison under a CC-BY-SA license
Shaq Image: Keith Allison under a CC-BY-SA license
Shaq Image: Keith Allison under a CC-BY-SA license
Biomarker Warehousepre-clinical, clinical, 3rd party data and publications
!"#$%"&&'
!#%&$(%&&&'
!)*(%"&&'
+,'-./01'
23,3415'61789:1'
;<./5'=>?6@'
6178170' 6A.7341' B817-135'
Estimated cost: 10 TB warehouse over 3 years
data processing
http://web.mit.edu/stardev/cluster/
http://cyclecomputing.comhttp://wiki.github.com/documentcloud/cloud-crowd
sudo gem install cloud-crowd
http://leonardo.phys.washington.edu/feff/
XAFS
http://bioteam.net
ASSEMBLING GENOMES
140 million 454 reads
Image: Ma) Wood
Map 100 million, 100 base paired end readsQuad core with 5 GB of RAM would take 16 days
30 high-memory instances; 32 hours; $195
BLAT @ U. PENN
HEAVY-ION COLLISIONS
Problem: Quark matter physics conference imminent but no compute resources handy
Solution: NIMBUS context broker allowed researchers to provision 300 nodes and get the simulations done
BELLE MONTE CARLO
Credit: Tom Fifield
disk read/writesslow & expensive
data processingfast & cheap
distribute the dataparallel reads
data processing for the cloud
distributed file system(HDFS)
map/reduce
Via Cloudera under a Creative Commons License
Via Cloudera under a Creative Commons License
apache pig
http://hadoop.apache.org/pig/
apache hive
http://hadoop.apache.org/hive/
work by @peteskomoroch
High Throughput Sequence AnalysisMike Schatz, University of Maryland
CloudBurst
Catalog k-mers Collect seeds End-to-end alignment
http://cloudburst-bio.sourceforge.net; Bioinformatics 2009 25: 1363-1369
Crossbow: Rapid whole genome SNP analysis
Ben Langmead
http://bowtie-bio.sourceforge.net/crossbow/index.shtml
Preprocessed reads
Map: Bowtie
Sort: Bin and partition
Reduce: SoapSNP
Crossbow: Rapid whole genome SNP analysis
Langmead B, Schatz MC, Lin, J, Pop M, Salzberg SL. Genome Biol 10(11): R134.
Crossbow condenses over 1,000 hours of resequencing computa:on into a few hours without requiring the user to own or operate a computer cluster
Assembly of Large Genomes with Cloud Computing.Schatz MC, Sommer D, Kelley D, Pop M, et al. In Preparation.http://contrail-bio.sourceforge.net
Scalable Genome Assembly
Input S3 bucket
Output S3 bucket
Amazon S3
Hadoop
Amazon EC2 Instances
Input dataset
outputresults
Deploy Application
Web Console, Command line tools
End
Notify
Get ResultsInput Data
Amazon Elastic MapReduce
Hadoop Hadoop
Hadoop
Hadoop
Hadoop
Elastic MapReduce
Elastic MapReduce
data storage & distributionpublic & private
http://aws.amazon.com/publicdatasets/
sharing and collaboration
software distribution
application platforms
Image: O’Reilly Radar
business models
to conclude
built for scale
built for availability
shared dataspacesglobal namespaces
task-based resources
new software architectures
new computing platforms
available today
[email protected] Twi?er:@mndoci Presenta2on ideas from James Hamilton, @mza and @lessig
Thank you!