talk at microsoft cloud futures 2010

Post on 11-May-2015

3.557 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

My talk from Cloud Futures 2010, organized by MSR

TRANSCRIPT

Scien&fic  Compu&ng  with  Amazon  Web  ServicesDeepak  Singh

Cloud  Futures  2010

life science industry

Credit: Bosco Ho

By ~Prescott under a CC-BY-NC license

data

Image: Wikipedia

Image  via  image  editor  under  a  CC-­‐BY  License

Image: Matt Wood

gigabytes

terabytes

petabytes

petabytes

petabytes

exabytes?

really fast

Image: http://www.broadinstitute.org/~apleite/photos.html

data management

data processing

data sharing

Image: Chris Dagdigian

compute & storage limited

amazon web services

the cloud

has_many :definitions

infrastructure as a service

ComputeAmazon Elastic Compute

Cloud (EC2)- Elastic Load Balancing- Auto Scaling

StorageAmazon Simple

Storage Service (S3)- AWS Import/Export

DatabaseAmazon RDS and

SimpleDB

ComputeAmazon Elastic Compute

Cloud (EC2)- Elastic Load Balancing- Auto Scaling

StorageAmazon Simple

Storage Service (S3)- AWS Import/Export

Content DeliveryAmazon CloudFront

MessagingAmazon Simple

Queue Service (SQS) Amazon Simple

Notification Service (SNS)

PaymentsAmazon Flexible Payments Service

(FPS)

On-Demand Workforce

Amazon Mechanical Turk

Parallel ProcessingAmazon Elastic

MapReduce

DatabaseAmazon RDS and

SimpleDB

ComputeAmazon Elastic Compute

Cloud (EC2)- Elastic Load Balancing- Auto Scaling

StorageAmazon Simple

Storage Service (S3)- AWS Import/Export

Content DeliveryAmazon CloudFront

MessagingAmazon Simple

Queue Service (SQS)Amazon Simple

Notification Service (SNS)

PaymentsAmazon Flexible Payments Service

(FPS)

On-Demand Workforce

Amazon Mechanical Turk

Parallel ProcessingAmazon Elastic

MapReduce

MonitoringAmazon CloudWatch

ManagementAWS Management Console

ToolsAWS Toolkit for Eclipse

AWS SDK for .NET

Isolated NetworksAmazon Virtual Private

Cloud

DatabaseAmazon RDS and

SimpleDB

ComputeAmazon Elastic Compute

Cloud (EC2)- Elastic Load Balancing- Auto Scaling

StorageAmazon Simple

Storage Service (S3)- AWS Import/Export

Your Custom Applications and Services

Content DeliveryAmazon CloudFront

MessagingAmazon Simple

Queue Service (SQS)Amazon Simple

Notification Service (SNS)

PaymentsAmazon Flexible Payments Service

(FPS)

On-Demand Workforce

Amazon Mechanical Turk

Parallel ProcessingAmazon Elastic

MapReduce

MonitoringAmazon CloudWatch

ManagementAWS Management Console

ToolsAWS Toolkit for Eclipse

AWS SDK for .NET

Isolated NetworksAmazon Virtual Private

Cloud

DatabaseAmazon RDS and

SimpleDB

elasticity

3000 CPU’s for one firm’s risk management application

!"#$%&'()'*+,'-./01.2%/'

344'+567/'(.'

8%%9%.:/'

;<"&/:1='

>?,3?,44@'

A&B:1='

>?,>?,44@'

C".:1='

>?,D?,44@'

E(.:1='

>?,F?,44@'

;"%/:1='

>?,G?,44@'

C10"&:1='

>?,H?,44@'

I%:.%/:1='

>?,,?,44@'

3444JJ'

344'JJ'

scalability

> 1PB of data in S3

highly available

Image: Chris Dagdigian

“Everything fails, all the time”-- Werner Vogels

“Things will crash. Deal with it”-- Jeff Dean

2-4% of serverswill die annually

Source: Jeff Dean, LADIS 2009

1-5% of disk drives will die every year

Source: Jeff Dean, LADIS 2009

human errors

human errors~20% admin issues have unintended consequences

Source: James Hamilton

scalable & available

assume sw/hw failure

design apps to be resilient

automation & alarming

US East Region

Availability Zone A

Availability Zone B

Availability Zone C

Availability Zone D

!"#$%&'()*+

T

TT

elastic load balancing

CloudWatch

auto scaling

elastic block store

elastic IP

SQSSNS

cost effective

cost effective

pay as you go

on-demand instancesreserved instances

spot instances

Your Network

AmazonWeb Services

CloudSecure VPN Connection

over the Internet

Customer’s isolated AWS

resources

VPN Gateway

ExternalCustomers

Subnets

10.32.1.0/24

10.32.2.0/24

10.32.3.0/24

AMAZON  VPC  ARCHITECTURE

AWS + science = win

3.7 million classifications in just over three days~15 million in less than a month>2.6 million clicks in 100 hours

lots and lots and lots and lots and lots and lots of data andlots and lots of lots of data

scalability & availability

we are data geeks not data center geeks

data management

Shaq Image: Keith Allison under a CC-BY-SA license

Shaq Image: Keith Allison under a CC-BY-SA license

Shaq Image: Keith Allison under a CC-BY-SA license

Shaq Image: Keith Allison under a CC-BY-SA license

Shaq Image: Keith Allison under a CC-BY-SA license

Biomarker Warehousepre-clinical, clinical, 3rd party data and publications

!"#$%"&&'

!#%&$(%&&&'

!)*(%"&&'

+,'-./01'

23,3415'61789:1'

;<./5'=>?6@'

6178170' 6A.7341' B817-135'

Estimated cost: 10 TB warehouse over 3 years

data processing

http://cyclecomputing.com

http://www.rightscale.com

http://leonardo.phys.washington.edu/feff/

XAFS

http://bioteam.net

ASSEMBLING GENOMES

140  million  454  reads

Image:  Ma)  Wood

Map 100 million, 100 base paired end readsQuad core with 5 GB of RAM would take 16 days

30 high-memory instances; 32 hours; $195

BLAT @ U. PENN

HEAVY-ION COLLISIONS

Problem: Quark matter physics conference imminent but no compute resources handy

Solution: NIMBUS context broker allowed researchers to provision 300 nodes and get the simulations done

BELLE MONTE CARLO

Credit: Tom Fifield

disk read/writesslow & expensive

data processingfast & cheap

distribute the dataparallel reads

data processing for the cloud

distributed file system(HDFS)

map/reduce

Via Cloudera under a Creative Commons License

Via Cloudera under a Creative Commons License

http://www.cascading.org/

apache pig

http://hadoop.apache.org/pig/

apache hive

http://hadoop.apache.org/hive/

work by @peteskomoroch

High Throughput Sequence AnalysisMike Schatz, University of Maryland

CloudBurst

Catalog k-mers Collect seeds End-to-end alignment

http://cloudburst-bio.sourceforge.net; Bioinformatics 2009 25: 1363-1369

Crossbow: Rapid whole genome SNP analysis

Ben Langmead

http://bowtie-bio.sourceforge.net/crossbow/index.shtml

Preprocessed reads

Map: Bowtie

Sort: Bin and partition

Reduce: SoapSNP

Crossbow: Rapid whole genome SNP analysis

Langmead B, Schatz MC, Lin, J, Pop M, Salzberg SL. Genome Biol 10(11): R134.

Crossbow   condenses   over   1,000   hours   of  resequencing   computa:on   into   a   few   hours  without   requiring   the   user   to   own   or   operate   a  computer  cluster

Assembly of Large Genomes with Cloud Computing.Schatz MC, Sommer D, Kelley D, Pop M, et al. In Preparation.http://contrail-bio.sourceforge.net

Scalable Genome Assembly

Input  S3  bucket

Output  S3  bucket

Amazon S3

Hadoop

Amazon EC2 Instances

Input dataset

outputresults

Deploy Application

Web Console, Command line tools

End

Notify

Get ResultsInput Data

Amazon Elastic MapReduce

Hadoop Hadoop

Hadoop

Hadoop

Hadoop

Elastic MapReduce

Elastic MapReduce

data storage & distributionpublic & private

sharing and collaboration

software distribution

http://www.cloudbiolinux.com/

http://usegalaxy.org/cloud

application platforms

http://heroku.com

http://chempedia.com/

Image: O’Reilly Radar

business models

to conclude

built for scale

built for availability

shared dataspacesglobal namespaces

task-based resources

new software architectures

new computing platforms

available today

http://aws.amazon.com/education

deesingh@amazon.com  Twi?er:@mndoci  Presenta2on  ideas  from  James  Hamilton,  @mza  and  @lessig

Thank  you!

top related