cloud technical challenges

44
Cloud Technical Challenges Guy Coates Wellcome Trust Sanger Institute [email protected]

Upload: guy-coates

Post on 05-Dec-2014

1.355 views

Category:

Technology


1 download

DESCRIPTION

This talks covers the current challenges and opportunities for using cloud computing for data-heavy, research computing. Talk given at the Marcus Evans "Cloud Computing in the Pharmaceutical Industry" conference, Frankfurt 2011.

TRANSCRIPT

Page 1: Cloud Technical Challenges

Cloud Technical Challenges

Guy Coates

Wellcome Trust Sanger Institute

[email protected]

Page 2: Cloud Technical Challenges

Outline

Background

Cloud Experiences

Barriers

Future Directions

Page 3: Cloud Technical Challenges

The Sanger Institute Funded by Wellcome Trust.

• 2nd largest research charity in the world.• ~700 employees.• Based in Hinxton Genome Campus,

Cambridge, UK.

Large scale genomic research.• Sequenced 1/3 of the human genome.

(largest single contributor).• We have active cancer, malaria,

pathogen and genomic variation / human health studies.

All data is made publicly available.• Websites, ftp, direct database. access,

programmatic APIs.

Page 4: Cloud Technical Challenges

Lost in the clouds...

Page 5: Cloud Technical Challenges

Victory!

Page 6: Cloud Technical Challenges

Our Cloud Experiences

Page 7: Cloud Technical Challenges

Hype Cycle

Awesome!

Just works...

Page 8: Cloud Technical Challenges

Ensembl

Ensembl is a system for genome Annotation.

Data visualisation / Mining web services.• www.ensembl.org• Provides web / programmatic interfaces to genomic data.• 10k visitors / 126k page views per day.

Compute Pipeline (HPTC Workload)• Take a raw genome and run it through a compute pipeline to find genes

and other features of interest.• Ensembl at Sanger/EBI provides automated analysis for 51 vertebrate

genomes.

• Software is Open Source (apache license).• Data is free for download.

We have web services and HPTC workloads running on Iaas.

Page 9: Cloud Technical Challenges

Why Cloud?

Web services• Was hosted in a single datacentre at the Genome Campus, UK.• 1 datacentre = Single point of failure.• Access slow if you were not in western Europe.

Cloud Application• Build worldwide network of mirrors on IaaS.

HPC• People want to run Ensembl HPC pipeline on their own data.• Requires skilled bioinformatician to get the software running and access

to a HPC cluster.

Cloud Application• Build HPC SaaS.• Users deploy ready-to-run Ensembl code on AWS, self-assembles into a

HPC cluster and analyses their data.

Page 10: Cloud Technical Challenges

Hype Cycle

Web services /Web services /Some HPCSome HPC

Page 11: Cloud Technical Challenges

That was easy...

Page 12: Cloud Technical Challenges

Hype cycle

Sequencinginformatics

Page 13: Cloud Technical Challenges

DNA sequencing

Page 14: Cloud Technical Challenges

Economic Trends:

As cost of sequencing halves every 12 months.• cf Moore's Law

The Human genome project: • 13 years.• 23 labs.• $500 Million.

A Human genome today:• 3 days.• 1 machine.• $10,000.• Large centres are now doing studies with 10,000s of

genomes.

Trend will continue:• Generation 3 sequencers are on their way.• $500 genome is probable within 5 years.

Page 15: Cloud Technical Challenges

The scary graph

Peak Yearly capillary sequencing: 30 Gbase

Current weeky sequencing:3000 Gbase

Page 16: Cloud Technical Challenges

19941995

19961997

19981999

20002001

20022003

20042005

20062007

20082009

0

1000

2000

3000

4000

5000

6000

Disk Storage

Year

Te

rab

yte

s

Managing Growth We have exponential growth in

storage and compute.• Storage /compute doubles every 12

months.• 2009 ~7 PB raw

Gigabase of sequence ≠ Gigbyte of storage.• 16 bytes per base for for sequence

data.• Intermediate analysis typically need 10x

disk space of the raw data.

Moore's law will not save us.• Transistor/disk density: T

d=18 months

• Sequencing cost: Td=12 months

• Sequencing output: Td=3-6 months

Page 17: Cloud Technical Challenges

What do you need to do sequencing?

SequencerSequencer analysis softwareanalysis software

LIMS System / Data TrackingLIMS System / Data Tracking

Sample prepSample prep Datarepository

Datarepository

External repositoryExternal

repository

HPC Resource

HPC Resource

Integratedcompute

Integratedcompute

Page 18: Cloud Technical Challenges

What IT do you need to do sequencing?

SequencerSequencer analysis softwareanalysis software

Datarepository

Datarepository

External repositoryExternal

repository

LIMS System / Data TrackingLIMS System / Data Tracking

Sample prepSample prep

HPC Resource

HPC Resource

Integratedcompute

Integratedcompute

Part covered in the grant

Page 19: Cloud Technical Challenges

This is really hard...

We have a whole division of HPC specialists, LIMs developers, bio-informaticians.

What about smaller labs with 1 or 2 sequencers?

Page 20: Cloud Technical Challenges

...and then change it.

Sequencing informatics is massively fluid.• New chemistry.• More sequencing machines.• New analysis software.

Constant cycle of development and deployment.

Page 21: Cloud Technical Challenges

How can cloud help?

Page 22: Cloud Technical Challenges

What can we put on the Cloud?

SequencerSequencer analysis softwareanalysis software

LIMS System / Data TrackingLIMS System / Data Tracking

Sample prepSample prep Datarepository

Datarepository

External repositoryExternal

repository

HPC Resource

HPC Resource

Integratedcompute

Integratedcompute

Page 23: Cloud Technical Challenges

Does it Cloud?

How do we decide what to cloud?

Rule of thumb borrowed from HPC.• Small data / High CPU work better in distributed environments.

IO Bound / Large data

CPU Bound / small data

Page 24: Cloud Technical Challenges

Sequencing Data

( Raw data (TB) )

Alignments (200 GB)

Sequence + quality data (500 GB)

Variation data (1GB)

Individual features (3MB)

Structured data(databases)

Unstructured data(flat files)

Data size per Genome

Tracking / LIMs (100s Kbytes)

Page 25: Cloud Technical Challenges

Sequencing Data

( Raw data (TB) )

Alignments (200 GB)

Sequence + quality data (500 GB)

Variation data (1GB)

Individual features (3MB)

Structured data(databases)

Unstructured data(flat files)

Data size per Genome

Cloud FriendlyCloud Friendly

Cloud UnfriendlyCloud Unfriendly

Tracking / LIMs (100s Kbytes)

Page 26: Cloud Technical Challenges

Can we Cloudify Sequencing?

SequencerSequencer analysis softwareanalysis softwareSample prepSample prep Data

repositoryData

repository

External repositoryExternal

repository

HPC Resource

HPC Resource

Integratedcompute

Integratedcompute

LIMS System / Data TrackingLIMS System / Data Tracking

Page 27: Cloud Technical Challenges

What are the blockers?

HPC infrastructure is now available in the cloud.• Good enough for 95% of sequencing.

Doing big data is hard:

1. You have to get the data there first.

2. You may not be allowed to put the data there.

Page 28: Cloud Technical Challenges

Moving data is hard

Tools:• (FTP,ssh/rsync) are not suited to wide-area networks.• WAN tools: gridFTP/FDT/Aspera.

Data transfer rates (gridFTP/FDT via our 2 Gbit/s site link).• Cambridge → EC2 East coast: 12 Mbytes/s (96 Mbits/s)• Cambridge → EC2 Dublin: 25 Mbytes/s (200 Mbits/s) • 11 hours to move 1TB to Dublin.• 23 hours to move 1 TB to East coast.

What speed should we get?• Once we leave JANET (UK academic network) finding out what the

connectivity is and what we should expect is almost impossible.

Do you have fast enough disks at each end to keep the network full?

Why not just ship disks?• Logistical nightmare.• Format issues, corruption, slow.

Page 29: Cloud Technical Challenges

Networking

How do we improve data transfers across the public internet?• CERN approach; don't.• Dedicated networking has been

put in between CERN and the T1 centres who get all of the CERN data.

Can it work for cloud?• Buy dedicated bandwidth to a

provider.• Ties you in.• Should they pay?

We need good connectivity to everywhere.

Page 30: Cloud Technical Challenges

Data Security

Page 31: Cloud Technical Challenges

Are you allowed to put data on the cloud?

Default policy:

“Our data is confidential/important/critical to our business. We must keep our data on our computers.”

Page 32: Cloud Technical Challenges

What does “My System” mean?

Purchased computer in my data centre

Leased computer inmy data centre

Purchased computer in a co-lo facility

Traditionally outsourced IT service

IaaS on a cloud provider

SaaS on a cloud provider

My System Not my system

Root / Admin Access?

Encrypted/ Non encrypted?

VPN / inside or outside firewall?

Legal / IP agreement in place?

Page 33: Cloud Technical Challenges

How confidential is the data?

Publically available Genome data

Anonymised datasets(eg individual genomes with no identifiers)

Trade Secret / Patentable data

Low Risk High Risk

Personally identifiable datasets

Page 34: Cloud Technical Challenges

Reasons to be optimistic:

Most (all?) data security issues can be dealt with.• But the devil is in the details.• Data can be put on the cloud, if care is taken.

It is probably more secure there than in your own data-centre.• Can you match AWS data availability guarantees?

Are cloud providers different from any other organisation you outsource to?

Page 35: Cloud Technical Challenges

Outstanding Issues

Audit and compliance:• If you need IP agreements, above your providers standard T&Cs, how do

you push them through?

Geographical boundaries mean little in the cloud.• Data can be replicated across national boundaries, without end user

being aware.

Moving personally identifiable data outside of the EU is potentially problematic.• (Can be problematic within the EU; privacy laws are not as harmonised as

you might think.)• More sequencing experiments are trying to link with phenotype data. (ie

personally identifiable medical records).

Page 36: Cloud Technical Challenges

Private Cloud to rescue?

Sequencing increasingly takes place in large consortiums.• Eg International Cancer Genome Consortium http://www.icgc.org)

Can we do private clouds within the consortium?

Page 37: Cloud Technical Challenges

Traditional Collaboration

SequencingCentre + DCCSequencing

Centre + DCC

Sequencingcentre

Sequencingcentre

Sequencingcentre

Sequencingcentre

Sequencingcentre

Sequencingcentre

Sequencingcentre

Sequencingcentre

ITIT

ITIT

ITIT

ITIT

Page 38: Cloud Technical Challenges

Cloud Collaborations

SequencingCentre

SequencingCentre

Sequencingcentre

Sequencingcentre

Sequencingcentre

Sequencingcentre

Sequencingcentre

Sequencingcentre

Private CloudIaaS / SaaS

Private CloudIaaS / SaaS

Private CloudIaaS / SaaS

Private CloudIaaS / SaaS

Page 39: Cloud Technical Challenges

Private Cloud

Advantages:• LIMS / analysis software easily shared with consortium.

• Small organisations leverage expertise of big IT organisations.• Academia tends to be linked by fast research networks.

• Moving data is easier.• Consortium will be signed up to data-access agreements.

• Simplifies data governance.

Problems:• Big change in funding model.• Are big centres set up to provide private cloud services?

• Selling services is hard if you are a charity.• Can we do it as well as the big internet companies?

Page 40: Cloud Technical Challenges

Cloud data archives

Page 41: Cloud Technical Challenges

Dark Archives

Storing data in an archive is not particularly useful.• You need to be able to access the

data and do something useful with it.

Data in current archives is “dark”.• You can put/get data, but cannot

compute across it.• Is data in an inaccessible archive

really useful?

Page 42: Cloud Technical Challenges

Example problem:

“We want to run out pipeline across 100TB of data currently in EGA/SRA.”

We will need to de-stage the data to Sanger, and then run the compute.• Extra 0.5 PB of storage, 1000 cores of compute.• 3 month lead time.• ~$1.5M capex.

Page 43: Cloud Technical Challenges

Cloud / Computable archives

Move the compute to the data.• Upload workload onto VMs.• Put VMs on compute that is

“attached” to the data.

Federated between centres• Grid software build on top of

cloud components.• Avoids scaling problems

inherent in putting everything on one place.

CPUCPU CPUCPU CPUCPU CPUCPUDataData

VMVMDataData

CPUCPU CPUCPU CPUCPU CPUCPU

Page 44: Cloud Technical Challenges

Acknowledgements

Sanger

• Phil Butcher• James Beal• Pete Clapham• Simon Kelley• Gen-Tao Chiang

• Steve Searle• Jan-Hinnerk Vogel• Bronwen Aken

EBI

Glenn Proctor Steve Keenan