bionimbus: lessons from a petabyte-scale science …...bionimbus: lessons from a petabyte-scale...

Bionimbus: Lessons from a Petabyte-Scale

Science Cloud Service Provider (CSP)

Robert Grossman

Institute for Genomics & Systems Biology Center for Research Informatics

Computation Institute Department of Medicine

University of Chicago &

Open Data Group

September 11, 2012

The OSDC & Bionimbus Teams

• Open Science Data Cloud (OSDC) Team – Matt Greenway, Allison Heath, Ray Powell, Rafael

Suarez.

– Major funding for the OSDC is provided by the Gordon and Betty Moore Foundation.

• Bionimbus Team – Elizabeth Bartom, Casey Brown, Jason Grundstad, David

Hanley, Nicolas Negre, Tom Stricker, Matt Slattery, Rebecca Spokony & Kevin White.

– Bionimbus is a joint project between Laboratory for Advanced Computing & White Lab at the University of Chicago and uses in part OSDC infrastructure.

Let’s Step Back 20 Years

• 1992-96: Petabyte Access & Storage Solutions (PASS) Project for SSC.

• It developed & benchmarked federated relational, OO DB, object stores, & column-oriented data warehouse solutions at the TB-scale.

A picture of Cern’s Large Hadron Collider (LHC). The LHC took about a decade to construct, and cost about $4.75 billion. Source of picture: Conrad Melvin, Creative Commons BY-SA 2.0, www.flickr.com/photos/58220828@N07/5350788732

Part 1. Genomics as a Big Data Science

Source: Lincoln Stein

One Million Genomes

• Sequencing a million genomes would most likely fundamentally change the way we understand genomic variation.

• The genomic data for a patient is about 1 TB (including samples from both tumor and normal tissue).

• One million genomes is about 1000 PB or 1 EB

• With compression, it may be about 100 PB

• At $1000/genome, the sequencing would cost about $1B

Big data driven discovery on 1,000,000 genomes and 1 EB of data.

Genomic-driven

diagnosis

Improved understanding

of genomic science

Genomic- driven drug

development

Precision diagnosis and treatment. Preventive

health care.

TNBC

ER+

Source: White Lab, University of Chicago.

With genomics, we can stratify diseases and treat each stratum differently.

Clonal Evolution of Tumors

Tumors evolve temporally and spatially.

Source: Mel Greaves & Carlo C. Maley, Clonal evolution in cancer, Nature, Volume 241, pages 306-312, 2012.

Combinations of Rare Alleles

Allele frequency

Penetrance

Very rare Common

Low

High

Rare Uncommon 0.001 0.01 0.1

Intermediate

Modest

alleles causing

Mendelian disease

most common variants

implicated in common disease

by GWA

rare examples of high-penetrance

common variants influencing

common disease

rare variants of small effect

very hard to identify by genetic means

Low-frequency variants with

intermediate penetrance

Source: Mark McCarthy

TCGA Analysis of Lung Cancer

• 178 cases of SQCC (lung cancer)

• Matched tumor & normal

• Mean of 360 exonic mutations, 323 CNV, & 165 rearrangements per tumor

Source: The Cancer Genome Atlas Research Network, Comprehensive genomic characterization of squamous cell lung cancers, Nature, 2012, doi:10.1038/nature11404.

Discipline Duration Size # Devices

HEP - LHC 10 years 15 PB/year* One

Astronomy - LSST 10 years 12 PB/year** One

Genomics - NGS 2-4 years 0.5 TB/genome 1000’s

Some Examples of Big Data Science

*At full capacity, the Large Hadron Collider (LHC), the world's largest particle accelerator, is expected to produce more than 15 million Gigabytes of data each year. … This ambitious project connects and combines the IT power of more than 140 computer centres in 33 countries. Source: http://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-en.html **As it carries out its 10-year survey, LSST will produce over 15 terabytes of raw astronomical data each night (30 terabytes processed), resulting in a database catalog of 22 petabytes and an image archive of 100 petabytes. Source: http://www.lsst.org/News/enews/teragrid-1004.html

One large instrument Many smaller instruments

Part 2. What Instrument Do we Use to Make Big Data Discoveries?

How do we build a “datascope?”

What is big data?

TB? PB? EB? ZB?

Think of data as big if you measure it in MW, as in Facebook’s Pineville Data Center is 30 MW.

Another way:

opencompute.org

An algorithm and computing infrastructure is “big-data scalable” if adding a rack (or container) of data (and corresponding processors) allows you to do the same computation in the same time but over more data.

Commercial Cloud Service Provider (CSP) 15 MW Data Center

100,000 servers 1 PB DRAM

100’s of PB of disk

Automatic provisioning and

infrastructure management

Monitoring, network security

and forensics

Accounting and billing Customer

Facing Portal

Data center network

~1 Tbps egress bandwidth

25 operators for 15 MW Commercial Cloud

What are some of the important differences between commercial

and research-focused CSPs?

Science Clouds

Science CSP Commercial CSP

POV Democratize access to data. Integrate data to make discoveries. Long term archive.

As long as you pay the bill; as long as the business model holds.

Data & Storage

Data intensive computing & HP storage

Internet style scale out and object-based storage

Flows Large data flows in and out

Lots of small web flows

Streams Streaming processing required

NA

Accounting Essential Essential

Lock in Moving environment between CSPs essential

Lock in is good

Part 3. The Open Cloud Consortium’s Open Science Data Cloud

23 www.opencloudconsortium.org

• U.S based not-for-profit corporation.

• Manages cloud computing infrastructure to support scientific research: Open Science Data Cloud.

• Manages cloud computing testbeds: Open Cloud Testbed.

Cloud Services Operations Centers (CSOC)

• The OSDC operates Cloud Services Operations Center (or CSOC).

• It is a CSOC focused on supporting Science Clouds for researchers.

• Compare to Network Operations Center or NOC.

• Both are an important part of cyber infrastructure for big data science.

• Design 1: Put cores over spindles.

• Higher cost but easy to compute over all the data.

• Design 2: separate (some of the )storage from the compute.

2012 OSDC rack design (draft) • 950 TB / rack • 600 cores / rack

Different Styles of OSDC Racks

Open Science Data Cloud

3 PB 2011 10 PB 2012

able to scale to 100 PB?

Automatic provisioning and

infrastructure management

Monitoring, compliance, &

security

Accounting and billing (OSDC)

Customer Facing Portal (Tukey)

Data center network

~100 Gbps bandwidth

5-12 operators to operate 1-5 MW Science Cloud

Science Cloud SW & Services

OSDC Data Stack based upon OpenStack, Hadoop, GlusterFS, UDT, …

OSDC Philosophy

• We try to automate as much as possible (we automate the setup & operations of a rack).

• We try to write as little software as possible. • Each project is a bit different, but in general: • We assign (permanent) IDs to data managed by

the OSDC and manage associated metadata. • We assign and enforce permissions for users &

groups of users and for files/objects, collections of files/objects, and collections of collections.

• We Support RESTful interfaces. • Do accounting for storage and core-hours.

Some Of Our Biggest Mistakes

• Not charging for services. This resulted in a lot of bad behavior.

• Trying to support donated equipment without adequate staff.

• Being too optimistic about when big data software would be ready for prime time.

• Some problems with big data software doesn’t show up at less than the full scale of the OSDC, but we have only one OSDC and it is difficult to test at this scale.

Essential Services for a Science CSP

• Support for data intensive computing • Support for big data flows • Account management, authentication and

authorization services • Health and status monitoring • Billing and accounting • Ability to rapidly provision infrastructure • Security services, logging, event reporting • Access to large amounts of public data • High performance storage • Simple data export and import services

Small Medium to Large Very Large

Data Size

10’s

100’s

1000’s

Number

Public infrastructure

Dedicated infrastructure

Shared community infrastructure

Individual scientists & small projects

Community based science via Science as a Service

very large projects

Part 4. Bionimbus

Bionimbus is a joint project between Laboratory For Advanced Computing & the White Lab at the University of Chicago.

Step 1. Prepare a Sample

Step 2. Login to Bionimbus and get a Bionimbus Key.

Step 3. Send your sample to the sequencing center.

Step 4. Login on to Bionimbus and view your data

Step 5. Use Bionimbus to perform standard and custom pipelines.

Bionimbus can launch multiple virtual machines.

Bionimbus Virtual Machine Releases Peak Calling MAT

MA2C

PeakSeq

MACS

SPP

Quality Control

Various

Alignment & Genotyping

Bowtie

TopHat

Samtools

Picard 37

Software Tools: Moving Genomes

Bionimbus Community Genomic Cloud

researcher

Personal “dropbox” + compute

• 1K genomes • PubMed • etc.

Cloud for Public Data

Bionimbus Private Genomic Cloud

researcher

Personal “dropbox” & compute


Cloud for Controlled Data

TCGA dbGaP


Bionimbus Private Biomedical Cloud

researcher

Personal “dropbox” plus compute


Cloud for Controlled Data

TCGA dbGaP

Cloud for PHI data

Clinical Research Data

Warehouse

Scatter, gather queries


Bionimbus Private Cloud

UC

Bionimbus Community

Cloud

Bionimbus Private

Cloud XY Amazon dbGaP

External sequencing partner

Internal Sequencers

Step 1. Get Bionimbus ID (BID), assign project, private/community, public cloud, etc.

Step 2. Send sample to be sequenced.

BID Generator

Step 3b. Return variant calls, CNV, annotation…

Step 4. Secure data routing to appropriate cloud based upon BID.

Step 5. Cloud based analysis using IGSB and 3rd party tools and applications.

Step 3a. Return raw reads.

Database Services

Analysis Pipelines & Re-analysis Services

web2py-based Front End

Data Cloud Services

Data Ingestion Services

Utility Cloud Services

Intercloud Services

(Hadoop, Sector/Sphere)

(Eucalyptus, OpenStack)

(PostgreSQL)

(IDs, etc.)

(UDT, replication)

44

>300 ChIP datasets -Chromatin/RNA timecourse -CBP -PolII -Pho/silencers -HDACs -Insulators -TFs

Predictions 537 silencers 2,307 new promoters 12,285 enhancers 14,145 insulators

www.modencode.org

Negre et al. Nature 2011

http://www.modencode.org

Part 5. Managing One Million Genomes

Sequence (BAM) Files (100-1000 PB)

Variation (VCF) Files (1-10 PB)

Summary level (10-100 TB)

Relational databases

NoSql & scientific databases

NoSql, DFS, file overlays?

Enrich with clinical data

(Genomic variation)

(Sequence data in binary form)

Acknowledgements Major funding and support for the Open Science Data Cloud is provided by the Gordon and Betty Moore Foundation, which has provided $2M of funding to the OSDC to launch Phase 1 of the project (2011-2014). Moore Foundation funding is used to support the OSDC-Adler, Sullivan and Root facilities. Additional funding for the OSDC has been provided by the following sponsors: • The OCC-Y Hadoop Cluster (approximately 1000 cores and 1 PB of storage) was

donated by Yahoo! in 2011. • Cisco provides the OSDC access to the Cisco C-Wave, which connects OSDC data

centers with 10 Gbps wide area networks. • NSF awarded the OSDC a 5-year (2010-2016) $3.5M PIRE award to train scientists

to use the OSDC and to further develop the underlying technology. • OSDC technology for high performance data transport is support in part by NSF

Award 1127316. • The StarLight Facility in Chicago enables the OSDC to connect to over 30 high

performance research networks around the world at 10 Gbps or higher, with an increasing number of 100 Gbps connections.

The OSDC is managed by the Open Cloud Consortium, a 501(c)(3) not-for-profit corporation. If you are interested in providing funding or donating equipment or services, please contact us at [email protected].

For more information

• You can find some more information on my blog:

rgrossman.com.

• Some of my technical papers are also available there.

• My email address is robert.grossman at uchicago dot edu

Sources for images

• The image of the hard disk is from Norlando Pobre, Creative Commons.

• The image of the Facebook Pineville Data Center is from the Intel Free Press, www.flickr.com/photos/intelfreepress/6722296855/, Creative Commons BY 2.0.

• The image of the LHC is from Conrad Melvin, Creative Commons BY-SA 2.0, www.flickr.com/photos/58220828@N07/5350788732

bionimbus: lessons from a petabyte-scale science …...bionimbus: lessons from a petabyte-scale...

Documents