bionimbus: lessons from a petabyte-scale science …...bionimbus: lessons from a petabyte-scale...
TRANSCRIPT
Bionimbus: Lessons from a Petabyte-Scale
Science Cloud Service Provider (CSP)
Robert Grossman
Institute for Genomics & Systems Biology Center for Research Informatics
Computation Institute Department of Medicine
University of Chicago &
Open Data Group
September 11, 2012
The OSDC & Bionimbus Teams
• Open Science Data Cloud (OSDC) Team – Matt Greenway, Allison Heath, Ray Powell, Rafael
Suarez.
– Major funding for the OSDC is provided by the Gordon and Betty Moore Foundation.
• Bionimbus Team – Elizabeth Bartom, Casey Brown, Jason Grundstad, David
Hanley, Nicolas Negre, Tom Stricker, Matt Slattery, Rebecca Spokony & Kevin White.
– Bionimbus is a joint project between Laboratory for Advanced Computing & White Lab at the University of Chicago and uses in part OSDC infrastructure.
Let’s Step Back 20 Years
• 1992-96: Petabyte Access & Storage Solutions (PASS) Project for SSC.
• It developed & benchmarked federated relational, OO DB, object stores, & column-oriented data warehouse solutions at the TB-scale.
A picture of Cern’s Large Hadron Collider (LHC). The LHC took about a decade to construct, and cost about $4.75 billion. Source of picture: Conrad Melvin, Creative Commons BY-SA 2.0, www.flickr.com/photos/58220828@N07/5350788732
Part 1. Genomics as a Big Data Science
Source: Lincoln Stein
One Million Genomes
• Sequencing a million genomes would most likely fundamentally change the way we understand genomic variation.
• The genomic data for a patient is about 1 TB (including samples from both tumor and normal tissue).
• One million genomes is about 1000 PB or 1 EB
• With compression, it may be about 100 PB
• At $1000/genome, the sequencing would cost about $1B
Big data driven discovery on 1,000,000 genomes and 1 EB of data.
Genomic-driven
diagnosis
Improved understanding
of genomic science
Genomic- driven drug
development
Precision diagnosis and treatment. Preventive
health care.
TNBC
ER+
Source: White Lab, University of Chicago.
With genomics, we can stratify diseases and treat each stratum differently.
Clonal Evolution of Tumors
Tumors evolve temporally and spatially.
Source: Mel Greaves & Carlo C. Maley, Clonal evolution in cancer, Nature, Volume 241, pages 306-312, 2012.
Combinations of Rare Alleles
Allele frequency
Penetrance
Very rare Common
Low
High
Rare Uncommon 0.001 0.01 0.1
Intermediate
Modest
alleles causing
Mendelian disease
most common variants
implicated in common disease
by GWA
rare examples of high-penetrance
common variants influencing
common disease
rare variants of small effect
very hard to identify by genetic means
Low-frequency variants with
intermediate penetrance
Source: Mark McCarthy
TCGA Analysis of Lung Cancer
• 178 cases of SQCC (lung cancer)
• Matched tumor & normal
• Mean of 360 exonic mutations, 323 CNV, & 165 rearrangements per tumor
Source: The Cancer Genome Atlas Research Network, Comprehensive genomic characterization of squamous cell lung cancers, Nature, 2012, doi:10.1038/nature11404.
Discipline Duration Size # Devices
HEP - LHC 10 years 15 PB/year* One
Astronomy - LSST 10 years 12 PB/year** One
Genomics - NGS 2-4 years 0.5 TB/genome 1000’s
Some Examples of Big Data Science
*At full capacity, the Large Hadron Collider (LHC), the world's largest particle accelerator, is expected to produce more than 15 million Gigabytes of data each year. … This ambitious project connects and combines the IT power of more than 140 computer centres in 33 countries. Source: http://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-en.html **As it carries out its 10-year survey, LSST will produce over 15 terabytes of raw astronomical data each night (30 terabytes processed), resulting in a database catalog of 22 petabytes and an image archive of 100 petabytes. Source: http://www.lsst.org/News/enews/teragrid-1004.html
One large instrument Many smaller instruments
Part 2. What Instrument Do we Use to Make Big Data Discoveries?
How do we build a “datascope?”
What is big data?
TB? PB? EB? ZB?
Think of data as big if you measure it in MW, as in Facebook’s Pineville Data Center is 30 MW.
Another way:
opencompute.org
An algorithm and computing infrastructure is “big-data scalable” if adding a rack (or container) of data (and corresponding processors) allows you to do the same computation in the same time but over more data.
Commercial Cloud Service Provider (CSP) 15 MW Data Center
100,000 servers 1 PB DRAM
100’s of PB of disk
Automatic provisioning and
infrastructure management
Monitoring, network security
and forensics
Accounting and billing Customer
Facing Portal
Data center network
~1 Tbps egress bandwidth
25 operators for 15 MW Commercial Cloud
What are some of the important differences between commercial
and research-focused CSPs?
Science Clouds
Science CSP Commercial CSP
POV Democratize access to data. Integrate data to make discoveries. Long term archive.
As long as you pay the bill; as long as the business model holds.
Data & Storage
Data intensive computing & HP storage
Internet style scale out and object-based storage
Flows Large data flows in and out
Lots of small web flows
Streams Streaming processing required
NA
Accounting Essential Essential
Lock in Moving environment between CSPs essential
Lock in is good
Part 3. The Open Cloud Consortium’s Open Science Data Cloud
23 www.opencloudconsortium.org
• U.S based not-for-profit corporation.
• Manages cloud computing infrastructure to support scientific research: Open Science Data Cloud.
• Manages cloud computing testbeds: Open Cloud Testbed.
Cloud Services Operations Centers (CSOC)
• The OSDC operates Cloud Services Operations Center (or CSOC).
• It is a CSOC focused on supporting Science Clouds for researchers.
• Compare to Network Operations Center or NOC.
• Both are an important part of cyber infrastructure for big data science.
• Design 1: Put cores over spindles.
• Higher cost but easy to compute over all the data.
• Design 2: separate (some of the )storage from the compute.
2012 OSDC rack design (draft) • 950 TB / rack • 600 cores / rack
Different Styles of OSDC Racks
Open Science Data Cloud
3 PB 2011 10 PB 2012
able to scale to 100 PB?
Automatic provisioning and
infrastructure management
Monitoring, compliance, &
security
Accounting and billing (OSDC)
Customer Facing Portal (Tukey)
Data center network
~100 Gbps bandwidth
5-12 operators to operate 1-5 MW Science Cloud
Science Cloud SW & Services
OSDC Data Stack based upon OpenStack, Hadoop, GlusterFS, UDT, …
OSDC Philosophy
• We try to automate as much as possible (we automate the setup & operations of a rack).
• We try to write as little software as possible. • Each project is a bit different, but in general: • We assign (permanent) IDs to data managed by
the OSDC and manage associated metadata. • We assign and enforce permissions for users &
groups of users and for files/objects, collections of files/objects, and collections of collections.
• We Support RESTful interfaces. • Do accounting for storage and core-hours.
Some Of Our Biggest Mistakes
• Not charging for services. This resulted in a lot of bad behavior.
• Trying to support donated equipment without adequate staff.
• Being too optimistic about when big data software would be ready for prime time.
• Some problems with big data software doesn’t show up at less than the full scale of the OSDC, but we have only one OSDC and it is difficult to test at this scale.
Essential Services for a Science CSP
• Support for data intensive computing • Support for big data flows • Account management, authentication and
authorization services • Health and status monitoring • Billing and accounting • Ability to rapidly provision infrastructure • Security services, logging, event reporting • Access to large amounts of public data • High performance storage • Simple data export and import services
Small Medium to Large Very Large
Data Size
10’s
100’s
1000’s
Number
Public infrastructure
Dedicated infrastructure
Shared community infrastructure
Individual scientists & small projects
Community based science via Science as a Service
very large projects
Part 4. Bionimbus
Bionimbus is a joint project between Laboratory For Advanced Computing & the White Lab at the University of Chicago.
Step 1. Prepare a Sample
Step 2. Login to Bionimbus and get a Bionimbus Key.
Step 3. Send your sample to the sequencing center.
Step 4. Login on to Bionimbus and view your data
Step 5. Use Bionimbus to perform standard and custom pipelines.
Bionimbus can launch multiple virtual machines.
Bionimbus Virtual Machine Releases Peak Calling MAT
MA2C
PeakSeq
MACS
SPP
Quality Control
Various
Alignment & Genotyping
Bowtie
TopHat
Samtools
Picard 37
Software Tools: Moving Genomes
Bionimbus Community Genomic Cloud
researcher
Personal “dropbox” + compute
• 1K genomes • PubMed • etc.
Cloud for Public Data
Bionimbus Private Genomic Cloud
researcher
Personal “dropbox” & compute
Cloud for Public Data
Cloud for Controlled Data
TCGA dbGaP
• 1K genomes • PubMed • etc.
Bionimbus Private Biomedical Cloud
researcher
Personal “dropbox” plus compute
Cloud for Public Data
Cloud for Controlled Data
TCGA dbGaP
Cloud for PHI data
Clinical Research Data
Warehouse
Scatter, gather queries
• 1K genomes • PubMed • etc.
Bionimbus Private Cloud
UC
Bionimbus Community
Cloud
Bionimbus Private
Cloud XY Amazon dbGaP
External sequencing partner
Internal Sequencers
Step 1. Get Bionimbus ID (BID), assign project, private/community, public cloud, etc.
Step 2. Send sample to be sequenced.
BID Generator
Step 3b. Return variant calls, CNV, annotation…
Step 4. Secure data routing to appropriate cloud based upon BID.
Step 5. Cloud based analysis using IGSB and 3rd party tools and applications.
Step 3a. Return raw reads.
Database Services
Analysis Pipelines & Re-analysis Services
web2py-based Front End
Data Cloud Services
Data Ingestion Services
Utility Cloud Services
Intercloud Services
(Hadoop, Sector/Sphere)
(Eucalyptus, OpenStack)
(PostgreSQL)
(IDs, etc.)
(UDT, replication)
44
>300 ChIP datasets -Chromatin/RNA timecourse -CBP -PolII -Pho/silencers -HDACs -Insulators -TFs
Predictions 537 silencers 2,307 new promoters 12,285 enhancers 14,145 insulators
www.modencode.org
Negre et al. Nature 2011
Part 5. Managing One Million Genomes
Sequence (BAM) Files (100-1000 PB)
Variation (VCF) Files (1-10 PB)
Summary level (10-100 TB)
Relational databases
NoSql & scientific databases
NoSql, DFS, file overlays?
Enrich with clinical data
(Genomic variation)
(Sequence data in binary form)
Acknowledgements Major funding and support for the Open Science Data Cloud is provided by the Gordon and Betty Moore Foundation, which has provided $2M of funding to the OSDC to launch Phase 1 of the project (2011-2014). Moore Foundation funding is used to support the OSDC-Adler, Sullivan and Root facilities. Additional funding for the OSDC has been provided by the following sponsors: • The OCC-Y Hadoop Cluster (approximately 1000 cores and 1 PB of storage) was
donated by Yahoo! in 2011. • Cisco provides the OSDC access to the Cisco C-Wave, which connects OSDC data
centers with 10 Gbps wide area networks. • NSF awarded the OSDC a 5-year (2010-2016) $3.5M PIRE award to train scientists
to use the OSDC and to further develop the underlying technology. • OSDC technology for high performance data transport is support in part by NSF
Award 1127316. • The StarLight Facility in Chicago enables the OSDC to connect to over 30 high
performance research networks around the world at 10 Gbps or higher, with an increasing number of 100 Gbps connections.
The OSDC is managed by the Open Cloud Consortium, a 501(c)(3) not-for-profit corporation. If you are interested in providing funding or donating equipment or services, please contact us at [email protected].
For more information
• You can find some more information on my blog:
rgrossman.com.
• Some of my technical papers are also available there.
• My email address is robert.grossman at uchicago dot edu
Sources for images
• The image of the hard disk is from Norlando Pobre, Creative Commons.
• The image of the Facebook Pineville Data Center is from the Intel Free Press, www.flickr.com/photos/intelfreepress/6722296855/, Creative Commons BY 2.0.
• The image of the LHC is from Conrad Melvin, Creative Commons BY-SA 2.0, www.flickr.com/photos/58220828@N07/5350788732