introduction to big data and science clouds (chapter 1, sc 11 tutorial)
TRANSCRIPT
An Introduc+on to Data Intensive Compu+ng
Chapter 1: Introduc+on
Robert Grossman University of Chicago Open Data Group
Collin BenneB
Open Data Group
November 14, 2011 1
1. Introduc+on (0830-‐0900) a. Data clouds (e.g. Hadoop) b. U+lity clouds (e.g. Amazon)
2. Managing Big Data (0900-‐0945) a. Databases b. Distributed File Systems (e.g. Hadoop) c. NoSql databases (e.g. HBase)
3. Processing Big Data (0945-‐1000 and 1030-‐1100) a. Mul+ple Virtual Machines & Message Queues b. MapReduce c. Streams over distributed file systems
4. Lab using Amazon’s Elas+c Map Reduce (1100-‐1200)
For the most current version of these notes, please see:
rgrossman.com
Our perspec+ve is to consider data intensive compu+ng from the viewpoint of u+lity and data clouds.
Sec+on 1.1 Data Intensive Science
4
Two of the 14 high throughput sequencers at the Ontario Ins+tute for Cancer Research (OICR).
Moore’s law also applies to the instruments that are producing data. This is crea+ng new paradigms: “data intensive science” and “data intensive compu+ng.”
Data is Big If It is Measured in MW • Data is big if you measure it in MegawaBs.
• As in, a good sweet spot for a data center is 15 MW.
• As in, Facebook’s leased data centers are typically between 2.5 MW and 6.0 MW.
• Facebook’s new Pineville data center is 30 MW.
• Google’s compu+ng infrastructure uses 260 MW.
Discipline Dura-on Size # Devices
HEP -‐ LHC 10 years 15 PB/year* One
Astronomy -‐ LSST 10 years 12 PB/year** One
Genomics -‐ NGS 2-‐4 years 0.4 TB/genome 1000’s
Some Big Data Sciences
*At full capacity, the Large Hadron Collider (LHC), the world's largest par+cle accelerator, is expected to produce more than 15 million Gigabytes of data each year. … This ambi+ous project connects and combines the IT power of more than 140 computer centres in 33 countries. Source: hBp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-‐en.html **As it carries out its 10-‐year survey, LSST will produce over 15 terabytes of raw astronomical data each night (30 terabytes processed), resul+ng in a database catalog of 22 petabytes and an image archive of 100 petabytes. Source: hBp://www.lsst.org/News/enews/teragrid-‐1004.html
An algorithm and compu+ng infrastructure is “big-‐data scalable” if adding a rack of data (and corresponding processors) does not increase the +me required to complete the computa+on but increases the amount of data that can be processed.
Add capacity with constant +me (ACCT)
The Term ‘In the Cloud’ is Annoying
• “Personally, I find the term ‘in the cloud’ preten+ous and annoying. … the world’s marketers and P.R. people seem to think that ‘the cloud’ just means ‘online.’ ” David Pogue, NYT June 16, 2011.
• More specifically he notes that you can think of the cloud as “data and applica+on sopware stored on remote servers [and accessed via the Internet]”
Idea Dates Back to the 1960s
• Virtualiza+on first widely deployed with IBM VM/370.
15
IBM Mainframe
IBM VM/370
CMS
App
Na+ve (Full) Virtualiza+on Examples: Vmware ESX
MVS
App
CMS
App
Usage Based Pricing Is New
17
1 computer in a rack for 120 hours
120 computers in three racks for 1 hour
costs the same as
Simplicity is New
18
+ .. and you have a computer ready to work.
A new programmer can develop a program to process a container full of data with less than day of training using MapReduce.
Elas+c, on demand provisioning.
Hyperviser, network
Hyperviser, network
Hyperviser, network
Apps
VM
Frameworks
Apps
VM
Frameworks
Apps
VM
Frameworks
Customer’s Responsibility
Cloud Service Provider’s Responsibility
IaaS PaaS SaaS
Amazon Style Data Cloud
S3 Storage Services
Simple Queue Service
21
Load Balancer
EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instances
EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instances
SDB
NIST Defini+on
• Cloud compu+ng is a model for enabling ubiquitous, convenient, on-‐demand network access to a shared pool of configurable compu+ng resources that can be rapidly provisioned and released with minimal management effort or service provider interac+on.
Essential Characteristics • On-demand / self-service • Broad network access • Resource pooling • Rapid elasticity • Measured service
Service Models • Software as a Service (SaaS) – consumer runs provider’s applications on cloud infrastructure • Platform as a Service (PaaS) – consumer runs consumer-created applications on the cloud using tools supported by provider • Infrastructure as a Service (IaaS) – consumer uses provider’s processing, storage, and networks
Deployment Models • Private • Community • Public • Hybrid
NIST Defini+on
Google’s Large Data Cloud
Storage Services
Data Services
Compute Services
25
Google’s Stack
Applica+ons
Google File System (GFS)
Google’s MapReduce
Google’s BigTable
Hadoop’s Large Data Cloud
Storage Services
Compute Services
26
Hadoop’s Stack
Applica+ons
Hadoop Distributed File System (HDFS)
Hadoop’s MapReduce
Data Services NoSQL Databases