introduction to big data and science clouds (chapter 1, sc 11 tutorial)

An Introduc+on to Data Intensive Compu+ng

Chapter 1: Introduc+on

Robert Grossman University of Chicago Open Data Group

Collin BenneB

Open Data Group

November 14, 2011 1

1.  Introduc+on (0830-‐0900) a.  Data clouds (e.g. Hadoop) b.  U+lity clouds (e.g. Amazon)

2.  Managing Big Data (0900-‐0945) a.  Databases b.  Distributed File Systems (e.g. Hadoop) c.  NoSql databases (e.g. HBase)

3.  Processing Big Data (0945-‐1000 and 1030-‐1100) a.  Mul+ple Virtual Machines & Message Queues b.  MapReduce c.  Streams over distributed file systems

4.  Lab using Amazon’s Elas+c Map Reduce (1100-‐1200)

For the most current version of these notes, please see:

rgrossman.com

Our perspec+ve is to consider data intensive compu+ng from the viewpoint of u+lity and data clouds.

Sec+on 1.1 Data Intensive Science

4

Two of the 14 high throughput sequencers at the Ontario Ins+tute for Cancer Research (OICR).

Moore’s law also applies to the instruments that are producing data. This is crea+ng new paradigms: “data intensive science” and “data intensive compu+ng.”

Source: Lincoln Stein

Data is Big If It is Measured in MW •  Data is big if you measure it in MegawaBs.

•  As in, a good sweet spot for a data center is 15 MW.

•  As in, Facebook’s leased data centers are typically between 2.5 MW and 6.0 MW.

•  Facebook’s new Pineville data center is 30 MW.

•  Google’s compu+ng infrastructure uses 260 MW.

Discipline Dura-on Size # Devices

HEP -‐ LHC 10 years 15 PB/year* One

Astronomy -‐ LSST 10 years 12 PB/year** One

Genomics -‐ NGS 2-‐4 years 0.4 TB/genome 1000’s

Some Big Data Sciences

*At full capacity, the Large Hadron Collider (LHC), the world's largest par+cle accelerator, is expected to produce more than 15 million Gigabytes of data each year. … This ambi+ous project connects and combines the IT power of more than 140 computer centres in 33 countries. Source: hBp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-‐en.html **As it carries out its 10-‐year survey, LSST will produce over 15 terabytes of raw astronomical data each night (30 terabytes processed), resul+ng in a database catalog of 22 petabytes and an image archive of 100 petabytes. Source: hBp://www.lsst.org/News/enews/teragrid-‐1004.html

An algorithm and compu+ng infrastructure is “big-‐data scalable” if adding a rack of data (and corresponding processors) does not increase the +me required to complete the computa+on but increases the amount of data that can be processed.

Add capacity with constant +me (ACCT)

Sec+on 1.2 What’s New with Clouds?

10

The Term ‘In the Cloud’ is Annoying

•  “Personally, I find the term ‘in the cloud’ preten+ous and annoying. … the world’s marketers and P.R. people seem to think that ‘the cloud’ just means ‘online.’ ” David Pogue, NYT June 16, 2011.

•  More specifically he notes that you can think of the cloud as “data and applica+on sopware stored on remote servers [and accessed via the Internet]”

U+lity Clouds

12

Infrastructure as a Service (IaaS)

Amazon Data Center

Data Clouds

13

ad targe+ng

Large Data Cloud Services

Yahoo Data Center

Virtualiza+on

14

App

OS

App

OS

App

OS

Hyperviser

Computer

App

OS

Computer

App App

Idea Dates Back to the 1960s

•  Virtualiza+on first widely deployed with IBM VM/370.

15

IBM Mainframe

IBM VM/370

CMS

App

Na+ve (Full) Virtualiza+on Examples: Vmware ESX

MVS

App

CMS

App

16

Scale is New

Usage Based Pricing Is New

17

1 computer in a rack for 120 hours

120 computers in three racks for 1 hour

costs the same as

Simplicity is New

18

+ .. and you have a computer ready to work.

A new programmer can develop a program to process a container full of data with less than day of training using MapReduce.

Elas+c, on demand provisioning.

Sec+on 1.4 U+lity Clouds

Hyperviser, network

Hyperviser, network

Hyperviser, network

Apps

VM

Frameworks

Apps

VM

Frameworks

Apps

VM

Frameworks

Customer’s Responsibility

Cloud Service Provider’s Responsibility

IaaS PaaS SaaS

Amazon Style Data Cloud

S3 Storage Services

Simple Queue Service

21

Load Balancer

EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instances

EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instance EC2 Instances

SDB

NIST Defini+on

•  Cloud compu+ng is a model for enabling ubiquitous, convenient, on-‐demand network access to a shared pool of configurable compu+ng resources that can be rapidly provisioned and released with minimal management effort or service provider interac+on.

Essential Characteristics •  On-demand / self-service •  Broad network access •  Resource pooling •  Rapid elasticity •  Measured service

Service Models •  Software as a Service (SaaS) – consumer runs provider’s applications on cloud infrastructure •  Platform as a Service (PaaS) – consumer runs consumer-created applications on the cloud using tools supported by provider •  Infrastructure as a Service (IaaS) – consumer uses provider’s processing, storage, and networks

Deployment Models •  Private •  Community •  Public •  Hybrid

NIST Defini+on

Sec+on 1.5 Data Clouds

Google’s Large Data Cloud

Storage Services

Data Services

Compute Services

25

Google’s Stack

Applica+ons

Google File System (GFS)

Google’s MapReduce

Google’s BigTable

Hadoop’s Large Data Cloud

Storage Services

Compute Services

26

Hadoop’s Stack

Applica+ons

Hadoop Distributed File System (HDFS)

Hadoop’s MapReduce

Data Services NoSQL Databases

Ques+ons?

introduction to big data and science clouds (chapter 1, sc 11 tutorial)

Technology