introduction to big data and science clouds (chapter 1, sc 11 tutorial)

27
An Introduc+on to Data Intensive Compu+ng Chapter 1: Introduc+on Robert Grossman University of Chicago Open Data Group Collin BenneB Open Data Group November 14, 2011 1

Upload: robert-grossman

Post on 20-Aug-2015

1.389 views

Category:

Technology


1 download

TRANSCRIPT

An  Introduc+on  to    Data  Intensive  Compu+ng  

 Chapter  1:  Introduc+on  

Robert  Grossman  University  of  Chicago  Open  Data  Group  

 Collin  BenneB  

Open  Data  Group    

November  14,  2011  1  

1.  Introduc+on  (0830-­‐0900)  a.  Data  clouds  (e.g.  Hadoop)  b.  U+lity  clouds  (e.g.  Amazon)  

2.  Managing  Big  Data  (0900-­‐0945)  a.  Databases  b.  Distributed  File  Systems  (e.g.  Hadoop)  c.  NoSql  databases  (e.g.  HBase)  

3.  Processing  Big  Data  (0945-­‐1000  and  1030-­‐1100)  a.  Mul+ple  Virtual  Machines  &  Message  Queues  b.  MapReduce  c.  Streams  over  distributed  file  systems  

4.  Lab  using  Amazon’s  Elas+c  Map  Reduce  (1100-­‐1200)  

 

For  the  most  current  version  of  these  notes,  please  see:    

rgrossman.com  

Our  perspec+ve  is  to  consider  data  intensive  compu+ng  from  the  viewpoint  of  u+lity  and  data  clouds.      

Sec+on  1.1    Data  Intensive  Science  

4  

Two  of  the  14  high  throughput  sequencers  at  the  Ontario  Ins+tute  for  Cancer  Research  (OICR).      

Moore’s  law  also  applies  to  the  instruments  that  are  producing  data.    This  is  crea+ng  new  paradigms:  “data  intensive  science”  and  “data  intensive  compu+ng.”  

Source:  Lincoln  Stein  

Data  is  Big  If  It  is  Measured  in  MW  •  Data  is  big  if  you  measure  it  in  MegawaBs.  

•  As  in,  a  good  sweet  spot  for  a  data  center  is  15  MW.  

•  As  in,  Facebook’s  leased  data  centers  are  typically  between  2.5  MW  and  6.0  MW.  

•  Facebook’s  new  Pineville  data  center  is  30  MW.  

•  Google’s  compu+ng  infrastructure  uses  260  MW.  

Discipline   Dura-on   Size   #  Devices  

HEP  -­‐  LHC   10  years   15  PB/year*   One  

Astronomy  -­‐  LSST   10  years   12  PB/year**   One  

Genomics  -­‐  NGS   2-­‐4  years   0.4  TB/genome   1000’s  

Some  Big  Data  Sciences  

*At  full  capacity,  the  Large  Hadron  Collider  (LHC),  the  world's  largest  par+cle  accelerator,  is  expected  to  produce  more  than  15  million  Gigabytes  of  data  each  year.    …  This  ambi+ous  project  connects  and  combines  the  IT  power  of  more  than  140  computer  centres  in  33  countries.    Source:  hBp://press.web.cern.ch/public/en/Spotlight/SpotlightGrid_081008-­‐en.html    **As  it  carries  out  its  10-­‐year  survey,  LSST  will  produce  over  15  terabytes  of  raw  astronomical  data  each  night  (30  terabytes  processed),  resul+ng  in  a  database  catalog  of  22  petabytes  and  an  image  archive  of  100  petabytes.    Source:  hBp://www.lsst.org/News/enews/teragrid-­‐1004.html  

An  algorithm  and  compu+ng  infrastructure  is  “big-­‐data  scalable”  if  adding  a  rack  of  data  (and  corresponding  processors)  does  not  increase  the  +me  required  to  complete  the  computa+on  but  increases  the  amount  of  data  that  can  be  processed.  

Add  capacity  with  constant  +me  (ACCT)  

Sec+on  1.2  What’s  New  with  Clouds?  

10  

The  Term  ‘In  the  Cloud’  is  Annoying    

•  “Personally,  I  find  the  term  ‘in  the  cloud’  preten+ous  and  annoying.  …  the  world’s  marketers  and  P.R.  people  seem  to  think  that  ‘the  cloud’  just  means  ‘online.’  ”    David  Pogue,  NYT  June  16,  2011.        

•  More  specifically  he  notes  that  you  can  think  of  the  cloud  as  “data  and  applica+on  sopware  stored  on  remote  servers  [and  accessed  via  the  Internet]”  

U+lity  Clouds  

12  

Infrastructure  as  a  Service  (IaaS)  

Amazon  Data  Center  

Data  Clouds  

13  

ad  targe+ng    

Large  Data  Cloud  Services  

Yahoo  Data  Center  

Virtualiza+on  

14  

App  

OS  

App  

OS  

App  

OS  

Hyperviser  

Computer  

App  

OS  

Computer  

App   App  

Idea  Dates  Back  to  the  1960s  

•  Virtualiza+on  first  widely  deployed  with  IBM  VM/370.  

15  

IBM  Mainframe  

IBM  VM/370  

CMS  

App  

Na+ve  (Full)  Virtualiza+on  Examples:  Vmware  ESX  

MVS  

App  

CMS  

App  

16  

Scale  is  New  

Usage  Based  Pricing  Is  New  

17  

1  computer  in  a  rack  for  120  hours  

120  computers  in    three  racks  for  1  hour  

costs  the  same  as  

Simplicity  is  New  

18  

+   ..  and  you  have  a  computer  ready  to  work.  

A  new  programmer  can  develop  a  program  to  process  a  container  full  of  data  with  less  than  day  of  training  using  MapReduce.  

Elas+c,  on  demand  provisioning.  

Sec+on  1.4    U+lity  Clouds  

Hyperviser,  network  

Hyperviser,  network  

Hyperviser,  network  

Apps  

VM  

Frameworks  

Apps  

VM  

Frameworks  

Apps  

VM  

Frameworks  

Customer’s  Responsibility  

Cloud  Service  Provider’s  Responsibility  

IaaS   PaaS   SaaS  

Amazon  Style  Data  Cloud  

S3  Storage  Services  

Simple  Queue  Service  

21

Load  Balancer  

EC2  Instance  EC2  Instance  EC2  Instance  EC2  Instance  EC2  Instance  EC2  Instances  

EC2  Instance  EC2  Instance  EC2  Instance  EC2  Instance  EC2  Instance  EC2  Instances  

SDB  

NIST  Defini+on  

•  Cloud  compu+ng  is  a  model  for  enabling  ubiquitous,  convenient,  on-­‐demand  network  access  to  a  shared  pool  of  configurable  compu+ng  resources  that  can  be  rapidly  provisioned  and  released  with  minimal  management  effort  or  service  provider  interac+on.  

Essential Characteristics •  On-demand / self-service •  Broad network access •  Resource pooling •  Rapid elasticity •  Measured service

Service Models •  Software as a Service (SaaS) – consumer runs provider’s applications on cloud infrastructure •  Platform as a Service (PaaS) – consumer runs consumer-created applications on the cloud using tools supported by provider •  Infrastructure as a Service (IaaS) – consumer uses provider’s processing, storage, and networks

Deployment Models •  Private •  Community •  Public •  Hybrid

NIST  Defini+on  

Sec+on  1.5  Data  Clouds  

Google’s  Large  Data  Cloud  

Storage  Services  

Data  Services  

Compute  Services  

25

Google’s  Stack  

Applica+ons  

Google  File  System  (GFS)  

Google’s  MapReduce  

Google’s  BigTable  

Hadoop’s  Large  Data  Cloud  

Storage  Services  

Compute  Services  

26

Hadoop’s  Stack  

Applica+ons  

Hadoop  Distributed  File  System  (HDFS)  

Hadoop’s  MapReduce  

Data  Services   NoSQL  Databases  

Ques+ons?