anthony joseph

5/10/2012

AMPLab Overview [email protected] 1

A Berkeley View of Big Data

Anthony D. JosephUC Berkeley

EDUSERV Symposium10 May 2012

Who Am I?

• Research: – Internet-scale systems (RAD Lab, AMP Lab)

– Security (DETERlab Testbed)

– Adversarial machine learning (SecML)

• Teaching (undergrad/grad): operating systems and systems, security, networking

Disclaimer: I don’t speak for UC or our research sponsors

5/10/2012


Big Data is Massive…• Facebook:

– 130TB/day: user logs– 200-400TB/day: 83 million pictures– >40 Billion photos

• Google: > 25 PB/day processed data

• Data generated by LHC: 1 PB/sec

• Total data created in 2010: 1.ZettaByte (1,000,000 PB)/year– ~60% increase every year

3

…and Diverse…

• Walmart

– >1 million customer transactions/hr

– >2.5 PByte customer DB

• Human genome sequencing

– Analyzing 3 billion base pairs

– Ten years for first one (2003)

– Today, less than one week

4

5/10/2012


…and Novel…

• Analyzing data from user behavior vs user input

• USGS TED

– Twitter-based Earthquake Detector

• Google Trends: “nowcasting”– http://www.google.org/flutrends/

– US 2009 “Cash for Clunkers” program success

– US State unemployment rates

5

…and Grows Bigger…

• More and more devices

• More and more people

• Cheaper and cheaper storage

– ~50% increase in GB/$ every year

6

5/10/2012


…and Bigger!

• Log everything!

– Don’t always know what question you’ll need to answer

• Stored data growing faster than bothavailable storage and GB/$

7

Which Big Data to Keep?

• Hard to decide what to delete

– Thankless decision: people know only when you are wrong!

– “Climate Research Unit (CRU) scientists admit they threw away key data used in global warming calculations”

8

5/10/2012


Data Retention Requirements

• New NSF data retention requirements

– Proposals submitted after 18 January 2011 must include a “Data Management Plan”

– Have to keep all data (including metadata) for 3 years after research award conclusion

– Institutional/org considerations:

• Opportunity to invest in pooled storage: campus, systemwide, regional, …

• Typical cost: 8TB chunks at $1.44/GB/year collaborative space and $0.17/GB/year for archive space

9

Big Data Isn’t Always Big

• You don’t need to be big to have big data problem!

– Inadequate tools to analyze data

– Data management may dominate infrastructure cost

10

Data that is expensive to manage,

and hard to extract value from

Data that is expensive to manage,

and hard to extract value from

5/10/2012


Big Data is not Cheap!

• Storing and managing 1PB data: $500K-$1M/ year

– Facebook: 200 PB/year

11

• “Typical” cloud-based service startup (e.g., Conviva)

– Log storage dominates infrastructure cost 0%

20%

40%

60%

80%

100%

2007 2008 2009 2010

Storage cluster Other

Infr

astr

uctu

re c

ost

~1PB storage capacity

Hard to Extract Value from Data!

• Data is – Diverse, variety of sources

– Uncurated, no schema, inconsistent semantics, syntax

– Integration a huge challenge

• No easy way to get answers that are– High-quality

– Timely

• Challenge: maximize value from data by getting best possible answers

12

5/10/2012


Requires Multifaceted Approach

• Three dimensions to improve data analysis

– Improving scale, efficiency, and quality of algorithms (Algorithms)

– Scaling up datacenters (Machines)

– Leverage human activity and intelligence (People)

• Need to adaptively and flexibly combine all three dimensions

13

The State of the Art

Algorithms

Machines

People

search

Watson/IBM

14

• Today’s apps: fixed point in solution space

Need techniques to dynamically pick best

operating point

Need techniques to dynamically pick best

operating point

5/10/2012


What Is the Big Data Problem?

• For two main reasons:– the more data the greater chance to find any

pattern you’d like to find• the more rows in a table, the more columns

• the more columns, the more hypotheses that can be considered

• indeed, the number of hypotheses grows exponentially in the number of columns

– the more data the less likely a sophisticated ML algorithm will run in an acceptable time frame

• and then we have to back off to cheaper algorithms that may be more error-prone

A Formulation of the Problem

• Given an inferential goal and a fixed computational budget, provide a guarantee (supported by an algorithm and an analysis) that the quality of inference will increase monotonically as data accrue (without bound)

– This is far from being achieved in the current state of the literature!

• It can be achieved by building a scalable system that blends statistical and computational design principles

5/10/2012


Big Data in the US

• Many Fortune 1000+ companies with huge write once, read none big data collections– For all the reasons I’ve already outlined…

• US Government agencies in same situation– New R&D funding

• Many companies developing proprietary solutions

• Very active open source big data tools committee– Broad international participation– Data Without Borders helping non-profits through pro

bono data collection, analysis, and visualization

17

Significant USG Investment

• 29 March 2012

– US federal agencies announced more than $200 million in new commitments

– Dept of Defense, Dept of Homeland Security, Dept of Energy, Veterans Administration, Office of Scientific and Technical Information, Health and Human Services, Food and Drug Admin, National Archives & Records Admin, National Aerospace & Space Admin, National Institutes of Health, National Science Foundation, National Security Agency, US Geological Service18

5/10/2012


Active Open Source Community

• On-going development of several elements of Big Data analysis pipeline

• Apache Hadoop (MapReduce)

• Hive

• Apache Pig

• R / Octave

• Much more is needed!

• E.g., new analysis environments

19

The AMP Lab

20

search

Watson/IBM

Machines

People

Algorithms

Make sense of data at scale by tightly

integrating algorithms, machines, and people



5/10/2012


AMP Faculty and Sponsors• Faculty

– Alex Bayen (mobile sensing platforms)– Armando Fox (systems)– Michael Franklin (databases): Director– Michael Jordan (machine learning): Co-director– Anthony Joseph (security & privacy)– Randy Katz (systems)– David Patterson (systems)– Ion Stoica (systems): Co-director– Scott Shenker (networking)

• Sponsors:

21

Algorithms

• State-of-art Machine Learning (ML) algorithms do not scale

– Prohibitive to process all data points

22

How do you know

when to stop?

true answer

Estim

ate

# of data points

5/10/2012


Algorithms

• Given any problem, data and a budget

– Immediate results with continuous improvement

– Calibrate answer: provide error bars

23

Error bars on every

answer!

Estim

ate

# of data points

true answer

Algorithms

• Given any problem, data and a time budget

– Immediate results with continuous improvement

– Calibrate answer: provide error bars

24

Stop when error

smaller than a given

threshold

Estim

ate

# of data pointstime

true answer

5/10/2012


Algorithms

• Given any problem, data and a time budget

– Automatically pick the best algorithm

25

Estim

ate

time

pick sophisticated pick simple

error too high

true answer

sophisticated

simple

Machines

• “The datacenter as a computer” still in its infancy

– Special purpose clusters, e.g., Hadoop cluster

– Highly variable performance

– Hard to program

– Hard to debug

26

=?

5/10/2012


Machines

• Make datacenter a real computer!

27

Node OS

(e.g. Linux)

Node OS

(e.g. Windows)

Node OS

(e.g. Linux)…

Datacenter “OS” (e.g., Mesos)

• Share datacenter between multiple cluster computing

apps

• Provide new abstractions and servicesAMP stack

Existingstack

Machines


28

Node OS

(e.g. Linux)

Node OS

(e.g. Windows)

Node OS

(e.g. Linux)…


Had

oop

MP

I

Hyp

ertb

ale

…

Ca

ssa

nd

raHiveSupport existing

cluster computing

apps

AMP stack

Existingstack

5/10/2012


Machines


29

Node OS

(e.g. Linux)

Node OS

(e.g. Windows)

Node OS

(e.g. Linux)…

Sp

ark

SCADS

…


Had

oop

MP

I

Hyp

ertb

ale

…C

ass

an

draHive PIQL

Support interactive

and iterative data

analysis (e.g., ML

algorithms)

Consistency

adjustable data

store

Predictive &

insightful query

language

AMP stack

Existingstack

Machines


30

Node OS

(e.g. Linux)

Node OS

(e.g. Windows)

Node OS

(e.g. Linux)…

Sp

ark

SCADS

…


Applications, tools

Had

oop

MP

I

Hyp

ertb

ale

…

Ca

ssa

nd

raHive PIQL• Advanced ML algorithms

• Interactive data mining

• Collaborative visualizationAMP stack

Existingstack

5/10/2012


People

• Humans can make sense of messy data!

31

People

• Make people an integrated part of the system!– Leverage human activity

– Leverage human intelligence (crowdsourcing):

• Curate and clean dirty data

• Answer imprecise questions

• Test and improve algorithms

• Challenge– Inconsistent answer quality in all

dimensions (e.g., type of question, time, cost)

32

Machines +

Algorithms

da

ta,

activi

ty

Qu

estio

ns A

nsw

ers

5/10/2012


Real Applications• Mobile Millennium Project

– Alex Bayen, Civil and Environment Engineering, UC Berkeley

• Microsimulation of urban development– Paul Waddell, College of

Environment Design, UC Berkeley

• Crowd based opinion formation– Ken Goldberg, Industrial

Engineering and Operations Research, UC Berkeley

• Personalized Sequencing– Taylor Sittler, UCSF

33

Personalized Sequencing

34

5/10/2012


Sequencing

Microsimulation Mobile Millennium

The AMP Lab

35

Machines

People

Algorithms





Big Data in 2020

Are you prepared?

• To create a new generation of big data scientist

• For ML to become an engineering discipline

• For people to be deeply integrated in big data analysis pipeline

• Will your institution

– offer a big data curriculum touching all fields?

– have hired cross-disciplinary faculty?

– have invested in (pooled) storage infrastructure?

– have invested in public/private clouds?

– have built inter/intra campus networks?36

5/10/2012


Summary

• Goal: Tame Big Data Problem

– Get results with right quality at the right time

• Approach: Holistic integration of Algorithms, Machines, and People

• Huge research issues across many domains

3737

anthony joseph

Technology

data data management

pb data

key data

data analysis

stored data

big data problem

novel analyzing data

data management plan