tyson condie. data is everywhere easier and cheaper than ever to collect data grows faster than...

22
Tyson Condie

Upload: erick-lamb

Post on 26-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tyson Condie. Data is Everywhere Easier and cheaper than ever to collect Data grows faster than Moore’s law (IDC report*)

Tyson Condie

Page 2: Tyson Condie. Data is Everywhere Easier and cheaper than ever to collect Data grows faster than Moore’s law (IDC report*)

Data is Everywhere

• Easier and cheaper than ever to collect• Data grows faster than Moore’s law

2010 2011 2012 2013 2014 20150

2

4

6

8

10

12

14

Moore's Law

(IDC report*)

Page 3: Tyson Condie. Data is Everywhere Easier and cheaper than ever to collect Data grows faster than Moore’s law (IDC report*)
Page 4: Tyson Condie. Data is Everywhere Easier and cheaper than ever to collect Data grows faster than Moore’s law (IDC report*)

The New Gold Rush• Everyone wants to extract value from data• Big companies & startups alike

• Huge potential• Already demonstrated by Google, Facebook, …

• But, untapped by most organizations• “We have lots of data but no one is looking at it!”

Page 5: Tyson Condie. Data is Everywhere Easier and cheaper than ever to collect Data grows faster than Moore’s law (IDC report*)

Extracting Value from Data Hard• Data is massive, unstructured, and dirty• Question are complex • e.g., Predict the future.

• Processing, analysis tools still in their “infancy”• Need tools that are• Faster• More sophisticated• Easier to use

Page 6: Tyson Condie. Data is Everywhere Easier and cheaper than ever to collect Data grows faster than Moore’s law (IDC report*)

Turning Data into Value• Insights, diagnosis, e.g.,• Why is user engagement dropping?• Why is the system slow?• Detect spam, DDoS attacks

• Decisions, e.g.,• What feature to add to a product• Personalized medical treatment• What ads to show • What actors to cast for the “House of Cards”

Data only as useful as the decisions it enables

Page 7: Tyson Condie. Data is Everywhere Easier and cheaper than ever to collect Data grows faster than Moore’s law (IDC report*)

What do We Need?

• Interactive queries: enable human in the loop decisions• Big Data Workbench• Explore data in real-time

• Streaming queries: enable automated real-time decisions• E.g., fraud detection, detect DDoS attacks

• Sophisticated data processing: enable “better” decisions• E.g., anomaly detection, trend analysis

Page 8: Tyson Condie. Data is Everywhere Easier and cheaper than ever to collect Data grows faster than Moore’s law (IDC report*)

The Need For Unification • Today’s state-of-art analytics stack

Data (e.g., logs)

Ad-Hoc querieson historical data

Challenge 1: need to maintain three stacks

• Expensive and complex• Hard to compute consistent metrics across

stacks

Interactive querieson historical data

StreamingReal-Time Analytics

Batch

Interactive queries

Page 9: Tyson Condie. Data is Everywhere Easier and cheaper than ever to collect Data grows faster than Moore’s law (IDC report*)

The Need For Unification • Today’s state-of-art analytics stack

Data (e.g., logs)

Ad-Hoc querieson historical data

Interactive querieson historical data

StreamingReal-Time Analytics

Batch

Interactive queries

Challenge 2: hard/slow to share data, e.g.,»Hard to perform interactive queries on streamed

data

Page 10: Tyson Condie. Data is Everywhere Easier and cheaper than ever to collect Data grows faster than Moore’s law (IDC report*)

Our Goal: Unified Big Data runtime

Batch

Interactive

Streaming

SingleFramework!

Support batch, streaming, and interactive computations…

… in a unified framework

Easy to develop sophisticated algorithms (e.g., graph, ML algos)

Page 11: Tyson Condie. Data is Everywhere Easier and cheaper than ever to collect Data grows faster than Moore’s law (IDC report*)

Resource Managers: Cloud Operating System• Manage machine cluster (cloud) resources

• Tenants coordinate with the RM to allocate resources for running tasks• E.g., a MapReduce job would execute its map/reduce tasks

• A few alternative designs• Apache YARN: also known as Hadoop version 2• Apache Mesos• Google Omega• Facebook Corona

• Goal: broaden the scope of Big Data applications

Page 12: Tyson Condie. Data is Everywhere Easier and cheaper than ever to collect Data grows faster than Moore’s law (IDC report*)

12

The Challenge

YARN / HDFS

Batch(MapReduce)

Streaming(Storm) Interactive Machine

Learning

!?!?!?!

Page 13: Tyson Condie. Data is Everywhere Easier and cheaper than ever to collect Data grows faster than Moore’s law (IDC report*)

13

The Challenge

YARN / HDFS

Fault Tolerance

High-throughput networking

Batch(MapReduce)

Streaming(Storm) Interactive Machine

Learning

Page 14: Tyson Condie. Data is Everywhere Easier and cheaper than ever to collect Data grows faster than Moore’s law (IDC report*)

14

The Challenge

YARN / HDFS

Load spikes

Elastic resource needs

Batch(MapReduce)

Streaming(Storm) Interactive Machine

Learning

Page 15: Tyson Condie. Data is Everywhere Easier and cheaper than ever to collect Data grows faster than Moore’s law (IDC report*)

15

The Challenge

YARN / HDFS

User friendly Toolkits

Low Latency Networking

Batch(MapReduce)

Streaming(Storm) Interactive Machine

Learning

Page 16: Tyson Condie. Data is Everywhere Easier and cheaper than ever to collect Data grows faster than Moore’s law (IDC report*)

16

The Challenge

YARN / HDFS

Complex functions/data

Iterative Dataflow

Batch(MapReduce)

Streaming(Storm) Interactive Machine

Learning

Page 17: Tyson Condie. Data is Everywhere Easier and cheaper than ever to collect Data grows faster than Moore’s law (IDC report*)

17

REEF: Retainable Evaluator Execution Framework

YARN / HDFS

REEF

Batch(MapReduce)

Streaming(Storm) Interactive Machine

Learning

Page 18: Tyson Condie. Data is Everywhere Easier and cheaper than ever to collect Data grows faster than Moore’s law (IDC report*)

18

Unified Big Data Runtime Stack

YARN / HDFS

REEF

Physical Data Parallel Operators

Domain Specific Language (DSL)

Batch(MapReduce)

Streaming(Storm) Interactive Machine

Learning

Page 19: Tyson Condie. Data is Everywhere Easier and cheaper than ever to collect Data grows faster than Moore’s law (IDC report*)

19

REEF: http://reef-project.orgCentralized control plane for building a distributed data plane

Control Plane Data Plane

StorageBig Buffer ManagerOperator Access Methods

NetworkMessage passing (sending statistics)Bulk Transfers (large-scale shuffle)

State ManagementCheckpointsData lineage

Job Driver User code executed on YARN’s Application Master (control plane)

TaskUser code executed within an Evaluator (data plane)

Evaluator Execution Environment for Tasks. One Evaluator is bound to one YARN Container

Page 20: Tyson Condie. Data is Everywhere Easier and cheaper than ever to collect Data grows faster than Moore’s law (IDC report*)

Summary

• Everyone collects but few extract value from data• Unification of comp. and prog. models to• Efficiently analyze data• Make sophisticated, real-time decisions

• REEF provides OS functionalities• Used to develop higher-level Big Data applications

• Long term goal is to…• Unify batch, interactive, streaming computation models• Provide domain specific toolkits to data scientists

Batch

Interactive Streaming

REEF

Page 21: Tyson Condie. Data is Everywhere Easier and cheaper than ever to collect Data grows faster than Moore’s law (IDC report*)

Scalable Analytics Institute

http://scai.cs.ucla.edu

Page 22: Tyson Condie. Data is Everywhere Easier and cheaper than ever to collect Data grows faster than Moore’s law (IDC report*)

ScAI Projects

•Big Data systems•Graph based analytics• Language design for Big Data and data streams•Mining high dimensional data•User and quality modeling in Big Data