hcj 2013-01-21

Post on 26-Jan-2015

106 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

What is the future of Hadoop? What is the new future of Hadoop? How is that different from the old one? Here is how I answered these questions at the winter Hadoop Conference of Japan 2013.

TRANSCRIPT

1©MapR Technologies - Confidential

The Power of Hadoop to Transform Business

2©MapR Technologies - Confidential

My Background

University, Startups– Aptex, MusicMatch, ID Analytics, Veoh– big data since before it was big

Open source– even before the internet– Apache Hadoop, Mahout, Zookeeper, Drill– bought the beer at first HUG

MapR Founding member of Apache Drill

3©MapR Technologies - Confidential

MapR Technologies

Silicon Valley Startup– Top investors– Top technical and management team• Google, Microsoft, EMC, NetApp, Oracle

Enterprise quality distribution for Hadoop

Many extensions to basic Hadoop function Strong supporter of Apache Drill

4©MapR Technologies - Confidential

Philosophy First

What is History?

5©MapR Technologies - Confidential

The study of the past

(what came before now)

6©MapR Technologies - Confidential

What is the future?

(it comes after now)

7©MapR Technologies - Confidential

8©MapR Technologies - Confidential

9©MapR Technologies - Confidential

10©MapR Technologies - Confidential

But the future also has a past!

11©MapR Technologies - Confidential

Do you remember the future?

12©MapR Technologies - Confidential

13©MapR Technologies - Confidential

14©MapR Technologies - Confidential

15©MapR Technologies - Confidential

16©MapR Technologies - Confidential

17©MapR Technologies - Confidential

Some things

turned out as

expected

19©MapR Technologies - Confidential

Many things are different!

20©MapR Technologies - Confidential

Hadoop has a history

21©MapR Technologies - Confidential

Hadoop also has a

future

22©MapR Technologies - Confidential

The Old Future of Hadoop

Map-reduce and HDFS– more and more, but not really different

Eco-system additions– Simpler programming (Hive and Pig)– Key-value store– Ad hoc query

Stands apart from other computing– Required by HDFS and other limitations

23©MapR Technologies - Confidential

The New Future of Hadoop

Real-time processing– Combines real-time and long-time

Integration with traditional IT– No need to stand apart

Integration with new technologies– Solr, Node.js, Twisted all should interface directly

Fast and flexible computation– Drill logical plan language

24©MapR Technologies - Confidential

Example #1Search Abuse

25©MapR Technologies - Confidential

History matrix

One row per user

One column per thing

26©MapR Technologies - Confidential

Recommendation based on cooccurrence

Cooccurrence gives item-item mapping

One row and column per thing

27©MapR Technologies - Confidential

Cooccurrence matrix can also be implemented as a search index

28©MapR Technologies - Confidential

SolRIndexerSolR

IndexerSolrindexing

Cooccurrence(Mahout)

Item meta-data

Indexshards

Complete history

29©MapR Technologies - Confidential

SolRIndexerSolR

IndexerSolrsearchWeb tier

Item meta-data

Indexshards

User history

30©MapR Technologies - Confidential

Objective Results

At a very large credit card company

History is all transactions, all web interaction

Processing time cut from 20 hours per day to 3

Recommendation engine load time decreased from 8 hours to 3 minutes

31©MapR Technologies - Confidential

Example #2Web

Technology

32©MapR Technologies - Confidential

Fast analysis(Storm)

Analytic output

Real-timedata

Raw logs

33©MapR Technologies - Confidential

Large analysis(map-reduce)

Analytic output Raw logs

34©MapR Technologies - Confidential

Presentation tier (d3 + node.js)

Analytic output

Browser query

Raw logs

35©MapR Technologies - Confidential

Objective Results

Real-time + long-time analysis is seamless

Web tier can be rooted directly on Hadoop cluster

No need to move data

36©MapR Technologies - Confidential

Example #3Apache Drill

37©MapR Technologies - Confidential

Big Data Processing – Hadoop

Batch processing

Query runtime Minutes to hours

Data volume TBs to PBs

Programming model

MapReduce

Users Developers

Google project MapReduce

Open source project

Hadoop MapReduce

38©MapR Technologies - Confidential

Big Data Processing – Hadoop and Storm

Batch processing Stream processing

Query runtime Minutes to hours Never-ending

Data volume TBs to PBs Continuous stream

Programming model

MapReduce DAG (pre-programmed)

Users Developers Developers

Google project MapReduce

Open source project

Hadoop MapReduce

Storm or Apache S4

39©MapR Technologies - Confidential

Big Data Processing – The missing part

Batch processing Interactive analysis Stream processing

Query runtime Minutes to hours Never-ending

Data volume TBs to PBs Continuous stream

Programming model

MapReduce DAG (pre-programmed)

Users Developers Developers

Google project MapReduce

Open source project

Hadoop MapReduce

Storm and S4

40©MapR Technologies - Confidential

Big Data Processing – The missing part

Batch processing Interactive analysis Stream processing

Query runtime Minutes to hours Milliseconds to minutes

Never-ending

Data volume TBs to PBs GBs to PBs Continuous stream

Programming model

MapReduce Queries(ad hoc)

DAG (pre-programmed)

Users Developers Analysts and developers

Developers

Google project MapReduce

Open source project

Hadoop MapReduce

Storm and S4

41©MapR Technologies - Confidential

Big Data Processing

Batch processing Interactive analysis Stream processing

Query runtime Minutes to hours Milliseconds to minutes

Never-ending

Data volume TBs to PBs GBs to PBs Continuous stream

Programming model

MapReduce Queries DAG

Users Developers Analysts and developers

Developers

Google project MapReduce Dremel

Open source project

Hadoop MapReduce

Storm and S4

42©MapR Technologies - Confidential

Big Data Processing

Batch processing Interactive analysis Stream processing

Query runtime Minutes to hours Milliseconds to minutes

Never-ending

Data volume TBs to PBs GBs to PBs Continuous stream

Programming model

MapReduce Queries DAG

Users Developers Analysts and developers

Developers

Google project MapReduce Dremel

Open source project

Hadoop MapReduce

Storm and S4

Apache Drill

43©MapR Technologies - Confidential

Design Principles

Flexible• Pluggable query languages• Extensible execution engine• Pluggable data formats

• Column-based and row-based• Schema and schema-less

• Pluggable data sources

Easy• Unzip and run• Zero configuration• Reverse DNS not needed• IP addresses can change• Clear and concise log messages

Dependable• No SPOF• Instant recovery from crashes

Fast• C/C++ core with Java support

• Google C++ style guide• Min latency and max throughput

(limited only by hardware)

44©MapR Technologies - Confidential

Simple Architecture

45©MapR Technologies - Confidential

Standard Interfaces

46©MapR Technologies - Confidential

query:[ { op:"sequence", do:[ { op: "scan", memo: "initial_scan", ref: "donuts", source: "local-logs", selection: {data: "activity"} }, { op: "transform", transforms: [ { ref: "donuts.quanity", expr: "donuts.sales”} ] }, { op: "filter", expr: "donuts.ppu < 1.00" }, …

Logical Plan Syntax:

47©MapR Technologies - Confidential

Logical Streaming Example

{ @id: <refnum>, op: “window-frame”, input: <input>, keys: [ <name>,... ], ref: <name>, before: 2, after: here}

0 1 2 3 4

0 0 10 1 2 1 2 32 3 4

48©MapR Technologies - Confidential

Logical Plan

49©MapR Technologies - Confidential

Execution Plan

50©MapR Technologies - Confidential

Representing a DAG

{ @id: 19, op: "aggregate", input: 18, type: <simple|running|repeat>, keys: [<name>,...], aggregations: [ {ref: <name>, expr: <aggexpr> },... ]}

51©MapR Technologies - Confidential

Non-SQL queries

52©MapR Technologies - Confidential

Design Principles

Flexible• Pluggable query languages• Extensible execution engine• Pluggable data formats

• Column-based and row-based• Schema and schema-less

• Pluggable data sources

Easy• Unzip and run• Zero configuration• Reverse DNS not needed• IP addresses can change• Clear and concise log messages

Dependable• No SPOF• Instant recovery from crashes

Fast• C/C++ core with Java support

• Google C++ style guide• Min latency and max throughput

(limited only by hardware)

53©MapR Technologies - Confidential

The future is not what we thought it would be

54©MapR Technologies - Confidential

It is better!

55©MapR Technologies - Confidential

Get Involved!

Tweet:#hcj13w#mapr

@ted_dunning

56©MapR Technologies - Confidential

Get Involved!

Download these slides– http://www.mapr.com/company/events/hcj-01-21-2013

Join the Drill project– drill-dev-subscribe@incubator.apache.org – #apachedrill

Contact me:– tdunning@maprtech.com– tdunning@apache.org– @ted_dunning

Join MapR (in Japan!)– jobs@mapr.com

top related