thinking at scale with hadoop

Copyright (c) 2008 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

Thinking at Scale with Hadoop

1

phot

o by

: i_p

inz,

flick

r

Ken Krugler

Bixo Labs &Scale Unlimited

Friday, September 24, 2010


Introduction - Ken Krugler

Founder, Bixo Labs - http://bixolabs.com

Large scale web mining workflows, running in Amazon’s cloud

Instructor, Scale Unlimited - http://scaleunlimited.com

Teaching Hadoop Boot Camp

Founder, Krugle (2005 - 2009)

Code search startup

Parse/index 2.6 billion lines of open source code

2


http://bixolabs.com

http://bixolabs.com

http://scaleunlimited.com

http://scaleunlimited.com

Goals for Talk

Intro to Map-Reduce

Overview of Hadoop

Challenges with Map-Reduce

Solutions to Complexity

3



4

Introduction to Map-Reduce

phot

o by

: exf

ordy

, flick

r

Ken Krugler




Map -> The ‘map’ function in the MapReduce algorithm, user defined

Reduce -> The ‘reduce’ function in the MapReduce algorithm, user defined

Group -> The ‘magic’ that happens between Map and Reduce

Key Value Pair -> two units of data, exchanged between Map & Reduce

5

Definitions



How MapReduce Works

Map translates input to keys and values to new keys and values

System Groups each unique key with all its values

Reduce translates the values of each unique key to new keys and values

6

[K1,V1] Map [K2,V2]

[K2,{V2,V2,....}] [K3,V3]Reduce

[K2,V2] [K2,{V2,V2,....}]Group



All Together

7

Reducer

Reducer

Reducer

Mapper

Mapper

Mapper

Mapper

Mapper

Shuffle

Shuffle

Shuffle



Canonical Example - Word Count

Word Count

Read a document, parse out the words, count the frequency of each word

Specifically, in MapReduce

With a document consisting of lines of text

Translate each line of text into key = word and value = 1

Sum the values (ones) for each unique word

8



9

[1, When in the Course of human events, it becomes]

[When,1] [in,1] [the,1] [Course,1] [k,1]

[2, necessary for one people to dissolve the political bands]

[n, {k,k,k,k,...}]

[When,{1,1,1,1}] [people,{1,1,1,1,1,1}] [dissolve,{1,1,1}]

[k,{v,v,...}][connected,{1,1,1,1,1}]

[When,4] [peope,6] [dissolve,3]

[k,sum(v,v,v,..)][connected,5]

Map

Reduce

GroupUser

Defined

[K1,V1]

[K2,{V2,V2,....}]

[K2,V2]

[K3,V3]

[3, which have connected them with another, and to assume]



Divide & Conquer

Because

The Map function only cares about the current key and value, and

The Reduce function only cares about the current key and its values

Mappers & Reducers can run Map/Reduce functions independently

Then

A Mapper can invoke Map on an arbitrary number of input keys and values

A Reducer can invoke Reduce on an arbitrary number of the unique keys

Any number of Mappers & Reducers can be run on any number of nodes

10



11

Overview of Hadoop

phot

o by

: exf

ordy

, flick

r

Ken Krugler




Logical Architecture

Logically, Hadoop is simply a computing cluster that provides:

a Storage layer, and

an Execution layer

12

Cluster

Storage

Execution



Logical Architecture

Where users can submit jobs that:

Run on the Execution layer, and

Read/write data from the Storage layer

13

Cluster

Storage

Execution

Users

Client

Client

Client

job job job



Storage - HDFS

14

Performance

Optimized for many concurrent serialized reads of large files

But only a single writer (write-once, no append)

Fidelity

Replication via block replication

GeographyGeography

RackRackRack

Node Node Node ...Node

block

block

block

block

block

block

block

block



Execution

Performance

Divide and Conquer

MapReduce

Data Affinity

Move code to data

Fidelity

Applications must choose to fail on bad data, or ignore bad data

15

GeographyGeography

RackRackRack

Node Node Node ...Node

map

reduce

map map map map

reducereduce



Architectural Components

16

Solid boxes are unique applicationsDashed boxes are child JVM instances on same node as parentDotted boxes are blocks of managed files on same node as parent

read/write

nsoperations

read/write

read/writens

operations

nsoperations

NameNode

Secondary

DataNodeDataNode

DataNodeDataNode

JobTrackerClientjobs

TaskTracker

mapper

child jvm

reducer

child jvm

mapper

child jvm

reducer

child jvmreducer

child jvm

mapper

child jvm

data block

tasks


Copyright (c) 2008-2010 Scale Unlimited, Inc. All Rights Reserved. Reproduction or distribution of this document in any form without prior written permission is forbidden.

17

Challenges of Map-Reduce

phot

o by

: Gra

ham

Rac

her,

flickrKen Krugler




Complexity of keys/values

Simple != Simplicity

People think about processing records/objects

The correct key definition changes during a workflow

A “value” is typically a set of fields

18



Complexity of chaining jobs

Most real-world workflows require multiple Map-Reduce jobs

Connecting data between jobs is tedious

Keeping key/value definitions in sync is error-prone

As an example...

19



A trivial filter/count/sort workflow

20

private void start(String inputPath, String outputPath, String filterUser) throws Exception { // counts the pages the users clicked and sort by number of clicks

// filter and count

Path outputDir = new Path(outputPath); Path parent = outputDir.getParent(); Path tmpOutput = new Path(parent, "" + System.currentTimeMillis());

// Create the job configuration Configuration conf = new Configuration(); conf.set(FilterMapper.USER_KEY, filterUser);

Job filterCountJob = new Job(conf, "filter+count job");

FileInputFormat.setInputPaths(filterCountJob, new Path(inputPath)); FileOutputFormat.setOutputPath(filterCountJob, tmpOutput);

filterCountJob.setMapperClass(FilterMapper.class); filterCountJob.setMapOutputKeyClass(Text.class); filterCountJob.setMapOutputValueClass(LongWritable.class);

filterCountJob.setReducerClass(CounterReducer.class); filterCountJob.setOutputKeyClass(Text.class); filterCountJob.setOutputValueClass(LongWritable.class); filterCountJob.setInputFormatClass(TextInputFormat.class); filterCountJob.setOutputFormatClass(SequenceFileOutputFormat.class);

// run job and block until job is done filterCountJob.waitForCompletion(false);

// sort Job sortJob = new Job(conf, "sort job");

FileInputFormat.setInputPaths(sortJob, tmpOutput); FileOutputFormat.setOutputPath(sortJob, outputDir);

sortJob.setMapperClass(SortMapper.class); sortJob.setMapOutputKeyClass(UserAndCount.class); sortJob.setMapOutputValueClass(NullWritable.class);

// sort are only stable with one reducer but you lose distribution. sortJob.setNumReduceTasks(1);

sortJob.setInputFormatClass(SequenceFileInputFormat.class); sortJob.setOutputFormatClass(TextOutputFormat.class); // run job and block until job is done sortJob.waitForCompletion(false);

// Get rid of temp directory used to bridge between the two jobs FileSystem.get(conf).delete(tmpOutput, true);



Not seeing the forest...

Too much low-level detail obscures intent

Not seeing intent leads to bugs

And makes optimization harder

21



Hard to use MR optimizations

MR => MR{0,1} => M{1,}R{0,1} => M{1,}R{0,1}M{0,}

Combiners can reduce map output size

Map-side joins where one of the sides is “small”

22



Solutions to Complexity

Cascading - Tuples, Pipes, and Operations

FlumeJava - Java Classes and Deferred Execution

Pig/Hive - “It’s just a big database”

Datameer - “It’s just a big spreadsheet”

23



24

Cascading

phot

o by

: Gra

ham

Rac

her,

flickrKen Krugler




Cascading

An API for defining data processing workflows

First public release Jan 2008

Currently 1.1.1 and supports Hadoop 0.19.x through 0.20.2

http://www.cascading.org

25





Cascading

Basic unit of data is “Tuple” - roughly equal to a record

Implements all standard data processing operations

function, filter, group by, co-group, aggregator, etc

Complex applications are built by chaining operations with pipes

And are planned into the minimum number of MapReduce jobs

26



Operations on Tuples

Map and Reduce communicate through Keys and Values

Our problems are really about elements of Keys and Values

e.g. “username”, not “line”

The effect is more reusability of our ‘map’ and ‘reduce’ operations

27



Mapping onto Hadoop

Functions & Filters are map operations

Aggregates & Buffers are reduce operations

Built-in support for joining, multi-field sorting

Concept of “Taps” with “Schemes” as input and output formats

28



Loose Coupling

29

Groupfunc aggr Sink

func

Source

Above, all the possible relationships between operations and data

If Fields are the interface between operations, we can build flexible apps



Layered Over MapReduce

30

Map

CoGroupfunc aggr SinkSource GroupBy

func

aggr

Source func

functemp

Reduce Map Reduce



Cascading Code

31

Tap input = new Hfs( new TextLine( new Fields( "line" ) ), inputPath ); Tap output = new Hfs( new TextLine( 1 ), outputPath, SinkMode.REPLACE );

Pipe pipe = new Pipe( "example" );

RegexSplitGenerator function = new RegexSplitGenerator( new Fields( "word" ), "\\s+" ); pipe = new Each( pipe, new Fields( "line" ), function );

pipe = new GroupBy( pipe, new Fields( "word" ) );

pipe = new Every( pipe, new Fields( "word" ), new Count( new Fields( "count" ) ) ); pipe = new Every( pipe, new Fields( "word" ), new Average( new Fields( "avg" ) ) );

pipe = new GroupBy( pipe, new Fields( "count", -1 ) );

FlowConnector.setApplicationJarClass( new Properties(), Main.class ); Flow flow = new FlowConnector( properties ).connect( "example flow", input, output, pipe ); flow.complete();



32

Every('tsCount')[Count[decl:'count']]

Hfs['TextLine[['offset', 'line']->[ALL]]']['output/arrivalrate/sec']']

GroupBy('tsCount')[by:['ts']]

Each('arrival rate')[DateParser[decl:'ts'][args:1]]

Hfs['SequenceFile[['ip', 'time', 'method', 'event', 'status', 'size']]']['output/logs/']']

Every('tmCount')[Count[decl:'count']]

Hfs['TextLine[['offset', 'line']->[ALL]]']['output/arrivalrate/min']']

GroupBy('tmCount')[by:['tm']]

Each('arrival rate')[ExpressionFunction[decl:'tm'][args:1]]

[head]

[tail]

['ts', 'count']

['ts']

['ip', 'time', 'method', 'event', 'status', 'size']

['ip', 'time', 'method', 'event', 'status', 'size']

tsCount['ts']

['ts']

['tm', 'count']

['tm']

['ts']

['ts']

tmCount['tm']

['tm']

['tm']

['tm']

['ts']

['ts']



33

Every('')[Agregator[decl:'']]

Dfs['TextLine[['offset', 'line']->[ALL]]']['/data/']']

GroupBy('')[by:['']]

Each('')[Identity[decl:'', '']]

CoGroup('')[by:a:['']b:['']]

Each('')[RegexSplitter[decl:'', ''][args:1]]


CoGroup('')[by:a:['']b:['']]

Each('')[Identity[decl:'', '', '', '']]

CoGroup('')[by:a:['']b:['']]

Each('')[Identity[decl:'', '', '']]

CoGroup('')[by:a:['']b:['']]

Each('')[RegexSplitter[decl:'', '', '', ''][args:1]]


Each('')[RegexSplitter[decl:'', '', ''][args:1]]


Every('')[Agregator[decl:'', '']]


GroupBy('')[by:['']]


CoGroup('')[by:a:['']b:['']]


CoGroup('')[by:a:['']b:['']]

Each('')[Identity[decl:'', '', '', '']]

CoGroup('')[by:a:['']b:['']]

[head]

[tail]

TempHfs['SequenceFile[['', '', '', '']]'][]

TempHfs['SequenceFile[['', '', '']]'][]


TempHfs['SequenceFile[['', '']]'][]

TempHfs['SequenceFile[['', '', '', '']]'][]


TempHfs['SequenceFile[['', '', '', '', '', '', '']]'][]




['', '']

['', '']

['', '']

['', '']

['', '']

['', '']

a[''],b['']

['', '', '', '']

['', '']

['', '']

a[''],b['']

['', '', '', '', '']

a[''],b['']

['', '', '', '', '', '', '', '', '']

a['']

['', '']

['', '', '']

['', '', '']

a[''],b['']

['', '', '', '', '', '']

a[''],b['']

['', '', '', '', '', '', '']

a[''],b['']

['', '', '', '', '']

a['']

['', '', '']

['', '', '', '']

['', '', '', '']

['', '', '', '']

['', '', '', ''] ['', '', '', '']

['', '', '', '']

['', '', '']

['', '', '']

['', '', '']

['', '', '']

['', '', '']

['', '', '']

['', '', '']

['', '', '']

['', '', '']

['', '', '']

['', '', '']

['', '', '']

['', '']

['', '']

['', '']

['', '']

['', '', '', '']

['', '', '', '']

['', '', '', '']

['', '', '', '']

['', '', '']

['', '', '']

['', '', '']

['', '', '']

a[''],b['']

['', '', '', '', '', '', '']

['', '', '', '', '', '', '']

['', '', '', '', '', '', '']

['', '']

['', '']

['', '']

['', '']

['', '']

['', '']

['', '']

['', ''] ['', '']

['', '']

['', '']

['', '']

['', '']

['', '']

['', '']

['', '']

['', '']

['', '']



34

FlumeJava

phot

o by

: Gra

ham

Rac

her,

flickrKen Krugler




FlumeJava

Recent paper (June 2010) by Googlites - Chambers et. al.

Internal project to make it easier to write Map-Reduce jobs

Equivalent open source project just started

Plume - http://github.com/tdunning/Plume

35



Java Classes, Deferred Exec

Java-centric view of map-reduce processing

Handful of classes to represent immutable, parallel collections

PCollection, PTable

Small number of primitive operations on classes

parallelDo, groupByKey, combineValues, flatten

Deferred execution allows construction/optimization of workflow

36



FlumeJava Code

37

PCollection<String> lines = readTextFileCollection("/gfs/data/shakes/hamlet.txt");

Collection<String> words = lines.parallelDo(new DoFn<String,String>() { void process(String line, EmitFn<String> emitFn) { for (String word : splitIntoWords(line)) { emitFn.emit(word); } } }, collectionOf(strings()));



Results of Use at Google

38

TextText

Lines of code

Text

Number of Map-Reduce Jobs



39

Pig & Hive

phot

o by

: Tam

bako

the

Jagu

ar, fl

ickr

Ken Krugler




Pig

Pig is a data flow ‘programming environment’ for processing large files

PigLatin is the text language it provides for defining data flows

Developed by researchers in Yahoo! and widely used internally

A sub-project of Hadoop

APL licensed

http://hadoop.apache.org/pig/

First incubator release v0.1.0 on Sept 2008

40


http://incubator.apache.org/pig/

http://incubator.apache.org/pig/


Pig

Provides two kinds of optimizations:

Physical - caching and stream optimizations at the MapReduce level

Algebraic - mathematical reductions at the syntax level

Allows for User Defined Functions (UDF)

must register jar of functions via register path/some.jar

can be called directly from PigLatin

B = foreach A generate myfunc.MyEvalFunc($0);

41



PigLatin - Example

42

W = LOAD 'filename' AS (url, outlink);G = GROUP W by url;R = FOREACH G { FW = FILTER W BY outlink eq 'www.apache.org'; PW = FW.outlink; DW = DISTINCT PW; GENERATE group, COUNT(DW);}


http://www.apache.org

http://www.apache.org


Hive

Hive is a data warehouse built on top of flat files

Developed by Facebook

Available as a Hadoop contribution in v0.19.0

http://wiki.apache.org/hadoop/Hive

43





Hive

Data Organization into Tables with logical and hash partitioning

A Metastore to store metadata about Tables/Partitions/Buckets

Partitions are ‘how’ the data is stored (collections of related rows)

Buckets further divide Partitions based on a hash of a column

metadata is actually stored in a RDBMS

A SQL like query language over object data stored in Tables

Custom types, structs of primitive types

44



Hive - Queries

Queries always write results into a table

45

FROM user INSERT OVERWRITE TABLE user_active SELECT user.* WHERE user.active = true;

FROM pv_users INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count_distinct(pv_users.userid) GROUP BY pv_users.gender INSERT OVERWRITE DIRECTORY '/user/facebook/tmp/pv_age_sum.txt' SELECT pv_users.age, count_distinct(pv_users.userid) GROUP BY pv_users.age;



46

Datameer Analytics Solution (DAS)

phot

o by

: Tam

bako

the

Jagu

ar, fl

ickr

Ken Krugler




DAS - Big Spreadsheets

Commercial solution from Datameer

Similar to IBM BigSheets

GUI + workflow editor on top of Hadoop

Define data sources, operations (via spreadsheet), and data sinks

http://www.datameer.com

47





48



49



50

Summary

phot

o by

: Tam

bako

the

Jagu

ar, fl

ickr

Ken Krugler




Thinking at Scale with Hadoop

Understand the underlying Map-Reduce architecture

Model the problem as a workflow with operations on records

Consider using Cascading or (eventually) Plume

Use Pig/Hive to solve query-centric problems

Look at DAS or BigSheets for easier analytics solutions

51


Resources

Hadoop mailing lists - many, e.g. [email protected]

Hadoop: The Definitive Guide by Tom White (O’Reilly)

Hadoop Training

Scale Unlimited: http://scaleunlimited.com/courses

Cloudera: http://www.cloudera.com/hadoop-training/

Cascading: http://www.cascading.org

DAS: http://www.datameer.com

52


mailto:[email protected]

mailto:[email protected]

http://scaleunlimited.com/courses

http://scaleunlimited.com/courses

http://www.cloudera.com/hadoop-training/

http://www.cloudera.com/hadoop-training/






Questions?

53


thinking at scale with hadoop

Technology

copyright c

prior written permission

scale unlimited http

map reduce5definitionsfriday

v1 map k2

current key

new keys

arbitrary number of