modeling with hadoop kdd2011

Modeling with Hadoop

Algorithms

Outline

•  Why learn models in MapReduce framework? •  Types of learning in MapReduce •  Statistical Query Model (SQM) •  SQM Algorithms in MapReduce •  Sequential learning methods and MapReduce •  Challenges and Enhancements •  Apache Mahout

Why learn models in MapReduce? •  High data throughput

–  Stream about 100 Tb per hour using 500 mappers •  Framework provides fault tolerance

–  Monitors mappers and reducers and re-starts tasks on other machines should one of the machines fail

•  Excels in counting patterns over data records

•  Built on relatively cheap, commodity hardware –  No special purpose computing hardware

•  Large volumes of data are being increasingly stored on Grid clusters running MapReduce –  Especially in the internet domain

Why learn models in MapReduce?

•  Learning can become limited by computation time and not data volume –  With large enough data and number of machines –  Reduces the need to down-sample data –  More accurate parameter estimates compared to

learning on a single machine for the same amount of time

Learning models in MapReduce •  A primer for learning models in MapReduce (MR)

–  Illustrate techniques for distributing the learning algorithm in a MapReduce framework

–  Focus on the mapper and reducer computations •  Data parallel algorithms are most appropriate for

MapReduce implementations •  Not necessarily the most optimal implementation for a

specific algorithm –  Other specialized non-MapReduce implementations exist for

some algorithms, which may be better •  MR may not be the appropriate framework for exact

solutions of non data parallel/sequential algorithms –  Approximate solutions using MapReduce may be good enough

Outline


Types of learning in MapReduce

•  Three common types of learning models using MapReduce framework

1.  Parallel training of multiple models –  Train either in mappers or reducers

2.  Ensemble training methods –  Train multiple models and combine them

3.  Distributed learning algorithms –  Learn using both mappers and reducers

Use the Grid as a large cluster

of independent machines (with fault tolerance)

Parallel training of multiple models

•  Train multiple models simultaneously using a learning algorithm that can be learnt in memory

•  Useful when individual models are trained using a subset, filtered or modification of raw data

•  Can train 1000’s of models simultaneously •  Essentially, treat Grid as a large cluster of machines

–  Leverage fault tolerance of Hadoop •  Train 1 model in each reducer

–  Map: •  Input: All data •  Filters subset of data relevant for each model training •  Output: <model_index, subset of data for training this model>

–  Reduce •  Train model on data corresponding to that model_index

Parallel training of multiple models •  Train 1 model in each reducer

Data subgroup 1

Data subgroup 2

Data subgroup N

Train

model_1"model_1", Data

Mapper Reducer

∏

Train

model_2"model_2", Data

Model_1

Model_2

∏

∏

Parallel training of multiple models

Map_1

Map_2

Map_M

Training Data

1 2

{ , ( , ... )}

{ , ... }i j k

i M

x c c cc c c c∈

1c

Mc

2c

1( )Model c

( )MModel c

2( )Model c

•  All data is sent to each mapper (as a cache archive)

•  Mapper partition file determines the training configuration and labeling strategy –  e.g., Training one vs. rest

models in multi-class classification

–  Can train 1000s of classes in parallel

•  Train 1 model in each mapper

Ensemble methods •  Train 1 base model in each mapper on a data partition •  Combine the base models using ensemble methods

(primarily, bagging) in the reducer •  Strictly, bagging requires the data to be sampled with

replacement –  However, if the data set is very large, sampling without

replacement may be ok

•  Base models are typically decision trees, SVMs etc.

Ensemble Methods: Random Subspace Bagging (RSBag)

•  Assume that the training data is partitioned randomly into blocks –  Class distributions are roughly the same across all blocks

•  Algorithm (Yan et al. 2007) –  Learn 1 base model per data sub-group

–  Optionally, use a random subset of features to train each model –  Combine the multiple base models into a composite classifier as

the final output

1( ) ( ) ( )i i ic c cF x F x h x−= +

Base-model ( ){ 1, 1}

c

c

h xy

=

∈ − +

RSBag in MapReduce

Features

D A T A

Map_1

Map_2

Map_4

Map_3

1( )ch x

4 ( )ch x

2 ( )ch x

3 ( )ch x

Combine base

models into final

classifier

•  Provides coarse level parallelism at the level of base models –  Base models can be decision trees, SVMs etc.

•  Speed-up with SVM base models

•  Can achieve similar performance as a single classifier with theoretical guarantee in less learning time

Correlation between classifiers

Upper bound on generalization error

RSBag in MapReduce

2 , , data, feature sampling ratios

5, 0.2, 0.5 10d f d f

d f

Nr r r rN r r Speedup

=

= = = → =

( )( )( )

* 2 2

,

,

( ) 1

( , ) ( , )

2 ( , ) 1c

c c c

x

c x y c

E F s s

E h x h x

s E P h x yθ θ

θ

ρ

ρ ρ θ θ

θ

ʹ′

≤ −

ʹ′⎡ ⎤= ⎣ ⎦

= = − Strength of classifier

Robust Subspace Bagging (RB-SBag)

•  Sometimes the base models may over-fit the training data –  Correlation between base models may be high

•  Add a Forward selection step for models –  Iteratively add base models based on their

performance on a validation data (Yan et al. 2009) •  Adds another MapReduce job

–  Select the base models using forward selection based on performance metrics on a validation dataset cV

RB-SBag in MapReduce

Map_1

Map_2

Map_N

Validation Data

1

2

( ),

( ),....( )

c

c

Nc

h xh x

h x

" ",{ ,Pr ediction ( )}c cc h V

1.  Forward selection of base models 2.  Combine base models into composite

classifier

Mapper Reducer

COMET: Cloud of Massive Ensemble Trees

•  Similar to RSBag, but uses Importance-Sampled Voting (IVoting) in each base model

•  Samples are weighted with non-uniform probability •  Each mapper creates a set of data to train on •  Ensemble after k iterations = E(k)

–  Add new sample to training set: •  Always if E(k) incorrectly classifies new sample •  With a lower probability if E(k) correctly classifies new sample

•  Variant of Random Forests, in which IVoting generates the training samples instead of bagging

•  Use lazy evaluation during prediction

( ) / (1 ( )); ( ) error on training datasete k e k e k− =

J.D Basilico, M.A. Munson, T.G. Kolda, K.R. Dixon, W.P.Kegelmeyer, COMET: A Recipe for Learning and Using Large Ensembles on massive data, 2011, http://arxiv.org/PS_cache/arxiv/pdf/1103/1103.2068v1.pdf

Distributed learning algorithms •  Use multiple mappers and reducers to learn 1 model •  Suitable for learning algorithms that

–  Have heavy computing per data record –  One or few iterations for learning –  Do not transfer much data between iterations

•  Typical algorithms –  Fit the Statistical query model (SQM)

•  One/few iterations –  Linear regression, Naïve Bayes, k-means clustering, pair-wise similarity etc.

•  More iterations have high overheads, e.g., –  SVM, Logistic regression etc.

–  Divide and conquer •  Frequent item-set mining, Approximate matrix factorization etc.

Outline


Statistical Query Model (SQM)

•  Learning algorithm can access the learning problem only through a statistical query oracle (Kearns 1998)

•  Given a function f(x,y) over data instances,

the statistical query oracle returns an estimate of the expectation of f(x,y) (averaged over the data distribution).

( , )x y

Raw Data Samples

(X,Y)

Statistical Query Model (SQM)

Statistics Oracle

Learning Algorithm

Raw Data Samples

(X,Y)

•  Learning algorithms that calculate sufficient statistics of data, gradients of a function, etc. fit this model •  These calculations can be expressed in a “summation form” over subgroups of data (Chu et al. 2006)

( , )f x y

( , )subgroup

f x y∑

SQM in MapReduce •  Distribute the summation calculations over each

data sub-group •  Map:

–  Calculate function estimates over sub-groups of data •  Reduce

–  Aggregate the function estimates from various sub-groups

•  Learning algorithm should be able to work with these summaries alone

SQM in MapReduce •  Assume algorithm depends on 2 functions f(x,y) and g(x,y)

Data subgroup 1

Data subgroup 2

Data subgroup N

" ", ( , ) " ", ( , )subgroup subgroup

f f x y g g x y∑ ∑

( , ), ( , )N subgroup N subgroup

f x y g x y∑ ∑ ∑ ∑

Mapper Reducer

∑

∑

∑

Outline


Algorithms in MapReduce •  Many common algorithms can be formulated in

the SQM framework (Chu et al. 2006) –  Classification and Regression

•  Linear Regression, Naïve Bayes, Logistic regression, Support Vector Machine, Decision Trees

–  Clustering •  K-means, Canopy clustering, Co-clustering

–  Back-propagation neural network –  Expectation Maximization –  PCA

•  Recommendations and Frequent Itemset mining •  Graph Algorithms

Classification and Regression algorithms in MapReduce

•  Linear Regression •  Naïve Bayes •  Logistic Regression •  Support Vector Machine •  Decision Trees

Linear regression •  Data vector: •  Real valued target : •  Weight of data point: •  Data set of points: ( ){ }, ,

mx y w

1 2( , ,... )Ti i i inx x x x=iy

iw

* 1

1

1

( )

( )

T

mT

i i iim

i i ii

y xA b

A w x x

b w x y

θ

θ −

=

=

=

=

=

=

∑

∑

ur

Summation form

•  Map: –  Input data from a subgroup of data –  Output

•  2 types of keys –  K1 – for matrix A

»  Value1 = N x N matrix –  K2 – for vector b

»  Value2 = N x 1 vector

•  Reducer: –  Aggregate the individual mapper outputs for each key –  Estimate

( ){ }, , ,Index x y w

* 1A bθ −=

Linear Regression in MapReduce

Linear Regression in MapReduce •  A: N x N matrix, b: N x 1 vector

Mapper Reducer

( ){ }1, ,x y w

( ){ }2, ,x y w

( ){ }, ,k

x y w

" ", " ",Ti i i i i i

subgroup subgroupA w x x b w x y∑ ∑,A b

,A b

,A b

* 1

,A b

A bθ −=

∑ ∑

•  Input Data: ;

•  Categorical target:

•  Class prediction:

•  Two types of sufficient statistics

Naïve Bayes

1 2( , ,... )nx x x x=

{ }1 2, ... Ly c c c∈

1 2{ , .... }

Domain of

j j jj Pj

j

x a a ax

∈

* argmax ( ) ( | )jk j pj k

y jy P y c P x a y c= = = =∏

( | )

( )

jj pj k

k

P x a y cP y c

= =

=Sum counts

over sub-groups

Naïve Bayes in MapReduce •  Map

–  Input data from a subgroup of data –  Output: 3 types of keys

•  Reduce –  Sum all the values of each key –  Compute the conditional and marginal probabilities

{ , }x y

( , ), 1( | )

( ), 1( )

" ", 1

j jj pj k j pj k

subgroup

k ksubgroup

subgroup

key x a y c value x a y c

key y c value y c

key samples value

= = = = = =

= = = =

= =

∑

∑

∑

Logistic Regression •  Features: ; •  Categorical target: •  Data: •  Conditional probability:

•  Equivalently

–  Log odds is a linear function of the features

1 2( , ,... )nx x x x=[0,1]y∈

1( | , )1 exp( )TP y x

xθ

θ=

+ −

( ){ },m

x y

log1

Tp xp

θ⎛ ⎞

=⎜ ⎟−⎝ ⎠

Logistic Regression

•  Estimate the parameters by maximizing the log conditional likelihood of observed data

•  Optimize using Newton-Raphson to update ( )( ) ( )

( )

1

1

[1, ]; , [1, ]

i i ijj

i

i i i ijk jk j k

i

H LCL

Gradient LCL y p x

Hessian H H p p x x

i m j k n

θ

θ

θ θ θ

θ

−= − ∇

=∇ = −

= = + −

∈ ∈

∑

∑

θ

( ): 1 : 0log log 1

i i

i ii y i y

LCL p p= =

= + −∑ ∑

Summation form

Logistic Regression in MapReduce •  A control program sets up the MapReduce iterations •  Map

–  Input: –  Output:

•  Reduce –  Aggregate the values of from all mappers –  Compute –  Update

•  Stop when updates become small

( ){ },x y( )

( )

, ,

, , , 1

i i ij

i subgroup

i i i ij k

i subgroup

key g value j y p x

key h value j k p p x x

∈

∈

⎧ ⎫⎛ ⎞⎪ ⎪= = −⎨ ⎬⎜ ⎟

⎪ ⎪⎝ ⎠⎩ ⎭

⎧ ⎫⎛ ⎞⎪ ⎪= = −⎨ ⎬⎜ ⎟

⎪ ⎪⎝ ⎠⎩ ⎭

∑

∑

( ) , jkjLCL Hθ θ∇

( )1H LCLθ θ− ∇

( )1H LCLθθ θ θ−= − ∇

Support Vector Machine •  Features: ; •  Binary target: •  Objective function in primal form

p=1 (hinge loss), p=2 (quadratic loss)

•  For quadratic loss, batch gradient descent to estimate

nx∈R[ 1, 1]y∈ − +

( )

2

, , 0min

.

(1 )

i

piw b i

i T ii

w C

s t i

y w x b

ξ

ξ

ξ

>

+

∀

+ ≥ −

∑

w

( )2 2 .w i i ii

G w C w x y x∇ = + −∑Summation form

Support Vector Machine in MapReduce

•  Map –  Input: –  Output:

•  Reduce –  Aggregate the values of gradient from all mappers

–  Update

•  Driver program that sets up the iterations and checks for convergence

{ }( , }x y

( ), 2 2 . i i isubgroup

key GGW value w C w x y x= = + −∑

* ww w Gη= − ∇

Decision Trees •  Features: •  Targets: or •  Data: •  Construct Tree

–  Each node splits the data by feature value –  Start from root

•  Select best feature, value to split the node –  Based on reduction in data impurity between the child and

parent nodes

–  Select the next child node –  Repeat the process till some stopping criterion

•  Pure node, or data is below some threshold etc.

1 2( , ,... )nx x x x=

[0,1]y∈ y∈R( ){ },

mD x y=

Decision Trees

B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo, PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce, 2009, Proceedings of The Vldb Endowment - PVLDB, vol. 2, no. 2, pp. 1426-1437

Expensive step for

Large datasets

PLANET for Decision Trees •  Parallel Learner for Assembling Numerous Ensemble

Trees (PLANET- Panda et al. 2009) –  Main idea is to use MapReduce to determine the best feature

value splits for nodes from large datasets

•  Each intermediate node has a sub-set of all data falling into it

•  If this sub-set is small enough to fit in memory, –  Grow remaining sub-tree in memory

•  Else, –  Launch a MapReduce job to find candidate feature value splits –  Select the best feature split from among the candidates

•  5 main components 1. Controller

•  Monitors and controls the growth of tree 2. Initialization Task

•  Identifies all feature values to be considered for splits 3. FindBestSplit Task

•  Finds best split when there is too much data to fit in memory 4. InMemoryGrow Task

•  Grow an entire sub-tree once the data fits in memory 5. Model File

•  File describing the state of the model

PLANET for Decision Trees

MapReduce Tasks

PLANET for Decision Trees •  Maintain 2 queues

–  MapReduceQueue (MRQ) •  Contains nodes for which data is too large to fit in memory

–  InMemoryQueue (InMemQ) •  Contains nodes for which data fits in memory

•  2 main MapReduce jobs –  MR_ExpandNodes

•  Process nodes from the MRQ to find best split •  Output for each node:

–  Candidate split positions for node along with »  Quality of split (using summary statistics) »  Predictions in left and right branches »  Size of data going into left and right branches

–  MR_InMemory •  Process nodes from the InMemQ. •  For a given set of nodes N, complete tree induction at nodes in N using the

InMemoryGrow algorithm.

PLANET for Decision Trees •  Map function in MR_ExpandNodes

–  Load the current model file M and set of nodes N –  For each record

•  Determine if record is relevant to any of the nodes in N •  Add record to the summary statistics (SS) for node •  For each feature-value in record

–  Add record to the summary statistics for node for split points “s” less than the value in record “v”

–  Output

[ ][ ]( )

[ ]

,

,

2,

( , , );

( , ); ,

( );

, , 1

n x

n x

n xsubgroup subgroup subgroup

key n N x Ordered feature s value T s

key n N x Categorical feature value v T v

key n N value SS

T s SS y y

= ∈ ∈ − =

= ∈ ∈ − =

= ∈ =

⎛ ⎞= = ⎜ ⎟

⎝ ⎠∑ ∑ ∑

SS of candidate

splits

SS of parent node

SS for variance impurity

Split ID

PLANET for Decision Trees •  Reduce function in MR_ExpandNodes

–  For each node •  Aggregate the summary statistics for that node

–  For each split (which is node specific) •  Aggregate the summary statistics for that Split ID from all map

outputs of summary statistics •  Compute impurity of data going into left and right branches •  Total impurity = Impurity in left branch + Impurity in right branch •  If Total impurity < Best split impurity so far

–  Best split = Current split

–  Output the best split found

Clustering algorithms in MapReduce

•  k-means clustering •  Canopy clustering •  Co-clustering

k-means clustering

•  Choose k samples as initial cluster centroids •  Iterate till convergence

–  Assign membership of each point to closest cluster –  Re-compute new cluster centroids using assigned

members •  Control program to

–  Initialize the centroids •  random, initial clustering on sample etc.

–  Run the MapReduce iterations –  Determine stopping criterion

MR

k-means clustering in MapReduce

•  Map –  Input data points: –  Input cluster centroids: –  Assign each data point to closest cluster –  Output

•  Reduce –  Compute new centroids for each cluster

1 2, ... Nx x x1 2( , ,... )KC c c c=

, | , 1|i j j i j isubgroup subgroup

key c value x x c x c⎛ ⎞

= = ∈ ∈⎜ ⎟⎝ ⎠∑ ∑

|,

1|i

i

j j ikey c subgroup

ij i

key c subgroup

x x ckey c value

x c=

=

∈

= =∈

∑ ∑

∑ ∑

ic

Complexity of k-means clustering

•  Each point is compared with each cluster centroid •  Complexity = where is the complexity

of the distance metric •  Typical Euclidean distance is not a cheap operation •  Can reduce complexity using an initial canopy clustering

to partition data cheaply –  Preliminary step to help reduce expensive distance calculations –  Group data into (possibly overlapping) canopies using a cheap

distance metric (McCallum et al. 2000) –  Compute the distance metric between a point and a cluster

centroid only if they share a canopy.

* * ( )N K O d ( )O d

Canopy clustering •  Every point in the dataset is in a canopy •  A point can belong to multiple canopies •  Canopy size = T1 •  Algorithm

–  Keep a list of canopies, initially an empty list –  Scan each data point:

•  If it is within T2 < T1 distance of existing canopies, discard it. Otherwise, add this point into the list of canopies

–  Use a cheap distance metric to construct the canopies

•  e.g. Manhattan distance, –  Assign points to the closest canopy

L∞

A.  McCallum, K. Nigam, L. Ungar. Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching, SIGKDD 2000

Canopy clustering

Image from: http://horicky.blogspot.com/2011/04/k-means-clustering-in-map-reduce.html

Canopy clustering in MapReduce

•  Map –  Input data points: –  If data point is not within distance of an existing

candidate canopy, add it as a candidate canopy point –  Output

•  Reduce –  Keep a list of final canopy points, initially an empty list –  If the canopy point is not within distance of an

existing final canopy point, add it as a final canopy point

–  Output

1 2, ... Nx x x2T

1, |i ikey value x x candidate canopy= = ∈ −

20.5*T

1, |i ikey value x x final canopy= = ∈ −

Canopy + k-means clustering

•  Final step in canopy clustering assigns all points to the closest final canopy point –  Map only operation

•  Speeding up k-means using canopy clustering –  Initial run of canopy clustering on the data (or on a

sample of data) •  Pick canopy centers •  Assign points to canopies

–  Pick initial k-means cluster centroids •  Run k-means iterations

–  Compute distance between point and centroid only if they are in the same canopy

Co-clustering

•  Cluster pair-wise relationships in dyadic data •  Simultaneously cluster both rows and clusters,

based on certain criteria •  Identify sub-matrices of rows and columns that

are inter-related •  Commonly used in text mining, recommendation

systems and graph mining

1 1 0 0 0 1 1 0 0 0 0 0 1 1 1 0 0 1 1 1

Co-clustering

0 1 0 1 1 1 0 1 0 0 0 1 0 1 1 1 0 1 0 0

( )2 1 2 1 Tr =

( )2 1 2 1 1 Tc =

•  Given an matrix –  Find group assignments of rows and columns such that the

resulting sub-matrices are smooth (Papadimitriou & Sun, 2008)

–  Assign rows and columns to clusters {1,2.... } , {1,2.... } , ,m nr k c l k m l n∈ ∈ < <

x m n

Co-clustering •  Iteratively re-arrange rows and columns till an

error function keeps reducing •  Algorithm: Input

–  Initialize r and c –  Compute a group statistics/cost matrix –  While cost decreases

•  For each row do –  For each row group label do

»  if cost decreases •  Update •  Do the same for columns

–  Return r and c

x k lG

1i m= K1p k= K

( )r i p=,G r

x , ,m nA k l

S. Papadimitriou, J. Sun, DisCo: Distributed Co-clustering with Map-Reduce, 2008, ICDM '08. Eighth IEEE International Conference on Data Mining, pp 512-521

Co-clustering in MapReduce •  Assumptions

–  Error can be computed using only (sufficient statistics) –  Row assignments can be based on (greedy search)

•  Map: –  Cost matrix and column cluster assignments are in all mappers –  Input:

•  Key = row index •  Value = adjacency list for row

–  Compute: •  Row statistics for current column cluster assignment •  Assign row to row cluster that has the lowest cost

–  Output:

, ,r c G:, , , ir c G a

:ii a=

( )( ,{ })i

key r ivalue g i

=

=

:( , )i ig a c( ) {1 }r i k∈ K

i

Row cluster label for row

Cost of cluster assignment, row

Co-clustering in MapReduce •  Reduce

–  For each row cluster label, merge the rows and total cost

–  Output

•  Collect the results for each row cluster –  For each reduce output

: ( ) ( )( ) p i p p

j r j r ip r i g g I I i

=

= = =∑ U

( ), ,p pp g I

:

( ) ,p p

p

g gr i p i I

=

= ∀ ∈

Row cluster label Total cost Rows in this row cluster

Co-clustering in MapReduce- Example

•  Assume a row and column partitioning for the matrix

0 1 0 1 1 1 0 1 0 0 0 1 0 1 1 1 0 1 0 0

2 2(1,1,1,2)(1,1,1,2,2)

k lrc

= =

=

=

Cost function = Number of non-zeros per group4 4

G=2 0⎛ ⎞⎜ ⎟⎝ ⎠

2

Map: Input:(2, 1,3 ) Output: (2) 2, ( (2,0),{2})r g

< >

< = = >2

2 2

Reduce: Input: (2,<(2,0),{2}) ) Output: 2,0

{2}

gI I

>

+ =< >

= US. Papadimitriou, J. Sun, DisCo: Distributed Co-clustering with Map-Reduce, 2008,

ICDM '08. Eighth IEEE International Conference on Data Mining, pp 512-521

= 2 4

G4 0⎛ ⎞⎜ ⎟⎝ ⎠

2(1, ,1, 2)r =

Recommendations and Frequent Itemset mining

•  Item-based collaborative filtering •  Pair-wise similarity •  Low-rank matrix factorization •  Frequent Itemset mining

Item-based collaborative filtering

•  Given a user-item ratings matrix, fill in the ratings of the missing items for each user

•  Infer missing ratings from available item ratings for user weighted by similarity between items

5 1 4 ? 2 5

4 3 2

U S E R

ITEM RATING

( , )! ?

( , )! ?

( , )* ( , )( , )

( , )R u j

R u j

sim i j R u jR u i

sim i j=

=

=∑

∑

Item-based collaborative filtering

•  Estimate similarity between items as Pearson correlation of rankings from users who have rated both items.

( )( )

( ) ( )2 2

( , ) ( ) ( , ) ( )( , )

( , ) ( ) ( , ) ( )

{ | ( )! ?, ( )! ?}

ij

ij ij

U

U U

ij

R u i R i R u j R jsim i j

R u i R i R u j R j

U u R i R j

− −

=− −

= = =

∑

∑ ∑

Item-based collaborative filtering using MapReduce

•  Map –  Input:

– Output: Ratings for item pairs

,{( , ( ) | ( )! ?)}

key uValue i R i R i

=

= =

( , )( ( ), ( ))

key i jValue R i R j

=

=

•  Reduce –  Input:

– Output:

( , )[( ( ), ( )]

key i jValue R i R j

=

=

( , )( , )

key i jValue sim i j

=

=

Pair-wise Similarity

•  Compute similarity between pairs of documents in a corpus

•  Generate a postings list for each

– This is an easy map-reduce job

, , , ,( , ) * *i j i j

i j

i j t d t d t d t dt V t d d

S d d w w w w∈ ∈

= =∑ ∑I

t V∈

, ,( ) {( , ) | 0}i ii t d t dP t d w w= >

Pair-wise Similarity in MapReduce

•  Generating a postings list of inverted index – Map

– Reduce

,

Input For each Emit { , ( , )}

i

i

i

i t d

dt dt d w∈

,Emit { ,[( , )]}ii t dt d w


•  Map –  Input term postings list –  Take the Cartesian product of the postings list with

itself •  For each pair of

•  Reduce –  For each

, ( )t P t

( , ) ( )i jd d P t∈

( , ),( , ) ( , )i j

key i jSim d d sim i j

=

=∑

, , <( , ), ( , ) *i jt d t dEmit i j sim i j w w= >


•  Cartesian product of postings list with itself may produce a large set of intermediate keys

•  Modify the above algorithm as follows –  Split the corpus into blocks of documents and query against postings list –  Map

•  Input term postings list •  Load blocks of documents in memory •  For each document in block

–  If compute partial score for each element –  Reduce

•  For each document, aggregate the partial scores from mappers for all other documents

•  Can reduce intermediate keys by implementing term limits when documents are loaded into memory

, ( )t P t

idit d∈

Low-rank matrix factorizations •  Useful for analyzing patterns in dyadic data

•  Given an application dependent loss function, find

•  Most loss functions are sums of local losses

•  Use stochastic gradient descent (SGD) for this factorization

x x x , min( , )m n m d d nV W H d m n≈ =

,argmin ( , , )W H

L V W H

( , )( , , )ij ij ij

i j ZL l V W H

∈

= ∑

( ) 0 0

'* *

'

* *'

'* *

Training set | ! ? , initial values ,

While not converged, do Select a training point ( , ) uniformly at random

( , )

( , )

e

ij ij

iji i n

i k

ijj j n

kj

i i

Z V V W H

i j ZL W H

W W NWL W H

H H NH

W W

ε

ε

= =

∈

∂= −

∂

∂= −

∂

=

nd while

'* * * *

*

* * * **

( , , )

( , , )

i i n ij i ji

j j n ij i jj

W W N l V W HW

H H N l V W HH

ε

ε

∂= −

∂

∂= −

∂

SGD for matrix factorization

R. Gemulla, P.J. Haas, E. Nijkamp, Y. Sismanis, IBM Tech Report , 2011

* *

For local losses, depend only on( , , )ij i jl V W H

SGD for matrix factorization in MapReduce

•  Main ideas –  Local loss depends only on –  If sub-matrices do not share rows and columns, they can be

factored independently and factors combined.

–  Stratify the input matrix such that each stratum can be processed in a distributed manner

* *, ,ij i jV W H

( )1 2

1 11

2 22

0 ... 00 ... 0

0 0

d

d dd

H H H

W ZW Z

W Z

⎛ ⎞⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠⎝ ⎠

L

MM M MO

K

( )

1

21 2 , ...

b b b

d

d

Z W H

WW

W H H H H

W

≈

⎛ ⎞⎜ ⎟⎜ ⎟= =⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

M


•  Stratify the input matrix (dropping missing values) into subsets

•  Stratification –  Randomly permute the rows and columns of the input matrix

1 2

1 2

, , such that

', ' ( , ) , ( ', ') , ( 1 2)

ds s s

b bs s

Z Z Zi i j j i j Z i j Z b b≠ ≠ ∀ ∈ ∈ ≠

K

31 2

1 2

1 11 1

For a permutation , .... of 1...

... d

d

j jj js

j j jd

Z Z Z Z Z= U U U

11 12 1

21 22 2

1 2

n

n

m m mn

V V VV V V

V V V

⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

K K KK K K

M M M M MK K K

m/d

n/d 11Z



1 2

0 0

0 0

11 1

Training set , initial values , , cluster size ,

Block / / into x / x 1/1 x blocksWhile not converged, do Pick step size For 1 do

Pick blocks , ,.... jj j

Z W H dW W H H

Z W H d d d d

s d

d Z Z Z

ε

= =

= K

{ } to form a stratum

For b 1 do Run SGD on points in with step size end for end forend while

d

b

s

bj

Z

dZ ε

= K

Epochs

Sub-epoch

Machines


Frequent Itemset Mining •  Given a set of items and

where •  Pattern A is frequent if

•  Problem –  Find all complete frequent item-sets of

•  Divide and conquer approach –  Patterns containing A can be found using only transactions

containing A. –  Filter transactions with A – conditional database (CDB) of A –  Find patterns containing A in CDB(A)

subsets of iT I=

support( )A ζ≥

D

1 2{ , ... }ND T T T=1 2{ , ... }MA a a a=

Frequent Itemset Mining •  Construct a Frequent Pattern (FP) Tree

–  Keep only items with frequency above the minimum support –  Sort each transaction in descending order of frequent items –  Add each sorted transaction to an item prefix tree –  Each node in the FP tree is an item

•  Node has count of transactions with that item in that path •  Nodes of same items in different paths are linked together

•  FPGrowth algorithm –  Start from CDB of single frequent item –  Build FP Tree of CDB –  Mine frequent patterns from CDBs using recursion

•  Recursion terminates when CDB has a single path •  Frequent pattern = Union of all nodes in this tree with support = min. support

of nodes in this tree

Frequent Itemset Mining

f a c d g i m p a b c f l m o b f h j o b c k s p a f c e l p m n

Original transactions

f:4 c:4 a:3 b:3 m:3 p:3

o:2 d:1 e:1 g:1 h:1 i:1 k:1 l:1 n:1

Frequent items

f c a m p f c a b m f b c b p f c a m p

Sorted transactions

p: { f c a m / f c a m / c b } m: { f c a / f c a / f c a b } b: { f c a / f c } a: { f c / f c / f c } c: { f / f / f } f: {}

Conditional databases of Frequent items

Frequent Itemset Mining in MapReduce

•  Identifying frequent items = 1 MapReduce job –  Find the set of items and the associated frequency

•  Prune this frequent items list keeping only items more frequent than minimum support

•  Mine subsequent projected CDBs in MapReduce iterations (Li et al. 2008) –  Project transactions in CDB by least frequent item in the mapper –  Breadth first search of the FP Tree using a MapReduce iteration –  Once projected CDB fits in memory of reducer

•  Run FPGrowth algorithm in reducer •  No more growth of the sub-tree

Frequent Itemset Mining in MapReduce

D

D|p

D|m

D|b

D|a

D|c

p

m

b

a

c

D|am

D|cm

a

c

D|ca c

D|cam c m: { f c a / f c a / f c a b }

am: { f c / f c / f c } cm: { f / f / f } cam: { f / f / f } fcam: {} mf:3, mc:3, ma:3, mfc:3, mfa:3, mca:3, mfca:3

a: { f c / f c / f c } ca: { f / f / f } fa: {} af:3, ac:3, afc: 3

p: { f c a m / f c a m / c b } pc: {} pc:3

b: { f c a / f c }

c: { f / f / f } fc: {} cf:3 MR Iteration 1 MR Iteration 2 MR Iteration 3

Graph Algorithms

•  Ubiquitous in web applications – Web-graph, Social network graph, User-item

graph •  Typical problems

– Popularity (e.g. PageRank) – Shortest paths – Clustering, semi-clustering etc.

Graph algorithms in MapReduce

•  Vertex centric approach –  Work with the adjacency list of each vertex –  Especially useful for sparse adjacency matrices

•  Breadth first search –  Each MR iteration advances the horizon by one level

•  In each iteration –  Compute on each vertex –  Pass values to connected vertices for aggregation in

the reducer –  Pass the adjacency list of each node to the reducer

Breadth first search on Graphs in MapReduce

1

2 3

2

2

3

3

3

MR Iteration 1 MR Iteration 2

•  Easy (iterative) implementations exist for some common algorithms –  Single source shortest path –  PageRank

Single source shortest path in MapReduce

•  Find the shortest path from a given node to any reachable node •  Given a start node:

–  Distance to adjacent nodes = 1 –  Distance to any other node reachable from a set of nodes S

DistanceTo(n) = 1 + min(DistanceTo(m), m ∈ S)

•  Map –  Input:

•  Node “n” •  D, Adjacency list of “n”

–  Output: •  For each node “p” in

adjacency list –  <p, (D+1)>

•  <n,Adjacency list of “n”>


•  “p”, “D+1” from all nodes pointing to “p”

•  “n”, Adjacency list of “n” –  Output:

•  “p”, min(“D+1” from all nodes pointing to “p”)

•  “n”, Adjacency list of “n”

Pass the graph from 1 iteration To the next

PageRank •  Given a node A

•  Iterate this equation till convergence •  Driver program to check if the page rank for each

node has converged

{ : }

( )( ) (1 )*( )

random jump probability node pointing to

( ) out-degree of

i i

i

T T A i

i

i i

PR TPR A d dC T

dT AC T T

−>

= + −

=

=

=

∑

PageRank in MapReduce

•  In each iteration (i) •  Map

–  Input: •  Node “n”, PRi-1(n)

•  Adjacency list of “n” –  Compute

•  V = PRi-1(n) / |Adjacency list of n|

–  Output: •  For each node “p” in

adjacency list –  <p, V>

•  <n,Adjacency list of “n”>


•  <“p”, V from all nodes “n” pointing to “p”>

•  Adjacency list of “n” –  Compute

•  PRi(p) = Sum(V) –  Output:

•  <p, PRi(p) > •  <n,Adjacency list of “n”>

Frameworks for graph algorithms

•  MapReduce is not a good fit for graph algorithms –  1 iteration for each level of the graph has large overheads

•  “Bulk synchronous processing model” for graph processing. –  Components – for either compute or storage –  Router – to deliver point to point messages –  Synchronization at periodic intervals (called supersteps) that are

atomic •  In each superstep, vertex can

–  Receive messages sent by other vertices in previous superstep –  Compute using the data in that vertex and the received

messages –  Send messages to other vertices

Frameworks for graph algorithms •  Vertex can vote to go to halt state •  Computation stops when all vertices have voted to halt. •  Vertices can also mutate the graph

–  Add/remove edges and other vertices –  Mutations implemented in next superstep

•  Framework also supports aggregators –  Can maintain global summaries over the graph –  Values communicated to all vertices before the next

superstep •  Large scale graph processing tools leveraging Grid

–  Pregel (in Google) –  Open source implementation Giraph

https://github.com/aching/Giraph

Outline


Sequential learning methods •  Some learning algorithms are inherently sequential in

nature, e.g., –  Stochastic Gradient Descent (SGD) minimization –  Conditional Maximum Entropy using SGD –  Perceptron

•  Difficult to distribute sequential algorithms over data partitions –  Need frequent communication of intermediate parameter

values •  Some sequential algorithms can be trained in a cluster

environment. –  Theoretical and empirical analysis show that parameters

converge to the values from sequential training over all data

Sequential learning methods in MapReduce

•  Types of sequential learning in MapReduce –  Single M/R job:

•  Learn parameters on each data partition in mappers over multiple epochs

•  Average the model parameters from all mappers in a reducer –  Multiple M/R jobs:

•  Learn parameters on each data partition in each mapper for 1 epoch

•  Average the model parameters from all mappers in a reducer

•  Start the next iteration for next epoch in the mapper with the average parameter values from previous iteration

–  Communicate between nodes •  Launch MPI on Hadoop cluster

Stochastic Gradient Descent (SGD) methods

•  Many learning algorithms involve optimizing an objective function (maximizing log likelihood, minimizing root mean square error etc.) over the training data to determine the optimal parameters

•  Stochastic Gradient techniques update the parameter one example at a time

•  Parameter updates are inherently sequential

* argmin ( , , )

* ( , , )

i i

i training data

i iw

i training data

w L x y w

w w L x y wη

∈ −

∈ −

=

= − ∇

∑

∑

,* ( , )i iww w L x y wη= − ∇

Parallelized SGD •  Partition the training data into multiple partitions, each

with examples chosen at random •  Perform stochastic gradient updates on each data

partition separately with constant learning rate. •  Average the solutions between different machines. •  For large scale data, Zinkevich et al. 2010 show that

–  Parameter values converge to sequential estimates –  Averaging the parameters reduces variance by –  Bias in parameter estimates decreases as well

T

1 2( )O k −

Parallelized SGD in MapReduce

,0

, ,( 1) , 1

Map:In each mapper 1...

0

For 1... * ( , , )

end forend for

i

i t i t w i t

i kw

t Tw w L x y wη− −

∈

=

=

= − ∇

,1

Reduce:Aggregate from all mappers:

1 k

i ti

v wk =

= ∑Data

Machines

Average across all machines

Parallelized SGD in MapReduce •  Multi-pass parallel SGD (Weimer, Rao, Zinkevich 2010)

–  Divide the data randomly among all machines

–  Initialize weight vector –  For iterations do

•  For each machine do

Shuffle data uniformly at random For each do

end for

end for

end for

th th example sent to machinejtc t j=

*w{1... }i T∈

{1... }j k∈*iw w=

:{1... } {1... }p m mʹ′ ʹ′→{1... }t mʹ′∈

( ) ( )j j j jp tw w c wη= − ∇

*

1

1 kj

jw w

k =

= ∑

Iterations Machines

Data

Average across all machines in each iteration

Initial value for next iteration

Conditional MaxEnt models •  Used in both binary and multi-class classification problems •  Commonly used in NLP and computer vision

( )

( )

1 1 2 2{( , ), ( , )...( , )1( | ) exp . ( , ) , ( , ) ( )

( ) exp . ( , )

m m

w

y Y

S x y x y x y

p y x w x y x y featureZ x

Z x w x y

φ φ

φ∈

=

= =

=∑2

1

1argmin ( ) argmin log ( | )

argmax ( | )

m

S ww w i

wy

w F w w p y xm

y p y x

λ=

= = −

=

∑

Conditional MaxEnt in MapReduce •  Mixture weighting method (Mann et al. 2009)

–  Train a model in each of mappers using standard gradient descent on a subsample of the data.

–  Average the weights from all the mappers in 1 reducer

–  Mann et al. (2009) show that the mixture weighting estimate converges to the sequential estimate

mapper; 0 1...

* ( )

k

thk

k k w S k

k wfor t T dow w F w

return w

η

=

=

= + ∇

M

1

M

k kk

w wµ=

=∑1

0 1M

k kk

µ µ=

≥ =∑

Perceptron algorithm •  Online algorithm used in NLP for structure prediction e.g.,

–  Parsing, Named entity recognition, Machine translation etc.

'

(0)

' '

'

( 1) '

( { , })

0; 0 1...

1... | | arg max . ( , )

( )

( , ) ( , ) 1

i i

kt

y

tk k

t t t

k

Perceptron D x yw kfor n Nfor t Dy w f x y

if y yw w f x y f x yk k

return w

+

=

= =

=

=

=

≠

= + −

= +

N epochs

Add weight to features for correct output

Remove weights to features for incorrect output

Predict using current weights

Data

Perceptron in MapReduce •  Iterative parameter mixing

–  Train using data sub-group for 1 epoch in each mapper –  Average the weights in reducer –  Communicate back to mapper –  Train next epoch in mapper

( , )

( , ),

0 1...

= ( , )

i ni

i ni n

i

wfor n Nw OneEpochPerceptron D ww w

return w

µ

=

=

=∑

'

(0)

' '

'

( 1) '

( , ); 0

1... | | arg max . ( , )

( )

( , ) ( , ) 1

kt

y

tk k

t t t

k

OneEpochPerceptron D ww w kfor t Dy w f x y

if y yw w f x y f x yk k

return w

+

= =

=

=

≠

= + −

= +

Average across all machines in each iteration

Perceptron in MapReduce

•  McDonald et al. (2010) show that averaging parameters after each epoch: –  Has as good or better performance as sequential

training on all data –  Trains better classifiers quicker than training

sequentially on all data –  Performs better than averaging parameters from

training model in each partition for multiple epochs to convergence

Outline


Challenges for ML algorithms on Hadoop

•  Hadoop is optimized for large batch data processing –  Assumes data parallelism –  Ideal for shared nothing computing

•  Many learning algorithms are iterative –  Incur significant overheads per iteration

•  Multiple scans of the same data –  Typically once per iteration à high I/O overhead reading data

into mappers per iteration –  In some algorithms static data is read into mappers in each

iteration •  e.g. input data in k-means clustering.

•  Need a separate controller outside the framework to: –  coordinate the multiple MapReduce jobs for each iteration –  perform some computations between iterations and at the end –  measure and implement stopping criterion

Challenges for ML algorithms on Hadoop

•  Incur multiple task initialization overheads –  Setup and tear down mapper and reducer tasks per iteration

•  Transfer/shuffle static data between mapper and reducer repeatedly –  Intermediate data is transferred through index/data files on local

disks of mappers and pulled by reducers •  Blocking architecture

–  Reducers cannot start till all map jobs complete •  Availability of nodes in a shared environment

–  Wait for mapper and reducer nodes to become available in each iteration in a shared computing cluster

Iterative algorithms in MapReduce

Overhead per Iteration: • Job setup • Data Loading • Disk I/O

Pass R

esult

Dat

a (e

ach

pass

)

Enhancements to Hadoop •  Many proposals to overcome these challenges •  All try to retain the core strengths of data partitioning and

fault tolerance of Hadoop to various degrees •  Proposed enhancements and alternatives to Hadoop

–  Worker/Aggregator framework –  HaLoop –  MapReduce Online –  iMapReduce –  Spark –  Twister –  Hadoop ML –  …..

Worker/Aggregator framework •  Worker

-  Load data in memory -  Iterate:

›  Iterates over data using user specified functions ›  Communicates state ›  Waits for input state of next pass

•  Aggregator –  Receive state from the workers –  Aggregate state using user specified functions –  Send state to all workers

•  Communicate between workers and aggregators using TCP/IP •  Leverage the fault tolerance, and data locality of Hadoop

M. Weimer, S. Rao, M. Zinkevich, 2010, NIPS 2010 Workshop on Learning on Cores, Clusters and Clouds

Parallelized SGD in Worker/Aggregator

8/29/11 102

Advantages: • Schedule once per Job • Data stays in memory • P2P communication In

itial

Dat

a

Final Result

HaLoop •  Programming model and architecture for iterations

–  New APIs to express iterations in the framework •  Loop-aware task scheduling

–  Physically co-locate tasks that use the same data in different iterations

–  Remember association between data and node –  Assign task to node that uses data cached in that node

•  Caching for loop invariant data: –  Detect invariants in first iteration, cache on local disk to reduce I/

O and shuffling cost in subsequent iterations –  Cache for Mapper inputs, Reducer Inputs, Reducer outputs

•  Caching to support fixpoint evaluation: –  Avoids the need for a dedicated MR step on each iteration

HaLoop: Efficient Iterative Data Processing on Large Clusters by Yingyi Bu, Bill Howe, Magdalena Balazinska, Michael D. Ernst. In VLDB'10

HaLoop vs. MapReduce

Applica'on

Framework

Applica'on

Framework

•  HaLoop framework controls the loop •  First iteration is similar to that on Hadoop. •  Framework identifies data à node mappings, caches and indexes for fast access, and controls looping

•  Subsequent iterations leverage the above optimizations


New, additional API

Starts new MR jobs

repeatedly

Leverage data

locality

Caching for fast access

HaLoop Design


HaLoop Programming API Name Functionality Map() & Reduce() Specify a map & reduce function AddMap() & AddReduce() Specify a step in loop SetDistanceMeasure() Specify a distance for results SetInput() Specify inputs to iterations AddInvariantTable() Specify loop-invariant data SetFixedPointThreshold() A loop termination condition SetMaxNumberOfIterations() Specify the max number of

iterations SetReducerInputCache() Enable/disable reducer input cache SetReducerOutputCache() Enable/disable reducer output

cache SetMapperInputCache() Enable/disable mapper input cache


Cache control

Loop control

Iteration inputs

k-means clustering in HaLoop •  k-means in HaLoop

1.  Job job = new Job(); 2.  job.AddMap(Map_Kmeans,1); à Assign data point to closest

cluster 3.  job.AddReduce(Reduce_Kmeans,1); à Re-compute centroids 4.  job.SetDistanceMeasure(ResultDistance);

–  # of changes in cluster membership 5.  job.SetFixedPointThreshold(0.01); 6.  job.SetMaxNumOfIterations(12); à Stopping criteria 7.  job.SetInput(IterationInput); à Same input data to each iteration 8.  job.SetMapperInputCache(true);

–  Enable mapper input caching for mappers to read data from local disk node

9.  job.Submit();

MapReduce Online •  Pipeline data between operators as it is produced

–  Decouple computation and data transfer schedules –  Intra-job:

•  between mapper and reducer –  Inter-job:

•  schedule multiple dependent jobs simultaneously •  between reducer of one job and mapper of next job

•  “Push” data from producers instead of a “pull” by consumers •  Intermediate data is considered tentative till map job completes

–  Also stored on disk for fault tolerance/recovery •  Reducer starts as soon as some data is available from mappers

–  Can compute approximate answers from partial data •  Mappers and Reducers can also run continuously

–  Enables stream processing

Mapreduce online, T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, R. Sears, 2010, NSDI'10, Proceedings of the 7th USENIX conference on Networked systems design and implementation

iMapReduce •  Iterative processing

–  Persistent map/reduce tasks –  Each reduce task has a locally connected

corresponding map task •  Maintain static data locally

–  On local disk of mapper •  Asynchronous map execution

–  Persistent socket between reduceàmap –  Completion of reduce triggers map –  Mappers do not need to wait

iMapReduce: A Distributed Computing Framework for Iterative Computation, Y. Zhang, Q. Gao, L. Gao, C. Wang, DataCloud 2011

iMapReduce – Iterative Processing

iMapReduce – Asynchronous map execution

TIM E

MapReduce iMapReduce

Spark •  Open source cluster computing model:

–  Different from MapReduce, but retains some basic character •  Optimized for:

–  iterative computations •  Applies to many learning algorithms

–  interactive data mining •  Load data once into multiple mappers and run multiple queries

•  Programming model using working sets –  applications reuse intermediate results in multiple parallel operations –  preserves the fault tolerance of MapReduce

•  Supports –  Parallel loops over distributed datasets

•  Loads data into memory for (re)use in multiple iterations –  Access to shared variables accessible from multiple machines

•  Implemented in Scala, •  www.spark-project.org

Spark: Cluster Computing with Working Sets. M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker, I. Stoica. 2010, USENIX HotCloud 2010.

Outline


Mahout •  Goal

–  Create scalable, machine learning algorithms under the Apache license. •  Scalable:

–  to large datasets –  business use cases –  community

•  Contains both: –  Hadoop implementations of algorithms that scale linearly with data. –  Fast sequential (non MapReduce) algorithms

•  Latest release is Mahout 0.5 on 27th May 2011 (circa Aug 4, 2011)

•  Wiki: –  https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+Wiki

•  Mailing lists –  User, Developer, Commit notification lists –  https://cwiki.apache.org/confluence/display/MAHOUT/Mailing+Lists

Algorithms in Mahout •  Classification:

–  Logistic Regression –  Naïve Bayes, Complementary Naïve Bayes –  Random Forests

•  Clustering –  K-means, Fuzzy k-means –  Canopy –  Mean-shift clustering –  Dirichlet Process clustering –  Latent Dirichlet allocation –  Spectral clustering

•  Parallel FP growth •  Item based recommendations •  Stochastic Gradient Descent (sequential)

Acknowledgment

Numerous wonderful colleagues!

Questions?

Model Training Exercise

Exercise problem •  Problem:

–  Predict the age of abalone as a function of physical attributes –  Useful for ecological and commercial fishing purposes

•  Dataset: –  Dataset from the Marine Resources Division at the Department of

Primary Industry and Fisheries, Tasmania –  Attributes:

•  Gender, Length, Diameter, Height, 4 different weights – 8 attributes –  Target:

•  Number of Rings in shell •  Age (in years) = 1.5 + number of rings in shell

–  At: http://www.stat.duke.edu/data-sets/rlw/abalone.dat •  Learn a linear relation between the age and the physical

attributes

Exercise dataset •  Original data sample size = 4177 •  Generate larger dataset by replicating each record

–  Add Gaussian noise for each feature with the sample variance –  Do not add variance for Gender and # of rings

•  For all attributes, compared to the original dataset, the larger datasets have: –  same mean –  higher sample variance

•  Replicate by factors of: –  10x, 1k x, 8k x, 16k x, 32k x –  Datasets of about 40k, 4MM, 32 MM, 64MM and 128 MM

records.

Exercise: Model training •  Train a linear regression model

•  Split the training data into 100 parts •  Mapper:

–  Compute the matrix A and vector b on each partition •  Reducer

–  Aggregate the values of A and b from all mappers –  Compute the weights

8

00

1i ii

Rings w x x=

= =∑* 1

8 8

0 0( ) ( )Ti i i i

i i

w A b

A x x b x y

−

= =

=

= =∑ ∑

Exercise: Model Results

•  For replication factor of 10x –  w[Sex] = 0.747 –  w[Length] = 1.894 –  w[Diameter] = 2.844 –  w[Height] = 7.213 –  w[Whole] = 0.311 –  w[Shucked] = -0.558 –  w[Viscera] = 0.840 –  w[Shell] = 3.288 –  w[1] = 5.046

Training Times: Sequential vs Hadoop

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0 20 40 60 80 100 120 140Data size (MM records)

Trai

ning

Tim

e (s

econ

ds)

Hadoop Sequential

References 1.  M. Kearns. Efficient noise-tolerant learning from

statistical queries. Journal of the ACM, Vol. 45, No. 6, November 1998, pp. 983–1006.

2.  C. Chu, S.K.Kim, Y. Lin, Y. Yu, G. Bradski, A.Y. Ng, K. Olukotun, Map-Reduce for Machine Learning on Multicore. In Proceedings of NIPS 2006, pp. 281-288.

3.  W. Zhao, H. Ma, Q. He. Parallel K-Means Clustering Based on MapReduce. CloudCom '09 Proceedings of the 1st International Conference on Cloud Computing 2009, pp. 674-679

4.  R. Ho. http://horicky.blogspot.com/2011/04/k-means-clustering-in-map-reduce.html

References 5.  Cluster Computing and MapReduce, Lecture 4.

http://www.youtube.com/watch?v=1ZDybXl212Q 6.  A. McCallum, K. Nigam, L. Ungar. Efficient Clustering

of High Dimensional Data Sets with Application to Reference Matching, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, 2000, pp.169-178

7.  C. Elkan, 2011. http://cseweb.ucsd.edu/~elkan/250B/logreg.pdf

8.  B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo, PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce, 2009, Proceedings of The Vldb Endowment - PVLDB, vol. 2, no. 2, pp. 1426-1437.

References 9.  J.S. Herbach, 2009.

http://fora.tv/2009/08/12/Josh_Herbach_PLANET_MapReduce_and_Tree_Learning#fullprogram

10.  R. Yan, J. Tesic, and J. R. Smith. Model-shared subspace boosting for multi-label classification, 2007, In Proceedings of the 13th ACM SIGKDD Intl. Conf. on Knowledge discovery and data mining, pp 834-843.

11.  R. Yan, M. Fleury, M. Merler, A. Natsev, J.R. Smith, 2009, Proceedings of the First ACM workshop on Large-scale multimedia retrieval and mining, pp 35-42

12.  J.D Basilico, M.A. Munson, T.G. Kolda, K.R. Dixon, W.P.Kegelmeyer, COMET: A Recipe for Learning and Using Large Ensembles on massive data, 2011, http://arxiv.org/PS_cache/arxiv/pdf/1103/1103.2068v1.pdf

References 13.  S. Papadimitriou, J. Sun, DisCo: Distributed Co-

clustering with Map-Reduce, 2008,ICDM '08. Eighth IEEE International Conference on Data Mining, pp 512-521

14. M.A. Zinkevich, M. Weimer, A. Smola, A., L. Li, Parallelized Stochastic Gradient Descent, 2010, NIPS.

15.  T. Elsayed, J. Lin, and D. Oard. Pairwise document similarity in large collections with MapReduce, 2008, In ACL, Companion Volume, pp 265-268, 2008

16.  J. Lin, Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce., Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) 2009.

References 17. M. Weimer, S. Rao, M. Zinkevich, 2010, NIPS 2010

Workshop on Learning on Cores, Clusters and Clouds 18.  HaLoop: Efficient Iterative Data Processing on Large

Clusters by Yingyi Bu, Bill Howe, Magdalena Balazinska, Michael D. Ernst. In VLDB'10: The 36th International Conference on Very Large Data Bases, Singapore, 24-30 September, 2010.

19. G. Mann, R. McDonald, M. Mohri, N. Silberman, D. D. Walker, 2009, in Advances in Neural Information Processing Systems 22 (2009), edited by: Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, A. Culotta pp. 1231-1239.

20.  R. McDonald, K. Hall, G. Mann, Distributed training strategies for the structured perceptron , 2010, In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (2010), pp. 456-464.

References 21.  H. Li, Y. Wang, D. Zhang, M. Zhang, E.Y. Chang, 2008,

In Proceedings of the 2008 ACM conference on Recommender systems (2008), pp. 107-114.

22.  R. Gemulla, P.J. Haas, E. Nijkamp, Y. Sismanis, IBM Tech Report , 2011 http://www.almaden.ibm.com/cs/people/peterh/dsgdTechRep.pdf

23.  Pregel: a system for large-scale graph processing, G. Malewicz, M. H. Austern, A. J.C Bik, J. C. Dehnert, A.H Horn, N. Leiser, G. Czajkowski, 2010, SIGMOD '10 Proceedings of the 2010 international conference on Management of data

References 24. Mapreduce online, T. Condie, N. Conway, P. Alvaro,

J. M. Hellerstein, K. Elmeleegy, R. Sears, 2010, NSDI'10, Proceedings of the 7th USENIX conference on Networked systems design and implementation

25.  iMapReduce: A Distributed Computing Framework for Iterative Computation, Y. Zhang, Q. Gao, L. Gao, C. Wang, 2011, DataCloud 2011

26.  Spark: Cluster Computing with Working Sets. M. Zaharia, M. Chowdhury, M.J. Franklin, S. Shenker, I. Stoica. 2010, USENIX HotCloud 2010.

Backup

Decision Trees •  Features: •  Targets: or •  Data: •  Construct Tree

–  Each node splits the data by feature value –  Start from root

•  Select best feature, value to split the node –  Based on reduction in data impurity between the child and

parent nodes

–  Select the next child node –  Repeat the process till some stopping criterion

•  Pure node, or data is below some threshold etc.

1 2( , ,... )nx x x x=

[0,1]y∈ y∈R( ){ },

mD x y=

Decision Trees

B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo, PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce, 2009, Proceedings of The Vldb Endowment - PVLDB, vol. 2, no. 2, pp. 1426-1437

Expensive step for

Large datasets

PLANET for Decision Trees •  Parallel Learner for Assembling Numerous Ensemble

Trees (PLANET- Panda et al. 2009) –  Main idea is to use MapReduce to determine the best feature

value splits for nodes from large datasets

•  Each intermediate node has a sub-set of all data falling into it

•  If this sub-set is small enough to fit in memory, –  Grow remaining sub-tree in memory

•  Else, –  Launch a MapReduce job to find candidate feature value splits –  Select the best feature split from among the candidates

•  5 main components 1. Controller

•  Monitors and controls the growth of tree 2. Initialization Task

•  Identifies all feature values to be considered for splits 3. FindBestSplit Task

•  Finds best split when there is too much data to fit in memory 4. InMemoryGrow Task

•  Grow an entire sub-tree once the data fits in memory 5. Model File

•  File describing the state of the model


MapReduce Tasks

PLANET for Decision Trees •  Controller

–  Determines the state of the tree and grows it •  Decides if nodes are pure or have small data to become leaves •  Data fits in memory à Launch a MapReduce job to

grow the entire sub-tree in memory •  Data does not fit in memory à Launch a MapReduce job to find

candidate best splits •  Collect results from MR jobs and choose the best split for a node •  Update the Model File

–  Periodically checkpoints the system

•  Model File –  Contains the state of the tree constructed so far –  Used by the controller to check which nodes to split or grow next

PLANET for Decision Trees •  Maintain 2 queues

–  MapReduceQueue (MRQ) •  Contains nodes for which data is too large to fit in memory

–  InMemoryQueue (InMemQ) •  Contains nodes for which data fits in memory

•  Initialization Task (MapReduce) –  Identifies candidate attribute values for node splits –  Continuous attributes

•  Compute an approximate equi-depth histogram •  Boundary points of histogram used for potential splits

–  Categorical attributes •  Identify attribute's domain •  Sort values by average values of Y and use this for ordering

–  Generate a file with list of attributes to be used by other tasks


•  2 main MapReduce jobs –  MR_ExpandNodes

•  Process nodes from the MRQ to find best split •  Output for each node:

–  Candidate split positions for node along with »  Quality of split (using summary statistics) »  Predictions in left and right branches »  Size of data going into left and right branches

–  MR_InMemory •  Process nodes from the InMemQ. •  For a given set of nodes N, complete tree induction at nodes

in N using the InMemoryGrow algorithm.

PLANET for Decision Trees •  Map function in MR_ExpandNodes

–  Load the current model file M and set of nodes N –  For each record

•  Determine if record is relevant to any of the nodes in N •  Add record to the summary statistics (SS) for node •  For each feature-value in record

–  Add record to the summary statistics for node for split points “s” less than the value in record “v”

–  Output

[ ][ ]( )

[ ]

,

,

2,

( , , );

( , ); ,

( );

, , 1

n x

n x

n xsubgroup subgroup subgroup

key n N x Ordered feature s value T s

key n N x Categorical feature value v T v

key n N value SS

T s SS y y

= ∈ ∈ − =

= ∈ ∈ − =

= ∈ =

⎛ ⎞= = ⎜ ⎟

⎝ ⎠∑ ∑ ∑

SS of candidate

splits

SS of parent node

SS for variance impurity

Split ID

PLANET for Decision Trees •  Reduce function in MR_ExpandNodes

–  For each node •  Aggregate the summary statistics for that node

–  For each split (which is node specific) •  Aggregate the summary statistics for that Split ID from all map

outputs of summary statistics •  Compute impurity of data going into left and right branches •  Total impurity = Impurity in left branch + Impurity in right branch •  If Total impurity < Best split impurity so far

–  Best split = Current split

–  Output the best split found

PLANET for Decision Trees •  InMemoryGrow

–  Task to grow the entire subtree once the data for it fits in memory

–  Similar to parallel training –  Map

•  Load the current model file •  For each record identify the node that needs to be grown, •  Output <Node_id, Record>

–  Reduce •  Initialize the feature value file from Initialization task •  For each <Node_id, List<Record>> run the basic tree

growing algorithm on the records •  Output the best split for each node in the subtree

modeling with hadoop kdd2011

Technology