apache hadoop india summit 2011 talk "framework for a suite of co-clustering algorithms for...

Post on 15-Jan-2015

1.741 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

 

TRANSCRIPT

Framework for a suite of Co-clustering

algorithms for predictive modeling on Hadoop

Vaijanath N. Rao

(vaijanath.rao@teamaol.com)

Rohini Uppuluri

(rohini.uppuluri@teamaol.com)

Presentation for[CLIENT]

Agenda

• Introduction

• Background

• Some Approaches

• Co-Clustering

• Introduction

• Related Work

• Why Hadoop?

• Goal

• Our Framework

• Conclusions and Future Work

Presentation for[CLIENT]

Background

Modeling for Prediction

• Will user A like this movie?

• Will a user B like this camera

• Customer purchase decisions in an e-commerce setting

And tons of other things…

Presentation for[CLIENT]

Some Approaches

• Collaborative filtering

• User Based, Item Based, Model Based, Content Based, Hybrid (See [1],

[2] ) etc

• Latent Models

• Probabilistic Latent Semantic Indexing [3,6]

• Matrix Factorization [4,7,8],

• Probabilistic Discrete Latent Factor[5]

• Co-clustering

• Clustering along multiple axes: [9,10] etc; survey in [16]

Presentation for[CLIENT]

Co-clustering

0?00?

00?11

0?1?0

11101

0?00?

00?11

0?1?0

11101

0?01?

00?11

0?1?1

11100

0?01?

00?11

0?1?1

11100

Users

Products

Clu

ste

red

Users

Clustered Products

. . .

. . .Reducing E

rror

Product

Attributes

User

Attributes

Row Cluster Updation

Column Cluster Updation

Global Model Updation

Row Cluster Updation

Column Cluster Updation

Global Model Updation

Row Cluster Updation

Column Cluster Updation

Global Model Updation

Presentation for[CLIENT]

Some Approaches

• Bregman co-clustering - Framework [11]

• Information theoretic co-clustering [12]

• Min sum squared co-clustering [13]

• Scalable Framework based on Bregman

framework[14]

• DisCo [15]

Presentation for[CLIENT]

Why Hadoop

• Real world data – Huge

• Large matrix to operate on(millions and

millions of rows, millions of columns!)

• Lot of computations

Presentation for[CLIENT]

Goal

• Number of approaches, need for a common

framework

• To build a framework to fit in the multiple algorithms

on hadoop

• Easy framework for users to choose and use

Presentation for[CLIENT]

Overview

Row Cluster

Updator Job

Column Cluster

Updator Job

Global Model

Updator Job

Global Model

Input

Row Clusters

Column Clusters

Presentation for[CLIENT]

Overview : Core Interfaces

• Input vector (type, id, datavec, attributevec, cost, assignment)

• Cluster ( vector, len)

• Row Cluster

• Column Cluster

• Distance/Error Function (vector1, vector2)

• Model (matrix)

• Row Model

• Column Model

• Group Model

• Objective Function (Model1, Model2)

Presentation for[CLIENT]

Currently we have

• Graph Based Bi-clustering

• Disco

Presentation for[CLIENT]

Disco Algorithm

1. Initialization

1.1 row and column clusters

1.2 Compute global model

2. While objective function is met

2.1 For each row in the data, pick the row group

which minimizes error

2.2 Update row clusters

2.3 Update global model

2.4 For each column in the data, pick the column

group which minimizes error

2.5 Update column clusters

2.6 Update global model

3. Return row and column clusters

Presentation for[CLIENT]

Pick the Best Row Group/Cluster

Presentation for[CLIENT]

Example

Presentation for[CLIENT]

RowCluster Updator Job

Presentation for[CLIENT]

Example

Presentation for[CLIENT]

BiClustering

Presentation for[CLIENT]

Pick the Best Row Group/Cluster

Presentation for[CLIENT]

Example

Presentation for[CLIENT]

Row Updator Job

Pick the best row group

cluster which minimizes cost

or error

rowId clickVector attributeVector bestRowClusterId cost

Best Row

Cluster IdclickVector

lineId

0

rowId

clickVector

attributeVector

curRowClusterId

curRowClusterError

KeyType

DATA

Key type

ROWCLUSTER

keyvalue

key value

RowCluster Mapper

keyType:

DATA:

Just Emit

ROW CLUSTER

Aggregate Row Cluster

RowCluster Reducer

rowId clickVector attributeVector bestRowClusterId cost

Updated Row Clusters

Also write

Presentation for[CLIENT]

Example

Presentation for[CLIENT]

Conclusions and Future Work

• Implementing more algorithms

• Easy to use examples and more documentation

Presentation for[CLIENT]

References[1] J. Herlocker, J. Konstan, A. Borchers, and J. Riedl. An algorithmic framework for

performing collaborative filtering. In SIGIR, pages 230–237, 1999

[2] J. Basilico and T. Hofmann. Unifying collaborative and content-based filtering. In ICML ’04, pages 65–72, 2004.

[3] T. Hoffman and J. Puzicha. Latent class models for collaborative filtering. In Proc. IJCAI ’99, 1999.

[4] R. Salakhutdinov and A. Mnih. Probabilistic matrix factorization. In NIPS ’07, 2007

[5] D. Agarwal and S. Merugu. Predictive discrete latent factor models for large scale dyadic data. In Proc. KDD ’07, pages 26–35, 2007

[6] A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative filtering. In WWW ’07, pages 271–280, 2007

[7] Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model. In KDD ’08, pages 426–434, 2008

[8] H. Ma, H. Yang, M. Lyu, and I. King. Sorec: social recommendation using probabilistic matrix factorization. In CIKM ’08, pages 931–940, 2008

[9] Y. Cheng and G. M. Church. Biclustering of expression data. In Proc ICMB ’00, pages 93–103, 2000

[10] T. George and S. Merugu. A scalable collaborative filtering framework based on co-clustering. In ICDM, pages 625 – 628, 2005

Presentation for[CLIENT]

References (contd..)[11] A. Banerjee, I. Dhillon, J. Ghosh, S. Merugu, and D. Modha. A generalized maximum

entropy approach to bregman co-clustering and matrix approximation. JMLR, 1919--1986, 2007.

[12] I. Dhillon, S. Mallela, and D. Modha. Information-theoretic co-clustering. In Proc. KDD ’03, pages 89–98, 2003

[13] H. Cho, I. S. Dhillon, Y. Guan, and S. Sra. Minimum sum squared residue co-clustering of gene expression data. In Proc. SDM ’04, 2004

[14] M. Deodhar, G. Gupta, J. Ghosh, H. Cho, and I. Dhillon. A scalable framework for discovering coherent co-clusters in noisy data. In ICML ’08, 2008

[15] S. Papadimitriou and J. Sun. Disco: Distributed co-clustering with mapreduce: A case study towards petabyte-scale end-to-end mining. In ICDM ’08, pages 512–521, 2008

[16] S. C. Madeira and A. L. Oliveira. Biclustering algorithms for biological data analysis: A survey. IEEE Trans. on Computational Biology and Bioinformatics, 1(1):24–45, 2004

Presentation for[CLIENT]

Thank you

Presentation for[CLIENT]

Row Cluster Updator Job

Pick the best row group

cluster which minimizes cost

or error

rowId clickVector attributeVector bestRowClusterId cost

rowId Updated rowCluster

rowId Updated Partial GlobalModel

rowId

clickVector

attributeVector

curRowClusterId

curRowClusterError

Value type

DATA

Value type

ROW GLOB MODEL

Value type

ROWCLUSTER

key value

key value

RowCluster Mapper

ValueType:

DATA:

Just Emit

ROW CLUSTER

Aggregate Row Cluster

ROW GLOB CLUSTER

Aggregate Partial Global Model

for given row cluster

RowCluster Reducer

rowId clickVector attributeVector bestRowClusterId cost

Updated Row Clusters

Updated Partial Global

Model

Also write

Presentation for[CLIENT]

Column Cluster Updator Job

Pick the best col group

cluster which minimizes cost

or error

colId clickVector attributeVector bestColClusterId cost

colId Updated colCluster

colId Updated Partial GlobalModel

colId

clickVector

attributeVector

curColClusterId

curColClusterError

Value type

DATA

Value type

COL GLOB MODEL

Value type

COLCLUSTER

key value

key value

ColCluster Mapper

ValueType:

DATA:

Just Emit

COL CLUSTER

Aggregate Col Cluster

COL GLOB CLUSTER

Aggregate Partial Global Model

for given col cluster

ColCluster Reducer

Updated Col Clusters

Updated Partial Global

Model

Also write

colId clickVector attributeVector bestColClusterId cost

top related