clustering very large multi- dimensional datasets with mapreduce 蔡跳

Clustering Very Large Multi-dimensional Datasets with MapReduce蔡跳

INTRODUCTION

• large dataset of moderate-to-high dimensional elements

• serial subspace clustering algorithms• TB 、 PB• e.g.,Twitter crawl: > 12TB Yahoo! operational data: 5PB• 方法： combine a fast, scalable serial algorithm and makes it run efficiently in parallel

INTRODUCTION

• bottleneck: I/O, network• Best of both Worlds -- BoW automatically spots the bottleneck and picks a good

strategy serial clustering methods as a plugged-in clusterin

g subroutine

RELATED WORK

• MapReduce-- 简化的分布式编程模式，用于大规模数据集的并行运算

• mapper, reducer• map stage ： input file and outputs(key, value)pairs• shuffle stage ： transfers the mappers'output to the re

ducers based on the key• reduce stage: processes the received pairs and output

s thefinal result

BoW

• ParC ：数据划分，合并结果• SnI ：先抽样，牺牲 I/O 减少 network cost• trade-off

ParC--Parallel Clustering

• 划分数据、分配数据到不同的机器• 每台机器在分配到的数据中聚类，得到簇称为β-clusters• 合并β-clusters 得到最终的类

SnI--Sample and Ignore

• 抽样，聚类得到 clusters• 排除属于 clusters 空间内的数据• ParC

COST-BASED OPTIMIZATION

• ParC Cost：

• Map Cost ：

• Shuffle Cost:

• Reduce Cost:

• SnI Cost ：

Bow

• compute ParC Cost->costC• compute SnI Cost->costCs• if costC > costCs then clusters = result of SnI • else clusters = result of ParC

EXPERIMENTAL RESULTS

• 采用 Hadoop• M45 ： 1.5PB storage ， 1TB memory ，• DISC/Cloud ： 512 cores ， 64 machines ， 1TB RAM ，

256TB disk storage ，

Quality of results

• 聚类的平均准确率、召回率• 模拟数据

Scale-up results

• 增加 reducer

Scale-up results

• 增加数据， r=128 ， m=700

Accuracy of our cost equations

感谢聆听 !Thanks for your time

clustering very large multi- dimensional datasets with mapreduce 蔡跳

Documents

cost equations

parallel clusteringclusters

result of sni

good strategy serial

scalable serial algorithm

twitter crawl

reducermap stageinput

tb yahoo operational