clustering very large multi- dimensional datasets with mapreduce 蔡跳

20
Clustering Very Large M ulti-dimensional Datase ts with MapReduce 蔡蔡

Upload: helen-baker

Post on 01-Jan-2016

237 views

Category:

Documents


5 download

TRANSCRIPT

Clustering Very Large Multi-dimensional Datasets with MapReduce蔡跳

INTRODUCTION

• large dataset of moderate-to-high dimensional elements

• serial subspace clustering algorithms• TB 、 PB• e.g.,Twitter crawl: > 12TB Yahoo! operational data: 5PB• 方法: combine a fast, scalable serial algorithm and makes it run efficiently in parallel

INTRODUCTION

• bottleneck: I/O, network• Best of both Worlds -- BoW automatically spots the bottleneck and picks a good

strategy serial clustering methods as a plugged-in clusterin

g subroutine

RELATED WORK

• MapReduce-- 简化的分布式编程模式,用于大规模数据集的并行运算

• mapper, reducer• map stage : input file and outputs(key, value)pairs• shuffle stage : transfers the mappers'output to the re

ducers based on the key• reduce stage: processes the received pairs and output

s thefinal result

BoW

• ParC :数据划分,合并结果• SnI :先抽样,牺牲 I/O 减少 network cost• trade-off

ParC--Parallel Clustering

• 划分数据、分配数据到不同的机器• 每台机器在分配到的数据中聚类,得到簇称为β-clusters• 合并β-clusters 得到最终的类

SnI--Sample and Ignore

• 抽样,聚类得到 clusters• 排除属于 clusters 空间内的数据• ParC

COST-BASED OPTIMIZATION

• ParC Cost:

• Map Cost :

• Shuffle Cost:

• Reduce Cost:

• SnI Cost :

Bow

• compute ParC Cost->costC• compute SnI Cost->costCs• if costC > costCs then clusters = result of SnI • else clusters = result of ParC

EXPERIMENTAL RESULTS

• 采用 Hadoop• M45 : 1.5PB storage , 1TB memory ,• DISC/Cloud : 512 cores , 64 machines , 1TB RAM ,

256TB disk storage ,

Quality of results

• 聚类的平均准确率、召回率• 模拟数据

Scale-up results

• 增加 reducer

Scale-up results

• 增加数据, r=128 , m=700

Accuracy of our cost equations

感谢聆听 !Thanks for your time