clustering very large multi- dimensional datasets with mapreduce 蔡跳
TRANSCRIPT
INTRODUCTION
• large dataset of moderate-to-high dimensional elements
• serial subspace clustering algorithms• TB 、 PB• e.g.,Twitter crawl: > 12TB Yahoo! operational data: 5PB• 方法: combine a fast, scalable serial algorithm and makes it run efficiently in parallel
INTRODUCTION
• bottleneck: I/O, network• Best of both Worlds -- BoW automatically spots the bottleneck and picks a good
strategy serial clustering methods as a plugged-in clusterin
g subroutine
RELATED WORK
• MapReduce-- 简化的分布式编程模式,用于大规模数据集的并行运算
• mapper, reducer• map stage : input file and outputs(key, value)pairs• shuffle stage : transfers the mappers'output to the re
ducers based on the key• reduce stage: processes the received pairs and output
s thefinal result
Bow
• compute ParC Cost->costC• compute SnI Cost->costCs• if costC > costCs then clusters = result of SnI • else clusters = result of ParC
EXPERIMENTAL RESULTS
• 采用 Hadoop• M45 : 1.5PB storage , 1TB memory ,• DISC/Cloud : 512 cores , 64 machines , 1TB RAM ,
256TB disk storage ,