cloudclustering: toward an iterative data processing pattern on the cloud

24
CloudClustering Ankur Dave*, Wei Lu†, Jared Jackson†, Roger Barga† *UC Berkeley †Microsoft Research Toward an Iterative Data Processing Pattern on the Cloud

Upload: ankur-dave

Post on 19-Jun-2015

337 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud

CloudClustering

Ankur Dave*, Wei Lu†, Jared Jackson†, Roger Barga†

*UC Berkeley†Microsoft Research

Toward an Iterative Data Processing Pattern on the Cloud

Page 2: CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud

Wei Lu Jared Jackson Roger BargaAnkur Dave

Page 3: CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud

Background

MapReduce and its successors reimplement communication and storage

But cloud services like EC2 and Azure natively provide these utilities

Can we build an efficient distributed application using only the primitives of the cloud?

Page 4: CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud

Design Goals

• Efficient • Fault tolerant• Cloud-native

Page 5: CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud

Outline

Windows AzureK-Means clustering algorithmCloudClustering architecture– Data locality– Buddy system

EvaluationRelated work

Page 6: CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud

Windows AzureVM instances

Blob storage

Queue storage

Page 7: CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud

Outline

Windows AzureK-Means clustering algorithmCloudClustering architecture– Data locality– Buddy system

EvaluationRelated work

Page 8: CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud

K-Means algorithm

Iteratively groups points into clusters

Centroids

Initialize

Assign points to centroids

Recalculate centroids

Points moved?

Page 9: CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud

K-Means algorithmInitialize

Partition work

Recalculate centroids

Points moved?

Assign points to centroids

Compute partial sums

Page 10: CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud

Outline

Windows AzureK-Means clustering algorithmCloudClustering architecture– Data locality– Buddy system

EvaluationRelated work

Page 11: CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud

Architecture

Centroids:

Initialize

Partition work

Recalculate centroids

Points moved?

Assign points to centroids

Compute partial sums

Page 12: CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud

Outline

Windows AzureK-Means clustering algorithmCloudClustering architecture– Data locality– Buddy system

EvaluationRelated work

Page 13: CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud

Data locality

Multiple-queue pattern unlocks data localityTradeoff: Complicates fault tolerance

Single-queue pattern is naturally fault-tolerantProblem: Not suitable for iterative computation

Page 14: CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud

Outline

Windows AzureK-Means clustering algorithmCloudClustering architecture– Data locality– Buddy system

EvaluationRelated work

Page 15: CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud

Handling failureInitialize

Partition work

Recalculate centroids

Points moved?

Assign points to centroids

Compute partial sums

Stall? Buddy system

Page 16: CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud

Buddy system

Buddy system provides distributed fault detection and recovery

Page 17: CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud

Buddy system

Spreading buddies across Azure fault domains provides increased resilience to

simultaneous failure

1

2

3

Fault domain

Page 18: CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud

Buddy system

Cascaded failure detection reduces communication and improves resilience

vs.

Page 19: CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud

Outline

Windows AzureK-Means clustering algorithmCloudClustering architecture– Data locality– Buddy system

EvaluationRelated work

Page 20: CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud

Evaluation

Linear speedup with instance count

Page 21: CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud

Evaluation

Sublinear speedup with instance sizeReason: I/O bandwidth doesn’t scale

Page 22: CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud

Outline

Windows AzureK-Means clustering algorithmCloudClustering architecture– Data locality– Buddy system

EvaluationRelated work

Page 23: CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud

Related work

• Existing frameworks (MapReduce, Dryad, Spark, MPI, …) all support K-Means, but reimplement reliable communication

• AzureBlast built directly on cloud services, but algorithm is not iterative

Page 24: CloudClustering: Toward an Iterative Data Processing Pattern on the Cloud

Conclusions

• CloudClustering shows that it's possible to build efficient, resilient applications using only the common cloud services

• Multiple-queue pattern unlocks data locality• Buddy system provides fault tolerance