Download - Big-Data Computing on the Cloud

Big-Data Computing on the Cloud an Algorithmic Perspective

Andrea PietracaprinaDept. of Information Engineering (DEI)

University of Padova, [email protected]

Supported in part by MIUR-PRIN Project Amanda: Algorithmics for MAssive and Networked DAta

Roma, May 20, 2016 Data Driven Innovation 1

mailto:[email protected]

OUTLINE


OUTLINE

From supercomputing to cloud computing

Paradigm shift

MapReduce

Big data algorithmics

Coresets

Decompositions of large networks

Conclusions

From Supercomputing to Cloud Computing


Supercomputing (‘70s – present)

Tianhe-2 (PRC)

Algorithm designfull knowledge and exploitation of

platform architecture

• Low productivity, high costs

• Grand Challenges

• Maximum performance (exascale in 2018?)

• Massively parallel systems



Cluster era (‘90s – present)

Algorithm designExploitation of architectural features

abstracted by few parameters

• Higher productivity and lower costs

• Wide range of commercial/scientific applications

• Good cost/performance tradeoffs

• Distributed systems (e.g., clusters, grids)

Network (bandwidth/latency)



Cloud Computing (‘00s – present)

Algorithm designArchitecture-oblivious design

Data-centric perspective

• Novel computing environments: e.g., Hadoop, Spark, Google DF

• Popular for big-data applications

• Flexibility of usage, low costs, reliability

• Infrastructure, Software as Services (IaaS, SaaS)

INPUT DATA

OUTPUT DATA

Map – Shuffle - Reduce

Paradigm Shift


Traditional Algorithmics

Big-Data Algorithmics

Best balance between computation, parallelism,

communication

Few scans of the whole input data

Machine-conscious design Machine-oblivious design

Noiseless, static input data Noisy, dynamic input data

Polynomial complexity (Sub-)Linear complexity

PARADIGM SHIFT


MAPREDUCE

MapReduce: single round

INPUT

MAPPER

MAPPER

MAPPER

MAPPER

MAPPER

MAPPER

OUTPUT

REDUCER

REDUCER

REDUCER

SHUFFLE

MAPPER: computation on individual data itemsREDUCER: computation on small subsets of input


MAPREDUCE

MapReduce: multiround

Key Performance Indicators (input size N):

Memory requirements per reducer: << N #Rounds (i.e., #shuffles): 1,2 Aggregate space and communication N

MAPPER

MAPPER

MAPPER

MAPPER

MAPPER

MAPPER

OUTPUT

REDUCER

REDUCER

REDUCER

INPUT

MAPPER

MAPPER

MAPPER

MAPPER

MAPPER

MAPPER

REDUCER

REDUCER

REDUCER

MAPPER

MAPPER

MAPPER

MAPPER

MAPPER

MAPPER

REDUCER

REDUCER

REDUCER

ROUND 1 ROUND 2 ROUND r


Big Data Algorithmics

Coresets



INPUTCORESET

Coreset: a subset of data (summary) which maintains the characteristics of the whole input, filtering out redundancy



General 2-round MapReduce approach

Round 1: partition into small subsets and extraction of partial coresetsRound 2: perform analysis on aggregation of partial coresets

INPUT

AGGREGATE CORESET

PARTIAL CORESET

CHALLENGE: composability of coresets



Example: diversity maximization



Example: diversity maximization

Goal: find k most diverse data objectsApplications: Recommendation systems, search engines



INPUT

MapReduce Solution

Round 1:• Partition input data arbitrarily


Big Data Algorithmics: coresets

MapReduce Solution

Round 1:• Partition input data arbitrarily• In each subset:

k’-clustering based on similarity (k’>k) pick one representative per cluster

( partial coreset)

subset of partition k’-clustering partial coreset

N.B. For enhanced accuracy, it is crucial to fix k’>k


Big Data Algorithmics: coresets

MapReduce Solution

Round 2:• Aggregate partial coresets• Compute output on aggregate coreset

partial coresets

aggregate coresetOUTPUT



Round 1

Round 2

INPUT

PARTIAL CORESET

AGGREGATE CORESET

OUTPUT



Experiments:

• N=64000 data objects

• Seek k=64 most diverse ones

• Final coreset size: [21024] k∙

• Measure: accuracy of solution

• 4 diversity measures

N.B. Same approach can be used in a streaming setting



Decompositions of Large Networks



Analysis of large networks in MapReduce must avoid:



• Long traversals• Superlinear complexities

Known exact algorithms often do not meet these criteria




• Long traversals• Superlinear complexities

Known exact algorithms often do not meet these criteria


Network decomposition can provide concise summary of network characteristics



Example : network diameter

Goal: determine max distanceApplications: social networks, internet/web, linguistics, biology

B

A



MapReduce Solution

• Cluster the network into few regions with small radius R, around random nodes

• R rounds

R



MapReduce Solution

• Network summary: one node per region

• Determine overlay network of selected nodes

• Few rounds



MapReduce Solution

• Compute diameter of overlay network

• Adjust for radius of original regions

• 1 round

R R

N.B. overlay network is a good summary of input network; its size can be chosen to fit memory constraints of reducers



Experiments: 16-node cluster, 10Gbit Ethernet, Apache Spark

Network No. Nodes

No. Links

Time (s) Rounds

Error

Roads-USA

24M 29M 158 74 26%

Twitter 42M 1.5G 236 5 19%

Artificial 500M 8G 6000 5 30%

benchmarks

scalability

(10K nodes in overlay network)



Efficient network partitioning

• Progressive node sampling

• Local cluster growth from sampled nodes

• #rounds = #cluster growing steps



Example



Round 2



Round 4



Round 6



Coping with uncertainty

Links exist with certain probabilities

Applications: biology, social network analysis

• Network partitioning strategy suitable for this scenario• cluster = region connected with high probability



• PPI viewed as uncertain network

• Hp: protein complex region with high connection probability

• Traditional general partitioning approaches slowed down by uncertainty

Example: identification of protein complexes from Protein-Protein Interaction (PPI) networks

Experiments show effectiveness of approach


Conclusions

CONCLUSIONS

• Design of big data algorithms (on clouds) entails paradigm shift Data centric view Handling size through summarization Give up exact solution Cope with noisy/unreliable data


References

References

M. Ceccarello, A.P., G. Pucci, E. Upfal: Space and Time Efficient Parallel Graph Decomposition, Clustering, and Diameter Approximation. ACM SPAA 2015

M. Ceccarello, A.P., G. Pucci, E. Upfal : A Practical Parallel Algorithm for Diameter Approximation of Massive Weighted Graphs. IEEE IPDPS 2016

M. Ceccarello, A.P., G. Pucci, E. Upfal : MapReduce and Streaming Algorithms for DiversityMaximization in Metric Spaces of Bounded Doubling Dimension. ArXiv 1605.05590 , 2016

M. Ceccarello, C. Fantozzi, A.P., G. Pucci, F. Vandin: Clustering in uncertain graphs. Work in progress. 2016


Conclusions

THANK YOU!

Download - Big-Data Computing on the Cloud

Top Related