big-data computing on the cloud

Download Big-Data Computing on the Cloud

Post on 15-Apr-2017

424 views

Category:

Technology

1 download

Embed Size (px)

TRANSCRIPT

Space-Round Tradeoffs for MapReduce Computations G. Pucci DEI University of Padova

Big-Data Computing on the Cloud an Algorithmic Perspective

Andrea PietracaprinaDept. of Information Engineering (DEI)University of Padova, Italyandrea.pietracaprina@unipd.it

Supported in part by MIUR-PRIN Project Amanda: Algorithmics for MAssive and Networked DAtaRoma, May 20, 2016Data Driven Innovation1

Title + authors

OUTLINERoma, May 20, 2016Data Driven Innovation2OUTLINE

From supercomputing to cloud computingParadigm shiftMapReduceBig data algorithmicsCoresetsDecompositions of large networksConclusions

Title + authors

From Supercomputing to Cloud ComputingRoma, May 20, 2016Data Driven Innovation3Supercomputing (70s present)

Tianhe-2 (PRC)Algorithm designfull knowledge and exploitation of platform architecture Low productivity, high costs

Grand ChallengesMaximum performance (exascale in 2018?)Massively parallel systems

Supercomputers start in fact in the 40sbut here we consider the modern era or cray era

Grand Challenges: automotive/aerospace, weather, energy, biology neuroscience)high costs for infrastructure, maintainance, sw

From Supercomputing to Cloud ComputingRoma, May 20, 2016Data Driven Innovation4Cluster era (90s present)Algorithm designExploitation of architectural features abstracted by few parametersHigher productivity and lower costsWide range of commercial/scientific applicationsGood cost/performance tradeoffs

Distributed systems (e.g., clusters, grids)

Network (bandwidth/latency)

High-level algorithmic and programming models higher productivity and portabilityMost existing clusters offer provisions for high availability and fault tolerance. Also provide forload balancing

From Supercomputing to Cloud ComputingRoma, May 20, 2016Data Driven Innovation5Cloud Computing (00s present)Algorithm designArchitecture-oblivious design Data-centric perspectiveNovel computing environments: e.g., Hadoop, Spark, Google DFPopular for big-data applicationsFlexibility of usage, low costs, reliabilityInfrastructure, Software as Services (IaaS, SaaS)

INPUT DATA

OUTPUT DATA

Map Shuffle - Reduce

Paradigmi funzionali: il placement dei dati e della computazione non e' sotto il controllo della programmazione ma il focus e' sui dati e sulle loro trasformazioni.

Paradigm ShiftRoma, May 20, 2016Data Driven Innovation6Traditional AlgorithmicsBig-Data AlgorithmicsBest balance between computation, parallelism, communicationFew scans of the whole input dataMachine-conscious designMachine-oblivious designNoiseless, static input dataNoisy, dynamic input dataPolynomial complexity(Sub-)Linear complexity

PARADIGM SHIFT

Linearity constraint often implies giving up existing exact strategies in favor of novel approximate ones

Roma, May 20, 2016Data Driven Innovation7MAPREDUCEMapReduce: single roundINPUTMAPPERMAPPERMAPPERMAPPERMAPPERMAPPEROUTPUTREDUCERREDUCERREDUCERSHUFFLE

MAPPER: computation on individual data itemsREDUCER: computation on small subsets of input

Hadoop: Industry standard, third party infrastructure support (Amazon EMR, Cloudera)Slow performance on communication-intensive or iterative algorithms, because of reliance on HDFS for communicationSpark: Emerging industry standard (e.g. Amazon, IBM, GroupOn, Yahoo!)Good performance for iterative and communication intensive algorithms

Roma, May 20, 2016Data Driven Innovation8MAPREDUCEMapReduce: multiroundKey Performance Indicators (input size N):

Memory requirements per reducer: k) pick one representative per cluster ( partial coreset)

subset of partitionk-clusteringpartial coresetN.B. For enhanced accuracy, it is crucial to fix k>k

Classical IR scenario: retrieve few documents most relevant to user queryDiversity maximization scenario: retrieve relavan documents that present all different angles of a queryWhen dont know user intent you must guess all possible intents and present a selection of results covering all of themE-commerce, recommendation systems: return a consideration set hoping that the user be attracted to at least one object in the set. Examples: google news selection

Roma, May 20, 2016Data Driven Innovation16Big Data Algorithmics: coresetsMapReduce Solution

Round 2:Aggregate partial coresetsCompute output on aggregate coreset

partial coresetsaggregate coresetOUTPUT

Classical IR scenario: retrieve few documents most relevant to user queryDiversity maximization scenario: retrieve relavan documents that present all different angles of a queryWhen dont know user intent you must guess all possible intents and present a selection of results covering all of themE-commerce, recommendation systems: return a consideration set hoping that the user be attracted to at least one object in the set. Examples: google news selection

Roma, May 20, 2016Data Driven Innovation17Big Data Algorithmics

Round 1Round 2

INPUTPARTIAL CORESETAGGREGATE CORESETOUTPUT

Classical IR scenario: retrieve few documents most relevant to user queryDiversity maximization scenario: retrieve relavan documents that present all different angles of a queryWhen dont know user intent you must guess all possible intents and present a selection of results covering all of themE-commerce, recommendation systems: return a consideration set hoping that the user be attracted to at least one object in the set. Examples: google news selection

Roma, May 20, 2016Data Driven Innovation18Big Data Algorithmics

Experiments:

N=64000 data objects

Seek k=64 most diverse ones

Final coreset size: [21024]k

Measure: accuracy of solution

4 diversity measuresN.B. Same approach can be used in a streaming setting

Roma, May 20, 2016Data Driven Innovation19Big Data AlgorithmicsDecompositions of Large Networks

Roma, May 20, 2016Data Driven Innovation20Big Data Algorithmics

Analysis of large networks in MapReduce must avoid:

Roma, May 20, 2016Data Driven Innovation21Big Data Algorithmics

Long traversalsSuperlinear complexitiesKnown exact algorithms often do not meet these criteria Analysis of large networks in MapReduce must avoid:

Roma, May 20, 2016Data Driven Innovation22Big Data Algorithmics

Long traversalsSuperlinear complexitiesKnown exact algorithms often do not meet these criteria Analysis of large networks in MapReduce must avoid:

Network decomposition can provide concise summary of network characteristics

Roma, May 20, 2016Data Driven Innovation23Big Data AlgorithmicsExample : network diameterGoal: determine max distanceApplications: social networks, internet/web, linguistics, biologyBA

Roma, May 20, 2016Data Driven Innovation24Big Data AlgorithmicsMapReduce Solution

Cluster the network into few regions with small radius R, around random nodes R roundsR

Roma, May 20, 2016Data Driven Innovation25Big Data AlgorithmicsMapReduce Solution

Network summary: one node per regionDetermine overlay network of selected nodesFew rounds

Roma, May 20, 2016Data Driven Innovation26Big Data AlgorithmicsMapReduce Solution

Compute diameter of overlay networkAdjust for radius of original regions1 round

RRN.B. overlay network is a good summary of input network; its size can be chosen to fit memory constraints of reducers

Roma, May 20, 2016Data Driven Innovation27Big Data AlgorithmicsExperiments: 16-node cluster, 10Gbit Ethernet, Apache SparkNetworkNo. NodesNo. LinksTime (s)RoundsErrorRoads-USA24M29M1587426%Twitter42M1.5G236519%Artificial 500M8G6000530%

benchmarksscalability(10K nodes in overlay network)

Roma, May 20, 2016Data Driven Innovation28Big Data AlgorithmicsEfficient network partitioning

Progressive node sampling

Local cluster growth from sampled nodes

#rounds = #cluster growing steps

Roma, May 20, 2016Data Driven Innovation29Big Data Algorithmics

Example

Roma, May 20, 2016Data Driven Innovation30Big Data Algorithmics

Roma, May 20, 2016Data Driven Innovation31Big Data Algorithmics

Round 2

Roma, May 20, 2016Data Driven Innovation32Big Data Algorithmics

Roma, May 20, 2016Data Driven Innovation33Big Data Algorithmics

Round 4

Roma, May 20, 2016Data Driven Innovation34Big Data Algorithmics

Roma, May 20, 2016Data Driven Innovation35Big Data Algorithmics

Round 6

Roma, May 20, 2016Data Driven Innovation36Big Data AlgorithmicsCoping with uncertainty

Links exist with certain probabilities

Applications: biology, social network analysisNetwork partitioning strategy suitable for this scenariocluster = region connected with high probability

Roma, May 20, 2016Data Driven Innovation37Big Data Algorithmics

PPI viewed as uncertain network

Hp: protein complex region with high connection probability

Traditional general partitioning approaches slowed down by uncertaintyExample: identification of protein complexes from Protein-Protein Interaction (PPI) networksExperiments show effectiveness of approach

Roma, May 20, 2016Data Driven Innovation38ConclusionsCONCLUSIONS

Design of big data algorithms (on clouds) entails paradigm shiftData centric viewHandling size through summarizationGive up exact solutionCope with noisy/unreliable data

Roma, May 20, 2016Data Driven Innovation39ReferencesReferences

M. Ceccarello, A.P., G. Pucci, E. Upfal: Space and Time Efficient Parallel Graph Decomposition, Clustering, and Diameter Approximation. ACM SPAA 2015

M. Ceccarello, A.P., G. Pucci, E. Upfal : A Practical Parallel Algorithm for Diameter Approximation of Massi

Recommended

View more >