edf2012 kostas tzouma - linking and analyzing bigdata - stratosphere

17
Analyzing and Linking Big Data with Stratosphere Kostas Tzoumas [email protected] www.user.tu-berlin.de/kostas.tzoumas /

Upload: european-data-forum

Post on 20-May-2015

863 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: EDF2012   Kostas Tzouma - Linking and analyzing bigdata - Stratosphere

Analyzing and Linking Big Data with Stratosphere

Kostas Tzoumas

[email protected]/kostas.tzoumas/

Page 2: EDF2012   Kostas Tzouma - Linking and analyzing bigdata - Stratosphere

Linking and Analyzing Big Data with Stratosphere

■ Driven by cheaper storage and computation□ Cloud computing further enabling economies of scale □ Open-source software lowers barrier of entry

■ Societal and economical impact□ Scientific breakthroughs will come from data exploration□ Success in business dictated by the ability to quickly draw

insights from data signals■ Big Data Analytics□ Analytical workloads scale inversely with cost

■ Major challenge: Information marketplaces□ Data as a resource, analytics as a product□ An “AppStore” for data

Big Data

2

Page 3: EDF2012   Kostas Tzouma - Linking and analyzing bigdata - Stratosphere

Linking and Analyzing Big Data with Stratosphere

■ Data Pools as collections of diverse data sets■ Example data pools□ Evolving history of the Web□ Financial and statistical data

■ Data identification□ By assigning unique IDs to objects□ Enables data linkage

■ Data Analytics□ Integration and analytics on diverse data sources□ Using a common framework and language

The DOPA Vision

3

Page 4: EDF2012   Kostas Tzouma - Linking and analyzing bigdata - Stratosphere

Linking and Analyzing Big Data with Stratosphere

■ Sentiment and market analysis □ An SME producing consumer goods can analyze blogs and

social network streams, and link them with customer demographics to perform sentiment analysis and market research

■ Green houses□ A home buyer can find out the energy consumption and

distribution over time in houses in a particular area by linking and analyzing energy with demographic data.

■ Traffic analysis, transportation and construction planning□ Linking weather, traffic, road data

Some Scenarios

Page 5: EDF2012   Kostas Tzouma - Linking and analyzing bigdata - Stratosphere

Linking and Analyzing Big Data with Stratosphere

■ “Big Data” refers to different applications as well as more data□ Beyond traditional DW queries

■ Open-source in data management□ Enabled primarily by Hadoop popularity□ Changed research landscape

■ New systems are rethinking the complete data management stack at a massively parallel scale□ As systems mature, need to tackle hard and novel problems

Big Data Analytics

5

Page 6: EDF2012   Kostas Tzouma - Linking and analyzing bigdata - Stratosphere

Linking and Analyzing Big Data with Stratosphere

StratoSphereAbove the Clouds

STRATOSPHERE

6

Page 7: EDF2012   Kostas Tzouma - Linking and analyzing bigdata - Stratosphere

Linking and Analyzing Big Data with Stratosphere

■ Collaborative Research Project□ 3 Universities, 5 research groups in the Berlin area

■ Infrastructure for Big Data Analytics■ Bridge relational DBMSs and MapReduce worlds□ Intersection of functional languages and data parallelism□ Re-architecting data management systems for massive

parallelism■ Open-source research platform (Apache)■ Used by a variety of Universities and research institutes

Stratosphere

7

Page 8: EDF2012   Kostas Tzouma - Linking and analyzing bigdata - Stratosphere

Linking and Analyzing Big Data with Stratosphere

Stratosphere Architecture

8

...Nephele

PACT Optimizer

Compiler

$res = filter $e in $emp where $e.income > 30000;

QueryProcessor

Data pool

Data pool

Data pool

Page 9: EDF2012   Kostas Tzouma - Linking and analyzing bigdata - Stratosphere

Linking and Analyzing Big Data with Stratosphere

3 month

s,

1h reso

lution

950km,2km resolution

1100

km,

2km

reso

lutio

n

10TB

Stratosphere Use Cases

9

Page 10: EDF2012   Kostas Tzouma - Linking and analyzing bigdata - Stratosphere

Linking and Analyzing Big Data with Stratosphere

■ Stratosphere declarative front-end inspired by the IBM Jaql language

■ Extensible and flexible□ Easy to add libraries, e.g., for data linkage, cleansing, mining□ Easy to integrate in language syntax

■ Provides operators for Information Extraction and Data Linkage

■ Time as a first-class concept

The Meteor Query Language

10

Page 11: EDF2012   Kostas Tzouma - Linking and analyzing bigdata - Stratosphere

Linking and Analyzing Big Data with Stratosphere

■ Executes Nephele schedules□ DAGs of already parallelized operator instances□ Parallelization already done by PACT optimizer

■ Design decisions□ Designed to run on top of an IaaS cloud□ Predictable performance□ Scalability to 1000+ nodes with flexible fault-tolerance

■ Permits network, in-memory (both pipelined), file (materialization) channels

The Nephele Execution Engine

11

Page 12: EDF2012   Kostas Tzouma - Linking and analyzing bigdata - Stratosphere

Linking and Analyzing Big Data with Stratosphere

■ Internal Stratosphere programming model□ Also exposed to the programmer for advanced functionality

■ Dataflow, side-effect free programming model enabling massive parallelism

■ Centered around the concept of second-order functions□ Generalization of MapReduce

The PACT Programming Model

12

Map PACT Reduce PACT Match PACT CoGroup PACT

Page 13: EDF2012   Kostas Tzouma - Linking and analyzing bigdata - Stratosphere

Linking and Analyzing Big Data with Stratosphere

■ Knowledge of PACT signature permits automatic optimization ala Relational DBMSs

■ Emulates different hand-crafted MapReduce implementations

Optimization

13

Match (on pid)↑(tid, k=r*p)↑

Reduce (on tid)↑(pid=tid, r=∑ k)↑

(pid, tid, p)

Join P and A

Sum uppartial ranks

(pid, r )

A

broadcast part./sort (tid)

probeHT (pid)buildHT (pid)

p(pid, tid, p)

buildHT (pid)

(pid, r )

A

part./sort (tid)

partition (pid)

probeHT (pid)

Reduce (on tid)↑(pid=tid, r=∑ k)↑

Match (on pid)↑(tid, k=r*p)↑

p

partition (pid)

fifo

■ Enables orders of magnitude faster programs

■ Frees programmer from thinking about execution

Page 14: EDF2012   Kostas Tzouma - Linking and analyzing bigdata - Stratosphere

Linking and Analyzing Big Data with Stratosphere

Ongoing Research

14

MapUDF

MatchUDF

ReduceUDF

MapUDF

Src 1

Src 2

Sink

Match (on pid)↑(tid, k=r*p)↑

Reduce (on tid)↑(pid=tid, r=∑ k)↑

O

I(pid, tid, p)

CACHE

Join P and A

Sum uppartial ranks

(pid, r )

A

broadcast part./sort (tid)

probeHT (pid)buildHT (pid)

p

fifo

fifo

Page 15: EDF2012   Kostas Tzouma - Linking and analyzing bigdata - Stratosphere

Linking and Analyzing Big Data with Stratosphere

■ DOPA□ Bootstrapping the information economy by providing

information marketplaces and related business models□ Brings together heterogeneous data pools□ Enables easy linkage and analytics across data pools via a

flexible programming language■ Stratosphere□ Technical infrastructure for scalable analytics□ Pushes the MapReduce paradigm forward□ Focal point of several research initiatives across Europe and the

world

Summary

15

Page 16: EDF2012   Kostas Tzouma - Linking and analyzing bigdata - Stratosphere

Linking and Analyzing Big Data with Stratosphere

■ FP7 STREP (DOPA), DFG FOR, EIT (Stratosphere)■ DOPA partners□ TU Berlin, IMR, DataMarket, OKKAM, Vico, ami

■ Stratosphere partners□ TU Berlin, HU Berlin, HPI

■ EIT partners□ TU Berlin, SICS, TU Delft, Inria, U. Trento, Aalto U., STACZKI

Acknowledgments

16

Page 17: EDF2012   Kostas Tzouma - Linking and analyzing bigdata - Stratosphere

Linking and Analyzing Big Data with Stratosphere17

Thank you!

www.stratosphere.eu@stratosphere_eu

[email protected]