samoa: a platform for mining big data streams (apache bigdata europe 2015)

29
SAMOA: A Platform for Mining Big Data Streams Nicolas Kourtellis Associate Researcher Telefonica I+D, Barcelona SAMOA: Scalable Advanced Machine Online Analysis 1

Upload: nicolas-kourtellis

Post on 16-Apr-2017

416 views

Category:

Science


1 download

TRANSCRIPT

Page 1: SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

1

SAMOA: A Platform for Mining Big Data Streams

Nicolas KourtellisAssociate Researcher

Telefonica I+D, Barcelona

SAMOA: Scalable Advanced Machine Online Analysis

Page 2: SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

2

What is Big Data?Search queriesFacebook postsEmailsTweetsPhoto sharesClicks on ads…

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15

Page 3: SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

3

How BIG is your data?Volume (+ Variety)

Too large for RAM of single commodity serverVelocity

Too fast for CPU of single commodity server

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15

Page 4: SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

4

What is the Streaming Paradigm?High amount of data, high speed of arrivalUpdated models at “real” timePotentially infinite sequence of dataChange over time (concept drift)

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15

Page 5: SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

5

Mining Big Data StreamsApproximation algorithms:

Single pass, one data item at a timeSub-linear space and time per data itemSmall error with high probability

A platform solution:Support different algorithms & processing enginesDistributedScalable

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15

Page 6: SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

6

What is SAMOA?Scalable Advanced Massive Online AnalysisA platform for mining big data streams

Framework for developing new distributed stream mining algorithms

Framework for deploying algorithms on new distributed stream processing engines

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15

Page 7: SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

7

Taxonomy

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15

Page 8: SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

8

SAMOA ArchitectureMachine LearningAlgorithms

Distributed StreamProcessing Engines Flink

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15

Page 9: SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

9

Why is SAMOA important?Program once, run everywhere

Reuse existing infrastructureAvoid deploy cycles

No system downtimeNo complex backup/update processNo need to select update frequency

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15

Page 10: SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

10

ML Developer API

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15

Page 11: SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

11

ML Developer API

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15

Page 12: SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

12

Deployment

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15

Page 13: SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

13

Easy to get!

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15

Page 14: SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

14

Easy to get!

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15

Page 15: SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

15

Easy to get!

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15

Page 16: SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

16

Easy to test!bin/samoa storm target/SAMOA-Storm-0.3.0-SNAPSHOT.jar"PrequentialEvaluation-d /tmp/dump.csv-i 1000000 -f 100000-l (classifiers.trees.VerticalHoeffdingTree -p 4 -k)-s (generators.RandomTreeGenerator –r 1 -c 2 -o 10 -u 10)"

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15

Page 17: SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

17

Case study: Decision TreesVHT: Vertical Hoeffding Tree*

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15

Task parallelism

*VHT: Vertical Hoeffding Tree. N. Kourtellis, G. De Francisci Morales, A. Bifet, A. Mordupo. IEEE BigData 2016.

Page 18: SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

18

Case study: VHT

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15

Horizontal Parallelism

Page 19: SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

19

Case study: VHT

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15

Vertical Parallelism

Page 20: SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

20

Benefits of Vertical ParallelismHigh number of attributes:

high level parallelism (e.g., documents)vs. task parallelism:

obvious parallelism observedvs. horizontal parallelism:

reduced memory usage (no model replication)parallelized split computation

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15

Page 21: SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

21

Vertical Hoeffding Tree

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15

Page 22: SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

22

Preliminary results: TweetsZipf skew: 1.5Bag of words: 100, 1000, 10000 (attributes)Size of tweet: ~15 wordsInstances: 1,000,000Class: positive or negative (Gaussian random variable)

10 runsLocal vs. Storm virtual cluster

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15

Page 23: SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

23

Results: Accuracy

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15

4 8 16 local0

20

40

60

80

100

Classification Accuracy vs.Parallelism Level vs.Number of Attributes

100 words1000 words10000 words

Parallelism Level

Cor

rect

Cla

ssifi

catio

n %

Page 24: SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

24

Results: Speedup

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15

4 8 16 local0

1

2

3

4

5

Speedup vs.Parallelism Level vs.Number of Attributes

100 words1000 words10000 words

Parallelism Level

Spee

dup

Page 25: SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

25

Is SAMOA for you?Are you dealing with:

Big fast data?Possibly endless streams of data?Evolving data?

Do you need updated models at real time?Do you want to test an algorithm on different DSPEs?

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15

Page 26: SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

26

SAMOA Team

Albert Bifet

GianmarcoDe Francisci Morales

Nicolas Kourtellis

Matthieu Morel

Arinto Murdopo

Olivier Van Laere

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15

Page 27: SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

27

StatusApache Incubator

Released version 0.3.0 in JulyExecution Engines

Input: Local FS HDFS Kafka [pending]

Heron?

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15

Page 28: SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

28

Algorithms in SAMOAExisting:

Vertical Hoeffding Tree (classification) CluStream (clustering) Adaptive Model Rules (regression)

Pending: Distributed Naïve Bayes Stochastic Gradient Descent Adaptive + Boosting VHT Parallelized Gradient Boosted Decision Tree PARMA (frequent pattern mining) …

Check Samoa Roadmap for more

Looking for contributors!

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15

Page 29: SAMOA: A Platform for Mining Big Data Streams (Apache BigData Europe 2015)

29

SAMOA: A Platform for Mining Big Data Streams

@ApacheSAMOAhttp://samoa.incubator.apache.org/

https://github.com/apache/incubator-samoa

Nicolas Kourtellis@kourtellis

[email protected]

SAMOA: Scalable Advanced Machine Online Analysis 30/09/15