real-time big data stream analytics

65
Real-Time Big Data Stream Analytics Albert Bifet @abifet Business Applications of Social Network Analysis (BASNA) Shenzhen, 14 December 2014

Upload: albert-bifet

Post on 12-Jul-2015

719 views

Category:

Software


6 download

TRANSCRIPT

Page 1: Real-Time Big Data Stream Analytics

Real-Time Big DataStream Analytics

Albert Bifet @abifet

Business Applications of Social Network Analysis (BASNA)Shenzhen, 14 December 2014

Page 2: Real-Time Big Data Stream Analytics

Real time analytics

Page 3: Real-Time Big Data Stream Analytics

Real time analytics

Page 4: Real-Time Big Data Stream Analytics

Companies need to know what is happeningright now to react and anticipate to detect new

business opportunities through Social Networksand Real-time Analytics.

Page 5: Real-Time Big Data Stream Analytics

Outline

Big Data Stream Analytics

MOA: Massive Online Analysis

SAMOA: Scalable Advanced Massive Online Analysis

Page 6: Real-Time Big Data Stream Analytics

Outline

Big Data Stream Analytics

MOA: Massive Online Analysis

SAMOA: Scalable Advanced Massive Online Analysis

Page 7: Real-Time Big Data Stream Analytics

Data Streams

Data Streams

• Sequence is potentially infinite

• High amount of data: sublinear space

• High speed of arrival: sublinear time per example

• Once an element from a data stream has been processedit is discarded or archived

Big Data & Real Time

Page 8: Real-Time Big Data Stream Analytics

Data Streams

Approximation algorithms

• Small error rate with high probability

• An algorithm (ε,δ )−approximates F if it outputs F̃ forwhich Pr[|F̃ −F |> εF ]< δ .

Big Data & Real Time

Page 9: Real-Time Big Data Stream Analytics

8 Bits Counter

1 0 1 0 1 0 1 0

What is the largest number we canstore in 8 bits?

Page 10: Real-Time Big Data Stream Analytics

8 Bits Counter

What is the largest number we canstore in 8 bits?

Page 11: Real-Time Big Data Stream Analytics

8 Bits Counter

0 20 40 60 80 1000

20

40

60

80

100

x

f (x) = log(1+x)/ log(2)

f (0) = 0, f (1) = 1

Page 12: Real-Time Big Data Stream Analytics

8 Bits Counter

0 2 4 6 8 100

2

4

6

8

10

x

f (x) = log(1+x)/ log(2)

f (0) = 0, f (1) = 1

Page 13: Real-Time Big Data Stream Analytics

8 Bits Counter

0 2 4 6 8 100

2

4

6

8

10

x

f (x) = log(1+x/30)/ log(1+1/30)

f (0) = 0, f (1) = 1

Page 14: Real-Time Big Data Stream Analytics

8 Bits Counter

0 20 40 60 80 1000

20

40

60

80

100

x

f (x) = log(1+x/30)/ log(1+1/30)

f (0) = 0, f (1) = 1

Page 15: Real-Time Big Data Stream Analytics

8 Bits Counter

MORRIS APPROXIMATE COUNTING ALGORITHM

1 Init counter c← 02 for every event in the stream3 do rand = random number between 0 and 14 if rand < p5 then c← c +1

What is the largest number we canstore in 8 bits?

Page 16: Real-Time Big Data Stream Analytics

8 Bits Counter

MORRIS APPROXIMATE COUNTING ALGORITHM

1 Init counter c← 02 for every event in the stream3 do rand = random number between 0 and 14 if rand < p5 then c← c +1

With p = 1/2 we can store 2×256with standard deviation σ =

√n/2

Page 17: Real-Time Big Data Stream Analytics

8 Bits Counter

MORRIS APPROXIMATE COUNTING ALGORITHM

1 Init counter c← 02 for every event in the stream3 do rand = random number between 0 and 14 if rand < p5 then c← c +1

With p = 2−c then E [2c] = n+2 withvariance σ 2 = n(n+1)/2

Page 18: Real-Time Big Data Stream Analytics

8 Bits Counter

MORRIS APPROXIMATE COUNTING ALGORITHM

1 Init counter c← 02 for every event in the stream3 do rand = random number between 0 and 14 if rand < p5 then c← c +1

If p = b−c then E [bc] = n(b−1)+b,σ 2 = (b−1)n(n+1)/2

Page 19: Real-Time Big Data Stream Analytics

Data Streams

ExamplePuzzle: Finding Missing Numbers

• Let π be a permutation of {1, . . . ,n}.• Let π−1 be π with one element

missing.

• π−1[i] arrives in increasing order

Task: Determine the missing number

Page 20: Real-Time Big Data Stream Analytics

Data Streams

ExamplePuzzle: Finding Missing Numbers

• Let π be a permutation of {1, . . . ,n}.• Let π−1 be π with one element

missing.

• π−1[i] arrives in increasing order

Task: Determine the missing number

Use a n-bitvector tomemorize all thenumbers (O(n)space)

Page 21: Real-Time Big Data Stream Analytics

Data Streams

ExamplePuzzle: Finding Missing Numbers

• Let π be a permutation of {1, . . . ,n}.• Let π−1 be π with one element

missing.

• π−1[i] arrives in increasing order

Task: Determine the missing number

Data Streams:O(log(n)) space.

Page 22: Real-Time Big Data Stream Analytics

Data Streams

ExamplePuzzle: Finding Missing Numbers

• Let π be a permutation of {1, . . . ,n}.• Let π−1 be π with one element

missing.

• π−1[i] arrives in increasing order

Task: Determine the missing number

Data Streams:O(log(n)) space.Store

n(n+1)2

−∑j≤i

π−1[j].

Page 23: Real-Time Big Data Stream Analytics

Min-wise independent hashing

Similarity: Jaccard Coefficient of A and B=|A∩B|/|A∪B|

Page 24: Real-Time Big Data Stream Analytics

Min-wise independent hashing(Broder et al., 2000)

Let h : Items→ [0,1] be a random hash function.Find the 4 items bought by either Alice or Bob with thesmallest hash values:

1 Find the 4 smallest hash values for Alice, and the 4 forBob.

2 Aggregate the results.

3 Estimate the similarity.

Example on estimating the Jaccard similarity indata streams.

Page 25: Real-Time Big Data Stream Analytics

Min-wise independent hashing

1 Find the 4 smallest hash values for Alice, andthe 4 for Bob.

Page 26: Real-Time Big Data Stream Analytics

Min-wise independent hashing

2 Aggregate the results

3 Estimate the similarity.• Among the 4 items with the smallest hash values, only

milk is bought by Alice and Bob, thus the estimatedsimilarity is 1/4

Example on estimating the Jaccard similarity indata streams.

Page 27: Real-Time Big Data Stream Analytics

Stream Learning of Influence probabilities

Page 28: Real-Time Big Data Stream Analytics

Learning influence strengthalong links in social networks

• Real-time interactions

• Small amount of time and memory

• Only one pass

Page 29: Real-Time Big Data Stream Analytics

Learning influence strengthalong links in social networks

Social Graph and Log of Propagation

Page 30: Real-Time Big Data Stream Analytics

Learning influence probabilities

• Approximate solutions• Superlinear space• Linear space• Sublinear space

• Landmark and slidingwindow

STRIP: Stream Learning of Influenceprobabilities

Page 31: Real-Time Big Data Stream Analytics

Learning influence probabilities

• The problem of learning the probabilities was firstconsidered in Goyal et al. WSDM’10.

• In particular the following definition was shown to give agood prediction of propagation:Let Au2v(t) be the number of actions that propagated fromuser u to user v within time t and Au|v the total number ofactions performed by either u or v. Then

pu,v = Au2v(t)/Au|v

Similar to Jaccard coefficient for estimating the similaritybetween sets A and B = |A∩B|/|A∪B|.

Page 32: Real-Time Big Data Stream Analytics

Outline

Big Data Stream Analytics

MOA: Massive Online Analysis

SAMOA: Scalable Advanced Massive Online Analysis

Page 33: Real-Time Big Data Stream Analytics

What is MOA?{M}assive {O}nline {A}nalysis is a framework for onlinelearning from data streams.

• It is related to WEKA

• It includes a collection of offline and online as well astools for evaluation: classification, regression, clustering,outlier detection, concept drift detection andrecommender systems

• Easy to extend

• Easy to design and run experiments

Page 34: Real-Time Big Data Stream Analytics

History - timelineWEKA

• 1993 - WEKA : project starts (Ian Witten)

• Mid 1999 - WEKA 3 (100% Java) released

MOA

• Nov 2007 - First public release of MOA: Richard Kirkby,Geoff Holmes and Bernhard Pfahringer

• 2009 - MOA Concept Drift Classification

• 2010 - MOA Clustering

• 2011 - MOA Graph Mining, Multi-label classification,Twitter Reader, Active Learning

• 2014 - MOA Outliers, and Recommender System

• 2014 - MOA Concept Drift

Page 35: Real-Time Big Data Stream Analytics

WEKA• Waikato Environment for Knowledge Analysis• Collection of state-of-the-art machine learning

algorithms and data processing tools implemented inJava

• Released under the GPL

• Support for the whole process of experimental datamining

• Preparation of input data• Statistical evaluation of learning schemes• Visualization of input data and the result of learning

• Used for education, research and applications

• Complements “Data Mining” by Witten & Frank & Hall

Page 36: Real-Time Big Data Stream Analytics

WEKA: the bird

Page 37: Real-Time Big Data Stream Analytics

MOA: the bird

The Moa (another native NZ bird) is not only flightless, like theWeka, but also extinct.

Page 38: Real-Time Big Data Stream Analytics

MOA: the bird

The Moa (another native NZ bird) is not only flightless, like theWeka, but also extinct.

Page 39: Real-Time Big Data Stream Analytics

MOA: the bird

The Moa (another native NZ bird) is not only flightless, like theWeka, but also extinct.

Page 40: Real-Time Big Data Stream Analytics

Classification ExperimentalSetting

Page 41: Real-Time Big Data Stream Analytics

Classification ExperimentalSetting

Evaluation procedures for DataStreams

• Holdout

• Interleaved Test-Then-Trainor Prequential

Page 42: Real-Time Big Data Stream Analytics

Classification ExperimentalSetting

Data Sources

• Random Tree Generator

• Random RBF Generator

• LED Generator

• Waveform Generator

• Hyperplane

• SEA Generator

• STAGGER Generator

Page 43: Real-Time Big Data Stream Analytics

Classification ExperimentalSetting

Classifiers

• Naive Bayes

• Decision stumps

• Hoeffding Tree

• Hoeffding Option Tree

• Bagging and Boosting

• Random Forests

• ADWIN Bagging andLeveraging Bagging

• Adaptive Rules

• Logistic Regression, SGD

Page 44: Real-Time Big Data Stream Analytics

RAM-Hours

RAM-HourEvery GB of RAM deployed for 1 hour

Cloud Computing Rental Cost Options

Page 45: Real-Time Big Data Stream Analytics

Clustering Experimental Setting

Page 46: Real-Time Big Data Stream Analytics

Clustering Experimental SettingInternal measures External measuresGamma Rand statisticC Index Jaccard coefficientPoint-Biserial Folkes and Mallow IndexLog Likelihood Hubert Γ statisticsDunn’s Index Minkowski scoreTau PurityTau A van Dongen criterionTau C V-measureSomer’s Gamma CompletenessRatio of Repetition HomogeneityModified Ratio of Repetition Variation of informationAdjusted Ratio of Clustering Mutual informationFagan’s Index Class-based entropyDeviation Index Cluster-based entropyZ-Score Index PrecisionD Index RecallSilhouette coefficient F-measure

Table : Internal and external clustering evaluation measures.

Page 47: Real-Time Big Data Stream Analytics

Clustering Experimental Setting

Clusterers

• StreamKM++

• CluStream

• ClusTree

• Den-Stream

• D-Stream

• CobWeb

Page 48: Real-Time Big Data Stream Analytics

Web

http://www.moa.cms.waikato.ac.nz

Page 49: Real-Time Big Data Stream Analytics

Easy Design of a MOA classifier

• void resetLearningImpl ()

• void trainOnInstanceImpl (Instance inst)

• double[] getVotesForInstance (Instance i)

Page 50: Real-Time Big Data Stream Analytics

Easy Design of a MOA clusterer

• void resetLearningImpl ()

• void trainOnInstanceImpl (Instance inst)

• Clustering getClusteringResult()

Page 51: Real-Time Big Data Stream Analytics

Extensions of MOA

• Multi-label Classification

• Active Learning

• Regression

• Closed Frequent Graph Mining

• Twitter Sentiment Analysis

Page 52: Real-Time Big Data Stream Analytics

Outline

Big Data Stream Analytics

MOA: Massive Online Analysis

SAMOA: Scalable Advanced Massive Online Analysis

Page 53: Real-Time Big Data Stream Analytics

SAMOA

SAMOA is distributed streaming machinelearning (ML) framework that contains aprograming abstraction for distributed

streaming ML algorithms.

Page 54: Real-Time Big Data Stream Analytics

Hadoop

Hadoop architecture

Page 55: Real-Time Big Data Stream Analytics

Apache S4

Page 56: Real-Time Big Data Stream Analytics

Apache Storm

Stream, Spout, Bolt, Topology

Page 57: Real-Time Big Data Stream Analytics

SAMOA

Page 58: Real-Time Big Data Stream Analytics

SAMOA

samoa-SPE

SAMOA

Algorithm and API

SPE-adapter

S4 Storm other SPEs

ML-

adap

ter MOA

Other ML frameworks

samoa-S4 samoa-storm samoa-other-SPEs

Page 59: Real-Time Big Data Stream Analytics

SAMOA

SAMOA  SA

Page 60: Real-Time Big Data Stream Analytics

SAMOA ML Developer API

Processing ItemProcessor

Stream

Page 61: Real-Time Big Data Stream Analytics

SAMOA ML Developer API

Page 62: Real-Time Big Data Stream Analytics

Web

http://samoa-project.net/

Page 63: Real-Time Big Data Stream Analytics

The Team

Page 64: Real-Time Big Data Stream Analytics

Summary

1 To improve benefits, companies need to satisfycostumers. To do this they need to know

• what potential costumers needs• social networks• real-time analytics

2 MOA deals with evolving data streams

3 SAMOA is distributed streaming machine learning

Page 65: Real-Time Big Data Stream Analytics

Thanks!

@abifet

http://moa.cms.waikato.ac.nz/@moadatamining

http://samoa-project.net/@samoa project