distributed decision tree learning for mining big data streams
DESCRIPTION
Master Thesis PresentationTRANSCRIPT
Distributed Decision Tree Learning for Mining Big Data Streams
1
Master Thesis presentation by: Arinto Murdopo EMDC [email protected]
Supervisors: Albert Bifet Gianmarco de Francisci Morales Ricard Gavaldà
Big Data
200 million users
400 million tweets/day
2
1+ TB/day to Hadoop
2.7 TB/day follower update
4.5 billion likes/day
350 million photos/day
Volume
Velocity
Variety
May 2013
March 2013
May 2013
Machine Learning (ML)
3
Make sense of the data, but how?
Machine Learning = learn & adapt based on data
Due to the 3Vs, we should:
1. Distribute, to scale
2. Stream, to be fast
3. Distribute and stream,
scale and fast
Are We Satisfied?
4
scale fast
fast scale
scale fast
loose-coupling
loose-coupling
We want machine learning frameworks that
are able to scale, fast, and loose-coupling
loose-coupling
SAMOA
Scalable Advanced Massive Online Analysis
Distributed Streaming Machine Learning Framework:
• Fast, using streaming model
• Scale, on top of distributed SPEs (Storm and S4)
• Loose-coupling between ML algorithms and SPEs
5
Contributions
SAMOA
• Architecture and Abstractions
• Stream Processing Engine Adapter
• Integration with Storm
Vertical Hoeffding Tree
• Better than MOA for high number of
attributes
6
7
SAMOA Architecture Frequent
Pattern
Mining
Storm Other SPEs
SAMOA
S4
Clustering
Methods
Classification
Methods
SAMOA Abstractions To develop distributed ML algorithms
8
z
EPI
Processor
Stream
n
Content
Events
Grouping
Parallelism
Hint
Topology PI
External
Event Source
SAMOA SPE-adapter
• Transforms the abstractions into SPE-specific runtime components
• Abstract factory pattern to decouple API and SPE
• Platform developers need to provide
1. PI and EPI
2. Stream
3. Grouping
9
SAMOA SPE-adapter
Examples of SPE-specific runtime
components from SPE-adapter
10
Focus of this
thesis
Storm
• Distributed Streaming Processing Engine
• MapReduce-like programming model
11
stream A
.. .... .... .... ..
stream BS1
S2
B1
B2
B3
B5
B4
stores useful information
data storage
Stream
Spout
Bolt
DAG
Tuples
SAMOA-Storm Integration
Mapping between Storm and SAMOA
1. Spout Entrance Processing Item (EPI)
2. Bolt Processing Item
• Use composition for EPI and PI
3. Bolt Stream & Spout Stream Stream
• Storm pull model
12
Contributions so far ..
13
samoa-SPE
SAMOA
Algorithm and API
SPE-adapter
S4 Storm other SPEs
ML-
adap
ter MOA
Other ML frameworks
samoa-S4 samoa-storm samoa-other-SPEs
Flexibility
Scalability
Extensibility
Next Contribution… Distributed Algorithm implementation:
Vertical Hoeffding Tree
Decision tree:
• Classification
• Divide and conquer
• Easy to interpret
14
Sample Dataset
ID Code
Outlook Temperature Humidity Windy Play
a sunny hot high false no
b sunny hot high true no
c overcast hot high false yes
d rainy mild high false yes
… … … … … …
15
attribute class
a datum (an instance) to
build the tree
Decision Tree
16
outlook
Y
sunny
rainy overcast
humidity windy
N Y N Y
true false normal high
root
split node
leaf node
Very Fast Decision Tree (VFDT)
• Pioneer in decision tree for streaming
• Information Gain + Gain Ratio + Hoeffding
bound
• Hoeffding bound decides whether the
difference in information gain is enough to
split or not
• Often called Hoeffding Tree
17
Distributed Decision Tree
Types of parallelism
• Horizontal
• Partition the data by the instance
• Vertical
• Partition the data by the attribute
• Task
• Tree leaf nodes grow in parallel 18
MOA Hoeffding Tree Profiling
19
Learn 70%
Split 24%
Other 6%
CPU Time Breakdown, 200 attributes
Vertical Hoeffding Tree
20
1 z 1 z z n 1
source PI
model-
aggregator PI
local-statistic PI
evaluator PI
source
local-result
control
attribute
result
Evaluation
Metrics:
• Accuracy
• Throughput
Input data:
• Random Tree Generator
• Text Generator – resembles tweets
Cluster: 3 shared nodes 48 GB of RAM, Intel Xeon CPU E5620 @ 2.4 GHz: 16 processors, Linux Kernel 2.6.18
21
VHT iteration 1 (VHT1)
• Goal: Verify algorithm correctness (same
accuracy as MOA)
• Utilized 2 internal queues: instances queue,
local-result queue
• Achieved same accuracy but throughput is
low. Proceed with VHT 2
22
VHT Iteration 2 (VHT2)
Goal: improve VHT1 throughput
• Kryo serializer: 2.5x throughput
improvement
• long identifier instead of String
• Remove 2 internal queues in VHT1
discard instances while attempting to split
23
tree-10
24
Around 8.2 % differences in accuracy
tree-100
25
Same trend as tree-10 (7.9% difference in accuracy)
No. Leaf Nodes VHT2 – tree-100
26
Very close and very high accuracy
Accuracy VHT2 – text-1000
27
Low accuracy when the number of
attributes increased
Throughput VHT2 – tree-generator
28
Not good for dense instance and low
number of attributes
Throughput VHT2 – text-generator
29
Higher throughput than MHT
30
0
50
100
150
200
250
300
VHT2-par-3 MHT
Executi
on T
ime (
seconds)
Classifier
Profiling Results for text-1000 with 1000000 instances
t_calc
t_comm
t_serial
Minimizing t_comm will increase throughput
31
0
50
100
150
200
250
VHT2-par-3 MHT
Executi
on T
ime (
seconds)
Classifier
Profiling Results for text-10000 with 100000 instances
t_calc
t_comm
t_serial
Throughput VHT2-par-3: 2631 inst/sec MHT : 507 inst/sec
Future Work
• Open Source
• Evaluation layer in SAMOA architecture
• Online classification algorithms that are
based on horizontal parallelism
32
Conclusions Mining big data stream is challenging
• Systems needs to satisfy 3Vs of big data.
SAMOA – Distributed Streaming ML Framework
• Architecture and Abstractions
• Stream Processing Engine (SPE) adapter
• SAMOA Integration with Storm
Vertical Hoeffding Tree
• Better than MOA for high number of attributes 33