fast perceptron decision tree learning from evolving data streams

Fast Perceptron Decision Tree Learningfrom Evolving Data Streams

Albert Bifet, Geoff Holmes, Bernhard Pfahringer, and Eibe Frank

University of WaikatoHamilton, New Zealand

Hyderabad, 23 June 201014th Pacific-Asia Conference on Knowledge Discovery and Data Mining

(PAKDD’10)

Motivation

RAM HoursTime and Memory in one measure

Hoeffding Decision Trees with Perceptron Learners atleaves

Improve performance of classification methods for datastreams

2 / 28

Outline

1 RAM-Hours

2 Perceptron Decision Tree Learning

3 Empirical evaluation

3 / 28

Mining Massive Data

2007Digital Universe: 281 exabytes (billion gigabytes)The amount of information created exceeded availablestorage for the first time

Web 2.0

106 million registered users600 million search queries per day3 billion requests a day via its API.

4 / 28

Green Computing

Green Computing

Study and practice of using computing resources efficiently.

Algorithmic EfficiencyA main approach of Green Computing

Data StreamsFast methods without storing all dataset in memory

5 / 28

Data stream classification cycle

1 Process an example at a time,and inspect it only once (atmost)

2 Use a limited amount ofmemory

3 Work in a limited amount oftime

4 Be ready to predict at anypoint

6 / 28

Mining Massive Data

Koichi KawanaSimplicity means the achievement of maximum effect withminimum means.

time

accuracy

memory

Data Streams

7 / 28

Evaluation Example

Accuracy Time MemoryClassifier A 70% 100 20Classifier B 80% 20 40

Which classifier is performing better?

8 / 28

RAM-Hours

RAM-HourEvery GB of RAM deployed for 1 hour

Cloud Computing Rental Cost Options

9 / 28

Evaluation Example

Accuracy Time Memory RAM-HoursClassifier A 70% 100 20 2,000Classifier B 80% 20 40 800

Which classifier is performing better?

10 / 28

Outline

1 RAM-Hours



11 / 28

Hoeffding TreesHoeffding Tree : VFDT

Pedro Domingos and Geoff Hulten.Mining high-speed data streams. 2000

With high probability, constructs an identical model that atraditional (greedy) method would learnWith theoretical guarantees on the error rate

Time

Contains “Money”

YESYes

NONo

Day

YES

Night

12 / 28

Hoeffding Naive Bayes Tree

Hoeffding TreeMajority Class learner at leaves

Hoeffding Naive Bayes Tree

G. Holmes, R. Kirkby, and B. Pfahringer.Stress-testing Hoeffding trees, 2005.

monitors accuracy of a Majority Class learnermonitors accuracy of a Naive Bayes learnerpredicts using the most accurate method

13 / 28

Perceptron

Attribute 1

Attribute 2

Attribute 3

Attribute 4

Attribute 5

Output h~w (~xi)

w1

w2

w3

w4

w5

Data stream: 〈~xi ,yi〉Classical perceptron: h~w (~xi) = sgn(~wT~xi),Minimize Mean-square error: J(~w) = 1

2 ∑(yi −h~w (~xi))2

14 / 28

Perceptron

Attribute 1

Attribute 2

Attribute 3

Attribute 4

Attribute 5

Output h~w (~xi)

w1

w2

w3

w4

w5

We use sigmoid function h~w = σ(~wT~x) where

σ(x) = 1/(1+e−x)

σ′(x) = σ(x)(1−σ(x))

14 / 28

Perceptron

Minimize Mean-square error: J(~w) = 12 ∑(yi −h~w (~xi))

2

Stochastic Gradient Descent: ~w = ~w +η∇J~xi

Gradient of the error function:

∇J =−∑i(yi −h~w (~xi))∇h~w (~xi)

∇h~w (~xi) = h~w (~xi)(1−h~w (~xi))

Weight update rule

~w = ~w +η ∑i(yi −h~w (~xi))h~w (~xi)(1−h~w (~xi))~xi

14 / 28

Perceptron

PERCEPTRON LEARNING(Stream,η)

1 for each class2 do PERCEPTRON LEARNING(Stream,class,η)

PERCEPTRON LEARNING(Stream,class,η)

1 � Let w0 and ~w be randomly initialized2 for each example (~x ,y) in Stream3 do if class = y4 then δ = (1−h~w (~x)) ·h~w (~x) · (1−h~w (~x))5 else δ = (0−h~w (~x)) ·h~w (~x) · (1−h~w (~x))6 ~w = ~w +η ·δ ·~x

PERCEPTRON PREDICTION(~x)

1 return argmaxclass h~wclass(~x)

15 / 28

Hybrid Hoeffding Trees

Hoeffding Naive Bayes TreeTwo learners at leaves: Naive Bayes and Majority Class

Hoeffding Perceptron TreeTwo learners at leaves: Perceptron and Majority Class

Hoeffding Naive Bayes Perceptron TreeThree learners at leaves: Naive Bayes, Perceptron and MajorityClass

16 / 28

Outline

1 RAM-Hours



17 / 28

What is MOA?

{M}assive {O}nline {A}nalysis is a framework for onlinelearning from data streams.

It is closely related to WEKAIt includes a collection of offline and online methods as wellas tools for evaluation:

boosting and baggingHoeffding Trees

with and without Naı̈ve Bayes classifiers at the leaves.

18 / 28

What is MOA?

Easy to extendEasy to design and run experiments

Philipp Kranen, Hardy Kremer, Timm Jansen, ThomasSeidl, Albert Bifet, Geoff Holmes, Bernhard Pfahringer

RWTH Aachen University, University of WaikatoBenchmarking Stream Clustering Algorithms within theMOA FrameworkKDD 2010 Demo

18 / 28

MOA: the bird

The Moa (another native NZ bird) is not only flightless, like theWeka, but also extinct.

19 / 28

Concept Drift Framework

t

f (t) f (t)

α

α

t0W

0.5

1

DefinitionGiven two data streams a, b, we define c = a⊕W

t0 b as the datastream built joining the two data streams a and b

Pr[c(t) = b(t)] = 1/(1+ e−4(t−t0)/W ).Pr[c(t) = a(t)] = 1−Pr[c(t) = b(t)]

20 / 28

Concept Drift Framework

t

f (t) f (t)

α

α

t0W

0.5

1

Example

(((a⊕W0t0 b)⊕W1

t1 c)⊕W2t2 d) . . .

(((SEA9⊕Wt0 SEA8)⊕W

2t0 SEA7)⊕W3t0 SEA9.5)

CovPokElec = (CoverType⊕5,000581,012 Poker)⊕5,000

1,000,000 ELEC2

20 / 28

Empirical evaluation

Accuracy

40

45

50

55

60

65

70

75

80

10.000 120.000 230.000 340.000 450.000 560.000 670.000 780.000 890.000 1.000.0

Instances

Ac

cu

rac

y (

%)

htnbp

htnb

htp

ht

Figure: Accuracy on dataset LED with three concept drifts.

21 / 28


RunTime

0

5

10

15

20

25

30

35

10.000 120.000 230.000 340.000 450.000 560.000 670.000 780.000 890.000

Instances

Tim

e (

se

c.) htnbp

htnb

htp

ht

Figure: Time on dataset LED with three concept drifts.

22 / 28


Memory

0

0,5

1

1,5

2

2,5

3

3,5

4

4,5

5

10.000 130.000 250.000 370.000 490.000 610.000 730.000 850.000 970.000

Instances

Me

mo

ry (

Mb

)

htnbp

htnb

htp

ht

Figure: Memory on dataset LED with three concept drifts.

23 / 28


RAM-Hours

0,00E+00

5,00E-06

1,00E-05

1,50E-05

2,00E-05

2,50E-05

3,00E-05

3,50E-05

4,00E-05

4,50E-05

10.000 130.000 250.000 370.000 490.000 610.000 730.000 850.000 970.000

Instances

RA

M-H

ou

rs

htnbp

htnb

htp

ht

Figure: RAM-Hours on dataset LED with three concept drifts.

24 / 28

Empirical evaluation Cover Type Dataset

Accuracy Time Mem RAM-HoursPerceptron 81.68 12.21 0.05 1.00Naı̈ve Bayes 60.52 22.81 0.08 2.99Hoeffding Tree 68.3 13.43 2.59 56.98TreesNaı̈ve Bayes HT 81.06 24.73 2.59 104.92Perceptron HT 83.59 16.53 3.46 93.68NB Perceptron HT 85.77 22.16 3.46 125.59BaggingNaı̈ve Bayes HT 85.73 165.75 0.8 217.20Perceptron HT 86.33 50.06 1.66 136.12NB Perceptron HT 87.88 115.58 1.25 236.65

25 / 28

Empirical evaluation Electricity Dataset

Accuracy Time Mem RAM-HoursPerceptron 79.07 0.53 0.01 1.00Naı̈ve Bayes 73.36 0.55 0.01 1.04Hoeffding Tree 75.35 0.86 0.12 19.47TreesNaı̈ve Bayes HT 80.69 0.96 0.12 21.74Perceptron HT 84.24 0.93 0.21 36.85NB Perceptron HT 84.34 1.07 0.21 42.40BaggingNaı̈ve Bayes HT 84.36 3.17 0.13 77.75Perceptron HT 85.22 2.59 0.44 215.02NB Perceptron HT 86.44 3.55 0.3 200.94

26 / 28

Summary

http://moa.cs.waikato.ac.nz/

SummarySensor Networks

use PerceptronHandheld Computers

use Hoeffding Naive Bayes Perceptron TreeServers

use Bagging Hoeffding Naive Bayes Perceptron Tree

27 / 28

Summary

http://moa.cs.waikato.ac.nz/

ConclusionsRAM-Hours as a new measure of time and memoryHoeffding Perceptron TreeHoeffding Naive Bayes Perceptron Tree

Future WorkAdaptive learning rate for the Perceptron.

28 / 28

fast perceptron decision tree learning from evolving data streams

Technology