incremental generalized eigenvalue classification on data streams [email protected] high...

31
Incremental Generalized Eigenvalue Classification on data streams [email protected] High Performance Computing and Networking Institute National Research Council – Naples, ITALY International Workshop on Data Stream Management and Mining Beijing, October 27-29, 2008

Upload: emory-banks

Post on 29-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Incremental Generalized Eigenvalue Classification on data streams Mario.Guarracino@cnr.it High Performance Computing and Networking Institute National

Incremental Generalized Eigenvalue Classification on data streams

[email protected]

High Performance Computing and Networking InstituteNational Research Council – Naples, ITALY

International Workshop on Data Stream Management and MiningBeijing, October 27-29, 2008

Page 2: Incremental Generalized Eigenvalue Classification on data streams Mario.Guarracino@cnr.it High Performance Computing and Networking Institute National

Acknowledgements

Panos Pardalos, Onur Seref, Claudio Cifarelli

Davide Feminiano, Salvatore Cuciniello

Rosanna Verde

2Beijing, October 27-29, 2008

Page 3: Incremental Generalized Eigenvalue Classification on data streams Mario.Guarracino@cnr.it High Performance Computing and Networking Institute National

Outline

Introduction

• Challenges, applications and existing methods

ReGEC (Regularized Generalized Eigenvalue Classifier)

I-Regec (Incremental ReGEC )

I-ReGEC on data streams

Experiments

3Beijing, October 27-29, 2008

Page 4: Incremental Generalized Eigenvalue Classification on data streams Mario.Guarracino@cnr.it High Performance Computing and Networking Institute National

Supervised learning

Supervised learning refers to the capability of a system to learn from examples

The trained system is able to answer to new questions

Supervised means the desired answer for the training set is provided by an external teacher

Binary classification is among the most successful methods for supervised learning

4Beijing, October 27-29, 2008

Page 5: Incremental Generalized Eigenvalue Classification on data streams Mario.Guarracino@cnr.it High Performance Computing and Networking Institute National

Supervised learning

Incremental Regularized Generalized Eigenvalue Classifier (I-ReGEC) is a supervised learning algorithm, using a subset of training data

The advantage is the classification model can be incrementally updated

• The algorithm online decides which points bring new information and updates the model

Experiments and comparison assess I-ReGEC classification accuracy and processing speed rates

5Beijing, October 27-29, 2008

Page 6: Incremental Generalized Eigenvalue Classification on data streams Mario.Guarracino@cnr.it High Performance Computing and Networking Institute National

The challenges

Applications on massive data sets are emerging with an increasing frequency

• Data has to be analyzed the data as soon as produced

Legacy data bases and warehouses with petabytes of data can not be loaded in main memory and data are accessed as streams

Classification algorithms have to deal with a large amount of data that are delivered in form of streams

6Beijing, October 27-29, 2008

Page 7: Incremental Generalized Eigenvalue Classification on data streams Mario.Guarracino@cnr.it High Performance Computing and Networking Institute National

The challenges

Classification has to be performed online

• Data is processed on fly, at the transmission speed

Need for data sampling

• Classification models not over fitting data and detailed enough to describe the phenomena.

Change of training behavior during time

• Nature and gradient of data changes over time

7Beijing, October 27-29, 2008

Page 8: Incremental Generalized Eigenvalue Classification on data streams Mario.Guarracino@cnr.it High Performance Computing and Networking Institute National

Applications on data streams

Sensor networks

• Power grids, telecommunications, bio, seismic, security,…

Computer network traffic

• spam, intrusion detection, IP traffic logs…

Bank transactions

• Financial data, frauds, credit cards,…

Web

• Browser clicks, user queries, link rating,…

8Beijing, October 27-29, 2008

Page 9: Incremental Generalized Eigenvalue Classification on data streams Mario.Guarracino@cnr.it High Performance Computing and Networking Institute National

Support Vector Machines

SVM algorithm is among the most successful method for classification, and its variations have been applied to data streams

The general purpose methods are only suitable for small size problems

For large problems, chunking subset selection and decomposition methods use subsets of points

SVM-Light and libSVM are among the most preferred implementations that use chunking subset selection and decomposition methods

9Beijing, October 27-29, 2008

Page 10: Incremental Generalized Eigenvalue Classification on data streams Mario.Guarracino@cnr.it High Performance Computing and Networking Institute National

Find two parallel lines with maximum margin and leave all points of a class on one side

AA BB

support vectors

Support Vector Machines

ebBebAts

..

2

||||min2

10Beijing, October 27-29, 2008

Page 11: Incremental Generalized Eigenvalue Classification on data streams Mario.Guarracino@cnr.it High Performance Computing and Networking Institute National

SVM for data streams

Batch technique uses SVM model on complete data set

Error-driven technique randomly stores k samples in a training and the other in a test set. If a point is well classified, it remains in the traing set

Fixed-partition technique divides the training set in batches of fixed size. This partition permits to add points to current SVM accordingly to the ones loaded in memory

11Beijing, October 27-29, 2008

Page 12: Incremental Generalized Eigenvalue Classification on data streams Mario.Guarracino@cnr.it High Performance Computing and Networking Institute National

SVM for data streams

Exceeding-margin technique checks, at the time t and for each new point, if the new data exceeds the margin evaluated by SVM

• In case, the point is added to the incremental training, otherwise it is discarded

Fixed-margin + errors technique adds the new point to the training if either it exceeds the margin or it is misclassified

12Beijing, October 27-29, 2008

Page 13: Incremental Generalized Eigenvalue Classification on data streams Mario.Guarracino@cnr.it High Performance Computing and Networking Institute National

Pros of SVM based methods

They require a minimal computational burden to build classification models

In case of kernel-based nonlinear classification, they reduce the size of the training set, and thus, the related kernel

All of these methods show that a sensible data reduction is possible while maintaining a comparable level of classification accuracy

13Beijing, October 27-29, 2008

Page 14: Incremental Generalized Eigenvalue Classification on data streams Mario.Guarracino@cnr.it High Performance Computing and Networking Institute National

A different approach: ReGECFind two lines, each the closest to one set and the furthest to the other

AA BB

2

2

0γω, ||||||||min

eBeA

0'

11

x

0'

22

x

14Beijing, October 27-28, 2008

M.R. Guarracino, C. Cifarelli, O. Seref, P. Pardalos. A Classification Method Based on Generalized Eigenvalue Problems, OMS, 2007.

Page 15: Incremental Generalized Eigenvalue Classification on data streams Mario.Guarracino@cnr.it High Performance Computing and Networking Institute National

ReGEC formulation

Let:

G = [A –e]’[A –e], H = [B –e]’[B –e], z = [’ ]’

Equation becomes:

Raleigh quotient of Generalized Eigenvalue Problem

G x = H x

2

2

0γω, ||||||||min

eBeA

zHz

zGzz '

'min0

15Beijing, October 27-29, 2008

Page 16: Incremental Generalized Eigenvalue Classification on data streams Mario.Guarracino@cnr.it High Performance Computing and Networking Institute National

ReGEC classification of new points

A new point is assigned to the class described by the closest line AA BB

||||

|'|),(

xPxdist i

16Beijing, October 27-29, 2008

Page 17: Incremental Generalized Eigenvalue Classification on data streams Mario.Guarracino@cnr.it High Performance Computing and Networking Institute National

ReGEC

Let [1, 1] and [2, 2] be eigenvectors of

min and max eigenvalues of G x = H x

• a ∈ A, closer to x'1 - 1 =0 than to x'2-2=0,

• b ∈ B, closer to x'2 - 2=0 than to x'1-1=0.

17Beijing, October 27-28, 2008

Page 18: Incremental Generalized Eigenvalue Classification on data streams Mario.Guarracino@cnr.it High Performance Computing and Networking Institute National

Incremental learning

The purpose of incremental learning is to find a small and robust subset of the training set that provides comparable accuracy results.

A smaller set of points reduces the probability of overfitting the problem.

A classification model built from a smaller subset is computationally more efficient in predicting new points.

As new points become available, the cost of retraining the algorithm decreases if the influence of the new points is only evaluated by the small subset.

18Beijing, October 27-29, 2008

C. Cifarelli, M.R. Guarracino, O. Seref, S. Cuciniello, and P.M. Pardalos. Incremental Classification with Generalized Eigenvalues, JoC, 2007.

Page 19: Incremental Generalized Eigenvalue Classification on data streams Mario.Guarracino@cnr.it High Performance Computing and Networking Institute National

Incremental learning algorithm1: 1 = C \ C0

2: {M0, Acc0} = Classify ( C; C0 )

3: k = 1

4: while |k| > 0 do

5: xk = x : maxx {Mk-1 ∩ k-1} {dist(x, Pclass(x))}

6: {Mk, Acck } = Classify( C; {Ck-1 U {xk}} )

7: if Acck > Acck-1 then

8: Ck = Ck-1 U {xk}

9: end if

10: k = k-1 \ {xk}

11: k = k + 1

12: end while 19Beijing, October 27-29, 2008

Page 20: Incremental Generalized Eigenvalue Classification on data streams Mario.Guarracino@cnr.it High Performance Computing and Networking Institute National

Incremental classification on data streams

Find a small and robust subset of the training set while accessing data available in awindow

When the window is full, all points are processed by the classifier

wsize

20Beijing, October 27-28, 2008

M.R. Guarracino, S. Cuciniello, D. Feminiano. Incremental Generalized Eigenvalue Classification on Data Streams. MODULAD, 2007.

Page 21: Incremental Generalized Eigenvalue Classification on data streams Mario.Guarracino@cnr.it High Performance Computing and Networking Institute National

I-ReGEC: Incremental Regularized Eigenvalue Classifier on Streams

wsize

21Beijing, October 27-29, 2008

Page 22: Incremental Generalized Eigenvalue Classification on data streams Mario.Guarracino@cnr.it High Performance Computing and Networking Institute National

I-ReGEC: Incremental Regularized Eigenvalue Classifier on Streams

At each step, data in window are processed with the incremental learning classifier…

New data Old data

And hyperplanes are built

22Beijing, October 27-29, 2008

Page 23: Incremental Generalized Eigenvalue Classification on data streams Mario.Guarracino@cnr.it High Performance Computing and Networking Institute National

I-ReGEC: Incremental Regularized Eigenvalue Classifier on Streams

…and I-ReGEC updates hyperplanes configuration

Step by step new points are processed

23Beijing, October 27-29, 2008

Page 24: Incremental Generalized Eigenvalue Classification on data streams Mario.Guarracino@cnr.it High Performance Computing and Networking Institute National

I-ReGEC: Incremental Regularized Eigenvalue Classifier on Streams

Some of them are discarted if their information contribution is useless

But not all points are considered…

24Beijing, October 27-29, 2008

Page 25: Incremental Generalized Eigenvalue Classification on data streams Mario.Guarracino@cnr.it High Performance Computing and Networking Institute National

I-ReGEC: Incremental Regularized Eigenvalue Classifier on Streams

New unknown incoming points are classified by their distance from the hyperplanes

25Beijing, October 27-29, 2008

Page 26: Incremental Generalized Eigenvalue Classification on data streams Mario.Guarracino@cnr.it High Performance Computing and Networking Institute National

An incremental learning technique based on ReGEC that determines the classification model from a very small sample of data stream

I-ReGEC: Incremental Regularized Eigenvalue Classifier on Streams

26Beijing, October 27-29, 2008

Page 27: Incremental Generalized Eigenvalue Classification on data streams Mario.Guarracino@cnr.it High Performance Computing and Networking Institute National

ExperimentsLarge-noisy-crossed-norm

Data set

200.000 points with 20 featuresequally divided in 2 classes

100.000 train points 100.000 test points

Each class is drawn from a multivariate normal distribution

27Beijing, October 27-29, 2008

Page 28: Incremental Generalized Eigenvalue Classification on data streams Mario.Guarracino@cnr.it High Performance Computing and Networking Institute National

Miss-classification results

SI-ReGEC has the lowest error and uses the smallest incremental set

B: Batch SVMED: Error-driven KNNFP: Fixed partition SVMEM: Exceeding-margin SVMEM+E: Fixed margin + errors

28Beijing, October 27-29, 2008

Page 29: Incremental Generalized Eigenvalue Classification on data streams Mario.Guarracino@cnr.it High Performance Computing and Networking Institute National

Window size

Larger windows lead to smaller train subset and execution time increases with window growth

One billion elements per day can be processedon standard hardware

29Beijing, October 27-29, 2008

Page 30: Incremental Generalized Eigenvalue Classification on data streams Mario.Guarracino@cnr.it High Performance Computing and Networking Institute National

Conclusion and future work

The classification accuracy of I-ReGEC a well compares with other methods

I-ReGEC produces small incremental training sets

In future, investigate how to dinamically adapt window size to stream rate and nonstationary data streams

Beijing, October 27-29, 2008 30

Page 31: Incremental Generalized Eigenvalue Classification on data streams Mario.Guarracino@cnr.it High Performance Computing and Networking Institute National

Incremental Generalized Eigenvalue Classification on data streams

[email protected]

High Performance Computing and Networking InstituteNational Research Council – Naples, ITALY

International Workshop on Data Stream Management and MiningBeijing, October 27-29, 2008