incremental generalized eigenvalue classification on data streams [email protected] high...
TRANSCRIPT
Incremental Generalized Eigenvalue Classification on data streams
High Performance Computing and Networking InstituteNational Research Council – Naples, ITALY
International Workshop on Data Stream Management and MiningBeijing, October 27-29, 2008
Acknowledgements
Panos Pardalos, Onur Seref, Claudio Cifarelli
Davide Feminiano, Salvatore Cuciniello
Rosanna Verde
2Beijing, October 27-29, 2008
Outline
Introduction
• Challenges, applications and existing methods
ReGEC (Regularized Generalized Eigenvalue Classifier)
I-Regec (Incremental ReGEC )
I-ReGEC on data streams
Experiments
3Beijing, October 27-29, 2008
Supervised learning
Supervised learning refers to the capability of a system to learn from examples
The trained system is able to answer to new questions
Supervised means the desired answer for the training set is provided by an external teacher
Binary classification is among the most successful methods for supervised learning
4Beijing, October 27-29, 2008
Supervised learning
Incremental Regularized Generalized Eigenvalue Classifier (I-ReGEC) is a supervised learning algorithm, using a subset of training data
The advantage is the classification model can be incrementally updated
• The algorithm online decides which points bring new information and updates the model
Experiments and comparison assess I-ReGEC classification accuracy and processing speed rates
5Beijing, October 27-29, 2008
The challenges
Applications on massive data sets are emerging with an increasing frequency
• Data has to be analyzed the data as soon as produced
Legacy data bases and warehouses with petabytes of data can not be loaded in main memory and data are accessed as streams
Classification algorithms have to deal with a large amount of data that are delivered in form of streams
6Beijing, October 27-29, 2008
The challenges
Classification has to be performed online
• Data is processed on fly, at the transmission speed
Need for data sampling
• Classification models not over fitting data and detailed enough to describe the phenomena.
Change of training behavior during time
• Nature and gradient of data changes over time
7Beijing, October 27-29, 2008
Applications on data streams
Sensor networks
• Power grids, telecommunications, bio, seismic, security,…
Computer network traffic
• spam, intrusion detection, IP traffic logs…
Bank transactions
• Financial data, frauds, credit cards,…
Web
• Browser clicks, user queries, link rating,…
8Beijing, October 27-29, 2008
Support Vector Machines
SVM algorithm is among the most successful method for classification, and its variations have been applied to data streams
The general purpose methods are only suitable for small size problems
For large problems, chunking subset selection and decomposition methods use subsets of points
SVM-Light and libSVM are among the most preferred implementations that use chunking subset selection and decomposition methods
9Beijing, October 27-29, 2008
Find two parallel lines with maximum margin and leave all points of a class on one side
AA BB
support vectors
Support Vector Machines
ebBebAts
..
2
||||min2
0ω
10Beijing, October 27-29, 2008
SVM for data streams
Batch technique uses SVM model on complete data set
Error-driven technique randomly stores k samples in a training and the other in a test set. If a point is well classified, it remains in the traing set
Fixed-partition technique divides the training set in batches of fixed size. This partition permits to add points to current SVM accordingly to the ones loaded in memory
11Beijing, October 27-29, 2008
SVM for data streams
Exceeding-margin technique checks, at the time t and for each new point, if the new data exceeds the margin evaluated by SVM
• In case, the point is added to the incremental training, otherwise it is discarded
Fixed-margin + errors technique adds the new point to the training if either it exceeds the margin or it is misclassified
12Beijing, October 27-29, 2008
Pros of SVM based methods
They require a minimal computational burden to build classification models
In case of kernel-based nonlinear classification, they reduce the size of the training set, and thus, the related kernel
All of these methods show that a sensible data reduction is possible while maintaining a comparable level of classification accuracy
13Beijing, October 27-29, 2008
A different approach: ReGECFind two lines, each the closest to one set and the furthest to the other
AA BB
2
2
0γω, ||||||||min
eBeA
0'
11
x
0'
22
x
14Beijing, October 27-28, 2008
M.R. Guarracino, C. Cifarelli, O. Seref, P. Pardalos. A Classification Method Based on Generalized Eigenvalue Problems, OMS, 2007.
ReGEC formulation
Let:
G = [A –e]’[A –e], H = [B –e]’[B –e], z = [’ ]’
Equation becomes:
Raleigh quotient of Generalized Eigenvalue Problem
G x = H x
2
2
0γω, ||||||||min
eBeA
zHz
zGzz '
'min0
15Beijing, October 27-29, 2008
ReGEC classification of new points
A new point is assigned to the class described by the closest line AA BB
||||
|'|),(
xPxdist i
16Beijing, October 27-29, 2008
ReGEC
Let [1, 1] and [2, 2] be eigenvectors of
min and max eigenvalues of G x = H x
• a ∈ A, closer to x'1 - 1 =0 than to x'2-2=0,
• b ∈ B, closer to x'2 - 2=0 than to x'1-1=0.
17Beijing, October 27-28, 2008
Incremental learning
The purpose of incremental learning is to find a small and robust subset of the training set that provides comparable accuracy results.
A smaller set of points reduces the probability of overfitting the problem.
A classification model built from a smaller subset is computationally more efficient in predicting new points.
As new points become available, the cost of retraining the algorithm decreases if the influence of the new points is only evaluated by the small subset.
18Beijing, October 27-29, 2008
C. Cifarelli, M.R. Guarracino, O. Seref, S. Cuciniello, and P.M. Pardalos. Incremental Classification with Generalized Eigenvalues, JoC, 2007.
Incremental learning algorithm1: 1 = C \ C0
2: {M0, Acc0} = Classify ( C; C0 )
3: k = 1
4: while |k| > 0 do
5: xk = x : maxx {Mk-1 ∩ k-1} {dist(x, Pclass(x))}
6: {Mk, Acck } = Classify( C; {Ck-1 U {xk}} )
7: if Acck > Acck-1 then
8: Ck = Ck-1 U {xk}
9: end if
10: k = k-1 \ {xk}
11: k = k + 1
12: end while 19Beijing, October 27-29, 2008
Incremental classification on data streams
Find a small and robust subset of the training set while accessing data available in awindow
When the window is full, all points are processed by the classifier
wsize
20Beijing, October 27-28, 2008
M.R. Guarracino, S. Cuciniello, D. Feminiano. Incremental Generalized Eigenvalue Classification on Data Streams. MODULAD, 2007.
I-ReGEC: Incremental Regularized Eigenvalue Classifier on Streams
wsize
21Beijing, October 27-29, 2008
I-ReGEC: Incremental Regularized Eigenvalue Classifier on Streams
At each step, data in window are processed with the incremental learning classifier…
New data Old data
And hyperplanes are built
22Beijing, October 27-29, 2008
I-ReGEC: Incremental Regularized Eigenvalue Classifier on Streams
…and I-ReGEC updates hyperplanes configuration
Step by step new points are processed
23Beijing, October 27-29, 2008
I-ReGEC: Incremental Regularized Eigenvalue Classifier on Streams
Some of them are discarted if their information contribution is useless
But not all points are considered…
24Beijing, October 27-29, 2008
I-ReGEC: Incremental Regularized Eigenvalue Classifier on Streams
New unknown incoming points are classified by their distance from the hyperplanes
25Beijing, October 27-29, 2008
An incremental learning technique based on ReGEC that determines the classification model from a very small sample of data stream
I-ReGEC: Incremental Regularized Eigenvalue Classifier on Streams
26Beijing, October 27-29, 2008
ExperimentsLarge-noisy-crossed-norm
Data set
200.000 points with 20 featuresequally divided in 2 classes
100.000 train points 100.000 test points
Each class is drawn from a multivariate normal distribution
27Beijing, October 27-29, 2008
Miss-classification results
SI-ReGEC has the lowest error and uses the smallest incremental set
B: Batch SVMED: Error-driven KNNFP: Fixed partition SVMEM: Exceeding-margin SVMEM+E: Fixed margin + errors
28Beijing, October 27-29, 2008
Window size
Larger windows lead to smaller train subset and execution time increases with window growth
One billion elements per day can be processedon standard hardware
29Beijing, October 27-29, 2008
Conclusion and future work
The classification accuracy of I-ReGEC a well compares with other methods
I-ReGEC produces small incremental training sets
In future, investigate how to dinamically adapt window size to stream rate and nonstationary data streams
Beijing, October 27-29, 2008 30
Incremental Generalized Eigenvalue Classification on data streams
High Performance Computing and Networking InstituteNational Research Council – Naples, ITALY
International Workshop on Data Stream Management and MiningBeijing, October 27-29, 2008