towards incremental learning of nonstationary imbalanced data … ·...

16
ORIGINAL PAPER Towards incremental learning of nonstationary imbalanced data stream: a multiple selectively recursive approach Sheng Chen Haibo He Received: 1 August 2010 / Accepted: 4 November 2010 Ó Springer-Verlag 2010 Abstract Difficulties of learning from nonstationary data stream are generally twofold. First, dynamically structured learning framework is required to catch up with the evolu- tion of unstable class concepts, i.e., concept drifts. Second, imbalanced class distribution over data stream demands a mechanism to intensify the underrepresented class concepts for improved overall performance. To alleviate the chal- lenges brought by these issues, we propose the recursive ensemble approach (REA) in this paper. To battle against the imbalanced learning problem in training data chunk received at any timestamp t, i.e., S t ; REA adaptively pushes into S t part of minority class examples received within [0, t - 1] to balance its skewed class distribution. Hypoth- eses are then progressively developed over time for all balanced training data chunks and combined together as an ensemble classifier in a dynamically weighted manner, which therefore addresses the concept drifts issue in time. Theoretical analysis proves that REA can provide less erroneous prediction results than a comparative algorithm. Besides that, empirical study on both synthetic benchmarks and real-world data set is also applied to validate effec- tiveness of REA as compared with other algorithms in terms of evaluation metrics consisting of overall prediction accu- racy and ROC curve. Keywords Incremental learning Nonstationary data Imbalanced learning Stream data Ensemble learning Concept drift 1 Introduction Learning from data stream has been featured in many practical applications such as network traffic monitoring and credit fraud identification (Babcock et al. 2002). Generally speaking, data stream is a sequence of unbounded, real-time data items with a very high rate that can be read only once by an application (Gaber et al. 2003). The restriction placed by the end of this definition is also called one-pass constraint (Aggarwal 2007), which is also claimed by other literature (Sharma 1998; Lange and Grieser 2002; Muhlbaier et al. 2009). It has been flour- ished for quite a few years for the studies of learning from data stream. To name a few, (Domingos and Hulten 2000) proposed very fast decision tree (VFDT) to address data mining from high speed data stream like Web access data. By using Hoe- ffding bound, it can offer approximately identical performance as that of conventional learner on static data set. Learn?? (Polikar et al. 2001) approaches learning from data stream through an aggressive ensemble-of-ensemble learning para- digm. Briefly speaking, Learn?? processes the data stream in unit of data chunks. For each data chunk, learn?? applies base learner of multi-layer perceptron (MLP) to create multiple ensemble hypotheses upon it. He and Chen (2008) proposed IMORL framework to address learning from video data stream. It calculates the Euclidean distance in feature space between each example within consecutive data chunk to transmit sampling weights in a biased manner, i.e., hard-to- learn examples would gradually be assigned higher weights for learning, which resembles itself to AdaBoost’s weights updating mechanism (Freund et al. 1997) to some degree. H. He (&) Department of Electrical, Computer, and Biomedical Engineering, University of Rhode Island, Kingston, RI 02881, USA e-mail: [email protected] S. Chen Department of Electrical and Computer Engineering, Stevens Institute of Technology, Hoboken, NJ 07030, USA e-mail: [email protected] 123 Evolving Systems DOI 10.1007/s12530-010-9021-y

Upload: others

Post on 28-Sep-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Towards incremental learning of nonstationary imbalanced data … · ensembleapproach(REA)inthispaper.Tobattleagainstthe imbalanced learning problem in training data chunk received

ORIGINAL PAPER

Towards incremental learning of nonstationary imbalanced datastream: a multiple selectively recursive approach

Sheng Chen • Haibo He

Received: 1 August 2010 / Accepted: 4 November 2010

� Springer-Verlag 2010

Abstract Difficulties of learning from nonstationary data

stream are generally twofold. First, dynamically structured

learning framework is required to catch up with the evolu-

tion of unstable class concepts, i.e., concept drifts. Second,

imbalanced class distribution over data stream demands a

mechanism to intensify the underrepresented class concepts

for improved overall performance. To alleviate the chal-

lenges brought by these issues, we propose the recursive

ensemble approach (REA) in this paper. To battle against the

imbalanced learning problem in training data chunk

received at any timestamp t, i.e., St; REA adaptively pushes

into St part of minority class examples received within

[0, t - 1] to balance its skewed class distribution. Hypoth-

eses are then progressively developed over time for all

balanced training data chunks and combined together as an

ensemble classifier in a dynamically weighted manner,

which therefore addresses the concept drifts issue in time.

Theoretical analysis proves that REA can provide less

erroneous prediction results than a comparative algorithm.

Besides that, empirical study on both synthetic benchmarks

and real-world data set is also applied to validate effec-

tiveness of REA as compared with other algorithms in terms

of evaluation metrics consisting of overall prediction accu-

racy and ROC curve.

Keywords Incremental learning � Nonstationary data �Imbalanced learning � Stream data � Ensemble learning �Concept drift

1 Introduction

Learning from data stream has been featured in many practical

applications such as network traffic monitoring and credit fraud

identification (Babcock et al. 2002). Generally speaking, data

stream is a sequence of unbounded, real-time data items with a

very high rate that can be read only once by an application

(Gaber et al. 2003). The restriction placed by the end of this

definition is also called one-pass constraint (Aggarwal 2007),

which is also claimed by other literature (Sharma 1998; Lange

and Grieser 2002; Muhlbaier et al. 2009). It has been flour-

ished for quite a few years for the studies of learning from data

stream. To name a few, (Domingos and Hulten 2000) proposed

very fast decision tree (VFDT) to address data mining from

high speed data stream like Web access data. By using Hoe-

ffding bound, it can offer approximately identical performance

as that of conventional learner on static data set. Learn??

(Polikar et al. 2001) approaches learning from data stream

through an aggressive ensemble-of-ensemble learning para-

digm. Briefly speaking, Learn?? processes the data stream in

unit of data chunks. For each data chunk, learn?? applies base

learner of multi-layer perceptron (MLP) to create multiple

ensemble hypotheses upon it. He and Chen (2008) proposed

IMORL framework to address learning from video data

stream. It calculates the Euclidean distance in feature space

between each example within consecutive data chunk to

transmit sampling weights in a biased manner, i.e., hard-to-

learn examples would gradually be assigned higher weights for

learning, which resembles itself to AdaBoost’s weights

updating mechanism (Freund et al. 1997) to some degree.

H. He (&)

Department of Electrical, Computer,

and Biomedical Engineering, University of Rhode Island,

Kingston, RI 02881, USA

e-mail: [email protected]

S. Chen

Department of Electrical and Computer Engineering,

Stevens Institute of Technology, Hoboken, NJ 07030, USA

e-mail: [email protected]

123

Evolving Systems

DOI 10.1007/s12530-010-9021-y

Page 2: Towards incremental learning of nonstationary imbalanced data … · ensembleapproach(REA)inthispaper.Tobattleagainstthe imbalanced learning problem in training data chunk received

In Angelov and Zhou (2006), an approach to real-time gener-

ation of fuzzy rule-based systems of eXtended Takagi-Sugeno

(xTS) type from data streams was proposed, which applies

incremental clustering procedure to generate clusters to form

fuzzy rule based systems. (Georgieva and Filev 2009) pro-

posed the Gustafson-Kessel Algorithm for incremental clus-

tering of data stream. It applies adaptive-distance metric to

identify clusters with different shape and orientation. As a

follow-up, (Filev and Georgieva 2010) extended Gustafson-

Kessel Algorithm to enable real-time clustering of data stream.

In Dovzan and Skrjanc (2010), a recursive version of the fuzzy

identification method and predictive functional model is pro-

posed to the control of a nonlinear, time-varying process.

Incapability of storing all data into memory for learning as

done by traditional approaches has yet been the solely chal-

lenge data stream has presented to the community. As what it

sounds to be, concept drift, also recognized as time-evolving

nature (Aggarwal 2003), suggests it is undesirable yet inev-

itable that most of the time class concepts evolve as data

stream forwards. This property combined with virtually

unbounded volume of data stream accounts for the so-called

‘‘stability-plasticity’’ dilemma (Grossberg 1988). One may be

trapped in an endless loop of pondering either reserving just

the most recent knowledge to battle against concept drift or

keeping track of knowledge as much as possible in avoidance

of ‘‘catastrophic forgetting’’. With regards to this, many

works have been recorded to strike a balance between two

ends of the ‘‘stability-plasticity’’ dilemma. Marked as an

effort of adapting ensemble approach to time-evolving data

stream, SEA (Street and Kim 2001) maintains an ensemble

pool of C4.5 hypotheses with a fixed size, each of which is

built upon a data chunk with unique time stamp. When the

request of inserting a new hypothesis is made but ensemble

pool has been fully occupied, some criterion is introduced to

evaluate whether the new hypothesis is qualified enough to be

accommodated at the expense of popping an existing

hypothesis therein. Directly targeting upon making one’s

choice for the new and old data, (Fan 2004) examines itself

the necessity of referring to old data’s help. Because if it is

unnecessary, reserving the most recent data would suffice to

yield a hypothesis with satisfying performance. Otherwise,

cross validation will be applied to locate the portion of old

data that may be mostly helpful to complement the most

recent data for building an optimal hypothesis. The potential

problem for this approach is the choice of granularity for cross

validation. As straightforward as it can be, finer granularity

would more accurately provide the desirable portion of old

data. However increasing performance comes with extra

overhead. When granularity is tuned fine enough to the scale

of single example, cross validation would degenerate itself

into a brute force method, which may exhibit intractability for

applications sensitive of speed. Other ways of countering

concept drift include sliding window method (Last 2002)

which maintains a sliding window with either fixed or

adaptively adjustable size to determine timeframe of the

knowledge that should be reserved, and fading factor method

(Law and Zaniolo 2005) which assigns a time-decaying factor

(usually in form of inverse exponential) to each hypothesis

built over time. In such a way, old knowledge would gradu-

ally be obsoleted and could be removed when the corre-

sponding factor downgrades itself to below the threshold.

Despite the popularity of data stream study, learning from

nonstationary data stream with skewed class distribution is a

relatively uncharted area, of which the difficulty resides itself

in the context. In static context, the counterpart of this

problem is recognized as ‘‘imbalanced learning’’ which

corresponds to domains where certain types of data distri-

bution over-dominates the instance space compared to other

data distribution (He and Garcia 2009). It is a recently

emerged area and has attracted significantly growing atten-

tion in community (Fan et al. 1999; Chawla et al. 2002,

2003; Hong et al. 2007; Masnadi-Shirazi and Vasconcelos

2007). However the same story does not come to the same

problem in the context of data stream, where the number of

solutions is rather limited. Those on record include (Gao

et al. 2007) which accommodates all previous minority class

examples into the current training data set to compensate

skewed class distribution, upon which an ensemble of

hypotheses is built. In lieu of this aggressive accommodation

mechanism, our previous work SERA (Chen and He 2009)

chooses a portion of previous minority class examples into

the current training data chunk based on their similarity.

Accumulation of previous minority class examples is of

limited volume due to skewed class distribution. Therefore, it

should not be considered as violation of one-pass constraint.

In this paper, we propose a Recursive Ensemble

Approach (REA) in an effort to provide a solution for

handling imbalanced data streams of nonstationary class

concepts. Different from (Gao et al. 2007), REA takes a

similar step as SERA to incorporate part of previous

minority class examples into the current training data

chunk. However in lieu of limiting the availability of

hypotheses on the current training data chunk as in SERA

as well as in literature (Gao et al. 2007), REA combines all

hypotheses built over time in a dynamically weighted

manner to make predictions on the testing data set.

The proposed REA framework in this work is mainly

motivated by our recent approach of MuSeRA (Chen and He

2010). Specifically, in this paper we investigate a different

strategy of estimating the similarity between previous minority

class examples and the current minority class set. Furthermore,

based on the success of SERA (Chen and He 2009) and

MuSeRA (Chen and He 2010), in this work we significantly

extend simulations of REA to both synthetic benchmarks and

real-world data sets. We also further design various simulations

to test the robustness of REA under different parameter

Evolving Systems

123

Page 3: Towards incremental learning of nonstationary imbalanced data … · ensembleapproach(REA)inthispaper.Tobattleagainstthe imbalanced learning problem in training data chunk received

settings. Such empirical results together with the theoretical

analysis provide a more comprehensive justification of the

effectiveness of the proposed REA framework.

Rest of this paper is organized as follows. Section 2

discusses technical details of REA algorithm. Section 3

gives a theoretical analysis on the prediction error rate of

REA, and compares it with that of the existing algorithm.

Section 4 introduces configuration and assessment metrics

applied to simulations. After that, two artificially synthetic

benchmarks and a real-world data set are used to evaluate the

effectiveness of the proposed REA in terms of its compari-

son with other existing algorithms. Section 5 concludes the

paper and briefly introduce potential improvement that can

be made for REA in the future.

2 The proposed algorithm for nonstationary

imbalanced data stream

2.1 Preliminaries for REA

Before officially elaborating the algorithm-level frame-

work of REA, we would like to introduce some preliminary

knowledge to better facilitate its understandability.

2.1.1 The recursive approach for imbalanced learning

Sampling-based methods account for a very important

category among imbalanced learning family. Generally

speaking, it consists of over-sampling approach and under-

sampling approach (He and Garcia 2009).

Over-sampling approaches, such as SMOTE/SMOTE-

Boost (Chawla et al. 2002, 2003), and dataBoost-IM (Hong

et al. 2007) create synthetic minority class instances based

upon existing minority class examples to balance skewed

class distribution. REA also seeks to amplify the number of

minority class examples in the current training data chunk.

But instead of creating synthetic minority class instances,

REA collects minority examples from previous training

data chunks over time and selectively accommodates those

with high similarity with the current minority class set into

the current training data chunk.

We would like to note that the approach proposed in (Gao

et al. 2007) also collects the previous minority class exam-

ples to amplify the current training data chunk. However, the

difference is that it adopts a ‘‘take-in-all’’ mechanism to put

all previous minority class examples into the current training

data chunk, no matter how many has been accumulated.

Besides, that method takes an under-sampling alike

approach to disintegrate without replacement the majority

class examples into several disjoint subsets. Hypotheses are

built on each of these subsets plus a replica of the amplified

minority class set. The averaged combination of these

hypotheses would be used to make predictions on the current

testing data set. We will see later in this section that REA

uses a different ensemble approach.

2.1.2 The k-nearest neighbors selective accommodation

mechanism

Similar to Chen and He (2009), REA selectively accom-

modates a certain amount of previous minority examples

into the current training data chunk to balance skewed class

distribution. This is different from the mechanism of (Gao

et al. 2007) which amplifies the current training data chunk

by incorporating all previous minority examples regardless

of their similarity degree to the current minority example set.

In Chen and He (2009), it has been shown in empirical study

that performance of SERA was competitive compared to the

take-in-all mechanism employed by Gao et al. (2007). REA

inherits the selective accommodation mechanism of SERA,

which gives it a good chance to receive similar benefits.

In Chen and He (2009), similarity was estimated based

on Mahalanobis distance defined by:

d ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi

ðx� lÞTR�1ðx� lÞq

ð1Þ

where x is the feature set of a previous minority class example,

l and R are mean value and covariance matrix of the current

minority class set, respectively. As shown in Fig. 1a, each

previous minority class example calculates its Mahalanobis

distance to the current minority class set, based on which

SERA determines which part of previous minority class

examples should be added into the current training data chunk.

This method, however, may exhibit a potential flaw: it

assumes that there are no disjoint sub-concepts within the

minority class concept. Otherwise, there may exist several

sub-concepts for the minority class, i.e., D1 and D2 in Fig. 1b

instead of D in Fig. 1a. REA solves this flaw by adopting the

k-nearest neighbors paradigm to estimate such similarity

degree. Specifically, each previous minority class example

determines the number of minority examples that are within

its k-nearest neighbors in the current training data chunk as

its similarity degree to the current minority class set. It can be

illustrated in Fig. 1c. Here highlighted areas surrounded by

dashed circles represent the k-nearest neighbors search area

for each previous minority class examples: S1; S2; S3; S4 and

S5. Search area of Si represents the region where the k-nearest

neighbors of Si in current training data chunk fall in, which

consist of both majority class examples and minority class

examples. Since majority class examples do not affect the

similarity estimation, they are not shown in Fig. 1. Current

minority class examples are represented by bold circles, the

number of which falling in each of ‘‘search areas’’ is 3, 1, 2,

1, and 0, respectively. Therefore, REA decides the similarity

of S1, S2, S3, S4 and S5 to the current minority example set

sorted as: S1 [ S3 [ S2 ¼ S4 [ S5.

Evolving Systems

123

Page 4: Towards incremental learning of nonstationary imbalanced data … · ensembleapproach(REA)inthispaper.Tobattleagainstthe imbalanced learning problem in training data chunk received

2.1.3 The ensemble approach for concept drifts

Both Gao et al. (2007) and Chen and He (2009) handle the

concept drifts by solely replying on the current training data

chunk (inflated by all/part of the previous minority class

examples). This method makes sense since only the current

training data chunk stands for exactly accurate information

of the class concept due to concept drifts. However, sole

maintenance of a hypothesis/hypotheses on the current

training data chunk is more or less equal to discarding a

significant part of previous knowledge, since knowledge of

previous majority class examples can never be accessed

again either explicitly or implicitly once they have been

processed. This situation, according to Grossberg (1988),

could partially account for ‘‘catastrophic forgetting’’ that a

qualified online learning system should manage to avoid

disconnecting itself from previous knowledge.

To address this issue, REA maintains all hypotheses built

on training data chunks over time. Concerning the way of

hypotheses combination, (Gao et al. 2007) employs a uni-

form voting mechanism to combine hypotheses maintained

this way since it claims practically the class concepts of

testing data set may not necessarily evolve consistently with

the streaming training data sets. Putting aside this debatable

subject, in this work, we assume class distribution of the

testing data set keeps tuned to the evolution of the training

data chunks. Therefore REA weighs hypotheses according

to their classification accuracy on the current training data

chunk. The weighted combination of all the hypotheses

makes prediction on the current testing data set.

As stated in Sect. 2.1.1, Gao et al. (2007) also employs

ensemble approach. Nonetheless, the difference is that

REA dynamically associates all knowledge acquired over

time in a weighted manner which captures the very essence

of incremental learning, while the way by Gao et al. (2007)

is more like an exploitation process on the current training

data chunk using random under-sampling. We will explore

the performances of these two methods in our experiments.

2.2 The REA learning algorithm

The general incremental learning scenario is training data

chunk St with labeled examples and testing data set T t

with unlabeled instances always come in a pairwise manner

at any timestamp t. The task of REA at any timestamp t is

to make predictions on Tt as accurately as possible based

on knowledge learned on ðS1;S2; . . .;StÞ: With loss of

generality, it is assumed in REA that the imbalanced ratios

of all training data chunks are the same. One can easily

generalize it to the case when the training data chunks do

have different imbalanced ratio.

Pseudo-code of the proposed REA for incremental

learning of the nonstationary imbalanced data stream at

timestamp t is thus formulated as follows:

Fig. 1 The selective

accommodation mechanism:

circles denote current minority

class examples; starts represent

previous minority class

examples. a Intuitive approach

to decide similarities based on

Euclidean distance. b Potential

dilemma by applying this

intuitive approach. c Proposed

approach by using the number

of current minority class cases

within k-nearest neighbors of

each previous minority example

to decide similarities

Evolving Systems

123

Page 5: Towards incremental learning of nonstationary imbalanced data … · ensembleapproach(REA)inthispaper.Tobattleagainstthe imbalanced learning problem in training data chunk received

Evolving Systems

123

Page 6: Towards incremental learning of nonstationary imbalanced data … · ensembleapproach(REA)inthispaper.Tobattleagainstthe imbalanced learning problem in training data chunk received

Figure 2 shows the system level framework of the pro-

posed REA algorithm. The underlying principle of this

framework is similar to Chen and He (2010). Briefly

speaking, data set G contains all minority training data

prior to the current time. At time t = n, a certain amount

((f - c) 9 m) of minority examples in G are chosen based

on the criterion that the number of majority class cases of

their k-nearest neighbors within the training data chunk Sn

is as large as possible. These examples are then appended

to Sn such that the ratio of minority example in the post-

balanced training data chunk S0n is equal to f. Hypothesis hn

is built upon S0n; and then added into the hypotheses set Hn

to obtain Hn?1. Each of the hypotheses in Hn?1 is applied

on Sn to calculate the error rate {ej} using Eq. (2), which is

then used to calculate the weights {wj} for each of them by

Eq. (3). Large ej means poor performance of hj on Sn. wj

would therefore be very small, which makes hj’s impact on

classifying unlabeled instances in T n negligible. Small ej

generally means hj generalizes well on Sn and should be

given larger weight. However one should be cautious when

ej becomes extremely small, e.g., approaching 0. This

means hj has a great chance of overfitting Sn which would

result in poor generalization performance on I n: When this

situation does happen, one should refrain from adding hj

into the ensemble classifier h(t)final for classifying testing

data set In:

Finally, the hypotheses set Hn?1 are weighted by {wj} to

obtain the final hypothesis h(n)final to make predictions on

the current testing data set T n:

3 The theoretical analysis of prediction accuracy

In this section, we present a brief discussion of the theo-

retical analysis of the REA framework. Since the proof can

be done in a similar way to our previous work of MuSeRA

(Chen and He 2010), here we only highlight several major

steps of the analysis while directing the interested readers

to Chen and He (2010) for further details.

Assuming the majority class examples can be decomposed

into K subsets with approximately identical size of the

minority class set (Gao et al. 2007), then each of these subsets

can be combined with a replicate of the minority example set,

on which a hypothesis hi; ði ¼ 1; . . .;KÞ is developed. We

further assume the probability output of hypothesis hi that the

testing instance x belongs to class c is fci(x), then the corre-

sponding probability output of the ensemble classifier is (we

refer this framework to ‘‘Uncorrelated Bagging’’ abbreviated

as ‘‘UB’’ in the rest of this paper):

f cEðxÞ ¼

1

n

X

K

i¼1

f ci ðxÞ ð5Þ

According to Tumer and Ghosh (1996), the probability

output of a soft-typed classifier for an instance x can be

expressed as:

f cðxÞ ¼ pðcjxÞ þ gcðxÞ ð6Þ

where p(c|x) is a posteriori probability that instance

x belongs to class c, and gc(x) is the error associated with

the ith output.

Fig. 2 The system level REA

framework

Evolving Systems

123

Page 7: Towards incremental learning of nonstationary imbalanced data … · ensembleapproach(REA)inthispaper.Tobattleagainstthe imbalanced learning problem in training data chunk received

Based on Eq. (6), given that we are targeting binary

classification problem, e.g., class i and j, it was proved in

Tumer and Ghosh (1996) that the expected error can be

reasonably approximated by:

Error ¼ r2gc� pðcjjxÞ � pðcijxÞ

2ð7Þ

where p(cj|x) and p(ci|x) are a posteriori probabilities that

instance x belongs to class i and class j of the true Bayes

model, respectively; they are irrelevant to the training

algorithm itself. Therefore, the expected error is propor-

tional to the variance of gc(x) with a constant, i.e.,

Error / r2gc:

Given the independence of the hypotheses developed in

each consolidated subset, the boundary variance of UB can

be reduced by a factor of K2 (Gao et al. 2007), i.e.,

r2bE ¼

1

K2

X

K

i¼1

r2bi ð8Þ

In our proposed REA framework, since the weights are

determined reversely proportional to the errors the single

classifiers on the current training data chunk, they can

approximately be described by:

wi ¼C

r2gi

c

ð9Þ

where C is a constant for all {wi}.

Based on Eq. (6), the variance gEc(x) is part of REA’s

probability output, and can thus be represented as the

weighted sum of the variance of single classifiers, i.e.,

gEc ðxÞ ¼

PNi¼1 wigi

cðxÞPN

i¼1 wi

ð10Þ

If one makes the similar assumption as in Gao et al.

(2007) that each single classifier is independent from each

other, the variance of gEc(x) can be represented by:

r2gE

c¼PN

i¼1 w2i r

2gi

cðxÞ

PNi¼1 w2

i

ð11Þ

Taking Eq. (9) into consideration, Eq. (11) can be

simplified into:

r2gE

c¼ 1PN

i¼1 1=r2gi

c

ð12Þ

With the estimations in Eqs. (8) and (12), one can follow

similar analysis in Chen and He (2010) to prove:

r2gE

c� r2

bE ð13Þ

According to previous discussion that Error / r2gc; we

can conclude that REA framework can provide less

erroneous prediction results than UB (Gao et al. 2007).

4 Simulation and discussion

Our previous work Chen and He (2009) is based on one

single hypothesis built upon the current amplified training

data chunk. In this section we will show that through

combining all hypotheses built upon the amplified training

data chunks over time in a properly weighted manner,

performance of REA for predicting class labels of the

testing data sets can be considerably improved. Further-

more, we also compare our proposed approach with the

dedicated static imbalanced learning approach such as

SMOTE (Chawla et al. 2002) to demonstrate that our

proposed approach can effectively handle the dynamic

imbalanced data streams.

In our current study, we adopted the alassification and

regression tree (CART) as the base classifier. The strategy

of making CART output likelihoods that the input instance

should belong to any class with is twofold. (1) The leaf

node that the instance under testing falls in is located; (2)

inside that leaf, proportions of training examples belonging

to each class are calculated as the likelihood of the instance

under testing for each class. A toy example would be a leaf

node with 3 majority class examples and 2 minority class

examples falling in it during training. Then as long as an

unlabeled instance reaches this leaf node during testing, it

would be assigned probabilities of 3/5 and 2/5 belonging to

majority class and minority class, respectively.

The whole tree generated by CART should be pruned

thereafter, because otherwise CART would always have a

perfect classification performance on training data set,

which is undesirable due to potential overfitting risk. We

choose to apply the strategy of cost-complexity pruning

process to prune the tree created by CART. Briefly

speaking, the pruning process generates a series of trees

T0 � T1. . . � Tm where T0 is the whole tree and Tm is the

root node (decision stump). Ti is created by replacing a

subtree satisfying certain condition in Ti-1 to be a leaf node.

Then tree Tj with maximum accuracy on training data set is

chosen as the pruned tree. Detailed description of cost-

complexity pruning process can be found in Breiman et al.

(1984).

The reason of choosing CART as the base learner is that

it can provide desired trade-off between speed and per-

formance. Base learners such as logistic regression or

decision stump are not strong enough to efficiently learn

knowledge from data chunks with unnatural class distri-

bution. Other base learners such as neural networks of

multi-layer perceptron (MLP) and support vector machines

(SVMs) are obviously strong enough to effectively learn

from streamed data chunks. However the problem is that

they generally require much more time for the training

process, which makes them no good choices for designing

an on-line learning system. Besides, they usually tend to

Evolving Systems

123

Page 8: Towards incremental learning of nonstationary imbalanced data … · ensembleapproach(REA)inthispaper.Tobattleagainstthe imbalanced learning problem in training data chunk received

output learning models of high variance, which could result

in low diversity among hypotheses in ensemble pool,

which is what an ensemble classifier should be designed to

avoid.

Configurations of comparative algorithms and their

notation fashion are summarized as follows.

• The REA approach uses k-nearest neighbor, which is

decided through cross-validation, to weigh the similar-

ity between the previous minority class examples and

the current minority class set. The post-balance ratio f is

set to be 0.5.

• The SERA approach uses the hypothesis built on the

amplified current training data chunk to evaluate the

current testing instance set. The post-balance ratio f is

set to be 0.5.

• The approach proposed in Gao et al. (2007), which is

denoted as ‘‘UB’’ in this section.

• The SMOTE approach (Chawla et al. 2002) employs

the synthetic minority over-sampling technique to

balance the class distribution of the current training

data chunk upon which a hypothesis is built to predict

on the current testing data set. The number of the

synthetic minority class instances plus current minority

class examples should be half of the number of majority

class examples in the current amplified training data

chunk. In other words, the post-balance ratio for

SMOTE is also 0.5, if that concept can apply.

• Learning directly upon the training data chunk, which

is denoted as ‘‘Normal’’ in the simulation results.

One may wonder how to decide the k parameter of

k-nearest neighbor for similarity estimation mechanism of

REA. In context of online learning, grid search for optimized

parameter with cross validation may not be applicable.

Conceptually, k should not be set more than the number of

minority class examples in current training data chunk,

because otherwise the search range of ‘‘nearest neighbors’’

would go way beyond local area of the previous minority

class example under consideration and thus make the distinct

among different previous minority examples less obvious. In

this study, we uniformly set k of REA to be 10 for all

benchmarks, which is consistently less than or equal to the

number of minority class examples in training data chunks.

4.1 Evaluation metrics

Following the routine of imbalanced learning study, the

minority class data and the majority class data belong to

positive and negative classes, respectively. Let {p, n}

denote the positive and negative true class labels and

{Y, N} denote the predicted positive and negative class

labels, the confusion matrix for binary classification

problem can be defined in Fig. 3.

By manipulating on the confusion matrix, the overall

prediction accuracy (OA) can be defined as:

OA ¼ TPþ TN

TPþ TNþ FPþ FNð14Þ

OA is usually adopted in traditional learning scenario,

i.e., static data set with balanced class distribution, to

evaluate the performance of algorithms. However, when

the context changes to imbalanced learning, it is wise to

apply other metrics to do such evaluation (He and Garcia

2009), among which Receiver Operation Characteristics

(ROC) curve and Area under ROC curve are what are most

strongly recommended (Fawcett 2003).

Based on the confusion matrix as defined in Fig. 3, one

can calculate the TP rate and FP rate as follows:

TP rate ¼ TP

PR¼ TP

TPþ FNð15Þ

FP rate ¼ FP

NR¼ FP

FPþ TNð16Þ

ROC space is established by plotting TP rate over

FP rate: Generally speaking, hard-type classifiers (those

that only output discrete class labels) correspond to points

in ROC space: ðFP rate; TP rateÞ: On the other hand, soft-

type classifiers (those that output a likelihood that an

instance belongs to either class label) correspond to curves

in ROC space. Such curves are formulated by adjusting the

decision threshold to generate a series of points in ROC

space. For example, if an unlabeled instance xk’s

likelihoods of belonging to minority class and majority

class are 0.3 and 0.7 respectively. Natural decision

threshold d = 0.5 would classify xk as majority class

example, since 0.3\ d. However, d could also be set other

values, e.g., d = 0.2. In this case, xk would be classified as

minority class example, since 0.3[ d. By tuning d from 0

to 1 with a small step H; e.g., H ¼ 0:01; a series of pair-

wise points (FP rate, TP rate) could be created in ROC

space. In order to assess different classifiers’ performance

in this case, one generally uses Area under ROC Curve

(AUROC) as an evaluation criterion; it is defined as the

area between ROC curve and the horizontal axis (axis

representing FP rate).

Fig. 3 Confusion matrix for binary classification

Evolving Systems

123

Page 9: Towards incremental learning of nonstationary imbalanced data … · ensembleapproach(REA)inthispaper.Tobattleagainstthe imbalanced learning problem in training data chunk received

In order to reflect the ROC curve characteristics for all

random runs, we adopt the vertical averaging approach in

Fawcett (2003) to plot the averaged ROC curves. Our

implementation of the vertical averaging method is illus-

trated in Fig. 4. Assume one would like to average two

ROC curves: l1 and l2; both are formed by a series of points

in the ROC space. The first step is to evenly divide the

range of FP rate into a set of intervals. Then at each

interval, find the corresponding TP rate values of each

ROC curve and average them. In Fig. 4, X1 and Y1 are the

points from l1 and l2 corresponding to the interval

FP rate1. By averaging their TP rate values, the corre-

sponding ROC point Z1 on the averaged ROC curve is

obtained. However, there exist some ROC curves which do

not have corresponding points on certain intervals. In this

case, one can use the linear interpolation method to obtain

the averaged ROC points. For instance, in Fig. 4, the point�X (corresponding to FP rate2) is calculated based on the

linear interpolation of the two neighboring points X2 and

X3. Once �X is obtained, it can be averaged with Y2 to get

the corresponding Z2 point on the averaged ROC curve.

4.2 SEA data set

4.2.1 Data preparation

SEA data set (Street and Kim 2001) is a popular artificial

benchmark to assess the stream data mining algorithms’

performance. It has three features randomized in [0, 10],

where whether the sum of the first two features surpasses a

defined threshold determines the class label. The third

feature is irrelevant and can be considered as noise to test

the robustness of the algorithm under simulation. The

concept drifts are designed to adjust the threshold period-

ically such that the algorithm under simulation would be

confronted with an abrupt change in class concepts after it

lives with a stable concept for several data chunks.

Following the original design of the SEA data set, we

categorize the whole data streams into 4 blocks. Inside

each of these blocks, the threshold value is fixed, i.e., the

class concepts are unchanged. However, whenever a new

block begins, the threshold value will be changed and

retained till the end of this block. The threshold values of

the 4 blocks are set to be 8, 9, 7, and 9.5, respectively,

which again adopts the configuration of (Street and Kim

2001). Each block consists of 10 data chunks, each of

which has 1,000 examples as the training data set and 200

instances as the testing data set. Examples with the sum of

the two features greater than the threshold belong to the

majority class, and those otherwise reside in the minority

class. The number of generated minority class data is

restricted to be 1/20 of the total number of data in the

corresponding data chunk. In other words, the imbalanced

ratio is set to be 0.05 in our simulation. In order to intro-

duce some uncertainty/noise into the data set, 1% of the

examples inside each training data set are randomly sam-

pled to reverse their class labels. In this way, approxi-

mately 1/6 of the minority examples are erroneously

labeled, which raises a challenge on handling noise for all

comparative algorithms learning from this data set.

Fig. 4 Vertical averaging

approach for multiple ROC

curves

Evolving Systems

123

Page 10: Towards incremental learning of nonstationary imbalanced data … · ensembleapproach(REA)inthispaper.Tobattleagainstthe imbalanced learning problem in training data chunk received

4.2.2 Results and discussion

The simulation results for the SEA data set are averaged

over 10 random runs. During each random run, the data set

is basically generated all over again using the scheme

described in Sect. 4.2.1. To view the performance of the

comparative algorithms in the whole learning life, we

installed ‘‘observation points’’ on chunks 5, 10, 15, 20, 25,

30, 35, and 40. The presentation and discussion of the

simulation results on the SEA data set only cover the whole

or subset of the observations points.

The tendency lines of the averaged prediction accuracy

over the observation points are plotted in Fig. 5a. One can

conclude from this figure that: (1) REA can provide higher

prediction accuracy on testing data over time than UB,

which is consistent with the theoretical conclusion made in

Sect. 3; (2) REA does not perform superiorly in terms of

overall prediction accuracy to other comparative algo-

rithms over time. In fact, it is learning without adding any

ingredients (‘‘Normal’’) that provides the most competitive

results in terms of the overall prediction accuracy on

testing data most of the time. However as discussed pre-

viously, overall prediction accuracy does not come into

the first place that should be cared about in imbalanced

learning scenario. It is metric like ROC/AUROC that

determines how well the algorithm performs on imbal-

anced data sets.

The AUROC values of the comparative algorithms on

the observation points are given in Fig. 5b, complemented

by which are the corresponding ROC curves on data

chunks 10 (Fig. 6a), 20 (Fig. 6b), 30 (Fig. 6c), and 40

(Fig. 6d), respectively, as well as the corresponding

numeric AUROC values on these data chunks shown in

Table 1. One can easily see that in terms of AUROC, REA

shows very competitive performance against other com-

parative algorithms for learning from the SEA data set of

imbalanced class distribution.

4.3 Real-world data set

4.3.1 Data preparation

The electricity market data set (ELEC data set) (Harries

1999) is used in this study to validate the effectiveness of

the proposed algorithm in real-world applications. The data

were collected from the Australian New South Wales

Electricity Market to reflect the electricity price fluctuation

(up/down) affected by demand and supply of the market.

Since how market influences the electricity price evolves

unpredictably in real world, the concrete representation of

the concept drifts embedded inside the data set is thus

inaccessible, which enable us to gain another insight into

the proposed algorithm as compared to artificial benchmark

with predefined design of concept drifts.

The original data set contains 4531 examples dated from

May 1996 to December 1998. We only retain examples after

May 11, 1997 for our simulation, since several features are

missing from the examples before that date and we do not

intend to investigate learning from incomplete feature set in

this study. Each example consists of 8 features. Features 1–2

represent the date, and the day of the week (1–7) for the

collection of the example, respectively. Each example is

sampled within a timeframe of 30 min, i.e., a period, thus

there are altogether 48 examples collected for each day,

which correspond to 48 periods a day. Feature 3 exactly

stands for in what period (1–48) the very example was col-

lected, and thus is a purely periodic number. Features 1–3 are

excluded from the feature set since they just stand for the

timestamp information of the data. According to the data

sheet instruction, feature 4 should also be ignored from the

learning process. Therefore, the remaining features are the

NSW electricity demand, the VIC Price, the VIC electricity

demand, and the scheduled transfer between states, respec-

tively. In summarize, 27,549 examples with the last 4 fea-

tures are extracted from ELEC data set for simulation.

5 10 15 20 25 30 35 400.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Data chunk

Ove

rall

pred

ictio

n ac

cura

cy

SERA

SMOTE

UB REA

Normal

(a)

5 10 15 20 25 30 35 400.85

0.9

0.95

1

Data chunk

Are

a un

der

RO

C c

urve

SMOTE

UB

SERA

Normal

REA

(b)

Fig. 5 OA and AUROC

for SEA data set

Evolving Systems

123

Page 11: Towards incremental learning of nonstationary imbalanced data … · ensembleapproach(REA)inthispaper.Tobattleagainstthe imbalanced learning problem in training data chunk received

With the order of the examples undisturbed, the

extracted data set is evenly sliced into 40 data chunks.

Inside each data chunk, examples representing electricity

price going down are determined as the majority class data,

while the remaining representing electricity price going up

are randomly under-sampled as the minority class data. The

imbalanced ratio is set to be 0.05, which means only 5% of

the examples inside each data chunk belong to minority

class. To conclude the preparation of this data set, 80% of

the majority class data and the minority class data inside

each data chunk are randomly sampled and merged as the

training data, and the remaining are used to assess the

performance of the corresponding trained hypotheses.

4.3.2 Results and discussion

The results of the simulation are based upon 10 random

runs, where the randomness comes from the random

under-sampling of the minority class data. Like what we

did for SEA data set, observation points are set up in

data chunks 5, 10, 15, 20, 25, 30, 35, and 40, respec-

tively, on which only we present or discuss the simula-

tion results.

Figure 7a plots the averaged overall prediction accuracy

of the comparative algorithms. The conclusions that can be

made are similar with that in Sect. 4.2.2 per se. In brief, in

terms of overall prediction accuracy, REA is consistently

better than UB, but is inferior to some other comparative

algorithms over time. However since we are talking about

data set of imbalanced class distribution under study, it is

ROC/AUROC instead of overall prediction accuracy that

can really decide the performance of the algorithms.

Figure 7b shows the averaged AUROC of the compara-

tive algorithms. As complements, Fig. 8 shows the ROC

curves averaged by 10 random runs of comparative algo-

rithms on data chunk 10 (Fig. 8a), 20 (Fig. 8b), 30 (Fig. 8c),

and 40 (Fig. 8d). Table 2 gives the numerical value for

AUROC of all comparative algorithms on selected data

chunks. One can see that with time goes by, REA performs

very competitive AUROC results against other comparative

algorithms, which leads to the conclusion that REA can

provide much satisfying performance on learning from real-

world ELEC data set with imbalanced class distribution in

temporally streamed format.

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

FP_rate

TP_

rate

Normal

SERAREA

SMOTE

UB

(a)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

FP_rate

TP_

rate

NormalSMOTE

SERA

UB

REA

(b)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

FP_rate

TP_

rate

SERA

Normal

REA

SMOTE

UB

(c)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

FP_rateT

P_ra

te

REA

Normal

SMOTE

SERA

UB

(d)

Fig. 6 ROC curves for selected

data chunks of SEA data set

Table 1 AUROC values for selected data chunks of SEA data set

Data chunk Normal SMOTE UB SERA REA

10 0.9600 0.9749 0.9681 0.9637 1.0000

20 0.9349 0.9397 0.9276 0.9373 0.9966

30 0.9702 0.9602 0.9565 0.9415 0.9964

40 0.9154 0.9770 0.9051 0.9497 1.0000

Evolving Systems

123

Page 12: Towards incremental learning of nonstationary imbalanced data … · ensembleapproach(REA)inthispaper.Tobattleagainstthe imbalanced learning problem in training data chunk received

4.4 Spinning hyperplane data set

4.4.1 Data preparation

Proposed in Wang et al. (2003), the spinning hyperplane

(SHP) data set defines the class boundary as a hyperplane

in n dimensions by coefficients a1; a2; . . .; an. An example

x ¼ ðx1; x2; . . .; xnÞ is created by randomizing each feature

in the range [0, 1], i.e., xi 2 ½0; 1�; i ¼ 1; . . .; n. A constant

bias is defined as:

a0 ¼1

2

Xn

i¼1ai ð17Þ

Then the class label y of the example x is determined by:

y ¼ 1Pn

i¼1 aixi� a0

0Pn

i¼1 aixi\a0

ð18Þ

In contrast to the abrupt concept drifts in SEA data set,

the SHP data set embraces a gradual concept drift scheme

that the class concepts undergo a ‘‘shift’’ whenever a new

5 10 15 20 25 30 35 40

0.4

0.5

0.6

0.7

0.8

0.9

1

Data chunk

Ove

rall

pred

ictio

n ac

cura

cy

SERA

UB

REA

SMOTE

Normal

(a)

5 10 15 20 25 30 35 400.5

0.6

0.7

0.8

0.9

1

Data chunk

Are

a un

der

RO

C c

urve

REA

SERA

SMOTE

UBNormal

(b)

Fig. 7 OA and AUROC

for ELEC data set

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

FP_rate

TP_

rate

SMOTE

NormalUB

REASERA

(a)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

FP_rate

TP_

rate

SERA

SMOTE

REA

UB

Normal

(b)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

TP_rate

FP_r

ate

REA

SMOTE

Normal

UB

SERA

(c)

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

FP_rate

TP_

rate SMOTE

SERA

Normal

REA

UB

(d)

Fig. 8 ROC curves for selected

data chunks of ELEC data set

Evolving Systems

123

Page 13: Towards incremental learning of nonstationary imbalanced data … · ensembleapproach(REA)inthispaper.Tobattleagainstthe imbalanced learning problem in training data chunk received

example is created. Specifically, part of the coefficients

a1; . . .; an will be randomly sampled to have a small

increment D added whenever a new example has been

created, which is defined as:

D ¼ s� t

Nð19Þ

where t is the magnitude of change for every N examples,

and s alternates in [-1, 1] specifying the direction of

change and has 20% chance of being reversed for every

N examples. a0 is also modified thereafter using Eq. (17).

In such a way, the class boundary would be like a spinning

hyperplane in the process of creating data. Data set with

gradual concept drifts requires the learning algorithm to

adaptively tune its inner parameters constantly in order to

catch up with the continuously change of class concepts.

Following the procedure in Wang et al. (2003), the

number of dimension for the feature set of SHP data set is

deemed to be 10, and the magnitude of change t is set to be

0.1. The number of chunks is set to be 100 instead of 40 as

for the previous two data sets, since we would like to

investigate REA in longer series of streaming data chunks.

Each data chunk has 1,000 examples as the training data set

and 200 instances as the testing data set, i.e., N = 1,200. In

addition to the normal setup of imbalanced ratio being 0.05

and noise level being 0.01, we also generate the data set

when imbalanced ratio being 0.01 and noise level being 0.01

and another one with imbalanced ratio being 0.05 and noise

level being 0.03. In this way, we can evaluate the robustness

of REA handling more imbalanced data set and more noisy

data sets. In the rest of this section, we refer to the three

different setup of imbalanced ratio and noise level using

‘‘setup 1’’ (imbalanced ratio = 0.05, noise level = 0.01), ‘‘setup

2’’ (imbalanced ratio = 0.01, noise level = 0.01), and ‘‘setup 3’’

(imbalanced ratio = 0.05, noise level = 0.03), respectively.

4.4.2 Results and discussion

Like what we did for simulation on previous two data

chunks, the results of all comparative algorithms on SHP

data set are based on the average of 10 random runs. The

observation points are placed in data chunks 10, 20, 30, 40,

50, 60, 70, 80, 90, and 100.

Figures 9, 10, and 11 plot the tendency lines of

overall prediction accuracy and AUROC for comparative

algorithms across observation points under setup 1, setup 2,

and setup 3, respectively.

One can see from these figures that in terms of either

overall prediction accuracy or area under ROC curve, REA

can consistently perform competitively against other

comparative algorithms on SHP data set with different

configurations.

4.4.3 Study of hypotheses removal

In scenario of long-term learning from data stream,

retaining all hypotheses in memory over time may not be a

decent strategy. Besides the concern for memory occupa-

tion, hypotheses built in distant past may hinder the clas-

sification performance on current testing data set, which

therefore should somehow be pruned/removed from the

hypotheses set H. We would like to explore this issue in an

empirical manner.

Imagine H is an FIFO queue. The original design of

REA physically sets the capacity of H to be infinity, since

from the time of its creation, any hypothesis will stay in

memory until the end of data stream. Now let’s assume

H has a smaller capacity. Should the number of stored

hypotheses exceed the capacity of H, the oldest hypothesis

would automatically be removed from H. In this way, it

can be guaranteed that H always maintains the ‘‘freshest‘‘

subset of the generated hypotheses in memory.

Figure 12 shows the AUROC performance of REA on

learning from SHP data sets with 100 data chunks when size

of H, i.e., |H|, is equal to?, 60, 40, and 20, respectively. One

can see that REA initially improves its performance in terms

of AUROC when |H| shrinks. However when |H| = 20,

performance of REA deteriorates, which is worse than the

case when |H| = ?. Based on these observations, one can

conclude that there exists a trade-off between |H| and REA’s

performance. A heuristic would be setting |H| approximately

half the total number of data chunks received over time,

which is impractical since the number of data chunks is

usually unknown in real-world applications. Another

method is to assign for each hypothesis a factor decaying

from the time it is created. When the factor decreases

through a threshold g, corresponding hypothesis is removed

from H, which is pretty much like a queue with dynamic

capacity. The challenge raised by this method is how to

determine g, which could hardly be determined by cross

validation in online learning scenario.

4.5 Time and space complexity

Time and space complexity are of critical importance for

designing real-time online learning algorithm. We expect

the algorithm to learn from data stream as quickly as

possible such that it can keep pace with the data stream

Table 2 AUROC values for selected data chunks of ELEC data set

Data chunk Normal SMOTE UB SERA REA

10 0.6763 0.6608 0.7273 0.7428 0.8152

20 0.5605 0.6715 0.6954 0.7308 0.6429

30 0.6814 0.7007 0.5654 0.6339 0.8789

40 0.7102 0.6284 0.6297 0.7516 0.9462

Evolving Systems

123

Page 14: Towards incremental learning of nonstationary imbalanced data … · ensembleapproach(REA)inthispaper.Tobattleagainstthe imbalanced learning problem in training data chunk received

which could be of very high speed. It is also desirable that

the algorithm not occupy significantly large memory due to

the concern of scalability. From a view of slightly high

level, time and space complexity of REA should be related

to 1. difference between post-balanced ratio f and imbal-

anced ratio c, i.e., f - c; 2. k parameter of k-nearest

neighbor; 3. capacity of hypothesis set H, i.e., |H|.

To get a quantified insights into time and space com-

plexity of REA as well as other comparative algorithms,

we record their time and space consumption for learning

from SEA (40 chunks), ELEC (40 chunks), SHP (100

chunks) data sets in Tables 3, 4, and 5, respectively. The

hardware configuration for simulation is Intel Core i5

Processor with 4 GB RAM.

One can conclude from these results that: (1) REA does

not take significantly more time than other comparative

algorithms to conduct training on data stream, which makes

it to be a qualified candidate for real-time online learning

system. (2) REA does take much more time to do testing,

which is probably because it has to combine multiple

10 20 30 40 50 60 70 80 90 100

0.4

0.5

0.6

0.7

0.8

0.9

1

Data chunk

Ove

rall

pred

ictio

n ac

cura

cy

SERANormal

UB

REA

SMOTE

(a)

10 20 30 40 50 60 70 80 90 100

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Data chunk

Are

a un

der

RO

C c

urve REA

SMOTE

Normal

SERA

UB

(b)

Fig. 9 OA and AUROC for SHP data set under setup 1

10 20 30 40 50 60 70 80 90 1000.4

0.5

0.6

0.7

0.8

0.9

1

Data chunk

Ove

rall

pred

ictio

n ac

cura

cy

SERA

NormalREA

UB

SMOTE

(a)

10 20 30 40 50 60 70 80 90 1000.4

0.5

0.6

0.7

0.8

0.9

1

Data chunk

Are

a un

der

RO

C c

urve

SMOTE

SERA UBREA

Normal

(b)

Fig. 10 OA and AUROC

for SHP data set under setup 2

10 20 30 40 50 60 70 80 90 1000.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Data chunk

Ove

rall

pred

ictio

n ac

cura

cy

SERASMOTE NormalREA

UB

(a)

10 20 30 40 50 60 70 80 90 100

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

Data chunk

Are

a un

der

RO

C c

urve

REA

SERA

Nornal UB

SMOTE

(b)

Fig. 11 OA and AUROC

for SHP data set under setup 3

Evolving Systems

123

Page 15: Towards incremental learning of nonstationary imbalanced data … · ensembleapproach(REA)inthispaper.Tobattleagainstthe imbalanced learning problem in training data chunk received

hypotheses for classification of testing instances. (3) REA

does not take significantly larger memories than other

comparative algorithms, which makes it possible to upscale

REA to handle database of large or even huge size.

5 Conclusion

In this paper, we propose REA as a framework to learn from

nonstationary imbalanced data streams. The key idea of this

approach is to estimate the similarity between previous

minority class examples and the current minority class set

based on k-nearest neighbor, and then selectively accumu-

late a certain amount of previous minority class examples

into the current data chunk to compensate the skewed class

distribution. After that, base classifier is built on this

amplified data set to develop the decision boundary, which

will contribute to the final decision-making process through

weighted combination. We present a brief theoretical anal-

ysis and detailed empirical study on both synthetic bench-

marks and real-world data sets to show the effectiveness of

this approach. There are several interesting issues to work on

in future for REA. Currently the k parameter of k-nearest

neighbors for REA is decided through some heuristic. How

to adaptively choose the most suitable k for data stream of

different characteristics would be of critical importance for

strengthening REA’s robustness. The other important issue

for REA is the design of hypothesis prune mechanism for

REA. We already showed in our simulation that perfor-

mance of REA could be enhanced by deserting part of old

hypotheses. However how to adaptively decide the appro-

priate proportion between old hypotheses and new hypoth-

eses in memory still demands better strategy. Motivated by

our preliminary results in this work, we believe this frame-

work might provide important insights for incremental

learning from nonstationary imbalanced data streams, and

also provide useful techniques for a wide range of real world

applications.

References

Aggarwal C (2003) A framework for diagnosing changes in evolving

data streams. In: ACM SIGMOD conference, pp 575–586

Aggarwal C (2007) Data streams: models and algorithms. Springer,

New York

Angelov P, Zhou X (2006) Evolving fuzzy systems from data streams

in real-time. In: IEEE symposium on evolving fuzzy systems.

IEEE Press, Ambelside, pp 29–35

Babcock B, Badu S, Datar M, Motwani R, Wisdom J (2002) Models

and issues in data stream systems. In: Proceedings of PODS

Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and

regression trees. Wadsworth International, Belmont, CA

Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote:

synthetic minority over-sampling technique. J Artif Intell Res

16:321–357

Chawla NV, Lazarevic A, Hall LO, Bowyer KW (2003) Smoteboost:

improving prediction of the minority class in boosting. In:

Proceedings of the principles of knowledge discovery in

databases, PKDD-2003, pp 107–119

Chen S, He H (2009) Sera: selectively recursive approach towards

nonstationary imbalanced stream data mining. IEEE-INNS-

ENNS international joint conference on Neural Networks,

pp 522–529

Chen S, He H (2010) Musera: multiple selectively recursive approach

towards imbalanced stream data mining. In: Proceedings of

world conference computational intellligence

10 20 30 40 50 60 70 80 90 1000.86

0.88

0.9

0.92

0.94

0.96

0.98

1

Data chunk

Are

a U

nder

RO

C C

urve

|H|=40

|H|=60

|H|=20

|H|=∞

Fig. 12 Performance comparison of hypotheses removal

Table 3 Time and space complexity for SEA data set

Complexity Normal SMOTE UB SERA REA

Training time

complexity (s)

2.632 4.092 7.668 3.952 5.356

Testing time complexity (s) 0.088 0.148 0.236 0.156 1.904

Space Complexity (kb) 1,142 1,185 1,362 1,266 1,633

Table 4 Time and space complexity for ELEC data set

Complexity Normal SMOTE UB SERA REA

Training time

complexity (s)

1.376 1.712 2.844 1.824 2.028

Testing time complexity (s) 0.084 0.148 0.244 0.14 1.744

Space Complexity (kb) 129 142 155 151 248

Table 5 Time and space complexity for SHP data set

Complexity Normal SMOTE UB SERA REA

Training time

complexity (s)

10.29 17.89 60.10 16.98 21.62

Testing time

complexity (s)

0.23 0.46 0.53 0.43 11.58

Space Complexity (kb) 6,881 7,061 7,329 9,858 10,274

Evolving Systems

123

Page 16: Towards incremental learning of nonstationary imbalanced data … · ensembleapproach(REA)inthispaper.Tobattleagainstthe imbalanced learning problem in training data chunk received

Domingos P, Hulten G (2000) Mining high-speed data streams. In:

Proceedings of international conference KDD. ACM Press,

pp 71–80

Dovzan D, Skrjanc I (2010) Predictive functional control based on an

adaptive fuzzy model of a hybrid semi-batch reactor. Control

Eng Practise 18(8):979–989

Fan W (2004) Systematic data selection to mine concept-drifting data

streams. In: Proceedings of ACM SIGKDD international confer-

ence knowledge discovery and data mining. ACM Press,

pp 128–137

Fan W, Stolfo SJ, Zhang J, Chan PK (1999) Adacost: misclassifi-

cation cost-sensitive boosting. In: Proceedings of 16th interna-

tional conference on machine learning, pp 97–105

Fawcett T (2003) Roc graphs: notes and practical considerations for

data mining researchers. Technical Report, HPL-2003-4

Filev D, Georgieva O (2010) An extended version of the gustafson-

kessel algorithm for evolving data stream clustering. In: Angelov

P, Filev D, Kasabov N (eds) Evolving intelligent systems:

methodology and applications. IEEE Press Series on Computa-

tional Intelligence, Wiley, pp 273–300

Freund Y, Schapire R (1997) Decision-theoretic generalization of

on-line learning and application to boosting. J Comput Syst Sci

55(1):119–139

Gaber MM, Krishnaswamy S, Zaslavsky A (2003) Adaptive mining

techniques for data streams using algorithm output granularity

mohamed. In: Workshop (AusDM 2003), held in conjunction

with the 2003 congress on evolutionary computation (CEC 2003)

Gao J, Fan W, Han J (2007) On appropriate assumptions to mine data

streams: analysis and practice. In: Proceedings of international

conference data mining, Washington, DC, USA, pp 143–152

Gao J, Fan W, Han J, Yu PS (2007) A general framework for mining

concept-drifting streams with skewed distribution. In: Proceed-

ings of international conference SIAM

Georgieva O, Filev D (2009) Gustafson-kessel algorithm for evolving

data stream clustering. In: Proceedings of international conference

computer systems and technologies for PhD students in computing

Grossberg S (1988) Nonlinear neural networks: principles, mecha-

nisms, and architectures. Neural Netw 1(1):17–161

Harries M (1999) Splice-2 comparative evaluation: electricity pricing.

Tech. rep., The University of South Wales

He H, Chen S (2008) Imorl: Incremental multiple-object recognition

and localization. IEEE Trans Neural Netw 19(10):1727–1738

He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans

Knowledge Data Eng 21(9):1263–1284

Hong X, Chen S, Harris CJ (2007) A kernel-based two-class classifier

for imbalanced data-sets. IEEE Trans Neural Netw 18(1):28–41

Lange S, Grieser G (2002) On the power of incremental learning.

Theor Comput Sci 288(2):277–307

Last M (2002) Online classification of nonstationary data streams.

Intell Data Analysis 6(2):129–147

Law Y, Zaniolo C (2005) An adaptive nearest neighbor classification

algorithm for data streams. In: Proceedings of European

Conference PKDD

Masnadi-Shirazi, Vasconcelos N (2007) Asymmetric boosting. In:

Proceedings of international conference machine learning

Muhlbaier MD, Topalis A, Polikar R (2009) Learn??.nc: Combining

ensemble of classifiers with dynamically weighted consult-and-

vote for efficient incremental learning of new classes. IEEE

Trans Neural Netw 20(1):152–168

Polikar R, Udpa L, Udpa S, Honavar V (2001) Learn??: an

incremental learning algorithm for supervised neural networks.

IEEE TransSyst Man Cybern C Spec Issue Knowledge Manage

31:497–508

Sharma A (1998) A note on batch and incremental learnability.

J Comput Syst Sci 56(3):272–276

Street WN, Kim Y (2001) A streaming ensemble algorithm (sea) for

large-scale classification. In: Proceedings the seventh ACM

SIGKDD internatinal conference knowledge discovery and data

mining. ACM Press, pp 377–382

Tumer K, Ghosh J (1996) Analysis of decision boundaries in linearly

combined neural classifiers. Pattern Recog 29:341–348

Tumer K, Ghosh J (1996) Error correlation and error reduction in

ensemble classifiers. Connect Sci 8(3–4):385–403

Wang H, Fan W, Yu PS, Han J (2003) Mining concept-drifting data

streams using ensemble classifiers. In: KDD ’03: Proceedings of

the ninth ACM SIGKDD international conference on Knowledge

discovery and data mining, pp 226–235

Evolving Systems

123