two birds with one stone: classifying positive and

12
Neurocomputing 277 (2018) 149–160 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Two birds with one stone: Classifying positive and unlabeled examples on uncertain data streams Donghong Han a,c , Shuoru Li a , Fulin Wei a,, Yuying Tang a , Feida Zhu b , Guoren Wang a,c a College of Computer Science and Engineering, Northeastern University, Shenyang, China b School of Information Systems, Singapore Management University, Singapore c Key Laboratory of Medical Image Computing (NEU), Ministry of Education, Shenyang, China a r t i c l e i n f o Article history: Received 25 September 2016 Revised 28 February 2017 Accepted 5 March 2017 Available online 24 August 2017 Keywords: Uncertain Data Streams PU learning Concept drift Ensemble classifier Cluster ELM a b s t r a c t An important feature characteristic of the data streams in many of today’s big data applications is the intrinsic uncertainty, which could happen for both item occurrence and attribute value. While this has already posed great challenges for fundamental data mining tasks such as classification, things are made even more complicated by the fact that completely-labeled examples are usually unavailable in such set- tings, leaving researchers the only option to learn classifiers on partially-labeled examples on uncertain data streams. Furthermore, there will be concept drift on evolving data streams. To address these chal- lenges, this paper therefore focuses on the study of learning from positive and unlabeled examples (PU) on uncertain data streams. To the best of our knowledge, this paper is the first work to address the uncertainty issue in both item occurrence (occurrence level) and attribute value (attribute level) for the problem of PU learning over streaming data. Firstly, we propose an algorithm to classify positive and unlabeled examples with both uncertainties (PUU). The algorithm extracts reliable positive and negative examples by clustering-based method, and then trains the classifier with Weighted Extreme Learning Machine (Weighted ELM). Secondly, we propose an algorithm of PU learning over uncertain data streams (PUUS). It adopts ensemble model and trains base classification by PUU. In order to detect concept drift, PUUS uses cluster set similarities between the current data block and history data block. We propose different update strategies for different concept drift to adapt to evolving uncertain data streams. Ex- perimental results show that PUUS can effectively classify uncertain data streams with just positive and unlabeled data, while achieving in the meantime good performance in detecting and handling concept drifts. © 2017 Elsevier B.V. All rights reserved. 1. Introduction Recent years have witnessed the power of big data in driving and fueling intelligence and innovation in a wide range of indus- tries to an unprecedented level [1]. In particular, as a typical ex- ample of big data, data streams are widely used in many fields, such as Web traffic statistics, financial analysis, Web application services, etc. Due to transmission error, measurement inaccuracy, sensor malfunction and so on, uncertainty is an intrinsic nature of the data streams in many applications including WSN (Wire- less Sensor Networks) [2], and RFID (Radio Frequency Identifica- tion) [3,4]. Note that in the context of uncertain data, uncertainty could be observed for both data item occurrence and data attribute values (we would call occurrence level for the former and attribute Corresponding author. E-mail addresses: [email protected] (D. Han), [email protected] (F. Wei). level for the latter), compounding the difficulty of streaming data classification with an extra degree. Furthermore, different from tra- ditional static data sets, streaming data distribution is subject to continuous change. It is essential to detect concept drift and up- date learning model. In reality, users often focus on only one target category of their interest on uncertain data streams and in general are not con- cerned with others. For example, for fraud detection or intrusion detection, only the confirmed fraud and the successful intrusion data are to be identified and handled. In these cases, we just need to label a few examples of target category and learn model to pre- dict class label for mass data. Tasks like such – classifying uncer- tain data streams with positive and unlabeled samples (but not negative examples) – are called PU learning for uncertain stream- ing data. However, while uncertainty in both occurrence level and at- tribute level are often simultaneously observed, there has been till this day little research to consider the two in a unified model for http://dx.doi.org/10.1016/j.neucom.2017.03.094 0925-2312/© 2017 Elsevier B.V. All rights reserved.

Upload: others

Post on 30-Nov-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Neurocomputing 277 (2018) 149–160

Contents lists available at ScienceDirect

Neurocomputing

journal homepage: www.elsevier.com/locate/neucom

Two birds with one stone: Classifying positive and unlab ele d

examples on uncertain data streams

Donghong Han

a , c , Shuoru Li a , Fulin Wei a , ∗, Yuying Tang

a , Feida Zhu

b , Guoren Wang

a , c

a College of Computer Science and Engineering, Northeastern University, Shenyang, China b School of Information Systems, Singapore Management University, Singapore c Key Laboratory of Medical Image Computing (NEU), Ministry of Education, Shenyang, China

a r t i c l e i n f o

Article history:

Received 25 September 2016

Revised 28 February 2017

Accepted 5 March 2017

Available online 24 August 2017

Keywords:

Uncertain Data Streams

PU learning

Concept drift

Ensemble classifier

Cluster

ELM

a b s t r a c t

An important feature characteristic of the data streams in many of today’s big data applications is the

intrinsic uncertainty, which could happen for both item occurrence and attribute value. While this has

already posed great challenges for fundamental data mining tasks such as classification, things are made

even more complicated by the fact that completely-labeled examples are usually unavailable in such set-

tings, leaving researchers the only option to learn classifiers on partially-labeled examples on uncertain

data streams. Furthermore, there will be concept drift on evolving data streams. To address these chal-

lenges, this paper therefore focuses on the study of learning from positive and unlabeled examples (PU)

on uncertain data streams. To the best of our knowledge, this paper is the first work to address the

uncertainty issue in both item occurrence (occurrence level) and attribute value (attribute level) for the

problem of PU learning over streaming data. Firstly, we propose an algorithm to classify positive and

unlabeled examples with both uncertainties (PUU). The algorithm extracts reliable positive and negative

examples by clustering-based method, and then trains the classifier with Weighted Extreme Learning

Machine (Weighted ELM). Secondly, we propose an algorithm of PU learning over uncertain data streams

(PUUS). It adopts ensemble model and trains base classification by PUU. In order to detect concept drift,

PUUS uses cluster set similarities between the current data block and history data block. We propose

different update strategies for different concept drift to adapt to evolving uncertain data streams. Ex-

perimental results show that PUUS can effectively classify uncertain data streams with just positive and

unlabeled data, while achieving in the meantime good performance in detecting and handling concept

drifts.

© 2017 Elsevier B.V. All rights reserved.

1

a

t

a

s

s

s

o

l

t

c

v

(

l

c

d

c

d

i

c

d

d

t

d

t

n

h

0

. Introduction

Recent years have witnessed the power of big data in driving

nd fueling intelligence and innovation in a wide range of indus-

ries to an unprecedented level [1] . In particular, as a typical ex-

mple of big data, data streams are widely used in many fields,

uch as Web traffic statistics, financial analysis, Web application

ervices, etc. Due to transmission error, measurement inaccuracy,

ensor malfunction and so on, uncertainty is an intrinsic nature

f the data streams in many applications including WSN (Wire-

ess Sensor Networks) [2] , and RFID (Radio Frequency Identifica-

ion) [3,4] . Note that in the context of uncertain data, uncertainty

ould be observed for both data item occurrence and data attribute

alues (we would call occurrence level for the former and attribute

∗ Corresponding author.

E-mail addresses: [email protected] (D. Han), [email protected]

F. Wei).

i

t

t

ttp://dx.doi.org/10.1016/j.neucom.2017.03.094

925-2312/© 2017 Elsevier B.V. All rights reserved.

evel for the latter), compounding the difficulty of streaming data

lassification with an extra degree. Furthermore, different from tra-

itional static data sets, streaming data distribution is subject to

ontinuous change. It is essential to detect concept drift and up-

ate learning model.

In reality, users often focus on only one target category of their

nterest on uncertain data streams and in general are not con-

erned with others. For example, for fraud detection or intrusion

etection, only the confirmed fraud and the successful intrusion

ata are to be identified and handled. In these cases, we just need

o label a few examples of target category and learn model to pre-

ict class label for mass data. Tasks like such – classifying uncer-

ain data streams with positive and unlabeled samples (but not

egative examples) – are called PU learning for uncertain stream-

ng data.

However, while uncertainty in both occurrence level and at-

ribute level are often simultaneously observed, there has been till

his day little research to consider the two in a unified model for

150 D. Han et al. / Neurocomputing 277 (2018) 149–160

Table 1

Positive and unlabeled examples with uncertainty.

ID Temperature Soil fertile ability Probability Category Time

1 f 1, 2 ( x ), [17,18] (fFertile, poor)(0.9, 0.1) 0.9 1 t 1 2 f 2, 2 ( x ), [16,19] (fFertile, poor)(0.8, 0.2) 0.5 ? t 2 3 f 3, 2 ( x ), [8,17] (fFertile, poor)(0.6, 0.4) 0.6 1 t 3 4 f 4, 2 ( x ), [16,18] (fFertile, poor)(0.1, 0.9) 0.2 1 t 4 5 f 5, 2 ( x ), [9,10] (fFertile, poor)(0.3, 0.7) 0.6 ? t 5

i

t

a

c

r

i

2

2

t

c

T

g

i

c

m

c

(

2

s

o

r

U

c

t

o

[

c

d

c

h

s

p

m

W

f

s

o

a

N

t

m

c

f

r

l

f

e

i

a

PU learning. Table 1 shows some samples of positive and unla-

beled examples with both uncertainties. Columns 2 and 3 repre-

sent temperature attribute and soil fertile ability of each sample.

The former is uncertain continuous attributes, f i,j ( x ) representing

the probability density function(pdf) of the i th sample at the j th

column. The latter is uncertain discrete attribute. There is attribute

level uncertainty in both columns. In the first sample, tempera-

ture ranges from 17 to 18 °C, and obeys the distribution probability

density with f 1,2 ( x ). Soil fertile ability values with the probability of

fertile or sterile is 0.9 or 0.1, respectively. Column 4 represents oc-

currence probability of the sample, that is, the occurrence level un-

certainty of the sample. Column 5 is the class label of the sample.

For a sample, "1" means that it belongs to labeled positive class,

"?" represents that the sample’s category is unlabeled. Column 6 is

the sample arrival time, which can be used to distinguish the data

block that the sample belongs to.

Based on the above challenges, this paper focuses on the re-

search of classifying uncertain data streams with models learned

from positive and unlabeled examples. To our knowledge, this pa-

per is the first to study PU learning problems on uncertain data

streams, handling both occurrence level and attribute level uncer-

tainty at the same time. We summarize our main contributions as

follows:

• Based on Weighted ELM [39] , we propose a classification algo-

rithm PUU, which learns from positive and unlabeled examples

with two kinds of uncertainty. In the proposed algorithm, in or-

der to reduce dimension, the attribute mean and mean devia-

tion are used to represent approximately the attribute level un-

certainty. Meanwhile, occurrence probability is used to indicate

the occurrence level uncertainty. And we adopt the method

based on clustering to extract credible positive and negative ex-

amples.

• We then extend PUU into PUUS algorithm to classify uncer-

tain data streams with positive and unlabeled examples. In an

ensemble model, we calculate the cluster similarities between

current blocks and historical blocks to detect concept drifts.

Only when the cluster similarity is greater than the thresh-

old, which is set in accordance with different concept drifts, it

adopts different classifier update strategy.

• Experimental results show that the proposed algorithms can

deal with the PU learning problem with both uncertainties con-

sidered simultaneously, and can effectively detect concept drifts

on uncertain data streams.

The remainder of the paper is organized as follows. In Section 2 ,

we introduce related work and ELM theory. In Section 3 , our pro-

posed algorithms are presented in detail. The experimental results

and the evaluation of the performance are discussed in Section 4 .

At last, Section 5 concludes the paper.

2. Preliminaries

Broadly, most research work use possible world model as an

uncertain data model [5] . If there is the independence assumption,

the occurrence level uncertainty means the occurrence probabil-

ity of the tuple, and when it increases, the number of instances

n possible world will increase exponentially. Attribute level uncer-

ainty describes the uncertainty information of each dimension for

tuple. In this paper, we focus on the data model with both un-

ertainties.

Firstly, we review the previous works on classification algo-

ithms over both precise and uncertain data streams, and PU learn-

ng. And then, we present a brief overview of ELM.

.1. Related work

.1.1. Classification algorithms over data streams

The data distribution of evolving data streams may change with

ime. This phenomenon, dubbed concept drift, is the important

hallenge obviously different from traditional static data model.

here are two kinds of concept drift: burst concept drift and pro-

ressive concept drift. In order to deal with concept drift, the ex-

sting algorithms are divided into two categories including single

lassifier approaches and ensemble models.

The single classifier approaches mainly use sliding window

echanism to select tuples which are suitable for the current

oncept to train classifier. In 20 0 0, the very fast decision tree

VFDT) algorithm was proposed to classify data streams [6] . In

001, the concept-adapting very fast decision tree (CVFDT) pre-

ented in [7] aimed at dealing with time-varying concepts. Based

n sliding window, CVFDT was updated by removing the low accu-

acy tree nodes and adding a new sub-tree. In 2004, [8] proposed

ltra-Fast forest tree system (UFFT), in which multiple binary trees

ompose decision forest and each binary tree corresponds to only

wo categories. The final prediction result is decided by the votes

f all the members. Fan et al. proposed random decision tree (RDT)

9] to learn classifiers without training sets.

The ensemble model is composed of multiple independent base

lassifiers. As the streaming data arrives continuously, for different

ata streams, the amount of data adapting to the current window

oncept is also different. Therefore, the single classifier strategy

as some limitations. As one of the classic ensemble-based clas-

ification methods, the streaming ensemble algorithm (SEA) was

resented in [10] . SEA divides the data stream into different seg-

ents, in which each segment trains base classifier respectively.

ang et al. put forward the weighted ensemble classifier (WCE)

or data streams classification [11] . When the accuracy of base clas-

ifier is lower than the threshold, it will be updated with a new

ne to adapt to the current data. For training the classifier, not

ll features are important in high-dimensional data sets. Therefore

guyen et al. [12] proposed HEFT-Stream algorithm, in which fea-

ure extraction was incorporated into a heterogeneous ensemble

odel to deal with different types of concept drifts. In [13] , an in-

remental weighted function was put forward to evaluate the per-

ormance of base classifiers. An on-line weighted ensemble (OWE)

egressive model was proposed in [14] , which can incrementally

earn from a lot of changing tuples and simultaneously retain in-

ormation appearing in the scene. Sun et al. proposed a class-based

nsemble to solve category evolution [15] .

The uncertainty of streaming data increases the complex-

ty of classification algorithms. So far, only a few literatures

re available. Based on CVFDT [7] , Liang et al. proposed the

D. Han et al. / Neurocomputing 277 (2018) 149–160 151

u

(

c

e

c

s

m

e

l

e

p

m

U

j

c

2

t

c

b

n

i

d

s

p

c

l

t

t

D

t

n

t

s

w

l

l

f

fi

m

l

b

i

o

2

S

p

[

a

w

c

s

t

i

o

i

T

M

S

e

T

t

b

β

g

E

3

t

m

l

a

r

3

D

A

t

{

1

a

a

D

i

o

A

P

t

a

D

c

i

b

r

fi

{

s

o

c

b

D

b

ncertainty-handle and conception-adapting very fast decision tree

UCVFDT) [16] , an algorithm dealing with the uncertainty and con-

ept drift at the same time. Pan et al. proposed static classifier

nsemble (SCE) classifier and dynamic classifier ensemble (DCE)

lassifier to classify uncertainty data streams [17] . In [18] , an en-

emble model ECluds was proposed, which applied supervised k -

eans clustering algorithm on uncertain data stream chunks, then

xtracted sufficient statistics into micro-clusters. Liu et al. proposed

ocal kernel-density-based method to generate a bound score for

ach instance [19] . In [20] , a weighted ensemble model was pro-

osed, in which the base classifier was trained by ELM. Further-

ore, Han et al. proposed a kind of parallel ensemble classifier

ELM-MapReduce. The weight of each base classifier could be ad-

usted according to its mean square error on the up-to-date test

hunk [21] .

.1.2. PU learning

In PU learning, the class that users are interested in is called the

arget/positive class, while all others are called non-target/negative

lass. As a kind of a semi-supervised learning method, PU learning

uilds two-class classifiers with only the labeled positive, but no

egative samples, and then predicts the class label of test data.

In [22] , Denis first proposed the concept of PU learning, which

s called learning model from POSitive EXamples (POSEX), and in-

icated the effect of unlabeled sample in the classification. Being a

pecial two-class classification approach, PU learning is widely ap-

lied in anomaly detection [23] , automatic image annotation [24] ,

hange detection [25] , authorship verification [26] and so on. Re-

ated algorithms mainly include three categories [27] : based on the

wo-phase strategy represented by Liu [28,29] , based on the sta-

istical query learning model of positive examples represented by

enis [30] , and the negative example bias method [31,32] .

The above algorithms are available for precise data. However,

he literature of PU learning for uncertain data is few. Based on

aïve Bayes method, He et al. first proposed a classifier from posi-

ive and unlabeled examples with uncertainty [33] . In [34] , a deci-

ion tree algorithm DTU-PU based on information entropy POSC4.5

as proposed to deal with uncertain continuous attributes in PU

earning. Liang et al. proposed puuCVFDT algorithm to solve PU

earning problems over uncertain data streams [35] . In [36] , the

ramework of UOLCS was proposed, which uses ensemble classi-

er and select reliable positive and negative examples by clustering

ethod. These approaches focus on attribute level or occurrence

evel uncertainty, respectively.

Until now, there is no PU learning algorithm of dealing with

oth uncertainties on data streams. In many real-world problems,

t is common to have all uncertainties. So, we focus on PU learning

ver streaming data with existing and attribute uncertainty.

.2. Review of ELM

If the activation function is non-polynomial, [37] shows that

ingle-hidden Layer Feedforward Neural Network (SLFN) could ap-

roximate any functions. Based on this, Huang et al. proposed ELM

38] . The essence of ELM is that, if hidden nodes, input weights,

nd hidden layer biases can be chosen randomly, then output layer

eights can be obtained by least squares method. The whole pro-

ess is completed one time, without iteration.

To solve the problem of the unbalanced distribution of data

ets, Huang et al. proposed Weighted ELM algorithm [39] , the de-

ailed description is as follows:

Given a training set ℵ = { ( x i , t i ) | x i ∈ R d , t i ∈ { −1 , 1 } , i = 1 , . . . , N } ,n which t i ∈ { −1 , 1 } represents the sample is labeled to positive

r negative. To represent the imbalance of the class distribution,

t introduces N × N diagonal matrix W related to every sample x .

i

he Weighted ELM can be expressed as:

inimize : L P ELM =

1

2

β‖

2 + CW

1

2

N ∑

i =1

ξi ‖

2

ubject to : h ( x i ) β = t T i − ξ t i , i = 1 , . . . , N

According to the KKT theory, above dual optimization problem

quals to:

L P ELM =

1

2

β‖

2 + CW

1

2

ξ 2 i −

N ∑

i =1

αi ( h ( x i ) β − t i + ξi ) (1)

In which each Lagrange multiplier αi is a variable relative to x i .

hrough making partial derivative for the variable is zero, it shows

hat, when the number of samples N is much larger than the num-

er of hidden layer nodes L , then:

=

(I

C + H

T W H

)−1

H

T W T (2)

After computing the weights from hidden layer to output layer,

iven a new sample x , the result of output function of Weighted

LM is as follows:

f ( x ) = sign h ( x ) β = sign h ( x )

(I

C + H

T W H

)−1

H

T W T (3)

. Algorithms

Based on Weighted ELM, this paper proposed PUU algorithm

o address the problem of PU learning with uncertainties. Further-

ore, we proposed PUUS algorithm of classifying positive and un-

abeled examples over uncertain data streams, which can detect

nd handle concept drift. In this section, we will demonstrate the

elated definitions in detail and our algorithms.

.1. Related definition

efinition 1. Uncertain continuous attributes. Assuming that the

i is the i th attribute of the data set, A it is the t th sample on

his attribute, V it is its corresponding random variable. If A it = 〈 a it , b it , f it (x ) 〉 } , where V it = { f it (x ) | x ∈ [ a it , b it ] } , ∫ b it a it

f it (x ) dx = , x ∈ [ a it , b it ] , A i is called uncertain continuous attribute, denoted

s A

u n i

, its sample and corresponding random variable are denoted

s A

u n it

and V u n it

, respectively.

efinition 2. Uncertain discrete attributes. Assuming that A i

s the i th attribute of a data set, A it is the t th sample

n the attribute, V it is its corresponding random variable. If

it = { 〈 v j , p i j (x ) 〉 } , where v j ∈ Dom ( V u c it

) , Dom ( V it ) = { v 1 , v 2 , ., v n } ,

( V it = v j ) = p i j , ∑ n

j=1 p i j = 1 , A i is called uncertain discrete at-

ribute, denoted as A

u c i

, its sample and corresponding random vari-

ble are represented as A

u n it

and V u n it

, respectively.

efinition 3. Uncertain data stream S . Uncertain data stream S

ontinuously arrival in a block, S = 〈 S 1 ,S 2 ,…,S i ,…,S cur ,S cur + 1 ,…. 〉 ,n which S i is the i th block, S cur is the current data block. Each

lock is composed of n tuples, i.e. S i = 〈 s i 1 , s i 2 ,…, s ij ,…, s in 〉 , s ij rep-

esents the j th tuple in the i th data block. Every tuple s ij is a

ve-tuple, s ij = 〈 A ij ,p ij ,c,l,t 〉 , where A i j = 〈〈 A

U 1 i j

, A

U 2 i j

, . . . , A

U di j

〉 〉 , U ∈ u n , u c } , each dimension represents a sample value of the corre-

ponding attribute of the instance. p ij is the occurrence probability

f s ij , c is the category of s ij , c ∈ { +1 , −1 } , + 1 represents positive

lass and −1 represents negative class. l shows whether s ij is la-

eled and t is the arrival time of s ij .

efinition 4. Mean. The average of the uncertain attributes distri-

ution is called mean. Mean is used to describe the area where

152 D. Han et al. / Neurocomputing 277 (2018) 149–160

V

V

a

a

D

a

3

a

m

i

n

3

w

e

A

p

t

b

p

t

a

f

(

r

t

r

t

A

S

s

i

r

u

3

f

c

p

g

n

the attribute value relatively concentrated, represented by EP ( ·). If

A i is the uncertain discrete attribute A

u c i

, A

u c it

is a sample of the at-

tribute. V u c it

is the random variable of the sample. The mean of A

u c it

is as follows:

EP ( A i t ( u c ) ) = E( V i t

( u c ) ) =

n ∑

k =1

v k p ik (4)

If A i is uncertain continuous attribute A

u n i

, A

u n it

is a sample of this

attribute, and V u n it

is the random variable of the sample. The mean

of A

u c it

is:

EP ( A i t ( u n ) ) = E( V i t

( u n ) ) =

∫ b it

a it

x f it ( x ) dx (5)

Definition 5. Average discrete distance. It is used to describe the

average degree of the attribute value distribution, represented by

Var ( ·). For the uncertain discrete attribute,

ar (A

u c it

)=

n ∑

k =1

∣∣v k − E (V

u c it

)∣∣p ik (6)

For the uncertain continuous attribute,

ar (A

u n it

)=

∫ b it

a it

∣∣x − E (V

u n it

)∣∣ f it ( x ) dx (7)

Definition 6. Positive cluster. The cluster which only contains pos-

itive examples and unlabeled examples is called positive cluster,

represented by CLU + .

Definition 7. Negative cluster. The cluster which only contains un-

labeled examples is called negative cluster, represented by CLU −.

Definition 8. Minimum surrounded ball. Given a set of points P ,

suppose that Hs is a set of some hyperspheres in the space. hs is

any hypersphere in the hyperspheres set. All points of P are located

in hs , namely ∀ p ∈ hs , where p ∈ P , hs ∈ Hs . The radius of the hyper-

spheres is represented by r hs _ min . If r hs _ min ≤ ∀ r hs , the hyperspheres

hs_min is called the minimum surrounded ball of set P . The center

of O ( x 1 , x 2 , …, x i ,…, x d ) and radius r hs_min could be calculated by

follows:

x i =

1

| P | | P | ∑

k =1

x p k i

(8)

r h s min = max

1 ≤k ≤| P | Dis ( p k , O ) = max

1 ≤k ≤| P |

d ∑

i =1

(∣∣x p k i

− x i ∣∣ + u

p k i

)2 (9)

Definition 9. Compression cluster. Suppose that there is a cluster

O in the space, whose radius is r . The cluster, whose center is O

and radius is x × r , is called x compression cluster of O , where

0 < x < 1.

The aim, this paper introduces minimum surrounded ball and

the compression cluster, is to filter the instance in the cluster. It

just needs to know the center of cluster and radius, without the

specific function, and simplify the calculation of the problem.

Definition 10. Reliable positive example. In the positive cluster,

the unlabeled examples in the positive minimum surrounded ball

are called reliable positive example.

Definition 11. Reliable negative example. In the negative cluster,

the samples which are in the x compression clusters of this nega-

tive cluster could be referred to as the reliable negative samples.

Definition 12. The distance between uncertain tuple and a point

in d- dimension space. If p is an uncertain tuple, A ( x ,x ,…,x ) is

k 1 2 d

point of d -dimension space. The distance Dis ( p k , A ) between p k nd A could be calculated by:

is ( p k , A ) =

d ∑

i =1

(∣∣x p k i

− x i ∣∣ + u

p k i

)2 (10)

where | x p k i

− x i | + u p k i

indicates the longest distance between p k nd A in the i th dimension.

.2. PUU algorithm

We first propose an algorithm PUU of learning static positive

nd unlabeled samples with uncertainty, including adopting di-

ensionality reduction approach to deal with uncertainty, extract-

ng reliable positive and negative examples based on clustering. Fi-

ally, the classifier based on the Weighted ELM is learned.

.2.1. Uncertain data processing

If the uncertain data is input the classifier directly, there

ill be some problems due to large number of dimensions and

ach attribute corresponding to different number of input nodes.

bove problems all belong to attribute level uncertainty. This pa-

er will descend dimension for uncertain data. Specifically, under

he premise of not losing important information, the uncertainties

oth in continuous attributes and discrete attributes will be sim-

lified.

The core of uncertain continuous attributes and discrete at-

ributes is continuous random variables and discrete random vari-

bles respectively. Therefore, the key task is to preserve the in-

ormation of random variables’ distribution. This paper uses mean

Eqs. (4 ) and (5 )) and average discrete distance ( Eqs. (6 ) and (7 )) to

ecord the characteristics of the distribution. For two different dis-

ributions, mean and average discrete distance used for dimension

eduction will not lose too much information. Algorithm 1 gives

he pseudo-code of processing uncertainty.

lgorithm 1 Uncertain data processing algorithm.

Input :

S’ i :uncertain data block of five-tuple

Output :

S i : uncertain data block of six-tuple

1. for each tuple in S’ i do

2. for each attribute A do

3. if A is uncertain discreet attribute

4. Calculating the mean and the average discreet distribution with (4)

and (6);

5. else

6. Calculating the mean and the average discreet distribution with (5)

and (7);

7. end if

8. end for

9. end for

After descending dimension, an instance of data block

’ i , s ij = 〈 A ij , p ij , c, l, t 〉 , will be represented by six-tuple

ij = 〈 x ij ,u ij ,p ij ,c,l,t 〉 . Here, x ij represents the mean vector of the

nstance, x i j = 〈 x 1 i j , x 2

i j , . . . , x k

i j , . . . , x d

i j 〉 where x k

i j = EP ( A

U k i j

) u ij rep-

esents the average discrete distance vector of the instance,

ij = 〈 u 1 i j , u 2

i j , . . . , u k

i j , . . . , u d

i j 〉 , u k

i j = V ar( A

U ki j

) , U ∈ { u c , u n }.

.2.2. Extracting reliable positive examples and negative examples

Because unlabeled examples contain some useful information

or classification, this paper selects two- stage strategy to train

lassifier. Most traditional two-stage strategy needs to iterate re-

eatedly until convergence, which is time-consuming and may not

et the optimal classifier. Therefore, based on the clustering tech-

ology, this paper proposes a method to exact reliable positive and

D. Han et al. / Neurocomputing 277 (2018) 149–160 153

Algorithm 2 Algorithm of extracting reliable positive and reliable negative example

based on two-stage strategy.

Input :

S : data block

k : the number of clusters

x : compression cluster ratio

P exisit : occurrence probability threshold

Output :

RP: reliable positive example

RN : reliable negative example

1. Select k different cluster centers randomly;

2. repeat

3. for each tuple in S do

4. if p tuple > p exist do

5. for each cluster Cl do

6. Calculate the distance between the tuple to the center of Cl cluster

with (10);

7. end for

8. Put the tuple in the closest cluster;

9. end if

10. end for

11. for each cluster Cl do

12. Recalculate the center of cluster

13. end for

14. until every cluster doesn’t change;

15. for each cluster Cl do

16. if cluster Cl is a positive cluster

17. Calculate its center and radius;

18. for each tuple in Cl do

19. if the tuple is in the minimum surrounded ball

20. Put the tuple in RP ;

21. end if

22. end for

23. else if cluster Cl is a negative cluster

24. Calculate its center and radius;

25. for each tuple in Cl do

26. if the tuple is in the compression cluster

27. Put the tuple in RN ;

28. end if

29. end for

30. end if

31. end for

n

t

f

c

5

t

i

i

c

f

w

b

t

3

c

e

3

c

a

l

u

E

u

b

a

t

c

c

A

a

f

n

w

E

d

s

e

d

t

i

n

t

d

c

o

a

b

d

a

a

a

o

t

p

t

i

c

t

t

t

F

A

r

a

d

t

egative examples. Algorithm 2 shows the pseudo-code of extrac-

ion process.

Algorithm 2 extracts reliable positive and negative examples

rom positive and unlabeled examples quickly. line 1 initializes k

lusters. Lines 2–14 cluster tuples in the data block, in which lines

–7 calculate the distance between uncertainty tuple to each clus-

er center, lines 4–9 process the tuple whose occurrence probabil-

ty is larger than P exisit and the tuples whose occurrence probabil-

ty is less than P exisit are filtered. Line 12 recalculates the cluster

enter. Lines 15–31 extract reliable positive and negative examples

rom the positive and negative clusters, where lines 16–22 deal

ith positive clusters. If each unlabeled tuple of positive cluster

elongs to the minimum surrounded ball of the positive examples,

he tuple is taken into the reliable positive examples set. Lines 23–

0 handle the negative clusters. If unlabeled tuple is located in the

ompression cluster, the tuple will be put in the reliable negative

xamples set.

.2.3. Training classifier and predicting

Then, it is necessary to train a classifier on the training sets,

onsisting of reliable positive examples, reliable negative examples

nd labeled positive examples. Because the PU learning is a prob-

em of unbalanced data, and Weighted ELM can be used to classify

nbalanced data effectively, we build classifier based on Weighted

LM.

For data model with both uncertainties, we need to combine all

ncertainties to build the classifier. Attribute level uncertainty has

een simplified and preserved in the six-tuple by using mean and

verage discrete distance at the pre-processing stage. It is known

hat the lower the occurrence probability of the tuple, the fewer

ontribution it gives to the training classifier. The tuples whose oc-

urrence probability is less than threshold have been ignored in

lgorithm 2 . Based on the principle that labeled positive examples

nd tuples with higher occurrence probability are more important

or classification, we calculates the weight of reliable positive and

egative examples with ( 11 ):

w j j =

{1 s i j ∈ P

p i j s i j ∈ RP ∪ RN

(11)

here the weight W is the punishment coefficient of Weighted

LM ( Section 2.2 ), w jj represents the j th row and the j th column

ata of W, p ij is the occurrence probability of s ij . P, RP, RN repre-

ents the set of labeled positive class, reliable positive and negative

xamples, respectively.

After training the classifier, the task of classifier is to pre-

ict the categories of the new arrival tuples. It is assumed in

his paper: occurrence level uncertainty has influence on build-

ng classifier and a little effect on classification. Therefore, it only

eeds to process attribute level uncertainty for classifying. In

he pre-processing phase, we standardize the data to eliminate

imensional effect. The final influence of different attributes to

lassification results is given weigh, i.e. adding adjustment factor

f the attributes. It is called as attribute weights, represented by

w , whose computing process is as follows:

Supposing that s i j = 〈〈 A i j , p i j , c, l, t 〉〉 is a tuple of the data

lock and u i j = 〈〈 u 1 i j , u 2

i j , . . . , u k

i j , . . . , u d

i j 〉 〉 represents the average of

iscrete distance vector. aw = 〈〈 a w 1 , a w 2 , . . . , a w k , . . . , a w d 〉 〉 is the

ttribute weigh:

aw

′ k = 1 −

u

k i j ∑ d

l=1 u

l i j

(12)

After normalizing aw

′ k , the final attribute weight is computed

s,

w k =

aw

′ k ∑ d

l=1 aw

′ l

, where

d ∑

l=1

a w k = 1 (13)

PU learning is a problem of binary classification. There are two

utputs in the classifier, namely f (x ) = [ f 1 (x ) , f 2 (x ) ] . Comparing

he value of f 1 ( x ) and f 2 ( x ) can determine the category of test tu-

le. Assuming that f 1 ( x ) represents the total contribution value that

uple x belongs to the positive class, while f 2 ( x ) is that of x belong-

ng to the negative class. If f 1 ( x ) > f 2 ( x ), the category of x is positive

lass; Otherwise, it is negative class.

Assuming that activation function is the linear excitation func-

ion g ( x ) = x . For tuple x , the contribution value of the mean x i to

he prediction result is ( ∑ L

k =1 a ki ( βk 1 − βk 2 ) ) x i . Here, we could use

he attribute weight aw to weight the contribution degree:

f f inal ( x ) = f 1 ( x ) − f 2 ( x ) =

d ∑

i =1

a w i

(

L ∑

k =1

a ki ( βk 1 − βk 2 )

)

x i

+

L ∑

k =1

b ki βk 1 −L ∑

k =1

b ki βk 2 (14)

The pseudo-code of the PUU algorithm is shown in Algorithm 3 .

or training data set S train , line 1 handles uncertainty with

lgorithm 1 to transform it into a six-tuple data set. Line 2 extracts

eliable positive and negative examples in S train with Algorithm 2 ,

nd then gets the set of RP and RN . Lines 3–5 explain the proce-

ure of learning model by P, RP and RN and lines 6 and 7 explain

he process of classifying the data in test data set.

154 D. Han et al. / Neurocomputing 277 (2018) 149–160

Algorithm 3 PUU algorithm.

Input :

S train : training data set

S test : test data set

Output :

the class label of the data in S test

1. Process the train set S train with Algorithm 1;

2. Extract reliable positive examples and negative examples in S train with

Algorithm 2;

3. Randomly generate the parameters of hidden layer nodes;

4. Compute the output matrix H of hidden layer;

5. Compute weight β from hidden layer to output layer with (2);

6. Process the test set S test with Algorithm 2;

7. Classify unknown tuples in S test with (14).

r

C

S

r

P

c

i

|

p

c

r

S

a

T

t

l

S

t

ξ

C

W

c

a

w

t

t

c

a

F

p

r

F

W

i

n

e

i

l

3.3. PUUS algorithm

Different from the traditional classification issue, concept drift

is the key problem for the PU learning over uncertain data streams.

Because the ensemble model has better performance on concept

drift, this paper proposes PUUS based on ensemble strategy.

3.3.1. Ensemble model

The procedure of ensemble model is shown in Fig. 1 . This paper

adopts vote to determine the final class label of the test tuples. C i is the i th basic classifier in ensemble classifier EC , and f C i (x ) rep-

resents the classification result from C i , f C i (x ) ∈ { +1 , −1 } . n + 1 rep-

resents the number of positive class votes and n −1 is the number

of negative class votes. Therefore, the final result of ensemble clas-

sifier is:

f EC ( x ) = arg ma x f C i ( x ) n f C i ( x )

(15)

3.3.2. Concept drift detection and processing

(1) Detecting concept drift

This paper proposed a novel method to detect concept drift,

which is available to conditions of multiple clusters with different

categories.

Each cluster in the space could be seen as a hypersphere, and

there may be overlap between hyperspheres. But there is no in-

tersection between clusters, hypersphere may be fragmentary. As

shown in Fig. 2 , because each cluster is determined by the fixed

center and radius, cluster such as A, B are both referred to by a

circle in the two-dimensional rectangular coordinate system. In the

area that the cluster C and D overlapped, the class label of a tuple

is decided by the distance between it and the center of clusters.

As shown in Fig. 3 , the area surrounded by dotted lines is the

cluster that exists at T 1 over data stream, while the region sur-

rounded by solid line is that of existing at T 2 . The field that these

clusters contain varies from time T 1 to T 2 . When the concept drift

happens, a region enclosed may occur to change in different meth-

ods. In Fig. 2 (a), the region varies finely, and the category of cluster

does not change either. The change can be ignored. In Fig. 2 (b), al-

though the clusters belong to the same category, the field varies

greatly and the concept is updated subtly; in Fig. 2 (c), not only the

region changes, but also the category that the cluster belongs to.

In this case, the concept needs to be revised in a substantial way,

even to be relearned. To deal with above issue, this paper first pro-

poses the concept of the similarity of cluster.

Definition 13. The similarity of cluster. If cluster CLU 1 and cluster

CLU 2 belong to the same category, the ration between the volume

of intersection region and CLU 2 is called the similarity of CLU 2 to

CLU 1 . If cluster CLU 1 and CLU 2 belong to different categories, the

similarity between them is zero. The similarity of CLU to CLU is

2 1

eferred as Sim ( CLU 2 ,CLU 1 ), where C CLU 1 is the category of CLU 1 and

CLU 2 is the category of CLU 2 . Therefore,

im ( C L U 2 , C L U 1 ) =

⎧ ⎨

V C L U 2 ∩ C L U 1 V CL U 2

, C CL U 2 = C CL U 1

0 , C CL U 2 = C CL U 1

(16)

Assuming that the number of instances in cluster CLU is rep-

esent by | P CLU |, an instance is referred as a point, namely P t, t ∈ P CLU . Because of occurrence level uncertainty in the instances,

alculating the number of instances needs to be weighted accord-

ng to the occurrence probability.

P CLU | =

P t ∈ P CLU

p P t (17)

In general, the larger scope covered by cluster, there are more

oints in the cluster. Therefore, | P CLU | and | V CLU | show positive

orrelation. If p ∈ P CL U 1 and p ∈ P CL U 2

, p ∈ P C L U 1 ∩ C L U 2 , Eq. (16) could be

ewritten as:

im ( C L U 2 , C L U 1 ) =

⎧ ⎨

| P C L U 2 ∩ C L U 1 | | P CL U 2 | , C CL U 2 = C CL U 1

0 , C CL U 2 = C CL U 1

(18)

CLUs new

is represented as a set of clusters in current data block,

nd CLUs old is represented as a set of clusters in history data block.

he similarity between cluster sets of the current data block and

he historical data block is the weighted sum of each cluster simi-

arity in the two cluster sets. Therefore,

im ( C LU s new

, C LU s old ) =

C L U new ∈ C LU s new

C L U old ∈ C LU s old

S im ( C L U new

, C L U old ) (19)

Sim ( CLUs new

, CLUs old ) is the similarity between cluster sets in

he current data block and the historical data block. Supposing

1 and ξ 2 are thresholds set by users set, when ξ 1 < Sim ( CLUs new

,

LUs old ) < ξ 2 , we consider that progressive concept drift happens.

hen Sim ( CLUs new

, CLUs old ) < ξ 1 , there is burst concept drift.

(2) Updating classification models

This paper adopts different update strategies for different con-

ept drifts, namely replacing the worst base classifier or retraining

ll the base classifiers.

This paper evaluates performance of the base classifiers by F1,

hich is related to recall and precision. In order to adapt uncer-

ainty, in this paper, precision and recall will be extended. If p x is

he occurrence probability of tuple x, TP is the sample set correctly

lassified as positive class, TN is the sample set correctly classified

s negativity, FP is the sample set wrongly classified as positivity,

N is the sample set wrongly classified as negativity.

recision =

x ∈ TP p x ∑

x ∈ TP p x +

x ∈ FP p x (20)

ecall =

x ∈ TP p x ∑

x ∈ TP p x +

x ∈ FN p x (21)

1 =

2 ∗precision ∗ recall

( precision + recall ) (22)

Algorithm 4 gives the pseudo-code of the PUUS algorithm.

hen the data arrives, the current data block is used to initial-

ze the ensemble model. And then, if a new data block arrives, we

eed to deal with the uncertainty by Algorithm 1 . The k clusters in

very data block could be found based on clustering, and the sim-

larity between the current cluster set and historical set is calcu-

ated. According to the similarity, we judge whether concept drift

D. Han et al. / Neurocomputing 277 (2018) 149–160 155

Fig. 1. Flow chart of ensemble classification strategy.

h

p

i

fi

c

4

P

i

l

o

r

e

I

d

6

i

4

c

t

d

appens and determine its category. If burst concept drift hap-

ens, all the classifiers will be deleted and new ensemble model

s trained with current data block; otherwise, a new basic classi-

er will replace the worst classifier. After updating classifiers, the

luster set in current data block covers the historical data block.

. Experimental results analysis

In this section, the performance of the proposed algorithm

UUS is demonstrated. Due to there is no algorithm on PU learn-

ng with both uncertainties over data streams, we can only ana-

yze the performance of our algorithm. Through the experiments

n real and synthetic data set, we indicate that the proposed algo-

ithm can solve the PU learning problem on uncertain data streams

ffectively. All of the experiments are run on computer which has

ntel(R) Core(TM)[email protected] GHz CPU and 4 GB DDR3. The hard

isk is 500 GB (7200 rpm) and the OS is Windows 7 ultimate

4 bits SP3. The IDE is Eclipse Luna and the programming language

s Java.

.1. Data set

In existing literature, there is no data stream model with two

ategories uncertainty. So we can only use certain data to syn-

hetize uncertain data streams. This paper adopts the synthetic

ata and real-world data to evaluate the performance respectively.

156 D. Han et al. / Neurocomputing 277 (2018) 149–160

Algorithm 4 PUUS algorithm.

Input :

S : uncertain data stream

N : the number of base classifiers in EC

ξ 1 : similarity threshold lower boundary

ξ 2 : similarity threshold upper boundary

Output:

ensemble classifier EC

1. Generate N basic classifiers in EC with the first data block;

2. repeat

3. new data block S i arrives;

4. Process uncertainty in S i with Algorithm 1;

5. randomly select k different clustering centers;

6. repeat

7. for each tuple in S i do

8. for each cluster ClU do

9. Calculate the distance from tuple to ClU with (10);

10. end for

11. Put the tuple into the closest cluster;

12. end for

13. for each cluster ClU do

14. Recalculate the clustering center of clusters;

15. end for

16. until each cluster doesn’t vary any more;

17. for each cluster ClU history in history clusters do

18. for each cluster ClU urrent in current clusters do

19. Calculate Sim ( CLU current, CLU history ) with (18);

20. end for

21. Calculate Sim ( CLUs new , CLUs old ) with (19);

22. end for

23. if Sim ( CLUs new , CLUs old ) < ξ 1

24. Delete all base classifiers and regenerate N basic classifiers;

25. end if

26. if ξ 1 < Sim ( CLUs new , CLUs old ) < ξ 2

27. Eliminate the basic classifier whose F1 value is the lowest;

28. Train a new classifier to replace the worst one;

29. end if

30. Classify unknown tuples with EC ;

31. until stream end.

Fig. 2. An example of regional overlap of hyper spheres.

u

m

a

b

T

s

a

a

d

g

t

a

T

w

t

i

r

a

w

w

u

u

t

p

t

t

b

t

u

v

A

i

P

p

u

t

b

1

u

i

o

t

r

a

p

9

Fig. 3. Sketch map o

(1) Synthetic data: The Moving Hyperplane data set is generally

sed to simulate time-changing concept in data stream environ-

ent. The experiment uses Hyperplane generator of MOA to gener-

te Moving Hyperplane data sets and adopts the method proposed

y [40] to convert the continuous attribute into discrete attribute.

he Moving Hyperplane in d -dimension-space could be repre-

ented as ∑ d

i =1 a i x i = a 0 . In addition, we use the same method

s in [11] to simulate concept drift. When concept drift gener-

tes, the parameter k represents the dimension affected by concept

rift. t ∈ R represents the changing scope of weight a 1 ,…, a k after

enerating N instances. S i ∈ { −1, + 1}represents the changing direc-

ion of weight a i (1 ≤ i ≤ k ). We choose k = 4, t = 0.1, N = 10,0 0 0,

nd the dimension of the generated data set is 10-dimension.

here are 40 0,0 0 0 instances totally.

(2) Real-world data: KDD CUP99 has better evolving nature,

hich was generated to measure and evaluate the Instruction De-

ection System when MIT Lincon Laboratory worked on the DARPA

nstruction detection in 1998. There are 42 attributes in each

ecord of the data set, where 32 continuous attributes, 9 discrete

ttributes and 1 attribute of the invasion type.

The synthetic data above–Moving Hyperplane and the real-

orld data–KDD CUP 99—could be regarded as the data stream

ith concept drift, but they are certain data. In order to simulate

ncertain data streams, we need to add uncertainty to data sets.

First, the uncertainty is added to discrete attributes and contin-

ous attributes by the method in [41] . A

u c i

represents the uncer-

ain discrete attribute, where Dom ( A

u c i

) = { v 1 , v 2 , . . . , v n } , and the

robability vector is P i = { p i 1 ,…, p in }. u is used to represent uncer-

ainty, that is to say, if 20% uncertainty is added, u = 20%. It means

hat probability of original value for A

u c i

is 0.8. At the same time,

y the method in [42] , we use Gauss distribution to add uncer-

ainty for continuous attributes. A

u n i

represents uncertain contin-

ous attribute, A

u n i

∼ N( μi , σ2 i ) , in which mean μi is the initial

alue of A

u n i

in data sets, and the standard deviation is σ i = 0.25

( x max i

− x min i

) · μ, x max i

and x min i

are the maximum and minimum of

i .

Second, the uncertainty of tuple level is added. Because there

s still no precedent to model the occurrence level uncertainty for

U learning, this paper uses a simple method to simulate it. The

robability of each instance class changing is p inverse . If the class is

nchanged, the occurrence probability is randomly generated be-

ween 0 ∼1, or the occurrence probability is randomly generated

etween 0 ∼b, and b is called reversal occurrence probability, b ∈ (0,

]. In the experiments, p inverse = 15%, b = 40%.

After simulating attribute and occurrence uncertainty, the

ncertain data stream with concept drift is obtained. The tuples

n these data sets are all labeled. To simulate the situation which

nly contains positive and unlabeled examples, this paper regards

he 1 class in the synthetic data set and the Normal class in the

eal-world data set as the target class. Tuples in these classes

re called positive examples, others are negative examples. In

articular, when we simulate the burst concept drift in KDD CUP

9, the target class will be reversed as soon as 10 data blocks are

f concept drift.

D. Han et al. / Neurocomputing 277 (2018) 149–160 157

Table 2

Default parameter setting.

Parameters Meaning Default value

N The size of data sets 40 0,0 0 0

N S i The size of data blocks 2500

N EC The number of ensemble classification 5

N L The number of hidden level nodes 100

P exist The threshold of occurrence probability 50%

x The ration of compression cluster 50%

k The number of clusters 5

ξ 1 The lower bound of similarity threshold 30%

ξ 2 The upper bound of similarity threshold 70%

u The ration of uncertainty 20%

b The occurrence probability of reversion 40%

Fig. 4. The selection of cluster number.

c

e

u

e

r

4

g

m

a

i

4

a

c

W

fi

c

t

ξ

i

d

i

i

c

d

b

s

t

o

s

a

i

fi

t

b

u

e

e

m

d

l

4

d

t

f

t

r

c

F

s

r

v

i

fi

I

lassified. When the data sets consisting positive and unlabeled

xamples is constructed, this paper adopts a method widely

sed in PU learning, randomly selecting a part from all positive

xamples as labeled positive examples. Till now, the synthetic and

eal-world data sets are generated.

.2. Experiment methods and results

In this paper, F1 is used to evaluate the performance of our al-

orithms, as shown is ( 22 ). Accuracy is the traditional evaluation

ethod and is used in our experiment.

ccuracy =

x ∈ T P p x +

x ∈ T N p x ∑

x ∈ T P p x +

x ∈ F P p x +

x ∈ T N p x +

x ∈ F N p x

(23)

Table 2 lists the unmentioned default parameters in the exper-

ment.

.2.1. The selection of parameters

The parameters k , ξ 1 , ξ 2 cannot be selected empirically, so we

dopt experimental method to choose. The purpose is to make the

lassifier have better classification ability, namely higher F1 value.

(1) The number of clusters k

Fig. 4 shows that the F1 value changes with cluster number.

ith the number of clusters growing, the F1 is getting larger at

Fig. 5. The selection of lowe

rst and then stable. This paper chooses k = 8 as the number of

lusters.

(2) The lower bound of similarity threshold

Fig. 5 shows the running time and F1 with the change of

he lower bound of similarity. It fixes the similarity upper bound

2 = 1. From the Fig. 5 , we know that with the lower bound of sim-

larity threshold growing, the running time of Moving Hyperplane

ata and F1 will almost have no change. The reason is that increas-

ng the lower bound of similarity equals to increasing the process-

ng scope of burst concept drift, while Moving Hyperplane does not

ontain concept drift. For KDD CUP 99, it contains burst concept

rift. If the lower bound of similarity is increased, the number of

urst concept drift increases and F1 becomes larger. The handing

trategy of burst concept drift is to reconstruct N basic classifiers,

herefore it consumes more time. This paper selects ξ1 = 30%.

(3) The upper bound value of similarity

In Fig. 6 , it shows the running time and F1 with the change

f the upper bound of similarity. At this time, the lower bound of

imilarity is ξ1 = 30%. From Fig. 6 , the trend of Moving Hyperplane

nd the KDD CUP 99 is similar. With the upper bound of similar-

ty increasing, the time increases constantly and F1 increases at

rst and then steadily. Fixing the lower bound of similarity means

he handling scope of concept drift is fixed. Increasing the upper

ound of similarity means increasing the handling scope of grad-

al concept drift, and both running time and F1 increase. How-

ver, when the upper bound of similarity increases continuously,

xpanding the scope of classifiers could not learn more accurate

odel, therefore the F1 tends to be stable. This paper selects two

ata sets which are both close to the upper bound value of simi-

arity ξ2 = 70%.

.2.2. Performance analysis

This experiment studies the algorithm dealing with uncertain

ata stream firstly, and observes the influence of uncertain rate,

he inverse existence rate and unlabeled rate for the algorithm per-

ormance, and then detects whether the algorithm could handle

he concept drift effectively.

(1) The uncertain rate u

Fig. 7 shows that when the uncertain rate increases, the accu-

acy decreases slowly, indicating that the accuracy effected by un-

ertain rate is less. With the uncertain rate increasing constantly,

1 changes little, indicating that our algorithm is robust.

(2) Reverse existence rate b

In Fig. 8 , it is shown that the accuracy and F1 decrease con-

tantly with the reverse existence rate increasing. Increasing the

everse existence rate could make the existence probability of re-

erse instances (which can be regard as noise) to get large, so its

nfluence on classifiers becomes large and the accuracy of classi-

ers decreases. Meanwhile, the precision and recall will decrease.

t leads to lower F1 with the reverse existence rate increasing at

r bound of similarity.

158 D. Han et al. / Neurocomputing 277 (2018) 149–160

Fig. 6. The selection of upper bound of similarity.

Fig. 7. The influence of uncertain rate.

Fig. 8. The influence of reverse existence rate.

Fig. 9. The influence of unlabeled rate.

w

t

b

p

p

f

s

b

l

last. It indicates that the algorithm will build different classifica-

tion model according to different occurrence uncertainty.

(3) Unlabeled rate a

Fig. 9 shows when the unlabeled rate increases, the accuracy

and F1 both first change a little and then get small. It is because

the classifier is established by reliable positive, reliable negative

and labeled positive examples together. If the labeled examples are

relatively more, there is little relationship between the clustering

method of extracting reliable positive, negative examples and la-

beled rate. Therefore, accuracy and F1 change a little. The reason

hy the accuracy and F1 could decrease is that we need to find

he points in the minimum covering hypersphere. Decreasing la-

eled positive examples would shrink the minimum covering hy-

ersphere of the positive examples. Too few labeled positive exam-

les will make the number of reliable positive examples decrease

ast, and then the accuracy declines. At the same time, due to the

mall range of extracting reliable positive examples, there would

e some positive examples are classified as negative ones. It will

ead the recall and F1to decline.

(4) Detection of concept drift

D. Han et al. / Neurocomputing 277 (2018) 149–160 159

Fig. 10. Detection of concept drift.

l

s

d

d

s

a

0

c

t

d

d

p

c

i

t

n

c

e

h

5

p

t

T

t

n

c

s

p

i

t

t

b

i

t

p

o

P

t

A

t

a

t

R

[

[

[

[

[

After confirming that the proposed algorithm could handle PU

earning over data streams with two kinds of uncertainties, it is es-

ential to detect concept drift normally. This paper detects concept

rift based on the similarity between cluster sets.

Fig. 10 shows the similarity of cluster sets of the first 20 block

ata sets. The left indicates the Moving Hyperplane data, and the

imilarity between each data block and the historical one is fluctu-

ting constantly. When the similarity of one data block is less than

.7, the similarity of the next block will improve greatly. It is be-

ause when the similarity is less than the upper bound ξ2 = 70%,

here is progressive concept drift so the worst classifier is up-

ated and then the similarity is improved. The step of concept

rift blocks is near to 4, conforming to the feature of Moving Hy-

erplane. The right is about the KDD CUP 99, and the progressive

oncept drift is detected in the 14th block too. Different to Mov-

ng Hyperplane, the similarity of KDD CUP 99 declines greatly in

he 10th block, and the similarity ascends fast at the time of the

ext block arriving. There is burst concept drift, and then all basic

lassifiers are deleted. The ensemble model is rebuilt to adapt the

volving concept. It is shown that our algorithm could detect and

andle different concept drift effectively.

. Conclusions

In view of the data model with two kinds of uncertainty, this

aper firstly propose static classification algorithm PUU for uncer-

ain data which only contains positive and unlabeled examples.

he algorithm deals with uncertainty by reducing dimension, and

hen adopts the two-stage strategy in which reliable positive and

egative examples are extracted based on clustering. At last the

lassifier is trained with Weighted ELM technique on training data

et consisting of reliable positive examples, reliable negative exam-

les and labeled positive examples. Secondly, the PUUS algorithm

s proposed for learning PU over uncertain data streams, which is

he first work to address streaming data model with both uncer-

ainties. Ensemble model is selected in PUUS algorithm and the

ase classifier is learned based on PUU. In the meanwhile, the sim-

larity between current and historic data blocks is calculated to de-

ect different concept drifts. At last, the experiments evaluate the

erformance of the proposed algorithm. By analyzing the results

n synthetic and real-world data sets, the PUUS could deal with

U learning problem with two types of uncertainty at the same

ime, and it could detect and handle the concept drift effectively.

cknowledgment

This work is supported by the National Natural Science Founda-

ion of China (Nos. 61173029, 61672144, 61272182, 61332006 ). The

uthors would like to thank anonymous reviewers and editors for

heir valuable comments.

eferences

[1] McKinsey B.D. The next frontier for innovation, competition, and productivity.McKinsey Global Institute Report, 2011.

[2] A. Deshpande , C. Guestrin , S. Madden , et al. , Model-driven data acquisition insensor networks, in: Proceeding of the 30th International Conference on Very

Large Data Bases, Toronto, Canada, 2004, pp. 588–599 .

[3] S.R. Jeffery , M.N. Garofalakis , M.J. Frwanklin , Adaptive cleaning for RFID datastreams, in: Proceeding of the 32nd International Conference on Very Large

Data Bases, Seoul, Korea, 2006, pp. 163–174 . [4] Y. Gu , G. Yu , T.C. Zhang , RFID complex event processing techniques, J. Front.

Comput. Sci. Technol. 1 (3) (2007) 255–267 . [5] A. Zhou , C. Jin , G. Wang , J. Li , A survey on the management of uncertain data,

Chin. J. Comput. 32 (1) (2009) 1–16 .

[6] P. Domingos , G. Hulten , Mining high-speed data streams, in: Proceedings ofthe 6th ACM SIGKDD International Conference on Knowledge Discovery and

Data Mining, 20 0 0, pp. 71–80 . [7] G. Hulten , L. Spencer , P. Domingos , Mining time-changing data streams, in:

Proceedings of the 7th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining, 2001, pp. 97–106 .

[8] J. Gama , P. Medas , R. Rocha , Forest trees for on-line data, in: Proceedings of

ACM Symposium on Applied Computing, 2004, pp. 632–636 . [9] W. Fan , H. Wang , P.S. Yu , et al. , Is random model better? on its accuracy and

efficiency, in: Proceedings of the 3rd IEEE International Conference on DataMining, 2003, pp. 51–58 .

[10] W.N. Street , Y. Kim , A streaming ensemble algorithm (SEA) for large-scale clas-sification, in: Proceedings of the 7th ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining, 2001, pp. 377–382 . [11] H. Wang , W. Fan , P.S. Yu , et al. , Mining concept-drifting data streams using en-

semble classifiers, in: Proceedings of the 9th ACM SIGKDD International Con-

ference on Knowledge Discovery and Data Mining, 2003, pp. 226–235 . [12] H.L. Nguyen , Y.K. Woon , W.K. Ng , et al. , Heterogeneous ensemble for feature

drifts in data streams, Advances in Knowledge Discovery and Data Mining,Springer, Berlin, Heidelberg, 2012, pp. 1–12 .

[13] D. Brzezinski , J. Stefanowski , Combining block-based and online methods inlearning ensembles from concept drifting data streams, Inf. Sci.: Int. J. 265 (5)

(2014) 50–67 .

[14] S.G. Soares , A. Rui , An on-line weighted ensemble of regressor models to han-dle concept drifts, Eng. Appl. Artif. Intell. 37 (2015) 392–406 .

[15] Y. Sun , K. Tang , L. Minku , et al. , Online ensemble learning of data streamswith gradually evolved classes, IEEE Trans. Knowl. Data Eng. 28 (6) (2016)

1532–1545 . [16] C. Liang , Y. Zhang , Q. Song , et al. , Decision tree for dynamic and uncertain data

streams, in: Proceedings of the 2nd Asian Conference on Machine Learning,

2010, pp. 209–224 . [17] S. Pan , K. Wu , Y. Zhang , et al. , Classifier ensemble for uncertain data stream

classification, in: Proceedings of the Pacific-Asia Conference, Advances inKnowledge Discovery & Data Mining, PAKDD, Hyderabad, India, June 2010,

2010, pp. 4 88–4 95 . [18] W. Xu , Z. Qin , Y. Chang , A framework for classifying uncertain and evolving

data streams, Inf. Technol. J. 10 (10) (2011) 1926–1933 .

[19] B. Liu , Y. Xiao , P.S. Yu , et al. , Uncertain one-class learning and concept summa-rization learning on uncertain data streams, IEEE Trans. Knowl. Data Eng. 26

(2) (2014) 46 8–4 84 . 20] K. Cao , G. Wang , D. Han , et al. , Classification of uncertain data streams based

on extreme learning machine, Cognit. Comput. 7 (1) (2015) 150–160 . [21] D. Han , C. Giraud-Carrier , S. Li , Efficient mining of high-speed uncertain data

streams, Appl. Intell. 43 (4) (2015) 773–785 .

22] F. Denis , PAC learning from positive statistical queries, in: Proceedings of the9th International Conference on Algorithmic Learning Theory, ALT’98, 1998,

pp. 112–126 . 23] T. Le , D. Tran , P. Nguyen , et al. , Multiple distribution data description learning

method for novelty detection, in: Proceedings of the 2011 International JointConference on Neural Networks, IJCNN, IEEE, 2011, pp. 2321–2326 .

24] J. Li , L. Su , C. Cheng , Finding pre-images via evolution strategies, Appl. Soft

Comput. 11 (6) (2011) 4183–4194 . 25] F. Bovolo , G. Camps-Valls , L. Bruzzone , A support vector domain method for

160 D. Han et al. / Neurocomputing 277 (2018) 149–160

change detection in multitemporal images, Pattern Recognit. Lett. 31 (10)(2010) 1148–1154 .

[26] M. Koppel , J. Schler , Authorship verification as a one-class classification prob-lem, in: Proceedings of the 21th International Conference on Machina Laerning

(ICML), 2004, pp. 52–62 . [27] B. Zhang , A Study on Learning from Positive and Unlabeled Examples, Jilin Uni-

versity, 2009, pp. 31–32 . [28] B. Liu , Y. Dai , X. Li , et al. , Building text classifiers using positive and unlabeled

examples, in: Proceedings of IEEE International Conference on Data Mining,

IEEE Computer Society, 2003, p. 179 . [29] G.P.C Fung , J.X. Yu , H. Lu , et al. , Text classification without negative examples

revisit, IEEE Trans. Knowl. Data Eng. 18 (1) (2006) 6–20 . [30] F. Letouzey , F. Denis , R. Gilleron , Learning from positive and unlabeled

examples, Algorithmic Learning Theory, Springer, Berlin, Heidelberg, 20 0 0,pp. 70–83 .

[31] D. Zhang , W.S. Lee , A simple probabilistic approach to learning from positive

and unlabeled examples, in: Proceedings of the 5th Annual UK Workshop onComputational Intelligence (UKCI), 2005, pp. 83–87 .

[32] D. Zhang , W.S. Lee , Learning classifiers without negative examples: a reductionapproach, in: Proceedings of International Conference on Digital Information

Management, IEEE, 2008, pp. 638–643 . [33] J. He , Y. Zhang , X. Li , et al. , Learning naive Bayes classifiers from positive

and unlabeled examples with uncertainty, Int. J. Syst. Sci. 43 (10) (2012)

1805–1825 . [34] X. Zhang , Y. Zhang , M. Liu , et al. , DTU-PU: decision tree for uncertain data of

PU learning, Comput. Eng. Appl. 49 (9) (2013) 127–133 . [35] C. Liang , Y. Zhang , P. Shi , et al. , Learning very fast decision tree from uncertain

data streams with positive and unlabeled samples, Inf. Sci. 213 (23) (2012)50–67 .

[36] B. Liu , Y. Xiao , P.S. Yu , et al. , Uncertain one-class learning and concept summa-

rization learning on uncertain data streams, IEEE Trans. Knowl. Data Eng. 26(2) (2014) 46 8–4 84 .

[37] M. Leshno , V.Y. Lin , A. Pinkus , et al. , Multilayer feedforward networks with anonpolynomial activation function can approximate any function, Neural Net-

works 6 (6) (1993) 861–867 . [38] G.B. Huang , Q.Y. Zhu , C.K. Siew , Extreme learning machine: Theory and appli-

cations, Neurocomputing 70 (1–3) (2006) 489–501 .

[39] W. Zong , G.B. Huang , Y. Chen , Weighted extreme learning machine for imbal-ance learning, Neurocomputing 101 (3) (2013) 229–242 .

[40] U.M. Fayyad , K.B. Irani , On the handling of continuous-valued attributes in de-cision tree generation, Mach. Learn. 8 (1) (1992) 87–102 .

[41] B. Qin , Y. Xia , S. Prabhakar , et al. , A rule-based classification algorithm for un-certain data, in: Proceedings of International Conference on Data Engineering,

ICDE’09, 2009, pp. 1633–1640 .

[42] J. Ren , S.D. Lee , X. Chen , et al. , Naïve Bayes classification of uncertain data,in: Proceedings of International Conference on Data Mining, ICDM’09, 2009,

pp. 944–949 .

Donghong Han was born in 1966. She received her M.S.

and Ph.D. degrees from Northeastern University in 2002and 2007 respectively, all in Computer Science. She is an

Associate Professor in the School of Computer Science andEngineering, Northeastern University. Her research inter-

ests include data stream management, data mining and

social network analysis.

Shuoru Li was born in 1992. He received his B.S. andM.S. degrees in Computer Science from Northeastern Uni-

versity in 2013 and 2015, respectively. His research inter-ests include uncertain data stream management and data

mining.

Fulin Wei was born in 1992. He received his B.S. de-

gree in Computer Science and Technology from ShandongUniversity of Science and Technology, in 2015. He is cur-

rently a M.S. candidate in Computer Application Technol-

ogy from Northeastern University. His research interest isdata mining.

Yuying Tang was born in 1988. She received her B.S. de-gree in Computer Science and Technology from North-

eastern University, in 2012. She is currently a M.S.candidate in Computer Application Technology from

Northeastern University. Her research interest is Big Data.

Feida Zhu is an Assistant Professor in the School of In-formation Systems at the Singapore Management Univer-

sity (SMU). He holds a Ph.D. degree in Computer Science

from the University of Illinois at Urbana-Champaign andobtained his B.S. degree in Computer Science from Fudan

University. His current research interests include large-scale constraint-based sequential, graph pattern mining

and information/social network analysis, with applica-tions on web, management information systems and busi-

ness intelligence.

Guoren Wang was born in 1966. He received his B.Sc.,M.Sc. and Ph.D. degrees from Northeastern University in

1988, 1991 and 1996 respectively, all in Computer Sci-ence. He is currently a Full Professor and Doctoral Su-

pervisor in the School of Computer Science and Engi-neering, Northeastern University. His research interests

include XML data management, data streaming analysis,

high-dimensional indexing and P2P data management.