using hddt to avoid instances propagation in unbalanced and evolving data streams

Introduction HDDT Unbalanced and drifting data streams Experimental results Conclusion

Using HDDT to avoid instances propagationin unbalanced and evolving data streams

IJCNN 2014

Andrea Dal Pozzolo, Reid Johnson, Olivier Caelen,Serge Waterschoot, Nitesh V Chawla and Gianluca

Bontempi

07/07/2014

1/ 27


INTRODUCTION

I In many applications (e.g. Fraud detection), we have astream of observations that we want to classify in real time.

I When one class is rare, the classification problem is said tobe unbalanced.

I In this paper we focus unbalanced data stream with twoclasses, positive (minority) and negative (majority).

I In unbalanced data streams, state-of-the-art techniques useinstance propagation and standard decision trees (e.g.C4.5 [13]).

I However, it is not always possible to either revisit or storeold instances of a data stream.

2/ 27


INTRODUCTION II

I We use Hellinger Distance Decision Tree [7] (HDDT) todeal with skewed distributions in data streams.

I We consider data streams where the instances are receivedover time in streams of batches.

I HDDT allows us to remove instance propagations betweenbatches, benefits:

I improved predictive accuracyI speedI single-pass through the data.

I An ensemble of HDDTs is used to combat concept driftand increase the accuracy of single classifiers.

I We test our framework on several streaming datasets withunbalanced classes and concept drift.

3/ 27


HELLINGER DISTANCE DECISION TREES

I HDDT [6] use Hellinger Distance as splitting criteria indecision trees for unbalanced problems.

I All numerical features are partitioned into p bins (onlycategorical variables).

I The Hellinger Distance between the positive and negativeclass is computed for each feature and then the featurewith the maximum distance is used to split the tree.

dH(f+, f−) =

√√√√√ p∑j=1

(√|f+j||f+|−

√|f−j||f−|

)2

(1)

where f+ denote the positive instances and f− the negatives offeature f . Note that the class priors do not appear explicitly inequation 1.

4/ 27


CONCEPT DRIFT

I The concept to learn can change due to non-stationarydistributions.

I In the presence of concept drift, we cannot longer assumethat the training and testing batches come from the samedistribution.

I Concept drift [11] can occur due to a change in:I P(c), class priors.I P(x|c), distribution of the classes.I P(c|x), posterior distributions of class membership.

5/ 27


CONCEPT DRIFT II

In general we don’t know where the changes in distributionscome from, but we can measure the distances between twobatches to estimate the similarities of the distributions.

I If the batch used for training a model is not similar to thetesting batch, we can assume that the model is obsolete.

I In our work we use batch distances to assign low weightsto obsolete models.

6/ 27


HELLINGER DISTANCE BETWEEN BATCHES

Let Bt be the batch at time t used for training and Bt+1 thesubsequent testing batch. We compute the distance between Bt

and Bt+1 for a given feature f using Hellinger distance asproposed by Lichtenwalter and Chawla [12]:

HD(Bt,Bt+1, f ) =

√√√√√∑v∈f

√ |Btf=v||Bt|

−

√|Bt+1

f=v||Bt+1|

2

(2)

where |Btf=v| is the number of instances of feature f taking value

v in the batch at time t, while |Bt| is the total number ofinstances in the same batch.

7/ 27


INFORMATION GAIN

I Equation 2 does not account for differences in featurerelevance.

I Lichtenwalter and Chawla [12] suggest to use theinformation gain to weight the distances.

For a given feature f of a batch B, the Information Gain (IG) isdefined as the decrease in entropy E of a class c:

IG(B, f ) = E(Bc)− E(Bc|Bf ) (3)

where Bc defines the class of the observations in batch B and Bfthe observations of feature f .

8/ 27


HELLINGER DISTANCE AND INFORMATION GAIN

We can make the assumption that feature relevance remainsstable over time and define a new distance function thatcombines IG and HD as:

HDIG(Bt,Bt+1, f ) = HD(Bt,Bt+1, f ) ∗ (1 + IG(Bt, f )) (4)

HDIG(Bt,Bt+1, f ) provides a relevance-weighted distance foreach single feature. The final distance between two batches isthen computed taking the average over all the features.

DHDIG(Bt,Bt+1) =

∑f∈Bt HDIG(Bt,Bt+1, f )

|f ∈ Bt|(5)

9/ 27


MODELS WEIGHTSIn evolving data streams it is important to retain models learntin the past and learn new models as soon as new data comesavailable.

I We learn a new model as soon as a new batch is availableand store all the models.

I The learnt models are then combined into an ensemblewhere the weights are inversely proportional to thebatches’ distances.

I The lower the distance between two batches the moresimilar is the concept between them.

Lichtenwalter and Chawla [12] suggest to use the followingtransformation:

weightst =1

DHDIG(Bt,Bt+1)b (6)

where b represents the ensemble size.

10/ 27


EXPERIMENTAL SETUP

I We compare HDDT with C4.5 [13] decision tree which isthe standard for unbalanced data streams [10, 9, 12, 8].

I We considered instance propagation methods that assumeno sub-concepts within the minority class:

I SE (Gao’s [8] propagation of rare class instances andundersampling at 40%)

I BD (Lichtenwalter’s Boundary Definition [12]: propagatingrare-class instances and instances in the negative class thatthe current model misclassifies.)

I UNDER (Undersampling: no propagation between batches,undersampling at 40%)

I BL (Baseline: no propagation, no sampling)I For each of the previous sampling strategies we tested:

I HDIG: DHDIG weighted ensemble.I No ensemble: single classifier.

11/ 27


DATASETSWe used different types of datasets.

I UCI datasets [1] to first study the unbalanced problemwithout worrying about concept drift.

I Artificial datasets from MOA [2] framework with driftingfeatures to test the behavior of the algorithms underconcept drift.

I A credit card dataset (online payment transactionsbetween the 5th of September and the 25th of September2013).

Name Source Instances Features Imbalance RatioAdult UCI 48,842 14 3.2:1

can UCI 443,872 9 52.1:1compustat UCI 13,657 20 27.5:1

covtype UCI 38,500 10 13.0:1football UCI 4,288 13 1.7:1

ozone-8h UCI 2,534 72 14.8:1wrds UCI 99,200 41 1.0:1text UCI 11,162 11465 14.7:1

DriftedLED MOA 1,000,000 25 9.0:1DriftedRBF MOA 1,000,000 11 1.02:1

DriftedWave MOA 1,000,000 41 2.02:1Creditcard FRAUD 3,143,423 36 658.8:1

12/ 27


RESULTS UCI DATASETSHDDT is better than C4.5 and ensembles have higher accuracy.

HDIG No Ensemble

0.00

0.25

0.50

0.75

BD BL SE UNDER BD BL SE UNDER

Sampling

AU

RO

C AlgorithmC4.5HDDT

Figure : Batch average results interms of AUROC (higher is better)using different sampling strategiesand batch-ensemble weightingmethods with C4.5 and HDDTover all UCI datasets.

HDIG No Ensemble

0

1000

2000

3000

4000


Sampling

TIM

E AlgorithmC4.5HDDT

Figure : Batch average results interms of computational TIME(lower is better) using differentsampling strategies andbatch-ensemble weightingmethods with C4.5 and HDDTover all UCI datasets.

13/ 27


RESULTS MOA DATASETSHDIG-based ensembles return better results than a singleclassifiers and HDDT again gives better accuracy than C4.5.

HDIG No Ensemble

0.0

0.2

0.4

0.6

0.8


Sampling

AU

RO

C AlgorithmC4.5HDDT

Figure : Batch average results in terms of computational AUROC(higher is better) using different sampling strategies andbatch-ensemble weighting methods with C4.5 and HDDT over alldrifting MOA datasets.

14/ 27


RESULTS CREDIT CARD DATASETFigure 5 shows the sum of the ranks for each strategy over allthe chunks. For each chunk, we assign the highest rank to themost accurate strategy and then sum the ranks over all chunks.

HDIG No Ensemble

0.00

0.25

0.50

0.75

1.00


Sampling

AU

RO

C AlgorithmC4.5HDDT

Figure : Batch average results interms of AUROC (higher is better)using different sampling strategiesand batch-ensemble weightingmethods with C4.5 and HDDTover Credit card dataset.

BL_C4.5SE_C4.5BD_C4.5

UNDER_C4.5BL_HDIG_C4.5

SE_HDDTSE_HDIG_C4.5UNDER_HDDT

BL_HDDTBD_HDIG_C4.5

BD_HDDTBD_HDIG_HDDT

UNDER_HDIG_C4.5SE_HDIG_HDDT

UNDER_HDIG_HDDTBL_HDIG_HDDT

0 100 200

Sum of the ranks

Best significantFALSETRUE

AUROC

Figure : Sum of ranks in all chunksfor the Credit card dataset interms of AUROC. In gray are thestrategies that are not significantlyworse than the best having thehighest sum of ranks.

15/ 27


CONCLUSION

I In unbalanced data streams, state-of-the-art techniques useinstance propagation/sampling to deal with skewedbatches.

I HDDT without sampling typically leads to better resultsthan C4.5 with sampling.

I The removal of the propagation/sampling step in thelearning process has several benefits:

I It allows a single-pass approach (avoids several passesthroughout the batches for instance propagation).

I It reduces the computational cost (with massive amounts ofdata it may no longer be possible to store/retrieve oldinstances).

I It avoids the problem of finding previous minorityinstances from the same concept.

16/ 27


CONCLUSION II

I HDIG ensemble weighting strategy:I has better performances than single classifiers.I takes into account drifts in the data stream.

I With the credit card dataset HDDT performs very wellwhen combined with BL (no sampling) and UNDERsampling.

I These sampling strategies are the fastest, since noobservations are stored from previous chunks.

I Future work will investigate propagation methods such asREA [4], SERA [3] and HUWRS.IP [9] that are able to dealwith sub-concepts in the minority class.

THANK YOU FOR YOUR ATTENTION

17/ 27


BIBLIOGRAPHY I

D. N. A. Asuncion.UCI machine learning repository, 2007.

A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer.Moa: Massive online analysis.The Journal of Machine Learning Research, 99:1601–1604, 2010.

S. Chen and H. He.Sera: selectively recursive approach towards nonstationary imbalanced stream data mining.In Neural Networks, 2009. IJCNN 2009. International Joint Conference on, pages 522–529. IEEE, 2009.

S. Chen and H. He.Towards incremental learning of nonstationary imbalanced data stream: a multiple selectively recursiveapproach.Evolving Systems, 2(1):35–50, 2011.

D. A. Cieslak and N. V. Chawla.Detecting fractures in classifier performance.In Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on, pages 123–132. IEEE, 2007.

D. A. Cieslak and N. V. Chawla.Learning decision trees for unbalanced data.In Machine Learning and Knowledge Discovery in Databases, pages 241–256. Springer, 2008.

18/ 27


BIBLIOGRAPHY II

D. A. Cieslak, T. R. Hoens, N. V. Chawla, and W. P. Kegelmeyer.Hellinger distance decision trees are robust and skew-insensitive.Data Mining and Knowledge Discovery, 24(1):136–158, 2012.

J. Gao, B. Ding, W. Fan, J. Han, and P. S. Yu.Classifying data streams with skewed class distributions and concept drifts.Internet Computing, 12(6):37–49, 2008.

T. R. Hoens and N. V. Chawla.Learning in non-stationary environments with class imbalance.In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages168–176. ACM, 2012.

T. R. Hoens, N. V. Chawla, and R. Polikar.Heuristic updatable weighted random subspaces for non-stationary environments.In Data Mining (ICDM), 2011 IEEE 11th International Conference on, pages 241–250. IEEE, 2011.

M. G. Kelly, D. J. Hand, and N. M. Adams.The impact of changing populations on classifier performance.In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, pages367–371. ACM, 1999.

R. N. Lichtenwalter and N. V. Chawla.Adaptive methods for classification in arbitrarily imbalanced and drifting data streams.In New Frontiers in Applied Data Mining, pages 53–75. Springer, 2010.

19/ 27


BIBLIOGRAPHY III

J. R. Quinlan.C4. 5: programs for machine learning, volume 1.Morgan kaufmann, 1993.

C. R. Rao.A review of canonical coordinates and an alternative to correspondence analysis using hellinger distance.Questiio: Quaderns d’Estadıstica, Sistemes, Informatica i Investigacio Operativa, 19(1):23–63, 1995.

20/ 27


HELLINGER DISTANCE

I Quantify the similarity between two probabilitydistributions [14]

I Recently proposed as a splitting criteria in decision treesfor unbalanced problems [6, 7].

I In data streams, it has been used to detect classifierperformance degradation due to concept drift [5, 12].

21/ 27


HELLINGER DISTANCE II

Consider two probability measures P1,P2 of discretedistributions in Φ. The Hellinger distance is defined as:

dH(P1,P2) =

√√√√∑φ∈Φ

(√P1(φ)−

√P2(φ)

)2(7)

Hellinger distance has several properties:

I dH(P1,P2) = dH(P1,P2) (symmetric)I dH(P1,P2) >= 0 (non-negative)I dH(P1,P2) ∈ [0,

√2]

dH is close to zero when the distributions are similar and closeto√

2 for distinct distributions.

22/ 27


CONCEPT DRIFT IILet us define Xt = {x0, x1, ..., xt} as the set of labeledobservations available at time t. For a new unlabelled instancext+1 we can train a classifier on Xt and predict P(ci|xt+1). UsingBayes’ theorem we can write P(ci|xt+1) as:

P(ci|xt+1) =P(ci)P(xt+1|ci)

P(xt+1)(8)

Since P(xt+1) is the same for all classes ci we can removeP(xt+1):

P(ci|xt+1) = P(ci)P(xt+1|ci) (9)

Kelly [11] argues that concept drift can occur from a change inany of the terms in equation 9, namely:

I P(ci), class priors.I P(xt+1|ci), distribution of the classes.I P(ci|xt+1), posterior distributions of class membership.

23/ 27


CONCEPT DRIFT IIIThe change in posterior distributions is the worst type of drift,because it directly affects the performance of a classifier [9].

In general we don’t know where the changes in distributionscome from, but we can measure the distances between twobatches to estimate the similarities of the distributions.24/ 27

using hddt to avoid instances propagation in unbalanced and evolving data streams

Science

unbalanced data streams

introduction hddt unbalanced

streams of batches

unbalanced problems

tobe unbalanced

hddt todeal

evolving data streamsijcnn

use hellinger distance