[ieee 2007 ieee/acs international conference on computer systems and applications - amman, jordan...

Mining Streaming Emerging Patterns from Streaming Data

Hamad Alhammady Etisalat University College - UAE

[email protected]

Abstract

Mining streaming data is an essential task in many applications such as network intrusion, marketing, manufacturing and others. The main challenge in the streaming data model is its unbounded size. This makes it difficult to run traditional mining techniques on this model. In this paper, we propose a new approach for mining emerging patterns (EPs) in data streams. Our method is based on mining EPs in a selective manner. EPs are those itemsets whose frequencies in one class are significantly higher than their frequencies in the other classes. Our experimental evaluation proves that our approach is capable of gaining important knowledge from data streams. 1. Introduction

The unbounded size of data streams is considered as the main obstacle on the way of processing this type of data [1] [2] [3]. This makes it infeasible to store the entire data on disk. This obstacle causes two problems. Firstly, multi-pass algorithms, which need the entire data to be stored in conventional relations, cannot deal directly with data streams. Secondly, obtaining the exact answers from data streams is too expensive [4].

EPs are a new kind of patterns introduced recently [5]. They have been proved to have a great impact in many applications [6] [7] [8] [9] [10]. EPs can capture significant changes between datasets. They are defined as itemsets whose supports increase significantly from one class to another. The discriminating power of EPs can be measured by their growth rates. The growth rate of an EP is the ratio of its support in a certain class over that in another class. Usually the discriminating power of an EP is proportional to its growth rate.

For example, the Mushroom dataset, from the UCI Machine Learning Repository [11], contains a large number of EPs between the poisonous and the edible mushroom classes. Table 1 shows two examples of these EPs. These two EPs consist of 3 items. e1 is an EP from the poisonous mushroom class to the edible mushroom class. It never exists in the poisonous mushroom class, and exists in 63.9% of the instances in the edible mushroom class; hence, its growth rate is ∞ (63.9 / 0). It has a very high predictive power to contrast edible

mushrooms against poisonous mushrooms. On the other hand, e2 is an EP from the edible mushroom class to the poisonous mushroom class. It exists in 3.8% of the instances in the edible mushroom class, and in 81.4% of the instances in the poisonous mushroom class; hence, its growth rate is 21.4 (81.4 / 3.8). It has a high predictive power to contrast poisonous mushrooms against edible mushrooms.

Table1. Examples of emerging patterns.

Multi-pass algorithms for mining EPs (e.g. [12] and

[13]) are not capable of running on streaming data. Work in [4] introduces a new type of EPs, approximate EPs (AEPs). This type of EPs enables current mining techniques to operate on streaming data. AEPs and the AEP-tree method have shown a good accuracy in classifying streaming data. In this paper, we propose another new type of EPs, Streaming EPs (SEPs). These SEPs have two advantages over AEPs in terms of mining complexity and classification accuracy (details are discussed later).

2. Related Work

The main difference between the data stream model and the conventional stored relation model is that data streams are unbounded in size. Most of the instances from a data stream have to be discarded after being processed. However, a certain number of instances can be stored for future analysis. This number is proportional to the available memory. That is, the data stream model does not preclude the presence of some data stored in conventional relations [1].

The idea of storing some instances from a data stream in conventional relations is fundamental to many techniques used in the data stream model. These techniques include sliding windows, sampling, and synopsis data structures. They are the basic features of any Data Stream Management System (DSMS) such as STREAM [14].

EP Support in poisonous mushrooms

Support in edible mushrooms

Growth rate

e1 0% 63.9% ∞ e2 81.4% 3.8% 21.4

e1 = {(ODOR = none), (GILL_SIZE = broad), (RING_NUMBER = one)} e2 = {(BRUISES = no), (GILL_SPACING = close), (VEIL_COLOR = white)}

4321-4244-1031-2/07/$25.00©2007 IEEE

Sliding windows [1] have a noticeable power to obtain approximate answers to data stream queries. This technique involves using a sliding window of recent data from the data stream rather than operating over the entire range of data. For example, if an unlabeled instance arrives from a data stream, and it needs to be classified to one of the classes associated with this data stream, then, only a certain number of recent instances (a window) will be used to train the classifier.

Figure 1. Sliding window technique

Figure 1 sketches the idea behind this technique. The

sliding window technique has the advantage of being well-defined. In addition, it is a deterministic method which avoids the problem of bad approximation caused by random sampling. Most essentially, it accentuates recent data which is considered to be the most interesting data in a large number of the real-life applications [15]. However, the sliding window technique suffers from the elimination of some important information contained in old (discarded) data. That is, sliding windows do not represent the whole range of knowledge contained in the data, but rather a portion (proportional to the size of the sliding window) of that knowledge. This problem may affect the quality of approximation.

Sampling [15] is another technique for approximation in the data stream model. In this case, the streaming data is randomly sampled to a certain number of instances. This number is proportional to the available memory. In contrast with the sliding window technique, sampling has the advantage of representing the whole range of old data. The representation level is proportional to the sampling rate. On the other hand, sampling may suffer from problems caused by noisy instances being selected during the random sampling process.

Synopsis data structures [1] aim at summarizing the most important characteristics of the whole range of data. These important characteristics play a key role in classifying future unlabeled instances. Synopsis data structures, like the sampling technique, have the advantage of representing the whole range of old data. Moreover, these structures avoid the problem caused by noisy instances. The reason is that they store the important characteristics rather than the data itself. EPs can be thought of as synopsis data structures because they represent the discriminating characteristics of the data they are related to. Approximate emerging patterns (AEPs) adopt approximation to mine EPs from data streams [4].

Mining AEPs is based on mining EPs from blocks of streaming data and merging the resulting EP sets to get a fixed number of AEPs. These special EPs are described as approximate because they are not mined from the complete range of data. The AEP tree is a new type of decision trees to classify streaming data. This tree uses AEPs rather than data instances to make decisions on the classes of unlabelled data. 3. Emerging Patterns and Classification

Let obj = {a1, a2, a3, ... an} be data object following the schema {A1, A2, A3, ... An}. A1, A2, A3.... An are called attributes, and a1, a2, a3, ... an are values related to these attributes. We call each pair (attribute, value) an item.

Let I denote the set of all items in an encoding dataset D. Itemsets are subsets of I. We say an instance Y contains an itemset X, if X ⊆ Y.

Definition 1. Given a dataset D, and an itemset X, the

support of X in D, sD(X), is defined as

||)()(

DXcountXs D

D = (1)

where countD(X) is the number of instances in D containing X.

Definition 2. Given two different classes of datasets D1 and D2. Let si(X) denote the support of the itemset X in the dataset Di. The growth rate of an itemset X from D1 to D2, )(

21Xgr DD → , is defined as

≠=∞==

=→

otherwise ,)()(

0)( and 0)( if ,0)( and 0)( if ,0

)(

1

2

21

21

21

XsXs

XsXsXsXs

Xgr DD

(2)

Definition 3. Given a growth rate threshold ρ >1, an

itemset X is said to be a ρ -emerging pattern ( ρ -EP or

simply EP) from D1 to D2 if ρ≥→ )(21

Xgr DD .

Let C = {c1, … ck} be a set of class labels. A training

dataset is a set of data objects such that, for each object obj, there exists a class label cobj ∈ C associated with it. A classifier is a function from attributes {A1, A2, A3, ... An} to class labels {c1, … ck}, that assigns class labels to unseen examples.

Sliding Window

Past Data (Discarded) Recent Data

Future Data

433

4. Mining Streaming Emerging Patterns

We adopt the streaming data model presented in [4]. This model is shown in figure 2. Assume that the data stream consists of two classes; C1 and C2. Data is received in blocks of size N, where N is decided according to the memory available in the system. Bt,j is a block of instances related to class j (C1 or C2) at time t.

After receiving and processing a number of data blocks, we need to gain information to classify the future unlabeled instances in the data streams. This information can be expressed as EPs. However, mining EPs from a dataset requires the availability of all instances in this dataset. This is infeasible in data streams as data is arriving continuously.

Figure 2. Data stream model

Our method is based on mining the strongest EPs from

strongest instances in the streaming blocks of data. The set of EPs is updated according to the strength of EPs and data instances. The strength of an EP e, strg(e), is defined as follows.

)(*)(1

)()( esegr

egrestrg+

= (3)

The strength of an EP is proportional to both its

growth rate (discriminating power) and support. Notice that if an EP has a high growth rate and a low support its strength might be low. In addition, if it has a low growth rate and a high support its strength might also be low.

The strength of an instance, I, is defined by a fitness function as follows.

n

asIFitness

n

ii∑

== 1)(

)( (4)

The fitness function of a data instance can be

measured by the average support of the attribute values in this instance. Suppose that we have an instance i {a1, a2, a3, ... an). We first find the supports of all the attribute values (from a1 to an). We then average these supports to obtain a measure that tells how good the instance is.

As data is streaming in blocks of size N, blocks of all classes related to period t (Bt,1 and Bt,2) are stored, processed and then discarded. EPs for both classes are mined from the strongest A% of data instances in blocks Bt,1 and Bt,2 before discarding them. These EPs are EPt,1 (for C1 ) and EPt,2 (for C2 ). The strongest B EPs (B is the predefined maximum required number of EPs) are moved to new sets of EPs called streaming EPs, SEPs. That is, SEP1 represents the SEPs of class C1 and SEP2 represents the SEPs of class C2 .

In the following stage, new data blocks arrive, Bt+1,1 and Bt+1,2. EPs are mined from these two new blocks according to some conditions. Suppose that the number of current SEPs related to a certain class is D. If D is less than B, then EPs are mined from a proportional percentage of the strongest instances in Bt+1,1 and Bt+1,2. This percentage is equal to (B-D)/B. That is, as the number of current SEPs, D, is smaller than B, EPs will be mined from a larger portion of the new data blocks. (B-D) strongest EPs are added to the SEPs set to fill it.

On the other hand, if the number of current SEPs, D, is equal to the predefined number of required EPs, B, then the set of SEPs is reduced by removing C% of the weakest EPs. The set of SEPs is refilled by mining EPs from C% of strongest data instances in the new blocks of data. SEPs sets are refilled by C% of the strongest EPs. We call C the updating percentage as it controls the number of EPs to be mined and the number of instances to be used in mining.

The previous process of updating the EPs in the set of SEPs is repeated whenever new blocks of data arrives. This process assures that we have the strongest EPs mined from the strongest data instances. Algorithm 1 explains the idea of mining SEPs.

SEPs are mined from selected portions of the blocks of data rather than the complete range of data. In spite of that, our approach guarantees that these SEPs are inherited from all the old discarded data. That is, for each class, we have to store its limited set of SEPs rather than its growing number of data instances.

SEPs are motivated by the following points: 1. Only the strongest data instances are used to mine

EPs, this prevents noisy EPs from being mined and added to the set of SEPs.

2. SEPs are updated continuously by removing the weakest ones and mining new strong EPs related to the new arriving data.

Algorithm 1. Mining SEPs from streaming data

SEP1= Φ , SEP2= Φ , t = 0 A = the number of the strongest instances to be mined from the first blocks of data. B = the maximum required number of EPs. D = current number of SEPs = 0

B1,1 B1,2

B2,1 B2,2 B3,1 B3,2

B4,1 B4,2

N

Data Stream

Class 1 Class 2

434

C = the updating percentage As data is streaming Do t = t + 1 If mining for the first time EPt,1 = mined EPs from Bt,1 SEP1 = SEP1 U strongest EPs in EPt,1 EPt,2 = mined EPs from Bt,2 SEP2 = SEP2 U strongest EPs in EPt,2 Else If D < B Mine EPs from ((B-D)/B)% of instances in Bt,1 and Bt,2 Fill SEP1 and SEP2 with (B-D) EPs Else if D = B Remove C% of EPs from SEP1 and SEP2 Mine EPs from C% of instances in Bt,1 and Bt,2 Fill SEP1 and SEP2 with (C%) of EPs End if End Do

The above two points support the importance of the recent data which is the main advantage of the sliding window technique. Moreover, they prove that the SEPs are related to all the previous data which is the advantage of the sampling technique. Furthermore, SEPs overcome the problem of mining EPs from all blocks of streaming data in the AEP method by applying a selective approach to choose certain portions of data blocks. This ensures that the mining process is conducted on the necessary data only. Our approach ensures that at each period of time we have limited sets of SEPs that best represents all the discarded data. These sets can be used at any time to classify unlabeled data instances using the AEP-tree proposed in [4]. 5. Experimental Evaluation

In this section, we apply five techniques to the data streaming model described in section 4. These techniques are AEP-tree using SEPs, AEP-tree using AEPs, random sampling, sliding window, and a traditional classifier (C4.5 decision tree). Beside applying it alone, the C4.5 decision tree will be the base classifier for the random sampling, and sliding window techniques.

The testing method is 10-fold-cross validation. This method is adapted to agree with the data stream model adopted in our experiments. The data is divided into ten folds. For each round of the 10-fold-cross validation, one fold is used for testing and the other nine folds are used for training. The training folds act as the blocks of data explained in the data stream model.

Table 2. Experimental results

Dataset C4.5* Sliding window sampling

AEP tree

(AEPs)

AEP tree

(SEPs) Breast 94.6 74.1 70.4 83.4 82.9 Cleve 73.8 59.4 55.8 65.1 71.7 Diabetes 73.4 59.8 55.2 62.9 69.3 Heart 80.6 61.7 59.4 71.5 77.8 labor 76.9 52.3 50.7 63.3 73.2 CC 85.3 70.2 67.6 77.6 82.3 Hayes-roth 70.2 60.3 55.2 66.5 79.2 Hepatitis 81.8 70.4 67.3 75.1 78.8 Horse 85.2 72.6 70.8 81.4 83.6 Segment 93.5 81.9 81.4 87.8 90.1 Average 81.53 66.27 63.38 73.46 78.89

* C4.5 with complete knowledge

Table 2 shows the performance of the previous

techniques on 10 real-life datasets from the UCI repository [11]. The last row in the table indicates the average accuracy of each technique. We draw the following findings from table 2:

• AEP-tree using SEPs, has the highest accuracy average.

• It outperforms the sliding window and sampling techniques on all datasets.

• It outperforms the AEP-tree using AEPs on 9 datasets.

• The average accuracy of AEP-tree using SEPs is very close to the average accuracy of C4.5 using the complete data space.

These findings indicate that our proposed method for mining SEPs is capable of gaining accurate knowledge from streaming data. 6. Conclusions

In this paper, we introduce a new method for mining emerging patterns in data streams. These patterns are called streaming emerging patterns (SEPs). Our approach is based on mining EPs in a selective manner. That is, EPs are not mined from the complete range of data. Instead, they are mined according to their strength as well as the strength of the streaming data instances. Our experiments prove that SEPs are capable of gaining important information from streaming data. This information affects the accuracy of classification positively. Our future work will focus on applying our technique in mining other types of patterns which are currently infeasible in the data stream model. References [1] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and Issues in Data Stream Systems. In Proceedings of the 21st ACM Symposium on Principles of

435

Database Systems (PODS’02), Madison, Wisconsin, USA. [2] G. Dong, J. Han, L.V.S. Lakshmanan, J. Pei, H. Wang and P.S. Yu. Online Mining of Changes from Data Streams: Research Problems and Preliminary Results. In Proceedings of the 2003 ACM SIGMOD Workshop on Management and Processing of Data Streams, San Diego, CA, USA. [3] M. Garofalakis, J. Gehrke, and R. Rastogi. Querying and Mining Data Streams: You Only Get One Look. In Proceedings of the 28th International Conference on Very Large Data Bases (VLDB’02), Hong Kong, China. [4] Alhammady, H., & Ramamohanarao, K. (2005). Mining emerging patterns and classification in data streams. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (WI), Compiegne, France, pp. 272-275. [5] G. Dong, and J. Li. Efficient Mining of Emerging Patterns: Discovering Trends and Differences. In Proceedings of the 1999 International Conference on Knowledge Discovery and Data Mining (KDD'99), San Diego, CA, USA. [6] H. Alhammady, and K. Ramamohanarao. The Application of Emerging Patterns for Improving the Quality of Rare-class Classification. In Proceedings of the 2004 Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'04), Sydney, Australia. [7] H. Alhammady, and K. Ramamohanarao. Using Emerging Patterns and Decision Trees in Rare-class Classification. In Proceedings of the 2004 IEEE International Conference on Data Mining (ICDM'04), Brighton, UK. [8] H. Alhammady, and K. Ramamohanarao. Expanding the Training Data Space Using Emerging Patterns and Genetic Methods. In Proceeding of the 2005 SIAM International Conference on Data Mining (SDM’05), New Port Beach, CA, USA. [9] H. Fan, and K. Ramamohanarao. A Bayesian Approach to Use Emerging Patterns for Classification. In Proceedings of the 14th Australasian Database Conference (ADC’03), Adelaide, Australia. [10] Guozhu D., Xiuzhen Z., Limsoon W., and Jinyan L.. CAEP: Classification by Aggregating Emerging Patterns. In Proceedings of the 2nd International Conference on Discovery Science (DS'99), Tokyo, Japan.

[11] C. Blake, E. Keogh, and C. J. Merz. UCI repository of machine learning databases. Department of Information and Computer Science, University of California at Irvine, CA, 1999. [12] H. Fan, and K. Ramamohanarao. An Efficient Single-Scan Algorithm For Mining Essential Jumping Emerging Patterns for Classification. In Proceedings of the 2002 Pacific-Asia Conference on Knowledge Discovery and Data Mining, Taipei, Taiwan. [13] H. Fan, and K. Ramamohanarao. Efficiently Mining Interesting Emerging Patterns. In Proceedings of the 4th International Conference on Web-Age Information Management (WAIM’03), Chengdu, China. [14] Stanford Stream Data Management (STREAM) Project. http://www-db.stanford.edu/stream [15] B. Babcock, M. Datar, and R. Motwani. Sampling From a Moving Window Over Streaming Data. In Proceedings of the 2002 Annual ACM-SIAM Symposium On Discrete Algorithms, San Francisco, CA, USA.

436

[ieee 2007 ieee/acs international conference on computer systems and applications - amman, jordan...

Documents