mining data streams presentation

ONLINE DATA STREAM MINING OF RECENT FREQUENT ITEMSETS BASED ON SLIDING WINDOW MODEL

IEEE2008

Conference

Presented by:Baha’ Nawafleh 20093173016

Table of Contents

Introduction. Literature Review. Problem definition. MRFI-SW algorithm.

• (window initialization phase, window sliding phase, mining frequent itemsets phase. )

Experiment Conclusion Questions??

Introduction

A data stream is a massive sequence of data elements continuously generating at a rapid rate. Different from the

traditional static datasets, data streams are continuous, unbounded and have a data distribution that changes with time.

Many applications generate large amount of data streams in real time, such as sensor data generated from sensors networks,

online transaction flows in retail chains, Web record and click-streams in Web applications, etc.

Data streams can be classified into offline data streams [1] and online data streams [2].

Cont..

[1] The target applications domains of offline data stream are a bulk addition of new transactions, such

as a data warehouse system.

[2] Online data streams are characterized by real-time updated data. The streaming data of online data stream come one by one in time, such as a continuously generated transaction as in a network monitoring system.

Literature Review

Researchers have proposed many algorithms of mining frequent item sets in data streams.

The researches of mining frequent itemsets in data streams can be

divided into three categories:

landmark window model. the time-fading model. the sliding window model.

Manku and Motwani developed two single-pass algorithms, Sticky Sampling and Lossy Counting . This algorithm can mine frequent items over offline data stream under landmark window model.

Cont..

SWFI-stream is an algorithm for mining frequent item sets in online data streams under transaction-sensitive sliding window model proposed an incremental mining algorithm to mine frequent item sets in offline data streams with a time-sensitive sliding window.

The purpose of this paper: MRFI-SW is Mining Recent Frequent Item sets over online data stream

with Sliding window.

Problem definition

Let Ψ={i1,i2,…,im} be a set of literals, called items. A transaction T={id, x1x2..xn}. A transaction data stream DS={T1, T2,…TN} is a continuous sequence

of transactions . A data stream can be also denoted as DS={W1, W2,…Wm}, where

each basic window is a transaction-sensitive sliding window. w is the size of the transaction-sensitive sliding window. s is a user-defined minimum support threshold in the rang of [0,1]. The support of a transaction X over SW is the number of transactions in

SW containing X as a subset. If the support of X is higher than s*w, X is called a frequent item set (FI).

MRFI-SW algorithm

The proposed MRFI-SW algorithm consists of three phases :

window initialization phase. window sliding phase.

and mining frequent itemsets phase.

window initialization phase.

The window initialization phase is activated by the first transaction arriving. The phase lasts until the transaction-sensitive sliding window is full.

When the sliding window is full, the w items are transformed into bit-order representations.

Each entry is the form of (bit, order), denoted as R(x). If item X is in the i-th transaction in current sliding window, the i-

th entry of R(X)_bit is set to be 1 and the order of items in a transaction can get from R(X)_order, otherwise the R(X) is set to be 0 (R(X)_bit=R(X)_order=0).

Cont..

For example, there are three transactions in SW1, T1, T2, and T3. The bit-order representations of items in SW1 are shown in Table 1.

Cont..

Table 1. Bit-order of items in window initialization phase

window sliding phase

The window sliding phase is activated when the sliding window becomes full. In this phase, a new arriving transaction is inserted into the sliding window, and the oldest transaction in current sliding window is removed.

Because the bit-order sequence representation is a structure of sequence, we use left-shift operation on the sequence.

To improve the memory usage, a pruning entry operation is executed after the window sliding.

a pruning entry operation is executed after the window sliding. The operation is pruning the entry of item when its bit-order sequence is 0. If item X dose not appear in any transaction over current sliding window, where sup(X)SW=0, the entry R(X) is pruned.

Cont..

For instance, in Table 1, when the forth transaction T4 arrives, the first transaction T1 must be removed from the current SW. The bit-order sequence entries of items in SW1 are executed left-shift. R(a) is modified from <(1, 1), 0, (1, 1)> to <0, (1, 1), 0>

Similarly R(c)=<(1, 2), (1, 3), 0> R(d)=<0, 0, 0> R(b)=<(1, 1), (1, 2), (1, 1)> R(e)=<(1, 3), (1, 4), (1, 2)>

Noted that item d is dropped, because R(d)=<0, 0, 0>, sup(d)SW2=0.

Algorithm 1: Output: updated bit-order sequence

1Initialize sliding window and bit-order sequence;

2While each new coming transaction Ti in SW do

3 If (SW is full)

4 Transform all of items in SW to bit-order sequence;

5 Else

6 Do left_shift operation on bit-order sequence of all items

7 For each item X arrives in SW

8 Transform X to bit sequence representation

9 End for

10 End if

11For each R(X) in SW

12 If SUM( R(X).bit)=0

13 Drop X from SW

14 End if

15End for

Mining frequent itemsets phase

The mining frequent itemsets phase is activated when the bit-order sequences are updated and the frequent itemsets are requested.

We proposed a method to generate k-frequent items (itemsets with k items) from the known k-1-frequent items.

The method works basing on Apriori property (If a pattern is frequent, all of its sub-patterns will also be frequent).

We use SUM operation on the bit of each entry to compute the support of items, and find the frequent 1-itemsets in current SW .

Then the proposed algorithm uses AND operation on the bit of each entry to find 2-itemsets. The support of 2-itemsets is computed, the itemsets whose supports are less than the user defined threshold are pruned.

The process is terminated until no new k+1-itemsets are generated.

Cont..

For instance, consider the DS in Table 1. Let the minimum support threshold s be 0.6.

Hence, an item set X is frequent if sup(X)≥0.6*3=1.8. We discuss the step of mining frequent item sets in SW2. First, MRFI-

SW algorithm finds out frequent 1-itemsets, through computing the support of items where

• R(a)=<0, (1, 1), 0>, i.e., sup(a)=1

• R(c)=<(1, 2), (1, 3), 0>, i.e., sup(c)=2

• R(b)=<(1, 1), (1, 2), (1, 1)>, i.e., sup(b)=3

• R(e)=<(1, 3), (1, 4), (1, 2)>, i.e., sup(e)=3

So item a is not frequent because its support is 1.

Cont..

Algorithm 1: Output: a set of frequent itemsets.

1Find frequent 1-itemsets FI1

2For (k=2; FIk-1≠null; k++)

3 Do AND operation on R(FIk-1).bit to find Candidate FIk

4For each FI do

5 Do bitwise SUM operation on R( Candidate FIk)

6 If SUM(R( Candidate FIk).bit )≥ s*w

7 If k=2

8 Scan R(Candidate FIk).order

9 Output FIk

10 End if

11 End if

12End for

Experiment

Our algorithm was written in C and compiled using Microsoft Visual C++ 6.0. We generate online data streams using IBM synthetic data generator.

Figure 1. Memory usages in window initialization

Figure 2. Memory usages in window sliding window

Figure 3. Memory usages in mining frequent item sets

Figure 4. The processing time of algorithm

Conclusion

Mining online data stream is an interesting and challenging research field.

The characteristics of data stream make many traditional mining algorithms unable to be applied.

In this paper proposed an efficient algorithm of three phases for mining recent frequent item sets over online data stream with transaction-sensitive sliding window.

Experiment shows that using the proposed algorithm not only attains highly accurate mining result, but also runs significant faster and consume less memory than SWFI-algorithm for mining recent frequent item sets over online data streams.

Questions??

mining data streams presentation

Documents