research issues on association rule mining loo kin kong 26 th february, 2003

Research issues on association rule mining

Loo Kin Kong26th February, 2003

Plan

Recent trends on data mining Association rule interestingness Association rule mining on data streams Research directions Conclusion

Association rules

First proposed in [Agarwal et al. 94] Given a database D of transactions, which

contains only binary attributes For an itemset x, the support of x is defined as

supp(x) = fraction of D containing x An association rule is in the form I J, where:

I J = supp(I J) supp

supp(I J) / supp(I) conf

Recent trends on association rule mining

Association rule interestingness Association rule mining on data streams Privacy preserving [Rizvi el al. 02] New data structures to improve the efficiency

of finding frequent itemsets [Relue et al. 01]

Association rule interestingness – overview

Problem with association rule mining: Too many rules mined Mined rules may contain redundancy or trivial rules

Subjective approaches aim at: Minimizing human effort involved

Objective approaches aim at: Based on some predefined interestingness measure,

filter rules that are uninteresting

Subjective approaches Rule templates [Klemettinen et al. 94]

A rule template specifies what attributes to occur in the LHS and RHS of a rule

e.g., any rule in the form “” & (any number of conditions) “” is uninteresting

By elimination [Sahar 99] For a rule r = A B, r’ = a b is an ancestor rule if a A

and b B. r’ is said to cover r. An ancestor rule can be classified as one of the following:

True-Not-Interesting (TNI) Not-True-Interesting (NTI) Not-True-Not-Interesting (NTNI) True-Interesting (TI)

Objective approaches

Statistical / problem-specific measures Entropy gain, lift, …

Pruning redundant rules by the maximum entropy principle [Jaroszewica 02]

Probability

A finite probability space is a pair (S,P), in which

S is a finite non-empty set P is a mapping P:S [0,1], satisfying sSP(s) = 1

Each sS is called an event P(s), also denoted by ps, is the probability of

the event s The self-information of s is defined as I(s) = –

log P(s)

Entropy A partition U is a collection of mutually exclusive

elements whose union equals S Each element contains one or more events

The measure of uncertainty that any event of a partition U would occur is called the entropy of the partitioning U

H(U) = – p1 log p1 – p2 log p2 – … – pN log pN

Where p1, ... , pN are respectively the probabilities of events a1, ... , aN of U

H(U) is maximum if p1 = p2 = ... = pN = 1/N

The maximum entropy method (MEM)

The MEM determines the probabilities pi of the events in a partition U, subject to various given constraints.

By MEM, when some of the pi’s are unknown, they must be chosen to maximize the entropy of U, subject to the given constraints.

Definitions

A constraint C is a pair C = (I, p), where: I is an itemset p[0,1] is the probability of I occurring in a

transaction The set of constraints generated by an

association rule I J is defined asC(I J) = {(I, supp(I)), (I J, supp(I

J))} A rule K J is a sub-rule of I J if K I

I-nonredundancy

A rule I J is considered I-nonredundant with respect to R, where R is a set of association rules, if:

I = , or I(CI,J(R), I J) is larger than some threshold, where I()

is either Iact() or Ipass(), CI,J(R) is the constraints induced by all sub-rules of I J in R

Pruning redundant association rules

Input: A set R of association rules1. For each singleton Ai in the database2. Ri = { Ai}3. k = 14. For each rule I Ai R, |I|=k, do5. If I Ai is I-nonredundant w.r.t. Ri then6. Ri = Ri {I Ai}7. k = k+18. Goto 49. R = Ri

Association rule interestingness: let’s face it...

“Interesting” is a subjective sense... Domain knowledge is needed at some stage to

determine what is interesting ... in fact, one may argue that there does not

exist a truly objective interestingness measure...

It is because we try to model what is interesting ... but “objective” interestingness measures

are still worth studying Can act as a filter before any human intervention is

required

Interesting or uninteresting?

Consider the association rule:r = I J, supp(r) = 1%, conf(r) = 100%

A question: Do you think whether r is interesting or

uninteresting? Considering the support and/or confidence of

one single rule may not be enough to determine whether a rule is interesting or not

So we try to compare a rule with some other rule(s)

Observation: comparing a family of rules For a maximal frequent itemset I:

The set of rules I’ {i}, where i I, I’ I \ {i}

forms a family of rules For example, for the maximal frequent itemset

{abcde}, abcd e conf =

supp({abcde})/supp({abcd}) abc e conf = supp({abce})/supp({abc}) abd e conf = supp({abde})/supp({abd}) ...

are in a family

abcde

abcd abce abde acde bcde

bcd abe ace ade bce bde cdeabc abd acd

bc bd cd ae be ce deab ac ad

a b c d e

Observation: comparing a family of rules (cont’d) The blue half of the lattice is obtained by

appending the item “e” to each node in the orange half

The family of rules captures how the item “e” affects the support of the orange half of the lattice

Idea: We may compare confidences of rules in a family to find

any “unusually” high or low confidences We can use some statistical tests to perform the

comparison; no need for complicated statistical models (e.g., MEM)

Association rule mining on data streams

In some new applications, data come as a continuous “stream”

The sheer volume of a stream over its lifetime is huge

Queries require timely answer Examples:

Stock ticks Network traffic measurements

A method for finding approximate frequency counts on data streams is proposed in [Manku et al. 02]

Goals of the paper

The algorithm ensures that All itemsets whose true frequency exceeds sN are

reported (i.e., no false negative) No itemset whose true frequency is less than (s-)N is

output Estimated frequencies are less than the true

frequencies by at most N

Some notations: Let N denote the current length of the stream Let s (0,1) denote the support threshold Let (0,1) denote the error tolerance

The simple case: finding frequent items

Each transaction in the stream contains only 1 item

2 algorithms were proposed, namely: Sticky Sampling Algorithm Lossy Counting Algorithm

Features of the algorithms: Sampling techniques are used Frequency counts found are approximate but error is gua

ranteed not to exceed a user-specified tolerance level For Lossy Counting, all frequent items are reported

Lossy Counting Algorithm

Incoming data stream is conceptually divided into buckets of 1/ transactions

Counts are kept in a data structure D Each entry in D is in the form (e, f, ), where:

e is the item f is the frequency of e in the stream since the entry is

inserted in D is the maximum count of e in the stream before e is

added to D

Lossy Counting Algorithm (cont’d)1. D ; N 02. w 1/; b 13. e next transaction; N N + 14. if (e,f,) exists in D do5. f f + 16. else do7. insert (e,1,b-1) to D8. endif9. if N mod w = 0 do10. prune(D, b); b b + 111. endif12. Goto 3;

D: The set of all countsN: Curr. len. of stream

e: Transaction (itemset)w: Bucket width

b: Current bucket id

1. function prune(D, b)2. for each entry (e,f,) in D do3. if f + b do4. remove the entry from D5. endif

Lossy Counting

Lossy Counting guarantees that: When deletion occurs, b N If an entry (e, f, ) is deleted, fe b where fe is the

actual frequency count of e Hence, if an entry (e, f, ) is deleted, fe N

Finally, f fe f + N

The more complex case: finding frequent itemsets

The Lossy Counting algorithm is extended to find frequent itemsets

Transactions in the data stream contains any number of items

Essentially the same as the case for single items, except:

Multiple buckets ( of them say) are processed in a batch

Each entry in D is in the form (set, f, ) Transactions read in are (wisely) expanded to its

subsets

Association rule mining on data streams: food for thought

Challenges to mine from data streams Fast update Data are usually not permanently stored (but may be

buffered) Fast response for queries Minimized resources (e.g. number of counts kept)

Possible interesting problems concerning association rule mining on data streams:

More efficient/accurate algorithms for finding association rules on data streams

Change mining in frequency counts

The lattice structure

A bottleneck in the algorithm proposed in [Manku et al. 02] is that it needs to expand a transaction to its subsets for counting

For example, for a transaction {abcde}, we may need to count the itemsets {a}, {b}, {c}, {d}, {e}, {ab}, {ac}...

Hence updates are expensive (although queries can be fast)

abcde

abcd abce abde acde bcde

acd ace ade bcd bce bde cdeabc abd abe

ae bc bd be cd ce deab ac ad

a b c d e

The lattice structure (cont’d)

Conclusion

Both association rule interestingness and mining on data streams are challenging problems

Research on rule interestingness can make association rule mining a more efficient tool for knowledge discovery

Association rule mining on data streams is an upcoming application and a promising direction for research

References [Agarwal et al. 94] R. Agarwal and R. Srikant. Fast Algorithms for

Mining Association Rules. VLDB94. [Jaroszewica 02] S. Jaroszewica and D.A. Simovici. Pruning

Redundant Association rules Using Maximum Entropy Principle. PAKDD02.

[Klemettinen et al. 94] Mika Klemettinen et al. Finding Interesting Rules from Large Sets of Discovered Association Rules. CIKM94.

[Manku et al. 02] G. S. Manku and R. Motwani. Approximate Frequency Counts over Data Streams. VLDB02.

[Relue et al. 01] R. Relue, X. Wu and H Huang. Efficient Runtime Generation of Association Rules. CIKM01.

[Rizvi el al. 02] S. J. Rizvi and J. R. Haritsa. Maintaining Data Privacy in Association Rule Mining. VLDB02.

[Sahar 99] Sigal Sahar. Interestingness Via What Is Not Interesting. KDD99.

research issues on association rule mining loo kin kong 26 th february, 2003

Documents