outline introduction problem definition algorithms evaluation conclusion 1

37
Outline Introduction Problem Definition Algorithms Evaluation Conclusion 1

Upload: adela-leonard

Post on 17-Jan-2016

227 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

OutlineIntroductionProblem DefinitionAlgorithmsEvaluation

Conclusion

1

Page 2: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

Mining Probabilistically Frequent Sequential Patterns in Large Uncertain Databases• Authors: Zhou Zhao, Da Yan and Wilfred Ng• Source: IEEE Transactions on Knowledge and Data Engineering,

Vol. 26, No. 5, May 2014, pp. 1171-1184• Reporter: Pei Wun Chen• Keywords: Frequent patterns, Uncertain databases,

Approximate algorithm, Possible world semantics

Page 3: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

Data mining – Frequent Pattern MiningHelp reveal collections of popular merchandise itemsCo-occurring objectsCo-located events

Sequence dataMarket transaction records ordered by time Discover who frequently buying the item A will buy the item B later (A->B)

Sequential pattern mining

3

Introduction Problem Definition Algorithms Evaluation Conclusion

Page 4: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

Sequential Pattern MiningWhat is sequential pattern mining ?Given a set of sequences, find the complete set of frequent

sequential patterns.

4

Introduction Problem Definition Algorithms Evaluation Conclusion

A market-basket sequence database SID sequence

1 <a(abc)(ac)d(cf)>

2 <(ad)c(bc)(ae)>

3 <(ef)(ab)(df)cb>

4 <eg(af)cbc>

<a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)>

Given a support threshold min_sup =2, <(ab)c> is a frequent sequential pattern

Page 5: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

Uncertain dataMany real-life applications are riddled with uncertaintyRFID sensors networkGPS trajectory database

Reasons of uncertaintySampling and duration errorsPrivacy preserving

Uncertain data represent by probabilitiesUncertain data modeling

5

Introduction Problem Definition Algorithms Evaluation Conclusion

Page 6: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

Uncertainty Data modelsSequence-level Model Sensors network dataset

• Pr{sup(AB) = 2} = Pr(pw1) = 0.9• Pr{sup(AB) = 1} = Pr(pw2) = 0.1• Pr{sup(AB) = 0} = 0

6

Introduction Problem Definition Algorithms Evaluation Conclusion

Page 7: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

Uncertainty Data modelsElement-level ModelGPS trajectory dataset

•Pr{sup(AB) = 2} = Pr(pw3) = 0.9025

•Pr{sup(AB) = 1} = Pr(pw1) + Pr(pw2) + Pr(pw4) = 0.0975

•Pr{sup(AB) = 0} = 0 7

Introduction Problem Definition Algorithms Evaluation Conclusion

Page 8: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

Probabilistic Frequentness patternsProbabilistic Frequentness(τsup, τprob)-frequentness

Find all probabilistic frequentness patterns that satisfied

Pr{sup(AB) ≥ 1} ≥ 1⨯Pr{sup(AB) ≥ 2} ≥ 0.91 8

Introduction Problem Definition Algorithms Evaluation Conclusion

Page 9: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

U-PrefixSpan - Sequence-Level ModelSequence Projection in Sequence-Level Model

9

Introduction Problem Definition Algorithms Evaluation Conclusion

Page 10: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

Element-Level U-PrefixSpan

10

Introduction Problem Definition Algorithms Evaluation Conclusion

Page 11: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

Experiments-Environment

Windows 7 PCIntel® Core(TM) i5 CPU4GB memory

Coded in C++ and run in Eclipse

11

Introduction Problem Definition Algorithms Evaluation Conclusion

Page 12: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

Experiments-SeqU-PrefixSpan (n/m)

12

Introduction Problem Definition Algorithms Evaluation Conclusion

Page 13: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

Experiments-SeqU-PrefixSpan (l/d)

13

Introduction Problem Definition Algorithms Evaluation Conclusion

(The length of a sequence instance is randomly chosen from the range [1, ℓ])

(Each element in the sequence instance is randomly picked froman element table with d elements)

Page 14: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

Threshold efficiency

14

Introduction Problem Definition Algorithms Evaluation Conclusion

Page 15: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

Real dataset - RFID datasets

15

Introduction Problem Definition Algorithms Evaluation Conclusion

Page 16: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

Precision of approximation - ElemU

16

Introduction Problem Definition Algorithms Evaluation Conclusion

Page 17: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

Towards Efficient Sequential Pattern Mining in Temporal Uncertain Databases• Authors: Jiaqi Ge, Yuni Xia and Jian Wang• Source: Proceedings of 19th Pacific-Asia Conference on

Knowledge Discovery and Data Mining (PAKDD), pp. 268-279, 2015• Keywords: Temporal uncertainty, Sequential pattern mining

Page 18: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

Temporal uncertaintyTimestamps of events in real applicationsInaccurateImprecise

Reasons of temporal uncertaintyUnavailable exact time of an eventAggregation operations on temporal scalesProtect privacy and confidentiality

18

Introduction Problem Definition Algorithms Evaluation Conclusion

Page 19: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

Motivation A time series model T = {t, (t + 1), . . . , (t + n)}In probabilistic temporal databases

Possible world semantics probabilistic databasesEfficiency and scalability challenges in uncertain SPM

Propose an efficient SPM algorithm in temporal uncertain sequence databases.

19

Introduction Problem Definition Algorithms Evaluation Conclusion

Page 20: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

Problem DefinitionUncertain event e = <sid, eid, T, I>sid = i and eid = j is denoted by eij

sequence-id and event-id are denoted as <sid, eid> An uncertain timestamp T modeled by a uniform distributionT ~ U(t−, t+)

I is an itemset

20

Introduction Problem Definition Algorithms Evaluation Conclusion

e11 = <1, 1, {[100, 103]}, {A, C}>e12 = <1, 2, {[102, 105]}, B>

An example of uncertain database

Page 21: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

Temporal Possible Worlds

21

Introduction Problem Definition Algorithms Evaluation Conclusion

The time point of an event is randomly drawn from the correspondinguncertain timestamps

Page 22: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

Uncertain SPM Problem

22

Introduction Problem Definition Algorithms Evaluation Conclusion

The SPM problem in temporal uncertain databases can be defined as followsGiven a minimal support ts, a minimal frequentness probability threshold tp, a

minimal time gap l and a maximal time gap h, find every probabilistic frequent sequential pattern s in a temporal uncertain database, which has P(sup(s) ≥ ts) ≥ tp

sup(s) is the total number of sequences that support s

ts is the user-defined minimal thresholdw is a possible world in which s is frequent

f(w) is the pdf of w

Page 23: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

Probability of Satisfying Time Constraints

23

Introduction Problem Definition Algorithms Evaluation Conclusion

os = {ek1, . . . , ekn} is a minimal possible occurrence of s

Sequential pattern s = <s1, . . . , sn>

Ti is the uncertain time of the event eki

P(s os), denoted by P(<T1 · · · Tn>) is the probability that satisfyl ≤ Ti+1 − Ti ≤ h, i [1, n)∀ ∈os = {e21, e22}, s = {A, B}

P({A, B} {e21, e22}, ) = P(<T1, T2>)l ≤ T2 − T1 ≤ hBut T1, T2 are uncertain.

Page 24: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

Probability of Satisfying Time Constraints

24

Introduction Problem Definition Algorithms Evaluation Conclusion

Two uncertain timestamps X U(x∼ −, x+) and Y U(y∼ −, y+).Time constraints mingap = l, maxgap = h, then

Decomposed into p cases by the endpoints

Page 25: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

An example of computing P(<XY>)

25

Introduction Problem Definition Algorithms Evaluation Conclusion

l = 0 and h = 5, X ~ U(60, 63), Y ~ (62, 68)

[62, 68] divide by four endpoints {63 + 0, 60 + 0, 63 + 5, 60 + 5}

= {63, 60, 68, 65}

Insert points 63 and 65 into [62, 68] => [62, 63, 65, 68]

Then we have 3 subintervals as [62, 68] = [62, 63] [63, 65] [65, ∪ ∪68]

Page 26: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

An example of computing P(<XY>)

26

Introduction Problem Definition Algorithms Evaluation Conclusion

Thereafter, P(XY) = P(XYi ∩ Yi) = 0.72

Compute P(<XY>) between two functions: y = x+l, y = x+h

Page 27: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

Synthetic Data Generation

27

Introduction Problem Definition Algorithms Evaluation Conclusion

IBM market-basket data generatorIntel(R) Core (TM) Duo CPU @2.33GHz and 4GB memoryParameters

C: number of sequences T: average number of transactions/itemsets per data-sequence L: average number of items per transaction/itemset per data-sequence I : number of different items.

Page 28: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

Add temporal uncertainty

28

Introduction Problem Definition Algorithms Evaluation Conclusion

Replace a timestamp t by a uniform distribution[(1 − r) t, (1 + r) t]∗ ∗r is randomly drawn from the uniform distribution U(0, 1)

Dataset named by parametersT4L10I1C10 indicates that T = 4, L = 10, I = 1 1000 and C = 10 1000∗ ∗

Page 29: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

Scalability (1/2)

29

Introduction Problem Definition Algorithms Evaluation Conclusion

minsup = 0.5%, minprob = 0.7, mingap = 1, and maxgap = 10.

C = 10 000, T = 4, I = 10 000 and L = 2.

Page 30: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

Scalability (2/2)

30

Introduction Problem Definition Algorithms Evaluation Conclusion

minsup = 0.5%, minprob = 0.7, mingap = 1, and maxgap = 10.

C = 10 000, T = 4, I = 10 000 and L = 2.

Page 31: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

Efficiency

31

Introduction Problem Definition Algorithms Evaluation Conclusion

minsup = 0.2%, minprob = 0.7, mingap = 1, and maxgap = 10.

C = 10 000, T = 4, I = 10 000 and L = 2.

Page 32: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

Real world stock market dataset

32

Introduction Problem Definition Algorithms Evaluation Conclusion

Extract the price of 882 stocks in 16 weeksEach stock corresponds to a sequenceThree events - price going up (+), going down (−) and no change (0).

For example, if price goes up at time 1, 2 and 3, then we aggregate them to form an uncertain event ([1,3], +).

Page 33: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

Performance of uSPM in real stock dataset

33

Introduction Problem Definition Algorithms Evaluation Conclusion

minprob = 0.7, mingap = 1 and maxgap = 5

Page 34: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

Conclusion – U-PrefixspanStudy the problem of mining p-FSPs in uncertain databases

Two new U-PrefixSpan algorithmsAvoid the problem of ”possible world explosion”

Three pruning rules and one early validating methodDecrease the execution time

34

Introduction Problem Definition Algorithms Evaluation Conclusion

Page 35: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

Conclusion - uSPM

35

Introduction Problem Definition Algorithms Evaluation Conclusion

This paper study the problem of mining probabilistic frequent sequential patterns in databases with temporal uncertaintyDesign an incremental approach to manage temporal uncertainty efficiently and integrate it into classic pattern-growth SPM algorithm.The experimental results prove that the algorithm is efficient and scalable.

Page 36: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

U-PrefixSpan Vs. uSPM

36

Introduction Problem Definition Algorithms Evaluation Conclusion

Uncertainty U-PrefixSpan in eventsuSPM in time

U-PrefixSpan is more practicaluSPM is the first work that study the problem of temporal uncertainty

Page 37: Outline  Introduction  Problem Definition  Algorithms  Evaluation  Conclusion 1

Thanks for listening

Q&A