mining frequent closed graphs on evolving data streams

Post on 08-May-2015

1.767 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Graph mining is a challenging task by itself, and even more so when processing data streams which evolve in real-time. Data stream mining faces hard constraints regarding time and space for processing, and also needs to provide for concept drift detection. In this talk we present a framework for studying graph pattern mining on time-varying streams and large datasets.

TRANSCRIPT

Mining Frequent Closed Graphs on Evolving Data Streams

A. Bifet, G. Holmes, B. Pfahringer and R. Gavalda

University of WaikatoHamilton, New Zealand

Laboratory for Relational Algorithmics, Complexity and Learning LARCAUPC-Barcelona Tech, Catalonia

San Diego, 24 August 201117th ACM SIGKDD International Conference on

Knowledge Discovery and Data Mining 2011

Mining Evolving Graph Data StreamsProblemGiven a data stream D of graphs, find frequent closed graphs.

Transaction Id Graph

1

C C S N

O

O

2

C C S N

O

C

3 C C S N

N

We provide three algorithms,of increasing power

IncrementalSliding WindowAdaptive

2 / 48

Non-incrementalFrequent Closed Graph Mining

CloseGraph: Xifeng Yan, Jiawei Hanright-most extension based on depth-first searchbased on gSpan ICDM’02

MoSS: Christian Borgelt, Michael R. Bertholdbreadth-first searchbased on MoFa ICDM’02

3 / 48

Outline

1 Streaming Data

2 Frequent Pattern MiningMining Evolving Graph StreamsADAGRAPHMINER

3 Experimental Evaluation

4 Summary and Future Work

4 / 48

Outline

1 Streaming Data

2 Frequent Pattern MiningMining Evolving Graph StreamsADAGRAPHMINER

3 Experimental Evaluation

4 Summary and Future Work

5 / 48

Mining Massive Data

Source: IDC’s Digital Universe Study (EMC), June 2011

6 / 48

Data Streams

Data StreamsSequence is potentially infiniteHigh amount of dataHigh speed of arrivalOnce an element from a data stream has been processedit is discarded or archived

Tools:approximationrandomization, samplingsketching

7 / 48

Data Streams Approximation Algorithms

1011000111 1010101

Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1

εlog2 N) space, whereN is the length of the sliding windowε is the accuracy parameter

M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002

8 / 48

Data Streams Approximation Algorithms

10110001111 0101011

Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1

εlog2 N) space, whereN is the length of the sliding windowε is the accuracy parameter

M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002

8 / 48

Data Streams Approximation Algorithms

101100011110 1010111

Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1

εlog2 N) space, whereN is the length of the sliding windowε is the accuracy parameter

M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002

8 / 48

Data Streams Approximation Algorithms

1011000111101 0101110

Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1

εlog2 N) space, whereN is the length of the sliding windowε is the accuracy parameter

M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002

8 / 48

Data Streams Approximation Algorithms

10110001111010 1011101

Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1

εlog2 N) space, whereN is the length of the sliding windowε is the accuracy parameter

M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002

8 / 48

Data Streams Approximation Algorithms

101100011110101 0111010

Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1

εlog2 N) space, whereN is the length of the sliding windowε is the accuracy parameter

M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002

8 / 48

Outline

1 Streaming Data

2 Frequent Pattern MiningMining Evolving Graph StreamsADAGRAPHMINER

3 Experimental Evaluation

4 Summary and Future Work

9 / 48

Pattern Mining

Dataset ExampleDocument Patterns

d1 abced2 cded3 abced4 acded5 abcded6 bcd

10 / 48

Itemset Mining

d1 abced2 cded3 abced4 acded5 abcded6 bcd

Support Frequentd1,d2,d3,d4,d5,d6 c

d1,d2,d3,d4,d5 e,ced1,d3,d4,d5 a,ac,ae,aced1,d3,d5,d6 b,bc

d2,d4,d5 d,cdd1,d3,d5 ab,abc,abe

be,bce,abced2,d4,d5 de,cde

11 / 48

Itemset Mining

d1 abced2 cded3 abced4 acded5 abcded6 bcd

Support Frequent6 c5 e,ce4 a,ac,ae,ace4 b,bc4 d,cd3 ab,abc,abe

be,bce,abce3 de,cde

12 / 48

Itemset Mining

d1 abced2 cded3 abced4 acded5 abcded6 bcd

Support Frequent Gen Closed6 c c c5 e,ce e ce4 a,ac,ae,ace a ace4 b,bc b bc4 d,cd d cd3 ab,abc,abe ab

be,bce,abce be abce3 de,cde de cde

12 / 48

Itemset Mining

d1 abced2 cded3 abced4 acded5 abcded6 bcd

Support Frequent Gen Closed Max6 c c c5 e,ce e ce4 a,ac,ae,ace a ace4 b,bc b bc4 d,cd d cd3 ab,abc,abe ab

be,bce,abce be abce abce3 de,cde de cde cde

12 / 48

Outline

1 Streaming Data

2 Frequent Pattern MiningMining Evolving Graph StreamsADAGRAPHMINER

3 Experimental Evaluation

4 Summary and Future Work

13 / 48

Graph DatasetTransaction Id Graph Weight

1

C C S N

O

O 1

2

C C S N

O

C 1

3

C S N

O

C 1

4 C C S N

N

1

14 / 48

Graph Coresets

Coreset of a set P with respect to some problemSmall subset that approximates the original set P.

Solving the problem for the coreset provides anapproximate solution for the problem on P.

δ -tolerance Closed GraphA graph g is δ -tolerance closed if none of its proper frequentsupergraphs has a weighted support ≥ (1−δ ) ·support(g).

Maximal graph: 1-tolerance closed graphClosed graph: 0-tolerance closed graph.

15 / 48

Graph Coresets

Coreset of a set P with respect to some problemSmall subset that approximates the original set P.

Solving the problem for the coreset provides anapproximate solution for the problem on P.

δ -tolerance Closed GraphA graph g is δ -tolerance closed if none of its proper frequentsupergraphs has a weighted support ≥ (1−δ ) ·support(g).

Maximal graph: 1-tolerance closed graphClosed graph: 0-tolerance closed graph.

15 / 48

Graph Coresets

Relative support of a closed graphSupport of a graph minus the relative support of its closedsupergraphs.

The sum of the closed supergraphs’ relative supports of agraph and its relative support is equal to its own support.

(s,δ )-coreset for the problem of computing closedgraphsWeighted multiset of frequent δ -tolerance closed graphs withminimum support s using their relative support as a weight.

16 / 48

Graph Coresets

Relative support of a closed graphSupport of a graph minus the relative support of its closedsupergraphs.

The sum of the closed supergraphs’ relative supports of agraph and its relative support is equal to its own support.

(s,δ )-coreset for the problem of computing closedgraphsWeighted multiset of frequent δ -tolerance closed graphs withminimum support s using their relative support as a weight.

16 / 48

Graph DatasetTransaction Id Graph Weight

1

C C S N

O

O 1

2

C C S N

O

C 1

3

C S N

O

C 1

4 C C S N

N

1

17 / 48

Graph Coresets

Graph Relative Support SupportC C S N 3 3

C S N

O

3 3

C S

N

3 3

Table: Example of a coreset with minimum support 50% and δ = 1

18 / 48

Graph Coresets

Figure: Number of graphs in a (40%,δ )-coreset for NCI.

19 / 48

Outline

1 Streaming Data

2 Frequent Pattern MiningMining Evolving Graph StreamsADAGRAPHMINER

3 Experimental Evaluation

4 Summary and Future Work

20 / 48

INCGRAPHMINER

INCGRAPHMINER(D,min sup)

Input: A graph dataset D, and min sup.Output: The frequent graph set G.

1 G← /02 for every batch bt of graphs in D3 do C← CORESET(bt ,min sup)4 G← CORESET(G∪C,min sup)5 return G

21 / 48

WINGRAPHMINER

WINGRAPHMINER(D,W ,min sup)

Input: A graph dataset D, a size window W and min sup.Output: The frequent graph set G.

1 G← /02 for every batch bt of graphs in D3 do C← CORESET(bt ,min sup)4 Store C in sliding window5 if sliding window is full6 then R← Oldest C stored in sliding window,

negate all support values7 else R← /08 G← CORESET(G∪C∪R,min sup)9 return G

22 / 48

Algorithm ADaptive Sliding WINdowExample

W= 101010110111111W0= 1

ADWIN: ADAPTIVE WINDOWING ALGORITHM

1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW

23 / 48

Algorithm ADaptive Sliding WINdowExample

W= 101010110111111W0= 1 W1 = 01010110111111

ADWIN: ADAPTIVE WINDOWING ALGORITHM

1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW

23 / 48

Algorithm ADaptive Sliding WINdowExample

W= 101010110111111W0= 10 W1 = 1010110111111

ADWIN: ADAPTIVE WINDOWING ALGORITHM

1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW

23 / 48

Algorithm ADaptive Sliding WINdowExample

W= 101010110111111W0= 101 W1 = 010110111111

ADWIN: ADAPTIVE WINDOWING ALGORITHM

1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW

23 / 48

Algorithm ADaptive Sliding WINdowExample

W= 101010110111111W0= 1010 W1 = 10110111111

ADWIN: ADAPTIVE WINDOWING ALGORITHM

1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW

23 / 48

Algorithm ADaptive Sliding WINdowExample

W= 101010110111111W0= 10101 W1 = 0110111111

ADWIN: ADAPTIVE WINDOWING ALGORITHM

1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW

23 / 48

Algorithm ADaptive Sliding WINdowExample

W= 101010110111111W0= 101010 W1 = 110111111

ADWIN: ADAPTIVE WINDOWING ALGORITHM

1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW

23 / 48

Algorithm ADaptive Sliding WINdowExample

W= 101010110111111W0= 1010101 W1 = 10111111

ADWIN: ADAPTIVE WINDOWING ALGORITHM

1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW

23 / 48

Algorithm ADaptive Sliding WINdowExample

W= 101010110111111W0= 10101011 W1 = 0111111

ADWIN: ADAPTIVE WINDOWING ALGORITHM

1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW

23 / 48

Algorithm ADaptive Sliding WINdowExample

W= 101010110111111 |µW0− µW1 | ≥ εc : CHANGE DET.!

W0= 101010110 W1 = 111111

ADWIN: ADAPTIVE WINDOWING ALGORITHM

1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW

23 / 48

Algorithm ADaptive Sliding WINdowExample

W= 101010110111111 Drop elements from the tail of WW0= 101010110 W1 = 111111

ADWIN: ADAPTIVE WINDOWING ALGORITHM

1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW

23 / 48

Algorithm ADaptive Sliding WINdowExample

W= 01010110111111 Drop elements from the tail of WW0= 101010110 W1 = 111111

ADWIN: ADAPTIVE WINDOWING ALGORITHM

1 Initialize Window W2 for each t > 03 do W ←W ∪{xt} (i.e., add xt to the head of W )4 repeat Drop elements from the tail of W5 until |µW0− µW1 | ≥ εc holds6 for every split of W into W = W0 ·W17 Output µW

23 / 48

Algorithm ADaptive Sliding WINdow

TheoremAt every time step we have:

1 (False positive rate bound). If µt remains constant withinW, the probability that ADWIN shrinks the window at thisstep is at most δ .

2 (False negative rate bound). Suppose that for somepartition of W in two parts W0W1 (where W1 contains themost recent items) we have |µW0−µW1 |> 2εc . Then withprobability 1−δ ADWIN shrinks W to W1, or shorter.

ADWIN tunes itself to the data stream at hand, with no need forthe user to hardwire or precompute parameters.

24 / 48

Algorithm ADaptive Sliding WINdow

ADWIN using a Data Stream Sliding Window Model,can provide the exact counts of 1’s in O(1) time per point.tries O(logW ) cutpointsuses O(1

εlogW ) memory words

the processing time per example is O(logW ) (amortizedand worst-case).

Sliding Window Model

1010101 101 11 1 1Content: 4 2 2 1 1Capacity: 7 3 2 1 1

25 / 48

ADAGRAPHMINER

ADAGRAPHMINER(D,Mode,min sup)

1 G← /02 Init ADWIN3 for every batch bt of graphs in D4 do C ← CORESET(bt ,min sup)5 R← /06 if Mode is Sliding Window7 then Store C in sliding window8 if ADWIN detected change9 then R← Batches to remove

in sliding windowwith negative support

10 G← CORESET(G∪C∪R,min sup)11 if Mode is Sliding Window12 then Insert # closed graphs into ADWIN13 else for every g in G update g’s ADWIN14 return G

26 / 48

ADAGRAPHMINER

ADAGRAPHMINER(D,Mode,min sup)

1 G← /02 Init ADWIN3 for every batch bt of graphs in D4 do C ← CORESET(bt ,min sup)5 R← /06789

10 G← CORESET(G∪C∪R,min sup)111213 for every g in G update g’s ADWIN14 return G

26 / 48

ADAGRAPHMINER

ADAGRAPHMINER(D,Mode,min sup)

1 G← /02 Init ADWIN3 for every batch bt of graphs in D4 do C ← CORESET(bt ,min sup)5 R← /06 if Mode is Sliding Window7 then Store C in sliding window8 if ADWIN detected change9 then R← Batches to remove

in sliding windowwith negative support

10 G← CORESET(G∪C∪R,min sup)11 if Mode is Sliding Window12 then Insert # closed graphs into ADWIN1314 return G

26 / 48

Outline

1 Streaming Data

2 Frequent Pattern MiningMining Evolving Graph StreamsADAGRAPHMINER

3 Experimental Evaluation

4 Summary and Future Work

27 / 48

Experimental Evaluation

ChemDB datasetPublic dataset4 million moleculesInstitute for Genomics and Bioinformatics at the Universityof California, Irvine

Open NCI DatabasePublic domain250,000 structuresNational Cancer Institute

28 / 48

What is MOA?{M}assive {O}nline {A}nalysis is a framework for onlinelearning from data streams.

It is closely related to WEKAIt includes a collection of offline and online as well as toolsfor evaluation:

classificationclustering

Easy to extendEasy to design and run experiments

29 / 48

WEKA: the bird

30 / 48

MOA: the bird

The Moa (another native NZ bird) is not only flightless, like theWeka, but also extinct.

31 / 48

MOA: the bird

The Moa (another native NZ bird) is not only flightless, like theWeka, but also extinct.

31 / 48

MOA: the bird

The Moa (another native NZ bird) is not only flightless, like theWeka, but also extinct.

31 / 48

Classification Experimental Setting

32 / 48

Classification Experimental Setting

ClassifiersNaive BayesDecision stumpsHoeffding TreeHoeffding Option TreeBagging and BoostingADWIN Bagging andLeveraging Bagging

Prediction strategiesMajority classNaive Bayes LeavesAdaptive Hybrid

33 / 48

Clustering Experimental Setting

34 / 48

Clustering Experimental Setting

ClusterersStreamKM++CluStreamClusTreeDen-StreamCobWeb

35 / 48

Clustering Experimental SettingInternal measures External measuresGamma Rand statisticC Index Jaccard coefficientPoint-Biserial Folkes and Mallow IndexLog Likelihood Hubert Γ statisticsDunn’s Index Minkowski scoreTau PurityTau A van Dongen criterionTau C V-measureSomer’s Gamma CompletenessRatio of Repetition HomogeneityModified Ratio of Repetition Variation of informationAdjusted Ratio of Clustering Mutual informationFagan’s Index Class-based entropyDeviation Index Cluster-based entropyZ-Score Index PrecisionD Index RecallSilhouette coefficient F-measure

Table: Internal and external clustering evaluation measures.36 / 48

Cluster Mapping Measure

Hardy Kremer, Philipp Kranen, Timm Jansen, ThomasSeidl, Albert Bifet, Geoff Holmes and Bernhard Pfahringer.An Effective Evaluation Measure for Clustering on EvolvingData StreamKDD’11

CMM: Cluster Mapping MeasureA novel evaluation measure for stream clustering on evolvingstreams

CMM(C ,C L ) = 1− ∑o∈F w(o) ·pen(o,C)

∑o∈F w(o) ·con(o,Cl(o))

37 / 48

Extensions of MOA

Multi-label ClassificationActive LearningRegressionClosed Frequent Graph MiningTwitter Sentiment Analysis

Challenges for bigger data streamsSampling and distributed systems (Map-Reduce, Hadoop, S4)

38 / 48

Open NCI dataset

Time NCI Dataset

0

20

40

60

80

100

120

10.0

00

30.0

00

50.0

00

70.0

00

90.0

00

110.

000

130.

000

150.

000

170.

000

190.

000

210.

000

230.

000

250.

000

Instances

Se

co

nd

s

IncGraphMiner IncGraphMiner-C MoSS closeGraph

39 / 48

Open NCI dataset

Memory NCI Dataset

0100020003000400050006000700080009000

10.0

00

30.0

00

50.0

00

70.0

00

90.0

00

110.

000

130.

000

150.

000

170.

000

190.

000

210.

000

230.

000

250.

000

Instances

Me

ga

by

tes

IncGraphMiner IncGraphMiner-C MoSS closeGraph

40 / 48

ChemDB dataset

Memory ChemDB Dataset

05000

100001500020000250003000035000400004500050000

10.0

00

240.

000

470.

000

700.

000

930.

000

1.16

0.00

0

1.39

0.00

0

1.62

0.00

0

1.85

0.00

0

2.08

0.00

0

2.31

0.00

0

2.54

0.00

0

2.77

0.00

0

3.00

0.00

0

3.23

0.00

0

3.46

0.00

0

3.69

0.00

0

3.92

0.00

0

Instances

Me

ga

by

tes

IncGraphMiner IncGraphMiner-C MoSS closeGraph

41 / 48

ChemDB dataset

Time ChemDB Dataset

0500

100015002000250030003500400045005000

10.0

00

240.

000

470.

000

700.

000

930.

000

1.16

0.00

0

1.39

0.00

0

1.62

0.00

0

1.85

0.00

0

2.08

0.00

0

2.31

0.00

0

2.54

0.00

0

2.77

0.00

0

3.00

0.00

0

3.23

0.00

0

3.46

0.00

0

3.69

0.00

0

3.92

0.00

0

Instances

Se

co

nd

s

IncGraphMiner IncGraphMiner-C MoSS closeGraph

42 / 48

ADAGRAPHMINER

0

10

20

30

40

50

60

10.0

00

60.0

00

110.

000

160.

000

210.

000

260.

000

310.

000

360.

000

410.

000

460.

000

510.

000

560.

000

610.

000

660.

000

710.

000

760.

000

810.

000

860.

000

910.

000

960.

000

Instances

Nu

mb

er

of

Clo

se

d G

rap

hs

ADAGRAPHMINER ADAGRAPHMINER-Window IncGraphMiner

43 / 48

Outline

1 Streaming Data

2 Frequent Pattern MiningMining Evolving Graph StreamsADAGRAPHMINER

3 Experimental Evaluation

4 Summary and Future Work

44 / 48

SummaryMining Evolving Graph Data StreamsGiven a data stream D of graphs, find frequent closed graphs.

Transaction Id Graph

1

C C S N

O

O

2

C C S N

O

C

3 C C S N

N

We provide three algorithms,of increasing power

IncrementalSliding WindowAdaptive

45 / 48

Summary{M}assive {O}nline {A}nalysis is a framework for onlinelearning from data streams.

http://moa.cs.waikato.ac.nz

It is closely related to WEKAIt includes a collection of offline and online as well as toolsfor evaluation:

classificationclusteringfrequent pattern mining

MOA deals with evolving data streamsMOA is easy to use and extend

46 / 48

Future work

Structured massive data miningsampling, sketchingparallel techniques

47 / 48

Thanks!

48 / 48

top related