new directions in statistical graph mining

47
1 New Directions in Statistical Graph Mining Max Planck Institute for Biological Cybernetics Koji Tsuda Joint work with Hiroto Saigo, Sebastian Nowozin, Nicole Kraemer, Takeaki Uno and Taku Kudo

Upload: others

Post on 24-May-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: New Directions in Statistical Graph Mining

1

New Directions in Statistical Graph Mining

Max Planck Institute for Biological Cybernetics

Koji Tsuda

Joint work with Hiroto Saigo, Sebastian Nowozin, Nicole Kraemer, Takeaki Uno and Taku Kudo

Page 2: New Directions in Statistical Graph Mining

2

Graphs and itemsets

(40F,41L,43E,210W,211K,215Y)(43Q,75M,122E,210W,211K)(75I, 77L, 116Y,151M,184V)

Page 3: New Directions in Statistical Graph Mining

3

DNA Sequence

RNA

Texts in literature

Graph Structures in Biology

CC OC

CC

C

H

A C G C

Amitriptyline inhibits adenosine uptake

H

H

H

HH

Compounds

CG

CG

U U U U

UA

Page 4: New Directions in Statistical Graph Mining

4

Structurization of statistical methods

Statistical methods for numerical vectorsSVM, Boosting, Clustering, Regression, PCA, CCA..

Structurization: Modify the algorithm so that it accepts structured data as input

Recipe:If the algorithm has a forward feature selection step, replace it with a top-k pattern mining algorithmIf not, the algorithm has to be engineered to have a forward feature selection step

Page 5: New Directions in Statistical Graph Mining

5

Pattern Mining Algorithms

Pattern: Substructure Subset, subtree, subgraph etc

Combinatorial algorithm to enumerate all patterns satisfying a criterion

Frequency in databaseWeighted frequencyTop-k patterns for a given evaluation function

See a textbook for data mining (e.g., Han and Kamber, 2002)

A B

A B C D A B

A

B C

Page 6: New Directions in Statistical Graph Mining

6

Naïve StructurizationApply frequent pattern mining and create the complete feature spaceThen, apply a statistical method

Too many patterns!It is likely to miss important features if frequency threshold (minimum support) is set highMemory overflowSee tutorial by Xifeng Yan (KDD08)

Page 7: New Directions in Statistical Graph Mining

7

Progressive collection of patterns

Most statistical methods are iterativeBoosting, PLS, Clustering etc

In each iteration, call a pattern mining algorithm to collect a few patterns

Discover the optimal patterns based on the current parameter

Page 8: New Directions in Statistical Graph Mining

8

Feature space is gradually constructed

0/1 vector of pattern indicatorsDimensionality grows slightly in each iteration

patterns

Page 9: New Directions in Statistical Graph Mining

9

Structurization examples from my group

Page 10: New Directions in Statistical Graph Mining

10

gBoost (Saigo et al., 2006, 2008)

Graph Classification: L1-SVM + Graph MiningGraph Regression: LASSO + Graph Mining

Page 11: New Directions in Statistical Graph Mining

11

Page 12: New Directions in Statistical Graph Mining

12

Related papers from other groups

Stream Prediction Using a Generative Model Based on Frequent Episodes in Event Sequences. Srivatsan Laxman, VikramTankasali, Ryen W. White. (KDD 2008)

HMM + Sequence Mining

Fast Logistic Regression for Text Categorization with Variable-Length N-grams. Georgiana Ifrim, Goekhan Bakir, Gerhard Weikum. (KDD 2008)

Logistic regression + Sequence Mining

Page 13: New Directions in Statistical Graph Mining

13

Why not kernels?

Kernels take all features into accountGraph kernels, Tree kernels, Weighted degree kernels

The prediction rules are not accountable

Rescue: Mine the prediction rule of SVMPOIMS by S. Sonnenburg et al. (2008)Also meaningful direction to pursue

Page 14: New Directions in Statistical Graph Mining

14

Three generations of structurization

1G: AdaBoost (2004)

2G: L1 regularization & Column generation (- 2008)

gBoost, gLARS, iBoost etc..

3G: Krylov subspace methodsPartial least squares, Lanczos methods, conjugate gradient

#iter

1000s

100s

10s

Page 15: New Directions in Statistical Graph Mining

15

OverviewQuick Review on Graph MiningGraph Partial Least Squares Regression (gPLS) KDD 2008

Graph Principal Component Analysis (gPCA) ICDM 2008

Page 16: New Directions in Statistical Graph Mining

16

Quick Review of Graph Mining

Page 17: New Directions in Statistical Graph Mining

17

Graph MiningAnalysis of Graph Databases

Find all patterns satisfying predetermined conditionsFrequent Substructure Mining

Combinatorial, ExhaustiveRecently developed

AGM (Inokuchi et al., 2000), gspan (Yan et al., 2002), Gaston (2004)

Page 18: New Directions in Statistical Graph Mining

18

Graph Mining

Frequent Substructure MiningEnumerate all patterns occurred in at least m graphs

:Indicator of pattern k in graph i

Support(k): # of occurrence of pattern k

Page 19: New Directions in Statistical Graph Mining

19

Gspan (Yan and Han, 2002)

Efficient Frequent Substructure Mining MethodDFS Code

Efficient detection of isomorphic patterns

Extend Gspan for our works

Page 20: New Directions in Statistical Graph Mining

20

Enumeration on Tree-shaped Search Space

Each node has a patternGenerate nodes from the root:

Add an edge at each step

Page 21: New Directions in Statistical Graph Mining

21

Tree PruningAnti-monotonicity:

If support(g) < m, stop exploring!

Not generated

Support(g): # of occurrence of pattern g

Page 22: New Directions in Statistical Graph Mining

22

Discriminative patterns:Weighted Substructure Mining

w_i > 0: positive classw_i < 0: negative classWeighted Substructure Mining

Patterns with large frequency differenceNot Anti-Monotonic: Use a bound for pruning

Page 23: New Directions in Statistical Graph Mining

23

Graph PLS

Page 24: New Directions in Statistical Graph Mining

24

Problem:QSAR (Quantitative Structure Activity Relationship)

• Input: Set of chemical compounds• Output: Activity

3

1.2

?

Chemical compounds

Activity

Page 25: New Directions in Statistical Graph Mining

25

Binary representation of feature vector

• Binary vector of pattern indicators• Huge dimensionality

– Too expensive to mine at once– Progressive feature collection

Cl

C C

C

CC

CC

OC

C

C

C

C

C

CC

CC

= ( 1, 0, … 1, 0, … 1, 0, 0, ……….…)

Page 26: New Directions in Statistical Graph Mining

26

PLS (Partial Least Squares Regression)

• Design matrix: X• Response: y• Dimensionality reduction

– Weight matrix W: Project examples to a lower dimensional space

– Latent components T: Projected examples T=WX

• Learn W to maximize the covariance between y and T

• How to perform PLS without accessing all columns of X ?

Page 27: New Directions in Statistical Graph Mining

27

Graph PLS

• PLS calls graph mining repeatedly• In i-th iteration,

– A set of graph patterns corresponding to i-thlatent component ti is collected by weighted subgraph mining

PLS gSpanweights

patterns

654321

Latent components

Page 28: New Directions in Statistical Graph Mining

28

PLS algorithm

Page 29: New Directions in Statistical Graph Mining

29

Idea of structurization

• Pre-weight vector computation (step 2)

• Collect the features with major pre-weights by weighted substructure mining

• Other pre-weights are set to zero

Page 30: New Directions in Statistical Graph Mining

30

Graph PLS

Page 31: New Directions in Statistical Graph Mining

31

Classification/regression performance

• 6 chemical compound datasets, 5 fold cross-validation• Accuracies of gPLS are competitive, but gPLS is

substantially faster. – Numerical computation much simpler than LP

over 24h0.8835095710039965AIDS3

over 24h0.7471675030040939AIDS2

0.7522997830.7730.06522901324AIDS1

0.86739186300.87014.13579437CAS

0.86234422.80.8620.47426.8684CPDB

0.63983.315.60.6470.002516.059EDKB

AUC/Q2Numerical Time

Mining Time

AUC/Q2Numerical TIme

Mining Time

datasize

gBoostgPLS

Page 32: New Directions in Statistical Graph Mining

32

gPLS efficiency• “Frequent mining + PLS” vs gPLS• gPLS scales better in larger maximum

pattern size threshold (maxpat)

Page 33: New Directions in Statistical Graph Mining

33

Latent components by gPLS10987654321

Page 34: New Directions in Statistical Graph Mining

34

Summary (gPLS)

• Introduction of a new PLS algorithm for use with graph mining.

• High efficiency compared with the naïve counterpart.

• Graph mining algorithm can be replaced with other frequent pattern mining algorithms such as itemset mining, sequence mining and tree mining.

Page 35: New Directions in Statistical Graph Mining

35

Graph PCA

Page 36: New Directions in Statistical Graph Mining

36

Graph PCA methodDesign matrix XGram matrixDerive eigenvalue and eigenvector of A

Equivalent to kernel PCA with linear kernelHow to perform eigendecomposition without accessing all columns of X ?

Page 37: New Directions in Statistical Graph Mining

37

Lanczos algorithm

Find an orthonormal matrix Q such that

is a tridiagonal matrixQ =(q1,…,qm): Lanczos vectorsThen, T is easily eigendecomposed

Eigenvalues preserved in TEigenvectors of A are recovered as

Eigenvectors of TEigenvectors of A

Page 38: New Directions in Statistical Graph Mining

38

Lanczos method for computing all eigenvectors

Page 39: New Directions in Statistical Graph Mining

39

Importance of initial Lanczosvector q1

If q1 corresponds to the first eigenvector, Lanczosfinishes in n iterations, creating all eigenvectorsTypically, the first vector is not known, so more iterations are necessary

If one needs only major eigenvectors, early stopping is possible

Major eigenvalues converge fasterNumber of iterations determined by convergence bound (Saad, 1980)

Page 40: New Directions in Statistical Graph Mining

40

Idea of structurization

SubstituteLanczos accesses X only through

Collect the features whose |gki| are large

Implemented by weighted subgraph mining

Page 41: New Directions in Statistical Graph Mining

41

Page 42: New Directions in Statistical Graph Mining

42

Three dimensional plot of EDKB-ER

Page 43: New Directions in Statistical Graph Mining

43

Principal Components of EDKB-ER

Page 44: New Directions in Statistical Graph Mining

44

Principal Components of EDKB-AR

Page 45: New Directions in Statistical Graph Mining

45

Summary (gPCA)

In gPCA, patterns are summarized in several components

Each component characterize different aspects of graph database

Easier to browse than frequent patterns

Page 46: New Directions in Statistical Graph Mining

46

ConclusionCombination of mining and learning algorithms result in a very powerful frameworkInterpretability is the keySuccessful applications to biological and computer vision domainsFurther applications possible to many different fields

Page 47: New Directions in Statistical Graph Mining

47

Publication and software

iboost: itemset boostinghttp://www.kyb.mpg.de/bs/people/hiroto/iboost/

gboost: graph boosting http://www.kyb.mpg.de/bs/people/nowozin/gboost/

pboost: sequence boosting http://www.kyb.mpg.de/bs/people/nowozin/pboost/

H. Saigo et al., KDD 2008H. Saigo and K. Tsuda, ICDM 2008H. Saigo et al., Machine Learning, to appearH. Saigo et al., Bioinformatics, 2007S. Nowozin et al., CVPR 2007S. Nowozin et al., ICCV 2007K. Tsuda, ICML 2006, ICML 2007