new directions in statistical graph mining

1

New Directions in Statistical Graph Mining

Max Planck Institute for Biological Cybernetics

Koji Tsuda

Joint work with Hiroto Saigo, Sebastian Nowozin, Nicole Kraemer, Takeaki Uno and Taku Kudo

2

Graphs and itemsets

(40F,41L,43E,210W,211K,215Y)(43Q,75M,122E,210W,211K)(75I, 77L, 116Y,151M,184V)

3

DNA Sequence

RNA

Texts in literature

Graph Structures in Biology

CC OC

CC

C

H

A C G C

Amitriptyline inhibits adenosine uptake

H

H

H

HH

Compounds

CG

CG

U U U U

UA

4

Structurization of statistical methods

Statistical methods for numerical vectorsSVM, Boosting, Clustering, Regression, PCA, CCA..

Structurization: Modify the algorithm so that it accepts structured data as input

Recipe:If the algorithm has a forward feature selection step, replace it with a top-k pattern mining algorithmIf not, the algorithm has to be engineered to have a forward feature selection step

5

Pattern Mining Algorithms

Pattern: Substructure Subset, subtree, subgraph etc

Combinatorial algorithm to enumerate all patterns satisfying a criterion

Frequency in databaseWeighted frequencyTop-k patterns for a given evaluation function

See a textbook for data mining (e.g., Han and Kamber, 2002)

A B

A B C D A B

A

B C

6

Naïve StructurizationApply frequent pattern mining and create the complete feature spaceThen, apply a statistical method

Too many patterns!It is likely to miss important features if frequency threshold (minimum support) is set highMemory overflowSee tutorial by Xifeng Yan (KDD08)

7

Progressive collection of patterns

Most statistical methods are iterativeBoosting, PLS, Clustering etc

In each iteration, call a pattern mining algorithm to collect a few patterns

Discover the optimal patterns based on the current parameter

8

Feature space is gradually constructed

0/1 vector of pattern indicatorsDimensionality grows slightly in each iteration

patterns

9

Structurization examples from my group

10

gBoost (Saigo et al., 2006, 2008)

Graph Classification: L1-SVM + Graph MiningGraph Regression: LASSO + Graph Mining

12

Related papers from other groups

Stream Prediction Using a Generative Model Based on Frequent Episodes in Event Sequences. Srivatsan Laxman, VikramTankasali, Ryen W. White. (KDD 2008)

HMM + Sequence Mining

Fast Logistic Regression for Text Categorization with Variable-Length N-grams. Georgiana Ifrim, Goekhan Bakir, Gerhard Weikum. (KDD 2008)

Logistic regression + Sequence Mining

13

Why not kernels?

Kernels take all features into accountGraph kernels, Tree kernels, Weighted degree kernels

The prediction rules are not accountable

Rescue: Mine the prediction rule of SVMPOIMS by S. Sonnenburg et al. (2008)Also meaningful direction to pursue

14

Three generations of structurization

1G: AdaBoost (2004)

2G: L1 regularization & Column generation (- 2008)

gBoost, gLARS, iBoost etc..

3G: Krylov subspace methodsPartial least squares, Lanczos methods, conjugate gradient

#iter

1000s

100s

10s

15

OverviewQuick Review on Graph MiningGraph Partial Least Squares Regression (gPLS) KDD 2008

Graph Principal Component Analysis (gPCA) ICDM 2008

16

Quick Review of Graph Mining

17

Graph MiningAnalysis of Graph Databases

Find all patterns satisfying predetermined conditionsFrequent Substructure Mining

Combinatorial, ExhaustiveRecently developed

AGM (Inokuchi et al., 2000), gspan (Yan et al., 2002), Gaston (2004)

18

Graph Mining

Frequent Substructure MiningEnumerate all patterns occurred in at least m graphs

:Indicator of pattern k in graph i

Support(k): # of occurrence of pattern k

19

Gspan (Yan and Han, 2002)

Efficient Frequent Substructure Mining MethodDFS Code

Efficient detection of isomorphic patterns

Extend Gspan for our works

20

Enumeration on Tree-shaped Search Space

Each node has a patternGenerate nodes from the root:

Add an edge at each step

21

Tree PruningAnti-monotonicity:

If support(g) < m, stop exploring!

Not generated

Support(g): # of occurrence of pattern g

22

Discriminative patterns:Weighted Substructure Mining

w_i > 0: positive classw_i < 0: negative classWeighted Substructure Mining

Patterns with large frequency differenceNot Anti-Monotonic: Use a bound for pruning

23

Graph PLS

24

Problem:QSAR (Quantitative Structure Activity Relationship)

• Input: Set of chemical compounds• Output: Activity

3

1.2

?

Chemical compounds

Activity

25

Binary representation of feature vector

• Binary vector of pattern indicators• Huge dimensionality

– Too expensive to mine at once– Progressive feature collection

Cl

C C

C

CC

CC

OC

C

C

C

C

C

CC

CC

= ( 1, 0, … 1, 0, … 1, 0, 0, ……….…)

26

PLS (Partial Least Squares Regression)

• Design matrix: X• Response: y• Dimensionality reduction

– Weight matrix W： Project examples to a lower dimensional space

– Latent components T： Projected examples T=WX

• Learn W to maximize the covariance between y and T

• How to perform PLS without accessing all columns of X ?

27

Graph PLS

• PLS calls graph mining repeatedly• In i-th iteration,

– A set of graph patterns corresponding to i-thlatent component ti is collected by weighted subgraph mining

PLS gSpanweights

patterns

654321

Latent components

28

PLS algorithm

29

Idea of structurization

• Pre-weight vector computation (step 2)

• Collect the features with major pre-weights by weighted substructure mining

• Other pre-weights are set to zero

30

Graph PLS

31

Classification/regression performance

• 6 chemical compound datasets, 5 fold cross-validation• Accuracies of gPLS are competitive, but gPLS is

substantially faster. – Numerical computation much simpler than LP

over 24h0.8835095710039965AIDS3

over 24h0.7471675030040939AIDS2

0.7522997830.7730.06522901324AIDS1

0.86739186300.87014.13579437CAS

0.86234422.80.8620.47426.8684CPDB

0.63983.315.60.6470.002516.059EDKB

AUC/Q2Numerical Time

Mining Time

AUC/Q2Numerical TIme

Mining Time

datasize

gBoostgPLS

32

gPLS efficiency• “Frequent mining + PLS” vs gPLS• gPLS scales better in larger maximum

pattern size threshold (maxpat)

33

Latent components by gPLS10987654321

34

Summary (gPLS)

• Introduction of a new PLS algorithm for use with graph mining.

• High efficiency compared with the naïve counterpart.

• Graph mining algorithm can be replaced with other frequent pattern mining algorithms such as itemset mining, sequence mining and tree mining.

35

Graph PCA

36

Graph PCA methodDesign matrix XGram matrixDerive eigenvalue and eigenvector of A

Equivalent to kernel PCA with linear kernelHow to perform eigendecomposition without accessing all columns of X ?

37

Lanczos algorithm

Find an orthonormal matrix Q such that

is a tridiagonal matrixQ =(q1,…,qm): Lanczos vectorsThen, T is easily eigendecomposed

Eigenvalues preserved in TEigenvectors of A are recovered as

Eigenvectors of TEigenvectors of A

38

Lanczos method for computing all eigenvectors

39

Importance of initial Lanczosvector q1

If q1 corresponds to the first eigenvector, Lanczosfinishes in n iterations, creating all eigenvectorsTypically, the first vector is not known, so more iterations are necessary

If one needs only major eigenvectors, early stopping is possible

Major eigenvalues converge fasterNumber of iterations determined by convergence bound (Saad, 1980)

40

Idea of structurization

SubstituteLanczos accesses X only through

Collect the features whose |gki| are large

Implemented by weighted subgraph mining

42

Three dimensional plot of EDKB-ER

43

Principal Components of EDKB-ER

44

Principal Components of EDKB-AR

45

Summary (gPCA)

In gPCA, patterns are summarized in several components

Each component characterize different aspects of graph database

Easier to browse than frequent patterns

46

ConclusionCombination of mining and learning algorithms result in a very powerful frameworkInterpretability is the keySuccessful applications to biological and computer vision domainsFurther applications possible to many different fields

47

Publication and software

iboost: itemset boostinghttp://www.kyb.mpg.de/bs/people/hiroto/iboost/

gboost: graph boosting http://www.kyb.mpg.de/bs/people/nowozin/gboost/

pboost: sequence boosting http://www.kyb.mpg.de/bs/people/nowozin/pboost/

H. Saigo et al., KDD 2008H. Saigo and K. Tsuda, ICDM 2008H. Saigo et al., Machine Learning, to appearH. Saigo et al., Bioinformatics, 2007S. Nowozin et al., CVPR 2007S. Nowozin et al., ICCV 2007K. Tsuda, ICML 2006, ICML 2007

new directions in statistical graph mining

Documents