new directions in statistical graph mining
TRANSCRIPT
1
New Directions in Statistical Graph Mining
Max Planck Institute for Biological Cybernetics
Koji Tsuda
Joint work with Hiroto Saigo, Sebastian Nowozin, Nicole Kraemer, Takeaki Uno and Taku Kudo
2
Graphs and itemsets
(40F,41L,43E,210W,211K,215Y)(43Q,75M,122E,210W,211K)(75I, 77L, 116Y,151M,184V)
3
DNA Sequence
RNA
Texts in literature
Graph Structures in Biology
CC OC
CC
C
H
A C G C
Amitriptyline inhibits adenosine uptake
H
H
H
HH
Compounds
CG
CG
U U U U
UA
4
Structurization of statistical methods
Statistical methods for numerical vectorsSVM, Boosting, Clustering, Regression, PCA, CCA..
Structurization: Modify the algorithm so that it accepts structured data as input
Recipe:If the algorithm has a forward feature selection step, replace it with a top-k pattern mining algorithmIf not, the algorithm has to be engineered to have a forward feature selection step
5
Pattern Mining Algorithms
Pattern: Substructure Subset, subtree, subgraph etc
Combinatorial algorithm to enumerate all patterns satisfying a criterion
Frequency in databaseWeighted frequencyTop-k patterns for a given evaluation function
See a textbook for data mining (e.g., Han and Kamber, 2002)
A B
A B C D A B
A
B C
6
Naïve StructurizationApply frequent pattern mining and create the complete feature spaceThen, apply a statistical method
Too many patterns!It is likely to miss important features if frequency threshold (minimum support) is set highMemory overflowSee tutorial by Xifeng Yan (KDD08)
7
Progressive collection of patterns
Most statistical methods are iterativeBoosting, PLS, Clustering etc
In each iteration, call a pattern mining algorithm to collect a few patterns
Discover the optimal patterns based on the current parameter
8
Feature space is gradually constructed
0/1 vector of pattern indicatorsDimensionality grows slightly in each iteration
patterns
9
Structurization examples from my group
10
gBoost (Saigo et al., 2006, 2008)
Graph Classification: L1-SVM + Graph MiningGraph Regression: LASSO + Graph Mining
11
12
Related papers from other groups
Stream Prediction Using a Generative Model Based on Frequent Episodes in Event Sequences. Srivatsan Laxman, VikramTankasali, Ryen W. White. (KDD 2008)
HMM + Sequence Mining
Fast Logistic Regression for Text Categorization with Variable-Length N-grams. Georgiana Ifrim, Goekhan Bakir, Gerhard Weikum. (KDD 2008)
Logistic regression + Sequence Mining
13
Why not kernels?
Kernels take all features into accountGraph kernels, Tree kernels, Weighted degree kernels
The prediction rules are not accountable
Rescue: Mine the prediction rule of SVMPOIMS by S. Sonnenburg et al. (2008)Also meaningful direction to pursue
14
Three generations of structurization
1G: AdaBoost (2004)
2G: L1 regularization & Column generation (- 2008)
gBoost, gLARS, iBoost etc..
3G: Krylov subspace methodsPartial least squares, Lanczos methods, conjugate gradient
#iter
1000s
100s
10s
15
OverviewQuick Review on Graph MiningGraph Partial Least Squares Regression (gPLS) KDD 2008
Graph Principal Component Analysis (gPCA) ICDM 2008
16
Quick Review of Graph Mining
17
Graph MiningAnalysis of Graph Databases
Find all patterns satisfying predetermined conditionsFrequent Substructure Mining
Combinatorial, ExhaustiveRecently developed
AGM (Inokuchi et al., 2000), gspan (Yan et al., 2002), Gaston (2004)
18
Graph Mining
Frequent Substructure MiningEnumerate all patterns occurred in at least m graphs
:Indicator of pattern k in graph i
Support(k): # of occurrence of pattern k
19
Gspan (Yan and Han, 2002)
Efficient Frequent Substructure Mining MethodDFS Code
Efficient detection of isomorphic patterns
Extend Gspan for our works
20
Enumeration on Tree-shaped Search Space
Each node has a patternGenerate nodes from the root:
Add an edge at each step
21
Tree PruningAnti-monotonicity:
If support(g) < m, stop exploring!
Not generated
Support(g): # of occurrence of pattern g
22
Discriminative patterns:Weighted Substructure Mining
w_i > 0: positive classw_i < 0: negative classWeighted Substructure Mining
Patterns with large frequency differenceNot Anti-Monotonic: Use a bound for pruning
23
Graph PLS
24
Problem:QSAR (Quantitative Structure Activity Relationship)
• Input: Set of chemical compounds• Output: Activity
3
1.2
?
Chemical compounds
Activity
25
Binary representation of feature vector
• Binary vector of pattern indicators• Huge dimensionality
– Too expensive to mine at once– Progressive feature collection
Cl
C C
C
CC
CC
OC
C
C
C
C
C
CC
CC
= ( 1, 0, … 1, 0, … 1, 0, 0, ……….…)
26
PLS (Partial Least Squares Regression)
• Design matrix: X• Response: y• Dimensionality reduction
– Weight matrix W: Project examples to a lower dimensional space
– Latent components T: Projected examples T=WX
• Learn W to maximize the covariance between y and T
• How to perform PLS without accessing all columns of X ?
27
Graph PLS
• PLS calls graph mining repeatedly• In i-th iteration,
– A set of graph patterns corresponding to i-thlatent component ti is collected by weighted subgraph mining
PLS gSpanweights
patterns
654321
Latent components
28
PLS algorithm
29
Idea of structurization
• Pre-weight vector computation (step 2)
• Collect the features with major pre-weights by weighted substructure mining
• Other pre-weights are set to zero
30
Graph PLS
31
Classification/regression performance
• 6 chemical compound datasets, 5 fold cross-validation• Accuracies of gPLS are competitive, but gPLS is
substantially faster. – Numerical computation much simpler than LP
over 24h0.8835095710039965AIDS3
over 24h0.7471675030040939AIDS2
0.7522997830.7730.06522901324AIDS1
0.86739186300.87014.13579437CAS
0.86234422.80.8620.47426.8684CPDB
0.63983.315.60.6470.002516.059EDKB
AUC/Q2Numerical Time
Mining Time
AUC/Q2Numerical TIme
Mining Time
datasize
gBoostgPLS
32
gPLS efficiency• “Frequent mining + PLS” vs gPLS• gPLS scales better in larger maximum
pattern size threshold (maxpat)
33
Latent components by gPLS10987654321
34
Summary (gPLS)
• Introduction of a new PLS algorithm for use with graph mining.
• High efficiency compared with the naïve counterpart.
• Graph mining algorithm can be replaced with other frequent pattern mining algorithms such as itemset mining, sequence mining and tree mining.
35
Graph PCA
36
Graph PCA methodDesign matrix XGram matrixDerive eigenvalue and eigenvector of A
Equivalent to kernel PCA with linear kernelHow to perform eigendecomposition without accessing all columns of X ?
37
Lanczos algorithm
Find an orthonormal matrix Q such that
is a tridiagonal matrixQ =(q1,…,qm): Lanczos vectorsThen, T is easily eigendecomposed
Eigenvalues preserved in TEigenvectors of A are recovered as
Eigenvectors of TEigenvectors of A
38
Lanczos method for computing all eigenvectors
39
Importance of initial Lanczosvector q1
If q1 corresponds to the first eigenvector, Lanczosfinishes in n iterations, creating all eigenvectorsTypically, the first vector is not known, so more iterations are necessary
If one needs only major eigenvectors, early stopping is possible
Major eigenvalues converge fasterNumber of iterations determined by convergence bound (Saad, 1980)
40
Idea of structurization
SubstituteLanczos accesses X only through
Collect the features whose |gki| are large
Implemented by weighted subgraph mining
41
42
Three dimensional plot of EDKB-ER
43
Principal Components of EDKB-ER
44
Principal Components of EDKB-AR
45
Summary (gPCA)
In gPCA, patterns are summarized in several components
Each component characterize different aspects of graph database
Easier to browse than frequent patterns
46
ConclusionCombination of mining and learning algorithms result in a very powerful frameworkInterpretability is the keySuccessful applications to biological and computer vision domainsFurther applications possible to many different fields
47
Publication and software
iboost: itemset boostinghttp://www.kyb.mpg.de/bs/people/hiroto/iboost/
gboost: graph boosting http://www.kyb.mpg.de/bs/people/nowozin/gboost/
pboost: sequence boosting http://www.kyb.mpg.de/bs/people/nowozin/pboost/
H. Saigo et al., KDD 2008H. Saigo and K. Tsuda, ICDM 2008H. Saigo et al., Machine Learning, to appearH. Saigo et al., Bioinformatics, 2007S. Nowozin et al., CVPR 2007S. Nowozin et al., ICCV 2007K. Tsuda, ICML 2006, ICML 2007