csce822 data mining and warehousing
DESCRIPTION
CSCE822 Data Mining and Warehousing. Lecture 16 Graph Mining MW 4:00PM-5:15PM Dr. Jianjun Hu http://mleg.cse.sc.edu/edu/csce822. University of South Carolina Department of Computer Science and Engineering. Roadmap. What, Why Graph Mining? Methods for Mining Frequent Subgraphs - PowerPoint PPT PresentationTRANSCRIPT
Lecture 16 Graph MiningMW 4:00PM-5:15PM
Dr. Jianjun Huhttp://mleg.cse.sc.edu/edu/csce822
CSCE822 Data Mining and Warehousing
University of South CarolinaDepartment of Computer Science and Engineering
RoadmapWhat, Why Graph Mining?
Methods for Mining Frequent Subgraphs
Mining Variant and Constrained Substructure Patterns
Applications:
Classification and Clustering Graph Indexing Similarity Search
Summary
April 21, 2023Mining and Searching Graphs in Graph Databases
April 21, 2023Mining and Searching Graphs in Graph Databases
Graph, Graph, Everywhere
Aspirin Yeast protein interaction network
fro
m H
. Je
on
g e
t a
l Na
ture
41
1, 4
1 (
200
1)
Internet Co-author network
RelationshipsInteractionsconnections
April 21, 2023Mining and Searching Graphs in Graph Databases
Why Graph Mining?
Graphs are ubiquitous Chemical compounds (Cheminformatics)
Protein structures, biological pathways/networks (Bioinformactics)
Program control flow, traffic flow, and workflow analysis
XML databases, Web, and social network analysis
Graph is a general model Trees, lattices, sequences, and items are degenerated graphs
Diversity of graphs Directed vs. undirected, labeled vs. unlabeled (edges & vertices), weighted,
with angles & geometry (topological vs. 2-D/3-D)
Complexity of algorithms: many problems are of high
complexity
April 21, 2023Mining and Searching Graphs in Graph Databases
Graph Pattern Mining
Frequent subgraphs
A (sub)graph is frequent if its support (occurrence
frequency) in a given dataset is no less than a minimum
support threshold
Applications of graph pattern mining
Mining biochemical structures
Program control flow analysis
Mining XML structures or Web communities
Building blocks for graph classification, clustering,
compression, comparison, and correlation analysis
April 21, 2023Mining and Searching Graphs in Graph Databases
Example: Frequent Subgraphs
GRAPH DATASET
FREQUENT PATTERNS(MIN SUPPORT IS 2)
(A) (B) (C)
(1) (2)
April 21, 2023Mining and Searching Graphs in Graph Databases
Graph Mining Algorithms
Incomplete beam search – Greedy (Subdue 1994)
Inductive logic programming (WARMR 1998)
Graph theory-based approaches
Apriori-based approach
Pattern-growth approach
Graph Definitionsa
b a
c c
b
(a) Labeled Graph
pq
p
p
rs
tr
t
qp
a
a
c
b
(b) Subgraph
p
s
t
p
a
a
c
b
(c) Induced Subgraph
p
rs
tr
p
April 21, 2023Mining and Searching Graphs in Graph Databases
SUBDUE (Holder et al. KDD’94)
Start with single vertices
Expand best substructures with a new edge
Limit the number of best substructures
Substructures are evaluated based on their ability to
compress input graphs Using minimum description length (DL) Best substructure S in graph G minimizes: DL(S) +
DL(G\S)Terminate until no new substructure is discovered
April 21, 2023Mining and Searching Graphs in Graph Databases
Frequent Subgraph Mining Approaches
Apriori-based approach AGM/AcGM: Inokuchi, et al. (PKDD’00) FSG: Kuramochi and Karypis (ICDM’01) PATH#: Vanetik and Gudes (ICDM’02, ICDM’04) FFSM: Huan, et al. (ICDM’03)
Pattern growth approach MoFa, Borgelt and Berthold (ICDM’02) gSpan: Yan and Han (ICDM’02) Gaston: Nijssen and Kok (KDD’04)
Apriori-like AlgorithmFind frequent 1-subgraphsRepeat
Candidate generation Use frequent (k-1)-subgraphs to generate candidate k-subgraph
Candidate pruning Prune candidate subgraphs that contain infrequent
(k-1)-subgraphs Support counting
Count the support of each remaining candidate Eliminate candidate k-subgraphs that are infrequent
In practice, it is not as easy. There are many other issues
April 21, 2023Mining and Searching Graphs in Graph Databases
Apriori-Based Approach
…
G
G1
G2
Gn
k-edge(k+1)-edge
G’
G’’
JOIN
Pruning
G1
G2
SupportCounting
G1
G2
Elimination
G1
Candidate GenerationIn Apriori:
Merging two frequent k-itemsets will produce a candidate (k+1)-itemset
In frequent subgraph mining (vertex/edge growing) Merging two frequent k-subgraphs may produce more
than one candidate (k+1)-subgraph
Vertex Growinga
a
e
a
p
q
r
p
a
a
a
p
rr
d
G1 G2
p
000
00
00
0
1
q
rp
rp
qpp
MG
000
0
00
00
2
r
rrp
rp
pp
MG
a
a
a
p
q
r
ep
0000
0000
00
000
00
3
q
r
rrp
rp
qpp
MG
G3 = join(G1,G2)
dr+
Edge Growing
a
a
f
a
p
q
r
p
a
a
a
p
rr
f
G1 G2
p
a
a
a
p
q
r
fp
G3 = join(G1,G2)
r+
April 21, 2023Mining and Searching Graphs in Graph Databases
FFSM (Huan, et al. ICDM’03)
Represent graphs using canonical adjacency matrix (CAM)Join two CAMs or extend a CAM to generate a new graphStore the embeddings of CAMs
All of the embeddings of a pattern in the database Can derive the embeddings of newly generated CAMs
Example: Dataset
a
b
e
c
p
q
rp
a
b
d
p
r
G1 G2
q
e
c
a
p q
r
b
p
G3
d
rd
r
(a,b,p) (a,b,q) (a,b,r) (b,c,p) (b,c,q) (b,c,r) … (d,e,r)G1 1 0 0 0 0 1 … 0G2 1 0 0 0 0 0 … 0G3 0 0 1 1 0 0 … 0G4 0 0 0 0 0 0 … 0
a eq
c
d
p p
p
G4
r
Example
p
a b c d ek=1FrequentSubgraphs
a b
pc d
pc e
qa e
rb d
pa b
d
r
pd c
e
p
(Pruned candidate)
Minimum support count = 2
k=2FrequentSubgraphs
k=3CandidateSubgraphs
Graph IsomorphismA graph is isomorphic if it is topologically equivalent
to another graph
A
A
A A
B A
B
A
B
B
A
A
B B
B
B
Graph Isomorphism
Graph IsomorphismTest for graph isomorphism is needed:
During candidate generation step, to determine whether a candidate has been generated
During candidate pruning step, to check whether its (k-1)-subgraphs are frequent
During candidate counting, to check whether a candidate is contained within another graph
Graph IsomorphismUse canonical labeling to handle isomorphism
Map each graph into an ordered string representation (known as its code) such that two isomorphic graphs will be mapped to the same canonical encoding
Example: Lexicographically largest adjacency matrixFind the permutations of the vertices so that the adjacency matrix
is lexicographically maximized when read off from left to right, one row at a time.
Graph Isomorphism Example:
Lexicographically largest adjacency matrix
0110
1011
1100
0100
String: 0010001111010110
0001
0011
0101
1110
Canonical: 0111101011001000
Lexicographically largest adjacency matrixFind the permutations of the vertices so that the
adjacency matrix is lexicographically maximized when read off from left to right, one row at a time.
April 21, 2023Mining and Searching Graphs in Graph Databases
Graph Pattern Explosion Problem
If a graph is frequent, all of its subgraphs are frequent ─
the Apriori property
An n-edge frequent graph may have 2n subgraphs
Among 422 chemical compounds which are confirmed to
be active in an AIDS antiviral screen dataset, there are
1,000,000 frequent graph patterns if the minimum support
is 5%
April 21, 2023Mining and Searching Graphs in Graph Databases
Closed Frequent Graphs
Motivation: Handling graph pattern explosion problemClosed frequent graph
A frequent graph G is closed if there exists no supergraph of G that carries the same support as G
If some of G’s subgraphs have the same support, it is unnecessary to output these subgraphs (nonclosed graphs)
Lossless compression: still ensures that the mining result is complete
April 21, 2023Mining and Searching Graphs in Graph Databases
CLOSEGRAPH (Yan & Han, KDD’03)
…
A Pattern-Growth Approach
G
G1
G2
Gn
k-edge
(k+1)-edge
At what condition, can westop searching their children
i.e., early termination?
If G and G’ are frequent, G is a subgraph of G’. If in any part of the graph in the dataset where G occurs, G’ also occurs, then we need not grow G, since none of G’s children will be closed except those of G’.
April 21, 2023Mining and Searching Graphs in Graph Databases
Experimental ResultThe AIDS antiviral screen compound dataset from
NCI/NIH
The dataset contains 43,905 chemical compounds
Among these 43,905 compounds, 423 of them belongs
to CA, 1081 are of CM, and the remaining are in class
CI
April 21, 2023 Mining and Searching Graphs in Graph Databases
Discovered Patterns
20% 10%
5%
April 21, 2023Mining and Searching Graphs in Graph Databases
Graph MiningMethods for Mining Frequent Subgraphs
Mining Variant and Constrained Substructure Patterns
Applications:
Graph Indexing Similarity Search Classification and Clustering
Summary
April 21, 2023Mining and Searching Graphs in Graph Databases
Constrained Patterns
DegreeDensityDiameterConnectivityMin, Max, Avg
April 21, 2023Mining and Searching Graphs in Graph Databases
Pattern-Growth ApproachFind a small frequent candidate graph
Remove vertices (shadow graph) whose degree is less than the connectivity
Decompose it to extract the subgraphs satisfying the connectivity constraint
Stop decomposing when the subgraph has been checked before
Extend this candidate graph by adding new vertices and edges
Repeat
April 21, 2023Mining and Searching Graphs in Graph Databases
Graph MiningMethods for Mining Frequent Subgraphs
Mining Variant and Constrained Substructure Patterns
Applications:
Classification and Clustering Graph Indexing Similarity Search
Summary
April 21, 2023Mining and Searching Graphs in Graph Databases
Graph ClusteringGraph similarity measure
Feature-based similarity measureEach graph is represented as a feature vector
The similarity is defined by the distance of their corresponding vectors
Frequent subgraphs can be used as features
Structure-based similarity measureMaximal common subgraph
Graph edit distance: insertion, deletion, and relabel
Graph alignment distance
April 21, 2023Mining and Searching Graphs in Graph Databases
Graph ClassificationLocal structure based approach
Local structures in a graph, e.g., neighbors surrounding a vertex, paths with fixed length
Graph pattern-based approach Subgraph patterns from domain knowledge Subgraph patterns from data mining
Kernel-based approach Random walk (Gärtner ’02, Kashima et al. ’02, ICML’03,
Mahé et al. ICML’04) Optimal local assignment (Fröhlich et al. ICML’05)
April 21, 2023Mining and Searching Graphs in Graph Databases
Graph Pattern-Based Classification
Subgraph patterns from domain knowledge Molecular descriptors
Subgraph patterns from data mining General idea
Each graph is represented as a feature vector x = {x1,
x2, …, xn}, where xi is the frequency of the i-th pattern
in that graph Each vector is associated with a class label Classify these vectors in a vector space
April 21, 2023Mining and Searching Graphs in Graph Databases
Graph MiningMethods for Mining Frequent Subgraphs
Mining Variant and Constrained Substructure Patterns
Applications:
Classification and Clustering Graph Indexing Similarity Search
Summary
April 21, 2023Mining and Searching Graphs in Graph Databases
Graph Search
Querying graph databases: Given a graph database and a query graph, find all
the graphs containing this query graph
query graph graph database
Scalability IssueSequential scan
Disk I/Os Subgraph isomorphism testing
An indexing mechanism is needed DayLight: Daylight.com (commercial) GraphGrep: Dennis Shasha, et al. PODS'02 Grace: Srinath Srinivasa, et al. ICDE'03
April 21, 2023Mining and Searching Graphs in Graph Databases
April 21, 2023Mining and Searching Graphs in Graph Databases
Indexing Strategy
Graph (G)
Substructure
Query graph (Q)
If graph G contains query graph Q, G should contain any substructure of Q
Remarks Index substructures of a query graph to prune graphs that do not
contain these substructures
April 21, 2023Mining and Searching Graphs in Graph Databases
Indexing Framework
Two steps in processing graph queriesStep 1. Index Construction
Enumerate structures in the graph database, build an inverted index between structures and graphs
Step 2. Query Processing Enumerate structures in the query graph Calculate the candidate graphs containing
these structures Prune the false positive answers by
performing subgraph isomorphism test
April 21, 2023Mining and Searching Graphs in Graph Databases
Graph MiningMethods for Mining Frequent Subgraphs
Mining Variant and Constrained Substructure Patterns
Applications:
Classification and Clustering Graph Indexing Similarity Search
Summary
April 21, 2023 Mining and Searching Graphs in Graph Databases
Structure Similarity Search
(a) caffeine (b) diurobromine (c) viagra
• CHEMICAL COMPOUNDS
• QUERY GRAPH
April 21, 2023 Mining and Searching Graphs in Graph Databases
Feature-Graph Matrix
G1 G2 G3 G4 G5
f10 1 0 1 1
f20 1 0 0 1
f31 0 1 1 1
f41 0 0 0 1
f50 0 1 1 0
Assume a query graph has 5 features and at most 2 features to miss due to the relaxation threshold
graphs in databasefe
atu
res
April 21, 2023Mining and Searching Graphs in Graph Databases
Summary: Graph Mining
Graph mining has wide applications
Frequent and closed subgraph mining methods gSpan and CloseGraph: pattern-growth depth-first search approach
Graph indexing techniques Frequent and discriminative subgraphs are high-quality indexing features
Similarity search in graph databases Indexing and feature-based matching
Further development and application exploration