csce822 data mining and warehousing

Lecture 16 Graph MiningMW 4:00PM-5:15PM

Dr. Jianjun Huhttp://mleg.cse.sc.edu/edu/csce822

CSCE822 Data Mining and Warehousing

University of South CarolinaDepartment of Computer Science and Engineering

RoadmapWhat, Why Graph Mining?

Methods for Mining Frequent Subgraphs

Mining Variant and Constrained Substructure Patterns

Applications:

Classification and Clustering Graph Indexing Similarity Search

Summary

April 21, 2023Mining and Searching Graphs in Graph Databases


Graph, Graph, Everywhere

Aspirin Yeast protein interaction network

fro

m H

. Je

on

g e

t a

l Na

ture

41

1, 4

1 (

200

1)

Internet Co-author network

RelationshipsInteractionsconnections


Why Graph Mining?

Graphs are ubiquitous Chemical compounds (Cheminformatics)

Protein structures, biological pathways/networks (Bioinformactics)

Program control flow, traffic flow, and workflow analysis

XML databases, Web, and social network analysis

Graph is a general model Trees, lattices, sequences, and items are degenerated graphs

Diversity of graphs Directed vs. undirected, labeled vs. unlabeled (edges & vertices), weighted,

with angles & geometry (topological vs. 2-D/3-D)

Complexity of algorithms: many problems are of high

complexity


Graph Pattern Mining

Frequent subgraphs

A (sub)graph is frequent if its support (occurrence

frequency) in a given dataset is no less than a minimum

support threshold

Applications of graph pattern mining

Mining biochemical structures

Program control flow analysis

Mining XML structures or Web communities

Building blocks for graph classification, clustering,

compression, comparison, and correlation analysis


Example: Frequent Subgraphs

GRAPH DATASET

FREQUENT PATTERNS(MIN SUPPORT IS 2)

(A) (B) (C)

(1) (2)


Graph Mining Algorithms

Incomplete beam search – Greedy (Subdue 1994)

Inductive logic programming (WARMR 1998)

Graph theory-based approaches

Apriori-based approach

Pattern-growth approach

Graph Definitionsa

b a

c c

b

(a) Labeled Graph

pq

p

p

rs

tr

t

qp

a

a

c

b

(b) Subgraph

p

s

t

p

a

a

c

b

(c) Induced Subgraph

p

rs

tr

p


SUBDUE (Holder et al. KDD’94)

Start with single vertices

Expand best substructures with a new edge

Limit the number of best substructures

Substructures are evaluated based on their ability to

compress input graphs Using minimum description length (DL) Best substructure S in graph G minimizes: DL(S) +

DL(G\S)Terminate until no new substructure is discovered


Frequent Subgraph Mining Approaches

Apriori-based approach AGM/AcGM: Inokuchi, et al. (PKDD’00) FSG: Kuramochi and Karypis (ICDM’01) PATH#: Vanetik and Gudes (ICDM’02, ICDM’04) FFSM: Huan, et al. (ICDM’03)

Pattern growth approach MoFa, Borgelt and Berthold (ICDM’02) gSpan: Yan and Han (ICDM’02) Gaston: Nijssen and Kok (KDD’04)

Apriori-like AlgorithmFind frequent 1-subgraphsRepeat

Candidate generation Use frequent (k-1)-subgraphs to generate candidate k-subgraph

Candidate pruning Prune candidate subgraphs that contain infrequent

(k-1)-subgraphs Support counting

Count the support of each remaining candidate Eliminate candidate k-subgraphs that are infrequent

In practice, it is not as easy. There are many other issues


Apriori-Based Approach

…

G

G1

G2

Gn

k-edge(k+1)-edge

G’

G’’

JOIN

Pruning

G1

G2

SupportCounting

G1

G2

Elimination

G1

Candidate GenerationIn Apriori:

Merging two frequent k-itemsets will produce a candidate (k+1)-itemset

In frequent subgraph mining (vertex/edge growing) Merging two frequent k-subgraphs may produce more

than one candidate (k+1)-subgraph

Vertex Growinga

a

e

a

p

q

r

p

a

a

a

p

rr

d

G1 G2

p

000

00

00

0

1

q

rp

rp

qpp

MG

000

0

00

00

2

r

rrp

rp

pp

MG

a

a

a

p

q

r

ep

0000

0000

00

000

00

3

q

r

rrp

rp

qpp

MG

G3 = join(G1,G2)

dr+

Edge Growing

a

a

f

a

p

q

r

p

a

a

a

p

rr

f

G1 G2

p

a

a

a

p

q

r

fp

G3 = join(G1,G2)

r+


FFSM (Huan, et al. ICDM’03)

Represent graphs using canonical adjacency matrix (CAM)Join two CAMs or extend a CAM to generate a new graphStore the embeddings of CAMs

All of the embeddings of a pattern in the database Can derive the embeddings of newly generated CAMs

Example: Dataset

a

b

e

c

p

q

rp

a

b

d

p

r

G1 G2

q

e

c

a

p q

r

b

p

G3

d

rd

r

(a,b,p) (a,b,q) (a,b,r) (b,c,p) (b,c,q) (b,c,r) … (d,e,r)G1 1 0 0 0 0 1 … 0G2 1 0 0 0 0 0 … 0G3 0 0 1 1 0 0 … 0G4 0 0 0 0 0 0 … 0

a eq

c

d

p p

p

G4

r

Example

p

a b c d ek=1FrequentSubgraphs

a b

pc d

pc e

qa e

rb d

pa b

d

r

pd c

e

p

(Pruned candidate)

Minimum support count = 2

k=2FrequentSubgraphs

k=3CandidateSubgraphs

Graph IsomorphismA graph is isomorphic if it is topologically equivalent

to another graph

A

A

A A

B A

B

A

B

B

A

A

B B

B

B

Graph Isomorphism

Graph IsomorphismTest for graph isomorphism is needed:

During candidate generation step, to determine whether a candidate has been generated

During candidate pruning step, to check whether its (k-1)-subgraphs are frequent

During candidate counting, to check whether a candidate is contained within another graph

Graph IsomorphismUse canonical labeling to handle isomorphism

Map each graph into an ordered string representation (known as its code) such that two isomorphic graphs will be mapped to the same canonical encoding

Example: Lexicographically largest adjacency matrixFind the permutations of the vertices so that the adjacency matrix

is lexicographically maximized when read off from left to right, one row at a time.

Graph Isomorphism Example:

Lexicographically largest adjacency matrix

0110

1011

1100

0100

String: 0010001111010110

0001

0011

0101

1110

Canonical: 0111101011001000

Lexicographically largest adjacency matrixFind the permutations of the vertices so that the

adjacency matrix is lexicographically maximized when read off from left to right, one row at a time.


Graph Pattern Explosion Problem

If a graph is frequent, all of its subgraphs are frequent ─

the Apriori property

An n-edge frequent graph may have 2n subgraphs

Among 422 chemical compounds which are confirmed to

be active in an AIDS antiviral screen dataset, there are

1,000,000 frequent graph patterns if the minimum support

is 5%


Closed Frequent Graphs

Motivation: Handling graph pattern explosion problemClosed frequent graph

A frequent graph G is closed if there exists no supergraph of G that carries the same support as G

If some of G’s subgraphs have the same support, it is unnecessary to output these subgraphs (nonclosed graphs)

Lossless compression: still ensures that the mining result is complete


CLOSEGRAPH (Yan & Han, KDD’03)

…

A Pattern-Growth Approach

G

G1

G2

Gn

k-edge

(k+1)-edge

At what condition, can westop searching their children

i.e., early termination?

If G and G’ are frequent, G is a subgraph of G’. If in any part of the graph in the dataset where G occurs, G’ also occurs, then we need not grow G, since none of G’s children will be closed except those of G’.


Experimental ResultThe AIDS antiviral screen compound dataset from

NCI/NIH

The dataset contains 43,905 chemical compounds

Among these 43,905 compounds, 423 of them belongs

to CA, 1081 are of CM, and the remaining are in class

CI

April 21, 2023 Mining and Searching Graphs in Graph Databases

Discovered Patterns

20% 10%

5%


Graph MiningMethods for Mining Frequent Subgraphs


Applications:

Graph Indexing Similarity Search Classification and Clustering

Summary


Constrained Patterns

DegreeDensityDiameterConnectivityMin, Max, Avg


Pattern-Growth ApproachFind a small frequent candidate graph

Remove vertices (shadow graph) whose degree is less than the connectivity

Decompose it to extract the subgraphs satisfying the connectivity constraint

Stop decomposing when the subgraph has been checked before

Extend this candidate graph by adding new vertices and edges

Repeat




Applications:


Summary


Graph ClusteringGraph similarity measure

Feature-based similarity measureEach graph is represented as a feature vector

The similarity is defined by the distance of their corresponding vectors

Frequent subgraphs can be used as features

Structure-based similarity measureMaximal common subgraph

Graph edit distance: insertion, deletion, and relabel

Graph alignment distance


Graph ClassificationLocal structure based approach

Local structures in a graph, e.g., neighbors surrounding a vertex, paths with fixed length

Graph pattern-based approach Subgraph patterns from domain knowledge Subgraph patterns from data mining

Kernel-based approach Random walk (Gärtner ’02, Kashima et al. ’02, ICML’03,

Mahé et al. ICML’04) Optimal local assignment (Fröhlich et al. ICML’05)


Graph Pattern-Based Classification

Subgraph patterns from domain knowledge Molecular descriptors

Subgraph patterns from data mining General idea

Each graph is represented as a feature vector x = {x1,

x2, …, xn}, where xi is the frequency of the i-th pattern

in that graph Each vector is associated with a class label Classify these vectors in a vector space




Applications:


Summary


Graph Search

Querying graph databases: Given a graph database and a query graph, find all

the graphs containing this query graph

query graph graph database

Scalability IssueSequential scan

Disk I/Os Subgraph isomorphism testing

An indexing mechanism is needed DayLight: Daylight.com (commercial) GraphGrep: Dennis Shasha, et al. PODS'02 Grace: Srinath Srinivasa, et al. ICDE'03



Indexing Strategy

Graph (G)

Substructure

Query graph (Q)

If graph G contains query graph Q, G should contain any substructure of Q

Remarks Index substructures of a query graph to prune graphs that do not

contain these substructures


Indexing Framework

Two steps in processing graph queriesStep 1. Index Construction

Enumerate structures in the graph database, build an inverted index between structures and graphs

Step 2. Query Processing Enumerate structures in the query graph Calculate the candidate graphs containing

these structures Prune the false positive answers by

performing subgraph isomorphism test




Applications:


Summary


Structure Similarity Search

(a) caffeine (b) diurobromine (c) viagra

• CHEMICAL COMPOUNDS

• QUERY GRAPH


Feature-Graph Matrix

G1 G2 G3 G4 G5

f10 1 0 1 1

f20 1 0 0 1

f31 0 1 1 1

f41 0 0 0 1

f50 0 1 1 0

Assume a query graph has 5 features and at most 2 features to miss due to the relaxation threshold

graphs in databasefe

atu

res


Summary: Graph Mining

Graph mining has wide applications

Frequent and closed subgraph mining methods gSpan and CloseGraph: pattern-growth depth-first search approach

Graph indexing techniques Frequent and discriminative subgraphs are high-quality indexing features

Similarity search in graph databases Indexing and feature-based matching

Further development and application exploration

csce822 data mining and warehousing

Documents

graph classification

graph databasesexample

graph g

graph miningmw

graph databasessubdue

subgraphs support

algorithmfind frequent

prune candidate subgraphs