graph-based data mining and applications istván jónyer department of computer science oklahoma...
TRANSCRIPT
Graph-Based Data Mining and Applications
István Jónyer
Department of Computer ScienceOklahoma State University
Graph-Based Data Mining and Applications 2
Outline Graph-Based Data Mining Applications Graph Grammar Induction Applications Conclusions
Graph-Based Data Mining and Applications 3
Graph-Based Data Mining Graphs are the most expressive
data structures in computer science Intuitively represent complex domains Without repetition of data Simple building blocks:
Labeled vertices Labeled directed/undirected edges
Graph-Based Data Mining and Applications 4
Subdue finds frequently occurring subgraphs
Returns the best ones according to the minimum description length heuristic (MDL)
Features: Discovery, Clustering and Concept Learning Inexact graph matching Parallel/distributed discovery Background knowledge
Subdue
Graph-Based Data Mining and Applications 5
Graph Representation Input is a labeled (vertices and edges) directed
graph A substructure is a connected subgraph An instance of a substructure is an isomorphic
subgraph of the input graph Input graph compressed by replacing
instances with vertex representing substructure
object
triangle
R1
C1
T1
S1
T2
S2T3S3
T4S4
Input Database Substructure S1 (graph form)
Compressed Database
R1
C1object
squareon
shape
shape S1S1 S1S1 S1S1
S1S1
Graph-Based Data Mining and Applications 6
Subdue Discovery Algorithm
1. Create substructure for each unique vertex label
circle
rectangle
triangle
square
on
on
triangle
square
on
ontriangle
square
on
ontriangle
square
on
on
Substructures:
triangle (4), square (4),circle (1), rectangle (1)on
Graph-Based Data Mining and Applications 7
Subdue Discovery Algorithm
2. Expand best substructures by an edge or edge+neighboring vertex
Substructures:
triangle
square
oncircle
rectangle
square
onrectangle
triangleon
circle
rectangle
triangle
square
on
on
triangle
square
on
ontriangle
square
on
ontriangle
square
on
on
onrectangle
on
Graph-Based Data Mining and Applications 8
Subdue Discovery Algorithm
3. Evaluate substructures using MDL4. Keep only best beam-width
substructures on queue5. Terminate when queue is empty
or #discovered substructures >= limit
6. Compress graph and repeat to generate hierarchical description
Graph-Based Data Mining and Applications 9
Graph Compression and MDL Minimum Description Length (MDL)
principle Best theory minimizes description length of
theory and data given theory Best substructure is the one with
shortest description length of substructure definition DL(S) + compressed graph DL(G|S) DL(G,S) = DL(S) + DL(G|S)
Graph-Based Data Mining and Applications 11
Biochemistry Application: Clustering of a DNA sequence
Coverage 61%
68%
71%
DNA
O |O == P — OH
C — N C — C
C — C \ O
O |O == P — OH | O | CH2
C \ N — C \ C
O \ C / \ C — C N — C / \O C
Graph-Based Data Mining and Applications 12
DNA Sequence Four bases constitute a four-letter
alphabet that cells use to store genetic information.
Molecular biologists can break up a DNA molecule and determine its base sequence, which can be stored as a character string in a computer:
TTCAGCCGATATCCTGGTCAGATTCTCTAAGTCGGCTATAGGACCAGTCTAAGAGA
Graph-Based Data Mining and Applications 13
Backbone Representation
“Base” vertices allow “don’t-care” positions.
Accounting for overlapping substructures is also possible.
basebase basebase basebase basebase basebasenextnextnextnextnextnextnextnext
AAAA CC TT GG
namename namenamenamenamenamenamenamename
Graph-Based Data Mining and Applications 14
page
Represent Web as Graph Breadth-first search of domain to generate
graph Nodes represent pages / documents Edges represent hyperlinks Additional nodes represent document keywords
page
university
texas
learning group
projects
subdue
robotics
parallel
hyperlink
work
word word
planning
Graph-Based Data Mining and Applications 15
Query: Find all pages which link to a page containing term ‘Subdue’
Subgraph vertices: 1 pageURL: http://cygnus.uta.edu7 pageURL: http://cygnus.uta.edu/projects.html8 Subdue[1->7] hyperlink[7->8] word
Subdue
pagehyperlink
word
page
Graph-Based Data Mining and Applications 16
Outline Graph-Based Data Mining Applications Graph Grammar Induction Applications Conclusions
Graph-Based Data Mining and Applications 17
Direction of Research One direction of graph-based data mining
research is towards efficient algorithms AGM FSG gSpan
We address another need Increasing the expressive power of graph-based
theories We develop a Graph Grammar Induction
algorithm
Graph-Based Data Mining and Applications 18
Related Work:Graph-Based Systems AGM
A. Inokuchi, T. Washio and H. Motoda, An Apriori-based Algorithm for Mining Frequent Substructures from Graph Data, Proceedings of the European Conference on Principles and Practice of Knowledge Discovery in Databases, 2000.
FSG M. Kuramochi and G. Karypis,
An Efficient Algorithm for Discovering Frequent Subgraphs, Technical Report 02-026, Department of Computer Science, University of Minnesota, 2002.
gSpan Yan, X. and J. Han. 2002. gSpan: Graph-Based Substructure Pattern Mining.
Proceedings of the International Conference on Data Mining (ICDM).
Apriori-based, association rule discovery Find all frequent subgraphs with minimum
support Emphasis is on efficiency
Graph-Based Data Mining and Applications 19
Graph Grammars Graph grammar production: S P
S is a non-terminal, single vertex Hence grammar is context-free
P is any graph containing terminals and/or non-terminals
Graph-Based Data Mining and Applications 20
Graph Grammars: Recursion Recursive production: S P S | P
P linked to S via a single edge Algorithm exponential in linking
edges
S a
b c
S a
b c
Graph-Based Data Mining and Applications 21
Graph Grammars: Variables Variable production:
S P1 | P2 | … | Pn (discrete) S [Pmin … Pmax] (continuous) P restricted to single terminal vertex
S1 a
b S2
S2 c d f
Graph-Based Data Mining and Applications 22
Graph Grammars: Relationships Relationships
Between continuous values (=, <=) Between discrete values (=)
air speed
visibility
lighting on
landing gear out
Air Crash
D1
D2
C1
220
=<=S
Graph-Based Data Mining and Applications 24
Discovering Recursion For each discovered substructure
Check for subsets of instances P connected by the same single edge e
If found, form production S P S | P, where P connected to S by edge e
Abstraction: Each matching chain compressed to single
vertex labeled S Algorithm is exponential in number of
edges considered between instances Note: two edges needed for S a S b
Graph-Based Data Mining and Applications 25
Recursion Example
a
cb
a
db
a
fb
a
fb
x
qz
y x
qz
y x
qz
y x
qz
yr
k
Input:
Graph-Based Data Mining and Applications 26
Recursion Example Recursive production:
Input graph parsed by production:a
cb
a
db
a
fb
a
fb
r
k
S1 S1
x
qz
y S1S1 x
qz
y
Graph-Based Data Mining and Applications 27
Discovering Variables Variables
After extending a substructure’s instances by a single edge,
Collect instances extended by the same edge (same direction and label), but possibly to vertices with differing labels Li
Form production of the form Discrete/Categorical Variable:
S L1 | L2 | … | Ln
Continuous Variable: S [Pmin … Pmax]
Graph-Based Data Mining and Applications 28
Variable Example
a
cb
a
db
a
fb
a
fb
x
qz
y x
qz
y x
qz
y x
qz
yr
k
Input:
Graph-Based Data Mining and Applications 29
Variable Example
a
cb
a
db
a
fb
a
fb
x
qz
y x
qz
y x
qz
y x
qz
yr
k
Already identified sub a-b
Graph-Based Data Mining and Applications 30
Variable Example Resulting production rules
Input graph parsed by productions
S2 a
b S3
S2
S3 c d f
a
b S3
r
k
S2
S1 S1
Graph-Based Data Mining and Applications 31
Variable Example Extending a-b results
a-b-c (1 instance) a-b-d (1 instance) a-b-f (2 instances)
Collecting all values from the same edge results a-b-{c,d,f} (4 instances)
Eliminating least frequent value results a-b-{d,f} (3 instances)
Graph-Based Data Mining and Applications 32
Discovering Relationships Properties of Relationships
Established between data points (variables)
At least one end of a relationship must be a vertex (otherwise relationship is trivial)
Represented by logical edges Types
Equal (both discrete and continuous) Less-than-or-equal (continuous only)
Graph-Based Data Mining and Applications 33
Sample Relationship
air speed
visibility
lighting on
landing gear out
Air Crash
D3
D4
C1
C2
=<=S
Graph-Based Data Mining and Applications 34
Concept Learning Graph grammar induction is
extended to concept learning Input is a positive and a negative
graph Grammar is to describe positive graph
while not describing negative graph I.e., infer model for positive input only Whatever does not fit the model is
classified as negative Learning is impacted by swapping
positive and negative inputs
Graph-Based Data Mining and Applications 35
Outline Graph-Based Data Mining Applications Graph Grammar Induction Applications Conclusions
Graph-Based Data Mining and Applications 36
Experiments Sequitur (Nevill-Manning and Witten, 1997)
Infers compositional hierarchies (i.e., memorizes input in a hierarchy)
Works on strings or sequences
Graph-Based Data Mining and Applications 37
SubdueGL vs. Sequitur Input 1: abcabdabcabd
Input graph:
Learned graph grammar:
Sequitur’s output:
S1 a b S2 S1
| a b S2
S2 c | d
a b c d a b a b c d a b
S1 S2 a S1 b
S2 c d
S2 a b
S 1 1 1 2 c 2 d2 a b
Graph-Based Data Mining and Applications 38
Biochemistry Protein primary and secondary
structure Primary structure is sequence of amino
acids: VAL LEU SER GLU GLY TRP GLN … Secondary structure is sequence of
helices: h_1_19 h_1_8 h_1_18 … Using hemoglobin and myoglobin Converted to graph: … VAL … LEU SER GLU GLY GLU TRP GLN LEU VAL
Graph-Based Data Mining and Applications 39
Protein Primary StructureS S2 – S3 – S4 – S5 – S6 – S7 – S8 – S9 – S10 – S11 – S12 – S13 – S14 – S15 –
S16 – S17 – GLU – S
S2 VAL | SER | HIS | LYS
S3 LEU | GLY | HIS | PHE | PRO
S4 GLY | GLN | ALA | ASP | THR
S5 ALA | ASP | ARG | THR | ASN
S6 LEU | LYS | ILE | PHE…
S20 S21 – S22 – S23 – S24 – S25 – S26 – S27 – S28 – S29 – LEU
S21 VAL | GLY | ARG | S
S22 LEU | LYS | PRO | S
S23 LEU | SER | ASP
S24 LEU | GLU | ALA | ASP | ILE
Graph-Based Data Mining and Applications 40
Protein Secondary Structure HemoglobinS S2 – S3 – h_1_6 – S4 – h_1_19 – h_1_8 – h_1_18 – S5
S2 h_1_14 | h_1_15
S3 h_1_14 | h_1_15
S4 h_1_6 | h_1_1
S5 h_1_20 | h_1_23
Myoglobin S h_1_15 – h_1_15 – h_1_6 – h_1_6 – h_1_19 – S2 – h_1_18 – S3
S2 h_1_9 | h_1_8
S3 h_1_25 | h_1_23
Common Sequence:h_1_15 – h_1_15 – h_1_6 – h_1_6 – h_1_19 – h_1_8 – h_1_18 – h_1_23
Graph-Based Data Mining and Applications 41
Common Ancestry? Common ancestry between hemoglobin and
myoglobin has long been hypothesized Common sequence can be further proof
Common sequence is only theoretical, not actual sequence
“Myoglobin-like proteins found; candidates for common ancestry”:
Hou, Larsen, Boudko, Riley, Karatan, Zimmer, Ordal And Alam. 2000. Myoglobin-like aerotaxis transducers in Archaea and Bacteria. Nature 403, 540 – 544.
Graph-Based Data Mining and Applications 42
Counter-Terrorism Domains Part of EELD project
Contract Killing Gang Wars Industry Takeover
Using simplified CK domain Goal: Identify sequence of events
leading up to murder-for-hire
Graph-Based Data Mining and Applications 43
Contract Killing Part of Contract Killing domain
Multiple sequence of events
Event
ReportOnSituation
Meeting
Person 1 Person 2
…
ContainsInformation
ReceiverSender
ReportOnSituation
PhoneCall
Person 2 Person 3
ContainsInformation
ReceiverSender
ReportOnSituation
Murder
Killerski Victimski
ContainsInformation
VictimPerpetrator
InformationSource InformationSourceInformationSource
nextEvent nextEvent nextEvent
Graph-Based Data Mining and Applications 44
Grammar of CK Domain
ReportOnSituation
Murder
Killerski Victimski
ContainsInformation
VictimPerpetrator
S5
S6 Event
S5 S nextEvent
InformationSourceInformationSource
ReportOnSituation
S2
S3 S4
ContainsInformation
ReceiverSender
S ReportOnSituation
S2
S3 S4
ContainsInformation
ReceiverSender
S
S2
Meeting
Person 1
PhoneCall
Person 2
S3
Person 3
Person 2 Person 3 S4
Killerski
Graph-Based Data Mining and Applications 45
Questions and Answers Can show more, comparative
experiments if we have time…
Graph-Based Data Mining and Applications 46
Non-Structural Experiments Comparison to many approaches
reported in the literature Statistical, neural, machine learning, DTL, … Using the Wisconsin Breast Cancer domain
Comparison to ILP and DTL systems Prolog, FOIL, C4.5, Subdue Using Vote, Diabetes, and Credit domains
Experiments use 10-fold cross validation
Graph-Based Data Mining and Applications 47
Related Work: ILP Systems Inductive logic programming (ILP)
Combines inductive methods with FOPC Rules
Fact(a,b) Dec(a,c), Fact(c,d), Mult(a,d,b) Play(a,b,c,false) b<=70.
Example systems: FOIL (Cameron-Jones, R. M., & Quinlan, J. R. 1994.
Efficient Top-down Induction of Logic Programs. SIGART Bulletin. Vol. 5, 1:33-42.)
Progol (Muggleton, S. 1995. Inverse Entailment and Progol. New Generation Computing Volume 13
245-86.)
Graph-Based Data Mining and Applications 48
Wisconsin Breast Cancer Domain
Properties: Continuous attributes only
9 attributes normalized between 1 and 10 Class attributes
Malignant cases Benign cases
Concept learning task
Graph-Based Data Mining and Applications 49
Comparison Using WBC Domain
Wen-Hua et al, 200292.3%Logistic regression16
Wen-Hua et al, 200292.9%Linear Discriminant15
Wolberg & Mangasarian, 199093.5%Multi-surface separation (1 plane)14
Zhang, 199293.7%1-nearest neighbor13
Taha & Ghosh, 199793.3–95.61%Neural Networks12
Authors94.37%Subdue11
Liu & Setiono, 199694.4%C4.510
Authors94.8%FOIL9
Authors95.67%SubdueGL8
Wolberg & Mangasarian, 199095.9%Multi-surface separation (3 planes)7
Wen-Hua et al, 200296.7%SVM6
Brodley & Utgoff, 199596.77%GSBE5
Brodley & Utgoff, 199596.92%Feature Minimization 4
Wen-Hua et al, 200297.0%Gaussian Process3
Brodley & Utgoff, 199597.07%RLP2
Wen-Hua et al, 200297.2%Probit1
Accuracy Reported byAccuracy AlgorithmRank
Graph-Based Data Mining and Applications 50
Comparison to ILP and DTL Discrete and mixed attribute types
Vote 16 discrete-valued attributes (y, n, u)
Diabetes (Pima Indians) 7 continuous-valued attributes
Credit (German) 13 discrete-valued attributes 7 continuous-valued attributes
Concept learning (all have 2 classes)
Graph-Based Data Mining and Applications 51
Comparison to ILP Systems
71.30%70.94%94.23%SubdueGL
70.50%61.71%89.07%Subdue
63.20%63.68%94.19%Progol
68.60%70.66%93.02%FOIL
CreditDiabetesVote