cse@utasrl workshop1 efficient mining of graph-based data jesus gonzalez, istvan jonyer, larry...
TRANSCRIPT
CSE@UTA SRL Workshop 1
Efficient Mining of Graph-Based Data
Jesus Gonzalez, Istvan Jonyer, Larry Holder and Diane Cook
University of Texas at ArlingtonDepartment of Computer Science and
Engineering
http://cygnus.uta.edu/subdue
CSE@UTA SRL Workshop 3
Graph-Based Discovery
object
triangle
R1
C1
T1
B1
T2
B2
T3
B3
T4
B4
Input Database Substructure S1 (graph form)
Compressed Database
R1
C1object
squareon
shape
shape S1S1 S1S1 S1S1
S1S1
CSE@UTA SRL Workshop 4
Algorithm
1. Create substructure for each unique vertex label
Substructures:
triangle (4), square (4),circle (1), rectangle (1)
circle
rectangle
triangle
square
on
on
triangle
square
on
ontriangle
square
on
ontriangle
square
on
on
on
CSE@UTA SRL Workshop 5
Algorithm
2. Expand best substructure by an edge or edge+neighboring vertex
Substructures:
triangle
square
on
rectangle
square
on
rectangle
triangleon
circle
rectangle
triangle
square
on
on
triangle
square
on
ontriangle
square
on
ontriangle
square
on
on
on
rectangle
circle
on
CSE@UTA SRL Workshop 6
Algorithm
3. Keep only best beam-width substructures on queue
4. Terminate when queue is empty or #discovered substructures >= limit
5. Compress graph and repeat to generate hierarchical description
Note: polynomially constrained
CSE@UTA SRL Workshop 7
Evaluation Metric Substructures evaluated based on
ability to compress input graph Compression measured using
minimum description length (DL) Best substructure S in graph G
minimizes: DL(S) + DL(G|S)
CSE@UTA SRL Workshop 9
Inexact Graph Match Some variations may occur
between instances Want to abstract over minor
differences Difference = cost of transforming
one graph to isomorphism of another
Match if cost/size < threshold
CSE@UTA SRL Workshop 10
Parallel/Distributed Discovery Divide graph into P partitions using
Metis, distribute to P processors Each processor performs serial Subdue
on local partition Broadcast best substructures, evaluate
on other processors Master processor stores best global
substructures Close to linear speedup
CSE@UTA SRL Workshop 11
Graph-Based Concept Learning One graph stores positive examples One graph stores negative examples Find substructure that compresses
positive graph but not negative graph (PosEgsNotCovered) + (NegEgsCovered)
Multiple iterations implements set-covering approach
CSE@UTA SRL Workshop 12
Concept-Learning Example
object
object
object
on
on
triangle
square
shape
shape
CSE@UTA SRL Workshop 13
Concept-Learning Results Chess endgames (19,257
examples) Black King is (+) or is not (-) in
check 99.8% FOIL, 99.21% Subdue
CSE@UTA SRL Workshop 14
More Concept-Learning Results
Tic-Tac-Toe endgames + is win for X (958 examples) 100% Subdue, 92.35% FOIL
Bach chorales Musical sequences (20 sequences) 100% Subdue, 85.71% FOIL
CSE@UTA SRL Workshop 15
Graph-Based Clustering Iterate Subdue until single vertex Each cluster (substructure)
inserted into a classification lattice
Root
CSE@UTA SRL Workshop 16
Clustering Example: Animals
Name Body Cover Heart Chamber Body Temp. Fertilization
mammal hair four regulated internalbird feathers four regulated internalreptile cornified-skin imperfect-four unregulated internal
amphibian moist-skin three unregulated external
fish scales two unregulated external
animal
hair
mammal
BodyCover
Fertilization
HeartChamber
BodyTempinternalregulated
Namefour
CSE@UTA SRL Workshop 17
Graph-Based Clustering Results
Animals
BodyTemp: unregulatedHeartChamber: fourBodyTemp: regulatedFertilization: internal
Fertilization: externalName: mammalBodyCover: hair
Name: birdBodyCover: feathers
Name: reptileBodyCover: cornified-skin
HeartChamber: imperfect-fourFertilization: internal
Name: fishBodyCover: scales
HeartChamber: two
Name: amphibianBodyCover: moist-skinHeartChamber: three
CSE@UTA SRL Workshop 18
Cobweb Results
Comparison of Subdue and Cobweb results Subdue lattice produced better generalization,
resulting in less clusters at higher levels Subdue lattice identifies overlap between
(reptile) and (amphibian/fish)
animals
amphibian/fishmammal/bird reptile
mammal bird fish amphibian
CSE@UTA SRL Workshop 20
Graph-Based Clustering Results
Coverage 61%
68%
71%
DNA
O |O == P — OH
C — N C — C
C — C \ O
O |O == P — OH | O | CH2
C \ N — C \ C
O \ C / \ C — C N — C / \O C
CSE@UTA SRL Workshop 21
Evaluation of Clusterings Traditional evaluation:
Not applicable to hierarchical domains Does not make sense to compare clusters
in different subtrees Not applicable to relational clusterings
erDistanceIntraClust
erDistanceInterClustQualityClustering
CSE@UTA SRL Workshop 22
Properties of Good Clusterings
Small number of clusters Large coverage good generality
Big cluster descriptions More features more inferential power
Minimal or no overlap between clusters More distinct clusters better defined
concepts
CSE@UTA SRL Workshop 23
New Evaluation Heuristic for Hierarchical Clusterings
c
iHc
i
c
ijji
c
i
c
ij
H
k
H
l ljkisize
ljki
C i
i j
CQHH
HH
HHdistance
CQ1
1
1 1
1
1 1 1 1 ,,
,,
)(
),(max
),(
Clustering rooted at C with c children Hi having |Hi| instances Hi,k
distance() measured by inexact graph match Animals: SubdueCQ=2.6, CobwebCQ=1.7
CSE@UTA SRL Workshop 24
Graph-Based Data Mining: Application Domains Biochemical domains
Protein data DNA data Toxicology (cancer) data
Spatial-temporal domains Earthquake data Aircraft Safety and Reporting System
Telecommunications data Program source code Web topology
web_page
web_page
web_page
hyperlink
hyperlink
hyperlink
home …
…
CSE@UTA SRL Workshop 25
Theoretical Analysis Galois lattice [Lequiere et al.] Conceptual graphs [Sowa et al.] PAC analysis [Jappy et al.]
CSE@UTA SRL Workshop 26
Graph-based Data Mining Pattern (substructure) discovery Hierarchical discovery Distributed discovery Concept learning Clustering Compression heuristic based on
minimum description length
CSE@UTA SRL Workshop 27
Future Work Concept learning
Theoretical analysis Comparison to ILP systems
Clustering Classification lattice Hierarchical relational conceptual clustering
evaluation metric Probabilistic substructures Domains: WWW, source code