graph-based data mining and applications istván jónyer department of computer science oklahoma...

52
Graph-Based Data Mining and Applications István Jónyer Department of Computer Science Oklahoma State University

Upload: audra-walker

Post on 25-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Graph-Based Data Mining and Applications

István Jónyer

Department of Computer ScienceOklahoma State University

Graph-Based Data Mining and Applications 2

Outline Graph-Based Data Mining Applications Graph Grammar Induction Applications Conclusions

Graph-Based Data Mining and Applications 3

Graph-Based Data Mining Graphs are the most expressive

data structures in computer science Intuitively represent complex domains Without repetition of data Simple building blocks:

Labeled vertices Labeled directed/undirected edges

Graph-Based Data Mining and Applications 4

Subdue finds frequently occurring subgraphs

Returns the best ones according to the minimum description length heuristic (MDL)

Features: Discovery, Clustering and Concept Learning Inexact graph matching Parallel/distributed discovery Background knowledge

Subdue

Graph-Based Data Mining and Applications 5

Graph Representation Input is a labeled (vertices and edges) directed

graph A substructure is a connected subgraph An instance of a substructure is an isomorphic

subgraph of the input graph Input graph compressed by replacing

instances with vertex representing substructure

object

triangle

R1

C1

T1

S1

T2

S2T3S3

T4S4

Input Database Substructure S1 (graph form)

Compressed Database

R1

C1object

squareon

shape

shape S1S1 S1S1 S1S1

S1S1

Graph-Based Data Mining and Applications 6

Subdue Discovery Algorithm

1. Create substructure for each unique vertex label

circle

rectangle

triangle

square

on

on

triangle

square

on

ontriangle

square

on

ontriangle

square

on

on

Substructures:

triangle (4), square (4),circle (1), rectangle (1)on

Graph-Based Data Mining and Applications 7

Subdue Discovery Algorithm

2. Expand best substructures by an edge or edge+neighboring vertex

Substructures:

triangle

square

oncircle

rectangle

square

onrectangle

triangleon

circle

rectangle

triangle

square

on

on

triangle

square

on

ontriangle

square

on

ontriangle

square

on

on

onrectangle

on

Graph-Based Data Mining and Applications 8

Subdue Discovery Algorithm

3. Evaluate substructures using MDL4. Keep only best beam-width

substructures on queue5. Terminate when queue is empty

or #discovered substructures >= limit

6. Compress graph and repeat to generate hierarchical description

Graph-Based Data Mining and Applications 9

Graph Compression and MDL Minimum Description Length (MDL)

principle Best theory minimizes description length of

theory and data given theory Best substructure is the one with

shortest description length of substructure definition DL(S) + compressed graph DL(G|S) DL(G,S) = DL(S) + DL(G|S)

Graph-Based Data Mining and Applications 10

Biochemistry Application: Clustering of a DNA sequence

Graph-Based Data Mining and Applications 11

Biochemistry Application: Clustering of a DNA sequence

Coverage 61%

68%

71%

DNA

O |O == P — OH

C — N C — C

C — C \ O

O |O == P — OH | O | CH2

C \ N — C \ C

O \ C / \ C — C N — C / \O C

Graph-Based Data Mining and Applications 12

DNA Sequence Four bases constitute a four-letter

alphabet that cells use to store genetic information.

Molecular biologists can break up a DNA molecule and determine its base sequence, which can be stored as a character string in a computer:

TTCAGCCGATATCCTGGTCAGATTCTCTAAGTCGGCTATAGGACCAGTCTAAGAGA

Graph-Based Data Mining and Applications 13

Backbone Representation

“Base” vertices allow “don’t-care” positions.

Accounting for overlapping substructures is also possible.

basebase basebase basebase basebase basebasenextnextnextnextnextnextnextnext

AAAA CC TT GG

namename namenamenamenamenamenamenamename

Graph-Based Data Mining and Applications 14

page

Represent Web as Graph Breadth-first search of domain to generate

graph Nodes represent pages / documents Edges represent hyperlinks Additional nodes represent document keywords

page

university

texas

learning group

projects

subdue

robotics

parallel

hyperlink

work

word word

planning

Graph-Based Data Mining and Applications 15

Query: Find all pages which link to a page containing term ‘Subdue’

Subgraph vertices: 1 pageURL: http://cygnus.uta.edu7  pageURL: http://cygnus.uta.edu/projects.html8 Subdue[1->7] hyperlink[7->8] word

Subdue

pagehyperlink

word

page

Graph-Based Data Mining and Applications 16

Outline Graph-Based Data Mining Applications Graph Grammar Induction Applications Conclusions

Graph-Based Data Mining and Applications 17

Direction of Research One direction of graph-based data mining

research is towards efficient algorithms AGM FSG gSpan

We address another need Increasing the expressive power of graph-based

theories We develop a Graph Grammar Induction

algorithm

Graph-Based Data Mining and Applications 18

Related Work:Graph-Based Systems AGM

A. Inokuchi, T. Washio and H. Motoda, An Apriori-based Algorithm for Mining Frequent Substructures from Graph Data, Proceedings of the European Conference on Principles and Practice of Knowledge Discovery in Databases, 2000.

FSG M. Kuramochi and G. Karypis,

An Efficient Algorithm for Discovering Frequent Subgraphs, Technical Report 02-026, Department of Computer Science, University of Minnesota, 2002.

gSpan Yan, X. and J. Han. 2002. gSpan: Graph-Based Substructure Pattern Mining.

Proceedings of the International Conference on Data Mining (ICDM).

Apriori-based, association rule discovery Find all frequent subgraphs with minimum

support Emphasis is on efficiency

Graph-Based Data Mining and Applications 19

Graph Grammars Graph grammar production: S P

S is a non-terminal, single vertex Hence grammar is context-free

P is any graph containing terminals and/or non-terminals

Graph-Based Data Mining and Applications 20

Graph Grammars: Recursion Recursive production: S P S | P

P linked to S via a single edge Algorithm exponential in linking

edges

S a

b c

S a

b c

Graph-Based Data Mining and Applications 21

Graph Grammars: Variables Variable production:

S P1 | P2 | … | Pn (discrete) S [Pmin … Pmax] (continuous) P restricted to single terminal vertex

S1 a

b S2

S2 c d f

Graph-Based Data Mining and Applications 22

Graph Grammars: Relationships Relationships

Between continuous values (=, <=) Between discrete values (=)

air speed

visibility

lighting on

landing gear out

Air Crash

D1

D2

C1

220

=<=S

Graph-Based Data Mining and Applications 23

Example Graph Grammar

S1 a

b S2

S1

S2 c d f

a

b S2

Graph-Based Data Mining and Applications 24

Discovering Recursion For each discovered substructure

Check for subsets of instances P connected by the same single edge e

If found, form production S P S | P, where P connected to S by edge e

Abstraction: Each matching chain compressed to single

vertex labeled S Algorithm is exponential in number of

edges considered between instances Note: two edges needed for S a S b

Graph-Based Data Mining and Applications 25

Recursion Example

a

cb

a

db

a

fb

a

fb

x

qz

y x

qz

y x

qz

y x

qz

yr

k

Input:

Graph-Based Data Mining and Applications 26

Recursion Example Recursive production:

Input graph parsed by production:a

cb

a

db

a

fb

a

fb

r

k

S1 S1

x

qz

y S1S1 x

qz

y

Graph-Based Data Mining and Applications 27

Discovering Variables Variables

After extending a substructure’s instances by a single edge,

Collect instances extended by the same edge (same direction and label), but possibly to vertices with differing labels Li

Form production of the form Discrete/Categorical Variable:

S L1 | L2 | … | Ln

Continuous Variable: S [Pmin … Pmax]

Graph-Based Data Mining and Applications 28

Variable Example

a

cb

a

db

a

fb

a

fb

x

qz

y x

qz

y x

qz

y x

qz

yr

k

Input:

Graph-Based Data Mining and Applications 29

Variable Example

a

cb

a

db

a

fb

a

fb

x

qz

y x

qz

y x

qz

y x

qz

yr

k

Already identified sub a-b

Graph-Based Data Mining and Applications 30

Variable Example Resulting production rules

Input graph parsed by productions

S2 a

b S3

S2

S3 c d f

a

b S3

r

k

S2

S1 S1

Graph-Based Data Mining and Applications 31

Variable Example Extending a-b results

a-b-c (1 instance) a-b-d (1 instance) a-b-f (2 instances)

Collecting all values from the same edge results a-b-{c,d,f} (4 instances)

Eliminating least frequent value results a-b-{d,f} (3 instances)

Graph-Based Data Mining and Applications 32

Discovering Relationships Properties of Relationships

Established between data points (variables)

At least one end of a relationship must be a vertex (otherwise relationship is trivial)

Represented by logical edges Types

Equal (both discrete and continuous) Less-than-or-equal (continuous only)

Graph-Based Data Mining and Applications 33

Sample Relationship

air speed

visibility

lighting on

landing gear out

Air Crash

D3

D4

C1

C2

=<=S

Graph-Based Data Mining and Applications 34

Concept Learning Graph grammar induction is

extended to concept learning Input is a positive and a negative

graph Grammar is to describe positive graph

while not describing negative graph I.e., infer model for positive input only Whatever does not fit the model is

classified as negative Learning is impacted by swapping

positive and negative inputs

Graph-Based Data Mining and Applications 35

Outline Graph-Based Data Mining Applications Graph Grammar Induction Applications Conclusions

Graph-Based Data Mining and Applications 36

Experiments Sequitur (Nevill-Manning and Witten, 1997)

Infers compositional hierarchies (i.e., memorizes input in a hierarchy)

Works on strings or sequences

Graph-Based Data Mining and Applications 37

SubdueGL vs. Sequitur Input 1: abcabdabcabd

Input graph:

Learned graph grammar:

Sequitur’s output:

S1 a b S2 S1

| a b S2

S2 c | d

a b c d a b a b c d a b

S1 S2 a S1 b

S2 c d

S2 a b

S 1 1 1 2 c 2 d2 a b

Graph-Based Data Mining and Applications 38

Biochemistry Protein primary and secondary

structure Primary structure is sequence of amino

acids: VAL LEU SER GLU GLY TRP GLN … Secondary structure is sequence of

helices: h_1_19 h_1_8 h_1_18 … Using hemoglobin and myoglobin Converted to graph: … VAL … LEU SER GLU GLY GLU TRP GLN LEU VAL

Graph-Based Data Mining and Applications 39

Protein Primary StructureS S2 – S3 – S4 – S5 – S6 – S7 – S8 – S9 – S10 – S11 – S12 – S13 – S14 – S15 –

S16 – S17 – GLU – S

S2 VAL | SER | HIS | LYS

S3 LEU | GLY | HIS | PHE | PRO

S4 GLY | GLN | ALA | ASP | THR

S5 ALA | ASP | ARG | THR | ASN

S6 LEU | LYS | ILE | PHE…

S20 S21 – S22 – S23 – S24 – S25 – S26 – S27 – S28 – S29 – LEU

S21 VAL | GLY | ARG | S

S22 LEU | LYS | PRO | S

S23 LEU | SER | ASP

S24 LEU | GLU | ALA | ASP | ILE

Graph-Based Data Mining and Applications 40

Protein Secondary Structure HemoglobinS S2 – S3 – h_1_6 – S4 – h_1_19 – h_1_8 – h_1_18 – S5

S2 h_1_14 | h_1_15

S3 h_1_14 | h_1_15

S4 h_1_6 | h_1_1

S5 h_1_20 | h_1_23

Myoglobin S h_1_15 – h_1_15 – h_1_6 – h_1_6 – h_1_19 – S2 – h_1_18 – S3

S2 h_1_9 | h_1_8

S3 h_1_25 | h_1_23

Common Sequence:h_1_15 – h_1_15 – h_1_6 – h_1_6 – h_1_19 – h_1_8 – h_1_18 – h_1_23

Graph-Based Data Mining and Applications 41

Common Ancestry? Common ancestry between hemoglobin and

myoglobin has long been hypothesized Common sequence can be further proof

Common sequence is only theoretical, not actual sequence

“Myoglobin-like proteins found; candidates for common ancestry”:

Hou, Larsen, Boudko, Riley, Karatan, Zimmer, Ordal And Alam. 2000. Myoglobin-like aerotaxis transducers in Archaea and Bacteria. Nature 403, 540 – 544.

Graph-Based Data Mining and Applications 42

Counter-Terrorism Domains Part of EELD project

Contract Killing Gang Wars Industry Takeover

Using simplified CK domain Goal: Identify sequence of events

leading up to murder-for-hire

Graph-Based Data Mining and Applications 43

Contract Killing Part of Contract Killing domain

Multiple sequence of events

Event

ReportOnSituation

Meeting

Person 1 Person 2

ContainsInformation

ReceiverSender

ReportOnSituation

PhoneCall

Person 2 Person 3

ContainsInformation

ReceiverSender

ReportOnSituation

Murder

Killerski Victimski

ContainsInformation

VictimPerpetrator

InformationSource InformationSourceInformationSource

nextEvent nextEvent nextEvent

Graph-Based Data Mining and Applications 44

Grammar of CK Domain

ReportOnSituation

Murder

Killerski Victimski

ContainsInformation

VictimPerpetrator

S5

S6 Event

S5 S nextEvent

InformationSourceInformationSource

ReportOnSituation

S2

S3 S4

ContainsInformation

ReceiverSender

S ReportOnSituation

S2

S3 S4

ContainsInformation

ReceiverSender

S

S2

Meeting

Person 1

PhoneCall

Person 2

E-Mail

S3

Person 3

Person 2 Person 3 S4

Killerski

Graph-Based Data Mining and Applications 45

Questions and Answers Can show more, comparative

experiments if we have time…

Graph-Based Data Mining and Applications 46

Non-Structural Experiments Comparison to many approaches

reported in the literature Statistical, neural, machine learning, DTL, … Using the Wisconsin Breast Cancer domain

Comparison to ILP and DTL systems Prolog, FOIL, C4.5, Subdue Using Vote, Diabetes, and Credit domains

Experiments use 10-fold cross validation

Graph-Based Data Mining and Applications 47

Related Work: ILP Systems Inductive logic programming (ILP)

Combines inductive methods with FOPC Rules

Fact(a,b) Dec(a,c), Fact(c,d), Mult(a,d,b) Play(a,b,c,false) b<=70.

Example systems: FOIL (Cameron-Jones, R. M., & Quinlan, J. R. 1994.

Efficient Top-down Induction of Logic Programs. SIGART Bulletin. Vol. 5, 1:33-42.)

Progol (Muggleton, S. 1995. Inverse Entailment and Progol. New Generation Computing Volume 13

245-86.)

Graph-Based Data Mining and Applications 48

Wisconsin Breast Cancer Domain

Properties: Continuous attributes only

9 attributes normalized between 1 and 10 Class attributes

Malignant cases Benign cases

Concept learning task

Graph-Based Data Mining and Applications 49

Comparison Using WBC Domain

Wen-Hua et al, 200292.3%Logistic regression16

Wen-Hua et al, 200292.9%Linear Discriminant15

Wolberg & Mangasarian, 199093.5%Multi-surface separation (1 plane)14

Zhang, 199293.7%1-nearest neighbor13

Taha & Ghosh, 199793.3–95.61%Neural Networks12

Authors94.37%Subdue11

Liu & Setiono, 199694.4%C4.510

Authors94.8%FOIL9

Authors95.67%SubdueGL8

Wolberg & Mangasarian, 199095.9%Multi-surface separation (3 planes)7

Wen-Hua et al, 200296.7%SVM6

Brodley & Utgoff, 199596.77%GSBE5

Brodley & Utgoff, 199596.92%Feature Minimization 4

Wen-Hua et al, 200297.0%Gaussian Process3

Brodley & Utgoff, 199597.07%RLP2

Wen-Hua et al, 200297.2%Probit1

Accuracy Reported byAccuracy AlgorithmRank

Graph-Based Data Mining and Applications 50

Comparison to ILP and DTL Discrete and mixed attribute types

Vote 16 discrete-valued attributes (y, n, u)

Diabetes (Pima Indians) 7 continuous-valued attributes

Credit (German) 13 discrete-valued attributes 7 continuous-valued attributes

Concept learning (all have 2 classes)

Graph-Based Data Mining and Applications 51

Comparison to ILP Systems

71.30%70.94%94.23%SubdueGL

70.50%61.71%89.07%Subdue

63.20%63.68%94.19%Progol

68.60%70.66%93.02%FOIL

CreditDiabetesVote

Graph-Based Data Mining and Applications 52

Comparison to DTL

71.30%70.94%94.23%SubdueGL

70.50%61.71%89.07%Subdue

70.90%74.62%94.48%C4.5

CreditDiabetesVote