dimacs graph mining (via similarity measures) ye zhu stephanie reu-dimacs, july 17, 2009...

21
Dimacs Graph Dimacs Graph Mining Mining (via Similarity (via Similarity Measures) Measures) Ye Zhu Stephanie REU-DIMACS, July 17, 2009 [email protected] Mentor : James Abello

Upload: giles-jones

Post on 27-Dec-2015

219 views

Category:

Documents


3 download

TRANSCRIPT

Dimacs Graph Mining Dimacs Graph Mining (via Similarity Measures)(via Similarity Measures)

Ye Zhu StephanieREU-DIMACS, July 17, [email protected]

Mentor : James Abello

Talk OutlineTalk OutlineI. From Data to Graphs via Similarity

Measure II. Our Research Project Input: REU participants information DIMACS Workshop Data Output: A variety of Graphs

III. Main Questions a. Choose good similarity measures

b. Visualize and detect “interesting” patterns

Original data records for Original data records for building building REU-Participants graphsREU-Participants graphs

Original Data RecordsOriginal Data Records

DIMACS Workshop DIMACS Workshop AbstractsAbstracts

General MethodGeneral MethodStep 1: Compute a similarity measure

among the data records shown above.

Since a record can be viewed as a unweighted/weighted set of attributes we use unweighted/weighted version of an standard metric among finite sets that uses the size of the intersection over the size of the union between two sets

Weighted caseWeighted case

pet

fat

dog cat

ComputationComputation

pet

fat

dog cat

jnm

jnm

nm,attj)),attj),W(w(W(w

,attj)),attj),W(w(W(w),wWJ(w

i)ttj with w(freq of a gw(attj)

) lw(wj,attjgw(attj)W

i)ttj with w(freq of a )lw(wi,attj

max

min

log

log

28.08.079.075.085.07.09.0

06.075.0000

General MethodGeneral MethodStep 1: Compute a similarity

measure among the data records shown above.

Step 2: Deal with different types of data records respectively.

Computing Edge WeightComputing Edge WeightTo deal with different types of

information, we partition the attributes into different classes according to their value types and compute a similarity measure for each class and then combine these values using a convex combination

Eg. Total Weight=0.3*Weighted Coeff+0.7*Unweighted

Coeff

REU participants exampleREU participants exampleHow to calculate the Edge How to calculate the Edge Weight? Weight?

Unweighted

Unweighted

Weighted

REU participants exampleREU participants exampleHow about the Vertices' How about the Vertices' Weight(ball size)Weight(ball size)

We can simply convert these 3 columns to three-digit numbers !!!

General MethodGeneral MethodStep 1: Compute a similarity

measure among the data records shown above.

Step 2: Deal with different types of data records respectively.

Step 3: Build weighted graph where each record is now treated as a vertex and two vertices are joined by an edge with weight equal to their computed similarity

General MethodGeneral MethodStep 1: Compute a similarity measure

among the data records shown above.Step 2: Deal with different types of

data records respectively. Step 3: Build weighted graph where

each record is now treated as a vertex and two vertices are joined by an edge with weight equal to their computed similarity

Step 4: Visualize the graph use GraphView Software and find interesting clusters

REU Participants GraphREU Participants Graph

Workshop Abstract Workshop Abstract ExampleExampleRead in all workshop abstracts

fileDelete stop words ->

unimportant wordsGet a count of number of

appearances (freqency) of ALL words left in All workshop abstracts

Compute Jaccard Coefficient

Dimacs Workshop Abstract Dimacs Workshop Abstract GraphGraph

ConclusionConclusionWe have shown how data set

records can be transformed into a weighted graph by using a similarity measure among records

This methodology allows us to use powerful graph clustering techniques to analyze and visualize data bases.

ReferencesReferences[1] GraphView system[2] C Gasperin, P Gamallo, A Agustini, G Lopes, V

Lima 2001- Using syntactic contexts for measuring word similarity

[3] Resnik, Philip (1999) Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language. Journal of Artificial Intelligence Research

Thank you!Thank you!

The end