exploiting relationships for object consolidation

Exploiting Relationships for Object Consolidation

Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra

Computer Science DepartmentUniversity of California, Irvine

http://www.ics.uci.edu/~dvk/RelDChttp://www.itr-rescue.org (RESCUE)

ACM IQIS 2005

Work supported by NSF Grants IIS-0331707 and IIS-0083489

Talk Overview

• Motivation

• Object consolidation problem

• Proposed approach – RelDC: Relationship based data cleaning– Relationship analysis and graph partitioning

• Experiments

Why do we need “Data Cleaning”?

q Hi, my name is Jane Smith.

I’d like to apply for a faculty

position at your university

Wow! Unbelievable!

Are you sure you will join us even if

we do not offer you tenure right

Jane Smith – Fresh Ph.D. Tom - Recruiter

OK, let me check

something quickly …

Publications:1. ……2. ……3. ……

CiteSeer Rank

• Names often do not uniquely identify people

What is the problem?

CiteSeer: the top-k most cited authors DBLP DBLP

Comparing raw and cleaned CiteSeer

Rank Author Location # citations

1 (100.00%) douglas schmidt cs@wustl 5608

2 (100.00%) rakesh agrawal almaden@ibm 4209

3 (100.00%) hector garciamolina @ 4167

4 (100.00%) sally floyd @aciri 3902

5 (100.00%) jennifer widom @stanford 3835

6 (100.00%) david culler cs@berkeley 3619

6 (100.00%) thomas henzinger eecs@berkeley 3752

7 (100.00%) rajeev motwani @stanford 3570

8 (100.00%) willy zwaenepoel cs@rice 3624

9 (100.00%) van jacobson lbl@gov 3468

10 (100.00%) rajeev alur cis@upenn 3577

11 (100.00%) john ousterhout @pacbell 3290

12 (100.00%) joseph halpern cs@cornell 3364

13 (100.00%) andrew kahng @ucsd 3288

14 (100.00%) peter stadler tbi@univie 3187

15 (100.00%) serge abiteboul @inria 3060

CiteSeer top-k

Cleaned CiteSeer top-k

Object Consolidation Problem

• Cluster representations that correspond to the same “real” world object/entity

• Two instances: real world objects are known/unknown

r1 r2 r3 r4 r5 r6 r7 rN

o1 o2 o3 o4 o5 o6 o7 oM

Representations of objects in the database

Real objects in the database

RelDC Approach

• Exploit relationships among objects to disambiguate when traditional approach on clustering based on similarity does not work

Traditional Methods

Relationship Analysis

RelDC Framework

features and context

Relationship-based Data Cleaning

Attributed Relational Graph (ARG)

View the database as an ARG

– per cluster of representations (if already resolved by feature-based approach)

– per representation (for “tough” cases)

Edges – Regular – correspond to

relationships between entities

– Similarity – created using feature-based methods on representations

person publication

department organization

Context Attraction Principle (CAP)

Who is “J. Smith” – Jane?– John?

Jane Smith

John Smith

J. Smith

Merging a new publication.

Questions to Answer

1. Does the CAP principle hold over real datasets? That is, if we consolidate objects based on it, will the

quality of consolidation improves?

2. Can we design a generic strategy that exploits CAP for consolidation?

Consolidation Algorithm

1. Construct ARG and identify all virtual clusters (VCSs)– use FBS in constructing the ARG

2. Choose a VCS and compute connection strength between nodes– for each pair of repr. connected via a similarity edge

3. Partition the VCS– use a graph partitioning algorithm– partitioning is based on connection strength– after partitioning, adjust ARG accordingly– go to Step 2, if more potential clusters exists

Connection Strength c(u,v)

Models for c(u,v)– many possibilities

– diffusion kernels, random walks, etc

– none is fully adequate– cannot learn similarity from data

G H z Diffusion kernels

– (x,y)= 1(x,y) “base similarity” – via direct links (of size 1)

– k(x,y) “indirect similarity”– via links of size k

– B: where Bxy = B1xy = 1(x,y)

– base similarity matrix

– Bk: indirect similarity matrix– K: total similarity matrix, or “kernel”

Connection Strength c(u,v) (cont.)

N-2... ... ... ... ...

John Smith Alan WhiteP1

Instantiating parameters– Determining (x,y)

– regular edges have types T1,...,Tn

– types T1,...,Tn have weights w1,...,wn

– (x,y) = wi

– get the type of a given edge

– assign this weigh as base similarity

– Handling similarity edges– (x,y) assigned value proportional to similarity

(heuristic)

– Approach to learn (x,y) from data (ongoing work)

Implementation

– we do not compute the whole matrix K– we compute one c(u,v) at a time

– limit path lengths by L

P1R1:John R2:J.SmithA4:Alan P3A5:MikeP2

R3:John A3:JohnA6:Tom StanfordP4 A7:Kate

P1 R1:JohnA1:John A4:AlanMITR3:John

Consolidation via Partitioning

Observations– each VCS contains representations of at least

1 object– if a repr. is in VCS, then the rest of repr. of the

same object are in it too

Partitioning– two cases

– k, the number of entities in VSC, is known– k is unknown

– when k is known, use any partit. algo– maximize inside-con, minimize outside-con.– we use [Shi,Malik’2000]– normalized cut

– when k is unknown– split into two: just to see the cut– compare cut against threshold– decide “to split” or “not to split”– Iterate

Measuring Quality of Outcome

– dispersion – for an entity, into how many clusters

its repr. are clustered, ideal is 1

– diversity– for a cluster, how many distinct

entities it covers, ideal is 1

– Entity uncertainty– for an entity, if out of m represent.

m1 to C1; ...; mn to Cn then

– Cluster Uncertainty– if a cluster consists of represent.: m1

of E1; ...; mn of En then (same...)– ideal entropy is zero

1 11 1

2 22 2 2 2

Ideal Clustering

1 11 1

2 22 2 2

One Misassigned (Example 1)

Half Misassigned

1 11 1

2 22 2 2

One Misassigned (Example 2)

Dis/Div cannot distinguish the two cases

Entropy can: since 0.65 < 1, first clustering is better

Average entropy decreases (improves), compared to Example 1

Experimental Setup

Parameters– L-short simple paths, L = 7– L is the path-length limit

Note– The algorithm is applied to

“tough cases”, after FBS already has successfully consolidated many entries!

RealMov– movies (12K) – people (22K)

– actors– directors– producers

– studious (1K) – producing – distributing

Uncertainty– d1,d2,...,dn are director entities– pick a fraction d1,d2,...,dm– Group entries in size k,

– e.g. in groups of two {d1,d2}, ... ,{d9,d10}

– make all representations of a group indiscernible by FBS, ...

Baseline 1Baseline 1– one cluster per VCS, regardlessone cluster per VCS, regardless– Equivalent to using only FBSEquivalent to using only FBS– ideal dispersion & H(E)!ideal dispersion & H(E)!

Baseline 2Baseline 2– knows grouping statisticsknows grouping statistics– gueses #ent in VCS gueses #ent in VCS – random assigns repr. to clustersrandom assigns repr. to clusters

Sample Movies Data

The Effect of L on Quality

Cluster Entropy & Diversity Entity Entropy & Dispersion

Effect of Threshold and Scalability

Summary

RelDC– domain-independent data cleaning framework– uses relationships for data cleaning

– reference disambiguation [SDM’05]– object consolidation [IQIS’05]

Ongoing work– “learning” the importance of relationships from data– Exploiting relationships among entities for other

data cleaning problems

Contact Information

RelDC projectwww.ics.uci.edu/~dvk/RelDC

www.itr-rescue.org (RESCUE)

Zhaoqi Chenchenz@ics.uci.edu

Dmitri V. Kalashnikovwww.ics.uci.edu/~dvk

dvk@ics.uci.edu

Sharad Mehrotrawww.ics.uci.edu/~sharad

sharad@ics.uci.edu

extra slides…

Object Consolidation

Notation– O={o1,...,o|O|} set of entities

– unknown in general

– X={x1,...,x|X|} set of repres.

– d[xi] the entity xi refers to– unknown in general

– C[xi] all repres. that refer to d[xi]– “group set”

– the goal is to find it for each xi

– S[xi] all repres. that can be xi

– “consolidation set”

– determined by FBS

– we assume C[xi] S[xi]

Object Consolidation Problem

• Let O={o1,...,o|O|} be the set of entities

• Let X={x1,...,x|X|} be the set of representations

• Map xi to its corresponding entity oj in O d[xi] the entity xi refers to

– C[xi] all repres. that refer to d[xi]– “group set”

– the goal is to find it for each xi

– S[xi] all repres. that can be xi

– “consolidation set”

– determined by FBS

– we assume C[xi] S[xi]

RelDC Framework

Traditional Methods

Relationship Analysis

Raw Data Representation

Tables/ARGs

RelDC Framework

features and context

Data CleaningExtraction

Relationship-based Data Cleaning

Analysis

Connection Strength

Computation of c(u,v)

Phase 1: Discover connections– all L-short simple paths between u and v

– bottleneck

– optimizations, not in IQIS’05

Phase 2: Measure the strength– in the discovered connections

– many c(u,v) models exist

– we use model similar to diffusion kernels

Our c(u,v) Model

N-2... ... ... ... ...

John Smith Alan WhiteP1

Our c(u,v) model– regular edges have types T1,...,Tn

– types T1,...,Tn have weights w1,...,wn

– (x,y) = wi

– get the type of a given edge– assign this weigh as base similarity

– paths with similarity edges– might not exist, use heuristics

Our model & Diff. kernels– virtually identical, but...– we do not compute the whole matrix K

– we compute one c(u,v) at a time

– we limit path lengths by L– (x,y) is unknown in general

– the analyst assigns them– learn from data (ongoing work)

P1R1:John R2:J.SmithA4:Alan P3A5:MikeP2

R3:John A3:JohnA6:Tom StanfordP4 A7:Kate

P1 R1:JohnA1:John A4:AlanMITR3:John

exploiting relationships for object consolidation

Documents

exploiting relationships for object consolidation zhaoqi...

exploiting functional relationships in musical...

exploiting modern microarchitectures: meltdown, …...

exploiting chaos

exploiting 101

urban consolidation centres: on relationships between...

exploiting symbian

354 final exam tutorial topics 353 review –consolidation...

relationships between evolutionary computation & biology...

exploiting musical connections: a proposal for support of...

exploiting delphi

exploiting ratings, reviews and relationships for item

exploiting symmetries

exploiting material

mapping the genetic diversity of castanea sativa: exploiting...

exploiting detachability

comparative approaches for plant … › 11048 › 1 ›...

exploiting parallelism

exploiting generative models in discriminative...

exploiting tal