self-tuning in graph-based reference disambiguation
DESCRIPTION
Self-tuning in Graph-Based Reference Disambiguation. Rabia Nuray-Turan Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California, Irvine. Overview. Intro to Data Cleaning Entity resolution RelDC Framework Past work Adapting to data The new part - PowerPoint PPT PresentationTRANSCRIPT
Self-tuning in Graph-Based Reference Disambiguation
Rabia Nuray-Turan Dmitri V. Kalashnikov
Sharad Mehrotra
Computer Science DepartmentUniversity of California, Irvine
April 19, 2023 DASFAA 2007, Bangkok, Thailand 2
Overview
• Intro to Data Cleaning– Entity resolution
• RelDC Framework– Past work
• Adapting to data – The new part– Reduction to an Optimization problem
– Linear programming
• Experiments
April 19, 2023 DASFAA 2007, Bangkok, Thailand 3
Raw Datasets(uncertainty, errors, multiple sources)
...J. Smith ...
.. John Smith ...
.. Jane Smith ...
MIT
Intel Inc.?
Regular Database(Can be analyzed)
John Smith Intel
Jane Smith MIT
... ...
Data Cleaning
Analysis on bad data leads to wrong conclusions
April 19, 2023 DASFAA 2007, Bangkok, Thailand 4
Suspicious entries– Lets go to DBLP website
– which stores bibliographic entries of many CS authors
– Lets check two people– “A. Gupta”
– “L. Zhang”
Example of the problem: CiteSeer top-K
CiteSeer: the top-k most cited authors DBLP DBLP
April 19, 2023 DASFAA 2007, Bangkok, Thailand 5
Two Most Common Entity-Resolution Challenges
...J. Smith ...
.. John Smith ...
.. Jane Smith ...
MIT
Intel Inc.
Fuzzy lookup
– reference disambiguation– match references to objects
– list of all objects is given
Fuzzy grouping
– group together object repre-sentations, that correspond to the same object
April 19, 2023 DASFAA 2007, Bangkok, Thailand 6
Standard Approach to Entity Resolution
"J. Smith"
f2
f3
?
?
?
Yf2
f3
X
Traditional MethodsFeatures and Context
"Jane Smith"
April 19, 2023 DASFAA 2007, Bangkok, Thailand 7
Overview
• Intro to Data CleaningRelDC Framework
– Past work
• Adapting to data – The new part– Reduction to an Optimization problem
– Linear programming
• Experiments
April 19, 2023 DASFAA 2007, Bangkok, Thailand 8
RelDC Framework
f1
f2
f3
?
?
?
f4
Y
f1
f2
f3
f4?
X
Traditional Methods
+ X Y
A
B C
D
E F
Relationship Analysis
ARG
RelDC Framework
features and context
Relationship-based Data Cleaning
April 19, 2023 DASFAA 2007, Bangkok, Thailand 9
RelDC Framework
• Past work– SDM’05, TODS’06
• Domain-independent framework– Viewing the dataset as an Entity Relationship Graph
– Analyzes paths in this graph
• Solid theoretic foundation– Optimization problem
• Scales to large datasets
• Robust under uncertainty
• High disambiguation quality
• No Self-tuning– This paper solves this challenge
April 19, 2023 DASFAA 2007, Bangkok, Thailand 10
Entity-Relationship Graph
w2
Jane Smith
John Smith
J. Smithw1
r1
...
wr1=?
wrN=?
wr2=?er1
erN
er2xr
yr1
yr2
yrN
Regular nodes
Choice nodes Options of choice r
Option-edgesContext entity of r
• Choice node– For uncertain references
– To encode options/possibilities yr1, … yrN
• Among options yr1, … yrN
– Pick the most strongly connected one– CAP principle
– Analyze paths in G– that exist between xr and yrj, for all j
– Use a model to measure connection strength
• “Connection strength” model– c(u,v), for nodes u and v in G
– how strongly u and v are connected in G
– RandomWalk-based – Fixed– Based on Intuition!!!
– This paper, instead, learns such a model from data.
April 19, 2023 DASFAA 2007, Bangkok, Thailand 11
Overview
• Intro to Data Cleaning
• RelDC Framework– Past work
Adapting to data – The new part– Reduction to an Optimization problem
– Linear programming
• Experiments
April 19, 2023 DASFAA 2007, Bangkok, Thailand 12
Adaptive Solution
• Classify the found paths in the graph into a finite set of path types
ST ={ T1, T2, …, TN}
• If paths p1 and p2 are of the same type then they are treated as identical.
• We can show the connection between nodes u and v with a path-type count vector: Tuv = { c1, c2, …, cN}
• If there is a way to associate path Ti to wi then connection strength will be:
n
i ii cwvuc1
*),(
April 19, 2023 DASFAA 2007, Bangkok, Thailand 13
Problems to Answer
• How will we classify the paths?
• How will we associate each path type with a weight?
April 19, 2023 DASFAA 2007, Bangkok, Thailand 14
Classifying Paths
• Path Type Model (PTM):– Views each path as a sequence of edges
– <e1,e2,e3,…,en>
– Each edge ei has a type Ei associated with it
– Thus, can associate each path p with a string– <E1,E2,E3,…,En>
– Different strings correspond to different path types– Associate each string a weight
• Different models are also possible
April 19, 2023 DASFAA 2007, Bangkok, Thailand 15
Learning Path Weights : Optimization Problem
• CAP Principle states that: – the right option will be better connected
• Linear programming• Learn path types weight w’s.
April 19, 2023 DASFAA 2007, Bangkok, Thailand 16
Final Solution
• The value of c(xr,yrj)- c(xr,yrl) should be maximized for all r, l≠j
• Then final solution:
April 19, 2023 DASFAA 2007, Bangkok, Thailand 17
Example -Graph
y1
r1
r2
w1
w2
w1
w2
y2
y3
x1
e1e3
e1
e1
e1
e3
e1
e3e1
e1
e3 y4e1
e1
e1e3
e1e2
e2
e3
w1
w4
e3
e1
e2
e3e1
e2
e2
e3e3
e2
w4w3e1
e3
e1
e3 e3
e2
e2
e1
e2
e3
e2
w3
w4
w1
x2
y5
+
-
-
+
-e1
e2e3
e2
w3
y6w2
e1e1
e3 -
P1= e1-e3-e1 P2= e1-e1-e3
P3= e1-e2-e2-e3 P4= e1-e2-e3-e2-e3
April 19, 2023 DASFAA 2007, Bangkok, Thailand 18
Example- Solution
•w1 =1
•w3 = w4 = 0
•w2 can be anything between 0 and 1.
April 19, 2023 DASFAA 2007, Bangkok, Thailand 19
Overview
• Intro to Data Cleaning
• RelDC Framework– Past work
• Adapting to data – The new part– Reduction to an Optimization problem
– Linear programming
Experiments
April 19, 2023 DASFAA 2007, Bangkok, Thailand 20
Experimental Setup
Parameters– When looking for L-short simple paths, L = 5– L is the path-length limit
SynPub datasets: – many ds of five different
types– emulation of RealPub
– publications (5K) – authors (1K) – organizations (25K)– departments (125K)
– ground truth is known
RealMov:– movies (12K) – people (22K)
– actors– directors– producers
– studious (1K) – producing – distributing
– ground truth is known
April 19, 2023 DASFAA 2007, Bangkok, Thailand 21
Experimental Results on Movies
Parameters :
-Fraction : fraction of uncertain references in the dataset
-Each reference has 2 choices
April 19, 2023 DASFAA 2007, Bangkok, Thailand 22
Experimental Results on Movies- II
Number of options based on PMF Distribution
April 19, 2023 DASFAA 2007, Bangkok, Thailand 23
Experimental Results on SynPub
• RandomWalk, PTM and the Hybrid Model have the same accuracy
• Is RandomWalk the optimum model for Publications domain?
Hybrid Model :
),(
).(),(vuPp
i
L
wpcvuc
April 19, 2023 DASFAA 2007, Bangkok, Thailand 24
Effect of Random Relationships in the Publications Domain
April 19, 2023 DASFAA 2007, Bangkok, Thailand 25
Summary
• Main Contribution– An adaptive solution for connection strength– Model learns the weights of different path types
• Ongoing work– Using different models to learn the importance of
paths in the connection strength– Use of standard machine learning techniques for learning:
such as decision trees, etc…– Different ways to classify paths
April 19, 2023 DASFAA 2007, Bangkok, Thailand 26
Contact Information
• RelDC project– www.ics.uci.edu/~dvk/RelDC– www.itr-rescue.org (RESCUE)
• Rabia Nuray-Turan (contact author)– www.ics.uci.edu/~rnuray
• Dmitri V. Kalashnikov – www.ics.uci.edu/~dvk
• Sharad Mehrotra– www.ics.uci.edu/~sharad
Thank you !