entity resolution in serf - stanford university · entity resolution in serf omar benjelloun...
TRANSCRIPT
![Page 1: Entity Resolution in SERF - Stanford University · Entity Resolution in SERF Omar Benjelloun Stanford University. Joint work with: Hector Garcia-Molina, Hideki Kawai, Tait E. Larson,](https://reader031.vdocuments.site/reader031/viewer/2022021902/5b971dcf09d3f27a7a8c8f57/html5/thumbnails/1.jpg)
Entity Resolution in SERF
Omar BenjellounStanford University
Joint work with:Hector Garcia-Molina, Hideki Kawai, Tait E. Larson, David Menestrina, Qi Su, Sutthipong Thavisomboon,
Jennifer Widom
InfoLab workshop March 22nd, 2006
![Page 2: Entity Resolution in SERF - Stanford University · Entity Resolution in SERF Omar Benjelloun Stanford University. Joint work with: Hector Garcia-Molina, Hideki Kawai, Tait E. Larson,](https://reader031.vdocuments.site/reader031/viewer/2022021902/5b971dcf09d3f27a7a8c8f57/html5/thumbnails/2.jpg)
2
Entity Resolution (ER)
N: a A: b CC#: c Ph: e
r1
N: a Exp: d Ph: e
r2
• Many applications:– customer files, – counter-terrorism, – comparison shopping...
• Aka: deduplication, record linkage, object co-identification, reference reconciliation, …
![Page 3: Entity Resolution in SERF - Stanford University · Entity Resolution in SERF Omar Benjelloun Stanford University. Joint work with: Hector Garcia-Molina, Hideki Kawai, Tait E. Larson,](https://reader031.vdocuments.site/reader031/viewer/2022021902/5b971dcf09d3f27a7a8c8f57/html5/thumbnails/3.jpg)
3
Challenges (1)
• No keys!• Value matching
– “Kaddafi”, “Qaddafi”, “Kadafi”, “Kaddaffi”...– Many techniques developed
• Record matchingNm: TomAd: 123 Main StPh: (650) 555-1212Ph: (650) 777-7777
Nm: ThomasAd: 132 Main StPh: (650) 555-1212
![Page 4: Entity Resolution in SERF - Stanford University · Entity Resolution in SERF Omar Benjelloun Stanford University. Joint work with: Hector Garcia-Molina, Hideki Kawai, Tait E. Larson,](https://reader031.vdocuments.site/reader031/viewer/2022021902/5b971dcf09d3f27a7a8c8f57/html5/thumbnails/4.jpg)
4
Challenges (2)
• Merging recordsNm: TomAd: 123 Main StPh: (650) 555-1212Ph: (650) 777-7777
Nm: ThomasAd: 132 Main StPh: (650) 555-1212Zp: 94305
Nm: TomNm: ThomasAd: 123 Main StPh: (650) 555-1212Ph: (650) 777-7777Zp: 94305
![Page 5: Entity Resolution in SERF - Stanford University · Entity Resolution in SERF Omar Benjelloun Stanford University. Joint work with: Hector Garcia-Molina, Hideki Kawai, Tait E. Larson,](https://reader031.vdocuments.site/reader031/viewer/2022021902/5b971dcf09d3f27a7a8c8f57/html5/thumbnails/5.jpg)
5
Challenges (3)
• ChainingNm: TomWk: IBMOc: laywerSal: 500K
Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBM
Nm: ThomasAd: 132 MainOc: lawyer
Nm: TomNm: ThomasAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyer
Nm: TomAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyerSal: 500K
![Page 6: Entity Resolution in SERF - Stanford University · Entity Resolution in SERF Omar Benjelloun Stanford University. Joint work with: Hector Garcia-Molina, Hideki Kawai, Tait E. Larson,](https://reader031.vdocuments.site/reader031/viewer/2022021902/5b971dcf09d3f27a7a8c8f57/html5/thumbnails/6.jpg)
6
Generic Entity Resolution
• Set of records: R (from domain )• Match function:
– M(r1,r2) = true if r1,r2 represent the same entity
• Merge function: – r3 = <r1,r2> (exists if M(r1,r2)=true)
• We view match and merge as black boxes• Focus on performance rather than accuracy
![Page 7: Entity Resolution in SERF - Stanford University · Entity Resolution in SERF Omar Benjelloun Stanford University. Joint work with: Hector Garcia-Molina, Hideki Kawai, Tait E. Larson,](https://reader031.vdocuments.site/reader031/viewer/2022021902/5b971dcf09d3f27a7a8c8f57/html5/thumbnails/7.jpg)
7
Domination
• Some records are less informative than others
• Record r1 is dominated by record r2 if <r1,r2>=r2• Dominated records should be discarded
Nm: ThomasAd: 132 MainOc: lawyer
Nm: TomNm: ThomasAd: 123 MainBD: Jan 1, 85Wk: IBMOc: lawyer
![Page 8: Entity Resolution in SERF - Stanford University · Entity Resolution in SERF Omar Benjelloun Stanford University. Joint work with: Hector Garcia-Molina, Hideki Kawai, Tait E. Larson,](https://reader031.vdocuments.site/reader031/viewer/2022021902/5b971dcf09d3f27a7a8c8f57/html5/thumbnails/8.jpg)
8
The Entity Resolution problem
• Given a set of records R, the Entity Resolution of R:– Has only records derived from R – Dominates all records derivable from R – Contains no matching or dominated records
• We provide simple and natural conditions to– Make ER “consistent” (finite and unique)– Enable efficient computation strategies
![Page 9: Entity Resolution in SERF - Stanford University · Entity Resolution in SERF Omar Benjelloun Stanford University. Joint work with: Hector Garcia-Molina, Hideki Kawai, Tait E. Larson,](https://reader031.vdocuments.site/reader031/viewer/2022021902/5b971dcf09d3f27a7a8c8f57/html5/thumbnails/9.jpg)
9
Conditions
• Commutativity:– M(r1, r2) = M(r2, r1)– <r1, r2> = <r2, r1>
• Idempotence:– M(r1, r1) = true; <r1, r1> = r1
• Merge associativity:– <r1, <r2, r3>> = <<r1, r2>, r3> (if they exist)
![Page 10: Entity Resolution in SERF - Stanford University · Entity Resolution in SERF Omar Benjelloun Stanford University. Joint work with: Hector Garcia-Molina, Hideki Kawai, Tait E. Larson,](https://reader031.vdocuments.site/reader031/viewer/2022021902/5b971dcf09d3f27a7a8c8f57/html5/thumbnails/10.jpg)
10
Conditions (2)
• Representativity– r3 = <r1, r2>
for any r4 such that M(r1, r4) = truewe also have M(r3, r4) = true.
r1
r2
r3
r4
![Page 11: Entity Resolution in SERF - Stanford University · Entity Resolution in SERF Omar Benjelloun Stanford University. Joint work with: Hector Garcia-Molina, Hideki Kawai, Tait E. Larson,](https://reader031.vdocuments.site/reader031/viewer/2022021902/5b971dcf09d3f27a7a8c8f57/html5/thumbnails/11.jpg)
11
Algorithms
• These conditions enable flexible computation of ER(R)– Starting from R… – Find matches, add merged records– Find and delete dominated records– …in any order
• Optimal algorithm: R-Swoosh– Merges records and deletes dominated records early– No algorithm performs fewer record comparisons in the
worst case
![Page 12: Entity Resolution in SERF - Stanford University · Entity Resolution in SERF Omar Benjelloun Stanford University. Joint work with: Hector Garcia-Molina, Hideki Kawai, Tait E. Larson,](https://reader031.vdocuments.site/reader031/viewer/2022021902/5b971dcf09d3f27a7a8c8f57/html5/thumbnails/12.jpg)
12
R-Swoosh
r1r2r3r4r5r6
R R’
![Page 13: Entity Resolution in SERF - Stanford University · Entity Resolution in SERF Omar Benjelloun Stanford University. Joint work with: Hector Garcia-Molina, Hideki Kawai, Tait E. Larson,](https://reader031.vdocuments.site/reader031/viewer/2022021902/5b971dcf09d3f27a7a8c8f57/html5/thumbnails/13.jpg)
13
R-Swoosh
r3r4r5r6
R R’r1r2
M(r3,r1) ?M(r3,r2) ?
r3M(r4,r1) ?
r7 = <r4,r1>
r7
![Page 14: Entity Resolution in SERF - Stanford University · Entity Resolution in SERF Omar Benjelloun Stanford University. Joint work with: Hector Garcia-Molina, Hideki Kawai, Tait E. Larson,](https://reader031.vdocuments.site/reader031/viewer/2022021902/5b971dcf09d3f27a7a8c8f57/html5/thumbnails/14.jpg)
14
R-Swoosh
r2r3r7r9
R R’
Also F-Swoosh, a variant that efficiently caches results of value comparisons
![Page 15: Entity Resolution in SERF - Stanford University · Entity Resolution in SERF Omar Benjelloun Stanford University. Joint work with: Hector Garcia-Molina, Hideki Kawai, Tait E. Larson,](https://reader031.vdocuments.site/reader031/viewer/2022021902/5b971dcf09d3f27a7a8c8f57/html5/thumbnails/15.jpg)
15
Example
• [a: v1, b: w1]• [a: v2, b: w2]• [a: v3, b: w3]• ...• [a: vn, b: wn]
Match: M( ri, rj ) = TrueMerge: Union of values
answer: [ a:{v1, ...,vn}, b:{w1, ..., wn}]
![Page 16: Entity Resolution in SERF - Stanford University · Entity Resolution in SERF Omar Benjelloun Stanford University. Joint work with: Hector Garcia-Molina, Hideki Kawai, Tait E. Larson,](https://reader031.vdocuments.site/reader031/viewer/2022021902/5b971dcf09d3f27a7a8c8f57/html5/thumbnails/16.jpg)
16
Naïve strategy
• [a: v1, b: w1]• [a: v2, b: w2]• [a: v3, b: w3]• [a: v4, b: w4]
• [a:{v1,v2}, b:{w1,w2}]• [a:{v1,v3}, b:{w1,w3}]• [a:{v1,v4}, b:{w1,w4}]• [a:{v2,v3}, b:{w2,w3}]• [a:{v2,v4}, b:{w2,w4}]• [a:{v3,v4}, b:{w3,w4}]
![Page 17: Entity Resolution in SERF - Stanford University · Entity Resolution in SERF Omar Benjelloun Stanford University. Joint work with: Hector Garcia-Molina, Hideki Kawai, Tait E. Larson,](https://reader031.vdocuments.site/reader031/viewer/2022021902/5b971dcf09d3f27a7a8c8f57/html5/thumbnails/17.jpg)
17
Naïve strategy (2)• [a:{v1,v2}, ...]• [a:{v1,v3}, ...]• [a:{v1,v4}, ...]• [a:{v2,v3}, ...]• [a:{v2,v4}, ...]• [a:{v3,v4}, ...]
• [a:{v1,v2,v3}, ...]• [a:{v1,v2,v4}, ...]• [a:{v2,v3,v4}, ...]• [a:{v1,v2,v4}, ...]
• [a:{v1,v2,v3,v4}, ...]
... A lot of useless work!
![Page 18: Entity Resolution in SERF - Stanford University · Entity Resolution in SERF Omar Benjelloun Stanford University. Joint work with: Hector Garcia-Molina, Hideki Kawai, Tait E. Larson,](https://reader031.vdocuments.site/reader031/viewer/2022021902/5b971dcf09d3f27a7a8c8f57/html5/thumbnails/18.jpg)
18
R-Swoosh
• [a: v1, b: w1]• [a: v2, b: w2]• [a: v3, b: w3]• [a: v4, b: w4]
• M(r1,r2) ®[a:{v1,v2}, ...]
• M(r3, r12) ®[a:{v1,v2,v3}, ...]
• M(r4, r123) ®[a: v1, a: v2, a: v3, a: v4, ...]
![Page 19: Entity Resolution in SERF - Stanford University · Entity Resolution in SERF Omar Benjelloun Stanford University. Joint work with: Hector Garcia-Molina, Hideki Kawai, Tait E. Larson,](https://reader031.vdocuments.site/reader031/viewer/2022021902/5b971dcf09d3f27a7a8c8f57/html5/thumbnails/19.jpg)
19
Distributed ER
• ER is expensive:– Many records– Match comparisons are costly
• Distribute the work across multiple processors– Make sure no matches are missed– Minimize computation, communications and storage
• Use domain knowledge when available– E.g., DOB within 5 years, same product category
![Page 20: Entity Resolution in SERF - Stanford University · Entity Resolution in SERF Omar Benjelloun Stanford University. Joint work with: Hector Garcia-Molina, Hideki Kawai, Tait E. Larson,](https://reader031.vdocuments.site/reader031/viewer/2022021902/5b971dcf09d3f27a7a8c8f57/html5/thumbnails/20.jpg)
20
D-Swoosh
r2r4…
R’ir3
Processor Pi add(r6)M(r3,r4)?r6 = <r3,r4>
del(r3)
del(r4)add(r9)
del(r6)
r6
r6
r6
r6
![Page 21: Entity Resolution in SERF - Stanford University · Entity Resolution in SERF Omar Benjelloun Stanford University. Joint work with: Hector Garcia-Molina, Hideki Kawai, Tait E. Larson,](https://reader031.vdocuments.site/reader031/viewer/2022021902/5b971dcf09d3f27a7a8c8f57/html5/thumbnails/21.jpg)
21
D-Swoosh
• Where to send records? – scope function (e.g., scope(r2)={P2,P5,P7})
• Who is responsible for comparisons? – resp predicate (e.g., resp(P6,r3,r5)= true)
• scope and resp must satisfy coverage property (related to mutual exclusion problem -- coteries)
• Schemes without domain knowledge – Majority, grid
• Schemes with domain knowledge– Value equality, linear ordering, hierarchies
![Page 22: Entity Resolution in SERF - Stanford University · Entity Resolution in SERF Omar Benjelloun Stanford University. Joint work with: Hector Garcia-Molina, Hideki Kawai, Tait E. Larson,](https://reader031.vdocuments.site/reader031/viewer/2022021902/5b971dcf09d3f27a7a8c8f57/html5/thumbnails/22.jpg)
22
D-Swoosh performance
• Computation cost per processor (10 processors)• Experiments on Yahoo! comparison shopping data
![Page 23: Entity Resolution in SERF - Stanford University · Entity Resolution in SERF Omar Benjelloun Stanford University. Joint work with: Hector Garcia-Molina, Hideki Kawai, Tait E. Larson,](https://reader031.vdocuments.site/reader031/viewer/2022021902/5b971dcf09d3f27a7a8c8f57/html5/thumbnails/23.jpg)
23
ER with confidences
• Each record has a “confidence” (0 ≤ c ≤ 1) – Not tied to specific interpretation (e.g., probabilistic)– Match function may exploit confidences– Merge function propagates confidences
• Some conditions do not hold anymore:– Representativity: Confidence decreases with merges– Associativity: Different derivations produce different
confidences
• More costly algorithm is required (Koosh)– Optimizations: early detection of domination, thresholds
![Page 24: Entity Resolution in SERF - Stanford University · Entity Resolution in SERF Omar Benjelloun Stanford University. Joint work with: Hector Garcia-Molina, Hideki Kawai, Tait E. Larson,](https://reader031.vdocuments.site/reader031/viewer/2022021902/5b971dcf09d3f27a7a8c8f57/html5/thumbnails/24.jpg)
24
Summary
• Entity resolution is critical• Generic approach yields reusable techniques • Efficient resolution is important• Currently working on
– Large scale distributed ER – Negative information– Uncertainty and lineage in ER
![Page 25: Entity Resolution in SERF - Stanford University · Entity Resolution in SERF Omar Benjelloun Stanford University. Joint work with: Hector Garcia-Molina, Hideki Kawai, Tait E. Larson,](https://reader031.vdocuments.site/reader031/viewer/2022021902/5b971dcf09d3f27a7a8c8f57/html5/thumbnails/25.jpg)
25
Thank you.