active sampling for entity matching
DESCRIPTION
Active Sampling for Entity Matching. Aditya Parameswaran Stanford University Jointly with: Kedar Bellare , Suresh Iyengar , Vibhor Rastogi Yahoo! Research. Entity Matching. Goal : Find duplicate entities in a given data set Fundamental data cleaning primitive decades of prior work - PowerPoint PPT PresentationTRANSCRIPT
Active Sampling for Entity Matching
Aditya ParameswaranStanford University
Jointly with: Kedar Bellare, Suresh Iyengar, Vibhor Rastogi
Yahoo! Research
Entity Matching
Goal: Find duplicate entities in a given data set
Fundamental data cleaning primitive decades of prior work
Especially important at Yahoo! (and other web companies)
2
Homma’s Brown Rice SushiCalifornia AvenuePalo Alto
Homma’s SushiCal AvePalo Alto
Why is it important?
3
Websites
DatabasesContent Providers
Dirty Entitie
s
???
Deduplicated
Entities
Applications:Business Listings in Y! LocalCelebrities in Y! MoviesEvents in Y! Upcoming….
Find Duplicates
YelpZagat
Foursq
How?
4
Reformulated Goal: Construct a high quality classifier identifying duplicate entity pairs
Problem: How do we select training data?
Answer: Active Learning with Human Experts!
Reformulated Workflow
5
Websites
DatabasesContent Providers
Dirty Entities
Our Technique
DeduplicatedEntities
Active Learning (AL) Primer
Properties of an AL algorithm:—Label Complexity—Time Complexity—Consistency
Prior work:—Uncertainty Sampling—Query by Committee—…—Importance Weighted Active Learning (IWAL)—Online IWAL without Constraints
• Implemented in Vowpal Wabbit (VW)• 0-1 Metric• Time and Label efficient• Provably Consistent
Work even under noisy settings}
6
Problem One: Imbalanced Data
Typical to have 100:1 even after blocking Solution: Metric from [Arasu11]:
—Maximize Recall —Such that Precision > τ
7
100 1Non-matches Matches
Solution: All Non-matches• Precision 100% • 0-1 Error ≈ 0
Correctly identified matches% of correct matches
Problem Two: Guarantees Prior work on Entity Matching
—No guarantees on Recall/Precision—Even if they do, they have:• High time + label complexity
Can we adapt prior work on AL for the new objective:—Maximize recall, such that precision > τ
With:—Sub-linear label complexity—Efficient time complexity
8
Overview of Our ApproachRecall
Optimizationwith Precision
Constraint
Weighted 0-1 Error
Active Learningwith 0-1 Error
Reduction: Convex-hull Search inRelaxed Lagrangian
Reduction: Rejection Sampling
This talk
Paper
9
Objective
Given: —Hypothesis class H, —Threshold τ in [0,1]
Objective: Find h in H that—Maximizes recall(h)—Such that: precision(h) >= τ
Equivalently:—Maximize -falseneg(h)—Such that: ε truepos(h) - falsepos(h) >= 0—Where ε = τ/(1-τ)
10
Unconstrained Objective
Current formulation:—Maximize -falseneg(h) ε truepos(h) - falsepos(h) >= 0
If we introduce lagrange multiplier λ: —Maximize X(h) + λ Y(h), can be rewritten as:—Minimize δ falseneg (h) + (1 – δ) falsepos(h)
X(h)
Y(h) Weighted 0-
1 objective
11
Convex Hull of Classifiers
12
Y(h)
X(h)
We want a classifier here
0
Convex shape formed by joining classifiers strictly dominating others
Maximize X(h)Such that Y(h)
>= 0
Can have exponential number of points inside
Convex Hull of Classifiers
13
Y(h)
X(h)
For any λ>0, there is a point / line with largest value of X + λ Y
If λ=-1/slope of a line, we get a classifier on the line, else we get a vertex classifier.
u
v
u-v
Plug λ into weighted objective, get classifier h with highest X(h) + λ Y(h)
Maximize X(h)Such that Y(h)
>= 0
Convex Hull of Classifiers
14
Y(h)
X(h)
Worst case, we get this point
Naïve strategy: try all λEquivalently, try all slopes
Instead, do binary search for λ
Problem: When to stop?
1) Bounds2) Discretization of λDetails in Paper!
Too long!
Maximize X(h)Such that Y(h)
>= 0
Algorithm I (Ours Weighted)
Given: AL black box C for weighted 0-1 error Goal: Precision constrained objective
Range of λ: [Λmin,Λmax]—Don’t enumerate all candidate λ too expensive;
O(n3)—Instead, discretized using factor θ see paper!
Binary search over discretized values Same complexity as binary search
—O(log n)15
Algorithm II (Weighted 0-1)
Given: AL black box B for 0-1 error Goal: AL Black box C for weighted 0-1 error
Use trick from Supervised Learning [Zadrozny03]—Cost-sensitive objective Binary —Reduction by rejection sampling
16
Overview of Our ApproachRecall
Optimizationwith Precision
Constraint
Weighted 0-1 Error
Active Learningwith 0-1 Error
Reduction: Convex-hull Search inRelaxed Lagrangian
Reduction: Rejection Sampling
This talk
Paper
O(log n)
O(log n)
Labels = O(log2 n) L(B)Time = O(log2 n) T(B)
17
Experiments Four real-world data sets
All labels known—Simulate active learning
Two approaches for AL with Precision Constraint:—Ours
• With Vowpal Wabbit as 0-1 AL Black Box—Monotone [Arasu11]
• Assumes monotonicity of similarity features• High computational + label complexity
Data Set Size Ratio (+/-) FeaturesY! Local Businesses 3958 0.115 5UCI Person Linkage 574913 0.004 9DBLP-ACM Bibliography 494437 0.005 7Scholar-DBLP Bibliography 589326 0.009 7
18
Results I (Runtime with #Features)
Computational complexity on UCI Person
5 5.5 6 6.5 7 7.5 8 8.5 90
200400600800
100012001400160018002000
OursMonotone
Number of similarity features
Tim
e (i
n se
cond
s)
19
Results II (Quality & #Label Queries)
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.950.7
0.75
0.8
0.85
0.9
0.95
1
OursMonotone
threshold
F-1
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95200250300350400450500550
threshold
Labe
l que
riesBusiness
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.950.6
0.650.7
0.750.8
0.850.9
0.951
OursMonotone
threshold
F-1
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.950
5001000150020002500300035004000
threshold
Labe
l que
riesPerson
20
Results II (Contd.)
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.950.6
0.650.7
0.750.8
0.850.9
0.951
OursMonotone
threshold
F-1
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.950
200
400
600
800
1000
1200
threshold
Labe
l que
ries
DBLP-ACM
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.950.6
0.650.7
0.750.8
0.850.9
0.951
OursMonotone
threshold
F-1
0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95600700800900
100011001200130014001500
threshold
Labe
l que
ries
21
Scholar
Results III (0-1 Active Learning)
Precision Constraint Satisfaction % of 0-1 AL
busin
esspe
rson
dblp-
acm
schola
r-dblp
00.20.40.60.8
1
t=0.7t=0.8t=0.9t=0.95
Datasets
Cons
trai
nt s
atis
fact
ion
%
22
Conclusion Active learning for Entity Matching Can use any 0-1 AL as black box Great real world performance:
—Computationally efficient (600k examples in 25 seconds)
—Label efficient and better F-1 on four real-world tasks
Guaranteed —Precision of matcher—Time and label complexity
23