active sampling for entity matching

Active Sampling for Entity Matching

Aditya ParameswaranStanford University

Jointly with: Kedar Bellare, Suresh Iyengar, Vibhor Rastogi

Yahoo! Research

Entity Matching

Goal: Find duplicate entities in a given data set

Fundamental data cleaning primitive decades of prior work

Especially important at Yahoo! (and other web companies)

2

Homma’s Brown Rice SushiCalifornia AvenuePalo Alto

Homma’s SushiCal AvePalo Alto

Why is it important?

3

Websites

DatabasesContent Providers

Dirty Entitie

s

???

Deduplicated

Entities

Applications:Business Listings in Y! LocalCelebrities in Y! MoviesEvents in Y! Upcoming….

Find Duplicates

YelpZagat

Foursq

How?

4

Reformulated Goal: Construct a high quality classifier identifying duplicate entity pairs

Problem: How do we select training data?

Answer: Active Learning with Human Experts!

Reformulated Workflow

5

Websites

DatabasesContent Providers

Dirty Entities

Our Technique

DeduplicatedEntities

Active Learning (AL) Primer

Properties of an AL algorithm:—Label Complexity—Time Complexity—Consistency

Prior work:—Uncertainty Sampling—Query by Committee—…—Importance Weighted Active Learning (IWAL)—Online IWAL without Constraints

• Implemented in Vowpal Wabbit (VW)• 0-1 Metric• Time and Label efficient• Provably Consistent

Work even under noisy settings}

6

Problem One: Imbalanced Data

Typical to have 100:1 even after blocking Solution: Metric from [Arasu11]:

—Maximize Recall —Such that Precision > τ

7

100 1Non-matches Matches

Solution: All Non-matches• Precision 100% • 0-1 Error ≈ 0

Correctly identified matches% of correct matches

Problem Two: Guarantees Prior work on Entity Matching

—No guarantees on Recall/Precision—Even if they do, they have:• High time + label complexity

Can we adapt prior work on AL for the new objective:—Maximize recall, such that precision > τ

With:—Sub-linear label complexity—Efficient time complexity

8

Overview of Our ApproachRecall

Optimizationwith Precision

Constraint

Weighted 0-1 Error

Active Learningwith 0-1 Error

Reduction: Convex-hull Search inRelaxed Lagrangian

Reduction: Rejection Sampling

This talk

Paper

9

Objective

Given: —Hypothesis class H, —Threshold τ in [0,1]

Objective: Find h in H that—Maximizes recall(h)—Such that: precision(h) >= τ

Equivalently:—Maximize -falseneg(h)—Such that: ε truepos(h) - falsepos(h) >= 0—Where ε = τ/(1-τ)

10

Unconstrained Objective

Current formulation:—Maximize -falseneg(h) ε truepos(h) - falsepos(h) >= 0

If we introduce lagrange multiplier λ: —Maximize X(h) + λ Y(h), can be rewritten as:—Minimize δ falseneg (h) + (1 – δ) falsepos(h)

X(h)

Y(h) Weighted 0-

1 objective

11

Convex Hull of Classifiers

12

Y(h)

X(h)

We want a classifier here

0

Convex shape formed by joining classifiers strictly dominating others

Maximize X(h)Such that Y(h)

>= 0

Can have exponential number of points inside


13

Y(h)

X(h)

For any λ>0, there is a point / line with largest value of X + λ Y

If λ=-1/slope of a line, we get a classifier on the line, else we get a vertex classifier.

u

v

u-v

Plug λ into weighted objective, get classifier h with highest X(h) + λ Y(h)


>= 0


14

Y(h)

X(h)

Worst case, we get this point

Naïve strategy: try all λEquivalently, try all slopes

Instead, do binary search for λ

Problem: When to stop?

1) Bounds2) Discretization of λDetails in Paper!

Too long!


>= 0

Algorithm I (Ours Weighted)

Given: AL black box C for weighted 0-1 error Goal: Precision constrained objective

Range of λ: [Λmin,Λmax]—Don’t enumerate all candidate λ too expensive;

O(n3)—Instead, discretized using factor θ see paper!

Binary search over discretized values Same complexity as binary search

—O(log n)15

Algorithm II (Weighted 0-1)

Given: AL black box B for 0-1 error Goal: AL Black box C for weighted 0-1 error

Use trick from Supervised Learning [Zadrozny03]—Cost-sensitive objective Binary —Reduction by rejection sampling

16

Overview of Our ApproachRecall

Optimizationwith Precision

Constraint

Weighted 0-1 Error

Active Learningwith 0-1 Error

Reduction: Convex-hull Search inRelaxed Lagrangian

Reduction: Rejection Sampling

This talk

Paper

O(log n)

O(log n)

Labels = O(log2 n) L(B)Time = O(log2 n) T(B)

17

Experiments Four real-world data sets

All labels known—Simulate active learning

Two approaches for AL with Precision Constraint:—Ours

• With Vowpal Wabbit as 0-1 AL Black Box—Monotone [Arasu11]

• Assumes monotonicity of similarity features• High computational + label complexity

Data Set Size Ratio (+/-) FeaturesY! Local Businesses 3958 0.115 5UCI Person Linkage 574913 0.004 9DBLP-ACM Bibliography 494437 0.005 7Scholar-DBLP Bibliography 589326 0.009 7

18

Results I (Runtime with #Features)

Computational complexity on UCI Person

5 5.5 6 6.5 7 7.5 8 8.5 90

200400600800

100012001400160018002000

OursMonotone

Number of similarity features

Tim

e (i

n se

cond

s)

19

Results II (Quality & #Label Queries)

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.950.7

0.75

0.8

0.85

0.9

0.95

1

OursMonotone

threshold

F-1

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95200250300350400450500550

threshold

Labe

l que

riesBusiness

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.950.6

0.650.7

0.750.8

0.850.9

0.951

OursMonotone

threshold

F-1

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.950

5001000150020002500300035004000

threshold

Labe

l que

riesPerson

20

Results II (Contd.)

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.950.6

0.650.7

0.750.8

0.850.9

0.951

OursMonotone

threshold

F-1

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.950

200

400

600

800

1000

1200

threshold

Labe

l que

ries

DBLP-ACM

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.950.6

0.650.7

0.750.8

0.850.9

0.951

OursMonotone

threshold

F-1

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95600700800900

100011001200130014001500

threshold

Labe

l que

ries

21

Scholar

Results III (0-1 Active Learning)

Precision Constraint Satisfaction % of 0-1 AL

busin

esspe

rson

dblp-

acm

schola

r-dblp

00.20.40.60.8

1

t=0.7t=0.8t=0.9t=0.95

Datasets

Cons

trai

nt s

atis

fact

ion

%

22

Conclusion Active learning for Entity Matching Can use any 0-1 AL as black box Great real world performance:

—Computationally efficient (600k examples in 25 seconds)

—Label efficient and better F-1 on four real-world tasks

Guaranteed —Precision of matcher—Time and label complexity

23

active sampling for entity matching

Documents