deduplication

23
1 Deduplication Bouvet BigOne, 2011-04-13 Lars Marius Garshol, <[email protected]> http://twitter.com/larsga

Upload: lars-marius-garshol

Post on 15-Jan-2015

6.889 views

Category:

Technology


5 download

DESCRIPTION

Describes the basic issues of detecting duplicates in messy data and a proposed open source Java engine for solving it.

TRANSCRIPT

Page 1: Deduplication

1

Deduplication

Bouvet BigOne, 2011-04-13Lars Marius Garshol, <[email protected]>http://twitter.com/larsga

Page 2: Deduplication

2

Getting started

Baby steps

Page 3: Deduplication

3

The problem

The suppliers table

ID NAME ASSOC_NO ADDRESS ZIP

413341 Acme Inc. 848233945 Main St. 4 8231

51233 Acme Incorporated

848 233 945 Main Street 4 8231

371341 Acme Main S 4 8231

452349 Acme (don't use) 84823945 Main 4 8231

450235 Acme Inc. North X Road 95 8227

239423 Acme Corp. NO848233945 North X Road 95 8227

... ... ... ... ...Real-world data is very, very messy

Page 4: Deduplication

4

The problem – take 2

ERP

Suppliers

Customers

Companies

Customers

CRM

Customers

Billing

Each of these has internal duplicates,plus duplicates across the tables. Noeasy fix.

Page 5: Deduplication

5

But ... what about identifiers?

• No, there are no system IDs across these tables

• Yes, there are outside identifiers– organization number for companies– personal number for people

• But, these are problematic– many records don't have them– they are inconsistently formatted– sometimes they are misspelled– some parts of huge organizations have the

same org number, but need to be treated as separate

Page 6: Deduplication

6

First attempt at solution

• I wrote a simple Python script in ~2 hours

• It does the following:– load all records– normalize the data

• strip extra whitespace, lowercase, remove letters from org codes...

– use Bayesian inferencing for matching

Page 7: Deduplication

7

Configuration

Page 8: Deduplication

8

Matching

This sums out to 0.93 probability

Page 9: Deduplication

9

Problems

• The functions comparing values are still pretty primitive

• Performance is abysmal– 90 minutes to process 14,500 records– performance is O(n2)– total number of records is ~2.5 million– time to process all records: 1 year 10 months

• Now what?

Page 10: Deduplication

10

An idea

• Well, we don't necessarily need to compare each record with all others if we have indexes– we can look up the records which have matching

values

• Use DBM for the indexes, for example• Unfortunately, these only allow exact matching• But, we can break up complex values into

tokens, and index those• Hang on, isn't this rather like a search engine?• Bing!• Let's try Lucene!

Page 11: Deduplication

11

Lucene-based prototype

• I whip out Jython and try it• New script first builds Lucene index• Then searches all records against the

index• Time to process 14,500 records: 1

minute• Now we're talking...

Page 12: Deduplication

12

Reality sets in

A splash of cold water to the face

Page 13: Deduplication

13

Prior art

• It turns out people have been doing this before

• They call it– entity resolution– identity resolution– merge/purge– deduplication– record linkage– ...

• This makes Googling for information an absolute nightmare

Page 14: Deduplication

14

Existing tools

• Several commercial tools– they look big and expensive: we skip those

• Stian found some open source tools– Oyster: slow, bad architecture, primitive

matching– SERF: slow, bad architecture

• I’ve later found more, but was not impressed

• So, it seems we still have to do it ourselves

Page 15: Deduplication

15

Finds in the research literature

• General– problem is well-understood– "naïve Bayes" is naïve– lots of interesting work on value comparisons– performance problem 'solved' with "blocking"

• build a key from parts of the data• sort records by key• compare each record with m nearest neighbours• performance goes from O(n2) to O(n m)

– parallel processing widely used

• Swoosh paper– compare and merge should have ICAR1 properties– optimal algorithms for general merge found– run-time for 14,000 records ~1.5 hours...

1 Idempotence, commutativity, associativity, reflexivity

Page 16: Deduplication

16

Good research papers

• Threat and Fraud Intelligence, Las Vegas Style, Jeff Jonas– http://jeffjonas.typepad.com/

IEEE.Identity.Resolution.pdf

• Real-world data is dirty: Data Cleansing and the Merge/Purge Problem, Hernandez & Stolfo– http://citeseerx.ist.psu.edu/viewdoc/download?

doi=10.1.1.30.3496&rep=rep1&type=pdf

• Swoosh: a generic approach to entity resolution, Benjelloun, Garcia-Molina et al– http://citeseerx.ist.psu.edu/viewdoc/download?

doi=10.1.1.122.5696&rep=rep1&type=pdf

Page 17: Deduplication

17

DUplicate KillEr

Duke

Page 18: Deduplication

18

Java deduplication engine

• Work in progress– so far spent only ~20 hours on it– only command-line batch client built so far

• Based on Lucene 3.1• Open source (on Google Code)– http://code.google.com/p/duke/

• Blazingly fast– 960,000 records in 11 minutes on this laptop

Page 19: Deduplication

19

Architecture

SDshare client

data in

SDshare server

equivalences out

RDF frontend Datastore API

Duke engine

Lucene H2 database

Page 20: Deduplication

20

Architecture #2

Command-line client

data in link file out

CSV frontend Datastore API

Duke engine

Lucene

More frontends:• JDBC• SPARQL• RDF file• ...

Page 21: Deduplication

21

Architecture #3

REST interface

data in equivalences out

X frontend Datastore API

Duke engine

Lucene H2 database

Page 22: Deduplication

22

Weaknesses

• Tied to naïve Bayes model– research shows more sophisticated models

perform better– non-trivial to reconcile these with index

lookup

• Value comparison sophistication limited– Lucene does support Levenshtein queries– (these are slow, though. will be fast in 4.x)

Page 23: Deduplication

23

Comments/questions?