deduplication

1

Deduplication

Bouvet BigOne, 2011-04-13Lars Marius Garshol, <[email protected]>http://twitter.com/larsga

2

Getting started

Baby steps

3

The problem

The suppliers table

ID NAME ASSOC_NO ADDRESS ZIP

413341 Acme Inc. 848233945 Main St. 4 8231

51233 Acme Incorporated

848 233 945 Main Street 4 8231

371341 Acme Main S 4 8231

452349 Acme (don't use) 84823945 Main 4 8231

450235 Acme Inc. North X Road 95 8227

239423 Acme Corp. NO848233945 North X Road 95 8227

... ... ... ... ...Real-world data is very, very messy

4

The problem – take 2

ERP

Suppliers

Customers

Companies

Customers

CRM

Customers

Billing

Each of these has internal duplicates,plus duplicates across the tables. Noeasy fix.

5

But ... what about identifiers?

• No, there are no system IDs across these tables

• Yes, there are outside identifiers– organization number for companies– personal number for people

• But, these are problematic– many records don't have them– they are inconsistently formatted– sometimes they are misspelled– some parts of huge organizations have the

same org number, but need to be treated as separate

6

First attempt at solution

• I wrote a simple Python script in ~2 hours

• It does the following:– load all records– normalize the data

• strip extra whitespace, lowercase, remove letters from org codes...

– use Bayesian inferencing for matching

7

Configuration

8

Matching

This sums out to 0.93 probability

9

Problems

• The functions comparing values are still pretty primitive

• Performance is abysmal– 90 minutes to process 14,500 records– performance is O(n2)– total number of records is ~2.5 million– time to process all records: 1 year 10 months

• Now what?

10

An idea

• Well, we don't necessarily need to compare each record with all others if we have indexes– we can look up the records which have matching

values

• Use DBM for the indexes, for example• Unfortunately, these only allow exact matching• But, we can break up complex values into

tokens, and index those• Hang on, isn't this rather like a search engine?• Bing!• Let's try Lucene!

11

Lucene-based prototype

• I whip out Jython and try it• New script first builds Lucene index• Then searches all records against the

index• Time to process 14,500 records: 1

minute• Now we're talking...

12

Reality sets in

A splash of cold water to the face

13

Prior art

• It turns out people have been doing this before

• They call it– entity resolution– identity resolution– merge/purge– deduplication– record linkage– ...

• This makes Googling for information an absolute nightmare

14

Existing tools

• Several commercial tools– they look big and expensive: we skip those

• Stian found some open source tools– Oyster: slow, bad architecture, primitive

matching– SERF: slow, bad architecture

• I’ve later found more, but was not impressed

• So, it seems we still have to do it ourselves

15

Finds in the research literature

• General– problem is well-understood– "naïve Bayes" is naïve– lots of interesting work on value comparisons– performance problem 'solved' with "blocking"

• build a key from parts of the data• sort records by key• compare each record with m nearest neighbours• performance goes from O(n2) to O(n m)

– parallel processing widely used

• Swoosh paper– compare and merge should have ICAR1 properties– optimal algorithms for general merge found– run-time for 14,000 records ~1.5 hours...

1 Idempotence, commutativity, associativity, reflexivity

16

Good research papers

• Threat and Fraud Intelligence, Las Vegas Style, Jeff Jonas– http://jeffjonas.typepad.com/

IEEE.Identity.Resolution.pdf

• Real-world data is dirty: Data Cleansing and the Merge/Purge Problem, Hernandez & Stolfo– http://citeseerx.ist.psu.edu/viewdoc/download?

doi=10.1.1.30.3496&rep=rep1&type=pdf

• Swoosh: a generic approach to entity resolution, Benjelloun, Garcia-Molina et al– http://citeseerx.ist.psu.edu/viewdoc/download?

doi=10.1.1.122.5696&rep=rep1&type=pdf

17

DUplicate KillEr

Duke

18

Java deduplication engine

• Work in progress– so far spent only ~20 hours on it– only command-line batch client built so far

• Based on Lucene 3.1• Open source (on Google Code)– http://code.google.com/p/duke/

• Blazingly fast– 960,000 records in 11 minutes on this laptop

19

Architecture

SDshare client

data in

SDshare server

equivalences out

RDF frontend Datastore API

Duke engine

Lucene H2 database

20

Architecture #2

Command-line client

data in link file out

CSV frontend Datastore API

Duke engine

Lucene

More frontends:• JDBC• SPARQL• RDF file• ...

21

Architecture #3

REST interface

data in equivalences out

X frontend Datastore API

Duke engine

Lucene H2 database

22

Weaknesses

• Tied to naïve Bayes model– research shows more sophisticated models

perform better– non-trivial to reconcile these with index

lookup

• Value comparison sophistication limited– Lucene does support Levenshtein queries– (these are slow, though. will be fast in 4.x)

23

Comments/questions?

deduplication

Technology

datasort records

on2total number of records

problematicmany records

org number

companiespersonal number

builds lucene indexthen

lucenebased prototypei

open source toolsoyster