deduplication

Deduplication

Bouvet BigOne, 2011-04-13Lars Marius Garshol, <larsga@bouvet.no>http://twitter.com/larsga

Getting started

Baby steps

The problem

The suppliers table

ID NAME ASSOC_NO ADDRESS ZIP

413341 Acme Inc. 848233945 Main St. 4 8231

51233 Acme Incorporated

848 233 945 Main Street 4 8231

371341 Acme Main S 4 8231

452349 Acme (don't use) 84823945 Main 4 8231

450235 Acme Inc. North X Road 95 8227

239423 Acme Corp. NO848233945 North X Road 95 8227

... ... ... ... ...Real-world data is very, very messy

The problem – take 2

Suppliers

Customers

Companies

Customers

Billing

Each of these has internal duplicates,plus duplicates across the tables. Noeasy fix.

But ... what about identifiers?

• No, there are no system IDs across these tables

• Yes, there are outside identifiers– organization number for companies– personal number for people

• But, these are problematic– many records don't have them– they are inconsistently formatted– sometimes they are misspelled– some parts of huge organizations have the

same org number, but need to be treated as separate

First attempt at solution

• I wrote a simple Python script in ~2 hours

• It does the following:– load all records– normalize the data

• strip extra whitespace, lowercase, remove letters from org codes...

– use Bayesian inferencing for matching

Configuration

Matching

This sums out to 0.93 probability

Problems

• The functions comparing values are still pretty primitive

• Performance is abysmal– 90 minutes to process 14,500 records– performance is O(n2)– total number of records is ~2.5 million– time to process all records: 1 year 10 months

• Now what?

An idea

• Well, we don't necessarily need to compare each record with all others if we have indexes– we can look up the records which have matching

values

• Use DBM for the indexes, for example• Unfortunately, these only allow exact matching• But, we can break up complex values into

tokens, and index those• Hang on, isn't this rather like a search engine?• Bing!• Let's try Lucene!

Lucene-based prototype

• I whip out Jython and try it• New script first builds Lucene index• Then searches all records against the

index• Time to process 14,500 records: 1

minute• Now we're talking...

Reality sets in

A splash of cold water to the face

Prior art

• It turns out people have been doing this before

• They call it– entity resolution– identity resolution– merge/purge– deduplication– record linkage– ...

• This makes Googling for information an absolute nightmare

Existing tools

• Several commercial tools– they look big and expensive: we skip those

• Stian found some open source tools– Oyster: slow, bad architecture, primitive

matching– SERF: slow, bad architecture

• I’ve later found more, but was not impressed

• So, it seems we still have to do it ourselves

Finds in the research literature

• General– problem is well-understood– "naïve Bayes" is naïve– lots of interesting work on value comparisons– performance problem 'solved' with "blocking"

• build a key from parts of the data• sort records by key• compare each record with m nearest neighbours• performance goes from O(n2) to O(n m)

– parallel processing widely used

• Swoosh paper– compare and merge should have ICAR1 properties– optimal algorithms for general merge found– run-time for 14,000 records ~1.5 hours...

1 Idempotence, commutativity, associativity, reflexivity

Good research papers

• Threat and Fraud Intelligence, Las Vegas Style, Jeff Jonas– http://jeffjonas.typepad.com/

IEEE.Identity.Resolution.pdf

• Real-world data is dirty: Data Cleansing and the Merge/Purge Problem, Hernandez & Stolfo– http://citeseerx.ist.psu.edu/viewdoc/download?

doi=10.1.1.30.3496&rep=rep1&type=pdf

• Swoosh: a generic approach to entity resolution, Benjelloun, Garcia-Molina et al– http://citeseerx.ist.psu.edu/viewdoc/download?

doi=10.1.1.122.5696&rep=rep1&type=pdf

DUplicate KillEr

Java deduplication engine

• Work in progress– so far spent only ~20 hours on it– only command-line batch client built so far

• Based on Lucene 3.1• Open source (on Google Code)– http://code.google.com/p/duke/

• Blazingly fast– 960,000 records in 11 minutes on this laptop

Architecture

SDshare client

data in

SDshare server

equivalences out

RDF frontend Datastore API

Duke engine

Lucene H2 database

Architecture #2

Command-line client

data in link file out

CSV frontend Datastore API

Duke engine

Lucene

More frontends:• JDBC• SPARQL• RDF file• ...

Architecture #3

REST interface

data in equivalences out

X frontend Datastore API

Duke engine

Lucene H2 database

Weaknesses

• Tied to naïve Bayes model– research shows more sophisticated models

perform better– non-trivial to reconcile these with index

lookup

• Value comparison sophistication limited– Lucene does support Levenshtein queries– (these are slow, though. will be fast in 4.x)

Comments/questions?

deduplication

datasort records

on2total number of records

problematicmany records

org number

companiespersonal number

builds lucene indexthen

lucenebased prototypei

open source toolsoyster

Technology

data deduplication

quantum dxi6800 deduplication appliance

progressive deduplication technology

emc vnx deduplication and compression deduplication and...

online deduplication for databases

deduplication storage system

symantec netbackup™ deduplication guide€¦ · ·...

emc networker data domain deduplication devices...

deduplication solutions are not all created equal, … ·...

building a high performance deduplication system · 2013....

deduplication school

anchor-driven subchunk deduplication

starwind configuring deduplication

data deduplication for dummies.pdf

progressive deduplication technology - nordicmind

plc-cache: endurable ssd cache for deduplication-based...

netapp data compression and deduplication deployment · pdf...

integration brief integrating rdx quikstor™ into windows...

netapp deduplication concepts

a hybrid cloud approach for secure authorized...