deduplication
DESCRIPTION
Describes the basic issues of detecting duplicates in messy data and a proposed open source Java engine for solving it.TRANSCRIPT
1
Deduplication
Bouvet BigOne, 2011-04-13Lars Marius Garshol, <[email protected]>http://twitter.com/larsga
2
Getting started
Baby steps
3
The problem
The suppliers table
ID NAME ASSOC_NO ADDRESS ZIP
413341 Acme Inc. 848233945 Main St. 4 8231
51233 Acme Incorporated
848 233 945 Main Street 4 8231
371341 Acme Main S 4 8231
452349 Acme (don't use) 84823945 Main 4 8231
450235 Acme Inc. North X Road 95 8227
239423 Acme Corp. NO848233945 North X Road 95 8227
... ... ... ... ...Real-world data is very, very messy
4
The problem – take 2
ERP
Suppliers
Customers
Companies
Customers
CRM
Customers
Billing
Each of these has internal duplicates,plus duplicates across the tables. Noeasy fix.
5
But ... what about identifiers?
• No, there are no system IDs across these tables
• Yes, there are outside identifiers– organization number for companies– personal number for people
• But, these are problematic– many records don't have them– they are inconsistently formatted– sometimes they are misspelled– some parts of huge organizations have the
same org number, but need to be treated as separate
6
First attempt at solution
• I wrote a simple Python script in ~2 hours
• It does the following:– load all records– normalize the data
• strip extra whitespace, lowercase, remove letters from org codes...
– use Bayesian inferencing for matching
7
Configuration
8
Matching
This sums out to 0.93 probability
9
Problems
• The functions comparing values are still pretty primitive
• Performance is abysmal– 90 minutes to process 14,500 records– performance is O(n2)– total number of records is ~2.5 million– time to process all records: 1 year 10 months
• Now what?
10
An idea
• Well, we don't necessarily need to compare each record with all others if we have indexes– we can look up the records which have matching
values
• Use DBM for the indexes, for example• Unfortunately, these only allow exact matching• But, we can break up complex values into
tokens, and index those• Hang on, isn't this rather like a search engine?• Bing!• Let's try Lucene!
11
Lucene-based prototype
• I whip out Jython and try it• New script first builds Lucene index• Then searches all records against the
index• Time to process 14,500 records: 1
minute• Now we're talking...
12
Reality sets in
A splash of cold water to the face
13
Prior art
• It turns out people have been doing this before
• They call it– entity resolution– identity resolution– merge/purge– deduplication– record linkage– ...
• This makes Googling for information an absolute nightmare
14
Existing tools
• Several commercial tools– they look big and expensive: we skip those
• Stian found some open source tools– Oyster: slow, bad architecture, primitive
matching– SERF: slow, bad architecture
• I’ve later found more, but was not impressed
• So, it seems we still have to do it ourselves
15
Finds in the research literature
• General– problem is well-understood– "naïve Bayes" is naïve– lots of interesting work on value comparisons– performance problem 'solved' with "blocking"
• build a key from parts of the data• sort records by key• compare each record with m nearest neighbours• performance goes from O(n2) to O(n m)
– parallel processing widely used
• Swoosh paper– compare and merge should have ICAR1 properties– optimal algorithms for general merge found– run-time for 14,000 records ~1.5 hours...
1 Idempotence, commutativity, associativity, reflexivity
16
Good research papers
• Threat and Fraud Intelligence, Las Vegas Style, Jeff Jonas– http://jeffjonas.typepad.com/
IEEE.Identity.Resolution.pdf
• Real-world data is dirty: Data Cleansing and the Merge/Purge Problem, Hernandez & Stolfo– http://citeseerx.ist.psu.edu/viewdoc/download?
doi=10.1.1.30.3496&rep=rep1&type=pdf
• Swoosh: a generic approach to entity resolution, Benjelloun, Garcia-Molina et al– http://citeseerx.ist.psu.edu/viewdoc/download?
doi=10.1.1.122.5696&rep=rep1&type=pdf
17
DUplicate KillEr
Duke
18
Java deduplication engine
• Work in progress– so far spent only ~20 hours on it– only command-line batch client built so far
• Based on Lucene 3.1• Open source (on Google Code)– http://code.google.com/p/duke/
• Blazingly fast– 960,000 records in 11 minutes on this laptop
19
Architecture
SDshare client
data in
SDshare server
equivalences out
RDF frontend Datastore API
Duke engine
Lucene H2 database
20
Architecture #2
Command-line client
data in link file out
CSV frontend Datastore API
Duke engine
Lucene
More frontends:• JDBC• SPARQL• RDF file• ...
21
Architecture #3
REST interface
data in equivalences out
X frontend Datastore API
Duke engine
Lucene H2 database
22
Weaknesses
• Tied to naïve Bayes model– research shows more sophisticated models
perform better– non-trivial to reconcile these with index
lookup
• Value comparison sophistication limited– Lucene does support Levenshtein queries– (these are slow, though. will be fast in 4.x)
23
Comments/questions?