sharing solution for record linkage: the relais software and the italian and spanish experiences...
TRANSCRIPT
Sharing Solution for Record Linkage: the RELAIS Software
and the Italian and Spanish Experiences
Nicoletta Cibella1, Gervasio-Luis Fernandez2, Marco Fortini1, Miguel Guigò2, Francisco Hernandez2, Monica Scannapieco1,
Laura Tosco1, Tiziana Tuoto1
1Italian National Statistical Institute – ISTAT – Italy2Spanish National Statistical Institute – INE – Spain
NTTS 2009Brussels
18-20 February 2009
Outline
1. The Record Linkage
2. The ESSnet on ISAD
3. The Idea and the Features of the RELAIS Software
4. The Italian and Spanish Experiences in using RELAIS
5. Throughout RELAIS 2.0
6. Conclusions
Theory and Practice in Developing a Record Linkage
Software
NTTS 2009Brussels
18-20 February 2009
Nicoletta Cibella, Brussels, 19th February 2009
Record Linkage
The record linkage purpose is to identify the same real
world entity, which can be differently represented in data
sources
Different approaches to deal with record linkage:
Exact RL - Deterministic RL - Probabilistic RL (Fellegi and
Sunter theory) - Bayesian RL - Machine Learning -
Knowledge Representation …
No particular technique has emerged as the best solution
for all cases
(maybe because such a solution does not exist…)
NTTS 2009Brussels
18-20 February 2009
Nicoletta Cibella, Brussels, 19th February 2009
Record Linkage Complexity
The record linkage techniques are a multidisciplinary set of methods and practices
RECORD LINKAGE
SEARCH SPACE REDUCTION• Sorted Neighbourhood Method• Blocking• Hierarchical Grouping• …
DECISION MODEL CHOICE• Fellegi & Sunter • Deterministic• Bayesian • Knowledge – based• Mixed• …
COMPARISON FUNCTION CHOICE• Exact• Edit distance• Smith-Waterman• Q-grams• Jaro string comparator• Soundex code• TF-IDF• …
............
......
PRE-PROCESSING • Conversion of upper/lower cases• Replacement of null strings• Standardization• Parsing•…
NTTS 2009Brussels
18-20 February 2009
Nicoletta Cibella, Brussels, 19th February 2009
The Record Linkage Phases
Record Linkage should be decomposed in its
constituting phases as much as possible
1. Pre-processing of the input files
2. Creation-Reduction of the search space of link candidate pairs
3. Choice of the matching variables
4. Choice of the comparison function
5. Choice of the decision model
6. Selection of unique links
7. Record linkage evaluation
NTTS 2009Brussels
18-20 February 2009
Nicoletta Cibella, Brussels, 19th February 2009
The ESSnet ISAD: Integration of Surveys and Administrative Data
NTTS 2009Brussels
18-20 February 2009
The ESSnet and its focus
The aim of the project is to arise, in the whole ESS, knowledge and understanding of the statistical methodologies for the integration of two (or more) data sources.
Partners
The ESSnet ISAD, cofinanced by Eurostat, started December 2006 and ended June 2008.
The project involved 5 countries:
ISTAT – Italy (scientific coordinator)
STAT – Austria
CZSO – Czech Republic
CBS – Netherlands
INE – Spain
Nicoletta Cibella, Brussels, 19th February 2009
RELAIS: The Idea
• There is not a unique optimal solution for
solving record linkage problems: for each
phase the most appropriate technique should
be chosen – depending on application and data requirements, not only
on the practitioner’s skill
• Ad-hoc record linkage process (workflow)
should be dynamically built
RELAIS (REcord Linkage At IStat)
is a toolkit serving such a purpose
NTTS 2009Brussels
18-20 February 2009
Nicoletta Cibella, Brussels, 19th February 2009
Record Linkage Workflows
Preprocessing
Search Space Reduction
Comparison Function
Decision Model
NormalizationUpperLowerCase
Schema reconciliation
Blocking
SNM
Edit DistanceJaroEquality
Probabilistic
Empirical
RecLink WF Appl2
SNM
Probabilistic
RecLink WF Appl1
Normalization
UpperLowerCase
Blocking
Jaro
Empirical
Equality
NTTS 2009Brussels
18-20 February 2009
Nicoletta Cibella, Brussels, 19th February 2009
RELAIS Features
- Modular structure: each phase is planned as a module of the toolkit, with an explicit interface with the other modules
- Top-down design: this allows to omit and/or iterate modules (phases) of the record linkage process
Advantages:
- dynamic composition of record linkage processes
- parallel development of various techniques is allowed
- design for Web service encapsulation in order to permit remote invocation
NTTS 2009Brussels
18-20 February 2009
Nicoletta Cibella, Brussels, 19th February 2009
RELAIS: An Open Source Project
• Results produced by the scientific community in the last years can be gathered and made available
– 175 000 papers mentioning “record linkage” (Google Scholar)
• Techniques for each phase can be implemented and maintained very rapidly by relying on a community of developers
• RELAIS Implementation Choices
– Java
– R statistical language
NTTS 2009Brussels
18-20 February 2009
Nicoletta Cibella, Brussels, 19th February 2009
RELAIS: the First Release
SEARCH SPACE REDUCTION• Cross Product• Sorted Neighbourhood Method• Blocking
DECISION MODEL CHOICE• Fellegi & Sunter
COMPARISON FUNCTION CHOICE• Equality
1:1 REDUCTION• Optimised Transportation Problem
RELAIS 1.0
NTTS 2009Brussels
18-20 February 2009
Nicoletta Cibella, Brussels, 19th February 2009
RELAIS: the First ReleaseNTTS 2009Brussels
18-20 February 2009
Nicoletta Cibella, Brussels, 19th February 2009
INPUT DATA SET
MATCHING N:M MATCHING 1:1
DatasetAcquisition
Reduction toMatching 1:1
DataProfiling
Search Space Creation/Reduction Comparison
Function
DecisionModel
RELAIS in the Italian and Spanish Experiences
• Common ideas and needs about the software (no ad-hoc solutions)
• Sharing knowledge and cooperation started in the ESSnet
Evaluation of the RELAIS “adaptability” in order to solve also Spanish data integration problems
Nicoletta Cibella, Brussels, 19th February 2009
NTTS 2009Brussels
18-20 February 2009
A Scenario: the Data
Individuals data from the 2001 Italian Census and PES (about 180 000 each ones).
Capture-recapture model to estimate Census Coverage Rate,
- no matching errors in linking Census and PES records.
Linkage was a very complex operation:
- deterministic and probabilistic approaches and clerical review
- almost 15 matching variables
- several working months.
Due to the accuracy of the matching procedures adopted, we know the true linkage status of all candidate pairs.
Nicoletta Cibella, Brussels, 19th February 2009
RELAIS in the Italian Tests NTTS 2009Brussels
18-20 February 2009
RELAIS in the Italian Tests
A focus on Rome
Size of PES and CEN files : about 8 000 units each ones
Cartesian Product CENxPES : more than 72 250 000 pairs (Expected link probability ≈ 0.0001)
1° Linkage Pass
Blocking on month of birth of the household header variable
Matching Variables: name, surname, gender, day-month-year of birth
Nicoletta Cibella, Brussels, 19th February 2009
NTTS 2009Brussels
18-20 February 2009
True Linkage Status
Matched Not Matched Total
Results of the 1° Linkage Pass
Matched 6 016 30 6 046
Not Matched 856
Total 6 872
Results of 1° Linkage Step
Match Rate: 88%
False Match Rate: 0.5%
False Non-Match Rate: 12%
The software also provides results at the block-level
MATCH RATE TOO LOW IN COVERAGE CONTEXT
Nicoletta Cibella, Brussels, 19th February 2009
RELAIS in the Italian Tests NTTS 2009Brussels
18-20 February 2009
2° Linkage Pass
Residuals of the 1° step: about 1 500 units each file
- mainly composed by records with missing value in the blocking variable at the 1° step; expected-link probability ≈ 0.0003
Cartesian Product : again not recommended …
Blocking procedure by means of Sorted Neighborhoods Method
Sorting variable: first letter of surname; window size = 450 (frequency of the most common first letter =250 )
Matching Variables: name, surname, day-month-year of birth
Nicoletta Cibella, Brussels, 19th February 2009
RELAIS in the Italian Tests NTTS 2009Brussels
18-20 February 2009
Theory and Practice in Developing a Record Linkage
Software
Nicoletta Cibella, Brussels, 19th February 2009
True Linkage Status
Matched Not Matched Total
Results of the Linkage Procedure
Matched 6 712 58 6 770
Not Matched 160
Total 6 872
Results of the Overall Linkage Procedure (1° plus 2° steps)
Match Rate: 98.5%
False Match Rate: 0.8%
False Non-Match Rate: 2.3%
Working Time: less than 2 hours
RELAIS in the Italian Tests NTTS 2009Brussels
18-20 February 2009
Search Space Reduction
Comparison Function
Decision Model
BlockingSNM
Edit DistanceJaro-WinklerEquality
Probabilistic
Rome PES Workflow
Theory and Practice in Developing a Record Linkage
Software
RELAIS 1.0
Cross Product
Linking Type1:1
Many:Many
Probabilistic
1:1
Equality
Step 2
SNM
Probabilistic
Blocking
1:1
Equality
Step 1
RELAIS in the Italian Tests NTTS 2009Brussels
18-20 February 2009
A Scenario: the Data
Individuals data from Living Conditions Survey (LCS) and Central Population Register (CPR)
1st Main Objective: obtain ID number for LCS
2nd Main Objective: compare the RELAIS results with ad-hoc procedures
Linkage was a very complex operation:
- only “name” and geographical variables were available
- large amount of data.
Blocking on geographic areas variables
Nicoletta Cibella, Brussels, 19th February 2009
RELAIS in the Spanish Tests NTTS 2009Brussels
18-20 February 2009
Weaknesses of the RELAIS 1.0
• difficulties in managing great amount of blocks
• difficulties in dealing with different probability estimations in each block
• difficulties in writing the largest output files
Strengths of the RELAIS 1.0
• efficacy of the implemented probabilistic method
• noticeable flexibility in modify/adapt the implemented functionalities (reduction from M:N to 1:1)
Nicoletta Cibella, Brussels, 19th February 2009
RELAIS in the Spanish Tests NTTS 2009Brussels
18-20 February 2009
Throughout RELAIS 2.0
• A relational database architecture in order to optimize the performances with respect to the management of huge amount of data through the whole record linkage process (input, intermediate phase and output).
• Several distance functions for string and numerical comparisons (not only the equality one).
• Exact and deterministic decision models to be used either as alternatives or in conjunction with the probabilistic model.
• A data profiling phase to help the user in the critical phases of choosing the best blocking or matching variables.
• One-shot Execution to deal with a large amount of blocks.
RELAIS 2.0 is now on testing and will be available from May 2009
Theory and Practice in Developing a Record Linkage
Software
Nicoletta Cibella, Brussels, 19th February 2009
NTTS 2009Brussels
18-20 February 2009
Concluding Remarks
• Profitable experiences in cooperation between NSIs.• Winning choice of the open-source philosophy and of the
overcoming of ad-hoc approaches.• Common nature of problems and needs of NSIs in data
integration projects.
New Challenge:
- Add in RELAIS methods for evaluating record linkage quality.
Theory and Practice in Developing a Record Linkage
Software
Nicoletta Cibella, Brussels, 19th February 2009
NTTS 2009Brussels
18-20 February 2009
RELAIS: Availability and Contacts
Relais 1.0 is available on the website : www.istat.it
Relais 2.0 will be available on May 2009
RELAIS Contacts:Nicoletta Cibella, StatisticianE-mail: [email protected] Tuoto, StatisticianE-mail: [email protected]
NTTS 2009Brussels
18-20 February 2009
Nicoletta Cibella, Brussels, 19th February 2009