sharing solution for record linkage: the relais software and the italian and spanish experiences...

24
Sharing Solution for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1 , Gervasio-Luis Fernandez 2 , Marco Fortini 1 , Miguel Guigò 2 , Francisco Hernandez 2 , Monica Scannapieco 1 , Laura Tosco 1 , Tiziana Tuoto 1 1 Italian National Statistical Institute – ISTAT – Italy 2 Spanish National Statistical Institute – INE – Spain NTTS 2009 Brussels 18-20 February 2009

Upload: rachel-daugherty

Post on 27-Mar-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sharing Solution for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco

Sharing Solution for Record Linkage: the RELAIS Software

and the Italian and Spanish Experiences

Nicoletta Cibella1, Gervasio-Luis Fernandez2, Marco Fortini1, Miguel Guigò2, Francisco Hernandez2, Monica Scannapieco1,

Laura Tosco1, Tiziana Tuoto1

1Italian National Statistical Institute – ISTAT – Italy2Spanish National Statistical Institute – INE – Spain

NTTS 2009Brussels

18-20 February 2009

Page 2: Sharing Solution for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco

Outline

1. The Record Linkage

2. The ESSnet on ISAD

3. The Idea and the Features of the RELAIS Software

4. The Italian and Spanish Experiences in using RELAIS

5. Throughout RELAIS 2.0

6. Conclusions

Theory and Practice in Developing a Record Linkage

Software

NTTS 2009Brussels

18-20 February 2009

Nicoletta Cibella, Brussels, 19th February 2009

Page 3: Sharing Solution for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco

Record Linkage

The record linkage purpose is to identify the same real

world entity, which can be differently represented in data

sources

Different approaches to deal with record linkage:

Exact RL - Deterministic RL - Probabilistic RL (Fellegi and

Sunter theory) - Bayesian RL - Machine Learning -

Knowledge Representation …

No particular technique has emerged as the best solution

for all cases

(maybe because such a solution does not exist…)

NTTS 2009Brussels

18-20 February 2009

Nicoletta Cibella, Brussels, 19th February 2009

Page 4: Sharing Solution for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco

Record Linkage Complexity

The record linkage techniques are a multidisciplinary set of methods and practices

RECORD LINKAGE

SEARCH SPACE REDUCTION• Sorted Neighbourhood Method• Blocking• Hierarchical Grouping• …

DECISION MODEL CHOICE• Fellegi & Sunter • Deterministic• Bayesian • Knowledge – based• Mixed• …

COMPARISON FUNCTION CHOICE• Exact• Edit distance• Smith-Waterman• Q-grams• Jaro string comparator• Soundex code• TF-IDF• …

............

......

PRE-PROCESSING • Conversion of upper/lower cases• Replacement of null strings• Standardization• Parsing•…

NTTS 2009Brussels

18-20 February 2009

Nicoletta Cibella, Brussels, 19th February 2009

Page 5: Sharing Solution for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco

The Record Linkage Phases

Record Linkage should be decomposed in its

constituting phases as much as possible

1. Pre-processing of the input files

2. Creation-Reduction of the search space of link candidate pairs

3. Choice of the matching variables

4. Choice of the comparison function

5. Choice of the decision model

6. Selection of unique links

7. Record linkage evaluation

NTTS 2009Brussels

18-20 February 2009

Nicoletta Cibella, Brussels, 19th February 2009

Page 6: Sharing Solution for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco

The ESSnet ISAD: Integration of Surveys and Administrative Data

NTTS 2009Brussels

18-20 February 2009

The ESSnet and its focus

The aim of the project is to arise, in the whole ESS, knowledge and understanding of the statistical methodologies for the integration of two (or more) data sources.

Partners

The ESSnet ISAD, cofinanced by Eurostat, started December 2006 and ended June 2008.

The project involved 5 countries:

ISTAT – Italy (scientific coordinator)

STAT – Austria

CZSO – Czech Republic

CBS – Netherlands

INE – Spain

Nicoletta Cibella, Brussels, 19th February 2009

Page 7: Sharing Solution for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco

RELAIS: The Idea

• There is not a unique optimal solution for

solving record linkage problems: for each

phase the most appropriate technique should

be chosen – depending on application and data requirements, not only

on the practitioner’s skill

• Ad-hoc record linkage process (workflow)

should be dynamically built

RELAIS (REcord Linkage At IStat)

is a toolkit serving such a purpose

NTTS 2009Brussels

18-20 February 2009

Nicoletta Cibella, Brussels, 19th February 2009

Page 8: Sharing Solution for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco

Record Linkage Workflows

Preprocessing

Search Space Reduction

Comparison Function

Decision Model

NormalizationUpperLowerCase

Schema reconciliation

Blocking

SNM

Edit DistanceJaroEquality

Probabilistic

Empirical

RecLink WF Appl2

SNM

Probabilistic

RecLink WF Appl1

Normalization

UpperLowerCase

Blocking

Jaro

Empirical

Equality

NTTS 2009Brussels

18-20 February 2009

Nicoletta Cibella, Brussels, 19th February 2009

Page 9: Sharing Solution for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco

RELAIS Features

- Modular structure: each phase is planned as a module of the toolkit, with an explicit interface with the other modules

- Top-down design: this allows to omit and/or iterate modules (phases) of the record linkage process

Advantages:

- dynamic composition of record linkage processes

- parallel development of various techniques is allowed

- design for Web service encapsulation in order to permit remote invocation

NTTS 2009Brussels

18-20 February 2009

Nicoletta Cibella, Brussels, 19th February 2009

Page 10: Sharing Solution for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco

RELAIS: An Open Source Project

• Results produced by the scientific community in the last years can be gathered and made available

– 175 000 papers mentioning “record linkage” (Google Scholar)

• Techniques for each phase can be implemented and maintained very rapidly by relying on a community of developers

• RELAIS Implementation Choices

– Java

– R statistical language

NTTS 2009Brussels

18-20 February 2009

Nicoletta Cibella, Brussels, 19th February 2009

Page 11: Sharing Solution for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco

RELAIS: the First Release

SEARCH SPACE REDUCTION• Cross Product• Sorted Neighbourhood Method• Blocking

DECISION MODEL CHOICE• Fellegi & Sunter

COMPARISON FUNCTION CHOICE• Equality

1:1 REDUCTION• Optimised Transportation Problem

RELAIS 1.0

NTTS 2009Brussels

18-20 February 2009

Nicoletta Cibella, Brussels, 19th February 2009

Page 12: Sharing Solution for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco

RELAIS: the First ReleaseNTTS 2009Brussels

18-20 February 2009

Nicoletta Cibella, Brussels, 19th February 2009

INPUT DATA SET

MATCHING N:M MATCHING 1:1

DatasetAcquisition

Reduction toMatching 1:1

DataProfiling

Search Space Creation/Reduction Comparison

Function

DecisionModel

Page 13: Sharing Solution for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco

RELAIS in the Italian and Spanish Experiences

• Common ideas and needs about the software (no ad-hoc solutions)

• Sharing knowledge and cooperation started in the ESSnet

Evaluation of the RELAIS “adaptability” in order to solve also Spanish data integration problems

Nicoletta Cibella, Brussels, 19th February 2009

NTTS 2009Brussels

18-20 February 2009

Page 14: Sharing Solution for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco

A Scenario: the Data

Individuals data from the 2001 Italian Census and PES (about 180 000 each ones).

Capture-recapture model to estimate Census Coverage Rate,

- no matching errors in linking Census and PES records.

Linkage was a very complex operation:

- deterministic and probabilistic approaches and clerical review

- almost 15 matching variables

- several working months.

Due to the accuracy of the matching procedures adopted, we know the true linkage status of all candidate pairs.

Nicoletta Cibella, Brussels, 19th February 2009

RELAIS in the Italian Tests NTTS 2009Brussels

18-20 February 2009

Page 15: Sharing Solution for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco

RELAIS in the Italian Tests

A focus on Rome

Size of PES and CEN files : about 8 000 units each ones

Cartesian Product CENxPES : more than 72 250 000 pairs (Expected link probability ≈ 0.0001)

1° Linkage Pass

Blocking on month of birth of the household header variable

Matching Variables: name, surname, gender, day-month-year of birth

Nicoletta Cibella, Brussels, 19th February 2009

NTTS 2009Brussels

18-20 February 2009

Page 16: Sharing Solution for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco

  True Linkage Status  

Matched Not Matched Total

Results of the 1° Linkage Pass

Matched 6 016 30 6 046

Not Matched 856

Total 6 872

Results of 1° Linkage Step

Match Rate: 88%

False Match Rate: 0.5%

False Non-Match Rate: 12%

The software also provides results at the block-level

MATCH RATE TOO LOW IN COVERAGE CONTEXT

Nicoletta Cibella, Brussels, 19th February 2009

RELAIS in the Italian Tests NTTS 2009Brussels

18-20 February 2009

Page 17: Sharing Solution for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco

2° Linkage Pass

Residuals of the 1° step: about 1 500 units each file

- mainly composed by records with missing value in the blocking variable at the 1° step; expected-link probability ≈ 0.0003

Cartesian Product : again not recommended …

Blocking procedure by means of Sorted Neighborhoods Method

Sorting variable: first letter of surname; window size = 450 (frequency of the most common first letter =250 )

Matching Variables: name, surname, day-month-year of birth

Nicoletta Cibella, Brussels, 19th February 2009

RELAIS in the Italian Tests NTTS 2009Brussels

18-20 February 2009

Page 18: Sharing Solution for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco

Theory and Practice in Developing a Record Linkage

Software

Nicoletta Cibella, Brussels, 19th February 2009

  True Linkage Status  

Matched Not Matched Total

Results of the Linkage Procedure

Matched 6 712 58 6 770

Not Matched 160

Total 6 872

Results of the Overall Linkage Procedure (1° plus 2° steps)

Match Rate: 98.5%

False Match Rate: 0.8%

False Non-Match Rate: 2.3%

Working Time: less than 2 hours

RELAIS in the Italian Tests NTTS 2009Brussels

18-20 February 2009

Page 19: Sharing Solution for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco

Search Space Reduction

Comparison Function

Decision Model

BlockingSNM

Edit DistanceJaro-WinklerEquality

Probabilistic

Rome PES Workflow

Theory and Practice in Developing a Record Linkage

Software

RELAIS 1.0

Cross Product

Linking Type1:1

Many:Many

Probabilistic

1:1

Equality

Step 2

SNM

Probabilistic

Blocking

1:1

Equality

Step 1

RELAIS in the Italian Tests NTTS 2009Brussels

18-20 February 2009

Page 20: Sharing Solution for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco

A Scenario: the Data

Individuals data from Living Conditions Survey (LCS) and Central Population Register (CPR)

1st Main Objective: obtain ID number for LCS

2nd Main Objective: compare the RELAIS results with ad-hoc procedures

Linkage was a very complex operation:

- only “name” and geographical variables were available

- large amount of data.

Blocking on geographic areas variables

Nicoletta Cibella, Brussels, 19th February 2009

RELAIS in the Spanish Tests NTTS 2009Brussels

18-20 February 2009

Page 21: Sharing Solution for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco

Weaknesses of the RELAIS 1.0

• difficulties in managing great amount of blocks

• difficulties in dealing with different probability estimations in each block

• difficulties in writing the largest output files

Strengths of the RELAIS 1.0

• efficacy of the implemented probabilistic method

• noticeable flexibility in modify/adapt the implemented functionalities (reduction from M:N to 1:1)

Nicoletta Cibella, Brussels, 19th February 2009

RELAIS in the Spanish Tests NTTS 2009Brussels

18-20 February 2009

Page 22: Sharing Solution for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco

Throughout RELAIS 2.0

• A relational database architecture in order to optimize the performances with respect to the management of huge amount of data through the whole record linkage process (input, intermediate phase and output).

• Several distance functions for string and numerical comparisons (not only the equality one).

• Exact and deterministic decision models to be used either as alternatives or in conjunction with the probabilistic model.

• A data profiling phase to help the user in the critical phases of choosing the best blocking or matching variables.

• One-shot Execution to deal with a large amount of blocks.

RELAIS 2.0 is now on testing and will be available from May 2009

Theory and Practice in Developing a Record Linkage

Software

Nicoletta Cibella, Brussels, 19th February 2009

NTTS 2009Brussels

18-20 February 2009

Page 23: Sharing Solution for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco

Concluding Remarks

• Profitable experiences in cooperation between NSIs.• Winning choice of the open-source philosophy and of the

overcoming of ad-hoc approaches.• Common nature of problems and needs of NSIs in data

integration projects.

New Challenge:

- Add in RELAIS methods for evaluating record linkage quality.

Theory and Practice in Developing a Record Linkage

Software

Nicoletta Cibella, Brussels, 19th February 2009

NTTS 2009Brussels

18-20 February 2009

Page 24: Sharing Solution for Record Linkage: the RELAIS Software and the Italian and Spanish Experiences Nicoletta Cibella 1, Gervasio-Luis Fernandez 2, Marco

RELAIS: Availability and Contacts

Relais 1.0 is available on the website : www.istat.it

Relais 2.0 will be available on May 2009

RELAIS Contacts:Nicoletta Cibella, StatisticianE-mail: [email protected] Tuoto, StatisticianE-mail: [email protected]

NTTS 2009Brussels

18-20 February 2009

Nicoletta Cibella, Brussels, 19th February 2009