1 open source software: a way to enrich local solutions n. cibella, m. scannapieco, t. tuoto italian...
TRANSCRIPT
1
Open source software:
a way to enrich local solutions
N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy
2
Outline
• The record linkage problem and the RELAIS solution
• RELAIS, a shareable tool
• The main features of RELAIS
• International experiences in using RELAIS
3
The problem
Record linkage aims to accurately recognize the same real world entity at individual micro level, even when differently stored in sources of various type.
Examples of applications (in official statistics):• data integration• update and de-duplication of a source • quality improvement of a data source • measure of population size by capture-recapture• estimate the risk of re-identification in public-use microdata
Also known as: Object Identification, Record Matching, …
4
Possible Solutions for Record Linkage
A very jeopardized picture, not only in Istat.
Different approaches to deal with record linkage:Exact RL - Deterministic RL - Probabilistic RL (Fellegi and Sunter theory) - Bayesian RL - Machine Learning - Knowledge Representation …
No particular technique has emerged as the best solution for all cases (maybe because such a solution does not exist…)
Several software and tools proposed, based on different
approaches, free or commercial.
5
• RELAIS is a toolkit for record linkage (RL)
• Istat started developing RELAIS in 2006 and the system is now at its 2.1 release– 2.2. release is going to be published
RELAIS, a brief history
RELAIS (REcord Linkage At Istat)
6
RELAIS, a brief history
– Istat working group with several cooperation and training courses on probabilistic record linkage
– Enriched experiences on Data Integration as coordinator of Essnet
• Common nature of problems and needs of NSIs in data integration projects
• Profitable experiences in cooperation with NSIs also in sharing the same software tools (NTTS 2009)
7
RELAIS: a Shareable Tool
• A tool designed to be shared
• It is a toolkit: possibility of adding new techniques to the system, and thus reusing solutions that are already available
• Open source implementation: Java and R as programming languages and MySQL as database management system
8
RELAIS: a Shareable Tool
• Reuse of existing solutions• Most of the comparison functions are part of the
Java package StringMetrics– (http://www.dcs.shef.ac.uk/~sam/stringmetrics.html )
• 1:1 reduction phase is implemented by making use of the R package lpSolve – (http://cran.r-project.org/web/packages/lpSolve/
index.html).
9
RELAIS: a Shareable Tool
• Sharing of the software• Both source code and executables of RELAIS
have been released on :– Istat site:
http://www.istat.it/strumenti/metodi/software/analisi_dati/relais/
– OSOR site: http://forge.osor.eu/projects/relais/
10
RELAIS: a Shareable Tool
• Licencing problem• RELAIS was the first system that Istat decided
to release as an open system so no previous experience was available
• Analysis of available licensing solutions• Choice of EUPL (European Union Public
Licence) – Consistency with the copyright law in the 27 Member
States of the European Union– Compatibility with popular open-source software
licences (e.g. GPL)
11
The main ideas of RELAIS
RELAIS main ideas:
- decompose the complex RL project in its
constituting phases;
- choose dynamically the most appropriate
technique for each phase, depending on
application and data requirements, not only on
practitioner’s skill
12
Choose the most appropriate techniques
13
Build ad-hoc RL workflows
Preprocessing
Search Space Reduction
Comparison Function
Decision Model
NormalizationUpperLowerCase
Blocking
SNM
Edit DistanceJaroEquality
Probabilistic
Deterministic
RecLink WF Appl2
SNM
Probabilistic
RecLink WF Appl1
Normalization
UpperLowerCase
Blocking
Jaro
Deterministic
Equality
14
Relational database support: input of data from database Oracle or MySQL.
New default input values for the parameter estimation of the probabilistic model and new definition of the candidate pairs for the optimal 1:1 reduction.
More than one variable for search space reduction by sorted neighborhood method.
Minor bugs have been solved.
RELAIS 2.1 - May 2010
15
Main features of RELAIS 2.1
• Input files both in text format and from database (mysql or oracle) tables;
• Data profiling to guide the choice of matching and blocking variables;
• Creation of the search space of pairs candidate to link by means of the “cross product”, “blocking” and “sorted neighborhood” method;
• Choice of matching variables;• Set of comparison functions (with several string distances); • Probabilistic record linkage: estimation of the F - S model
parameters via the EM algorithm;• Deterministic record linkage: both exact and rule based; • Reduction from N:M to 1:1 matching solution with optimal or
greedy methods.
16
A glance on RELAIS 2.1
17
RELAIS 2.2 in June 2011
• Explicit application for de-duplication• Nested blocking methods• Set probabilities by the users • Improvement of GUI functionalities for output management and user interactions (manual review).• Summary output on linkage results• Batch execution• Interfaces for clerical review
18
RELAIS and extra-Istat interaction
Spontaneous collaboration among NSIs (Spain, UK, Tunisia, Brazil) was favoured by the open source philosophy adopted in RELAIS
but
even in a statistical system with shared goals and regulations (ESS) different constraints (e.g. language features), may be present and could affect the outcome of the same linkage.
19
RELAIS and extra-Istat interaction
The collaboration among NSIs helped in:
• assessing the capabilities of the various functionalities included in the RELAIS toolkit, e.g. the use of the EM algorithm for record linkage purposes; • comparing the results achieved by the software with those obtained throughout some alternative ad hoc techniques;• testing in terms of performances the methods implemented in RELAIS.
20
RELAIS and extra-Istat interaction
ISTAT, coordinator of the DI (Data Integration) ESSnet project, conducted on January 2011 in U.K. on-the-job training on record linkage methods.
The training on the job had these crucial aspects:• the combination of the theoretical concepts of record linkage with the solutions proposed in RELAIS;• the test of the RELAIS toolkit, during the computer session, on the specific record linkage problem faced by ONS on their own data;• a very interactive way of conducting the lessons by the trainers.
21
Next challenges
• Censuses and post-censual surveys (Population and Agriculture): integration of population registers and auxiliary ones to focus on population register under-coverage, de-duplication also due to multi-channel answers, Post Enumeration Survey.
• Longitudinal study of regular foreign people
• Integration of ICT enterprises
22
Future research projects
• Preprocessing (character conversions, schema reconciliation, standardization, etc.); • Modification of the probabilistic approach:
– Not binary comparison vector– Allowing interactions between matching
variables– Bayesian approach
• Graphical analysis on the model fitting
23
Thanks and Invitation to Cooperation
RELAIS Contacts:
Computer Scientists:
Monica Scannapieco E-mail: [email protected] ToscoE-mail: [email protected] ValentinoE-mail: [email protected]
Statisticians:
Nicoletta CibellaE-mail: [email protected] Tuoto E-mail: [email protected]
http://www.istat.it/strumenti/metodi/software/analisi_dati/relais/ http://www.osor.eu/projects/relais