1 open source software: a way to enrich local solutions n. cibella, m. scannapieco, t. tuoto italian...

23
1 Open source software: a way to enrich local solutions N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy

Upload: sophie-griffith

Post on 05-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Open source software: a way to enrich local solutions N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy

1

Open source software:

a way to enrich local solutions

N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy

Page 2: 1 Open source software: a way to enrich local solutions N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy

2

Outline

• The record linkage problem and the RELAIS solution

• RELAIS, a shareable tool

• The main features of RELAIS

• International experiences in using RELAIS

Page 3: 1 Open source software: a way to enrich local solutions N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy

3

The problem

Record linkage aims to accurately recognize the same real world entity at individual micro level, even when differently stored in sources of various type.

Examples of applications (in official statistics):• data integration• update and de-duplication of a source • quality improvement of a data source • measure of population size by capture-recapture• estimate the risk of re-identification in public-use microdata

Also known as: Object Identification, Record Matching, …

Page 4: 1 Open source software: a way to enrich local solutions N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy

4

Possible Solutions for Record Linkage

A very jeopardized picture, not only in Istat.

Different approaches to deal with record linkage:Exact RL - Deterministic RL - Probabilistic RL (Fellegi and Sunter theory) - Bayesian RL - Machine Learning - Knowledge Representation …

No particular technique has emerged as the best solution for all cases (maybe because such a solution does not exist…)

Several software and tools proposed, based on different

approaches, free or commercial.

Page 5: 1 Open source software: a way to enrich local solutions N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy

5

• RELAIS is a toolkit for record linkage (RL)

• Istat started developing RELAIS in 2006 and the system is now at its 2.1 release– 2.2. release is going to be published

RELAIS, a brief history

RELAIS (REcord Linkage At Istat)

Page 6: 1 Open source software: a way to enrich local solutions N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy

6

RELAIS, a brief history

– Istat working group with several cooperation and training courses on probabilistic record linkage

– Enriched experiences on Data Integration as coordinator of Essnet

• Common nature of problems and needs of NSIs in data integration projects

• Profitable experiences in cooperation with NSIs also in sharing the same software tools (NTTS 2009)

Page 7: 1 Open source software: a way to enrich local solutions N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy

7

RELAIS: a Shareable Tool

• A tool designed to be shared

• It is a toolkit: possibility of adding new techniques to the system, and thus reusing solutions that are already available

• Open source implementation: Java and R as programming languages and MySQL as database management system

Page 8: 1 Open source software: a way to enrich local solutions N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy

8

RELAIS: a Shareable Tool

• Reuse of existing solutions• Most of the comparison functions are part of the

Java package StringMetrics– (http://www.dcs.shef.ac.uk/~sam/stringmetrics.html )

• 1:1 reduction phase is implemented by making use of the R package lpSolve – (http://cran.r-project.org/web/packages/lpSolve/

index.html).

Page 9: 1 Open source software: a way to enrich local solutions N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy

9

RELAIS: a Shareable Tool

• Sharing of the software• Both source code and executables of RELAIS

have been released on :– Istat site:

http://www.istat.it/strumenti/metodi/software/analisi_dati/relais/

– OSOR site: http://forge.osor.eu/projects/relais/

Page 10: 1 Open source software: a way to enrich local solutions N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy

10

RELAIS: a Shareable Tool

• Licencing problem• RELAIS was the first system that Istat decided

to release as an open system so no previous experience was available

• Analysis of available licensing solutions• Choice of EUPL (European Union Public

Licence) – Consistency with the copyright law in the 27 Member

States of the European Union– Compatibility with popular open-source software

licences (e.g. GPL)

Page 11: 1 Open source software: a way to enrich local solutions N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy

11

The main ideas of RELAIS

RELAIS main ideas:

- decompose the complex RL project in its

constituting phases;

- choose dynamically the most appropriate

technique for each phase, depending on

application and data requirements, not only on

practitioner’s skill

Page 12: 1 Open source software: a way to enrich local solutions N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy

12

Choose the most appropriate techniques

Page 13: 1 Open source software: a way to enrich local solutions N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy

13

Build ad-hoc RL workflows

Preprocessing

Search Space Reduction

Comparison Function

Decision Model

NormalizationUpperLowerCase

Blocking

SNM

Edit DistanceJaroEquality

Probabilistic

Deterministic

RecLink WF Appl2

SNM

Probabilistic

RecLink WF Appl1

Normalization

UpperLowerCase

Blocking

Jaro

Deterministic

Equality

Page 14: 1 Open source software: a way to enrich local solutions N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy

14

Relational database support: input of data from database Oracle or MySQL.

New default input values for the parameter estimation of the probabilistic model and new definition of the candidate pairs for the optimal 1:1 reduction.

More than one variable for search space reduction by sorted neighborhood method.

Minor bugs have been solved.

RELAIS 2.1 - May 2010

Page 15: 1 Open source software: a way to enrich local solutions N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy

15

Main features of RELAIS 2.1

• Input files both in text format and from database (mysql or oracle) tables;

• Data profiling to guide the choice of matching and blocking variables;

• Creation of the search space of pairs candidate to link by means of the “cross product”, “blocking” and “sorted neighborhood” method;

• Choice of matching variables;• Set of comparison functions (with several string distances); • Probabilistic record linkage: estimation of the F - S model

parameters via the EM algorithm;• Deterministic record linkage: both exact and rule based; • Reduction from N:M to 1:1 matching solution with optimal or

greedy methods.

Page 16: 1 Open source software: a way to enrich local solutions N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy

16

A glance on RELAIS 2.1

Page 17: 1 Open source software: a way to enrich local solutions N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy

17

RELAIS 2.2 in June 2011

• Explicit application for de-duplication• Nested blocking methods• Set probabilities by the users • Improvement of GUI functionalities for output management and user interactions (manual review).• Summary output on linkage results• Batch execution• Interfaces for clerical review

Page 18: 1 Open source software: a way to enrich local solutions N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy

18

RELAIS and extra-Istat interaction

Spontaneous collaboration among NSIs (Spain, UK, Tunisia, Brazil) was favoured by the open source philosophy adopted in RELAIS

but

even in a statistical system with shared goals and regulations (ESS) different constraints (e.g. language features), may be present and could affect the outcome of the same linkage.

Page 19: 1 Open source software: a way to enrich local solutions N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy

19

RELAIS and extra-Istat interaction

The collaboration among NSIs helped in:

• assessing the capabilities of the various functionalities included in the RELAIS toolkit, e.g. the use of the EM algorithm for record linkage purposes; • comparing the results achieved by the software with those obtained throughout some alternative ad hoc techniques;• testing in terms of performances the methods implemented in RELAIS.

Page 20: 1 Open source software: a way to enrich local solutions N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy

20

RELAIS and extra-Istat interaction

ISTAT, coordinator of the DI (Data Integration) ESSnet project, conducted on January 2011 in U.K. on-the-job training on record linkage methods.

The training on the job had these crucial aspects:• the combination of the theoretical concepts of record linkage with the solutions proposed in RELAIS;• the test of the RELAIS toolkit, during the computer session, on the specific record linkage problem faced by ONS on their own data;• a very interactive way of conducting the lessons by the trainers.

Page 21: 1 Open source software: a way to enrich local solutions N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy

21

Next challenges

• Censuses and post-censual surveys (Population and Agriculture): integration of population registers and auxiliary ones to focus on population register under-coverage, de-duplication also due to multi-channel answers, Post Enumeration Survey.

• Longitudinal study of regular foreign people

• Integration of ICT enterprises

Page 22: 1 Open source software: a way to enrich local solutions N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy

22

Future research projects

• Preprocessing (character conversions, schema reconciliation, standardization, etc.); • Modification of the probabilistic approach:

– Not binary comparison vector– Allowing interactions between matching

variables– Bayesian approach

• Graphical analysis on the model fitting

Page 23: 1 Open source software: a way to enrich local solutions N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy

23

Thanks and Invitation to Cooperation

RELAIS Contacts:

Computer Scientists:

Monica Scannapieco E-mail: [email protected] ToscoE-mail: [email protected] ValentinoE-mail: [email protected]

Statisticians:

Nicoletta CibellaE-mail: [email protected] Tuoto E-mail: [email protected]

http://www.istat.it/strumenti/metodi/software/analisi_dati/relais/ http://www.osor.eu/projects/relais