a record linkage procedure for the management and …

22
1612 A RECORD LINKAGE PROCEDURE FOR THE MANAGEMENT AND THE ANALYSIS OF THE ITALIAN STATISTICAL BUSINESS REGISTER Giuseppe Garofalo and Caterina Viviano, ISTAT Adriano Paggiaro and Nicola Torelli, University of Padua Nicola Torelli, Dipartimento di Statistica, Via S.Francesco 33, 35121 Padova, Italy, [email protected] ABSTRACT We consider the application of record linkage to the maintenance of the Italian statistical business register (ASIA), which has been explicitly built up by integrating different administrative sources. The main goal of the record linkage procedure we developed is that of avoiding duplications of units and false demographic flows of enterprises. The procedure is quite general and allows us to control the quality of individual characteristics helpful to identify the same business and to estimate linkage weights under fairly unrestrictive assumptions. Results of a preliminary application of the linkage procedure to an administrative data set are presented and discussed. Key Words: Unduplication, Continuity, Enterprise demography. 1. INTRODUCTION Planning and setting-up a complete and updated statistical business register has to exploit, in order to be economically feasible, all information on enterprises stored by administrative bodies. The Italian statistical business register (ASIA) has been built up by integrating data from various administrative archives and large-scale surveys or censuses. The use of administrative sources produces obvious advantages in terms of a decrease in costs, time, data availability and enterprise burden, but raises some definitional and methodological problems. More specifically, it is worth recalling that the legal population represented by administrative sources typically do not correspond to a meaningful statistical population. Unit definitions are different in administrative and statistical archives. More importantly, criteria to include or exclude a unit from administrative archives could be frequently due to reasons (e.g. tax evasion and elusion) not strictly connected to real changes in the economic activity of the enterprise. The linkage of legal units to statistical ones, in order to avoid duplications and false demographic flows of enterprises, requires a complex strategy involving: (i) a clear definition of continuity criteria to understand when changes in statistical units are statistically relevant; (ii) the use of exact matching procedures to decide when data from two different records actually pertain to the same unit. In this paper, after a concise presentation of the Italian statistical business register, relevance of a continuity definition for an enterprise with reference to the Italian case is discussed (section 2). In section 3, it is presented a short review of problems connected to computer record linkage techniques, and in section 4 it is reported a first application of a fairly general record linkage procedure to the reconstruction of enterprise evolution and the identification of non-demographic flows (spurious demography). The application is limited, at this stage, to data from the Italian fiscal register for a single municipality. 2. CRITERIA FOR ENTERPRISE IDENTIFICATION AND CONTINUITY ANALYSIS 2.1. The Italian Business Register Since 1995 the Italian National Statistical Institute has developed a complex project, called ASIA, for the setting-up of a statistical business register harmonised with the European Community regulations. The first register was completed at the end of 1997, while the register quality has been checked in 1998 using data from 1997 intermediate Census, projected as a sample survey. The Italian statistical business register has been built up by integrating data from administrative sources. The main archives are: the fiscal register managed by the Ministry of Finances (9.600.000 records), the registers of enterprises managed by the Chambers of Commerce (5.800.000 records), the social security register (l.700.000 records), the register of insurance against accidents on work (3.200.000 records), the electricity users register and the business telephone numbers register. Administrative data are integrated with statistical ones taken from surveys carried out by Istat, usually limited to medium/large enterprises. UE regulations and Eurostat recommendations give a clear and exhaustive table of concepts and definitions useful in this specific context (for details, see Garofalo and Viviano 1998). The statistical definition of enterprise as “the smallest combination of legal units that is an organisational

Upload: others

Post on 29-Apr-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A RECORD LINKAGE PROCEDURE FOR THE MANAGEMENT AND …

1612

A RECORD LINKAGE PROCEDURE FOR THE MANAGEMENT AND THE ANALYSIS OF THEITALIAN STATISTICAL BUSINESS REGISTER

Giuseppe Garofalo and Caterina Viviano, ISTATAdriano Paggiaro and Nicola Torelli, University of Padua

Nicola Torelli, Dipartimento di Statistica, Via S.Francesco 33, 35121 Padova, Italy, [email protected]

ABSTRACT

We consider the application of record linkage to the maintenance of the Italian statistical business register (ASIA), whichhas been explicitly built up by integrating different administrative sources. The main goal of the record linkage procedurewe developed is that of avoiding duplications of units and false demographic flows of enterprises. The procedure is quitegeneral and allows us to control the quality of individual characteristics helpful to identify the same business and toestimate linkage weights under fairly unrestrictive assumptions. Results of a preliminary application of the linkageprocedure to an administrative data set are presented and discussed.

Key Words: Unduplication, Continuity, Enterprise demography.

1. INTRODUCTION

Planning and setting-up a complete and updated statistical business register has to exploit, in order to beeconomically feasible, all information on enterprises stored by administrative bodies.The Italian statistical business register (ASIA) has been built up by integrating data from various administrativearchives and large-scale surveys or censuses. The use of administrative sources produces obvious advantages interms of a decrease in costs, time, data availability and enterprise burden, but raises some definitional andmethodological problems. More specifically, it is worth recalling that the legal population represented byadministrative sources typically do not correspond to a meaningful statistical population. Unit definitions aredifferent in administrative and statistical archives. More importantly, criteria to include or exclude a unit fromadministrative archives could be frequently due to reasons (e.g. tax evasion and elusion) not strictly connected toreal changes in the economic activity of the enterprise.The linkage of legal units to statistical ones, in order to avoid duplications and false demographic flows ofenterprises, requires a complex strategy involving: (i) a clear definition of continuity criteria to understand whenchanges in statistical units are statistically relevant; (ii) the use of exact matching procedures to decide when datafrom two different records actually pertain to the same unit.In this paper, after a concise presentation of the Italian statistical business register, relevance of a continuitydefinition for an enterprise with reference to the Italian case is discussed (section 2). In section 3, it is presented ashort review of problems connected to computer record linkage techniques, and in section 4 it is reported a firstapplication of a fairly general record linkage procedure to the reconstruction of enterprise evolution and theidentification of non-demographic flows (spurious demography). The application is limited, at this stage, to datafrom the Italian fiscal register for a single municipality.

2. CRITERIA FOR ENTERPRISE IDENTIFICATION AND CONTINUITY ANALYSIS

2.1. The Italian Business Register

Since 1995 the Italian National Statistical Institute has developed a complex project, called ASIA, for the setting-upof a statistical business register harmonised with the European Community regulations. The first register wascompleted at the end of 1997, while the register quality has been checked in 1998 using data from 1997 intermediateCensus, projected as a sample survey.The Italian statistical business register has been built up by integrating data from administrative sources. The mainarchives are: the fiscal register managed by the Ministry of Finances (9.600.000 records), the registers of enterprisesmanaged by the Chambers of Commerce (5.800.000 records), the social security register (l.700.000 records), theregister of insurance against accidents on work (3.200.000 records), the electricity users register and the businesstelephone numbers register. Administrative data are integrated with statistical ones taken from surveys carried outby Istat, usually limited to medium/large enterprises. UE regulations and Eurostat recommendations give a clear andexhaustive table of concepts and definitions useful in this specific context (for details, see Garofalo and Viviano1998). The statistical definition of enterprise as “the smallest combination of legal units that is an organisational

Page 2: A RECORD LINKAGE PROCEDURE FOR THE MANAGEMENT AND …

1613

unit…” and the concept of statistical continuity give a clear indication that the statistical population corresponds to asubset of the legal one.

2.2. Data and Techniques to Identify Unit Relationships

In archives, at least partially fed by administrative sources like ASIA, the definition of statistical continuity assumesthe identification of dynamic relationships between administrative entities that are apparently different.The identification of exact relationships can use either direct surveys or linkage techniques based on observableattributes of legal units. Surveys have the usual problems of costs and burden and, when information on legal unitsconnections are directly taken from administrative files, coverage could be partial.Linkage techniques could be designed, in order to be useful, paying attention to actually available data and theyshould be based on a set of rules according to which units having common identification characteristics or relevantattributes are linked one another. With reference to Italian data, three strategies can be used:1. Identification of legal units links through employees flows. This technique is based on the analysis of employees

flows between two (or more) units and can be used in the analysis of mergers and demergers. Its use rests on thefact that if enterprise “b” takes over enterprise “a” a flow of (almost) all employees from “a” towards “b” will beobserved, with the implicit assumption that it is possible to discriminate between “physiological” movements,produced by employees choices, and “spurious” ones caused by transitions between enterprises. To use thistechnique, individual longitudinal data on employees and cross section data on relationships between employeesand enterprises are needed. Those information are available in Italy only in the social security register (Pacelliand Revelli 1995).

2. Identification of legal units links through the analysis of enterprise ownership. If the same person results asowner in more than one legal unit this could imply that two legal units correspond to the same statistical unit, orthat a spurious opening/closure of activities has occurred. For Italian data, this solution could be used withinformation from the Chambers of Commerce registers.

3. Identification of legal units links through the "similarity of attributes" analysis. Through this technique linksbetween legal units are reconstructed on the basis of similarity of attributes like: enterprise name, location,economic activity, size, juridical status.

The choice of one of the above listed techniques depends on purposes and data availability as well as on costs andtime for data processing. Whereas the technique based on employees flows has given good results in the analysis ofspurious demography for medium-large enterprises, it cannot be used for smaller enterprises and for thoseenterprises without employees, the largest part of ASIA population. Exact matching techniques can be particularlyuseful in case 3) and this will be the route pursued in the sequel of the paper.

2.3. The Continuity Criterion

The analysis of links and relationships among units over time is an important issue in a static context, in presence ofdelays in updating data on enterprises, and in a dynamic context for micro-economic research. Enterprisesdemography traditionally holds a main role in competition and production theories, in entrepreneurship supplyanalysis and as a tool of analysis in job creation and job turnover studies. Data to explore such themes are greatlyaffected by false flows of units (spurious demography).Definition of a reasonable continuity criterion, that is the condition under which different units are deemed to be thesame over a given time period, is crucial. An enterprise is recognised by a specific set of resources, functions andproducts and possible changes in their combinations should be considered to define a continuity concept.The continuity concept proposed by Struijs and Willeboordse (1995) has been widely accepted at an internationallevel. Eurostat suggests some practical criteria based on combinations of changes occurring in some characteristicsrecorded in business registers. According to this concept, an enterprise is considered to be same in time if it modifieswithout any significant change in its identity in terms of the set of its production factors (employment, machines,raw material, capital management, buildings, etc.).Measuring continuity of all production factors and weighting them can be quite difficult and costly. For thosereasons Eurostat suggests, as a practical criterion to identify the enterprise, to use their specific characteristics,available in the register, that can be assumed to be correlated to the most important production factors. Thesuggested empirical rule is that an enterprise is not considered to be the same if almost two over three modificationsin the following characteristics occur:a) Legal unit controlling the enterprise: continuity of management of the enterprise may be assumed to be

positively correlated with continuity in the control of the legal unit.

Page 3: A RECORD LINKAGE PROCEDURE FOR THE MANAGEMENT AND …

1614

b) Economic activities carried out by the enterprise: continuity of the four-digit NACE Rev.1 code of main activitymay be assumed to be positively correlated to the continuity of production factors as employment, machinesand equipment.

c) Locations where activities are carried out: continuity of locations is of course closely linked to the continuity ofland and buildings used by the enterprise.

In the suggested rules an element of discontinuity is introduced when changes are “of great extent” and quick. Theconcrete applicability of such rules must be evaluated according to the economic structure in which they have toperform, because of the peculiarity of each country. For instance, for some domains of study as for demography ofvery small enterprises, it does not make sense to separate the juridical subject (the entrepreneur) from the statisticalsubject (the enterprise). For such cases a new controlling legal unit becomes a factor producing discontinuity even ifit is the only one to change (Garofalo and Viviano 1999).

3. RECORD LINKAGE PROCEDURE FOR INTEGRATING DATA FROM BUSINESS REGISTERS

Business data often need statistical methods to decide whether two records contain data on the same unit, and thechoice is based on information coming from common identifiers such as name, address or individual codes.The most general formulation of a record linkage procedure dates back to the seminal paper by Fellegi and Sunter(1969). An excellent review on more recent development in this field, with connection to their possible use formanaging business register, is in Winkler (1995). For our purposes, a short review on the topic will be given, inorder to describe more efficiently the main solutions and criteria adopted in our empirical application.Let A and B be two files respectively containing records a and b. The set ( ){ }BbAabaBA ∈∈=× ,;, can bepartitioned in a set M of pairs representing the same business entity and a set U of pairs representing differententities. If A≡B, we have a problem of unduplication and the set M contains all the duplicates in the original file.The size of the files usually considered does not often allow explicit consideration of comparison of all pairs ofrecords, and usually only pairs with some common characteristics are actually compared, by using blocking criteria.A record linkage procedure is then characterised by a decision rule that assigns all compared pairs either to M or U(for some pairs no decision is taken, and a set of possible links, usually left to clerical review, is defined).The link/not link decision is based on a matching weight, which is assigned to each compared pair according to theresult of a comparison among some matching variables present in both records. A crucial choice is the definition ofagreement in those comparisons, going from a simple agreement-disagreement dichotomy to a complex definitiontaking into account the specific values of the variables. The results of the comparison can be collected in a vector γdefining the agreement pattern for the i-th variable in every j-th pair of the N deriving from the blocking criterion:

[ ] NjIj

ijjjj ...1 ,,...,,...,, 21 == γγγγγ .

A weight w is then associated to every possible outcome γ, taking to a decision rule depending on two thresholds: ifuj Kw ≥ the pair is declared matched, if lj Kw < the pair is declared non-matched, while if K w Kl j u≤ < the

decision is delayed to further analysis. According to this rule, two kind of errors can occur: (a) false matches - nonmatched pairs erroneously assigned to M; (b) false non-matches - matched pairs which are assigned to U (or leftoutside the defined comparison blocks). Estimation of false match and false non-match rates is important in order todefine the specific choices in the procedure, as the blocking criteria and the thresholds in the decision rule.The crucial step for implementing a record linkage procedure is estimation of matching weights. A probabilisticprocedure can be used to estimate the value of the latent variable G (G=1 if a pair is in M, G=0 if the pair is in U)given some information coming from the comparison on the matching variables.The estimation of the weights is usually related to the original formulation by Fellegi and Sunter (1969), with a ratioof probabilities of the form:

( )( ) j

j

j

jj u

mPP

w lnUM

ln ==γγ .

Fellegi and Sunter showed that these weights take to an optimal rule in that for any pair of fixed thresholds theclerical review region is minimised.

Page 4: A RECORD LINKAGE PROCEDURE FOR THE MANAGEMENT AND …

1615

Estimation is difficult in that we can not usually know which pairs belong to M and U. A possible solution comesfrom the use of iterative methods like the EM algorithm, by which we can use “imputed” samples to estimate theparameters of interest. Let p be the unknown proportion of matched pairs; the likelihood function is:

( ) ( ) ( )[ ] ( ) ( )[ ] [ ] ( )[ ]∏∏=

=

− −==N

j

gj

gj

N

j

gj

gj

jjjj uppmPPPPpumL1

1

1

1 1UUMM;, γγ .

Given the values of m, u and p, the E step of the algorithm consists in estimating the latent variable g for each pair,given by its expected value (with a direct logit link to the w weights defined by Fellegi and Sunter):

( ) ( ) ( ) ( )E g m u ppm

pm p u

m u

m u p pe

e p pj j jj

j j

j j

j j

w

w

j

j, , =

+ −=

+ −=

+ −1 1 1.

In the M step, the likelihood is maximised on m and p given G. Following Jaro (1989), the probabilities u areestimated outside the iterative algorithm on a randomly chosen sample of pairs. In the estimation of m and uprobabilities, it is often assumed independence among the probabilities mi and ui of observing respectively the singleoutput iγ in the comparison between pairs in M and U. Note that in many applications this assumption can not beconsidered realistic.

4. DATA AND RESULTS

4.1. Data

An exploratory application of the record linkage procedure has been carried out on a small data set referring to themunicipality of Pesaro. The data set contains 9420 records collecting some administrative information on enterprisesfrom the fiscal register in 1997. The archive contains some identifying information that can be used for anunduplication analysis aimed to identify records that pertain to the same units and allowing a proper analysis ofenterprise demography.The identifying variables are:1. An alphanumeric code (CF: codice fiscale) which uniquely identifies enterprises. The shape of the code depends

on whether it is associated to individual enterprises (alphanumeric 16-characters code) or to partnership andcompanies (numeric 11-digits code).

2. Full NAME of the enterprise, as an up-to-40 characters string (ragione sociale). The name may be characterisedby one first name and surname (individual enterprise), many names and surnames or other denominations oftenrelated to the actual economic activity of the enterprise. Moreover, other words are present in the name of non-individual enterprises, specifying, for instance, the type of "company".

3. Address, in a 5-digit code (ADD).4. Economic activity of the enterprise (ATECO), in a 5-digit code (the Nace Rev.1 plus the fifth digit).

4.2. Agreement Definition and Estimation Strategy

Even with a relatively small data set, a direct comparison between all the possible pairs of records is unfeasible. Inthis application it has been used a blocking strategy that reduces the number of comparisons and is somehow relatedto the definition of continuity already outlined in the paper. The only pairs of records considered are those with atleast one full agreement on three different variables: name of the enterprise, address or economic activity. Moreover,pairs of records which could never be associated to one another are excluded from the analysis: to avoid the wronglinkage of spurious homonyms, no pair is chosen if both records pertain to individual enterprises, with the onlyexception of pairs with perfect agreement on the CF code. The number of compared pairs following this complexblocking structure is about 399000.A first exploratory step was carried out to identify the better way to define different levels of agreement for everyvariable. A key variable for linkage is NAME, and in order to use it we took into account the results of a firstapplication of record linkage techniques to the same data set (Garofalo and Viviano 1998). Different levels ofagreement are defined for the full string of NAME, taking into account the number of words in the string itself andthe number of agreements between the single words. After some sensitivity analysis, the final choice was on a 6-

Page 5: A RECORD LINKAGE PROCEDURE FOR THE MANAGEMENT AND …

1616

levels definition, also depending on the kind of enterprise. The definition of agreement for the remaining variables isas follows:- An indicator of agreement on CF with three levels (perfect agreement, disagreement between 11-digits codes,

disagreement between a 11-digits and a 16-characters code).- A dichotomous indicator of agreement-disagreement for ADD.- Three levels of agreement for ATECO (perfect agreement, different but “similar” economic activities,

completely different economic activities).The choice of which pairs have to be considered matched is finally based on the probabilistic procedure described inthe preceding section. Specifically, the estimation strategy we adopted is the following:- The u probabilities are estimated outside the iterative algorithm, by building a data set of randomly matched

pairs of records.- The m probabilities are estimated by the EM algorithm, with a sensitivity analysis using different starting values.- Using a-priori information, some probabilities are forced to 1 or 0 (as an example, perfect agreement on CF leads

to a linked pair with probability 1).

4.3. Some Empirical Results

In the previous application by Garofalo and Viviano (1998) the linkage procedure there adopted led to theidentification of a high number of “large” clusters containing more than 10 enterprises. On the basis of an analyticalreview of the single enterprises in the clusters, it appeared quite reasonable that records referring to the sameenterprise (who changed name, or street) were into the same cluster. But it was impractical (if not unfeasible) toidentify them by a clerical review of all large clusters. In their procedure the estimate of matching weights wasobtained by assuming independence among the components ui and mi, but this assumption does not seem veryrealistic for our problem. Just to give a simple example, note that probability of agreement on address given that thetwo records refer to different units could be related to probability of agreement on economic activity (very oftenenterprises with the same economic activity are located in the same area). This lack of independence seems to be aproblem for most of the comparison variables here used.The implemented linkage procedure allows very easily to overcome this problem and to estimate the involvedprobabilities relaxing the independence assumption. In Table 1. we compare results obtained in estimating both uand m probabilities by using different levels of dependence among variables involved in the choice of matching thepairs of records.

Table 1. Estimated matching probabilities for some selected vectors of agreement under different dependence assumptions

Degree of agreement Estimated matching probabilitiesCF ADD ATECO NAME M

UDD

ID

PDPD

IPD

II

N T T T 0.99996 0.99995 0.99326 0 0P N T T 0 0.99976 0 0 0.99974N T N T 0.89928 0.67265 0.95956 0 0P T P P 0.99803 0.82384 0.99978 0.99976 0.99980P T P N 0.51923 0 0.79341 0.76517 0.84877P T T N 0.77587 0 0.84532 0.85589 0.99805P N T N 0 0 0 0 0.84877P N T P 0 0.02997 0 0 0.99976

- Zeros in the table indicate values <10-5

- Degree of agreement: T=Total; P=Partial; N=Null- Estimated matching probabilities: M=choices in estimating the m probabilities, U=choices in estimating the u probabilities- Kind of dependence: D=full dependence among CF, NAME and ATECO; PD=partial dependence (only between CF and

NAME); I=full independence

There is a clear indication of the sensitivity of the procedure to alternative assumptions of independence. In thetable, the first four rows refer to matching probabilities for vectors of agreement estimated for pairs that we knowthat should be matched. The remaining rows refer to cases where the observed agreement should lead to a non-match. Assuming complete independence gives, as expected, unreasonable results. It should be added that a strategywith complete independence on the u should be excluded, because dependence is partly forced by the definition ofagreement for NAME that could differ accordingly to the result of the comparison on CF. Nonetheless, this is astrategy similar to that adopted in Garofalo and Viviano and it is reported here to compare our results with thoseobtained in their application where the size of clusters of enterprises that could be the same was very large.

Page 6: A RECORD LINKAGE PROCEDURE FOR THE MANAGEMENT AND …

1617

Note that dependence among variables, at least for the u probabilities, should be assumed to obtain a reasonableseparation between probabilities of link and non-link for records belonging respectively to the M or the U.Results reported in Table 2. show the effect of different assumptions of dependence on number and size of clustersof records that are to be considered as related to the same enterprise according to the results of the matchingprocedure (with a threshold for declaring a pair matched fixed at 0.9).

Table 2. Clusters of enterprises obtained under different assumptions on independence

Dimension of clusters N. of clustersMU

DD

ID

PDPD

IPD

II

1 9047 9023 8967 9153 81002 177 179 208 132 3203 5 10 11 1 674 1 1 1 265 1 186-9 1910-19 9>20 1

5. CONCLUDING REMARKS

The case study here presented is a first step of a larger project aimed to explore tools that help in the various phasesof building and maintaining Italian statistical business registers and that allow the development of information foranalytical purposes such as the study of enterprise demography. More work is required (i) for the re-definition ofrelevant concepts and conventions to fit the Italian context; (ii) for the implementation of specific tools within therecord linkage procedure useful to deal with data from business archives.The available information usually gathered in different administrative and statistical sources can be organised tohelp the record linkage (unduplication) process. The main step is in defining the potential of the informationcontained in the crucial variable NAME. This can be done by parsing the string and standardisation of single words(Winkler 1995). A first application of this idea is in Garofalo and Viviano (1998).The size of the archives involved in the process could hamper any naive application of record linkage procedures,and the use of sensible blocking criteria cannot be avoided. The link between continuity rules for enterprises andblocking criteria is crucial and needs a revision when working with larger data sets.The present version of the record linkage procedure is already quite general, in that it allows various options at theestimation stage and for the output of the results. In future versions, new tools will be added like alternativetechniques for estimating false match rate and options for choosing the threshold for link non-link decisions.

6. REFERENCES

Fellegi, I.P. and A.B. Sunter (1969), “A Theory for Record Linkage”, Journal of the American StatisticalAssociation, 64, pp. 1183-1210.

Garofalo, G. and C. Viviano (1998), “The Problem of Links between Legal Units: Statistical Techniques for theEnterprise Identification and the Analysis of Continuity”, paper presented at the 12th Roundtable onBusiness Survey Frames, Helsinki, September 1998.

Garofalo, G . and C. Viviano (1999), “Continuity Rules Re-delineation in the Italian Context”, paper presented atthe 13th Roundtable on Business Survey Frames, Paris, September 1999.

Jaro, M.A. (1989), “Advances in Record Linkage Methodology as Applied to Matching the 1985 Census of Tampa,Florida”, Journal of the American Statistical Association, 89, pp. 414-420.

Pacelli, L. and R. Revelli (1995), “Trasformazioni societarie, scorpori, fusioni: un metodo di individuazionemediante dati di fonte INPS”, in S. Biffignandi and M. Martini (eds.), Il registro statistico delle imprese,Milan: Franco Angeli.

Strujis, P. and A. Willeboordse (1995), “Changes in Populations of Statistical Units”, in B.G. Cox et al. (eds.),Business Survey Methods, New York: J. Wiley.

Winkler, W.E. (1995), “Matching and record linkage”, in B.G. Cox. et al. (eds.), Business Survey Methods, NewYork: J. Wiley.

Page 7: A RECORD LINKAGE PROCEDURE FOR THE MANAGEMENT AND …

1618

INFORMATIVE SYSTEMSFOR THE MULTIDIMENSIONAL MANAGEMENT OF BUSINESS DATA

Lucia Buzzigoli, Cristina Martelli and Alessandro VivianiLucia Buzzigoli, Dipartimento di Statistica "G. Parenti", Viale Morgagni 59, 50134 Firenze, Italy

[email protected]

ABSTRACT

In business studies the units of analysis are identified on the basis of the problem of interest and of the specific rulesdefining the target population (e.g. enterprises, establishments, etc.) and, as a consequence, the events of birth, death andstate transition of the units may be defined. In this respect, there is a need for an adequate informative system which canoverride the concept of register as a list of units with attributes and which can be used at different levels. The purpose ofthis paper is to discuss theoretically the importance of an informative system for a management of business information ina multidimensional framework, which can serve as a flexible instrument of economic analysis.

Keywords: business panels, business demography, statistical units, collective subjects, statistical database

1. INTRODUCTION

Statistical analysis involving business entities are more complex than other kinds of study for several reasons, manyof which are strictly related to definitional aspects.a) The statistical units. First of all, there are various possible statistical units of reference. For instance, the ECCouncil Regulation 969/1993 defines eight different statistical units of the production: the enterprise, theinstitutional unit, the enterprise group, the kind-of-activity group, the unit of homogeneous production, the localunit, the local kind-of-activity unit, the local unit of homogenous production. Statistics Canada uses four differentunits for business surveys: the enterprise, the company, the establishment and the location (Lavallée, 1998). In anycase, the definitions and the classifications are based on one side on the administrative and legal criteria which formthe substantial prerequisite for recognizing and identifying the units and on the other side by the activity criteria,which are much more linked to economic issues.In empirical analysis, the choice among the various available units depends on the type of economic problem that wehave to face (Baldwin and Gorecki, 1990): while for industrial economy the best choice could be the enterprise, forjob-creation studies the establishment is preferable.Another problem is that of threshold values used to categorize units (i.e. in longitudinal panels the units areclassified according to some conventional criteria, which sometimes are subjectively decided: a turnover increase ofgreater than 70% is labeled in the Finnish longitudinal database as 'rapidly increased' on the base of a judgmentalapproach).The problem of defining the units of interest assumes a particular relevance in this period of transition towards aunified European statistical system: the need of international comparability of statistics calls for the harmonizationof definitions, concepts, norms and standards (Struijs, 1996).b) The hierarchical aspect. These classifications and definitions often contains hierarchical structures; for instance,Statistics Canada's units actually are four hierarchical levels: from the highest (the enterprise) to the lowest one (thelocation). This sort of 'classification of classifications' reveals a complex frame where the different statistical unitscan variously interact and can have different evolutionary behaviors.c) The dynamic aspect. The interest for evolutionary behaviors is obvious, because the study of various economicissues (job turnover, productivity comparison, etc.) involves dynamic aspects of business analysis. The increasinginterest for business panel surveys recently renewed the debate on the definition of birth and death of businessentities. The definition matters, in the sense that can determine the final result of the analysis: the definitions can bebased exclusively on administrative criteria (changes in the legal form or in the ownership) or on economic criteria,depending on what kind of information we actually use to identify business entry and exit in the data set (Stamas etal. 1997). For instance, a usual criterion to discriminate between administrative and 'economic' births is to use theEmployment Statistics (e.g. Mustaniemi, 1996; Egmose, 1998).Therefore, any definition is arbitrary, but can highlight a different aspect of analytical interest (Baldwin et al. 1990).The consequence is that longitudinal studies would require information that allows for alternative definitions ofbusiness births and deaths. This could be of great value, for instance, for organizational theorists, who could have aricher informative basis for their analysis (Brüderl and Schüssler, 1990).

Page 8: A RECORD LINKAGE PROCEDURE FOR THE MANAGEMENT AND …

1619

At first sight this kind of problem could seem trivial, but actually, the need for flexibility involves the critical reviewof all the data processing and management phases (McGuckin R., 1991).Solving this kind of problem means that the informative system which is going to be designed must develop a veryspecific concept of statistical unit that can be translated in a conceptual model for data management requirements(Willeboordse and Struijs, 1999). This is important in order to face this problem systematically and 'scientifically', inthe sense of getting an experimental environment where the rules can be changed according to the specific statisticalinterests.

2. SOME MODELLING CONSIDERATIONS

The discussion developed in the previous paragraph about the problems connected with the statistical analysis ofbusiness entities has pointed out the need of an adequate informative basis: the data must be organised so as to allowthe management of different definition protocols for the statistical units, often characterised by inherent hierarchicalstructures. The source modelling, moreover, must ensure the description of the dynamic characteristics of thebusiness entities under study, and, finally it must be at the basis of an information system that allow a multiplicity ofdata views and organisational modalities.In this sense, the organisation of a proper information system for the statistical analysis of business data falls into themore general problem of statistical data bases organisation, even if it shows some peculiar aspects.As it is well known, a correct modelling and structuring of statistical sources improve the informative value of thedata (Hinterberger, 1992; Michalewicz, 1990).The modeling of statistical sources, in fact, is very peculiar and usually is deeply different from the design ofbusiness and operational data bases; statistical databases are in fact characterized by a higher degree of complexityand by a richer semantic oriented to a full data handling in order to achieve the best data organization for statisticalanalysis.Another important point that characterizes statistical databases is given by the need of assuring a proper descriptionof the temporal dimension of information (Tansel et al., 1993; Martelli, 1996).Statistical sources, in fact, contrary to those used in production, are very often substantially historical; in non-statistical applications, in fact, it is usually important to know only the most up-to-date level of information, andfrequently the user is not at all interested in knowing the history of the values assumed by a certain variable. This iswhy most conventional databases represent reality only at the current time; the current contents of a database can beviewed, in this sense, as a snapshot of the real world at a single instant of time. As the real world changes, newvalues are incorporated into the database by replacing the old values.Statisticians have different needs: the proper management of the temporal component of a source is, in fact, offundamental importance for them, as all social, economic, and productive activities, that traditionally are thestatisticians’ object of research and analysis, always occur in a temporal context. In this regard, it is interesting torecall all the fields focused on the study of evolutionary behaviors or on the evaluation of links between causes andeffects; another important point, next to the problems of temporal series archiving optimization, is linked to thepredisposition of integrated sources in which the aim is to bring together, in a context that is temporally homogenousand therefore usable as a unicum, contexts described with different temporal metrics.All these general points become particularly critical when we want to organize, as a proper and rich statisticalsource, the archives involving business entities data.

3. COLLECTIVE INDIVIDUALS

As it has been previously discussed, in fact, in this context we have to manage a further cause of complexity, givenby the presence, in the informative structure domain, of collective individuals.In this paper, with the term collective individuals, we intend subjects that are identifiable on the basis ofclassification rules and that are composed by individuals, stricto sensu, or by other collective individuals defined at alower level of aggregation. Examples of collective subjects are families (that are composed by persons), socialclasses (composed by families and persons), firms, and so on.It is very important to guarantee a correct description and an accurate representation of the collective individuals inthe context of the sources informative structure, because, analogously to what happens to individuals, intended in theusual meaning of the term, they are, for instance, subjects of evolutionary behaviours or they can be differentlycharacterised for making comparisons among different economic or spatial or historical environments.In particular, in all the evolutionary sciences, it is normal to conceptualise collective individuals (we could sayspecies) as collection of individuals, assumed as delimited and discrete evolving units, subject to birth, ageing and

Page 9: A RECORD LINKAGE PROCEDURE FOR THE MANAGEMENT AND …

1620

disappearance during the course of time; in this sense the process of becoming a species (that is those in which anew species emerges) is studied as having characters autonomous and often divergent to the process of evolutionwithin a species (Martelli, 1999).An example could better clarify this statement: let’s imagine a situation in which the survival probability of abusiness entity at the lower aggregation level is decreasing, while the same indicator at a higher aggregation level isincreasing, just as a consequence of the dynamism and flexibility observed at the bottom of the hierarchy. Toproperly observe this phenomenon an information system is needed in which the two different units of analysis maybe individuated, characterised and described in their dynamic aspects.The collective individuals may be therefore considered like ”second level individuals”, a sort of higher ordersystems that integrate and connect individuals of the “first level”, and that are the protagonists of the evolutionaryprocess of the species, autonomous historical entities, with theirs own life trajectories and their own evolutionarypatterns, which could even conflict with those of the individuals that belong to them (Eldredge and Gould, 1990,Ceruti 1995).But how to model a source that respects all these informative requirements?The statistical source conceptual modelling is the first thing to do when structuring a source (Batini et al., 1992). Asit is well known, the conceptual model is a formal, high level description of the reality about which we want tocollect information; this description, usually expressed in terms of entities and relationships, is independent from thetechnical characteristics of the information management system and from all the aspects associated to the practicaldata base accomplishment, and it will rely only on the informative structure of the source.The modelling starts with an abstraction process, that is the mental mechanism that we adopt when we spotlightsome of the reality characteristics in order to define a new unifying object characterised by the consideredproperties. This new objects are what we define as entities, i.e. the system informative actors, the informationcarriers.While is not very difficult to identify entities when working with individuals in the usual sense of the term, thisoperation is more difficult when working with collective subjects. When individuals are persons, for instance, inseveral national contexts they are individuated as instances of entities at the very moment of their birth, when it isthem assigned a unique and not ambiguous personal code.When working with collective individuals the situation is different: the individuation of a certain collective subject,in fact, depends on the rules that have been adopted for defining it; for persons we don’t need any definition rule, butfor collective subjects we must know what we intend when we say “family” or “firm”.

4. CLASSIFICATION

The most common types of abstraction that are adopted when modelling information systems are the classificationand generalisation processes.In particular, classification is the abstraction that leads to the definition of a class starting from a set of objectshaving common properties.The collective individuals are the result of a classification operation on individuals at lower aggregation level, and aswe have discussed in the previous paragraphs, we are far from having unanimity about these definitions; the impactthat the change of the collective subject definition rules has on the source modelling is, therefore, very important;such a modification in fact, will induce a variation in the entity individuation and, accordingly, in the sourceconceptual model; as a consequence, the type of information that we can obtain from the source may deeply change.The result of a classification on data will not necessarily turn out in an entity creation: in order that a conceptassumes the status of entity, in fact, it must have an autonomous existence and it must have some properties.Therefore, when modelling an information system characterised by the presence of collective subjects we mustdecide whether they are the simple result of a classification, and in this sense nothing but the output of aninterrogation of the individual data base, or if they have the status of an entity.For statisticians, in particular if interested in the evolutionary description of the collective individuals, collectivesubjects must be entities and not simple queries. It is important, in fact, to characterise them with attributes thatallow to follow, describe and interpret their evolutionary behaviours, as it is usually done for individuals and how itis done, for instance, in the business demography approach.For the study of the evolutionary behaviour of the collective subject, in fact, we must be able to trace a sort ofbiography of it, marked at least by its birth date, by all the events that happen during its life and by a death date.In this sense the classification of the individuals in subject at an higher aggregation level will result in an entitycreation because they will be characterised at least by the instance date of birth and death. These considerations will

Page 10: A RECORD LINKAGE PROCEDURE FOR THE MANAGEMENT AND …

1621

solve, by the way, the classical ontological question that we have to deal with when speaking about collectiveindividuals (Guarino, 1994).

5. MODELLING COLLECTIVE SUBJECTS IN THE FRAME OF THE RELATIONAL APPROACH

If the objective is that of organising a statistical source that correctly represents the business entities, the adoption ofdifferent rules for the identification of the collective subjects present in the informative schema has, as we haveseen, immediate consequences on the conceptual modelling.In this sense, for the task of being able to adopt different points of view, the correct approach could be thesemistructured modelling, in which the user is not linked to a fixed schema, but is able to interact in a dynamic waywith its creation.This type of approach is usually developed in the frame of an object oriented modelling context, that is oftenconsidered by several authors more suitable to the complexity of the statistical sources design, with respect to therelational one.In spite of that, the object oriented approach is less diffused and used than the relational one, also because it hassuffered for the lack of standard management modalities and languages.In addition to these more technological aspects, we must reflect on the fact that the relational approach is not onlygenerally speaking more diffused, but in practice most of the statistical systems, even at national level are organisedin a relational way: for this reason we will develop some considerations oriented to draw modelling strategies forbusiness information systems in the context of the relational approach even if it is not practically able to allow asemistructured approach.Starting from a set of individuals these individuals can be grouped in several types of collective subjects. Every oneof these collective subjects may have one or more definitions.For instance:

In the example A1, A2 and A3 are three collective individuals characterised on the base of three differentclassification protocols; to every collective subject belong several individuals and every individual may be classifiedfollowing different protocols.In the ambit of this information system we must define the rules in accordance with which a collective subjectcomes into being, or changes its status or dies.The birth of a collective subject may be linked to events happening at the lower level; for instance: the date of birthof a new enterprise group (third hierarchical level identified by the E.C Council Regulation 696/93) is the date inwhich the enterprises that belong to this entity have lived the event of making a legal and/or financial agreementamong themselves.Analogously, we can define rules for identifying the death date of a higher level individual.The information system, therefore, ought to record the events that happen at the different levels, and in addition tothat, it ought to be able to react to particular conditions that happen in the different levels, updating the biography ofthe collective individuals present in the system.In conclusion, when modeling the source, we must predispose the entities that correspond to all the collectivesubjects inherent to the system and all the rules that link together the events happening at the different aggregationlevels.The flexibility offered by a modeling protocol like this gives a contribute to the needs discussed in the previousparagraphs, because the users may organize the data according to different concepts and rules.Moreover, a structure like this, shifts the problem of the harmonization only at the level of the less aggregatedcollective subject, being the others linked to this through rules among events.

Individual

A1

A2

A3

Page 11: A RECORD LINKAGE PROCEDURE FOR THE MANAGEMENT AND …

1622

6. CONCLUSIONS

The problem of organizing adequate sources for the statistical analysis of business entities is multifaceted.The problems are linked to the particular typology of the data that are affected by non homogeneous definitions andby the presence of collective subjects derived from the application of classification protocols that may substantiallydiffer among users and contexts.In addition to that, there is the need to characterize the evolutionary behavior of these collective subjects.An answer to these needs may come from a conceptual modeling of the sources that, starting from the definition ofindividuals with a level of aggregation as low as possible, defines the entities corresponding to all the possiblecollective subjects in which the first level individuals may be aggregated and all the rules that link together theevents happening at the different aggregation levels.The goal is to succeed in tracking not only the biography of the low level individuals but also that of the collectivesubjects to which they belong.Some proposals have already been made to "cut the universe of businesses into small pieces that can be used asbuilding blocks for constructing the statistical units and subsequently for the classification of these units accordingto varying criteria" (Kroese et al. 1999), but they do not pay much attention to the time dimension. In our proposalthe description of rules identifies a criteria for following the temporal evolution of the subjects.

REFERENCES

Baldwin, J., R. Dupuy and W. Penner (1992), Development of Longitudinal Panel Data From Business Registers:Canadian Experience, Social and Economic Studies Division, Analytical Studies Branch, Ottawa: StatisticsCanada.

Baldwin, J., and P.K. Gorecki (1990), "Measuring Firm Entry and Exit in the Canadian Manifacturing Sector, 1970-1982", Canadian Journal of Economics, pp. 300-323.

Batini, C., S. Ceri and S.B. Nvathe (1992), Conceptual Database Design, an Entity-Relationship Approach,Benjamin-Cummings, Menlo Park, California.

Brüderl, J. and R. Schüssler (1990), "Organizational Mortality: The Liabilities of Newness and Adolescence",Administrative Science Quarterly, 35, pp.530-547.

Ceruti, M. (1995), Evoluzione senza fondamenti, Laterza.Eldredge, N. e S.J. Gould (1991), “Gli equilibri punteggiati: un’alternativa al gradualismo filetico”, in N. Eldredge

Strutture del tempo,italian translation, Hopefulmonster, Firenze, pp. 221-260.Council Regulation (EEC) No 696/93 of 15 March 1993 on the Statistical Units for the Observation and the

Analysis of the Production System in the Community.Egmose, S. (1998), "Following Establishments Over Time", 12th International Roundtable on business Survey

Frames, Statistics Finland.Eurostat (1998), Use of Administrative Sources for Business Statistics Purposes. Handbook of Good Practices,

Luxembourg: Statistical Office of the European Communities.Guarino, N. (1994), Formal Ontology, Conceptual Analysis and Knowledge Representation.Herczog, A., H. van Hooff and A. Willeboordse (1998), The Impact of Diverging Interpretations of the Enterprise

Concept, RSM-30340, Division Research and Development, Department of Statistical Methods, Voorburg:Statistics Netherlands.

Hinterberger, H. (ed.) (1992), “Statistical and Scientific Database Management”, Proceedings of the VI Intern.Conference on Scientific and Statistic Database Management, ETH Publ., Asona, Switzerland.

Kroese, B., H. van Hooff and A. Willeboordse (1999), A Formal Description of the Structure and Activities ofBusinesses, RSM-30480, Division Research and Development, Department of Statistical Methods,Voorburg: Statistics Netherlands.

Lavallée, P. (1998), "Business Panel Surveys: Following Enterprises Versus Following Establishments", Researchin Official Statistics, 0, pp. 37-57.

Martelli, C. (1996), “The Temporal Dimension in Data Bases”, Third International Meeting on QuantitativeMethods for Applied Sciences, Siena.

Martelli, C. (1999), “Information Systems for a Complex Approach to Demographic Analysis”, in Théories,Paradigmes et Courants Explicatifs en Démographie, Louvain-la-Neuve, L’Harmattan.

Michalewicz, Z. (ed.) (1990), Statistical and Scientific Database Management, Lectures Notes in Computer Science,N.420, Springer Verlag.

Page 12: A RECORD LINKAGE PROCEDURE FOR THE MANAGEMENT AND …

1623

McGuckin, R.H. (1991), "Multiple Classification Systems for Economic Data: Can a Thousand Flowers Bloom?And Should They?", Research Paper CES 91-8, Center for Economic Studies, Washington: Bureau of theCensus.

Mustaniemi, T. (1996), "Enterprise Demography as a Method of Studying Real Enterprise Births", CAED 1996,Statistics Finland.

Spletzer, J.R. (1998), "Longitudinal Establishment Microdata at the Bureau of Labor Statistics: Development, Usesand Access", Washington: Bureau of Labor Statistics.

Stamas, G., K. Goldenberg, K.Levin and D. Cantor (1997), "Sampling for Employment at New Establishments in aMonthly Business Survey", Washington: Bureau of Labor Statistics.

Struijs, P. (1996), "International Harmonization of Statistical Units", 10th International Roundtable of BusinessSurvey Frames, Quebec.

Tansel, A.U., J. Clifford, S. Gadia, S. Jajodia, A. Segev and R. Snodgrass (1993), Temporal Databases Benjamin-Cummings Pub. Comp.

Willeboordse A. and P. Struijs (1999), "Tracking Real Changes in Business Structures: a Conceptual Framework",Fifth Annual Seminar of the INSEE Directorate of Business Statistics, December 1999, Paris.

Page 13: A RECORD LINKAGE PROCEDURE FOR THE MANAGEMENT AND …

1624

FIRMS AND EMPLOYMENT CONTRACTS: PRELIMINARY ANALYSES FROM NETLABOR,AN INTEGRATED DATABASE ON THE ITALIAN LABOUR FORCE AND ENTERPRISES*

Francesca Bassi, University of Padua, Italy,Maurizio Gambuzza and Maurizio Rasera, Labour Agency, Venice, Italy

Francesca Bassi, Statistics Department, University of Padua, Via S.Francesco 33, 35121 Padua, [email protected]

Preliminary version (April 15, 2000)

ABSTRACT

In Italy, each person wishing to be employed in the private sector has to join a list at a local labour exchange office.Registration in the lists gives rise to the collection of workers’ demographic characteristics, educational background,professional expertise. Workers’ labour history is then followed over time. Private enterprises have to communicate, totheir reference exchange labour office, any employment and any modification of employer-employee relations, togetherwith information on demographic characteristics, economic activity, types of contracts used, employed personnel. Eachworker and each enterprise are identified by a unique code. This builds up an employer-employee-jobs linked databasecontaining detailed information continuously updated. The aim of the paper is to describe the potentialities of the databaseto analyse the Italian labour market, specifically at local level. As an example, some exploratory analyses on different typesof employment contracts utilization by enterprises, a debated issue currently in Italy, are presented.

Key Words: employer-employee data, temporary work, firm dynamics

1. INTRODUCTION

Netlabor is a software developed to collect and store information at the local offices of the Italian Ministryof Labour. The program deals, in an interactive mode, with the bureaucratic issues implied by Italian legislation onsubordinate work. In Italy, at the present moment, each person wishing to be employed in the private sector has toregister in an appropriate list at the local labour exchange office (Ufficio di collocamento) of his/her area ofresidence. On the other hand, private enterprises have to communicate, with a maximum delay of five days, to theirreference local office any employment, dismissal and any other modification of employer-employee relations. Thisresults in a great amount of information on the Italian labour market collected at the labour exchange offices (for abrief analysis of the evolution of Italian legislation related to labour exchange offices, see Del Boca and Rota, 1998).

The reference populations of the data are (i) all workers enrolled in the lists, (ii) all jobs in the privatesector, (iii) all enterprises, which have a contact with a worker registered at local labour offices.

Workers enrolling can be considered as unemployed and reasonably looking for a job. With regard to them,information is collected on demographic characteristics (such as gender, age, residence, etc.), educationalbackground and professional expertise. Once a worker has enrolled, his/her labour history is followed until he/sheleaves the list: main reasons for quitting are the decision to leave the labour force, obtaining a job in the publicsector or starting an entrepreneurial activity. Information on individuals is updated both by direct communication tothe office from workers (on education, expertise, etc.) and by notifications sent by enterprises.

Information on firms refers to some demographic characteristics (denomination, location, dimension, etc.),economic activity, types of employment contracts used, employed personnel; it is updated mainly through thenotifications from firms to the local offices. Firms are obliged to communicate engagements, dismissals and anyother modification of the relations between employer and employee, together with their characteristics: duration ofemployment, type of contract and so on.

Each worker and each enterprise are identified in the archive by a unique code, which allows linkagebetween information on each worker and the enterprises with which he/she has contacts. The software organisescollected information in an integrated database where separate archives are linked through appropriate proceduresusing units (individuals and firms) identification codes.

Netlabor is then a quite rich source of data to be used to analyse the Italian labour market, in terms of stocksand in terms of flows. It can be exploited specifically to study (i) workers labour histories, entrance in the labourmarket, unemployment and job duration; (ii) enterprises histories and strategies, especially with reference to labourforce employment; (iii) labour policies usage and effects.

The peculiarities of the database derive mainly from the big amount of linked detailed information onworkers, jobs and enterprises (Hamermesh, 1999) and the continuous time updating. Although the archive is anadministrative one and the main scope of information collection is not statistical analysis but bureaucratic needs, thequality of the data contained has been judged overall good (Bassi, Gambuzza, Rasera, 2000). Its main limits are the

Page 14: A RECORD LINKAGE PROCEDURE FOR THE MANAGEMENT AND …

1625

fact that it monitors only subordinate work in the private sector and that information is collected and stored at locallabour exchange offices at an intraregional level, without links on a national basis. Nevertheless, as we will show inthis paper, it results in a powerful instrument to explore labour markets especially at local level. Given thecomposite nature of the labour market in Italy, targeted analysis in different areas of the country is requested anduseful to understand labour demand and supply mechanisms and to plan specific economic policies.

In this paper we exemplify of how to exploit information collected at the Italian local employment officesto understand local labour markets dynamics. We present some preliminary analyses on an issue, which is raisinggreat debate in our country, and not only (see, for example, European Commission, 1997) at the present moment:usage by enterprises of forms of employment contracts alternative to permanent job (fixed term, apprenticeship,etc.). The recent increase of temporary employment in Italy is seen by some experts as a positive trend to labourmarket flexibility, by others as a symptom of weakness of the productive system. We start to investigate themechanism, which induces firms to prefer different types of engagements; specifically we explore the relations witheconomic activity and enterprise dynamics. The analyses have been conducted on data collected at six localemployment offices1 located in the Veneto region (North-east of Italy) in the period 1995-1997.

2. TEMPORARY EMPLOYMENT IN ITALY

In 1994 the total number of subordinated workers in Italy was 14,362 millions, in 1998 it was 14,458millions (0,7% increase). In the same period the number of temporary workers increased from 977.000 to 1,288millions (31,8%)2.

Table 1. Subordinate workers: total and temporary by year and economic sector (x 1000), Veneto & Italy, 1994-1998AGRICULTURE INDUSTRY OTHER ALLTotal Temporary Total Temporary Total Temporary Total Temporary

1994Veneto 22 5 597 27 628 46 1.246 77Italy 575 183 5.403 276 8.384 518 14.362 9771996Veneto 25 4 620 29 657 50 1.301 83Italy 523 173 5.256 309 8.523 582 14.301 1.0641998Veneto 24 3 650 33 649 57 1.323 93Italy 497 170 5.252 394 8.709 724 14.458 1.288

Source: National Statistical Institute: Labour Force Survey (RTFL)

In the region we are analysing (Veneto), from 1994 to 1998, short-term employment increased by 20,8%(from 77.000 to 93.000 units), with an increase in total number of workers employed of only 6% (form 1,246 to1,323 million units). The relative low incidence of temporary employment in this region can be explained by the factthat here total employment has increased at a higher rate than average. Table 1 contains some figures on temporaryemployment in Veneto and Italy disaggregated by year and economic sector.

A first evidence emerging from Table 1 is that in recent years in Italy permanent jobs have been substitutedby short term ones. Temporary jobs have decreased in agriculture together with total employment (agriculture is thesector where short term employment is traditionally used); have increased by 43% in the industry, in spite of adecrease of total employment; by 40% in the other sectors where also total employment has increased. Anotherimportant evidence for Italy, not contained in the table, is that worker gross flows are quite low: persistence ofemployed in this same position (but not necessarily in the same sector or workplace) observed in two consecutiveannual surveys is 94%. The massive increase in temporary jobs may then be explained either by an high turnover by

1 The areas covered by the six employment offices can be considered as a good sample of the economy of the overall region.They include two industrial areas: (Conegliano e Montebelluna); a tourist one (Calalzo); an urban site of medium size (Belluno);an agricultural area (San Bonifacio) and a local productive system, based on small and medium size companies, located in thelarge metropolitan area characterising the central part of the region (Mira).2 The increase in temporary employment observed in Italy follows a general trend of European countries: the highest percentageof temporary workers, in 1997, was employed in Spain (30%); Scandinavian countries, The Netherlands, France, Germany andPortugal follow with percentages between 10 and 15; Italy stays in the bottom group (percentage of temporary workers between 5and 10) together with Ireland, Austria, Belgium and UK.

Page 15: A RECORD LINKAGE PROCEDURE FOR THE MANAGEMENT AND …

1626

those workers who change state or by the fact that enterprises deliberately use short term contracts to “test” newemployees and hire them on a permanent basis in a second time.

The above evidences stimulated the analyses presented in the paper. Specifically, we start to investigate therelationships between use of different types of employment contracts and enterprise characteristics (mainlyeconomic sector and dynamics). The questions to which answers are sought are:(i) do short time contracts keep to be used in sectors that traditionally prefer them to permanent employment

(mainly with seasonal production) or are becoming sort of transversal across sectors and firm dimension?(ii) Are short-term contracts a need, for example for enterprises or sectors in difficulty, or a choice either to

“test” new workers or to bypass dismissal regulation, which, in our country, is rather strict?

3. TEMPORARY EMPLOYMENT: DIFFUSION AND IMPACT

Our analyses refer to a case study. We selected data collected in labour exchange offices in Veneto; theareas covered by those offices are considered a good sample of the economic situation of the all region. Table 2contains some labour market indicators for Veneto and Italy to compare dimensions. From the table it is evident thatthe employment situation in Veneto is better than the average in our country (employment rates are 6% higher thanin Italy and unemployment rates are almost half), this is, of course, linked to a healthier productive system.

Table 2. Labour market indicators: Veneto and Italy, 1994-1998 (annual averages)1994 1996 1998 1994 1996 1998

VENETO ITALYPopulation 15-70 (x1000) 3.364 3.377 3.384 Population 15-70 (x1000) 42.578 42.703 42.711Labour Force (x1000) 1.904 1.933 1.952 Labour Force (x1000) 22.680 22.851 23.034Employment rate 53,1 54,0 54,7 Employment rate 47,3 47,0 47,3Unemployment rate 6,3 5,6 5,2 Unemployment rate 11,3 12,1 12,3Source: National Statistical Institute and Italian Ministry for Labour

In the area considered, more than 156.000 engagements have taken place over the period 1995-1997. Thedistribution by contract (Table 3) shows that 63.170 engagements (40,4%) are fixed term ones, while the percentageof permanent jobs is a bit lower, 39,8 (62.122 units). Summing to fixed term jobs, ordinary apprenticeship (19.030)and the contratti di formazione e lavoro (CFL), specifically regulated fixed term apprenticeship contracts3 (11.832),over 60% engagements in the period regard short-term employment contracts.

The agricultural and tourist sector employ mainly with temporary jobs (82% and 62% of total engagements,respectively) and specifically with fixed term ones, as a consolidated practice due to the high seasonality of theproductive activity. Trade has also quite a high percentage of temporary jobs (58%) which comprises a discretepercentage of apprenticeships. 51% of engagements are temporary in the building sector. In the industrial sector therate of short-term jobs is higher in the mechanical division (61%) than in the others (56%), with a discreteproportion of apprenticeship and on the job training. The use of fixed term contracts is quite a new practice for thissector, which deserves further attention.

To understand enterprises’ choice of different employment contracts it may be useful to analyse conjointlyfirms’ characteristics and frequency of contracts adoption. Each enterprise in our sample has been classified in fourcategories with regard of usage of the four types of employment contracts available in Italy at the present moment:permanent basis, fixed term, apprenticeship, CFL (Table 4). With reference, for example, to fixed term jobs, withcategory 1 all enterprises, which have never used a fixed term contract to hire workers in the period considered(1995-1997) are identified. Class 2 contains firms which have stipulated less than 20 total engagements in the periodand at least one of them on fixed term basis; this class isolates firms that may have different behaviours with regardto fixed term contract usage but that, in general, have quite a small job turnover, that does not allow to discriminateon the existence of real strategies in labour force management. Class 3 contains enterprises who use “marginally”fixed term contracts: less than 25% of total engagements; class 4 those which use it “intensively”: between 25 and75% of total recruitment; finally in class 5 are enterprises which have hired on a fixed term basis over 75% of times(“specialised” use).

3 I contratti di formazione e lavoro are a form of on the job training. These employment contracts were introduced in 1984,regard young workers, between 16 to 32 and have a maximum length of 24 months. This instrument was introduced both tointegrate young people in the labour market (there is a 25% discount in payroll taxes) and facilitate school to work transition.Ordinary apprenticeship has its own regulation and regards young people between 16 and 24.

Page 16: A RECORD LINKAGE PROCEDURE FOR THE MANAGEMENT AND …

1627

Table 3. Engagements by type of contract and economic activity, sampled area, 1995-1997PERMANENT FIXED TERM APPRENT. CFL TOTAL

Agriculture 3.138 13.368 749 488 17.743 11,4Light ind. & oth. 11.023 7.274 4.139 2.626 25.062 16,0Mechanical ind. 15.212 14.148 5.670 3.747 38.777 24,8Building 6.815 3.071 2.002 1.005 12.893 8,3Trade 7.340 6.125 2.501 1.510 17.476 11,2Services 6.999 3.601 1.093 1.217 12.910 8,3P.A.4 and other 4.358 6.438 900 598 12.294 7,8Tourism 7.237 9.145 1.976 641 18.999 12,2All 62.122 63.170 19.030 11.832 156.154% 39,8 40,4 12,2 7,6

Table 4. Firms by contract usage and economic activity, sampled area, 1995-1997 (1=No use,2=less than 20 engagements, 3=marginal use, 4=intensive use, 5=specialised use)

PERMANENT FIXED TERM1 2 3 4 5 1 2 3 4 5 All

Agric. 384 514 58 35 13 424 463 18 33 66 1004Light Ind. 545 1717 84 157 50 1491 797 103 138 24 2553Mech. Ind. 613 1866 117 250 63 1654 855 170 191 39 2909Building 453 1328 30 45 42 1384 419 46 31 18 1898Trade 731 1685 46 72 27 1704 718 56 49 34 2561Services 48 1439 32 47 42 1383 565 44 32 23 2047P.A. 474 1000 45 36 18 927 545 18 35 48 1573Tourism 399 1229 61 49 57 1148 492 43 35 77 1795

APPRENTICESHIP CFLAgric. 766 201 25 12 0 822 141 37 4 0 1004Light Ind. 1360 998 131 64 0 1744 663 112 34 0 2553Mech. Ind. 1423 1170 211 103 2 1972 738 150 47 2 2909Building 1132 685 53 28 0 1411 443 30 14 0 1898Trade 1645 817 67 31 1 1907 586 51 17 0 2561Services 1616 388 29 14 0 1577 424 30 15 1 2047P.A. 1226 321 14 10 2 1329 227 13 4 0 1573Tourism 1107 575 96 15 2 1516 239 32 8 0 1795

Only ¼ of enterprises did not hire workers on a permanent basis in the three years and this percentage doesnot vary too much across sectors. In general, firms which do not offer permanent jobs show low levels ofrecruitment: more than 94% of them stipulated less than 10 contracts in the period (half of them only one). On theother hand, in the remaining ¾ of firms, even those which hire low numbers stipulated at least one contract on apermanent basis. This confirms that the use of permanent contracts is linked to a sort of traditional practice in ourcountry.

The percentage of firms, which never used fixed term contracts, is rather high (62%), on the other handthese firms are responsible only for 25% of total recruitment. A modest use of fixed term employment is seen acrossall sectors, while intense and specialised ones seem more linked to seasonal productions.

Apprenticeship and on the job training seem to be used in a sort of targeted way and in some sectors morethan others (mainly in industry).

As an overall impression, figures in Table 4 show some linkage between the economic sector and thefrequency of use of the different types of contracts. From other exploratory analyses of the sample, as mentionedabove, it came out that also firm dimension and/or recruitment levels may play a role in choosing which kind ofcontract to stipulate. In order to identify patterns of relations among these three phenomena (contract usage,

4 Workers for low skill jobs are employed by the Public Administration among those registered in the local employment officeslists.

Page 17: A RECORD LINKAGE PROCEDURE FOR THE MANAGEMENT AND …

1628

economic activity and firm dynamics) multiple correspondence analyses5 have been performed. In order tosummarise firms behaviour with regard to employment flows, enterprises have been classified as “growing” if theyhad an increase in labour force in the period considered (53% of the sample), as “shrinking” if they had a decrease instaff (18%), “stable” if the number of workers at the end of the period was equal to that at the beginning (29%).

With regard to fixed term employment the first two dimensions, which explain almost 98% of total inertia,of multiple correspondence analysis point out the following relations:� Agriculture, Public Administration and Tourism are basically stable sectors with workers turnover explained by

the seasonality of productivity. Agricultural enterprises are clearly specialised in the use of fixed term contractsand this is somehow an expected evidence. The other two sectors employ a discrete proportion of workers withfixed term jobs.

� Building and Trade are also basically stable sectors for what concerns labour force employment. Enterpriseswith this economic activity tend not to use fixed term contracts.

� Services is one of the sectors with the highest proportion of growing firms; nevertheless, fixed termemployment is not used.

� The industrial sector is somehow peculiar and shows pronounced differences between the mechanical divisionand the others. The mechanical sector is the one with the highest proportion of growing firms, the novelty staysin the fact that many enterprises use fixed term contract intensively. Also the other divisions are using fixedterm jobs intensively, although here we find the highest proportion of enterprises shrinking. Fixed term jobs arethen used with a double scope in the industrial sector, to increase labour force in the mechanical division,presumably to “test” new workers; to contain labour force and possibly substitute permanent jobs in thedeclining divisions.As a general evidence, fixed term employment, though strictly linked to economic activity, cannot be fully

explained without firm dynamics. This is also confirmed by estimating multinomial logit models with the proportionof fixed term contracts as the dependent variable: no other model than the saturated one fits the data which meansthat three variables interaction parameters cannot be omitted.

In the case of CFL use, the first dimension found out with multiple correspondence analysis explains aloneover 90% of total inertia. The position of categories points in the two-dimensional representation show that activitysectors tend to be ordered form Tourism, Agriculture and Public Administration (where this type of contract is notused) to sectors which hire seldom on this base (Building, Trade and Services) to those which apply it intensively(Industrial divisions). It happens also that sectors which do not use or use only marginally CFL are also those whichdid not show significative increase in personnel over the period – with the only exception of the non-mechanicaldivisions of Industry-; this is also confirmed by the fact that the multinomial logit model with no three variablesinteractions fits the data. This evidence raises the questions whether firms use on the job training for a sort ofstructural fact, independently form conditions, or if its usage is linked to growing phases. A reasonable guess,supported by the evidence that a high percentage of these contracts is transformed in permanent jobs, is that CFL isused in the firms where training costs are higher, to prepare workers to subsequently hire on a permanent basis.

Also in the case of apprenticeship the first dimension alone explains a high percentage of total inertia (85).As a general evidence, the relation between apprenticeship usage and economic activity is stronger than thatbetween contract usage and firm dynamics (confirmed also by the estimation of the multinomial logit model),although both factors must be considered in order to explain hiring through apprenticeship. Sectors, which do nothave apprentices, are Agriculture, Services and the Public Administration. The firms, which use most this type ofcontract, are in the mechanical division; the other divisions of industry and the other sectors use it seldom. As ageneral tendency, small companies6 in the industrial sector prefer apprenticeship to fixed term contract, while biggerones prefer to offer fixed term jobs. This tendency is seen even more clearly in the mechanical division and confirmsthe suspect that in Italy apprenticeship is not only used as a mean to teach young workers a profession but also toorganize seasonal production. Also the tourist sector, which has a typical seasonal activity, hires a discretepercentage of workers as apprentices. The reason for this behaviour is the very low labour cost associated withapprenticeship in comparison with fixed term job; moreover, an apprenticeship contract can be interrupted any timeand the worker can eventually complete his/her training in another company at any other time.

Finally, as already said, permanent employment is traditionally quite well spread in Italy. In this case,correspondence analyses shows that the main dimension explaining contract usage is that determined by firmdynamics. Except for the agricultural sector, which traditionally uses fixed term workers, in the other sectors, the

5 One for each type of contract.6 Job flows have been used as a proxy of firm dimension which, at the present moment, is not registered reliably in the database:small companies are those with less than 15 movements in the three years.

Page 18: A RECORD LINKAGE PROCEDURE FOR THE MANAGEMENT AND …

1629

more assumptions, the more permanent jobs are offered. The Public Sector here has an untypical behaviour due tothe fact that only a small proportion of workers (typically low skilled) is hired through labour exchange offices; thisworkers are usually employed in temporary jobs and not offered permanent positions.

4. CONCLUSIONS

The paper deserves conclusive remarks along two lines: (i) the potentialities of the database used for theanalyses and (ii) the evidences emerged on the issue investigated, different types of contract usage by firms.

The preliminary analyses presented in the preceding paragraphs are an example of how exploiting the datacollected at the Italian labour exchange offices in order to analyses local labour markets. Local labour marketanalysis has recently become an autonomous research field to which policy makers refer. Detailed and disaggregatedeconomic analyses are required especially in Italy in order to highlight peculiarities of each local market.

The main characteristics of our database are the detailed and continuously updated longitudinal informationcontained and the possibility of linking employer and employee data. This last feature has not been exploited in theanalyses presented here, nevertheless, it will be matter of future use: it is evident that more information on firmsbehaviour in managing personnel can be gained investigating characteristics (demographic and not only) of theirworkers.

As a general evidence, in recent years, in Italy, as in other European countries with quite highunemployment rates (France, Spain and Sweden), temporary jobs have substituted permanent ones (OECD, 1996).This trend, tough, in Italy has developed quite differently across economic sectors, as our analyses document.Temporary jobs keep to be widely offered by enterprises with seasonal production (Agriculture, Trade andTourism), with, among these, small firms preferring apprenticeship to fixed term contracts for reasons of containinglabour costs. In Mechanical Industry and Services, growing sectors, temporary jobs come alongside with permanentones and are used as a mean of on the job training; firms in Services seem to prefer CFL to fixed term contracts and,in general, do not show high worker turnover (due to higher training costs?). The very typical case of temporary jobssubstituting permanent ones seem to happen in the non-mechanical divisions of Industry were fixed term contractsare highly used and we observed the highest proportion of enterprises shrinking.

Worker flows analyses should add valuable information on the issue under study and, for this reason, are inour future work agenda. Simple worker flows measures confirm what written in the preceding paragraph. Sectorswith seasonal production show higher percentages of workers hired and subsequently dismissed in the observedperiod; conversely, in the industrial sector this percentage is lowest: new workers are employed to substitutedworkers who leave.

As a final note: the above evidences on contract usage by firms seem to be quite sensible and do notcontrast with evidences sparse in the vast literature on worker turnover (Bingley and Westegaard-Nielsen, 1998);this is, in our opinion, another point in favour of the good quality and the reliability Netlabor database.

REFERENCES

Bassi F., Gambuzza M., Rasera M. (2000), “Struttura e qualità delle informazioni del sistema informatizzatoNetlabor. Una verifica sui dati delle Scica di Treviso e Belluno”, Working Paper, Lavoro eDisoccupazione: questioni di misura e di analisi, Statistics Department, University of Padua.

Bingley P., Westegaard-Nielsen N. (1998) “Establishment tenure and worker turnover”, International Symposiumon Linked Employer-Employee Data, May 21-22, 1998, Arlington, CD.

Del Boca A., Rota P. (1998) “How much does hiring and firing costs? Survey evidence from a sample of Italianfirms”, Labour, 12 (3), pp.427-449.

European Commission (1997), Joint Employment Report, Brussels.Hamermesh D.S. (1999) “LEEping into the future of labour economics: the research potential of linking employer

and employee data”, Labour Economics, 6, pp.25-41.OECD (1996) Employment Outlook, Paris. * Research for this paper has been supported by grant from MURST n.

Page 19: A RECORD LINKAGE PROCEDURE FOR THE MANAGEMENT AND …

1630

SUPPRESSION VERSUS DISCLOSUREAn analyst’s view

Bruno Pépin, Statistic CanadaManufacturing, Construction and Energy Division

[email protected]

Introduction

Statistical agencies are responsible for collecting, extracting, compiling, analysing and publishing statistics oncommercial, industrial, financial and many other activities. We commonly devote much time and effort to datacollection and analysis, but on the other hand, what of its publication? Yet one of the main responsibilities ofstatistical agencies is to release the information collected without thereby violating the confidence of the respondentswho provide it. What does data suppression entails for the analyst? What criteria does a specialist employ indeciding which variables will be suppressed. How do we satisfy users while protecting confidentiality? This is whatbrings me to address you today on the matter of the suppression versus disclosure of data.

In the first part of this paper, I shall briefly explain what makes confidentiality analysis a sine qua non of thepublication of information. In the second part, I shall deal with suppression in relation to publication, the role andresponsibilities of the confidentiality analyst, and what suppression entails for the analyst. I shall conclude with anoutline of the confidentiality analysis done for the Annual Survey of Manufactures (ASM), and a look at the resultsachieved with CONFID software.

Why is confidentiality analysis essential?

There are a number of reasons. Statistical information helps governments and corporations make decisions.However, it can very easily find its way into the hands of a competitor, or a person who might use it for personalends. A company could acquire financial information about its competitors, and use it against them. Companies areacutely aware of the danger, and thus very concerned about yielding financial information. This is why moststatistical agencies are regulated by legislation, and Statistics Canada is no exception. Section 17 (1) (b) of Canada’sStatistics Act reads:

“no person who has been sworn under section 6 shall disclose or knowingly cause to be disclosed, by anymeans, any information obtained under this Act in such a manner that it is possible from the disclosure torelate the particulars obtained from any individual return to any identifiable individual person, business ororganization.”

The Act is like a contract between the agency and its respondents, protecting the latter against any disclosure thatmight enable a user of the data to identify them. It empowers the agency to collect statistics and to impose sanctionson companies that refuse to supply information. The Act also makes it possible to reduce the response burden andavoid double collection through the sharing of statistical data with the various provincial agencies with which data-sharing agreements have been concluded. These agencies are also regulated by provincial statutes that in essenceprovide the same degree of protection of confidentiality, and the same sanctions against improper disclosure. Indeed,protection of confidentiality is a key requirement of a successful statistics program.

Suppression and disclosure

Suppression and disclosure play very different roles in the data distribution process. Suppression is the result ofconfidentiality analysis. It consists in ensuring that published data does not reveal the identity of an individual orrespondent. The purpose of disclosure, on the other hand, is to satisfy the statistical requirements of users.

Page 20: A RECORD LINKAGE PROCEDURE FOR THE MANAGEMENT AND …

1631

Disclosure of information may seem a relatively straightforward activity, but it sometimes presents a majordilemma. Once collected, information must first be analysed by a confidentiality specialist. This person willroutinely have to make choices – sometimes difficult ones – in order to reconcile the Act’s requirements with thedata needs of users. In most cases, choices will be based on the size of the industry at the national and provinciallevels, and on what has been published in the past. The confidentiality specialist will sometimes decide to suppressone statistic or another in order to preserve the confidentiality of yet another. This necessarily entails a loss ofinformation. Sometimes, in order to protect one piece of information, it will be necessary to suppress several others.As table 1 shows, confidentiality analysis really has a domino effect. In this example, the analyst has to suppressthree other cells in order to protect one confidential value.

Table 1

Province 1 Province 2 TotalIndustry 1 S X PIndustry 2 S S P

Total P P P

Confidentiality analysis is very demanding and painstaking work. Absolute accuracy is required, because onceinformation has been disclosed in error, correction is not possible. The growth in the number of customised requestsand the ease with which users can now match statistical series from one survey or several have greatly increased thecomplexity of the analysis.

Primary confidentiality – applying the rules of confidentiality to each cell – generally causes no problems. Thework is usually done by computer. However, secondary analysis – protecting confidential cells from disclosurethrough cross-checking – is far more complex. It is a painstaking exercise, usually performed manually. It involvesmaking suppressions to prevent the disclosure of information in cases where a user need only strip away the dataoverburden to find the confidential figures. Table 2 presents a case in which the confidentiality analysis has beendone. Close examination of the table will show that it reveals confidential information.

Table 2

Industry 1 Industry 2 Industry 3 Industry 4 Total

X1 X2 X3 15 2015 X4 X5 20 55X6 10 10 X7 25X8 6 15 X9 35

20 30 35 50 135

A few calculations will show that value (X1) at the top left of the table can be deduced. We know that the sum ofthe five values suppressed (X1, X2, X3, X4, X5) is 25 ((20+55) – (15+15+20)). We can also deduce that the foursuppressed values in the middle of the table (X2, X3, X4, X5) total 24 ((30+35) - (10+10+6+15)). Thus, (X1) in thetop left cell must be 1.

As we can see, confidentiality analysis even of a relatively simple 16-cell table represents quite a challenge for theanalyst. The level of difficulty rises considerably when the analysis must be carried out on a larger scale – that of theAnnual Survey of Manufactures, for example.

A change in the level of priorities in a confidentiality analysis can have major consequences for the user, as well asfor the protection of information. Table 3 shows a case in which the pattern changes from year to year. As thisexample shows, a user could easily estimate the value for industry 1 for 1995 from the published data. On the otherhand, missing data can also considerably reduce the value and usefulness of time series. In some case, the usergenuinely needs the real figure, rather than an estimate, if accuracy is very important for the purposes of a study.

Page 21: A RECORD LINKAGE PROCEDURE FOR THE MANAGEMENT AND …

1632

Table 3

Year Industry 1 Industry 2 Industry 3 Industry 4 Total

1995 5 S X 14 251996 S 4 X 13 261997 3 X 3 S 241998 4 S X 14 27

In sum, the role of the confidentiality analyst and publication officer consists in:

� Meeting the requirements of the Statistics Act, which is ensuring that no confidential information isdisclosed.

� Taking into consideration users’ needs, priorities and demands. For Statistics Canada this includes ourpartners in the Canadian statistical system – provincial statistical bureaux.

� Ensuring the quality and reliability of the information published.

� Ensuring continuity in time series.

� Publishing as much information as possible.

� Publishing information within a reasonable time.

Confidentiality analysis – Annual Survey of Manufactures (ASM)

Outline of the ASM

This survey of manufacturing industries in Canada has been carried out every year since 1917. Information is collectedfor some 35,000 manufacturing establishments in 236 industries based on the SIC (Standard Industrial Classification)and, beginning in 1998, 259 industries based on the NAICS (North American Industrial Classification System). Thesurvey collects and publishes financial data for some 15 variables. The main ones include inventory, input costs, valueof shipments and employment. The survey also collects data on some 10,000 commodities, including raw materials,fuels, packaging and finished products. The Standard Classification of Goods (the Harmonized System) is used toclassify these commodities.

Confidentiality analysis

The ASM uses confidentiality rules originally drafted by M. Walter E. Duffet. Additions were made by G.WAndrews and Dominon Statistician M. Berlinquette. The rules are applied at company level to ensure that nopublished information can be used to identify a company. In most cases, the company and the plant represent thesame statistical entity, although some companies have more than one plant. The statistical control used in dominanceanalysis to determine whether the cell is confidential is the “value of shipments of goods of own manufacture”. Thevalue of shipments was chosen because this is the most sensitive figure for companies: when this variable meets theconfidentiality criteria, it is safe to publish the information.

Results obtained by using software to analyse confidentiality

Until 1995, secondary analysis of confidentiality for the ASM was performed manually. Software was thereafterused to check whether there was a fault in the confidentiality scheme. It took a person with several years’ experience(of financial analysis only) two months of full-time work to accomplish this. In order to reduce the time allowed forconfidentiality analysis, for reference year 1996 we used CONFID software to perform it. This software was

Page 22: A RECORD LINKAGE PROCEDURE FOR THE MANAGEMENT AND …

1633

developed by Gordon Sande in 1984, while he was working for Statistics Canada. While not particularly user-friendly, it is nevertheless indispensable in confidentiality analysis for a program as complex as the ASM. Note thata precondition for the use of software of this type is that the data must be cumulative: the sum of 4-digit SIC codes(SIC 1011,1012 and so on) must be equal to the 3-digit code (101), and so forth.

The ASM collects and publishes information nationally, regionally and provincially for all manufacturing industries(SIC 2, 3 and 4). For reference year 1997, the potential number of industries to be published by province and regionis about 4,000. Of these, almost 35 % are confidential after primary analysis, that is after the confidentiality ruleshave been applied. A further 21% are then suppressed in order to protect confidential data following primaryanalysis. Taking into account regions and provinces, only 44 % of the data collected can be published for allindustries. Nationally – for Canada as a whole – the publication rate is almost 100%, but the percentage variesconsiderably from province to province. The variations are mainly due to the structure of the economy and the sizeof the manufacturing industries located in each province.

The main purpose of using CONFID for confidentiality analysis is to ensure that no confidential data is disclosed,while publishing the largest range of data possible. As we have said, the software protects in particular the largestindustries, as measured by the value of shipments. Accordingly, some relatively large industries located in thesmaller provinces are suppressed to protect the same industries the value of whose shipments is greater in the otherprovinces. Given the disparity from one province to another, it is important to set priorities for confidentialityanalysis so that the data requirements of our provincial partner agencies can be met.

Conclusion and Recommendation

The use of software for confidentiality analysis or for testing the confidentiality scheme is assuredly necessary forthe Annual Survey of Manufactures. The sheer quantity of information to be analysed means that it is practicallyimpossible to do the work manually and still guarantee that the data published will not reveal confidentialinformation. The software certainly offers a number of advantages: speed, the certainty that no confidentialinformation will be disclosed, and maximisation of the number of values that can be published. There are someconstraints, however: it is not user-friendly; it cannot give more weight to the larger industries to meet users’ datarequirements; and it may break the continuity of time series. Despite these drawbacks, we recommend its use.

We also recommend improvement to the suppression software:

� To make it more user friendly.� To enable it to improve important time-series.� To reduce the number of infeasible solution (e.g. by introducing cell weights which are surrogate for

importance).

For the ASM, we have adopted a dual strategy. First, the largest industry groupings (2-digit SIC code) are analysedmanually, and then the software is used to check the manual analysis and to analyse the levels of greater detail (3-and 4-digit SIC codes). This enables us to satisfy our users’ data requirements; to ensure there is some continuity inour distribution of information; and to be confident that none of the data we publish will disclose confidentialinformation.