towards a methodological framework for estimating present

15
Towards a methodological framework for estimating present population density from mobile network operator data Fabio Ricciato * , Giampaolo Lanzieri , and Albrecht Wirthmann European Commission – EUROSTAT 5, rue A. Weicker, L-1211 Luxembourg Abstract The concept of “present population” is gaining increasing attention in official statistics. The (almost) continuous measurement of present pop- ulation provides a basis to derive indicators of population exposure that are relevant in different application domains. One possible approach to measure present population exploits data from Mobile Network Opera- tors (MNO), including CDR but also more informative (and complex) signalling records. Such data, collected primarily for network operation processes, can be repurposed to infer patterns of human mobility. Two decades of research literature have produced several case studies (mostly limited to CDR data) and a variety of ad-hoc methodologies tailored to specific datasets. Moving beyond the stage of explorative research, to- wards production of official statistics, requires a more systematic and sustainable approach to methodological development. Towards this aim, EUROSTAT and other members of the European Statistical System (ESS) are working towards the definition of a general Reference Methodological Framework. In this contribution we report on the methodological as- pects related to the estimation of present population density, for which we present a general and modular methodological structure. Along the way, we identify a number of specific research (sub)problems requiring further attention by the research community. 1 Measuring population size and exposure Enumerating the population is one of the oldest statistical activities, but also one with many challenges, starting from the very definition of “population”. * Unit B1 Methodology, Innovation in Official Stat., [email protected] Unit F2 Population and migration, [email protected] Unit B1 Methodology, Innovation in Official Stat., [email protected] 1

Upload: others

Post on 03-Dec-2021

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Towards a methodological framework for estimating present

Towards a methodological framework for

estimating present population density from

mobile network operator data

Fabio Ricciato∗, Giampaolo Lanzieri†, and Albrecht Wirthmann‡

European Commission – EUROSTAT5, rue A. Weicker, L-1211 Luxembourg

Abstract

The concept of “present population” is gaining increasing attention inofficial statistics. The (almost) continuous measurement of present pop-ulation provides a basis to derive indicators of population exposure thatare relevant in different application domains. One possible approach tomeasure present population exploits data from Mobile Network Opera-tors (MNO), including CDR but also more informative (and complex)signalling records. Such data, collected primarily for network operationprocesses, can be repurposed to infer patterns of human mobility. Twodecades of research literature have produced several case studies (mostlylimited to CDR data) and a variety of ad-hoc methodologies tailored tospecific datasets. Moving beyond the stage of explorative research, to-wards production of official statistics, requires a more systematic andsustainable approach to methodological development. Towards this aim,EUROSTAT and other members of the European Statistical System (ESS)are working towards the definition of a general Reference MethodologicalFramework. In this contribution we report on the methodological as-pects related to the estimation of present population density, for whichwe present a general and modular methodological structure. Along theway, we identify a number of specific research (sub)problems requiringfurther attention by the research community.

1 Measuring population size and exposure

Enumerating the population is one of the oldest statistical activities, but alsoone with many challenges, starting from the very definition of “population”.

∗Unit B1 Methodology, Innovation in Official Stat., [email protected]†Unit F2 Population and migration, [email protected]‡Unit B1 Methodology, Innovation in Official Stat., [email protected]

1

Page 2: Towards a methodological framework for estimating present

The current international recommendations on population statistics favour theadoption of the concept of “usually resident population”, based on a 12-monthperiod of actual stay in the geographic area of interest. Therefore, observationsmust last for at least one year before assessing the inclusion or exclusion of anindividual into the population of interest, regardless of which method or tech-nology is adopted to perform the observations. However, this is not the onlypopulation concept in official statistics (e.g., see [1, 2]) and other alternativepopulation concepts exist that are prone to be measured in more timely ways,with shorter observation periods. The most prominent alternative concept isthe “present population”, also known as “de facto population”. According tothis concept, the target population is composed by all individuals who are phys-ically present in the geographic area of interest at a selected moment in time —the so-called “reference time”. If measured continuously over time, the “presentpopulation” concept would be the appropriate basis for the measurement of the“population exposure”, an aspect that should be of particular relevance in allthose domains where the physical interactions between individuals or betweenindividuals and external factors are of interest (e.g., environmental or epidemio-logical studies). The recent discussions on the population concept to be adoptedin future official statistics may lead to increased relevance of the de facto pop-ulation as a complementary source of information on population dynamics (see[3, 4]). Indeed, the evolution of concepts and data sources used for measuringthe population might eventually lead to break the relation between physicalpresence on a territory and inclusion in its population count. In any case, theobservation of physical presence of individuals on the territory could providevaluable inputs to the estimation of population based on different concepts.

The above considerations motivate the attention of statisticians for methodsthat allow measuring (or at least estimating) the size of the present popula-tion at a given reference time, and possibly its exposure, over large territories(whole country) and in a timely manner. To this aim, mobile network operator(MNO) data represent a promising data source, as evidenced by several researchstudies and academic literature during the last two decades. Moving from proof-of-concept case study towards an official statistics production setting requiresaddressing a number of issues, including privacy protection and sustainabilityof data provision. It also requires a more systematic methodological approach,calling for the development of a proper reference methodological framework,that is the focus of the present contribution.

2 Why a Reference Methodological Framework?

A reference methodological framework is an abstract organization of the dataflow that is logically antecedent to the development of particular methods. Aspecific method can be directly developed for a specific input data set and fora well understood use-case (desired output). This approach suits well for casestudy, proof-of-concept and academic work. But when designing a statisticalproduction process, we must cope with a number of design challenges calling for

2

Page 3: Towards a methodological framework for estimating present

a more systematic approach. The design should take into account the followingcharacteristics of the MNO data in input:

• Heterogeneity. Data are highly heterogeneous between different MNOin many respects. First, while most previous studies have considered CallDetail Records (CDR), an increasing number of MNO is now able to ac-quire signalling data that are more informative but also more complexthan CDR. Furthermore, the data format and the detailed data generationprocess (hence their information content) are dependent on the particularconfiguration and operational conditions of the network infrastructure, allaspects that vary across MNOs;

• Multi-purpose. MNO data can be used to extract information servingmultiple use-cases and application domains. Therefore, a modular ap-proach is needed to organize the data analytic flow into processing modulesthat can be modified and adapted to different purposes.

In order to address these challenges, a general Reference MethodologicalFramework (RMF for short) is under development in Eurostat with the col-laboration of other members of the European Statistical System (ESS). TheRMF design follows the principles of functional layering and the so-called hour-glass model that lie at the foundation of the Internet. The processing flow,from raw input data up to the desired statistical indicators (output data), isorganized into three macro-layers as sketched in Fig. 1. At the bottom, theData Layer (D-layer) embeds processing modules whose implementation logicis highly specific to the particular MNO infrastructure, to the type and formatof the available input data, or anyway dependent on technology-specific details.For each module in this layer, the implementation logic must be developed inclose cooperation with MNO engineers and telecommunications experts in or-der to maximize the information content and mitigate some sources of error.At the top, the Statistics Layer (S-layer) includes modules that are dependenton the particular statistical application and desired indicators. Between theinput-specific D-layer and the output-specific S-layer, the intermediate Conver-gence Layer (C-layer) is designed to be input-agnostic and output-agnostic. Inthis way, the C-layer decouples the complexity and heterogeneity of the otherdomains, enabling independent development, hence evolvability and portabilityof the processing methods.

For a particular target application — for example, the problem of spatialdensity estimation that is the focus of this work — establishing a unified mod-ular framework brings several further advantages. It helps the definition andrefinement of novel estimation methods or variants thereof. It facilitates thecomparison between different methods. By establishing a common terminology,formalism and conceptual frame, it facilitates the discussion and organisationof work among different researchers and development teams.

We remark that the RMF depicted in Fig. 1 is relevant only during thephase of methodological development, i.e., for building the processing methodsand their software implementation. The RMF is not intended to prescribe where

3

Page 4: Towards a methodological framework for estimating present

each function is to be physically executed during the computation phase. Inone possible deployment scenario, the lower part of the processing workflow– including the whole D-layer, C-layer and the bottom part of the S-layer –could be physically executed at the MNO premises. The intermediate data thatare produced at some logical point within the S-layer are then passed to thestatistical office where the upper part of the S-layer functions are executed. Fromsuch (purely illustrative) example it should be clear that the layer interfacesdo not necessarily map to borders between organizations. In other word, theproposed RMF helps to determine (at development phase) what logical functiontakes place in each module and how such function is implemented, and theseaspects should be seen as independent from the issues of where (and by whom)the function is physically executed during the production phase.

The general RMF concept and design principles are elaborated elsewhere(see [5, 6]). In this contribution we focus on a particular use case, namely theestimation of present population density. A modular view of the main processingstages is sketched in Fig. 2. Each module is briefly described in the next section.

Before proceeding further, it is useful tor remind that processing MNO datafor inferring human mobility (and presence) involves coping with thee maindimensions of uncertainty:

• Spatial uncertainty: a generic MNO observation does not refer in gen-eral to a point position, but rather to an extended location representingthe expected (or assumed) radio coverage area of a particular radio cell1

or slice thereof;

• Temporal uncertainty: the location of a generic mobile user can not beobserved continuously, but only at discrete event-generation times;

• Population coverage uncertainty2: the target population (humans)does not correspond exactly to the observed population (mobile devices).Consequently, under-coverage, over-coverage and double-counting errorsapply, corresponding respectively to people carrying no mobile device,Internet-of-Things (IoT) devices not carried by humans, and people car-rying multiple devices or subscriptions.

3 A modular methodology for density estima-tion

The methodological framework sketched in Fig. 2 is designed to take in inputmultiple sources of MNO data: event-based records (CDR or, preferably, sig-

1The “radio cell” represents a fundamental building block of mobile networks – that in factare also called cellular networks. Every radio cell is uniquely identified by the Cell GlobalIdentity (CGI). Every radio cell is associated to a transmit antenna, but the association is not1:1 as a single antenna can be used to transmit multiple radio cells (multiplexing).

2Throughout the paper we use the term “coverage” in two distinct and independent ways,in the spatial domain (radio coverage, coverage area) and in terms of population (populationcoverage, coverage errors). The meaning of each occurrence should be clear from the context.

4

Page 5: Towards a methodological framework for estimating present

Statistics S-Layer

Data D-Layer

Convergence C-Layer

parsimony, stability few common definitions

Heterogeneity, Complexity, Multiplicity, Variability of statistical indicators across different SO

Heterogeneity, Complexity, Multiplicity, Variability of data sources across different data providers

Figure 1: The layered hourglass model at the foundation of the ReferenceMethodological Framework under development by Eurostat

nalling records), network topology data (position and type of radio cells, radiocoverage maps, transmit antenna configuration parameters3) and possibly otherauxiliary information about subscriber groups, terminal types, etc.

The first processing stage (filtering and transformation box in Fig. 2) aimsat excluding IoT devices (and in general mobile devices other than phones)in order to minimize the over-coverage error. The identification of such de-vices is achieved through a set of filtering rules based on various network-levelidentifiers4. The detailed logic and their practical implementation are highlynetwork-specific. Likewise all other modules at the D-layer, the RMF specifi-cations should provide a set of guidelines, leaving to MNO engineers to partic-ularise and adapt the implementation to the particular network configurationand available data.

The second stage (event geo-location in Fig. 2) is central to the minimiza-tion of spatial uncertainty. Every event record contains information about theradio cell originating the message (call, SMS or signalling exchange). The cov-erage area of individual radio cells can be determined, at least approximately,based on auxiliary information about radio cell configuration and topology thatare normally available in some form to MNO engineers (antenna position, ori-entation, beamwidth, transmit power, radio cell type, etc.). Therefore, everyindividual record (event) can be referred to a specific region, called event lo-

3Note that antenna configuration can be static or dynamic (adaptive). In the latter case,the power and direction of transmission (and consequently the size and location of the coveragearea) can be varied in time in order to adapt to the current traffic condition. Antenna steeringcan be implemented mechanically or electronically (adaptive arrays).

4For example IMSI, IMEI, TAC, APN, as defined in the 3GPP standards (for more detailssee e.g. http://www.qtc.jp/3GPP/Specs/23003-930.pdf).

5

Page 6: Towards a methodological framework for estimating present

sequence of geo-located events

(C-paths)

C-LAYER input-agnostic

output-agnostic

S-LAYER use case-specific

logic output-specific

space-time interpolation

density inference

estimated spatial density

reference user location

(*)

other

use cases other

use cases

event monitoring

system

network topology

data

customer data device type data

config. data

Mobile Network Infrastructure

filtering transformation

event geo-location (spatial mapping)

raw event records

selected event records D-LAYER

infrastructure-specific logic

input-specific

Figure 2: Modular structure of the density estimation procedure.

6

Page 7: Towards a methodological framework for estimating present

x

y

t

B1

B2

t1

a1

t2 t*

a2

Interpolated event location

at t*

Observed event location

at t2

Bounding area in interval [t1,t2]

Observed event location

at t1

Figure 3: Graphical representation of the C-path components (event locationsand bounding areas) and interpolated location.

event locations (e.g. cell coverage areas)

reference estimation grid fixed, regular, small tiles

(e.g. INSPIRE 100x100 m)

reporting grid (administrative units)

observed event records

GEOLOCATION METHOD

external data, auxiliary maps (e.g. land use, buildings, roads, transport network)

INFERENCE AGGREGATION

Figure 4: Relation between the different geo-referenced stages. The geo-locationstage maps individual records to event locations: it is logically placed in the D-layer and makes the best possible use of the available information from theMNO infrastructure. The inference stage takes in input the event locationsand associated counters, and delivers in output the intermediate estimates atthe level of individual tiles (or super-tiles). This stage is logically placed atthe S-layer, and might take in input also additional information in the formof constraints or priors (e.g., land use maps, transportation network). In thefinal stage, the final estimates are obtained at the desired level of administrativeunits by simple aggregation.

7

Page 8: Towards a methodological framework for estimating present

cation hereafter, that basically corresponds to the nominal cell coverage areaor part thereof5. Several different strategies can be chosen to calculate (or pre-dict) the cell coverage area, depending on which approach is taken for modellingthe physical process of device-to-radio cell association, leading to different geo-location options within the proposed framework. The systematic comparison ofalternative geo-location methods isll sti an open research task. At one extreme,the simplest approach is to assume that the mobile device always connects to theclosest antenna, leading to the well known approach based on Voronoi tessella-tion seeded by tower location [8, 9]. Despite its popularity in academic literature,such modelling approach is overly simplistic: it does not take into account somevery basic aspects of mobile network operation, in primis the fact that radiocells with very different transmit power and coverage ranges are superimposedin so-called multi-layer deployments, with small cells and large cells co-existingtogether in the same area. Not always the mobile terminal selects the strongestsignal, nor the strongest signal corresponds to the closest antenna. Missing suchfundamental phenomenological aspects, as done by the plain Voronoi modelingapproach, represents a clear source of model-mismatching error.

Other geo-location variants mitigate this problem by taking into account,implicitly or explicitly, the multi-layer nature of radio network deployments, andheterogeneity of radio cell size. Example includes e.g. the approach proposed in[10] and those based on so-called Best Server Area (BSA) maps6 as explored in[11]. Still, also such improved approaches are conceived to build event locationsthat are mutually disjoint, leading to alternative forms of tessellations of thegeographical space.

Only a few pioneering work have started to consider models of overlappinglocations that embrace the multi-layer nature of radio cell deployment, and thefact that radio cell coverage areas overlap by design [12, 13, 14, 15]. The choicebetween overlapping locations vs non-overlapping event locations (tessellations)has important consequences for the choice of the inference method, as discussed

5The definition of “event location” depends on which variables can be observed by the MNOdata at hand. If only the radio cell identifier is observed, then the event location correspondsto the radio cell coverage area. However, if additional variables can be observed, then theevent location can be further narrowed down. For instance, in case of signalling messagesextracted from the Radio Access Network (RAN), one may extract the so-called TimingAdvance (TA) that provides a direct indication about the distance between the mobile deviceand the radio antenna. In other cases, proprietary systems are deployed in the network toextract accurate point positions of mobile devices obtained through multilateration methods(e.g. the LocHNESSs platform in [7] or commercial solutions for Location-Based Services(LBS)). However, such precise data are in general available only for selected groups of opt-inusers and/or in limited geographical areas.

6BSA maps associate every point in space to the single radio cell with the strongest receivedsignal strength. The latter is typically predicted by ad-hoc software tools based on radiopropagation models, terrain data, radio cell configuration parameters, etc. In some cases,field measurements are used to improve the prediction of BSA predictions. BSA maps aretypically used for radio planning and radio optimisation. An extension of the BSA concept isgiven by the N -Best Servers Area (N -BSA) maps, where each point is associated to N ≥ 1strongest radio cells. BSA can be seen as a special case of N -BSA for N = 1. BSA mapsrepresent tessellations (non-overlapping locations), while N -BSA maps with N ≥ 2 lead tooverlapping locations.

8

Page 9: Towards a methodological framework for estimating present

below.For a generic mobile device, the output of the geo-location stage is basi-

cally a sequence of event observations referred to discrete points in time (eventtimestamps). Each observation is associated to a more or less extended areas(called event location) as determined by the geo-location module. It is importantto consider that we don’t know where the mobile devices was located (neitherwhether nor where it moved) within the interval between two consecutive obser-vation times. However, if signalling data are available from the D-layer, we canassume that during such interval the mobile devices remained confined within acertain area, called bounding area in our RMF (see Fig. 3) consisting of a prede-fined set of neighboring radio cells7. The collection of event timestamps, eventlocations and possibly bounding areas for the same mobile devices constitutesthe so-called C-path (for C-layer path) in the proposed RMF.

To maximize portability and inter-operability, a common C-path formatshould be adopted to represent data from different MNO. The definition ofa standard format would allow algorithm producers (including statisticians andresearchers) to develop software implementation that can run on data from dif-ferent MNO.

In order to mitigate double-counting errors, one possibility is to resort toprobabilistic matching: heuristic algorithms could be developed to identify pairsof strongly similar C-paths along a sufficiently long observation interval, that arelikely to be associated to the same individual (if present, such a module wouldbe placed at the point marked with an asterisk in Fig. 2). These approachescan mitigate, but not completely eliminate population coverage errors. Thestatistical model at the upper S-layer should take coverage errors into account,and possibly quantify them in order to adjust the estimates, e.g. by resortingto external reference data from administrative records (as done e.g. in [8, 9]) orad-hoc surveys.

The next processing task is to determine the user location at the referencetime t∗ from the available observations at neighboring observation times. Thisfunction represents a sort of interpolation in the joint space-time domain. Again,different strategies can be considered for this module. At one extreme, we mightjust pick the event location closest in time to t∗ (zero-order interpolation) toserve as reference location, and this simplistic approach would be probably suf-ficient for initial implementations. More sophisticated interpolation methodsmight take into account external information about e.g. road network maps,urban layout, transportation network schedules, and of course the boundingareas introduced above, if available. If sufficiently frequent event data are avail-able, interpolation might be combined with transport mode inference (as donee.g. in [12] for highway drivers). The elaboration of more advanced interpolationstrategies represents an interesting research (sub)problem per se.

From the observed (or interpolated) event locations for the selected set ofmobile devices, the next task is to compute an estimation of the (unknown)

7The concept of Bounding Area corresponds to the notions of Location Area, Routing Area,Tracking Area List and Registration Area, respectively, in 2G, 3G, 4G and 5G technology.

9

Page 10: Towards a methodological framework for estimating present

spatial density. We propose to refer the estimates to a regular grid of small units,or tiles8, e.g. the INSPIRE grid at Level 11 with tile size 100 m × 100 m [16].From fine-grained estimates at the tile level, density estimates at any desired(coarser) level of administrative units can be computed straightforwardly bysimple aggregation. The resulting workflow is exemplified in Fig. 4. By splittingthe estimation workflow into two stages – namely (i) from event locations tofixed tiles in the estimation grid, and (ii) from tiles to the final desired levelof statistical/administrative units — we decouple the most complex estimationtask (i) from the particular use case and reporting administrative level, enablingthe reuse of per-tile estimates across different application domains.

The estimation task (i) is mapped to the module “density estimation” inFig. 2. External maps (e.g., buildings, land use, roads) can be used at thisstage to increase spatial accuracy as depicted in Fig. 4. For instance, they canbe used as priors in Bayesian inference methods.

4 A semi-formalized view of density estimation

Hereafter we describe a compact model for the data generating process for theproblem at hand. Let the jth element uj of the column vector u denote theunknown number of mobile devices in grid tile j (tile count). Let the ith ele-ment ci of the column vector c denote the observed number of mobile devicesassociated to event location i (location count). Denote by pij the probabilitythat a mobile device located in grid tile j will be mapped to (observed in) eventlocation i, as sketched in Fig. 5, and gather the assignment probabilities pij ’sinto matrix P .

The (measured) location count c can be interpreted as the realization of arandom vector c whose expected average value is given by:

E (c) = P u (1)

The stochastic characterisation of the random vector c is part of our ongoingwork. In the estimation problem we must solve for estimand u given the vectorof measurement data c and the model matrix P (inversion problem). Theestimate u can be written in general as:

u = g (P , c) (2)

where g (·) denotes the estimator of choice. It is important to remark that incases of practical interest the number of tiles is (much) larger than the numberof event locations, and therefore the associated inversion problem is under-determined. Furthermore, it is evident that we must constrain the estimandvariables to be non-negative, i.e. ui ≥ 0, ∀i.

Equation (2) shows that establishing an estimation procedure entails twodistinct design choices that map to logically sequential sub-problems in theoverall framework:

8We use the term “tile” to refer to the generic grid unit in order to avoid confusion withthe term “cell” that we reserved for radio cells.

10

Page 11: Towards a methodological framework for estimating present

Geo-location ap-proach

Example Notes

Non-overlappingevent locations(tessellation)

P =

1 0 0 0 01 1 1 0 00 0 0 1 1

Fig. 5(a).Applicable toall Voronoivariants[8, 10] andBSA [11].

Overlapping eventlocations withequal probabilities

P =

0 0 0 1/3 1/21 1/2 1/2 1/3 00 1/2 1/2 1/3 1/2

Fig. 5(b).Used in [13,17].

Overlapping eventlocations with un-equal probabilities

P =

0 0 0 0.25 0.621 0.65 0.43 0.6 00 0.35 0.57 0.15 0.38

Fig. 5(b).Used in [15,14].

Table 1: Examples of model matrix P corresponding to different geo-locationapproaches.

• Construction of model matrix P in the geo-location block. This taskincludes the determination of event location maps and their probabilisticbinding to tiles;

• Choice of a particular estimator g (·) in the density inference block.

Different methods can be adopted in each block. In case of non-overlappinglocations (tessellation), every tile is assigned to only one single location, asexemplified in Fig. 5(a), therefore the elements of P are binary. In case ofoverlapping locations (Fig. 5(b)) the non-zero elements of P can take anyvalue in the interval (0, 1], depending on the geo-location model of choice. Forinstance, the method presented in [13, 17] assumes that all radio cells coveringa generic tile have the same probability of being selected by a mobile deviceplaced in that tile, resulting in fractional values for the elements of P that areequal along each column. Instead, in the approach proposed by [15, 14] (andadopted also by [18]) the elements of P are interpreted as likelihoods and aretied to the radio signal strength of the competing cells, resulting in unequalnon-fractional values. These three options are exemplified in Table 1.

If two tiles j1 and j2 have equal assignment probabilities pij1 = pij2 ∀i andsame priors, then they are indistinguishable from each other. In other words,they are perfectly collinear and we cannot identify differences between their re-spective estimates. In this case it makes sense to merge both tiles into a singlesuper-tile, and then compute a single estimate for the whole super-tile9. Analyt-ically, this corresponds to merging together identical columns of matrix P . We

9In [13] the term section was used in place of super-tile.

11

Page 12: Towards a methodological framework for estimating present

refer to this operation by the term consolidation. Note that the consolidationprocess does not necessarily imply that the resulting (consolidated) matrix isfull rank. In other words, it does not guarantee the resulting problem is fullyidentifiable.

If P has binary elements, the data generating process becomes deterministicand the resulting matrix after consolidation can be conducted to the identitymatrix. The estimation problem then becomes trivial.

In case of overlapping locations (Fig. 5(b)) the estimation problem is non-trivial, and exploring various estimation approaches remains an open researchproblem. One possible estimation method was elaborated in [13, Section 5.3],where a Maximum Likelihood (ML) estimator was developed based on a hierar-chical generative model: the (unknown) elements of n are modelled by randomvariables with Multinomial distribution. Recently, the authors of [18] have pro-posed to apply to this problem a ML estimator that was developed earlier inthe field of emission tomography [19]: also this method is based on a hierar-chical generative model, but here the elements of n are modelled as Poisson(instead of Multinomial) random variables. Another possible option is to resortto Bayesian approaches, as done recently by the mobloc package [14] (see [15]for further details). All such estimation approaches are fully coherent with theoverall methodological workflow described above, and more research is neededto systematically compare these methods—and develop new ones. Also, fur-ther work is needed to extend the estimation model(s) to take into accountpopulation coverage errors, in addition to spatial uncertainty.

5 Outlook on future work

In this contribution we have reported on the ongoing activity in EUROSTAT incollaboration with other members of the ESS (in the context of the ESSnet onBig Data 2018-2020) towards the definition of a general Reference Methodolog-ical Framework for processing MNO data for official statistics, with a focus onthe task of estimating the density of present population. Along the way, we haveindicated a number of research sub-problems that deserve further attention bythe research community.

In parallel to the methodological work, other lines of activity are addressingother issues related to data access, including the exploration of partnershipmodels between MNO and statistical institutes [20]. Another important lineof work focuses on the potential adoption of privacy-preserving computationmodels, and in particular Secure Multi-party Computation, for the fusion ofinput data from multiple (and possibly competing) MNO.

Acknowledgments

The clarity and readability of this paper has improved from interaction andfeedback with colleagues David Salgado, Martijn Tennekes, Benjam Sakarovitch

12

Page 13: Towards a methodological framework for estimating present

tiles (grid cells)

p3,4p2,2 p2,3 p3,5

event locations (e.g.,radio cells)

p1,1

1 2 3 4 5

3 2 1

(a) Non-overlapping event locations (tessellation)

tiles (grid cells)

p1,4

p2,4

p3,4

p2,2

p3,2

p2,3

p3,3

1 p1,5

p3,5

event locations (e.g.,radio cells)

p2,1

1 2 3 4 5

2

3

(b) Overlapping event locations

Figure 5: Assignment probabilities from tiles to event locations in the datagenerating process.

and Roberta Radini.

References

[1] G. Lanzieri. Population definitions at the 2010 censuses round in thecountries of the unece region. In 15th Meeting of the UNECE Groupof Experts on Population and Housing Censuses, Geneva, Switzerland,2013. http://www.unece.org/fileadmin/DAM/stats/documents/ece/

ces/ge.41/2013/census_meeting/10_E_x20_Aug_WEB.pdf.

[2] G. Lanzieri. On a new population definition for statistical pur-poses. In 15th Meeting of the UNECE Group of Experts onPopulation and Housing Censuses, Geneva, Switzerland, 2013.http://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/

ge.41/2013/census_meeting/Eurostat_introductory_paper_on_new_

population_definition.pdf.

[3] G. Lanzieri. Alternative definitions of population for future demographicand migration statistics. In 13th Meeting of the Task Force on the futureEU censuses of population and housing, Luxembourg, 2018.

13

Page 14: Towards a methodological framework for estimating present

[4] G. Lanzieri. Towards a single population concept for international purposes:definitions and statistical architecture. In 16th Meeting of the Task Forceon the future EU censuses of population and housing, Luxembourg, 2019.

[5] F. Ricciato. Towards a reference methodological framework for processingmno data for official statistics. In 15th Global Forum on Tourism Statistics,Cusco, Peru, November 2018. https://tinyurl.com/ycgvx4m6.

[6] F. Ricciato. Towards a reference methodological framework for the pro-cessing of mobile network operator data for official statistics. Keynote Talkat Mobile Tartu 2018. https://tinyurl.com/y75psqs4.

[7] F. Calabrese, M. Colonna, P. Lovisolo, D. Parata, and C. Ratti. Real-time urban monitoring using cell phones: A case study in rome. IEEETransactions on Intelligent Transportation Systems, 12(1), 2011.

[8] P. Deville et al. Dynamic population mapping using mobile phone data.PNAS, 111(45), 2014. https://doi.org/10.1073/pnas.1408439111.

[9] de Bellefon M.P. Givord P. Sakarovitch, B. and M. Vanhoof. Estimatingthe residential population from mobile phone data, an initial exploration.ECONOMIE ET STATISTIQUE / ECONOMICS AND STATISTICS, 505-506, 2018. https://www.persee.fr/docAsPDF/estat_0336-1454_2018_

num_505_1_10870.pdf.

[10] F. De Meersman et al. Assessing the quality of mobile phone data as asource of statistics. In European Conference on Quality in Official Statistics,June 2016. https://tinyurl.com/y7pbracn.

[11] F. De Fausti, M. Savarese, F. Fabbri, M. Spada, R. Radini, T. Tuoto,and L. Valentino. Challenges and opportunities with mobile phone datain official statistics. In Conference of European Statistics Stakeholders(CESS), October 2018.

[12] A. Janecek, D. Valerio, K. Hummel, F. Ricciato, and H. Hlavacs. The cel-lular network as a sensor: From mobile phone data to real-time road traf-fic monitoring. IEEE Transactions on Intelligent Transportation Systems,16(5), October 2015.

[13] F. Ricciato, P. Widhalm, M. Craglia, and F. Pantisano. Beyond the “single-operator, cdr-only” paradigm: An interoperable framework for mobilephone network data analyses and population density estimation. Pervasiveand Mobile Computing, May 2016.

[14] M. Tennekes. R package for mobile location algorithms and tools, April2017. https://github.com/MobilePhoneESSnetBigData/mobloc.

[15] M. Tennekes, Yvonne A.P.M. Gootzen, and Shan H. Shah. A bayesianapproach to location estimation of mobile devices from mobile networkoperator data. CBS discussion paper (to appear), 2019.

14

Page 15: Towards a methodological framework for estimating present

[16] INSPIRE Thematic Working Group Coordinate Reference Systemsand Geographical Grid Systems. D2.8.I.2 Data Specification on Ge-ographical Grid Systems — Technical Guidelines (v 3.1), April 2017.http://inspire.ec.europa.eu/documents/Data_Specifications/

INSPIRE_DataSpecification_GG_v3.1.pdf.

[17] F. Ricciato, P. Widhalm, M. Craglia, and F. Pantisano. Estimating pop-ulation density distribution from network-based mobile phone data. JRCTechnical Report, 2015. https://tinyurl.com/ydz4mgaw.

[18] J. van der Laan and E. de Jongey. Maximum likelihood reconstruction ofpopulation densities from mobile signalling data. In NetMob’19, 2019.

[19] L. Shepp and Y. Vardi. Maximum likelihood reconstruction for emissiontomography. IEEE Transactions on Medical Imaging, 1982.

[20] F. Ricciato, F. De Meersman, A. Wirthmann, G. Seynaeve, and M. Skali-otis. Processing of mobile network operator data for official statistics: thecase for public-private partnerships. In 104th DGINS conference, Oct. 2018.https://tinyurl.com/yahwq3dm.

15