dealing with big data for official statistics: it issues giulio barcaroli stefano de francisci
DESCRIPTION
Dealing with Big Data for Official Statistics: IT Issues Giulio Barcaroli Stefano De Francisci Monica Scannapieco Donato Summa Istat – Italian National Institute of Statistics. Outline. Background: Istat Big D ata strategy and experimental projects - PowerPoint PPT PresentationTRANSCRIPT
Dealing with Big Data for Official Statistics: IT Issues
Giulio BarcaroliStefano De FrancisciMonica ScannapiecoDonato Summa
Istat – Italian National Institute of Statistics
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 2
Outline
1.Background: Istat Big Data strategy and experimental projects
2. IT issues in experimental projects
3.Final remarks
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 3
Istat Big Data Strategy - 1
Istat (The Italian National Institute of Statistics) set up a technical Commission
with the objective to orient investments on Big Data adoption
in statistical production processes
Duration: from February 2013 to
February 2015 Members coming from different
areas: Official Statistics, Academy, Private Sector
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 4
Objective of the talk I will NOT deal with (just) technological issues
I will deal instead (mainly) with IT methodological issues
Example:
. MapReduce-Hadoop : Open Source Framework
Map-Reduce Programming Model: are OS stat methods mapreduce-able? Current research for investigating mapreduce-ability of (classes of) computational problems
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 5
Istat Big Data Strategy - 2
The Commission will release a strategy for Big Data adoption
Three experimental projects launched and monitored by the Commission: Persons and Places Labour Market Estimation based on Google Trends ICT Usage in enterprises based on Internet as a Data
Source (IaD) Status: advanced implementation (first results already
available)
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 6
Persons and Places Purpose
Production of the origin/destination matrix of daily mobility for purpose of work and study at the spatial granularity of municipalities starting from phone (tracking) data
Actors involved in the project Istat National Research Council University of Pisa
Methodology Inference of population mobility profiles from GSM Call Data
Records (CDRs) Comparison with data derived from administrative sources
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 7
Labour Market Estimation
Purpose Test the usage of Google Trends for forecasting and
nowcasting purposes in the Labour Force domain Actors involved in the project
Istat: Central Methodology Sector and Labour Force Survey
Methodology Autoregressive model vs. Usage of Google Trends
as prediction models Comparison extended to macroeconomics prediction
models
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 8
ICT Usage in Enterprises Purpose:
Evaluate the possibility of adopting Web scraping and text mining techniques for estimates on the usage of ICT by enterprises and public institutions
Actors involved in the project: Istat: Survey on the ICT Usage in Enterprises Cineca (Consortium of Italian universities,
National Research Council and Ministry of Education and Research)
Methodology Scraping of web sites for data extraction Supervised classification task
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 9
Features of Experimental Projects
1: Persons & Places 2: Google Trends 3: ICT Usage
DATA SOURCE • Mobility data• Web search
record• Web data
extraction
SCENARIO (IMPACT ON THE PRODUCTION PROCESS)
• Deep impact: source replaces traditional sampling and collection
• Considerable impact: estimation phase
• Limited impact: subset of data gathered by using IaD
KEY TECHNOLOGIES
• Machine learning libraries
• MapReduce/ Hadoop (future)
• Google Trends
• Scraping• NoSql• Machine learning
libraries• MapReduce/
Hadoop (future?)
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 10
Statistical Phases for Big Data Management
• Principal selected phases• Inversion due to the fact that “traditional” design phase is
not anymore present for Big Data• Collapse due to the fact that same methods can be used
for both phases• Other phases, e.g. Dissemination, not (yet) involved in
Big Data
Inversion of the two phases
Collapsed phases
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 11
Collect: IT issues - 1
Access to Big Data sources: Type 1: Access control mechanisms that the
Big data provider designedly set up and/or Type 2: Technological barriers
Google Trends: Absence of APIs, preventing from the
possibility of accessing GT data by a software program
Not possible to foresee the usage of such a facility in production processes
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 12
Collect: IT issues - 2
ICT Usage: Both type 1 and 2 problems 8.647 URLs of enterprises’ Web sites, but only
about 5.600 were actually accessed Type 1: Scrapers deliberately blocked, e.g.
mechanisms in place to verify human access to sites, like CAPTCHA
Type 2: Usage of some technologies like Adobe Flash simply prevented from accessing contents
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 13
Design: IT Issues - 1
Even if a traditional survey design cannot take place, the problem of “understanding” the data still present
Semantic extraction techniques Knowledge representation and natural
language processing E.g.: FRED (http://wit.istc.cnr.it/stlab-tools/fred
) permits to extract an ontology from sentences in natural language
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 14
Design: IT Issues - 2
ICT Usage: Human inspection refined by:
Some NLP techniques: Tokenization, Stemming, Stopwords removal, etc,
Semantic enrichment by semantic dictionaries (WordNet)
Images: tag extraction
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 15
Process/Analyse: IT Issues - 1
Big size, possibly solvable by Map-Reduce algorithms
Model absence, possibly solvable by learning techniques
Privacy constraints, solvable by privacy-preserving techniques
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 16
Process/Analyse: IT Issues - 2
Map-Reduce algorithms: problem of formulating algorithms that can be implemented according to such a paradigm ->”mapreduce-ability”
Recent state of the art Map-Reduce algorithms for: Basic graph problems, e.g. minimum spanning trees,
triangle counting and matching Combinatorial optimization, e.g. maximum coverage,
densest subgraph, and k-means clustering
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 17
Process/Analyse: IT Issues - 3
Persons and Places: Match mobility-related data with data stored in
Istat archives Record linkage problem should be solved
(future task) Model Absence: neither survey-based nor
“traditional” model-based approaches directly applicable to Big Data Possible usage of machine learning
techniques
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 18
Process/Analyse: IT Issues - 4 ICT Usage:
Supervised learning methods used to learn the YES/NO answers to the questionnaire (including, classification trees, random forests and adaptive boosting)
Persons and Places: Unsupervised learning technique, namely SOM
(Self Organizing Map) to learn mobility profiles E.g. “free city users” vs. “embedded city users”
(more confidently estimated by deterministic constraints)
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 19
Process/Analyse: IT Issues - 5
Privacy constraints could potentially be solved in a proactive way by relying on techniques that work on anonymous data
Privacy-preserving data integration, e.g. [DMKM-2004]
Privacy-preserving data mining, e.g. [TKDE - 2004]
Persons and Places: Anonymous matching of CDRs with Istat archives via privacy-preserving record linkage
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 20
Concluding Remarks Illustration of some IT issues considered
as relevant for Big Data adoption by OS on the basis of practical experiences Probably technology is not an issue but IT
methodology is an issue!!! Some IT issues also share some statistical
methodological aspects Other relevant IT issues:
Event data management On-line analytics
Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014 21
Thank you for the attention!