ess event: big data in official statistics...4 session 1: objective at the end of the day:...
TRANSCRIPT
1
ESS event: Big Data in Official Statistics
v v erbi is
2
METHODOLOGY, QUALITY ISSUES AND ACCREDITATION OF BIG DATA SETS FACILITATOR: MONICA SCANNAPIECO
Parallel sessions 1A and 1B
3
Session 1: Organization
Scheveningen Memorandum
We will discuss two specific Scheveningen challenges, as follows:
Morning session:
[SCH7] New developments in methodology and quality needed for the use of Big Data
Afternoon session
[SCH6] Strategic partnerships with the academia
4
Session 1: Objective
At the end of the day: Suggestions for a key imput to the roadmap for
the usage of Big Data in Official Statistics related to how to address the discussed challenges, in the form of: List of issues and possible approaches to address
them
Key imput to the Roadmap: plan to achieve the objective of using Big Data in Official Statistics (i.e., list of suggested actions and priorities to deal with identified issues)
5
Session 1: Introduction
Bob says: «What do you think to discover from Big Data? You’ll find no more that «November is a raining month»!»
Alice replies: «Well, sometimes you have to start by exploring and trying, even if a model/theory is not there. The first wearable eyeglasses were invented in 1284, while the first results in the Optical theory were available in early 17th Century, 300 years later!»
6
Bob’s position: OS can do almost nothing with Big Data
If you don’t control data generation mechanisms by design, you have no control on results
Big Data can be used just by «statisticians for a day»
Alice’s position: OS can solve all its problems with Big Data: more
timely statistics, reduction of the respondent burden, cost reduction, …
Session 1: Introduction
7
Who is right? Probably no one of them! Let Carol embody the position in the middle wrt to Alice
and Bob.
Today we will discuss Carol’s position, by considering the following aspects: Big data sources: can be very different so we need to
differentiate them Statistical methodological issues: problems and possible
solutions Relationships with Academia: Which scientific areas?
Which kind of relationships?
Session 1: Introduction
8
Big Data Sources
UNECE Classification
Social Networks (human-sourced information)
Traditional Business systems (process-mediated data)
Internet of Things (machine-generated data)
9
Social Networks (human-sourced information)
Interactions with news media and social media, job posting
Humans interacting with devices (also mobile) produce data
Example:
Blog posts
Twitter messages
User-generated maps
...
10
Traditional Business systems (process-mediated data)
Data collected by traditional systems in a passive mode
Example:
Web search logs
Medical records
Commercial transactions
Banking/stock records
...
11
Internet of Things (machine-generated data)
Sensors and machines used to measure and record the events and situations in the physical world.
Example:
Traffic sensors
Environmental Sensors
12
METHODOLOGY AND QUALITY ISSUES
Parallel session 1A
13
Session 1A Methodology & Quality Issues - 1
Representativeness E.g. Twitter is used by 18 % of online adults
Lack of representativeness means biased results for OS (it is different for private companies that e.g. make statistics on their customers – complete population for them)
Statistical inference Top Down perspective: Design-based & Model-
based
Bottom up perspective: Algorithmic
14
Session 1A Methodology & Quality Issues - 2
Data sources Integration with traditional data sources (survey
data and/or administrative data) Linking by identifiers (or pseudo-identifiers) often not
possible due for instance to privacy issues
Comparability problems due to harmonization problems: Between Big sources e.g. data scraped from Web sites with
different «schemas»
Between Big sources and Traditional ones
New sources for data which do not exist yet.
15
Session 1A Methodology & Quality Issues - 3
Data preparation: Data from Big sources are typically event-
based rather than unit-based
Data Filtering E.g. High percentage of Twitter data simply
not useful («pointless babbles»)
Sparse/incomplete data E.g. Data scraped from Web sites with different
«schemas»
16
Session1A Methodology & Quality Issues - 4
Quality Characterization: definition of the quality of Big data sources
Highly dependent on the specific Big Data source
Example: Quality of sensor data very different from quality of social media data
Sensor data: missing data, faulty data (noise and calibration effects)
Social media data: unstructured, no metadata
17
Session1A Methodology & Quality Issues - 5
Example: Assessment of the quality of Deep Web data, i.e. Web sites that are interfaces to databases not directly accessible via standard search engines Stock (55 sources) and Flight (38 sources) domains
Bad quality in terms of inconsistency (for 70% data items more than one value is provided) and of correctness (only 70% correct values are provided by the majority of the sources)
Gold standards for both domains [Li-et-al-2013]: Xian Li,Xin Luna Dong,K.B. Lyons,W., D. Srivastava, Truth Finding on the Deep Web: Is the Problem Solved?, PVLDB 2013
18
Session 1A Methodology & Quality Issues - 6
A new tool to assess quality: Provenance and Trust
Provenance of Web data: “Provenance of a resource is a record that describes entities and processes involved in producing and delivering or otherwise influencing that resource”
(W3C Working group on provenance)
Need for OS to have a
«certification», where possible, of data to be used in order
to trust them
19
Session 1A Methodology & Quality Issues - 7
On Timeliness: More timely data, but also time-related issues
Time characterization of the data generating mechanism
Social data: updated in an unpredictable way
Dedicated activity for extracting time-related metadata to permit the usage of time-aware technique
[Temporal record linkage - Pei Li, Xin Luna Dong, Andrea Maurino, Divesh Srivastava: Linking
temporal records. Frontiers of Computer Science 6(3): 293-312 (2012)
20
Session1A Methodology & Quality Issues - 8
So there are issues on Accuracy and Timeliness on Big sources
But there is also an issue on the Tradeoff between Accuracy and Timeliness of products resulting from such sources
More timely products
But accuracy hard to characterize (or at least harder wrt traditional way, i.e. sampling errors, coverage errors, etc.)
21
Session 1A
Task on Methodology and Quality for the participants
1. Complete the list of suggested issues
2. Detail issues in sub-issues
3. If relevant, characterize each issue per Big Data source
4. If possible, sketch possible solutions to the issue
5. Discuss dependencies among issues
6. Propose a priority-based rating of issues
22
RELATIONSHIPS WITH ACADEMIA
Parallel session 1B
23
Session 1B Relationships with Academia
Considering the methodological and quality issues of Big Data, scientific areas that could contribute to solutions are: – Statistics: survey sampling, estimation theory,
statistical hypothesis testing, statistical decision theory, statistical learning, etc.
– Computer science: data mining, algorithmic and information theory, data visualization and analytics, natural language processing, knowledge representation, etc.
24
Session 1B Data Scientist
“The Sexiest Job of the 21st Century”, Thomas H. Davenport and D.J. Patil “Harvard Business Review”, October 2012
Statistics Computer Science
Application Domain Knowledge
Data Science
25
Session 1B Relationships with Academia: Issues - 1
Some topics are still in an early research stage E.g. Semantic extraction from short texts
Need to set up stable rather than occasional relationships, possible ways: Institutional forum of exchange on Big Data (‘scientific
council’) where academic and official statisticians meet
Temporary secondments of university researchers or PhD students in NSIs supported with fellowships or scholarships
Regular lectures by academics to NSI staff
…
26
Session 1B Relationships with Academia: Issues - 2
Joint research and application projects:
To get funds for facilitating collaborations, people travels, event organization
Privileged access for academics to statistical data and micro-data from Big Data sources.
27
Session 1B Task on relationships with Academia
for the Groups 1. Complete/detail the list of areas and/or
topics per area
2. Select those topics that in your opinion would specifically need relationships with academia (still in research stage)
3. Propose a rating of the topics selected in 2
4. Propose possible ways to strenghten relationships with Academia (also by providing examples already in place)