ess event: big data in official statistics...4 session 1: objective at the end of the day:...

1

ESS event: Big Data in Official Statistics

v v erbi is

2

METHODOLOGY, QUALITY ISSUES AND ACCREDITATION OF BIG DATA SETS FACILITATOR: MONICA SCANNAPIECO

Parallel sessions 1A and 1B

3

Session 1: Organization

Scheveningen Memorandum

We will discuss two specific Scheveningen challenges, as follows:

Morning session:

[SCH7] New developments in methodology and quality needed for the use of Big Data

Afternoon session

[SCH6] Strategic partnerships with the academia

4

Session 1: Objective

At the end of the day: Suggestions for a key imput to the roadmap for

the usage of Big Data in Official Statistics related to how to address the discussed challenges, in the form of: List of issues and possible approaches to address

them

Key imput to the Roadmap: plan to achieve the objective of using Big Data in Official Statistics (i.e., list of suggested actions and priorities to deal with identified issues)

5

Session 1: Introduction

Bob says: «What do you think to discover from Big Data? You’ll find no more that «November is a raining month»!»

Alice replies: «Well, sometimes you have to start by exploring and trying, even if a model/theory is not there. The first wearable eyeglasses were invented in 1284, while the first results in the Optical theory were available in early 17th Century, 300 years later!»

6

Bob’s position: OS can do almost nothing with Big Data

If you don’t control data generation mechanisms by design, you have no control on results

Big Data can be used just by «statisticians for a day»

Alice’s position: OS can solve all its problems with Big Data: more

timely statistics, reduction of the respondent burden, cost reduction, …


7

Who is right? Probably no one of them! Let Carol embody the position in the middle wrt to Alice

and Bob.

Today we will discuss Carol’s position, by considering the following aspects: Big data sources: can be very different so we need to

differentiate them Statistical methodological issues: problems and possible

solutions Relationships with Academia: Which scientific areas?

Which kind of relationships?


8

Big Data Sources

UNECE Classification

Social Networks (human-sourced information)

Traditional Business systems (process-mediated data)

Internet of Things (machine-generated data)

9

Social Networks (human-sourced information)

Interactions with news media and social media, job posting

Humans interacting with devices (also mobile) produce data

Example:

Blog posts

Twitter messages

User-generated maps

...

10

Traditional Business systems (process-mediated data)

Data collected by traditional systems in a passive mode

Example:

Web search logs

Medical records

Commercial transactions

Banking/stock records

...

11

Internet of Things (machine-generated data)

Sensors and machines used to measure and record the events and situations in the physical world.

Example:

Traffic sensors

Environmental Sensors

12

METHODOLOGY AND QUALITY ISSUES

Parallel session 1A

13

Session 1A Methodology & Quality Issues - 1

Representativeness E.g. Twitter is used by 18 % of online adults

Lack of representativeness means biased results for OS (it is different for private companies that e.g. make statistics on their customers – complete population for them)

Statistical inference Top Down perspective: Design-based & Model-

based

Bottom up perspective: Algorithmic

14


Data sources Integration with traditional data sources (survey

data and/or administrative data) Linking by identifiers (or pseudo-identifiers) often not

possible due for instance to privacy issues

Comparability problems due to harmonization problems: Between Big sources e.g. data scraped from Web sites with

different «schemas»

Between Big sources and Traditional ones

New sources for data which do not exist yet.

15


Data preparation: Data from Big sources are typically event-

based rather than unit-based

Data Filtering E.g. High percentage of Twitter data simply

not useful («pointless babbles»)

Sparse/incomplete data E.g. Data scraped from Web sites with different

«schemas»

16

Session1A Methodology & Quality Issues - 4

Quality Characterization: definition of the quality of Big data sources

Highly dependent on the specific Big Data source

Example: Quality of sensor data very different from quality of social media data

Sensor data: missing data, faulty data (noise and calibration effects)

Social media data: unstructured, no metadata

17


Example: Assessment of the quality of Deep Web data, i.e. Web sites that are interfaces to databases not directly accessible via standard search engines Stock (55 sources) and Flight (38 sources) domains

Bad quality in terms of inconsistency (for 70% data items more than one value is provided) and of correctness (only 70% correct values are provided by the majority of the sources)

Gold standards for both domains [Li-et-al-2013]: Xian Li,Xin Luna Dong,K.B. Lyons,W., D. Srivastava, Truth Finding on the Deep Web: Is the Problem Solved?, PVLDB 2013

18


A new tool to assess quality: Provenance and Trust

Provenance of Web data: “Provenance of a resource is a record that describes entities and processes involved in producing and delivering or otherwise influencing that resource”

(W3C Working group on provenance)

Need for OS to have a

«certification», where possible, of data to be used in order

to trust them

19


On Timeliness: More timely data, but also time-related issues

Time characterization of the data generating mechanism

Social data: updated in an unpredictable way

Dedicated activity for extracting time-related metadata to permit the usage of time-aware technique

[Temporal record linkage - Pei Li, Xin Luna Dong, Andrea Maurino, Divesh Srivastava: Linking

temporal records. Frontiers of Computer Science 6(3): 293-312 (2012)

20


So there are issues on Accuracy and Timeliness on Big sources

But there is also an issue on the Tradeoff between Accuracy and Timeliness of products resulting from such sources

More timely products

But accuracy hard to characterize (or at least harder wrt traditional way, i.e. sampling errors, coverage errors, etc.)

21

Session 1A

Task on Methodology and Quality for the participants

1. Complete the list of suggested issues

2. Detail issues in sub-issues

3. If relevant, characterize each issue per Big Data source

4. If possible, sketch possible solutions to the issue

5. Discuss dependencies among issues

6. Propose a priority-based rating of issues

22

RELATIONSHIPS WITH ACADEMIA

Parallel session 1B

23

Session 1B Relationships with Academia

Considering the methodological and quality issues of Big Data, scientific areas that could contribute to solutions are: – Statistics: survey sampling, estimation theory,

statistical hypothesis testing, statistical decision theory, statistical learning, etc.

– Computer science: data mining, algorithmic and information theory, data visualization and analytics, natural language processing, knowledge representation, etc.

24

Session 1B Data Scientist

“The Sexiest Job of the 21st Century”, Thomas H. Davenport and D.J. Patil “Harvard Business Review”, October 2012

Statistics Computer Science

Application Domain Knowledge

Data Science

25

Session 1B Relationships with Academia: Issues - 1

Some topics are still in an early research stage E.g. Semantic extraction from short texts

Need to set up stable rather than occasional relationships, possible ways: Institutional forum of exchange on Big Data (‘scientific

council’) where academic and official statisticians meet

Temporary secondments of university researchers or PhD students in NSIs supported with fellowships or scholarships

Regular lectures by academics to NSI staff

…

26

Session 1B Relationships with Academia: Issues - 2

Joint research and application projects:

To get funds for facilitating collaborations, people travels, event organization

Privileged access for academics to statistical data and micro-data from Big Data sources.

27

Session 1B Task on relationships with Academia

for the Groups 1. Complete/detail the list of areas and/or

topics per area

2. Select those topics that in your opinion would specifically need relationships with academia (still in research stage)

3. Propose a rating of the topics selected in 2

4. Propose possible ways to strenghten relationships with Academia (also by providing examples already in place)

ess event: big data in official statistics...4 session 1: objective at the end of the day:...

Documents