big data (and official statistics) - un escap · caca g aabeused o o ca sa s csn big data be used...

13
1 Big Data (and Big Data (and official statistics) Piet Daas and Mark van der Loo* 3 Statistics Netherlands MSIS 2013, April 25, Paris * With contributions of: Edwin de Jonge and Paul van den Hurk Overview What’s Big Data? Definition and the 3 V’s Can Big Data be used for official statistics? Examples from Statistics Netherlands Future challenges Wh th t h ? MSIS 2013, April 25, Paris What has to change? 1

Upload: others

Post on 08-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data (and official statistics) - UN ESCAP · CaCa g aabeused o o ca sa s csn Big Data be used for official statistics? Examples from Statistics Netherlands 1. Traffic loop detection

1

Big Data (andBig Data (and official statistics)

Piet Daas and Mark van der Loo* 3Statistics Netherlands

MSIS 2013, April 25, Paris

* With contributions of: Edwin de Jonge and Paul van den Hurk

Overview

• What’s Big Data?g• Definition and the 3 V’s

• Can Big Data be used for official statistics?• Examples from Statistics Netherlands

• Future challengesWh t h t h ?

MSIS 2013, April 25, Paris

• What has to change?

1

Page 2: Big Data (and official statistics) - UN ESCAP · CaCa g aabeused o o ca sa s csn Big Data be used for official statistics? Examples from Statistics Netherlands 1. Traffic loop detection

2

•• Data, data everywhere!Data, data everywhere!

XX

MSIS 2013, April 25, Paris 2

What is Big Data?

• According to a group of expertsBi d t d t th t bBig data are data sources that can be –generally– described as: “high volume, velocity and variety of data that demand cost-effective, innovative forms of processing for enhanced insight and decision making.”

MSIS 2013, April 25, Paris

• According to a user“Data so big that it becomes awkward to work with”

3

Page 3: Big Data (and official statistics) - UN ESCAP · CaCa g aabeused o o ca sa s csn Big Data be used for official statistics? Examples from Statistics Netherlands 1. Traffic loop detection

3

The most 3 important characteristics of Big Data

Amount

MSIS 2013, April 25, Paris

Rapid availabilityComplexity

Unstructured dataText

4

3 Big Data case studies

Can Big Data be used for official statistics?Ca g a a be used o o c a s a s cs

Examples from Statistics Netherlands

1. Traffic loop detection data (100 million records/day)

• Traffic & transport statistics

2. Mobile phone data (35 million records/day)

• Day time population, tourism

MSIS 2013, April 25, Paris

y p p

3. Dutch social media messages (1~2 million messages/day)

• Topics and sentiment

5

Page 4: Big Data (and official statistics) - UN ESCAP · CaCa g aabeused o o ca sa s csn Big Data be used for official statistics? Examples from Statistics Netherlands 1. Traffic loop detection

4

1. Traffic loop detection data

• Traffic ‘loops’E i t (24/7) th b f i• Every minute (24/7) the number of passing vehicles is counted by >10,000 road sensors & camera’s in the Netherlands• Total vehicles and in different length classes

• Interesting source to produce traffic and

MSIS 2013, April 25, Paris

g ptransport statistics (and more)• Huge amounts of data, about 100 million

records a dayLocations

6

Number of detected vehicles on a single day

MSIS 2013, April 25, Paris

Total = ~ 295 million

7

By all loops

Page 5: Big Data (and official statistics) - UN ESCAP · CaCa g aabeused o o ca sa s csn Big Data be used for official statistics? Examples from Statistics Netherlands 1. Traffic loop detection

5

Traffic loop detection activity (only first 10 min.)

MSIS 2013, April 25, Paris 8

Correct for missing data

• ‘Corrected’ data (for blocks of 5 min)

Before After

MSIS 2013, April 25, Paris

Total = ~ 295 million Total = ~ 330 million (+ 12%)

9

Page 6: Big Data (and official statistics) - UN ESCAP · CaCa g aabeused o o ca sa s csn Big Data be used for official statistics? Examples from Statistics Netherlands 1. Traffic loop detection

6

For different vehicle lengths

1 categorie 3 categoriën 5 categoriën

XSmall vehicles <= 5.6 m

Totaal Totaal Totaal<= 5.6m > 1.85 & <= 2.4m> 5.6 & <= 12.2m > 2.4 & <= 5.6m> 12.2m > 5.6 & <= 11.5m

> 11.5 & <= 12.2m> 12.2m

X

MSIS 2013, April 25, Paris

Medium sized vehicles > 5.6 m & <= 12.2 mLarge vehicles > 12.2 m

10

Small vehicles

MSIS 2013, April 25, Paris

~75% of total

11

Page 7: Big Data (and official statistics) - UN ESCAP · CaCa g aabeused o o ca sa s csn Big Data be used for official statistics? Examples from Statistics Netherlands 1. Traffic loop detection

7

Small & medium vehicles

MSIS 2013, April 25, Paris 12

Small, medium & large vehicles

MSIS 2013, April 25, Paris 13

Page 8: Big Data (and official statistics) - UN ESCAP · CaCa g aabeused o o ca sa s csn Big Data be used for official statistics? Examples from Statistics Netherlands 1. Traffic loop detection

8

2. Mobile phone data

• Nearly every person in the Netherlands has a bil hmobile phone

• On them and almost always switched on!• An increasing number of people has a smart phone

• Ideal source of information to:• Use mobile phone data of mobile phone companies:

MSIS 2013, April 25, Paris

• Travel behaviour (‘Day time’-population)

• Tourism (new phones that register to network)

• Crowd info (for example during events)

14

Travel behaviour of mobile phones

Mobility of very activeactive mobile phone users

- during a 14-day periodg y p- data of a single mob. company

Based on:- Call- and text-activity

multiples times a day- Location based on phone masts

Clearly selective:

MSIS 2013, April 25, Paris 15

Clearly selective:- Includes major cities- But the North and South-eastof the country much less

Page 9: Big Data (and official statistics) - UN ESCAP · CaCa g aabeused o o ca sa s csn Big Data be used for official statistics? Examples from Statistics Netherlands 1. Traffic loop detection

9

3. Social media messages

• Dutch are very active on social media platforms• Bijna altijd bij zich en staat vrijwel altijd aan

• Steeds meer mensen hebben een smartphone!

• Mogelijke informatiebron voor:• Welke onderwerpen zijn actueel:

• Aantal berichten en sentiment hierover

MSIS 2013, April 25, Paris

• Als meetinstrument te gebruiken voor:• .

Map by Eric Fischer (via Fast Company)

16

3. Social media messages

• Dutch are very active on social media platforms

• Potential information source for:

3a. Content:- Collected Dutch Twitter messages for study: ‘selection’ of 12 million

• Topics discussed and sentiment over these topics (quickly available!) and probably more?

• Investigate it to obtain an answer on its potential use

MSIS 2013, April 25, Paris

3b. Sentiment- Sentiment in Dutch social media messages: ‘all’ ~2 billion

17

Page 10: Big Data (and official statistics) - UN ESCAP · CaCa g aabeused o o ca sa s csn Big Data be used for official statistics? Examples from Statistics Netherlands 1. Traffic loop detection

10

Social media: Dutch Twitter topics

(3%)

(10%)

(7%)

(3%)(7%)

(3%)

MSIS 2013, April 25, Paris 18

(46%)

(5%)

12 million messages

(3%)

Sentiment in Social media

• Access to Coosto database

2 billi bli l il bl• ~ 2 billion publicly available messages• Twitter, Facebook, Hyves, Webfora, Blogs etc.

• Sentiment of each message• Positive, negative or neutral

• Interesting finding

MSIS 2013, April 25, Paris

• Looked at so-called ‘Mood of the nation’ compared to Consumer confidence of Statistics Netherlands

19

Page 11: Big Data (and official statistics) - UN ESCAP · CaCa g aabeused o o ca sa s csn Big Data be used for official statistics? Examples from Statistics Netherlands 1. Traffic loop detection

11

Consumer confidence, survey data

Sentiment towards the economic climate

(pos

–ne

g) a

s %

of

tota

l

MSIS 2013, April 25, Paris 20

~1000 respondents/month

(

Sentiment in social media messages

Sentiment towards the economic climate &Social media message sentiment

(pos

–ne

g) a

s %

of

tota

l

MSIS 2013, April 25, Paris

Corr: 0.88

21

~25 million messages/month

(

Page 12: Big Data (and official statistics) - UN ESCAP · CaCa g aabeused o o ca sa s csn Big Data be used for official statistics? Examples from Statistics Netherlands 1. Traffic loop detection

12

Challenges: Big Data and statistics

• Legal• Is access routinely allowed (not only for research)?y ( y )

• Privacy• With more and more data, privacy demands increase• We have to be careful here!

• Costs• In the Netherlands we don’t pay for admin data. • Should we pay for Big Data?

• Manage

MSIS 2013, April 25, Paris

• Who owns the data? Stability of delivery/source• Because of its volume, run queries in database of data

source holder

22

Challenges: Big Data and statistics (2)

• Methodological• Big data sources register events, not units, and they are selective!• Methods & models specific for large dataset (fast and ‘robust’)• Methods & models specific for large dataset (fast and robust ) • Try to ‘make big data small’ ASAP (noise reduction)

• Technological• Learn from ‘computational statistical’ research areas• High Performance Computing needs, parallel processing

• People

MSIS 2013, April 25, Paris

p• Need ‘data scientists’ (statistical minded people with programming

skills that are curious) • That are able to think outside the traditional sample survey based

paradigm!

23

Page 13: Big Data (and official statistics) - UN ESCAP · CaCa g aabeused o o ca sa s csn Big Data be used for official statistics? Examples from Statistics Netherlands 1. Traffic loop detection

13

MSIS 2013, April 25, Paris The future of Stat Neth?