big data (and official statistics) - un escap · caca g aabeused o o ca sa s csn big data be used...
Post on 08-Jun-2020
1 Views
Preview:
TRANSCRIPT
1
Big Data (andBig Data (and official statistics)
Piet Daas and Mark van der Loo* 3Statistics Netherlands
MSIS 2013, April 25, Paris
* With contributions of: Edwin de Jonge and Paul van den Hurk
Overview
• What’s Big Data?g• Definition and the 3 V’s
• Can Big Data be used for official statistics?• Examples from Statistics Netherlands
• Future challengesWh t h t h ?
MSIS 2013, April 25, Paris
• What has to change?
1
2
•• Data, data everywhere!Data, data everywhere!
XX
MSIS 2013, April 25, Paris 2
What is Big Data?
• According to a group of expertsBi d t d t th t bBig data are data sources that can be –generally– described as: “high volume, velocity and variety of data that demand cost-effective, innovative forms of processing for enhanced insight and decision making.”
MSIS 2013, April 25, Paris
• According to a user“Data so big that it becomes awkward to work with”
3
3
The most 3 important characteristics of Big Data
Amount
MSIS 2013, April 25, Paris
Rapid availabilityComplexity
Unstructured dataText
4
3 Big Data case studies
Can Big Data be used for official statistics?Ca g a a be used o o c a s a s cs
Examples from Statistics Netherlands
1. Traffic loop detection data (100 million records/day)
• Traffic & transport statistics
2. Mobile phone data (35 million records/day)
• Day time population, tourism
MSIS 2013, April 25, Paris
y p p
3. Dutch social media messages (1~2 million messages/day)
• Topics and sentiment
5
4
1. Traffic loop detection data
• Traffic ‘loops’E i t (24/7) th b f i• Every minute (24/7) the number of passing vehicles is counted by >10,000 road sensors & camera’s in the Netherlands• Total vehicles and in different length classes
• Interesting source to produce traffic and
MSIS 2013, April 25, Paris
g ptransport statistics (and more)• Huge amounts of data, about 100 million
records a dayLocations
6
Number of detected vehicles on a single day
MSIS 2013, April 25, Paris
Total = ~ 295 million
7
By all loops
5
Traffic loop detection activity (only first 10 min.)
MSIS 2013, April 25, Paris 8
Correct for missing data
• ‘Corrected’ data (for blocks of 5 min)
Before After
MSIS 2013, April 25, Paris
Total = ~ 295 million Total = ~ 330 million (+ 12%)
9
6
For different vehicle lengths
1 categorie 3 categoriën 5 categoriën
XSmall vehicles <= 5.6 m
Totaal Totaal Totaal<= 5.6m > 1.85 & <= 2.4m> 5.6 & <= 12.2m > 2.4 & <= 5.6m> 12.2m > 5.6 & <= 11.5m
> 11.5 & <= 12.2m> 12.2m
X
MSIS 2013, April 25, Paris
Medium sized vehicles > 5.6 m & <= 12.2 mLarge vehicles > 12.2 m
10
Small vehicles
MSIS 2013, April 25, Paris
~75% of total
11
7
Small & medium vehicles
MSIS 2013, April 25, Paris 12
Small, medium & large vehicles
MSIS 2013, April 25, Paris 13
8
2. Mobile phone data
• Nearly every person in the Netherlands has a bil hmobile phone
• On them and almost always switched on!• An increasing number of people has a smart phone
• Ideal source of information to:• Use mobile phone data of mobile phone companies:
MSIS 2013, April 25, Paris
• Travel behaviour (‘Day time’-population)
• Tourism (new phones that register to network)
• Crowd info (for example during events)
14
Travel behaviour of mobile phones
Mobility of very activeactive mobile phone users
- during a 14-day periodg y p- data of a single mob. company
Based on:- Call- and text-activity
multiples times a day- Location based on phone masts
Clearly selective:
MSIS 2013, April 25, Paris 15
Clearly selective:- Includes major cities- But the North and South-eastof the country much less
9
3. Social media messages
• Dutch are very active on social media platforms• Bijna altijd bij zich en staat vrijwel altijd aan
• Steeds meer mensen hebben een smartphone!
• Mogelijke informatiebron voor:• Welke onderwerpen zijn actueel:
• Aantal berichten en sentiment hierover
MSIS 2013, April 25, Paris
• Als meetinstrument te gebruiken voor:• .
Map by Eric Fischer (via Fast Company)
16
3. Social media messages
• Dutch are very active on social media platforms
• Potential information source for:
3a. Content:- Collected Dutch Twitter messages for study: ‘selection’ of 12 million
• Topics discussed and sentiment over these topics (quickly available!) and probably more?
• Investigate it to obtain an answer on its potential use
MSIS 2013, April 25, Paris
3b. Sentiment- Sentiment in Dutch social media messages: ‘all’ ~2 billion
17
10
Social media: Dutch Twitter topics
(3%)
(10%)
(7%)
(3%)(7%)
(3%)
MSIS 2013, April 25, Paris 18
(46%)
(5%)
12 million messages
(3%)
Sentiment in Social media
• Access to Coosto database
2 billi bli l il bl• ~ 2 billion publicly available messages• Twitter, Facebook, Hyves, Webfora, Blogs etc.
• Sentiment of each message• Positive, negative or neutral
• Interesting finding
MSIS 2013, April 25, Paris
• Looked at so-called ‘Mood of the nation’ compared to Consumer confidence of Statistics Netherlands
19
11
Consumer confidence, survey data
Sentiment towards the economic climate
(pos
–ne
g) a
s %
of
tota
l
MSIS 2013, April 25, Paris 20
~1000 respondents/month
(
Sentiment in social media messages
Sentiment towards the economic climate &Social media message sentiment
(pos
–ne
g) a
s %
of
tota
l
MSIS 2013, April 25, Paris
Corr: 0.88
21
~25 million messages/month
(
12
Challenges: Big Data and statistics
• Legal• Is access routinely allowed (not only for research)?y ( y )
• Privacy• With more and more data, privacy demands increase• We have to be careful here!
• Costs• In the Netherlands we don’t pay for admin data. • Should we pay for Big Data?
• Manage
MSIS 2013, April 25, Paris
• Who owns the data? Stability of delivery/source• Because of its volume, run queries in database of data
source holder
22
Challenges: Big Data and statistics (2)
• Methodological• Big data sources register events, not units, and they are selective!• Methods & models specific for large dataset (fast and ‘robust’)• Methods & models specific for large dataset (fast and robust ) • Try to ‘make big data small’ ASAP (noise reduction)
• Technological• Learn from ‘computational statistical’ research areas• High Performance Computing needs, parallel processing
• People
MSIS 2013, April 25, Paris
p• Need ‘data scientists’ (statistical minded people with programming
skills that are curious) • That are able to think outside the traditional sample survey based
paradigm!
23
13
MSIS 2013, April 25, Paris The future of Stat Neth?
top related