wp7 multi domains - europa · pdf filebrief overview of the methodology ... • support...
TRANSCRIPT
WP7 Multi Domains
Anna Nowicka – WP7 leader
Jacek Maślankowski – WP7 Coordinator of methodology
in cooperation with WP7 team
Agenda
• Updates on organizational issues
• Work done
• Pilots • Agriculture • Tourism • Population
• Combining data
• Future work and summary
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 2
Updates on organizational issues
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 3
WP7 team Apart from GUS (Statistics Poland) which is leading
WP 7 and CBS (Statistics Netherlands), this WP is carried out by two other representatives of ESSnet Big Data partners: CSO (Statistics Ireland) and ONS (Statistics United Kingdom).
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 4
WP0: CO-ORDINATION
WP9: DISSEMINATION
ES
Sn
et B
IG D
AT
A
WP
1 :
We
bsc
rap
ing
/ Jo
b V
aca
nci
es
WP
2 :
We
bsc
rap
ing
/ E
nte
rpri
se
Ch
ara
cte
rist
ics
WP
3 :
Sm
art
Me
ters
WP
4 :
AIS
Da
ta
WP
5 :
Mo
bil
e P
ho
ne
Da
ta
WP
6 :
Ea
rly
Est
ima
tes
WP
8 :
Me
tho
do
logy
WP
7 :
Mu
lti D
om
ain
s
From SGA-2 (in March 2017)
Portugal will join to this team.
The goal
• The aim of WP 7 is to find out how a combination of Big Data sources, administrative data and statistical data may enrich statistical output.
• The WP team will describe the data collection, linkage, processing and methodology, when combining this data in statistical domains. Additional value could be discovered in the linkages between domains.
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 5
Suggest pilots and domains with successful
implementation for further elaboration
in the second wave of pilots in 2018
Detailed description
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 6
Investigate how they can be used to improve current statistics in the domains:
‘Population’, ‘Tourism/border crossings’ and ‘Agriculture’.
WP 7 team divided work into 4 main groups of tasks:
Task 1. Data availability/Data inventory (SGA-1)
Task 2. Data feasibility (SGA-1)
Task 3. Data combination (SGA-2)
Task 4. Summary plus future perspectives (SGA-2)
Similarities and differences between countries, concerning the availability of
registers, and the legality of data linkage will be taken into account when
carrying out the tasks.
Pre-Pilot use cases
Pilot use cases
Final SGA-1 description of use cases
Domain Population Agriculture Tourism/Border Crossing
Name of the use
case
Everyday citizen satisfaction.
Opinions about public events with major
impact on peoples’ satisfaction
Estimation of Agricultural statistics
– pilot case study on crop types
based on satellite data
Border movement
Big Data source Social media/blogs/Internet portals Satellite images Traffic sensors
Responsibility UK+PL, PT (from SGA-2) PL+IE PL+NL, PT (from SGA-2)
Brief overview of
the methodology
Webscraping
Data/Text/Web mining
Machine learning
Combinations of data – data fusion
on radar and optical remote sensing
data; data comparison with
traditional surveys e.g. FSS;
combining data – administrative
data sources with satellite data.
Intertemporal disaggregation and
interpolation,
Latent variable models,
Cross entropy econometrics.
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 7
Work done
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 8
General overview
• Brainstorming on data sources
• Questionnaire on different aspects of Big Data implementation • e.g., data access, data quality,
combining data, methodology
• Final use cases
• Several videoconferences
• Annotated bibliography
• Pre-Pilot use cases implementation
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 9
Annotated bibliography
• Several research papers and statistical reports relevant for WP7 work, e.g.:
• Social Media Sentiment and Consumer Confidence: Piet J.H. Daas and Marco J.H. Puts;
• Experiment report: Social Media - Sentiment Analysis, UNECE, created by: Antonino Virgillito, modified by Steven Vale;
• Twitter Sentiment Classification using Distant Supervision; Alec Go, Richa Bhayani, Lei Huang.
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 10
Pilots
• Population
• in pilot use case based on Twitter
•Agriculture
• based on satellite maps
•Tourism
• applying advanced methods of data aggregation and disaggregation
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 11
Agriculture Use Case Pilot technical aspects and preliminary results
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017
12
Data sources already collected
• The Integrated Administration and Control System (IACS) – complex administrative information system
• The Land Parcel Identification System (LPIS) – part of the National Register of Producers
• LUCAS – Land Use and Coverage Area frame Survey
• Copernicus – previously known as GMES – Global Monitoring for Environment and Security programme
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 13
Methodology and technical aspects
• Dedicated software for teledetections
• Data calibration
• Data combining
• Machine Learning
• Volume size for Pilot Use Case
• 1.5 TB for 2015 for 1 of 16 voivodships
• 3.5 TB for 2016 for 2 of 16 voivodships
Detailed methodology:
• time series for classes of spectral reflections for each crops interviewers
• analyzes network communication system
• allocation of network zones for each interviewer
• get directions to points
• photos (series of geo-tagged images)
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 14
The network zones allocation to interviewers in relation to the transport system in the Warmińsko-Mazurskie Voivodship (Poland)
#
#
#Ełk
Elbląg
Olsztyn
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 15
The training fields selection criteria
• the parcel size – depending on the field size, can be adopted for Sentinel satellite data and administrative data (including agricultural surveys)
• the distance from the road
• homogenous crops – enabling the unambiguous crop identification
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 16
Example of the training field
Data segmentation
• administrative vector data (LPIS)
• radar satellite images
• optical satellite images
• Depending on the actual data quality and availability, one or several segmentation methods can be used.
• Fitting the type of crop to the satellite pixel.
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 17
Data classification
• Support Vector Machine (SVM),
• Decision Trees (DT),
• K-Nearest Neighbours (KNN)
including the following classification parameters:
• Sigma,
• Entropia,
• Alfa,
• multi-temporal indicators,
• Wishard distribution.
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 18
Data aggregation
Results assessment – once processed, the images were assessed in terms of their usefulness through:
• analysing the training fields classification error matrix,
• analysing the calculations accuracy,
• making comparisons with administrative data,
• making comparisons with statistical data.
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 19
Calculation of the coherence matrix
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 20
HH/HV polarisation
𝑇 =𝑆𝐻𝐻
2 𝑆𝐻𝐻𝑆𝐻𝑉∗
𝑆𝐻𝑉𝑆𝐻𝐻∗ 𝑆𝐻𝑉
2
VV/VH polarisation
𝑇 =𝑆𝑉𝑉
2 𝑆𝑉𝑉𝑆𝑉𝐻∗
𝑆𝑉𝐻𝑆𝑉𝑉∗ 𝑆𝑉𝐻
2
HH/HV polarisation
𝑘 =1
2𝑆𝐻𝐻 + 𝑆𝑉𝑉𝑆𝐻𝐻 − 𝑆𝑉𝑉
⊤
𝑇 = 1
2
𝑆𝐻𝐻 + 𝑆𝑉𝑉2 𝑆𝐻𝐻 + 𝑆𝑉𝑉 𝑆𝐻𝐻 − 𝑆𝑉𝑉
∗
𝑆𝐻𝐻 − 𝑆𝑉𝑉 𝑆𝐻𝐻 + 𝑆𝑉𝑉∗ 𝑆𝐻𝐻 − 𝑆𝑉𝑉
2
Vector fields development layers (based on ARMA, paying agency for agriculture subsidies)
Vector layers Description
Z Woody or bushy land
I Other land not suitable for agricultural activity
P Potentially agricultural land
W Water
U Industrial or urbanised area
T Permanent grassland
S Orchard
L Forest
K Transportation area
D Habitat
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 21
Comparison of different Machine Learning algorithms
No. Method Classified data Total accuracy [%]
1. Support Vector Machine (SVM) Sigma 81,0
2. Support Vector Machine (SVM) Sigma, Entropy, Alpha 77,5
3. Decision Trees (DT) Sigma 73,4
4. Decision Trees (DT) Sigma, Entropy, Alpha 73,5
5. Decision Trees (DT) Multi-temporal indicators 72,8
6. K-Nearest Neighbours (KNN) Wishard distribution 81,7
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 22
Summary
• The use case is a concept of using radar Sentinel-1 and optical Sentinel-2 data for the purpose of crop identification
• The use of the time series of crop development data, which should be processed and then classified
• It is important to integrate reference data from in situ survey, and also the existing administrative data
• The results can be aggregated in line with the intended spatial division and validated in respect of the reference, administrative and statistical data
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 23
Tourism/Border Crossing Use Case Pilot technical aspects and preliminary results
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 24
Methodology – obstacles
• proper definition of the surveyed population
• non-representative sample
• enlisting all websites relevant to task (defining survey frame)
• behaviour on the Internet may be not similar to observed in real life
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 25
Main findings
• new regulations on toll charges which may change the vehicle traffic density on a given segment
• extension of the number of toll charge points on a given segment
• changes in traffic density consecutively to:
• atmospheric conditions – such changes should impact tourism movement
• administrative regulations or to random events, like road accidents
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 26
Potential data sources
• Flickr
• Google Trends
• Administrative data
Netherlands • AIS data, • Call detail records, • Road sensor data, • Smart city data, • Twitter. UK • Flickr, • Twitter, • Travel smartcards, • Credit card transactions, • Mobile phone data. Ireland • Flight websites, • Hotel websites, • Wikipedia, • Twitter, • Google Trends, • ATM Data, • Credit card data, • Traffic loops, • Mobile Phone Data, • AIS Data.
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 27
Administrative data sources and additional data
• Generalna Dyrekcja Dróg Krajowych i Autostrad for Poland (GDDKiA)
• Bundesanstalt für Straßenwesen for Germany (BASt)
• Ředitelství Silnic a Dálnic for Czech Republic (RSD)
• Národná diaľničná spoločnosť for Slovakia (NDS)
• Kelių ir transporto tyrimo institutas for Lithuania (KTTI)
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 28
Number of measurement points by GDDKiA
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Number of measurement points
General Traffic Measurement Continuous Traffic Measurement
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 29
Traffic intensity at Wierzbica-Pultusk and Stolno-Kończewice points, over the period 2006-2015
0
2000
4000
6000
8000
10000
12000
14000
16000
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
AA
DT
Wierzbica - Pułtusk Stolno-Kończewice
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 30
Methodology
• data imputation
• connecting traffic intensity variables with external factor
• including distance matrix or adjacency matrix to improve data coherence
• modelling level shifts
• temporal disaggregation of yearly data to quarterly data
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 31
The general formulation
(eq.1) 𝑀𝑖𝑛 𝐻𝑞 𝑝//𝑝0, 𝑟//𝑟0, 𝑤//𝑤0 ≡
𝛼∑𝑝𝑘𝑚 𝑝𝑘𝑚/𝑝𝑘𝑚
0 𝑞−1− 1
𝑞 − 1+ 𝛽 ∑𝑟𝑛𝑗
𝑟𝑛𝑗/𝑟𝑛𝑗0 𝑞−1
− 1
𝑞 − 1+ 𝛿 ∑𝑤𝑡𝑠
𝑤𝑡𝑠/𝑤𝑡𝑠0 𝑞−1 − 1
𝑞 − 1
subject to
(eq.2) 𝑌 = 𝑋 ⋅ 𝛽 + 𝑒 = 𝑋 ⋅ ∑ 𝑣𝑚𝑝𝑘𝑚
𝑞
∑ 𝑝𝑘𝑚𝑞𝑀
𝑚=1
𝑀𝑚=1 + ∑ 𝑧𝑗
𝑟𝑛𝑗𝑞
∑ 𝑟𝑛𝑗𝑞𝐽
𝑗=1
𝐽𝑗=1
(eq.3) ∑ ∑ 𝑝𝑘𝑚𝑀𝑚>2
𝐾𝑘=1 = 1
(eq.4) ∑ ∑ 𝑟𝑛𝑗𝐽𝑗>2
𝑁𝑛=1 = 1
(eq.5) ∑ ∑ 𝑤𝑡𝑠𝑆𝑠>2
𝑇𝑡=1 = 1
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 32
Population Use Case Pilot technical aspects and preliminary results
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 33
Population use case
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 34
1. Daily satisfaction (Life satisfaction)
2. The mood of population associated with public
events (e.g., Brexit, Voting)
3. The morbidity areas (e.g., flu)
Population indicators will be limited to:
Residence population;
Number of women per 100 men;
Population structure;
Data obtained through the proposed solutions enables:
Extending the scope of the database,
Obtaining more recent data,
Add more detailed cross-sections for the study
population of social media users and the Internet
(currently there are no such sub-populations in similar
studies).
Detailed information on the expected results
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 35
Example of indicators of wellbeing Eu2013 module for EU-SILC
Example of questions in the European Social
Survey questionnaire as the framework of
questions about feelings from last week
Process of data analysis
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 36
Data sources:
• DS1 – Twitter
• DS2 – Google Trends
• DS3 – Comments on Specific News/Events on Web Portals such as
gazeta.pl, bbc.co.uk, irishtimes.com, spiegel.de, guardian.com
Emotional states according to EU-SILC:
• Very upset;
• So deeply depressed that nothing can lift your
spirits;
• Quiet and calm;
• Discouraged, nailed or had the blues;
• Lucky.
The framework of the pilot use case for Population domain
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 37
data
Tweepy
Sklearn
Training
Dataset
Machine
Learning
algorithm
Data extracting
Predictive
model
Labels
Feature
vectors
Result set
Training Dataset
No. Text Target Language Id
1 Rousey is gonna quit UFC forever now lmao #SoHappy Satisfied EN F1
2 And I did absolutely nothing #satisfied 😁🎉❤ Satisfied EN F2
3 To był cudowny weekend 😊😇💚💛💜 #love #happy #awesome #osom #bestweekend @
Czestochowa
Satisfied PL F3
4 Połączenie nowoczesnego designu z funkcjonalnością sprawi, że osiągniesz jeszcze lepsze
wyniki.
Neutral PL F4
5 They want more happiness & more money in 2017 cause they're not satisfied w/the
position of each. It don't matter the context. #Unsatisfied
Not satisfied EN F5
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 38
Testing Dataset
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 39
• Equal distribution of classifiers
• Some sentences are more easy
to predict
• Building the training dataset in
cooperation with statisticians
involved in the survey
Results of analysis (using Matplotlib for Python)
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 40
Current work on Population Use Case by ONS UK
• Stage (1) – Data Collection via API/Web Scraping
• Facebook Graph API (Guardian Facebook Page)
• Web Scraping (Guardian Website)
• Stage (2) – Text Mining/Sentiment Analysis
• Lexicon based sentiment analysis (like, love, haha, wow, sad and angry)
• Stage (3) – Quantitative and qualitative classification of reviews
• Pre-defined period of time (e.g., Dec 2016, Jan 2017)
• Granularity daily or weekly
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 41
Conclusions
• Population – more recent data
• Not representative – the data will only reflect the feelings of people who are active in social networks
• Data from social networks are generally unstructured and appear irregularly
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 42
Combining data within and between domains SGA-2 perspectives
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 43
Agriculture – data combining
• Satellite data directly combined with administrative data
• Benefits:
• Increased accuracy and effectiveness
• Another advantage of satellite data refers to its use in the current agricultural statistics based on the progressing plant vegetation at a given time
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 44
Population - combining data – three scenarios
• Integrating data from various data sources related to a given topic – e.g., 1.1. Daily Satisfaction
• Supplementing the survey results obtained through traditional questionnaires by adding new and more detailed data to statistical tables
• Adding new data to the output tables compiled for traditional surveys
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 45
Population – matching data by attributes
• the topic surveyed
• the sentiment (also life satisfaction)
• geographic location
• other attributes available, such as gender
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 46
Tourism – combining data
• Administrative data and surveys:
• Big Data from Polish data owner (GDDKiA)
• Surveys conducted by the Statistical Office in Rzeszów (SOR)
• Data from different providers
• The models restricting the cross-entropy objective function will be proposed based upon formal relations
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 47
Data combining – summary
• Combine in one repository the selected data from all Big Data sources
• Supplement of information gained in traditional surveys
• Comparison with the results of traditional surveys to add more detailed information
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 48
Data combining between domains
It is planned to cross combine data by schemes:
• Population - Tourism / border crossings
• Population - Agriculture
• Agriculture - Tourism / border crossings
• Population - Tourism / border crossings - agriculture
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 49
Conclusions
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 50
Summary
• Data sources selected
• Detailed goals for each domain established
• Agreements with data owners done or in progress
• Pre-pilot use cases done
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 51
SGA-2 perspectives
•Extend the scope of pilot surveys
•Combining data within domain as well as inter-domain data combination
•Sharing general framework WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 52
Towards SGA-2
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017
53
WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 54