wp7 multi domains - europa · pdf filebrief overview of the methodology ... • support...

54
WP7 Multi Domains Anna Nowicka – WP7 leader Jacek Maślankowski – WP7 Coordinator of methodology in cooperation with WP7 team

Upload: vantu

Post on 26-Mar-2018

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

WP7 Multi Domains

Anna Nowicka – WP7 leader

Jacek Maślankowski – WP7 Coordinator of methodology

in cooperation with WP7 team

Page 2: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Agenda

• Updates on organizational issues

• Work done

• Pilots • Agriculture • Tourism • Population

• Combining data

• Future work and summary

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 2

Page 3: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Updates on organizational issues

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 3

Page 4: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

WP7 team Apart from GUS (Statistics Poland) which is leading

WP 7 and CBS (Statistics Netherlands), this WP is carried out by two other representatives of ESSnet Big Data partners: CSO (Statistics Ireland) and ONS (Statistics United Kingdom).

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 4

WP0: CO-ORDINATION

WP9: DISSEMINATION

ES

Sn

et B

IG D

AT

A

WP

1 :

We

bsc

rap

ing

/ Jo

b V

aca

nci

es

WP

2 :

We

bsc

rap

ing

/ E

nte

rpri

se

Ch

ara

cte

rist

ics

WP

3 :

Sm

art

Me

ters

WP

4 :

AIS

Da

ta

WP

5 :

Mo

bil

e P

ho

ne

Da

ta

WP

6 :

Ea

rly

Est

ima

tes

WP

8 :

Me

tho

do

logy

WP

7 :

Mu

lti D

om

ain

s

From SGA-2 (in March 2017)

Portugal will join to this team.

Page 5: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

The goal

• The aim of WP 7 is to find out how a combination of Big Data sources, administrative data and statistical data may enrich statistical output.

• The WP team will describe the data collection, linkage, processing and methodology, when combining this data in statistical domains. Additional value could be discovered in the linkages between domains.

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 5

Suggest pilots and domains with successful

implementation for further elaboration

in the second wave of pilots in 2018

Page 6: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Detailed description

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 6

Investigate how they can be used to improve current statistics in the domains:

‘Population’, ‘Tourism/border crossings’ and ‘Agriculture’.

WP 7 team divided work into 4 main groups of tasks:

Task 1. Data availability/Data inventory (SGA-1)

Task 2. Data feasibility (SGA-1)

Task 3. Data combination (SGA-2)

Task 4. Summary plus future perspectives (SGA-2)

Similarities and differences between countries, concerning the availability of

registers, and the legality of data linkage will be taken into account when

carrying out the tasks.

Pre-Pilot use cases

Pilot use cases

Page 7: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Final SGA-1 description of use cases

Domain Population Agriculture Tourism/Border Crossing

Name of the use

case

Everyday citizen satisfaction.

Opinions about public events with major

impact on peoples’ satisfaction

Estimation of Agricultural statistics

– pilot case study on crop types

based on satellite data

Border movement

Big Data source Social media/blogs/Internet portals Satellite images Traffic sensors

Responsibility UK+PL, PT (from SGA-2) PL+IE PL+NL, PT (from SGA-2)

Brief overview of

the methodology

Webscraping

Data/Text/Web mining

Machine learning

Combinations of data – data fusion

on radar and optical remote sensing

data; data comparison with

traditional surveys e.g. FSS;

combining data – administrative

data sources with satellite data.

Intertemporal disaggregation and

interpolation,

Latent variable models,

Cross entropy econometrics.

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 7

Page 8: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Work done

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 8

Page 9: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

General overview

• Brainstorming on data sources

• Questionnaire on different aspects of Big Data implementation • e.g., data access, data quality,

combining data, methodology

• Final use cases

• Several videoconferences

• Annotated bibliography

• Pre-Pilot use cases implementation

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 9

Page 10: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Annotated bibliography

• Several research papers and statistical reports relevant for WP7 work, e.g.:

• Social Media Sentiment and Consumer Confidence: Piet J.H. Daas and Marco J.H. Puts;

• Experiment report: Social Media - Sentiment Analysis, UNECE, created by: Antonino Virgillito, modified by Steven Vale;

• Twitter Sentiment Classification using Distant Supervision; Alec Go, Richa Bhayani, Lei Huang.

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 10

Page 11: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Pilots

• Population

• in pilot use case based on Twitter

•Agriculture

• based on satellite maps

•Tourism

• applying advanced methods of data aggregation and disaggregation

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 11

Page 12: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Agriculture Use Case Pilot technical aspects and preliminary results

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017

12

Page 13: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Data sources already collected

• The Integrated Administration and Control System (IACS) – complex administrative information system

• The Land Parcel Identification System (LPIS) – part of the National Register of Producers

• LUCAS – Land Use and Coverage Area frame Survey

• Copernicus – previously known as GMES – Global Monitoring for Environment and Security programme

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 13

Page 14: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Methodology and technical aspects

• Dedicated software for teledetections

• Data calibration

• Data combining

• Machine Learning

• Volume size for Pilot Use Case

• 1.5 TB for 2015 for 1 of 16 voivodships

• 3.5 TB for 2016 for 2 of 16 voivodships

Detailed methodology:

• time series for classes of spectral reflections for each crops interviewers

• analyzes network communication system

• allocation of network zones for each interviewer

• get directions to points

• photos (series of geo-tagged images)

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 14

Page 15: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

The network zones allocation to interviewers in relation to the transport system in the Warmińsko-Mazurskie Voivodship (Poland)

#

#

#Ełk

Elbląg

Olsztyn

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 15

Page 16: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

The training fields selection criteria

• the parcel size – depending on the field size, can be adopted for Sentinel satellite data and administrative data (including agricultural surveys)

• the distance from the road

• homogenous crops – enabling the unambiguous crop identification

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 16

Example of the training field

Page 17: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Data segmentation

• administrative vector data (LPIS)

• radar satellite images

• optical satellite images

• Depending on the actual data quality and availability, one or several segmentation methods can be used.

• Fitting the type of crop to the satellite pixel.

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 17

Page 18: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Data classification

• Support Vector Machine (SVM),

• Decision Trees (DT),

• K-Nearest Neighbours (KNN)

including the following classification parameters:

• Sigma,

• Entropia,

• Alfa,

• multi-temporal indicators,

• Wishard distribution.

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 18

Page 19: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Data aggregation

Results assessment – once processed, the images were assessed in terms of their usefulness through:

• analysing the training fields classification error matrix,

• analysing the calculations accuracy,

• making comparisons with administrative data,

• making comparisons with statistical data.

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 19

Page 20: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Calculation of the coherence matrix

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 20

HH/HV polarisation

𝑇 =𝑆𝐻𝐻

2 𝑆𝐻𝐻𝑆𝐻𝑉∗

𝑆𝐻𝑉𝑆𝐻𝐻∗ 𝑆𝐻𝑉

2

VV/VH polarisation

𝑇 =𝑆𝑉𝑉

2 𝑆𝑉𝑉𝑆𝑉𝐻∗

𝑆𝑉𝐻𝑆𝑉𝑉∗ 𝑆𝑉𝐻

2

HH/HV polarisation

𝑘 =1

2𝑆𝐻𝐻 + 𝑆𝑉𝑉𝑆𝐻𝐻 − 𝑆𝑉𝑉

𝑇 = 1

2

𝑆𝐻𝐻 + 𝑆𝑉𝑉2 𝑆𝐻𝐻 + 𝑆𝑉𝑉 𝑆𝐻𝐻 − 𝑆𝑉𝑉

𝑆𝐻𝐻 − 𝑆𝑉𝑉 𝑆𝐻𝐻 + 𝑆𝑉𝑉∗ 𝑆𝐻𝐻 − 𝑆𝑉𝑉

2

Page 21: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Vector fields development layers (based on ARMA, paying agency for agriculture subsidies)

Vector layers Description

Z Woody or bushy land

I Other land not suitable for agricultural activity

P Potentially agricultural land

W Water

U Industrial or urbanised area

T Permanent grassland

S Orchard

L Forest

K Transportation area

D Habitat

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 21

Page 22: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Comparison of different Machine Learning algorithms

No. Method Classified data Total accuracy [%]

1. Support Vector Machine (SVM) Sigma 81,0

2. Support Vector Machine (SVM) Sigma, Entropy, Alpha 77,5

3. Decision Trees (DT) Sigma 73,4

4. Decision Trees (DT) Sigma, Entropy, Alpha 73,5

5. Decision Trees (DT) Multi-temporal indicators 72,8

6. K-Nearest Neighbours (KNN) Wishard distribution 81,7

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 22

Page 23: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Summary

• The use case is a concept of using radar Sentinel-1 and optical Sentinel-2 data for the purpose of crop identification

• The use of the time series of crop development data, which should be processed and then classified

• It is important to integrate reference data from in situ survey, and also the existing administrative data

• The results can be aggregated in line with the intended spatial division and validated in respect of the reference, administrative and statistical data

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 23

Page 24: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Tourism/Border Crossing Use Case Pilot technical aspects and preliminary results

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 24

Page 25: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Methodology – obstacles

• proper definition of the surveyed population

• non-representative sample

• enlisting all websites relevant to task (defining survey frame)

• behaviour on the Internet may be not similar to observed in real life

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 25

Page 26: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Main findings

• new regulations on toll charges which may change the vehicle traffic density on a given segment

• extension of the number of toll charge points on a given segment

• changes in traffic density consecutively to:

• atmospheric conditions – such changes should impact tourism movement

• administrative regulations or to random events, like road accidents

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 26

Page 27: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Potential data sources

• Twitter

• Flickr

• Google Trends

• Administrative data

Netherlands • AIS data, • Call detail records, • Road sensor data, • Smart city data, • Twitter. UK • Flickr, • Twitter, • Travel smartcards, • Credit card transactions, • Mobile phone data. Ireland • Flight websites, • Hotel websites, • Wikipedia, • Twitter, • Google Trends, • ATM Data, • Credit card data, • Traffic loops, • Mobile Phone Data, • AIS Data.

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 27

Page 28: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Administrative data sources and additional data

• Generalna Dyrekcja Dróg Krajowych i Autostrad for Poland (GDDKiA)

• Bundesanstalt für Straßenwesen for Germany (BASt)

• Ředitelství Silnic a Dálnic for Czech Republic (RSD)

• Národná diaľničná spoločnosť for Slovakia (NDS)

• Kelių ir transporto tyrimo institutas for Lithuania (KTTI)

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 28

Page 29: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Number of measurement points by GDDKiA

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

Number of measurement points

General Traffic Measurement Continuous Traffic Measurement

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 29

Page 30: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Traffic intensity at Wierzbica-Pultusk and Stolno-Kończewice points, over the period 2006-2015

0

2000

4000

6000

8000

10000

12000

14000

16000

2006 2007 2008 2009 2010 2011 2012 2013 2014 2015

AA

DT

Wierzbica - Pułtusk Stolno-Kończewice

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 30

Page 31: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Methodology

• data imputation

• connecting traffic intensity variables with external factor

• including distance matrix or adjacency matrix to improve data coherence

• modelling level shifts

• temporal disaggregation of yearly data to quarterly data

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 31

Page 32: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

The general formulation

(eq.1) 𝑀𝑖𝑛 𝐻𝑞 𝑝//𝑝0, 𝑟//𝑟0, 𝑤//𝑤0 ≡

𝛼∑𝑝𝑘𝑚 𝑝𝑘𝑚/𝑝𝑘𝑚

0 𝑞−1− 1

𝑞 − 1+ 𝛽 ∑𝑟𝑛𝑗

𝑟𝑛𝑗/𝑟𝑛𝑗0 𝑞−1

− 1

𝑞 − 1+ 𝛿 ∑𝑤𝑡𝑠

𝑤𝑡𝑠/𝑤𝑡𝑠0 𝑞−1 − 1

𝑞 − 1

subject to

(eq.2) 𝑌 = 𝑋 ⋅ 𝛽 + 𝑒 = 𝑋 ⋅ ∑ 𝑣𝑚𝑝𝑘𝑚

𝑞

∑ 𝑝𝑘𝑚𝑞𝑀

𝑚=1

𝑀𝑚=1 + ∑ 𝑧𝑗

𝑟𝑛𝑗𝑞

∑ 𝑟𝑛𝑗𝑞𝐽

𝑗=1

𝐽𝑗=1

(eq.3) ∑ ∑ 𝑝𝑘𝑚𝑀𝑚>2

𝐾𝑘=1 = 1

(eq.4) ∑ ∑ 𝑟𝑛𝑗𝐽𝑗>2

𝑁𝑛=1 = 1

(eq.5) ∑ ∑ 𝑤𝑡𝑠𝑆𝑠>2

𝑇𝑡=1 = 1

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 32

Page 33: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Population Use Case Pilot technical aspects and preliminary results

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 33

Page 34: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Population use case

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 34

1. Daily satisfaction (Life satisfaction)

2. The mood of population associated with public

events (e.g., Brexit, Voting)

3. The morbidity areas (e.g., flu)

Population indicators will be limited to:

Residence population;

Number of women per 100 men;

Population structure;

Data obtained through the proposed solutions enables:

Extending the scope of the database,

Obtaining more recent data,

Add more detailed cross-sections for the study

population of social media users and the Internet

(currently there are no such sub-populations in similar

studies).

Page 35: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Detailed information on the expected results

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 35

Example of indicators of wellbeing Eu2013 module for EU-SILC

Example of questions in the European Social

Survey questionnaire as the framework of

questions about feelings from last week

Page 36: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Process of data analysis

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 36

Data sources:

• DS1 – Twitter

• DS2 – Google Trends

• DS3 – Comments on Specific News/Events on Web Portals such as

gazeta.pl, bbc.co.uk, irishtimes.com, spiegel.de, guardian.com

Emotional states according to EU-SILC:

• Very upset;

• So deeply depressed that nothing can lift your

spirits;

• Quiet and calm;

• Discouraged, nailed or had the blues;

• Lucky.

Page 37: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

The framework of the pilot use case for Population domain

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 37

Twitter

data

Tweepy

Sklearn

Training

Dataset

Machine

Learning

algorithm

Data extracting

Predictive

model

Labels

Feature

vectors

Result set

Page 38: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Training Dataset

No. Text Target Language Id

1 Rousey is gonna quit UFC forever now lmao #SoHappy Satisfied EN F1

2 And I did absolutely nothing #satisfied 😁🎉❤ Satisfied EN F2

3 To był cudowny weekend 😊😇💚💛💜 #love #happy #awesome #osom #bestweekend @

Czestochowa

Satisfied PL F3

4 Połączenie nowoczesnego designu z funkcjonalnością sprawi, że osiągniesz jeszcze lepsze

wyniki.

Neutral PL F4

5 They want more happiness & more money in 2017 cause they're not satisfied w/the

position of each. It don't matter the context. #Unsatisfied

Not satisfied EN F5

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 38

Page 39: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Testing Dataset

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 39

• Equal distribution of classifiers

• Some sentences are more easy

to predict

• Building the training dataset in

cooperation with statisticians

involved in the survey

Page 40: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Results of analysis (using Matplotlib for Python)

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 40

Page 41: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Current work on Population Use Case by ONS UK

• Stage (1) – Data Collection via API/Web Scraping

• Facebook Graph API (Guardian Facebook Page)

• Web Scraping (Guardian Website)

• Stage (2) – Text Mining/Sentiment Analysis

• Lexicon based sentiment analysis (like, love, haha, wow, sad and angry)

• Stage (3) – Quantitative and qualitative classification of reviews

• Pre-defined period of time (e.g., Dec 2016, Jan 2017)

• Granularity daily or weekly

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 41

Page 42: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Conclusions

• Population – more recent data

• Not representative – the data will only reflect the feelings of people who are active in social networks

• Data from social networks are generally unstructured and appear irregularly

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 42

Page 43: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Combining data within and between domains SGA-2 perspectives

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 43

Page 44: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Agriculture – data combining

• Satellite data directly combined with administrative data

• Benefits:

• Increased accuracy and effectiveness

• Another advantage of satellite data refers to its use in the current agricultural statistics based on the progressing plant vegetation at a given time

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 44

Page 45: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Population - combining data – three scenarios

• Integrating data from various data sources related to a given topic – e.g., 1.1. Daily Satisfaction

• Supplementing the survey results obtained through traditional questionnaires by adding new and more detailed data to statistical tables

• Adding new data to the output tables compiled for traditional surveys

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 45

Page 46: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Population – matching data by attributes

• the topic surveyed

• the sentiment (also life satisfaction)

• geographic location

• other attributes available, such as gender

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 46

Page 47: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Tourism – combining data

• Administrative data and surveys:

• Big Data from Polish data owner (GDDKiA)

• Surveys conducted by the Statistical Office in Rzeszów (SOR)

• Data from different providers

• The models restricting the cross-entropy objective function will be proposed based upon formal relations

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 47

Page 48: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Data combining – summary

• Combine in one repository the selected data from all Big Data sources

• Supplement of information gained in traditional surveys

• Comparison with the results of traditional surveys to add more detailed information

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 48

Page 49: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Data combining between domains

It is planned to cross combine data by schemes:

• Population - Tourism / border crossings

• Population - Agriculture

• Agriculture - Tourism / border crossings

• Population - Tourism / border crossings - agriculture

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 49

Page 50: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Conclusions

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 50

Page 51: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Summary

• Data sources selected

• Detailed goals for each domain established

• Agreements with data owners done or in progress

• Pre-pilot use cases done

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 51

Page 52: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

SGA-2 perspectives

•Extend the scope of pilot surveys

•Combining data within domain as well as inter-domain data combination

•Sharing general framework WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017 52

Page 53: WP7 Multi Domains - Europa · PDF fileBrief overview of the methodology ... • Support Vector Machine (SVM), ... WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24,

Towards SGA-2

WP7 Multi Domains, Sofia ESSNet Big Data Workshop, February 23-24, 2017

53