evaluation and checking non- response data by soft ... · neural networks - algorithm • neural...
TRANSCRIPT
Evaluation and checking non-response data by soft computing
approaches - case of business and trade statistics
Miroslav Hudec, Jana Juriová INFOSTAT – Institute of Informatics and Statistics
Brussels, 7. March, 2013
Presentation roadmap
2
1. Introduction
2. Evaluation how current algorithms estimate missing values by fuzzy logic
3. New approach for estimation of missing values by neural networks
4. Further research topics
3
Database
1. Evaluation how current algorithms estimate missing values by fuzzy
logic
2. New approach for estimation of missing values by neural networks
company Attr 1 attr 2 attr 3 attr 4 attr 5 …… attr n
Id 1 564561
4 5645614
564561
4 5645614 5645614
medium
high 5645614
Id 2 564561
4 565978 23545 23545 565978 small 5645614
Id 3 23545 23545 missing 23545 565978
not too
small 5645614
Id 4 23545 23545
876855
6 8768556 8768556 high 8768556
…. 457688 565978
876855
6 8768556 8768556 missing 8768556
Id m 457688 457688 457688 457688 457688
Very
high 457688
Administrative
data
Surveys
could cope with this data as well
4
Fuzzy logic- Evaluation …
If estimated values in Intrastat database have more or less similar properties such as data received from respondents then we could say that the algorithms for data imputation do not need improvements.
4
Intrastat database tables (SK) contain two parts: data
obtained from respondents and data estimated due to mising
values. It means same tables structure and one column
indicating if row (trade) is collected or estimated.
Miroslav Hudec, INFOSTAT Slovakia
5
A usual (crisp) rule describing evaluated property is either fully satisfied
or fully rejected. If a rule is rejected, we are not sure whether the rule is
about to be satisfied or whether the data are far away from the rule
condition.
Fuzzy rules are different.
A fuzzy rule is able to capture statisticians’ knowledge which is often
expressed by ambiguities and uncertainties (linguistic terms and
quantifiers) and directly apply on databases.
A fuzzy rule has a degree of truth, which is a value from the [0, 1]
interval. Truth value indicates how strongly data meet the rule
condition.
Fuzzy logic - Evaluation
Miroslav Hudec, INFOSTAT Slovakia
6
most of (about half, few) responded exports has small
(medium, high) number of items (goods) in report
most of (about half, few) non responded exports has small
(medium, high) number of items (goods) in report
Fuzzy logic - Fuzzy rules
The current algorithm works
properly.
Linguistic terms Quantifiers
Miroslav Hudec, INFOSTAT Slovakia
If truth values of both
rules gravitate to each
other, then both parts
of database have
similar properties.
7
Fuzzy logic - Case study
For the case study anonymised data on the Intra-EU (Intrastat)
trade were provided by the Statistical Office of the Slovak
Republic.
Data of Intrastat survey was used for year 2009 from one
detailed Intrastat form – a form for dispatch of goods.
Database contains one attribute which indicates whether the row
is describing realised trade either collected or estimated. It helps
to evaluate rules easier, because the structure of database is the
same for real and estimated values.
7 Miroslav Hudec, INFOSTAT Slovakia
8
Fuzzy logic- Interface
8 Miroslav Hudec, INFOSTAT Slovakia
9
Fuzzy logic - Example 1
most of non-responded exports has small number
of items in report
The truth value of rule is 0.6773
most of responded exports has small number of
items in report
The truth value of rule is 0.9313
We could conclude that distribution is quite different for
both cases and algorithm should be improved.
If we use this rule in data analysis we could conclude that most
our exports has small number of items in reports.
9 Miroslav Hudec, INFOSTAT Slovakia
10
Fuzzy logic - Example 2 The second kind of rule is distribution of countries of dispatch.
Rules: export by countries has high (medium, small) number of reports
Country High number
of reports
AT 1
CZ 1
DE 1
HU 1
PL 1
FR 0,9533
IT 0,777
RO 0,3277
SI 0,1222
NL 0,0449
GB 0,0394
BE 0,0137
Country
High
number
of reports
AT 1
CZ 1
DE 1
FR 1
GB 1
HU 1
IT 1
PL 1
SI 0,236
ES 0,126
RO 0,0623
Countries with high number
of reports – surveyed data
Countries with high number
of reports – estimated data
Similar distributions – algorithm works properly
Strength of fuzzy rule is obvious in case of FR, IT, SI.
Crisp case might lead to conclusion that used algorithm for estimation of
values should be improved. Miroslav Hudec, INFOSTAT Slovakia
11
Fuzzy logic - Beyond Blue-ETS
Analysing respondents behaviour in order to find critical group of
respondents. Reveal dependencies among trade indicators
Data analysis, rules evaluation
e.g. most companies which belong to branch i (according to the
classification NACE) have small non-response
Dissemination on websites
Providing users tool capable of giving answers to their imprecise
questions. Websites could solve more users demands and therefore
improve image of NSIs.
Is stronger proposition: about half of municipalities have altitude above
sea level around 700 m and small pollution or
few municipalities have altitude above sea level around 700 m and
small pollution?
Miroslav Hudec, INFOSTAT Slovakia
Neural networks - Motivation
12 Jana Juriová, INFOSTAT Slovakia
Neural networks can deal efficiently with huge databases and are frequently used for classification problems when the borders
of classes are not exactly defined.
The advantages of this technique can be taken also by statistical institutes that have been collecting and storing vast amount of
data.
Neural networks - Approach
An attempt to research usage of neural networks approach in the field of official statistics to decrease response burden and improve data analysis.
Imputation of missing values in Intrastat data system – application of proposed classification approach using more classification items
Main goal:
To test the ability of neural networks to classify data in cases of incomplete statistical datasets.
13 Jana Juriová, INFOSTAT Slovakia
Neural networks - Algorithm • Neural network is a computational model from the category of soft computing
methods, based on the abstraction of biological neural systems.
14
The steps of proposed neural network algorithm: 1. Dividing data into training and validating sets. 2. Allocation of training dataset into 2 classes – 1
means that unit belongs to the class, 0 means that unit does not belong to the class.
3. Creating the neural network. 4. Training the neural network with an
optimization algorithm. 5. Classification of validating dataset into classes
by means of the trained neural network.
Feed-forward neural network:
Jana Juriová, INFOSTAT Slovakia
Neural networks - Imputation in Intrastat database
15 Jana Juriová, INFOSTAT Slovakia
Intrastat database – data on foreign trade
anonymised data provided by SO SR
from detailed declarations for dispatches of goods
year 2008
The exemption threshold is set for dispatches of goods to 400 000 EUR, after reaching
this value the company has to fulfil declaration. After exceeding simplification threshold
of 1 700 000 EUR, the company is obliged to give detailed declaration.
Individual business reports contain several items characterising their activity.
In this experiment only the first reports were regarded, i.e. those revised or
corrected that were sent later were not included at all.
The characteristics considered useful were the following 8 items:
time period (month), code of goods (simplified, i.e. three-digit level), invoiced
value, region of dispatch, state of destination, delivery terms, nature of
transaction and mode of transport
Neural networks - Imputation in Intrastat database
16 Jana Juriová, INFOSTAT Slovakia
The main objective is to use classification by means of neural
networks for imputation of missing data in Intrastat data system.
NN was trained on the
complete dataset for
classification into classes. After
reaching an acceptable degree
of accuracy the network can be
used for the classification of the
rest of database with missing
values. NN identifies the most
similar class for each statistical
unit and this enables
imputation of missing values.
CN OB DD DP FS REGP STU DOA
731 1 3 4 1 5 1 4
731 1 3 4 1 5 1 5
621 10 3 4 1 4 1 5
732 2 3 4 1 5 1 4
731 1 3 4 1 5 1 5
621 7 3 4 1 4 1 5
732 1 3 4 1 5 1 4
732 2 3 4 2 5 1 4
731 2 3 4 2 5 1 5
732 2 3 4 2 5 1 4
Complete dataset DOA – Nature of transaction – 2 classes: Operations with a view to processing under contract (DOA4) Operations following processing under contract
Neural networks - Imputation in Intrastat database
17 Jana Juriová, INFOSTAT Slovakia
Characteristics of the learning process
Type of transaction
Probability of inclusion into the class (%)
RMSE*
10 hidden neurons, 300 training cycles
DOA4 59 0.41
DOA5 46 0.54
10 hidden neurons, 400 training cycles
DOA4 64 0.37
DOA5 57 0.43
15 hidden neurons, 1000 training cycles
DOA4 71 0.30
DOA5 77 0.23
Evaluation of the learning process
Results: After the network has been trained the best one was used for the classification of the original data to verify the proposed classifier. The validating set consists of 2000 units coming from the class DOA5. The probability of inclusion into the class DOA5 proved to be 76.8%. This confirmed the ability to use the trained network for suggesting the missing values.
* Root Mean Square Error
Neural networks - To summarize
• a properly designed neural network enables classification of large
datasets on the basis of similarity and can solve the problem of
missing values;
18
Neural networks proved to be useful as an alternative approach for imputation of missing values in large statistical databases. However, the first experimental
results on Intrastat database indicate that this approach needs further improvements and testing with special focus to the searching algorithm to
increase classification rate.
Jana Juriová, INFOSTAT Slovakia
Further research
Any introduction of new methods for the purpose of missing values imputation at the NSIs needs further research of variance estimation of proposed values.
In the first step neural networks will estimate missing values. In the next step fuzzy rules will evaluate estimated values. If significant difference appears then neural networs will be re-trained for better estimation.
19
Jana Juriová, INFOSTAT Slovakia
20
Conclusion
Approaches for evaluation of non-responses by soft computing
could improve the quality of collected data and therefore
released data by NSIs.
Additional benefits could be obtained from integration of these
two approaches.
Without significant modifications fuzzy logic could be applied also
in other stages of data production (e.g. dissemination).
20
Thank you for your attention.