evaluation and checking non- response data by soft ... · neural networks - algorithm • neural...

Evaluation and checking non-response data by soft computing

approaches - case of business and trade statistics

Miroslav Hudec, Jana Juriová INFOSTAT – Institute of Informatics and Statistics

Brussels, 7. March, 2013

Presentation roadmap

2

1. Introduction

2. Evaluation how current algorithms estimate missing values by fuzzy logic

3. New approach for estimation of missing values by neural networks

4. Further research topics

3

Database

1. Evaluation how current algorithms estimate missing values by fuzzy

logic

2. New approach for estimation of missing values by neural networks

company Attr 1 attr 2 attr 3 attr 4 attr 5 …… attr n

Id 1 564561

4 5645614

564561

4 5645614 5645614

medium

high 5645614

Id 2 564561

4 565978 23545 23545 565978 small 5645614

Id 3 23545 23545 missing 23545 565978

not too

small 5645614

Id 4 23545 23545

876855

6 8768556 8768556 high 8768556

…. 457688 565978

876855

6 8768556 8768556 missing 8768556

Id m 457688 457688 457688 457688 457688

Very

high 457688

Administrative

data

Surveys

could cope with this data as well

4

Fuzzy logic- Evaluation …

If estimated values in Intrastat database have more or less similar properties such as data received from respondents then we could say that the algorithms for data imputation do not need improvements.

4

Intrastat database tables (SK) contain two parts: data

obtained from respondents and data estimated due to mising

values. It means same tables structure and one column

indicating if row (trade) is collected or estimated.

Miroslav Hudec, INFOSTAT Slovakia

5

A usual (crisp) rule describing evaluated property is either fully satisfied

or fully rejected. If a rule is rejected, we are not sure whether the rule is

about to be satisfied or whether the data are far away from the rule

condition.

Fuzzy rules are different.

A fuzzy rule is able to capture statisticians’ knowledge which is often

expressed by ambiguities and uncertainties (linguistic terms and

quantifiers) and directly apply on databases.

A fuzzy rule has a degree of truth, which is a value from the [0, 1]

interval. Truth value indicates how strongly data meet the rule

condition.

Fuzzy logic - Evaluation


6

most of (about half, few) responded exports has small

(medium, high) number of items (goods) in report

most of (about half, few) non responded exports has small

(medium, high) number of items (goods) in report

Fuzzy logic - Fuzzy rules

The current algorithm works

properly.

Linguistic terms Quantifiers


If truth values of both

rules gravitate to each

other, then both parts

of database have

similar properties.

7

Fuzzy logic - Case study

For the case study anonymised data on the Intra-EU (Intrastat)

trade were provided by the Statistical Office of the Slovak

Republic.

Data of Intrastat survey was used for year 2009 from one

detailed Intrastat form – a form for dispatch of goods.

Database contains one attribute which indicates whether the row

is describing realised trade either collected or estimated. It helps

to evaluate rules easier, because the structure of database is the

same for real and estimated values.

7 Miroslav Hudec, INFOSTAT Slovakia

8

Fuzzy logic- Interface


9

Fuzzy logic - Example 1

most of non-responded exports has small number

of items in report

The truth value of rule is 0.6773

most of responded exports has small number of

items in report

The truth value of rule is 0.9313

We could conclude that distribution is quite different for

both cases and algorithm should be improved.

If we use this rule in data analysis we could conclude that most

our exports has small number of items in reports.


10

Fuzzy logic - Example 2 The second kind of rule is distribution of countries of dispatch.

Rules: export by countries has high (medium, small) number of reports

Country High number

of reports

AT 1

CZ 1

DE 1

HU 1

PL 1

FR 0,9533

IT 0,777

RO 0,3277

SI 0,1222

NL 0,0449

GB 0,0394

BE 0,0137

Country

High

number

of reports

AT 1

CZ 1

DE 1

FR 1

GB 1

HU 1

IT 1

PL 1

SI 0,236

ES 0,126

RO 0,0623

Countries with high number

of reports – surveyed data

Countries with high number

of reports – estimated data

Similar distributions – algorithm works properly

Strength of fuzzy rule is obvious in case of FR, IT, SI.

Crisp case might lead to conclusion that used algorithm for estimation of

values should be improved. Miroslav Hudec, INFOSTAT Slovakia

11

Fuzzy logic - Beyond Blue-ETS

Analysing respondents behaviour in order to find critical group of

respondents. Reveal dependencies among trade indicators

Data analysis, rules evaluation

e.g. most companies which belong to branch i (according to the

classification NACE) have small non-response

Dissemination on websites

Providing users tool capable of giving answers to their imprecise

questions. Websites could solve more users demands and therefore

improve image of NSIs.

Is stronger proposition: about half of municipalities have altitude above

sea level around 700 m and small pollution or

few municipalities have altitude above sea level around 700 m and

small pollution?


Neural networks - Motivation

12 Jana Juriová, INFOSTAT Slovakia

Neural networks can deal efficiently with huge databases and are frequently used for classification problems when the borders

of classes are not exactly defined.

The advantages of this technique can be taken also by statistical institutes that have been collecting and storing vast amount of

data.

Neural networks - Approach

An attempt to research usage of neural networks approach in the field of official statistics to decrease response burden and improve data analysis.

Imputation of missing values in Intrastat data system – application of proposed classification approach using more classification items

Main goal:

To test the ability of neural networks to classify data in cases of incomplete statistical datasets.


Neural networks - Algorithm • Neural network is a computational model from the category of soft computing

methods, based on the abstraction of biological neural systems.

14

The steps of proposed neural network algorithm: 1. Dividing data into training and validating sets. 2. Allocation of training dataset into 2 classes – 1

means that unit belongs to the class, 0 means that unit does not belong to the class.

3. Creating the neural network. 4. Training the neural network with an

optimization algorithm. 5. Classification of validating dataset into classes

by means of the trained neural network.

Feed-forward neural network:

Jana Juriová, INFOSTAT Slovakia

Neural networks - Imputation in Intrastat database


Intrastat database – data on foreign trade

anonymised data provided by SO SR

from detailed declarations for dispatches of goods

year 2008

The exemption threshold is set for dispatches of goods to 400 000 EUR, after reaching

this value the company has to fulfil declaration. After exceeding simplification threshold

of 1 700 000 EUR, the company is obliged to give detailed declaration.

Individual business reports contain several items characterising their activity.

In this experiment only the first reports were regarded, i.e. those revised or

corrected that were sent later were not included at all.

The characteristics considered useful were the following 8 items:

time period (month), code of goods (simplified, i.e. three-digit level), invoiced

value, region of dispatch, state of destination, delivery terms, nature of

transaction and mode of transport



The main objective is to use classification by means of neural

networks for imputation of missing data in Intrastat data system.

NN was trained on the

complete dataset for

classification into classes. After

reaching an acceptable degree

of accuracy the network can be

used for the classification of the

rest of database with missing

values. NN identifies the most

similar class for each statistical

unit and this enables

imputation of missing values.

CN OB DD DP FS REGP STU DOA

731 1 3 4 1 5 1 4

731 1 3 4 1 5 1 5

621 10 3 4 1 4 1 5

732 2 3 4 1 5 1 4

731 1 3 4 1 5 1 5

621 7 3 4 1 4 1 5

732 1 3 4 1 5 1 4

732 2 3 4 2 5 1 4

731 2 3 4 2 5 1 5

732 2 3 4 2 5 1 4

Complete dataset DOA – Nature of transaction – 2 classes: Operations with a view to processing under contract (DOA4) Operations following processing under contract



Characteristics of the learning process

Type of transaction

Probability of inclusion into the class (%)

RMSE*

10 hidden neurons, 300 training cycles

DOA4 59 0.41

DOA5 46 0.54


DOA4 64 0.37

DOA5 57 0.43


DOA4 71 0.30

DOA5 77 0.23

Evaluation of the learning process

Results: After the network has been trained the best one was used for the classification of the original data to verify the proposed classifier. The validating set consists of 2000 units coming from the class DOA5. The probability of inclusion into the class DOA5 proved to be 76.8%. This confirmed the ability to use the trained network for suggesting the missing values.

* Root Mean Square Error

Neural networks - To summarize

• a properly designed neural network enables classification of large

datasets on the basis of similarity and can solve the problem of

missing values;

18

Neural networks proved to be useful as an alternative approach for imputation of missing values in large statistical databases. However, the first experimental

results on Intrastat database indicate that this approach needs further improvements and testing with special focus to the searching algorithm to

increase classification rate.


Further research

Any introduction of new methods for the purpose of missing values imputation at the NSIs needs further research of variance estimation of proposed values.

In the first step neural networks will estimate missing values. In the next step fuzzy rules will evaluate estimated values. If significant difference appears then neural networs will be re-trained for better estimation.

19


20

Conclusion

Approaches for evaluation of non-responses by soft computing

could improve the quality of collected data and therefore

released data by NSIs.

Additional benefits could be obtained from integration of these

two approaches.

Without significant modifications fuzzy logic could be applied also

in other stages of data production (e.g. dissemination).

20

Thank you for your attention.

[email protected]

[email protected]

mailto:[email protected]

mailto:[email protected]

evaluation and checking non- response data by soft ... · neural networks - algorithm • neural...

Documents