social statistics integrated information architecture (stiina) · • integrate new data sources...

36
Social Statistics Integrated Information Architecture and metadata driven services Antti Santaharju & Toni Räikkönen COE on S-DWH workshop, Warsaw 22.11.2018

Upload: others

Post on 03-Mar-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Social Statistics Integrated Information

Architecture and metadata driven services

Antti Santaharju & Toni Räikkönen

COE on S-DWH workshop, Warsaw 22.11.2018

Contents

• Social Statistics Integrated Information Architecture (Stiina)

• Metadata driven software architecture

• Using VTL in Editing and imputation service

• Software demonstrations

• Architectural aspects of Editing and imputation service

• Experiences so far

29.11.2018 Statistics Finland2

STIINA Social Statistics Integrated Information Architecture

• STIINA program aims to modernize the statistical production of

social statistics

• The aim is to build an integrated production system for about 70

official statistics – including both register-based and survey statistics

• Timeframe 2017–2020

• Work is based on international statistical models and standards:

• GSBPM

• GSIM

• CSPA

• VTL

3 29.11.2018 Statistics Finland

STIINA goals

• Emphasize information content instead of organizational silos

• Integrate new data sources into statistical production

(daily data deliveries, data from open APIs, web scraping…)

• Build faster, more automatized and transparent production

processes (metadata, process control system)

• Find ways to integrate data from different types of statistics

• Ensure the flexibility of statistical production when data and

information needs are constantly changing

4 29.11.2018 Statistics Finland

29.11.2018 Statistics Finland5

DATA COLLECTION

Data Repositories for Data

Collection

Population and Social

Data Repository

Organization Data Repository

Built Environment Data Repository

Commodity Data Repository

Emissions and Energy

Data Repository

GeospatialData Repository

ANALYSISAND

REPORTING

Micro Data Repository

Macro Data Repository

DISSEMINATION

Data Repositories for Dissemination

Metadata Repository

DATA

PROCESSING

Population and Social Data Repository

29.11.2018 Statistics Finland6

Person

Education

Labor market

Health

Person

Relationships

Housing

Justice

Income

information

Living

conditons

Built

Environment Organisation

Geospatial

Data

Metadata

Social Statistics Integrated Information Architecture (STIINA)

7

Data

Repositories

for Data

Collection

Data collection Analysis and Reporting DisseminationData Processing

Population and

Social Data

Repository

Person Data

Housing

Labor Market

Income and

Consumption

Education

Living conditions

Health

Justice and Elections

Built Environment

Data

Organization Data

Geospatial Data

Metadata Repository

Micro Data and

Macro Data Repositories,

Data Repositories for

Dissemination

Data Warehouses

29.11.2018

Direct Data

Collection

Administrative

Data

New Data

resources

Statistical

Releases

Statistical

Databases

Research

Data

Other

Products and

Services

Services for

statistical

processes

Information

services

Statistics Finland

29.11.2018 Statistics Finland8

STIINA projects 2017–2020

20182017 2019

Population I

Editing and imputing

Labor market and

Income Information I

Data warehouse

and data

confidentiality

Education I

Data collection

processes

2020

Labor market and

Income Information

II

Justice Living conditions

Health

Metadata I

Statistical

methods

Dissemination Dissemination II

Education IIPopulation II

and Housing

Labor market and

Income

Information III

Geospatial data

and services

Stiina Services 4:

Metadatat II

Elections

Data collection

processes II

Service-oriented projects

Data-oriented projects

Housing

Editing

and

imputing

II

Services as a part of statistical production process

29.11.2018 Statistics Finland9

Statistical production process

GSBPM-services toolbox

Service

Service

GSBPM 5.3

Review &

validate

Service

Service

GSBPM 5.5

Derive new

variables

Service

Service Service

Service

Service

GSBPM 5.4

Edit & impute

Metadata

29.11.2018 Statistics Finland10

Standards

VTL

Software architecture in Statistics Finland

29.11.2018 Statistics Finland11

GSBPM

(1) whenever possible

GSIMCSPA

Micro

services

Metadata

driven

Re-use

Technology

neutrality

Cloud native (1)

Unified

methods

Unified

metadata

definitions

From data acquisition to data analysis and

dissemination – Case STIINA

29.11.201812

M E T A D A T A

Data marts

Automated Data

Acquisition

Process

Raw data

warehouse

Operational

data

warehouse

A&R and

dissemination

data warehouse

Continuous ETL

Statistics Finland

Data Storage Layer

Application Layer

The architectural style of the operational

environment

29.11.201813

Data Virtualization Layer

Data virtualization

• Isolates the data storage

• Offers services to clients

• GSIM based interfaces

Data storage

• Located in the on premises env or in

the cloud

• Accessible only thru the virtualization

layer

Clients

Statistics Finland

GSIM modeled metadata

architecture

29.11.201814

VariablesData

Structures

Rules

Identifiers

Different kind of rules

- value domains

- data types

- formation rules

- VTL statements

Definitions of

- Represented variables

- VariablesDefinitions of

Data Structures

Administers unique identifiers of objects

(URNs, HTTP URIs, DOIs etc.)

Links represented variables

to a data structure

Defines the rules for

represented variables

e.g. value domains,

Defines the rules for

instance variables

e.g. data types, precisions

Concepts

Definitions of the concepts

Classificati

ons

Defines a value domain

for a classification

Process Output

Process

ExecutionProcess

Metrics

Population

Defines a variable

Definitions of populations

Defines the formation rule

of a population

Defines a structure of

a population

Statistics Finland

Metadata driven APIs

29.11.2018 Statistics Finland15

Data

MartOp.

Data

VaultData

Mart

Interfaces

Data

OutData

In

ID: URN:x-stat:meta:dataset:y

Var1

Var2

Varx

ID: URN:x-stat:meta:dataset:y

Var1

Var2

Varx

In order to use the APIs the corresponding

metadata definition must be included with

the service call

Data StructureData Structure

Editing and imputation service

29.11.2018 Statistics Finland16

Edit Specification and Analysis

Edit Summary Statistics Tables

ErrorLocalization

Deterministic Imputation

DonorImputation

Imputation Estimators

Prorating

MassImputation

OutlierDetection

Amendment

Review

Selection

Source:

Generic Statistical Data Editing Models

(UNECE)

Editing and imputation service methods

• Current methods

• If-then rule method (VTL)

• Banff Outlier Detection

• Banff Imputation estimators

• Planned Banff methods

• Donor imputation

• Error localisation

• Deterministic imputation

• Prorating

• Edit summary statistics

29.11.2018 Statistics Finland17

Metadata driven editing service

29.11.2018 Statistics Finland18

Editing service

Process

management

Editing rules and

parameters

Data description

- Edited data

- Editing history

- Frequency reports

- Impact reports

Data

Example method: Banff outlier

• Input:

• data to be edited (matrix form)

• parameters

• Output:

• status data with flagged cells (name-value form)

19 29.11.2018 Statistics Finland

20 29.11.2018 Statistics Finland

21 29.11.2018 Statistics Finland

22 29.11.2018 Statistics Finland

VTL input

SAS-code preview

VTL functions

Operators

etc.

Variable list

29.11.2018 Statistics Finland24

Architectural aspects of

Martin Fowler

Editing and Imputation Service

Statistical Libraries

E&I Service – the architectural style

29.11.2018 Statistics Finland25

SAS BANFFPython

Pandas

Library X

R VIM

Library Y

Staging

API

Metadata

services

Process Engine

Metadata

editor

E&I Service – process flow

29.11.2018 Statistics Finland27

E&I Service

Process Engine

Method 1 Method 2 … Method n

4) Invoke method calls3) Signal the start event

6) Signal the end event

Data

In

Data

Out

Staging

BI Web Services

1) Invoke the API call

with data 2) Load the data

to the staging area

5) Invoke

SAS BI WS

7) Load the results

from staging

8) Return the results

E&I Service - the role of SAS BI WS

29.11.2018 Statistics Finland28

Staging

BI Web Services

TransformBANFF

Internal data

area

Transform

E&I Service

Store data

Invoke SAS service

Invoke E&I Service

BANFFBANFFBANFF

29.11.2018

Editing

meta

Statistics Finland29

Rules

If-then-rule method

Rules: [

urn:stat-fi:meta:rule:9912,

urn:stat-fi:meta:rule:9937,

..

..

]

Id: urn:stat-fi:meta:rule:9912

Type: VTL

Value: error := if(a > b) then error = 1 else error = 0

Id: urn:stat-fi:meta:rule:9937

Type: VTL

Value: c:= if(isnull(c)) then c = 100 else c

SAS BIWSE&I Service

-in VTL stmts

-out sasds2

SAS

Code

SAS

Code

VTL Statements

VTL Translator

VT Parser

SAS Data Step

Code Generator

SAS DS2

Code Generator

R Code

Generator

[X] Code

Generator

If-then-rule method using VTL

29.11.2018 Statistics Finland30

”Social” challenges

• The change in perspective

• From customized solutions to unified methods and tools

• Difficult to please all users

• “My statistics is so special that I really can’t use that tool”

• Sometimes difficult to recognize who is the product owner

29.11.2018 Statistics Finland31

IT challenges

• Microservices increase the complexity quite a lot

• Orchestration / choreography

• Data by value / data by reference

• Requires a smooth DevOps process

• Performance with really huge datasets still unknown

29.11.2018 Statistics Finland32

”Social” success stories

• The valuation of GSIM model has increased vastly

• The users understand better why it is important to define the

metadata for the data objects

• Generic tools for other projects to use

• E&I service enables a cumulative, standardized audit trail and

reports

29.11.2018 Statistics Finland33

IT success stories

• Microservices = a really fast track to generic tools

• 15-20 services already in use / under construction

• GSIM based metamodel enables metadata driven architecture

really nicely

• The capabilities in IT have increased a lot

• Requires a smooth DevOps process

29.11.2018 Statistics Finland34

Future

• New microservices under development

• Derivation of new variables

• Aggregation

• GSIM based cloud native meta system under development

29.11.2018 Statistics Finland35

29.11.2018 Statistics Finland36

[email protected]

[email protected]