in partnership with · business intelligence (bi) the general dwh is often considered part of a...
TRANSCRIPT
Handbook to set up a S-DWH 1 version 2.1 / 4 September 2017
in partnership with
Overall handbook to set up a S-DWH
CoE: Deliverable: 4.6
Version: 3.1 Date: 3 November 2017
CoE CENTRE of EXCELLENCE ON DATA WAREHOUSING
Handbook to set up a S-DWH 2
version 3.1 / 3 November 2017
Content
1. Introduction ........................................................................................................................ 3
2. The Statistical Data Warehouse .......................................................................................... 4
3. The main phases for setting up a S-DWH ........................................................................... 6
4. The 3 tracks within the S-DWH process: ............................................................................. 7
1.1 Metadata ...................................................................................................................... 7
1.2 Methodological aspects ............................................................................................... 8
1.3 Technological aspects .................................................................................................. 9
5. The Road Map for setting up a S-DWH ............................................................................. 12
5.1 Roadmap S-DWH: General overview ....................................................................... 13
5.2 Roadmap S-DWH: Approved Business Case ............................................................ 14
5.2 Roadmap S-DWH: Design phase .............................................................................. 15
5.3 Roadmap S-DWH: Build phase ................................................................................. 16
5.4 Roadmap S-DWH: Finalize phase ............................................................................. 17
Handbook to set up a S-DWH 3
version 3.1 / 3 November 2017
1. Introduction
In October 2010, the ‘ESSnet on micro data linking and data warehousing in statistical production'
was established to provide assistance in the development of more integrated databases and data
production systems for (business) statistics. From October 2013 the ESSnet evolved in CoE.
In order to improve and optimise statistical production, ESS Member States are searching for ways to
make optimal use of all available data sources, existing and new. In daily statistical practice this
means supporting and assisting statistical institutes to increase the efficiency of data processing in
statistical production systems and to maximize the reuse of already collected data in the statistical
system. Recently, the CoE Member States have started to evaluate the impact of Big Data
infrastructures on statistical data warehouse (S-DWH) systems. The result will be included in the S-
DWH Manual, available as CoE deliverable in the ESS Cross Portal in the S-DWH web page.
This modernisation implicates an important organisational impact. First there is the need to develop
and implement a complete new way of organising and operating the statistical production processes.
Second, it also comes with higher and stricter demands for the data and metadata management.
Both activities are often decentralised and implemented in various ways, depending on the needs of
specific statistical systems, whereas realising maximum re-use of available statistical data just
demands the opposite: a centralised and standardised set of (generic) systems with a flexible and
transparent metadata catalogue that gives insight in and easy access to all available statistical data.
To reach these goals, building a S-DWH is considered to be a crucial instrument. The S-DWH
approach enables NSIs to identify the particular phases and data elements in the various statistical
production processes that need to be common and reusable.
Main focus of the ESSnet was on issues that are common for the majority of the NSIs within the ESS
when applying a data warehousing approach for statistics. A thorough enquiry among the ESS
Member States resulted in a set of deliverables, now reorganized and updated in the Manual,
articulated over 3 main topics:
1. Metadata
2. Methodological aspects
3. Technological aspects
In the various workshops, held to interactively exchange information and receive feedback, MS
expressed great demand for a practical handbook that helps and guides in the process of developing
and implementing a S-DWH.
This handbook answers the following questions:
What is a Statistical Data Warehouse (S-DWH) ?
How does a S-DWH differ from a traditional = 'commercial' DWH ?
Why should we build a S-DWH ?
Who are the envisaged users of a S-DWH ?
Give a road map for designing, building and finalizing the S-DWH:
- What are the prerequisites for implementing a S-DWH ?
- What are the phases/steps to take ?
- How to prepare for an implementation ?
The handbook is set up as a lean quick reference guide around the S-DWH roadmap. Goal is to guide
users through the process of setting up and implementing a S-DWH by indicating what deliverables
of the ESSnet (recommendations, guidelines etc.) to use at which phase in the development process.
Handbook to set up a S-DWH 4
version 3.1 / 3 November 2017
2. The Statistical Data Warehouse
This chapter gives a short explanation on most common terminology to explain the statistical data warehouse. The Manual chapter 5.1, on Fundamental principles, gives more detailed explanation and information on the terminology used in the project.
Data Warehouse
The generic definition of a Data Warehouse (DWH) says that it is “a central repository of data which is created by integrating data from one or more disparate sources”1. In the DWH current and historical data are stored and organised in ways that facilitate combining data to, e.g., to perform analyses and to create reports.
According to broader and perhaps more useful definitions the term DWH should not only be understood as a way of storing data, but it must also include all the functions and tools necessary to extract, transform and load data (ETL tools), to maintain the data structure, and to make data available to end users in ways that suit their tools.
According to the role and function, a commercial (or traditional) DWH mostly is set up as a supportive system to the primary process of an organisation, with as main goal to produce and deliver management information that is used to manage and improve the primary process.
Statistical Data Warehouse
This project uses the term Statistical Data Warehouse (S-DWH) to refer to a DWH that is purpose-built specifically to support the production of national and international statistics. Thus the S-DWH is defined as a central store of statistical data, regardless of their sources, for managing all available data of interest, thereby improving the NSI’s ability to:
- use and reuse data in order to create new data or new outputs;
- create reports;
- execute analyses;
- produce any required information.
According to the role and function, a statistical data warehouse is developed as a crucial element in the primary process, which simply is: to produce statistics.
Business Intelligence (BI)
The general DWH is often considered part of a Business Intelligence (BI) system. BI technology can handle large amounts of historical and current data stored in a DWH. Specialised BI tools let the users analyse the information in the DWH and even make predictions in order to make better business decisions.
Many BI tasks, such as decision support, include quick creation and immediate analysis of statistics based on data from the DWH. Supporting creation and analysis of statistics is the main purpose of the S-DWH, but the demands for quality are generally higher, while the analysis may follow immediately on creation or later.
1 Wikipedia: http://en.wikipedia.org/wiki/Data_warehouse
Handbook to set up a S-DWH 5
version 3.1 / 3 November 2017
Metadata
The S-DWH contains only statistical data and is dedicated to supporting efficient production of statistics. Data in the S-DWH may be atomic, micro data, or aggregated, macro data. All data must always be defined and described in accompanying metadata.
Since the data warehouse is not only one single data store, but consists of several parts, or layers, metadata must also describe the processes that move the data through the layers from source to presentation and dissemination (process metadata).
Standards
There are several formal and industry standards that should be considered when building a DWH. The architecture should be supported by well-established data modelling standards.
In addition to the standards and rules that support the design of any DWH, the S-DWH should also be designed and built in accordance with the standards that are used in the statistics society. The process model GSBPM, the information model GSIM, the metadata registry standard ISO/IEC 11179 and the classification model (Neuchâtel model) are examples of important and widely accepted standards that should be taken into account when designing a S-DWH.
Why build and use a S-DWH ?
There are several alternative models that can be used to describe and build statistics production systems, e.g., the traditional stovepipe model and several versions of integrated models. The S-DWH model is generally considered as being the most advantageous one compared to the other models. Some arguments that speak in favour of using a S-DWH include:
Easier to reuse data, “collect once, use many times”;
Facilitates cross-domain analysis;
Well suited for process oriented production systems (even though its data model is not specifically designed for that purpose);
Supports standardisation of tools and methods;
Enables efficient governance and maintenance.
Handbook to set up a S-DWH 6
version 3.1 / 3 November 2017
3. The main phases for setting up a S-DWH
From a project management view, the process of setting up and implementing a S-DWH does not essentially differ from other major projects that involve organisational changes in combination with new processes and (IT) systems. Basically 5 more or less generic phases can be distinguished:
Business Case
As for all projects it is an essential and required precondition to compose a solid business case that needs to be approved the responsible authority/management/sponsors. The business case must clearly state the aimed goals, describe and explain the expected benefits and of course give a sound cost – benefit analysis. The Introduction and the first chapter of the Manual 2 can be used as a good fundament when writing the business case.
Design
The first phase in the actual development is the design of the S-DWH, with all elements and aspects. This should cover various aspects:
What type of S-DWH, active or passive ?
The architectural framework for the S-DWH.
A clear description of the functions of the S-DWH.
The necessary metadata designs (metadata model, meta system etc.)
Methodological concepts (role BR etc.).
All designs must be approved by the responsible managerial body (steering group, program management e.g.)
Build
In the ‘build’ phase the various elements of the S-DWH need to be realised. For the most part these are strongly IT related components: databases, repository, ETL processes etc. Main milestones in this phase are tool selection, translating design to business rules, testing and documentation. As the development of a S-DWH mostly consists of a complex set of systems, it is recommended to work in small incremental steps.
Finalize
The finalization phase means actually putting the S-DWH to work. After defining a sound implementation strategy, most important milestones in this phase are setting up the governance of the (meta)data management, ensuring confidentiality and training users.
Use & Maintain
After finalizing the S-DWH the phase of operational use starts. The feedback from daily statistical use requires also a steady process of maintaining 2 main aspects of the S-DWH:
1. The content of the S-DWH (metadata and statistical data)
2. The functional and technical systems
The focus of ESSnet was on the elements of the phases design, build and finalization. Therefore the roadmap of this handbook concentrates on and describes these 3 phases in connection with the S-DWH Manual in wich mains information are described and explained.
2 https://ec.europa.eu/eurostat/cros/content/general-introduction_en
Handbook to set up a S-DWH 7
version 3.1 / 3 November 2017
4. The 3 tracks within the S-DWH process:
The goal of the statistical data warehouse is to enable NSIs to produce flexible outputs, in an efficient
way, with maximum re-use of data that is already available in the statistical system. Therefore the
ESSnets needs to focus on issues that are common for the majority of the NSIs when applying a data
warehousing approach for statistics, resulting in 3 main tracks:
1.1 Metadata
One of the key factors and drivers in a S-DWH is the information about one or more aspects of
the data itself, usually referred to as "metadata".
‘Metadata is the DNA of the data warehouse, defining its elements and how they work
together. [...] Metadata plays such a critical role in the architecture that it makes sense to
describe the architecture as being metadata driven’.
The metadata provides the access to the data and must enable a clear and unambiguous
description of the data and its elements. All data in the S-DWH must have corresponding
metadata: ‘no data without metadata’. Users must be able to search the entire metadata layer
and, if permitted, to access the physical statistical data via the metadata. Thus, metadata plays a
vital role in the S-DWH, satisfying 2 essential needs:
1. to guide statisticians in processing and controlling the statistical production
2. to inform end users by giving them insight in the exact meaning of statistical data
In order to meet these 2 essential functions, the statistical metadata must be:
correct and reliable (the metadata must give a correct picture of the statistical data),
consistent and coherent (the metadata driving the statistical processes and the reporting
metadata presented to the end users must be compatible with each other),
standardised and coordinated (the data of different statistics are described and
documented in the same standardised way).
Finally, since the different users of the (meta)data have diverse needs, it is essential to ensure an
effective management of the statistical metadata in the S-DWH.
In the metadata track, the first focus was on the identification of the various kinds of essential
metadata and recommendations and guidelines on their use. Further focus was on the use of
metadata models, the required functions of a metadata system and the governance of metadata
in the S-DWH.
In the context the manual answer to the follow items:
Framework of metadata requirements and roles in the S-DWH give definitions and
background information on the roles and purposes of metadata in the S-DWH in generic
terms. It destined to provide a common language.
Recommendations on the impact of (meta)data quality in the S-DWH. This item is about
monitoring the quality of (meta)data in a S-DWH. For data exchange, it is more or less
common to use indicators to measure data quality. The advice is to also define a set of
indicators for metadata quality, following and using the data quality systems.
Handbook to set up a S-DWH 8
version 3.1 / 3 November 2017
Overview of and recommendations on the use of metadata models give an overview of
metadata models and recommendations on their use. The use of a metadata model is a key
element in structuring and standardising the statistical metadata within a NSI in a generic
way. In the context of the S-DWH, a metadata model is a standardized representation used
to define all necessary metadata elements of statistical information systems.
Definition of the functionalities of a metadata system to facilitate and support the operation
of the S-DWH. This item gives a detailed description of the functionalities that are necessary
to facilitate and support the operation of the S DWH. In order to meet these diverse needs of
different users of the (meta)data, the statistical metadata must be managed and maintained
in a metadata system that covers these functionalities.
Recommendations and guidelines on governance of metadata management in the S-DWH
explain the importance of reliable governance of metadata management in a statistical
organisation when operating a S-DWH. It focuses on the main issues to consider when
establishing, running and maintaining metadata management in a S-DWH. Implementing
good governance for metadata management is highly important for a S-DWH.
The detailed metadata system functionalities are mapped on the layered S-DWH architecture
and the GSBPM workflow.
1.2 Methodological aspects
A key challenge in the process of designing and implementing a Statistical Data Warehouse is to
match the various statistical requirements that are set by the statistical users of the S-DWH.
The indicated methodological challenges that need to be covered and ensured are about:
Impacts on statistical methods
Which are the methodological advantages and drawbacks ?
Which considerations as to statistical methods are needed ?
How to handle confidentiality issues ?
How to deal with data linking ?
Also this work package provided input to actions/deliverables of the other 2 tracks, by reviewing
deliverables and advising from the methodological perspective.
Items to be faced in the context are:
Guidelines (including options) on how the BR interacts with the S-DWH. This item is an
essential part of the S-DWH: the role and position of the statistical business register. The
Business Register holds a central role in the S-DWH in order to link different units from
different data sources and to act as a population frame.
Guidelines/recommendations for application within the S-DWH of the data linking
aspects.
This item is faced in the Manual and gives an overview on data linking aspects in a S-
DWH. It provides information about data linking methods, about useful links, and it
mentions possible problems that can occur when linking data from multiple sources.
Finally it presents guidelines about the methodological challenges on data linking.
Guidelines/recommendations for application in the S-DWH of the confidentiality aspects.
Handbook to set up a S-DWH 9
version 3.1 / 3 November 2017
This outlines the options for understanding and dealing with the confidentiality aspects
of combining and re-using data from a Statistical Data Warehouse that comes with an
increased risk for compromising the confidentiality of the data.
Guidelines on editing for the S-DWH. This examines options for efficient editing in a
Statistical Data Warehouse, specifically exploring how selective editing may be used in
this context. Focus is on two widely available selective editing tools, to consider if they
could be used for efficient editing in a S-DWH.
Guidelines on detecting and treating outliers for the S-DWH. This explains the distinction
between outliers and errors, the three possible types of outliers in a S-DWH and gives
recommendation on how to deal with them.
1.3 Technological aspects
This track covers all essential architectural and technical elements for designing and building the
statistical data warehouse and provide a generic model of the statistical data warehouse:
Management processes to govern S-DWH operations
In the S-DWH are fourteen over-arching statistical processes needed to support the statistics
production processes, nine of them are those found in the GSBPM, while the remaining five are a
consequence of a fully active S-DWH approach; they are:
1. S-DWH Management
2. Data Capturing Management
3. Output Management
4. Web Communication Management
This includes for example management of a thematic web portal.
5. (Business) Register Management (or for institutions or civil registers)
Models & Tools
There is a great variety of models and tools that can be used to support the creation of a S-DWH:
Generic Statistical Business Process Model (GSBPM)
In order to treat and manage all stages of a generic production process it is useful to identify and
locate the different phases of a generic statistics production process by using the Generic
Statistical Business Process Model (GSBPM).
Generic Statistical Information Model (GSIM)
Another model used for describing statistical processes is the Generic Statistical Information
Model (GSIM), a reference framework providing a set of standardized, consistently described
information objects, which are the inputs and outputs in the design and production of statistics.
GSIM is intended to support a common representation of information concepts at a “conceptual”
level.
Handbook to set up a S-DWH 10
version 3.1 / 3 November 2017
CORE
There are many software models and approaches available to build modular flows between
layers. One of the approaches is CORE (Common Reference Environment), which is an
environment supporting the definition of statistical processes and their automated execution.
CORE services can be used to move data between S-DWH layers and also inside the layers
between different sub-tasks.
The Integrated Warehouse model
The Integrated Warehouse model combines technical and process integration with the
warehouse approach into one model. To have an integrated warehouse centric statistical
production system, different statistical domains should use a common methodology, share
common tools and have a distributed architecture. Decisions in the design phase, like
questionnaire design, sample selection, imputation method, etc., are made “globally”.
This way, integration of processes provides reusable data in the warehouse. The warehouse
contains each variable only once, making it easier to reuse and manage valuable data.
There is also a big variety of software tools used for statistics production. Which tool to choose
mainly depends on the NSI’s possibilities to adopt a particular technology, what tools are already
used, which skills and experiences are available, as well as other considerations and available
resources. In the interpretation and source layers standard tools can be used out-of-the-box,
even though they are not generally very customizable to adapt to statistical processes. In the
Integration layer, where all operational activities needed for the statistical elaboration processes
are carried out, mainly in-house developed software is used. This is because the needs are very
specific and cannot be covered by standard applications. In these cases sharing of experience
between NSIs is very desirable as it avoids unwanted duplication of work and allows using the
experiences already acquired.
The S-DWH business architecture.
A corporate S-DWH specialized in supporting production must support multiple-purpose
statistical information. Different statistical information on different topics should not be
produced independently from each other but as integrated parts of a comprehensive information
system where statistical concepts, micro data, macro data and metadata are shared.
The S-DWH data model must sustain the ability of realizing data integration at micro and macro
data granularity levels. The model, instead of focusing on a process-oriented design, should be
on data inter-relationships that are fundamental for different processes of different statistical
domains.
We identify four functional layers defined as:
IV° - access layer, for the access to the data: selected operational views, final presentation,
dissemination and delivery of the information sought;
III° - interpretation and data analysis layer, enables data analysis or data mining functional to
support statistical design;
II° - integration layer, is where all operational activities needed for any statistical production
process are carried out; in this layer data are transformed;
Handbook to set up a S-DWH 11
version 3.1 / 3 November 2017
I° - source layer, is the level in which we locate all the activities related to storing and
managing data sources and where is realized the reconciliation, the mapping, of statistical
definitions from external to internal DWH dictionary.
The layers can be viewed as grouped in two sub-groups: the first two layers for statistical
operational activities, i.e. where the data are acquired, stored, coded, checked, imputed, edited
and validated; the last two layers are for the effective data warehouse, i.e. levels in which data
are organized for analysis, evaluation, design and for data visualization.
Easy and flexible access to the data is a basic requirement for any production based on a large,
changeable, amount of data. The S-DWH architecture could support a conceptual organization in
which we consider the first two levels as pure statistical operational infrastructures, while the
core repository of the S-DWH system is the interpretation and analysis layer, which is the
effective data warehouse, and the final access layer allows the use of specialized statistical tools.
Layers II and III are reciprocally functional to each other. Layer II supports the uploading from raw
data or from any base-phase elaboration output of a production process. Layer III is optimized for
an integrated and effective activity on micro/macro data at any stage of the elaboration process.
This is because, in layer III methodologists may organize and retrieve the data for analysis or for
creating the input of each base-phase elaboration.
This means that, layer II supplies elaborated data for analytical activities, while layer III supplies
concepts usable for the engineering of ETL functions, or new production processes by a
continuous cyclical interaction.
Through the interpretation layer methodologists, or data experts, can easily access all data,
before, during and after the elaboration of a production line to re-design or correct a process.
STA
TIST
ICA
L D
ATA
WA
REH
OU
SE
OP
ERA
TIO
NA
L D
ATA
DA
TA W
AR
EHO
USE
INTERPRETATION AND ANALYSIS LAYER
ACCESS LAYER
INTEGRATION LAYER
SOURCES LAYER COLLECT
PROCESS
ANALAYZE
DISSEMINATE
BUILD
layered S-DWH architecture and operational GSBPM-phases interaction
Handbook to set up a S-DWH 12
version 3.1 / 3 November 2017
5. The Road Map for setting up a S-DWH
After illustrating and explaining the 5 phases and the 3 tracks for setting up a S-DWH, in this chapter a roadmap is given, explaining which general steps to take and what chapters of the Manual on S-DWH to use in which step(s). The (approved) ‘business case’ is seen as a required precondition for even starting the actual process whereas the ‘use and maintain’ phase is the actual operational phase.
For this purpose we use a graphical representation based comparable to an underground map.
The first map gives a general overview from start to end. The S-DWH development process is represented by 1 single line with the most essential ‘stops’:
1. The approved business case, the official ‘GO’ to start the S-DWH project.;
2. The approved designs of the various components of the S-DWH (business architecture, meta model, etc.);
3. A set of tested and approved systems, representing the working S-DWH (but not yet implemented);
4. The operational S-DWH, in use to produce statistics.
These phases are then worked out in detailed maps that show the essential milestones/steps, represented as a ‘station or stop’. The with stops are specific for the S-DWH development process. The grey stops are generic stops, like ‘testing’, ‘training users’ etc.
All the each specific S-DWH stops are linked to the Manual to be used in that stage of the S-DWH development process.
In these detailed sub maps the 3 tracks are represented by collared lines:
the green line represents WP1 – Metadata
the blue line represents WP2 – Methodology
the red line represents WP3 – Technological aspects
Handbook to set up a S-DWH 13
version 3.1 / 3 November 2017
5.1 Roadmap S-DWH: General overview
Start Project
Approved Business Case
Business Requirements
Establish Project Target
Define Project Strategy
Cost-benefit Analysis
Information Architecture
Business Architecture
Approved Design
Metadata Model
Data Linking
Building Blocks
Data Cleaning
Estimation
Metadata System
Working S-DWH
Technology Architecture
Workflow System
Test
Operational S-DWH
Revisions
Metadata Governance
Confidentiality
Analysts
Training
Handbook to set up a S-DWH 14 version 2.1 / 4 September 2017
5.2 Roadmap S-DWH: Approved Business Case
Handbook to set up a S-DWH 15
version 3.1 / 3 November 2017
5.2 Roadmap S-DWH: Design phase
Handbook to set up a S-DWH 16
version 3.1 / 3 November 2017
5.3 Roadmap S-DWH: Build phase
Handbook to set up a S-DWH 17
version 3.1 / 3 November 2017
5.4 Roadmap S-DWH: Finalize phase