technical environment specification · core common reference environment date of dissemination...

43
ESSnet CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 1 Partner’s name: Istat WP number and name: WP3 - Generic interface design for interconnecting GSBPM sub- processes Deliverable number and name: 3.2 Technical Environment Specification Technical Environment Specification Partner in charge Istat Version 1.0 Date February 2012 Version Changes Changed by Date 0.1 First draft ISTAT 26/07/2011 0.2 Second draft ISTAT 29/02/2012 This document is distributed under Creative Commons licence "Attribution-Share Alike - 3.0 ", available at the Internet site: http://creativecommons.org/licenses/by-sa/3.0

Upload: others

Post on 05-Apr-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 1

Partner’s name: Istat

WP number and name: WP3 - Generic interface design for interconnecting GSBPM sub-processes

Deliverable number and name: 3.2 Technical Environment Specification

Technical Environment Specification

Partner in charge Istat

Version 1.0

Date February 2012

Version Changes Changed by Date

0.1 First draft ISTAT 26/07/2011

0.2 Second draft ISTAT 29/02/2012

This document is distributed under Creative Commons licence

"Attribution-Share Alike - 3.0 ", available at the Internet site:

http://creativecommons.org/licenses/by-sa/3.0

Page 2: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 2

Summary

This document presents the design of the CORE environment, by detailing its components.

Keywords: CORA, CORE, data models, process engine, GUI, services, implementation scenario

Page 3: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 3

Contents

1 Introduction ................................................................................................................................. 6

2 Logical Architecture ................................................................................................................... 7

3 Data Models ................................................................................................................................ 9

3.1 CORE Data Model ............................................................................................................... 9

3.2 CORE Domain Descriptor ................................................................................................. 10

3.3 CORE Mapping.................................................................................................................. 10

4 Design of Integration APIs ....................................................................................................... 12

4.1 CORE Service Definition................................................................................................... 12

4.2 Design of CORE Integration APIs ..................................................................................... 12

4.2.1 Example of Implementation: CSV Transformation ..................................................... 13

5 GUI Design ............................................................................................................................... 15

5.1 GUI requirements for CORE ............................................................................................. 15

5.1.1 Support to process definition ....................................................................................... 18

5.1.2 Support to domain descriptor definition ...................................................................... 18

5.1.3 Support to CORE mapping .......................................................................................... 19

5.1.4 Process execution component ...................................................................................... 20

5.2 ORYX: a possible technical solution ................................................................................. 22

5.2.1 Stencil sets.................................................................................................................... 22

6 Design of the Repository........................................................................................................... 24

7 Architecture Deployment .......................................................................................................... 26

8 CORE Process Scenario ............................................................................................................ 28

8.1 Scenario description ........................................................................................................... 29

8.1.1 SAMPLE ALLOCATION ........................................................................................... 30

8.1.2 SAMPLE SELECTION ............................................................................................... 30

8.1.3 ESTIMATION ............................................................................................................. 31

8.1.4 STORING AND CONVERSION TO SDMX ............................................................. 31

9 CORE Demo Scenario .............................................................................................................. 33

Page 4: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 4

10 Appendix 1: CORE Data Model ............................................................................................... 39

11 Appendix 2: CORE Domain Descriptor ................................................................................... 41

12 Appendix 3: CORE Mapping .................................................................................................... 42

Page 5: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 5

Summary

The purpose of this document is to describe the CORE architecture, in terms of its logical components. Specifically, we will provide a detailed description of each component and of their interactions.

Keywords: CORE, CORA, IT architecture, information model, GSBPM

Page 6: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 6

1 Introduction

The principal aim of the CORE project is the detailed design and prototype implementation of an architecture supporting the execution of statistical business processes. Such processes are defined in terms of services calling each other, according to principles of modern information systems design. Hence, a first important notion of CORE is the definition of statistical processes in terms of identified services. This definition step is supposed to be performed by statisticians and indeed the services to be selected are statistical services abstracting from the related specific IT implementations.

This is a relevant standardization step for NSIs. Indeed, the statistical user of CORE is forced to think of the process in terms of available statistical services. A further significant standardization effort concerns data exchanges. Indeed, CORE aims at standardizing the data flow underlying a statistical process in terms of both (i) a common information model and (ii) a unique technological transport format.

The CORE information model has been released by the CORE Essnet as a dedicated deliverable (2.2 Generic statistical information model). From a technological perspective, CORE data are represented by means of XML standard technologies.

Technology independence is a basic principle for CORE, motivated by the consideration that each NSI has its own technologies and no approach imposing one unique set of technologies would have been successful. Instead, CORE aims to be able to “wrap” existing NSIs services in whatever technology they are written, so acting more on the side of standardizing at the design level (e.g. exchanged data, clear communication interfaces, etc.) than at the implementation level (i.e., by imposing a specific technology).

This document provides a first set of technical specifications for CORE that derives from the above-summarized principal ideas. The level of detail is intended as an input for a lower level design, in which specific technological choices, with respect to the platform implementation, have been taken. However, as most of the provided specifications have been implemented as demonstration prototypes, some example implementation cases are also illustrated.

Page 7: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 7

2 Logical Architecture

The main components of CORE logical architecture are represented in Figure 1.

Figure 1: Logical CORE Architecture

The Graphical User Interface (GUI) component serves the purpose of providing a set of GUIs for:

• Process specification, according to a defined process modelling language. The process specification must include the services that realize the process and the definition of the execution flow among such services.

• Definition of the data to be exchanged among services. As it will be clarified by the following sections, data exchanged by CORE-compliant services will be XML data and must be conform to a set of defined schemata. Some GUIs will support the statistical user in the definition of the information necessary for the automated generation of XML files conform to the defined schemata.

The Integration APIs have the purpose of converting tool-specific data from\to CORE data.

The Process Engine is in charge of the execution of the process specified in terms of services. The workpackage 4 will provide indications concerning the choice of a process engine satisfying the requirements identified for the CORE environment.

GUI

DefinitionRepository

Integration APIs

Process EngineRuntime

Services

GUI

DefinitionRepository

Integration APIs

Process EngineRuntime

Services

Page 8: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 8

The Definition Repository stores:

• Process schemata, that report the choices made in the process specification phase.

• Service specifications, in terms of where services to be called are, where input/output of such services should be taken from, etc.

• Data models, consisting of the defined schemata for data exchanges within CORE environment.

• Data, comprising all the data that must be passed to CORE services as input/output, and as housekeeping information useful for the process management.

Services and their runtime must also be part of the environment, in order to permit the execution of the overall CORE process “composed” by the available services.

Each component of the logical architecture will be detailed in the following sections.

Page 9: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 9

3 Data Models

CORE environment permits the definition of a set of XML schemata whose purpose is to have a unique model for data exchanges among CORE services. The definition of a unique model for data representation has several advantages, among which the most significant ones are: (i) overcoming format heterogeneity and (ii) overcoming model heterogeneity. The format heterogeneity issue is quite straightforwardly addressed by using XML format, as the unique format representation for data exchanges among CORE services. The model heterogeneity format is more tricky to be addressed. Actually, as we will see, the representation of data according to a CORE model on one side and the usage of a domain schema for the representation of domain knowledge on the other side, contribute both to address the model heterogeneity issue.

In more detail, some XML schemata are defined within CORE acting as a sort of integration glue among CORE services.

More specifically, three XSD schemata have been defined, namely:

• CORE Data Model

• CORE Domain Descriptor

• CORE Mapping

CORE Data Model contains information about the CORE layer to which a specific data set (or a part of it) belongs. Domain Descriptor can be optionally specified and reports domain concepts (like persons, enterprises, etc.) with related properties. Mapping contains information about the correspondence between the CORE Models and the Domain Descriptor concepts, if Domain Descriptor is present, otherwise between the CORE Model and the specific format of the tool encapsulated by the CORE service.

The detailed definition of such files is reported in the following sections.

3.1 CORE Data Model The full CORE Data Model xsd file is reported in Appendix 1.

The CORE Data Model, in conformance to the CORE Information Model, represents data according to a rectangular data structure. We refer to the CORE Information Model for the full motivations underlying this choice. Here, however, we just remark that this assumption has the advantage of permitting an easy management of data structures in the CORE environment while, at the same time, being general enough to represent most of the input/output data managed by tools used by NSIs.

Page 10: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 10

The CORE Data Model provides the definition of a CORE Tag attribute, whose admissible values are controlled by a CORE Tag type defined as follows:

<xs:simpleType name="coreTagType"> <xs:restriction base="xs:string"> <xs:enumeration value="Figures" /> <xs:enumeration value="TimeSeries" /> <xs:enumeration value="Statistics" /> <xs:enumeration value="Population" /> <xs:enumeration value="Unit" /> <xs:enumeration value="Variable" /> </xs:restriction> </xs:simpleType>

The CORE tag must be attached to the data set level in a mandatory way. This is because this tag will permit to associate the CORE layer to the service producing such data set as output. In addition, the CORE tag can be specified also for each column of the rectangular data set and for the rows as a whole. In both cases the specification of the CORE tag is optional. Notice that the purpose of such associations is related to possible subsequent manipulations of the data sets that could split it into “smaller” pieces. Lower level tagging enables possible subsequent data partitioning not to lose information provided by CORE metadata.

Moreover, the Data set kind attribute and the Colum kind attribute can also be specified, according to what specified by the CORE Information model described in Deliverable 2.2

3.2 CORE Domain Descriptor The full CORE Domain Descriptor xsd file is reported in Appendix 2.

The purpose of this schema is to define a very easy meta-model for the representation of domain knowledge. Specifically, the model should permit the definition of domain concepts (e.g. enterprises) with associated properties (e.g. name, address, VAT, etc.). Hence, the metamodel defines entities, which can have associated properties. Notice that, to keep the model as simple as possible, relationships among entities have not so far been introduced.

Notice that the CORE Domain Descriptor is not mandatory, in the sense that it is not said that on each data exchange between two services necessary a domain schema must be involved. In a sense, having domain knowledge representation can be useful in the view of standardizing data representation, but it’s not mandatory to have it whereas not available or not necessary.

3.3 CORE Mapping The CORE mapping file serves the purpose of specifying how to translate data from a tool specific format to CORE and viceversa. More specifically, CORE mapping files are thought to be specific for classes of tools on the basis of the admitted input and output formats. More

Page 11: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 11

specifically, by considering the principal data formats managed by tools internal to NSIs, CORE Mapping should support at least:

• CSV/CORE transformations

• Relational/CORE

More transformations can be obviously defined, but these two are able to meet most of requirements exhibited by tools used by NSIs.

When specifying the CORE mapping, the CORE tag must be provided.

A full example of CSV/CORE mapping is reported in Appendix 3.

In the example of CSV/CORE mapping, the most notable things are:

• CSV columns can be mapped either to properties of an entity or to properties of auxiliary.

• Specification of the CORE tag. Consistently with the CORE data model specification, this must be specified at the data set level and can be specified at the level of each column or of the whole set of rows.

Page 12: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 12

4 Design of Integration APIs

4.1 CORE Service Definition One of the principal goals of CORE architecture is enabling the integration of existing tools into a unique environment. NSIs can indeed have tools developed according to heterogeneous design and technologies. CORE enables abstracting from technologies heterogeneities and to homogenise input and output data according to the CORA model. To such a scope, dedicated integration APIs have been designed and implemented.

In a general sense, an integration API permits to wrap a tool in order to make it CORE-complaint, i.e. a CORE executable service. As shown in Figure 2, a CORE service is indeed composed by an inner part, which is the tool to be wrapped, and by input and output integration APIs. Such APIs will transform from/to CORE model into the tool specific format.

As anticipated in Section 3.3, CORE mappings are designed for classes of tools and hence integration APIs should support the admitted transformations, e.g. CSV-to-CORE and CORE-to-CSV, Relational-to-CORE and CORE-to-Relational, etc.

4.2 Design of CORE Integration APIs Basically, the integration API consists of a set of transformation components. Each transformation component corresponds to a specific data format and the principal elements of their design are:

• CORE mapping file

• Domain descriptor file

CORE

SERVICE

TOOL

IAPI

IAPI

Figure 2: CORE Service

Page 13: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 13

• Transform-to-CORE operation

• Transform-from-CORE operation

In order to provide an input to a tool (inner part of a CORE service) the Transform-from-CORE operation is invoked. Conversely, the output of the tool is converted by Transform-to-CORE operation.

Notice that for each single input or output file a transformation must be launched.

The illustrated concepts are summarized in Figure 3.

4.2.1 Example of Implementation: CSV Transformation

We have implemented the integration APIs related to CSV files conversion. In the following, we report the principal choices made with respect to such an implementation for the purpose of showing a concrete example of the above described concepts.

The following assumptions were made on the CSV files to process:

• First row of the CSV containing the names of the variables of the dataset (in any order).

• Presence of a separator, to be user specified.

The current implementation choices include, basically, the usage of JAVA and XML.

The files implementing the data models described in Section 3 are realized as XSD schemata.

Figure 3: Data transformations through APIs

Tool-specificformat

Tool-specificformat

IAPI TOOL IAPI

COREXMLDATA

COREXMLDATA

Tool-specificformat

Tool-specificformat

IAPI TOOL IAPI

COREXMLDATA

COREXMLDATA

COREXMLDATA

COREXMLDATA

Page 14: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 14

The current Java implementation exploits the JAXB API for converting XML files to/from Java objects. The XML files to be parsed must be described with an XSD schema, which is compiled through a JAXB-specific tool. Java classes corresponding to elements in the XML files are generated by the compiler.

Page 15: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 15

5 GUI Design

In this section we introduce the general requirements the GUI environment implements in order to support both the user in the definition of a CORE statistical process (design-time) and the execution of the process (run-time).

5.1 GUI requirements for CORE The GUI is an integrated environment intended to provide support in different phases of the process design. In particular, it provides the following features:

• graphical representation of the process as constituted by interacting CORE services;

• graphical tool for the definition of the domain descriptor;

• support to the XML mapping to/from CORE data domain descriptor;

• execution of the statistical process.

Before presenting the technical details concerning the architecture of the environment, we shall describe the GUI requirements according to the flow chart presented in Figure 4.

The GUI allows either to define CORE processes or to select a CORE process among the available ones. Further the GUI allows to define and modify CORE services. The functionality responsible of the definition of a CORE service allows binding the service to a tool; the list of available tools is dynamically loaded from a dedicated tool repository, which contains the tools provided by the National Statistical Institutes. In practical terms, for each pair of services involved in the process, the user has to perform the steps shown in Figure 4. Step 1: the user chooses a CORE service (Service n) among the available ones. Step 2: the user chooses another CORE service (Service n+1), which will use Service n output as input. Step 3: the user defines the domain descriptor of the transformation to integrate the selected services. Step 4: to complete the integration of the selected services, the user specifies both (i) the mapping between the output of Service n and the domain descriptor and (ii) the mapping between the domain descriptor and the input of Service n+1. Step 5: through the Data upload component of the GUI, the user specifies data, such as input data and service configuration parameters, to finish the process. At the end of the design phase, the GUI allows the user to execute the process.

Page 16: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 16

Figure 4: Process design flow chart

In order to perform the process previously described, the GUI environment has a modular architecture that can be schematized as shown in Figure 5 .

is process ok?

Select first CORE service

Select second CORE service

Define the domain descriptor

Mapping to/from domain descriptor

Data upload yes

no

Page 17: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 17

Figure 5: GUI environment architecture

The components of the GUI environment are:

• process design component: the graphical part of the GUI, which exposes the set of available CORE processes, CORE services and allows to design the process connecting pairs of services;

• domain descriptor definition component: this element of the environment helps the user with the definition of such concepts as entities and properties related to each transformation; the component creates automatically the domain descriptor XML;

• mapping component: it supports the user in the mapping to/from the domain descriptor and any input/output format of CORE services, building automatically the suitable XML files;

Process Design

component

Domain descriptor definition

XML mapping to/from CORE

Data upload

Workflow engine adapter

Workflow engine

GUI environment

BPEL/BPMN

file

Page 18: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 18

• file upload component: it allows the user to upload the data necessary for the process execution. Such data can be either input data for each service in the process or configuration parameters;

• workflow engine interaction component: this component is responsible for the creation of the BPEL/BPMN file relative to the designed statistical process which will be executed by the workflow engine.

The main components of the GUI environment are three: (1) the process design component, (2) the domain descriptor definition component, (3) the mapping component and (4) the process execution component. Each will be described in detail in the following sections.

5.1.1 Support to process definition

Component (1), i.e. the process design component, gives graphical support in the design of the statistical process, providing a user-friendly environment. The GUI exposes different functionalities that help the user in the definition of the process. As the user logs in the application, the GUI presents a form that gives the possibility to choose between two options: 1) choose a process in a list of processes previously defined; 2) define a new process. In the first case, as the user selects a process, the GUI displays a list of the services in the process. In the second case the user can specify the name of the statistical process he wants to create and next pressing the button “Add process” he can create a new empty process.

After he has chosen a process, the user has the possibility to view the details of each service related to the process simply clicking on the service name. Further he can define a new service by pressing the “Add service” button. Once this button has been pressed, the GUI presents a page where the user has the possibility to define:

• service properties: i.e. the service name, the tool connected to the service, the command line needed to run the tool, the GSBPM tag;

• logical input names: i.e. the names of the logical inputs linked to the service;

• logical output names: i.e. the names of the logical outputs linked to the service.

5.1.2 Support to domain descriptor definition

Component (2), i.e. the domain descriptor definition component, gives graphical support in the creation of the domain descriptor XML. This component implements the following functionalities:

1) Creation of a DD file from scratch

2) Creation of a DD file from a template

Page 19: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 19

3) View and editing of existing DD files

We start examining the first functionality. In practical terms, given the structure of a DD in terms of entities composed by properties, the GUI permits to:

• define the name of the domain descriptor;

• add an entity;

• specify a name for it;

• add properties to an entity;

• specify names for such properties.

In the case in which the number of entities and/or properties is high, it could be useful to have a default population of the DD starting from a text file. This file should contain (at least) the header of a rectangular dataset, i.e. the names of the columns of the dataset. So, for instance, a file containing rectangular data is valid, provided that the first line is the header of the kind described. The GUI through the second functionality in the list above, gives the possibility, for a given entity, to populate the properties of such an entity starting from a file uploaded by the user. In order to create the DD from a file template, the user has to specify the name for the entity and the field separator used in the file.

The environment creates the XML file, which is not visible to the user.

The third functionality implemented by this component gives the user the possibility to view and edit the DD formerly created.

5.1.3 Support to CORE mapping

Component (3), i.e. the mapping component, gives graphical support in the mapping to/from the domain descriptor entities properties and any input/output format of CORE services.

Attributes of the datasets could either be edited or uploaded from an existing text file. This file should contain (at least) the header of a rectangular dataset, i.e. the names of the columns of the dataset. Attributes of the DD must instead be loaded from an existing DD file. This means that mapping GUI functionalities must be enabled only when at least one DD file is available. Moreover, a core tag and a core column kind must be specified on such a mapping.

The core tag must be one of the following: figures, timeseries, statistics, population, unit, variable.

The core column kind must be one of the following: variable, dimension, measure, level, loggingInfo, other.

A consistency check is implemented in order to control the consistency between data set kinds and column kinds. Specifically:

Page 20: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 20

• if the dataset kind is micro, the column kind must be variable;

• if the dataset kind is dimensional, the column kind must be either dimension or measure;

• if the dataset kind is classification, the column kind must be level;

• if the dataset kind is logging, the column kind must be loggingInfo.

In practical terms, after defining the domain descriptor, the user has to perform the following steps (as shown in Figure 6):

• mapping between the output of Service n and the domain descriptor: with the assistance of the GUI, each field of the output file can be mapped to an entity property, further global mapping properties such as dataset tag, dataset kind and rows tag, must be specified. According to the reported example, “STRAT_V” is mapped to “STRATO”, “Count” to “N”, and so on. The component (i) creates the corresponding mapping XML file and (ii) produces the CORE data set (as shown in Figure 6), which is not visible to the user.

• mapping between the domain descriptor and the input of Service n+1: similarly to the previous step, the component allows one to define the mapping from the CORE data set to the Service n+1 input file.

5.1.4 Process execution component

This component is responsible of the execution of the process.

Page 21: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 21

Figure 6: transformation to/from CORE data set

STRAT_V COUNT VAL_ADDED_MEAN VAL_ADDED_STD

a 5158 54641.8534 22319.11

b 2632 54639.58913 24056.21

c 1184 54653.04626 54201.75

<dataset name="ToSample" rowsCoreTag="Statistics" coreTag="Statistics">

<row rowId="0">

<column propertyValue="a" propertyName="STRATO" entityName=“Stats" coreTag="Variable"/>

<column propertyValue="5158" propertyName="N" entityName=“Stats"/>

<column propertyValue="54641.8534" propertyName="M1" entityName=“Stats"/>

</row>

STRATO N M1 S1

a 5158 54641.8534 22319.11

b 2632 54639.58913 24056.21

c 1184 54653.04626 54201.75

Service n output

CORE data set

Service n+1 input

Page 22: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 22

5.2 ORYX: a possible technical solution In order to implement part of the GUI environment presented in the previous paragraph we have chosen as technical solution the Oryx project (a screen shot is reported in Figure 7). Oryx is an academic Open Source framework for graphical process modelling. The project is mainly driven by the Business Process Technology research group at the Hasso-Plattner-Institute. Oryx is an open platform for developments concerning business process modelling. The project is Open Source, so people are invited to contribute new process modelling languages, features, and knowledge to Oryx. The project activity is very high; indeed, the number of contributors is growing rapidly and the developers come from all over the world. The project is published under MIT License.

The most important features of Oryx are the following:

• based on web technology: to start modelling, no software installation is required, all the process design phases can be performed through a web browser;

• extensible via a plug-in mechanism: it is possible to extend all the environment functionalities via a plug-in mechanism. Plug-ins can be implemented both on client side and on server side. Client side plug-ins allow extending the functionality of the GUI. For example, it is possible to implement an import plug-in for a specific data format. Server side plug-ins allow to extend the core of the environment;

• stencil sets: easy, declarative definition of new process modelling languages;

• support to BPMN and other process modelling languages.

From a more technical point of view we can say that: (i) the server side code is written in Java; in particular, server side core functionalities can be implemented through Java Servlets. (ii) The client side code is written in Javascript and the rendering technology used is XHTML and scalable vector graphics (SVG). An interesting feature of ORYX is that it offers the possibility to define custom stencil sets which enable to implement all constructs of the BPMN. Details concerning stencil sets will be provided in the next sub-section.

5.2.1 Stencil sets

A stencil set allows one to define a resource type, i.e. it defines which data has to be added to a resource and what data the editor can expect when loading a resource of a certain type. Thus, through a custom stencil set it is possible to implement all the constructs necessary to

Page 23: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 23

define a statistical process. In practical terms, the main features/characteristics of a stencil set are the following:

• it is a set of graphical objects and rules that specify how to relate graphical objects to others;

• it can be loaded and used in the Oryx editor to build process models;

• not only does it have a graphical representation, but also additional properties that can later be used by other applications or Oryx extensions (e.g. setting element colours and visibility);

• it can be used to build process models.

We have implemented a CORE stencil set. A dedicated graphical object of the stencil set represents a CORE service. Further graphical objects have been introduced to model flow and control constructs as well as starting and ending nodes. It is also possible to define custom properties, such as the GSBPM level of the service or the type of input/output data. Furthermore, being a stencil set BPMN compliant, it allows one to easily export the designed process in a format that can be executed by a workflow engine.

Figure 7: Oryx screenshot

Page 24: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 24

6 Design of the Repository

Figure 8: Logical schema of the repository

The central element of the schema represented in Figure 8 is the service. A service is defined within a single process and it is “followed” by a single service. Notice that it is not the purpose of this repository to store the complexity of workflow managed processes. Instead, it is deliberately modelled a sequence-based interaction among services.

Indeed, as shown in Figure 9, the process control can be decoupled into two layers: data flow control system and workflow engine. The data flow control system is in charge of a sequential execution of services, and at the same time of the data transformations ensuring the correct flow of the input/output data passed among the services. This is indeed the sequence of services modelled in our repository.

Page 25: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 25

Figure 9: Decoupling the process control into two layers

Instead, the workflow engine takes into account more complex flows (see Figure 9) to be intended as activity synchronization constructs.

Each service corresponds to one single tool, while, conversely a tool can be encapsulated by different services. Tools have a runtime type that takes into account how it can be invocated.

Notice that, for the purpose of documentation, GSBPM and CORE tags are attributes of services.

Besides processes, the related process instances are stored in the repository, hence a precise logging concerning process execution can be derived from the repository.

A further significant concept modelled in the repository is operational data. These can be of two types: CORE data and non-CORE data. The former types are the data that must undergo transformations, instead non-CORE data are passed to services without any transformations. As an example of this kind of data consider the data that are input to the whole process: obviously they are passed as they are to the first service of the process.

All the data managed by the CORE environment are stored for the purpose of process reproducibility: the entities DomainDescriptor and MappingFile serve exactly to this purpose.

Finally, notice that operational data have logical names that are specified at design time and resolved at runtime depending on the nature of CORE data and non-CORE data. CORE data must be further elaborated by the environment though an ad-hoc design process. Instead, non-CORE data are simply specified at the runtime stage through the path where they are located. Nevertheless, logical names must be given to both CORE and non-CORE data in order to let CORE environment understand how to pass data among services, e.g. the output of a service will be considered the input of another service by matching the logical names of their operational data.

Page 26: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 26

7 Architecture Deployment

CORE architecture has been designed to be deployed as a web-based architecture. Specifically, the web architecture consists of:

• a centralized component, corresponding to the main elements of the CORE logical architecture;

• distributed components for service execution. In order to make the remote execution possible, a support for tool execution and data transfer must be explicitly developed.

Each NSI can have more than one deployment of the CORE architecture. Specifically, it may be the case that an internal deployment is required in order to carry out intra-organizational processes, but it may also be the case that a distinct deployment is required for managing calls of CORE services outside NSIs.

Runtime for service execution can be of one of the following types, namely:

• Batch, in which a tool is executed by a command line call. It can be automated.

• Interactive, in which a user can interact with the tool through a tool-provided GUI. It cannot be (fully) automated.

• Web service, in which a procedure is deployed on a web server. It can be automated.

GUI DefinitionRepository

Integration APIs

Process Engine

Runtime

CORE Environment

Web service client

Remote activation

Runtime

Runtime agent

Batch-Interactive runtime

Web service runtime

Web container

GUI DefinitionRepository

Integration APIs

Process Engine

Runtime

CORE Environment

Web service client

Remote activation

Runtime

Runtime agent

Batch-Interactive runtime

Web service runtime

Web container

Figure 10: Types of runtime

Page 27: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 27

In Figure 10, the logical architecture previously described has been extended with a further Runtime component.

In the case either of a batch or of an interactive service execution, the Runtime component is in charge of the remote activation of a runtime agent. The runtime agent runs on the machine on which the tool is deployed and is responsible for: (i) preparing the service input; (ii) gathering the service output; (iii) activating the tool.

In the case of a web service runtime, the Runtime component includes a web service client that performs the remote call of a web service deployed in a web container. The web container basically performs the same functions of the runtime agent. The web service runtime can be used for calling services external to NSIs boundaries.

In Figure 11, we show the sequence of steps that is performed when calling a service according to batch and interactive runtime types (the web service runtime requires similar steps).

GUI DefinitionRepository

Integration APIs

Process Engine

Runtime

CORE Environment

Web service client

Remote activation

Runtime

Runtime agent

Batch-Interactive runtime

1: The process engine signals a service must be executed

2: Service definition isextracted from the repository, as well as the required datasets and the corresponding mappings

3: Datasets are convertedaccording to the mapping

4: Converted datasets are transferred to the remote runtime

5: The tool is activated bythe runtime agent

Figure 11: Interactions of components upon a batch\interactive service call

Page 28: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 28

8 CORE Process Scenario

In this section we describe the process scenario we used as empirical test-bed during the whole implementation cycle of the CORE environment. The considerations that led us to choose such scenario can be summarized as follows:

• The process scenario is easily recognized as a portion of the typical data production process carried out by NSIs for sample surveys, though represented in a quite simplified way.

• The workflow linking scenario sub-processes is very easy, namely it doesn’t involve conditional execution, nor cycles. We are aware that real-world statistical processes may exhibit quite complex workflows, and indeed this issue has been addressed by WP4 tests on workflow engine solutions; anyway, we contend that workflow complexity has not any impact on CORE architecture, nor on Data Models or integration APIs (recall also the considerations of Section 6).

• Data exchanged by CORE services belonging to the scenario process are heterogeneous, both from a statistical perspective (both micro data and aggregated data are involved), as well as with respect to the format (both CSV files and relational DB tables are involved). Moreover, the scenario entails some “model” heterogeneity, as only a part of the exchanged data refer to ordinary real-world concepts (e.g. enterprises, employees, …), while the remaining relies upon concepts directly borrowed from the statistical domain (e.g. strata, variances, sampling weights, …). Recall that data heterogeneity is exactly what the CORE environment is expected to be able to get rid of, in order to integrate statistical services.

• IT tools implementing scenario sub-processes are technologically heterogeneous. They include: (i) simple SQL statements executed on a relational DB, (ii) batch jobs based on SAS or R scripts, (iii) full-fledged R-based systems requiring a human-computer interaction through a GUI layer. Recall that the CORE environment is expected to overcome the technological heterogeneity of the tools, by wrapping them inside CORE-compliant services via tool-specific integration APIs.

The next Section is devoted to a thorough description of the scenario.

Page 29: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 29

8.1 Scenario description We report in Figure 12 below a traditional flow chart representation of the process scenario.

Figure 12: A traditional flow chart representation of the scenario process

As anticipated, the process flows in a simple waterfall fashion, with services being activated sequentially. Only the services identified by yellow rectangles are meant to be actually implemented in the scenario, whereas those encapsulated inside the dashed central container have been reported for conceptual completeness only. As a consequence, the scenario actually breaks down into two disconnected sequences, both covering a portion of the typical processing steps performed for sample surveys in the Official Statistics field. The first sequence covers the phases of sample allocation and selection; the second deals with the computation of estimates and sampling errors, as well with their storage (e.g. for later dissemination) and subsequent conversion to SDMX format (e.g. for later bilateral exchange). In what follows we outline the most relevant aspects of the involved abstract statistical services, providing also some information on the IT tools we plan to wrap as CORE proof-of-concept.

START

ComputeStrata Statistics

Allocate the Sample

Selectthe Sample

Compute Estimatesand Sampling Errors

CalibrateSurvey Data

CollectSurvey Data

Check and CorrectSurvey Data

Store Estimatesand Sampling Errors

Convert toSDMX

STOP

ALL

OC

AT

ION

ES

TIM

AT

ION

START

ComputeStrata Statistics

Allocate the Sample

Selectthe Sample

Compute Estimatesand Sampling Errors

CalibrateSurvey Data

CollectSurvey Data

Check and CorrectSurvey Data

Store Estimatesand Sampling Errors

Convert toSDMX

STOP

START

ComputeStrata Statistics

Allocate the Sample

Selectthe Sample

Compute Estimatesand Sampling Errors

CalibrateSurvey Data

CollectSurvey Data

Check and CorrectSurvey Data

Store Estimatesand Sampling Errors

Convert toSDMX

STOP

ALL

OC

AT

ION

ES

TIM

AT

ION

Page 30: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 30

8.1.1 SAMPLE ALLOCATION

The ALLOCATION sub-process has the overall goal of determining the minimum number of units to be sampled inside each stratum, when lower bounds are imposed on the expected level of precision of the estimates the survey has to deliver. To obtain such result, two statistical services are needed. The first service – Compute Strata Statistics – is in charge of computing, for each stratum, the mean and the standard deviation of a set of auxiliary variables whose value is known for each unit belonging to the sampling frame (i.e., ideally, for the entire target population). The output of the first service is an input to the second one – Allocate the Sample – which, by solving a constrained optimization problem, eventually finds and returns the optimal sample allocation across strata. On the basis of their output, both services belong to the “Statistic” CORA layer.

Usually NSIs maintain sampling frame(s) as relational DB table(s), hence we assume that the IT tool to be wrapped inside the Compute Strata Statistics service is a simple SQL aggregated query with a group-by clause. Making such service CORE-compliant requires, therefore, to implement an integration API supporting Relational/CORE transformations.

For the Allocate the Sample service, we choose to wrap the MAUSS-R system1. MAUSS-R is implemented in R and Java, and can be run either in batch mode, or interactively via a GUI. Since MAUSS-R handles input/output data through CSV files, CORE-compliance doesn’t pose any problem, as it can be achieved by means of an integration API supporting CSV/CORE transformations. On the contrary, enabling the CORE environment to support human-computer interaction is still an open issue: we plan to exploit MAUSS-R to test the technical feasibility of any forthcoming solution.

8.1.2 SAMPLE SELECTION

The Select the Sample service is aimed at drawing a stratified random sample of units from the sampling frame, according to the previously computed optimal allocation (i.e. the output of the previous ALLOCATION service). Its output will be a dataset storing the identifiers of the units to be later surveyed, along with basic information needed to contact them. As a consequence, the service lies in the “Population” CORA layer.

We choose to implement such service by wrapping a simple SAS script to be executed in batch mode. Since SAS datasets adhere to a proprietary, closed format, we do not provide any support for direct SAS/CORE transformations. On the contrary, we implement an integration API that, by exploiting SAS import/export facilities, rather reuses standard CSV/CORE transformations.

The execution of the Select the Sample service ends the first scenario sequence.

1 MAUSS-R software [online] available at https://joinup.ec.europa.eu/software/mauss-r/description

Page 31: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 31

8.1.3 ESTIMATION

The second scenario sequence is supposed to start after the completion of the standard phases of data collection, editing and imputation (represented in Figure 12 by dummy services Collect Survey Data and Check and Correct Survey Data).

The ESTIMATION sub-process has the overall goal of computing the estimates the survey must deliver, and to asses their precision as well. To obtain such result, two statistical services are needed. The first service – Calibrate Survey Data – is in charge of enabling the usage of Calibration Estimators in the subsequent Compute Estimates and Sampling Errors service.

Roughly speaking and focusing on the output, the Calibrate Survey Data service must provide a new set of weights (the “calibrated weights”) to be used, instead of the original sampling weights (the “direct weights”), for estimation purposes. Therefore, the service has to be placed in the “Variable” CORA layer.

Once the “calibrated weights” are available, the Compute Estimates and Sampling Errors service can exploit them according to standard Model Assisted Sampling Theory methods. The output of such service is precisely the set of estimates the survey was designed to provide (typically computed for different subpopulations of interest), supplied with the corresponding confidence intervals. Hence, the service lies in the “Statistic” CORA layer.

To implement a CORE ESTIMATION sub-process, we choose to wrap the ReGenesees system2. ReGenesees is an R system for Design Based and Model Assisted analysis of complex surveys; it can be used interactively via a GUI or, alternatively, it can be run in batch mode. As ReGenesees can handle a variety of input/output formats (including text files, spreadsheets or DB tables), the implementation of a CORE integration API should not be problematic. A challenge could be, on the contrary, the one of enabling the CORE environment to handle the sampling design metadata needed to analyze a complex survey, namely information about strata, stages, clusters identifiers, sampling weights, calibration models, and so on. Moreover, as already observed for MAUSS-R, we plan to exploit ReGenesees to design, build and test some CORE facilities for interactive services execution.

8.1.4 STORING AND CONVERSION TO SDMX

The Store Estimates and Sampling Errors service must simply execute a set of SQL statements, so that the previously computed survey estimates can be persistently stored in a relational DB, e.g. in order to subsequently feed a data warehouse for online publication. The service will be at the “Statistic” CORA layer and will rely on an integration API supporting Relational/CORE transformations.

The last service of the second scenario sequence – Convert to SDMX – should retrieve the aggregated data from the relational DB, and directly convert them in SDMX XML format, so that they could be later sent to, e.g., Eurostat. Such service, whose position along the CORA

2 ReGenesees system [online] available at http://joinup.ec.europa.eu/software/regenesees/description

Page 32: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 32

data dimension is “Statistic” again, has been deliberately conceived with the objective of serving as test-bed for an integration API based on a SDMX/CORE transformation.

Page 33: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 33

9 CORE Demo Scenario

In this section, we show an instance of the data transformations performed by the CORE environment in order to enable runtime data exchange among interacting CORE services. Specifically, we focus on that part of the CORE process scenario described in Section 8 that has been actually used for demo purposes. As can be seen in the Figure 13 below, the demo scenario entails the sequential execution of three services:

1. ALLOCATION, encapsulating the MAUSS-R software;

2. SELECTION, encapsulating a SAS script;

3. ESTIMATION, encapsulating the ReGenesees System.

Figure 13: Flow chart representation of the CORE demo scenario

As already pointed out, pairs of sequential CORE services (red rectangles in Figure 13) interact by exchanging CORE XML data only: that is exactly what the black broken arrows in Figure 13 mean. As argued in Section 4, CORE intagration APIs take care of translating CORE XML data to/from tool-specific data formats: this way the CORE environment enables the actual execution of IT tools wrapped inside CORE services (yellow rectangles in Figure 13). Such mechanism is depicted for our demo process in Figure 14, where sky-blue rectangles represent CORE operational data (i.e. datasets which must undergo transformations

START

MAUSS-R

ALL

OC

AT

ION

SAS Script

SELECTION

STOP

ReGeneseesSystem

ESTIMATION

START

MAUSS-R

ALL

OC

AT

ION

MAUSS-R

ALL

OC

AT

ION

SAS Script

SELECTION

SAS Script

SELECTION

STOP

ReGeneseesSystem

ESTIMATION

ReGeneseesSystem

ESTIMATION

Page 34: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 34

in order to be exchanged by pairs of CORE services), whereas green rectangles represent non-CORE ones (i.e. datasets that are dispatched to/from IT tools without any transformations).

Figure 14: Flow chart representation of the CORE demo scenario

The implemented demo process has a common Domain Descriptor file, shown in the following:

<schema name="DEMO_Domain_Descriptor">

<entity name="SamplePlan">

<property name="STRATIFICATION_VAR"/>

<property name="STRATUM_SAMPLE_SIZE"/>

</entity>

<entity name="Enterprise">

<property name="IDENTIFIER"/>

<property name="STRATIFICATION_VAR"/>

<property name="WEIGHT"/>

<property name="SAMPLING_FRACTION"/>

<property name="ENTERPRISE_FLAG"/>

<property name="EMPLOYEES_NUM"/>

ReGenesees(Estimation)

SAS Script(Selection)

MAUSS-R(Allocation)

bethel_out

stratif errors

xml

bethel_out* sample sample*

xml

estimatesframe

COREtransformations

COREtransformations

ReGenesees(Estimation)

SAS Script(Selection)

MAUSS-R(Allocation)

bethel_out

stratif errors

xml

bethel_out* sample sample*

xml

estimatesframe

COREtransformations

COREtransformations

Page 35: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 35

<property name="VALUE_ADDED"/>

<property name="AREA"/>

</entity>

</schema>

The file consists of two entities, i.e. SamplePlan and Enterprise, each characterized by specific properties.

The input to the ALLOCATION service consists of two files reporting strata statistics: stratif.txt and errors.txt. As sketched in Section 8.1.1, file stratif.txt stores the mean and the standard deviation of a set of auxiliary variables (two, in this case) whose value is known for each unit in the sampling frame, while file errors.txt stores the upper bounds imposed on the expected level of precision of the estimates the survey has to deliver. Starting from these files, ALLOCATION computes the optimal sample allocation across strata, and eventually returns it as the CAMP column of its output file bethel_out.txt. The header of such a file is shown below:

STRATO | N | M1 | M2 | S1 | S2 | cost | cens | DOM1 | DOM2 | CAMP

This header is mapped to the previously shown Domain Descriptor by means of an XML mapping file (Mapping1DD.xml) as follows:

<csvFile name="ALLOCATION" coreTag="Statistics" rowsCoreTag="Statistics">

<column name="STRATO" coreTag="Variable">

<mapped-to-entity entityName= "SamplePlan"

propertyName="STRATIFICATION_VAR"/>

</column>

<column name="CAMP" coreTag="Variable">

<mapped-to-entity entityName= "SamplePlan"

propertyName="STRATUM_SAMPLE_SIZE"/>

</column>

</csvFile>

Notice that: (i) only STRATO and CAMP fields of ALLOCATION output have been mapped to the Domain Desciptor, and (ii) a core tag has been specified at both the overall data set level (i.e. Statistics) and the column level (i.e. Variable).

Contextually, SELECTION input is also mapped to the same Domain Descriptor. As sketched in Section 8.1.2, SELECTION has two inputs. The first one, frame.txt, is the

Page 36: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 36

sampling frame to be used for drawing samples, in the present case: a complete list of real world enterprises. Notice that this file, being not exchanged by CORE services, does not need to undergo any CORE transformation. The other input to SELECTION is a file providing the necessary information for selecting the sample. Due to the nature of the tool implementing SELECTION, such second input is expected to contain two columns, STRATUM and ALLOC, that we must be able to map, via the Domain Descriptor, to ALLOCATION output columns STRATO and CAMP. In order to make the CORE environment able to “translate” ALLOCATION output (bethel_out.txt) into SELECTION input (call it bethel_out

*

.txt), it is therefore necessary to specify the following mapping of SELECTION input to the Domain Descriptor (MappingDD2.xml):

<csvFile name="SAMPLE_SIZE" coreTag="Statistics" rowsCoreTag="Statistics">

<column name="STRATUM" coreTag="Variable">

<mapped-to-entity entityName= "SamplePlan"

propertyName="STRATIFICATION_VAR"/>

</column>

<column name="ALLOC" coreTag="Variable">

<mapped-to-entity entityName= "SamplePlan"

propertyName="STRATUM_SAMPLE_SIZE"/>

</column>

</csvFile>

SELECTION output is the file storing the selected sample, sample.txt. The header of such a file is:

id | public | emp_num | emp_cl | nace5 | nace2 | area | cens | region |

va_cl | va | dom1 | nace_macro | dom2 | stratum | va_imp1 | va_imp2 | y |

ent

This, in turn, must be mapped to the Domain Descriptor in order to cope with the schema heterogeneity with respect to ESTIMATION input. Indeed, as sketched in Section 8.1.3, ESTIMATION needs to be fed by (i) all the sampling design metadata needed to analyze the survey (namely information about strata, stages, clusters identifiers, sampling weights), and (ii) the interest variables and the domains to be used for estimation purposes. This is accomplished by the following mapping (Mapping 2DD.xml):

<csvFile name="SAMPLE" coreTag="Population" rowsCoreTag="Unit">

<column name="id" coreTag="Variable">

<mapped-to-entity entityName= "Enterprise"

Page 37: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 37

propertyName="IDENTIFIER"/>

</column>

<column name="stratum" coreTag="Variable">

<mapped-to-entity entityName= "Enterprise"

propertyName="STRATIFICATION_VAR"/>

</column>

<column name="weight" coreTag="Variable">

<mapped-to-entity entityName= "Enterprise" propertyName="WEIGHT"/>

</column>

<column name="fpc" coreTag="Variable">

<mapped-to-entity entityName= "Enterprise"

propertyName="SAMPLING_FRACTION"/>

</column>

<column name="ent" coreTag="Variable">

<mapped-to-entity entityName= "Enterprise"

propertyName="ENTERPRISE_FLAG"/>

</column>

<column name="emp_num" coreTag="Variable">

<mapped-to-entity entityName= "Enterprise"

propertyName="EMPLOYEES_NUM"/>

</column>

<column name="area" coreTag="Variable">

<mapped-to-entity entityName= "Enterprise" propertyName="AREA"/>

</column>

</csvFile>

Notice that only columns actually meaningful for the subsequent ESTIMATION service have been actually mapped.

Finally, ESTIMATION input must be made CORE complaint and a mapping specification file is again necessary (MappingDD3.xml):

<csvFile name="SURVEY_DATA" coreTag="Population" rowsCoreTag="Unit">

<column name="id" coreTag="Variable">

<mapped-to-entity entityName= "Enterprise"

propertyName="IDENTIFIER"/>

</column>

<column name="stratum" coreTag="Variable">

<mapped-to-entity entityName= "Enterprise"

propertyName="STRATIFICATION_VAR"/>

</column>

<column name="weight" coreTag="Variable">

<mapped-to-entity entityName= "Enterprise" propertyName="WEIGHT"/>

Page 38: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 38

</column>

<column name="fpc" coreTag="Variable">

<mapped-to-entity entityName= "Enterprise"

propertyName="SAMPLING_FRACTION"/>

</column>

<column name="ent" coreTag="Variable">

<mapped-to-entity entityName= "Enterprise"

propertyName="ENTERPRISE_FLAG"/>

</column>

<column name="emp_num" coreTag="Variable">

<mapped-to-entity entityName= "Enterprise"

propertyName="EMPLOYEES_NUM"/>

</column>

<column name="area" coreTag="Variable">

<mapped-to-entity entityName= "Enterprise" propertyName="AREA"/>

</column>

</csvFile>

It should be noticed that the mapping transformations specified via files Mapping 2DD.xml and MappingDD3.xml have the practical effect of dropping the variable va_imp2 (see the header of SELECTION output) from file sample*

.txt.

We conclude by providing the header of the output file of ESTIMATION, which ends the demo process we illustrated across this section. Such a file stores the estimates, and the related standard errors, of the totals of the requested interest variables (ent and emp_num) inside the requested estimation domains (area):

area | Total.emp_num | SE.Total.emp_num | Total.ent | SE.Total.ent

Page 39: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 39

10 Appendix 1: CORE Data Model

<?xml version="1.0" encoding="iso-8859-1"?>

<xs:schema

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xmlns:xs="http://www.w3.org/2001/XMLSchema"

xmlns:xsd="http://www.w3.org/2001/XMLSchema"

xmlns:jaxb="http://java.sun.com/xml/ns/jaxb">

<xs:simpleType name="coreTagType">

<xs:restriction base="xs:string">

<xs:enumeration value="Figures" />

<xs:enumeration value="TimeSeries" />

<xs:enumeration value="Statistics" />

<xs:enumeration value="Population" />

<xs:enumeration value="Unit" />

<xs:enumeration value="Variable" />

</xs:restriction>

</xs:simpleType>

<xs:simpleType name="dataSetKindType">

<xs:restriction base="xs:string">

<xs:enumeration value="Micro" />

<xs:enumeration value="Dimensional" />

<xs:enumeration value="Classification" />

<xs:enumeration value="Logging" />

<xs:enumeration value="Other" />

</xs:restriction>

</xs:simpleType>

<xs:attribute name="isIdtype">

<xs:simpleType>

<xs:restriction base="xs:string">

<xs:enumeration value="TRUE" />

<xs:enumeration value="FALSE" />

<xs:enumeration value="true" />

<xs:enumeration value="false" />

<xs:enumeration value="True" />

<xs:enumeration value="False" />

<xs:enumeration value="T" />

<xs:enumeration value="F" />

</xs:restriction>

</xs:simpleType>

</xs:attribute>

<xs:simpleType name="columnKindType">

<xs:restriction base="xs:string">

<xs:enumeration value="Variable" />

<xs:enumeration value="Dimension" />

<xs:enumeration value="Measure" />

<xs:enumeration value="Level" />

<xs:enumeration value="LoggingInfo" />

<xs:enumeration value="Other" />

</xs:restriction>

</xs:simpleType>

Page 40: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 40

<xs:element name="dataSet" >

<xs:complexType>

<xs:sequence>

<xs:element name="row" minOccurs="0" maxOccurs="unbounded">

<xs:complexType>

<xs:sequence>

<xs:element name="column" minOccurs="1" maxOccurs="unbounded">

<xs:complexType>

<xs:attribute name="coreTag"

type="coreTagType"></xs:attribute>

<xs:attribute name="entityName" type="xs:string"

use="required"></xs:attribute>

<xs:attribute name="propertyName" type="xs:string"

use="required"></xs:attribute>

<xs:attribute name="propertyValue" type="xs:string"

use="required"></xs:attribute>

<xs:attribute name="columnKind"

type="xs:columnKindType"></xs:attribute>

<xs:attribute name="levelValue"

type="xs:string"></xs:attribute>

<xs:attribute name="loggingInfoValue"

type="xs:string"></xs:attribute>

<xs:attribute name="otherValue"

type="xs:string"></xs:attribute>

<xs:attribute name="isId"

type="xs:boolean"></xs:attribute>

</xs:complexType>

</xs:element>

</xs:sequence>

<xs:attribute name="rowId" type="xs:int"

use="required"></xs:attribute>

</xs:complexType>

</xs:element>

</xs:sequence>

<xs:attribute name="coreTag" type="coreTagType"></xs:attribute>

<xs:attribute name="dataSetKind" type="dataSetKindType"></xs:attribute>

<xs:attribute name="rowsCoreTag" type="coreTagType"></xs:attribute>

<xs:attribute name="name" type="xs:string"></xs:attribute>

</xs:complexType>

</xs:element>

</xs:schema>

Page 41: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 41

11 Appendix 2: CORE Domain Descriptor

<?xml version="1.0" encoding="iso-8859-1"?>

<xs:schema

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xmlns:xs="http://www.w3.org/2001/XMLSchema"

elementFormDefault="qualified" attributeFormDefault="unqualified">

<xs:attribute name="isId">

<xs:simpleType>

<xs:restriction base="xs:string">

<xs:enumeration value="TRUE" />

<xs:enumeration value="FALSE" />

<xs:enumeration value="true" />

<xs:enumeration value="false" />

<xs:enumeration value="True" />

<xs:enumeration value="False" />

<xs:enumeration value="T" />

<xs:enumeration value="F" />

</xs:restriction>

</xs:simpleType>

</xs:attribute>

<xs:element name="schema">

<xs:complexType>

<xs:sequence>

<xs:element name="entity" maxOccurs="unbounded">

<xs:complexType>

<xs:sequence>

<xs:element name="property" minOccurs="1" maxOccurs="unbounded">

<xs:complexType>

<xs:attribute name="name" type="xs:string" use="required"></xs:attribute>

<xs:attribute ref="isId"></xs:attribute>

</xs:complexType>

</xs:element>

</xs:sequence>

<xs:attribute name="name" type="xs:string"></xs:attribute>

</xs:complexType>

</xs:element>

</xs:sequence>

<xs:attribute name="name" type="xs:string"></xs:attribute>

</xs:complexType>

</xs:element>

</xs:schema>

Page 42: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 42

12 Appendix 3: CORE Mapping

<?xml version="1.0" encoding="iso-8859-1"?>

<xs:schema

xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

xmlns:xs="http://www.w3.org/2001/XMLSchema"

elementFormDefault="qualified" attributeFormDefault="unqualified">

<xs:simpleType name="coreTagType">

<xs:restriction base="xs:string">

<xs:enumeration value="Figures" />

<xs:enumeration value="TimeSeries" />

<xs:enumeration value="Statistics" />

<xs:enumeration value="Population" />

<xs:enumeration value="Unit" />

<xs:enumeration value="Variable" />

</xs:restriction>

</xs:simpleType>

<xs:simpleType name="dataSetKindType">

<xs:restriction base="xs:string">

<xs:enumeration value="Micro" />

<xs:enumeration value="Dimensional" />

<xs:enumeration value="Classification" />

<xs:enumeration value="Logging" />

<xs:enumeration value="Other" />

</xs:restriction>

</xs:simpleType>

<xs:simpleType name="columnKindType">

<xs:restriction base="xs:string">

<xs:enumeration value="Variable" />

<xs:enumeration value="Dimension" />

<xs:enumeration value="Measure" />

<xs:enumeration value="Level" />

<xs:enumeration value="LoggingInfo" />

<xs:enumeration value="Other" />

</xs:restriction>

</xs:simpleType>

<xs:element name="csvFile">

<xs:complexType>

<xs:sequence>

<xs:element name="column" minOccurs="1" maxOccurs="unbounded">

<xs:complexType>

<xs:sequence>

<xs:element name="mapped-to-entity" minOccurs="0" maxOccurs="1">

<xs:complexType>

<xs:attribute name="entityName" type="xs:string"

use="required"></xs:attribute>

<xs:attribute name="propertyName" type="xs:string"

Page 43: Technical Environment Specification · CORE COmmon Reference Environment Date of dissemination Version Page February 2012 1.0 6 1 Introduction The principal aim of the CORE project

ESSnet

CORE COmmon Reference Environment

Date of dissemination Version Page

February 2012 1.0 43

use="required"></xs:attribute>

</xs:complexType>

</xs:element>

</xs:sequence>

<xs:attribute name="coreTag" type="coreTagType"></xs:attribute>

<xs:attribute name="columnKind" type="columnKindType"></xs:attribute>

<xs:attribute name="levelValue" type="xs:string"></xs:attribute>

<xs:attribute name="loggingInfoValue" type="xs:string"></xs:attribute>

<xs:attribute name="otherValue" type="xs:string"></xs:attribute>

<xs:attribute name="name" type="xs:string"

use="required"></xs:attribute>

</xs:complexType>

</xs:element>

</xs:sequence>

<xs:attribute name="coreTag" type="coreTagType"></xs:attribute>

<xs:attribute name="rowsCoreTag" type="coreTagType"></xs:attribute>

<xs:attribute name="dataSetKind" type="dataSetKindType"></xs:attribute>

<xs:attribute name="name" type="xs:string"></xs:attribute>

</xs:complexType>

</xs:element>

</xs:schema>