it2rail - d9 1 - quality plan v3

Grant Agreement No. 777522

CONN-WP2-D2.1 of 38 31/10/2018

CONNECTIVE

D2.1- BIG DATA ARCHITECTURE

C-REL

Due date of deliverable: 30/09/2018

Actual submission date: 31/10/2018

Leader of this Deliverable: INDRA

Document status

Revision Date Description

v0 05/09/2018 Table of Contents

v1 22/10/2018 Consolidated version

v2 30/10/2018 Consolidated version by all partners

v3 30/10/2018 Final consolidated version by all partners

Project funded by the S2R Joint Undertaking under Horizon 2020 research and

innovation programme

Dissemination Level

PU Public

CO Confidential, restricted under conditions set out in Model Grant Agreement X

CI Classified, information as referred to in Commission Decision 2001/844/EC

Reviewed Y

Start date of project: 01/09/2017 Duration: 58 months

This project has received funding from the European Union’s Horizon 2020 research and

innovation programme under grant agreement No 777522.


CONN-WP2-D2.1 of 38 31/10/2018

REPORT CONTRIBUTORS

Name Company Details of Contribution

Indra Sistemas SA

INDRA Edition of the document.

Thales Communications & Security SAS

THALES Edition of the document.

Network rail Infrastructure Limited

NETWORK RAIL Review of the document.

Ansaldo STS S.p.A.

ANSALDO STS Edition of the document.


CONN-WP2-D2.1 of 38 31/10/2018

EXECUTIVE SUMMARY

The “Big Data architecture Document” will report on the activities performed for the C-REL in task 2.2.

In the present deliverable is included the Big Data architecture choices and implementation, benchmark

results and best development practices.

The deliverable will focus on the following points:

Analysis of the outputs of IT2Rail architecture

Benchmarks regarding new Big Data technologies. These benchmarks will allow adapting

existing architecture and/or proposing new Big Data architectures.

The deliverable will have different versions (releases) to reflect the progress of the task.

In the specific, the “D2.1 Big Data Architecture” will illustrate the architecture as was built in IT2Rail

describing the limitations encountered and the advantages of it. These two points will marked the

architecture decisions and the best practice that will be re-use within CONNECTIVE.

In addition, the document shows the software used by each partners implicated to the IT2Rail project.

Besides, it describes the Sofia2 middleware IoT (Internet of Thing) tool. It was being used during that

stage of the project but for the C-REL in CONNECTIVE has been decided to do not utilized it for

evaluation of others software dedicated to solve the task that each layer marked.

A preliminary modelling of the BA system is illustrates through the Capella model tool. It shows firstly

the diagrams regarding the system analysis with missions and capabilities and interaction of the system

with external actors. Secondly, the logical architecture with the description of the components.

A second section illustrates the choice to have a common architecture across the partners for the

CONNECTIVE. This choice is determined for being align with the development and for having a

common foundation where build up the Business Analytic S2R-IP4. The architecture layers agreed

among partners are:

Data Source Layer;

Data Staging Layer;

Data Storage Layer;

Data Analytic Layer;

Data Presentation Layer.

For each layer, depending of the use cases take into consideration and the data that will be analysed

there are several software used. The document will describe:

Hortonworks Data Platform;

Apache Spark;

Talend;

Pivotal Geenplum;

Apache ElasticSearch;

Apache Superset

Etc.


CONN-WP2-D2.1 of 38 31/10/2018

The last section will demonstrate through Benchmark the result of the comparison among the chosen

software, which estimate the performance making evident why they have been decided.


CONN-WP2-D2.1 of 38 31/10/2018

TABLE OF CONTENTS

REPORT CONTRIBUTORS ................................................................................................................ 2

EXECUTIVE SUMMARY ..................................................................................................................... 3

TABLE OF CONTENTS ....................................................................................................................... 5

LIST OF FIGURES .............................................................................................................................. 6

LIST OF TABLES ................................................................................................................................ 7

1. INTRODUCTION .......................................................................................................................... 8

2. ONTOLOGIES AND SPECIFICATIONS ....................................................................................... 9

2.1 INTRODUCTION ................................................................................................................... 9

2.2 SPECIFICATIONS ................................................................................................................ 9

2.2.1 SYSTEM ANALYSIS ...................................................................................................... 9

2.2.2 LOGICAL ARCHITECTURE ......................................................................................... 10

2.3 DATA STANDARDIZATION AND ONTOLOGIES ............................................................... 15

2.3.1 INTRODUCTION .......................................................................................................... 15

2.3.2 DATASET DESCRIPTION ONTOLOGY ...................................................................... 15

2.3.3 DATA DESCRIPTION ONTOLOGY ............................................................................. 16

3. IT2RAIL BA ARCHITECTURE .................................................................................................... 19

3.1 LESSONS LEARNT FOR CONNECTIVE ............................................................................ 22

4. CONNECTIVE ARCHITECTURE ............................................................................................... 24

4.1 A BIG DATA ARCHITECTURE BASED ON HORTONWORKS DATA PLATFORM

(ANS+THA) .................................................................................................................................... 25

4.2 MAIN COMPONENTS ......................................................................................................... 26

4.2.1 STRENGTHS AND BENEFITS .................................................................................... 28

4.3 IMPLEMENTATION OF HORTONWORKS BASED ON LAMBDA ARCHITECTURE .......... 28

4.4 IMPLEMENTATION BASED ON LAMBDA ARCHITECTURE (IND) .................................... 29

5. BENCHMARKS OF NEW BIG DATA ARCHITECTURES .......................................................... 33

5.1 INTRODUCTION ................................................................................................................. 33

5.2 BENCHMARK SCOPE ........................................................................................................ 33

5.3 MAP-D: AN EXAMPLE OF POWERFUL SQL WITH GPU ARCHITECTURE ...................... 34

5.4 MAP-D: FIRST IMPLEMENTATION .................................................................................... 35

5.5 NEXT STEPS ...................................................................................................................... 37

6. CONCLUSIONS ......................................................................................................................... 38


CONN-WP2-D2.1 of 38 31/10/2018

LIST OF FIGURES

Figure 1: Business Analytics Missions and Capabilities ..................................................................... 10

Figure 2: Hierarchy of Business Analytics components ...................................................................... 11

Figure 3: Business Analytics components .......................................................................................... 12

Figure 4: Sequence diagram for pre-processing ................................................................................ 13

Figure 5: Sequence diagram for Machine learning ............................................................................. 14

Figure 6: Sequence diagram for KPIs ................................................................................................ 15

Figure 7: Dataset description ............................................................................................................. 16

Figure 8: IT2Rail Architecture Layers ................................................................................................. 20

Figure 9: Sofia 2 BA Architecture ....................................................................................................... 22

Figure 10: S2R-IP4 Business Analytic Architecture Layers ................................................................ 24

Figure 4-2: Hortonworks Data Platform (from: https://adtmag.com/articles/2016/06/28/hdp-2-5.aspx)26

Figure 13: Magic Quadrant by Gartner for Data Integration Tool 2018 ............................................... 30

Figure 14: Grafana Dashboard regarding ATVM devices alarms ....................................................... 31

Figure 15: Grafana Dashboard regarding change of state from Normal to Out Of Service and Technical

alarm ATVM ...................................................................................................................................... 32

Figure 16: Apache Superset Dashboard ............................................................................................ 32

Figure 17: MapD dashboard on rail transport network in UK (50,000 rail segment data over 10 years)

.......................................................................................................................................................... 36

Figure 18: MapD dashboard on maritime transport data (1,5billion data) ........................................... 37


CONN-WP2-D2.1 of 38 31/10/2018

LIST OF TABLES

Table 1: Reference Documents ............................................................ ¡Error! Marcador no definido.

Table 2: List of Acronyms ..................................................................... ¡Error! Marcador no definido.

Table 3: IT2Rail Architecture Technology .......................................................................................... 21


CONN-WP2-D2.1 of 38 31/10/2018

1. INTRODUCTION

The present deliverable is the first document explaining the architecture regarding Business Analytics

(BA) within the second work package (WP2) of the CONNECTIVE project.

CONNECTIVE follows the lighthouse pilot IT2Rail where the BA initially started into the Shift2Rail –

Innovation Program 4 (S2R-IP4).

S2R-IP4 is the first rail joint technology initiative focused on accelerating the integration of new and

advanced technologies into innovative rail product solutions. CONNECTIVE aims to be the technical

backbone of S2R´s Innovation Programme 4 (IP4), which addresses the provision of “IT solutions for

attractive Railway services”.

An increment in the CONNECTIVE project respect to IT2Rail will be the analysis of real data coming

from different sources whether they are open source or not. This fact is important to give to the external

sources, like transportation operators, a better vision about the evolution of the European mobility. The

consequences of it is a modification on the services offered to the traveller making so more attractive

its experience.

The analysis of this data source will involve advanced algorithm applied of the business analytic. At

the end of the life cycle of the CONNECTIVE project, the expected outcome will be the answers of the

three main questions which BA need to answer and they are:

Descriptive Analytics: It provides insight into the past (What has happened?)

Predictive Analytics: it understands the future (What could happen?)

Prescriptive Analytics: It advices on possible outcomes (What should we do?)

For what exposed, the BA within the S2R-IP4 aims to get the insights required for making a better

business decisions and strategic moves.

This document is the first stage for delivering such a result. Nowadays the present document will

defines the basis common architecture and the software, which it is composed. Such evaluation is

needed to have a solid BA foundation.


CONN-WP2-D2.1 of 38 31/10/2018

2. ONTOLOGIES AND SPECIFICATIONS

2.1 INTRODUCTION

For the core release, initial specifications have been performed to model Business Analytics inside

S2R-IP4 Ecosystem. For this purpose, Capella Model has been adopted in all Shift2Rail IP4 projects

as a solution for model-based systems engineering. It provides a process and tooling for graphical

modelling of systems, hardware or software architecture.

Capella model process is iterative and allow to describe the system in different steps:

System analysis with the description of missions and capabilities and interactions of the system

with external actors;

Logical architecture with the description of components and functional scenarios.

2.2 SPECIFICATIONS

2.2.1 SYSTEM ANALYSIS

Mission and capabilities:

This description aims to define the actors and how they interact with Business Analytics. The following

actors have been identified:

Traveller (via the Travel Companion application);

Transport Service Provider (TSP);

Business analyst.

Interactions between Business Analytics and the actors allow defining four main capabilities (see

Figure 1 below):

Understand & synthetize data: Describe the data and extract information from it;

Decision support: Propose optimisation from the current data and help to understand the

impact of a change;

Prediction: Build learning algorithms to predict outcome in a given situation;

Visualise & explore: Present the data and the result of the other capabilities.


CONN-WP2-D2.1 of 38 31/10/2018

Figure 1: Business Analytics Missions and Capabilities

2.2.2 LOGICAL ARCHITECTURE

Once actors, missions and capabilities defining interactions between the actors and the Business

Analytics system are defined, the next step is to define the components of the system.

These components include:

BA Portal: This component is the point of entry for the business analyst. It’s the tool which he

uses to enter his command and launch the algorithms & visualizations.

Data Management Engine: This component has all functions dedicated to the pre-processing

of the data and its storage.

ETL: This component is responsible for all functions for dataset manipulation: loading,

filtering, fusion, storing etc…

Data generator: This component is responsible for all functions dedicated to creating new

dataset, mainly from simulators and data generation algorithms.

Anonymiser: This component is responsible for all functions dedicated to anonymization

of personal data.

Analytics Engine: This component has all functions which create new information to a dataset,

extracting it from the current variables.

Predictive Engine: This component is responsible for all functions dedicated to creating

models from a dataset (learning) and applying it to another one (predict).


CONN-WP2-D2.1 of 38 31/10/2018

Prescriptive Engine: This component is responsible for all functions dedicated to decision

support: optimisation algorithms, what-if analysis.

Descriptive Engine: This component is responsible for all functions dedicated to

describing the dataset: KPIs, profiling.

Visualisation Engine: This Component has all functions related to displaying the results of the

analysis.

Dashboard Engine: State-of-the-art visualisations with dashboards.

Virtual Reality Engine: 3D engine that create 3D view of the data in a virtual environment.

These components are described in Figure 2 and Figure 3 below:

Figure 2: Hierarchy of Business Analytics components


CONN-WP2-D2.1 of 38 31/10/2018

Figure 3: Business Analytics components

The exchange scenarios describe the basic way of how components interact to perform Business

Analytics work. It’s impossible to create a real exchange scenario of an analysis, as the workflow is

always different, and the analyst doesn’t know what it will do after each step, as he needs the result

of the current one to decide if it’s good enough, of if it requires different settings, algorithms, or if it

requires to redo everything as something isn’t right. The scenarios presented here are generic ones,

describing the basic workflow an analyst will do if everything went perfect on an imaginary dataset.

Three scenarios are presented:

Pre-processing scenario;

Machine learning algorithm scenario;

KPI scenario.

Pre-processing scenario

This scenario presents the first steps the analyst will do with a dataset: load data in the tool, clean it,

fusion it with other data, anonymize the data and then store the result for future use. The sequence

diagram associated to this scenario is presented in Figure 4 below.


CONN-WP2-D2.1 of 38 31/10/2018

Figure 4: Sequence diagram for pre-processing

Machine learning algorithm scenario

This scenario is a simple way to test a machine learning algorithm. It loads a pre-processed dataset,

learn a model on a sub-part of the dataset, and then test it on the other part. The results are displayed

into a visualisation for evaluation. The sequence diagram associated to this scenario is presented in

Figure 5 below.


CONN-WP2-D2.1 of 38 31/10/2018

Figure 5: Sequence diagram for Machine learning

KPI scenario

This scenario shows how to load the pre-processed data, compute the KPIs and show them with the

data on a dashboard. The sequence diagram associated to this scenario is presented in Figure 6

below.


CONN-WP2-D2.1 of 38 31/10/2018

Figure 6: Sequence diagram for KPIs

2.3 DATA STANDARDIZATION AND ONTOLOGIES

2.3.1 INTRODUCTION

Before building ontology for Business Analytics, a first step will be the construction of Shift2Rail-IP4

standard for Business Analytics for dataset description and data description. The goal is to share

easily between partners.

This standard may evolve during the project towards a real ontology. An ontology is a representation

of concepts and their relations, using a specific language. This ontology will be used by the

specifications (in particular, interface specifications) for the harmonization of concepts across the

transportation mode.

2.3.2 DATASET DESCRIPTION ONTOLOGY

The data can be stored in many databases and many formats. At the end, it’s a matrix, one

observation per line with a value for each variable (column). As there are many more lines than

variables, that’s the lines that are stored into the database. As some databases (like csv files) don’t

store enough information about the variables, a “variable description” ontology will be created, which


CONN-WP2-D2.1 of 38 31/10/2018

can be passed with the lines to propagate the information. At the end, the dataset is composed by

an array of lines and a set of “description header”, 1 per variable.

The dataset description is described in Figure 7 below.

Figure 7: Dataset description

2.3.3 DATA DESCRIPTION ONTOLOGY

The data description will use “JSON” format. A first proposition is done for the core release and will

be refined during the project lifetime.

The structure contains:

Root node: array of object, each object is a variable description.

Variable object description common fields (default value if not present):

o name: string (“”)

o idx: int (0)

o type: string {int, float, string}

o nullable: bool (true)

o null_values: array (empty array)

o unit: string {iso format for measures, "date:DDyymmm" for string-based date, latitude,

longitude, …} (“”)

o is_id: bool (false)

o is_useful: bool (true)

o learning: string {learn, oracle} (null)

o time: string {instant, period_begin, period_end} (null)

o position: string {latitude, longitude, geojson, relative, latitude_start, longitude_start,

relative_start, latitude_end, longitude_end, relative_end} (null)

o relation: string (“”)

o update_interval: int (0)

Variable object description fields for int/float:


CONN-WP2-D2.1 of 38 31/10/2018

o min: float (-oo)

o max: float (+oo)

o scale: {linear, log, exp} (linear)

Variable object description fields for string:

o values: array of strings (empty array)

Explanation of each field:

o name: it’s the label of the variable.

o Idx: it’s the index of this variable if they are ordered in the dataset.

o type: it’s the type of the field, for parsing it if needed. Int is an arbitrary large integer

type (long in most languages), float is an arbitrary large flaot type (double in most

languages)

o nullable: if this field can be null / absent / have no value

o null_values: array wich contains the nullable values. If these values are string and the

type isn’t string, the comparison have to be done before parsing.

o unit: the name of the variable’s unit of measure. Standard measure should follow the

iso format for parsing convenience (no ‘°c’ but ‘°C’). For the string-based dates, the

org.joda.time.format.DateTimeFormat is used, with the string ‘date:’ before. For a

timestamp-based date-time, the iso unit: ‘s’ or ‘ms’ is used.

o is_id: true if the field is used as an id for the row. SO it shouldn’t be used for an

analysis.

o is_useful: false if the field should be ignored (like a description field, or a field that

should be filtered out)

o learning: ‘learn’ If the field has been identified as a candidate for learing: it’s a data we

have easily and can be use as input. ‘predict’ if it’s a data we want to predict, as it’s

difficult or impossible to obtain this value in the general case.

o time: To distinguee between an instant and a field that is used to identify a boundary

of a period.

o position: If the variable is used to position the row, how to interpret it, like the ‘time’

field but for 1d or 2d position instead of time position.

o relation: What this variable describe. Example: if it is a time: instant, these fields may

have ‘row’ if it’s describing the time of the row. If it is a position: relative_start, and

relation: width, and another variable is with position: relative_end, and relation: width,

this variable and the other one are describing the interval of the ‘width’ (whatever it

is). The content of this field is a descriptive string, for displaying and linking intervals

together. ‘row’ has a special meaning as it’s referring to the row/observation, where

it’s stored, and so can be used to place the observation on a timeline or on a map

automatically in a visualisation.

o update_interval: how frequent this variable is updated in the dataset. Example: 3600

for a temperature variable that is updated every hour, even if we have a row for every

check-in. 0 mean “real-time”.

o min: For a numeric variable, the minimum value (included) that this field can take

o max: For a numeric variable, the minimum value (excluded if it’s an int) that this field

can take

o scale: For a numeric variable, how the value should be understood. Example: the

decibel is in log scale. We can tag a variable ‘number of view” as exp, because we

want the visualisation to be able to understand that it’s maybe more useful to display


CONN-WP2-D2.1 of 38 31/10/2018

it in a log scale. It can also be used by a analytic algorithm, as some field may be more

informative in a exp/log scale.

o values: for a string variable, the possible values it can have. If not present or empty, it

means that there are no limits (it’s not an enumeration).

An example of this ontology is described in the following json example:

[{ “name”:”transacID”, “idx”:0, “type”:”int”, “is_id”:true, “min”:0, “nullable”:false},

{ “name”:”userID”, “idx”:1, “type”:”int”, “learning”:”learn”, “min”:0, “nullable”:false},

{ “name”:”station”, “idx”:2, “type”:”string”, “learning”:”learn”, “null_values”:[“ERROR”,””]},

{ “name”:”time_transac”, “idx”:3, “type”:”int”, “learning”:”learn”, “unit”:”s”, “time”:”instant”,

“relation”:”row”, “null_values”:[0, “ERROR”]}]


CONN-WP2-D2.1 of 38 31/10/2018

3. IT2RAIL BA ARCHITECTURE

IT2Rail architecture for Business Analytics was designed as a distributed infrastructure environment

composed by different layers. Several partners set up their own architecture and environment, but

all of them shared the same approach in terms of the layers that were identified by the involved

partners, and that are summarized in the following points:

Data Management: It is the component in charge of defining a set of tools to collect and

integrate data from internal, external and internet data sources. Its main functionality is to

provide the correct way for storing the information.

Big Data Storage: It collects and integrates information into a repository that provides

quality and timeliness to the business analytics process.

Information Management and Analysis: It is the component in charge of the computation

of the Business Analytics based on the data collected and stored in the repositories. This

component includes processes allowing to move data from multiple sources, reformat, clean

and charge them in another database, data mart or data warehouse to analyse and support

a business process.

Presentation: It is the component representing the graphical interface that will be used by

operators to visualize the analytics of the IT2Rail Platform. This component offers the ability

to visualize a unified representation of the information related to the indicators and KPIs.

Business Analytics Services: It is the component exporting KPIs to others IT2Rail

modules that need to obtain information stored in the repositories related to the analytics.

This component allows to publish all Business Analytics services in a standard way that can

be consumed by others components by leveraging the facilities offered by the IT2Rail

Interoperability Framework.

Moreover, in order to shield final users (both the travellers through their TC, or the travel experts

trough dedicated interfaces) from the existence of the different Business Analytics environments

deployed in IT2Rail, each one offered the results (KPIs, graphics) through a web service, allowing to

be presented integrated through a unique interface.

One of the biggest limitations faced during the development of IT2Rail was the lack of big volumes

of real data available to test the system. Therefore, the scalability, stability of system and capacity to

work with big data could not be guaranteed.

The integration in a unique presentation interface also entailed several challenges, as there was not

previous agreements on the technology to be used, which complicated the integration of services

(e.g. If a web application accepts or not gadgets that were not responsive and vice versa).

Figure 8 shows the IT2Rail architecture used by the involved partners. As explained, each of them

had its own infrastructure, and presentation layer, but the information could also be presented

integrated in a unique web application provided by Leonardo (one for the TSPs and other for the

traveller through the TC).


CONN-WP2-D2.1 of 38 31/10/2018

Figure 8: IT2Rail Architecture Layers

Table 1 gathers different technologies and tools used by IT2Rail partners for the different layers

identified.

Actors Presentation

Layer

Information

Management

and Analysis

Layer

Data Management Component

Data Storage Data Collection Data Retrieval

Leonardo • EXT JS v6

• OpenWeatherMap

API v2.5

• Tomcat v8.0

• Pentaho v6.1

• Java JDK v1.8

• MySQL v5.6

• Tomcat v8.0

• Python v3.6

• Java JDK v1.8

• Tomcat v.8.0

• MongoDB v3.1

• MySQL v5.6

• This module is built

using

OpenWeatherMap

APIs requiring an

OpenWeatherMap

key provided by

OpenWeatherMap

Open Weather Map

key is required for

the use of the third

party

OpenWeatherMap

v2.5 API.

• The following

software is required:

• Java JDK v1.8

• MongoDB 3.1

• Java JDK v1.8

• Tomcat v8.0

• MongoDB v3.1

• MySQL v5.6

Indra • Sofia2 • Java V 1.7.0_67

• MongoDB V

3.0.15

• MongoDB V 3.0.15

• Apache Tomcat V 7.4.54

• MySQL V 5.5

• Sofia2 • Java V 1.7.0_67

• MongoDB V 3.0.15

• Apache Tomcat V

7.4.54


CONN-WP2-D2.1 of 38 31/10/2018

• Apache Tomcat V

7.4.54

• MySQL V 5.5

• MySQL V 5.5

UPC

• Java 1.6 or higher

• Docker 17.0.3 or

higher

• MongoDB 3.0 or

higher

• MySQL 5.5.2 or

higher

• Sparksee 5.2.3 or higher

• MongoDB 3.0 or higher

• Twitter API streaming

service

• Sparksee 5.2.3 or

higher

• MongoDB 3.0 or

higher

POLIMI

• Java Runtime

Environment (at

least version 1.6)

and PostgreSQL

(at least version 8)

CEA • R (v3.4.3),

• Python (v2.7.12),

• Node.js (v9.4.0)

• MongoDB

(v3.4.10)

• R (v3.4.3),

• Python (v2.7.12),

• Node.js (v9.4.0)

• MongoDB

(v3.4.10)

Table 1: IT2Rail Architecture Technology

Among them, the only partner that is involved in CONNECTIVE is Indra. During IT2Rail, Indra relied

mainly on the use of its platform Sofia2, which offers capabilities for data collection and visualization

and supports different technologies such as Java or MongoDB. In addition, Sofia2 is built with an

analytic layer giving to the developer the main advantage to have concentrate in a unique platform

all the layers needed for extraction, data cleaning, storage, generation and visualization of the defined

KPIs. Within CONNECTIVE, Indra will also analyse other technologies that could be applicable to

the different layers.


CONN-WP2-D2.1 of 38 31/10/2018

Figure 9: Sofia 2 BA Architecture

3.1 LESSONS LEARNT FOR CONNECTIVE

CONNECTIVE proposes to follow a similar distributed approach, with each of the main partners

developing a different implementation for each layer. Likewise, the partners plan to align the different

layers of the architecture, in order to have a common approach for a general architecture.

This approach is also followed in other Big Data R&D projects, such as Transforming Transport, in

which Indra, Thales and Network Rail also participate, which allows to gain experience that can be

applied to CONNECTIVE. This project has 13 pilots and almost each of them, carried by different

partners, uses its own environment instead of a common one. The advantage of this approach is that

it allows different partners to work in parallel, and also avoids problems with sharing data of one entity

with the rest of entities of the consortium.

CONNECTIVE plans to go beyond, and following IT2Rail experience, plans to provide users with a

unique access point for all the results, independently of which of the partners platforms performs the

analysis. This approach has also the advantage that allows future entities to offer their BA

services/results to be integrated in the ecosystem, making the solution scalable and not linked to a

specific technology or provider. Moreover, it will allow to test and compare different technologies

during the project, which can help to identify the most robust and recommended solutions for a global

transport scenario as the one targeted in IP4.


CONN-WP2-D2.1 of 38 31/10/2018

Moreover, CONNECTIVE interface for travel experts will be adapted to each travel expert using the

system. A web portal plans to be provided by CONNECTIVE to allow each travel expert to have its

own access (with user and password) to join the ecosystem, configure business rules, or visualize

BA results. In the same way, travellers will access different BA results in a unified way, regardless of

the partner providing the analysis.

In order to succeed in this integration, all participants will work aligned and in close collaboration

during the lifetime of the project.


CONN-WP2-D2.1 of 38 31/10/2018

4. CONNECTIVE ARCHITECTURE

Business Analytics within the CONNECTIVE project will have different distributed architecture

depending on the areas of interest that BA will take into consideration. This distributed environment

is a necessary fact for the different use cases that each of the actors involved will focus in the

development. The development of the use cases will not influence a coherent global vision of what

the architecture is.

The terms and tasks identified with layers which is composed the CONNECTIVE architecture are

common for all the infrastructure involved also if they may differ for the software used. The list of the

layers are illustrated in the following list:

Data Source Layer

Staging Area Layer

Data Storage Layer

Data Analytic Layer

Presentation Layer

Figure 10: S2R-IP4 Business Analytic Architecture Layers

Data Source Layer:

This first layer represents the different data sources that feed the Business Analytic module. It can

be made by different heterogeneous type of data. The data source can be of any format from the

plain text file, relational database, no relational database, Excel file etc..

Data Staging Layer:

This layer focuses on three main processes: extraction, transformation and loading. Extraction is the

process of identifying and collecting relevant data from different sources. The extraction process is

needed to select data that are significant in supporting organizational decision making. The extracted

data are then sent to a temporary storage area called the data staging area prior to the transformation

and cleansing process. This is done to avoid the need of extracting data again should any problem


CONN-WP2-D2.1 of 38 31/10/2018

occurs. After that, the data will go through the transformation and the cleansing process.

Transformation is the process of converting data using a set of business rules (such as aggregation

functions) into consistent formats for reporting and analysis.

Data Storage Layer:

This is where the transformed and cleansed data sit. Based on scope and functionality, three types

of entities can be found here: data warehouse, data mart, and operational data store (ODS). In any

given system, you may have just one of the three, two of the three, or all three types.

Data Analytic Layer:

This component aims to analyse the result coming from the data treated in the historic (or batch) flow.

This layer provides the outputs on the basis enrichment process and supports the presentation layer

to reduce the latency in responding the queries.

Data Presentation Layer:

This layer refers to the information that reaches the users. This can be in a form of a tabular, graphical

report through a web application (this how was done in IT2Rail).

4.1 A BIG DATA ARCHITECTURE BASED ON HORTONWORKS DATA PLATFORM

(THALES+ANSALDO)

One possible solution for a Big Data Architecture is a solution based on Hortonworks Data Platform.

The Hortonworks Data Platform (HDP) is an open source framework for distributed storage and

processing of large, multi-source data sets, based on the Apache Hadoop framework.

Apache Hadoop is a collection of open-source software utilities that facilitate using a network of many

computers to solve problems involving massive amounts of data and computation. The core of

Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and

a processing part which is a MapReduce programming model.

Figure below gives an overview of all the components included in HDP, grouped by their macro-

functionality.


CONN-WP2-D2.1 of 38 31/10/2018

Figure 4-11: Hortonworks Data Platform (from: https://adtmag.com/articles/2016/06/28/hdp-2-5.aspx)

Main components and their applications are described in the following section.

4.2 MAIN COMPONENTS

- Hadoop Distributed File System (HDFS)

HDFS is an open source distributed file system, designed to run on commodity hardware, which

makes up the primary storage system of the Hadoop ecosystem. HDFS is highly fault tolerant and is

designed to be deployed on low cost hardware, provides high throughput access to application data

and is suitable for applications that have large datasets.

HDFS is a specialized streaming file system that is optimized for reading and writing of large files.

When writing to HDFS, data are “sliced” and replicated across the servers in a Hadoop cluster. The

slicing process creates many small sub-units (blocks) of the larger file and transparently writes them

to the cluster nodes. The various slices can be processed in parallel (at the same time) enabling

faster computation. The user does not see the file slices but interacts with whole files in HDFS like a

normal file system (i.e., files can be moved, copied, deleted, etc.). When transferring files out of

HDFS, the slices are assembled and written as one file on the host file system.

- Apache Hadoop Map Reduce

MapReduce is a programming model and an associated implementation (Apache Hadoop Map

Reduce) for processing and generating big data sets with a parallel, distributed algorithm on a cluster.

A MapReduce program is composed of a map procedure (or method), which performs filtering and

sorting, and a reduce method, which performs a summary operation.

- Apache Hadoop Yarn

Apache Yarn is the architectural centre of Hadoop 2.x. The yarn based architecture of Hadoop 2.x

provides a general purpose data processing platform which is not just limited to the MapReduce.

It provides a consistent framework for writing data access applications that run in Hadoop. Moreover,

it provides the resource management and pluggable architecture for a versatile range of processing

engines that enable to interact with the same data in multiple ways, at the same time. This means

https://adtmag.com/articles/2016/06/28/hdp-2-5.aspx

https://adtmag.com/articles/2016/06/28/hdp-2-5.aspx


CONN-WP2-D2.1 of 38 31/10/2018

applications can interact with the data in the best way: from batch to interactive SQL or low latency

access with NoSQL.

- Apache Spark

It is a fast, in-memory data processing engine with elegant and expressive development APIs to allow

data workers to execute efficiently streaming, machine learning or SQL workloads that require fast

iterative access to datasets. With Spark running on Apache Hadoop YARN, developers everywhere

can now create applications to exploit Spark’s power, derive insights, and enrich their data science

workloads within a single, shared dataset in Hadoop. The Hadoop YARN-based architecture provides

the foundation that enables Spark and other applications to share a common cluster and dataset

while ensuring consistent levels of service and response. Spark is now one of many data access

engines that work with YARN in HDP. Apache Spark consists of Spark Core and a set of libraries.

The core is the distributed execution engine and the Java, Scala, and Python APIs offer a platform

for distributed ETL application development. Additional libraries, built atop the core, allow diverse

workloads for streaming, SQL, and machine learning.

- Hive

It is a no sql database where tables are similar to tables in a relational database, and data units are

organized in a taxonomy from larger to more granular units. Databases are comprised of tables,

which are made up of partitions. Data can be accessed via a simple query language and Hive

supports overwriting or appending data. Within a particular database, data in the tables is serialized

and each table has a corresponding Hadoop Distributed File System (HDFS) directory. Each table

can be sub-divided into partitions that determine how data is distributed within sub-directories of the

table directory. Data within partitions can be further broken down into buckets. Hive supports all the

common primitive data formats such as BIGINT, BINARY, BOOLEAN, CHAR, DECIMAL, DOUBLE,

FLOAT, INT, SMALLINT, STRING, TIMESTAMP, and TINYINT. In addition, analysts can combine

primitive data types to form complex data types, such as structs, maps and arrays.

- Apache HBase

HBase is an open source, non-relational, column-oriented, versioned, distributed database modelled

after Google's BigTable and written in Java.

This module provides random, real time access to data stored in Hadoop. It was created for hosting

very large tables, making it a great choice to store multi-structured or sparse data. Users can query

HBase for a particular point in time, making “flashback” queries possible. These following

characteristics make HBase a great choice for storing semi-structured data like log data and then

providing that data very quickly to users or applications integrated with HBase.

- Zeppelin

Zeppelin is a collaborative data analytics and visualization open source tool for distributed, general-

purpose data processing systems such as Apache Spark, Apache Flink, and many others. Zeppelin

is a modern web-based tool for the data scientists to collaborate over large-scale data exploration

and visualization projects. It’s a notebook-style interpreter that enables collaborative analysis

sessions sharing between users. Zeppelin is independent of the execution framework itself, because

it includes pluggable interpreter APIs to support any data processing systems. Execution frameworks

that currently work with Zeppelin are Spark, Hive, HBase, Flink, and others in the Hadoop ecosystem.

Zeppelin includes a set of classical basic charts such as bar charts, pie charts, tables, line charts,


CONN-WP2-D2.1 of 38 31/10/2018

histograms, and few others, but they can be added by developing new visualization options in

JavaScript.

These components may be implemented in different manners and combined with other components.

For CONNECTIVE project, two architectures are presented below that respect the 4 (or 5) layers

presented in the previous section. Components that have been added to standard Hortonworks

components will be presented in the layer where they are used.

4.2.1 STRENGTHS AND BENEFITS

A list of possible advantages of adopting the architecture described in previous section follows:

- the Apache Hadoop ecosystem is composed of several tools that enable a complete BDA

- the Apache Hadoop ecosystem is composed of open source tools and an active community is

working with them

- the architecture is 100% free and deployable on any platform (Windows, Linux, on cloud, etc..)

- the architecture allows the maximum freedom in terms of future developments since it is not

based on a proprietary solution

- interface for external tools gives flexibility to extend the solution to fit better to a specific

scenario not well covered by current distribution

4.3 IMPLEMENTATION OF HORTONWORKS BASED ON LAMBDA ARCHITECTURE

(THALES+ANSALDO)

Thales and AnsaldoSTS use an implementation based on the Lambda pattern architecture. Lambda

architecture is an effective data-processing architecture designed to handle massive quantities of

data by taking advantage of both batch and streaming processing methods. This approach enables

the system to well manage both historical data record and real-time data streaming from operators

according to detail requirements.

The implemented Lambda architecture implemented for CONNECTED is composed of the following

layers:

Data layer with Kafka:

Kafka is publish-subscribe messaging rethought as a distributed commit log. Kafka is used

to collect all the data and send them to the batch and streaming layers for treatments. Data

collection layer is the major interface to get data from external provider or producer. This

layer supports both active and passive ways to obtain available data. Different kinds of means

may be adopted for obtaining data, such as crawling, using legacy API, and ETL (Extract,

Transform and Load) technology.

Raw data collected are typically stored in a HDFS file system.

Staging layer with Spark (see description above)

Storage layer with Hadoop HDFS file system (see description above)


CONN-WP2-D2.1 of 38 31/10/2018

The storage layer is composed of a batch sub-layer and a speed sub-layer to respect Lambda

pattern:

o Batch Sub-Layer: the batch layer has two functions: (i) managing the master dataset (an

immutable, append-only set of raw data), and (ii) to pre-compute the batch views.

o Speed Sub-layer: the speed layer compensates for the high latency of updates and deals

with recent data only.

In addition to HDFS, that is the ideal solution for raw data, outputs from staging layer can

be stored in different data storages.

In particular AnsaldoSTS is exploring the usage of Apache Hive and ApacheHBase for such

purpose.

Data Analytics layer with Spark( see description above) and Spark streaming

Batch Sub-Layer: Spark is used for the batch treatment of data.

Speed Sub-layer: Spark streaming is a Spark component to treat data in streaming mode.

Data presentation layer with ElasticSearch+Kibana

ElasticSearch is an open-source search server based on Lucene. It provides a

distributed, multitenant-capable full-text search engine with a RESTful web interface and

schema-free JSON documents. It’s one of the most popular enterprise search engines

in the world. It can be used to search all kinds of documents, as text-based documents

for enterprise search, but also numerical-based for business analytics.. It provides

scalable search, has near real-time search and supports multi-tenancy.

Elasticsearch takes charge of indexing results data so that they can be queried in a low

latency way. Besides indexing, the technology applied in this layer enables other

features such as high concurrency querying, fast building of consolidated view.

Kibana is the flexible visualization tool to display results on dashboards.

4.4 IMPLEMENTATION BASED ON LAMBDA ARCHITECTURE (INDRA)

The Indra’s implementation choice follows the lambda architecture. For the implementation and for

building a strong BA analysis related with the object of the analysis, Indra has been decided to use

heterogeneous software tools that are described in the following points.

Data Source layer:

The data source layer is the data base where are stored the data raw generated by the rail

operator. The data base presents high numbers of tables where it stored the information

regarding the Access Gate Control Fare Collection. It used Oracle database. Oracle database

(Oracle DB) is a relational database management system (RDBMS). The system is built

around a relational database framework in which data objects may be directly accessed by

users (or an application front end) through structured query language (SQL). Oracle is a fully

scalable relational database architecture and is often used by global enterprises, which

manage and process data across wide and local area networks. The Oracle database has its

own network component to allow communications across networks.


CONN-WP2-D2.1 of 38 31/10/2018

Data Staging Layer:

Talend: Talend is software for data integration. It provides thousands of must-have

productivity features enabling you to quickly connect, transform and move all of your data.

It is leader in cloud integration solutions, this year announced it has once again been

recognized by Gartner, Inc. as a leader in data integration as described in the 2018 “Magic

Quadrant for Data Integration Tools.”

This is Talend’s third placement in the Leaders quadrant, acknowledging the company’s

completeness of vision and ability to execute.

Figure 12: Magic Quadrant by Gartner for Data Integration Tool 2018

Data Storage Layer:

PostgreSQL: PostgreSQL is a powerful, open source object-relational database system that

uses and extends the SQL language combined with many features that safely store and scale

the most complicated data workloads.

PostgreSQL has earned a strong reputation for its proven architecture, reliability, data

integrity, robust feature set, extensibility, and the dedication of the open source community

behind the software to consistently deliver performant and innovative solutions. PostgreSQL

runs on all major operating systems, has been ACID-compliant, and has powerful add-ons

such as the popular PostGIS geospatial database extender.

Greenplum Pivotal: Greenplum Database is an advanced, fully featured, open source data

platform. It provides powerful and rapid analytics on petabyte scale data volumes. Uniquely

geared toward big data analytics, Greenplum Database is powered by the world’s most

advanced cost-based query optimizer delivering high analytical query performance on large

data volumes. It is based on an architecture providing automatic parallelization of all data and

https://www.globenewswire.com/Tracker?data=LxuOfsuAIMYucMaFBU1GVsxC2Ta-hok-KXXQ6QnmMf7QlQXyqontPukOXLEPDXQu6MLHGsHJNs5PwxgWcHEOqifvsp1jpVbZIjhV3MG9UhV2wlB_Hnki60s3xZr9wlnp-uUzfxXFOlcApVu5lxAHWglRZ3Cl0ItEb9N5i7IDk9TyEq_4lQFE01Zxt5VAJ5yTTzl1F92KS6sEshzwBs42Uw==

https://www.postgresql.org/download/

https://en.wikipedia.org/wiki/ACID

https://postgis.net/


CONN-WP2-D2.1 of 38 31/10/2018

queries in a scale-out, shared nothing architecture. It can scale interactive and batch mode

analytics to large datasets in the petabytes without degrading query performance and

throughput.

Data Analytic Layer:

Apache Spark: Spark is a general-purpose data processing engine, an API-powered toolkit

which data scientists and application developers incorporate into their applications to rapidly

query, analyse and transform data at scale. Spark’s flexibility makes it well-suited to tackling

a range of use cases, and it is capable of handling several petabytes of data at a time,

distributed across a cluster of thousands of cooperating physical or virtual servers.

Data Presentation Layer:

Grafana: Grafana is a software tool dedicated for the visualization of a large-scale of

measurement data that the user would like to see in a graphical easy way. It runs in a web

application. Grafana can be used on top of a variety of different data stores. It is built for

enabling easy metric and function editing. Grafana has a specific query editor that is

customized for the features and capabilities that are included in that data source.

Figure 13: Grafana Dashboard regarding ATVM devices alarms


CONN-WP2-D2.1 of 38 31/10/2018

Figure 14: Grafana Dashboard regarding change of state from Normal to Out Of Service and Technical alarm ATVM

Apache Superset: Superset is another software tool that is in evaluation dedicated to the

visualization of the data is the Apache Superset. Superset is a data exploration and

visualization web application.

Figure 15: Apache Superset Dashboard


CONN-WP2-D2.1 of 38 31/10/2018

5. BENCHMARKS OF NEW BIG DATA ARCHITECTURES

5.1 INTRODUCTION

Current architecture deployed by partners suffers from some drawbacks:

SQL may perform badly on Hadoop platforms. For example, HIVE, which is a SQL engine, is

not very efficient.

Data visualization may be quite slow when data to be displayed is important. For example,

visualizing data from Paris area is quite difficult with elasticsearch+kibana. Indeed, Paris area

represents more than 60000 stations, bus and tram stops. Visualizing all these data regarding

different factors, like time, hour, type of travels… is quite slow.

Big Data domain is also a very fast evolving environment and new promising Big Data architectures

appear regularly on the market.

So, benchmark is very important regarding CONNECTIVE project. It will allow to enrich existing

architectures deployed by partners and/or to propose new ones.

5.2 BENCHMARK SCOPE

For the core release, investigation has been done regarding two axes:

Performance of SQL in Big Data environment

SQL, with its familiar syntax in the IT community, is a very good way to promote and to enlarge

audience for Big Data architecture. Big Data actors are very active in this area. For example,

the new version of elasticsearch (6.3) allows to use a SQL query syntax to analyze data.

Another example is the solution MapD (rebranched recently as OmniSci): this solution was

architected to work on GPU and, in particular, development has been focused to enable

common SQL analytic operations such as filtering (WHERE), segmenting (GROUP BY) and

joining (JOIN) to run as fast as possible with native GPU speed.

New architectures based on GPUOriginally designed to render video games, GPUs have

evolved into general purpose computational engines that excel at performing tasks in parallel.

The GPUs’ prodigious compute capabilities allow them to excel at many machine learning

algorithms, and in particular, in the development of deep learning algorithms. Moreover, the

graphics pipeline of the cards means they can be used for rendering large datasets in

milliseconds. Solutions like Brytlyt and MapD have developed all their architecture on GPU.

This scope led us to select the MapD solution for the first benchmark. The solution is presented in the

next paragraph, with some preliminary results.


CONN-WP2-D2.1 of 38 31/10/2018

5.3 MAP-D: AN EXAMPLE OF POWERFUL SQL WITH GPU ARCHITECTURE

As seen in the previous section, MapD’s (now named OmniSci) platform leverages the parallel power

of modern graphics processing units (GPUs) and offers:

an efficient way to perform SQL queries

an immersive, instantaneous and interactive way to explore massive datasets in real time.

With MapD’s software and GPU compute power, it is possible to query and visualize billions of records

in tens of milliseconds. This enables the creation of hyper-interactive dashboards in which dozens of

attributes can be correlated and cross-filtered without lag. and even if MapD is a general-purpose

analytics platform, one of the most interesting interests of the platform is its ability to visualize and

explore large geospatial datasets interactively and at the grain-level.

In January 2017, a Big Data consultant Mark Litwintschik published benchmark comparisons on “The

Taxi Dataset” -- 1.2 billion individual taxi trips made available by the NYC Taxi and Limousine

Commission (TLC). The benchmark results are very impressive and are summed-up in the table below:


CONN-WP2-D2.1 of 38 31/10/2018

If we compare the MapD results (highlighted in the green in the previous table), it can be seen than it

outperforms solutions like elasticsearch and Spark (highlighted in orange in the previous table), by a

factor of more than 100, elasticsearch and Spark being the present software solutions selected in the

implementation described in paragraph 4.1.3.

MapD could be seen as an alternative or a complement to elasticsearch+Kibana MapD provides a

variety of connectors to move data from data lakes (like Hadoop, Spark…) into the MapD analytics

platform. And as Spark has brought orders-of-magnitude acceleration to Hadoop, MapD brings similar

speedups to Spark and other existing CPU-based analytics systems

5.4 MAP-D: FIRST IMPLEMENTATION

For CONNECTIVE, a first implementation of MapD has been done on two datasets.

The two datasets are:


CONN-WP2-D2.1 of 38 31/10/2018

Transportation rail network information in UK (for predictive maintenance purposes): about

50000 rail segments with 10 year historical data.

And as the volume is important but not very huge, we also benchmark the MapD platform on a

bigger dataset of maritime transportation with 1.5 billion of data to be displayed.

The test platform, on which this datasets have been used, is a server with 8 NVidia GTX 1080 Ti with

11Gb of VRAM on each one. A 1To SSD I used to store the dataset before loading. Tests need the 8

GPU only for the map rendering in the AIS example. Without the cartography, MapD manages to

balance data between the VRAM, Ram and the SSD disk and keep a short response time.

For both cases, response time when user performs cross-filtering – the paradigm where a click on any

dimension in a chart simultaneously redraws all other charts in a dashboard – are very impressive,

even for the biggest dataset.

Some screenshots are displayed below.

Figure 16: MapD dashboard on rail transport network in UK (50,000 rail segment data over 10 years)


CONN-WP2-D2.1 of 38 31/10/2018

Figure 17: MapD dashboard on maritime transport data (1,5billion data)

5.5 NEXT STEPS

The benchmark will be continued with the data produced by CONNECTIVE. In particular, precise

benchmarks are needed to quantify performances worsening when the dataset size becomes bigger

than the available VRAM size. The benchmark will also be enriched with comparisons with other

promising MPP (Massively Parallel Processing) solutions based on SQL.


CONN-WP2-D2.1 of 38 31/10/2018

6. CONCLUSIONS

The present deliverable illustrated the architecture as was implemented in IT2Rail with the software

that were in used. A comparison with the two architectures and the software with them could be

made.

The main differences between the two projects are firstly the utilization of real data (open source in

some cases and data coming from real operators in other). IT2Rail mainly analysed data that were

processed in its ecosystem as preferences stored in the cloud wallet, disruption of events and in

some specific case regarding the satisfaction of the traveller through questionnaire or weather data.

In CONNECTIVE the objective is beyond what IT2Rail marked. Advanced analysis regarding

Descriptive, Predictive and Prescriptive analytics are or will be introduced make a modification in the

different implementation choices regarding the software.

Having these points in mind, the evaluation of the software needed to have a solid base where BA

lay down is fundamental. This have as a consequence to introduce another stage before the final

decision of the software concurrent in the architecture. It is the utilization of the benchmark for the

comparison of the evaluated analytics tools. Performance must be high, reliability must be strong.

The data that CONNECTIVE will treat are remarkable and the scalability in every layers of the

architecture must be respected.

In conclusion, CONNECTIVE is a five years project and nowadays it is entered in the second year of

development. Work has been done but more need to be developed. The foundation as shown in the

present deliverable is made (also if some software explained in the document might be change if

needed). The starting point of the analysis is building up through the study of the information stored

in the different repositories across Europe.

The next deliverables will show the second stage in the development where new achievement will be

made, new modules will be introduced and the questions regarding descriptive, predictive and

prescriptive analysis might be answer.