it2rail - d9 1 - quality plan v3
TRANSCRIPT
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 1 of 38 31/10/2018
CONNECTIVE
D2.1- BIG DATA ARCHITECTURE
C-REL
Due date of deliverable: 30/09/2018
Actual submission date: 31/10/2018
Leader of this Deliverable: INDRA
Document status
Revision Date Description
v0 05/09/2018 Table of Contents
v1 22/10/2018 Consolidated version
v2 30/10/2018 Consolidated version by all partners
v3 30/10/2018 Final consolidated version by all partners
Project funded by the S2R Joint Undertaking under Horizon 2020 research and
innovation programme
Dissemination Level
PU Public
CO Confidential, restricted under conditions set out in Model Grant Agreement X
CI Classified, information as referred to in Commission Decision 2001/844/EC
Reviewed Y
Start date of project: 01/09/2017 Duration: 58 months
This project has received funding from the European Union’s Horizon 2020 research and
innovation programme under grant agreement No 777522.
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 2 of 38 31/10/2018
REPORT CONTRIBUTORS
Name Company Details of Contribution
Indra Sistemas SA
INDRA Edition of the document.
Thales Communications & Security SAS
THALES Edition of the document.
Network rail Infrastructure Limited
NETWORK RAIL Review of the document.
Ansaldo STS S.p.A.
ANSALDO STS Edition of the document.
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 3 of 38 31/10/2018
EXECUTIVE SUMMARY
The “Big Data architecture Document” will report on the activities performed for the C-REL in task 2.2.
In the present deliverable is included the Big Data architecture choices and implementation, benchmark
results and best development practices.
The deliverable will focus on the following points:
Analysis of the outputs of IT2Rail architecture
Benchmarks regarding new Big Data technologies. These benchmarks will allow adapting
existing architecture and/or proposing new Big Data architectures.
The deliverable will have different versions (releases) to reflect the progress of the task.
In the specific, the “D2.1 Big Data Architecture” will illustrate the architecture as was built in IT2Rail
describing the limitations encountered and the advantages of it. These two points will marked the
architecture decisions and the best practice that will be re-use within CONNECTIVE.
In addition, the document shows the software used by each partners implicated to the IT2Rail project.
Besides, it describes the Sofia2 middleware IoT (Internet of Thing) tool. It was being used during that
stage of the project but for the C-REL in CONNECTIVE has been decided to do not utilized it for
evaluation of others software dedicated to solve the task that each layer marked.
A preliminary modelling of the BA system is illustrates through the Capella model tool. It shows firstly
the diagrams regarding the system analysis with missions and capabilities and interaction of the system
with external actors. Secondly, the logical architecture with the description of the components.
A second section illustrates the choice to have a common architecture across the partners for the
CONNECTIVE. This choice is determined for being align with the development and for having a
common foundation where build up the Business Analytic S2R-IP4. The architecture layers agreed
among partners are:
Data Source Layer;
Data Staging Layer;
Data Storage Layer;
Data Analytic Layer;
Data Presentation Layer.
For each layer, depending of the use cases take into consideration and the data that will be analysed
there are several software used. The document will describe:
Hortonworks Data Platform;
Apache Spark;
Talend;
Pivotal Geenplum;
Apache ElasticSearch;
Apache Superset
Etc.
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 4 of 38 31/10/2018
The last section will demonstrate through Benchmark the result of the comparison among the chosen
software, which estimate the performance making evident why they have been decided.
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 5 of 38 31/10/2018
TABLE OF CONTENTS
REPORT CONTRIBUTORS ................................................................................................................ 2
EXECUTIVE SUMMARY ..................................................................................................................... 3
TABLE OF CONTENTS ....................................................................................................................... 5
LIST OF FIGURES .............................................................................................................................. 6
LIST OF TABLES ................................................................................................................................ 7
1. INTRODUCTION .......................................................................................................................... 8
2. ONTOLOGIES AND SPECIFICATIONS ....................................................................................... 9
2.1 INTRODUCTION ................................................................................................................... 9
2.2 SPECIFICATIONS ................................................................................................................ 9
2.2.1 SYSTEM ANALYSIS ...................................................................................................... 9
2.2.2 LOGICAL ARCHITECTURE ......................................................................................... 10
2.3 DATA STANDARDIZATION AND ONTOLOGIES ............................................................... 15
2.3.1 INTRODUCTION .......................................................................................................... 15
2.3.2 DATASET DESCRIPTION ONTOLOGY ...................................................................... 15
2.3.3 DATA DESCRIPTION ONTOLOGY ............................................................................. 16
3. IT2RAIL BA ARCHITECTURE .................................................................................................... 19
3.1 LESSONS LEARNT FOR CONNECTIVE ............................................................................ 22
4. CONNECTIVE ARCHITECTURE ............................................................................................... 24
4.1 A BIG DATA ARCHITECTURE BASED ON HORTONWORKS DATA PLATFORM
(ANS+THA) .................................................................................................................................... 25
4.2 MAIN COMPONENTS ......................................................................................................... 26
4.2.1 STRENGTHS AND BENEFITS .................................................................................... 28
4.3 IMPLEMENTATION OF HORTONWORKS BASED ON LAMBDA ARCHITECTURE .......... 28
4.4 IMPLEMENTATION BASED ON LAMBDA ARCHITECTURE (IND) .................................... 29
5. BENCHMARKS OF NEW BIG DATA ARCHITECTURES .......................................................... 33
5.1 INTRODUCTION ................................................................................................................. 33
5.2 BENCHMARK SCOPE ........................................................................................................ 33
5.3 MAP-D: AN EXAMPLE OF POWERFUL SQL WITH GPU ARCHITECTURE ...................... 34
5.4 MAP-D: FIRST IMPLEMENTATION .................................................................................... 35
5.5 NEXT STEPS ...................................................................................................................... 37
6. CONCLUSIONS ......................................................................................................................... 38
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 6 of 38 31/10/2018
LIST OF FIGURES
Figure 1: Business Analytics Missions and Capabilities ..................................................................... 10
Figure 2: Hierarchy of Business Analytics components ...................................................................... 11
Figure 3: Business Analytics components .......................................................................................... 12
Figure 4: Sequence diagram for pre-processing ................................................................................ 13
Figure 5: Sequence diagram for Machine learning ............................................................................. 14
Figure 6: Sequence diagram for KPIs ................................................................................................ 15
Figure 7: Dataset description ............................................................................................................. 16
Figure 8: IT2Rail Architecture Layers ................................................................................................. 20
Figure 9: Sofia 2 BA Architecture ....................................................................................................... 22
Figure 10: S2R-IP4 Business Analytic Architecture Layers ................................................................ 24
Figure 4-2: Hortonworks Data Platform (from: https://adtmag.com/articles/2016/06/28/hdp-2-5.aspx)26
Figure 13: Magic Quadrant by Gartner for Data Integration Tool 2018 ............................................... 30
Figure 14: Grafana Dashboard regarding ATVM devices alarms ....................................................... 31
Figure 15: Grafana Dashboard regarding change of state from Normal to Out Of Service and Technical
alarm ATVM ...................................................................................................................................... 32
Figure 16: Apache Superset Dashboard ............................................................................................ 32
Figure 17: MapD dashboard on rail transport network in UK (50,000 rail segment data over 10 years)
.......................................................................................................................................................... 36
Figure 18: MapD dashboard on maritime transport data (1,5billion data) ........................................... 37
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 7 of 38 31/10/2018
LIST OF TABLES
Table 1: Reference Documents ............................................................ ¡Error! Marcador no definido.
Table 2: List of Acronyms ..................................................................... ¡Error! Marcador no definido.
Table 3: IT2Rail Architecture Technology .......................................................................................... 21
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 8 of 38 31/10/2018
1. INTRODUCTION
The present deliverable is the first document explaining the architecture regarding Business Analytics
(BA) within the second work package (WP2) of the CONNECTIVE project.
CONNECTIVE follows the lighthouse pilot IT2Rail where the BA initially started into the Shift2Rail –
Innovation Program 4 (S2R-IP4).
S2R-IP4 is the first rail joint technology initiative focused on accelerating the integration of new and
advanced technologies into innovative rail product solutions. CONNECTIVE aims to be the technical
backbone of S2R´s Innovation Programme 4 (IP4), which addresses the provision of “IT solutions for
attractive Railway services”.
An increment in the CONNECTIVE project respect to IT2Rail will be the analysis of real data coming
from different sources whether they are open source or not. This fact is important to give to the external
sources, like transportation operators, a better vision about the evolution of the European mobility. The
consequences of it is a modification on the services offered to the traveller making so more attractive
its experience.
The analysis of this data source will involve advanced algorithm applied of the business analytic. At
the end of the life cycle of the CONNECTIVE project, the expected outcome will be the answers of the
three main questions which BA need to answer and they are:
Descriptive Analytics: It provides insight into the past (What has happened?)
Predictive Analytics: it understands the future (What could happen?)
Prescriptive Analytics: It advices on possible outcomes (What should we do?)
For what exposed, the BA within the S2R-IP4 aims to get the insights required for making a better
business decisions and strategic moves.
This document is the first stage for delivering such a result. Nowadays the present document will
defines the basis common architecture and the software, which it is composed. Such evaluation is
needed to have a solid BA foundation.
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 9 of 38 31/10/2018
2. ONTOLOGIES AND SPECIFICATIONS
2.1 INTRODUCTION
For the core release, initial specifications have been performed to model Business Analytics inside
S2R-IP4 Ecosystem. For this purpose, Capella Model has been adopted in all Shift2Rail IP4 projects
as a solution for model-based systems engineering. It provides a process and tooling for graphical
modelling of systems, hardware or software architecture.
Capella model process is iterative and allow to describe the system in different steps:
System analysis with the description of missions and capabilities and interactions of the system
with external actors;
Logical architecture with the description of components and functional scenarios.
2.2 SPECIFICATIONS
2.2.1 SYSTEM ANALYSIS
Mission and capabilities:
This description aims to define the actors and how they interact with Business Analytics. The following
actors have been identified:
Traveller (via the Travel Companion application);
Transport Service Provider (TSP);
Business analyst.
Interactions between Business Analytics and the actors allow defining four main capabilities (see
Figure 1 below):
Understand & synthetize data: Describe the data and extract information from it;
Decision support: Propose optimisation from the current data and help to understand the
impact of a change;
Prediction: Build learning algorithms to predict outcome in a given situation;
Visualise & explore: Present the data and the result of the other capabilities.
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 10 of 38 31/10/2018
Figure 1: Business Analytics Missions and Capabilities
2.2.2 LOGICAL ARCHITECTURE
Once actors, missions and capabilities defining interactions between the actors and the Business
Analytics system are defined, the next step is to define the components of the system.
These components include:
BA Portal: This component is the point of entry for the business analyst. It’s the tool which he
uses to enter his command and launch the algorithms & visualizations.
Data Management Engine: This component has all functions dedicated to the pre-processing
of the data and its storage.
ETL: This component is responsible for all functions for dataset manipulation: loading,
filtering, fusion, storing etc…
Data generator: This component is responsible for all functions dedicated to creating new
dataset, mainly from simulators and data generation algorithms.
Anonymiser: This component is responsible for all functions dedicated to anonymization
of personal data.
Analytics Engine: This component has all functions which create new information to a dataset,
extracting it from the current variables.
Predictive Engine: This component is responsible for all functions dedicated to creating
models from a dataset (learning) and applying it to another one (predict).
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 11 of 38 31/10/2018
Prescriptive Engine: This component is responsible for all functions dedicated to decision
support: optimisation algorithms, what-if analysis.
Descriptive Engine: This component is responsible for all functions dedicated to
describing the dataset: KPIs, profiling.
Visualisation Engine: This Component has all functions related to displaying the results of the
analysis.
Dashboard Engine: State-of-the-art visualisations with dashboards.
Virtual Reality Engine: 3D engine that create 3D view of the data in a virtual environment.
These components are described in Figure 2 and Figure 3 below:
Figure 2: Hierarchy of Business Analytics components
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 12 of 38 31/10/2018
Figure 3: Business Analytics components
The exchange scenarios describe the basic way of how components interact to perform Business
Analytics work. It’s impossible to create a real exchange scenario of an analysis, as the workflow is
always different, and the analyst doesn’t know what it will do after each step, as he needs the result
of the current one to decide if it’s good enough, of if it requires different settings, algorithms, or if it
requires to redo everything as something isn’t right. The scenarios presented here are generic ones,
describing the basic workflow an analyst will do if everything went perfect on an imaginary dataset.
Three scenarios are presented:
Pre-processing scenario;
Machine learning algorithm scenario;
KPI scenario.
Pre-processing scenario
This scenario presents the first steps the analyst will do with a dataset: load data in the tool, clean it,
fusion it with other data, anonymize the data and then store the result for future use. The sequence
diagram associated to this scenario is presented in Figure 4 below.
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 13 of 38 31/10/2018
Figure 4: Sequence diagram for pre-processing
Machine learning algorithm scenario
This scenario is a simple way to test a machine learning algorithm. It loads a pre-processed dataset,
learn a model on a sub-part of the dataset, and then test it on the other part. The results are displayed
into a visualisation for evaluation. The sequence diagram associated to this scenario is presented in
Figure 5 below.
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 14 of 38 31/10/2018
Figure 5: Sequence diagram for Machine learning
KPI scenario
This scenario shows how to load the pre-processed data, compute the KPIs and show them with the
data on a dashboard. The sequence diagram associated to this scenario is presented in Figure 6
below.
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 15 of 38 31/10/2018
Figure 6: Sequence diagram for KPIs
2.3 DATA STANDARDIZATION AND ONTOLOGIES
2.3.1 INTRODUCTION
Before building ontology for Business Analytics, a first step will be the construction of Shift2Rail-IP4
standard for Business Analytics for dataset description and data description. The goal is to share
easily between partners.
This standard may evolve during the project towards a real ontology. An ontology is a representation
of concepts and their relations, using a specific language. This ontology will be used by the
specifications (in particular, interface specifications) for the harmonization of concepts across the
transportation mode.
2.3.2 DATASET DESCRIPTION ONTOLOGY
The data can be stored in many databases and many formats. At the end, it’s a matrix, one
observation per line with a value for each variable (column). As there are many more lines than
variables, that’s the lines that are stored into the database. As some databases (like csv files) don’t
store enough information about the variables, a “variable description” ontology will be created, which
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 16 of 38 31/10/2018
can be passed with the lines to propagate the information. At the end, the dataset is composed by
an array of lines and a set of “description header”, 1 per variable.
The dataset description is described in Figure 7 below.
Figure 7: Dataset description
2.3.3 DATA DESCRIPTION ONTOLOGY
The data description will use “JSON” format. A first proposition is done for the core release and will
be refined during the project lifetime.
The structure contains:
Root node: array of object, each object is a variable description.
Variable object description common fields (default value if not present):
o name: string (“”)
o idx: int (0)
o type: string {int, float, string}
o nullable: bool (true)
o null_values: array (empty array)
o unit: string {iso format for measures, "date:DDyymmm" for string-based date, latitude,
longitude, …} (“”)
o is_id: bool (false)
o is_useful: bool (true)
o learning: string {learn, oracle} (null)
o time: string {instant, period_begin, period_end} (null)
o position: string {latitude, longitude, geojson, relative, latitude_start, longitude_start,
relative_start, latitude_end, longitude_end, relative_end} (null)
o relation: string (“”)
o update_interval: int (0)
Variable object description fields for int/float:
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 17 of 38 31/10/2018
o min: float (-oo)
o max: float (+oo)
o scale: {linear, log, exp} (linear)
Variable object description fields for string:
o values: array of strings (empty array)
Explanation of each field:
o name: it’s the label of the variable.
o Idx: it’s the index of this variable if they are ordered in the dataset.
o type: it’s the type of the field, for parsing it if needed. Int is an arbitrary large integer
type (long in most languages), float is an arbitrary large flaot type (double in most
languages)
o nullable: if this field can be null / absent / have no value
o null_values: array wich contains the nullable values. If these values are string and the
type isn’t string, the comparison have to be done before parsing.
o unit: the name of the variable’s unit of measure. Standard measure should follow the
iso format for parsing convenience (no ‘°c’ but ‘°C’). For the string-based dates, the
org.joda.time.format.DateTimeFormat is used, with the string ‘date:’ before. For a
timestamp-based date-time, the iso unit: ‘s’ or ‘ms’ is used.
o is_id: true if the field is used as an id for the row. SO it shouldn’t be used for an
analysis.
o is_useful: false if the field should be ignored (like a description field, or a field that
should be filtered out)
o learning: ‘learn’ If the field has been identified as a candidate for learing: it’s a data we
have easily and can be use as input. ‘predict’ if it’s a data we want to predict, as it’s
difficult or impossible to obtain this value in the general case.
o time: To distinguee between an instant and a field that is used to identify a boundary
of a period.
o position: If the variable is used to position the row, how to interpret it, like the ‘time’
field but for 1d or 2d position instead of time position.
o relation: What this variable describe. Example: if it is a time: instant, these fields may
have ‘row’ if it’s describing the time of the row. If it is a position: relative_start, and
relation: width, and another variable is with position: relative_end, and relation: width,
this variable and the other one are describing the interval of the ‘width’ (whatever it
is). The content of this field is a descriptive string, for displaying and linking intervals
together. ‘row’ has a special meaning as it’s referring to the row/observation, where
it’s stored, and so can be used to place the observation on a timeline or on a map
automatically in a visualisation.
o update_interval: how frequent this variable is updated in the dataset. Example: 3600
for a temperature variable that is updated every hour, even if we have a row for every
check-in. 0 mean “real-time”.
o min: For a numeric variable, the minimum value (included) that this field can take
o max: For a numeric variable, the minimum value (excluded if it’s an int) that this field
can take
o scale: For a numeric variable, how the value should be understood. Example: the
decibel is in log scale. We can tag a variable ‘number of view” as exp, because we
want the visualisation to be able to understand that it’s maybe more useful to display
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 18 of 38 31/10/2018
it in a log scale. It can also be used by a analytic algorithm, as some field may be more
informative in a exp/log scale.
o values: for a string variable, the possible values it can have. If not present or empty, it
means that there are no limits (it’s not an enumeration).
An example of this ontology is described in the following json example:
[{ “name”:”transacID”, “idx”:0, “type”:”int”, “is_id”:true, “min”:0, “nullable”:false},
{ “name”:”userID”, “idx”:1, “type”:”int”, “learning”:”learn”, “min”:0, “nullable”:false},
{ “name”:”station”, “idx”:2, “type”:”string”, “learning”:”learn”, “null_values”:[“ERROR”,””]},
{ “name”:”time_transac”, “idx”:3, “type”:”int”, “learning”:”learn”, “unit”:”s”, “time”:”instant”,
“relation”:”row”, “null_values”:[0, “ERROR”]}]
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 19 of 38 31/10/2018
3. IT2RAIL BA ARCHITECTURE
IT2Rail architecture for Business Analytics was designed as a distributed infrastructure environment
composed by different layers. Several partners set up their own architecture and environment, but
all of them shared the same approach in terms of the layers that were identified by the involved
partners, and that are summarized in the following points:
Data Management: It is the component in charge of defining a set of tools to collect and
integrate data from internal, external and internet data sources. Its main functionality is to
provide the correct way for storing the information.
Big Data Storage: It collects and integrates information into a repository that provides
quality and timeliness to the business analytics process.
Information Management and Analysis: It is the component in charge of the computation
of the Business Analytics based on the data collected and stored in the repositories. This
component includes processes allowing to move data from multiple sources, reformat, clean
and charge them in another database, data mart or data warehouse to analyse and support
a business process.
Presentation: It is the component representing the graphical interface that will be used by
operators to visualize the analytics of the IT2Rail Platform. This component offers the ability
to visualize a unified representation of the information related to the indicators and KPIs.
Business Analytics Services: It is the component exporting KPIs to others IT2Rail
modules that need to obtain information stored in the repositories related to the analytics.
This component allows to publish all Business Analytics services in a standard way that can
be consumed by others components by leveraging the facilities offered by the IT2Rail
Interoperability Framework.
Moreover, in order to shield final users (both the travellers through their TC, or the travel experts
trough dedicated interfaces) from the existence of the different Business Analytics environments
deployed in IT2Rail, each one offered the results (KPIs, graphics) through a web service, allowing to
be presented integrated through a unique interface.
One of the biggest limitations faced during the development of IT2Rail was the lack of big volumes
of real data available to test the system. Therefore, the scalability, stability of system and capacity to
work with big data could not be guaranteed.
The integration in a unique presentation interface also entailed several challenges, as there was not
previous agreements on the technology to be used, which complicated the integration of services
(e.g. If a web application accepts or not gadgets that were not responsive and vice versa).
Figure 8 shows the IT2Rail architecture used by the involved partners. As explained, each of them
had its own infrastructure, and presentation layer, but the information could also be presented
integrated in a unique web application provided by Leonardo (one for the TSPs and other for the
traveller through the TC).
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 20 of 38 31/10/2018
Figure 8: IT2Rail Architecture Layers
Table 1 gathers different technologies and tools used by IT2Rail partners for the different layers
identified.
Actors Presentation
Layer
Information
Management
and Analysis
Layer
Data Management Component
Data Storage Data Collection Data Retrieval
Leonardo • EXT JS v6
• OpenWeatherMap
API v2.5
• Tomcat v8.0
• Pentaho v6.1
• Java JDK v1.8
• MySQL v5.6
• Tomcat v8.0
• Python v3.6
• Java JDK v1.8
• Tomcat v.8.0
• MongoDB v3.1
• MySQL v5.6
• This module is built
using
OpenWeatherMap
APIs requiring an
OpenWeatherMap
key provided by
OpenWeatherMap
Open Weather Map
key is required for
the use of the third
party
OpenWeatherMap
v2.5 API.
• The following
software is required:
• Java JDK v1.8
• MongoDB 3.1
• Java JDK v1.8
• Tomcat v8.0
• MongoDB v3.1
• MySQL v5.6
Indra • Sofia2 • Java V 1.7.0_67
• MongoDB V
3.0.15
• MongoDB V 3.0.15
• Apache Tomcat V 7.4.54
• MySQL V 5.5
• Sofia2 • Java V 1.7.0_67
• MongoDB V 3.0.15
• Apache Tomcat V
7.4.54
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 21 of 38 31/10/2018
• Apache Tomcat V
7.4.54
• MySQL V 5.5
• MySQL V 5.5
UPC
• Java 1.6 or higher
• Docker 17.0.3 or
higher
• MongoDB 3.0 or
higher
• MySQL 5.5.2 or
higher
• Sparksee 5.2.3 or higher
• MongoDB 3.0 or higher
• Twitter API streaming
service
• Sparksee 5.2.3 or
higher
• MongoDB 3.0 or
higher
POLIMI
• Java Runtime
Environment (at
least version 1.6)
and PostgreSQL
(at least version 8)
CEA • R (v3.4.3),
• Python (v2.7.12),
• Node.js (v9.4.0)
• MongoDB
(v3.4.10)
• R (v3.4.3),
• Python (v2.7.12),
• Node.js (v9.4.0)
• MongoDB
(v3.4.10)
Table 1: IT2Rail Architecture Technology
Among them, the only partner that is involved in CONNECTIVE is Indra. During IT2Rail, Indra relied
mainly on the use of its platform Sofia2, which offers capabilities for data collection and visualization
and supports different technologies such as Java or MongoDB. In addition, Sofia2 is built with an
analytic layer giving to the developer the main advantage to have concentrate in a unique platform
all the layers needed for extraction, data cleaning, storage, generation and visualization of the defined
KPIs. Within CONNECTIVE, Indra will also analyse other technologies that could be applicable to
the different layers.
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 22 of 38 31/10/2018
Figure 9: Sofia 2 BA Architecture
3.1 LESSONS LEARNT FOR CONNECTIVE
CONNECTIVE proposes to follow a similar distributed approach, with each of the main partners
developing a different implementation for each layer. Likewise, the partners plan to align the different
layers of the architecture, in order to have a common approach for a general architecture.
This approach is also followed in other Big Data R&D projects, such as Transforming Transport, in
which Indra, Thales and Network Rail also participate, which allows to gain experience that can be
applied to CONNECTIVE. This project has 13 pilots and almost each of them, carried by different
partners, uses its own environment instead of a common one. The advantage of this approach is that
it allows different partners to work in parallel, and also avoids problems with sharing data of one entity
with the rest of entities of the consortium.
CONNECTIVE plans to go beyond, and following IT2Rail experience, plans to provide users with a
unique access point for all the results, independently of which of the partners platforms performs the
analysis. This approach has also the advantage that allows future entities to offer their BA
services/results to be integrated in the ecosystem, making the solution scalable and not linked to a
specific technology or provider. Moreover, it will allow to test and compare different technologies
during the project, which can help to identify the most robust and recommended solutions for a global
transport scenario as the one targeted in IP4.
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 23 of 38 31/10/2018
Moreover, CONNECTIVE interface for travel experts will be adapted to each travel expert using the
system. A web portal plans to be provided by CONNECTIVE to allow each travel expert to have its
own access (with user and password) to join the ecosystem, configure business rules, or visualize
BA results. In the same way, travellers will access different BA results in a unified way, regardless of
the partner providing the analysis.
In order to succeed in this integration, all participants will work aligned and in close collaboration
during the lifetime of the project.
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 24 of 38 31/10/2018
4. CONNECTIVE ARCHITECTURE
Business Analytics within the CONNECTIVE project will have different distributed architecture
depending on the areas of interest that BA will take into consideration. This distributed environment
is a necessary fact for the different use cases that each of the actors involved will focus in the
development. The development of the use cases will not influence a coherent global vision of what
the architecture is.
The terms and tasks identified with layers which is composed the CONNECTIVE architecture are
common for all the infrastructure involved also if they may differ for the software used. The list of the
layers are illustrated in the following list:
Data Source Layer
Staging Area Layer
Data Storage Layer
Data Analytic Layer
Presentation Layer
Figure 10: S2R-IP4 Business Analytic Architecture Layers
Data Source Layer:
This first layer represents the different data sources that feed the Business Analytic module. It can
be made by different heterogeneous type of data. The data source can be of any format from the
plain text file, relational database, no relational database, Excel file etc..
Data Staging Layer:
This layer focuses on three main processes: extraction, transformation and loading. Extraction is the
process of identifying and collecting relevant data from different sources. The extraction process is
needed to select data that are significant in supporting organizational decision making. The extracted
data are then sent to a temporary storage area called the data staging area prior to the transformation
and cleansing process. This is done to avoid the need of extracting data again should any problem
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 25 of 38 31/10/2018
occurs. After that, the data will go through the transformation and the cleansing process.
Transformation is the process of converting data using a set of business rules (such as aggregation
functions) into consistent formats for reporting and analysis.
Data Storage Layer:
This is where the transformed and cleansed data sit. Based on scope and functionality, three types
of entities can be found here: data warehouse, data mart, and operational data store (ODS). In any
given system, you may have just one of the three, two of the three, or all three types.
Data Analytic Layer:
This component aims to analyse the result coming from the data treated in the historic (or batch) flow.
This layer provides the outputs on the basis enrichment process and supports the presentation layer
to reduce the latency in responding the queries.
Data Presentation Layer:
This layer refers to the information that reaches the users. This can be in a form of a tabular, graphical
report through a web application (this how was done in IT2Rail).
4.1 A BIG DATA ARCHITECTURE BASED ON HORTONWORKS DATA PLATFORM
(THALES+ANSALDO)
One possible solution for a Big Data Architecture is a solution based on Hortonworks Data Platform.
The Hortonworks Data Platform (HDP) is an open source framework for distributed storage and
processing of large, multi-source data sets, based on the Apache Hadoop framework.
Apache Hadoop is a collection of open-source software utilities that facilitate using a network of many
computers to solve problems involving massive amounts of data and computation. The core of
Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and
a processing part which is a MapReduce programming model.
Figure below gives an overview of all the components included in HDP, grouped by their macro-
functionality.
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 26 of 38 31/10/2018
Figure 4-11: Hortonworks Data Platform (from: https://adtmag.com/articles/2016/06/28/hdp-2-5.aspx)
Main components and their applications are described in the following section.
4.2 MAIN COMPONENTS
- Hadoop Distributed File System (HDFS)
HDFS is an open source distributed file system, designed to run on commodity hardware, which
makes up the primary storage system of the Hadoop ecosystem. HDFS is highly fault tolerant and is
designed to be deployed on low cost hardware, provides high throughput access to application data
and is suitable for applications that have large datasets.
HDFS is a specialized streaming file system that is optimized for reading and writing of large files.
When writing to HDFS, data are “sliced” and replicated across the servers in a Hadoop cluster. The
slicing process creates many small sub-units (blocks) of the larger file and transparently writes them
to the cluster nodes. The various slices can be processed in parallel (at the same time) enabling
faster computation. The user does not see the file slices but interacts with whole files in HDFS like a
normal file system (i.e., files can be moved, copied, deleted, etc.). When transferring files out of
HDFS, the slices are assembled and written as one file on the host file system.
- Apache Hadoop Map Reduce
MapReduce is a programming model and an associated implementation (Apache Hadoop Map
Reduce) for processing and generating big data sets with a parallel, distributed algorithm on a cluster.
A MapReduce program is composed of a map procedure (or method), which performs filtering and
sorting, and a reduce method, which performs a summary operation.
- Apache Hadoop Yarn
Apache Yarn is the architectural centre of Hadoop 2.x. The yarn based architecture of Hadoop 2.x
provides a general purpose data processing platform which is not just limited to the MapReduce.
It provides a consistent framework for writing data access applications that run in Hadoop. Moreover,
it provides the resource management and pluggable architecture for a versatile range of processing
engines that enable to interact with the same data in multiple ways, at the same time. This means
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 27 of 38 31/10/2018
applications can interact with the data in the best way: from batch to interactive SQL or low latency
access with NoSQL.
- Apache Spark
It is a fast, in-memory data processing engine with elegant and expressive development APIs to allow
data workers to execute efficiently streaming, machine learning or SQL workloads that require fast
iterative access to datasets. With Spark running on Apache Hadoop YARN, developers everywhere
can now create applications to exploit Spark’s power, derive insights, and enrich their data science
workloads within a single, shared dataset in Hadoop. The Hadoop YARN-based architecture provides
the foundation that enables Spark and other applications to share a common cluster and dataset
while ensuring consistent levels of service and response. Spark is now one of many data access
engines that work with YARN in HDP. Apache Spark consists of Spark Core and a set of libraries.
The core is the distributed execution engine and the Java, Scala, and Python APIs offer a platform
for distributed ETL application development. Additional libraries, built atop the core, allow diverse
workloads for streaming, SQL, and machine learning.
- Hive
It is a no sql database where tables are similar to tables in a relational database, and data units are
organized in a taxonomy from larger to more granular units. Databases are comprised of tables,
which are made up of partitions. Data can be accessed via a simple query language and Hive
supports overwriting or appending data. Within a particular database, data in the tables is serialized
and each table has a corresponding Hadoop Distributed File System (HDFS) directory. Each table
can be sub-divided into partitions that determine how data is distributed within sub-directories of the
table directory. Data within partitions can be further broken down into buckets. Hive supports all the
common primitive data formats such as BIGINT, BINARY, BOOLEAN, CHAR, DECIMAL, DOUBLE,
FLOAT, INT, SMALLINT, STRING, TIMESTAMP, and TINYINT. In addition, analysts can combine
primitive data types to form complex data types, such as structs, maps and arrays.
- Apache HBase
HBase is an open source, non-relational, column-oriented, versioned, distributed database modelled
after Google's BigTable and written in Java.
This module provides random, real time access to data stored in Hadoop. It was created for hosting
very large tables, making it a great choice to store multi-structured or sparse data. Users can query
HBase for a particular point in time, making “flashback” queries possible. These following
characteristics make HBase a great choice for storing semi-structured data like log data and then
providing that data very quickly to users or applications integrated with HBase.
- Zeppelin
Zeppelin is a collaborative data analytics and visualization open source tool for distributed, general-
purpose data processing systems such as Apache Spark, Apache Flink, and many others. Zeppelin
is a modern web-based tool for the data scientists to collaborate over large-scale data exploration
and visualization projects. It’s a notebook-style interpreter that enables collaborative analysis
sessions sharing between users. Zeppelin is independent of the execution framework itself, because
it includes pluggable interpreter APIs to support any data processing systems. Execution frameworks
that currently work with Zeppelin are Spark, Hive, HBase, Flink, and others in the Hadoop ecosystem.
Zeppelin includes a set of classical basic charts such as bar charts, pie charts, tables, line charts,
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 28 of 38 31/10/2018
histograms, and few others, but they can be added by developing new visualization options in
JavaScript.
These components may be implemented in different manners and combined with other components.
For CONNECTIVE project, two architectures are presented below that respect the 4 (or 5) layers
presented in the previous section. Components that have been added to standard Hortonworks
components will be presented in the layer where they are used.
4.2.1 STRENGTHS AND BENEFITS
A list of possible advantages of adopting the architecture described in previous section follows:
- the Apache Hadoop ecosystem is composed of several tools that enable a complete BDA
- the Apache Hadoop ecosystem is composed of open source tools and an active community is
working with them
- the architecture is 100% free and deployable on any platform (Windows, Linux, on cloud, etc..)
- the architecture allows the maximum freedom in terms of future developments since it is not
based on a proprietary solution
- interface for external tools gives flexibility to extend the solution to fit better to a specific
scenario not well covered by current distribution
4.3 IMPLEMENTATION OF HORTONWORKS BASED ON LAMBDA ARCHITECTURE
(THALES+ANSALDO)
Thales and AnsaldoSTS use an implementation based on the Lambda pattern architecture. Lambda
architecture is an effective data-processing architecture designed to handle massive quantities of
data by taking advantage of both batch and streaming processing methods. This approach enables
the system to well manage both historical data record and real-time data streaming from operators
according to detail requirements.
The implemented Lambda architecture implemented for CONNECTED is composed of the following
layers:
Data layer with Kafka:
Kafka is publish-subscribe messaging rethought as a distributed commit log. Kafka is used
to collect all the data and send them to the batch and streaming layers for treatments. Data
collection layer is the major interface to get data from external provider or producer. This
layer supports both active and passive ways to obtain available data. Different kinds of means
may be adopted for obtaining data, such as crawling, using legacy API, and ETL (Extract,
Transform and Load) technology.
Raw data collected are typically stored in a HDFS file system.
Staging layer with Spark (see description above)
Storage layer with Hadoop HDFS file system (see description above)
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 29 of 38 31/10/2018
The storage layer is composed of a batch sub-layer and a speed sub-layer to respect Lambda
pattern:
o Batch Sub-Layer: the batch layer has two functions: (i) managing the master dataset (an
immutable, append-only set of raw data), and (ii) to pre-compute the batch views.
o Speed Sub-layer: the speed layer compensates for the high latency of updates and deals
with recent data only.
In addition to HDFS, that is the ideal solution for raw data, outputs from staging layer can
be stored in different data storages.
In particular AnsaldoSTS is exploring the usage of Apache Hive and ApacheHBase for such
purpose.
Data Analytics layer with Spark( see description above) and Spark streaming
Batch Sub-Layer: Spark is used for the batch treatment of data.
Speed Sub-layer: Spark streaming is a Spark component to treat data in streaming mode.
Data presentation layer with ElasticSearch+Kibana
ElasticSearch is an open-source search server based on Lucene. It provides a
distributed, multitenant-capable full-text search engine with a RESTful web interface and
schema-free JSON documents. It’s one of the most popular enterprise search engines
in the world. It can be used to search all kinds of documents, as text-based documents
for enterprise search, but also numerical-based for business analytics.. It provides
scalable search, has near real-time search and supports multi-tenancy.
Elasticsearch takes charge of indexing results data so that they can be queried in a low
latency way. Besides indexing, the technology applied in this layer enables other
features such as high concurrency querying, fast building of consolidated view.
Kibana is the flexible visualization tool to display results on dashboards.
4.4 IMPLEMENTATION BASED ON LAMBDA ARCHITECTURE (INDRA)
The Indra’s implementation choice follows the lambda architecture. For the implementation and for
building a strong BA analysis related with the object of the analysis, Indra has been decided to use
heterogeneous software tools that are described in the following points.
Data Source layer:
The data source layer is the data base where are stored the data raw generated by the rail
operator. The data base presents high numbers of tables where it stored the information
regarding the Access Gate Control Fare Collection. It used Oracle database. Oracle database
(Oracle DB) is a relational database management system (RDBMS). The system is built
around a relational database framework in which data objects may be directly accessed by
users (or an application front end) through structured query language (SQL). Oracle is a fully
scalable relational database architecture and is often used by global enterprises, which
manage and process data across wide and local area networks. The Oracle database has its
own network component to allow communications across networks.
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 30 of 38 31/10/2018
Data Staging Layer:
Talend: Talend is software for data integration. It provides thousands of must-have
productivity features enabling you to quickly connect, transform and move all of your data.
It is leader in cloud integration solutions, this year announced it has once again been
recognized by Gartner, Inc. as a leader in data integration as described in the 2018 “Magic
Quadrant for Data Integration Tools.”
This is Talend’s third placement in the Leaders quadrant, acknowledging the company’s
completeness of vision and ability to execute.
Figure 12: Magic Quadrant by Gartner for Data Integration Tool 2018
Data Storage Layer:
PostgreSQL: PostgreSQL is a powerful, open source object-relational database system that
uses and extends the SQL language combined with many features that safely store and scale
the most complicated data workloads.
PostgreSQL has earned a strong reputation for its proven architecture, reliability, data
integrity, robust feature set, extensibility, and the dedication of the open source community
behind the software to consistently deliver performant and innovative solutions. PostgreSQL
runs on all major operating systems, has been ACID-compliant, and has powerful add-ons
such as the popular PostGIS geospatial database extender.
Greenplum Pivotal: Greenplum Database is an advanced, fully featured, open source data
platform. It provides powerful and rapid analytics on petabyte scale data volumes. Uniquely
geared toward big data analytics, Greenplum Database is powered by the world’s most
advanced cost-based query optimizer delivering high analytical query performance on large
data volumes. It is based on an architecture providing automatic parallelization of all data and
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 31 of 38 31/10/2018
queries in a scale-out, shared nothing architecture. It can scale interactive and batch mode
analytics to large datasets in the petabytes without degrading query performance and
throughput.
Data Analytic Layer:
Apache Spark: Spark is a general-purpose data processing engine, an API-powered toolkit
which data scientists and application developers incorporate into their applications to rapidly
query, analyse and transform data at scale. Spark’s flexibility makes it well-suited to tackling
a range of use cases, and it is capable of handling several petabytes of data at a time,
distributed across a cluster of thousands of cooperating physical or virtual servers.
Data Presentation Layer:
Grafana: Grafana is a software tool dedicated for the visualization of a large-scale of
measurement data that the user would like to see in a graphical easy way. It runs in a web
application. Grafana can be used on top of a variety of different data stores. It is built for
enabling easy metric and function editing. Grafana has a specific query editor that is
customized for the features and capabilities that are included in that data source.
Figure 13: Grafana Dashboard regarding ATVM devices alarms
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 32 of 38 31/10/2018
Figure 14: Grafana Dashboard regarding change of state from Normal to Out Of Service and Technical alarm ATVM
Apache Superset: Superset is another software tool that is in evaluation dedicated to the
visualization of the data is the Apache Superset. Superset is a data exploration and
visualization web application.
Figure 15: Apache Superset Dashboard
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 33 of 38 31/10/2018
5. BENCHMARKS OF NEW BIG DATA ARCHITECTURES
5.1 INTRODUCTION
Current architecture deployed by partners suffers from some drawbacks:
SQL may perform badly on Hadoop platforms. For example, HIVE, which is a SQL engine, is
not very efficient.
Data visualization may be quite slow when data to be displayed is important. For example,
visualizing data from Paris area is quite difficult with elasticsearch+kibana. Indeed, Paris area
represents more than 60000 stations, bus and tram stops. Visualizing all these data regarding
different factors, like time, hour, type of travels… is quite slow.
Big Data domain is also a very fast evolving environment and new promising Big Data architectures
appear regularly on the market.
So, benchmark is very important regarding CONNECTIVE project. It will allow to enrich existing
architectures deployed by partners and/or to propose new ones.
5.2 BENCHMARK SCOPE
For the core release, investigation has been done regarding two axes:
Performance of SQL in Big Data environment
SQL, with its familiar syntax in the IT community, is a very good way to promote and to enlarge
audience for Big Data architecture. Big Data actors are very active in this area. For example,
the new version of elasticsearch (6.3) allows to use a SQL query syntax to analyze data.
Another example is the solution MapD (rebranched recently as OmniSci): this solution was
architected to work on GPU and, in particular, development has been focused to enable
common SQL analytic operations such as filtering (WHERE), segmenting (GROUP BY) and
joining (JOIN) to run as fast as possible with native GPU speed.
New architectures based on GPUOriginally designed to render video games, GPUs have
evolved into general purpose computational engines that excel at performing tasks in parallel.
The GPUs’ prodigious compute capabilities allow them to excel at many machine learning
algorithms, and in particular, in the development of deep learning algorithms. Moreover, the
graphics pipeline of the cards means they can be used for rendering large datasets in
milliseconds. Solutions like Brytlyt and MapD have developed all their architecture on GPU.
This scope led us to select the MapD solution for the first benchmark. The solution is presented in the
next paragraph, with some preliminary results.
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 34 of 38 31/10/2018
5.3 MAP-D: AN EXAMPLE OF POWERFUL SQL WITH GPU ARCHITECTURE
As seen in the previous section, MapD’s (now named OmniSci) platform leverages the parallel power
of modern graphics processing units (GPUs) and offers:
an efficient way to perform SQL queries
an immersive, instantaneous and interactive way to explore massive datasets in real time.
With MapD’s software and GPU compute power, it is possible to query and visualize billions of records
in tens of milliseconds. This enables the creation of hyper-interactive dashboards in which dozens of
attributes can be correlated and cross-filtered without lag. and even if MapD is a general-purpose
analytics platform, one of the most interesting interests of the platform is its ability to visualize and
explore large geospatial datasets interactively and at the grain-level.
In January 2017, a Big Data consultant Mark Litwintschik published benchmark comparisons on “The
Taxi Dataset” -- 1.2 billion individual taxi trips made available by the NYC Taxi and Limousine
Commission (TLC). The benchmark results are very impressive and are summed-up in the table below:
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 35 of 38 31/10/2018
If we compare the MapD results (highlighted in the green in the previous table), it can be seen than it
outperforms solutions like elasticsearch and Spark (highlighted in orange in the previous table), by a
factor of more than 100, elasticsearch and Spark being the present software solutions selected in the
implementation described in paragraph 4.1.3.
MapD could be seen as an alternative or a complement to elasticsearch+Kibana MapD provides a
variety of connectors to move data from data lakes (like Hadoop, Spark…) into the MapD analytics
platform. And as Spark has brought orders-of-magnitude acceleration to Hadoop, MapD brings similar
speedups to Spark and other existing CPU-based analytics systems
5.4 MAP-D: FIRST IMPLEMENTATION
For CONNECTIVE, a first implementation of MapD has been done on two datasets.
The two datasets are:
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 36 of 38 31/10/2018
Transportation rail network information in UK (for predictive maintenance purposes): about
50000 rail segments with 10 year historical data.
And as the volume is important but not very huge, we also benchmark the MapD platform on a
bigger dataset of maritime transportation with 1.5 billion of data to be displayed.
The test platform, on which this datasets have been used, is a server with 8 NVidia GTX 1080 Ti with
11Gb of VRAM on each one. A 1To SSD I used to store the dataset before loading. Tests need the 8
GPU only for the map rendering in the AIS example. Without the cartography, MapD manages to
balance data between the VRAM, Ram and the SSD disk and keep a short response time.
For both cases, response time when user performs cross-filtering – the paradigm where a click on any
dimension in a chart simultaneously redraws all other charts in a dashboard – are very impressive,
even for the biggest dataset.
Some screenshots are displayed below.
Figure 16: MapD dashboard on rail transport network in UK (50,000 rail segment data over 10 years)
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 37 of 38 31/10/2018
Figure 17: MapD dashboard on maritime transport data (1,5billion data)
5.5 NEXT STEPS
The benchmark will be continued with the data produced by CONNECTIVE. In particular, precise
benchmarks are needed to quantify performances worsening when the dataset size becomes bigger
than the available VRAM size. The benchmark will also be enriched with comparisons with other
promising MPP (Massively Parallel Processing) solutions based on SQL.
Grant Agreement No. 777522
CONN-WP2-D2.1 Page 38 of 38 31/10/2018
6. CONCLUSIONS
The present deliverable illustrated the architecture as was implemented in IT2Rail with the software
that were in used. A comparison with the two architectures and the software with them could be
made.
The main differences between the two projects are firstly the utilization of real data (open source in
some cases and data coming from real operators in other). IT2Rail mainly analysed data that were
processed in its ecosystem as preferences stored in the cloud wallet, disruption of events and in
some specific case regarding the satisfaction of the traveller through questionnaire or weather data.
In CONNECTIVE the objective is beyond what IT2Rail marked. Advanced analysis regarding
Descriptive, Predictive and Prescriptive analytics are or will be introduced make a modification in the
different implementation choices regarding the software.
Having these points in mind, the evaluation of the software needed to have a solid base where BA
lay down is fundamental. This have as a consequence to introduce another stage before the final
decision of the software concurrent in the architecture. It is the utilization of the benchmark for the
comparison of the evaluated analytics tools. Performance must be high, reliability must be strong.
The data that CONNECTIVE will treat are remarkable and the scalability in every layers of the
architecture must be respected.
In conclusion, CONNECTIVE is a five years project and nowadays it is entered in the second year of
development. Work has been done but more need to be developed. The foundation as shown in the
present deliverable is made (also if some software explained in the document might be change if
needed). The starting point of the analysis is building up through the study of the information stored
in the different repositories across Europe.
The next deliverables will show the second stage in the development where new achievement will be
made, new modules will be introduced and the questions regarding descriptive, predictive and
prescriptive analysis might be answer.