international journal of information ... -...

11
Contents lists available at ScienceDirect International Journal of Information Management journal homepage: www.elsevier.com/locate/ijinfomgt A Big Data system supporting Bosch Braga Industry 4.0 strategy Maribel Yasmina Santos, Jorge Oliveira e Sá, Carina Andrade, Francisca Vale Lima, Eduarda Costa, Carlos Costa, Bruno Martinho, João Galvão ALGORITMI Research Center, University of Minho, Guimarães, Portugal ARTICLE INFO Keywords: Big Data Industry 4.0 Big Data analytics Big Data architecture Bosch ABSTRACT People, devices, infrastructures and sensors can constantly communicate exchanging data and generating new data that trace many of these exchanges. This leads to vast volumes of data collected at ever increasing velocities and of dierent variety, a phenomenon currently known as Big Data. In particular, recent developments in Information and Communications Technologies are pushing the fourth industrial revolution, Industry 4.0, being data generated by several sources like machine controllers, sensors, manufacturing systems, among others. Joining volume, variety and velocity of data, with Industry 4.0, makes the opportunity to enhance sustainable innovation in the Factories of the Future. In this, the collection, integration, storage, processing and analysis of data is a key challenge, being Big Data systems needed to link all the entities and data needs of the factory. Thereby, this paper addresses this key challenge, proposing and implementing a Big Data Analytics architecture, using a multinational organisation (Bosch Car Multimedia Braga) as a case study. In this work, all the data lifecycle, from collection to analysis, is handled, taking into consideration the dierent data processing speeds that can exist in the real environment of a factory (batch or stream). 1. Introduction Nowadays, data is generated at unprecedented rates, mainly due to the advancements in cloud computing, internet, mobile devices and embedded sensors (Dumbill, 2013; Villars, Olofson, & Eastwood, 2011). The way people interact with organisations, the data produced by or- ganisationsday-by-day activities and the rate at which the transactions occurs may create unprecedented challenges in data collection, storage, processing and analysis. If organisations nd a way to extract business value from this data, they will most likely gain signicant competitive advantages (Villars et al., 2011). Big Data is often seen as a catchword for smarter and more in- sightful data analysis, but it is more than that, it is about new chal- lenging data sources helping to understand business at a more granular level, creating new products or services, and responding to business changes as they occur (Davenport, Barth, & Bean, 2012). As we live in a world that constantly produces and consumes data, it is a priority to understand the value that can be extracted from it. Big Data will have a signicant impact on value creation and competitive advantage for organisations, such as new ways of inter- acting with customers or developing new products, services and stra- tegies, raising protability. Another area where the concept of Big Data is of major relevance is the Internet of Things (IoT), seen as a network of sensors embedded into several devices (e.g., applications, smartphones, cars), which is a signicant source of data that can bring many orga- nisations, like factories, into the era of Big Data (Chen, Mao, & Liu, 2014). In this context of the factories of the future, the fourth industrial revolution (Industry 4.0) uses the technological innovations to enhance productive processes through the integration of more automation, controlling and information technologies. To support the data needs in these Factories of the Future, a Big Data Analytics architecture was envisaged and proposed (Santos, Oliveira e Sá et al., 2017), integrating several layers and components for the collection, storage, processing, analysis and distribution of data, making available an integrated environment that supports decision- making at the several levels of the managerial process. This paper intends to present a Big Data system that implements and validates a specic set of components of this architecture, using the ongoing work on a multinational organisation (Bosch Car Multimedia Braga) as a case study. In this, several layers of the architecture, and some of their specic components, are tested, handling the data life- cycle that goes from collection to analysis and visualisation, also pro- posing and modelling the data structures needed to support storage and processing in a Big Data Warehouse. http://dx.doi.org/10.1016/j.ijinfomgt.2017.07.012 Corresponding author. E-mail addresses: [email protected] (M.Y. Santos), [email protected] (J. Oliveira e Sá), [email protected] (C. Andrade), [email protected] (F. Vale Lima), [email protected] (E. Costa), [email protected] (C. Costa), [email protected] (B. Martinho), [email protected] (J. Galvão). International Journal of Information Management 37 (2017) 750–760 Available online 12 August 2017 0268-4012/ © 2017 Elsevier Ltd. All rights reserved. MARK

Upload: others

Post on 30-Aug-2019

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: International Journal of Information ... - algoritmi.uminho.ptalgoritmi.uminho.pt/downloads/IST-02.pdf · ALGORITMI Research Center, University of Minho, Guimarães, Portugal ARTICLE

Contents lists available at ScienceDirect

International Journal of Information Management

journal homepage: www.elsevier.com/locate/ijinfomgt

A Big Data system supporting Bosch Braga Industry 4.0 strategy

Maribel Yasmina Santos, Jorge Oliveira e Sá, Carina Andrade, Francisca Vale Lima,Eduarda Costa, Carlos Costa, Bruno Martinho, João Galvão⁎

ALGORITMI Research Center, University of Minho, Guimarães, Portugal

A R T I C L E I N F O

Keywords:Big DataIndustry 4.0Big Data analyticsBig Data architectureBosch

A B S T R A C T

People, devices, infrastructures and sensors can constantly communicate exchanging data and generating newdata that trace many of these exchanges. This leads to vast volumes of data collected at ever increasing velocitiesand of different variety, a phenomenon currently known as Big Data. In particular, recent developments inInformation and Communications Technologies are pushing the fourth industrial revolution, Industry 4.0, beingdata generated by several sources like machine controllers, sensors, manufacturing systems, among others.Joining volume, variety and velocity of data, with Industry 4.0, makes the opportunity to enhance sustainableinnovation in the Factories of the Future. In this, the collection, integration, storage, processing and analysis ofdata is a key challenge, being Big Data systems needed to link all the entities and data needs of the factory.Thereby, this paper addresses this key challenge, proposing and implementing a Big Data Analytics architecture,using a multinational organisation (Bosch Car Multimedia – Braga) as a case study. In this work, all the datalifecycle, from collection to analysis, is handled, taking into consideration the different data processing speedsthat can exist in the real environment of a factory (batch or stream).

1. Introduction

Nowadays, data is generated at unprecedented rates, mainly due tothe advancements in cloud computing, internet, mobile devices andembedded sensors (Dumbill, 2013; Villars, Olofson, & Eastwood, 2011).The way people interact with organisations, the data produced by or-ganisations’ day-by-day activities and the rate at which the transactionsoccurs may create unprecedented challenges in data collection, storage,processing and analysis. If organisations find a way to extract businessvalue from this data, they will most likely gain significant competitiveadvantages (Villars et al., 2011).

Big Data is often seen as a catchword for smarter and more in-sightful data analysis, but it is more than that, it is about new chal-lenging data sources helping to understand business at a more granularlevel, creating new products or services, and responding to businesschanges as they occur (Davenport, Barth, & Bean, 2012). As we live in aworld that constantly produces and consumes data, it is a priority tounderstand the value that can be extracted from it.

Big Data will have a significant impact on value creation andcompetitive advantage for organisations, such as new ways of inter-acting with customers or developing new products, services and stra-tegies, raising profitability. Another area where the concept of Big Data

is of major relevance is the Internet of Things (IoT), seen as a network ofsensors embedded into several devices (e.g., applications, smartphones,cars), which is a significant source of data that can bring many orga-nisations, like factories, into the era of Big Data (Chen, Mao, & Liu,2014). In this context of the factories of the future, the fourth industrialrevolution (Industry 4.0) uses the technological innovations to enhanceproductive processes through the integration of more automation,controlling and information technologies.

To support the data needs in these Factories of the Future, a BigData Analytics architecture was envisaged and proposed (Santos,Oliveira e Sá et al., 2017), integrating several layers and componentsfor the collection, storage, processing, analysis and distribution of data,making available an integrated environment that supports decision-making at the several levels of the managerial process.

This paper intends to present a Big Data system that implements andvalidates a specific set of components of this architecture, using theongoing work on a multinational organisation (Bosch Car Multimedia –Braga) as a case study. In this, several layers of the architecture, andsome of their specific components, are tested, handling the data life-cycle that goes from collection to analysis and visualisation, also pro-posing and modelling the data structures needed to support storage andprocessing in a Big Data Warehouse.

http://dx.doi.org/10.1016/j.ijinfomgt.2017.07.012

⁎ Corresponding author.E-mail addresses: [email protected] (M.Y. Santos), [email protected] (J. Oliveira e Sá), [email protected] (C. Andrade),

[email protected] (F. Vale Lima), [email protected] (E. Costa), [email protected] (C. Costa), [email protected] (B. Martinho),[email protected] (J. Galvão).

International Journal of Information Management 37 (2017) 750–760

Available online 12 August 20170268-4012/ © 2017 Elsevier Ltd. All rights reserved.

MARK

Page 2: International Journal of Information ... - algoritmi.uminho.ptalgoritmi.uminho.pt/downloads/IST-02.pdf · ALGORITMI Research Center, University of Minho, Guimarães, Portugal ARTICLE

In methodological terms, all this work has been undertaken in orderto propose a technological artefact built upon data science (Hevner,March, Park, & Ram, 2004), here made available as a system prototype.The research process uses the Design Science Research Methodology forInformation Systems from Peffers, Tuunanen, Rothenberger, andChatterjee (2007), providing a rigorous way of carrying out designscience research.

The rest of the paper is organised as follows. Section 2 summarisesconcepts of Industry 4.0 and the Factories of the Future, pointing therole of Big Data in this fourth industrial revolution. Section 3 describesthe evolution of the Business Intelligence and Big Data Analytics area,given the context for the emergence of the Big Data concept. Section 4presents the proposed Big Data Analytics architecture, describing itsseveral layers and highlighting the selected layers and components, forimplementing the proof-of-concept. Section 5 reports the work done toundertake the implementation, addressing the Big Data Warehousemodelling process and all the related tasks from data integration,cleaning, transforming and loading, until data analysis and visualisa-tion. Finally, Section 6 concludes with some remarks and guidelines forfuture work.

2. Industry 4.0 and the factories of the future

Industry 4.0 is a recent concept that was mentioned for the first timein 2011 in the Hannover Fair in Germany. This involves the maintechnological innovations applied to production processes in the field ofautomation, control and information technologies (Hermann,Pentek, & Otto, 2016). The basic foundation of Industry 4.0 implies thatthrough the connection of machines, systems and assets, organisationscan create smart grids all along the value chain controlling the pro-duction processes autonomously. Within the Industry 4.0 framework,organisations will have the capacity and autonomy to schedule main-tenance, predict failures and adapt themselves to new requirements andunplanned changes in the production processes (Jazdi, 2014).

In the context of major industrial revolutions, Industry 4.0 is seen asthe fourth industrial revolution. The first industrial revolution, around1780, essentially consisted in the appearance of the steam engine andthe mechanical loom. The second industrial revolution, around 1870,included the use of electric motors and petroleum fuel. The third in-dustrial revolution, around 1970, it is recognised in the context of theuse of computerised systems and robots in the industrial production.Finally, the fourth industrial revolution, occurring now, is wherecomputers and automation will come together in an integrated way,i.e., robotics connecting computerized systems equipped with machinelearning algorithms, in which the production systems are able to learnfrom data, enabling the increase of efficiency and autonomy of theproduction processes and, also, making them more customizable(Drath &Horch, 2014; Hermann et al., 2016; Jazdi, 2014).

For the development and deployment of Industry 4.0, six principlesare identified guiding the evolution of intelligent production systemsfor the coming years (Hermann et al., 2016; Kagermann, 2015),namely:

1. Interoperability – systems, people and information transparentlyintercommunicated in the cyber-physical systems (a fusion of thephysical and virtual worlds). This allows exchanging informationbetween machines and processes, interfaces and people;

2. Real-time operation capability – instantaneous data acquisition andprocessing, enabling real-time decision making;

3. Virtualization – creating smart factories, allowing the remote tra-ceability and monitoring of all processes through the several sensorsspread throughout the shop floor;

4. Decentralisation – the cyber-physical systems are spread accordinglyto the needs of the production providing real-time decision-makingcapabilities. In addition, machines will not only receive commands,but will be able to provide information about their work cycle.

Therefore, the smart manufacturing modules will work in a decen-tralised way to improve production processes;

5. Service Orientation – use of service-oriented software architecturescoupled with the IoT concept;

6. Modularity – production processes accordingly to the demand,coupling and decoupling of modules in production, giving theflexibility to change machine tasks easily.

Based on the principles described above, Industry 4.0 became pos-sible due to the technological advances of the last decade in the areas ofinformation and engineering.

Fig. 1 shows the key technologies enabling Industry 4.0, namely:

• IoT – it consists of networking physical objects, environments, ve-hicles and machines by means of embedded electronic devices, al-lowing the collection and exchanging of data. Systems that operateon the IoT are endowed with sensors and actuators, the cyber-physical systems, and are the basis of Industry 4.0 (Almada-Lobo,2016; Hermann et al., 2016; Jazdi, 2014; Kagermann, 2015);

• Big Data – in Industry 4.0 contexts data is generated by severalsources like machine controllers, sensors, manufacturing systems,people, among many others. All this voluminous data, arriving athigh velocity and in different formats is called “Big Data”. Theprocessing of Big Data in order to identify useful insights, patterns ormodels is the key to sustainable innovation within an Industry 4.0factory (Lee, Kao, & Yang, 2014).

• Mobile and Augmented Reality – mobile devices with reliable andinexpensive positioning systems allow the representation of real-time positioning in 3D maps, enabling the use of augmented realityscenarios. These are expected to bring tangible gains in areas such asidentification and localisation of materials or containers or inmaintenance related activities (Almada-Lobo, 2016);

• Additive Manufacturing – technologies like 3D printing will enablemore localised, distributed and reconfigurable production, whichwill completely change the supply chains. Also, additive manu-facturing is a key enabler of mass customization, reducing the pro-duction time and costs for the creation of unique products(Biller & Annumziata, 2014);

• Cloud – cloud-based manufacturing can be described as a networkedmanufacturing model with reconfigurable cyber-physical produc-tion lines enhancing efficiency, reducing production costs, and al-lowing optimal resource allocation in response to a customer

Fig. 1. Enabling technologies for Industry 4.0.

M.Y. Santos et al. International Journal of Information Management 37 (2017) 750–760

751

Page 3: International Journal of Information ... - algoritmi.uminho.ptalgoritmi.uminho.pt/downloads/IST-02.pdf · ALGORITMI Research Center, University of Minho, Guimarães, Portugal ARTICLE

variable-demand (Almada-Lobo, 2016; Jazdi, 2014;Thames & Schaefer, 2016);

• Cybersecurity – one of the major challenges to the success of theIndustry 4.0 lies in the security and robustness of InformationSystems. Problems such as transmission failures in machine-to-ma-chine communication, or even eventual “gagging” of the system cancause production disruption. With all this connectivity, systems willalso need to protect the organisation’s know-how embedded in theprocessing control files (Sommer, 2015; Thames & Schaefer, 2016);

In this context of Industry 4.0, people need to adapt their skills tothe needs of the Factories of the Future. The manual labour will bereplaced by specialised labour, raising new opportunities to very welltrained professionals, in an environment of huge technological varietyand challenges (Hermann et al., 2016).

Summarizing, when implementing an Industry 4.0 scenario, thefocus is not on new technologies, but in how to combine them in a newway, considering three levels of integration: the cyber-physical objectslevel; the (Big) data infrastructure and models of the mentioned cyber-physical objects; and, the services based on the available (Big) data(Drath &Horch, 2014).

3. Business intelligence and Big Data analytics evolution

Over the last years, the interest in Big Data has increased con-siderably (Trends, 2016), particularly after 2012, as can be seen inFig. 2. It is important now to look back and see the evolution of dataanalytics in Business Intelligence (BI) systems and, after that, how wearrived at the Big Data era.

Looking back to 1958, Hans Peter Luhn, a researcher from IBM,proposed an automatic system for the dissemination of information tothe several players of any industrial, scientific or governmental orga-nisation. The system was based on the use of data-processing machinesfor providing useful information to those who need it. The processingcapabilities were based on statistical procedures and complementedwith proper communication facilities and input-output equipment,providing a comprehensive system that accommodates all informationneeds of an organisation (Luhn, 1958).

The key point in Luhn’s proposal was to optimise business usingdata, a concern that is maintained in more recent definitions of the BIarea. Looking to the Gartner IT glossary, BI is nowadays defined as “anumbrella term that includes the applications, infrastructure and tools, andbest practices that enable access to and analysis of information to improveand optimise decisions and performance” (Gartner, 2017). Although abroader definition, the focus is maintained in the data processing cap-abilities to provide useful information and insights for improvingbusiness.

Looking into the same glossary for the definition of Big Data, it isdefined as “high-volume, high-velocity and/or high-variety informationassets that demand cost-effective, innovative forms of information processingthat enable enhanced insight, decision making, and process automation”(Gartner, 2017). Putting aside Big Data characteristics like volume,velocity and variety, the key asset is still information and data pro-cessing for supporting the decision-making process.

Given this context, an evolution can be seen from BI to Big Data interms of the supporting technologies and development frameworks,although the organisational role is the same, processing capabilities togive useful insights on business to decision-makers.

The evolution from BI, or from Business Intelligence and Analytics(BI & A), to Big Data is addressed in the work of (Chen,Chiang, & Storey, 2012), making a retrospective characterization of theBI & A itself and showing what changes to Big Data. For these authors,the term business analytics was introduced to mention the key analy-tical components of BI, whereas Big Data is used to describe datasetsthat are so large and complex that require advanced and unique datastorage, management, analysis and visualisation technologies. In thiscontext, Big Data analytics offers new research directions for BI & A.

Retrospectively, (Chen et al., 2012) propose a framework thatcharacterises BI & A in three eras, BI & A 1.0, BI & A 2.0 and BI & A 3.0,verifying the evolution over the years, applications and emerging re-search areas with different data sources, as can be seen in Fig. 3. Fromits inception to date, the concept of BI is used in specific applicationssuch as health, commerce and sales, government, security and evenscience and technology.

In BI & A 1.0, data is mostly structured, distributed by several datasources that include legacy systems, and often stored in RelationalDatabase Management Systems (RDBMS). Data Warehouses (DW) are afoundation of this era and DW schemas are essential for integrating andconsolidating enterprise data, supported by Extraction, Transformationand Loading (ETL) mechanisms. Online Analytical Processing (OLAP)and reporting tools based on intuitive graphics are used to explore data,providing interactive environments for ad-hoc querying processing,complemented by statistical methods and data mining algorithms foradvanced data analytics.

BI & A 2.0 started to emerge when the Internet and the Web offerednew ways of data collection and analytics. In these contexts, detailedand IP-specific user search and interaction logs are collected throughcookies and server logs allowing the exploration of customers’ needsand potentiating the identification of new business opportunities. Thisera is centred in text and web analytics from unstructured data, usingdata analytics techniques such as web intelligence, web analytics, textmining, web mining, social network analysis or spatial-temporal dataanalysis (Chen et al., 2012).

BI & A 3.0 emerges with the new role of mobile devices and theirincreasingly use in our modern society. Mobile phones, tablets, sensor-based Internet-enabled devices, barcodes, and radio tags, commu-nicating together in the IoT, support mobile, location-aware, person-centred and context-relevant operations (Chen et al., 2012). In thecontext of a vast amount of web-based, mobile and sensor-generateddata arriving at ever increasing rates, such Big Data will drive theidentification of new insights that can be obtained from highly detaileddata.

4. Big Data analytics architecture for Industry 4.0

Moving towards Industry 4.0 requires the adoption of proper BigData technologies that be integrated in order to fulfil the data collec-tion, storage, processing and analysis needs. Fig. 4 shows the proposed

Fig. 2. Increased interest in Big Data.Retrieved from Trends (2016).

M.Y. Santos et al. International Journal of Information Management 37 (2017) 750–760

752

Page 4: International Journal of Information ... - algoritmi.uminho.ptalgoritmi.uminho.pt/downloads/IST-02.pdf · ALGORITMI Research Center, University of Minho, Guimarães, Portugal ARTICLE

architecture and its main layers and components, following our pre-vious work in Santos, Oliveira e Sá et al. (2017), for Big Data Analyticsin Industry 4.0. This proposal benefits from state-of-the-art work, eitherin the identification of its main components and in the identification ofthe Big Data technologies to be adopted (Costa & Santos, 2016a). Be-sides, as can be seen in Fig. 4, some components of this architecturewere already tested and included in the proof-of-concept described inthis paper, used to validated the proposed architecture. These compo-nents are highlighted in the figure and create a data workflow that goesfrom data collection to data visualisation, a process that will be ex-plained in more detail in Section 5.

The architecture is divided into seven layers, each layer includingcomponents, some of which are already associated to some technolo-gical tools. While in our previous version several components wereinstantiated to technologies, now only those implemented in the proof-of-concept are associated to technologies. In Fig. 4, each layer is re-presented by a rectangle, while dashed rectangles are used to specifythe components and associated technologies, when applicable. Dataflows between layers are also represented in this figure.

The Entities/Applications layer represents all Big Data producersand consumers, as for instance customers, suppliers, managers (inseveral managerial levels), among others. These entities are usuallyconsumers of raw data, indicators or metrics, like Key PerformanceIndicators (KPI), available from the Data Storage (through the RawData Publisher layer) and Big Data Analytics layers.

The Data Sources layer includes components such as Databases(operational/transactional databases), Dynamic-link libraries (DLL),Files, ERPs, E-Mail, Web Services or Custom Code, among others. Thesecomponents can generate data with low velocity and concurrency (like,for instance, data from periodical readings from databases), or datawith high degree of velocity and concurrency (like, for instance, datastreams).

The Data Preparation layer (ETL/ELT) corresponds to the process ofextracting data from data sources to the Data Storage layer. From theseveral technologies that can be used to implement the DataPreparation process, Talend is here used for integrating data frommultiple data sources. Talend is a data integration platform that hasseveral elements used to do data extraction, transformation andloading, making available connectors for the file system in Hadoop,NoSQL databases, among others (Talend, 2016).

The Data Storage layer has different components that will be used indifferent contexts:

• For real-time, data streams will be stored in a real-time fashion intoa NoSQL database. There are several NoSQL technologies availablesuch as column-based, document-based, graph-based, key-value ormulti-model, like HBase, Cassandra, MongoDB, CouchDB,DynamoDB, Riak, Redis, Neo4J, among many others. Based on thework of Costa and Santos (2016b), the most adequate NoSQL da-tabases for real-time environments are Cassandra and HBase, beingCassandra selected in this work due to its compatibility with Presto.

• Staging Area and Big Data Warehouse (BDW) components will savedata in a more historical perspective. In the Staging Area compo-nent, data is stored in the Hadoop Distributed File System (HDFS)and available for further use during a delimited period of time. Forthe BDW, the data previously loaded into the staging area is ex-tracted, transformed and loaded into the BDW, being available fordata analytics through the SQL Query Engine component (of the BigData Analytics layer). Hive is the technology used for the BDW, aninfrastructure similar to a traditional DW, which is built on HDFSenabling distributed storage and processing, for storing and ag-gregating large volumes of data.

The Raw Data Publisher layer enables downloading data available

Fig. 3. BI & A Evolution, applications and research.Adapted from Chen et al. (2012).

M.Y. Santos et al. International Journal of Information Management 37 (2017) 750–760

753

Page 5: International Journal of Information ... - algoritmi.uminho.ptalgoritmi.uminho.pt/downloads/IST-02.pdf · ALGORITMI Research Center, University of Minho, Guimarães, Portugal ARTICLE

in the Data Storage layer by using Web Services. This interface is in-cluded to avoid direct accesses to the Hadoop cluster (where the data isstored) by other Entities/Applications, but making available a way toaccess, share and distribute data to the several users in the factory.

The Big Data Analytics layer includes components that facilitate theanalysis of vast amounts of data, making available different data ana-lysis techniques, namely:

• Data Visualisation – is a component used for the exploration/ana-lysis of data through intuitive and simple graphs;

• Data Mining (or Knowledge Discovery) – is the component re-sponsible for identifying new patterns and insights in data;

• Ad-hoc Querying – is a component that allows the interactive defi-nition of queries on data, attending to the users’ analytical needs.Queries are defined on-the-fly, mostly depending on the results ofprevious analyses on data. This component must ensure an easy-to-use and intuitive querying environment;

• Reporting – is the component that organises data into informationalsummaries in order to monitor how the different areas of a businessare performing; and,

• SQL Query Engine – this component provides an interface betweenthe other components in this layer and the Data Storage layer.

In this layer, different technologies can be used, like for example R,Weka, Spark and other commercial tools like Tableau, SAS, PowerPivot,QlikView, SPSS, among others. In the Big Data system here presented,Presto was chosen as the SQL Query Engine due both to its connector tothe NoSQL database Cassandra and to its good results in a recent per-formed benchmark (Santos, Costa et al., 2017). Nevertheless, it is worth

mentioning that many other technologies can be used, apart fromPresto, as Impala, HAWQ, IBM Big SQL, Drill, among others. For DataVisualization, Tableau was selected due to its successive good evalua-tion: for the fourth year, Gartner names Tableau a leader in the MagicQuadrant for BI & A platforms (Tableau, 2016).

Finally, the Security, Administration and Monitoring layer includescomponents that provide base functionalities needed in the otherlayers, and that ensure the proper functioning of the whole infra-structure. In this layer, the components needed are:

• Cluster Tuning and Monitoring – detects bottlenecks and improvesperformance by adjusting some parameters of the adopted tech-nologies;

• Metadata Management – the needed metadata can be divided intothree categories:○ Business – describes the data ownership information and business

definition;○ Technical – includes database systems’ names, tables definition,

and data characterization like columns’ names, sizes, data typesand allowed values;

○ Operational – description of the data status (active, archived, orpurged), history of the migrated data and transformations ap-plied on it.

• Authorization and Auditing – user authorizations, data access policymanagement and tracking user’s operations are represented in thiscomponent;

• Data Protection – associated with policies for data storage, allowingdata to be encrypted or not, attending to how critical or sensitive isthe data;

Fig. 4. Big Data Architecture for Industry 4.0.

M.Y. Santos et al. International Journal of Information Management 37 (2017) 750–760

754

Page 6: International Journal of Information ... - algoritmi.uminho.ptalgoritmi.uminho.pt/downloads/IST-02.pdf · ALGORITMI Research Center, University of Minho, Guimarães, Portugal ARTICLE

• Authentication – representing the authentication of the users in theBig Data infrastructure, here shortly named as the Big Data cluster.

5. Bosch’s Big Data warehouse: a case study

After the proposal of a Big Data Analytics architecture aligned withIndustry 4.0 needs, it is now of major relevance its implementation, atask that will validate all the work undertaken so far. This validation isbeing made in an organisation which is aligned with Industry 4.0concepts, Bosch Car Multimedia in Braga – Portugal.

For this purpose and due to its complexity, the validation of thisarchitecture needs to be made in phases. This first validation phase isfocused on the data workflow previously shown in Fig. 4, which high-lights the components that are here used and tested, from data collec-tion to data visualisation.

In this data workflow, data selected mostly from the organisationalSAP ERP is regularly extracted and stored as Excel files in specificfolders, being available for the needed ETL processes, implementedwith the use of Talend. As the available data contains historical and up-to-date transactional data, which can be collected in real-time, allcomponents of the Data Storage layer are used, namely Cassandra,HFDS and Hive. For data processing, supporting analytical tasks, Prestoruns the required queries while Tableau allows the visualisation of theobtained results.

The data used to implement the system and validate the architectureis from Bosch customer's Quality Complaint (QCs) and Internal DefectCosts (IDCs). Due to privacy concerns, all the presented values anddescriptions were manipulated to mask or hide the real values andnames.

As one of the central components of the Big Data architecture is itsDW, which is responsible for the integration and consolidation of datafrom different business processes, this paper describes how this re-pository is modelled using a methodological approach for buildingBDW (Costa & Santos, 2017; Santos, Martinho, & Costa, 2017). Thisapproach includes prescriptive models and methods that guide thedesign and implementation of complex analytical systems and will bethe basis for setting the BDW model and a method of Big Data Appli-cation Provider which is divided into three main phases:

1. Collection: it refers to data acquisition and metadata creation;2. Preparation: it refers to data validation, cleansing, outlier removal,

standardisation and reformatting;3. Access, Analytics & Visualisation: it implements the techniques to

extract knowledge from the data being represented in an optimisedcommunication format and that involves the production of reportsor graphs for analysis.

5.1. Big Data model

To start the architecture validation, it was necessary to analyse thedata that would be used. For that, Bosch provided a list of all the usedattributes in these two selected areas (QCs and IDCs) and a data sampleof them. All the attributes that seemed relevant for analysis were usedand included in an Entity–Relationship Diagram (ERD), enhancing theunderstanding of the entities, their attributes and relationships. Takingas an example the IDCs, Fig. 5 presents the obtained ERD (the QCs ERDis not depicted here due to its complexity, as it includes more than 200attributes distributed by 50 tables).

Based on the two ERDs, QCs and IDCs, and following two differentvariants of the methodological approach, one was first transformed intoa multidimensional model and then to a Hive data model, for QCs, andthe other was directly transformed into a Hive data model, for IDCs.Both Hive data models are here named as the Big Data Model. To defineit, the first step consists in identifying the analytical objects (in this casethe QCs and IDCs), which have descriptive and analytical attributes,and an associated granularity. The descriptive attributes are those

allowing the interpretation of analytical attributes, considering thepossibility of different perspectives, using aggregation or filter opera-tions, for example. The analytical attributes are the ones with numericvalues that can be analysed using the different descriptive attributes, ashappens in a traditional DW (Kimball & Ross, 2013). In addition, tooptimise the query performance, the big data models can be materi-alised as tables.

In the case of IDCs, the analytical object identified is the IDC and thegranularity of the analytical object is the IDC row, i.e., each IDCidentified by an “IDC_Code”. The attributes identified as analytical arethe factual ones, namely the quantity of affected units (“Quantity”), thevalue associated with the IDC (“Currency_Value”) and the informativeflags (“Flag_Valuated”, “Flag_Cancellation”). The other ones wereidentified as descriptive attributes.

There is one fundamental difference between the IDCs and the QCsdatasets, which is of major relevance for explaining how the proposedBig Data Analytics architecture deals with data arriving at differentspeeds. In the Bosch case study, IDCs are directly stored in the BDWcomponent (with the intervention of the Staging Area component),while QCs are stored in the real-time component or in the BDW com-ponent, depending on the presented status (for instance, if the QC isalready finished or not). This design decision is justified because theHadoop BDW (HDFS and Hive) can lack efficient support for fast andconstant random access inserts/updates. During a certain time frame,QCs are constantly being updated (e.g., status changes), which leads tothe need of using a NoSQL database to support analytical tasks, namelyCassandra in this case study.

However, as Costa and Santos (2017) demonstrates, NoSQL data-bases are OLTP-oriented and can lack support for fast sequential accessover large amounts of data, typically required in analytical environ-ments. In contrast, the Hadoop BDW is the main choice for fast se-quential access and, therefore, after QCs are marked as “closed”, theyare transferred from Cassandra to Hive. This represents a mix betweenreal-time and historical analytics that can be significantly useful fororganisations, combining different perspectives into one single picture.Hive tables and Cassandra column families are modelled by organisingdata into analytical objects, descriptive attributes and analytical attri-butes. Fig. 6 presents an extract of the data model used in this casestudy.

As can be seen in Fig. 6, this modelling approach considers histor-ical analytical objects stored as Hive tables and real-time analyticalobjects stored as Cassandra column families. Among the descriptiveattributes of an analytical object (top half of the “idcs” and “qcs” inFig. 6), one can find two other concepts: primary keys, whose main goalis identical to the one present in the primary keys of traditional data-bases, i.e., uniquely identify a record; and partition keys, which allowdata distribution according to the values of the partition key. Regardingthe Bosch case study, one uses the “creation_year” attribute in the “idcs”Hive table to fragment it into different folders according to the year inwhich the IDC was created. This considerably improves query executiontimes, since users tend to search IDCs for specific years, typically thecurrent year. There is no need to define a primary key in Hive. Incontrast, Cassandra requires a primary key, which in this case is the“qc_notification_code” attribute in the “qcs” column family. This attri-bute is also the partition key, as, in Cassandra, the first part of theprimary key is always the partition key, evenly distributing datathroughout the nodes in the cluster according to the range of values ofthe partition key.

This data modelling approach provides significant flexibility, be-cause analytical objects are denormalized structures, unlike traditionalfact tables in relational DWs, which rely on constant join operations toanswer analytical queries. In Big Data environments, using denorma-lized structures allows faster execution times and simple collection,preparation and enrichment processes (Jukic, Jukic, Sharma,Nestorov, & Korallus Arnold, 2017), reducing the time between datacollection and data analysis. As depicted in Fig. 6, in cases where

M.Y. Santos et al. International Journal of Information Management 37 (2017) 750–760

755

Page 7: International Journal of Information ... - algoritmi.uminho.ptalgoritmi.uminho.pt/downloads/IST-02.pdf · ALGORITMI Research Center, University of Minho, Guimarães, Portugal ARTICLE

different subjects are related (e.g., sales and complaints), analyticalobjects can be joined or united, even if they are stored in differentsystems, since Presto can retrieve data from both Hive and Cassandrasimultaneously, using a single query. Since join operations can be costlyin Big Data environments (Chang, 2015; Floratou, Minhas, & Özcan,2014; Marz &Warren, 2015; Wang, Qin, Zhang, Wang, &Wang, 2011),one must highlight that these joins and unions operations are entirelyoptional, and are only needed if certain queries combine different

subjects to answer specific business questions. Moreover, as previouslymentioned, the results of complex and long running queries can bematerialised into Hive tables, in order to achieve interactive data vi-sualisation mechanisms.

5.2. Big Data application provider method

As already mentioned, the methodological approach followed in this

Fig. 5. IDCs ERD.

M.Y. Santos et al. International Journal of Information Management 37 (2017) 750–760

756

Page 8: International Journal of Information ... - algoritmi.uminho.ptalgoritmi.uminho.pt/downloads/IST-02.pdf · ALGORITMI Research Center, University of Minho, Guimarães, Portugal ARTICLE

Fig. 6. Bosch data model extract.

Fig. 7. Phases of the followed approach.

M.Y. Santos et al. International Journal of Information Management 37 (2017) 750–760

757

Page 9: International Journal of Information ... - algoritmi.uminho.ptalgoritmi.uminho.pt/downloads/IST-02.pdf · ALGORITMI Research Center, University of Minho, Guimarães, Portugal ARTICLE

implementation has three phases as can be seen in Fig. 7: Collection,Preparation and Access, Analytics & Visualization phases.

The Collection phase refers to the data acquisition and it is wherethe metadata are created. The Preparation phase includes the data va-lidation, cleansing, outlier removal, standardisation and reformattingtasks and its storage. The Access, Analytics & Visualization phase in-cludes the access to the data for organising it in reports and graphs thathelp the decision making process.

5.2.1. Collection phaseAs already mentioned, in the proof-of-concept there is not (yet) any

direct access to the SAP ERP system, reason why temporary Excel filesare used. The data files are periodically extracted and made availablefor the refreshment of the Data Storage, including Cassandra, HDFS andHive.

The Collection phase is focused on an automatic Talend job thatruns periodically and starts with a set of validation tests on the avail-able source files, ensuring that the source file really exists, has a validstructure and is not empty.

These validation tests are divided into five different steps, as can beseen in Fig. 8, including:

(1) monitoring of error occurrences, during all the validation tests,which catches any error occurrence in any component of the ETLprocess. If something fails, the process stops, the error is describedin the log file and an email is sent to the Data Storage administratoror responsible. This email aims to keep the administrator or

responsible updated about the ETL process status. This monitoringis transversal to all validation tests;

(2) verifying the existence of the temporary files in the correspondingfolder;

(3) checking if the Hash MD5 of the source file is the same of thepreviously loaded files;

(4) verifying that the file is not empty; and,(5) validating the file structure.

After the validation tests concluded successfully, the source file isnow ready to be loaded into the file system in two different ways: Thefirst one is a permanent raw data file and has all the original attributesand with no treatment. This means that it will remain always available,as it is, in the system; The second one is a temporary file and will beused in the next step as a source for data preparation. In this file, thenull rows, the attributes with less than 1% of fulfilment and the othersin the list of attributes with no analytical value are excluded. The re-maining attributes are loaded to HDFS.

5.2.2. Preparation phaseTo guarantee that the available data can be used for decision sup-

port, it is important to certify its quality, reason why data cleaning, dataconversion or other operations on data are needed.

To ensure data quality, a preliminary analysis of the available datawas done, identifying those attributes that present anomalies, like er-roneous values or missing data fields. In this last case, several attributeswith a high percentage of missing values were identified, representing

Fig. 8. Validation tests on files (Talend job).

Table 1Examples of Problems and Transformations.

Problems Transformations

Null Fields For the fields that correspond to codes in a hierarchy, the nulls are replaced with a code that exists (father or child code);For null in numeric fields,null values remain to avoid possible misleading in results;For any other fields, null fields are replaced with the expression “Not Applicable” or“Unknown” (depends on the subject).

Numeric values with dots Since not all numeric values have the dot separating thousands, the dots are deleted.Dates separated by dots The dots are deleted and the field is converted to the date format. In some cases, it is necessary to create three more attributes: “day”, “month” and

“year”.Flag treatment Replace the “X” value with “1” and the null value with “0”.Split of values Some attributes have a specific codification that is only known in the organisation. In these cases, Bosch provided the necessary information to split

the data. As an example, the attribute “Batch_Manufacturing” has characters with different meanings. Thus, based on this attribute, four new ones arecreated: Department, Production Line, Production Shift and Workstation.

M.Y. Santos et al. International Journal of Information Management 37 (2017) 750–760

758

Page 10: International Journal of Information ... - algoritmi.uminho.ptalgoritmi.uminho.pt/downloads/IST-02.pdf · ALGORITMI Research Center, University of Minho, Guimarães, Portugal ARTICLE

data that do not bring any analytical value.For erroneous data values, all the identified problems were catalo-

gued and proper data transformation tasks were defined to improvedata quality. Table 1 presents the main groups of problems identified inthe IDCs (and QCs) data and the appropriate transformations defined tomitigate them.

Once the file with the transformed data is obtained, the next step isdivided into two different approaches:

• For the IDCs data, the transformed file is saved into HDFS and then aconnection with Hive is needed to create the Hive’s table and movethe data from HDFS to Hive. This Hive’s table contains the data andthe correspondent metadata, and it is partitioned by year of the IDC“Creation_Date”, optimising querying processing;

• For the QCs data, and due to specific business rules at Bosch (the QCinformation is handled for months, changing its status several timesover that period), data goes to a Cassandra column family or to aHive table, depending on the QC status, as already explained insubsection 5.1.

5.3. Access, analytics & visualisation phase

With the data available in Hive and Cassandra, Tableau is used tovisualise and analyse the information in an interactive way. Presto isthe technology used in the SQL Query Engine component to connectTableau to Hive and/or Cassandra in order to provide data that isavailable for producing dynamic dashboards about IDCs and QCs.

Fig. 9 presents an example of a visualisation in Tableau, showing astorytelling and the respective dashboard that is selected. On top, thestorytelling is used to navigate between the several dashboards. Below,a specific dashboard that shows for the years 2014 and 2015, the

customer QCs at the production date, categorised by Business Unit andcoloured by part number (corresponding to specific pieces of producedcomponents).

In this dashboard, it is possible to see different patterns and outliersthat could show a production problem on specific days. For example, inthe PS Business Unit, several dots with the same colour are identified ina delimited area, indicating that the same product type had an ab-normal number of complaints during that period, ranging from Marchand November of 2016. Also, for the other Business Units, several peaksare visible pointing to possible production problems.

Although the data here shown does not reflect real values, it ispossible to see how the proposed system allows data analysis, sup-porting decision making and improving organisational analytical cap-abilities.

6. Conclusions

This paper presented the implementation of a Big Data systemaimed to validate a Big Data Analytics architecture for Industry 4.0. Inthis implementation, specific layers of the proposed architecture, andspecific components for those layers, were integrated into a dataworkflow from data collection to data analysis and visualisation.

The presented proof-of-concept showed how these technologiescomplement each other, pursuing the overall goal of supporting thedecision-making process. For this, two specific business processes wereselected, QCs and IDCs, and data was modelled, cleaned, transformedand delivered to the data storage components, able to deal with his-torical data and data that can arrive in streams, enabling a just-in-timeresponse from Bosch Braga to possible problems

At the Big Data Analytics layer, where specific dashboards are madeavailable for data analysis and visualisation, it was possible to see that

Fig. 9. Dashboard Example (for QCs).

M.Y. Santos et al. International Journal of Information Management 37 (2017) 750–760

759

Page 11: International Journal of Information ... - algoritmi.uminho.ptalgoritmi.uminho.pt/downloads/IST-02.pdf · ALGORITMI Research Center, University of Minho, Guimarães, Portugal ARTICLE

all the selected technologies worked together and that no integration orinteroperability problems were detected.

In the future, it is expected to test other components of the archi-tecture, until all the layers are fully characterised in terms of thetechnologies that must be used. Some technologies must be chosen andothers can be replaced with time, if more promising ones emerge. Interms of business processes, others must be integrated to complementthe analysis of the organisation and support the decision-making pro-cess.

This architecture is prepared to use several data sources to feed theData Storage layer, being one of our priorities to test streaming datacollection. Finally, the Data Mining component should be implementedand tested to push this solution to a next level, i.e., an Adaptive BigData Systems to combine prediction and optimisation techniques toassist decision makers in Industry 4.0.

Acknowledgments

This work has been supported by COMPETE: POCI-01-0145-FEDER-007043 and FCT (Fundação para a Ciência e Tecnologia) within theProject Scope: UID/CEC/00319/2013, and by Portugal IncentiveSystem for Research and Technological Development, Project in co-promotion no 002814/2015 (iFACTORY 2015–2018). Some of the fig-ures in this paper use icons made by Freepik, from www.flaticon.com.

References

Almada-Lobo, F. (2016). The Industry 4.0 revolution and the future of ManufacturingExecution Systems (MES). Journal of Innovation Management, 3(4), 16–21.

Biller, S., & Annumziata, M. (2014). The future of work starts now – GE reports Retrieved 21March 2017 from http://www.gereports.com/post/93343692948/the-future-of-work-starts-now/.

Chang, W. L. (2015). NIST big data interoperability framework: Volume 6, reference archi-tecture. http://dx.doi.org/10.6028/NIST.SP. 1500-6 Gaithersburg, MD.

Chen, H., Chiang, R. H., & Storey, V. C. (2012). Business intelligence and analytics: Frombig data to big impact. Mis Quarterly, 36(4), 1165–1188. http://dx.doi.org/10.1145/2463676.2463712.

Chen, M., Mao, S., & Liu, Y. (2014). Big data: A survey. Mobile networks and applications,Vol. 19, 171–209. http://dx.doi.org/10.1007/s11036-013-0489-0.

Costa, C., & Santos, M. Y. (2016a). BASIS: A big data architecture for smart cities. SAIcomputing conference (SAI) (pp. 1247–1256). . http://dx.doi.org/10.1109/SAI.2016.7556139.

Costa, C., & Santos, M. Y. (2016b). Reinventing the energy bill in smart cities with NoSQLtechnologies. Transactions on engineering technologies, . Singapore: Springer, 383–396.http://dx.doi.org/10.1007/978-981-10-1088-0_29.

Costa, C., & Santos, M. Y. (2017). The SusCity big data warehousing approach for smartcities. In B. C. Desai, J. Hong, & R. McClatchey (Eds.), Proceedings of the 21st inter-national database engineering & applications symposium (IDEAS 2017) (pp. 264–273). .http://dx.doi.org/10.1145/3105831.3105841.

Davenport, T. H., Barth, P., & Bean, R. (2012). How big data is different. MIT SloanManagement Review, 54(1), 43–46.

Drath, R., & Horch, A. (2014). Industrie 4.0 – Hit or hype? IEEE Industrial ElectronicsMagazine, 8(2), 56–58. http://dx.doi.org/10.1109/mie.2014.2312079.

Dumbill, E. (2013). Making sense of big data. Big Data, 1(1), 1–2. http://dx.doi.org/10.

1089/big.2012.1503.Floratou, A., Minhas, U. F., & Özcan, F. (2014). SQL-on-Hadoop: Full circle back to

shared-nothing database architectures. Proceedings of the VLDB endowment. Vol. 7,(pp. 1295–1306). . http://dx.doi.org/10.14778/2732977.2733002.

Gartner (2017). Gartner of information technology IT definitions and glossary. Retrieved 23January 2017, from http://www.gartner.com/technology/it-glossary/.

Hermann, M., Pentek, T., & Otto, B. (2016). Design principles for industrie 4.0 scenarios.49th hawaii international conference on system sciences (HICSS) (pp. 3928–3937). .http://dx.doi.org/10.1109/HICSS.2016.488.

Hevner, A. R., March, S. T., Park, J., & Ram, S. (2004). Design science in informationsystems research. MIS Quarterly, 28(1), 75–105. http://dx.doi.org/10.2307/25148625.

Jazdi, N. (2014). Cyber physical systems in the context of Industry 4.0. 2014 IEEE auto-mation, quality and testing, robotics, 2–4. http://dx.doi.org/10.1109/AQTR.2014.6857843.

Jukic, N., Jukic, B., Sharma, A., Nestorov, S., & Korallus Arnold, B. (2017). Expeditinganalytical databases with columnar approach. Decision Support Systems, 95, 61–81.http://dx.doi.org/10.1016/j.dss.2016.12.002.

Kagermann, H. (2015). Change through digitization – value creation in the age of Industry 4.0.Management of permanent change, 23–45. http://dx.doi.org/10.1007/978-3-658-05014-6_2.

Kimball, R., & Ross, M. (2013). The data warehouse toolkit: The definitive guide to dimen-sional modeling (3rd ed.). John Wiley Sons, Inc.

Lee, J., Kao, H. A., & Yang, S. (2014). Service innovation and smart analytics for Industry4.0 and big data environment. Procedia CIRP, Vol. 16, 3–8. http://dx.doi.org/10.1016/j.procir.2014.02.001.

Luhn, H. P. (1958). A business intelligence system. IBM Journal of Research andDevelopment, 2, 314–319. http://dx.doi.org/10.1147/rd.24.0314.

Marz, N., & Warren, J. (2015). Big data: Principles and best practices of scalable real-timedata systems. Shelter Island, NY 11964: Manning Publications Co.

Peffers, K., Tuunanen, T., Rothenberger, M. A., & Chatterjee, S. (2007). A design scienceresearch methodology for information systems research. Journal of ManagementInformation Systems, 24(3), 45–77. http://dx.doi.org/10.2753/mis0742-1222240302.

Santos, M. Y., Costa, C., Galvão, J., Andrade, C., Martinho, B., Lima, F. V., & Costa, E.(2017). Evaluating SQL-on-Hadoop for big data warehousing on Not-So-Good hardware.Manuscript submitted for publication.

Santos, M. Y., Martinho, B., & Costa, C. (2017). Modelling and implementing big datawarehouses for decision support. Journal of Management Analytics, 4(2), 111–129.http://dx.doi.org/10.1080/23270012.2017.130429.

Santos, M. Y., Oliveira e Sá, J., Costa, C., Galvão, J., Andrade, C., Martinho, B., ... Costa, E.(2017). A big data analytics architecture for Industry 4.0. In WorldCIST 2017 (Vol.Ed.), Advances in intelligent systems and computing. Vol. 570Cham: Springer. http://dx.doi.org/10.1007/978-3-319-56538-5_19.

Sommer, L. (2015). Industrial revolution – Industry 4.0: Are German manufacturing SMEsthe first victims of this revolution? Journal of Industrial Engineering and Management,8(5), 1512–1532. http://dx.doi.org/10.3926/jiem.1470.

Tableau (2016). For fourth year, gartner names tableau a “leader” in magic quadrant | tableausoftware. Retrieved 20 February 2017, from https://www.tableau.com/about/blog/2016/2/fourth-year-gartner-names-tableau-leader-magic-quadrant-49719.

Talend (2016). Big data integration for Spark & Hadoop: Big data system. Retrieved 22February 2017, from https://www.talend.com/products/big-data/.

Thames, L., & Schaefer, D. (2016). Software-defined cloud manufacturing for Industry 4.0.Procedia CIRP, Vol. 52, 12–17. http://dx.doi.org/10.1016/j.procir.2016.07.041.

Trends (2016). Interest in big data over time. Retrieved 15 November 2016, from https://trends.google.pt/trends/explore?date=all&q=big data.

Villars, R. L., Olofson, C. W., & Eastwood, M. (2011). Big data: What it is and why youshould care. White Paper, IDChttp://dx.doi.org/10.1080/00049670.2014.974004.

Wang, H., Qin, X., Zhang, Y., Wang, S., & Wang, Z. (2011). LinearDB: A relational ap-proach to make data warehouse scale like mapreduce. International conference ondatabase systems for advanced applications. DASFAA 2011 (pp. 306–320). . http://dx.doi.org/10.1007/978-3-642-20152-3_23.

M.Y. Santos et al. International Journal of Information Management 37 (2017) 750–760

760