an xml interchange format for etl models

12
An XML Interchange Format for ETL Models Judith Awiti, Esteban Zim´ anyi Universit´ e Libre de Bruxelles, Belgium judith.awiti,[email protected] Abstract. ETL tools are responsible for extracting, transforming and loading data from data sources into a data warehouse. Each ETL tool has its own model for specifying ETL processes. This makes it difficult to interchange ETL designs. Entire ETL workflows must be redesigned in order to migrate them from one tool to another. It has therefore become increasingly important to have a single conceptual model for ETL pro- cesses which is interchangeable between tools. Business Process Model and Notation (BPMN) has been widely accepted as a standard for spec- ifying business processes. For this reason, it has been proposed as an efficient conceptual model of ETL processes. In this paper, we present BEXF, an Extensible Markup Language (XML) interchange format for BPMN4ETL, an extended BPMN model for ETL. It is a format power- ful enough the express and interchange BPMN4ETL model information across tools that are compliant with BPMN 2.0. This XML interchange format does not only describe the control flow of ETL processes but also the data flow. Keywords: BEXF, BPMN, BPMN4ETL, ETL, XML 1 Introduction ETL process development is one of the complex and costly part of any data warehouse project. It involves a lot of time and resources since in practice, ETL processes are designed with a specific tool. SQL Server Integration Ser- vices (SSIS), Oracle Warehouse Builder, Talend Open Studio and Pentaho Data Integration (PDI) are examples of such tools. Each ETL tool has its own model for specifying ETL processes which requires a development team to develop ETL processes according to the capabilities of their chosen tool. This makes it difficult to interchange ETL designs. Entire ETL workflows must be redesigned in order to migrate them from one tool to the other. For this reason, several conceptual [1,6,7,9,11] and logical models [8,12] have been proposed for ETL process design. The work in [1,6] proposes BPMN4ETL, a vendor-independent conceptual metamodel for designing ETL processes based on BPMN 1 . Using BPMN to spec- ify ETL processes makes them simple and easy to understand. It hides technical details and enables stakeholders to focus on essential characteristics of the ETL processes. BPMN is accepted in the field of modelling business processes and 1 https://www.omg.org/spec/BPMN/2.0/About-BPMN/

Upload: others

Post on 14-Mar-2022

17 views

Category:

Documents


0 download

TRANSCRIPT

An XML Interchange Format for ETL Models

Judith Awiti, Esteban Zimanyi

Universite Libre de Bruxelles, Belgiumjudith.awiti,[email protected]

Abstract. ETL tools are responsible for extracting, transforming andloading data from data sources into a data warehouse. Each ETL toolhas its own model for specifying ETL processes. This makes it difficult tointerchange ETL designs. Entire ETL workflows must be redesigned inorder to migrate them from one tool to another. It has therefore becomeincreasingly important to have a single conceptual model for ETL pro-cesses which is interchangeable between tools. Business Process Modeland Notation (BPMN) has been widely accepted as a standard for spec-ifying business processes. For this reason, it has been proposed as anefficient conceptual model of ETL processes. In this paper, we presentBEXF, an Extensible Markup Language (XML) interchange format forBPMN4ETL, an extended BPMN model for ETL. It is a format power-ful enough the express and interchange BPMN4ETL model informationacross tools that are compliant with BPMN 2.0. This XML interchangeformat does not only describe the control flow of ETL processes but alsothe data flow.

Keywords: BEXF, BPMN, BPMN4ETL, ETL, XML

1 Introduction

ETL process development is one of the complex and costly part of any datawarehouse project. It involves a lot of time and resources since in practice,ETL processes are designed with a specific tool. SQL Server Integration Ser-vices (SSIS), Oracle Warehouse Builder, Talend Open Studio and Pentaho DataIntegration (PDI) are examples of such tools. Each ETL tool has its own modelfor specifying ETL processes which requires a development team to develop ETLprocesses according to the capabilities of their chosen tool. This makes it difficultto interchange ETL designs. Entire ETL workflows must be redesigned in orderto migrate them from one tool to the other. For this reason, several conceptual[1,6,7,9,11] and logical models [8,12] have been proposed for ETL process design.

The work in [1,6] proposes BPMN4ETL, a vendor-independent conceptualmetamodel for designing ETL processes based on BPMN1. Using BPMN to spec-ify ETL processes makes them simple and easy to understand. It hides technicaldetails and enables stakeholders to focus on essential characteristics of the ETLprocesses. BPMN is accepted in the field of modelling business processes and

1 https://www.omg.org/spec/BPMN/2.0/About-BPMN/

2 Judith Awiti, Esteban Zimanyi

thus provides well-known constructs for any process that has a starting pointand an ending point.

Recently, relational algebra (RA) has been introduced as a logical modelfor ETL processes [8]. RA provides a set of operators that manipulates rela-tions to ensure that there is no ambiguity. It can also be directly translated intoSQL to be executed in any Relational Database Management System (RDBMS).The authors of [4] extend RA to model complex ETL scenarios like SlowlyChanging Dimensions with Dependencies as well as provide a translation ofBPMN4ETL to RA. They propose an ETL development approach which be-gins with a BPMN4ETL conceptual model translated into RA extended withupdate operations at the logical level. This approach is a Model Driven Ar-chitecture (MDA) approach where platform-independent models (BPMN4ETLand RA) can be implemented as SQLs on any RDBMS platform. In view of this,BPMN4ETL diagrams must have interchangeable formats that can be trans-formed into the above-mentioned extended RA logical model [4].

Unfortunately, there does not exist an interchange format for BPMN4ETL.In this paper, we present BEXF (BPMN4ETL interchange Format), an XML-based model interchange format for BPMN4ETL that expresses and interchangeBPMN4ETL model information across tools. BEXF does not only interchangegraphical design of the BPMN4ETL but also attributes and manipulations ofthe diagram. This way, a BPMN4ETL diagram can be reproduced in anothersystem with all its hidden details. To the best of our knowledge, BEXF is thefirst step to providing an XML interchange format for BPMN4ETL.

The paper is organized as follows. In Section 2 we discuss related work inETL design. We present our running example in Section 3 and explain briefly thealready proposed BPMN4ETL model in Section 4. In Section 5, we explain thefundamentals of BEXF and show how a BPMN4ETL model can be translatedinto BEXF using an example in Section 6. Lastly, we conclude and mention waysby which this work can be pursued in Section 7.

2 Related Work

BPMN4ETL [1,6] conceptual model combines two perspectives, a control processview, and a data process view. A control process view consists of all the data pro-cesses in the ETL workflow, while the data process view provides a more detailedinformation of the input and output data of each data process in the control pro-cess view. BPMN4ETL enables easy communication and validation between anoperational database designer, an ETL designer and a business intelligence ana-lyst. It enables stakeholders to see the manipulation of data from one ETL taskto the other. Also, BPMN4ETL can be translated directly into relational algebra(RA), XML2, Structured Query Language (SQL), or even customized models ofvendor tools. In this approach, well-known BPMN operators are customized forETL design. BPMN gateways are used to control the sequence of activities in

2 https://www.w3.org/TR/2008/REC-xml-20081126/

An XML Interchange Format for ETL Models 3

the ETL workflow based on conditions. Events show the start and end of theworkflow and are also used to handle errors. An activity describes an ETL taskthat is not further subdivided, whereas a subprocess represents a collection ofactivities.

Several platform-independent conceptual [1,6,7,9,11] and logical models [8,12]have been proposed. The authors of [3] presents a survey of current trends indesigning and optimizing ETL processes. On the other hand, each ETL toolhas its own specific model for designing ETL processes. Therefore, there is theneed to harmonize the ETL process development with a common and integrateddevelopment strategy. One way to do this is to apply an MDA approach toits development. MDA3 is an approach to software design, development andimplementation that separates business and application logic from underlyingplatform technology. Applying the MDA approach to ETL process developmentmeans developing platform-independent conceptual and logical models and thenimplementing them with vendor-specific technologies.

In an attempt to provide a single agreed-upon development strategy for ETLprocesses with BPMN, the authors of [2], introduced a framework for model-driven development (MDD) of ETL processes. In this framework, BPMN4ETLis used to specify ETL processes in a vendor-independent way and is automat-ically transformed into vendor-specific implementations like SSIS. Transforma-tions between a vendor-independent model and this vendor-specific code areformally established by using model-to-text transformations4, an Object Man-agement Group (OMG) standard for transformations from models to text.

The authors of [5] describe an integration of BPMN and profiled UML lan-guages. They extend UML is to accommodate BPMN notations and define trans-forms between BPMN and UML in cases where the profiles cannot support. Oneof the transforms is expressed in XSLT (eXtensible Stylesheet Language Trans-formations) that operates on BPMN and UML interchange files expressed inXML. The transform files translate BPMN interchange files to UML interchangefiles and vice versa. The paper focuses on generic BPMN specifications and doesnot address how the XML interchange file is obtained from the BPMN or UMLdiagram.

3 Running Example

This section presents an example to be used throughout this paper to illustratehow ETL designs in BPMN4ETL can be translated into BEXF. We reuse a partof the example shown in [10] which described the ETL process that loads theCity dimension table of a data warehouse called NorthwindDW. Fig. 1a showsthe schema of a data source text file, TempCities.txt whereas Fig. 1b shows theschemas of the dimension tables mentioned above. TempCities.txt file containsthree fields City, State, and Country with a few rows shown below.

Aachen - North Rhine-Westphalia - Germany

3 https://www.omg.org/mda/4 https://www.omg.org/spec/MOFM2T/1.0

4 Judith Awiti, Esteban Zimanyi

Albuquerque - New Mexico - USASevilla - Madrid - SpainSingapore - NULL - SingaporeSouthfield - Michigan - United States of America

In the case of cities located in countries that do not have states, as it is thecase of Singapore, a null value is found in the second field. A temporary table inthe data warehouse, denoted TempCities, is used for storing the contents of thisfile. The schema of the table is the same as the text file given in Fig. 1a. The goalof our ETL is to load the City dimension table with a StateKey and a CountryKey,one of which in null depending on whether the country is divided into states ornot. Note that, states and countries come in different forms. For example, thecountry United States of America can be written as its country codes USA orUS, or even in other laguages. Therefore in order to retrieve the CountryKey orStateKey of a particular city, we need to match the different representations ofthe (State,Country) pairs in TempCities to values in the State and Country tables.Finally, we store city records into the City dimension table and store records forwhich no state and/or country is found into a text file (BadCities.txt) for futureinvestigation.

TempCities

City

State

Country

(a)

StateStateKeyStateNameEnglishStateNameStateTypeStateCodeStateCapitalCountryKey

CityCityKeyCityNameStateKeyCountryKey

CountryCountryKeyCountyNameCountryCodeCountryCapital

(b)

Fig. 1: (a) Schema of the TempCities.txt file; (b) Schema of the City-State-Country dimension tables in NorthwindDW.

4 BPMN4ETL

In this section, we briefly introduce BPMN4ETL by means of our running ex-ample. Additional description of the model can be found in [1,6]. The input datatask to insert records from TempCities into the ETL flow. The first exclusivegateway (G1) tests whether the State attribute is null or not (recall that thisis the optional attribute). In the first case, for records with a null value for theState attribute, a lookup obtains the CountryKey. In the second case, we mustmatch (State,Country) pairs in TempCities to values in the State and Countrytables. However, as we have explained, states and countries can come in manyforms; thus, we need a number of lookup tasks, as shown in the annotations inFig. 2. Due to space limitation, we only show three lookups are as follows:

An XML Interchange Format for ETL Models 5

– The first lookup process records where State and Country correspond, re-spectively, to StateName and CountryName. An example is state Loire andcountry France.

– The second lookup process records where State and Country correspond,respectively, to EnglishStateName and CountryName. An example is stateLower Saxony, whose German name is Niedersachsen, together with countryGermany

– Finally, the third lookup process records where State and Country corre-spond, respectively, to StateName and CountryCode. An example is stateNew Mexico and country USA.

The SQL query associated with these lookups is as follows:

SELECT S.*, CountryName, CountryCode FROM State S JOIN Country C ONS.CountryKey = C.CountryKey

A union task combines the results of the four flows and the City dimensionis populated with an insert data task. Recall that in the City table, if a statewas not found in the initial lookup (Input1 in Fig. 2), the attribute State will benull; on the other hand, if a state was found, it means that the city will havean associated state; therefore, the Country attribute will be null (Input2, Input3,and Input4 in Fig. 2). Finally, records for which the state and/or country are notfound are stored into a BadCities.txt file.

Lookup Union

Lookup

Input2

Lookup

Input4

Insert Data

Insert Data

Input Data

Y

N

Lookup

Input1

Insert Data

Y

N

Input3

YN

YN

YN

G2

G3

G5

G4

S2

S3

S4

S5

S6

S7

S8

S9

S10

S11 S12

S13

G1

Start Event

End EventS14

S15

S16

S1

Retrieve: StateKeyDatabase: NorthwindDWQuery: <SQL Query>Where: State, CountryMatches: StateName, CountryName

Input1: CityName, NULL, CountryKeyInput2, Input3, Input4: CityName, StateKey, NULL Output: CityName, StateKey, CountryKey

Retrieve: StateKeyDatabase: NorthwindDWQuery: <SQL Query>Where: State, CountryMatches: EnglishStateName, CountryName

Retrieve: StateKeyDatabase: NorthwindDWQuery: <SQL Query>Where: State, CountryMatches: StateName, CountryCode

Database: NorthwindDWTable: City

File: BadCities.txtType: Text

Database: NorthwindDWTable: TempCities

Retrieve: CountryKeyDatabase: NorthwindDWTable: CountryWhere: CountryMatches: CountryName

File: BadCities.txtType: Text

Condition: State Null?

Condition: Found?

Condition: Found?

Condition: Found?

Condition: Found?

Fig. 2: Load of City dimension table

6 Judith Awiti, Esteban Zimanyi

5 BEXF Fundamentals

In exchanging ETL processes modeled with BPMN 2.0 to an XML format, weconsider two types of interchangeable information.

– Semantics information: This comprises the building blocks or objects of thegraphical model. The activities, events, control objects, connecting objectsand swimlanes together with the attributes and manipulations of them.

– Visual appearance information: This comprises information about the layoutof the graphical model. The shapes and positions of the graphical elements.

In this paper, we concentrate on the interchange of semantics information fromBPMN4ETL to BEXF. Note that visual appearance information can be de-scribed by XSL (eXtensible Stylesheet Language), a styling language for XML.

XML is designed to carry data with the use of arbitrary tags. Users areat liberty to define their own tags and document structure. Therefore, it hasbeen widely applied in different fields including the Mathematical Markup Lan-guage (MathML) for describing mathematical notation, and Open Financial in-terchange (OFX), a data-stream format for exchanging financial information.XML simplifies data sharing, data transport, platform changes, and data avail-ability hence providing an efficient interchange format.

Table 1: BPMN4ETL objects and their BEXF representationBPMN4ETL Object Element BEXF Representation

ETL Process Process <ETLProcess>

ETL Activities Task <ETLTask>

Subprocess <ETLSubprocess>

Control Objects Exclusive Gateway <ExclusiveGateway>

Parallel Gateway <ParallelGateway>

Inclusive Gateway <InclusiveGateway>

Events Start Event <StartEvent>

End Event <EndEvent>

Message <Message>

Cancel <Cancel>

Compensate <Compensate>

Terminate <Terminate>

Time <Time>

Connecting Objects Sequence Flow <SequenceFlow>

Conditional Flow <ConditionalFlow>

Default Flow <DefaultFlow>

Message Flow <MessageFlow>

Association <Association>

Swimlanes Pool <Pool>

Lane <Lane>

An XML Interchange Format for ETL Models 7

Table 1 describes the basic BPMN4ETL objects and their BEXF represen-tation. Attributes of each BPMN object are mapped to attributes of the corre-sponding XML element. Each BEXF element is identified by an id and a nameattribute. The previous and next elements of each BEXF element in an ETLflow are referenced by their id attributes. The <inRefId> child element containsthe id of the previous element whereas the <outRefId> child element containsthe id of the subsequent element in the flow. As a naming construct, we beginall id attributes with _id. The sequence of all elements at the same level of aBEXF tree does not follow a particular order as elements can be reproduced inBPMN4ETL provided they contain the id information about their immediatesurrounding elements.

BEXF being an XML-based language exposes all information in a BPMN4ETLobject through attributes and child elements which allows us to achieve data flowof a BPMN4ETL conceptual model. For example, the expression that calculatesvalues of an added column in a BPMN4ETL task is stored in an attribute of aBEXF element.

6 From BPMN4ETL to BEXF

We describe below BEXF elements corresponding to BPMN4ETL objects foundin Fig. 2.

Process and Subprocess An ETL process describes a sequence of flow of activ-ities, events, gateways, and sequence flows with the objective of carrying outwork. A typical example is shown in Fig. 2 where the main objective is to loadthe City dimension table. In BPMN4ETL, some activities of an ETL process canbe encapsulated in order to hide their details. Such parts are called subprocesses.For instance, Fig. 2 is a subprocess in the overall ETL process that loads North-windDW. We represent processes and subprocesses in BEXF as shown in Table 1as <ETLProcess> and <ETLSubProcess> elements respectively. The BEXF ofthe entire process of Fig. 2 is as follows:<ETLProcess id="_idProcess" name="Load of City dimension table">...</ETLProcess >

Control Objects These are Gateways which control the sequence of activities inan ETL flow. The BEXF of an Exclusive Gateway is represented in Table 1 as<ExclusiveGateway>. It has a child element called <condition> which containsthe condition to be checked. An Exclusive Gateway can have several output con-necting objects. These links are represented by two or more <outRefId> childelements. Below, we show the BEXF of the exclusive gateway G1 of Fig. 2 whereState = NULL is the condition, _idS2 is the id of the incoming sequence flow S2,and _idS3 and _idS4 are the outgoing sequence flows S3 and S4, respectively.<ExclusiveGateway id="_idG1" name="G1">

<condition >State = NULL</condition ><inRefId >_idS2</inRefId ><outRefId >_idS3</outRefId ><outRefId >_idS4</outRefId >

</ExclusiveGateway >

8 Judith Awiti, Esteban Zimanyi

Events They are happenings that affect the sequence of flow in an ETL processes.The (<StartEvent>) element contains one or more <outRefId> child elements.More than one reference <outRefId> represent scenarios where at the start, theETL flow is divided into different paths. The <EndEvent> contains one or more<inRefId> child elements can also be used to model scenarios where severalpaths of an ETL flow end at the same time. The BEXF code below shows theStart Event and the End Event of Fig. 2.

<StartEvent id="_idStartEvent" name="Start Event"><outRefId >_idS1</outRefId >

</StartEvent >...<EndEvent id="_idEndEvent" name="End Event">

<inRefId >_idS14 </inRefId ><inRefId >_idS15 </inRefId ><inRefId >_idS16 </inRefId >

</EndEvent >

Connecting Objects These are mostly arrows representing the links betweenBPMN4ETL objects. An example is the Sequence Flow which represents thesequencing constraint between ETL flow objects. The Sequence Flow, S2 of Fig. 2is represented in BEXF as shown below by a <SequenceFlow> element that hasone <inRefId> and one <outRefId>.

<SequenceFlow id="_idS2" name="S2"><inRefId >_idInputData </inRefId ><outRefId >_idG1</outRefId >

</SequenceFlow >

ETL Activities and Tasks An activity is a work performed during an ETL pro-cess. Activities are either single tasks or subprocesses. An ETL Task is a simple,atomic unit of work. ETL processes and subprocesses contain several ETL tasks.In Table 1, we represent an ETL task with the element <ETLTask>. The type oftask is specified by its type attribute. Each ETL task has some peculiarities thatdistinguishes it from other tasks. We describe below, the BEXF representationsof the ETL tasks in Fig. 2 as well as some other common ETL tasks.

Input Data: The Input Data task insert records into the ETL flow froma data source. <Database> and the <table> child elements of the Input Datatask contain information about incoming data from a database. Both child el-ements can be replaced by a <file> child element with a type attribute show-ing the type of file if the data is from a file source. This is true for the In-sert Data task as well. Below is the BEXF representation of the Input Datatask of Fig. 2. With this representation, <Database name="NorthwindDW"/> and<Table name="TempCities"/> specify the location of the TempCities table. Allinput columns (City,State,Country) are specified in the <inputs> child element.

<ETLTask id="_idInputData" name="Input Data" type="Input Data"><Database name="NorthwindDW"/><Table name="TempCities"/><inputs >

<inputColumn name="City"/><inputColumn name="State"/><inputColumn name="Country"/>

</inputs ><inRefId >_idS1</inRefId >

An XML Interchange Format for ETL Models 9

<outRefId >_idS2</outRefId ></ETLTask >

Insert Data: Recall that this task inserts records from the ETL flow into adestination file or database table depending on the child elements available. TheDbCol attribute of the <outputColumn> child element provides the destinationcolumn names. Note that in our running example (Fig. 2), the column names ofthe ETL flow and that of the City dimension table are the same. Therefore, thevalues of name and DbCol attributes for each <outputColumn> subchild elementare the same.

<ETLTask id="_idInsertData3" name="Insert Data" type="Insert Data"><Database name="NorthwindDW"/><Table name="City"/><outputs >

<outputColumn name="City" DbCol="City"/><outputColumn name="StateKey" DbCol="StateKey"/><outputColumn name="CountryKey" DbCol="CountryKey"/>

</outputs ><inRefId >_idS13 </inRefId ><outRefId >_idS15 </outRefId >

</ETLTask >

Lookup Several types of Lookup tasks exist depending on where the lookupdata is comes from. <Database> and <Query>, <Database> and <Table>, and<file> child elements are used if the lookup column(s) is/are from the resultsof a query, a database table, or an external file, respectively. The MatchCol at-tribute of the <inputColumn> child element provide the corresponding columnname of the Lookup task. This attribute does not exist in unmatched columns.The BEXF of the lookup task to obtain CountryKey of Fig. 2 is shown below.The BEXF code <inputColumn name="Country" MatchCol="CountryName"/>

specifies that the Country column is matched with the CountryName columnwhereas <outputColumn name="CountryKey"/> specifies a new column Coun-tryKey derived at the output.

<ETLTask id="_idL1" name="Lookup" type="Lookup"><Database name="NorthwindDW"/><table>Country <table/><inputs >

<inputColumn name="City"/><inputColumn name="State"/><inputColumn name="Country" MatchCol="CountryName"/><inputColumn name="CountryName" MatchCol="Country"/>

</inputs ><outputs >

<outputColumn name="City"/><outputColumn name="State"/><outputColumn name="Country"/><outputColumn name="CountryKey"/>

</outputs ><inRefId >_idS3</inRefId ><outRefId >_idS11 </outRefId >

</ETLTask >

Union: This task combines data from all incoming paths. Each <input>

subchild element contains the input columns of one input path. The default valueof input columns can be set by the default attribute. The BEXF representationof the Union task of our running example is shown below. Note that in the first

10 Judith Awiti, Esteban Zimanyi

<input> element, we set the default value of StateKey to null. This specifies thepath with records that has no StateKey value. For all other input paths, theCountryKey column is set to null to specify the reverse.

<ETLTask id="_idUnion" name="Union" type="Union"><inputs >

<input><inputColumn name="City"/><inputColumn name="State"/><inputColumn name="Country"/><inputColumn name="StateKey" default="NULL"/><inputColumn name="CountryKey"/>

</input >...<input>

<inputColumn name="City"/><inputColumn name="State"/><inputColumn name="Country"/><inputColumn name="StateKey"/><inputColumn name="CountryKey" default="NULL"/>

</input ></inputs ><outputs >

<output ><outputColumn name="City"/><outputColumn name="State"/><outputColumn name="Country"/><outputColumn name="StateKey"/><outputColumn name="CountryKey"/>

</output ></outputs ><inRefId >_idInput1 </inRefId ><inRefId >_idInput2 </inRefId ><inRefId >_idInput3 </inRefId ><inRefId >_idInput4 </inRefId ><outRefId >_idS13 </outRefId >

</ETLTask >

Rename Column: This task adds new derived columns to an ETL flow.Assuming in Fig. 2, there exist an Rename Column task that renames the Countrycolumn of the ETL flow into Ctry. As shown below in BEXF, the value of thenewname attribute of Country is Ctry.

<ETLTask id="_idRenameColumn" name="Rename Country" type="Rename Column"><Column name="Country" newname="Ctry"/><inRefId ></inRefId ><outRefId ></outRefId >

</ETLTask >

Aggregate: This task adds new columns to an ETL flow computed by ap-plying an aggregate function such as Count, Min, Max, Sum, or Avg on inputcolumns. This is done after partitioning tuples in groups that have the samevalues in some columns. We show below the BEXF representation of an aggre-gate task. Assume that there exist an aggregate ETL task that counts the citiesof each state in Fig. 2. <AggColumn/> specifies the column to group by. In thisexample, the aggregate column is State. The order attribute specifies the orderof the grouping columns. This attribute is needed for cases where the aggregatecolumns are more than one. <NewColumn/> specifies the new column CityCount,which is to be added to the flow as a result of the function count(City).

<ETLTask id="_idAggregate" name="Aggregate" type="Aggregate">

An XML Interchange Format for ETL Models 11

<AggColumn name="State" order="1"/><NewColumn name="CityCount" function="count(City)"/><inRefId ></inRefId ><outRefId ></outRefId >

</ETLTask >

Add Column: This task adds new derived columns to an ETL flow. Assumein Fig. 2, there exist an Add Column task that adds a column called CityStatuswith a value of NEW to show that this city was just added to the dimensiontable. We show the BEXF representation of this task below. The name attributecontains the name of the added column whereas the expression attribute containsthe expression that computes the value of the newly added column.

<ETLTask id="_idAddColumn" name="Add Column" type="Add Column"><Column name="CityStatus" expression="CityStatus = NEW"/><inRefId ></inRefId ><outRefId ></outRefId >

</ETLTask >

Convert Column: This task changes the data type of columns in an ETLflow. A task to convert the data type of a column called CityKey to an integer isspecified as follows:

<ETLTask id="_idConvertColumn" name="Convert Column" type="Convert Column"><Column name="CityKey" DataType="INTEGER"/><inRefId ></inRefId ><outRefId ></outRefId >

</ETLTask >

Update Column: This task replaces column values in the flow. Assumethat there exist a column called NetWorth in the ETL flow of Fig. 2 which storesthe amount of money each city has. To deduct 10,000 euros from this amountfor cities whose NetWorth values are greater than 1,000,000 euros, we will needan Update Column task with a condition as shown in BEXF below.

<ETLTask id="_idUpdateColumn" name="Update Column" type="Update Column"><Column name="NetWorth" expression="NetWorth = NetWorth * 2"/><Condition >NetWorth > 1000000 <Condition/><inRefId ></inRefId ><outRefId ></outRefId >

</ETLTask >

Another type of Update Column task replaces column values of a databasetable or file that corresponds to the records in the ETL flow. For such tasks,we add a <file> child element or a <database> and <table> child elements tospecify the location of the column to update.

7 Conclusion and Future Work

BPMN4ETL is a conceptual model for designing ETL processes. This paperpresents BEXF, an XML interchange format that can be used to interchange andreuse BPMN4ETL models. BEXF expresses both the control flow and the dataflow of ETL processes through attributes and manipulations. Being a platform-independent language, it can be translated into various implementation plat-forms such as ETL tools (e.g., Microsoft SSIS or Pentaho PDI) or SQL dialects(e.g., PL/pgSQL for PostgreSQL).

12 Judith Awiti, Esteban Zimanyi

One direction for our future work is to use the Model-Driven Architecture fortranslating BEXF models into such implementation platforms. We envision touse Model-to-Model and Model-to-Text transformations to generate, e.g., DTSX(an XML-based file format for SSIS) or PL/pgSQL. Another future direction is toadd visual appearance information like position, size, color and shape to BEXFelements. Finally, we will tackle the issue of evolving ETL flows upon data sourceschema changes.

Acknowledgements The work of Judith Awiti is supported by the Eu-ropean Commission through the Erasmus Mundus Joint Doctorate project In-formation Technologies for Business Intelligence-Doctoral College (IT4BI-DC).

References

1. Akkaoui, Z.E., Zimanyi, E.: Defining ETL worfklows using BPMN and BPEL. In:Proc. of the 12th ACM International Workshop on Data Warehousing and OLAP.pp. 41–48. ACM, Hong Kong, China (2009)

2. Akkaoui, Z.E., Zimanyi, E., Mazon, J., Trujillo, J.: A model-driven framework forETL process development. In: Proc. of the 14th International Workshop on DataWarehousing and OLAP, DOLAP 2011. pp. 45–52. ACM, Glasgow, UK (2011)

3. Ali, S.M.F., Wrembel, R.: From conceptual design to performance optimizationof ETL workflows: current state of research and open problems. VLDB J. 26(6),777–801 (2017)

4. Awiti, J., Vaisman, A., Zimanyi, E.: From conceptual to logical ETL design usingBPMN and relational algebra. In: Proc. of the 21st ACM International Conferenceon Big Data Analytics and Knowledge Discovery, DAWAK 2019. Springer, Linz,Austria (2019), forthcoming

5. Bock, C., Barbau, R., Narayanan, A.: BPMN profile for operational requirements.Journal of Object Technology 13(2), 1–35 (2014)

6. El Akkaoui, Z., Zimanyi, E., Mazon, J.N., Trujillo, J.: A BPMN-Based design andmaintenance framework for ETL processes. International Journal of Data Ware-housing and Mining 9(3), 46–72 (2013)

7. Munoz, L., Mazon, J.N., Pardillo, J., Trujillo, J.: Modelling ETL processes ofdata warehouses with UML activity diagrams. In: Proc. of the OTM ConfederatedInternational Conferences, OTM 2008. pp. 44–53. Springer, Monterrey, Mexico(2008)

8. Santos, V., Belo, O.: Using relational algebra on the specification of real world ETLprocesses. In: Proc. of the 13th IEEE International Conference on Dependable,Autonomic and Secure Computing, DASC 2015. pp. 861–866. IEEE, Liverpool,UK (2015)

9. Trujillo, J., Lujan-Mora, S.: A UML based approach for modeling ETL processesin data warehouses. In: Proc. of the 22nd International Conference on ConceptualModeling, ER 2003. pp. 307–320. Springer, Chicago, Illinois, USA (2003)

10. Vaisman, A.A., Zimanyi, E.: Data Warehouse Systems: Design and Implementa-tion. Springer (2014)

11. Vassiliadis, P., Simitsis, A., Skiadopoulos, S.: Conceptual modeling for ETL pro-cesses. In: Proc. of the 5th ACM International workshop on Data Warehousing andOLAP, DOLAP 2002. pp. 14–21. ACM, McLean, Virginia, USA (2002)

12. Vassiliadis, P., Simitsis, A., Skiadopoulos, S.: Modeling ETL activities as graphs.In: Proc. of the 4th International Workshop on Design and Management of DataWarehouses, DMDW 2002. pp. 52–61. CEUR-WS.org, Toronto, Canada (2002)