ellipse etl technical...

18
Ellipse ETL Technical Overview

Upload: others

Post on 21-Mar-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ellipse ETL Technical Overviewec2-54-235-97-194.compute-1.amazonaws.com/documentation/pdf/integration/ETL/E8_ETL...ETL or data integration tools. PDI Use Cases Business Intelligence

Ellipse ETL Technical Overview

Page 2: Ellipse ETL Technical Overviewec2-54-235-97-194.compute-1.amazonaws.com/documentation/pdf/integration/ETL/E8_ETL...ETL or data integration tools. PDI Use Cases Business Intelligence

Ellipse ETL Technical Overview

ContentsEllipse ETL Technical Overview 2

Commercial In Confidence 3

Preface 4

Summary information 4

Confidentiality 4

Document Control 4

Who should use this guide? 4

How to use this guide 4

Purpose 4

Scope 4

Overview 5

Highlights & Benefits 6

General 6

PDI Use Cases 6

Business Intelligence and Data Warehousing 6

Data Migration 6

Application Consolidation 6

Data Synchronization 6

PDI Features 6

Uses Industry Standards 6

Leverages Open Source 6

Reduces Deployment Complexity 6

ETL view 7

What functionality is provided? 7

How do you access the functionality? 7

How is it used? 7

Architecture 8

Standard Components 8

Jobs 8

Pre-built Job Entries 9

Transformations 11

Pre-built Transformation Steps 12

Bulk Loading 12

Tools and Utilities 17

Integrated Development Environment 17

Command-line execution tools 17

Carte (job server) 17

Logging 18

Page 3: Ellipse ETL Technical Overviewec2-54-235-97-194.compute-1.amazonaws.com/documentation/pdf/integration/ETL/E8_ETL...ETL or data integration tools. PDI Use Cases Business Intelligence

Commercial In ConfidenceCopyright 2016 ABB

All Rights Reserved

Confidential and Proprietary

Legal Disclaimer

The product described in this documentation may be connected to, and/or communicate information and data via, a networkinterface, which should be connected to a secure network. It is your sole responsibility to ensure a secure connection to thenetwork and to establish and maintain appropriate measures (such as but not limited to the installation of firewalls, applicationof authentication measures, encryption of data, installation of antivirus programs, etc.) to protect the product, the network,your systems, and the interface against any kind of security breach, unauthorised access, interference, intrusion, leakage,damage, or corruption or theft of data. We are not liable for damages or losses related to any such security breach,unauthorised access, interference, intrusion, leakage, damage, or corruption or theft of data.

Page 4: Ellipse ETL Technical Overviewec2-54-235-97-194.compute-1.amazonaws.com/documentation/pdf/integration/ETL/E8_ETL...ETL or data integration tools. PDI Use Cases Business Intelligence

PrefaceThis document provides an overview of the Integration Platform. Included in the overview is a description of the architectureand the standard capabilities provided by the product.

Summary information

ConfidentialityThe contents of this document are confidential between ABB and its customers. The parties must keep the information hereinconfidential at all times and not disclose it, or permit it to be disclosed, to any third party, apart from any of their officers,employees, agents or advisers who have a specific need to access the information herein and have agreed to be bound by theterms of confidentiality.

Document ControlOnce the project is completed or terminated, this document will revert to an uncontrolled document status. No further advicewill be provided, and each recipient may either destroy the document or mark it as obsolete and retain it for future personalreference.

All copies of this document will be issued electronically.

Who should use this guide?This guide provides information on Ellipse 8 ETL for Ellipse Technical Consultants.

How to use this guideThis guide describes:

• The highlights and benefits of the integration platform

• The architecture relating to ETL

PurposeThis document is a guide to Technical Consultants involved with installing Ellipse 8.

ScopeThe following is in scope and covered in a section of this document.

• The ETL Platform

• Architecture related to the platform

Page 5: Ellipse ETL Technical Overviewec2-54-235-97-194.compute-1.amazonaws.com/documentation/pdf/integration/ETL/E8_ETL...ETL or data integration tools. PDI Use Cases Business Intelligence

OverviewThe Integration ETL Platform is part of the Business Intelligence Infrastructure offered by ABB. The platform is designed toprovide ETL services between the ABB suite of products as well as to external 3rd party systems. At a summary level, thefollowing core components or services are provided by the platform.

• Graphical designer to develop ETL job sequences and data flows

• Connectivity to a wide range of data sources including relational databases, spreadsheets and other diverse datasources.

• Enterprise scalability and performance, including in-memory caching

• Modern, open, standards-based architecture

• Based on open source technology Pentaho Data Integration Platform Community Edition (PDI CE)

Page 6: Ellipse ETL Technical Overviewec2-54-235-97-194.compute-1.amazonaws.com/documentation/pdf/integration/ETL/E8_ETL...ETL or data integration tools. PDI Use Cases Business Intelligence

Highlights & BenefitsThe ETL Platform is based on the Pentaho Data Integration Community Edition (PDI CE), also known as Kettle.

GeneralPentaho Data Integration delivers powerful Extraction, Transformation and Loading (ETL) capabilities using an innovative,metadata-driven approach. With an intuitive, graphical, drag and drop design environment, and a proven, scalable,standards-based architecture, Pentaho Data Integration is increasingly the choice for organizations over traditional, proprietaryETL or data integration tools.

PDI Use Cases

Business Intelligence and Data WarehousingETL is the most critical component of a business intelligence solution. ETL processes retrieve data from operational systemsand prepare it for analysis and reporting. The accuracy and timeliness of the entire business intelligence solution relies on thiscritical step.

Data MigrationWhen upgrading or migrating to a new version of a database or application, the data within that application needs to move tothe new environment. As part of this move, data needs to be transformed to the format suitable for the new system.

Application ConsolidationAcquisitions, or consolidation of different business units within the same enterprise, often require an application consolidationeffort. To move all existing applications to a new central one, data from separate instances needs to merge into a single formatused by the new application.

Data SynchronizationIn environments where multiple copies of data are used by multiple applications, data consistency is critical. In suchenvironments, low latency processes such as change data capture (CDC) or data replication ensure that all copies are keptconsistent and identical.

PDI Features

• PDI uses 100% Java APIs, ideal for integrating into existing applications.

• It uses a multi-threaded engine, with options for scaling out ETL jobs across a cluster of PDI servers.

• PDI offers the ability to run a simple HTTP server, Carte, for the remote execution or clustering of ETL jobs.

Uses Industry StandardsThe ETL Platform is based on industry standards such as XML and Java and includes an open API.

Leverages Open SourceA benefit of using an open source foundation for the ETL Platform is that no additional licensing is required from 3rd parties touse it. No additional costs for developer tools, runtime licenses, or other costs will be incurred by the customer as well.

Reduces Deployment ComplexityKeeping with the theme of reduced costs, the ETL Platform has been designed to eliminate as much of the complexity aspossible when deploying. It has been designed to be deployed in a virtualized environment as well as supporting a simplyinstallation process. The Integration Platform does not require any additional components to be deployed other than the baseLinux operating system. All components required, including the Java runtime are included in the installers.

One caveat to this goal is the use of Oracle databases. The Integration Platform installer does not come with the Oracledatabase software. Since for most customers, the use of a centrally managed Oracle database instance is preferred, this istypically not an issue as the DBA group will have already installed a DB on a dedicated server.

Page 7: Ellipse ETL Technical Overviewec2-54-235-97-194.compute-1.amazonaws.com/documentation/pdf/integration/ETL/E8_ETL...ETL or data integration tools. PDI Use Cases Business Intelligence

ETL viewETL provides the tools and environment to build and execute data integration tasks from which customers can performBusiness Intelligence type analytical reporting.

Extract, transform and load (ETL) view

What functionality is provided?Data Integration tasks connect to the Ellipse database and extract, transfer and load the data into the star schema designeddatamart tables. Note the datamart tables are licensed independently to the core ETL server platform.

How do you access the functionality?The jobs and transformations designed through the Pentaho GUI can are executed on a periodic basis, to update the datamarts.

How is it used?The datamarts are a generic way of delivering the ABB systems Ellipse 8 data, so that the data can be accessed through thecustomers corporate reporting solution. For example: Cognos, Business Objects, CorVu, or Hyperion.

Page 8: Ellipse ETL Technical Overviewec2-54-235-97-194.compute-1.amazonaws.com/documentation/pdf/integration/ETL/E8_ETL...ETL or data integration tools. PDI Use Cases Business Intelligence

ArchitectureThe ETL platform is composed primarily of standalone java applications.

In a typical PDI install, there is no server that is permanently running, unless the optional Carte server is configured.

PDI jobs are typically run by a shell script, which starts a java program that runs to completion and then ends.

This results in a very simple installed architecture.

Standard ComponentsThe PDI architecture contains two primary objects – transformations and jobs. The following diagram represents the conceptualmodel of jobs and transformations and how they relate.

Jobs and Transformations conceptual model

JobsJobs are used to control the execution sequence of a data transformation process. Jobs consist of one or more job entries thatare executed in sequence. The order of execution is determined by the order of the job entries. As well as controlling theexecution of data flows, they are typically used to verify the environment is ready and available for processing and to cater forerror handling. Job hops define the execution path between job entries and can have tree types of execution type:

Unconditional: This means that the next job entry is executed no matter what happened in the previous one. This evaluationtype is indicated by a lock icon over a black hop arrow as shown in the following screenshot.

Follow when result is true: This job hop path is followed when the result of the previous job entry execution was true. Thistypically means that it ran without a problem. This type is indicated with a green success icon drawn over a green hop arrow.

Follow when result is false: This job hop path is followed when the result of the previous job entry execution was false, orunsuccessful. This is indicated by a red stop icon drawn over a red hop arrow.

The following screenshot is an example job.

Page 9: Ellipse ETL Technical Overviewec2-54-235-97-194.compute-1.amazonaws.com/documentation/pdf/integration/ETL/E8_ETL...ETL or data integration tools. PDI Use Cases Business Intelligence

screenshot example job

Pre-built Job EntriesJob entries are the building blocks of jobs, Because a job executes job entries sequentially, you must define a starting point.This starting point comes in the form of the special job entry called Start (see screenshot above). As a consequence, you canonly put one Start entry in any one job.

The following table lists the types of pre-built job entries available out of the box. Any additional functionality required can bedelivered using the scripting steps listed or the platform can be extended using the published open APIs.

Name Description

Bulk loadingBulkLoad from Mysql into file Load from a mysql table into a fileBulkLoad into MSSQL Load data from a file into a MSSQL tableBulkLoad into Mysql Load data from a file into a mysql tableMS Access Bulk Load Load data into a Microsoft Access table from CSV file format. ATTENTION, at the

moment only the insertion is available! If target table exists, a new one will becreated and data inserted.

ConditionsCheck Db connections Check if we can connect to one or several databases.Check files locked Check if one or several files are locked by another processCheck if a folder is empty Check if a folder is emptyCheck webservice availability Check if a webservice is availableChecks if files exist Checks if files existsColumns exist in a table Check if one or several columns exist in a table on a specified connectionEvaluate files metrics Evaluate files size or files countEvaluate rows number in a table Evaluate the content of a table. You can also specify a SQL queryFile Exists Checks if a file existsSimple evaluation Evaluate one field or variableTable exists Checks if a table exists on a database connectionWait for Wait for a delay

Page 10: Ellipse ETL Technical Overviewec2-54-235-97-194.compute-1.amazonaws.com/documentation/pdf/integration/ETL/E8_ETL...ETL or data integration tools. PDI Use Cases Business Intelligence

File encryptionDecrypt files with PGP Decrypt files encrypted with PGP (Pretty Good Privacy). This job entry need

GnuPG to work properly.Encrypt files with PGP Encrypt files with PGP (Pretty Good Privacy). This job entry need GnuPG to work

properly.Verify file signature with PGP Verify file signature with PGP (Pretty Good Privacy). This job entry need GnuPG

to work properly.File managementAdd filenames to result Add filenames to resultCompare folders compare two folders (or two files)Convert file between DOS and UNIX Convert file content between DOS and UNIX. Converting to UNIX will replace

CRLF (Carriage Return and line Feed) by LF (Line Feed)Copy Files Copy FilesCopy or Move result filenames Copy or Move result filenamesCreate a folder Create a folderCreate file Create (an empty) fileDelete file Delete a fileDelete filenames from result Delete filenames from resultDelete files Delete filesDelete folders Delete specified folders. Attention: if the folder contains files, PDI will delete

them all!File Compare Compare 2 filesHTTP Gets or uploads a file using HTTPMove Files Move FilesUnzip file Unzip file in a target folderWait for file Wait for a fileWrite to file Write text content to file.Zip file Zip files from a directory and process filesFile transferFTP Delete Delete files in a remote hostGet a file with FTP Get files using FTP (File Transfer Protocol)Get a file with FTPS Get a file with FTP secureGet a file with SFTP Get files using Secure FTP (Secure File Transfer Protocol)Put a file with FTP Put a file with FTPPut a file with SFTP Put files using SFTP (Secure File Transfer Protocol)SSH2 Get Get files using SSH2SSH2 Put Put files in a remote host using SSH2Upload files to FTPS Upload files to a FTP secureGeneralExample plugin This is an example test job entry for a pluginJob Executes a jobSet variables Set one or several variables.Success SuccessTransformation Executes a transformationHadoopHadoop job executor Execute a map/reduce job contained in a jar filePig Script Executor Execute a Pig script on a Hadoop clusterMailGet mails (POP3/IMAP) Get mails (POP3/IMAP) server and save into a local folderMail Sends an e-MailMail validator Check the validity of an email address,SNMP trap to a target hostPalo

Page 11: Ellipse ETL Technical Overviewec2-54-235-97-194.compute-1.amazonaws.com/documentation/pdf/integration/ETL/E8_ETL...ETL or data integration tools. PDI Use Cases Business Intelligence

Palo Cube Create Creates a cube on a Palo serverPalo Cube Delete Deletes a cube on a Palo serverScriptingJavaScript Evaluates the result of the execution of a previous job entryShell Executes a shell scriptSQL Executes SQL on a certain database connectionUtilityAbort job Abort the jobDisplay Msgbox Info Display a simple Message box InformationPing a host Ping a hostSend information using Syslog Sends information to another server using the Syslog protocol.Send SNMP trap Send SNMP trap to a target hostTalend Job Execution This job entry executes an exported Talend JobxmlCheck if XML file is well formed Check if one or several XML files is/are well formedDTD Validator DTD ValidatorXSD Validator XSD ValidatorXSL Transformation Make an XSL Transformation

Table: Types of prebuilt job entries available out of the box

TransformationsA transformation handles the manipulation of rows or data to perform the extraction, transformation, and loading process. Itconsists of one or more steps that perform core ETL work such as reading data from files, filtering out rows, data cleansing, orloading data into a database. The steps in a transformation are connected by transformation hops. The hops define a streamthat allows data to flow between the steps that are connected by the hop.

The following screenshot is an example transformation.

Transformation example

Page 12: Ellipse ETL Technical Overviewec2-54-235-97-194.compute-1.amazonaws.com/documentation/pdf/integration/ETL/E8_ETL...ETL or data integration tools. PDI Use Cases Business Intelligence

Pre-built Transformation StepsA step is a core building block in a transformation. Steps read data from the prior step and write data to one or more outgoinghops. A step can be configured to either distribute or copy data to its outgoing hops. When distributing data, the stepalternates between all outgoing hops for each outbound row (this is known as a round robin). When copying data, each row issent to all outgoing hops. When running a transformation, one or more copies of each step are started, each running in its ownthread. During the run, all step copies run simultaneously, with rows of data constantly flowing through their connecting hops.Beyond these standard capabilities, each step has a distinct functionality that is represented by the step type.

The following sections list the types of pre-built transformation steps available out of the box. Any additional functionalityrequired can be delivered using the scripting steps listed or the platform can be extended using the published open APIs.

Bulk Loading

Name Description

Bulk loadingElasticSearch Bulk Insert Performs bulk inserts into ElasticSearchGreenplum Bulk Loader Greenplum Bulk LoaderInfobright Loader Load data to an Infobright database tableIngres VectorWise Bulk Loader This step interfaces with the Ingres VectorWise Bulk Loader "COPY TABLE"

command.LucidDB Streaming Loader Load data into LucidDB by using Remote Rows UDX.MonetDB Bulk Loader Load data into MonetDB by using their bulk load command in streaming mode.MySQL Bulk Loader MySQL bulk loader step, loading data over a named pipe (not available on MS

Windows)Oracle Bulk Loader Use Oracle Bulk Loader to load dataPostgreSQL Bulk Loader PostgreSQL Bulk LoaderTeradata Fastload Bulk Loader The Teradata Fastload Bulk loaderData WarehouseCombination lookup/update Update a junk dimension in a data warehouse. Alternatively, look up information

in this dimension. The primary key of a junk dimension are all the fields.Dimension lookup/update Update a slowly changing dimension in a data warehouse. Alternatively, look up

information in this dimension.FlowAbort Abort a transformationAppend streams Append 2 streams in an ordered wayBlock this step until steps finish Block this step until selected steps finish.Blocking Step This step blocks until all incoming rows have been processed. Subsequent steps

only recieve the last input row to this step.Detect empty stream This step will output one empty row if input stream is empty (ie when input

stream does not contain any row)Dummy (do nothing) This step type doesn't do anything. It's useful however when testing things or in

certain situations where you want to split streams.ETL Metadata Injection This step allows you to inject metadata into an existing transformation prior to

execution. This allows for the creation of dynamic and highly flexible dataintegration solutions.

Filter rows Filter rows using simple equationsIdentify last row in a stream Last row will be markedJava Filter Filter rows using java codePrioritize streams Prioritize streams in an order way.Single Threader Executes a transformation snippet in a single thread. You need a standard

mapping or a transformation with an Injector step where data from the parenttransformation will arive in blocks.

Switch / Case Switch a row to a certain target step based on the case value in a field.Job Executor This step executes a Pentaho Data Integration Job, passes parameters and rows.InlineInjector Injector step to allow to inject rows into the transformation through the java APISocket reader Socket reader. A socket client that connects to a server (Socket Writer step).

Page 13: Ellipse ETL Technical Overviewec2-54-235-97-194.compute-1.amazonaws.com/documentation/pdf/integration/ETL/E8_ETL...ETL or data integration tools. PDI Use Cases Business Intelligence

Socket writer A socket server that can send rows of data to a socket reader.InputCSV file input Simple CSV file inputData Grid Enter rows of static data in a grid, usually for testing, reference or demo

purposeDe-serialize from file Read rows of data from a data cube.Email messages input Read POP3/IMAP server and retrieve messagesESRI Shapefile Reader Reads shape file data from an ESRI shape file and linked DBF fileFixed file input Fixed file inputGenerate random credit card numbers Generate random valide (luhn check) credit card numbersGenerate random value Generate random valueGenerate Rows Generate a number of empty or equal rows.Get data from XML Get data from XML file by using XPath. This step also allows you to parse XML

defined in a previous field.Get File Names Get file names from the operating system and send them to the next step.Get Files Rows Count Get Files Rows CountGet repository names Lists detailed information about transformations and/or jobs in a repositoryGet SubFolder names Read a parent folder and return all subfoldersGet System Info Get information from the system like system date, arguments, etc.Get table names Get table names from database connection and send them to the next stepGoogle Analytics Fetches data from google analytics accountGZIP CSV Input Parallel GZIP CSV file input readerHBase input Read from an HBase column familyJson Input Extract relevant portions out of JSON structures (file or incoming field) and

output rowsLDAP Input Read data from LDAP hostLDIF Input Read data from LDIF filesLoad file content in memory Load file content in memoryMicrosoft Access Input Read data from a Microsoft Access fileMicrosoft Excel Input Read data from Excel and OpenOffice Workbooks (XLS, XLSX, ODS).Mondrian Input Execute and retrieve data using an MDX query against a Pentaho Analyses OLAP

server (Mondrian)MongoDB Input Reads all entries from a MongoDB collection in the specified database.OLAP Input Execute and retrieve data using an MDX query against any XML/A OLAP

datasource using olap4jOpenERP Object Input Retrieves data from the OpenERP server using the XMLRPC interface with the

'read' function.Property Input Read data (key, value) from properties files.RSS Input Read RSS feedsS3 CSV Input S3 CSV InputSalesforce Input Reads information from SalesForceSAP Input Read data from SAP ERP, optionally with parametersSAS Input This step reads files in sas7bdat (SAS) native formatTable input Read information from a database table.Text file input Read data from a text file in several formats. This data can then be passed on to

the next step(s)...XBase input Reads records from an XBase type of database file (DBF)XML Input Stream (StAX) This step is capable of processing very large and complex XML files very fast.Yaml Input Read YAML source (file or stream) parse them and convert them to rows and

writes these to one or more output.Palo Cell Input Retrieves all cell data from a Palo cubePalo Dimension Input Returns elements from a dimension in a Palo databaseJob

Page 14: Ellipse ETL Technical Overviewec2-54-235-97-194.compute-1.amazonaws.com/documentation/pdf/integration/ETL/E8_ETL...ETL or data integration tools. PDI Use Cases Business Intelligence

Copy rows to result Use this step to write rows to the executing job. The information will then bepassed to the next entry in this job.

Get files from result This step allows you to read filenames used or generated in a previous entry ina job.

Get rows from result This allows you to read rows from a previous entry in a job.Get Variables Determine the values of certain (environment or Kettle) variables and put them

in field values.Set files in result This step allows you to set filenames in the result of this transformation.

Subsequent job entries can then use this information.Set Variables Set environment variables based on a single input row.JoinsJoin Rows (cartesian product) The output of this step is the cartesian product of the input streams. The

number of rows is the multiplication of the number of rows in the input streams.Merge Join Joins two streams on a given key and outputs a joined set. The input streams

must be sorted on the join keyMerge Rows (diff) Merge two streams of rows, sorted on a certain key. The two streams are

compared and the equals, changed, deleted and new rows are flagged.Sorted Merge Sorted MergeXML Join Joins a stream of XML-Tags into a target XML stringLookupCall DB Procedure Get back information by calling a database procedure.Check if a column exists Check if a column exists in a table on a specified connection.Check if file is locked Check if a file is locked by another processCheck if webservice is available Check if a webservice is availableDatabase join Execute a database query using stream values as parametersDatabase lookup Look up values in a database using field valuesDynamic SQL row Execute dynamic SQL statement build in a previous fieldFile exists Check if a file existsFuzzy match Finding approximate matches to a string using matching algorithms. Read a

field from a main stream and output approximative value from lookup stream.HTTP client Call a web service over HTTP by supplying a base URL by allowing parameters

to be set dynamicallyHTTP Post Call a web service request over HTTP by supplying a base URL by allowing

parameters to be set dynamicallyMaxMind GeoIP Lookup Lookup an IPv4 address in a MaxMind database and add fields such as

geography, ISP, or organization.REST Client Consume RESTfull services. REpresentational State Transfer (REST) is a key

design idiom that embraces a stateless client-server architecture in which theweb services are viewed as resources and can be identified by their URLs

Stream lookup Look up values coming from another stream in the transformation.Table exists Check if a table exists on a specified connectionWeb services lookup Look up information using web services (WSDL)MappingMapping (sub-transformation) Run a mapping (sub-transformation), use MappingInput and MappingOutput to

specify the fields interfaceMapping input specification Specify the input interface of a mappingMapping output specification Specify the output interface of a mappingOutputAutomatic Documentation Output This step automatically generates documentation based on input in the form of

a list of transformations and jobsCassandra output Write to a Cassandra column familyDelete Delete data in a database table based upon keysHBase output Write to an HBase column familyInsert / Update Update or insert rows in a database based upon keys.Json output Create Json bloc and output it in a field ou a file.

Page 15: Ellipse ETL Technical Overviewec2-54-235-97-194.compute-1.amazonaws.com/documentation/pdf/integration/ETL/E8_ETL...ETL or data integration tools. PDI Use Cases Business Intelligence

LDAP Output Perform Insert, upsert, update, add or delete operations on records based ontheir DN (Distinguished Name).

Microsoft Access Output Stores records into an MS-Access database table.Microsoft Excel Output Stores records into an Excel (XLS) document with formatting information.Microsoft Excel Writer Writes or appends data to an Excel filePentaho Reporting Output Executes an existing report (PRPT)Properties Output Write data to properties fileRSS Output Read RSS stream.S3 File Output Create files in an S3 locationSalesforce Delete Delete records in Salesforce module.Salesforce Insert Insert records in Salesforce module.Salesforce Update Update records in Salesforce module.Salesforce Upsert Insert or update records in Salesforce module.Serialize to file Write rows of data to a data cubeSQL File Output Output SQL INSERT statements to fileSynchronize after merge This step perform insert/update/delete in one go based on the value of a field.Table output Write information to a database tableText file output Write rows to a text file.Update Update data in a database table based upon keysXML Output Write data to an XML fileOpenERP Object Output Updates data on the OpenERP server using the XMLRPC interface and the

'import' functionPalo Cell Output Updates cell data in a Palo cubePalo Dimension Output Creates/updates dimension elements and element consolidations in a Palo

databaseScriptingExecute row SQL script Execute SQL script extracted from a field created in a previous step.Execute SQL script Execute an SQL script, optionally parameterized using input rowsFormula Calculate a formula using Pentaho's libformulaModified Java Script Value This steps allows the execution of JavaScript programs (and much more)Regex Evaluation Regular expression Evaluation. This step uses a regular expression to evaluate a

field. It can also extract new fields out of an existing field with capturing groups.User Defined Java Class This step allows you to program a step using Java codeUser Defined Java Expression Calculate the result of a Java Expression using JaninoStatisticsAnalytic Query Execute analytic queries over a sorted dataset (LEAD/LAG/FIRST/LAST)Group by Builds aggregates in a group by fashion. This works only on a sorted input. If the

input is not sorted, only double consecutive rows are handled correctly.Memory Group by Builds aggregates in a group by fashion. This step doesn't require sorted input.Output steps metrics Return metrics for one or several stepsReservoir Sampling Transform Samples a fixed number of rows from the incoming streamSample rows Filter rows based on the line number.Univariate Statistics This step computes some simple stats based on a single input fieldTransformAdd a checksum Add a checksum column for each input rowAdd constants Add one or more constants to the input rowsAdd sequence Get the next value from an sequenceAdd value fields changing sequence Add sequence depending of fields value change. Each time value of at least one

field change, PDI will reset sequence.Add XML Encode several fields into an XML fragmentCalculator Create new fields by performing simple calculationsClosure Generator This step allows you to generates a closure table using parent-child

relationships.

Page 16: Ellipse ETL Technical Overviewec2-54-235-97-194.compute-1.amazonaws.com/documentation/pdf/integration/ETL/E8_ETL...ETL or data integration tools. PDI Use Cases Business Intelligence

Example plugin This is an example for a plugin test stepGet ID from slave server Retrieves unique IDs in blocks from a slave server. The referenced sequence

needs to be configured on the slave server in the XML configuration file.Number range Create ranges based on numeric fieldReplace in string Replace all occurences a word in a string with another word.Row denormaliser Denormalises rows by looking up key-value pairs and by assigning them to new

fields in the output rows. This method aggregates and needs the input rows tobe sorted on the grouping fields

Row flattener Flattens consecutive rows based on the order in which they appear in the inputstream

Row Normaliser De-normalised information can be normalised using this step type.Select values Select or remove fields in a row. Optionally, set the field meta-data: type, length

and precision.Set field value Set value of a field with another value fieldSet field value to a constant Set value of a field to a constantSort rows Sort rows based upon field values (ascending or descending)Split field to rows Splits a single string field by delimiter and creates a new row for each split termSplit Fields When you want to split a single field into more then one, use this step type.String operations Apply certain operations like trimming, padding and others to string value.Strings cut Strings cut (substring).Unique rows Remove double rows and leave only unique occurrences. This works only on a

sorted input. If the input is not sorted, only double consecutive rows are handledcorrectly.

Unique rows (HashSet) Remove double rows and leave only unique occurrences by using a HashSet.Value Mapper Maps values of a certain field from one value to anotherXSL Transformation Transform XML stream using XSL (eXtensible Stylesheet Language).UtilityChange file encoding Change file encoding and create a new fileClone row Clone a row as many times as neededDelay row Output each input row after a delayEdi to XML Converts an Edifact message to XML to simplify data extraction (Available in PDI

4.4, already present in CI trunk builds)Execute a process Execute a process and return the resultIf field value is null Sets a field value to a constant if it is null.Mail Send eMail.Metadata structure of stream This is a step to read the metadata of the incoming stream.Null if... Sets a field value to null if it is equal to a constant valueProcess files Process one file per row (copy or move or delete). This step only accept

filename in input.Run SSH commands Run SSH commands and returns result.Send message to Syslog Send message to Syslog serverWrite to log Write data to logValidationCredit card validator The Credit card validator step will help you tell: (1) if a credit card number is

valid (uses LUHN10 (MOD-10) algorithm) (2) which credit card vendor handlesthat number (VISA, MasterCard, Diners Club, EnRoute, American Express(AMEX),...)

Data Validator Validates passing data based on a set of rulesXSD Validator Validate XML source (files or streams) against XML Schema Definition.Mail Validator Check if an email address is valid.

Table: Bulk Loading Table

Page 17: Ellipse ETL Technical Overviewec2-54-235-97-194.compute-1.amazonaws.com/documentation/pdf/integration/ETL/E8_ETL...ETL or data integration tools. PDI Use Cases Business Intelligence

Tools and Utilities

Integrated Development EnvironmentSpoon is Kettle’s integrated development environment. It offers a graphical user interface that will allow you to quickly designand manage complex ETL workloads. The following screenshot shows the Spoon IDE:

Spoon IDE screenshot

The three main components of Spoon are the tree-view, the canvas and the execution results pane.

The Tree-view is on the left where standard components are available for selection. The components can be dragged to theCanvas area and linked up to form job sequences or transformation flows. The execution results be monitored and views in theExecution Results pane at the bottom.

Command-line execution toolsJobs and transformations can be executed from within the graphical Spoon environment for developing, testing, or debugging.For deployment, you need to be able to invoke jobs and transformations from the command line so they can be integrated withshell scripts and the operating

system’s job scheduler. The Kitchen and Pan command-line tools provide this capability. The only difference between thesetools is that Kitchen is designed for running jobs, whereas Pan runs transformations.

Carte (job server)A lightweight web server that enables remote execution of transformations and jobs. A Carte instance also represents a slaveserver, a key part of Kettle clustering.

The Carte program is a job runner, just like Kitchen. But unlike Kitchen, which runs immediately after invocation on thecommand line, Carte is started and then continues running in the background as a server (daemon). While Carte is running, itwaits and listens for requests on a predefined network port (TCP/IP). Clients on a remote computer can make a request to amachine running Carte, sending a job definition as part of the message that makes up the request. When a running Carteinstance receives such a request, it authenticates the request, and then executes the job contained within it. Carte supports afew other types of requests, which can be used to communicate progress and monitor information.

Carte is a crucial building block in Kettle clustering. Clustering allows a single job or transformation to be divided and executedin parallel by multiple computers that are running the Carte server, thus distributing the workload.

Page 18: Ellipse ETL Technical Overviewec2-54-235-97-194.compute-1.amazonaws.com/documentation/pdf/integration/ETL/E8_ETL...ETL or data integration tools. PDI Use Cases Business Intelligence

LoggingKettle provides and inbuilt logging architecture. When you execute a transformation or a job, you can choose the logging levelat which you want to run. Depending on the level you pick, more or fewer log lines will be generated. Here are the availablelogging levels in Kettle:

Type Description

Rowlevel Prints all the available logging information in Kettle, including individual rows ina number of more complex steps.

Debugging Generates a lot of logging information as well, but not on the row level.Detailed Allows the user to see a bit more compared to the basic logging level. Examples of extra

information generated include SQL queries and DDL in general.Basic The default logging level; prints only those messages that reflect execution on a step or

job-entry level.Minimal Informs you of information on only a job or transformation level.Error logging only Shows the error message if there is an error; otherwise, nothing is displayed.Nothing at all Does not generate any log lines at all, not even when there is an error.

Table: Available logging levels in Kettle

All log lines are kept in a central buffer. In addition to the text of the log lines as described in the preceding section, thefollowing additional pieces are stored and logged into the central logging tables:

Type Description

The date and time This component allows the datetime to be colored blue in the logging windowsin Spoon.

The logging level This allows Spoon to show error lines in red in the logging windows.An incremental unique number Kettle uses this for incremental updates of the various logging windows or log

retrieval during remote execution.The logging text The actual textual information that is generated to help the developer see what

is going on.A log channel ID This is a randomly generated (quasi) unique string that identifies the Kettle

component where the log line originated

Table: Additional pieces stored and logged