talend etl sample documentation

30
JasperETL / Talend Created By: Nitin Marwal

Upload: nitin1n11599

Post on 22-Nov-2014

230 views

Category:

Documents


10 download

TRANSCRIPT

Page 1: Talend ETL Sample Documentation

JasperETL / Talend

Created By: Nitin Marwal

Page 2: Talend ETL Sample Documentation

ContentsJasperETL / Talend....................................................................................................................................1

Purpose of the Document...........................................................................................................................3

Intended Audience......................................................................................................................................3

Technology..................................................................................................................................................3

Reference Project Name..............................................................................................................................3

Contributors................................................................................................................................................3

Introduction.............................................................................................................................................4

Data Integration..................................................................................................................................4

Data Quality.........................................................................................................................................4

Master Data Management..................................................................................................................5

Talend key features:................................................................................................................................5

Getting Started with Taland...................................................................................................................6

Installation:..............................................................................................................................................6

1/. Repository.........................................................................................................................................7

2/. Palette................................................................................................................................................7

Creating Job:.............................................................................................................................................8

Adding Meta Data....................................................................................................................................9

Adding Components to Job....................................................................................................................17

Mapping of Data:...................................................................................................................................21

Run the Job............................................................................................................................................25

Conclusion.................................................................................................................................................27

Is this the Work Around or Best Solution?.................................................................................................27

Document / Product / Component Repository Path..................................................................................27

Page 3: Talend ETL Sample Documentation

Introduction

JasperETL is powered by Talend and uses Talend’s Data integration and OpenStudios features for ETl purpose.

Talend MDM allows organizations to easily model and master any reference data, in any domain without constraints. The unified data management platform unites Data Integration, Data Quality, Master Data and Data Stewardship all through a single Eclipse-based development environment.

Talend' data management solutions cover three key domains:

Data Integration Data Quality Master Data Management

All Talend products are built on a unified Eclipse-based development environment, which provides users with consistent ergonomics, fast learning curve and a high-level of reusability. This offers unrivaled benefits in terms of resource optimization and utilization, and project consistency.

Data Integration

Talend's data integration products include:

Talend Open Studio, the community version, provided under the GPL v2 license and freely downloadable

Talend Integration Suite, the enterprise version, provided under a commercial subscription license. Talend Integration Suite exists in 3 editions: Team Edition, Professional Edition and Enterprise Edition

Talend On Demand, the Software as a Service version Talend Integration Suite MPx, a massively parallel data integration platform Talend Integration Suite RTx, a real-time data integration platform

Data Quality

Talend's data quality products include:

Talend Open Profiler, an open source data profiling tool provided under the GPL v2 license and freely downloadable

Page 4: Talend ETL Sample Documentation

Talend Data Quality, the enterprise data quality platform that includes data profiling and data cleansing features

Master Data Management

Talend's master data management products include:

Talend MDM Community Edition, an open source Master Data Management tool provided under the GPL v2 license and freely downloadable

Talend MDM Enterprise Edition, the enterprise version, provided under a commercial subscription licenset.

Talend key features: Active Data Model - allows organizations to immediately model and master any data

domain without a constraining data model and conditionally drive integration and synchronization with external systems to reduce system complexity and time to deploy. Talend MDM permits an iterative definition of the data model to gain alignment from business users and ensure adoption upon launch.

Domain Driven Integration- With Talend MDM, the master data drives interactions with external systems. The solution employs a unique event manager to drive when and where data is synchronized, augmented or distributed. A graphical tool provides over 400 proven components and connectors to build and deploy integration jobs with any application, database or system.

Master Data Quality - Talend MDM provides features that allow to validate, resolve, standardize, cleanse and augment master data. The solution delivers a robust data profiling tool. It packages native components for name and address standardization, and callouts to external standardization services are provided. Callouts to external source including lookups for hierarchies or some other reference codes can be performed based on specific data criteria.

Data Stewardship- The Talend MDM collaborative interface allows to search and author hub data and appropriate stewardship tools help manage the process of updating the data. The Ajax based interface is dynamically driven by Talend's Active Data Model. All validations found on the model instantiate themselves as validations on Web-based forms. Workflow process is easy to define and provides a strong set of tools for a team to collaborate on and create a trusted and reliable set of master data.

Talend Studio- Talend Studio is an intuitive development environment based on Eclipse that allows building and managing the data model, defining integration jobs, administering data quality and creating stewardship workflows to support the creation of master data all in a single interface. It also provides unique functions for creating versions of hub data and hierarchy management.

Page 5: Talend ETL Sample Documentation

Getting Started with Taland

Installation:Requirements:

- Java 1.5 or later

Download the zip from: http://www.talend.com/download.php#mdm

Extract it on your machine:

Here you will get two products Talend Server and Talend MDM. To run the application execute the TMDMCE-win32-x86.exe under Talend MDM.

Create a local repository and a project based on the Language (Java / Perl) you suit with.

The Main Screen of Talend MDM is:

Over Here you can see the various windows such as:

1/. Repository

Page 6: Talend ETL Sample Documentation

2/. Palette

3/. Other windows such as Component Properties, Run Job etc...

4/. The Middle area is your working zone. Where you can create various jobs, Business Models etc...

Now let’s see these in more detail:

1/. Repository Repository is the Place where every Data is stores such as your Jobs, Business Models, MetaData

information and others.

Here the Screenshot of the same:

Under Job Design you create various jobs regards to your Data Transformation requirements.

Under Metadata you can define and create various connections with your source data that can be a CSV or a database or any other format of data.

Page 7: Talend ETL Sample Documentation

2/. PalettePalette provides you all the components that you can use while preparing youe Business Model

or Job for Data Transfer from source data location to Destination Data Location.

Here is the Screenshot of the Palette window:

Here you have lots of components available for data Extraction, Transformation and Loading into the Target Source.

Now we will see how we create a new Job into the System:

Creating Job:

Right Click on JOB DESIGN under Repository window and select Create Job it will open a popup. Here you can provide the basic details of the Job like name Purpose and Description. Now it will create a new job for you and open it in the workspace:

Page 8: Talend ETL Sample Documentation

Now you can create various metadata items regards to your source and Destination data.

Adding Meta Data Expend the Metadata under Repository here you can see the various options available with you

like DB connections, delimited files, xml file component etc. You can create any kind of metadata based on your requirements.

For this demo we will use the File Delimited component that will use a CSV file to read data as source data.

For this right click on File Delimited and select “Create File Delimited”. This will open a popup, here you can provide the basic details like name, purpose etc…

Click on next and browse the partner CSV you will get the data shown below that:

Page 9: Talend ETL Sample Documentation

Now click on the next:

Page 10: Talend ETL Sample Documentation

Here you can set various parameters regards to your CSV settings. Now click on Next:

Page 11: Talend ETL Sample Documentation

Here you will get the description of the schema as fields of your CSV file. Here I have selected Website as Key because I want Partner not to be duplicated and to avoid duplicate records in the system based on the website. In general you can set any number of columns as key as per your requirement. Now click on finish and you will get the partner_csv under File Delimited as your source data.

Page 12: Talend ETL Sample Documentation

Now we need to setup our destination database here I am using PostgreSQL Database. So right click on DBConnection under MetaData and select Create Connection this will open a

popup, here you can provide the name of the connection, Click on next to provide the connection details:

Here you can select the target database type and provide the connection settings. After filling the details click on finish and you will get the connection under MetaData > DBConnections:

Page 13: Talend ETL Sample Documentation

Now to retrieve the table schemas right click on your DB Connection and select retrieve schema:

Here you can select the schema type among TABLE , VIEW or SYNONYMs, here I have selected only tables as I required only tables. You can use the SQL Queries as well to fetch your data. Now click next.

Page 14: Talend ETL Sample Documentation

Here it will show you all the tables present in the database, so you can select the table which you want and click next:

Page 15: Talend ETL Sample Documentation

Here you can select the fields which you want for your data process and click on finish.

Now you will get your connection under MetaData:

Page 16: Talend ETL Sample Documentation

Adding Components to Job Now open the test_job which we have created before by double click on the same:

Now we need to add our CSV file in the job, so drag the CSV file into the job workspace, it will then open a popup like this:

Select tFileInputDelimited as we want CSV as input source and click ok it will then create an item in job workspace:

Page 17: Talend ETL Sample Documentation

Now by double clicking on the component you can see the component properties in component window:

Over here you can view all the settings regards to your CSV file and you can also edit the settings over here as well.

Now add the destination source for the data as the table we have created in DBConnections , just drag the table from there to job workspace. It will open a popup window:

Page 18: Talend ETL Sample Documentation

Here select the tPostgresqlOutput as this is going to be the output of the data flow:

Same way you can view or edit the properties of the res_partner component under component window.

Now we need to add tMap component from palette window for mapping the input and output fields in data flow so drag the tMap component from the palette window:

Page 19: Talend ETL Sample Documentation

Now drag this into job workspace:

Now to filter duplicate records add tUniqRow component from the palette window:

Now we need to join the data flow from partner_csv to tMap that will be the input for tMap. For this right click on the partner_csv and select row > main and connect to the tMap :

Page 20: Talend ETL Sample Documentation

Now we need to take output from tMap to the tUniqRow. For this right click on tMap and select row> newOutput(Main) and name the output connection.

Then it will ask for matching target schema then click yes.

Now we need to take output from tUniqRow to the res_partner. For this right click on tUniqRow and select row> uniques and connect to res_partner.

Mapping of Data: Now double click on the tMap icon to map the target and source data flow, it will open the map like

this:

Page 21: Talend ETL Sample Documentation

Now we need to map the input fields to the output fields: so drag the related columns to the target columns or either click on Auto Map button on right hand side top.

Now you can see that I have used one extra column active there as it is required for making partner active and mandatory in destination database. It is a Boolean field so it will take default value as TRUE/FLASE so I have written true before active column as you can see:

Now we have matched the columns so you can click on ok button:

Now double click on the tUniqRow component to view the properties:

Page 22: Talend ETL Sample Documentation

Here you can select the key attribute as website so that it can check the uniqueness on the basis of these columns.

Now double click on res_partner to view the properties of this component under component window:

Page 23: Talend ETL Sample Documentation

Here you can view the basic connection settings. Here two fields are important:

1. Action on table: this defines how you connection will treat your table.2. Action on Data: this defines what operation you are going to perform, it can be

insert / update or the combination of the both. Now we are done with all the configuration and ready to run our job. So click on the Run window

near component window:

Before run check our CSV which we have created to import the Partners we have some duplicate records.

Here you can see that we have some duplicate records that we will avoid by using tUniqRow component.

Page 24: Talend ETL Sample Documentation

Run the Job Now you can run our job:

To run the job click on Run button:

Page 25: Talend ETL Sample Documentation

Here you can see that 6 rows flowed from partner_csv but only 4 rows flowed to the destination as tUniqRow filtered the duplicate records.

So this is how we use Talend for ETL purpose. For more reference you can check-out the help which is available in detail at help menu in Talend.