36017098 datastage what i knew

8/8/2019 36017098 Datastage What i Knew

1/34


2/34

2

DAY - 1


3/34

3

Day-1

Datastage

ETL tool used to extract data from any number or

type of database transform the data and load the

data to data warehouse.

It has client and server components.

Client components are (DS designer, DS Director,

DS Manager, Ds Administrator)

Server components are (Repository, DS Server)

Datastage Files will be having an extension .dsx

If we run the job in the designer we wont be able

to view the logs.


4/34

4

Day-1

Datastage Designer

It is used to create the jobs, compile them. Thereoption of running the job in the designer is alsopossible but its an optional feature.

Datastage Director It is used to validate the jobs created using the

designer then run and monitor the jobs.

Once the job is executed we can check the logs toget the details of the job.


5/34

5

Day-1

Datastage Manager

It is used to view and edit the contents of the

repository.

It is used to export, import and manage themetadata.

Datastage Administrator

It is a user interface used to do administrative

tasks.


6/34

6

Day-1

Development / Debug stage

Row generator

Column generator

Peek Tail

Head

Sample


7/34

7

Day-1

Files

Sequential File

Dataset

Fileset

Lookup Fileset

Processing Stage

Sort

Copy

Remove Duplicates

Join


8/34

8

Day-1

Look Up

Merge

Funnel

Modify Aggregator

Transformer

Pivot

Filter

Surrogate Key generator


9/34

9

Day-1

Row Generator (no input -> one output)

Row generator generates mock data based onthe metadata .

It can be used as an input data for testingpurposes.

By default the number of rows generated will be10.

Column Generator (1 input -> 1 output)

We must enter the metadata for the column tobe generated.

It depends on the number of input rows.


10/34

10

Day-1

So if the input has 15 rows then the output will also

have 15 rows.

To set a default value for all the records in the

column being generated then set the default value

to that particular value.

We can also set the increment by, start value, end

value.

Head (1 input -> 1 output)

It is used to pass the first n records of the input file.


11/34

11

Day-1

By Default the n is set to 10 it can be changed toany value we need.

In case if the value of n=10 and the input has 50records then only 1st 10 records will pass through.

Tail (1 input -> 1 output)

It is similar to head the main difference is it selectsthe last n records of the input data.

Sample (1 input -> n output)

It selects n records where the value of n is givenas % from the full set of input records in a


12/34

12

Day-1

Random manner.

If n=50% and input has 150 records the O/P willhave 75 records selected in random manner.

Peek (1 input -> 1 output (optional))

Allows us to view the data in logs in the director.

By default the output property is set to log so thatwe can view it in the director.

If the output is set to Output itself then the output

from the peek is stored in a separate output file.And cannot be seen in the logs.

It has the records only if the stage is valid and isexecuted.


13/34


14/34

14

Day - 2

Types ofJobs we use in Datastage Parallel Job

Parallel Shared Container

Job Sequence

Server job is used rarely.Parallel Processing

Mainly there of2 types of parallel processing

Partition parallelism

Pipeline parallelism


15/34

15

Day - 2

Partition Parallelism To process very large quantities of data it is split

into smaller subsets and each subset of data willbe handled by a separate processor so that the

job will be completed faster.

At the end of the job the data subsets can becollected back together again and written to asingle data source.

Pipeline Parallelism

In this case there are multiple stages then eachprocessor will start processing a particular stageand load the data to the pipeline.


16/34

16

Day - 2

So multiple stages will be executed subsequently sothat the job will be completed at a much faster pace.

Sequential File

Sequential file may be a text file are CSV or a fixed

width file etc., The data from the sequential file will be transferred in

a sequential manner.

So the fan out/in process is done before the data is

transferred from/to parallel processing stages.


17/34

17

Day - 2

Dataset Dataset has a extension of .ds

It can be viewed only using datastage and it is

encrypted.

The data from the dataset can be accessed inparallel so it is much faster when compared to the

sequential files.


18/34

18

DAY-3


19/34

19

Day-3

Join [2 input (more than 2 optional) -> 1 output]

It performs join based on the field that is set as the key.

The 4 types ofJoin can be performed

Inner join

Left outer join

Right OuterJoin

Full OuterJoin

We can explicitly set the input table to be left, right orintermediate.


20/34

20

Day-3

This is done in the Link ordering Tab. The data to the join must be sorted.

It is the faster when compared to look up but slow when

compared to Merge.

It does not have a reject link.Pivot (1 Input -> 1 output)

It is used to convert Columns to rows.

It does the transpose function of the matrix.


21/34

21

Day-3

If there are 4 columns q1,q2,q3,q4 and we are going to

convert them into a row with a common column name

q the description for q = q1,q2,q3,q4.

Remove Duplicates (1 Input -> 1 Output)

Pre-requisite for this stage is that the data that comes

as input must be sorted.

If duplicate values exists in the input then only one

record will be retained others will be dropped. In the properties we can set which record is to be

retained like either the first one or the last one.


22/34

22

Day-3

It checks the first record with the very next record if

both are same then one record will be dropped. So if

the input is not sorted duplicate values will remain in the

output also.

We must also give a field or key based on which theduplicate records have to be removed.

Copy ( 1 Input -> n Output)

It is use to just copy the input data to multiple output

links. It is better to have a copy between two transformation

stages.


23/34

23

Day - 4

Sort ( 1- Input -> 1 Output) It is used to sort the data that comes as a input based

on a particular key that we mention explicitly.

We can also mention how to handle the null value

like first or last preference.

We have a option called allow duplicates if this is set

to false if duplicate records exist only the first record

will be retained. The subsequent records will be

dropped.

We can also choose the sor order either ascending ordescending.


24/34

24

Day - 4

Sort can also be done in the link by setting the

properties in the link.

We cannot have a reject link in for the sort stage.

It is active if the allow duplicates is set to true and its

passive in the later case.

Funnel ( n Input -> 1 Output)

It is used to combine multiple input files into a single

output file.

In funnel there are 3 types:

Sort

Sequence

Continuous


25/34

25

Day - 4

By default it will be in continuous only. Sort we must give the key so that it sorts the output

from the given input from n files and sort it and

combine it into single file and gives as output.

In

Continuous it combines like 1

st

record from 1st

filethen 1st from 2nd file then again subsequent records

from the input files.

In case of Sequence funnel initially all the records

from the first input will be there then the ones from

the 2nd input like that for all the input files.


26/34


27/34

27

Day - 4

Filter ( 1 Input -> n Output ,1 Reject (optional))

It is used to filter the records from the input based on a

condition and pass the records which pass the

condition to the output.

We can set multiple conditions and get multiple outputbased on those conditions.

The link ordering can be done in the link ordering tab

in the properties dialog box.

The records that does not satisfy any of the conditionswill be either dropped or passed on to the reject link.


28/34

28

DAY - 5


29/34

29

Day - 5

Transformer ( 1 Input -> n Output , 1 Reject) This stage .

It must be used only if the process cannot be doneusing any other stages.

It is an active stage.

It can do filter, copy, sort etc.,

It has stage variable where we can create somevariables to store some values that are to be usedfrequently among the output.

In the properties there is a constraints option where wecan set any conditions to be checked in the incomingdata.


30/34

30

Day - 5

The derivations option will allow us to perform somecalculations using functions like arithmetic, string,trigonometric etc.,

Null handling, conditional statement like if-else, datefunctions can also be used in the derivations.

We can derive the outputs from the transformer tothe multiple targets based on the derivations.

The stage variables declared and defined will getcalculated in the order they are derived.


31/34

31

Day - 5

Look Up (1 source (Primary), n Reference -> 1Reject)

Look up is used to join the tables based on the key

(field) we mention.

There are

2types of look up:

Normal lookup

Sparse lookup

In this case the reject will get the records that are

rejected from the primary. The record from the primary is looked up in the

reference input and if the key match the join is

performed.


32/34

32

Day - 5

In case if there is no record matching the primarytable in the reference table then the record isdropped and passed to the reject link.

It is the slowest compared to merge and join stage.

In look up also we can set constraints.

There of two types of constraints: Condition failure

Lookup failure


33/34

33

Day - 5

In case of failure there are 4 ways of handling it:

Continue

Fail

Reject

Drop

Continue just puts a null value in case of failure and

continue with the process.

Fail passes quits the job and comes out of the

execution (i.e) it aborts the process.

Drop just drops the record that didnt meet the

condition and continues processing the next record.

Reject passes the rejected records to the reject link.


34/34

36017098 datastage what i knew

Documents