36017098 datastage what i knew
TRANSCRIPT
-
8/8/2019 36017098 Datastage What i Knew
1/34
-
8/8/2019 36017098 Datastage What i Knew
2/34
2
DAY - 1
-
8/8/2019 36017098 Datastage What i Knew
3/34
3
Day-1
Datastage
ETL tool used to extract data from any number or
type of database transform the data and load the
data to data warehouse.
It has client and server components.
Client components are (DS designer, DS Director,
DS Manager, Ds Administrator)
Server components are (Repository, DS Server)
Datastage Files will be having an extension .dsx
If we run the job in the designer we wont be able
to view the logs.
-
8/8/2019 36017098 Datastage What i Knew
4/34
4
Day-1
Datastage Designer
It is used to create the jobs, compile them. Thereoption of running the job in the designer is alsopossible but its an optional feature.
Datastage Director It is used to validate the jobs created using the
designer then run and monitor the jobs.
Once the job is executed we can check the logs toget the details of the job.
-
8/8/2019 36017098 Datastage What i Knew
5/34
5
Day-1
Datastage Manager
It is used to view and edit the contents of the
repository.
It is used to export, import and manage themetadata.
Datastage Administrator
It is a user interface used to do administrative
tasks.
-
8/8/2019 36017098 Datastage What i Knew
6/34
6
Day-1
Development / Debug stage
Row generator
Column generator
Peek Tail
Head
Sample
-
8/8/2019 36017098 Datastage What i Knew
7/34
7
Day-1
Files
Sequential File
Dataset
Fileset
Lookup Fileset
Processing Stage
Sort
Copy
Remove Duplicates
Join
-
8/8/2019 36017098 Datastage What i Knew
8/34
8
Day-1
Look Up
Merge
Funnel
Modify Aggregator
Transformer
Pivot
Filter
Surrogate Key generator
-
8/8/2019 36017098 Datastage What i Knew
9/34
9
Day-1
Row Generator (no input -> one output)
Row generator generates mock data based onthe metadata .
It can be used as an input data for testingpurposes.
By default the number of rows generated will be10.
Column Generator (1 input -> 1 output)
We must enter the metadata for the column tobe generated.
It depends on the number of input rows.
-
8/8/2019 36017098 Datastage What i Knew
10/34
10
Day-1
So if the input has 15 rows then the output will also
have 15 rows.
To set a default value for all the records in the
column being generated then set the default value
to that particular value.
We can also set the increment by, start value, end
value.
Head (1 input -> 1 output)
It is used to pass the first n records of the input file.
-
8/8/2019 36017098 Datastage What i Knew
11/34
11
Day-1
By Default the n is set to 10 it can be changed toany value we need.
In case if the value of n=10 and the input has 50records then only 1st 10 records will pass through.
Tail (1 input -> 1 output)
It is similar to head the main difference is it selectsthe last n records of the input data.
Sample (1 input -> n output)
It selects n records where the value of n is givenas % from the full set of input records in a
-
8/8/2019 36017098 Datastage What i Knew
12/34
12
Day-1
Random manner.
If n=50% and input has 150 records the O/P willhave 75 records selected in random manner.
Peek (1 input -> 1 output (optional))
Allows us to view the data in logs in the director.
By default the output property is set to log so thatwe can view it in the director.
If the output is set to Output itself then the output
from the peek is stored in a separate output file.And cannot be seen in the logs.
It has the records only if the stage is valid and isexecuted.
-
8/8/2019 36017098 Datastage What i Knew
13/34
-
8/8/2019 36017098 Datastage What i Knew
14/34
14
Day - 2
Types ofJobs we use in Datastage Parallel Job
Parallel Shared Container
Job Sequence
Server job is used rarely.Parallel Processing
Mainly there of2 types of parallel processing
Partition parallelism
Pipeline parallelism
-
8/8/2019 36017098 Datastage What i Knew
15/34
15
Day - 2
Partition Parallelism To process very large quantities of data it is split
into smaller subsets and each subset of data willbe handled by a separate processor so that the
job will be completed faster.
At the end of the job the data subsets can becollected back together again and written to asingle data source.
Pipeline Parallelism
In this case there are multiple stages then eachprocessor will start processing a particular stageand load the data to the pipeline.
-
8/8/2019 36017098 Datastage What i Knew
16/34
16
Day - 2
So multiple stages will be executed subsequently sothat the job will be completed at a much faster pace.
Sequential File
Sequential file may be a text file are CSV or a fixed
width file etc., The data from the sequential file will be transferred in
a sequential manner.
So the fan out/in process is done before the data is
transferred from/to parallel processing stages.
-
8/8/2019 36017098 Datastage What i Knew
17/34
17
Day - 2
Dataset Dataset has a extension of .ds
It can be viewed only using datastage and it is
encrypted.
The data from the dataset can be accessed inparallel so it is much faster when compared to the
sequential files.
-
8/8/2019 36017098 Datastage What i Knew
18/34
18
DAY-3
-
8/8/2019 36017098 Datastage What i Knew
19/34
19
Day-3
Join [2 input (more than 2 optional) -> 1 output]
It performs join based on the field that is set as the key.
The 4 types ofJoin can be performed
Inner join
Left outer join
Right OuterJoin
Full OuterJoin
We can explicitly set the input table to be left, right orintermediate.
-
8/8/2019 36017098 Datastage What i Knew
20/34
20
Day-3
This is done in the Link ordering Tab. The data to the join must be sorted.
It is the faster when compared to look up but slow when
compared to Merge.
It does not have a reject link.Pivot (1 Input -> 1 output)
It is used to convert Columns to rows.
It does the transpose function of the matrix.
-
8/8/2019 36017098 Datastage What i Knew
21/34
21
Day-3
If there are 4 columns q1,q2,q3,q4 and we are going to
convert them into a row with a common column name
q the description for q = q1,q2,q3,q4.
Remove Duplicates (1 Input -> 1 Output)
Pre-requisite for this stage is that the data that comes
as input must be sorted.
If duplicate values exists in the input then only one
record will be retained others will be dropped. In the properties we can set which record is to be
retained like either the first one or the last one.
-
8/8/2019 36017098 Datastage What i Knew
22/34
22
Day-3
It checks the first record with the very next record if
both are same then one record will be dropped. So if
the input is not sorted duplicate values will remain in the
output also.
We must also give a field or key based on which theduplicate records have to be removed.
Copy ( 1 Input -> n Output)
It is use to just copy the input data to multiple output
links. It is better to have a copy between two transformation
stages.
-
8/8/2019 36017098 Datastage What i Knew
23/34
23
Day - 4
Sort ( 1- Input -> 1 Output) It is used to sort the data that comes as a input based
on a particular key that we mention explicitly.
We can also mention how to handle the null value
like first or last preference.
We have a option called allow duplicates if this is set
to false if duplicate records exist only the first record
will be retained. The subsequent records will be
dropped.
We can also choose the sor order either ascending ordescending.
-
8/8/2019 36017098 Datastage What i Knew
24/34
24
Day - 4
Sort can also be done in the link by setting the
properties in the link.
We cannot have a reject link in for the sort stage.
It is active if the allow duplicates is set to true and its
passive in the later case.
Funnel ( n Input -> 1 Output)
It is used to combine multiple input files into a single
output file.
In funnel there are 3 types:
Sort
Sequence
Continuous
-
8/8/2019 36017098 Datastage What i Knew
25/34
25
Day - 4
By default it will be in continuous only. Sort we must give the key so that it sorts the output
from the given input from n files and sort it and
combine it into single file and gives as output.
In
Continuous it combines like 1
st
record from 1st
filethen 1st from 2nd file then again subsequent records
from the input files.
In case of Sequence funnel initially all the records
from the first input will be there then the ones from
the 2nd input like that for all the input files.
-
8/8/2019 36017098 Datastage What i Knew
26/34
-
8/8/2019 36017098 Datastage What i Knew
27/34
27
Day - 4
Filter ( 1 Input -> n Output ,1 Reject (optional))
It is used to filter the records from the input based on a
condition and pass the records which pass the
condition to the output.
We can set multiple conditions and get multiple outputbased on those conditions.
The link ordering can be done in the link ordering tab
in the properties dialog box.
The records that does not satisfy any of the conditionswill be either dropped or passed on to the reject link.
-
8/8/2019 36017098 Datastage What i Knew
28/34
28
DAY - 5
-
8/8/2019 36017098 Datastage What i Knew
29/34
29
Day - 5
Transformer ( 1 Input -> n Output , 1 Reject) This stage .
It must be used only if the process cannot be doneusing any other stages.
It is an active stage.
It can do filter, copy, sort etc.,
It has stage variable where we can create somevariables to store some values that are to be usedfrequently among the output.
In the properties there is a constraints option where wecan set any conditions to be checked in the incomingdata.
-
8/8/2019 36017098 Datastage What i Knew
30/34
30
Day - 5
The derivations option will allow us to perform somecalculations using functions like arithmetic, string,trigonometric etc.,
Null handling, conditional statement like if-else, datefunctions can also be used in the derivations.
We can derive the outputs from the transformer tothe multiple targets based on the derivations.
The stage variables declared and defined will getcalculated in the order they are derived.
-
8/8/2019 36017098 Datastage What i Knew
31/34
31
Day - 5
Look Up (1 source (Primary), n Reference -> 1Reject)
Look up is used to join the tables based on the key
(field) we mention.
There are
2types of look up:
Normal lookup
Sparse lookup
In this case the reject will get the records that are
rejected from the primary. The record from the primary is looked up in the
reference input and if the key match the join is
performed.
-
8/8/2019 36017098 Datastage What i Knew
32/34
32
Day - 5
In case if there is no record matching the primarytable in the reference table then the record isdropped and passed to the reject link.
It is the slowest compared to merge and join stage.
In look up also we can set constraints.
There of two types of constraints: Condition failure
Lookup failure
-
8/8/2019 36017098 Datastage What i Knew
33/34
33
Day - 5
In case of failure there are 4 ways of handling it:
Continue
Fail
Reject
Drop
Continue just puts a null value in case of failure and
continue with the process.
Fail passes quits the job and comes out of the
execution (i.e) it aborts the process.
Drop just drops the record that didnt meet the
condition and continues processing the next record.
Reject passes the rejected records to the reject link.
-
8/8/2019 36017098 Datastage What i Knew
34/34