etl testing simplified
TRANSCRIPT
-
8/12/2019 ETL Testing Simplified
1/13
TCS Public
BFS 2.1
Prasanna Desigan Kesavan
ETL Testing Simplified
Version 1.1
20/01/2012
mailto:[email protected]:[email protected]:[email protected] -
8/12/2019 ETL Testing Simplified
2/13
ETL Testing Simplified
Internal Use 2
Document History
Revision History
Version Date of Change Owner of Changes Description of Change
1.0 13 Jan 2012 Prasanna Desigan
Kesavan
Created the document
1.1 20 Jan 2012 Prasanna Desigan
Kesavan
Added Revision History, TOC, Page
Numbers
-
8/12/2019 ETL Testing Simplified
3/13
ETL Testing Simplified
Internal Use 3
Table of Contents
1. Introduction _____________________________________________________________ 4
2. Types of ETL Data movement _______________________________________________ 4
a) Source to Target through Direct Pull ______________________________________________ 4
b) Source to Target through Lookup ________________________________________________ 4
c) Source to Target straight move and Source to Target through Lookup ________________ 5
d) Source to Target through Direct Pull and Source to Target through Derivation _________ 6
e) Source to Target through Direct Pull, through Lookup and through Derivation _________ 7
3. Different stages of ETL testing ______________________________________________ 8
a) Record Count Verification _______________________________________________________ 8
b) Data Completeness Verification __________________________________________________ 8
c) Data Integrity Verification _____________________________________________________ 10
i. Data Integrity verification for Direct Pull fields _________________________________________ 10
ii. Data Integrity verification for Lookup fields ___________________________________________ 10
iii. Data Integrity verification for Derived fields ___________________________________________ 11
d) Data Quality Verification _______________________________________________________ 12
e) Delta Load Verification ________________________________________________________ 13
i. Truncate and Reload _______________________________________________________________ 13
ii. Delta Load ________________________________________________________________________ 13
-
8/12/2019 ETL Testing Simplified
4/13
ETL Testing Simplified
Internal Use 4
1. Introduction
ETL stands for Extract Transform Load which means Data gets extracted from source system, then gets
transformed as per the requirements of the target system and gets loaded in to target system. ETL is
mainly applicable for data movement between tables and data bases. Testing whether the data
movement has been done properly is the main purpose ETL testing.
2. Typesof ETL Datamovement
a) Source to Target through Direct Pull
Data available in Source system is copied as such in to Target system without any transformation. The
field names might get abbreviated or expanded, but the values stay intact.
For example, all records in TCS.EMPLOYEE table are moved in to TCS_STG.EMP_STG table where
EMPLOYEE is a table in TCS schema while EMP_STG is a table in TCS_STG schema. TCS.EMPLOYEE has
the fields EMPLOYEE_NUMBER, EMPLOYEE_STATUS, EMPLOYEE_NAME and
EMPLOYEE_JOINING_DATE, while TCS_STG.EMP_STG has the fields EMP_ID, EMP_STA, EMP_NM and
EMP_JOIN_DT respectively.
Source TCS.EMPLOYEEEMPLOYEE_N
UMBER
EMPLOYEE_ST
ATUS
EMPLOYEE_N
AMEEMPLOYEE_JOINING_DATE
267523 Active Prasanna 19-09-1993
SELECT EMPLOYEE_NUMBER asEMP_ID,
EMPLOYEE_STATUS asEMP_STA,
EMPLOYEE_NAME asEMP_NM,
EMPLOYEE_JOINING_DATE asEMP_JOIN_DT
FROM TCS.EMPLOYEE
Target TCS_STG.EMP_STGEMP_ID EMP_STA EMP_NM EMP_JOIN_DT
267523 Active Prasanna 19-09-1993
b) Source to Target through Lookup
Data available in Source system is utilized to pick up values from Lookup system and picked up values
are directly moved in to Target system without any transformation. Data available in Source system
will not be there in Target system, but corresponding data from Lookup system will be present in
Target system. The field names might get abbreviated or expanded, but the values stay intact.
-
8/12/2019 ETL Testing Simplified
5/13
ETL Testing Simplified
Internal Use 5
For example, all ID information in TCS.EMPLOYEE_DETAIL table is moved as name information in to
TCS_STG.EMP_DTL_STG table where EMPLOYEE_ DETAIL is a table in TCS schema while EMP_DTL_STG
is a table in TCS_STG schema. TCS.EMPLOYEE_DETAIL has the fields EMPLOYEE_ID, DEPARTMENT_ID
and BRANCH_ID while TCS_STG.EMP_DTL_STG has the fields EMP_ NM, DEPT_NM and BNCH_NM
respectively. The Names of the Employee, Department and Branch are picked up from TCS.EMPLOYEE,
TCS.DEPARTMENT and TCS.BRANCH Lookup tables respectively by utilizing IDs of Employee,
Department and Branch in TCS.EMPLOYEE_DETAIL
Source TCS.EMPLOYEE_DETAILEMPLOYEE_ID DEPARTMENT_ID BRANCH_ID
267523 2 5
Lookup
TCS.EMPLOYEEEMPLOYEE_ID EMPLOYEE_NAME
267523 Prasanna
TCS.DEPARTMENTDEPARTMENT_ID DEPARTMENT _NAME
2 Testing
TCS.BRANCHBRANCH_ID BRANCH _NAME
5 Chennai
SELECTEMPLOYEE_NAME asEMP_NM,
DEPARTMENT_NAME asDEPT_NM,
BRANCH_NAME asBNCH_NM
FROM TCS.EMPLOYEE_DETAIL A,
TCS.EMPLOYEE B,
TCS.DEPARTMENT C,
TCS.BRANCH D
WHERE A.EMPLOYEE_ID = B.EMPLOYEE_ID
AND A.DEPARTMENT_ID = C.DEPARTMENT_ID
AND A.BRANCH_ID = D.BRANCH_ID
Target TCS_STG.EMP_DTL_STGEMP_NM DEPT_NM BNCH_NM
Prasanna Testing Chennai
c) Source to Target straight move and Source to Target through Lookup
Some of the data in Source system is copied as such in to Target system without any transformation,
while remaining data is utilized to pick up values from Lookup system and picked up values are
directly moved in to Target system without any transformation. Data available in Source system will be
there in Target system and corresponding data from Lookup system will also be present in Target
system. The field names might get abbreviated or expanded, but the values stay intact.
Source TCS.EMPLOYEE_DETAILEMPLOYEE_ID DEPAR TMENT_NAME BRANCH_NAME
267523 Testing Chennai
Lookup TCS.EMPLOYEEEMPLOYEE_ID EMPLOYEE_NAME
267523 Prasanna
Target TCS_STG.EMP_DTL_STGEMP_ID EMP_NM DEPT_NM BNCH_NM
267523 Prasanna Testing Chennai
-
8/12/2019 ETL Testing Simplified
6/13
ETL Testing Simplified
Internal Use 6
SELECT EMPLOYEE_ID asEMP_ID,
EMPLOYEE_NAME asEMP_NM,
DEPARTMENT_NAME asDEPT_NM,
BRANCH_NAME asBNCH_NM
FROM TCS.EMPLOYEE_DETAIL A,
TCS.EMPLOYEE B
WHERE A.EMPLOYEE_ID = B.EMPLOYEE_ID
d) Source to Target through Direct Pull and Source to Target through Derivation
All/Some of the data in Source system is copied as such in to Target system without any
transformation, while some data is utilized to derive values that are not available in Source system and
derived values are moved in to Target system. Data available in Source system will be there in Target
system and derived data will also be present in Target system. The field names might get abbreviated
or expanded.
For example, all records in TCS.EMPLOYEE table are moved in to TCS_STG.EMP_STG table where
EMPLOYEE is a table in TCS schema while EMP_STG is a table in TCS_STG schema. TCS.EMPLOYEE has
the fields EMPLOYEE_NUMBER, EMPLOYEE_STATUS, EMPLOYEE_NAME and
EMPLOYEE_JOINING_DATE, while TCS_STG.EMP_STG has the fields EMP_ID, EMP_STA, EMP_NM,
EMP_JOIN_DT and EMP_GEMS.
EMP_GEMS in TCS_STG.EMP_STG is derived from EMP_STA and EMP_JOIN_DT in TCS.EMPLOYEE table
using the following business logics. When Employee Status is Terminated, EMP_GEMS = 0. When
Employee Status is Active and Employee has more than 3 years of experience, EMP_GEMS = 1000.
When Employee Status is Active and Employee has more than 5 years of experience, EMP_GEMS =2000
Source TCS.EMPLOYEE
EMPLOYEE_N
UMBER
EMPLOYEE_ST
ATUS
EMPLOYEE_N
AMEEMPLOYEE_JOINING_DATE
267523 Active Prasanna 19-09-2003
267524 Active Vivek 19-09-2007
267525 Terminated Jaikumar 19-09-1993
SELECTEMPLOYEE_NUMBER asEMP_ID,
EMPLOYEE_STATUS asEMP_STA,
EMPLOYEE_NAME asEMP_NM,EMPLOYEE_JOINING_DATE asEMP_JOIN_DT,
CASE
WHENEMP_STA = 'Terminated'
THEN0
WHENEMP_STA = 'Active'ANDEMP_JOIN_DT < 20- JUL - 2008
THEN 1000
WHENEMP_STA = 'Active'ANDEMP_JOIN_DT < 20- JUL - 2006
THEN2000
-
8/12/2019 ETL Testing Simplified
7/13
ETL Testing Simplified
Internal Use 7
END asEMP_GEMS
FROMTCS.EMPLOYEE
e) Source to Target through Direct Pull, through Lookup and through Derivation
All/Some of the data in Source system is copied as such in to Target system without any
transformation while some data is utilized to pick up values from Lookup system and picked up values
are directly moved in to Target system without any transformation. Also, some data is utilized to
derive values that are not available in Source system and derived values are moved in to Target
system. Data available in Source system, corresponding data from Lookup system and derived data
will be present in Target system. The field names might get abbreviated or expanded
Source TCS.EMPLOYEE
EMPLOYEE_N
UMBER
EMPLOYEE_ST
ATUS
BRANCH_N
AMEEMPLOYEE_JOINING_DATE
267523 Active Chennai 19-09-2003
267524 Active Hyderabad 19-09-2007
267525 Terminated Bangalore 19-09-1993
Lookup TCS.EMP_DTL
EMPLOYEE_N
UMBER
EMPLOYEE_N
AME
267523 Prasanna
267524 Vivek
267525 Jaikumar
SELECTEMPLOYEE_NUMBER asEMP_ID,
EMPLOYEE_STATUS asEMP_STA,
EMPLOYEE_NAME asEMP_NM,
EMPLOYEE_JOINING_DATE asEMP_JOIN_DT,
CASE
WHENEMP_STA = 'Terminated'
THEN0
WHENEMP_STA = 'Active'ANDEMP_JOIN_DT < 20- JUL - 2008
THEN 1000
WHENEMP_STA = 'Active'ANDEMP_JOIN_DT < 20- JUL - 2006
THEN2000END asEMP_GEMS
FROMTCS.EMPLOYEE A,
TCS.EMP_DTL B
WHERE A.EMPLOYEE_NUMBER = B.EMPLOYEE_NUMBER
Target TCS_STG.EMP_STG
EMP_ID EMP_STA EMP_NM EMP_JOIN_DT EMP_GEMS
267523 Active Prasanna 19-09-2003 1000
267524 Active Vivek 19-09-2007 2000
267525 Terminated Jaikumar 19-09-1993 0
-
8/12/2019 ETL Testing Simplified
8/13
ETL Testing Simplified
Internal Use 8
3. Different stages of ETL testing
a) Record Count Verification
In any type of ETL data movement, comparing the number of records in Source system and Target
system is the definition of Record Count Verification. The Source system will mostly be a single table
from which the all/key information will be loaded in Target table.
For example, Data from TCS.EMPLOYEE_DETAIL, TCS.EMPLOYEE, TCS.DEPARTMENT and TCS.BRANCH
tables are loaded in to TCS_STG.EMP_DTL_STG. The Source system here will be
TCS.EMPLOYEE_DETAIL table as any Employee who has a record in this table will have a record in
TCS_STG.EMP_DTL_STG table irrespective of the employee being present in TCS.EMPLOYEE,
TCS.DEPARTMENT and TCS.BRANCH tables.
Here, we will verify whether the number of records or employees in TCS.EMPLOYEE_DETAIL is same as
the number of records in TCS_STG.EMP_DTL_STG table.
This will be 1 scenario as well as 1 test case.
SELECT Source, COUNT(*)
FROM TCS.EMPLOYEE_DETAIL A
UNIONSELECT Target, COUNT(*)
FROM TCS_STG.EMP_DTL_STG B
b) Data Completeness Verification
In all types of ETL data movement except Source to Target through Lookup, comparing the key
information in Source system and Target system is the definition of Data Completeness Verification.
The Source system will mostly be a single table from which the all/key information will be loaded in
Target table.
For example, EMP_ID, EMP_STA, EMP_NM, EMP_JOIN_DT along with EMP_GEMS derived fromTCS.EMPLOYEE table are loaded in to TCS_STG.EMP_STG table.
Here, we will verify whether all EMPLOYEE_NUMBERs in TCS.EMPLOYEE table are present in
TCS_STG.EMP_STG table as EMP_IDs and no additional EMP_IDs are present in TCS_STG.EMP_STG.
SELECT EMPLOYEE_NUMBER
FROM TCS.EMPLOYEE_DETAIL A
Target TCS_STG.EMP_STG
EMP_ID EMP_STA EMP_NMEMP_BNCH_
NM
EMP_JOIN_
DT
EMP_G
EMS
267523 Active Prasanna Chennai 19-09-2003 1000
267524 Active Vivek Hyderabad 19-09-2007 2000
267525 Terminated Jaikumar Bangalore 19-09-1993 0
-
8/12/2019 ETL Testing Simplified
9/13
ETL Testing Simplified
Internal Use 9
MINUS
SELECT EMP_ID
FROM TCS_STG.EMP_DTL_STG B
SELECT EMP_ID
FROM TCS_STG.EMP_DTL_STG B
MINUS
SELECT EMPLOYEE_NUMBER
FROM TCS.EMPLOYEE_DETAIL A
In Source to Target through Lookup type of ETL data movement, comparing the key information in
Source system and key information in Lookup system corresponding to the picked up information in
Target system is the definition of Data Completeness Verification. The Source system will be a
combination of one or more table from which the key information will be picked up and loaded in
Target table.
For example, EMP_ NM, DEPT_NM and BNCH_NM are picked up from TCS.EMPLOYEE,TCS.DEPARTMENT and TCS.BRANCH Lookup tables by utilizing IDs of Employee, Department and
Branch in TCS.EMPLOYEE_DETAIL and loaded in to TCS_STG.EMP_DTL_STG table.
Here, we will verify whether all EMPLOYEE_NUMBERs in TCS.EMPLOYEE_DETAIL table are present as
EMP_NMs in TCS_STG.EMP_DTL_STG table and no additional EMP_NMs are present in
TCS_STG.EMP_DTL_STG table.
This will be 1 scenario which has 2 test cases under it. That is one for verifying whether all key
information in Source system is present in Target system, while other for verifying whether no
additional key information in present in Target system.
SELECT EMPLOYEE_NAME
FROM TCS.EMPLOYEE_DETAIL A
TCS.EMPLOYEE B
WHEREA.EMPLOYEE_NUMBER = B.EMPLOYEE_NUMBER
MINUS
SELECT EMP_NM
FROM TCS_STG.EMP_DTL_STG C
SELECT EMP_NM
FROM TCS_STG.EMP_DTL_STG C
MINUSSELECT EMPLOYEE_NAME
FROM TCS.EMPLOYEE_DETAIL A
TCS.EMPLOYEE B
WHEREA.EMPLOYEE_NUMBER = B.EMPLOYEE_NUMBER
-
8/12/2019 ETL Testing Simplified
10/13
-
8/12/2019 ETL Testing Simplified
11/13
ETL Testing Simplified
Internal Use 11
This will be 1 scenario while the number of test cases will be equal to the number of lookup
systems that is 3 in our example.
SELECTEMP_ID,
EMP_NMFROM TCS_STG.EMP_DTL_STG A
MINUS
SELECTEMPLOYEE_NUMBER asEMP_ID,
EMPLOYEE_NAME asEMP_NM
FROM TCS.EMPLOYEE_DETAIL A,
TCS.EMPLOYEE B
WHERE A.EMPLOYEE_ NUMBER = B.EMPLOYEE_ NUMBER
SELECTDEPT_ID,
DEPT_NM
FROM TCS_STG.EMP_DTL_STG A
MINUSSELECTDEPARTMENT_ID asDEPT_ID,
DEPARTMENT_NAME asDEPT_NM
FROM TCS.EMPLOYEE_DETAIL A,
TCS.DEPARTMENT C
WHERE A.DEPARTMENT_ID = C.DEPARTMENT_ID
SELECTBNCH_ID,
BNCH _NM
FROMTCS_STG.EMP_DTL_STG A
MINUS
SELECTBRANCH_ID as BNCH_ID,BRANCH_NAME asBNCH_NM
FROM TCS.EMPLOYEE_DETAIL A,
TCS.BRANCH D
WHERE A.BRANCH_ID = B.BRANCH_ID
iii. Data Integrity verification for Derived fields
Here, we will verify whether
- When EMP_STA = T in TCS_STG.EMP_DTL_STG table, EMP_GEMS = 0
- When EMP_STA = A and SYSDATE > (EMP_JOIN_DT + 3 Years) in TCS_STG.EMP_DTL_STG
table, EMP_GEMS = 1000
- When EMP_STA = A and SYSDATE > (EMP_JOIN_DT + 5 Years) in TCS_STG.EMP_DTL_STG
table, EMP_GEMS = 2000
There will be 1 scenario for 1 derived field, while the number of test cases will be equal to the
number of possible values for each derived field that is 3 in our example.
SELECTEMP_ID,
FROM TCS_STG.EMP_DTL_STG A
-
8/12/2019 ETL Testing Simplified
12/13
ETL Testing Simplified
Internal Use 12
WHERE EMP_STA = 'T'
ANDEMP_GEMS > 0
SELECTEMP_ID,
FROM TCS_STG.EMP_DTL_STG A
WHERE EMP_STA = 'A'ANDEMP_JOIN_DT < 20- JUL - 2008
ANDEMP_GEMS 1000
SELECTEMP_ID,
FROM TCS_STG.EMP_DTL_STG A
WHERE EMP_STA = 'A'ANDEMP_JOIN_DT < 20- JUL - 2006
ANDEMP_GEMS 2000
d) Data Quality Verification
Verifying whether key field in Target system has unique values and certain other fields in Target
system does not have a value other than specified ones is the definition of Data Quality verification.
For example, EMP_ID, EMP_STA, EMP_JOIN_DT, EMP_ NM, DEPT_NM, BNCH_NM and EMP_GEMS arepresent in TCS_STG.EMP_DTL_STG table.
EMP_ID field must have unique values as you cannot have more than one record for an
employee
SELECTEMP_ID
FROM TCS_STG.EMP_DTL_STG A
GROUP BY EMP_ID
HAVING COUNT(*) > 1
BNCH_NM field must have unique values for each employee as one employee cannot work inmore than one branch
SELECTEMP_NM
FROM TCS_STG.EMP_DTL_STG A
GROUP BY EMP_NM
HAVING COUNT(*) > 1
EMP_NM field cannot have NULL values as every employee must have a name
SELECTEMP_ID
FROM TCS_STG.EMP_DTL_STG A
WHERE EMP_NM IS NULL
EMP_STA cannot have a value other than T for Terminated and A for Active
SELECTEMP_ID
FROM TCS_STG.EMP_DTL_STG A
WHERE EMP_STA NOT IN (T,A)
EMP_JOIN_DT cannot be greater than SYSDATE
-
8/12/2019 ETL Testing Simplified
13/13
ETL Testing Simplified
Internal Use 13
SELECTEMP_ID
FROM TCS_STG.EMP_DTL_STG A
WHERE EMP_JOIN_DT > SYSDATE
There will be 1 scenario for fields that need to have unique values, while 1 scenario for fields that need
to have only specific values. The number of test cases will be equal to the number of fields that needto have unique values and specific values that is 5 in our example.
These might not have been verified in the Source system and might have been moved as such in to
Target system. These can happen when the Target system gets appended with records from source
system during every load instead of replacing or updating the existing records. These can also happen
due to some refresh issues or truncation issues. Hence Data Quality verification is done apart from
Data Integrity verification.
e) Delta Load Verification
In any warehouse there are two different methods in which various tables are loaded.
i. Truncate and Reload
The Target table being loaded is completely truncated and fresh data from source tables is
loaded again.
ii. Delta Load
The source tables of the target table being loaded are scanned for any change in records. If
any changes are identified in any of the records in source tables, those records are alone
updated in Target table. Similarly, if any new records is identified in source table, those are
added in Target table and vice versa for deleted records in source table
Delta testing is applicable only for Delta load tables. In this type of Testing, change in a record of
Source Table is artificially created using UPDATE/INSERT/DELETE sql commands. Then the workflow
for loading the corresponding Target table is executed. If the changes made are reflected in the Target
at the end of the load, the testing is successful. If not, the reasons have to be analyzed and fixed