etl testing simplified

Upload: ashok-kumar-k-r

Post on 03-Jun-2018

219 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 ETL Testing Simplified

    1/13

    TCS Public

    BFS 2.1

    Prasanna Desigan Kesavan

    [email protected]

    ETL Testing Simplified

    Version 1.1

    20/01/2012

    mailto:[email protected]:[email protected]:[email protected]
  • 8/12/2019 ETL Testing Simplified

    2/13

    ETL Testing Simplified

    Internal Use 2

    Document History

    Revision History

    Version Date of Change Owner of Changes Description of Change

    1.0 13 Jan 2012 Prasanna Desigan

    Kesavan

    Created the document

    1.1 20 Jan 2012 Prasanna Desigan

    Kesavan

    Added Revision History, TOC, Page

    Numbers

  • 8/12/2019 ETL Testing Simplified

    3/13

    ETL Testing Simplified

    Internal Use 3

    Table of Contents

    1. Introduction _____________________________________________________________ 4

    2. Types of ETL Data movement _______________________________________________ 4

    a) Source to Target through Direct Pull ______________________________________________ 4

    b) Source to Target through Lookup ________________________________________________ 4

    c) Source to Target straight move and Source to Target through Lookup ________________ 5

    d) Source to Target through Direct Pull and Source to Target through Derivation _________ 6

    e) Source to Target through Direct Pull, through Lookup and through Derivation _________ 7

    3. Different stages of ETL testing ______________________________________________ 8

    a) Record Count Verification _______________________________________________________ 8

    b) Data Completeness Verification __________________________________________________ 8

    c) Data Integrity Verification _____________________________________________________ 10

    i. Data Integrity verification for Direct Pull fields _________________________________________ 10

    ii. Data Integrity verification for Lookup fields ___________________________________________ 10

    iii. Data Integrity verification for Derived fields ___________________________________________ 11

    d) Data Quality Verification _______________________________________________________ 12

    e) Delta Load Verification ________________________________________________________ 13

    i. Truncate and Reload _______________________________________________________________ 13

    ii. Delta Load ________________________________________________________________________ 13

  • 8/12/2019 ETL Testing Simplified

    4/13

    ETL Testing Simplified

    Internal Use 4

    1. Introduction

    ETL stands for Extract Transform Load which means Data gets extracted from source system, then gets

    transformed as per the requirements of the target system and gets loaded in to target system. ETL is

    mainly applicable for data movement between tables and data bases. Testing whether the data

    movement has been done properly is the main purpose ETL testing.

    2. Typesof ETL Datamovement

    a) Source to Target through Direct Pull

    Data available in Source system is copied as such in to Target system without any transformation. The

    field names might get abbreviated or expanded, but the values stay intact.

    For example, all records in TCS.EMPLOYEE table are moved in to TCS_STG.EMP_STG table where

    EMPLOYEE is a table in TCS schema while EMP_STG is a table in TCS_STG schema. TCS.EMPLOYEE has

    the fields EMPLOYEE_NUMBER, EMPLOYEE_STATUS, EMPLOYEE_NAME and

    EMPLOYEE_JOINING_DATE, while TCS_STG.EMP_STG has the fields EMP_ID, EMP_STA, EMP_NM and

    EMP_JOIN_DT respectively.

    Source TCS.EMPLOYEEEMPLOYEE_N

    UMBER

    EMPLOYEE_ST

    ATUS

    EMPLOYEE_N

    AMEEMPLOYEE_JOINING_DATE

    267523 Active Prasanna 19-09-1993

    SELECT EMPLOYEE_NUMBER asEMP_ID,

    EMPLOYEE_STATUS asEMP_STA,

    EMPLOYEE_NAME asEMP_NM,

    EMPLOYEE_JOINING_DATE asEMP_JOIN_DT

    FROM TCS.EMPLOYEE

    Target TCS_STG.EMP_STGEMP_ID EMP_STA EMP_NM EMP_JOIN_DT

    267523 Active Prasanna 19-09-1993

    b) Source to Target through Lookup

    Data available in Source system is utilized to pick up values from Lookup system and picked up values

    are directly moved in to Target system without any transformation. Data available in Source system

    will not be there in Target system, but corresponding data from Lookup system will be present in

    Target system. The field names might get abbreviated or expanded, but the values stay intact.

  • 8/12/2019 ETL Testing Simplified

    5/13

    ETL Testing Simplified

    Internal Use 5

    For example, all ID information in TCS.EMPLOYEE_DETAIL table is moved as name information in to

    TCS_STG.EMP_DTL_STG table where EMPLOYEE_ DETAIL is a table in TCS schema while EMP_DTL_STG

    is a table in TCS_STG schema. TCS.EMPLOYEE_DETAIL has the fields EMPLOYEE_ID, DEPARTMENT_ID

    and BRANCH_ID while TCS_STG.EMP_DTL_STG has the fields EMP_ NM, DEPT_NM and BNCH_NM

    respectively. The Names of the Employee, Department and Branch are picked up from TCS.EMPLOYEE,

    TCS.DEPARTMENT and TCS.BRANCH Lookup tables respectively by utilizing IDs of Employee,

    Department and Branch in TCS.EMPLOYEE_DETAIL

    Source TCS.EMPLOYEE_DETAILEMPLOYEE_ID DEPARTMENT_ID BRANCH_ID

    267523 2 5

    Lookup

    TCS.EMPLOYEEEMPLOYEE_ID EMPLOYEE_NAME

    267523 Prasanna

    TCS.DEPARTMENTDEPARTMENT_ID DEPARTMENT _NAME

    2 Testing

    TCS.BRANCHBRANCH_ID BRANCH _NAME

    5 Chennai

    SELECTEMPLOYEE_NAME asEMP_NM,

    DEPARTMENT_NAME asDEPT_NM,

    BRANCH_NAME asBNCH_NM

    FROM TCS.EMPLOYEE_DETAIL A,

    TCS.EMPLOYEE B,

    TCS.DEPARTMENT C,

    TCS.BRANCH D

    WHERE A.EMPLOYEE_ID = B.EMPLOYEE_ID

    AND A.DEPARTMENT_ID = C.DEPARTMENT_ID

    AND A.BRANCH_ID = D.BRANCH_ID

    Target TCS_STG.EMP_DTL_STGEMP_NM DEPT_NM BNCH_NM

    Prasanna Testing Chennai

    c) Source to Target straight move and Source to Target through Lookup

    Some of the data in Source system is copied as such in to Target system without any transformation,

    while remaining data is utilized to pick up values from Lookup system and picked up values are

    directly moved in to Target system without any transformation. Data available in Source system will be

    there in Target system and corresponding data from Lookup system will also be present in Target

    system. The field names might get abbreviated or expanded, but the values stay intact.

    Source TCS.EMPLOYEE_DETAILEMPLOYEE_ID DEPAR TMENT_NAME BRANCH_NAME

    267523 Testing Chennai

    Lookup TCS.EMPLOYEEEMPLOYEE_ID EMPLOYEE_NAME

    267523 Prasanna

    Target TCS_STG.EMP_DTL_STGEMP_ID EMP_NM DEPT_NM BNCH_NM

    267523 Prasanna Testing Chennai

  • 8/12/2019 ETL Testing Simplified

    6/13

    ETL Testing Simplified

    Internal Use 6

    SELECT EMPLOYEE_ID asEMP_ID,

    EMPLOYEE_NAME asEMP_NM,

    DEPARTMENT_NAME asDEPT_NM,

    BRANCH_NAME asBNCH_NM

    FROM TCS.EMPLOYEE_DETAIL A,

    TCS.EMPLOYEE B

    WHERE A.EMPLOYEE_ID = B.EMPLOYEE_ID

    d) Source to Target through Direct Pull and Source to Target through Derivation

    All/Some of the data in Source system is copied as such in to Target system without any

    transformation, while some data is utilized to derive values that are not available in Source system and

    derived values are moved in to Target system. Data available in Source system will be there in Target

    system and derived data will also be present in Target system. The field names might get abbreviated

    or expanded.

    For example, all records in TCS.EMPLOYEE table are moved in to TCS_STG.EMP_STG table where

    EMPLOYEE is a table in TCS schema while EMP_STG is a table in TCS_STG schema. TCS.EMPLOYEE has

    the fields EMPLOYEE_NUMBER, EMPLOYEE_STATUS, EMPLOYEE_NAME and

    EMPLOYEE_JOINING_DATE, while TCS_STG.EMP_STG has the fields EMP_ID, EMP_STA, EMP_NM,

    EMP_JOIN_DT and EMP_GEMS.

    EMP_GEMS in TCS_STG.EMP_STG is derived from EMP_STA and EMP_JOIN_DT in TCS.EMPLOYEE table

    using the following business logics. When Employee Status is Terminated, EMP_GEMS = 0. When

    Employee Status is Active and Employee has more than 3 years of experience, EMP_GEMS = 1000.

    When Employee Status is Active and Employee has more than 5 years of experience, EMP_GEMS =2000

    Source TCS.EMPLOYEE

    EMPLOYEE_N

    UMBER

    EMPLOYEE_ST

    ATUS

    EMPLOYEE_N

    AMEEMPLOYEE_JOINING_DATE

    267523 Active Prasanna 19-09-2003

    267524 Active Vivek 19-09-2007

    267525 Terminated Jaikumar 19-09-1993

    SELECTEMPLOYEE_NUMBER asEMP_ID,

    EMPLOYEE_STATUS asEMP_STA,

    EMPLOYEE_NAME asEMP_NM,EMPLOYEE_JOINING_DATE asEMP_JOIN_DT,

    CASE

    WHENEMP_STA = 'Terminated'

    THEN0

    WHENEMP_STA = 'Active'ANDEMP_JOIN_DT < 20- JUL - 2008

    THEN 1000

    WHENEMP_STA = 'Active'ANDEMP_JOIN_DT < 20- JUL - 2006

    THEN2000

  • 8/12/2019 ETL Testing Simplified

    7/13

    ETL Testing Simplified

    Internal Use 7

    END asEMP_GEMS

    FROMTCS.EMPLOYEE

    e) Source to Target through Direct Pull, through Lookup and through Derivation

    All/Some of the data in Source system is copied as such in to Target system without any

    transformation while some data is utilized to pick up values from Lookup system and picked up values

    are directly moved in to Target system without any transformation. Also, some data is utilized to

    derive values that are not available in Source system and derived values are moved in to Target

    system. Data available in Source system, corresponding data from Lookup system and derived data

    will be present in Target system. The field names might get abbreviated or expanded

    Source TCS.EMPLOYEE

    EMPLOYEE_N

    UMBER

    EMPLOYEE_ST

    ATUS

    BRANCH_N

    AMEEMPLOYEE_JOINING_DATE

    267523 Active Chennai 19-09-2003

    267524 Active Hyderabad 19-09-2007

    267525 Terminated Bangalore 19-09-1993

    Lookup TCS.EMP_DTL

    EMPLOYEE_N

    UMBER

    EMPLOYEE_N

    AME

    267523 Prasanna

    267524 Vivek

    267525 Jaikumar

    SELECTEMPLOYEE_NUMBER asEMP_ID,

    EMPLOYEE_STATUS asEMP_STA,

    EMPLOYEE_NAME asEMP_NM,

    EMPLOYEE_JOINING_DATE asEMP_JOIN_DT,

    CASE

    WHENEMP_STA = 'Terminated'

    THEN0

    WHENEMP_STA = 'Active'ANDEMP_JOIN_DT < 20- JUL - 2008

    THEN 1000

    WHENEMP_STA = 'Active'ANDEMP_JOIN_DT < 20- JUL - 2006

    THEN2000END asEMP_GEMS

    FROMTCS.EMPLOYEE A,

    TCS.EMP_DTL B

    WHERE A.EMPLOYEE_NUMBER = B.EMPLOYEE_NUMBER

    Target TCS_STG.EMP_STG

    EMP_ID EMP_STA EMP_NM EMP_JOIN_DT EMP_GEMS

    267523 Active Prasanna 19-09-2003 1000

    267524 Active Vivek 19-09-2007 2000

    267525 Terminated Jaikumar 19-09-1993 0

  • 8/12/2019 ETL Testing Simplified

    8/13

    ETL Testing Simplified

    Internal Use 8

    3. Different stages of ETL testing

    a) Record Count Verification

    In any type of ETL data movement, comparing the number of records in Source system and Target

    system is the definition of Record Count Verification. The Source system will mostly be a single table

    from which the all/key information will be loaded in Target table.

    For example, Data from TCS.EMPLOYEE_DETAIL, TCS.EMPLOYEE, TCS.DEPARTMENT and TCS.BRANCH

    tables are loaded in to TCS_STG.EMP_DTL_STG. The Source system here will be

    TCS.EMPLOYEE_DETAIL table as any Employee who has a record in this table will have a record in

    TCS_STG.EMP_DTL_STG table irrespective of the employee being present in TCS.EMPLOYEE,

    TCS.DEPARTMENT and TCS.BRANCH tables.

    Here, we will verify whether the number of records or employees in TCS.EMPLOYEE_DETAIL is same as

    the number of records in TCS_STG.EMP_DTL_STG table.

    This will be 1 scenario as well as 1 test case.

    SELECT Source, COUNT(*)

    FROM TCS.EMPLOYEE_DETAIL A

    UNIONSELECT Target, COUNT(*)

    FROM TCS_STG.EMP_DTL_STG B

    b) Data Completeness Verification

    In all types of ETL data movement except Source to Target through Lookup, comparing the key

    information in Source system and Target system is the definition of Data Completeness Verification.

    The Source system will mostly be a single table from which the all/key information will be loaded in

    Target table.

    For example, EMP_ID, EMP_STA, EMP_NM, EMP_JOIN_DT along with EMP_GEMS derived fromTCS.EMPLOYEE table are loaded in to TCS_STG.EMP_STG table.

    Here, we will verify whether all EMPLOYEE_NUMBERs in TCS.EMPLOYEE table are present in

    TCS_STG.EMP_STG table as EMP_IDs and no additional EMP_IDs are present in TCS_STG.EMP_STG.

    SELECT EMPLOYEE_NUMBER

    FROM TCS.EMPLOYEE_DETAIL A

    Target TCS_STG.EMP_STG

    EMP_ID EMP_STA EMP_NMEMP_BNCH_

    NM

    EMP_JOIN_

    DT

    EMP_G

    EMS

    267523 Active Prasanna Chennai 19-09-2003 1000

    267524 Active Vivek Hyderabad 19-09-2007 2000

    267525 Terminated Jaikumar Bangalore 19-09-1993 0

  • 8/12/2019 ETL Testing Simplified

    9/13

    ETL Testing Simplified

    Internal Use 9

    MINUS

    SELECT EMP_ID

    FROM TCS_STG.EMP_DTL_STG B

    SELECT EMP_ID

    FROM TCS_STG.EMP_DTL_STG B

    MINUS

    SELECT EMPLOYEE_NUMBER

    FROM TCS.EMPLOYEE_DETAIL A

    In Source to Target through Lookup type of ETL data movement, comparing the key information in

    Source system and key information in Lookup system corresponding to the picked up information in

    Target system is the definition of Data Completeness Verification. The Source system will be a

    combination of one or more table from which the key information will be picked up and loaded in

    Target table.

    For example, EMP_ NM, DEPT_NM and BNCH_NM are picked up from TCS.EMPLOYEE,TCS.DEPARTMENT and TCS.BRANCH Lookup tables by utilizing IDs of Employee, Department and

    Branch in TCS.EMPLOYEE_DETAIL and loaded in to TCS_STG.EMP_DTL_STG table.

    Here, we will verify whether all EMPLOYEE_NUMBERs in TCS.EMPLOYEE_DETAIL table are present as

    EMP_NMs in TCS_STG.EMP_DTL_STG table and no additional EMP_NMs are present in

    TCS_STG.EMP_DTL_STG table.

    This will be 1 scenario which has 2 test cases under it. That is one for verifying whether all key

    information in Source system is present in Target system, while other for verifying whether no

    additional key information in present in Target system.

    SELECT EMPLOYEE_NAME

    FROM TCS.EMPLOYEE_DETAIL A

    TCS.EMPLOYEE B

    WHEREA.EMPLOYEE_NUMBER = B.EMPLOYEE_NUMBER

    MINUS

    SELECT EMP_NM

    FROM TCS_STG.EMP_DTL_STG C

    SELECT EMP_NM

    FROM TCS_STG.EMP_DTL_STG C

    MINUSSELECT EMPLOYEE_NAME

    FROM TCS.EMPLOYEE_DETAIL A

    TCS.EMPLOYEE B

    WHEREA.EMPLOYEE_NUMBER = B.EMPLOYEE_NUMBER

  • 8/12/2019 ETL Testing Simplified

    10/13

  • 8/12/2019 ETL Testing Simplified

    11/13

    ETL Testing Simplified

    Internal Use 11

    This will be 1 scenario while the number of test cases will be equal to the number of lookup

    systems that is 3 in our example.

    SELECTEMP_ID,

    EMP_NMFROM TCS_STG.EMP_DTL_STG A

    MINUS

    SELECTEMPLOYEE_NUMBER asEMP_ID,

    EMPLOYEE_NAME asEMP_NM

    FROM TCS.EMPLOYEE_DETAIL A,

    TCS.EMPLOYEE B

    WHERE A.EMPLOYEE_ NUMBER = B.EMPLOYEE_ NUMBER

    SELECTDEPT_ID,

    DEPT_NM

    FROM TCS_STG.EMP_DTL_STG A

    MINUSSELECTDEPARTMENT_ID asDEPT_ID,

    DEPARTMENT_NAME asDEPT_NM

    FROM TCS.EMPLOYEE_DETAIL A,

    TCS.DEPARTMENT C

    WHERE A.DEPARTMENT_ID = C.DEPARTMENT_ID

    SELECTBNCH_ID,

    BNCH _NM

    FROMTCS_STG.EMP_DTL_STG A

    MINUS

    SELECTBRANCH_ID as BNCH_ID,BRANCH_NAME asBNCH_NM

    FROM TCS.EMPLOYEE_DETAIL A,

    TCS.BRANCH D

    WHERE A.BRANCH_ID = B.BRANCH_ID

    iii. Data Integrity verification for Derived fields

    Here, we will verify whether

    - When EMP_STA = T in TCS_STG.EMP_DTL_STG table, EMP_GEMS = 0

    - When EMP_STA = A and SYSDATE > (EMP_JOIN_DT + 3 Years) in TCS_STG.EMP_DTL_STG

    table, EMP_GEMS = 1000

    - When EMP_STA = A and SYSDATE > (EMP_JOIN_DT + 5 Years) in TCS_STG.EMP_DTL_STG

    table, EMP_GEMS = 2000

    There will be 1 scenario for 1 derived field, while the number of test cases will be equal to the

    number of possible values for each derived field that is 3 in our example.

    SELECTEMP_ID,

    FROM TCS_STG.EMP_DTL_STG A

  • 8/12/2019 ETL Testing Simplified

    12/13

    ETL Testing Simplified

    Internal Use 12

    WHERE EMP_STA = 'T'

    ANDEMP_GEMS > 0

    SELECTEMP_ID,

    FROM TCS_STG.EMP_DTL_STG A

    WHERE EMP_STA = 'A'ANDEMP_JOIN_DT < 20- JUL - 2008

    ANDEMP_GEMS 1000

    SELECTEMP_ID,

    FROM TCS_STG.EMP_DTL_STG A

    WHERE EMP_STA = 'A'ANDEMP_JOIN_DT < 20- JUL - 2006

    ANDEMP_GEMS 2000

    d) Data Quality Verification

    Verifying whether key field in Target system has unique values and certain other fields in Target

    system does not have a value other than specified ones is the definition of Data Quality verification.

    For example, EMP_ID, EMP_STA, EMP_JOIN_DT, EMP_ NM, DEPT_NM, BNCH_NM and EMP_GEMS arepresent in TCS_STG.EMP_DTL_STG table.

    EMP_ID field must have unique values as you cannot have more than one record for an

    employee

    SELECTEMP_ID

    FROM TCS_STG.EMP_DTL_STG A

    GROUP BY EMP_ID

    HAVING COUNT(*) > 1

    BNCH_NM field must have unique values for each employee as one employee cannot work inmore than one branch

    SELECTEMP_NM

    FROM TCS_STG.EMP_DTL_STG A

    GROUP BY EMP_NM

    HAVING COUNT(*) > 1

    EMP_NM field cannot have NULL values as every employee must have a name

    SELECTEMP_ID

    FROM TCS_STG.EMP_DTL_STG A

    WHERE EMP_NM IS NULL

    EMP_STA cannot have a value other than T for Terminated and A for Active

    SELECTEMP_ID

    FROM TCS_STG.EMP_DTL_STG A

    WHERE EMP_STA NOT IN (T,A)

    EMP_JOIN_DT cannot be greater than SYSDATE

  • 8/12/2019 ETL Testing Simplified

    13/13

    ETL Testing Simplified

    Internal Use 13

    SELECTEMP_ID

    FROM TCS_STG.EMP_DTL_STG A

    WHERE EMP_JOIN_DT > SYSDATE

    There will be 1 scenario for fields that need to have unique values, while 1 scenario for fields that need

    to have only specific values. The number of test cases will be equal to the number of fields that needto have unique values and specific values that is 5 in our example.

    These might not have been verified in the Source system and might have been moved as such in to

    Target system. These can happen when the Target system gets appended with records from source

    system during every load instead of replacing or updating the existing records. These can also happen

    due to some refresh issues or truncation issues. Hence Data Quality verification is done apart from

    Data Integrity verification.

    e) Delta Load Verification

    In any warehouse there are two different methods in which various tables are loaded.

    i. Truncate and Reload

    The Target table being loaded is completely truncated and fresh data from source tables is

    loaded again.

    ii. Delta Load

    The source tables of the target table being loaded are scanned for any change in records. If

    any changes are identified in any of the records in source tables, those records are alone

    updated in Target table. Similarly, if any new records is identified in source table, those are

    added in Target table and vice versa for deleted records in source table

    Delta testing is applicable only for Delta load tables. In this type of Testing, change in a record of

    Source Table is artificially created using UPDATE/INSERT/DELETE sql commands. Then the workflow

    for loading the corresponding Target table is executed. If the changes made are reflected in the Target

    at the end of the load, the testing is successful. If not, the reasons have to be analyzed and fixed