data warehouse testing best practices to improve...

32
Ajay Nalabhatla , QA Lead Srihari Gopisetty , Technology Manager Wells Fargo India Solutions 1 Data Warehouse Testing Best practices to improve and sustain Data Quality – Getting ready for Serious DevOps

Upload: dangdiep

Post on 27-Aug-2018

227 views

Category:

Documents


0 download

TRANSCRIPT

Ajay Nalabhatla , QA Lead

Srihari Gopisetty , Technology Manager

Wells Fargo India Solutions

1

Data Warehouse Testing Best practices to improve and sustain Data Quality

– Getting ready for Serious DevOps

Abstract

In the age of Digital disruption, every organization wants to transform its technology arm in to advance practices - DevOps , to and fulfill the continuous demand from Business.

However, all organizations are data driven and they need to realize that the success does not rely only on faster throughput and speed but also on the ability to access the diverse, large volume of

complex data at real time to make strategic decisions.

Important question is – Does the organization have ‘Quality’ Data ?

“On average, U.S. organizations believe 32% of their data is inaccurate” -Gartner “Average organization loses $8.2 million annually due to poor Data Quality.“ -Experian “Less than 0.5% of all data is every analyzed” -Forrester

2

Even as many organizations are establishing the Data Warehouse Testing as specializedservice, recent surveys indicate that much more improvements needs to be done. It is acall–to-action for organizations to address Data Quality gaps

Setting the context

Data are of high quality "if they are fit for their intended uses in operations, decision making and planning.“– J.M.Juran (Source –dqglossary.com)

Issue Drivers QA Key CausesData Quality

REMEMBER : Poor Data Quality = Use of Less Information for Decision Making3

DimensionValidity

CompletenessTimeliness

Integrity

AccuracyConsistency

Unavailability of Complete Data

ETL Transformation

Delayed Batch SLA

Batch Performance

Obsolete Jobs & Records

No Exhaustive Validation

Missing Defined Test Strategies

Lack of Tools / Accelerators

Incomplete DB Objects Validation

Missing End -to- End QA Framework

Missing Standard Process

Where, What, Why

Staging/ODS

Xml

Ebcdic

Heterogeneous Sources

Data Warehouse BI

Ascii

DB

Extracts

Inte

rnal

So

urce

s

External Sources

Views

ETL

Extracts Downstream Apps Downstream

AppsET

L

4

Other DB Objects

DB Objects

Static TestingETL

Transformation Testing

Staging /ODS Validation

Data Warehouse Validation

Data Quality & Objects Validation

Batch Performance

BI Testing / Extracts

High Level Tests

High Medium Low

Views

Tables

Reports

Applications

OLTP OLAP

QA Framework – For High Quality

Exhaustive validation at every intermediate check point

Data Integrity Validation – RI checks etc.

Heterogeneous sources validation –Xml/Ascii /Ebcdic

Database Privileges validation at Table/View/Report Level

Runbook and scheduler/dependence validations

Database Objects Validation – Partitions, Synonyms, Flashback etc.

Batch Performance Execution

BI reports –UI ,Data & Performance validation

5

Specialized Testing

Database Object Validation - Partition

Dat

abas

e Ta

ble

Users View DBA’s View

Regression Interval & Merge Purge >13MDay Wise Partition

Inside DB Table 1st time LoadSource Merge @ Monthly Purge >13 Months

Jan’17

Feb’17Day2

Day 31

Day3

Daily Loads

Feb ‘18

Test

Str

ateg

y

Test Flow

Dec’17

Feb’17

Feb ‘18

6

Automation Possibility

Extracts

Parameter Home Grown Tools External ToolsStatic Testing Limited Limited

Source File – Metadata / Layout / Fields Order Yes – Macro / UNIX Shell Yes

Exhaustive Validation – Diff Server DB’s Yes – Macro / UNIX Shell Yes for BothBatch Job execution in sequence Yes - UNIX NoHeterogeneous File Load & Comparison (ASCII / XML / EBCEDIC) Yes – ASCII / XML only YesRegression Testing – Tables /Extracts Yes – Macro / UNIX YesData Quality checks Yes – Macro / UNIX YesTable Metadata validation Yes – Macro / UNIX YesBI Reports Validation (Data /Graphs) Yes – Only Data YesBatch Performance Testing Yes -UNIX Yes

Views ValidationYes – Macro using ODBC

/ UNIX YesPartition /Index validation No NoTest Cases Batch Execution yes –Unix / Excel YesAutomate Test Execution Scheduler yes –Unix Yes

Market Tools= Automation possibility

7

Integrate the ETL testingIn DevOps

Usu

al A

rchi

tect

ure

for E

TL

Static Testing ETL/DW Testing Batch Performance Testing

8

Demo Overview

9

CICT

CD

Static Testing ETL/DW Testing Batch Performance Testing

Sources & ETL Jobs

10

ETL Jobs

Shel

l Scr

ipt

SQL - Test cases

Data Validation Test case

11

MyS

QL

Test

.SQ

L fi

le

.SQ

L in

Sc

ript

Jenkins Dashboard

Jenkins – First job Creation

Adding Subsequent job(s)

ETL Test case Execution Job

Dependency Scheduling for Pipeline

GitHub – Add WebHook with Jenkins

Jenkins – GitHub for Automatic Triggering

Jenkins Scheduling using ‘Poll’ Feature

GitHub , Add files to GitHub

Pipeline – For Sequence Execution

Batch Execution Start

Batch Execution Finish

Green =Success

Console o/p for jobs

Final Pipeline O/p

Test Results document on Jenkins

Test Results on Linux

Build Execution History

23

- Useful for Batch performance testing

DevOps Readiness

24

Pre-

Requ

isite

sRequire thorough knowledge on ‘Line of Business’ Data flow

Runbook availability following predecessors & successors

Identification of suitable Test approach based on project types

Availability of Test data that represent all needs

Alternative analysis to avoid table unusable state issues

Ensure table Referential Integrity is addressed

Availability of TDM team for table refresh to previous state in case of failures

Batch performance SLA prediction

Focus about - Cultural , Process and Tools

Benefits

25

Improved Business confidence on Quality Blueprint That helps companies to gear up Possible Faster Iteration, Quick Feedback , Great Collaboration Low-priced Automation possibilities Insights on various Database Object Validations Early Defect Detection

26

Ajay Nalabhatla works as a Data Management QA Lead at WellsFargo for a Line of Business. He has around 11+ years of experience in Quality Assurance for ETL, DWT & BI Testing. Over a decade, he has involved in various DWT projects for different Banking & Securities Clients and delivered them successfully . He was also involved in conducting the Due Diligence for various DWT clients and also suggested them many improvements in both process and Technical competencies.

Ajay’s holds Bachelor of Engineering in Electronics & communication from Anna University

Srihari Gopisetty is managing the Data Management and Digital Advisory Teams for a Line of Business at Wells Fargo India Solutions. He has more than 17 years of rich experience in leading teams for BFSI Domain and Microsoft Products. Prior to Wells Fargo, he has worked with Microsoft, First Advantage

Srihari’s educational background includes Bachelor of Engineering in Mechanical from Gulbarga University

Author Biography

Srihari Gopisetty

Ajay Nalabhatla

27

References & Appendix

28

• http://www.pavantestingtools.com/p/load-runner.html• http://www.slideshare.net/ITRevolution/thursday-320-john-kosco-gb-final• https://en.wikipedia.org/wiki/DevOps• https://www.slideshare.net/Hadoop_Summit/scaling-self-service-on-hadoop• http://digitalcto.com/can-you-build-software-faster-cheaper-and-better/

APPENDIX

29

Test Approach - Data Migration

30

Pre - Migration Post -Migration Steady State

Analyze Db and identify the objects –

Tables/Indexes/views

Segregate wave wise plan -Forklift ,Consolidation, static

, Dynamic tables /views

Take the snapshot for Post comparison

Prioritize / Create Batch & collect pre-run stats

Compare all DB objectsBetween Legacy & New

Validate data for all tables b/w legacy & new

identified for migration

Validate New tables transformation rules are

as per specs

Parallel load comparison for tables between legacy

& new

Execute Batch performance testing &

compare stats with Legacy

Steady state Validations for Monthly loads

Downstream App’s support

Validate Purging process is as

expected for New system

Performance Monitoring for Data

Load

Infrastructure Upgrade Test Strategy- ETL tool /Scheduler tool upgrade

Direct SQL processing

ETL Jobs

Store Procedure

jobs

File watcher / Legacy jobs using c/c++

Pre Migration

•Identify the various types of jobs•Identify the Priority Jobs for phase wise execution•Build run book for the phases with all dependencies•Collect the batch statistics•Take the snapshots of relevant table data

Post Migration

•Execute the identified jobs on upgrade system•Validate the batch / job performance with pre-stats•Regression Testing for tables•Parallel Load processing & data validation•Validate Job dependencies and predecessors

Steady State

•Batch performance monitoring .• Analyse the failures to understand if they are

upgrade related•Steady state support for downstream applications

Test Strategy

31

Job Types

Thank You!!!

32