ron forino dama - washington, dc september 1999
DESCRIPTION
Project Driven Data Quality Improvement. Ron Forino DAMA - Washington, DC September 1999. Examples. According to DM Review, one European company discovered through an audit that it was not invoicing 4% of its orders. With $2 billion in revenues, that meant $80 million went unpaid. - PowerPoint PPT PresentationTRANSCRIPT
Ron Forino
DAMA - Washington, DC September 1999
Project DrivenData Quality Improvement
ConfidentialDMR Consulting Group Inc.
2
ExamplesExamplesExamplesExamples According to DM Review, one European company discovered
through an audit that it was not invoicing 4% of its orders. With $2 billion in revenues, that meant $80 million went unpaid.
Electronic data audits show that the invalid data values in the typical customer database average around 15 - 20%. Physical audits suggest that this number may be closer to 25 - 30%.
In 1992, 96,000 IRS tax refund checks were returned “undeliverable” due to incorrect addresses.
This year, incorrect price data in retail databases will cost American consumers as much as $2.5 billion in overcharges.
According to organizations like the Data Warehouse Institute, the Gartner Group and MetaGroup - Data Quality is one of the top 1-3 success factors to Data Warehousing.
The average mid-sized company may have 30,000 - 50,000 fields in files, tables, screens, reports, etc. [Platinum Technology]
ConfidentialDMR Consulting Group Inc.
3
AgendaAgendaAgendaAgenda Definitions
What is Data Quality?
Tactics and the End Game
Building Blocks to Data Quality– Tactical Initiatives– Strategic Initiatives
Tactical Data Quality– Rule Disclosure– Data Quality Measurement, Analysis and Certification– Meta Data Creation– Validation– Quality Improvement
Definitions
ConfidentialDMR Consulting Group Inc.
5
DefinitionsDefinitionsDefinitionsDefinitions
Data Transformation - Changing data values to a format consistent with integrity and business rules agreed to by data stakeholders.
Data Cleansing - Consolidation of redundant customer records. Term used to describe the process of “merging and purging” of customer lists in an effort to reduce duplicate or inaccurate customer records.
Data Quality Improvement - The process of improving data quality to the level desired to support the enterprise information demand.
Data Quality - definition to follow….
ConfidentialDMR Consulting Group Inc.
6
Data Quality Improvement Decision TreeData Quality Improvement Decision TreeData Quality Improvement Decision TreeData Quality Improvement Decision Tree
Data QualityImprovement
DataCleansing
Transform
Data Reengineering
Match & Dedupe
Process Reengineering StandardizeValidateMatchDedupeIntegrateEnrich
Conform to Business Rule
Task Process
ConfidentialDMR Consulting Group Inc.
7
Tactics and The End GameTactics and The End GameTactics and The End GameTactics and The End Game
“We need better data quality...”
Enterprise Initiative
Select Project
Data Quality Assessment
Report & Recommendations
Source System Clean-up Initiative
ConfidentialDMR Consulting Group Inc.
8
Tactics and The End GameTactics and The End GameTactics and The End GameTactics and The End Game
“We need better data quality...”
Enterprise Initiative
Select Project
Data Quality Assessment
Report & Recommendations
Source System Clean-up Initiative
Data Warehouse
Data Quality Assessment
ReportStaging Specifications
Source System Clean-up Initiative
What is [Good] Data Quality?
ConfidentialDMR Consulting Group Inc.
10
How Can We Know Good Data Quality?How Can We Know Good Data Quality?How Can We Know Good Data Quality?How Can We Know Good Data Quality?
Column 1 321453 212392 093255 214421 . . .
Is this Good Data Quality?
What can we conclude?
ConfidentialDMR Consulting Group Inc.
11
What is Data Quality?What is Data Quality?What is Data Quality?What is Data Quality?
Information Quality = f(Definition + Data + Presentation)
Definition Defines Data Domain Value Specification Business Rules that Govern the Data Information Architecture Quality
Data Content Completeness Validity/Reasonability
Data Presentation Accessible Timely Non-ambiguous
ConfidentialDMR Consulting Group Inc.
12
Common Data Quality ProblemsCommon Data Quality ProblemsCommon Data Quality ProblemsCommon Data Quality Problems
Data Content Missing Data Invalid Data Data Outside Legal Domain Illogical Combinations of Data
Structural Record Key Integrity Referential Integrity Cardinality Integrity
Migration/Integration Rationalization Anomalies Duplicate or Lost Entities
Definitions and Standards Ambiguous Business Rules Multiple Formats for Same
Data Elements Different Meanings for the
Same Code Value Multiple Codes Values with
the Same Meaning Field Used for Unintended
Data Data in Filler Y2K Violation
Building Blocksto Data Quality
ConfidentialDMR Consulting Group Inc.
14
Benefits Realization
Strategic
Tactical
Building Blocks of a Data Quality ProgramBuilding Blocks of a Data Quality ProgramBuilding Blocks of a Data Quality ProgramBuilding Blocks of a Data Quality Program
Rule Disclosure
Analyze & Certify
Meta Data Creation Quality Improvement
Data Stewardship
DQ Requirements
Enterprise Cultural Shift
QC/Process Auditing
Defect Prevention
Validation
Quality Reengineering
Measure
Tactical Data Quality
ConfidentialDMR Consulting Group Inc.
16
Steps to Tactical Data QualitySteps to Tactical Data QualitySteps to Tactical Data QualitySteps to Tactical Data Quality
MeasureQuality
Meta DataCreation
RuleDisclosure
Analyze &Certify
ValidationQuality
Improvement
Rule Disclosure
ConfidentialDMR Consulting Group Inc.
18
Sources of Meta DataSources of Meta DataSources of Meta DataSources of Meta Data
Legacy Meta Data– Data Models, Process Models– Data Dictionary, Definitions,
Aliases– Glossary of Terms
Transformation Meta Data– Data Mapping – Transformation Rules– Error Handling Rules
Access Meta Data – Data Directory– Data Definitions
The Subject Matter Expert
– Database Directory– Domain Values, Range of
Values– Run Books
– Derived Data Calculations– Audit Statistics
– Source & Transformation
ConfidentialDMR Consulting Group Inc.
19
Acquiring good Meta Data is EssentialAcquiring good Meta Data is EssentialAcquiring good Meta Data is EssentialAcquiring good Meta Data is Essential
Meta Data can be gathered before, during or after the Assessment
Collect Documentation
ReportFindings
Validate theMeta Data
Assess theData
Collect Documentation
ValidateFindings
Assess theData
ReportFindings
Preferred
Collect ValidMeta Data
ReportFindings
Assess theData
“You can pay me now, or you can pay me later…”
MeasuringData Quality Techniques Tools Methods
ConfidentialDMR Consulting Group Inc.
21
Customer ComplaintsUser Interviews & FeedbackCustomer Satisfaction SurveyData Quality Requirements GatheringData Quality Assessments
“One accurate measurement is worth a thousand expert opinions”[Grace Hopper, Admiral, US Navy]
How can Data Quality be Measured?How can Data Quality be Measured?How can Data Quality be Measured?How can Data Quality be Measured?
ConfidentialDMR Consulting Group Inc.
22
Measuring Data Quality - ToolsMeasuring Data Quality - ToolsMeasuring Data Quality - ToolsMeasuring Data Quality - Tools
Analysis Tools Specifically designed assessment tools
– Quality Manager, Migration Architect– N & A: Trillium, Group-1, ID Centric, Finalist, etc.
Improvisations– SAS, Focus, SQL, other query tools
Other Necessary Tools File Transfer Data Conversion
ConfidentialDMR Consulting Group Inc.
23
Business Rule IntegrityRequiring Meta Data
Field Integrity Intuitive Integrity Rules
Level 1: Completeness– Nulls or Blanks– Misuse (or overuse) of Default Values
Level 2: Validity– Data Integrity Anomalies – Invalid Data based on Business Rule
Level 3: Structural Integrity – Primary Key Uniqueness– Key Structure (Cardinality, Referential Integrity, Alternate Keys)
Level 4: Business Rule Violations– Relationship between two or more fields– Calculations
Assessment MeasurementsAssessment MeasurementsAssessment MeasurementsAssessment Measurements
Analyzeand Certify Identifying Problems Sizing up Problems “To Certify or Not to Certify…”
Report Card
ConfidentialDMR Consulting Group Inc.
25
Template Template - field level- field level Template Template - field level- field level
Data Quality Report
Value Frequency Percent 88 Info Analysis
•Value - the domain occurrence•Frequency - the number of occurrences within the data set•Percent - the % of the whole set•88 Info - the copybook definition for the value•Analysis - comments about our findings
ConfidentialDMR Consulting Group Inc.
26
Identifying ProblemsIdentifying ProblemsIdentifying ProblemsIdentifying Problems
Data Quality Report
Value Frequency Percent 88 Info Analysis
BLANK 19 11.9Is this a required field? If yes, what is the value definition (88 Info) for 'BLANK'?
BBUY 59 36.9 Best Buy
ID216 53 33.1What are the value definitions (88 Info) for all non-blank values?
MUNI 23 14.4 Municipal Bond
MLCMO 2 1.3 CMO Account
MLMTN 2 1.3 Manitenance Account
STANG 2 1.3What are the value definitions (88 Info) for all non-blank values?
Total 160 100
Analysis (and Discovery)1. Is the field required? If so, blanks indicate an anomaly.2. Are the values “ID206” and “STANG” allowed? (Is this a problem
with the data or the Meta Data?3.Some values occur in only 1.3% of the records. Is this telling us there is a problem?
1
2
3
ConfidentialDMR Consulting Group Inc.
27
Data Quality Scoring Data Quality Scoring Data Quality Scoring Data Quality Scoring
Scoring Key
Priority
Criticality/Sensitivity High Medium Low LegendNo Problem Encountered A A A Remarks Action
Less than .1 % had problems B+ B+ B+
Less than .5 % had problems B B+ B+ A ExcellentPre-Certif ied. No problems encountered
If f indings agree w ith documented business rules, CERTIFY, otherw ise review findings w ith a SME.
Less than 1 % had problems C B B+ B Good Problems of small magnitudeMeet w ith SME to review metadata and report
Less than 2 % had problems F C B C PoorEither has data anomaly or a business definition is inaccurate
Meet w ith SME to review metadata and report
Less than 5 % had problems F C C F FailureRequires Serious Attention or is an unreliable field
Meet w ith SME to review metadata and report
Less than 10 % had problems F F C X Not Populated absent in 100% of row s Verify if there are plans to use the f ield
Less than 50 % had problems F F F * for SME Not enough meta data to score SME Review
Field is not populated in 100% of the row s X X XField cannot be scored because there is not a proper definition or domain description, and requires a SMEs consultation * * *
Prior to SME Review
ConfidentialDMR Consulting Group Inc.
28
File: Customer Master SAS Program(s):Field: Customer Code Data Analysis\DQ Programs\cm\Cust Code BlankAN.txtCriticality: High Data Analysis\DQ Programs\cm\CustCodeLookup3.txtTest File Date: 1/19/99
Report ValidationDefinition: Customer code (A/E) identifier. New Buisiness Rules/Domain:
Customer codes are unique within regional office.
Data Cleansing Notes: Transformation/Edit Recommendations:
Issue Log #: Comments:
Scores
Scores%
w/problems CommentOverall Certification F 0%Completeness A 0% Every Customer Code field contains data.Validity F >2% 97248 customer codes not found on the validation lookup file.Structural Integrity N/A N/ABusiness Rules N/A N/A
Completeness Report# Blank Fields # Low-
Value Fields
% Blank/Low-Values
# Populated Fields
Total Records
Analysis
0 0 0% 4794726 4794726 >All fields contain a data value. No blank fields.
Validity ReportCode Values
Not Found (SAMPLE)
Code Frequency
# Codes Not Found
Total Records
Analysis
C2 1 97248 4794726 >97248 codes could not be found on71 1 the validation lookup field T.cu.ccode ( a copy of production file.AA 3 PVSAM.CICS.custcc 02/02/99)..AB 1 >A significant number of data values are not valid cust codes. .AC 2
Example: Example: Poor Data QualityPoor Data QualityExample: Example: Poor Data QualityPoor Data Quality
ConfidentialDMR Consulting Group Inc.
29
Field AnalysisField AnalysisField AnalysisField Analysis
In a range of values, in the absence of domain rules,investigate the first and last .2%
Bell curve distribution
ConfidentialDMR Consulting Group Inc.
30
Management ReportingManagement Reporting- Short Engagement- Short EngagementManagement ReportingManagement Reporting- Short Engagement- Short Engagement
T A B L E O F C O N T E N T S
EXECUTIVE SUMMARY......................................................................................................................... 1
LEVEL 1: COMPLETENESS AND VALIDITY ...................................................................................................... 1LEVEL 2: STRUCTURAL INTEGRITY................................................................................................................ 1LEVEL 3: BUSINESS RULES ............................................................................................................................ 2RECOMMENDATIONS...................................................................................................................................... 2
BACKGROUND ......................................................................................................................................... 6
THE CUSTOMER MASTER FILE ....................................................................................................................... 6DATA QUALITY ANALYSIS METHODOLOGY .................................................................................................. 9
CUSTOMER MASTER FILE SCORE CARD ...................................................................................... 11
OVERALL ASSESSMENT ............................................................................................................................... 12LEVEL 1 ANALYSIS ...................................................................................................................................... 12
Completeness (Fair) ............................................................................................................................... 12Validity (Very Good)............................................................................................................................... 15
LEVEL 2 ANALYSIS ...................................................................................................................................... 16Primary Keys (Very Good) ..................................................................................................................... 16Referential Integrity (N/A) ...................................................................................................................... 16
LEVEL 3 ANALYSIS ...................................................................................................................................... 16Business Rules and Calculations (Excellent*) ........................................................................................ 16
RECOMMENDATIONS.......................................................................................................................... 18
ESTABLISH THE OFFICE OF CORPORATE DATA QUALITY ............................................................................. 19IMPROVEMENT RECOMMENDATIONS ........................................................................................................... 21MANAGE AND DISTRIBUTE THE CORPORATION’S METADATA..................................................................... 21INITIATIVES FOR DATA IMPROVEMENT ........................................................................................................ 24PROPAGATE THE DATA QUALITY ASSESSMENT PROCESS ............................................................................ 32SAFEGUARD DATA WAREHOUSE USERS FROM DEFECTIVE DATA ............................................................... 33FACILITATE BEST IN TESTING, QUALITY ASSURANCE AND DATA QUALITY ............................................... 37
DETAILED ANALYSIS........................................................................................................................... 40
ConfidentialDMR Consulting Group Inc.
31
Management Reporting - StatusManagement Reporting - StatusManagement Reporting - StatusManagement Reporting - Status
CustomerProduct
No Problem
Data Quality Anomalies
Undergoing Validation
58% 59%
31%
27%
11% 14%
0%
10%
20%
30%
40%
50%
60%
Field Analysis
ConfidentialDMR Consulting Group Inc.
32
Management Reporting - AnomaliesManagement Reporting - AnomaliesManagement Reporting - AnomaliesManagement Reporting - AnomaliesStatistic # % of Total % of Anomalies
Fields Completed 467 - -OPEN Fields 271 58.0% -Data Quality Anomalies 143 30.6%
Completeness 106 22.7% 74.1%Validity 36 7.7% 25.2%
Structural Integrity 0 0.0% 0.0%Business Rule 1 0.2% 0.7%
Pre-certified 53 11.3% -
Data Quality Anomalies- %
74%
25%
0%
1%
0.0% 10.0% 20.0% 30.0% 40.0% 50.0% 60.0% 70.0% 80.0%
Completeness
Validity
Structural Integrity
Business Rule
ConfidentialDMR Consulting Group Inc.
33
Management Reporting - ProductivityManagement Reporting - ProductivityManagement Reporting - ProductivityManagement Reporting - ProductivitySecurity
26-Feb 5-Mar 12-Mar 19-Mar 26-Mar 2-Apr 9-Apr 16-Apr 23-Apr 30-Apr 1-May 2-May 3-MayStatistic 1 2 3 4 5 6 7 8 9 10 11 12 13
Data Analysis MetricsField Count: 997 997 997 997 997 997 997 997 997 997
Fields Eliminated: 303 361 361 205 205 205 205 297 297 297Adjusted Field Count: 694 636 636 792 792 792 792 700 700 700
Work CompletedTables In Progress: 2 5 2 1 4 2 1 3 2 0Tables Completed: 1 1 6 8 9 11 15 16 17 20
% 2% 3% 15% 21% 23% 28% 38% 41% 44% 53%Fields in Progress: 104 120 113 20 123 93 68 49 29 0Fields Completed: 27 27 82 185 227 257 335 403 423 467
Fields Completed (week): 0 55 103 42 30 78 68 20 44% 4% 4% 13% 23% 29% 32% 42% 58% 60% 67%
AVG 27 14 27 46 45 43 48 50 47 47
<<<<< PROM only >>>>>
PROM Fields Completed:
27 27
82
185
227257
335
403423
467
0
50
100
150
200
250
300
350
400
450
500
1 2 3 4 5 6 7 8 9 10 11
Prom Progress Report- by Week
4% 4%
13%
23%29%
32%
42%
58%60%
67%
0%
10%
20%
30%
40%
50%
60%
70%
80%
1 2 3 4 5 6 7 8 9 10
Weekly Productivity
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Fields Completed (week): AVG
Meta Data Creation
ConfidentialDMR Consulting Group Inc.
35
Example: Data Quality RepositoryExample: Data Quality RepositoryExample: Data Quality RepositoryExample: Data Quality Repository
NewlyDiscovered Rules
NewlyDiscovered Rules
ConfidentialDMR Consulting Group Inc.
36
Work Groups
Field Name
Data Inventory Meta Data
Meta Data
KnowledgeManagement
Transformation & Edit
Recommendations
Data QualityStatisticalReports
DQ Assessment
Data Quality& DefinitionValidation
Data CleansingUpdate
SME Validation
Meta Data Supply ChainMeta Data Supply ChainMeta Data Supply ChainMeta Data Supply Chain
Definition &Domain
Meta Data Gathering
Data Requirements
Results Validation
File: PONACUST SAS Program(s): T.EDA.SAS.INPUT(CdeIndPO) Field: oc801ind J:\EDA\DataAnalysis\DQPrograms\Ponacust\ CdeIndPO.txt
Criticality: MediumTest File Date: T.TSOSZUC.EDA (Jan.19,1999) RSLTS:
J:\EDA\DataAnalysis\DQAssessmentReports\Ponacust\CdeIndPO.xls
Report ValidationBusiness Name: Business Rules:Definition:
Data Cleansing Notes: Transformation/Edit Recommendations:
Issue Log #: Comments:
Scores
Scores % w/problems CommentOverall Certification A 0.00Completeness N/A N/AValidity N/A N/AStructural Integrity N/A N/ABusiness Rules N/A N/A
Frequency ReportValue Frequency Percent 88 info Analysis
0 3684278 76.8 1 1110449 23.2
Total= 4794727 100
Report ValidationReport ValidationSME validation… an opportunity to improve Meta Data1. Supply a clear name for the field.2. Is there a good definition?3. Make the business rules public?4. Will the SME initiate a data cleansing initiative?5. Does the SME recommend edit or data transformation rules?6. Are the findings consistent with the SMEs expectations?
Report Sections
Identification
Field Definition & Rules
Statistical Reports &
Analysis
Score & Explanation
1
2 3
4 5
6
QualityImprovement
ConfidentialDMR Consulting Group Inc.
40
Next StepsNext StepsNext StepsNext Steps
ManagementManagementReport &Report &
RecommendationsRecommendations
SteeringSteeringCommitteeCommittee
InitiativesInitiativesData Clean-upData Clean-up
Legacy SystemLegacy SystemEnhancements &Enhancements &Re-engineeringRe-engineering
Data MigrationData MigrationTransformationTransformation& Cleansing& CleansingSpecificationsSpecifications
ContinuedMonitoring
MonthlyMonthlyReportsReports
PerformBaseline
Assessment
InformationInformationManagementManagementObjectivesObjectives
Metadata,Metadata,Models,Models,
Reports, etc.Reports, etc.
LegacyLegacyDataData
ExtractionsExtractions
(DiscoveredBusiness
Rules)
ConfidentialDMR Consulting Group Inc.
41
Completeness
Accuracy
100%
100%
(More complete,more error prone)
(More accurate,less data)
$$
(Most complete,most accurate,most costly,most timely)
Lessons Learned- Data CleanupLessons Learned- Data CleanupLessons Learned- Data CleanupLessons Learned- Data Cleanup
ConfidentialDMR Consulting Group Inc.
42
SummarySummarySummarySummary We made the distinction between:
- Data Migration- Data Quality- Data Cleansing
We defined what “good” data quality is.We discussed that there could be 10 or more processes that
could take place in building a comprehensive data quality program for the enterprise.- Tactical should precede the Strategic [or be the 1st step of ]
There are 6 steps to an effective tactical data quality initiative:- Rule Disclosure- Quality Measurement- Analyze and Certify- Meta Data Creation- Validation- Quality Improvement
ConfidentialDMR Consulting Group Inc.
43
Reference MaterialReference MaterialReference MaterialReference Material
The Demings Management Method (Total Quality Management), Mary Walton
Data Quality for the Information Age, Tom Redman
The Data Warehouse Challenge: Taming Data Chaos, Michael Brackett
Improving Data Warehouse and Business Information Quality, Larry English
DM Review Magazine, Information Quality series by Larry English