Download - Improving DQ
Improving Data Quality:Why is it so difficult?
presented by
Larissa T. MossPresident, Method Focus, Inc.
DAMAOakland, CA
May 7, 2003
Copyright 2003, Larissa T. Moss, Method Focus, Inc.
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 2
Ms. Moss is founder and president of Method Focus Inc., a company specializing in improving the quality of business information systems. She frequently speaks at Data Warehouse, Business Intelligence, CRM, and Information Quality conferences around the world on the topics of information asset management, data quality, data modeling, project management, and organizational realignment. She lectures worldwide on the BI topics of spiral development methodology, data modeling, data audit and control, project management, as well as organizational issues. Her articles are frequently published in DM Review, TDWI Journal of Data Warehousing, Cutter IT Journal, Analytic Edge, and The Navigator. She co-authored the books: Data Warehouse Project Management, Addison Wesley 2000, Impossible Data Warehouse Situations, Addison Wesley 2002, and Business Intelligence Roadmap: The Complete Project Lifecycle for Decision Support Applications, Addison Wesley 2003. Ms. Moss is a member of the IBM Gold Group, a Friend of Teradata, a senior consultant at the Cutter Consortium, and a contributing member of Ask The Experts on www.dmreview.com. She has been a lecturer at DCI, TDWI, MISTI, and at the Extension of the California Polytechnic University, Pomona . She can be reached at lmoss@ methodfocus.com.
Method Focus Inc. www.methodfocus.com [email protected] (626) 355-8167
Larissa T. Moss
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 3
Presentation Outline
• What do we mean by data quality?Dirty data categories
• How are we addressing it today?Ineffective technology solutions
• What do we have to change?Approaches and techniques
• How do we change?
12 steps to [DQ] recovery
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 4
What do we mean by data quality?
• Data is correct
• Data is accurate
• Data is consistent
• Data is complete
• Data is integrated
• Data values follow the business rules
• Data corresponds to established domains
• Data is well defined and understood
#1
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 5
Symptoms of poor-quality data
• Do your programs abend with data exceptions?• Are your users confused about meaning of data?• Is some of your data is too stale for reporting?• Is your data being shared? Is it sharable? • Are reports inconsistent?• Does it take your IT staff or the end users hours to
reconcile inconsistent reports?• Does merging data often cause the system to fail?• Do beepers go off at night?
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 6
Dirty data categories
• Dummy (default) values
• “Intelligent” dummy values
• Missing values
• Multi-purpose fields
• Cryptic values
• Free-form address lines
• Contradicting values
• Violation of business rules
• Reused primary key
• Non-unique primary key
• Missing data relationships
• Inappropriate data relationships
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 7
Dummy (default) values
• Defaults for mandatory fields
SSN 999-99-9999 Age 999 Zip 99999
Income 9,999,999.99
Inability to determine customer profiles Inability to determine customer profiles Inability to determine customer demographicsInability to determine customer demographics
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 8
“Intelligent” dummy values
• Defaults with meaning
SSN 888-88-8888Income 999,999.99Age 000Source Code ‘FF’
Non-resident alien
Employee
Corporate customer
Account closed prior to 1991
Inability to write straight forward queries withoutInability to write straight forward queries withoutknowing how to filter dataknowing how to filter data
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 9
Missing Values
• Operational systems do not always require informational or demographic data
Gender EthnicityAgeIncomeReferring Source
Inability to analyze marketing channelsInability to analyze marketing channels
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 10
Multi-purpose fields
Inability to judge product profitabilityInability to judge product profitability
• ONE field explicitly has MANY meanings
» Which business unit enters the data» At what time in history it was entered» A value in one or more other fields
Appraisal Amount redefined as
Advertised Amount redefined as
Sold Date Loan Type Code redefined as ...
25 redefines = 25 attributes !
Not mutually exclusive !
Only the value of oneis known for each record !
25 redefines = 25 attributes !
Not mutually exclusive !
Only the value of oneis known for each record !
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 11
Cryptic values (1)
• Often found in “Kitchen Sink” fields
» Usually one byte (if not one bit)» Highly cryptic (A, B, C, 1, 2, 3, ...)» Non-intelligent, non-intuitive codes
» Often not mutually exclusive
Inability to empower end users to write their Inability to empower end users to write their own queriesown queries
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 12
Cryptic values (2)
• ONE field implicitly has MANY meanings
Master_Cd {A, B, C, D, E, F, G, H, I}
{A, B, C}{D, E, F} {G, H, I}
Type of customer
Type of supplier
Regional constraints
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 13
Free-form address lines
• Unstructured text
» no discernable pattern» cannot be parsed
address-line-1: ROSENTHAL, LEVITZ, Aaddress-line-2: TTORNEYSaddress-line-3: 10 MARKET, SAN FRANCaddress-line-4: ISCO, CA 95111
Inability to perform market analysisInability to perform market analysis
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 14
Contradicting values
• Values in one field are inconsistent with values in another related field
1488 Flatbush Avenue New York, NY 75261
Type of real property: Single Family Residence Number of rental units: four
Texas Zip
Income property
Inability to make reliable business decisionsInability to make reliable business decisions
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 15
Violation of business rules
• Business Rule: Adjustable Rate Mortgages must have
» Maximum Interest Rate ( Ceiling)» Minimum Interest Rate ( Floor)
• Business Rule: A Ceiling is always higher than a Floor
ceiling-interest-rate: 8.25floor-interest-rate: 14.75
switched ?
Inability to calculate product profitabilityInability to calculate product profitability
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 16
Reused primary keys
• Little history, if any, stored in operational files
» primary keys are customarily re-used » may have a different rollup structure
January ‘94: branch 501 = San Francisco Mainregion 1area SW
August ‘97: branch 501 = San Luis Obisporegion 2area SW
Inability to evaluate organizational performanceInability to evaluate organizational performance
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 17
Non-unique primary keys
Inability to determine customer relationshipsInability to determine customer relationshipsInability to analyze employee benefits trendsInability to analyze employee benefits trends
• Duplicate identification numbers
» Multiple customer numbers Customer Name Phone Number Cust. Number
Philip K. Sherman 818.357.5166 960601 Philip K. Sherman 818.357.7711 960105 Philip K. Sherman 818.357.8911 960003
» Multiple employee numbers
Employee Name Department Empl. Number July 1995: Bob Smith 213 (HR) 21304762 January 1996: Bob Smith 432 (SRV) 43218221 August 1999: Bob Smith 206 (MKT) 20684762
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 18
Missing data relationships
• Data that should be related to other data in a dependent (parent-child) relationship
» Branch number 0765 does not exist in the BRANCH table
Branch Employee
Inability to produce accurate rollupsInability to produce accurate rollups
Benefit
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 19
Inappropriate data relationships
• Data that is inadvertently related, but should not be
» two entity types with the same key values
Purchaser: Jackie Schmidt 837221Seller: Robert Black 837221
Inability to determine customer or vendorInability to determine customer or vendorrelationshipsrelationships
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 20
Impact of erroneous data
• Extra time it takes to correct data problems• Extra resources needed to correct data problems• Time and effort required to re-run jobs that abend• Time wasted arguing over inconsistent reports • Lost business opportunities due to unavailable data• Unable to demonstrate business potential in a
buyout• Fines may be paid for noncompliance with
government regulations• Shipping products to the wrong customers• Bad public relations with customers
– leads to alienated and lost customer
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 21
Cost of erroneous data
MarketingCampaign
PerInstance
Numberof
Instances
Total NumberPer Year
TotalCost
Per Year
Time: ($60/hour loaded rate) Creating redundant occurrence 2.4 min 167,141 1 $ 401,138 Researching correct address 10 min 5,000/mo 12 $ 600,000 Correcting address errors 0.3 min 6,000/mo 12 $ 21,600 Handling complaints from customers 5.5 min 974/yr 1 $ 5,357 Mail preparation 0.1 min 393,273 4 $ 157,309
Materials, Facilities, Equipment: Marketing brochure $1.96 393,273 4 $3,083,260 Postage $0.52 393,273 4 $ 818,008 Warehouse storage $0.01 393,273 4 $ 15,731 Shipping equipment and maintenance $5,000/yr 36% 1 $ 1,800
Computing resources: CPU transactions $0.02/trans 393,273 4 $ 31,462 Data storage $0.001/mo 393,273 12 $ 4,719 Data backup $0.005/mo 393,273 12 $ 23,596
Direct Costs of Non-Quality Information © Larry English,Improving DW and BI Quality
Total Annual Costs $5,163,980
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 22
Impact of redundant data
• Hardware (CPU, disks) and software (program maintenance) costs incurred
as a result of uncontrolled redundant data• Extra time it takes to reconcile inconsistencies• Extra resources needed to reconcile inconsistencies• Unwise business decisions made due to redundant
and inconsistent data• Lost opportunities due to unreliable data• Overcharging or overpayment for products• Duplicate shipping of products• Money wasted on sending redundant marketing
material
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 23
Cost of redundant data
Information Development Cost Analysis
Category
PortfolioTotal
Number
RelativeWeightFactor*
AverageUnit
Dev/MaintCosts
TotalDev/Maint
Expenses**
TotalInfrastructureValue-addingCost-adding
Expenses
% ofBudget
Expenses
Infrastructure Basis: Enterprise architected DBs 200 0.75 $ 15,000 $ 3,000,000 Enterprise reusable create/update programs + 300 1.50 $ 30,000 $ 9,000,000 Total Infrastructure expenses $12,000,000
Value Basis: Total retrieve equivalent pgms + 300 1.00 $ 20,000 $ 6,000,000 Total value-adding expenses $ 6,000,000
Cost-adding Basis: Redundant create/update pgms 500 1.50 $ 30,000 $15,000,000 Interface/extract programs 400 1.00 $ 20,000 $ 8,000,000 Redundant database files 600 0.75 $ 15,000 $ 9,000,000 Total cost-adding expenses 1,500 $32,000,000
Lifetime Total ** 3,800 $50,000,000
* Determine relative effort to develop average unit of each category using effort to develop a retrieve program as “1.00”+ For programs that retrieve some data and create/update other data, determine the percent of retrieve only attributes and percent of create/update attributes (e.g., to retrieve customer data to create an order)**Based on 3.800 application programs and database files in portfolio and $50 Million in development
© Larry English,Improving DW and BI Quality
24%
12%
64%
100%
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 24
Dirty data – How did it happen?
BusinessManager
BusinessManager
TechnologyManager
TechnologyManager
... ...
... ...
Business Technology
ChiefExecutiveOfficer
ChiefOperating
Officer
ChiefInformation
Officer
paired with
Business Units
Marketing
Financial (A
P &
AR
)
Product P
ricing
Custom
er Support
Distribution
Inventory
Sales
Client Client Client Client Client Client Client
IT IT IT IT IT IT IT
Information Technology Units
?
• data redundancy• process redundancy• dirty data
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 25
Major cause for data deficiencies
TIME
SCOPE
BUDGET
PEOPLE
QUALITY
1 2 3 4 5
highest to lowest priority
Pro
ject
Co
nst
rain
ts
Wrong priority on project constraints!
Priority
Industrial Age: • Cheaper, faster, better • Automate as quickly as possible
Cost-based value proposition
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 26
Time is getting shorter – scope is getting bigger
• Everyone on the business side and in IT wants quality, but rarely is the extra time given or taken to achieve it.
Quality and time are polarized constraints.
• The higher the quality the more effort (time) it takes to deliver.
• Companies are driven by shorter and shorter schedules.
SCOPE
TIMEYAH DDD
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 27
How are we addressing it today?
• Data Warehousing
• Customer Relationship Management
• Enterprise Resource Planning
• Enterprise Application Integration
• Knowledge Management
Why can’t technology
fix this?
Ineffective Technology SolutionsIneffective Technology Solutions
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 28
Data Warehousing
The Promise: data integration no redundancy consistency historical data ad-hoc reporting trend analysis reporting faster data delivery faster data access
The Reality: stove pipe marts departmental views swim lane development
approach too time consuming to integrate
too costly to cleanse data increased data redundancy
If it sounds too good to be true, it is to good to be true.
DW delivers...
a collection of integrated data used to support the strategic decision making process for the enterprise.
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 29
Customer Relationship Management
The Promise: data integration data quality customer intimacy customer wallet share product pricing customization knowing your competition geographic market potential
The Reality: more stovepipe systems departmental views dirty customer data purchased packages not
integrated focus is too narrow privacy issues
If it sounds too good to be true, it is to good to be true.
CRM delivers …
the organizational lifeline, creating competitive advantage through customer service excellence.
seamless coordination between back-office systems, front-office systems and the Web.
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 30
Enterprise Resource Planning
ERP delivers...
a collection of functional modules used to integrateoperational data to support seamless operational business processes for the enterprise.
The Promise: data integration no redundancy consistency data quality
easy reporting easy maintenance Y2K compliance
If it sounds too good to be true, it is to good to be true.
The Reality: system conversion not cross-
organizational analysis same dirty data operational focus poor quality (unusable) reports one-size-fits-all data warehouse
too costly
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 31
Enterprise Application IntegrationEAI delivers ...
integration of disparate applications into a unified set of business processes through centrally managed rules and middleware technologies.
The Promise: fast & automated integration leverage existing data bridge islands of automation easy cross-system reporting faster data delivery faster data access
If it sounds too good to be true, it is to good to be true.
The Reality: dirty data no true integration still data redundancy still islands of automation easier access to the current
data mess
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 32
Knowledge ManagementKM delivers ...
a process for capturing, editing, verifying (for accuracy), disseminating, and utilizing tacit and explicit information about the organization.
The Promise: utilize organizational info data integration historical data faster data delivery faster data access first & only customer contact reduction of customer calls less re-solving same problems
Reality of KM: too difficult to build too time consuming
too costly technology challenges non-sharing culture isolated applications difficult to disseminate
information
If it sounds too good to be true, it is to good to be true.
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 33
What’s the lesson?
You cannot keep doingYou cannot keep doingwhat you have always donewhat you have always done
and expect the results to be different.and expect the results to be different.
“That wouldn’t be logical”Spock, Star Trek
Not even withNot even withnew technology.new technology.
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 34
What do we have to change?1. Assess the current state of data quality at your company
2. Understand and fix the root causes for data contamination
3. Perform data audits regularly (monthly, quarterly)
4. Stop working in isolated “swim lanes”
> Stop recreating data
5. Centrally manage your data like a business asset(Enterprise Information Management [EIM])
> Assemble data as needed from the data inventory (enterprise data model and meta data)
> Standardize and reconcile data transformations for BI/DW applications (coordinated ETL staging area)
6. Scale down project scopes to incorporate data quality and EIM activities
7. Embed data quality and EIM activities in all projects
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 35
Business intelligence …
…is a cross-organizational discipline
and an enterprise architecture
for an integrated collection of operational as well as decision support
applications and databases, which provide the business community easy access to their business data, and
allows them to make accurate business decisions.
…is a cross-organizational discipline
and an enterprise architecture
for an integrated collection of operational as well as decision support
applications and databases, which provide the business community easy access to their business data, and
allows them to make accurate business decisions.
… is not business as usual
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 36
BI goals and objectives
Data Management
Get control over the existing data chaos
Data Delivery
Provide intuitive access to business information
Data Reengineering (Enterprise Information Management)
80% 20%
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 37
Proliferation of data quality problems
Legacy Data WarehousesData Marts
Marketing
Finance
Product Sales
Engineering
Users
L
L
L
L
DM
DM
DW
DM
DM
transformation ? cleansing? Customer Support
“LegaMarts”(Doug Hackney)
BI ?BI ?
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 38
Industrial-age mental model
Business Units
Marketing
Financial (A
P &
AR
)
Product P
ricing
Custom
er Support
Distribution
Inventory
Sales
Client Client Client Client Client Client Client
IT IT IT IT IT IT IT
Information Technology Units
TIME
SCOPE
BUDGET
PEOPLE
QUALITY
1 2 3 4 5
highest to lowest priority
Pro
ject
Co
nst
rain
ts
Priority
Scrap and rework
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 39
The game has changed
1. Enormous degree of complexity
2. Extremely high rate of change
Cheaper, faster, better !!!But how?
Don’t scrap and rework.Reuse what you already have.
(John Zachman)
…but our mental model has not
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 40
Information-age mental model
QUALITY
BUDGET
PEOPLE
TIME
SCOPE
1 2 3 4 5P
roje
ct C
on
stra
ints
Priority
Reassemble reusable components
highest to lowest priority
Information Age: • Reassemble the entire enterprise • Reuse assets from inventory
Investment-based value proposition
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 41
Software release concept (1)
SecondRelease
FirstRelease
FourthRelease
Reusable &Expanding
FinalRelease
Application
FifthRelease
ThirdRelease
Projects
“Refactoring”- Kent Beck
Project = ApplicationProject = Application //
“Extreme scoping”- Larissa Moss
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 42
Software release concept (2)• Requirements can be tested, and implemented in small increments
• Scope is very small and manageable
• Technology infrastructure can be tested and proven
• Data volumes (per release) are relatively small
• Project schedules are easier to estimate because the scope is very small
• Development activities can be iteratively refined, honed, and adapted
AND: The quality of the release deliverables (and ultimatelythe quality of the applications) will be higher!
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 43
Cross-organizational development approach (1)
BI/DW Development Steps1. Business Case Assessment ...........................2.A Enterprise Technical Infrastructure ...........2.B Enterprise Non-Technical Infrastructure ...3. Project Planning ...........................................4. Project Requirements Definition ..................5. Data Analysis ...............................................6. Application Prototyping ...............................7. Meta Data Repository Analysis ...................8. Database Design ..........................................9. ETL Design .......................................….......10. Meta Data Repository Design ....................11. ETL Development .....................................12. Application Development .........................13. Data Mining ..............................................14. Meta Data Repository Development ........15. Implementation .........................................16. Release Evaluation ...................................
Cross-organizationalCross-organizationalCross-organizationalProject-specificProject-specificCross-organizationalProject-specificCross-organizationalCross-organizationalCross-organizationalCross-organizationalCross-organizationalProject-specificCross-organizationalCross-organizationalProject-specificCross-organizational
Data QualityTouch Points
(© Larissa Moss and Shaku Atre, “Business Intelligence Roadmap”)
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 44
Cross-organizational development approach (2)
• Commitment to data quality embedded in the methodology
• Cross-organizational program management
• Enterprise information management group
• Standards that include a common information architecture (enterprise data model) Involving down-stream information consumers in the
requirements definition step Involving data owners in the data analysis step Involving business representatives from all business
units to ratify the data models and meta data
• Coordinating the development/ETL processes Disallowing stovepipe development Extracting and cleansing source data only once Reconciling data transformations and storing the
reconciliation totals as meta data
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 45
Enterprise information managementBusiness Units
Marketing
Financial (A
P &
AR
)
Product P
ricing
Custom
er Support
Distribution
Inventory
Sales
Client Client Client Client Client Client Client
IT IT IT IT IT IT IT
ODSDM
Discover, Coordinate, Integrate, Document, Control
Operational Environment
EDW
OM
BI/DW Databases
Information Technology Units
Enterprise Information Management
Decision Support EnvironmentOperational Systems
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 46
EIM responsibilities• Business architecture inventory
Process modelsData models
• Application inventory ProgramsDatabases
• Meta data inventoryBusiness meta dataTechnical meta data
• Policy inventoryStandardsProceduresGuidelines…
Discover, Coordinate, Integrate,
Document, Control
Architects
Stew
ards
Managers
IT asset inventorymanagement
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 47
Data stewardship
• Guardians of the data while it is being created or maintained by them
• Create standards and procedures to ensure that policies and business rules are known and followed
• Enforce adherence to policies and business rules that govern the data while the data is in their custody
• Periodically monitor (audit) the quality of the data in their custody
• Also known as custodians
• Can be a business person or an IT person
“One who manages another’s property.”
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 48
Data ownership
• Authority to establish policies and set business rules for the data under their control
• Decide what the official enterprise definition and domain is for the data under their control
• Monitor and advise other end users on proper usage of their data
• Frequently, but not always, the data originator
• Can be a person or a committee
“One who has the legal right to the possession of a property.”
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 49
Enterprise architecture
1. Data Management
• data integration• data cleansing
2. Data Delivery
• data access• data manipulation
Business Architecture
Information ArchitectureInformation Architecture
Application ArchitectureApplication Architecture
Technology Architecture
Mission and ObjectiveBusiness PrinciplesBusiness FunctionsProgram Management
Mission and ObjectiveBusiness PrinciplesBusiness FunctionsProgram Management
Enterprise Data Model- Data Standardization- Data Integration- Data Reconciliation- Data Quality
Enterprise Data Model- Data Standardization- Data Integration- Data Reconciliation- Data Quality
Operational ApplicationsData Access ApplicationsData Analysis ApplicationsApplication Databases
Operational ApplicationsData Access ApplicationsData Analysis ApplicationsApplication Databases
Technology PlatformNetworkMiddlewareDBMS, Tools
Technology PlatformNetworkMiddlewareDBMS, Tools
Content
Storage &Presentation
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 50
Enterprise data model (data inventory)
Supported by common
data definitions, domains, and
business rules.
Salesperson
CommissionedSalesperson
SalariedSalesperson
OrgStructure
Org Unit
Product Part
ProductCategory
Product
Customer Product Order
PotentialCustomer
ExistingCustomer
Customer
AccountAccount Payment
Payment
Method
Part
Supplier Shipment
Warehouse
Top-Down
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 51
Source data analysis
Domain Violations:Domain Violations:• Dummy values• Intelligent dummy values• Missing values• Multi-purpose fields• Cryptic values• Free-form address lines
Integrity Violations:Integrity Violations:• Contradicting values• Violation of business rules• Reused primary keys• Non-unique primary keys • Missing data relationships• Inappropriate data relationships
Bottom-Up
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 52
To cleanse or not to cleanse …
…that is the question
• You probably cannot cleanse it all (takes too long)
• It may not be worth the time and money to cleanse every data element
• Not all data is equally significant
• Not all data can be cleansed
• How do you know what to cleanse?
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 53
Triaging questions (1)
• Can the data be cleansed?Does the correct data exist anywhere?Is it easily accessible?
• Should the data be cleansed?How extensive is the problem?How elaborate will the cleansing process be?Is it cost-effective?
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 54
Triaging questions (2)
• Why are we building the application?
What business questions cannot be answered today?
• Why are we not able to answer the business questions? Is it because of this dirty data?
Is it because of these missing relationships?
• Will the benefits of cleansing outweigh the cost of the effort?
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 55
Categories of data significance
• Critical data– Not all data is equally critical to all end users– All critical data must be cleansed– Usually includes amount fields
• Important data– Important to the organization, but not absolutely critical– Further prioritize important data elements– Cleanse as many as time allows– Those that cannot be cleansed should be bumped to critical for the next release
• Insignificant data– Informational data, which is nice to have– Cleansing is optional if time allows
Business decision!Business decision!
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 56
Cleansing – repairing – prevention • Where should the dirty data be cleansed?
In the staging area of the BI application?In the source (legacy) files?
• When should it be cleansed?Retroactively?
At data entry time?
• How should it be cleansed?Use data cleansing or ETL tools?Write procedural (COBOL/C++) code?
• What will we do to prevent dirty data in the future?
Source Data Reengineering …Source Data Reengineering … Total [Data] Quality Management (TQM) Total [Data] Quality Management (TQM)
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 57
Coordinated ETL staging
ClientsClients
Legacy
Operat’lreports
Data Marts
Strategic rpts
Enterprise Data Warehouse
Strategic rpts
OperationalData Store/Oper MartsTactical rpts
L
L
L
ODS EDWFinance
Product Pricing
Engineering
DM
DM
MarketingCRM DMAnalytical
CRMOperational
OM CustomerSupport
EXWEXW Legal
Enterprise Architecture & Meta Data RepositoryEnterprise Architecture & Meta Data RepositoryEnterprise Architecture & Meta Data RepositoryEnterprise Architecture & Meta Data Repository
StagingStaging AreaAreaCleansingTransform’s
StagingStaging AreaAreaCleansingTransform’s
DailyStA
MoStA
Transformation Transformation CleansingCleansing
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 58
ETL process flow
Extract AccountsExtract Accounts
Merge CustomersMerge Customers
Account TranFile
Account TranFile
Customer Info File
Customer Info File
CustomerMaster
CustomerMaster
CustomersCustomers
Extract New SalesExtract New SalesSales
File
SalesFile
Filter AccountsFilter Accounts
NewSales
NewSales
AccountsAccounts
NewAccounts
NewAccounts
ProspectsProspects
Extract ProspectsExtract Prospects
ProspectsProspects
Merge ProspectsMerge Prospects
AllCustomers
AllCustomers
Sort AcctsSort Accts
SortedAccounts
SortedAccounts
MatchAccounts
MatchAccounts
Sort CustomersSort Customers
SortedCustomers
SortedCustomers
2
AccountErrors
AccountErrors
Extract Extract CleanseCleanse TransformTransform PreparePrepare Load Load
AssociateAccounts
AssociateAccounts
1
ProfileCustomers
ProfileCustomers
3
– coordinated –
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 59
ETL Reconciliation
DM
DM
DM
DM
(monthly)
Load Files
L
L
L
L
ODS(daily)
EDW(monthly)
MonthlyStaging
Area
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 60
ETL tie-outs: record counts
INPUTRECORDS
PROCESSMODULE
PROCESSMODULE
OUTPUTRECORDS
REJECTEDRECORDS# Input Records =
# Output Records
+
# Rejected Records
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 61
ETL tie-outs: domain counts
# Records Per Input Domain =
# Records Per First Output Domain
+
# Records Per Second Output Domain
+
# Records Per Third Output Domain
+
# Rejected Data Values
OUTPUTCODES
OUTPUTCODES
INPUTCODES
PROCESSMODULE
PROCESSMODULE
OUTPUTCODES
REJECTEDCODES
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 62
ETL tie-outs: amount counts
OUTPUTAMOUNTS
OUTPUTAMOUNTS
INPUTAMOUNTS
PROCESSMODULE
PROCESSMODULE
REJECTEDAMOUNTS
Total $ Input Amounts =
Total $ Per First Input Amount +
Total $ Per Second Input Amount +
Total $ Per Rejected Amounts
Total $ Per First Output Amount
+
Total $ Per Second Output Amount
+
Total $ Rejected Amounts
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 63
Data quality improvements• Source data repairs
• Increased program edits
• Enhanced data entry procedures
• Improved data quality training
• Regular data audits
• Data usage monitoring
• Enterprise-wide end user surveys
• Continuous validation of enterprise data model
• Continuous validation of meta data, especially definitions and domains
• Involvement of data owners, information consumers, and business sponsors
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 64
Data quality maturity
Discoveryby accident
Program abends
1
Limiteddata analysis
Data profilingData cleansingduring ETL
2
Proactiveprevention
4
Enterprise-wideDQ methods &techniques
At what level of DQ maturity is your organization?
3
Addressingroot causes
Repairingsource dataand programs
shortterm
5
Optimization
ContinuousDQ process improvements
longterm
Scale of 1 .. 5
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 65
DQ capability maturity model (1)
CMM Level 1. Uncertainty - Unconscious and unaware
» Data quality problems are denied. » No formal data quality processes defined. » Data quality initiatives are ad hoc and chaotic.» Any success is dependent on individual efforts.
(Source: Larry English)
CMM Level 2. Awakening - The big Aha! and lip service
» Data quality problems are acknowledged. » Major problems are attacked as they come up. » Minimum funding for a formal data quality initiative. » Capability is a characteristic of the individual rather than the organization.
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 66
DQ capability maturity model (2)
CMM Level 3. Enlightenment - Let’s do something
» Data quality initiative takes off. » Enterprise-wide data quality assessment is performed. » Data quality problems are corrected at the source (where possible).» Data quality improvement process is institutionalized.
CMM Level 4. Wisdom - Making a difference
» Management accepts personal responsibility for data quality. » Data quality group reports to a chief officer (CIO, CKO, COO). » Data quality correction changes to data defect prevention. » All business areas are involved.
(Source: Larry English)
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 67
DQ capability maturity model (3)
CMM Level 5. Certainty - Nirvana
» Data defect prevention is the main focus.» Data quality is an integral part of the business processes.» All business areas are continuously improving the processes.» The culture of the organization has changed.
(Source: Larry English)
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 68
Organizational impact
• Cross-organizational tasks and responsibilities are not well defined
• Data quality responsibility is not clear or ignored
• Value of data is not understood or appreciated
• Projects are often cost justified using the industrial-age mental model
• Resource requirements are not well defined
• Impact on application development empire
• No reward for data sharing
• Resistance to change
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 69
Organizational changes
• Business and IT collaboration (“partnership”)
• Business and business collaboration (“partnership”)
• IT and IT collaboration (“partnership”)
• Increased end user involvement
• Cross-organizational activities
• Architecture and standardization
• Software release concept
• New charge-back system
• New incentives
• New leadership
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 70
New leadership
CEO
CTOCKOCOO
EIM ...EALOB Execs IT Execs
ChiefKnowledge
Officer
CFO
EnterpriseInformation
Management
collaboration collaboration
DA DQA MDA
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 71
How do we change?12 steps to [DQ] recovery (1)
1. Become aware• Every cultural transformation process
begins with an “Aha”. • Understand the root causes for your
current data chaos.
2. Accept responsibility• “Yes, it is our fault” for being in this mess.• Accepting responsibility is a prerequisite for
change.
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 72
12 steps to [DQ] recovery (2)
3. Decide to change• Now that “you know better”, the decision
is yours: Stay stuck or change. • There can be no more false hopes for
any silver bullet technology solutions.
4. Identify root causes• What are the specific root causes for non-
quality data in your organization?• Some root causes are common, some are not.
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 73
12 steps to [DQ] recovery (3)
5. Collaborate• It doesn’t matter “whose fault” it is
that the root causes exist. • IT must collaborate with the business
community to affect changes.• Business community must also
collaborate with business community.
6. Identify change agents• Who will be the couriers?• Changes must be systemic and holistic,
not isolated and sporadic.
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 74
12 steps to [DQ] recovery (4)
8. Plan changes• Big changes do not get implemented in one
“Big Bang”.• Involve people in change planning.• Cross-organizational changes are phased in.
7. Spread the word• To embrace changes, there must be
“something in it” for everybody.• Otherwise, changes trigger anxiety and
anxiety results in resistance or rejection.
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 75
12 steps to [DQ] recovery (5)
9. Prioritize changes• Some changes are easier to implement
than others. • Some changes have a higher payback.
10. Implement changes• Everyone affected by the changes must have
an opportunity to review and approve the plan before implementation.
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 76
12 steps to [DQ] recovery (6)
11. Measure effectiveness• Solicit feedback from “the trenches”. • Are the changes affecting anyone adversely?
12. Refine changes• Nothing is perfect the first time around.• What might work in one organization may not
work in another.
© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 77
Bibliography• Adelman, Sid, and Larissa Terpeluk Moss. Data Warehouse Project Management. Boston, MA: Addison-
Wesley, 2000.• Aiken, Peter H. Data Reverse Engineering: Slaying the Legacy Dragon. New York: McGraw-Hill, 1995.• Brackett, Michael H. Data Resource Quality: Turning Bad Habits into Good Practices. Boston, MA:
Addison-Wesley, 2000.• Brackett, Michael H. The Data Warehouse Challenge: Taming Data Chaos. New York: John Wiley &
Sons, 1996.• English, Larry P. Improving Data Warehouse and Business Information Quality: Methods for Reducing
Costs and Increasing Profits. New York: John Wiley & Sons, 1999.• Hoberman, Steve. Data Modeler’s Workbench: Tools and Techniques for Analysis and Design. New York:
John Wiley & Sons, 2001.• Kuan-Tsae, Huang, Yang W. Lee, and Richard Y. Wang. Quality Information and Knowledge
Management. Upper Saddle River, NJ: Prentice Hall, 1998.• Marco, David. Building and Managing the Meta Data Repository: A Full Lifecycle Guide. New York: John
Wiley & Sons, 2000.• Moss, Larissa T., and Shaku Atre. Business Intelligence Roadmap: The Complete Lifecycle for Decision-
Support Applications. Boston, MA: Addison-Wesley, 2003.• Reingruber, Michael C., and William W. Gregory. The Data Modeling Handbook: A Best-Practice
Approach to Building Quality Data Models. New York: John Wiley & Sons, 1994.• Ross, Ronald G. The Business Rule Concepts. Houston, TX: Business Rule Solutions, Inc., 1998.• Simsion, Graeme. Data Modeling Essentials: Analysis, Design, and Innovation. Boston, MA: International
Thomson Computer Press, 1994.• Von Halle, Barbara. Business Rules Applied: Building Better Systems Using the Business Rules Approach.
New York: John Wiley & Sons, 2001.