improving dq

77
Improving Data Quality: Why is it so difficult? presented by Larissa T. Moss President, Method Focus, Inc. DAMA Oakland, CA May 7, 2003 Copyright 2003, Larissa T. Moss, Method Focus, Inc.

Upload: justopenminded

Post on 22-Apr-2015

46 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Improving DQ

Improving Data Quality:Why is it so difficult?

presented by

Larissa T. MossPresident, Method Focus, Inc.

DAMAOakland, CA

May 7, 2003

Copyright 2003, Larissa T. Moss, Method Focus, Inc.

Page 2: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 2

Ms. Moss is founder and president of Method Focus Inc., a company specializing in improving the quality of business information systems. She frequently speaks at Data Warehouse, Business Intelligence, CRM, and Information Quality conferences around the world on the topics of information asset management, data quality, data modeling, project management, and organizational realignment. She lectures worldwide on the BI topics of spiral development methodology, data modeling, data audit and control, project management, as well as organizational issues. Her articles are frequently published in DM Review, TDWI Journal of Data Warehousing, Cutter IT Journal, Analytic Edge, and The Navigator. She co-authored the books: Data Warehouse Project Management, Addison Wesley 2000, Impossible Data Warehouse Situations, Addison Wesley 2002, and Business Intelligence Roadmap: The Complete Project Lifecycle for Decision Support Applications, Addison Wesley 2003. Ms. Moss is a member of the IBM Gold Group, a Friend of Teradata, a senior consultant at the Cutter Consortium, and a contributing member of Ask The Experts on www.dmreview.com. She has been a lecturer at DCI, TDWI, MISTI, and at the Extension of the California Polytechnic University, Pomona . She can be reached at lmoss@ methodfocus.com.

Method Focus Inc. www.methodfocus.com [email protected] (626) 355-8167

Larissa T. Moss

Page 3: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 3

Presentation Outline

• What do we mean by data quality?Dirty data categories

• How are we addressing it today?Ineffective technology solutions

• What do we have to change?Approaches and techniques

• How do we change?

12 steps to [DQ] recovery

Page 4: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 4

What do we mean by data quality?

• Data is correct

• Data is accurate

• Data is consistent

• Data is complete

• Data is integrated

• Data values follow the business rules

• Data corresponds to established domains

• Data is well defined and understood

#1

Page 5: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 5

Symptoms of poor-quality data

• Do your programs abend with data exceptions?• Are your users confused about meaning of data?• Is some of your data is too stale for reporting?• Is your data being shared? Is it sharable? • Are reports inconsistent?• Does it take your IT staff or the end users hours to

reconcile inconsistent reports?• Does merging data often cause the system to fail?• Do beepers go off at night?

Page 6: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 6

Dirty data categories

• Dummy (default) values

• “Intelligent” dummy values

• Missing values

• Multi-purpose fields

• Cryptic values

• Free-form address lines

• Contradicting values

• Violation of business rules

• Reused primary key

• Non-unique primary key

• Missing data relationships

• Inappropriate data relationships

Page 7: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 7

Dummy (default) values

• Defaults for mandatory fields

SSN 999-99-9999 Age 999 Zip 99999

Income 9,999,999.99

Inability to determine customer profiles Inability to determine customer profiles Inability to determine customer demographicsInability to determine customer demographics

Page 8: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 8

“Intelligent” dummy values

• Defaults with meaning

SSN 888-88-8888Income 999,999.99Age 000Source Code ‘FF’

Non-resident alien

Employee

Corporate customer

Account closed prior to 1991

Inability to write straight forward queries withoutInability to write straight forward queries withoutknowing how to filter dataknowing how to filter data

Page 9: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 9

Missing Values

• Operational systems do not always require informational or demographic data

Gender EthnicityAgeIncomeReferring Source

Inability to analyze marketing channelsInability to analyze marketing channels

Page 10: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 10

Multi-purpose fields

Inability to judge product profitabilityInability to judge product profitability

• ONE field explicitly has MANY meanings

» Which business unit enters the data» At what time in history it was entered» A value in one or more other fields

Appraisal Amount redefined as

Advertised Amount redefined as

Sold Date Loan Type Code redefined as ...

25 redefines = 25 attributes !

Not mutually exclusive !

Only the value of oneis known for each record !

25 redefines = 25 attributes !

Not mutually exclusive !

Only the value of oneis known for each record !

Page 11: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 11

Cryptic values (1)

• Often found in “Kitchen Sink” fields

» Usually one byte (if not one bit)» Highly cryptic (A, B, C, 1, 2, 3, ...)» Non-intelligent, non-intuitive codes

» Often not mutually exclusive

Inability to empower end users to write their Inability to empower end users to write their own queriesown queries

Page 12: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 12

Cryptic values (2)

• ONE field implicitly has MANY meanings

Master_Cd {A, B, C, D, E, F, G, H, I}

{A, B, C}{D, E, F} {G, H, I}

Type of customer

Type of supplier

Regional constraints

Page 13: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 13

Free-form address lines

• Unstructured text

» no discernable pattern» cannot be parsed

address-line-1: ROSENTHAL, LEVITZ, Aaddress-line-2: TTORNEYSaddress-line-3: 10 MARKET, SAN FRANCaddress-line-4: ISCO, CA 95111

Inability to perform market analysisInability to perform market analysis

Page 14: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 14

Contradicting values

• Values in one field are inconsistent with values in another related field

1488 Flatbush Avenue New York, NY 75261

Type of real property: Single Family Residence Number of rental units: four

Texas Zip

Income property

Inability to make reliable business decisionsInability to make reliable business decisions

Page 15: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 15

Violation of business rules

• Business Rule: Adjustable Rate Mortgages must have

» Maximum Interest Rate ( Ceiling)» Minimum Interest Rate ( Floor)

• Business Rule: A Ceiling is always higher than a Floor

ceiling-interest-rate: 8.25floor-interest-rate: 14.75

switched ?

Inability to calculate product profitabilityInability to calculate product profitability

Page 16: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 16

Reused primary keys

• Little history, if any, stored in operational files

» primary keys are customarily re-used » may have a different rollup structure

January ‘94: branch 501 = San Francisco Mainregion 1area SW

August ‘97: branch 501 = San Luis Obisporegion 2area SW

Inability to evaluate organizational performanceInability to evaluate organizational performance

Page 17: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 17

Non-unique primary keys

Inability to determine customer relationshipsInability to determine customer relationshipsInability to analyze employee benefits trendsInability to analyze employee benefits trends

• Duplicate identification numbers

» Multiple customer numbers Customer Name Phone Number Cust. Number

Philip K. Sherman 818.357.5166 960601 Philip K. Sherman 818.357.7711 960105 Philip K. Sherman 818.357.8911 960003

» Multiple employee numbers

Employee Name Department Empl. Number July 1995: Bob Smith 213 (HR) 21304762 January 1996: Bob Smith 432 (SRV) 43218221 August 1999: Bob Smith 206 (MKT) 20684762

Page 18: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 18

Missing data relationships

• Data that should be related to other data in a dependent (parent-child) relationship

» Branch number 0765 does not exist in the BRANCH table

Branch Employee

Inability to produce accurate rollupsInability to produce accurate rollups

Benefit

Page 19: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 19

Inappropriate data relationships

• Data that is inadvertently related, but should not be

» two entity types with the same key values

Purchaser: Jackie Schmidt 837221Seller: Robert Black 837221

Inability to determine customer or vendorInability to determine customer or vendorrelationshipsrelationships

Page 20: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 20

Impact of erroneous data

• Extra time it takes to correct data problems• Extra resources needed to correct data problems• Time and effort required to re-run jobs that abend• Time wasted arguing over inconsistent reports • Lost business opportunities due to unavailable data• Unable to demonstrate business potential in a

buyout• Fines may be paid for noncompliance with

government regulations• Shipping products to the wrong customers• Bad public relations with customers

– leads to alienated and lost customer

Page 21: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 21

Cost of erroneous data

MarketingCampaign

PerInstance

Numberof

Instances

Total NumberPer Year

TotalCost

Per Year

Time: ($60/hour loaded rate) Creating redundant occurrence 2.4 min 167,141 1 $ 401,138 Researching correct address 10 min 5,000/mo 12 $ 600,000 Correcting address errors 0.3 min 6,000/mo 12 $ 21,600 Handling complaints from customers 5.5 min 974/yr 1 $ 5,357 Mail preparation 0.1 min 393,273 4 $ 157,309

Materials, Facilities, Equipment: Marketing brochure $1.96 393,273 4 $3,083,260 Postage $0.52 393,273 4 $ 818,008 Warehouse storage $0.01 393,273 4 $ 15,731 Shipping equipment and maintenance $5,000/yr 36% 1 $ 1,800

Computing resources: CPU transactions $0.02/trans 393,273 4 $ 31,462 Data storage $0.001/mo 393,273 12 $ 4,719 Data backup $0.005/mo 393,273 12 $ 23,596

Direct Costs of Non-Quality Information © Larry English,Improving DW and BI Quality

Total Annual Costs $5,163,980

Page 22: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 22

Impact of redundant data

• Hardware (CPU, disks) and software (program maintenance) costs incurred

as a result of uncontrolled redundant data• Extra time it takes to reconcile inconsistencies• Extra resources needed to reconcile inconsistencies• Unwise business decisions made due to redundant

and inconsistent data• Lost opportunities due to unreliable data• Overcharging or overpayment for products• Duplicate shipping of products• Money wasted on sending redundant marketing

material

Page 23: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 23

Cost of redundant data

Information Development Cost Analysis

Category

PortfolioTotal

Number

RelativeWeightFactor*

AverageUnit

Dev/MaintCosts

TotalDev/Maint

Expenses**

TotalInfrastructureValue-addingCost-adding

Expenses

% ofBudget

Expenses

Infrastructure Basis: Enterprise architected DBs 200 0.75 $ 15,000 $ 3,000,000 Enterprise reusable create/update programs + 300 1.50 $ 30,000 $ 9,000,000 Total Infrastructure expenses $12,000,000

Value Basis: Total retrieve equivalent pgms + 300 1.00 $ 20,000 $ 6,000,000 Total value-adding expenses $ 6,000,000

Cost-adding Basis: Redundant create/update pgms 500 1.50 $ 30,000 $15,000,000 Interface/extract programs 400 1.00 $ 20,000 $ 8,000,000 Redundant database files 600 0.75 $ 15,000 $ 9,000,000 Total cost-adding expenses 1,500 $32,000,000

Lifetime Total ** 3,800 $50,000,000

* Determine relative effort to develop average unit of each category using effort to develop a retrieve program as “1.00”+ For programs that retrieve some data and create/update other data, determine the percent of retrieve only attributes and percent of create/update attributes (e.g., to retrieve customer data to create an order)**Based on 3.800 application programs and database files in portfolio and $50 Million in development

© Larry English,Improving DW and BI Quality

24%

12%

64%

100%

Page 24: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 24

Dirty data – How did it happen?

BusinessManager

BusinessManager

TechnologyManager

TechnologyManager

... ...

... ...

Business Technology

ChiefExecutiveOfficer

ChiefOperating

Officer

ChiefInformation

Officer

paired with

Business Units

Marketing

Financial (A

P &

AR

)

Product P

ricing

Custom

er Support

Distribution

Inventory

Sales

Client Client Client Client Client Client Client

IT IT IT IT IT IT IT

Information Technology Units

?

• data redundancy• process redundancy• dirty data

Page 25: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 25

Major cause for data deficiencies

TIME

SCOPE

BUDGET

PEOPLE

QUALITY

1 2 3 4 5

highest to lowest priority

Pro

ject

Co

nst

rain

ts

Wrong priority on project constraints!

Priority

Industrial Age: • Cheaper, faster, better • Automate as quickly as possible

Cost-based value proposition

Page 26: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 26

Time is getting shorter – scope is getting bigger

• Everyone on the business side and in IT wants quality, but rarely is the extra time given or taken to achieve it.

Quality and time are polarized constraints.

• The higher the quality the more effort (time) it takes to deliver.

• Companies are driven by shorter and shorter schedules.

SCOPE

TIMEYAH DDD

Page 27: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 27

How are we addressing it today?

• Data Warehousing

• Customer Relationship Management

• Enterprise Resource Planning

• Enterprise Application Integration

• Knowledge Management

Why can’t technology

fix this?

Ineffective Technology SolutionsIneffective Technology Solutions

Page 28: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 28

Data Warehousing

The Promise: data integration no redundancy consistency historical data ad-hoc reporting trend analysis reporting faster data delivery faster data access

The Reality: stove pipe marts departmental views swim lane development

approach too time consuming to integrate

too costly to cleanse data increased data redundancy

If it sounds too good to be true, it is to good to be true.

DW delivers...

a collection of integrated data used to support the strategic decision making process for the enterprise.

Page 29: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 29

Customer Relationship Management

The Promise: data integration data quality customer intimacy customer wallet share product pricing customization knowing your competition geographic market potential

The Reality: more stovepipe systems departmental views dirty customer data purchased packages not

integrated focus is too narrow privacy issues

If it sounds too good to be true, it is to good to be true.

CRM delivers …

the organizational lifeline, creating competitive advantage through customer service excellence.

seamless coordination between back-office systems, front-office systems and the Web.

Page 30: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 30

Enterprise Resource Planning

ERP delivers...

a collection of functional modules used to integrateoperational data to support seamless operational business processes for the enterprise.

The Promise: data integration no redundancy consistency data quality

easy reporting easy maintenance Y2K compliance

If it sounds too good to be true, it is to good to be true.

The Reality: system conversion not cross-

organizational analysis same dirty data operational focus poor quality (unusable) reports one-size-fits-all data warehouse

too costly

Page 31: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 31

Enterprise Application IntegrationEAI delivers ...

integration of disparate applications into a unified set of business processes through centrally managed rules and middleware technologies.

The Promise: fast & automated integration leverage existing data bridge islands of automation easy cross-system reporting faster data delivery faster data access

If it sounds too good to be true, it is to good to be true.

The Reality: dirty data no true integration still data redundancy still islands of automation easier access to the current

data mess

Page 32: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 32

Knowledge ManagementKM delivers ...

a process for capturing, editing, verifying (for accuracy), disseminating, and utilizing tacit and explicit information about the organization.

The Promise: utilize organizational info data integration historical data faster data delivery faster data access first & only customer contact reduction of customer calls less re-solving same problems

Reality of KM: too difficult to build too time consuming

too costly technology challenges non-sharing culture isolated applications difficult to disseminate

information

If it sounds too good to be true, it is to good to be true.

Page 33: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 33

What’s the lesson?

You cannot keep doingYou cannot keep doingwhat you have always donewhat you have always done

and expect the results to be different.and expect the results to be different.

“That wouldn’t be logical”Spock, Star Trek

Not even withNot even withnew technology.new technology.

Page 34: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 34

What do we have to change?1. Assess the current state of data quality at your company

2. Understand and fix the root causes for data contamination

3. Perform data audits regularly (monthly, quarterly)

4. Stop working in isolated “swim lanes”

> Stop recreating data

5. Centrally manage your data like a business asset(Enterprise Information Management [EIM])

> Assemble data as needed from the data inventory (enterprise data model and meta data)

> Standardize and reconcile data transformations for BI/DW applications (coordinated ETL staging area)

6. Scale down project scopes to incorporate data quality and EIM activities

7. Embed data quality and EIM activities in all projects

Page 35: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 35

Business intelligence …

…is a cross-organizational discipline

and an enterprise architecture

for an integrated collection of operational as well as decision support

applications and databases, which provide the business community easy access to their business data, and

allows them to make accurate business decisions.

…is a cross-organizational discipline

and an enterprise architecture

for an integrated collection of operational as well as decision support

applications and databases, which provide the business community easy access to their business data, and

allows them to make accurate business decisions.

… is not business as usual

Page 36: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 36

BI goals and objectives

Data Management

Get control over the existing data chaos

Data Delivery

Provide intuitive access to business information

Data Reengineering (Enterprise Information Management)

80% 20%

Page 37: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 37

Proliferation of data quality problems

Legacy Data WarehousesData Marts

Marketing

Finance

Product Sales

Engineering

Users

L

L

L

L

DM

DM

DW

DM

DM

transformation ? cleansing? Customer Support

“LegaMarts”(Doug Hackney)

BI ?BI ?

Page 38: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 38

Industrial-age mental model

Business Units

Marketing

Financial (A

P &

AR

)

Product P

ricing

Custom

er Support

Distribution

Inventory

Sales

Client Client Client Client Client Client Client

IT IT IT IT IT IT IT

Information Technology Units

TIME

SCOPE

BUDGET

PEOPLE

QUALITY

1 2 3 4 5

highest to lowest priority

Pro

ject

Co

nst

rain

ts

Priority

Scrap and rework

Page 39: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 39

The game has changed

1. Enormous degree of complexity

2. Extremely high rate of change

Cheaper, faster, better !!!But how?

Don’t scrap and rework.Reuse what you already have.

(John Zachman)

…but our mental model has not

Page 40: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 40

Information-age mental model

QUALITY

BUDGET

PEOPLE

TIME

SCOPE

1 2 3 4 5P

roje

ct C

on

stra

ints

Priority

Reassemble reusable components

highest to lowest priority

Information Age: • Reassemble the entire enterprise • Reuse assets from inventory

Investment-based value proposition

Page 41: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 41

Software release concept (1)

SecondRelease

FirstRelease

FourthRelease

Reusable &Expanding

FinalRelease

Application

FifthRelease

ThirdRelease

Projects

“Refactoring”- Kent Beck

Project = ApplicationProject = Application //

“Extreme scoping”- Larissa Moss

Page 42: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 42

Software release concept (2)• Requirements can be tested, and implemented in small increments

• Scope is very small and manageable

• Technology infrastructure can be tested and proven

• Data volumes (per release) are relatively small

• Project schedules are easier to estimate because the scope is very small

• Development activities can be iteratively refined, honed, and adapted

AND: The quality of the release deliverables (and ultimatelythe quality of the applications) will be higher!

Page 43: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 43

Cross-organizational development approach (1)

BI/DW Development Steps1. Business Case Assessment ...........................2.A Enterprise Technical Infrastructure ...........2.B Enterprise Non-Technical Infrastructure ...3. Project Planning ...........................................4. Project Requirements Definition ..................5. Data Analysis ...............................................6. Application Prototyping ...............................7. Meta Data Repository Analysis ...................8. Database Design ..........................................9. ETL Design .......................................….......10. Meta Data Repository Design ....................11. ETL Development .....................................12. Application Development .........................13. Data Mining ..............................................14. Meta Data Repository Development ........15. Implementation .........................................16. Release Evaluation ...................................

Cross-organizationalCross-organizationalCross-organizationalProject-specificProject-specificCross-organizationalProject-specificCross-organizationalCross-organizationalCross-organizationalCross-organizationalCross-organizationalProject-specificCross-organizationalCross-organizationalProject-specificCross-organizational

Data QualityTouch Points

(© Larissa Moss and Shaku Atre, “Business Intelligence Roadmap”)

Page 44: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 44

Cross-organizational development approach (2)

• Commitment to data quality embedded in the methodology

• Cross-organizational program management

• Enterprise information management group

• Standards that include a common information architecture (enterprise data model) Involving down-stream information consumers in the

requirements definition step Involving data owners in the data analysis step Involving business representatives from all business

units to ratify the data models and meta data

• Coordinating the development/ETL processes Disallowing stovepipe development Extracting and cleansing source data only once Reconciling data transformations and storing the

reconciliation totals as meta data

Page 45: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 45

Enterprise information managementBusiness Units

Marketing

Financial (A

P &

AR

)

Product P

ricing

Custom

er Support

Distribution

Inventory

Sales

Client Client Client Client Client Client Client

IT IT IT IT IT IT IT

ODSDM

Discover, Coordinate, Integrate, Document, Control

Operational Environment

EDW

OM

BI/DW Databases

Information Technology Units

Enterprise Information Management

Decision Support EnvironmentOperational Systems

Page 46: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 46

EIM responsibilities• Business architecture inventory

Process modelsData models

• Application inventory ProgramsDatabases

• Meta data inventoryBusiness meta dataTechnical meta data

• Policy inventoryStandardsProceduresGuidelines…

Discover, Coordinate, Integrate,

Document, Control

Architects

Stew

ards

Managers

IT asset inventorymanagement

Page 47: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 47

Data stewardship

• Guardians of the data while it is being created or maintained by them

• Create standards and procedures to ensure that policies and business rules are known and followed

• Enforce adherence to policies and business rules that govern the data while the data is in their custody

• Periodically monitor (audit) the quality of the data in their custody

• Also known as custodians

• Can be a business person or an IT person

“One who manages another’s property.”

Page 48: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 48

Data ownership

• Authority to establish policies and set business rules for the data under their control

• Decide what the official enterprise definition and domain is for the data under their control

• Monitor and advise other end users on proper usage of their data

• Frequently, but not always, the data originator

• Can be a person or a committee

“One who has the legal right to the possession of a property.”

Page 49: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 49

Enterprise architecture

1. Data Management

• data integration• data cleansing

2. Data Delivery

• data access• data manipulation

Business Architecture

Information ArchitectureInformation Architecture

Application ArchitectureApplication Architecture

Technology Architecture

Mission and ObjectiveBusiness PrinciplesBusiness FunctionsProgram Management

Mission and ObjectiveBusiness PrinciplesBusiness FunctionsProgram Management

Enterprise Data Model- Data Standardization- Data Integration- Data Reconciliation- Data Quality

Enterprise Data Model- Data Standardization- Data Integration- Data Reconciliation- Data Quality

Operational ApplicationsData Access ApplicationsData Analysis ApplicationsApplication Databases

Operational ApplicationsData Access ApplicationsData Analysis ApplicationsApplication Databases

Technology PlatformNetworkMiddlewareDBMS, Tools

Technology PlatformNetworkMiddlewareDBMS, Tools

Content

Storage &Presentation

Page 50: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 50

Enterprise data model (data inventory)

Supported by common

data definitions, domains, and

business rules.

Salesperson

CommissionedSalesperson

SalariedSalesperson

OrgStructure

Org Unit

Product Part

ProductCategory

Product

Customer Product Order

PotentialCustomer

ExistingCustomer

Customer

AccountAccount Payment

Payment

Method

Part

Supplier Shipment

Warehouse

Top-Down

Page 51: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 51

Source data analysis

Domain Violations:Domain Violations:• Dummy values• Intelligent dummy values• Missing values• Multi-purpose fields• Cryptic values• Free-form address lines

Integrity Violations:Integrity Violations:• Contradicting values• Violation of business rules• Reused primary keys• Non-unique primary keys • Missing data relationships• Inappropriate data relationships

Bottom-Up

Page 52: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 52

To cleanse or not to cleanse …

…that is the question

• You probably cannot cleanse it all (takes too long)

• It may not be worth the time and money to cleanse every data element

• Not all data is equally significant

• Not all data can be cleansed

• How do you know what to cleanse?

Page 53: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 53

Triaging questions (1)

• Can the data be cleansed?Does the correct data exist anywhere?Is it easily accessible?

• Should the data be cleansed?How extensive is the problem?How elaborate will the cleansing process be?Is it cost-effective?

Page 54: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 54

Triaging questions (2)

• Why are we building the application?

What business questions cannot be answered today?

• Why are we not able to answer the business questions? Is it because of this dirty data?

Is it because of these missing relationships?

• Will the benefits of cleansing outweigh the cost of the effort?

Page 55: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 55

Categories of data significance

• Critical data– Not all data is equally critical to all end users– All critical data must be cleansed– Usually includes amount fields

• Important data– Important to the organization, but not absolutely critical– Further prioritize important data elements– Cleanse as many as time allows– Those that cannot be cleansed should be bumped to critical for the next release

• Insignificant data– Informational data, which is nice to have– Cleansing is optional if time allows

Business decision!Business decision!

Page 56: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 56

Cleansing – repairing – prevention • Where should the dirty data be cleansed?

In the staging area of the BI application?In the source (legacy) files?

• When should it be cleansed?Retroactively?

At data entry time?

• How should it be cleansed?Use data cleansing or ETL tools?Write procedural (COBOL/C++) code?

• What will we do to prevent dirty data in the future?

Source Data Reengineering …Source Data Reengineering … Total [Data] Quality Management (TQM) Total [Data] Quality Management (TQM)

Page 57: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 57

Coordinated ETL staging

ClientsClients

Legacy

Operat’lreports

Data Marts

Strategic rpts

Enterprise Data Warehouse

Strategic rpts

OperationalData Store/Oper MartsTactical rpts

L

L

L

ODS EDWFinance

Product Pricing

Engineering

DM

DM

MarketingCRM DMAnalytical

CRMOperational

OM CustomerSupport

EXWEXW Legal

Enterprise Architecture & Meta Data RepositoryEnterprise Architecture & Meta Data RepositoryEnterprise Architecture & Meta Data RepositoryEnterprise Architecture & Meta Data Repository

StagingStaging AreaAreaCleansingTransform’s

StagingStaging AreaAreaCleansingTransform’s

DailyStA

MoStA

Transformation Transformation CleansingCleansing

Page 58: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 58

ETL process flow

Extract AccountsExtract Accounts

Merge CustomersMerge Customers

Account TranFile

Account TranFile

Customer Info File

Customer Info File

CustomerMaster

CustomerMaster

CustomersCustomers

Extract New SalesExtract New SalesSales

File

SalesFile

Filter AccountsFilter Accounts

NewSales

NewSales

AccountsAccounts

NewAccounts

NewAccounts

ProspectsProspects

Extract ProspectsExtract Prospects

ProspectsProspects

Merge ProspectsMerge Prospects

AllCustomers

AllCustomers

Sort AcctsSort Accts

SortedAccounts

SortedAccounts

MatchAccounts

MatchAccounts

Sort CustomersSort Customers

SortedCustomers

SortedCustomers

2

AccountErrors

AccountErrors

Extract Extract CleanseCleanse TransformTransform PreparePrepare Load Load

AssociateAccounts

AssociateAccounts

1

ProfileCustomers

ProfileCustomers

3

– coordinated –

Page 59: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 59

ETL Reconciliation

DM

DM

DM

DM

(monthly)

Load Files

L

L

L

L

ODS(daily)

EDW(monthly)

MonthlyStaging

Area

Page 60: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 60

ETL tie-outs: record counts

INPUTRECORDS

PROCESSMODULE

PROCESSMODULE

OUTPUTRECORDS

REJECTEDRECORDS# Input Records =

# Output Records

+

# Rejected Records

Page 61: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 61

ETL tie-outs: domain counts

# Records Per Input Domain =

# Records Per First Output Domain

+

# Records Per Second Output Domain

+

# Records Per Third Output Domain

+

# Rejected Data Values

OUTPUTCODES

OUTPUTCODES

INPUTCODES

PROCESSMODULE

PROCESSMODULE

OUTPUTCODES

REJECTEDCODES

Page 62: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 62

ETL tie-outs: amount counts

OUTPUTAMOUNTS

OUTPUTAMOUNTS

INPUTAMOUNTS

PROCESSMODULE

PROCESSMODULE

REJECTEDAMOUNTS

Total $ Input Amounts =

Total $ Per First Input Amount +

Total $ Per Second Input Amount +

Total $ Per Rejected Amounts

Total $ Per First Output Amount

+

Total $ Per Second Output Amount

+

Total $ Rejected Amounts

Page 63: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 63

Data quality improvements• Source data repairs

• Increased program edits

• Enhanced data entry procedures

• Improved data quality training

• Regular data audits

• Data usage monitoring

• Enterprise-wide end user surveys

• Continuous validation of enterprise data model

• Continuous validation of meta data, especially definitions and domains

• Involvement of data owners, information consumers, and business sponsors

Page 64: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 64

Data quality maturity

Discoveryby accident

Program abends

1

Limiteddata analysis

Data profilingData cleansingduring ETL

2

Proactiveprevention

4

Enterprise-wideDQ methods &techniques

At what level of DQ maturity is your organization?

3

Addressingroot causes

Repairingsource dataand programs

shortterm

5

Optimization

ContinuousDQ process improvements

longterm

Scale of 1 .. 5

Page 65: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 65

DQ capability maturity model (1)

CMM Level 1. Uncertainty - Unconscious and unaware

» Data quality problems are denied. » No formal data quality processes defined. » Data quality initiatives are ad hoc and chaotic.» Any success is dependent on individual efforts.

(Source: Larry English)

CMM Level 2. Awakening - The big Aha! and lip service

» Data quality problems are acknowledged. » Major problems are attacked as they come up. » Minimum funding for a formal data quality initiative. » Capability is a characteristic of the individual rather than the organization.

Page 66: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 66

DQ capability maturity model (2)

CMM Level 3. Enlightenment - Let’s do something

» Data quality initiative takes off. » Enterprise-wide data quality assessment is performed. » Data quality problems are corrected at the source (where possible).» Data quality improvement process is institutionalized.

CMM Level 4. Wisdom - Making a difference

» Management accepts personal responsibility for data quality. » Data quality group reports to a chief officer (CIO, CKO, COO). » Data quality correction changes to data defect prevention. » All business areas are involved.

(Source: Larry English)

Page 67: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 67

DQ capability maturity model (3)

CMM Level 5. Certainty - Nirvana

» Data defect prevention is the main focus.» Data quality is an integral part of the business processes.» All business areas are continuously improving the processes.» The culture of the organization has changed.

(Source: Larry English)

Page 68: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 68

Organizational impact

• Cross-organizational tasks and responsibilities are not well defined

• Data quality responsibility is not clear or ignored

• Value of data is not understood or appreciated

• Projects are often cost justified using the industrial-age mental model

• Resource requirements are not well defined

• Impact on application development empire

• No reward for data sharing

• Resistance to change

Page 69: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 69

Organizational changes

• Business and IT collaboration (“partnership”)

• Business and business collaboration (“partnership”)

• IT and IT collaboration (“partnership”)

• Increased end user involvement

• Cross-organizational activities

• Architecture and standardization

• Software release concept

• New charge-back system

• New incentives

• New leadership

Page 70: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 70

New leadership

CEO

CTOCKOCOO

EIM ...EALOB Execs IT Execs

ChiefKnowledge

Officer

CFO

EnterpriseInformation

Management

collaboration collaboration

DA DQA MDA

Page 71: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 71

How do we change?12 steps to [DQ] recovery (1)

1. Become aware• Every cultural transformation process

begins with an “Aha”. • Understand the root causes for your

current data chaos.

2. Accept responsibility• “Yes, it is our fault” for being in this mess.• Accepting responsibility is a prerequisite for

change.

Page 72: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 72

12 steps to [DQ] recovery (2)

3. Decide to change• Now that “you know better”, the decision

is yours: Stay stuck or change. • There can be no more false hopes for

any silver bullet technology solutions.

4. Identify root causes• What are the specific root causes for non-

quality data in your organization?• Some root causes are common, some are not.

Page 73: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 73

12 steps to [DQ] recovery (3)

5. Collaborate• It doesn’t matter “whose fault” it is

that the root causes exist. • IT must collaborate with the business

community to affect changes.• Business community must also

collaborate with business community.

6. Identify change agents• Who will be the couriers?• Changes must be systemic and holistic,

not isolated and sporadic.

Page 74: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 74

12 steps to [DQ] recovery (4)

8. Plan changes• Big changes do not get implemented in one

“Big Bang”.• Involve people in change planning.• Cross-organizational changes are phased in.

7. Spread the word• To embrace changes, there must be

“something in it” for everybody.• Otherwise, changes trigger anxiety and

anxiety results in resistance or rejection.

Page 75: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 75

12 steps to [DQ] recovery (5)

9. Prioritize changes• Some changes are easier to implement

than others. • Some changes have a higher payback.

10. Implement changes• Everyone affected by the changes must have

an opportunity to review and approve the plan before implementation.

Page 76: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 76

12 steps to [DQ] recovery (6)

11. Measure effectiveness• Solicit feedback from “the trenches”. • Are the changes affecting anyone adversely?

12. Refine changes• Nothing is perfect the first time around.• What might work in one organization may not

work in another.

Page 77: Improving DQ

© Copyright 2003, Larissa T. Moss, Method Focus, Inc. 77

Bibliography• Adelman, Sid, and Larissa Terpeluk Moss. Data Warehouse Project Management. Boston, MA: Addison-

Wesley, 2000.• Aiken, Peter H. Data Reverse Engineering: Slaying the Legacy Dragon. New York: McGraw-Hill, 1995.• Brackett, Michael H. Data Resource Quality: Turning Bad Habits into Good Practices. Boston, MA:

Addison-Wesley, 2000.• Brackett, Michael H. The Data Warehouse Challenge: Taming Data Chaos. New York: John Wiley &

Sons, 1996.• English, Larry P. Improving Data Warehouse and Business Information Quality: Methods for Reducing

Costs and Increasing Profits. New York: John Wiley & Sons, 1999.• Hoberman, Steve. Data Modeler’s Workbench: Tools and Techniques for Analysis and Design. New York:

John Wiley & Sons, 2001.• Kuan-Tsae, Huang, Yang W. Lee, and Richard Y. Wang. Quality Information and Knowledge

Management. Upper Saddle River, NJ: Prentice Hall, 1998.• Marco, David. Building and Managing the Meta Data Repository: A Full Lifecycle Guide. New York: John

Wiley & Sons, 2000.• Moss, Larissa T., and Shaku Atre. Business Intelligence Roadmap: The Complete Lifecycle for Decision-

Support Applications. Boston, MA: Addison-Wesley, 2003.• Reingruber, Michael C., and William W. Gregory. The Data Modeling Handbook: A Best-Practice

Approach to Building Quality Data Models. New York: John Wiley & Sons, 1994.• Ross, Ronald G. The Business Rule Concepts. Houston, TX: Business Rule Solutions, Inc., 1998.• Simsion, Graeme. Data Modeling Essentials: Analysis, Design, and Innovation. Boston, MA: International

Thomson Computer Press, 1994.• Von Halle, Barbara. Business Rules Applied: Building Better Systems Using the Business Rules Approach.

New York: John Wiley & Sons, 2001.