rapid corporate growth and...

13
The MIT 2008 Information Quality Industry Symposium Rapid Corporate Growth and Information Steve Sarsfield, Trillium Software The MIT 2008 Information Quality Industry Symposium Agenda How Companies Grow The Data Components of Company Value Effects of Rogue Data Quality Processes Sorting Out a Large Company’s Problems Cleansing and Matching Domain Specific Knowledge Platform Unification MIT Information Quality Industry Symposium, July 16-17, 2008 186

Upload: hadang

Post on 19-Mar-2018

218 views

Category:

Documents


3 download

TRANSCRIPT

The MIT 2008 Information Quality Industry Symposium

Rapid Corporate Growth and Information

Steve Sarsfield, Trillium Software

The MIT 2008 Information Quality Industry Symposium

Agenda

• How Companies Grow• The Data Components of Company Value• Effects of Rogue Data Quality Processes• Sorting Out a Large Company’s Problems

– Cleansing and Matching– Domain Specific Knowledge– Platform Unification

MIT Information Quality Industry Symposium, July 16-17, 2008

186

The MIT 2008 Information Quality Industry Symposium

How CompaniesHow CompaniesGrowGrow

The MIT 2008 Information Quality Industry Symposium

SatelliteOffice

Acquisition

MIT Information Quality Industry Symposium, July 16-17, 2008

187

The MIT 2008 Information Quality Industry Symposium

EuropeanOffice

SatelliteOffice

Acquisition

EuropeanOffice

The MIT 2008 Information Quality Industry Symposium

EuropeanOffice

SatelliteOffice

Acquisition

Central &South American

Partner

EuropeanOffice

MIT Information Quality Industry Symposium, July 16-17, 2008

188

The MIT 2008 Information Quality Industry Symposium

EuropeanOffice

APAC Manufacturing

APACPartners

SatelliteOffice

Acquisition

Central &South American

Partner

EuropeanOffice

The MIT 2008 Information Quality Industry Symposium

AustralianCustomerSupport &

Sales

EuropeanOffice

APAC Manufacturing

APACPartners

SatelliteOffice

Acquisition

Central &South American

Partner

EuropeanOffice

MIT Information Quality Industry Symposium, July 16-17, 2008

189

The MIT 2008 Information Quality Industry Symposium

AustralianNames/Addresses

ASCII Pan European

Sales Data

Simplified ChineseSupply & Sales Data

UnicodeERP Data

ASCII SupplyChain Data

ASCII US Sales Data

Central and SouthAmerican

Sales Data

The MIT 2008 Information Quality Industry Symposium

AustralianNames/Addresses

ASCII Pan European

Sales Data

Simplified ChineseSupply & Sales Data

UnicodeERP Data

ASCII SupplyChain Data

ASCII US Sales Data

Central and SouthAmerican

Sales Data

MIT Information Quality Industry Symposium, July 16-17, 2008

190

The MIT 2008 Information Quality Industry Symposium

AustralianNames/Addresses

ASCII Pan EuropeanSales Data

Simplified ChineseSupply & Sales Data

UnicodeERP Data

ASCII SupplyChain Data

ASCII US Sales Data

Central and SouthAmericanSales Data

HQ

THE BASICSHow much did we sell yesterday?

What’s the sales pipeline?

What do we have in inventory worldwide?

Can I trust these results?

OPPORTUNITY LOSTSupply Chain/Inventory Problems?

Can we reach our customers effectively?

Are we paying too much to suppliers?

Are we making the right business decisions?

Are users avoiding new systems because of data?

Vision BlockingFactors• Typos and

Duplicates• Lack of

standards• CompetingInformation Quality Processes

• Code pages• ASCII• Unicode• EBCDIC

• Platforms• SAP• Oracle• Siebel• Tibco• SalesForce• Etc

• Operating Sys.• Language• Local Nuances• Data Age &

Reliability• Unknown Data

• M&A• Suppliers• Partner

The MIT 2008 Information Quality Industry Symposium

How Company Value is Measured

• Number of Customers• Hard Assets• Number of Valued Employees• Sales Channels

= Customer Data= ERP and Supply Chain Data = Knowledge Management Data= Supplier and Partner Data

MIT Information Quality Industry Symposium, July 16-17, 2008

191

The MIT 2008 Information Quality Industry Symposium

Differences in Data Quality Processes

Ivan Madar

75 Calle del Norte

Sedona, AZ 86336

Certain Solutions can't understandthis street address.

No BLVD, ST, RD, AVE

The MIT 2008 Information Quality Industry Symposium

Differences in Data Quality Processes

Debra Shaw

203 Old Meadow Drive

Leave at front door

Greensburg, PA 15601

Certain solutions can't handledelivery info

intermingled with name and address.

MIT Information Quality Industry Symposium, July 16-17, 2008

192

The MIT 2008 Information Quality Industry Symposium

Differences in Data Quality Processes

Gary Wright

C/O Allstate Insurance

1466 S Potomac St

Hagerstown, MD 21740

Certain SolutionsCan't handle the C/O

The MIT 2008 Information Quality Industry Symposium

Differences in Data Quality Processes

Marilyn E Vogt

N105W21040 Parkland

Colgate, WI 53017

Certain SolutionsCan't handle a number and letter-rich

address such as this.

MIT Information Quality Industry Symposium, July 16-17, 2008

193

The MIT 2008 Information Quality Industry Symposium

Differences in Data Quality Processes

Ivan Madar

75 Calle del Norte

Sedona, AZ 86336

Debra Shaw

203 Old Meadow Drive

Leave at front door

Greensburg, PA 15601

Gary Wright

C/O Allstate Insurance

1466 S Potomac St

Hagerstown, MD 21740

Marilyn E Vogt

N105W21040 Parkland

Colgate, WI 53017

Ivan Madar

75 Calle del Norte

Sedona, AZ 86336

Debra Shaw

203 Old Meadow Drive

Leave at front door

Greensburg, PA 15601

Gary Wright

C/O Allstate Insurance

1466 S Potomac St

Hagerstown, MD 21740

Marilyn E Vogt

N105W21040 Parkland

Colgate, WI 53017

Debra Shaw

203 Old Meadow Drive

Greensburg, PA 15601

Gary Wright

Allstate Insurance

1466 S Potomac St

Hagerstown, MD 21740

Marilyn E Vogt

N105 W21040 Parkland

Colgate, WI 53017

The MIT 2008 Information Quality Industry Symposium

Dirty Data Strategy

Clean the Lake• Batch cleansing of existing data

Clean the Rivers• Real-time cleansing of incoming data

Clean the Lake

Clean the Rivers

MIT Information Quality Industry Symposium, July 16-17, 2008

194

The MIT 2008 Information Quality Industry Symposium

AustralianNames/Addresses

ASCII Pan European

Sales Data

Simplified ChineseSupply & Sales Data

UnicodeERP Data

ASCII SupplyChain Data

ASCII US Sales Data

Central and SouthAmerican

Sales Data

Starting Point: Analysis

D

Pick High Value Targets and Identify Business Value of Improved Information Quality

The MIT 2008 Information Quality Industry Symposium

Original Record 1 Original Record 2Name: Margaret SmithAddress: 345 Avenue of the Americas City ManhattanState: NYZip: 1012Country: USA

Name: Peggy SmithAddress: 345 6th AveCity: NYState: NYZip: 01012Country:

Standardized Record 1 Standardized Record 2Root First Name: MargaretLast Name: SmithAddress: 345 Ave of the AmericasCity: New YorkState: NYPost Code: 10012Country: USA

Root First Name: MargaretLast Name: SmithAddress: 345 Ave of the AmericasCity: New YorkState: NYPost Code: 10012Country: USA

Cleansing Is Key to Matching

MIT Information Quality Industry Symposium, July 16-17, 2008

195

The MIT 2008 Information Quality Industry Symposium

Name1: Flugtaggen GMBHName 2: rhamer strasse 20Address: dusCity/Town: 40489Post Code:Country:

Value Added for CRMFully automate data cleansingApply country intelligence (names geographic, etc.)Standardize critical data elementsContext-sensitive data interpretationEnrich data (geocoding, etc.)

Business Name: Flugtaggen GMBHPersonal Name: Street Name: RhamerStreet Type: Str. Street Number: 20City/Town: DüsseldorfPost Code: 40489Country: DE

Cleanse: Domain-Specific StandardizationHow to Repair and Make Sense of Legacy Data

Increased accuracy = better business processes & better matching

The MIT 2008 Information Quality Industry Symposium

Level 1

Level 2

Level 3

Level 4

Block Number

Sub-Block Num

Apt. Number

House Number

오포읍광주시 양벌리94--5대주파크빌2차 101호207동

Postal Code

Understanding Global Data (Korean)

MIT Information Quality Industry Symposium, July 16-17, 2008

196

The MIT 2008 Information Quality Industry Symposium

Understanding Global Data (Korean)

The MIT 2008 Information Quality Industry Symposium

Understanding Product Data

Product Description

12oz D. Pepsi 12pack

12pk C Orange Slice

Mtn Dew 2ltr

Code Red 24pk Bottles

2L Mountain Dew Cs

D.P. Cans 12p

Free-Form Text: No Common

Format

Duplicates

Multiple Meanings12 oz vs. 12 Pack

Unstandardized

MIT Information Quality Industry Symposium, July 16-17, 2008

197

The MIT 2008 Information Quality Industry Symposium

12 OZ

2 L

20 OZ

12 OZ

2 L

12 OZ

Container Size

DIET PEPSI

MOUNTAIN DEW

CODE RED

ORANGE SLICE

MOUNTAIN DEW

DIET PEPSI

Product

CANS

BOTTLES

BOTTLES

CANS

BOTTLES

CANS

Container Type

8 CASEMtn Dew 2ltr

12 PACK12pk C Orange Slice

24 PACKCode Red 24pk Bottles

8 CASE2L Mountain Dew Cs

12 PACKD.P. Cans 12p

12 PACK12oz D. Pepsi 12pack

PackagingProduct Description

Identify Attributes/Categories

The MIT 2008 Information Quality Industry Symposium

Key Take-aways

• Big Company = Big DQ Problems– Faster Growth = Bigger DQ Problems

• Unified Process for Data Quality is Key• Domain Coverage is Important

– Think enterprise solution, not point solution• First Steps: Data and Metadata Comprehension

Steve Sarsfield, Trillium [email protected] (978) 436-8768

MIT Information Quality Industry Symposium, July 16-17, 2008

198