data quality and bi

58
1 DW Part 2 The Twins: Data Quality & Business Intelligence Denise Jeffries [email protected] [email protected] 205.747.3301

Upload: jeffd00

Post on 27-Jan-2015

120 views

Category:

Education


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Data quality and bi

1

DW Part 2The Twins: Data Quality &

Business Intelligence

Denise Jeffries

[email protected]@hotmail.com

205.747.3301

Page 2: Data quality and bi

2

Star Schema (facts and dimensions)

The facts that the data warehouse helps analyze are classified along different dimensions: The FACT table houses the main data

Includes a large amount of aggregated data (i.e. price, units sold)

DIMENSION tables off the FACT include attributes that describe the FACT

Star schemas provide simplicity for users

Page 3: Data quality and bi

3

Star Schema example (Sales db)

Page 4: Data quality and bi

4

SnowFlake Schema

Central FACT Connected to multiple DIMENSIONS which

are NORMALIZED into related tables Snowflaking effects DIMS and never FACT Used in Data warehouses and data marts

when speed is more important than efficiency/ease of data selection

Needed for many BI OLAP tools Stores less data

Page 5: Data quality and bi

5

Snowflake Schema example (Sales db)

Page 6: Data quality and bi

6

Comparison of SQL Star vs SnowFlake

SELECT Brand, Country, SUM (Units Sold) FROM Fact.Sales

JOIN Dim.Date ON Date_FK = Date_PK

JOIN Dim.Store ON Store_FK = Store_PK

JOIN Dim.Product ON Product_FK = Product_PK

WHERE [Year] = 2010 AND Product Category = ‘TV' GROUP BY

Brand, Country

SELECT B.Brand, G.Country, SUM (F.Units_Sold)

FROM Fact_Sales F (NOLOCK) INNER JOIN Dim_Date D (NOLOCK) ON F.Date_Id = D.Id INNER JOIN Dim_Store S (NOLOCK) ON F.Store_Id = S.Id INNER JOIN Dim_Geography G (NOLOCK) ON S.Geography_Id =

G.Id INNER JOIN Dim_Product P (NOLOCK) ON F.Product_Id = P.Id INNER JOIN Dim_Product_Category C (NOLOCK) ON

P.Product_Category_Id = C.ID

INNER JOIN Dim_Brand B (NOLOCK) ON P.Brand_Id = B.Id WHERE

D.Year = 2010 AND C.Product_Category = 'tv' GROUP BY

B.Brand, G.Country

Page 7: Data quality and bi

7

Account, Customer & Address Relationships

Account Contact

Party Address link

Account Party link

Address

Account

Party

Account Information loaded from ALL Source Systems

ETL process builds the relationship between Accounts and Customers (Party) based on the relationship file from CUSTOMER CRM SYSTEM

Page 8: Data quality and bi

8

Staging

Area

EDW

Metadata | Data Governance | Data Management

DM

CPS

MANTAS

CRDB

MKTG

FIN

SALES

EDW

Data cleansing

Data profiling

Sync &Sort

EDW Process State

BISource System

Cleanse / Pre-process

IMP

RMOECALS

AFSST

REDFPSBA

AFSV-PR

Page 9: Data quality and bi

9

Explosion in innovation

BI software now able to be deployed on intranet vs hard to maintain thick client apps Thick client still used for developers

Web server, application server, database server Allows offloading of processing to

correct tier More power for everyone

Page 10: Data quality and bi

10

Change in Business

Global economy changed needs of organizations worldwide

Global markets Mergers and Acquisitions All increase data needs More tech savvy end users (demand more

data, more tools… More information demanding executives

facilitates sponsorship of DW

Page 11: Data quality and bi

11

Single definition of a data element needed for BI

DW brings in the data from multiple sources and conforms it so that it can be viewed together Multiple systems have individual

customers/addresses, but warehouse gives single view of the customer and all the systems they are in

Helping move from product centric systems to customer centric systems

Page 12: Data quality and bi

12

Business view of data

DW is only successful is it provides the view the business needs of its data

A data warehouse is a structured extensible environment designed for the analysis of non-volatile data, logically and physically transformed from multiple source applications to align with business structure, updated and maintained for a long time period, expressed in simple business

terms, and summarized for quick analysis. Vivek R. Gupta, Senior Consultant [email protected] System

Services corporation, Chicago, Illinois http://www.system-services.com

Page 13: Data quality and bi

13

Example of conforming data for business view:

Figure 8. Physical transformation of application data

•Uniform business terms

•Single physical definition of an attribute

•Consistent use of entity attributes

•Default and missing values

Data Warehouse

System

OperationalSystem B

OperationalSystem A

Detailed

Data

Summarized Data

Transformation

-----------------------

cust, cust_id, borrower>> customer ID

-----------------------

“1” >> “M”

“2” >> “F”

-----------------------

Missing >>> “……..”

http://www.sserve.com/ftp/dwintro.doc

Page 14: Data quality and bi

14

Business use of DW

Business should use data mart created off data warehouse

Business uses want to use existing tools/methods (replicate queires, Excel, extract to Access) against DW and validate the data between existing and DW

Over time LoB gains confidence in DW and then begins to explore new possibilities of data use and tool use

Page 15: Data quality and bi

15

EDW Development Project Cycle (New Source to EDW)

ImplementTestBuildSystem Design

Initial Scope

Source Data Analysis

Initial Estimate

Final Scope

Work Plan

Architecture & Design

Source/Target Mapping

Report/Data Requirements

Data Modeling

Test, Integration and Deployment Plan

Transition and Support Plan

Finalize Source/Target Mapping

Build ETL and other processes

Unit test cases and results

Defect fixes and support

Functional test cases and results

Operations Procedure testing and review

Data validation

Load / Performance Testing

Integration Testing(CA-Scheduler)

UAT and Sign-off

Initial Data creation/Setup

Production Migration

User Training

Support Documentation and training

Requirement SpecsInitiation

Project Sponsor Approval

Management Approval

Peer review

IT Approval

IT / Business Distribution

Peer review

IT Approval

Peer / Lead reviewPerformance, Capacity and

Guidelines for project

Business Users IT Planning & Systems EDW Development Team

Data Analyst Team Operations & SupportProject Management Operations & Support

Operations & Support

Business Users

Business Users

Business Users

Business Users

Project Management

Project Management Project Management Project Management

IT Planning & Systems

IT Planning & Systems

IT Planning & Systems IT Planning & Systems

Data Analyst Team

Data Analyst Team

Data Analyst Team Data Analyst Team

EDW Development Team

EDW Development Team EDW Development TeamGroups Involved in various Phases of the project

Major Tasks and Deliverables for the project

Page 16: Data quality and bi

16

DW - Roadmap

DW(Accounts and Customers)

Multiple Source System Financial Mart

Master Data Management

mart1

Risk Mart

mart2

Transaction

Customer Analytics

Source System

mart3

Management Architecture (Metadata, Data Security, Systems Management)

Page 17: Data quality and bi

17

SECTION 3

What is Data Quality I can’t tell you what’s

important, but your users can.

Look for the fields that can identify potential problems with the data

What is Master Data Management (MDM)

Page 18: Data quality and bi

18

Data Quality

Data doesn’t stay the same Sometimes it does

Considerations: What happens to the warehouse when

the data changes When needs change

Page 19: Data quality and bi

19

Roadmap to DQ

Data profiling Establishing metrics/measures Design and implement the rules Deploy the plan Review errors/exceptions Monitor the results

Page 20: Data quality and bi

20

Data Profiling

What’s in the data Analyze the columns in the tables

Provides metadata Allows for good specifications for

programmers Reduces project risk (as data is now

known) How many rows, number of distinct values in

a column, how many null, data type identification

Shows the data pattern

Page 21: Data quality and bi

21

Data Profiling Example

Page 22: Data quality and bi

22

Data Quality is measured as the degree of superiority, or excellence, of the various data that we use to create information products.

“Reason #1 for the failure of CRM projects: Data is ignored. Enterprise must have a detailed understanding of the quality of their data. How to clean it up, how to keep it clean, where to source it, and what 3rd-party data is required. Action item: Have a data quality strategy. Devote ½ of the total timeline of the CRM project to data elements.” - Gartner

Page 23: Data quality and bi

23

Data Quality Tools (Gartner Magic Quadrant)

Page 24: Data quality and bi

24

Dimensions of Quality

Informatica.com

Page 25: Data quality and bi

25

Data Quality Measures

Definition Accuracy Completeness Coverage Timeliness Validity

Page 26: Data quality and bi

26

Definition

Conformance: The degree to which data values are consistent with their agreed upon definitions.

A detailed definition must first exist before this can be measured.

Information quality begins with a comprehensive understanding of the data inventory. The information about the data is as important as the data itself.

A Data Dictionary must exist! An organized, authoritive collection of attributes is equivalent to the old “Card Catalog” in a library, or the “Parts and List Description” section of an inventory system. It must contain all the know usage rules and an acceptable list of values. All known caveats and anomalies must be descried.

Page 27: Data quality and bi

27

Accuracy

The degree to which a piece of data is correct and believable. The value can be compared to the original source for correctness, but it can still be unbelievable. Conformed values can be compared to lists of reference values. Zip code 35244 is correct and believable. Zip code 3524B is incorrect and unbelievable. Zip code 35290 is incorrect but believable (it looks right,

but does not exist). AL is a correct and believable state code (compared to

the list of valid state codes) A1 is an incorrect and unbelievable state code

(compared to the list of valid state codes) AA is an incorrect but believable state code (compared

to the list of valid state codes)

Page 28: Data quality and bi

28

Completeness

The Degree to which all information expected is received. This is measured in two ways: Do we have all the records that were sent to us?

Counts from the provider can be compared against counts of data received.

Did the provider send us all the records that they have or just some of them?

This is difficult to measure without auditing and trending the source.

How would we know that the provider had a ‘glitch’ in their system and records were missing from our feed?

Page 29: Data quality and bi

29

Measures of Completeness

The following questions can be answered for counts: How many records per batch by

provider? How is this batch’s counts compared to

the previous month’s average. How is the batch’s counts compared to

the same time period last year? How does this batch’s counts compare to

a 12 month average?

Page 30: Data quality and bi

30

Coverage

The degree to which all fields are populated with data. Columns of data can be measured for % of missing values and compared to expected % missing. i.e. Sale Type Code is expected to be

populated 100% by all sources for Sales documents.

Page 31: Data quality and bi

31

Timeliness

The degree to which provider files are received, processed and made available to for assembly to data marts. Expected receipt times are compared to actual receipt times. Late or missing files are flagged and reported

on. Proactive alerts trigger communication with the

provider contact. Proactive communication can alert to assembly

processes. Excessive lag times can be reported to providers

in order to request delivery sooner.

Page 32: Data quality and bi

32

Validity

The degree to which the relationships between different data are valid. Zip code 48108 is accurate. State code

AL is accurate. Zip code 48108 is invalid for the state of AL.

Page 33: Data quality and bi

33

Data Quality Measures

How do you know if your data is of high quality? Agree upon the measure that are

important to the organization and consistently report them out.

Use the data measures to communicate and inform.

Page 34: Data quality and bi

34

Measurement

Informatica.com

Page 35: Data quality and bi

35

Exercise: Changing the Data (1 of 2)

So, you need to add a new source Or, you need to receive additional

data from an existing source Could be the data quality is an issue Could be that the business rules

weren’t defined adequately

Page 36: Data quality and bi

36

Brainstorming Group Exercise

(2 of 2)

The data changed due to DQ measures – what do we have to do in the DW? What has to change Estimate the change Implement the change How do we make sure it doesn’t

happen again? What DQ measure can help?

Page 37: Data quality and bi

37

MDMMaster Data Management

The newest ‘buzz word’ The recent emphasis on regulatory

compliance, SOA, and mergers and acquisitions has made the creating and maintaining of accurate and complete master data a business imperative.

Page 38: Data quality and bi

38

MDM

The pain that organizations are experiencing around consistent reporting, regulatory compliance, strong interest in Service-Oriented Architecture (SOA), and Software as a Service (SaaS) has prompted a great deal of interest in Master Data Management (MDM).

Page 39: Data quality and bi

39

What Is Master Data Management?

Master data is the technology, tools, and processes an organization needs to create and maintain consistent and accurate inventory of its data.

Page 40: Data quality and bi

40

5 Types of Data for MDM

Unstructured—This is data found in e-mail, white papers like this, magazine articles, corporate intranet portals, product specifications, marketing collateral, and PDF files.

Transactional—This is data related to sales, deliveries, invoices, trouble tickets, claims, and other monetary and non-monetary interactions.

Metadata—This is data about other data and may reside in a formal repository or in various other forms such as XML documents, report definitions, column descriptions in a database, log files, connections, and configuration files.

Hierarchical—Hierarchical data stores the relationships between other data. It may be stored as part of an accounting system or separately as descriptions of real-world relationships, such as company organizational structures or product lines. Hierarchical data is sometimes considered a super MDM domain, because it is critical to understanding and sometimes discovering the relationships between master data.

Page 41: Data quality and bi

41

5 types of data cont’d

Master—the critical nouns of a business and fall generally into four groupings: people things places concepts

Further categorizations within those groupings are called subject areas, domain areas, or entity types.

For example, within people, there are

customer, employee, and salesperson.

Within things, there are product, part, store, and asset.

Within concepts, there are things like contract, warrantee, and licenses.

Within places, there are office locations and geographic divisions.

Some of these domain areas may be further divided. Customer may be further segmented, based on incentives and history. A company may have normal customers, as well as premiere and executive customers. Product may be further segmented by sector and industry. (4)

Page 42: Data quality and bi

42

Exercise:

What processes need to be put in place for MDM Who needs to be involved Who owns it

Page 43: Data quality and bi

43

SECTION 4

BI Tools BICC Jobs Certifications

Page 44: Data quality and bi

44

SECTION 4

What is business intelligence What are BI tools What is a business intelligence

competency center (BICC) What jobs are available

certifications

Page 45: Data quality and bi

45

What is business intelligence

Turning raw data into information. Business intelligence (BI) is a broad

category of applications and technologies for gathering, storing, analyzing, and providing access to data to help enterprise users make better business decisions. BI applications include the activities of decision support systems, query and reporting, online analytical processing (OLAP), statistical analysis, forecasting, and data mining. (1)

Page 46: Data quality and bi

46

What is BI

It is about making better business decisions easier and quicker.

Data Mining is a BI technique which is done to extract valid, useful and previously unknown information from a companies data sources.

Page 47: Data quality and bi

47

BI solutions examples by industry

Retail Forecasting Ordering & supply Marketing Merchandising Distribution Transportation Inventory planning Space

management …..

Insurance Claims & premium

analysis Customer Analytics Risk analysis

Banking Customer

profitability Credit Management Branch sales

Page 48: Data quality and bi

48

BI term coined Sept 1996 by Gartner Group in a report

“By 2000, Information Democracy will emerge in forward-thinking enterprises, with Business Intelligence information and applications available broadly to employees, consultants, customers, suppliers, and the public. The key to thriving in a competitive marketplace is staying ahead of the competition. Making sound business decisions based on accurate and current information takes more than intuition. Data analysis, reporting, and query tools can help business users wade through a sea of data to synthesize valuable information from it - today these tools collectively fall into a category called Business Intelligence.” (1)

Page 49: Data quality and bi

49

Magic Quadrant for BI (Gartner)

Page 50: Data quality and bi

50

BI

BI is a term categorizing a variety of software applications that are used to analyze a business’ raw data.

It is also a discipline categorizing activities that include data quality, data mining, OLAP (online analytical processing), querying and reporting. (2)

Page 51: Data quality and bi

51

What kinds of companies use BI

All kinds, restaurants, sports franchises, retailers….any company.

Examples include: New England Patriots, Walmart, Harrah’s, Amazon, Yahoo, Capital One…..

Page 52: Data quality and bi

52

When are you doing BI?

When looking at your market share or profitability you are doing BI.

Looking at the best area to increase your sales you are doing BI.

Anytime you analyze data and turn it into information you are doing BI.

Page 53: Data quality and bi

53

How do you know if you are really doing BI?

Efforts around changing individual and team work practices arise, from the individual and from the teams

New jobs are posted talking about analyzing data and delivering reports

Dashboards appear The CEO and CIO start talking about

it

Page 54: Data quality and bi

54

BI Tools & What they do

Cognos SAS Oracle (Siebel &

Hyperion) MicroStrategy Microsoft Information

Builders QlikView ….. etc

Querying & Reporting

OLAP And its sisters:

MOLAP ROLAP HOLAP

Data Mining

Page 55: Data quality and bi

55

BICC

A Business Intelligence Competency Center (BICC) is a cross-functional organizational team that has defined tasks, roles, responsibilities and processes for supporting and promoting the effective use of Business Intelligence (BI) across an organization.

As early as 2001, Gartner, an information technology research and advisory company, started advocating that companies need a BICC to develop and focus resources to be successful using business intelligence. [1] Since then, the BICC concept has been further refined through practical implementations in organizations that have implemented BI and analytical software.

Taken directly from Wikipedia

Page 56: Data quality and bi

56

BICC

In practice, the term "BICC" is not well integrated into the nomenclature of business or public sector organizations and there are a large degree of variances in the organizational design for BICCs. Nevertheless, the popularity of the BICC concept has caused the creation of units that focus on ensuring the use of the information for decision-making from BI software and increasing the return on investment (ROI) of BI. [2]

A BICC coordinates the activities and resources to ensure that a fact-based approach to decision making is systematically implemented throughout an organization. It has responsibility for the governance structure for BI and analytical programs, projects, practices, software, and architecture. It is responsible for building the plans, priorities, infrastructure, and competencies that the organization needs to take forward-looking strategic decisions by using the BI and analytical software capabilities.

A BICC’s influence transcends that of a typical business unit, playing a crucial central role in the organizational change and strategic process. Accordingly, the BICC’s purpose is to empower the entire organization to coordinate BI from all units. Through centralization, it "…ensures that information and best practices are communicated and shared through the entire organization so that everyone can benefit from successes and lessons learned."[3]

The BICC also plays an important organizational role facilitating interaction among the various cultures and units within the organization. Knowledge transfer, enhancement of analytic skills, coaching and training are central to the mandate of the BICC. A BICC should be pivotal in ensuring a high degree of information consumption and a ROI for BI.

Taken directly from Wikipedia

Page 57: Data quality and bi

57

Jobs in Business Intelligence

Business Analyst BI Programmer BI Architect BI Support

Engineer BI Manager

1000+ jobs on washingtonpost.com

5,357 jobs on indeed.com

Page 58: Data quality and bi

58

References

Data Management and Integration Topic, Gartner, http://www.gartner.com/it/products/research/asset_137953_2395.jsp

Articles: Key Issues for Implementing an Enterprise wide Data Quality Improvement Project, 2008, Key Issues for Enterprise Information Management Initiatives, 2008, Key Issues for Establishing Information Governance Policies, Processes and Organization, 2008

Data Quality Management, The Most Critical Initiative You Can Implement, J. G. Geiger, http://www2.sas.com/proceedings/sugi29/098-29.pdf

Information Management, How to Measure and Monitor the Quality of Master Data, http://www.information-management.com/issues/2007_58/master_data_management_mdm_quality-10015358-1.html?ET=informationmgmt:e963:2046487a:&st=email

Data Management Assn of Michigan Bits & Bytes, Critical Data Quality Controls, D Jeffries, Fall 2006 http://dama-michigan.org/2%20Newsletter.pdf

(1) http://searchdatamanagement.techtarget.com/definition/business-intelligence (2) http://www.cio.com/article/40296/Business_Intelligence_Definition_and_Solutions (3) http://www.redbooks.ibm.com/redbooks/pdfs/sg245747.pdf (4) http://msdn.microsoft.com/en-us/library/bb190163.aspx Wikipedia: BICC

Strange, K. H., Hostmann, B. (22 July 2003), BI Competency Center Is Core to BI Success, Gartner Research Miller, G., Queisser, T (2008), The Modern BI Organization, Heidelberg, MaxMetrics GmbH Miller, G., Bräutigam, B, & Gerlach, S. (2006). BICC: A Team Approach Competitive Advantage. Hoboken: Wiley