data quality and bi
DESCRIPTION
TRANSCRIPT
1
DW Part 2The Twins: Data Quality &
Business Intelligence
Denise Jeffries
[email protected]@hotmail.com
205.747.3301
2
Star Schema (facts and dimensions)
The facts that the data warehouse helps analyze are classified along different dimensions: The FACT table houses the main data
Includes a large amount of aggregated data (i.e. price, units sold)
DIMENSION tables off the FACT include attributes that describe the FACT
Star schemas provide simplicity for users
3
Star Schema example (Sales db)
4
SnowFlake Schema
Central FACT Connected to multiple DIMENSIONS which
are NORMALIZED into related tables Snowflaking effects DIMS and never FACT Used in Data warehouses and data marts
when speed is more important than efficiency/ease of data selection
Needed for many BI OLAP tools Stores less data
5
Snowflake Schema example (Sales db)
6
Comparison of SQL Star vs SnowFlake
SELECT Brand, Country, SUM (Units Sold) FROM Fact.Sales
JOIN Dim.Date ON Date_FK = Date_PK
JOIN Dim.Store ON Store_FK = Store_PK
JOIN Dim.Product ON Product_FK = Product_PK
WHERE [Year] = 2010 AND Product Category = ‘TV' GROUP BY
Brand, Country
SELECT B.Brand, G.Country, SUM (F.Units_Sold)
FROM Fact_Sales F (NOLOCK) INNER JOIN Dim_Date D (NOLOCK) ON F.Date_Id = D.Id INNER JOIN Dim_Store S (NOLOCK) ON F.Store_Id = S.Id INNER JOIN Dim_Geography G (NOLOCK) ON S.Geography_Id =
G.Id INNER JOIN Dim_Product P (NOLOCK) ON F.Product_Id = P.Id INNER JOIN Dim_Product_Category C (NOLOCK) ON
P.Product_Category_Id = C.ID
INNER JOIN Dim_Brand B (NOLOCK) ON P.Brand_Id = B.Id WHERE
D.Year = 2010 AND C.Product_Category = 'tv' GROUP BY
B.Brand, G.Country
7
Account, Customer & Address Relationships
Account Contact
Party Address link
Account Party link
Address
Account
Party
Account Information loaded from ALL Source Systems
ETL process builds the relationship between Accounts and Customers (Party) based on the relationship file from CUSTOMER CRM SYSTEM
8
Staging
Area
EDW
Metadata | Data Governance | Data Management
DM
CPS
MANTAS
CRDB
MKTG
FIN
SALES
EDW
Data cleansing
Data profiling
Sync &Sort
EDW Process State
BISource System
Cleanse / Pre-process
IMP
RMOECALS
AFSST
REDFPSBA
AFSV-PR
9
Explosion in innovation
BI software now able to be deployed on intranet vs hard to maintain thick client apps Thick client still used for developers
Web server, application server, database server Allows offloading of processing to
correct tier More power for everyone
10
Change in Business
Global economy changed needs of organizations worldwide
Global markets Mergers and Acquisitions All increase data needs More tech savvy end users (demand more
data, more tools… More information demanding executives
facilitates sponsorship of DW
11
Single definition of a data element needed for BI
DW brings in the data from multiple sources and conforms it so that it can be viewed together Multiple systems have individual
customers/addresses, but warehouse gives single view of the customer and all the systems they are in
Helping move from product centric systems to customer centric systems
12
Business view of data
DW is only successful is it provides the view the business needs of its data
A data warehouse is a structured extensible environment designed for the analysis of non-volatile data, logically and physically transformed from multiple source applications to align with business structure, updated and maintained for a long time period, expressed in simple business
terms, and summarized for quick analysis. Vivek R. Gupta, Senior Consultant [email protected] System
Services corporation, Chicago, Illinois http://www.system-services.com
13
Example of conforming data for business view:
Figure 8. Physical transformation of application data
•Uniform business terms
•Single physical definition of an attribute
•Consistent use of entity attributes
•Default and missing values
Data Warehouse
System
OperationalSystem B
OperationalSystem A
Detailed
Data
Summarized Data
Transformation
-----------------------
cust, cust_id, borrower>> customer ID
-----------------------
“1” >> “M”
“2” >> “F”
-----------------------
Missing >>> “……..”
http://www.sserve.com/ftp/dwintro.doc
14
Business use of DW
Business should use data mart created off data warehouse
Business uses want to use existing tools/methods (replicate queires, Excel, extract to Access) against DW and validate the data between existing and DW
Over time LoB gains confidence in DW and then begins to explore new possibilities of data use and tool use
15
EDW Development Project Cycle (New Source to EDW)
ImplementTestBuildSystem Design
Initial Scope
Source Data Analysis
Initial Estimate
Final Scope
Work Plan
Architecture & Design
Source/Target Mapping
Report/Data Requirements
Data Modeling
Test, Integration and Deployment Plan
Transition and Support Plan
Finalize Source/Target Mapping
Build ETL and other processes
Unit test cases and results
Defect fixes and support
Functional test cases and results
Operations Procedure testing and review
Data validation
Load / Performance Testing
Integration Testing(CA-Scheduler)
UAT and Sign-off
Initial Data creation/Setup
Production Migration
User Training
Support Documentation and training
Requirement SpecsInitiation
Project Sponsor Approval
Management Approval
Peer review
IT Approval
IT / Business Distribution
Peer review
IT Approval
Peer / Lead reviewPerformance, Capacity and
Guidelines for project
Business Users IT Planning & Systems EDW Development Team
Data Analyst Team Operations & SupportProject Management Operations & Support
Operations & Support
Business Users
Business Users
Business Users
Business Users
Project Management
Project Management Project Management Project Management
IT Planning & Systems
IT Planning & Systems
IT Planning & Systems IT Planning & Systems
Data Analyst Team
Data Analyst Team
Data Analyst Team Data Analyst Team
EDW Development Team
EDW Development Team EDW Development TeamGroups Involved in various Phases of the project
Major Tasks and Deliverables for the project
16
DW - Roadmap
DW(Accounts and Customers)
Multiple Source System Financial Mart
Master Data Management
mart1
Risk Mart
mart2
Transaction
Customer Analytics
Source System
mart3
Management Architecture (Metadata, Data Security, Systems Management)
17
SECTION 3
What is Data Quality I can’t tell you what’s
important, but your users can.
Look for the fields that can identify potential problems with the data
What is Master Data Management (MDM)
18
Data Quality
Data doesn’t stay the same Sometimes it does
Considerations: What happens to the warehouse when
the data changes When needs change
19
Roadmap to DQ
Data profiling Establishing metrics/measures Design and implement the rules Deploy the plan Review errors/exceptions Monitor the results
20
Data Profiling
What’s in the data Analyze the columns in the tables
Provides metadata Allows for good specifications for
programmers Reduces project risk (as data is now
known) How many rows, number of distinct values in
a column, how many null, data type identification
Shows the data pattern
21
Data Profiling Example
22
Data Quality is measured as the degree of superiority, or excellence, of the various data that we use to create information products.
“Reason #1 for the failure of CRM projects: Data is ignored. Enterprise must have a detailed understanding of the quality of their data. How to clean it up, how to keep it clean, where to source it, and what 3rd-party data is required. Action item: Have a data quality strategy. Devote ½ of the total timeline of the CRM project to data elements.” - Gartner
23
Data Quality Tools (Gartner Magic Quadrant)
24
Dimensions of Quality
Informatica.com
25
Data Quality Measures
Definition Accuracy Completeness Coverage Timeliness Validity
26
Definition
Conformance: The degree to which data values are consistent with their agreed upon definitions.
A detailed definition must first exist before this can be measured.
Information quality begins with a comprehensive understanding of the data inventory. The information about the data is as important as the data itself.
A Data Dictionary must exist! An organized, authoritive collection of attributes is equivalent to the old “Card Catalog” in a library, or the “Parts and List Description” section of an inventory system. It must contain all the know usage rules and an acceptable list of values. All known caveats and anomalies must be descried.
27
Accuracy
The degree to which a piece of data is correct and believable. The value can be compared to the original source for correctness, but it can still be unbelievable. Conformed values can be compared to lists of reference values. Zip code 35244 is correct and believable. Zip code 3524B is incorrect and unbelievable. Zip code 35290 is incorrect but believable (it looks right,
but does not exist). AL is a correct and believable state code (compared to
the list of valid state codes) A1 is an incorrect and unbelievable state code
(compared to the list of valid state codes) AA is an incorrect but believable state code (compared
to the list of valid state codes)
28
Completeness
The Degree to which all information expected is received. This is measured in two ways: Do we have all the records that were sent to us?
Counts from the provider can be compared against counts of data received.
Did the provider send us all the records that they have or just some of them?
This is difficult to measure without auditing and trending the source.
How would we know that the provider had a ‘glitch’ in their system and records were missing from our feed?
29
Measures of Completeness
The following questions can be answered for counts: How many records per batch by
provider? How is this batch’s counts compared to
the previous month’s average. How is the batch’s counts compared to
the same time period last year? How does this batch’s counts compare to
a 12 month average?
30
Coverage
The degree to which all fields are populated with data. Columns of data can be measured for % of missing values and compared to expected % missing. i.e. Sale Type Code is expected to be
populated 100% by all sources for Sales documents.
31
Timeliness
The degree to which provider files are received, processed and made available to for assembly to data marts. Expected receipt times are compared to actual receipt times. Late or missing files are flagged and reported
on. Proactive alerts trigger communication with the
provider contact. Proactive communication can alert to assembly
processes. Excessive lag times can be reported to providers
in order to request delivery sooner.
32
Validity
The degree to which the relationships between different data are valid. Zip code 48108 is accurate. State code
AL is accurate. Zip code 48108 is invalid for the state of AL.
33
Data Quality Measures
How do you know if your data is of high quality? Agree upon the measure that are
important to the organization and consistently report them out.
Use the data measures to communicate and inform.
34
Measurement
Informatica.com
35
Exercise: Changing the Data (1 of 2)
So, you need to add a new source Or, you need to receive additional
data from an existing source Could be the data quality is an issue Could be that the business rules
weren’t defined adequately
36
Brainstorming Group Exercise
(2 of 2)
The data changed due to DQ measures – what do we have to do in the DW? What has to change Estimate the change Implement the change How do we make sure it doesn’t
happen again? What DQ measure can help?
37
MDMMaster Data Management
The newest ‘buzz word’ The recent emphasis on regulatory
compliance, SOA, and mergers and acquisitions has made the creating and maintaining of accurate and complete master data a business imperative.
38
MDM
The pain that organizations are experiencing around consistent reporting, regulatory compliance, strong interest in Service-Oriented Architecture (SOA), and Software as a Service (SaaS) has prompted a great deal of interest in Master Data Management (MDM).
39
What Is Master Data Management?
Master data is the technology, tools, and processes an organization needs to create and maintain consistent and accurate inventory of its data.
40
5 Types of Data for MDM
Unstructured—This is data found in e-mail, white papers like this, magazine articles, corporate intranet portals, product specifications, marketing collateral, and PDF files.
Transactional—This is data related to sales, deliveries, invoices, trouble tickets, claims, and other monetary and non-monetary interactions.
Metadata—This is data about other data and may reside in a formal repository or in various other forms such as XML documents, report definitions, column descriptions in a database, log files, connections, and configuration files.
Hierarchical—Hierarchical data stores the relationships between other data. It may be stored as part of an accounting system or separately as descriptions of real-world relationships, such as company organizational structures or product lines. Hierarchical data is sometimes considered a super MDM domain, because it is critical to understanding and sometimes discovering the relationships between master data.
41
5 types of data cont’d
Master—the critical nouns of a business and fall generally into four groupings: people things places concepts
Further categorizations within those groupings are called subject areas, domain areas, or entity types.
For example, within people, there are
customer, employee, and salesperson.
Within things, there are product, part, store, and asset.
Within concepts, there are things like contract, warrantee, and licenses.
Within places, there are office locations and geographic divisions.
Some of these domain areas may be further divided. Customer may be further segmented, based on incentives and history. A company may have normal customers, as well as premiere and executive customers. Product may be further segmented by sector and industry. (4)
42
Exercise:
What processes need to be put in place for MDM Who needs to be involved Who owns it
43
SECTION 4
BI Tools BICC Jobs Certifications
44
SECTION 4
What is business intelligence What are BI tools What is a business intelligence
competency center (BICC) What jobs are available
certifications
45
What is business intelligence
Turning raw data into information. Business intelligence (BI) is a broad
category of applications and technologies for gathering, storing, analyzing, and providing access to data to help enterprise users make better business decisions. BI applications include the activities of decision support systems, query and reporting, online analytical processing (OLAP), statistical analysis, forecasting, and data mining. (1)
46
What is BI
It is about making better business decisions easier and quicker.
Data Mining is a BI technique which is done to extract valid, useful and previously unknown information from a companies data sources.
47
BI solutions examples by industry
Retail Forecasting Ordering & supply Marketing Merchandising Distribution Transportation Inventory planning Space
management …..
Insurance Claims & premium
analysis Customer Analytics Risk analysis
Banking Customer
profitability Credit Management Branch sales
48
BI term coined Sept 1996 by Gartner Group in a report
“By 2000, Information Democracy will emerge in forward-thinking enterprises, with Business Intelligence information and applications available broadly to employees, consultants, customers, suppliers, and the public. The key to thriving in a competitive marketplace is staying ahead of the competition. Making sound business decisions based on accurate and current information takes more than intuition. Data analysis, reporting, and query tools can help business users wade through a sea of data to synthesize valuable information from it - today these tools collectively fall into a category called Business Intelligence.” (1)
49
Magic Quadrant for BI (Gartner)
50
BI
BI is a term categorizing a variety of software applications that are used to analyze a business’ raw data.
It is also a discipline categorizing activities that include data quality, data mining, OLAP (online analytical processing), querying and reporting. (2)
51
What kinds of companies use BI
All kinds, restaurants, sports franchises, retailers….any company.
Examples include: New England Patriots, Walmart, Harrah’s, Amazon, Yahoo, Capital One…..
52
When are you doing BI?
When looking at your market share or profitability you are doing BI.
Looking at the best area to increase your sales you are doing BI.
Anytime you analyze data and turn it into information you are doing BI.
53
How do you know if you are really doing BI?
Efforts around changing individual and team work practices arise, from the individual and from the teams
New jobs are posted talking about analyzing data and delivering reports
Dashboards appear The CEO and CIO start talking about
it
54
BI Tools & What they do
Cognos SAS Oracle (Siebel &
Hyperion) MicroStrategy Microsoft Information
Builders QlikView ….. etc
Querying & Reporting
OLAP And its sisters:
MOLAP ROLAP HOLAP
Data Mining
55
BICC
A Business Intelligence Competency Center (BICC) is a cross-functional organizational team that has defined tasks, roles, responsibilities and processes for supporting and promoting the effective use of Business Intelligence (BI) across an organization.
As early as 2001, Gartner, an information technology research and advisory company, started advocating that companies need a BICC to develop and focus resources to be successful using business intelligence. [1] Since then, the BICC concept has been further refined through practical implementations in organizations that have implemented BI and analytical software.
Taken directly from Wikipedia
56
BICC
In practice, the term "BICC" is not well integrated into the nomenclature of business or public sector organizations and there are a large degree of variances in the organizational design for BICCs. Nevertheless, the popularity of the BICC concept has caused the creation of units that focus on ensuring the use of the information for decision-making from BI software and increasing the return on investment (ROI) of BI. [2]
A BICC coordinates the activities and resources to ensure that a fact-based approach to decision making is systematically implemented throughout an organization. It has responsibility for the governance structure for BI and analytical programs, projects, practices, software, and architecture. It is responsible for building the plans, priorities, infrastructure, and competencies that the organization needs to take forward-looking strategic decisions by using the BI and analytical software capabilities.
A BICC’s influence transcends that of a typical business unit, playing a crucial central role in the organizational change and strategic process. Accordingly, the BICC’s purpose is to empower the entire organization to coordinate BI from all units. Through centralization, it "…ensures that information and best practices are communicated and shared through the entire organization so that everyone can benefit from successes and lessons learned."[3]
The BICC also plays an important organizational role facilitating interaction among the various cultures and units within the organization. Knowledge transfer, enhancement of analytic skills, coaching and training are central to the mandate of the BICC. A BICC should be pivotal in ensuring a high degree of information consumption and a ROI for BI.
Taken directly from Wikipedia
57
Jobs in Business Intelligence
Business Analyst BI Programmer BI Architect BI Support
Engineer BI Manager
1000+ jobs on washingtonpost.com
5,357 jobs on indeed.com
58
References
Data Management and Integration Topic, Gartner, http://www.gartner.com/it/products/research/asset_137953_2395.jsp
Articles: Key Issues for Implementing an Enterprise wide Data Quality Improvement Project, 2008, Key Issues for Enterprise Information Management Initiatives, 2008, Key Issues for Establishing Information Governance Policies, Processes and Organization, 2008
Data Quality Management, The Most Critical Initiative You Can Implement, J. G. Geiger, http://www2.sas.com/proceedings/sugi29/098-29.pdf
Information Management, How to Measure and Monitor the Quality of Master Data, http://www.information-management.com/issues/2007_58/master_data_management_mdm_quality-10015358-1.html?ET=informationmgmt:e963:2046487a:&st=email
Data Management Assn of Michigan Bits & Bytes, Critical Data Quality Controls, D Jeffries, Fall 2006 http://dama-michigan.org/2%20Newsletter.pdf
(1) http://searchdatamanagement.techtarget.com/definition/business-intelligence (2) http://www.cio.com/article/40296/Business_Intelligence_Definition_and_Solutions (3) http://www.redbooks.ibm.com/redbooks/pdfs/sg245747.pdf (4) http://msdn.microsoft.com/en-us/library/bb190163.aspx Wikipedia: BICC
Strange, K. H., Hostmann, B. (22 July 2003), BI Competency Center Is Core to BI Success, Gartner Research Miller, G., Queisser, T (2008), The Modern BI Organization, Heidelberg, MaxMetrics GmbH Miller, G., Bräutigam, B, & Gerlach, S. (2006). BICC: A Team Approach Competitive Advantage. Hoboken: Wiley