data warehouse and business intelligence dr. minder chen professor of mis
DESCRIPTION
Data Warehouse and Business Intelligence Dr. Minder Chen Professor of MIS Martin V. Smith School of Business and Economics CSU Channel Islands [email protected]. BI. “The key in business is to know something that nobody else knows.” -- Aristotle Onassis. - PowerPoint PPT PresentationTRANSCRIPT
Data Warehouse and
Business Intelligence
Dr. Minder Chen
Professor of MIS
Martin V. Smith School of Business and Economics
CSU Channel Islands
DW & BI - 2 © Minder Chen, 2004-2014
BI
Business Intelligence (BI) is the process of gathering meaningful information to answer questions and identify significant trends or patterns, giving key stakeholders the ability to make better business decisions.
“The key in business is to know something that
nobody else knows.”-- Aristotle Onassis
PHOTO: HULTON-DEUTSCH COLL
“To understand is to perceive patterns.”
— Sir Isaiah Berlin
"The manager asks how and when, the leader asks what and why."
— “On Becoming a Leader” by Warren Bennis
DW & BI - 3 © Minder Chen, 2004-2014
BI Questions
• What happened?– What were our total sales this month?
• What’s happening?– Are our sales going up or down, trend analysis
• Why?– Why have sales gone down?
• What will happen?– Forecasting & “What If” Analysis
• What do I want to happen?– Planning & Targets
Source: Bill Baker, Microsoft
DW & BI - 4 © Minder Chen, 2004-2014
Business Valuation Models for BI
DW & BI - 5 © Minder Chen, 2004-2014
Performance Dashboards for Information Delivery
DW & BI - 6 © Minder Chen, 2004-2014
Scorecards for Information Delivery
Balanced Scorecard
DW & BI - 7 © Minder Chen, 2004-2014
DW & BI - 8 © Minder Chen, 2004-2014
Inmon's Definition of Data Warehouse – Data View
• A warehouse is a
– subject-oriented,
– integrated,
– time-variant and – non-volatile
collection of data in support of management's decision making process.
– Bill Inmon in 1990
Source: http://www.intranetjournal.com/features/datawarehousing.html
DW & BI - 9 © Minder Chen, 2004-2014
Inmon's Definition Explain• Subject-oriented: They are organized around major
subjects such as customer, supplier, product, and sales. Data warehouses focus on modeling and analysis to support planning and management decisions vs. operations and transaction processing.
• Integrated: Data warehouses involve an integration of sources such as relational databases, flat files, and on-line transaction records. Processes such as data cleansing and data scrubbing achieve data consistency in naming conventions, encoding structures, and attribute measures.
• Time-variant: Data contained in the warehouse provide information from an historical perspective.
• Nonvolatile: Data contained in the warehouse are physically separate from data present in the operational environment.
DW & BI - 10 © Minder Chen, 2004-2014
Increasing potentialto supportbusiness decisions (MIS) End User
Business Analyst
DataAnalyst
DBA
MakingDecisions
Data Presentation
Visualization Techniques
Data MiningInformation Discovery
Data ExplorationOLAP, MDA,
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
Data Sources(Paper, Files, Information Providers, Database Systems, OLTP)
Business Intelligence
DW & BI - 11 © Minder Chen, 2004-2014
Where is Business Intelligence applied?
• ERP Reporting
• KPI Tracking
• Product Profitability
• Risk Management
• Balanced Scorecard
• Activity Based Costing
• Global Sourcing
• Logistics
• Sales Analysis
• Sales Forecasting
• Segmentation
• Cross-selling
• CRM Analytics
• Campaign Planning
• Customer Profitability
Operational Efficiency Customer Interaction
DW & BI - 12 © Minder Chen, 2004-2014
OLTP Versus Business Intelligence: Who asks what?
OLTP Questions
• When did that order ship?
• How many units are in inventory?
• Does this customer haveunpaid bills?
• Are any of customer X’s line items on backorder?
Analysis Questions• What factors affect order
processing time?
• How did each product line (or product) contribute to profit last quarter?
• Which products have the lowest Gross Margin?
• What is the value of items on backorder, and is it trending up or downover time?
DW & BI - 13 © Minder Chen, 2004-2014
The Data Warehouse/BI Architecture & Process
Data Marts
DataWarehouse
SourceSystems
Clients
Design the Populate Create Query Data Warehouse Data Warehouse OLAP Cubes Data
3 4
Query ToolsReportingAnalysis
Data Mining
211
E T
L
ETL: Extract, Transform, and Load
E T
L E T
L
OLAP Cubes
DW & BI - 14 © Minder Chen, 2004-2014
Normalized Database for OLTP
DW & BI - 15 © Minder Chen, 2004-2014
OLTP vs. OLAP
Source: http://datawarehouse4u.info/OLTP-vs-OLAP.html
OLTP System Online Transaction Processing
(Operational System)
OLAP System Online Analytical Processing
(Data Warehouse)
Source of dataOperational data; OLTPs are the original
source of the data.Consolidation data; OLAP data comes from the
various OLTP Databases
Purpose of data
To control and run fundamental business tasks
To help with planning, problem solving, and decision support
What the dataReveals a snapshot of ongoing business
processesMulti-dimensional views of various kinds of business
activitiesInserts and
UpdatesShort and fast inserts and updates
initiated by end usersPeriodic long-running batch jobs refresh the data
QueriesRelatively standardized and simple
queries Returning relatively few recordsOften complex queries involving aggregations
Processing Speed
Typically very fastDepends on the amount of data involved; batch data
refreshes and complex queries may take many hours; query speed can be improved by creating indexes
Space Requirements
Can be relatively small if historical data is archived
Larger due to the existence of aggregation structures and history data; requires more indexes than OLTP
Database Design
Highly normalized with many tablesTypically de-normalized with fewer tables; use of star
and/or snowflake schemas
Backup and Recovery
Backup religiously; operational data is critical to run the business, data loss is likely to entail significant monetary loss
and legal liability
Instead of regular backups, some environments may consider simply reloading the OLTP data as a
recovery method
DW & BI - 16 © Minder Chen, 2004-2014
Measuring Performance
• Real estate consumer services and analysis firm Trulia reports that Oct. 2013 saw only an 0.6% rise in home asking prices comparing to Sept. 2013.
• However, the average home asking price rose by 11.7% from Oct. 2012 to Oct. 2013.
• The year-over-year figure is the largest jump since the housing bubble popped back in 2007-08.
Source: http://www.thestreet.com/story/12100873/1/home-sellers-price-hikes-coming-unsustainably-fast.html
DW & BI - 17 © Minder Chen, 2004-2014
compare with last period vs. year-on-year comparison
• A time series is a sequence of data points, measured typically at successive points in time spaced at uniform time intervals. Examples of time series are the daily closing value of the Dow Jones Industrial Average - Wikipedia “Time series”
• The year-over-year data compares a time period (e.g., a month or a quarter) against the same time period last year.
• You can compare a performance indicator with one from last period (quarter, month, week, day)
• One of the advantages of year-over-year comparisons is that it automatically negates the effect of seasonality (e.g., seasonal effect). It is a more effective way of looking at performance.
DW & BI - 18 © Minder Chen, 2004-2014
Identifying Measures and Dimensions
The attribute (column) variescontinuously: •Unit Sold•Cost•Sales•Balance
The attribute is perceived asa constant or discrete value:
•Name/Description•Location•Color•Size
DimensionsMeasures
Performance Measures for KPI
Performance Drivers
Attribute Type?
What? Why?
Information for Decision Making
DW & BI - 19 © Minder Chen, 2004-2014
Star Schema
Sales
Customers
Dates
Products
Channels
Promotions
Fact Tablewith
measures
Dimension Table
Dimension Table
Dimension Table
Dimension Table
Dimension Table
Multi-dimensional Data model
DW & BI - 20 © Minder Chen, 2004-2014
Snowflake Schema
Sales
Customers
Dates
Products
Channels
Promotions
Brands
Star Schema
Fact Table
Dimension Table
Dimension Table
Dimension Table
Dimension Table
Dimension Table
Customer type
Normalized
Source: http://www.diffen.com/difference/Snowflake_Schema_vs_Star_Schema
Normalized
DW & BI - 21 © Minder Chen, 2004-2014
Designing Data Warehouse: Dimensional Design Process
• Select the business process to model • Declare the grain of the business process/data in the fact
table (The grain represents the most atomic level by which the facts may be defined. The grain of a SALES fact table might be stated as "Sales volume by Day by Product by Store". )
• Identify the numeric facts/meaures that will populate each fact table row
• Choose the dimensions that apply to each fact table row
BusinessRequirements
Data Realities
Ref: http://en.wikipedia.org/wiki/Fact_table
DW & BI - 22 © Minder Chen, 2004-2014
Select a business process to model
• Not business departments or business functions
• Cross-functional business processes
• Business events
• Examples: – Raw materials purchasing
– Order fulfillment process
– Shipments
– Invoicing
– Inventory
– General ledger
– Insurance claims
– Class enrollment
– Airline ticket sales
DW & BI - 23 © Minder Chen, 2004-2014
Facts Table
DateID
ProductID
CustomerID
Units
Dollars
DimensionsDimensionsDimensionsDimensions
MeasuresMeasuresMeasuresMeasures
The Fact Table contains keys and units of The Fact Table contains keys and units of measuremeasure
Measurements of business events.
DW & BI - 24 © Minder Chen, 2004-2014
Fact Tables
Fact tables have the following characteristics:• It contains numeric measures (metric) of the
business.• It may contain summarized (aggregated) data.• It almost always contains date-stamped data.• Measures are typically additive.• Have key value that is typically a concatenated key
composed of the primary keys of the dimensions.• Joined to dimension tables through foreign keys
that reference primary keys in the dimension tables.• Fact tables are narrow (few attributes) but many
records.
DW & BI - 25 © Minder Chen, 2004-2014
A Dimensional Model for a Grocery Store’s Sales
DW & BI - 26 © Minder Chen, 2004-2014
Creating Dimensional Model
• Identify fact tables• Translate business measures into fact tables
• Analyze information from source systems for additional measures
• Identify base and derived measures
• Document additivity of measures (e.g., non-additive[price], semi-additive [quantity-on-hand is not additive over time], or additive [quantity])
• Identify dimension tables
• Link fact tables to the dimension tables
• Create views for users
DW & BI - 27 © Minder Chen, 2004-2014
Transaction Level Order Item Fact Table
DW & BI - 28 © Minder Chen, 2004-2014
Inside a Dimension Table
• Dimension table key: Uniquely identify each row. Use surrogate key (integer).
• Table is wide: A table may have many attributes (columns).
• Textual attributes. Descriptive attributes in string format. No numerical values for calculation.
• Attributes not directly related: E.g., product color and product package size. No transitive dependency.
• Not normalized (star schema).
• Drilling down and rolling up along a dimension.
• One or more hierarchy within a dimension.
• Fewer number of records.
DW & BI - 29 © Minder Chen, 2004-2014
OLAP Solutions
• Data Warehouse
• Data Mart
• Cubes
• Dimensions
• Measures
• CellsGadgets
Gizmos
Thingies
Widgets
Q1 Q2 Q3 Q4
US
EuropeAsia
130 135 140 142
205 390 350 475
175 230 190 250
310 340 410 450
OLAP Server (e.g., Oracle ESSBase & SQL Server’s Analysis Services)
A cube is a collection of data that’s been aggregated to allow queries to return data quickly.
DW & BI - 30 © Minder Chen, 2004-2014
Hierarchy
DW & BI - 31 © Minder Chen, 2004-2014
A Hierarchy in the Product Dimension
• SKU: Stock Keeping Unit
• Hierarchy: – Department Category Subcategory Brand Product
DW & BI - 32 © Minder Chen, 2004-2014
Multidimensional Query Techniques
What?Why?
Why?
Why? Slicing
Dicing
Drillingdown
Product
Time
Geography
Aggregated data
Detail data
Drill d
ow
n Ro
ll u
p
Performance Measures
Performance Drivers
Hie
rarc
hy
DW & BI - 33 © Minder Chen, 2004-2014
Roll-Up and Drill-Down
Source: http://www.tutorialspoint.com/dwh/dwh_olap.htm
DW & BI - 34 © Minder Chen, 2004-2014
Slice and Dice
Source: http://www.tutorialspoint.com/dwh/dwh_olap.htm
DW & BI - 35 © Minder Chen, 2004-2014
A Visual Operation: Pivot (Rotate)
10
47
30
12
Juice
Cola
Milk
CreamNY
LA
SF3/1 3/2 3/3 3/4 Date
Month
Re
gio
n
Product
DW & BI - 36 © Minder Chen, 2004-2014
Operations in Multidimensional Data Model
• Aggregation (roll-up)
– dimension reduction: e.g., total sales by city
– summarization over aggregate hierarchy: e.g., total sales by city and year total sales by region and by year
• Navigation to detailed data (drill-down)
– e.g., (sales - expense) by city, top 3% of cities by average income
• Selection (slice or dice) defines a subcube
– e.g., sales where city = Palo Alto and date = 1/15/96
• Visualization operations (e.g., Pivot)
DW & BI - 37 © Minder Chen, 2004-2014
Pivot Table in Excel
DW & BI - 38 © Minder Chen, 2004-2014
Date Dimension of the Retail Sales Model
DW & BI - 39 © Minder Chen, 2004-2014
Store Dimension
• It is not uncommon to represent multiple hierarchies in a dimension table. Ideally, the attribute names and values should be unique across the multiple hierarchies.
DW & BI - 40 © Minder Chen, 2004-2014
ETLETL = Extract, Transform, Load.
ETL cycle includes
• Build reference data (e.g., currency codes)
• Extract (from sources)
• Validate
• Transform (clean, apply business rules, check for data integrity, create aggregates)
• Stage (load into staging tables, if used)
• Audit reports on compliance with business rules.
• Publish/load (to target tables in the data warehouse)
• Clean up
DW & BI - 41 © Minder Chen, 2004-2014
Data Quality Issues
• No common time basis
• Different calculation algorithms
• Different levels of extraction
• Different levels of granularity
• Different data field names
• Different data field meanings
• Missing information
• No data correction rules
• No drill-down capability
DW & BI - 42 © Minder Chen, 2004-2014
Building The WarehouseTransforming Data
DW & BI - 43 © Minder Chen, 2004-2014
CUST #CUST # NAMENAME ADDRESSADDRESS TYPETYPE
90238475
90233479
90233489
90234889
90345672
90328575
90328575
Digital Equipment
Digital
Digital Corp
Digital Consulting
Digital Info Service
Digital Integration
DEC
187 N. PARK St. Salem NH 01458187 N. Pk. St. Salem NH 01458
187 N. Park St Salem NH 01458
187 N. Park Ave. Salem NH 01458
15 Main Street Andover MA 02341PO Box 9 Boston MA 02210
Park Blvd. Boston MA 04106
OEM
OEM
$#%
Comp
Consult
Mail List
SYS INT
No Unique KeyNoise in
Blank FieldsSpellingNo StandardizationAnomalies
How does one correctly identify and consolidate anomalies from millions of records?
The Anomalies Nightmare
DW & BI - 44 © Minder Chen, 2004-2014
Data Mining & Knowledge Discovery in Database (KDD) Process
Data mining is the analysis step of the "Knowledge Discovery in Databases" process (KDD) involving methods such as artificial intelligence, machine learning, statistics, and database systems.
Data Mining is the practice of searching through large amounts of computerized data to find useful patterns or trends
Source: http://www2.cs.uregina.ca/~dbd/cs831/notes/kdd/1_kdd.html
DW & BI - 45 © Minder Chen, 2004-2014
Knowledge Discovery• Knowledge discovery in databases is the non-trivial process of
identifying valid, novel, potentially useful, and ultimately understandable patterns in data.Data A set of facts.
PatternAn association, dependence, clusters, etc. among facts (items) in the data set.
ProcessKDD is a multi-step process involving data preparation, pattern searching, knowledge evaluation, and refinement with iteration after modification.
ValidDiscovered patterns should be true on new data with some degree of certainty. Generalize to the future (other data).
Novel Patterns must be novel (should not be previously known).
UsefulActionable; patterns should potentially lead to some useful actions.
Under-standable
The process should lead to human insight. Patterns must be made understandable in order to facilitate a better understanding of the underlying data.
http://www2.cs.uregina.ca/~dbd/cs831/notes/kdd/1_kdd.html
DW & BI - 46 © Minder Chen, 2004-2014
Cross Industry Standard Process for Data Mining
Source: http://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining
Decision
Action
DW & BI - 47 © Minder Chen, 2004-2014
Data Mining Tasks and Examples• Classification - Customer profiling into predefined
categories via supervised learning using Decision Tree or Neural Network
• Clustering - grouping a set of objects in such a way that objects in the same group (cluster) are more similar to each other than to those in other groups (clusters) Market segmentation , e.g.,
• Summarization - Credit scoring and risk analysis using Bayesian inference. It is considered a Structured prediction technique.
• Association - What is the likelihood that a customer will buy a product next month, if he buys a related item today? (sequence association)
http://www2.cs.uregina.ca/~dbd/cs831/notes/kdd/2_tasks.html
DW & BI - 48 © Minder Chen, 2004-2014
OLAP and Data Mining Address Different Types of Questions
While reporting and OLAP are informative about past facts, only data mining can help you predict the future of your business.
OLAP Data Mining
What was the response rate to our mailing? What is the profile of people who are likely to respond to future mailings?
How many units of our new product did we sell to our existing customers?
Which existing customers are likely to buy our next new product?
Who were my 10 best customers last year?Which 10 customers offer me the greatest profit potential?
Which customers didn't renew their policies last month?
Which customers are likely to switch to the competition in the next six months?
Which customers defaulted on their loans?Is this customer likely to be a good credit risk?
What were sales by region last quarter?What are expected sales by region next year?
What percentage of the parts we produced yesterday are defective?
What can I do to improve throughput and reduce scrap?
Source: http://www.dmreview.com/editorial/dmreview/print_action.cfm?articleId=2367
DW & BI - 49 © Minder Chen, 2004-2014
Shopping Basket Analysis
• Which items are purchased in a retail store at the same time?
• Amazon use collaborative filtering that use shopping basket (sales) data to make recommendations when you select an item.
Ref: http://en.wikipedia.org/wiki/Collaborative_filtering
DW & BI - 50 © Minder Chen, 2004-2014
Issues on Interpreting Modeling Results
• Housing price: Use factors, such as location, number of bedrooms, and square footage, to determine the market value of a property.
• Beer and Diaper
Source: http://dssresources.com/newsletters/66.php
DW & BI - 51 © Minder Chen, 2004-2014
Source: •http://www.ibmbigdatahub.com/infographic/four-vs-big-data•http://whatsthebigdata.com/2013/07/25/big-data-3-vs-volume-variety-velocity-infographic/
Scale of Data Analysis of Streaming Data
Different forms of data
Uncertainty of data
Veracity
DW & BI - 52 © Minder Chen, 2004-2014
CRISP-DM Methodology
Source: http://lyle.smu.edu/~mhd/8331f03/crisp.pdf & (link)
Cross Industry Standard Process for Data Mining Methodology
DW & BI - 53 © Minder Chen, 2004-2014
Data Mining Contexts
Source: http://lyle.smu.edu/~mhd/8331f03/crisp.pdf
DW & BI - 54 © Minder Chen, 2004-2014
Phases and Tasks
Source: http://lyle.smu.edu/~mhd/8331f03/crisp.pdf
DW & BI - 55 © Minder Chen, 2004-2014
• Backup Slides
DW & BI - 56 © Minder Chen, 2004-2014
DW & BI - 57 © Minder Chen, 2004-2014
Key Concepts in BI Development Lifecycle
DW & BI - 58 © Minder Chen, 2004-2014
OLTP Normalized Design
Ordering Ordering ProcessProcess
Ware- Ware- househouse
POS POS ProcessProcess
Chain Chain RetailerRetailer
Retailer Retailer ReturnsReturns
Retailer Retailer PaymentsPayments
StoreStore
ProductProduct
BrandBrandGLGL AccountAccount
ClerkClerk
Retail Retail CustCust
Cash Cash RegisterRegister
Retail Retail PromoPromo
DW & BI - 59 © Minder Chen, 2004-2014