mis2502: data analytics - temple mis...mar 09, 2018 · dimensional data modeling ... could you...
TRANSCRIPT
MIS2502:Data AnalyticsDimensional Data Modeling
JaeHwuen [email protected]
http://community.mis.temple.edu/jaejung
Where we are…
Transactional Database
Analytical Data Store
Stores real-time transactional data
Stores historical transactional and
summary data
Data entry
Data transformation
Data analysis
Now we’re here…
Comparing Transactional and Analytical Data Stores
Transactional Database Analytical Data Store
Based on Relationalparadigm
Based on Dimensionalparadigm
Storage of real-timetransactional data
Storage of historical transactional data
Optimized for storage efficiency and data integrity
Optimized for data retrieval and summarization
Supports day-to-dayoperations
Supports periodic and on-demand analysis
Some terminology
• Takes many forms
• Really is just a repository for historical data
Data Warehouse
• Subset of the Data Warehouse
• Designed for specific analysisData Mart
• Organization of data as a “multidimensional matrix”
• Implementation of a Data MartData Cube
The Actual Process
Transactional Database 2
Transactional Database 1
Other Sources
ETL
ETL
ETL
Analytical Data Store
Data Mart(Finance)
Data Mart(Sales)
Data Mart(Inventory)
Data Warehouse
Data Cube
How they all relate
The data in the
transactional database…
…is put into a data
warehouse…
…which feeds the data mart…
…and is analyzed as a
cube.
We’ll start here.
The Data Cube
• Core component of Analytical Data Storeand Multidimensional Data Analysis
• Made up of “facts” and “dimensions”
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
Product
Store
M&MsDiet
CokeDoritos
Ardmore,
PA
Temple
Main
Cherry Hill,
NJ
King of
Prussia, PAJan. 2013
Feb. 2013
Mar. 2013
Quantity sold and total price are measured facts.Why isn’t product price a measured fact?
Cheetos
The Data Cube
• Fact (What happened)– Quantity
– Total price
– Number of promotions
– Etc.
• Dimension(Attributes that describes what happened)– When/where/what
product was sold
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
Product
Store
M&MsDiet
CokeDoritos
Ardmore,
PA
Temple
Main
Cherry Hill,
NJ
King of
Prussia, PAJan. 2013
Feb. 2013
Mar. 2013
Cheetos
The Data Cube
quantity quantity quantity quantity
quantity quantity quantity quantity
quantity quantity quantity quantity
quantity quantity quantity quantity
Product
Store
M&MsDiet
CokeDoritos
Ardmore,
PA
Temple
Main
Cherry Hill,
NJ
King of
Prussia, PAJan. 2013
Feb. 2013
Mar. 2013
The highlighted element represents the quantity M&Ms were sold in Ardmore, PA in
January, 2013
A single summary record representing a
business event (monthly sales).
Cheetos
Slicing the Data
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
Product
Store
M&MsDiet
CokeDoritos
Ardmore,
PA
Temple
Main
Cherry Hill,
NJ
King of
Prussia, PAJan. 2013
Feb. 2013
Mar. 2013
The highlighted elements represent
Cheetos cookies sold on Temple’s Main
campus from January to March, 2013
This is called “slicingthe data.”
Cheetos
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
Product
Store
M&MsDiet
CokeDoritos Cheetos
Ardmore,
PA
Temple
Main
Cherry Hill,
NJ
King of
Prussia, PAJan. 2013
Feb. 2013
Mar. 2013
What do the orange highlighted elements
represent?
What do the purple highlighted elements
represent?
Slicing the Data
Could you have a data mart with five dimensions?
Then why does our example (and most others) only have three?
Modeling a data cube:
Designing the Star Schema
Kimball’s Four Step Process for Data Cube Design (Kimball et al., 2008)
1. Choose the business process
2. Decide on the level of granularity
3. Identify the dimensions
4. Identify the fact
http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/four-4-step-design-process/
Choose the business process• Business processes are the operational activities
performed by your organization
• Determined by the questions you want to answer about your organization
Question Business Process
Who is my best customer? Sales
What are my highest selling products? Sales
Which supplier is offering us the best deals?
Purchasing
Which type of promotions work the best? Marketing
Note that a “business process” is not always about business.
Decide on the level of granularity
• Level of detail for each business process event
• Will determine the data in the dimensions
• Example: What are the highest sellingproducts?
– Choices for time: yearly, quarterly, monthly, daily
– Choices for store: store, city, state– Choices for product: product, category
How would you select the right granularity?
Identify the dimensions
• Description of the context of the business process
• The key elements of the process needed to answer the question (“fact”)– who, what, where, when, why, and
how
• Example: Sales transaction– A “sale” is the fact– Dimensions: product, store, and
time– Could this data mart tell you
• What is the best selling product?• Who is the best customer?
Identify the factThe fact table contains data called facts associated with the business process event
Keys
• Primary key for each event
• Foreign keys for the associated dimensions
• Example: Sales has Sales_IDas primary key, and Product_ID, Store_ID, and Time_ID as foreign keys
Facts: Measured, numeric data
• Facts: Quantifiable information for each business event – almost always numeric
• Describes a particular combination of dimensional data
• Example: Sales has quantity_sold and total_price.
Modeling a data cube: The Star Schema
ProductProduct_ID
Product_NameProduct_Price
Product_Weight
StoreStore_ID
Store_AddressStore_City
Store_StateStore_Type
TimeTime_ID
DayMonth
Year
Fact
Dimension
Dimension
Dimension
Transactional databases aren’t built around dimensions
• They don’t map well to cubes
• They aren’t set up for summarization
So we build a star schema
• Built around “dimensions” and “facts”
• Simplified relational model
The star schema facilitates
• Aggregating individual transactions
• Creation of cubes
SalesSales_ID
Product_IDStore_IDTime_ID
Quantity SoldTotal Price
Fact Table
• Contain facts (numeric measurements) associated with a specific business process
• Contain foreign keys that refer to dimension tables
Fact Sales
Sales_IDProduct_ID
Store_IDTime_ID
Quantity SoldTotal Price
Dimension Tables
• Contain text and descriptive information
• Provide the “who, what, where, when, why, and how” context surrounding a business process event
ProductProduct_ID
Product_NameProduct_Price
Product_Weight
StoreStore_ID
Store_AddressStore_City
Store_StateStore_Type
TimeTime_ID
DayMonth
Year
Fact
Dimension
Dimension
Dimension
SalesSales_ID
Product_IDStore_IDTime_ID
Quantity SoldTotal Price
From Star Schema to Data Cube
A Cube typically uses a Star Schema as its source
and stores precomputed summarized (aggregated) data
Much more efficient, but can’t be changed (non-volatile)
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
quantity
& total
price
Product
Store
M&MsDiet
CokeDoritos
Famous
Amos
Ardmore,
PA
Temple
Main
Cherry Hill,
NJ
King of
Prussia, PAJan. 2012
Feb. 2012
Mar. 2012
Advantages of data cube
• Fast response to give you the information you have previously designed in the cube
Speed
• The data multi-dimensional data structure allows the data to be analyzed in the most logical way.
Analysis
Data cube caveats
• The cube is “non volatile,” so you’re locked in
– Measured facts
– Dimensions
– Granularity
• So choose wisely!
– For example: You can’t track daily sales if “date” is monthly
– So why not include every single sale and do no aggregation?
Pivot tables in Excel
• PivotTable is a data summarization tool in Excel
– the easiest way to learn multidimensional data and generate simple reports
• Data cubes can act as the data source for Pivot Table in Excel
Summary
• Data warehouse vs. data mart vs. data cube
• Star schema
• Kimball’s four step process for data mart design
• Pivot tables in Excel
ICA #9
• In ICA #9, we learned to how to create a pivot table in Excel
– Identify which fields are assigned as VALUES and which ones are assigned as ROWS
– Identify the correct function for aggregation: e.g., SUM, COUNT, AVERAGE, MAX, MIN
The star schema in ICA #9
• Measured Fact: Order amount
• Three dimensions: Salesperson, Country, and Time.
Order
amount
Order
amount
Order
amount
Order
amount
Order
amount
Order
amount
Order
amount
Order
amount
Order
amount
Country
Salesperson
USA
Bob
Buchanan
Martha
Suyama
Paul
Peacock
Susan
Leverling
7/10/2008
Order
amount
UK
7/11/2008
7/12/2008
7/15/2008
7/16/2008
7/17/2008
Susan
Leverling
…more
salespeople
…more time
periods
Pivot Table and Data Cube
• The fields in the ROWS box correspond to dimensions in a data cube
• The fields in the VALUES box correspond to measured facts in a data cube
Example 1
Dimension Measured Fact
Example 2
Dimensions Measured Fact