contents of this slideshow :
DESCRIPTION
Contents of this slideshow :. What is a data warehouse? Multi-dimensional data modeling. A star shema datawarehouse has a central table ( the Fact table ) surrouded by dimension tables with on-to-many relationships towards the fact table. An example of a Datawarehouse:. - PowerPoint PPT PresentationTRANSCRIPT
Contents of this slideshow:
• What is a data warehouse?
• Multi-dimensional data modeling
An example of a Datawarehouse:
- Product#- Order#- Qty- Date#- Salesman#
Fact table
Orders
Orderdetails
Time
Products Salesmen
Dimension Dimension
Dimension
Dimension
- Product#- Product-name- Price
- Order#- Ordertype
- Salesman#- Salesman-name
- Date#- Date-Name
A star shema datawarehouse has a central table (the Fact table) surrouded by dimension tables with on-to-many relationships towards the fact table.
The fixed data base structure implies that application programs (drilling functions/aggregates) can be generated automatically!
Dimension hierarchies:
A dimension hierarchy is a set of tables connected by one-to-many relationships towards the fact table:
Orders CustomersOrderdetails
- Product# - Order# - Qty- Price
- Order# - Customer#- Date
- Customer# - Customer-name
Fact table Dimension hierarchy
In a dimension hierarchiy it is possible to aggregate data from the fact table to the different levels of the hierachy.
Drill-down = “de-aggregate” = break an aggregate into its constituents.
Roll-up = aggregate along one or more dimensions.
Two different types of drilling:• Drilling in dimension hierarchies.
• Drilling between dimensions.
- Product#- Order#- Qty- Date#- Salesman#
Fact table
Orders
Orderdetails
Time
Products Salesmen
Dimension Dimension
Dimension
Dimension
- Product#- Product-name- Price
- Order#- Ordertype
- Salesman#- Salesman-name
- Date#- Date-Name
Which star schemas or data marts can be build by using the illustrated integrated E-commerce/ERP data model?
Which star schema would you recommend to be implemented first?
LocationLocation#Address
UserSessionSession#IPaddress#ClickTimestamp
ProductProduct#ProductNamePrice
OrderOrder#OrderDateBalanceState
Order-DetailHistoryInv-Item#Order#Seq#StateTimestamp
UserAccountSalesman#PassWordTimestamp#visits#transTtl-tr-amount
Order-DetailProduct#Order#QtyPriceTimestamp
ShippingShipping#ShipMethodShipChargeStateShipDate
CreditCardCard#HolderNameExpireDate
PaymentPayment#AmmountStateTimestamp
InvoyceHistoryInvoice#TimestampStateNotes
AddressAddress#NameAdd1Add2CityStateZip
InvoiceInvoice#CreationDate
Billing
Shipping
Product-StockProduct#Location#Qty
CustomerCustomer#Kredit-LimitBalance
Data marts = Kimball uses the word for any multidimensional database/star schema.
A galaxy is a set of multidimensional databases with conformed (fælles tilpassede) dimensions:
Sale-Orderdetails
Storage-per-product
Purchase-orderdetails
- Product# - Sale-order#
- Qty- Discount
- Sale-price - Date#
- Product# - Date# - End-of-day-
storage-qty
- Product# - Purchase-order#
- Purchase-price - Qty - Date#
Fact table Fact table- Date# - Qty
Day
Month
Year
Fact table
Products
Productgroups
Time dimension hierarchy
- yy
- yy- mm
- yy- mm- dd
- Product#- Product-name
- Product-group#- Product-group-name
Product dimension hierarchy
The value chain
Suppose an entreprise has a datamart for Purchase and another datamart for Sale as illustrated above. Is it possible to calculate the revenue per month for the last year by using such a galaxy?
Conformed dimensions = dimensions designed to be common for different data marts in order to make drill across operations possible.
Conformed facts = measures with common units of measurement and granularities that make it possible to integrate measures from different fact tables.
Sale-Orderdetails
Storage-per-product
Purchase-orderdetails
- Product# - Sale-order#
- Qty- Discount
- Sale-price - Date#
- Product# - Date# - End-of-day-
storage-qty
- Product# - Purchase-order#
- Purchase-price - Qty - Date#
Fact table Fact table- Date# - Qty
Day
Month
Year
Fact table
Products
Productgroups
Time dimension hierarchy
- yy
- yy- mm
- yy- mm- dd
- Product#- Product-name
- Product-group#- Product-group-name
Product dimension hierarchy
The value chain
Is it possible to calculate the revenue per month for the last year if the datamart for Purchase and the datamart for Sale do not have conformed dimensions or facts?
Contents of this slideshow:
• What is a datawarehouse?
• Multi-dimensional data modelling
Datawarehouse aggregating to the product level:
- Product#- Order#- Qty- Date#- Salesman#
Fact table
Orders
Orderdetails
Time
Products Salesmen
Dimension Dimension
Dimension
Dimension
- Product#- Product-name- Price
- Order#- Ordertype
- Salesman#- Salesman-name
- Date#- Date-Name
SELECT Product#, SUM(Qty*Price) AS omsætningFROM Orderdetails JOIN ProductsGROUP BY Product#
Drill down to the Product per Salesman level:
- Product#- Order#- Qty- Date#- Salesman#
Fact table
Orders
Orderdetails
Time
Products Salesmen
Dimension Dimension
Dimension
Dimension
- Product#- Product-name- Price
- Order#- Ordertype
- Salesman#- Salesman-name
- Date#- Date-Name
SELECT Product#, Salesman#, SUM(Qty*Price) AS omsætningFROM Orderdetails JOIN Products JOIN Salesmen GROUP BY Product#, Salesman#;
Where should the Price be stored?
Dimension hierarchies:A dimension hierarchi is a set of tables connected by one-to-many relationships towards the fact table:
Orders CustomersOrderdetails
- Product# - Order# - Qty- Price
- Order# - Customer#- Date
- Customer# - Customer-name
Fact table Dimension hierarchy
A Snowflake schema may in contrast to star schemas have dimension hierarchies.
Describe advantage and disadvantage by using dimension hierarchies/Snowflake schema?
Snowflake schema with branches:
A Snowflake schema may have branches in the dimension hierarchies:
Orders CustomersOrderdetails
- Product# - Order# - Qty
- Order# - Customer#- Date
- Customer# - Customer-name
Fact table Dimension hierarchy
Salesmen Branchoffices
Regions
Products
- Product# - Product-name- Price- Group#
Productgroups
- Group# - Group-name- Department#
Departments- Department# - Department-name
- Salesman# - Salesman-name– Branch-office#
- Branch-office# - Branch-office#- Region#
- Region# - Region-name
Snowflake hierarchy
Dimension hierarchyAre Customers related to the regions?
The aggregation level is the argument to the GROUP BY statement.
x1 x2 … xn Aggregated data Non-aggregated data
Salesman# Productname Turnover Branch-office#
Smith Screw 10,000 LA
Smith Bolt 30,000 LA
Smith Nut 60,000 LA
Jones Screw 20,000 SF
Jones Nut 40,000 SF
. . .
- Product# - Order# - Qty - Date# - Salesman#
Fact table
Orders
Orderdetails
Time
Products Salesmen
Dimension Dimension
Dimension
Dimension
- Product# - Product-name - Price
- Order# - Ordertype
- Salesman# - Salesman-name - Branch-Office#
- Date# - Date-Name
Drilling in dimension hierarchies:
Orders Customers Orderdetails
- Product# - Order# - Qty
- Order# - Customer# - Date
- Customer# - Customer-name
Fact table Dimension hierarchy
Salesmen Branch offices
Products
- Product# - Product-name - Price - Group#
Product groups
- Group# - Group-name - Department#
Departments
- Salesman# - Salesman-name – Branch-office#
- Branch-office# - Branch-office# - Region#
Snowflake hierarchy
Dimension hierarchy
Branch-office# Turnover
LA 400,000
SF 200,000
Salesman# Turnover Branch-office#
Smith 100,000 LA
Jones 300,000 LA
Adams 200,000 SF
Drilling between dimension hierarchies:
Orders Customers Orderdetails
- Product# - Order# - Qty
- Order# - Customer# - Date
- Customer# - Customer-name
Fact table Dimension hierarchy
Salesmen Branch offices
Products
- Salesman# - Salesman-name – Branch-office#
- Branch-office# - Branch-office# - Region#
Snowflake hierarchy
Salesman# Turn-over
Branch-office#
Smith 100,000 LA
Jones 300,000 LA
Adams 200,000 SF
Salesman#
Product-name
Turn-over
Branch-office#
Smith Screw 10,000 LA
Smith Bolt 30,000 LA
Smith Nut 60,000 LA
Jones Screw 20,000 SF
Jones Nut 40,000 SF
. . .
Roll up to the top level:
Roll up can be executed by removing one or more argument to the GROUP BY statement.
Salesman#
Product-name
Turn-over
Branch-office#
Smith Screw 10,000 LA
Smith Bolt 30,000 LA
Smith Nut 60,000 LA
Jones Screw 20,000 SF
Jones Nut 40,000 SF
. . .
Productname Turnover
Screw 100.000
Bolt 200.000
Nut 300,000
Roll up to the product level.
Top level Turnover
600.000 Roll up to the top level.
Non-linear dimensions as e.g. the Date Dimension:
• The granularity is day.
• Many different hierarchies.
• Two major problems:– Calender Week do not aggregate to
year.
– Type of Day distinguish between working day and holiday. However, they are idependent of the other dimensions (e.g. Easter).
Day of Week
Type of Day
Fiscal Week
Fiscal Month
Fiscal Quarter
Fiscal Year
Day
Calendar Month
Calendar Quarter
Calendar Year
Calendar Week
What aggregation level would you use to calculate the average sale in non-hollyday mondays per month?
The time dimension:
• The granularity is minute.
• The top level is a hole day.
Minute
Hour
Day Part
AM/PM Flag
Why do you think Kimball recommends to separate the date and time dimensions?
Degenerated dimension =A dimension that is not created because nobody want to aggregate data to the degenerated level.
Example: The Order dimension should be deleted while the Time and Customer attributes should be created as new dimensions to which it is meaningful to aggregate data.
- Product# - Order# - Qty - Date# - Salesman#
Fact table
Orders
Orderdetails Products Salesmen
- Product# - Product-name - Price
- Order# - Time - Customer#
- Salesman# - Salesman-name
Exercise:
The figure illustrates an ER-diagram of a car rental company like Hertz or Avis.
Design a snowflake shema, star shema or Galaxy for the car rental company!
Customers
Car types
Reservations
Orders
Branch offices
Cars
GaragesGarage services
Pick up
Contracts
Car return
Major problems in data warehouse design:
Drilling in many-to-many relationships and tree structures.
Inconsistensies caused by ”slowly changing dimensions”.
Slowly Changing Dimensions (SCD)
Bank accounts
Branch-offices
- Account# - Interest-last-year - Cost-last-year - Branch#
- Branch# - Branch-name - Branch-size
Fact table Dimension
If the attributes of a dimension is dynamic (e.i. they may be updated) we say that they are slowly changing.
May the Branch-size of a Branch-office change after e.g. a renovation?May the Branch-name of a Branch-office change?
Exercise in SCD:
Soppose the attribute Branch-size is dynamic and aggregations is made to the levels (Branch-size, Year) or (Branch-size, Month) .
Does this aggregation make sense and how would you solve possible problems?
Bank accounts
Branch-offices
- Account# - Interest-last-year - Cost-last-year - Branch#
- Branch# - Branch-name - Branch-size
Fact table Dimension
Exercise:
Is the region of the customer a dynamic attribute of the customer?
Does it make sense to aggregate the rental revenue to the region of the customers?
Customers
Car types
Reservations
Orders
Branch offices
Cars
GaragesGarage services
Pick up
Contracts
Car return
It is possible to cheat the application generator. That is, special very complicated data structures may function as many-to-many or networt relationships when they are dealt with as 1-to-many relationships.
How would you recommend to design a datawarehouse where it is possible to aggregate Sale to the Stock locations used for the sale?
LocationLocation#Address
UserSessionSession#IPaddress#ClickTimestamp
ProductProduct#ProductNamePrice
OrderOrder#OrderDateBalanceState
Order-DetailHistoryInv-Item#Order#Seq#StateTimestamp
UserAccountSalesman#PassWordTimestamp#visits#transTtl-tr-amount
Order-DetailProduct#Order#QtyPriceTimestamp
ShippingShipping#ShipMethodShipChargeStateShipDate
CreditCardCard#HolderNameExpireDate
PaymentPayment#AmmountStateTimestamp
InvoyceHistoryInvoice#TimestampStateNotes
AddressAddress#NameAdd1Add2CityStateZip
InvoiceInvoice#CreationDate
Billing
Shipping
Product-StockProduct#Location#Qty
CustomerCustomer#Kredit-LimitBalance
Exercise.Design a datawarehouse for a travel agency.
Customers
Reservations
Orders
Departures/Hotel rooms/Car rentals/
etc.
Flight routes/Room types/Car types/
service types
Buyer
Bookings
Traveler
Product owners
Design a data warehouse (or galaxy) for an ERP system with as many meaningful dimensions as possible:
Orders
Accounts
Customers
Orderlines
Products
Stocks per product per location
Account items
The sales module
The account module offer services to the other ERP modules.
End of session
Thank you !!!Thank you !!!
Response type Evaluation criteriaIs historical information preserved
Aggregation performance Storage consumption
Response 1 where dimension records are overwritten
No In the evaluation, we define this solution to have average performance
Only the current dimension record version is stored. No redundant data is stored
Response 2 where new versions are created
Yes Version records makes performance slower proportional to the number of changes
All old versions of dimension records are stored often with redundant attributes
Response 3 where only one historical version is saved
The current version and a single history destroying version are saved
No performance degradation occurs if either the current or the historical version are used in a query
Normally, only a single extra attribute version is stored
Response 4 that use the top of a dynamic dimen-sion hierarchy as a new static dimension
Yes Better or worse depen-ding on whether both dimension tables are used in a query
The relatively large fact table must have an extra foreign key attribute
Response 5 with dimension data as fact data
Yes Better or worse depen-ding on whether the new fact data are used in a query
The relatively large fact table must have an extra attribute for each dynamic dimension attribute
Response 6 that use fine granularity in combination with response 1 or 3
The finer the granularity, the more historical state information is preserved
The finer the granularity, the slower the performance
The finer the granularity, the more storage consumption
Response 7 that stores dynamic dimension data as static facts in another data mart
Yes Better or worse depen-ding on whether both fact tables are used in a drill across query
This is the most storage consuming solution as at least a new fact and foreign key are stored in the new fact table
Where do the responses of SCDs store historic information?
• Response 1 does not store historic information.
• Response 2 store historic information in a new record version.
• Response 3 store at one historic value in a new dimension attribute.
• Response 4 store historic information in a new dimension relationship.
• Response 5 store historic information in a new fact attribute.
• Response 6 can sometimes deminish the aggregation error of response 1 as finer granularity in a state fact more acurately can be related to the right dimension record.
• Response 7 store historic information in a new fact table.