bi session 6 prof dhruv nath
TRANSCRIPT
-
7/27/2019 BI Session 6 Prof Dhruv Nath
1/58
Dhruv Nath
BITech Session on Data Warehousing
-
7/27/2019 BI Session 6 Prof Dhruv Nath
2/58
Slides on OLAP
-
7/27/2019 BI Session 6 Prof Dhruv Nath
3/58
DW : Contents
ER Model vs Dimensional Model
Designing a Data Warehouse
Starting with the ER Model
Facts and Dimensions
BI Products and Vendors
Data Warehouse Optimisation
OLAP Implementation
-
7/27/2019 BI Session 6 Prof Dhruv Nath
4/58
OLTP Databases use the Entity Relationship Model
Why cant we use
the ER Model for
Analytics / BI ?
Why no
Many-Many
relationships?
-
7/27/2019 BI Session 6 Prof Dhruv Nath
5/58
Problems with using the ER Model / 3NF for Querying
Complex to understand and query
All kinds of tables being joined to all kinds of other tables
Maybe OK for joining a few tables. Not OK when lots of tables
involved
Complex to visualise
The E-R Model is very symmetric
no way to figure out what data is business numbers (changing)
and what is constant (eg. Regions, Products)
The E-R Model is designed for capturing / updating
detailed data. Not for querying it
Different Model required for querying this
data by Management
-
7/27/2019 BI Session 6 Prof Dhruv Nath
6/58
An Easier Model to Query
SalesCollections
Complaints
Model
Geography
Dealer
Product
Year
Dimensional ModelFacts and Dimensions
-
7/27/2019 BI Session 6 Prof Dhruv Nath
7/58
Benefits of the Dimensional Model
Simple Can be used directly by the user
Very clear what data is business numbers
(changing - facts) and what is constant (eg.Regions, Products - dimensions)
-
7/27/2019 BI Session 6 Prof Dhruv Nath
8/58
Example : Dimensional Model of Data
Cust. Id
Month & Yr
Region Code
Balance
Cust. Id
Cust Name
Address
Phone
Region Code
Region Name
Address
Manager
Month & Yr
Quarter
What is the primary key in each dimension ?
What is the primary key in the Fact table ?
Dimension
Dimension Dimension
What are the foreign keys ? What
relationships do they define ?
What do we call this schema ?
Star Schema
Fact
-
7/27/2019 BI Session 6 Prof Dhruv Nath
9/58
Example : Dimensional Model of Data
Cust. Id
Month & Yr
Region Code
Balance
Cust. Id
Cust Name
Address
Phone
Region Code
Region Name
Address
Manager
Month & Yr
Quarter
Dimension
Dimension Dimension
Fact
Each Dimension represents an entity (with attributes)
The Star Schema can be visualised as a
Data Cube. How ?
-
7/27/2019 BI Session 6 Prof Dhruv Nath
10/58
Visualising a Star Schema as a Data Cube
Querying :
OLAP
(vs OLTP)
-
7/27/2019 BI Session 6 Prof Dhruv Nath
11/58
Dimensional Model
Cust. Id
Month & Yr
Region Code
Balance
Cust. Id
Cust Name
Address
Phone
Region Code
Region Name
Address
Manager
Month & Yr
Quarter
Dimension
Dimension Dimension
Fact
Can have any number
of dimensions
Usually 5 - 15
How are snapshots added on ?
-
7/27/2019 BI Session 6 Prof Dhruv Nath
12/58
Exercise : Compare the ER Model with theDimensional Model of Data
ER Model
Designed for entering / storing data(transactions)
Optimized for transactions: single
row entry and retrieval
Thousands of concurrent users
No way to figure out what data isbusiness numbers (changing) and
what is constant / static / near-
static (eg. Regions, Products). All
of them are fields or relations.
Therefore tough to implement a
query
JOINs needed between any
combination of tables. Therefore
tough to implement a query
Dimensional Model
Designed for analysis / queryingby the user
Optimized for bulk load and large,
complex, unpredictable queries
Few concurrent users
What is constant / static / near-static (dimensions) and what are
business numbers (facts) very
clear. Therefore easier to
implement a query
JOINS only between the FactTable and each Dimension Table.
Therefore easier to implement a
query
-
7/27/2019 BI Session 6 Prof Dhruv Nath
13/58
Data Marts
Cust. Id
Month & Yr
Region Code
Balance
Cust. Id
Cust Name
Address
Phone
Region Code
Region Name
Address
Manager
Month & Yr
Quarter
How would Data Marts created out of
such a Data Warehouse look ?
Similar. Some fields may be
missing. Examples ?
Dimension
Dimension Dimension
Corporate customers : No personal details
Retail customers : No Organisational details
Fact
Data Cubes usually formed in Data Marts
-
7/27/2019 BI Session 6 Prof Dhruv Nath
14/58
DW : Contents
ER Model vs Dimensional Model Designing a Data Warehouse
Starting with the ER Model
Facts and Dimensions
BI Products and Vendors
Data Warehouse Optimisation
OLAP Implementation
-
7/27/2019 BI Session 6 Prof Dhruv Nath
15/58
SALES_REP
ORDER
CUSTOMER
PRODUCT
Line_Item
Places_Order
Sells_to
Exercise : ER-Model to Dimensional Model
Exercise : Convert this ER Model into a
Dimensional Model (Star Schema)
ContainsIs_Ordered
Print for
Students
-
7/27/2019 BI Session 6 Prof Dhruv Nath
16/58
SALES_REP
ORDER
CUSTOMER
PRODUCT
LINE_ITEM TIME
Dimensional Model
FactEmp. Id
Name
Qualifications
Cust IdCust Name
Address
DateQuarter
Order Num
Credit TermsLead Time
Product Code
Product Name
Brand
RateStar : Instead of keeping a relationship from Sales_Rep to
Customer, the relationship is from both to line item
Emp Id
Cust Id
Date
Order Num
Product Code
Quantity
What are the Foreign Keys in the Fact Table ?
What is the primary key in the Fact Table ?
New Dimension created : Time.
Time will always be a dimension in a Data Warehouse
-
7/27/2019 BI Session 6 Prof Dhruv Nath
17/58
SALES_REP
ORDER
CUSTOMER
PRODUCT
LINE_ITEM TIME
Exercise : Is this a Normalised design ?
FactEmp. Id
Name
Qualifications
Cust IdCust Name
Address
DateQuarter
Order Num
Credit TermsLead Time
Product Code
Product Name
Brand
Rate
Emp Id
Cust Id
Date
Order Num
Product Code
Quantity
Print for
Students
-
7/27/2019 BI Session 6 Prof Dhruv Nath
18/58
Exercise : Is this a Normalised design ?
In the Fact Table, Emp Id is functionally dependent
on (Cust Id + Date) not the primary key Logically, every time Customer P places an order
on Salesman Q, we will have one row in the fact
table for this Customer, Salesman combination
So redundancy. Cust Id should have been enough.
Therefore anomalies ???
Insert : Cannot insert a Customer Salesman
relationship, till the Customer places an order
Delete : If an order is cancelled, and this is the only
order the salesman has from this Customer, we lose
the Salesman Customer relationship
Does this lack of normalisation cause a problem ?
-
7/27/2019 BI Session 6 Prof Dhruv Nath
19/58
Does lack of normalisation cause a problem ?
A Datawarehouse has no updation, deletion or
insertion
Only snapshots getting added on with time
So no anomalies ----- Lack of normalisation is not
a problem The E-R Model tries to remove redundancy
completely
The Dimensional model tries to simplify theschema, and therefore brings in redundancy
eg. the relationship between sales_rep and customer
is repeated in every line_item where these two are
involved
-
7/27/2019 BI Session 6 Prof Dhruv Nath
20/58
Does lack of normalisation cause a problem contd. ?
Cannot enter a Salesman Customer relationship till thecustomer places at least one order
Instead it is shown as a relationship between a customerand a line item, and a salesperson and the same lineitem. The relationship is only through the line item (Fact)
Is this a problem ?
In a DW we decide what our focus is - those are thefacts.
In this case our fact is the line items sold, not the
relationship between the salesperson / customer rep andthe customer
If the relationship (even without the order) is important tomaintain at is important, we create another Star Schema,around some other fact (say, Opportunity)
-
7/27/2019 BI Session 6 Prof Dhruv Nath
21/58
Constellation
Multiple STARs
-
7/27/2019 BI Session 6 Prof Dhruv Nath
22/58
Exercise Implementing Data Marts
-
7/27/2019 BI Session 6 Prof Dhruv Nath
23/58
DW : Contents
ER Model vs Dimensional Model Designing a Data Warehouse
Starting with the ER Model
Facts and Dimensions
BI Products and Vendors
Data Warehouse Optimisation
OLAP Implementation
-
7/27/2019 BI Session 6 Prof Dhruv Nath
24/58
Which of these can be facts ?
Region Sales No. of ComplaintsType of complaint Outstandings
Premium paid Salary Colour
Cash_on_hand collections
breakages product customer
interest
Typical characteristics of facts ??
-
7/27/2019 BI Session 6 Prof Dhruv Nath
25/58
Typical Characteristics of Facts
Numerical
Additive why ?
Querying involves scanning lots of records
The end result of the query should be short - one or two pages /
screens
Additive facts can provide this
Examples ?
Sales, Collections, Revenue, Expenses
Continuously valued (even numbers (eg. no. of
complaints / no. of transactions are consideredcontinuously valued)
-
7/27/2019 BI Session 6 Prof Dhruv Nath
26/58
Will Facts always be additive ?
Semi-additive Facts ? Explain
Account Balance - Explain
Can be added across some dimensions, not all
Guidelines What forms additive facts and what forms semi-
additive facts ?
Flows vs Levels (eg. Deposits vs balance, eg. Collections vsCurrent outstandings)
Non-additive Facts ? Explain
Interest %age, %age target achievement, %age profit
Cannot be added across any dimension
Can this be converted into an Additive fact ?
Convert interest %age to an absolute value
When is this done ?
ETL (Transform stage)
-
7/27/2019 BI Session 6 Prof Dhruv Nath
27/58
Facts will usually be additive, or semi-
additive. Avoid non-additive facts
Additive Facts : Summarise
However, it is possible to have facts without
satisfying some or all of these conditions
Ultimately, the designer decides.
-
7/27/2019 BI Session 6 Prof Dhruv Nath
28/58
Review : Facts - Guidelines
Numerical Continuously valued
Additive
Semi-additive
Non-additive
-
7/27/2019 BI Session 6 Prof Dhruv Nath
29/58
Dimensions
Determined by what you want as row and columnheaders in your query reports :
Usually :
Textual
Discrete Could also be numeric. Where ?
Where they form column headers, and no calculations are done
on them (eg. Age, Salary). Typically a range
Time is always one dimension. Why ?
Because of snapshots
Dimensions are an entry point
into a Data Warehouse
-
7/27/2019 BI Session 6 Prof Dhruv Nath
30/58
Exercise : Facts or Dimensions ?
Region Sales No. of ComplaintsType of complaint Outstandings
Premium paid Salary Colour
Cash_on_hand collectionsbreakages product customer
interest
The same thing can be modelled as a fact or
as a dimension. Depends on the designerNumeric dimensions are in the form of a range
-
7/27/2019 BI Session 6 Prof Dhruv Nath
31/58
DW : Contents
ER Model vs Dimensional Model Designing a Data Warehouse
Starting with the ER Model
Facts and Dimensions
BI Products and Vendors
Data Warehouse Optimisation
OLAP Implementation
-
7/27/2019 BI Session 6 Prof Dhruv Nath
32/58
Clients
Data Cubes
BI Products and Vendors
Data Marts
Data
Warehouse
OLTP
Databases
DBMS
VendorsOracle, Microsoft SQL Server, IBM (DB2),..
-
7/27/2019 BI Session 6 Prof Dhruv Nath
33/58
Clients
Data Cubes
BI Products and Vendors
Data Marts
Data
Warehouse
OLTP
Databases
Provide everything except the OLTP DBMS and DW. ETL included
BI Tool
Vendors
SAS, Cognos (IBM), Business Objects (SAP), Qlikview..
I l ti D t W h
-
7/27/2019 BI Session 6 Prof Dhruv Nath
34/58
Implementing a Data Warehouse Where should the Pilot be done ?
Four Regions (rep by 4 teams) :1. Dynamic and keen Regional Manager very
poor historical data
2. Excellent historical data. RM interested butdoesnt have much time
3. Recently started Region. Not much historical
data, but good current data. RM interested,
may spend some time
4. Small, unimportant Region, but good RM,
and interested. Good historical data, but not
too much of it
-
7/27/2019 BI Session 6 Prof Dhruv Nath
35/58
DW : Contents
ER Model vs Dimensional Model Designing a Data Warehouse
Starting with the ER Model
Facts and Dimensions
BI Products and Vendors
Data Warehouse Optimisation
OLAP Implementation
Exercise : How big are the Fact and Dimension
-
7/27/2019 BI Session 6 Prof Dhruv Nath
36/58
Exercise : How big are the Fact and DimensionTables ? a) Number of records b) Size in bytes
Cust. Id
Month & Yr
Region Code
Balance
Cust. Id
Cust Name
Address
Phone
Region Code
Region Name
Address
Manager
Month & Yr
Quarter
1 lakh customers, 10 regions.
Data stored for the past 10 years
Dimension
Dimension Dimension
Fact
What if we store daily balances, and for each of the
1000 branches ?
Implications ? Space, speed. So what do we do ?
Optimise on Fact table size. Ignore dimension tables !!!
-
7/27/2019 BI Session 6 Prof Dhruv Nath
37/58
SALES_REP
ORDER
CUSTOMER
PRODUCT
LINE_ITEM TIME
Optimisation : Exercise : Can we modify this Star
Schema to cut down space ?
FactEmp. Id
Name
Qualifications
Cust Id
Cust NameAddress
Date
Quarter
Order Num
Credit TermsLead Time
Product Code
Product Name
Brand
Rate
Emp Id
Cust Id
Date
Order Num
Product Code
Quantity
Is the Dimension
-
7/27/2019 BI Session 6 Prof Dhruv Nath
38/58
SALES_REP
ORDER
CUSTOMER
PRODUCT
LINE_ITEM TIME
Star Schema Option 2
FactEmp. Id
Name
Qualifications
Cust Id
Cust NameAddress
Date
Quarter
Order Num
Credit TermsLead Time
Product Code
Product Name
Brand
Rate
Emp Id
Cust Id
Date
Order Num
Product Code
Quantity
Emp. Id
Name
Qualifications
Advantage / Disadvantage ?
Fact Table space vs. Ease of Querying
Which one would you use ?
Is the Dimension
Table Normalised ? Denormalised
Dimension Table
More highly
-
7/27/2019 BI Session 6 Prof Dhruv Nath
39/58
SALES_REP
ORDER
CUSTOMER
PRODUCT
LINE_ITEM TIME
Star Schema Option 3
FactEmp. Id
Name
Qualifications
Cust Id
Cust NameAddress
Date
Quarter
Order Num
Credit TermsLead Time
Product Code
Product Name
Brand
Rate
Emp Id
Cust Id
Date
Order Num
Product Code
Quantity
Emp. Id
Name
Qualifications
Cust Id
Cust Name
Address
Emp. Id
Name
Qualifications
Advantage / Disadvantage ?
Fact Table space vs. Ease of Querying
Which one would you use ?
More highly
Denormalised
Dimension Table
Optimisation : What occupies the maximum space in
-
7/27/2019 BI Session 6 Prof Dhruv Nath
40/58
Optimisation : What occupies the maximum space inthe Fact Table ?
Cust. Id
Month & Yr
Region Code
Balance
Cust. Id
Cust Name
Address
Phone
Region Code
Region Name
Address
Manager
Month & Yr
Quarter Keys
Dimension
Dimension Dimension
Fact
How do we reduce the size of the keys ?
Use surrogate keys
-
7/27/2019 BI Session 6 Prof Dhruv Nath
41/58
Optimisation : Use Surrogate keys
Operational Keys - Disadvantage ?
English like Ids : occupy space
Surrogate Keys - meaningless integers. 2 or 4
byte integers most common. Advantage ?
Much shorter Disadvantage ?
Processing reqd to transform from op to surrogate
In any case, when the data comes from multiple
sources, keys in all but one of the sources need to
change
-
7/27/2019 BI Session 6 Prof Dhruv Nath
42/58
SALES_REP
ORDER
CUSTOMER
PRODUCT
LINE_ITEM TIME
Exercise : Add surrogate keys to this schema
FactEmp. Id
Name
Qualifications
Cust Id
Cust NameAddress
Date
Quarter
Order Num
Credit TermsLead Time
Product Code
Product Name
Brand
Rate
Emp Id
Cust Id
Date
Order Num
Product CodeQuantity
Cust KeyEmp KeyEmp Key (PK)
Cust Key (PK)
Order KeyOrder Key
Product Key
Product Key
Do we need both the original and the
surrogate key in the Dimension Table ?
Fact Table ?
-
7/27/2019 BI Session 6 Prof Dhruv Nath
43/58
SALES_REP
ORDER
CUSTOMER
PRODUCT
LINE_ITEM TIME
Designing a Data Warehouse
FactEmp. Id
Name
Qualifications
Cust Id
Cust NameAddress
Date
Quarter
Order Num
Credit TermsLead Time
Product Code
Product Name
Brand
Rate
Emp Id
Cust Id
Date
Order Num
Product CodeQuantity
Cust KeyEmp KeyEmp Key (PK)
Cust Key (PK)
Order KeyOrder Key
Product Key
Product KeyBased on this exercise, what is the process for
converting an ER Model into a Dimensional
Model (Data Warehouse)
Date Key (PK)
Date Key
h i
-
7/27/2019 BI Session 6 Prof Dhruv Nath
44/58
The DW Design Process
Identify an association table as the central
fact table
Choose the Dimensions
Add date (time) dimension Replace all operational keys with surrogate
keys
Promote foreign keys from each dimensiontable to the fact table
Choose the Facts
A h Di i li d ?
-
7/27/2019 BI Session 6 Prof Dhruv Nath
45/58
Are the Dimensions normalised ?
Cust. Id
Month
Region Code
Balance
Cust. Id
Cust Name
Address
Phone
Region Code
Region Name
Address
Manager
Month & Yr
Quarter
Dimension
Dimension Dimension
Fact
Add fields to each dimension to make it denormalised
Now, what does the schema look like if we
normalise each dimension table ?Snowflake Schema
Are Snowflake Schemas desirable ? Why ?
Speed of querying. Complexity of querying for the user
Thinking question : Is there
any situation where we wouldnormalise a dimension table
?
DW C
-
7/27/2019 BI Session 6 Prof Dhruv Nath
46/58
DW : Contents
ER Model vs Dimensional Model Designing a Data Warehouse
Starting with the ER Model
Facts and Dimensions
BI Products and Vendors
Data Warehouse Optimisation
OLAP Implementation
Representing dimensions
-
7/27/2019 BI Session 6 Prof Dhruv Nath
47/58
Representing dimensionsSKU
Brand
Product
Product Category
Department
All products
Store
Locality
PIN Code
City
Region
All
Date
Month
Quarter
Year
All
Promotion
All
How do we represent a query - eg. Get Sales by SKU by Store by Date by Promotion ? How do we show a Roll-up / Drill-down ?
Representing dimensions
-
7/27/2019 BI Session 6 Prof Dhruv Nath
48/58
Representing dimensionsSKU
Brand
Product
Product Category
Department
All products
Store
Locality
PIN Code
City
Region
All
Date
Month
Quarter
Year
All
Promotion
All
For this query, we need to add fields across rows in the Fact Table. How
many rows need to be summed? Problems ?Speed. Solution ?
Pre-aggregate sums and store
Multiple levels of aggregates
-
7/27/2019 BI Session 6 Prof Dhruv Nath
49/58
Multiple levels of aggregatesSKU
Brand
Product
Product Category
Department
All products
Store
Locality
PIN Code
City
Region
All
Date
Month
Quarter
Year
All
Promotion
All
Store multiple level aggregatesRedundancy : To speed up
querying
A ti I
-
7/27/2019 BI Session 6 Prof Dhruv Nath
50/58
Aggregation : Issues
When are aggregates computed ?
During every update
How do we decide what aggregates to keep ?
Frequency of usage / repeat queries
Priority of users
Managers / Analysts should figure out the likely frequency. Therefore what aggregates to keep
A ti I
-
7/27/2019 BI Session 6 Prof Dhruv Nath
51/58
Aggregation : Issues
Where are Aggregations stored ?
Separate Fact table
Families of Stars (Constellations)
When are they computed ?
During every update
How do we decide what aggregates to keep ? Frequency of usage / repeat queries
Priority of users
Users should not be aware of aggregation. The software
automatically uses the aggregate Fact table to answer thequery. Why ?
I l i OLAP
-
7/27/2019 BI Session 6 Prof Dhruv Nath
52/58
Implementing OLAP
Relational OLAP Disc Implemented using a regular Relational DBMS
Linked list structures
Multi-Dimensional OLAP Disc MDDB Created in advance and stored for
querying
Array structures
Advantages and Disadvantages ? Disc
ROLAP vs MOLAP
-
7/27/2019 BI Session 6 Prof Dhruv Nath
53/58
ROLAP vs MOLAP Linked List Structure slow
Space Optimised only records
that have some value are stored
All data is available in the ROLAP.
Can handle large DW
No Pre-aggregated data
therefore slow
Array Structure therefore fast
All cells in the Fact Table are
stored whether they exist or not Therefore huge space (Explain)
eg. (Bank example) A customer
does not have any Account in a
given branch
A customer does not performany transaction in most of his
accounts on specific days
Therefore only small DW can be
handled.
For large DW, summarised data
can be kept in the MDDB. Drilling
down requires going back to
ROLAP (Called HOLAP Hybrid
OLAP)
Pre-aggregated data therefore
fast
MOLAP
-
7/27/2019 BI Session 6 Prof Dhruv Nath
54/58
MOLAP
Sparse Matrix techniques used tooptimised space
ROLAP s MOLAP
-
7/27/2019 BI Session 6 Prof Dhruv Nath
55/58
ROLAP vs MOLAP
DBMS vendors started off with ROLAP
(knowhow already existed), but are now addingMOLAP
Pure BI vendors largely into MOLAP (proprietary)
Role Play Implementation
-
7/27/2019 BI Session 6 Prof Dhruv Nath
56/58
Role Play Implementationacross Multiple Locations
Book
-
7/27/2019 BI Session 6 Prof Dhruv Nath
57/58
Book
The Data Warehouse Toolkit RalphKendall, Margy Ross - Wiley
-
7/27/2019 BI Session 6 Prof Dhruv Nath
58/58
Dhruv Nath
BITech Session on Data Warehousing