database management systems (l5)

30
CMPUT 391 Database Management Systems Data Warehousing & OLAP Textbook: 17.1 – 17.5 (first edition: 19.1 – 19.5) Based on slides by Lewis, Bernstein and Kifer and other sources University of Alberta Dr. Jörg Sander, 2006 1 CMPUT 391 – Database Management Systems

Upload: others

Post on 11-Mar-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

CMPUT 391Database Management Systems

Data Warehousing & OLAP

Textbook: 17.1 – 17.5(first edition: 19.1 – 19.5)

Based on slides by Lewis, Bernstein and Kiferand other sources

University of AlbertaDr. Jörg Sander, 2006 1CMPUT 391 – Database Management Systems

Why Data Warehouses• Businesses have a lot of data, operational data and facts,

stored in heterogeneous and distributed databases.– in different databases – in different physical locations– in different formats

• Decision makers need fast access to this information in a summarized form, with a focus often on historical data

University of AlbertaDr. Jörg Sander, 2006 2CMPUT 391 – Database Management Systems

What Is Data Warehouse?• consolidates the information from different data sources, enabling

OLAP (online analytical processing), to help decision support. • is maintained separately from an operational database (which is used

for OLTP – online transaction processing).

Corporate data

Data Mart Data Mart Data Mart Data Mart

CorporateData Warehouse

Option 1:Consolidate Data Marts

Option 2:Build from scratch

University of AlbertaDr. Jörg Sander, 2006 3CMPUT 391 – Database Management Systems

OLTP Compared With OLAP• On Line Transaction Processing -- OLTP

– Maintain a database that is an accurate model of some real-world enterprise

• Short simple transactions• Relatively frequent updates• Transactions access only a small fraction of the database

• On Line Analytic Processing -- OLAP– Use information in database to guide strategic decisions

• Complex aggregation queries • Infrequent updates• Transactions access a large fraction of the database

University of AlbertaDr. Jörg Sander, 2006 4CMPUT 391 – Database Management Systems

Why Do We Separate DWs From Operational DBs?

• Performance reasons:– OLAP necessitates special data organization

that supports multidimensional views.– OLAP queries would degrade operational DB.– OLAP is read only.– No concurrency control and recovery.

• Decision support requires historical data.• Decision support requires consolidated data.

University of AlbertaDr. Jörg Sander, 2006 5CMPUT 391 – Database Management Systems

Fact Tables• Many OLAP applications are based on a fact table• For example, a supermarket application might be

based on a tableSales (Market_Id, Product_Id, Time_Id, Sales_Amt)

• The table can be viewed as a multidimensional data cube– The first three columns are the dimensions

representing specific • supermarkets• products • time intervals

– The fourth column, the Sales_Amt, is a function of the other three, called a measure

University of AlbertaDr. Jörg Sander, 2006 6CMPUT 391 – Database Management Systems

Dimension Tables• The dimensions of the fact table can be further

described with dimension tables• Fact table

– Sales (Market_id, Product_Id, Time_Id, Sales_Amt)• Dimension Tables

– Market (Market_Id, City, Province, Region)– Product (Product_Id, Name, Category, Price)– Time (Time_Id, Week, Month, Quarter)

University of AlbertaDr. Jörg Sander, 2006 7CMPUT 391 – Database Management Systems

Star Schema

• The fact and dimension relations can be displayed in an E-R diagram, which suggests a star and is called a star schema

University of AlbertaDr. Jörg Sander, 2006 8CMPUT 391 – Database Management Systems

Table View of a Star Schema

TimeIdDayMonthYear

Time

CustIdCustNameCustCityCustCountry

Cust

Sales Fact Table

Time

Product

Store

Customer

unit_sales

dollar_sales

ProductNoProdNameProdDescCategory

Product

StoreIDCityProvinceCountryRegion

Store

(Source: JH)

Two different measures

University of AlbertaDr. Jörg Sander, 2006 9CMPUT 391 – Database Management Systems

Aggregation• Many OLAP queries involve aggregation of the

data in the fact table• For example, to find the total sales (over time) of

each product in each market, we might use

SELECT S.Market_Id, S.Product_Id, SUM (S.Sales_Amt)FROM SalesSales SGROUP BY S.Market_Id, S.Product_Id

• The aggregation is over the entire time dimension and thus produces a two-dimensional view of the data

University of AlbertaDr. Jörg Sander, 2006 10CMPUT 391 – Database Management Systems

Aggregation over Time

• The output of the previous query

University of AlbertaDr. Jörg Sander, 2006 11CMPUT 391 – Database Management Systems

SUM(Sales_Amt)M1 M2 M3 M4

P1 3003 1503 …P2 6003 2402 …P3 4503 3 …P4 7503 7000 …P5 … … …

Market_Id

Prod

uct_

Id

Concept-Hierarchies

Many dimensions form an aggregation hierarchy(total or partial orders)

Examples:

Markets(Market_Id → City → Province → Country → Region)

weekTime(year → quarter day)

month

University of AlbertaDr. Jörg Sander, 2006 12CMPUT 391 – Database Management Systems

Drilling Down and Rolling Up• Executing a series of queries that moves down a

hierarchy (e.g., from aggregation over regions to that over provinces) is called drilling down– Requires the use of the fact table or information more

specific than the requested aggregation (e.g., cities)• Executing a series of queries that moves up the

hierarchy (e.g., from provinces to regions) is called rolling up– Note: In a rollup, coarser aggregations can be

computed using prior queries for finer aggregationsUniversity of AlbertaDr. Jörg Sander, 2006 13CMPUT 391 – Database Management Systems

Drilling DownDrilling down on market: from Region to Province

SalesSales (Market_Id, Product_Id, Time_Id, Sales_Amt)MarketMarket (Market_Id, City, Province, Region)

1. SELECT S.Product_Id, M.Region, SUM (S.Sales_Amt)FROM SalesSales S, MarketMarket MWHERE M.Market_Id = S.Market_IdGROUP BY S.Product_Id, M.Region

2. SELECT S.Product_Id, M.Province, SUM (S.Sales_Amt)FROM SalesSales S, MarketMarket MWHERE M.Market_Id = S.Market_IdGROUP BY S.Product_Id, M.Province,

University of AlbertaDr. Jörg Sander, 2006 14CMPUT 391 – Database Management Systems

Rolling UpRolling up on market, from Province to Region

If we have already created a table, Province_SalesProvince_Sales, using

1. SELECT S.Product_Id, M.Province, SUM (S.Sales_Amt)INTO Province_SalesProvince_SalesFROM Sales Sales S, MarketMarket MWHERE M.Market_Id = S.Market_IdGROUP BY S.Product_Id, M.Province

then we can roll up from there to:

22. SELECT T.Product_Id, M.Region, SUM (T.Sales_Amt)FROM Province_SalesProvince_Sales T, MarketMarket MWHERE M.Province = T.ProvinceGROUP BY T.Product_Id, M.Region

University of AlbertaDr. Jörg Sander, 2006 15CMPUT 391 – Database Management Systems

Pivoting• When we view the data as a multi-dimensional

cube and group on a subset of the axes, we are said to be performing a pivotpivot on those axes– Pivoting on dimensions D1,…,Dk in a data cube

D1,…,Dk,Dk+1,…,Dn means that we use GROUP BY A1,…,Ak and aggregate over Ak+1,…An, where Ai is an attribute of the dimension Di

– Example: Pivoting on ProductProduct and TimeTime corresponds to grouping on Product_id and Quarter and aggregating Sales_Amt over Market_id:

SELECT S.Product_Id, T.Quarter, SUM (S.Sales_Amt)FROM SalesSales S, TimeTime TWHERE T.Time_Id = S.Time_IdGROUP BY S.Product_Id, T.Quarter Pivot

University of AlbertaDr. Jörg Sander, 2006 16CMPUT 391 – Database Management Systems

Dicing• When we use GROUP BY to specify part of

a hierarchy, we are performing a range selection called a dice– Dicing Sales in the time dimension: total sales

for each product in each quarter.SELECT S.Product_Id, T.Quarter, SUM (Sales_Amt)FROM Sales S, Time TWHERE T.Time_Id = S.Time_IdGROUP BY T.Quarter, S.Product_Id

Dice

University of AlbertaDr. Jörg Sander, 2006 17CMPUT 391 – Database Management Systems

Slicing• When we use WHERE to specify a particular

value for an axis (or several axes), we are performing a slice– Slicing the data cube in the TimeTime dimension

(choosing sales only in week 12) then pivoting to Product_id (aggregating over Market_id)SELECT S.Product_Id, SUM (Sales_Amt)FROM SalesSales S, TimeTime TWHERE T.Time_Id = S.Time_Id AND T.Week = ‘Wk-12’GROUP BY S. Product_Id

Slice

University of AlbertaDr. Jörg Sander, 2006 18CMPUT 391 – Database Management Systems

Slicing-and-Dicing

• Typically slicing and dicing involves several queries to find the “right slice.”For instance, change the slice and the axes:

• Slicing on TimeTime and Market Market dimensions then pivoting to Product_idand Week (in the time dimension)

SELECT S.Product_Id, T.Week, SUM (Sales_Amt)FROM SalesSales S, TimeTime TWHERE T.Time_Id = S.Time_Id

AND T.Quarter = 4AND S.Market_id = ‘M1’

GROUP BY S.Product_Id, T.Week

Slice

Pivot

University of AlbertaDr. Jörg Sander, 2006 19CMPUT 391 – Database Management Systems

The extended Multi-dimensional Data Cube/Fact Table

Sum

1999 2000 2002 SumDrama

… ...

Sum

Comedy

Year

Category

2001EdmontonCalgary

Lethbridge

All YearsDrama, Edmonton

City

Contains all possible aggregates in addition to the facts in the fact table

University of AlbertaDr. Jörg Sander, 2006 20CMPUT 391 – Database Management Systems

The CUBE Operator• To construct the following table, would take 4

queries (next slide)

University of AlbertaDr. Jörg Sander, 2006 21CMPUT 391 – Database Management Systems

SUM(Sales_Amt)M1 M2 M3 Total

P1 3003 1503 … …P2 6003 2402 … …P3 4503 3 … …P4 7503 7000 … …

Total … … … …

Market_Id

Prod

uct_

Id

The Four Queries• For the table entries, without the totals (aggregation on time)

SELECT S.Market_Id, S.Product_Id, SUM (S.Sales_Amt)FROM SalesSales SGROUP BY S.Market_Id, S.Product_Id

• For the row totals (aggregation on time and supermarkets)SELECT S.Product_Id, SUM (S.Sales_Amt)FROM SalesSales SGROUP BY S.Product_Id

• For the column totals (aggregation on time and products)SELECT S.Market_Id, SUM (S.Sales) FROM SalesSales S GROUP BY S.Market_Id

• For global total: SELECT SUM (S.Sales) FROM SalesSales S

University of AlbertaDr. Jörg Sander, 2006 22CMPUT 391 – Database Management Systems

Definition of the CUBE Operator• Doing these four queries is wasteful

– The first does much of the work of the other three: if we could save that result and aggregate over Market_Id and Product_Id, we could compute the other queries more efficiently

• The CUBE clause is part of SQL:1999– GROUP BY CUBE (v1, v2, …, vn)– Equivalent to a collection of GROUP BYs, one for

each of the 2n subsets of v1, v2, …, vn

University of AlbertaDr. Jörg Sander, 2006 23CMPUT 391 – Database Management Systems

Example of CUBE Operator

• The following query returns all the information needed to obtain the previous products/markets table:

SELECT S.Market_Id, S.Product_Id, SUM (S.Sales_Amt)FROM SalesSales SGROUP BY CUBE (S.Market_Id, S.Product_Id)

University of AlbertaDr. Jörg Sander, 2006 24CMPUT 391 – Database Management Systems

ROLLUP• ROLLUP is similar to CUBE except that instead of

aggregating over all subsets of the arguments, it creates subsets moving from right to left

• GROUP BY ROLLUP (A1,A2,…,An) is a series of these aggregations:– GROUP BY A1 ,…, An-1 ,An– GROUP BY A1 ,…, An-1– … … …– GROUP BY A1, A2– GROUP BY A1– No GROUP BY

• ROLLUP is also in SQL:1999University of AlbertaDr. Jörg Sander, 2006 25CMPUT 391 – Database Management Systems

Example of ROLLUP OperatorSELECT S.Market_Id, S.Product_Id, SUM (S.Sales_Amt)FROM SalesSales SGROUP BY ROLLUP (S.Market_Id, S. Product_Id)– first aggregates with the finest granularity:

GROUP BY S.Market_Id, S.Product_Id– then with the next level of granularity:

GROUP BY S.Market_Id– then the grand total is computed with no GROUP

BY clause

University of AlbertaDr. Jörg Sander, 2006 26CMPUT 391 – Database Management Systems

ROLLUP vs. CUBE

• The same query with CUBE:- first aggregates with the finest granularity:

GROUP BY S.Market_Id, S.Product_Id

- then with the next level of granularity (both subsets):

GROUP BY S.Market_IdGROUP BY S.Product_Id

- then the grand total with no GROUP BY

University of AlbertaDr. Jörg Sander, 2006 27CMPUT 391 – Database Management Systems

Materialized Views

The CUBE operator is often used to pre-compute aggregations on all dimensions of a fact table and then save them as a materialized views to speed up future queries

University of AlbertaDr. Jörg Sander, 2006 28CMPUT 391 – Database Management Systems

ROLAP and MOLAP• Relational OLAP: ROLAP

– OLAP data is stored in a relational database as previously described. Data cube is a way to think abouta fact table.

• Multidimensional OLAP: MOLAP– Vendor provides an OLAP server that implements a fact

table as a data cube using some multi-dimensional (non-relational) implementation.

– provide proprietary, perhaps visual, languages that allow unsophisticated users to make queries that involve pivots, drilling down, or rolling up

University of AlbertaDr. Jörg Sander, 2006 29CMPUT 391 – Database Management Systems

Implementation Issues

• OLAP applications are characterized by a very large amount of data that is relatively static, with infrequent updates– Thus, various aggregations can be precomputed

and stored in the database– Star joins, join indices, and bitmap indices can

be used to improve efficiency– Since updates are infrequent, the inefficiencies

associated with updates are minimized

University of AlbertaDr. Jörg Sander, 2006 30CMPUT 391 – Database Management Systems