lecture 10: data warehouses

04/22/23 1

Lecture 10: Data Warehouses

Introduction Operational vs. Warehouse

Multidimensional Data Examples MOLAP vs ROLAP Dimensional Hierarchies OLAP Queries Demos Comparison with SQL

Queries CUBE Operator Multidimensional Design Star/Snowflake Schemas

Online Aggregation Implementation Issues

Bitmap Index Constructing a Data

Warehouse Views

Materialized View Example Materialized View is an

Index Issues in Materialized

Views Maintaining Materialized

Views

04/22/23 2

Introduction In the late 80s and early 90s, companies

began to use their DBMSs for complex, interactive, exploratory analysis of historical data.

This was called Decision Support, and On-Line Analytic Processing (OLAP).

DS slowed down the operation of the company, called On-Line Transaction Processing (OLTP).

This led to the creation of Data Warehouses, separate from operational Databases.

04/22/23 3

Operational vs Data Warehouse Requirements

OLTP / Operational / Production

DSS / Warehouse / DataMart

Operate the business / Clerks Diagnose the business / Managers

Short queries, small amts of data

opposite

Queries change data oppositeCustomer inquiry, Order Entry, etc.

OLAP, Statistics, Visualization, Data Mining, etc.

Legacy Applications, Heterogeneous databases

Opposite

Often Distributed Often Centralized (Warehouse)Current data Current and Historical data

04/22/23 4

Operational vs Data Warehouse Requirements, ctdOperational Warehouse

General E-R Diagrams Multidimensional data model common

Locks necessary No Locks necessary

Crash recovery required Crash recovery optional

Smaller volume of data Huge volume of data

Need indexes designed to access small amounts of data

Need indexes designed to access large volumes of data

04/22/23 5

Data Warehousing Integrated data spanning

long time periods, often augmented with summary information.

Several terabytes to petabytes common.

Interactive response times expected for complex queries; ad-hoc updates uncommon.

Operational Data

EXTRACTTRANSFORM LOAD REFRESH

DATAWAREHOUSE Metadata

Repository

SUPPORTS

OLAPDATAMINING

04/22/23 6

Multidimensional Data In order to support OLAP, warehouse data is

often structured multidimensionally, as measures and dimensions.

Measure: Numeric attribute, e.g. sales amount Dimension: attribute categorizing the

measure, e.g. product, store, date of sale. The fact table is a foreign key for each

dimension, plus an attribute for each measure.

There will also be a dimension table for each dimension.

On the next page, the fact tables are red, the dimension tables are green.

04/22/23 7

Examples of MultiDimensional Data

Purchase(ProductID, StoreID, DateID, Amt) Product(ID, SKU, size, brand) Store(ID, Address, Sales District, Region, Manager) Date (ID, Week, Month, Holiday, Promotion)

Claims(ProvID, MembID, Procedure, DateID, Cost) Providers(ID, Practice, Address, ZIP, City, State) Members(ID, Contract, Name, Address) Procedure (ID, Name, Type)

Telecomm (CustID, SalesRepID, ServiceID, DateID) SalesRep(ID, Address, Sales District, Region, Manager) Service(ID, Name, Category)

04/22/23 8

MOLAP vs ROLAP Multidimensional data can be stored physically

in a (disk-resident, persistent) array; called MOLAP systems. Alternatively, can store as a relation; called ROLAP systems.

The main relation, which relates dimensions to a measure, is called the fact table. Each dimension can have additional attributes and an associated dimension table. E.g., Products(pid, locid, timeid, amt) Fact tables are much larger than dimensional tables.

04/22/23 9

25.2 Multidimensional Data Model

Collection of numeric measures, which depend on a set of dimensions. E.g., measure Amt, dimensions Product

(key: pid), Location (locid), and Time (timeid).

8 10 1030 20 50

25 8 15 1 2 3 timeid

pid

11

12

13

11 1 1 25 11 2 1 8 11 3 1 15 12 1 1 30 12 2 1 20 12 3 1 50 13 1 1 8 13 2 1 10 13 3 1 10 11 1 2 35

pid

tim

eid

loci

dam

t

locid

Slice locid=1is shown:

04/22/23 10

Dimension Hierarchies For each dimension, some of the attributes

may be organized in a hierarchy:PRODUCT TIME LOCATION

pname week city

PID date ZIP

year

category quarter state

04/22/23 11

25.3 OLAP Queries Influenced by SQL and by spreadsheets. A common operation is to aggregate a

measure over one or more dimensions. Find total sales. Find total sales for each city, or for each state. Find top five products ranked by total sales.

Roll-up: Aggregating at different levels of a dimension hierarchy. E.g., Given total sales by city, we can roll-up to get

sales by state.

04/22/23 12

OLAP Queries Drill-down: The inverse of roll-up.

E.g., Given total sales by state, can drill-down to get total sales by city.

E.g., Can also drill-down on different dimension to get total sales by product for each state.

Pivoting: Aggregation on selected dimensions. E.g., Pivoting on State and Year

yields this cross-tabulation: 63 81 14438 107 145

75 35 110

OR CA Total

20072008

2009

176 223 339Total

Slicing and Dicing: Equality and range selections on one or more dimensions.

04/22/23 13

Cognos Demo

Now we watch a demo of Cognos (bought by IBM) Dimensions: Products…Margin ranges Measure: Order value (sales)

First pivot from Product dimension to Margin Range Notice how quickly the cube changes

Slice to Low Margin, pivot to Product and Company Region

Drill Down to High Tech, IDES AG Now the guilty product is clear.

04/22/23 14

Tableau Demo http://www.tableausoftware.com/products/tour2 Note the many measures. Pivot on sales, date (drill down to

month), region as color. Clear date, pivot on product and drill

down on subcategory. Change region from color to rows Move profit into color Change bars to circles Pivot on dates (columns)

http://www.tableausoftware.com/products/tour2

04/22/23 15

Comparison with SQL Queries The cross-tabulation obtained by pivoting can also be

computed using a collection of SQLqueries:

SELECT T.year, L.state, SUM(S.amt)FROM Sales S, Times T, Locations LWHERE S.timeid=T.timeid AND S.locid=L.locidGROUP BY T.year, L.state

SELECT T.year, SUM(S.amt)FROM Sales S, Times TWHERE S.timeid=T.timeidGROUP BY T.year

SELECT L.state,SUM(S.amt)FROM Sales S, Location LWHERE S.locid=L.locidGROUP BY L.state

04/22/23 16

The CUBE Operator Generalizing the previous example, if there

are k dimensions, we have 2^k possible SQL GROUP BY queries that can be generated through pivoting on a subset of dimensions.

CUBE pid, locid, timeid BY SUM Sales Equivalent to rolling up Sales on all eight subsets

of the set {pid, locid, timeid}; each roll-up corresponds to an SQL query of the form:

SELECT SUM(S.amt)FROM Sales SGROUP BY grouping-list

Lots of work on optimizing the CUBE operator!

04/22/23 17

Example Multidimensional Design

This kind of schema is very common in OLAP applications

It is called a star schema What is “wrong” with it?

price

category

pname

pid country

statecitylocid

amtlocidtimeid

pid

holiday_flag

weekdate

timeid month

quarter

year

(Fact table)SALES

TIMES

PRODUCTS LOCATIONS

04/22/23 18

Star/Snowflake Schemas

Why normalize? Space Redundancy, anomalies

Why unnormalize? Performance

Which is more important in D. Warehouses?

If normalized, it is a snowflake schema

04/22/23 19

Online Aggregation Consider an aggregate query, e.g., finding the

average sales by state. Can we provide the user with some information before the exact average is computed for all states? Can show the current “running average” for each

state as the computation proceeds. Even better, if we use statistical techniques and

sample tuples to aggregate instead of simply scanning the aggregated table, we can provide bounds such as “the average for Oregon is 2000102 with 95% probability”.• Should also use nonblocking algorithms!

04/22/23 20

25.6 Implementation Issues New indexing techniques: Bitmap indexes,

Join indexes, array representations, compression, precomputation of aggregations, etc.

E.g., Bitmap index:10100110

112 Joe M 3115 Ram M 5119 Sue F 5112 Woo M 4

00100000010000100010

sex custid name sex rating ratingBit-vector:1 bit for eachpossible value.

MF

04/22/23 21

Bitmap Indexes

Work when an attribute has few values, e.g. gender or rating

Advantage: Small enough to fit in memory

Many queries can be answered by bit-vector ops, e.g. females with rating = 3.

04/22/23 22

25.7 Constructing a D. Warehouse Extract

Is the data in native format? Clean

How many ways can you spell Mr.? Errors, missing information

Transform Fix semantic mismatches.

• E.g. Last+first vs. Name Load

Do it in parallel or else…. Refresh

Both data and indexes

04/22/23 23

25.8,9 Views and Decision Support In large databases, precomputation is

necessary for decent response times Examples: brain, google

Example: Precompute daily sums for the cube. What can be derived from those

precomputations? These precomputed queries are called

Materialized Views (SQL Server: Indexed views).

04/22/23 24

Materialized View ExampleCREATE VIEW DailySum(date, sumamt)AS SELECT date, SUM(amt)FROM Times Join Sales USING(timeid) GROUP BY dateSELECT week, SUM(amt)FROM Times Join Sales USING(timeid)Group By week

SELECT week, SUM(sumamt)FROM Times Join DailySum USING (week)GROUP BY week

Mat.View

Query

ModifiedQuery

04/22/23 25

Pros and Cons of Materialized Views Pro: Modified query is a join of two small

tables; original query is a join with one huge table.

Con: Materialized views take up space, need to be updated.

04/22/23 26

A Materialized View is an Index

Recall the definition of an index Data structure that provides fast access to data

Table indexes were of the form {(value, pointer)}, perhaps at leaf level of a search structure. This is different.

Needs to be maintained as underlying tables change.

Ideally, we want incremental view maintenance algorithms.

04/22/23 27

What views should we materialize?

Remember the software that automatically chooses optimal index configurations?

The same software will choose optimal materialized views, given a workload and available space.

04/22/23 28

What about the optimizer?

Given a query and a set of materialized views, can we use the materialized views to answer the query?

This is tricky. Best reference is [348]

04/22/23 29

Refreshing Materialized Views How often should we refresh the

materialized view? Many enterprises refresh warehouse data

only weekly/nightly, so can afford to completely rebuild their materialized views.

Others want their warehouses to be current, so materialized views must be updated incrementally if possible.

Let's look at some simple examples.

04/22/23 30

25.10 Maintaining Materialized Views*

Incremental view maintenance Defn: make changes in view that

correspond to changes in the base tables Example: V = SELECT a FROM R

How is V modified if r is inserted to R?

How is V modified if r is deleted from R?

04/22/23 31

Maintaining Materialized Views* Consider V = R ⋈ S

How is V modified if r is inserted to R? How is V modified if r is deleted from R?

Consider V = SELECT g,COUNT(*) FROM R GROUP BY g

How is V modified if r is inserted to R?

How is V modified if r is deleted from R

For more general cases, see [348]

lecture 10: data warehouses

Documents