decision support, data warehousing, and olap anindya datta director, ixl center for e-commerce...

47
Decision Support, Data Warehousing, and OLAP Anindya Datta Director, iXL Center for E- Commerce Georgia Institute of Technology [email protected]

Upload: magnus-collins

Post on 02-Jan-2016

218 views

Category:

Documents


1 download

TRANSCRIPT

Decision Support, Data Warehousing, and OLAP

Anindya Datta

Director, iXL Center for E-Commerce

Georgia Institute of Technology

[email protected]

Outline

Data Warehousing Architecture Terminology: OLTP vs. OLAP Technologies Products Research Issues References

Data Warehouse

A decision support database that is maintained separately from the organization’s operational databases.

A data warehouse is a subject-oriented integrated time-varying non-volatile

Data Warehouse

collection of data that is used primarily in organizational decision making

Decision support system

OLTP

On-Line Transaction Processing OLTPWhat we have been talking about in classFor day-to-day operations Involves operational databaseQueries to retrieve information from specific

tuples

OLAP and Decision Support

On-Line Analytical Processing OLAP is an element of decision support systems (DSS).

Information technology to help the knowledge worker (executive, manager, analyst) make faster and better decisions. What were the sales volumes by region and

product category for the last year? Which orders should we fill to maximize

revenues? Will a 10% discount increase sales volume

sufficiently?

Three-Tier Architecture

Warehouse database server Almost always a relational DBMS

OLAP servers Relational OLAP (ROLAP): extended relational DBMS that maps

operations on multidimensional data to standard relational operations. Multidimensional OLAP (MOLAP): special purpose server that directly

implements multidimensional data and operations.

Clients Query and reporting tools. Analysis tools Data mining tools (e.g., trend analysis, prediction)

Data Warehouse vs. Data Marts

Enterprise warehouse: collects all information about subjects for entire organization. Requires extensive business modeling May take years to design and build

Data Marts: Departmental subsets that focus on selected subjects: Marketing data mart Faster roll out, but complex integration in the long run.

Virtual warehouse: views over operational dbs Materialize some summary views for efficient query

processing Easier to build Requisite excess capacity on operational db servers

Data Warehousing Architecture

Monitoring & Monitoring & AdministrationAdministration

Metadata Metadata RepositoryRepository

ExtractExtractTransformTransformLoadLoadRefreshRefresh

Data Data MartsMarts

External External SourcesSources

OperationOperational dbsal dbs

ServServee

OLAP OLAP serversservers

AnalysisAnalysis

Query/ Query/ ReportinReportingg

Data Data MiningMining

OLTP vs. OLAP

Clerk, IT Professional Day to day operations

Application-oriented (E-R based)

Current, Isolated Detailed, Flat relational Structured, Repetitive Short, Simple

transaction Read/write Index/hash on prim. Key Tens Thousands 100 MB-GB Trans. throughput

Knowledge worker Decision support

Subject-oriented (Star, snowflake)

Historical, Consolidated Summarized,

Multidimensional Ad hoc Complex query Read Mostly Lots of Scans Millions Hundreds 100GB-TB Query throughput, response

User

Function

DB Design

Data

View

Usage

Unit of work

Access

Operations

# Records accessed

#Users

Db size

Metric

OLTPOLTP OLAPOLAP

OLAP for Decision Support

Goal of OLAP is to support ad-hoc querying for the business analyst

Business analysts are familiar with spreadsheets Extend spreadsheet analysis model to work with

warehouse data Multidimensional view of data is the foundation

of OLAP

Relational DBMS as Warehouse Server - ROLAP Schema design Specialized scan, indexing and join techniques Handling of aggregate views (querying and

materialization) Supporting query language extensions beyond

SQL Complex query processing and optimization Data partitioning and parallelism

Warehouse Database Schema

ER design techniques not appropriate Design should reflect multidimensional

viewStar SchemaSnowflake SchemaFact Constellation Schema

Star Schema

A single fact (subject) table and a single table for each dimension

Every fact points to one tuple in each of the dimensions and has additional attributes

Does not capture hierarchies directly Generated keys are used for performance and

maintenance reasons

Example of a Star Schema

Order NoOrder No

Order DateOrder Date

Customer NoCustomer No

Customer Customer NameName

Customer Customer AddressAddress

CityCity

SalespersonIDSalespersonID

SalespersonNaSalespersonNameme

CityCity

QuotaQuota

OrderNOOrderNO

SalespersonIDSalespersonID

CustomerNOCustomerNO

ProdNoProdNo

DateKeyDateKey

CityNameCityName

QuantityQuantity

Total Price

ProductNOProductNO

ProdNameProdName

ProdDescrProdDescr

CategoryCategory

CategoryDescriptionCategoryDescription

UnitPriceUnitPrice

DateKeyDateKey

DateDate

CityNameCityName

StateState

CountryCountry

OrderOrder

CustomerCustomer

SalespersoSalespersonn

CityCity

DateDate

ProductProduct

Fact TableFact Table

Types of tables Fact table

(OrderNO, SalespersonID, CustomerNo, ProdNo, Total Price)

Dimension tablesOrder (Order No, Order Date)Date (DateKey, Date)

Time

CustomerName

NoOf Items

SoldVia

OccuredAt

IsPartOf

BelongsTo

SuppliedBy

BuysVia

Date

Tid

SALETRANSCTCustomerZip

CustomerID

CUSTOMER

RegionName RegiontID

REGION

StoreZip StoreID

STORE

Price

VendorName

VendorID

CategoryName

CategoryID

VENDOR

CATEGORY

ProductNameProductID

PRODUCT

Category Table Product Table Vendor Table

CategoryID CategoryName ProductID Name Price VendorID CategoryID VendorID VendorName

CP Camping 1X1 Zzz Bag $100 PG CP PG Pacif ica Gear

FW Footw ear 2X2 Easy Boot $70 MK FW MK Mountain King

3X3 Cosy Sock $15 MK FW

Region Table Store Table Customer Table

RegionID RegionName StoreID StoreZip RegionID CustomerID CustomerName CustomerZip

C Chicagoland S1 60600 C 1-2-333 Tina 60137

T Tristate S2 60605 C 2-3-444 Tony 60611

S3 35400 T 4-5-666 Pam 35401

Sales Transaction Table Sold Within Table

Tid CustomerID StoreID Date Time ProductID Tid NoOfItems

T111 1-2-333 S1 1/1/2011 8:23:59 AM 1X1 T111 1

T222 2-3-444 S2 1/1/2011 8:24:30 AM 2X2 T222 1

T333 1-2-333 S3 1/2/2011 8:15:08 AM 3X3 T333 5

T444 4-5-666 S4 1/2/2011 8:20:33 AM 1X1 T333 1

2X2 T444 1

3X3 T444 2

Calendar Dimension Store DimensionCalendarKey FullDate

Day of Week

Day of Month Month Qtr Year

Store Key StoreID StoreZip RegionID

1 1/1/2011 Saturday 1 January Q1 2011 1 S1 60600 Chicagoland2 1/2/2011 Sunday 2 January Q1 2011 2 S2 60605 Chicagoland

3 S3 35400 Tristate

Product Dimension Customer DimensionProduct Key ProductID Name Price VendorID CategoryID

CustomerKey

CustomerID

CustomerName

CustomerZip

1 1X1 Zzz Bag $100 Pacif ica Gear Camping 1 1-2-333 Tina 60137

2 2X2 Easy Boot $70 Mountain King Footw ear 2 2-3-444 Tony 60611

3 3X3 Cosy Sock $15 Mountain King Footw ear 3 4-5-666 Pam 35401

CalendarKey

Store Key

CustomerKey

Product Key

Dollars Sold

Units Sold

1 1 1 1 1 1

1 2 2 2 1 1

2 1 3 3 5 5

2 1 3 1 1 1

2 3 1 2 1 1

2 3 3 3 2 2

SALES Fact table

Dimensional SQL version:SELECT SUM(SA.NoOfItems), P.PCategory, V.Vendor, C.Qtr, GroupBy(C.Qtr)FROM Calendar C, Store S, Product P, Sales SAWHERE C.CalendarKey = SA.CalendarKey AND S.StoreKey = SA. StoreKey AND P.ProductKey = SA. ProductKey AND P.Category = ‘Pacifica’ and S.RegionName = ‘Midwest’ AND C.DayofWeek = ‘Saturday’ AND C.Year = 2010HAVING C.Qtr in (1,2)

Non-dimensional SQL version:SELECT SUM(SW.NoOfItems), C.CategoryName, V.VendorName, EXTRACTQUARTER(ST.Date), GroupBy(EXTRACTQUARTER(ST.Tdate))FROM Region R, Store S, SalesTransaction ST, SoldWithin SW, Product P, Vendor V, Category CWHERE R.RegionID = S. RegionID AND S.StoreID = ST.StoreID AND ST.Tid = SW.Tid AND SW.ProductID = P.ProductID AND P. VendorID = V. VendorID AND P.CateoryID = C.CategoryID AND P.Category = ‘Pacifica’ and R.RegionName = ‘Midwest’ AND EXTRACTWEEKDAY(St.Date) = ‘Saturday’, AND EXTRACTYEAR(ST.Date) = 2010HAVING QUARTER(ST.Date) in (1,2)

Snowflake Schema

Represent dimensional hierarchy directly by normalizing the dimension tables

Easy to maintain Saves storage, but is alleged that it

reduces effectiveness of browsing (Kimball)

Example of a Snowflake Schema

Order NoOrder No

Order DateOrder Date

Customer NoCustomer No

Customer Customer NameName

Customer Customer AddressAddress

CityCity

SalespersonIDSalespersonID

SalespersonNaSalespersonNameme

CityCity

QuotaQuota

OrderNOOrderNO

SalespersonIDSalespersonID

CustomerNOCustomerNO

ProdNoProdNo

DateKeyDateKey

CityNameCityName

QuantityQuantity

Total Price

ProductNOProductNO

ProdNameProdName

ProdDescrProdDescr

CategoryCategory

CategoryCategory

UnitPriceUnitPrice

DateKeyDateKey

DateDate

MonthMonth

CityNameCityName

StateState

CountryCountry

OrderOrder

CustomerCustomer

SalespersoSalespersonn

CityCity

DateDate

ProductProduct

Fact TableFact Table

CategoryNaCategoryNameme

CategoryDeCategoryDescrscr

MontMonthh

YearYear YearYear

StateNameStateName

CountryCountry

CategoryCategory

StateState

MonthMonthYearYear

Multidimensional Data Model Database is a set of facts (points) in a

multidimensional space A fact has a measure dimension

quantity that is analyzed, e.g., sale, budget A set of dimensions on which data is analyzed

e.g. , store, product, date associated with a sale amount Each dimension has a set of attributes

e.g., owner city and county of store Attributes of a dimension may be related by partial

order Hierarchy: e.g., county > city > street Lattice: e.g., year>month> date, year>week>date

Multidimensional Data Model

1010

4747

3030

1212

JuiceJuice

ColaCola

Milk Milk

CreaCreamm

NYCNYCLALA

SFSF

Sales Sales Volume Volume as a as a functiofunction of n of time, time, city city and and productproduct3/1 3/2 3/3 3/1 3/2 3/3

3/4 3/4

DateDate

22 18 30

Operations in Multidimensional Data Model – Roll-up

Aggregation (roll-up) summarization over aggregate hierarchy – higher levels of

summarization: e.g., total sales by city and rollup to total sales by state or

take daily sales totals and rollup to monthly sales totals Subtracting group by columns

101000

474755313100121211

JuiceJuice

ColaCola

Milk Milk

CreaCreamm

NYNY

ALAL

CACA

3/1 3/2 3/3 3/1 3/2 3/3 3/43/4

DateDate

Roll-UP

Drill Down

Navigation to detailed data (drill-down) Used to adjust the level of detail – getting more detail

– finer data e.g., total sales by region and drill down to total sales

by city or top 3% of cities Adding group by columns

Drill Down

55

2020

2020

77

JuiceJuice

ColaCola

Milk Milk

CreaCreamm

NYCNYCLALA

SFSF

3/1 am 3/1 pm 3/2 am 3/1 am 3/1 pm 3/2 am 3/2 pm3/2 pm

DateDate

55

2727

1010

55

Operations in Multidimensional Data Model Selection (slice) defines a subcube

e.g., sales where city = SF and Product = Cream

Visualization Operations (e.g., Pivot) Rotating summary tables (next page)

Selection - Slice

12 22 18 30Cream

SF

3/1 3/2 3/3 3/4

A Visual Operation: Pivot (Rotate)

1010

4747

3030

1212

JuiceJuice

ColaCola

Milk Milk

CreaCreamm

NYCNYCLALA

SFSF

3/1 3/2 3/3 3/1 3/2 3/3 3/43/4

DateDate

Month

Month

Reg

ion

Reg

ion

ProductProduct

Indexing Techniques

Exploiting indexes to reduce scanning of data is of crucial importance

Bitmap Indexes Join Indexes

Bit Map Index

Cust Region RatingC1 N HC2 S MC3 W LC4 W HC5 S LC6 W LC7 N H

Base Base TableTable

Row ID N S E W1 1 0 0 02 0 1 0 03 0 0 0 14 0 0 0 15 0 1 0 06 0 0 0 17 1 0 0 0

H M L1 0 00 1 00 0 11 0 00 0 10 0 11 0 0

Rating IndexRating Index

Region Region IndexIndex

Customers whereCustomers where Region = Region = WW

Rating = Rating = LLAndAnd

BitMap Indexes

Comparison, join and aggregation operations are reduced to bit arithmetic with dramatic improvement in processing time

Significant reduction in space and I/O (30:1) Adapted for higher cardinality domains as well. Compression (e.g., run-length encoding) exploit

Issues in Handling of Aggregate Views (summary info) Important component for ROLAP Servers Logic for Aggregation Navigation

make optimum use of materialized aggregates to answer a query

Choice of aggregate views to materialize

SQL Extensions for Front End Tools Extended Family of Aggregate functions

rank (top 10) and N-Tile (“top 30%” of all products) Median, mode…..

Reporting Features running total, cumulative totals

Results of multiple group by: total sales by month and total sales by product

Approaches to OLAP Servers

Relational OLAP (ROLAP)Relational and Specialized Relational

DBMS to store and manage warehouse data

Example: Oracle Warehouse Builder, MetaCube (Informix) Multidimensional OLAP

Array-based storage structures Direct access to array data structures Example: Essbase (Arbor then Hyperion then

Oracle)

Calendar Dimension Store DimensionCalendarKey FullDate

Day of Week

Day of Month Month Qtr Year

Store Key StoreID StoreZip RegionID

1 1/1/2011 Saturday 1 January Q1 2011 1 S1 60600 Chicagoland2 1/2/2011 Sunday 2 January Q1 2011 2 S2 60605 Chicagoland

3 S3 35400 Tristate

Product Dimension Customer DimensionProduct Key ProductID Name Price VendorID CategoryID

CustomerKey

CustomerID

CustomerName

CustomerZip

1 1X1 Zzz Bag $100 Pacif ica Gear Camping 1 1-2-333 Tina 60137

2 2X2 Easy Boot $70 Mountain King Footw ear 2 2-3-444 Tony 60611

3 3X3 Cosy Sock $15 Mountain King Footw ear 3 4-5-666 Pam 35401

CalendarKey

Store Key

CustomerKey

Product Key

Dollars Sold

Units Sold

1 1 1 1 1 1

1 2 2 2 1 1

2 1 3 3 5 5

2 1 3 1 1 1

2 3 1 2 1 1

2 3 3 3 2 2

SALES Fact table

ROLAPData stored in relational tables

MOLAP – data stored in multidimensional array The storage model is an n-dimensional

array – Cube(1,1,1) subject value stored for juice, 3/1, SF1,1,1= 10 1,2,1=37 1,3,1=40 1,4,1=12

Array representation has good indexing properties but very poor storage utilization when data is sparse

ROLAP vs MOLAP

MOLAP – fast response times – data ready to display, cheaper if must start from scratch

ROLAP – increased flexibility in choice of queries, cheaper if already have relational DB

ETL

Data Extraction Data cleaning

Data may have errors from multiple sources

Data Transformation Convert from legacy/host format to warehouse format

Data Load Sort, summarize, consolidate, compute views, check

integrity, build indexes, partition

Data Cleaning/Transformation Techniques Extraction – decide which data Uses domain-specific knowledge to do scrubbing Discover facts that flag unusual patterns (auditing)

Some dealer has never received a single complaint Products: QDB, SBStar, WizRule

Transformation Rules Example: translate “gender” to “sex”

Products: Warehouse Manger (Prism), Extract (ETI)

Data Load

Issues: huge volumes of data to be loaded small time window (usually at night) when the warehouse can be taken

off-line When to build indexes and summary tables allow system administrator to monitor status, cancel suspend, resume

load, or change load rate restart after failure with no loss of data integrity

Techniques: sort input records on clustering key and use sequential I/O; build

indexes and derived tables sequential loads still too long use parallelism and incremental techniques

Data Refresh

Issues: when to refresh

on every update: too expensive, only necessary if OLAP queries need current data (e.g., up-the-minute stock quotes)

periodically (e.g., every 24 hours, every week) or after “significant” events

refresh policy set by administrator based on user needs and traffic

Techniques: Full extract from base tables

read entire source table or database: expensive. Incremental techniques (related to work on active DBS)

Issues to Consider

Physical Design design of summary tables, partitions, indexes

Query processing selecting appropriate summary tables approximate answers Runaway queries

Warehouse Management detecting runaway queries resource management incremental refresh techniques computing summary tables during load

Research Issues Data warehouses in the cloud XML and data warehouses