decision support, data warehousing, and olap anindya datta director, ixl center for e-commerce...
TRANSCRIPT
Decision Support, Data Warehousing, and OLAP
Anindya Datta
Director, iXL Center for E-Commerce
Georgia Institute of Technology
Outline
Data Warehousing Architecture Terminology: OLTP vs. OLAP Technologies Products Research Issues References
Data Warehouse
A decision support database that is maintained separately from the organization’s operational databases.
A data warehouse is a subject-oriented integrated time-varying non-volatile
Data Warehouse
collection of data that is used primarily in organizational decision making
Decision support system
OLTP
On-Line Transaction Processing OLTPWhat we have been talking about in classFor day-to-day operations Involves operational databaseQueries to retrieve information from specific
tuples
OLAP and Decision Support
On-Line Analytical Processing OLAP is an element of decision support systems (DSS).
Information technology to help the knowledge worker (executive, manager, analyst) make faster and better decisions. What were the sales volumes by region and
product category for the last year? Which orders should we fill to maximize
revenues? Will a 10% discount increase sales volume
sufficiently?
Three-Tier Architecture
Warehouse database server Almost always a relational DBMS
OLAP servers Relational OLAP (ROLAP): extended relational DBMS that maps
operations on multidimensional data to standard relational operations. Multidimensional OLAP (MOLAP): special purpose server that directly
implements multidimensional data and operations.
Clients Query and reporting tools. Analysis tools Data mining tools (e.g., trend analysis, prediction)
Data Warehouse vs. Data Marts
Enterprise warehouse: collects all information about subjects for entire organization. Requires extensive business modeling May take years to design and build
Data Marts: Departmental subsets that focus on selected subjects: Marketing data mart Faster roll out, but complex integration in the long run.
Virtual warehouse: views over operational dbs Materialize some summary views for efficient query
processing Easier to build Requisite excess capacity on operational db servers
Data Warehousing Architecture
Monitoring & Monitoring & AdministrationAdministration
Metadata Metadata RepositoryRepository
ExtractExtractTransformTransformLoadLoadRefreshRefresh
Data Data MartsMarts
External External SourcesSources
OperationOperational dbsal dbs
ServServee
OLAP OLAP serversservers
AnalysisAnalysis
Query/ Query/ ReportinReportingg
Data Data MiningMining
OLTP vs. OLAP
Clerk, IT Professional Day to day operations
Application-oriented (E-R based)
Current, Isolated Detailed, Flat relational Structured, Repetitive Short, Simple
transaction Read/write Index/hash on prim. Key Tens Thousands 100 MB-GB Trans. throughput
Knowledge worker Decision support
Subject-oriented (Star, snowflake)
Historical, Consolidated Summarized,
Multidimensional Ad hoc Complex query Read Mostly Lots of Scans Millions Hundreds 100GB-TB Query throughput, response
User
Function
DB Design
Data
View
Usage
Unit of work
Access
Operations
# Records accessed
#Users
Db size
Metric
OLTPOLTP OLAPOLAP
OLAP for Decision Support
Goal of OLAP is to support ad-hoc querying for the business analyst
Business analysts are familiar with spreadsheets Extend spreadsheet analysis model to work with
warehouse data Multidimensional view of data is the foundation
of OLAP
Relational DBMS as Warehouse Server - ROLAP Schema design Specialized scan, indexing and join techniques Handling of aggregate views (querying and
materialization) Supporting query language extensions beyond
SQL Complex query processing and optimization Data partitioning and parallelism
Warehouse Database Schema
ER design techniques not appropriate Design should reflect multidimensional
viewStar SchemaSnowflake SchemaFact Constellation Schema
Star Schema
A single fact (subject) table and a single table for each dimension
Every fact points to one tuple in each of the dimensions and has additional attributes
Does not capture hierarchies directly Generated keys are used for performance and
maintenance reasons
Example of a Star Schema
Order NoOrder No
Order DateOrder Date
Customer NoCustomer No
Customer Customer NameName
Customer Customer AddressAddress
CityCity
SalespersonIDSalespersonID
SalespersonNaSalespersonNameme
CityCity
QuotaQuota
OrderNOOrderNO
SalespersonIDSalespersonID
CustomerNOCustomerNO
ProdNoProdNo
DateKeyDateKey
CityNameCityName
QuantityQuantity
Total Price
ProductNOProductNO
ProdNameProdName
ProdDescrProdDescr
CategoryCategory
CategoryDescriptionCategoryDescription
UnitPriceUnitPrice
DateKeyDateKey
DateDate
CityNameCityName
StateState
CountryCountry
OrderOrder
CustomerCustomer
SalespersoSalespersonn
CityCity
DateDate
ProductProduct
Fact TableFact Table
Types of tables Fact table
(OrderNO, SalespersonID, CustomerNo, ProdNo, Total Price)
Dimension tablesOrder (Order No, Order Date)Date (DateKey, Date)
Time
CustomerName
NoOf Items
SoldVia
OccuredAt
IsPartOf
BelongsTo
SuppliedBy
BuysVia
Date
Tid
SALETRANSCTCustomerZip
CustomerID
CUSTOMER
RegionName RegiontID
REGION
StoreZip StoreID
STORE
Price
VendorName
VendorID
CategoryName
CategoryID
VENDOR
CATEGORY
ProductNameProductID
PRODUCT
Category Table Product Table Vendor Table
CategoryID CategoryName ProductID Name Price VendorID CategoryID VendorID VendorName
CP Camping 1X1 Zzz Bag $100 PG CP PG Pacif ica Gear
FW Footw ear 2X2 Easy Boot $70 MK FW MK Mountain King
3X3 Cosy Sock $15 MK FW
Region Table Store Table Customer Table
RegionID RegionName StoreID StoreZip RegionID CustomerID CustomerName CustomerZip
C Chicagoland S1 60600 C 1-2-333 Tina 60137
T Tristate S2 60605 C 2-3-444 Tony 60611
S3 35400 T 4-5-666 Pam 35401
Sales Transaction Table Sold Within Table
Tid CustomerID StoreID Date Time ProductID Tid NoOfItems
T111 1-2-333 S1 1/1/2011 8:23:59 AM 1X1 T111 1
T222 2-3-444 S2 1/1/2011 8:24:30 AM 2X2 T222 1
T333 1-2-333 S3 1/2/2011 8:15:08 AM 3X3 T333 5
T444 4-5-666 S4 1/2/2011 8:20:33 AM 1X1 T333 1
2X2 T444 1
3X3 T444 2
Calendar Dimension Store DimensionCalendarKey FullDate
Day of Week
Day of Month Month Qtr Year
Store Key StoreID StoreZip RegionID
1 1/1/2011 Saturday 1 January Q1 2011 1 S1 60600 Chicagoland2 1/2/2011 Sunday 2 January Q1 2011 2 S2 60605 Chicagoland
3 S3 35400 Tristate
Product Dimension Customer DimensionProduct Key ProductID Name Price VendorID CategoryID
CustomerKey
CustomerID
CustomerName
CustomerZip
1 1X1 Zzz Bag $100 Pacif ica Gear Camping 1 1-2-333 Tina 60137
2 2X2 Easy Boot $70 Mountain King Footw ear 2 2-3-444 Tony 60611
3 3X3 Cosy Sock $15 Mountain King Footw ear 3 4-5-666 Pam 35401
CalendarKey
Store Key
CustomerKey
Product Key
Dollars Sold
Units Sold
1 1 1 1 1 1
1 2 2 2 1 1
2 1 3 3 5 5
2 1 3 1 1 1
2 3 1 2 1 1
2 3 3 3 2 2
SALES Fact table
Dimensional SQL version:SELECT SUM(SA.NoOfItems), P.PCategory, V.Vendor, C.Qtr, GroupBy(C.Qtr)FROM Calendar C, Store S, Product P, Sales SAWHERE C.CalendarKey = SA.CalendarKey AND S.StoreKey = SA. StoreKey AND P.ProductKey = SA. ProductKey AND P.Category = ‘Pacifica’ and S.RegionName = ‘Midwest’ AND C.DayofWeek = ‘Saturday’ AND C.Year = 2010HAVING C.Qtr in (1,2)
Non-dimensional SQL version:SELECT SUM(SW.NoOfItems), C.CategoryName, V.VendorName, EXTRACTQUARTER(ST.Date), GroupBy(EXTRACTQUARTER(ST.Tdate))FROM Region R, Store S, SalesTransaction ST, SoldWithin SW, Product P, Vendor V, Category CWHERE R.RegionID = S. RegionID AND S.StoreID = ST.StoreID AND ST.Tid = SW.Tid AND SW.ProductID = P.ProductID AND P. VendorID = V. VendorID AND P.CateoryID = C.CategoryID AND P.Category = ‘Pacifica’ and R.RegionName = ‘Midwest’ AND EXTRACTWEEKDAY(St.Date) = ‘Saturday’, AND EXTRACTYEAR(ST.Date) = 2010HAVING QUARTER(ST.Date) in (1,2)
Snowflake Schema
Represent dimensional hierarchy directly by normalizing the dimension tables
Easy to maintain Saves storage, but is alleged that it
reduces effectiveness of browsing (Kimball)
Example of a Snowflake Schema
Order NoOrder No
Order DateOrder Date
Customer NoCustomer No
Customer Customer NameName
Customer Customer AddressAddress
CityCity
SalespersonIDSalespersonID
SalespersonNaSalespersonNameme
CityCity
QuotaQuota
OrderNOOrderNO
SalespersonIDSalespersonID
CustomerNOCustomerNO
ProdNoProdNo
DateKeyDateKey
CityNameCityName
QuantityQuantity
Total Price
ProductNOProductNO
ProdNameProdName
ProdDescrProdDescr
CategoryCategory
CategoryCategory
UnitPriceUnitPrice
DateKeyDateKey
DateDate
MonthMonth
CityNameCityName
StateState
CountryCountry
OrderOrder
CustomerCustomer
SalespersoSalespersonn
CityCity
DateDate
ProductProduct
Fact TableFact Table
CategoryNaCategoryNameme
CategoryDeCategoryDescrscr
MontMonthh
YearYear YearYear
StateNameStateName
CountryCountry
CategoryCategory
StateState
MonthMonthYearYear
Multidimensional Data Model Database is a set of facts (points) in a
multidimensional space A fact has a measure dimension
quantity that is analyzed, e.g., sale, budget A set of dimensions on which data is analyzed
e.g. , store, product, date associated with a sale amount Each dimension has a set of attributes
e.g., owner city and county of store Attributes of a dimension may be related by partial
order Hierarchy: e.g., county > city > street Lattice: e.g., year>month> date, year>week>date
Multidimensional Data Model
1010
4747
3030
1212
JuiceJuice
ColaCola
Milk Milk
CreaCreamm
NYCNYCLALA
SFSF
Sales Sales Volume Volume as a as a functiofunction of n of time, time, city city and and productproduct3/1 3/2 3/3 3/1 3/2 3/3
3/4 3/4
DateDate
22 18 30
Operations in Multidimensional Data Model – Roll-up
Aggregation (roll-up) summarization over aggregate hierarchy – higher levels of
summarization: e.g., total sales by city and rollup to total sales by state or
take daily sales totals and rollup to monthly sales totals Subtracting group by columns
101000
474755313100121211
JuiceJuice
ColaCola
Milk Milk
CreaCreamm
NYNY
ALAL
CACA
3/1 3/2 3/3 3/1 3/2 3/3 3/43/4
DateDate
Roll-UP
Drill Down
Navigation to detailed data (drill-down) Used to adjust the level of detail – getting more detail
– finer data e.g., total sales by region and drill down to total sales
by city or top 3% of cities Adding group by columns
Drill Down
55
2020
2020
77
JuiceJuice
ColaCola
Milk Milk
CreaCreamm
NYCNYCLALA
SFSF
3/1 am 3/1 pm 3/2 am 3/1 am 3/1 pm 3/2 am 3/2 pm3/2 pm
DateDate
55
2727
1010
55
Operations in Multidimensional Data Model Selection (slice) defines a subcube
e.g., sales where city = SF and Product = Cream
Visualization Operations (e.g., Pivot) Rotating summary tables (next page)
A Visual Operation: Pivot (Rotate)
1010
4747
3030
1212
JuiceJuice
ColaCola
Milk Milk
CreaCreamm
NYCNYCLALA
SFSF
3/1 3/2 3/3 3/1 3/2 3/3 3/43/4
DateDate
Month
Month
Reg
ion
Reg
ion
ProductProduct
Indexing Techniques
Exploiting indexes to reduce scanning of data is of crucial importance
Bitmap Indexes Join Indexes
Bit Map Index
Cust Region RatingC1 N HC2 S MC3 W LC4 W HC5 S LC6 W LC7 N H
Base Base TableTable
Row ID N S E W1 1 0 0 02 0 1 0 03 0 0 0 14 0 0 0 15 0 1 0 06 0 0 0 17 1 0 0 0
H M L1 0 00 1 00 0 11 0 00 0 10 0 11 0 0
Rating IndexRating Index
Region Region IndexIndex
Customers whereCustomers where Region = Region = WW
Rating = Rating = LLAndAnd
BitMap Indexes
Comparison, join and aggregation operations are reduced to bit arithmetic with dramatic improvement in processing time
Significant reduction in space and I/O (30:1) Adapted for higher cardinality domains as well. Compression (e.g., run-length encoding) exploit
Issues in Handling of Aggregate Views (summary info) Important component for ROLAP Servers Logic for Aggregation Navigation
make optimum use of materialized aggregates to answer a query
Choice of aggregate views to materialize
SQL Extensions for Front End Tools Extended Family of Aggregate functions
rank (top 10) and N-Tile (“top 30%” of all products) Median, mode…..
Reporting Features running total, cumulative totals
Results of multiple group by: total sales by month and total sales by product
Approaches to OLAP Servers
Relational OLAP (ROLAP)Relational and Specialized Relational
DBMS to store and manage warehouse data
Example: Oracle Warehouse Builder, MetaCube (Informix) Multidimensional OLAP
Array-based storage structures Direct access to array data structures Example: Essbase (Arbor then Hyperion then
Oracle)
Calendar Dimension Store DimensionCalendarKey FullDate
Day of Week
Day of Month Month Qtr Year
Store Key StoreID StoreZip RegionID
1 1/1/2011 Saturday 1 January Q1 2011 1 S1 60600 Chicagoland2 1/2/2011 Sunday 2 January Q1 2011 2 S2 60605 Chicagoland
3 S3 35400 Tristate
Product Dimension Customer DimensionProduct Key ProductID Name Price VendorID CategoryID
CustomerKey
CustomerID
CustomerName
CustomerZip
1 1X1 Zzz Bag $100 Pacif ica Gear Camping 1 1-2-333 Tina 60137
2 2X2 Easy Boot $70 Mountain King Footw ear 2 2-3-444 Tony 60611
3 3X3 Cosy Sock $15 Mountain King Footw ear 3 4-5-666 Pam 35401
CalendarKey
Store Key
CustomerKey
Product Key
Dollars Sold
Units Sold
1 1 1 1 1 1
1 2 2 2 1 1
2 1 3 3 5 5
2 1 3 1 1 1
2 3 1 2 1 1
2 3 3 3 2 2
SALES Fact table
ROLAPData stored in relational tables
MOLAP – data stored in multidimensional array The storage model is an n-dimensional
array – Cube(1,1,1) subject value stored for juice, 3/1, SF1,1,1= 10 1,2,1=37 1,3,1=40 1,4,1=12
Array representation has good indexing properties but very poor storage utilization when data is sparse
ROLAP vs MOLAP
MOLAP – fast response times – data ready to display, cheaper if must start from scratch
ROLAP – increased flexibility in choice of queries, cheaper if already have relational DB
ETL
Data Extraction Data cleaning
Data may have errors from multiple sources
Data Transformation Convert from legacy/host format to warehouse format
Data Load Sort, summarize, consolidate, compute views, check
integrity, build indexes, partition
Data Cleaning/Transformation Techniques Extraction – decide which data Uses domain-specific knowledge to do scrubbing Discover facts that flag unusual patterns (auditing)
Some dealer has never received a single complaint Products: QDB, SBStar, WizRule
Transformation Rules Example: translate “gender” to “sex”
Products: Warehouse Manger (Prism), Extract (ETI)
Data Load
Issues: huge volumes of data to be loaded small time window (usually at night) when the warehouse can be taken
off-line When to build indexes and summary tables allow system administrator to monitor status, cancel suspend, resume
load, or change load rate restart after failure with no loss of data integrity
Techniques: sort input records on clustering key and use sequential I/O; build
indexes and derived tables sequential loads still too long use parallelism and incremental techniques
Data Refresh
Issues: when to refresh
on every update: too expensive, only necessary if OLAP queries need current data (e.g., up-the-minute stock quotes)
periodically (e.g., every 24 hours, every week) or after “significant” events
refresh policy set by administrator based on user needs and traffic
Techniques: Full extract from base tables
read entire source table or database: expensive. Incremental techniques (related to work on active DBS)
Issues to Consider
Physical Design design of summary tables, partitions, indexes
Query processing selecting appropriate summary tables approximate answers Runaway queries
Warehouse Management detecting runaway queries resource management incremental refresh techniques computing summary tables during load