roadmap 1.what is the data warehouse, data mart 2.multi-dimensional data modeling 3.data warehouse...
TRANSCRIPT
Roadmap
1. What is the data warehouse, data mart 2. Multi-dimensional data modeling3. Data warehouse design – schemas, indices4. The Data Cube operator – semantics and
computation5. Aggregate View Selection
Why not Using Existing DB?
• DBMS is for On Line Transaction Processing (OLTP)– automate day-to-day operations (purchasing,
banking etc)
• Data Warehouse is for On Line Analytical Processing (OLAP)– need historical data for trend analysis
OLTP vs. OLAP OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date detailed, flat relational isolated
historical, summarized, multidimensional integrated, consolidated
usage repetitive ad-hoc
access read/write index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
Examples of OLAP• Comparisons (this period v.s. last period)
– Show me the sales per store for this year and compare it to that of the previous year to identify discrepancies
• Ranking and statistical profiles (top N/bottom N)
– Show me sales, profit and average call volume per day for my 10 most profitable salespeople
• Custom consolidation (market segments, ad hoc groups)
– Show me an abbreviated income statement by quarter for the last four quarters for my northeast region operations
Multidimensional Modeling• Example: compute total sales volume per product and store
Store Product Total Sales
1 1 454
1 4 925
2 1 468
2 2 800
Etc.
Product Total Sales 1 2 3 4
1 454 - - 925
2 468 800 - -
3 296 - 240 - Stor
e
4 652 - 540 745
Product
Store
800
From Tables and Spreadsheets to Data Cubes
• In general multidimensional data model views data in the form of a data cube
• A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions
– Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year)
– Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables
• In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids forms a data cube.
Cube: A Lattice of Cuboids
all
time item location supplier
time,item time,location
time,supplier
item,location
item,supplier
location,supplier
time,item,location
time,item,supplier
time,location,supplier
item,location,supplier
time, item, location, supplier
0-D(apex) cuboid
1-D cuboids
2-D cuboids
3-D cuboids
4-D(base) cuboid
Dimensions and Hierarchies
DIMENSIONS
prod
uct
city
month
category region year
product country quarter
state month week
city day
store
PRODUCT LOCATION TIMEHyd
DVD
Augu
s t
Sales of DVDs in Hyd in August
• A cell in the cube may store values (measurements) relative to the combination of the labeled dimensions
Common OLAP Operations
• Roll-up: move up the hierarchy– e.g given total sales per city, we
can roll-up to get sales per state
• Drill-down: move down the hierarchy– more fine-grained aggregation
category region year
product country quarter
state month week
city day
store
PRODUCT LOCATION TIME
Pivoting
• Pivoting: aggregate on selected dimensions– usually 2 dims (cross-tabulation)
Product Sales
1 2 3 4 ALL
1 454 - - 925 1379
2 468 800 - - 1268
3 296 - 240 - 536
4 652 - 540 745 1937
Sto
re
ALL 1870 800 780 1670 5120
Slice and Dice Queries
• Slice and Dice: select and project on one or more dimensions
product
customers
store
customer = “Kalam”
Roadmap
1. What is the data warehouse, data mart 2. Multi-dimensional data modeling3. Data warehouse design – schemas, indices4. The Data Cube operator – semantics and
computation5. Aggregate View Selection
The Data Cube Operator (Gray et al)
• All previous aggregates in a single query:
SELECT LOCATION.store, SALES.product_key, SUM (amount)
FROM SALES, LOCATION
WHERE SALES.location_key=LOCATION.location_key
CUBE BY SALES.product_key, LOCATION.storeORCUBE product_key, store BY SUM(SALES.amount)
Challenge: Optimize Aggregate Computation
Store Product_key sum(amout)1 1 4541 4 9252 1 4682 2 8003 1 2963 3 2404 1 6254 3 2404 4 7451 ALL 13792 ALL 12683 ALL 5364 ALL 1937ALL 1 1870ALL 2 800ALL 3 780ALL 4 1670ALL ALL 5120
Relational View of Data Cube
Product Sales
1 2 3 4 ALL
1 454 - - 925 1379
2 468 800 - - 1268
3 296 - 240 - 536
4 652 - 540 745 1937
Sto
re
ALL 1870 800 780 1670 5120
SELECT LOCATION.store, SALES.product_key, SUM (amount)
FROM SALES, LOCATION
WHERE SALES.location_key=LOCATION.location_key
CUBE BY SALES.product_key, LOCATION.store
Data Cube: Multidimensional ViewTotal annual salesof DVDs in AmericaQuarter
Prod
uct
Regi
on
sum
sum DVD
VCRPC
1Qtr 2Qtr 3Qtr 4Qtr
America
Europe
Asia
sum