dw e modelo multidimensional (baseado nos slides do livro: data mining: c & t)

65
DW e Modelo DW e Modelo multidimensional multidimensional (baseado nos slides do (baseado nos slides do livro: Data Mining: C & T) livro: Data Mining: C & T)

Post on 20-Dec-2015

230 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

DW e Modelo DW e Modelo multidimensionalmultidimensional

(baseado nos slides do livro: Data (baseado nos slides do livro: Data Mining: C & T)Mining: C & T)

Page 2: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

What is a data warehouse?What is a data warehouse?

A multi-dimensional data modelA multi-dimensional data model

OLAP operationsOLAP operations

OutlineOutline

Page 3: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

What is Data What is Data Warehouse?Warehouse?

Defined in many different ways, but not rigorously.Defined in many different ways, but not rigorously.

A decision support database that is maintained separately from

the organization’s operational database

Support information processing by providing a solid platform of

consolidated, historical data for analysis.

““A data warehouse is a A data warehouse is a subject-orientedsubject-oriented, , integratedintegrated, , time-time-

variantvariant, and , and nonvolatilenonvolatile collection of data in support of collection of data in support of

management’s decision-making process.”—W. H. Inmonmanagement’s decision-making process.”—W. H. Inmon

Data warehousingData warehousing::

The process of constructing and using data warehouses

Page 4: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Data Warehouse—Data Warehouse—Subject-OrientedSubject-Oriented

Organized around major subjects, such as Organized around major subjects, such as customer, customer,

product, salesproduct, sales..

Focusing on the modeling and analysis of data for Focusing on the modeling and analysis of data for

decision makers, not on daily operations or decision makers, not on daily operations or

transaction processing.transaction processing.

Provide Provide a simple and concisea simple and concise view around particular view around particular

subject issues by subject issues by excluding data that are not useful in excluding data that are not useful in

the decision support processthe decision support process..

Page 5: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Data Warehouse—Data Warehouse—IntegratedIntegrated

Constructed by Constructed by integratingintegrating multiple, heterogeneous multiple, heterogeneous data sourcesdata sources relational databases, flat files, on-line transaction records

Data cleaning and data integrationData cleaning and data integration techniques are techniques are applied.applied. Ensure consistency in naming conventions, encoding

structures, attribute measures, etc. among different data sources

E.g., Hotel price: currency, tax, breakfast covered, etc.

When data is moved to the warehouse, it is converted.

Page 6: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Data Warehouse—Time Data Warehouse—Time VariantVariant

The time horizon for the data warehouse is The time horizon for the data warehouse is

significantly longer than that of operational systems.significantly longer than that of operational systems. Operational database: current value data.

Data warehouse data: provide information from a historical

perspective (e.g., past 5-10 years)

Every key structure in the data warehouse contains Every key structure in the data warehouse contains

an an element of timeelement of time, explicitly or implicitly, explicitly or implicitly But the key of operational data may or may not contain

“time element”.

Page 7: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Data Warehouse—Non-Data Warehouse—Non-VolatileVolatile

A A physically separate storephysically separate store of data transformed of data transformed

from the operational environment.from the operational environment.

Operational Operational update of data does not occurupdate of data does not occur in the in the

data warehouse environment.data warehouse environment. Does not require transaction processing, recovery,

and concurrency control mechanisms

Requires only two operations in data accessing:

initial loading of data and access of data.

Page 8: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Data Warehouse vs. Data Warehouse vs. Heterogeneous DBMSHeterogeneous DBMS

Traditional heterogeneous DB integration: Traditional heterogeneous DB integration: Build wrappers/mediators on top of heterogeneous databases

Query driven approach

When a query is posed to a client site, a meta-dictionary is

used to translate the query into queries appropriate for

individual heterogeneous sites involved, and the results are

integrated into a global answer set

Complex information filtering, compete for resources

Data warehouse: Data warehouse: update-drivenupdate-driven, high performance, high performance Information from heterogeneous sources is integrated in advance

and stored in warehouses for direct query and analysis

Page 9: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Data Warehouse vs. Data Warehouse vs. Operational DBMSOperational DBMS

OLTP (on-line transaction processing)OLTP (on-line transaction processing) Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking,

manufacturing, payroll, registration, accounting, etc.

OLAP (on-line analytical processing)OLAP (on-line analytical processing) Major task of data warehouse system Data analysis and decision making

Distinct features (OLTP vs. OLAP):Distinct features (OLTP vs. OLAP): User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated Database design: ER + application vs. star + subject View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queries

Page 10: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

OLTP vs. OLAPOLTP vs. OLAP OLTP OLAP

users clerk, IT professional knowledge worker

function day to day operations decision support

DB design application-oriented subject-oriented

data current, up-to-date detailed, flat relational isolated

historical, summarized, multidimensional integrated, consolidated

usage repetitive ad-hoc

access read/write index/hash on prim. key

lots of scans

unit of work short, simple transaction complex query

# records accessed tens millions

#users thousands hundreds

DB size 100MB-GB 100GB-TB

metric transaction throughput query throughput, response

Page 11: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Why Separate Data Why Separate Data Warehouse?Warehouse?

High performance for both systemsHigh performance for both systems DBMS— tuned for OLTP: access methods, indexing, concurrency control,

recovery

Warehouse—tuned for OLAP: complex OLAP queries, multidimensional view,

consolidation

Different functions and different data:Different functions and different data: missing data: Decision support requires historical data which operational DBs do

not typically maintain

data consolidation: DS requires consolidation (aggregation, summarization) of

data from heterogeneous sources

data quality: different sources typically use inconsistent data representations,

codes and formats which have to be reconciled

Note: There are more and more systems which perform OLAP analysis Note: There are more and more systems which perform OLAP analysis

directly on relational databasesdirectly on relational databases

Page 12: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

What is a data warehouse?What is a data warehouse?

A multi-dimensional data modelA multi-dimensional data model

OLAP operationsOLAP operations

OutlineOutline

Page 13: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

From Tables and From Tables and Spreadsheets to Data Spreadsheets to Data

CubesCubes A data warehouse is based on a A data warehouse is based on a multidimensional data multidimensional data

modelmodel which views data in the form of a data cube which views data in the form of a data cube

A A data cubedata cube, such as sales, allows data to be modeled , such as sales, allows data to be modeled

and viewed in multiple dimensionsand viewed in multiple dimensions

A data cube is defined by:A data cube is defined by: Dimension tables, such as item (item_name, brand, type), or

time(day, week, month, quarter, year)

Fact table contains (usually, numerical) measures or facts (such

as dollars_sold) and keys to each of the related dimension

tables

Page 14: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

2-D view of sales data 2-D view of sales data according to dimensions according to dimensions

time and itemtime and item

Page 15: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

3-D view of sales data 3-D view of sales data according to dimensions according to dimensions

time, item, locationtime, item, location

Page 16: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

3-D data cube 3-D data cube representationrepresentation

Page 17: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

4-D view of sales data 4-D view of sales data according to dimensions according to dimensions

time, item, location, time, item, location, suppliersupplier

Page 18: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Data CubeData Cube

In data warehousing literature, an n-D base cube is In data warehousing literature, an n-D base cube is

called a called a base cuboidbase cuboid. The top most 0-D cuboid, which . The top most 0-D cuboid, which

holds the highest-level of summarization, is called the holds the highest-level of summarization, is called the

apex cuboidapex cuboid. The lattice of cuboids forms a . The lattice of cuboids forms a data cube.data cube.

Any n-D data can be displayed as a series of Any n-D data can be displayed as a series of (n-1)-D(n-1)-D

cubescubes

Data cube is a logical representation; its physical Data cube is a logical representation; its physical

storage may be differentstorage may be different

Page 19: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Cube: A Lattice of Cube: A Lattice of CuboidsCuboids

time,item

time,item,location

time, item, location, supplier

all

time item location supplier

time,location

time,supplier

item,location

item,supplier

location,supplier

time,item,supplier

time,location,supplier

item,location,supplier

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D cuboids

4-D(base) cuboid

Page 20: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Conceptual Modeling of Conceptual Modeling of DWDW

A multidimensional model can exist in the following A multidimensional model can exist in the following

forms:forms: Star schema: A fact table without redundancy in the middle

connected to a set of dimension tables

Snowflake schema: A refinement of star schema where

some dimensions are normalized therefore splitting the

data into additional tables. The resulting schema graph

forms a shape similar to snowflake

Fact constellations or galaxy schema: Multiple fact tables

share dimension tables, viewed as a collection of stars

Page 21: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Example of a Star SchemaExample of a Star Schema

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcityprovince_or_streetcountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch

Page 22: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Example of Snowflake Example of Snowflake SchemaSchema

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcity_key

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_key

item

branch_keybranch_namebranch_type

branch

supplier_keysupplier_type

supplier

city_keycityprovince_or_streetcountry

city

Page 23: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Example of Fact Example of Fact ConstellationConstellation

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcityprovince_or_streetcountry

location

Sales Fact Table

time_key

item_key

branch_key

location_key

units_sold

dollars_sold

avg_sales

Measures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch

Shipping Fact Table

time_key

item_key

shipper_key

from_location

to_location

dollars_cost

units_shipped

shipper_keyshipper_namelocation_keyshipper_type

shipper

Page 24: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Escolha das tabelas de Escolha das tabelas de factos e de dimensõesfactos e de dimensões

Análise das interrogaçõesAnálise das interrogaçõesatributos group-by indicam as dimensõesatributos agregados referem as medidas ou factosatributos where são os atributos da tabelas

de dimensões Exemplo :Exemplo :

select s s.time_key, s.location_key, sum(s.units_sold)from item i, sales swhere i.item_key=sales.item_key and i.item_name = ‘clothes’group by s.time_key, s.location_key

Page 25: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Categories of measures Categories of measures

DistributiveDistributive: if the result derived by applying the function to : if the result derived by applying the function to n n aggregate values is the same as that derived by applying aggregate values is the same as that derived by applying the function on all the data without partitioning.the function on all the data without partitioning.

Ex: count(), sum(), min(), max().

AlgebraicAlgebraic:: if it can be computed by an algebraic function with if it can be computed by an algebraic function with MM arguments (where arguments (where M M is a constant), each of which is is a constant), each of which is obtained by applying a distributive aggregate function.obtained by applying a distributive aggregate function.

Ex: avg(), min_N(), standard_deviation().

HolisticHolistic:: if there is no algebraic function with M arguments if there is no algebraic function with M arguments that characterizes the computation.that characterizes the computation.

Ex: median(), mode(), rank().

Page 26: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Hierarquia de conceitosHierarquia de conceitos

Define uma sequência de correspondências Define uma sequência de correspondências entre um conjunto de conceitos de baixo nível e entre um conjunto de conceitos de baixo nível e um conjunto de conceitos de alto nível.um conjunto de conceitos de alto nível.

Exemplos: dimensão localização e dimensão Exemplos: dimensão localização e dimensão tempotempo

Page 27: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Dimensão Dimensão locationlocation

all

Europe North_America

MexicoCanadaSpainGermany

Vancouver

M. WindL. Chan

...

......

... ...

...

all

region

office

country

TorontoFrankfurtcity

Page 28: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Dimensão Dimensão TimeTime

year

quarter

month

day

week

Page 29: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Multidimensional DataMultidimensional Data Sales volume as a function of product, Sales volume as a function of product,

month, and regionmonth, and region

Pro

duct

Regio

n

Month

Dimensions: Product, Location, TimeHierarchical summarization paths

Industry Region Year

Category Country Quarter

Product City Month Week

Office Day

Page 30: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

A Star-Net Query A Star-Net Query ModelModel

Shipping Method

AIR-EXPRESS

TRUCKORDER

Customer Orders

CONTRACTS

Customer

Product

PRODUCT GROUP

PRODUCT LINE

PRODUCT ITEM

SALES PERSON

DISTRICT

DIVISION

OrganizationPromotion

CITY

COUNTRY

REGION

Location

DAILYQTRLYANNUALYTime

Each circle is called a footprint

Page 31: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

What is a data warehouse?What is a data warehouse?

A multi-dimensional data modelA multi-dimensional data model

OLAP operationsOLAP operations

OutlineOutline

Page 32: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

OLAP queries (1)OLAP queries (1)

Pivoting:Pivoting: Aggregation on selected Aggregation on selected dimensions.dimensions.E.g., Pivoting on Location and

Time yields this cross-tabulation:

63 81 144

38 107 145

75 35 110

WI CA Total

1995

1996

1997

176 223 339Total

price

category

pname

pid country

statecitylocid

sales

locidtimeid

pid

holiday_flag

weekdate

timeid month

quarter

year

(Fact table)SALES

TIMES

PRODUCTSLOCATIONS

Page 33: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

OLAP Queries (2)OLAP Queries (2)

Roll-up:Roll-up: Aggregating at different levels of Aggregating at different levels of a dimension hierarchy. a dimension hierarchy. Given total sales by city, we can roll-up to get

sales by state.

Page 34: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

SQL Group By operatorSQL Group By operator

Grouping Values

Partitioned Table

Sum()

Aggregate Values

The GROUP BY relational operator partitions a table into groups. Each group is then aggregated by a function. The aggregation function summarizes some column of groups returning a value for each group.

Page 35: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

SQL to express OLAP SQL to express OLAP queries queries

The cross-tabulation can be computed using a The cross-tabulation can be computed using a collection of SQL queries, e.g.:collection of SQL queries, e.g.:

SELECT SUM(S.sales)FROM Sales S, Times T, Locations LWHERE S.timeid=T.timeid AND S.timeid=L.timeidGROUP BY T.year, L.state

SELECT SUM(S.sales)FROM Sales S, Times TWHERE S.timeid=T.timeidGROUP BY T.year

SELECT SUM(S.sales)FROM Sales S, Location LWHERE S.timeid=L.timeidGROUP BY L.state

Page 36: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Problems with Group ByProblems with Group By

HistogramsHistogramsEx: SELECT day, nation, MAX(Temp)FROM WeatherGROUP BY Day(Time) AS day, Nation(Latitude, Longitude) AS nation;

Roll-up totals and sub-totals for drill-downsRoll-up totals and sub-totals for drill-downs Cross-tabulationsCross-tabulations

Page 37: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Roll-up totals and sub-Roll-up totals and sub-totals for drill-downs totals for drill-downs

SALES

Model Year Color Sales

Chevy 1990 red 5 Chevy 1990 white 87 Chevy 1990 blue 62 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 blue 49 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 blue 71 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 blue 63 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 blue 55 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 blue 39

Page 38: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Sales Roll Up by Model by Year by Color

Sales Roll Up by Model by Year by Color

Model Year Color Sales by Modelby Year

by Color

Salesby Modelby Year

Salesby Model

Chevy

1994 black 50

white 40

90

1995 black 85

white 115

200

290

Page 39: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Sales Roll-Up by Model by Year by Color – relational

representation

Model Year Color Sales Salesby Modelby Year

Salesby Model

Chevy

1994 black 50 90 290

Chevy 1994 white 40 90 290

Chevy 1995 black 85 200 290

Chevy 1995 white 115 200 290

Page 40: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

An Excel pivot table representation with Ford

sales data included

Sum Year Color

Sales 1994 1994 Total

1995 1995 Total

Grand Total

Model black white black white

Chevy 50 40 90 85 115 200 290

Ford 50 10 60 85 75 160 220

Grand Total 100 50 150 170 190 360 510

Page 41: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Sales Summary – relational representation of the pivot table

Model Year Color Units

Chevy 1994 black 50

Chevy 1994 white 40

Chevy 1994 ALL 90

Chevy 1995 black 85

Chevy 1995 white 115

Chevy 1995 ALL 200

Chevy ALL ALL 290

Page 42: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

SQL code SQL code SELECT ‘ALL’, ‘ALL’, ‘ALL’, SUM(Sales)FROM SalesWHERE Model = 'Chevy'UNION SELECT Model, ‘ALL’, ‘ALL’, SUM(Sales)FROM SalesWHERE Model = 'Chevy'GROUP BY ModelUNION SELECT Model, Year, ‘ALL’, SUM(Sales)FROM SalesWHERE Model = 'Chevy'GROUP BY Model, YearUNIONSELECT Model, Year, Color, SUM(Sales)FROM SalesWHERE Model = 'Chevy'GROUP BY Model, Year, Color;

Page 43: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Cross tab – symmetric 2D aggregation

Table 6.a: Chevy Sales Cross Tab

Chevy 1994 1995 total (ALL)

black 50 85 135

white 40 115 155

total (ALL) 90 200 290

Page 44: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Problems with SQL Problems with SQL representationrepresentation

Expressing roll-ups and cross-tabs queries Expressing roll-ups and cross-tabs queries with SQL is dauntingwith SQL is daunting

Too complex to analyze for optimizationToo complex to analyze for optimization

Propose an extension of the relational Propose an extension of the relational group by – the group by – the cube operatorcube operator

Page 45: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Cube OperatorCube Operator

REDWHITEBLUE

By Color

By Make & Color

By Make & Year

By Color & Year

By MakeBy Year

Sum

The Data Cube and The Sub-Space AggregatesSum

REDWHITEBLUE

Chevy Ford

By Make

By ColorCross Tab

REDWHITEBLUE

By Color

Sum

Group By (with total)

Sum

Aggregate

Page 46: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Sales Sales SALES Model Year Color Sales

Chevy 1990 red 5 Chevy 1990 white 87 Chevy 1990 blue 62 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 blue 49 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 blue 71 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 blue 63 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 blue 55 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 blue 39

Page 47: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Data cube for salesData cube for sales

Chevy 1990 blue 62

Chevy 1990 red 5 Chevy 1990 white 95

Chevy 1990 ALL 154 Chevy 1991 blue 49 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 ALL 198 Chevy 1992 blue 71 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 ALL 156 Chevy ALL blue 182

Model Year Color Sales

...

Page 48: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Semantics of the cube Semantics of the cube operatoroperator

Aggregates over all the <select list > attributes in the Aggregates over all the <select list > attributes in the Group By clause Group By clause

Unions in each super-aggregate of the global cube, Unions in each super-aggregate of the global cube, substituting ALL for the agg. columnssubstituting ALL for the agg. columns

Ex for Sales:Ex for Sales:select model, year, color, sum(sales) select model, year, color, sum(sales) as salesas salesfrom salesfrom saleswhere model in (‘Ford’,’Chevy’)where model in (‘Ford’,’Chevy’)

and year between 1990 and 1992and year between 1990 and 1992group by cube model, year, colorgroup by cube model, year, color

Page 49: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

SQL to express OLAP SQL to express OLAP queries queries

The cross-tabulation can be computed using a The cross-tabulation can be computed using a collection of SQL queries, e.g.:collection of SQL queries, e.g.:

SELECT SUM(S.sales)FROM Sales S, Times T, Locations LWHERE S.timeid=T.timeid AND S.timeid=L.timeidGROUP BY T.year, L.state

SELECT SUM(S.sales)FROM Sales S, Times TWHERE S.timeid=T.timeidGROUP BY T.year

SELECT SUM(S.sales)FROM Sales S, Location LWHERE S.timeid=L.timeidGROUP BY L.state

Page 50: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

The CUBE Operator The CUBE Operator (SQL:1999)(SQL:1999)

Generalizing the previous example, if there Generalizing the previous example, if there are k dimensions, we have 2^k possible SQL are k dimensions, we have 2^k possible SQL GROUP BY GROUP BY queries on a subset of dimensions.queries on a subset of dimensions.

CUBE pid, locid, timeid BY SUM SalesCUBE pid, locid, timeid BY SUM SalesEquivalent to rolling up Sales on all eight subsets

of the set {pid, locid, timeid}; each roll-up corresponds to an SQL query of the form:

SELECT SUM(S.sales)FROM Sales SGROUP BY grouping-list

Lots of work on optimizing the CUBE operator!

Page 51: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

A Sample Data CubeA Sample Data CubeTotal annual salesof TV in U.S.A.Date

Produ

ct

Cou

ntr

ysum

sum TV

VCRPC

1Qtr 2Qtr 3Qtr 4Qtr

U.S.A

Canada

Mexico

sum

Page 52: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Cuboids Corresponding Cuboids Corresponding to the Cubeto the Cube

all

product date country

product,date product,country date, country

product, date, country

0-D(apex) cuboid

1-D cuboids

2-D cuboids

3-D(base) cuboid

Page 53: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Typical OLAP OperationsTypical OLAP Operations Roll up (drill-up):Roll up (drill-up): summarize data by climbing up hierarchy (or summarize data by climbing up hierarchy (or

by dimension reduction)by dimension reduction) Drill down (roll down):Drill down (roll down): reverse of roll-up; from higher level reverse of roll-up; from higher level

summary to lower level summary or detailed data, (or summary to lower level summary or detailed data, (or introducing new dimensions)introducing new dimensions)

Slice and dice:Slice and dice: slice performs a selection on one dimension of slice performs a selection on one dimension of the cube; dice defines a subcube by performing a selection on the cube; dice defines a subcube by performing a selection on two or more dimensionstwo or more dimensions

Pivot (rotate):Pivot (rotate): reorient the cube, visualization, 3D to series of reorient the cube, visualization, 3D to series of 2D planes2D planes

Drill across:Drill across: involving (across) more than one fact table involving (across) more than one fact table Drill through:Drill through: through the bottom level of the cube to its back- through the bottom level of the cube to its back-

end relational tables (using SQL)end relational tables (using SQL)

Page 54: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Roll-up vs Cube operatorRoll-up vs Cube operator

Fact tableFact table Sales (Market_id, Product_Id, Time_Id, Sales_Amt)

Dimension TablesDimension Tables Market (Market_Id, City, State, Region)

Product (Product_Id, Name, Category, Price)

Time (Time_Id, Week, Month, Quarter)

Page 55: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

The fact and dimension relations can be The fact and dimension relations can be displayed in an E-R diagram, which displayed in an E-R diagram, which suggests a star and is called a suggests a star and is called a star star schemaschema

Star SchemaStar Schema

Page 56: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

AggregationAggregation

Many OLAP queries involve Many OLAP queries involve aggregationaggregation of the data of the data in the fact tablein the fact table

For example, to find the total sales (over time) of each For example, to find the total sales (over time) of each product in each market, we might useproduct in each market, we might use SELECT S.Market_Id, S.Product_Id, SUM (S.Sales_Amt) FROM Sales S GROUP BY S.Market_Id, S.Product_Id

The aggregation is over the entire time dimension and The aggregation is over the entire time dimension and thus produces a two-dimensional view of the datathus produces a two-dimensional view of the data

Page 57: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Aggregation over TimeAggregation over Time The output of the previous queryThe output of the previous query

SUM(Sales_Amt)SUM(Sales_Amt) M1M1 M2M2 M3M3 M4M4

P1P1

P2P2

P3P3

P4P4

P5P5

Market_Id

Product_Id

Page 58: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Drilling Down and Drilling Down and Rolling UpRolling Up

Some dimension tables represent an Some dimension tables represent an aggregation aggregation hierarchyhierarchy Market_Id City State Region

When we execute a series of queries that moves When we execute a series of queries that moves down a hierarchy (down a hierarchy (e.g., e.g., from aggregation over from aggregation over regions to aggregation over states) we are said to regions to aggregation over states) we are said to be be drilling down.drilling down. Requires use of the fact table or information more

specific than the requested aggregation (e.g., cities) When we move up the hierarchy – from states to When we move up the hierarchy – from states to

regions – we are said to be regions – we are said to be rolling uprolling up Agregates can be calculated from a prior query

Page 59: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Drilling DownDrilling Down Drilling down on regionsDrilling down on regions

SELECT S.Product_Id, M.Region, SUM (S.Sales_Amt) FROM Sales S, Market M WHERE M.Market_Id = S.Market_Id GROUP BY M.Region, S.Product_Id

SELECT S.Product_Id, M.State, SUM (S.Sales_Amt) FROM Sales S, Market M WHERE M.Market_Id = S.Market_Id GROUP BY M.State, S.Product_Id

Sales (Market_Id, Product_Id, Time_Id, Sales_Amt)

Market (Market_Id, City, State, Region)

Page 60: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Rolling UpRolling Up Rolling up on regionsRolling up on regions

If we have already created a table, State_Sales, containing the result of

SELECT S.Product_Id, M.State, SUM (S.Sales_Amt) FROM Sales S, Market M WHERE M.Market_Id = S.Market_Id GROUP BY M.State, S.Product_Id

then we can roll up from there to:

SELECT T.Product_Id, M.Region, SUM (T.Sales_Amt)FROM State_Sales T, (SELECT DISTINCT M.Region, M.State

FROM Market M) AS R (Region, State)WHERE R.State = T.StateGROUP BY R.Region, T.Product_Id

Page 61: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

PivotingPivoting When we view the data as a multi-dimensional When we view the data as a multi-dimensional

cube and group on a subset of the axes, we are cube and group on a subset of the axes, we are said to be performing a pivot on those axessaid to be performing a pivot on those axes Pivoting uses GROUP BY; aggregation is used on the

remaining attributes Example: Pivot on the product and time and aggregate

on the Market_Id

Sales (Market_id, Product_Id, Time_Id, Sales_Amt)

Time (Time_Id, Week, Month, Quarter)Time (Time_Id, Week, Month, Quarter) SELECT S.Product_Id, T.Quarter, SUM (Sales_Amt) FROM Sales S, Time T WHERE T.Time_Id = S.Time_Id GROUP BY T.Quarter, S.Product_Id

Page 62: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

ROLLUPROLLUP

ROLLUPROLLUP is similar to is similar to CUBECUBE except that except that instead of aggregating all subsets of the instead of aggregating all subsets of the arguments, it creates subsets moving from arguments, it creates subsets moving from right to leftright to leftROLLUP is also in SQL:1999

Page 63: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

Example of Example of ROLLUPROLLUP OperatorOperator

SELECT S.Market_Id, S.Product_Id, SUM (S.Sales_Amt)

FROM Sales SGROUP BY ROLLUP (S.Market_Id, S. Product_Id)

- first aggregates with the finest granularityGROUP BY S.Market_Id, S.Product_Id

- then with the next level of granularityGROUP BY S.Market_Id

- then the grand total is computed with the empty GROUP BY clause GROUP BY

Page 64: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

ROLLUPROLLUP Vs. Vs. CUBECUBE

By contrast, the same query with By contrast, the same query with CUBECUBE - first aggregates with the finest granularity

GROUP BY S.Market_Id, S.Product_Id

- then with the next level of granularity (both subsets)

GROUP BY S.Market_IdGROUP BY S.Product_Id

- then the grand total with

GROUP BY

Page 65: DW e Modelo multidimensional (baseado nos slides do livro: Data Mining: C & T)

SAD Tagus 2004/05 H. Galhardas

BibliografiaBibliografia

(Livro) (Livro) Data Mining: Concepts and TechniquesData Mining: Concepts and Techniques, J. Han & M. , J. Han & M. Kamber, Morgan Kaufmann, 2001 (Secções 2.1 e 2.2)Kamber, Morgan Kaufmann, 2001 (Secções 2.1 e 2.2)

(Livro) Database Management Systems, R. Ramakrishnan & (Livro) Database Management Systems, R. Ramakrishnan & J. Gehrke, 3rd Ed. (Cap. 25)J. Gehrke, 3rd Ed. (Cap. 25)

(Artigo) (Artigo) An Overview of Data Warehousing and OLAP An Overview of Data Warehousing and OLAP TechnologyTechnology, S. Chaudhuri & U. Dayal, SIGMOD Record, , S. Chaudhuri & U. Dayal, SIGMOD Record, March 1997March 1997

(Artigo) (Artigo) Data Cube: A Relational Aggregation Operator Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-TotalsGeneralizing Group-By, Cross-Tab, and Sub-Totals, J. Gray , J. Gray et al, Tech. Report MSR-TR-97-32, 1997et al, Tech. Report MSR-TR-97-32, 1997