1 cube: a relational aggregate operator generalizing group by by ata İsmet Özçelik

24
1 CUBE: A Relational Aggregate Operator Generalizing Group By C HEVY FORD 1990 1991 1992 1993 RED WHITE BLUE By Color By Make & Color By Make & Year By Color & Year By Make By Year Sum The Data Cube and The Sub-Space Aggregates RED WHITE BLUE Chevy Ford By Make By Color Sum Cross Tab RED WHITE BLUE By Color Sum Group By (with total) Sum Aggregate By Ata İsmet Özçelik

Upload: jack-cameron

Post on 11-Jan-2016

232 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 CUBE: A Relational Aggregate Operator Generalizing Group By By Ata İsmet Özçelik

1

CUBE: A Relational Aggregate Operator

Generalizing Group By

CUBE: A Relational Aggregate Operator

Generalizing Group By

CHEVY

FORD 19901991

19921993

REDWHITEBLUE

By Color

By Make & Color

By Make & Year

By Color & Year

By MakeBy Year

Sum

The Data Cube and The Sub-Space Aggregates

REDWHITEBLUE

Chevy Ford

By Make

By Color

Sum

Cross TabRED

WHITEBLUE

By Color

Sum

Group By (with total)Sum

Aggregate

By

Ata İsmet Özçelik

Page 2: 1 CUBE: A Relational Aggregate Operator Generalizing Group By By Ata İsmet Özçelik

2

The Data Analysis CycleThe Data Analysis Cycle• User extracts data from

database with query

• Then visualizes, analyzes data with desktop tools

Spread Sheet

Table

1

1015

1012

109

106

103

Size vs Speed

Access Time (seconds)10-9 10-6 10-3 10 0 10 3

Cache

Main

Secondary

Disc

Nearline Tape Offline

Tape

OnlineTape

104

102

100

10-2

10-4

Price vs Speed

Access Time (seconds)10-9 10-6 10-3 10 0 10 3

Cache

MainSecondary

Disc

Nearline Tape

OfflineTape

OnlineTape

Size(B) $/MB

visualize

Extract analyze

Page 3: 1 CUBE: A Relational Aggregate Operator Generalizing Group By By Ata İsmet Özçelik

3

N-Dimensional data

• What exactly is N-Dimensional data ?– Relation with N-attribute Domains.– Could have Domain Tables for dimension in

the main table.

• Why is just this not enough?– We need aggregation of various kinds to

make the data representation humanly readable.

Page 4: 1 CUBE: A Relational Aggregate Operator Generalizing Group By By Ata İsmet Özçelik

4

Relational Representation of a 3-D Data

Model

Sales Fact Table

model_key

year_key

color_key

sales

Measures

Year

Color

Page 5: 1 CUBE: A Relational Aggregate Operator Generalizing Group By By Ata İsmet Özçelik

5

Aggregate Functions

• Aggregation Functions :– SQL Standard – SUM(), COUNT(), MIN(), MAX(), and

AVG().– Many Systems provide their own custom aggregate

functions and some even provide users ability to make custom functions.

• The basic idea is :

Combine all values in a column

into a single scalar value.

SUM()

Page 6: 1 CUBE: A Relational Aggregate Operator Generalizing Group By By Ata İsmet Özçelik

6 6

Relational Group By OperatorRelational Group By Operator• Group By allows aggregates over table sub-groups

• Result is a new table

• Syntax: select location, sum(units)from inventorygroup by locationhaving nation = “USA”;

Grouping Values

Partitioned Table

Sum()

Aggregate Values

Page 7: 1 CUBE: A Relational Aggregate Operator Generalizing Group By By Ata İsmet Özçelik

7

Problems with GROUP BY• Histogram

– In standard SQL, histograms are computed indirectly from table-valued expression which is then aggregated.

• Roll-up Totals and Sub-Totals for drill-downs.– Reports commonly aggregate data at a coarse level, and then

at successively finer levels.• Roll-up: going up levels.• Drill-down: going down levels.

• Cross-tabulation (Cross-tab for short).– Symmetric aggregation table.

• The problem hence is a 2N – way Union for every Roll-up or Cross-tab, when using GROUP BY

Page 8: 1 CUBE: A Relational Aggregate Operator Generalizing Group By By Ata İsmet Özçelik

8

An example approach

• Not relational

• Not convenient

Page 9: 1 CUBE: A Relational Aggregate Operator Generalizing Group By By Ata İsmet Özçelik

9

‘ALL’

• Dummy value to fill all the super-aggregation items.

• Is actually a set representing all the values that are present for the corresponding dimension.

• There are two ways of dealing with it.– Define a new keyword ALL in SQL

• ALL() function is defined to enumerate the set that ALL represents.

• ALL [NOT] ALLOWED is added to column definition syntax

• Set interpretation guides relational operators {=, IN} for ALL

– Avoiding the ALL keyword.• NULL is used instead of ALL.

• GROUPING() function to discriminate between ALL and NULL

Page 10: 1 CUBE: A Relational Aggregate Operator Generalizing Group By By Ata İsmet Özçelik

10

This is a simple 3-dimensional roll-up. Aggregating over N dimensions requires N such unions.

3D ROLL-UP

3D Roll-Up

Page 11: 1 CUBE: A Relational Aggregate Operator Generalizing Group By By Ata İsmet Özçelik

11

Cross Tabs

• The symmetric aggregation result is a table called cross-tabulation.

Page 12: 1 CUBE: A Relational Aggregate Operator Generalizing Group By By Ata İsmet Özçelik

12

CHEVY

FORD 19901991

19921993

REDWHITEBLUE

By Color

By Make & Color

By Make & Year

By Color & Year

By MakeBy Year

Sum

The Data Cube and The Sub-Space Aggregates

REDWHITEBLUE

Chevy Ford

By Make

By Color

Sum

Cross TabRED

WHITEBLUE

By Color

Sum

Group By (with total)Sum

Aggregate

Data Cube Relational Operator

Page 13: 1 CUBE: A Relational Aggregate Operator Generalizing Group By By Ata İsmet Özçelik

13

N-dimensional CubeEach Attribute is a Dimension

N-dimensional CubeEach Attribute is a Dimension

• N-dimensional Aggregate (sum(), max(),...)

– fits relational model exactly:

• a1, a2, ...., aN, f()

• Super-aggregate over N-1 Dimensional sub-cubes

• ALL, a2, ...., aN , f()

• a1 , ALL, a3, ...., aN , f()

• ...

• a1, a2, ...., ALL, f()

– this is the N-1 Dimensional cross-tab.

• Super-aggregate over N-2 Dimensional sub-cubes

• ALL, ALL, a3, ...., aN , f()

• ...

• a1, a2 ,...., ALL, ALL, f()

Page 14: 1 CUBE: A Relational Aggregate Operator Generalizing Group By By Ata İsmet Özçelik

14

CUBE Operator

• Syntax:SELECT Model, Year, Color, SUM(sales) AS Sales

FROM Sales

WHERE Model in (‘Ford’, ‘Chevy’)

AND Year BETWEEN 1990 AND 1992

GROUP BY CUBE (Model, Year, Color)

• Semantics:

Page 15: 1 CUBE: A Relational Aggregate Operator Generalizing Group By By Ata İsmet Özçelik

15

CUBE

Result of a Cube Operator

Page 16: 1 CUBE: A Relational Aggregate Operator Generalizing Group By By Ata İsmet Özçelik

16

ROLL UP Operator

• Syntax:SELECT Manufacturer, Year, Color, Model, SUM(price) AS Revenue

FROM Weather

GROUP BY Manufacturer

ROLLUP Year(Time) AS Year

Month(Time) AS Month

Day(Time) AS Day

• Semantics:

Manufacturer Year, Mo, Day

Mo

de

l xC

olo

rcu

be

s

Page 17: 1 CUBE: A Relational Aggregate Operator Generalizing Group By By Ata İsmet Özçelik

17

ALL

DivisionGroup

Unit

ALL

Channel Discount District

Region

Geography

WeekMonth

QuarterYear

Product Seller Buyer Units Price Office Date

ALL

ALL

ALL

Cust Type

ALL

Snowflake Schema

A snowflake schema showing the core fact table and some of the many aggregation granularities of the core dimensions.

Page 18: 1 CUBE: A Relational Aggregate Operator Generalizing Group By By Ata İsmet Özçelik

18

Addressing Data Cube

• SQL3 defines a Turing Complete procedural programming language.SELECT Year, Color, Model, SUM(sales) AS total

SUM(Sales) / total(ALL, ALL, ALL)

FROM Sales

WHERE Model IN {‘Ford’, ‘Chevy’}

AND Year BETWEEN 1990 AND 1992

GROUP BY CUBE Model, Year, Color

Page 19: 1 CUBE: A Relational Aggregate Operator Generalizing Group By By Ata İsmet Özçelik

19

Computing Data CubesComputing Data Cubes

• If each attribute has Ni valuesCUBE has P (Ni+1) values

• Compute N-D cube with hash if fits in RAM

• Compute N-D cube with sort if overflows RAM

• Same comments apply to subcubes:

– compute N-D-1 subcube from N-D cube.

– Aggregate on “biggest” domain first when >1 deep

– Aggregate functions need hidden variables:

• e.g. average needs sum and count.

• Use standard techniques from query processing

– arrays, hashing, hybrid hashing

– fall back on sorting.

Page 20: 1 CUBE: A Relational Aggregate Operator Generalizing Group By By Ata İsmet Özçelik

20

Computing Data Cubes

• 2N Algorithm for cube computation.– The simplest algorithm to compute the cube is to allocate a handle

for each cube cell

• Categorization of aggregation functions.– Distributive

• If the function can be calculated in the following distributed manner:– Partition data into n sets.– Compute the aggregation function on each partition to get an aggregate

value.– Apply a function g(), to the n aggregates to get a final aggregate.– This aggregate is the same as it would have been if the whole data would

have been aggregated at the same time.

• COUNT(), SUM(), MIN(), MAX(), SUM().• Can be more efficiently calculated than by the 2N Algorithm

Page 21: 1 CUBE: A Relational Aggregate Operator Generalizing Group By By Ata İsmet Özçelik

21

Computing Data Cubes continued..

– Algebraic

• If it can be calculated by an algebraic function with M(a bounded +ve integer) arguments(each result of a distributive function)

• Min_N(), max_N, standard_deviation(), avg()

• Can also be calculated in a more efficient way.

– Holistic

• If there is no constant bound on the storage size needed to describe a subaggregate.

• rank(), median(), mode() (Need base data)

• 2N algorithm the fastest for exact result, but better algorithms for approximate results.

Page 22: 1 CUBE: A Relational Aggregate Operator Generalizing Group By By Ata İsmet Özçelik

22

Compute 2D core of 2 x 3 Cube

Then computer 1D edges

Then compute 0D points

Works for algebraic and distributive functionsSaves “lots” of calls

Example

Page 23: 1 CUBE: A Relational Aggregate Operator Generalizing Group By By Ata İsmet Özçelik

23

Maintaining a Data Cube– Up until now we have been discussing only SELECT statements.

– Now we have to accommodate INSERT, DELETE, & UPDATE

– Example max() function• Distributive for SELECT and INSERT , but holistic for DELETE

– If a function algebraic for INSERT,UPDATE and DELETE it is easy to maintain the cube.

– If it is distributive it is fairly inexpensive ( using scratchpads)

– If its holistic it is expensive to maintain the cube.

Page 24: 1 CUBE: A Relational Aggregate Operator Generalizing Group By By Ata İsmet Özçelik

24

SummarySummary

• CUBE operator generalizes relational aggregates• Needs ALL value to denote sub-cubes

– ALL values represent aggregation sets• Needs generalization of user-defined aggregates• Decorations and abstractions are interesting• Computation has interesting optimizations• Relationship to “rest of SQL” not fully worked

out.