1 cube: a relational aggregate operator generalizing group by by ata İsmet Özçelik

Post on 11-Jan-2016

232 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

CUBE: A Relational Aggregate Operator

Generalizing Group By

CUBE: A Relational Aggregate Operator

Generalizing Group By

CHEVY

FORD 19901991

19921993

REDWHITEBLUE

By Color

By Make & Color

By Make & Year

By Color & Year

By MakeBy Year

Sum

The Data Cube and The Sub-Space Aggregates

REDWHITEBLUE

Chevy Ford

By Make

By Color

Sum

Cross TabRED

WHITEBLUE

By Color

Sum

Group By (with total)Sum

Aggregate

By

Ata İsmet Özçelik

2

The Data Analysis CycleThe Data Analysis Cycle• User extracts data from

database with query

• Then visualizes, analyzes data with desktop tools

Spread Sheet

Table

1

1015

1012

109

106

103

Size vs Speed

Access Time (seconds)10-9 10-6 10-3 10 0 10 3

Cache

Main

Secondary

Disc

Nearline Tape Offline

Tape

OnlineTape

104

102

100

10-2

10-4

Price vs Speed

Access Time (seconds)10-9 10-6 10-3 10 0 10 3

Cache

MainSecondary

Disc

Nearline Tape

OfflineTape

OnlineTape

Size(B) $/MB

visualize

Extract analyze

3

N-Dimensional data

• What exactly is N-Dimensional data ?– Relation with N-attribute Domains.– Could have Domain Tables for dimension in

the main table.

• Why is just this not enough?– We need aggregation of various kinds to

make the data representation humanly readable.

4

Relational Representation of a 3-D Data

Model

Sales Fact Table

model_key

year_key

color_key

sales

Measures

Year

Color

5

Aggregate Functions

• Aggregation Functions :– SQL Standard – SUM(), COUNT(), MIN(), MAX(), and

AVG().– Many Systems provide their own custom aggregate

functions and some even provide users ability to make custom functions.

• The basic idea is :

Combine all values in a column

into a single scalar value.

SUM()

6 6

Relational Group By OperatorRelational Group By Operator• Group By allows aggregates over table sub-groups

• Result is a new table

• Syntax: select location, sum(units)from inventorygroup by locationhaving nation = “USA”;

Grouping Values

Partitioned Table

Sum()

Aggregate Values

7

Problems with GROUP BY• Histogram

– In standard SQL, histograms are computed indirectly from table-valued expression which is then aggregated.

• Roll-up Totals and Sub-Totals for drill-downs.– Reports commonly aggregate data at a coarse level, and then

at successively finer levels.• Roll-up: going up levels.• Drill-down: going down levels.

• Cross-tabulation (Cross-tab for short).– Symmetric aggregation table.

• The problem hence is a 2N – way Union for every Roll-up or Cross-tab, when using GROUP BY

8

An example approach

• Not relational

• Not convenient

9

‘ALL’

• Dummy value to fill all the super-aggregation items.

• Is actually a set representing all the values that are present for the corresponding dimension.

• There are two ways of dealing with it.– Define a new keyword ALL in SQL

• ALL() function is defined to enumerate the set that ALL represents.

• ALL [NOT] ALLOWED is added to column definition syntax

• Set interpretation guides relational operators {=, IN} for ALL

– Avoiding the ALL keyword.• NULL is used instead of ALL.

• GROUPING() function to discriminate between ALL and NULL

10

This is a simple 3-dimensional roll-up. Aggregating over N dimensions requires N such unions.

3D ROLL-UP

3D Roll-Up

11

Cross Tabs

• The symmetric aggregation result is a table called cross-tabulation.

12

CHEVY

FORD 19901991

19921993

REDWHITEBLUE

By Color

By Make & Color

By Make & Year

By Color & Year

By MakeBy Year

Sum

The Data Cube and The Sub-Space Aggregates

REDWHITEBLUE

Chevy Ford

By Make

By Color

Sum

Cross TabRED

WHITEBLUE

By Color

Sum

Group By (with total)Sum

Aggregate

Data Cube Relational Operator

13

N-dimensional CubeEach Attribute is a Dimension

N-dimensional CubeEach Attribute is a Dimension

• N-dimensional Aggregate (sum(), max(),...)

– fits relational model exactly:

• a1, a2, ...., aN, f()

• Super-aggregate over N-1 Dimensional sub-cubes

• ALL, a2, ...., aN , f()

• a1 , ALL, a3, ...., aN , f()

• ...

• a1, a2, ...., ALL, f()

– this is the N-1 Dimensional cross-tab.

• Super-aggregate over N-2 Dimensional sub-cubes

• ALL, ALL, a3, ...., aN , f()

• ...

• a1, a2 ,...., ALL, ALL, f()

14

CUBE Operator

• Syntax:SELECT Model, Year, Color, SUM(sales) AS Sales

FROM Sales

WHERE Model in (‘Ford’, ‘Chevy’)

AND Year BETWEEN 1990 AND 1992

GROUP BY CUBE (Model, Year, Color)

• Semantics:

15

CUBE

Result of a Cube Operator

16

ROLL UP Operator

• Syntax:SELECT Manufacturer, Year, Color, Model, SUM(price) AS Revenue

FROM Weather

GROUP BY Manufacturer

ROLLUP Year(Time) AS Year

Month(Time) AS Month

Day(Time) AS Day

• Semantics:

Manufacturer Year, Mo, Day

Mo

de

l xC

olo

rcu

be

s

17

ALL

DivisionGroup

Unit

ALL

Channel Discount District

Region

Geography

WeekMonth

QuarterYear

Product Seller Buyer Units Price Office Date

ALL

ALL

ALL

Cust Type

ALL

Snowflake Schema

A snowflake schema showing the core fact table and some of the many aggregation granularities of the core dimensions.

18

Addressing Data Cube

• SQL3 defines a Turing Complete procedural programming language.SELECT Year, Color, Model, SUM(sales) AS total

SUM(Sales) / total(ALL, ALL, ALL)

FROM Sales

WHERE Model IN {‘Ford’, ‘Chevy’}

AND Year BETWEEN 1990 AND 1992

GROUP BY CUBE Model, Year, Color

19

Computing Data CubesComputing Data Cubes

• If each attribute has Ni valuesCUBE has P (Ni+1) values

• Compute N-D cube with hash if fits in RAM

• Compute N-D cube with sort if overflows RAM

• Same comments apply to subcubes:

– compute N-D-1 subcube from N-D cube.

– Aggregate on “biggest” domain first when >1 deep

– Aggregate functions need hidden variables:

• e.g. average needs sum and count.

• Use standard techniques from query processing

– arrays, hashing, hybrid hashing

– fall back on sorting.

20

Computing Data Cubes

• 2N Algorithm for cube computation.– The simplest algorithm to compute the cube is to allocate a handle

for each cube cell

• Categorization of aggregation functions.– Distributive

• If the function can be calculated in the following distributed manner:– Partition data into n sets.– Compute the aggregation function on each partition to get an aggregate

value.– Apply a function g(), to the n aggregates to get a final aggregate.– This aggregate is the same as it would have been if the whole data would

have been aggregated at the same time.

• COUNT(), SUM(), MIN(), MAX(), SUM().• Can be more efficiently calculated than by the 2N Algorithm

21

Computing Data Cubes continued..

– Algebraic

• If it can be calculated by an algebraic function with M(a bounded +ve integer) arguments(each result of a distributive function)

• Min_N(), max_N, standard_deviation(), avg()

• Can also be calculated in a more efficient way.

– Holistic

• If there is no constant bound on the storage size needed to describe a subaggregate.

• rank(), median(), mode() (Need base data)

• 2N algorithm the fastest for exact result, but better algorithms for approximate results.

22

Compute 2D core of 2 x 3 Cube

Then computer 1D edges

Then compute 0D points

Works for algebraic and distributive functionsSaves “lots” of calls

Example

23

Maintaining a Data Cube– Up until now we have been discussing only SELECT statements.

– Now we have to accommodate INSERT, DELETE, & UPDATE

– Example max() function• Distributive for SELECT and INSERT , but holistic for DELETE

– If a function algebraic for INSERT,UPDATE and DELETE it is easy to maintain the cube.

– If it is distributive it is fairly inexpensive ( using scratchpads)

– If its holistic it is expensive to maintain the cube.

24

SummarySummary

• CUBE operator generalizes relational aggregates• Needs ALL value to denote sub-cubes

– ALL values represent aggregation sets• Needs generalization of user-defined aggregates• Decorations and abstractions are interesting• Computation has interesting optimizations• Relationship to “rest of SQL” not fully worked

out.

top related