1 cube: a relational aggregate operator generalizing group by by ata İsmet Özçelik

CUBE: A Relational Aggregate Operator

Generalizing Group By

CUBE: A Relational Aggregate Operator

Generalizing Group By

FORD 19901991

19921993

REDWHITEBLUE

By Color

By Make & Color

By Make & Year

By Color & Year

By MakeBy Year

The Data Cube and The Sub-Space Aggregates

REDWHITEBLUE

Chevy Ford

By Make

By Color

Cross TabRED

WHITEBLUE

By Color

Group By (with total)Sum

Aggregate

Ata İsmet Özçelik

The Data Analysis CycleThe Data Analysis Cycle• User extracts data from

database with query

• Then visualizes, analyzes data with desktop tools

Spread Sheet

Size vs Speed

Access Time (seconds)10-9 10-6 10-3 10 0 10 3

Secondary

Nearline Tape Offline

OnlineTape

Price vs Speed

Access Time (seconds)10-9 10-6 10-3 10 0 10 3

MainSecondary

Nearline Tape

OfflineTape

OnlineTape

Size(B) $/MB

visualize

Extract analyze

N-Dimensional data

• What exactly is N-Dimensional data ?– Relation with N-attribute Domains.– Could have Domain Tables for dimension in

the main table.

• Why is just this not enough?– We need aggregation of various kinds to

make the data representation humanly readable.

Relational Representation of a 3-D Data

Sales Fact Table

model_key

year_key

color_key

Measures

Aggregate Functions

• Aggregation Functions :– SQL Standard – SUM(), COUNT(), MIN(), MAX(), and

AVG().– Many Systems provide their own custom aggregate

functions and some even provide users ability to make custom functions.

• The basic idea is :

Combine all values in a column

into a single scalar value.

Relational Group By OperatorRelational Group By Operator• Group By allows aggregates over table sub-groups

• Result is a new table

• Syntax: select location, sum(units)from inventorygroup by locationhaving nation = “USA”;

Grouping Values

Partitioned Table

Aggregate Values

Problems with GROUP BY• Histogram

– In standard SQL, histograms are computed indirectly from table-valued expression which is then aggregated.

• Roll-up Totals and Sub-Totals for drill-downs.– Reports commonly aggregate data at a coarse level, and then

at successively finer levels.• Roll-up: going up levels.• Drill-down: going down levels.

• Cross-tabulation (Cross-tab for short).– Symmetric aggregation table.

• The problem hence is a 2N – way Union for every Roll-up or Cross-tab, when using GROUP BY

An example approach

• Not relational

• Not convenient

‘ALL’

• Dummy value to fill all the super-aggregation items.

• Is actually a set representing all the values that are present for the corresponding dimension.

• There are two ways of dealing with it.– Define a new keyword ALL in SQL

• ALL() function is defined to enumerate the set that ALL represents.

• ALL [NOT] ALLOWED is added to column definition syntax

• Set interpretation guides relational operators {=, IN} for ALL

– Avoiding the ALL keyword.• NULL is used instead of ALL.

• GROUPING() function to discriminate between ALL and NULL

This is a simple 3-dimensional roll-up. Aggregating over N dimensions requires N such unions.

3D ROLL-UP

3D Roll-Up

Cross Tabs

• The symmetric aggregation result is a table called cross-tabulation.

FORD 19901991

19921993

REDWHITEBLUE

By Color

By Make & Color

By Make & Year

By Color & Year

By MakeBy Year

The Data Cube and The Sub-Space Aggregates

REDWHITEBLUE

Chevy Ford

By Make

By Color

Cross TabRED

WHITEBLUE

By Color

Group By (with total)Sum

Aggregate

Data Cube Relational Operator

N-dimensional CubeEach Attribute is a Dimension

• N-dimensional Aggregate (sum(), max(),...)

– fits relational model exactly:

• a1, a2, ...., aN, f()

• Super-aggregate over N-1 Dimensional sub-cubes

• ALL, a2, ...., aN , f()

• a1 , ALL, a3, ...., aN , f()

• ...

• a1, a2, ...., ALL, f()

– this is the N-1 Dimensional cross-tab.

• Super-aggregate over N-2 Dimensional sub-cubes

• ALL, ALL, a3, ...., aN , f()

• ...

• a1, a2 ,...., ALL, ALL, f()

CUBE Operator

• Syntax:SELECT Model, Year, Color, SUM(sales) AS Sales

FROM Sales

WHERE Model in (‘Ford’, ‘Chevy’)

AND Year BETWEEN 1990 AND 1992

GROUP BY CUBE (Model, Year, Color)

• Semantics:

Result of a Cube Operator

ROLL UP Operator

• Syntax:SELECT Manufacturer, Year, Color, Model, SUM(price) AS Revenue

FROM Weather

GROUP BY Manufacturer

ROLLUP Year(Time) AS Year

Month(Time) AS Month

Day(Time) AS Day

• Semantics:

Manufacturer Year, Mo, Day

DivisionGroup

Channel Discount District

Region

Geography

WeekMonth

QuarterYear

Product Seller Buyer Units Price Office Date

Cust Type

Snowflake Schema

A snowflake schema showing the core fact table and some of the many aggregation granularities of the core dimensions.

Addressing Data Cube

• SQL3 defines a Turing Complete procedural programming language.SELECT Year, Color, Model, SUM(sales) AS total

SUM(Sales) / total(ALL, ALL, ALL)

FROM Sales

WHERE Model IN {‘Ford’, ‘Chevy’}

AND Year BETWEEN 1990 AND 1992

GROUP BY CUBE Model, Year, Color

Computing Data CubesComputing Data Cubes

• If each attribute has Ni valuesCUBE has P (Ni+1) values

• Compute N-D cube with hash if fits in RAM

• Compute N-D cube with sort if overflows RAM

• Same comments apply to subcubes:

– compute N-D-1 subcube from N-D cube.

– Aggregate on “biggest” domain first when >1 deep

– Aggregate functions need hidden variables:

• e.g. average needs sum and count.

• Use standard techniques from query processing

– arrays, hashing, hybrid hashing

– fall back on sorting.

Computing Data Cubes

• 2N Algorithm for cube computation.– The simplest algorithm to compute the cube is to allocate a handle

for each cube cell

• Categorization of aggregation functions.– Distributive

• If the function can be calculated in the following distributed manner:– Partition data into n sets.– Compute the aggregation function on each partition to get an aggregate

value.– Apply a function g(), to the n aggregates to get a final aggregate.– This aggregate is the same as it would have been if the whole data would

have been aggregated at the same time.

• COUNT(), SUM(), MIN(), MAX(), SUM().• Can be more efficiently calculated than by the 2N Algorithm

Computing Data Cubes continued..

– Algebraic

• If it can be calculated by an algebraic function with M(a bounded +ve integer) arguments(each result of a distributive function)

• Min_N(), max_N, standard_deviation(), avg()

• Can also be calculated in a more efficient way.

– Holistic

• If there is no constant bound on the storage size needed to describe a subaggregate.

• rank(), median(), mode() (Need base data)

• 2N algorithm the fastest for exact result, but better algorithms for approximate results.

Compute 2D core of 2 x 3 Cube

Then computer 1D edges

Then compute 0D points

Works for algebraic and distributive functionsSaves “lots” of calls

Example

Maintaining a Data Cube– Up until now we have been discussing only SELECT statements.

– Now we have to accommodate INSERT, DELETE, & UPDATE

– Example max() function• Distributive for SELECT and INSERT , but holistic for DELETE

– If a function algebraic for INSERT,UPDATE and DELETE it is easy to maintain the cube.

– If it is distributive it is fairly inexpensive ( using scratchpads)

– If its holistic it is expensive to maintain the cube.

SummarySummary

• CUBE operator generalizes relational aggregates• Needs ALL value to denote sub-cubes

– ALL values represent aggregation sets• Needs generalization of user-defined aggregates• Decorations and abstractions are interesting• Computation has interesting optimizations• Relationship to “rest of SQL” not fully worked

1 cube: a relational aggregate operator generalizing group by by ata İsmet Özçelik

Documents

generalizing database access methods

generalizing the taylor principle

İnönü vakfı, İsmet İnönü, İsmet İnönü kimdir,...

göğüs hastalıkları uzmanının hak ve...

ismet gvozden, maturski rad

graduat en osteopatia - ismet

Özel mtsk sinav sorumlusu hİzmet İÇİ eĞİtİm...

kayapalı İsmet ali ssunumu

articles prezentacija ismet gotic

İsmet Özel-Şiir kitabı

ismet kayapalı

generalizing about genre: new conceptions of an old...

İsmet frankfurt okulu

mücahit ÖzÇelİk İkinci dünya savaşı'nda türk dış...

dr ismet tahirović - hemija.pmf.unsa.ba

ismet yilmaz

ismet callibay

generalizing pagerank (pisa)

generalizing simulations

generalizing demonstrated manipulation tasks