Implementing Data Cube Construction Using a Cluster
Middleware: Algorithms, Implementation Experience, and
Performance
Ge Yang Ruoming Jin
Gagan Agrawal Department of Computer and
Information SciencesOhio State University
Motivation
A lot of effort into developing cluster computing tools targetting scientific applications
There is an emerging class of commercial applications that are well suited for cluster environments OnLine Analytical Processing (OLAP) Data Mining
Can we successfully use cluster tools developed for scientific applications on commercial applications ?
Overview
Focus on: Data cube construction, which is an OLAP
problem Both compute and data intensive Frequently used in data warehouses
Use of Active Data Repository (ADR) developed for scientific data intensive applications
Questions: Are new algorithms / variations to existing algorithms
required ? Implementation experience ? Performance ?
Outline
Data cube construction Problem definition Challenges
Active Data Repository (ADR) Scalable data cube construction algorithms
targetting ADR Implementation Experience Performance Evaluation Summary
Data Cube Construction
Context: Data Warehouses Frequently store (possibly sparse) multidimensional
datasets Example: Sale information for a chain of stores: time, item,
and location can be the three dimensions Frequently asked queries: aggregate along one or more
dimensions Data Cube Construction: Perform all aggregations in advance to facilitate rapid
response to all queries For the original n dimension array construct:
n C m arrays of m dimensions, 0 =< m =< n
Data Cube Construction Example:
Consider original 3 dimensional array ABC Data cube comprises of
3 two-dimensional arrays AB, BC, AC 3 one-dimensional arrays A, B, and C A scalar value all
Some observations: Large input size: data warehouses can have a lot of data Total amount of output could be quite large A lot of computation is involved
Lattice for Data Cube Construction
Options for computing differentoutput arrays can be represented by a lattice
If A is the shortest dimension and C is the largest, the arrowsrepresent the minimal spanningtree of the lattice
AB is considered the smallest parent of A and B
Active Data Repository
Developed at University of Maryland (Chang, Kurc, Sussman, Saltz)
Targetted scientific data intensive applications Execution model:
Divide output dataset(s) into tiles, allocate one tile at a time
Fetch input dataset one chunk at a time to compute the tile
Decide on a plan or schedule for fetching chunks that contribute to a tile
Operations involved in computing an output element must be associative and commutative
Goals In Algorithm Design
Must use smallest parents / minimal spanning tree
Maximal cache and memory reuse: perform all computations associated with an input chunk before it is discarded from memory
Minimize interprocessor communication volume
Minimize the amount of memory that needs to be allocated across the tiles
Fit into ADR’s computation model
Approach
Currently consider data cube construction starting from three dimensional array only
Partition and tile along a single dimension only
If the size along the dimensions A, B, and C are |A|, |B| and |C|, assume that
|A| <= |B| <= |C|
(No loss of generality)
Partitioning and Tiling Always partition along the dimension C
Minimizes communication volume If |A| <= |B| <= |C|, |A||B| <= |A||C| <= |B||C|
Let the size of the dimension C on each processor be |C’|
Three separate cases for tiling Case I: |A| <= |B| <= |C’| Case II: |A| <= |C’| <= |B| Case III: |C’| <= |A| <= |B|
Focus on first and second cases, third is almost identical to the second case
First Case
Tile along the dimension C on each processor
Hold AB in memory through the processing of all tiles
AC and BC are allocated separately for each tile
Algorithm for Case I
Allocate AB Foreach tile: Allocate AC and BC Foreach input chunk to be read Update AB, AC, and BC Compute C from AC Write-back AC, BC, and C If last tile Perform global reduction to obtain AB If (proc_id == 0) Compute A and B from AB Compute all from A
Properties of the Algorithm
All arrays are computed from their smallest parents
Maximal cache and memory reuse Minimal interprocessor communication volume
among all single dimensional partitions Portion of output arrays that need to be kept in
the main memory for the entire computation is minimal of all single dimensional tiling possibilities
Second Case
Tile along the dimension B Hold AC in main memory for the entire
computation
Algorithm for Case II Allocate AC and A Foreach tile: Allocate AB and AC Foreach input chunk to be read Update AB, AC, and BC Perform global reduction to obtain final AB If (proc_id == 0) Compute B from AB Update A using AB Write-back AB, BC, and B If (last tile) Finish AC Compute C from AC If (proc_id == 0) Finish A Compute all from A
Implementation Experience Using ADR Had to supply
Local reduction function - processing for each chunk Global reduction function - after local reduction on
each tile A Finalize function – after processing all tiles A specification of tiling desired
ADR’s runtime support offered Fetching of input chunk corresponding to each tile Scheduling asynchronous operations Details of interprocessor communication
Experimental Evaluation
Goals: Speedups on sparse and dense datasets Scaling of performance with respect to dataset sizes Scaling of performance with respect to number of
tiles Evaluating the impact of sparsity
Experimental Platform: 8 250 MHz Ultra-II processors 1 GB of main memory on each Myrinet for interconnection
Scaling Input Datasets - Dense Arrays
0
200
400
600
800
1000
1200
1400
1600
1 2 4 8
1 GB2 GB 4 GB
Almost linear speedups upto 8 processors
Performance per elementincreases linearly with increase in dataset size
Scaling Dataset Sizes: Sparse Dataset
0
50
100
150
200
250
300
350
400
450
1 2 4 8
.5 GB 1 GB 2 GB
25% Sparsity level
Slightly lower speedups than dense datasets: higher comm. to comp. ratio
Execution time stays Proportional to the amt. Of Computation
Increasing Number of Tiles
0
200
400
600
800
1000
1200
1400
1600
1 2 4 8
Execution time
2 nodes Fixed amount of Computation per tile
Execution time stays proportional to the amount of computation
Impact of Sparsity
0
100
200
300
400
500
600
700
800
1 2 4 8
25%10%5%1%
Same number of non-zero elements in each dataset
Good speedups in all cases
Some reduction in sequential performance as sparsity increases: Particularly for 1% case
Summary
Consider data cube construction on clusters Used a runtime system developed for
scientific data intensive applications New algorithms to combine tiling and
interprocessor communication Observations:
Code writing simplified because of the use of runtime system
High speedups Performance scales well as dataset sizes are
increased