[ieee 2008 international conference on computer and communication engineering (iccce) - kuala...

Proceedings of the International Conference on Computer and Communication Engineering 2008 May 13-15, 2008 Kuala Lumpur, Malaysia

978-1-4244-1692-9/08/$25.00 ©2008 IEEE

Multi Level Exceptions Mining in OLAP Data Cubes

M. Naderi Dehkordi1, M. H. Shenassa2, Kambiz Badie3 1Science & Research Branch, Islamic Azad University (IAU)

Department of Computer Engineering, Tehran, Iran E-mail: [email protected]

2Department of Electrical Engineering,K. N. Toosi University, Tehran, Iran 3IT Research Faculty, Iran Telecom Research Center, Tehran, Iran

(Email: [email protected])

Abstract

People nowadays are relying more and more on OLAP data to find business solutions. A typical OLAP data cube usually contains four to eight dimensions, with two to six hierarchical levels and tens to hundreds of categories for each dimension. It is often too large and has too many levels for users to browse it effectively. In this paper we propose a new definition of exception. This integrated system prototype will guide users to efficiently explore exceptions in data cubes. It automatically computes the degree of exceptions for cube cells at different aggregation levels. Different statistical methods such as log-linear model, adapted linear model and Z-tests are used to compute the degree of exceptions. We present algorithms and address the issue of improving the performance on large data sets.

I. INTRODUCTION Exception mining and association rule mining are

two important data mining issues and techniques. People nowadays are relying more and more on OLAP data to find business solutions. But OLAP data cubes are often too large and have too many levels for users to browse effectively. Therefore the so-called "discovery driven" exploration of OLAP data [1,6,8], which focus on partial or full automation of data exploration to help users discover the interesting part of the cube data, is becoming a hot topic among OLAP tool developers [2,5].

Exceptions are a very small portion of data that are often treated as noise. However, "one person's noise is another person's signal" [3,7]. Sometimes rare events are more interesting than the average ones. Business analysts who are browsing the OLAP data cubes are

often looking for exceptions, because exceptions often lead to identification of problem areas or new opportunities. For example, the chain store managers often pay special attention to areas with unusually high or low sales. Analysts from credit card companies would like to find anomalous transactions for either fraud detection or marketing reasons.

Therefore, we would like to write an algorithm to mark them automatically so that the users could easily identify them even when the data cube is very large. Intuitively an exception in a data cube is a cell with a value significantly different from the value we anticipated for this cell [4]. The anticipation is based on some statistical model or statistical computation. It should estimate a cell value in context of its position in the data cube and consider the value variation patterns over all dimensions and the group-by of dimensions the cell belongs to.

II. General Framework We need to first introduce some notations to

describe the data cube. Suppose that in an n-dimensional cube, the number of concept hierarchy levels, excluding level ALL, of the r-th dimension is , then we can use an integer number to denote the current level of . Let =0 if is at the highest level ALL, and if it is at the most detailed level. The hierarchy levels of , from the highest one down to the most detailed one, will be 0 1

. A dimension is present in the cuboid if it is rolled-up to level ALL. Definition 1: The hierarchical level of a cuboid in an n-dimensional cube is , , , … , where is the concept hierarchical level of dimension Dr. 0 if

is rolled up to level ALL. We know that in a data cube, there is one cuboid for

each different hierarchical level. Suppose that there are

747

n dimensions and each dimension r has hierarchical levels (excluding ALL), then the total number of cuboids, or the number of different hierarchical levels of aggregation will be ∏1 .

III. Basic Idea The basic idea is to add automatically computed

indicators of exceptions at all levels in the data cube to help users efficiently identify and explore exceptional cells. This is achieved by first computing the quantitative measures of exceptions for all cells and paths and then visualizing the results by highlighting interesting cells and paths. There are four types of measures of exceptions to compute: 1. : measure representing the degree of

exception for cell c, where c denotes a cell at some hierarchical level l as introduced before. It is computed for each cell by comparing it with all other cells of the same aggregation groups.

2. : measure representing the degree of exceptions that could be found beneath cell c if we drill down along some dimensions. Some cells may or may not be exceptional at the current level, but if we drill down along some dimensions, we may find exceptions underneath.

3. , : measure representing the degree of exceptions that could be found if we drill down along dimension k, where l defines the current

hierarchical level, and k is the dimension number. When a user wants to drill down and explore lower level exceptions, he or she often faces choices of along which dimension to go on.

4. , : measure representing the degree of exceptions that could be found beneath cell c when drilling down along dimension k. c defines a cell, and k is the dimension number. Sometimes a user may be interested in a cell with a colored border and wants to drill down to see the exceptions beneath this cell.

IV. Measures of Exceptions Now the problem remains as to compute the

following four measures of exceptions at each level of aggregation l: • for each cell , ; • for each cell ; • , for each dimension ; • , for each cell along

each dimension . Among them the measure for cell

exception is the essential one. Generally, an exception in a data cube is a cell whose value is significantly different from what we anticipated for this cell based on some statistical model or statistical computation. The model or computation should

take into consideration the position of the cell in the data cube and the variation pattern over every aggregation (rows and planes as in a 3-dimensional cube) the cell belongs to. One possible approach is to use a statistical model to estimate the cell values. Now we use a simple model which does not consider concept hierarchies to illustrate our idea. We first look at the base cuboids where all the n dimensions

, , … , are present. Let be the value of the cell at position , , … , . The anticipated value of the model should be a function f of contributions from various possible aggregations (or group-by's as commonly used in SQL term) over all dimensions as

| | , , … ,

The terms are called coefficient of the modem equation. is the coefficient from group-by . Example 2: This example shows the model terms for a 3-dimensional cube. The anticipated value for a cell at position , , of dimension , , will be

, , , , , , , , , , ,

(1)

is the overall factor. , and are factor from group-by , and respectively, or we can say they are for the effects from each individual dimension. We can say they are for the effects from the interactions between each pair of two dimensions. The residual of a model is defined as

(2) Normalizing it by the standard deviation of the cell, we get the normalized residual

(3)

The problem of looking for exceptional cell values is equivalent to the problem of looking for exceptional normalized residuals. The larger the absolute value of the normalized residual, the more exceptional a cell is. Choosing a threshold 0, define exceptions as cells satisfying | | . Define

, , 0 , | |

(4)

When 0 the cell is a high-end exception, and when 0 it is a low-end exception.

748

Now we have found exceptions in the base cuboid. To find exceptions at other levels, we just fit equation 1 separately for each cuboid, where some of the dimensions are rolled up to ALL so the number of dimensions n will be different. When level is taken into account, we use instead of

to denote the degree of exception of a cell. Algorithm 1: Find Exceptions Compute the measures of exceptions at each level of the data cube. Input: 1-The normalized residual for each cube cell

, . 2- A threshold of exception . 3- An exception summarizing function . Output: For each cell , and each dimension the following measures of exceptions: 1- 2- 3- , 4- , Method: Starting from the base cuboid level up to the apex level ALL, for each level of aggregation

, , … , do the following: 1- For each i, compute , directly from

, and using equation 4. 2- if l is the base level, then for all i and k,

, , 0 , 0

, 0 Otherwise (a) For each i and k, find all its child cells c' at level

, , … , 1, … , . Let , , | ;

, | ; (b) For each i and k compute

,, , , ;

(c) For each i compute , | ;

(d) For each k compute , , | ;

V. An Alternative Approach In the previous sections, we combine the effects of all dimensions together and define an exception as a cell which behaves differently in conjuncture with its

position in the cube. For example, in figure 1, the average profit from Far East on Sports Goods is much higher than that of other products. But it was not marked as an exception because all other values in the "Sports Goods" row are very high, too. Therefore, this cell looks pretty normal in its position. However, sometimes users may not be satisfied by this definition. They might think <"Far East", "Sports Goods"> is indeed an exception because it is significantly larger than the overall average, or the average profits of <"Far East"> among all product categories. So we could define an exception as a cell which is anomalous in any categorical group it belongs to. For example, <"Far East", "Sports Goods"> is an exception if it is anomalous among all the cube cells, or if it is anomalous among the values of the <"Far East"> column, or if it is anomalous among the values of the <"Sports Goods"> row. We will call this kind of exceptions "separate effect" exceptions, and the previous one "combined effect" exceptions.

VI. Measure of Exceptions In previous sections when we try to find the

combined effect exceptions, we build a model which takes into account influences from all dimensions. But now things are simpler. We only need to compare a cell with one group of values at each time. We use Z-test to measure whether a cell is remarkably different from cells in the same group-by. When the aggregate function of the cube is or the problem is to simply pick up exceptions from a group of cell values. When the number of cells in the group-by is no less than 30, we can assume that they are normally distributed. Let denote a cell at level and position , and the ancestor of at level . Then

(5)

where which are the current level cells that share the same ancestor with at level .

is the variance of these cell values.

Equation 5 measures how different y(c) is from the average value of cells belonging to the same group-by at level t. 2.57 2.57 covers a probability of 99% of the data.

The computation is quite simple. Both avg and can be computed incrementally. We can start from the lowest level group-by's up to the least detailed group-by's. It is similar to the process of building cubes.

749

When the number of cells in a group is less than 30, (which is possible when the hierarchical level is high) we can adapt some rules or conventions on how to define exceptions. For example, we can ignore exceptions in small groups and do not mark them, because it is not worthwhile to compute exceptions in small group. Human eyes can identify exceptions quite well in such a case. Also there is no standard definition on exceptions when data size is very small, and in such a case human eyes make better judgments than computers. Therefore, when the number of cells is less than 30 we simply set 0.

However we provide the above method to find "separate effect" exceptions when the cube aggregate function is or our focus of this section will be on the cases when the aggregate function is

. We believe when cell values are or user more likely want to find the "combined effect" exceptions by the log-linear model, as each cell contains different count and a simple comparison of a at set of "sum" values may not be that useful. When the aggregate function of the cube is

, the problem of finding exceptions is different from the and case. Now we are supposed to find out whether a cell mean is significantly different from the mean of any of the groups it belongs to. For example, in the <"Far East", "Sports Goods"> case, we want to judge separately whether the average profits of <"Far East", "Sports Goods"> is significantly different from the average profits of the three groups it belongs to (if not consider concept hierarchies of dimensions): the group of all cells in the cube, the group of cells in the <"Far East"> column, and the group of cells in the <"Sports Goods"> row. This is exactly the problem of comparing sample mean and the population mean in statistic literature.

VII. The Algorithm Algorithm 2: Alternative Exceptions Compute the measures of exceptions at each level of the data cube. Input: 1- A data cube with n dimensions , ,…, . More specifically, the following parameters are known for the cube: (a) The concept hierarchy levels for each dimension (b) The cell values , of the cuboid at each level

, , , … , where 0 2- A threshold of exception . 3- An exception summarizing function . Output:

For each cell , and each dimension the following measures of exceptions: 1- 2- 3- , 4- , Method: 1. if the cube is not fully materialized then

starting from the base cuboid up to the apex cuboid compute the mean and standard deviation of the

explored measurement. 2. Starting from the base cuboid up to the apex cuboid for each cell ,

compute , for all higher-level group-by's .

compute , from all the , values. compute , and

in the same way as in algorithm 1.

Analysis: Compare this algorithm with the one for the weighted linear-model, we can see that the complexities of the two are the same. In the model fitting algorithm 1 to compute the coefficient , , we need to subtract all the coefficients from the higher level group-by's , belongs to, namely , where . Here for each cube cell , , we need to compare it with the average values of all the higher level group-by's the cell belongs to, namely , , where . The techniques for reducing I/O costs, such as simultaneously computing some of the Z values can also be used.

VIII. Conclusion In this paper, we proposed a method of automatically detecting and highlighting interesting cells and paths in OLAP cubes. The first method uses an adapted linear model to predict a cell value based on the average values of the different group-by's (also at different levels, the cell belongs to) and then uses the difference between the anticipated value and the real value to represent the degree of exception. The second method compares a cell value (which is the average value of all tuples belonging to this cell) with the average value of each group-by it belongs to. When a cell value is remarkably different from one of the larger group it belongs to, we consider it as an exception. Z-test is used as the means of comparison. We present algorithms which can pre-compute the exceptions in reasonable amount of time. The measures of exceptions can be stored and retrieved like all other pre-computed measurements in the cube. When the cube is not fully materialized, these

750

measures can be computed on the fly in a similar way we aggregate other measurements.

References [1] R. Agrawal, T. Imielinski, and A. Swami. Mining associations

between sets of items in massive databases. In Proc. of the ACM SIGMOD Int'l Conference on Management of Data, pages 207-216, 1993.

[2] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of the Twentieth International Conference on Very Large Databases, pages 487-499, Santiago, Chile, 1994.

[3] E. A. Bender. Mathematical Methods in Artificial Intelligence. IEEE Computer Society Press, 1996.

[4] E. K. C. Blake and C. J. Merz. UCI repository of machine learning databases, http://www.ics.uci.edu/mlearn/MLRepository.html, 1998.

[5] L. Breiman. Bagging predictors. Machine Learning,24:123-140,1996.

[6] P. Clark and R. Boswell. Rule induction with CN2: Some recent improvements. In Machine Learning, EWSL91, pages 151-163, 1991.

[7] G. Dong, X. Zhang, L. Wong, and J. Li. CAEP: Classificationby aggregating emerging patterns. In Proceedings of the 2nd International Conference on Discovery Science (DS99), volume 1721 of LNAI, pages 30-42, Berlin, 1999. Springer.

[8] M. Naderi Dehkordi, M. H. Shenassa. CLoPAR: Classification based on Predictive Association Rules. In Proc. of 3rd International IEEE Conference in "Intelligent Systems", London, September 2006.

751

[ieee 2008 international conference on computer and communication engineering (iccce) - kuala...

Documents