document

XML for Data Warehousing

Chances and Challenges

(Extended Abstract)

Peter Fankhauser and Thomas Klement

Fraunhofer IPSI, Integrated Publication and Information Systems InstituteDolivostr. 15, 64293 Darmstadt, Germany{fankhaus,klement}@fraunhofer.ipsi.de

http://www.ipsi.fraunhofer.de

The prospects of XML for data warehousing are staggering. As a primary purposeof data warehouses is to store non-operational data in the long term, i.e., toexchange them over time, the key reasons for the overwhelming success of XMLas an exchange format also hold for data warehouses.

– Expressive power: XML can represent relational data, EDI messages, reportformats, and structured documents directly, without information loss, andwith uniform syntax.

– Self-describing: XML combines data and metadata. Thereby, heterogeneousand even irregular data can be represented and processed without a fixedschema, which may become obsolete or simply get lost.

– Openness: As a text format with full support for Unicode, XML is not tiedto a particular hardware or software platform, which makes it ideally suitedfor future proof long-term archival.

But what can we do with an XML data warehouse beyond long term archival?How can we make sense of these data? How can we cleanse them, validate them,aggregate them, and ultimately discover useful patterns in XML data?

A natural first step is to bring the power of OLAP to XML. Unfortunately,even though in principle XML is well suited to represent multidimensional datacubes, there does not yet exist a widely agreed upon standard neither for repre-senting data cubes nor for querying them. XQuery 1.0 has resisted standardizingeven basic OLAP features. Grouping and aggregation requires nested for-loops,which are difficult to optimize. XSLT 2.0 (XSL Transformations) has introducedbasic grouping mechanisms. However, these mechanisms make it difficult to takeinto account hierarchical dimensions, and accordingly compute derived aggrega-tions at different levels. In the first part of the talk we will introduce a small XMLvocabulary for expressing OLAP queries that allow aggregation on different lev-els of granularity and can fully exploit document order and nested structure ofXML. Moreover, we will illustrate the main optimization and processing tech-niques for such queries.

Y. Kambayashi, M. Mohania, W. Woß (Eds.): DaWaK 2003, LNCS 2737, pp. 1–3, 2003.c© Springer-Verlag Berlin Heidelberg 2003

2 Peter Fankhauser and Thomas Klement

Data cubes constitute only one possible device to deal with the key challengeof XML data warehouses. XML data are notoriously noisy. They often comewithout a schema or with highly heterogeneous schemas, they rarely explicatedependencies and therefore are often redundant, and they can contain missingand inconsistent values. Data mining provides a wealth of established methodsto deal with this situation.

In the second part of the talk, we will illustrate by way of a simple experi-ment, how data mining techniques can help in combining multiple data sourcesand bringing them to effective use. We explore to which extent stable XML tech-nology can be used to implement these techniques. The experiment deliberatelyfocuses on data and mining techniques that cannot not be readily representedand realized with standard relational technology. It combines a bilingual dictio-nary, a thesaurus, and a text corpus (altogether about 150 MB data) in orderto support bilingual search and thesaurus based analysis of the text corpus. Weproceeded in three steps:

None of the data sources was in XML form; therefore they needed to bestructurally enriched to XML with a variety of tools. State-of-the-art schemamining combined with an off-the-shelf XML-Schema validator has proven to bevery helpful to ensure quality for this initial step by ruling out enrichment errorsand spurious structural variations in the initial data.

In the next step, the data were cleansed. The thesaurus contained spuriouscycles and missing relationships, and the dictionary suffered from incompletedefinitions. These inconsistencies significantly impeded further analysis steps.XSLT, extended with appropriate means to efficiently realize fixpoint queriesguided by regular path expressions, turned out to be a quick and dirty meansfor this step. However, even though cleansing did not go very far, the developedstylesheets reached a considerable level of complexity, indicating the need forbetter models to express and detect such inconsistencies.

In the final step, the thesaurus was used to enrich the text corpus withso called lexical chains, which cluster a text into sentence groups that containwords in sufficiently close semantic neighborhood. These chains can be used tounderstand the role of lexical cohesion for text structure, to deploy this structurefor finer grained document retrieval and clustering, and ultimately to enhance thethesaurus with additional relationships. Again, XSLT turned out to be a suitablemeans to implement the enrichment logic in an ad-hoc fashion, but the lack ofhigher level abstractions for both, the data structures and the analysis rules,resulted in fairly complex stylesheets. On the other hand, XSLT’s versatilityw.r.t. expressing different structural views on XML turned out to be extremelyhelpful to flexibly visualize lexical chains.

The main lessons learned from this small experiment are that state-of-the-artXML technology is mature and scalable enough to realize a fairly challengingtext mining application. The main benefits of XML show especially for the earlysteps of data cleansing and enrichment, and the late steps of interactive analysis.These steps are arguably much harder to realize with traditional data warehousetechnology, which requires significantly more data cleansing and restructuring as

XML for Data Warehousing Chances and Challenges 3

a prerequisite. On the other hand, the thesaurus based analysis in Step 3 suffersfrom the lack of XML-based interfaces to mining methods and tools. Realizingthese in XSLT, which has some deficiencies w.r.t. compositionality and expressivepower, turns out to be unnecessarily complex.

CPM: A Cube Presentation Model for OLAP

Andreas Maniatis1, Panos Vassiliadis2, Spiros Skiadopoulos

1, Yannis Vassiliou1

1 National Technical Univ. of Athens,

Dept. of Elec. and Computer Eng.,

15780 Athens, Hellas

{andreas,spiros,yv}@ dblab.ece.ntua.gr

2 University of Ioannina,

Dept. of Computer Science

45110 Ioannina, Hellas

[email protected]

Abstract. On-Line Analytical Processing (OLAP) is a trend in database

technology, based on the multidimensional view of data. In this paper we

introduce the Cube Presentation Model (CPM), a presentational model for

OLAP data which, to the best of our knowledge, is the only formal

presentational model for OLAP found in the literature until today. First, our

proposal extends a previous logical model for cubes, to handle more complex

cases. Then, we present a novel presentational model for OLAP screens,

intuitively based on the geometrical representation of a cube and its human

perception in the space. Moreover, we show how the logical and the

presentational models are integrated smoothly. Finally, we describe how typical

OLAP operations can be easily mapped to the CPM.

1. Introduction

In the last years, On-Line Analytical Processing (OLAP) and data warehousing has

become a major research area in the database community [1, 2]. An important issue

faced by vendors, researchers and - mainly - users of OLAP applications is the

visualization of data. Presentational models are not really a part of the classical

conceptual-logical-physical hierarchy of database models; nevertheless, since OLAP

is a technology facilitating decision-making, the presentation of data is of major

importance. Research-wise, data visualization is presently a quickly evolving field and

dealing with the presentation of vast amounts of data to the users [3, 4, 5].

In the OLAP field, though, we are aware of only two approaches towards a discrete

and autonomous presentation model for OLAP. In the industrial field Microsoft has

already issued a commercial standard for multidimensional databases, where the

presentational issues form a major part [6]. In this approach, a powerful query

language is used to provide the user with complex reports, created from several cubes

(or actually subsets of existing cubes). An example is depicted in Fig. 1. The

Microsoft standard, however, suffers from several problems, with two of them being

the most prominent ones: First, the logical and presentational models are mixed,

resulting in a complex language which is difficult to use (although powerful enough).

Y. Kambayashi, M. Mohania, W. Woß (Eds.): DaWaK 2003, LNCS 2737, pp. 4-13, 2003. c Springer-Verlag Berlin Heidelberg 2003

Secondly, the model is formalized but not thoroughly: for instance, to our knowledge,

there is no definition for the schema of a multicube.

SELECT CROSSJOIN({Venk,Netz},{USA_N.Children,USA_S,Japan}) ON COLUMNS {Qtr1.CHILDREN,Qtr2,Qtr3,Qtr4.CHILDREN} ON ROWS

FROM SalesCube WHERE (Sales,[1991],Products.ALL)

Year = 1991 Venk Netz

Product = ALL USA Japan USA Japan

USA_N USA_S USA_N USA_S

Seattle Boston Seattle Boston

Size(city)

R1 Qtr1 Jan

Feb C1 C2 C3 C4 C5 C6

Mar

R2 Qtr2

R3 Qtr3

R4 Qtr4 Jan

Feb

Mar

Fig. 1: Motivating example for the cube model (taken from [6]).

Apart from the industrial proposal of Microsoft, an academic approach has also

been proposed [5]. However, the proposed Tape model seems to be limited in its

expressive power (with respect to the Microsoft proposal) and its formal aspects are

not yet publicly available.

In this paper we introduce a cube presentation model (CPM). The main idea behind

CPM lies in the separation of logical data retrieval (which we encapsulate in the

logical layer of CPM) and data presentation (captured from the presentational layer of

CPM). The logical layer that we propose is based on an extension of a previous

proposal [8] to incorporate more complex cubes. Replacing the logical layer with any

other model compatible to classical OLAP notions (like dimensions, hierarchies and

cubes) can be easily performed. The presentational layer, at the same time, provides a

formal model for OLAP screens. To our knowledge, there is no such result in the

related literature. Finally, we show how typical OLAP operations like roll-up and drill

down are mapped to simple operations over the underlying presentational model.

The remainder of this paper is structured as follows. In Section 2, we present the

logical layer underlying CPM. In Section 3, we introduce the presentational layer of

the CPM model. In Section 4, we present a mapping from the logical to the

presentational model and finally, in Section 5 we conclude our results and present

topics for future work. Due to space limitations, we refer the interested reader to a

long version of this report for more intuition and rigorous definitions [7].

2. The logical layer of the Cube Presentation Model

The Cube Presentation Model (CPM) is composed of two parts: (a) a logical layer,

which involves the formulation of cubes and (b) a presentational layer that involves

the presentation of these cubes (normally, on a 2D screen). In this section, we present

5CPM: A Cube Presentation Model for OLAP

the logical layer of CPM; to this end, we extend a logical model [8] in order to

compute more complex cubes. We briefly repeat the basic constructs of the logical

model and refer the interested reader to [8] for a detailed presentation of this part of

the model. The most basic constructs are:

− A dimension is a lattice of dimension levels (L,p), where p is a partial order

defined among the levels of L.

− A family of monotone, pairwise consistent ancestor functions ancL2L1 is defined,

such that for each pair of levels L1 and L2 with L1pL2, the function ancL2L1 maps each

element of dom(L1) to an element of dom(L2).

− A data set DS over a schema S=[L1,…,Ln,A1,…,Am] is a finite set of tuples over S

such that [L1,…,Ln] are levels, the rest of the attributes are measures and

[L1,…,Ln] is a primary key. A detailed data set DS0 is a data set where all levels

are at the bottom of their hierarchies.

− A selection condition φ is a formula involving atoms and the logical connectives ∧,

∨ and ¬. The atoms involve levels, values and ancestor functions, in clause of the

form x ∂ y. A detailed selection condition involves levels at the bottom of their

hierarchies.

− A primary cube c (over the schema [L1,…,Ln,M1,…,Mm]), is an expression of the

form c=(DS0,φ,[L1,…,Ln,M1,…,Mm],[agg1(M01),…,aggm(M

0m)]), where:

DS0 is a detailed data set over the schema S=[L01,…,L0n,M

01,…,M

0k],m≤k.

φ is a detailed selection condition.

M1,…,Mm are measures.

L0i and Li are levels such that L0ipLi, 1≤i≤n.

aggi∈{sum,min,max,count}, 1≤i≤m.

The limitations of primary cubes is that, although they model accurately

SELECT-FROM-WHERE-GROUPBY queries, they fail to model (a) ordering, (b)

computation of values through functions and (c) selection over computed or

aggregate values (i.e., the HAVING clause of a SQL query). To compensate this

shortcoming, we extend the aforementioned model with the following entities:

− Let F be a set of functions mapping sets of attributes to attributes. We distinguish

the following major categories of functions: property functions, arithmetic

functions and control functions. For example, for the level Day, we can have the

property function holiday(Day) indicating whether a day is a holiday or not. An

arithmetic function is, for example Profit=(Price-Cost)*Sold_Items.

− A secondary selection condition ψ is a formula in disjunctive normal form. An

atom of the secondary selection condition is true, false or an expression of the

form x θ y, where x and y can be one of the following: (a) an attribute Ai

(including RANK), (b) a value l, an expression of the form fi(Ai), where Ai is a set

of attributes (levels and measures) and (c) θ is an operator from the set (>, <, =, ≥,

≤, ≠). With this kind of formulae, we can compute relationships between measures

(Cost>Price), ranking and range selections (ORDER BY...;STOP after 200,

RANK[20:30]), measure selections (sales>3000), property based selection

(Color(Product)='Green').

6 Andreas Maniatis et al.

− Assume a data set DS over the schema [A1,A2,…,Az]. Without loss of generality,

suppose a non-empty subset of the schema S=A1,…,Ak,k≤z. Then, there is a set of

ordering operations OθS, used to sort the values of the data set, with respect to the

set of attributes participating to S. θ belongs to the set {<,>,∅} in order to denote

ascending, descending and no order, respectively. An ordering operation is applied

over a data set and returns another data set which obligatorily encompasses the

measure RANK.

− A secondary cube over the schema S=[L1,…,Ln,M1,…,Mm,Am+1,…,Am+p, RANK] is

an expression of the form: s=[c,[Am+1:fm+1(Am+1),…,Am+p:fm+p(Am+p)],OθA,ψ]

where c=(DS0,φ,[L1,…,Ln,M1,…,Mm],[agg1(M01),…,aggm(M

0m)]) is a primary

cube, [Am+1,…,Am+p]⊆[L1,…,Ln,M1,…,Mm], A⊆S-{RANK}, fm+1,…,fm+p are

functions belonging to F and ψ is a secondary selection condition.

With these additions, primary cubes are extended to secondary cubes that incorpo-

rate: (a) computation of new attributes (Am+i) through the respective functions (fm+i),

(b) ordering (OθA) and (c) the HAVING clause, through the secondary selection condition

ψ.

3. The presentational layer of the Cube Presentation Model

In this section, we present the presentation layer of CPM. First, we will give an

intuitive, informal description of the model; then we will present its formal definition.

Throughout the paper, we will use the example of Fig. 1, as our reference example.

The most important entities of the logical layer of CPM include:

− Points: A point over an axis resembles the classical notion of points over axes in

mathematics. Still, since we are grouping more than one attribute per axis (in order

to make things presentable in a 2D screen), formally, a point is a pair comprising of

a set of attribute groups (with one of them acting as primary key) and a set of

equality selection conditions for each of the keys.

− Axis: An axis can be viewed as a set of points. We introduce two special purpose

axes, Invisible and Content. The Invisible axis is a placeholder for the

levels of the data set which are not found in the “normal” axis defining the

multicube. The Content axis has a more elaborate role: in the case where no

measure is found in any axis then the measure which will fill the content of the

multicube is placed there.

− Multicubes. A multicube is a set of axes, such that (a) all the levels of the same

dimensions are found in the same axis, (b) Invisible and Content axes are

taken into account, (c) all the measures involved are tagged with an aggregate

function and (d) all the dimensions of the underlying data set are present in the

multicube definition. In our motivating example, the multicube MC is defined as

MC={Rows,Columns,Sections,Invisible,Content}.

− 2D-slice: Consider a multicube MC, composed of K axes. A 2D-slice over MC is a set

of (K-2) points, each from a separate axis. Intuitively, a 2D-slice pins the axes of


the multicube to specific points, except for 2 axes, which will be presented on the

screen (or a printout). In Fig. 2, we depict such a 2D slice over a multicube.

− Tape: Consider a 2D-slice SL over a multicube MC, composed of K axes. A tape

over SL is a set of (K-1) points, where the (K-2) points are the points of SL. A tape

is always parallel to a specific axis: out of the two "free" axis of the 2D-slice, we

pin one of them to a specific point which distinguishes the tape from the 2D-slice.

− Cross-join: Consider a 2D-slice SL over a multicube MC, composed of K axes and

two tapes t1 and t2 which are not parallel to the same axis. A cross-join over t1and t2 is a set of K points, where the (K-2) points are the points of SL and each of

the two remaining points is a point on a different axis of the remaining axes of the

slice.

The query of Fig. 1 is a 2D-Slice, say SL. In SL one can identify 4 horizontal tapes

denoted as R1, R2, R3 and R4 in Fig. 1) and 6 vertical tapes (numbered from C1 to

C6). The meaning of the horizontal tapes is straightforward: they represent the Quar-

ter dimension, expressed either as quarters or as months. The meaning of the vertical

tapes is somewhat more complex: they represent the combination of the dimensions

Salesman and Geography, with the latter expressed in City, Region and Country

level. Moreover, two constraints are superimposed over these tapes: the Year dimen-

sion is pinned to a specific value and the Product dimension is ignored. In this mul-

tidimensional world of 5 axes, the tapes C1 and R1 are defined as:

C1 = [(Salesman='Venk'∧ancregioncity (city)='USA_N'),(Year='1991'),

(ancALLitem(Products)='all'),(Sales,sum(Sales))]

R1 = [(ancmonthday (Month)='Qtr1'∧Year='1991'),(Year='1991'),

(ancALLitem(Products)='all'),(Sales,sum(Sales))]

One can also consider the cross-join t1 defined by the common cells of the tapes

R1 and C1. Remember that City defines an attribute group along with

[Size(City)].

t1=([SalesCube,(Salesman='Venk'∧ancregioncity (city)='USA_N ∧

ancmonthday (Month)='Qtr1'∧Year='1991'∧ancALL

item(Products)='all'),

[Salesman,City,Month,Year,Products.ALL,Sales],sum],

[Size(City)],true)

In the rest of this section, we will describe the presentation layer of CPM in its

formality. First, we extend the notion of dimension to incorporate any kind of

attributes (i.e., results of functions, measures, etc.). Consequently, we consider every

attribute not already belonging to some dimension, to belong to a single-level

dimension (with the same name as the attribute), with no ancestor functions or

properties defined over it. We will distinguish between the dimensions comprising

levels and functionally dependent attributes through the terms level dimensions and

attribute dimensions, wherever necessary. The dimensions involving arithmetic

measures will be called measure dimensions.

An attribute group AG over a data set DS is a pair [A,DA], where A is a list of

attributes belonging to DS (called the key of the group) and DA is a list of attributes

dependent on the attributes of A. With the term dependent we mean (a) measures

dependent over the respective levels of the data set and (b) function results depending


on the arguments of the function. One can consider examples of the attribute groups

such as ag1=([City],[Size(City)]),ag2=([Sales,Expenses],[Profit]).

ancmonthday (Month)=

Qtr1

(5)Salesman='Netz',Region='USA_S'

Salesman='Netz',Country='Japan'

(6)ancmonth

day (Month)=

Qtr4

Quarter= Qtr3

Rows

Salesman='Venk',Region='USA_S'

(2)

(3)Salesman='Venk',Country='Japan'

(1)Salesman='Venk',

ancregioncity (City) =

'USA_N'

Columns

Quarter= Qtr2

Salesman='Netz',

ancregioncity (City) =

'USA_N'(4)

Year=1991

Year=1992 Sections

+Products.ALL

= 'all'

Invisible

+Sales,

sum(Sales0),true

Content

Fig. 2: The 2D-Slice SL for the example of Fig. 1.

A dimension group DG over a data set DS is a pair [D,DD], where D is a list of

dimensions over DS (called the key of the dimension group) and DD is a list of

dimensions dependent on the dimensions of D. With the term dependent we simply

extend the respective definition of attribute groups, to cover also the respective

dimensions. For reasons of brevity, wherever possible, we will denote an

attribute/dimension group comprising only of its key simply by the respective

attribute/dimension.

An axis schema is a pair [DG,AG], where DG is a list of K dimension groups and AG

is an ordered list of K finite ordered lists of attribute groups, where the keys of each

(inner) list belong to the same dimension, found in the same position in DG, where

K>0. The members of each ordered list are not necessarily different. We denote an

axis schema as a pair ASK=([DG1×DG2×…×DGK],[[ag11,ag

21,…,ag

k11 ]×[ag12,ag

22

,…,agk22 ]×…×[ag1k,ag2k,…,ag

kkk ])}.

In other words, one can consider an axis schema as the Cartesian product of the

respective dimension groups, instantiated at a finite number of attribute groups. For

instance, in the example of Fig. 1, we can observe two axes schemata, having the

following definitions: Row_S = {[Quarter],[Month,Quarter,Quarter,Month]}

Column_S = {[Salesman×Geography], [Salesman]×[[City,Size(City)],Region, Country]}

Consider a detailed data set DS. An axis over DS is a pair comprising of an axis

schema over K dimension groups, where all the keys of its attribute groups belong to

DS, and an ordered list of K finite ordered lists of selection conditions (primary or

secondary), where each member of the inner lists, involves only the respective key of

the attribute group.

a = (ASK,[φ1,φ2,...,φK]),K≤N or

a={[DG1×DG2×…×DGK],[[ag11,ag

21,…,ag

k11 ]×[ag1

2,ag22,…,ag

k22 ]×…×[ag1

k,ag2k,…,ag

kkk

]], [[φ11,φ

21,…,φ

k11 ]×[φ1

2,φ22,…,φ

k22 ]×...×[φ1

k,φ2k,…,φ

kkk ]]}


Practically, an axis is a restriction of an axis schema to specific values, through the

introduction of specific constraints for each occurrence of a level. In our motivating

example, we have two axes: Rows = {Row_S,[ancmonth

day (Month)=Qtr1,Quarter=Qtr2,

Quarter=Qtr3,ancmonthday (Month)=Qtr4]}

Columns = {Column_S,{[Salesman='Venk',Salesman='Netz'],

[ancregioncity (City)='USA_N', Region='USA_S', Country='Japan']}

We will denote the set of dimension groups of each axis a by dim(a).

A point over an axis is a pair comprising of a set of attribute groups and a set of

equality selection conditions for each one of their keys.p1=([Salesman,[City,Size(City)]], [Salesman='Venk',anc

regioncity (City)=

'USA_N'])

An axis can be reduced to a set of points, if one calculates the Cartesian products of

the attribute groups and their respective selection conditions. In other words,

a=([DG1×DG2×...×DGK],[[p1,p2,…,pl]), l=k1×k2×…×kkk.

Two axes schemata are joinable over a data set if their key dimensions (a) belong

to the set of dimensions of the data set and (b) are disjoint. For instance, Rows_S and

Columns_S are joinable.

A multicube schema over a detailed data set is a finite set of axes schemata

fulfilling the following constraints:

1. All the axes schemata are pair-wise joinable over the data set.

2. The key of each dimension group belongs only to one axis.

3. Similarly, from the definition of the axis schema, the attributes belonging to a

dimension group are all found in the same axis.

4. Two special purpose axes called Invisible and Content exist. The Content

axis can take only measure dimensions.

5. All the measure dimensions of the multicube are found in the same axis. If more

than one measure exist, they cannot be found in the Content axis.

6. If no measure is found in any of the "normal" axes, then a single measure must be

found in the axis Content.

7. Each key measure is tagged with an aggregate function over a measure of the data

set.

8. For each attribute participating in a group, all the members of the group are found

in the same axis.

9. All the level dimensions of the data set are found in the union of the axis

schemata (if some dimensions are not found in the "normal" axes, they must be

found in the Invisible axis).

The role of the Invisible axis follows: it is a placeholder for the levels of the

data set which are not to be taken into account in the multicube. The Content axis

has a more elaborate role: in the case where no measure is found in any axis (like in

the example of Fig. 1) then the measure which will fill the content of the multicube is

placed there. If more than one measures are found, then they must be placed in the

same axis (not Content), as this would cause a problem of presentation on a

two-dimensional space.

A multicube over a data set is defined as a finite set of axes, whose schemata can

define a multicube schema. The following constraints must be met:


1. Each point from a level dimension, not in the Invisible axis, must have an

equality selection condition, returning a finite number of values.

2. The rest of the points can have arbitrary selection conditions (including "true" -

for the measure dimensions, for example).

For example, suppose a detailed data set SalesCube under the schemaS = [Quarter.Day, Salesman.Salesman, Geography.City, Time.Day,

Product.Item, Sales, PercentChange, BudgetedSales]

Suppose also the following axes schemata over DS0

Row_S = {[Quarter],[Month,Quarter,Quarter,Month]}

Column_S = {[Salesman×Geography], [Salesman]×[[City,Size(City)],Region, Country]}

Section_S = {[Time],[Year]}

Invisible_S = {[Product],[Product.ALL]}

Content_S = {[Sales],[sum(Sales0)]}

and their respective axesRows={Row_S,[ancmonth

day (Month)=Qtr1,Quarter=Qtr2,Quarter=Qtr3,

ancmonthday (Month)=Qtr4]}

Columns = {Column_S,{[Salesman='Venk',Salesman='Netz'],

[ancregioncity (City)='USA_N', Region='USA_S', Country='Japan']}

Sections = {Section_S,[Year=1991,Year=1992]}

Invisible = {Invisible_S,[ALL='all']}

Content_S = {Content_S,[true]}

Then, a multicube MC can be defined as MC = {Rows, Columns, Sections, Invisible, Content}

Consider a multicube MC, composed of K axes. A 2D-slice over MC is a set of (K-2)

points, each from a separate axis, where the points of the Invisible and the

Content axis are comprised within the points of the 2D-slice. Intuitively, a 2D-slice

pins the axes of the multicube to specific points, except for 2 axes, which will be

presented on a screen (or a printout).

Consider a 2D-slice SL over a multicube MC, composed of K axes. A tape over SL is

a set of (K-1) points, where the (K-2) points are the points of SL. A tape is always

parallel to a specific axis: out of the two "free" axis of the 2D-slice, we pin one of

them to a specific point which distinguishes the tape from the 2D-slice. A tape is more

restrictively defined with respect to the 2D-slice by a single point: we will call this

point the key of the tape with respect to its 2D-slice. Moreover if a 2D-slice has two

axes a1,a2 with size(a1) and size(a2) points each, then one can define

size(a1)*size(a2) tapes over this 2D-slice.

Consider a 2D-slice SL over a multicube MC, composed of K axes. Consider also

two tapes t1 and t2 which are not parallel to the same axis. A cross-join over t1 and

t2 is a set of K points, where the (K-2) points are the points of SL and each of the two

remaining points is a point on a different axis of the remaining axes of the slice.

Two tapes are joinable if they can produce a cross-join.


4. Bridging the presentation and the logical layers of CPM

Cross-joins form the bridge between the logical and the presentational model. In this

section we provide a theorem proving that a cross-join is a secondary cube. Then, we

show how common OLAP operations can be performed on the basis of our model.

The proofs can be found at [7].

Theorem 1. Α cross-join is equivalent to a secondary cube.

The only difference between a tape and a cross-join is that the cross-join restricts

all of its dimensions with equality constraints, whereas the tape constraints only a

subset of them. Moreover, from the definition of the joinable tapes it follows that a

2D-slice contains as many cross-joins as the number of pairs of joinable tapes

belonging to this particular slice. This observation also helps us to understand why a

tape can also be viewed as a collection of cross-joins (or cubes). Each of this

cross-joins is defined from the k-1 points of the tape and one point from all its

joinable tapes. This point belongs to the points of the axis the tape is parallel to.

Consequently, we are allowed to treat a tape as a set of cubes: t=[c1,…,ck]. Thus we

have the following lemma.

Lemma 1. A tape is a finite set of secondary cubes.

We briefly describe how usual operations of the OLAP tools, such as roll-up, drill

down, pivot etc can be mapped to operations over 2D-slices and tapes.

− Roll-up. Roll-up is performed over a set of tapes. Initially key points of these tapes

are eliminated and replaced by their ancestor values. Then tapes are also eliminated

and replaced by tapes defined by the respective keys of these ancestor values. The

cross-joins that emerge can be computed through the appropriate aggregation of the

underlying data.

− Drill-down. Drill down is exactly the opposite of the roll-up operation. The only

difference is that normally, the existing tapes are not removed, but rather

complemented by the tapes of the lower level values.

− Pivot. Pivot means moving one dimension from an axis to another. The contents of

the 2D-slice over which pivot is performed are not recomputed, instead they are

just reorganized in their presentation.

− Selection. A selection condition (primary or secondary) is evaluated against the

points of the axes, or the content of the 2D-slice. In every case, the calculation of

the new 2D-slice is based on the propagation of the selection to the already

computed cubes.

− Slice. Slice is a special form of roll-up, where a dimension is rolled up to the level

ALL. In other words, the dimension is not taken into account any more in the

groupings over the underlying data set. Slicing can also mean the reconstruction of

the multicube by moving the sliced dimension to the Invisible axis.

− ROLLUP [9]. In the relational context, the ROLLUP operator takes all combination

of attributes participating in the grouping of a fact table and produces all the


possible tables, with these marginal aggregations, out of the original query. In our

context, this can be done by producing all combinations of Slice operations over

the levels of the underlying data set. One can even go further by combining roll-ups

to all the combinations of levels in a hierarchy.

5. Conclusions and Future Work

In this paper we have introduced the Cube Presentation Model, a presentation model

for OLAP data which formalizes previously proposed standards for a presentation

layer and which, to the best of our knowledge, is the only formal presentational model

for OLAP in the literature. Our contributions can be listed as follows: (a) we have

presented an extension of a previous logical model for cubes, to handle more complex

cases; (b) we have introduced a novel presentational model for OLAP screens,

intuitively based on the geometrical representation of a cube and its human perception

in the space; (c) we have discussed how these two models can be smoothly integrated;

and (d) we have suggested how typical OLAP operations can be easily mapped to the

proposed presentational model.

Next steps in our research include the introduction of suitable visualization

techniques for CPM, complying with current standards and recommendation as far as

usability and user interface design is concerned and its extension to address the

specific visualization requirements of mobile devices.

References

[1]S. Chaudhuri, U. Dayal: An overview of Data Warehousing and OLAP technology. ACM

SIGMOD Record, 26(1), March 1997.

[2]P. Vassiliadis, T. Sellis: A Survey of Logical Models for OLAP Databases. SIGMOD

Record 28(4), Dec. 1999.

[3]D.A. Keim. Visual Data Mining. Tutorials of the 23rd International Conference on Very

Large Data Bases, Athens, Greece, 1997.

[4]Alfred Inselberg. Visualization and Knowledge Discovery for High Dimensional Data . 2nd

Workshop Proceedings UIDIS, IEEE, 2001.

[5]M. Gebhardt, M. Jarke, S. Jacobs: A Toolkit for Negotiation Support Interfaces to Multi-

Dimensional Data. ACM SIGMOD 1997, pp. 348 – 356.

[6]Microsoft Corp. OLEDB for OLAP February 1998. Available at:

http://www.microsoft.com/data/oledb/olap/.

[7]A. Maniatis, P. Vassiliadis, S. Skiadopoulos, Y. Vassiliou. CPM: A Cube Presentation

Model. http://www.dblab.ece.ntua.gr/~andreas/publications/CPM_dawak03.pdf (Long

Version).

[8]Panos Vassiliadis, Spiros Skiadopoulos: Modeling and Optimization Issues for

Multidimensional Databases. Proc. of CAiSE-00, Stockholm, Sweden, 2000.

[9]J. Gray et al.: Data Cube: A Relational Aggregation Operator Generalizing Group-By,

Cross-Tab and Sub-Totals . Proc. of the ICDE 1996.


Changqing Chen1, Jianlin Feng2, and Longgang Xiang3

1 School of Software, Huazhong Univ. of Sci. & Tech., Wuhan 430074, Hubei, China [email protected]

2 School of Computer Science, Huazhong Univ. of Sci. & Tech., Wuhan 430074, Hubei, China [email protected]

3 School of Computer Science, Huazhong Univ. of Sci. & Tech., Wuhan 430074, Hubei, China

[email protected]

Abstract. For a data cube there are always constraints between dimensions or between attributes in a dimension, such as functional dependencies. We intro-duce the problem that when there are functional dependencies, how to use them to speed up the computation of sparse data cubes. A new algorithm CFD is pre-sented to satisfy this demand. CFD determines the order of dimensions by con-sidering their cardinalities and functional dependencies between them together. It makes dimensions with functional dependencies adjacent and their codes sat-isfy monotonic mapping, thus reduces the number of partitions for such dimen-sions. It also combines partitioning from bottom to up and aggregate computa-tion from top to bottom to speed up the computation further. In addition CFD can efficiently compute a data cube with hierarchies from the smallest granular-ity to the coarsest one, and at most one attribute in a dimension takes part in the computation each time. The experiments have shown that the performance of CFD has a significant improvement.

1 Introduction

OLAP often pre-computes a large number of aggregates to improve the perform-ance of aggregation queries. A new operator CUBE BY [5] was introduced to repre-sent a set of group-by operations�i.e., to compute aggregates for all possible combi-nations of attributes in the CUBE BY clause. The following example 1 shows a cube computation query on a relation SALES (employee, product, customer, quantity)

Example 1: SELECT employee, product, customer, SUM (quantity) FROM SALES CUBE BY employee, product, customer It will compute group-bys for (employee, product, customer), (employee, product),

(employee, customer), (product, customer), (employee), (product), (customer) and ALL (no GROUP BY). The attributes in the CUBE BY clause are called dimensions and the attributes aggregated are called measures. For n dimensions, 2n group-bys are

Computation of Sparse Data Cubes with Constraints


computed. The number of distinct values of a dimension is its cardinality. Each combination of attribute values from different dimensions constitutes a cell. If empty cells are a majority of the whole cube, then the cube is sparse.

Relational normal forms are hardly suitable for OLAP cubes because of different goals in operational and OLAP databases. The main goal of operational databases is to avoid update anomalies and the relational normal forms are very adaptive for this goal. But for OLAP databases the efficiency of queries is the most important issue. So there are always constraints between dimensions or between attributes in a dimension for a cube, such as functional dependencies. Sparsity clearly depends on actual data. How-ever, functional dependencies between dimensions may imply potential sparsity [4]. A tacit assumption for all algorithms before is that dimensions are independent each other and so all these algorithms did not consider the effect of functional dependencies on computing cubes.

Algebraic functions COUNT, SUM, MIN and MAX have the key property that more detailed aggregates (i.e. more dimensions) can be used to compute less detailed aggregates (i.e. fewer dimensions). This property induces a partial ordering (i.e. a lattice) on all group-bys of the CUBE. A group-by is called a child of some parent group-by if the parent can be used to compute the child (and no other group-by is between the parent and the child). The algorithms [1, 2, 3, 6] recognize that group-bys with common attributes can share partitions, sorts, or partial sorts. The difference between them is that how they exploit such properties. In these algorithms, BUC [1] computes from bottom to up, while others compute from top to bottom.

This paper addresses full cube computation over sparse data cubes and makes the following contributions:

1. We introduce the problem of computation of sparse data cubes with constraints, which allows us to use such constraints to speed up the computation. A new algo-rithm CFD (Computation by Functional Dependencies) is presented to satisfy this demand. CFD determines the partitioning order of dimensions by considering their cardinalities and functional dependencies between them together. Therefore the correlated dimensions can share sorts.

2. CFD partitions group-bys of a data cube from bottom to up, at the same time it computes aggregate values from top to bottom by summing up return values of smaller partitions. Even if all the dimensions are independent each other, CFD is still faster than BUC to compute full cubes.

3. Few algorithms deal with hierarchies in dimensions. CFD can compute a sparse data cube with hierarchies in dimensions. In this situation, CFD efficiently com-putes from the smallest granularity to the coarsest one.

The rest of this paper is organized as follows: Section 2 presents the problem of sparse cubes with constraints. Section 3 illustrates how to decide partitioning order of dimensions. Section 4 presents a new algorithm for the computation of sparse cubes called CFD. Our performance analysis is described in section 5. Related work is dis-cussed in section 6. Section 7 contains conclusions.

15Computation of Sparse Data Cubes with Constraints

2 The Problem

Let C= D�M be an OLAP cube schema, where D are the set of dimensions and M the set of measures. Two attributes X and Y with one-to-one or many-to-one relation has a functional dependency X�Y, where X is called a determining attribute and Y is called a depending attribute. Such functional dependency can exist between two di-mensions or two attributes in a dimension. The problem is when there are constraints (functional dependencies), how to use them to speed up the computation of sparse cubes. The dependencies considered in CFD are only those that the left and right side always contain a single attribute respectively. Such functional dependencies will help in data pre-processing (see Section 3.2) and partitioning dimensions (see section 4.1).

Functional dependencies between dimensions implied the structural sparsity of a cube [4]. With no functional dependencies, the structural sparsity is zero. Considering the cube in Example 1, if we know that one employee sells only one product, we get a functional dependency employee�product. Assume we have 6 employees, 4 custom-ers, and 3 different products, then the size of the cube is 72 cells. Further the total number of occupied cells in the whole cube is at most 6�4=24, thus the structural sparsity is 67%.

3 Data Preprocessing

CFD partitions from bottom to up just like BUC, i.e., first partitions on one dimension, then two dimensions, and etc. One difference between CFD and BUC is that CFD chooses the order of dimensions by functional dependencies and cardinalities together.

3.1 Deciding the Order of Dimensions

First we build a directed graph by Functional Dependencies between dimensions, called FD graph. The graph ignores all transitive dependencies (i.e., dependencies that can be deduced from other dependencies). A node in the graph is a dimension. Once the graph has been built, we try to classify the nodes. We find the longest path in the graph in order to make the most of dependencies. The nodes in such path form a de-pendency set and are deleted from the graph. Such process is repeated until the graph is empty. The time complexity of this process is O(n2), where n is the number of di-mensions.

Example 2: A cube has six dimensions from A to F with the cardinalities descen-dant and functional dependencies A�C, A�D, C�E, B�F. Figure 1 is the corre-sponding FD graph. From Figure 1, we first get the dependency set {A, C, E} for they have the longest path, then {B, F} and at last {D}. The elements in each set are or-dered by dependencies. Although there is a functional dependency between A and D, it is not considered, so the dependency set {D} contains only the dimension D itself.

After getting the dependency sets, CFD sorts them descendently by the biggest car-dinality of a dimension in each set. Then we merge each set sequentially to determine

16 Changqing Chen et al.

the order of dimensions. By this approach, CFD can make the depending dimension share the sort of the determining dimension because such two dimensions are putted together. If there is no functional dependency, the partitioning order of CFD is just the same as that of BUC.

A B

C D F

E

employee product

Tom towel

Bob soap

Smith soap

White sharver

Louis soap

Ross towel

product employee

towel Tom

towel Ross

soap Bob

soap Smith

soap Louis

shaver White

sort

employee product

0 0

1 0

2 1

3 1

4 1

5 2

encode

Fig. 1. FD graph Fig. 2. The encoding of two dimensions with a functional dependency

3.2 Data Encoding

Like other algorithms for computing a data cube, CFD assumes that each dimension value is an integer between zero and its cardinality, and that the cardinality is known in advance. A usual data encoding does not consider the correlations between dimen-sions and simply maps each dimension value between zero and its cardinality. This operation is similar to sorting on the values of a dimension.

In order to share shorts, CFD encodes adjacent dimensions with functional depend-encies jointly to make their codes satisfy a monotonic mapping. For example, X and Y are two dimensions and f is a functional dependency from X to Y. Assume there are two arbitrary values xi and xj on dimension X, and yi = f(xi) and yj = f(xj) are two val-ues on dimension Y. If xi> xj, we have yi � yj, then y = f(x) is monotonic. Due to the functional dependency between X and Y, the approach of encoding is to sort on di-mension Y first, then the values of X and Y can be mapped sequentially to zero and their cardinalities respectively. Figure 2 shows the encoding of two dimensions with a functional dependency: employee�product in Example 1. Obviously, if the left or right side of a functional dependency has more than one attribute, it is difficult to encode like that. Note that the mapping relations can be reflected in the fact table for correlated dimensions. But for hierarchies in a dimension the mapping relations should be stored respectively.

4 Algorithm CFD

We propose a new algorithm called CFD for the computation of full sparse data cubes with constraints. The idea in CFD is to take advantage of functional dependencies to share partitions and to make use of the property of algebraic functions to reduce ag-gregation costs. CFD was inspired by BUC algorithm and is similar to a version of algorithm BUC except the aggregation computation and the partition function. After data preprocessing, we can compute a sparse data cube now.


Fig. 3. Algorithm Computation with Functional Dependencies (CFD)

The details of CFD are in Figure 3. The first step is to aggregate the entire input to

aggval[numdims] when arriving at the smallest partition. For each dimension d be-tween dim and numDims, the input is partitioned on dimension d (line 6). On return from Partition(), dataCount contains the number of records for each distinct value of the d-th dimension. Line 8 iterates through the partitions (i.e. each distinct). The partition becomes the input relation in the next recursive call to CFD, which computes the cube on the partition for dimensions (d+1) to numDims. Upon return from the recursive call, we sum up the aggregation results from smaller partitions (line 14) and continue with the next partition of dimension d. Once all the partitions are processed, we repeat the whole process for the next dimension.

We use the same optimization process Write-Ancestors() as that in BUC (line 11). Write-Ancestors() simply outputs each of ancestor group-bys to avoid fruitless parti-tion and computation when the partition contains only one tuple. For a data cube con-taining a single tuple partition (a1, b1) with dimensions A, B, C, D, Write-Ancestors() directly output (a1, b1), (a1, b1, c), (a1, b1, c, d) and (a1, b1, d).

CFD(input,dim) Inputs:

input: The relation to aggregate. dim: The starting dimension to partition.

Globals: numdims: the total number of dimensions. dependentable[]: the dependency sets gotten from section 3.1. hierarchy[numdims]: the high of hierarchies in each dimension. cardinality[numDims][]: the cardinality of each dimension. dataCount[numdims]: the size of each partition. aggval[numdims]: sum the results of smaller partitions.

1: if (dim == numdims) aggval[dim]=Aggregate(input); //the result of a thinnest part ition.

2: FOR d = dim; d < numdims; d++ DO 3: FOR h=0; h<hierarchy[d]; h++ DO 4: aggval[d]=0; 5: C = cardinality[d][h]; 6: Partition(input, d, dependentable[], C, dataCount[d]); //partition by co nstraints 7: k = 0; 8: FOR i = 0; i < C; i++ DO 9: n = dataCount[d][i]; 10: IF n = = 1 THEN 11: WriteAncestors(input[0], dim ); //single tuple optimization 12: ELSE THEN 13: CFD(input[k, k+1, …, k+n-1], d+1); 14: aggval[d]+=aggval[d+1]; //sum up the results of smaller partitions 15: END IF 16: k += n; 17: END FOR 18: END FOR 19: END FOR


4.1 Partitioning

CFD partitions from the bottom of the lattice and works its way up towards the lar-ger, less detailed aggregates. When CFD partitions, Partition() determines whether two dimensions are dependent by dependentable[] gotten from section 3.1.

A C B D

a1 c1 b1 d1

a1 c1 b1 d2

a1 c1 b2 d2

a2 c1 b1 d2

a3 c2 b3 d1

a4 c3 b4 d2

4ACBD

3ACB

5ACD

7ABD

11CBD

2AC

6AB

8AD

10CB

12CD

14BD

1A

9C

13B

15D

16ALL

Fig. 4. Encoding and partitioning of CFD Fig. 5. CFD processing tree Example 3: A data cube has four dimensions from A to D with the cardinalities de-

scendant and a dependency: A�C. The order of dimensions for is A, C, B and D. Figure 4 illustrates how the input in Example 3 is partitioned during the first calls

by CFD. First CFD partitions on dimension A, producing partitions a1 to a4, then it recursively partitions the partition a1 on dimension C, then dimension B. Because of the dependency: A�C and the monotonic mapping, CFD will not sort on dimension C.

This is one key optimization of CFD. CFD will not affect the efficiency of other op-timizations such as finding single tuples. For independent dimensions, CountingSort, QuickSort and InsertSort can be used by CFD just like BUC.

4.2 Data Cube Computation

Another key factor in the success of CFD is to take advantage of algebraic functions by summing up the results of the recursive calls from the smallest partitions (line 1 and 14 of Figure 3). By this approach, CFD can reduce half of the time to scan the whole relation, and it is faster than BUC even when dimensions are independent. Fig-ure 5 shows the CFD processing tree for Example 3. The numbers indicate the order in which CFD visits the group-bys and CFD produces the empty group-by ALL last.

We find if we directly use the return of the recursive call to aggregate the results, CFD actually runs more slowly than BUC, just as what reported in [1]. This may be the side effect of stack operations. Instead we use the array aggval[numdims] to record and aggregate the results, and it really runs faster than BUC. On the negative side, CFD cannot efficiently compute holistic functions.

4.3 Data Cube with Hierarchies

Some approaches have been proposed to reduce structural sparsity risks, i.e. some dimensions with functional dependencies are decomposed to hierarchical dimensions [4, 11]. Hierarchical dimensions, which enable the user to analyze data on different levels of aggregation, are essential for OLAP.


CFD can also efficiently compute a cube with hierarchies in dimensions (showed in line 3 of Figure 3). Because the attributes in a dimension have functional dependen-cies from the smaller granularity (the smallest one is the key) to the coarser one, so its computation is similar to that of a cube with constraints between dimensions.

4.4 Memory Requirements

The memory requirement of CFD is slightly more than that of BUC. CFD tries to fit a partition of tuples in memory as soon as possible. Say that a partition has X tuples and each tuple requires N bytes. Our implementation also uses pointers to the tuples, and CountingSort requires a second set of pointers for temporary use. Let d0…dn-1 be the cardinalities of each dimension for a data cube, and dmax be the maximum cardinality.

CountingSort uses dmax counters, and CFD uses =

1

0

n

iid counters. In order to aggregate

results from the smallest partitions, an n array is needed for an aggregation function. If the counters and pointers are each four bytes and every element in the above array is also four bytes, the total memory requirement in bytes for CFD is:

( +8) X + 4=

1

0

n

iid + 4dmax + 4n

5 Performance Analysis

Since MemoryCube is the first algorithm to compute sparse cubes and BUC is faster than MemoryCube, we concentrate on comparing CFD with BUC to compute full cubes. We implemented CFD for main memory only. The implementation of BUC had the same restriction. CFD can be smoothly extended when performing an external partitioning. We did not count the time to read the file and write the results. We only measured running time as the total cost, including partitioning time and aggregating time. We also neglected the cost of determining the order of dimensions.

5.1 Qualitative analysis

First we analyze the time of aggregation computation and assume dimensions are independent each other. For a cube with n dimensions, the number of 2n group-bys need to be computed. BUC needs to scan the whole relation 2n times. CFD uses the results of the smallest partitions to aggregate the results of coarser partitions. This means CFD can save about half of the time of aggregation computation. Due to the single-tuple optimization, CFD can actually save 3-5% of the total cost than BUC when dimensions are independent.


Next we analyze the partitioning time. For a cube with n dimensions, the order is 0, 1, …n-1, then the number of partitions for each dimension are 20, 21, …2n-1 corre-spondingly. Say that the partitioning time of each dimension is one. Considering a functional dependency di�di+1, it will save 2i+1 times by sharing the sort of the i-th dimension when partitioning the (i+1)-th dimension. If there are k functional depend-encies considered and depending dimensions are i1, … ik, then the total time saved is

=ki

im

m

12 (k<n). So Partition() of CFD is further optimized than that of BUC.

It seems that if we push the dimensions with dependencies more backward, we will save more partitioning time. But such approach neglects the optimization of finding single tuples as early as possible. Experiments have shown that the performance may decrease if we do like that.

5.2 Full Cube Computation

All tests were run on a machine with 128MB of memory and a 500MHz Pentium processor. The data was randomly generated uniformly. The sparse data cube used in our experiments had eight dimensions from A to H, and each dimension had a cardi-nality of 10. The whole data could be loaded in the memory.

Figure 6 compared CFD with BUC to compute a full cube with five sum functions and four functional dependencies: A�E, B�F, C�G, and D�H, varying the number of tuples. For 1.5 million tuples, CFD saved the time by about 25% than BUC.

The time saved above is caused by two factors: the time for partition and aggrega-tion computation. Figure 7 shows the computation time for one million tuples with dimensions independent varying the number of aggregation functions. For five sum functions, CFD saved 15% than BUC.

Figure 8 compared CFD with BUC considering functional dependencies. The data for this test were one million tuples. We considered one dependency in Figure 6 each time. With the position of a dependency moves backwards, the time reduces quickly.

0

100

200

300

400

0.5 0. 75 1 1. 25 1.5

Number of tuples (million)

Tim

e (s

ec)

BUC CFD

100

150

200

250

1 2 3 4 5

Number of s ums

Tim

e(se

c)

BUC CFD

145

150

155

160

165

0 1 2 3 4

Position o f a dependency

Tim

e(se

c)

CFD

Fig. 6. Computation of a cube Fig. 7. Aggregation functions Fig. 8. Different position Figure 9 compared CFD with BUC for just one sum function, assuming the dimen-

sions are independent each other. The data varied from 0.5 million tuples to 1.5 mil-lion tuples. With the tuples increased, single tuples decreased and CFD saved more time of aggregation computation than BUC. We note that CFD still saved 3% of the total time on the most sparse cube with 0.5 million tuples. This experiment shows that CFD is more adaptive than BUC to compute data cubes with different sparsities.


0

50

100

150

200

250

0.5 0. 75 1 1. 25 1.5

Time

(se

c)

0

500

1000

1500

0.2 0.4 0.6 0.8 1

Time

(se

c)

CFD & No FD CFD with FD

Fig. 9. Dimensions independent Fig. 10. Hierarchies in dimensions

5.3 Hierarchies

This experiment explores the computation of a data cube with hierarchies varying the number of tuples. The dimensions were order by cardinality: 500, 200, 100, 80, 60 and 50. The cardinality of coarser granularity was one third of that of smaller granularity adjacent in each dimension. There were three hierarchies in the first two dimensions and two hierarchies in the other dimensions.

The results are shown in Figure 10. The bottom line shows the computation consid-ering functional dependencies between attributes in a dimension and the upper line shows that without considering those correlations. The running time increased 30% when functional dependencies not considered.

5.4 Weather Data

We compared CFD with BUC on a real nine-dimensional dataset containing weather conditions at various weather stations on land for September 1985 [12]. The dataset contained 1,015,367 tuples. The attributes were ordered by cardinality: station (7037), longitude (352), Solar-altitude (179), latitude (152), present-weather (101), day (30), weather-change-code (10), hour (8), and brightness (2). Some of the attributes were significantly correlated (e.g., only one station was at one (latitude, longitude)).

The experiment shows that CFD is effective on real dataset. The running time of CFD had a 3% improvement than BUC considering the functional dependency station�latitude. When (latitude, longitude) was treated as hierarchies of the dimension station, CFD had a 5% improvement than BUC that does not consider dependencies.

6 Related Work

Since the introduction of the concept of data cube [5], efficient computation of data cubes has been a theme of active research, such as [1, 2, 3, 7, 8, 9, 10]. All algorithms


BUC CFD



before did not consider the effect of constraints on the computation of a data cube. Following Lenher et al., a general principle in OLAP design is that dimensions should be independent, and inside a dimension there should be a single attribute key for the dimension [11]. But it is not always suitable to put all correlated attributes in a dimen-sion to satisfy such principle, just as the weather data in [12] showed.

While some work concentrated on fast computing a data cube, some other work dealt with the size problem of a cube [7, 8, 10]. Computing iceberg cubes rather than complete cubes were also proposed by [2, 9].

7 Conclusions

We introduce the problem of computation of sparse data cubes with constraints and present a novel algorithm CFD for this problem. CFD decides the order of dimensions by taking into account cardinalities and functional dependencies together. In the mean-time CFD partitions group-bys from bottom to up and aggregates the results of parti-tions from top to bottom. So CFD combines the advantage of partitioning from bottom to up of BUC and the advantage of aggregation computation from top to bottom of Pipesort and etc. Our approach can efficiently speed up the computation of sparse data cubes as well as those with hierarchies.

References

1. K.Baeyer, R.Ramakrishnan. Bottom-Up Computation of Sparse and Iceberg CUBEs. SIGMOD’99, pages 359-370

2. K.A.Ross, D.Strivastava. Fast computation of sparse data cubes. In Proc. Of the 23rd VLDB Conf., pages 116-125,Athens, Green, 1997

3. S. Agarwal, R. Agrawal, P. M. Desgpande, A.Gupta, J. F. Naughton, R. Ramakrishnan, and S.Sarawagi,. On the computation of multidimensional aggregates. In Proc. Of the 22nd VLDB Conf., pages 506-521, 1996

4. T. Niemi, J. Nummenmaa, P. Thanisch, Constructing OLAP Cubes Based on Queries. DOLAP 2001, pages 1-8

5. J.Gray, A.Bosworth, A.Layman and H.pirahesh. Datacube: A relational aggregation operator generalizing group by, cross-tab, and sub-totals, ICDE’96, pages 152-159

6. Y. Zhao, P. M. Desgpande, and J. F. Naughton. An array-based algorithm for simultaneous mutl-dimensional aggregates. SIGMOD’97, pages 159-170

7. W. Wang, J. Feng, H. Lu, J. X. Yu. Condensed Cube: An Effective Approach to Reducing Data Cube Size. Proc. of the 18th Int. Conf. on Data Engineering, 2002, pp.155 -165

8. Y. Sismanis, A. Deligiannakis, N. Roussopoulos, Y. Kotidis. Dwarf: Shrinking the PetaCube. SIGMOD’02

9. J. Han, J. Pei, G. Dong and K. Wang. Efficient Computation of Iceberg Cubes with Complex Meas-ures. SIGMOD’01.

10. Lakshmanan, L., Pei, J. and Han, J. Quotient Cube: How to Summarize the Semantics of a Data CubeFast. Proc. the 28rd VLDB Conference, HongKong, China, 2002

11. W. Lenher, J. Albrecht and H. Wedekind. Normal forms for multidimensional databases. SSDBM’98, pages 63-72

12. C. Hahn, S. Warren, and J. London. Edited synoptic cloud reports from ships and land stations over the globe, 1982-1991. http://cdiac.esd.ornl.gov/-cdiac/ndps/ndp026b.h tml, http://cdiac.esd.ornl.gov/-ftp/ndp026b/SEP85L.Z, 1994


Answering Joint Queries from

Multiple Aggregate OLAP Databases

Elaheh Pourabbas1 and Arie Shoshani2

1 Istituto di Analisi dei Sistemi ed Informatica ”Antonio Ruberti” - CNRViale Manzoni, 30 00185, Rome,Italy

[email protected] Lawrence Berkeley National Laboratory

Mailstop 50B-3238 1 Cyclotron Road, Berkeley, CA 94720, [email protected]

Abstract. Given an OLAP query expressed over multiple source OLAPdatabases, we study the problem of evaluating the result OLAP targetdatabase. The problem arises when it is not possible to derive the resultfrom a single database. The method we use is the linear indirect estima-tor, commonly used for statistical estimation. We examine two obviouscomputational methods for computing such a target database, calledthe ”Full-cross-product” (F) and the ”Pre-aggregation” (P) methods.We study the accuracy and computational complexity of these methods.While the method F provides a more accurate estimate, it is more ex-pensive computationally than P. Our contribution is in proposing a thirdnew method, called the ”Partial-Pre-aggregation” method (PP), whichis significantly less expensive than F, but is just as accurate.

1 Introduction

Similar to Statistical Databases that were introduced in the 1980’s [1], OLAPdatabases have a data model that represents one or more ”measures” over a mul-tidimensional space of ”dimensions”, where each dimension can be defined overa hierarchy of ”category attributes” [3],[7]. In many socio-economic type applica-tions only summarized data is available because the base data for the summaries(called ”microdata”) are not kept or are unavailable for reasons of privacy [7].We will refer to Statistical Databases or OLAP databases that contain sum-marized data as ”summary databases”, and the measures associated with themas ”summary measures”. Each summary measure must have a ”summary opera-tor” associated with it, such as ”sum, or ”average”. In this paper, we address theproblem of evaluating queries expressed over multiple summary databases [6].That is, given that the base data is not available and that a query cannot be de-rived from a single summary database, we examine the process of estimating thedesired result from multiple summary databases by a method of interpolationcommon in statistical estimation, called ”linear indirect estimator”. Essentially,this method takes advantage of the correlation between measures to perform the


Answering Joint Queries from Multiple Aggregate OLAP Databases 25

estimation. For example, suppose that we have a summary database of ”total-income by race, and sex” and another summary database of ”population bystate, age, and sex”. If we know that there is a strong correlation between ”pop-ulation” and ”income” of states, we can infer the result ”total-income by state”.We say that in this case the ”population” was used as a proxy measure to esti-mate ”total-income by state”. Similarly, the summary database ”population bystate, age, and sex” is referred to as the proxy database. The problem we areaddressing is to answer the query ”total-income by state” over the two summarydatabases. Furthermore, the fact that there is a common category attribute toboth databases (Sex) can be used to achieve more accurate results, as we’ll showlater. Consider the above mentioned databases written in the following nota-tion: ”summary-measure (category-attribute,..., category-attribute)”. One obvi-ous method of estimating this result is to aggregate each of the source databasesto the maximum level. We call this the Pre-aggregation (P) method. In thiscase, we can aggregate Population(State, Sex, Age) over sex and age to producePopulation(State)1 and Total-Income(Race,Sex) over race and sex to produceTotal-Income(•), where the symbol ”•” indicates aggregation over all the cate-gory attributes. Then, we can calculate the proportional estimation using linearindirect estimation (see section 3) to produce Total-Income(State). Another pos-sibility is to produce the full cross product: Total-Income(State,Sex,Race,Age)using Population as a proxy summary measure. Then, from this result we canaggregate over Sex, Race, Age to get Total-Income(State). We call this the Full-cross-product (F) method. This is the most accurate result that can be producedsince it performs the linear indirect estimation on the most detailed cells. Weuse the Average Relative Error (ARE) to measure accuracy (see section 5) ofthese estimation methods. Using this measure we show that method P is lessaccurate than method F. Our main result is in proposing a method, called thePartial-pre-aggregation (PP) method that achieves the same accuracy as the Fmethod but at a much lower computation cost. This is achieved by noticing thatit is possible to pre-aggregate over all the category attributes that are not incommon to the two source databases before we perform the cross product, andstill achieve the same accuracy of the full cross product computation. Accord-ing to this method, in our example, we first aggregate over Race in the Total-Income(Race,Sex) database to produce Total-Income(Sex) and aggregate overAge in Population(State,Sex,Age) to produce Population(State, Sex). Then, weuse the linear indirect estimator to produce Total-Income(State,Sex). Finally, weaggregate over Sex in Total-Income(State,Sex) to produce Total-Income(State).We show that this result is as accurate as F, but by performing the aggregationsover the non-common attributes early we minimize the computation needed.

The paper is structured as follows. Section 2 introduces syntax for definingjoint queries that provides the basis for a formal analysis of the accuracy of theresults. Section 3 discusses the underlying methodology for the estimation ofquery results. Section 4 describes the three methods for estimating the query1 Population(State) is equivalent to Population(State, ALL, ALL), where ALL indi-

cates the construct introduced in [2]. For the sake of brevity, we use the first notation.

26 Elaheh Pourabbas and Arie Shoshani

results. The accuracy of these methods is investigated in Section 5, whereas,the results of performance evaluation are given in Section 6. Finally, Section 7concludes.

2 The Joint Query Syntax

We define the syntax of a joint query on two summary databases, in terms of thecommon and non-common category attributes of the databases. Let M(CMi),N(CNj) with 1 ≤ i ≤ p, and 1 ≤ j ≤ q be summary databases, where M and Nare summary measures, CMi and CNj are category attributes, and p and qeach represents a finite number. A joint query formulated on these summarydatabases is a triple defined by MT (CC , CC , CT ) where: MT is a selected targetsummary database that can be one of M or N . Without loss of generality,suppose that M was selected. Then, CC is a set of R common category attributes,where CC

r = CMi = CNj , with 1 ≤ r ≤ R. CC is a set of S non-commoncategory attributes, where CC

s = CMi or CCs = CNj , with CMi �= CNj , 1 ≤ s

≤ S. CT represents a set of target category attributes, which includes at leastone CNi and at least one of the target category attributes is not in the targetsummary database. According to these distinctions, the summary databases canbe represented by M(CC

M , CCM ), N(CC

N , CCN , CT

N ), where CTN are called target

category attributes of the joint query, N is called the proxy measure, and M

is the target measure (MT ), N(CCN , CC

N , CTN ) is called the proxy database. Note

that the set of category attributes of summary databases, as well as the categoryattributes of a joint query can be null.

3 The Linear Indirect Estimator Method

The main idea of such an approach, known in the literature as Small Area Es-timation, is to use data from surveys designed to produce estimates of variablesof interest at the national or regional level, and to obtain comparable estimatesat more geographically disaggregated levels such as counties. This approach ischaracterized by indirect estimators techniques. An indirect estimator uses val-ues of the variable of interest from available auxiliary (called predictor or proxy)data at the local level that are related to the variable of interest [4],[5]. In thismodel, the population is partitioned into a large number of domains formed bycross classification of demographic variables such as age, sex, race. Let i de-note a small area. For each domain d, the only available variable of interest Y(e.g., Income) denoted by Y (d) =

∑i Y (i, d) is calculated from the survey data.

Furthermore, it is assumed that auxiliary information (e.g., Population) in theform of X(i, d) is also available. A estimator of Y for small area i is givenby Y (i) =

∑d(X(i, d)/X(d))Y (d) =

∑d Y (i, d), where X(d) =

∑i X(i, d),

X(i, d)/X(d) represents the proportion of the population of small area i thatis in each domain d, and

∑i Y (i) equals to the direct estimator Y =

∑d Y (d).

The estimate is subject to error. The error is computed using the true values. In


our examples, we assume the knowledge of the true values in order to evaluatethe error. This provides us with the means of comparing the accuracy of theresults using different computational methods. This estimation method appliesto any summary database where the assumption is that the characteristics ofsmall areas are sufficiently close to the characteristics of the large areas.

4 Computational Methods for Answering Joint Query

In this section, we propose three methods able to compute joint queries expressedon two summary databases. They are called Full-cross-product, Pre-aggregation,and Partial-Pre-aggregation methods. We first define them, and then we discusstheir relative computational complexity. For the definition of these methods, weuse the formalism defined for a joint query in Section 2.

4.1 The Full Cross Product Method (F)

The next theorem provides the estimation of the full-cross-product over summarydatabases that represent different measures.

Theorem 1. Let M(CCM , CC

M ) and N(CCN , CC

N ) be two summary databases. Thefull-cross-product summary database is obtained as follows:

M(CCM , CC

M , CCN ) = M(CC

M , CCM )

N(CCN , CC

N )∑CC

N

N(CCN , CC

N )(1)

If we assume as proxy the summary database N(CCN , CC

N ), then by linearindirect estimation the proof of the above theorem is straightforward.

Example 1. Table 1 represents the data in the summary databases, Income(Race,Sex), and Population(State, Sex, Age) where by ”Income” we mean ”Total-Income” in the rest of paper. Let us consider obtaining the full-cross-productsummary database by Eq. 1, Income(State,Sex,Age,Race)=Income(Race,Sex )

Population(State,Age,Sex)∑State,Age

Population(State,Age,Sex). According to the syntax introduced in Sec-

tion 2, Sex is a common category attribute, Race, Age, and State are non-commoncategory attributes. The summary database Population(State, Sex, Age) is theproxy database. For instance, if Income(State) is the target summary database,applying the F method, we first obtain the full cross product summary databaseshown in Table 2 (for space limitation, only one state is shown) and then we sum-marize all category attributes except the target attribute. The result is shownin Table 3, third column.


4.2 The Pre-aggregation (P) Method

The pre-aggregation method is based on summarizing the summary databasesover all common and non-common category attributes before the application ofthe linear indirect estimator method.

Definition 1. Let M(CCM , CC

M ) and N(CCN , CC

N , CTN ) be summary databases.

The target summary database MT (CTN ), is estimated by pre-summarizing all

common and non-common category attributes in the summary databases as fol-lows: M(•) =

∑CC

M,CC

M

M(CCM , CC

M ), N(CTN ) =

∑CC

N,CC

N

N(CCN , CC

N , CTN ), and

then applying the linear indirect estimation: MT (CTN ) = M(•)( N(CT

N )∑CT

N

N(CTN

)).

Example 2. Consider Table 1. To apply the P method, we first summarize allcommon and non-common category attributes in the summary databases In-come and Population:

∑Age,SexPopulation(State,Age,Sex)=Population(State),∑

Race,Sex Income(Race,Sex ) = Income(•). Then, by applying linear indirect

estimation, we obtain I ncome(State)=Income(•) Population(State)∑State

Population(State). The re-

sult of target summary database I ncome(State) is shown in in Table 3, fourthcolumn. We observe that I ncome(State)F and I ncome(State)P are different.

4.3 The Partial-Pre-aggregation (PP) Method

This method was devised to yield the same accuracy as method F but witha lower computational complexity. The main idea is to summarize the summarydatabases only over non-common category attributes, and then estimate thetarget summary database with the common and target category attributes.


M ) and N(CCN , CC

N , CTN ) be summary databases.

The target summary database MT (CTN ), is estimated by pre-summarizing all the

non-common category attributes in the summary databases as follows: M(CCM ) =∑

CCM

M(CCM , CC

M ), N(CCN , CT

N ) =∑

CCN

N(CCN , CC

N , CTN ), and then estimating

the cross product and summarizing over the common attributes as follows:MT (CT

N ) =∑

CCN

M (CCN , CT

N ) =∑

CCN

(M(CCM ))( N(CC

N ,CTN )∑

CCN

,CTN

N(CCN

,CTN

)).

Example 3. Consider again Table 1. First, we summarize the summary databasesover non-common category attributes as follows:

∑Age Population(State,Age,

Sex)=Population(State,Sex),∑

Race Income(Race,Sex ) = Income(Sex ). Then,by applying linear indirect estimation, we obtain I ncome(State)=

∑Sex I ncome

(State, Sex)=Income(Sex ) Population(State,Sex)∑State

Population(State,Sex). We note that the results

obtained by I ncome(State)PP are identical to I ncome(State)F shown in Table 3.


Table 1. Income(Race,Sex) and Population(State,Sex,Age)

Income Sex Population Sex

Race Male Female State Age Male Female

White 629907 330567 AL <65 years 6 4Black 311121 241835 ≥65 years 3 3

Hispanic 312800 192632 CA <65 years 7 5Non-Hispanic 331254 272765 ≥65 years 3 5

Total 1585082 1037799 FL <65 years 9 6

≥65 years 5 3NE <65 years 3 5

≥65 years 3 7TE <65 years 12 6

≥65 years 7 5Total 58 49

Table 2. Income (State,Race,Age,Sex)

Income Sex

State Race Age Male Female

AL White <65 years 65162.7931 26985.06122≥65 years 32581.396552 20238.795918

Black <65 years 32184.93103 19741.632653≥65 years 16092.46552 14806.22449

Hispanic <65 years 32358.62069 15725.061224≥65 years 16179.31035 11793.79592

Non-Hispanic <65 years 34267.65517 22266.530612≥65 years 17133.82759 16699.897959

...... ...... ...... ......

5 Accuracy Analysis of Methods

The previous definitions can be used to estimate joint queries with any numberof (common, non-common) category attributes. As seen from the previous ex-amples, the application of the F and P methods on the same data yields differentresults. This difference is formalized by the next theorem.


Theorem 2. The estimation of any joint query MT (CTN ) over M(CC

M , CCM ) and

N(CCN , CC

N ,CTN ) using the methods F and P give different results.

Proof. (sketch) We prove this by negation. Suppose methods F and P give thesame results MT (CT

N ). According to this assumption in Eq.2, (i), (ii) must beequal. CC

M = CCN , but the proportions are not equal. It contradicts the assump-

tion.

(i)∑CC

N

⎛⎜⎝

⎛⎜⎝∑

CCM

M(CCM , CC

M )

⎞⎟⎠

⎛⎝

∑CC

N

N(CCN , CC

N , CTN )

∑CC

N,CT

N

N(CCN , CC

N , CTN )

⎞⎠

⎞⎟⎠

(ii)∑CC

M

⎛⎜⎝

⎛⎜⎝∑

CCM

M(CCM , CC

M )

⎞⎟⎠

⎛⎝

∑CC

N

∑CC

N

N(CCN , CC

N , CTN )

∑CC

N

∑CC

N,CT

N

N(CCN , CC

N , CTN )

⎞⎠

⎞⎟⎠ (2)

Since the methods F and P yield different results, we need a way of evaluatingthe accuracy of these results. A common approach to determine the accuracy isbased on the calculation of their average relative errors (ARE) from the ”true”base values (v) as defined in [4]: ARE = 1

m

∑mi=1

|vi−vi|vi

. By the linear indirectestimator method, we obtain an estimation of a given target measure for smallarea i, and then we calculate the ARE from the true value of small area i. InTable 3, the true value for each State as well as the estimated value by methods F(or PP), and P are shown. Note that, the true values are calculated from the basedata in order to obtain the error between the true values and the data estimatedby each of the F, PP, and P methods. In same table, the relative ARE are shown.As can be seen, while method P is less expensive in terms of computation thanmethod F, method F is more precise w.r.t. method P. An intermediate solutionin term of computational complexity of these two methods is method PP. Aswe saw from the examples 1, and 2 and on the basis of the next theorem themethods F and PP give the same results.

Theorem 3. The estimation of any joint query MT (CTN ) over M(CC

M , CCM ) and

N(CCN , CC

N ,CTN ) using the methods F and PP give the same results.

Proof. (sketch) It is easy to show the part F⇒PP by simple algebric manipu-lation [6]. For the part PP⇒F, we consider a simplified representation of theMT (CT

N ) obtained by the PP method as follows: (i)For any set of category attributes Cp, and Cq with p �= q, the above sum-

mary database can be the result of summarization over Cp and Cq as follows:∑CC

N

((∑Cp

M(CCM , Cp)

) (∑Cq

N(CCN ,Cq,CT

N )∑Cq

N(CCN

,Cq)

))where the inner term is equiv-

alent to (ii)∑

Cp,Cq

((M(CC

M , Cp)) (

N(CCN ,Cq,CT

N )

N(CCN

)

))=

∑Cp,Cq

M(CN , Cp, Cq,

CTN ). As can be easily seen, the summarization of the full cross product in (ii)

over Cp, and Cq gives M(CCN , CT

N ) that is equal to the inner term of (i).


Table 3. Results obtained by methods F (or PP) and P, and their ARE

Income IncomeF (PP ) IncomeP|Incomei(F )−Incomei|

Incomei

|Incomei(P )−Incomei|Incomei

AL 265039 394218.0000 392206.50467 0.487396195 0.479806763CA 495302 485085.7143 490258.13084 0.020626377 0.010183422FL 806255 573222.14286 563796.85047 0.289031209 0.300721421NE 393694 418128.85714 441232.31776 0.062065607 0.120749409TE 662591 752226.28571 735387.19626 0.135279963 0.109865960

ARE 0.198879870 0.204265395

5.1 Pre-aggregation on Category Hierarchies

Let us consider an example of a category hierarchy State → Region → Country.In the case that the result database needs to be evaluated over a higher-levelcategory, such as Income(Region) it is more efficient to aggregate (or roll-up inOLAP terminology) the source database Population(State,Age,Sex) to the regionlevel to produce Population(Region,Age,Sex) before applying the PP method.Intuitively, we expect to get the same level of accuracy. We verified that indeedaggregating (rolling-up) first to the higher category level produced the sameaccuracy as aggregating after applying the PP method [6]. Thus, we consideraggregation to the desired level of the category hierarchies as a first step of thePartial-Pre-Aggregation (PP) method.

6 Performance Evaluation

In this section, we illustrate the experimental results of performance evaluationof the methods F and PP. We focus our attention on these methods becauseof their higher accuracy levels w.r.t. method P. The performance evaluation ofmethods are described over two and then over more than two summary databasesthrough some examples.

6.1 Performance Evaluation over Two Summary Databases

Let the number of cells of a given summary database be defined by XM =∏i=1,...,n

∣∣Ci∣∣ where |Ci| represents the cardinality of the domain values of the

i-th category attributes. For instance, the total number of cells of the sum-mary database Population(State,Age,Sex) is XPopulation = 20, given that thecardinality of the domain values of the category attributes (State, Age, Sex) arerespectively: 5, 2, 2. Note that each cell requires the same space for storing thedata value, and therefore the space complexity for each cell is assumed to be thesame. Thus counting the number of cells is a good measure of the space required.



M ) and N(CCN , CC

N , CTN ) be summary databases,

and let M be the target summary database. The total number of cells of thetarget summary database M(CC

M , CCM , CC

N , CTN ) is defined as follows. In the case

of method F: XMN

= XM XN∏q=1,...,n

|Cq| , where∣∣CC

∣∣ represents the cardinality of the

domain values of the category attribute which is common between the target and

proxy summary databases. In the case of method PP: XMN

= (XM /|CCM |)(XN /|CC

N |)∏q=1,...,n

|Cq| .

Each cell in the target summary database shows the cost in terms of the num-ber of arithmetic operations performed (time) and the number of bytes (space)needed. The space cost is important in the intermediate steps of processing whenmethods F and PP are applied to achieve the final result. For instance, let usconsider the summary databases shown in Table 1. Applying the F method, thetotal number of cells of the summary database Income (State,Race,Age,Sex) is80, while by the method PP, the total number of cells of the summary databaseIncome (State,Sex) is 10. Concerning the arithmetic operations, we assume thatthe multiplication and division operations in each cell need some fixed amountof time, such as 3μs. Therefore, in our example, the time cost for estimating thetarget summary database Income(State) by F is 255μs, while by PP it is 45μs.

6.2 Performance Evaluation over Three Summary Databases

The evaluation of a joint query over more than two summary databases isperformed by selecting the target summary measure of one database andusing all the other databases as proxy databases. Consequently, there aremany combinations for estimating the result of such a joint query, depend-ing on the order of evaluation. As examples of our methodology to perfor-mance evaluation, we use the summary databases Population(State,Age,Sex), In-come(Race,Sex,Profession), PopulationE(Race, Sex, Education-level), which arelabelled by a, b, and c in Figure 1. The joint query is Income(State,Education-level). In this case, the target summary measure is from b, and the target categoryattributes belong to a and c. In Figure 1, for example, b

c indicates that betweenb and c, the former is the target database and the latter is the proxy database.For answering this query, we use the methods F and PP. Note that for all thecases reported in Figure 1, the solutions yield the same total number of cells forthe target summary database using the methods F and PP.

The target summary database Income (State,Education-level) is the re-sult of: (a) Using method F, the aggregation over Race, Sex, Age, Pro-fession in I ncome(State,Race,Sex ,Age,Profession,Education − level), whichhas 320 cells; (a) Using method PP, the aggregation over Race and Sex inIncome(State, Race, Sex, Education− level), which has 80 cells. Table 4 showsthe cost of processing the intermediate steps in each solution. As can be seen fromthe table, the minimum space cost is achieved with solution B for both methodsF and PP. The same table shows the total time cost (in μs) for estimating thetarget summary database. Overall, solution B with method PP provides the bestperformance (least amount of space and least amount of computation).


a b c

ba

ba,c

(A)

a b c

bc

ba,c

(B)

a b c

a

ba,c

cac

a b c

bc

ba,c

ba

(C)

(D)

a b c

bc

ba,c

b

a

a b c

bc

ba,c

ba

(E) (F)

Fig. 1. Solutions of joint query on multiple summary databases

Table 4. Cost of Space and Time of intermediate summary databases for eachsolution

Space(byte) Methods Time(μs) Methods

Solutions F PP Solutions F PP

A 160 40 A 1440 360B 32 16 B 1056 288C 160 80 C 1440 480

D,E,F 192 40 D,E,F 1536 360

7 Conclusions

In this paper, we proposed a method, called the Partial-Pre-aggregation method,for estimating the results of a joint query over two source databases. This methodis based on partitioning the category attributes of the source databases into”common”, ”non-common”, and ”target” attributes. By summarizing on thenon-common attributes first, we reduce the computational and space complexityof applying the linear indirect estimator method. We have shown that the Partial-Pre-aggregation method is, in general, significantly more efficient that the Full-cross-product method commonly used by statistical software. Furthermore, weprovided a way to evaluate the optimal order of pairing databases for queriesover more than two source summary databases.

References

[1] Chan, P., Shoshani, A.: SUBJECT: A Directory Driven System for Organiz-ing and Accessing Large Statistical Databases. Conference on Very Large DataBases, (1981) 553–563 24


[2] Gray, J., Bosworth, A., Layman,A.,Pirahesh, H.:Data cube: a Relational Aggre-gation Operator Generalizing Group-by, Cross-tabs and Subtotals. 12th IEEEInt. Conf. on Data Engineering, (1996) 152–159 25

[3] Codd, E. F., Codd, S.B., Salley, C. T.: Providing OLAP (On-Line AnalyticalProcessing) to User-Analysts: An IT mandate. Technical report (1993) 24

[4] Ghosh, M., Rao, J.N.K. : Small Area Estimation: An Appraisal. Journal ofStatistical Science. 9 (1994) 55–93 26, 30

[5] Pfeffermann, D.: Samll Area Estimation - New Developments and Directions.International Statistical Review, 70 (2002) 26

[6] Pourabbas, E., Shoshani, A.: Joint Queries Estimation from Aggregate OLAPDatabases. LBNL Technical Report, (2001) LBNL-48750 24, 30, 31

[7] Shoshani, A.: OLAP and Statistical Databases: Similarities and Differences. 16thACM Symposium on Principles of Database Systems, (1997) 185–196 24

An Approach to Enabling Spatial OLAPby Aggregating on Spatial Hierarchy

Long Zhang, Ying Li, Fangyan Rao, Xiulan Yu,Ying Chen, and Dong Liu

IBM China Research Laboratory,Beijing 100085, P. R. China

{longzh,lying,raofy,yuxl,yingch,liudong}@cn.ibm.com

Abstract. Investigation shows that a huge number of spatial data existsin current business databases. Traditional data warehousing and OLAP,however, could not exploit the spatial information to get deep insight intothe business data in decision making. In this paper, we propose a novelapproach to enabling spatial OLAP by aggregating on the spatial hierar-chy. A spatial index mechanism is employed to derive the spatial hierar-chy for pre-aggregation and materialization, which in turn are leveragedby the OLAP system to efficiently answer spatial OLAP queries. Ourprototype system shows that the proposed approach could be integratedeasily into the existing data warehouse and OLAP systems to supportspatial analysis. Preliminary experiment results are also presented.

1 Introduction

OLAP system provides architectures and tools for knowledge workers (execu-tives, managers, analysts) to systematically organize, understand, and use theirdata to make strategic decisions. It is claimed that 80% of the overall informa-tion stored in computers is geo-spatial related, either explicitly or implicitly[4].Currently a large number of spatial data have been accumulated in business in-formation systems. How to analyze such business data associated with the spatialinformation presents the challenges to data warehousing and OLAP systems. Forexample, the business data, such as the growth of surrounding neighborhood,proximate competitors and geographical data, such as the distance to the nearesthighways, can be used to choose a location for a new store.

To address the above issues, two methods are commonly used to supportmanagement of business data with spatial information: (1) GIS + DBMS. Inthis approach, geographical information system (GIS) is used to model, ma-nipulate and analyze the spatial data, and DBMS is used to handle businessdata. The disadvantage of this method is that the business data and spatialdata are maintained separately and it is hard to provide a uniform view forusers; (2) DBMS with spatial extensions. Some commercial database vendors,such as IBM and Oracle, provide spatial extensions in their database systems[3,11]. These spatial extensions provide methods to execute spatial query, namely,


to select spatial objects in specified areas. However they do not provide spatialdata analysis functionality in the OLAP tools.

The data in the warehouse are often modelled as multidimensional cubes[5]which allow the data to be organized around natural business concepts thatare called measures and dimensions. We adopt Han’s definition of spatial-to-spatial dimension as the spatial dimension, whose primitive level and all of itshigh-level generalized data are spatial[9]. The spatial dimension differentiatesthe non-spatial dimension in that there is little a priori knowledge about thegrouping hierarchy[12, 13]. In addition to some predefined regions, the user mayrequest groupings based on a map which are computed on the fly, or may bearbitrarily created (e.g. an arbitrary grid in a selected window). Thus the well-known pre-aggregation methods which are used to enhance performance can notbe applied.

In this paper, we present an approach to enabling spatial data manipulationinto OLAP systems. We extend the cube query by introducing spatial predicatesand functions which explicitly express the spatial relationships among data infact tables and dimension tables. A spatial index mechanism is employed toderive the spatial hierarchy for pre-aggregation and materialization, which inturn is leveraged by OLAP system to efficiently answer spatial OLAP queries.Our prototype system shows that the proposed approach could be integratedeasily into existing data warehousing and OLAP systems to support spatialanalysis. Preliminary studies on the performance of the system are presented.

The remainder of this paper is organized as follows. A motivating example isgiven in Section 2. The spatial index mechanism is described in Section 3. Thesystem architecture is depicts in Section 4. Section 5 gives a detailed explanationof query processing. Preliminary experiments are given in Section 6. Relatedworks are given in Section 7 and conclusion is drawn in Section 8.

2 Motivating Example

In our work, the star schema is used to map multi-dimensional data on a re-lational database. To focus on spatial dimensions, we employ a simple datawarehouse of sales from thousands of gas stations as the motivating example.The sales concerning different gas stations, gas types, customers at different timeare analyzed. The schema of dimension tables station, gas, customer, time andfact table transaction is as follows:

station (station ID, location)gas (gas ID, unit price)customer (customer ID, type)time (time ID, day, month, year)transaction (station ID, gas ID, customer ID, time ID, sales, quantity)Each tuple in the fact table is a transaction item indicating sales, customer

ID, gas type, gas station, and time involved in the transaction. Here, stationis a spatial dimension. Its attribute location gives the spatial location of a gasstation. A location is a point such as “(332, 5587)”. The typical OLAP query

36 Long Zhang et al.

may be: “What are the total sales of each type of gas for ‘taxi’ customers at eachgas station in Oct. 2002?” While an OLAP query involving spatial informationmay be “What are the total sales of each type of gas for customer ‘taxi’ atgas stations within the query window in Oct. 2002?” Here, a query window is arectangle that user draws on the map.

3 The Spatial Hierarchy and Pre-Aggregation

In OLAP system, concept hierarchy is the basis for two commonly used OLAPoperators: drill-down and roll-up[8]. A concept hierarchy defines a sequence ofmappings from a set of low level concepts to higher level, more general concepts.For example, the hierarchy on time dimension is day, month and year. Tradition-ally, users, field experts and knowledge engineers manually define the concepthierarchies of dimensions. But for the spatial dimension, there is little a prioriknowledge available on how to define an appropriate hierarchy in advance.

Spatial indexing has been one of the active focus areas in recent databaseresearch. There have been many data structures proposed for indexing spatialobjects[6], such as Quadtree, R-tree (and its derivations such as R*-tree andR+-tree), and grid file. Among them, those indexes with tree structures providenesting relationship between high-level nodes and low-level nodes. This relation-ship might be inspiring candidates for building the spatial hierarchy.

Each node in R-tree stores a minimum bounding rectangle (MBR) and theMBR of higher level node encloses all MBRs of its descendants. Thus R-treeprovides a naturally nested hierarchy based on the spatial shapes and placementof the indexed objects. This R-tree derived spatial hierarchy plays the samerole as traditional concept hierarchies. It can be used for pre-aggregation onspatial dimension. Other tree-like indexes can also be used to derive the spatialhierarchy. Fig. 1 indicates a part of a sample R-tree for the motivating exampleand the spatial layout of the objects. The leaf entries are gas station IDs. Eachintermediate R-tree entry has an associated MBR indicating its spatial scope.

r1

r2 r3 r4

r7

......

r8

......

r9

s1 s3s2

r10

......

r11

s4 s5

r5 r6

............

s2s3

r4

r7r8

r9r3

r2

r10

s1

r11

s4

s5

r5

r6

Fig. 1. A part of sample R-tree and spatial layout of the objects

Assuming that a user draws a query window indicated as bold rectangle inFig. 1, our Spatial Index Engine (will be introduced in Section 4) uses R-tree tocompute the query. It must return exact objects within the query window. Indexsearching algorithm tries to find higher level nodes in the tree that satisfy spatial

37An Approach to Enabling Spatial OLAP by Aggregating on Spatial Hierarchy

query predicates, more specifically, the within predicate. In some cases, somegas stations are contained in the query window, but MBRs of their ascendantnodes are not completely enclosed by the query window. Thus these gas stationsshould be fetched instead of their ascendant nodes. According to this, for thequery window in Fig. 1, intermediate R-tree entries {r2, r7} and leaf entries {s2,s3, s4} are returned.

With the spatial hierarchy, pre-aggregation and materialization can be builtto answer OLAP queries. Usually, the result of materialization is stored in a sum-mary table. For our motivating example, the summary table concerning spatialdimensions is as follows:

spatial_sum (nid, customer, gas, month, year, sales)NIDs are IDs of R-tree nodes. Currently, we materialize the whole cube; that is,all nodes in spatial hierarchy (R-tree) will be computed and the result will beinserted into summary table. The summary table is built up traversing aggre-gating paths. For example, in Fig. 1, gas stations s1, s2 and s3 are grouped intonode r9, and in turn, r7, r8 and r9 are grouped into r3 and so on.

For each intermediate node in the spatial hierarchy, all other non-spatial di-mensions will be aggregated and materialized. DB2 provides OLAP extensions[2]to SQL and we use Group By Cube to aggregate and materialize non-spatial di-mensions. For example, the corresponding pre-aggregation and materializationSQL statement for r9 is:

INSERT INTO spatial_sumSELECT "r9" AS nid, c.type AS customer, g.gas_id AS gas, t.month,

t. year, sum(tr.sales) AS salesFROM station s, customer c, gas g, transaction tr, time tWHERE tr.station_id in ("s1", "s2", "s3")GROUP BY CUBE (customer, gas, month, year)

Using this method, all nodes in the spatial hierarchy are aggregated andmaterialized into table spatial sum.

4 System Architecture

Our prototype system executes spatial OLAP queries, accessing multidimen-sional data cubes stored in a relational database and presents the result in amultidimensional format. Fig. 2 depicts the architecture of our system.

Warehouse Builder extracts, transforms and loads raw data from operationaldatabase systems (ODS) to build fact table and dimension tables. Before sum-mary tables are created, Spatial Index Engine uses dimension tables containingspatial data to build spatial hierarchy. The hierarchy then is used by WarehouseBuilder when generating summary tables.

User query is generated in Graphic User Interface (GUI) on client/browserside, encoded in XML and then transferred to server side. Spatial OLAP QueryProcessor extracts all spatial predicates from the query and dispatches themto Spatial Index Engine. By investigating Spatial Index, the engine fetches all


Graphic User Interface

Spatial OLAP QueryProcessor

Spatial IndexEngine

Java OLAP API

ROLAP Engine

summarytable

FactTable

DimensionTable

Clinet/Browser side

Server side

WarehouseBuilder

ODS

ODS

ODS

SpatialIndex

result(XML)

resultset

query(XML) spatial

predicate

summaryinfo.query

Fig. 2. System architecture

intermediate R-tree nodes whose MBRs satisfy the predicates and individualleaf nodes (i.e. gas station IDs in our motivating example) whose spatial loca-tions satisfy the predicates. These nodes are returned to Spatial OLAP QueryProcessor. With these data, the processor rewrites the original query and sendsthe rewritten query to ROLAP Engine, which is a relational OLAP engine. RO-LAP Engine processes the query using summary tables and returns query resultsto the query processor. Query results are further reconstructed as XML docu-ment and sent to client/browser side. The architecture takes the advantage ofindependent Spatial Index Engine. Thus the system could employ other indexstructures, such as Quadtree, with little modifications on other modules. Theinterface between query processor and ROLAP Engine is a Java OLAP APIreferring to JOLAP API specification[10]. With this API, any OLAP enginescomplying with it can be easily adopted.

5 Spatial OLAP Query Processing

We use the common query region shape - a rectangle drawn on a map to analyzespatial OLAP query processing. There are two types of queries:

– Summary query (SQ). It requests summary information of all selected spatialobjects as a whole, e.g. “what are the total sales of each customer at ALLgas stations within my indicated query window during each month in 2002?”

– Individual query (IQ). It asks for individual information based on each se-lected spatial object, e.g. “what are the total sales of each customer at EACHgas station within my indicated query window during each month in 2002?”

Obviously, SQ queries investigate and take the advantage of the data alreadysummarized on spatial dimension while IQ queries can not exploit the summaryinformation and must be processed as traditional OLAP queries. To deal withthese two types separately, we built different cubes for them.

The schema of summary table concerning spatial information is introducedin Section 3. We established the following traditional cube that is based onindividual gas stations for IQ queries:


nonspatial_sum (station_id, customer, gas, month, year, sales)Taking a query as an example, “what are the total sales of each customer at

all gas stations within my indicated query window during each month in 2002?”.The indicated rectangle query window with upper left corner (122, 220) andlower right corner (500, 768) is depicted in Fig. 1. The SQL statement is:

Q1:SELECT c.type as customer, month, sum(sales)FROM customer c, gas g, time t, station s, transaction trWHERE s.location WITHIN query_window(122,220,500,768) AND t.year=

2002 AND tr.customer_id=c.customer_id AND tr.gas_id=g.gas_idAND tr.station_id=s.station_id AND tr.time_id=t.time_id

GROUP BY customer, month

When the query is being processed, all the gas stations located within thewindow will be selected and their sales information will contribute to the queryresult.

After the query is generated in GUI and transferred to the server side, theSpatial OLAP Query Processor extracts the spatial predicates in the WHEREclause. Then query processor dispatches each predicate to Spatial Index Engine.The engine deals with the predicate and returns an NID list containing IDs ofall R-tree nodes which satisfy the predicate and a gas station ID list containingall IDs of separate gas stations. For example, for predicate query window (122,220, 500, 768), Spatial Index Engine returns two list: one NID list containing allintermediate R-tree nodes whose MBRs are within the specified query window(122, 220, 500, 768); the other list of gas station IDs containing all gas stationswhose locations are also within the query window, but MBRs of their ascendantsare not completely enclosed by the query window. Let the NID list be {23, 179,255, 88} and the gas station ID list be {868, 3234, 843, 65, 7665}. Then Q1 canbe rewritten into:

Q2:WITH summary AS(SELECT nid AS id, customer, month, sum(sales)FROM spatial_sumWHERE nid IN (23,179,255,88) AND year=2002 AND gas is null

AND year is nullUNION ALLSELECT station_id AS id, customer, month, sum(sales)FROM nonspatial_sumWHERE station_id IN (868,3234,843,65,7665) AND year=2002

AND gas is null AND year is null)SELECT customer, month, salesFROM summaryGROUP BY customer, month

The processing of IQ queries will not investigate the summary data basedon spatial information. The user is not interested in the total information on


all gas stations as one, but each individual gas station in the query window;that is, “what are the total sales of each customer at EACH gas station withinmy indicated query window during each month in 2002?” Spatial Index Enginewill only return one list containing all individual gas stations whose locationsconfirm to the specified spatial predicate. The corresponding query is:

Q3:SELECT c.type as customer, tr.station_id as station, t.month,

sum(sales)FROM customer c, station s, time t, transaction trWHERE s.location WITHIN query_window(300,220,990,882)

AND t.year=2002 AND tr.station_id=s.station_id ANDtr.time_id=t.time_id AND tr.customer_id=c.customer_id

GROUP BY customer, station, month

Assuming the gas station IDs returned by the index engine are {868, 3234, 843,65, 7665}, Q3 can be rewritten as Q4, where non-spatial summary table is used.

Q4:SELECT customer, station, month, sum(sales)FROM nonspatial_sumWHERE station IN (868, 3234, 843, 65, 7665) AND year=2002

AND gas is null

6 Experiments

Preliminary experiments are conducted on the simplest shape: point. Our aimis to find whether our approach is effective for spatial OLAP queries and morecomplex shapes such as line and polygon will be discussed in further work.

Two data sets were generated according to the motivating example. Thelocations of gas stations for each data set comply with Gaussian distribution.By this, we attempted to reflect the fact that gas stations tend to be clusteredaround areas with dense road network. Table 1 gives the statistic information ofthese data sets:

Table 1. The statistic data of base tables

set # of gas stations # of customers # of transactions

1 1,000 200 8,6632 10,000 800 34,685

Spatial index was built on each data set. We computed and materialized thewhole cube by using the spatial hierarchy. The materializing result is stored intable spatial sum. Materialized cube for other non-spatial dimensions is storedin table nonspatial sum. The statistic data are given in Table 2.

DB2 UDB and Spatial Extender were used as the database server to supportstoring spatial objects. Our system runs on a desktop machine with Pentium III750M CPU and 512M memory.


Table 2. The statistic data of spatial index and summary tables

Set R-tree size R-tree depth Size of nonspatial sum Size of spatial sum

1 1,424 6 25,120 30,4902 14,182 9 126,176 195,616

We compare our approach with the one in which spatial dimension is notaggregated. For simplification, our approach is called spatial aggregation (SA)approach and the other one non-spatial aggregation (NSA) approach. In orderto restrict discussion to the main concerns, we use the following typical query inthe following experiments.

Find out the total sales for every gas type in every month for all gas stationslocated in the specified query window.

Obviously, with spatial hierarchy not aggregated, to answer the spatial query,base table station must be searched in order to fetch all IDs of gas stationslocated among the query window. The query for NSA is

SELECT gas, month, sum(sales)FROM nonspatial_sumWHERE gas is not null AND month is not null AND

station_id IN(SELECT station_idFROM stationWHERE location..xmin>$left AND location..xmin<$right

AND location..ymin>$bottom and location..ymin<$top)

Here, $left, $right, $bottom and $top indicate the border of the specifiedquery window. It should be mentioned here that if the objects have more complexshapes, such as lines and polygons, searching objects whose location are within aquery window is a time-consuming task. In our approach, the better part of thistask has been done when building spatial index: the objects have been groupedinto the nodes of spatial hierarchy. The searching cost is reduced dramatically.

10 20 30 40 50 60 70 80 900

200

400

600

800

1000

1200

Area%

Num

ber

of tu

ple

acce

ss

SANSA

(a) 1,000 gas stations

0 10 20 30 40 50 60 70 80 900

2000

4000

6000

8000

10000

12000

Area%

Num

ber

of tu

ple

acce

ss

SANSA

(b) 10,000 gas stationsFig. 3. Aggregation degree

In order to investigate the aggregation ratio of our spatial hierarchy, for eachdata set, we generated 9 or more query sets, each containing 100 queries. Thequery windows in every query set have equal area and the centers of the windows


comply with Gaussian distribution. The proportions of window area to the wholespace are predefined as 10, 20, 30, ..., 90% for these query sets. Fig. 3 depictsthe average numbers of tuple-access by NSA and SA methods on each set.

It can be derived from Fig. 3 that with the area of query window growing, thenumber of tuple-access by SA approach drops dramatically, while that by NSAapproach increases quickly. At any area ratio, the number of accessed tuples bySA is much less than that by NSA, usually more than an order of magnitude.Fig. 4 displays the performance comparison for NSA and SA on above querysets. It can be seen from the figure that with the window area groups, SA alwaysperforms much better than NSA. This result also complies with the trend ofaggregation degree presented in Fig. 3.

10 20 30 40 50 60 70 80 900

0.02

0.04

0.06

0.08

0.1

0.12

0.14

Area%

Ela

pse

time

(s)

SANSA

(a) 1,000 gas stations

0 10 20 30 40 50 60 70 80 900

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Area%

Ela

pse

time

(s)

SANSA

(b) 10,000 gas stationsFig. 4. The elapse time of SA and NSA

7 Related Works

Although advances in OLAP technology have led to successful industrial deci-sion supporting systems, the integration of spatial data into data warehouse forsupporting OLAP has only recently become a topic of active research.

Han et al. [9, 14] are the first ones to propose a framework for spatial datawarehouses. They proposed an extension of the star-schema and focus on thespatial measures and propose a method for selecting spatial objects for materi-alization. It was differed from our work that they focused on spatial measureswhile we concentrated on the spatial relationship and the spatial query pro-cessing. Papadias et al. [12, 13] gave an approach to deal with the problem ofproviding OLAP operations in spatial data warehouse, i.e. to allow user to ex-ecute aggregation queries in groups, based on the position of objects in space.Both Han’s and Papadias’ approaches modified the classical star schema in theexisting data warehouses. The work in this paper managed to reserve the starschema while representing the spatial data.

The pre-aggregation and materialization is the traditional method to enhancethe performance of OLAP systems [1, 7]. Due to the lack of a priori knowledgeabout the spatial hierarchy, these well-known methods can not be applied di-rectly. Papadias’ method stores aggregation results in the index. In our method,the results are simply stored in relational tables which are separated from thespatial index.


8 Conclusion

Currently a large number of spatial data have been stored in business informa-tion systems. In order to analyze these data, traditional OLAP systems mustbe adjusted to handle them. From a user’s perspective, it is not convenient toprocess spatial data separately before OLAP analyzing as that current OLAPsystems deal with spatial information.

In this paper, we discussed the feasibility of enabling spatial OLAP in tradi-tional OLAP systems by aggregating on spatial hierarchy. In our proposed ap-proach, spatial hierarchy can be automatically established. User issues queriesinvolving spatial predicates and the system will compute these predicates au-tomatically using spatial index. Pre-aggregation and materialization techniquesare engaged under the help of the spatial hierarchy.

References

1. Baralis, E., Paraboschi, S., Teniente, E.: Materialized View Selection in a Multidi-mensional Database. Proceedings of VLDB Conference, 1997.

2. Colossi, N., Malloy, W., Reinwald, B.: Relational extensions for OLAP. IBM SYS-TEMS JOURNAL, VOL 41, NO 4, 2002.

3. David W. Adler. DB2 Spatial Extender - Spatial data within the RDBMS. Proceed-ings of VLDB Conference, pp 687-690, 2001.

4. Daratech: Geographic Information Systems Markets and Opportunities. Daratech,Inc. 2000.

5. Gray, J., Chaudhuri, S., Bosworth, A., et al.: Data Cube: A Relational Aggrega-tion Operator Generalizing Group-By, Cross-Tab, and Sub-Totals. Data Mining andKnowledge Discovery Journal, Vol 1, 29-53 ,1997.

6. Gaede, V., Gnther, O.: Multidimensional Access Methods. ACM Computing Sur-veys, 1997.

7. Gupta, H.: Selection of Views to Materialize in a Data Warehouse. Proceedings ofInternational Conference on Database Theory , 1997.

8. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan KaufmannPublisher, Inc. 2001.

9. Han, J., Stefanovic, N., Koperski, K.: Selective Materialization: An Efficient Methodfor Spatial Data Cube Construction. Proceedings of PAKDD Conference, 1998.

10. Java OLAP Interface (JOLAP), version 0.85. Java Community Process, May 2002.Available at http://jcp.org/en/jsr/detail?id=069.

11. Kothui, R.K.V., Ravada, S., Abugov, D.: Quadtree and R-tree Indexes in OracleSpatial: A Comparison using GIS Data. ACM SIGMOD Conference, 2002.

12. Papadias, D., Kalnis, P., Zhang, J., Tao, Y.: Efficient OLAP operations in spatialdata warehouses. Technical Report HKUST-CS01-01, Jan. 2001.

13. Papadias, D., Tao, Y., Kalnis, P., Zhang, J.: Indexing Spatio-Temporal Data Ware-houses. Proceedings of International Conference on Data Engineering, 2002.

14. Stefanovic, N., Han, J., Koperski, K.: Object-Based Selective Materialization forEfficient Implementation of Spatial Data Cubes. TKDE 12(6): 938-958 (2000)


46 Meng-Feng Tsai and Wesley Chu

/ => roots; WD=>weekdays, Date=>day in 1 month; AC=>aircrafts; TB=>time block; f#=> flight number

Date,AC, / Date,/ ,TB WD,AC,TB WD,/ ,f# / ,AC, f#

/, /, /

WD,/ ,/ /, AC, / /, /, TB

Date,/ ,/ WD, AC, / WD,/ ,TB /, AC, TB /, /, f#

Date, AC, f#

Date,AC, TB Date,/ ,f# WD, AC, f#

[7] [14] [4]

[30] [98] [28] [56] [53]

[420] [120] [392] [371] [742]

[1680] [1590] [5194]

[ ] = number of possible entries for each scope.

/ => roots; WD=>weekdays, Date=>day in 1 month; AC=>aircrafts; TB=>time block; f#=> flight number

"number of passengers" => #p (measure)

#p,(WD,AC,/), Var #p, (WD,AC,/), Sum

#p, (WD,AC,TB), {SqrSum,Sum}#p, (Date,AC,/),{SqrSum,Sum}

#p, (Date,AC,/), Var #p, (Date,AC,/), Sum #p, (WD,AC,TB), Var #p, (WD,AC,TB),Sum

#p, (WD,AC,/),{SqrSum, Sum}

Variance(X) := Sum [sqr(xi - Avg(X)) ] := Sum [sqr(xi)] - Sum(xi)Avg(X).->

note: Variance(Var) can be derived by SqrSum and Sum. (assuming that the function "count" is given)

47A Multidimensional Aggregation Object (MAO) Framework

0

1e+06

2e+06

3e+06

4e+06

5e+06

6e+06

7e+06

8e+06

9e+06

0 2 4 6 8 10 12 14 16

perf

orm

ance

impr

ovem

ents

a) number of cached MAOs or scopes

using MAOwithout MAO(scope)

0

1e+07

2e+07

3e+07

4e+07

5e+07

6e+07

7e+07

0 2 4 6 8 10 12 14 16

perf

orm

ance

impr

ovem

ents

b) number of cached MAOs or scopes

using MAOwithout MAO(scope)


<indirect> -> search_for_indirect_sources

<Synchronization_scheme> -> <Compensation> <Recomputation>

<Compensation> -> <direct> <indirect>

<direct> -> binary_op, insertion_data, inverse_op, deletion_data

<Recomputation> -> <Distributive_ag> | <Algebraic_mf>

<Distributive_ag> -> current_partial_sum2, binary_op ,cursor_on_updated_sources

<Algebraic_mf> -> mapping_function, cursors_on_updated_sources.

<Recomputation_cost> -> <Recomputation_compute> <Recomputation_access>

<Recomputation_compute> -> function1, size_updated_sources

<Recomputation_access> -> function2, size_updated_sources

<Targets> -> mao-list

<Relation> -> <Distributive_aggregation> | <Algebraic_mapping>

<Distributive_aggregation> -> current_partial_sum, binary_op, cursor_on_sources

<Algebraic_mapping> -> mapping_function, cursor_on_sources.

<Cost_estimation> -> <For_caching> <For_synchronizing>

<For_caching> -> <Caching_computation_cost> <Caching_retrieval>

<Caching_computation> -> function1, size_of_sources

<Caching_retrieval> -> function2, size_of_sources

<Compensation_cost> -> accessing_cost, computing_cost

<For_synchronization> -> <Compensation_cost> | <Recomputation_cost>

<Sources> -> mao-list

<Derivation_Relationship>-> <Sources> <Relation> <Targets>

<Execution_Plans>-> <Derivation_Relationship> <Cost_estimation> <Synchronization_scheme>


## recomputation_scheme is the same as in Distributive_Aggregation except setting the source’s cursors on updated source

indirect_compensation: *pointer for source’s compensation plan.

else W = W + I - D

if source are raw data => W = W + I*I - D*D

direct_compensation:

Sunchronization_scheme:

access:f2(size_of_updated_source)

Recomputation_cost: compute:f1(size_of_updated_source)

For_synchronization:

access_cost: f2(size_of_inserted + size_of_deleted)Compensation_cost: computation_cost: f1(size_of_inserted) + f3(size_of_deleted)

Caching_retrieval:f2(size_of_source)Caching_computation: f1(size_of_source)For_caching:

Cost_Estimation:

else W = Y + S

Distributive_Aggregation: if sources are raw data(most detailed fact) => W = X*X + S

## W : cursor on target entry; X,Y: cursor on source entry; S: accumulating result.

Target: ((WD,AC,TB), SqrSum, # of passengers)

Source: ((Date,AC,TB),SqrSum, # of passengers)

Derivation_Relationship:

((WD,AC,TB), SqrSum, # of passengers)((Date,AC,TB),SqrSum, # of passengers) ->


The GMD Data Model for

Multidimensional Information:A Brief Introduction

Enrico Franconi and Anand Kamble

Faculty of Computer ScienceFree Univ. of Bozen-Bolzano, Italy

[email protected]

[email protected]

Abstract. In this paper we introduce a novel data model for multi-dimensional information, GMD, generalising the MD data model firstproposed in Cabibbo et al (EDBT-98). The aim of this work is not topropose yet another multidimensional data model, but to find the gen-eral precise formalism encompassing all the proposals for a logical datamodel in the data warehouse field. Our proposal is compatible with allthese proposals, making therefore possible a formal comparison of thedifferences of the models in the literature, and to study formal prop-erties or extensions of such data models. Starting with a logic-baseddefinition of the semantics of the GMD data model and of the basic al-gebraic operations over it, we show how the most important approachesin DW modelling can be captured by it. The star and the snowflakeschemas, Gray’s cube, Agrawal’s and Vassiliadis’ models, MD and othermultidimensional conceptual data can be captured uniformly by GMD.In this way it is possible to formally understand the real differences inexpressivity of the various models, their limits, and their potentials.

1 Introduction

In this short paper we introduce a novel data model for multidimensional infor-mation, GMD, generalising the MD data model first proposed in [2]. The aim ofthis work is not to propose yet another data model, but to find the most generalformalism encompassing all the proposals for a logical data model in the datawarehouse field, as for example summarised in [10]. Our proposal is compatiblewith all these proposals, making therefore possible a formal comparison of thedifferent expressivities of the models in the literature. We believe that the GMDdata model is already very useful since it provides a very precise and, we believe,very elegant and uniform way to model multidimensional information. It turnsout that most of the proposals in the literature make many hidden assumptionswhich may harm the understanding of the advantages or disadvantages of theproposal itself. An embedding in our model would make all these assumptionsexplicit.


56 Enrico Franconi and Anand Kamble

So far, we have considered, together with the classical basic star and snowflakeER-based models and multidimensional cubes, the logical data models intro-duced in [2, 5, 1, 6, 9, 11, 3, 7, 8]. A complete account of both the GMD datamodel (including and extended algebra) and of the various encodings can befound in [4]; in this paper we just give a brief introduction to the basic princi-ples of the data model.

GMD is completely defined using a logic-based approach. We start introduc-ing a data warehouse schema, which is nothing else than a set of fact definitionswhich restricts (i.e., constrains) the set of legal data warehouse states associ-ated to the schema. By systematically defining how the various operators usedin a fact definition constrain the legal data warehouse states, we give a formallogic-based account of the GMD data model.

2 The Syntax of the GMD Data Model

We introduce in this Section the notion of data warehouse schema. A data ware-house schema basically introduces the structures of the cubes that will populatethe warehouse, together with the types allowed for the components of the struc-tures. The definition of a GMD schema that follows is explained step by step.

Definition 1 (GMD schema). Consider the signature < F ,D,L,M,V ,A >,where F is a finite set of fact names, D is a finite set of dimension names, L isa finite set of level names – each one associated to a finite set of level elementnames, M is a finite set of measure names, V is a finite set of domain names –each one associated to a finite set of values, A is a finite set of level attributes.

➽ We have just defined the alphabet of a data warehouse: wemay have fact names (like SALES, PURCHASES), dimension names(like Date, Product), level name (like year, month, product-brand,product-category) and their level elements (like 2003, 2004, heineken,drink), measure names (like Price, UnitSales), domain names (like in-tegers, strings), and level attributes (like is-leap, country-of-origin).

A GMD schema includes:

– a finite set of fact definitions of the form

F .= E {D1 |L1 , . . . ,Dn |Ln} : {M1 |V1 , . . . ,Mm |Vm},

where E,F ∈ F ,Di ∈ D,Li ∈ L,Mj ∈ M,Vj ∈ V.We call the fact name F a defined fact, and we say that F is based onE. A fact name not appearing at the left hand side of a definition is calledan undefined fact. We will generally call fact either a defined fact or anundefined fact. A fact based on an undefined fact is called basic fact. A factbased on a defined fact is called aggregated fact. A fact is dimensionless ifn = 0; it is measureless if m = 0. The orderings in a defined fact amongdimensions and among measures are irrelevant.

The GMD Data Model for Multidimensional Information 57

➽ We have here introduced the building block of a GMD schema:the fact definition. A basic fact corresponds to the base data ofany data warehouse: it is the cube structure that contains all thedata on which any other cube will be built upon. In the followingexample, BASIC-SALES is a basic fact, including base data aboutsale transactions, organised by date, product, and store (which arethe dimensions of the fact) which are respectively restricted to thelevels day, product, and store, and with unit sales and sale price asmeasures:

BASIC-SALES .=SALES {Date|day, Product|product, Store|store} :{UnitSales|int, SalePrice|int}

– a partial order (L,≤) on the levels in L.We call � the immediate predecessor relation on L induced by ≤.

➽ The partial order defines the taxonomy of levels. For example, day� month � quarter and day � week; product � type � category

– a finite set of roll-up partial functions between level elements

ρLi,Lj : Li �→ Lj

for each Li,Lj such that Li � Lj.We call ρ∗Li,Lj

the reflexive transitive closure of the roll-up functionsinductively defined as follows:

ρ∗Li,Li= id

ρ∗Li,Lj=

⋃k ρLi,Lk

◦ ρ∗Lk,Ljfor each k such that Li � Lk

where

(ρLp,Lq ∪ ρLr ,Ls)(x) = y iff

⎧⎨⎩

ρLp,Lq(x) = ρLr,Ls(x) = y, orρLp,Lq(x) = y and ρLr ,Ls(x) = ⊥, orρLp,Lq(x) = ⊥ and ρLr ,Ls(x) = y

➽ When in a schema various levels are introduced for a dimension,it is also necessary to introduce a roll-up function for them. A roll-up function defines how elements of one level map to elements ofa superior level. Since we just require for the roll-up function to bea partial order, it is possible to have elements of a level which roll-up to an upper level, while other elements may skip that upper levelto be mapped to a superior one. For example, ρday,month(1/1/01)= Jan-01, ρday,month(2/1/01) = Jan-01, . . . ρquarter,year(Qtr1-01) =2001, ρquarter,year(Qtr2-01) = 2001, . . .


– a finite set of level attribute definitions:

L.= {A1 |V1 , . . . ,An |Vn

}where L ∈ L, Ai ∈ A and Vi ∈ V for each i, 1 ≤ i ≤ n.

➽ Level attributes are properties associated to levels. For example,product .= {prodname|string, prodnum|int, prodsize|int, prodweight|int}

– a finite set of measure definitions of the form

N.= f(M)

where N, M ∈ M, and f is an aggregation function f : B(V) �→ W, for someV,W ∈ V. B(V) is the finite set of all bags obtainable from values in V whosecardinality is bound by some finite integer Ω.

➽ Measure definitions are used to compute values of measures in anaggregated fact from values of the fact it is based on. For example:Total-UnitSales .= sum(UnitSales) and Avg-SalePrice .= aver-age(SalePrice)

Levels and facts are subject to additional syntactical well-foundedness conditions:

– The connected components of (L,≤) must have a unique least element each,which is called basic level.

➽ The basic level contains the finest grained level elements, on topof which all the facts are identified. For example, store � city �country; store is a basic level.

– For each undefined fact there can be at most one basic fact based on it.

➽ This allows us to disregard undefined facts, which are in one-to-one correspondence with basic facts.

– Each aggregated fact must be congruent with the defined fact it is based on,i.e., for each aggregated fact G and for the defined fact F it is based on suchthat

F .= E {D1 |L1 , . . . ,Dn |Ln} : {M1 |V1 , . . . ,Mm |Vm

}G .= F {D1 |R1 , . . . ,Dp |Rp} : {N1 |W1 , . . . ,Nq |Wq}the following must hold (for some reordering on the dimensions):• the dimensions in the aggregated fact G are among the dimensions of the

fact F it is based on:p ≤ n

• the level of a dimension in the aggregated fact G is above the level of thecorresponding dimension in the fact F it is based on:

Li ≤ Ri for each i ≤ p


• each measure in the aggregated fact G is computed via an aggregationfunction from some measure of the defined fact F it is based on:

N1.= f1(Mj(1)) . . . Nq

.= fq(Mj(q))

Moreover the range and the domain of the aggregation function shouldbe in agreement with the domains specified respectively in the aggregatedfact G and in the fact F it is based on.

➽ Here we give a more precise characterisation of an aggregatedfact: its dimensions should be among the dimensions of the fact itis based on, its levels should be generalised from the correspondingones in the fact it is based on, and its measures should be all com-puted from the fact it is based on. For example, given the basic factBASIC-SALES:

BASIC-SALES .=SALES {Date|day, Product|product, Store|store} :{UnitSales|int, SalePrice|int}

the following SALES-BY-MONTH-AND-TYPE is an aggregatedfact computed from the BASIC-SALES fact:

SALES-BY-MONTH-AND-TYPE .=BASIC-SALES {Date|month, Product|type} :{Total-UnitSales|int, Avg-SalePrice|real}

with the following aggregated measures:

Total-UnitSales .= sum(UnitSales)Avg-SalePrice .= average(SalePrice)

2.1 Example

The following GMD schema summarises the examples shown in the previousSection:

– Signature:

• F = {SALES, BASIC-SALES, SALES-BY-MONTH-AND-TYPE,PURCHASES}

• M = {UnitSales, Price, Total-UnitSales, Avg-Price}• D = {Date, Product, Store}• L = {day, week, month, quarter, year, product, type, category, brand,

store, city, country }day = {1/1/01, 2/1/01, . . . , 1/1/02, 2/1/02, . . . }month = {Jan-01, Feb-01, . . . , Jan-02, Feb-02, . . . }quarter = {Qtr1-01, Qtr2-01, . . . , Qtr1-02, Qtr2-02, . . . }year = {2001, 2002}· · ·


• V = {int, real, string}• A = {dayname, prodname, prodsize, prodweight, storenumb}

– Partial order over levels:

• day � month � quarter � year, day � week; day is a basic level• product � type � category, product � brand; product is a basic level• store � city � country; store is a basic level

– Roll-up functions:ρday,month(1/1/01) = Jan-01, ρday,month(2/1/01) = Jan-01, . . .ρmonth,quarter(Jan-01) = Qtr1-01, ρmonth,quarter(Feb-01) = Qtr1-01, . . .ρquarter,year(Qtr1-01) = 2001, ρquarter,year(Qtr2-01) = 2001, . . .ρ∗day,year(1/1/01) = 2001, ρ∗

day,year(2/1/01) = 2001, . . .· · ·

– Level Attributes:day

.= {dayname|string, daynum|int}

product.= {prodname|string, prodnum|int, prodsize|int, prodweight|int}

store.= {storename|string, storenum|int, address|string}

– Facts:BASIC-SALES

.=

SALES {Date|day, Product|product, Store|store} :{UnitSales|int, SalePrice|int}

SALES-BY-MONTH-AND-TYPE.=

BASIC-SALES {Date|month, Product|type} :{Total-UnitSales|int, Avg-SalePrice|real}

– Measures:Total-UnitSales

.= sum(UnitSales)

Avg-SalePrice.= average(SalePrice)

3 GMD Semantics

Having just defined the syntax of GMD schemas, we introduce now their se-mantics through a well founded model theory. We define the notion of a datawarehouse state, namely a specific data warehouse, and we formalise when a datawarehouse state is actually in agreement with the constraints imposed by a GMDschema.

Definition 2 (Data Warehouse State). A data warehouse state over a sche-ma with the signature < F ,D,L,M,V ,A > is a tuple I = < Δ, Λ, Γ, ·I >, where

– Δ is a non-empty finite set of individual facts (or cells) of cardinality smallerthan Ω;

➽ Elements in Δ are the object identifiers for the cells in a multi-dimensional cube; we call them individual facts.

– Λ is a finite set of level elements;– Γ is a finite set of domain elements;


– ·I is a function (the interpretation function) such that

FI ⊆ Δ for each F ∈ F , where FI is disjoint from any other EI

such that E ∈ FLI ⊆ Λ for each L ∈ L, where LI is disjoint from any other HI

such that H ∈ LVI ⊆ Γ for each V ∈ V, where VI is disjoint from any other WI

such that W ∈ VDI = Δ �→ Λ for each D ∈ DMI = Δ �→ Γ for each M ∈ M(AL

i )I = L �→ Γ for each L ∈ L and ALi ∈ A for some i

(Note: in the paper we will omit the ·I interpretation function applied tosome symbol whenever this is non ambiguous)

➽ The interpretation functions defines a specific data warehousestate given a GMD signature, regardless from any fact definition.It associates to a fact name a set of cells (individual facts), whichare meant to form a cube. To each cell corresponds a level elementfor some dimension name: the sequence of these level elements ismeant to be the “coordinate” of the cell. Moreover, to each cellcorresponds a value for some measure name. Since fact definitions inthe schema are not considered yet at this stage, the dimensions andthe measures associated to cells are still arbitrary. In the following,we will introduce the notion of legal data warehouse state, which isthe data warehouse state which conforms to the constraints imposedby the fact definitions. A data warehouse state will be called legal fora given GMD schema if it is a data warehouse state in the signatureof the GMD schema and it satisfies the additional conditions foundin the GMD schema.

A data warehouse state is legal with respect to a GMD schema if:

– for each fact F .= E {D1 |L1 , . . . ,Dn |Ln} : {M1 |V1 , . . . ,Mm |Vm

} in theschema:• the function associated to a dimension which does not appear in a fact

is undefined for its cells:∀f. F(f) → f �∈ dom(D)for each D ∈ D such that D �= Di for each i ≤ n

➽ This condition states that the level elements associated to a cellof a fact should correspond only to the dimensions declared in thefact definition of the schema. That is, a cell has only the declareddimensions in any legal data warehouse state.• each cell of a fact has a unique set of dimension values at the appropriate

level:∀f. F(f) → ∃l1, . . . , ln. D1(f) = l1 ∧ L1(l1) ∧ . . . ∧Dn(f) = ln ∧ Ln(ln)

➽ This condition states that the level elements associated to a cellof a fact are unique for each dimension declared for the fact in theschema. So, a cell has a unique value for each declared dimension inany legal data warehouse state.


• a set of dimension values identifies a unique cell within a fact:∀f, f ′, l1, . . . , ln.

F(f) ∧ F(f ′) ∧D1(f) = l1 ∧D1(f ′) = l1 ∧ . . . ∧ Dn(f) = ln ∧ Dn(f ′) = ln →f = f ′

➽ This condition states that a sequence of level elements associatedto a cell of a fact are associated only to that cell. Therefore, thesequence of dimension values can really be seen as an identifyingcoordinate for the cell. In other words, these conditions enforce thelegal data warehouse state to really model a cube according thespecification given in the schema.• the function associated to a measure which does not appear in a fact

is undefined for its cells:∀f. F(f) → f �∈ dom(M)for each M ∈ M such that M �= Mi for each i ≤ n

➽ This condition states that the measure values associated to a cellof a fact in a legal data warehouse state should correspond only tothe measures explicitly declared in the fact definition of the schema.• each cell of a fact has a unique set of measures:

∀f. F(f) → ∃m1, . . . , mm. M1(f) = m1 ∧ V1(m1) ∧ . . . ∧ Mm(f) =mm ∧ Vm(mm)

➽ This condition states that the measure values associated to a cellof a fact are unique for each measure explicitly declared for the factin the schema. So, a cell has a unique measure value for each declaredmeasure in any legal data warehouse state.

– for each aggregated fact and for the defined fact it is based on in the schema:

F .= E {D1 |L1 , . . . ,Dn |Ln} : {M1 |V1 , . . . ,Mm |Vm}G .= F {D1 |R1 , . . . ,Dp |Rp

} : {N1 |W1 , . . . ,Nq |Wq}

N1.= f1(Mj(1)) . . . Nq

.= fq(Mj(q))

each aggregated measure function should actually compute the aggregationof the values in the corresponding measure of the fact the aggregationis based on:

∀g, v. Ni(g) = v ↔ ∃r1, . . . , rp. G(g) ∧ D1(g) = r1 ∧ . . . ∧Dp(g) = rp∧v = fi({|Mj(i)(f) | ∃l1, . . . , lp. F(f)∧

D1(f) = l1 ∧ . . . ∧ Dp(f) = lp∧ρ∗L1,R1

(l1)=r1 ∧ . . . ∧ ρ∗Lp,Rp(lp)=rp|})

for each i ≤ q, where {| · |} denotes a bag.

➽ This condition guarantees that if a fact is the aggregation ofanother fact, then in a legal data warehouse state the measures as-sociated to the cells of the aggregated cube should be actually com-puted by applying the aggregation function to the measures of thecorresponding cells in the original cube. The correspondence betweena cell in the aggregated cube and a set of cells in the original cubeis found by looking how their coordinates – which are level elements– are mapped through the roll-up function dimension by dimension.


According to the definition, a legal data warehouse state for a GMD schema isa bunch of multidimensional cubes, whose cells carry measure values. Each cubeconforms to the fact definition given in the GMD schema, i.e., the coordinatesare in agreement with the dimensions and the levels specified, and the measuresare of the correct type. If a cube is the aggregation of another cube, in a legaldata warehouse state it is enforced that the measures of the aggregated cubesare correctly computed from the measures of the original cube.

3.1 Example

A possible legal data warehouse state for (part of) the previous example GMDschema is shown in the following.

BASIC-SALESI = {s1, s2, s3, s4, s5, s6, s7}SALES-BY-MONTH-AND-TYPEI = {g1, g2, g3, g4, g5, g6}

Date(s1) = 1/1/01Date(s2) = 7/1/01Date(s3) = 7/1/01Date(s4) = 10/2/01Date(s5) = 28/2/01Date(s6) = 2/3/01Date(s7) = 12/3/01

Product(s1) = Organic-milk-1lProduct(s2) = Organic-yogh-125gProduct(s3) = Organic-milk-1lProduct(s4) = Organic-milk-1lProduct(s5) = Organic-beer-6packProduct(s6) = Organic-milk-1lProduct(s7) = Organic-beer-6pack

Store(s1) = Fair-trade-centralStore(s2) = Fair-trade-centralStore(s3) = Ali-groceryStore(s4) = Barbacan-storeStore(s5) = Fair-trade-centralStore(s6) = Fair-trade-centralStore(s7) = Ali-grocery

UnitSales(s1) = 100UnitSales(s2) = 500UnitSales(s3) = 230UnitSales(s4) = 300UnitSales(s5) = 210UnitSales(s6) = 150UnitSales(s7) = 100

EuroSalePrice(s1) = 71,00EuroSalePrice(s2) = 250,00EuroSalePrice(s3) = 138,00EuroSalePrice(s4) = 210,00EuroSalePrice(s5) = 420,00EuroSalePrice(s6) = 105,00EuroSalePrice(s7) = 200,00

Date(g1) = Jan-01Date(g2) = Feb-01Date(g3) = Jan-01Date(g4) = Feb-01Date(g5) = Mar-01Date(g6) = Mar-01

Product(g1) = DairyProduct(g2) = DairyProduct(g3) = DrinkProduct(g4) = DrinkProduct(g5) = DairyProduct(g6) = Drink

Total-UnitSales(g1) = 830Total-UnitSales(g2) = 300Total-UnitSales(g3) = 0Total-UnitSales(g4) = 210Total-UnitSales(g5) = 150Total-UnitSales(g6) = 100

Avg-EuroSalePrice(g1) = 153,00Avg-EuroSalePrice(g2) = 210,00Avg-EuroSalePrice(g3) = 0,00Avg-EuroSalePrice(g4) = 420,00Avg-EuroSalePrice(g5) = 105,00Avg-EuroSalePrice(g6) = 200,00

daynum(day) = 1 prodweight(product) = 100gm storenum(store) = S101

4 GMD Extensions

For lack of space, in this brief report it is impossible to introduce the full GMDframework [4], which includes a full algebra in addition to the basic aggregationoperation introduced in this paper. We will just mention the main extensionswith respect to what has been presented here, and the main results.

The full GMD schema language includes also the possibility to define aggre-gated measures with respect to the application of a function to a set of original


measures, pretty much like in SQL. For example, it is possible to have an aggre-gated cube with a measure total-profit being the sum of the differences betweenthe cost and the price in the original cube; the difference is applied cell by cell inthe original cube (generating a profit virtual measure), and then the aggregationcomputes the sum of all the profits.

Two selection operators are also in the full GMD language. The slice op-eration simply selects the cells of a cube corresponding to a specific value fora dimension, resulting in a cube which contains a subset of the cells of the origi-nal one and one less dimension. The multislice allows for the selection of rangesof values for a dimension, so that the resulting cube will contain a subset of thecells of the original one but retains the selected dimension.

A fact-join operation is defined only between cubes sharing the same di-mensions and the same levels. We argue that a more general join operation ismeaningless in a cube algebra, since it may leads to cubes whose measures areno more understandable. For similar reasons we do not allow a general unionoperator (like the one proposed in [6]).

As we were mentioning in the introduction, one main result is in the fullencoding of many data warehouse logical data models as GMD schemas. Weare able in this way to give an homogeneous semantics (in terms of legal datawarehouse states) to the logical model and the algebras proposed in all thesedifferent approaches, we are able to clarify ambiguous parts, and we argue aboutthe utility of some of the operators presented in the literature.

The other main result is in the proposal of a novel conceptual data modelfor multidimensional information, that extends and clarifies the one presentedin [3].

References

[1] R. Agrawal, A. Gupta, and S. Sarawagi. Modeling multidimensional databases.In Proc. of ICDE-97, 1997. 56

[2] Luca Cabibbo and Riccardo Torlone. A logical approach to multidimensionaldatabases. In Proc. of EDBT-98, 1998. 55, 56

[3] E. Franconi and U. Sattler. A data warehouse conceptual data model for multi-dimensional aggregation. In Proc. of the Workshop on Design and Managementof Data Warehouses (DMDW-99), 1999. 56, 64

[4] Enrico Franconi and Anand S. Kamble. The GMD data model for multidimen-sional information. Technical report, Free University of Bozen-Bolzano, Italy,2003. Forthcoming. 56, 63

[5] M. Golfarelli, D. Maio, and S. Rizzi. The dimensional fact model: a conceptualmodel for data warehouses. IJCIS, 7(2-3):215–247, 1998. 56

[6] J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data cube: a relationalaggregation operator generalizing group-by, cross-tabs and subtotals. In Proc.of ICDE-96, 1996. 56, 64

[7] M. Gyssens and L.V. S. Lakshmanan. A foundation for multi-dimensionaldatabases. In Proc. of VLDB-97, pages 106–115, 1997. 56

[8] A. Tsois, N. Karayiannidis, and T. Sellis. MAC: Conceptual data modelling forOLAP. In Proc. of the International Workshop on Design and Management ofWarehouses (DMDW-2001), pages 5–1–5–13, 2001. 56


[9] P. Vassiliadis. Modeling multidimensional databases, cubes and cube operations.In Proc. of the 10th SSDBM Conference, Capri, Italy, July 1998. 56

[10] P. Vassiliadis and T. Sellis. A survey of logical models for OLAP databases. InSIGMOD Record, volume 28, pages 64–69, December 1999. 55

[11] P. Vassiliadis and S. Skiadopoulos. Modelling and optimisation issues for multi-dimensional databases. In Proc. of CAiSE-2000, pages 482–497, 2000. 56

An Application of Case-Based Reasoning inMultidimensional Database Architecture*

Dragan Simi 1, Vladimir Kurbalija2, Zoran Budimac2

1 Novi Sad Fair, Hajduk Veljkova 11, 21000 Novi Sad, [email protected]

2 Department of Mathematics and Informatics, Fac. of Science, Univ. of Novi SadTrg D. Obradovi a 4, 21000 Novi Sad, [email protected], [email protected]

ABSTRACT. A concept of decision support system is considered in this paper.It provides data needed for fast, precise and good business decision making toall levels of management. The aim of the project is the development of a newonline analytical processing oriented on case-based reasoning (CBR) where aprevious experience for every new problem is taken into account. Methodologi-cal aspects have been tested in practice as a part of the management informationsystem development project of "Novi Sad Fair". A case study of an applicationof CBR in prediction of future payments is discussed in the paper.

1 Introduction

In recent years, there has been an explosive growth in the use of database for decisionsupport systems. This phenomenon is a result of the increased availability of newtechnologies to support efficient storage and retrieval of large volumes of data: datawarehouse and online analytical processing (OLAP) products. A data warehouse canbe defined as an online repository of historical enterprise data that is used to supportdecision-making. OLAP refers to technologies that allow users to efficiently retrievedata from the data warehouse.

In order to help an analyst focus on important data and make better decisions,case-based reasoning (CBR - an artificial intelligence technology) is introduced formaking predictions based on previous cases. CBR will automatically generate an an-swer to the problem using stored experience, thus freeing the human expert of obliga-tions to analyse numerical or graphical data.

The use of CBR in predicting the rhythm of issuing invoices and receiving actualpayments based on the experience stored in the data warehouse is presented in thispaper. Predictions obtained in this manner are important for future planning of a com-pany such as the ”Novi Sad Fair” because achievement of sales plans, revenue and

* Research was partially supported by the Ministry of Science, Technologies and Developmentof Republic of Serbia, project no. 1844: ”Development of (intelligent) techniques based onsoftware agents for application in information retrieval and workflow”


company liquidation are measures of success in business. Performed simulations showthat predictions made by CBR differ only for 8% in respect to what actually happened.With inclusion of more historical data in the warehouse, the system gets better in pre-dictions. Furthermore, the system uses not only a data warehouse but also previouscases and previous predictions in future predictions thus learning during the operatingprocess.

The combination of CBR and data warehousing, i.e. making an OLAP intelligentby the use of CBR is a rarely used approach, if used at all. The system also uses anovel CBR technique to compare graphical representation of data which greatly sim-plifies the explanation of the prediction process to the end-user [3].

The rest of the paper is organized as follows. The following section elaboratesmore on motivations and reasons for inclusion of CBR in decision support system.This section also introduces our case-study on which we shall describe the usage ofour system. Section three overviews the case based reasoning technique, while sectionfour describes the original algorithm for searching the previous cases (curves) lookingfor the most similar one. Fifth section describes the actual application of our techniqueto the given problem. Section six presents the related work, while the seventh sectionconcludes the paper.

2 User requirements for decision support system

“Novi Sad Fair” represents a complex organization considering the fact that it is en-gaged in a multitude of activities. The basic Fair activity is organizing fair exhibitions,although it has particular activities throughout the year. Ten times a year, 27 fair exhi-bitions are organized where nearly 4000 exhibitors take part, both from the countryand abroad.

Besides designing a ‘classical’ decision support system based on a data warehouseand OLAP, requirements of the company management clearly showed that it will notbe enough for good decision making. The decision to include artificial intelligencemethods in general and CBR in particular into the whole system was driven by theresults of the survey. The survey was made on the sample of 42 individuals (users ofthe current management information system) divided into three groups: strategic-tactical management (9 people), operational managers (15 people), and transactionalusers (18 people).

After a statistical evaluation of the survey [5], the following conclusions (amongothers) were drown:

Development of the decision support systems should be focussed on prob-lems closely related to financial estimates and financial marker trendstracking which span several years.The key influences on business (management) are political and economicenvironment of the country and region, which induces the necessity of exactimplementation of those influences in the observed model (problem). Also itis necessary to take them into account in future events estimations.

67An Application of Case-Based Reasoning in Multidimensional Database Architecture

The behavior of the observed case does not depend on its pre-history butonly on initial state, respectively.

Implementation of this non-exact mathematical model is a very complex problem. Asan example, let us take a look into the problem pointed to us by company managers.

During any fair exhibition the total of actual income is only 30% to 50% of the to-tal invoice value. Therefore, managers want to know how high the payment of somefair services would be in some future time, with respect to invoicing. If they couldpredict reliably enough what would happen in the future, they could make importantbusiness activities to ensure faster arrival of invoiced payments and plan future activi-ties and exhibitions better.

The classical methods can not explain influences on business and management wellenough. There are political and economic environments of the country and region thatcannot be successfully explained and used with classical methods: war in Iraq, oildeficiency, political assassinations, terrorism, spiral growth in mobile telecommunica-tion industry, general human occupation and motivation. And this is even more true inan enterprise such as Fair whose success depends on many external factors.

One possible approach in dealing with external influences is observing the casehistories of similar problems (cases) for a longer period of time, and making estima-tions according to that observation. This approach, generally speaking, representsintelligent search which is applied to solving new problems by adapting solutions thatworked for similar problems in the past - case-based reasoning.

3 Case based reasoning

Case-Based Reasoning is a relatively new and promising area of artificial intelligenceand it is also considered a problem solving technology (or technique). This technologyis used for solving problems in domains where experience plays an important role [2].

Generally speaking, case-based reasoning is applied to solving new problems byadapting solutions that worked for similar problems in the past. The main suppositionhere is that similar problems have similar solutions. The basic scenario for mainly allCBR applications looks as follows. In order to find a solution of an actual problem,one looks for a similar problem in an experience base, takes the solution from the pastand uses it as a starting point to find a solution to the actual problem. In CBR systemsexperience is stored in a form of cases. The case is a recorded situation where problemwas totally or partially solved, and it can be represented as an ordered pair (problem,solution). The whole experience is stored in case base, which is a set of cases andeach case represents some previous episode where the problem was successfullysolved.

The main problem in CBR is to find a good similarity measure – the measure thatcan tell to what extent the two problems are similar. In the functional way similaritycan be defined as a function sim : U CB [0 , 1] where U refers to the universe ofall objects (from a given domain), while CB refers to the case base (just those objects

68 Dragan Simic et al.

which were examined in the past and saved in the case memory). The higher value ofthe similarity function means that these objects are more similar [1].

The case based reasoning system has not the only goal of providing solutions toproblems but also of taking care of other tasks occurring when used in practice. Themain phases of the case-based reasoning activities are described in the CBR-cycle(fig. 1) [1].

Fig. 1. The CBR-Cycle after Aamodt and Plaza (1994)

In the retrieve phase the most similar case (or k most similar cases) to the problemcase is retrieved from the case memory, while in the reuse phase some modificationsto the retrieved case are done in order to provide better solution to the problem (caseadaptation). As the case-based reasoning only suggests solutions, there may be a needfor a correctness proof or an external validation. That is the task of the phase revise. Inthe retain phase the knowledge, learned from this problem, is integrated in the systemby modifying some knowledge containers.

The main advantage of this technology is that it can be applied to almost any do-main. CBR system does not try to find rules between parameters of the problem; it justtries to find similar problems (from the past) and to use solutions of the similar prob-lems as a solution of an actual problem. So, this approach is extremely suitable forless examined domains – for domains where rules and connections between parame-ters are not known. The second very important advantage is that CBR approach tolearning and problem solving is very similar to human cognitive processes – peopletake into account and use past experiences to make future decisions.


4 CBR for predicting curves behaviour

The CBR system for its graphics in presenting both the problem and the cases is used[3]. The reasons are that in many practical domains some decisions depend on behav-iour of time diagrams, charts and curves. The system therefore analyses curves, com-pares them to similar curves from the past and predicts the future behaviour of thecurrent curve on the basis of the most similar curves from the past.

The main problem here, as almost in every CBR system, was to create a goodsimilarity measure for curves, i.e. what is the function that can tell to what extent thetwo curves are similar. In many practical domains data are represented with the set ofpoints, where the point is an ordered pair (x,y). Very often the pairs are (t,v) where trepresents time and v represents some value in the time t. When the data is given inthis way (as a set of points) then it can be graphically represented. When the points areconnected, then they represent some kind of a curve. If the points are connected onlywith straight lines then it represents the linear interpolation, but if someone wantssmoother curves then some other kind of interpolation with polynomials must be used.There was a choice between a classical interpolating polynomial and a cubic spline.The cubic spline was chosen for two main reasons:

Power: for the n+1 points classical interpolating polynomial has the powern, while cubic spline always has the power 4.Oscillation: if only one point is moved (which can be the result of bad ex-periment or measuring) classical interpolating polynomial significantlychanges (oscillates), while cubic spline only changes locally (which is moreappropriate for real world domains).

Fig. 2. Surface between two curves

When the cubic spline is calculated for curves then one very intuitive and simplesimilarity (or distance – which is the dual notion for similarity1) measure can be used.

1 When the dictance d i known then the similarity sim can be easily computed using for exam-ple function: sim = 1/(1+d)


The distance between two curves can be represented as a surface between these curvesas seen on the fig 2. This surface can be easily calculated using the definitive integral.Furthermore, the calculation of the definitive integral for polynomials is a very simpleand efficient operation.

5 Application of the system

A data warehouse of “Novi Sad Fair” contains data about payment and invoicingprocesses from the last 3 years for every exhibition - containing between 25 and 30exhibitions every year. Processes are presented as sets of points where every point isgiven with the time of the measuring (day from the beginning of the process) and thevalue of payment or invoicing on that day. It can be concluded that these processescan be represented as curves. Note that the case-base consists of cases of all exhibi-tions and that such a case-base is used in solving concrete problems for concrete exhi-bitions. The reason for this is that environmental and external factors influence busi-ness processes of the fair to a high extent.

The measurement of the payment and invoicing values was done every 4 days fromthe beginning of the invoice process in duration of 400 days, therefore every curveconsists of approximately 100 points. By analysing these curves, the process of in-voicing usually starts several months before the exhibition and that value of invoicingrapidly grows approximately to the time of the beginning of exhibition. After that timethe value of invoicing remains approximately the same till the end of the process. Thatmoment, when the value of invoicing reaches some constant value and stays the sameto the end, is called the time of saturation for the invoicing process, and the corre-sponding value – the value of saturation.

The process of payment starts several days after the corresponding process of in-voicing (process of payment and invoicing for the same exhibition). After that thevalue of payment grows, but not so rapidly as the value of invoicing. At the moment ofexhibition the value of payment is between 30% and 50% of the value of invoicing.After that, the value of payment continues to grow to some moment when it reaches aconstant value and stays approximately constant till the end of the process. That mo-ment is called the time of saturation for the payment process, and the correspondingvalue – the value of saturation.Payment time of saturation is usually couple of months after the invoice time of thesaturation, and the payment value of saturation is always less than the invoice value ofsaturation or equal. The analysis shows that payment value of saturation is between80% and 100% of the invoice value of saturation. The maximum represents a total ofservices invoiced and that amount is to be paid. The same stands for the invoicingcurve where the maximum amount of payment represents the amount of payment byregular means. The rest will be paid later by court order, other special businessagreements or, perhaps, will not be paid at all (debtor bankruptcy).


Fig. 3. The curves from the data mart, as the "Old payment curve" and the "Old invoice curve"

One characteristic invoice and a corresponding payment curve as the "Old paymentcurve" and "Old invoice curve" from the ”curve base” are shown (fig. 3). The pointsof saturation (time and value) are represented with the emphasised points on curves.

At the beginning system reads the input data from two data marts: one data martcontains the information about all invoice processes for every exhibition in the past 3years, while the other data mart contains the information about the correspondingpayment processes. After that, the system creates splines for every curve (invoice andpayment) and internally stores the curves in the list of pairs containing the invoicecurve and the corresponding payment curve.

In the same way system reads the problem curves from the third data mart. Theproblem is invoice and a corresponding problem curve at the moment of the exhibi-tion. At that moment, the invoice curve reaches its saturation point, while the paymentcurve is still far away from its saturation point. These curves are shown as the "Actualpayment curve" and the "Actual invoice curve" (fig. 4).

The solution of this problem would be the saturation point for the payment curve.This means that system helps experts by suggesting and predicting the level of futurepayments. At the end of the total invoicing for selected fair exposition, operationalexposition manager can get a prediction from CBR system of a) the time period whenpayment of a debt will be made and b) the amount paid regularly.


Fig. 4. Problem payment and invoice curves, as the "Actual payment curve" and the"Actual invoice curve" and prediction for the future payments

Time point and the amount of payment of a debt are marked on the graphic by a bigred dot (fig. 4). When used with the subsets of already known values, CBR predictsthe results that differed around 10% in time and 2% in value from actually happened.

5.1 Calculation of saturation points and system learning

The saturation point for one prediction is calculated by using 10% of the most similarpayment curves from the database of previous payment processes. The similarity iscalculated by using the previosly described algorithm. Since the values of saturationare different for each exhibition, every curve from the database must be scaled with aparticular factor so that the invoice values of saturation of the old curve and actualcurve are the same. That factor is easily calculated as:

saturationofvalueold

saturationofvalueactualFactor

___

___

where the actual value of saturation is in fact the value of the invoice in the time of theexhibition.

The final solution is then calculated by using payment saturation points of the 10%most similar payment curves. Saturation points of the similar curves are multipliedwith the appropriate type of goodness and then summed. The values of goodness aredirectly proportional to the similarity between old and actual curves, but the sum of allgoodnesses must be 1. Since the system calculates the distance, the similarity iscalculated as:


distsim

1

1

The goodness for every old payment curve is calculated as:

jallj

ii sim

simgoodness

_

At the end, the final solution – payment saturation point is calculated as:

iallii tpoinsatgoodnesstpoinsat

_

__

The system draws the solution point at the diagram combined with the saturation timeand value. The system also supports solution revising and retaining (fig. 1). By memo-rizing a) the problem, b) suggested solution, c) the number of similar curves used forobtaining the suggestion and d) the real solution (obtained later), the system uses thisinformation in the phase of reusing the solution for future problems. The system willthen use not only 10% percent of the most similar curves but will also inspect theprevious decisions in order to find ‘better’ number of similar curves that would lead tothe better prediction.

6 Related work

The system presented in the paper represents a useful coexistence of a data warehouseand a case based reasoning resulting in a decision support system. The data warehouse(part of the described system) has been in operation in “Novi Sad Fair” since 2001and is described in more details [5] [6] [7]. The part of the system that uses CBR incomparing curves has been done during the stay of the second author at HumboldtUniversity in Berlin and is described in [3] in more detail.

Although CBR is successfully used in many areas (aircraft conflict resolution in airtraffic control, optimizing rail transport, subway maintenance, optimal job search,support to help-desks, intelligent search on the internet) [4], it is not very often used incombination with data warehouse and in collaboration with classical OLAP, probablydue to novelty of this technique. CBR does not require causal model or deep under-standing of a domain and therefore it can be used in domains that are poorly defined,where information is incomplete, contradictory, or where it is difficult to get sufficientdomain of knowledge. All this is typical for business processing.

Besides CBR, other possibilities are rule base knowledge or knowledge discoveryin database where knowledge evaluation is based on rules [1]. The rules are usuallygenerated by combining propositions. As the complexity of the knowledge base in-creases, maintaining becomes problematical because changing rules often implies a lotof reorganization in a rule base system. On the other side, it is easier to add or delete acase in a CBR system, which finally provides the advantages in terms of learning andexplicability.

Applying CBR to curves and its usage in decision making is also a novel approach.According to the authors' findings, the usage of CBR, looking for similarities in curvesand predicting future trends is by far superior to other currently used techniques.


7 Conclusion

The paper presented the decision support system that uses CBR as an OLAP to thedata warehouse. The paper has in greater detail described the CBR part of the systemgiving a thorough explanation of one case study.

There are numerous advantages of this system. For instance, based on CBR predic-tions, operational managers can make important business activities, so they would: a)make payment delays shorter, b) make the total of payment amount bigger, c) securepayment guarantee on time, d) reduce the risk of payment cancellation and e) informsenior managers on time. By combining graphical representation of predicted valueswith most similar curves from the past, the system enables better and more focussedunderstanding of predictions with respect to real data from the past.

Senior managers can use these predictions to better plan possible investments andnew exhibitions, based on the amount of funds and the time of their availability, aspredicted by the CBR system.

Presented system is not only limited to this case-study but it can be applied to otherbusiness values as well (expenses, investments, profit) and it guarantees the same levelof success.

Acknowledgement

The CBR system that uses graphical representation of problem and cases [3] wasimplemented by V. Kurbalija at Humboldt University, Berlin (AI Lab) under the lead-ership of Hans-Dieter Burkhard and sponsorship of DAAD (German academic ex-change service). Authors of this paper are grateful to Prof. Burkhard and his team fortheir unselfish support without which none of this would be possible.

References

1. Aamodt, A., Plaza, E.: Case-Based Reasoning: Foundational Issues, Methodological Varia-tions and System Approaches, AI Commutations, pp. 39-58. 1994.

2. Zoran Budimac, Vladimir Kurbalija: Case-based Reasoning – A Short Overview, Conferenceof Informatics and IT, Bitola, 2001.

3. Vladinmir Kurbalija: On Similarity of Curves – project report, Humboldt University, AI Lab,Berlin, 2003.

4. Mario Lenz, Brigitte Bartsh-Sporl, Hans-Dieter Burkhard, Stefan Wess, G. Goos, J. VanLeeuwen, B. Bartsh: Case-Based Reasoning Technology: From Foundations to Aplications,Springer Verlag, October 1998.

5. Dragan Simic: Financial Prediction and Decision Support System Based on Artificial Intelli-gence Technology, Ph.D. thesis, draft text – manuscript, Novi Sad 2003.

6. Dragan Simic: Reengineering management information systems, contemporary informationtechnologies perspective, Master thesis, Novi Sad 2001.

7. Dragan Simic: Data Warehouse and Strategic Management, Strategic management and deci-sion support systems, Palic, 1999.


©

{binh,tjoa}@ifs.tuwien.ac.at

[email protected]

><( ) { }∪=

=

•• ∪Ω∈

Ω

NN

⊂

=

•

•

•

topicMap, topic, baseName, association, occur-rence, topicRef,

•

•

©

{boris.vrdoljak,marko.banek}@fer.hr

[email protected]

let $x:= for $c in $retValue where not(deep-equal($c/first/content,$c/second/content)) return $creturn count($x)

max( ... for $c in distinct-values($retValue/child) let $p:=for $exp in $retValue where deep-equal($exp/child,$c) return $exp/parent return count(distinct-values($p)) )

Building XML Data Warehouse Based on Frequent

Patterns in User Queries

Ji Zhang1, Tok Wang Ling1, Robert M. Bruckner2, A Min Tjoa2

1Department of Computer Science 2Institute of Software Technology

National University of Singapore Vienna University of Technology

Singapore 117543 Favoritenstr. 9/188, A-1040 Vienna, Austria {zhangji, lingtw}@comp.nus.edu.sg {bruckner, tjoa}@ifs.tuwien.ac.at

Abstract. With the proliferation of XML-based data sources available across the

Internet, it is increasingly important to provide users with a data warehouse of

XML data sources to facilitate decision-making processes. Due to the extremely

large amount of XML data available on web, unguided warehousing of XML

data turns out to be highly costly and usually cannot well accommodate the users’

needs in XML data acquirement. In this paper, we propose an approach to

materialize XML data warehouses based on frequent query patterns discovered

from historical queries issued by users. The schemas of integrated XML

documents in the warehouse are built using these frequent query patterns

represented as Frequent Query Pattern Trees (FreqQPTs). Using hierarchical

clustering technique, the integration approach in the data warehouse is flexible

with respect to obtaining and maintaining XML documents. Experiments show

that the overall processing of the same queries issued against the global schema

become much efficient by using the XML data warehouse built than by directly

searching the multiple data sources.

1. Introduction

A data warehouse (DWH) is a repository of data that has been extracted,

transformed, and integrated from multiple and independent data source like operational

databases and external systems [1]. A data warehouse system, together with its

associated technologies and tools, enables knowledge workers to acquire, integrate, and

analyze information from different data sources. Recently, XML has rapidly emerged

as a standardized data format to represent and exchange data on the web. The

traditional DWH has gradually given way to the XML-based DWH, which becomes the

mainstream framework.

Building a XML data warehouse is appealing since it provides users with a

collection of semantically consistent, clean, and concrete XML-based data that are

suitable for efficient query and analysis purposes. However, the major drawback of

building an enterprise wide XML data warehouse system is that it is usually extremely

time and cost consuming that is unlikely to be successful [10]. Furthermore, without

proper guidance on which information is to be stored, the resulting data warehouse

cannot really well accommodate the users’ needs in XML data acquirement.

In order to overcome this problem, we propose a novel XML data warehouse

approach by taking advantage of the underlying frequent patterns existing in the query


history of users. The historical user queries can ideally provide us with guidance

regarding which XML data sources are more frequently accessed by users, compared to

others. The general idea of our approach is: Given multiple distributed XML data

sources and their globally integrated schema represented as a DTD (data type

definition) tree, we will build a XML data warehouse based on the method of revealing

frequent query patterns. In doing so, the frequent query patterns, each represented as a

Frequent Query Pattern Tree (FreqQPT), are discovered by applying a rule-mining

algorithm. Then, FreqQPTs are clustered and merged to generate a specified number of

integrated XML documents.

Apparently, the schema of integrated XML documents in the warehouse is only a

subset of the global schema and the size of this warehouse is usually much smaller than

the total size of all distributed data sources. A smaller sized data warehouse can not

only save storage space but also enable query processing to be performed more

efficiently. Furthermore, this approach is more user-oriented and is better tailored to

the user’s needs and interests.

There has been some research in the field of building and managing XML data

warehouse. The authors of [2] present a semi-automated approach to building a

conceptual schema for a data mart starting from XML sources. The work in [3] uses

XML to establish an Internet-based data warehouse system to solve the defects of

client/server data warehouse systems. [4] presents a framework for supporting

interoperability among data warehouse islands for federated environments based on

XML. A change-centric method to manage versions in a web warehouse of XML data

is published in [5]. Integration strategies and their application to XML Schema

integration has been discussed in [6]. The author of [8] introduces a dynamic

warehouse, which supports evaluation, change control and data integration of XML

data.

The remainder of this paper is organized as follows. Section 2 discusses the

generation of XML data warehouses based on frequent query patterns of users’ queries.

In Sections 3, query processing using the data warehouse is discussed. Experimental

results are repoeted in Section 4. The final section conclude this paper.

2. Building a XML DWH Based on Frequent Query Patterns

2.1. Transforming Users’ Queries into Query Path Transactions

XQuery is a flexible language commonly used to query a broad spectrum of XML

information sources, including both databases and documents [7]. The following

XQuery-formatted query aims to extract the ISBN, Title, Author and Price of books

with a price over 20 dollars from a set of XML documents about book-related

information. The global DTD tree is shown in Figure 1.

FOR $a IN DOCUMENT (book XML documents)/book

SATIFIES $a/Price/data()>20

RETURN <QueryResult> <book>{$a/ISBN, $a/Title, $a/Author, $a/Price}</book>

</QueryResult >

100 Ji Zhang et al.

Book

Author+TitleISBN Price

Para*AffiliationName Title

Publisher

Figure*

Section+ Year

Title Image

QP1: Book/ISBN

QP2: Book/Title

QP3: Book/Author/Name

QP4:Book/Author/Affiliation

QP5: Book/Price

Fig. 1. Global DTD Tree of multiple XML documents. Fig. 2. QPs of the XQuery sample.

A Query Path is a path expression of a DTD tree that starts at the root of tree. QPs

can be obtained from the query script expressed using XQuery Statements. The sample

query above can be decomposed into five QPs, as shown in Figure 2. The root of a QP

is denoted as Root(QP) and all QPs in a query have the same root.

Please note that two QPs with different roots are regarded as different QPs,

although these two paths may have some common nodes. This is because different

roots of paths often indicate dissimilar contexts of the queries. For example, two

queries Author/Name and Book/Author/Name are different because

Root(Author/Name)=Author Root(Book/Author/Name)=Book.

A query can be expressed using a set of QPs which includes all the QPs that this

query consists. For example, the above sample query, denoted as Q, can be expressed

using a QP set such as Q={QP1, QP2, QP3, QP4, QP5}. By transforming all the queries into QP sets, we now obtain a database containing

all these QP sets of queries, denoted as DQPS. We will then apply a rule-mining

techniques to discover significant rules among the users’ query patterns.

2.2. Discovering Frequent Query Path Sets in DQPS

The aim of applying a rule mining technique in DQPS is to discover FrequentQuery Path Sets (FreqQPSs) in DQPS. A FreqQPS contains frequent QPs that jointly

occur in DQPS. Frequent Query Pattern Trees (FreqQPTs) are built from these

FreqQPSs and serve as building blocks of schemas of the integrated XML documents

in the data warehouse. Formal definition of FreqQPTs is given as follows.

Definition 1. Frequent Query Path Set (FreqQPS): From all the occurring QPs in

DQPS transformed from user’s queries, a Frequent Query Path Set (FreqQPS) is a set of

QPs: {QP1, QP2,…,QPn} that satisfies the following two requirements:

(1) Support requirement: Support ({QP1, QP2,…,QPn}) minsup;(2) Confidence requirement: For each QPi,

Freq({QP1, QP2,…,QPn}) / Freq(QPi) minconf.where Freq(s) counts the occurrence of set s in DQPS. In (1), Support({QP1,

QP2,…,QPn}) = freq({QP1, QP2,…,QPn}) / N(DQPS), where N(DQPS) is the total number

of QPs in DQPS. The constants minsup and minconf are the minimum support and

confidence thresholds, specified by the user. A FreqQPS that consists of n QPs is

termed as an n-itemset FreqQPS.

The definition of a FreqQPS is similar to that of association rules. The support

requirement is identical to the traditional definition of large association rules. The

confidence requirement is, however, more rigid than the traditional definition. Setting a

more rigid confidence requirement is to ensure the joint occurrence of QPs in a

FreqQPS should be significant enough with respect to an individual occurrence of any

101Building XML Data Warehouse Based on Frequent Patterns in User Queries

QP. Since the number of QPs in the FreqQPS is unknown in advance, we will mine all

FreqQPSs containing various numbers of itemsets. The FreqQPS mining algorithm is

presented in Figure 3.

The n-itemset QPS candidates are generated by joining (n-1)-itemset FreqQPSs. A

pruning mechanism is devised to delete those candidates of the n-itemset QPSs that do

not have n (n-1)-itemset subsets in the (n-1)-itemset FreqQPS list. The reason is that if

one or more (n-1)-subsets of a n-itemset QPS candidate are missing in the (n-1)-itemset

FreqQPS list, this n-itemset QPS cannot become a FreqQPS. This is obviously morerigid than pruning mechanism used in conventional association rule mining.

For example, if one or more of the 2-itemset QPSs {QP1, QP2}, {QP1, QP3} and

{QP2, QP3} are not frequent, then the 3-itemset QPS {QP1, QP2, QP3} cannot become a

frequent QPS. The proof of this pruning mechanism is given below. The pruning the n-

itemset QPS candidates are evaluated in terms of the support and confidence

requirements to decide whether or not they are a FreqQPS. The (n-1)-itemset FreqQPSs

are finally deleted if they are subsets of some n–itemset FreqQPSs. For example, the 2-

itemset FreqQPT {QP1, QP2} will be deleted from 2-itemset FreqQPT list if the 3-

itemset {QP1, QP2, QP3} exists in the 3-itemset FreqQPT list.

Algorithm MineFreqQPSInput: DQPS, minsup, minconf.Output: FreqQPS of varied number of itemsets.

FreqQPS1={QP in DQPS| SatisfySup(QP)=true};

i=2;

WHILE (CanFreqQPSi-1 is not empty) {

CanQPSi=CanQPSGen(FreqQPSi-1);

CanQPSi= CanQPSi {QPSi| NoSubSet(QPSi, FreqQPS

i-1)<i};

FreqQPSi={QPSi in CanQPSi | sfyS (QPSi)=t ue AN SatisfyCon PS

i)= rue};Sati up r D f(Q t

FreqQPSi-1= FreqQPSi-1 {QPSi-1| QPSi-1

QPSi, QPS

i-1 in FreqQPS

i-1, QPS

i in FreqQPS

i};i++; }

MaxItemset=i-2;

IF (MaxItemset 0) THEN

FOR (i=1;i MaxItemset; i++) Return (FreqQPSi);

Fig. 3. Algorithm for mining FreqQPSs.

Proof: Suppose a n-itemset QPS has only p (n-1)-itemset subsets QPSn-1i|1 i p,

meaning that there are (n-p) subsets of QPSn are missing in the (n-1)-itemset QPS list.

These missing (n-p) subsets of QPSn, denoted as QPSn-1i| (p+1) i n, are definitely not

FreqQPSs and they fail to satisfy the support or the confidence requirement or both.

Specifically,

(1) If QPSn-1i|(p+1) i n does not satisfy support requirement, then Support(QP1,

QP2,…,QPn-1) < minsup. Because Support(QP1, QP2,…,QPn) Support(QP1,QP2 ,…,QPn-1), so Support(QP1, QP2,…,QPn) < minsup, meaning QPSn cannot

become a n-itemset FreqQPS;

(2) If QPSn-1i|(p+1) i n does not satisfy confidence requirement, then for a certain QPi,

Freq(QP1, QP2,…,QPn-1) / Freq(QPi) < minconf. Because Support(QP1,QP2, …,QPn) Support(QP1, QP2,…,QPn-1), so for QPi, Freq(QP1, QP2,…,QPn) /

Freq(QPi) < minconf, meaning that QPSn cannot become a n-itemset FreqQPS.

102 Ji Zhang et al.

After we have obtained a number of FreqQPSs, their corresponding Frequent

Query Pattern Trees (FreqQPTs) will be built.

Definition 2. Frequent Query Pattern Tree (FreqQPT): Given a FreqQPS, its

corresponding Frequent Query Pattern Tree (FreqQPT) is a rooted tree FreqQPT=<V,

E>, where V and E denote its vertex and edge sets, which are the union of the vertices

and edges of QPs in this FreqQPS, respectively. The root of a FreqQPT, denoted as

Root(FreqQPT), is the root of its constituting QPs.

For example, suppose a FreqQPS has two QPs: Book/Title and

Book/Author/Name. The resulting FreqQPT is shown in the Figure 4. Book

Title Author

Name

Fig. 4. Build a FreqQPT for a FreqQPS.

{Book/ Title,

Book/ Author/ Name}

2.3. Generating Schemas of Integrated XML Documents

When all FreqQPTs have been mined, the schema of the integrated XML

document will be built. We have noticed that a larger integrated XML document

usually requires larger space when it is loaded into main memory. In order to solve this

problem, we alternatively choose to build a few, rather than only one, integrated XML

documents from the FreqQPTs mined, making the integration more flexible. The exact

number of integrated XML documents to be obtained is user- specified. The basic idea

is to use a clustering technique to find a pre-specified number of clusters of FreqQPTs.

The integration of the FreqQPTs is performed within each of the clusters.

Similarity measurement of FreqQPTs

We need to measure the similarity between two FreqQPTs in order to find the

closest pair in each step of the clustering process. It is noticed that the complexity of

merging two FreqQPTs is dependant on the distance of the roots of the FreqQPTs

involved, rather than on the other nodes in the FreqQPTs. Intuitively, the closer the two

roots are to each other, the easier the merging can be done and vice versa. To measure

the similarity between the roots of two FreqQPTs, we have to first discuss the

similarity between two nodes in the hierarchy of a global schema.

In our work, the similarity computation between two nodes in the hierarchy is

based on the edge counting method. We measure the similarity of nodes by first

computing the distance between two nodes, since the distance can be easily obtained by

edge counting. Naturally, the larger the number of edges between two nodes, the

further apart the two nodes are. The distance between two nodes n1 and n2, denoted as

NodeDist(n1, n2), is computed as NodeDist(n1, n2)= Nedge(n1, n2), where Nedge() returns

the number of edges between n1 and n2. This distance can be normalized by dividing

the maximum possible distance between two nodes in the hierarchy, denoted by

LongestDist. The normalized distance between n1 and n2, denoted as NodeDistN(n1, n2),is computed as follows:

NodeDistN(n1, n2)= Nedge(n1, n2)/LongestDistThus the similarity between n1 and n2 is computed as:

NodeSimN(n1, n2)=1- NodeDistN(n1, n2)We now give an example to show how the similarity between two roots of

FreqQPTs is computed. Suppose there are two QPs, QP1: Book/ Price and QP2:


Section/ Figure/ Image as shown in Figure 5. What we should do is to compute the

similarity between the roots of these two QPs, namely Book and Section. The

maximum length between two nodes in the hierarchy as shown in Figure 1 is 5 (from

Name or Affiliation to Title or Image). Thus NodeSimN(Book, Section) = 1 –

NodeDistN(Book, Section) = 1–1/5 = 4/5 = 0.8.

Book

Section

Image

FigurePrice

Fig. 5. Similarity between two QPs.

Merging of FreqQPTs

When a nearest pair of FreqQPTs is found in each step of the clustering, merging

of these two FreqQPTs is performed. Let FreqQPT1=<V1, E1>, FreqQPT2=<V2, E2>,

Root(FreqQPT1)=root1, Root(FreqQPT2)=root2, and FreQPTM be the new FreqQPT

merged from FreqQPT1 and FreqQPT2. We will now present the definition of Nearest

Common Ancestor Node (NCAN) of two nodes in the DTD tree before we give details

of FreqQPT merging.

Definition 3. Nearest Common Ancestor Node (NCAN): The NCAN of root nodes of

two FreqQPTs root1 and root2 in the hierarchical structure of a global DTD tree H,

denoted as NCANH(root1, root2), is the common ancestor node in H that are closest to

both root1 and root2.

To merge two closest FreqQPTs, the Nearest Common Ancestor Node (NCAN) of

root1 and root2 has to be found, thereby these two FreqQPTs can be connected.

We denote the vertex and edge set of the paths between NCANH(root1, root2) and

root1 as VNCAN root1 and ENCAN root1, and those between NCANH(root1, root2) and root2

as VNCAN root2 and ENCAN root2. The FreQPTM in this case can be expressed as

FreQPTM={Union(V1, V2, VNCAN root1,VNCAN root2), Union (E1, E2, ENCAN root1,

ENCAN root2)} and Root(FreQPTM)= NCANH(root1, root2).

Specifically, there are three scenarios in merging two FreqQPTs, namely, (1) the

two FreqQPTs have the same root; (2) The root of one FreqQPT is an ancestor node of

another FreqQPT’s root; (3) case other than (a) and (b). Figure 6 (a)-(c) give examples

for each of the cases of FreqQPT merging discussed above. The dot-lined edges in the

integrated schema, if any, are the extra edges that have to be included into the

integrated schema in merging the two separate FreqQPTs.

Book

Author

Name

Title Author

Name

Book

Title PricePrice

Book

(a) Example for Case 1.

Book Section

Image

Figure TitleTitle

Title

Book

Section

Image

Figure

Title

104 Ji Zhang et al.

(b) Example for Case 2.

ImageTitle

FigureAuthor

AffiliationName

Image

Book

Section

Figure

Title

Author

AffiliationName

(c) Example for Case 3.

Fig. 6 (a) – (c). Examples of FreqQPT merging.

Clustering of FreqQPTs

The aim of clustering FreqQPTs is to group similar FreqQPTs together for further

integration. Merging two closer FreqQPTs is cheaper and requires less re-structuring

operations compared to merging two FreqQPTs far apart from each other. In our work,

we utilize the agglomerative hierarchical clustering paradigm. The basic idea of

agglomerative hierarchical clustering is to begin with each FreqQPT as a distinct

cluster and merge the two closest clusters in each subsequent step until a stopping

criterion is met. The stop criterion of the clustering is typically either the similarity

threshold or the number of clusters to be obtained. We choose to specify the number of

clusters since it is more intuitive and easy to specify, compared to the similarity

threshold that is typically not known before the clustering process.

Please note that k , the specified number of clusters to be obtained, should not be

larger than the number of FreqQPTs, otherwise the error message will be returned. This

is because the QPs in the same FreqQPTs are not allowed to be further split. In each

step, the two closest FreqQPT pair will be found and merged into one FreqQPT and the

number of current clusters will be decreased by 1 accordingly. This clustering process

is terminated when k clusters are obtained.

2.4. Acquire Data to Feed the Warehouse

The last step of building the XML data warehouse is to read data from XML data

sources when the schemas of the integrated XML documents are ready. Coming from

different data source across the Internet, these data may be incomplete, noisy,

inconsistent, and duplicate. Processing efforts such as standardization, data cleaning

and conflict solving need to be performed to make the data in the warehouse more

consistent, clean, and concrete.

3 Processing of Queries Using the Date Warehouse

One of the main purposes of building data warehouse is to facilitate the query

processing. When there is no data warehouse, processing of queries use the single

mediator architecture (shown in Figure 7), in which all the queries will be processed in

this mediator and directed to the multiple XML data sources. When the data warehouse

has been built, a dual-mediator architecture is adopted (shown in Figure 8). Mediator 1

processes all the incoming queries from users, and each query will be directed to either

the data warehouse or mediator 2 which is responsible for further directing the queries

to the XML data sources or both.


Specifically, let QPSdwh be QP set of the integrated XML documents in the data

warehouse. QPTra(q) be the QP transaction of the query q.(i) If QPTra(q) QPSdwh, meaning that all the QPs of q can be found in the schemas

of integrated XML documents in the data warehouse, and this query can be

answered by using the data warehouse alone, then q will only be directed by

mediator 1 to the XML data warehouse;

(ii) if QPTra(q) QPSdwh and QPTra(q) QPSdwh is not empty, meaning that not all

QPs of q can be found in the schemas of integrated XML documents in the data

warehouse, and the data warehouse does not contain enough information to

answer q, then q will be directed by mediator 1 to both the data warehouse and

mediator 2;

(iii) if QPTra(q) QPSdwh is empty, indicating that the information needed to answer

q is not contained in the warehouse, thus q will only directed by mediator 1 to

mediator 2.

…

..

.Users

…

..

.

XML data sources

Mediator

X M L data

w arehouse

…

...

X M L data sources

M ediator 2

…

..

.U sers

M ediator 1

Fig. 7. Query processing without data

warehouse

Fig. 8. Query processing with data

warehouse

4. Experimental Results

In this section, we will conduct experiments to evaluate the efficiency of the

constructing schema of XML data integration and the speedup of query processing by

means of the data warehouse we have built. We use a set of 50 XML documents about

book information and generate their global DTD tree. Zipfian distribution is employed

to produce transaction file of queries, because web queries and surfing patterns

typically conform to the Zipf’s law [9]. In our work, the query transaction file contains

500 such synthetic queries based on which the data warehouse is built. All these

experiments are carried out on the PC of 900 MHz PC with 256 megabytes of main

memory running on Windows 2000.

4.1 Construction of the Data Warehouse Schema under Varying Number of

Queries

106 Ji Zhang et al.

Fig. 9. Efficiency of constructing the data warehouse schema under varying number of queries

Fig. 10. Comparative study on query answering time

First, we will evaluate the time spent in constructing the schema of XML data

integration of the data warehouse under varying number of queries from which

frequent query patterns are extracted. The number of the queries used is varied from

100 to 1,000. As shown in Figure 9, the time increases approximately in an exponential

rate since the number of candidates of FreqQPSs generated increases exponentially as

the number of queries goes up.

4.2 Speedup of Query Processing Using Data Warehouse

The major benefits of building data warehouse system based on frequent query

patterns are to not only obtain a smaller but more concrete and clean subset of original

XML data sources but also helps speedup the query processing. In this experiment, we

measure the response time for answering queries with and without the aid of the data

warehouse, respectively. The number of queries to be answered ranges from 100 to

1,000. The results shown in Figure 10 justifies that, by using data warehouse we have

built, the query answering is faster than that the case when there is no such a data

warehouse. This is because that the potion of information contained in the data


warehouse is smaller in size than that stored in the original data sources, reducing the

volume of data needed to scanned in the query answering. In addition, the data has

been undergone the processing such as standardization, data cleaning and conflict

solving, thus the duplication of data is lower. The smaller size and lower duplication of

the data in the warehouse contribute to the higher efficiency in query answering.

5 Conclusions

In this paper, we propose a novel approach to perform XML data warehousing

based on the frequent query patterns discovered from historical user’s queries. A

specific rule mining technique is employed to discover these frequent query patterns,

based on which the schemas of integrated XML documents are built. Frequent query

patterns are represented using Frequent Pattern Trees (FreqQPTs) that are clustered

using a hierarchical clustering technique according to the integration specification to

build the schemas of integrated XML documents. Experimental results show that query

answering time is reduced when compared to the case when there is no such a data

warehouse.

References[1] H. Garcia-Molina, W. Labio, J. L.Wiener, and Y. Zhuge: Distributed and Parallel

Computing Issues in Data Warehousing. In Proc. of ACM Principles of Distributed

Computing Conference (PODS), Puerto Vallarta, Mexico 1998.

[2] M. Golfarelli, S. Rizzi, and B. Vrdoljak: Data Warehouse Design from XML Sources. In

Proc. of ACM DOLAP’01, Atlanta, Georgia, USA, Nov. 2001.

[3] S. M. Huang and C.H. Su: The Development of an XML-based Data Warehouse System. In

Proc. of 3rd Intl. Conf. of Intelligent Data Engineering and Automated Learning

(IDEAL’02), Springer LNCS 2412, pp. 206-212, Manchester, UK, Aug. 2002.

[4] O. Mangisengi, J. Huber, C. Hawel and W. Essmayr: A Framework for Supporting

Interoperability of Data Warehouse Islands using XML. In Proc. of 3rd Intl. Conf.

DaWaK’01, Springer LNCS 2114, pp. 328-338, Munich, Germany, Sept. 2001.

[5] A. Marian, S. Abiteboul, G. Cobena, and L. Mignet: Change-centric Management of

Versions in an XML Warehouse. In Proc. of Intl. Conf. on Very Large Data Bases

(VLDB’01), pp. 581-590, Roma, Italy, Sept. 2001.

[6] K. Passi, L. Lane, S. Madria, B.C. Sakamuri, M. Mohania and S. Bhowmick: A Model for

XML Schema Integration. In Proc. of 3rd Intl. Conf. EC-Web, Springer LNCS 2455, pp.

193-202, Aix-en-Provence, France, Sept. 2002.

[7] XQuery Language 1.0. http://www.w3.org/TR/xquery/.

[8] L. Xyleme. A Dynamic Warehouse for XML Data of the Web. IEEE Data Engineering

Bulletin, Vol. 24(2), pp. 40-47, 2001.

[9] L. H. Yang, M. L. Lee, W. Hsu, S. Acharya. Mining Frequent Query Patterns from XML

Queries. In Proc. of 8th Intl. Symp. on Database Systems for Advanced Applications

(DASFAA’03), Kyoto, Japan, March 2003.

[10] L.Garber. Michael StoneBraker on the Importance of Data Integration. IT Professional, Vol.

1, No.3, pp 80, 77-79, 1999.

108 Ji Zhang et al.

©

{martin,aabello}@lsi.upc.es

Automatic Detection of Structural Changes inData Warehouses

Johann Eder� � §

University of KlagenfurtDep. of Informatics-Systems

� {eder,koncilia}@isys.uni-klu.ac.at§ [email protected]

Abstract. Data Warehouses provide sophisticated tools for analyzingcomplex data online, in particular by aggregating data along dimensionsspanned by master data. Changes to these master data is a frequentthreat to the correctness of OLAP results, in particular for multi- perioddata analysis, trend calculations, etc. As dimension data might changein underlying data sources without notifying the data warehouse, we areexploring the application of data mining techniques for detecting suchchanges and contribute to avoiding incorrect results of OLAP queries.

1 Introduction and Motivation

A data warehouse is a collection of data stemming from different frequentlyheterogeneous data sources and is optimized for complex data analysis opera-tion rather than for transaction processing. The most popular architecture aremultidimensional data warehouses (data cubes) where facts (transaction data)are “indexed” by several orthogonal dimensions representing a hierarchical or-ganization of master data. OLAP (on-line analytical processing) tools allow theanalysis of this data, in particular by aggregating data along the dimensionswith different consolidation functions.

Although data warehouses are typically deployed to analyse data from alonger time period than transactional databases, they are not well preparedfor changes in the structure of the dimension data. This surprising observationoriginates in the (implicit) assumption that the dimensions of data warehousesought to be orthogonal, which, in the case of the dimension time means that allother dimensions ought to be time-invariant.

In this paper we address another important issue: how can such structuralchanges be recognized, even if the sources do not notify the data warehouseabout the changes.

Such “hidden” changes are a problem, because (usually) such changes arenot modifications of the schema of the data source. E.g., inserting the data of anew product or a new employee is a modification on the instance level. However,in the data warehouse such changes result in a modification of its structure.

Of course, this defensive strategy of recognizing structural changes can onlybe an aid to avoid some problems, it is not a replacement for adequate means for

, Christian Koncilia , and Dieter Mitsche


managing knowledge about changes. Nevertheless, in several practical situationwe able trace erroneous results of OLAP queries back to structural changes notknown by the analysts and the data warehouse operators, erroneous in the sensethat the the resulting data din not correctly represent the state of affairs in thereal world.

As means for detecting such changes we propose the use of data miningtechnique. In a nutshell, the problem can be described as a multidimensionaloutlier detection problem.

In this paper we report on some experiments we were conducting to analyse

– which data mining techniques might be applied for the detection of structuralchanges

– how these techniques are best applied– whether these techniques effectively detect structural changes– whether these techniques scale up to large data warehouses.

To the best of our knowledge it is the first time that the problem of changesin dimensions of data warehouses is addressed with data mining techniques.The problems related to the effects of structural changes in data warehouses andapproaches to overcome the problems they cause were subject of several projects[Yan01, BSH99, Vai01, CS99] including our own efforts [EK01, EKM02] to builda temporal data warehouse structure with means to transform data betweenstructural versions such that OLAP tools work on data cleaned of the effects ofstructural changes.

The remainder of this paper is organized as follows: in section 2 we bring basicdefinitions and discuss the notion of structural changes in data warehouses. Insection 3 we briefly introduce the data mining techniques we analyzed for thedetection of structural changes and we introduce a procedure for applying thesetechniques. In section 4 we briefly discuss the experiments we conducted as proofof concept. Finally in section 5 we draw some conclusions.

2 Structural Changes

We will now briefly discuss different types of structural changes. Furthermore,we will argue why some of these structural changes do not need to be detectedautomatically.

In [EK01] we showed how the basic operations INSERT, UPDATE and DELETEhave to be adopted for a temporal data warehouse. With respect to dimensionmembers, i.e., the instances of dimensions, these basic operations may be com-bined to represent the following complex operations (where Q is the chronon,i.e. it is a “a non-decomposable time interval of some fixed, minimal duration”[JD98]):

i.) SPLIT: One dimension member M splits into n dimension members, M1, ...,Mn. This operation translates into a DELETE(M, Ts−Q) and a set of insertoperations INSERT(Mi, Ts).

120 Johann Eder et al.

mapping function for Facts F1, ..., Fn

Legend

Div. E

SV1

Subdiv. C

Subdiv. D

Div. E

SV2

Subdiv. X

Subdiv. D

Div. A

Div. E

SV3

Subdiv. X

Subdiv. D

Div. A1

Div. A2

Div. E

SV4

Subdiv. D

Div. F

Insert new Div. A andChange Name of SubDiv. C

Split Div. A intoA1 and A2

Initial OutlineMerge Div. A1 and A2Delete SubDiv. X

Fig. 1. An example of structural changes

For instance, Figure 1 shows a split operation between the structure versionsSV2 and SV3 where a division “Div.A” splits up into two divisions “Div.A1”and “Div.A2”. We would need one delete operation (“Div.A”) and two inserts(“Div.A1” and “Div.A2”) to cope with this.

ii.) MERGE: n dimension members M1, ..., Mn are merged together into one di-mension member M . This operation translates into a set of delete operationsDELETE(Mi, Ts − Q) and an insert operationINSERT(M, Ts).A merge is the opposite to a split, i. e. a split in one direction of time isalways a merge in the opposite direction of time. Consider, for the examplegiven, that these modifications occur at the timepoint T . For each analysisthat requires measures from a timepoint before T for the structure versionwhich is valid at timepoint T we would call these modifications “a split”. Foreach analysis that requires measures from timepoint T for a structure versionvalid before timepoint T these modifications would be called “a merge”.

iii.) CHANGE: An attribute of a dimension member changes, for example, ifthe product number (Key) or the name of a department (user defined at-tribute) changes. Such a modification can be carried out by using the updateoperation defined above.With respect to dimension members representing measures, CHANGE couldmean that the way how to compute the measure changes (for example, theway how to compute the unemployment rate changed in Austria becausethey joined the European Union in 1995) or that the unit of facts changes(for instance, from Austrian Schillings to EURO).

iv.) MOVE: A dimension member moves from one parent to another, i.e., wemodify the hierarchical position of a dimension member. For instance, if aproduct P no longer belongs to product group GA but to product group GB .This can be done by changing the DMP

id (parent ID) of the correspondingdimension member with an update operation.

v.) NEW-MEMBER: A new dimension member is inserted. For example, if anew product becomes part of the product spectrum. This modification canbe done by using an insert operation.

121Automatic Detection of Structural Changes in Data Warehouses

vi.) DELETE-MEMBER: A dimension member is deleted. For instance, if abranch disbands. Just as a merge and a split are related depending on thedirection of time, this is also applicable for the operations NEW-MEMBERand DELETE-MEMBER. In opposite to a NEW-MEMBER operation thereis a relation between the deleted dimension member and the following struc-ture version. Consider, for example, that we delete a dimension member“Subdivision B” at timepoint T . If for the structure version valid at time-point T we would request measures from a timepoint before T , we still couldget valid results by simply subtracting the measures for “Subdivision B”from its parent.

For two of these operations, namely NEW-MEMBER and DELETE-MEMBER, there is no need to use data mining techniques to automaticallydetect these modifications.

When loading data from data sources for a dimension member which is newin the data source but does not exist in the warehouse yet, the NEW-MEMBERoperation is detected automatically by the ETL-Tool (extraction, transformationand loading tool). On the other hand, the ETL-Tool automatically detects whenno fact values are available in the data source for deleted dimension members.

3 Data Mining Techniques

In this section a selection of different data mining techniques for automatic de-tection of structural changes is issued. Whereas the first section gives an overviewover some possible data mining techniques, the second section focusses on mul-tidimensional extensions of the methods. Finally, a stepwise approach to detectstructural changes at different layers is proposed.

3.1 Possible data mining methods

The simplest method for detecting structural changes is the calculation of de-viation matrices. Absolute and relative differences between consecutive values,and differences in the shares of each dimension member between two chrononscan be easily computed - the runtime of this approach is clearly linear in thenumber of analyzed values. Since this method is very fast, it should be used asa first sieve.

A second approach whose runtime complexity is in the same order as thecalculation of traditional deviation matrices is the attempt to model a givendata set with a stepwise constant differential equation (perhaps with a simplefunctional equation). This model, however, only makes sense if there exists somerudimentary, basic knowledge about factors that could have caused the develop-ment of certain members (but not exact knowledge, since in this case no datamining would have to be done anymore). After having solved the equation (forsolution techniques of stepwise differential equations refer to [Dia00]), the rela-tive and absolute differences between the predicted value and the actual valuecan be considered to detect structural changes.


Other techniques that can be used for detecting structural changes are mostlytechniques that are also used for time-series analysis:

– autoregression - a significantly high absolute and relative difference betweena dimension member’s actual value and its value predicted via a simpleARMA (AutoRegression Moving Average)(p,q)-model (or, if necessary, anARIMA (AutoRegression Integrated Moving Average)(p,d,q)-model, per-haps even with extensions for seasonal periods) is an indicator for a structuralchange of that dimension member.

– autocorrelation - the usage of this method is similar to the method of au-toregression. The results of this method, however, can be easily visualizedwith the help of correlograms.

– crosscorrelation and regression - these methods can be used to detect sig-nificant dependencies between two different members. Especially a very lowcorrelation coefficient (a very inaccurate prediction with a simple regressionmodel, respectively) could lead to the roots of a structural change.

– discrete fourier transform (DFT), discrete cosine transform (DCT), differenttypes of discrete wavelet transforms - the maximum difference (scaled bymean of the vector) as well as the overall difference (scaled by mean andlength of the vector) of the coefficients of the transforms of two dimensionmembers can be used to detect structural changes.

– singular value decomposition (SVD) - unusually high differences in singularvalues can be used for detecting changes in the measure dimension whenanalyzing the whole data matrix. If single dimension members are compared,the differences of the eigenvalues of the covariance matrices of the dimensionmembers (= principal component analysis) can be used in the same way.

In this paper, due to lack of space no detailed explanation of these methodsis given, for details refer to [Atk89] (fourier transform), [Vid99] (wavelet trans-forms), [BD02] (autoregression and -correlation), [Hol02] (SVD and principalcomponent analysis), [Wei85] (linear regression and crosscorrelation).

3.2 Multidimensional Structures

Since in data warehouses there is usually a multidimensional view on the data,the techniques shown in the previous section have to be applied carefully. Ifall structure dimensions are considered simultaneously and a structural changeoccurred in one structure dimension, it is impossible to detect the dimensionthat was responsible for this change. Therefore the values of the data warehousehave to be grouped along one structure dimension (we considered only fully-additive measures that can be summed along each structure dimension. It is oneaspect of further work to check the approach for semi-additive and non-additivemeasures). On this two-dimensional view the methods of the previous section canthen be applied to detect the changed structure dimension. If, however, a changehappens in two or more structure dimensions at the same time, the analysis ofthe values grouped along one structure dimension will not be successful - either


a few (or even all) structure dimensions will show big volatilities in their data,or not a single structure dimension will show significant changes. Hence, if achange in the structure dimensions is assumed, and none can be detected withthe methods described above, the values have to be grouped along two structuredimensions. The methods can be applied on this view, if changes still cannotbe detected, the values are grouped by three structure dimensions, a.s.o. Theapproach to analyze structural changes in data warehouses by grouping valuesby just one structure dimension in the initial step was chosen for two reasons:

1.) In the vast majority of all structure changes in structure dimensions onlyone structure dimension will be affected.

2.) The runtime and memory complexity of the analysis is much smaller whenvalues are grouped by just one structure dimension: let D1, D2, . . . , Dn

denote the number of elements in the i-th structure dimension (i = 1 . . . n;D1 ≥ D2 ≥ . . .≥ Dn); then in the first step only D1 + D2 + . . . + Dn =O(D1) different values have to be analyzed, in the second step already D1D2

+ . . . + D1Dn + . . . + Dn−1Dn = O(D21), in the i-th step therefore O(Di

1),i = 1 . . . n.

3.3 Stepwise Approach

As a conclusion of the previous sections we propose a stepwise approach to detectdifferent types of structure changes at different layers:

1.) In the first step the whole data matrix of each measure in the data ware-house is analyzed to detect changes in the measure dimension (change of thecalculation formula of a measure, change of the metric unit of a measure).Primarily a simple deviation matrix that calculates the differences of thesums of all absolute values between two consecutive chronons can be appliedhere. If these differences between two chronons are substantially bigger thanthose between other chronons, then this is an indicator for a change in themeasure dimension. If the runtime performance is not too critical, SVD andDCT can also be carried out to detect changes. Changes at this level thatare detected must be either corrected or eliminated - otherwise the resultsin the following steps will be biased by these errors.

2.) In the next step the data are grouped by one structure dimension. Thedeviation matrices that were described above can be applied here to detectdimension members that were affected by structural changes.

3.) If the data grouped by one structure dimension can be adequately mod-elled with a stepwise constant differential equation (or a simple functionalequation) then also the deviation matrices that calculate the absolute andrelative difference between the model-estimated value and the actual valueshould be used.

4.) In each structure dimension where one dimension member is known thatdefinitely remained unchanged throughout all chronons (fairly stable so thatit can be considered as a dimension member with an average development,


mostly a dimension member with rather big absolute values), other datamining techniques such as autocorrelation, autoregression, discrete fouriertransform, discrete wavelet transform, principal component analysis, cross-correlation and linear regression can be used to compare this ’average’ di-mension member with any other dimension member detected in steps 2 and3. If one of the methods shows big differences between the average dimen-sion member and the previously detected dimension member, then this is anindicator for a structural change of the latter one. Hence, these methods onthe one hand are used to make the selection of detected dimension memberssmaller, on the other hand they are also used to ’prove’ the results of the pre-vious steps. However, all these methods should not be applied to a dimensionmember that is lacking values, whose data are too volatile or whose valuesare often zero. If no ’average’ dimension member is known, the dimensionmembers that were detected in previous steps can also be compared withthe sum of the absolute values of all dimension members. In any case, it isfor performance reasons recommended to use the method of autocorrelationat first; among all wavelet transforms the Haar method is the fastest.

5.) If in steps 2, 3 and 4 no (or not all) structural changes are detected and onestill assumes structural changes, then the values are grouped by i+1 structuredimensions, where i (i = 1 . . .n − 1, n = number of structure dimensions)is the number of structure dimensions that were used for grouping values inthe current step. Again, steps 2, 3 and 4 can be applied.

4 Experiments

The stepwise approach proposed in the previous section was tested on manysmall data sets and one larger sample data set. Here one small example is givento show the usefulness of the stepwise approach.

Consider the data warehouse with one measure dimension and four structuredimensions given in table 1, where three structural changes and one measurechange are hidden: between year1 and year2 dimension member SD21 is re-duced to 20% of its original value (UPDATE, MOVE or SPLIT), between year2

and year3 dimension members SD31 and SD32 swap their structure (UPDATEor MOVE), between year3 and year4 dimension member SD41 loses 70% of itsvalue to SD42 (MOVE); between year3 and year4 the currency of the mea-sure changes (from EURO to ATS, values multiplied by 14) (in this case, onemight probably detect the measure change in the last chronon simply by visualinspection of all values, without calculating deviation matrices. In large datawarehouses, however, this may become infeasible.). In the first step the datamatrix is checked for changes in the measure dimension: differences of the sumsof all absolute values of two consecutive chronons are calculated. As can be seenfrom iteration one of table 2, the difference between year3 and year4 (122,720)is by far the biggest - a very strong indicator for a measure change between theseyears. When asking for possible explanations, domain experts should recognizethe fact of a new currency. To be able to continue the analysis, all values in the


SD1 SD2 SD3 SD4 year1 year2 year3 year4

SD11 SD21 SD31 SD41 100 20 60 252

SD11 SD21 SD31 SD42 200 40 80 1,708

SD11 SD21 SD32 SD41 300 60 20 84

SD11 SD21 SD32 SD42 400 80 40 756

SD11 SD22 SD31 SD41 500 500 700 2,940

SD11 SD22 SD31 SD42 600 600 800 18,060

SD11 SD22 SD32 SD41 700 700 500 2,100

SD11 SD22 SD32 SD42 800 800 600 13,300

SD12 SD21 SD31 SD41 900 180 220 924

SD12 SD21 SD31 SD42 1,000 200 240 5,516

SD12 SD21 SD32 SD41 1,100 220 180 756

SD12 SD21 SD32 SD42 1,200 240 200 4,564

SD12 SD22 SD31 SD41 1,300 1,300 1,500 6,300

SD12 SD22 SD31 SD42 1,400 1,400 1,600 37,100

SD12 SD22 SD32 SD41 1,500 1,500 1,300 5,460

SD12 SD22 SD32 SD42 1,600 1,600 1,400 32,340

SD=structure dimension, SDij=j-th dimension member in structure dimension i

Table 1. Structural changes in a data warehouse with four structure dimensions

Diff year12 year23 year34

iteration 1 -4,160 0 122,720

iteration 2 -4,160 0 0

Diff=absolute difference, yearmn=comparison of year m with year n

Table 2. Detection of changes in the measure dimension

data warehouse have to be noted in the same currency. Therefore, all values in thelast column (year4) are divided by 14. Having corrected the problem of differentcurrencies, the biggest remaining difference is -4,160 between year1 and year2

(see line ’it 2’ in table 2). According to domain experts, this difference cannotbe explained with changes in the measure dimension, and hence, the approachcan be continued with the analysis of changes in the structure dimensions.

In the next step the values in the data warehouse are grouped by one structuredimension - on the resulting view the differences of shares of dimension membersare calculated (this deviation matrix was chosen because it shows the outliersmost clearly in this case). As can be seen from table 3, the former ’hidden’structural changes become obvious: all three structural changes are detected(in this example with just two dimension members per structure dimension thechanges in the one member have to be counted up in the other - it is thereforenot known whether between year1 and year2 dimension member SD21 or SD22

changed. In real-world data warehouses with many more dimension members,however, it usually is clear which dimension member changed). Here, due to lackof space steps 3 and 4 are omitted; if one assumes further structural changes,the detected structural changes have to be corrected, and the above deviation


Δ(%) year12 year23 year34

SD11 3.19% 0% 0%

SD12 -3.19% 0% 0%

SD21 -27.22% 0% 0%

SD22 27.22% 0% 0%

SD31 0.8% 10.17% 0%

SD32 -0.8% -10.17% 0%

SD41 0.4% 0% -33.22%

SD42 -0.4% 0% 33.22%

SD=structure dimension, SDij=j-th dimension member in structure dimension i,Δ(%)=change in share of a dimension member between two consecutive years,

yearmn=comparison of shares of different dimension members between year m and year n

Table 3. Detection of changes in the structure dimension

matrix can be calculated once again. In this case, however, all differences of alldimension members between all years are zero - all dimension members stayunchanged throughout the four years. Hence, a further analysis of combinedstructural changes is useless.

On a large sample data set (40 GB) we tested the performance of our pro-posed approach: good scalability of the methods was shown - all methods (ex-cluding SVD and DCT on the whole data matrix, they took six minutes) tookless than six seconds (Pentium III, 866 MHz, 128 MB SDRAM). This example,however, also showed that the quality of the results of the different methods verymuch depends on the quality and the volatility of the original data.

5 Conclusions

Unknown structural changes lead to incorrect results of OLAP queries, to analy-sis reports with wrong data and probably to suboptimal decisions based on thesedata. Since analysts need not see such changes in the data, the changes mightbe hiden in the lower levels of the dimension hierarchies which are typically onlylooked at in drill down operations, but of course influence the data derived ionhigher levels. Pro-active steps are necessary to avoid that incorrect results dueto neglecting of structural changes.

Some changes might be detected when data is loaded into the database, orwhen change-logs of the sources are forwarded to the data warehouse. Neverthe-less, such changes stemming from different sources might appear unnoticed indata warehouses. Therefore, we are proposing to apply data mining techniquesfor detecting such changes, or more precisely detecting suspicious unexpecteddata characteristics which might originate from unknown structural changes. Itis clear that any such technique will phase the problem, that it might not be ableto detect all such changes, in particular, when the data looks perfectly feasible.On the other hand, the techniques might indicate a change due to character-


istics of data which is however correctly representing reality, i.e. no change indimension data took place.

We showed that several data mining techniques might be used for this anal-ysis. We propose a procedure which uses several techniques in turn and in ouropinion is a good combination of efficiency and effectiveness. We were able toshow that the techniques we propose actually detect structural changes in datawarehouses and that these techniques also scale up for large data warehouses.The application of the data mining techniques, however, requires good qualityof the data in the data warehouse, because otherwise errors of first and seconddegree rise. It is also necessary to fine-tune the parameters for the data miningtechniques, in particular to take the volatility of the data in the data warehouseinto account. Here, further research is expected lead to self adaptive methods.

References

[Atk89] K. E. Atkinson. An Introduction to Numerical Analysis. John Wiley, NewYork, USA, 1989.

[BD02] P. J. Brockwell and R. A. Davis. Introduction to Time Series Forecasting.Springer Verlag, New York, USA, 2002.

[BSH99] M. Blaschka, C. Sapia, and G. Hofling. On Schema Evolution in Multidi-mensional Databases. In Proceedings of the DaWak99 Conference, Florence,Italy, 1999.

[CS99] P. Chamoni and S. Stock. Temporal Structures in Data Warehousing. InProceedings of the 1st International Conference on Data Warehousing andKnowledge Discovery (DaWaK’99), pages 353–358, Florence, Italy, 1999.

[Dia00] F. Diacu. An Introduction to Differential Equations - Order and Chaos. W.H. Freeman, New York, USA, 2000.

[EK01] J. Eder and C. Koncilia. Changes of Dimension Data in Temporal DataWarehouses. In Proceedings of 3rd International Conference on Data Ware-housing and Knowledge Discovery (DaWaK’01), Munich, Germany, 2001.Springer Verlag (LNCS 2114).

[EKM02] J. Eder, C. Koncilia, and T. Morzy. The COMET Metamodel for TemporalData Warehouses. In Proceedings of the 14th International Conference onAdvanced Information Systems Engineering (CAISE’02), Toronto, Canada,2002. Springer Verlag (LNCS 2348).

[Hol02] J. Hollmen. Principal component analysis, 2002.URL: http://www.cis.hut.fi/ jhollmen/dippa/node30.html.

[JD98] C. S. Jensen and C. E. Dyreson, editors. A consensus Glossary of TemporalDatabase Concepts - Feb. 1998 Version, pages 367–405. Springer-Verlag,1998. In [EJS98].

[Vai01] A. Vaisman. Updates, View Maintenance and Time Management in Multi-dimensional Databases. Universidad de Buenos Aires, 2001. Ph.D. Thesis.

[Vid99] B. Vidakovic. Statistical Modeling by Wavelets. John Wiley, New York,USA, 1999.

[Wei85] S. Weisberg. Applied Linear Regression. John Wiley, New York, USA, 1985.[Yan01] J. Yang. Temporal Data Warehousing. Stanford University, June 2001.

Ph.D. Thesis.


©

{rosana,cardoso,jano}@cos.ufrj.br

SELECT count(C.id)FROM table_current as C, table_previous as PWHERE (C.id=P.id) AND ((C.field1<>P.field1) OR(C.field2<>P.field2) OR (C.fieldN<>P.fieldN));

SELECT count(C.id)FROM table_current as C, table_previous as PWHERE (C.id = P.id) AND (P.signature <>F_CalculateSignature(C.Field 1,..,C.Field N));

Recent Developments in

Web Usage Mining Research

Federico Michele Facca and Pier Luca Lanzi�

Artificial Intelligence and Robotics LaboratoryDipartimento di Elettronica e Informazione, Politecnico di Milano

Abstract. Web Usage Mining is that area of Web Mining which dealswith the extraction of interesting knowledge from logging informationproduced by web servers. In this paper, we present a survey of the recentdevelopments in this area that is receiving increasing attention from theData Mining community.

1 Introduction

Web Mining [29] is that area of Data Mining which deals with the extraction ofinteresting knowledge from the World Wide Web. More precisely [40], Web Con-tent Mining is that part of Web Mining which focuses on the raw informationavailable in web pages; source data mainly consist of textual data in web pages(e.g., words, but also tags); typical applications are content-based categorizationand content-based ranking of web pages. Web Structure Mining is that part ofWeb Mining which focuses on the structure of web sites; source data mainlyconsist of the structural information in web pages (e.g., links to other pages);typical applications are link-based categorization of web pages, ranking of webpages through a combination of content and structure (e.g. [20]), and reverseengineering of web site models. Web Usage Mining is that part of Web Miningwhich deals with the extraction of knowledge from server log files; source datamainly consist of the (textual) logs, that are collected when users access webservers and might be represented in standard formats; typical applications arethose based on user modeling techniques, such as web personalization, adaptiveweb sites, and user modeling. The recent years have seen the flourishing of re-search in the area of Web Mining and specifically of Web Usage Mining. Sincethe early papers published in the mid 1990s, more than 400 papers on Web Min-ing have been published; more or less than 150 papers, of the overall 400, havebeen before 2001; around the 50% of these papers regarded Web Usage Mining.The first workshop entirely on this topic, WebKDD, was held in 1999. Since2000, the published papers on Web Usage Mining are more than 150 showinga dramatic increase of interest for this area. This paper is a survey of the recentdevelopments in the area of Web Usage Mining. It is based on the more than 150papers published since 2000 on the topic of Web Usage Mining; see the on-linebibliography on the web site of the cInQ project [1].� Contact Author: Pier Luca Lanzi, [email protected].


Recent Developments in Web Usage Mining Research 141

2 Data Sources

Web Usage Mining applications are based on data collected from three mainsources [58]: (i) web servers, (ii) proxy servers, and (iii) web clients.

The Server Side. Web servers are surely the richest and the most commonsource of data. They can collect large amounts of information in their log filesand in the log files of the databases they use. These logs usually contain basicinformation e.g.: name and IP of the remote host, date and time of the request,the request line exactly as it came from the client, etc. This information isusually represented in standard format e.g.: Common Log Format [2], ExtendedLog Format [3], LogML [53]. When exploiting log information from web servers,the major issue is the identification of users’ sessions (see Section 3).

Apart from web logs, users’ behavior can also be tracked down on the serverside by means of TCP/IP packet sniffers. Even in this case the identificationof users’ sessions is still an issue, but the use of packet sniffers provides someadvantages [52]. In fact: (i) data are collected in real time; (ii) information comingfrom different web servers can be easily merged together into a unique log; (iii)the use of special buttons (e.g., the stop button) can be detected so to collectinformation usually unavailable in log files. Packet sniffers are rarely used inpractice because of rise scalability issue on web servers with high traffic [52], andthe impossibility to access encrypted packets like those used in secure commercialtransactions a quite severe limitation when applying web usage mining to e-businesses [13]. Probably, the best approach for tracking web usage consists ofdirectly accessing the server application layer, as proposed in [14]. Unfortunately,this is not always possible.

The Proxy Side. Many internet service providers (ISPs) give to their customerProxy Server services to improve navigation speed through caching. In manyrespects, collecting navigation data at the proxy level is basically the same ascollecting data at the server level. The main difference in this case is that proxyservers collect data of groups of users accessing huge groups of web servers.

The Client Side. Usage data can be tracked also on the client side by usingJavaScript, java applets [56], or even modified browsers [22]. These techniquesavoid the problems of users’ sessions identification and the problems caused bycaching (like the use of the back button). In addition, they provide detailedinformation about actual user behaviors [30]. However, these approaches relyheavily on the users’ cooperation and rise many issues concerning the privacylaws, which are quite strict.

142 Federico Michele Facca and Pier Luca Lanzi

3 Preprocessing

Data preprocessing has a fundamental role in Web Usage Mining applications.The preprocessing of web logs is usually complex and time demanding. It

comprises four different tasks: (i) the data cleaning, (ii) the identification andthe reconstruction of users’ sessions, (iii) the retrieving of information aboutpage content and structure, and (iv) the data formatting.

Data Cleaning. This step consists of removing all the data tracked in web logsthat are useless for mining purposes [27, 12] e.g.: requests for graphical pagecontent (e.g., jpg and gif images); requests for any other file which might beincluded into a web page; or even navigation sessions performed by robots andweb spiders. While requests for graphical contents and files are easy to eliminate,robots’ and web spiders’ navigation patterns must be explicitly identified. This isusually done for instance by referring to the remote hostname, by referring to theuser agent, or by checking the access to the robots.txt file. However, some robotsactually send a false user agent in HTTP request. In these cases, a heuristicbased on navigational behavior can be used to separates robot sessions fromactual users’ sessions (see [60, 61]).

Session Identification and Reconstruction. This step consists of (i) identi-fying the different users’ sessions from the usually very poor information availablein log files and (ii) reconstructing the users’ navigation path within the identi-fied sessions. Most of the problems encountered in this phase are caused by thecaching performed either by proxy servers either by browsers. Proxy cachingcauses a single IP address (the one belonging to the proxy Server) to be as-sociated with different users’ sessions, so that it becomes impossible to use IPaddresses as users identifies. This problem can be partially solved by the use ofcookies [25], by URL rewriting, or by requiring the user to log in when enteringthe web site [12]. Web browser caching is a more complex issue. Log from webservers cannot include any information about the use of the back button. Thiscan generate inconsistent navigation paths in the users’ sessions. However, byusing additional information about the web site structure is still possible to re-construct a consistent path by means of heuristics. Because the HTTP protocolis stateless, it is virtually impossible to determine when a user actually leavesthe web site in order to determine when a session should be considered finished.This problem is referred to as sessionization. [17] described and compared threeheuristics for the identification of sessions termination; two were based on thetime between users’ page requests, one was based on information about the re-ferrer. [24] proposed an adaptive time out heuristic. [26] proposed a techniqueto infer the timeout threshold for the specific web site. Other authors proposeddifferent thresholds for time oriented heuristics based on empiric experiments.

Content and Structure Retrieving. The vast majority of Web Usage Miningapplications use the visited URLs as the main source of information for mining


purposes. URLs are however a poor source of information since, for instance,they do not convey any information about the actual page content. [26] hasbeen the first to employ content based information to enrich the web log data.If an adequate classification is not known in advance, Web Structure Miningtechniques can be employed to develop one. As in search engines, web pages areclassified according to their semantic areas by means of Web Content Miningtechniques; this classification information can then be used to enrich informa-tion extracted from logs. For instance, [59] proposes to use Semantic Web forWeb Usage Mining: web pages are mapped into ontologies to add meaning tothe observed frequent paths. [15] introduces concept-based paths as an alter-native to the usual user navigation paths; concept-based path are a high levelgeneralization of usual path in which common concepts are extracted by meansof intersection of raw user paths and similarity measures.

Data Formatting. This is the final step of preprocessing. Once the previousphases have been completed, data are properly formatted before applying miningtechniques. [11] stores data extracted from web logs into a relational databaseusing a click fact schema, so as to provide better support to log querying finalizedto frequent pattern mining. [47] introduces a method based on signature treeto index log stored in databases for efficient pattern queries. A tree structure,WAP-tree, is also introduced in [51] to register access sequence to web pages;this structure is optimized to exploit the sequence mining algorithm developedby the same authors [51].

4 Techniques

Most of the commercial applications of Web Usage Mining exploit consolidatedstatistical analysis techniques. In contrast, research in this area is mainly focusedon the development of knowledge discovery techniques specifically designed forthe analysis of web usage data. Most of this research effort focuses on threemain paradigms: association rules, sequential patterns, and clustering (see [32]for a detailed description of these techniques).

Association Rules. are probably the most elementary data mining techniqueand, at the same time, the most used technique in Web Usage Mining. Whenapplied to Web Usage Mining, association rules are used to find associationsamong web pages that frequently appear together in users’ sessions. The typicalresult has the form “A.html, B.html ⇒ C.html” which states that if a user hasvisited page A.html and page B.html, it is very likely that in the same session,the same user has also visited page C.html. This type of result is for instanceproduced by [38] and [46] by using a modification of the Apriori algorithm [32].[37] proposes and evaluates some interestingness measures to evaluate the as-sociation rules mined from web usage data. [21] exploits a mixed technique ofassociation rules and fuzzy logic to extract fuzzy association rules from web logs.


Sequential Patterns. are used to discover frequent subsequences among largeamount of sequential data. In web usage mining, sequential patterns are exploitedto find sequential navigation patterns that appear in users’ sessions frequently.The typical sequential pattern has the form [45]: the 70% of users who first vis-ited A.html and then visited B.html afterwards, in the same session, have alsoaccessed page C.html. Sequential patterns might appear syntactically similar toassociation rules; in fact algorithms to extract association rules can also be usedfor sequential pattern mining. There are essentially two classes of algorithmsthat are used to extract sequential patterns: one includes methods based on as-sociation rule mining; one includes methods based on the use of tree-structures,data projection techniques, and Markov chains to mine navigation patterns.Some well-known algorithms for mining association rules have been modified toextract sequential patterns. [44] presents a comparison of different sequential pat-tern algorithms applied to Web Usage Mining. The comparison includes PSP+,FreeSpan, and PrefixSpan. While PSP+ is an evolution of GSP, based on candi-date generation and test heuristic, FreeSpan and the newly proposed PrefixSpanuse a data projection based approach. According to [44] PrefixSpan outperformsthe other two algorithms and offers very good performance even on long se-quences. [54] proposes an hybrid method: data are store in a database accordingto a so-called Click Fact Schema; an Hypertext Probabilistic Grammar (HPG)is generated by querying the databases; HPGs represent transitions among webpages through a model which resembles many similarities with Markov chains.The frequent sequential patterns are mined through a breadth first search overthe hypertext probabilistic grammar. HPGs were first proposed in [18], and laterimproved in [54] where some scalability issues of the original proposal have beensolved.

Clustering. techniques look for groups of similar items among large amount ofdata based on a general idea of distance function which computes the similaritybetween groups. Clustering has been widely used in Web Usage Mining to grouptogether similar sessions [56, 34, 36, 15]. [65] was the first to suggest that the focusof web usage mining should be shifted from single user sessions to group of usersessions; [65] was also the first to apply clustering for identifying such cluster ofsimilar sessions. [15] proposes similarity graph in conjunction with the time spenton web pages to estimate group similarity in concept-based clustering. [33] usessequence alignment to measure similarity, while [65] exploits belief functions. [57]uses Genetic Algorithms [35] to improve the results of clustering through userfeedback. [48] couples Fuzzy Artificial Immune System and clustering techniquesto improve the users’ profiles obtained through clustering. [34] applies multi-modal clustering, a technique which build clusters by using multiple informationdata features. [49] presents an application of matrix clustering to web usage data.


5 Applications

The general goal of Web Usage Mining is to gather interesting information aboutusers navigation patterns (i.e., to characterize web users). This information canbe exploited later to improve the web site from the users’ viewpoint. The resultsproduced by the mining of web logs can used for various purposes [58]: (i) topersonalize the delivery of web content; (ii) to improve user navigation throughprefetching and caching; (iii) to improve web design; or in e-commerce sites (iv)to improve the customer satisfaction.

Personalization of Web Content. Web Usage Mining techniques can beused to provide personalized web user experience. For instance, it is possibleto anticipate, in real time, the user behavior by comparing the current navi-gation pattern with typical patterns which were extracted from past web log.In this area, recommendation systems are the most common application; theiraim is to recommend interesting links to products which could be interesting tousers [10, 21, 63, 43]. Personalized Site Maps [62] are an example of recommen-dation system for links (see also [45]). [50] proposed an adaptive technique toreorganize the product catalog of the products according to the forecasted userprofile. A survey on existing commercial recommendation systems, implementedin e-commerce web sites, is presented in [55].

Prefetching and Caching. The results produced by Web Usage Mining can beexploited to improve the performance of web servers and web-based applications.Typically, Web Usage Mining can be used to develop proper prefetching andcaching strategies so as to reduce the server response time, as done in [23, 41,42, 46, 64].

Support to the Design. Usability is one of the major issues in the designand implementation of web sites. The results produced by Web Usage Miningtechniques can provide guidelines for improving the design of web applications.[16] uses stratograms to evaluate the organization and the efficiency of web sitesfrom the users’ viewpoint. [31] exploits Web Usage Mining techniques to suggestproper modifications to web sites. Adaptive Web sites represents a further step.In this case, the content and the structure of the web site can be dynamicallyreorganized according to the data mined from the users’ behavior [39, 66].

E-commerce. Mining business intelligence from web usage data is dramati-cally important for e-commerce web-based companies. Customer RelationshipManagement (CRM) can have an effective advantage from the use of Web Us-age Mining techniques. In this case, the focus is on business specific issues suchas: customer attraction, customer retention, cross sales, and customer depar-ture [19, 14, 28].


6 Software

There are many commercial tools which perform analysis on log data collectedfrom web servers. Most of these tools are based on statistical analysis techniques,while only a few products exploit Data Mining techniques. [28] provides an upto date review of available commercial tools for web usage mining. With re-spect to Web Mining commercial tools, it is worth noting that since the reviewmade in [58], the number of existing products almost doubled. Companies whichsold Web Usage Mining products in the past have been disappeared (e.g., An-dromeda’s Aria); others have been bought by other companies. In most cases,Web Usage Mining tools are part of integrated Customer Relation Management(CRM) solutions for e-commerce (e.g., [8] and [4]). Sometimes, these tools aresimple web log analyzers (e.g., [6, 7, 5]). One software developed in a researchenvironment, WUM [9], appears to have reached an interesting maturity level;WUM has currently reached the version 7.0.

We presented a survey of the recent developments in the area of Web UsageMining, based on the more than 150 papers published since 2000 on this topic.Because, it was not possible to cite all the papers here we refer the interestedreader to provide an on-line bibliography on the web site of the cInQ project [1].

Acknowledgements

This work has been supported by the consortium on discovering knowledge withInductive Queries (cInQ) [1], a project funded by the Future and Emerging Tech-nologies arm of the IST Programme (Contr. no. IST-2000-26469). The authorswish to thank Maristella Matera for discussions.

References

[1] consortium on discovering knowledge with Inductive Queries (cInQ). Projectfunded by the European Commission under the Information Society TechnologiesProgramme (1998-2002) Future and Emerging Technologies arm. Contract no.IST-2000-26469. http://www.cinq-project.org. Bibliography on Web UsageMining available at http://www.cinq-project.org/intranet/polimi/. 140,146

[2] Configuration File of W3C httpd, 1995.http://www.w3.org/Daemon/User/Config/. 141

[3] W3C Extended Log File Format, 1996.http://www.w3.org/TR/WD-logfile.html. 141

[4] Accrue, 2003. http://www.accrue.com. 146[5] Funnel Web Analyzer, 2003. http://www.quest.com. 146[6] NetIQ WebTrends Log Analyzer, 2003. http://www.netiq.com. 146[7] Sane NetTracker, 2003. http://www.sane.com/products/NetTracker. 146[8] WebSideStory HitBox, 2003. http://www.websidestory.com. 146[9] WUM: A Web Utilization Miner, 2003. http://wum.wiwi.hu-berlin.de. 146


[10] Gediminas Adomavicius and Alexander Tuzhilin. Extending recommender sys-tems: A multidimensional approach. 145

[11] Jesper Andersen, Anders Giversen, Allan H. Jensen, Rune S. Larsen, Tor-ben Bach Pedersen, and Janne Skyt. Analyzing clickstreams using subsessions.In International Workshop on Data Warehousing and OLAP (DOLAP 2000),2000. 143

[12] Corin R. Anderson. A Machine Learning Approach to Web Personalization. PhDthesis, University of Washington, 2002. 142

[13] Suhail Ansari, Ron Kohavi, Llew Mason, and Zijian Zheng. Integrating e-commerce and data mining: Architecture and challenges. In WEBKDD 2000- Web Mining for E-Commerce – Challenges and Opportunities, Second Inter-national Workshop, August 2000. 141

[14] Suhail Ansari, Ron Kohavi, Llew Mason, and Zijian Zheng. Integrating e-commerce and data mining: Architecture and challenges. In Nick Cercone,Tsau Young Lin, and Xindong Wu, editors, Proceedings of the 2001 IEEE In-ternational Conference on Data Mining (ICDM 2001). IEEE Computer Society,2001. 141, 145

[15] A. Banerjee and J. Ghosh. Clickstream clustering using weighted longest com-mon subsequences. In Proceedings of the Web Mining Workshop at the 1st SIAMConference on Data Mining, 2001. 143, 144

[16] Bettina Berendt. Using site semantics to analyze, visualize, and support navi-gation. Data Mining and Knowledge Discovery, 6(1):37–59, 2002. 145

[17] Bettina Berendt, Bamshad Mobasher, Miki Nakagawa, and Myra Spiliopoulou.The impact of site structure and user environment on session reconstruction inweb usage analysis. In Proceedings of the 4th WebKDD 2002 Workshop, at theACM-SIGKDD Conference on Knowledge Discovery in Databases (KDD’2002),2002. 142

[18] Jose Borges. A Data Mining Model to Capture UserWeb Navigation Patterns.PhD thesis, Department of Computer Science University College London, 2000.144

[19] Catherine Bounsaythip and Esa Rinta-Runsala. Overview of data mining forcustomer behavior modeling. Technical Report TTE1-2001-18, VTT InformationTechnology, 2001. 145

[20] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual Websearch engine. Computer Networks and ISDN Systems, 30(1–7):107–117, 1998.140

[21] S. Shiu C. Wong and S. Pal. Mining fuzzy association rules for web access caseadaptation. In Case-Based Reasoning Research and Development: Proceedings ofthe Fourth International Conference on Case-Based Reasoning, pages ?–?, 2001.143, 145

[22] Lara D. Catledge and James E. Pitkow. Characterizing browsing strategies inthe World-Wide Web. Computer Networks and ISDN Systems, 27(6):1065–1073,1995. 141

[23] Cheng-Yue Chang and Ming-Syan Chen. A new cache replacement algorithm forthe integration of web caching and prefectching. In Proceedings of the eleventhinternational conference on Information and knowledge management, pages 632–634. ACM Press, 2002. 145

[24] Mao Chen, Andrea S. LaPaugh, and Jaswinder Pal Singh. Predicting categoryaccesses for a user in a structured information space. In Proceedings of the 25thannual international ACM SIGIR conference on Research and development ininformation retrieval, pages 65–72, 2002. 142


[25] R. Cooley. Web Usage Mining: Discovery and Application of Interesting Patternsfrom Web Data. PhD thesis, University of Minnesota, 2000. 142

[26] Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava. Data preparationfor mining world wide web browsing patterns. Knowledge and Information Sys-tems, 1(1):5–32, 1999. 142, 143

[27] Boris Diebold and Michael Kaufmann. Usage-based visualization of web locali-ties. In Australian symposium on Information visualisation, pages 159–164, 2001.142

[28] Magdalini Eirinaki and Michalis Vazirgiannis. Web mining for web personaliza-tion. ACM Transactions on Internet Technology (TOIT), 3(1):1–27, 2003. 145,146

[29] Oren Etzioni. The world-wide web: Quagmire or gold mine? Communicationsof the ACM, 39(11):65–68, 1996. 140

[30] Kurt D. Fenstermacher and Mark Ginsburg. Mining client-side activity for per-sonalization. In Fourth IEEE International Workshop on Advanced Issues ofE-Commerce and Web-Based Information Systems (WECWIS’02), pages 205–212, 2002. 141

[31] Yongjian Fu, Mario Creado, and Chunhua Ju. Reorganizing web sites basedon user access patterns. In Proceedings of the tenth international conference onInformation and knowledge management, pages 583–585. ACM Press, 2001. 145

[32] Jiawei Han and Micheline Kamber. Data Mining Concepts and Techniques.Morgan Kaufmann, 2001. 143

[33] Birgit Hay, Geert Wets, and Koen Vanhoof. Clustering navigation patterns ona website using a sequence alignment method. 144

[34] Jeffrey Heer and Ed H. Chi. Mining the structure of user activity using clus-ter stability. In Proceedings of the Workshop on Web Analytics, Second SIAMConference on Data Mining. ACM Press, 2002. 144

[35] John H. Holland. Adaptation in Natural and Artificial Systems. University ofMichigan Press, Ann Arbor, 1975. Republished by the MIT press, 1992. 144

[36] Joshua Zhexue Huang, Michael Ng, Wai-Ki Ching, Joe Ng, and David Che-ung. A cube model and cluster analysis for web access sessions. In R. Kohavi,B. Masand, M. Spiliopoulou, and J. Srivastava, editors, WEBKDD 2001 - MiningWeb Log Data Across All Customers Touch Points, Third International Work-shop, San Francisco, CA, USA, August 26, 2001. Revised Papers, volume 2356of Lecture Notes in Computer Science, pages 48–67. Springer, 2002. 144

[37] Xiangji Huang, Nick Cercone, and Aijun An. Comparison of interestingnessfunctions for learning web usage patterns. In Proceedings of the eleventh inter-national conference on Information and knowledge management, pages 617–620.ACM Press, 2002. 143

[38] Karuna P. Joshi, Anupam Joshi, and Yelena Yesha. On using a warehouse toanalyze web logs. Distributed and Parallel Databases, 13(2):161–180, 2003. 143

[39] Tapan Kamdar. Creating adaptive web servers using incremental web log min-ing. Master’s thesis, Computer Science Department, University of Maryland,Baltimore County, 2001. 145

[40] Kosala and Blockeel. Web mining research: A survey. SIGKDD: SIGKDD Explo-rations: Newsletter of the Special Interest Group (SIG) on Knowledge Discovery& Data Mining, ACM, 2(1):1–15, 2000. 140

[41] Bin Lan, Stephane Bressan, Beng Chin Ooi, and Kian-Lee Tan. Rule-assistedprefetching in web-server caching. In Proceedings of the ninth international con-ference on Information and knowledge management (CIKM 2000), pages 504–511. ACM Press, 2000. 145


[42] Tianyi Li. Web-document prediction and presending using association rule se-quential classifiers. Master’s thesis, Simon Fraser University, 2001. 145

[43] Bamshad Mobasher, Honghua Dai, Tao Luo, and Miki Nakagawa. Effectivepersonalization based on association rule discovery from web usage data. InWeb Information and Data Management, pages 9–15, 2001. 145

[44] Behzad Mortazavi-Asl. Discovering and mining user web-page traversal patterns.Master’s thesis, Simon Fraser University, 2001. 144

[45] Eleni Stroulia Nan Niu and Mohammad El-Ramly. Understanding web usagefor dynamic web-site adaptation: A case study. In Proceedings of the FourthInternational Workshop on Web Site Evolution (WSE’02), pages 53–64. IEEE,2002. 144, 145

[46] Alexandros Nanopoulos, Dimitrios Katsaros, and Yannis Manolopoulos. Ex-ploiting web log mining for web cache enhancement. In R. Kohavi, B. Masand,M. Spiliopoulou, and J. Srivastava, editors, WEBKDD 2001 - Mining Web LogData Across All Customers Touch Points, Third International Workshop, SanFrancisco, CA, USA, August 26, 2001. Revised Papers, volume 2356 of LectureNotes in Computer Science, pages 68–87. Springer, 2002. 143, 145

[47] Alexandros Nanopoulos, Maciej Zakrzewicz, Tadeusz Morzy, and YannisManolopoulos. Indexing web access-logs for pattern queries. In Fourth ACMCIKM International Workshop on Web Information and Data Management(WIDM’02), 2002. 143

[48] O. Nasraoui, F. Gonzalez, and D. Dasgupta. The fuzzy artificial immune system:Motivations, basic concepts, and application to clustering and web profiling. InProceedings of the World Congress on Computational Intelligence (WCCI) andIEEE International Conference on Fuzzy Systems, pages 711–716, 2002. 144

[49] Shigeru Oyanagi, Kazuto Kubota, and Akihiko Nakase. Application of matrixclustering to web log analysis and access prediction. In WEBKDD 2001 - MiningWeb Log Data Across All Customers Touch Points, Third International Work-shop, 2001. 144

[50] Hye-Young Paik, Boualem Benatallah, and Rachid Hamadi. Dynamic restruc-turing of e-catalog communities based on user interaction patterns. World WideWeb, 5(4):325–366, 2002. 145

[51] Jian Pei, Jiawei Han, Behzad Mortazavi-asl, and Hua Zhu. Mining access pat-terns efficiently from web logs. In Pacific-Asia Conference on Knowledge Dis-covery and Data Mining, pages 396–407, 2000. 143

[52] Pilot Software. Web Site Analysis, Going Beyond Traffic Analysishttp://www.marketwave.com/products solutions/hitlist.html, 2002. 141

[53] John R. Punin, Mukkai S. Krishnamoorthy, and Mohammed J. Zaki. Logml:Log markup language for web usage mining. In R. Kohavi, B. Masand,M. Spiliopoulou, and J. Srivastava, editors, WEBKDD 2001 - Mining Web LogData Across All Customers Touch Points, Third International Workshop, SanFrancisco, CA, USA, August 26, 2001. Revised Papers, volume 2356 of LectureNotes in Computer Science, pages 88–112. Springer, 2002. 141

[54] T.B. Pedersen S. Jespersen and J. Thorhauge. A hybrid approach to web usagemining. Technical Report R02-5002, Department of Computer Science AalborgUniversity, 2002. 144

[55] J. Ben Schafer, Joseph A. Konstan, and John Riedl. E-commerce recommenda-tion applications. Data Mining and Knowledge Discovery, 5(1-2):115–153, 2001.145


[56] Cyrus Shahabi and Farnoush Banaei-Kashani. A framework for efficient andanonymous web usage mining based on client-side tracking. In R. Kohavi,B. Masand, M. Spiliopoulou, and J. Srivastava, editors, WEBKDD 2001 - MiningWeb Log Data Across All Customers Touch Points, Third International Work-shop, San Francisco, CA, USA, August 26, 2001. Revised Papers, volume 2356of Lecture Notes in Computer Science, pages 113–144. Springer, 2002. 141, 144

[57] Cyrus Shahabi and Yi-Shin Chen. Improving user profiles for e-commerce bygenetic algorithms. E-Commerce and Intelligent Methods Studies in Fuzzinessand Soft Computing, 105(8), 2002. 144

[58] Jaideep Srivastava, Robert Cooley, Mukund Deshpande, and Pang-Ning Tan.Web usage mining: Discovery and applications of usage patterns from web data.SIGKDD Explorations, 1(2):12–23, 2000. 141, 145, 146

[59] G. Stumme, A. Hotho, and B. Berendt. Usage mining for and on the seman-tic web. In National Science Foundation Workshop on Next Generation DataMining, 2002. 143

[60] Pang-Ning Tan and Vipin Kumar. Modeling of web robot navigational patterns.In WEBKDD 2000 - Web Mining for E-Commerce – Challenges and Opportu-nities, Second International Workshop, August 2000. 142

[61] Pang-Ning Tan and Vipin Kumar. Discovery of web robot sessions based ontheir navigational patterns. Data Mining and Knowledge Discovery, 6(1):9–35,2002. 142

[62] Fergus Toolan and Nicholas Kushmerick. Mining web logs for personalized sitemaps. 145

[63] Debra VanderMeer, Kaushik Dutta, and Anindya Datta. Enabling scalable on-line personalization on the web. In Proceedings of the 2nd ACM E-CommerceConference (EC’00), pages 185–196. ACM Press, 2000. 145

[64] Yi-Hung Wu and Arbee L. P. Chen. Prediction of web page accesses by proxyserver log. World Wide Web, 5(1):67–88, 2002. 145

[65] Yunjuan Xie and Vir V. Phoha. Web user clustering from access log using belieffunction. In Proceedings of the First International Conference on KnowledgeCapture (K-CAP 2001), pages 202–208. ACM Press, 2001. 144

[66] Osmar R. Zaıane. Web usage mining for a better web-based learning environ-ment. In Proceedings of Conference on Advanced Technology for Education, pages450–455, 2001. 145

Parallel Vector Computing Techniquefor Discovering Communities

on the Very Large Scale Web Graph

Kikuko Kawase1, Minoru Kawahara2, Takeshi Iwashita2, Hiroyuki Kawano1, andMasanori Kawazawa2

1 Department of Systems Science, Graduate School of Informatics, Kyoto University2 Academic Center for Computing and Media Studies, Kyoto University

Abstract. The study of the authoritative pages and community discovery froman enormous Web contents has attracted many researchers. One of the link-basedanalysis, the HITS algorithm, calculates authority scores as the eigenvector of aadjacency matrix created from the Web graph. Although it was considered im-possible to compute the eigenvector of a very large scale of Web graph usingprevious techniques, due to this calculation requires enormous memory space.We make it possible using data compression and parallel computation.

1 Introduction

ISC (Internet Software Consortium)[10] checked out the existence of over 160 millionWeb servers which construct the Web (World Wide Web) on the Internet as of July2002. It is easy to guess there are a huge number of Web pages from it. Some Websearch engines collect many Web pages. For examples, Google[8] in the United Stateshas 2,500 million pages as of November 2002, and AlltheWeb.com[1] of Norway has2,100 million pages for retrieval. Additionally, Openfind[17], which is a Web searchengine in Taiwan, is under a beta test to retrieve over 3,500 million Web pages.

However it is difficult to find out useful Web pages using only standard retrievaltechnique such as fulltext search because of a huge number of Web pages and retrievalresults from them. Many researches have been trying to find out useful Web pages outof the retrieval results[19]. Especially, many researches have used a link structure ofthe Web to evaluate the importance of Web pages and find the strongly connected Webpage county (Web community) [9, 12–15].

There is a popular algorithm called as “HITS (Hyperlink-Induced Topic Search) al-gorithm”[12, 13] which uses make of the link structure of the Web and evaluates eachimportance of Web page. And there has been many researches of Web community dis-covery using the algorithm[3, 6]. For example, PageRank[18], which is used in order torank Web pages in the Web search engine Google, is an evaluation method based on theHITS algorithm[8, 11]. In the HITS algorithm, which will be given a detailed descrip-tion in Section 2, a link between two Web pages in the Web structure are regarded as anedge of directed graph, and then the Web graph can be denoted as a adjacency matrix.When the Web structure is denoted by a adjacency matrix, the evaluation of importanceof Web pages resolves itself into an eigenvalue problem of the matrix. But there are two


problems for this calculation. One is that it is required a huge memory area to hold thehuge adjacency matrix corresponding to large-scale Web pages. And another is that itis required high calculation performance to compute the huge matrix. According to asimple calculation for n Web pages, it is required at least 8×n2 bytes of memory areafor a square matrix whose one side is n, where a edge holds a double precision floatingpoint value which occupies 8 bytes memory area. For example, the memory area of8× 1016 = 80PB(s) is required to calculate 100 million (= 108) Web pages. Thus it isimpossible to evaluate such a huge Web pages.

We noted that almost all Web pages hold few links compared with the number ofWeb pages, and then the adjacency matrix becomes a huge sparse matrix. In this paper,we approach the way of high-speed calculation for the importance of a huge numberof Web pages through the sparse matrix storing method[7], which compresses data andstores into main memory. And we apply the high-speed calculation technique of vec-torization and parallelization for a parallel vector computer.

The paper is organized as follows. Section 2 describes a calculation method for theimportance of Web pages. Section 3 describes a compression method of data and a high-speed calculation method. Section 4 evaluates our proposed method. Finery, Section 5makes concluding remarks and describes future subjects.

2 Calculation method for the importance of Web pages

2.1 Authority and hub

A graph consists of points (nodes) and lines (edges) which connect two nodes. If itis thought of Web pages (p1, p2, · · ·) as nodes and the set of nodes is defined as V ={p1, p2, · · ·}, and it is thought of a directed link (pi, p j) from a Web page pi to a Webpage p j as a edge and the set of links is defined as E, then the Web structure canbe denoted by a directed graph G = (V,E). And the graph can be translated into anadjacency matrix A when the element ai j of A is set to 1 if a directed link (pi, p j) existsand is set to 0 if not exist, where i denotes a row and j denotes a column.

An authority is a page which is considered to be useful and a hub is a page whichhas many hyperlinks to valuable pages. A good hub links many good authorities and agood authority is linked from many good hubs. HITS is one of the algorithms based onthis idea. The useful pages can be found out as follows[5].

It is assumed that p and q are Web pages then the authority weight x p of p and thehub weight yp of q are defined as follows:

xp = ∑q:(q,p)∈E

yq, yp = ∑q:(q,p)∈E

xq,

Where,

∑ (xp)2 = 1, ∑(yp)2 = 1.

The hub weight and the authority weight are initialized by the same non-negativevalue. So x and y will converge on x∗ and y∗ respectively by iterative updating of theseweights for each pages. It is thought that x∗ and y∗ show the usefulness of each Webpage. Furthermore one calculation of the iterations is x ← Aty, y ← Ax using adjacency

152 Kikuko Kawase et al.

matrix A. Suppose an initial value is z, the authority weight xk and hub weight yk afterk times calculation are expressed as:{

xk = (AtA)k−1Atz,yk = (AAt)kz.

And when xk and yk are converged, we can get x∗ and y∗, where x∗ is the eigenvector ofAtA and y∗ is the eigenvector of AAt .

Although all of the eigenvectors and the eigenvalues of AA t and AtA can be cal-culated by a method of singular value decomposition of A, it is impossible to use themethod since it is required for the method too much memory to calculate a large-scaleWeb graph treated in this paper, In the mean time, the research of retrieval [13] says thatit can be assumed that there exists only one maximum eigenvalue when it comes to aWeb graph, and there also exists only one maximum eigenvalues of AA t and AtA. There-fore we calculate the principal eigenvector of AAt and AtA using the power method fordiscovering authorities and hubs in the whole Web according to this assumption.

2.2 Power Method

The power method is an effective technique for calculation of the maximum eigenvalueand the eigenvector of a matrix . We need only principal eigenvector so that this methodis very suitable.

Suppose matrix A has eigenvalues λ1,λ2, · · · ,λn ( | λ1 |>| λ2 |≥ · · · ≥| λn |) and theeigenvectors corresponding them are ξ1,ξ2, · · · ,ξn. If u0 is any vector, it can be said thatthe linear merging of ξ1,ξ2, · · · ,ξn where α1,α2, · · · ,αn is any real number.

u0 =n

∑i=1

αiξi.

Now calculate the value ASu0 which is multiplied u0 S times by A. Using the upperformula and Aξi = λiξi,

Asu0 = As

(n

∑i=1

αiξi

)= As−1

(n

∑i=1

αiλiξi

)

...

=n

∑i=1

αiλsi ξi

= λ S1

[α1ξ1 +

n

∑i=2

αi

(λi

λ1

)S

ξi

].

As S gets large enough, the second term of the upper expression gets closer to zero.Thus ASu0 � α1λ S

1 ξ1. So we can get the principal eigenvalue λ1 from the fraction ofAS+1u0 and ASu0. It calculates as follows.{

vS+1 = AuS,

uS+1 =vS+1bS+1

(S = 0,1,2, · · ·),

153Parallel Vector Computing Technique for Discovering Communities

where bS+1 is the element of the maximum absolute value of vector v S+1. Therefore

uS+1 =vS+1

bS+1=

AuS

bS+1=

AvS

bS+1bS

...

=AS+1u0

bS+1 · · ·b1.

Although some matrices may converge very slowly, the principal eigenvalue for theadjacency matrix of the Web graph can converge by a few accumulation because it ismuch larger than the next eigenvalue[9].

In this paper, this matrix A equivalents to the product of the adjacency matrix andits transposed matrix. Suppose A is the adjacency matrix, the formula is

{vS+1 = AAtuS,

uS+1 =vS+1bS+1

(S = 0,1,2, · · ·).

There are 2 ways of this calculation. One is the way to calculate AtuS first [Method 1],and the other is the way to calculate AAt first [Method 2]. In case there are many repeatcount and little calculation cost, [Method 2] has merit.

2.3 Data compression storing method for large-scale matrix

The adjacency matrix of the Web graph is a sparse matrix with 0 and 1, so that we useCompressed Row Storage format (CRS format) [2] for data compression. This formatcan be used for any structure of matrix, and it does not hold any 0 elements. It puts theconsequent nonzero elements of matrix rows in adjacent memory locations. Assumingwe have a sparse matrix A, we need to use 3 vectors: one for floating-point numbers(val), and other two for integers (col ind, row ptr). Although the val vector holds thevalues of nonzero elements of the matrix A as they are traversed in a row-wise fashion,we can omit to use this vector in this paper because all nonzero elements become 1. Thecol ind vector holds the column indexes of the nonzero elements. The row ptr vectorholds the row indexes in the col ind vector. By convention, we define row ptr(n+1) =e+1, where e is the number of nonzero elements in the matrix A. Therefore this storingmethod requires only e+ n+ 1 storage locations instead of storing n×n elements.

As well as the CRS format, there is a data compression method like as the Com-pressed Column Storage (CCS) format. The CCS format is the same as the CRS formatexcept that the columns of A are stored. In other words, the CCS format can be saidthe CRS format for At[2]. In [Method 2], it is required to store the calculation result ofAAt on the main memory in addition to the area for A. Although AA t becomes a sym-metrical matrix and a symmetrical matrix is suitable for parallel processing, there is noguarantee that AAt becomes a sparse matrix and a huge memory area may be requited.


(1)

(2)

(3)

(4)

r0

r1

r2

r3

r0

r1

r2

r3

r0

r1

r2

r3

A u btS

Fig. 1. The parallel calculation of b←Atu with4PEs

(1) (2) (3) (4)r0

r1

r2

r3

r0

r1

r2

r3

r0

r1

r2

r3

A vb S+1

Fig. 2. The parallel calculation of v ← Ab with4PEs

3 Implementation on a distributed memory parallel vectorcomputer

In order to evaluate our method, we implemented our proposed method on a parallelvector computer Fujitsu VPP800/63, which includes 63PEs (Processing Element) andeach PE has 8GFLOPS calculating power and 8GB memory, of Academic Center forComputing and Media Studies, Kyoto University.

3.1 Parallel procession with MPI

We use MPI (Message Passing Interface) for the implementation. MPI is one of themost general programming techniques which gives parallel computing in a distributedmemory parallel computer, transmitting and receiving a message between processors .

Here we explain the distributed parallel procession with [Method 1]. [Method 1]doesn’t calculate AAt to save the storage area. In this paper, all array of the one dimen-sional arrays for matrix A and eigenvector and for work area distributed on the mainmemory. Fig.1 and Fig.2 show distribution of data using 4 processors (r 0, r1, r2, r3).Since we put only matrix A using CRS formats on the main memory, A t is stored withCCS formats like Fig.1. More specifically we explain the calculation of

vS+1 = A(AtuS)

using 4 processors. First to calculate b ← Atus of Fig.1 each processor calculates sumof products of the part of At and us that each processor has. As a result each one getsn dimensional vector bi. And we get b by calculation of Σbi = b with communicationbetween processors. Then to calculate vs+1 ← Ab each one communicates each part ofvector b using MPI and calculates sum of products of matrix A. And each one stores theresult as vector vs+1.

Meantime [Method 2] is easy to parallel processing of AAt using the way of matrixstorage on the distributed memory.


n4n1 n2 n3

A At

c a a a a

m3

m2

m1

cn1 cn2 cn3 cn4

am 11

am 12

am 13

1

Fig. 3. The product of matrices compressed by CRSformats and CCS formats

1

10

100

1000

10000

100000

Hub weights

Authority weiths

987654321 1010101010101010101 ---------

The hub weights and the authority weights

The

num

ber

of p

ages

Fig. 4. Distribution of Hub weights andAuthority weights

3.2 Vector calculation

In case it can be calculated independently, high-speed processing of array operation ispossible with vector calculation. In order to increase the efficiency of vector calculationto the array col ind stored by the CRS format in [Method 1], it calculates in order ofthe numerical value stored in the array row ptr. That is, each line can be independentlycalculated and vectorization becomes possible by calculating sequentially from the ele-ment which appears in the leftmost of each line. Even in [Method 2], when calculatingthe product of the matrix stored by the CRS format and the matrix stored by the CCSformat, vector calculation of the matrix multiplication is possible in a compressed formby using additional two vectors.

For example, suppose that inner product calculation of column 1 of A t and row cof A is calculated as shown in Fig.3. Non-zero elements are at column n 1,n2,n3,n4 inthe row c of A and at row m1,m2,m3 in the column 1 of At . They are stored by CRSformat and CCS format. We import check A and check At which point positions in thevectors. They are initialized to 1, and point to acn1

and am11. If m1 is smaller than n1

then check A is incremented and points acn2. If m1 is larger than n1 then check At is

incremented and points am21. If m1 is equal to n1 then the product of acn1and am11 it

calculated and check A and check At point acn2and am21 respectively by incrementa-

tion. Thus calculation will be completed when the value which points either out is lost.This calculation can be done independently if each value of c is differ one another. So itis possible to do a vector calculation. Furthermore, it can be calculated without thawingcompressed data and the main memory can be used efficiently.

4 Performance evaluation

As a result of conducting a preliminary experiment using [Method 1] and [Method 2],[Method 1] is better in the field of rapidity and an occupancy memory about 100 timesthan [Method 2], so we use [Method 1] for evaluation.


le+10le+09le+08le+07le+06

100001000

100

100 1000101

101

The

num

ber

of p

ages

The number of links

October, 1999

Fig. 5. The links of real data and test data: “Graph Structure in the Web” of IBM Almaden Re-search Institute

Table 1. The relation between pages and links

Number of pages 1 ×104 1 ×105 1 ×106 1 ×107 1 ×108 1 ×109 2 ×109

Maximum number of links 68 160 374 873 2,036 4,747 6,125

4.1 Evaluation using actual data

As real data used for performance evaluation, we used the link data of about 15 millionWeb pages based on “jp” domains which were collected for National Institute of Infor-matics NTCIR-3 Web retrieval tasks[16]. In this link data the number of average linksis 7.2, and the number of the maximum links is 1,258. On the other hand, as a resultof the experiment using 10PEs of VPP800, the lapsed time was 550 seconds and theoccupancy memory domain was757MB.

Fig.4 shows the calculation result of authority weights and hub weights. In Fig.4there are some pages those authority weight and hub weight are almost 1. As a result ofresearch they are the Web pages those have many self links. Although this phenomenonin HITS algorithm was pointed out in the past research[3], it can avoid by adding someimprovement to algorithm[3, 18]. In case it actually applies to Web link analysis, it isnecessary to use the improved algorithm.

4.2 Evaluation using test data

According to “Graph Structure in the Web” of IBM Almaden Research Institute[4],the number of average links for the 200 million Web pages as of October 1999 was16.1. So we use the assumption that there are 20 links as an average estimated a littlehigher. And we assumed that the Web page that has the most links has the number raisedthe number of all pages × 10 to the 1/2.72th power based on Fig.5. Table1 shows therelation between the number of the maximum links assumed and the total number ofpages.


0.01

0.1

1

10

100

1000

10000

2 PE

10 PE

The number of pages

Tim

e (s

econ

d)

10 ths 100 ths 1 mln 10 mln 100 mln

Fig. 6. The relation between the number ofpages and the lapsed time

0

10

20

30

40

0 5 10 15 20 25 30 35 40

10 mln

1 mlnpages

pages

pages100 ths

The

eff

ect (

times

)

The number of PE

Fig. 7. The number effect of PE

0

5

10

15

20

25

30

35

40 2PE 1 mln pages

10PE 1mln pages

1197531010101010

-----

The

laps

ed ti

me

(sec

ond)

The convergence condition

Fig. 8. The relation between the lapsed timeand the convergence condition

0

20

40

60

80

100

120

140

160

20 40 60 80 100

10 PE10 mln pages

2 PE 1 mln pages

The average number of links

The

laps

ed ti

me

Fig. 9. The relation between the average num-ber of links and lapsed time

Moreover, thinking the increases of the number of average links in the future, wemeasured the calculation time when the number of average links was made to increase.We generated a Web graph using random numbers to fit the number of average links to20. The convergence conditions of the eigenvector was set to 10−5.

In Fig.6 we changed the number of pages from 10,000 to 100 million and measuredlapsed time using 2PEs and 10PEs. Fig.6 shows that lapsed time is proportional to the1.3rd power of the number of pages.

Fig.7 shows the number effect of PE. We evaluated to use 2, 5, 10, 20 and 40PEsand measured on the basis of the time when calculating by one PE. Fig.7 shows that thenumber effect is acquired larger data.

Fig.8 expresses the lapsed time at the time when we make it sever that the judg-ment conditions of convergence in the power method. Fig.8 shows that lapsed time isproportional to the number of places of decimals of condition mostly.

In Fig.9 We measured lapsed time changing the number of average links with 20,40, 60, 80 and 100. Even if the number of average links increased 5 times, lapsed timestopped about 2.8 times by the case for 1 million page using 2PEs and about 1.8 times


Table 2. The used resource of computer

Number of PE Number of pages Lapsed time Used memory Communication time40 2 billions 12 hours 2 minutes 280GB 20 minutes

by the case for 10 million page using 10PEs. Moreover, although the number of averagelinks increased, that lapsed time decreases in some case. Therefore, even if the numberof average links increased in the future, it does not need so long lapsed time.

The variable used by the program of the experiment needs 120 B of the main mem-ory per 1 Web page. Using the current system, since the user area of the main memory islimited up to 7GB per 1PE, it is thought that calculation of about 50 million Web pagesis possible using 1PE. Then, we calculated 2 billion Web pages using 40PEs which arethe maximum number that a user can use. Table 2 shows the lapsed time, the used mainmemory domain, and communication time. In addition, since the result that lapsed timeusing real data require more than lapsed time using test data when calculated in thesame accuracy, we compensate for the number of times of loop in the power methodof test data by the number of times of loop in real data and get the lapsed time andcommunication time. Communication time is the time of the message communicationby MPI.

In addition, in this paper, 32 bit-length integer type variable is used as a variablewhich has memorized the link, and if the main memory space for storing the data of aprocession is securable, it will be thought that is is possible to treat the Web pages upto about 4 billion. Although it is necessary to use an integer type variable with morelong bit length when treating the Web page beyond it, when 64 bit-length integer typevariable is used, to 2×4×n bytes at the time of using 32 bit-length integer type variable,used memory domain is 2×8×n bytes, so it is not so a big increase.

5 Conclusion

In this paper, in order to analyze the Web link structure, we proposed a technique forcalculation of the importance of Web pages by replacing a large-scale Web graph witha huge sparse matrix of the adjacency matrix and solving eigenvalue problem on themain memory of the computer using CRS formats. We checked that calculation endedin realistic time by the computer system of currently possessed using the vectorizationcalculating method and the parallel computing method. This shows that the analysis oflink structure to the Web page of a world scale is possible by applying the technique ofthis paper. Although this paper described how to calculate the importance of Web pagesusing HITS algorithm, the technique in this paper is applicable to other Web graphanalysis techniques[3, 18]. It is possible to develop the new algorithm which used thatthe calculation to large-scale data by using the sparse matrix storing method for theWeb graph and the high-speed calculation technique shown in this paper.

In this paper although the experiment to an actual Web page is a small-scale thingfor about 15 million pages, we think that it needs to conduct the evaluation to large-scale


actual Web pages. Moreover, it is necessary to clarify relation between the memory andcalculation time of various algorithms.

Acknowledgments

A part of this work is supported by the grant of Scientific Research (13680482, 14213101,15017248, 15500065) from the Ministry of Education, Science, Sports and Culture ofJapan. A part of this work is supported by the grant of Mazda Foundation. We wouldlike to thank NTCIR (NII-NACSIS Test Collection for IR Systems) Project of NationalInstitute of Informatics in Japan.

References

1. AlltheWeb.com: http://www.alltheweb.com/ .2. Barrett, R., Chan, T., Donato, J., Berry, M. and Demmel, J.: Templates for the Solution of

Linear Systems: Building Blocks for Iterative Methods, 2nd Edition, SIAM (1994).3. Bharat, K. and Henzinger, M.: Improved Algorithm for Topic Distillation in a Hyperlinked

Environment, Proc. of ACM SIGIR’98, Melbourne, Australia, pp. 104–111 (1998).4. Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A.

and Wiener, J.: Graph Structure in the Web, Computer Networks, pp. 309–320 (2000).5. Chakrabarti, S., Dom, B., Kumar, S., Raghavan, P., Rajagopalan, S., Tomkins, A., Gibson,

D. and Kleinberg, J.: Mining the Web’s Link Structure, Computer, pp. 60–67 (1999).6. Dean, J. and Henzinger, M.: Finding Related Pages in the World Wide Web, Proc. of the 8th

World-Wide Web Conference, Amsterdam, Netherlands (1999).7. Fuller, L. and Bechtel, R.: Introduction to matrix algebra, Dickenson Pub. Co (1967).8. Google: http://www.google.com/ .9. Hirokawa, S. and Ikeda, D.: Structural Analysis of Web Graph, Transactions of the Japanses

Society for Artificial Intelligence, Vol. 16, No. 4, pp. 525–529 (2001).10. Internet Software Consortium: http://www.isc.org/ .11. Kazama, K. and Harada, M.: Advanced Web Search Engine Technologies, Transactions of

the Japanses Society for Artificial Intelligence, Vol. 16, No. 4, pp. 503–508 (2001).12. Kleinberg, J.: Authoritative Sources in a Hyperlinked Environment, Journal of the ACM, pp.

604–632 (1999).13. Kleinberg, J., Kumar, R., Raghavan, P., Rajagopalan, S. and Tomkins, A.: The Web as a

Graph: Measurements, Models, and Methods, Computing and Combinatorics. 5th AnnualInternational Conference, COCOON’99, pp. 1–17 (1999).

14. Kumar, R., Raghavan, P., Rajagopalan, S. and Tomkins, A.: Extracting Large-scale Knowl-edge Bases from the Web, Proceedings of the 25th International Conference on Very LargeDatabases, Edinburgh, UK, pp. 639–650 (1999).

15. Murata, T.: Discovery of Web Communities Based on Co-occurrence of References, Trans-actions of the Japanses Society for Artificial Intelligence, Vol. 16, No. 3, pp. 316–323 (2001).

16. NII-NACSIS Test Collection for IR Systems Project: http://research.nii.ac.jp/ ntcadm/index-en.html.

17. Openfind: http://www.openfind.com/ .18. Page, L., Brin, S., Motwani, R. and Winograd, T.: The PageRank Citation Ranking: Bringing

Order to the Web, Stanford Digital Library Project, http://dbpubs.stanford.edu/pub/1999-66,No. 1999-66 (1999).

19. Yamada, S., Murata, T. and Kitamura, Y.: Intelligent Web Information System, Transactionsof the Japanses Society for Artificial Intelligence, Vol. 16, No. 4, pp. 495–502 (2001).


©

[email protected]

″ ″

∈

==

==

==

==

″ ″

==

==

==

∀ ∈ =++++−=−++−++−

∀ ∈ =++++−=−++−++−

″ → ″

″ ″

″ ″ ″ ″

→ ∈∈

∈

→

( ) ( )Ω∈Ω∈

=∩=

input : Set RAO of ordinal association rulesROS=∅ %initialization of the set of specific ordinal association rulesfor each ordinal association rule of RAO Calculation of the contingency table of differences Search for m zones Zk

for each zone Zk

Search for the rectangle Rk [Pg(X=x

(i1),Y=y(j1)), Pd (X=x(i2),Y=y(j2))]

if Pep ≥ minep

RO=[RO ; X=[x(i1).. x(i2)] → Y=[y(j1).. y(j2)] Pep Tc] end if % Pep ≥ minep

end for %each zone Zk

end for %each ordinal association rule of RAOoutput : Set RAOS of specific ordinal association rules

→

∈

″ →″

″ → ″″ → ″

∧ ∧ →∈ ∈

∈

∈∈

{ } =∈∀=

∈ ∈

→

→

input : Set RAOS of specific ordinal association rulesRA=∅ %initialization of the set of association rulesfor each rule of RAOS where II1 represents the intensity of inclination Calculation of the expanded contingency table for each line X=x(i)

for each combination of the line X=x(i)

Removal of the combination from the contingencytable

Calculation of the intensity of inclination II2

if II1/II2 ≥ minr

RA=[RA ; X1=x1

(i1), .., Xp=xp

(ip)→ Y=[y(j1)..y(j2)] Tc ] end if % II1/II2 ≥ minr

end for %each combination of the line X=x(i)

end for %each line X=x(i)

end for %each specific ordinal association rule of RAOSoutput : Set RA of association rules

∧ →

∧ →

∧ →

∧ →

″ → ″ ″ → ″ ″ → ″″ → ″

″→ ″

∪″ → ″

″ → ″″ → ″ ″ ″

→

→→

″ →

″″

″″

″ ″

″″

″ ″

″″

″ ″

″

″

″ ″

″″

Rough Set based Decision Tree Model forClassification

Sonajharia Minz 1 and Rajni Jain 1, 2

1School of Computers and Systems Sciences, Jawaharlal Nehru University,New Delhi, India [email protected]

2 National Centre for Agricultural Economics and Policy Research,Library Avenue, Pusa, New Delhi, India 110012

[email protected], [email protected]

Abstract. Decision tree, a commonly used classification model, is constructedrecursively following a top down approach (from the general concepts toparticular examples) by repeatedly splitting the training data set. ID3 is a greedyalgorithm that considers one attribute at a time for splitting at a node. In C4.5,all attributes, barring the nominal attributes used at the parent nodes, are retainedfor further computation. This leads to extra overheads of memory andcomputational efforts. Rough Set theory (RS) simplifies the search for dominantattributes in the information systems. In this paper, Rough set based DecisionTree (RDT) model combining the RS tools with classical DT capabilities, isproposed to address the issue of computational overheads. The experimentscompare the performance of RDT with RS approach and ID3 algorithm. Theperformance of RDT over RS approach is observed better in accuracy and rulecomplexity while RDT and ID3 are comparable.

Keywords: Rough set, supervised learning, decision tree, feature selection,classification, data mining.

1 Introduction

Decision Tree (DT) has increasingly gained popularity and is commonly usedclassification model. Following a top down approach, DT is constructed recursively bysplitting the given set of examples [7]. In many real time situations, there are far toomany attributes to be handled by learning schemes, majority of them are redundant.Reducing the dimension of the data reduces the size of the hypothesis-space andallows algorithms to operate faster and more effectively. In some cases, accuracy offuture classification can be improved while in others the target concept is morecompact and is easily interpreted [3]. For DT induction, ID3 and its successor C4.5 [7]by Quinlan, are widely used. Although both attempt to select attributes appropriately,all attributes, barring the nominal attributes that have been used at the parent nodes,are retained for further computation involved in splitting criteria. Irrelevant attributesare not filtered until the example reaches the leaf of the decision tree. This leads toextra overhead in terms of memory and computational efforts. Its performance can beimproved by prior selection of attributes [3]. We propose to use Rough Set theory (RS)[6] introduced by Pawlak in early 80’s, for attribute subset selection. The intent of RS


is based on the fact that in real life while dealing with sets, due to limited resolution ofour perception mechanism, we can distinguish only classes of elements rather thanindividuals. Elements within classes are indistinguishable. RS offers a simplifiedsearch for dominant attributes in data sets [6,11]. This is used in the proposed RSbased Decision Tree (RDT) model for classification. RDT combines merits of bothRS and DT induction algorithm. Thus, it aims to improve efficiency, simplicity andgeneralization capability of both the base algorithms.

The paper is organized as follows. In section 2, the relevant concepts of rough settheory are illustrated. The proposed RDT model is described in section 3. Section 4formulates the performance evaluation criteria. After discussing experimental resultsin section 5, we conclude with directions for future work in section 6.

2 Rough Set Theory

2.1 Concepts

A brief review of some concepts of RS [6,11], used for mining classification rules arepresented in this section.

2.1.1 Information System and Decision Table. In RS, knowledge is a collection offacts expressed in terms of the values of attributes that describe the objects. Thesefacts are represented in the form of a data table. Entries in a row represent an object.A data table described as above is called an information system.

Formally, an information system S is a 4-tuple, S = (U, Q, V, f) where, U a non-empty, finite set of objects is called the universe; Q a finite set of attributes; V= ∪Vq,∀q ∈ Q and Vq being the domain of attribute q; and f : U × Q →V, f be theinformation function assigning values from the universe U to each of the attributes qfor every object in the set of examples.

A decision table A = (U, C ∪ D), is an information system where Q = (C ∪ D). Cis the set of categorical attributes and D is a set of decision attributes. In RS, thedecision table represents either a full or partial dependency occurring in data.

2.1.2 Indiscernibility Relation. For a subset P ⊆ Q of attributes of an informationsystem S, a relation called indiscernibility relation denoted by IND is defined as,

INDs (P)={ (x, y) ∈ U × U : f(x, a)=f(y, a) ∀ a∈ P}.If (x, y) ∈ INDs(P) then objects x and y are called indiscernible with respect to P. Thesubscript s may be omitted if information system is implied from the context. IND(P)is an equivalence relation that partitions U into equivalence classes, the sets of objectsindiscernible with respect to P. Set of such partitions are denoted by U/IND(P).

2.1.3 Approximation of Sets. Let X ⊆ U be a subset of the universe. A descriptionfor X is desired that can determine the membership status of each object in U withrespect to X. Indiscernibility relation is used for this purpose. If a partition defined byIND(P) partially overlaps with the set X, the objects in such an equivalence class cannot be determined without ambiguity. Consequently, description of such a set X may

173Rough Set Based Decision Tree Model for Classification

not be possible. Therefore, the description of X is defined in terms of P-lower

approximation (denoted as P ) and P-upper approximation (denoted as P ), for P ⊆ Q

}:)(/{ XYPINDUYXP ⊆∈∪= (1)

}:)(/{ φ≠∩∈∪= XYPINDUYXP (2)

A set X for which XPXP = is called as exact set otherwise it is called rough set

with respect to P.

2.1.4 Dependency of Attributes. RS introduces a measure of dependency of twosubsets of attributes P, R ⊆ Q. The measure is called a degree of dependency of P onR, denoted byγ R(P). It is defined as,

)())(()(

UcardPPOScard

RRP =γ , where XRPPOS

PINDUXR

)(/)(

∈∪= (3)

The set POSR(P), positive region, is the set of all the elements of U that can beuniquely classified into partitions U/IND(P) by R. The coefficient )(PRγ represents

the fraction of the number of objects in the universe which can be properly classified.If P totally depends on R then )(PRγ = 1, else )(PRγ < 1.

2.1.6 Reduct. The minimum set of attributes that preserves the indiscernibilityrelation is called reduct. The relative reduct of the attribute set P, P ⊂ Q, with respectto the dependency )(QPγ is defined as a subset RED(P,Q) ⊆ P such that:

1. )()( P),( QQQPRED γγ = , i.e. relative reduct preserves the degree of inter

attribute dependency,2. For any attribute a ∈ RED(P,Q), )()( P}{),( QQaQPRED γγ <− , i.e. the relative

reduct is a minimal subset with respect to property 1.A single relative reduct can be computed in linear time. Genetic algorithms are alsoused for simultaneous computation of many reducts [1,8,10].

2.1.7 Rule Discovery. Rules can be perceived as data patterns that representrelationships between attribute values. RS theory provides mechanism to generaterules directly from examples [11]. In this paper, rules are produced by reading theattribute values from the reduced decision table.

2.2 Illustrations

Example 2.2.1. Using Table 1 [9] some concepts of the information system asdescribed in 2.1 are:U={X1, X2, X3, X4, X5, X6, X7, X8}Q={Hair, Height, Weight, Lotion, Sunburn}VHair={blonde, brown, red}, VHeight={tall, average, short},VWeight={light, average, heavy}, VLotion={no, yes}f(X1, Hair)=blonde, i.e. value of the attribute Hair for object X1 is blondeFor R={Lotion} ⊆ Q, U/IND(R)={{X1, X4, X5, X6, X7},{X2,X3,X8}}

174 Sonajharia Minz and Rajni Jain

The lower and upper approximation with reference to R={Lotion} for objects withdecision attribute Sunburn = yes, i.e. X={X1, X4, X5}

Using equation (1) and (2), R X=φ and R X={X1, X4, X5, X6, X7}

Table 1. Sunburn data set

ID Hair Height Weight Lotion Sunburn

X1 blonde average light no yesX2 blonde tall average yes no

X3 brown short average yes no

X4 blonde short average no yesX5 red average heavy no yes

X6 brown tall heavy no no

X7 brown average heavy no noX8 blonde short light yes no

Table 2. Weather data set

ID Outlook Temperature Humidity Wind PlayX1 sunny Hot High weak noX2 sunny Hot High strong noX3 overcast Hot High weak yesX4 rain mild High weak yesX5 rain cool Normal weak yesX6 rain cool Normal strong noX7 overcast cool Normal strong yesX8 sunny mild High weak noX9 sunny cool Normal weak yesX10 rain mild Normal weak yesX11 sunny mild Normal strong yesX12 overcast mild High strong yesX13 overcast Hot Normal weak yesX14 rain mild High strong no

Example 2.2.2. Let P={Sunburn} and R={Lotion} then U/IND(P)={{X1, X4, X5},{X2, X3, X6, X7, X8}}, POSR(P)= φ ∪ {X2, X3, X8}, the dependency of Sunburnon Lotion i.e. γ R(P)=3/8.

Example 2.2.3. For Table 2, two reducts R1 ={Outlook, Temperature, Wind} andR2={Outlook, Humidity, Wind} exist. One sample of the twelve rules generatedusing R1 is ,

If Outlook=sunny and Temperature=mild and Wind=weak then Play=No;On closer observation of the rules generated in the above case there exist scope for

improvement on the number of selectors and the number of rules. This issue isaddressed in the next section.


3 Rough Decision Trees (RDT) Model

The proposed RDT model combines the RS-based approaches [11] and decision treelearning capabilities [7,9]. The issues related to the greediness of the ID3 algorithmand complexity of rules in RS approach, are addressed by the proposed model. Thearchitecture of RDT model is presented in Fig. 1. The implementation of thearchitecture is presented by algorithm RDT.Algorithm RDT: Rough Decision Tree Induction

1. Input the training data set T1.2. Discretize the numeric or continuous attributes if any, and label the modified

data set as T2.3. Obtain the minimal decision relative reduct of T2, say R.4. Reduce T2 based on reduct R and label reduced data set as T3.5. Apply ID3 algorithm on T3 to induce decision tree.6. If needed, convert decision tree to rules by traversing all the possible paths

from root to each leaf node.The training data is a collection of examples used for supervised learning to

develop the classification model. In step 2, continuous attributes in data set bediscretized. A number of algorithms are available in the literature for discretization[2,3]. Any local discretization algorithm may be used as per the requirement. The twodata sets used in this paper have nominal attributes. The next step involvescomputation of reduct. The reduct distinguishes between examples belonging to

Decision Table

Reduct Computation Algorithm

Reduct

Remove attributes absent in reduct

Reduced Training Data

ID3 algorithm

Decision Tree

Fig. 1. The Architecture of RDT model


different decision classes. The reduct also assists in reducing the training data, whichis finally used for decision tree induction. In this paper, Johnson’s algorithm (asimplemented in Rosetta software [8]) is used for computation of a single reduct.

GA based algorithms have also been reported in the literature for computing thepopulation of reducts [1,10]. This provides flexibility to the data miner for choosingthe desired set of attributes in the induction of the decision tree. For example, thereducts could be ranked in terms of the cost of obtaining the values of the requiredattributes. A reduct score can also be computed based on the cardinality of eachattribute of the reduct. The reduct with the minimum score could be preferred forfurther steps [11].

In step 4, by removing columns of attributes, not present in the reduct, thedimension of the learning examples are reduced. DT learning algorithm is thenapplied to the reduced examples in step 5. In this paper, an implementation of ID3algorithm as proposed by Quinlan is used for inducing decision tree. Step 6 maps thetree to the rules.

Example 3.1 For Table 2 the RDT produces the reduct R={Outlook, Temperature,Wind}. It generates a decision tree (Fig. 2), which is mapped to the followingdecision rules,

1. If Outlook=sunny and Temperature=hot then Play=no;2. If Outlook=sunny and Temperature=mild and Wind =weak then Play =no;3. If Outlook=sunny and Temp=mild and Wind =strong then Play =yes;4. If Outlook=sunny and Temperature=cool then Play =yes;5. If Outlook=overcast then Play =yes;6. If Outlook=rain and Wind = weak then Play =yes;7. If Outlook=rain and Wind=strong then Play =no;

Outlook

sunny overcast rain

Temperature Wind

coolweak strong

Wind

mild

weakstrong

Fig. 2. RDT classifier for Weather data set

hot

yes

yes

yes

yes no

no

no


It is observed that rules are less in number as well as more generalized compared tothose generated in Section 2. On using GA-based algorithm two reducts (Example2.2.3), R1={Outlook, Temperature, Wind} and R2={Outlook, Humidity, Wind}would provide different trees. R2 generates simpler tree with better accuracy than thatof R1. Issue of relevant reduct selection is not addressed in this paper.

4 Evaluation Criteria for RDT

To evaluate the performance of the RDT model, classification accuracy and rulecomplexity are considered. Using these two measures the behavior and theperformances of the three models namely RDT, ID3 algorithm and RS approach, arecompared.

Classification accuracy is assessed by applying the algorithms to the examplesnot used for rule induction and is measured by the fraction of the examples for whichdecision class is correctly predicted by the model [4]. Fraction of instances for whichno prediction can be made by the model is called uncertainty. Classification errorrefers to fraction of test examples, which are misclassified by the system.

A set of rules induced for classification is called rule-set. A condition of the formattribute=value is called a selector. The size of rule-set may not be appropriatecriterion for evaluation hence total number of selectors in a rule-set is used as ameasure of complexity of the rule-set [4].

5 Results and Discussion

Complexity

0

10

20

30

40

Sunburn Weather

Num. of Rules

0

5

10

Sunburn Weather

RS ID3 RDT

Fig.3. Comparison of RS, ID3, RDT algorithms w.r.t. complexity and number of rules fortraining dataset of Sunburn and Weather

The aim of this paper is to introduce the RDT model. The model has beenimplemented on very small sample data sets. The results from these pilot data sets canneither be used to claim nor disprove the validity and usefulness of the model overother approaches. However, the results could indicate whether or not to explore itfurther for data mining. Subsequent sections 5.1 and 5.2 discuss the results of the


three approaches mentioned in the paper for training and test data.

Num. of Rules (Sunburn)

0

4

8

1 2 3 4 5 Avg.

Num. of Rules (Weather)

0

4

8

12

1 2 3 4 5 Avg.

Complexity (Sunburn)

0

4

8

12

1 2 3 4 5 Avg.

Complexity (Weather)

0

10

20

30

1 2 3 4 5 Avg.

RS ID3 RDT

Fig. 4. Comparison of RS, ID3, RDT algorithms w.r.t. number of rules, complexity fortraining:test::70:30 of Sunburn and Weather

5.1 Training Data

Each of the three algorithms RS, ID3 and RDT were applied to the data setsmentioned in Table 1 and Table 2. It was observed that for each of the data setsaccuracy is 1, thus uncertainty and error are 0. The results regarding number of rulesand the complexity of rule-set are presented in Fig. 3. The size and the complexity ofrule-set induced by RDT model is significantly less as compared to that of RS forboth data sets. On comparing RDT with ID3, it was observed that for Sunburn data setthe complexity of rules induced by the two are equal however, for Weather data set,complexity of rule-set is greater for RDT. This is attributed to the computation of asingle reduct. As mentioned in examples in earlier sections, there are two reductsnamely R1:{Outlook, Temperature, Wind} and R2:{Outlook, Humidity, Wind} butonly a single reduct is used in the model. For a system, if only one reduct is filteredout, it is possible that alternate reduct, if any, could generate less complex rule-setunder RDT. This issue may be addressed by using some measure of reduct fitness orranking of reducts in the reduct population obtained by GA based tools for reductcomputation. However, this is not dealt with in this paper.


Accuracy (Sunburn)

0

0.5

1

1 2 3 4 5 Avg.

Accuracy (Weather)

0

0.5

1

1 2 3 4 5 Avg.

Error (Sunburn)

0

0.3

0.6

1 2 3 4 5 Avg.

Error (Weather)

0

0.5

1

1 2 3 4 5 Avg.

Uncertainty (Sunburn)

0

0.2

0.4

0.6

1 2 3 4 5 Avg.

Uncertainty (Weather)

0

0.4

0.8

1 2 3 4 5 Avg.

RS ID3 RDT

Fig. 5. Comparison of RS, Id3, RDT algorithms w.r.t accuracy, error, uncertainty forTraining:Test::70:30 of Sunburn and Weather

5.2 Test Data

For each of the three algorithms, the results over five trials were averaged for the twodomains. In each trial, 70% of training examples were selected at random from entiredata set and the remaining were used for testing. The training data, is used forinduction of classification rules or tree while the test data is used for evaluating the


performance of the induced model. Each of the algorithms was implemented on thesame training-test partition. These results are presented in Fig. 4 and Fig. 5. It isobserved that for Sunburn data set, RDT performs better than RS in terms ofcomplexity of rule-set, accuracy and other performance parameters while RDT modelis comparable to ID3. For Weather data set, average accuracy of rules generated fromRDT model has improved over RS as well as ID3 while average complexity of rulesfor RDT is improved over RS approach but is little more than that of ID3.

6 Conclusions

The results from the experiments on the small data sets neither claim nor disprove theusefulness of the model as compared to other approaches. However, they suggest thatRDT can serve as a model for classification as it generates simpler rules and removesirrelevant attributes at a stage prior to tree induction. This facilitates less memoryrequirements for the subsequent steps while executing the model and for classifyingthe test data as well as actual examples. For real data sets, at times number of reducts(may be hundreds) exist and at some other times no reduct may exist. This providespotential for further refinement of RDT model. Availability of many reducts offersscope to generate a tree avoiding evaluation of an attribute that is difficult orimpossible to measure. It also offers options of using low cost decision trees. Problemrelated to absence of reduct for noisy domains or inconsistent data sets can be handledby computing approximate reducts using variable precision RS model [12]. Furtherresearch is being pursued to handle such real time data sets.

References

1. Bjorvand, A. T., Komorowski, J.: Practical Applications of Genetic Algorithms forEfficient Reduct Computation., Wissenschaft & Technik Verlag, 4 (1997) 601-606.

2. Grzymala-Busse, J. W., Stefanowski, Jerzy: Three Discretization Methods for RuleInduction. IJIS, 16(1) (2001) 29-38

3. Hall, Mark A., Holmes, G.: Benchmarking Attribute Selection Techniques for DiscreteClass Data Mining. IEEE TKDE 20 (2002) 1-16

4. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, (2001),279-325

5. Murthy, S.K.: Automatic Construction of decision trees from Data: A MultidisciplinarySurvey. Data Mining and Knowledge Discovery 2 (1998) 345-389

6. Pawlak, Z.: Drawing Conclusions from Data-The Rough Set Way. IJIS, 16 (2001) 3-117. Quinlan, J. R.: C4.5: Programs for Machine Learning. Morgan Kauffman (1993)8. Rosetta, Rough set toolkit for analysis of data available at

http://www.idi.ntnu.no/~aleks/rosetta/.9. Winston, P.H.: Artificial Intelligence. Addison-Wesley Third Edition (1992)10. Wroblewski, J.: Finding Minimal Reduct Using Genetic Algorithms. Warsaw University

of Technology- Institute of Computer Science- Reports – 16/95 (1995)11. Ziarko, W.: Discovery through Rough Set Theory. Comm. of ACM, 42(11) (1999) 55-5712. Ziarko, W.: Variable Precision Rough Set Model. Jr. of Computer and System Sciences,

46(1), Feb, (1993) 39-59b


Inference Based Classifier: E cient Constructionof Decision Trees for Sparse Categorical

Attributes

Shih-Hsiang Lo, Jian-Chih Ou, and Ming-Syan Chen

Department of Electrical Engineering, National Taiwan University,Taipei, Taiwan, ROC

{lodavid,alex}@arbor.ee.ntu.edu.tw, [email protected]

Abstract. Classification is an important problem in data mining andmachine learning, and the decision tree approach has been identified asan e cient means for classification. According to our observation on realdata, the distribution of attributes with respect to information gain isvery sparse because only a few attributes are major discriminating at-tributes where a discriminating attribute is an attribute, by whose valuewe are likely to distinguish one tuple from another. In this paper, we pro-pose an e cient decision tree classifier for categorical attribute of sparsedistribution. In essence, the proposed Inference Based Classifier (abbre-viated as IBC ) can alleviate the “overfitting” problem of conventionaldecision tree classifiers. Also, IBC has the advantage of deciding thesplitting number automatically based on the generated partitions.is empirically compared to 4 5, and K-means based classifiers.The experimental results show that significantly outperforms thecompanion methods in execution e ciency for dataset with categoricalattributes of sparse distribution while attaining approximately the sameclassification accuracies. Consequently, is considered as an accurateand e cient classifier for sparse categorical attributes.

1 Introduction

Classification is an important issue both in data mining and machine learning,with such important techniques as Bayesian classification [3], neural networks[16], genetic algorithms [7] and decision trees [1][12]. Decision tree classifiers havebeen identified as e cient methods for classification. In addition, it was proventhat a decision tree with scale-up and parallel capability is very e cient andsuitable for large training sets [9][17]. Also, decision tree generation algorithmsdo not require additional information than that already contained in the train-ing data [5][12]. Eventually, decision trees earn similar and sometimes betteraccuracy compared to other classification methods [11].Numerous decision tree algorithms have been developed over the years, e.g.,

ID3 [13], C4.5 [12], CART [1], [9], SPRINT [17]. However, even beingcapable of processing both numerical and categorical attributes, most exist-ing decision tree classifiers are not explicitly designed for categorical attributes


fairlow<=307

excellenthigh31…406

fairmed>405

fairmed>404

excellenthigh31...403

excellentlow<=302

fairlow<=301

credit-ratingincomeageID

fairlow<=307

excellenthigh31…406

fairmed>405

fairmed>404

excellenthigh31...403

excellentlow<=302

fairlow<=301

credit-ratingincomeageID

Fig. 1. A small credit-rating dataset

[9][10][17]. Thus, their performance on categorical data is not particularly opti-mized. Consequently, we propose in this paper an e cient decision tree classifierwhich not only considers on the categorical attribute’s characteristics in practi-cal datasets but also alleviates the “overfitting” problem of traditional decisiontree classifiers.According to our observation on real data, the distribution of attributes with

respect to information gain is very sparse because only a few attributes are ma-jor discriminating attributes where a discriminating attribute is an attribute, bywhose value we are likely to distinguish one tuple from another. As shown by ourexperiments without dealing with the descriminating attribute separately, priorworks are not suitable for processing data set with sparse attributes. We callthe attribute corresponding to the target label to classify the target attribute.An attribute which is not a target attribute is called an ordinary attribute. Forexample, the target attribute in Figure 1 is “credit-rateing,” and two ordinaryattributes are “age” and “income.” It is first observed that in many real-lifedatasets, such as customers’ credit-rating data of banks and credit-card compa-nies, medical diagnosis data and document categorization data, the attributesare mostly categorical attributes and the value of an attribute usually impliesone target class. An inference class is defined as the target class to which the ma-jority of an attribute value belongs. For example in Figure 1, the target classesrefer to the values of attribute “credit-rating.” The value “ = 30” of attribute“age” has the inference class “fair.” Then, note that after mapping each ordinaryattribute value to its influence class, it would be better to divide the ordinaryattributes according to their inference classes instead of their original values be-fore proceeding to perform the goodness function, e.g., GINI index, computationfor node splitting. As will be shown later, by doing so the execution e ciency isimproved and the overfitting problem can be alleviated. A detailed example willbe given in Section 3.1 later. Note that the number of in real-lifedatasets is usually less than that of an attribute. For the example in Figure 1,the number of , 2, is less than that of attribute “age,” 3 and alsothat of attribute “income,” 3. This fact is indeed instrumental to the e cientexecution of the proposed algorithm as will be validated by our experimentalstudies.

183Inference Based Classifier

Explicitly, the decision tree classifier, Inference Based Classifier (abbrevi-ated as IBC ) proposed, is a two phases decision tree classifier for categoricalattributes. In the first phase, IBC partitions each attribute’s values accordingto their inference classes. By partitioning the attribute’s value based on its in-ference class, IBC can identify the major discriminating attribute from sparsecategorical attributes and also alleviate the overfitting problem which most con-ventional decision tree classifiers su er. Recall that some “overfitting” problemmight be induced by small data in the training dataset which do not appearin the testing dataset at all. This phenomenon can be alleviated by using theinference class to do the classification. In the second phase of , by evalu-ating the goodness function for each attribute, selects the best splittingattribute from all attributes as the splitting node of a decision tree. In additionto alleviating the overfitting problem, has the advantage of deciding thesplitting number automatically based on the generated partitions, since unlikeprior work [4] no additional clustering on the attributes is needed to determinethe splitting by IBC. In the experimental study, we compare with C4.5,

and K-mean based classifiers for categorical attributes. The experimentalresults show that significantly outperforms companion schemes significantlyin execution e ciency while attaining approximately the same classification ac-curacy. Consequently, is considered as a practical and e cient categoricalattribute classifier.The remainder of the paper is organized as follows. Preliminaries are given in

Section 2. In Section 3, we present the new developed decision tree for categoricalattributes. In Section 4, we conduct the experiments to access the performanceof . Finally, we give a conclusion in Section 5.

2 Preliminaries

Suppose is an attribute and { 1, 2 , } are possible values of attribute. The domain of the target classes is represented by domain( )= { 1, 2 ,

| ( )|}. The inference class for a value of attribute , denoted by , isthe target class to which most tuples with their attribute = imply. Explicitly,use ( ) to denote the number of tuples which imply to and have a valueof in their attribute . Then, we have

( ) = max( ){ ( )} (1)

The inference class for each value of attribute can hence be obtained.For the example profile in Figure 1, if is “age” with value “ =30,” thendomain( )= {fair, excellent}, and ( =30, fair)=2, and ( =30, excellent)=1.“fair” is therefore the inference class for the value “ =30” of the attribute“age.”

Definition 1: The unique target class to which most tuples with their attribute= imply is called the for a value of attribute .

184 Shih-Hsiang Lo et al.

If the target class to which most tuples with their attribute = imply isnot unique, we say attribute value is associated with a neutral class. Also, wecall that value is a neural attribute value. As will be seen later, by replacing theoriginal attribute value with its inference class in performing the node-splitting,IBC is able to build the decision tree very e ciently without comprising thequality of classification.

3 Inference Based Classifier

In essence, is a decision tree classifier that refines the splitting criteria forcategorical attributes in the building phase in order to reveal the major discrim-inating attribute from sparse attributes. Also, IBC can alleviate the overfittingproblem and improve the overall execution e ciency. Note that informationgain and GINI index are common measurements for selecting the best splitnode. Without loss of generality, we adopt information gain as a measurementto identify the sparsity of attributes and GINI index as the measurement fornode splitting criterion.

3.1 Algorithm of IBC

As described earlier, is divided into two major phases, i.e., partitioningvalues of an attribute according to their inference class, to be presented in Section3.2, and selecting the best splitting attribute with the lowest index valuefrom these attributes, to be presented in Section 3.3.

3.2 Inference Class Identification Phase

Algorithm : Inference Based Classifier: MakeDecisionTree(Training Data )

1. BuildNode( )

:EvaluateSplit(Data )

1. begin for each attribute in2. begin for value in do3. if is a neural attribute value then4. Categorize to the partition of NEURAL5. else6. Categorize to the partition of ’s inference class7. end8. if there are two or more partitions then9. Compute the gini index for these partitions10. else11. Return no gini index12. end


In the Initial Phase of IBC, the training dataset is input for the tree building.Before the evaluation for the best split, the first phase of IBC, Inference ClassIdentification Phase, scans each attribute in data from Step 3 to Step 12. FromStep 2 to Step 7, IBC first assigns an inference class to each attribute value andgroups the attribute’s values into partitions according to their inference classes.If there are two or more partitions, the index is calculated and returnedin Step 9. Otherwise, IBC returns no index in Step 11.

No Outlook Temperature Humidity Windy Class

1 Sunny Hot High False N

2 Sunny Hot High True N

3 Overcast Hot High False P

4 Rain Mild High False P

5 Rain Cool Normal False P

6 Rain Cool Normal True N

7 Overcast Cool Normal True P

8 Sunny Mild High False N

9 Sunny Cool Normal False P

10 Rain Mild Normal False P

11 Sunny Mild Normal True P

12 Overcast Mild High True P

13 Overcast Hot Normal False P

14 Rain Mild High True N

Table 1. A training dataset

For illustrative purposes, the following example uses the training data in Ta-ble 1 which has four categorical attributes, i.e., Outlook, Temperature, Humidityand Windy, and one target attribute, i.e., Class. For each node of a decision tree,

first counts the target class occurrences of each attribute value and assignsan to each attribute value. Then, partitions the valuesof an attribute into groups according to their inference classes and calculatesthe index for each attributes. Finally, selects the attribute with thelowest index as the splitting attribute for the decision tree node. For at-tribute Outlook in Table 1, evaluates the counts of target classes for eachattribute value shown in Table 2. Next, partitions these attributes intotwo groups according to their inference classes and calculates the corresponding

index of attribute Outlook. For other attributes, Temperature, Humidityand Windy, the counts of target classes and inference class of attribute valuesare listed in Table 3, 4 and 5 accordingly.

= (9

14)(1 (

2

9)2 (

7

9)2) + (

5

14)(1 (

3

5)2 (

2

5)2) = 0 3937 (2)

For other attributes, Temperature, Humidity and Windy, the counts of targetclasses and inference class of attribute values are listed in Table 3, 4 and 5


accordingly. Note that the attribute value, Hot, of attribute, Temperature, is aneural attribute value whose is NEURAL. Further, theindexes of these attributes are calculated as follows.

= (4

14)(1 (

2

4)2 (

2

4)2)+(

10

14)(1 (

3

10)2 (

7

10)2) = 0 4429 (3)

= (7

14)(1 (

4

7)2 (

3

7)2) + (

7

14)(1 (

1

7)2 (

6

7)2) = 0 3673 (4)

= (6

14)(1 (

3

6)2 (

3

6)2) + (

8

14)(1 (

2

8)2 (

6

8)2) = 0 4286 (5)

min = min( ) =(6)

attribute class countsvalues N PSunny 3 2 NOvercast 0 4 PRain 2 3 P

Table 2 Attribute Outlook

attribute class countsvalues N PHot 2 2 NEURALMild 2 4 PCool 1 3 P

Table 3 Attribute Temperature

attribute class counts

values N P

High 4 3 N

Normal 1 6 P

Table 4 Attribute Humidity

attribute class counts

values N P

True 3 3 NEURAL

False 2 6 P

Table 5 Attribute Windy

3.3 Node Split Phase

:BuildNode(Data )

1. If (all tuples in are in the same class) then return2. Call EvaluateSplit( ) function to evaluate splits for each attribute in3. Use the best split found to partition into 1 where is the numberof partitions4. begin for each partition in where =1 to5. BuildNode( )


Humidity

Outlook

WindyN(1) P(1)

P(1) N(1)

P(0.86)

NormalHigh

Sunny Overcast

False True

Rain

Fig. 2. The decision tree for Table 1 by IBC

6. end

The second phase of , i.e., , checks if all tuples inare in the same target class in Step 1. If all tuples are in the same class,does not split these tuples and returns the target class. Otherwise, usesattributes’ evaluation values returned by first phase to select the best splittingattribute in and partitions into subpartitions in Step 2 and Step 3. Then,the child nodes of decision trees grow accordingly from Step 4 to Step 6.For the example in Section 3.1.1, the Node Split Phase of chooses the

attribute Humidity with the lowest index value as the best splitting at-tribute for the decision tree node. Then, IBC partitions Table 1 into two sub-tables which consist of one table where the value of attribute Humidity is Highand the other one where the value of attribute Humidity is Normal. Following asimilar procedure of for these subtables, the whole decision tree is built asdepicted in Figure 2 where the purity is also examined in each leaf.

4 Performance Studies

To assess the performance of algorithms proposed, we perform several experi-ments on a computer with a clock rate of 700 MHz and 256 MB of mainmemory. The characteristics of the real-life datasets are described in Section 4.1.Experimental studies on and others schemes are conducted in Section 4.2.Results on execution e ciency are presented in Section 4.3.

4.1 Real-life Datasets

We experimented with three real-life datasets from the UCI Machine LearningRepository. These datasets are used by the machine learning community for theempirical analysis of machine learning algorithms. We use a small portion of data


as the training dataset and the rest of the data is used as the testing dataset.Note that the attributes in these selected data belongs to categorical attributes.In addition, we calculate the information gain of categorical attributes for alldatasets. Based these information gain, we further obtain the mean, varianceand standard deviation in order to understand the distribution of categoricalattributes among datasets. Table 6 lists the characteristics of our training andtesting datasets which are sorted by their standard variance with respect toinformation gain of attributes. Note that Credit card, Breast-cancer and Liver-discorders are considered as the data sets with sparse attributes and others arenot.

Data set CreditCard BreastCancer LiverDisorders Heart Mushroom SoyBean

No. of Attributes 11 9 6 13 22 35

No. of Classes 2 2 2 2 2 19

No. of Training set 8745 500 230 180 5416 450

No. of Testing set 4373 199 115 90 2708 233

Info. Gain 1.83 4.61 1.34 2.72 4.61 27.97

Mean 0.1661 0.5122 0.2226 0.2091 0.2094 0.7992

Variance 0.0163 0.0169 0.0186 0.0329 0.0467 0.1078

Stand. Deviation 0.1278 0.1301 0.1364 0.1813 0.2160 0.3282

Table 6 The characteristics of data sets

4.2 Experiment One: Classification Accuracy

In this experiment, we compare accuracy results with tree pruning and withouttree pruning where the MDL pruning technique was applied in the tree pruningphase. From the results shown in Table 7 without tree pruning, is a clearwinner and has the highest accuracy in 4 cases because IBC is designed for dataset with sparse categorical attributes. In Table 8, the overall accuracies wereimproved by applying tree pruning to alleviate overfitting problem. However,the accuracies were a little lower than others after doing tree pruning. Thereason is that considers and alleviates the overfitting problem in the treebuilding phase whereas others do not. So, other methods require tree pruningtechniques to alleviate the overfitting problem and improve the accuracy.


0.8218 0.9748 0.6522 0.4667 0.9435 0.9055

0.7567 0.9296 0.5621 0.5146 0.9213 0.8645

0.5988 0.9447 0.5913 0.6633 0.9435 0.8927

4 5 0.7070 0.92017 0.6184 0.7713 1.00 0.8841

Table 7 Accuracy comparison on real-life data sets(without tree pruning)



0.8278 0.9397 0.6434 0.5778 0.9321 0.8969

0.7793 0.9246 0.5478 0.5333 0.9439 0.8841

0.7001 0.9347 0.6434 0.6667 0.9468 0.9098

4 5 0.7390 0.9327 0.6184 0.7713 1.00 0.9013

Table 8 Accuracy comparison on real-life data sets(with tree pruning)

4.3 Experiment Two: Execution Time in Scale-Up Experiments fordata set of sparse categorical attributes

Because SLIQ was shown to outperform C4.5 in [9], so we only compare SLIQ,K-mean based and IBC in scale-up experiments. Before scale-up experiments, webriefly explain the complexity of three methods. In general case, the complexityof SLIQ is O( 2)[9], the complexity of K-mean based is O( ( )2)[8]and the complexity of IBC is O( ) where is the number of attributes andis the size of data set.For scale-up experiments, we selected the credit-card dataset and divided

it into di erent sizes of training set to show execution e ciency for data setof sparse categorical attributes. The size of the dataset increases from 1,000to 12,000. The results in scale-up experiments are shown in Fig. 3 and Fig. 4.In Fig. 3, the value in -axis corresponds to the ratio of the execution time of

to that of (presented in log scale), showing outperformssignificantly in execution e ciency. In addition, the is approximately twicefaster than the K-mean based algorithm as the size of datasets increases in Fig.4. These results in fact agree with their time complexities pointed out earlier.Consequently, the experimental results indicate that IBC is an e cient decisiontree classifier with good classification quality for sparse categorical attributes.

0

0.5

1

1.5

2

2.5

1 2 3 4 5 6 7 8 9 10 11 12

Dataset size (K)

Ex

ecu

tion

Tim

e R

atio

in L

og

IBC

SLIQ

Fig. 2: SLIQ and IBC

0

2

4

6

8

10

12

1 2 3 4 5 6 7 8 9 10 11 12

Dataset size (K)

Tim

e (s

eco

nd

)

IBC

K-mean based

Fig. 3: K-mean based and IBC

5 Conclusion

According to our observation on real data, the distribution of attributes withrespect to information gain was very sparse because only a few attributes are


major discriminating attributes where a discriminating attribute is an attribute,by whose value we are likely to distinguish one tuple from another. In this paper,we proposed an e cient decision tree classifier for categorical attribute of sparsedistribution. In essence, the proposed Inference Based Classifier can alleviatethe “overfitting” problem of conventional decision tree classifiers. Also, IBC

had the advantage of deciding the splitting number automatically based on thegenerated partitions. was empirically compared to 4 5, and K-means based classifiers. The experimental results showed that significantlyoutperformed the companion methods in execution e ciency for dataset withcategorical attributes of sparse distribution while attaining approximately thesame classification accuracies. Consequently, was considered as an accurateand e cient classifier for sparse categorical attributes.

References

1. L. Breiman, J. H. Friedman, R.A. Olshen, and C.J. Sotne. Classification andRegression Trees. Wadsworth, Belmont, 1984.

2. NASA Ames Research Center. Introduction to IND Version 2.1. GA23-2475-02edition, 1992.

3. P. Chesseman, J. Kelly, M. Self, and et al. AutoClass: A Bayesian classificationsystem. In 5th Int’l Conf. on Machine Learning. Morgan Kaufman, 1988.

4. P. A. Chou. Optimal Partitioning for Classification and Resgression Trees. IEEETransactions on Pattern Analysis and Machine Intelligence, Vol 13, No 4, 1991.

5. U. Fayyad. On the Induction of Decision Trees for multiple Concept Learning.PhD thesis, The University of Michigan, Ann arbor, 1991.

6. U. Fayyad and K. B. Irani. Multi-interval discretization of continuous-valued at-tributes for classification learning. In Proc. of the 13th International Joint Con-ference on Artificial Intelligence, 1993.

7. D. E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learn-ing. Morgan Kaufmann, 1989.

8. J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kauf-mann Publishers, 2000.

9. M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A fast scalable classifier for datamining. In EDBT 96, Avignonm, France, 1996.

10. M. Mehta, J. Rissanen, and R. Agrawal. MDL-based decision tree pruning. InInt’l Conference on Knowledge Discovery in Databases and Data Mining, 1995.

11. D. Michie, D.J. Spiegelhalter, and C.C. Taylor. Machine Learning, Neural andStatistical Classification. Ellis Horwood, 1994.

12. J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993.13. J.R. Quinlan. Induction of decision trees. Machine Learning, 1986.14. J.R. Quinlan and R. L. Rivest. Inferring decision trees using minimum description

length principle. Information and Computtation, 1989.15. R. Rastogi and K. Shim. PUBLIC: A Decision Tree Classifier that Integrates

Building and Pruning. Proceedings of 24rd International Conference on Very LargeData Bases, August 24-27, 1998, New York City, New York, USA, 1998.

16. B. D. Ripley. Pattern Recognition and Neural Networks. Cambridge UniversityPress, Cambridge, 1996.

17. J. Shafer, R. Agrawal, and M. Mehta. SPRINT: A scalable parallel classifier fordata mining. In Proc. of the VLDB Conference, Bombay, India, 1996.


Generating Effective Classifiers with Supervised Learning of Genetic Programming

Been-Chian Chien1, Jui-Hsiang Yang1, and Wen-Yang Lin2

1 Institute of Information Engineering I-Shou University

1, Section 1, Hsueh-Cheng Rd., Ta-Hsu Hsiang, Kaohsiung County Taiwan, 840, R.O.C.

{cbc, m9003012}@isu.edu.tw 2 Department of Information Management

I-Shou University 1, Section 1, Hsueh-Cheng Rd., Ta-Hsu Hsiang, Kaohsiung County

Taiwan, 840, R.O.C. [email protected]

Abstract. A new approach of learning classifiers using genetic programming has been developed recently. Most of the previous researches generate classification rules to classify data. However, the generation of rules is time consuming and the recognition accuracy is limited. In this paper, an approach of learning classification functions by genetic programming is proposed for classification. Since a classification function deals with numerical attributes only, the proposed scheme first transforms the nominal data into numerical values by rough membership functions. Then, the learning technique of genetic programming is used to generate classification functions. For the purpose of improving the accuracy of classification, we proposed an adaptive interval fitness function. Combining the learned classification functions with training samples, an effective classification method is presented. Numbers of data sets selected from UCI Machine Learning repository are used to show the effectiveness of the proposed method and compare with other classifiers.

1 Introduction

Classification is one of the important tasks in machine learning. A classification problem is a supervised learning that is given a data set with pre-defined classes referred as training samples, then the classification rules, decision trees, or mathematical functions are learned from the training samples to classify future data with unknown class. Owing to the versatility of human activities and unpredictability of data, such a mission is a challenge. For solving classification problem, many different methods have been proposed. Most of the previous classification methods are based on mathematical models or theories. For example, the probability-based classification methods are built on Bayesian decision theory [5][7]. The Bayesian network is one of the important classification methods based on statistical model.


Many improvements of Naïve Bayes like NBTree [7] and SNNB [16] also provide good classification results. Another well-known approach is neural network [17]. In the approach of neural network, a multi-layered network with m inputs and n outputs is trained with a given training set. We give an input vector to the network, and an n-dimensional output vector is obtained from the outputs of the network. Then the given vector is assigned to the class with the maximum output. The other type of classification approach uses the decision tree, such as ID3 and C4.5 [14]. A decision tree is a flow-chart-like tree structure, which each internal node denotes a decision on an attribute, each branch represents an outcome of the decision and leaf nodes represent classes. Generally, a classification problem can be represented in a decision tree clearly.

Recently, some modern computational techniques start to be applied by few researchers to develop new classifiers. As an example, CBA [10] employs data mining techniques to develop a hybrid rule-based classification approach by integrating classification rules mining with association rules mining. The evolutionary computation is the other one interesting technique. The most common techniques of evolutionary computing are the genetic algorithm and the genetic programming [4][6]. For solving a classification problem, the genetic algorithm first encodes a random set of classification rules to a sequence of bit strings. Then the bit strings will be replaced by new bit strings after applying the evolution operators such as reproduction, crossover and mutation. After a number of generations are evolved, the bit strings with good fitness will be generated. Thus a set of effective classification rules can be obtained from the final set of bit strings satisfying the fitness function. For the genetic programming, either classification rules [4] or classification functions [6] can be learned to accomplish the task of classification. The main advantage of classifying by functions instead of rules is concise and efficient, because computation of functions is easier than rules induction.

The technique of genetic programming (GP) was proposed by Koza [8][9] in 1987. The genetic programming has been applied to several applications like symbolic regression, the robot control programs, and classification, etc. Genetic programming can discover underlying data relationships and presents these relationships by expressions. The algorithm of genetic programming begins with a population that is a set of randomly created individuals. Each individual represents a potential solution that is represented as a binary tree. Each binary tree is constructed by all possible compositions of the sets of functions and terminals. A fitness value of each tree is calculated by a suitable fitness function. According to the fitness value, a set of individuals having better fitness will be selected. These individuals are used to generate new population in next generation with genetic operators. Genetic operators generally include reproduction, crossover and mutation [8]. After the evolution of a number of generations, we can obtain an individual with good fitness value.

The previous researches on classification using genetic programming have shown the feasibility of learning classification functions by designing an accuracy-based fitness function [6][11] and special evolution operations [2]. However, there are two main disadvantages in the previous work. First, only numerical attributes can be calculated in classification functions. It is difficult for genetic programming to handle the cases with nominal attributes containing categorical data. The second

193Generating Effective Classifiers with Supervised Learning of Genetic Programming

drawback is that classification functions may conflict one another. The result of conflicting will decrease the accuracy of classification. In this paper, we propose a new learning scheme that defines a rough attribute membership function to solve the problem of nominal attributes and gives a new fitness function for genetic programming to generate a function-based classifier. Fifteen data sets are selected from UCI data repository to show the performance of the proposed scheme and compare the results with other approaches.

This paper is organized as follows: Section 2 introduces the concepts of rough set theory and rough membership functions. In Section 3, we discuss the proposed learning algorithm based on rough attribute membership and genetic programming. Section 4 presents the classification algorithm. The experimental results is shown and compared with other classifiers in Section 5. Conclusions are made in Section 6.

2 Rough Membership Functions

Rough set introduced by Pawlak [12] is a powerful tool for the identification of common attributes in data sets. The mathematical foundation of rough set theory is based on the set approximation of partition space on sets. The rough sets theory has been successfully applied to knowledge discovery in databases. This theory provides a powerful foundation to reveal and discover important structures in data and to classify complex objects. An attribute-oriented rough sets technique can reduce the computational complexity of learning processes and eliminate the unimportant or irrelevant attributes so that the knowledge can be learned from large databases efficiently.

The idea of rough sets is based on the establishment of equivalence classes on the given data set S and supports two approximations called lower approximation and upper approximation. The lower approximation of a concept X contains the equivalence classes that are certain to belong to X without ambiguity. The upper approximation of a concept X contains the equivalence classes that cannot be described as not belonging to X.

A vague concept description can contain boundary-line objects from the universe, which cannot be with absolute certainty classified as satisfying the description of a concept. Such uncertainty is related to the idea of membership of an element to a concept X. We use the following definitions to describe the membership of a concept X on a specified set of attributes B [13]. Definition 1: Let U = (S, A) be an information system where S is a non-empty, finite set of objects and A is a non-empty, finite set of attributes. For each B A, a A, there is an equivalence relation EA(B) such that EA(B) = {(x, x ) S2 a B, a(x) = a(x )}. If (x,x ) EA(B), we say that objects x and x are indiscernible. Definition 2: apr = (S, E), is called an approximation space. The object x S belongs to one and only one equivalence class. Let

[x]B = { y | x EA(B) y,x, y S }, [S]B = {[x]B | x S }. The notation [x]B denotes equivalence classes of EA(B) and [S]B denotes the set of

all equivalence classes [x]B for x S.

194 Been-Chian Chien et al.

Definition 3: For a given concept X S, a rough attribute membership function of X on the set of attributes B is defined as

|][|

|][|)(

B

BXB x

Xxxμ ,

where |[x]B| denotes the cardinality of equivalence classes of [x]B and |[x]B X| denotes the cardinality of the set [x]B X. The value of )(xX

B is in the range of [0, 1].

3 The Learning Algorithm of Classification Functions

3.1 The Classification Functions

Consider a given data set S has n attributes A1, A2, …, An. Let Di be the domain of Ai, Di R for 1 i n and A = {A1, A2,…, An}. For each data xj in S, xj = (vj1, vj2, ... , vjn), where vjt Dt stands for the value of attribute At in data xj. Let C = {C1, C2, …, CK} be the set of K predefined classes. We say that <xj, cj> is a sample if the data xj belongs to class cj, cj C. We define a training set (TS) to be a set of samples, TS = {<xj, cj> | xj S , cj C, 1 j m}, where m = |TS| is the number of samples in TS. Let mi be the number of samples belonging to the class Ci, we have m = m1 + m2 + + mK, 1 i K. A classification function for the class Ci is defined as fi : R

n R, where fi is a function that can determine whether a data xj belongs to the class Ci or not. Since the image of a classification function is in real number, we can decide if a data xj belongs to the specified class by means of a specific range where xj is mapped. If we find the set of functions that can recognize all K classes, a classifier is constructed. We define a classifier F for the set of K predefined classes C as follows,

F = { fi | fi : Rn R, 1 i K}.

3.2 The Transformation of Rough Attributes Membership

The classification function defined in Section 3.1 has a limitation on attributes: Since the calculation of functions accepts only numerical values, classification functions cannot work if datasets contained nominal attributes. In order to apply the genetic programming to be able to train all data sets, we make use of rough attribute membership as the definitions in Section 2 to transform the nominal attributes into a set of numerical attributes.

Let A = {A1, A2, …, An}, for a data set S has the set of attributes A containing n attributes and Di is a domain of Ai, a data xj S, xj = (vj1, vj2, ... , vjn), vji Di. If Di R, Ai is a numerical attribute, we have Ãi = Ai, let wjk be the value of xj in Ãi, wjk = vji. If Ai is a nominal attribute, we assume that S is partitioned into pi equivalence classes by attribute Ai. Let [sl]Ai denote the l-th equivalence class partitioned by attribute Ai, pi is the number of equivalence classes partitioned by the attribute Ai. Thus, we have

.|][|,][][1

i

i

ii Ai

p

lAlA Sp wheresS


We transform the original nominal attribute Ai into K numerical attributes Ãi. Let Ãi = (Ai1, Ai2, ... , AiK), where K is the number of predefined classes C as defined in Section 3.1 and the domains of Ai1, Ai2, ... , AiK are in [0, 1]. The method of transformation is as follows: For a data xj S, xj = (vj1, vj2, ... , vjn), if vji Di and Ai is a nominal attribute, the vji will be transformed into (wjk, wj(k+1), ... , wj(k+K-1) ), wik [0, 1] and wjk = )(1

jCA x

i, wj(k+1) = )(2

jCA x

i, ... , wj(k+K-1)= )( j

CA xK

i, where

|][|

|][][|)(

i

kik

iAl

CjAl

jCA s

xsxμ , if xji [sl]Ai.

After the transformation, we get the new data set S with attributes Ã = {Ã1, Ã2, ... , Ãn} and a data xj S, xj = (vj1, vj2, ... , vjn) will be transformed into yj, yj = (wj1, wj2, …, wjn ), where n = (n - q) + qK, q is the number of nominal attributes in A. Thus, the new training set becomes TS = {<yj, cj> | yj S , cj C, 1 j m}.

3.3 The Adaptive Interval Fitness Function

The fitness function is important for genetic programming to generate effective solutions. As descriptions in Section 3.1, a classification function fi for the class Ci should be able to distinguish positive instances from a set of data by mapping the data yj to a specific range. To find such a function, we define an adaptive interval fitness function. The mapping interval for positive instances is defined in the following.

Let <yj, cj> be an positive instance of the training set TS for the class cj = Ci. We urge the values of fi(yj) for positive instances to fall in the interval [ i

geni rX

)(,

i

geni rX

)(]. At the same time, we also wish that the values of fi(yj) for negative

instances are mapped out of the interval [ i

geni rX

)(, i

geni rX

)(]. The

)( geniX is the

mean value of an individual fi(yj) in the gen-th generation of evolution,

.1 ,1,

)(,',

)(Ki mj

m

yf

X ii

CcTScy

ji

geni

ij

jj

Let ri be the maximum distance between )( gen

iX and positive instances <yj, cj> for 1 j mi. That is, Ki yfXr ji

geni

mji

i

1|},)({|max)(

1.

We measure the error of a positive instance <yj, cj> for (gen+1)-th generation by

i

genijiij

i

genijiij

prXyfandCcif

rXyfandCcifD

|)(|1

|)(|0)(

)(

,

and the error of a negative instance by

i

genijiij

i

genijiij

nrXyfandCcif

rXyfandCcifD

|)(|0

|)(|1)(

)(

.

The fitness value of an individual fi is then evaluated by the following fitness function:

fitness(fi, TS ) =m

jnp DD

1

)( .


The fitness value of an individual represents the degree of error between the target function and the individual, we should have the fitness value be as small as possible.

3.4 The Learning Algorithm

The classification functions learning algorithm using genetic programming is described in detail as follows: Algorithm: The learning of classification functions Input: The training set TS. Output: A classification function. Step 1: Initial value i = 1, k = 1. Step 2: Transform nominal attributes into rough attribute membership values.

For a data <xj, cj> TS, xj = (vj1, vj2, …, vjn), 1 j m. If Ai is a numerical attribute, wjk = vji, k = k + 1. If Ai is a nominal attribute,

wjk = )(1j

CA x

i

, wj(k+1) = )(2j

CA x

i

, ... , wj(k+K-1)= )( jCA xK

i,

k = k + K, repeat Step 2 until yj = (wj1, wj2, …, wjn’) is generated, n = (n - q) + qK, q is the number of nominal attributes in A.

Step 3: j = j + 1, if j m, go to Step 1. Step 4: Generate the new training set TS .

TS = {<yj, cj>| yj = (wj1, wj2, …, wjn’) , cj C, 1 j m}. Step 5: Initialize the population.

Let gen = 1 and generate the set of individuals 1 = { 11h , 1

2h , …, 1ph } initially,

where (gen) is the population in the generation gen and )( genih stands for the ith

individual of the generation gen. Step 6: Evaluate the fitness value of each individual on the training set.

For all )( gen

ih (gen), compute the fitness values )( gen

iE = fitness( )(genih , TS ),

where the fitness evaluating function fitness() is defined as Section 3.3. Step 7: Does it satisfy the conditions of termination?

If the best fitness value of )( geniE satisfies the conditions of termination

( )(geniE = 0) or the gen is equal to the specified maximum generation, the )( gen

ih with the best fitness value are returned and the algorithm halts; otherwise, gen = gen + 1.

Step 8: Generate the next generation of individuals and go to Step 5. The new population of next generation (gen) is generated by the ratio of Pr, Pc and Pm, goes to Step 6, where Pr, Pc and Pm represent the probabilities of reproduction, crossover and mutation operations, respectively.

4 The Classification Algorithm

After the learning phase, we obtain a set of classification functions F that can recognize the classes in TS . However, these functions may still conflict each other in practical cases. To avoid the situations of conflict and rejection, we proposed a


scheme based on the Z-score of statistical test. In the classification phase, we calculate all the Z-values of every classification function for the unknown class data and compare these Z-values. If the Z-value of an unknown class object yj for classification fi is minimum, then yj belongs to the class Ci. We present the classification approach in the following.

For a classification function fi F corresponding to the class Ci and positive instances <yj, cj> TS with cj = Ci. Let iX be the mean of values of fi(yj) defined in Section 3.3. The standard deviation of values of fi(yj), 1 j mi, is defined as

Ki mj m

Xyf

ii

CcTS'cy

iji

iij

jj

1,1,

))((,,

2

For a data x S and a classification function fi, let y S be the data with all numerical values transformed from x using rough attribute membership. The Z-value of data y for fi is defined as

,|)(|

)(i

iii

XyfyZ

where 1 i K. We used the Z-value to determine the class to which the data should be assigned. The detailed classification algorithm is listed as follows. Algorithm: The classification algorithm Input: A data x. Output: The class Ck that x is assigned. Step 1: Initial value k = 1. Step 2: Transform nominal attributes of x into numerical attributes.

Assume that the data x S, x = (v1, v2, …, vn). If Ai is a numerical attribute, wjk = vji, k = k + 1. If Ai is a nominal attribute,

wjk = )(1j

CA x

i

, wj(k+1) = )(2j

CA x

i

, ... , wj(k+K-1)= )( jCA xK

i,

k = k + K, repeat Step 2 until yj = (wj1, wj2, …, wjn’) is generated, n = (n - q) + qK, q is the number of nominal attributes in A.

Step 3: Initially, i = 1. Step 4: Compute Zi(y) with classification function fi(y). Step 5: If i < K, then i = i + 1, go to Step 4. Otherwise, go to Step 6. Step 6: Find the )}(min{

1yZArgk i

Ki, the data x will be assigned to the class Ck.

Table 1. The parameters of GPQuick used in the experiments

Parameters Values Parameters Values Node mutate weight 43.5% Mutation weight 8% Mutate constant weight 43.5% Mutation weight annealing 40% Mutate shrink weight 13% Population size 1000 Selection method Tournament Set of functions {+, -, , } Tournament size 7 Initial value of iX 0 Crossover weight 28% Initial value of ri 10 Crossover weight annealing 20% Generations 10000


5 The Experimental Results

The proposed learning algorithm based on genetic programming is implemented by modifying the GPQuick 2.1 [15]. Since the GPQuick is an open source on Internet, it is more confidential for us to build and evaluate the algorithm of learning classifiers. The parameters used in our experiments are listed in Table 1. We define only four basic operations {+, -, , } for final functions. That is, each classification function contains only the four basic operations. The population size is set to be 1000 and the number of maximum generations of evolution is set to be 10000 for all data sets. Although the number of generations is high, the GPQuick still have good performance in computation time. The experimental data sets are selected from UCI Machine Learning repository [1]. We take 15 data sets from the repository totally including three nominal data sets, four composite data sets (with nominal and numeric attributes), and eight numeric data sets. The size of data and the number of attributes in the data sets are quite diverse. The related information of the selected data sets is summarized in Table 2. Since the GPQuick is fast in the process of evolving, the training time for each classification function in 10000 generations can be done in few seconds or minutes depending on the number of cases in the training data sets. The proposed learning algorithm is efficient when it is compared with the training time in [11] having more than an hour. We don’t know why the GPQuick is so powerful in evolving speed, but it is easy to get the source [15] and modify the problem class to obtain the results for everyone.

The performance of the proposed classification scheme is evaluated by the average classification error rate of 10-fold cross validation for 10 runs. We figure out the experimental results and compare the effectiveness with different classification models in Table 3. These models include statistical model like Naïve Bayes[3], NBTree [7], SNNB [16], the decision tree based classifier C4.5 [14] and the association rule-based classifier CBA [10]. The related error rates in Table 3 are cited from [16] except the GP-based classifier. Since the proposed GP-based classifier is random based, we also show the standard deviations in the table for reference of readers. From the experimental results, we observed that the proposed method obtains lower error rates than CBA in 12 out of the 15 domains, and higher error rates in three domains. It obtains lower error rates than C4.5 rules in 13 domains, only one domain has higher error rate and the other one results in a draw. While comparing our method with NBTree and SNNB, the results are also better for most cases. While comparing with Naïve Bayes, the proposed method wins 13 domains and loses in 2 domains. Generally, the classification results of proposed method are better than others on an average. However, in some data sets, the test results in GP-based is much worse than others, for example, in the “labor” data, we found that the average error rate is 20.1%. The main reason of high error rate terribly in this case is the small size of samples in the data set. The “labor” contains only 57 data totally and is divided into two classes. While the data with small size is tested in 10-fold cross validation, the situation of overfitting will occur in both of the two classification functions. That’s also why the rule based classifiers like C4.5 and CBA have the similar classification results as ours in the labor data set.


6 Conclusions

Classification is an important task in many applications. The technique of classification using genetic programming is a new classification approach developed recently. However, how to handling nominal attributes in genetic programming is a difficult problem. We proposed a scheme based on the rough membership function to classify data with nominal attribute using genetic programming in this paper. Further, for improving the accuracy of classification, we proposed an adaptive interval fitness function and use the minimum Z-value to determine the class to which the data should be assigned. The experimental results demonstrate that the proposed scheme is feasible and effective. We are trying to reduce the dimensions of attributes for any possible data sets and cope with the data having missing values in the future.

Table 2. The information of data sets Attributes Attributes Data set

nominal numeric Classes Cases Data set

nominal numericClasses Cases

australian 8 6 2 690 lymph 18 0 4 148 german 13 7 2 1000 pima 0 8 2 768 glass 0 9 7 214 sonar 0 60 2 208 heart 7 6 2 270 tic-tac-toe 9 0 2 958

ionosphere 0 34 2 351 vehicle 0 18 4 846 iris 0 4 3 150 waveform 0 21 3 5000

labor 8 8 2 57 wine 0 13 3 178 led7 7 0 10 3200

Table 3. The average error rates (%) for compared classifiers

Data sets NB NBTree SNNB C4.5 CBA GP-Ave. S.D. australian 14.1 14.5 14.8 15.3 14.6 9.5 1.2 german 24.5 24.5 26.2 27.7 26.5 16.7 2.2

glass 28.5 28.0 28.0 31.3 26.1 22.1 2.9

heart 18.1 17.4 18.9 19.2 18.1 11.9 2.7

ionosphere 10.5 12.0 10.5 10.0 7.7 7.2 2.3

iris 5.3 7.3 5.3 4.7 5.3 4.7 1.0

labor 5.0 12.3 3.3 20.7 13.7 20.1 3.0

led7 26.7 26.7 26.5 26.5 28.1 18.7 2.7

lymph 19.0 17.6 17.0 26.5 22.1 13.7 1.5

pima 24.5 24.9 25.1 24.5 27.1 18.3 2.8

sonar 21.6 22.6 16.8 29.8 22.5 5.6 1.9

tic-tac-toe 30.1 17.0 15.4 0.6 0.4 5.2 1.6

vehicle 40.0 29.5 28.4 27.4 31 24.7 2.4

waveform 19.3 16.1 17.4 21.9 20.3 11.7 1.8

wine 1.7 2.8 1.7 7.3 5.0 4.5 0.7


References

1. Blake, C., Keogh, E. and Merz, C. J.: UCI repository of machine learning database, http://www.ics.uci.edu/~mlearn/MLRepository.html, Irvine, University of California, Department of Information and Computer Science (1998)

2. Bramrier, M. and Banzhaf, W.: A Comparison of Linear Genetic Programming and Neural Networks in Medical Data Mining, IEEE Transaction on Evolutionary Computation, 5, 1 (2001) 17-26

3. Duda, R. O. and Hart, P. E.: Pattern Classification and Scene Analysis, New York: Wiley, John and Sons Incorporated Publishers (1973)

4. Freitas, A. A.: A Genetic Programming Framework for Two Data Mining Tasks: Classification and Generalized Rule Induction, Proc. the 2nd Annual Conf. Genetic Programming. Stanford University, CA, USA: Morgan Kaufmann Publishers (1997) 96-101

5. Heckerman, D. M. and Wellman, P.: Bayesian Networks, Communications of the ACM, 38, 3 (1995) 27-30

6. Kishore, J. K., Patnaik, L. M., and Agrawal, V. K.: Application of Genetic Programming for Multicategory Pattern Classification, IEEE Transactions on Evolutionary Computation, 4, 3 (2000) 242-258

7. Kohavi, R.: Scaling Up the Accuracy of Naïve-Bayes Classifiers: a Decision-Tree Hybrid. Proc. Int. Conf. Knowledge Discovery & Data Mining. Cambridge/Menlo Park: AAAI Press/MIT Press Publishers (1996) 202-207

8. Koza, J. R.: Genetic Programming: On the programming of computers by means of Natural Selection, MIT Press Publishers (1992)

9. Koza, J. R.: Introductory Genetic Programming Tutorial, Proc. the First Annual Conf. Genetic Programming, Stanford University. Cambridge, MA: MIT Press Publishers (1996)

10. Liu, B., Hsu, W., and Ma, Y.: Integrating Classification and Association Rule Mining. Proc. the Fourth Int. Conf. Knowledge Discovery and Data Mining. Menlo Park, CA, AAAI Press Publishers (1998) 443-447

11. Loveard, T. and Ciesielski, V.: Representing Classification Problems in Genetic Programming, Proc. the Congress on Evolutionary Computation. COEX Center, Seoul, Korea (2001) 1070-1077

12. Pawlak, Z.: Rough Sets, International Journal of Computer and Information Sciences, 11 (1982) 341-356

13. Pawlak, Z. and Skowron, A.: Rough Membership Functions, in: R.R. Yager and M. Fedrizzi and J. Kacprzyk (Eds.), Advances in the Dempster-Shafer Theory of Evidence (1994) 251-271

14. Quinlan, J. R.: C4.5: Programs for Machine Learning, San Mateo, California, Morgan Kaufmann Publishers (1993)

15. Singleton A. Genetic Programming with C++, http://www.byte.com/art/9402/sec10/ar-t1.htm, Byte, Feb., (1994) 171-176

16. Xie, Z., Hsu, W., Liu, Z., and Lee, M. L.: SNNB: A Selective Neighborhood Based Naïve Bayes for Lazy Learning, Proc. the sixth Pacific-Asia Conf. on Advances in Knowledge Discovery and Data Mining (2002) 104-114

17. Zhang, G. P.: Neural Networks for Classification: a Survey, IEEE Transaction on Systems, Man, And Cybernetics-Part C: Applications and Reviews, 30, 4 (2000) 451-462


Clustering by Regression Analysis

Masahiro Motoyoshi1, Takao Miura1, and Isamu Shioya2

1 Dept.of Elect.& Elect. Engr., HOSEI University3-7-2 KajinoCho, Koganei, Tokyo, 184–8584 Japan

{i02r3243,miurat}@k.hosei.ac.jp2 Dept.of Management and Informatics, SANNO University

1573 Kamikasuya, Isehara, Kanagawa 259–1197 [email protected]

Abstract. In data clustering, many approaches have been proposedsuch as K-means method and hierarchical method. One of the prob-lems is that the results depend heavily on initial values and criterion tocombine clusters.In this investigation, we propose a new method to avoid this deficiency.Here we assume there exists aspects of local regression in data. Thenwe develop our theory to combine clusters using F values by regressionanalysis as criterion. We examine experiments and show how well thetheory works.

Keywords: Data Mining, Multivariable Analysis, Regression Analysis,Clustering

1 Introduction

It is well-known that stocks in a securities market are properly classified accord-ing to industry genre (classification of industries). Such genre appears very oftenin security investment. The movement of the descriptions would be similar witheach other, but this classification should be maintained according to economicalsituation, trends and activities in our societies and regulations. Sometimes wesee some mismatch between the classification and real society. When an analysttries classifying using more effective criterion, she/he will try a quantitative clas-sification. Cluster analysis is one of the method based on multivariate analysiswhich performs a quantitative classification.

Cluster analysis is a general term of algorithms to classify similar objects intogroups (clusters) where each object in one cluster shares heterogeneous feature.We can say that, in very research activity, a researcher is faced to a problemhow observed data should be systematically organized, that is, how to classify.

Generally the higher similarity of objects in a cluster and the lower similar-ity between clusters we see, the better clustering we have. This means qualityof clustering depends on definition of similarity and the calculation complexity.There is no guarantee to see whether we can interpret similarity easily or not.


Clustering by Regression Analysis 203

So is true for similarity from the view point of analysts. It is an analyst’s re-sponsibility to apply methods accurately to specific applications. The point ishow to find out hidden patterns.

Roughly speaking, cluster analysis has divided into hierarchical methods andnon-hierarchical methods[4]. In non-hierarchical methods, data are decomposedinto k clusters each of which satisfies evaluation standard most. To obtain bestsolutions, we have to look for all the possible patterns, it takes much time. Heuris-tic methods have been investigated. K-means method[6] generates clusters basedon their centers. Fuzzy K-means method[1] takes an approach based on fuzzyclassification. AutoClass[3] automatically determines the number of clusters andclassifies the data stochastically.

Recently an interesting approach[2] has been proposed, called ”Local Dimen-sionality Reduction”. In this approach, data are assumed to have correlationlocally same as our case. But the clustering technique is based on PrincipalComponent Analysis (PCA) and they propose completely different algorithmfrom ours.

In this investigation, we assume there exists aspects of local regression indata, i.e., we assume observed data structure of local sub-linear space. We pro-pose a new clustering method using variance and F value by regression analysisas a criterion to make suitable clusters.

In the next section we discuss reasons why conventional approaches are notsuitable to our situation. In section 3 we give some definitions and discuss aboutpreliminary processing of data. Section 4 contains a method to combine clustersand the criterion. In section 5, we examine experimental results and the compar-ative experiments by K-mean method. After reviewing related works in section6, we conclude our work in section 7.

2 Clustering Data with Local Trends

As previously mentioned, we assume a collection of data in which we see sev-eral trends at the same time. Such kind of data could be regressed locally byusing partial linear functions and the result forms an elliptic cluster in multi-dimensional space. Of course, these clusters may cross each other.

The naive solution is to put clusters together by using nearest neighbormethod found in hierarchical approach. However, when clusters cross, the re-sult doesn’t become nice; they will be divided at a crossing. If the clusters havedifferent trends but they are close to each other, they could be combined. Simi-larly an approach based on general Minkowski distance have the same problem.

In K-mean method, a collection of objects is represented by its center ofgravity. Thus every textbook says that it is not suitable for non-convex clusters.The notion of center comes from a notion of variance, but if we look for points toimprove linearity of the two clusters by moving objects, we can’t always obtainsuch a point. More serious is that we should decide the number of clusters inadvance.

204 Masahiro Motoyoshi et al.

These two approaches share a common problem, how to define similaritybetween clusters. In our case, we want to capture local aspects of sub linearity,thus new techniques should satisfy the following properties:

(1) similarity to classify sub linear space.(2) convergence on suitable level (i.e., the number of clusters) which canbe interpreted easily.

Regression analysis is one of techniques of multivariable analysis by which wecan predict future phenomenon in form of mathematical functions.

We introduce F value as a criterion of the similarity to combine clusters, andthat we can consider a cluster as a line, that is, our approach is clustering byline while K-mean method is clustering by point.

In this investigation, by examining F value (as similarity measure), we com-bine linear clusters in one by one manner, in fact, we take an approach of restor-ing to target clusters.

3 The Choice of Initial Clusters

In this section, let us explain the difference between our approach and hierar-chical clustering by existing agglomerative nesting.

An object is a thing in the world of interests. Data is a set of objects whichhave several variables. We assume that all variables are inputs given from sur-roundings and no other external criteria for a classification. There are two kindsof variables. A criterion variable is an attribute which plays a role of criterion ofregression analysis given by analysts, and another is called a explanatory vari-able. In this investigation, we discuss only numerical data. As for categoricaldata, readers could think about quantification theory or dummy variables.

We deal with data by a matrix (X |Y ). An object is represented as a row ofa matrix while criterion/explanatory variables are described as a column. Wedenote explanatory variables and a criterion variable by x1, x2, . . . , xm and yrespectively, and the number of objects by n:

(X |Y ) =

⎛⎜⎜⎜⎜⎜⎜⎝

x11 . . . x1m y1

.... . .

......

xk1 . . . xkm yk

.... . .

......

xn1 . . . xnm yn

⎞⎟⎟⎟⎟⎟⎟⎠

(1)

(∈ Rn×(m+1))

where X denotes explanatory variables, and Y denotes criterion variable. Eachvariable is assumed to be normalized (called Z score).

An initial cluster is a set of objects where each object is exclusively containedin the initial cluster.


In the first stage, in an agglomerative nesting, each object represents eachcluster. Similarity is defined as distance between objects. However we like toassume every cluster has to have variances because we deal with ”data as a line”.We pose this assumption on initial clusters. To make ”our” initial clusters, wedivide the objects into small groups. We obtain initial clusters dynamically byan inner product (cosine) which measures the difference of angle between twovectors as the following algorithm shows:

0. Let a input vector be s1, s2, , , sn.1. Let the first input vector s1 be center of cluster C1 and s1 be a member of

C1.2. Calculate similarity between sk and existing cluster C1 . . . Ci by (2). If every

similarities is below given threshold Θ, we generate a new cluster and let itbe the center of the cluster. Otherwise, let it be a member of cluster whichhas the highest similarity. By using (3) calculate again a center of cluster towhich members are added.

3. Repeat until all the assignment is completed.4. Remove clusters which has no F value and less than m + 2 members.

wherecos(k, j) =

sk · cj

|sk||cj | (2)

cj =

∑Sk ∈ Csk

j

Mj(3)

Note Mj means the number of members in Cj and m means the number ofexplanatory variables.

4 Combining Clusters

Now let us define similarities between clusters, and describe how the similaritycriterion relates to combining. We define the similarity between clusters fromtwo aspects. One of the aspects is a distance between centers of clusters. Wesimply take Euclidean distance as distance measure (as well as a least squaremethod used by a regression analysis).

d(i, j) =√

|xi1 − xj1|2 + . . . + |xim − xjm|2 + |yi − yj |2 (4)

Then we define non-similarity matrix as follows:⎛⎜⎜⎜⎜⎜⎝

0d(2, 1) 0d(3, 1) d(3, 2) 0

......

.... . .

d(n, 1) d(n, 2) . . . d(n, n − 1) 0

⎞⎟⎟⎟⎟⎟⎠

(5)

(∈ Rn×n)


Clearly one of the candidate clusters to combine is the one with the smallestdistance and we have to examine whether it is suitable or not in our case. Forthis purpose we define a new similarity by F value of the regression to keepeffectiveness.

In the following let us review quickly F test and presumption by regressionbased on least square method in multiple regression analysis.

Given clusters represented by data matrix like (1), we define a model ofmultiple regression analysis which is corresponded to the clusters as follows:

y = b1x1 + b2x2 + . . . + bmxm + ei (6)

A estimator of the least squares bi of bi is given by

B = (b1, b2, . . . , bm) = (XT X)−1XT Y (7)

This is called regression coefficient. Actually it is a standardised partial regres-sion coefficient, because it is based on z-score.

Let y be an observed value and Y be a predicted value based on the regressioncoefficient. Then, for variation factor by regression, sum of squares SR and meansquare VR are defined as

SR =n∑

k=1

(Yk − Y )2 ; VR =SR

m(8)

For variation factor by residual, sum of squares SE and mean square VE are

SE =n∑

k=1

(yk − Yk)2 ; VE =SE

n − m − 1(9)

Then we define F value F0 by:

F0 =VR

VE(10)

It is well known that F0 obeys F distribution where the first and second degreesof freedom are m and n − m − 1 respectively.

Given clusters A and B where the number of members are a and b respec-tively, a data matrix of the combined cluster A ∪ B is described as follows.

(X |Y ) =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎝

xA11 . . . xA1m yA1

.... . .

......

xAa1 . . . xAam yAa

xB11 . . . xB1m yB1

.... . .

......

xBb1 . . . xBbm yBb

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎠

(11)

(∈ Rn×(m+1))


where n = a + b. As previously mentioned, we can calculate regression by (7)and F value by (10).

Let us examine the relationship between two F values of clusters before/aftercombining. Let A, B be two clusters, FA, FB the two F values and F the F valueafter combining A and B.

Then we have some interesting properties as shown in examples.

Property 1. FA > F , FB > FWhen F decreases, the gradient is significantly different. Thus we can say

that the similarity between A and B is low and linearity of the cluster decreases.In the case of FA = FB , F = 0, both A and B have same number objects andcoordinates and the regressions are orthogonal at center of gravity.

Property 2. FA ≤ F , FB ≤ FWhen F increases, the gradient isn’t significantly different and the similarity

between A and B is high. Linearity of the cluster increases. When FA = FB, F =2 × FA, we see A and B have same number of the objects and coordinates.

Property 3. FA ≤ F, FB > F , or FB ≤ F, FA > FOne of FA, FB increases while another decreases, when there exists big dif-

ference between the variances of A and B, or between FA and FB . We can’t sayanything about combining.

Thus we can say we’d better combine clusters if F is bigger than both FA

and FB .

Non-similarity using Euclidean distance is one of the nice ideas to prohibitfrom combining clusters that have the distance bigger than local ones. Since ouralgorithm proceeds based on a criterion using F values, the process continuesto look for candidate clusters by decreasing distance criterion until the processsatisfies our F value criterion. But we may have difficulties in the case of defectiveinitial clusters, or in the case of no cluster to regress locally; the process mightcombine clusters that should not be combined.

To overcome such problem, we assume a threshold Δ to a distance. That is,we have Δ as a criterion of variance.

Δ > (Var(A) + V ar(B)) × Dwhere Var(A), Var(B) mean variance of A, B respectively and D meansthe distance between the centers of the gravity.

When A and B satisfy both criterion of F value and Δ, we can combine thetwo clusters. In our experiments, we give Δ the average of the internal variancesof initial clusters as a default. By Δ we manage the internal variances of clustersto avoid combine far clusters.


Now our algorithm is given as follows.

1. Standardize data.2. Calculate initial clusters that satisfy Θ. Remove clusters which the number

of members don’t reach the number of explanatory variables.3. Calculate center of gravity, variance, regression coefficient, F value to each

cluster the distance between their centers of gravity.4. Choose close clusters as candidates for combining. Standardize the pair. Cal-

culate regression coefficient and F value again.5. Combine the pair if F value of a combined cluster is bigger than F value

of each cluster and if it satisfy Δ. Otherwise, go to step4 to choose othercandidates. If there is not candidate any more, then stop.

6. Calculate center of gravity to each cluster and distance between their centersof gravity again and go to step4.

5 Experiments

In this section, let us show some experiments to demonstrate the feasibility ofour theory. We have Weather Data in Japan[5]; on January, 1997 two meteoro-logical observatory data of Wakkanai in Hokkaido (northern part of Japan) andNiigata in Honshu (middle part of Japan) measured in January of 1997. Eachmeteorological observatory contains 744 records. To apply our method under theassumption that there are clusters to regress locally. We simply joined them. andwe have 1488 records of 180KB.

Each data instance contains 22 attributes observed every hour. We uti-lize ”day”(day), ”hour”(hour), ”pressure”(hPa), ”sea-level pressure”(hPa),”air temperature”(C), ”dew point”(C), ”steam pressure”(hPa) and ”relativehumidity”(%) as candidates of variables among the 22 attributes. All of themare numerical without any missing value. We use ”observation point number”additionally only for the purpose of evaluation. A table1 contains examples ofthe data.

Before processing, we have standardized all variables to analyze by our al-gorithm. We take ”air temperature” as a criterion variable and other values asexplanatory variables.

Table 1. Weather Data

Point Day Hour Pressure Sea pressure Temperature . . .

604 1 1 1019.2 1020 5 . . .604 1 2 1018.6 1019.4 5.2 . . .604 1 3 1018.3 1019.1 5.4 . . .

......

......

......

...401 31 24 1014.6 1016 -5.8 . . .


Let Θ = 0.8 and Δ = 15. A table 2 shows the results. In this experiment,we have obtained 40 initial clusters from 1364 objects by using inner product.We have excluded other 124 objects because they have been classified into thesmall clusters. It took us 35 loops for convergence. And eventually we have got5 clusters.

Let us go into more detail of our results. Cluster 1 has been obtained by com-bining 19 initial clusters. On the other hand, cluster 2 and cluster 3 contains nocombining. Cluster 4 and cluster 5 contain 10 and 9 initial clusters respectively.

Generally the results seem to reflect features by observation points. In fact,cluster 1 contains 519 objects (69.8%) of 744 objects in Niigata point. cluster 5holds 469 objects (63.0%) of 744 objects in Wakkanai point. Thus we can saycluster 1 reflects peculiar trends of Niigata points well, and cluster 2 reflectspeculiar trends of Wakkanai points well.

For example, in a table3, we see both ”pressure” and ”temperature” of cluster1 are higher than other clusters. Thus the cluster contains objects that wereobserved in a region of high altitude and high temperature. Also ”temperature”and ”humidity” in cluster 5 are relatively low. And we see the cluster containsobjects observed in region of low precipitation and low temperature. In case ofcluster 4, ”day” value is high since it was observed in January. The ”pressure”is low, and ”humidity” is also high. We can say that cluster 4 is unrelated toobservation region. We might be able to characterize cluster 4 by state of weathersuch as low-pressures. In fact, the cluster contains almost same number of objectsof Niigata Wakkanai points.

In a table 4, the absolute value of regression coefficients in cluster 5 is overallhigh compared with cluster 1 and 4. Compared to change of weather, change oftemperature is large. That is, temperature varies in a wide range. Since Hokkaidois the region that takes maximum of annual difference of temperature in Japan1,our results go well with actual classification.

Also cluster 1 is similar to cluster 5, but the gradient is smaller. Thus, temper-ature in cluster1 doesn’t vary very much. Cluster 4 is clearly different from otherclusters; the cluster has correlation only for dew point and relative humidity.

Let us summarize our experiment. We got 5 clusters. Especially, we haveextracted regional features from cluster1 and 5. It is evident by informationon observation point in table 2 to see clustering suitably has classified objectsvery well. This fact means that the results in our experiment satisfy the initialcondition.

Let us discuss some more aspects to compare our technique with others. Wehave analyzed the same data by using K-means method with statistics appli-cation tool SPSS. We gave centers of initial clusters by random numbers. Wespecified 10 as the maximum number of iteration. Then we have analyzed twocases of k = 2 and 3. Let us show the results of k = 2 in a table 5, and theresults of k = 3 in a table 6.1 In Hokkaido area, the lowest temperature decreases to about -20uA in winter time,

and the maximum air temperature exceeds 30 C in summer time.


Table 2. Final clusters(Θ = 0.8, Δ = 15)

Variance F-value Contained clusters Niigata Wakkanai

Cluster1 4.958 8613.28 19 519 74Cluster2 2.12926 2043.62 1 29 1Cluster3 2.42196 78.2235 1 45 0Cluster4 5.1034 85603.6 10 135 189Cluster5 5.50085 17964.9 9 11 469

Table 3. Center of gravity for cluster

Cluster1 Cluster4 Cluster5

Day -0.0103696 0.487242 -0.0987597Hour 0.0622433 -0.148476 0.0358148Pressure 0.599712 -0.899464 -0.0542843Sea-level pressure 0.580779 -0.902211 -0.023735Dew point 0.512113 0.294745 -1.05024Steam pressure 0.468126 0.245389 -0.993099Relative humidity -0.179054 0.975926 -0.392923Air temperature 0.692525 -0.234597 -0.976859

Table 4. Standardised regression coefficient of clusters

Cluster1 Cluster4 Cluster5

Day -0.0163907 -0.0029574 0.0108929Hour -0.00316974 -0.00182585 -0.00965708Pressure 1.18154 0.0357092 1.77679Sea-level pressure -1.15683 -0.0393822 -1.77494Dew point 0.909799 1.103 1.22524Steam pressure 0.421526 -0.016361 0.212142Relative humidity -1.25678 -0.36817 -0.804252

Table 5. Clustering by K-mean method (k=2)

Niigata Wakkanai

Cluster1 496 359Cluster2 248 385

In case of k = 2, it seems hard for readers (and for us) to extract significantdifferences of the two final clusters with respect to observation points.

Similarly, in case of k = 3, we can’t extract sharp feature from the results.Thus, our technique can be an alternative when it is not possible to cluster wellby K-means method.


Table 6. Clustering by K-mean method (k=3)

Niigata Wakkanai

Cluster1 293 149Cluster2 158 353Cluster3 293 243

6 Conclusion

In this investigation, we have discussed clustering for data where objects witha different local trend existed together. We have proposed how to extract trendof clusters by using regression analysis and similarity of the cluster by F-valueof a regression. We have introduced threshold of distance between clusters tokeep precision of the cluster. By examining the data, we have shown that we canextract clusters of a moderate number to interpret it and the feature by center ofgravity and regression coefficient. We have examined some experimental resultsand compared our method with other methods to show the feasibility of ourapproach.

We had already discussed how to mine Temporal Class Schemes to model acollection of time series data[7], and we are now developing integrated method-ologies to time series data and stream data.

Acknowledgements

We would like to acknowledge the financial support by Grant-in-Aid for ScientificResearch (C)(2) (No.14580392).

References

[1] Bezdek,J. C.: ”Numerical taxonomy with fuzzy sets”, Journal of MathematicalBiology, Vol.1, pp.57-71, 1974. 203

[2] Chakrabarti, K. and Mehrotra, S.: ”Local Dimensionality Reduction: A NewApproach to Indexing High Dimensional Spaces”, proc.VLDB, 2000. 203

[3] Cheeseman,P.,et al: ”Bayesian classification”, proc. ACM Artificial Intelligence,1988, pp.607-611. 203

[4] Jain, A.K., Murty, M. N. and Flynn, P. J.: ”Data Clutering – A Review”, ACMComputing Surveys, Vol. 31-3, 1999, pp.264-323 203

[5] Japan Weather Association: ”Weather Data HIMAWARI”, Maruzen, 1998. 208[6] MacQueen, J. B.: ”Some methods for classification and analysis of multivariate

observations” , proc. Fifth Berkeley Symposium observations, ProStatistics andProbability, Vol.1, University of California Press,1967. 203

[7] Motoyoshi,M., Miura,T., Watanabe,K., Shioya,I.: ”Mining Temporal Classesfrom Time Series Data”, proc.ACM Conf. on Information and Knowledge Man-agement (CIKM), 2002. 211

[8] Wallace,C. S.and Dowe, D. L.: ”Intrinsic classification by MML-the Snob pro-gram”, proc. 7th Australian Joint Conference on Artificial Intelligence, 1994,pp.37-44.

Matteo Golfarelli

DEIS - University of Bologna 40136 Bologna - Italy [email protected]

Abstract. View materialization is recognized to be one of the most effective ways to increase the Data Warehouse performance; nevertheless, due to the computational complexity of the techniques aimed at choosing the best set of views to be materialized, this task is mainly carried out manually when large workloads are involved. In this paper we propose a set of statistical indicators that can be used by the designer to characterize the workload of the Data Warehouse, thus driving the logical and physical optimization tasks; furthermore we propose a clustering algorithm that allows the cardinality of the workload to be reduced and uses these indicators for measuring the quality of the reduced workload. Using the reduced workload as the input to a view materialization algorithm allows large workloads to be efficiently handled.

1 Introduction

During the design of a data warehouse (DW), the phases aimed at improving the system performance are mainly the logical and physical ones. One of the most effective ways to achieve this goal during logical design is view materialization. The so-called view materialization problem consists of choosing the best subset of the possible (candidate) views to be precomputed and stored in the database while respecting a set of system and user constraints (see [8] for a survey). Even if the most important constraint is the disk space available for storing aggregated data, the quality of the result is usually measured in terms of the number of disk pages necessary to answer a given workload.

Despite the efforts made by research in the last years, view materialization remains a task whose success depends on the experience of the designer that, adopting rules of thumb and applying the trial-and-error approach, may lead to acceptable solutions. Unlike other issues in the Data Warehouse (DW) field, understanding why the large set of techniques available in the literature have not been engineered and included in some commercial tools is fundamental to solving the problem. Of course the main reason is the computational complexity of view materialization that makes all the approaches proposed unsuitable for workloads larger than about forty queries. Unfortunately, real workloads are much larger and are not usually available during the DW design but only when the system is on-line. Nevertheless, the designer can estimate the core of the workload at design phase but such a rough approximation will lead to a largely sub-optimal solution.

Handling Large Workloads by Profiling and Clustering


We believe that the best solution is to carry out a rough optimization at design time and to refine the solution by tuning it, manually or automatically, when the system is on-line on the base of the real workload. The main difficulty with this approach is the huge size of the workload that cannot be handled by the algorithms known in the literature. In this context the contribution of the paper is twofold: firstly we propose a technique for profiling large workloads that can be obtained from the log file produced by the DBMS when the DW is on line. The statistical indicators obtained can be used by the designer to characterize the DW workload thus driving the logical and physical optimization tasks. The second contribution concerns a clustering algorithm that allows the cardinality of the workload to be reduced and that uses the indicators in order to measure the quality of the reduced workload. Using the reduced workload as the input to a view materialization algorithm allows large workloads to be efficiently handled. Since clustering is an independent way of preprocessing, all the algorithms presented in the literature can be adopted during the views selection phase. Figure 1 shows the framework we assume for our approach: OLAP applications generate SQL queries whose logs are periodically elaborated to determine the statistical indicators and a clustered workload that can be handled by a view materialization algorithm that produces new patterns to be materialized.

Fig. 1. Overall framework for the view materialization process

To the best of our knowledge only few works directly faced the workload size problem; in particular, in [5] the authors proposed a polynomial time algorithm that explores only a subset of the candidate views and delivers a solution whose quality is comparable with other techniques that run in exponential time. In [1] the authors propose a heuristic reduction technique that is based on the functional dependencies between attributes and excludes from the search space those views that are “similar” to other ones already considered. With respect to ours, this approach does not produce any representative workload to be used for further optimizations.

Clustering of queries in the field of DWs has been recently used to reduce the complexity of the plan selection task [2]: each cluster has a representative for whom the execution plan, as determined by the optimizer, is persistently stored. Here the concept of similarity is based on a complex set of features that it is necessary to encode when different queries can be efficiently solved using the same execution plan. This idea has been implicitly used in several previous works where a global optimization plan was obtained given a set of queries [7].

DW

RDBMS

OLAP APPLICATIONS

Profiling & Clustering

Viewmaterialization

Queries Data

Queries Data

Log data

Clustered workload

Statistical indicators

Views

Datavolume

213Handling Large Workloads by Profiling and Clustering

The rest of the paper is organized as follows: Section 2 presents the necessary background, Section 3 defines the statistical indicators for workload profiling; Section 4 presents the algorithm for query clustering while in Section 5 a set of experiments, aimed at proving its effectiveness, are reported. Finally in Section 6 the conclusions are drawn.

2 Background

It is recognized that DWs lean on the multidimensional model to represent data, meaning that indicators that measure a fact of interest are organized according a set of dimensions of analysis; for example, sales can be measured by the quantitysold and the price of each sale of a given product that took place in a given store and on a given day. Each dimension is usually related to a set of attributes describing it at different aggregation levels; the attributes are organized in a hierarchy defined according to a set of functional dependencies. For example a product can be characterized by the attributes PName, Type, Category and Brand among which the following functional dependencies are defined: PName→Type,Type→Category and PName→Brand; on the other hand, stores can be described by their geographical and commercial location: SName→City, City→Country,SName→CommArea, CommArea→CommZone.

On relational solutions, the multidimensional nature of data is implemented on the logical model by adopting the so-called star scheme, composed by a set of, fully denormalized, dimension tables, one for each dimension of analysis, and a fact tablewhose primary key is obtained by composing the foreign keys referencing the dimension tables. The most common class of queries used to extract information from a star schema are GPSJ [3] that consists of a selection, over a generalized projection over a selection over a join between the fact table and the dimension table involved.

It is easy to understand that grouping heavily contributes to the global query cost and that such a cost can be reduced precomputing (materializing) that aggregated information that is useful to answer a given workload. Unfortunately, in real applications, the size of such views never fits the constraint given by the available disk space and it is very hard to choose the best subset to be actually materialized. When working on a single fact scheme and assuming that all the measures contained in the elemental fact table are replicated a view is completely defined by its aggregation level.

Definition 1 The pattern of a view consists of a set of dimension table attributes such that no functional dependency exists between attributes in the same pattern.

Possible patterns for the sales fact are: P1 = {Month, Country, Category}, P2

= {Year, Sname}, P3 = {Brand}. In the following we will use indifferently the terms pattern and view and we will refer to the query pattern as the coarsest pattern that can be used to answer the query.

Definition 2 Given two views Vi, Vj with patterns Pi, Pj respectively, we say that Vi

can be derived from Vj (Vi ≤ Vj) if the data in Vi can be calculated from the data in Vj.

214 Matteo Golfarelli

Derivability determines a partial-order relationship between the views, and thus between patterns, of a fact scheme. Such partial-order can be represented by the so-called multidimensional lattice [1] whose nodes are the patterns and whose arcs show a direct derivability relationship between patterns.

Definition 3 We denote with Pi ⊕ Pj the least upper bound (ancestor) of two patterns in the multidimensional lattice.

In other words the ancestor of two patterns corresponds to the coarsest one from which both can be derived.

Given a set of queries the ancestor operator can be used to determine the set of views that are potentially useful to reduce the workload cost (candidate views). The candidate set can be obtained, starting from the workload queries, by iteratively adding to the set the ancestors of each couple of patterns until a fixed point is reached. Most of the approaches to view materialization try to determine first the candidate views, and to choose the best subset that fits the constraints later. Both problems have an exponential complexity.

3 Profiling the workload

Profiling means determining a set of indicators that captures the workload features that have an impact on the effectiveness of different optimization techniques. In particular, we are interested in those relevant to the problem of view materialization and that help the designer to answer queries like: “How suitable to materialization is the workload ?”, “How much space do I need to obtain good results ?”.

In the following we propose four indicators that have proved to properly capture all the relevant aspects and that can be used as guidance by the designer that manually tunes the DW or as input to an optimization algorithm for a materialized view section. All the indicators are based on the concept of cardinality of the views associated to a given pattern that can be estimated knowing the data volume of the fact scheme that we assume to contain the cardinality of the base fact table and the number of distinct values of each attribute in the dimension tables. The cardinality of an aggregate view can be estimated using Cardenas’ formula. In our case the objects are the tuples in the elemental fact table with pattern P0 (whose number |P0| is assumed to be known) while the number of buckets is the maximum number of tuples, |P|Max, that can be stored in a view with pattern P and that can be easily calculated given the cardinalities of the attributes belonging to the pattern, thus

Card(P)= Φ(|P|Max ,|I0|) (1)

3.1 Aggregation level of the workload

The aggregation level of a pattern P is calculated as:

||

)(1)(

0P

PCardPAgg −=

(2)


Agg(P) ranges between [0,1[, the higher the values the coarser the pattern. The average aggregation level (AAL) of the full workload W ={Q1,…Qn} can be calculated as

∑=

=n

iiPAgg

nAAL

1)(

1 (3)

where Pi is the pattern of query Qi.In order to partially capture how the queries are distributed at different aggregation levels we also include the aggregation level standard deviation (ALSD), which is the standard deviation for AAL:

( )

n

AALPAgg

ALSD

n

ii∑

=−

= 1

2)((4)

AAL and ALSD characterize to what extent the information required by the users is aggregated and express the willingness of the workload to be optimized using materialized views. Intuitively, workloads with high values of AAL will be efficiently optimized using materialized views since they determine a strong reduction of the number of tuples to be read. Furthermore, the limited size of such tables allows a higher number of views to be materialized. On the other hand, a low value for ALSDdenotes that most of the views share the same aggregation level further improving the usefulness of view materialization.

3.2 Skewness of the workload

Measuring the aggregation level is not sufficient to characterize the workload; in fact workloads with similar values of AAL and ALSD can behave differently, with respect to materialization, depending on the attributes involved in the queries. Consider for example two workloads W1 ={Q1, Q2} and W2 ={Q3, Q4} formulated on the Sales fact and the pattern of their queries:

− P1 = {Category, City} Card(P1) = 2100 − P2 = {Type, Country} Card(P2) = 1450 − P3 = {Category, Country} Card(P3) = 380 − P4 = {Brand, CommZone} Card(P4) = 680

Materializing a single view to answer both the queries in the workload is much more useful for W1, than for W2 since in the first case the ancestor is very “close” to the queries (P1⊕ P2={Type, City}) and still coarse, while in the second case it is “far” and fine (P3⊕ P4={SName, PName}).

This difference is captured by the distance between the two patterns that we calculate as:

Dist(Pi, Pj) = Agg(Pi) + Agg(Pj) - 2 Agg(Pi ⊕ Pj) (5)


Dist(Pi, Pj) is calculated in terms of distance of Pi and Pj from their ancestor that is the point of the multidimensional lattice closest to both the views. Figure 2 shows two different situations on the same multidimensional lattice: even if the aggregation level of the patterns is similar, the distance between each couple change significantly. The average skewness (ASK) of the full workload W ={Q1,…Qn} can be calculated as

( ) ∑ ∑−

= +=−⋅=

1

1 1),(

1

2 n

i

n

ijji PPDist

nnASK

(6)

where Pz is the pattern of query Qz. ASK ranges in [0,2[1 . Also for the skewness indicator it is useful to calculate the standard deviation (Skewness Standard Deviation, SKSD) in order to evaluate how the distances between queries are distributed with respect to their mean value:

( ) ( )∑ ∑−

= +=−

−⋅=

1

1 1

2),(1

2 n

i

n

ijji ASKPPDist

nnSKSD

(7)

Intuitively, workloads with low values for ASK will be efficiently optimized using materialized views since the similarity of the query patterns makes it possible to materialize few views to optimize several queries.

Fig. 2. Distance between close and far patterns

4 Clustering of queries

Clustering is one of the most common techniques for classification of features into groups. Several algorithms have been proposed in the literature (see [4] for a survey) each suitable for a specific class of problems. In this paper we adopted the hierarchical approach that recursively agglomerates the two most similar clusters forming a dendogram whose creation can be stopped at different levels to yield different clustering of data, each related to a different level of similarity that will be evaluated using the statistical indicators introduced so far. The initial clusters contain a single query of the workload that represent them. At each step the algorithm looks

1 The maximum value for ASK depends on the cardinalities of the attributes and on the

functional dependencies defined on the hierarchies, thus it cannot be defined without considering the specific star schema.

P0

{ }

P0

{ }


for the two most similar clusters that are collapsed forming a new one that is represented by the query whose pattern is the ancestor of their representative. Figure 3 shows the output of this process. With a little abuse of terminology we write qx⊕qy

meaning that the ancestor operator is applied to the pattern of the queries.

Fig. 3. A possible dendogram for a workload with 6 queries

Similarity between clusters is expressed in terms of the distance, as defined in Section 3.2, between the patterns of their representatives. Each cluster is represented by the ancestor of all the queries belonging to it and is labeled with the sum of the frequencies of its queries. This simple, but effective, solution reflects the criteria adopted by the view materialization algorithms that rely on the ancestor concept when choosing one view to answer several queries. The main drawback here is that the value of AAL tends to decrease when the initial workload is strongly aggregated. Nevertheless the ancestor solution is the only one ensuring that the cluster representative effectively characterizes its queries with respect to materialization (i.e. all the queries in the cluster can be answered on a view on which the representative can also be answered). Adding new queries to a cluster inevitably induces heterogeneity in the aggregation level of its queries thus reducing its capability to represent all of them. Given a clustering Wc ={C1,…Cm}, we measure the compactness of the clusters in terms of similarity of the aggregation levels of the queries in each cluster as:

∑=

=m

iiALSD

mIntraALSD

1

1 (8)

where ALSDi is the standard deviation of the aggregation level for queries in the cluster Ci. The lower IntraALSD the closer the queries in the clusters.

As to the behavior of ASK, it tends to increase when the number of clusters decreases since the closer queries are collapsed earlier than the far ones. While this is an obvious effect of clustering a second relevant measure of the compactness of the clusters in Wc ={C1,…Cm} can be expressed in terms of internal skewness:

c1=q1 c2=q2 c3=q3 c4=q4 c5=q5 c6=q6

c7=q1⊕q2

c8=q1⊕q2⊕q3

c9=q4⊕q5⊕q6

c10=q1⊕q2⊕q3⊕q4⊕q5⊕q6

c7=q4⊕q5

level 1

level 0

level 2

level 3

level 4

level 5


∑=

=m

iiASK

mIntraASK

1

1 (9)

where ASKi is the skewness of the queries in the cluster Ci. The lower IntraASK the closer the queries in the clusters.

The ratio between the statistical indicators and the corresponding intra cluster ones can be used to evaluate how well the clustering models the original workload; in particular we adopted this technique to define when the clustering process must be stopped; the stop rule we adopt is as follows:

Stop if SKAL TIntraASK

ASKT

IntraAAL

AAL >> ∨In our tests both TAL and TSK have been set to 5.

5 Tests and discussion

In this section we present four different tests aimed at proving the effectiveness of both profiling and clustering. The tests have been carried out on the LINEITEM fact scheme described in the TPC-H/R benchmark [9] using a set of generated workloads. Since selections are rarely take into account by view materialization algorithms our queries do not present any selection clause. As to the materialization algorithm, we adopted the classic one in the literature proposed by Baralis et al. [1]; the algorithm first determines the set of candidate views and then heuristically chooses the best subset that fits given space constraints. Splitting the process into two phases allows us to estimate both the difficulty of the problem, that we measure in terms of the number of candidate views, and the effectiveness of materialization that is calculated in terms of the number of disk pages saved by materialization. The cost function we adopted computes the cost of a query Q on a star schema S composed by a fact table FT and a set {DT1,…, DTn} of dimension tables as

( )( )∑∈

++=)(

)()(),(QDimi

ii PKSizeDTSizeFTSizeSQCost (10)

where Size( ) function returns the size of a table/index expressed in disk pages, Dim(Q) returns the indexes of the dimension tables involved in Q and PKi is the primary index on DTi. This cost function assumes the execution plan that is adopted by Redbrick 6.0 when no select conditions are present in a query on a star schema.

5.1 Workload features fitting

The first test shows that the statistical indicators proposed in Section 3 effectively summarize the features of a workload. Four workloads, each made up of 20 queries, have been generated with different values for the indicators. Table 1 reports the value of the parameters and the resulting number of candidate views that confirms the considerations made in Section 3. The complexity of the problem mainly depends on the value of the ASK and is more slightly influenced by AAL. The simplest workloads


to be elaborated will be those with highly aggregated queries with similar patterns, while the most complex will be those with very different patterns with a low aggregation level. It should be noted that on increasing the size of the worklfoads, those with a “nice” profile still perform well, while the others quickly become too complex. For example workloads WKL5, WKL6, whose profile follows those of WKL1 and WKL4 respectively, in Table 1 contains 30 queries: while the number of candidate views remains low for WKL5, it explodes for WKL6. Actually, we stopped the algorithm after two days of computation on a PENTIUM IV CPU (1GHz). The profile is also useful to evaluate how well the workload will behave with respect to view materialization. Figure 4.a shows that, regardless of the difficulty of the problems, workloads with high values of AAL are strongly optimized even when a limited disk space is available for storing materialized views. This behavior is induced by the dimension, and thus by the number, of the materialized views that fits the space constraint as it can be verified in Figure 4.b.

Table 1. Number of candidate views for workloads with different profiles

Name AAL ALSD ASK SKSD N. Candidate viewsWKL1 0.835 0.307 0.348 0.393 97WKL2 0.186 0.245 0.327 0.269 124WKL3 0.790 0.278 0.810 0.391 596WKL4 0.384 0.153 0.751 0.216 868WKL5 0.884 0.297 0.316 0.371 99WKL6 0.352 0.276 0.668 0.354 > 36158

0

2

4

6

8

10

12

1.1 1.4 1.7 2 2.3 2.6 2.9Disk space constraint (GB)

Mill

ions

of d

isk

page

s

0

5

10

15

20

1.1 1.4 1.7 2 2.3 2.6 2.9Disk space constraint (GB)

N. o

f mat

eria

lized

vi

ews

Fig. 4. Cost of the workloads (a) and number of materialized views (b) on varying the disk space constraint for the workloads in Table 1

5.2 Clustering suboptimality

The second test is aimed at proving that clustering produces a good approximation of the input workload, meaning that applying view materialization to the original and clustered workload does not induce a too heavy suboptimality. With reference to the workloads in Table 1, Table 2 shows how change the behavior and the effectiveness of the view materialization algorithm changes for an increasing level of clustering. It

WKL2WKL1 WKL3 WKL4

(a) (b)


should be noted that the number of candidate views can be strongly reduced inducing, in most cases, a limited suboptimality. By comparing the suboptimality percentages with the statistical indicator trends presented in Figure 5, it is clear that suboptimality arises earlier for workloads where IntraASDL and IntraASK increase earlier.

5.3 Handling large workloads

When workloads with hundred of queries are considered it is not possible to measure the suboptimality induced by the clustered solution since the original workloads cannot be directly optimized. On the other hand, it is still possible to compare the increase of the performance with respect to the case with no materialized views and it is also interesting to show how the workload costs change depending on the number of queries included in the clustered workload and how the cost is related to the statistical indicators.

Table 2. Effects of clustering on the view materialization algorithm applied to workload in Table 1

WKL #. Cluster # Cand.Views #. Mat.Views % SubOpt Stop rule at15 90 12 0.001 10 68 7 0.308 WKL15 25 3 40.511

3

15 79 2 0.000 10 38 2 2.561 WKL2

5 6 2 4.564

6

15 549 10 1.186 10 156 7 22.146 WKL3

5 16 4 65.407

7

15 321 2 0.0 10 129 2 0.0 WKL4

5 17 2 0.0

4

Table 3 reports the view materialization results for two workloads, WKL 7 (AAL:0.915, ALSD:0.266, ASK: 0.209, SKSD: 0.398) - WKL 8 (AAL: 0.377, ALSD: 0.250, ASK: 0.738, SKSD: 0.345), containing 200 queries. The data in the table and the graphs in Figure 6 confirm the behaviors deduced from previous tests: the effectiveness of view materialization is higher for workloads with high value of AAL and low value of ASK. Also the capability of the clustering algorithm to capture the features of the original workload depends on its profile, in fact workloads with higher values of ASK require more queries (7 for WKL7 vs. 20 for WKL8) in the clustered workload to effectively model the original one. On the other hand it is not useful to excessively increase the clustered workload cardinality since the performance improvement is much lower than the increase of the computation time.


0

0.5

1

1.5

20 18 16 14 12 10 8 6 4 2

N. Clusters

00.20.40.60.8

1

N. Clusters

00.20.40.60.8

11.2

20 18 16 14 12 10 8 6 4 2

N. Clusters

0

0.2

0.4

0.6

0.8

1

20 18 16 14 12 10 8 6 4 2

N. Clusters

Fig. 5. Trends of the statistical indicators for increasing levels of clustering and for different workloads.

Table 3. Effects of clustering on the view materialization algorithm applied to workload in Table 1

WKL #.

Cluster#

Cand.Views#.

Mat.Views%Cost

Reduction Comp. Time

(sec.) Stop rule

at30 12506 17 90.6 43984 20 4744 15 89.0 439 10 384 9 83.3 39

WKL7

7 64 6 38.9 24

6

30 17579 5 19.1 78427 20 2125 5 17.8 304 WKL810 129 2 2.4 25

19

6 Conclusions

In this paper we have discussed two techniques that make it possible to carry out view materialization when the high cardinality of the workload does not allow the problem to be faced directly. In particular, the set of statistical indicators proposed have proved to capture those workload features that are relevant to the view materialization problem, thus driving the designer choices. The clustering algorithm allows large workloads to be handled by automatic techniques for view materialization since it reduces its cardinality slightly corrupting the original characteristics. We believe that the use of the information carried by the statistical indicators we proposed can be

AAL ASK IntraAAL IntraASK

WKL1 WKL2

WKL3 WKL4


profitably used to increase the effectiveness of the optimization algorithms used in both logical and physical design. For example, in [6] the authors propose a technique for splitting a given quantity of disk space into two parts used for creating views and indexes respectively. Since the technique takes account of only information relative to a single query our indicators can improve the solution by providing the bent of the workload to be optimized by indexing or view materializing.

0

0.5

1

1.5

2

200

180

160

140

120

100 80 60 40 20

N. Clusters

0

0.2

0.4

0.6

0.8

1

200

180

160

140

120

100 80 60 40 20

N. Clusters

Fig. 6. Trends of the statistical indicators for increasing levels of clustering and for different workloads.

References

[1] E. Baralis, S. Paraboschi and E. Teniente. Materialized view selection in a multidimensional database. In Proc. 23rd VLDB, Greece, 1997.

[2] A. Ghosh, J. Parikh, V.S. Sengar and J. R. Haritsa. Plan Selection Based on Query Clustering, In Proc. 28th VLDB, Hong Kong, China, 2002.

[3] A. Gupta, V. Harinarayan and D. Quass. Aggregate-query processing in data-warehousing environments. In Proc. 21st VLDB, Switzerland, 1995.

[4] A.K. Jain, M.N. Murty and P.J. Flynn. Data Clustering A Review. ACM Computing Surveys, Vol. 31, N. 3, September 1999.

[5] T. P. Nadeau and T. J. Teorey. Achieving scalability in OLAP materialized view selection. In Proc. DOLAP’02, Virginia USA, 2002.

[6] S. Rizzi and E. Saltarelli. View materialization vs. Indexing: balancing space constraints in Data Warehouse Design. To appear in Proc. CAISE’03, Austria, 2003.

[7] T. K. Sellis. Global query Optimization. In Proc. SIGMOD Conference Washington D.C. 1986, pp. 191-205.

[8] D. Theodoratos, M. Bouzeghoub. A General Framework for the View Selection Problem for Data Warehouse Design and Evolution. In Proc. DOLAP’00, Washington D.C. USA, 2000.

[9] Transaction Processing Performance Council. TPC Benchmark H (Decision Support) Standard Specification, Revision 1.1.0, 1998, http://www.tpc.org.

WKL7 WKL8

AAL ASK IntraAAL IntraASK


Incremental OPTICS: Efficient Computation of Updatesin a Hierarchical Cluster Ordering

Hans-Peter Kriegel, Peer Kroger, and Irina Gotlibovich

Institute for Computer ScienceUniversity of Munich, Germany

{kriegel,kroegerp,gotlibov}@dbs.informatik.uni-muenchen.de

Abstract. Data warehouses are a challenging field of application for data min-ing tasks such as clustering. Usually, updates are collected and applied to the datawarehouse periodically in a batch mode. As a consequence, all mined patternsdiscovered in the data warehouse (e.g. clustering structures) have to be updatedas well. In this paper, we present a method for incrementally updating the clus-tering structure computed by the hierarchical clustering algorithm OPTICS. Wedetermine the parts of the cluster ordering that are affected by update operationsand develop efficient algorithms that incrementally update an existing cluster or-dering. A performance evaluation of incremental OPTICS based on syntheticdatasets as well as on a real-world dataset demonstrates that incremental OPTICSgains significant speed-up factors over OPTICS for update operations.

1 Introduction

Many companies gather a vast amount of corporate data. This data is typically dis-tributed over several local databases. Since the knowledge hidden in this data is usu-ally of great strategic importance, more and more companies integrate their corporatedata into a common data warehouse. In this paper, we do not anticipate any specialwarehousing architecture but simply address an environment which provides derivedinformation for the purpose of analysis and which is dynamic, i.e. many updates occur.

Usually manual or semi-automatic analysis such as OLAP cannot make use of the entireinformation stored in a data warehouse. Automatic data mining techniques are moreappropriate to fully exploit the knowledge hidden in the data.

In this paper, we focus on clustering, which is the data mining task of grouping theobjects of a database into classes such that objects within one class are similar andobjects from different classes are not (according to an appropriate similarity measure).In recent years, several clustering algorithms have been proposed [1,2,3,4,5].

A data warehouse is typically not updated immediately when insertions or deletions ona member database occur. Usually updates are collected locally and applied to the com-mon data warehouse periodically in a batch mode, e.g. each night. As a consequence, allclusters explored by clustering methods have to be updated as well. The update of themined patterns has to be efficient because it should be finished when the warehouse hasto be available for its users again, e.g. in the next morning. Since a warehouse usually


stores a large amount of data, it is highly desirable to perform updates incrementally[6]. Instead of recomputing the clusters by applying the algorithm to the entire (verylarge) updated database, only the old clusters and the objects inserted or deleted duringa given period are considered.

In this paper, we present an incremental version of OPTICS [5] which is an efficientclustering algorithm for metric databases. OPTICS combines a density-based clusteringnotion with the advantages of hierarchical approaches. Due to the density-based natureof OPTICS, the insertion or deletion of an object usually causes expensive computa-tions only in the neighborhood of this object. A reorganization of the cluster structurethus affects only a limited set of database objects. We demonstrate the advantage ofthe incremental version of OPTICS based on a thorough performance evaluation usingseveral synthetic and a real-world dataset.

The remainder of this paper is organized as follows. We review related work in Section2. Section 3 briefly introduces the clustering algorithm OPTICS. The incremental algo-rithms for insertions and deletions are presented in Section 4. In Section 5, the resultsof our performance evaluation are reported. Conclusions are presented in Section 6.

2 Related Work

Beside the tremendous amount of clustering algorithms (e.g. [1,2,3,4,5]), the problemof incrementally updating mined patterns is a rather new area of research. Most workhas been done in the area of developing incremental algorithms for the task of miningassociation rules, e.g. [7]. In [8] algorithms for incremental attribute-oriented general-ization are presented.

The only algorithm for incrementally updating clusters detected by a clustering algo-rithm is IncrementalDBSCAN proposed in [6]. It is based on the algorithm DBSCAN[4] which models clusters as density-connected sets. Due to the density-based natureof DBSCAN, the insertion or deletion of an object affects the current clustering onlyin the neighborhood of this object. Based on these observations, IncrementalDBSCANyields a significant speed-up over DBSCAN [6].

In this paper, we propose IncOPTICS an incremental version of OPTICS [5] whichcombines the density-based clustering notion of DBSCAN with the advantages of hi-erarchical clustering concepts. Since OPTICS is an extension to DBSCAN and yieldsmuch more information about the clustering structure of a database, IncOPTICS ismuch more complex than IncrementalDBSCAN. However, IncOPTICS yields an ac-curate speed-up over OPTICS without any loss of effectiveness, i.e. quality.

3 Density-Based Hierarchical Clustering

In the following, we assume that D is a database of n objects, dist : D × D → isa metric distance function on objects in D and Nε(p) := {q ∈ D | dist(p, q) ≤ ε}denotes the ε-neighborhood of p ∈ D where ε ∈ .

OPTICS extends the density-connected clustering notion of DBSCAN [4] by hierarchi-cal concepts. In contrast to DBSCAN, OPTICS does not assign cluster memberships

225Incremental OPTICS

but computes a cluster order in which the objects are processed and additionally gener-ates the information which would be used by an extended DBSCAN algorithm to assigncluster memberships. This information consists of only two values for each object, thecore-level and the reachability-distance (or short: reachability).

Definition 1 (core-level). Let p ∈ D, MinPts ∈ , ε ∈ , and MinPts-dist(p) be thedistance from p to its MinPts-nearest neighbor. The core-level of p is defined as follows:

CLev(p) :={∞ if |Nε(p)| < MinPts

MinPts-dist(p) otherwise.

Definition 2 (reachability). Let p, q ∈ D, MinPts ∈ , and ε ∈ . The reachabilityof p wrt. q is defined as RDist(p, q) := max{CLev(q), dist(q, p)}.

Definition 3 (cluster ordering). Let MinPts ∈ , ε ∈ , and CO be a totally or-dered permutation of the n objects of D. Each o ∈ D has additional attributes Pos(o),Core(o) and Reach(o), where Pos(o) symbolizes the position of o in CO. We call COa cluster ordering wrt. ε and MinPts if the following three conditions hold:(1) ∀p ∈ CO : Core(p) = CLev(p)(2) ∀o, x, y ∈ CO : Pos(x) < Pos(o) ∧ Pos(y) > Pos(o) ⇒ RDist(y, x) ≥ RDist(o, x)(3) ∀p, o ∈ CO : Reach(p) = min{RDist(p, o) |Pos(o) < Pos(p)}, where min ∅ = ∞.

Intuitively, Condition (2) states that the order is built on selecting at each position i inCO that object o having the minimum reachability to any object before i.

A cluster ordering is a powerful tool to extract flat, density-based decompositions forany ε′ ≤ ε. It is also useful to analyze the hierarchical clustering structure when plottingthe reachability values for each object in the cluster ordering (cf. Fig. 1(a)).

Like DBSCAN, OPTICS uses one pass over D and computes the ε-neighborhood foreach object of D to determine the core-levels and reachabilities and to compute thecluster ordering. The choice of the starting object does not affect the quality of theresult [5]. The runtime of OPTICS is actually higher than that of DBSCAN becausethe computation of a cluster ordering is more complex than simply assigning clustermemberships and the choice of the parameter ε affects the runtime of the range queries(for OPTICS, ε has typically to be chosen significantly higher than for DBSCAN).

4 Incremental OPTICS

The key observation is that the core-level of some objects may change due to an update.As a consequence, the reachability values of some objects have to be updated as well.Therefore, condition (2) of Def. 3 may be violated, i.e. an object may have to move toanother position in the cluster ordering. We will have to reorganize the cluster orderingsuch that condition (2) of Def. 3 is re-established. The general idea for an incrementalversion of OPTICS is not to recompute the ε-neighborhood for each object in D butrestrict the reorganization on a limited subset of the objects (cf. Fig. 1(b)).

Although it cannot be ensured in general, it is very likely that the reorganization isbounded to a limited part of the cluster ordering due to the density-based nature of

226 Hans-Peter Kriegel et al.

(a) (b)

Fig. 1. (a) Visual analysis of the cluster ordering: clusters are valleys in the according reachabilityplot. (b) Schema of the reorganization procedure.

OPTICS. IncOPTICS therefore proceeds in two major steps. First, the starting pointfor the reorganization is determined. Second, the reorganization of the cluster orderingis worked out until a valid cluster ordering is re-established. In the following, we willfirst discuss how to determine the frontiers of the reorganization, i.e. the starting pointand the criteria for termination. We will determine two sets of objects affected by anupdate operation. One set called mutating objects, contains objects that may changeits core-level due to an update operation. The second set of affected objects containsobjects that move forward/backwards in the cluster ordering to re-establish condition(2) of Def. 3. Movement of objects may be caused by changing reachabilities — as aneffect of changing core-levels — or by moving predecessors/successors in the clusterordering. Since we can easily compute a set of all objects possibly moving, we callthis set moving objects, containing all objects that may move forward/backwards in thecluster ordering due to an update.

4.1 Mutating Objects

Obviously, an object o may change its core-level only if the update operation affects theε-neighborhood of o. From Def. 1 it follows that if the inserted/deleted object is one ofo’s MinPts-nearest neighbors, Core(o) increases in case of a deletion and decreases incase of an insertion. This observation led us to the definition of the set MUTATING(p)of mutating objects:

Definition 4 (mutating objects). Let p be an arbitrary object either in or not in thecluster ordering CO. The set of objects in CO possibly mutating their core-level afterthe insertion/deletion of p is defined as: MUTATING(p) := {q | p ∈ Nε(q)}.

Let us note that p ∈ MUTATING(p) since p ∈ Nε(p). In fact, MUTATING(p) can becomputed rather easily.

Lemma 1. ∀p ∈ D : MUTATING(p) = Nε(p).


Proof. Since dist is a metric, the following conclusions hold:∀ q ∈ Nε(p) : dist(q, p) ≤ ε ⇔ dist(p, q) ≤ ε ⇔ p ∈ Nε(q) ⇔ q ∈ MUTATING(p).

Lemma 2. Let C be a cluster ordering and p ∈ CO. MUTATING(p) is a superset ofthe objects that change their core-level due to an insertion/deletion of p into/from CO.

Proof. (Sketch)Let q ∈ MUTATING(p): Core(q) changes if p is one of q’s MinPts-nearest neighbors.Let q �∈ MUTATING(p): According to Lemma 1, p �∈ Nε(q) and thus p either cannot beany of q’s MinPts-nearest neighbors or Core(q) = ∞ remains due to Def. 1.

Due to Lemma 2, we have to test for each object o ∈ MUTATING(p) whether Core(o)increases/decreases or not by computing Nε(o) (one range query).

4.2 Moving Objects

The second type of affected objects move forward or backwards in the cluster orderingafter an update operation. In order to determine the objects that may move forward orbackwards after an update operation occurs, we first define the predecessor and the setof successors of an object:

Definition 5 (predecessor). Let CO be a cluster ordering and o ∈ CO. For each entryp ∈ CO the predecessor is defined as

Pre(p) ={

o if Reach(p) = RDist(o, p)UNDEFINED if Reach(p) = ∞.

Intuitively, Pre(p) is the object in CO from which p has been reached.

Definition 6 (successors). Let CO be a cluster ordering. For each object p ∈ CO theset of successors is defined as Suc(p) := {q ∈ CO | Pre(q) = p}.

Lemma 3. Let CO be a cluster ordering and p ∈ CO. If Core(p) changes due to anupdate operation, then each object o ∈ Suc(p) may change its reachability values.

Proof. ∀o ∈ CO: o ∈ Suc(p)[Def. 6]=⇒ Pre(o) = p

[Def. 5]=⇒ Reach(o) = RDist(o, p)

[Def. 2]=⇒ Reach(o) = max{Core(p), dist(p, o)}. Since the value Core(p) has changed,

Reach(o) may also have changed.

As a consequence of a changed reachability value, objects may move in the cluster or-dering. If the reachability-distance of an object decreases, this object may move forwardsuch that Condition (2) of Def. 3 is not violated. On the other hand, if the reachability-distance of an object increases, this object may move backwards due to the same rea-son. In addition, if an object has moved, all successors of this objects may also movealthough their reachabilities remain unchanged. All such objects that may move afteran insertion or deletion of p are called moving objects:


Definition 7 (moving objects). Let p be an arbitrary object either in or not in thecluster ordering CO. The set of objects possibly moving forward/backwards in COafter insertion/deletion of p is defined recursively:

(1) If x ∈ MUTATING(p) and q ∈ Suc(x) then q ∈ MOVING(p).(2) If y ∈ MOVING(p) and q ∈ Suc(y) then q ∈ MOVING(p).(3) If y ∈ MOVING(p) and q ∈ Pre(y) then q ∈ MOVING(p).

Case (1) states, that if an object is a successor of a mutating object, it is a moving object.The other two cases state, that if an object is a successor or predecessor of a movingobject it is also a moving object. Case (3) is needed, if a successor of an object o ismoved to a position before o during reorganization.

For the reorganization of moving objects we do not have to compute range queries. Wesolely need to compare the old reachability values to decide whether these objects haveto move or not.

4.3 Limits of Reorganization

We are now able to determine between which bounds the cluster ordering must be reor-ganized to re-establish a valid cluster ordering according to Def. 3.

Lemma 4. Let CO be a cluster ordering and p be an object either in or not in CO.The set of objects that have to be reorganized due to an insertion or deletion of p is asubset of MUTATING(p) ∪ MOVING(p).

Proof. (Sketch)Let o be an object which has to be reorganized. If o has to be reorganized due to achange of Core(o), then o ∈ MUTATING(p). Else o has to be reorganized due to achanged reachability or due to moving predecessor/successors. Then o ∈ MOVING(p).

Since OPTICS is based on the formalisms of DBSCAN, the determination of the startposition for reorganization is rather easy. We simply have to determine the first objectin the cluster ordering whose core-level changes after the insertion or deletion becausereorganization is only initiated by changing core-levels.

Lemma 5. Let CO be a cluster ordering which is updated by an insertion or deletion ofobject p. The object o ∈ D is the start object where reorganization starts if the followingconditions hold:

(1) o ∈ MUTATING(p)(2) ∀q ∈ MUTATING(p), o �= q : Pos(o) ≤ Pos(q).

Proof. Since reorganization is caused by changing core-levels, the start object mustchange its core-level due to the update. (1) follows from Def. 4. According to Def. 7,each q ∈ Suc(p) can by affected by the reorganization. To ensure, that no object islost by the reorganization procedure, o has to be the first object, whose core-level haschanged (⇒(2)). In addition, all objects before o are neither element of MUTATING(p)nor of MOVING(p). Therefore, they do not have to be reorganized.


WHILE NOT Seeds.isEmpty() DO// Decide which object is at next added to COnew

IF currObj.reach > Seeds.first().reach THENCOnew.add(Seeds.first());Seeds.removeFirst();

ELSECOnew.add(currObj);currObj = next object in COold which has not yet been inserted into COnew

// Decide which objects are inserted into Seedsq = COnew.lastInsertedObject();IF q∈ MUTATING(p) THENFOR EACH o∈Nε(p) which has not yet been inserted into COnew DOSeeds.insert(o, max{q.core, dist(q,o)});

ELSE IF q∈ MOVING(p) THENFOR EACH o∈Pre(p) OR o∈Suc(p) and o has not yet been inserted into COnew DOSeeds.insert(o, o.reach);

Fig. 2. IncOPTICS: Reorganization of the cluster ordering

4.4 Reorganizing a Cluster Ordering

In the following, COold denotes the old cluster ordering before the update and COnewdenotes the updated cluster ordering which is computed by IncOPTICS. After thestart object so has been determined according to Lemma 5, all objects q ∈ COold withPos(q) < Pos(so) can be copied into COnew (cf. Fig. 1(b)) because up to the position ofso COold is a valid cluster ordering.

The reorganization of CO begins at so and imitates OPTICS. The pseudo-code of theprocedure is depicted in Fig. 2. It is assumed that each not yet handled o ∈ Nε(so)is inserted into the priority queue Seeds which manages all not yet handled objectsfrom MOVING(p) ∪ MUTATING(p) (i.e. all o ∈ MOVING(p) ∪ MUTATING(p) withPos(o) ≥ Pos(so)) sorted in the order of ascending reachabilities.

In each step of the reorganization loop, the reachability of the first object in Seedsis compared with the reachability of the current object in COold. The entry with thesmallest reachability is inserted into the next free position of COnew. In case of a deleteoperation, this step is skipped if the considered object is the update object. After thisinsertion, Seeds has to be updated depending on which object has recently been in-serted. If the inserted object is an element of MUTATING(p), all neighbors that arecurrently not yet handled may change their reachabilities. If the inserted object is anelement of MOVING(p), all predecessors and successors that are currently not yet han-dled may move. In both cases, the corresponding objects are inserted into Seeds usingthe method Seeds::insert which inserts an object with its current reachability orupdates the reachability of an object if it is already in the priority queue. If a predecessoris inserted into Seeds, its reachability has to be recomputed (which means a distancecalculation in the worst-case) because RDist(., .) is not symmetric.

According to Lemma 4, the reorganization terminates if there are no more objects inSeeds, i.e. all objects in MOVING(p) ∪ MUTATING(p), that have to be processed, are


(a) Insertion (b) Deletion

Fig. 3. Runtime of OPTICS vs. average and maximum runtime of IncOPTICS.

handled. COnew is filled with all objects from COold which are not yet handled (and thusneed not to be considered by the reorganization) maintaining the order determined byCOold (cf. Fig. 1(b)). The resulting COnew is valid according to Def. 3.

5 Experimental Evaluation

We evaluated IncOPTICS using four synthetic datasets consisting of 100,000, 200,000,300,000, and 500,000 2-dimensional points and a real-world dataset consisting of 112,361TV snapshots encoded as 64-dimensional color histograms. All experiments were runon a workstation featureing a 2 GHz CPU and 3,5 GB RAM. An X-Tree was used tospeed up the range queries computed by OPTICS and IncOPTICS.

We performed 100 random updates (insertions and deletions) on each of the syntheticdatasets and compared the runtime of OPTICS with the maximum and average runtimesof IncOPTICS (insert/delete) on the random updates. The results are depicted in Fig.3. We observed average speed-up factors of about 45 and 25 and worst-case speed-upfactors of about 20 and 17 in case of insertion and deletion, respectively. A similarobservation, but on a lower level, can be made when evaluating the performance ofOPTICS and IncOPTICS applied to the real world dataset. The worst ever observedspeed-up factor for the real-world dataset was 3. In Fig. 5(a)) the average runtimes ofIncOPTICS of the best 10 inserted and deleted objects are compared with the runtimeof OPTICS using the TV dataset.

A possible reason for the large speed up is that IncOPTICS saves a lot of range queries.This is shown in Fig. 4(a) and 4(b) where we compared the average and maximumnumber of range queries and moved objects, respectively. The cardinality of the setMUTATING(p) is depicted as “RQ” and the cardinality of the set MOVING(p) is de-picted as “MO” in the figures. It can be seen, that IncOPTICS saves a lot of rangequeries compared to OPTICS. For high dimensional data this observation is even moreimportant since the logarithmic runtime of most index structures for a single rangequery degenerates to a linear runtime. Fig. 5(b) presenting the average cardinality of


(a) Insertion (b) Deletion

Fig. 4. Comparison of average and maximum cardinalities of MOVING(p) vs. MUTATING(p)

the sets of mutating objects and moving objects of incremental insertion/deletion, illus-trates this effect. Since the number of objects which have to be reorganized is rather highin case of insertion or deletion the runtime speed-up is caused by the strong reductionof range queries (cf. bars “IncInsert RQ” and “IncDelete RQ” in Fig. 5(b)).

We separately analyzed the objects o whose insertions/deletions caused the highest run-time. Thereby, we found out that the biggest part of the high runtimes originated fromthe reorganization step due to a high cardinality of the set MOVING(o). We further ob-served that these objects causing high update runtimes usually are located between twoclusters and objects in MUTATING(o) belong to more than one cluster. Since spatiallyneighboring clusters need not to be adjacent in the cluster ordering, the reorganizationaffects a lot more objects. This observation is important because it indicates that theruntimes are more likely near the average case than near the worst case especially forinsert operations since most inserted objects will probably reproduce the distribution ofthe already existing data. Let us note, that since the tests on the TV Dataset were runusing unfavourable objects, the performance results are less impressive than the resultson the synthetic datasets.

6 Conclusions

In this paper, we proposed an incremental algorithm for mining hierarchical clusteringstructures based on OPTICS. Due to the density-based notion of OPTICS, insertionsand deletions affect only a limited subset of objects directly, i.e. a change of their core-level may occur. We identified a second set of objects which are indirectly affected byupdate operations and thus they may move forward or backwards in the cluster order-ing. Based on these considerations, efficient algorithms for incremental insertions anddeletions of a cluster ordering were suggested.

A performance evaluation of IncOPTICS using synthetic as well as real-world databasesdemonstrated the efficiency of the proposed algorithm.


(a) Runtimes (b) Affected objects

Fig. 5. Runtimes and affected objects of IncOPTICS vs. OPTICS applied on the TV Data.

Comparing these results to the performance of IncrementalDBSCAN which achievesmuch higher speed-up factors over DBSCAN, it should be mentioned that incremen-tal hierarchical clustering is much more complex than incremental “flat” clustering. Infact, OPTICS generates considerably more information than DBSCAN and thus In-cOPTICS is suitable for a much broader range of applications compared to Incremen-talDBSCAN.

References

1. McQueen, J.: ”Some Methods for Classification and Analysis of Multivariate Observations”.In: 5th Berkeley Symp. Math. Statist. Prob. Volume 1. (1967) 281–297

2. Ng, R., J., H.: ”Efficient and Affective Clustering Methods for Spatial Data Mining”. In: Proc.20st Int. Conf. on Very Large Databases (VLDB’94), Santiago, Chile. (1994) 144–155

3. Zhang, T., Ramakrishnan, R., M., L.: ”BIRCH: An Efficient Data Clustering Method forVery Large Databases”. In: Proc. ACM SIGMOD Int. Conf. on Management of Data (SIG-MOD’96), Montreal, Canada. (1996) 103–114

4. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: ”A Density-Based Algorithm for DiscoveringClusters in Large Spatial Databases with Noise”. In: Proc. 2nd Int. Conf. on KnowledgeDiscovery and Data Mining (KDD’96), Portland, OR, AAAI Press (1996) 291–316

5. Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: ”OPTICS: Ordering Points to Iden-tify the Clustering Structure”. In: Proc. ACM SIGMOD Int. Conf. on Management of Data(SIGMOD’99), Philadelphia, PA. (1999) 49–60

6. Ester, M., Kriegel, H.P., Sander, J., Wimmer, M., Xu, X.: ”Incremental Clustering for Miningin a Data Warehousing Environment”. In: Proc. 24th Int. Conf. on Very Large Databases(VLDB’98). (1998) 323–333

7. Feldman, R., Aumann, Y., Amir, A., Mannila, H.: ”Efficient Algorithms for DiscoveringFrequent Sets in Incremental Databases”. In: Proc. ACM SIGMOD Workshop on ResearchIssues on Data Mining and Knowledge Discovery, Tucson, AZ. (1997) 59–66

8. Ester, M., Wittman, R.: ”Incremental Generalization for Mining in a Data Warehousing En-vironment”. In: Proc. 6th Int. Conf. on Extending Database Technology, Valencia, Spain.Volume 1377 of Lecture Notes in Computer Science (LNCS)., Springer (1998) 135–152


235On Complementarity of Cluster and Outlier Detection Schemes

236 Zhixiang Chen et al.

o

A

B

q

p

0.801.001.201.401.601.80

LOF

0.800.901.001.101.201.301.401.50

COF


p

o

CD

qw

...

...

l

ll

l 12

34

2.00

4.00

6.00

8.00

10.00

12.00

14.00

20 40 60 80 100 120 140 160 180LO

Fk

Outlier q=(-88,0)Non-outlier p=(67,45)

Outlier o=(26,0)Non-outlier w=(22,0)


Cluster Validity Using Support Vector Machines

Vladimir Estivill-Castro1 and Jianhua Yang2

1 Griffith University, Brisbane QLD 4111, Australia2 The University of Western Sydney, Campbelltown, NSW 2560, Australia

Abstract. Gaining confidence that a clustering algorithm has producedmeaningful results and not an accident of its usually heuristic optimiza-tion is central to data analysis. This is the issue of validity and we proposehere a method by which Support Vector Machines are used to evaluatethe separation in the clustering results. However, we not only obtaina method to compare clustering results from different algorithms or dif-ferent runs of the same algorithm, but we can also filter noise and outliers.Thus, for a fixed data set we can identify what is the most robust andpotentially meaningful clustering result. A set of experiments illustratesthe steps of our approach.

1 Introduction

Clustering is challenging because normally there is no a priori information aboutstructure in the data or about potential parameters, like the number of clusters.Thus, assumptions make possible to select a model to fit to the data. For instance,k-Means fits mixture models of normals with covariance matrices set to theidentity matrix. k-Means is widely applied because of its speed; but, becauseof its simplicity, it is statistically biased and statistically inconsistent, and thusit may produce poor (invalid) results. Hence, clustering depends significantlyon the data and the way the algorithm represents (models) structure for thedata [8]. The purpose of clustering validity is to increase the confidence aboutgroups proposed by a clustering algorithm. The validity of results is up-mostimportance, since patterns in data will be far from useful if they were invalid [7].

Validity is a certain amount of confidence that the clusters found are actu-ally somehow significant [6]. That is, the hypothetical structure postulated asthe result of a clustering algorithm must be tested to gain confidence that itactually exists in the data. A fundamental way is to measure how “natural” arethe resulting clusters. Here, formalizing how “natural” a partition is, impliesfitting metrics between the clusters and the data structure [8]. Compactness andseparation are two main criteria proposed for comparing clustering schemes [17].Compactness means the members of each cluster should be as close to each otheras possible. Separation means the clusters themselves should be widely spaced.

Novelty detection and concepts of maximizing margins based on SupportVector Machines (SVMs) and related kernel methods make them favorable forverifying that there is a separation (a margin) between the clusters of an algo-rithm’s output. In this sense, we propose to use SVMs for validating data models,


Cluster Validity Using Support Vector Machines 245

and confirm that the structure revealed in clustering results is indeed of somesignificance. We propose that an analysis of magnitude of margins and (relative)number of Support Vectors increases the confidence that a clustering output doesseparate clusters and creates meaningful groups. The confirmation of separationin the results can be gradually realized by controlling training parameters. Ata minimum, our approach is able to discriminate between two outputs of twoclustering algorithms and identify the more significant one.

Section 2 presents relevant aspects of Support Vector Machines for our clus-tering validity approach. Section 3 presents our techniques. Section 4 presentsexperimental results. We then conclude with a discussion of related work.

2 Support Vector Machines

Our cluster validity method measures margins and analyzes the number of Sup-port Vectors. Thus, a summary of Support Vector Machines (SVMs) is necessary.The foundations of SVMs were developed by Vapnik [16] and are gaining popu-larity due to promising empirical performance [9]. The approach is systematic,reproducible, and motivated by statistical learning theory. The training formu-lation embodies optimization of a convex cost function, thus all local minima areglobal minimum in the learning process [1]. SVMs can provide good generaliza-tion performance on data mining tasks without incorporating problem domainknowledge. SVM have been successfully extended from basic classification tasksto handle regression, operator inversion, density estimation, novelty detection,clustering and to include other desirable properties, such as invariance undersymmetries and robustness in the presence of noise [15, 1, 16]. In addition totheir accuracy, a key characteristic of SVMs is their mathematical tractabilityand geometric interpretation.

Consider the supervised problem of finding a separator for a set of trainingsamples {(xi, yi)}l

i=1 belonging to two classes, where xi is the input vector forthe ith example and yi is the target output. We assume that for the positivesubset yi = +1 while for the negative subset yi = −1. If positive and negativeexamples are “linearly separable”, the convex hulls of positive and negative ex-amples are disjoint. Those closest pair of points in respective convex hulls lieon the hyper-planes wT x + b = ±1. The separation between the hyper-planeand the closest data point is called the margin of separation and is denoted byγ. The goal of SVMs is to choose the hyper-plane whose parameters w and bmaximize γ = 1/‖w‖; essentially a quadratic minimization problem (minimize‖w‖).

Under these conditions, the decision surface wT x + b is referred to as theoptimal hyper-plane. The particular data points (xi, yi) that satisfy yi[wtxi+b] =1 are called Support Vectors, hence the name “Support Vector Machines”. Inconceptual terms, the Support Vectors are those data points that lie closest tothe decision surface and are the most difficult to classify. As such, they directlyinfluence the location of the decision surface [10].

246 Vladimir Estivill-Castro and Jianhua Yang

If the two classes are nonlinearly separable, the variants called φ-machinesmap the input space S = {x1, . . . , xl} into a high-dimensional feature spaceF = {φ(x)|i = 1, . . . , l}. By choosing an adequate mapping φ, the input samplesbecome linearly or mostly linearly separable in feature space. SVMs are capableof providing good generalization for high dimensional training data, since thecomplexity of optimal hyper-plane can be carefully controlled independently ofthe number of dimensions [5]. SVMs can deal with arbitrary boundaries in dataspace, and are not limited to linear discriminants. For our cluster validity, wemake use of the features of ν-Support Vector Machine (ν-SVM). The ν-SVM isa new class of SVMs that has the advantage of using a parameter ν on effectivelycontrolling the number of Support Vectors [14, 18, 4]. Again consider trainingvectors xi ∈ �d, i = 1, · · · , l labeled in two classes by a label vector y ∈ �l

such that yi ∈ {1,−1}. As a primal problem for ν-Support Vector Classification(ν-SVC), we consider the following minimization:

Minimize 12‖w‖2 − νρ + 1

l

∑li=1 ξi

subject to yi(wT φ(xi) + b) ≥ ρ − ξi,ξi ≥ 0, i = 1, · · · , l, ρ ≥ 0,

(1)

where

1. Training vectors xi are mapped into a higher dimensional feature spacethrough the function φ, and

2. Non-negative slack variables ξi for soft margin control are penalized in theobjective function.

The parameter ρ is such that when ξT = (ξ1, · · · , ξl) = 0, the margin ofseparation is γ = ρ/‖w‖. The parameter ν ∈ [0, 1] has been shown to be anupper bound of the fraction of margin errors and a lower bound of the frac-tion of Support Vectors [18, 4]. In practice, the above prime problem is usuallysolved through its dual by introducing Lagrangian multipliers and incorporatingkernels:

Minimize 12αT (Q + yyT )α

subject to 0 ≤ αi ≤ 1/l, i = 1, · · · , leT α ≥ ν

(2)

where Q is a positive semidefinite matrix, Qij ≡ yiyjk(xi, xj), and k(xi, xj) =φ(xi)T · φ(xj) is a kernel, e is a vector of all ones.

The context for solving this dual problem is presented in [18, 4], some con-clusions are useful for our cluster validity approach.

Proposition 1. Suppose ν-SVC leads to ρ > 0, then regular C-SVC with pa-rameter C set a priori to 1/ρ, leads to the same decision function.

Lemma 1. Optimization problem (2) is feasible if and only if ν ≤ νmax, whereνmax = 2 min(#yi = 1, #yi = −1)/l, and (#yi = 1), (#yi = −1) denote thenumber of elements in the first and second classes respectively.

Corollary 1. If Q is positive definite, then the training data are separable.


Thus, we note that νl is a lower bound of the number of Support Vectors(SVs)and an upper bound of the number of misclassified training data. These misclas-sified data are treated as outliers and called Bounded Support Vectors(BSVs).The larger we select ν, the more points are allowed to lie inside the margin; ifν is smaller, the total number of Support Vectors decreases accordingly. Propo-sition 1 describes the relation between standard C-SVC and ν-SVC, and aninteresting interpretation of the regularization parameter C. The increase of Cin C-SVC is like the decrease of ν in ν-SVC. Lemma 1 shows that the size ofνmax depends on how balanced the training set is. If the numbers of positive andnegative examples match, then νmax = 1. Corollary 1 helps us verify whether atraining problem under extent kernels is separable.

We do not assume the original cluster results are separable, but, it is favorableto use balls to describe the data in feature space by choosing RBF kernels.If the RBF kernel is used, Q is positive definite [4]. Also, RBF kernels yieldan appropriate tight contour representations of a cluster [15]. Again, we cantry to put most of the data into a small ball to maximize the classificationproblem, and the bound of the probability of points falling outside the ballcan be controlled by the parameter ν. For a kernel k(x, x

′) that only depends

on x − x′, k(x, x) is constant, so the linear term in the dual target function

is constant. This simplifies computation. So in our cluster validity approach,we will use the Gaussian kernels kq(x, x

′) = eq‖x−x

′‖2

with width parameterq = −1

2σ2 (note q < 0).In this situation, the number of Support Vectors depends on both ν and q.

When q’s magnitude increases, boundaries become rough (the derivative oscil-lates more), since a large fraction of the data turns into SVs, especially thosepotential outliers that are broken off from core data points in the form of SVs.But no outliers will be allowed, if ν = 0. By increasing ν, more SVs will beturned into outliers or BSVs. Parameters ν and p will be used alternatively inthe following sections.

3 Cluster Validity Using SVMs

We apply SVMs to the output of clustering algorithms, and show they learn thestructure inherent in clustering results. By checking the complexity of bound-aries, we are able to verify if there are significant “valleys” between data clustersand how outliers are distributed. All these are readily computable from the datain an supervised manner through SVMs training.

Our approach is based on three properties of clustering results. First, goodclustering results should separate clusters well; thus in good clustering results weshould find separation (relative large margins between clusters). Second, thereshould be high density concentration in the core of the cluster (what has beennamed compactness). Third, removing a few points in the core shall not affecttheir shape. However, points in cluster boundaries are in sparse region and per-turbing them does change the shape of boundaries.


To verify separation pairwise, we learn the margin γ from SVMs training;then we choose the top ranked SVs (we propose 5) from a pair of clusters andtheir k (also 5) nearest neighbors. We measure the average distance of these SVsfrom their projected neighbors from each cluster (projected along the normalof the optimal hyper-plane). We let these average be γ1 for the first cluster ina pair and we denote it as γ2 for the second cluster. We compare γ with γi. Givenscalars t1 and t2, the relations between local measures and margin is evaluatedby analyzing if any of the following conditions holds:

Condition 1: γ1 < t1 · γ or γ2 < t1 · γ; Condition 2: γ1 > t2 · γ or γ2 > t2 · γ.(3)

If either of them holds for carefully selected control parameters t1 and t2, theclusters are separable; otherwise they are not separable (we recommend t1 = 0.5and t2 = 2). This separation test can discriminate between two results of a clus-tering algorithm. That is, when facing two results, maybe because the algorithmis randomized or because two clustering methods are applied, we increase theconfidence (and thus the preference to believe one is more valid than the other)by selecting the clustering result that shows less pairs of non-separable classes.

To verify the compactness of each cluster, we control the number of SVs andBSVs. As mentioned before, the parameter q of the Gaussian kernel determinesthe scale at which the data is probed, and as its magnitude increases, more SVsresult - especially potential outliers tend to appear isolated as BSVs. Howeverto allow for BSVs, the parameter ν should be greater than 0. This parameterenables analyzing points that are hard to assign to a class because they areaway from high density areas. We refer to these as noise or outliers, and theywill usually host BSVs. As shown by the theorems cited above, controlling q andν provides us a mechanism for verifying compactness of clusters.

We verify robustness by checking the stability of cluster assignment. Afterremoving a fraction of BSVs, if reclustering results in repeatable assignments,we conclude that the cores of classes exist and outliers have been detected.

We test the confidence of the result in applying an arbitrary clustering algo-rithm A to a data set as follows. If the clustering result is repeatable (compactand robust to our removal of BSVs and their nearest neighbors) and separable (inthe sense of having a margin a faction larger than the average distance betweenSVs), this maximizes our confidence that the data does reflect this clustering andis not an artifact of the clustering algorithm. We say the clustering result hasa maximum sense of validity. On the other hand, if reclustering results are notquite repeatable but well separable, or repeatable but not quite separable, westill call the current run a valid run. Our approach may still find valid clusters.However, if reclustering shows output that is neither separable nor repeatable,we call the current run an invalid run. In this case, the BSVs removed in thelast run may not be outliers, and they should be recovered for a reclustering.

We discriminate runs further by repeating the above validity test, for severalrounds. If consecutive clustering results converge to a stable assignment (i.e.the result from each run is repeatable and separable), we claim that potentialoutliers have been removed, and cores of clusters have emerged. If repetition of


the analysis still produces invalid runs, (clustering solutions differ across runswithout good separation) the clustering results are not interesting.

In order to set the parameters of our method we conducted a series of ex-periments we summarize here 1. We determined parameters for separation andcompactness checking first. The data sets used were in different shapes to ensuregenerality. The LibSVM [3] SVMs library has been used in our implementationof our cluster validity scheme.

The first evaluation of separation accurately measured the margin betweentwo clusters. To ensure the lower error bound, we use a hard margin trainingstrategy by setting ν = 0.01 and q = 0.001. This allows for few BSVs. In thisevaluation, six data sets each with 972 points uniformly and randomly generatedin 2 boxes were used. The margin between the boxes is decreasing across datasets. To verify the separation of a pair of clusters, we calculated the values ofγ1 and γ2. Our process compared them with the margin γ and inspected thedifference. The experiment showed that the larger the discrepancies between γ1

and γ (or γ2 and γ), the more separable the clusters are. In general, if γ1 < 0.5γor γ2 < 0.5γ, the two clusters are separable. Thus, the choice of value for t1.

Secondly, we analyzed other possible cases of the separation test. This in-cluded (a) both γ1 and γ2 much larger than γ; (b) a small difference between γ1

and γ, but the difference between γ2 and γ is significant (c) significant differencebetween γ1 and γ, although there is no much difference between γ2 and γ. Again,we set t1 = 0.5 and t2 = 2 for this test. Then, according to the verification rulesof separation (in Equation (3)), all of these examples were declared separablecoinciding with our expectation.

Third, we tested noisy situation and non-convex clusters. Occasionally clus-tering results might not accurately describe the groups in the data or are hardto interpret because noise is present and outliers may mask data models. Whenthese potential outliers are tested and removed, the cores of clusters appear. Weperformed a test that showed that, in the presence of noise, our approach worksas a filter and the structure or model fit to the data becomes clearer. A ring-shaped cluster with 558 points surrounded by noise and another spherical clusterwere in the dataset. A ν-SVC trained with ν = 0.1 and q = 0.001 results in 51BSVs. After filtering these BSVs (outliers are more likely to become BSVs),our method showed a clear data model that has two significantly isolated denseclusters. Moreover, if a ν-SVC is trained again with ν = 0.05 and q = 0.001 onthe clearer model, fewer BSVs (17) are generated (see Fig. 1)3.

As we discussed, the existence of outliers complicates clustering results. Thesemay be valid, but separation and compactness are also distorted. The repeatedperformance of a clustering algorithm depends on the previous clustering results.If these results have recognized compact clusters with cores, then they becomerobust to our removal of BSVs. There are two cases. In the first case, the last twoconsecutive runs of algorithm A (separated by an application of BSVs removal)are consistent. That is, the clustering results are repeatable. The alternative1 The reader can obtain an extended version of this submission with large figures in

www.cit.gu.edu.au/˜s2130677/publications.html


(a) (b) (c)

Fig. 1. Illustration of outlier checking. Circled points are SVs

(a) Clustering struc-ture C1

(b) SVs in circles (c) Clustering struc-ture C2

Fig. 2. For an initial clustering (produced by k-Means) that gives non-compactclasses, reclustering results are not repeated when outliers are removed. 2(a)Results of the original first run. 2(b) Test for outliers. 2(c) Reclustering results;R = 0.5077, J = 0.3924, FM = 0.5637

case is that reclustering with A after BSVs removal is not concordant with theprevious result. Our check for repeated performance of clustering results verifiesthis. We experimented with 1000 points drawn from a mixture data model3 andtraining parameters for ν-SVC are set to ν = 0.05 and q = 0.005, we showedthat the reclustering results can become repeatable leading to valid results (seeFigs. 3(a), 3(c) and 3(d))3. However we also showed cases, where an initial invalidclustering does not lead to repeatable results (see Figs. 2(a), 2(b) and 2(c))3.To measure the degree of repeated performance between clustering results oftwo different runs, we adopt indexes of external criteria used in cluster validity.External criteria are usually used for comparing a clustering structures C witha predetermined partition P for a given data set X . Instead of referring toa predetermined partition P of X , we measure the degree of agreement betweentwo consecutively produced clustering structures C1 and C2. The indexes we useare the rand statistic R, the Jaccard coefficient J and the Fowlkes-Mallows indexFM [12]. The values of these three statistics are between 0 and 1. The largertheir value, the higher degree to which C1 matches C2.


4 Experimental Results

First, we use a 2D dataset for a detailed illustration of our cluster validity testingusing SVMs (Fig. 3). The 2D data set is from a mixture model and consists of1000 points. The k -medoids algorithm assigns two clusters. The validity processwill be conducted in several rounds. Each round consists of reclustering and ourSVMs analysis (compactness checking, separation verification, and outliers split-ting and filtering). The process stops when a clear clustering structure appears(this is identified because it is separable and repeatable), or after several rounds(we recommend six). Several runs that do not suggest a valid result indicatethe clustering method is not finding reasonable clusters in the data. For theseparation test in this example, we train ν-SVC with parameters ν = 0.01 andq = 0.0005. To filter potential outliers, we conduct ν-SVC with ν = 0.05 butdifferent q in every round. The first round starts with q = 0.005, and q will bedoubled in each following round.

Fig. 3(b) and Fig. 3(c)3 show separation test and compactness evaluationrespectively corresponding to the first round. We observed that the cluster resultsare separable. Fig. 3(b) indicates γ1 > 2γ and γ2 > 2γ. Fig. 3(c) shows the SVsgenerated, where 39 BSVs will be filtered as potential outliers. We performreclustering after filtering outliers, and match the current cluster structure toprevious clustering clustering structure. The values of indexes R = 1 (J = 1 andFM = 1) indicate compactness. Similarly, the second round up to the fourthround also show repeatable and separable clustering structure. We conclude thatthe original cluster results can be considered valid.

We now show our cluster validity testing using SVMs on a 3D data set (seeFig. 4)3. The data set is from a mixture model and consists of 2000 points. Thealgorithm k-Means assigns three clusters. The validity process is similar to thatin 2D example. After five rounds of reclustering and SVMs analysis, the validityprocess stops, and a clear clustering structure appears. For the separation testin this example, we train ν-SVC with parameters ν = 0.01 and q = 0.0005. Tofilter potential outliers, we conduct ν-SVC with ν = 0.05 but different q in everyround. The first round starts with q = 0.005, and q will be doubled in eachfollowing round.

In the figure, we show the effect of a round with a 3D view of the data followedby the separation test and the compactness verification. To give a 3D view effect,we construct convex hulls of clusters. For the separation and the compactnesschecking, we use projections along z axis. Because of pairwise analysis, we denoteby γi,j the margin between clusters i and j, while γi(i,j) is the neighborhooddispersion measure of SVs in cluster i with respect to the pair of clusters i and j.Fig. 4(a) illustrates a 3D view of original clustering result. Fig. 4(b) and Fig. 4(c)3

show separation test and compactness evaluation respectively corresponding tothe first round. Fig. 4(b) indicates γ1(1,2)/γ1,2 = 6.8, γ1(1,3)/γ1,3 = 11.2 andγ2(2,3)/γ2,3 = 21.2. Thus, we conclude that the cluster results are separablein the first run. Fig. 4(c) shows the SVs generated, where 63 BSVs will befiltered as potential outliers. We perform reclustering after filtering outliers, andmatch the current cluster structure to previous clustering structure. Index values


(a) Original clusteringstructure C1

(b) γ = 0.019004 γ1 =0.038670 γ2 = 0.055341

(c) SVs in cir-cles, BSVs=39,R=J=FM=1.

(d) Struc-ture C2 fromreclustering

(e) γ =0.062401γ1 = 0.002313γ2 = 0.003085

(f) BSVs=39,R=J=FM=1.

(g) γ =0.070210γ1 = 0.002349γ2 = 0.002081

(h)BSVs=41,R=J=FM=1.

(i) γ =0.071086γ1 = 0.005766γ2 = 0.004546

(j) BSVs=41,R=J=FM=1.

(k) γ =0.071159γ1 = 0.002585γ2 = 0.003663

Fig. 3. A 2D example of cluster validity through SMVs approach. Circled pointsare SVs. Original first run results in compact classes 3(a). 3(c) Test for outliers.3(d) Reclustering results; R = 1.0, J = 1.0, FM = 1.0. 3(b) and 3(c) separationcheck and compactness verification of the first round. 3(e) and 3(f) separationcheck and compactness verification of the second round. 3(g) and 3(h) separationcheck and compactness verification of the third round. 3(i) and 3(j) separationcheck and compactness verification of the fourth round. 3(i) clearly separableand repeatable clustering structure


R = 1 indicate the compactness of the result in previous run. Similarly, thesecond round up to the fifth round also show repeatable and separable clusteringstructure. Thus the original cluster results can be considered valid.

5 Related Work and Discussion

Various methods have been proposed for clustering validity. The most commonapproaches are formal indexes of cohesion or separation (and their distributionwith respect to a null hypothesis). In [11, 17], a clear and comprehensive de-scription of these statistical tools is available. These tools have been designed tocarry out hypothesis testing to increase the confidence that the results of clus-tering algorithms are actual structure in the data (structure understood as dis-crepancy from the null hypothesis). However, even these mathematically definedindexes face many difficulties. In almost all practical settings, this statistic-basedmethodology for validity faces challenging computation of the probability den-sity function of indexes that complicates the hypothesis testing approach aroundthe null hypothesis [17].

Bezdek [2] realized that it seemed impossible to formulate a theoretical nullhypothesis used to substantiate or repudiate the validity of algorithmically sug-gested clusters. The information contained in data models can also be capturedusing concepts from information theory [8]. In specialized cases, like conceptualschema clustering, formal validation has been used for suggesting and verifyingcertain properties [19]. While formal validity guarantees the consistency of clus-tering operations in some special cases like information system modeling, it isnot a general-purpose method. On the other hand, if the use of more sophis-ticated mathematics requires more specific assumptions about the model, andif these assumptions are not satisfied by the application, performance of suchvalidity test could degrade beyond usefulness.

In addition to theoretical indexes, empirical evaluation methods [13] are alsoused in some cases where sample datasets with similar known patterns are avail-able. The major drawback of empirical evaluation is the lack of benchmarks andunified methodology. In addition, in practice it is sometimes not so simple toget reliable and accurate ground truth. External validity [17] is common prac-tice amongst researchers. But it is hard to contrast algorithms whose results areproduced in different data sets from different applications.

The nature of clustering is exploratory, rather than confirmatory. The taskof data mining is that we are to find novel patterns. Intuitively, if clusters areisolated from each other and each cluster is compact, the clustering results aresomehow natural. Cluster validity is a certain amount of confidence that thecluster structure found is significant. In this paper, we have applied SupportVector Machines and related kernel methods to cluster validity. SVMs trainingbased on clustering results can obtain insight into the structure inherent in data.By analyzing the complexity of boundaries through support information, we canverify separation performance and potential outliers. After several rounds ofreclustering and outlier filtering, we will confirm clearer clustering structures


(a) Original

clustering re-sult

(b)γ1(1,2)/γ1,2 =6.8γ1(1,3)/γ1,3 =11.2γ2(2,3)/γ2,3 =21.2

(c) SV s =

184 BSV s =63

(d) Reclus-

tering R = 1

(e)γ1(1,2)/γ1,2 =0.47γ1(1,3)/γ1,3 =0.25γ2(2,3)/γ2,3 =0.17

(f)SVs=155BSV s = 57

(g)Reclus-tering R = 1

(h)γ1(1,2)/γ1,2 =0.12γ1(1,3)/γ1,3 =0.02γ2(2,3)/γ2,3 =0.01

(i)SV s = 125BSV s = 44

(j)Reclus-tering R = 1

(k)γ1(1,2)/γ1,2 =0.06γ1(1,3)/γ1,3 =0.09γ2(2,3)/γ2,3 =0.31

(l) SV s =

105 BSV s =36

(m) Reclus-

tering R = 1

(n)γ1(1,2)/γ1,2 =0.02γ1(1,3)/γ1,3 =0.08γ2(2,3)/γ2,3 =0.18

(o) SV s =

98 BSV s =26

(p) Reclus-

tering R = 1

Fig. 4. 3D example of cluster validity through SMVs. SVs as circled points. 4(a)3D view of the original clustering result. 4(b), 4(c) and 4(d) is 1st run. 4(e),4(f) and 4(g) is 2nd run. 4(h), 4(i) and 4(j) is 3rd run. 4(k), 4(l) and 4(m) is 4thrun. 4(n), 4(o) and 4(p) is 5th run arriving at clearly separable and repeatableclustering structure. Separation tests in 4(b), 4(e), 4(h), 4(k) and 4(n). Com-pactness verification in 4(c), 4(f), 4(i), 4(l) and 4(o). 3D view of reclusteringresult in 4(d), 4(g), 4(j) and 4(m)


when we observe they are repeatable and compact. Counting the number ofvalid runs and match results from different rounds in our process contributes toverifying the goodness of clustering result. This provides us a novel mechanismfor cluster evaluation.

Our approach provides a novel mechanism to address cluster validity prob-lems for more elaborate analysis. This is required by a number of clusteringapplications. The intuitive interpretability of support information and bound-ary complexity makes it easy to operate practical cluster validity.

References

[1] K. P. Bennett and C. Campbell. Support vector machines: Hype or hallelujah.SIGKDD Explorations, 2(2):1–13, 2000. 245

[2] J. C. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms.Plenum, NY, 1981. 253

[3] C.C. Chang and C. J. Lin. LIBSVM: a library for support vector machines.http://www.csie.ntu.edu.tw/~cjlin/libsvm, 2001. 249

[4] C.C. Chang and C. J. Lin. Training ν-support vector classifiers: Theory andalgorithms. Neural Computation, 13(9):2119–2147, 2001. 246, 247

[5] V. Cherkassky and F. Muller. Learning from Data — Concept, Theory and Meth-ods. Wiley, NY, USA, 1998. 246

[6] R.C. Dubes. Cluster analysis and related issues. C.H. Chen, L. F. Pau, andP. S.P. Wang, eds., Handbook of Pattern Recognition and Computer Vision, 3–32,NJ, 1993. World Scientific. Chapter 1.1. 244

[7] V. Estivill-Castro. Why so many clustering algorithms - a position paper.SIGKDD Explorations, 4(1):65–75, June 2002. 244

[8] E. Gokcay and J. Principe. A new clustering evaluation function using Renyi’sinformation potential. R.O. Wells, J. Tian, R.G. Baraniuk, D.M. Tan, andH.R. Wu, eds., Proc. of IEEE Int. Conf. Acoustics, Speech and Signal Processing(ICASSP 2000), 3490–3493, Istanbul, 2000. 244, 253

[9] S. Gunn. Support vector machines for classification and regression. Tech. ReportISIS-1-98, Univ. of Southampton, Dept. of Electronics and Computer Science,1998. 245

[10] S. S. Haykin. Neural networks: a comprehensive foundation. PrenticeHall, NJ,1999. 245

[11] A.K. Jain & R.C. Dubes. Algorithms for Clustering Data. PrenticeHall, NJ,1998. 253

[12] R. Koschke and T. Eisenbarth. A framework for experimental evaluation of clus-tering techniques. Proc. Int. Workshop on Program Comprehension, 2000. 250

[13] A. Rauber, J. Paralic, and E. Pampalk. Empirical evaluation of clustering al-gorithms. M. Malekovic and A. Lorencic, eds., 11th Int. Conf. Information andIntelligent Systems (IIS’2000), Varazdin, Croatia, Sep. 20 - 22 2000. Univ. ofZagreb. 253

[14] B. Scholkopf, R.C. Williamson, A. J. Smola, and J. Shawe-Taylor. SV estimationof a distribution’s support. T.K Leen, S.A. Solla, and K.R. Muller, eds., Ad-vances in Neural Information Processing Systems 12. MIT Press, forthcomming.mlg.anu.edu.au/ smola/publications.html. 246

[15] H. Siegelmann, A. Ben-Hur, D. Horn, and V. Vapnik. Support vector clustering.J. Machine Learning Research, 2:125–137, 2001. 245, 247


[16] V.N. Vapnik. The nature of statistical learning theory. Springer Verlag, Heidel-berg, 1995. 245

[17] M. Vazirgiannis, M Halkidi, and Y. Batistakis. On clustering validation tech-niques. Intelligent Information Systems J. 17(2):107–145, 2001. 244, 253

[18] R. Williamson, B. Scholkopf, A. Smola, and P. Bartlett. New support vectoralgorithms. Neural Computation, 12(5):1207–1245, 2000. 246

[19] R. Winter. Formal validation of schema clustering for large information systems.Proc. First American Conference on Information Systems, 1995. 253

FSSM: Fast Construction of the Optimized

Segment Support Map �

Kok-Leong Ong, Wee-Keong Ng, and Ee-Peng Lim

Centre for Advanced Information Systems, Nanyang Technological University,Nanyang Avenue, N4-B3C-13, Singapore 639798, SINGAPORE

[email protected]

Abstract. Computing the frequency of a pattern is one of the key op-erations in data mining algorithms. Recently, the Optimized SegmentSupport Map (OSSM) was introduced as a simple but powerful way ofspeeding up any form of frequency counting satisfying the monotonic-ity condition. However, the construction cost to obtain the ideal OSSMis high, and makes it less attractive in practice. In this paper, we pro-pose the FSSM, a novel algorithm that constructs the OSSM quicklyusing a FP-Tree. Given a user-defined segment size, the FSSM is ableto construct the OSSM at a fraction of the time required by the al-gorithm previously proposed. More importantly, this fast constructiontime is achieved without compromising the quality of the OSSM. Ourexperimental results confirm that the FSSM is a promising solution forconstructing the best OSSM within user given constraints.

1 Introduction

Frequent set (or pattern) mining plays a pivotal role in many data mining tasksincluding associations [1] and its variants [2, 4, 7, 13], sequential patterns [12]and episodes [9], constrained frequent sets [11], emerging patterns [3], and manyothers. At the core of discovering frequent sets is the task of computing the fre-quency (or support) of a given pattern. In all cases above, we have the followingabstract problem for computing support. Given a collection I of atomic patternsor conditions, compute for collections C ⊆ I the support σ(C) of C, where themonotonicity condition σ(C) � σ({c}) holds for all c ∈ C.

Typically, the frequencies of patterns are computed in a collection of transac-tions, i.e., D = {T1, . . . , Ti}, where a transaction can be a set of items, a sequenceof events in a sliding time window, or a collection of spatial objects. One class ofalgorithms find the above patterns by generating candidate patterns C1, . . . , Cj ,and then checking them against D. This process is known to be tedious andtime-consuming. Thus, novel algorithms and data structures were proposed toimprove the efficiency of frequency counting. However, most solutions do notaddress the problem in a holistic manner. As a result, extensive efforts are oftenneeded to incorporate a particular solution to an existing algorithm.

� This work was supported by SingAREN under Project M48020004.


Recently, the Optimized Segment Support Map (OSSM) [8, 10] was intro-duced as a simple yet powerful way of speeding up any form of frequency countingsatisfying the monotonicity condition. It is a light-weight, easy to compute struc-ture, that partitions D into n segments, i.e., D = S1 ∪ . . . ∪ Sn and Sp ∩ Sq = ∅,with the goal of reducing the number of candidate patterns for which frequencycounting is required. The idea of the OSSM is simple: the frequencies of patternsin different parts of the data is different. Therefore, computing the frequenciesseparately in different parts of the data makes it possible to obtain tighter sup-port bounds for the frequencies of the collections of patterns. This enables oneto prune more effectively, thus improving the speed of counting.

Although the OSSM is an attractive solution for a large class of algorithms, itsuffers from one major problem: the construction cost to obtain the best OSSMof a user-defined segment size for a given large collection is high. This makes theOSSM much less attractive in practice. For practicality, the authors proposedhybrid algorithms that use heuristics to contain the runtime, and to construct the“next best” OSSM. Although the solution guarantees an OSSM that improvesperformance, the quality of estimation is sub-optimal. This translates to a weakersupport bound estimated for a given pattern and hence, reduces the probabilityof pruning an infrequent pattern.

Our contribution to the above is to show the possibility of constructing thebest OSSM within limited time for a given segment size and a large collection.Our proposal, called the FSSM, is an algorithm that constructs the OSSM fromthe FP-Tree. With the FSSM, we need not compromise the quality of estimationin favor of a shorter construction time. The FSSM may therefore make obsoletethe sub-optimal algorithms originally proposed. Our experimental results sup-port these claims.

2 Background

The OSSM is a light-weight structure that holds the support of all singletonitemsets in each segment of the database D. A segment in D is a partitioncontaining a set of transactions such that D = S1 ∪ . . . ∪ Sn and Sp ∩ Sq = ∅.In each segment, the support of each singleton itemset is registered and thus,the support of an item ‘c’ can be obtained by

∑n

i=1 σi({c}). While the OSSMcontains only segment supports of singleton itemsets, it can be used to give anupper bound on the support (σ) of any itemset C using the formula given below,where On is the OSSM constructed with n segments.

σ(C,On) =

n∑i=1

min({σi({c}) | c ∈ C})

Let us consider the example in Figure 1. Assume in this configuration, eachsegment has exactly two transactions. Then, we have the OSSM (right table)where the frequency of each item in each segment is registered. By the equationabove, the estimated support of an itemset C = {a, b} would be σ(C,On) =

258 Kok-Leong Ong et al.

TID Contents Segment

1 {a} 12 {a, b} 1

3 {a} 24 {a} 2

5 {b} 36 {b} 3

S1 S2 S3 D = S1 ∪ S2 ∪ S3

{a} 2 2 0 4{b} 1 0 2 3

Fig. 1. A collection of transactions (left) and its corresponding OSSM (right). TheOSSM is constructed with a user-defined segment size of n = 3.

min(2, 1)+min(2, 0)+min(0, 2) = 1. Although this estimate is the support boundof C, it turns out to be the actual support of C for this particular configurationof segments. Suppose we now switch T1 and T5 in the OSSM, i.e., S1 = {T2, T5}and S3 = {T1, T6}, then σ(C,On) = 2! This observation suggests that the waytransactions are selected into a segment can affect the quality of estimation.Clearly, if each segment contains only one transaction, then the estimate will beoptimal and equals the actual support. However, this number of segments willbe practically infeasible. The ideal alternative is to use a minimum number ofsegments to maintain the optimality of our estimate. This leads to the followingproblem formulation.

Definition 1. Given a collection of transactions, the segment minimization

problem is to determine the minimum value nm for the number of segments inthe OSSM Onm

, such that σ(C,Onm) = σ(C) for all itemsets C, i.e., the upper

bound on the support for any itemset C is exactly its actual support.

With the FSSM, the minimum number of segments can be obtained quickly intwo passes of the database. However, knowing the minimum number of segmentsis at best a problem of academic interest. In practice, this number is still toolarge to consider the OSSM as light-weight. It is thus desirable to constructthe OSSM based on a user-defined segment size nu. And since nu � nm, weexpect a drop in the accuracy of the estimate. The goal then is to find the bestconfiguration of segments, such that the quality of every estimate is the bestwithin the bounds of nu. This problem is formally stated as follows.

Definition 2. Given a collection of transactions and a user-defined segmentsize nu � nm to be formed, the constrained segmentation problem is todetermine the best composition of the nu segments that minimizes the loss ofaccuracy in the estimate.

3 FSSM: Algorithm for Fast OSSM Construction

In this section, we present our solutions to the above problems. For the ease ofdiscussion, we assume the reader is familiar with the FP-Tree and the OSSM. Ifnot, a proper treatment can be obtained in [5, 10].

259FSSM: Fast Construction of the Optimized Segment Support Map

3.1 Constructing the Ideal OSSM

Earlier, we mentioned that the FSSM constructs the optimal OSSM from theFP-Tree. Therefore, we begin by showing the relationship between the two.

Lemma 1. Let Si and Sj be two segments of the same configuration from acollection of transactions. If we merge Si and Sj into one segment Sm, then Sm

is the same configuration, and σ(C,Sm) = σ(C,Si) + σ(C,Sj).

The term configuration refers to the characteristic of a segment that is de-scribed by the descending frequency order of the singleton itemsets. As an ex-ample, suppose the database has three unique items and two segments, i.e.,S1 = {b(4), a(1), c(0)} and S2 = {b(3), a(2), c(2)}, where the number in theparentheses is the frequency of each item in the segment. In this case, both seg-ments are described by the same configuration 〈σ({a}) � σ({b}) � σ({c})〉, andtherefore can be merged (by Lemma 1) without loosing accuracy.

In a more general case, the lemma solves the segment minimization prob-lem. Suppose each segment begins with a single transaction, i.e., the singletonfrequency registered in each segment is either ‘1’ or ‘0’. We begin by mergingtwo single-transaction segments of the same configuration. From this mergedsegment, we continue merging other single-transaction segments as long as theconfiguration is not altered. When no other single-transaction segments can bemerged without loosing accuracy, we repeat the process on another configura-tion. The number of segments found after processing all distinct configurationsis the minimum number of segments required to build the optimal OSSM.

Theorem 1. The minimum number of segments required for the upper bound onσ(C) to be exact for all C, is the number of segments with distinct configurations.

Proof: As shown in [10].

Notice the process of merging two segments is very similar to the process ofFP-Tree construction. First, the criterion to order items in a transaction is thesame as that to determine the configuration of a segment (specifically a single-transaction segment). Second, the merging criterion of two segments is implicitlycarried out by the overlaying of a transaction on an existing unique path1 inthe FP-Tree. An example will illustrate this observation. Let T1 = {f, a,m, p},T2 = {f, a,m} and T3 = {f, b,m} such that the transactions are already ordered,and σ({b}) � σ({a}). Based on FP-Tree characteristics, T1 and T2 will share thesame path in the FP-Tree, while T3 will have a path of its own. For the twotransactions overlaid on the same path in the FP-Tree, they actually have thesame configuration: 〈σ({f}) � σ({a}) � σ({m}) � σ({p}) � σ({b}) � . . .〉, sinceσ({b}) = 0 in both T1 and T2 and σ({p}) = 0 for T2. For T3, the configurationis 〈σ({f}) � σ({b}) � σ({m}) � σ({a}) � σ({p}) � . . .〉, where σ({a}) =σ({p}) = 0. Clearly, this is a different configuration from T1 and T2 and hence,a different path in the FP-Tree.

1 A unique path in the FP-Tree, is a distinct path that starts from the root node, andends at one of the leaf nodes in the FP-Tree.


Theorem 2. Given a FP-Tree constructed from some collection, the number ofunique paths (or leaf nodes) in the FP-Tree is the minimum number of segmentsachievable without compromising the accuracy of the OSSM.

Proof: Suppose the number of unique paths in the FP-Tree is not the minimumnumber of segments required to build the optimal OSSM. Then, there will beat least one unique path that has the same configuration as another path inthe FP-Tree. However, two paths Pi and Pj in the FP-Tree can have the sameconfiguration if and only if, there exist transactions in both paths that have thesame configuration. If Ti ∈ Pi and Tj ∈ Pj are of the same configuration, theymust satisfy the condition Ti ⊆ Tj and ∀c ∈ Tj −Ti, σ({c}) � σ({x|Ti| ∈ Ti}), orvice versa. However by the principle of FP-Tree construction, if Ti and Tj satisfythe above condition, then they must be overlaid on the same path. Therefore,each unique path in the FP-Tree must be of a distinct configuration. Hence, wemay now apply Theorem 1 to complete the proof of Theorem 2.

Corollary 1. The transactions that are fully contained in each unique path ofthe FP-Tree is the set of transactions that constitutes to a distinct segment inthe optimal OSSM.

Proof: By Theorem 2, every unique path in the FP-Tree must have a distinctconfiguration, and all transactions contained in a unique path are transactionswith the same configuration. In addition, since every transaction in the collectionmust lie completely along one of the paths in the FP-Tree, it follows that thereis an implicit and complete partition on the collection by the unique path thetransaction belongs. By this observation, every unique path and its set of trans-actions must therefore correspond to a distinct segment in the optimal OSSM.Hence, we have the above corollary of Theorem 2.

From Theorem 1, we shall give an algorithmic sketch of the construction algo-rithm for the optimal OSSM. Although this has little practical utility, its resultis an intermediate step towards the goal of finding the optimal OSSM withinthe bounds of the user-defined segment size. Hence, its efficient construction isstill important. The algorithm to construct the optimal OSSM is given in Fig-ure 2. Notice that the process is very much based on the FP-Tree construction.In fact, the entire FP-Tree is constructed along with the optimal OSSM. There-fore, the efficiency of the algorithm is bounded by the time needed to constructthe FP-Tree, i.e., within two scans of the database.

The results of the above is important to solve the constrained segmentationproblem. As we will show in the next subsection, the overlapping of unique pathsin the FP-Tree contain an important property that will allow us to constructthe best OSSM within the bounds of the user-defined segment size. As before,we shall present the formal discussions that lead to the algorithm.

3.2 Constructing OSSM with User-Defined Segment Size

Essentially, Theorem 1 states the lower bound nm on the number of segmentsallowable before the OSSM becomes sub-optimal in its estimation. Also men-


Algorithm BuildOptimalOSSM(Set of transactions D)begin

Find the singleton frequency of each item in D; // Pass 1foreach transaction T ∈ D do // Pass 2Sort T accordingly to descending frequency order;if (T can be inserted completely along an existing path Pi in the FP-Tree) then

Increment the counter in segment Si for each item in T ;else

Create the new path Pj in the FP-Tree, and the new segment Sj ;Initialize the counter in segment Sj for each item in T to 1;

endif

endfor

return optimal OSSM and FP-Tree;end

Fig. 2. Algorithm to construct the optimal OSSM via FP-Tree construction.

tioned is that the value of nm is too high to construct the OSSM as a lightweight and easy to compute structure. The alternative, as proposed, is to intro-duce a user-defined segment size nu where nu � nm. Clearly, when nu < nm, theaccuracy can no longer be maintained. This means merging segments of differentconfiguration so as to reach the user-defined segment size. Of course, the sim-plest approach is to randomly merge any distinct configuration. However, thiswill result in an OSSM with poor pattern pruning efficiency. As such, we areinterested in constructing the best OSSM within the bounds of the user-definedsegment size. Towards this goal, the following measure was proposed.

SubOp (S) =∑ci,cj

[σ({ci, cj},O1) − σ({ci, cj},Ov)]

In the equation, S = {S1, . . . , Sv} is a set of segments with v � 2. Thefirst term is the upper bound on σ({ci, cj}) based on O1, which consists of onecombined segment formed by merging all v segments in S. The second termis the upper bound based on Ov which keeps the v segments separated. Thedifference between the two terms quantifies the amount of sub-optimality in theestimation on the set {ci, cj} to have the v segments merged, and the sum overall pairs of items measure the total loss. Generally, if all v segments are of thesame configuration, then SubOp (S) = 0, and if there are at least two segmentswith different configurations, then SubOp (S) > 0.

What this means is that we would like to merge segments having smallersub-optimality values, i.e., they have a reduced loss when the v segments aremerged. And this measure is the basis of operation for the algorithms proposedby the authors. Clearly, this approach is expensive. First, computing a single sub-optimality value requires the sum of all pairs of items in the segment. If there

are k items, then there are k·(k−1)2 terms to be summed. Second, the number of

distinct segments for which the sub-optimality is to be computed is also verylarge. As a result, the runtime to construct the best OSSM within the bounds


of the user-defined segment size becomes very high. To contain the runtime,hybrid algorithms were proposed. These algorithms first create segments of largergranularity by randomly merging existing segments before the sub-optimalitymeasure is used to reach the user-defined segment size. The consequence is anOSSM with an estimation accuracy that cannot be predetermined, and is oftennot the best OSSM possible for the given user-defined segment size.

With regards to the above, the FP-Tree has some interesting characteristics.Recall in Theorem 2, we learn that segments having the same configuration sharethe same unique path. Likewise, it is not difficult to observe that two uniquepaths are similar in configuration if they have a high degree of overlapping(i.e., sharing of prefixes). In other words, as the overlapping increases, the sub-optimality value approaches zero. To illustrate this, suppose T1 = {f, a,m}, T2 ={f, a, c, p} and T3 = {f, a, c, q}. A FP-Tree constructed over these transactionswill have three unique paths due to their distinct configurations. Assuming thatT2 is to be merged with either T1 or T3, then we observed that T2 should bemerged with T3. This is because T3 has a longer shared prefix than T1, i.e.,more overlapping in the two paths. This can be confirmed by the calculating thesub-optimality, i.e., SubOp(T1, T2) = 2 and SubOp(T2, T3) = 1.

Lemma 2. Given a segment Si and its corresponding unique path Pi in theFP-Tree, the segment(s) that have the lowest sub-optimality value (i.e., the mostsimilar configuration) with respect to Si, are the segment(s) whose unique pathhas the most overlap with Pi in the FP-Tree.

Proof: Let Pj be a unique path with a distinct configuration from Pi. Withoutloss of distinction in the configuration, let the first k items in both configurationsshare the same item and frequency ordering. Then, the sub-optimality computedwith or without the k items will be the same; since computing all pairs of thefirst k items (of the same configuration) contributes a zero result. Furthermore,the sub-optimality of Pi and Pj has to be non-zero. Therefore, a non-zero sub-optimality depends on the remaining L = max(|Pi|, |Pj |) − k items, where eachpair (formed from the L items) contributes to a non-zero partial sum. As k tendstowards L, the number of pairs that can be formed from the L items reduces,and the sub-optimality thus approaches zero. Clearly, max(|Pi|, |Pj |) > k > 0when Pi and Pj in the FP-Tree partially overlaps one another, and k = 0 whenthey do not overlap at all. Hence, with more overlapping between the two path,i.e., a large k, there is less overall loss in the accuracy, hence Lemma 2.

Figure 3 shows the FSSM algorithm that constructs the best OSSM basedon the user-defined segment size nu. Instead of creating segments of larger gran-ularity by randomly merging existing ones, we begin with the nm segments inthe optimal OSSM constructed earlier. From this nm segments, we merged twosegments at a time such that the loss of accuracy is minimized. Clearly, this iscostly if we compare each segment against every other as proposed [10]. Rather,we utilize Lemma 2 to cut the search space down to comparing only a few seg-ments. More importantly, the FSSM begins with the optimal OSSM and will


Algorithm BuildBestOSSM(FP-Tree T, Segment Size nu, Optimal OSSM Om)begin

while (number of segments in Om > nu) do

select node N from lookup table H

where N is the next furthest from the root of T and has > 1 child nodes;foreach possible pair of direct child nodes (ci, cj) of N do

Let Si/Sj be the segment for path Pi/Pj containing ci/cj respectively;Compute the sub-optimality as a result of merging Si and Sj ;

endfor

Merge the pair Sp and Sq whose sub-optimality value is smallest;Create unique path Pp q in T by merging Pp and Pq;

endwhile

return best OSSM with nu segments;end

Fig. 3. FSSM: algorithm to build the best OSSM for any given segment size nu < nm.

always merge segments with minimum loss of accuracy. This ensures that thebest OSSM is always constructed for any value of nu.

Each pass through the while-loop merges two segments at a time, and thiscontinues until the OSSM of nm segments reduces to nu segments. At the startof each pass, we first find the set of unique paths having the longest commonprefix (i.e., the biggest k value). This is satisfied by the condition in the select-where statement which returns N , the last node in the common prefix. Thisnode is important because together with its direct children, we can derive theset of all unique paths sharing this common prefix. The for-loop then computesthe sub-optimality for each pair of segments in this set of unique paths. Insteadof searching the FP-Tree (which will be inefficient), our implementation usesa lookup table H to find N . Each entry in H records the distance of a nodehaving more than one child, and a reference to the actual node in the FP-Tree.All entries in H are then ordered by their distance so that the select-wherestatement can find the next furthest node by iterating through H.

Although the pair of segments to process is substantially reduced, the effi-ciency of the for-loop can be further enhanced with a more efficient method ofcomputing sub-optimality. As shown in the proof for Lemma 2, the first k itemsin the common prefix do not contribute to a non-zero sub-optimality. By thesame rationale, we can also exclude all the h items where their singleton fre-quencies are zero in both segments. Hence, the sub-optimality can be computedby considering only the remaining |I| − k − h or max(|Pi|, |Pj |) − k items.

After comparing all segments under N , we merge the two segments repre-sented by the two unique paths with the least loss in the accuracy. Finally, wemerge the two unique paths whose segments they represent were combined ear-lier. This new path will then correspond to the merged segment in the OSSM,where all nodes in the path are arranged according to their descending singletonfrequency. The rationale for merging the two paths is to consistently reflect thestate of the OSSM required for the subsequent passes.


1

10

100

1000

10000

100000

20 30 50 80 120 170 230 300

Number of Segments

Ru

nti

me (

secon

ds)

FSSM

Random-RC

Greedy

0

10

20

30

40

50

60

20 30 50 80 120 170 230 300Number of Segments

Sp

eed

up

Rela

tiv

e t

o A

prio

ri

wit

ho

ut

the O

SS

M

FSSM/Greedy

Random-RC

Fig. 4. (a) Runtime performance comparison for constructing the OSSM based on anumber of given segment sizes. (b) Corresponding speedup achieved for Apriori usingthe OSSMs constructed in the first set of experiments.

4 Experimental Results

The objective of our experiment is to evaluate the cost effectiveness of our ap-proach against the Greedy and Random-RC algorithms proposed in [10]. Weconducted two sets of experiments using a real data set BMS-POS [6], which has515,597 transactions. In the first set of experiments, we compare the FSSMagainst the Greedy and Random-RC in terms of their performance to constructthe OSSM based on different user-defined segment sizes. In the second set ofexperiments, we compare their speedup contribution to the Apriori using theOSSMs constructed by the three algorithms at varying segment sizes.

Figure 4(a) shows the results of the first set of experiments. As we expectedfrom our previous discussion, the Greedy algorithm experiences extremely poorruntime when it comes to constructing the best OSSM within the bounds ofthe given segment size. Compared to the greedy algorithm, FSSM producesthe same results in significantly less time, showing the feasibility of pursuingthe best OSSM in practical context. Interestingly, our algorithm is even ableto out-perform the Random-RC on larger user-defined segment sizes. This canbe explained by observing the fact that the Random-RC first randomly mergesegments to some larger granularity segments before constructing the OSSMbased on the sub-optimality measure. As the user-defined segment size becomeslarger, the granularity of each segment, formed from random merging, becomesfiner. With more combination of segments, the cost to find the best segments tomerge in turn becomes higher.

Although we are able to construct the OSSM at the performance level ofthe Random-RC algorithm, it does not mean that the OSSM produced is ofpoor quality. As a matter of fact, the FSSM guarantees the best OSSM by thesame principle that the Greedy algorithm used to build the best OSSM from thegiven user-defined segment size. Having shown this by a theoretical discussion,our experimental results in Figure 4(b) provides the empirical evidence. Whilethe Random-RC takes approximately the same amount of time as the FSSMduring construction, it fails to deliver the same level of speedup as the FSSM in


all cases of our experiments. On the other hand, our FSSM is able to constructthe OSSM very quickly, and yet deliver the same level of speedup as the OSSMproduced by the Greedy algorithm.

5 Conclusions

In this paper, we present an important observation about the construction ofan optimal OSSM with respect to the FP-Tree. We show, by means of formalanalysis, the relationship between the them, and how the characteristics of theFP-Tree can be exploited to construct high-quality OSSMs. We demonstrated,both theoretically and empirically, that our proposal is able to consistently pro-duce the best OSSM within limited time for any given segment size. More impor-tantly, with the best within reach, the various compromises suggested to balanceconstruction time and speedup becomes unnecessary.

References

1. R. Agrawal and R. Srikant. Fast Algorithm for Mining Association Rules. In Proc.

of VLDB, pages 487–499, Santiago, Chile, August 1994.2. C. H. Cai, Ada W. C. Fu, C. H. Cheng, and W. W. Kwong. Mining Association

Rules with Weighted Items. In Proc. of IDEAS Symp., August 1998.3. G. Dong and J. Li. Efficient Mining of Emerging Patterns: Discovering Trends and

Differences. In Proc. of ACM SIGKDD, San Diego, CA, USA, August 1999.4. J. Han and Y. Fu. Discovery of Multiple-Level Association Rules from Large

Databases. In Proc. of VLDB, Zurich, Swizerland, 1995.5. J. Han, J. Pei Y. Yin, and R. Mao. Mining Frequent Patterns without Candidate

Generation: A Frequent-pattern Tree Approach. J. of Data Mining and Knowledge

Discovery, 7(3/4), 2003.6. R. Kohavi, C. Brodley, B. Frasca, L. Mason, and Z. Zheng. KDD-Cup 2000 orga-

nizers’ report: Peeling the onion. SIGKDD Explorations, 2(2):86–98, 2000.7. K. Koperski and J. Han. Discovery of Spatial Association Rules in Geographic

Information Databases. In Proc. of the 14th Int. Symp. on Large Spatial Databases,Maine, August 1995.

8. L. Lakshmanan, K-S. Leung, and R.T. Ng. The Segment Support Map: ScalableMining of Frequent Itemsets. SIGKDD Explorations, 2:21–27, December 2000.

9. H. Mannila, H. Toivonen, and A. I. Verkamo. Discovering Frequent Episodes inSequences. In Proc. of ACM SIGKDD, Montreal, Canada, August 1995.

10. C. K-S. Leung R. T. Ng and H. Mannila. OSSM: A Segmentation Approach toOptimize Frequency Counting. In Proc. of IEEE Int. Conf. on Data Engineering,pages 583–592, San Jose, CA, USA, February 2002.

11. R. T. Ng, L. V. S. Lakshmanan, and J. Han. Exploratory Mining and Pruning Op-timizations of Constrained Association Rules. In Proc. of SIGMOD, Washington,USA, June 1998.

12. R. Srikant and R. Agrawal. Mining Sequential Patterns: Generalizations and Per-formance Improvements. In Proc. of the 5th Int. Conf. on Extending Database

Technology, Avignon, France, March 1996.13. O. R. Zaiane, J. Han, and H. Zhu. Mining Recurrent Items in Multimedia with

Progressive Resolution Refinement. In Proc. of ICDE, San Diego, March 2000.


Using a Connectionist Approach

for Enhancing Domain Ontologies:Self-Organizing Word Category Maps Revisited

Michael Dittenbach1, Dieter Merkl1,2, and Helmut Berger1

1 E-Commerce Competence Center – EC3Donau-City-Straße 1, A–1220 Wien, Austria

2 Institut fur Softwaretechnik, Technische Universitat WienFavoritenstraße 9–11/188, A–1040 Wien, Austria

{michael.dittenbach,dieter.merkl,helmut.berger}@ec3.at

Abstract. In this paper, we present an approach based on neural net-works for organizing words of a specific domain according to their se-mantic relations. The terms, which are extracted from domain-specifictext documents, are mapped onto a two-dimensional map to provide anintuitive interface displaying semantically similar words in spatially sim-ilar regions. This representation of a domain vocabulary supports theconstruction and enrichment of domain ontologies by making relevantconcepts and their relations evident.

1 Introduction

Ontologies gained increasing importance in many fields of computer science. Es-pecially for information retrieval systems, ontologies can be a valuable means ofrepresenting and modeling domain knowledge to deliver search results of a higherquality. However, a crucial problem is an ontology’s increasing complexity withgrowing size of the application domain. In this paper, we present an approachbased on a neural network to assist domain engineers in creating or enhancingontologies for information retrieval systems.

We show an example from the tourism domain, where free-form text descrip-tions of accommodations are used as a basis to enrich the ontology of a tourisminformation retrieval system with highly specialized terms that are hardly foundin general purpose thesauri or dictionaries. We exploit information inherent inthe textual descriptions that are accessible but separated from the structuredinformation the search engine operates on. The vector representations of theterms are created by generating statistics about local contexts of the words oc-curring in natural language descriptions of accommodations. These descriptionshave in common that words belonging together with respect to their seman-tics, are found spatially close together regarding their position in the text, eventhough the descriptions are written by different authors, i.e. the accommodationproviders themselves in case of our application. Therefore, we think that theapproach presented in this paper can be applied to a variety of domains, since,


268 Michael Dittenbach et al.

for instance product descriptions, generally have similarly structured content.Consider for example, typical computer hardware descriptions where informa-tion about, say, storage devices are normally grouped together, rather than beingintertwined with input and display devices.

More specifically, we use the self-organizing map to cluster terms relevant tothe application domain to provide an intuitive representation of their semanticrelations. With this kind of representation at hand, finding synonyms, addingnew relations between concepts or detecting new concepts, which would be im-portant to be added to the ontology, is facilitated. More traditional clusteringtechniques are used in the DARE system [3] as methods supporting combinedtop-down and bottom-up ontology engineering [11].

The remainder of the paper is structured as follows. In Section 2 we providea brief review of our natural language tourism information retrieval system alongwith some results of a field trial in which the interface has been made publiclyaccessible. Section 3 gives an overview of the SOM and how it can be usedto create a word category map. Following a description of our experiments inSection 4, we provide some concluding remarks in Section 5.

2 A Tourism Information Retrieval System

2.1 System Architecture

We have developed a natural language interface for the largest Austrian web-based tourism platform Tiscover (http://www.tiscover.com) [12]. Tiscover isa well-known tourism information system and booking service in Europe thatalready covers more than 50,000 accommodations in Austria, Germany, Liecht-enstein, Switzerland and Italy. Contrary to the original form-based Tiscoverinterface, our natural language interface allows users to search for accommoda-tions throughout Austria by formulating the query in natural language sentenceseither in German or English. The language of the query is automatically detectedand the result is presented accordingly. For the task of natural language queryanalysis we followed the assumption that shallow natural language processing issufficient in restricted and well-defined domains. In particular, our approach re-lies on the selection of query concepts, which are modeled in a domain ontology,followed by syntactic and semantic analysis of the parts of the query where theconcepts appear.

To improve the retrieval performance, we used a phonetic algorithm to findand correct orthographic errors and misspellings. It is furthermore an importantissue to automatically identify proper names consisting of more than one word,e.g. “Gries am Brenner”, without having the user to enclose it with quotes.This also applies to phrases and multi-word denominations like “city center”or “youth hostel”, to name but a few. In the next query processing step, therelevant concepts and modifiers are tagged. For this purpose, we have devel-oped an XML-based ontology covering the semantics of domain specific conceptsand modifiers and describing linguistic concepts like synonymy. Additionally,

Using a Connectionist Approach for Enhancing Domain Ontologies 269

a lightweight grammar describes how particular concepts may be modified byprepositions and adverbial or adjectival structures that are also specified in theontology. Finally, the query is transformed into an SQL statement to retrieve in-formation from the database. The tagged concepts and modifiers together withthe rule set and parameterized SQL fragments, also defined in the knowledgebase, are used to create the complete SQL statement reflecting the natural lan-guage query. A generic XML description of the matching accommodations istransformed into a device-dependent output, customized to fit screen size andbandwidth.

Our information retrieval system covers a part of the Tiscover database, that,as of October 2001, provides access to information about 13,117 Austrian accom-modations. These are described by a large number of characteristics includingthe respective numbers of various room types, different facilities and servicesprovided in the accommodation, or the type of food. The accommodations arelocated in 1,923 towns and cities that are again described by various features,mainly information about sports activities offered, e.g. mountain biking or ski-ing, but also the number of inhabitants or the sea level. The federal states ofAustria are the higher-level geographical units. For a more detailed report onthe system we refer to [2].

2.2 A Field Trial and Its Implications

The field trial was carried out during ten days in March 2002. During this timeour natural language interface was promoted on and linked from the main Tis-cover page. We obtained 1,425 unique queries through our interface, i.e. equalqueries from the same client host have been reduced to one entry in the querylog to eliminate a possible bias for our evaluation of the query complexity.

In more than a half of the queries, users formulated complete, grammaticallycorrect sentences, about one fifth were partial sentences and the remaining setwere keyword-type queries. Several of the queries consisted of more than onesentence. This approves our assumption that users accept the natural languageinterface and are willing to type more than just a few keywords to search forinformation. More than this, a substantial portion of the users is typing completesentences to express their information needs.

To inspect the complexity of the queries, we considered the number of con-cepts and the usage of modifiers like “and”, “or”, “not”, “near” and some com-binations of those as quantitative measures. We found out that the level ofsentence complexity is not very high. This confirms our assumption that shallowtext parsing is sufficient to analyze the queries emerging in a limited domainlike tourism. Even more important for the research described in this paper, wefound out that regions or local attractions are inevitable informations that haveto be integrated in such systems. We also noticed that users’ queries containedvague or highly subjective criteria like “romantic”, “cheap” or “within walkingdistance to”. Even “wellness”, a term broadly used in tourism nowadays, is farfrom being exactly defined. A more detailed evaluation of the results of the field


trial can be found in [1]. It furthermore turned out that a deficiency of our ontol-ogy was the lack of diversity of the terminology. To provide better quality searchresults, it is necessary to enrich the ontology with additional synonyms. Besidesthe structured information about the accommodations, the web pages describingthe accommodations offer a lot more information in form of natural language de-scriptions. Hence, the words occurring in these texts constitute a very specializedvocabulary for this domain. The next obvious step is to exploit this informationto enhance the domain ontology for the information retrieval system. Due to thesize of this vocabulary, some intelligent form of representation is necessary toexpress semantic relations between the words.

3 Word Categorization

3.1 Encoding the Semantic Contexts

Ritter and Kohonen [13] have shown that it is possible to cluster terms accordingto their syntactic category by encoding word contexts of terms in an artificialdata set of three-word sentences that consist of nouns, verbs and adverbs, suchas, e.g. “Jim speaks often” and “Mary buys meat”. The resulting maps clearlyshowed three main clusters corresponding to the three word classes. It shouldfurthermore be noted that within each cluster, the words of a class were arrangedaccording to their semantic relation. For example, the adverbs poorly and wellwere located closer together on the map than poorly and much, the latter waslocated spatially close to little. An example from a different cluster would be theverbs likes and hates.

Other experiments using a collection of fairy tales by the Grimm Brothershave shown that this method works well with real-world text documents [5]. Theterms on the SOM were divided into three clusters, namely nouns, verbs andall other word classes. Again, inside these clusters, semantic similarities betweenwords were mirrored. The results of these experiments have been elaborated laterto reduce the vector dimensionality for document clustering in the WEBSOMproject [6]. Here, a word category map has been trained with the terms occurringin the document collection to subsume words with similar context to one seman-tic category. These categories, obviously fewer than the number of all words ofthe document collection, have then been used to create document vectors forclustering. Since new methods of dimensionality reduction have been developed,the word category map has been dropped for this particular purpose [9].

Nevertheless, since our objective is to disclose semantic relations betweenwords, we decided to use word category maps. For training a self-organizingmap in order to organize terms according to their semantic similarity, theseterms have to be encoded as n-dimensional numerical vectors. As shown in [4],the random vectors are quasi-orthogonal in case of n being large enough. Thus,unwanted geometrical dependence of the word representation can be avoided.This is a necessary condition, because otherwise the clustering result could bedominated by random effects overriding the semantic similarity of words.


We assume that, in textual descriptions dominated by enumerations, seman-tic similarity is captured by contextual closeness within the description. Forexample, when arguing about the attractions offered for children, things likea playground, a sandbox or the availability of a baby-sitter will be enumeratedtogether. Analogously, the same is true for recreation equipment like a sauna,a steam bath or an infrared cabin. To capture this contextual closeness, we useword windows where a particular word i is described by the set of words thatappear a fixed number of words before and after word i in the textual descrip-tion. Given that every word is represented by a unique n-dimensional randomvector, the context vector of a word i is built as the concatenation of the averageof all words preceding as well as succeeding word i. Technically speaking, ann × N -dimensional vector xi representing word i is a concatenation of vectorsxi

(dj) denoting the mean vectors of terms occurring at the set of displacements{d1, . . . , dN} of the term as given in Equation 1. Consequently, the dimensional-ity of xi is n×N . This kind of representation has the effect that words appearingin similar contexts are represented by similar vectors in a high-dimensional space.

xi =

⎡⎢⎣

xi(d1)

...xi

(dN)

⎤⎥⎦ (1)

With this method, a statistical model of word contexts is created. Consider,for example, the term Skifahren (skiing). The set of words occurring directlybefore the term at displacement −1 consists of words like Langlaufen (crosscountry skiing), Rodeln (toboggan), Pulverschnee (powder snow) or Winter toname but a few. By averaging the respective vectors representing these terms,a statistical model of word contexts is created.

3.2 Self-Organizing Map Algorithm

The self-organizing map (SOM) [7, 8] is an unsupervised neural network provid-ing a mapping from a high-dimensional input space to a usually two-dimensionaloutput space while preserving topological relations as faithfully as possible. TheSOM consists of a set of units arranged in a two-dimensional grid, with a weightvector mi ∈ �n attached to each unit i. Data from the high-dimensional inputspace, referred to as input vectors x ∈ �n, are presented to the SOM and theactivation of each unit for the presented input vector is calculated using an ac-tivation function. Commonly, the Euclidean distance between the weight vectorof the unit and the input vector serves as the activation function, i.e. the smallerthe Euclidean distance, the higher the activation.

In the next step the weight vector of the unit showing the highest activation isselected as the winner and is modified as to more closely resemble the presentedinput vector. Pragmatically speaking, the weight vector of the winner is movedtowards the presented input by a certain fraction of the Euclidean distance asindicated by a time-decreasing learning rate α(t) as shown in Equation 2.


mi(t + 1) = mi(t) + α(t) · hci(t) · [x(t) − mi(t)] (2)

Thus, this unit’s activation will be even higher the next time the same inputsignal is presented. Furthermore, the weight vectors of units in the neighbor-hood of the winner are modified accordingly as described by a neighborhoodfunction hci(t) (cf. Equation 3), yet to a less strong amount as compared to thewinner. The strenght of adaptation depends on the Euclidean distance ||rc − ri||between the winner c and a unit i regarding their respective locations rc, ri ∈ �2

on the 2-dimensional map and a time-decreasing parameter σ.

hci(t) = exp

(−||rc − ri||

2 · σ2(t)

2)

(3)

Starting with a rather large neighborhood for a general organization of theweight vectors, this learning procedure finally leads to a fine-grained topolog-ically ordered mapping of the presented input signals. Similar input data aremapped onto neighboring regions on the map.

4 Experiments

4.1 Data

The data provided by Tiscover consist, on the one hand, of structured infor-mation as described in Section 2, and, on the other hand, of free-form textsdescribing the accommodations. Because accommodation providers themselvesenter the data into the system, the descriptions vary in length and style and areare not uniform or even quality controlled regarding spelling. HTML tags, whichare allowed to format the descriptions, had to be removed to have plain-textfiles for further processing. For the experiments presented hereafter, we used theGerman descriptions of the accommodations since they are more comprehensivethan the English ones. Especially small and medium-sized accommodations pro-vide only a very rudimentary English description, many being far from correctlyspelled.

It has been shown with a text collection consisting of fairy tales that, withfree-form text documents, the word categories dominate the cluster structureof such a map [5]. To create semantic maps primarily reflecting the semanticsimilarity of words rather than categorizing word classes, we removed wordsother than nouns and proper names.

Therefore, we used the characteristic, unique to the German language, ofnouns starting with a capital letter to filter the nouns and proper names occur-ring in the texts. Obviously, using this method, some other words like adjectives,verbs or adverbs at the beginning of sentences or in improperly written docu-ments are also filtered. Contrarily, some nouns can be missed, too. A differentmethod of determining nouns or other relevant word classes, especially for lan-guages other than German, would be part-of-speech (POS) taggers. But even


Die Ferienwohnung Lage Stadtrand Wien Bezirk Mauer In Gehminuten

Schnellbahn Fahrminuten Wien Mitte Stadt Die Wohnung Wohn Eßraum

Kamin SAT TV Kuche Geschirrspuler Schlafzimmer Einzelbetten

Einbettzimmer Badezimmer Wanne Doppelwaschbecken Dusche Extra WC

Terrasse Sitzgarnitur Ruhebetten Die Ferienwohnung Aufenthalt Wunsche

Fig. 1. A sample description of a holiday flat in a suburb of Vienna afterremoving almost all words not being nouns or proper names

the(fem.), holiday flat, location, outskirts, Vienna, district, Mauer,

in, minutes to walk, urban railway, minutes to drive, Wien Mitte

(station name), city, the(fem.), flat, living, dining room, fireplace,

satellite tv, kitchen, dishwasher, sleeping room, single beds,

single-bed room, bathroom, bathtub, double washbasin, shower,

separate toilet, terrace, chairs and table, couches, the(fem.), holiday

flat, stay, wishes

Fig. 2. English translation of the description shown in Figure 1

state-of-the-art POS taggers do not reach an accuracy of 100% [10]. For the restof this section, the numbers and figures presented, refer to the already prepro-cessed documents, if not stated otherwise.

The collection consists of 12,471 documents with a total number of 481,580words, i.e. on average, a description contains about 39 words. For the curiousreader we shall note that not all of the 13,117 accommodations in the databaseprovide a textual description. The vocabulary of the document collection com-prises 35,873 unique terms, but for the sake of readability of the maps we reducedthe number of terms by excluding those occurring less than ten times in the wholecollection. Consequently, we used 3,662 terms for creating the semantic maps.

In Figure 1, a natural language description of a holiday flat in Vienna isshown. Beginning with the location of the flat, the accessibility by public trans-port is mentioned, followed by some terms describing the dining and living roomtogether with enumerations of the respective furniture and fixtures. Other partsof the flat are the sleeping room, a single bedroom and the bathroom. In thisparticular example, the only words not being nouns or proper names are thedeterminer Die and the preposition In at the beginning of sentences. For thesake of convenience, we have provided an English translation in Figure 2.

4.2 Semantic Map

For encoding the terms we have chosen 90-dimensional random vectors. Thevectors used for training the semantic map depicted in Figure 3 were createdby using a context window of length four, i.e. two words before and two wordsafter a term. But instead of treating all four sets of context terms separately, wehave put terms at displacements −2 and −1 as well as those at displacements+1 and +2 together. Then the average vectors of both sets were calculated and


finally concatenated to create the 180-dimensional context vectors. Further ex-periments have shown that this setting yielded the best result. For example,using a context window of length four but considering all displacements sepa-rately, i.e. the final context vector has length 360, has led to a map where theclusters were not as coherent as on the map shown below. A smaller contextwindow of length two, taking only the surrounding words at displacements −1and +1 into account, had a similar effect. This indicates that the amount of textavailable for creating such a statistical model is crucial for the quality of theresulting map. By subsuming the context words at displacements before as wellas after the word, the disadvantage of having an insufficient amount of text canbe alleviated, because having twice the number of contexts with displacements−1 and +1 is simulated. Due to the enumerative nature of the accommodationdescriptions, the exact position of the context terms can be disregarded.

The self-organizing map depicted in Figure 3 consists of 20 × 20 units. Dueto space considerations, only a few clusters can be detailed in this descriptionand enumerations of terms in a cluster will only be exemplary. The semanticclusters shaded gray have been determined by manual inspection. They consistof very homogeneous sets of terms related to distinct aspects of the domain. Theparts of the right half of the map that have not been shaded, mainly containproper names of places, lakes, mountains, cities or accommodations. However, itshall be noted, that e.g. names of lakes or mountains are homogeneously groupedin separate clusters.

In the upper left corner, mostly verbs, adverbs, adjectives or conjunctionsare located. These are terms that have been inadvertently included in the setof relevant terms as described in the previous subsection. In the upper partof the map, a cluster containing terms related to pricing, fees and reductionscan be found. Other clusters in this area predominantly deal with words de-scribing types of accommodation and, in the top-right corner a strong clusterof accommodation names can be found. On the right-hand border of the map,geographical locations, such as central, outskirts, or close to a forest have beenmapped, and a cluster containing skiing- and mountaineering-related terms isalso located there.

A dominant cluster containing words that describe room types, furnishingand fixtures can be found in the lower left corner of the map. The cluster labeledadvertising terms in the bottom-right corner of the map, predominately containswords that are found at the beginning of the documents where the pleasuresawaiting the potential customer are described.

Interesting inter-cluster relations showing the semantic ordering of the termscan be found in the bottom part of the map. The cluster labeled farm containsterms describing, amongst other things, typical goods produced on farms like,organic products, jam, grape juice or schnaps. In the upper left corner of thecluster, names of farm animals (e.g. pig, cow, chicken) as well as animals usuallyfound in a petting zoo (e.g. donkey, dwarf goats, cats, calves) are located. Thiscluster describing animals adjoins a cluster primarily containing terms related


skiing

gameschildren

farm

room types,furnishing and fixtures

verbs

kitchen

outdoor sports

proper names (cities)

animals

mountain-eering

location

view

propernames(farms)

types of prices,rates,feesreductions

types ofprivate

accomm.

grouptravel

food

typesof

travelers

advertising terms

proper names

wellness

swimming

adjectivesadverbs

conjunctionsdeterminer

propernames

(accomm.)

sports

Fig. 3. A self-organizing semantic map of terms in the tourism domain with la-bels denoting general semantic clusters. The cluster boundaries have been drawnmanually

to children, toys and games. Some terms are playroom, tabletop soccer, sandboxand volleyball, to name but a few.

This representation of a domain vocabulary supports the construction andenrichment of domain ontologies by making relevant concepts and their relationsevident. To provide an example, we found a wealth of terms describing sauna-likerecreational facilities having in common that the vacationer sojourns in a closedroom with well-tempered atmosphere, e.g. sauna, tepidarium, bio sauna, herbalsauna, Finnish sauna, steam sauna, thermarium or infrared cabin. On the onehand, major semantic categories identified by inspecting and evaluating the se-mantic map can be used as a basis for a top-down ontology engineering approach.On the other hand, the clustered terms, extracted from domain-relevant docu-ments, can be used for bottom-up engineering an existing ontology.


5 Conclusions

In this paper, we have presented a method, based on the self-organizing map, tosupport the construction and enrichment of domain ontologies. The words oc-curring in free-form text documents from the application domain are clusteredaccording to their semantic similarity based on statistical context analysis. Moreprecisely, we have shown that when a word is described by words that appearwithin a fix-sized context window, semantic relations of words unfold in theself-organizing map. Thus, words that refer to similar objects can be found inneighboring parts of the map. The two-dimensional map representation pro-vides an intuitive interface for browsing through the vocabulary to discover newconcepts or relations between concepts that are still missing in the ontology.We illustrated this approach with an example from the tourism domain. Theclustering results revealed a number of relevant tourism-related terms that cannow be integrated into the ontology to provide better retrieval results whensearching for accommodations. We achieved this by analysis of self-descriptionswritten by accommodation providers, thus assisting substantially the costly andtime-consuming process of ontology engineering.

References

[1] M. Dittenbach, D. Merkl, and H. Berger. What customers really want fromtourism information systems but never dared to ask. In Proc. of the 5th Int’l Con-ference on Electronic Commerce Research (ICECR-5), Montreal, Canada, 2002.270

[2] M. Dittenbach, D. Merkl, and H. Berger. A natural language query interface fortourism information. In A. J. Frew, M. Hitz, and P. O’Connor, editors, Proceedingsof the 10th International Conference on Information Technologies in Tourism(ENTER 2003), pages 152–162, Helsinki, Finland, 2003. Springer-Verlag. 269

[3] W. Frakes, R. Prieto-Dıaz, and C. Fox. DARE: Domain analysis and reuse envi-ronment. Annals of Software Engineering, Kluwer, 5:125–141, 1998. 268

[4] T. Honkela. Self-Organizing Maps in Natural Language Processing. PhD thesis,Helsinki University of Technology, 1997. 270

[5] T. Honkela, V. Pulkki, and T. Kohonen. Contextual relations of words in grimmtales, analyzed by self-organizing map. In F. Fogelman-Soulie and P. Gallinari,editors, Proceedings of the International Conference on Artificial Neural Networks(ICANN 1995), pages 3–7, Paris, France, 1995. EC2 et Cie. 270, 272

[6] S. Kaski, T. Honkela, K. Lagus, and T. Kohonen. WEBSOM–self-organizing mapsof document collections. Neurocomputing, Elsevier, 21:101–117, November 1998.270

[7] T. Kohonen. Self-organized formation of topologically correct feature maps. Bi-ological Cybernetics, 43, 1982. 271

[8] T. Kohonen. Self-organizing maps. Springer-Verlag, Berlin, 1995. 271[9] T. Kohonen, S. Kaski, K. Lagus, J. Salojarvi, J. Honkela, V. Paatero, and A.

Saarela. Self organization of a massive document collection. IEEE Transactionson Neural Networks, 11(3):574–585, May 2000. 270

[10] C. Manning and H. Schutze. Foundations of statistical natural language processing.MIT Press, 2000. 273


[11] R. Prieto-Dıaz. A faceted approach to building ontologies. In S. Spaccapietra, S. T.March, and Y. Kambayashi, editors, Proc. of the 21st Int’l Conf. on ConceptualModeling (ER 2002), LNCS, Tampere, Finland, 2002. Springer-Verlag. 268

[12] B. Proll, W. Retschitzegger, R. Wagner, and A. Ebner. Beyond traditional tourisminformation systems – TIScover. Information Technology and Tourism, 1, 1998.268

[13] H. Ritter and T. Kohonen. Self-organizing semantic maps. Biological Cybernetics,61(4):241–254, 1989. 270

Parameterless Data Compression and NoiseFiltering Using Association Rule Mining

Yew-Kwong Woon1, Xiang Li2, Wee-Keong Ng1, and Wen-Feng Lu23

1 Nanyang Technological University, Nanyang Avenue,Singapore 639798, SINGAPORE

2 Singapore Institute of Manufacturing Technology, 71 Nanyang Drive,Singapore 638075, SINGAPORE

3 Singapore-MIT Alliance

Abstract. The explosion of raw data in our information age necessitatesthe use of unsupervised knowledge discovery techniques to understandmountains of data. Cluster analysis is suitable for this task because of itsability to discover natural groupings of objects without human interven-tion. However, noise in the data greatly affects clustering results. Existingclustering techniques use density-based, grid-based or resolution-basedmethods to handle noise but they require the fine-tuning of complexparameters. Moreover, for high-dimensional data that cannot be visu-alized by humans, this fine-tuning process is greatly impaired. Thereare several noise/outlier detection techniques but they too need suitableparameters. In this paper, we present a novel parameterless method offiltering noise using ideas borrowed from association rule mining. Weterm our technique, FLUID (Filtering Using Itemset Discovery). FLUIDautomatically discovers representative points in the dataset without anyinput parameter by mapping the dataset into a form suitable for fre-quent itemset discovery. After frequent itemsets are discovered, they aremapped back to their original form and become representative points ofthe original dataset. As such, FLUID accomplishes both data and noisereduction simultaneously, making it an ideal preprocessing step for clus-ter analysis. Experiments involving a prominent synthetic dataset provethe effectiveness and efficiency of FLUID.

1 Introduction

The information age was hastily ushered in by the birth of the World Wide

Web (Web) in 1990. All of sudden, an abundance of information, in the formof web pages and digital libraries, was available at the fingertips of anyone whowas connected to the Web. Researchers from the Online Computer Library Cen-

ter found that there were 7 million unique sites in the year 2000 and the Webwas predicted to continue its fast expansion [1]. Data mining becomes impor-tant because traditional statistical techniques are no longer feasible to handlesuch immense data. Cluster analysis, or clustering, becomes the data miningtechnique of choice because of its ability to function with little human super-vision. Clustering is the process of grouping a set of physical/abstract objects


into classes of similar objects. It has been found to be useful for a wide varietyof applications such as web usage mining [2], manufacturing [3], personalizationof web pages [4] and digital libraries [5].

Researchers begin to analyze traditional clustering techniques in an attemptto adapt them to current needs. One such technique is the classic k-means al-gorithm [6]. It is fast but is very sensitive to the parameter k and noise. Re-cent clustering techniques that attempt to handle noise more effectively includedensity-based techniques [7], grid-based techniques [8] and resolution-based tech-niques [9, 10]. However, all of them require the fine-tuning of complex parametersto remove the adverse effects of noise. Empirical studies show that many adjust-ments need to be made and an optimal solution is not always guaranteed [10].Moreover, for high-dimensional data that cannot be visualized by humans, thisfine-tuning process is greatly impaired. Since most data, such as digital librarydocuments, web logs and manufacturing specifications, have many features ordimensions, this shortcoming is unacceptable. There are also several work on out-lier/noise detection but they too require the setting of non-intuitive parameters[11, 12].

In this paper, we present a novel unsupervised method of filtering noise usingideas borrowed from association rule mining (ARM) [13]. We term our technique,FLUID (FiLtering Using Itemset Discovery). FLUID first maps the datasetinto a set of items using binning. Next, ARM is applied to it to discover fre-quent itemsets. As there has been sustained intense interest in ARM since itsconception in 1993, ARM algorithms have improved by leaps and bounds. AnyARM algorithm can be used by FLUID and this allows the leveraging of theefficiency of latest ARM methods. After frequent itemsets are found, they aremapped back to become representative points of the original dataset. This ca-pability of FLUID not only eliminates the problematic need for noise removal inexisting clustering algorithms but also improves their efficiency and scalabilitybecause the size of the dataset is significantly reduced. Experiments involving aprominent synthetic dataset prove the effectiveness and efficiency of FLUID.

The rest of the paper is organized as follows. The next section reviews relatedwork in the areas of clustering, outlier detection, ARM while Section 3 presentsthe FLUID algorithm. Experiments are conducted on both real and syntheticdatasets to assess the feasibility of FLUID in Section 4. Finally, the paper isconcluded in Section 5.

2 Related Work

In this section, we review prominent works in the areas of clustering and outlierdetection. The problem of ARM and its representative algorithms are discussedas well.

2.1 Clustering and Outlier Detection

The k-means algorithm is the pioneering algorithm in clustering [6]. It begins byrandomly generating k cluster centers known as centroids. Objects are iteratively

279Parameterless Data Compression and Noise Filtering

assigned to the cluster where the distance between itself and the cluster’s centroidis the shortest. It is fast but sensitive to the parameter k and noise. Density-based methods are more noise-resistant and are based on the notion that denseregions are interesting regions. DBSCAN (Density Based Spatial Clustering ofApplications with Noise) is the pioneering density-based technique [7]. It usestwo input parameters to define what constitutes the neighborhood of an objectand whether its neighborhood is dense enough to be considered. Grid-basedtechniques can also handle noise. They partition the search space into a numberof cells/units and perform clustering on such units. CLIQUE (CLustering InQUEst) considers a unit to be dense if the number of objects in it exceeds adensity threshold and uses an apriori-like technique to iteratively derive higher-dimensional dense units. CLIQUE requires the user to specify a density thresholdand the size of grids.

Recently, resolution-based techniques are proposed and applied successfullyon noisy datasets. The basic idea is that when viewed at different resolutions, thedataset reveals different clusters and by visualization or change detection of cer-tain statistics, the correct resolution at which noise is minimum can be chosen.WaveCluster is a resolution-based algorithm that uses wavelet transformation todistinguish clusters from noise [9]. Users must first determine the best quanti-zation scheme for the dataset and then decide on the number of times to applywavelet transform. The TURN* algorithm is another recent resolution-based al-gorithm [10]. It iteratively scales the data to various resolutions. To determinethe ideal resolution, it uses the third differential of the series of cluster featurestatistics to detect an abrupt change in the trend. However, it is unclear howcertain parameters such as the closeness threshold and the step size of resolu-tion scaling are chosen. Outlier detection is another means of tackling noise. Oneclassic notion is that of DB(Distance-Based)-outliers [11]. An object is consid-ered to be a DB-outlier if a certain fraction f of the dataset lies greater than adistance D from it. A recent enhancement of it involves the use of the conceptof k-nearest neighbors [12]; the top n points with the largest Dk (distance of thekth nearest neighbor of a point) are treated as outliers. The parameters f,D, k, nmust be supplied by the user.

In summary, currently, there is no ideal solution to the problem of noise andexisting clustering algorithms require much parameter tweaking which becomesdifficult for high-dimensional datasets. Even if somehow their parameters canbe optimally set for a particular dataset, there is no guarantee that the samesettings will work for other datasets. The problem is similar in the area of outlierdetection.

2.2 Association Rule Mining

Since the concept of ARM is central to FLUID, we formally define ARM and thensurvey existing ARM algorithms in this section. A formal description of ARMis as follows. Let the universal itemset, I = {a1, a2, ..., aU} be a set of literalscalled items. Let Dt be a database of transactions, where each transaction T

contains a set of items such that T ⊆ I. A j-itemset is a set of j unique items.

280 Yew-Kwong Woon et al.

For a given itemset X ⊆ I and a given transaction T, T contains X if and onlyif X ⊆ T . Let ψX be the support count of an itemset X, which is the numberof transactions in Dt that contain X. Let s be the support threshold and |Dt| bethe number of transactions in Dt. An itemset X is frequent if ψX � |Dt| × s%.An association rule is an implication of the form X =⇒ Y , where X ⊆ I, Y ⊆ Iand X ∩ Y = ∅. The association rule X =⇒ Y holds in Dt with confidence c%

if no less than c% of the transactions in Dt that contain X also contain Y . Theassociation rule X =⇒ Y has support s% in Dt if ψX∪Y = |Dt| × s%.

The problem of mining association rules is to discover rules that have confi-dence and support greater than the thresholds. It consists of two main tasks: thediscovery of frequent itemsets and the generation of association rules from fre-quent itemsets. Researchers usually tackle the first task only because it is morecomputationally expensive. Hence, current algorithms are designed to efficientlydiscover frequent itemsets. We will leverage the ability of ARM algorithms torapidly discover frequent itemsets in FLUID. Introduced in 1994, the Apriorialgorithm is the first successful algorithm for mining association rules [13]. Sinceits introduction, it has popularized ARM. It introduces a method to generatecandidate itemsets in a pass using only frequent itemsets from the previous pass.The idea, known as the apriori property, rests on the fact that any subset of afrequent itemset must be frequent as well.

The FP-growth (Frequent Pattern-growth) algorithm is a recent ARM algo-rithm that achieves impressive results by removing the need to generate candi-date itemsets which is the main bottleneck in Apriori [14]. It uses a compact treestructure called a Frequent Pattern-tree (FP-tree) to store information aboutfrequent itemsets. This compact structure also removes the need for multipledatabase scans and it is constructed using only two scans. The items in the trans-actions are first sorted and then used to construct the FP-tree. Next, FP-growthproceeds to recursively mine FP-trees of decreasing size to generate frequentitemsets. Recently, we presented a novel trie-based data structure known as theSupport-Ordered Trie ITemset (SOTrieIT) to store support counts of 1-itemsetsand 2-itemsets [15, 16]. The SOTrieIT is designed to be used efficiently by ouralgorithm, FOLDARM (Fast OnLine Dynamic Association Rule Mining) [16]. Inour recent work on ARM, we propose a new algorithm, FOLD-growth (Fast On-Line Dynamic-growth) which is an optimized hybrid version of FOLDARM andFP-growth [17]. FOLD-growth first builds a set of SOTrieITs from the databaseand use them to prune the database before building FP-trees. FOLD-growth isshown to outperform FP-growth by up to two orders of magnitude.

3 Filtering Using Itemset Discovery (FLUID)

3.1 Algorithm

Given a d-dimensional dataset Do consisting of n objects o1, o2, . . . , on, FLUIDdiscovers a set of representative objects O1, O2, . . . , Om where m � n in threemain steps:


1. Convert dataset Do into a transactional database Dt using procedureMapDB

2. Mine Dt for frequent itemsets using procedure MineDB

3. Convert the discovered frequent itemsets back to their original object formusing procedure MapItemset

Procedure MapDB

1 Sort each dimension of Do in ascending order2 Compute mean μx and standard deviation σx of the nearest

object distance in each dimension x by checking the left and rightneighbors of each object

3 Find range of values rx for each dimension x4 Compute number of bins βx for each dimension x:

βx = rx/((μx + 3 × σx) × 0.005 × n5 Map each bin to a unique item a ∈ I6 Convert each object oi in Do into a transaction Ti with exactly d items

items by binning its feature values, yielding a transactional database Dt

Procedure MapDB tries to discretize the features of dataset Do in a way thatminimizes the number of required bins without losing the pertinent structuralinformation of Do. Every dimension has its own distribution of values and thus,it is necessary to compute the bin sizes of each dimension/feature separately.Discretization is itself a massive area but experiments reveal that MapDB isgood enough to remove noise efficiently and effectively.

To understand the data distribution in each dimension, the mean and stan-dard deviation of the closest neighbor distance of every object in every dimensionare computed. Assuming that all dimensions follow a Normal distribution, anobject should have one neighboring object within three standard deviations ofthe mean nearest neighbor distance. To avoid having too many bins, there is aneed to ensure that each bin would contain a certain number of objects (0.5% ofdataset size) and this is accomplished in step 4. In the event that the values arespread out too widely, i.e. the standard deviation is much larger than the mean,the number of standard deviations used in step 4 is reduced to 1 instead of 3.Note that if a particular dimension has less than 100 unique values, steps 2-4would be unnecessary and the number of bins would be the number of uniquevalues. As mentioned in step 6, each object becomes a transaction with exactlyd items because each item represents one feature of the object. The transactionsdo not have duplicated items because every feature has its own unique set ofbins. Once Do is mapped into transactions with unique items, it is now in aform that can be mined by any association rule mining algorithm.

Procedure MineDB

1 Set support threshold s = 0.1 (10%)


2 Set number of required frequent d-itemsets k = n3 Let δ(A,B) be the distance between 2 j-itemsets A(a1, . . . , aj) and

B(b1, . . . , bj): δ(A,B) =∑j

i=1(ai − bi)4 A itemset A is a loner itemset if δ(A,Z) > 1,∀Z ∈ L ∧ Z = A5 Repeat6 Repeat7 Use an association rule mining algorithm to discover a set of

frequent itemsets L from Dt

8 Remove itemsets with less than d items from L9 Adjust s using a variable step size to bring |L| closer to k10 Until |L| = k or |L| stabilizes11 Set k = 1

2 |L|12 Set s = 0.113 Remove loner itemsets from |L|14 Until abrupt change in number of loner itemsets

MineDB is the most time-consuming and complex step of FLUID. The keyidea here is to discover the optimal set of frequent itemsets that represents theimportant characteristics of the original dataset; we consider important char-acteristics as dense regions in the original dataset. In this case, the supportthreshold s is akin to the density threshold used by density-based clustering al-gorithms and thus, it can be used to remove regions with low density (itemsetswith low support counts). The crucial point here is how to automate the fine-tuning of s. This is done by checking the number of loner itemsets after eachiteration (steps 6-14). Loner itemsets represent points with no neighboring pointsin the discretized d-dimensional feature space. Therefore, an abrupt change inthe number of loner itemsets indicates that the current support threshold valuehas been reduced to a point where dense regions in the original datasets arebeing divided into too many sparse regions. This point is made more evident inSection 5 where its effect can be visually observed.

The number of desired frequent d-itemsets (frequent itemsets with exactlyd items), k, is initially set to the size of the original dataset as seen in step 2.The goal is to obtain the finest resolution of the dataset that is attainable afterits transformation. The algorithm then proceeds to derive coarser resolutions inan exponential fashion in order to quickly discover a good representation of theoriginal dataset. This is done at step 11 where k is being reduced to half of |L|.The amount of reduction can certainly be lowered to get more resolutions but thiswill incur longer processing time and may not be necessary. Experiments haverevealed that our choice suffices for a good approximation of the representativepoints of a dataset.

In step 8, notice that itemsets with less than d items are removed. This isbecause association rule mining discovers frequent itemsets with various sizesbut we are only interested in frequent itemsets containing items that repre-sent all the features of the dataset. In step 9, the support threshold s is incre-mented/decremented by a variable step size. The step size is variable as it must


be made smaller in order to zoom in on the best possible s to obtain the requirednumber of frequent d-itemsets, k. In most situations, it is quite unlikely that |L|can be adjusted to equal k exactly and thus, if |L| stabilizes or fluctuates betweensimilar values, its closest approximation to k is considered as the best solutionas seen in step 10.

Procedure MapItemset

1 for each frequent itemset A ∈ L do2 for each item i ∈ A do3 Assign the center of the bin represented by i as its new value4 end for5 end for

The final step of FLUID is the simplest: it involves mapping the frequentitemsets back to their original object form. The filtered dataset would now con-tain representative points of the original dataset excluding most of the noise.Note that the filtering is only an approximation but it is sufficient to removemost of the noise in the data and retain pertinent structural characteristics ofthe data. Subsequent data mining tasks such as clustering can then be used toextract knowledge from the filtered and compressed dataset efficiently with littlecomplications from noise. Note also that the types of clusters discovered dependmainly on the clustering algorithm used and not on FLUID.

3.2 Complexity Analysis

The following are the time complexities of the three main steps of FLUID:

1. MapDB: The main processing time is taken by step 1 and hence, its timecomplexity is O(n log n).

2. MineDB: As the total number of iterations used by the loops in the proce-dure is very small, the bulk of the processing time is attributed to the timeto perform association rule mining given by TA.

3. MapItemset: The processing time is dependent on the number of resultantrepresentative points |L| and thus, it has a time complexity of O(n).

Hence, the overall time complexity of FLUID is O(n log n + TA + n).

3.3 Strengths and Weaknesses

The main strength of FLUID is its independence on user-supplied parameters.Unlike its predecessors, FLUID does not require any human supervision. Notonly it removes noise/outliers, it compresses the dataset into a set of represen-tative points without any loss of pertinent structural information of the originaldataset. In addition, it is reasonably scalable with respect to both the size and


0

50

100

150

200

250

300

350

400

450

500

0 100 200 300 400 500 600 7000

50

100

150

200

250

300

350

400

450

500

0 100 200 300 400 500 600 700

(a) (b)

0

50

100

150

200

250

300

350

400

450

500

0 100 200 300 400 500 600 7000

50

100

150

200

250

300

350

400

450

500

0 100 200 300 400 500 600 700

(c) (d)

Fig. 1. Results of executing FLUID on a synthetic dataset.

dimensionality of the dataset as it inherits the efficient characteristics of existingassociation rule mining algorithms. Hence, it is an attractive preprocessing toolfor clustering or other data mining tasks.

Ironically, its weakness also stems from its use of association rule miningtechniques. This is because association rule mining algorithms do not scale as wellas resolution-based algorithms in terms of dataset dimensionality. Fortunately,since ARM is still receiving much attention from the research community, it ispossible that more efficient ARM algorithms will be available to FLUID. Anotherweakness is that FLUID spends much redundant processing time in finding andstoring frequent itemsets that have less than d items. This problem is inherent inassociation rule mining because larger frequent itemsets are usually formed fromsmaller frequent itemsets. Efficiency and scalability can certainly be improvedgreatly if there is a way to directly discover frequent d-itemsets.

4 Experiments

This section evaluates the viability of FLUID by conducting experiments on aPentium-4 machine with a CPU clock rate of 2 GHz and 1 GB of main memory.We shall use FOLD-growth as our ARM algorithm in our experiments as itis fast, incremental and scalable [17]. All algorithms are implemented in Java.


The synthetic dataset (named t7.10k.dat) used here tests the ability of FLUIDto discover clusters of various sizes and shapes amidst much noise; it has beenused as a benchmarking test for several clustering algorithms [10]. It has beenshown that prominent algorithms like k-means [6], DBSCAN [7], CHAMELEON[18] and WaveCluster [9] are unable to properly find the nine visually-obviousclusters and remove noise even with exhaustive parameter adjustments [10]. OnlyTURN* [10] manages to find the correct clusters but it requires user-suppliedparameters as mentioned in Section 2.1. Figure 1(a) shows the dataset with10,000 points in nine arbitrary-shaped clusters interspersed with random noise.

Figure 1 shows the results of running FLUID on the dataset. FLUID stopsat the iteration when Figure 1(c) is obtained but we show the rest of the re-sults to illustrate the effect of loner itemsets. It is clear that Figure 1(c) is theoptimal result as most of the noise is removed while the nine clusters remainintact. Figure 1(d) loses much of the pertinent information of the dataset. Thenumber of loner itemsets for Figures 1(b), (c) and (d) is 155, 55 and 136 respec-tively. Figure 1(b) has the most loner itemsets because of the presence of noisein the original dataset. It is the finest representation of the dataset in terms ofresolution. There is a sharp drop in the number of loner itemsets in Figure 1(c)followed by a sharp increase in the number of loner itemsets in Figure 1(d). Thesharp drop can be explained by the fact that most noise is removed leaving be-hind objects that are closely grouped together. In contrast, the sharp increasein loner itemsets is caused by too low a support threshold. This means that onlyvery dense regions are captured and this causes the disintegration of the nineclusters as seen in Figure 1(d). Hence, a change in the trend of the number ofloner itemsets is indicative that the structural characteristics of the dataset haschanged. FLUID took a mere 6 s to compress the dataset into 1,650 represen-tatives points with much of the noise removed. The dataset is reduced by morethan 80% without affecting its inherent structure, that is, the shapes of its nineclusters are retained. Therefore, it is proven in this experiment that FLUID canfilter away noise even in a noisy dataset with sophisticated clusters without anyuser parameters and with impressive efficiency.

5 Conclusions

Clustering is an important data mining task especially in our information agewhere raw data is abundant. Several existing clustering methods cannot handlenoise effectively because they require the user to set complex parameters prop-erly. We propose FLUID, a noise-filtering and parameterless algorithm based onassociation rule mining, to overcome the problem of noise as well as to compressthe dataset. Experiments on a benchmarking synthetic dataset show the effec-tiveness of our approach. In our future work, we will improve and provide vigor-ous proofs of our approach and design a clustering algorithm that can integrateefficiently with FLUID. In addition, the problem of handling high dimensionaldatasets will be addressed. Finally, more experiments involving larger datasetswith more dimensions will be conducted to affirm the practicality of FLUID.


References

1. Dean, N., ed.: OCLC Researchers Measure the World Wide Web. Number 248.Online Computer Library Center (OCLC) Newsletter (2000)

2. Srivastava, J., Cooley, R., Deshpande, M., Tan, P.N.: Web usage mining: Discoveryand applications of usage patterns from web data. SIGKDD Explorations 1 (2000)12–23

3. Gardner, M., Bieker, J.: Data mining solves tough semiconductor manufacturingproblems. In: Proc. 6th ACM SIGKDD Int. Conf. on Knowledge discovery anddata mining, Boston, Massachusetts, United States (2000) 376–383

4. Mobasher, B., Dai, H., Luo, T., Nakagawa, M., Sun, Y., Wiltshire, J.: Discoveryof aggregate usage profiles for web personalization. In: Proc. Workshop on WebMining for E-Commerce - Challenges and Opportunities, Boston, MA, USA (2000)

5. Sun, A., Lim, E.P., Ng, W.K.: Personalized classification for keyword-based cate-gory profiles. In: Proc. 6th European Conf. on Research and Advanced Technologyfor Digital Libraries, Rome, Italy (2002) 61–74

6. MacQueen, J.B.: Some methods for classification and analysis of multivariateobservations. In: Proc. 5th Berkeley Symp. on Mathematical Statistics and Prob-ability. (1967) 281–297

7. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for dis-covering clusters in large spatial databases with noise. In: Proc. 2nd Int. Conf. onKnowledge Discovery and Data Mining, Portland, Oregon (1996) 226–231

8. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clus-tering of high dimensional data for data mining applications. In: Proc. ACMSIGMOD Conf., Seattle, WA (1998) 94–105

9. Sheikholeslami, G., Chatterjee, S., Zhang, A.: Wavecluster: A wavelet based clus-tering approach for spatial data in very large databases. VLDB Journal 8 (2000)289–304

10. Foss, A., Zaiane, O.R.: A parameterless method for efficiently discovering clustersof arbitrary shape in large datasets. In: Proc. Int. Conf. on Data Mining, MaebashiCity, Japan (2002) 179–186

11. Knorr, E.M., Ng, R.T.: Algorithms for mining distance-based outliers in largedatasets. In: Proc. 24th Int. Conf. on Very Large Data Bases. (1998) 392–403

12. Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliersfrom large data sets. In: Proc. ACM SIGMOD Conf., Dallas, Texas (2000) 427–438

13. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proc.20th Int. Conf. on Very Large Databases, Santiago, Chile (1994) 487–499

14. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation.In: Proc. ACM SIGMOD Conf., Dallas, Texas (2000) 1–12

15. Das, A., Ng, W.K., Woon, Y.K.: Rapid association rule mining. In: Proc. 10thInt. Conf. on Information and Knowledge Management, Atlanta, Georgia (2001)474–481

16. Woon, Y.K., Ng, W.K., Das, A.: Fast online dynamic association rule mining.In: Proc. 2nd Int. Conf. on Web Information Systems Engineering, Kyoto, Japan(2001) 278–287

17. Woon, Y.K., Ng, W.K., Lim, E.P.: Preprocessing optimization structures for as-sociation rule mining. In: Technical Report CAIS-TR-02-48, School of ComputerEngineering, Nanyang Technological University, Singapore (2002)

18. Karypis, G., Han, E.H., Kumar, V.: Chameleon: Hierarchical clustering usingdynamic modeling. Computer 32 (1999) 68–75


∈∈

∈∈

∈∈

∈∈

< < ∈ ∈−

∩

−=+

=

× ∀ ∈ =

=

∈×

× + +

∈

==

α

α

μ

μ= ∈μ

μ μ− −+ − α

α ∈ α α μ α= ∈ ≥ α

αα →

αα

αα

δ

α

α α α

α

∈δ δ

α α

α

δ

Similarity Search in Structured Data

Hans-Peter Kriegel and Stefan Schonauer

University of MunichInstitute for Computer Science

{kriegel, schoenauer}@informatik.uni-muenchen.de

Abstract. Recently, structured data is getting more and more impor-tant in database applications, such as molecular biology, image retrievalor XML document retrieval. Attributed graphs are a natural model forthe structured data in those applications. For the clustering and clas-sification of such structured data, a similarity measure for attributedgraphs is necessary. All known similarity measures for attributed graphsare either limited to a special type of graph or computationally extremelycomplex, i.e. NP-complete, and are, therefore, unsuitable for data miningin large databases. In this paper, we present a new similarity measurefor attributed graphs, called matching distance. We demonstrate, howthe matching distance can be used for efficient similarity search in at-tributed graphs. Furthermore, we propose a filter-refinement architectureand an accompanying set of filter methods to reduce the number of nec-essary distance calculations during similarity search. Our experimentsshow that the matching distance is a meaningful similarity measure forattributed graphs and that it enables efficient clustering of structureddata.

1 Introduction

Modern database applications, like molecular biology, image retrieval or XMLdocument retrieval, are mainly based on complex structured objects. Those ob-jects have an internal structure that is usually modeled using graphs or trees,which are then enriched with attribute information (cf. figure 1). In additionto the data objects, those modern database applications can also be character-ized by their most improtant operations, which are extracting new knowledgefrom the database, or in other words data mining. The data mining tasks inthis context require some notion of similarity or dissimilarity of objects in thedatabase.

A common approach is to extract a vector of features from the databaseobjects and then use the Euclidean distance or some other Lp-norm betweenthose feature vectors as similarity measure. But often this results in very high-dimensional feature vectors, which even index structures for high-dimensionalfeature vectors like the X-tree [1], the IQ-tree [2] or the VA-file [3], can no longerhandle efficiently due to a number of effects usually described by the term ’curseof dimensionality’.


O O

O O

C

C

C

C

C C

C

C

C

H

H

H

Fig. 1. Examples of attributed graphs: an image together with its graph andthe graph of a molecule.

Especially for graph modeled data, the additional problem arises how to in-clude the structural information into the feature vector. As the structure of agraph cannot be modeled by a low-dimensional feature vector, the dimensional-ity problem gets even worse. A way out of this dilemma is to define similaritydirectly for attributed graphs. Consequently, there is a strong need for similaritymeasures for attributed graphs. Several approaches to this problem have beenproposed in recent time. Unfortunately, all of them have certain drawbacks, likebeing restricted to special graph types or having NP-complete time complexity,which makes them unusable for data mining applications. Therefore, we present anew similarity measure for attributed graphs, called the edge matching distance,which is not restricted to special graph types and can be evaluated efficiently.Additionally, we propose a filter-refinement architecture for efficient query pro-cessing and provide a set of filter methods for the edge matching distance.

The paper is organized as follows: In the next section, we describe the existingsimilarity measures for attributed graphs and discuss their strengths and weak-nesses. The edge matching distance and its properties are presented in section 3,before the query architecture and the filter methods are introduced in section 4.In section 5, the effectiveness and efficiency of our methods is demonstrated inexperiments with real data from the domain of image retrieval, before we finishwith a short conclusion.

2 Related Work

As graphs are a very general object model, graph similarity has been studiedin many fields. Similarity measures for graphs have been used in systems forshape retrieval [4], object recognition [5] or face recognition [6]. For all thosemeasures, graph features specific to the graphs in the application, are exploitedin order to define graph similarity. Examples of such features are a given one-to-one mapping between the vertices of different graphs or the requirement thatall graphs are of the same order.

A very common similarity measure for graphs is the edit distance. It uses thesame principle as the well known edit distance for strings [7, 8]. The idea is todetermine the minimal number of insertion and deletions of vertices and edges

310 Hans-Peter Kriegel and Stefan Schonauer

to make the compared graphs isomorphic. In [9] Sanfeliu and Fu extended thisprinciple to attributed graphs, by introducing vertex relabeling as a third basicoperation beside insertions and deletions. In [10] this measure is used for datamining in a graph.

Unfortunately, the edit distance is a very time-complex measure. Zhang, Stat-man and Shasha proved in [11] that the edit distance for unordered labeled treesis NP-complete. Consequently, in [12] a restricted edit distance for connectedacyclic graphs, i.e. trees, was introduced.

Papadopoulos and Manulopoulos presented another similarity measure forgraphs in [13]. Their measure is based on histograms of the degree sequence ofgraphs and can be computed in linear time, but does not take the attributeinformation of vertices and edges into account.

In the field of image retrieval, similarity of attributed graphs is sometimesdescribed as an assignment problem [14], where the similarity distance betweentwo graphs is defined as the minimal cost for mapping the vertices of one graphto those of another graph. With an appropriate cost function for the assignmentof vertices, this measure takes the vertex attributes into account and can beevaluated in polynomial time. This asssignment measure, which we will callvertex matching distance in the rest of the paper, obviously completely ignoresthe structure of graphs, i.e. they are just treated as sets of vertices.

3 The Edge Matching Distance

As we just described, all the known similarity measures for attributed graphshave certain drawbacks. Starting from the edit distance and the vertex matchingdistance we propose a new method to measure the similarity of attributed graphs.This method solves the problems mentioned above and is useful in the contextof large databases of structured objects.

3.1 Similarity of Structured Data

The similarity of attributed graphs has several major aspects. The first oneis the structural similarity of graphs and the second one is the similarity ofthe attributes. Additionally, the weighting of the two just mentioned aspectsis significant, because it is highly application dependent, to what extent thestructural similarity determines the object similarity and to what extent theattribute similarity has to be considered.

With the edit distance between attributed graphs there exists a similaritymeasure that fulfills all those conditions. Unfortunately, the computational com-plexity of this measure is too high to use it for clustering databases of arbitrarysize. The vertex matching distance on the other hand can be evaluated in polyno-mial time, but this similarity measure does not take the structural relationshipsbetween the vertices into account, which results in a too coarse model for the sim-ilarity of attributed graph. For our similarity measure, called the edge matching

311Similarity Search in Structured Data

G1 G2

Δ

Fig. 2. An example of an edge matching between the graphs G1 and G2.

distance, we also rely on the principle of graph matching. But instead of match-ing the vertices of two graphs, we propose a cost function for the matching ofedges and then derive a minimal weight maximal matching between the edgesets of two graphs. This way not only the attribute distribution, but also thestructural relationships of the vertices are taken into account. Figure 2 illustratesthe idea behind our measure, while the formal definition of the edge matchingdistance is as follows:

Definition 1. (edge matching, edge matching distance)Let G1(V1, E1) and G2(V2, E2) be two attributed graphs. Without loss of gen-erality, we assume that |E1| ≥ |E2|. The complete bipartite graph Gem(Vem =E1∪E2∪Δ, E1× (E2∪Δ)), where Δ represents an empty dummy edge, is calledthe edge matching graph of G1 and G2. An edge matching between G1 and G2

is defined as a maximal matching in Gem. Let there be a non-negative metriccost function c : E1 × (E2 ∪Δ) → IR+

0 . We define the matching distance betweenG1 and G2, denoted by dmatch(G1, G2), as the cost of the minimum-weight edgematching between G1 and G2 with respect to the cost function c.

Through the use of an appropriate cost function, it is possible to adapt theedge matching distance to the particular application needs. This implies howindividual attributes are weighted or how the structural similarity is weightedrelative to the attribute similarity.

3.2 Properties of the Edge Matching Distance

In order to use the edge matching distance for the clustering of attributed graphs,we need to investigate a few of the properties of this measure. The time complex-ity of the measure is of great importance for the applicability of the measure indata mining applications. Additionally, the proof of the following theorem alsoprovides an algorithm how the matching distance can be computed efficiently.

Theorem 1. The matching distance can be calculated in O(n3) time in the worstcase.

Proof. To calculate the matching distance between two attributed graphs G1 andG2, a minimum-weight edge matching between the two graphs has to be deter-mined. This is equivalent to determining a minimum-weight maximal matching


in the edge matching graph of G1 and G2. To achieve this, the method of Kuhn[15] and Munkres [16] can be used. This algorithm, also known as the Hungarianmethod, has a worst case complexity of O(n3), where n is the number of edgesin the larger one of the two graphs. ��

Apart from the complexity of the edge matching distance itself, it is also im-portant that there are efficient search algorithms and index structures to supportthe use in large databases. In the context of similarity search two query typesare most important, which are range queries and (k)-nearest-neighbor queries.Especially for k-nearest-neighbor search, Roussopoulos, Kelley and Vincent[17]and Hjaltason and Samet [18] proposed efficient algorithms. Both of these re-quire that the similarity measure is a metric. Additionally, those algorithms relyon an index structure for the metric objects, such as the M-tree [19]. Therefore,the following theorem is of great importance for the practical application of theedge matching distance.

Theorem 2. The edge matching distance for attributed graphs is a metric.

Proof. To show that the edge matching distance is a metric, we have to provethe three metric properties for this similarity measure.

1. dmatch(G1, G2) ≥ 0The edge matching distance between two graphs is the sum of the cost foreach edge matching. As the cost function is non-negative, any sum of costvalues is also non-negative.

2. dmatch(G1, G2) = dmatch(G2, G1)The minimum-weight maximal matching in a bipartite graph is symmetric,if the edges in the bipartite graph are undirected. This is equivalent to thecost function being symmetric. As the cost function is a metric, the cost formatching two edges is symmetric. Therefore, the edge matching distance issymmetric.

3. dmatch(G1, G3) ≤ dmatch(G1, G2) + dmatch(G2, G3)As the cost function is a metric, the triangle inequality holds for each tripleof edges in G1, G2 and G3 and for those edges that are mapped to an emptyedge. The edge matching distance is the sum of the cost of the matching ofindividual edges. Therefore, the triangle inequality also holds for the edgematching distance.

��Definition 1 does not require that the two graphs are isomorphic in order to

have a matching distance of zero. But the matching of the edges together with anappropriate cost function ensures that graphs with a matching distance of zerohave a very high structural similarity. But even if the application requires thatonly isomorphic graphs are considered identical, the matching distance is stillof great use. The following lemma allows to use the matching distance betweentwo graphs as filter for the edit distance in a filter refinement architecture aswill be described in section 4.1. This way, the number of expensive edit distancecalculations during query processing can be greatly reduced.


Lemma 1. Given a cost function for the edge matching which is always less thanor equal to the cost for editing an edge, the matching distance between attributedgraphs is a lower bound for the edit distance between attributed graphs:

∀G1, G2 : dmatch(G1, G2) ≤ dED(G1, G2)

Proof. The edit distance between two graphs is the number of edit operationswhich are necessary to make those graphs isomorphic. To be isomorphic, thetwo graphs have to have identical edge sets. Additionally, the vertex sets have tobe identical, too. As the cost function for the edge matching distance is alwaysless than or equal to the cost to transform two edges into each other throughan edit operation, the edge matching distance is a lower bound for the numberof edit operations, which are necessary to make the two edge sets identical. Asthe cost for making the vertex sets identical is not covered by the edge matchingdistance, it follows that the edge matching distance is a lower bound for the editdistance between attributed graphs. ��

4 Efficient Query Processing Using the Edge MatchingDistance

While the edge matching distance already has polynomial time complexity ascompared to the exponential time complexity of the edit distance, a matchingdistance calculation is still a complex operation. Therefore, it makes sense totry to reduce the number of distance calculations during query processing. Thisgoal can be achieved by using a filter-refinement architecture.

4.1 Multi-Step Query Processing

Query processing in a filter-refinement architecture is performed in two or moresteps, where the first steps are filter steps that return a number of candidateobjects from the database. For those candidate objects the exact similarity dis-tance is determined in the refinement step and the objects fulfilling the querypredicate are reported. To reduce the overall search time, the filter steps haveto be easy to perform and a substantial part of the database objects has to befiltered out.

Additionally, the completeness of the filter step is essential, i.e. there mustbe no false drops during the filter steps. Available similarity search algorithmsguarantee completeness if the distance function in the filter step fulfills the lower-bounding property. This means that the filter distance between two objects mustalways be less than or equal to their exact distance.

Using a multi-step query architecture requires efficient algorithms which ac-tually make use of the filter step. Agrawal, Faloutsos and Swami proposed suchan algorithm for range search [20]. In [21] and [22] multi-step algorithms fork-nearest-neighbor search were presented, which are optimal in the number ofexact distance calculations neccessary during query processing. Therefore, weemploy the latter algorithms in our experiments.


4.2 A Filter for the Edge Matching Distance

To employ a filter-refinement architecture we need filters for the edge matchingdistance, which cover the structural as well as the attribute properties of thegraphs in order to be effective.

A way to derive a filter for a similarity measure is to approximate thedatabase objects and then determine the similarity of those approximations. Asan approximation for the structure of a graph G we use the size of that graph,denoted by s(G), i.e. the number of edges in the graph. We define the followingsimilarity measure for our structural approximation of attributed graphs:

dstruct(G1, G2) = |s(G1) − s(G2)| · wmismatch

Here wmismatch is the cost for matching an edge with the empty edge Δ. Whenthe edge matching distance between two graphs is determined, all edges of thelarger graph, which are not mapped onto an edge of the smaller graph, aremapped onto an empty dummy edge Δ. Therefore, the above measure fulfillsthe lower bounding property, i.e. ∀G1, G2 : dstruct(G1, G2) ≤ dmatch(G1, G2).

Our filters for the attribute part of graphs are based on the observationthat the difference between the attribute distributions of two graphs influencestheir edge matching distance. This is due to the fact, that during the distancecalculation, edges of the two graphs are assigned to each other. Consequently,the edge matching distance between two graphs is the smaller, the more edgeswith the same attribute values the two graphs have, i.e. the more similar theirattribute value distributions are. Obviously, it is too complex to determine theexact difference of the attribute distributions of two graphs in order to use thisas a filter and an approximation of those distributions is, therefore, needed.

We propose a filter for the attribute part of graphs, which exploits the factthat |x− y| ≥ ||x| − |y||. For attributes which are associated with edges, we addall the absolute values for an attribute in a graph. For two graphs G1 and G2

with s(G1) = s(G2), the difference between those sums, denoted by da(G1, G2),is the minimum total difference between G1 and G2 for the respective attribute.Weighted appropriately according to the cost function that is used, this is alower bound for the edge matching distance. For graphs of different size, this isno longer true, as an edge causing the attribute difference could also be assignedto an empty edge. Therefore, the difference in size of the graphs multiplied withthe maximum cost for this attribute has to be substracted from da(G1, G2) inorder to be lower bounding in all cases.

When considering attributes that are associated with vertices in the graphs,wehave to take into account that during the distance calculation a vertex v is com-pared with several vertices of the second graph, namely exactly degree(v) manyvertices. To take care of this effect, the absolute attribute value for a vertexattribute has to be multiplied with the degree of the vertex, which carries thisattribute value, before the attribute values are added in the same manner as foredge attributes. Obviously, the appropriately weighted size difference has to besubstracted in order to achieve a lower bounding filter value for a node attribute.


Fig. 3. Result of a 10-nearest-neighbor query for the pictograph dataset. Thequery object is shown on top, the result for the vertex matching distanceis in the middle row and the result for the edge matching distance is in thebottom row.

With the above methods it is ensured that the sum of the structural filterdistance plus all attribute filter distances is still a lower bound for the edgematching distance between two graphs. Furthermore, it is possible to precomputethe structural and all attribute filter values and store them in a single vector.This supports efficient filtering during query processing.

5 Experimental Evaluation

To evaluate our new methods, we chose an image retrieval application and rantests on a number of real world data sets:

– 705 black-and-white pictographs– 9818 full-color TV images

To extract graphs from the images, they were segmented with a region grow-ing technique and neighboring segments were connected by edges to representthe neighborhood relationship. Each segment was assigned four attribute values,which are the size, the height and width of the bounding box and the color ofthe segment. The values of the first three attributes were expressed as a percent-age relative to the image size, height and width in order to make the measureinvariant to scaling. We implemented all methods in Java 1.4 and performed ourtests on a workstation with a 2.4GHz Xeon processor and 4GB RAM.

To calculate the cost for matching two edges, we add the difference betweenthe values of the attributes of the corresponding terminal vertices of the twoedges divided by the maximal possible difference for the respective attribute.This way, relatively small differences in the attribute values of the vertices resultin a small matching cost for the compared edges. The cost for matching an edgewith an empty edge is equal to the maximal cost for matching two edges. Thisresults in a cost function, which fulfills the metric properties.


Fig. 4. A cluster of portraits in the TV-images.

Figure 3 shows a comparison between the results of a 10-nearest-neighborquery in the pictograph dataset with the edge matching distance and the vertexmatching distance. As one can see, the result obtained with the edge matchingdistance contains less false positives due to the fact that the structural propertiesof the images are considered more using this measure. It is important to note thatthis better result was obtained, even though the runtime of the query processingincreases by as little as 5%.

To demonstrate the usefullness of the edge matching distance for data miningtasks, we determined clusterings of the TV-images by using the density-basedclustering algorithm DBSCAN [23]. In figure 4 one cluster found with the edgematching distance is depicted. Although, the cluster contains some other objects,it clearly consist mainly of portraits. When clustering with the vertex matchingdistance, we found no comparable cluster, i.e. this cluster could only be foundwith the edge matching distance as similarity measure.

To measure the selectivity of our filter method, we implemented a filter re-finement architecture as described in [21]. For each of our datasets, we measuredthe average filter selectivity for 100 queries which retrieved various fractions ofthe database. The results for the experiment when using the full-color TV-imagesare depicted in figure 5(a). It shows that the selectivity of our filter is very good,as e.g. for a query result which is 5% of the database size, more than 87% ofthe database objects are filtered out. The results for the pictograph dataset, asshown in figure 5(b), underline the good selectivity of the filter method. Evenfor a quite large result size of 10%, more than 82% of the database objects areremoved by the filter. As the calculation of the edge matching distance is farmore complex than that of the filter distance, it is not surprising that the re-duction in runtime resulting from filter use was proportional to the number ofdatabase objects, which were filtered out.

6 Conclusions

In this paper, we presented a new similarity measure for data modeled as at-tributed graphs. Starting from the vertex matching distance, well known from thefield of image retrieval, we developed the so called edge matching distance, which


(a) (b)

Fig. 5. Average filter selectivity for the TV-image dataset (a) and the pic-tograph dataset (b).

is based on minimum-weight maximum matching of the edge sets of graphs.This measure takes the structural and the attribute properties of the attributedgraphs into account and can be calculated in O(n3) time in the worst case, whichallows to use it in data mining applications, unlike the common edit distance.In our experiments, we demonstrate that the edge matching distance reflectsthe similarity of graph modeled objects better than the similar vertex matchingdistance, while having an almost identical runtime. Furthermore, we devised afilter refinement architecture and a filter method for the edge matching distance.Our experiments show that this architecture reduces the number of necessarydistance calculations during query processing between 87% and 93%.

In our future work, we will investigate different cost functions for the edgematching distance as well as their usefullness for different applications. Thisincludes especially, the field of molecular biology, where we plan to apply ourmethods to the problem of similarity search in protein databases.

7 Acknowledgement

Finally let us acknowledge the help of Stefan Brecheisen, who implemented partof our code.

References

1. Berchtold, S., Keim, D., Kriegel, H.P.: The X-tree: An index structure for high-dimensional data. In: Proc. 22nd VLDB Conf., Bombay, India (1996) 28–39

2. Berchtold, S., Bohm, C., Jagadish, H., Kriegel, H.P., Sander, J.: Independentquantization: An index compression technique for high-dimensional data spaces.In: Proc. of the 16th ICDE. (2000) 577–588


3. Weber, R., Schek, H.J., Blott, S.: A quantitative analysis and performance studyfor similarity-search methods in high-dimensional spaces. In: Proc. 24th VLDBConf. (1998) 194–205

4. Huet, B., Cross, A., Hancock, E.: Shape retrieval by inexact graph matching.In: Proc. IEEE Int. Conf. on Multimedia Computing Systems. Volume 2., IEEEComputer Society Press (1999) 40–44

5. Kubicka, E., Kubicki, G., Vakalis, I.: Using graph distance in object recognition.In: Proc. ACM Computer Science Conference. (1990) 43–48

6. Wiskott, L., Fellous, J.M., Kruger, N., von der Malsburg, C.: Face recognition byelastic bunch graph matching. IEEE PAMI 19 (1997) 775–779

7. Levenshtein, V.: Binary codes capable of correcting deletions, insertions and re-versals. Soviet Physics-Doklady 10 (1966) 707–710

8. Wagner, R.A., Fisher, M.J.: The string-to-string correction problem. Journal ofthe ACM 21 (1974) 168–173

9. Sanfeliu, A., Fu, K.S.: A distance measure between attributed relational graphsfor pattern recognition. IEEE Transactions on Systems, Man and Cybernetics 13(1983) 353–362

10. Cook, D.J., Holder, L.B.: Graph-based data mining. IEEE Intelligent Systems 15(2000) 32–41

11. Zhang, K., Statman, R., Shasha, D.: On the editing distance between unorderedlabeled trees. Information Processing Letters 42 (1992) 133–139

12. Zhang, K., Wang, J., Shasha, D.: On the editing distance between undirectedacyclic graphs. International Journal of Foundations of Computer Science 7 (1996)43–57

13. Papadopoulos, A., Manolopoulos, Y.: Structure-based similarity search with graphhistograms. In: Proc. DEXA/IWOSS Int. Workshop on Similarity Search, IEEEComputer Society Press (1999) 174–178

14. Petrakis, E.: Design and evaluation of spatial similarity approaches for imageretrieval. Image and Vision Computing 20 (2002) 59–76

15. Kuhn, H.: The hungarian method for the assignment problem. Nval ResearchLogistics Quarterly 2 (1955) 83–97

16. Munkres, J.: Algorithms for the assignment and transportation problems. Journalof the SIAM 6 (1957) 32–38

17. Roussopoulos, N., Kelley, S., Vincent, F.: Nearest neighbor queries. In: Proc. ACMSIGMOD, ACM Press (1995) 71–79

18. Hjaltason, G.R., Samet, H.: Ranking in spatial databases. In: Advances in SpatialDatabases, 4th International Symposium, SSD’95, Portland, Maine. Volume 951of Lecture Notes in Computer Science., Springer (1995) 83–95

19. Ciaccia, P., Patella, M., Zezula, P.: M-tree: An efficient access method for similaritysearch in metric spaces. In: Proc. of 23rd VLDB Conf. (1997) 426–435

20. Agrawal, R., Faloutsos, C., Swami, A.N.: Efficient similarity search in sequencedatabases. In: Proc. of the 4th Int. Conf. of Foundations of Data Organization andAlgorithms (FODO), Springer Verlag (1993) 69–84

21. Seidl, T., Kriegel, H.P.: Optimal multi-step k-nearest neighbor search. In: Proc.ACM SIGMOD, ACM Press (1998) 154–165

22. Korn, F., Sidiropoulos, N., Faloutsos, C., Siegel, E., Protopapas, Z.: Fast andeffective retrieval of medical tumor shapes. IEEE TKDE 10 (1998) 889–904

23. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discov-ering clusters in large spatial databases with noise. In: 2nd Int. Conf. on KnowledgeDiscovery and Data Mining, Portland, Oregon, AAAI Press (1996) 226–231


Using an Interest Ontology for ImprovedSupport in Rule Mining

Xiaoming Chen1, Xuan Zhou1, Richard Scherl2, and James Geller1,�

1 CS Dept., New Jersey Institute of Technology, Newark, NJ 071022 Monmouth University, West Long Branch, New Jersey 07764

Abstract. This paper describes the use of a concept hierarchy for im-proving the results of association rule mining. Given a large set of tupleswith demographic information and personal interest information, asso-ciation rules can be derived, that associate ages and gender with inter-ests. However, there are two problems. Some data sets are too sparse forcoming up with rules with high support. Secondly, some data sets withabstract interests do not represent the actual interests well. To overcomethese problems, we are preprocessing the data tuples using an ontology ofinterests. Thus, interests within tuples that are very specific are replacedby more general interests retrieved from the interest ontology. This re-sults in many more tuples at a more general level. Feeding those tuplesto an association rule miner results in rules that have better support andthat better represent the reality.3

1 Introduction

Data mining has become an important research tool for the purpose of market-ing. It makes it possible to draw far-reaching conclusions from existing customerdatabases about connections between different products purchased. If demo-graphic data are available, data mining also allows the generation of rules thatconnect them with products. However, companies are not just interested in thebehavior of their existing customers, they would like to find out about potentialcustomers. Typically, there is no information about potential customers availablein a company database, that can be used for data mining.

It is possible to perform data mining on potential customers, if one makes thefollowing two adjustments: (1) Instead of looking at products already purchased,we may look at interests of a customer. (2) Many people express their interestsfreely and explicitly on their Web home pages. The process of mining data ofpotential customers becomes a process of Web Mining. In this project, we areextracting raw data from home pages on the Web. In the second stage, we raisespecific but sparse data to higher levels, to make it denser. In the third stage weapply traditional rule mining algorithms to the data.

When mining real data, what is available is often too sparse to producerules with reasonable support. In this paper we are describing a method how to� This research was supported by the NJ Commission for Science and Technology3 Contact Author: James Geller, [email protected]


improve the support of mined rules by using a large ontology of interests thatare related to the extracted raw data.

2 Description of Project, Data and Mining

Our Web Marketing system consists of six modules.(1) The Web search module extracts home pages of users from several portalsites. Currently, the following portal sites are used: LiveJournal, ICQ and Yahoo,as well as a few major universities.(2) The Object-Relational database stores the cleaned results of this search.(3) The data mining module uses the WEKA [13] package for extracting asso-ciation rules from the table data.(4) The ontology is the main knowledge representation of this project [4, 11]. Itconsists of interest hierarchies based on Yahoo and ICQ.(5) The advanced extraction component processes Web pages which do not fol-low simple structure rules.(6) The front end is a user-friendly, Web-based GUI that allows users with noknowledge of SQL to query both the raw data in the tables and the derivedrules.

The data that we are using for data mining consists of records of real personaldata that contain either demographic data and expressed interest data or two dif-ferent items of interest data. In most cases, we are using triples of age, gender andone interest as input for data mining. In other cases we are using pairs of inter-ests. Interests are derived from one of sixteen top level interest categories. Theseinterest categories are called interests at level 1. Examples of level 1 interests(according to Yahoo) include RECREATION SPORTS, HEALTH WELLNESS,GOVERNMENT POLITICS, etc. Interests are organized as a DAG (DirectedAcyclic Graph) hierarchy.

As a result of the large size of the database, the available data goes well be-yond the capacity of the data mining program. Thus, the data sets had to be bro-ken into smaller data sets. A convenient way to do this is to perform data miningon the categories divided at level 1 (top level) or the children of level 1. Thus thereare 16 interest categories at level 1, and the interest GOVERNMENT POLITICShas 20 children, including LAW, MILITARY, ETHICS, TAXES, etc. At the timewhen we extracted the data, ENTERTAINMENT ARTS was the largest datafile at level 1. It had 176218 data items, which is not too large to be handled bythe data mining program.

WEKA generates association rules [1] using the Apriori algorithm first pre-sented by [2]. Since WEKA only works with clean data converted to a fixedformat, called .arff format, we have created customized programs to do dataselection and data cleaning.

321Using an Interest Ontology for Improved Support in Rule Mining

3 Using Raising for Improved Support

A concept hierarchy is present in many databases either explicitly or implicitly.Some previous work utilizes a hierarchy for data mining. Han [5] discusses datamining at multiple concept levels. His approach is to use discovered associationsat one level (e.g., milk → bread) to direct the search for associations at a differentlevel (e.g., milk of brand X → bread of brand Y). As most of our data mininginvolves only one interest, our problem setting is quite different. Han et al. [6]introduce a top-down progressive deepening method for mining multiple-levelassociation rules. They utilize the hierarchy to collect large item sets at differentconcept levels. Our approach utilizes an interest ontology to improve support inrule mining by means of concept raising.

Fortin et al. [3] use an object-oriented representation for data mining. Theirinterest is in deriving multi-level association rules. As we are typically usingonly one data item in each tuple for raising, the possibility of multi-level rulesdoes not arise in our problem setting. Srikant et al. [12] present Cumulative andEstMerge algorithms to find associations between items at any level by adding allancestors of each item to the transaction. In our work, items of different levelsdo not coexist in any step of mining. Psaila et al. [9] describe a method howto improve association rule mining by using a generalization hierarchy. Theirhierarchy is extracted from the schema of the database and used together withmining queries [7]. In our approach, we are making use of a large pre-existingconcept hierarchy, which contains concepts from the data tuples. Pairceir et al.also differ from our work in that they are mining multi-level rules that associateitems spanning several levels of a concept hierarchy [10]. Joshi et al. [8] areinterested in situations where rare instances are really the most interesting ones,e.g., in intrusion detection. They present a two-phase data mining method with agood balance of precision and recall. For us, rare instances are not by themselvesimportant, they are only important because they contribute with other rareinstances to result in frequently occurring instances for data mining.

There are 11 levels in the Yahoo interest hierarchy. Every extracted interestbelongs somewhere in the hierarchy, and is at a certain level. The lower the levelvalue, the higher up it is in the hierarchy. Level 0 is the root. Level 1 is the toplevel, which includes 16 interests. For example, FAMILY HOME is an interest atlevel 1. PARENTING is an interest at level 2. PARENTING is a child of FAM-ILY HOME in the hierarchy. If a person expressed an interest in PARENTING,it is common sense that he or she is interested in FAMILY HOME. Therefore,at level 1, when we count those who are interested in FAMILY HOME, it isreasonable to count those who are interested in PARENTING. This idea appliesin the same way to lower levels.

A big problem in the derivation of association rules is that available datais sometimes very sparse and biased as a result of the interest hierarchy. Forexample, among over a million of interest records in our database only 11 peo-ple expressed an interest in RECREATION SPORTS, and nobody expressedan interest in SCIENCE. The fact that people did not express interests withmore general terms does not mean they are not interested. The data file of

322 Xiaoming Chen et al.

RECREATION SPORTS has 62734 data items. In other words, 62734 interestexpressions of individuals are in the category of RECREATION SPORTS. In-stead of saying “I’m interested in Recreation and Sports,” people prefer saying“I’m interested in basketball and fishing.” They tend to be more specific withtheir interests. We analyzed the 16 top level categories of the interest hierar-chy. We found users expressing interests at the top level only in two categories,MUSIC and RECREATION SPORTS. When mining data at higher levels, it isimportant to include data at lower levels, in order to gain data accuracy andhigher support.

In the following examples, the first letter stands for an age range. The agerange from 10 to 19 is represented by A, 20 to 29 is B, 30 to 39 is C, 40 to 49 isD, etc. The second letter stands for Male or Female. Text after a double slash(//) is not part of the data. It contains explanatory remarks.

Original Data File:B,M,BUSINESS FINANCE //level=1D,F,METRICOM INC //level=7E,M,BUSINESS SCHOOLS //level=2C,F,ALUMNI //level=3B,M,MAKERS //level=4B,F,INDUSTRY ASSOCIATIONS //level=2C,M,AOL INSTANT MESSENGER //level=6D,M,INTRACOMPANY GROUPS //level=3C,M,MORE ABOUT ME //wrong data

The levels below 7 do not have any data in this example. Raising will processthe data level-by-level starting at level 1. It is easiest to see what happens if welook at the processing of level 3. First the result is initialized with the data atlevel 3 contained in the source file. With our data shown above, that means thatthe result is initialized with the following two lines.

C,F,ALUMNID,M,INTRACOMPANY GROUPS

In order to perform the raising we need to find ancestors at level 3 of theinterests in our data. Table 1 shows all ancestors of our interests from levels 4,5, 6, 7, such that the ancestors are at level 3. The following lines are now addedto our result.

D,F,COMMUNICATIONS AND NETWORKING // raised from level=7 (1stancestor)D,F,COMPUTERS // raised from level=7 (2nd ancestor)B,M,ELECTRONICS // raised from level=4C,M,COMPUTERS // raised from level=6

That means, after raising we have the following occurrence counts at level 3.


ALUMNI: 1INTRACOMPANY GROUPS: 1COMMUNICATIONS AND NETWORKING: 1COMPUTERS: 2ELECTRONICS: 1

Before raising, we only had two items at level 3. Now, we have six items atlevel 3. That means that we now have more data as input for data mining thanbefore raising. Thus, the results of data mining will have better support and willmuch better reflect the actual interests of people.

Table 1. Relevant Ancestors

Interest Name Its Ancestor(s) at Level 3

METRICOM INC COMMUNICATIONS AND NETWORKINGMETRICOM INC COMPUTERSMAKERS ELECTRONICSAOL INSTANT MESSENGER COMPUTERS

Due to the existence of multiple parents and common ancestors, the precisemethod of raising is very important. There are different ways to raise a data file.One way is to get the data file of the lowest level, and raise interests bottom-up,one level at a time, until we finish at level 1. The data raised from lower levelsis combined with the original data from the given level to form the data file atthat level. If an interest has multiple parents, we include these different parentsin the raised data. However, if those parents have the same ancestor at somehigher level, duplicates of data appear at the level of common ancestors.

This problem is solved by adopting a different method: we are raising directlyto the target level, without raising to any intermediate level. After raising to acertain level, all data at this level can be deleted and never have to be consideredagain for lower levels. This method solves the problem of duplicates caused bymultiple parents and common ancestors. The data file also becomes smaller whenthe destination level becomes lower.

In summary, the raising algorithm is implemented as follows: Raise the orig-inal data to level 1. Do data mining. Delete all data at level 1 from the originaldata file. Raise the remaining data file to level 2. Do data mining. Delete all dataat level 2 from the data file, etc. Continue until there’s no more valid data. Theremaining data in the data file are wrong data.

4 Results

The quality of association rules is normally measured by specifying support andconfidence. Support may be given in two different ways [13], as absolute supportand as relative support. Witten et al. write:


The coverage of an association rule is the number of instances for whichit predicts correctly – this is often called its support. . . . It may also beconvenient to specify coverage as a percentage of the total number ofinstances instead. (p. 64)

For our purposes, we are most interested in the total number of tuples thatcan be used for deriving association rules, thus we will use the absolute numberof support only. The data support is substantially improved by means of raising.Following are two rules from RECREATION SPORTS at level 2 without raising:

age=B interest=AVIATION 70 ⇒ gender=M 55 conf:(0.79) (1)age=C interest=OUTDOORS 370 ⇒ gender=M 228 conf:(0.62) (2)

Following are two rules from RECREATION SPORTS at level 2 with raising.age=A gender=F 13773 ⇒ interest=SPORTS 10834 conf:(0.79) (3)age=C interest=OUTDOORS 8284 ⇒ gender=M 5598 conf:(0.68) (4)

Rule (2) and Rule (4) have the same attributes and rule structure. Withoutraising, the absolute support is 228, while with raising it becomes 5598. Theimprovement of the absolute support of this rule is 2355%.

Not all rules for the same category and level have the same attributes andstructure. For example, rule (1) appeared in the rules without raising, but notin the rules with raising. Without raising, 70 people are of age category B andchoose AVIATION as their interest. Among them, 55 are male. The confidencefor this rule is 0.79. After raising, there is no rule about AVIATION, becausethe support is too small compared with other interests such as SPORTS andOUTDOORS. In other words, one effect of raising is that rules that appear inthe result of WEKA before raising might not appear after raising and vice versa.

There is a combination of two factors why rules may disappear after raising.First, this may be a result of how WEKA orders the rules that it finds byconfidence and support. WEKA primarily uses confidence for ordering the rules.There is a cut off parameter, so that only the top N rules are returned. Thus,by raising, a rule in the top N might drop below the top N.

There is a second factor that affects the change of order of the mined rules.Although the Yahoo ontology ranks both AVIATION and SPORTS as level-2interests, the hierarchy structure underneath them is not balanced. According tothe hierarchy, AVIATION has 21 descendents, while SPORTS has 2120 descen-dents, which is about 100 times more. After raising to level 2, all nodes belowlevel 2 are replaced by their ancestors at level 2. As a result, SPORTS becomesan interest with overwhelmingly high support, whereas the improvement rate forAVIATION is so small that it disappeared from the rule set after raising.

There is another positive effect of raising. Rule (3) above appeared in therules with raising. After raising, 13773 people are of age category A and gendercategory F. Among them, 10834 are interested in SPORTS. The confidence is0.79. These data look good enough to generate a convincing rule. However, therewere no rules about SPORTS before raising. Thus, we have uncovered a rule withstrong support that also agrees with our intuition. However, without raising, this


rule was not in the result of WEKA. Thus, raising can uncover new rules thatagree well with our intuition and that also have better absolute support.

To evaluate our method, we compared the support and confidence of raisedand unraised rules. The improvement of support is substantial. Table 2 comparessupport and confidence for the same rules before and after raising for RECRE-ATION SPORTS at level 2. There are 58 3-attribute rules without raising, and55 3-attribute rules with raising. 18 rules are the same in both results. Theirsupport and confidence are compared in the table. The average support is 170before raising, and 4527 after raising. The average improvement is 2898%. Thus,there is a substantial improvement in absolute support. After raising, the loweraverage confidence is a result of expanded data. Raising effects not only the datathat contributes to a rule, but all other data as well. Thus, confidence was ex-pected to drop. Even though the confidence is lower, the improvement in supportby far outstrips this unwanted effect.

Table 2. Support and Confidence Before and After Raising

Supp. Supp. Improv. Conf. Conf. Improv.Rule (int = interest, gen = gender) w/o w/ of w/o w/ of

rais. rais. supp. rais. rais. Conf.

age=C int=AUTOMOTIVE ⇒ gen=M 57 3183 5484% 80 73 -7%age=B int=AUTOMOTIVE ⇒ gen=M 124 4140 3238% 73 65 -8%age=C int=OUTDOORS ⇒ gen=M 228 5598 2355% 62 68 6%age=D int=OUTDOORS ⇒ gen=M 100 3274 3174% 58 67 9%age=B int=OUTDOORS ⇒ gen=M 242 5792 2293% 54 61 7%age=C gen=M ⇒ int=OUTDOORS 228 5598 2355% 51 23 -28%gen=M int=AUTOMOTIVE ⇒ age=B 124 4140 3238% 47 37 -10%age=D gen=M ⇒ int=OUTDOORS 100 3274 3174% 46 27 -19%age=B int=OUTDOORS ⇒ gen=F 205 3660 1685% 46 39 -7%age=B gen=M ⇒ int=OUTDOORS 242 5792 2293% 44 18 -26%gen=F int=OUTDOORS ⇒ age=B 205 3660 1685% 42 39 -3%gen=M int=OUTDOORS ⇒ age=B 242 5792 2293% 38 34 -4%int=AUTOMOTIVE ⇒ age=B gen=M 124 4140 3238% 35 25 -10%gen=M int=OUTDOORS ⇒ age=C 228 5598 2355% 35 33 -2%age=D ⇒ gen=M int=OUTDOORS 100 3274 3174% 29 19 -10%gen=M int=AUTOMOTIVE ⇒ age=C 57 3183 5484% 22 28 6%int=OUTDOORS ⇒ age=B gen=M 242 5792 2293% 21 22 1%int=OUTDOORS ⇒ age=C gen=M 228 5598 2355% 20 21 1%

Table 3 shows the comparison of all rules that are the same before and afterraising. The average improvement of support is calculated at level 2, level 3,level 4, level 5 and level 6 for each of the 16 categories. As explained in Sect.3, few people expressed an interest at level 1, because these interest names aretoo general. Before raising, there are only 11 level-1 tuples with the interestRECREATION SPORTS and 278 tuples with the interest MUSIC. In the other


14 categories, there are no tuples at level 1 at all. However, after raising, thereare 6,119 to 174,916 tuples at level 1, because each valid interest in the originaldata can be represented by its ancestor at level 1, no matter how low the interestis in the hierarchy.

All the 16 categories have data down to level 6. However, COMPUTERSINTERNET, FAMILY HOME and HEALTH WELLNESS have no data at level7. In general, data below level 6 is very sparse and does not contribute a greatdeal to the results. Therefore, we present the comparison of rules from level 2through level 5 only. Some rules generated by WEKA are the same with andwithout raising. Some are different. In some cases, there is not a single rulein common between the rule sets with and without raising. The comparison istherefore not applicable. Those conditions are denoted by “N/A” in the table.

Table 3. Support Improvement Rate of Common Rules

Category Level2 Level3 Level4 Level5

BUSINESS FINANCE 122% 284% 0% 409%COMPUTERS INTERNET 363% 121% 11% 0%CULTURES COMMUNITY N/A 439% N/A 435%ENTERTAINMENT ARTS N/A N/A N/A N/AFAMILY HOME 148% 33% 0% 0%GAMES 488% N/A 108% 0%GOVERNMENT POLITICS 333% 586% 0% N/AHEALTH WELLNESS 472% 275% 100% 277%HOBBIES CRAFTS N/A 0% 0% 0%MUSIC N/A 2852% N/A 0%RECREATION SPORTS 2898% N/A 76% N/AREGIONAL 6196% 123% N/A 0%RELIGION BELIEFS 270% 88% 634% 0%ROMANCE RELATIONSHIPS 224% 246% N/A 17%SCHOOLS EDUCATION 295% 578% N/A 297%SCIENCE 1231% 0% 111% 284%

Average Improvement 1086% 432% 104% 132%

Table 4 shows the average improvement of support of all rules after raisingto level 2, level 3, level 4 and level 5 within the 16 interest categories. Thisis computed as follows. We sum the support values for all rules before raisingand divide them by the number of rules, i.e., we compute the average supportbefore raising, Sb. Similarly, we compute the average support of all the rulesafter raising. Then the improvement rate R is computed as:

R =Sa − Sb

Sb∗ 100 [percent] (1)


The average improvement rate for level 2 through level 5 is, respectively,279%, 152%, 68% and 20%. WEKA ranks the rules according to the confidence,and discards rules with lower confidence even though the support may be higher.

In Tab. 4 there are three values where the improvement rate R is negative.This may happen if the total average relative support becomes lower after raising.That in turn can happen, because, as mentioned before, the rules before and afterraising may be different rules. The choice of rules by WEKA is primarily madebased on relative support and confidence values.

Table 4. Support Improvement Rate of All Rules

Category Level2 Level3 Level4 Level5

BUSINESS FINANCE 231% 574% -26% 228%COMPUTERS INTERNET 361% 195% 74% -59%CULTURES COMMUNITY 1751% 444% 254% 798%ENTERTAINMENT ARTS 4471% 2438% 1101% 332%FAMILY HOME 77% 26% 56% 57%GAMES 551% 1057% 188% 208%GOVERNMENT POLITICS 622% 495% 167% 1400%HEALTH WELLNESS 526% 383% 515% 229%HOBBIES CRAFTS 13266% 2% 7% 60%MUSIC 13576% 3514% 97% 62%RECREATION SPORTS 6717% 314% 85% 222%REGIONAL 7484% 170% 242% -50%RELIGION BELIEFS 285% 86% 627% 383%ROMANCE RELATIONSHIPS 173% 145% 2861% 87%SCHOOLS EDUCATION 225% 550% 1925% 156%SCIENCE 890% 925% 302% 317%

Average Improvement 279% 152% 68% 20%

5 Conclusions and Future Work

In this paper, we showed that the combination of an ontology of the minedconcepts with a standard rule mining algorithm can be used to generate datasets with orders of magnitude more tuples at higher levels. Generating rulesfrom these tuples results in much larger (absolute) support values. In addition,raising often produces rules that, according to our intuition, better represent thedomain than rules found without raising. Formalizing this intuition is a subjectof future work.

According to our extensive experiments with tuples derived from Yahoo in-terest data, data mining with raising can improve absolute support for rules upto over 6000% (averaged over all common rules in one interest category). Im-provements in support may be even larger for individual rules. When averaging


over all support improvements for all 16 top level categories and levels 2 to 5,we get a value of 438%.

Future work includes using other data mining algorithms, and integratingthe raising process directly into the rule mining algorithm. Besides mining forassociation rules, we can also perform classification and clustering at differentlevels of the raised data. The rule mining algorithm itself needs adaptation toour domain. For instance, there are over 31,000 interests in our version of theinterest hierarchy. Yahoo has meanwhile added many more interests. Findinginterest – interest associations becomes difficult using WEKA, as interests ofpersons appear as sets, which are hard to map onto the .arff format.

References

1. R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules betweensets of items in large databases. In Peter Buneman and Sushil Jajodia, editors,Proceedings of the 1993 ACM SIGMOD International Conference on Managementof Data, pages 207–216, Washington, D.C., 1993.

2. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Jorge B.Bocca, Matthias Jarke, and Carlo Zaniolo, editors, Proc. 20th Int. Conf. Very LargeData Bases, VLDB, pages 487–499. Morgan Kaufmann, 1994.

3. S. Fortin and L. Liu. An object-oriented approach to multi-level association rulemining. In Proceedings of the fifth international conference on Information andknowledge management, pages 65–72. ACM Press, 1996.

4. J. Geller, R. Scherl, and Y. Perl. Mining the web for target marketing information.In Proceedings of CollECTeR, Toulouse, France, 2002.

5. J. Han. Mining knowledge at multiple concept levels. In CIKM, pages 19–24, 1995.6. J. Han and Y. Fu. Discovery of multiple-level association rules from large

databases. In Proc. of 1995 Int’l Conf. on Very Large Data Bases (VLDB’95),Zurich, Switzerland, September 1995, pages 420–431, 1995.

7. J. Han, Y. Fu, W. Wang, K. Koperski, and O. Zaiane. DMQL: A data miningquery language for relational databases, 1996.

8. M. V. Joshi, R. C. Agarwal, and V. Kumar. Mining needle in a haystack: classifyingrare classes via two-phase rule induction. SIGMOD Record (ACM Special InterestGroup on Management of Data), 30(2):91–102, 2001.

9. G. P. and P. L. Lanzi. Hierarchy-based mining of association rules in data ware-houses. In Proceedings of the 2000 ACM symposium on Applied computing 2000,pages 307–312. ACM Press, 2000.

10. R. Pairceir, S. McClean, and B. Scotney. Discovery of multi-level rules and ex-ceptions from a distributed database. In Proceedings of the sixth ACM SIGKDDinternational conference on Knowledge discovery and data mining, pages 523–532.ACM Press, 2000.

11. R. Scherl and J. Geller. Global communities, marketing and web min-ing,. Journal of Doing Business Across Borders, 1(2):141–150, 2002.http://www.newcastle.edu.au/journal/dbab/images/dbab 1(2).pdf.

12. R. Srikant and R. Agrawal. Mining generalized association rules. In Proc. of 1995Int’l Conf. on Very Large Data Bases (VLDB’95), Zurich, Switzerland, September1995, pages 407–419, 1995.

13. I. H. Witten and E. Frank. Data Mining. Morgan Kaufmann Publishers, SanFrancisco, 2000.


Fraud Formalization and Detection�

Bharat Bhargava, Yuhui Zhong, and Yunhua Lu

Center for Education and Research in Information Assurance and Security (CERIAS)and Department of Computer Sciences

Purdue University, West Lafayette, IN 47907, USA{bb,zhong,luy}@cs.purdue.edu

Abstract. A fraudster can be an impersonator or a swindler. An imper-sonator is an illegitimate user who steals resources from the victims by“taking over” their accounts. A swindler is a legitimate user who inten-tionally harms the system or other users by deception. Previous researchefforts in fraud detection concentrate on identifying frauds caused byimpersonators. Detecting frauds conducted by swindlers is a challengingissue. We propose an architecture to catch swindlers. It consists of fourcomponents: profile-based anomaly detector, state transition analysis,deceiving intention predictor, and decision-making component. Profile-based anomaly detector outputs fraud confidence indicating the possibil-ity of fraud when there is a sharp deviation from usual patterns. Statetransition analysis provides state description to users when an activityresults in entering a dangerous state leading to fraud. Deceiving inten-tion predictor discovers malicious intentions. Three types of deceiving in-tentions, namely uncovered deceiving intention, trapping intention, andillusive intention, are defined. A deceiving intention prediction algorithmis developed. A user-configurable risk evaluation function is used for de-cision making. A fraud alarm is raised when the expected risk is greaterthan the fraud investigation cost.

1 Introduction

Fraudsters can be classified into two categories: impersonators and swindlers.An impersonator is an illegitimate user who steals resources from the victims by“taking over” their accounts. A swindler, on the other hand, is a legitimate userwho intentionally harms the system or other users by deception. Taking super-imposition fraud in telecommunication [7] as an example, impersonators imposetheir usage on the accounts of legitimate users by using cloned phones with Mo-bile Identification Numbers (MIN) and Equipment Serial Numbers (ESN) stolenfrom the victims. Swindlers obtain legitimate accounts and use the services with-out the intention to pay bills, which is called subscription fraud.

Impersonators can be forestalled by utilizing cryptographic technologies thatprovide strong protection to users’ authentication information. The idea of sep-aration of duty may be applied to reduce the impact of a swindler. The essence� This research is supported by NSF grant IIS-0209059.


Fraud Formalization and Detection 331

is to restrict the power an entity (e.g., a transaction partner) can have to pre-vent him from abusing it. An empirical example of this idea is that laws areset, enforced and interpreted by different parties. Separation of duty can be im-plemented by using access control mechanisms such as role based access controlmechanism, or lattice-based access control model [8]. Separation of duty policiesand other mechanisms, like dual-log bookkeeping [8] reduce frauds but cannoteliminate them. For example, for online auctions, such as eBay, sellers and buy-ers have restricted knowledge about the other side. Although eBay, as a trustedthird party, has authentication services to check the information provided bysellers and buyers (e.g. phone numbers), it is impossible to verify all of them dueto the high quantities of online transactions. Fraud is a persistent issue undersuch an environment.

In this paper, we concentrate on swindler detection. Three approaches areconsidered: (a) detecting an entity’s activities that deviate from normal patterns,which may imply the existence of a fraud; (b) constructing state transition graphsfor existing fraud scenarios and detecting fraud attempts similar to the knownones; and (c) discovering an entity’s intention based on his behavior. The firsttwo approaches can also be used to detect frauds conducted by impersonators.The last one is applicable only for swindler detection.

The rest of this paper is organized as the follows. Section 2 introduces therelated work. Definitions for fraud and deceiving intentions are presented inSection 3. An architecture for swindler detection is proposed in Section 4. Itconsists of a profile-based anomaly detector, a state transition analysis compo-nent, a deceiving intention predictor, and a decision-making component. Thefunctionalities and design considerations for each component are discussed. Analgorithm for predicting deceiving intentions is designed and studied via exper-iments. Section 5 concludes the paper.

2 Related Work

Fraud detection systems are widely used in telecommunications, online trans-actions, the insurance industry, computer and network security [1, 3, 6, 11].The majority of research efforts addresses detecting impersonators (e.g. detect-ing superimposition fraud in telecommunications). Effective fraud detection usesboth fraud rules and pattern analysis. Fawcett and Provost proposed an adap-tive rule-based detection framework [4]. Rosset et al. pointed out that standardclassification and rule generation were not appropriate for fraud detection [7].The generation and selection of a rule set should combine both user-level andbehavior-level attributes. Burge and Shawe-Taylor developed a neural networktechnique [2]. The probability distributions for current behavior profiles and be-havior profile histories are compared using Hellinger distances. Larger distancesindicate more suspicion of fraud.

Several criteria exist to evaluate the performance of fraud detection engines.ROC (Receiver Operating Characteristics) is a widely used one [10, 5]. Rossetet al. use accuracy and fraud coverage as criteria [7]. Accuracy is the number

332 Bharat Bhargava et al.

of detected instances of fraud over the total number of classified frauds. Fraudcoverage is the number of detected frauds over the total number of frauds. Stolfoet al. use a cost-based metric in commercial fraud detection systems [9]. If theloss resulting from a fraud is smaller than the investigation cost, this fraud isignored. This metric is not suitable in circumstances where such a fraud happensfrequently and causes a significant accumulative loss.

3 Formal Definitions

Frauds by swindlers occur in cooperations where each entity makes a commit-ment. A swindler is an entity that has no intention to keep his commitment.

Commitment is the integrity constraints, assumptions, and conditions an en-tity promises to satisfy in a process of cooperation. Commitment is describedby using conjunction of expressions. An expression is (a) an equality with anattribute variable on the left hand side and a constant representing the expectedvalue on the right hand side, or (b) a user-defined predicate that represents cer-tain complex constraints, assumptions and conditions. A user-defined Booleanfunction is associated with the predicate to check whether the constraints, as-sumptions and conditions hold.

Outcome is the actual results of a cooperation. Each expression in a com-mitment has a corresponding one in the outcome. For an equality expression,the actual value of the attribute is on the right hand side. For a predicate ex-pression, if the use-define function is true, the predicate itself is in the outcome.Otherwise, the negation of the predicate is included.

Example: A commitment of a seller for selling a vase is (Received by = 04/01)∧ (Prize = $1000) ∧ (Quality = A) ∧ ReturnIfAnyQualityProblem. This com-mitment says that the seller promises to send out one “A” quality vase at theprice of $1000. The vase should be received by April 1st. If there is a qualityproblem, the buyer can return the vase. An possible outcome is (Received by= 04/05) ∧ (Prize = $1000) ∧ (Quality = B) ∧ ¬ReturnIfAnyQualityProblem.This outcome shows that the vase of quality “B”, was received on April 5th. Thereturn request was refused. We may conclude that the seller is a swindler.

Predicates or attribute variables play different roles in detecting a swindler.We define two properties, namely intention-testifying and intention-dependent.

Intention-testifying: A predicate P is intention-testifying if the presence of ¬Pin an outcome leads to the conclusion that a partner is a swindler. An attributevariable V is intention-testifying if one can conclude that a partner is a swindlerwhen V’s expected value is more desirable than the actual value.

Intention-dependent: A predicate P is intention-dependent if it is possiblethat a partner is a swindler when ¬P appears in an outcome. An attributevariable V is intention-dependent if it is possible that a partner is a swindlerwhen its expected value is more desirable than the actual value.

An intention-testifying variable or predicate is intention-dependent. The op-posite direction is not necessarily true.


0 50 100 1500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Sat

isfa

ctio

n R

atin

g

Number of Observations

(a) Uncovered deceiving in-tention

0 50 100 1500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Sat

isfa

ctio

n R

atin

g


(b) Trapping intention

0 50 100 1500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Sat

isfa

ctio

n R

atin

g


(c) Illusive intention

Fig. 1. Deceiving intention

In the above example, ReturnIfAnyQualityProblem can be intention-testifying or intention-dependent. The decision is up to the user. Prize isintention-testifying since if the seller charges more money, we believe that heis a swindler. Quality and received by are defined as intention-dependent vari-ables considering that a seller may not have full control on them.

3.1 Deceiving Intentions

Since the intention-testifying property is usually too strong in real applications,variables and predicates are specified as intention-dependent. A conclusion thata partner is a swindler cannot be drawn with 100% certainty based on oneintention-dependent variable or predicate in one outcome. Two approaches canbe used to increase the confidence: (a) consider multiple variables or predicatesin one outcome; and (b) consider one variable or predicate in multiple outcomes.The second approach is applied in this paper.

Assume a satisfaction rating ranging from 0 to 1 is given for the actual valueof each intention-dependent variable in an outcome. The higher the rating is,the more satisfied the user is. The value of 0 means totally unacceptable and thevalue of 1 indicates that actual value is not worse than the expected value. Forexample, if the quality of received vase is B, the rating is 0.5. If the quality is C,the rating drops to 0.2. For each intention-dependent predicate P, the rating is0 if ¬P appears. Otherwise, the rating is 1. A satisfaction rating is related to anentity’s deceiving intention as well as some unpredictable factors. It is modelledby using random variables with normal distribution. The mean function fm(n)determines the mean value of the normal distribution at the the nth rating.

Three types of deceiving intentions are identified.Uncovered deceiving intention: The satisfaction ratings associated with

a swindler having uncovered deceiving intention are stably low. The ratingsvary in a small range over time. The mean function is defined as fm(n) = M,where M is a constant. Figure 1a shows satisfaction ratings with fm(n)=0.2. Thefluctuation of ratings results from the unpredictable factors.


Trapping intention: The rating sequence can be divided into two phases:preparing and trapping. A swindler behaves well to achieve a trustworthy imagebefore he conducts frauds. The mean function can be defined as:

fm(n) ={

mhigh, n≤ n0;mhigh, otherwise. Where n0 is the turning point.

Figure 1b shows satisfaction ratings for a swindler with trapping intention.Fm(n) is 0.8 for the first 50 interactions and 0.2 afterwards.

Illusive intention: A smart swindler with illusive intention, instead of mis-behaving continuously, attempts to cover the bad effects by intentionally doingsomething good after misbehaviors. He repeats the process of preparing andtrapping. fm(n) is a periodic function. For simplicity, we assume the period is N,the mean function is defined as:

fm(n) ={

mhigh, (n mod N) < n0;mhigh, otherwise.

Figure 1c shows satisfaction ratings with period of 20. In each period, fm(n)is 0.8 for the first 15 interactions and 0.2 for the last five.

4 Architecture for Swindler Detection

Swindler detection consists of profile-based anomaly detector, state transi-tion analysis, deceiving intention predictor, and decision-making. Profile-basedanomaly detector monitors suspicious actions based upon the established pat-terns of an entity. It outputs fraud confidence indicating the possibility of a fraud.State transition analysis builds a state transition graph that provides state de-scription to users when an activity results in entering a dangerous state leading

Satisfied ratings

Fraud Confidence

DI-Confidence State Description

Profile-based Anomaly Detector

State Transition Analysis

Record Preprocessor

Deceiving Intention predictor

Decision Making

Architecture boundary

Fig. 2. Architecture for swindler detection


to a fraud. Deceiving intention predictor discovers deceiving intention based onsatisfaction ratings. It outputs DI-confidence to characterize the belief that thetarget entity has a deceiving intention. DI-confidence is a real number rangingover [0,1]. The higher the value is, the greater the belief is.

Outputs of these components are feed into decision-making component thatassists users to reach decisions based on predefined policies. Decision-makingcomponent passes warnings from state transition analysis to user and displaythe description of next potential state in a readable format. The expected riskis computed as follows.

f(fraud confidence, DI-confidence, estimated cost) = max(fraud confidence,DI-confidence) × estimated cost

Users can replace this function according to their specific requirements.A fraud alarm will arise when expected risk is greater than fraud-investigatingcost. In the rest of this section, we concentrate on the other three components.

4.1 Profile-Based Anomaly Detector

As illustrated in fig. 3, profile-based anomaly detector consists of rule generationand weighting, user profiling, and online detection.

Rule generation and weighting: Data mining techniques such as associationrule mining are applied to generate fraud rules. The generated rules are as-signed weights according to their frequency of occurrence. Both entity-level andbehavior-level attributes are used in mining fraud rules and weighting. Normally,a large volume of rules will be generated.

User profiling: Profile information characterizes both the entity-level infor-mation (e.g. financial status) and an entity’s behavior patterns (e.g. interestedproducts). There are two sets of profiling data, one for history profiles and theother for current profiles. Two steps, variable selection followed by data filtering,are used for user profiling. The first step chooses variables characterizing the nor-mal behavior. Selected variables need to be comparable among different entities.

Rule Generation and Weighting

Record Preprocessor

User Profiling

Online Detection

Case selection

Rules selection

Rules and patterns retrieving

Fraud confidence

Profile-based anomaly detector boundary

Fig. 3. Profile-based anomaly detector


Profile of the selected variable must show a pattern under normal conditions.These variables need to be sensitive to anomaly (i.e., at least one of these pat-terns is not matched in occurrence of anomaly). The objective of data filteringfor history profiles is data homogenization (i.e. grouping similar entities). Thecurrent profile set will be dynamically updated according to behaviors. As behav-ior level data is large, decay is needed to reduce the data volume. This part alsoinvolves rule selection for a specific entity based on profiling results and rules.The rule selection triggers the measurements of normal behaviors regarding therules. These statistics are stored in history profiles for online detection.

Online detection: The detection engine retrieves the related rules from theprofiling component when an activity occurs. It may retrieve the entity’s currentbehavior patterns and behavior pattern history as well. Analysis methods suchas Hellinger distance can be used to calculate the deviation of current profilepatterns to profile history patterns. These results are combined to determinefraud confidence.

4.2 State Transition Analysis

State transition analysis models fraud scenarios as series of states changing froman initial secure state to a final compromised state. The initial state is the startstate prior to actions that lead to a fraud. The final state is the resulting stateof completion of the fraud. There may be several intermediate states betweenthem. The action, which causes one state to transit to another, is called thesignature action. Signature actions are the minimum actions that lead to thefinal state. Without such actions, this fraud scenario will not be completed.

This model requires collecting fraud scenarios and identifying the initialstates and the final states. The signature actions for that scenario are identifiedin backward direction. The fraud scenario is represented as a state transitiongraph by the states and signature actions.

A danger factor is associated with each state. It is defined by the distancefrom the current state to a final state. If one state leads to several final states,the minimum distance is used. For each activity, state transition analysis checksthe potential next states. If the maximum value of the danger factors associatedwith the potential states exceeds a threshold, a warning is raised and detailedstate description is sent to the decision-making component.

4.3 Deceiving Intention Predictor

The kernel of the predictor is the deceiving intention prediction (DIP) algorithm.DIP views the belief of deceiving intention as the complementary of trust belief.The trust belief about an entity is evaluated based on the satisfaction sequence<R1, R2, . . . , Rn >, Rn is the most recent one, which contributes to a portionof α to the trust belief. The rest portion comes from the previous trust beliefthat is determined recursively. For each entity, DIP maintains a pair of factors(i.e. current construction factor Wc and current destruction factor Wd). If in-tegrating Rn will increase trust belief, α = Wc. Otherwise, α = Wd. Wc and


Wd satisfy the constraint Wc < Wd, which implies that more efforts are neededto gain the same amount of trust than to loose it [12]. Wc and Wd are modifiedwhen a foul event is triggered by the fact that the coming satisfaction rating islower than a user-defined threshold. Upon a foul event, the target entity is putunder supervision. His Wc is decreased and Wd is increased. If the entity doesnot conduct any foul event during the supervision period, the Wc and Wd arereset to the initial values. Otherwise, they are further decreased and increasedrespectively. Current supervision period of an entity increases each time when heconduct a foul event, so that he will be punished longer next time, which meansan entity with worse history is treated harsher. The DI-confidence is computedas 1 − current trust belief .

DIP algorithm accepts seven input parameters: initial construction factor Wcand destruction factor Wd; initial supervision period p; initial penalty ratiosfor construction factor, destruction factor and supervision r1, r2 and r3 suchthat r1, r2 ∈ (0, 1) and r3 > 1; foul event threshold fThreshold. For eachentity k, we maintain a profile P(k) consisting of five fields: current trust valuetV alue, current construction factor Wc, current destruction factor Wd, currentsupervision period cPeriod, rest of supervision period sRest.

DIP algorithm (Input parameters: Wd, Wc, r1, r2, r3, p,fThreshold; Output: DI-confidence)

Initialize P(k) with input parameterswhile there are new rating R

if R <= fThreshold then //put under supervisionP(k).Wd = P(k).Wd + r1 * (1 - P(k).Wd)P(k).Wc = r2 * P(k).WcP(k).sRest = P(k).sRest + P(k).cPeriodP(k).cPeriod = r3 * P(k).cPeriod

end ifif R <= P(k).tValue then //update tValue

W = P(k).Wdelse

W = P(k).Wcend ifP(k).tValue = P(k).tValue * (1 - W) + R * P(k).Wif P(k).sRest > 0 and R > fThreshold then

P(k).sRest = P(k).sRest - 1if P(k).sRest = 0 then //restore Wc and Wd

P(k).Wd = Wd and P(k).Wc = Wcend if

end ifreturn (1 - P(k).tValue)

end while

Experimental Study DIP’s capability of discovering deceiving intentions de-fined in section 3.1 is investigated through experiments. Initial construction fac-


0 50 100 1500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

DI−

conf

iden

ce

Number of Ratings

(a) Discovery uncovered de-ceiving intention

0 50 100 1500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

DI−

conf

iden

ce

Number of Ratings

(b) Discovery trapping in-tention

0 50 100 1500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

DI−

conf

iden

ce

Number of Ratings

(c) Discovery illusive inten-tion

Fig. 4. Experiments to discovery deceiving intentions

tor is 0.05. Initial destruction factor is 0.1. Penalty ratios for construction factor,destruction factor and supervision-period are 0.9, 0.1 and 2 respectively. Thethreshold for a foul event is 0.18. The results are shown in fig. 4. The x-axis ofeach figure is the number of ratings. The y-axis is the DI-confidence.

Swindler with uncovered deceiving intention: The satisfaction rating sequenceof the generated swindler is shown in fig. 1a. The result is illustrated in fig.4a. Since the possibility for the swindler to conduct foul events is high, he isunder supervision at most of the time. The construction and destruction factorsbecome close to 0 and 1 respectively because of the punishment for foul events.The trust values are close to the minimum rating of interactions that is 0.1 andDI-confidence is around 0.9.

Swindler with trapping intention: The satisfaction rating sequence of thegenerated swindler is shown in fig. 1b. As illustrated in fig. 4b, DIP respondsto the sharp drop of fm(n) very quickly. After fm(n) changes from 0.8 to 0.2, ittakes only 6 interactions for DI-confidence increasing from 0.2239 to 0.7592.

Swindler with illusive intention: The satisfaction rating sequence of the gen-erated swindler is shown in fig. 1c. As illustrated in fig. 4c, when the meanfunction fm(n) changes from 0.8 to 0.2, DI-confidence increases. When fm(n)changes back from 0.2 to 0.8, DI-confidence decreases. DIP is able to catch thissmart swindler in the sense that his DI-confidence eventually increases to about0.9. The swindler’s effort to cover a fraud with good behaviors has less and lesseffect with the number of frauds.

5 Conclusions

In this paper, we classify fraudsters as impersonators and swindlers and presenta mechanism to detect swindlers. The concepts relevant to frauds conducted byswindlers are formally defined. Uncovered deceiving intention, trapping inten-tion, and illusive intention are identified. We propose an approach for swindlerdetection, which integrates the ideas of anomaly detection, state transition anal-ysis, and history-based intention prediction. An architecture that realizes thisapproach is presented. The experiment results show that the proposed deceiving


intention prediction (DIP) algorithm accurately detects the uncovered deceivingintention. Trapping intention is captured promptly in about 6 interactions aftera swindler enters the trapping phase. The illusive intention of a swindler, whoattempt to cover frauds with good behaviors, can also be caught by DIP.

References

[1] R. J. Bolton and D. J. Hand. Statistical fraud detection: A review. StatisticalScience, 17(3):235–255, 2002. 331

[2] P. Burge and J. Shawe-Taylor. Detecting cellular fraud using adaptive prototypes.In Proceedings of AAAI-97 Workshop on AI Approaches to Fraud Detection andRisk Management, 1997. 331

[3] M. Cahill, F. Chen, D. Lambert, J. Pinheiro, and D. Sun. Detecting fraud in thereal world. In Handbook of Massive Datasets, pages 911–930. Klewer AcademicPublishers, 2002. 331

[4] T. Fawcett and F. Provost. Adaptive fraud detection. Data Mining and KnowledgeDiscovery, 1997. 331

[5] J. Hollmen and V. Tresp. Call-based fraud detection in mobile communicationnetworks using a hierarchical regime-switching model. In Proceedings of Advancesin Neural Information Processing Systems (NIPS’11), 1998. 331

[6] Bertis B. Little, Walter L. Johnston, Ashley C. Lovell, Roderick M. Rejesus, andSteve A. Steed. Collusion in the U. S. crop insurance program: applied datamining. In Proceedings of the eighth ACM SIGKDD international conference onKnowledge discovery and data mining, pages 594–598. ACM Press, 2002. 331

[7] Saharon Rosset, Uzi Murad, Einat Neumann, Yizhak Idan, and Gadi Pinkas.Discovery of fraud rules for telecommunicationsuchallenges and solutions. In Pro-ceedings of the fifth ACM SIGKDD, pages 409–413. ACM Press, 1999. 330, 331

[8] Ravi Sandhu. Lattice-based access control models. IEEE Computer, 26(11):9–19,1993. 331

[9] Salvatore J. Stolfo, Wenke Lee, Philip K. Chan, Wei Fan, and Eleazar Eskin. Datamining-based intrusion detectors: an overview of the columbia IDS project. ACMSIGMOD Record, 30(4):5–14, 2001. 332

[10] M. Taniguchi, J. Hollmen M. Haft, and V. Tresp. Fraud detection in communi-cations networks using neural and probabilistic methods. In Proceedings of theIEEE International Conference in Acoustics, Speech and Signal Processing, 1998.331

[11] David Wagner and Paolo Soto. Mimicry attacks on host-based intrusion detectionsystems. In Proceedings of the 9th ACM conference on Computer and communi-cations security, pages 255–264. ACM Press, 2002. 331

[12] Y. Zhong, Y. Lu, and B. Bhargava. Dynamic trust production based on interactionsequence. Technical Report CSD-TR 03-006, Department of Computer Sciences,Purdue University, 2003. 337

341Combining Noise Correction with Feature Selection

342 Choh Man Teng

344 Choh Man Teng

346 Choh Man Teng

348 Choh Man Teng

Pre-computing Approximate Hierarchical

Range Queries in a Tree-Like Histogram

Francesco Buccafurri and Gianluca Lax

DIMET, Universita degli Studi Mediterranea di Reggio CalabriaVia Graziella, Localita Feo di Vito, 89060 Reggio Calabria, Italy

{bucca,lax}@ing.unirc.it

Abstract. Histograms are a lossy compression technique widely appliedin various application contexts, like query optimization, statistical andtemporal databases, OLAP applications, and so on. This paper presentsa new histogram based on a hierarchical decomposition of the originaldata distribution kept in a complete binary tree. This tree, thus contain-ing a set of pre-computed hierarchical queries, is encoded in a compressedform using bit saving in representing integer numbers. The approach, ex-tending a recently proposed technique based on the application of sucha decomposition to the buckets of a pre-existing histogram, is shownby several experiments to improve the accuracy of the state-of-the-arthistograms.

1 Introduction

Histograms are a lossy compression technique widely applied in various applica-tion contexts, like query optimization [9], statistical [5] and temporal databases[12], and, more recently, OLAP applications [4, 10]. In OLAP, compression allowsus to obtain fast approximate answers by evaluating queries on reduced data inplace that original ones. Histograms are well suited to this purpose, especiallyin case of range queries. Indeed, buckets of histograms basically correspond toa set of pre-computed range queries, allowing us to estimate the remainder possi-ble range queries. Estimation is needed when the range query overlaps partiallya bucket. As a consequence, the problem of minimizing the estimation errorbecomes crucial in the context of OLAP applications.

In this work we propose a new histogram, extending the approach used in [2]for the estimation inside the bucket. The histogram, called nLT, consists ofa tree-like index, with a number of levels depending on the fixed compressionratio. Nodes of the index contain, hierarchically, pre-computed range queries,stored by an approximate (via bit saving) encoding. Compression derives bothfrom aggregation implemented by leaves of the tree, and from the saving of bitsobtained by representing range queries with less than 32 bits (assumed enoughfor an exact representation). The number of bits used for representing rangequeries is decreasing for increasing level of the tree. Peculiar characteristics ofour histogram are the following:


Pre-computing Approximate Hierarchical Range Queries 351

1. Due to bit saving, the number of pre-computed range queries embedded inour histogram is larger than a bucket-based histogram occupying the samestorage space. Observe that such queries are stored in an approximate form.However, hierarchical organization of the index, allows us to express thevalue of a range query as a fraction of the range query including it (i.e.,corresponding to the parent node in the tree), and this allows us to maintaina low numeric approximation. In case of absence of the tree, values of rangequeries would be expressed as a fraction of the maximum value (i.e., thequery involving the entire domain).

2. The histogram supports directly hierarchical range queries, representing ameaningful type of queries in the OLAP context [4].

3. The evaluation of a range query can be executed by visiting the tree from theroot to a leaf (in the worst case), thus with a logarithmic cost on the numberof smallest pre-computed range queries (this number is the counterpart ofthe number of buckets of a classic histogram, from which the cost of theevaluation of the query depends linearly).

4. The update of the histogram (we refer here to the case of the change ofa single occurrence frequency) can be performed without reconstructing theentire tree, but only by updating nodes of the path connecting the leafinvolved by the change with the root of the tree. Also this task is hencefeasible in logarithmic time.

While the three last points above describe evidently positive characteristicsof the proposed method, the first point needs some kind of validation, to beconsidered effectively a point in favor of our proposal. Indeed, there is no a-priori clear if having a larger set of approximate pre-computed queries (even ifthis approximation is reduced by the hierarchical organization) is better thanhaving a smaller set of exact pre-computed range queries. In this work we try togive an answer to this question through experimental comparison with the mostrelevant histograms proposed in the literature. Thus, the main contribution ofthe paper is to conclude that keeping pre-computed hierarchical range queries(with a suitable numerical approximation done by bit saving), advances accuracyof histograms, not only when hierarchical decomposition is applied to bucketsof pre-existing histograms (as shown in [2]), but also when the technique isapplied to the entire data distribution. The paper is organized as follows. In thenext section we illustrate histograms. Our histogram is presented in Section 3.Section 4 reports results of experiments conducted on our histogram and severalother ones. Finally, we give conclusions in Section 5.

2 Histograms

Histograms are used for reducing relations in order to give approximate answersto range queries on such relations. Let X be an attribute of a relation R. W.l.o.g.,we assume that the domain U of the attribute X is the interval of integer numbersfrom 1 to |U |1. The set of frequencies is the set F = {f(1), ..., f(|U |)} where f(i)1 |U | denotes the cardinality of the set U

352 Francesco Buccafurri and Gianluca Lax

is the number of occurrence of the value i in the relation R, for each 1 ≤ i ≤ |U |.The set of values is V = {i ∈ U such that f(i) > 0}.From now on, consider given R, X , F and V . A bucket B on X is a 4-tuple〈lb, ub, t, c〉, with 1 ≤ lb < ub ≤ |U |, t = |V | and c =

∑ubi=lb f(i). lb and ub

are said, respectively, lower bound and upper bound of B, t is said number ofnon-null values of B and c is the sum of frequencies of B.

A histogram H on X is a h-tuple 〈B1, ..., Bh〉 of buckets such that (1) ∀1 ≤i < h, the upper bound of Bi precedes the lower bound of Bi+1 and (2) ∀j with1 ≤ j ≤ |U | and (fj > 0) ⇒ ∃i ∈ [1, h] such that j ∈ Bi.

Given a histogram H and a range query Q, it is possible to return an esti-mation of the answer to Q using information contained in H . At this point thefollowing problem arises: how to partition the domain U into b buckets in orderto minimize the error estimation? According to the criterion used for partition-ing the domain, there are different classes of histograms (we report here only themost important ones):

1. Equi-sum Histograms [9]: buckets are obtained in such a way that the sum ofoccurrences in each bucket is equal to 1/b times the total sum of occurrences.

2. MaxDiff Histogram [9, 8]: each bucket has the upper bound in Vi ∈ V (set ofattribute values actually appearing in the relation R), if |φ(Vi)− φ(Vi+1)| isone of the b − 1 highest computed values, for each i. φ(Vi) is said area andis obtained as f(Vi) · (Vi+1 − Vi).

3. V-Optimal Histograms [6]: boundaries of each bucket, say lbi and ubi (with1 ≤ i ≤ b), are fixed in such a way that

∑bi=1 SSEi is minimum, where

SSEi =∑ubi

j=lbi(f(j) − avgi)2 and avgiis the average of the frequencies oc-

curring in the i-th bucket.

In the part of the work devoted to experiments (see Section 4), among theabove presented bucket-based histograms, we have considered only MaxDiff andV-Optimal histograms, as it was shown in the literature that their have thebest perfomances in terms of accuracy. In addition, we will consider also twofurther bucket-based histograms, called MaxDiff4LT and V-Optimal4LT. Suchmethods have been proposed in [2], and consist of adding a 32 bit tree-like index,called 4LT, to each bucket of either a MaxDiff or a V-Optimal histogram. The4LT is used for computing, in an approximate way, the frequency sums of 8non overlapping sub-ranges of the bucket. We observe that the idea underlyingthe proposal presented in this paper takes its origin just from the 4LT method,extending the application of such an approach to the construction of the entirehistograms instead of single buckets.

There are other kinds of histograms whose construction is not driven by thesearch of a suitable partition of the attribute domain and, further, their struc-ture is more complex than simply a set of buckets. We call such histograms nonbucket-based. Two important examples of histograms of such a type are wavelet-based and binary-tree histograms. Wavelets are mathematical transformationsimplementing hierarchical decomposition of functions originally used in different


research and application contexts, like image and signal processing [7, 13]. Re-cent studies have shown the applicability of wavelets to selectivity estimation [6]as well as the approximation of OLAP range queries over datacubes [14, 15].A wavelet-based histogram is not a set of buckets; it consists of a set of waveletcoefficients and a set of indices by which the original frequency set can be recon-structed. Histograms are obtained by applying one of these transformations tothe original cumulative frequency set (extended over the entire attribute domain)and selecting, among the N wavelet coefficients, the m < N most significant co-efficients, for m corresponding to the desired storage usage.

The binary-tree histogram [1] is also based on a hierarchical multiresolutiondecomposition of the data distribution operating in a quad-tree fashion, adaptedto the mono-dimensional case.

Beside the bucket-based histograms, both the above types of histograms arecompared experimentally in this paper with our histogram, which is a non bucket-based histogram too.

3 The nLT Histogram

In this section we describe the proposed histogram, called nLT. As wavelet andbinary-tree histograms, nLT is a non bucket-based histogram.

Given a positive integer n, an nLT histogram (on the attribute X is a fullbinary tree with n levels such that each node N is a 3-tuple 〈l(N), u(N), val(N)〉,where 1 ≤ l(N) < u(N) ≤ |U | and val(N) =

∑u(N)i=l(N) f(i). l(N) and u(N) are

said, respectively, lower bound and upper bound of N and val(N) is said valueof N . Observe that the interval of the domain of X with boundaries l(N) andu(N) is associated to N . We denote by r(N) such an interval. Moreover, val(N)is the sum of occurrence frequencies of X within such an interval. The rootnode, denoted by N0 is such that l(N0) = 1 and u(N0) = |U |. Given a leafnode N , the left-hand child node, say Nfs, is such that l(Nfs) = l(N) andu(Nfs) = u(N)+l(N)

2 �2, while the right-hand child node, say Nfd, is such thatl(Nfd) = u(N)+l(N)

2 � + 1 and u(Nfd) = u(N).Concerning the implementation of the nLT, we observe that it is not needed

to keep lower and upper bounds of nodes, since they can be derived by theknowledge of n and the position of the node in the tree. Moreover, we don’thave to keep the value of any right-hand child node too, since such a value canbe obtained as difference between the value of the parent node with the value ofthe sibling node.

In Figure 1 an example of nLT with n = 3 is reported. The nLT of this exam-ple refers to a domain of size 12 with 3 null elements. For each node (representedas a box), we have reported boundaries of the associated interval (on the leftside and on the right side, respectively) and the value of the node (inside thebox). Grey nodes can be derived by white nodes. Thus, they are not stored.

2 �x� denotes the application of the operator floor to x


Fig. 1. Example of nLT

The storage space required by the nLT in case integers are encoded using tbits, is t · 2n−1. We assume that t = 32 is enough for representing integer valueswith no scaling approximation. In the following we will refer to this kind of nLTimplementation as exact implementation of the nLT, or, for short, exact nLT. Inthe next section, we will illustrate how to reduce the storage space by varyingthe number of bits used for encoding the value of the nodes. Of course, to thelossy compression due to linear interpolation needed for retrieving all the nonpre-computed range queries, we add another lossy compression given by the bitsaving.

3.1 Approximate nLT

In this section we describe the approximate nLT, that is an implementation ofthe nLT which uses length-variable encoding of integer numbers. In particular,all nodes which belong to the same level in the tree are represented with thesame number of bits. When we go down to the lower level, we reduce by 1 thenumber of bits used for representing nodes of this level. This bit saving, allowsus to increase the nLT depth (w.r.t. the exact nLT), once the total storage spaceis fixed, and to have a larger set of pre-computed range queries and thus higherresolution.

Substantially, the approach is based on the assumption that, in the average,the sum of occurrences of a given interval of the frequency vector, is twice thanthe sum of the occurrences of each half of such an interval. This assumption ischosen as heuristic criterion for designing the approximate nLT, and this explainsthe choice of reducing by 1 per level the number of bits used for representingnumbers. Clearly, the sum contained in a given node is represented as a fractionof the sum contained in the parent node. Observe that, in principle, it could beused also a representation allowing possibly different number of bits for nodesbelonging to the same level, depending on the actual value contained into nodes.However, we should deal with the spatial overhead due to these variable codes.The reduction of 1 bit per level appears as a reasonable compromise.

We describe now in more details how to encode with a certain number ofbits, say k, the value of a given node N , denoting by P the parent node of N .


With such a representation, the value of the node val(N) will be recovered notin exact way, in general. It will be affected by a certain scaling approximation.We denote by valk(N) the encoding of val(N) done with k bits and by valk(N)the approximation of val(N) obtained by valk(N).

We have that:

valk(N) = Round(val(N)val(P ) × (2k − 1))

Clearly, 0 ≤ valk(N) ≤ 2k − 1. Concerning the approximation of val(N) itresults:

valk(N) = ( valk(N)2k−1

× val(P ))

The absolute error due to the k-bit encoding of the node N , with parentnode P is:

εa(val(N), val(P ), k) = |val(N) − valk(N)|.It can be easily verified that 0 ≤ εa(val(N), val(P ), k) ≤ val(P )

2k+1 .The relative error is defined as:

εr(val(N), val(P ), k) = εa(val(N),val(P ),k)val(N) .

Define now the average relative error (for variable value of the node N) as:

‖εr(val(N), val(P ), k)‖ = 1val(P ) ×

∑val(P )i=1 εr(i, val(P ), k).

We observe that, for the root node N0, we use 32 bits. Thus, no scaling errorarises for such a node, i.e. val(N0) = valk(N0).

It can be proven that the average relative error is null until val(P ) reachesthe value 2k, and, then, after a number of decreasing oscillations, converges toa value independent of val(P ) and depending on k.

Before proceeding to the implementation of a nLT, we should set the twoparameters n and k, that are, we recall, number of levels of the nLT and numberof bits used for encoding each child node of the root (for the successive levels, asalready mentioned, we drop 1 bit per level). Observe that, according to the aboveobservation about the average relative error, setting the parameter k means fixingalso the average relative error due to scaling approximation. Thus, in order toreduce such an error, we should set k to a value as large as possible. However,for a fixed compression ratio, this may limit the depth of the tree and, thusthe resolution of the leaves. As a consequence, the error arising from linearinterpolation done inside leaf nodes, increases. The choice of k has thus to solvethe above trade-off. The size of an approximate nLT is thus:

size(nLT ) = 32 +n−2∑h=0

(n − h) × 2h (1)

recalling that the root node is encoded with 32 bits.For instance, a nLT with n = 4 and k = 11 uses 32+20 ·11+21 ·10+22 ·9 = 99

bit for representing its nodes.


4 Experiments on Histograms

In this section we shall conduct several experiments on synthetic data in orderto compare the effectiveness of several histograms in estimating range query.

Available Storage: For our experiments, we shall use a storage space, that is42 four-byte numbers to be in line with experiments reported in [9], which wereplicate.

Techniques: We compare nLT with 6 new and old histograms, fixing the totalspace required by each technique:

– MaxDiff (MD) and V-Optimal (VO) produce 21 bucket; for each bucket bothupper bound and value are stored.

– Max-Diff with 4LT (MD4LT) and V-Optimal with 4LT (VO4LT) produce14 bucket; for each bucket is stored the upper bound, the value and the 4LTindex.

– Wavelet (WA) are constructed using the bi -orthogonal 2.2 decomposition ofthe MATLAB 5.2 wavelet toolbox. The wavelet approach needs 21 four-bytewavelet coefficients plus another 21 four-byte numbers for storing coefficientpositions. We have stored the 21 largest (in absolute value) wavelet coef-ficients and, in the reconstruction phase, we have set to 0 the remainingcoefficients.

– Binary-Tree (BT) produces 19 terminal buckets (for reproducing the exper-iments reported in [1]).

– nLT (nLT) is obtained fixing n = 9 and k = 11. Using (1) shown in Sec-tion 3.1, the stored space is about 41 four-byte numbers. The choice of k = 11and, consequently of n = 9, is done by fixing the average relative error ofthe highest level of the tree to about 0.15%.

Data Distributions: A data distribution is characterized by a distribution forfrequencies and a distribution for spreads. Frequency set and value set are gen-erated independently, then frequencies are randomly assigned to the elements ofthe value set. We consider 3 data distributions: (1) D1: Zipf-cusp max(0.5,1.0).(2) D2 = Zipf-zrand(0.5,1.0): Frequencies are distributed according to a Zipfdistribution with the z parameter equal to 0.5. Spreads follow a ZRand dis-tribution [8] with z parameter equal to 1.0 (i.e., spreads following a Zipf dis-tributions with z parameter equal to 1.0 are randomly assigned to attributevalues). (3) D3 = Gauss-rand: Frequencies are distributed according to a Gaussdistribution. Spreads are randomly distributed.

Histograms Populations: A population is characterized by the value of threeparameters, that are T , D and t and represents the set of histograms storinga relation of cardinality T , attribute domain size D and value set size t (i.e.,number of non-null attribute values).


method/popul. P1 P2 P3 avg

WA 3.50 3.42 2.99 3.30

MD 4.30 5.78 8.37 6.15

V O 1.43 1.68 1.77 1.63

MD4LT 0.70 0.80 0.70 0.73

V O4LT 0.29 0.32 0.32 0.31

BT 0.26 0.27 0.27 0.27

nLT 0.24 0.24 0.22 0.23


WA 13.09 13.06 6.08 10.71

MD 19.35 16.04 2.89 12.76

V O 5.55 5.96 2.16 4.56

MD4LT 1.57 1.60 0.59 1.25

V O4LT 1.33 1.41 0.56 1.10

BT 1.12 1.15 0.44 0.90

nLT 0.63 0.69 0.26 0.53

(a) (b)

Fig. 2. (a): Errors for distribution 1. (b): Errors for distribution 2

Population P1. This population is characterized by the following values for theparameters: D = 4100, t = 500 and T = 100000.Population P2. This population is characterized by the following values for theparameters: D = 4100, t = 500 and T = 500000.Population P3. This population is characterized by the following values for theparameters: D = 4100, t = 1000 and T = 500000.

Data Sets: Each data set included in the experiments is obtained by generatingunder one of the above described data distributions 10 histograms belonging toone of the populations specified below. We consider the 9 data sets that aregenerated by combining all data distributions and all populations.

All queries belonging to the query set below are evaluated over the histogramsof each data set:

Query Set and Error Metric: In our experiments, we use the query set{X ≤ d : 1 ≤ d ≤ D} (recall, X is the histogram attribute and 1..D is itsdomain) for evaluating the effectiveness of the various methods. We measurethe error of approximation made by histograms on the above query set by us-ing the average of the relative error 1

Q

∑Qi=1 erel

i , where Q is the cardinality of

the query set, and ereli is the relative error , i.e., erel

i = |Si−Si|Si

, where Si and Si

are the actual answer and the estimated answer of the query i-th of the query set.

For each population and distribution we have calculated the average relativeerror.

Table in Figure 2.(a) shows good accuracy on the distibution Zipf max ofall index-based methods. In particular, nLT has the best performances, even ifthere is no a high gap w.r.t. the other methods. The error is considerable low fornLT (less than 0.25%) although the compression ratio is very high (i.e., about100). With the second distribution, that is Zipf rand (see Figure 2.(b)), behavoirof methods becomes more different: Wavelt and MaxDiff show an unsatisfactoryaccuracy, V-Optimal has better performances but errors still high, while index-based methods show very low errors. Once again, nLT reports the minimum



WA 14.53 5.55 5.06 8.38

MD 11.65 6.65 3.30 7.20

V O 10.60 6.16 2.82 6.53

MD4LT 3.14 2.32 1.33 2.26

V O4LT 2.32 4.85 1.24 2.80

BT 1.51 3.50 0.91 1.97

nLT 1.38 0.87 0.70 0.99

Fig. 3. Errors for distribution 3

20 30 40 50 60 70 80 90 1000

0.5

1

1.5

2

2.5

density %

rela

tive e

rror

%

Wavelet Maxdiff V−OptimalnLT

15 20 25 30 35 40 45 500

1

2

3

4

5

6

7

8

9

10

storage space

rela

tive e

rror

%

Wavelet Maxdiff V−OptimalnLT

Fig. 4. Experimental results

error. In Figure 3 we report results of experiments performed on Gauss data.Due to the high variance, all methods become worse. Also nLT presents a slightlyhigher error, w.r.t. Zipf data, but still less than 1% (in the average), and stillless than the error of the other methods.

In Figure 4, average relative error versus data density and versus histogramsize are plotted (in the left-hand graph and right-hand graph, respectively). Fordata density we mean the ratio |V |

|U| between the cardinality of the non null valueset and the cardinality of the attribute domain. For histogram size we meanthe amount of 4-byte numbers used for storing the histogram. This measure ishence related to the compression ratio. In both cases nLT, compared with clas-sical bucket-based histograms, shows the best performances with a considerableimprovement gap.

5 Conclusion

In this paper we have presented a new non bucket-based histogram, which wehave called nLT. It is based on a hierarchical decomposition of the data dis-


tribution kept in a complete n-level binary tree. Nodes of the tree store, ina approximate form (via bit saving), pre-computed range query on the origi-nal data distribution. Beside the capability of the histogram to directly supporthierarchical range query and efficient updating and query answering, we haveshown experimentally it improves significantly the state of the art in terms ofaccuracy in estimating range queries.

References

[1] F. Buccafurri, L. Pontieri, D. Rosaci, D. Sacca Binary-tree Histograms with TreeIndices DEXA 2002, Aix-en-Provence, France. 353, 356

[2] F. Buccafurri, L. Pontieri, D. Rosaci, D. Sacca Improving Range Query Estimationon Histograms ICDE 2002, San Jose (CA), USA. 350, 351, 352

[3] Buccafurri, F., Rosaci, D., Sacca’, D., Compressed datacubes for fast OLAP ap-plications, DaWaK 1999, Florence, 65-77.

[4] Koudas, N., Muthukrishnan, S., Srivastava, D., Optimal Histograms for Hierar-chical Range Queries, Proc. of Symposium on Principles of Database Systems -PODS pp. 196-204, Dallas, Texas, 2000. 350, 351

[5] Malvestuto, F., A Universal-Scheme Approach to Statistical Databases Contain-ing Homogeneous Summary Tables, ACM TODS, 18(4), 678–708, December 1993.350

[6] Y. Matias, J. S. Vitter, M. Wang. Wavelet-based histograms for selectivity esti-mation. In Proceedings of the 1998 ACM SIGMOD Conference on Managementof Data, Seattle, Washington, June 1998 352, 353

[7] Natsev, A., Rastogi, R., Shim, K., WALRUS: A Similarity Retrieval Algorithmfor Image Databases, In Proceedings of the 1999 ACM SIGMOD Conference onManagement of Data, 1999. 353

[8] V. Poosala. Histogram-based Estimation Techniques in Database Systems. PhDdissertation, University of Wisconsin-Madison, 1997 352, 356

[9] V. Poosala, Y.E. Ioannidis, P. J. Haas, E. J. Shekita. Improved histograms for se-lectivity estimation of range predicates. In Proceedings of the 1996 ACM SIGMODInternational Conference on Management of Data, pages 294-305, 1996 350, 352,356

[10] Poosala, V., Ganti, V., Ioannidis, Y. E., Approximate Query Answering usingHistograms, IEEE Data Engineering Bulletin Vol. 22, March 1999. 350

[11] P.G. Selinger, M.M. Astrahan, D.D. Chamberlin, R.A. Lorie, and T. T. Price.Access path selection in a relational database management system. In Proc. ofACM SIGMOD Internatinal Conference, pages 23-24, 1979

[12] Sitzmann, I., Stuckey, P. J., Improving Temporal Joins Using Histograms, Proc.of the Int. Conference, Database and Expert Systems Applications – DEXA 2000.350

[13] E. J. Stollnitz, T. D. Derose, and D.H. Salesin. Wavelets for Computer Graphics.Morgann Kauffmann, 1996. 353

[14] J. S. Vitter, M. Wang, B. Iyer. Data Cube Approximation and Histograms viaWavelet. In Proceedings of the 1998 CIKM International Conference on Informa-tion and Knowledge Management, Washington, 1998 353

[15] J. S. Vitter, M. Wang, Approximate Computation of Multidimansional Aggre-gates of Sparse Data using Wavelets, In Proceedings of the 1999 ACM SIGMODInternational Conference on Managemnet of Data, 1999. 353

Comprehensive Log Compression

with Frequent Patterns

Kimmo Hatonen1, Jean Francois Boulicaut2, Mika Klemettinen1,Markus Miettinen1, and Cyrille Masson2

1 Nokia Research CenterP.O.Box 407, FIN-00045 Nokia Group, Finland

{kimmo.hatonen,mika.klemettinen,markus.miettinen}@nokia.com2 INSA de Lyon, LIRIS CNRS FRE 2672

F-69621 Villeurbanne, France{Jean-Francois.Boulicaut,Cyrille.Masson}@insa-lyon.fr

Abstract. In this paper we present a comprehensive log compression(CLC) method that uses frequent patterns and their condensed repre-sentations to identify repetitive information from large log files generatedby communications networks. We also show how the identified informa-tion can be used to separate and filter out frequently occurring eventsthat hide other, unique or only a few times occurring events. The iden-tification can be done without any prior knowledge about the domainor the events. For example, no pre-defined patterns or value combina-tions are needed. This separation makes it easier for a human observerto perceive and analyse large amounts of log data. The applicability ofthe CLC method is demonstrated with real-world examples from datacommunication networks.

1 Introduction

In the near future telecommunication networks will deploy an open packet-basedinfrastructure which has been originally developed for data communication net-works. The monitoring of this new packet-based infrastructure will be a challengefor operators. The old networks will remain up and running for still some time.At the same time the rollout of the new infrastructure will take place intro-ducing many new information sources, between which the information neededin, e.g., security monitoring and fault analysis will be scattered. These sourcescan include different kinds of event logs, e.g., firewall logs, operating systems’system logs and different application server logs to name a few. The problemis becoming worse every day as operators are adding new tools for logging andmonitoring their networks. As the requirements for the quality of service per-ceived by customers gain more importance, the operators are starting to seriouslyutilise information that is hidden in these logs. Their interest towards analysingtheir own processes and operation of their network increases concurrently.

Data mining and knowledge discovery methods are a promising alternativefor operators to gain more out of their data. Based on our experience, however,


Comprehensive Log Compression with Frequent Patterns 361

simple-minded use of discovery algorithms in the network analysis poses prob-lems with the amount of generated information and its relevance. In the KDDprocess [6, 10, 9], it is often reasonable or even necessary to constrain the discov-ery using background knowledge. If no constraints are applied, the discoveredresult set of, say, association rules [1, 2] might become huge and contain mostlytrivial and uninteresting rules. Also, association and episode rule mining tech-niques can only capture frequently recurring events according to some frequencyand confidence thresholds. This is needed to restrict the search space and thusfor computation tractability. Clearly, the thresholds that can be used are notnecessarily the ones that denote objective interestingness from the user point ofview. Indeed, rare combinations can be extremely interesting. When consideringour previously unknown domains, an explicit background knowledge is missing,e.g., about the possible or reasonable values of attributes and their relationships.

When it is difficult or impossible to define and maintain a priori knowledgeabout the system, there is still a possibility to use meta information that canbe extracted from the logs. Meta information characterizes different types of logentries and log entry combinations. It can not only be used to help an expert infiltering and browsing the logs manually but also to automatically identify andfilter out insignificant log entries. It is possible to reduce the size of an analyseddata set to a fraction of its original size without losing any critical information.

One type of meta information are frequent patterns. They capture the com-mon value combinations that occur in the logs. Furthermore, such meta infor-mation can be condensed by means of, e.g., the closed frequent itemsets [12, 3].Closed sets form natural inclusion graphs between different covering sets. Thistype of presentation is quite understandable for an expert and can be usedto create hierarchical views. These condensed representations can be extracteddirectly from highly correlated and/or dense data, i.e., in contexts where theapproaches that compute the whole collection of the frequent patterns FS areintractable [12, 3, 17, 13]. They can also be used to regenerate efficiently thewhole FS collection, possibly partially and on the fly.

We propose here our Comprehensive Log Compression (CLC) method. It isbased on the computation of frequent pattern condensed representations and weuse this presentation as an entry point to the data. The method provides a wayto dynamically characterize and combine log data entries before they are shownto a human observer. It finds frequently occurring patterns from dense log dataand links patterns to the data as a data directory. It is also possible to separaterecurring data and analyse it separately. In most cases, this reduces the amountof data needed to be evaluated by an expert to a fraction of the original volume.

This type of representation is general w.r.t. different log types. Frequent setscan be generated from most of the logs that have structure and contain repeatingsymbolic values in their fields, e.g., in Web Usage Mining applications [11, 16].The main difference between the proposed method and those applications is theobjective setting of the mining task. Most of the web usage applications tryto identify and somehow validate common access patterns in web sites. Thesepatterns are then used to do some sort of optimization of the site. The proposed

362 Kimmo Hatonen et al.

...777;11May2000; 0:00:23;a_daemon;B1;12.12.123.12;tcp;;778;11May2000; 0:00:31;a_daemon;B1;12.12.123.12;tcp;;779;11May2000; 0:00:32;1234;B1;255.255.255.255;udp;;781;11May2000; 0:00:43;a_daemon;B1;12.12.123.12;tcp;;782;11May2000; 0:00:51;a_daemon;B1;12.12.123.12;tcp;;...

Fig. 1. An example of a firewall log

method, however, doesn’t say anything about semantic correctness or relationsbetween the found frequent patterns. It only summarizes the most frequent valuecombinations in entries. This gives either a human expert or computationallymore intensive algorithms a change to continue with data, which doesn’t containtoo common and trivial entries. Based on our experience with real-life log data,e.g., large application and firewall logs, the original data set of tens of thousandsof rows can often be represented by just a couple of identified patterns and theexceptions not matching these patterns.

2 Log Data and Log Data Analysis

A log data consists of entries that represent a specific condition or an event thathas occurred somewhere in the system. The entries have several fields, whichare called variables from now on. The structure of entries might change overtime from entry to another, although some variables are common to all of them.Each variable has a set of possible values called a value space. Values of onevalue space can be considered as binary attributes. Variable value spaces areseparated. A small example of a log data is given in Figure 1. It shows a samplefrom a log file produced by CheckPoint’s Firewall-1.

In a data set a value range in a variable value space might be very largeor very limited. For example, there may be only few firewalls in an enterprise,but every IP address in the internet might try to contact the enterprise domain.There are also several variables that have such a large value space but containonly a fraction of the possible values. Therefore, it is unpractical and almostimpossible to fix the size of the value spaces as a priori knowledge.

A log file may be very large. During one day, there might accumulate millionsof lines into a log file. A solution to browse the data is either to search forpatterns that are known to be interesting with high probability or to filter outpatterns that most probably are uninteresting. A system can assist in this butthe evaluation of interestingness is left for an expert. To be able to make theevaluation an expert has to check the found log entries by hand. He has to returnto the original log file and iteratively check all those probably interesting entriesand their surroundings. Many of the most dangerous attacks are new and unseenfor an enterprise defense system. Therefore, when the data exploration is limitedonly to known patterns it may be impossible to find the new attacks.

Comprehensive Log Compression (CLC) is an operation where meta informa-tion is extracted from the log entries and used to summarize redundant entries


{Proto:tcp, Service:a_daemon, Src:B1} 11161{Proto:tcp, SPort:, Src:B1} 11161{Proto:tcp, SPort:, Service:a_daemon} 11161{SPort:, Service:a_daemon, Src:B1} 11161...{Destination:123.12.123.12, SPort:, Service:a_daemon, Src:B1} 10283{Destination:123.12.123.12, Proto:tcp, Service:a_daemon, Src:B1} 10283{Destination:123.12.123.12, Proto:tcp, SPort:, Src:B1} 10283{Destination:123.12.123.12, Proto:tcp, SPort:, Service:a_daemon} 10283{Proto:tcp, SPort:, Service:a_daemon, Src:B1} 11161...{Destination:123.12.123.12, Proto:tcp, SPort:, Service:a_daemon, Src:B1} 10283

Fig. 2. A sample of frequent sets extracted from a firewall log

without losing any important information. By combining log entries with theirfrequencies and identifying recurring patterns, we are able to separate correlatingentries from infrequent ones and display them with accompanying information.Thus, an expert has a more covering overview of the logged system and he canidentify interesting phenomena and concentrate on his analysis.

The summary has to be understandable for an expert and must contain allthe relevant information that is available in the original log. Presentation hasalso to provide a mechanism to move back and forth between the summary andthe original logs.

Summarization can be done by finding correlating value combinations fromlarge amount of log entries. Due to the nature of the logging mechanism, thereare always several value combinations that are common to a large number ofthe entries. When these patterns are combined with information about howuncorrelating values are changing w.r.t. to these correlating patterns it givesa comprehensive description of the contents of the logs. In many cases it ispossible to detect such patterns by browsing the log data but unfortunately itis also tedious. E.g., a clever attack against a firewall cluster of an enterprise isscattered over all of its firewalls and executed slowly from several different IPaddresses using all the possible protocols alternately.

Figure 2 provides a sample of frequent sets extracted from the data intro-duced in Figure 1. In Figure 2, the last pattern, which contains five attributes,has five subpatterns out of which four have the same frequency as the longer pat-tern and only one has larger frequency. In fact, many frequent patterns have thesame frequency and it is the key idea of the frequent closed set mining techniqueto consider only some representative patterns, i.e., the frequent closed itemsets(see next section for a formalization). Figure 3 gives a sample of frequent closedsets that correspond to the frequent patterns shown in Figure 2.

An example of the results of applying the CLC method to a firewall logdata set can be seen in Table 1. It shows three patterns with highest coveragevalues found from the firewall log introduced in Figure 1. If the supports of thesepatterns are combined, then 91% of the data in the log is covered. The blankfields in the figure are intentionally left empty in the original log data. The fieldsmarked with ’*’ can have varying values. For example, in the pattern 1 the field


{Proto:tcp, SPort:, Service:a_daemon, Src:B1} 11161{Destination:123.12.123.12, Proto:tcp, SPort:, Service:a_daemon, Src:B1} 10283{Destination:123.12.123.13, Proto:tcp, SPort:, Service:a_daemon, Src:B1} 878

Fig. 3. A sample of closed sets extracted from a firewall log

Table 1. The three most frequent patterns found from a firewall log

No Destination Proto SPort Service Src Count1. * tcp A daemon B1 111612. 255.255.255.255 udp 1234 * 14373. 123.12.123.12 udp B-dgm * 1607

’Destination’ gets two different values on lines matched by it, as it is shown inFigure 3.

3 Formalization

The definition of a LOG pattern domain is made of the definition of a languageof patterns L, evaluation functions that assign a description to each pattern ina given log r, and languages for primitive constraints that specify the desired pat-terns. We introduce some notations that are used for defining the LOG patterndomain. A so-called log contains the data in a form of log entries and patternsare the so-called itemsets, which are sets of (field, value) pairs of log entries.

Definition 1 (Log). Assume that Items is a finite set of (field, value) pairsdenoted by field name combined with value, e.g., Items= {A : ai, B : bj , C :ck, . . .}. A log entry t is a subset of Items. A log r is a finite and non emptymultiset r = {e1, e2, . . . , en} of log entries.

Definition 2 (Itemsets). An itemset is a subset of Items. The language ofpatterns for itemsets is L = 2Items.

Definition 3 (Constraint). If T denotes the set of all logs and 2Items the setof all itemsets, an itemset constraint C is a predicate over 2Items×T . An itemsetS ∈ 2Items satisfies a constraint C in the database r ∈ T iff C(S, r) = true.Whenit is clear from the context, we write C(S).

Evaluation functions return information about the properties of a given item-set in a given log. These functions provide an expert information about the eventsand conditions in the network. They also form a basis for summary creation.They are used to select the proper entry points to the log data.

Definition 4 (Support for Itemsets). A log entry e supports an itemset Sif every item in S belongs to e, i.e., S ⊆ e. The support (denoted support(S, r))of an itemset S is the multiset of all log entries of r that supports S (e.g.,support(∅) = r).


Definition 5 (Frequency). The frequency of an itemset S in a log r is definedby F(S, r) = |support(S)| where |.| denotes the cardinality of the multiset.

Definition 6 (Coverage). The coverage of an itemset S in a log r is definedby Cov(S, r) = F(S, r) · |S|, where |.| denotes the cardinality of the itemset S.

Definition 7 (Perfectness). The perfectness of an itemset S in a log r is de-fined by Perf(S, r) = Cov(S, r)/

∑F(S,r)i=0 |ei|, where ∀ei : ei ∈ support(S, r) and

|ei| denotes to the cardinality of log entry ei. Please, notice that if the cardinalityof all the log entries is constant it applies then Perf(S, r) = Cov(S, r)/(F(S, r) ·|e|), where e is an arbitrary log entry.

Primitive constraints are a tool set that is used to create and control sum-maries. For instance, the summaries are composed by using the frequent (closed)sets, i.e., sets that satisfy a conjunction of a minimal frequency constraint andthe closeness constraint plus the original data.

Definition 8 (Minimal Frequency). Given an itemset S, a log r, and a fre-quency threshold γ ∈ [1, |r|], Cminfreq(S, r) ≡ F(S, r) ≥ γ. Itemsets that satisfyCminfreq are called γ-frequent or frequent in r.

Definition 9 (Minimal Perfectness). Given an itemset S, a log r, and a per-fectness threshold π ∈ [0, 1], Cminperf(S, r) ≡ Perf(S, r) ≥ π. Itemsets that sat-isfy Cminperf are called π-perfect or perfect in r.

Definition 10 (Closures, Closed Itemsets and Constraint Cclose). Theclosure of an itemset S in r (denoted by closure(S, r)) is the maximal (for setinclusion) superset of S which has the same support than S. In other terms, theclosure of S is the set of items that are common to all the log entries whichsupport S. A closed itemset is an itemset that is equal to its closure in r, i.e.,we define Cclose(S, r) ≡ closure(S, r) = S. Closed itemsets are maximal sets ofitems that are supported by a multiset of log entries.

If we consider the equivalence class that group all the itemsets that have thesame closure (and thus the same frequency), the closed sets are the maximalelements of each equivalence class. Thus, when the collection of the frequentitemsets FS is available, a simple post-processing technique can be applied tocompute only the frequent closed itemsets. When the data is sparse, it is possibleto compute FS, e.g., by using Apriori-like algorithms [2]. However, the numberof frequent itemsets can be extremely large, especially in dense logs that containmany highly correlated field values. In that case, computing FS might not befeasible while the frequent closed sets CFS can often be computed for the samefrequency threshold or even a lower one. CFS = {φ ∈ L | Cminfreq(φ, r) ∧Cclose(φ, r) satisfied}. On one hand, FS can be efficiently derived from CFSwithout scanning the data again [12, 3]. On the other hand, CFS is a compactrepresentation of the information about every frequent set and its frequencyand thus fulfills the needs for CLC. Several algorithms can compute efficientlythe frequent closed sets. In this work, we compute the frequent closed sets by


computing the frequent free sets and providing their closures [4, 5]. This isefficient since the freeness property is anti-monotonic, i.e., a key property for anefficient processing of the search space.

For a user, displaying of the adequate information is the most importantphase of the CLC method. This phase gets the original log file and a condensedset of frequent patterns as input. An objective of the method is to select themost informative patterns as starting points for navigating the condensed set ofpatterns and data. As it has been shown [12], the frequent closed sets give riseto a lattice structure, ordered by set inclusion. These inclusion relations betweenpatterns can be used as navigational links.

What are the most informative patterns depends on the application anda task in hand. There are at least three possible measures that can be used tosort the patterns: frequency, i.e., on how many lines the pattern exists in a dataset; perfectness, i.e., how big part of the line has been fixed in the pattern; andcoverage of the pattern, i.e., how large part of the database is covered by thepattern. Coverage is a measure, which balances the trade-off between patternsthat are short but whose frequency is high and patterns that are long but whosefrequency is lower. Selection of the most informative patterns can also be basedon the optimality w.r.t. coverage. It is possible that an expert wishes to seeonly n most covering patterns or most covering patterns that together covermore than m% of the data. Examples of optimality constraints are consideredin [14, 15].

An interesting issue is the treatment of the patterns, whose perfectness isclose to zero. It is often the case that the support of such a small pattern isalmost entirely covered by supports of larger patterns, subset of which the smallpattern is. The most interesting property of this kind of lines is the possibilityto find those rare and exceptional entries that are not covered by any of thefrequent patterns.

In the domain that we are working on, log entries of telecommunicationapplications, we have found out that coverage and perfectness are very goodmeasures to find good and informative starting points for pattern and databrowsing. This is probably because of the fact that if there are too many fieldsthat have not fixed values, then the meaning of the entry is not clear and thosepatterns are not understandable for an expert. On the other hand, in those logsthere are a lot of repeating patterns, whose coverage is high and perfectness isclose to 100 percent.

4 Experiments

Our experiments were done with two separate log sets. The first of them wasa firewall log that was divided into several files so that each file contained entrieslogged during one day. From this collection we selected logs of four days withwhich we executed the CLC method with different frequency thresholds. Thepurpose of this test was to find out how large a portion of the original log it ispossible to cover with the patterns found and what the optimal value for the


Table 2. Summary of the CLC experiments with firewall data

Firewall daysSup Day 1 Day 2 Day 3 Day 4

Freq Clsd Sel Lines % Freq Clsd Sel Lines % Freq Clsd Sel Lines % Freq Clsd Sel Lines %100 8655 48 5 5162 96.3 9151 54 5 15366 98.6 10572 82 7 12287 97.1 8001 37 4 4902 97.350 9213 55 6 5224 97.5 9771 66 7 15457 99.2 11880 95 11 12427 98.2 8315 42 5 4911 97.510 11381 74 12 5347 99.8 12580 88 12 15537 99.7 19897 155 19 12552 99.2 10079 58 8 4999 99.25 13013 82 13 5351 99.9 14346 104 14 15569 99.9 22887 208 20 12573 99.3 12183 69 10 5036 99.9

Tot 5358 15588 12656 5039

frequency threshold would be. In Table 2, a summary of the experiment resultsis presented.

Table 2 shows, for each firewall daily log file, the number of frequent sets(Freq), closed sets (Clsd) derived from those, selected closed sets (Sel), the num-ber of lines that the selected sets cover (Lines) and how big part of the log theselines are covering (%). The tests were executed with several frequency thresholds(Sup). The pattern selection was based on the coverage of each pattern.

As can be seen from the result, already with the rather high frequency thresh-old of 50 lines, the coverage percentage is high. With this threshold there were,e.g., only 229 (1.8%) lines not covered in the log file of day 3. This was basicallybecause there was an exceptionally well distributed port scan during that day.Those entries were so fragmented that they escaped from the CLC algorithm,but were clearly visible when all the other information was taken away.

In Table 2, we also show the sizes of the different representations comparedto each other. As can be seen, the reduction from the number of frequent sets tothe number of closed sets is remarkable. However, by selecting the most coveringpatterns, it is possible to reduce the number of shown patterns to very fewwithout losing the descriptive power of the representation.

Another data set that was used to test our method was an application logof a large software system. The log contains information about the execution ofdifferent application modules. The main purpose of the log is to provide informa-tion for system operation, maintenance and debugging. The log entries providea continuous flow of data, not occasional bursts, which are typical for firewallentries. The interesting thing in the flow are the possible error messages that arerare and often hidden in the mass.

The size of the application log was more than 105 000 lines, which werecollected during a period of 42 days. From these entries, with the frequencythreshold of 1000 lines (about 1%), the CLC method was able to identify 13 in-teresting patterns that covered 91.5% of the data. When the frequency thresholdwas still lowered to 50 lines, the coverage rose up to 95.8%. With that thresholdvalue, there were 33 patterns found. The resulting patterns, however, started tobe so fragmented that they were not very useful anymore.

These experiments show the usefulness of the condensed representation ofthe frequent itemsets by means of the frequent closed itemsets. In a data setlike a firewall log, it is possible to select only a few most covering of the foundfrequent closed sets and cover the majority of the data. After this bulk has been


removed from the log it is much easier for any human expert to inspect the restof the log, even manually.

Notice also that the computation of our results has been easy. This is partlybecause of our test data sets reported here are not very large; the largest set beinga little over 100 000 lines. However, in a real environment of a large corporation,the daily firewall logs might contain millions of lines and much more variables.The amount of data — the number of lines and the number of variables —will continue to grow in the future, when the number of service types, differentservices and their use will grow. The scalability of the algorithms that computethe frequent closed sets is quite good compared to the Apriori approach: fewerdata scans are needed and the search space can be drastically reduced in thecase of dense data [12, 3, 5]. In particular, we have done preliminary testingwith ac-miner designed by A. Bykowski [5]. It discovers free sets, from whichit is straightforward to compute closed sets. These tests have shown promisingresults w.r.t. execution times. This approach seems to scale up more easily thanthe search for a whole set of frequent sets.

Also, other condensed representations have been recently proposed like theδ-free sets, the ∨-free sets or the Non Derivable Itemsets [5, 7, 8]. They couldbe used in even more difficult contexts (very dense and highly-correlated data).Notice however, that from the end user point of view, these representations donot have the intuitive semantics of the closed itemsets.

5 Conclusions and Future Work

The Comprehensive Log Compression (CLC) method provides a powerful tool forany analysis that inspects data with lot of redundancy. Only very little a prioriknowledge is needed to perform the analysis: knowledge structures: only a min-imum frequency threshold for the discovery of closed sets and e.g., the numberof displayed patterns, to guide the selection of the most covering patterns.

The method provides a mechanism to separate different information typesfrom each other. The CLC method identifies frequent repetitive patterns froma log database and can be used to emphasize either the normal course of actionsor exceptional log entries or events in the normal course of actions. This isespecially useful in getting knowledge out of previously unknown domains or inanalyzing logs that are used to record unstructured and unclassified information.

In the future we are interested in generalizing and testing the describedmethod with frequent episodes: how to utilize relations between selected closedsets. Other interesting issues concern the theoretical foundations of the CLCmethod as well as ways to utilize this method in different real world applications.

Acknowledgements

The authors have partly been supported by the Nokia Foundation and theconsortium on discovering knowledge with Inductive Queries (cInQ), a project


funded by the Future and Emerging Technologies arm of the IST Programme(Contract no. IST-2000-26469).

References

[1] Rakesh Agrawal, Tomasz Imielinski, and Arun Swami. Mining association rulesbetween sets of items in large databases. In SIGMOD’93, pages 207–216, Wash-ington, USA, May 1993. ACM Press. 361

[2] Rakesh Agrawal, Heikki Mannila, Ramakrishnan Srikant, Hannu Toivonen, andA. Inkeri Verkamo. Fast discovery of association rules. In Advances in KnowledgeDiscovery and Data Mining, pages 307–328. AAAI Press, 1996. 361, 365

[3] Jean-Francois Boulicaut and Artur Bykowski. Frequent closures as a conciserepresentation for binary data mining. In PAKDD’00, volume 1805 of LNAI,pages 62–73, Kyoto, JP, April 2000. Springer-Verlag. 361, 365, 368

[4] Jean-Francois Boulicaut, Artur Bykowski, and Christophe Rigotti. Approximationof frequency queries by mean of free-sets. In PKDD’00, volume 1910 of LNAI,pages 75–85, Lyon, F, September 2000. Springer-Verlag. 366

[5] Jean-Francois Boulicaut, Artur Bykowski, and Christophe Rigotti. Free-sets: acondensed representation of boolean data for the approximation of frequencyqueries. Data Mining and Knowledge Discovery journal, 7(1):5–22, 2003. 366,368

[6] Ronald J. Brachman and Tej Anand. The process of knowledge discovery indatabases: A first sketch. In Advances in Knowledge Discovery and Data Mining,July 1994. 361

[7] Artur Bykowski and Christophe Rigotti. A condensed representation to find fre-quent patterns. In PODS’01, pages 267 – 273. ACM Press, May 2001. 368

[8] Toon Calders and Bart Goethals. Mining all non derivable frequent itemsets.In PKDD’02, volume 2431 of LNAI, pages 74–83, Helsinki, FIN, August 2002.Springer-Verlag. 368

[9] Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. The KDD pro-cess for extracting useful knowledge from volumes of data. Communications ofthe ACM, 39(11):27 – 34, November 1996. 361

[10] Usama M. Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth. From datamining to knowledge discovery: An overview. In Advances in Knowledge Discoveryand Data Mining, pages 1 – 34. AAAI Press, Menlo Park, CA, 1996. 361

[11] R. Kosala and H. Blockeel. Web mining research: A survey. SIGKDD: SIGKDDExplorations: Newsletter of the Special Interest Group (SIG) on Knowledge Dis-covery & Data Mining, ACM, 2(1):1–15, 2000. 361

[12] Nicolas Pasquier, Yves Bastide, Rafik Taouil, and Lotfi Lakhal. Efficient mining ofassociation rules using closed itemset lattices. Information Systems, 24(1):25–46,January 1999. 361, 365, 366, 368

[13] Jian Pei, Jiawei Han, and Runying Mao. CLOSET an efficient algorithm formining frequent closed itemsets. In SIGMOD Workshop DMKD’00, Dallas, USA,May 2000. 361

[14] Tobias Scheffer. Finding association rules that trade support optimally againstconfidence. In PKDD’01, volume 2168 of LNCS, pages 424–435, Freiburg, D,September 2001. Springer-Verlag. 366

[15] Jun Sese and Shinichi Morishita. Answering the most correlated N associationrules efficiently. In PKDD’02, volume 2431 of LNAI, pages 410–422, Helsinki,FIN, August 2002. Springer-Verlag. 366


[16] Jaideep Srivastava, Robert Cooley, Mukund Deshpande, and Pang-Ning Tan.Web usage mining: Discovery and applications of usage patterns from web data.SIGKDD Explorations, 1(2):12–23, 2000. 361

[17] Mohammed Javeed Zaki. Generating non-redundant association rules. InSIGKDD’00, pages 34–43, Boston, USA, August 2000. ACM Press. 361

Non Recursive Generation of Frequent

K-itemsets from Frequent Pattern TreeRepresentations

Mohammad El-Hajj and Osmar R. Zaıane

Department of Computing Science, University of Alberta, Edmonton AB, Canada{mohammad, zaiane}@cs.ualberta.ca

Abstract. Existing association rule mining algorithms suffer from manyproblems when mining massive transactional datasets. One major prob-lem is the high memory dependency: gigantic data structures built areassumed to fit in main memory; in addition, the recursive mining pro-cess to mine these structures is also too voracious in memory resources.This paper proposes a new association rule-mining algorithm based onfrequent pattern tree data structure. Our algorithm does not use muchmore memory over and above the memory used by the data structure. Foreach frequent item, a relatively small independent tree called COFI-tree,is built summarizing co-occurrences. Finally, a simple and non-recursivemining process mines the COFI-trees. Experimental studies reveal thatour approach is efficient and allows the mining of larger datasets thanthose limited by FP-Tree

1 Introduction

Recent days have witnessed an explosive growth in generating data in all fieldsof science, business, medicine, military, etc. The same rate of growth in the pro-cessing power of evaluating and analyzing the data did not follow this massivegrowth. Due to this phenomenon, a tremendous volume of data is still kept with-out being studied. Data mining, a research field that tries to ease this problem,proposes some solutions for the extraction of significant and potentially usefulpatterns from these large collections of data. One of the canonical tasks in datamining is the discovery of association rules. Discovering association rules, con-sidered as one of the most important tasks, has been the focus of many studiesin the last few years. Many solutions have been proposed using a sequential orparallel paradigm. However, the existing algorithms depend heavily on massivecomputation that might cause high dependency on the memory size or repeatedI/O scans for the data sets. Association rule mining algorithms currently pro-posed in the literature are not sufficient for extremely large datasets and newsolutions, that especially are less reliant on memory size, still have to be found.

1.1 Problem Statement

The problem consists of finding associations between items or itemsets in trans-actional data. The data could be retail sales in the form of customer transactions


or any collection of sets of observations. Formally, as defined in [2], the problemis stated as follows: Let I = {i1, i2, ...im} be a set of literals, called items. m isconsidered the dimensionality of the problem. Let D be a set of transactions,where each transaction T is a set of items such that T ⊆ I. A unique identi-fier TID is given to each transaction. A transaction T is said to contain X , aset of items in I, if X ⊆ T . An association rule is an implication of the form“X ⇒ Y ”, where X ⊆ I, Y ⊆ I, and X ∩ Y = ∅. An itemset X is said tobe large or frequent if its support s is greater or equal than a given minimumsupport threshold σ. The rule X ⇒ Y has a support s in the transaction set Dif s% of the transactions in D contain X ∪ Y . In other words, the support ofthe rule is the probability that X and Y hold together among all the possiblepresented cases. It is said that the rule X ⇒ Y holds in the transaction set Dwith confidence c if c% of transactions in D that contain X also contain Y . Inother words, the confidence of the rule is the conditional probability that theconsequent Y is true under the condition of the antecedent X . The problem ofdiscovering all association rules from a set of transactions D consists of generat-ing the rules that have a support and confidence greater than a given threshold.These rules are called strong rules. This association-mining task can be brokeninto two steps: 1. A step for finding all frequent k-itemsets known for its extremeI/O scan expense, and the massive computational costs; 2. A straightforwardstep for generating strong rules. In this paper, we are mainly interested in thefirst step.

1.2 Related Work

Several algorithms have been proposed in the literature to address the problemof mining association rules [2, 6]. One of the key algorithms, which seems to bethe most popular in many applications for enumerating frequent itemsets is theApriori algorithm [2]. This Apriori algorithm also forms the foundation of mostknown algorithms. It uses a monotone property stating that for a k-itemset tobe frequent, all its (k-1)-itemsets have to be frequent. The use of this funda-mental property reduces the computational cost of candidate frequent itemsetgeneration. However, in the cases of extremely large input sets with big frequent1-items set, the Apriori algorithm still suffers from two main problems of re-peated I/O scanning and high computational cost. One major hurdle observedwith most real datasets is the sheer size of the candidate frequent 2-itemsetsand 3-itemsets. TreeProjection is an efficient algorithm presented in [1]. Thisalgorithm builds a lexicographic tree in which each node of this tree presentsa frequent pattern. The authors of this algorithm report that their algorithmis one order of magnitude faster than the existing techniques in the literature.Another innovative approach of discovering frequent patterns in transactionaldatabases, FP-Growth, was proposed by Han et al. in [6]. This algorithm cre-ates a compact tree-structure, FP-Tree, representing frequent patterns, that al-leviates the multi-scan problem and improves the candidate itemset generation.The algorithm requires only two full I/O scans of the dataset to build the pre-fix tree in main memory and then mines directly this structure. The authors

372 Mohammad El-Hajj and Osmar R. Zaıane

of this algorithm report that their algorithm is faster than the Apriori and theTreeProjection algorithms. Mining the FP-tree structure is done recursively bybuilding conditional trees that are of the same order of magnitude in numberas the frequent patterns. This massive creation of conditional trees makes thisalgorithm not scalable to mine large datasets beyond few millions. [7] proposesa new algorithm H-mine that invokes FP-Tree to mine condensed data. Thisalgorithm is still not scalable as reported by its authors in [8].

1.3 Preliminaries, Motivations and Contributions

The (Co-Occurrence Frequent Item Tree, or COFI-tree for short) algorithm thatwe are presenting in this paper is based on the core idea of the FP-Growth al-gorithm proposed by Han et al. in [6]. A compacted tree structure, FP-Tree, isbuilt based on an ordered list of the frequent 1-itemsets present in the transac-tional database. However, rather than using FP-Growth which recursively buildsa large number of relatively large trees called conditional trees [6] from the builtFP-tree, we successively build one small tree (called COFI-tree) for each frequent1-itemset and mine the trees with simple non-recursive traversals. We keep onlyone such COFI-tree in main memory at a time.

The COFI-tree approach is a divide and conquer approach, in which wedo not seek to find all frequent patterns at once, but we independently findall frequent patterns related to each frequent item in the frequent 1-itemset.The main differences between our approach and the FP-growth approach arethe followings: (1) we only build one COFI-tree for each frequent item A. ThisCOFI-tree is non-recursively traversed to generate all frequent patterns relatedto item A. (2) Only one COFI-tree resides in memory at one time and it isdiscarded as soon as it is mined to make room for the next COFI-tree.

Algorithms like FP-Tree-based depend heavily on the memory size as thememory size plays an important role in defining the size of the problem. Mem-ory is not only needed to store the data structure itself, but also to generaterecursively in the mining process the set of conditional trees. This phenomenonis often overlooked. As argued by the authors of the algorithm, this is a seriousconstraint [8]. Other approaches such as in [7], build yet another data structurefrom which the FP-Tree is generated, thus doubling the need for main memory.

The current association rule mining algorithms handle only relatively smallsizes with low dimensions. Most of them scale up to only a couple of millionsof transactions and a few thousands of dimensions [8, 5]. None of the existingalgorithms scales to beyond 15 million transactions, and hundreds of thousandsof dimensions, in which each transaction has an average of at least a couple ofdozen items.

The remainder of this paper is organized as follows: Section 2 describes theFrequent Pattern tree, design and construction. Section 3 illustrates the design,constructions and mining of the Co-Occurrence Frequent Item trees. Experimen-tal results are given in Section 4. Finally, Section 5 concludes by discussing someissues and highlights our future work.

373Non Recursive Generation of Frequent K-itemsets

2 Frequent Pattern Tree: Design and Construction

The COFI-tree approach we propose consists of two main stages. Stage one isthe construction of the Frequent Pattern tree and stage two is the actual miningfor these data structures, much like the FP-growth algorithm.

2.1 Construction of the Frequent Pattern Tree

The goal of this stage is to build the compact data structures called FrequentPattern Tree [6]. This construction is done in two phases, where each phaserequires a full I/O scan of the dataset. A first initial scan of the database identifiesthe frequent 1-itemsets. The goal is to generate an ordered list of frequent itemsthat would be used when building the tree in the second phase.

This phase starts by enumerating the items appearing in the transactions.After enumeration these items (i.e. after reading the whole dataset), infrequentitems with a support less than the support threshold are weeded out and theremaining frequent items are sorted by their frequency. This list is organized ina table, called header table, where the items and their respective support arestored along with pointers to the first occurrence of the item in the frequentpattern tree. Phase 2 would construct a frequent pattern tree.

Table 1. Transactional database

T.No. Items T.No. Items T.No. Items T.No. Items

T1 A G D C B T2 B C H E D T3 B D E A M T4 C E F A N

T5 A B N O P T6 A C Q R G T7 A C H I G T8 L E F K B

T9 A F M N O T10 C F P G R T11 A D B H I T12 D E B K L

T13 M D C G O T14 C F P Q J T15 B D E F I T16 J E B A D

T17 A K E F C T18 C D L B A

Item Freq. Item Freq. Item Freq. Item Freq. Item Freq.A 11 H 3 Q 2 A 11 F 7B 10 F 7 R 2 B 10 E 8C 10 M 3 I 3 C 10 D 9D 9 N 3 K 3 D 9 C 10G 4 O 3 L 3 E 8 B 10E 8 P 3 J 3 F 7 A 11

Step1 Step2 Step3

Fig. 1. Steps of phase 1.

Phase 2 of constructing the Frequent Pattern tree structure is the actualbuilding of this compact tree. This phase requires a second complete I/O scan


from the dataset. For each transaction read only the set of frequent items presentin the header table is collected and sorted in descending order according to theirfrequency. These sorted transaction items are used in constructing the FP-Treesas follows: for the first item on the sorted transactional dataset, check if itexists as one of the children of the root. If it exists then increment the supportfor this node. Otherwise, add a new node for this item as a child for the rootnode with 1 as support. Then, consider the current item node as the newlytemporary root and repeat the same procedure with the next item on the sortedtransaction. During the process of adding any new item-node to the FP-Tree, alink is maintained between this item-node in the tree and its entry in the headertable. The header table holds as one pointer per item that points to the firstoccurrences of this item in the FP-Tree structure.

For illustration, we use an example with the transactions shown in Table 1.Let the minimum support threshold set to 4. Phase 1 starts by accumulating thesupport for all items that occur in the transactions. Step 2 of phase 1 removesall non-frequent items, in our example (G, H, I, J, K, L,M, N, O, P, Q andR), leaving only the frequent items (A, B, C, D, E, and F). Finally all frequentitems are sorted according to their support to generate the sorted frequent 1-itemset. This last step ends phase 1 of the COFI-tree algorithm and starts thesecond phase. In phase 2, the first transaction (A, G, D, C, B) read is filteredto consider only the frequent items that occur in the header table (i.e. A, D, Cand B). This frequent list is sorted according to the items’ supports (A, B, Cand D). This ordered transaction generates the first path of the FP-Tree with allitem-node support initially equal to 1. A link is established between each item-node in the tree and its corresponding item entry in the header table. The sameprocedure is executed for the second transaction (B, C, H, E, and D), whichyields a sorted frequent item list (B, C, D, E) that forms the second path of theFP-Tree. Transaction 3 (B, D, E, A, and M) yields the sorted frequent item list(A, B, D, E) that shares the same prefix (A, B) with an existing path on thetree. Item-nodes (A and B) support is incremented by 1 making the support of(A) and (B) equal to 2 and a new sub-path is created with the remaining itemson the list (D, E) all with support equal to 1. The same process occurs for alltransactions until we build the FP-Tree for the transactions given in Table 1.Figure 2 shows the result of the tree building process.

Root

A 11 B 4 C 3F 7E 8 F 1 C 4 B 6 C 1 E 1 D 2 F 2 D 1D 9C 10 E 2 C 2 D 3 D 1 F 1 E 2B 10A 11 F 2 D 2 E 2 E 1 F 1

Fig. 2. Frequent Pattern Tree.


3 Co-Occurrence Frequent-Item-trees: Construction andMining

Our approach for computing frequencies relies first on building independent rel-atively small trees for each frequent item in the the header table of the FP-Treecalled COFI-trees. Then we mine separately each one of the trees as soon as theyare built, minimizing the candidacy generation and without building conditionalsub-trees recursively. The trees are discarded as soon as mined. At any giventime, only one COFI-tree is present in main memory.

3.1 Construction of the Co-Occurrence Frequent-Item-trees

The small COFI-trees we build are similar to the conditional FP-trees in generalin the sense that they have a header with ordered frequent items and horizontalpointers pointing to a succession of nodes containing the same frequent item,and the prefix tree per-se with paths representing sub-transactions. However,the COFI-trees have bidirectional links in the tree allowing bottom-up scanningas well, and the nodes contain not only the item label and a frequency counter,but also a participation counter as explained later in this section. The COFI-treefor a given frequent item x contains only nodes labeled with items that are morefrequent or as frequent as x.

To illustrate the idea of the COFI-trees, we will explain step by step theprocess of creating COFI-trees for the FP-Tree of Figure 2. With our example,the first Co-Occurrence Frequent Item tree is built for item F as it is the leastfrequent item in the header table. In this tree for F, all frequent items whichare more frequent than F and share transactions with F participate in buildingthe tree. This can be found by following the chain of item F in the FP-Treestructure. The F-COFI-tree starts with the root node containing the item inquestion, F. For each sub-transaction or branch in the FP-Tree containing itemF with other frequent items that are more frequent than F which are parentnodes of F, a branch is formed starting from the root node F. The support ofthis branch is equal to the support of the F node in its corresponding branchin FP-Tree. If multiple frequent items share the same prefix, they are mergedinto one branch and a counter for each node of the tree is adjusted accordingly.Figure 3 illustrates all COFI-trees for frequent items of Figure 2. In Figure 3, therectangle nodes are nodes from the tree with an item label and two counters.The first counter is a support-count for that node while the second counter,called participation-count, is initialized to 0 and is used by the mining algorithmdiscussed later, a horizontal link which points to the next node that has thesame item-name in the tree, and a bi-directional vertical link that links a childnode with its parent and a parent with its child. The bi-directional pointersfacilitate the mining process by making the traversal of the tree easier. Thesquares are actually cells from the header table as with the FP-Tree. This isa list made of all frequent items that participate in building the tree structuresorted in ascending order of their global support. Each entry in this list contains


F ( 7 0 ) E ( 8 0 )

E 4 E ( 4 0 ) A ( 1 0 ) C ( 2 0 ) D 5 D ( 5 0 ) C ( 2 0 ) B ( 1 0 )D 2 C 3C 4 B 6B 2 C ( 2 0 ) B ( 1 0 ) D ( 1 0 ) A 4 C ( 1 0 ) B ( 4 0 ) A ( 2 0 )A 3

A ( 2 0 ) B ( 1 0 ) B ( 1 0 ) A ( 2 0 )

D ( 9 0 ) C ( 10 0 ) B ( 10 0 )

A 6C 4 C ( 4 0 ) B ( 5 0 ) B ( 3 0 ) A ( 4 0 ) A ( 6 0 )B 8 B 3A 5 A 6

B ( 3 0 ) A ( 3 0 ) A ( 2 0 )

A ( 2 0 )

B-COFI-tree

F-COFI-tree E-COFI-tree

D-COFI-tree C-COFI-tree

Fig. 3. COFI-trees

the item-name, item-counter, and a pointer to the first node in the tree that hasthe same item-name.

To explain the COFI-tree building process, we will highlight the buildingsteps for the F-COFI-tree in Figure 3. Frequent item F is read from the headertable and its first location in the FP-Tree is located using the pointer in theheader table. The first location of item F indicate that it shares a branch withitem A, with support = 1 for this branch as the support of the F-item is consid-ered the support for this branch (following the upper links for this item). Twonodes are created, for FA: 1. The second location of F indicate a new branch ofFECA:2 as the support of F=2. Three nodes are created for items ECA withsupport = 2. The support of the F node is incremented by 2. The third locationindicates the sub-transaction FEB:1. Nodes for F and E are already exist andonly new node for B is created as a another child for E. The support for allthese nodes are incremented by 1. B becomes 1, E becomes 3 and F becomes4. FEDB:1 is read after that, FE branch already exists and a new child branchfor DB is created as a child for E with support = 1. The support for E nodesbecomes 4, F becomes 5. Finally FC:2 is read, and a new node for item C iscreated with support =2, and F support becomes 7. Like with FP-Trees, theheader constitutes a list of all frequent items to maintain the location of firstentry for each item in the COFI-tree. A link is also made for each node in thetree that points to the next location of the same item in the tree if it exists. Themining process is the last step done on the F-COFI-tree before removing it andcreating the next COFI-tree for the next item in the header table.

3.2 Mining the COFI-trees

The COFI-trees of all frequent items are not constructed together. Each tree isbuilt, mined, then discarded before the next COFI-tree is built. The mining pro-


Step 1 E (8,1) E(8 5)

E ( 8 0 ) E ( 8 1 )

D(5 1)

D(5 5)

D 5 D ( 5 0 ) C ( 2 0 ) B ( 1 0 ) C(1 1) D 5 D ( 5 1 ) C ( 2 0 ) B ( 1 0 )

C 3 C 3

B 6 B(1 1) B 6 B(4 4)

A 3 C ( 1 0 ) B ( 4 0 ) A ( 2 0 ) EDB:1 A 3 C ( 1 1 ) B ( 4 0 ) A ( 2 0 ) EDB:4

ED:1EB:1 ED:5

B ( 1 0 ) A ( 2 0 ) EDB:1 B ( 1 1 ) A ( 2 0 ) EB:5

EDB:5

Step 3 E (8 6) E(8 6)

E ( 8 5 ) E ( 8 6 )

B(1 1) D(5 5)

D 5 D ( 5 5 ) C ( 2 0 ) B ( 1 0 ) D 5 D ( 5 5 ) C ( 2 0 ) B ( 1 1 )

C 3 EB:1 C 3

B 6 ED:5 B 6 No change

A 3 C ( 1 1 ) B ( 4 4 ) A ( 2 0 ) EB:6 A 3 C ( 1 1 ) B ( 4 4 ) A ( 2 0 )EDB:5 ED:5

EB:6B ( 1 1 ) A ( 2 0 ) B ( 1 1 ) A ( 2 0 ) EDB:5

Step 2

Step 4

Fig. 4. Steps needed to generate frequent patterns related to item E

cess is done for each tree independently with the purpose of finding all frequentk -itemset patterns in which the item on the root of the tree participates.

Steps to produce frequent patterns related to the E item for example, areillustrated in Figure 4. From each branch of the tree, using the support-countand the participation-count, candidate frequent patterns are identified and storedtemporarily in a list. The non-frequent ones are discarded at the end when allbranches are processed. The mining process for the E-COFI-tree starts fromthe most locally frequent item in the header table of the tree, which is item B.Item B exists in three branches in the E-COFI-tree which are (B:1, C:1, D:5and E:8), (B:4, D:5, and E:8) and (B:1, and E:8). The frequency of each branchis the frequency of the first item in the branch minus the participation valueof the same node. Item B in the first branch has a frequency value of 1 andparticipation value of 0 which makes the first pattern EDB frequency equalsto 1. The participation values for all nodes in this branch are incremented by1, which is the frequency of this pattern. In the first pattern EDB: 1, we needto generate all sub-patterns that item E participates in which are ED:1 EB:1and EDB:1. The second branch that has B generates the pattern EDB: 4 asthe frequency of B on this branch is 4 and its participation value is equal to0. All participation values on these nodes are incremented by 4. Sub-patternsare also generated from the EDB pattern which are ED: 4 , EB: 4, and EDB:4. All patterns already exist with support value equals to 1, and only updatingtheir support value is needed to make it equal to 5. The last branch EB:1 willgenerate only one pattern which is EB:1, and consequently its value will beupdated to become 6. The second locally frequent item in this tree, “D” existsin one branch (D: 5 and E: 8) with participation value of 5 for the D node. Sincethe participation value for this node equals to its support value, then no patternscan be generated from this node. Finally all non-frequent patterns are omittedleaving us with only frequent patterns that item E participates in which areED:5, EB:6 and EBD:5. The COFI-tree of Item E can be removed at this time


and another tree can be generated and tested to produce all the frequent patternsrelated to the root node. The same process is executed to generate the frequentpatterns. The D-COFI-tree is created after the E-COFI-tree. Mining this treegenerates the following frequent patterns: DB:8, DA:5, and DBA:5. C-COFI-tree generates one frequent pattern which is CA:6. Finally, the B-COFI-tree iscreated and the frequent pattern BA:6 is generated.

4 Experimental Evaluations and Performance Study

To test the efficiency of the COFI-tree approach, we conducted experimentscomparing our approach with two well-known algorithms namely: Apriori andFP-Growth. To avoid implementation bias, third party Apriori implementation,by Christian Borgelt [4], and FP-Growth [6] written by its original authors areused. The experiments were run on a 733-MHz machine with a relatively smallRAM of 256MB.

Transactions were generated using IBM synthetic data generator [3]. Weconducted different experiments to test the COFI-tree algorithm when miningextremely large transactional databases. We tested the applicability and scala-bility of the COFI-tree algorithm. In one of these experiments, we mined using asupport threshold of 0.01% transactional databases of sizes ranging from 1 mil-lion to 25 million transactions with an average transaction length of 24 items.The dimensionality of the 1 and 2 million transaction dataset was 10,000 itemswhile the datasets ranging from 5 million to 25 million transactions had a dimen-sionality of 100,000 unique items. Figure 5A illustrates the comparative resultsobtained with Apriori, FP-Growth and the COFI-tree. Apriori failed to minethe 5 million transactional database and FP-Growth couldn’t mine beyond the 5million transaction mark. The COFI-tree, however, demonstrates good scalabil-ity as this algorithm mines 25 million transactions in 2921s (about 48 minutes).None of the tested algorithms, or reported results in the literature reaches sucha big size. To test the behavior of the COFI-tree vis-a-vis different supportthresholds, a set of experiments was conducted on a database size of one milliontransactions, with 10,000 items and an average transaction length of 24 items.The mining process tested different support levels, which are 0.0025% that re-vealed almost 125K frequent patterns, 0.005% that revealed nearly 70K frequentpatterns, 0.0075% that generated 32K frequent patterns and 0.01 that returned17K frequent patterns. Figure 5B depicts the time needed in seconds for eachone of these runs. The results show that the COFI-tree algorithm outperformsboth Apriori and FP-Growth algorithms in all cases.

5 Discussion and Future Work

Finding scalable algorithms for association rule mining in extremely large databasesis the main goal of our research. To reach this goal, we propose a new algorithmthat is FP-Tree based. This algorithm identifies the main problem of the FP-Growth algorithm which is the recursive creation and mining of many conditional


(A) (B)

0

2000

4000

6000

8000

10000

12000

0.0025% 0.005% 0.0075% 0.01%Support

Tim

e in

sec

on

ds

Apriori FP-Growth COFI-tree

0

500

1000

1500

2000

2500

3000

3500

1M 2M 5M 10M 15M 20M 25MSize in millions

Tim

e in

sec

on

ds

Apriori FP-Growth COFI-tree

Fig. 5. Computational performance and scalability

pattern trees, which are equal in number to the distinct frequent patterns gen-erated. We have replaced this step by creating one COFI-tree for each frequentitem. A simple non-recursive mining process is applied to generate all frequentpatterns related to the tested COFI-tree. The experiments we conducted showedthat our algorithm is scalable to mine tens of millions of transactions, if not more.We are currently studying the possibility of parallelizing the COFI-tree algorithmto investigate the opportunity of mining hundred of millions of transactions ina reasonable time and with acceptable resources.

References

1. R. Agarwal, C.Aggarwal, and V. Prasad. A tree projection algorithm for generationof frequent itemsets. Parallel and distributed Computing, 2000.

2. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc.1994 Int. Conf. Very Large Data Bases, pages 487–499, Santiago, Chile, September1994.

3. IBM. Almaden. Quest synthetic data generation code.http://www.almaden.ibm.com/cs/quest/syndata.html.

4. C. Borgelt. Apriori implementation. http://fuzzy.cs.uni-magdeburg.de/~borgelt/apriori/apriori.html.

5. E.-H. Han, G. Karypis, and V.Kumar. Scalable parallel data mining for associationrule. Transactions on Knowledge and data engineering, 12(3):337–352, May-June2000.

6. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation.In ACM-SIGMOD, Dallas, 2000.

7. H. Huang, X. Wu, and R. Relue. Association analysis with one scan of databases.In IEEE International Conference on Data Mining, pages 629–636, December 2002.

8. J. Liu, Y. Pan, K. Wang, and J. Han. Mining frequent item sets by oppotunisticprojection. In Eight ACM SIGKDD Internationa Conf. on Knowledge Discoveryand Data Mining, pages 229–238, Edmonton, Alberta, August 2002.


382 Jianchao Han et al.

383A New Computation Model for Rough Set Theory Based on Database Systems

Computing SQL Queries with Boolean

Aggregates

Antonio Badia�

Computer Engineering and Computer Science DepartmentUniversity of Louisville

Abstract. We introduce a new method for optimization of SQL querieswith nested subqueries. The method is based on the idea of Booleanaggregates, aggregates that compute the conjunction or disjunction ofa set of conditions. When combined with grouping, Boolean aggregatesallow us to compute all types of non-aggregated subqueries in a uniformmanner. The resulting query trees are simple and amenable to furtheroptimization. Our approach can be combined with other optimizationtechniques and can be implemented with a minimum of changes in anycost-based optimizer.

1 Introduction

Due to the importance of query optimization, there exists a large body of researchin the subject, especially for the case of nested subqueries ([10, 5, 13, 7, 8,17]). It is considered nowadays that existing approaches can deal with all typesof SQL subqueries through unnesting. However, practical implementation lagsbehind the theory, since some transformations are quite complex to implement.In particular, subqueries where the linking condition (the condition connectingquery and subquery) is one of NOT IN, NOT EXISTS or a comparison with ALLseem to present problems to current optimizers. These cases are assumed to betranslated, or are dealt with using antijoins. However, the usual translation doesnot work in the presence of nulls, and even when fixed it adds some overhead tothe original query. On the other hand, antijoins introduce yet another operatorthat cannot be moved in the query tree, thus making the job of the optimizermore difficult. When a query has several levels, the complexity grows rapidly(an example is given below).

In this paper we introduce a variant of traditional unnesting methods thatdeals with all types of linking conditions in a simple, uniform manner. The querytree created is simple, and the approach extends neatly to several levels of nestingand several subqueries at the same level. The approach is based on the conceptof Boolean aggregates, which are an extension of the idea of aggregate function inSQL ([12]). Intuitively, Boolean aggregates are applied to a set of predicates andcombine the truth values resulting from evaluation of the predicates. We showhow two simple Boolean predicates can take care of any type of SQL subquery in� This research was sponsored by NSF under grant IIS-0091928.


392 Antonio Badia

a uniform manner. The resulting query trees are simple and amenable to furtheroptimization. Our approach can be combined with other optimization techniquesand can be implemented with a minimum of changes in any cost-based optimizer.

In section 2 we describe in more detail related research on query optimizationand motivate our approach with an example. In section 3 we introduce theconcept of Boolean aggregates and show its use in query unnesting. We thenapply our approach to the example and discuss the differences with standardunnesting. Finally, in section 4 we offer some preliminary conclusions and discussfurther research.

2 Related Research and Motivation

We study SQL queries that contain correlated subqueries1. Such subqueries con-tain a correlated predicate, a condition in their WHERE clause introducing thecorrelation. The attribute in the correlated predicate provided by a relation inan outer block is called the correlation attribute; the other attribute is calledthe correlated attribute. The condition connecting query and subquery is calledthe linking condition. There are basically four types of linking condition in SQL:comparisons between an attribute and an aggregation (called the linking aggre-gate); IN and NOT IN comparisons; EXISTS and NOT EXISTS comparisons;and quantified comparisons between an attribute and a set of attribute throughthe use of SOME and ALL. We call linking conditions involving an aggregate, IN,EXISTS, and comparisons with SOME positive linking conditions, and the rest(those involving NOT IN, NOT EXISTS, and comparisons with ALL) negativelinking conditions. All nested correlated subqueries are nowadays executed bysome variation of unnesting. In its original approach ([10]), the correlation pred-icate is seen as a join; if the subquery is aggregated, the aggregate is computedin advance and then join is used. Kim’s approach had a number of shortcomings;among them, it assumed that the correlation predicate always used equality andthe linking condition was a positive one. Dayal’s ([5]) and Muralikrishna’s ([13])work solved these shortcomings; Dayal introduced the idea of using an outerjoininstead of a join (so values with no match would not be lost), and proceedswith the aggregate computation after the outerjoin. Muralikrishna generalizesthe approach and points out that negative linking aggregates can be dealt withusing antijoin or translating them to other, positive linking aggregates. Theseapproaches also introduce some shortcomings. First, outerjoins and antijoins donot commute with regular joins or selections; therefore, a query tree with allthese operators does not offer many degrees of freedom to the optimizer. Thework of [6] and [16] has studied conditions under which outerjoins and antijoinscan be moved; alleviating this problem partially. Another problem with this ap-proach is that by carrying out the (outer)join corresponding to the correlationpredicate, other predicates in the WHERE clause of the main query, which mayrestrict the total computation to be carried out, are postponed. The magic sets1 The approach is applicable to non-correlated subqueries as well, but does not provide

any substantial gains in that case.

Computing SQL Queries with Boolean Aggregates 393

approach ([17, 18, 20]) pushes these predicates down past the (outer)join byidentifying the minimal set of values that the correlating attributes can take(the magic set), and computing it in advance. This minimizes the size of othercomputation but comes at the cost of building the magic set in advance.

However, all approaches in the literature assume positive linking conditions(and all examples shown in [5, 13, 19, 20, 18] involve positive linking conditions).Negative linking conditions are not given much attention; it is considered thatqueries can be rewritten to avoid them, or that they can be dealt with directly us-ing antijoins. But both approaches are problematic. About the former, we pointout that the standard translation does not work if nulls are present. Assume,for instance, the condition attr > ALL Q, where Q is a subquery, with attr2the linked attribute. It is usually assumed that a (left) antijoin with conditionattr ≤ attr2 is a correct translation of this condition, since for a tuple t to be inthe antijoin, it cannot be the case that t.attr ≤ attr2, for any value of attr2 (orany value in a given group, if the subquery is correlated). Unfortunately, thisequivalence is only true for 2-valued logics, not for the 3-valued logic that SQLuses to evaluate predicates when null is present. The condition attr ≤ attr2will fail if attr is not null, and no value of attr2 is greater than or equal toattr, which may happen because attr2 is the right value or because attr2 is null.Hence, a tuple t will be in the antijoin in the last case above, and t will qualifyfor the result. Even though one could argue that this can be solved by changingthe condition in the antijoin (and indeed, a correct rewrite is possible, but morecomplex than usually considered ([1]), a larger problem with this approach isthat it produces plans with outerjoins and antijoins, which are very difficult tomove around on the query tree; even though recent research has shown thatouterjoins ([6]) and antijoins ([16]) can be moved under limited circumstances,this still poses a constraint on the alternatives that can be generated for a givenquery plan -and it is up to the optimizer to check that the necessary conditionsare met. Hence, proliferation of these operations makes the task of the queryoptimizer difficult. As an example of the problems of the traditional approach,assume tables R(A,B,C,D), S(E,F,G,H,I), U(J,K,L), and consider the query

Select *From RWhere R.A > 10 andR.B NOT IN (Select S.E

From SWhere S.F = 5 andR.D = S.G andS.H > ALL (Select U.J

From UWhere U.K = R.C and

U.L != S.I))

Unnesting this query with the traditional approach has the problem of intro-ducing several outerjoins and antijoins that cannot be moved, as well as extra

394 Antonio Badia

R S

LOJ(D = G)T

LOJ(K = C and L != I)

AJ(B = E)

Select(A>10 & F=5)

Project(R.*)

AJ(H =< J)

Project(S.*,T.*)

Project(R.*,S.*)

Fig. 1. Standard unnesting approach applied to the example

operations. To see why, note that we must outerjoin U with S and R, and thengroup by the keys of R and S, to determine which tuples of U must be tested forthe ALL linking condition. However, should the set of tuples of U in a group failthe test, we cannot throw the whole group away: for that means that some tuplesin S fail to qualify for an answer, making true the NOT IN linking condition,and hence qualifying the R tuple. Thus, tuples in S and U should be antijoinedseparately to determine which tuples in S pass or fail the ALL test. Then theresult should separately antijoined with R to determine which tuples in R passor fail the NOT IN test. The result is shown in figure 1, with LOJ denoting a leftouter join and AJ denoting an antijoin (note that the tree is actually a graph!).Even though Muralikrishna ([13]) proposes to extract (left) antijoins from (left)outerjoins, we note that in general such reuse may not be possible: here, theouterjoin is introduced to deal with the correlation, and the antijoin with thelinking, and therefore they have distinct, independent conditions attached tothem (and such approaches transform the query tree in a query graph, makingit harder for the optimizer to consider alternatives). Also, magic sets would beable to improve on the above plan pushing selections down to the relations; how-ever, this approach does not improve the overall situation, with outerjoins andantijoins still present. Clearly, what is called for is an approach which uniformlydeals with all types of linking conditions without introducing undue complexity.

3 Boolean Aggregates

We seek a uniform method that will work for all linking conditions. In order toachieve this, we define Boolean aggregates AND and OR, which take as inputa comparison, a set of values (or tuples), and return a Boolean (true or false) asoutput. Let attr be an attribute, θ a comparison operator and S a set of values.


ThenAND(S, attr, θ) =

∧attr2∈S

attr θ attr2

We define AND(∅, att, θ) to be true for any att, θ. Also,

OR(S, attr, θ) =∨

attr2∈S

attr θ attr2

We define OR(∅, att, θ) to be false for any att, θ.It is important to point out that each individual comparison is subject to

the semantics of SQL’s WHERE clause; in particular, comparisons with null valuesreturn unknown. The usual behavior of unknown with respect to conjunctionand disjunction is followed ([12]). Note also that the set S will be implicit innormal use. When the Boolean aggregates are used alone, S will be the inputrelation to the aggregate; when used in conjunction with a GROUP-BY operator,each group will provide the input set. Thus, we will write GBA,AND(B,θ)(R),where A is a subset of attributes of the schema of R, B is an attribute fromthe schema of R, and θ is a comparison operator; and similarly for OR. Theintended meaning is that, similar to other aggregates, AND is applied to eachgroup created by the grouping.

We use boolean aggregates to compute any linking condition which doesnot use a (regular) aggregate, as follows: after a join or outerjoin connectingquery and subquery is introduced by the unnesting, a group by is executed. Thegrouping attributes are any key of the relation from the outer block; the Booleanaggregate used depends on the linking condition: for attr θ SOME Q, where Qis a correlated subquery, the aggregate used is OR(attr, θ). For attr IN Q, thelinking condition is treated as attr = SOME Q. For EXISTS Q, the aggregateused in OR(1, 1, =)2. For attr θ ALL Q, where Q is a correlated subquery, theaggregate used is AND(attr, θ). For attr NOT IN Q, the linking condition istreated as attr �= ALL Q. Finally, for NOT EXISTS Q, the aggregate usedis AND(1, 1, �=). After the grouping and aggregation, the Boolean aggregatesleave a truth value in each group of the grouped relation. A selection then mustbe used to pick up those tuples where the boolean is set to true. Note that mostof this work can be optimized in implementation, an issue that we discuss in thenext subsection.

Clearly, implementing a Boolean aggregate is very similar to implementinga regular aggregate. The usual way to compute the traditional SQL aggregates(min, max, sum, count, avg) is to use an accumulator variable in which tostore temporary results, and update it as more values come. For min and max,for instance, any new value is compared to the value in the accumulator, andreplaces it if it is smaller (larger). Sum and count initialize the accumulatorto 0, and increase the accumulator with each new value (using the value, forsum, using 1, for count). Likewise, a Boolean accumulator is used for Boolean

2 Note that technically this formulation is not correct since we are using a constantinstead of attr, but the meaning is clear.

396 Antonio Badia

aggregates. For ALL, the accumulator is started as true; for SOME, as false.As new values arrive, a comparison is carried out, and the result is ANDed (forAND) or ORed (for OR) with the accumulator. There is, however, a problemwith this straightforward approach. When an outerjoin is used to deal with thecorrelation, tuples in the outer block that have no match appear in the resultexactly once, padded on the attributes of the inner block with nulls. Thus, whena group by is done, these tuples become their own group. Hence, tuples with nomatch actually have one (null) match in the outer join. The Boolean aggregatewill then iterate over this single tuple and, finding a null value on it, will deposita value of unknown in the accumulator. But when a tuple has no matches theALL test should be considered successful. The problem is that the outer joinmarks no matches with a null; while this null is meant to be no value occurs, SQLis incapable of distinguishing this interpretation from others, like value unknown(for which the 3-valued semantics makes sense). Note also that the value of attr2may genuinely be a null, if such a null existed in the original data. Thus, what isneeded is a way to distinguish between tuples that have been added as a pad bythe outer join. We stipulate that outer joins will pad tuples without a match notwith nulls, but with a different marker, called an emptymarker, which is differentfrom any possible value and from the null marker itself. Then a program like thefollowing can be used to implement the AND aggregate:

acc = True;while (not (empty(S)){

t = first(S);if (t.attr2 != emptymark) acc = acc AND attr comp attr2;S = rest(S);

}

Note that this program implements the semantics given for the operator, sincea single tuple with the empty marker represents the empty set in the relationalframework3.

3.1 Query Unnesting

We unnest using an approach that we call quasi-magic. First, at every query levelthe WHERE clause, with the exception of any linking condition(s), is transformedinto a query tree. This allows us to push selections before any unnesting, as in themagic approach, but we do not compute the magic set, just the complementaryset ([17, 18, 20]). This way, we avoid the overhead associated with the magicmethod. Then, correlated queries are treated as in Dayal’s approach, by adding3 The change of padding in the outer join should be of no consequence to the rest of

query processing. Right after the application of the Boolean aggregate, a selectionwill pick up only those tuples with a value of true in the accumulator. This includestuples with the marker; however, no other operator up the query tree operates onthe values with the marker -in the standard setting, they would contain nulls, andhence no useful operation can be carried out on these values.


a join (or outerjoin, if necessary), followed by a group by on key attributes of theouter relation. At this point, we apply boolean aggregates by using the linkingcondition, as outlined above.

In our previous example, a tree (call it T1) will be formed to deal with theouter block: σA>10(R). A second tree (call it T2) is formed for the nested queryblock at first level: σF=5(S). Finally, a third tree is formed for the innermostblock: U (note that this is a trivial tree because, at every level, we are excludinglinking conditions, and there is nothing but linking conditions in the WHEREclause of the innermost block of our example). Using these trees as buildingblocks, a tree for the whole query is built as follows:

1. First, construct a graph where each tree formed so far is a node and there isa direct link from node Ti to node Tj if there is a correlation in the Tj blockwith the value of the correlation coming from a relation in the Ti block;the link is annotated with the correlation predicate. Then, we start our treeby left outerjoining any two nodes that have a link between them (the leftinput corresponding to the block in the outer query), using the condition inthe annotation of the link, and starting with graph sources (because of SQLsemantics, this will correspond to outermost blocks that are not correlated)and finishing with sinks (because of SQL semantics, this will correspond toinnermost blocks that are correlated). Thus, we outerjoin from the outsidein. An exception is made for links between Ti and Tj if there is a path in thegraph between Ti and Tj on length ≥ 1. In the example above, our graphwill have three nodes, T1, T2 and T3, with links from T1 to T2, T1 to T3

and T2 to T3. We will create a left outerjoin between T2 and T3 first, andthen another left outerjoin of T1 with the previous result. In a situation likethis, the link from T1 to T3 becomes a condition just another condition whenwe outerjoin T1 to the result of the previous outerjoin.

2. On top of the tree obtained in the previous step, we add GROUP BYnodes, with the grouping attributes corresponding to keys of relations inthe left argument of the left outerjoins. On each GROUP BY, the appropri-ate (boolean) aggregate is used, followed by a SELECT looking for tupleswith true (for boolean aggregates) or applying the linking condition (for reg-ular aggregates). Note that these nodes are applied from the inside out, ie.the first (bottom) one corresponds to the innermost linking condition, andso on.

3. A projection, if needed, is placed on top of the tree. The following optimiza-tion is applied automatically: every outerjoin is considered to see if it canbe transformed into a join. This is not possible for negative linking condi-tions (NOT IN, NOT EXISTS, ALL), but it is possible for positive linkingconditions and all aggregates except COUNT(*)4.

4 This rule coincides with some of Galindo-Legaria rules ([6]), in that we know that inpositive linking conditions and aggregates we are going to have selections that arenull-intolerant and, therefore, the outerjoin is equivalent to a join.

398 Antonio Badia

Select(Bool=True)

SELECT(A>10)

R Select(F=5)

S

T

LOJ(L = I)

LOJ(K = C and D = G)

GB(Rkey,Skey, AND(S.H > T.J))

GB(Rkey,AND(R.B != S.E))

PROJECT(R.*)

SELECT(Bool=True)

Fig. 2. Our approach applied to the example

After this process, the tree is passed on to the query optimizer to see iffurther optimization is possible. Note that inside each subtree Ti there may besome optimization work to do; note also that, since all operators in the tree arejoins and outerjoins, the optimizer may be able to move around some operators.Also, some GROUP BY nodes may be pulled up or pushed down ([2, 3, 8, 9]).We show the final result applied to our example above in figure 2. Note thatin our example the outerjoins cannot be transformed into joins; however, thegroup bys may be pushed down depending on the keys of the relation (which wedid not specify). Also, even if groupings cannot be pushed down, note that thefirst one groups the temporal relation by the keys of R and S, while the secondone groups by the keys of R alone. Clearly, this second grouping is trivial; thewhole operation (grouping and aggregate) can be done in one scan of the input.Compare this tree with the one that is achieved by standard unnesting (shownin figure 1), and it is clear that our approach is more uniform and simple, whileusing to its advantage the ideas behind standard unnesting. Again, magic setscould be applied to Dayal’s approach, to push down the selections in R andS like we did. However, in this case additional steps would be needed (for thecreation of the complementary and magic sets), and the need for outerjoins andantijoins does not disappear. In our approach, the complementary set is alwaysproduced by our decision to process first operations at the same level, collapsingeach query block (with the exception of linking conditions) to one relation (thisis the reason we call our approach a quasi-magic strategy). As more levels andmore subqueries with more correlations are added, the simplicity and clarity ofour approach is more evident.


3.2 Optimizations

Besides algebraic optimizations, there are some particular optimizations that canbe applied to Boolean aggregates. Obviously, AND evaluation can stop as soonas some predicate evaluates to false (with final result false); and OR evaluationcan stop as soon as some predicate evaluates to true (with final result true).The later selection on boolean values can be done on the fly: since we know thatthe selection condition is going to be looking for groups with a value of true,groups with a value of false can be thrown away directly, in essence pipeliningthe selection in the GROUP-BY. Note also that by pipelining the selection,we eliminate the need for a Boolean attribute! In our example, once both leftouter joins have been carried out, the first GROUP-BY is executed by usingeither sorting or hashing by the keys of R and S. On each group, the Booleanaggregate AND is computed as tuples come. As soon as a comparison returnsfalse, computation of the Boolean aggregate is stopped, and the group is markedso that any further tuples belonging to the group are ignored; no output isproduced for that group. Groups that do not fail the test are added to theoutput. Once this temporary result is created, it is read again and scannedlooking only at values of the keys of R to create the groups; the second Booleanaggregate is computed as before. Also as before, as soon as a comparison returnsfalse, the group is flagged for dismissal. Output is composed of groups thatwere not flagged when input was exhausted. Therefore, the cost of our plan,considering only operations above the second left outer join, is that of groupingthe temporary relation by the keys of R and S, writing the output to disk andreading this output into memory again. In traditional unnesting, the cost afterthe second left outer joins is that of executing two antijoins, which is in the orderof executing two joins.

4 Conclusion and Further Work

We have proposed an approach to unnesting SQL subqueries which builds ontop of existing approaches. Therefore, our proposal is very easy to implementin existing query optimization and query execution engines, as it requires verylittle in the way of new operations, cost calculations, or implementation in theback-end. The approach allows us to treat all SQL subqueries in a uniformand simplified manner, and meshes well with existing approaches, letting theoptimizer move operators around and apply advanced optimization techniques(like outerjoin reduction and push down/pull up of GROUP BY nodes). Further,because it extends to several levels easily, it simplifies resulting query trees.Optimizers are becoming quite sophisticate and complex; a simple and uniformtreatment of all queries is certainly worth examining.

We have argued that our approach yields better performance than traditionalapproaches when negative linking conditions are present. We plan to analyze theperformance of our approach by implementing Boolean attributes on a DBMSand/or developing a detailed cost model, to offer further support for the conclu-sions reached in this paper.

400 Antonio Badia

References

[1] Cao, Bin and Badia, A. Subquery Rewriting for Optimization of SQL Queries,submitted for publication. 393

[2] Chaudhuri, S. ans Shim, K. Including Group-By in Query Optimization, in Pro-ceedings of the 2th VLDB Conference, 1994. 398

[3] Chaudhuri, S. ans Shim, K. An Overview of Cost-Based Optimization of Querieswith Aggregates, Data Engineering Bulletin, 18(3), 1995. 398

[4] Cohen, S., Nutt, W. and Serebrenik, A. Algorithms for Rewriting AggregateQueries using Views, Proceedings of the Design and Management of Data Ware-houses Conference, 1999.

[5] Dayal, U. Of Nests and Trees: A Unified Approach to Processing Queries ThatContain Nested Subqueries, Aggregates, and Quantifiers, in Proceedings of theVLDB Conference, 1987. 391, 392, 393

[6] Galindo-Legaria, C. and Rosenthal, A. Outerjoin Simplification and Reorderingfor Query Optimization, ACM TODS, vol. 22, n. 1, 1997. 392, 393, 397

[7] Ganski, R. and Wong, H. Optimization of Nested SQL Queries Revisited, in Pro-ceedings of the ACM SIGMOD Conference, 1987. 391

[8] Goel, P. and Iyer, B. SQL Query Optimization: Reordering for a General Classof Queries, in Proceedings of the 1996 ACM SIGMOD Conference. 391, 398

[9] Gupta, A., Harinayaran, V. and Quass, D. Aggregate-Query Processing in DataWarehousing Environments, in Proceedings of the VLDB Conference, 1995. 398

[10] Kim, W. On Optimizing an SQL-Like Nested Query, ACM Transactions OnDatabase Systems, vol. 7, n.3, September 1982. 391, 392

[11] Materialized Views: Techniques, Implementations and Applications, A. Gupta andI. S. Mumick, eds., MIT Press, 1999.

[12] Melton, J. Advanced SQL: 1999, Understanding Object-Relational and Other Ad-vanced Features, Morgan Kaufmann, 2003. 391, 395

[13] Muralikrishna, M. Improving Unnesting Algorithms for Join Aggregate Queries inSQL, in Proceedings of the VLDB Conference, 1992. 391, 392, 393, 394

[14] Ross, K. and Rao, J. Reusing Invariants: A New Strategy for Correlated Queries,in Proceedings of the ACM SIGMOD Conference, 1998.

[15] Ross, K. and Chatziantoniou, D., Groupwise Processing of Relational Queries, inProceedings of the 23rd VLDB Conference, 1997.

[16] Jun Rao, Bruce Lindsay, Guy Lohman, Hamid Pirahesh and David Simmen, UsingEELs, a Practical Approach to Outerjoin and Antijoin Reordering, in Proceedingsof ICDE 2001. 392, 393

[17] Praveen Seshadri, Hamid Pirahesh, T.Y. Cliff Leung Complex Query Decorrela-tion, in Proceedings of ICDE 1996, pages 450-458. 391, 393, 396

[18] Praveen Seshadri, Joseph M. Hellerstein, Hamid Pirahesh, T.Y. Cliff Leung,Raghu Ramakrishnan, Divesh Srivastava, Peter J. Stuckey, and S. SudarshanCost-Based Optimization for Magic: Algebra and Implementation, in Proceedingsof the SIGMOD Conference, 1996, pages 435-446. 393, 396

[19] Inderpal Singh Mumick and Hamid Pirahesh Implementation of Magic-sets ina Relational Database System, in Proceedings of the SIGMOD Conference 1994,pages 103-114. 393

[20] Inderpal Singh Mumick, Sheldon J. Finkelstein, Hamid Pirahesh and Raghu Ra-makrishnan Magic is Relevant, in Proceedings of the SIGMOD Conference, 1990,pages 247-258. 393, 396

Fighting Redundancy in SQL�

Antonio Badia and Dev Anand

Computer Engineering and Computer Science DepartmentUniversity of LouisvilleLouisville KY 40292

Abstract. Many SQL queries with aggregated subqueries exhibit redun-dancy (overlap in FROM and WHERE clauses). We propose a method, calledthe for-loop, to optimize such queries by ensuring that redundant com-putations are done only once. We specify a procedure to build a queryplan implementing our method, give an example of its use and arguethat it offers performance advantages over traditional approaches.

1 Introduction

In this paper, we study a class of Decision-Support SQL queries, characterizethem and show how to process them in an improved manner. In particular, weanalyze queries containing subqueries, where the subquery is aggregated (type-Aand type-JA in [8]). In many of these queries, SQL exhibits redundancy in thatFROM and WHERE clauses of query and subquery show a great deal of overlap. Weargue that these patterns are currently not well supported by relational queryprocessors. The following example gives some intuition about the problem; thequery used is Query 2 from the TPC-H benchmark ([18]) -we will refer to it asquery TPCH2:

select s_acctbal, s_name, n_name, p_partkey,p_mfgr, s_address, s_phone, s_comment

from part, supplier, partsupp, nation, regionwhere p_partkey = ps_partkey and s_suppkey = ps_suppkey

and p_size = 15 and p_type like ’%BRASS’ and r_name = ’EUROPE’and s_nationkey = n_nationkey and n_regionkey = r_regionkeyand ps_supplycost = (select min(ps_supplycost)

from partsupp, supplier, nation, regionwhere p_partkey = ps_partkey

and s_suppkey = ps_suppkeyand s_nationkey = n_nationkeyand n_regionkey = r_regionkeyand r_name = ’EUROPE’)

order by s_acctbal desc, n_name, s_name, p_partkey;

� This research was sponsored by NSF under grant IIS-0091928.


402 Antonio Badia and Dev Anand

This query is executed in most systems by using unnesting techniques. How-ever, the commonality between query and subquery will not be detected, andall operations (including common joins and selections) will be repeated (see anin-depth discussion of this example in subsection 2.3).

Our goal is to avoid duplication of effort. For lack of space, we will not discussrelated research in query optimization ([3, 11, 6, 7, 8, 15]); we point out that de-tecting and dealing with redundancy is not attempted in this body of work. Ourmethod applies only to aggregated subqueries that contain WHERE clauses over-lapping with the main query’s WHERE clause. This may seem a very narrow typeof queries until one realizes that all types of SQL subqueries can be rewrittenas aggregated subqueries (EXISTS, for instance, can be rewritten as a subquerywith COUNT; all other types of subqueries can be rewritten similarly ([2])). There-fore, the approach is potentially applicable to any SQL query with subqueries.Also, it is important to point out that the redundancy is present because of thestructure of SQL, which necessitates a subquery in order to declaratively statethe aggregation to be computed. Thus, we argue that such redundancy is notinfrequent ([10]). We describe an optimization method geared towards detectingand optimizing this redundancy. Our method not only computes the redundantpart only once, but also proposes a new special operator to compute the rest ofthe query very effectively. In section 2 we describe our approach and the newoperator in more detail. We formally describe the operator (subsection 2.1),show one query trees with the operator can be generated for a given SQL query(subsection 2.2), and describe an experiment ran on the context of the TPC-Hbenchmark ([18]) (subsection 2.3). Finally, in section 3 we propose some furtherresearch.

2 Optimization of Redundancy

In this section we define patterns which detect redundancy in SQL queries. Wethen show how to use the matching of patterns and SQL queries to produce aquery plan which avoids repeating computations. We represent SQL queries inan schematic form or pattern. With the keywords SELECT ... FROM ... WHEREwe will use L, L1, L2, . . . as variables over a list of attributes; T, T1, T2, . . . asvariables over a list of relations, F, F1, F2, . . . as variables over aggregate func-tions and Δ, Δ1, Δ2, . . . as variables over (complex) conditions. Attributes willbe represented by attr, attr1, attr2, . . .. If there is a condition in the WHERE clauseof the subquery which introduces correlation it will be shown explicitly; this iscalled the correlation condition. The table to which the correlated attribute be-longs is called the correlation table, and is said to introduce the correlation; theattribute compared to the correlated attribute is called the correlating attribute.Also, the condition that connects query and subquery (called a linking condi-tion) is also shown explicitly. The operator in the linking condition is called thelinking operator, the attributes the linking attributes and the aggregate functionon the subquery side is called the linking aggregate. We will say that a pattern

Fighting Redundancy in SQL 403

matches an SQL query when there is a correspondence g between the variablesin the pattern and the elements of the query. As an example, the patternSELECT L FROM T WHERE Δ1 AND

attr1 θ (SELECT F(attr2) FROM T WHERE Δ2)would match query TPCH2 by setting

g(Δ1) = {p partkey = ps partkey and s suppkey = ps suppkey andp size = 15 and p type like ’%BRASS’ and r name = ’EUROPE’ ands nationkey = n nationkey and n regionkey = r regionkey },g(Δ2) = {p partkey = ps partkey and s suppkey = ps suppkey andr name = ’EUROPE’ and s nationkey = n nationkey and n regionkey =r regionkey},g(T) = {part,supplier,partuspp,nation,region},g(F) = min and g(attr1) = g(attr2) = ps supplycost. Note that the T symbolappears twice so the pattern forces the query to have the same FROM clauses inthe main query and in the subquery1. The correlation condition is p partkey =ps partkey; the correlation table is part, and ps partkey is the the correlatingattribute. The linking condition here is ps supplycost = min(ps suplycost);thus ps supplycost is the linking attribute, ’=’ the linking operator and minthe linking aggregate.

The basic idea of our approach is to divide the work to be done inthree parts: one that is common to query and subquery, one that belongsonly to the subquery, and one that belongs only to the main query2. Thepart that is common to both query and subquery can be done only once;however, as we argue in subsection 2.3 in most systems today it would bedone twice. We calculate the three parts above as follows: the common partis g(Δ1) ∩ g(Δ2); the part proper to the main query is g(Δ1) − g(Δ2);and the part proper to the subquery is g(Δ2) − g(Δ1). For query TPCH2,this yields { p partkey = ps partkey and s suppkey = ps suppkey andr name = ’EUROPE’ and s nationkey = n nationkey and n regionkey =r regionkey}, {p size = 15 and p type like ’%BRASS’} and ∅, respectively.We use this matching in constructing a program to compute this query. Theprocess is explained in the next subsection.

2.1 The For-Loop Operator

We start out with the common part, called the base relation, in order to ensurethat it is not done twice. The base relation can be expressed as an SPJ query.Our strategy is to compute the rest of the query starting from this base relation.This strategy faces two difficulties. First, if we simply divide the query based1 For correlated subqueries, the correlation table is counted as present in the FROM

clause of the subquery.2 We are assuming that all relations mentioned in a query are connected; i.e. that there

are no Cartesian products present, only joins. Therefore, when there is overlap be-tween query and subquery FROM clause, we are very likely to find common conditionsin both WHERE clauses (at least the joins).


on common parts we obtain a plan where redundancy is eliminated at the priceof fixing the order of some operations. In particular, some selections not inthe common part wouldn’t be pushed down. Hence, it is unclear whether thisstrategy will provide significant improvements by itself (this situation is similarto that of [13]). Second, when starting from the base relation, we face a problemin that this relation has to be used for two different purposes: it must be used tocompute an aggregate after finishing up the WHERE clause in the subquery (i.e.after computing g(Δ2) − g(Δ1)); and it must be used to finish up the WHEREclause in the main query (i.e. to compute g(Δ1) − g(Δ2)) and then, using theresult of the previous step, compute the final answer to the query. However, itis extremely hard in relational algebra to combine the operators involved. Forinstance, the computation of an aggregate must be done before the aggregatecan be used in a selection condition.

In order to solve this problem, we define a new operator, called the for-loop, which combines several relational operators into a new one (i.e. a macro-operator). The approach is based on the observation that some basic operationsappear frequently together and they could be more efficiently implemented asa whole. In our particular case, we show in the next subsection that there is anefficient implementation of the for-loop operator which allows it, in some cases,to compute several basic operators with one pass over the data, thus savingconsiderable disk I/O.

Definition 1. Let R be a relation, sch(R) the schema of R, L ⊆ sch(R),A ∈ sch(R), F an aggregate function, α a condition on R (i.e. involving onlyattributes of sch(R)) and β a condition on sch(R) ∪ {F (A)} (i.e. involving at-tributes of sch(R) and possibly F (A)). Then for-loop operator is defined as eitherone of the following:

1. FLL,F (A),α,β(R). The meaning of the operator is defined as follows: let Tempbe the relation GBL,F (A)(σα(R)) (GB is used to indicate a group-by opera-tion). Then the for-loop yields relation σβ(R ��R.L=Temp.L Temp), where thecondition of the join is understood as the pairwise equality of each attributein L. This is called a grouped for-loop.

2. FLF (A),α,β(R). The meaning of the operator is given byσβ(AGGF (A)(σα(R)) × R), where AGGF (A)(R) indicates the aggregate Fcomputed over all A values of R. This is called a flat for-loop.

Note that β may contain aggregated attributes as part of a condition. Infact, in the typical use in our approach, it does contains an aggregation. Themain use of a for-loop is to calculate the linking condition of a query with anaggregated subquery on the fly, possibly with additional selections. Thus, for in-stance, for query TPCH2, the for-loop would take the grouped form FLp partkey ,

min(ps supplycost),∅,p size=15∧p typeLIKE%BRASS∧ps suplycost=min(ps supplycost)(R),where R is the relation obtained by computing the base relation3. The for-loop isequivalent to the relational expression σp size=15∧p typeLIKE%BRASS∧ps suplycost=

min(ps supplycost)(AGGmin(ps supplycost)(R) × R).

3 Again, note that the base relation contains the correlation as a join.


It can be seen that this expression will compute the original SQL query; theaggregation will compute the aggregate function of the subquery (the conditionsin the WHERE clause of the subquery have already been computed in R, since inthis case Δ2 ⊆ Δ1 and hence Δ2 − Δ1 = ∅), and the Cartesian product willput a copy of this aggregate on each tuple, allowing the linking condition to bestated as a regular condition over the resulting relation.

Note that this expression may not be better, from a cost point of view, thanother plans produced by standard optimization. What makes this plan attractiveis that the for-loop operator can be implemented in such a way that it computesits output with one pass over the data. In particular, the implementation willnot carry out any Cartesian product, which is used only to explain the seman-tics of the operator. The operator is written as an iterator that loops over theinput implementing a simple program (hence the name). The basic idea is sim-ple: in some cases, computing an aggregation and using the aggregate result ina selection can be done at the same time. This is due to the behavior of someaggregates and the semantics of the conditions involved. Assume, for instance,that we have a comparison of the type att = min(attr2), where both attrand attr2 are attributes of some table R. In this case, as we go on computingthe minimum for a series of values, we can actually decide, as we iterate over R,whether some tuples will make the condition true or not ever. This is due to thefact that min is monotonically non-increasing, i.e. as we iterate over R and wecarry a current minimum, this value will always stay the same or decrease, neverincrease. Since equality imposes a very strict constraint, we can take a decisionon the current tuple t based on the values of t.attr and the current minimum,as follows: if t.attr is greater than the current minimum, we can safely get ridof it. If t.attr is equal to the current minimum, we should keep it, as least fornow, in a temporary result temp1. If t.attr is less than the current minimum,we should keep it, in case our current minimum changes, in a temporary resulttemp2. Whenever the current minimum changes, we know that temp1 should bedeleted, i.e. tuples there cannot be part of a solution. On the other hand, temp2should be filtered: some tuples there may be thrown away, some may be in a newtemp1, some may remain in temp2. At the end of the iteration, the set temp1gives us the correct solution. Of course, as we go over the tuples in R we maykeep some tuples that we need to get rid of later on; but the important point isthat we never have to get back and recover a tuple that we dismissed, thanksto the monotonic behavior of min. This behavior does generalize to max, sum,count, since they are all monotonically non-decreasing (for sum, it is assumedthat all values in the domain are positive numbers); however, average is notmonotonic (either in an increasing or decreasing manner). For this reason, ourapproach does not apply to average. For the other aggregates, though, we ar-gue that we can successfully take decisions on the fly without having to recoverdiscarded tuples later on.


2.2 Query Transformation

The general strategy to produce a query plan with for-loops for a given SQLquery Q is as follows: we classify q into one of two categories, according to q’sstructure. For each category, a pattern p is given. As before, if q fits into pthere is a mapping g between constants in q and variables in p. Associatedwith each pattern there is a for-loop program template t. A template is differentfrom a program in that it has variables and options. Using the information onthe mapping g (including the particular linking aggregate and linking conditionin q), a concrete for-loop program is generated from t. The process to producea query tree containing a for-loop operator is then simple: our patterns allow usto identify the part common to query and subquery (i.e. the base relation), whichis used to start the query tree. Standard relational optimization techniques canbe applied to this part. Then a for-loop operator which takes the base relationas input is added to the query tree, and its parameters determined. We describeeach step separately.

We distinguish between two types of queries: type A queries, in which thesubquery is not correlated (this corresponds to type J in [8]); and type B queries,where the subquery is correlated (this corresponds to the type JA in [8]). Queriesof type A are interesting in that usual optimization techniques cannot do any-thing to improve them (obviously, unnesting does not apply to them). Thus, ourapproach, whenever applicable, offers a chance to create an improved query plan.In contrast, queries of type B have been dealt with extensively in the literature([8, 3, 6, 11, 17, 16, 15]). As we will see, our approach is closely related to otherunnesting techniques, but it is the only one that considers redundancy betweenquery and subquery and its optimization.

The general pattern a type A query must fit is given below:

SELECT LFROM TWHERE Δ1 and attr1 θ (SELECT F(attr2)

FROM TWHERE Δ2)

{GROUP BY L2}The parenthesis around the GROUP BY clause are to indicate that such clause

is optional4. We create a query plan for this query in two steps:

1. A base relation is defined by g(Δ1) ∩ g(Δ2)(g(T )). Note that this is an SPJquery, which can be optimized by standard techniques.

2. We apply a forloop operator defined byFL(g(F (attr2)), g(Δ2) − g(Δ1), g(Δ1) − g(Δ2) ∧ g(attr3 θ F2(attr4)))

It can be seen that this query plan computes the correct result for this queryby using the definition of the for-loop operator. Here, the aggregate is F (attr2),

4 Obviously, SQL syntax requires that L2 ⊆ L, where L and L2 are lists of attributes.In the following, we assume that queries are well formed.


α is g(Δ2−Δ1) and β is g(Δ1)−g(Δ2)∧ g(attr θ F (attr2)). Thus, this plan willfirst apply Δ1∩Δ2 to T , in order to generate the base relation. Then, the for-loopwill compute the aggregate F (attr2) on the result of selecting g(Δ2−Δ1) on thebase relation. Note that (Δ2−Δ1)∪(Δ1∩Δ2) = Δ2, and hence the aggregate iscomputed over the conditions in the subquery only, as it should. The result of thisaggregate is then “appended” to every tuple in the base relation by the Cartesianproduct (again, note that this description is purely conceptual). After that, theselection on g(Δ1)− g(Δ2)∧ g(attr3 θ F2(attr4)) is applied. Here we have that(Δ1−Δ2)∪ (Δ1∩Δ2) = Δ1, and hence we are applying all the conditions in themain clause. We are also applying the linking condition attr3 θ F (attr2), whichcan be considered a regular condition now because F (attr2) is present in everytuple. Thus, the forloop operator computes the query correctly. This forloop op-erator will be implemented by a program that will carry out all needed operatorswith one scan of the input relation. Clearly, the concrete program is going todepend on the linking operator (θ, assumed to be one of {=, <=, >=, <, >})and the aggregate function (F, assumed to be one of min,max,sum,count,avg).

The general pattern for type B queries is given next.

SELECT LFROM T1

WHERE Δ1 and attr1 θ (SELECT F1(attr2)FROM T2

WHERE Δ2 and S.attr3 θ R.attr4){GROUP BY L2}

where R ∈ T1 − T2, S ∈ T2, and we are assuming that T1 − {R} = T2 − {S}(i.e. the FROM clauses contain the same relations except the one introducing thecorrelated attribute, called R, and the one introducing the correlation attribute,called S). We call T = T1−{R}. As before, a group by clause is optional. In ourapproach, we consider the table containing the correlated attribute as part of theFROM clause of the subquery too (i.e. we effectively decorrelate the subquery).Thus, the outer join is always part of our common part. In our plan, there aretwo steps:

1. compute the base relation, given by g(Δ1 ∩ Δ2)(T ∪ {R, S}). This includesthe outer join of R and S.

2. computation of a grouped forloop defined by

FL(attr6, F (attr2), Δ2 − Δ1, Δ1 − Δ2 ∧ attr1 θ F (attr2))

which computes the rest of the query.

Our plan has two main differences with traditional unnesting: the parts com-mon to query and subquery are computed only once, at the beginning of theplan, and computing the aggregate, the linking predicate, and possible some se-lections is carried out by the forloop predicate in one step. Thus, we potentiallydeal with larger temporary results, as some selections (those not in Δ1 ∩ Δ2)are not pushed down, but may be able to effect several computations at once


Join

Join

Join

GBps_partkey,min(ps_supplycost)

Join

Selectps_supplycost=min(ps_supplycost)

Region

Selectname="Europe"

Nation

PartSupp Supplier

Join

Selectsize=15&type LIKE %BRASS

PartSelect

name="Europe"

Region

Join

Join

Join Nation

PartSupp Supplier

Fig. 1. Standard query plan

(and do not repeat any computation). Clearly, which plan is better depends onthe amount of redundancy between query and subquery, the linking condition(which determines how efficient the for-loop operator is), and traditional opti-mization parameters, like the size of the input relations and the selectivity ofthe different conditions.

2.3 Example and Analytical Comparison

We apply our approach to query TPCH2; this is a typical B query. For ourexperiment we created a TPC-H benchmark of the smallest size (1 GB) usingtwo leading commercial DBMS. We created indices in all primary and foreignkeys, updated system statistics, and capture the query plan for query 2 on eachsystem. Both query plans were very similar, and they are represented by thequery tree in figure 1. Note that the query is unnested based on Kim’s approach(i.e. first group and then join). Note also that all selections are pushed all theway down; they were executed by pipelining with the joins. The main differencesbetween the two systems were the choices of implementations for the joins anddifferent join ordering5. For our concern, the main observation about this queryplan is that operations in query and subquery are repeated, even though thereclearly is a large amount of repetition6. We created a query plan for this query,based on our approach (shown in figure 2). Note that our approach does notdictate how the base relation is optimized; the particular plan shown uses thesame tree as the original query tree to facilitate comparisons. It is easy to see that5 To make sure that the particular linking condition was not an issue, the query was

changed to use different linking aggregates and linking operators; the query planremained the same (except that for operators other than equality Dayal’s approachwas used instead of Kim’s). Also, memory size was varied from a minimum of 64 Mto a maximum of 512 M, to determine if memory size was an issue. Again, the queryplan remained the same through all memory sizes.

6 We have disregarded the final Sort needed to complete the query, as this would benecessary in any approach, including ours.


Select

PartSupp

Region

Part

FL

Join

Join

JoinJoin

name="Europe"

(p_partkey, min(ps_supplycost),(p_size=15 & p_type LIKE %BRASS & ps_supplycost=min(ps_supplycos

Supplier

Nation

Fig. 2. For-loop query plan

our approach avoids any duplication of work. However, this comes at the costof fixing the order of some operations (i.e. operations in Δ1 ∩ Δ2 must be donebefore other operations). In particular, some selections get pushed up becausethey do not belong into the common part, which increases the size of the relationcreated as input for the for-loop. Here, TPCH2 returns 460 rows, while theintermediate relation that the for-loop takes as input has 158,960 tuples. Thus,the cost of executing the for-loop may add more than other operations becauseof a larger input. However, grouping and aggregating took both systems about10% of the total time7. Another observation is that the duplicated operationsdo not take double the time, because of cache usage. But this can be attributedto the excellent main memory/database size ratio in our setup; with a morerealistic setup this effect is likely to be diminished. Nevertheless, our approachavoids duplicated computation and does result in some time improvement (ittakes about 70% of the time of the standard approach). In any case, it is clearthat a plan using the for-loop is not guaranteed to be superior to traditionalplans under all circumstances. Thus, it is very important to note that we assumea cost-based optimizer which will generate a for-loop plan if at least some amountof redundancy is detected, and will compare the for-loop plan to others based oncost.

3 Conclusions and Further Research

We have argued that Decision-support SQL queries tend to contain redundancybetween query and subquery, and this redundancy is not detected and optimizedby relational processors. We have introduced a new optimization mechanism todeal with this redundancy, the for-loop operator, and an implementation forit, the for-loop program. We developed a transformation process that takes usfrom SQL queries to for-loop programs. A comparative analysis with standardrelational optimization was shown. The for-loop approach promises a more ef-ficient implementation for queries falling in the patterns given. For simplicityand lack of space, the approach is introduced here applied to a very restricted7 This and all other data about time come from measuring performance of appropriate

SQL queries executed against the TPC-H database on both systems. Details are leftout for lack of space.


class of queries. However, we have already worked out extensions to widen itsscope (mainly, the approach can work with overlapping (not just identical) FROMclauses in query and subquery, and with different classes of linking conditions).We are currently developing a precise cost model, in order to compare the ap-proach with traditional query optimization using different degrees of overlap,different linking conditions, and different data distributions as parameters. Weare also working on extending the approach to several levels of nesting, andstudying its applicability to OQL.

References

[1] Badia, A. and Niehues, M. Optimization of Sequences of Relational Queries inDecision-Support Environments, in Proceedings of DAWAK’99, LNCS n. 1676,Springer-Verlag.

[2] Cao, Bin and Badia, A. Subquery Rewriting for Optimization of SQL Queries,submitted for publication. 402

[3] Dayal, U. Of Nests and Trees: A Unified Approach to Processing Queries ThatContain Nested Subqueries, Aggregates, and Quantifiers, in Proceedings of theVLDB Conference, 1987. 402, 406

[4] Fegaras, L. and Maier, D. Optimizing Queries Using an Effective Calculus, ACMTODS, vol. 25, n. 4, 2000.

[5] Freytag, J. and Goodman, N. On the Translation of Relational Queries into It-erative Programs, ACM Transactions on Database Systems, vol. 14, no. 1, March1989.

[6] Ganski, R. and Wong, H. Optimization of Nested SQL Queries Revisited, in Pro-ceedings of the ACM SIGMOD Conference, 1987. 402, 406

[7] Goel, P. and Iyer, B. SQL Query Optimization: Reordering for a General Classof Queries, in Proceedings of the 1996 ACM SIGMOD Conference. 402

[8] Kim, W. On Optimizing an SQL-Like Nested Query, ACM Transactions OnDatabase Systems, vol. 7, n.3, September 1982. 401, 402, 406

[9] Lieuwen, D. and DeWitt, D. A Transformation-Based Approach to OptimizingLoops in database Programming Languages, in Proceedings of the ACM SIGMODConference, 1992.

[10] Lu, H., Chan, H.C. and Wei, K.K. A Survey on Usage of SQL, SIGMOD Record,1993. 402

[11] Muralikrishna, M. Improving Unnesting Algorithms for Join Aggregate Queries inSQL, in Proceedings of the VLDB Conference, 1992. 402, 406

[12] Park, J. and Segev, A. Using common subexpressions to optimize multiple queries,in Proceedings of the 1988 IEEE CS ICDE.

[13] Ross, K. and Rao, J. Reusing Invariants: A New Strategy for Correlated Queries,in Proceedings of the ACM SIGMOD Conference, 1998. 404

[14] Jun Rao, Bruce Lindsay, Guy Lohman, Hamid Pirahesh and David Simmen, UsingEELs, a Practical Approach to Outerjoin and Antijoin Reordering, in Proceedingsof ICDE 2001.

[15] Praveen Seshadri, Hamid Pirahesh, T.Y. Cliff Leung Complex Query Decorrela-tion, in Proceedings of ICDE 1996. 402, 406

[16] Praveen Seshadri, Joseph M. Hellerstein, Hamid Pirahesh, T.Y. Cliff Leung,Raghu Ramakrishnan, Divesh Srivastava, Peter J. Stuckey, and S. SudarshanCost-Based Optimization for Magic: Algebra and Implementation, in Proceedingsof the SIGMOD Conference, 1996. 406


[17] Inderpal Singh Mumick and Hamid Pirahesh Implementation of Magic-sets ina Relational Database System, in Proceedings of the SIGMOD Conference 1994.406

[18] TPC-H Benchmark, TPC Council, http://www.tpc.org/home.page.html. 401,402

Incremental and Decremental Proximal SupportVector Classification using Decay Coefficients

Amund Tveit, Magnus Lie Hetland and Havard Engum

Department of Computer and Information Science,Norwegian University of Science and Technology,

N-7491 Trondheim, Norway{amundt,mlh,havare}@idi.ntnu.no

Abstract. This paper presents an efficient approach for supporting decre-mental learning for incremental proximal support vector machines (SVM).The presented decremental algorithm based on decay coefficients is com-pared with an existing window-based decremental algorithm, and is shownto perform at a similar level in accuracy, but providing significantly bet-ter computational performance.

1 Introduction

Support Vector Machines (SVMs) is an exceptionally efficient data mining ap-proach for classification, clustering and time series analysis [5, 12, 4]. This isprimarily due to SVMs highly accurate results that are competitive with otherdata mining approaches, e.g. artificial neural networks (ANNs) and evolution-ary algorithms (EAs). In recent years tremendous growth in the amount of datagathered (e.g. user clickstreams on the web, in e-commerce and in intrusion de-tection systems), has changed the focus of SVM classifier algorithms to not onlyprovide accurate results, but to also enable online learning, i.e. incremental anddecremental learning, in order to handle concept drift of classes [2, 13].

Fung and Mangasarian introduced the Incremental and Decremental LinearProximal Support Vector Machine (PSVM) for binary classification [10], andshowed that it was able to be trained extremely fast, i.e. with 1 billion examples(500 increments of 2 million) in 2 hours and 26 minutes on relatively low-endhardware (400 MHz Pentium II). This has later been extended to support effi-cient support of incremental multicategorical classification [16]. Proximal SVMshas also been shown to perform at a similar level of accuracy as regular SVMsand at the same time being significantly faster [9].

In this paper we propose a computationally efficient algorithm that enablesdecremental support for Incremental PSVMs using a weight decay coefficient.The suggested approach is compared the current time-window based approachproposed by Fung and Mangasarian [10].


2 Background Theory

The basic idea of Support Vector Machine classification is to find an optimalmaximal margin separating hyperplane between two classes. Support VectorMachines uses an implicit nonlinear mapping from input-space to a higher di-mensional feature-space using kernel-functions, in order to find a hyperplane ofproblems which are not linear separable in input-space [7, 18]. Classifying multi-ple classes is commonly performed by combining several binary SVM classifiersin a tournament manner, either one-against-all or one-against-one, the latterapproach requiring substantial more computational effort [11].

The standard binary SVM classification problem with soft margin (allowingsome errors) is shown visually in Fig. 1(a). Intuitively, the problem is to maximizethe margin between the solid planes and at the same time permit as few errorsas possible, errors being positive class points on the negative side (of the solidline) or vice versa.

1' +=gwx

OO

OO

O

O

O

O

w2=

1' −=gwx

M argin

A+

A-

OO

OO O

O

OO

OO

OO O

O

OO

OO

OO

X

X

X

X

X

X

XX

X

XX

X

X

X

X

XX

X

X

X

X

X

O

X

O

O

X

X

X

g=wx'Separating Plane

w

X

(a) Standard SVMclassifier

1' +=γwx

OO

OO

O

O

O

O

=

γw2

1' −=γwx

Margin

A+

A-

OO

OO O

O

OO

OO

OO O

O

OO

OO

OO

X

X

X

X

X

X

X

X

X

XX

X

X

X

X

XX

X

X

X

X

X

O

X

O

O

X

X

X

0' =−γwxSeparating Planeγw

X

(b) Proximal SVMclassifier

Fig. 1. SVM and PSVM

The standard SVM problem can be stated as a quadratic optimization prob-lem with constraints, as shown in (1).

min(w,γ,y)∈Rn+1+m

{ve′y + 12w′w}

s.t. D(Aw − eγ) + y ≥ ey ≥ 0

(1)

A ∈ Rm×n, D ∈ {−1,+1}m×1

, e = 1m×1

Fung and Mangasarian [8] replaced the inequality constraint in (1) with anequality constraint. This changed the binary classification problem, because thepoints in Fig. 1(b) are no longer bounded by the planes, but are clustered around

423Incremental and Decremental Proximal Support Vector Classification

them. By solving the equation for y and inserting the result into the expressionto be minimized, one gets the following unconstrained optimization problem:

min(w,γ)∈Rn+1+m

f(w, γ) = ν2‖D(Aw − eγ) − e‖2 + 1

2 (w′w + γ2) (2)

Setting ∇f =(

∂f∂w , ∂f

∂γ

)= 0 one gets:

(wγ

)︸︷︷︸

X

=(

A′A + Iν −A′e

−e′A 1ν + m

)−1 (A′De−e′De

)=

(I

ν+ E′E

)−1

︸︷︷︸A−1

E′De︸︷︷︸B

(3)

E = [A − e], E ∈ Rm×(n+1)

Agarwal has showed that the Proximal SVM is directly transferable to a ridgeregression expression [1]. Fung and Mangasarian [10] later showed that (3) canbe rewritten to handle increments (Ei, di) and decrements (Ed, dd), as shownin (4). This decremental approach is based on time windows.

X =(

wγ

)

=(

I

ν+ E′E + (Ei)′Ei − (Ed)′Ed

)−1 (E′d + (Ei)′di − (Ed)′dd

), (4)

where d = De

.

3 PSVM Decremental Learning using Weight DecayCoefficient

The basic idea is to reduce the effect of the existing (old) accumulated trainingknowledge E′E with an exponential weight decay coefficient α.

(wγ

)=

(Iν + α · E′

E + Ei′Ei

)−1 (α · E′

d + Ei′di)

; α ∈ 〈0, 1] (5)

As opposed to the decremental approach in expression (4), the presentedweight decay approach does not require storage of increments (Ei

′Ei , Ei

′di)

later to be retrieved as decrements (Ed′Ed , Ed

′dd).

A hybrid approach is shown in expression (6), where one has both a softdecremental effect using the weight decay coefficient α as well as a hard decre-mental effect using a fixed window of size W increments.

424 Amund Tveit et al.

(wγ

)=

(Iν + α · E′

E + Ei′Ei − αW · Ed

′Ed

)−1

·(α · E′

D + Ei′Di − αW · Ed

′Dd

); α ∈ 〈0, 1]

(6)

4 Related Work

Syed et al. presented an approach for handling concept drift with SVM [2].Their approach trains on data, and keeps only the support vectors representingthe data before (exact) training with new data and the previous support vectors.Klinkenberg and Joachims presented a window adjustment based SVM methodfor detecting and handling concept drift [13]. Cauwenberghs and Poggio proposedan incremental and decremental SVM method based on a different approximationthan used by us [6].

5 Empirical results

In order to test and compare our suggested decremental PSVM learning ap-proach with the existing window-based approach we created synthetic binaryclassification data sets with simulated concept drift. This was created by sam-pling feature values from a multivariate normal distribution where the covariancematrix Ω = I (identity matrix) and the mean vector μ was sliding linearly fromonly +1 values to −1 values for the positive class case, and vice versa for thenegative class [14], as shown in algorithm 1.

Algorithm 1 simConceptDrift(nFeat, nSteps, nExPerStep, start)Require: nFeat, nSteps, nExPerStep ∈ N and start ∈ R

Ensure: Linear stochastic drift in nSteps from start to −start1: center = [start, . . . , start] {vector of length nFeat}2: origcenter = center3: for all step in {0, . . . , nSteps − 1} do4: for all synthExampleCount in {0, . . . , nExPerStep − 1} do5: sample example from multivar.gauss.dist with μ = center and σ2’s = 16: end for7: center = origcenter · (1 − 2 · step+1

nStep−1) {concept drift}

8: end for

5.1 Classification Accuracy

For the small concept drift test (20000 examples with 10 features and 40 in-crements of 500 examples, figure 2(a)), the weight decay of α = 0.1 performsslightly better in terms of unlearning than a window size of W = 5, and a weight


decay of α = 0.9 performs between unlearning with W = 10 and W = 20, andthe unlearning performance varies quite a bit with α.

For the medium concept drift test (200000 examples with 10 features and400 increments of 500 examples, figure 2(b)), the value of α matters less, thisdue to more increments shown and faster exponential effect of the weight decaycoefficient than in the small concept drift test.

As seen in both figure 2(a) and 2(b), there is “dip” in classification perfor-mance around their respective center points (increment number ≈ 20 and 200).This is caused by concept drift, i.e. the features of the positive and negative classare indiscernible.

0 10 20 30

020

4060

8010

0

Number of examples=20000

Increment Number

Cla

ssifi

catio

n A

ccur

acy

(%)

Non.Decr.W=5W=10W=20alpha=0.1alpha=0.9

(a) Short timespan

0 100 200 300 400

020

4060

8010

0

Number of examples = 200 000

Increment Number

Cla

ssifi

catio

n A

ccur

acy

(%)

Non.Decr.W=100W=10alpha=0.1alpha=0.9

(b) Medium timespan

Fig. 2. Classification Accuracy under Concept Drift

5.2 Computational Performance

As shown in figure 5.2 the computational performance (measured in wallclocktime) of the weight decay based approach is almost twice as fast as the window-based approach except for large windows (e.g. W = 1000). The performancedifference seems to decrease with increasing increment size, this is supported bythe P-values from T-test comparisons. 21 out of 27 T-tests (tables 1-3) showedsignificant difference in favor of the weight decay based approach over the windowbased approach. Performed T-tests were based on timing of ten repeated runsof each presented configuration of α, w and increment size.


50 100 200 500 1000 2000 5000

2040

6080

100

Number of examples = 2000 000

Increment Size

Ave

rage

Tim

e (s

econ

ds)

alpha=0.1alpha=0.5alpha=0.9W=10W=100W=1000

Fig. 3. Computational Performance (Long timespan)

w=10 w=100 w=1000

α=0.1 0.00 0.00 0.01

α=0.5 0.00 0.00 0.00

α=0.9 0.00 0.00 0.00

Table 1. P-values for increment size 50 (Comp. Perf.)

5.3 Implementation and Test environment

The incremental and decremental proximal SVM has been implemented in C++using the CLapack and ATLAS libraries [3, 19]. Support for Python and Javainterfaces to the library is currently under development using the “SimplifiedWrapper and Interface Generator”[15]. A Linux cluster (Athlon 1.4-1.66 GHznodes, Sorceror Linux) has served as the test environment.

Acknowledgements

We would like to thank Professor Mihhail Matskin and Professor Arne Halaas.This work is supported by the Norwegian Research Council.

6 Conclusion and Future Work

We have introduced a weigth decay based decremental approach for proximalSVMs and shown that it can replace the current window-based approach. The


w=10 w=100 w=1000

α=0.1 0.00 0.00 0.25

α=0.5 0.00 0.00 0.39

α=0.9 0.00 0.00 0.67


w=10 w=100 w=1000

α=0.1 0.00 0.00 0.67

α=0.5 0.00 0.00 0.79

α=0.9 0.00 0.00 0.57


weight decay based approach is significantly faster than the window-based ap-proach (due to less IO-requirements) for small-to-medium increment and windowsizes, this is supported by simulation and p-values from T-Test.

Future work includes applying the approach on demanding incremental clas-sification and prediction problems. e.g. game usage mining [17]. Algorithmicimprovements that needs to be done include 1) develop incremental multiclassbalancing mechanisms, 2) investigate the approriateness of parallellized incre-mental proximal SVMs, 3) strengthen implementation with support for tuningset and kernels.

References

1. Deepak K. Agarwal. Shrinkage Estimator Generalizations of Proximal SupportVector Machines. In Proceedings of the 8th ACM SIGKDD International Con-ference on Knowledge Discovery and Data Mining, pages 173–182. ACM Press,2002.

2. Nadeem Ahmed, Huan Liu, and Kah Kay Sung. Handling Concept Drifts in In-cremental Learning with Support Vector Machines. In Proceedings of the fifth In-ternational Conference on Knowledge Discovery and Data Mining, pages 317–321.ACM Press, 1999.

3. E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz,A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users’Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, thirdedition, 1999.

4. Asa Ben-Hur, David Horn, Hava T. Siegelmann, and Vladimir Vapnik. SupportVector Clustering. Journal of Machine Learning Research, 2:125–137, 2001.

5. Robert Burbidge and Bernhard F. Buxton. An introduction to support vectormachines for data mining. In M. Sheppee, editor, Keynote Papers, Young OR12,pages 3–15, University of Nottingham, March 2001. Operational Research Society,Operational Research Society.


6. Gert Cauwenberghs and Tomaso Poggio. Incremental and Decremental SupportVector Machine Learning. In Advances in Neural Information Processing Systems(NIPS’2000), volume 13, pages 409–415. MIT Press, 2001.

7. Nello Christiani and John Shawe-Taylor. An Introduction to Support Vector Ma-chines and other kernel-based learning methods, chapter 6, pages 93–111. Cam-bridge University Press, 1st edition, 2000.

8. Glenn Fung and Olvi L. Mangasarian. Multicategory Proximal Support VectorClassifiers. Submitted to Machine Learning Journal, 2001.

9. Glenn Fung and Olvi L. Mangasarian. Proximal support vector machine classifiers.In Proceedings of the 7th ACM Conference on Knowledge Discovery and DataMining, pages 77–86. ACM, 2001.

10. Glenn Fung and Olvi L. Mangasarian. Incremental Support Vector Machine Clas-sification. In R. Grossman, H. Mannila, and R. Motwani, editors, Proceedings ofthe Second SIAM International Conference on Data Mining, pages 247–260. SIAM,April 2002.

11. Chih-Wei Hsu and Chih-Jen Lin. A Comparison of Methods for Multi-class SupportVector Machines. IEEE Transactions on Neural Networks, 13(2):415–425, 2002.

12. Jeffrey Huang, Xuhui Shao, and Harry Wechsler. Face pose discrimination usingsupport vector machines (svm). In Proceedings of 14th Int’l Conf. on PatternRecognition (ICPR’98), pages 154–156. IEEE, 1998.

13. Ralf Klinkenberg and Thorsten Joachims. Detecting Concept Drift with SupportVector Machines. In Pat Langley, editor, Proceedings of the Seventeenth Interna-tional Conference on Machine Learning (ICML). Morgan Kaufmann, 2000.

14. Kenneth Lange. Numerical Analysis for Statisticians, chapter 7.3, pages 80–81.Springer-Verlag, 1999.

15. Simplified wrapper and interface generator. Online, http://www.swig.org/, March2003.

16. Amund Tveit and Magnus Lie Hetland. Multicategory Incremental ProximalSupport Vector Classifiers. In Proceedings of the 7th International Conferenceon Knowledge-Based Information & Engineering Systems (forthcoming), LectureNotes in Artificial Intelligence. Springer-Verlag, 2003.

17. Amund Tveit and Gisle B. Tveit. Game Usage Mining: Information Gathering forKnowledge Discovery in Massive Multiplayer Games. In Proceedings of the Inter-national Conference on Internet Computing (IC’2002), session on Web Mining.CSREA Press, June 2002.

18. Vladimir N. Vapnik. The Nature of Statistical Learning Theory, chapter 5, pages138–146. Springer-Verlag, 2nd edition, 1999.

19. Richard C. Whaley, Antoine Petitet, and Jack J. Dongarra. Automated EmpiricalOptimization of Software and the ATLAS Project”. Parallel Computing, 27(1-2):3–25, 2001.